You are on page 1of 708

Preface

In 1991 two of us, Luc Massart and Bernard Vandeginste, discussed, during one of
our many meetings, the possibility and necessity of updating the book
Chemometrics: a textbook. Some of the newer techniques, such as partial least
squares and expert systems, were not included in that book which was written
some 15 years ago. Initially, we thought that we could bring it up to date with
relatively minor revision. We could not have been more wrong. Even during the
planning of the book we witnessed a rapid development in the application of
natural computing methods, multivariate calibration, method validation, etc.
When approaching colleagues to join the team of authors, it was clear from the
outset that the book would not be an overhaul of the previous one, but an almost
completely new book. When forming the team, we were particularly happy to be
joined by two industrial chemometricians. Dr. Paul Lewi from Janssen
Pharmaceutica and Dr. Sijmen de Jong from Unilever Research Laboratorium
Vlaardingen, each having a wealth of practical experience. We are grateful to
Janssen Pharmaceutica and Unilever Research Vlaardingen that they allowed
Paul, Sijmen and Bernard to spend some of their time on this project. The three
other authors belong to the Vrije Universiteit Brussel (Prof. An Smeyers-Verbeke
and Prof. D. Luc Massart) and the Katholieke Universiteit Nijmegen (Professor
Lutgarde Buydens), thus creating a team in which university and industry are
equally well represented. We hope that this has led to an equally good mix of
theory and application in the new book.
Much of the material presented in this book is based on the direct experience of
the authors. This would not have been possible without the hard work and input of
our colleagues, students and post-doctoral fellows. We sincerely want to acknowl-
edge each of them for their good research and contributions without which we
would not have been able to treat such a broad range of subjects. Some of them
read chapters or helped in other ways. We also owe thanks to the chemometrics
community and at the same time we have to offer apologies. We have had the
opportunity of collaborating with many colleagues and we have profited from the
research and publications of many others. Their ideas and work have made this
book possible and necessary. The size of the book shows that they have been very
productive. Even so, we have cited only a fraction of the literature and we have not
included the more sophisticated work. Our wish was to consolidate and therefore
to explain those methods that have become more or less accepted, also to
newcomers to chemometrics. Our apologies, therefore, to those we did not cite or
not extensively: it is not a reflection on the quality of their work.
Each chapter saw many versions which needed to be entered and re-entered in
the computer. Without the help of our secretaries, we would not have been able to
complete this work successfully. All versions were read and commented on by all
authors in a long series of team meetings. We will certainly retain special
memories of many of our two-day meetings, for instance the one organized by Paul
in the famous abbey of the regular canons of Premontre at Tongerlo, where we
could work in peace and quiet as so many before us have done.
Much of this work also had to be done at home, which took away precious time
from our families. Their love, understanding, patience and support was indispens-
able for us to carry on with the seemingly endless series of chapters to be drafted,
read or revised.
Chapter 28

Introduction to Part B

In the introduction to Part A we discussed the "arch of knowledge" [1] (see Fig.
28.1), which represents the cycle of acquiring new knowledge by experimentation
and the processing of the data obtained from the experiments. Part A focused
mainly on the first step of the arch: a proper design of the experiment based on the
hypothesis to be tested, evaluation and optimization of the experiments, with the
accent on univariate techniques. In Part B we concentrate on the second and third
steps of the arch, the transformation of data and results into information and the
combination of information into knowledge, with the emphasis on multivariate
techniques.
In order to obtain information from a series of experiments, we need to interpret
the data. Very often the first step in understanding the data is to visualise them in a
plot or a graph. This is particularly important when the data are complex in nature.
These plots help in discovering any structure that might be present in the data and
which can be related to a property of the objects studied. Because plots have to be
represented on paper or on a flat computer screen, data need to be projected and
compressed.
Analytical results are often represented in a data table, e.g., a table of the fatty
acid compositions of a set of olive oils. Such a table is called a two-way multi-
variate data table. Because some olive oils may originate from the same region and
others from a different one, the complete table has to be studied as a whole instead
as a collection of individual samples, i.e., the results of each sample are interpreted
in the context of the results obtained for the other samples. For example, one may
ask for natural groupings of the samples in clusters with a common property,
namely a similar fatty acid composition. This is the objective of cluster analysis
(Chapter 30), which is one of the techniques of unsupervised pattern recognition.
The results of the clustering do not depend on the way the results have been
arranged in the table, i.e., the order of the objects (rows) or the order of the fatty
acids (columns). In fact, the order of the variables or objects has no particular
meaning.
In another experiment we might be interested in the monthly evolution of some
constituents present in the olive oil. Therefore, we decide to measure the total
amount of free fatty acids and the triacylglycerol composition in a set of olive oil
Knowledge

creativity intelligence

Deduction Induction
(synthesis) (analysis)
Hypothesis Information

design data

Experiment

Fig. 28.1. The arch of knowledge.

samples under fixed storage conditions. Each month a two-way table is obtained
— six in total after six months. We could decide to analyse all six tables
individually. However, this would not provide information on the effect of the
storage time and its relation to the origin of the oil. It is more informative to
consider all tables together. They form a so-called three-way table. The analysis of
such a table is discussed in Chapter 31. If, in addition, all olive samples are split
into portions which are stored under different conditions, e.g., open and closed
bottles, darkness and daylight, we obtain several three-way tables or in general a
multi-way data table.
Some analytical instruments produce a table of raw data which need to be
processed into the analytical result. Hyphenated measurement devices, such as
HPLC linked to a diode array detector (DAD), form an important class of such
instruments. In the particular case of HPLC-DAD, data tables are obtained con-
sisting of spectra measured at several elution times. The rows represent the spectra
and the columns are chromatograms detected at a particular wavelength. Conse-
quently, rows and columns of the data table have a physical meaning. Because the
data table X can be considered to be a product of a matrix C containing the
concentration profiles and a matrix S containing the pure (but often unknown)
spectra, we call such a table bilinear. The order of the rows in this data table
corresponds to the order of the elution of the compounds from the analytical
column. Each row corresponds to a particular elution time. Such bilinear data
tables are therefore called ordered data tables. Trilinear data tables are obtained
from LC-detectors which produce a matrix of data at any instance during the
elution, e.g., an excitation-emission spectrum as a function of time. Bilinear and
trilinear data tables are also measured when a chemical reaction is monitored over
time, e.g., the biomass in a fermenter by Near infrared Spectroscopy. An example
of a non-bilinear two-way table is a table of MS-MS spectra or a 2D-NMR
spectrum. These tables cannot be represented as a product of row and column
spectra.
So far, we have discussed the structure of individual data tables: two-way to
multiway, which may be bilinear to multilinear. In some cases two or more tables
are obtained for a set of samples. The simplest situation is a two-way data table
associated with a vector of properties, e.g., a table with the fatty acid composition
of olive oils and a vector with the coded region of the oils. In this case we do not
only want to display the data table, but we want to derive a classification rule,
which classifies the collected oils in the right category. With this classification
rule, oils of unknown origin can be classified. This is the area of supervised pattern
recognition, which classically is based on multivariate statistics (Chapter 33) but
more recently neural nets have been introduced in this field (Chapter 44). The
technique of neural networks belongs to the area of so-called natural computation
methods. Genetic algorithms — another technique belonging to the family of
natural computation methods — were discussed in Part A. They are called natural
because the principle of the algorithms to some extent mimics the way biological
systems function. Originally, neural networks were considered to be a model for
the functioning of the brain. In the example above, the property is a discrete class,
region of origin, healthy or ill person, which is not necessarily a quantitative value.
However, in many cases the property may be a numerical value, e.g., a concent-
ration or a pharmacological activity (Chapter 37). Modelling a vector of properties
to a table of measurements is the area of multivariate calibration, e.g., by principal
components regression or by partial least squares, which are described in Chapter
36. Here the degree of complexity of the data sets is almost unlimited: several data
tables may be predictors for a table of properties, the relationship between the
tables may be non-linear and some or all tables may be multiway.
As indicated before, the columns and the rows of a bilinear or trilinear dataset
have a particular meaning, e.g., a spectrum and a chromatogram or the concent-
ration profiles of reactants and the reaction products in an equilibrium or kinetic
study. The resulting data table is made up by the product of the tables of these pure
factors, e.g., the table of the elution profiles of the pure compounds and the table of
the spectra of these compounds. One of the aims of a study of such a table is the
decomposition of the table into its pure spectra and pure elution profiles. This is
done by factor analysis (Chapter 34).
A special type of table is the contingency table, which has been introduced in
Chapter 16 of Part A. In Part B the 2x2 contingency table is extended to the general
case (Chapter 32) which can be analyzed in a multivariate way. The above
examples illustrate that the complexity of data and operations discussed in Part B
require advanced chemometric techniques. Although Part B can be studied inde-
pendently from Part A, we will implicitly assume a chemometrics background
equivalent to Part A, where a more informal treatment of some of the same topics
can be found. For instance, vectors and matrices are introduced for the first time in
Chapter 9 of Part A, but are treated in more depth in Chapter 29 of Part B. Principal
components analysis, which was introduced in Chapter 17, is discussed in more
detail in Chapter 31. These two chapters provide the basis for more advanced
techniques such as Procrustes analysis, canonical correlation analysis and partial
least squares discussed in Chapter 35. In Part B we also concentrate on a number of
important application areas of multivariate statistics: multivariate calibration
(Chapter 36) quantitative structure-activity relationships (QSAR, Chapter 37),
sensory analysis (Chapter 38) and pharmacokinetics (Chapter 39).
The success of data analysis depends on the quality of the data. Noise and other
instrumental factors may hide the information in the data. In some instances it is
possible to improve the quality of the data by a suitable preprocessing technique
such as signal filtering and signal restoration by deconvolution (Chapter 40). Prior
to signal enhancement and restoration it may be necessary to transform the data by,
e.g., a Fourier transform or wavelet transform. Both are discussed in Chapter 40. A
special type of filter is the Kalman filter which is particularly applicable to the
real-time modelling of systems of which the model parameters are time dependent.
For instance, the slope and intercept of a calibration line may be subject to a drift.
At each new measurement of a calibration standard, the Kalman filter updates the
calibration factors, taking into account the uncertainty in the calibration factors
and in the data. Because the Kalman filter is driven by the difference between a
new measurement and the predicted value of that measurement, the filter ignores
outlying measurements, caused by stochastic or systematic errors.
The last step of the arch of knowledge is the transformation of information into
knowledge. Based on this knowledge one is able to make a decision. For instance,
values of temperatures and pressures in different parts of a process have to be
interpreted by the operator in the control room of the plant, who may take a series
of actions. Which action to take is not always obvious and some guidance is
usually found in a manual. In analytical method development the same situation is
encountered. Guidance is required for instance to select a suitable stationary phase
in HPLC or to select the solvents that will make up the mobile phase. This type of
knowledge may be available in the form of "If Then" rules. Such rules can be
combined in a rule-based knowledge base, which is consulted by an expert system
(Chapter 43). Questions such as "What If can also be answered by models
developed in Operations Research (Chapter 42). For instance, the average time a
sample has to wait in the sample queue can be predicted by queuing theory for
various priority strategies.
After completion of one cycle of the arch of knowledge, we are back at the
starting point of the arch, where we should accept or reject the hypothesis. At this
stage a new cycle can be started based on the knowledge gained from all previous
ones. The chemometric techniques described in Part A and B aim to support the
scientist in running through these cycles in an efficient way.

References

1. D. Oldroyd, The Arch of Knowledge. Methuen, New York (1986).


Chapter 29

Vectors, Matrices and Operations on Matrices

This chapter is an extension and generalization of the material presented in


Chapter 9. Here we deal with the calculus of vectors and matrices from the point of
view of the analysis of a two-way multivariate data table, as defined in Chapter 28.
Such data arise when several measurements are made simultaneously on each
object in a set [1]. Usually these raw data are collected in tables in which the rows
refer to the objects and the columns to the measurements. For example, one may
obtain physicochemical properties such as lipophilicity, electronegativity, mole-
cular volume, etc., on a number of chemical compounds. The resulting table is
called a measurement table. Note that the assignment of objects to rows and of
measurements to columns is only conventional. It arises from the fact that often
there are more objects than measurements, and that printing of such a table is more
convenient with the smallest number of columns. In a cross-tabulation each element
of the table represents a count, a mean value or some other summary statistic for
the various combinations of the categories of the two selected measurements. In
the above example, one may cross the categories of lipophilicity with the categories
of electronegativity (using appropriate intervals of the measurement scales). When
each cell of such a cross-tabulation contains the number of objects that belong to
the combined categories, this results in a contingency table or frequency table
which is discussed extensively in Chapter 32. In a more general cross-tabulation,
each cell of the table may refer, for example, to the average molecular volume that
has been observed for the combined categories of lipophilicity and electronegativity.
One of the air^of multivariate analysis is to reveal patterns in the data, whether
they are in the form of a measurement table or in that of a contingency table. In this
chapter we will refer to both of them by the more algebraic term 'matrix'. In what
follows we describe the basic properties of matrices and of operations that can be
applied to them. In many cases we will not provide proofs of the theorems that
underlie these properties, as these proofs can be found in textbooks on matrix
algebra (e.g. Gantmacher [2]). The algebraic part of this section is also treated
more extensively in textbooks on multivariate analysis (e.g. Dillon and Goldstein
[1], Giri [3], Cliff [4], Harris [5], Chatfield and Collins [6], Srivastana and Carter
[7], Anderson [8]).
29.1 Vector space

In accordance with Section 9.1, we represent a vector z as an ordered vertical


arrangement of numbers. The transpose T} then represents an ordered horizontal
arrangement of the same numbers. The dimension of a vector is equal to the
number of its elements, and a vector with dimension n will be referred to as an
Ai-vector.
A set of/? vectors (z, ... z^ with the same dimension n is linearly independent if
the expression:

I^. z,=0 (29.1)

holds only when all p coefficients c^ are zero. Otherwise, the p vectors are linearly
dependent (see also Section 9.2.8).
The following three vectors z,, Zj and Zj with dimension four are linearly
independent:
'-\ " 2" "-5l
4 2 1
z, = Z2 = Z3 =
0 3 -6
_ 2_ -1 _ 4_

as it is not possible to find a set of coefficients Cj, C2, C3 which are not all equal to
zero, and which satisfy the system of four equations:
-C] + 2C2 -5C3 = 0
4c, + 2c2 + IC3 = 0
Oc, + 3c2 - 6C3 = 0
2c, - IC2 + 4c3 = 0
In the case of linearly dependent vectors, each of them can be expressed as a
linear combination of the others. For example, the last of the three vectors below
can be expressed in the form Z3 = Zj - IT^^,
'-\ " 2" '-5l
4 2 0
^2 = Z3 =
0 3 -6
_ 2_ -1 _ 4_

A vector space spanned by a set of/? vectors (Zj... z^ with the same dimension n
is the set of all vectors that are linear combinations of the p vectors that span the
space [3]. A vector space satisfies the three requirements of an algebraic set which
follow hereafter. (1) Any vector obtained by vector addition and scalar multiplication
of the vectors that span the space also belongs to this space. This includes the null
vector whose elements are all equal to zero. (2) Addition of the null vector to any
vector of the space reproduces the vector. (3) For every vector that belongs to the
space, another one can be found, such that vector addition of these two vectors
produces the null vector.
A set of n vectors of dimension n which are linearly independent is called a basis
of an n-dimensional vector space. There can be several bases of the same vector
space.
The set of unit vectors of dimension n defines an n-dimensional rectangular (or
Cartesian) coordinate space S^. Such a coordinate space 5" can be thought of as
being constructed from n base vectors of unit length which originate from a
common point and which are mutually perpendicular. Hence, a coordinate space is
a vector space which is used as a reference frame for representing other vector
spaces. It is not uncommon that the dimension of a coordinate space (i.e. the
number of mutually perpendicular base vectors of unit length) exceeds the dimension
of the vector space that is embedded in it. In that case the latter is said to be a
subspace of the former. For example, the basis of 5"^ is:

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

Any vector x in S^ can be uniquely expressed as a linear combination of the n


basis vectors u.:

: = X^/"/ (29.2)

where the n vectors (Uj ... u j form a basis of 5". The n coefficients (xj ... x^ are
called the coordinates of the vector x in the basis (Uj ... u j .
For example, a vector x in 5"* may be expressed in the previously defined usual
basis as the vector sum:
1 0 0 0
0 1 0 0
x=3 + (-4) +0 +2
0 0 1 0
0 0 0 1
10

where the coefficients 3, ^ , 0 and 2 are the coordinates of x in the usual basis.
Another basis of 5"* can be defined by means of the set of vectors:
1 1 -1 -1
1 -1 1 -1
1 -1 -1 1
1 1 1 1

and the same vector x which we have defined above can be expressed in this
particular basis by means of the vector sum:
'\ r -1 -1
1 -1 1 -1
x = 0.25 + 2.25 + (-1.25) + 0.75
1 -1 -1 1
1 1 1 1
where the coefficients 0.25, 2.25, -1.25 and 0.75 nbw represent the coordinates of
X in this particular basis. From the above it follows that a vector can be expressed
by different coordinates according to the particular basis that has been chosen. In
multivariate data analysis one often changes the basis in order to highlight particular
properties of the vectors that are represented in it. This automatically causes a
change of their coordinates. A change of basis and its effect on the coordinates can
be defined algebraically, as is shown in Chapters 31 and 32.

29.2 Geometrical properties of vectors

Every n vector can be represented as a point in an n-dimensional coordinate


space. The n elements of the vector are the coordinates along n basis vectors, such
as defined in the previous section. The null vector 0 defines the origin of the
coordinate space. Note that the origin together with an endpoint define a directed
line segment or axis, which also represents a vector. Hence, there is an equivalence
between points and axes, which can both be thought as geometrical representations
of vectors in coordinate space. (The concepts discussed here are extensions of
those covered previously in Sections 9.2.4 to 9.2.5.)
In this and subsequent sections we will make frequent use of the scalar product
(also called inner product) between two vectors x and y with the same dimension
n, which is defined by:

x^y = X^/ >'/ (29.3)


11

By way of example, if x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^ then we obtain that the
scalar product of x with y equals:
x'ry = 2 x 4 + l x 2 + 3 x ( - l ) + ( - l ) x 5 = 2
Note that in Section 9.2.2.3 the dot product x • y is used as an equivalent notation
for the scalar product x^y.
In Euclidean space we define squared distance from the origin of a point x by
means of the scalar product of x with itself:
n
xTx = ^ x f =llxl|2 (29.4)

where 11 xl I is to be read as the norm or length of vector x. Likewise, the squared


distance between two points x and y is given by the expression:
n

{x-yY (x-y) = 21Ui - y , ) ' =llx-ylP (29.5)

where 11 x - yl I is the norm or length of the vector x - y. Note that the expression of
distance from the origin in eq. (29.4) can be derived from that of distance between
two points in eq. (29.5) by replacing the vector y by the null vector 0:
xTx = ( x - 0 ) ^ ( x - 0 ) = llx-OlP =llxlP
Given the vectors x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^, we derive the norms:
llxlP - 2 ^ +1^ + 3 ^ +(-1)2 =15
llyiP = 4 ^ + 2 ^ + ( - 1 ) ' + 5^ =46
l l x - y i P = ( 2 - 4 ) 2 + ( 1 - 2 ) 2 +(3_(_i))2 ^ ( _ i _ 5 ) 2 ^ 5 7
It may be noted that some authors define the norm of a vector x as the square of
11 xl I rather than 11 xl I itself, e.g. Gantmacher [2].
Angular distance or angle between two points x and y, as seen from the origin of
space, is derived from the definition of the scalar product in terms of the norms of
the vectors:
n
xTy = ^ x , . y,. =llxllllyllcosi^ (29.6)

where i^ represents the angular distance between the vectors x and y.


The geometrical interpretation of the scalar product of the vectors x and y is that
of an arithmetic product of the length of y, i.e. II yl I, with the projection of x upon y,
i.e. 11x11 cos d (Fig. 29.1). From the expression in eq. (29.6) we derive that cos '&
12

llyll
11x11 cos d

Fig. 29.1. Geometrical interpretation of the scalar product of x'^y as the projection of the vector x upon
the vector y. The lengths of x and y are denoted by 11 xl I and 11 yl I, respectively, and their angular separa-
tion is denoted by i3.

equals the scalar product of the normalized vectors x and y:


T
cos d = - :(x/llxll)'^ (y/llyll) (29.7)
llxllllyH

Note that normalization of an arbitrary vector x is obtained by dividing each of its


elements by the norm 11 xl I of the vector.
The geometric properties of vectors can be combined into the triangle relation-
ship, also called the cosine rule, which states that:

l l x - y i P =llxlP +llylP -211x11 llyllcosi^ (29.8)

This relationship is of importance in multivariate data analysis as it relates distance


between endpoints of two vectors to distances and angular distance from the origin
of space. A geometrical interpretation is shown in Fig. 29.2.
Using the vectors x and y from our previous illustration, we derive that:

cos-d^ = 0.0761
15V46

or:

^ = 85.64 degrees

One can define three special configurations of two vectors, namely parallel in
the same direction, parallel in opposite directions, and orthogonal (or perpend-
icular). The three special configurations depend on the angular distances between
the two vectors, being 0, 180 and 90 degrees respectively (Fig. 29.3).
More generally, two vectors x and y are orthogonal when their scalar product is
zero:
13

Fig. 29.2. Distance 11 x - yl I between two vectors x and y of length 11 xl I and 11 yl I, separated by an angle i^.

f^ ^ x y = 11x11 llyll

180°
V 0 V > x y = -11x11 llyll
0 X

/^y

tki T
X y=0

Fig. 29.3. Three special configurations of two vectors x and y and their corresponding scalar product
x^y. Angular separations of x and y are 0, 180 and 90 degrees, respectively.

x^y = 0 (29.9)
It can be shown that the vectors x = [2, - 1 , 8, 0]''' and y = [10, 4, -2, 3]^ are
orthogonal, since:
x'ry = [2,-l,8,0][10,4,-2,3]T
= 2 x l 0 + (-l)x4 + 8x(-2) + 0 x 3 = 0
14

Hence cos iJ equals 0, or equivalently ^ equals 90 degrees.


Two orthogonal vectors are orthonormal when, in addition to orthogonality, the
norms of these vectors are equal to one:
llxll = llyll = l
or equivalently:

x^x = y V = l (29.10)

The n basis vectors which define the basis of a coordinate space 5" are n
mutually orthogonal and normalized vectors. Together they form a frame of
reference axes for that space.
If we represent by x and y the arithmetic means of the elements of the vectors x
and y:

1 "

^ ' (29.11)
1 "
y = -Yuyi
^ i

then we can relate the norms of the vectors (x - x) and (y - y) to the standard
deviations s^ and s^ of the elements in x and y (Section 2.1.4):

1 " 1
^x =-Y,{x,-xy =-\\x-x\\^
n , n (29.12)

Note that in data analysis we divide by n in the definition of standard deviation


rather than by the factor n - 1 which is customary in statistical inference. Likewise
we can relate the product-moment (or Pearson) coefficient of correlation r (Section
8.3.1) to the scalar product of the vectors (x - x) and {y -y):
n

- ' - ( ' ' - • ^ ) ^ y - ^ = cos(p (29.13)


fn n V'^ IIX-JII lly-yil
X(x,-J)2 SCy.-y)'
15

where cp is the angular distance between the vectors (x - x) and (y - y).

29.3 Matrices

A matrix is defined as an ordered rectangular arrangement of scalars into


horizontal rows and vertical columns (Section 9.3). On the one hand, one can
consider a matrix X with n rows and/? columns as an ordered array ofp vectors of
dimension n, each of the form:

^./ = with J = 1, ...,/7

On the other hand, one can also regard the same matrix X as an ordered array of n
vectors of dimension p, each of the form:
X • — [X-1 , . . . , X;„ with / = 1, ..., n
In our notation, x^j represents the element of matrix X at the crossing of row / and
column j . The vector Xy defines a vector which contains the n elements of the jth
column of X. The vector x, refers to a vector which comprises the/? elements of the
/th row of X.
In the matrix X of the following example:

3 2 -2
2 1 0
X=
0 -4 -3
-1 2 4

we denote the second column by means of the vector Xy^2*

2
1
X,=2=i
-4
2
16

1 P
1 X
T
r
''ij

nI

''j
p

Fig. 29.4. Schematic representation of a matrix X as a stack of horizontal rows x,- , and as an assembly
of vertical columns x,.

and the third row by means of the vector x^^-^'-

In the illustration of Fig. 29.4 we regard the matrix X as either built up from n
horizontal rows x^ of dimension p, or as built up from p vertical columns x^ of
dimension n. This exemplifies the duality of the interpretation of a matrix [9].
From a geometrical point of view, and according to the concept of duality, we can
interpret a matrix with n rows and p columns either as a pattern of n points in a
/7-dimensional space, or as a pattern of p points in an n-dimensional space. The
former defines a row-pattern P"" in column-space SP, while the latter defines a
column-pattern P^ in row-space 5^". The two patterns and spaces are called dual (or
conjugate). The term dual space also possesses a specific meaning in another
17

1 1 p
1 X

T
-^1
Xij

Fig. 29.5. Geometrical interpretation of an nxp matrix X as either a row-pattern of n points P" in
/7-dimensional column-space S^ (left panel) or as a column-pattern of p points P^ in n-dimensional
row-space S" (right panel). The/? vectors u, form a basis of 5^ and the n vectors v, form a basis of 5".

mathematical context which is distinct from the one which is implied here. The
occasion for confusion, however, is slight.
In Fig. 29.5, the column-space S^ is represented as a/^-dimensional coordinate
space in which each row x^ of X defines a point with coordinates {xn,.., x^p.., x^^) in
an orthonormal basis (u^,.., Uy,.., u^ such that:

with:
rr '0'

0 ' " y = 1 and u,

0 0

Each element x^j can thus be reconstructed from the scalar product:
Xij=xJ u . = u j X. (29.14)
In the same Fig. 29.5, the row-space S"^ is shown as an n-dimensional coordinate
space in which each column Xj of X defines a point with coordinates (x^j,.., x-j,.., x^p
in an orthonormal basis (Vj,.., v,,.., v„) such that:
Xj=X,j V, -\-,., + X,j V,- + . . . + X„^. V„

with:

T "o"

0 ' V,. = 1 and

0 0

Each element x^j can also be reconstructed from the scalar product:
X^j — \ • Xj — Xj \ • (29.15)
The dimension of a matrix X with n rows and p columns is nxp (pronounced n
by p). Here X is referred to as an nxp matrix or as a matrix of dimension nxp.
A matrix is called square if the number of rows is equal to the number of
columns. Such a matrix can be referred to as a pxp matrix or as a square matrix of
dimension p.
The transpose of a matrix X is obtained by interchanging its rows and columns
and is denoted by X^. If X is an nxp matrix then X^ is apxn matrix. In particular,
we have that:
T\T
(X^)' _ (29.16)
19

The transpose of the matrix X in the previous example is given below:

3 2 0 -1
2 1 -4 2
•2 0 -3 4

A square matrix A is called symmetric if:


AT = A (29.17)
In a square nxn matrix A, the main diagonal or principal diagonal consists of the
elements a^ for all / ranging from 1 to n. The latter are called the diagonal elements;
all other elements are off-diagonal.
A diagonal matrix D is a square matrix in which all off-diagonal elements are
zero, i.e.:
J, = 0 if i^j with iandj = 1, ..., n.
An identity matrix I is a diagonal matrix in which all diagonal elements are
equal to unity and all off-diagonal elements are zero, i.e.:

'//-! and ijj, = 0 with j^f = 1, ..., n.


The following 3x3 matrices illustrate a square, a diagonal and an identity
matrix:
3 2 o" 3 0 0" 'l 0 0"
A- 2 1 -4 D= 0 1 0 1= 0 1 0
-2 0 -3 0 0 -3_ 0 0 1_

Special matrices are the null matrix 0 in which all elements are zero, and the sum
matrix 1 in which all elements are unity. In the case of 3x3 matrices we obtain:
"0 0 0" "1 1 r
0= 0 0 0 1= 1 1 1
0 0 0 1 1 1

29.4 Matrix product

Chapter 9 dealt with the basic operations of addition of two matrices with the
same dimensions, of scalar multiplication of a matrix with a constant, and of
arithmetic multiplication element-by-element of two matrices with the same
20

dimensions. Here, we formalize the properties of the matrix product that have
already been introduced in Section 9.3.2.3. If X is of dimension nxp and Y is of
dimension pxq, then the product Z = XY is an nxq matrix, the elements of which
are defined by:

^ik ~ Aj^ij yjk with / = 1, ..., n and /: = 1,..., q (29.18)

Note that the inner dimensions of X and Y must be equal. For this reason the
operation is also called inner product, as the inner dimensions of the two terms
vanish in the product. Any element of the product, say z,^, can also be thought of as
being the sum of the products of the corresponding elements of row / of X with
those of column k of Y. Hence the descriptive name of rows-by-columns product.
In terms of the scalar product (Section 29.2) we can write:
= X (29.19)
Throughout the book, matrices are often subscripted with their corresponding
dimensions in order to provide a check on the conformity of the inner dimensions
of matrix products. For example, when a 4x3 matrix X is multiplied with a 3x2
matrix Y rows-by-columns, we obtain a 4x2 matrix Z:
3 2 -f 1 38
3 8'
0 -5 4 26 -44
X - Y = -2 4 Z = X Y =
4x3 2 1 -2 3x2 4x2 4x3 3x2 -4 32
4 - 6J
4 3 0 6 44

In this theoretical chapter, however, we do not follow this convention of sub-


scripted matrices for the sake of conciseness of notation. Instead, we will take care
to indicate the dimensions of matrices in the accompanying text whenever this is
appropriate.
The operation of matrix multiplication can be shown to be associative, meaning
that X(YZ) = (XY)Z. But, it is not commutative, as in general we will have that
XY ^ YX. Matrix multiplication is distributive with respect to matrix addition,
which implies that (X + Y)Z = XZ + YZ. When this expression is read from right
to left, the process is called factoring-out [4].
Multiplication of an nxp matrix X with an identity matrix leaves the original
matrix unchanged:
XT = LX = X (29.20)
where I^ is the identity matrix with dimension /?, and I„ is the identity matrix with
dimension n. For example, by working out the rows-by-column product one can
21

easily verify that:

"3 4
"l 0' "1 0 0] [3 4' '3 4'
2 -1 0 1 — 0 1 0 2 -1 — 2 -1
0 3 0 0 ij [o 3 0 3_

A matrix is orthogonal if the product with its own transpose produces a diagonal
matrix. An orthogonal matrix of dimension nxp satisfies one or both of the
following relationships:
XX^ = D or X^X = D„ (29.21)
where D„ is a diagonal matrix of dimension n and D^ is a diagonal matrix of
dimension p. In an orthogonal matrix we find that all rows or all columns of the
matrix are mutually orthogonal as defined above in Section 9.2. In the former case
we state that X is row-orthogonal, while in the latter case X is said to be column-
orthogonal.
The following 3x3 matrix X can be shown to be column-orthogonal:
2.351 0.060 0.686
X = 4.726 1.603 -0.301
-1.840 4.196 0.105

as can be seen by working out the matrix products:


'31.248 0 0
X^X = | 0 20.180 0
0 0 0.572

A matrix is called orthonormal if additionally we obtain that:


XX^ = T or X^X = I (29.22)
where I^ and I^ have been defined before. In an orthonormal matrix X, all row-
vectors or all column-vectors are mutually orthonormal. In the former case, X is
row-orthonormal, while in the latter case we state that X is column-orthonormal.
A square matrix U is orthonormal if we can write that:

UU'^ = U^U = I (29.23)

where I is the identity matrix with the same dimension as U (or U^).
The 3x3 matrix U shown below is both row- and column-orthonormal:
22

0.4205 0.0134 05072


U= 0.8455 0.3569 -0.3972
-0.3291 0.9340 0.1388

as can be seen by working out the matrix products:


'l 0 0"
UU^ =U'^U = 0 1 0
0 0 Ij
An important property of the matrix product is that the transpose of a product is
equal to the product of the transposed terms in reverse order:

(XY)T = Y^X^ (29.24)

This property can be readily verified by means of an example:

nT T
'3 2 -1] 1 38"
r 3 8]
0 -5 4 __ 26 -44 ^ 1 26-4 6
-2 4
2 1 -2 -4 32 38 -44 32 44
L 4 -6]
4 3 oj 6 44_

and

8" T 3 2 -1 2 4'
3 r 3 0
4 0 -5 4 '3 -2 4] 1 3
-2 = 2 --5
2 1 -2 8 4 -6j
4 -6 [-1 4 -2 0_
4 3 0

" 1 26-4 6'


38 -44 3 2 44
The trace of a square matrix A of dimension n is equal to the sum of the n
elements on the main diagonal:

tr(A) = X«// (29.25)

For example:
23

4 2 - 1
tr 2 8 - 4 = 4 + 8 + 3=15
1 -4 3

If X is of dimension nxp and if Y is of dimensionpxn, then we can show that:

tr(XY) = tr(YX) (29.26)

In particular, we can prove that:

n p
tr(X^X) = tr(XX^) = ^ ^ J ^j =1^J ^i = l E ^ ' (29.27)
I J

where x, represents the ith row and x^ denotes theyth column of the nxp matrix X.
The proof follows from working out the products in the manner described above.
This relationship is important in the case when Y equals X^.
Matrix multiplication can be applied to vectors, if the latter are regarded as
one-column matrices. This way, we can distinguish between four types of special
matrix products, which are explained below and which are represented schem-
atically in Fig. 29.6.
(1) In the matrix-by-vector product one postmultiplies an nxp matrix X with ap
vector y which results in an n vector z:

Xy = z (29.28)

For example:

3 2 -1] r '^1
I
0 -5 4 26
-2
2 1 -2 A
-4
4 3 oj 6

(2) The vector-by-matrix product involves an n vector x^ which premultiplies an


nxp matrix Y to yield a/? vector z^:

x^Y = z" (29.29)

For example:
24

1 ^ P

z = Xy

1 Y P

z X
1 P 1 n

T T„
z =x Y

1z p ^ y

Z = xy'

T
z=x y

Fig. 29.6. Schematic illustration of four types of special matrix products: the matrix-by-vector prod-
uct, the vector-by-matrix product, the outer product and the scalar product between vectors, respec-
tively from top to bottom.
25

3 2
[3 2 -1] -2 1 = [1 5]
4 3

(3) The outer product results from premultiplying an n vector x with ap vector
y^ yielding an nxp matrix Z:

xy^ = Z (29.30)

The outer product of two vectors can be thought of as the matrix product between a
single-column matrix with a single-row matrix:

2x3 2x(-2) 2x4 6 -4 8


(-5)x3 (-5)x(-2) (-5)x4 15 10 -20
[3, - 2 , 4] =
1x3 lx(-2) 1x4 3 -2 4
3x3 3x(-2) 3x4 9 -6 12

For the purpose of completeness, we also mention the vector product which is
extensively used in physics and which is defined as:

xxy = z (29.31)

(read as x cross y) where x, y and z have the same dimension n. Geometrically, we


can regard x and y as two vectors drawn from the origin ofS"^ and forming an angle
(x, y) between them. The resulting vector product z is perpendicular to the plane
formed by x and y, and has a length defined by:
llzll = llxll llyllsin(x,y) (29.32)
(4) In the scalar product, which we described in Section 29.2, one multiplies a
vector x^ with another vector y of the same dimension, which produces a scalar z:

T
(29.33)
x'y = z
For example:
•-1

4
[3 0 2 -1] :3x(-l) + 0 x 4 + 2 x ( - 2 ) + (-l)x0 = - 7
-2
0
26

The product of a matrix with a diagonal matrix is used to multiply the rows or
the columns of a matrix with given constants. If X is an nxp matrix and if D^ is a
diagonal matrix of dimension n we obtain a product Y in which the iih row y-
equals the iih row of X, i.e. x,, multiplied by the iih element on the main diagonal of

Y = D„X (29.34)

and in particular for the /th row:

y, = i/„x, with «"= !,...,«

The matrix D„ effectively scales the rows of X. For example:

•2 0 0 0] 3 2 -1' -6-4 2"


0 1 0 0 0 -5 4 0 - 5 4
0 0 3 0 2 1 -2 6 3-6
0 0 0 2J 4 3 0 8 6 0

Likewise, if D^ is a diagonal matrix of dimension p we obtain a product Y in which


they'th column y^ equals theyth column of X, i.e. x^, multiplied by theyth element on
the main diagonal of D^:
Y = XD^ with; = l,...,p (29.35)
and in particular for theyth column:

The matrix D^ effectively scales the columns of X. For example:


3 2 -1] [2 6 -2 -3
0 0'
0 -5 4 0 5 12
0 -1 0
2 1 -2 4 -1 -6
0 0 3_
4 3 0 '- 8 -3 0
Pre- or postmultiplication with a diagonal matrix is useful in data analysis for
scaling rows or columns of a matrix, e.g. such that after scaling all rows or all
columns possess equal sums of squares.
27

29.5 Dimension and rank

It has been shown that the/? columns of an nxp matrix X generate a pattern of p
points in S"^ which we call PP. The dimension of this pattern is called rank and is
indicated by r(P^). It is equal to the number of linearly independent vectors from
which all p columns of X can be constructed as Hnear combinations. Hence, the
rank of P^ can be at most equal top. Geometrically, the rank of P^can be seen as the
minimum number of dimensions that is required to represent the p points in the
pattern together with the origin of space. Linear dependences among the p columns
of X will cause coplanarity of some of the p vectors and hence reduce the minimum
number of dimensions.
The same matrix X also generates a pattern oin points in S^ which we call P*^ and
which is generated by the n rows of X. The rank of P"is denoted as riP"^) and equals
the number of linearly independent vectors from which all n rows of X can be
produced as linear combinations. Hence, the rank of P"^ can be at most equal to n.
Using the same geometrical arguments as above, one can regard the rank of P^ as
the minimum number of dimensions that is required to represent the n points in the
pattern together with the origin of space. Linear dependences between the n rows
of X will also reduce this minimum number of dimensions because of coplanarity
of some of the corresponding n vectors.
It can be shown that the rank of P^ must be equal to that of P'^ and, hence, that the
rank of X is at most equal to the smaller of n and p [3]:
r{P^ ) = r{PP) = r(X) < min(n, p) (29.36)

where r(X) is the rank of the matrix X (see also Section 9.3.5). For example, the
rank of a 4x3 matrix can be at most equal to 3. In the case when there are linear
dependences among the rows or columns of the matrix, the rank can be even
smaller.
An nxp matrix X with n>p is called singular if linear dependences exist between
the columns of X, otherwise the matrix is called non-singular. In this case the rank
of X equals p minus the number of linear dependences among the columns of X. If
n < p, then X is singular if linear dependences exist between the rows of X,
otherwise X is non-singular. In that case, the rank of X equals n minus the number
of linear dependences among the rows of X. A matrix is said to be of full rank when
X is non-singular or alternatively when r(X) equals the smaller of n or/?.
Dimensions and rank of a matrix are distinct concepts. A matrix can have
relatively large dimensions say 100x50, but its rank can be small in comparison
with its dimensions. This point can be made more clearly in geometrical terms. In a
100-dimensional row-space 5^^, it is possible to represent the 50 columns of the
matrix as 50 points, the coordinates of which are defined by the 100 elements in
each of them. These 50 points form a pattern which we represent by P^^. It is clear
28

that the true dimension of this pattern of 50 points must be less than the number of
coordinate axes of the space 5^^ in which they are represented. In fact, it cannot be
larger than 50. The true dimension of the pattern P^^ defines its rank. In an extreme
case when all 50 points are located at the origin of 5^^, the rank is zero. In another
extreme situation we may obtain that all 50 points are collinear on a line through
the origin, in which case the rank is one. All 50 columns may be coplanar in a plane
that comprises the origin, which results in a rank of 2. In practical situations, we
often find that the points form patterns of low rank, when the data are sufficiently
filtered to eliminate random variation and artifacts. Multivariate data analysis
capitalizes on this point, and in a subsequent section on eigenvectors we will deal
with the algebra which allows us to find the true number of dimensions or rank of a
pattern in space.
We can now define the rank of the column-pattern P^^ as the number of linearly
independent columns or rank of X. If all 50 points are coplanar, then we can
reconstruct each of the 50 columns, by means of linear combinations of two
independent ones. For example, if x^^j and Xj^2 ^^^ linearly independent then we
must have 48 linear dependences among the 50 columns of X:

^7=3 ~ ^ 1 3 ^7=1 "''^23 ^7=2

X,=4 = C j 4 Xy^i + C 2 4 Xj=2

^j=50 ~^1,50 ^7=1 "^^2,50 ^7=2

where 0,3, C23,... are the coefficients of the linear combinations. The rank of X and
hence the rank of the column-pattern P^^ is thus equal to 50 - 48 = 2. In this case it
appears that 48 of the 50 columns of X are redundant, and that a judicious choice of
two of them could lead to a substantial reduction of the dimensionality of the data.
The algebraic approach to this problem is explained in Section 29.6 on eigen-
vectors.
A similar argument can be developed for the dual representation of X, i.e. as a
row-pattern of 100 points P^^ in a 50-dimensional column-space 5^^. Here again, it
is evident that the rank of P^^ can at most be equal to 50 (as it is embedded in a
50-dimensional space). This implies that in our illustration of a 100x50 matrix, we
must of necessity have at least 100 - 50 = 50 linear dependences among the rows of
X. In other words, we can eliminate 50 of the 100 points without affecting the rank
of the row-pattern of points in 5^^. Let us assume that this is obtained by reducing
the 100x50 matrix X into a 50x50 matrix X'. The resulting pattern of the rows in X'
now comprises only 50 points instead of 100 and is denoted here by the symbol
pioo-50 ^hich has the same rank as P^^. Since we previously assumed 48 linear
29

dependences among the columns of X we must necessarily also have 48 additional


linear dependences among the rows of X'. Hence the rank of p^^^^^ in S^^ is also
equal to 2. Summarizing the results obtained in the dual spaces we can write that:

r(P^^) = r(P^^) = r(X)=2

We will attempt to clarify this difficult concept by means of an example. In the


4x3 matrix X we have an obvious linear dependence among the columns:
"3 5
2
X=
2
-1

smce Xy^3 = Xy^i + Xy^2 •


By simple algebra we can derive that there is also a linear dependence among
the rows of X, namely:

^i=4 ~ ' 11X/=i + ^i=2


29 29

Hence we can remove the fourth row of X without affecting its rank, which
results in the 3x3 matrix X":

3 5 8
X= 7 2 9
1 2 3

Note that the linear dependence among the columns still persists. We now show
that there remains a second linear dependence among the rows of X':

X i=3 = •
11 i=2
29 29

Thus we have illustrated that the number of independent rows, the number of
independent columns and the rank of the matrix are all identical. Hence, from
geometrical considerations, we conclude that the ranks of the patterns in row- and
column-space must also be equal. The above illustration is also rendered geomet-
rically in Fig. 29.7.
The rank of a product of two matrices X and Y is equal to the smallest of the rank
30

Fig. 29.7. Illustration of a pattern of points with rank of 2. The pattern is represented by a matrix X with
dimensions 5x4 and a linear dependence between the three columns of X is assumed. The rank is
shown to be the smallest number of dimensions required to represent the pattern in column-space S^
and in row-space S".

ofXandY:

r(XY) = min(r(X), r(Y)) (29.37)

This follows from the fact that the columns of XY are linear combinations of X and
that the rows of XY are linear combinations of Y [3].
From the above property, it follows readily that:

r(XX^) = r(X^X) = HX) (29.38)

where the products XX^ and X^X are of special interest in data analysis as will be
explained in Section 29.7.

29.6 Eigenvectors and eigenvalues

A square matrix A of dimension p is said to be positive definite if:


x^Ax > 0 (29.39)
for all non-trivial p vectors x (i.e. vectors that are distinct from 0). The matrix A is
said to be positive semi-definite if:
31

x^Ax > 0 for all X ;^ 0 (29.40)


It can be shown that all symmetric matrices of the form X^X and XX^ are positive
semi-definite [2]. These cross-product matrices include the widely used dispersion
matrices which can take the form of a variance-covariance or correlation matrix,
among others (see Section 29.7).
An eigenvalue or characteristic root of a symmetric matrix A of dimension p is
a root X^ of the characteristic equation:
IA-^2II = 0 (29.41)
where IA - ^^11 means the determinant of the matrix A - V'l [2]. The determinant
in this equation can be developed into a polynomial of degree p of which all/7 roots
yj- are real. Additionally, if A is positive semi-definite then all roots are non-
negative. Furthermore, it can be shown that the sum of the eigenvalues is equal to
the trace of the symmetric matrix A:

I^i=tr(A) (29.42)

and that the product of the eigenvalues is equal to the determinant of the symmetric
matrix A:

n^=iAi (29.43)

By way of example we construct a positive semi-definite matrix A of dim-


ensions 2x2 from which we propose to determine the characteristic roots. The
square matrix A is derived as the product of a rectangular matrix X with its
transpose in order to ensure symmetry and positive semi-definitiveness:

"2 -\
-5 4 39 -24
A = XTX =
1 -2 -24 21
_3 0

from which follows the characteristic equation:


|39_^2 _24 I
\X-XH\-- -24 21-^2 1
=0

The determinant in this characteristic equation can be developed according to the


methods described in Section 9.3.4:
32

IA-;i2i|^(39-;i2)(21-X^)-(-24)(-24) = 0

which leads to a quadratic equation in }}\

(X^y -(21+39)>.2 + 3 9 x 2 1 - 2 4 x 2 4 = 0

or
(?i2)2 _60;i2 +243 = 0

From the form of this equation we deduce that the characteristic equation has two
positive roots:
?i^ 2 = 30 ± (30^ - 243)^/2 ^ 30 + 25.632

or
X] =55.632 and X\ =4.368
It can be easily verified that:
2
XM=tr(A) = 60

n^ 2^=IAI = 243

If A is a symmetric positive definite matrix then we obtain that all eigenvalues


are positive. As we have seen, this occurs when all columns (or rows) of the matrix
A are linearly independent. Conversely, a linear dependence in the columns (or
rows) of A will produce a zero eigenvalue. More generally, if A is symmetric and
positive semi-definite of rank r < /?, then A possesses r positive eigenvalues and
(p - r) zero eigenvalues [6].
In the previous section we have seen that when A has the form of the product of
a matrix X with its transpose, then the rank of A is the same as the rank of X. This
can be easily demonstrated by means of a simplified illustration:
1 -2"
-3 6 15 -30
= A = X^X =
2 -4 -30 60
-1 2_

Note that there is a linear dependence in X which is transmitted to the matrix of


cross-products A:
33

The singularity of A can also be ascertained by inspection of the determinant lAI


which in this case equals zero. As a result, the characteristic equation has the form
of a degenerated quadratic:

(Vy - ( 1 5 + 60);i2 + 1 5 x 6 0 - 3 0 x 3 0 = 0

The last term of the characteristic equation is always equal to the determinant of
A, which in this case equals zero. Hence we obtain:

which leads to:


X] = 75 and X\ = 0
with
2

XM=tr(A) = 75
k

and

n^i=iAi=o
k

An eigenvector or characteristic vector is a nontrivial normalized vector v


(distinct from 0) which satisfies the eigenvector relation:

{k-XH)\^Q or Av = ?i2v (29.44)

from which follows that:


v'^Av = X^ (29.45)
because of the orthonormality condition:
v^v=l
We have seen above that a symmetric non-singular matrix of dimensions/?x;? hasp
positive eigenvalues which are roots of the characteristic equation. To each of
these p eigenvalues V- one can associate an eigenvector v. The p eigenvectors are
normalized and mutually orthogonal. This leads us to the eigenvalue decompo-
sition (EVD) of a symmetric non-singular matrix A:
V^AV^A^ (29.46)
34

with the orthonormality condition:

where I^ is thepxp identity matrix and where A^ is apx/? diagonal matrix in which
the elements of the main diagonal are the p eigenvalues associated to the eigen-
vectors (columns) in the pxp matrix V. Because A^ is a diagonal matrix, the
decomposition is also called diagonalization of A. Algorithms for eigenvalue
decomposition are discussed in Section 31.4. These are routinely used in the
multivariate analysis of measurement tables and contingency tables (Chapters 31
and 32).
Because of the orthonormality condition we can rearrange the terms of the
decomposition in eq. (29.46) into the expression:
A = VA^V^ (29.47)
which is known as the spectral decomposition of A. The latter can also be expanded
in the form:
A = ^^ V, v^ + . . . + ^^, V, v l + . . . + ^ V, v^ (29.48)
where v^ represents the Jdh column of V and where \\ is the eigenvalue associated
tov^.
By way of example, we propose to extract the eigenvectors from the symmetric
matrix A of which the eigenvalues have been derived at the beginning of this
section:
39 -24
A=
-24 21

and which has been found to be positive definite with eigenvalues:


?.2= 55.632 and X^ =4.368
The eigenvector relation for the first eigenvector can then be written as:
"39 - 55.632 -24
{K-X\l)y, = =0
-24 21-55.632, , v^x

where v^ and Vj2 are the unknown elements of the eigenvector Vj associated to A,^.
The determinant of the corresponding system of homogeneous linear equations
equals zero:
IA-?i2i| = (_i6.632)(-34.632)-(-24)(-24) = 0
within the limits of precision of our calculation. Hence, we can solve for the
unknowns v'^ and v'21, which are the non-normalized elements v^ and V2\.
35

-16.632 v ' „ - 24v'2j=0


-24v',i - 34.632 v'21 =0

This leads to the solution:

v'n = 1
21 -16.632/24 = -24/34.632 = -0.693
Since the norm of \\ is defined as:
IIv^ Il = (v7i + v'l, )i/2 =(1 + 0.6932)1/2 ^1.217
we derive the elements of the normalized eigenvector Vj from:
vji =v'ii/llvVI = l/l.217=0.822
V2i= v'21/llv'ill = -0.693/1.217 = - 0 . 5 6 9
or
Vi = [0.822-0.569f
In order to compute the second eigenvector, we make use of the spectral decomp-
osition of the matrix A:
A-?l2 y ..T _ •
1 M^l ' 2 V.Vn^
^2^2

where the left-hand member is called the residual matrix (or deflated matrix) of A
after extraction of the first eigenvector.
This reduces the problem to that of finding the eigenvector V2 associated to 'k\
from the residual matrix:

39 -24 0.822
-Mv,v;^ = - 55.6322 [0.822 -0.569]
-24 21_ -0.569

"1.417 2.045
2.045 2.S>51

We now have to solve the eigenvector relation:

1.417 - 4.368 2.045 '12


{K-X\y,y1 -'k\l)y^ = =0
2.045 2.951-4.368 '22

where v^2 and V22 are the unknown elements of the eigenvector V2 associated to X, 2.
The determinant of the residual matrix is also zero:
36

\A~X] v,v^ ^ 1 II = (-2.951)(-1.417)-2.045x2.045 = 0


within the limits of precision of our calculation.
Now, we can solve the system of homogeneous linear equations for the
unknowns v',,12 'and v'„
22 which are the non-normalized elements of v,^
12 ' and v
"22-

-2.951 v',2 -l-2.045v'22 :0


2.045 v' 12 1.417 v'22 = 0
from which we derive that:

12 • 1
v'„ = 2.951/2.045 = 2.045/1.417 = 1.443
After normalization we obtain the elements v,2 and V22 of the eigenvector V2 associ-
ated to A, 2:
v,2 =v',2 /llv'2 11=1/1.756 = 0.569
V22 =v'22 /llv'211= 1.433/1.756 = 0.822
where
llv'2 II = (1+1.4332)"2 ^ i 7 5 g
It can be shown that the second eigenvector Vj can also be computed directly from
the original matrix A, rather than from the residual matrix A - X\\f\J, by solving
the relation:
( A - ^ 1)^2 = 0
This follows from the orthogonality of the eigenvectors v, and V2. We have preferred
the residual matrix because this approach is used in iterative algorithms for the
calculation of eigenvectors, as is explained in Section 31.4.
Finally, we arrange the eigenvectors column-wise into the matrix V, and the
eigenvalues into the diagonal matrix A^:

0.822 0.569 55.632 0


V= A^ =
-0.569 0.822 0 4.368

From V and A-^ one can reconstruct the original matrix A by working out the
consecutive matrix products:
0.822 0.569 55.632 0 0.822 -0.569
VA^V T _
-0.569 0.822 0 4.368 0.569 0.822
37

39 -24
=A
-24 21

The 'paper-and-pencir method of eigenvector decomposition can only be per-


formed on small matrices, such as illustrated above. For matrices with larger
dimensions one needs a computer for which efficient algorithms have been
designed (Section 31.4).
Thus far we have considered the eigenvalue decomposition of a symmetric
matrix which is of full rank, i.e. which is positive definite. In the more general case
of a symmetric positive semi-definite pxp matrix A we will obtain r positive
eigenvalues where r</?. In the general case we obtain a/7Xr matrix of eigenvectors
V such that:
V ^ A V = A2 or A = VA2V^ (29.49)
with the orthonormality condition:

where I^ represents the identity matrix of dimension rxr, and where Js} is the
diagonal matrix whose elements on the main diagonal are the r positive eigen-
values, with r < /?. As a general rule we cannot write that VV^ equals I^ but the
product possesses a property which is analogous to that of an identity matrix:
VV^A = A VV^ = A (29.50)
which holds for the matrix A from which V has been computed. This remarkable
property is illustrated by means of the singular 2x2 matrix A which we have
already introduced at the beginning of this section:

15 -30 0.447
A= [75] [0.447 - 0.894] = VA^ V^
-30 60 -0.894

and which possesses only one eigenvector and associated eigenvalue. By working
out the matrix product of V with its transpose we obtain that:

0.2 -0.4
VV T _
-0.4 0.8

which is different from the 2x2 identity matrix. Nevertheless we find that:

T _
15 -301 0.2 -0.4 15 -30
AW =A
30 60 -0.4 0.8 -30 60
38

and
0.2 - 0 . 4 ] r 15 -30 15 -30
=A
•0.4 0.8 -30 60 -30 60

The spectral decomposition of a symmetric matrix A can be rewritten in the


form:
A = VA^VT = VAAVr = SST (29.51)
where:

S = VA

and where A is a diagonal matrix whose main diagonal elements are equal to the
square roots of those of A^. This expression is also known as the Young-House-
holder factorization of a symmetric matrix [6].
The spectral decomposition can also be used to compute various powers of a
symmetric matrix. For example, the square of a symmetric matrix A of dimension
p can be computed by means of its spectral decomposition:
A^ = A A = ( V A 2 V T ) ( V A 2 V ^ ) = VA^V'^ (29.52)
where use is made of the orthonormality of the columns of V. This procedure
generally applies to all real exponents and holds also for singular symmetric
matrices (for which r<p):
A' = VA^^V^ for all real exponents s. (29.53)
It should be noted that in the case of a singular matrix A, the dimensions of V and
A^ are pxr and rxr, respectively, where r is smaller than p. The expression in eq.
(29.53) allows us to compute the generalized inverse, specifically the Moore-
Penrose inverse, of a symmetric matrix A from the expression:

A-^ = VA-^V^ (29.54)

which holds even when A is singular. It provides a more general method for
computing inverses, but requires that an algorithm for eigenvector extraction is
available.
For the previously defined 2x2 matrix A we obtain the inverse from its
eigenvalue decomposition, which has already been derived:

39 -24 0.822 0.569 55.632 0 0.822 -0.569


A=
-24 21 -0.569 0.822 0 4.368 0.569 0.822
39

2\TT
= \A^\

0.822 0.569 0.017975 0 0.822 -0.569 0.0863 0.987 "


A-' =
-0.569 0.822 0 0.22894 0.569 0.822 0.987 0.1605

It can be easily verified that:

AA-^=A-iA = l2

within the numerical precision of our calculation.


From the spectral decomposition we can deduce that A and A^ always have the
same rank, since the rank of a matrix is equal to the number of nonzero eigenvalues
which are also the elements on the main diagonal of A^ (or A^O-
A theorem, which we do not prove here, states that the nonzero eigenvalues of
the product AB are identical to those of BA, where A is an nxp and where B is a
pxn matrix [3]. This applies in particular to the eigenvalues of matrices of
cross-products XX^ and X^X which are of special interest in data analysis as they
are related to dispersion matrices such as variance-covariance and correlation
matrices. If X is an nxp matrix of rank r, then the product X^X has r positive
eigenvalues in A^ and possesses r eigenvectors in V since we have shown above
that:

X^X = VA^2\7T
V with \^Y = V and r<p (29.55)
The transposed product XX^ has the same r eigenvalues in A^, but generally
possesses r different eigenvectors in U:
2ITT
XXT = UA2U with U^U = L and r<n (29.56)
We have already mentioned that the determinant of a square symmetric pxp
matrix A of the form X^X can be expressed in terms of its p eigenvalues:

IAI = IX'rXI = n X (29.57)

This is an alternative way of computing determinants which, unlike the usual one
described in Section 9.3.4, allows for a geometrical interpretation.
As we have shown above (Section 29.3), a pxn matrix X can be interpreted
geometrically as a pattern of points P*" in the space S^. In the case of an (hyper)
ellipsoidal pattern we find that the axes of symmetry coincide with the eigen-
40

vectors of X^X and that the lengths of the semi-axes are equal to the square roots of
the corresponding eigenvalues. This is the case when the data represented by X
follow a multivariate normal distribution [3]. The product of the square roots of the
eigenvalues can thus be regarded as the product of the lengths of the semi-axes of
the ellipsoid. Hence, this product is proportional to the volume enclosed by the
(hyper)ellipsoid. From eq. (29.57) follows that the determinant lAI or IX'^XI can
also be regarded as being proportional to the volume enclosed by an ellipsoidal
pattern. In this case, the semi-axes of the (hyper)ellipsoid are also oriented in the
direction of the eigenvectors of X^X but their lengths are proportional to the
corresponding eigenvalues. When /? = 2 we obtain that the determinant is propor-
tional to the area of the ellipse nX] X\; when /? = 3 the determinant is proportional
to the volume of the ellipsoid (4n/3)X]X\'k\, etc. In some applications it is
required that this volume be minimal. For example, in an experimental design with
design matrix X, D-optimality is achieved when the determinant, i.e. the volume of
the ellipsoid, derived from (X^X)~^ is minimized (Section 24.2.1). This geo-
metrical interpretation may clarify the concept of degeneracy, when two or more
eigenvalues are identical. In that case, the (hyper)ellipsoid which encloses the
pattern P^ can be seen to be spherical in two or more dimensions of S^.
Singularity of the matrix A occurs when one or more of the eigenvalues are zero,
such as occurs if linear dependences exist between the p rows or columns of A.
From the geometrical interpretation it can be readily seen that the determinant of a
singular matrix must be zero and that under this condition, the volume of the
pattern P^ has collapsed along one or more dimensions of 5^. Applications of
eigenvalue decomposition of dispersion matrices are discussed in more detail in
Chapter 31 from the perspective of data analysis.
Singular value decomposition (SVD) of a rectangular matrix X is a method
which yields at the same time a diagonal matrix of singular values A and the two
matrices of singular vectors U and V such that:

X = UAV^ with U'^U = V^V = I, (29.58)

where r is defined as above. This approach is also covered in greater detail in


Chapter 31. We only state here that the singular vectors in U and V of X are
identical to the eigenvectors of X^X and XX'^, respectively, up to algebraic signs,
and that the singular values in the diagonal matrix A are equal to the positive
square roots of the corresponding eigenvalues in A^.
It is important to realize that the eigenvectors in U and in V are computed
independently from one another, while the corresponding singular vectors are
computed simultaneously (such as in the NIPALS algorithm which is discussed in
Section 31.4.1). The difference between eigenvectors and singular vectors resides
only in a possible disagreement between their algebraic signs.
41

From the 4x2 matrix X of our previous illustration we already derived V and A^
from the eigenvalue decomposition of the 2x2 cross-product matrix X^X:

r 2 -f
-5 4
X=
1 -2
_ 3 0_

X'^X = VA^V^

0.822 0.569] 55.632 0 0.822 -0.569


•0.569 O522J 0 4.368 0.569 0.822

39 -24'
•24 21_

In a similar way we can derive the eigenvalue decomposition of the corresponding


4x4 cross-product matrix XX^:

XX^ =UA2U^

0.297 0.152
-0.856 0.210 55.632 0 0.297 -0.856 0.263 0.331
0.263 -0.514 0 4.368 0.152 0.210 -0.514 0.818
0.331 0.818

5 -14 4 6
14 41 -13 -15
4 -13 5 3
6 -15 3 9

From these results we can now define the singular vector decomposition of the 4x2
data matrix X:
42

0.297 0.152]
-0.856 0.210 r7.459 0 1 [0.822 -0.569
UAV =
0.263 -0.514 [o 2.09oJ [o.569 0.822
0.331 0.818 J

2 -1"
-5 4
=X
1 -2
3 0

within the limits of the precision of our calculation.


Note that the algebraic signs of the columns in U and V are arbitrary as they
have been computed independently. In the above illustration, we have chosen the
signs such as to be in agreement with the theoretical result. This problem does not
occur in practical situations, when appropriate algorithms are used for singular
vector decomposition.

29.7 Statistical interpretation of matrices

The vector of column-means m^ of an nx/? matrix X is defined as follows:


1 ''
m II X or ^7 = - Z ^ ^ 7 withy = 1, ...,p (29.59)

where 1„ is the sum vector of dimension n which is composed of n elements equal


to 1 [ 1 ]. For example, when n equals four, we define the sum vector by means of:

h=

The elements of m^ are the coordinates of the column-centroid. Geometrically, this


point corresponds with the center of mass of the row-pattern P'^ of points formed in
column-space S^. In this chapter we assume that all points in a pattern are given the
same mass. In data analysis it is possible to assign different masses to individual
43

points, thus giving more weight to certain points than to others. This aspect will be
further developed in Chapter 32.
In order to be consistent with the concept of the two dual data spaces, we must
also define the vector of row-means m^:
1 1 ^
m ^ = —XI or ^i =-~y\^ij with/= 1,..., n (29.60)
P P j
where 1^ is the sum vector of dimension p.
The elements of m^ represent the coordinates of the row-centroid, which
corresponds with the center of mass of the column-pattern P^ of points formed in
row-space S"^.
The global mean m is a scalar defined by:

m = — llXl^ or m= — XS^.y (29.61)


np np i J
Usually, the raw data in a matrix are preprocessed before being submitted to
multivariate analysis. A common operation is reduction by the mean or centering.
Centering is a standard transformation of the data which is applied in principal
components analysis (Section 31.3). Subtraction of the column-means from the
elements in the corresponding columns of an nxp matrix X produces the matrix of
deviations from column-means or the column-centered matrix Y^:
Y^ = X-l„ml (29.62)
or
y^j =x-j - mj with /= 1, ..., n andy = 1, ..,/?
where the outer product has dimension nxp. Note that this outer product is required
in order for the above expression to be consistent with matrix addition.
The geometrical interpretation of column-centering is a parallel translation of
the pattern of points P^ representing the rows in column-space S^ such that the
center of mass coincides with the origin of space. Column-centering does not
affect distances between points of P"^ in S^. It does change, however, distances of P^
in the dual space S"^. Figures 29.8a and b show the dual patterns in the dual spaces
before and after column-centering. The geometrical interpretation of column-
centering is more fully explained in Chapter 31.
Similarly as in eq. (29.62), subtraction of the row-means from the elements in
the corresponding rows of the matrix X results in the matrix of deviations from
row-means or row-centered matrix Y„:
Y„ = X-m„i;^ (29.63)
44

Fig. 29.8. (a) Pattern of points in column-space S^ (left panel) and in row-space S" (right panel) before
column-centering, (b) After column-centering, the pattern in S^ is translated such that the centroid co-
incides with the origin of space. Distances between points in S^ are conserved while those in 5" are not.
(c) After column-standardization, distances between points in 5^ and S" are changed. Points in S" are
located on a (hyper)sphere centered around the origin of space.
45

or
J IJ ^ IJ "^l with /= 1,..., n andy = 1, ...,p
where the outer product has dimension nxp. Row-centering achieves a parallel
translation of the pattern of points P^ representing the columns in row-space S"^
such that the center of mass coincides with the origin. It leaves distances in 5^^
unchanged, but affects distances in 5^.
Finally, we can center the matrix X simultaneously by rows and by columns,
which yields the matrix of deviations from row- and column-means or the double-
centered matrix Y:

Y= X-\ml-m„\l+m\M (29.64)

or

yij = X,
rri: m, + m with / = 1,..., n andj = 1, ...,p

where all outer products are of dimension nxp.


For example, given a 4x3 matrix X:
1 5
7 -1
X=
-3 9
1 3 -2

we derive

(a) the vector of column-means nip and the column-centered matrix Y^:

m =[1.50 4.00 2.75^

-0.5 1 0.25
5.5 - 5 1.25
p
-4.5 5 3.25
-0.5 -1 -475

It can be verified that the sums of Y^ computed column-wise are zero.

(b) the vector of row-means m„ and the row-centered matrix Y„:

m„ =[3.000 3.333 4.000 0.667]^


46

-2.000 2.000 0.000'


3.667 -4.333 0.667
Y„ =
-7.000 5.000 2.000
0.333 2.333 -2J667

It can be verified that the sums of Y„ computed row-wise are zero.

(c) the global mean m and the double-centered matrix Y:


m = 2.750

-0.750 0750 0.000


4.917 -5.583 0.667
Y=
-5.750 3750 2.000
1.583 1.083 -2.667

In the above matrix Y we obtain that the sums computed row- and column-wise are
zero.
A vector of column-standard deviations dp provides for each column of X a
measure of the spread of the elements around the corresponding column-mean:
,1/2 \ Ml
dT = withy = 1, ...,/? (29.65)
p
or dj =
z^yl
where Y^ is the column-centered matrix obtained from X and where the power
symbols indicate that all elements of Y^ are to be squared and that the square root
of all elements in the vector-to-matrix product is to be taken. We divide by n rather
than by n-\ in the above definition of standard deviation. The former seems to be
more appropriate in exploratory data analysis, where the emphasis lies in revealing
geometrical structure in the data. The latter is preferred in confirmatory data
analysis, where the focus is on estimating statistical properties from sampled data.
Each element dj of the vector d^ contains the norm or length of the corresp-
onding column y^ in Y^, multiplied by a constant:
.1/2
1
dj = |y./| smce iy,n = \yl
The geometrical interpretation of column-standard deviations is in terms of
distances of the points representing the columns of Y^ from the origin of S'^Cmulti-
47

plied by the constant 1 / V AI). Division of each element of an nxp column-centered


matrix Y^ by its corresponding column-standard deviation yields the matrix of
column-standardized scores or the column-standardized matrix Z„:

z.=Y^; (29.66)

or

'yij'^^j
with /= 1, ..., n andy = 1, ...,/?

provided that d- ^ 0, and where D^ is the diagonal matrix of dimension p in which


the main diagonal elements are equal to the elements in d^. All columns in Z^ have
mean zero and unit standard deviation. After column-standardization, the row-
pattern P" is centered about the origin of 5^. The points in the column-pattern P^ of
p points in the dual space S^ are at equal distances (exactly from the origin.
Hence, after column-standardization, points in S"^ are forced to lie on a (hyper)sphere
around the origin. Note that angular distances in 5^ after column-standardization
are the same as those defined by column-centering (Fig. 29.8c). In S^ distances
after column-standardization are different from those defined by column-centering.
Similar definitions to those in eqs. (29.65) and (29.66) can be derived for the
matrix of deviations from row-means Y„ and for the row standardized matrix Z„:

All 111

d - •YM, or d, = with/=!,...,« (29.67)


\P /p y j J

,1/2

Since iy,n =
VJ

Z . =D-^ Y (29.68)

or

Zij=yij/di with / = 1, ..., n and j=l,...,p

provided that d^ ^ 0, and where D„ is the diagonal matrix of dimension n in which


the main diagonal elements are equal to the elements in d^.
Row-standardization forces points in S^ on a (hyper)sphere about the origin. In
5^, angular distances after row-standardization are the same as those defined by
row-centering. Distances defined in 5^ by row-standardization are different from
48

those defined by row-centering. Note that it is not generally possible to obtain a


double-standardized matrix, i.e. a matrix which has at the same time unit row- and
unit column-standard deviations.
Using the same matrix X as in the illustration above we derive:
(a) the vector of column-standard deviations d^ and the column-standardized
matrix Z^ from the column-centered matrix Y^:

d^ =[3.571 3.606 2.947]'^

-0.140 0.277 0.085


1.540 -1.387 0.424
z.- -1.260 1.387 1.103
-0.140 -0.277 -1.612

(b) the vector of row-standard deviations d^ and the row-standardized matrix Z^


from the row-centered matrix Y^:

d„-[1.633 3.300 5.099 2.055]'^

-1.225 1.225 0.000


1.111 -1.313 0.202
-1.373 0.981 0.392
0.162 1.136 -1.298
After preprocessing of a raw data matrix, one proceeds to extract the structural
features from the corresponding patterns of points in the two dual spaces as is
explained in Chapters 31 and 32. These features are contained in the matrices of
sums of squares and cross-products, or cross-product matrices for short, which
result from multiplying a matrix X (or X^) with its transpose:

s =xx^ (29.69)

s, = x^x (29.70)
where X is an nxp data matrix. The nxn matrix S„ is symmetric and contains all
scalar products that can be formed from the rows of X. The px/? matrix S^ is also
symmetric and contains all scalar products that can be formed from the columns of
X. Using the previously defined properties of vectors we can derive from these
cross-products the norms of the vectors, their angular distances and the distances
between their endpoints. This shows that the structural information of the patterns
in space are contained in the cross-product matrices.
49

From a previously derived property of the trace of a product of two matrices (eq.
(29.26)) follows that:
tr(S,) = tr(XXT) = tr(X^X) = tr(Sp (29.71)
For the 4x3 matrix X in the previous illustration in this section, we derive the
cross-product matrices:
35 14 60 10
14 66 -6 -4
s„ = 60 -6 126 12
10 -4 12 14

' 60 -26 11"


s.= -26 116 59
11 59 65

for which holds that:


tr(S„) = tr(SJ = 241
A special form of cross-product matrix is the variance-covariance matrix (or
covariance matrix for short) C^, which is based on the column-centered matrix Y^
derived from an original matrix X:

C p = - Y ^p Yp (29.72)
n
The matrix C^ contains the variances of the columns of X on the main diagonal and
the covariances between the columns in the off-diagonal positions (see also Section
9.3.2.4.4). The correlation matrix R^ is derived from the column-standardized
matrix Z^:

(29.73)
n
which contains the coefficients of correlation between the columns of X.
From Y^ and Z^, which have been computed in the previous examples, we
obtain the corresponding column-covariance matrix C^ and column-correlation
matrix R^:
12.750 -12.500 -1.375
Cp = -12.500 13.000 3.750
-1.375 3.750 8.688
50

1 -0.971 -0.131
R. -0.971 1 0.353
-0.131 0.353 1

Similar expressions to those in eqs. (29.72) and (29.73) can be derived for the
variance-covariances and correlations between the rows of X:

1
Y Y^ (29.74)
n n

(29.75)
P

where Y„ and Z„ now represent the row-centered and row-standardized matrices,


respectively, as defined above. The row-covariance matrix C„ and the row-correl-
ation matrix R„ are obtainable from the previously computed Y„ and Z„:

2.667 -5.333 8.000 1.333'


-5.333 10.889 -15.333 -3.556
C =
8.000 -15.333 26.000 1.333
1.333 -3.556 1.333 4.222_

1 -0.990 0.961 0.397"


-0.990 1 -0.911 •-0.524
R„ =
0.961 -0.911 1 0.127
0.397 -0.524 0.127 1

The eigenvectors extracted from the cross-product matrices or the singular


vectors derived from the data matrix play an important role in multivariate data
analysis. They account for a maximum of the variance in the data and they can be
likened to the principal axes (of inertia) through the patterns of points that
represent the rows and columns of the data matrix [10]. These have been called
latent variables [9], i.e. variables that are hidden in the data and whose linear
combinations account for the manifest variables that have been observed in order
to construct the data matrix. The meaning of latent variables is explained in detail
in Chapters 31 and 32 on the analysis of measurement tables and contingency
tables.
51

®
Fig. 29.9. (a) Geometrical representation of the image s in 5" of a vector v in S^. To each point in S" cor-
responds a vector in S^, and vice versa, (b) Geometrical representation of the image 1 in S^ of a vector u
in S". To each point in S^ corresponds a vector in 5", and vice versa.

29.8 Geometrical interpretation of matrix products

The matrix-to-vector product can be interpreted geometrically as a. projection of


a pattern of points upon an axis. As we have seen in Section 29.4 on matrix
products, if X is an nxp matrix and if v is a /? vector then the product of X with v
produces the n vector s:

s = Xv (29.76)
in particular, we can express the /th element of s by means of the scalar product of
the /th row of X with v:
S: =X with /= 1, ..., n
52

The matrix X defines a pattern P"" of n points, e.g. x^, in 5^ which are projected
perpendicularly upon the axis v. The result, however, is a point s in the dual space
5". This can be understood as follows. The matrix X is of dimension nxp and the
vector V has dimensions p. The dimension of the product s is thus equal to n. This
means that s can be represented as a point in 5^^. The net result of the operation is
that the axis v in 5^ is imaged by the matrix X as a point s in the dual space 5". For
every axis v in 5^ we will obtain an image s formed by X in the dual space. In this
context, we use the word image when we refer to an operation by which a point or
axis is transported into another space. The word projection is reserved for operations
which map points or axes in the same space [11]. The imaging of v in S^ into s in S""
is represented geometrically in Fig. 29.9a. Note that the patterns of points P"^ and P^
are represented schematically by elliptic envelopes.
In a similar way, we can think of X^as representing a pattern P^ ofp points, e.g.
Xy, in 5" which can be projected upon an axis u and which results into a point I (not
to be confounded with the sum vector 1) in the complementary (or dual) space 5^
(Fig. 29.9b):
l = X^u (29.77)
in particular, we can express the jih element of 1 as the scalar product of the jth
column of X with u:
/. =xTu withy = 1, ...,/7
Using the same argument as above, we can see that the product is of dimension p,
since the matrix X^ has dimensions pxn and the vector u possesses dimension n.
Hence, the vector can be imaged as a single point in the dual space S^. We state that
the vector u in 5" is imaged by the matrix X^ into the point 1 in the dual space S^.

:P ^

Fig. 29.10. Geometrical interpretation of multiple linear regression (MLR). The pattern of points in S^
representing a matrix X is projected upon a vector b, which is imaged in S" by the point y. The orienta-
tion of the vector b is determined such that the distance between y and the given y is minimal.
53

In a general way, we can state that the projection of a pattern of points on an axis
produces a point which is imaged iJi the dual space. The matrix-to-vector product
can thus be seen as a device for passing from one space to another. This property of
swapping between spaces provides a geometrical interpretation of many procedures
in data analysis such as multiple! linear regression and principal components
analysis, among many others [12] (see Chapters 10 and 17).
In multiple linear regression (A^LR) we are given an nxp matrix X and an
n vector y. The problem is to find aii unknown p vector b such that the product y of
X with b is as close as possible to the original y using a least squares criterion:

y = Xb (29.78)

where X describes n objects (rows) by means of p independent measurements


(columns) and where y represents the dependent variable for each of the n objects.
In particular, for the ith object we write:
y^ =xjh with / = 1, ..., n
The least squares criterion states that the norm of the error between observed and
predicted (dependent) measurements 11 y - yl I must be minimal. Note that the latter
condition involves the minimization of a sum of squares, from which the unknown
elements of the vector b can be determined, as is explained in Chapter 10.
The geometrical interpretation of MLR is given in Fig. 29.10. The n rows
(objects) of X form a pattern P"" of points (represented by x^) which is projected
upon an (unknown) axis b. This causes the axis b in S^ to be imaged by X in the
dual space S"^ at the point y. The vector of observed measurements y has dimension
n and, hence, is also represented as a point in S'^. Is it possible then to define an axis
b in S^ such that the predicted y coincides with the observed y? Usually this will not
be feasible. One may propose finding the best possible b such that y comes as close
to y as possible. A criterion for closeness is to ask for the distance between y and y,
which is equal to the norm 11 y - yl I, to be as small as possible.
From the above we conclude that the product of a matrix with a vector can be
interpreted geometrically as an operation by which a pattern of points is projected
upon an axis. This projection produces an image of the axis at a point in the dual
space. The concept can be extended to the product of a matrix with another matrix.
In this case we can conceive of the latter as a set of axes, each of which produces
image points in the dual space. In the special case when this matrix has only two
columns, the product can be regarded as a projection of a pattern of points upon the
plane formed by the two axes. As a result one obtains two image points (one for
each axis that defines the plane of projection) in the dual space.
The least squares solution of MLR can be formally defined in terms of matrix
products (Section 10.2):
54

h = (X'X)-'X'y (29.79)
where X is the nxp matrix of independent variables, y is the n vector of dependent
observations and b is the p vector of regression coefficients. This leads to an
expression for the n vector of estimated dependent observations y:

y = X(X'^X)-^X^y (29.80)
where the term X(X^X)"^X^ is called an orthogonal projection operator. In geo-
metrical terms, this operator projects from 5'' to 5^ and back to S"^. When applied to
the dependent observations in y, it returns the predictions in y that can be obtained
by using the information contained in X. In other words, y reproduces the part of
the variation in y that can be predicted from X. The above expression is useful, for
example, in the case when one obtained p independent measurements from a
number n^ of additional samples in the form of an n^xp additional data matrix X^,.
Predictions for the corresponding dependent observations y^ are then derived
from:
i =:X,(X^X)-^X^y (29.81)
In the general case we use the symbols U and V to represent projection matrices
in S'^ and 5^, each containing r projection vectors, and the symbols S and L to
represent their images in the dual space:
for a projection in 5^, producing an image in S^ (29.82)
s = xv
L = X^U for a projection in 5^^, producing an image in S^ (29.83)

Fig. 29.11. Geometrical interpretation of a rotation of 5^ as a change of the frame of coordinate axes to
new directions which are defined by the columns in the rotation matrix V (left panel). Likewise, a rota-
tion ofS" can be interpreted as a change of the frame of coordinate axes to new directions which are de-
fined by the columns in the rotation matrix U (right panel).
55

where X is an nxp matrix and where V and U are/7Xr and nxr projection matrices,
respectively. The images S and L are nxr and/?xr matrices, respectively. We have
chosen the symbols S and L in accordance with conventional nomenclature used in
multivariate data analysis, where projections of rows of a data matrix are called
scores and where projections of columns are called loadings.
An orthogonal projection is obtained when the projection matrices V or U are
orthonormal:
V^V = U^U = I, (29.84)
where I^ is the identity matrix of dimension r, i.e. the number of projection vectors
in V and U.
Eigenvector projections are those in which the projection vectors u and v are
eigenvectors (or singular vectors) of the data matrix. They play an important role in
multivariate data analysis, especially in the search for meaningful structures in
patterns in low-dimensional space, as will be explained further in Chapters 31 and
32 on the analysis of measurement tables and general contingency tables.
A projection is called a rotation when the projection matrix U or V is non-
singular and square. In a geometrical sense, one obtains in this case that the number
of projection vectors equals the number of dimensions of space. The result of a
rotation is a change of the frame of coordinates (Fig. 29.11). A non-orthogonal
rotation will result in an oblique frame of coordinate axes, and interdistances
between points that represent rows and columns will generally not be conserved.
Of special interest are orthogonal rotations which are obtained by multiplying a
matrix with an orthonormal rotation matrix. In the case of an nxp non-singular
matrix X with n>p,we have:

S = XV with VV^ = V^V = I^ (29.85)


where V is apxp orthonormal rotation matrix and where S denotes the rotated nxp
matrix X.
In the case of an nxp non-singular matrix X with « < p, we obtain:
L = X^U with UU^ = U^U = I, (29.86)
where U is an nxn orthonormal rotation matrix and where L stands for the rotated
pxn matrix X^.
Orthogonal rotation produces a new orthogonal frame of reference axes which
are defined by the column-vectors of U and V. The structural properties of the
pattern of points, such as distances and angles, are conserved by an orthogonal
rotation as can be shown by working out the matrices of cross-products:

SS^ = XVV^X^ = XX'^ (29.87)


56

or
LL^ = X^UU^X = X'^X (29.88)
where use is made of the orthogonality of rotation matrices.
After an orthogonal rotation one can also perform a backrotation toward the
original frame of reference axes:
SV^ = XVV^ = X (29.89)
or
LU^ = X'^UU^ = X^ (29.90)
where V and U are orthonormal rotation matrices and where use is made of the
same property of orthogonality as stated above.

References

1. W.R. Dillon and M. Goldstein, Multivariate Analysis, Methods and Applications. Wiley, New
York, 1984.
2. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977.
3. N.C. Giri, Multivariate Statistical Inference. Academic Press, New York, 1972.
4. N. Cliff, Analyzing Multivariate Data. Academic Press, San Diego, CA, 1987.
5. R.J. Harris, A Primer on Multivariate Statistics. Academic Press, New York, 1975.
6. C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman and Hall, Lon-
don, 1980.
7. M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North
Holland, New York, 1983.
8. T.W. Anderson, An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.
9. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and
use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom.
Intell. Lab. Syst., 4 (1988) 11-25.
10. P.E. Green and J.D. Carroll, Mathematical Tools for Applied Multivariate Analysis. Academic
Press, New York, 1976.
11. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990.
12. K. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal.
Chem., 59 (1987) 1007A-1009A.
57

Chapter 30

Cluster analysis

30.1 Clusters

Clustering or cluster analysis is used to classify objects, characterized by the


values of a set of variables, into groups. It is therefore an alternative to principal
component analysis for describing the structure of a data table. Let us consider an
example.
About 600 iron meteorites have been found on earth. They have been analysed
for 13 inorganic elements, such as Ir, Ni, Ga, Ge, etc. One wonders if certain
meteorites have similar inorganic composition patterns. In other words, one would
like to classify the iron meteorites according to these inorganic composition
patterns. One can view the meteorites as the 600 objects of a data table, each object
being characterized by the concentration of 13 elements, the variables. This means
that one views the 600 objects as points (or vectors) in a 13-dimensional space. To
find groups one could obtain, as we learned in Chapter 17, a principal component
plot and consider those meteorites that are found close together as similar and try to
distinguish in this way clusters or groups of meteorites. Instead of proceeding in
this visual way, one can try to use more formal and therefore more objective
methods. Let us first attempt such a classification by using only two variables (for
instance, Ge and Ni). Fictitious concentrations of these two metals for a number of
meteorites (called A, B, ..., J) are shown in Fig. 30.1. A classification of these
meteorites permits one to distinguish first two clusters, namely ABDFG and
CEHIJ. On closer observation, one notes that the first such group can be divided
into two sub-groups, namely ABF and DG, and that in the second group one can
also discern two sub-groups, namely CEIJ and H.
There are two ways of representing these data by clustering. The first is depicted
by the tree, also called a dendrogram, of Fig. 30.2 and consists in the elaboration of
a hierarchical classification of meteorites. It is hierarchical because large groups
are divided into smaller ones (for instance, the group ABDFG splits into ABF and
DG). These are then split up again until eventually each group consists of only one
meteorite.
This type of classification is very often used in many areas of science. Figure
30.3 shows a very small part of the classification of plants. Individual species are
58

Concentration of Ni
i

Concentration of Ge

Fig. 30.1. Concentration of Ni and Ge for ten meteorites A to J.

ABCDEFGHIJ
. 1
ABFDG CEHIJ
I ' I I ' 1
ABF DG CEJI H
rti rh I r I I
A B F D G C E J I

Fig. 30.2. Hierarchical clustering for the meteorites described in Fig. 30.1.

grouped in genera, genera in families, etc. This classification was obtained histor-
ically by determining characteristics such as the number of cotyledons, the flower
formula, etc. More recently, botanists and scientists from other areas such as
bacteriology where classification is needed, have reviewed the classifications in
their respective fields by numerical taxonomy [1,2]. This consists of considering
the species as objects, characterized by certain variables (number of cotyledons,
etc.). The data table thus obtained is then subjected to clustering. Numerical
taxonomy has inspired other experimental scientists, such as chemists to apply
clustering techniques in their own field.
The other main possibility of representing clustered data is to make a table
containing different clusterings. A clustering is a partition into clusters. For the
example of Fig. 30.1, this could yield Table 30.1. Such a table does not necessarily
yield a complete hierarchy (e.g. in going from 6 to 7, objects J and I, separated for
clustering 6, are joined again for 7). Therefore, the presentation is called non-
hierarchical.
Classical books in the field are the already cited book by Sneath and Sokal [1]
and that by Everitt [3]. A more recent book has been written by Kaufman and
59

(angiosperms)

(monocotyledones) (dicotyledones)

(rosaceae) (papilionaceae)

pi
o
r
&
I I I
Fig. 30.3. Taxonomy of some plants.

TABLE 30.1

A list of clusterings derived from Fig. 30.1 by non-hierarchical clustering

No. of Composition of the clusters


clusters

1 A B C D E F G H I J
2 A B F D G - C E H I J
3 A B F - D G - C E H I J
4 A B F - D G - C E J I - H
6 A - B F - D G - C I - J E - H
7 A - B F - D G - C - E - J I - H
0 A - B - F - D - G - C - E - J -

Rousseeuw [4]. Massart and Kaufman [5] and Bratchell [6] wrote specifically for
chemometricians. Massart and Kaufman's book contains many examples, relevant
to chemometrics, including the meteorite example [7]. More recent examples
concern classification, for instance according to structural descriptions for toxicity
testing [8] or in connection with combinatorial chemistry [9], according to chemical
60

composition for aerosol particles [10], Chinese teas [11] or mint species [12] and
according to physicochemical parameters for solvents [13]. The selection of
representative samples from a larger group for multivariate calibration (Chapter
36) by clustering was described by Naes [14].

30.2 Measures of (dis)similarity

30.2.1 Similarity and distance

To be able to cluster objects, one must measure their similarity. From our
introduction, it is clear that "distance" may be such a measure. However, many
types of similarity coefficients may be applied. While the terms similarity or
dissimilarity have no unique definitions, the definition of distance is much clearer
[6]. A dissimilarity between two objects / and i' is a distance if

Dii^ > 0 where D,-. = 0 if x,- = x,. (30.1)

(where x- and x- are the row-vectors of the data table X with the measurements
describing objects / and /')

Ar = A . (30.2)

D,, + D , , > A r (30.3)


Equation (30.1) shows that distances are zero or positive; eq. (30.2) that they are
symmetric. Equation (30.3), where a is another object, is called the metric in-
equality. It states that the sum of the distances from any object to objects / and i'
can never be smaller than the distance between / and i\

30.2.2 Measures of (dis)similarity for continuous variables

30.2.2.1 Distances
The equation for the Euclidean distance between objects / and i is

Ar=JX(^,v-^.';)' (30.4)

where m is the number of variables. The concept of Euclidean distance was


introduced in Section 9.2.3. In vector notation this can be written as:
D^,=(x,-x,)^(x,-x,)
61

In some cases, one wants to give larger weights to some variables. This leads to
the weighted Euclidean distance:

Ar=JSv./x,-x,.)^ with X w . = l (30.5)


V ./=i j

The standardized Euclidean distance is given by:

^/r=JS[(^//-^.y)/^,]' (30.6)

where Sj is the standard deviation of the values in theyth column of X:

'j=\lt^-,-^j)' (30.7)
It can be shown that the standardized Euclidean distance is the Euclidean distance
of the autoscaled values of X (see further Section 30.2.2.3). One should also note
that in this context the standard deviation is obtained by dividing by n, instead of
by(n-l).
The Mahalanobis distance [15] is given by:
Z)^,=(x,-x,)^C-Ux,-x,) (30.8)

where C is the variance-covariance matrix of a cluster represented by x^ (e.g. x^ is


the centroid of the cluster). It is therefore a distance between a group of objects and
a single object i. The distance is corrected for correlation. Consider Fig. 30.4a; the
distance between the centre C of the cluster and the objects A and B is the same in
Euclidean distances but, since B is part of the group of objects outlined by the
ellipse, while A is not, one would like a distance measure such that CA is larger
than CB. The point B is "closer" to C than A because it is situated in the direction of
the major axis of the ellipse while A is not: the objects situated within the ellipse
have values ofx^ and X2 that are strongly correlated. For A this will not be the case.
It follows that the distance measure should take correlation (or covariance) into
account.
In the same way, in Fig. 30.4b, clusters Gl and G2 are closer together than G3
and G4 although the Euclidean distances between the centres are the same. All
groups have the same shape and volume, but Gl and G2 overlap, while G3 and G4
do not. Gl and G2 are therefore more similar than G3 and G4 are.
Generalized distance summarizes eqs. (30.4) to (30.8). It is a weighted distance
of the general form:
62

x^i Xi»

a) b)

Fig. 30.4. Mahalanobis distance: (a) object Bis closer to centroidC of cluster Gl then object A; (b)the
distance between clusters Gl and G2 is smaller than between G3 and G4.

D,^,=(x,-x,yW(x,-x,) (30.9)

where W represents an mxm weighting matrix. Four particular cases of the gener-
alized distance are mentioned below:
W = I defines ordinary Euclidean distances
W = diag (w) produces weighted Euclidean distances
W = diag (1/d^) where d represents the vector of column-standard
deviations of X, yielding standardized Euclidean distances
W = C~\ where C represents the variance-covariance matrix as defined in
eq. (30.8), defines Mahalanobis distances.
Euclidean distances (ordinary or standardized) are used very often for clustering
purposes. This is not the case for Mahalanobis distance. An application of Mahal-
anobis distances can be found in Ref. [16].

30.2.2.2 Correlation coefficient


Another way of measuring similarity between / and /' is to measure the correl-
ation coefficients between the two row-vectors x, and x^/. The difference between
using Euclidean distance and correlation is explained with the help of Fig. 30.5 and
Table 30.2. In Chapter 9 it was shown that r equals the cosine of the angle between
vectors. Consider the objects /, i' and T. The Euclidean distance, Z)^-. in Fig. 30.5 is
the same as Z),^-. However, the angle between x^ and x,/ is much smaller than
between x, and x^.. and therefore the correlation coefficient is larger. How to choose
between the two is not evident and requires chemical considerations. This is shown
with the example of Table 30.2, which gives the retention indices of five
substances on three gas chromatographic stationary phases (SFs). The question is
which of these phases should be considered similar. The similarity measure to be
chosen depends on the point of view of the analyst. One point of view might be that
63

Fig. 30.5. The point / is equidistant to /' and f according to the Euclidean distances {Dw and Du") but
much closer to /' (cos 0,/') than to i" (cos 6r), when a correlation-based similarity measure is applied.

those SFs that have more or less the same over-all retention, i.e., the same polarity
towards a variety of substances, are considered to be similar. In that case, SF3 is
very dissimilar from both SFj and SF2, while SFj and SF2 are quite similar. The
best way to express this is the Euclidean distance. D^^ and D22, are then much higher
than D^2'
On the other hand, the analyst might not be interested in global retention
indices. Indeed, by increasing the temperature for SF3, he would obtain similar
retention indices as for the other two. He will then observe that the relative
retention time, i.e. the retention times of the substances compared with each other,
are the same for SFj and SF3 and different from SF2. Chemically, this means that
SF3 has different polarity from SFj, but the same specific interactions. This is best
expressed by using the correlation coefficient as the similarity measure. Indeed,
ri3 = 1, indicating complete similarity, while r^2 ^^^ ^23 ^ ^ niuch lower. Since both
r = 1 and r = -1 are considered to indicate absolute similarity and if, as with
Euclidean distance, one would like the numerical value of the similarity measures
to increase with increasing dissimilarity, one should use, for instance, 1 - |r|.

TABLE 30.2
Retention indices of five substances on three stationary phases in GLC

Stationary 1 2 3 4 5
phases (SF)
1 100 130 150 160 170
2 120 110 170 150 145
3 200 260 300 320 340
64

30.2.2.3 Scaling
In the meteorite example, the concentration of Ni is of the order of 50000 ppm
and the Ga content of the order of 50 ppm. Small relative changes in the Ni content
then have, of course, a much higher effect on the Euclidean distance than equally
high relative changes of the Ga content. One might also consider two metals M and
N, one ranging in concentration from 900 to 1100 ppm, the other from 500 to 1500
ppm. Concentration changes from one end of the range to the other in N would then
be more important in the Euclidean distance than the same kind of change in M. It
is probable that the person carrying out the classification will not agree with these
numerical consequences and consider them as artefacts. Both problems can be
solved by scaling the variables. The most usual way of doing this is using the
z-transform, also called autoscaling (see also Chapter 3.3). One then determines
^^^ij_2^ (30.10)

where x^j is the value for object / of variabley, x^ is the mean for variable7, and Sj is
the standard deviation for variable j . One then uses z in eq. (30.4), which is
equivalent to applying the standardized Euclidean distance (eq. (30.6)) to the
jc-values.
Other possibilities are range scaling and logarithmic transformation. In range
scaling one does not divide by s^ as in eq. (30.6), but by the range r^ of variable/

x^^_-^ (30.11)
r.

If one wants z,y expressed on a 0-1 scale, this becomes:

where JC^^^J^ is the lowest value of jc^. The logarithmic transform, too, reduces vari-
ation between variables. Its effect is not to make absolute variation equal but to
make variation comparable in the following sense. Suppose that variable 1 has a
mean value of 100 and variable 2 a mean value of 10. Variable 1 varies between 50
and 150. If the variation is proportional to the mean values of the variables, then
one expects variable 2 to vary between more or less 5 and 15. In absolute values the
variation in variable 1 is therefore much larger. When one transforms variables 1
and 2 by taking their logarithms the variation in the two transformed variables
becomes comparable. Log-transformation to correct for heteroscedasticity in a
regression context is described in Section 8.2.3.1. It also has the advantage that the
scaling does not change when data are added. This is not so for eqs. (30.10) and
(30.11), since one must recompute Xj, r- or Sy
65

Scaling is a very important operation in multivariate data analysis and we will


treat the issues of scaling and normalisation in much more detail in Chapter 31. It
should be noted that scaling has no impact (except when the log transform is used)
on the correlation coefficient and that the Mahalanobis distance is also scale-
invariant because the C matrix contains covariance (related to correlation) and
variances (related to standard deviation).

30.2.3 Measures of (dis)similarity for other variables

30.2.3.1 Binary variables


Binary variables usually have values of 0 (for attribute absent) or 1 (for attribute
present). The simplest type of similarity measure is the matching coefficient. For
two objects / and /' and attribute j :

V/ = 0 if ^/y^^ry
The matching coefficient is the mean of the 5--values for all m attributes
1 m
(30.13)
m j=i

This means that one counts the number of attributes for which / and /' have the
same value and divides this by the number of attributes.
The Jaccard similarity coefficient is slightly more complex. It considers that the
simultaneous presence of an attribute in objects / and i indicates similarity, but that
the absence of the attribute has no meaning. Therefore:
^..,.= 1 if x^. = x^^=i

ignored if x^^ = x^^j = 0


The Jaccard similarity coefficient is then computed with eq. (30.13), where m is
now the number of attributes for which one of the two objects has a value of 1. This
similarity measure is sometimes called the Tanimoto similarity. The Tanimoto
similarity has been used in combinatorial chemistry to describe the similarity of
compounds, e.g. based on the functional groups they have in common [9]. Unfort-
unately, the names of similarity coefficients are not standard, so that it can happen
that the same name is given to different similarity measures or more than one name
is given to a certain similarity measure. This is the case for the Tanimoto coefficient
(see further).
66

The Hamming distance is given by:


d^i^j = I if x.j^x^^j

d^^^j = 0 if Xij = Xi^j

and

It can be shown [5] that the Hamming distance is a binary version of the city block
distance (Section 30.2.3.2).
Some authors use the Hamming distance as the equivalent of Euclidean distance
of binary data. In that case:

The literature also mentions a normalized Hamming distance, which is then equal
to either:
1 '^
"^.1=

or
1 '"
\mf:t
The first of these two is also called the Tanimoto coefficient by some authors. It
can be verified that, since distance = 1 - similarity, this is equal to the simple
matching coefficient. Clearly, confusion is possible and authors using a certain
distance or similarity measure should always define it unambiguously.

30.2.3.2 Ordinal variables


For those variables that are measured on a scale of integer values consisting of
more than two levels, one uses the Manhattan or city-block distance. This is also
referred to as the Lj-norm. It is given for variabley by:

D,,=X4o (30-14)
67

X2 4

^1

Fig. 30.6. Dii' is the Euclidean distance between / and /', dm and dwj are the city-block distances be-
tween / and /' for variables xi and X2 respectively. The city-block distance is dm + dm.

Here, too, scaling can be required when the ranges of the variables are dissimilar.
In this case, one divides the distances by the range r^ for variable j

IJ

In this way one obtains t/-values from 0 to 1. Then s^^y = 1 - d^^^j.


Manhattan distances can be used also for continuous variables, but this is rarely
done, because one prefers Euclidean distances in that case. Figure 30.6 compares
the Euclidean and Manhattan distances for two variables. While the Euclidean
distance between / and i' is measured along a straight line connecting the two
points, the Manhattan distance is the sum of the distances parallel to the axes. The
equations for both types of distances are very similar in appearance. In fact, they
both belong to the Minkowski distances given by:
\lr
fm \
D,, (30.15)

The Manhattan distance is obtained for r = 1 and the Euclidean distance for r = 2. In
this context the Euclidean distance is also referred to as the L2-norm.

30.2.3.3 Mixed variables


In some cases, one needs to combine variables of mixed types (binary, ordinal
or continuous). The usual way to do this is to eliminate the effect of varying ranges
by scaling. All variables are transformed, so that they take values from 0 to 1 using
range scaling for the continuous variables or the procedure for scaling described
68

for ordinal variables in Section 30.2.3.2, while binary variables are expressed
naturally on a 0-1 scale. The range scaled similarity for variables on an interval
scale is obtained as

with Zij and z- as defined in eq. (30.12). Then one can determine the similarities of
the objects / and /' by summing the range scaled similarities for all variables j . A
distance measure can be obtained by computing

where d^^^j is the range scaled similarity between objects / and i' for variable 7.

30.2.4 Similarity matrix

The similarities between all pairs of objects are measured using one of the
measures described earlier. This yields the similarity matrix or, if the distance is
used as measure of (dis)similarity, the distance matrix. It is a symmetrical nxn
matrix containing the similarities between each pair of objects. Let us suppose, for
example, that the meteorites A, B, C, D, and E in Table 30.3 have to be classified
and that the distance measure selected is Euclidean distance. Using eq. (30.4), one
obtains the similarity matrix in Table 30.4. Because the matrix is symmetrical,
only half of this matrix needs to be used.

TABLE 30.3

Example of a data matrix

System Concentrations (arbitrary units)


Metal a Metal b Metal c Metal d
A 100 80 70 60
B 80 60 50 40
C 80 70 40 50
D 40 20 20 10
E 50 10 20 10
69

30.3 Clustering algorithms

30.3.1 Hierarchical methads

There is a wide variety of hierarchical algorithms available and it is impossible


to discuss all of them here. Therefore, we shall only explain the most typical ones,
namely the single linkage, the complete linkage and the average linkage methods.
In the similarity matrix, one seeks the two most similar objects, i.e., the objects
for which S^^^ is largest. When using distance as the similarity measure, this means
that one looks for the smallest D^^ value. Let us suppose that it is D^^, which means
that of all the objects to be classified, q and p are the most similar. They are
considered to form a new combined object /?*. The similarity matrix is thereby
reduced to (n - 1) x (w - 1). In average linkage, the similarities between the new
object and the others are obtained by averaging the similarities of q and p with
these other objects. For example, D^^* = {D^^ + D^p)/2. In single linkage, D^^* is the
distance between the object / and the nearest of the linked objects, i.e., it is set equal
to the smallest of the two distances D-^ and D^^: Z)^^* = min (D^^, D^^). Complete
linkage follows the opposite approach: D^^* is the distance between / and the
furthest object q or p. In other words Z)^^* = max(Z)j^, Z),^.
At the same time, one starts constructing the dendrogram by linking together q
and p at the level D^^. This process is repeated until all objects are linked in one
hierarchical classification system, which is represented by a dendrogram. This
procedure can now be illustrated using the data of Tables 30.3 and 30.4. The
smallest D is 14.1 (between D and E). D and E are combined first and yield the
combined object D*. The successive reduced matrices obtained by average link-
ages are given in Table 30.5, those obtained by single linkage in Table 30.6 and
those obtained by complete linkage in Table 30.7. The dendrograms are shown in
Fig. 30.7. Clusters are then obtained by cutting the highest link(s). For instance, by
breaking the highest link in Fig. 30.7a, one obtains the clusters (ABC) and (DE).
Cutting the second highest links leads to the clustering (A) (BC) (DE). How many
links to cut is not always evident (see also Section 30.3.4.2).

TABLE 30.4
Similarity matrix (based on Euclidean distance) for the objects from Table 30.3

A 0
B 40.0 0
C 38.7 17.3 0
D 110.4 70.7 78.1 0
E 111.4 72.1 80.6 14.1
70

A B C D E A B C D E A B C D E
OT

U M IUM U M
bO\

1001

Fig. 30.7. Dendrograms for the data of Tables 30.3-30.7: (a) average linkage; (b) single linkage; (c)
complete linkage.

TABLE 30.5
Successive reduced matrices for the data of Table 4 obtained by average linkage
(a)
B D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 110.9 71.4 79.3
D* is the object resulting from the combination of D and E.

(b)
A B* D*
A 0
B* 39.3 0
D* 110.9 75.3 0
B* is the object resulting from the combination of B and C.

(c)
A* D*
A* 0
D* 93.1 0
A* is the object resulting from the combination of A and B*.

(d)
The last step consists in the junction of A* and D*.
The resulting dendrogram is given in Fig. 30.7(a).
71

TABLE 30.6

Successive reduced matrices for the data of Table 30.4 obtained by single linkage

(a)
A B C D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 110.4 70.7 78.1

(b)
A B* D*
A 0
B* 38.7 0
D* 110.4 70.7 0

(c)
A* D*
A* 0
D* 70.7 0

(d)
The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(b).

We observe that, in this particular instance, the only noteworthy difference


between the algorithms is the distance at which the last link is made (from 111.4
for complete linkage to 70.7 for single linkage). When larger data sets are studied,
the differences may become more pronounced. In general, average linkage is
preferred. In the average linkage mode, one may introduce a weighting of the
objects when clusters of unequal size are linked. Both weighted and unweighted
methods exist. Another method which gives good results (i.e., has been shown to
give meaningful clusters) is known as Ward's method [17]. It is based on a
heterogeneity criterion. This is defined as the sum of the squared distances of each
member of a cluster to the centroid of that cluster. Elements or clusters are joined
with as criterion that the sum of heterogeneities of all clusters should increase as
little as possible.
Single linkage methods have a tendency to chain together ill-defined clusters
(see Fig. 30.8). This eventually leads to include rather different subjects (A to X of
Fig. 30.8) into the same long drawn-out cluster. For that reason one sometimes
72

TABLE 30.7

Successive reduced matrices for the data of Table 30.4 obtained by complete linkage

(a)
A B C D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 111.4 72.1 80.6

(b)
A B* D*
A 0
B* 40.0 0
D* 111.4 80.6 0

(c)
A* D*
A* 0
D* 111.4 0

(d)
The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(c).

X,*

Fig. 30.8. Dissimilar objects A and X are chained together in a cluster obtained by single linkage.

says that the single linkage methods is space contracting. The complete linkage
method leads to small, tight clusters and is space dilating. Average linkage and
Ward's method are space conserving and seem, in general, to give the better results.
73

TABLE 30.8
Values characterizing the objects of Fig. 30.9

X2

A 45 24
B 24 43
C 14 23
D 64 52
E 36 121
F 56 140
G 20 148

TABLE 30.9
Euclidean distance between points in Fig. 30.9 (from Ref [18])

B C D
A 0
B 28 0
C 32 23 0
D 35 40 60 0
E 100 80 103 76 0
F 119 104 128 90 29 0
G 127 105 126 105 30 35

Single linkage has the advantage of mathematical simplicity, particularly when


it is calculated using an operational research technique called the minimal spann-
ing tree [18]. Although the computations seem to be very different from those in
Table 30.6, exactly the same results are obtained. To explain the method we need a
matrix with some more objects. The data matrix is given in Table 30.8 and the
resulting similarity matrix (Euclidean distances) in Table 30.9.
We may think of these objects as towns, the distances between which are given
in the table, and suppose that the seven towns must be connected to each other by
highways (or a production unit serving six clients using a pipeline). This must be
done in such a way that the total length of the highway is minimal. Two possible
configurations are given in Fig. 30.9. Clearly, (a) is a better solution than (b). Both
(a) and (b) are graphs that are part of the complete graph containing all possible
links and both are connected graphs (all of the nodes are linked directly or
indirectly to each other). These graphs are called trees and the tree for which the
sum of the values of the links is minimal is called the minimal spanning tree. This
74

Fig. 30.9. Examples of trees in a graph; (a) is the minimal spanning tree [18].

minimal spanning tree is also the optimal solution for the highway problem. The
terminology used in this chapter comes from graph theory. Graph theory is
described in Chapter 42.
Several algorithms can be used to find the minimal spanning tree. One of these
is KruskaVs algorithm [19] which can be stated as follows: add to the tree the edge
with the smallest value which does not form a cycle with the edges already part of
the tree. According to this algorithm, one selects first the smallest value in Table
30.9 (link BC, value 23). The next smallest value is 28 (link AB). The next smallest
values are 29 and 30 (links EF and EG). The next smallest value in the table is 32
(link AC). This would, however, close the cycle ABC and is therefore eliminated.
Instead, the next link that satisfies the conditions of Kruskal's algorithm is AD and
the last one is DE. The minimal spanning tree obtained in this way is that given in
Fig. 30.9(a).
After careful inspection of this figure, one notes that two clusters can be
obtained in a formal way by breaking the longest edge (DE). When a more detailed
classification is needed, one breaks the second longest edge, and so on until the
desired number of classes is obtained (see Section 30.3.4.2). In the same way,
clusters were obtained from Fig. 30.7 by breaking first the lowest link, i.e. the one
with highest distance.
An example of an application is shown in Fig. 30.10. This concerns the
classification of 42 solvents based on three solvatochromic parameters (para-
meters that describe the interaction of the solvents with solutes) [13]. Different
methods were applied, among which was the average linkage method, the result of
which is shown in the figure. According to the method applied, several clusterings
can be found. For instance, the first cluster to split off from the majority of solvents
consists of solvents 36, 37, 38, 39, 40, 41, 42 (r-butanol, isopropanol, n-butanol,
75

0.371 -r

0.332 I

0.293 t

0.254

0.215 +

0.176

0.138 -f

0.099 I

0.060 +

0.021

I ^ rfiTmm rrrfFT?! n rmTrii I m n


JUL JL JL

Fig. 30.10. Hierarchical agglomerative classification of solvents according to solvent-solute and sol-
vent-solvent interactions [13].

ethanol, methanol, ethyleneglycol and water). This solvent class consists of am-
phiprotic solvents (alcohols and water). This is then split further into the mono-
alcohols on the one hand and ethyleneglycol and water, which have higher
association ability. In this way, one can develop a detailed classification of the
solvents. Another use of such a classification is to select different, representative
objects. Snyder [20] used this to select a few solvents, that would be different and
representative of certain types of solvent-solute interactions. These solvents were
then used in a successful strategy for the optimization of mobile phases for liquid
chromatographic separation.
The hierarchical methods so far discussed are called agglomerative. Good
results can also be obtained with hierarchical divisive methods, i.e., methods that
first divide the set of all objects in two so that two clusters result. Then each cluster
is again divided in two, etc., until all objects are separated. These methods also lead
to a hierarchy. They present certain computational advantages [21,22].
Hierarchical methods are preferred when a visual representation of the cluster-
ing is wanted. When the number of objects is not too large, one may even compute
a clustering by hand using the minimum spanning tree.
One of the problems in the approaches described above and, in fact, also in those
described in the next sections, is that when objects are added to the data set, the
76

05%
LI 03[l [\

Va-1

Fig. 30.11. The three-distance clustering method [23]. The new object A has to be classified. In node
Ka it must be decided whether it fits better in the group of nodes represented by Vi, the group of nodes
represented by V2, or does not fit in any of the nodes already represented by V^.

whole clustering must be carried out again. A hierarchical procedure which avoids
this problem has been proposed by Zupan [23,24] and is called the three-distance
clustering method (3-DM). Let us suppose that a hierarchical clustering has
already been obtained and define V^ as a node in the dendrogram, representing all
the objects above it. Thus

1
Z^ ^ii^Xi ^i2^"'^2j
i=\ i=\ i=\

where v,^ is the mean vector of the n^ vectors, representing the % objects / = 1,..., n^
above it and m is the number of variables. For instance, in Fig. 30.11, Vj is the mean
of three objects Oj, O2, O3.
Suppose now also that in an earlier stage one has decided that the new object A
belongs rather to the group of objects above V^ than to the group represented by V^.
We will now take one of three possible decisions:
(a) A belongs to Vj
(b) A belongs to V2
(c) A does not belong to V^ or V2.
This depends on the similarity or distance between A, V^ and Vo- O^ie determines
the similarities 5^^ , 5^^^, and 5^ ^ . If 5^^ is highest, then A belongs to V^ and
the same process as for V^ is repeated for Vj. If 5^^^ is highest, then A belongs to
the group represented by V2 and if 5^, y^ is highest, then A belongs to V^^ but not to
V, or Vo. A new branch is started for A between K, and V,a-\'

30.3.2 Non-hierarchical methods

Let us now cluster the objects of Table 30.8 with a non-hierarchical algorithm.
Instead of clustering by joining objects successively, one wants to determine
77

1I

oD
oF
»t1

^ E ^

X2

Fig. 30.12. Forgy's non-hierarchical classification method. A,..., G are objects to be classified; * 1,
* 4 are successive centroids of clusters.

directly a A'-clustering, by which is meant a classification into /f clusters. We will


apply this for 2 clusters. Of course, one is able to see that the correct 2-clustering is
(A,B,C,D) (E,F,G). In general, one uses m-dimensional data and it is then not
possible to visually observe clusters. In this section, we will also suppose that we
are not able to do this. To obtain 2 clusters, one selects 2 seed points among the
objects and classifies each of the objects with the nearest seed point. In this way, an
initial clustering is obtained. For the objects of Table 30.8, A and B are selected as
first seed points. In Fig. 30.12 it can be seen that this is not a good choice (A and E
would have been better), but it should be remembered that one is supposed to be
unable to observe this. D is nearest to A and C, E, F and G are nearest to B (Table
30.9). The initial clustering is therefore (A,D) (B,C,E,F,G).
For each of these clusters, one determines the centroid (the point with mean
values of variables x^ and X2 for each cluster). For cluster (A,D), the centroid (* 1) is
characterized by
Xi = (45 + 64) / 2 = 54.5
^2 = (24 + 52) / 2 = 38
and for cluster (B,C,E,F,G) the centroid (*2) is given by
jcj = (24 + 14 + 36 + 56 + 20) / 5 = 30
X2 = (43 + 23 + 121 + 140 + 148) / 5 = 95
The two centroids are shown in Fig. 30.12. In the next step, one reclassifies each
object according to whether it is nearest to *1 or *2. This now leads to the
clustering (A,B,C,D) (E,F,G). The whole procedure is then repeated: new centroids
are computed for the clusters (A,B,C,D) and (E,F,G). These new centroids are
78

situated in *3 and *4. Reclassification of the objects leads again to (A,B,C,D)


(E,F,G). Since the new clustering is the same as the preceding one, this clustering
is considered definitive.
The method used here is called Forgy's method [25]. This is one of the K-center
or K'Centroid methods, another well-known variant of which is MacQueen's
K-means method [26]. Forgy's method involves the following steps.
(1) Select an initial clustering.
(2) Determine the centroids of the clusters and the distance of each object to
these centroids.
(3) Locate each object in the cluster with the nearest centroid.
(4) Compute new cluster centroids and go to step (3). One continues to do this
until convergence occurs (i.e., until the same clustering is found in two
successive assignment steps).
Instead of using centroids as the points around which the clusters are cons-
tructed, one can select some of the objects themselves. These are then called
centrotypes. Such a method might be preferred if one wants to select representative
objects: the centrotype object will be considered to be representative for the cluster
around it. Returning to the simple example of Fig. 30.12, suppose that one selects
objects A and E as centrotypes. Thus, B, C and D would be classified with A since
they are nearer to it than to E, and F and G would be clustered with E. This method
is based on an operations research model (Chapter 42), the so-called location
model. The points A-G might then be cities in which some central facilities must
be located. The criterion to select A and E as centrotypes is that the sum of the
distances from each town to the nearest facility is minimal when the facilities are
located in A and E. This means that A, B, C and D will then be served by the facility
located in A and E, F and G by the one located in E. An algorithm that allows to do
this was described in the clustering literature under the name MASLOC [27].
Numerical algorithms such as genetic algorithms or simulated annealing can also
be applied (e.g. Ref [11]).
Both methods described above belong to a class of methods that is also called
partitioning or optimization or partitioning-optimization techniques. They parti-
tion the set of objects into subsets according to some optimization criterion. Both
methods use representative elements, in one case an object of the set to be clustered
(the centrotype), in the other an object with real values for the variables that is not
necessarily (and usually not) part of the objects to be clustered (the centroid).
In general, one maximizes between-cluster Euclidean distance or minimizes
within-cluster Euclidean distance or variance. This really amounts to the same. As
described by Bratchell [6], one can partition total variation, represented by T, into
between-group (B) and within-group components (W).
T =B+W
79

^2*

Fig. 30.13. Agglomerative methods will first link A and B, so that meaningless clusters may result.
The non-hierarchical K=2 clustering will yield clusters I and II.

where T is the total sums of squares and products matrix, related to the variance-
covariance matrix, B is the same matrix for the centroids and W is obtained by
pooling the sums of squares and product matrices for the clusters.
One can also write
tr(T) = tr(B) + tr(W)
Since T and therefore also tr(T) is constant, minimizing tr(W) is equivalent to
maximizing tr(B). It can be shown that tr(B) is the sum of squared Euclidean
distances between the group centroids.
An advantage of non-hierarchical methods compared to hierarchical methods is
that one is not bound by earlier decisions. A simple example of how disastrous this
can be is given in Fig. 30.13 where an agglomerative hierarchical method would
start by linking A and B. On the other hand, the agglomerative methods allow
better visualization, although some visualization methods (e.g. Ref. [28]) have
been proposed for non-hierarchical methods.

30.3.3 Other methods

A group of methods quite often used is based on the idea of describing high local
densities of points. This can be done in different ways. One such way, mode
analysis [29], is described by Bratchell [6]. A graph theoretical method (see also
Chapter 42) by Jardine and Sibson [30] starts by considering each object as a node
and by linking those objects which are more similar than a certain threshold. If a
Euclidean distance is used, this means that only those nodes are linked for which
the distance is smaller thap a chosen threshold distance. When this is done, one
determines the so-called maximal complete subgraphs. A complete subgraph is a
80

Fig. 30.14. Step 1 of the Jardine and Sibson method [30]. Objects less distant than Dj are linked.

set of nodes for which all the nodes are connected to each other. Maximal complete
subgraphs are then the largest (i.e. those containing most objects) of these comp-
lete subgraphs. In Fig. 30.14, they are (B, G, F, D, E, C), (A, B, C, F, G), (H, I, J,
K), etc. Each of these is considered as the kernel of a cluster. One now joins those
kernels that overlap to a large degree with, as criterion, the fact that they should
have at least a prespecified number of nodes in common (for instance, 3). Since
only the first two kernels satisfy this requirement, one considers A, B, C, D, E, F, G
as one cluster and H, I, J, K as another.
Another technique, originally derived from the potential methods described for
supervised pattern recognition (Chapter 33), was described by Coomans and
Massart [31]. A kernel or potential density function is constructed around each
object. In Fig. 30.15 A, B, C and D are objects around which a triangular potential
field is constructed (solid lines). The potentials in each point are summed (broken
line). One selects the point with highest summed potential (B) as cluster center and
measures the summed potential in the closest point. All points, such as A and C,
that can be reached from B along a potential path, which decreases continuously,
belong to the same summed potential hill and such objects are considered to be part
of the cluster. When the potential is higher again, in a certain object, or there is a
point mid-way between two points which has lower potential, then this means that
one has started to climb a new hill and therefore the object is part of another cluster.
The method has the advantage that the form of the cluster is not important, while
most other methods select spherical or ellipsoid clusters. The disadvantage is that
the width of the potential field must be optimized.
All these methods and the methods of the preceding section have one charact-
eristic in common: an object may be part of only one cluster. Fuzzy clustering
applies other principles. It permits objects to be part of more than one cluster. This
leads to results such as those illustrated by Fig. 30.16. Each object / is given a value
81

potential

B'

\
\
\
\
\

AgC D distance

Fig. 30.15. One-dimensional example of the potential method [31].

Xjt
II /B

^1

Fig. 30.16. Fuzzy clustering. Two fuzzy clusters (I and II) are obtained. For example/M = 1 ,/AII = 0,/BI
= 0 , / B I I = 1 , / C I = 0.47,/CII = 0.53.

f-i^ for a membership function (see Chapter 19) in cluster k. The following relation-
ships are defined for all / and k:

and
K
82

where K is the number of clusters. When/j^ = 1, then it means that / unambiguously


belongs to cluster k, otherwise the larger/j^ is, the more / belongs to cluster k.
The assignment of the membership values is done by an optimization procedure.
Of the many criteria that have been described, probably the best known is that of
Ruspini [32]:

where 5 is an empirical constant and d^^ a distance. The criterion is minimized and
therefore requires membership values of / and i to be similar when their distance is
small. Fuzzy clustering has been applied only to a very limited extent in chemo-
metrics. A good example concerning the classification of seeds from images is
found in Ref. [33].
As described in the Introduction to this volume (Chapter 28), neural networks
can be used to carry out certain tasks of supervised or unsupervised learning. In
particular, Kohonen mapping is related to clustering. It will be explained in more
detail in Chapter 44.

30.3.4 Selecting clusters

30.3.4.1 Measures for clustering tendency


Instead of carrying out the actual cluster analysis to find out whether there is
structure in the data, one might wonder if it is useful to do so and try to measure
clustering tendency. Hopkin's statistic and modifications of it have been described
in the literature [34,35]. The original procedure is based on the fact that if there is a
clustering tendency, distances between points and their nearest neighbour will tend
to be smaller than distances between randomly selected artificial points in the same
experimental domain and their nearest neighbour. The method consists of the
following steps (see also Fig. 30.17):
- select at random a small number (for example 5%) of the real data points;
- compute the distance d^ to the nearest data points for each selected data point
i\
- generate at random an equal number of artificial points in the area studied;
- compute the distance Uj to the nearest real data point for each artificial point;
- determine// = S«y/(Xwy +X<i/);
- perform a statistical test to determine whether H is significantly larger than
0.5.
If there is no structure Xw^ will be more or less equal to 5//, and H not far from 0.5.
If there is clustering tendency, then Xw/ > Y.d^ and H will be higher than 0.5.
83

Fig. 30.17. Square symbols are the actual objects and circled squares are the marked objects. Open cir-
cles are artificial points (adapted from Ref. [34]).

30.3.4.2 How many clusters?


In hierarchical clustering one can obtain any number of clusters K,\<K<n,hy
cutting the appropriate number of links in the dendrogram and this is true also for
non-hierarchical clustering, with the difference that there K is defined a priori by
the user. The question then arises which ^-clustering is significant. To introduce
the problem let us first consider a technique that was proposed for the non-
hierarchical method MASLOC [27], which selects so-called robust clusters.
This is explained with Fig. 30.18. If one applies a non-hierarchical procedure,
one asks for a certain number, K, of clusters. In the present case, one could ask for
A^ = 1, 2,..., 9 clusters and one would obtain the optimal list of ^-clusters for each
K. Neither the 2- nor the 4-clustering are really important. The important result is
the 3-clustering. However, this is not known beforehand but can be derived if the
MASLOC method is used. The following reasoning is applied. In Fig. 30.18, it can
be seen that E, F and D are separated when one asks for the 2-clustering. When one
asks for the 3-clustering, one of the clusters obtained is EDF. Now, in going from
the 2-clustering to the 3-clustering, it could be expected that one of the two clusters
would fall apart to give three clusters. Since E, F and D are joined together in the
3-clustering, this is not the case and one concludes that something unnatural has
happened. The 2-clustering is mathematically optimal but not meaningful. When
going from the 3-clustering to higher ones, the same phenomenon is not encoun-
tered: the objects separated in this clustering are never joined again. The 3-clusters
are therefore meaningful or "robust".
Many procedures for selecting a A'-clustering have been described, none of
them really satisfying. The most often used one is to plot some characteristic, such
84

'24

Fig. 30.18. Nine objects, A-I, are clustered. K=2 and A^ = 3 clusterings are obtained. The 2-clustering
is correct but not relevant. The 3-clustering is meaningful and robust.

as the similarity value at which links are cut or the optimization criterion of Section
30.3.3 in function of K. Large steps in that function occur when a significant
A'-clustering is encountered. An information theoretical procedure is described in
[10]. Another approach to select clusters is to validate them using a supervised
pattern recognition technique (see Chapter 33), such as linear discriminant anal-
ysis (LDA) [36]. Cross-validation can be applied to decide whether the LDA does
separate the clusters, in which case they are considered to be significant.

30.3.5 Conclusion

The result of the clustering procedure depends on which procedure is applied


and on the similarity measures used. Each gives a different view of the complex
reality in the data set. It is therefore highly recommended that a clustering method
is combined with a PC A or PLS display (see Chapters 17, 31 and 35) and, if
possible, that several clustering methods and several types of similarity are used.
85

References

1. P.H. A. Sneath and R.R. Sokal, Numerical Taxonomy. The Principles and Practice of Numeri-
cal Classification. Freeman, San Francisco, CA, 1973.
2. G.C. Mead, A.P. Norris and N. Bratchell, Differentiation of Staphylococcus aureus from
freshly slaughtered poultry and strains 'endemic' to processing plants by biochemical and
physiological tests. J. Appl. Bacteriol., 66 (1989) 153-159.
3. B.S. Everitt, Cluster Analysis. Heineman, London, 1981.
4. L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York, 1990.
5. D.L. Massart and L. Kaufman, Interpretation of Analytical Data by the Use of Cluster Analy-
sis. Wiley, New York, 1983.
6. N. Bratchell, Cluster analysis. Chemom. Intell. Lab. Syst., 6 (1989) 105-125.
7. D.L. Massart, L. Kaufman and K.H. Esbensen, Hierarchical non-hierarchical clustering strat-
egy and application to classification of iron-meteorites according to their trace element pat-
terns. Anal. Chem., 54 (1982) 911-917.
8. R.G. Lawson and P.C. Jurs, Cluster analysis of acrylates to guide sampling for toxicity testing.
J. Chem. Inf. Comput. Sci., 30 (1990) 137-144.
9. R.D. Brown and Y.C. Martin, Use of structure-activity data to compare structure-based clus-
tering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 36
(1996) 572-584.
10. I. Bondarenko, H. Van Malderen, B. Treiger, P. Van Espen and R. Van Grieken, Hierarchical
cluster analysis with stopping rules built on Akaike's information criterion for aerosol particle
classification based on electron probe X-ray microanalysis. Chemom. Intell. Lab. Syst., 22
(1994) 87-95.
11. L.X. Sun, F. Xu, Y.Z. Liang, Y.L. Xie and R.Q. Yu, Cluster analysis by the K-means algorithm
and simulated annealing. Chemom. Intell. Lab. Syst., 25 (1994) 51-60.
12. E. Marengo, C. Baiocchi, M.C. Gennaro, P.L. Bertolo, S. Lanteri and W. Garrone, Classifica-
tion of essential mint oils of different geographic origin by applying pattern recognition meth-
ods to gas chromatographic data. Chemom. Intell. Lab. Syst., 11 (1991) 75-88.
13. A. de Juan, G. Fonrodona and E. Casassas, Solvent classification based on solvatochromic pa-
rameters: a comparison with the Snyder approach. Trends Anal. Chem., 16(1) (1997) 52-62.
14. T. Naes, The design of calibration in near infra-red reflectance analysis by clustering. J.
Chemom., 1 (1987) 129-134.
15. P.C. Mahalanobis, On the Generalized Distance in Statistics, Proc. Nat. Inst. Sci. (India) 12
(1936)49-55.
16. P. Dagnelie and A. Merckx, Using generalized distances in classification of groups. Biom. J.,
33(1991)683-695.
17. J.H. Ward Jr., Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc, 58
(1963) 236-244.
18. D.L. Massart and L. Kaufman, Operations research in analytical chemistry. Anal. Chem., 47
(1975) 1244A-1253A.
19. J.B. Kruskal Jr., On the shortest spanning subtree of a graph and the traveling salesman prob-
lem. Proc. Am. Math. Soc, 7 (1956) 48-50.
20. L.R. Snyder, Classification of the solvent properties of common liquids. J. Chrom. Sci., 16
(1978)223-234.
21. P. MacNaughton-Smith, W.T. Williams, M.B. Dale and H.T. Clifford. Computer J., 14 (1972)
162.
86

22. A. Thielemans, M.P. Derde and D.L. Massart, CLUE. Elsevier Scientific Software, Elsevier,
Amsterdam, 1985.
23. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122 (1980)
337-346.
24. J. Zupan and D.L. Massart, Application of the three-distance clustering method in analytical
chemistry. Anal. Chem., 61 (1989) 2098-2102.
25. E.W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifi-
cations. Biometrics, 21 (1965) 768.
26. J. MacQueen, Some methods for classification and analysis of multivariate observations. In: L.
Le Cam and J. Neyman (eds.). Proceedings 5th Berkeley Symposium on Mathematical Statis-
tics and Probability, University of California Press, Berkeley, CA, 1967, pp. 281-297.
27. D.L. Massart, F. Plastria and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern
Recogn., 16(1983)507-516.
28. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster anal-
ysis. J. Comput. Applied. Math., 20 (1987) 53-65.
29. D. Wishart, Mode analysis. In: A.J. Cole (ed.). Numerical Taxonomy. Academic Press, Lon-
don, 1969, pp. 282-308.
30. N. Jardine and R. Sibson, The construction of hierarchic and non-hierarchic classifications.
Computer J., 11 (1968) 177-184.
31. D. Coomans and D.L. Massart, Potential methods in pattern recognition. Part 2. CLUPOT: an
unsupervised pattern recognition technique. Anal. Chim. Acta., 133 (1981) 225-239.
32. E. Ruspini, Numerical methods for fuzzy clustering, Inf Sci., 2 (1970) 319-350.
33. Y. Chtioui, D. Bertrand, D. Barba and Y. Dattee, Application of fuzzy C-means clustering for
seed discrimination by artificial vision. Chemom. Intell. Lab. Syst., 38 (1997) 75-87.
34. B. Hopkins, A new method for determining the type of distribution of plant individuals. Ann.
Bot., 18(1954)213-227.
35. J. Jurs and R.G. Lawson, New index for clustering tendency and its application to chemical
problems. Chemom. Intell. Lab. Syst., 10 (1991) 81-83.
36. E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and
cross-validable divisive clustering method. Chemom. Intell. Lab. Syst., 19 (1993) 43-51.

Additional reading

G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies. 1. Hierarchical
systems, Comput. J., 9 (1967) 373-380.
P. Willett and V. Winterman, A comparison of some measures for the determination of inter-molecu-
lar structural similarity. Measures of inter-molecular structural similarity. Quant. Struct.- Act.
Relat.,5(1986) 18-25.
87

Chapter 31

Analysis of Measurement Tables

Introduction

Measurement tables are the raw data that result from measurements on a set of
objects. For the sake of simplicity we restrict our arguments to measurements
obtained by means of instruments on inert objects, although they equally apply to
sensory observations and to living subjects. By convention, a measurement table is
organized such that its rows correspond to objects (e.g. chemical substances) and
that its columns refer to measurements (e.g. physicochemical parameters). Here
we adopt the point of view that objects are described in the table by means of the
measurements performed upon them. Objects and measurements will also be
referred to in a more general sense as row-variables and column-variables.
A heterogeneous table contains measurements that are defined with different
units; for example, chromatographic retention time in seconds, molecular volume
in nm^, biological activity in mg substance per kg body weight, partition coefficient
(dimensionless), etc. In a homogeneous table, all measurements are expressed in
the same unit, such as retention times obtained with various chromatographic
methods in seconds, or biological activities from a battery of tests in mg substance
per kg body weight, etc.
A special type of homogeneous measurements is found in a compositional table
which describes chemical samples by means of the relative concentrations of their
components. By definition, relative concentrations in each row of a compositional
table add up to unity or to 100%. Such a table is said to be closed with respect to the
rows. In general, closure of a table results when their rows or columns add up to a
constant value. This operation is only applicable to homogeneous tables. Yet
another type of homogeneous table arises when the rows or columns can be
ordered according to a physical parameter, such as in a table of spectroscopic
absorptions by chemical samples obtained at different wavelengths.
Multivariate analysis of these different types of measurements (heterogeneous,
homogeneous, compositional, ordered) may require special approaches for each of
them. For example, compositional tables that are closed with respect to the rows,
require a different type of analysis than heterogeneous tables where the columns
are defined with different units. The basic approach of principal components
analysis, however, can be very helpful in understanding the overall structure of the
data, i.e. its dimensionality, correlations between measurements, clustering of
objects, specificities of objects for certain measurements, patterns of objects
correlating with external parameters, and much more. The objective of this chapter
is to discuss principal components analysis in relation to measurement tables. An
introduction to principal components analysis has already been presented in
Chapter 17, the notation of which is followed here as closely as possible, with two
exceptions which will be indicated where they first occur.
A measurement table is different from a contingency table. The latter results
from counting the number of objects that belong simultaneously to various cat-
egories of two measurements (e.g. molar refractivity and partition coefficient of
chemical compounds). It is also called a two-way table or cross-tabulation, as the
total number of objects is split up in two ways according to the two measurements
that are crossed with one another. The analysis of contingency tables is dealt with
specifically in Chapter 32.
Most of the algebra of vectors and matrices that is used in this chapter has been
explained in Chapters 9 and 29. Small discrepancies between the tabulated values
in the examples and their exact values may arise from rounding of intermediate
results.

31.1 Principal components analysis

A first introduction to principal components analysis (PCA) has been given in


Chapter 17. Here, we present the method from a more general point of view, which
encompasses several variants of PCA. Basically, all these variants have in com-
mon that they produce linear combinations of the original columns in a measure-
ment table. These linear combinations represent a kind of abstract measurements
or factors that are better descriptors for structure or pattern in the data than the
original measurements [1]. The former are also referred to as latent variables [2],
while the latter are called manifest variables. Often one finds that a few of these
abstract measurements account for a large proportion of the variation in the data. In
that case one can study structure and pattern in a reduced space which is possibly
two- or three-dimensional.
Historically, a distinction has been made between PCA of column-variables and
that of row-variables. These are referred to as R-mode or Q-mode PCA, respect-
ively. The modem approach is to consider both analyses as dual and to unify the
two views (of rows and columns) into a single display, which is called biplot and
which will be discussed in greater detail later on.
89

31,1.1 Singular vectors and singular values

Let us suppose that we have a measurement table X with n rows and/7 columns.
Each element x^y of the table then represents the value of they th measurement on the
/th object, where / ranges from 1 to AI and where y ranges from 1 to p. In contrast
with the notation in Chapter 17, we use the symbol p instead of m to denote the
number of columns in a date table in order to avoid a conflict with the symbol m
which we reserve for denoting means, later on in this chapter.
An important theorem of matrix algebra, called singular value decomposition
(SVD), states that any nxp table X can be written as the matrix product of three
terms U, A and V:

X = UAVT or A = UTXV (31.1a)

where U is an nxr column-orthonormal matrix, where V is a pxr column-ortho-


normal matrix and where A is an rxr diagonal matrix. The dimension r can be at
most equal to the smaller of the dimensions n or p. The elements on the diagonal of
A are defined as positive values, while the values in the off-diagonal positions are
zero. An equivalent notation of SVD using the column-vectors of U and V, and
using the diagonal elements of A, is in the form:
r

X = X,n, \J +;i2 U2 ^I + . . . + ? t , u , v j =^^kn^\J (31.1b)


k

Orthonormality implies that:


U'^ U = V^ V = I, (31.2)
where I^ is the rXr identity matrix. This means that the columns in U and V are
normalized to unit sums of squares and that their sums of cross-products are zero.
Vectors that possess these properties are said to be normalized and orthogonal,
hence orthonormal.
It can be proved that the decomposition is always possible and that the solution
is unique (except for the algebraic signs of the columns of U and V) [3]. Singular
value decomposition of a rectangular table is an extension of the classical work of
Eckart and Young [4] on the decomposition of matrices. The decomposition of X
into U, V and A is illustrated below using a 4x3 data table which has been adapted
from Van Borm [5]. (A similar example has been used for the introduction to PC A
in Chapter 17.)
The data in Table 31.1 represent the concentrations of the trace elements Na, CI
and Si (columns) in atmospheric samples that have been collected at the prevailing
wind directions of 0, 90, 180 and 270 degrees (rows).
90

TABLE 31.1
Concentrations of trace elements in atmospheric samples at prevailing wind directions

Na CI Si
0 0.212 0.399 0.190
90 0.072 0.133 0.155
180 0.036 0.063 0.213
270 0.078 0.141 0.273

The SVD of the table of concentrations of trace elements in atmospheric


samples yields (in accordance with Section 17.6):
0.753 0.618]
0.343 -0.127 [0.626 0 1 ro.371 0.690 0.622
UAV^ =
0.302 -0.567 [o 0.214J [o.280 0.556 -0.783
0.473 -0.529

0.212 0.399 0.190


0.072 0.133 0.155
=X
0.036 0.063 0.213
0.078 0.141 0.273

with the orthonormality condition for the singular vectors of the wind directions:
r0753 0.618"
[0753 0.343 0.302 0.473 f 0.343 -0.127 '1 O'
U^U =1.
i 0.618 -0.127 -0.567 -0.529 0.302 -0.567 0 1
0.473 -0.529

and with the orthonormality condition for the singular vectors of the trace elements:
[0.371 0.280"
0.371 0.690 0.622 1 0
VTv = 0.690 0.556 I,
0.280 0.556 -0783 0 1
0.622 -0.7831

Note that, although X has dimensions 4x3, only 2 nonzero eigenvalues can be
extracted. This is the result of a linear dependency among the columns of X which
takes the form:
x^^i =0.5245x^.^2 +0.01437x^.^3
91

or, equivalently, in terms of the concentrations of Na, CI and Si:


[Na] = 0.5245 [CI] + 0.01437 [Si]
The above relation holds within the limited precision of our calculations of the
singular values, which in the present example is about four significant digits.
In the context of SVD, U is called the matrix of row-singular vectors, V the
matrix of column-singular vectors and A denotes the diagonal matrix of associated
singular values. U and V are also referred to as the matrix of left-singular vectors
and the matrix of right-singular vectors, respectively. As we will discuss in more
detail in Section 31.2 on geometrical interpretation, the ^th column of U defines
the kth singular vector in row-space S"^. Similarly, the ^th column of V defines the
kth singular vector in column-space 5^. Since singular vectors in U and V are
orthonormal, they define an orthogonal system of basis vectors in each of the dual
spaces 5" and 5^. According to the previous definition in Section 9.2.1, we define
row-space 5" as the coordinate space in which the p columns of an nxp matrix X
can be represented as a pattern P^ of p points. Similarly, column-space S^ is the
coordinate space in which the n rows of X can be represented as a pattern P^ of n
points. These concepts are explained in greater detail in Chapter 29. Note that
corresponding columns of U, V and A refer to the same singular vectors. The
matrix of singular values is denoted here by the symbol A instead of the symbol W,
which is used in Chapter 17, in order to maintain consistency with the notation A^
introduced in Chapter 29 for the matrix of eigenvalues.

31.1.2 Eigenvectors and eigenvalues

The number of singular vectors r is at most equal to the smallest of the number
of rows n or the number of columns p of the data table X. For the sake of simplicity
we will assume here that p is smaller than n, which is most often the case with
measurement tables. Hence, we can state here that r is at most equal to p or
equivalently that r<p<n. The value of r is called the rank of X. It represents the
number of independent measurements in X. Independent measurements are those
that cannot be expressed as a linear combination or weighted sum of the other
variables.
Using eqs. (31.1) and (31.2) we can readily prove that:
C, = XX^ = UAV^VAU^ = UA^U^
C^ = X^X = VAU^UAV^ = VA^V^ (31.3)
where C^ is thepxp square symmetric matrix of cross-products of the columns of
X, and where C^ is tht nxn square symmetric matrix of cross-products of the rows
ofX.
92

Equation (31.3) defines the eigenvalue decomposition (EVD), also referred to


as spectral decomposition, of a square symmetric matrix. The orthonormal matrices
U and V are the same as those defined above with SVD, apart from the algebraic
sign of the columns. As pointed out already in Section 17.6.1, the diagonal matrix
A^ can be derived from A simply by squaring the elements on the main diagonal of
A.
The decompositions of C„ and C^ into U, V and A^ are illustrated in the
following example, which makes use of the previously computed results. For the
cross-products between wind directions, we obtain:
0.753 0.618
0.343 -0.127 0.392 0 0.753 0.343 0.302 0.473
UA^U^ =
0.302 -0.567 0 0.046 0.618 -0.127 -0.567 -0.529
0.473 -0.529

0.240 0.098 0.073 0.125


0.098 0.047 0.044 0.067
= C„
0.073 0.044 0.051 0.070
0.125 0.067 0.070 0.100

For the cross-products between trace elements we find that:

0.371 0.280
0.392 0 0.371 0.690 0.622
VA^yT = 0.690 0.556
0 0.046 0.280 0.556 -0.783
0.622 -0.783

0.058 0.107 0.080


0.107 0.200 0.148 = C.
0.080 0.148 0.180

In the context of EVD, U is called the matrix of row-eigenvectors, V the matrix


of column-eigenvectors and A^ the diagonal matrix of (associated) eigenvalues.
Singular vectors and eigenvectors are identical, up to an algebraic sign, and the
associated eigenvalues are the squares of the corresponding singular values.
Eigenvectors in U and V are computed independently from the cross-product
matrices C„ and C^, and hence their algebraic sign is arbitrary. Singular vectors,
however, are computed simultaneously from the data matrix X, and hence they
always appear with the appropriate algebraic sign. If the sign of a particular
93

column of U happens to be reversed as a result of the calculation, then the sign of


the corresponding column of V will be reversed also automatically.
An alternative way of defining EVD is as follows:
U^ 0 ^ 0 = ^ ^ C^ V = A2 (31.4)
which can be obtained by multiplying both sides of eq. (31.3) with U and U^, or V
and V^, and applying the orthogonality conditions (eq. (31.2)). These operations
show that the square symmetric matrix C^ is diagonalized by the orthonormal matrix
U into the diagonal matrix A^. Diagonalization is the operation by which a square
matrix is transformed into a diagonal one. Likewise, the square symmetric matrix C^
is diagonalized by the orthonormal matrix V into the same diagonal matrix A^.
If Uj and Vi represent the eigenvectors corresponding to the largest eigenvalue
'k], then according to eq. (31.4) we have:

u T C „ u , = v T c ^ v , =^^ (315^)

where Uj and Vj are such that X] is maximal [6]. By virtue of the orthonormality of
eigenvectors we can rewrite the above expressions in the forms:

or (31.5b)

The same applies to the other eigenvectors U2 and V2, etc., with additional
constraints of orthonormality of Uj, U2, etc. and of Vj, V2, etc. By analogy with eq.
(31.5b) it follows that the r eigenvalues in A^ must satisfy the system of linear
homogeneous equations:
(C^-X\IJu,=0, with/:=l,...,r (31.6a)
or

where !„ represents the nxn identity matrix and where Ip is thepxp identity matrix
[7].
For the sake of completeness we mention here an alternative definition of
eigenvalue decomposition in terms of a constrained maximization problem which
can be solved by the method of Lagrange multipliers:
max [u;^ C^ Ui - X\ (uj u, - 1)] (31.6b)
94

or
m^x[yj Cpy,-X]{yJ v^ -1)]
where X] represents the Lagrange multiplier associated with the first eigenvectors
u, and Vj. The above Lagrange expressions define that the sum of squares of the
projections of X^ and X upon the normalized vectors u^ and v, must be maximal.
Differentiation of these expressions with respect to the unknown Uj and Vj, and
equalization of the results to zero, leads to the expressions in eq. (31.5b). Similar
Lagrange expressions as in eq. (31.6b) can be derived for the other eigenvectors U2
and V2, etc., with additional constraints of orthonormality of Uj, U2, etc. and of Vj,
V2, etc. Differentiation with respect to the unknown eigenvectors then leads to the
system of linear homogeneous equations of eq. (31.6a).
A necessary condition for obtaining non-trivial solutions for u^ and v^ in eq.
(31.6a) is that the determinants of the coefficient matrices must be zero, which
results into the so-called characteristic equation:

IC„ ->.2 I J = 0 withik=l,...,r (31.7)

or

ic,-x^,i,i=o
The determinants can be developed into a polynomial equation of degree r of
which the r positive roots are the eigenvalues X\, where r<p, and provided that
/7 < n as we have assumed above. (The derivation of these properties is given in
Section 29.6.) More practical methods for computing the eigenvalues will be
discussed in Section 31.4 on algorithms.
A property of EVD is that the traces of C„, C^ and A^ are the same and that they
are equal to the global sum of squares c of the elements of X:

r fi P

^ ^ 2 = t r ( A ^ ) = tr(CJ = tr(C^) = X S ^ 0 = ^ (3L8)

where trace (tr) means sum of the elements on the main diagonal of a square
matrix. Note that the global sum of squares c is denoted by the symbol SS^ in other
chapters, where it is referred to as the total sum of squares.
From the numerical results in the previous example we find that the global sum
of squares of the concentrations of all trace elements in all wind directions amounts
to:
tr(A2) = 0.392 + 0.046 = 0.438
95

tr(CJ = 0.240 + 0.047 + 0.051 + 0.100 = 0.438


tr(Cp = 0.058 + 0.200 + 0.180 = 0.438
Each eigenvector u^ or v^ contributes an amount X\ to the global sum of squares
c of X. Hence, eigenvectors can be ranked according to their contributions to c.
From now on we assume that the columns in U and V are arranged in decreasing
order of their contributions.

31.1.3 Latent vectors and latent values

In order to avoid confusion we will in this chapter consistently use the term
latent vector or latent variable as being synonymous with singular vector and with
eigenvector. Similarly, we will use the term latent value when referring to singular
value or to the square root of eigenvalue. In the literature, latent vectors are also
indicated by means of the term factor or component. The process of extracting
latent vectors from a matrix of cross-products according to eqs. (31.3) or (31.4) is
cdiWtd factorization. In a broad sense, factor analysis designates a class of methods
of multivariate data analysis which involve factorization, although, historically,
the name factor analysis has been first proposed for the analysis of tables of
correlation coefficients [8]. This subject is treated extensively in Chapter 34.

31.1.4 Scores and loadings

We have seen above that the r columns of U represent r orthonormal vectors in


row-space 5". Hence, the r columns of U can be regarded as a basis of an
r-dimensional subspace S"" of 5". Similarly, the r columns of V can be regarded as a
basis of an r-dimensional subspace S^ of column-space 5^. We will refer to S^ as the
factor space which is embedded in the dual spaces S^ and S^. Note that r<p<n
according to our convention. The concepts of row-, column-, and factor-spaces
will be more fully developed in the next section.
We define the coordinates for then rows of X in factor space S'' by means of:
S = UA« (31.9)
where th^ factor scaling coefficient a usually takes the value 0,0.5 or 1. The matrix
S is an nxr orthogonal matrix, called score matrix, where r<p <n, according to
our previous convention. Orthogonality of S follows from the orthonormality of U
as expressed in eq. (31.2):
S^ S = A« U^ UA" = A 2 « (31.10)
where A^" is a diagonal matrix obtained by raising the diagonal elements of A to
the power 2a.
96

In the same way, we define the coordinates for the p columns of X in factor-
space S^\
L = VAP (31.11)
where the factor scaling coefficient p is usually assigned the value 0,0.5 or 1. The
matrix L is a pxr orthogonal matrix, called loading matrix. Orthogonality of L can
be derived from the orthonormality of V, using eq. (31.2):
L ' ^ L = AP V ^ V A P = A 2 P (31.12)

where A^^ is a diagonal matrix obtained by raising the diagonal elements of A to


the power 2(3.
A traditional notation in chemometrics for SVD defines scores and loadings by
means of the symbols T and P such that X = T P^, which is equivalent to X = U A V,
where T = U A and P = V. This notation corresponds with the case a = 1 and p = 0,
which is the most frequently used combination of factor scaling coefficients in
chemometrics.

31.1.5 Principal components

In the special case where a = 1 we can express the score matrix S in the form:
S = UA = XV (31.13)
as follows fromeqs. (31.1) and (31.2).
Each element of S can be regarded as a scalar product:
p

Sii,=xJ v^=^jc,yVy^ with/= 1,...,/z and/: = 1,..., r (31.14a)


j

where the vector x^ represents the rth row of X and where the vector v^ represents
the kth column of V.
Each column of S represents a row-principal component of X and can be
interpreted as a linear combination of the columns of X using the elements of V as
weighting coefficients:

Sfc = Xv, = X^y^y^ ^i^h k = 1,..., r (31.14b)

In the case of a = 1 we compute the scores S for the wind directions in the
current example of concentrations of trace elements in atmospheric samples as
follows:
97

0.753 0.618
0.343 -0.127 0.626 0
S = UA =
0.302 -0.567 0 0.214
0.473 -0.529

0.212 0.399 0.190 0.472 0.132


0.371 0.280'
0.072 0.133 0.155 0.215 -0.027
0.690 0.556 =
= xv= 0.036 0.063 0.213 0.189 -0.122
0.622 -0.783
0.078 0.141 0.273 0.296 -0.113

In Fig. 31.1a these scores are used as the coordinates of the four wind directions in
2-dimensional factor-space. From this so-called score plot one observes a large
degree of association between the wind directions of 90, 180 and 270 degrees,
while the one at 0 degrees stands out from the others.
When a = 0.5 we obtain somewhat different values for the scores in S:
0753 0.618 0.596 0.286
0.343 -0.127 0.791 0 0.271 -0.059
S = UAi'2^
0.302 -0.567 0 0.462 0.239 -0.262
0.473 -0.529 0.374 -0.244

Finally, when a = 0 then S is equal to U according to eq. (31.9).


In the same way, we consider the special case where P = 1, which yields for the
loading matrix L the expression:
L = VA = X^U (31.15)
by virtue of eqs. (31.1) and (31.2).
Each element of L can be written as a scalar product:
n
U = x j u^ = ^x^j Uij^ with; = 1,..., /? and /: = 1,..., r (31.16a)

where the vector x^ represents the yth column of X and where the vector u^ rep-
resents the k\h column of U.
Each column of L represents also a column-principal component of X and can
be regarded as a linear combination of the rows of X using the elements of U as
weighting coefficients:

1^ =X'^ u^ =^^iUij^ with^= 1,..., r (31.16b)


98

© a=l
S'' \= .046

^ 1 A,.= .392

-.5

I 1 ^ , = .392
-.5

-.5 -^

Fig. 31.1. (a) Score plot in which the distances between representations of rows (wind directions) are
reproduced. The factor scaling coefficient a equals 1. Data are listed in Table 31.1. (b) Loading plot in
which the distances between representations of columns (trace elements) are preserved. The factor
scaling coefficient p equals 1. Data are defined in Table 31.1.
99

The columns of the loading matrix L contain the principal components of X in


column-space 5''.
In the case of |3 = 1, one calculates the loadings L of the three trace elements of
the current example as follows below:

0.371 0.280
0.626 0
L = V A = 0.690 0.556
0 0.214
0.622 -0.783

0.753 0.618
0.212 0.072 0.036 0.078
0.343 -0.127
XTU = 0.399 0.133 0.063 0.141
0.302 -0.567
0.190 0.155 0.213 0.273
0.473 -0.529

0.232 0.060
0.432 0.119
0.390 -0.168

These loadings have been used for the construction of the so-called loading plot in
Fig. 31.1b which shows the positions of the three trace elements in 2-dimensional
factor-space. The elements Na and CI are clearly related, while Si takes a position
of its own in this plot.
When p = 0.5 we obtain:

0.371 0.280 0.292 0.129


0.791 0
L = V A'/2 = 0.690 0.556 0.546 0.257
0 0.462
0.622 -0.783 0.492 -0.362

Finally, when |3 = 0 then L is equal to V as follows from eq. (31.11).


Summarizing, we have obtained two sets of principal components S and L (for a
particular a and |3) from one and the same data table X. In the next section on the
geometrical interpretation of principal components, we will show in a more
general way that S contains the coordinates of the rows of X in factor space and that
L contains the coordinates of the columns of X in the same factor space. Some-
times one refers to U and V as the principal components of X. This is also
legitimate when one refers to the special case where a = P = 0, as can be readily
seen fromeqs. (31.9) and (31.11).
100

31.1.6 Transition formulae

The relationships between scores, loadings and latent vectors can be written in a
compact way by means of the so-called transition formulae:

s = xv
U = S A-^ (from eq. (31.13)) (31.17)

L = X^U

V = LA-^ (from eq. (31.15))

In this way, given U one can compute V, and vice versa. These transition
formulae are the basis for the calculation of U and V by the so-called NIPALS
algorithm which will be explained in Section 31.4.

31.1.7 Reconstructions

In the previous subsection, we have described S and L as containing the


coordinates of the rows and columns of a data table in factor-space. Below we
show that, in some cases, it is possible to graphically reconstruct the data table and
the two cross-product matrices derived from it. It is not possible, however, to
reconstruct at the same time the data and all the cross-products, as will be seen. We
distinguish between three types of reconstructions.
In the case where a equals 1 we can reconstruct the diagonalized cross-product
matrix A^:

S'rS = AU'rUA = A2 (31.18)

and the cross-product matrix C^:

SS^ = U A A U'^ = U A^ U^ = C„ (31.19)

using eqs. (31.2), (31.3) and (31.13).


In the current example using a = 1, we can reconstruct the sums of cross-
products C„ between the wind directions from their scores S:
101

0.472 0.132]
0.215 -0.027 ro.472 0.215 0.189 0.296
SS'T =
0.189 -0.122 [o.l32 -0.027 -0.122 -0.113
0.296 -0.113

0.240 0.098 0.073 0.125


0.098 0.047 0.044 0.067
= C.
0.073 0.044 0.051 0.070
0.125 0.067 0.070 0.100

In the case where P equals 1 we can reconstruct the diagonalized cross-product


matrix A^:

L^ L = A V^ V A = A^ (31.20)

and the cross-product matrix Cp.

(31.21)

using eqs. (31.2), (31.3) and (31.15). Note that the diagonal matrix A is invariant
under transposition.
In the current example using p = 1, we reconstruct the sums of cross-products C^
between the trace elements from their loadings L:

0.232 0.060
T _
0.232 0.432 0.390
LL 0.432 0.119
0.060 0.119 -0.168
0.390 -0.168

0.058 0.107 0.080


0.107 0.201 0.148 = C.
0.080 0.148 0.180

A very special case arises when a + P equals 1. If we form the product of the
score matrix S with the transpose of the loading matrix L, then we obtain the
original measurement table X:
SL'T =UA« AP yi" =UAV'^ = X (31.22)
by making use of eqs. (31.1), (31.9) and (31.11) where a + P = 1.
102

This implies that the scalar product of a score vector s^ with a loading vector ly
allows us to reconstruct the value x^j in the table X. Written in full, this is equivalent
to:
r

^yi=l^ikhi:=^u (31-23)
k

where the vector sj is the iih row of S and where the vector IJ is theyth row of L.
Summarizing, we find that, depending on the choice of a and (5, we are able to
reconstruct different features of the data in factor-space by means of the latent
vectors. On the one hand, if a = 1 then we can reproduce the cross-products C„
between the rows of the table. On the other hand, if p equals 1 then we are able to
reproduce the cross-products C^ between columns of the table. Clearly, we can
have both a = 1 and p = 1 and reproduce cross-products between rows as well as
between columns. In the following section we will explain that cross-products can
be related to distances between the geometrical representations of the corres-
ponding rows or columns.
If we want at the same time to reconstruct the data in the table X we must have
that a + p = 1. A frequently used compromise is a = 1 and p = 0, where we sacrifice
distances between columns for distances between rows, while still satisfying the
constraint a + p = 1 for reconstruction of the data. This is an asymmetrical choice
for a and p. Another compromise is to have a = 0.5 and P = 0.5 which allows for an
approximate reconstruction of distances together with the reconstruction of the
data. This is a symmetrical choice for a and p.
An important aspect of latent vectors analysis is the number of latent vectors
that are retained. So far, we have assumed that all latent vectors are involved in the
reconstruction of the data table (eq. (31.1)) and the matrices of cross-products (eq.
(31.3)). In practical situations, however, we only retain the most significant latent
vectors, i.e. those that contribute a significant part to the global sum of squares c
(eq.(31.8)).
If we only include the first r* latent variables we have to redefine our relation-
ships between data, latent vectors and latent values:
X = U* A* V*^ 4- E = X* + E (31.24)
where U*, A* and V* represent the matrices formed from the first r* columns of U
and V and from the first r* rows and columns of A, where E is the residual data
matrix, and where X* refers to the partially reconstructed data matrix. The matrix
X* can be regarded as a least squares approximation of the original matrix X by
means of a reduced number of latent variables r* out of a total of r [4].
If we retain only the first factor in the current example of concentrations of trace
elements in atmospheric samples, we obtain:
103

0.753
0.343
X* =U* A* V * j (0.626) (0.371 0.690 0.622) =
0.302
0.473

0.175 0.325 0.294 0.212 0.399 0.190 0.037 0.074 -0.104


0.080 0.148 0.134 0.072 0.133 0.155 -0.008 -0.015 0.021
0.070 0.130 0.118 0.036 0.063 0.213 -0.034 -0.068 0.095
0.110 0.204 0.184 0.078 0.141 0.273 -0.032 -0.063 0.089

= X-E

The residual matrix E contains the contributions of the remaining second factor to
the concentrations of the trace elements at the various wind directions.
If we disregard E, we have to rewrite the reconstruction formulae (eqs. (31.19),
(31.21) and (31.22)) accordingly:

c: =s* s*"^ with a = 1

c;=L*L*T with P = 1 (31.25)

T *T
X* = S** L' with a + P = 1

A measure for the goodness of the reconstruction is provided by the relative


contribution y of the retained latent vectors to the global sum of squares c (eq.
(31.8)):

IK 1^
(31.26)

I^
In our example, we find that y = 0.895 for r* = 1. We will discuss various
methods which can guide in the choice of the number of relevant latent vectors r* in
Section 31.5.
104

31.2 Geometrical interpretation

37.2./ Line of closest fit

In the previous section we have developed principal components analysis


(PCA) from the fundamental theorem of singular value decomposition (SVD). In
particular we have shown by means of eq. (31.1) how an nxp rectangular data
matrix X can be decomposed into an nxr orthonormal matrix of row-latent vectors
U, a p x r orthonormal matrix of column-latent vectors V and an rxr diagonal
matrix of latent values A. Now we focus on the geometrical interpretation of this
algebraic decomposition.
In Chapter 29 we introduced the concept of the two dual data spaces. Each of the
n rows of the data table X can be represented as a point in the p-dimensional
column-space S^. In Fig. 31.2a we have represented the n rows of X by means of
the row-pattern P^. The curved contour represents an equiprobability envelope,
e.g. a curve that encloses 99% of the points. In the case of multinormally
distributed data this envelope takes the form of an ellipsoid. For convenience we
have only represented two of the p dimensions of 5^ which is in reality a
multidimensional space rather than a two-dimensional one. One must also imagine
the equiprobability envelope as an ellipsoidal (hyper)surface rather than the
elliptical curve in the figure. The assumption that the data are distributed in a
multinormal way is seldom fulfilled in practice, and the patterns of points often
possess more complex structure than is shown in our illustrations. In Fig. 31.2a the
centroid or center of mass of the pattern of points appears at the origin of the space,
but in the general case this needs not to be so.
Similarly, Fig. 31.2b shows the column-pattern P^ of the p columns of the data
table X by means of an elliptical envelope in the dual n-dimensional row-space S"^.
The ellipses should be interpreted as (hyper)ellipsoidal equiprobability envelopes
of multinormal data. In practice the data are rarely multinormal and the centroid (or
center of mass) of the pattern does not generally appear at the origin of space. An
essential feature is that the equiprobability envelopes are similarly shaped in Figs.
31.2a and b. The reason for this will become apparent below. Note that in the
previous section we have assumed by convention that n exceeds /?, but this is not
reflected in Figs. 31.2a and b.
In Fig. 31.2a we have represented the rth row x, of the data table X as a point of
the row-pattern P" in column-space S^. The additional axes Vj and YJ correspond
with the columns of V which are the column-latent vectors of X. They define the
orientation of the latent vectors in column-space S^. In the case of a symmetrical
pattern such as in Fig. 31.2, one can interpret the latent vectors as the axes of
symmetry or principal axes of the elliptic equiprobability envelopes. In the special
case of multinormally distributed data, v^ and V2 appear as the major and minor
105

Fig. 31.2. Geometrical example of the duality of data space and the concept of a common factor space,
(a) Representation of n rows (circles) of a data table X in a space S^ spanned hyp columns. The pattern
P" is shown in the form of an equiprobability ellipse. The latent vectors V define the orientations of the
principal axes of inertia of the row-pattern, (b) Representation of/? columns (squares) of a data table X
in a space S" spanned by n rows. The pattern P^ is shown in the form of an equiprobability ellipse. The
latent vectors U define the orientations of the principal axes of inertia of the column-pattern, (c) Result
of rotation of the original column-space S^ toward the factor-space 5'' spanned by r latent vectors. The
original data table X is transformed into the score matrix S and the geometric representation is called a
score plot, (d) Result of rotation of the original row-space S" toward the factor-space 5'^ spanned by r
latent vectors. The original data table X^ is transformed into the loading table L and the geometric rep-
resentation is referred to as a loading plot, (e) Superposition of the score and loading plot into a biplot.
106

axes of symmetry of the ellipse. In the general case, Vj and V2 are called axes of
inertia, because they account for a maximal part of the global sum of squares of the
data which we have denoted by the symbol c in eq. (31.8). Note that inertia of a
pattern of points can be related to the sum of squares of their coordinates, if we
assume that the masses of the points are all equal. If we project a particular row x^
upon VJ, then from eq. (31.13) we deduce that the projection is at a distance s^^ from
the origin:
x^ y,=s,,
or more generally for all the rows:
Xv, =Si (31.27)
Since this latent vector is defined as the vector for which the sum of squares of
the projections is maximum (eq. (31.5)), we can interpret Vj as an axis of maximal
inertia:
s^ s, =v;^ X^ Xvi ^v?' C^ Vi =\\ (31.28)
Hence ^^ is the fraction of the total sum of squares (or inertia) c of the data X that is
accounted for by Vj. The sum of squares (or inertia) of the projections upon a
certain axis is also proportional to the variance of these projections, when the mean
value (or sum) of these projections is zero. In data analysis we can assign different
masses (or weights) to individual points. This is the case in correspondence factor
analysis which is explained in Chapter 32, but for the moment we assume that all
masses are identical and equal to one.
We now consider a subspace of SP which is orthogonal to Vj and we repeat the
argument. This leads to V2, and in the multidimensional case to all r columns in V.
By the geometrical construction, all r latent vectors are mutually orthogonal, and r
is equal to the number of dimensions of the pattern of points represented by X. This
number r is the rank of X and cannot exceed the number of columns /? in X and, in
our case, is smaller than the number of rows in X (because we assume that n is
larger than/?).
One of the earliest interpretations of latent vectors is that of lines of closest fit
[9]. Indeed, if the inertia along Vj is maximal, then the inertia from all other
directions perpendicular to Vj must be minimal. This is similar to the regression
criterion in orthogonal least squares regression which minimizes the sum of
squared deviations which are perpendicular to the regression line (Section 8.2.11).
In ordinary least squares regression one minimizes the sum of squared deviations
from the regression line in the direction of the dependent measurement, which
assumes that the independent measurement is without error. Similarly, the plane
formed by Vj and V2 is a plane of closest fit, in the sense that the sum of squared
deviations perpendicularly to the plane is minimal. Since latent vectors v^ contribute
107

in decreasing order to the global sum of squares (as measured by their contri-
butions X\), we obtain increasingly better approximations to the data structure X
by means of the (hyper)planes of closest fit which are defined by V. Usually we
will not include all r latent vectors in such a regression model. If a large fraction of
the total inertia is accounted for by the r* first latent variables, where r is much
smaller than p, then we have achieved a substantial dimension reduction. The
question of how many latent variables should be included is discussed in Section
31.5.
The same geometrical considerations can be applied to the dual representation
of the column-pattern P^ in row-space S"- (Fig. 31.2b). Here Uj is the major axis of
symmetry of the equiprobability envelope. The projection of theyth column Xy of X
upon Uj is at a distance from the origin given by:
xj n,=lj,
or more generally from eq. (31.15) for all the columns:
X^ Ui = l i (31.29)
Since the latent vector maximizes the sum of squares of the projections (eq.
(31.5)) we can also interpret u^ as an axis of inertia:

i;^ l,=vij XX^ Ui = u ^ C, n,=X] (31.30)

Note that the inertia X] loaded by Uj in S"^ is the same as that loaded by Vj in S^.
For this reason, we must consider Uj and Vj as two different expressions of one and
the same latent vector. The former is developed in S^ while the latter is constructed
in 5^.
It is important to note that the score vector Sj results from the projection of X
upon Vj. While Vj is a latent vector in S^ (because it has p elements), Sj is a vector in
5" (because it possesses n elements). In Chapter 29 we have shown that the pattern
P" in 5^, when projected upon the axis represented by Vj, produces an image
thereof in the dual space S"^ at the point represented by Sj. If we normalize Sj (by
division with the corresponding latent value X^) we obtain the latent vector Uj.
Similarly, the loading vector Ij results from the projection of the pattern P^ in S^
upon Uj. While the latent vector Uj is in 5" (because of its n elements), Ij is in 5^
(because it hasp elements). By the same argument as employed above, we find that
the pattern P^ in 5"^, when projected upon the axis represented by Uj, produces an
image thereof in the dual space 5^ at the point represented by Ij. Normalization of Ij
(by division with X^) yields the latent vector Vj. This is the geometrical basis for the
transition formulae which have been discussed before (eq. (31.17)) and for the
NIPALS algorithm which is used for the calculation of singular vectors and which
is explained in Section 31.4.
108

Once we have obtained the projections S and L of X upon the latent vectors V
and U, we can do away with the original data spaces S^ and S"^. Since V and U are
orthonormal vectors that span the space of latent vectors S^, each row / and each
columny of X is now represented as a point in S\ as shown in Figs. 31.2c and d. The
coordinate axes of this space S^ are oriented along the axes of inertia of the
equiprobability envelopes. In the case of multinormally distributed data the semi-
axes of the (hyper)ellipsoids have lengths equal to the corresponding latent values
in A. The lengths of these semi-axes can also be interpreted as the standard
deviations of the corresponding latent vectors. Since the scores and the loadings
are associated with the same latent values, we obtain that the semi-axes of the
ellipses in Figs. 31.2a and b are identical. The geometrical interpretation of PCA,
as expressed by eqs. (31.13) and (31.15) can thus be seen as an orthogonal rotation
of the original data spaces 5^ and S"" into 5^ The rotation is called orthogonal
because the rotation matrices V and U are orthonormal (eq. (31.2)). Orthogonal
rotations do not change Euclidean distances between the points in the patterns. In
particular, the total inertia of the pattern is invariant under orthogonal rotation, a
property which has been defined already in eq. (31.8). Fig. 31.2c is traditionally
called a score plot, while Fig. 31.2d is called a loading plot.
Since U and V express one and the same set of latent vectors, one can
superimpose the score plot and the loading plot into a single display as shown in
Fig. 31.2e. Such a display was called a biplot (Section 17.4), as it represents two
entities (rows and columns of X) into a single plot [10]. The biplot plays an
important role in the graphic display of the results of PCA. A fundamental property
of PCA is that it obviates the need for two dual data spaces and that instead of these
it produces a single space of latent variables.

31.2.2 Distances

It has been shown in the previous section that there is an infinity of decomp-
ositions of a data table X into scores S and loadings L depending upon the choice of
the factor scaling coefficients a and p (eqs. (31.9) and (31.11)). The most current
settings of a and (3, however, are 0,0.5 and 1. It has been noted that when a equals
1, distances between the representations of the rows are preserved in the score plot.
In the case when p equals 1, distances between the representations of the columns
are preserved in the loading plot. In the score plot of Fig. 31.3a we represent the
Euclidean distance between rows / and i' by (i,,-, while their angular distance, as
seen from the origin, is denoted by i^^-. Euclidean distances from the origin of the
rows / and /' are indicated as d^ and J,, respectively. As shown in Section 9.2.3, this
means in algebraic terms that:
109

df=sj s,=xj x.=c, (31.31)

where the vectors sj and sj^ are the / and i'-th row of the score matrix S, where the
vectors x^ and xj^ are the / and /'-th row of the matrix X and where C^ is the matrix
of cross-products between rows of X (eq. (31.3)). The above relations can also be
derived from eq. (31.13) and by taking account of the orthonormality of singular
vectors.
In particular, c^^ and C,Y represent the sums of squared elements of the row-
vectors Xj and Xj/. Consequently, d^ and d^^ are the norms (or lengths) of the ith and
i'-th rows, respectively. The relation that links the three expressions of eq. (31.31)
together is called the triangle equation or cosine rule:
dl = df + df^ - Id^d^. cos ^ii. (31.32a)
and from eq. (31.31) we can also derive that:
dl.=c,+c,,-2c,, (31.32b)
where 0^^ is the sum of cross-products x] x,. between the / and i-\h rows of X. The
term cos i^- is a measure for the association (or similarity) between the rows / and
I. From the score plot in Fig. 31.3a we learn that it represents the angular distance
between the points / and i as seen from the origin of space. This measure of
association is often used to define similarities between objects in cluster analysis
(Chapter 30).
In the current example we can calculate the distances of the wind directions
from the origin either from the data X, from the scores S or from the sums of
cross-products C^. In the case of wind direction 0 we obtain:
t/f^i =s;?Li s^.^i =(0.472^ +0.132^) = 0.240
= x^^i x.^i =(0.212^ +0.399^ +0.190^) = 0.240
^ c , . i , =0.240
or
J,^i =0.490
The distances between wind directions can be calculated from the sums of cross-
products in C„. In the case of wind directions 0 and 90° we derive:
dl'=n =(^ii=u +^/r=22 -2c,.,.,=i2 =0.240 +0.047 - 2(0.098) = 0.092
or
d..^^^2 =0.303
110

@ a=l ® P=i
S^

>v
-> —7
0

© a + p=l @ a+p=l
/N
s^ S"^
s ./
y\
s / U
•P 1 O ,/ L
/ 0
/
/
/
• ^ • >

© a+p=l ® a+p=l

/N A^
s-^ S^

s, X
-> ^

Fig. 31.3. (a,b) Reproduction of distances D and angular distances 0 in a score plot (a = 1) or loading
plot (p = 1) in the common factor-space 5^ (c,d) Unipolar axis through the representation of a row or
column and through the origin 0 of space. Reproduction of the data X is obtained by perpendicular
projection of the column- or row-pattern upon the unipolar axis (a + P = 1). (e,0 Bipolar axis through
the representation of two rows or two columns. Reproduction of differences (contrasts) in the data X is
obtained by perpendicular projection of the column- or row-pattern upon the bipolar axis (a + p = 1).
Ill

The distances are reported in the score plot of Fig. 31.1a.


By analogy we obtain for the distances from the origin d^ and dj^ of the columns^,
/ and for the distance J^y/between these columns:

4,=(l,-I,,)^(l,-V) = (x,-x,0^(x,-v)
d^-^h=^^j='jj (31-33)

where the vectors 1J and 1 J. are they and/-th rows of the loading matrix L, where
the vectors Xj and Xy/ are the j and /-th columns of the matrix X and where C^
represents the matrix of cross-products between the columns of X (eq. (31.3)). In
particular, Cjj and Cjy represent the sums of squared elements of the jth and y'-th
columns of X respectively. The above relations can also be derived from eq.
(31.15) and by taking account of the orthonormality of singular vectors.
Here dj and di^ are called the norms (or lengths) of they andy'-th columns of X
respectively. The triangle equation or cosine rule links the above expressions in eq.
(31.33) together:
d].. = dj + d} - Idjdy COS ^^y (31.34a)
and from eq. (31.33) we can also derive that:

where Cjj^ is the sum of cross-products xJ Xy between the y and y'-th column-
variables of X. The term cos i^^^/ is a measure of association (or similarity)
between these column-variables. From Fig. 31.3b we can see that it can be ob-
tained from the angular distance between the pointsy andy" as seen from the origin
of space. This measure of association is closely related to the coefficient of correl-
ation between measurements (Section 8.3).
From the current example we can calculate the distances from the origin of the
trace elements, either from the data X, from the loadings L (with p = 1) or from the
sums of cross-products C^. In the case of Na:
d%^ ^ I j l i I,.^i =(0.232^ +0.060^) = 0.057
= x]^, x.^i =(0.212^ +0.072^ +0.036^ +0.0782) = 0.058
= c^..ii =0.058
or
rf,.^i =0.240
The distances between trace elements are calculated from the sums of cross-
products in Cp. In the case of Na and CI:
112

4-=i2 =^jj=n +^/r=22 -^^jr=\2 =0.058+ 0.200-2(0.107) = 0.044


or
djj^^^2 =0.208
The distances are reported in the loading plot of Fig. 31.1b.
In the following section on preprocessing of the data we will show that
column-centering of X leads to an interpretation of the sums of squares and
cross-products in C^ in terms of the variances-covariances of the columns of X.
Furthermore, cos 'd^- then becomes the coefficient of correlation between these
columns.

31.2.3 Unipolar axes

The interpretation of biplots is made easier by the construction of axes in it.


These axes are used in the same way as in a bivariate Cartesian diagram. Perpend-
icular projection of the points in the diagrams upon a coordinate axis allows us to
determine (or reconstruct) the values in the table.
We consider the special biplot in which both rows and columns are represented
in a single display of latent variables subjected to the constraint that a + p equals 1.
As we have seen above, this constraint allows us to reconstruct the original data X
from which the latent variables U, V and the latent values A have been computed
(eq. (31.22)).
Algebraically, the reconstruction of the values of X has been defined by the
matrix product of the scores S with the transpose of the loadings L (eq. (31.22)).
Geometrically, one reconstructs the value x^j by perpendicular projection of the
point represented by 1^ upon the axis represented by s, as shown in Fig. 31.3c:
X = SLT'
or
sjl • =iy s^ =Xij with / = 1,..., nandy = 1,...,/? (31.35)
The same result can be obtained by perpendicular projection of a point s- upon
an axis ly as shown in Fig. 31.3d.
We call the axes through the origin and through either s, and ly unipolar axes,
because these axes are defined by a single variable or pole, as opposed to bipolar
axes which are discussed below. Each row and each column of a table X may thus
be seen as a pole through which unipolar axes can be drawn. Perpendicular
projection of the row-pattern P" upon a unipolar axis through columny reproduces
Xy. Likewise, perpendicular projection of the column-pattern P^ upon a unipolar
axis through row / reproduces x,. In practical applications of biplots one may add
tick marks to unipolar axes in order to facilitate their reading.
113

Figure 31.4 shows the biplot of the trace elements and wind directions for the
case when a = p = 0.5. Since here we have that a -H (3 equals 1, we can reconstruct
the values in the columns of the data table X by means of perpendicular projections
upon unipolar axes. In Fig. 31.4a we have drawn a unipolar axis through CI.
Perpendicular projection of the four wind directions upon this axis reconstructs the
order of the concentrations of CI at the four wind directions as listed in Table 31.1.
Now we have established a way which leads back from the graphic display to the
tabulated data. This interpretation of the biplot emphasizes the one-to-one relation-
ship between the data and the plot. Such a relationship is also inherent in the
ordinary bivariate (or Cartesian) diagram.
The distance from the origin d^ or dj is a measure of the information contained in
the corresponding row or column. If the distance is relatively large, in comparison
to others, then the corresponding row or column can be seen to contribute more
information (inertia, variance) to the result of the analysis. In the case when the
distance is zero, then the row or column carries no information (inertia, variance)
since all its elements are zero.
In the case when not all r latent variables have been retained, the reproduction
will be approximate and we have to rewrite eq. (31.35) in the form:
sY \]=\y s* =jc*. (31.36)
where the asterisk denotes that only the first r* latent variables have been included
in the reconstruction of the data X, and where r* < r.

31.2.4 Bipolar axes

A bipolar axis is defined by two rows s, and s^/ in Fig. 31.3e. If we project a
column Ij upon this bipolar axis, we reconstruct the difference between two
elements in X. This follows readily from eq. (31.22):
(s,.-s,)^l;=sn,-sn,.=^,-^,,, (31.37)
This bipolar axis defines a contrast between the row-variables / and /'. Note that a
contrast involves the difference of two quantities.
By virtue of the symmetry between scores and loadings, we can also construct
bipolar axes through two columns ly and 1^^ such as is shown in Fig. 31.3f. When we
project a row s^ upon this bipolar axis we construct a difference between two
elements in X. The proof follows readily from eq. (31.22):
sjilj-l,) = sjlj-sllj,=x,-x,, (31.38)
This bipolar axis defines a contrast between the column-variables7 and/.
In Fig. 31.4b we have constructed a bipolar axis through CI and Si. Perpend-
icular projections of the wind directions upon this axis are in the same order as the
114

© a = p = 0.5

6 O
180 270

a = P = 0.5
® .5n
/
i

Na
/

.5
-.5 90^
?
270.'

180
/

.5 -•

Fig. 31.4. (a) Biplot in which the concentrations of an atmospheric trace element (CI) are recon-
structed by perpendicular projection upon a unipolar axis, (b) Biplot in which the differences (con-
trasts) between two atmospheric trace elements (CI, Si) are reproduced by perpendicular projection
upon a bipolar axis.
115

differences of the Si and CI concentrations in Table 31.1. Similar geometric


operations allow us to reconstruct the values and the contrasts (differences) in the
rows of the data table X. It is also possible to add tick marks to the axes such that
the values and contrasts can be read off immediately. This is done by calibration of
the axes, specifically by relating the distances on the axes from an arbitrary origin
to the values in the table by means of linear regression.
It is important to realize that the number of contrasts increases with the square of
the number of variables {n or/?). Some contrasts, however, are more important than
others and many of them are linearly dependent. To evaluate the importance of a
contrast we have to examine the distances d^^^ and djy (Figs. 31.3a and b). If these
are large in comparison to others, then the corresponding contrasts are more import-
ant. Of course, a distance of zero means that the contrasts are non-existent, i.e. the
corresponding rows or columns in X are identical or differ only by a constant value.

31.3 Preprocessing

Preprocessing is the operation which precedes the extraction of latent vectors


from the data. It is an operation which is carried out on all the elements of an
original data table X and which produces a transformed data table Z. We will
discuss six common methods of preprocessing, including the trivial case in which
the original data are left unchanged. The effects of each of these six types of
preprocessing will be illustrated numerically by means of the small 4x3 data table
from the study of trace elements in atmospheric samples which has been used in
previous sections (Table 31.1). The various effects of the transformations can be
observed from the two summary statistics (mean and norm). These statistics
include the vector of column-means m^ and the vector of column-norms d^ of the
transformed data table Z:
1 ""
^j=-^Zij withy = 1, ...,/?

1
^ i

and the vector of row-means m^ and the vector of row-norms d„:


1 ^
m^ = — y Zii with / = 1, ..., n
PJ
I ^
^N-E4 (31.39b)
P j
116

In our notation m^ represents an element of the vector m^ and rrij is an element of the
vector m^. The same convention applies to the elements d-, dj and the
corresponding vectors d„, d^. The global mean m and the global norm d are defined
as follows:

1 n p

^P i J

d'=-tt4 (31.40)
^Pi J
Note that a norm is defined here as a root mean square rather than as a sum of
squares, because of the division by n and/7. This greatly simplifies our notation.
The vector of column-means m^ defines the coordinates of the centroid (or
center of mass) of the row-pattern P^ that represents the rows in column-space S^.
Similarly, the vector of row-means m„ defines the coordinates of the center of mass
of the column-pattern P^ that represents the columns in row-space 5". If the
column-means are zero, then the centroid will coincide with the origin of 5^ and the
data are said to be column-centered. If both row- and column-means are zero then
the centroids are coincident with the origin of both 5'^ and 5^. In this case, the data
are double-centered (i.e. centered with respect to both rows and columns). In this
chapter we assume that all points possess unit mass (or weight), although one can
extend the definitions to variable masses as is explained in Chapter 32.
The vector of column-norms d^ of the table defines the weighted distances of the
points in P^ from the origin of space in S^, Similarly, the vector of row-norms d^
represents the weighted distances of the points in P"^ from the origin of space in S^.
If all column-norms are equal then the pattern PP will appear on a (hyper)sphere
about the origin of 5". In this case the data are said to be column-normalized. If, in
addition, the center of mass of P^ is at the origin of S^, then we deal with
column-standardized data (i.e. column-centered and column-normalized). Note
that the norm is also proportional to the length of the vectors, i.e. the distance of its
endpoint from the origin of space. These concepts are explained in Chapter 29.
By way of graphical example of the various algebraic and geometrical concepts
that are introduced in this chapter, we will make use of a measurement table
adapted from Walczak etal. [II]. Table 31.2 describes 23 substituted chalcones in
terms of eight chromatographic retention times. Chalcone molecules are consti-
tuted of two phenyl rings joined by a chain of three-carbon atoms which carries a
double bond and a ketone function. Substitutions have been made on each of the
phenyl rings at the para-positions with respect to the chain. The substituents are
CF3, F, H, methyl, ethyl, /-propyl, r-butyl, methoxy, dimethylamine, phenyl and
NO2. Not all combinations two-by-two of these substituents are represented in the
117

TABLE 31.2
Chromatographic retention times of 23 doubly substituted chalcones, as determined by 8 HPLC chromato-
graphic methods using heptane as the mobile phase. The methods differ by addition of 0.5 percent of a
chemical modifier.

X-Y CH2CI2 THE Dioxan Ethanol Propano][ Octanol DMSO DMF Mean

H-CH3 2.02 1.78 1.72 0.62 0.57 0.78 2.98 2.85 1.66
H-tBu 2.99 2.24 2.30 0.64 0.59 0.92 2.20 1.53 1.68
H-iPr 3.00 2.24 2.29 0.66 0.62 0.96 2.26 1.62 1.71
H-H 2.98 2.39 2.31 0.78 0.76 1.17 3.00 2.80 2.02
F-H 3.56 2.94 2.81 0.98 0.92 1.40 4.66 4.15 2.68
H-F 2.82 2.38 2.30 0.83 0.76 1.13 3.74 3.40 2.17
H-Et 3.12 2.34 2.43 0.70 0.68 1.08 2.38 1.90 1.83
H-Me 3.28 2.52 2.57 0.81 0.76 1.22 2.67 2.25 2.01
F-Me 3.87 3.00 3.06 0.91 0.87 1.38 4.33 3.22 2.58
F-F 3.47 2.85 2.92 1.05 0.96 1.39 6.24 4.98 2.98
Me-Phe 5.86 4.59 4.43 0.96 0.93 1.68 4.84 3.82 3.39
MeO-Me 9.00 6.62 6.68 1.78 1.58 2.92 6.81 5.62 5.13
Me-MeO 9.12 6.76 6.74 1.75 1.63 3.01 6.59 5.76 5.17
F-MeO 10.34 8.22 7.59 2.03 2.02 3.48 11.90 10.50 7.01
H-NO2 6.76 5.60 5.92 1.80 1.66 2.68 16.85 12.82 6.76
MeO-Phe 14.58 10.72 10.68 2.19 2.08 4.08 14.39 11.41 8.77
F-NO2 10.15 8.11 7.10 2.68 2.37 3.68 27.62 20.24 10.24
N02-Me 12.15 9.82 9.42 2.50 2.30 3.95 18.04 15.28 9.18
NO2-H 11.93 9.81 9.40 2.68 2.59 4.27 24.38 18.94 10.50
MeO-MeO 22.19 16.08 15.79 3.44 3.48 6.85 21.27 17.12 13.28
MeO-NOj 15.94 12.67 12.74 3.36 3.08 5.48 41.17 26.35 15.10
NO2-F 14.32 12.43 12.01 3.52 3.20 5.04 39.34 21.80 13.96
NMe2-N02 26.07 19.28 18.93 4.24 3.92 8.78 42.38 26.48 18.76

Mean 8.67 6.76 6.61 1.78 1.67 2.93 13.48 9.78 6.46

measurement table. Retention times have been observed by high-performance


liquid chromatography (HPLC) using heptane as the mobile phase with additions
of 0.5% of one of eight different modifiers: CH2CI2, tetrahydrofurane (THF),
dioxane, ethanol, propanol, octanol, dimethylsulfoxide (DMSO) and dimethyl-
formamide (DMF). We expect to discover some structure in the set of methods
(measurements) as well as in the set of substituted compounds (objects). Further-
more, we want to investigate the specific interactions of substituents for one or
more of the modifiers.
Biplots constructed from this table are shown in Figs. 31.5 to 31.11. The
horizontal and vertical axes of these biplots represent scores and loadings of the
118

first two latent variables as defined by eqs. (31.9) and (31.11), using the compromise
a = P = 0.5. This particular choice of a and (3 ensures that projections on unipolar
and bipolar axes can be carried out, while compromising on the representation of
distances between rows and between columns.
The relative inertias contributed by the two latent vectors are indicated along-
side the horizontal and vertical axes. In some cases we have reflected one or two of
these axes such as to make the individual biplots comparable with each other.
Rows of Table 31.2 refer to the 23 substituted chalcones which are represented on
the biplots by means of graded circles. Areas of the circles are proportional to the
mean retention times of the corresponding compounds as obtained with the eight
chromatographic methods. Here large areas of circles correspond with compounds
whose average retention times are large. Columns of the table refer to the chroma-
tographic methods (heptane plus 0.5% of one of the different modifiers). They are
represented by means of graded squares on the biplot. Areas of these squares are
proportional to the mean retention times produced with the corresponding methods
by the 23 substituted chalcones. Large areas of squares correspond with methods
that on the average tend to produce large retention times in the 23 compounds. The
areas of circles and squares provide an indication of the average size or importance
of the objects and measurements described by the table, i.e. average retention time
of compounds and average retention time with chromatographic methods.

31.3.1 No transformation

The case in which the original data are left unchanged can be simply defined by
means of:
Zij-x^j with / = 1, ..., n andy = 1, ...,/7 (31.41)
Generally when no preprocessing is applied, we will find that all means and
norms are distinct as in Table 31.3.

TABLE 31.3

Atmospheric data from Table 31.1, after no transformation. The means m and norms d have been computed
column-wise, row-wise and globally over all elements of the table

Na CI Si m„ d„
0 0.2120 0.3990 0.1900 0.2670 0.2830
90 0.0720 0.1330 0.1550 0.1200 0.1250
180 0.0360 0.0630 0.2130 0.1040 0.1299
270 0.0780 0.1410 0.2730 0.1640 0.1830

m^, 0.0995 0.1840 0.2078 0.1638 —


d. 0.1199 0.2240 0.2121 — 0.1911
119


)HmO-H02

9r'N02 ^Q

# a-N02
DMFfH H02-H
PROPAJfOL
B-CF3\ »™^^^

B-lPr^-B^ 2
^kiatm2-lK)2

.H.-PH. ^,_^^

DIOXAH^TBT

«-°"^tM.-M*0

^MmO-MmO
97%

Fig. 31.5. Biplot of 23 substituted chalcones (circles) and 8 chromatographic methods (squares) as de-
scribed by their retention times in Table 31.2, after no transformation of the data. Areas of circles and
squares are related to the mean retention times of the corresponding compounds and methods, such as
they appear in the margins of the table.

Figure 31.5 shows the biplot of the 23 substituted chalcones and eight chrom-
atographic methods as described by the retention times in Table 31.2. The first
latent variable accounts for 97% of the inertia, while the second one carries about
3%. A small fraction of the inertia (less than 0.3%) is distributed over the other
latent variables. The origin of the space is represented by a small cross and appears
at the far left of the biplot. The two centroids (of compounds and of methods) are
manifestly at a large distance from the origin. Note that the circles and squares are
ordered by increasing areas along the horizontal axis. We conclude from this, that
the first latent variable represents a strong size component. It accounts for the size of
compounds and measurements as appears from the marginal means of Table 31.2,

31.3.2 Column-centering

Column-centering is a customary form of preprocessing in principal comp-


onents analysis (Section 17.6.1). It involves the subtraction of the corresponding
column-means m^ from each element of the table X:
120

TABLE 31.4

Atmospheric data from Table 31.1, after column-centering.

Na CI Si ^n dn
0 0.1125 0.2150 -0.0178 0.1032 0.1405
90 -0.0275 -0.0510 -0.0528 -0.0437 0.0452
180 -0.0635 -0.1210 0.0053 -0.0598 0.0790
270 -0.0215 -0.0430 0.0653 0.0003 0.0468

m^ 0 0 0 0
^ 0.0669 0.1278 0.0430 — 0.0869

^iJ ^ "^(Z -m.


~ with / = ! , . . . , n andy = 1, ..'^P (31.42)
with
1 ""
m

After this transformation we find that the column-means m^ are zero as shown in
Table 31.4.
In this case the column-norms d^ are called column-standard deviations. The
square of these numbers are the column-variances, whose sum represents the
global variance in the data. Note that the column-variances are heterogeneous
which means that they are very different from each other.
The column-centered biplot of the chromatographic data is shown in Fig. 31.6.
The first and second latent variables contribute 94% and 5% to the total inertia of
the data (or total variance in this case). About 0.5% is distributed over the residual
latent variables. Here the centroid of compounds coincides with the origin of space
(indicated by the small cross near the middle of the plot). The centroid of the
methods is not at the origin, since all methods are projected at the right side of the
origin on Fig. 31.6. This is indicative of a strong size component, which is also
evidenced by the ordering of the compounds according to increasing average
retention time. This observation can also be made by looking at the areas of the
circles which increase from left to right.
Inspection of the positions of the chromatographic methods on the biplot of Fig.
31.6 shows that the three alcohols (ethanol, propanol and octanol) contribute
relatively little to the global inertia (variance) of the data, since their distances from
the origin are small in comparison to those of DMSO, DMF, THF, dioxane and
methylenedichloride (CH2CI2). Judging from the angular distances, as seen from
the origin, we find rather strong correlations between DMSO and DMF, on the one
121

P* \ 00 Hm0-K02 f ^^j^
X -W02

\. 2
9'- N02-F^ W mC''^<'^^^

4 # fl-N02 «^

Nv V02-H^
B-cr3* B-r •'-' ^''[UDMT /^20 1
fl-fl,* mr-H
X ^ ..^^-^^^^^^^ ^^
/N

M0-Ph« ^ . . ^ ^ ^ " ^ ^^STHANOL


j^o ^«QP-w«>^ i ^ O ^ ^ ^ ^ - * ^ / ^'^'^N
•1>^
^"'^^'^^'^ 5 # rHtoONP^^^^*''^^^
\ I^
/
0 A ^''•> 1
/ 10 *"*,
M lfi«A2-V02^k

0M0O-Ph« \^^^/
5
CH2Cl2j3^ 20

22

/^^
N ^ 24
/ WoO-Ate<J^

X 34 »
Fig. 31.6. Biplot of chromatographic retention times in Table 31.2, after column-centering of the data.
Two unipolar axes and one bipolar axis have been drawn through the representations of the methods
DMSO and methylenedichloride (CH2CI2). The projections of three selected compounds are indicated
by dashed lines. The values read off from the unipolar axes reproduce the retention times in the corre-
sponding columns. The values on the bipolar axis reproduce the differences between retention times.

hand, and between dioxane, THF and methylenedichloride, on the other hand. The
correlation between the three alcohols cannot be established accurately on this plot
because of their proximity to the origin.
There are two outstanding poles on this biplot. DMSO and dimethylchloride are
at a large distance from the origin and from one another. These poles are the most
likely candidates for the construction of unipolar axes. As has been explained in
the previous section, perpendicular projections of points (representing comp-
ounds) upon a unipolar axis (representing a method) leads to a reproduction of the
data in Table 31.3. In this case we have to substitute the untransformed value x^^ in
eq. (31.35) by z^. of eq. (31.42):
s7i 7 ~'^ij -^ij ^j (31.43)
Since m^ is a constant for the given unipolar axis through the yth method, we
obtain that the projections on this axis are equal to x^j minus a constant.
The unipolar axis through the origin and DMSO reproduces rather well the data
in the corresponding column of Table 31.2. By perpendicular projection of the
122

center of the circles we can read off the approximate retention times on this
unipolar axis. For example, the compound with substituents F-NO2 projects at
about a value of 28 which compares well with the tabulated value of 27.62. The
DMSO axis separates compounds with NO2 and methoxy substituents from the
others. The unipolar axis through methylenedichloride also reproduces the data in
the corresponding column of Table 31.2. This additive seems to selectively delay
the elution of compounds with dimethylamine and some methoxy substituents.
Finally, we have constructed a bipolar axis through DMSO and methylene-
dichloride. By perpendicular projection of the centers of the circles upon this bipolar
axis we obtain the differences in retention times obtained respectively with DMSO
and with methylenedichloride. Using a similar reasoning as developed above for
unipolar axes we can perform a substitution of eq. (31.42) in eq. (31.38) which leads
to:
sj{\j -lj.) = Zij -Zij'=Xij -x.y-{mj -my) (31.44)
This shows that the bipolar axis reproduces the difference x^j - x- between
method7 a n d / of the table minus a constant (nij - m^). This bipolar axis defines a
clear contrast between the N02-substituted compounds and the others.
For example, projection of the compound with substituents dimethylamine-N02
defines a DMSO-methylenedichloride contrast of about 16 (Fig. 31.6), where the
real difference is 42.38 - 26.07 or 16.31 (Table 31.2). Graphical estimation of the
same contrast for the MeO-MeO compound yields a value o f - 1 , which compares
well with the difference between tabulated data of 21.27 - 22.19 or -0.92.

31.3.3 Column-standardization

Column-standardization is the most widely used transformation. It is performed


by division of each element of a column-centered table by its corresponding
column-standard deviation (i.e. the square root of the column-variance):
(Xij -JHj)
Zij = -^ — with / = 1, ..., n and; = 1,..., p (31.45)

with

d]=^l(^,-mjr

In the context of data analysis we divide by n rather than by (n - 1) in the


calculation of the variance. This procedure is also called autoscaling. It can be
verified in Table 31.5 how these transformed data are derived from those of Table
31.4.
123

TABLE 31.5

Atmospheric data from Table 31.1, after column-standardization.

Na CI Si in„ d.
0 1.681 1.683 -0.413 0.984 1.394
90 -0.411 -0.399 -1.228 -0.679 0.782
180 -0.949 -0.947 0.122 -0.591 0.777
270 -0.321 -0.337 1.519 0.287 0.917

m^ 0 0 0 0 —
d. 1 1 1 — 1

In the corresponding column-standardized biplot of Fig. 31.7 we find all


representations of the eight chromatographic methods more or less at the same
distance from the origin of space. The circle is distorted because of the large
difference between the contributions of the first and second latent variables (95
and 4%) and the choice of a = (3 = 0.5 which has been made at the outset. The
combined effect is an apparent dilation of the vertical axis.
The distances between compounds in Fig. 31.7 are not notably affected by the
transformation in comparison with the previous Fig. 31.6. This biplot allows more
easily to perceive the correlations between measurements. Three clusters are now
put in evidence, namely (1) DMSO and DMF, (2) ethanol and propanol, (3)
octanol, dioxane, THF and methylenedichloride. The line segments drawn from
the origin have been added to emphasize these groupings. Unipolar axes could
have been defined here in the same way as in Fig. 31.6. Bipolar axes on the
column-standardized biplot, however, cannot be interpreted directly in terms of
the original data in X.

31.3.4 Log column-centering

The transformation by log column-centering consists of taking logarithms


followed by column-centering. The choice of the base of the logarithms has no
effect on the interpretation of the result, but decimal logs will be used throughout.
y^j = log^.) with / = 1,..., n andj = 1,...,/? (31.46)
z.. = y..~m^
with
1 ""
124

4%

# a-N02

• B-r • ' ^ '^

B-Hm
• a-LPr B-Et

Fig. 31.7. Biplot of chromatographic retention times in Table 31.2, after column-standardization of
the data. Unipolar semi-axes have been drawn through all points representing methods. The particular
arrangement of the methods is indicative for the presence of a strong size component in the data.

In this case it is required that the original data in X are strictly positive. The effect
of the transformation appears from Table 31.6. Column-means are zero, while
column-standard deviations tend to be more homogeneous than in the case of
simple column-centering in Table 31.4 as can be seen by inspecting the corresp-
onding values for Na and CI.
In the log column-centered biplot of Fig. 31.8 one observes that the centroid of
the compounds coincides with the origin. Also, the chromatographic methods are
at a more uniform distance from the origin than it is the case with simple
column-centering in Fig. 31.6. The effect of log column-centering is to reduce the
heterogeneity of the variances. The result is close to that of column-standard-
ization in Fig. 31.7. While the logarithmic function reduces the effect of large
values it enhances that of the smaller ones. In the chromatographic application this
is an advantage, as appears from the widening of the cluster of compounds with the
substituents CF3, F, H, methyl, ethyl, /-propyl, r-butyl on the left side of Fig. 31.7.
With log column-centering we obtain unipolar axes by substituting eq. (31.46)
in eq. (31.35):

^j=^ij '•ytj ~^j =^^s(Xij)-^j (31.47)


125

TABLE 31.6

Atmospheric data from Table 31.1, after log column-centering.

Na CI Si m^ d.
0 0.4183 0.4326 -0.0297 0.2738 0.3479
90 -0.0507 -0.0445 -0.1181 -0.0711 0.0785
180 -0.3517 -0.3690 0.0200 -0.2336 0.2945
270 -0.0159 -0.0191 0.1278 0.0309 0.0751

m^ 0 0 0 0
^ 0.2746 0.2853 0.0888 — 0.2343

where rrij is the logarithm of the geometric mean of theyth column in Table 31.2,
which is a constant for this unipolar axis. Hence, the unipolar axis through DMSO
reproduces the logarithms of the data in the corresponding column of the table.
Note that the unipolar axis in Fig. 31.8 has been constructed with logarithmical
subdivisions. A similar construction applies to the unipolar axis through methylene-
dichloride.
The bipolar axis through DMSO and methylenedichloride defines logarithms of
the ratio of the data in the corresponding columns. By substitution of eq. (31.46) in
eq. (31.38) and after rearrangement we obtain that:

s^ ( • . / - • ' . / ) = ^ ( / - ^ ( / ' =yij -yty - ( ^ . / ~ ^ / )


(xA (31-48)
= log -(mj-my)

Consequently, the bipolar axis represents the (log) ratio of columns j and / in
the original data table up to the term nij - rrif which is a constant for the given
bipolar axis. Note that the axis which represents the ratio of retention times with
DMSO/methylenedichloride is also divided according to a logarithmic scale.
In Fig. 31.8 we determine the DMSO/methylenedichloride ratio of the di-
methylamine-N02 substituted compound approximately at 1.6, where the com-
puted ratio from Table 31.2 is 42.38/26.07 or 1.63.

31.3.5 Log double-centering

Preprocessing by log double-centering consists of first taking logarithms, and


then to center the data both by rows and by columns:
y^j = log(x^j) with / = 1, ..., n andy = 1,...,/? (31.49)
^ =yij -rrii -rrij^m
126

\3%
r 4 ^
\ ^ 2 r 2.s^^
DMSO ^y^
• H-W02 F-JI02
DMFW^LS
B-CF3 ^^v^ 9r-F

j ^^MAO-K02

^ N02

• H-H ^ ^ NO2-Mm
• f-ite ETBANOL^
i%
8 PROPAITOL"
l^-^-"^^
\^^ OCTAWOIQ f NMtt^-KO^'^
R-tBu ^ ^ 4 •^'•-PhB CHF
#toO-Ph«^v^^3| DJOXAN
CB2C12*f^
M«0-Ha
^^,y^ 2 %H»-MQO

97%*^

Fig. 31.8. Biplot of chromatographic retention times in Table 31.2, after log column-centering of the
data. The values on the bipolar axis reproduce the (log) ratios between retention times in the two corre-
sponding columns.

with
1 n p
and m=— ^ ^ y..

It is assumed that the original data in X are strictly positive. As is evident from
Table 31.7 both the row-means m„ and the column-means m^ of the transformed
table Z are equal to zero.
The biplot of Fig. 31.9 shows that both the centroids of the compounds and of
the methods coincide with the origin (the small cross in the middle of the plot). The
first two latent variables account for 83 and 14% of the inertia, respectively. Three
percent of the inertia is carried by higher order latent variables. In this biplot we
can only make interpretations of the bipolar axes directly in terms of the original
data in X. Three prominent poles appear on this biplot: DMSO, methylene-
dichloride and ethylalcohol. They are called poles because they are at a large
distance from the origin and from one another. They are also representative for the
three clusters that have been identified already on the column-standardized biplot
in Fig. 31.7.
127

TABLE 31.7

Atmospheric data from Table 31.1, after log double-centering

Na CI Si m„ dn
0 0.1446 0.1589 -0.3034 0 0.2146
90 0.0204 0.0266 -0.0470 0 0.0333
180 -0.1181 -0.1354 0.2536 0 0.1794
270 -0.0468 -0.0500 0.0969 0 0.0685

m^ 0 0 0 0
d. 0.0968 0.1082 0.2049 — 0.1450

A bipolar axis through columns7 a n d / can be interpreted in the same way as in


the log column-centered case (eq. (31.48)) since the terms rrij and m^/ cancel out.
The first (close to horizontal) axis between DMSO and ethanol represents the
(log)ratios of the corresponding retention times. They can be read off by vertical
projection of the compounds on this scale. Note that the scale is divided logarith-
mically. In the same way, one can read off the (log)ratios of methylenedichloride
and ethanol from the second (close to vertical) axis on Fig. 31.9. Graphical
estimation of these contrasts for the dimethylamine-N02 substituted chalcone
produces 9.5 on the DMSO/ethanol axis and 6.2 on the methylenedichloride/
ethanol axis of Fig. 31.9. The exact ratios from Table 31.2 are 10.00 and 6.14,
respectively.
The first bipolar axis (DMSO/ethanol) accounts for the contrast between
compounds with NO2 substitutions and those without. Compounds with a NO2
substituent systematically obtain higher scores on this bipolar axis than others. The
second bipolar axis (methylenedichloride/ethanol) seems to produce an order of
the substituents according to their electronic properties. To emphasize this point
we have reproduced the log double-centered biplot again in Fig. 31.10. The dashed
line near the middle separates the class of NO2 substituted chalcones from the other
compounds. Further, we have joined substituents by line segments according to the
sequence CF3, F, H, methyl, ethyl, /-propyl, ^butyl, methoxy, phenyl and di-
methylamine. The electronic properties of these substituents vary progressively
from electron acceptors to electron donors [11] in accordance with their scores on
the second bipolar axis.
The size component which may be strongly present (as in this chromatographic
application) is eliminated by the operation of double-centering. Hence, double-
centered latent variables only express contrasts. In column-centered biplots one
may find that one latent variable expresses mainly size and the others mainly
contrasts. In general, none of the latter is a pure component of size or of contrasts.
If we want to see size and some contrasts represented in a biplot, column-centering
128

114%
# MmO-Phm . — • 194m2-N02
€.5 —

Mm-Phmm^
CH2Cl2i 3 tf

DIOXAJ p i ^ j ^ 0M.O-WO2 -^

Mm -MmO • • MmO-Mm V ^ •r-M.O \ ^02-T ^y<^13

\ «02-B% \ ^ ^ 11
• a-tBu \ aoczAMOL 4. ^y^^^io
• E-iPr

• B-Xt
\ ^^^^^ 1 %F-N02
• B-H»

\ '-B^^
E-B% \ > ^
\ ^^"^ ^
PRQPANOLB \ x ^ • E-Cr3

Fig. 31.9. Biplot of chromatographic retention times in Table 31.2, after log double-centering. Two bi-
polar axes have been drawn through the representation of the methods DMSO, methylenedichloride
(CH2CI2) and ethanol.

is indicated. If we only want to see contrasts, then double-centering is the method


of choice.
Compounds and chromatographic methods that are far away from the origin of
the biplot possess a high degree of interaction (producing large contrasts). Those
close to the center have little interaction (producing small contrasts). Compounds
and methods that are at a distance from the origin and in the same direction possess
a positive interaction with one another (they attract each other). Those that are
opposite with respect to the origin show a negative interaction with each other
(they repel each other). Interaction of compounds with methods (and vice-versa)
can be interpreted on the biplot by analogy with mechanical forces of attraction
(for positive interaction) and repulsion (for negative interaction). The mechanical
analogy also illustrates that interaction is mutual. If a compound is attracted (or
repelled) by a chromatographic modifier, then that modifier is also attracted (or
repelled) by the compound. This property of interactions finds an analogy in
Newton's third law of the forces. In a chromatographic context, attraction is to be
interpreted as an interaction between compound and modified stationary phase
which results in a greater elution time of the compound. Similarly, repulsion leads
to a shorter elution time of the compound.
129

14%

83%

Fig. 31.10. Same biplot of chromatographic retention times as in Fig. 31.9. The line segments connect
compounds that share a common substituent. The horizontal contrast reflects the presence or absence
of a NO2 substituent. The vertical contrast expresses the electronegativity of the substituents.

One can also state that the log double-centered biplot shows interactions
between the rows and columns of the table. In the context of analysis of variance
(ANOVA), interaction is the variance that remains in the data after removal of the
main effects produced by the rows and columns of the table [12]. This is precisely
the effect of double-centering (eq. (31.49)).
Sometimes it is claimed that the double-centered biplot of latent variables 1 and
2 is identical to the column-centered biplot of latent variables 2 and 3. This is only
the case when the first latent variable coincides with the main diagonal of the data
space (i.e. the line that makes equal angles with all coordinate axes). In the present
application of chromatographic data this is certainly not the case and the results are
different. Note that projection of the compounds upon the main diagonal produces
the size component.
The transformation by log double-centering has received various names among
which spectral mapping [13], logarithmic analysis [14], saturated RC association
model [15], log-bilinear model [16] and spectral map analysis or SMA for short
[17].
130

31.3.6 Double-closure

A matrix is said to be closed when the sums of the elements of each row or
column are equal to a constant, for example, unity. This is the case with a table of
compositional data where each row or column contains the relative concentrations
of various components of a sample. Such compositional data are closed to 100%.
Centering is a special case where the rows or columns of a table are closed with
respect to zero. Here we only consider closure with respect to unity. A data table
becomes column-closed by dividing each element with the corresponding column-
sum. A table is row-closed when each element has been divided by its corresp-
onding row-sum. By analogy with double-centering, double-closure involves the
division of each element of a table by its corresponding row- and column-sum and
multiplication by the global sum:

^ij ~ ^^^^ ' ~ ^' ••' ^ ^uidy = 1, ...,p (31.50)


•^i-»- -^+7

with
p n n p

J i i J

where x„^, x+^ and x^^ represent the vector of row-sums, the vector of column-sums
and the global sum of the elements in the table X. Note that eq. (31.50) can also be
written as:

Zii=-^^— (31.51)
m, nij

where m^, m^, m represent the row-means, column-means and global mean of the
elements of X.
Double-closure is only applicable to data that are homogeneous, i.e. when
measurements are expressed in the same unit. The data must also be non-negative.
If these requirements are satisfied, then the row- and column-sums of the table can
be thought to express the size or importance of the items that are represented by the
corresponding rows and columns. Double-closure is the basic transformation of
correspondence factor analysis (CFA), which is applicable to contingency tables
and, by extension, to homogeneous data tables [18]. Data in contingency tables
represent counts, while data in homogeneous tables need only be defined in the
same unit. Although CFA will be discussed in greater detail in Chapter 32 on the
multivariate analysis of contingency tables, it is presented here for comparison
with the other methods of PC A which have been discussed above.
131

If the original data represent counts or frequencies one defines:

X:. X,: rU: 171:


E(x,^) = -^^^ =^ - ^ (31.52)
x^^. m

where E(x^j) is called the expected value of x^y under the condition of fixed marginal
totals. (See Chapter 26 on 2 x 2 contingency tables). For this reason the operation
of double-closure can also be considered as a division by expected values. It
follows that the transformed data Z in eqs. (31.50) or (31.51) can be regarded as
ratios of X to their expected values E(X).
In CFA the means m^, m^, m and norms d„, d^, d are computed by weighted
sums and weighted sums of squares:
p
m^ = 2^Wi Zij = 1 with / = 1, ..., n
J
n

^.1 =^^i ^ij = 1 with7 = 1, ...,/? (31.53)

n p

' J

dj^^w.zfj

n p

i J

where the vectors of weight coefficients w„ and w^ are defined from the marginal
sums:

w,=^^ (31.54)
132

TABLE 31.8

Atmospheric data from Table 31.1, after double-closure. The weights w are proportional to the row- and
column-sums of the original data table. They are normalized to unit sum.

Na CI Si w„ m. d.
0 0.3067 0.3299 -0.4391 0.4075 0 0.3759
90 -0.0126 -0.0136 0.0181 0.1832 0 0.0155
180 -0.4303 -0.4609 0.6143 0.1587 0 0.5259
270 -0.2173 -0.2349 0.3121 0.2503 0 0.2672

^P 0.2025 0.3744 0.4229 1


m^ 0 0 0 — 0 —
d. 0.2821 0.3032 0.4036 — — 0.3454

When all weight coefficients w„ and w^ are constant then we have the special case:

w.=— and w =— (31.55)


n P
which leads to the usual definition of means and norms of eq. (31.39).
The effect of double-closure is shown in Table 31.8. For convenience, we have
subtracted a constant value of one from all the elements of Z in order to emphasize
the analogy of the results with those obtained by log double-centering in Table
31.7. The marginal means in the table are average values for the relative deviations
from expectations and thus must be zero.
The analysis of Table 31.2 by CFA is shown in Fig. 31.11. As can be seen, the
result is very similar to that obtained by log double-centering in Figs. 31.9 and 31.10.
The first latent variable expresses a contrast between NO2 substituted chalcones
and the others. The second latent variable seems to be related to the electronic
properties of the substituents. The contributions of the two latent variables to the
total inertia is 96%. The double-closed biplot of Fig. 31.11 does not allow a direct
interpretation of unipolar and bipolar axes in terms of the original data X. The
other rules of interpretation are similar to those of the log double-centered biplot in
the previous subsection. Compounds and methods that seem to have moved away
from the center and in the same directions possess a positive interaction (attraction).
Those that moved in opposite directions show a negative interaction (repulsion).
There is a close analogy between double-closure (eq. (31.51)) and log double-
centering (eq. (31.49)) which can be rewritten as:
z,,. ={y,j -\-m)-{m, + m^ (31.56)
with
133

6%

• B-tBu P JDIOKAN^

PROPANOLa a ^ - ^
ETHAIfOl

90%

Fig. 31.11. Biplot of chromatographic retention times in Table 31.2, resulting from correspondence
factor analysis, i.e. after double-closure of the data. The line segments have been added to emphasize
contrasts in the same way as in Fig. 31.10.

ytj = iogixij) with / = 1,..., n andy = 1,..., p

where m^, m^ and m now represent the means of the logarithmically transformed
data Y.
It has been proved that, in the limiting case when the contrasts in the data are
small, the two expressions produce equivalent results [14]. The main difference
between the methods resides in the quantity that is analyzed. In CFA one analyses
the global distance ofchi-square of the data [18] as explained in Section 32.5. In
log double-centered PCA or SMA one analyses the global interaction in the data
[17]. Formally, distances of chi-square and interactions are defined in the same
way as weighted sums of squares of the transformed data. Only the type of
transformation of the data differs from one method to another. (See Goodman [15]
for a thorough review of the two approaches.) Contrasts in CFA are expressed in
terms of distances of chi-square, while in SIMA they can be interpreted as (log)
ratios. While the former is only applicable to homogeneous data, the latter lends
itself also to heterogeneous data, since differences between units are cancelled out
by the combined operations of logarithms and centering.
134

31.4 Algorithms

A large variety of algorithms is available for the extraction of latent vectors


from rectangular data tables and from square symmetric matrices. We only discuss
very briefly a few of these.
The NIPALS algorithm [19] is applied to a rectangular data table and produces
row- and column-latent vectors sequentially in decreasing order of their (associated)
latent values. A better performing, but also more complex, algorithm for rectangular
data tables is the Golub-Reinsch [20] singular value decomposition, which prod-
uces all row- and column-latent vectors at once. The counterpart of NIPALS for
square symmetric matrices is the power algorithm of Hotelling [21], which returns
row- or column-latent vectors sequentially, in decreasing order of their (associated)
latent values. More demanding alternatives are the Jacobi [22] and Householder-
QR [23] algorithms which produce all row- or column-latent vectors at once.
The choice of a particular method depends on the size and shape (tall or wide) of
the data table, the available computer resources and programming languages. If
one is only interested in the first two latent vectors (e.g. for the construction of a
biplot) one may take advantage of the iterative algorithms (NIPALS or power)
since they can be interrupted at any time. The latter are also of particular interest
with matrix oriented computer notations such as APL [24], MATLAB [25] and
SAS/IML [26]. Algorithms are also available in libraries of Fortran functions for
solving problems in linear algebra such as LINPACK [27], in EISPACK [28]
which is dedicated to the solution of eigenvalue problems and in the Pascal library
of numerical recipes [29].

31.4.1 Singular value decomposition

The 'non-linear iterative partial least square' or NIPALS algorithm has been
designed by H. Wold [19] for the solution of a broad class of problems involving
relationships between several data tables. For a discussion of partial least squares
(PLS) the reader is referred to Chapter 35. In the particular case where NIPALS is
applied to a single rectangular data table one obtains the row- and column-singular
vectors one after the other, in decreasing order of their corresponding singular
values. The original concept of the NIPALS approach is attributed to Fisher [30].
In NIPALS one starts with an initial vector t with n arbitrarily chosen values
(Fig. 31.12). In a first step, the matrix product of the transpose of the nxp table X
with the AX-vector t is formed, producing the p elements of vector w. Note that in the
traditional NIPALS notation, w has a different meaning than that of a weighting
vector which has been used in Section 31.3.6. In a second step, the elements of the
p-vector w are normalized to unit sum of squares This prevents values from
becoming too small or too large for the purpose of numerical computation. The
135

NIPALS

X t

y^

/
/ .'1
//
/
/ i
/3
0 7

/
/~7
A
w -,Jr^

Fig. 31.12. Dance-step diagram, illustrating a cycle of the iterative NIPALS algorithm. Step 1 multi-
plies the score vector t with the data table X, which produces the weight vector w. Step 2 normalizes w
to unit sum of squares. In step 3, X is multiplied by w, yielding an updated t.

third step involves the multiplication of X with w, which produces updated values
for t. The complete cycle of the NIPALS algorithm can be represented by means of
the operations, given an initial vector t:
w = X^t
w -^ W (31.57)
(w^ w) 1/2

t = Xw
which is equivalent to the transition formulae of eq. (31.17). The three steps of the
cycle are represented in Fig. 31.12 which shows the transitions in the form of a
dance step diagram [31]. The cycle is repeated until convergence of w or t, when
changes between current and previous values are within a predefined tolerance
(e.g. 10"^). At this stage one can derive the first normalized row-latent vector Uj
and column-latent vector Vj from the final w and t:

(31.58)
( t ^ t ) 1/2
W

' (w^ w)^/2


(In practice, the normalization of w can be omitted and one can write Vj = w.)
136

The corresponding latent value Xj, is then defined by means of:


X,=nJ Xy, (31.59)
in accordance with eq. (31.1).
A crucial operation in the NIPALS algorithm is the calculation of the residual
data matrix which is independent of the contributions by the first singular vector.
This can be produced by the instruction:
X-X,(u^ \J)^X (31.60)
in which the new X represents the residual data matrix. Geometrically, one may
consider the residual data matrix X as the projections of the corresponding patterns
of points onto (dual) subspaces which are orthogonal to Uj and Vj. If all elements of
the residual table are zero (or near zero within given tolerance limits), then we have
exhausted the data matrix X and no further latent vectors can be extracted. Other-
wise, the algorithm (eq. (31.57)) is repeated, yielding the singular vectors U2 and V2
together with the (associated) singular value ^2- The way in which the residual
matrix has been defined (eq. (31.60)) ensures that U2 is orthogonal to Uj and that V2
is orthogonal to Vj. A new residual data matrix is computed and tested for being
zero. If not, a third singular vector is extracted and so on, until complete exhaustion
of X. At the end we have decomposed the data table X into r orthogonal compo-
nents (where r is at most equal to p, which is assumed to be smaller than n):
r

X = X^it("it v J ) = U A V ^ (31.61)
k

By construction, all singular vectors in U and V are normalized and mutually


orthogonal.
The NIPALS algorithm is easy to program, particularly with a matrix-oriented
computer notation, and is highly efficient when only a few latent vectors are
required, such as for the construction of a two-dimensional biplot. It is also suitable
for implementation in personal or portable computers with limited hardware
resources.
A more efficient but also more demanding algorithm is the Golub-Reinsch [20]
singular value decomposition. This is a non-iterative method. Its first step is a
so-called Householder transformation [32] of a rectangular data table X which
produces a bidiagonal matrix, i.e. a matrix in which all elements must be zero,
except those on the principal diagonal and the diagonal immediately below (or
above) (Fig. 31.13a). The second step reduces the bidiagonal matrix into a
diagonal matrix A by means of the so-called QR transformation [33]. The elements
of this principal diagonal are the singular values. At the end of the transformations,
the Golub-Reinsch algorithm also delivers the row- and column-singular vectors
in U and V.
137

(a) Golub - Reinsch


X u A VT

-> • 0 • -^ • \ 0 •
0^ 0 \

Householder step QR step

(b) Jacobi

c A2

-^ • 1 • •
\ 0
0 \

+

M •

+
• m

(c) Householder - QR

c A2 VT


\ 0
-^ -^
0 \
Householder step QR step

Fig. 31.13. Schematic example of three common algorithms for singular value and eigenvalue decom-
position.
138

31.4.2 Eigenvalue decomposition


The power algorithm [21] is the simplest iterative method for the calculation of
latent vectors and latent values from a square symmetric matrix. In contrast to
NIPALS, which produces an orthogonal decomposition of a rectangular data table
X, the power algorithm decomposes a square symmetric matrix of cross-products
X^ X which we denote by C^. Note that C^ is called the column-variance-
covariance matrix when the data in X are column-centered.
In the power algorithm one first computes the matrix product of C^ with an
initial vector of/? random numbers v, yielding the vector w:
w = C^v (31.62a)
The result is then normalized, which produces an updated vector v:

v = w/(wTw)^/2 (31.62b)

The normalization step prevents the elements in v from becoming either too large
or too small during the numerical computation. The two operations above define
the cycle of the powering algorithm, which can be iterated until convergence of the
elements in the vector v within a predefined tolerance. It can be easily shown that
after n iterations the resulting vector w can be expressed as:

wocC^ V (31.63)

where the symbol C^ represents the nth power of the matrix C^, i.e. the matrix
product of n terms equal to C^. The name of the method is derived from this
property. The normalized vector Vj which results from the iterative procedure is
the first dominant eigenvector of C^. Its associated eigenvalue X\ follows from eq.
(31.5a):

^^^v^^C^v, (31.64)

A key operation in the power algorithm is the calculation of the deflated


cross-product matrix which is independent of the contribution by the first eigen-
vector. This is achieved by means of the instruction:

C , - M ( v , v?-)^Cp (31.65)

in which the original matrix C^ is replaced by the residual matrix.


The geometrical equivalent of the deflation operation is a projection of a pattern
of points into a subspace which is orthogonal to Vj. If all elements of the residual
139

matrix are zero (or near zero within a specified tolerance), then we have exhausted
the cross-product matrix C^. Otherwise, the algorithm (eq. (31.62)) is repeated,
yielding the second eigenvector V2 and its associated eigenvalue \\. Because of
the definition of the residual matrix (eq. (31.65)), we obtain that y^ is orthogonal to
Vj. A new residual matrix is computed and the process is repeated until complete
exhaustion of C^. At the end we have decomposed the cross-product matrix C^ into
r orthogonal components (where r is at most equal to p)\

C , = i M ( v , v J ) = VA2V^ (31.66)
k

Once we have computed the matrix of column-eigenvectors V we can derive the


corresponding row-eigenvectors U using eq. (31.13):
U = XVA-i (31.67)
By construction, all eigenvectors are normalized and mutually orthogonal.
As we have remarked before, singular vectors and eigenvectors are identical (up
to an algebraic sign) and have been called latent vectors in the context of data
analysis. Similarly, singular values are the square roots of the eigenvalues and
have been called latent values. The eigenvalues are the contributions to the trace of
the matrix of cross-products, or global sum of squares, conform to eq. (31.8).
Although the results of NIPALS are equivalent to those of the power algorithm, the
latter converges more rapidly. It is, however, numerically less stable when results
are computed with finite precision. In practice, use of the power algorithm is
advantageous with a matrix-oriented computer notation, when hardware resources
are limited and when only a few latent vectors are required (as for the construction
of a two-dimensional biplot). In all other cases, one should make use of more
powerful and efficient algorithms, such as those described below.
A non-iterative algorithm for the diagonalization of a symmetric square matrix
(C) is attributed to Jacobi [22]. Basically, this method begins with considering the
2x2 submatrix of C formed by the two symmetrical off-diagonal elements with the
largest absolute value and its two corresponding diagonal elements. It is possible to
find a simple 2x2 transformation matrix which diagonalizes this 2x2 symmetrical
submatrix within the larger symmetrical matrix (Fig. 31.13b). Of course this
affects the other off-diagonal elements which in general still remain different from
zero. The procedure is repeated by selecting the new off-diagonal elements with
the largest absolute value, which is again zeroed by means of an appropriate
orthogonal transformation matrix. This process, when continued, converges to a
state were all off-diagonal elements of the square symmetric matrix are zero
(within given limits of tolerance). After convergence, the diagonalized matrix A^
contains the eigenvalues. The product of the successive transformation matrices
140

produces the matrix of corresponding eigenvectors V. A detailed description of


this algorithm has been given by Ralston and Wilf [34].
An even more efficient, but also more complex, algorithm for the diagonal-
ization of a square matrix is the Householder-QR eigenvalue decomposition (Fig.
31.13c). It is similar in structure to the Golub-Reinsch singular value decompo-
sition which we have mentioned above (Fig. 31.13a). The first part of this
high-performance algorithm applies a Householder [32] transformation which
converts a square matrix (C) into a tridiagonal matrix, i.e. a matrix in which all
elements are equal to zero except those on the main diagonal and the diagonals
below and above. In a second part, this tridiagonal matrix is transformed into a
diagonal one by means of the QR transformation [33,35] which can be further
optimized for efficiency [36]. After diagonalization one finds the eigenvalues on
the main diagonal of the transformed square matrix A^. As a result of the trans-
formations one also obtains the corresponding eigenvectors V. Note that this
algorithm can also be applied to non-symmetrical square matrices for the solution
of general eigenproblems such as arise in linear discriminant analysis and canon-
ical correlation analysis [37]. (The latter are also discussed in Chapters 33 and 35.)
Computer programs for the Householder-QR algorithm are available from the
function libraries mentioned above.
A comparison of the performance of the three algorithms for eigenvalue decom-
position has been made on a PC (IBM AT) equipped with a mathematical co-
processor [38]. The results which are displayed in Fig. 31.14 show that the
Householder-QR algorithm outperforms Jacobi's by a factor of about 4 and is
superior to the power method by a factor of about 20. The time for diagonalization
of a square symmetric value required by Householder-QR increases with the
power 2.6 of the dimension of the matrix.
Usually it is assumed that the number of rows of a data table X exceeds the
number of columns. In the opposite case, the calculation of the column-eigen-
vectors V from the matrix X^ X may place high demands on computer time and
memory. In the so-called kernel algorithm this inconvenience is greatly reduced
[39]. The basic idea here is to first compute the row-eigenvectors U and eigen-
values A^ of the matrix X X^, and then to derive the corresponding column-eigen-
vectors V from the expression X^ U A"^ (according to eq. (31.17b)).

31.5 Validation

A question that often arises in multivariate data analysis is how many meaning-
ful eigenvectors should be retained, especially when the objective is to reduce the
dimensionality of the data. It is assumed that, initially, eigenvectors contribute
only structural information, which is also referred to as systematic information.
141

100

Time
s
Power

Jacobi

10 -\

Householder-QR

IBM PC-AT + coprocessor


~l '
20 50
Dimension

Fig. 31.14. Performance of three computer algorithms for eigenvalue decomposition as a function of
the dimension of the input matrix. The horizontal and vertical scales are scaled logarithmically. Exe-
cution time is proportional to a power 2.6 of the dimension.

With increasing number of eigenvectors, however, noise is progressively contam-


inating the eigenvectors and eventually only pure noise may be carried by the
higher-order eigenvectors. The problem then is to define the number of eigen-
vectors which account for a maximum of structure, while carrying a minimum of
noise. Since one has not always access to replicated data, there may be uncertainty
about the extent of the random noise in the data and, hence, the problem is by no
means trivial.
The many solutions that have been proposed thus far have been reviewed
recently by Deane [40]. Some empirical methods of validation are also discussed
in Chapter 36. In this section we only discuss three representative approaches, i.e.
empirical, statistical and by internal validation. (There are several variants of these
approaches which we cannot discuss here.) It is not always clear which method
should be preferred under given circumstances [37]. It may be useful to apply
142

several methods of validation in parallel and to be prepared for the possibility of


conflicting outcomes. None of these approaches of internal validation, however,
are perfect substitutes for replicated data and external validation. As is often the
case, the proof of the pudding is in the eating.
Each of the three approaches will be applied in this section to the transformed
retention times of the 23 chalcones with eight chromatographic elution methods in
Table 31.2. The transformation is defined by the successive operations of logar-
ithms, double-centering and global normalization which is typical for the method
of spectral map analysis (SMA):
y^j = logjc^y with / = 1,..., n andj = 1,...,/? (31.68)

g.. = y.. - nii -nij + m

'•' d

with

. p . n ^ n p

p j n i -^ ^P i j

1 n p

^p i j

The rank of the transformed table of chromatographic retention times Z is equal to


seven.

31.5.1 Scree-plot

This empirical test is based on the so-called Scree-plot which represents the
residual variance as a function of the number of eigenvectors that have been
extracted [42]. The residual variance V of the r*-th eigenvector is defined by:
r

V(r*)= ^l\ withl<r*<r (31.69)

where r is the number of nontrivial (zero) eigenvalues.


143

V(r*) A

Fig. 31.15. Scree-plot, representing the residual variance V* as a function of the number of factors r*
that has been extracted. The diagram is based on a factor analysis of Table 31.2 after log dou-
ble-centering. A break point occurs after the second factor, which suggests the presence of only two
structural factors, the residual factors being attributed to noise and artefacts in the data.

It is assumed that the structural eigenvectors explain successively less variance


in the data. The error eigenvalues, however, when they account for random errors
in the data, should be equal. In practice, one expects that the curve on the
Scree-plot levels off at a point r* when the structural information in the data is
nearly exhausted. This point determines the number of structural eigenvectors. In
Fig. 31.15 we present the Scree-plot for the 23x8 table of transformed chroma-
tographic retention times. From the plot we observe that the residual variance
levels off after the second eigenvector. Hence, we conclude from this evidence that
the structural pattern in the data is two-dimensional and that the five residual
dimensions contribute mostly noise.

31.5.2 MalinowskVs F-test

This is a statistical test designed by Malinowski [43] which compares the


variance contributed by a structural eigenvector with that of the error eigenvectors.
Let us suppose that A:f is the variance contributed by the last structural eigen-
144

TABLE 31.9

Validation results obtained from factor analysis of Table 31.2, containing the retention times of 23
chalcones in 8 chromatographic methods, after log double-centering and global normalization. The results
are used in the Malinowski's F-test and in cross-validation by PRESS.

r* r-r*
^i" V(r*) F(r*) PRESS(r*) W(r*)

1 6 0.8286 0.1715 29.0 45.181 0.4936


2 5 0.1366 0.0349 19.6 22.301 1.0600
3 4 0.0229 0.0120 7.6 23.639 1.6128
4 3 0.0066 0.0054 3.7 38.126 1.2486
5 2 0.0031 0.0023 2.7 47.603 1.1104
6 1 0.0015 0.0008 1.9 52.856 1.0472
7 0 0.0008 0 — 55.351 —

vector. This structural variance should be larger than the average V of the vari-
ances contributed by the r-r* error eigenvectors that follow:
- ] ' 1
V(r*) = y X\=--^V(r*) withl<r*<r (31.70)

Hence, the number of structural eigenvectors is the largest r* for which Malinowski 's
F-ratio is still significant at a predefined level of probability a (say 0.05):

F(r*) = ^ - ^ with 1 < r* < r (31.71)


V(r*)
which is to be tested with 1 and (r - r*) degrees of freedom. Note that this test
assumes that the errors in the data are homogeneous and normally distributed.
In Table 31.9 we represent the results of Malinowski's F-test as computed from
the eigenvalues of the transformed retention times in Table 31.2. The second
eigenvector produces an F-value which still exceeds the critical F-statistic (6.6
with 1 and 5 degrees of freedom) at the 0.05 level of probability. Hence, from this
evidence we conclude again that there are two, possibly three, structural eigen-
vectors in this data set.

31.5.3 Cross-validation

The method of cross-validation is based on internal validation, which means


that one predicts each element in the data set from the results of an analysis of the
remaining ones. This can be done by leaving out each element in turn, which in the
case of an nxp table would require nxp analyses. Wold [44] has implemented a
scheme for leaving out groups of elements at the same time, which reduces the
145

Wfl ff
Fig. 31.16. Two of the four masks that are used in the calculation of a PRESS value in cross-validation.
The remaining two masks are obtained by a parallel shift of the diagonal lines that represent data
points to be deleted.

number of analyses to four. In this scheme, elements to be left out are located on
equally spaced diagonals (pseudodiagonals) such as illustrated by the masks of
Fig. 31.16. Actually, there are four such masks, but only two are shown here. This
scheme ensures that not more and not less than about one fourth of each row and
column are left out at the same time. The elements on the diagonals are predicted
from the remaining ones. Then the diagonals are shifted one element to the right
and the procedure is repeated. After four such analyses we have predicted all
elements of the table. The predicted residual error sum of squares or PRESS for r*
eigenvectors is given by:

pREss(r')= yy(z,-ii) (31.72)


i J

where z* is the value predicted for Zij using r* eigenvectors [44]. For a general
discussion of PRESS the reader is referred to Section 10.3.4 and to Chapter 36.
As more stmctural eigenvectors are included we expect PRESS to decrease up to
a point when the structural information is exhausted. From this point on we expect
PRESS to increase again as increasingly more error eigenvectors are included. In
order to determine the transition point r* one can compare PRESS(r*+l) with the
previously obtained PRESS(r*). The number of structural eigenvectors r* is reached
when the ratio:
PRESS(r* +1)
W(r*): with 1 < r* < r (31.73)
PRESS(r*)
becomes larger than 1 [45]. Note that the symbols Vand Win this section represent
the traditional notation which differs from our standardized nomenclature in this
chapter.
146

In a more recent variant of cross-validation one replaces PRESS(r*) in the


denominator by the residual sum of squares RSS(r*):
n p

RSS(r') = XS(^.7-V)' (31.74)

where J. J is the reconstructed value of z,y using r* eigenvectors and without leaving
any element out. The results of cross-validation of the transformed retention times
of Table 31.2 are compiled in the last two columns of Table 31.9. They point
toward the presence of two structural factors.

31.6 Principal coordinates analysis

31.6.1 Distances defined from data

Principal coordinates analysis (PCoA) is applied to distance tables rather than


to original data tables, as is the case with principal components analysis (PCA).
We consider an nxn table D of distances between the n row-items of an nxp
data table X. Distances can be derived from the data by means of various functions,
depending upon the nature of the data and the objective of the analysis. Each of
these functions defines a particular metric (or yardstick), and the graphical result of
a multivariate analysis may largely depend on the particular choice of distance
function.
The squared Euclidean distance (also called Pythagorean distance) has been
defined in Section 9.2.3:

dl =(x, - x . ) ^ (^i-^r) = f,(x,j -XrjV with/,r = 7, ..., n (31.75)


j

where the vectors xj and x^. represent the iih and i'-th rows of X.
It can be regarded as a special case of the squared weighted Euclidean distance
(Section 30.2.2.1). A property of (weighted) Euclidean distance functions is that
the distances between row-items D are invariant under column-centering of the
table X:

d'h=j^wj((x^j-mj)-(x,j-mj)y=j;^wj(x,j -x.j)' =dl (31.76)


./• j

J n p

with ^ / = — X ^ij ^"^ X ^7 "^ ^


147

Geometrically, column-centering of X is equivalent to a translation of the origin of


column-space toward the centroid of the points which represent the rows of the
data table X. Hence, the operation of column-centering leaves distances between
the row-points unchanged.
The squared generalized distance is a weighted distance of the general form:

J2,=(x.-x,0^ W(x,-x,0 (31.77)

where W represents apxp weighting matrix. Depending upon the definition of W,


the generalized distance encompasses the ordinary and weighted Euclidean dis-
tances as well as the Mahalanobis distance (Section 30.2.2.1).
The Minkowski distance defines a class of distance functions which are char-
acterized by the parameter r (Section 30.2.3.2):

^ir=H^^ij -^r/ with r > 1 (31.78)

In the case of r = 2 we obtain the ordinary Euclidean distance of eq. (31.75), which
is also called the L2-norm. In the case of r = 1 we derive the city-block distance
(also called Hamming-, taxi- or Manhattan-distance), which is also referred to as
the Lj-norm:

^n'=Zl^// -^r./' (31.79)

The squared Chi-square distance is appropriate for the analysis contingency


tables (when the data represent counts) and for cross-tabulations (when the data
represent parts of a whole):
\2
1 f
^^-I +./ V^/+
' 7

^/'.+ y
=1^ (31.80)

where x-^, x^-, x^^ represent the /th row-sum, theyth column-sum and the global sum
over all rows and columns, respectively.
The Chi-square distance can be seen as a weighted Euclidean distance on the
transformed data:
Zij=Xij/E{Xij) (31.81)
where the expected value Eix^^ has been defined in Section 16.2.3 as:
•^i+ ^+j
E{x,) = (31.82)
148

In this respect, the weight coefficients are proportional to the column-sums. Dist-
ances of Chi-square form the basis of correspondence factor analysis (CFA) which
is discussed in Chapter 32.

31.6.2 Distances derived from comparisons of pairs

Suppose we have a set of n items which are to be compared two-by-two


according to the presence or absence ofp characteristics. The similarity between
any pair /, /' can be defined, for example, by means of Gower's similarity index
[46] which was defined in Chapter 30.
Gower recommends to transform the similarities S into distances D by means
of:
d,, = {\-s^^'^ (31.83)
There are many alternative indices of similarity, but Gower's is considered to be
fairly general and can be related to several other similarity indices [47].

31.6.3 Eigenvalue decomposition

In Section 31.2.2 we have derived the Euclidean distances D between two items
/ and /' from their corresponding variances and covariances C:

The inverse relationship has been established by Young and Householder [48] and
is also referred to as Gower's formula because of the extensive use he has made of
it [49]:

Cn'-^-\idl-dl-d\-^d^) (31.84)

where d], d\ and d^ refer to the mean of row /, the mean of column i and the
global mean over all rows and columns, respectively, of the squared distance
matrix D^. The variance-covariance matrix C equals the double-centered squared
distance matrix D (multiplied by the constant -0.5).
Once the nXAz variance-covariance matrix C has been derived one can apply
eigenvalue decomposition (EVD) as explained in Section 31.4.2. In this case we
obtain:
C = U A^ U'^ with U'^ U = I
where U represents the Azxr matrix of orthonormal factor scores, where A^ is the
ry.r diagonal matrix of eigenvalues and where r is the rank of C.
149

Factor scores for each of the n objects with respect to the r computed factors are
then defined in the usual way by means of the wxr orthogonal factor score matrix S:
S = UA
These factor scores can be used for the construction of a score plot in which each
object is represented as a point in the plane of the first two dominant factors.
A close analogy exists between PCoA and PCA, the difference lying in the
source of the data. In the former they appear as a square distance table, while in the
latter they are defined as a rectangular measurement table. The result of PCoA also
serves as a starting point for multidimensional scaling (MDS) which attempts to
reproduce distances as closely as possible in a low-dimensional space. In this
context PCoA is also referred to as classical metric scaling. In MDS, one mini-
mizes the stress between observed and reconstructed distances, while in PCA one
maximizes the variance reproduced by successive factors.

31.7 Non-linear principal components analysis

Non-linear PCA can be obtained in many different ways. Some methods make
use of higher order terms of the data (e.g. squares, cross-products), non-linear
transformations (e.g. logarithms), metrics that differ from the usual Euclidean one
(e.g. city-block distance) or specialized applications of neural networks [50]. The
objective of these methods is to increase the amount of variance in the data that is
explained by the first two or three components of the analysis. We only provide a
brief outline of the various approaches, with the exception of neural networks for
which the reader is referred to Chapter 44.

31.7.1 Extensions of the data by higher order terms

One approach is to extend the columns of a measurement table by means of their


powers and cross-products. An example of such non-linear PCA is discussed in
Section 37.2.1 in an application of QSAR, where biological activity was known to
be related to the hydrophobic constant by means of a quadratic function. In this
case it made sense to add the square of a particular column to the original
measurement table. This procedure, however, tends to increase the redundancy in
the data.

31.7.2 Non-linear transformations of the data

Another approach is to transform the original columns of a measurement table


by suitable non-linear functions in order to obtain some desirable effect. This is
150

done, for example, in the HOMALS program (Homogeneity and Alternating Least
Squares) [51] where non-linear transformations are applied in order to maximize
the homogeneity of the data. In this context, homogeneity is inversely related to the
within-variance of the rows of the data table. For example, a table in which the
values in each individual row are constant (but different from zero) possesses
maximum homogeneity [52]. The HOMALS approach is most suitable in the case
of categorical data, i.e. when the column-items represent categories of one or more
variables such as age, gender, etc. Correspondence factor analysis (CFA) which is
discussed in Chapter 32 may be regarded as a special case of HOMALS.
The logairithmic transformation prior to column- or double-centered PCA (Sect-
ion 31.3) can be considered as a special case of non-linear PCA. The procedure
tends to make the row- and column-variances more homogeneous, and allows us to
interpret the resulting biplots in terms of log ratios.

31.7.3 Non-linear PCA biplot

The non-linear PCA biplot approach makes use of distance metrics that differ
from the ordinary and weighted Euclidean metrics. We have already discussed the
linear biplot resulting from PCA of a column-centered measurement table (Section
31.3.3). In this case, row-items are represented as a pattern of points and column-
items are depicted by means of unipolar axes, i.e. straight lines through the origin
of the plot (which is at the same time the centroid of the pattern). The original data
in a column of the table can also be read off by means of perpendicular projection
of the row-points upon the unipolar axes and by interpolating between equidistant
markers (tick marks). In the non-linear case, the axes are curved and the markers
on them are no longer equidistant (Fig. 31.17).
The theory of the non-linear PCA biplot has been developed by Gower [49] and
can be described as follows. We first assume that a column-centered measurement
table X is decomposed by means of classical (or linear) PCA into a matrix of factor
scores S and a matrix of factor loadings L:
X = U A V^ = S L'^
with
S =UA and L =V
This corresponds with a choice of factor scaling coefficients a = 1 and (3 = 0, as
defined in Section 31.1.4. Note that classical PCA implicitly assumes a Euclidean
metric as defined above. Let us consider theyth coordinate axis of column-space,
which is defined by a ;?-vector of unit length e^ of the form:
e. = (0 0 ... 1 ... 0 0)
151

Fig. 31.17. (a) In a classical PCA biplot, data values xij can be estimated by means of perpendicular
projection of the iih row-point upon a unipolar axis which represents the jth column-item of the data
table X. In this case the axis is a straight line through the origin (represented by a small cross), (b) In a
non-linear PCA biplot, theyth column-item traces out a curvilinear trajectory. The data value JC/, is now
estimated by defining the shortest distance between the iih row point and theyth trajectory.

which has zeroes in all p positions except at theyth one. The projection of e^ onto
factor-space yields the coordinates of the jth column-item of X, which are con-
tained in the 7th row of the factor loading matrix L. Instead of e^ we may also
project the vector kj Cy, where kj is a variable coefficient ranging from -00 to +00. In
that case, we obtain a series of points the locus of which is the unipolar axis
through the origin of space and through the representation of the 7th column-item
of X. In classical PCA, the origin of space is also the centroid of the row-items of
the column-centered table X. This projection method can also be used to define
markers along the 7th unipolar axis, by choosing appropriate values of the co-
152

efficient/:y (e.g. ...,-10,-5,0,5,10, ...).If we set/:y equal to the successive values x^^
in theyth column of X we obtain the exact projections of the row-items (/ = 1,..., n)
along theyth axis. These projections usually differ from the orthogonal projections
in the plane of the biplot when the contribution of the first two factors is less than
100% of the variance in X.
The same idea can be developed in the case of a non-Euclidean metric such as
the city-block metric or Lj-norm (Section 31.6.1). Here we find that the traject-
ories, traced out by the variable coefficient kj are curvilinear, rather than linear.
Markers between equidistant values on the original scales of the columns of X are
usually not equidistant on the corresponding curvilinear trajectories of the non-
linear biplot (Fig. 31.17b). Although the curvilinear trajectories intersect at the
origin of space, the latter does not necessarily coincide with the centroid of the
row-points of X. We briefly describe here the basic steps of the algorithm and we
refer to the original work of Gower [53,54] for a formal proof.
Given the original nxp measurement table X, one derives the table of nxn
distances D between the n row-items, using a particular distance function cp:
p

Using D as input we apply principal coordinates analysis (PCoA) which we dis-


cussed in the previous section. This produces the nxn factor score matrix S. The
next step is to define a variable point along theyth coordinate axis, by means of the
coefficient kj and to compute its distance d(kj) from all n row-points:
p
df(kj) = ^(f(x,y,kjbjj.) for/=l,...,n (31.85)
./•'

where bjj^ represents Kronecker's 5 which equals 1 if y = / and 0 otherwise. As


indicated above, one may either assign a set of equidistant values to kp or the n
values x^j in theyth column of X.
The r coordinates of the variable point which traces out the trajectory of theyth
column-item in the r-dimensional biplot are compiled in the r-vector Sy. The
elements of the latter can be estimated by means of linear regression of the nxr
factor scores S upon an n-vector f^ which is defined as:

f^.,=-Udfikj)-dl+^dA (31.86)
2 V 2
where the terms of the right-hand member have been defined above.
The result of the linear regression can be expressed in the form:

s^. = (S^Sr'ST f. (31.87)


153

Note that df and d^ represent the row- and global means of the squared distance
matrix D, respectively. The latter needs only be computed once and for all. The n
distances of the variable point d{kj) to the n row-points, however, are to be re-
evaluated for every column-item of X and for every marker which is to appear in
the corresponding trajectory in the biplot. Usually, the range of k^ is limited
between the minimum and maximum value in the 7th column of X or somewhat
beyond (say 10% of the range on either side).

31.8 Three-way principal components analysis

Multiway and particularly three-way analysis of data has become an important


subject in chemometrics. This is the result of the development of hyphenated
detection methods (such as in combined chromatography-spectrometry) and yields
three-way data structures the ways of which are defined by samples, retention
times and wavelengths. In multivariate process analysis, three-way data are obtain-
ed from various batches, quality measures and times of observation [55]. In image
analysis, the three modes are formed by the horizontal and vertical coordinates of
the pixels within a frame and the successive frames that have been recorded. In this
rapidly developing field one already finds an extensive body of literature and only
a brief outline can be given here. For a more comprehensive reading and a
discussion of practical applications we refer to the reviews by Geladi [56], Smilde
[57]andHenrion[58].
In the present terminology, a three-way data array is defined by n rows, p
columns and q layers, with indices ij and k, respectively.

31.8.1 Unfolding

A straightforward approach to three-way PCA, which avoids some of the


inherent problems of three-way analysis, is to unfold the three-way data array into
a two-way array, i.e. a rectangular data table, which can then be processed
according to the methods outlined in previous sections. The operation of unfolding
can be likened to spreading out of a stack of continuous forms (such as is used with
mechanical pin-fed printers). Unfolding of a three-way data array, however, can be
done in three different manners as shown in Fig. 31.18. Theoretically, an nxpxq
array will give rise to an nx(pxq), px(nxq) or qx(nxp) rectangular data table.
Application of PCA to each of these three unfoldings will generally produce a
different result. In particular, one obtains three different biplots for one and the
same three-way table, which may be undesirable. Strictly speaking, unfolding
avoids the three-way problem by reducing it to the analysis of a two-way table.
154

pl I I I

I }
/ . I ^ I
n

nXq
PI

nXp
\

nL _ _Ji _

Fig. 31.18. Three ways of unfolding a three-way table into a rectangular matrix.

31.8.2 The TuckerS model

The so-called TuckerS model is defined by the decomposition of a three-way


table X into a three-way core matrix Z and three two-way loading matrices A, B, C
(one for each mode):

(31.88)
f S h
155

where e-jj^ represents the residual error term.


In this way, an nxpxq table X is decomposed into an rxsxt core matrix Z and
the nxr, pxs, qxt loading matrices A, B, C for the row-, column- and layer-items
of X. The loading matrices are column-wise orthonormal, which means that:

A^A = T , 8^8 = 1 and C^ C = I

where I^, I^ and I^ represent the unit matrices of dimensions r, s and t, respectively.
The number of factors r, s and t, assigned to each mode, is generally different.
They are chosen such as to be less than the dimensions of the original three-way
table in order to achieve a considerable amount of data reduction. The elements of
Z represent the magnitude of the factors and the extent of their interaction [57].
Computationally, the core matrix Z and the loadings matrices A, B and C are
derived such as to minimize the sum of squared residuals.
By analogy with singular vector decomposition (SVD) and ordinary PCA, one
can also define the Tucker3 model in an extended matrix notation:

(31.89)
X=A Z B+ E

The extended matrix notation is represented here by the three dots that surround
the core matrix. A graphical example of the Tucker3 model is rendered in Fig.
31.19.

>

I r'

Fig. 31.19. Tucker3 or core matrix decomposition of an nx/7X^ three-way table X. The matrix Z repre-
sents the rx5xr core matrix. A, B and C are thenxr,/?X5 and ^xdoading matrices of the row-, column-
and layer-items of X, respectively.
156

31.8.3 The PARAFAC model

The decomposition of a three-way table X can also be defined by means of the


PARAF AC model in terms of three two-way loading matrices (one for each mode)
A, B, C:
r

^iik = Yu^if bjf ^kf + ^ijk (31.90)


/

where e^j^ represents a residual error term. This decomposition is also referred to as
the trilinear model
In this case, the nxpxq three-way table X is decomposed into the nxr, pxr, qxr
loading matrices A, B, C for the row-, column- and layer-items of X.
In the PARAFAC model, the three loading matrices A, B, C are not necessarily
orthogonal [56]. The solution of the PARAFAC model, however, is unique and
does not suffer from the indeterminacy that arises in principal components and
factor analysis.
In contrast to the Tucker3 model, described above, the number of factors in each
mode is identical. It is chosen to be much smaller than the original dimensions of
the data table in order to achieve a considerable reduction of the data. The elements
of the loading matrices A, B, C are computed such as to minimize the sum of
squared residuals.
The PARAFAC model can also be defined by means of an extended matrix
notation:
C
X=A i B (31.91)

where I represents an rxrxr identity array (with ones on the superdiagonal and
zeroes in the off-diagonal positions). The decomposition is represented schematic-
ally in Fig. 31.20. It can be seen from (eqs. (31.89) and (31.91)) that the PARAFAC
model is a special case of the Tucker3 model where Z = I and r = 5 = r.

31.9 PCA and cluster analysis

Cluster analysis (which is covered extensively in Chapter 30) can be performed


on the factor scores of a data table using a reduced number of factors (Section
31.1.4) rather than on the data table itself. This way, one can apply cluster analysis
on the structural information only, while disregarding the noise or artefacts in the
data. The number of structural factors may be determined by means of internal
157

A
c
/ r_

I y---

Fig. 31.20. PARAFAC decomposition of an nxpxq three-way table X. The core matrix I is an rxrxr
three-way identity matrix. A, B and C represent the nxr, pxr and qxr loading matrices of the row-,
column- and layer-items of X, respectively.

validation (Section 31.5). If only two or three structural factors are present in the
data, it can be helpful to provide the clustering tree in addition to the plot of factor
scores in order to better comprehend the latent structure. It is also possible to
represent the cluster structure graphically on the score plot, for example, by
grouping those points that cluster together at a certain level in the clustering tree
[59].
Although cluster analysis is based on the distances or similarities between a set
of objects, its graphic representation in the form of a clustering tree has no metric
property. The separation between any two leaves (i.e. terminal objects) in the tree
usually bears no relation to their actual distance as defined by the data table or on
the score plot with a reduced number of factors. In fact, given n objects one can
easily show that there are 2^"^ equivalent hierarchical clustering trees, which can be
obtained by reversals of the branches of the tree. In some special cases it is
possible, however, to derive a unique or canonical representation of the objects in a
clustering tree, by relating the order of appearance in the tree to their correspond-
ing positions on the score plot. This is relevant, for example, in so-called phylo-
genetic applications, where the distance or similarity between any two objects
reflects the time that elapsed after their evolutionary divergence. Such applications
arise in biological taxonomy, in the study of proteins, in the classification of DNA
sequences, etc. In these cases, one often finds that the objects (species, proteins,
DNA strings) in a score plot are arranged approximately on a circle around the
origin, at least when a two-factor solution is capable of reproducing the hereditary
158

or ancestral structure adequately. If the center of the score plot represents the
primeval ancestor, then one expects all living present-day species, proteins or
DNA strings to be more or less at an equal distance in evolutionary space. There
may be discrepancies, however, because biological clocks may tick at different
rates in different branches of the evolutionary tree. The positions of the objects are
defined unambiguously on the score plot, apart from reflections about the factor
axes. Hence, given a particular orientation of the axes, one may rearrange the
clustering tree such that the order of its leaves corresponds optimally with the order
of the objects in the score plot [60]. This then leads to a canonical representation of
the clustering tree which is based on a preliminary PCA of the data.

References

1. IT. Joliffe, Principal Components Analysis. Springer, New York, 1986.


2. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and
use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom.
Intell. Lab. Syst., 4 (1988) 11-25.
3. J. Mandel, Use of the singular value decomposition in regression analysis. Am. Statistician, 36
(1982) 15-24.
4. C. Eckart and G. Young, The approximation of one matrix by another of lower rank.
Psychometrika, 1 (1936)211-218.
5. W. Van Borm, Source Apportionment of Atmospheric Particles by Electron Probe X-ray
Microanalysis and Receptor Models. Doctoral Thesis, University of Antwerp, 1989.
6. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977.
7. G.H. Dunteman, Principal Components Analysis. Sage Publications, Newbury Park, CA, 1989.
8. L.L. Thurstone, Multiple-factor Analysis. A Development and Expansion of the Vectors of
Mind. Univ. Chicago Press, Chicago, 1947.
9. K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag., 2 (6th
Series) (1901), 559-572.
10. K.R. Gabriel, The biplot graphic display of matrices with applications to principal components
analysis. Biometrika, 58 (1971) 453-467.
11. B. Walczak, L. Morin-Allory, M. Chretien, M. Lafosse and M. Dreux, Factor analysis and ex-
periment design in high-performance liquid chromatography. III. Influence of mobile phase
modifications on the selectivity of chalcones on a diol stationary phase. Chemom. Intell. Lab.
Syst., 1 (1986)79-90.
12. H.F. Gollob, A statistical model which combines features of factor analytic and analysis of
variance techniques. Psychometrika, 33 (1968) 73-111.
13. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical
compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300.
14. J.B. Kasmierczak, Analyse logarithmique. Deux exemples d'application. Revue de Statistique
Appliquee, 33 (1985) 13-24.
15. L. A. Goodman, Some useful extensions of the usual correspondence analysis approach and the
usual log-linear models approach in the analysis of contingency tables. Int. Statistical Rev., 54
(1986)243-309.
16. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int.
Statistical Rev., 54 (1986) 279-283.
159

17. P.J. Lewi, Spectral map analysis. Analysis of contrasts, especially from log-ratios. Chemom.
Intell. Lab. Syst., 5 (1989) 105-116.
18. J.P. Benzecri, L'analyse des Donnees. Vol II. L'analyse des Correspondances. Dunod, Paris,
1973.
19. H. Wold, Soft modelling by latent variables: the non-linear iterative partial least squares
(NIPALS) algorithm. In: Perspectives in Probability and Statistics, J. Gani (Ed.). Academic
Press, London, 1975, pp. 117-142.
20. G.H. Golub and C. Reinsch, Singular value decomposition and least squares solutions. Numer.
Math., 14(1970)403-420.
21. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol., 24 (1933) 417^41.
22. C.G.J. Jacobi, Ueber ein leichtes Verfahren die in der Theorie der Saekularstoerungen vor-
kommender Gleichungen numerisch aufzuloesen. J. Reine Angew. Math., 30 (1846) 51-95.
23. B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ, 1980.
24. K.E. Iverson, Notation as a tool of thought (1979 Turing award lecture). CACM, 23 (1980)
444^65.
25. MATLAB, Engineering Computation Software, The Math Works Inc., Natick, 1992.
26. SAS/IML (Interactive Matrix Language) Software: User and Reference Manual. Version 6,1 st
Ed., SAS Institute Inc., Gary, NC, 1990.
27. J.J. Dongarra, C.B. Moler, J.R. Bunch and G.W. Stewart, LINPACK. SIAM, Philadelphia, 1979.
28. B.T. Smith, Matrix Eigensystem Routines — EISPACK Guide. 2nd edn.. Lecture Notes in
Computer Science, Vol. 6. Springer, New York, 1976.
29. W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes. The Art
of Scientific Computing. Cambridge Univ. Press, Cambridge, UK, 1986, pp. 335-363.
30. R. Fisher and W. A. MacKenzie, Studies in crop variation II. The manurial response of different
potato varieties, J. Agricult. Sci., 13 (1923) 311-320.
31. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of
real-life performance from laboratory results. In: Scientific Computing and Automation (Eu-
rope) 1990. E.J. Karjalainen (Ed.). Elsevier, Amsterdam, 1990, pp. 199-210.
32. A.S. Householder, The Theory of Matrices in Numerical Analysis. Blaisdell, New York, 1964.
33. J.G.F. Francis, The QR transformation. Parts I and II. Computer J., 6 (1958) 378-392.
34. A. Ralston and H.S. Wilf, Mathematical Methods for Digital Computers. Wiley, New York,
1960.
35. H. Rutishauser, Solution of eigenvalue problems with the LR-transformation. Appl. Math. Ser.
Nat. Bur. Stand., 49 (1958) 47-81.
36. J.H. Wilkinson, Global convergence of tridiagonal QR algorithm with origin shift. Algorithms
Applic, 1 (1968) 409-420.
37. W.W. Cooley and P.P. Lohnes, Multivariate Procedures for the Behavioral Sciences. Wiley,
New York, 1962.
38. A. Ley sen. Systematic study of algorithms for eigenvector extraction on PC. Thesis, Kath. Ind.
Hogeschool der Kempen, Geel, Belgium, 1989.
39. S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects. Part I: theory and algorithm. J. Chemom., 8 (1994) 11-125.
40. J.M. Deane, Data reduction using principal components analysis. In: Multivariate Pattern Rec-
ognition in Chemometrics, R. Brereton (Ed.). Chapter 5, Elsevier, Amsterdam, 1992, pp.
125-165.
41. R.G. Brereton, Chemometrics, Applications of Mathematics and Statistics to Laboratory Sys-
tems. Ellis Horwood, New York, 1990, p. 221.
160

42. R.B. Cattell, The Scree test for the number of factors. Multivariate Behavioral Res., 1 (1966)
245-276.
43. E.R. Malinowski, Statistical F-tests for abstract factor analysis and target testing. J. Chemom.,
3(1988)49-60.
44. S. Wold, Cross-validatory estimation of the number of components in factor and principal
components models. Technometrics, 20 (1978) 397^05.
45. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986, p. 255.
46. J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 27
(1971)857-874.
47. I.E. Frank and R. Todeschini, The Data Analysis Handbook. Elsevier, Amsterdam, 1994.
48. G. Young and A.S. Householder, Discussion of a set of points in terms of their mutual dis-
tances. Psychometrika, 3 (1938) 19-22.
49. J.C. Gower, Some distance properties of latent root and vector methods used in multivariate
analysis. Biometrika, 53 (1966) 325-328.
50. G.E. Hinton, How neural networks learn from experience. Sci. Am., Sept. (1992) 145-151.
51. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990.
52. J.P. van de Geer, Multivariate Analysis of Categorical Data: Applications. Sage Publications,
Newbury Park, CA, 1993.
53. J.C. Gower and S.A. Harding, Non-linear biplots. Biometrika, 75 (1988) 445-455.
54. J.C. Gower, Adding a point to vector diagrams in multivariate analysis. Biometrika, 55 (1968)
582-585.
55. T. Kourti and J.F. MacGregor, Process analysis, monitoring and diagnosis, using multivariate
projection methods. Chemom. Intell. Lab. Syst., 28 (1995) 3-21.
56. P. Geladi, Analysis of multi-way (multi-mode) data. Chemom. Intell. Lab. Syst., 7 (1989)
11-30.
57. A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15
(1992) 143-157.
58. R. Henrion, N-way principal component analysis. Theory, algorithms and applications.
Chemom. Intell. Lab. Syst., 25 (1994) 1-23.
59. P.N. Nyambi, J. Nkengasong, P. Lewi, K. Andries, W. Janssens, K. Fransen, L. Heyndrickx, P.
Piot and G. van der Groen, Multivariate analysis of human immunodeficiency virus type 1 neu-
tralization data. J. Virol., 70 (1996) 6235-6243.
60. P.J. Lewi, H. Moereels and D. Adriaensen, Combination of dendrograms with plots of latent
variables, Application of G-protein coupled receptor sequences. Chemom. Intell. Lab. Syst., 16
(1992) 145-154.
161

Chapter 32

Analysis of Contingency Tables

32.1 Contingency table

Table 32.1 describes 30 persons who have been observed to use one of four
available therapeutic compounds for the treatment of one of three possible disorders.
The four compounds in this measurement table are the benzodiazepine tranquil-
lizers Clonazepam (C), Diazepam (D), Lorazepam (L) and Triazolam (T). The
three disorders are anxiety (A), epilepsy (E) and sleep disturbance (S). In this
example, both measurements (compounds and disorders) are defined on nominal
scales. Measurements can also be defined on ordinal scales, or on interval and ratio
scales in which case they need to be subdivided in discrete and non-overlapping
categories.
Each column of a measurement table can be expanded into an indicator table.
The rows of an indicator table refer to the same objects and in the same order as in
the measurement table. The columns of the indicator table represent non-over-
lapping categories of the selected measurement. Table 32.1 has been expanded
into the indicator Table 32.2 for compounds and into the indicator Table 32.3 for
disorders. In the indicator table for compounds, a value of one in a particular row is
recorded if a person has used the corresponding compound. In the indicator table
for disorders, one in a particular row indicates that the person has been treated for
the corresponding disorder. All other elements of the row are set to zero. Note that
the order of the columns in the indicator tables is not relevant.
A contingency table X can be constructed by means of the matrix product of two
indicator tables:
X = FJ (32.1)
where I is the indicator table for the first measurement and J is the indicator matrix
for the second measurement. The rows of I and J must refer to the same objects and
in the same order. The dimensions of a contingency table are defined by the
number of categories of the two measurements.
Each element x^^ in an nxp contingency table X represents the number of objects
that can be associated simultaneously with category / of the first measurement and
with category j of the second measurement. The element x^y can be interpreted as
162

TABLE 32.1
Measurement table describing the use of four compounds (C, D, L, T) for the treatment of three disorders
(A, E, S) by 30 persons. For explanation of symbols, see text.

# Compound Disorder

1 L A
2 D E
3 C S
4 T S
5 L A
6 D A
7 C E
8 C S
9 D A
10 C E
11 D S
12 L A
13 D S
14 C E
15 C S
16 D A
17 L S
18 D E
19 L S
20 L A
21 C E
22 T S
23 D A
24 T S
25 D E
26 T A
27 T S
28 C E
29 D E
30 D A
163

TABLE 32.2
Indicator table for compounds, from Table 32.1

Compound

1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 0 0 1
5 0 0 1 0
6 0 1 0 0
7 1 0 0 0
8 1 0 0 0
9 0 1 0 0
10 1 0 0 0
11 0 1 0 0
12 0 0 1 0
13 0 1 0 0
14 1 0 0 0
15 1 0 0 0
16 0 1 0 0
17 0 0 1 0
18 0 1 0 0
19 0 0 1 0
20 0 0 1 0
21 1 0 0 0
22 0 0 0 1
23 0 1 0 0
24 0 0 0 1
25 0 1 0 0
26 0 0 0 1
27 0 0 0 1
28 1 0 0 0
29 0 1 0 0
30 0 1 0 0

All 8 11 6 5
164

TABLE 32.3
Indicator table for disorders, from Table 32.1

# Disorder

A E S

1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 1
5 1 0 0
6 1 0 0
7 0 1 0
8 0 0 1
9 1 0 0
10 0 1 0
11 0 0 1
12 1 0 0
13 0 0 1
14 0 1 0
15 0 0 1
16 1 0 • 0
17 0 0 1
18 0 1 0
19 0 0 1
20 1 0 0
21 0 1 0
22 0 0 1
23 1 0 0
24 0 0 1
25 0 1 0
26 1 0 0
27 0 0 1
28 0 1 0
29 0 1 0
30 1 0 0
All 10 9 11
165

TABLE 32.4
Contingency table X derived from the indicator Tables 32.2 and 32.3

Anxiety (A) Epilepsy (E) Sleep (S) Sum


Clonazepam (C) 0 5 3 8
Diazepam (D) 5 4 2 11
Lorazepam(L) 4 0 2 6
Triazolam (T) 1 0 4 5

Sum 10 9 11 30

the number of occurrences of / contingent (or conditional) uponj, or equivalently


as the number of instances ofj contingent upon /. The 4x3 contingency Table 32.4
is the result of multiplying the 30x4 indicator Table 32.2 for compounds with the
30x3 indicator Table 32.3 for disorders.
In this chapter we deal explicitly with contingency tables that result from
combining two measurements and which are generally referred to as two-way
contingency tables. From this point of view, it can be regarded as a generalization
of Chapter 16 to the case of more than two categories for each of the two
measurements. It is possible to cross multiple measurements, which then results in
a multi-way contingency table. For example, when three measurements with
respectively n, p and q categories are crossed with one another, this results in an
nxpxq three-way contingency table. Recently there has been a growing interest in
the analysis of this type of data structure. In this chapter, however, we only deal
with two-way contingency tables.
Summation row-wise (horizontally) of the elements of a contingency table
produces the vector of row-sums with elements x^^. Summation column-wise
(vertically) yields the vector of column-sums with elements x^j. The global sum is
denoted by x^_^.. These marginal sums are defined as follows:

p
x^^ = ^x^j with / = 1,..., n (32.2)
./

•^+./=S-^y withy =1,...,/?

n p

=11'.
I J
166

32.2 Chi-square statistic

Each element x^j of a contingency table X can be thought of as a random variate.


Under the assumption that all marginal sums are fixed, we can derive the expected
values E{x^j) for each of the random variates x^j [1]:
•^i+ •^+7
E(x^^) = with/= 1,..., Az and j=l,...,p (32.3)

Table 32.5 presents the expected values of the elements in the contingency
Table 32.4. Note that the marginal sums in the two tables are the same. There are,
however, large discrepancies between the observed and the expected values. Small
discrepancies between the tabulated values of our illustrations and their exact
values may arise due to rounding of intermediate results.
For example, the observed value for Clonazepam in anxiety has been recorded
as 0 in Table 32.4. The corresponding expected value in Table 32.5 has been
computed from eq. (32.3) as follows:
x^+ x+1 _ 8x 10
£:(x,j): = 2.67
jc_ ~ 30

where jCj j is the element in the first row and first column of the table, where Xj^, x^^
are the corresponding marginal sums, and where x^^ is the global sum.
The generalization of Pearson's chi-square statistic x^ for 2x2 contingency
tables, which has been discussed in Section 16.2.3, can be written as:
( ^1+ -"'•+./ V
n P (Y _ f(y \\2 n p
^-h+ J
(32.4)
Eix,) •^1+ -^+7

with (n-\){p-\) degrees of freedom.


TABLE 32.5
Table of expected values E(X) computed from Table 32.4

Anxiety Epilepsy Sleep Sum


Clonazepam 2.67 2.40 2.93 8
Diazepam 3.67 3.30 4.03 11
Lorazepam 2.00 1.80 2.20 6
Triazolam 1.67 1.50 1.83 5

Sum 10 9 11 30
167

The chi-square statistic is a measure for the global deviation of the observed
values in a contingency table from their expected values. The number of degrees of
freedom of the chi-square statistic defines the number of elements of the table that
can be varied independently of each other. Since n row-sums and/? column-sums
are held fixed and since both add up to the same global sum, the number of degrees
of freedom of an nxp contingency table is np minus n -\- p - I or (n - I) (p - I). The
chi-square statistic can be tested for significance using tabulated critical values of
the statistic for a fixed level of significance (e.g. 0.05, 0.01 or 0.001) and for the
specified degrees of freedom.
In the case of the 4x3 contingency Table 32.4 we obtain a chi-square value of
15.3 with 6 degrees of freedom, which is significant at the 0.05 level of probability
as it exceeds the critical value of 12.6. The probability corresponding with a
chi-square value exceeding 15.3 equals 0.02. From this result we may be led to
conclude that there are significant differences between the prescription patterns of
the four compounds, as well as between the treatment patterns of the three
disorders. One should be cautious, however, in drawing such a conclusion,
because of three considerations. First, the statistical evidence of this positive
outcome is not overwhelming, as it leaves a two percent chance of being false.
Secondly, the chi-square test is not quite appropriate in this case, since all but two
cell frequencies are smaller than five and three of them are even zero. Finally, a
statistically significant result does not necessarily mean that the differences between
compounds and disorders are of practical relevance. These considerations should
not preclude, however, an exploratory analysis of the data in order to search for
meaningful correlations and trends that might be the object of more detailed
studies in the future. In the case of a negative outcome of the chi-square test, one
might have assumed that there are no differences between the drugs and between
the disorders. This conclusion also might be erroneous in view of the lack of power
that is inherent to a small sample (30 persons in this example). In this case, an
exploratory analysis may still yield relevant associations and tendencies which can
guide the design of future studies.

32.3 Closure

In the literature we encounter three common transformations of the contingency


table. These can be classified according to the type of closure that is involved. By
closure we mean the operation of dividing each element in a row or column of a
table by its corresponding marginal sum. We reserve the word closure for the
specific operation where the elements in a row or column of the table are reduced
to unit sum. This way, we distinguish between closure and normalization, as the
latter implies an operation which reduces the elements of a table to unit sums of
squares. In a strict sense, closure applies only to tables with non-negative elements.
168

Clonazepam

Diazepam

Lorazepam y/^y^yyM

Triazolam V^/M^^A
Anxiety Epilepsy Sleep

Fig. 32.1. Row-profiles representing the data in the rows of Table 32.4.

32.3.1 Row-closure

Comparison between rows of a contingency table X is made easier after


dividing each element of the table by its corresponding row-sum. This operation is
called row-closure as it forces all rows of the table to possess the same unit sum.
After closure, the rows of the table are called row-profiles. These can be represented
in the form of stacked histograms such as shown in Fig. 32.1.
The average or expected row-profile is obtained by dividing the marginal row in
the original table by the global sum. The matrix F of deviations of row-closed
profiles from their expected values is defined by:

fx^A +7
(32.5)
^/+

under the assumption of fixed marginal sums.

32.3.2 Column-closure

Comparison between columns of a contingency table X is also made easier after


dividing each element of the table by its corresponding column-sum. This operation
is referred to as column-closure as it makes all columns of the table possess the
same unit sum. After closure, the columns of the table are called column-profdes.
These can also be visualized in the form of stacked histograms as is shown in Fig.
32.2.
169

Anxiety

Epilepsy

Sleep

Clonazepam Diazepam Lorazepam Triazolam

Fig. 32.2. Column-profiles representing the data in the columns of Table 32.4.

The average or expected column-profile results from dividing the marginal


column by the global sum. The matrix G of deviations of column-profiles from
their expected values is given by:

^
^^:/=- (32.6)
"+./ v^+^ J "+./

assuming fixed marginal sums.

32.3.3 Double-closure

Double-closure is the joint operation of dividing each element of the contingency


table X by the product of its corresponding row- and column-sums. The result is
multiplied by the grand sum in order to obtain a dimensionless quantity. In this
context the term dimensionless indicates a certain symmetry in the notation. If x
were to have a physical dimension, then the expressions involving x would appear
as dimensionless. In our case, x represents counts and, strictly speaking, is di-
mensionless itself. Subsequently, the result is transformed into a matrix Z of
deviations of double-closed data from their expected values:

•^ij •^++ •^ij •^++


-1 = (32.7)
^ = E(x,j)
•^i+ •^+j •^i+ •^+j

assuming fixed marginal sums.


170

TABLE 32.6
Table of double-closed data Z, with weighted means, weighted sums of squares and weight coefficients
added in the margins, from Table 32.4

z Anxiety Epilepsy Sleep W;,

Clonazepam -1.000 1.083 0.023 0.000 0.686 0.267


Diazepam 0.364 0.212 -0.504 0.000 0.151 0.367
Lorazepam 1.000 -1.000 -0.091 0.000 0.636 0.200
Triazolam -0.400 -1.000 1.182 0.000 0.865 0.167

0.000 0.000 0.000 0.000


p 0.542 0.696 0.328 0.510
wp 0.333 0.300 0.367 1.000

The effect of these operations is shown in Table 32.6 using the data of our
example in Table 32.4. For example, in the case of Clonazepam in anxiety, the
observed value x^^ has been shown to be equal to 0 in Table 32.4, and the
corresponding expected value has been derived to be 2.67 in Table 32.5. The
corresponding deviation of the double-closed value from its expected value is then
computed as:

^^ £(jcii) 2.67

It is important to realize that closure may reduce the rank of the data matrix by
one. This is the case with row-closure when n>p, and with column-closure when
n<pAi is always the case with double-closure. This reduction of the rank by one is
the result of a linear dependence between the rows or columns of the table that
results from closure of the data matrix.

32.4 Weighted metric

A weighted Euclidean metric is defined by the weighted scalar product:


n n
x'^ Wy = ^ w , jc,};, with X^'"^ (32.8)

where x and y are vectors with the same dimension n and where W is a diagonal
matrix of dimension n:
W = diag(w)
171

in which the elements of the weight vector w are positive constants which add to
unity. In what follows we will omit the predicate 'Euclidean' which is intended
implicitly throughout this chapter.
A weighted metric is uniquely defined by the metric matrix W. The usual metric
is but a special case of the weighted metric, which is obtained when W equals the
identity matrix I (divided by the dimension n). In the usual metric we assign equal
weights to all the axes of coordinate space. This is the metric that is used in daily
life for measuring distances, lengths, widths, heights, etc. Weighted metrics play
an important role in multivariate data analysis when different weights are assigned
to objects and measurements according to their size or importance. This is the case
in correspondence factor analysis which will be discussed in Section 32.6.
It has been shown in Chapter 29 that the set of vectors of the same dimension
defines a multidimensional space S in which the vectors can be represented as
points (or as directed line segments). If this space is equipped with a weighted
metric defined by W, it will be denoted by the symbol S^. The squared weighted
distance between two points representing the vectors x and y in S^ is defined by the
weighted scalar product:
n

llx-yl|2 = ( x - y ) T W ( x - y ) = X w , ^ , - } ' , ) ' (32.9)

where llx - yll^ refers to the weighted norm or length of the vector x - y in the
weighted metric defined by W.
Distances in S^ are different from those in the usual space S. A weighted space
S^ can be represented graphically by means of stretched coordinate axes [2]. The
latter result when the basis vectors of the space are scaled by means of the
corresponding quantities in vw, where the vector w contains the main diagonal
elements of W. Figure 32.3 shows that a circle is deformed into an ellipse if one
passes from usual coordinate axes in the usual metric I to stretched coordinate axes
in the weighted metric W. In this example, the horizontal axis in5^ is stretched by
a factor ^1.6 = 1.26 and the vertical axis is shrunk by a factor V0.4 = 0.63.
A similar effect in S^ is observed on angular distances (or angles), which are
also different from those in the unweighted space 5:

cosd^ = ^— (32.10)
11x11 J l y l l ,
with

||xl|2=xTWx and llyll^^y'^Wy (32.11)

where d^ is the angular distance and where 11x11^ and llyll^are the weighted norms or
172

1.
(Ya4)

1 (Yr6)

1 0 1.6 0
W: w=
2 2
0 1 0 0.4

Fig. 32.3. Effect of a weighted metric on distances, (a) representation of a circle in the space S defined
by the usual Euclidean metric, (b) representation of the same circle in the space 5^ defined by a
weighted Euclidean metric. The metric is defined by the metric matrix W.

lengths of the vectors x and y as measured from the origin of 5^. Two vectors x and
y are said to be orthogonal in the weighted metric defined by W if and only if:

x'^Wy = 0 (32.12)

Generally, two vectors that are orthogonal in S will be oblique in 5^, unless the
vectors are parallel to the coordinate axes. This is illustrated in Fig. 32.4. Further-
more, if X and y are orthogonal vectors in S, then the vectors W~^^^x and W"^^^y are
orthogonal in 5^. This follows from the definition of orthogonality in the metric W
(eq. 32.12):

xTW-l/2;Y^-l/2y = xV = 0

In Section 29.3 it has been shown that a matrix generates two dual spaces: a
row-space 5^^ in which the p columns of the matrix are represented as a pattern P^,
and a column-space 5^ in which the n rows are represented as a pattern P^. Separate
weighted metrics for row-space and column-space can be defined by the corre-
sponding metric matrices W^ and W^. This results into the complementary
weighted spaces 5^ and 5^, each of which can be represented by stretched
coordinate axes using the stretching factors in ^w „ and J w ^ , where the vectors
w^ and w^ contain the main diagonal elements of W„ and W^.
173

1
(Yoi)

1 (VT.6)

Fig. 32.4. Effect of a weighted metric on angular distances, (a) representation of two line segments that
are perpendicular in the space S defined by the usual Euclidean metric, (b) representation of the same
two line segments in the space 5H, defined by a weighted Euclidean metric. The metric matrix W is the
same as in Fig. 32.3b.

Weight coefficients can be used to define weighted measures of location and


spread. In particular, we define the weighted mean m^ of a vector x by means of:

m^ = x ^ w = xTWl = ^ W. X: (32.13)

and the weighted variance c^ by means of:


n
c^ = ( x - m ^ l ) T W ( x - m ^ l ) = (xT - m ^ l ' ^ ) ^ Wl = Xw,.(jc,. - m ^ ) ^
(32.14)
where 1 represents a sum vector with n elements equal to one, and where the
superscript w indicates that the measures are taken in a weighted metric. The above
expression of the weighted variance is a special case of a weighted sum of squares,
which arises when the elements of the vector have been previously reduced by
their mean value. Note that the weight coefficients are normalized to unit sum. In
this notation, x^ results in squaring each element of x. The expression x^Wl
produces the weighted sum of elements in x.
The usual expressions for the mean and the variance result when the weight
coefficients are constant and defined as:

1 1
w, = • and w^=- with / = l,...,n and y = !,...,/?
174

In the case of an nxp matrix X we can define two sets of weighted measures of
location and spread, one for each of the two dual spaces S^ and 5 ^ , in terms of
weighted means and weighted variances:

m:=XW^l^=XWp or <=X>^v^,y
J
(32.15)
n

m;=X^W„l„=X'rw„ or mJ=J^w,x,j

c: = ( X - i n - l p 2 W^ 1^ or c- =j^Wjix,-mry

(32.16)

c ; - ( X ^ -m;^ i:y W„ 1„ or cj =J^w,(x,j -mj)'


i

where 1„ and 1^ represent sum vectors with n and p elements equal to one, resp-
ectively. In this notation, X^ results in squaring each element of X. The expression
XW^l^ produces a weighted summation over the columns of X, and X^W„1„
performs a weighted summation over the rows of X.
One may assign greater weight (or mass) to objects and measurements that
should have a greater influence in the analysis than the others. In regression
problems one may define weights to be inversely proportional to the variance of
the objects, such as to lend more influence to those that have been measured with
greater precision. In correspondence factor analysis (CFA) the weights are defined
from the marginal sums of the table, as will be shown in Section 32.6. The result of
an exploratory analysis depends to some extent on the choice of the weight
coefficients in w„ and w^ . Sometimes the effect is only slight, for example, when
the observed data in a contingency table are close to their expected values. In other
cases, when the discrepancies between observed and expected values are large, the
effect of variable weighting can be prominent. As a general rule, one will select the
weights according to the nature of the data and the objective of the analysis.
175

32.5 Distance of chi-square

The global distance of chi-square 5 of a contingency table X is derived from the


chi-square statistic as follows:
.2

.2 ^ P X
52=^ =XS^ -^^^ (32.17)
•^++ / ./• -^ i+ •^+j

which is a dimensionless quantity. We refer to the quantity 5^ as the global inter-


action between the rows and columns of the contingency table. The concept of
distance of chi-square is important in the multivariate analysis of contingency
tables. As we will see in Section 32.6 on correspondence factor analysis (CFA),
results are often converted into a metric diagram in which distances between
representations of rows and distances between representations of columns are
related to chi-square values. Later in this section, we will show that the global
distance of chi-square is equal to the global weighted sum of squares of the
transformed contingency table.
There are three ways by which the global distance of chi-square can be
meaningfully rewritten as a weighted sum. These correspond with the three
different ways of closing the data in the original contingency table X, such as has
been described above in Section 32.3. The metric matrices W„ and W^ have to be
defined differently for each of the three cases.
In the case of the contingency Table 32.4 we obtained a chi-square of 15.3.
Taking into account that the global sum equals 30, this produces a global inter-
action of 15.3/30 = 0.510. The square root of this value is the global distance of
chi-square which is equal to 0.714.

32.5.1 Row-closure

The terms in the expression of the global distance of chi-square 6 can be


rearranged into:

5 ' = j : ^ t — / / =i-,. tw, // (32.18)


^++ j ^+j

with

W: = and w: =
' X .
176

where F is the matrix of deviations of row-closed profiles, cf. eq. (32.5):

^ ij "^ +j
fij = with / = 1,..., nandy = 1, ...,p

In 5^ the row-profiles are centred about the origin, as can be shown by working
out the expression for the weighted column-means m^:

•^i+ -^+7
m =0 with;= 1, ...,/7
y-^++ "++ y

Weighted sums of squares cJJ' of the row-profiles in F around the origin of 5^


can be expressed as distances of chi-square:
.2
r Ay A+^ 1

v^±:__^jiiZ__s2
^r=l^jf.;=l =6 with /= 1, ..., n (32.19)
"+j

The distance 5, is a measure of the difference of the ith row-closed profile with
respect to the average or expected row-profile.
The distance 6,,^ between two row-profiles / and /' can also be expressed
formally as a distance of chi-square:
.2
U I J

l^j(f,-frjy=l = 8^/ with i, I = 1,..., n (32.20)


"+;

32.5.2 Column-closure

Another rearrangement of the terms in the expression of the global distance of


chi-square 5 leads to:

^'=t — t—sl=t-.t->sl (32.21)

with
177

"+.1
w, = • and Wj=-
X;^

where G stands for the matrix of deviations of column-closed profiles, cf. eq.
(32.6):

with/= 1,..., n and j=l,...,p


•^+.7 *^++

In this case we find that the pattern of column-profiles in S^ is centred about the
origin, as can be seen by computing the weighted row-means m„:

•^i+ •^+J
<=l^j8, =Z =0 with / = 1,..., n
\^-^++ ^++ J

The weighted sums of squares c^ of the points around the origin of 5^ can be
expressed as distances of chi-square:
\2
(

V^+7'
-.r=I-<^^=i: = 8y withy = !,...,/? (32.22)

The distance 5^ is a measure of the difference of the yth column-profile from the
average or expected column-profile.
The distance 6^y.between two column-closed profilesy and/ can also be defined
as a distance of chi-square:

(X X .^
ij iJ

V^+./ 1LZ_-K2
^^i(8il -Sij'f =S 6t with7,/= l,...,p (32.23)

X^

32.5.3 Double-closure

A last rearrangement of terms in the expression of the global distance of


chi-square 6 produces:

l^l-^4=l-il-.4 (32.24)
++ J •^++
178

with

''+./
w- =- and w. =-

and where Z is the matrix of deviations of the double-closed data, cf. eq. (32.7):

Zij=^ 1 with/= l,...,/2 and 7 = 1,...,/?

The pattern of points produced by Z is centred in both dual spaces S^ and 5^,
since the weighted row- and column-means m„ and m^ are zero:
'^ f
^ij
m :0 withy = 1,..., P
V^+7

P f X
m r=I-.^. =S X.
ij ^+j

^++y
=0 with / = 1,..., n

The global weighted mean of Z can also be shown to be zero:

m Z^/Z^7^.y =0

Weighted sums of squares c]^ and c^ around the origins of the two spaces can
be interpreted as distances of chi-square:

2
v-X ^8j withy = 1, ...,/?

(32.25)

< ^ r = X ^ / 4 ^^? w i t h / = 1, ..., Az

where 5, and 5^ have been defined above in terms of the elements and marginal
sums of X. These distances are measures for the differences of the row- or column-
closed profiles from their average or expected profiles. The symbols cf' and c'J
denote the weighted sums of squares of row / and column j in the transformed
matrix Z.
The values of Z in Table 32.6 already reveal some peculiar aspects of the data.
One should compare these values with zero, i.e. the value that would be obtained if
the observed values were equal to their expected ones. The larger the deviation
from zero, the stronger is the interaction between the corresponding row and
179

column. Positive deviations point toward a positive interaction. For example,


Triazolam shows a strong positive interaction with sleep (1.182), which means that
the compound Triazolam is particularly indicated for the treatment of sleepless-
ness. Likewise, a large negative value reflects a strong negative interaction. For
example. Triazolam presents also a strong negative interaction with epilepsy
(-1.000), which means that Triazolam is rarely prescribed for the treatment of
epilepsy. In the following section we will refer to interaction by the name of
correspondence. Note that the mean and mean square values in Table 32.6 are the
result of weighted averages.
Using the deviations Z of the double-closed data from their expected values and
the column-weights w^ in Table 32.6, we compute the distances of chi-square from
the origin of Triazolam (/ = 4) using eq. (32.25):
3

S?=4 = S ^ . / ^47 = 0.333(-0.400)^ + 0.300(-l.000)2 + 0.367(1.182)^ = 0.865


./•

or
5,..4 =0.931
Similarly, we compute the distance of chi-square from the origin for epilepsy,
using the same deviations Z and the row-weights w^:
4

b%2 = S ^ / 4 =0.267(1.083)2 +0.367(0.212)2+ 0.200(-l.000)2


i

+ 0.167(-1.000)2 = 0.696
or
6 ,.^2 = 0-834
Distances between points in 5^ and S^ can also be defined formally as distances
of chi-square:
p

c7i' = X^.,-(2<:/ - ^ n ) ' = S r with /, i' = \,..., n


(32.26)
n

cjr=X ^/ (^ij - ^')' = ^' withy,/=1,..., p


i

where 5 - and 5^y. have been defined before in terms of the elements and marginal
sums of X. We use the symbols cJJ. and cjj^ to denote the weighted sums of
cross-products between rows / and i' and between columnsy and/ in the transformed
matrix Z.
180

Using the deviations Z and the column-weights w^ in Table 32.6, we compute


the distance of chi-square between Lorazepam (/ = 3) and Triazolam (/ = 4) from
eq. (32.26):
3

S^=34 = X ^ / ^ 3 ; -^47)^ = 0.333(1.000 +0.400)2+ 0.300(-l.000+ 1.000)2


j
^2_
+ 0.367(-0.091 - 1.182)2 = 1.248

or
8,r=34 = i n 7
In a similar way we derive the distance of chi-square between anxiety (/ = 1) and
epilepsy (/ = 2) using the same deviations Z and the row-weights w„:
4
^r=\2 =Ya^Mi\ -^ii)^ =0.267(-1.000-1.083)2+ 0.367(0.364-0.212)2

+ 0.200(1.000 + 1.000)2 + o.l67(-0.400 + 1.000)^ = 2.025


or
5r=i2 = 1-423
The global weighted sum of squares c^ of the transformed data Z can be shown
to be equal to the global interaction 52 between rows and columns:

^ " = X ^ / S^7^^=52 (32.27)


' j

One finds an illustration of the measures of location and spread in the margins of
Table 32.6. The geometric properties of row-closed and column-closed profiles are
summarized in Fig. 32.5.
In the following section on the analysis of contingency tables we will relate the
distances of chi-square in terms of contrasts. In the present context we use the word
contrast in the sense of difference (see also Section 31.2.4). For example, we will
show that the distance of chi-square from the origin 5, can be related to the amount
of contrast contained in row / of the data tables, with respect to what can be
expected. Similarly, the distance 5^ can be associated to the amount of contrast in
column y, relative to what can be expected. In a geometrical sense, one will find
rows and columns with large contrasts at a relatively large distance from the origin
of 5^ and 5 ^ , respectively. The distance of chi-square 6- then represents the
amount of contrast between rows / and i with respect to the difference between
their expected values. Similarly, the distance of chi-square 6^^ indicates the amount
of contrast between columns^ a n d / in comparison to the difference between their
181

Fig. 32.5. (a) Geometric representation of row-closed profiles in the weighted column-space S^. The
row-profiles are shown as an elliptic pattern P". The figure shows the distance of a row-profile / from
the origin and the distance between two row-profiles i and i\ (b) Geometric representation of col-
umn-closed profiles in the weighted row-space S^. The column-profiles are shown as an elHptic pat-
tern P\ The figure shows the distance of a column-profile; from the origin and the distance between
two column-profiles7 and/.

expected values. In geometrical terms, one will find in 5^ rows with large contrast
between them at a relatively large distance from one-another. Likewise, columns
with large contrast between them will also be found in S^ at a relatively large
distance from one another.
Note that double-closure yields the same results as those produced by row-
closure in 5^ and by column-closure in S^. From an algorithmic point of view,
double-closure is the more attractive transformation, although row- and column-
closure possess a strong didactic appeal.
182

32.6 Correspondence factor analysis

32.6.1 Historical background

The origin of the analysis of contingency tables can be traced back to the
definition by Pearson in 1900 of the chi-square statistic as a measure for goodness
of fit [3]. The partition of the chi-square into main effects and interaction has been
described around 1950 by Lancaster [4]. One of the first practical applications of
multivariate analysis to contingency tables is due to Hill [5] who developed the
method of reciprocal averaging. This method is used for the ordination of plant
species from tabulated surveys carried out in various plots of land yielding a
species x plots matrix. The result allows one to order species according to a latent
vector which can be ascribed to environmental characteristics, such as density of
the soil, abundance of sunlight, thickness of cover, etc.
Reciprocal averaging is considered as the forerunner of correspondence factor
analysis (CFA) which has been developed by Benzecri [6,7] and a group of French
statisticians [8]. It is often also referred to as correspondence analysis (CA) for
short. Originally, the term correspondence means association between rows and
columns of a table, i.e. the number of joint occurrences of the categories represented
by the rows and columns of a contingency table. The French term 'analyse des
correspondances' points to the analysis of the interaction between rows and
columns. The French school also used a geometrical approach for the inter-
pretation of its results, especially by means of the biplot which has been designed
by Gabriel [9]. In the discussion of measurement tables in Chapter 31, we have
defined biplots as the joint representation of rows and columns of a table in a
low-dimensional space spanned by latent vectors. The concept of latent vectors
will be generalized to the case of weighted metrics in the following section.
Initially, CFA has been applied in linguistic studies. Later on, applications have
been described in many diverse fields of the empirical sciences. The method of
CFA can be extended to the analysis of tables with non-negative elements, not
necessarily counts, provided that these have been recorded with a common unit of
measurement. Such a table is called homogeneous, as opposed to a heterogeneous
table in which the rows or columns are defined with different units. Heterogeneous
tables can be analyzed by means of CFA, after subdivision of the measurements
into discrete categories (e.g., in the case of lipophilicity: strongly lipophilic,
weakly lipophilic or hydrophilic, strongly hydrophilic). CFA has played a decisive
role in the development of multivariate data analysis, especially by means of its
graphic aspect. It also has stimulated interest in related areas such as principal
components analysis and cluster analysis.
The position of CFA, as well as that of its related methods, is that of a
preliminary step to statistical inference. In this respect, it has been regarded as an
183

inductive approach [2]. This means that, first, hypotheses have to be formulated
from an analysis of empirical evidence. After this exploratory phase, the newly
formulated hypotheses have to be tested more rigorously using appropriate
statistical methods of inference.
Correspondence factor analysis can be described in three steps. First, one
applies a transformation to the data which involves one of the three types of closure
that have been described in the previous section. This step also defines two vectors
of weight coefficients, one for each of the two dual spaces. The second step
comprises a generalization of the usual singular value decomposition (SVD) or
eigenvalue decomposition (EVD) to the case of weighted metrics. In the third and
last step, one constructs a biplot for the geometrical representation of the rows and
columns in a low-dimensional space of latent vectors.

32.6.2 Generalized singular value decomposition

We assume that Z is a transformed nxp contingency table (e.g. by means of


row-, column- or double-closure) with associated metrics defined by W^ and W^.
Generalized SVD of Z is defined by means of:
Z = A^B^ (32.28)
where |Li is an rxr diagonal matrix of singular values and where r is the rank of Z.
The matrices A and B are the orthonormal matrices of generalized singular vectors,
respectively with dimensions nxr and pxr in the weighted metrics W„ and W^.
Orthonormality in the weighted metrics implies that:
A ^ W > = RTW^B = I, (32.29)
where I^ is the identity matrix of dimension r.
It can be proved that the above generalized SVD of the matrix Z can be derived
from the usual SVD of the matrix W^^^ ZW^^^:
W'J^ ZW^/^ = UAV^ (32.30)
where A is a diagonal matrix of dimension r, and where U and V are the usual
orthonormal matrices of dimensions nXr and pxr, which contain the singular
vectors of W^/2 2W^/2.
u^u = v^v = I,
and where I^ and r are defined as above.
The results of applying these operations to the double-closed data in Table 32.6
are shown in Table 32.7. The analysis yielded two latent vectors with associated
singular values of 0.567 and 0.433.
184

TABLE 32.7
1/2 1/2

Usual singular vectors extracted from the transformed Z in Table 32.6, i.e. W^ ZW^

Compound
Clonazepam u,0.750 Uj0.113 Disorder
Anxiety Vj
-0.644 ^-0.502
Diazepam -0.025 -0.540 Epilepsy 0.762 -0.346
Lorazepam -0.619 -0.148 Sleep -0.075 0.792
Triazolam -0.232 0.820

In Table 32.7 we observe a contrast (in the sense of difference) along the first
row-singular vector Uj between Clonazepam (0.750) and Lorazepam (-0.619).
Similarly we observe a contrast along the first column-singular vector Vi between
epilepsy (0.762) and anxiety (-0.644). If we combine these two observations then
we find that the first singular vector (expressed by both Uj and Vj) is dominated by
the positive correspondence between Clonazepam and epilepsy and between
Lorazepam and anxiety. Equivalently, the observations lead to a negative corres-
pondence between Clonazepam and anxiety, and between Lorazepam and epilepsy.
In a similar way we can interpret the second singular vector (expressed by both U2
and V2) in terms of positive correspondences between Triazolam and sleep and
between Diazepam and anxiety.
From eq. (32.30) we can derive the solution for the generalized SVD problem of
Z:

B = W-'^^\ (32.31)

|Ll = A

where U, V and A have been defined above in eq. (32.30).


The results of generalized SVD, when applied to the 4x3 contingency table, are
presented in Table 32.8. The generalized singular vectors can be interpreted in the
same way as we have done for the usual ones in Table 32.7. The only difference is
that the elements of A and B contain the coordinates of the four compounds and of
the three disorders in a two-dimensional diagram with a weighted metric defined
by w^ and w^, the coefficients of which are shown in Table 32.6. Triazolam
obtained the smallest weight (0.167) of the four compounds and Diazepam is given
the largest weight (0.367). This amounts to a factor of about 2 between the largest
and the smallest weights. The weights of the disorders are more homogeneous as
they range between 0.300 and 0.367. The heterogeneity among the weights
185

TABLE 32.8
Generalized singular vectors extracted from Z in Table 32.6

Compound Disorder
Clonazepam 1.452 0.220 Anxiety -1.115 -0.870
Diazepam -0.042 -0.892 Epilepsy 1.390 -0.632
Lorazepam -1.385 -0.331 Sleep -0.124 1.308
Triazolam -0.569 2.010

assigned to the compounds is responsible for a distortion of the contrasts in Table


32.8 when compared to those of Table 32.7. This can be understood as follows.
The elements of U and A are more or less proportional to one another with a
proportionality constant of about 2. The same is true for the elements of V and B.
For example, in the case of Clonazepam we find that the values a^, a^2 (1-452,
0.220) are about twice the corresponding values w^, u^2 (0.750, 0.113). Similarly,
in the case of epilepsy, we obtain that the values ^21. ^22 (1-390, -0.632) are about
twice the corresponding values V21, V22 (0.762, -0.346). Some values, however, do
not conform to this proportionality. For example, in the case of Triazolam we
observe that a^2 (2.010) is more than twice the corresponding value W42 (0.820) and
in the case of anxiety we have that b^2 (-0.870) is less than twice the corresponding
Vj2 (-0.502). These distortions are the result of variable weights of the compounds
in the generalized SVD. In a sense the interpretation of A and B is fairer than that of
U and V, since the former accounts for the differences between importance of the
compounds (and to a lesser extent of the disorders). Compounds that are more
often prescribed have a greater influence on the result of the analysis than those
that are less frequently used.
One can define transition formulae for the two sets of generalized latent vectors
in A and B (see also Section 31.1.6):

A = ZW^BA^
(32.32)
B = Z^W A A

These transition formulae express one set of generalized latent vectors (A or B)


in terms of the other set (B or A). They follow readily from the definition of the
generalized SVD problem which has been stated above.
Generalized EVD can be developed along similar lines. By analogy with the
analysis of measurements tables in Chapter 31, we distinguish between EVD on
rows and EVD on columns. In the row-problem we compute the matrix of
generalized row-eigenvectors A and the diagonal matrix of eigenvalues \i^ from:
186

ZW^Z'^ = A^i^A^ with A'^W.A = I, (32.33)

From the solution of the usual EVD problem (Section 31.4.2) follows that:

W^/2 ZW^ T7 W^^2 = UA^U "^ with U'^U = I, (32.34)

which yields:

A = W;^/2u ^„d ^^^

In the column-problem we compute the matrix of generalized column-eigen-


vectors B and the diagonal matrix of eigenvalues |i^ from:

Z^W^Z = B^i^B'^ with B'^W^B = I, (32.35)

From the solution of the usual EVD problem follows also:

W^^^Z'^ W^ZW;^^^ = VA^ V^ with V^V = I, (32.36)

which yields:

B = W;^/2V and ^ =A

Note that the eigenvalues in A^ are the same in both the row- and column-problems.
The rank r of Z is also the rank of the weighted cross-products matrices Z^W^Z
and TN^^, Singular vectors and eigenvectors are the same, and it is appropriate to
refer to them by the more general terms of latent vectors [10]. Singular values are the
square roots of the corresponding eigenvalues. The latter express the contribution of
the associated latent vectors to the global interaction between rows and columns.
The trace of A^ is equal to the traces of the weighted cross-product matrices,
which in turn are equal to the global weighted sum of squares c"^ or global
interaction 6^ (eq. (32.27)):
tr(A2) = tr(Wy2Z'^W„ZW^/2)^

= tr(W^/2ZW Z'^W^/2)^^w ^§2 (3237)


187

32.6.3 Biplots

In CFA we can derive biplots for each of the three types of transformed
contingency tables which we have discussed in Section 32.3 (i.e., by means of
row-, column- and double-closure). These three transformations produce, respect-
ively, the deviations (from expected values) of the row-closed profiles F, of the
column-closed profiles G and of the double-closed data Z. It should be reminded
that each of these transformations is associated with a different metric as defined
by W^ and W^. Because of this, the generalized singular vectors A and B will be
different also. The usual latent vectors U, V and the matrix of singular values A,
however, are identical in all three cases, as will be shown below. Note that the
usual singular vectors U and V are extracted from the matrix W^^^ZW^^^.
In what follows we restrict the discussion of biplots to the case of double-closed
data Z as defined by the elements:

^.. = ^JL^IL _ 1 ^ith / = 1, ..., n and; = 1,...,/? (32.38)

using the metric matrices W^ and W^ with main diagonal elements:

w-= and Wj-


^++ ^++

All other cases can be readily derived by analogy. In the case of row-closed
profiles we have to perform the following substitutions:

z.:/ = fii = (32.39)

_ •^++
w,. = - ^ and Wf
J
•^-\-+ ^+7

With column-closed profiles we sul

^ = Sij = (32.40)

_ •^++
w,. and W . := - ^
J
•*^/+ •^++
It can be proved that the matrix W^^^ ZW^^^ is the same in all cases. Hence, the
decomposition UAV^ is also the same, and since the decomposition is unique we
obtain that U, V and A are the same for all three cases. The matrices A and B,
however, depend on the type of closure that has been applied to the data.
From the latent vectors and singular values one can compute the nxr general-
ized score matrix S and the pxr generalized loading matrix L. These matrices
contain the coordinates of the rows and columns in the space spanned by the latent
vectors:
S = AA« (32.41a)
and
L = BAP (32.41b)
where the factor scaling coefficients a and P may take the values 0, 0.5 and 1. In
Chapter 31 we have amply discussed the implications of the various choices of a
and p for the interpretation of the biplots. The main considerations are reviewed
below in the context of CFA:
a=l
for reproduction of distances from the origin 5, and distances 6,,/ between the
row-profiles / and i\
p=i
for reproduction of distances from the origin 5y and distances 5^^. between the
column-profilesy and/,
a + p=l
for reproduction of the values Zij by means of perpendicular projections upon
unipolar axes; also for reproduction of the differences (or contrasts) z^ - Zi^j and
Zij - Zif by means of perpendicular projections upon bipolar axes.
Unipolar and bipolar axes have been discussed in Section 31.2. Briefly, a
unipolar axis is defined by the origin and the representation of a row or column. A
bipolar axis is drawn through the representations of two rows or through the
representations of two columns. Projections upon unipolar axes reproduce the
values in the transformed data table. Projections upon bipolar axes reproduce the
contrasts (i.e. differences) between values in the data table.
In the case when a = 1 we obtain:
SS^ ^AAAA'r = A A 2 A ' ^
(32.42)
189

as follows from eq. (32.41a). In particular we can write:

s^ s.=\\s,\\^=j^sf,=j^wjzl=bj with/=l,...,n (32.43)


k j

which shows that the distances of the rows of the biplot in S'^ are equal to the
distances of chi-square in 5^ which we have derived in Section 32.5.
In the case of (3 = 1 we can show that:

LL^ =BAABT =BA^B^


(32.44)
= W-'^^ VA^V^ W;i/^ = Z ^ W, Z

as follows from eq. (32.41b). In particular we have:

IJl,.=lll,.|P=t'/* = i ^ . - 4 = S ? with7=l,...,p (32.45)


k i

from which we conclude that distances of the columns on the biplot in S"^ are equal
to the distances of chi-square in 5^ which we have already derived in Section 32.5.
In the special case of a + (3 = 1 we can state:

SL^ ^AA^APB'T = A A B T
(32.46)
= W;i/2 UAV^ W;i/2 = Z

which follows from eq. (31.41).


In particular we obtain:

sj 1. = ^ s^j^ Ijj^ = Zij with / = 1,..., n andy = 1,..., p (32.47)

which shows that the projection in 5"^ of point s, upon a unipolar axis 1^ reproduces
the value z„:
r

sj (IJ - ly) = ^ s^^ilj^ - ly^) = Zij - z^y with / = 1,..., n a n d ; , / = 1,..., p


k

(32.48)
which defines the projection in S"^ of point s^ upon the bipolar axis 1^ - \y resulting in
the contrast Zij - Zif.
190

TABLE 32.9

Generalized scores and loadings computed from Z in Table 32.6

Compound Disorder
Clonazepam 0.823 0.095 Anxiety -0.632 -0.378
Diazepam -0.024 -0.388 Epilepsy 0.788 -0.275
Lorazepam -0.785 -0.144 Sleep -0.070 0.568
Triazolam -0.322 0.873

By similar arguments we derive that:


r

(s,. - s,.,)'' \j = X(^ik - ^n)hk = ^ - ^i'j ^ith /, r = 1,..., n


k

and7= l,...,p (32.49)

which defines the projection in y of point \j upon the bipolar axis s, - s,^.
The generalized scores and loadings from the 4x3 contingency table are given in
Table 32.9 for the particular choice of a = (3 = 1. A plot of these generalized scores
and loadings is shown in Fig. 32.6 using the coordinates of the points which are
listed in Table 32.9. Distances from the origin and distances between points have
been computed according to the expressions given in Section 32.5. In this example
we find two distinct eigenvectors with eigenvalues of 0.322 and 0.188, respect-
ively. This is compatible with the maximal rank of a 4x3 data matrix after
double-closure. It should be reminded that double-closure always reduces the rank
of the data matrix by one. The sum of the eigenvalues (0.510) is equal to the global
sum of squares (interaction) which is shown in Table 32.6.
The two plots can be superimposed into a biplot as shown in Fig. 32.7. Such a
biplot reveals the correspondences between the rows and columns of the cont-
ingency table. The compound Triazolam is specific for the treatment of sleep
disturbances. Anxiety is treated preferentially by both Lorazepam and Diazepam.
The latter is also used for treating epilepsy. Clonazepam is specifically used with
epilepsy. Note that distances between compounds and disorders are not to be
considered. This would be a serious error of interpretation. A positive correspondence
between a compound and a disorder is evidenced by relatively large distances from
the origin and a common orientation (e.g. sleep disturbance and Triazolam). A
negative correspondence is manifest in the case of relatively large distances from
the origin and opposite orientations (e.g. sleep disturbance and Diazepam).
The geometrical reconstruction of the values in Z by perpendicular projection of
points upon axes can be justified algebraically by the matrix product of the scores S
with the transpose of the loadings L according to eqs. (32.41) and (32.46):
191

Triazolam 1 -i
\= 0.188

?^,= 0.322 \
Lorazepam
O
Diazepam

\= 0.188

Sleep
®
0.834'**-'^>.A= 0.322 j

1.423 Epilepsy

Fig. 32.6. (a) Generalized score plot derived by correspondence factor analysis (CFA) from Table
32.4. The figure shows the distance of Triazolam from the origin, and the distance between Triazolam
and Lorazepam. (b) Generalized loading plot derived by CFA from Table 32.4. The figure shows the
distance of epilepsy from the origin, and the distance between epilepsy and anxiety.

SL^ = A A « A P B T = A A B T =Z (32.50)

provided that a + |3 = 1.
The most common choices for the exponents in CFA are a = (3 = 1 and a = (3 = 0.5.
The choices a = 1, p = 0 and a = 0, p = 1 will produce biplots that are
192

H
X.2= 0.188
Triazolam

Sleep

Clonazepam
O
Lorazepam
-1 O \= 0.322 1
D
Epilepsy
Anx?etv DiazepaS

Fig. 32.7. CFA biplot resulting from the superposition of the score and loading plots of Figs. 32.6a and
b. The coordinates of the products and the disorders are contained in Table 32.9.

multidimensional extensions of the triangular (also called barycentric) diagrams


that are used for representing ternary mixtures.
The reconstruction Z* of the transformed contingency table Z in a reduced space
of latent vectors follows from:
S*L*'r = Z* (32.51)
where S* and L* are obtained from the first r* columns of S and L, with r* < r. It is
assumed that the columns of S and L are arranged in decreasing order of their
corresponding singular values. The r*-dimensional subspace of dominant latent
vectors can be interpreted as a (hyper)plane of closest fit to the pattern of points
represented in the r-dimensional space of latent vectors. This is analogous to the
interpretation of the results of principal components analysis (PCA) which has
been discussed in Chapters 17 and 31. The goodness of the fit can be judged from
the relative contribution y of the first r* latent vectors to the global interaction,
which can be expressed in the form:

y= lK'lK (32.52)

CFA can also be defined as an expansion of a contingency table X using the


generalized latent vectors in A, B and the singular values in A:
193

1 + Z ^ i t ^ / ^ ^ . jk with / = 1,..., n and J = 1,..., p (32.53)


J

as can be derived from eqs. (32.28) and (32.38).

32.6A Application

CFA is applied to the contingency Table 32.10 which has been adapted from a
retrospective study of US doctorates in chemistry and in other fields awarded to
men and women [11]. The rows of this table refer to consecutive time intervals
from 1920 up to 1989. The columns indicate non-overlapping categories of
doctorates. The marginal sums provide the totals by time interval and by category
of doctorate. These will be used for the assignment of weights to the rows and
columns of the table. Note that the data from 1920 up to 1959 have been grouped
into intervals of 10 or 5 years in order to level out statistical fluctuations due to
small counts, especially in the category of women doctorates in chemistry. This
pooling of rows does not affect the subsequent analysis by CFA. Indeed, rows (or
columns) that are similar can be grouped together, provided that the corresponding
weights are also added together. This is called the principle of distributional
equivalence of CFA [2].
Casual inspection of the table reveals that the total number of doctorates has
reached a peak value of 33727 in 1973 which has only been surpassed in the last
year of the time series. Peak values are also observed in the 1973 category of men
in other fields and in the 1970 category of men in chemistry. In both categories
related to women, no such peak values are observed as the annual counts seem to
increase more or less steadily over the whole period.
A more comprehensive analysis is obtained by CFA, the results of which are
summarized in Tables 32.11 and 32.12. The first column of these tables contains
the diagonal elements of the metric matrices W^ and W^. These normalized weight
coefficients are proportional to the marginal sums of the original contingency
Table 32.10 and sum to unity. The following three columns form the matrix of
generalized scores S and the matrix of generalized loadings L. These matrices
satisfy the relations (eq. (39.41)):
S = AA and L = BA
which follows from the generalized SVD (eq. (32.50)):
Z = AART with A^W^A = B^W^B = I,
where:
194

TABLE 32.10

Doctorates awarded in the US between 1920 and 1989 [11]

Year W o m e n in M e n in W o m e n in M e n in Total
chemistry chemistry other fields other fields

1920-1929 142 1782 1674 8291 11889


1930-1939 256 3707 3507 18116 25586
1940-1949 221 5105 3871 21358 30555
1950-1954 222 4950 3370 30102 38644
1955-1959 228 4840 4388 34714 44170
1960 45 1060 1045 7850 10000
1961 66 1162 1185 9692 12105
1962 66 1163 1185 9692 12106
1963 66 1163 1186 9691 12106
1964 93 1427 1702 13865 17087
1965 94 1427 1702 13865 17088
1966 94 1427 1702 13865 17088
1967 120 1644 2295 16236 20295
1968 139 1643 2761 18291 22834
1969 146 1801 3225 20562 25734
1970 180 2043 3764 23449 29436
1971 172 2032 4403 25165 31772
1972 202 1809 5080 25910 33001
1973 179 1670 5903 25975 33727
1974 174 1618 6241 24967 33000
1975 192 1570 7001 24150 32913
1976 189 1434 7487 23813 32923
1977 180 1390 7665 22437 31672
1978 196 1349 8117 21188 30850
1979 220 1347 8701 20932 31200
1980 255 1283 9132 20312 30982
1981 235 1376 9637 20071 31319
1982 271 1407 9786 19584 31048
1983 297 1462 10188 19243 31190
1984 320 1445 10340 19148 31253
1985 362 1474 10337 19028 31201
1986 396 1507 10848 19019 31770
1987 406 1568 10964 19340 32278
1988 429 1589 11361 20077 33456
1989 497 1474 12013 20335 34319

Total 7350 65148 203766 680333 956597


195

TABLE32.il
Weights, scores, distances, contributions and precisions from CFA applied to Table 32.10

/ w. Sl S2 S3 K In T^n

1920-1929 0.012 -0.96 0.95 0.11 1.35 0.023 0.999


1930-1939 0.027 -0.98 0.85 0.05 1.29 0.045 0.999
1940-1949 0.032 -1.18 1.10 -0.11 1.62 0.084 0.995
1950-1954 0.040 -1.35 0.37 -0.03 1.40 0.079 1.000
1955-1959 0.046 -1.17 0.15 -0.03 1.18 0.064 1.000
1960 0.010 -1.11 0.11 -0.06 1.12 0.013 0.998
1961 0.013 -1.12 -0.05 0.02 1.12 0.016 1.000
1962 0.013 -1.12 -0.05 0.02 1.12 0.016 1.000
1963 0.013 -1.12 -0.05 0.02 1.12 0.016 1.000
1964 0.018 -1.05 -0.23 0.04 1.07 0.021 0.998
1965 0.018 -1.05 -0.23 0.04 1.07 0.021 0.998
1966 0.018 -1.05 -0.23 0.04 1.07 0.021 0.998
1967 0.021 -0.92 -0.21 0.05 0.94 0.019 0.998
1968 0.024 -0.81 -0.31 0.06 0.87 0.018 0.995
1969 0.027 -0.77 -0.33 0.04 0.83 0.019 0.998
1970 0.031 -0.74 -0.32 0.06 0.81 0.020 0.995
1971 0.033 -0.63 -0.36 0.02 0.73 0.018 0.999
1972 0.034 -0.45 -0.43 0.05 0.63 0.014 0.994
1973 0.035 -0.25 -0.44 -0.00 0.51 0.009 1.000
1974 0.034 -0.13 -0.39 -0.03 0.41 0.006 0.996
1975 0.034 0.08 -0.32 -0.03 0.33 0.004 0.990
1976 0.034 0.22 -0.32 -0.05 0.39 0.005 0.983
1977 0.033 0.34 -0.26 -0.08 0.44 0.006 0.971
1978 0.032 0.53 -0.18 -0.08 0.56 0.010 0.980
1979 0.033 0.67 -0.12 -0.07 0.68 0.015 0.990
1980 0.032 0.82 -0.07 -0.04 0.82 0.022 0.998
1981 0.033 0.91 0.01 -0.10 0.92 0.028 0.990
1982 0.032 0.98 0.07 -0.06 0.98 0.031 0.996
1983 0.033 1.07 0.14 -0.05 1.08 0.038 0.998
1984 0.033 1.12 0.15 -0.02 1.13 0.041 1.000
1985 0.033 1.12 0.18 0.04 1.14 0.042 0.999
1986 0.033 1.21 0.23 0.06 1.24 0.051 0.998
1987 0.034 1.19 0.24 0.06 1.22 0.050 0.998
1988 0.035 1.20 0.23 0.07 1.22 0.052 0.996
1989 0.036 1.32 0.22 0.14 1.34 0.065 0.990
196

TABLE 32.12
Weights, loadings, distances, contributions and precisions from CFA applied to Table 32.10

1, 1. 1. ^P

Women in chemistry 0.008 0.99 0.72 0.66 1.39 0.015 0.770


Men in chemistry 0.068 -1.46 1.20 -0.03 1.89 0.244 1.000
Women in other fields 0.213 1.69 0.18 -0.02 1.70 0.617 1.000
Men in other fields 0.711 -0.38 -0.18 0.00 0.42 0.124 1.000

1 ( X,i X^
with / = 1, ..., n andy = 1, ...,p (32.54)
^ 5
\^i-\- ^+j

and where r is the rank of Z. The rank r is at most equal to the smaller ofn-l and
p- 1. Hence, in the present application to a 35x4 contingency table we can at most
obtain three distinct latent vectors. For practical reasons we have divided the
transformed data (eq. (32.7)) by the global distance of chi-square 6 (eq. (32.24)).
This type of normalization ensures that the global interaction in the data equals
unity:

The three latent vectors account for respectively 86, 13 and 1% of the interaction.
The next two columns in Tables 32.11 and 32.12 show the distances 5„ and 6^ of rows
and columns from the origin of space and their contributions y^ and y^ to the
interaction:

S?=I4 and y,. =w, 5? with / = 1, ..., n

(32.55)

s]=i:^ and y^ =Wj 5] withy = 1, ...,p

The final column in Tables 32.11 and 32.12 lists the precisions n^ and n^ with
which the rows and columns are represented in the plane spanned by the first two
latent vectors:
197

M&n Chemistry
m ®^40-49
j \.® 20-29
1 ® 30-39
||IlVoi77e;3 Chemistry

^50-54

j^p' Women Other Fields


1 nf Men Other Fields 0^^-9^O

^^^^^^m^^^^^^i"®^^^^

Fig. 32.8. CFA biplot computed from the data in Table 32.10. Circles represent years and squares
identify the four educational categories. The centre of the plot is represented by a small cross. The co-
ordinates of the years and the categories are contained in Tables 32.11 and 32.12. Factor scaling coef-
ficients were defined as a = P = 1.

with / = 1, ..., n
k
(32.56)

with7= 1, ...,/7

Figure 32.8 shows the biplot constructed from the first two columns of the
scores matrix S and from the loadings matrix L (Table 32.11). This biplot
corresponds with the exponents a = 1 and p = 1 in the definition of scores and
loadings (eq. (39.41)). It is meant to reconstruct distances between rows and
between columns. The rows and columns are represented by circles and squares
respectively. Circles are connected in the order of the consecutive time intervals.
The horizontal and vertical axes of this biplot are in the direction of the first and
second latent vectors which account respectively for 86 and 13% of the interaction
between rows and columns. Only 1% of the interaction is in the direction perpen-
dicular to the plane of the plot. The origin of the frame of coordinates is indicated
198

by a small cross (+) near the centre of the biplot. As can be seen from the precisions
K^ in Table 32.11, all years are well-represented in the plane of the biplot. The
precisions n^ in Table 32.12, however, show that the category of women in
chemistry is not so well-represented in the biplot as its precision amounts only to
0.770.
It is readily evident that the most dominant (horizontal) latent variable reflects a
contrast between women and men. From 1966 onwards there appears a sustained
increase of the proportion of women doctorates. The increase is most prominent
during the seventies as evidenced by the relatively large distances between adjacent
time intervals from 1971 to 1980. The second (vertical) latent variable is defined
by a contrast between chemistry and the other fields. Contrast is to be understood
here as a difference between profiles. Initially there is a progressive decrease in the
share of chemistry doctorates when compared to those in other fields. Around
1973, however, the trend reverses and the proportion of chemical degrees rises
slowly, but steadily.
The correspondences between rows and columns are evidenced by the positions
of the circles and squares with respect to the origin of the plot. Those points that are
closest to the origin are most similar to the average profile. Those that are further
away show specific differences with respect to the average profile. Circles and
squares that moved toward the border of the biplot, and in the same direction,
possess a positive correspondence. They seem to attract each other. Those that
moved in the opposite direction demonstrate a negative correspondence. They
seem to repel each other. For example, the category of chemistry doctorates
awarded to men exhibits a positive correspondence with the early years, and at the
same time a negative correspondence with the more recent years. A mechanical
analogy of forces of attraction and repulsion between a circle and a square is
appropriate here. One should refrain, however, from judging distances between
circles and squares. As stated already before, their closeness is not a measure of
their correspondence, only the distances from the origin and their angular distance
matter.
In the case when one of the two measurements of the contingency table is
divided in ordered categories, one can construct a so-called thermometer plot. On
this plot we represent the ordered measurement along the horizontal axis and the
scores of the dominant latent vectors along the vertical axis. The solid line in Fig.
32.9 displays the prominent features of the first latent vector which, in the context
of our illustration, is called the women/men factor. It clearly indicates a sustained
progress of the share of women doctorates from 1966 onwards. The dashed line
corresponds with the second latent vector which can be labelled as the chemistry/
other fields factor. This line shows initially a decline of the share of chemistry and
a slow but steady recovery from 1973 onwards. The successive decline and rise are
responsible for the horseshoe-like appearance of the pattern of points representing
199

1.5 n
Score

^ Chemistry / Other Fields


0.5 H

-0.5
1973

Women / Men
1966

-1.5
20 30 40 50 60 70 80 90
Years

Fig. 32.9. Thermometer plot representing the scores of the first and second component of a CFA ap-
plied to Table 32.10. The solid line denotes the first component which accounts for the women/men
contrast in the data. The broken line corresponds with the second component which reveals a contrast
between chemistry and other fields.

the rows in the biplot of Fig. 32.8. This phenomenon is also called the Guttman
effect [2]. The effect is due to the fact that a linearly varying series of scores or
loadings on factor 1 (e.g. 1, 2, 3) can be uncorrelated or orthogonal to a parabolic
series on factor 2 (e.g. -30, 0, 10).
For the purpose of comparison, we also discuss briefly the biplot constructed
from the CFA using the exponents a = 0.5 and (3 = 0.5 (Fig. 32.10). Such a display
is meant to reconstruct the values in the transformed contingency table Z by
projections of points representing rows upon axes representing columns (or vice
versa):

cf. eq. (32.46)

where:

S = AA 1/2 and L = BA 1/2 cf. eq. (32.41)

This type of biplot does not reproduce distances or angles accurately, especially
when A,j and A-2 are very distinct. It can be shown that the distortion of distances that
occurs in the vertical direction is proportional to the square root of \IX2. The
advantage of this type of biplot is that it allows us to construct bipolar axes for any
200

Fig. 32.10. CFA biplot computed from the data in Table 32.10. Factor scaling coefficients were de-
fined as a = p = 0.5. This definition allows us to draw bipolar axes through the four educational catego-
ries, showing the contrast between women and men (horizontally) and between chemistry and other
fields (vertically).

pair of categories. For example, the more horizontal axes represent the contrasts
between women and men. The more vertical axes display contrasts between chem-
istry and the other fields. By perpendicular projection of the circles upon any
bipolar axis one obtains a relative ordering of the time intervals according to a
particular contrast. In the case of the women/men contrast in other fields (the
slightly positively sloping axis) we first note a somewhat stable situation until
1966, when the contrast evolves rapidly in favour of women doctorates. Similarly,
the chemistry/other fields contrast in women (the more vertical axis at the right)
shows a strong regression of chemistry until about 1975. In later years the situation
seems to have stabilized.
Both types of symmetric displays exhibited in Figs. 32.9 and 32.10 have their
merits. They are called symmetric because they produce equal variances in the
scores and in the loadings. In the case when a = p = 1, we obtain that the variances
along the horizontal and vertical axes are equal to the eigenvalues'h?associated to
the dominant latent vectors. In the other case when a = (3 = 0.5, the variances are
found to be equal to the singular values X.
201

32.7 Log-linear model

32J.1 Historical introduction

The log-linear model (LLM) is closely related to correspondence factor analysis


(CFA). Both methods pursue the same objective, i.e. the analysis of the association
(or correspondence) between the rows and columns of a contingency table. In CFA
this can be obtained by means of double-closure of the data; in LLM this is
achieved by means of double-centring of the logarithmic data.
According to Andersen [12] early applications of LLM are attributed to the
Danish sociologist Rasch in 1963 and to Andersen himself. Later on, the approach
has been described under many different names, such as spectral map analysis
[13,14] in studies of drug specificity, as logarithmic analysis in the French
statistical literature [15] and as the saturated RC association model [16]. The term
log-bilinear model has been used by Escoufier and Junca [17]. In Chapter 31 on the
analysis of measurement tables we have described the method under the name of
log double-centred principal components analysis.
Here we develop the method specifically from the point of view of contingency
tables and within the context of weighted metrics. We will show that LLM differs
only from CFA in the type of preprocessing that is applied to the contingency table.
The results of both approaches are often similar when there are no extreme
contrasts in the data.

32.7.2 Algorithm

We assume that X is a contingency table in which all elements are positive. We


are also given a vector of weights (or masses) w^ for the rows and another set of
weights w^ for the columns of the table. For convenience, we assume that all
weight coefficients are normalized to unit sum. These weights can be defined
proportionally to the marginal sums of the table, although in LLM this is not a strict
requirement. One may assign the weights according to the objective of the
analysis, i.e. by giving a more prominent weight to the more relevant rows or
columns than to the others, or by assigning identical weights to rows, to columns or
to both rows and columns. The weight vectors define the weighted metrics of the
row- and column-spaces:
W^ = diag(wJ and W^ = diag(w^)
Preprocessing of the contingency table involves the taking of (natural) logarithms:
y^j = logiXij) with / = 1,..., n and j = 1,..., p
followed by double centring:
202

Zij =yij -yi.-y.j +x.


where y^, yj andy represent the row-, column- and global means in the weighted
metrics W„ and W^:

y,. = YW^ 1 , = Yw^ or yy,-., = S w , y,j


yij
(32.57)
j

y..=YT^W,l,=Y^w, y. J = S "^j yij


i

n p
= 1^ W YW 1 =w^ Yw or y.. yij

where the sum vectors 1„ and 1^ have been defined before in eqs. (32.15) and
(32.16).
From this point on, the analysis is identical to that of CFA. Briefly, this involves
a generalized singular vector decomposition (SVD) of Z using the metrics W„ and
W^, such that:
Z = AAB'^ cf. eq. (32.50)
where A is the rxr diagonal matrix of singular values, A is the nxr matrix of
generalized latent vectors for rows, B is the pxr matrix of generalized latent
vectors for columns, and where r is the rank of Z. The generalized latent vectors A
and B are orthogonal in the weighted metrics W^ and W^, which implies that:
A^W^A = B^W^B = I,
For the same reason as for double-closure, double-centring always reduces the
rank of the data matrix by one, as a result of the introduction of a linear dependence
among the rows and columns of the data table.
Biplots can be constructed in the same way as in the case of CFA, by defining
scores S and loadings L, cf. eq. (32.41):
S = AA«
L = BAP
The choice of a = (3 = 1 will reproduce distances between rows and between
columns of Z. The choice a = p = 0.5 allows one to reconstruct the values and
contrasts of Z by perpendicular projections of points, representing rows, upon axes
representing columns (or vice versa). With these two choices for a and (3, the
analysis is symmetrical with respect to rows and columns.
203

The conventions of the LLM biplot are similar to those defined for the CFA
biplot. Circles represent the row-categories and squares denote the column-
categories of the contingency table. When the areas of circles and squares are
made proportional to the marginal sums of the table, then the areas confer a
notion of size or importance of the row- and column-categories that are represented.
The positions of the circles and squares from the origin (which appears as a small
cross near the centre) express the degree of difference or contrast of the corres-
ponding rows and columns with respect to their average profiles. Rows or columns
that are represented close to the origin show little contrast with their average
profiles. Those that are further away will have greater contrasts in their profiles.
The direction in which circles and squares move from the origin, in the same or in
opposite directions, is an expression of their positive or negative correspondence.
The way interaction is expressed in the LLM biplot is identical to that of the CFA
biplot. We can also define bipolar axes through two selected row-categories or
through two selected column-categories which form the poles of bipolar axes.
Additionally, these axes can be calibrated according to the ratios of the two
categories that define the axes. Perpendicular projection of the centres of the
circles upon any bipolar axis produces an (approximate) reading of the corres-
ponding ratios, as has been explained in Chapter 31. The approximation depends
on the precision with which the points are represented in the plane of the plot and is
limited by the visual interpolation on logarithmic scales.
LLM can also be defined as an expansion of a contingency table X using the
generalized latent vectors in A and B and the singular values in A by analogy with
eq. (32.53):

^/. ^.j
x^j = - e x p ^^k^ikt^. with/= 1, ...,/2 and 7 = 1,...,/? (32.58)
J

where x^, x j and x are the geometrical row-, column- and global means of X. If the
contrasts in the data are small, then the exponents in the above expression will be
small also. In that case we can further expand the exponential function, which
produces:

and which is similar to the expansion in eq. (32.53) obtained with CFA.
204

32.7.3 Application

The LLM approach described above has been applied to the 35x4 contingency
Table 32.10 after some modification. In this case, we have replaced the pooled data
in the first five rows by the corresponding average annual values. Further, columns
1 and 2 have been combined with columns 3 and 4 in order to produce the
categories of women in all fields and of men in all fields. The reason for the change
is our objective to produce an analysis of the ratios between chemistry and all
fields rather than between chemistry and the other fields.
In this analysis, weight coefficients for rows and for columns have been defined
as constants. They could have been made proportional to the marginal sums of
Table 32.10, but this would weight down the influence of the earlier years, which
we wished to avoid in this application. As with CFA, this analysis yields three
latent vectors which contribute respectively 89,10 and 1% to the interaction in the
data. The numerical results of this analysis are very similar to those in Table 32.11
and, therefore, are not reproduced here. The only notable discrepancies are in the
precision of the representation of the early vears up to 1972, which is less than in
the previous application, and in the precision of the representation of the category
of women chemists which is better than in the previous analysis by CFA (0.960 vs
0.770).
The overall interpretation of the LLM biplot of Fig. 32.11 is the same as
obtained with the CFA biplot of Fig. 32.10. The first (horizontal) latent variable
seems to be associated primarily with the women/men contrast, while the second
(vertical) latent variable is mostly associated with the chemistry/all fields contrast.
Thermometer plots, which represent the scores of the various time intervals as a
function of time, are similar to those in Fig. 32.9. They are not reproduced here as
they point to the same remarkable events, i.e. the sustained rise of the proportion of
women since 1966 and the recovery of the share of chemistry in 1973.
In Fig. 32.11 the women/men ratio in 1980 is estimated visually to be 0.190 in
chemistry and 0.430 in all fields. The exact ratios as computed from Table 32.10
are 0.198 and 0.435 respectively. The chemistry/all fields ratio in 1970 is visually
estimated at 0.090 for men and at 0.041 for women. The exact values from Table
32.10 are 0.080 and 0.046 respectively.
The question may be asked why there should be different methods when they
provide more or less the same kind of information. One answer is that not all
contingency tables will yield results that are agree as well as in the case we have
studied here. When there are very large contrasts, the results of CFA and LLM tend
to disagree. This occurs when there are several zeroes in the table, which in LLM
have to be replaced by small positive values. In that case it may pay off to analyse
the same contingency table with different methods, each of which may reveal
different aspects of the multivariate patterns [18]. Furthermore, LLM can be
205

Fig. 32.11. Log-linear model (LLM) biplot computed from the data in Table 32.10. Conventions are
the same as in Fig. 32.10. The areas of circles (representing years) and of squares (representing catego-
ries) are made proportional to the row- and column-totals in Table 32.10.

applied to tables of heterogeneous data, i.e. data that have been recorded with
different units, while CFA can only be applied to tables of homogeneous data,
unless the heterogeneous scales of measurement are subdivided into discrete
categories.

References

1. B.S. Everitt, The Analysis of Contingency Tables. Chapman and Hall, London, 1977.
2. M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, Lon-
don, 1984.
3. A. Agresti, Categorical Data Analysis. Wiley, New York, 1990.
4. M.G. Kendall and A. Stuart, The Advanced Theory of Statistics. Vol. II. Ch. Griffin, London,
1960.
5. M.O. Hill, Reciprocal averaging: an eigenvector method of ordination. J. EcoL, 61 (1973)
237-224.
6. J.-P. Benzecri, L'analyse des Donnees. Vol. II, L'analyse des Correspondances. Dunod, Paris,
1973.
7. J.-P. Benzecri, Histoire et Prehistoire de 1'Analyse des Donnees. Dunod, Paris, 1982.
206

8. M.O. Hill, Correspondence analysis: a neglected multivariate method. Appl. Statist., 23 (1974)
340-355.
9. K.R. Gabriel, The biplot graphic display of matrices with application to principal components
analysis. Biometrika, 58 (1971) 453-446.
10. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and
use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom.
Intell. Lab. Syst., 4 (1988) 11-25.
11. K.G. Everett and W.S. Deloach, Chemistry doctorates awarded to women in the United States,
A historical perspective. J. Chem. Educ, 68 (1991) 545-547.
12. E.B. Andersen, Discussion of paper by L.A. Goodman. Int. Stat. Review, 54 (1986) 271-272.
13. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical
compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300.
14. P.J. Lewi, Spectral Map Analysis, Analysis of contrasts especially from log-ratios. Chemom.
Intell. Lab. Syst., 5 (1989) 105-116.
15. J.B. Kasmierczak, Analyse logarithmique, Deux exemples d'applicarion. Rev. Statist.
Appliquee, 33 (1985) 13-24.
16. L.A. Goodman, Some useful extensions of the usual correspondence analysis approach and the
usual log-linear models approach in the analysis of contingency table. Int. Stat. Review, 54
(1986)243-309.
17. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int.
Stat. Rev., 54 (1986) 279-283.
18. A. Thielemans, P.J. Lewi and D.L. Massart, Similarities and differences among multivariate
display techniques by Belgian Cancer Mortality Distribution data. Chemom. Intell. Lab. Syst.,
3(1988)277-300.
207

Chapter 33

Supervised Pattern Recognition

33.1 Supervised and unsupervised pattern recognition

In Section 17.9 a method, called linear discriminant analysis, was introduced


and applied to derive a rule which would discriminate between wines from three
origins. To start with, about 100 wine samples of known origin were analyzed: 8
variables (or features) were determined for each. One wanted to know if, in some
way, these results could be used to derive a procedure to determine the origin of
new samples. In pattern recognition terminology, this question was rephrased as:
use the learning or training objects (i.e. those with known origin) to derive a
classification rule which allows to classify new objects with unknown origin in one
of three known classes, based on the values of the features of the new object. This
is called supervised pattern recognition or supervised learning.
Mathematically, this means that one needs to assign portions of an 8-dimensional
space to the three classes. A new sample is then assigned to the class which
occupies the portion of space in which the sample is located. Supervised pattern
recognition is distinct from unsupervised pattern recognition. In the latter one
applies essentially clustering methods (Chapter 30) to classify objects into classes
that are not known beforehand. In supervised pattern recognition, one knows the
classes and has to decide in which of those an object should be classified.
Supervised pattern recognition techniques essentially consist of the following
steps.

1. Selection of a training or learning set. This consists of objects of known


classification for which a certain number of variables are measured.
2. Feature selection, i.e. the selection of variables that are meaningful for the
classification and elimination of those that have no discriminating (or, for
certain techniques, no modelling power). This step is discussed further in
Section 33.3.
3. Derivation of a classification rule, using the training set. This is the subject
of Section 33.2.
4. Validation of the classification rule, using an independent test set. This is
described in more detail in Section 33.4.
208

Many books have been published about pattern recognition; one of these is
directed towards chemometrics [1].

33.2 Derivation of classification rules

33,2.1 Types of classification rules

There are many types of pattern recognition which essentially differ in the way
they define classification rules. In this section, we will describe some of the
approaches, which we will then develop further in the following sections. We will
not try to develop a classification of pattern recognition methods but merely
indicate some characteristics of the methods, that are found most often in the
chemometric literature and some differences between those methods.
A first distinction which is often made is that between methods focusing on
discrimination and those that are directed towards modelling classes. Most methods
explicitly or implicitly try to find a boundary between classes. Some methods such
as linear discriminant analysis (LDA, Sections 33.2.2 and 33.2.3) are designed to
find explicit boundaries between classes while the k-nearest neighbours (A;-NN,
Section 33.2.4) method does this implicitly. Methods such as SIMCA (Section
33.2.7) put the emphasis more on similarity within a class than on discrimination
between classes. Such methods are sometimes called disjoint class modelling
methods. While the discrimination oriented methods build models based on all the
classes concerned in the discrimination, the disjoint class modelling methods
model each class separately.
LDA and several other supervised methods focus on finding optimal boundaries
between classes: their first goal is to discriminate. In Section 17.9 we explained
also with an example from clinical chemistry that canonical variates can be
determined and plotted to discriminate between classes and that this is a way of
performing LDA. The example concerned the thyroid gland. People whose thyroid
gland functions normally are called euthyroid (EU) and patients whose thyroid
gland is too active or not active enough are called, respectively, hyperthyroid
(HYPER) or hypothyroid (KYPO). Clinicians want to make a distinction between
the three classes. This can be done using five chemical determinations among
which e.g. serum thyroxine or thyroid-stimulating hormone. This is clearly a
multivariate (five-dimensional) situation and LDA is applied. Allocation of patients
to one of the classes can be achieved in a more formal way by determining the
centroid of each group and drawing a boundary half-way between the two centroids
[2]. This is shown in Fig. 33.1.
Figure 33.1 is typical of many situations in clinical chemistry: it shows a tight
normal group (the EU group) and spreading out from it much more disperse
209

a
.2
a

Discriminant function 1

Fig. 33.1. Canonical variate plot for three classes with different thyroid status. The boundaries are ob-
tained by linear discriminant analysis [2].

abnormal groups (the HYPO and HYPER groups). In fact, this kind of picture is
also found in many non-clinical situations too (see, for instance, the air pollution
situation of Fig. 17.10). If one now determines boundaries as described in the
preceding paragraph, namely by situating linear boundaries half-way between the
centroids of adjacent classes, some patients of the more disperse abnormal classes
are classified as members of the more condensed class.
This classification problem can then be solved better by developing more
suitable boundaries. For instance, using so-called quadratic discriminant analysis
(QDA) (Section 33.2.3) or density methods (Section 33.2.5) leads to the boundaries
of Fig. 33.2 and Fig. 33.3, respectively [3,4]. Other procedures that develop
irregular boundaries are the nearest neighbour methods (Section 33.2.4) and neural
nets (Section 33.2.9).
One of the problems with discrimination-oriented methods is that we need to
classify each object in one of the given classes. It is, however, quite possible that an
210

o
o
•4—•

c
ac
a
c
>
3 . - - i .
Q .^r' 1 1
t' I 1 1 1
3 ,'3111 I1 1 1 I '*
, 3 \ 31 1 1 I I I I I I n 1 \
^ :^^ nil 11 11111 i
^'^^ 1 111 111 1 1 1 1 I 1 1 V' 1 2
3 3 ^ s l l l l l 2 x ' 2 2 2 2
1 "->>^.J-'2 2 2 2
3 2
•I 2
2 2 2

2 2 2 2
3 \ 2
\ 2

3 \ ' 2 2
3 3

Discriminant function 1

Fig. 33.2. As Fig. 33.1 but with boundaries obtained by quadratic discriminant analysis [3].

object should not be classified in any of these classes. Returning to the wine
example, where a classification between wines from three given origins was
wanted, we must realize that the sample submitted for classification may, in
reality, belong to none of the given origins, but to a fourth one. However, using the
discrimination-oriented methods, we will classify it necessarily in one of the three
given regions. A different approach to supervised pattern recognition can then be
useful. This consists of making a separate model of each class. Objects which fit
the model for a class are considered part of it and objects which do not fit are
classified as non-members. In discrimination terms, we could say that the class
model discriminates between membership and non-membership of a certain class.
In statistical terms, we can state that these methods perform outlier tests.
The conceptually simplest model, which for reasons explained later is called
UNEQ, is based on the multivariate normal distribution. Suppose we have carried
211

c
3 X 1
\
3|
/ 1 1 I
1/ 1 11 1
;3111 II /2
\ 31 M il
11111 iiim '
3 3
^ 1111
" 1' 111 111112'
3N ••••• ^/
V 11 1 1 1 1 1 1 11 1 1 1 1 1 / 1 2
3
^\ 1 11 1 1 2 ^ ^ 2 2 2 2
1 \ 1 , ''2 2 2
3
\ 1 /
2 2
\>
2 2 2

2 2

Discriminant function 1

Fig. 33.3. As Fig. 33.1 but with boundaries obtained by a density method [4].

out two tests, jCj and X2, with which we want to describe the health of a patient. Only
healthy patients are investigated. In Fig. 33.4, we could say that the ellipse
describing the 95% confidence limit for a bivariate normal distribution can be
considered as a model of the class of healthy patients. Those within its limits are
considered healthy and those outside would be considered non-members of the
healthy class. The bivariate normal distribution is therefore a model of the healthy
class.
In three dimensions (Fig. 33.5), the model takes the shape of an ellipsoid and in
m dimensions, we must imagine an m-dimensional hyperellipsoid. In the figure,
two classes are considered and we now observe that four situations can be
encountered when classifying an object, namely:
(a) the object is part of class K,
(b) the object is part of class L,
(c) the object is not a member of class K or L: it is an outlier, and
212

X
1 f

Fig. 33.4. Nintety-five percent confidence limit for a bivariate distribution as class envelope.

Fig. 33.5. Class envelopes in three dimensions as derived from the three-variate normal distribution.

(d) if K and L overlap the object can be a member of both classes K and L: it is
situated in a region of doubt.
UNEQ is applied only when the number of variables is relatively low. For more
variables, one does not work with the original variables, but rather with latent
variables. A latent variable model is built for each class separately. The best known
such method is SIMCA.
We also make a distinction between parametric and non-parametric techniques.
In the parametric techniques such as linear discriminant analysis, UNEQ and
SIMCA, statistical parameters of the distribution of the objects are used in the
derivation of the decision function (almost always a multivariate normal distribution
213

is assumed). The most important disadvantage of parametric methods is that to


apply the methods correctly statistical requirements must be fulfilled. The non-
parametric methods such as nearest neighbours (Section 33.2.4), density methods
(Section 33.2.5) and neural networks (Section 33.2.9 and Chapter 44) are not
explicitly based on distribution statistics. The most important advantage for the
parametric methods is that probabilities of correct classification can be more easily
estimated than with most non-parametric methods.

33.2.2 Canonical variates and linear discriminant analysis

LDA is the best studied method of pattern recognition. It was originally proposed
by Fisher [2] and is applied very often in chemometrics. Applications can be found
for instance in the classification of Eucalyptus oils based on gas-chromatographic
data [6], the automatic recognition of substance classes from GC/MS [7], the
recognition of tablets and capsules with different dosages with the use of NIR
spectra [8] and in the already cited clinical chemical example (see Section 33.1). It
appears that there are several ways of deriving essentially the same methodology.
This may be confusing and, following a short article by Fearn [9], we will try to
explain the different approaches. A detailed overview is found in the book by
McLachlan [10].
Let us first consider two classes K and L in a bivariate space (xj, Xr^. Figure
33.6a shows the objects in this space. In Fig. 33.6b bivariate probability ellipses
are drawn representing the normal (bivariate) probability distributions to which
the objects belong. Since there are two classes, there are two such ellipses.
Basically, an object will be classified in the class for which it has the highest
probability. In Fig. 33.6b, object A is classified in class ^ because it has a (much)
higher probability in K than in L. In Fig. 33.6c, an additional ellipse is drawn for
each class. These ellipses both represent the same probability level in their
respective classes; they touch in point O half-way between the two class centres.
Line a is the tangent to the two ellipses in point O. Any point to the left of it, has a
higher probability to belong to K and to the right it is more probable that it belongs
to L. Line a can be used as a boundary, separating K from L.
In practice, we would prefer an algebraic way to define the boundary. For this
purpose, we define line d, perpendicular to a. One can project any object or point
on that line. In Fig. 33.6c this is done for point A. The location of A on d is given by
its score on d. This score is given by:
Z) = WQ + WjX] + ^2^2 (33.1)

When working with standardized data WQ = 0. The coefficients Wj and w^ are


derived in a way described later, such that D = 0 in point O and D > 0 for objects
belonging to L and Z) < 0 for objects of A'. This then is the classification rule.
214

a)

X
X X X
V X X
X X „ X

• • •
• X• • •
• • • •

. 'K

b)

X 0

-2

-6 -4 0
X1

Fig. 33.6. Caption opposite.


215

Fig. 33.6. (a) Two classes A'and L to be discriminated, (b) confidence limits around the centroids oiK
and L, (c) the iso-probability confidence limits touch in O; a is a line tangential to both ellipses; d is the
optimal discriminating direction; A is an object.

It is observed that D as defined by eq. (33.1) is a latent variable, in the same way
as a principal component. We can consider LDA, as was the case for principal
components analysis, as a feature reduction method. Let us therefore again consider
the two-dimensional space of Fig. 33.6. For feature reduction, we need to determine
a one-dimensional space (a line) on which the points will be projected from higher,
here two-dimensional space. However, while principal components analysis selects
a direction which retains maximal structure in a lower dimension among the data,
LDA selects a direction which achieves maximum separation among the given
classes. The latent variable obtained in this way is a linear combination of the
original variables. This function is called the canonical variate. When there are k
classes, one can determine k- \ canonical variates. In Fig. 33.1, two canonical
variates are plotted against one another for three overlapping classes. A new
sample can be allocated by determining its location in the figure. The second way
of introducing LDA, discovered by Fisher, is therefore to rotate through O a line
216

• •
• • • • • • • • • X X X X X X X
• • • • • • • • • • • X X X X X X ^^

Fig. 33.7. A univariate classification problem.

until the optimal discriminating direction is found (d in Fig. 33.6c). This rotation is
determined by the values of Wj and W2 in eq. (33.1).
These weights depend on several characteristics of the data. To understand
which ones, let us first consider the univariate case (Fig. 33.7). Two classes, K and
L, have to be distinguished using a single variable, jCj. It is clear that the discrim-
ination will be better when the distance between Xj^^ and x^j (i.e. the mean values,
or centroids, of jc, for classes K and L) is large and the width of the distributions is
small or, in other words, when the ratio of the squared difference between means to
the variance of the distributions is large. Analytical chemists would be tempted to
say that the resolution should be as large as possible.
When we consider the multivariate situation, it is again evident that the discrim-
inating power of the combined variables will be good when the centroids of the two
sets of objects are sufficiently distant from each other and when the clusters are
tight or dense. In mathematical terms this means that the between-class variance is
large compared with the within-class variances.
In the method of linear discriminant analysis, one therefore seeks a linear
function of the variables, D, which maximizes the ratio between both variances.
Geometrically, this means that we look for a line through the cloud of points, such
that the projections of the points of the two groups are separated as much as
possible. The approach is comparable to principal components, where one seeks a
line that explains best the variation in the data (see Chapter 17). The principal
component line and the discriminant function often more or less coincide (as is the
case in Fig. 33.8a) but this is not necessarily so, as shown in Fig. 33.8b.
Generalizing eq. (33.1) to m variables, we can write:
D = w'^x + Wo (33.2)
where it can be shown [10] that the weights w are determined for a two-class
discrimination as
w^ =(x, - X 2 ) ' ' S-' (33.3)
and
217

Xi4

o o

a)
^ PC.DF

Xi4

b)

Fig. 33.8. Situation where principal component (PC) and linear discriminant function (DF) are essen-
tially the same (a) and very different (b).

Wo = - - ( ^ 1 - ^ 2 ) ' ^ S U x i + X 2 ) (33.4)

In eq. (33.3) and (33.4) Xj and X2 ^re the sample mean vectors, that describe the
location of the centroids in m-dimensional space and S is the pooled sample
variance-covariance matrix of the training sets of the two classes.
The use of a pooled variance-covariance matrix implies that the variance-
covariance matrices for both populations are assumed to be the same. The conse-
quences of this are discussed in Section 33.2.3.

Example:
A simple two-dimensional example concerns the data from Table 33.1 and Fig.
33.9. The pooled variance-covariance matrix is obtained as [K^K + L^L]/(ni + ^2
- 2), i.e. by first computing for each class the centred sum of squares (for the
diagonal elements) and the cross-products between variables (for the other
218

lO 1 — 1 1 •y—I 1
\
o

14 - -
n
12 o ^ ^
o o ^
10 ^ ^
o

^ 8 o o )K -
^
6 o )K ~
o o )K X

4 - )i( -

2 -

{ 1 1 1

10 12 14
x1

Fig. 33.9. LDA applied to the data of Table 33.1; n is a new object to be classified.

elements), then summing the two matrices and dividing each element hyn^+n2-k
(here 10 + 10 - 2 = 18). As an example we compute the cross-term s^2 (which is
equal to 521). This calculation is performed in Table 33.2. In the same way we can
compute the diagonal elements, yielding
2.78 3.78 ^
3.78 10.56

and

0.70 -0.25
s-' = -0.25 0.18

Since
(6-11) -5
(x, - X 2 ) :
( 9 - 7) 2
219

TABLE 33.1
Example data set for linear discriminant analysis

Class 1 Class 2

Object ^1 X2 Object ^1 X2

1 8 15 11 11 11
2 7 12 12 9 5
3 8 11 13 11 8
4 5 11 14 12 6
5 7 9 15 13 10
6 4 8 16 14 12
7 6 8 17 10 7
8 4 5 18 12 4
9 5 6 19 10 5
10 6 5 20 8 2
mean 6 9 mean 11 7

TABLE 33.2
Computation of the cross-product term in the pooled variance-covariance matrix for the data of Table 33.1

Class 1 Class 2

(8-6) (15-9) = 12 (ll-ll)(ll-7) =0


(7-6)(12-9) = 3 (9-ll)(5-7) =4
(8-6)(ll-9) =4 (ll-ll)(8-7) =0
(5-6)(ll-9)=-2 (12-ll)(6-7) = - l
(7 - 6) (9 ~ 9) = 0 (13-ll)(10-7) = 6
(4-6)(8-9) = 2 (14-11)(12-7)=15
(6 - 6) (8 - 9) = 0 (10-ll)(7-7) = 0
(4-6)(5-9) = 8 (12-ll)(4-7) = - 3
(5-6)(6-9) = 3 (10-ll)(5-7) = 2
(6-6)(5-9) = 0 (8-11) (2-7) = 15

S class 1 = 30 I class 2 = 38

Degrees of freedom: 1 0 + 1 0 - 2 = 1 8 .
Cross-product term: (30 + 38)/18 = 3.78
220

Wi=-4.00
W2 = 1.62
Wo= 21.08
and
D = 21.08-4.00 jci +1.62 JC2
We can now classify a new object n. Consider an object with x^ = 9 and JC2 = 13. For
this object

D = 21.08 - 4.00 X 9 + 1.62 X 13 = 6.14

Since D > 0, it is classified as belonging to class 1.

For two classes, Fisher arrived at similar results for the equations given above
by considering LDA as a regression problem. A response factory, indicating class
membership, is introduced: >^ = -1 for all objects belonging to class K and y = -\-\
for all objects belonging to class L. We then obtain the regression equation for y =
y(xj,X2). It is shown that, when there is an equal number of objects in K and L, the
same w values are obtained. If the number is not the same, then w^ and W2 are still
the same but WQ changes.
So far, we have described only situations with two classes. The method can also
be applied to K classes. It is then sometimes called descriptive linear discriminant
analysis. In this case the weight vectors can be shown to be the eigenvectors of the
matrix:
A = W-^ B (33.5)
where W is the within-group sum of squares and cross-products matrix and B is the
between-groups sum of squares and cross-products matrix. As described in [11]
this leads to non-symmetrical eigenvalue problems.

33.2.3 Quadratic discriminant analysis and related methods

There is still another approach to explain LDA, namely by considering the


Mahalanobis distance (see Chapter 30) to a class. All these approaches lead to the
same result. The Mahalanobis distance is the distance to the centre of a class taking
correlation into account and is the same for all points on the same probability
ellipse. For equally probable classes, i.e. classes with the same number of training
objects, a smaller Mahalanobis distance to class K than to class L, means that the
probability that the object belongs to class K is larger than that it belongs to L.
221

The Mahalanobis distance representation will help us to have a more general


look at discriminant analysis. The multivariate normal distribution for m variables
and class K can be described by

f (X.) = ~ — C-^^^^'MK (33.6)

with Dlfj^ the Mahalanobis distance to class K.

Dj,j,=(x-ii^y r^Hx-R^) (33.7)

where ^ij^ and F^ are the population mean vector and variance-covariance matrix
of K respectively. They can be estimated by the sample parameters x^ and C^.
From these equations, one can derive the following classification rule: classify
object u of unknown class in the class K for which Dj^j^^^ is minimal, given

DIK.U=(^U-^KVC-,\X^-X,) (33.8)

where x^ is the vector of x values describing object u. This equation is applied


when the a priori probability of the classes is the same. When this is not so, an
additional term has to be added.
When all C^ are considered equal, this means that they can be replaced by S, the
pooled variance-covariance matrix, which is the case for linear discriminant
analysis. The discrimination boundaries then are linear and DJ^J^^^ is given by

Dl,,^,=(x,-x,y S-\x,-x,) (33.9)

Friedman [12] introduced a Bayesian approach; the Bayes equation is given in


Chapter 16. In the present context, a Bayesian approach can be described as
finding a classification rule that minimizes the risk of misclassification, given the
prior probabilities of belonging to a given class. These prior probabilities are
estimated from the fraction of each class in the pooled sample:

where % is the prior probability that an object belongs to class K, rij^ is the number
of objects in the training set for class K and N is the total number of objects in the
training set. One computes OMA:,M ^^

DIK,U = ( X „ - X ^ ) ^ Ci,Hx„ - x ^ ) + lnlC^I-21n7i^ (33.10)

and classifies u in the class for which this value is smallest.


222

X^t

Q)

^2

X, •

b)

X2

Fig. 33.10. Situations with unequal variance-covariance: (a) unequal variance, (b) unequal co-
variance.

Equation (33.10) is applied in what is called quadratic discriminant analysis


(QDA). The equations can be shown to describe a quadratic boundary separating
the regions where Dl^j^ ^ is minimal for the classes considered.
As stated earlier, LDA requires that the variance-covariance matrices of the
classes being considered can be pooled. This is only so when these matrices can be
considered to be equal, in the same way that variances can only be pooled, when
they are considered equal (see Section 2.1.4.4). Equal variance-covariance means
that the 95% confidence ellipsoids have an equal volume (variance) and orientation
in space (covariance). Figure 33.10 illustrates situations of unequal variance or
covariance. Clearly, Fig. 33.1 displays unequal variance-covariance, so that one
must expect that QDA gives better classification, as is indeed the case (Fig. 33.2).
When the number of objects is smaller than the number of variables m, the
variance-covariance matrix is singular. Clearly, this problem is more severe for
QDA (which requires m < /i^) than for LDA, where the variance-covariance
matrix is pooled and therefore the number of objects N is the sum of all objects
223

over all classes. It follows that both QDA and LDA have advantages: QDA is less
subject to constraints in the distribution of objects in space and LDA requires less
objects than QDA. Friedman [12] has also shown that regularised discriminant
analysis (RDA), a form of discriminant analysis intermediate between QDA and
LDA, has advantages compared to both: it is less subject to constraints without
requiring more objects. The method has been used in chemometrics, e.g. for the
classification of seagrass [13] or pharmaceutical preparations [14].

33.2.4 The k-nearest neighbour method

A mathematically very simple classification procedure is the nearest neighbour


method. In this method one computes the distance between an unknown object u
and each of the objects of the training set. Usually one employs the Euclidean
distance D (see Section 30.2.2.1) but for strongly correlated variables, one should
prefer correlation based measures (Section 30.2.2.2). If the training set consists of
n objects, then n distances are calculated and the lowest of these is selected. If this
is D^i, where u represents the unknown and / an object from learning class L, then
one classifies u in group L. A three-dimensional example is given in Fig. 33.11.
Object u is closest to an object of the class L and is therefore considered to be a
member of that class.
In a more sophisticated version of this technique, called the k-nearest neighbour
method (/:-NN method), one selects the k nearest objects to u and applies a majority
rule: u is classified in the group to which the majority of the k objects belong.
Figure 33.12 gives an example of a 3-NN method. One selects the three nearest
neighbours (A, B and C) to the unknown u. Since A and B belong to L, one

Fig. 33.11. 1-NN classification of the unknown u.


224

X,

Fig. 33.12. 3-NN classification of the unknown u.

classifies u in category L. The choice of k is determined by optimization: one


determines the prediction abihty with different values ofk. Usually it is found that
small values of k (3 or 5) are to be preferred.
The method has several advantages, the first being its mathematical simplicity,
which does not prevent it from yielding classification results as good and often
better than the much more complex meth|ods discussed in other sections of this
chapter. Moreover, it is free from statistical assumptions, such as normality of the
distribution of the variables.
This does not mean that the method is not subject to any problem. One such
problem is that the method is sensitive to gross inequalities in the number of
objects in each class. Figure 33.13 gives an example. The unknown is classified
into the class with largest membership, because in the zone of overlap between
classes more of its members are present. In fact, the unknown is closer to the centre
of the other class, so that its classification is at least doubtful. This can be overcome
by not using a simple majority criterion but an altemative one, such as "classify the
object in the larger class AT if for /:= 10 at least 9 neighbours (out of 10) belong to K,
otherwise classify the test object in the smaller class L". The selection of k and the
alternative criterion value should be determined by optimization [15].
225

• • U

•• • J • ••
• • • • • • ^
• • • • o
• • • -

Fig. 33.13. A situation which necessitates classification of the unknown u by alternative A:-NN criteria.

The nearest neighbour method is often applied, with, in view of its simplicity,
surprisingly good results. An example where k-NN performs well in a comparison
with neural networks and SIMCA (see further) can be found in [16].

33.2.5 Density methods

In density or kernel methods one imagines a potential field around the objects of
the learning set. For this reason these methods have also been called potential
methods. A variant for clustering was described in Section 30.3.3. One starts with
the selection of a potential function. Many functions can be used for this purpose,
but for practical reasons it is recommended that a simple one such as a triangular or
a Gaussian function is selected. The function is characterized by its width. This is
important for its smoothing behaviour (see below). Figure 33.14 shows a Gaussian
function for a class ^ in a one-dimensional space. The cumulative potential
function is determined by adding the heights of the individual potential functions
in each position along the x axis. The figure shows that the cumulative function
constitutes a continuous line which is never zero within a class. This is done
separately for each class.
226

Fig. 33.14. Density estimate for a test set using normal potential functions (univariate case).

Fig. 33.15. Classification of an unknown object u. f(K) and f(L) indicate the potential functions for
classes K and L.

By dividing the cumulative potential function of a class by the number of


samples contributing to it, one obtains the (mean) potential function of the class. In
this way, the potential function assumes a probabilistic character and, therefore,
the density method permits probabilistic classification.
The classification of a new object u into one of the given classes is determined
by the value of the potential function for that class in u. It is classified into the class
which has the largest value. A one-dimensional example is given in Fig. 33.15.
Object u is considered to belong to K, because at the location of u the potential
value of K is larger than that of L. The boundary between two classes is given by
those positions where the potentials caused by these two classes have the same
value. The boundaries can assume irregular values as shown in Fig. 33.3.
One of the disadvantages of the method is that one must determine the smoothing
parameter by optimisation. When the smoothing parameter is too small (Fig.
33.16a) many potential functions of a learning class do not overlap with each other,
so that the continuous surface of Fig. 33.15 is not obtained. A new object u may
then have a low membership value for a class (here class K) although it clearly
belongs to that class. An excessive smoothing parameter leads to a too flat surface
(Fig. 33.16b), so that discrimination becomes less clear. The major task of the
227

Fig. 33.16. Influence of the smoothing parameters on the potential surfaces of classes which are (a) too
small and (b) too large.

learning procedure is then to select the most suitable value of the smoothing
parameter.
Advantages of these methods are that no a priori assumptions about distrib-
utions are necessary and that probabilistic decisions can be taken more easily than
with k-NN. In chemometrics, the method was introduced under the name ALLOC
[17, 18]. The methodology was described in detail in a book by Coomans and
Broeckaert [19]. The method was developed further by Forina and coworkers
[20,21].

33.2.6 Classification trees

In Section 18.4, we explained that inductive expert systems can be applied for
classification purposes and we refer to that section for further information and
example references. It should be pointed out that the method is essentially uni-
variate. Indeed, one selects a splitting point on one of the variables, such that it
achieves the "best" discrimination, the "best" being determined by, e.g., an
entropy function. Several references are given in Chapter 18. A comparison with
other methods can be found, for instance, in an article by Mulholland et al. [22].
Additionally, Breiman et al. [23] developed a methodology known as classifi-
cation and regression trees (CART), in which the data set is split repeatedly and a
binary tree is grown. The way the tree is built, leads to the selection of boundaries
parallel to certain variable axes. With highly correlated data, this is not necessarily
the best solution and non-Hnear methods or methods based on latent variables have
been proposed to perform the splitting. A combination between PLS (as a feature
reduction method — see Sections 33.2.8 and 33.3) and CART was described by
228

Yeh and Spiegelman [24]. Very good results were also obtained by using simple
neural networks of the type described in Section 33.2.9 to derive a decision rule at
each branching of the tree [25]. Classification trees have been used relatively
rarely in chemometrics, but it seems that in general [26] their performance is
comparable to that of the best pattern recognition methods.

33.2.7 UNEQ, SIMCA and related methods

As explained in Section 33.2.1, one can prefer to consider each class separately
and to perform outlier tests to decide whether a new object belongs to a certain
class or not. The earliest approaches, introduced in chemometrics, were called
SIMCA (soft independent modelling of class analogy) [27] and UNEQ [28].
UNEQ can be applied when only a few variables must be considered. It is based
on the Mahalanobis distance from the centroid of the class. When this distance
exceeds a critical distance, the object is an outlier and therefore not part of the
class. Since for each class one uses its own covariance matrix, it is somewhat
related to QDA (Section 33.2.3). The situation described here is very similar to that
discussed for multivariate quality control in Chapter 20. In eq. (20.10) the original
variables are used. This equation can therefore also be used for UNEQ. For
convenience it is repeated here.
Dl={x,-x^y C-\x,-x^) (33.11)
where Dj^ is the squared Mahalanobis distance between object / and the centroid
x^ of the class A'and C is the variance-covariance matrix of the n training objects
defining class K (see also eq. (33.7)). When D^ becomes too large for a certain
object, this means that it is no longer considered to be part of the class. The
Mahalanobis distance follows the Hotelling 7^-distribution. The critical value t^ is
defined as:
2 _ m{n - \){n •\-\)
^crit ~ Z ^^ Ma,m,n-m) (33.12)
n{n-m)
UNEQ requires a multivariate normal distribution and can be applied only when
the ratio objects/variables is sufficiently high (e.g. 3). When the ratio is less, as also
explained in Chapter 20, one can measure distances in the principal component
(PC)-space instead of the original space and in the pattern recognition context this
is usually either necessary or preferable. In SIMCA one applies latent variables
instead of the original variables. It, too, can be viewed as a variant of the quadratic
discriminant rule. SIMCA, the original version of which [27] was published in
1976, starts by determining the number of principal components or eigenvectors
needed to describe the structure of the training class (Section 31.5). Usually
229

PCI PCI

*-x, • X ,

a) b)

PCI PCI

c) d)
Fig. 33.17. SIMCA: (a) step 1 in a 1 PC model; (b) step 1 in a 2 PC model; (c) step 2 in a 1 PC model; (d)
step 2 in a 2 PC model.

cross-validation (Section 31.5.3) is preferred. Let us call the number of eigen-


vectors retained r*. If r* = 1, this means that (see Fig. 33.17) all data are considered
to be modelled by a one-dimensional model (a line), for r* = 2, by a two-
dimensional model (a plane), etc. The residuals of the training class towards such a
model are assumed to follow a normal distribution with a residual standard
deviation
230

^ = JSE4/[(''-''*K'^-^*-l)J (33.13)
The residuals from the model can be computed from the scores on the non-
retained eigenvectors, i.e. the scores t^j on the eigenvectors r* + 1 to r (r = min {{n -
1), m)). Then

^ = J S I,tfj/[(r-r^)(n^r^-l)] (33.14)

If care is not taken about the way s is obtained, SIMCA has a tendency to exclude
more objects from the training class than necessary. The 5-value should be deter-
mined by cross-vahdation. Each object in the training set is then predicted, using
the r*-dimensionaI PCA model obtained, for the other (n-l) training set objects.
The (residual) scores obtained in this way for each object are used in eq. (33.14)
[30].
A confidence limit is obtained by defining a critical value of the (Euclidean)
distance towards the model. This is given by

^crit=V^^ (33.15)
F^rit is the tabulated one-sided value for (r - r*) and (r - r*) (n - r* - 1) degrees
of freedom. The s^^-^^ is used to determine the boundary (the cylinder) around the
PCI line in Fig. 33.17c and the planes around the PCI, PC2 plane in Fig. 33.17d.
Objects with s < 5^^^ belong to class K, otherwise they do not. To predict whether a
new object x^^^ belongs to class K one verifies whether it falls within the cylinder
(for a one-dimensional model), between the limiting planes (for a two-dimensional
model, etc.). Suppose the following r* dimensional PC model was obtained

X^=T^L1,4-E^ (33.16)

with X^ the centred X-matrix for class K, T^^ the (un-normed) score matrix (nxr*),
(T^ = U^ A;^, where U^ is the normed score matrix and A^^ is the singular value
matrix).
L^ = the loading matrix (mxr'^)
Ef. = the matrix of residuals (nxm)
For a new object x^^^ one first determines the scores using eq. (33.17)

tLw=(x„ew-x^)'^L^ (33.17)
231

The Euclidean distance from the model is then obtained, similarly to eq. (33.14)
as:

(33.18)
V./=r*+l

If ^ng^ < s^^-^^, then the new object belongs to class K, otherwise it does not. A
discussion concerning the number of degrees of freedom can be found in [31 ]. This
article also compares SIMCA with several other methods.
A useful tool in the interpretation of SIMCA is the so-called Coomans plot [32].
It is applied to the discrimination of two classes (Fig. 33.18). The distance from the
model for class 1 is plotted against that from model 2. On both axes, one indicates
the critical distances. In this way, one defines four zones: class 1, class 2, overlap
of class 1 and 2 and neither class 1 nor class 2. By plotting objects in this plot, their
classification is immediately clear. It is also easy to visualize how certain a
classification is. In Fig. 33.18, object a is very clearly within class 1, object b is on
the border of that class but is not close to class 2 and object c clearly belongs to
neither class.
The first versions of SIMCA stop here: it is considered that a new object belongs
to the class, if it fits the r*-dimensional PC model. However, one can also consider

T distance from class 1


1
1
1
1 outlier zone
class 1
1 • c
2 1
I
I
1
1
1
1
1
1
1
1
_4 _ _ .
1
1
• b
1

overlap 1 • a
1
zone 1
class 1
1
1

distance from class 2

Fig. 33.18. The Coomans plot. Object a belongs to class 1, object b is a borderline class 1 object, object
c is an outlier towards the two classes.
232

that objects that fit the PC-model but, in that model, are far from the members of
the training class, are also outliers. Therefore, a second step was added to the
original version of SIMCA by closing the tube or the box (Fig. 33.17). This was
originally done by treating each PC in an univariate way. The limits were situated
on each of the r* PCs:

max max(r^) + 0.5 s,


and
^min "~ min(y -- 0.5 s, (33.19)
where max(rj^) is the largest among the scores of the training objects of class K on
the PC considered and s^ is the standard deviation of the scores along that PC.
SIMCA has inspired several related methods, such as DASCO [33] and CLASSY
[34,35]. The latter has elements of the potential methods and SIMCA, while the
former starts with the extraction of principal components, as in SIMCA, but then
follows a quadratic discriminant rule.
SIMCA has been applied very often and with much success in chemometrics.
Examples are food authentication [36], or pharmaceutical identifications such as
the recognition of excipients from their near infrared spectra [37] or of blister-
packed tablets using near infrared spectra [38]. Environmental applications have
been published by many authors [39,40], for instance by Kvalheim et al. [41,42].
Lavine et al. [43] apply it to fuel spills from high-speed gas chromatograms, and
compares SIMCA with DASCO and RDA. Chemometricians often consider
SIMCA as the preferred supervised pattern recognition method for all situations.
However, this is not evident. When the accent is on discrimination, discrimination
oriented methods should be used. The testing procedure underlying SIMCA and
other outlier tests has the disadvantage that one has to set a confidence level, a. If
the data are normally distributed, a% (e.g. 5%) of objects belonging to the class
will be considered as not belonging to it. When applying discriminating methods,
such as LDA, to a discrimination oriented classification, this misclassification
problem can be avoided.
As explained already, SIMCA can be applied as an outlier test, similarly to the
multivariate QC tests referred to earlier. Feam et al. [44] have described certain
properties of SIMCA in this respect and compared it with some alternatives.

33.2.8 Partial least squares

In Section 33.2.2 we showed how LDA classification can be described as a


regression problem with class variables. As a regression model, LDA is subject to
the problems described in Chapter 10. For instance, the number of variables should
not exceed the number of objects. One solution is to apply feature selection or
233

reduction (see Section 33.3) and the other is to apply methods such as partial least
squares (PLS - see Chapter 35) [45]. When there are two classes to be discrim-
inated PLSl is applied, which means that there is one independent variable y,
which for each object has a value 0 or 1. When there are more classes, PLS2 is
applied. The independent variable then becomes a vector of class variables, one for
each class, with a value of 1 for the class to which it belongs and zeros for all other
variables. Suppose that there are 4 classes and that a certain object belongs to class
2, then for that object y^ = [0 1 0 0]. One might be tempted to use PLSl, with an
independent variable that can take the values 1, 2, 3 and 4. However, this would
imply an ordered relationship between the four classes, such that the distance
between class 3 and 1 is twice that between class 2 and 1.

33.2.9 Neural networks

A more recently introduced technique, at least in the field of chemometrics, is


the use of neural networks. The methodology will be described in detail in Chapter
44. In this chapter, we will only give a short and very introductory description to be
able to contrast the technique with the others described earlier. A typical artificial
neuron is shown in Fig. 33.19. The isolated neuron of this figure performs a
two-stage process to transform a set of inputs in a response or output. In a pattern
recognition context, these inputs would be the values for the variables (in this
example, limited to only 2, jc^ and X2) and the response would be a class variable,
for instance j = 1 for class K and j = 0 for class L.
The inputs, x^ and X2, are linked to the neuron by weights. These weights are
determined by training the neuron with a set of training objects, but we will
consider in this chapter that this has already been done. In the first stage a weighted
sum of the jc-values is made, Z = w^x^ + ^2X2-

OUT
IN

Fig. 33.19. An artificial neuron. The inputs are weighted and summed according to Z = wixi + W2X2, X
is transformed by comparison with T and leads to a 0/1 value for y.
234

Fig. 33.20. Output of the artificial neuron with values wi = 1, W2 = 2, 7 = 1.

In the second stage, Z is transformed with the aid of a transfer function. For
instance, it can be compared to a threshold value. If Z > 7, then y=l and if I. <T,y
= 0. This is the simplest possible type of neuron, used here for didactic purposes
and not because it is the configuration to be recommended. Let us suppose that for
this isolated neuron Wj = 1, W2 = 2 and T=l. The line in Fig. 33.20 then gives the
values of X, and X2 for which Z = 7. All combinations of jCj and X2 on and above the
line will yield Z > 7and therefore lead to an output j j = 1 (i.e. the object is class K),
all combinations below it to 3^1 = 0. The procedure described here is equivalent to a
method called the linear learning machine, which was one of the first supervised
pattern recognition methods to be applied in chemometrics. It is further explained,
including the training phase, in Chapter 44.
Neurons are not used alone, but in networks in which they constitute layers. In
Fig. 33.21 a two-layer network is shown. In the first layer two neurons are linked
each to two inputs, jCj and ^2. The upper one is the one we already described, the
lower one has w, = 2, W2 = 1 and also T = 1. It is easy to understand that for this
neuron, the output y2 is 1 on and above line b in Fig. 33.22a and 0 below it. The
outputs of the neurons now serve as inputs to a third neuron, constituting a second
layer. Both have weight 0.5 and 7 for this neuron is 0.75. The output y^-^^^i of this
neuron is 1 if Z = 0.5 y^ + 0.5 ^2 > 0.75 and 0 otherwise. Since ^i and y2 have as
possible values 0 and 1, the condition for r > 0.75 is fulfilled only when both are
equal to 1, i.e. in the dashed area of Fig. 33.22b. The boundary obtained is now no
longer straight, but consists of two pieces. This network is only a simple demon-
stration network. Real networks have many more nodes and transfer functions are
usually non-linear and it will be intuitively clear that boundaries of a very complex
nature can be developed. How to do this, and applications of supervised pattern
recognition are described in detail in Chapter 44 but it should be stated here that
excellent results can be obtained.
235

Layer 1 Layer 2

Fig. 33.21. A two-layer neural network.

a) b)

Fig. 33.22. (a) Intermediate (yi mdy2) outputs of the neural network of Fig. 33.21; (b) final output of
the neural network.

The similarity in approach to LDA (Section 33.2.2) and PLS (Section 33.2.8)
should be pointed out. Neural classification networks are related to neural regression
networks in the same way that PLS can be applied both for regression and
classification and that LDA can be described as a regression application. This can
be generalized: all regression methods can be applied in pattern recognition. One
must expect, for instance, that methods such as ACE and MARS (see Chapter 11)
will be used for this purpose in chemometrics.
236

33.3 Feature selection and reduction

One can — and sometimes must — reduce the number of features. One way is to
combine the original variables in a smaller number of latent variables such as
principal components or PLS functions. This is calltd feature reduction.
The combination of PCA and LDA is often applied, in particular for ill-posed
data (data where the number of variables exceeds the number of objects), e.g. Ref.
[46]. One first extracts a certain number of principal components, deleting the
higher-order ones and thereby reducing to some degree the noise and then carries
out the LDA. One should however be careful not to eliminate too many PCs, since
in this way information important for the discrimination might be lost. A method in
which both are merged in one step and which sometimes yields better results than
the two-step procedure is reflected discriminant analysis. The Fourier transform is
also sometimes used [14], and this is also the case for the wavelet transform (see
Chapter 40) [13,16]. In that case, the information is included in the first few
Fourier coefficients or in a restricted number of wavelet coefficients.
In feature selection one selects from the m variables a subset of variables that
seem to be the most discriminating. Feature selection therefore constitutes a means
of choosing sets of optimally discriminating variables and, if these variables are
the results of analytical tests, this consists, in fact, of the selection of an optimal
combination of analytical tests or procedures.
One way of selecting discriminating features is to compare the means and the
variances of the different variables. Variables with widely different means for the
classes and small intraclass variance should be of value and, for a binary discrim-
ination, one therefore selects those variables for which the expression

(^iK-^jL)/p]K + ^i) (33.20)

is maximal. It should be noted that, in this way, we select the individually best
variables. As the correlation between variables is not taken into account, this
means one not necessarily selects the best combination of variables.
Most of the supervised pattern recognition procedures permit the carrying out of
stepwise selection, i.e. the selection first of the most important feature, then, of the
second most important, etc. One way to do this is by prediction using e.g.
cross-validation (see next section), i.e. we first select the variable that best
classifies objects of known classification but that are not part of the training set,
then the variable that most improves the classification already obtained with the
first selected variable, etc. The results for the linear discriminant analysis of the
EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a
selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of
classification success. Selectivity is used here. It is applied in the sense of Chapter
237

16 because of the medical context. Selectivity is the number of true negatives


divided by the sum of true negatives (the EU-cases that are classified as such) and
false positives (the EU-cases that are classified wrongly as HYPER). Of course,
one should also consider the sensitivity (Chapter 16), which was done in the
article, but, for simplicity's sake, will not be discussed here. For the HYPER/EU
discrimination, the elimination of successively two, or more variables leads to the
expected result that, since there is less information, the classification is less
successful. On the other hand, for the HYPO/EU discrimination, a less evident
result is obtained. Five variables yield a selectivity of 80.0%, 4 of 83.0%, 3 of
86.7% and 2 of 96.7%. A smaller number of tests leads to an improvement in the
classification results. One concludes that the variables eliminated either have no
relevance to the discrimination considered, and therefore only add noise, or else
that the information present in the eliminated variables was redundant or correlated
with respect to the retained variables.
Another approach requires the use of Wilks' lambda. This is a measure of the
quality of the separation, computed as the determinant of the pooled within-class
covariance matrix divided by the determinant of the covariance matrix for the
whole set of samples. The smaller this is, the better and one selects variables in a
stepwise way by including those that achieve the highest decrease of the criterion.
In SIMCA, we can determine the modelling power of the variables, i.e. we
measure the importance of the variables in modelling the class. Moreover, it is
possible to determine the discriminating power, i.e. which variables are important
to discriminate two classes. The variables with both low discriminating and
modelling power are deleted. This is more a variable elimination procedure than a
selection procedure: we do not try to select the minimum number of features that
will lead to the best classification (or prediction rate), but rather eliminate those
that carry no information at all.
It should be stressed here that feature selection is not only a data manipulation
operation, but may have economic consequences. For instance, one could decide
on the basis of the results described above to reduce the number of different tests
for a EU/HYPO discrimination problem to only two. A less straightforward
problem with which the decision maker is confronted is to decide how many tests
to carry out for a EU/HYPER discrimination. One loses some 3% in selectivity by
eliminating one test. The decision maker must then compare the economic benefit
of carrying out one test less with the loss contained in a somewhat smaller
diagnostic success. In fact, he carries out a cost-benefit analysis. This is only one of
the many instances where an analytical (or clinical) chemist may be confronted
with such a situation.
238

33.4 Validation of classification rules

In the training or learning step, one develops a decision model (a rule) which
allows classification of the unknown samples to be carried out. The decision model
of Fig. 33.6 consists of line a and the classification rule is that objects to the right of
it are assigned to class L and objects to the left to class K. Once a decision rule has
been obtained, it is still necessary to demonstrate that it is a good one. This can be
done by observing how successful it is at classifying known samples (test set). One
distinguishes between recognition and prediction ability. The recognition (or
classification) ability is characterized by the percentage of the members of the
training set that are correctly classified. The prediction ability is determined by the
percentage of the members of the test set correctly classified by using the decision
functions or classification rules developed during the training step. When one only
determines the recognition ability, there is a risk that one will be deceived into
taking an overoptimistic view of the classification result. It is therefore also
necessary to verify the prediction ability. Both recognition and prediction ability
are usually expressed as (correct) classification rate, although other possibilities
exist (see Section 33.3).
The situation is very similar to that in regression (see Section 10.3.4), where we
validated the regression model by looking how well it modelled the objects that
were included in the calibration set (goodness- or lack-of-fit) and new samples
(prediction performance). This analogy should not surprise since regression and
classification are both modelling methods. Validation in pattern recognition is
therefore similar to validation in multivariate calibration and the reader should
refer to Chapter 36 for more details.
The ideal situation is when there are enough samples available to create separate
(independent) training and test sets. When this is not possible, an artifice is
necessary. The prediction ability is determined by developing the decision model
on a part of the training set only and using the other part as a mock test set. Often
this is repeated a few times until all training samples have been used as test
samples. If several objects at a time are considered as test samples, this is called a
resampling or {internal) cross-validation method, k-fold cross-validation ov jack-
knife method; when only one sample at a time is removed from the training set, it is
called a leave-one-out procedure. If the training set consists of 20 objects, a
jackknife method could be carried out as follows. We first delete objects 1-6 from
the training set and develop the classification rules with the remaining objects
7-20. Then, we consider 1-6 as the test set, classify them with the rules obtained
on objects 7-20 and note how many objects were classified correctly. The whole
procedure is then repeated after replacing first objects 1-6 in the training set but
deleting 7-12. The latter objects are classified using the classification rule developed
on a training set consisting of objects 1-6 and 13-20. Finally, a training set
239

consisting of 1-12 is used and objects 13-20 serve as test set and are classified.
The percentage of successes on the three runs together is then called the prediction
ability.
In general, it is found that prediction ability is somewhat less good than
recognition ability. If the prediction and the recognition ability are substantially
different, this means that the decision rules depend too much on the actual objects
in the training set: the solution obtained is not stable and should therefore not be
trusted.
Many other subjects are important to achieve successful pattern recognition. To
name only two, it should be investigated to what extent outliers are present,
because these can have a profound influence on the quality of a model and to what
extent clusters occur in a class (e.g. using the index of clustering tendency of
Section 30.4.1). When clusters occur, we must wonder whether we should not
consider two (or more) classes instead of a single class. These problems also affect
multivariate calibration (Chapter 36) and we have discussed them to a somewhat
greater extent in that chapter.

References

1. R.G. Brereton, ed., Multivariate Pattern Recognition in Chemometrics. Elsevier, Amsterdam,


1992.
2. D. Coomans, M. Jonckheer, D.L. Massart, I. Broeckaert and P. Blockx, The application of lin-
ear discriminant analysis in the diagnosis of thyroid diseases. Anal. Chim. Acta, 103 (1978)
409-415.
3. D. Coomans, I. Broeckaert, M. Jonckheer and D.L. Massart, Comparison of multivariate dis-
crimination techniques for clinical data — Application to the thyroid functional state. Meth. In-
form. Med., 22 (1983) 93-101.
4. D. Coomans, I. Broeckaert and D.L. Massart, Potential methods in pattern recognition. Part 4,
Anal. Chim. Acta 132 (1981) 69-74.
5. R. Fisher, The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7
(1936) 179-188.
6. P.J. Dunlop, CM. Bignell, J.F. Jackson, D.B. Hibbert, Chemometric analysis of gas chromato-
graphic data of oils from Eucalyptus species. Chemom. Intell. Lab. Systems 30(1995) 59-67.
7. K. Varmuza, F. Stangl, H. Lohninger and W. Werther, Automatic recognition of substance
classes from data obtained by gas chromatography, mass spectrometry. Lab. Automation Inf.
Manage., 31 (1996)221-224.
8. A. Candolfi, W. Wu, S. Heuerding and D.L. Massart, Comparison of classification approaches
applied to NIR-spectra of clinical study lots. J. Pharm. Biomed. Anal., 16 (1998) 1329-1347.
9. T. Fearn, Discriminant analysis. NIR News, 4 (5) (1993) 4-5.
10. G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York,
1992.
11. M.G. Kendall and A.S. Stuart, The Advanced Theory of Statistics, Vol. 3. Ch. Griffin, London,
1968.
12. J.H. Friedman, Regularized discriminant analysis. J. Am. Stat. Assoc. 84 (1989) 165-175.
240

13. Y. Mallet, D. Coomans and O. de Vel, Recent developments in discriminant analysis on high
dimensional spectral data. Chemom. Intell. Lab. Systems, 35 (1996) 157-173.
14. W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding and F. Erni, Com-
parison of regularized discriminant analysis, linear discriminant analysis and quadratic
discriminant analysis, applied to NIR data. Anal. Chim. Acta, 329 (1996) 257-265.
15. D. Coomans and D.L. Massart, Alternative K-nearest neighbour rules in supervised pattern
recognition. Part 2. Probabilistic classification on the basis of the kNN method modified for di-
rect density estimation. Anal. Chim. Acta, 138 (1982) 153-165.
16. E.R. Collantes, R. Duta, W.J. Welsh, W.L. Zielinski and J. Brower, Reprocessing of HPLC
trace impurity patterns by wavelet packets for pharmaceutical finger printing using artificial
neural networks. Anal. Chem. 69 (1997) 1392-1397.
17. J.D.F. Habbema, Some useful extensions of the standard model for probabilistic supervised
pattern recognition. Anal. Chim. Acta, 150 (1983) 1-10.
18. D. Coomans, M.P. Derde, I. Broeckaert and D.L. Massart, Potential methods in pattern recog-
nition. Anal. Chim. Acta, 133 (1981) 241-250.
19. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Deci-
sion Making. Wiley, Chichester, 1986.
20. M. Forina, C. Armanino, R. Leardi and G. Drava, A class-modelling technique based on poten-
tial functions. J. Chemom. 5 (1991) 435^53.
21. M. Forina, S. Lanteri and C. Armanino, Chemometrics in Food Chemistry. Topics Curr.
Chem., 141 (1987)93-143.
22. M. Mulholland, D.B. Hibbert, P.R. Haddad and P. Parslov, A comparison of classification in
artificial intelligence, induction versus a self-organising neural networks. Chemom. Intell.
Lab. Systems, 30 (1995) 117-128.
23. L.J. Breiman, R. Freidman, R. Olsen and C. Stone, Classification and Regression Trees.
Wadsworth, Pacific Grove, CA, 1984.
24. C.H. Yeh and C.H. Spiegelman, Partial least squares and classification and regression trees.
Chemom. and Intell. Lab. Systems, 22 (1994) 17-23.
25. A. Sankar and R. Mammone, A fast learning algorithm for tree neural networks. In: Proc. 1990
Conf on Information Sciences and Systems, Princeton, NJ, 1990, pp. 638-642.
26. D.H. Coomans and O. Y. de Vel, Pattern analysis and classification, in J. Einax (ed). The Hand-
book of Environmental Chemistry, Vol. 2, Part G. Springer, 1995, pp. 279-324.
27. S. Wold, Pattern recognition by means of disjoint principal components models. Pattern
Recogn.,8(1976) 127-139.
28. M.P. Derde and D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition
based on normal distribution. Anal. Chim. Acta, 184 (1986) 33-51.
29. I. Frank and J.M. Friedman, Classification: oldtimers and newcomers. J. Chemom. 3 (1989)
463-475.
30. R. De Maesschalck, A. Candolfi, D.L. Massart and S. Heuerding, Decision criteria for SIMCA
applied to Near Infrared data. Chemom. Intell. Lab. Syst., in prep.
31. H. Van der Voet and P.M. Coenegracht, The evaluation of probabilistic classification methods,
Part 2. Comparison of SIMCA, ALLOC, CLASSY and LDA. Anal. Chim. Acta, 209 (1988)
1-27.
32. D. Coomans, I. Broeckaert, M.P. Derde, A. Tassin, D.L. Massart and S. Wold, Use of a micro-
computer for the definition of multivariate confidence regions in medical diagnosis based on
clinical laboratory profiles. Comp. Biomed. Res., 17 (1984) 1-14.
33. I. Frank, DASCO: a new classification method. Chemom. Intell. Lab. Syst., 4 (1988) 215-222.
241

34. H. Van Der Voet, P.M.J. Coenegracht and J.B. Hemel, New probabilistic versions of the Simca
and Classy classification methods. Part 1. Theoretical description. Anal. Chim. Acta, 192
(1987) 63-75.
35. H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using ker-
nel density estimation. Part 1. Anal. Chim. Acta, 161 (1984), 115-123; Part 2. Anal. Chim.
Acta, 161 (1984) 125-134.
36. M. Forina, G. Drava and G. Contarini, Feature selection and validation of SIMCA models: a
case study with a typical Italian cheese. Analusis 21 (1993) 133-147.
37. A. Candolfi, R. De Maesschalk, D.L. Massart, P.A. Hailey and A.C.E. Harrington, Identifica-
tion of pharmaceutical excipients using NIR spectroscopy and SIMCA. J. Pharm. Biomed.
Anal., in prep.
38. M.A. Dempster, B.F. MacDonald, P.J. GemperHne and N.R. Boyer, A near-infrared reflect-
ance analysis method for the non invasive identification of film-coated and non-film-coated,
blister-packed tablets. Anal. Chim. Acta, 310 (1995) 43-51.
39. D. Scott, W. Dunn and S. Emery, Pattern recognition classification and identification of trace
organic pollutants in ambient air from mass spectra. J. Res. Natl. Bur. Stand., 93 (1988)
281-283.
40. E. Saaksjarvi, M. Khaligi and P. Minkkinen, Waste water pollution modeling in the southern
area of Lake Saimaa, Finland, by the simca pattern recognition method. Chemom. Intell. Lab.
Systems, 7 (1989) 171-180.
41. O.M. Kvalheim, K. 0ygard and O. Grahl-Nielsen, SIMCA multivariate data analysis of blue
mussel components in environmental pollution studies. Anal. Chim. Acta, 150 (1983) 145-152.
42. B.G.J. Massart, O.M. Kvalheim, F.O. Libnau, K.I. Ugland, K. Tjessem and K. Bryne, Projec-
tive ordination by SIMCA: a dynamic strategy for cost-efficient environmental monitoring
around offshore installations. Aquatic Sci., 58 (1996) 121-138.
43. B.K. Lavine, H. Mayfield, P.R. Kromann and A. Faruque, Source identification of under-
ground fuel spills by pattern recognition analysis of high-speed gas chromatograms. Anal.
Chem., 67 (1995) 3846-3852.
44. B. Mertens, M. Thompson and T. Fearn, Principal component outlier detection and SIMCA: a
synthesis. Analyst 119 (1994) 2777-2784.
45. L Stable and S. Wold, Partial least square analysis with cross-vahdation for the two-class prob-
lem: a Monte Carlo study. J. Chemometrics, 1 (1987) 185-196.
46. A.F.M. Nierop, A.C. Tas, J. Van der Greef, Reflected discriminant analysis. Chemom. Intell.
Lab. Syst., 25 (1994) 249-263.
243

Chapter 34

Curve and Mixture Resolution by Factor Analysis and


Related Techniques

34.1 Abstract and true factors

In Chapter 31 we stated that any data matrix can be decomposed into a product
of two other matrices, the score and loading matrix. In some instances another
decomposition is possible, e.g. into a product of a concentration matrix and a
spectrum matrix. These two matrices have a physical meaning. In this chapter we
explain how a loading or a score matrix can be transformed into matrices to which
a physical meaning can be attributed. We introduce the subject with an example
from environmental chemistry and one from liquid chromatography.
Let us suppose that dust particles have been collected in the air above a city and
that the amounts of/? constituents, e.g. Si, Al, Ca,..., Pb have been determined in
these samples.The elemental compositions obtained for n (e.g. 100) samples, taken
over a grid of sampling points, can be arranged in a data matrix X (Fig. 34.1). Each
row of the table represents the elemental composition of one of the samples. A
column represents the amount of one of the elements found in the sample set. Let
us further suppose that there are two main sources of dust in the neighbourhood of
the sampled area, and that the particles originating from each source have a
specific concentration pattern for the elements Si to Pb. These concentration
patterns are described by the vectors Sj and S2. For instance the dust in the air may
originate from a power station and from an incinerator, having each a specific
concentration pattern, sj = [Si^, Al^, Ca^,... PbJ with A: = 1,2.
Obviously, each sample in the sampled area contains particles from each source,
but in a varying proportion. Some of the samples mainly contain particles from the
power station and less from the incinerator. Other samples may contain an equal
amount of particles of each source. In general, one can say that the composition x^
of any sample / of dust is a linear combination of the two source patterns s^ and S2
given by: x^ = c^^ s^ + c^2 S2. In this expression c^ gives the contribution of the first
source and c^2 ^he contribution of the second dust source in sample /. For all n
samples these contributions can be arranged in a nx2 matrix C giving X = CS^
where S is thepx2 matrix of the source patterns. If the concentration patterns of the
244

variables
Si Al Ca .... Pb 12 Si Al Ca .... Pb
1
2
factors

i
FA
measurements concentrations

Si Al Ca .... Pb

V
PC loadings
PCA

PC-scores

Fig. 34.1. The principle of factor analysis.

elements (Si to Pb) of the two dust sources are known, the relative contribution C
of each source in the samples is estimated by solving the equation X = CS^ by
multiple linear regression (see Chapter 10) provided that the concentration patterns
are sufficiently different and that the number of sources nc is less than or equal to
the number of measured elements p: C = XS(S^S)"^
Matters become more complex when the concentration patterns of the sources
are not known. It becomes even more complicated when the number and the origin
of the potential sources of the dust are unknown. In this case the number of sources
and the concentration patterns of each source have to be estimated from the
measured data table X. This operation is called factor analysis. In the terminology
of factor analysis, the two sources of dust in our example are called factors, and the
concentration patterns of the compounds in each source are calltdfactor loadings.
In Chapters 17 and 31 we explained that a matrix X can be decomposed by SVD
in a product of three matrices: the two matrices of singular vectors U and V and a
diagonal matrix of singular values A such that:

X - U A V^ (34.1)
nxp nxp pxp pxp
245

where n is the number of samples (rows), and p is the number of variables


(columns). Often the decomposition is given as a product of only two matrices:

X = T V^ (34.2)
nxp nxp pxp

where T (= UA) is the matrix of scores proportional to their 'size' and V is the
loading matrix. The columns of V are colloquially denoted as the principal compo-
nents of X. Each row of the original data matrix X can be represented as a linear
combination of all PCs in the loading matrix. The multiplicative factors (or
regressors) of that linear combination for a particular row, x^ of X are given by the
corresponding row t^ in the score matrix: x, = t^ V^.
X can be reconstructed within the measurement error E by taking the first nc
PCs.

X - T * V*T+ E
nxp nxnc ncxp nxp

The symbol * means that the first nc significant columns of V and T are retained.
The number of significant principal components, nc, which is the pseudo-rank of
X, is usually unknown. Methods for estimating the number of components are
discussed in Section 31.5.
In the case of city air pollution caused by two sources of dust, one expects to
find two significant PCs: PCj and PC2, which are the two first rows of V^. One
could conclude that this decomposition yielded the requested information. Each
row of X is represented by a linear combination of two source profiles PCj and
PC2. The question arises then whether these profiles represent the true
concentration profiles of the sources. Unfortunately, the answer is no! In Chapter
29, we explained that PCj and PC2 are obtained under some specific constraints,
i.e., PCi is calculated under the constraint that it describes the maximum variance
of X. The second loading vector PC2 also describes the maximum variance in X but
now in the orthogonal direction to PCj. These constraints are not necessarily valid
for the true factors. It is very improbable that the true source profiles are ortho-
gonal. Therefore the PCs are called abstract factors. The aim of factor analysis is to
transform these abstract factors into real factors. PC A is a purely mathematical
operation, using no other information than that the rows in X can be described by a
linear combination of a number of linearly independent vectors. However factor
analysis usually requires the formulation of additional constraints to find a solution.
These constraints are defined by the characteristics of the system being investi-
gated. In this example, a constraint could be that all concentrations should be
non-negative.
Before going into more detail, a second example is given. Suppose that during
the elution of two compounds from an HPLC, one measures n (=15) UV-visible
246

0.25

(T3

0.15

0.05

220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)

Fig. 34.2. UV-visible spectra of mixtures offluorantheneand chrysene (see Fig. 34.3 for the pure
spectra).

spectra at p (=20) wavelengths. Because of the Lambert-Beer law, all measured


spectra are linear combinations of the two pure spectra. Together they form a
15x20 data matrix. For example the UV-visible spectra of mixtures of two
polycyclic aromatic hydrocarbons (PAH) given in Fig. 34.2 are linear combi-
nations of the pure spectra shown in Fig. 34.3. These mixture spectra define a data
matrix X, which can be written as the product of a 15x2 concentration matrix C
with the 2x20 matrix S^ of the pure spectra:

X = C S^ (34.3)
nx/? ny^nc ncxp

where AI (= 15) is the number of mixture spectra, nc (= 2) is the number of species,


and p (= 20) is the number of wavelengths.
The rows of X are mixture spectra and the columns are chromatograms at the
p = 20 wavelengths. Here, columns as well as rows are linear combinations of pure
factors, in this example pure row factors, being the pure spectra, and pure column
factors, being the pure elution profiles.
Any data matrix can be considered in two spaces: the column or variable space
(here, wavelength space) in which a row (here, spectrum) is a vector in the
multidimensional space defined by the column variables (here, wavelengths), and
the row space (here, retention time space) in which a column (here, chromatogram)
is a vector in the multidimensional space defined by the row variables (here,
elution times). This duality of the multivariate spaces has been discussed in more
detail in Chapter 29. Depending on the chosen space, the PCs of the data matrix
247

0)
o
c
CD

220 240 260 280 300 320 340 360 380 400 420

wavelength
(nm)

Fig. 34.3. UV-visible spectra of two polyaromatic hydrocarbons (PAHs), fluoranthene and chrysene.

have a different meaning. In wavelength space the eigenvectors of X^X represent


abstract spectra, and in retention time space the eigenvectors of XX^ are abstract
chromatograms. Irrespective of the chosen space, by decomposing matrix X with a
PCA as many significant principal components should be found as there are
chemical species in the mixtures. The decomposition in the wavelength space, for
a system with two compounds is given by:

X =T* V*^ + E (34.4)


nx/7 nx2 2x/7 nxp

By decomposing the HPLC data matrix of spectra shown in Fig. 34.2 according
to eq. (34.4), a matrix V* is obtained containing the two significant columns of V.
Evidently the loading plots shown in Fig. 34.4 do not represent the two pure
spectra, though each mixture spectrum can be represented as a linear combination
of these two PCs. Therefore, these two PCs are called abstract spectra. Equations
(34.3) and (34.4) show a decomposition of the data matrix X in two ways: the first
is a decomposition in real factors, a product of a matrix S^ of the spectra with a
matrix C of concentration profiles, and the second is a decomposition in abstract
factors T* and V*^. By factor analysis one transforms V*^ in eq. (34.4) into S^ in eq.
(34.3). The score matrix T* gives the location of the spectra in the space defined by
the two principal components. Figure 34.5 shows a scores plot thus obtained with a
clear structure (curve). The cause of this structure is explained in Section 34.2.1.
248

• — • — • — • — f

20 240 260 280 300 320 340 360 380 400 420

wavelength (nm)

Fig. 34.4. The two first principal components of the data matrix of the spectra given in Fig. 34.2.

scores on PC2

0.3 0.4 0.7


scores on PC1

Fig. 34.5. Score plot (PCi score vs PC2 score) of the mixture spectra given in Fig. 34.2.
249

The decomposition into elution profiles and spectra may also be represented as:

X^ = S C^
pxn px2 2xn

The corresponding decomposition by a principal components analysis gives:

XT ^ p* Q*T + E
pxn pxnc ncxn pxn

where the rows in Q*^ now represent abstract elution profiles. It should be noted
that the score matrix P* has another meaning than T* in eq. (34.4). It represents
here the location of the chromatograms in the factor space Q*. It should also be
noted that Q is equivalent to U in eq. (34.1) and that P is equivalent to AV^ in eq.
(34.1). Here, too, one wants to transform abstract elution profiles Q*^ into real
elution profiles C^ by factor analysis. The result of a PC A carried out in this
retention time space is given in Fig. 34.6. The two first PCs clearly have the
appearance of elution profiles, but are not the true elution profiles. Because elution
profiles have a much smoother appearance than spectra, which may have a very
irregular form, abstract elution profiles are sometimes easier to interpret than
abstract spectra. For instance, one can easily derive the positions of the peak
maxima and also distinguish significant PCs from those which represent noise.
The scores plot (Fig. 34.7) in which the chromatograms are plotted in the space
defined by the wavelengths is less easily interpreted than the corresponding scores
plot of the spectra shown in Fig. 34.5. The plot in Fig. 34.7 does not reveal any
structure because the consecutive chromatograms in X^ follow the irregular pattern
of the absorptivity coefficients as a function of the wavelength. Therefore, if the
aim of factor analysis is to transform PCs into real factors (by one of the methods
explained in this chapter) one prefers the retention time space, because this yields
loadings which are the easiest to interpret. On the other hand, if the aim of the
analysis is to detect structure in the scores plot, the wavelength space is preferred.
In this space regions of pure spectra (selective parts of the chromatograms), or
regions where only binary mixtures are present, are more easily detected. The
above considerations form the basis of the HELP procedure explained in Section
34.3.3.
In some cases a principal components analysis of a spectroscopic- chromato-
graphic data-set detects only one significant PC. This indicates that only one
chemical species is present and that the chromatographic peak is pure. However,
by the presence of noise and artifacts, such as a drifting baseline or a nonlinear
response, conclusions on peak purity may be wrong. Because the peak purity
assessment is the first step in the detection and identification of an impurity by
factor analysis, we give some attention to this subject in this chapter.
250

13 17 21 25 29

retention time

Fig. 34.6. The two first principal components of a LC-DAD data set in the retention time space.

scores on PC2
• 0.1


0.05 - •
. •• • • •
0 t
*• • •

-0.05 • •


-0.1

1 1 . _- 1
-0.15
0.2 0.4 0.6 0.8
scores on PC1
Pi

Fig. 34.7. Score plot (PCi score vs PC2 score) of chromatograms in the retention time space.
251

Basically, we make a distinction between methods which are carried out in the
space defined by the original variables (Section 34.4) or in the space defined by the
principal components. A second distinction we can make is between full-rank
methods (Section 34.2), which consider the whole matrix X, and evolutionary
methods (Section 34.3) which analyse successive sub-matrices of X, taking into
account the fact that the rows of X follow a certain order. A third distinction we
make is between general methods of factor analysis which are applicable to any
data matrix X, and specific methods which make use of specific properties of the
pure factors.

34.2 Full-rank methods

34.2.1 A qualitative approach

Before going into detail about various specific methods to estimate pure factors,
we qualitatively describe how pure factors can be derived from the principal
components. This is illustrated with a data matrix of the two-component HPLC
example discussed in the previous section. When discussing the scores plot (Fig.
34.5) we mentioned that the scores showed some structure. From the origin, two
straight lines depart which are connected with a curved line. In Chapter 17 we
explained that these straight lines coincide with the pure spectra present in the pure
elution time zones. The distance from the origin is a measure for the 'size' of the
spectrum. The curved part represents the zone where two compounds co-elute.
Going through the curve starting from the origin we find pure spectra of compound
1 in increasing concentrations, then mixtures of compounds 1 and 2, followed by
the spectra of compound 2 in decreasing concentration, back to the origin. The
angle between the two lines is a measure for the correlation between the two pure
spectra. If the spectra are uncorrelated (i.e. very dissimilar) the two lines are
orthogonal. At high correlations the angle becomes very small. The two lines
define the directions of the pure factors in the PC1-PC2 space. In this simplified
situation no factor analysis is needed to find the pure factors. From the scores plot
we also observe that the pure factors have an angle aj and a2 respectively with PCj
and PC2. Conversely, one could also find pure factors by rotating PCj over an angle
a^ and PC2 over an angle a2. Finding these angles is the purpose of factor analysis.
When pure spectra are present in the data set, finding these angles is quite
straightforward. Therefore, several factor analysis methods aim at finding the
purest rows (spectra) or purest columns (wavelengths) in the data set, which is
discussed in Section 34.4. When no pure row or column is available for one of the
factors, we cannot directly derive the rotation angles from the scores plot, because
the straight line segments are missing. In this case we need to make some
252

assumptions about the pure factors in order to estimate the rotation angles.
Obvious assumptions in chromatography are the non-negativity of the absorbances
and the concentration profiles. If no constraints can be formulated to estimate the
rotation angles, one must rely on abstract rotation procedures. An example of this
type of rotation is the Varimax method of Kaiser [1] which is explained in Section
34.2.3.

34.2.2 Factor rotations

A row of a data matrix can be interpreted as a point in the space defined by its
column variables.

yi

n yn

For instance, the first row of the matrix X defines a point with the coordinates (JCJ,
3^i) in the space defined by the two orthogonal axes x^ = (1 0) and y^ = (0 1). Factor
rotation means that one rotates the original axes x^ = (1 0) and y^ = (0 1) over a
certain angle 9. With orthogonal rotation in two-dimensional space both axes are
rotated over the same angle. The distance between the points remains unchanged.

Fig. 34.8. Orthogonal rotation of the (jcj) axes into (/:,/) axes.
253

Rotated axes are characterized by their position in the original space, given by the
vectors k^ = [cos9 -sinG] and F = [sinG cosG] (see Fig. 34.8). In PCA or FA, these
axes fulfil specific constraints (see Chapter 17). For instance, in PCA the direction
of k is the direction of the maximum variance of all points projected on this axis. A
possible constraint in FA is maximum simplicity of k, which is explained in
Section 34.2.3. The new axes (k,I) define another basis of the same space. The
position of the vector [x- y-] is now [k^ /•] relative to these axes.
Factor rotation involves the calculation of [k^ /J from [x^ y^], given a rotation
angle with respect to the original axes. Suppose that after rotation the matrix X is
transformed into a matrix F:
'k, h'
2 h
zn

.K K.
Then the following relationship exists between [k^ /J and [x^ y^
k^ = x^ cosG - y^ sinG
li = x^ sinG + jjCosG
or in matrix notation (for all vectors /)

'k, h' Xi y

k2 k ^2 yi cosG sinG
=
-sinG cosG
.n K _ _^n yn\

which gives:
F = XR (34.4)

witr
cosG sinG
R—
iv = - s i r iG cosG

Columns and rows of R are orthogonal with a norm equal to one. Therefore, R
defines a rotation, for which RR^ = R^ R = I.
254

The aim of factor analysis is to calculate a rotation matrix R which rotates the
abstract factors (V) (principal components) into interpretable factors. The various
algorithms for factor analysis differ in the criterion to calculate the rotation matrix
R. Two classes of rotation methods can be distinguished: (i) rotation procedures
based on general criteria which are not specific for the domain of the data and (ii)
rotation procedures which use specific properties of the factors (e.g. non-negativity).

34.2.3 The Varimax rotation

In the previous section we have seen that axes defined by the column variables
can be rotated. It is also possible to rotate the principal components. Instead of
rotating the axes which define the column space of X, we rotate here the significant
PCs in the sub-space defined by V*^:

F =V*'r R (34.5)
ncxp ncxp pxp

The columns of V* are the abstract factors of X which should be rotated into real
factors. The matrix V*^ is rotated by means of an orthogonal rotation matrix R, so
that the resulting matrix F = V*^ R fulfils a given criterion. The criterion in
Varimax rotation is that the rows of F obtain maximal simplicity, which is usually
denoted as the requirement that F has a maximum row simplicity. The idea behind
this criterion is that 'real' factors should be easily interpretable which is the case
when the loadings of the factor are grouped over only a few variables. For instance
the vector f^ = [ 0 0 0 0.5 0.8 0.33] may be easier to interpret than the vector f J =
[0.1 0.3 0.1 0.4 0.4 0.75]. It is more likely that the simple vector is a pure factor
than the less simple one. Returning to the air pollution example, the simple vector
f J may represent the concentration profile of one of the pollution sources which
mainly contains the three last constituents.
A measure for the simplicity of a vector f, is the variance of the squares of its p
elements, which should be maximized [2]:
1 ^ —
Simp = var(f,2 ) = _ V (f 2 ^ f 2 )2
P i=i

To illustrate
3 vectors the concept
f, and fj: of simplicity of a vector we calculate the simplicity of
f f f.^ f2
*2
0 0.1 0 0.01
0 0.3 0 0.09
0 0.1 0 0.01
0.5 0.4 0.25 0.16
255

0.8 0.4 0.64 0.16


0.33 0.75 0.11 0.56
var(fi2) = 0.053 var(f|) = 0.035

The simplicity of a matrix is the sum of the simplicities of its rows.Thus the
simplicity of a matrix F with the rows fj^ and fj as given before is 0.088. fj and £2
are called varivectors.
By a varimax rotation the matrix V*^ is rotated by means of an orthogonal
rotation matrix R, so that the simplicity of the resulting matrix F = V*^ R is
maximal. Several algorithms have been proposed [2] for the calculation of R. Let
us suppose that a PC A of X with four variables yields the following two significant
principal components: PCi! [0.1023 -0.767 0.153 0.614] and PC2: [0.438 0.501
0.626 0.407]. The simplicity of the matrix with the rows PC^ and PC2 is equal to
0.0676. A varimax rotation of this matrix yields the varivectors f^ = [-0.167
-0.916 -0.233 0.269] and f2^ =[0.417 -0.030 0.601 0.685] at an angle of 35
degrees with the PCs, and a corresponding simplicity equal to 0.1486. One can see
that varivector f^ is mainly directed along variable 2, which is probably a pure
factor. The other variables (1,3 and 4) load high on varivector ^2- Therefore, they
belong to the second pure factor. Anyway, because the varivectors are simpler than
the original PCs one can safely conclude that they resemble better the pure factors.
Several applications of varimax rotation in analytical chemistry have been
reported. As an example the varimax rotation is applied on the HPLC data table of

1 3 5 7 9 11 13 15
Retention time

Fig. 34.9. Varimax rotated principal components given in Fig. 34.6.


256

PAHs introduced in Section 34.1. A PCA applied on the transpose of this data
matrix yields abstract 'chromatograms' which are not the pure elution profiles.
These PCs are not simple as they show several minima and/or maxima coinciding
with the positions of the pure elution profiles (see Fig. 34.6). By a varimax rotation
it is possible to transform these PCs into vectors with a larger simplicity (grouped
variables and other variables near to zero). When the chromatographic resolution
is fairly good, these simple vectors coincide with the pure factors, here the elution
profiles of the species in the mixture (see Fig. 34.9). Several variants of the
varimax rotation, which differ in the way the rotated vectors are normalized, have
been reviewed by Forina et al. [2].

34.2.4. Factor rotation by target transformation factor analysis (TTFA)

In the previous section, factors were searched by rotating the abstract factors
into pure factors, obeying a number of constraints. In some cases, however, one
may have a collection of candidate pure factors, e.g., a set of UV-visible or Mass
spectra of chemical compounds. Having measured a data matrix of mixture spectra
one could investigate whether compounds present in the mixture match with
compounds available in the data base of pure spectra. In that situation one could
first estimate the pure spectra from the mixture spectra and thereafter compare the
obtained spectra with the spectra in the data base. Alternatively, one could identify
the pure spectra by solving equation X = C S^ for C where S is the set of candidate
spectra to be tested and X contains the mixture spectra. All non-zero rows of C
indicate the presence of the spectrum in the corresponding row in S. This method
fails when S does not contain all pure spectra present in the mixtures. Moreover,
this procedure becomes unpractical when the number of candidate spectra is very
large and the whole data base has to be checked. Furthermore, when the spectra to
be checked are quite similar, the calculation of (S^S)~^ becomes unstable, leading
to large errors in C and the indication of wrong spectra [3].
By target transformation factor analysis (TTFA), each candidate spectrum,
called target is tested individually on its presence in the mixtures [4,5]. Here,
targets are tested in the space defined by the significant principal components of
the data matrix. Therefore, TTFA begins with a PCA of the data matrix X of the
measured spectra. The principle of TTFA can be explained in an algebraic as well
as geometrical way. We start with the algebraic approach. Because according to
eq. (34.4) any row of X can be written as

\xp Ixnc ncxp Ixp

Each mixture spectrum is a linear combination of the nc significant eigenvectors.


Equally, the pure spectra are linear combinations of the first nc PCs. A target
257

spectrum taken from the library can be tested on this property. If the test passes, the
spectrum or target may be one of the pure factors. How this is done is explained
below.
The first step is to calculate the scores t i„ of the target spectrum, in, to be tested
by solving equation:

lx/7 Ixnc ncxp Ixp

givmg
t;„ =inV*(V*T V*)-^ in V
These scores give the linear combination of the PCs that provides the best
estimation (in a least squares sense) of the target spectrum. How good that
estimation is can be evaluated by calculating the sum of squares of the residuals
between the re-estimated target or output target (from its scores) and the input
target. The output target out is equal to t •„ V*^. The overall expression for TTFA
therefore becomes:
out = inV*V*T (34.6)
If the difference between out and in (||out - in||) can be explained by the variance
of the noise, the test passes and the target is possibly one of the pure factors.

9 11 13 15 17 19 21 23 25 27 29

Retention time

Fig. 34.10. Simulated LC-DAD data-set of the separation of three PAHs (spectra 4, 5 and 6 in Fig.
34.11) (individual profiles and the sum).
258

C ^'•
^ 0>
<D ON

1^
§ CO
h ! ON


^^
- ^ ON

^^
cxo
*r: ^
W OS

s^

"^ IT)

c a-
o :3

&^
a ^
OS CU
<u c
C/3

Xi
o C/5

00
cd O
^-1
O (^
(1)
o

<j J3
259

Otherwise, the test fails. In principle there are no limitations on the number of
compounds in the mixture.
Let us illustrate the method with a LC-UV-visible data set of three overlapping
peaks (Fig. 34.10). The TFA results obtained with six candidate spectra, including
the pure spectra are given in Fig. 34.11. The assignment of the right spectra by an
evaluation of the correlation coefficients between input and output target is
obvious.
Strasters et al. [6] applied this method to data sets obtained by HPLC-DAD and
concluded that the pure factors can be very well recovered even at very low
chromatographic resolutions (R^ < 0.02). A critical step in PCA-based methods is
the determination of the number of significant PCs, which indicates the number of
pure factors which have to be searched.
A geometrical representation of target FA is as follows. Suppose a data set with
spectra of a two-component system, measured at p wavelengths. All spectra are
located on a plane defined by the two PCs, PCj and PC2 (see Fig. 34.12 where/? = 3).
An input target in is a vector in the originalp-dimensional space. A target (e.g. inj)
which lies in the plane PC1-PC2 is a linear combination of the two eigenvectors.
Other targets (e.g. in2) are not. Whether a target lies in the plane defined by the PCs
is tested by projecting the target on the plane (or space) defined by the eigen-
vectors. For a target which lies in the plane defined by the eigenvectors the input

Fig. 34.12. Geometrical representation of TTFA: target ini is accepted; target in2 is rejected.
260

and output targets are the same. For any other target (e.g. in2), the projected target
out2 differs from the original one. If that difference is significantly greater than the
noise, then this target is not one of the factors. Figure 34.12 also shows that the
projected vector out2 is obtained by rotating one of the PCs. It also illustrates that
another way to express the difference between the input target and output target is
the angle between these two vectors. This angle is related to the correlation
coefficient between the two vectors (Chapter 9).

34.2.5 Curve resolution based methods

In previous methods no pre-knowledge of the factors was used to estimate the


pure factors. However, in many situations such pre-knowledge is available. For
instance, all factors are non-negative and all rows of the data matrix are non-
negative linear combinations of the pure factors. These properties can be exploited
to estimate the pure factors. One of the earliest approaches is curve resolution,
developed by Lawton and Sylvestre [7], which was applied on two-component
systems. Later on, several adaptations have been proposed to solve more complex
systems [8-10].

34.2.5.1 Curve Resolution of two-factor systems


In their fundamental paper on curve resolution of two-component systems,
Lawton and Sylvestre [7] studied a data matrix of spectra recorded during the
elution of two constituents. One can decide either to estimate the pure spectra (and
derive from them the concentration profiles) or the pure elution profiles (and
derive from them the spectra) by factor analysis. Curve resolution, as developed by
Lawton and Sylvestre, is based on the evaluation of the scores in the PC-space.
Because the scores of the spectra in the PC-space defined by the wavelengths have
a clearer structure (e.g. a line or a curve) than the scores of the elution profiles in
the PC-space defined by the elution times, curve resolution usually estimates pure
spectra. Thereafter, the pure elution profiles are estimated from the estimated pure
spectra. Because no information on the specific order of the spectra is used, curve
resolution is also applicable when the sequence of the spectra is not in a specific
order.
As explained before, the scores of the spectra can be plotted in the space defined
by the two principal components of the data matrix. The appearance of the scores
plot depends on the way the rows (spectra) and the columns have been normalized.
If the spectra are not normalized, all spectra are situated in a plane (see Fig. 34.5).
From the origin two straight lines depart, which are connected by a curved line. We
have already explained that the straight line segments correspond with the pure
spectra which are located in the wings of the elution bands (selective retention time
261

scores on PC2

0.2

0.1

-0.1

-0.2

-0.3

-0.4 0.05 0.1 0.15 0.2 0.25


scores on PC1

Fig. 34.13. Score plot (ti vs 12) of the spectra given in Fig. 34.2 after normalisation. The points A and B
are the purest spectra in the data set. The points A' and B' are the spectra at the boundaries of the
non-negativity constraint.

zones). The curved line segment corresponds with the mixture spectra. When no
pure spectra are present, no straight lines are observed making it very difficult to
extract the pure spectra from this scores plot. Such plots are the basis of the
HELP-method to assess peak purity in liquid chromatography, which is discussed
in Section 34.3.3. When the spectra are normalized to a sum equal to one, the
scores plot given in Fig. 34.5 becomes much simpler. All spectra are situated on a
straight line (see Fig. 34.13). This is caused by the fact that by normalization of the
rows the dimensionality of the data is reduced by one (because the spectra are not
mean centred, we still find two significant PCs). The spectra on this line are
arranged in a certain order. At both ends are the two spectra with the smallest
contamination from the other compound. These are the purest spectrum (A) and
the purest spectrum (B). In fact, these two spectra are the two purest rows in X
which are the best available estimators of the two pure spectra. These estimates can
be improved. First we know that the normalized pure spectra are located on the line
passing through the line segment (A-B) of the mixture spectra. Thus the pure
spectra should be found somewhere on the extrapolated ends of the line. The
question arises how far one should extrapolate. At this point one needs constraints
to bound the solution. The first constraint is that the absorbances are non-negative.
At a given critical point (say AO located on the line, this constraint may be
violated. This point has to be found. Let us explain how this is done with an
example.
262

Suppose that 15 spectra have been measured at 20 wavelengths and that after
normalization on a sum equal to one they are compiled in a data matrix. Suppose
also that by PCA the following score and loading matrices are obtained:

.372 .160
.609 .391
.148 -.208
.340 .073
.218 -.450
.339 .071
.314 -.583
.339 .068
.402 -.070
.338 .064
.088 -.230
.336 .058
.101 -.307
.333 .049
.058 -.063
.330 .035
.101 -J005
7= .324 .027 v= (34.7)
.100 J023
.317 -.008
.206 .144
.309 -.038
.139 .110
.300 -.072
.239 .212
.289 -.107
.016 -.007
.279 -.140
.010 -.0108
.271 -.169
.0035 .0006
.265 -.191
.002 -.0009
.0004 -.001
0 0

The scores and the loadings are plotted in Figs. 34.13 and 34.4. The fact that all
absorbances of the pure spectra Sj and S2 should be non-negative means that the
linear combination of the two PCs, s = x^y^ + T2V2 should be non-negative for each
element 5^ of s, given by:

^\k ••• ^2 ^2k - 0 for all /: = 1 to /? (34.8)

Because the elements of v, all have equal sign, contrary to the elements of V2
which have mixed signs, x^ /Xj values satisfying eq. (34.8) are found on the interval
[m,n], where m is equal to
263

-minM^ (V2^>0) (34.9)

and n is equal to

minP^ (V2^<0)

with Tj > 0.
The boundaries, m and n, define two lines in the (Vj, V2) plane: the line Xj = mi2
and the line T^ = nT2. The intersection of these two lines with the line defined by the
mixture spectra gives two points A' and B' which are the estimates of the pure
spectra. The intervals (A-AO and (B-BO define the solution bands between which
the pure spectra are situated.
For the example discussed here, the points A' and B' are found as follows. First,
we calculate m and n by dividing all elements of Vj by the corresponding elements
of ¥2, which gives:

(^1A21; ^12/^22; ^13/^23; •••; ^ 1 A P =


(2.32; 1.55; -0.711; -0.484; -0.538; -5.74; -0.382; -0.329; -0.92; -20.2;
4.34; 1.43; 1.26; 1.13; -2.28; -0.93; 5.8; -2.0; -0.4; -)
The value 1.13 is the smallest positive value and -0.329 has the smallest absolute
value of all negative numbers. Therefore, m = -1.13 and n = 0.329 leading to:

-1.13 <Xi/T2< 0.329

or, the m-line is T^ = -1.13x2 ^^^ ^^^ n-line is Xj = 0.329X2-


In Fig. 34.13 the points A' and B' are situated at the intersections of the m-and
^-lines with the line defined by the mixture spectra. The scores are respectively
equal to (0.3620,0.1187) and (0.2415, -0.273). According to eq. (34.8) the spectra
corresponding to these scores are:
s, = 0.2415 Vi-0.273 V2

S2 = 0.3620 Vi+0.1187 V2
which are plotted in Fig. 34.14.
It should be borne in mind that the two spectra given in Fig. 34.14 are estimates
of the pure spectra, which exactly fulfil the constraints. Two other estimates of the
pure spectra are the purest mixture spectra A and B. When plotted in the same
figure (see Figs. 34.15 and 34.16) a good impression is obtained of the remaining
264

-0.05
220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)

Fig. 34.14. Estimated pure spectra at the points A' and B', with respectively the scores ^A'I , /A'2 and t^'i,
^B'2-

220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)

Fig. 34.15. Solution band for the spectrum of fluoranthene.

uncertainty (or solution bands) in the spectra. The widths of the solution bands
depend on the similarity of the spectra, and on the concentration ratios spanned by
the mixtures. In HPLC with DAD these concentration ratios depend on the
265

220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)

Fig. 34.16. Solution band for the spectrum of chrysene.

chromatographic overlap and on the concentration ratio of the two components.


The effect of these two factors has been discussed in the literature [11].
The solution bands can be somewhat narrowed by taking into consideration a
second constraint namely that the resulting concentrations by solving eq. (34.3)
should be non-negative. By combining eqs. (34.3) and (34.4), it follows that each
measured spectrum / can be represented as:

^i =^/l Sl +^/2 S2 =^a(^A'l Vi +^A'2 V2) + q2(^B'l ^i + ^B'2 ^2 )

(^A'l»^A'2) ^^d (^B'l»^B'2) ^^^ the scores of the two spectra meeting the non-negativ-
ity criterion. Equally, each mixture spectrum is a linear combination of the two
first principal components:

^i ~ h\ ^1 "•" hi ^2

Therefore, for all spectra / = 1,...,« the scores t^^ and t^2 are given by:

U\ ~^i\ ^A'l "^ ^i2 ^B'l ^^^Ul —^i\ ^A'2 "•" ^/2 ^B'2

By solving these two equations for c^ and c^2 ^^^ by applying the condition that
c^j > 0 and c,2 > 0 we find [7] that the scores of the first pure spectrum (^^^ and ^^^^2)
should fulfil the condition:
266

scores on PC2
0.1
n-line
0.05

-0.05

-0.1

-0.15
m-line

0.35 0.4 0.45 0.5


scores on PCI

Fig. 34.17. Non-negativity constraints (m- and w-line) applied on the scores of the normalized
chromatograms.

'A'2
> max hi. m
^'1 til

and that the scores of the second pure spectrum (rg/i and ^3^2) should fulfil the
condition:

^B'2 'i2
< min =n
^B'l

If these conditions are not fulfilled by the scores (r^/j, ^^^2) ^^^ (^B'I' ^B'2) ^he pure
spectra should be adapted by taking the intersections of the line defined by the
mixture spectra and the lines T2 = m'T^ and T2 = nx^. In our example, the above
conditions are met and therefore the solution bands are not affected.
We can also estimate the pure column factors (here elution profiles). Therefore,
we take the transpose of X (rows are chromatograms), normalize all rows to a sum
equal to one, and calculate the PCs. These PCs represent abstract chromatograms
as is shown in Fig. 34.6. The scores of the chromatograms are again situated on a
straight line (Fig. 34.17). The purest elution profiles are at the two ends of the line
segment. By applying eq. (34.9) one obtains the m- and n-lines defined by the
constraint that the elution profiles should be non-negative (see Fig. 34.17):
267

0.15

0.05

29 33 37
retention time

Fig. 34.18. Elution profiles estimated by curve resolution.

T2 =-0.5103 Xi(m-line)

T2 = 0.1909 Ti(n-line)

The intersection of these two lines with the line defined by the mixture chromato-
grams gives the scores of the 'pure' elution profiles (0.360, 0.070) and (0.325,
-0.166) resulting in the chromatograms given in Fig. 34.18. Knowing the pure
elution profiles, the spectra are easily calculated by solving eq. (34.3).

34.2.5.2 Curve resolution of three-factor systems

Having derived a solution for two-component systems, we could try and extend
this solution to three-component systems. A PC A of a data set of spectra of
three-component mixtures yields three significant eigenvectors and a score matrix
with three scores for each spectrum. Therefore, the spectra are located in a
three-dimensional space defined by the eigenvectors. For the same reason, exp-
lained for the two-component system, by normalization, the ternary spectra are
found on a surface with one dimension less than the number of compounds, in this
case, a plane.
If the ternary mixtures span the whole concentration range, i.e. from 0 to 100%
for each component, some of the mixtures contains only one or two compounds.
As explained before, the binary mixtures should be found on straight lines. In this
268

PureB it

B+C

A+B • A+B+C

Puree

# • • • • • • •
A+C
Pure A

Fig. 34.19. Three-component spectra (normalized) projected in the space defined by the two first PCs.

case, three straight lines should be found representing the binary mixtures of
compounds (A,B), (A,C) and (B,C), respectively. These three lines define a
triangle. Ternary mixtures are located inside the triangle. The comers of the
triangle represent the pure spectra, and on each side are the binary mixtures (see
Fig. 34.19). The problem of finding the pure spectra (or pure factors) in a
three-component (or three-factor) system is to first estimate the position of the
lines on which the binary spectra are situated, followed by an estimate of the
intersections (comer points). If no spectra of binary mixtures are in the data set,
these lines are not directly observed and should be estimated. In a similar way as
for the two-component case, these lines and comer points have to be found by
applying constraints on the spectra (non-negativity of the absorbances) and on the
resulting concentrations. Details of this method are found in Ref. [9]. Clearly, the
factor analysis of three-component systems by curve resolution is much more
complex than it is for the two-component case. In fact the generalisation of the
method to systems with more than three components appears to be impractical.

34.2.6 Factor rotation by iterative target transformation factor analysis


(ITTFA)

Iterative target transformation factor analysis (ITTFA) is an extension of TTFA


and has been introduced by Hopke et al. [12] in environmetrics and by Gemperline
[13,14] and Vandeginste et al. [15] in chromatography. The idea behind ITTFA is
269

Fig. 34.20. ITTFA: projection of the input vector ini in the PC-space gives outi. A new input target in2
is obtained by adapting outi to specific constraints. in2 projected in the PC-space gives out2.

that, in the absence of good candidate targets to be tested by TTFA, one defines an
initial target, which is gradually improved until the TTFA test passes.
The first step of ITTFA is to project any input target on the eigenvector space
(Fig. 34.20) defined by the data matrix of the mixtures, for example inj into outj.
Next, the projected vector is inspected on its likelihood to be a pure factor. By
examining outj, we find that some loadings do not fulfil specific requirements, e.g.
if the factor should represent an elution profile, then all values should be non-
negative and no shoulders or secondary maxima should be present. Instead of
rotating outj in the PC1-PC2 plane, we adapt the loadings of this vector in such a
way that it obeys all imposed constraints; i.e. all negative values are replaced by
zeros and, in the case of chromatography for instance, all secondary peaks are cut
away. This step is schematically shown in Fig. 34.21. The consequence of this
adaptation is that the vector outj is lifted from the plane and rotated over a small
angle, giving in2. By this procedure, we hope to obtain a result that is closer to one
of the true factors, than inj was.
In the following step the vector in2 is considered as a new target, which in its
turn is submitted to a TTFA. As a result, a second projected output target out2 is
obtained. The overall result is a rotation of outj into out2 as is schematically shown
in Fig. 34.20. The procedure of inspection and adaptation of the loadings is then
repeated, until the difference between output and input targets converges to zero,
indicating that a stable solution is obtained. This solution is one of the pure factors.
To find a second factor, the procedure is repeated with a new initial target. The
main condition for success is that good constraints can be formulated for adapting
the targets and that good initial targets can be found.
270

retention time

retention time

Fig. 34.21. Example of the adaptation of projected targets.

In summary ITTFA comprises the following steps:


(1) Calculate the significant PCs from the data matrix;
(2) Choose an initial target ini;
(3) Project in, in the space defined by the eigenvectors by applying eq. (34.6).
An output target out, is obtained;
(4) Evaluate the correlation between the input and output target; if this correl-
ation is larger than a specified value, the procedure converges to a factor.
Repeat the procedure with another initial target, until all pure factors are
estimated;
(5) Otherwise adapt the projected target, by applying constraints. This gives a
new target to be tested. Return to step 3.
Let us explain the above steps of ITTFA in more detail for data sets obtained in
HPLC with DAD.
(1) Taking into consideration the fact that initial targets have to be formulated
and that the projected targets have to be inspected and adapted, we estimate the
pure elution profiles instead of the pure spectra. Therefore, the PCA is carried out
in the elution time space and the resulting principal components represent abstract
elution profiles (see Fig. 34.6). The factors we are looking for are the elution
profiles of the different compounds. The first abstract factor closely resembles one
271

of the elution profiles. The other one also has a smooth shape, but with negative
values because it is orthogonal to the first one.
(2) Several procedures have been proposed to select a target. The first one is the
so-called needle search [13,14,16]. This procedure starts the ITTFA with as many
needle targets (i.e., all elements of the target are zero except one which has a value
equal to one) as there are variables. Each needle is projected into the space defined
by the eigenvectors once, and the correlation between input and output target is
evaluated. The nc (number of significant eigenvectors) best targets are retained to
start the iterations. A second procedure is to apply first a VARIMAX rotation on
the eigenvectors (see Section 34.2.3). Because elution profiles are present in
specific windows of the data matrix and because they are almost orthogonal, a
VARIMAX yields good initial profiles for each factor, indicating at least the
position of the peaks.
(3) Targets are adapted by replacing all negative values by zero. Furthermore,
all secondary peaks or shoulders are removed. Some typical examples are given in
Fig. 34.21.
Because all elution profiles are normalized to a sum equal to one, the resulting
factors are also normalized. The overall result of ITTFA is therefore a set of peaks
with equal peak area or height. Depending on the kind of information wanted,
either the spectra or elution profiles, ITTFA is concluded by a terminating step.
The spectra are obtained by solving eq. (34.3), where X is the original and
unnormalized data set and C contains the normalized elution profiles. After the
spectra are obtained, we can denormalize the elution profiles by dividing them by
the norm of the spectra. This is easily seen from the fact that X = C K S, where C
and S are respectively the elution profiles and spectra, normalized to a sum equal to
one, and K is a diagonal matrix with on its diagonal the norms of the spectra. From
C and X we obtain KS. The lengths of the spectra KS are equal to the diagonal
elements of K. Multiplication of C by K in its turn gives CK, the denormalized
elution profiles.
An example of the results obtained by ITTFA is given in Figs. 34.22 and 34.23.
The solid lines in the two figures are the estimated spectra and elution profiles. The
dotted lines are the true spectra and profiles The ability to estimate spectra and
elution profiles by ITTFA depends on the similarity of the spectra, the overlap of
the elution profiles and the relative intensity of the signal. By simulation, Strasters
et al. [6] evaluated the quality of the spectra as a function of the peak overlap and
spectrum similarity. The quality of the spectra was expressed as the correlation
coefficient between the true and estimated spectrum. As can be seen from Fig.
34.24, spectra of good quality are obtained for a chromatographic resolution
/?, > 0.45. The quality of the elution profiles, expressed as the accuracy of the peak
areas, has been evaluated by Vandeginste et al. [11].
272

a
O4
o
o
:3
o
CO
a.
- CO
(D

^
o 0
•A. (D
-
• * — '

(• CO T3

Vm O cd
\ 'w- Q^
M CO "^ f0
••
O 0
CN
CO
<
O
O
CO ^
O
00
CM 13
CD
O 03
CD
CM
C/3
O (U
cd
-^
CM V-i

^*1l« O
CM
CO
CO m CM in O
CN
o o o (N
d d
CO

PLH
273

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
retention time

Fig. 34.23. Individual elution profiles and the total profile estimated by ITTFA compared to the true
profiles.

spectral
similarity

0.3 0.4
resolution

Fig. 34.24. Quality (correlation coefficient between true and estimated spectra) of the spectra, esti-
mated by ITTFA depending on similarity of the spectra and the chromatographic resolution.
274

One of the early applications of ITTFA was developed by Hopke et al. in the
area of geochemistry [12]. By measuring the elemental composition in the crossing
section of two lava beds, Hopke was able to derive the elemental composition of
the two lava bed sources. This application is very similar to the air dust case given
in the introduction of this chapter (see Section 34.1).

34.3 Evolutionary and local rank methods

Methods for evolutionary rank analysis are explained and discussed in this
section. The different approaches of evolutionary rank analysis have in common
that the two-way data structure is analyzed piece-wise to locally reveal the
presence of the analytes. A reference to the review of evolutionary methods by
Toft et al. is included in the additional recommended reading list at the end of this
chapter.

34.3,1 Evolving factor analysis (EFA)

As previously discussed, data sets can in principle be factor-analyzed column-


wise or row-wise. The fact that the rows follow a certain logical sequence may be
exploited for finding the pure row factors. We illustrate this on a four-component
chromatogram given in Fig. 34.25. The compounds are present in well-defined
time windows, e.g compound A in window t^-t^, compound B in window t2-t^,
compound C in window t^-t-j and compound D in window t^-t^. The ordering
provides additional information and this can be exploited to estimate the pure
spectra from which the respective concentration profiles are estimated. There-
fore, sub-matrices are formed by adding rows to an initial top sub-matrix, TQ top
down, or by adding rows to a bottom sub-matrix, BQ bottom up. In our example the
rank of these matrices increases from one to four as is schematically shown in Fig.
34.25. By analysing these ranks as a function of the number of added rows, time
windows are derived where one, two, three, etc. significant PCs, are present.
Such an analysis of the rank as a function of the number of rows of X included in
the principal components analysis is the principle of evolving principal comp-
onents analysis (EPCA), which is the first step of evolving factor analysis (EFA)
developed by Maeder et al. [17-19]. As explained in Chapter 17, an eigenvalue is
associated with a principal component. This eigenvalue expresses the amount of
variance described by the eigenvector. By determining the eigenvalues or
variances associated with each PC for each sub-matrix and plotting these
eigenvalues (or the log of the eigenvalues) as a function of the number of rows ni
included in the sub-matrix, a typical plot, dependent on the studied system is
obtained. Figure 34.26 shows the plot obtained for a HPLC-DAD data set. This
275

1 6 11 ; 16 i 21 126 :31 36 41 4i6 51 56

tl 12_ t4 15_ t6

2 3 4 4 4

4 4 4 3 2
J-L

Fig. 34.25. Time windows in which compounds are present in a composite 4-component peak, with
the ranks of the data matrices formed by adding rows to a top matrix To (top down) or to a bottom ma-
trix B() (bottom up).

figure can be interpreted as follows. For each sub-matrix a new significant


eigenvector appears each time a new compound is introduced in the spectra, thus at
t = t^, ?2» h ^^d t^. This is observed from an increase of the eigenvalue. From this
plot it is not yet possible to derive the compound windows, as this plot indicates the
appearance of a new compound but not its disappearance. Therefore, a second
EPCA is carried out, but now in the reversed order, i.e. one starts with the initial
matrix BQ and adds rows bottom up. A similar plot (see Fig. 34.27) is now obtained
but in the reversed order. New factors appear at t = ^g, tj, t^ and t^. In the forward
direction it means that components disappear at t = t^, t^, tj and t^. Because the
widths of elution bands in HPLC in a narrow region of retention times are more or
less equal, the compound which appears first in the spectra should also be the first
one to disappear. Therefore, the compound windows are found by connecting the
line indicating the first appearing compound in Fig. 34.26 with the last appearing
compound in Fig. 34.27. Both figures can be combined in a single figure (see Fig.
34.28), from which the concentration windows can be reconstructed as indicated.
Tauler and Casassas [20] applied this technique to reconstruct the concentration
276

20
r e t e n t i o n time

Fig. 34.26. Eigenvalues when adding rows to To (forward evolving PCA).

to
i->

c
=}

/Axx
Q

u
5

•••1 1
a •• •• h .. Is s
3
'['••• •••1
«0 "••••• 1 ••• 1

> • • • . '
• .
cr f-. I'-.

O)
••••.. 1 ••• 1 ••.
1
a; •-, 1 •••. 1
1 •• 1

o •• 1

. i . 1 '
20 40
r e t e n t i o n time

Fig. 34.27. Eigenvalues when adding rows to Bo (backward evolving PCA).


277

<v
Z3
to

c>
0)
en
"•*
<u

^1 ^2 s s H S k
1
1 - - - ;
1

' \ " 1 1 1

1 I 1 , 1 _ ,..,J

20 40
retention time

Fig. 34.28. Reconstructed concentration profiles from the combination of Figs. 34.26 and 34.27.

II
T R ^1 1

EZ]

1
I ^1

10
Ci
-pO
T i r;

Fig. 34.29. Construction of a sub-matrix containing only the zero rows of c,.

profiles in equilibria studies. One could consider these bell-shaped profiles of the
eigenvalues as being a first rough estimate of the concentration profiles C in eq.
(34.3). It is now possible to calculate the pure spectra S in an iterative way by
278

solving the eq. (34.3) first for S (by a least-squares method). Because C is a first
approximation, the spectra have negative values, v^hich can be set to zero. In a next
step the corrected spectra are used to solve eq. (34.3) for C. These steps are
repeated until S and C are stable. This is the principle of alternating regression, also
called alternating least squares, introduced by Karjalainen [21]. This iterative
resolution method has been successfully applied to many different analytical data,
including chromatographic examples (see recommended additional reading).
An alternative and faster method estimates the pure spectra in a single step. The
compound windows derived from an EPCA, are used to calculate a rotation matrix
R by which the PCs are transformed into the pure spectra: X = C S^ = T* R R"* V*^.
Consequently, C = T* R, where T* is the score matrix of X. Focusing on a
particular component, c^, (ith column of the matrix C) one can write c^ = T* r^,
where r, is the /th column of R. Because compound / is not present in the shaded
areas of c, (see Fig. 34.29), the values of c^ in these areas are zero. This allows us to
calculate the rotation vector r^. Therefore, all zero rows of c^ are combined into a
new column vector, cf, and the corresponding rows of T* are combined into a new
matrix T*^. As a result one obtains:

c^ :=T;^ r,. =0 (34.10)

The procedure is schematically shown in Fig. 34.29. Equation (34.10) represents


a homogeneous system of equations with a trivial solution r^ = 0. Because
component / is absent in the concentration vector, this component does not
contribute to the matrix T*^. As a consequence the rank of T*^ is one less than its
number of rows. A non-trivial solution therefore can be calculated. The value of
one element of r, is arbitrarily chosen and the other elements are calculated by a
simple regression [17]. Because the solution depends on the initially chosen value,
the size (scale) of the true factors remains undetermined. By repeating this
procedure for all columns c, (/ = 1 to /?), one obtains all columns of R, the entire
rotation matrix.

34.3.2 Fixed-size window evolving factor analysis (FSWEFA)

When the rows of a data matrix follow a certain pattern, e.g. the appearance and
disappearance of compounds as a function of time, a. fixed-size window EFA is
applicable. This is the case, for instance, for data sets generated by hyphenated
measurement techniques such as HPLC with DAD. Fixed-size window EFA [22]
can be applied for detecting the presence of minor compounds (< 1 %) and for the
resolution of a data set into its components (pure spectra and elution profiles).
279

wavelength wavelength wavelength


1 P 1 P 1 P

0) a>
a>
E
E E
'43

Fig. 34.30 Principle of fixed-size window evolving factor analysis.

3
2
1
0
1
logEY 2
<J
-4
-5
-6
7
10 20 30 40 50 60

Fig. 34.31. Eigenvalue plot obtained by fixed-size moving window FA for a pure peak.

Contrary to EFA which calculates a PC A of a sub data matrix to which rows are
added, in fixed-size window EFA a small window of rows is selected which is
moved over the data set (see Fig. 34.30). Typically, a window of seven consecutive
spectra is used. At each new position of the window a PCA is calculated and the
eigenvalues associated with each PC are recalculated and are plotted as a function
of the position of the window. This yields a number of eigenvalue-lines. Figure
34.31 shows the eigenvalue-lines obtained for a simulated pure LC-DAD peak. In
the baseline zones (null spectra) all eigenvalue-lines are noisy horizontal lines. In
the selective retention time regions (one component present) the eigenvalue-line
associated with the first PC follows the appearance and disappearance of the
280

%10

3
Q.
E detectable

0.5
0.3 not detectable
0.2

0.1
0.2 0.4 0.6 0.8 1
— • resolution

Fig. 34.32. Detectability of a minor compound by fixed-size window evolving factor analysis.

compound. The other eigenvalue-lines remain noisy horizontal lines. From the
variations observed for eigenvalue-lines representing noise one can derive a
critical value to decide on the significance of an eigenvalue. Figure 34.32 demon-
strates the capabilities of this method for the detection of a minor compound which
is partly separated from the main constituent. Impurities as low as 1% are detected
when chromatographic resolution > 0.3.
Eigenstmcture tracking (ETA), developed by Toft and Kvalheim [23] is very
similar to FSWEFA. Here, one starts initially with a window of size 2 and the size
is increased by one until a singular value representing noise is obtained, i.e. the
window size is one higher than the number of co-eluting compounds.

34.3.3 Heuristic evolving latent projections (HELP)

Heuristic evolving latent projection [24], which was introduced in Chapter 17,
concentrates on the evaluation of the results of a local principal components
analysis (LPCA), performed on a part of the data matrix. Both, the scores and
loadings, obtained by a local PCA are inspected for the presence of structures
which indicate selective retention time regions (pure spectra) and selective wave-
lengths. Because the data are not pre-treated (normalization or any other type of
pretreatment) the score plots obtained are very similar to the scores plots previously
discussed in Section 34.2.1. Straight lines departing from the origin in the scores
plot indicate pure spectra. HELP can be conducted in the wavelength space and in
the retention time space. Purity checks and the search for selective retention time
281

WAVELENGTH ELUTION TIME

1.277

U 0.000
0 O

-1.277 J
1 J

-3.436 0.000 3.436


PCI

Fig. 34.33. Simulated three-component system: (a) spectra (b) elution profiles and (c) scores plot ob-
tained by a global PCA with HELP.

regions are usually conducted in the wavelength space. Selective wavelengths are
found in the retention time space, and if present yield directly the pure elution
profiles.
The method also provides what is called a data-scope, which zooms in on a
particular part of the data set. The functioning of the data-scope is illustrated with a
simulated three-component system given in Figs. 34.33a and b. The scores plot
(Fig. 34.33c) obtained by a global PCA in wavelength space shows the usual line
structures. In this case the data-scope technique is applied to evaluate the purity of
the up-slope and down-slope elution zones of the peak. Therefore, data-scope
performs a local PCA on the up-slope and down-slope regions of the data.
In Fig. 34.34 the first three local PCs with the associated eigenvalues are given
for four retention time regions: (a) a zero component region (Fig. 34.34a), (b)
selective chromatographic regions at the up-slope side (Fig. 34.34b), (c) and at the
282

1.277

LM^W^t/^
imnnniiiimiimn»imimiinimiiiBtniiuiimwiimr

\A/W\Mf\NM
nfmiiiini|uiuiuniinmniiJtmtiiwinHHiBimuBiil

k Y ^ WH^^V^Ayj
WAVELENGTH
(a)

Fig. 34.34. The three first principal components obtained by a local PC A: (a) zero component region,
(b) up-slope selective region, (c) down-slope selective region (d) three-component region. The spectra
included in the local PCA are indicated in the score plot and in the chromatogram.

down-slope side (Fig. 34.34c), and a three-component region in the centre of the
peak profile (Fig. 34.34d). The selected regions of the data set are indicated in the
scores plot (wavelength space) and in the chromatogram. The loading patterns of
the three PCs obtained for the suspected selective chromatographic regions,
clearly show structure in the first principal component (Fig. 34.34 b,c). The
significance of the second principal component should be confirmed by separating
the variation originating from this component from the variation due to
experimental or instrumental errors. This is done by comparing the second eigen-
value in the suspected selective region and the first singular value in the zero-
component region (see Table 34.1). In the first instance, the rank of the
investigated local region was established by means of an F-test. For the up-slope
283

1.277

0.000
1 0 o

-1.277

-3.438
i
0.000
J
3.438
RETENTION TIME

PC1

nimHi«uui|«mminmnmninmnniininBmiiiT

\r
Bmn|iBwmmnwii|Bnnii|iiuii.|Bmiiuniinnnuii

mmi)mminiiiHmiiiiiiiniinu|iiiimnnuimmul
WAVELENGTH (b)

Fig. 34.34 continued, (b) Up-slope selective region.

TABLE 34.1
Local rank analysis of suspected selective regions in a LC-DAD data set (see Fig. 34.34)

Eigenvalues

PC Zero-compound region Suspected selective region Suspected selective region


(up-slope) (down-slope)

1 0.01033 0.48437 0.54828


2 0.00969 0.0090 0.0081
3 0.00860 0.00481 0.00569
284

1.277

O 0.000
Q.
0 O

RETENTION TIME
-1.277

-3.438 0.000 3.438


PCI

inniB>HHiim{iiim.nii.i.»:|mmiii|BUiimHiininLiiiii(

B»pg>iiit|nmn:»|niiiiiii{iimBnnnuanniiniHBmir

K
innrwnmnui;BunM{wiinriminBiHBmBinMiiuiniinii1
WAVELENGTH (C)

Fig. 34.34 continued, (c) Down-slope selective region

region F = 0.01033/0.00900 = 1.148 and for the down-slope region F = 0.01033/


0.00881 = 1.172. The number of degrees of freedom is equal to the number of rows
included in the local rank analysis times the number of wavelengths. Both values
are below the critical value, indicating that the two selected regions are pure. As
the number of spectra used in the zero-component region increases, the critical F
value tends towards 1 and one risks setting the detection limit too high. To avoid
that problem, a non-parametric test was introduced [24]. On the other hand, when
F exceeds the critical F-value, one should be careful in concluding that this
variation is due to a chemical compound. By the occurrence of artifacts, such as
non-linearities, baseline drift and heteroscedasticity of the noise, the second
principal component may falsely show structure [25,26] and lead to a false positive
285

1.277

O 0.000
Q.
0.0

RETENTION TIME
-1.277

-3.438 0.000 3.438


PCI

iiiiiiHi|iniiiiii|niiiHii|Hiiiiiii|NinHM|niiinii|i^innii|Hiiiiii

WAVELENGTH (d)

Fig. 34.34 continued, (d) Three-component region.

identification of a component. This may seriously complicate the estimation of


pure factors when applying the methods explained in this chapter. This aspect is
discussed in more detail in Section 34.6. To overcome these problems, the HELP
technique includes a procedure to detect artifacts. For instance, baseline points
should be located in the origin of the scores plot. If this is not the case for a matrix
consisting of baseline spectra taken from the left and right hand side of the peak of
interest, a drift may be present. The effectiveness of a baseline correction (e.g.
substraction of baseline) can be monitored by repeating the local PCA.
286

34.4 Pure column (or row) techniques

Pure variables are fiilly selective for one of the factors. This means that only
one pure factor contributes to the values of that variable. When the pure variables
or selective wavelengths for each factor are known then the pure spectra can be
calculated in a straightforward manner from the mixture spectra by solving:
X = PS'^ (34.11)
where X is the matrix of the mixture spectra, S is the matrix of the unknown pure
spectra and P is a sub-matrix of X which contains the nc pure columns (one
selective column per factor). The pure spectra are then estimated by a least squares
procedure:
sT = (pTp)-^p'rx
The pure variable technique can be applied in the column space (wavelength) as
well as in the row space (time). When applied in the column space, a pure column
is one of the column factors. In LC-DAD this is the elution profile of the compound
which contains that selective wavelength in its spectrum. When applied in the row
space, a pure row is a pure spectrum measured in a zone where only one compound
elutes.

34.4.1 The variance diagram (VARDIA) technique

In Section 34.2 we explained that factor analysis consists of a rotation of the


principal components of the data matrix under certain constraints. When the
objects in the data matrix are ordered, i.e. the compounds are present in certain
row-windows, then the rotation matrix can be calculated in a straightforward way.
For non-ordered spectra with three or less components, solution bands for the pure
factors are obtained by curve resolution, which starts with looking for the purest
spectra (i.e. rows) in the data matrix. In this section we discuss the VARDIA
[27,28] technique which yields clusters of pure variables (columns), for a certain
pure factor.
The principle of the VARDIA technique is schematically shown in Fig. 34.35.
Let us suppose that Vj and V2 are the two first principal components of a data matrix
with p variables. The plane defined by Vj and V2 is positioned in the original space
of the/? variables (see Fig. 34.35). For simplicity only the variables x^,X2, and x^ are
included in the figure. The angle between the axes jCj and X2 is small because these
two variables are correlated, whereas the axis x^ is almost orthogonal to x^ and ^2,
indicating a very low correlation. The loadings v^ and V2 are the projections of all
variables on these principal components. A high loading for a variable on Vj means
287

Fig. 34.35. Principle of the VARDIA technique demonstrated on a two-component system, f is rotated
in the V1-V2 plane. The variables Xi, X2 and X3 are projected on f during the rotation. In this figure f is
oriented in the direction of Xi and X2.

that the angle of this variable axis with Vj is small. Conversely a small loading for
the variable with Vj means that the angle with this variable is large. Instead of
projecting the variables on the principal components, one can also project them on
any other vector f located in the space defined by the PCs and which makes an
angle T with Vj. Vector f is defined as:
f = Vi cosF + V2 sinP
The loading of variabley on this vector is:

fj = Vy cosF + V2j sinF (34.12)

The larger the loading^ of variable) on f, the smaller the angle is between f and this
variable. This means that f lies in the direction of that variable, f may lie in the
direction of several correlated variables (jCj and X2 in Fig. 34.35), or a cluster of
variables. Because correlated variables are assumed to belong to the same pure
factor, f could be one of the pure factors. Thus by rotating fin the space defined by
the principal components and observing the loadings during that rotation, one may
find directions where variables cluster in the multivariate space. Such a cluster
belongs to a pure factor. This is illustrated with the LC-DAD data discussed in
Section 34.2.5.1. The two principal components found for this data matrix are
given in eq. (34.7). By illustration, the loadings of the variables 2 and 15 on fare
288

TABLE 34.2

The loadings of variables 2 and 15 during the rotation of PC^ (see eq. (34.7) for the values of PC^ and PCj)

Rotation angle Variable 2 Variable 15

10 0.668 0.014
20 0.706 0.013
30 0.723 0.010
40 0.718 0.008
50 0.691 0.005
60 0.643 0.002
70 0.576 -0.001
80 0.491 -0.004
90 0.391 -0.007
100 0.279 -0.010
110 0.159 -0.012
120 0.034 -0.014
130 -0.092 -0.016
140 -0.215 -0.017
150 -0.332 -0.017
160 -0.438 -0.017
170 -0.532 -0.017
180 -0.609 -0.016
190 -0.668 -0.014
200 -0.706 -0.013
210 -0.723 -0.010
220 -0.718 -0.008
230 -0.691 -0.005
240 -0.643 -0.002
250 -0.576 0.001
260 -0.491 0.004
270 -0.391 0.007
280 -0.279 0.010
290 -0.159 0.012
300 -0.034 0.014
310 0.092 0.016
320 0.215 0.017
330 0.332 0.017
340 0.438 0.017
350 0.532 0.017
360 0.609 0.016
289

calculated as a function of the angle F of f with Vj in steps of 10 degrees according


to eq. (34.12) and are tabulated in Table 34.2. From this table one can see that the
highest correlation between f and variable 2 is found when f has an angle of 30
degrees (and in the negative direction, 210 degrees) with the first principal
component. By the same reasoning we find that f is maximally correlated with
variable 15 when f has an angle of 340 degrees with PCj.
The next question is about the purity of these variables. Therefore, we need to
define a criterion for purity. Because the loading of variable 2 at 30 degrees (/230) is
equal to 0.7229, which is larger than /15 340 = 0.017 at 340 degrees, one may
conclude that variable 2 is probably purer than variable 15. A measure for purity of
a variable is obtained [27] by comparing the loading^ of variable7 on f with the
length of the vector obtained by projecting variable; on the V1-V2 plane. The length
of variable7 projected in the PC-space is equal to

If

fj> pi + vi)cosm)
one concludes that the variable j is pure. The cosine term introduces a small
projection window from -p/2 to +p/2. Windig [27] recommends a value of 10
degrees for p.
In order to decide on the purity of the factor f as a whole, the purity of the
loadings of all variables should be checked. At 30 degrees, where f = 0.866vi +
0.5V2, the variables 2 and 12 (see Table 34.3) fulfil the purity criterion. The sum of
squares of all variable loadings fulfilling the purity criterion are plotted in a
so-called variance diagram. At 30 degrees, this is the value (/2^3o + /i2,3o)- ^^^^ ^^
the principle of the VARDIA technique in which var(/) is plotted as a function of F,
with

var(F) = X / ) ' (34.13)

for allfj for whichy^ > J{v^j + v|y )cos(P/2), with^ = v^cosF + V2ySinF (F is the
rotation angle).
The variance diagram obtained for the example discussed before is quite simple.
Clusters of pure variables are found at 30 degrees (var = 0.5853) and at 300 degrees
(var = 0.4868) (see Fig. 34.36). The distance from the centre of the diagram to each
point is proportional to the variance value. Neighbouring points are connected by
solid lines. All values were scaled in such a way that the highest variance is full
scale. As can be seen from Fig. 34.36, two clusters of pure variables are found. The
290

TABLE 34.3
VARDIA calculation at 30 degrees

Vl f2 *
V2 *3() ^(v?+V2)cos(57i/180) *• 3 0

0.372 0.160 0.402 0.403 -


0.601 0.391 0.721* 0.523 0.523
0.148 -0.208 0.024 0.254 -
0.218 -0.450 -0.036 -0.498 -
0.314 -0.583 -0.019 -0.660 -
0.402 -0.070 0.313 0.406 -
0.088 -0.232 -0.085 -0.333 -
0.101 -0.307 -0.066 -0.322 -
0.058 -0.063 0.019 0.085 -
0.101 -0.005 0.085 0.101 -
0.100 0.023 0.098 0.102 -
0.206 0.144 0.250* 0.250 0.063
0.139 0.110 0.175 0.177 -
0.239 0.212 0.313 0.318 -
0.016 -0.007 0.010 0.017 -
0.010 -0.0108 0.003 0.015 -
0.0035 0.0006 0.003 0.004 -
0.002 -0.009 -0.003 0.009 -
0.0004 -0.001 -0.0002 0.001 -
VAR(30) 0.586
* Loadings for which fjQ < ^(vi +\2 )cos(57c/180).

Spectra at 30 degrees and 300 degrees (see Figs. 34.37 and 34.38) are in good
agreement with the pure spectra given in Fig. 34.3. This demonstrates that this
procedure yields quite a good estimate of the pure spectra. The rotation of a factor
in the space defined by two eigenvectors is straightforward and, therefore, the
method is well applicable to solve two-component systems. Also three-component
systems can be solved after all rows (spectra) have been normalized.
Windig [27,28] also proposed procedures to handle more complex systems with
more than two components. Therefore, several variance diagrams have to be com-
bined. Instead of observing the variance of the loadings of a factor being rotated in
the space defined by the eigenvectors, one can also plot the loadings, and guide
iteratively the rotation to a factor that obeys the constraints imposed by the system
291

180

210

Fig. 34.36. Variance diagram obtained for the data in Fig. 34.2.

wavelength (nm)

Fig. 34.37. Spectrum (f) at an angle of 30 degrees with Vi (in the V1-V2 plane).
292

0)
o
c:
CO

to

220 240 260 280 300 320 340 360 380 400

wavelength (nm)

Fig. 34.38. Spectrum (f) at an angle of 300 degrees with Vi (in the V1-V2 plane).

under investigation, such as non-negativity of signals and concentrations. The


difficulty is to design a procedure that converges to the right factors in a few steps.

34.4.2 Simplisma

Windig et al. [29] proposed an elegant method, called SIMPLISMA to find the
pure variables, which does not require a principal components analysis. It is based
on the evaluation of the relative standard deviation (Sj/Xj) of the columnsy of X.
This yields a so-called standard deviation spectrum. A large relative standard
deviation indicates a high purity of that column. For instance, variable 1 in Table
34.4 has the same value (0.2) for the two pure factors and is therefore very impure.
By mixing the two pure factors no variation is observed in the value of that
variable. On the other hand, variable 2 is pure for factor 1 and varies between 0.1
and 0.4 in the mixtures. In order to avoid that wavelengths with a low mean
intensity obtain a high purity value, the relative standard deviation is truncated by
introducing a small off-set value 6, leading to the following expression for the
purity pj of a variable (column 7):

Pj =
Xj + 6

The purity plotted as a function of the wavelength; gives a purity spectrum. The
calculation of a purity spectrum is illustrated by the following example. Suppose
293

TABLE 34.4
Determination of pure variables from mixture spectra

Mixtures ratio Spectra

1 2 3 4 5 6

0.5/0.5 0.2 0.25 0.45 0.35 0.45 0.40


0.2/0.8 0.2 0.10 0.30 0.32 0.48 0.64
0.4/0.6 0.2 0.20 0.40 0.34 0.46 0.48
0.6/0.4 0.2 0.30 0.50 0.36 0.44 0.32
0.8/0.2 0.2 0.40 0.60 0.38 0.42 0.16

Stand, dev. 0 0.118 0.118 0.0224 0.0224 0.1789


Average 0.2 0.25 0.45 0.35 0.45 0.40
Rel, stand, dev. 0 0.4472 0.2622 0.0640 0.0498 0.4473

Pure 1 0.2 0.5 0.7 0.4 0.4 0.0


Pure 2 0.2 0.0 0.2 0.3 0.5 0.8

Estimated
Pure 1 0.4 1.0 1.4 0.8 0.8 0
Pure 2 0.25 0 0.25 0.375 0.625 1.0

that spectra have been obtained of a number of mixtures (see Table 34.4). A plot of
the relative standard deviation spectrum together with the pure spectra (see Fig.
34.39) shows that a high relative standard deviation coincides with a pure variable
and that a small relative standard deviation coincides with an impure variable
(impure in the sense of non selective). Pure variables either belong to the same
spectrum, e.g. because it contains several selective wavelengths, or belong to
spectra from different compounds. This has to be sorted out by considering the
correlation between the pure variable columns of the data matrix. If the pure
variables are not positively correlated (here variable 2 and 6) one can also conclude
that these pure variables belong to different compounds. In this example the
correlation between columns 2 and 6 is -1.0 which indicates that the pure variables
belong to different compounds. In practice, however, one has to follow a more
complex procedure to find the successive pure variables belonging to different
species and to determine the number of compounds [29].
294

(a)
8
C
to
JO
o
CO

5 6
wavelength

0) 0.4

5 6

— • wavelength

Fig. 34.39. Mixture spectra and the corresponding relative standard deviation spectrum.

Once the pure variables have been identified, the data set can be resolved into
the pure spectra by solving eq. (34.11). For the mixture spectra in Table 34.4, this
gives:

0.2 0.25 0.45 0.35 0.45 0.40 0.25 0.4


0.2 0.10 0.3 0.32 0.48 0.64 0.1 0.64
0.2 0.2 0.4 0.34 0.46 0.48 = 0.20 0.48
0.2 0.3 0.5 0.36 0.44 0.32 0.30 0.32
0.2 0.4 0.6 0.38 0.42 0.16 0.40 0.16

giving the results:


s;^ = [0.4 1.0 1.4 0.8 0.8 0] and s j = [0.25 0.0 0.25 0.375 0.625 1.0].
The estimated spectra are the pure spectra except for a normalization factor. As
we indicated before, the pure(st) columns of an HPLC-DAD data matrix are an
295

estimate of the pure column factors. These are the elution profiles which are the
less contaminated by the other compounds.
An alternative procedure is to evaluate the purity of the rows / instead of the
purity of the columns;:

X- + O

The rows with the highest purities are estimates of the row factors, i.e. the purest
spectra from the data set, which are refined afterward by alternating regression.

34.4.3 Orthogonal projection approach (OPA)

The orthogonal projection approach (OPA) [30] is an iterative procedure to


find the pure or purest spectra (row) in a data matrix. In HPLC, a pure spectrum
coincides with a zone in the retention time where only one solute elutes. OPA can
also be applied to find the pure or purest chromatograms (columns) in a data
matrix. A pure chromatogram indicates a selective wavelength, which is a pure
variable. We illustrate the method for the problem of finding the purest spectra.
The procedure for finding pure chromatograms, or selective wavelengths is fully
equivalent by applying the procedure on the transpose of the data matrix.
A basic assumption of OPA is that the purest spectra are mutually more
dissimilar than the corresponding mixture spectra. Therefore, OPA uses a dis-
similarity criterion to find the number of components and the corresponding purest
spectra. Spectra are sequentially selected, taking into account their dissimilarity.
The dissimilarity d^ of spectrum / is defined as the determinant of a dispersion
matrix Y,. In general, matrices Y, consist of one or more reference spectra, and the
spectrum measured at the /th elution time.

rf,. = det(Y7 Y.) (34.14)


A dissimilarity plot is then obtained by plotting the dissimilarity values, J^, as a
function of the retention time /. Initially, each p-l matrix Y^ consists of two
columns: the reference spectrum, which is the mean (average) spectrum (normal-
ised to unit length) of matrix X, and the spectrum at the rth retention time. The
spectrum with the highest dissimilarity value is the least correlated with the mean
spectrum, and it is the first spectrum selected, x^j. Then, the mean spectrum is
replaced by x^j as reference in matrices Y, (Y^ = [x^j xj), and a second dissimilarity
plot is obtained by applying eq. (34.14). The spectrum most dissimilar with x^j is
selected {x^^ and added to matrix Y-. Therefore, for the determination of the third
dissimilarity plot Y^ contains three columns [x^^ x^2 ^J' i-^-» ^^^ reference spectra
and the spectrum at the /th retention time.
296

1 1^ , , , , , , , , ,
1

0.9

0.8
\
MM
\ \ \

|o.6
0.7
/ 1/ 1 I
CO

\ \
§0.5
c
o
f "\

0.4
\ \
\ \ 1
0.3
/ ! \ 1
0.2

0.1 j
/ A \
1 >• .mrntf ,. ..1 . . . ^ ^ 1 i^'*'*»>..., 1 i,?TTrrrtii 1 1
10 20 30 40 50 60 70 80 90 100
Time

Fig. 34.40. Normalized concentration profiles of a minor and main compound for a system with 0.2%
of prednisone; chromatographic resolution is 0.8.

In summary, the selection procedure consists of three steps: (1) compare each
spectrum in X with all spectra already selected by applying eq. (34.14). Initially,
when no spectrum has been selected, the spectra are compared with the average
spectrum of matrix X; (2) plot of the dissimilarity values as a function of the
retention time (dissimilarity plot) and (3) select the spectrum with the highest
dissimilarity value by including it as a reference in matrix Y^. The selection of the
spectra is finished when the dissimilarity plot shows a random pattern. It is
considered that there are as many compounds as there are spectra. Once the purest
spectra are available, the data matrix X can be resolved into its spectra and elution
profiles by Alternating Regression explained in Section 34.3.1.
By way of illustration, let us consider the separation of 0.2% prednisone in
etrocortysone eluting with a chromatographic resolution equal to 0.8 [30] (Fig.
34.40). The dissimilarity of each spectrum with respect to the mean spectrum is
plotted in Fig. 34.41a. Two clearly differentiated peaks with maxima around times
46 and 63 indicate the presence of at least two compounds. In this case, the
297

a) xlO"

Fig. 34.41. Dissimilarity of each spectrum with respect to (a) the mean spectrum, (b) spectrum at time
46 and (c) spectra at times 46 and 63, for the system of Fig. 34.40.

dissimilarity of the spectrum at time 46 is slightly higher than the one of the
spectrum at time 63 and it is the first spectrum selected. Each spectrum is then
compared with the spectrum at time 46 and the dissimilarity is plotted versus time
(Fig. 34.41b). The spectrum at time 63 has the highest dissimilarity value and,
therefore, it is the second spectrum selected. The procedure is continued by
calculating the dissimilarity of all spectra left with respect to the two already
selected spectra, which is plotted in Fig. 34.41c. As one can see the dissimilarity
values are about 1000 times smaller than the smallest than obtained so far.
Moreover, no peak is observed in the plot. This leads to the conclusion that no third
component is present in the data. A comparison of the performance of FWSEFA,
SIMPLISMA and OPA on a real data set of LC-FTIR spectra containing three
complex clusters of co-eluting compounds is given in Ref. [31]. An alternative
method, key-set factor analysis, which looks for a set of purest rows, called key-set,
has been developed by Malinowski [32].
298

34.5 Quantitative methods for factor analysis

The aim of all the foregoing methods of factor analysis is to decompose a


data-set into physically meaningful factors, for instance pure spectra from a
HPLC-DAD data-set. After those factors have been obtained, quantitation should
be possible by calculating the contribution of each factor in the rows of the data
matrix. By ITTFA (see Section 34.2.6) for example, one estimates the elution
profiles of each individual compound. However, for quantitation the peak areas
have to be correlated to the concentration by a calibration step. This is particularly
important when using a diode array detector because the response factors (absorp-
tivity) may considerably vary with the compound considered. Some methods of
factor analysis require the presence of a pure variable for each factor. In that case
quantitation becomes straightforward and does not need a multivariate approach
because full selectivity is available.
In this section we focus on methods for the quantitation of a compound in the
presence of an unknown interference without the requirement that this interference
should be identified first or its spectrum should be estimated. Hyphenated methods
are the main application domain. The methods we discuss are generalized rank
annihilation method (GRAM) and residual bilinearization (RBL).

34.5.1 Generalized rank annihilation factor analysis (GRAFA)

In 1978, Ho et al. [33] published an algorithm for rank annihilation factor


analysis. The procedure requires two bilinear data sets, a calibration standard set
X^ and a sample set X„. The calibration set is obtained by measuring a standard
mixture which contains known amounts of the analytes of interest. The sample set
contains the measurements of the sample in which the analytes have to be
quantified. Let us assume that we are only interested in one analyte. By a PCA we
obtain the rank R^ of the data matrix X^ which is theoretically equal to 1 + n^ where
Az- is the number of interfering compounds. Because the calibration set contains
only one compound, its rank R^ is equal to one.
In the next step, the rank is calculated of the difference matrix X^ = X„ - kX^. For
any value of/:, the rank of X^ is equal to 1 + n, except for the case where k is exactly
equal to the contribution of the analyte to the signal. In this case the rank of X^ is R^
- 1. Thus the concentration of the analyte in the unknown sample can be found by
determining the /:-value for which the rank of X^^ is equal to 7?^ - 1. The amount of
the analyte in the sample is then equal to kc^ where c^ is the concentration of the
analyte in the standard solution. In order to find this /:-value Ho et al. proposed an
iterative procedure which plots the eigenvalues of the least significant PC of X^ as
a function oik. This eigenvalue becomes minimal when k exactly compensates the
signal of the analyte in the sample. For other /^-values the signal is under- or
299

Fig. 34.42. RAFA plot of the least significant eigenvalue as a function of ^ (see text for an explanation
ofk).

overcompensated which results in a higher value of the EV. An example of such a


plot is given in Fig. 34.42.
When several analytes have to be determined, this procedure needs to be
repeated for each analyte. Because this algorithm requires that a PCA is calculated
for each considered value of ^, RAFA is computationally intensive. Sanchez and
Kowalski [34] introduced generalized rank annihilation factor analysis (GRAFA).
300

More than one analyte can be quantified simultaneously in the presence of


interfering compounds. The required measurements are identical to RAFA: a data
matrix X^ of the unknown sample and a calibration matrix with the analytes X^.

34.5.2 Residual bilinearization (RBL)

In order to apply residual bilinearization [35] at least two data sets are needed:
Xy which is the data set measured for the unknown sample and X^ which is the data
matrix of a calibration standard, containing the analyte of interest. In the absence
of interferences these two data matrices are related to each other as follows:
X^ = bX^-^R (34.15)

b is 3. coefficient which relates the concentration of the analyte in the unknown


sample to the concentration in the calibration standard, where c^ = bc^. R is a
residual matrix which contains the measurement error. Its rows represent null
spectra. However, in the presence of other (interfering) compounds, the residual
matrix R is not random, but contains structure. Therefore the rank of R is greater
than zero. A PCA of R, after retaining the significant PCs, gives:

R = T*V*'r + E (34.16)
By combining eqs. (34.15) and (34.16) we obtain:
X„ =fcX,+ r V*^ + E (34.17)
By RBL the regression coefficient b is calculated by minimizing the sum of
squares of the elements in E. Because the rank of R in eq. (34.16) is unknown, the
estimation of b from eq. (34.17) should be repeated for an increasing number of
principal components included in V*^.
Schematically the procedure proceeds as follows:
(1) Start with an initial estimate b^ofb',
(2) Calculate R = X, - fcoX,;
(3) Determine the rank of R and decompose R into T*V*^ + E;
(4) Obtain a new estimate b^ofbby solving X^ = bX^ + f *Y*T f^j. ^^-^^which
T*V*^ is the result ofstep 3. This yields: Z7i = (X J X,)-'Xj ( X , - T V ) ;
(5) Repeat steps (2) to (4) after substituting b^ with by
The iteration process is stopped after b converged to a constant value. If na
analytes are quantified simultaneously, data matrices of standard samples are
measured for each analyte separately. These matrices X^j, X^2' •••' ^s na ^^^
collected in a three-way data matrix Xg of the size nxpxna, where n is the number
of spectra in X^j,..., X^ „3, p is the number of wavelengths and na is the number of
analytes. The basic equation for this multicomponent system is given by:
301

X, = X,b + R (34.18)
where X,. is the three-way matrix of calibration data and b is a vector of regression
coefficients related to the unknown concentrations by c^ = c^ b^. How to perform
matrix operations on a three-way table is discussed in Section 31.17. The procedure
is then continued in a similar way as for the one-component case. Eq. (34.18) is
solved for b iteratively by substituting R = T*V*^ + E as explained before. Because
the concentrations c^ are known, the three-way data matrix X^ measured for the
standard samples can be directly resolved in its elution profiles and spectra by
Parafac [36] explained in Section 31.8.3. References to other methods for the
decomposition of three-way multicomponent profiles are included in the list of
additional recommended reading.

34.5.3 Discussion

In order to apply RBL or GRAFA successfully some attention has to be paid to


the quality of the data. Like any other multivariate technique, the results obtained
by RBL and GRAFA are affected by non-linearity of the data and heteroscedast-
icity of the noise. By both phenomena the rank of the data matrix is higher than the
number of species present in the sample. This has been demonstrated on the PCA
results obtained for an anthracene standard solution eluted and detected by three
different brands of diode array detectors [37]. In all three cases significant second
eigenvalues were obtained and structure is seen in the second principal component.
A particular problem with GRAFA and RBL is the reproducibility of the
retention data. The retention time axes should be perfectly synchronized. Small
shifts of one time interval (thus the /th spectrum in X^ corresponds with the /+lth
spectrum in X^) already introduce major errors (> 5%) when the chromatographic
resolution is less than 0.6. The results of an extensive study on the influence of
these factors on the accuracy of the results obtained by GRAFA and RBL have
been reported in Ref. [37]. Although some practical applications have been
reported [38,39], the lack of robustness of RBL and GRAFA due to artifacts
mentioned above has limited their widespread application in chromatography.

34.6 Application of factor analysis for peak purity check in HPLC

In pharmaceutical analysis the detection of impurities under a chromatographic


peak is a major issue. An important step forward in the assessment of peak purity
was the introduction of hyphenated techniques. When selecting a method to
perform a purity check, one has the choice between a global method which
considers a whole peak cluster (from the start to the end of the peak), and
evolutionary methods, which consider a window of the peak cluster, which is
302

usually moved over the cluster. All global methods, except PCA, usually apply a
stepwise approach, e.g. SIMPLISMA, OPA and HELP. HELP is a very versatile
tool for a visual inspection and exploration of the data.
Several complications can be present, such as heteroscedastic noise, sloping
baseline, large scan time and non-linear absorbance [40]. This may lead to the
overestimation of the number of existing compounds. The presence of hetero-
scedastic noise and non-linearities have an important effect on all PCA based
methods, such as EFA and FSWEFA. Non-zero and sloping baselines have a
critical effect in SIMPLISMA, HELP and FSWEFA. In any case it is better to
correct for the baseline prior to the application of any multivariate technique.
Baseline correction can be done by subtracting a linear interpolation of the noise
spectra before and after the peak, or by row-centring the data [40]. Most analytical
instruments have a restricted linear range and outside that range Beer's law no
longer holds. Non-linear absorbance indicates the presence of more compounds in
all the approaches discussed in this chapter. In some cases it is possible to detect a
characteristic profile indicating the presence of non-linearities. In any case the best
remedy is to keep the signal within the linear range. A non-linearity may also be
introduced because the DAD needs about 10-50 ms to measure a whole spectrum.
During that time the concentration of the eluting compound(s) may change signific-
antly. The most sensitive methods for the detection of small amounts of impurities
eluting at low chromatographic resolutions, OPA and HELP, are also the ones
most affected by these non-linearities. If the scan time is known, a partial correction
is possible. EFA, FSWEFA and ETA, which belong to the family of evolutionary
methods are somewhat less performing for purity checking. They may also flag
impurities due to the heteroscedasticity of the noise and non-linearity of the signal.
For a more detailed discussion we refer to Ref. [40].

34.7 Guidance for the selection of a factor analysis method

The first step in analysing a data table is to determine how many pure factors
have to be estimated. Basically, there are two approaches which we recommend.
One starts with a PCA or else either with OPA or SIMPLISMA. PCA yields the
number of factors 2ind the significant principal components, which are abstract
factors. OPA yields the number of factors and the purest rows (or columns)
(factors) in the data table. If we suspect a certain order in the spectra, we
preferentially apply evolutionary techniques such as FSWEFA or HELP to detect
pure zones, or zones with two or more components.
Depending on the way the analysis was started, either the abstract factors found
by a PCA or the purest rows found by OPA, should be transformed into pure
factors. If no constraints can be formulated on the pure factors, the purest rows
303

(spectra) found by OPA cannot be improved. On the contrary, a PCA can either be
followed by a Varimax rotation or by constructing a variance diagram which yields
factors with the greatest simplicity. If constraints can be formulated on the pure
factors, a PCA can be followed by a curve resolution under the condition that only
two compounds are present. OPA (or SIMPLISMA or FSWEFA) can be followed
by alternating regression to iteratively estimate the pure row-factors (spectra) and
pure column-factors (elution profiles). In a similar way, the varimax and vardia
factors can be improved by alternating regression. The success of ITTFA in
finding pure factors depends on its convergence to a pure factor by a stepwise
application of constraints on the solution, which has been demonstrated on elution
profiles. However, it then requires a PCA in the retention time space.
Although the decomposition of a data table yields the elution profiles of the
individual compounds, a calibration step is still required to transform peak areas
into concentrations. Essentially we can follow two approaches. The first one is to
start with a decomposition of the peak cluster by one of the techniques described
before, followed by the integration of the peak of the analyte. By comparing the
peak area with those obtained for a number of standards we obtain the amount. One
should realize that the decomposition step is necessary because the interfering
compound is unknown. The second approach is to directly calibrate the method by
RAFA, RBL or GRAFA or to decompose the three-way table by Parafac. A serious
problem with these methods is that the data sets measured for the sample and for
the standard solution should be perfectly synchronized.

References

1. H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23
(1958) 187-200.
2. M. Forina, C. Armanino, S. Lanteri and R. Leardi, Methods of Varimax rotation in factor anal-
ysis with applications in clinical and food chemistry. J. Chemom., 3 (1988) 115-125.
3. J.K. Strasters, H. A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Evaluation of
peak-recognition techniques in liquid chromatography with photodiode array detection. J.
Chromatog., 385 (1987) 181-200.
4. E.R. Malinowski and D. Howery, Factor Analysis in Chemistry. Wiley, New York, 1980.
5. P.K. Hopke, Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19.
6. J.K. Strasters, H.A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Reliability of
iterative target transformation factor analysis when using multiwavelength detection for peak
tracking in liquid-chromatographic separation. Anal. Chem., 60 (1988) 2745-2751.
7. W.H. Lawton and E A. Sylvestre, Self modeling curve resolution. Technometrics, 13 (1971)
617-633.
8. B.G.M. Vandeginste, R. Essers, T. Bosman, J. Reijnen and G. Kateman, Three-component
curve resolution in HPLC with multiwavelength diode array detection. Anal. Chem., 57 (1985)
971-985.
9. O.S. Borgen and B.R. Kowalski, An extension of the multivariate component-resolution
method to three components. Anal. Chim. Acta, 174 (1985) 1-26.
304

10. A. Meister, Estimation of component spectra by the principal components method. Anal.
Chim. Acta, 161 (1984) 149-161.
11. B.G.M. Vandeginste, F. Leyten, M. Gerritsen, J.W. Noor, G. Kateman and J. Frank, Evaluation
of curve resolution and iterative target transformation factor analysis in quantitative analysis by
liquid chromatography. J. Chemom., 1 (1987) 57-71.
12. P.K. Hopke, D.J. Alpert and B.A. Roscoe, FANTASIA — A program for target transformation
factor analysis to apportion sources in environmental samples. Comput. Chem., 7 (1983)
149-155.
13. P.J. Gemperline, A priori estimates of the elution profiles of the pure components in over-
lapped liquid chromatography peaks using target factor analysis. J. Chem. Inf. Comput. Sci., 24
(1984)206-212.
14. P.J. Gemperline, Target transformation factor analysis with linear inequality constraints ap-
plied to spectroscopic-chromatographic data. Anal. Chem., 58 (1986) 2656-2663.
15. B.G.M. Vandeginste, W.Derks and G. Kateman, Multicomponent self modelling curve resolu-
tion in high performance liquid chromatography by iterative target transformation analysis.
Anal. Chim. Acta, 173 (1985) 253-264.
16. A. de Juan, B. van den Bogaert, F. Cuesta Sanchez and D.L. Massart, Application of the needle
algorithm for exploratory analysis and resolution of HPLC-DAD data. Chemom. Intell. Lab.
Syst.,33(1996) 133-145.
17. M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks.
Anal. Chem., 59 (1987) 527-530.
18. H. Gampp, M. Maeder, C.J. Meyer and A.D. Zuberbuhler, Calculation of equilibrium con-
stants from multiwavelength spectroscopic data. Ill Model-free analysis of spectrophotometric
and ESR titrations. Talanta, 32 (1985) 1133-1139.
19. M. Maeder and A.D. Zuberbuhler, The resolution of overlapping chromatographic peaks by
evolving factor analysis. Anal. Chim. Acta, 181 (1986) 287-291.
20. R.Tauler and E. Casassas, Application of principal component analysis to the study of multiple
equilibria systems — Study of Copper(II) salicylate monoethanolamine, diethanolamine and
triethanolamine systems. Anal. Chim. Acta, 223 (1989) 257-268.
21. E.J. Karjalainen, Spectrum reconstruction in GC/MS. The robustness of the solution found
with alternating regression, in: E.J. Karjalainen (Ed.), Scientific Computing and Automation
(Europe). Elsevier, Amsterdam, 1990, pp. 477-488.
22. H.R. Keller and D.L. Massart, Peak purity control in liquid-chromatography with photodiode
array detection by fixed size moving window evolving factor analysis. Anal. Chim. Acta, 246
(1991)379-390.
23. J. Toft and O.M. Kvalheim, Eigenstructure tracking analysis for revealing noise patterns and
local rank in instrumental profiles: application to transmittance and absorbance IR spectros-
copy. Chemom. Intell. Lab. Syst., 19 (1993) 65-73.
24. O.M. Kvalheim and Y.-Z. Liang, Heuristic evolving latent projections — resolving 2-way
multicomponent data. 1. Selectivity, latent projective graph, datascope, local rank and unique
resolution. Anal. Chem., 64 (1992) 936-946.
25. M.J.P. Gerritsen, H. Tanis, B.G.M. Vandeginste and G. Kateman, Generalized rank annihila-
tion factor analysis, iterative target transformation factor analysis and residual bilinearization
for the quantitative analysis of data from liquid-chromatography with photodiode array detec-
tion. Anal. Chem., 64 (1992) 2042-2056.
26. H.R. Keller and D.L. Massart, Artifacts in evolving factor analysis-based methods for peak pu-
rity control in liquid-chromatography with diode array detection. Anal. Chim. Acta, 263 (1992)
21-28.
305

27. W. Windig and H.L.C. Meuzelaar, Nonsupervised numerical component extraction from py-
rolysis mass spectra of complex mixtures. Anal. Chem., 56 (1984) 2297-2303.
28. W. Windig and J. Guilement, Interactive self-modeling mixture analysis. Anal. Chem., 63
(1991) 1425-1432.
29. W. Windig, C.E. Heckler, FA. Agblevor and R.J. Evans, Self-modeling mixture analysis of
categorized pyrolysis mass-spectral data with the Simplisma approach. Chemom. Intell. Lab.
Syst., 14 (1992) 195-207.
30. F.C. Sanchez, J. Toft, B. van den Bogaert and D.L. Massart, Orthogonal projection approach
applied to peak purity assessment. Anal. Chem., 68 (1996) 79-85.
31. F. C. Sanchez, T. Hancewicz, B.G.M. Vandeginste and D.L. Massart, Resolution of complex
liquid chromatography Fourier transform infrared spectroscopy data. Anal. Chem., 69 (1997)
1477-1484.
32. E.R. Malinowski, Obtaining the key set of typical vectors by factor analysis and subsequent
isolation of component spectra. Anal. Chim. Acta, 134 (1982) 129-137.
33. C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to
quantitative analysis of multicomponent fluorescence data from the video fluorometer. Anal.
Chem., 52 (1980) 1108-1113.
34. E. Sanchez and B.R. Kowalski, Generalized rank annihilation factor analysis. Anal. Chem., 58
(1986) 496-499.
35. J. Ohman, P. Geladi and S. Wold, Residual bilinearization. Part I Theory and algorithms. J.
Chemom., 4 (1990) 79-90.
36 A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15
(1992) 143-157.
37. M.J.P. Gerritsen, N.M. Faber, M. van Rijn, B.G.M. Vandeginste and G. Kateman, Realistic
simulations of high-performance liquid-chromatographic ultraviolet data for the evaluation of
multivariate techniques. Chemom. Intell. Lab. Syst., 2 (1992) 257-268.
38. E. Sanchez, L.S. Ramos and B.R. Kowalski, Generalized rank annihilation method. I. Applica-
tion to liquid chromatography-diode array ultraviolet detection data. J. Chromatog., 385
(1987) 151-164.
39. L.S. Ramos, E. Sanchez and B.R. Kowalski, Generalized rank annihilation method. II Analysis
of bimodal chromatographic data. J. Chromatog., 385 (1987) 165-180.
40. F. C. Sanchez, B. van den Bogaert, S.C. Rutan and D.L. Massart, Multivariate peak purity ap-
proaches. Chemom. Intell. Lab. Syst., 34 (1996) 139-1171.

Additional recommended reading

Books
E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, 2nd Edn. Wiley, New York, 1992.
R. Coppi and S. Bolasco (Eds.), Multiway Data Analysis. North-Holland, Amsterdam, 1989.

Articles
Target transformation factor analysis:
P.K. Hopke, Tutorial: Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989)
7-19.
306

Rank annihilation factor analysis:


C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to fluo-
rescent multicomponent mixtures of polynuclear aromatic hydrocarbons. Anal. Chem., 52
(1980) 1071-1079.
J. Ohman, P. Geladi and S. Wold, Residual bilinearization, part 2: AppHcation to HPLC-diode array
data and comparison with rank annihilation factor analysis. J. Chemom., 4 (1990) 135-146.
Evolutionary methods:
H.R. Keller and D.L. Massart, Evolving factor analysis. Chemom. Intell. Lab. Syst., 12 (1992)
209-224.
F. Cuesta Sanchez, M.S. Khots, D.L. Massart and J.O. De Beer, Algorithm for the assessment of peak
purity in liquid chromatography with photodiode-array detection. Anal. Chem., 285 (1994)
181-192.
J. Toft, Tutorial: Evolutionary rank analysis applied to multidetectional chromatographic structures.
Chemom. Intell. Lab. Syst., 29 (1995) 189-212.
Three-way methods:
B. Grung and O.M. Kvalheim, Detection and quantitation of embedded minor analytes in three-way
multicomponent profiles by evolving projections and internal rank annihilation. Chemom.
Intell. Lab. Syst., 29 (1995) 213-221.
B. Grung and O.M. Kvalheim, Rank mapping of three-way multicomponent profiles. Chemom.
Intell. Lab. Syst., 29 (1995) 223-232.
R. Tauler, A.K. Smilde and B.R. Kowalski, Selectivity, local rank, three-way data analysis and ambi-
guity in multivariate curve resolution. J. Chemom., 9 (1995) 31-58.
Simplisma:
W. Windig and G. Guilment, Interactive self-modeling mixture analysis. Anal. Chem., 63 (1991)
1425-1432.
W. Windig and D.A. Stephenson, Self-modeling mixture analysis of second-derivative near-infrared
spectral data using the Simplisma approach. Anal. Chem., 64 (1992) 2735-2742.
Alternating least squares method:
R. Tauler, A.K. Smilde, J.M. Henshaw, L.W. Burgess and B.R. Kowalski, Multicomponent determi-
nation of chlorinated hydrocarbons using a reaction-based chemical sensor. 2 Chemical
speciation using multivariate curve resolution. Anal. Chem., 66 (1994) 3337-3344.
R. Tauler, A. Izquierdo-Ridorsa, R. Gargallo and E. Casassas, Application of a new multivariate
curve-resolution procedure to the simultaneous analysis of several spectroscopic titrations of
the cupper (II) — polyiosinic acid system. Chemom. Intell. Lab. Syst., 27 (1995) 163-174.
S. Lacorte, D. Barcelo and R. Tauler, Determination of traces of herbicide mixtures in water by
on-line solid-phase extraction followed by liquid chromatography with diode-array detection
and multivariate self-modelling curve resolution. J. Chromatog. A 697 (1995) 345-355.
307

Chapter 35

Relations between measurement tables

35.1 Introduction

Studying the relationship between two or more sets of variables is one of the
main activities in data analysis. This chapter mainly deals with modelling the
linear relationship between two sets of multivariate data. One set holds the
dependent variables (or responses), the other set holds the independent variables
(or predictors). However, we will also consider cases where such a distinction
cannot be made and the two data sets have the same status. Each set is in the usual
objects X measurements format. There is a choice of techniques for estimating the
model, all closely related to multiple linear regression (see Chapter 10). Roughly,
the model found can be used in two ways. One usage is for a better understanding
of the system under investigation by an interpretation of the model results. The
other usage is for the future prediction of the dependent variable from new
measurements on the predictor variables. Examples of problems amenable to such
multivariate modelling are legion, e.g. relating chemical composition to
spectroscopic or chromatographic measurements in analytical chemistry, studying
the effect of structural properties of chemical compounds, e.g. drug molecules, on
functional behaviour in pharmacology or molecular biology, linking flavour
composition and sensory properties in food research or modelling the relation
between process conditions and product properties in manufacturing.
The present chapter provides an overview of the wide range of techniques that
are available to tackle the problem of relating two sets of multivariate data.
Different techniques meet specific objectives: simply identifying strong
correlations, matching two multi-dimensional point configurations, analyzing the
effects of experimental factors on a set of responses, multivariate calibration,
predictive modelling, etc. It is important to distinguish the properties of these
techniques in order to make a balanced choice.
As an example consider the data presented in Tables 35.1-35.4. These tables are
extracted from a much larger data base obtained in an international cooperative
study on the sensory aspects of olive oils [1]. Table 35.1 gives the mean scores for
16 samples of olive oil with respect to six appearance attributes given by a Dutch
sensory panel. Table 35.2 gives similar scores for the same samples as judged by a
308

TABLE 35.1
Olive oils: mean scores for appearance attributes from Dutch sensory panel

Sample ID Yellow Green Brown Olossy Transp Syrup

1 Gil 21.4 73.4 10.1 79.7 75.2 50.3


2 G12 23.4 66.3 9.8 77.8 68.7 51.7
3 012 32.7 53.5 8.7 82.3 83.2 45.4
4 013 30.2 58.3 12.2 81.1 77.1 47.8
5 022 51.8 32.5 8.0 72.4 65.3 46.5

6 131 40.7 42.9 20.1 67.7 63.5 52.2


7 132 53.8 30.4 11.5 77.8 77.3 45.2
8 132 26.4 66.5 14.2 78.7 74.6 51.8
9 133 65.7 12.1 10.3 81.6 79.6 48.3
10 142 45.0 31.9 28.4 75.7 72.9 52.8

11 S51 70.9 12.2 10.8 87.7 88.1 44.5


12 S52 73.5 9.7 8.3 89.9 89.7 42.3
13 S53 68.1 12.0 10.8 78.4 75.1 46.4
14 S61 67.6 13.9 11.9 84.6 83.8 48.5
15 S62 71.4 10.6 10.8 88.1 88.5 46.7
16 S63 71.4 10.0 11.4 89.5 88.5 47.2

Mean 50.9 33.5 12.3 80.8 78.2 48.0


Std. dev. 19.5 23.5 5.1 6.2 8.3 3.1

British panel. Note that the sensory attributes are to some extent different. Table
35.3 gives some information on the country of origin and the state of ripeness of
the olives. Finally, Table 35.4 gives some physico-chemical data on the same
samples that are related to the quality indices of olive oils: acid and peroxide level,
UV absorbance at 232 nm and 270 nm, and the difference in absorbance at
wavelength 270 nm and the average absorbance at 266 nm and 274 nm.
Given these tables of multivariate data one might be interested in various
relationships. For example, do the two panels have a similar perception of the
different olive oils (Tables 35.1 and 35.2)? Are the oils more or less similarly
scattered in the two multidimensional spaces formed by the Dutch and by the
British attributes? How are the two sets of sensory attributes related? Does the
309

TABLE 35.2
Olive oils: mean scores of appearance attributes from British sensory panel

Sample ID Bright Depth Yellow Brown Green

1 Gil 33.2 76.8 24.4 50.9 56.8


2 G12 40.9 76.7 28.3 39.4 61.4
3 G12 44.1 70.0 33.6 35.9 52.4
4 G13 51.4 65.0 37.1 28.3 52.1
5 G22 63.6 47.2 58.1 17.9 36.9

6 131 42.4 67.3 41.6 41.1 34.7


7 132 60.6 51.1 58.0 20.3 33.5
8 132 71.7 42.7 69.9 17.7 21.6
9 133 41.7 74.7 28.1 42.8 51.9
10 142 48.3 68.7 44.7 57.4 16.4

11 S51 78.6 34.3 82.5 9.4 18.7


12 S52 84.8 25.0 85.9 3.1 16.2
13 S53 85.3 26.3 86.7 2.3 17.9
14 S61 81.4 34.5 80.2 8.3 18.2
15 S62 88.4 27.7 87.4 4.7 14.7
16 S63 88.4 29.7 86.8 3.4 16.3

Mean 62.8 51.1 58.3 23.9 32.5


Std.dev. 19.8 19.9 24.4 18.5 17.2

country of origin or the state of ripeness affect the sensory characteristics (Tables
35.1 and 35.3)? Can we possibly predict the sensory properties from the physico-
chemical measurements (Tables 35.1 and 35.4)?
An important aspect of all methods to be discussed concerns the choice of the
model complexity, i.e., choosing the right number of factors. This is especially
relevant if the relations are developed for predictive purposes. Building validated
predictive models for quantitative relations based on multiple predictors is known
as multivariate calibration. The latter subject is of such importance in chemo-
metrics that it will be treated separately in the next chapter (Chapter 36). The
techniques considered in this chapter comprise Procrustes analysis (Section 35.2),
canonical correlation analysis (Section 35.3), multivariate linear regression
310

TABLE 35.3

Country of origin and state of ripeness of 16 olive oils. The last 4 columns contain the same information in
the form of a coded design matrix

Sample ID Country Ripeness Greece Spain Unripe Overripe

1 Gil Greece unripe 0 1 0


2 G12 Greece normal 0 0 0
3 G12 Greece normal 0 0 0
4 G13 Greece overripe 0 0 1
5 G22 Greece normal 0 0 0

6 131 Italy unripe 0 0 1 0


7 132 Italy normal 0 0 0 0
8 132 Italy normal 0 0 0 0
9 133 Italy overripe 0 0 0 1
10 142 Italy normal 0 0 0 0

11 S51 Spain unripe 0 1 0


12 S52 Spain normal 0 0 0
13 S53 Spain overripe 0 0 1
14 S61 Spain unripe 0 1 0
15 S62 Spain normal 0 0 0
16 S63 Spain overripe 0 0 1

(Section 35.4), reduced rank regression (Section 35.5), principal components


regression (Section 35.6), partial least squares regression (Section 35.7) and
continuum regression methods (Section 35.8).

35.2 Procrustes analysis

35.2.1 Introduction

Procrustes analysis is a method for relating two sets of multivariate


observations, say X and Y. For example, one may wish to compare the results in
Table 35.1 and Table 35.2 in order to find out to what extent the results from both
panels agree, e.g., regarding the similarity of certain olive oils and the dissimilarity
of others. Procrustes analysis has a strong geometric interpretation. The
311

TABLE 35.4
Physico-chemical quality parameters of the 16 olive oils

Sample ID Acidity Peroxide K232 K270 DK

1 Gil 0.73 12.70 1.900 0.139 0.003


2 G12 0.19 12.30 1.678 0.116 -0.004
3 012 0.26 10.30 1.629 0.116 -0.005
4 013 0.67 13.70 1.701 0.168 -0.002
5 022 0.52 11.20 1.539 0.119 -0.001

6 131 0.26 18.70 2.117 0.142 0.001


7 132 0.24 15.30 1.891 0.116 0.000
8 132 0.30 18.50 1.908 0.125 0.001
9 133 0.35 15.60 1.824 0.104 0.000
10 142 0.19 19.40 2.222 0.158 -0.003

11 S51 0.15 10.50 1.522 0.116 -0.004


12 S52 0.16 8.14 1.527 0.103 -0.002
13 S53 0.27 12.50 1.555 0.096 -0.002
14 S61 0.16 11.00 1.573 0.094 -0.003
15 S62 0.24 10.80 1.331 0.085 -0.003
16 S63 0.30 11.40 1.415 0.093 -0.004

Mean 0.31 13.25 1.709 0.118 -0.002


Std. dev. 0.18 3.35 0.249 0.024 0.002

observations (objects) are envisioned as points in a high-dimensional variable


space. The objective is to find the transformation such that the configuration of
points in X-space best matches the corresponding point configuration in F-space.
Not all transformations are allowed: the internal configuration of the objects
should be preserved. Procrustes analysis treats the two data sets symmetrically:
there is no essential difference between either transforming Y to match X or
applying the reverse transformation to X so that it best matches Y. One may also
apply a transformation to each so that they meet halfway. In the sequel we consider
the transformation of X to the target Y. We will assume that X and Y have the same
number of variables. If this condition is not met one is at liberty to add the required
number of columns, with zeros as entries, to the smaller data set (so-called "zero
padding").
312

little bear

Great Bear

OVERALL ROTATION (PCA)

^ ^ ^ \

Fig. 35.1. The stages of Procrustes analysis.

We will explain the mechanics of Procrustes analysis by optimal matching of


the two stellar configurations Great Bear and Little Bear. For ease of presentation
we work with the 2D-configuration as we see it from the earth (Fig. 35.1a) and we
ignore that the actual configuration is 3-dimensional. First X and Y are column
mean-centred, so that their centroids m^ = X^IM and my = Y^IM are moved to the
origin (Fig. 35.1b, translation step). This column centering is an admissible
transformation since it does not alter the distances between objects within each
313

data set. The next step is a reflection (Fig. 35.1c). Again this is a transformation
which leaves distances between objects unaltered. The following step is a rotation,
which changes the orientation, but not the internal structure of the configuration
(Fig. 35.Id). When this best match is found one is at liberty to rotate all config-
urations equally. This will not affect the match but it may yield an overall
orientation that is more appealing (Fig. 35.le). Finally, by taking the mean position
for each star, one obtains an average configuration, often called the consensus, that
is representative for the two separate configurations (Fig. 35.If).
The major problem is to find the rotation/reflection which gives the best match
between the two centered configurations. Mathematically, rotations and reflec-
tions are both described by orthogonal transformations (see Section 29.8). These
are linear transformations with an orthonormal matrix (see Section 29.4), i.e. a
square matrix R satisfying R^R = RR^ = I, or R^ = R"^ When its determinant is
positive R represents a pure rotation, when the determinant is negative R also
involves a reflection.
The best match is defined as the one which minimizes the sum of squared
distances between the transformed Z-objects and the corresponding objects in the
target configuration given by Y. The Procrustes problem then is equivalent to
minimizing the sum of squares of the deviations matrix E = Y - XR, assuming both
X and Y have been column-mean centered. This looks like a straightforward
least-squares regression problem, Y = XR + E, but it is not since R is restricted to
be an orthogonal rotation/reflection matrix.
Using a shorthand notation for a matrix sum of squares, ||E|p = YJ^'^J = tr(E^E),
we may state the Procrustes optimization problem as:

min 11 Y - XR IP subject to R'^R = RR^ = 1 (35.1)


R

Using elementary properties of the trace of a matrix (viz. tr(A + B) = tr(A) + tr(B)
and tr(AB) = tr(BA), see Section 29.4) we may write:
II Y - XR IP = tr((Y - XR)'r(Y - XR)) = tr(YTY - R^X'^Y - Y^XR + R'^X^XR) =
= tr(YTY) - 2tr(Y^XR) + tr(RTX'rXR) (35.2)
The first term on the right-hand side represents the total sum of squares of Y, that
obviously does not depend on R. Likewise, the last term represents the total sum of
squares of the transformed X-configuration, viz. XR. Since the rotation/reflection
given by R does not affect the distance of an object from the origin, the total sum of
squares is invariant under the orthogonal transformation R. (This also follows
from tr(R'^X^XR) = tr(X^XRRT) = tT(X^XI) = tT{X^X).) The only term then in eq.
(35.2) that depends on R is tr(Y'^XR), which we must seek to maximize.
314

Let the SVD of V^X be given by V^X = Q D W ^ , with D being the diagonal
matrix of singular values. The properties of singular vector decomposition (SVD,
Section 29.6) tell us that, among all possible orthonormal matrices, Q and W are
the ones that maximize tr(Q'^Y'^XW). Since tr(Q'^Y'^XW) = trCY^XWQ^), it
follows that R = WQ^ is the rotation/reflection which maximizes tr(Y^XR), and
hence it minimizes the squared distances, ||Y - XR |p (eq. 35.2) between X and Y.
Given this optimal Procrustes rotation applied to X, one may compute an average
configuration Z as (Y + XR)/2. Usually, this is followed by a principal component
analysis (Section 31.1) of the average Z. The rotation matrix V, obtained as the
matrix of eigenvectors of Z^Z, is then applied each to Y, XR and Z.
It must be emphasized that Procrustes analysis is not a regression technique. It
only involves the allowed operations of translation, rotation and reflection which
preserve distances between objects. Regression allows any linear transformation;
there is no normality or orthogonality restriction to the columns of the matrix B
transforming X. Because such restrictions are released in a regression setting Y =
XB will fit Y more closely than the Procrustes match Y = XR (see Section 35.3).

35.2.2 Algorithm

Summarizing, the Procrustes matching problem for two configurations X and Y


can be solved with the following algorithm:
Column-centering: X <— X-l^nix and Y <r- Y-l^m y
SVD: WDQ^ ^ SVD(X^Y)
Rotate X: X* f- XWQ'^
Average: Z <r- (Y+X*)/2
PCAofZ: V^EIG(Z^Z)
Final rotation: Z* <r- ZV, Y* f- YV, X* <e- X*V
For more details, also for the case of determining individual isotropic scaling
factors, see Refs. [2,3].

35.2.3 Discussion

In the field of structural chemistry Procrustes analysis has been used to compare
the geometries of similar structures from different studies: e.g. the conformations
of a molecule crystallized in two crystalline forms, or the 3D-configuration of the
main chain of two proteins, or the geometries of differently substituted com-
pounds. Figure 35.2a shows the 3D-configuration of two comparable amino-acid
sequences taken from related protein fragments obtained from different structural
studies. After Procrustes matching the two structures are overlaid (Fig. 35.2b)
revealing both the close overall similarity and the local deviations.
315

AA,
/-O- --A

Fig. 35.2. Comparable fragments of different protein fragments, before (a) and after (b) matching.

Let us now compare the sensory data of Table 35.1 (Dutch panel, X) and Table
35.2 (British panel, Y) using Procrustes analysis. Since the British panel has one
variable less than the Dutch panel we add a column of zeros to ensure that the two
data sets have the same size (16x6). This so-called 'zero padding' is allowed since
it does not affect the distances between the objects. Another form of data pre-
processing that we apply is the overall scaling of the British scores which have a
larger total variance. After shrinking these data by a factor 0.73, both data sets have
the same total variance (= 1038). Figure 35.3 shows the object scores of the
combined data, obtained after matching, projected on the principal axes of the
316

20

10 C
to
b D d
(0
" %
a a ^

rflt
I
o
c
-10^ M

-20\
1 1 — • — I — ' — \ — ' -1 p- —^ r — ^ I I r -

50 -40 -30 -20 -10 0 10 20 30 40 50

principal axis 1

Fig. 35.3. Scatter plot of 16 olive oils scored by two sensory panels (Dutch panel: lower case; British
panel: upper case). The combined data are shown after Procrustes matching and projection onto the
principal plane of the average configuration.

consensus configuration. This scatter plot clearly shows that the relative positions
of the 16 olive oil samples by the two panels compare very well. It is also clear that
samples from different countries, especially Greece and Spain, are well separated.
As to the interpretation of the various attributes one may plot the correlations of the
individual attributes with the two principal dimensions (Fig. 35.4). This plot
shows, for example, that indeed the attributes 'yellow' and 'green' (and to a lesser
extent 'brown') are similarly perceived by the two panels. It also shows that the
second dimension is better represented by the attributes from the Dutch panel.
Practically all attributes lie close to the circle of radius 1. This means that a large
fraction of their variance is explained by dimensions 1 and 2, indicating that there
is little correlation with the higher dimensions. Hence, the common two-
dimensional representation provides an adequate summary of the data of both
panels.
Procrustes analysis has been generalized in two ways. One extension is that
more than two data sets may be considered. In that case the algorithm is iterative.
One then must rotate, in turn, each data set to the average of the other data sets. The
cycle must be repeated until the fit no longer improves. Procrustes analysis of
many data sets has been applied mostly in the field of sensory data analysis [4].
Another extension is the application of individual scaling to the various data sets in
order to improve the match. Mathematically, it amounts to multiplying all entries
in a data set by the same scalar. Geometrically, it amounts to an expansion (or
317

1.0

BRIC
YELLJOW

Fig. 35.4. Correlations of sensory attributes with principal axes of average configuration after Pro-
crustes matching (Dutch panel: lower case; British panel: upper case).

shrinkage) that is isotropic, i.e. equal in all directions. While it does affect the
absolute distances between objects, it does not affect their relative distances nor
the angles involving three objects. In Chapter 38 we give an example of such a
Generalized Procrustes Analysis (GPA) where it is used to find a 'best average'
configuration of different food products coming from individual assessments by
different panellists. Procrustes analysis has been little exploited in typical chemo-
metrical areas, since it is less common that one only wants to investigate the
similarity of two comparable data tables. One is more often interested in a
predictive regression-type linear relation between two data tables. The data tables
involved then may be of quite different origin, e.g. sensory and chemical.

35.3 Canonical correlation analysis

35.3.1 Introduction

Canonical Correlation Analysis (CCA) is perhaps the oldest truly multivariate


method for studying the relation between two measurement tables X and Y [5]. It
generalizes the concept of squared multiple correlation or coefficient of deter-
mination, R^. In Chapter 10 on multiple linear regression we found that R^ is a
measure for the linear association between a univariate y and a multivariate X. This
R^ tells how much of the variance of y is explained by X: R^ = y^ yly^y = llylP/llylF.
Now, we extend this notion to a set of response variables collected in the multi-
variate data set Y.
318

TABLE 35.5
Two small bivariate data sets.
A. Raw data

X Y

•^1 ^2 ^'i yi

8 11 10 8
8 8 9 7
9 12 6 11
10 10 1 14
7 9 13 3
0 2 10 7
9 11 5 12
5 2 11 2

B. Correlation matrix

Xi X2 yi yi

Xl 1 0.86 -0.55 0.53


X2 0.86 1 -0.46 0.62
yi -0.55 -0.46 1 -0.93
yi 0.53 0.62 -0.93 1

For example, let us take a look at the data of Table 35.5a. This table shows two
very simple data sets, X and Y, each containing only two variables. Is there a
relationship between the two data sets? Looking at the matrix of correlation
coefficients (Table 35.5b) we find that the so-called intra-set (or within-set)
correlations are strong:
r(jCi,jC2) = +0.86 and r(y^,y2) = -0.93.
The inter-set (or between-set) correlations, however, are relatively low:
r{x,,y,) = -0.55, rix.^y^) = +0.53, rix^^y,) = -0.46, r(x2,y2) = +0.62.
The squared inter-set correlation coefficients vary from 0.21 to 0.38. Thus, only
some 20% to 40% of the variance of the individual variables can be explained by
one of the variables from the other data set. At a first glance these low inter-set
correlations do not indicate a strong relation between the two data tables. In
319

principle, however, these figures can be improved if we consider multiple re-


gression of each variable on all variables from the other data set. We then find the
following squared multiple correlation coefficients (7?^-values):
R\x^\yi,y2) = 0.3l, R\x2\y^.y2) = 0AS, R\y^\x^,X2) = 030, R^(y2Kx2) = 039
Here, the notation R'^(y^\x^,X2) stands for the squared multiple correlation
coefficient (or coefficient of determination) of the multiple regression of y^ on x^
and X2. The improvement is quite modest, suggesting once more that there is only a
weak (linear) relation between the two sets of data.
We can go one step further, however. Each of the above multiple regression
relations is between a single variable (response) of one data set and a linear
combination of the variables (predictors) from the other set. Instead, one may
consider the 'multiple-multiple' correlation, i.e. the correlation of a linear com-
bination from one set with a linear combination of the other set. Such linear
combinations of the original variables are variously called factors, components,
latent variables, canonical variables or canonical variates (also see Chapters 9,17,
29, and 31).
As an example, let us construct a factor u = Yq from the F-variables defined
through its weight vector q. For example, let q = [0.8, -0.6]^, i.e. u = 0.8yi-0.6y2.
A vector of factor scores u can be calculated then by applying these weights, 0.8
and -0.6, to the columns y^ and y2, respectively, in Table 35.5 and adding the
results. Multiple regression of this newly formed variable u on Xj and X2 yields R^ =
0.66. This is already a considerable improvement. The natural question then is:
Which linear combination of F-variables yields the highest R^ when regressed on
the X-variables in a multiple regression? Canonical correlation analysis answers
this question.
The particular linear combinations of the X- and F-variables achieving the
maximum correlation are the so-called first canonical variables, say t^ = Xwj and
Uj = Yqj. The vectors of coefficients Wj and q^ in these linear combinations are the
canonical weights for the X-variables and 7-variables, respectively. For the data of
Table 35.5 they are found to be Wj = [0.583, -0.561]^ and qi = [0.737,0.731]^. The
correlation between these first canonical variables is called the first canonical
correlation, pj. This maximum correlation turns out to be quite high: pj = 0.95
{R^ = 0.90), indicating a strong relation between the first canonical dimensions of
X and Y.
The next pair of canonical variates, t2 and U2 also has maximum correlation p2,
subject, however, to the condition that this second pair should be uncorrelated to
the first pair, i.e. t^ t2 = u^ U2 = 0. For the example at hand, this second canonical
correlation is much lower: p2 = 0.55 (R^ = 0.31). For larger data sets, the analysis
goes on with extracting additional pairs of canonical variables, orthogonal to the
previous ones, until the data table with the smaller number of variables has been
320

transformed completely into equally many orthogonal canonical variables. It turns


out that the interset correlations of canonical variables of different dimensions are
also orthogonal, i.e. ij u^ = 0 for / ^j.
Thus, we see that CCA forms a canonical analysis, namely a decomposition of
each data set into a set of mutually orthogonal components. A similar type of
decomposition is at the heart of many types of multivariate analysis, e.g. PCA and
PLS. Under the assumption of multivariate normality for both populations the
canonical correlations can be tested for significance [6]. Retaining only the
significant canonical correlations may allow for a considerable dimension
reduction.

35.3.2 Algorithm

Computationally, canonical correlation analysis can be implemented using the


following steps, where it is assumed that the data X and Y are mean-centered.
X = Ux Sx V^ < SVD of X > (35.5a)

Y = Uy Sy V^ < SVD of Y > (35.5b)

R = (U J UY) = W*DQ*'^ < SVD of interset PC correlation matrix> (35.6)

T = (n-iy^^ UxW* < X canonical scores> (35.7a)


U = (n-iy^^ UyQ* < r canonical scores> (35.7b)
W = (n-iy^^ VxSx^W* <Xcanonical weights> (35.8a)

Q = (n-iy^^ VYSY^ Q* < ycanonical weights> (35.8b)

T = XW < X canonical scores> (35.9a)


U = YQ < Y canonical scores> (35.9b)
P = X'^T(T'^T)-^ < X canonical structure> (35.1 Oa)
C = Y^U(U^U)-^ < y canonical structure (35.10b)
Equations 35.5a and b represent the singular value decomposition of the original
data tables, giving the new sets, Ux and Uy, of unit-length orthogonal (ortho-
normal) variables. From these the matrix R is calculated as U x Uy (eq. 35.6). R is
the correlation matrix between the principal components of X and those of Y,
because of the equivalence of PCs and (left) singular vectors. Singular value
decomposition of R yields the canonical weight vectors W* and Q* applicable to
Ux and Uy, respectively. The singular values obtained are equal to the canonical
321

correlations p^. Instead of a single SVD of R one may apply a spectral de-
composition (Section 29.6) of RR^ giving eigenvectors W* and a spectral de-
composition of R^R giving eigenvectors Q*, the eigenvalues corresponding to the
squared canonical correlations. The canonical variables are now obtained as in
eqs. (35.7a,b). The factor {n-\f^ is included to ensure that the canonical variables
have unit variance. Back-transformation to the centred X- and F-variables yields
the sets of canonical weights collected in matrices W and Q, respectively (eqs.
35.8a,b). Applying these weights to the original variables again yields the
canonical variables (eqs. 35.9a,b). Regressing the X-variables and F-variables on
their corresponding canonical variables gives the loading matrices P and C (eqs.
35.10a,b)) which appear in the canonical decomposition: X = TP'^ = T(T'rT)-iT^X
and Y = UC^ = U(U^U)"^U^Y. The loadings, defining the original mean-centred
variables in terms of the orthogonal canonical variables, are better suited for
interpretation than the weights. Each row of P (or C) corresponds to a variable and
tells how much each canonical variable contributes to (or "loads" on) this variable.
In case the X-variables and the F-variables are also scaled to unit variance, P and C
contain the intra-set correlations between the original variables and the canonical
variables (so-called structure correlations', see Table 35.6).
It should be appreciated that canonical correlation analysis, as the name implies,
is about correlation not about variance. The first step in the algorithm is to move
from the original data matrices X and Y, to their singular vectors, Ux and Uy,
respectively. The singular values, or the variances of the PCs of X and Y, play no
role.

353.3 Discussion

Let us take a closer look at the analysis of the data of Table 35.5. In Table 35.6
we summarize the correlations of the canonical variates and also their correlations
with the original variables. The high value of the first canonical correlation
(pi = 0.95) suggests a strong relationship between the two data sets. However, the
canonical variables tj and Uj are only strongly related to each other, not with the
original variables (Table 35.6). On the other hand, the second set of canonical
variates t2 and U2 are strongly related to their original variables, but not to each
other (p2 = 0.55). Thus, the analysis yields a pair of strongly linked, but un-
interesting factors and a pair of more interesting factors, which are weakly related,
however.
A major limitation to the value of CCA thus already has become apparent in the
example shown. There is no guarantee that the most important canonical variable
t, (or Ui) is highly correlated to any of the individual variables of X (or Y). It is
possible then for the first canonical variable tj of X to be strongly correlated with
Uj, yet to have very little predictive value for Y. In terms of principal components
322

TABLE 35.6

Canonical structure: correlations between the original variables (jc, y) and their canonical variates (/, u).

X Y

^1 h "i "2

^1 0.0476 0.9989 0.0451 0.5522


X2 0.5477 0.8367 0.5187 0.4625

>'J -0.0042 -0.5528 -0.0045 -1.0000

yi 0.3532 0.5129 0.3729 0.9279

u 1.0000 0.0000 0.9470 0.0000

h 0.0000 1.0000 0.0000 0.5528

"i 0.9470 0.0000 1.0000 0.0000

"2 0.0000 0.5528 0.0000 1.0000

(Chapter 17): only the minor principal components of X and Y happen to be highly
correlated. It is questionable whether such a high correlation is then of much
interest. This dilemma of choosing between high correlation and large variance
presents a major problem when analyzing the relation between measurements
tables. The regression techniques treated further on in this chapter address this
dilemma in different ways. A second limitation of CCA is that it cannot deal in a
meaningful way with data tables in 'landscape mode', i.e. wide data tables having
more variables than objects. This severely limits the importance of CCA as a
general tool for multivariate data analysis in chemometrics, e.g. when X represents
spectral data.
As the name implies CCA analyses correlations. It is therefore insensitive to any
rescaling of the original variables. This advantage is not shared with most other
techniques discussed in this chapter. As in Procrustes analysis X and Y play
entirely equivalent roles in canonical correlation analysis: there is no distinction in
terms of dependent variables (or responses) versus independent variables (or
predictors, regressors, etc.). This situation is fairly uncommon. Usually, the X and
Y data are of a different nature and one is interested in understanding one set of
data, say Y, in the light of the information contained in the other data set, X. Rather
than exploring correlations in a symmetric X<->Y relation, one is searching for an
asymmetric regression relation X->Y explaining the dependent F-variables from
the predictor X-variables. Thus, the symmetrical nature of CCA limits its practical
importance. In the following sections we will discuss various asymmetric re-
gression methods where the goal is to fit the matrix of dependent variables Y by
linear combination(s) of the predictor variables X.
323

35.4 Multivariate least squares regression

35.4.1 Introduction
In this section we will distinguish multivariate regression from multiple
regression. The former deals with a multivariate response (Y), the latter with the
use of multiple predictors (X). When studying the relation between two multi-
variate data sets via regression analysis we are therefore dealing with multivariate
multiple regression. Perhaps the simplest approach to studying the relation
between two multivariate data sets X and Y is to perform for each individual
univariate variable y^ {k = 1,..., m) a separate multiple (i.e. two or more predictor
variables) regression on the Z-variables. The obvious advantage is that the whole
analysis can be done with standard multiple regression programs. A drawback of
this approach of many isolated regressions is that it does not exploit the multi-
variate nature of Y, viz. the interdependence of the y-variables. Genuine multi-
variate analysis of a data table Y in relation to a data table X should be more than
just a collection of univariate analyses of the individual columns of Y!
One might suspect that fitting all 7-variables simultaneously, i.e. in one overall
multivariate regression, might make a difference for the regression model. This is
not the case, however. To see this, let us state the multivariate (i.e. two or more
dependent variables) regression model as:
[yi' y2' — Ym] = X [bj, b2,..., b j + [ei, t^^..., e j (35.11)
which shows explicitly the various responses, y^ (/ = 1, 2, ..., m), as well as the
vector of regression coefficients b^, and the residual vector, e^, corresponding to
each response. This model may be written more compactly as:
Y = XB + E (35.12)
where Y is the nxm data set of responses, X the nxp data set of regressors, B the
pxm matrix of regression coefficients and E the nxm error matrix. Each column of
Y, B and E corresponds to one of the m responses, each column of X and each row
of B to one of the p predictor variables and each row of Y, X, and E to one of the n
observations.
The total residual sum of squares, taken over all elements of E, achieves its
minimum when each column e^ separately has minimum sum of squares. The latter
occurs if each (univariate) column of Y is fitted by X in the least-squares way.
Consequently, the least-squares minimization of E is obtained if each separate
dependent variable is fitted by multiple regression on X. In other words: the
multivariate regression analysis is essentially identical to a set of univariate
regressions. Thus, from a methodological point of view nothing new is added and
we may refer to Chapter 10 for a more thorough discussion of theory and
application of multiple regression.
324

35.4.2 Algorithm

The solution for the regression parameters can be adapted in a straightforward


manner from eq (10.6), viz. b = (X^X)"^ X^y, giving:
B = {X^Xr'X^Y (35.13)
Y = X B = X (X^X)-^ X^Y (35.14)
In eqs. (35.13) and (35.14) X may include a column of ones, when an intercept has to
be fitted for each response, giving (p+l x m) B. Otherwise, X and Y are supposed to
be mean centered, and (pxm) B does not contain a column of intercepts.
The geometric meaning of Eq. (35.14) is that the best fit is obtained by
projecting all responses orthogonally onto the space defined by the columns of X,
using the orthogonal projection matrix X(X^X)"^X^ (see Section 29.8).

35.4.3 Discussion

A major drawback of the approach is felt when the number of dependent


variables, m, is large. In that case there is an equally large number of separate
analyses and the combined results may be hard to summarize. When the number of
predictor variables, p, is very large, e.g. when X represents spectral intensities at
many wavelengths, there is also a problem. In that case X^X cannot be inverted and
there is no unique solution for B. Both in the case of large m or large p some kind of
dimension reduction is called for. We will therefore not discuss the multivariate
regression approach further, since this chapter focuses on truly multivariate
methods, taking the joint variation of variables into account.
All other methods discussed in this chapter provide such a dimension reduction.
They search for the most "interesting" directions in F-space and/or "interesting"
directions in X-space that are linearly related. They differ in the optimizing
criterion that is used to discover those interesting directions.

35.5 Reduced rank regression

35.5.7 Introduction

Reduced rank regression (RRR), also known as redundancy analysis (or PCA
on Instrumental Variables), is the combination of multivariate least squares
regression and dimension reduction [7]. The idea is that more often than not the
dependent F-variables will be correlated. A principal component analysis of Y
might indicate that A {A « m) PCs may explain Y adequately. Thus, a full set of m
325

separate multiple regressions as in unconstrained multivariate regression (Section


35.4) contains a fair amount of redundancy. To illustrate this we may look for A
particular linear combinations of X-variables that explain most of the total
variation contained in Y. For simplicity let us start with A=l. When the K-variables
have equal variance, this boils down to finding a single component in X-space, say
tj = Xwj, that maximizes the average 7?^. This average /?^ is the mean of the
individual /?^-values resulting from all regressions of the individual y-variables
with the X-component tj. Since now all 7-variables are estimated by the same
regressor tj, all fitted K-variables are proportional to this predictor and,
consequently, they are all perfectly correlated. In other words, the rank of the fitted
y-matrix is, of necessity, 1. Hence the name reduced rank regression.
Of course, this rank-1 restriction may severely affect the quality of the fit when
the effective dimensionality A of Y is larger, 1 < A < m. Thus, we may look for a
second linear combination of X-variables, X,^ = Xw2, orthogonal to tp such that the
multivariate regression of Y on tj and i^ further maximizes the amount of variance
explained. This process may be continued until Y can be sufficiently well approx-
imated by regression on a limited set of Z-components, T = [tj, i^^ ..., t J . Since
each y-variable is fitted by a linear combination of the A X-components, each X-
component itself being a linear combination of the predictor variables, the Y-
variables can finally be expressed as a linear combination of the X-variables. It
should be noted that when the same number of A-components is used as there are
K-variables, i.e. A = m, we can no longer speak of reduced rank regression. The
solution then has become entirely equivalent with unconstrained multivariate
regression.
The question of how many components to include in the final model forms a
rather general problem that also occurs with the other techniques discussed in this
chapter. We will discuss this important issue in the chapter on multivariate
calibration.
An alternative and illuminating explanation of reduced rank regression is
through a principal component analysis of Y, the set of fitted y-variables resulting
from an unrestricted multivariate multiple regression. This interpretation reveals
the two least-squares approximations involved: projection (regression) of Y onto
X, followed by a further projection (PCA) onto a lower dimensional subspace.

35.5.2 Algorithm

The interpretation also suggests the following simple computational


implementation of reduced rank regression.
Step 1. Multivariate least squares regression of Y on X (compare Section 35.4):
Y = X(X^X)-iX^Y (35.15)
326

Equation (35.15) represents the projection of each 7-variable onto the space
spanned by the X-variables, i.e. each F-variable is replaced by its fit from multiple
regression on X.
Step 2. Next one appUes an SVD (or PCA) to centered Y, denoted as Y*(= Y - I m ^):
Y* = U S V ^ (35.16)
Step 3. Dimension (rank) reduction by only retaining A major components to
approximate Y*. This gives the RRR fit:
Y*,A] = ^A]StAiVtIj (35.17)

Step 4. The RRR model coefficients are then found by a multivariate linear
regression of the RRR fit, Yj^j = (Y*j^j + In"** Y ) ^^ original X, which should have a
column of ones:
BRRR = ( X ^ ) - ' X\Y\^^+ l„m^) (35.18)

35.5.3 Discussion

A major difference between reduced rank regression and canonical correlation


analysis or Procrustes analysis is that RRR is a regression technique, with different
roles for Y and X. It is an appropriate method for the simultaneous prediction of
many correlated K-variables from a common set of X-variables through a few
X-components. Since reduced rank regression involves a PCA of Y, its solution
depends on the choice of scale for the 7-variables. It does not depend on the scaling
of the X-variables. The reduction to a few factors may help to prevent overfitting
and in this manner it stabilizes the estimation of the regression coefficients.
However, the most important factor determining the robustness of any regression
solution is the design of the regressor data. When the X-variables are highly
correlated we still have no guarantee that unstable minor X-factors are avoided in
the regression. In that case, and certainly when X is not of full rank, one may
consider to base the regression on all but the smallest principal components of X.
The ill-conditioning problem does not occur in the following example.

35.5.4 Example

Let us try to relate the (standardized) sensory data in Table 35.1 to the explan-
atory variables in Table 35.3. Essentially, this is an analysis-of-variance problem.
We try to explain the effects of two qualitative factors, viz. Country and Ripeness,
on the sensory responses. Each factor has three levels: Country = {Greece, Italy,
327

Spain} and Ripeness = {Unripe, Normal, Overripe}. Since not all combinations
out of the complete 3x3 block design are duplicated, there is some unbalance
making the design only nearly orthogonal. We treat this multivariate ANOVA
problem as a regression problem, coding the regressors as indicated in Table 35.3
and omitting the Italy column and Normal column to avoid ill-conditioning of X.
Some of the results are collected in Table 35.7. Table 35.7a shows that some
sensory attributes can be fitted rather well by the RRR model, especially 'yellow'
and 'green' (/?^« 0.75), whereas for instance 'brown' and 'syrup' do much worse
(/?2 ^ 0.40). These fits are based on the first two PCs of the least-squares fit (Y). The
PCA on the OLS predictions showed the 2-dimensional approximation to be very
good, accounting for 99.2% of the total variation of Y. The table shows the PC
weights of the (fitted) sensory variables. Particularly the attributes 'brown', and to
a lesser extent 'syrup', stand out as being different and being the main contributors
to the second dimension.

TABLE 35.7
(a) Basic results of the reduced rank regression analysis. The columns PCI and PC2 give the weights of the
PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well Y and Y are fitted by the first two
principal components of Y.

/-variable PCl(Y) PC2(Y) RK% /?2(Y)

Yellow 0.50 -0.35 99.9 77


Transp 0.44 +0.06 99.2 53
Glossy 0.44 +0.24 99.9 58
Green -0.47 +0.40 99.6 73
Brown -0.16 -0.73 98.2 41
Syrup -0.34 -0.35 97.3 41

Overall R^: 80.8 18.4 99.2 57

(b) The columns PCI and PC2 give the X-weights of the PCA model of Y (OLS fitted Y). The columns /?2
(in %) show how well the X variables are fitted by the first two principal components of Y.

X-variable PCI PC2 R^

Intercept -0.97 -0.86 -


Greece -0.17 +1.91 99
Spain +3.23 +1.03 94
Unripe -0.88 -0.35 4
Overripe +0.14 -0.13 7
328

The two principal axes can also be defined as linear combinations of the
explanatory variables. This is given in Table 35.7b. The larger coefficients for the
Country variables when regressing the PCs on the four predictor variables show
that the country of origin is strongly related to the most predictive principal
dimensions and that the state of ripeness is not. This also appears from the fact that
the Country variables can be fitted very well (high R^) by the first two PCs, in
contrast to the low R^ values for the ripeness variables. In other words, the country
of origin is the dominant factor affecting the appearance of olive oils whereas the
state of ripeness has little effect. The first predictive dimension mainly represents a
contrast between the olive oils of Spanish origin versus non-Spanish origin and to a
much lesser extent a contrast between unripe versus the (over)ripe olives. The
second predictive factor, which is mostly used to fit the 'brown' sensory attribute,
represents a contrast between Italy and Greece, with Spain in the middle. Fig. 35.5
summarizes the relationships between samples (objects), predictor variables and
dependent variables. The objects are plotted as standardized scores (first two
columns of (n-iy^^V), the variables as loading vectors, taken from X'^U and Y^U,
respectively, scaled to fit on the graph. For a thorough treatment of biplotting the
results of rank reduced multivariate regression models, see Ref [8].
By combining the coefficients in the two parts of the table one can express each
sensory attribute in terms of the explanatory factors. Note that the above regression

Fig. 35.5. Biplot of reduced rank regression model showing objects, predictors and responses.
329

model is defined in terms of binary regressor variables, which indicate the


presence or absence of a condition. Italian olive oils, for example, are defined as
not Greek and not Spanish, and the variables indicating the country of origin,
'Greece' and 'Spain', are both set to 0. For example:
Yellow = 0.50*PC1 - 0.35*PC2
= 0.5*(-0.97-0.17* 'Greece' + ...)-0.35*(-0.86 + 1.91 * 'Greece' + ...)
= -0.22-0.74*'Greece' + 1.25* 'Spain' - 0.26* 'Unripe' + 0.19* 'Overripe'
For an unripe Spanish olive oil this works out as: Yellow = -0.22 - 0.74*0 +
1.25*1 - 0.26*1 + 0.19*0 = 0.77. Since the sensory data were standardized one
needs to multiply by the standard deviation (19.5) and to add the average (50.9) to
arrive at a prediction in original units, viz. 50.9+0.77*19.5 ^ 66.

35.6 Principal components regression

35.6.1 Introduction

In principal components regression (PCR) first a principal component analysis


(Chapters 17 and 31) is performed on X, then the 7-variables are regressed on these
PCs of X. PCR also combines the two steps of regression and dimension reduction.
Compared with reduced rank regression the order of these two basic steps is
reversed. The major difference, however, is that the dimension reduction pertains
to the predictor set X, and not to the dependent variables. In PCR, therefore, the
definition of Z-components is determined prior to the regression analysis, the
F-variables not playing a role at this stage. As in the other approaches PCR
modelling proceeds factor by factor, the number of factors A to be determined by
some model validation procedure (Chapter 36 on Multivariate Calibration).

35.6.3 Algorithm

The computational implementation of principal components regression is very


straightforward.
Step 1. First, carry out an SVD (or PC A) on centered X:
X = U S V^
Step 2. Multivariate least squares regression of Y on the major A principal
components, using either the unit-norm singular vectors U^^j, or the principal
components T^^j = XV^^^ = Uf^jS^^^:
330

The equation represents the projection of each K-variable onto the space spanned
by the first A PCs of X.
Step 4. The PCR model coefficient matrix, pxm Bp^R, can be obtained in a variety
of equivalent ways:

BpcR = (X^X)-' X^Y = V,^,Sf^, U,^^, Y = V,^,(T,I,T,^,)-'T,X, Y

The vector of intercepts is obtained as: b = (my - nix ^PCR)^-

35.6.3 Discussion

The PCR approach has many attractive features. First of all there is the aspect of
a prior dimension reduction of the data set (measurement table) X. Using PCA this
is done in such a way as to maintain the maximum amount of information. The
neglected minor components are supposed to contain noise that is in no way
relevant for the relation with Y. Another advantage is that the principal compo-
nents are orthogonal (uncorrelated). This greatly simplifies the multiple regression
of the y-variables, allowing the effect of the individual principal components to be
assessed independently. The chief advantage is that the major principal compo-
nents have, by definition, large variance. This leads to a stable regression as the
variance of an estimated regression coefficient is inversely proportional to the
variance of the regressor (si = s^ /Z(A:? - Jc"); see Section 8.2.4.1)
The orthogonality of the principal components has the advantage that the effect
of the various PCs are estimated independently: multiple regression becomes
equivalent with a sequence of separate regressions of the response(s) on the
individual PCs. The fact that the X-components are chosen on the basis of
representing X rather than Y does not only have advantages. It also gives rise to a
major concern. What if a minor component happens to be important for the
regression? And what is the use of a major principal component if it is not related to
Y? The answer to the latter question is simple: it is of little use but it does not harm
either. The problem of discarding minor X-components that possibly are highly
correlated to Y is more severe. One way to address this problem is to include the
minor components in the regression if they are really needed. That is, one should
go on adding principal components in the regression model until Y is fitted well,
provided such a model also passes the (cross-)validation procedure (Section
36.10). Another strategy that is gaining popularity is to enter the principal compo-
nents in a different order than the standard order of descending variance (PCI,
PC2,...). Rather than this top-down procedure, one may apply variable selection:
one starts with the principal component that is most highly correlated with the
331

responses, then move to the PC with the second highest correlation, etc. The only
thing that is needed is to compute the correlation coefficients of the PCs with each
response. For a univariate y the PCs may then be ranked according to their
descending (squared) correlation coefficient. By applying this forward selection
procedure one ensures that highly correlating PCs are not overlooked. For multi-
variate response data Y one should compute for each PC an average index of its
importance for all 7-variables together, e.g. the average squared correlation or the
total variance explained (I I Yl P = 11 u J Yl P) by that PC. Comparative studies have
shown that the latter method of PCR frequently performs better, i.e. it gives good
predictive models with fewer components [9].
Since principal components regression starts with a PCA of X, its solution
depends on the particular scaling chosen for the X-variables. It does not depend on
the scaling of the 7-variables. When the maximum number of factors are used the
regression model becomes equivalent with multivariate regression. There is no
special multivariate version of principal component regression: each K-variable is
separately regressed on the set of X-components. One might also consider
regressing the major PCs of Y or of Y (Eq. 35.14) on the PCs of X.

35.7 Partial least squares regression

The purpose of Partial Least Squares (PLS) regression is to find a small number
A of relevant factors that (/) are predictive for Y and (//) utilize X efficiently. The
method effectively achieves a canonical decomposition of X in a set of orthogonal
factors which are used for fitting Y. In this respect PLS is comparable with CCA,
RRR and PCR, the difference being that the factors are chosen according to yet
another criterion.
We have seen that PCR and RRR form two extremes, with CCA somewhere in
between. RRR emphasizes the fit of Y (criterion ii). Thus, in RRR the X-
components t- preferably should correlate highly with the original 7-variables.
Whether X itself can be reconstructed ('back-fitted') from such components t^ is of
no concern in RRR. With standard PCR, i.e. top-down PCR, the emphasis is
initially more on the X-side (criterion /) than on the F-side. CCA emphasizes the
importance of correlation; whether the canonical variates t and u account for much
variance in each respective data set is immaterial. Ideally, of course, one would
like to have the best of all three worlds, i.e. when the major principal components
of X (as in PCR) and the major principal components of Y (as in RRR) happen to
be very similar to the major canonical variables (as in CCA). Is there a way to
combine these three desiderata — summary of X, summary of Y and a strong link
between the two — into a single criterion and to use this as a basis for a
compromise method? The PLS method attempts to do just that.
332

PLS has been introduced in the chemometrics literature as an algorithm with the
claim that it finds simultaneously important and related components of X and of Y.
Hence the alternative explanation of the acronym PLS: Projection to Latent
Structure. The PLS factors can loosely be seen as modified principal components.
The deviation from the PCA factors is needed to improve the correlation at the cost
of some decrease in the variance of the factors. The PLS algorithm effectively
mixes two PCA computations, one for X and one for Y, using the NIPALS
algorithm. It is assumed that X and Y have been column-centred as usual. The
basic NIPALS algorithm can best be demonstrated as an easy way to calculate the
singular vectors of a matrix, viz. via the simple iterative sequence (see Section
31.4.1):
t = Xw (35.19)
wocX^t (35.20)
for X, and
u = Yq (35.21)
q oc Y^u (35.22)
for Y. The «:-symbol is used here to imply that the resultant vector has to be
normalized, i.e. w^w = q^q = 1. In eq. (35.19) t represents the regression coeffi-
cients of the rows of X regressed on w. Likewise, w in eq. (35.20) is proportional to
the vector of regression coefficients obtained by regressing each column (vari-
ables) of X on the score vector t. This iterative process of criss-cross regressions is
graphically illustrated in Fig. 35.6.
Iterating eq. (35.19) and eq. (35.20) leads w to converge to the first eigenvector
of X^X. One may easily verify this by substituting eq. (35.19) into eq. (35.20),
which yields w oc X^t«: X^Xw, the defining relation for an eigenvector. Similarly,
t is proportional to an eigenvector of XX^. It can be shown that the eigenvectors w
and t are the dominant eigenvectors, i.e. the ones corresponding to the largest
eigenvalue. Thus, w and (normalized) t form the first pair of singular vectors of X.
Likewise, q and (normalized) u are the dominant eigenvectors of Y^Y and YY^,

Fig. 35.6. Principle of SVD/NIPALS algorithm.


333

u
T
Fig. 35.7. Principle of PLS/NIPALS algorithm.

respectively, or the first pair of singular vectors of Y. Once this first pair of singular
vectors is determined one extracts this dimension by fitting t to X (or u to Y) and
proceeding with the matrix Ej (or Fj) of residuals. Using the residual matrix Ej (or
Fj) and the basic NIPALS algorithm one may find the pair of dominant singular
vectors which in fact is the second pair of singular vectors of the starting matrix X
(or Y). The process is repeated until the starting matrix is fully depleted.
Instead of separately calculating the principal components for each data set, the
two iterative sequences are interspersed in the PLS-NIPALS algorithm (see Fig.
35.7):
Wcx: X^U (35.23)
t = Xw (35.19)
qocY^t (35.24)
u = Yq (35.21)
One starts the iterative process by picking some column of Y for u and then
repeating the above steps cyclically until consistency. Upon convergence we have
w oc X^u oc X^Yq oc X^YY^t oc X^YY^Xw. Thus, w is an eigenvector of X'^YY'^X
and, similarly, q is an eigenvector of Y^XX^Y [10]. These matrices are the two
symmetric matrix products, viz. (X'^Y)(X'^Y)'^ and (X'^Y)'^(X'^Y), based on the
same cross-product matrix (X^Y). Apart from a factor (n - 1), the latter matrix is
equal to the matrix of inter-set covariances. Another interpretation of the weight
vectors w and q in PLS is therefore as the first pair of singular vectors of the
CO variance matrix X^Y. As we found in Chapter 29 this first pair of singular
vectors forms the unique pair of normalized weight vectors that maximizes the
expression w^(X^Y)q = (Xw)'^(Yq) = t^u. Up to a factor (n - 1), the latter inner
product equals the covariance of the two score vectors t = Xw and u = Yq. This
then leads to the following important interpretation of the PLS factors: t = Xw and
u = Yq are chosen to maximize their covariance [10,11].
334

Let us take a closer look at this covariance criterion. A covariance involves three
terms (see Section 8.3):

cow(i,u) = s,s^r,^ (35.25)

or, taking the square,


cov(t,u)2 = var(t) var(u) r,l (35.26)

Thus, the PLS covariance criterion capitalizes on precisely the three links that
connect two sets of data via their latent factors:
(i) the X-factor t should have appreciable variance, var(t);
(ii) similarly, the K-factor u should have a large variance, var(u), and
(iii) the two factors t and u should be strongly related also (high r^).
Of the three aspects inherent to the covariance criterion (35.26), CCA just con-
siders the so-called inner relation between t and u as expressed by r^, RRR
entirely neglects the var(t) aspect, whereas PCR emphasizes this var(t) compo-
nent. One might maintain that PLS forms a well-balanced compromise between
the methods treated thus far. PLS neither emphasizes one aspect of the X<->Y
relation unduly, nor does it completely neglect any.
The covariance criterion as such suggests a symmetrical situation, X and Y
playing equivalent roles. In fact, up to here, there is little difference with Procrustes
analysis which also utilizes the singular vectors of the covariance matrix (Section
35.2). The difference is that in PLS the first X-factor, say tj, is now used as a
regressor to fit both the X-block and the K-block:
X(=Eo) = t,pT + E , (35.27a)

Y(=:Fo) = t , c ^ + F i (35.27b)

Here, the loading vector pj contains the coefficients of the separate univariate
regressions of the individual X-variables on tj. The/^ element of Pj, py, represents
the regression coefficient of X: regressed on ti'-Pij = ^J^j f tJ ty The full vector of
loadings becomes p, = E^tj / t ^ t j . Similarly, Cj contains the regression
coefficients relating tj to the K-variables: Cj =FQ t^/tjt^.
The residuals of these regressions are collected in residual matrices Ej and Fj:
E,=E,-i,pJ (35.28a)
F, =Fo-t,c;^ (35.28b)
A second PLS factor t2 is extracted in a similar way maximizing the covariance of
linear combinations of the residual matrices Ej and Fj. Subsequently, Ej and Fj are
regressed on t2, yielding new residual matrices Ej and F2 from which a third PLS
335

factor t3 is computed, and so on. If one does not limit the number of factors, the
process automatically stops when the Z-matrix has been fully depleted. This
occurs when the number of factors A equals the rank of X, i.e. A = mm(n - 1, /?). As
for PCR, such a full-rank PLS model is entirely equivalent with multivariate
regression on the original X-variables.
The PLS algorithm is relatively fast because it only involves simple matrix
multiplications. Eigenvalue/eigenvector analysis or matrix inversions are not
needed. The determination of how many factors to take is a major decision. Just as
for the other methods the 'right' number of components can be determined by
assessing the predictive ability of models of increasing dimensionality. This is
more fully discussed in Section 36.5 on validation.
Let us now consider a new set of values measured for the various X-variables,
collected in a supplementary row vector x*. From this we want to derive a row
vector y* of expected F-values using the predictive PLS model. To do this, the
same sequence of operations is followed transforming x* into a set of factor scores
{^1*, ^2, ..., ^^ } pertaining to this new observation. From these r*-scores y * can be
estimated using the loadings C.
Prediction starts by equating yo to the mean (my) for the training data and
removing the mean m^ from x* giving CQ :
y 0 = my

6*0 = X* - n i x

Then we compute the score of the new observation x* on the first PLS dimension
and from that we calculate an updated prediction (y\) and we remove the first
dimension from CQ giving e\:
r, =eo' w,

^1 =^o-h Pi
This sequence is repeated for dimension 2:

62 —Cj — ^2 P 2

and so on.
336

Alternatively, one may obtain predicted values directly as


y*o =niY + (x*-mx)'^BpLs
using the matrix of regression coefficients BpLs, as estimated by the PLS method. It
may be shown that a closed expression for these coefficients can be obtained from
the weights and loadings matrices [12]:
BpLs = W(P^W)-'CT-.

35.7.2 NIPALS-PLS Algorithm

Here we summarize the steps needed to compute the PLS model


1. E = X-lml/n <column centering of X>
2. F =Y <column centering of Y>
3. E<-E*diag(l/std(E)) <autoscaling (optional)>
4. F ^ F * diag(l/std(Y)) <autoscaling (optional)>
5. ssqX = sum_of_squares(E)
6. ssqY = sum_of_squares(F)
7. for a = 1 to A
8. Set u = first column of F
9. Iterate until convergence
10. w = E^u, w <— w/(w^w)"^ <X factor weights>
11. t = Ew <X factor scores>
12. c = FMt^t) <Y factor loadings>
13. u = FTc/(c^c)"2 <Y factor scores>
14. End
15. p = E'^t/(t'^t) <X factor loadings>
16. E<-E-tp^ <update residual X>
17. F <- F-tc^ <update residual Y>
18. r2x = (t'rt)(p^p)/ssqX <X sum of squares accounted for by t>
19. r2y = (t'^t)(c^c)/ssqY <Y sum of squares accounted for by t>
20. Save results for current factor w, t, c, p, u, r2x, r2y into W, T, C, P, U, R2X,
R2Y
21. End
22 BpLs = W(P^W)'CT^
Slightly different implementations of the above PLS-NIPALS algorithm exist.
They mostly differ in the chosen normalization of w, t or p (here Iwl = 1). This is
not an important issue, but it may be a cause of confusion when comparing results
from different (software) implementations. That the normalization is of no real
importance can be seen as follows. Let us say we choose to multiply the weight
337

vector w by some scalar constant. Then, the length of the score vector t increases
by the same factor. In contrast, the lengths of the loading vectors p and c decrease
by that factor. Since the loadings appear in combination with scores (e.g. tp^, tc^,
(t^t)(p^p), and (tTt)(c^c)) or with weights (W(P'rW)-i C^), the chosen multipli-
cation factor has no effect.
A slight alternative to the above NIPALS algorithm is replacing the iteration
loop (line 8-14) by:

8'. q <r- first eigenvector of (F'^EE'^F)


9\ u = Fq
10. W = E^U, W <r- W / ( W V ) ' ' '
11. t = Ew
12. c = FV(ft)

This formulation is attractive since it gives the weight vector q accurately and
(usually) very fast. The latter holds true since more often than not the number of
response variables (m) is not that large, hence the matrix F^EE^F (mxm) is usually
quite small. This is especially true when Y is univariate. One may verify that in this
situation Y and F become vectors,say y and f, hence the loading 'vector'
c = F'^t/(t'^t) becomes a scalar, c = f^t/(t'^t), and so do the 'matrix' F'^EE'^F
(=f^EE^f) and its eigenvector: q=^=l. The NIPALS procedure also becomes very
efficient when there is a single response y as it requires only one passage through
the iteration steps (lines 9-14). The distinction between multivariate and
univariate Y is an important one and the two situations are distinguished by the
notation PLS2 and PLSl, respectively. In this chapter, which is about relating two
multivariate data tables, the emphasis is on PLS2, and the case of univariate y is not
considered. PLSl regression is considered further in Chapter 36.

35.7.3 Discussion

One may wonder why it is necessary to take the variance of predictors t^ into
account, when the goal, after all, is not to fit X but to fit and predict Y. There is a
good reason, however, to prefer a fit based on a large-variance component t above
an equally good fit based on a component of small variance. This ties in with the
intuitive appeal that a regression relation should preferably be based on a regressor
varying over as wide a range as possible. Or, more precisely, the variance of an
estimated regression parameter is inversely proportional to the variance of the
regressor (Section 8.4 and 10.4).
As an example we try to model the relation between the sensory data of Table
35.1 and the instrumental measurements of Table 35.4. The PLS analysis results
are shown in Table 35.8. The first PLS dimension loads about equally high on
338

S52

^?eaB2

S61
133 S53
U 132 G12
1 322
G13
G11
-2 132 012
142
131
T ' 1 '

-4 - 2 0 2 4
t1

Fig. 35.8. Plot of inner relation (u versus t) for PLS dimension 1.

peroxide, K232 and K270 and much less on DK and Acidity. As to the K-variables
all variables are about equally well approximated from tj. Thus the first PLS factor
(tj) tries to model some average overall appearance feature (Ui) of the F-block.
There is a good linear relation between the first X-factor and the first F-factor (Fig.
35.8). One notices that this dimension separates the Spanish olive oils from the
other nationalities. The second dimension captures a different feature. It chiefly
loads on acidity and a little bit on peroxide and DK. This second PLS factor t2 loads
mainly on the *brown' variable and slightly on 'green'. A scatter plot of U2 versus t2
again reveals a fairly good correlation (0.69) and, this time, a contrast between the
Greek and the non-Greek oils (Fig. 35.9). While a plot of u-scores versus t-scores
shows the strength of the linear relation, a scatter plot of the t-scores for the first
few dimensions gives a view of the arrangements of the objects in multivariate
space. From Fig. 35.10 one concludes, for example, that there is a weak clustering
of objects according to the country of origin. Figure 35.11 shows the pattern of
loadings of the predictors and the dependent variables. One observes that the first
PLS factor is associated with a contrast between glossy, transparency and yellow-
ness on the one hand and green, syrupy and brown on the other. One also sees that
the UV measurements K232 and K270 are most relevant for this dimension, or that
K232 is more associated with 'syrupy' and 'brown', whereas K270 and DK are
more associated with the attribute 'green'. The various projections obtained via
PLS regression such as the ones shown in Figs. 35.8-11 are a powerful tool in
interpreting the structure and the relationship of the two datasets.
339

G11
G12 G13

U
1^ 52
022

1
ssg
l3iS61^

142
-l ,—1—r- 1 1 — , — . — 1 — , —
-3 -2 0 1 2 3

t2

Fig. 35.9. Plot of inner relation (u versus t) for PLS dimension 2.

G11

G13
k322

133^ ^ ^ ^ ^

131
132 132
G12 U,
-2 142

-3 -2 -1 0
tl

Fig. 35.10. Scatter plot of samples in first two PLS dimensions.


340

u . a.
ACIDITY

0.6

0.3- DK
d green

i K270
glossy
0.0
transpar
m
^gsyrupy yellow
2 -0.3
PI :ROXIDE

brown
-0.6 — , — , — , — . — . — 1 — . — , — f J
1 — ' — • — 1 — ' — ' —

-0.6-0.3 0.0 0.3 0.6 0.9

dim 1

Fig. 35.11. Loadings of the two dominant PLS factors on the X-variables (lower case) and 7-variables
(upper case).

35.7.4 Alternative PLS algorithms

The algorithm described above is the classical PLS algorithm as developed by


Wold and Martens in the eariy 1980s. There is a choice of different algorithms all
deriving the PLS model. Most of these algorithms aim at increasing the efficiency
of the algorithm when the number of objects (n) or the number of variables (p) is
very large. Rather than working with the X- and F-matrices (or their residual
offspring) these algorithms work with cross-product matrices of smaller
dimension. An overview and comparison of these algorithms has been given in
Ref. [13].
The SIMPLS algorithm [14] was devised as a statistically /nspired modification
of the PLS algorithm and led to a PLS model that is slightly different from the
classical PLS/NIPALS algorithm when the response is multivariate (PLS2).
SIMPLS solves the covariance optimization in the same way as CCA solves the
correlation optimization problem and PCA solves the variance optimization
problem, viz. as a set of constrained optimization problems. An outstanding
feature of SIMPLS is that the weight matrix, called R to avoid confusion with the
NIPALS/PLS weight matrix W, applies to the original data. The restriction is that
any newly formed PLS factor t^ = Xr^ should be orthogonal to its predecessors
tb = Xr^, \ <b <a. This appears to be equivalent to constraining the new weight
341

TABLE 35.8
Summary of PLS results

X Y
PLS dimension PLS dimension
OL (2) (3) (4) (5) W_ (2) (3) (4) (5)
Weights
Acidity -21 77 11 53 -24 Yellow 39 -29 -66 5 66
Peroxide -53 -44 -20 61 31 Green -36 40 55 -16 -40
K232 -56 -22 2 -25 -75 Brown -39 -79 47 25 16
K270 -51 16 59 -36 47 Glossy 44 11 13 -12 22
DK -30 35 -76 -38 21 Transpar 41 -2 6 -22 11
Syrupy -42 -32 -1 91 -55
Loadings
Acidity -246 814 107 497 -241 Yellow 375 -199 -337 43 541
Peroxide -507 -368 -174 659 316 Green -343 266 280 -136 -328
K232 -545 -271 -30 -361 -756 Brown -377 -526 243 209 139
K270 ^88 128 625 -292 471 Glossy 423 76 70 -102 188
DK -394 349 -755 -349 218 Transpar 395 -14 31 -181 90
Syrupy -405 -218 -9 734 -456
Scores
Gil -1969 2504 -556 -356 -332 -1595 1662 524 253 -701
G12 724 -708 620 -54 -105 -2075 782 1726 1103 -2969
G12 1205 -221 1106 -55 -385 319 1204 1482 -722 -1085
G13 -1533 1725 1708 282 323 -797 1138 389 -359 837
G22 333 1483 26 130 64 -675 594 -620 643 -204
131 -2617 -878 -499 -254 176 -3220 -957 -491 464 -280
132 -843 -539 -760 -206 -36 283 449 -729 -1422 701
132 -1797 -554 -965 260 350 -2018 606 764 -138 -1191
133 -615 -106 -1024 390 -275 867 -162 -1498 -258 1040
142 2676 -1905 1171 62 -189 -2643 -2423 138 338 1041
S51 1411 -459 647 -387 296 2328 -278 -311 -725 997
S52 1770 144 -218 -985 115 3172 235 -453 -1179 987
S53 1029 -73 -548 267 -5 694 -495 -750 285 283
S61 1542 -513 -281 -129 -134 1158 -869 -112 701 -269
S62 2215 62 -505 391 218 2110 -721 40 522 164
S63 1822 39 77 645 -81 2093 -766 -99 493 646

% VAF
Acidity 17.6 95 95.8 99.7 100 Yellow 40.9 45.6 53.3 53.3 54.9
Peroxide 74.8 90.6 92.7 99.5 100 Green 34.3 42.6 47.9 48.2 48.8
K232 86.2 94.8 94.8 96.9 100 Brown 41.3 73.7 77.7 78.4 78.5
K270 69.1 71 97.5 98.8 100 Glossy 51.9 52.6 52.9 53.1 53.2
Dk 45 59.3 97.8 99.7 100 Transpar 45.4 45.4 45.5 46 46.1
Syrupy 47.6 53.2 53.2 61.6 62.7

Average 58.5 82.1 95.7 98.9 100 Average 43.6 52.2 55.1 56.8 57.4
342

vector r^ to be orthogonal to prior loading vectors p^, since t J t^ = rJ X \ = rj p^/(


111^). In this way one avoids the deflation step of X for each separate dimension. It
turns out that NIPALS/PLS2 does not truly maximize the covariance criterion after
the first dimension whereas SIMPLS does. This leads to a slight difference
between the NIPALS/PLS2 model and the SIMPLS model. From a practical point
of view the difference is negligible. Once the weight matrix R is found, one can
immediately find the regression model.

35.8 Continuum regression methods

We have seen that PLS regression (covariance criterion) forms a compromise


between ordinary least squares regression (OLS, correlation criterion) and princi-
pal components regression (variance criterion). This has inspired Stone and
Brooks [15] to devise a method in such a way that a continuum of models can be
generated embracing OLS, PLS and PCR. To this end the PLS covariance cri-
terion, cov(t,y) = Sf Sy r^, is modified into a criterion T = s^^^^'^^Sy r^. (For
simplicity, we assume a univariate response, in which case u = y.) For a = 0 the
variance of t does not play a role, since s^ = I. Maximizing Tis equivalent with
maximizing the correlation r^, and a = 0 corresponds to ordinary least squares
regression. For a = 1/2, T = ^J^Y^'rv == cov(t,y), which is the PLS criterion. For
a -> 1 we have that (x/(l - a ) -> oo and Tis dominated by the s^ component, i.e. the
variance of t should be maximal. Hence, a = 1 corresponds to PCR. By varying a
between 0 and 1 we can generate a continuum of regression models, including OLS
(a = 0), PLS(a = 1/2) and PCR(a = 1) as special cases. One hopes that by
optimizing the additional parameter a through cross-validation better models may
be obtained. The method has been aptly called continuum regression (CR). When
Y, is multivariate the method is called jomr continuum regression (JCR)[16]. The
limit a = 0 then corresponds to reduced rank regression. Since the algorithmic
implementation of (joint) continuum regression is rather cumbersome we discuss
two simpler alternatives.
Principal covariates regression (PCovR) is a technique that recently has been
put forward as a more flexible alternative to PLS regression [17]. Like CCA, RRR,
PCR and PLS it extracts factors t from X that are used to estimate Y. These factors
are chosen by a weighted least-squares criterion, viz. to fit both Y and X. By
requiring the factors to be predictive not only for Y but also to represent X
adequately, one introduces a preference towards the directions of the stable
principal components of X.
The easiest way to envision PCovR is as a PCA of the supermatrix Z obtained
by adjoining Y and X, Z = [YIX]. Here, Y is the least squares fit of Y by X as given
by eq. (35.14). By including a relative weighing of the two matrices Y and X
343

constituting Z = [(1 -a)YlaX], one may, to some extent, control the outcome of the
analysis. If the importance of X is downgraded (a -^ 0), Y dominates and the
analysis becomes entirely equivalent to RRR. In the opposite case (a -^ 1), X will
dominate and the analysis becomes equivalent to PCR. When Y and X are scaled to
equal "size", i.e. sum of squares, they are weighted equally and the results
resemble those of PLS.
The main advantage of PCovR lies in its flexibility of being able to move
between the two extremes. It releases the data analyst from making a choice
between the various techniques, letting the data determine the best predictive
model. One might also consider it an advantage of PCovR compared to PLS that it
employs the familiar least-squares criterion. PCovR also allows for a nice graph-
ical presentation of the dependent and independent variables in conjunction
through a biplot of the augmented Z data table. Figures 35.12-14 show such
biplots for the sensory and physico-chemical data. Figure 35.12 shows the RRR
extreme, optimally displaying the fitted sensory variables (total Y variance
accounted/or by two dimensions, VAF = 54%, i.e. 95% of the maximum attain-
able percentage 57%) and the explanatory variables (VAF = 77%) projected on
this PCs-of-Y plane. Figure 35.13 shows the PCR extreme, optimally displaying
the explanatory variables (VAF = 82%) and the sensory variables (VAF = 51%, or
89% of maximum) projected on this PCs-of-X plane. Figure 35.14 forms the
PCovR compromise, accounting for 53% of the total Y-variance (94% of max-
imum) and 81% of the total X-variance. These results and the accompanying
graphs show once again that the difference between the various methods is

15%

] ^^m
G13C
GREEN \
GS22C

S52C
Q12P GLOSSY
80%

Y^mjp::^afi^^^^^*
»2iErSfflS62C
J .^^^^ ^^ I32P
VEILOIV
\BR6WN
1 1 , 1 1 fJ

-3 0
PCI

Fig. 35.12. Scores, X-loadings and F-loadings for the two dominant RRR dimensions.
344

23%

P
C
2
59%
I YELLOW^
-1 PEROX?DE /
BROWN
142
-2 -1
PCI

Fig. 35.13. Scores, X-loadings and K-loadings for the two dominant PCR dimensions.

20 %

G22
d
i
m S52
IMQSSY

G12 ^ YELLOW
IPEROXYD^
BROWN
1142

—r-
-1
dim 1 (66%)

Fig. 35.14. Scores, X-loadings and K-loadings for the two dominant PCovR dimensions.
345

marginal in this case. The reason is that the principal components of X and those of
Y are already highly correlated, i.e. both sets of data can be explained by the same
latent variables.
Finally, another alternative to continuum regression has been put forward by
Wise and de Jong [18]. Their continuum power-PLS (CP-PLS) method modifies
the matrix X = USV^ into X^^^ = US^^^V^, i.e. the singular values are raised to a
certain power y = oc/(l - a) and a modified ('powered') predictor matrix is
constructed. Then one applies PLS regression and the results are back-transformed
to the original X matrix. Again by changing a from 0 to 1 (or y from 0 to «>) the
model changes from RRR via PLS to PCR.

35.9 Concluding remarks

Different techniques have been presented that can be used to relate two
measurements tables. Are we now in a position to answer the obvious question:
Which one is to be preferred? As expected there is no clear cut answer to this
question. It all depends on the objective of the comparison and the nature of the
data. Some general conclusions can be drawn, however. Procrustes analysis stands
apart from the other techniques in that it is not a regression technique. Its main use
lies in the comparison of two or more data sets which are supposed to be essentially
similar apart from basic transformations such as translation, rotation and
reflection. Procrustes analysis finds those transformations needed to match the sets
of data. Multivariate regression is only useful when both the number of dependent
variables and the number of regressors are not too large. It does not provide an
interpretation in terms of latent variables. All other regression techniques dis-
cussed apply some form of dimension reduction and become essentially equivalent
to multivariate regression when they are carried through to the maximum number
of factors. Canonical correlation analysis is not to be recommended when the
interest lies in understanding or predicting Y from X. However, when the aim is to
discover dimensions in Y and X that are very 'similar', and this in itself is of
substantial interest, then CCA can be a very useful tool. Reduced rank regression is
especially useful when the Y data are highly collinear and the X data are not.
Hence it is especially useful when X is designed and Y is multivariate. It is of little
use when the Y data are nearly orthogonal. Principal component regression is to be
preferred when the X data are highly collinear. It provides little advantage when
the predictor variables are nearly orthogonal. Partial least squares should do well
under most circumstances. When the X data are orthogonal, e.g. when coming
from a designed experiment, PLS becomes equivalent to reduced rank regression.
Principal covariates regression is in many ways quite comparable to PLS. It adds
some additional flexibility, covering RRR and PCR as extreme cases, as does
346

continuum regression. When we are in the lucky circumstance that the major
dimensions of X and Y are strongly related, then it will be hard to miss this by any
technique: all will reveal this strong relation. On the other extreme side, when there
is no relation whatsoever between the two data tables, all techniques will arrive to
the same negative conclusion.
The similarities and differences between the regression methods have been
investigated in various instances (see e.g. [19]). The relation between CCA, PLS,
SIMPLS and RRR has been discussed by Bumham, Viveros and MacGregor [20].
When X is an orthogonal design matrix PLS2 becomes equivalent with RRR [21].
Breiman and Friedman [22] have devised a method, called Curds-and-Whey, that
uses the canonical variables and combines ideas from canonical correlation anal-
ysis and ridge regression (Section 10.6). A thorough discussion of various multi-
variate regression techniques and their application to QSAR problems is given by
Jonathan and Stone [23]. A good collection of papers on the applicability of PLS
regression to QSAR problems can be found in Ref [24]. Total Least Squares (TLS,
[25]) is a technique popular with numerical analysts and engineers. It is a multi-
variate generalization of errors-in-variables regression or orthogonal distance
regression (cf Section 8.2.11), where measurement errors in both X and Y are
explicitly taken into account.

References

1. C. Peri, A. Garrido and A. Lopez, eds., The sensory and nutritional quality of virgin olive oil.
Grasas y Aceitas, 45, (Symposium Issue) (1994) 1-81.
2. J.C. Gower, Generalized Procrustes analysis. Psychometrika, 40 (1975) 33-51.
3. J.M.F. ten Berge, Orthogonal Procrustes rotation for two or more matrices. Psychometrika, 42
(1977)267-276
4. G.B. Dijksterhuis, Procrustes analysis in sensory research, Ch. 7 in: Multivariate analysis of
data in sensory science (T. Naes and E. Risvik, eds), Elsevier, Amsterdam (1996).
5. H. Hotelling, Relationships between two sets of variates. Biometrika, 28 (1936) 321-327.
6. T.W. Anderson, Introduction to Multivariate Statistical Analysis, 2nd ed., Wiley, New York
(1984).
7. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression, Appl. Stat., 31 (1982)
244^255.
8. C.J.F. ter Braak, Interpreting canonical correlation analysis through biplots of structure corre-
lations and weights. Psychometrika, 55 (1990) 519-531.
9. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a
global prediction model by principal component regression. Anal. Chim. Acta (1997) 348,
19-27.
10. A. Hoskuldsson, PLS regression methods. J. Chemom., 2 (1988) 211-228.
11. I.E. Frank, Intermediate least-squares regression method. Chemom. Intell. Lab. Syst., 1 (1987)
233-242.
12. H. Martens and T. Naes. Multivariate Calibration. Wiley, Chichester, 1989.
347

13. S. de Jong, Comparison of PLS algorithms. Chemom. Intell. Lab. Syst. (1998)
14. S. de Jong, SIMPLS: an alternative approach to partial least squares regression. Chemom.
Intell. Lab. Syst., 18 (1993) 251-263.
15. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed
prediction embracing ordinary least sqaures, partial least squares, and principal component re-
gression. J. Roy. Stat. Soc. B52 (1990) 237-269.
16. R.J. Brooks and M. Stone, Joint Continuum Regression for multiple predictands. J. Am. Stat.
Assoc, 89 (1994) 1374-1377.
17. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14
(1992) 155-164.
18. S. de Jong and B.M. Wise, J. Chemom., 12, (1998).
19. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools.
Technometrics, 35 (1993) 109-135.
20. A. Bumham, R. Viveros, J.F. MacGregor, Frameworks for latent variable multivariate regres-
sion. J. Chemom., 10 (1996) 3 1 ^ 6 .
21. 0. Langsrud and T. Nses, On the structure of PLS in orthogonal designs. J. Chemom., 9 (1995)
483-487.
22. L. Breiman and J.H. Friedman, Predicting multivariate responses in multiple linear regression.
J. Roy. Stat. Soc. B59 (1997) 3-37.
23. M. Stone and P. Jonathan, Statistical thinking and technique for QSAR and related studies. J.
Chemom. 7 (1993) 455-475; 8 (1994) 1-20, 303.
24. S. Wold, in: QSAR: Chemometric Methods in Molecular Design, ed. H. van de Waterbeemd,
VCH, Weinheim, 1995 (especially Ch 3.2, Ch 4.4, Ch 5.1, Ch 5.2).
25. S. Van Huff el and J. Vandewalle, The Total Least Squares Problem: Computational Aspects
and Analysis. SIAM, Philadelphia, PA, 1991.

Additional recommended reading


R. Gittins, Canonical Analysis. A Review with Applications in Ecology. Springer-Verlag, Berlin,
1985.
M. Tenenhaus, La Regression PLS — Theorie et Pratique. Editions Technip, Paris, 1998.
349

Chapter 36

Multivariate calibration

36.1 Introduction

Multivariate calibration is the collective term used for the development of a


quantitative model for the reliable prediction of properties of interest (y^, y^,..., y^)
from a number of predictor variables (jCj, x^,..., x^). As an example, one may think
of the spectroscopic analysis of a mixture in order to measure the concentration of
one or more of its constituents. The goal of calibration, whether multivariate or not,
is to replace a measurement of the property of interest by one that is cheaper, or
faster, or better accessible, yet sufficiently accurate. Developing the calibration
model includes stating the objective of the study, designing the experiment,
choosing the type of model, estimating its parameters and the final stage of
assessing the precision of the predictions.
Multivariate calibration is set apart from univariate calibration because it
involves more than one predictor variable (p> I). This gives rise to new oppor-
tunities, e.g. using a complete spectrum as a predictor rather than the signal at a
single 'best' wavelength. Using the entire spectral information may, in principle,
lead to better predictions. However, it also opens the possibility that essentially
non-informative spectral regions are included in the model through chance correl-
ations in the calibration set. Having more than one predictor admits the estimation
of more than one dependent property (^ > 1) and correction for undesired co-
variates (interferences). In fact, the latter prospect is a major motivation for doing
the multivariate measurements. In univariate calibration it is not possible to correct
for interferences without additional information. With additional information, i.e.
with multivariate data, the opportunity arises to separate the information relevant
for the respective properties from non-relevant variation or random noise.
The set of possible dependent properties and independent predictor variables,
i.e. the number of possible applications of predictive modelling, is virtually
boundless. A major application is in analytical chemistry, specifically the develop-
ment and application of quantitative predictive calibration models, e.g. for the
simultaneous determination of the concentrations of various analytes in a multi-
component mixture where one may choose from a large arsenal of spectroscopic
methods (e.g. UV, IR, NIR, XRF, NMR). The emerging field of process analysis.
350

i.e. the analysis of chemical systems or processes using multiple sensors, depends
heavily on the applicability of multivariate calibration models for the quantitative
monitoring of the systems or processes of interest. Particularly, the application of
near-infrared spectroscopy to analyze samples needing little or no pretreatment has
found wide-spread use in the chemical and food industries [1]. The technique is
also used to characterize product properties related to composition, e.g. octane
number of gasolines, iodine value of fats and oils or crystallinity of polymers.
Applications outside the field of analytical chemistry are, for example, the pre-
diction of biochemical or pharmacological properties from structural parameters
(QSAR) [2], the understanding of sensory profiles from physico-chemical data in
food research [3], and the modelling of environmental data [4].
The ultimate goal of multivariate calibration is the indirect determination of a
property of interest (y) by measuring predictor variables (X) only. Therefore, an
adequate description of the calibration data is not sufficient: the model should be
generalizable to future observations. The optimum extent to which this is possible
has to be assessed carefully: when the calibration model chosen is too simple
(underfitting) systematic errors are introduced, when it is too complex (overfitting)
large random errors may result {cf. Section 10.3.4).
In many applications the goal of predictive modelling is not a detailed under-
standing of the relation between dependent and independent variables. Ability to
interpret the model, therefore, is not a requirement per 5^. This should not preclude
the exploitation of available background knowledge on the problem at hand during
calibration modelling. A model that can be sensibly interpreted certainly adds
value and confidence to the calibration result.
Section 36.2 deals with the types of regression models that are commonly used
in multivariate calibration. The treatment of the subject will mostly focus on
multi-component analysis, i.e. predicting the analyte concentrations in a mixture
from some multi-channel responses. We will gradually add complexity to the
analytical chemical problem (unknown pure spectra, interferents, less samples
than channels, etc) which automatically calls for different calibration methods.
These topics are better addressed after modelling aspects have been discussed.
Section 36.3 deals with the important issue of validation and assessing the predic-
tive performance. Section 36.4 treats procedures preceding the actual modelling
stage, viz. design and data pretreatment and outlier detection. We conclude in
Section 36.5 with briefly mentioning some recent developments, such as wave-
length selection and calibration standardization. We will also discuss non-linear
modelling as an extension to the mainstream linear calibration setting.
351

36.2 Calibration methods

Let us assume that we have collected a set of calibration data (X, Y), where the
matrix X (nxp) contains thep > 1 predictor variables (columns) measured for each
of n samples (rows). The data matrix Y (nxq) contains the q variables which
depend on the X-data. The general model in calibration reads
Y = / ( X ; 0 ) + EY (36.1)
where / is some multivariate function, 0 is a vector of parameters and Ey is the
error. We will mainly discuss linear calibration functions, i.e./is a linear function
of the parameters. The case of non-linear regression modelling is treated for the
univariate case in Chapter 11. Non-linear multivariate modelling is briefly dealt
with in the concluding Section 36.5 on recent developments.
There are several good reasons to focus on linear models. Theory may indicate
that a linear relation is to be expected, e.g. Lambert-Beer's law of the linear
relationship between concentration and absorbance. Even when a linear relation
does not hold strictly it can be a sufficiently good local approximation. Finally, one
may try and find a transformation of the individual variables (e.g. a logarithmic
transformation), in order to obtain an acceptable linear model for the transformed
variables. Thus, we simplify eq. (36.1) to

Y = X B +EY (36.2)
nxq nxp pxq nxq

where B is the matrix of regression coefficients. Each column of B contains the


regression coefficients pertaining to the corresponding F-variable, i.e. column of
Y.
So far, we have deliberately refrained from using the term 'response'. In
conventional statistical theory on calibration, the F-variables are regarded as
random variables depending on the X-variables. The dependent F-variables res-
pond to a change in the independent X-variables, not the other way around. In the
traditional controlled calibration experiment the X-variables are designed. They
are under full control and can be set accurately at the values prescribed by the
experimental design, e.g. calibration standards of known composition can be
prepared. In the setting of multi-component analysis, the measured response Y is
the collection of n spectra measured at q wavelengths, which depend on the
concentrations of the/? absorbing analytes in the n different calibration standards.
The multivariate regression model B then would consist of the molar absorbances
for each analyte at each wavelength. In other words B would contain the pure
spectra. This is the classical approach to multicomponent analysis (c/. Section
10.7).
352

In an informal way one may summarize eq. (36.2) as:


spectrum = function (composition)
Given the pure spectra, or more generally given the estimated model B, it is an
easy task to predict a spectrum y^^^ knowing the composition x^^^ of a new sample.
However, our preoccupation is with the opposite application: given a newly
measured spectrum y^^^, what is the most likely mixture composition x^^^; and,
how precise is the estimate? Thus, eq. (36.2) is necessary for a proper estimation of
the parameters B, but we have to invert the relation y =j{\) = xB into, say, x = g{y)
for the purpose of making future predictions about x (concentration) given y
(spectrum). We will treat this case of controlled calibration using classical least
squares (CLS) estimation in Section 36.2.1.
Often, it is not quite feasible to control the calibration variables at will. When
the process under study is complex, e.g. a sewage system, it is impossible to
produce realistic samples that are representative of the process and at the same
time optimally designed for calibration. Often, one may at best collect repre-
sentative samples from the population of interest and measure both the dependent
properties Y and the predictor variables X. In that case, both Y and X are random,
and one may just as well model the concentrations X, given the observed Y. This
case oi natural calibration (also known as random calibration) is compatible with
the linear regression model
X = Y B + Ex (36.3)
It should be understood that B in eq. (36.3) differs from that in eq. (36.2), as we
shall not introduce new symbols for every possible new model. One may recapi-
tulate eq. (36.3) loosely as
composition = function (spectrum)
The model of eq. (36.3) has the considerable advantage that X, the quantity of
interest, now is treated as depending on Y. Given the model, it can be estimated
directly from Y, which is precisely what is required in future application. For this
reason one has also employed model (36.3) to the controlled calibration situation.
This case of inverse calibration via Inverse Least Squares (ILS) estimation will be
treated in Section 36.2.3 and has been treated in Section 8.2.6 for the case of simple
straight line regression.
We will see that CLS and ILS calibration modelling have limited applicability,
especially when dealing with complex situations, such as highly correlated pre-
dictors (spectra), presence of chemical or physical interferents (uncontrolled and
undesired covariates that affect the measurements), less samples than variables,
etc. More recently, methods such as principal components regression (PCR,
Section 17.8) and partial least squares regression (PLS, Section 35.7) have been
353

used to deal with such conditions. The main feature of these methods is that the
spectral data are projected onto a low-dimensional subspace of the ^wavelengths
(predictors) retaining nearly all relevant information. These methods are discussed
in Sections 36.2.4 and 36.2.5. Finally, in Section 36.2.6, we briefly discuss some
other approaches to the multivariate linear calibration problem.

36,2.1 Classical least squares

From now on, we adopt a notation that reflects the chemical nature of the data,
rather than the statistical nature. Let us assume one attempts to analyze a solution
containing p components using UV-VIS transmission spectroscopy. There are n
calibration samples ('standards'), hence n spectra. The spectra are recorded at q
wavelengths ('sensors'), digitized and collected in an nxq matrix S. The infor-
mation on the known concentrations of the chemical constituents in the calibration
set is stored in an nxp matrix C. Each column of C contains the concentrations of
one of the p analytes, each row the concentrations of the analytes for a particular
calibration standard.
The calibration standards must have been well-chosen. In the simplest case it
suffices to measure the spectra of the respective single-component solutions. This
presumes that pure samples can be made, that the whole range of concentrations is
relevant and that one is convinced that the relation is linear over the entire
concentration range. Normally, not all of these conditions will be fulfilled, al-
though it may apply, for example, to the case of re-calibrating an instrument for a
system that has already been amply studied. Usually, we would have to set up a
design for measuring a number of mixtures spanning the range of interest.
Assuming Lambert-Beer's law to hold, we may write

S = C K + Es (36.4)
nxq nxp pxq nxq

which states that every spectrum (row of S) is a concentration-weighted sum-


mation (row of C) of the pure spectra (rows of K). Alternatively, each column7 of
S, Sy, containing the intensities at wavelength y for the n calibration samples, is a
linear combination of the columns of C, the weights being given by column; of K.
Setting Y = S, X = C and B = K, shows that eq. (36.4) is the equivalent of the
multivariate linear statistical model of eq. (36.2). The pure spectra K take the role
of regression coefficients, which can be estimated by classical least-squares re-
gression of the measured calibration spectra (S) on the known calibration matrix C
(see Chapter 10 on multiple linear regression for univariate y and Section 35.2 on
multiple regression for multivariate Y):
K=(C^C)-^CTS (36.5)
354

TABLE 36.1
Concentrations and digitized spectra (xlOO) of four calibration samples (compare Fig. 36.1a)

Analyite Spectral channel

Sample 1 2 3 5 15 25 35 45 55 65 75 85 95

1 100 10 10 97 104 110 106 95 83 74 6 47 38


2 10 100 10 45 60 79 92 117 112 103 77 55 41
3 10 10 100 29 39 50 70 89 113 111 87 59 40
4 30 30 30 44 52 60 68 76 79 72 54 40 30

Note that eq. (36.5) is a collection of many univariate multiple regression models:
for each wavelength y the multiple regression of the corresponding spectral 'chan-
nel', i.e. Sp on the concentration matrix C yields a vector of regression coefficients,
k^ (theyth column of K). For K to be estimable C^C must be invertible, i.e. the
number of calibration standards should at least be as large as the number of
analytes. It is clearly not possible to obtain, directly or indirectly, say 3 pure
spectra from recording the spectra of just 1 or 2 standards of known composition.
In practice, the condition n>p, or more precisely rank(C) =p, is hardly a restriction.
As an example, we give in Table 36.1 data on calibration spectra at 10
wavelengths of 4 calibration standards for 3 analytes. The corresponding spectra
are shown in Fig. 36.1a at a 10-fold higher resolution. The pure spectra calculated
according to eq. (36.5), using all 100 wavelengths, are displayed in Fig. 36.1b.
In many cases, one may measure spectra of solutions of the pure components
directly, and the above estimation procedure is not needed. For the further dev-
elopment of the theory of multicomponent analysis we will therefore abandon the
hat-notation in K. Given the pure spectra, i.e. given K (pxq), one may try and
estimate the vector of concentrations c^^^ (pxl) of a new sample from its measured
spectrum Sne^(^xl):

sLw-cLwK (36.6)

Transposing both sides to column vectors gives a set of q linear equations in p


unknowns
s„ew=K^c„,„ (36.7)
qx\ qxp px\

The concentration vector c^^^ can be estimated by least-squares regression of the q


absorbances (s^^^) on those of the pure components (rows of K), giving
355

calibration spectra

regression models new spectrum


0.3

0.2 [- ,."i
1 I '. •'! /•

15 0.1 L
c
So 0
•vvvl
-0.1
1* . /',

-0.2
0 50 100
channel

Fig. 36.1. (a) Spectra of four different calibration standards with three analytes of known composition;
(b) Extracted pure spectra for the three analytes; (c) Regression weights for estimating concentrations
of the three analytes; (d) New spectrum resolved into three spectral contributions from pure analytes.

^new - ( ^ K ) KS, (36.8)

or
^T _ T »
(36.9)

Here, BQ^ = K^(KK^)"^ is the final qxp matrix of regression coefficients for
converting a spectral measurement into concentration estimates. For KK^ (pxp) to
be invertible K should be of full rank. A first requirement for this is that p<q, i.e.
the number of chemical components should not exceed the number of wave-
lengths. Furthermore, the set of pure spectra in K should be independent, i.e. no
pure spectrum may be an exact linear combination of the other pure spectra.
Elaborating on the example we show in Fig. 36.1c the three regression models
(regression coefficients in columns of B^LS) ^^^ the three analytes. Each of these
356

models has the property that it gives an estimate close to zero when applied to the
pure spectra of the other two analytes. Figure 36. Id gives the (simulated) spectrum
of a supposedly unknown mixture and its reconstruction in terms of concentra-
tion-weighted pure spectra. The estimated composition (34.7, 16.0, 64.1) reflects
the true composition (35.0,15.0,65.0) well. The greatest error occurs for analyte 2,
which is clearly overestimated at the cost of analytes 1 and 3. Such a result is in line
with the spectral characteristics of the analytes, viz. the fact that the spectrum of
analyte 2 strongly overlaps with those of 1 and 3 (see Fig. 36.1b).
The attraction of the CLS approach is that the whole spectral domain is used for
estimating each constituent. Using redundant information has an effect equivalent
to replicated measurement and signal averaging, hence it improves the precision of
the concentration estimate. This is true if the entire spectral region contains
information with respect to the constituents, which need not be the case. A further
improvement in precision is possible by taking the error structure of S into account
using Generalized Least Squares (GLS) regression [5,6]. The error of the spectral
signal is not necessarily constant over the range of wavelengths, e.g. it may
increase with the signal itself. In this situation of heteroscedasticity a more
efficient estimator than the one given by eq. (36.9) for c^g^ is
c„,, = (KV-^KVKV-is„,^ (36.10)
where V (qxq) is the error covariance. It may be estimated from the matrix of
IS][dual sEs:

'S = S-- C K (36.11)


r _
= EJ Es/(«--1) (36.12)

This GLS estimator is akin to inverse variance-weighted regression discussed in


Section 8.2.3. Again there is a limitation: V can be inverted only when the number
of calibration samples is larger than the number of predictor variables, i.e. spectral
wavelengths. Thus, one either has to work with a limited set of selected wave-
lengths or one must apply other solutions which have been proposed for tackling
this problem [5].
The CLS method hinges on accurately modelling the calibration spectra as a
weighted sum of the spectral contributions of the individual analytes. For this to
work the concentrations of all the constituents in the calibration set have to be
known. The implication is that constituents not of direct interest should be
modelled as well and their concentrations should be under control in the cali-
bration experiment. Unexpected constituents, physical interferents, non-linearities
of the spectral responses or interaction between the various components all
invalidate the simple additive, linear model underlying controlled calibration and
classical least squares estimation.
357

36.2.2 Inverse least squares

In inverse calibration one models the properties of interest as a function of the


predictors, e.g. analyte concentrations as a function of the spectrum. This reverses
the causal relationship between spectrum and chemical composition and it is
geared towards the future goal of estimating the concentrations from newly
measured spectra. Thus, we write

C = S P + Ec (36.13)
where P (qxp) is a matrix of regression coefficients. Each column of P contains a
set of q weights for a particular analyte. Multiplying the q weights of the jth
column of P with the corresponding spectral intensities and summing the results
yields a concentration estimate for theyth analyte.
The P-matrix is chosen to fit best, in a least-squares sense, the concentrations in
the calibration data. This is called inverse regression, since usually we fit a random
variable prone to error (>;) by something we know and control exactly (x). The
least-squares estimate P is given by
F = (S^Sr^S^C. (36.14)
It is only possible to invert S^S (pxp) if the matrix of calibration spectra S is of full
rank/?. As a consequence, the number of calibration samples should at least be as
large as the number of wavelengths, n > p. This precludes the use of the full
spectrum and poses the problem of selecting the 'best' wavelengths. The variable
selection methods discussed in Chapter 10 (forward selection, stepwise regression,
all possible subsets regression; cf. Section 10.3.3) or genetic algorithms (Chapter
27) can be used to determine a good set of wavelengths. A major concern when
making this selection is the correlation between the absorbances at the various
wavelengths for the calibration set. Highly dependent predictors inflate the vari-
ance of the regression coefficients P, leading to unreliable predictions (cf. Section
10.5), particularly if the prediction samples move out of the region covered by the
calibration samples. The PCR and PLS methods provide a way out for the problem
of wavelength selection and the associated collinearity problem.
The advantage of the inverse calibration approach is that we do not have to
know all the information on possible constituents, analytes of interest and inter-
ferents alike. Nor do we need pure spectra, or enough calibration standards to
determine those. The columns of C (and P) only refer to the analytes of interest.
Thus, the method can work in principle when unknown chemical interferents are
present. It is of utmost importance then that such interferents are present in the
calibration samples. A good prediction model can only be derived from calibration
data that are representative for the samples to be measured in the future.
358

With this proviso a new unknown sample with the spectrum s^^^^ can be directly
translated into a concentration estimate:
cL=SnewB,LS (36.15)

where the matrix of prediction vectors BJLS (qxp) has already been established in
the calibration step:
B,Ls=P = (S'^Sr^S'^C. (36.16)
It should be noted that Bj^s is just a collection of separate multiple regression
models. Each column h^^ of BILS is associated with a particular analyte and
follows from a multiple regression of the corresponding column c of C on the
predictor matrix of spectra S.

36.2.3 Principal Components Regression

The application of principal components regression (PCR) to multivariate


calibration introduces a new element, viz. data compression through the con-
struction of a small set of new orthogonal components or factors. Henceforth, we
will mainly use the term 'factor' rather than 'component' in order to avoid
confusion with the chemical components of a mixture. The factors play an
intermediary role as regressors in the calibration process. In PCR the factors are
obtained as the principal components (PCs) from a principal component analysis
(PC A) of the predictor data, i.e. the calibration spectra S (nxp). In Chapters 17 and
31 we saw that any data matrix can be decomposed ('factored') into a product of
(object) score vectors T(nxr) and (variable) loadings P(pxr). The number of
columns in T and P is equal to the rank r of the matrix S, usually the smaller of n or
p. It is customary and advisable to do this factoring on the data after column-
centering. This allows one to write the mean-centered spectra SQ as:

= t,pj +t2 p j +... + t , p ; (36.17)


where !„ is a nxl vector of ones and s (pxl) is the average spectrum over all
calibration spectra. The PC score vectors t^ (a = 1,2,.., r), i.e. the columns of T, are
arranged in order of diminishing importance, i.e. descending variance. We will
choose to normalize the loadings p^ so that the importance of each dimension is
reflected in the length (norm) of the corresponding score vector. This corresponds
to the choice a = 1 and (3 = 0, as discussed in Chapter 31, when factoring a data
matrix. The scores T are obtained as weighted sums of the original columns of SQ.
The weights are given by the eigenvectors W (qxr) of the cross-product matrix
S J SQ associated with the r positive eigenvalues:
359

T = SoW (36.18)
In PCA, the weights W and the normalized loadings P are identical {cf. Chapters
31 and 35). For any other data decomposition, e.g PLS, weights and loadings
differ.
The main idea of PCA is to approximate the original data, spectra in our case, by
a small number {A < r) of factors, discarding the minor factors:
S-l„s'^ + rP*T (36.19)
The suffix * in T* {nxA) and P* {qxA) indicates that only the first A columns of T
and P are used, A being much smaller than n and q. In principal component
regression we use the PC scores as regressors for the concentrations. Thus, we
apply inverse calibration of the property of interest on the selected set of factor
scores:
C = l„c'^ +T*Q* + Ec (36.20)
The regression coefficients Q* follow immediately by least-squares estimation
Q* = {T^Ty'T^C^ (36.21)
where CQ = C - 1 ^ c^ is the matrix of concentrations, expressed as deviations from
the mean composition c. Substituting eq. (36.18) into eq. (36.20) gives
C = l „ c ^ +SoW*Q* (36.22)

Prediction of a new sample with spectrum s^^^ is now straightforward


cLw=cT+(s„,,-s^BpcR (36.23)
where
BpcR = W* Q* (36.24)
Notice that the matrix Q* represents a multivariate regression model. Each column
of Q* contains a multiple regression model of the (mean-centered) concentrations
of an analyte on the orthogonal regressors in T*. Because of this orthogonality,
T*^T* is diagonal and its inversion is trivial (viz. inverting the individual diagonal
elements). Thus, the calculation of Q* and Bp^R is a simple matter once the PCA,
giving T and W, is done. One should also appreciate that eq. (36.23) has the form
of a multiple regression model, with the regression coefficients B = Bp^R having
been computed via the detour of PCR, since S^S is not invertible.
PCR combines aspects of both CLS and ILS. In common with ILS it is based on
the direct calibration of the property of interest from the multivariate predictor,
irrespective of the direction of the causal relation. Contrary to ILS and in common
with CLS it can use all predictor information even when there are many more
360

low X-content
velcngth
samples

high X-content
1100

Fig. 36.2. Raw data: near-infrared spectra of a food product with different X-content.

predictors than samples. There are no problems of collinearity since the regressors
in T* are uncorrelated. There is no problem of having too many variables {q > n),
since we have made the step from the q original variables, possibly hundreds of
wavelengths, to just a few PCs (dimension reduction). There is no problem of
selecting predictor variables: all are represented in the PCs via the weights in W*
that define T*. The only problem is choosing A, the number of PCs to be used for
the calibration model. This important decision is given separate attention in
Section 36.3.
As an example we give in Fig. 36.2 a set of near-infrared calibration spectra of a
food product along with the varying content of a major component that we must
call X for proprietary reasons. The food product is measured as such in reflection
mode. We see that there is little variation in the spectra. Figure 36.3 gives the
mean spectrum (A) and the 'spectrum' of standard deviations (B). Comparing A
and B, we observe bands at roughly the same wavelengths, e.g. at wavelengths
1450 and 1880, indicating that the variation in intensity is largest at the peaks.
Figure 36.4 shows the loadings for the major 4 PCs (cf. Section 17.6.2). Notice
that a loadings vector has an entry for each wavelength, so it can be drawn as a
spectrum-like graph. One observes that the loadings plot of PCI more or less
reflects the standard deviation spectrum (Fig. 36.3b), and that practically all
individual loadings are positive. The implication is that PCI gives positive weight
361

250

0.04

50 100 150 250


wavelength

Fig. 36.3. Summary of spectral information: (a) average spectrum, (b) standard deviation spectrum.

to each wavelength and that the resulting PCI scores more or less reflects the total
intensity of the spectra (c/. Section 31.3 where we also found that the first PC
represents overall size when the measurements are positively correlated). The
loadings of PC2, PC3, etc. show more complex patterns. Negative parts enter since
the different loadings spectra are orthogonal, i.e. the inner products of the loading
vectors are zero. One may also interpret the loadings spectra of the higher PCs in
terms of contrasts between different spectral regions. Plots such as Fig. 36.4 are
helpful in interpreting the nature of the PCs from a spectroscopic point of view.
They also can aid in detecting artefacts in the data. For example, a spike in one of
the spectra will show up as a spike in one or more of the loadings plots at the
'faulty' wavelength.
Figure 36.5 shows the scores plots for PC2 v. PCI (A) and PC4 v. PC3 (B). Such
plots are useful in indicating a possible clustering of samples in subsets or the
presence of influential observations. Again, a spectrum with a spike may show up
as an outlier for that sample in one of the scores plots. If outliers are indicated, one
should try and identify the cause of the outlying behaviour. Only when a satis-
factory explanation is found can the outlier be safely omitted. In practice, one will
362

PCI PC 2

-0.1

PC 3 PC 4
0.2
' '
D

GO
0.1 s
c

n 1 V^J_ ^ 1

100 200 100 200


wavelength wavelength

Fig. 36.4. Spectral loadings for the first four principal components.

also remove the outlier solely on the basis of its extreme deviation. The most
difficult cases arise when suspect outliers are only mildly deviating and there is no
good explanation for the outlying behaviour. In our example, there are no such
remarkable outlying features in the score plots. Robust methods for PCA and PCR
have been developed which are less vulnerable to deviating observations {cf.
Section 36.4.3).
Finally, one may plot the X-content against any of the PC scores (Fig. 36.6). In
this case we observe a relationship of the amount of X with PC2 and with PC3. The
amount of spectral variance explained by the PCs is shown in Fig. 36.7. It would
appear that the first four PCs account for practically all variation (99.2 %). Thus, a
model with A=4 PCs will capture most of the spectral variation and, hopefully,
most of the correlation with the X-content. The estimated model {cf. eq. 36.20) is

50.3 + 1.53^1 + 10.7^2 19.1^3+1.2r4 (36.25)


363

0.6 — " T
A 0 0

0.4 \- « o
o
..
O O
0 0
0.2 L 0
0
0 ^" 0 o o
0
o o Oo
<h
U 0
o 0 0
0 0 ^

o 0 O 0 0 0 0 O
o
-0.2 - G 0 8 0


|0 0 1
-0.4 1 1 1 1 1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8


PCI scores

0.2 ' ' ' ' ° ' o„ 1 1 1

B : " 0 0 O
0.1
0 0 „ 00 °
o cP Op G O
Oo 0
0 o " "o"
5 -0.1 ^' 0 Oo G
o "

-0.2 h - • • • • • o

o o
-0.3
-0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
PC3 scores

Fig. 36.5. Score plot of the samples in PC space: (a) scatterplot of PC2-scores vs. PCl-scores, (b)
scatterplot of PC4-scores vs. PC3-scores.

where the constant term represents the average X-content in the calibration set,
^mois =50.3, and the regression coefficients correspond to the slopes in Fig. 36.7. If
we multiply the weights (= loadings) given in Fig. 36.4 with these corresponding
slopes and add up the result we arrive at the regression model shown in Fig. 36.8.
This is the graphical version of eq. (36.24). It gives the weights bp^R in the closed
model (Cn = c„ ,1^ + SJ bpcR) which describes how to convert spectral
measurements into best fitting estimates of the X-content, based on 4 PCs as an
intermediary. Figure 36.9 gives the fitted versus the observed X-content.
We chose the number of PCs in the PCR calibration model rather casually. It is,
however, one of the most consequential decisions to be made during modelling.
One should take great care not to overfit, i.e. using too many PCs. When all PCs
are used one can fit exactly all measured X-contents in the calibration set. Perfect
as it may look, it is disastrous for future prediction. All random errors in the
calibration set and all interfering phenomena have been described exactly for the
calibration set and have become part of the 'predictive' model. However, all one
needs is a description of the systematic variation in the calibration data, not the
364

50 PCI scores 50
PC2 scores
A B
45 ~ ~ 45 o o o
u ft) tOo o 0 ()
0 0
^^?^->aO ^ O 0
40 - • • - •
o 0 %'^^^^«i:^, o o - 40 0 o
0 '^"°^>;-^ «
X <>„o 0 o°o'' 0^ oo&b
0 35 0 0
35 o o

in 1 1 I
30
-0.5 0.5 -0.5 0.5

50 50
D
45 h 45 - -
o
40 o
o

o o
35 o

30
0.4 -0.2 0 0.2 0.4
PC3 scores PC4 scores

Fig. 36.6. Scatter plot of dependent variable (X-content) versus the first four principal components.

random variation. As a model becomes more complex, e.g. using more PCs in PCR
calibration, the noise term starts to dominate the added systematic contribution.
Procedures to establish the optimum complexity, i.e. choosing the best-predictive
calibration model from a range of increasingly complex models, are given in
Section 36.3.
Thus far w^e have discussed the traditional method of applying PCR, where the
PCs are chosen in order of decreasing variance (top-down approach). Another
approach is to calculate the correlation of each PC with y and to enter the PCs in
order of their decreasing squared correlation. Since the PC scores are uncorrelated,
this is equivalent to applying multiple regression with variable selection on the set
of PCs using forward selection (cf. Section 10.3.3). Several studies, e.g. Refs.
[7,8], have shown that, with respect to prediction accuracy, this correlation-PCR
approach performs no worse and often better than the traditional top-down app-
roach. Correlation-PCR often gives simpler models with a smaller number of PCs
than top-down PCR. The added advantage of such parsimonious models is that
they are easier to interpret.
365

100

>

Fig. 36.7. Percentage variance of X-content explained by the principal components from spectral data.
Individual percentages (bars) are shown as well as cumulative percentages (circles).

I • T — T
2r
45 0 °
O ,

1 CP

T3
o
U
o[ 40

-1
35 0
-21-
100 200
35 40 45
wavelength
X-content (measure

Fig. 36.8. Regression coefficients obtained Fig. 36.9. Final calibration showing fitted versus
from PCR model. measured values.
366

36.2.4 Partial least squares regression

The procedure for PLS calibration is very akin to PCR calibration. It also
proceeds via a small set of orthogonal factors constructed from the predictor
variables. The main difference with PCR lies in the way the factors are determined.
This has been discussed in Chapter 35. In PLS regression the factor is not solely
determined by the spread (variance) of the predictor data; the correlation of the
PLS factor with the variable to be predicted also plays a role. A high-variance
factor that is not at all correlated to the dependent property is given less weight by
PLS from the outset. In PCR, such a factor at first seems important on account of its
high variance. However, it is downweighted in the final model through the small
regression coefficient q^ for that factor (see, for example, the low coefficient for
PCI in the PCR regression model of eq. (36.25)). Since in PLS regression such
factors are avoided, PLS models often have less factors than PCR models built on
the same data. This aspect of parsimony is sometimes seen as an advantage of PLS
over traditional PCR. Correlation-PCR behaves like PLS in this respect. Another
difference between PLS and PCR is that the loadings P, even when normalized to
unit length, differ from the weights W. Qualitatively, however, e.g. with regard to
the sign pattern, the difference often is not large. Usually one plots the loadings
since they have a simpler interpretation. In Chapter 35 we showed that the loadings
P are the regression coefficients obtained from regressing the predictors (spectral
signals at each wavelength) on the PLS factor scores. One may consider the
loadings (columns of P) as abstract spectra from which the measured spectra can
be reconstructed. A high loading p^j^ implies that the spectral region around thejth
wavelength has a strong contribution from the kih factor.
PLS calibration of a multicomponent system can be performed in two different
ways. One may do a separate regression for each analyte. Such univariate (in >^)
regressions are called PLSl regressions. One may also model the various analytes
collectively in one and the same multivariate PLS2 regression model. Which of
these two approaches should be chosen? The use of PLS2 regression has a few
advantages. Firstly, there is one common set of PLS factors T for all analytes. This
simplifies interpretation and enables a simultaneous graphical inspection. Secondly,
when the analyte concentrations are strongly correlated one may expect on theo-
retical grounds that the PLS2 model is more robust than separate PLSl models.
This is especially true when the Y matrix is closed, as for compositional data.
Finally, when the number of analytes is large the development of a single PLS2
model is done much quicker than the development of many separate PLS 1 models.
Practical experience, however, indicates that PLSl calibration usually performs
equally well or better in terms of predictive accuracy. Thus, when the ultimate
requirement of the calibration study is to enable the best possible predictions, a
separate PLSl regression for each analyte is advised.
367

36.2.5 Other linear methods

There are many more methods that are well suited for calibration of collinear
data. Latent root regression [9,10], principal covariates regression [11], total least
squares [12], and reduced rank regression [13] are related to PCR and PLS in that a
calibration model is based on orthogonal factors derived from the predictors
(spectra). However, a comparative evaluation of these techniques has not yet been
published. Continuum regression [14,15] provides a continuum of models which
can be viewed as interpolations between three 'anchor' models, viz. PCR, PLS
(more specifically SIMPLS, Section 35.7.4), and MLR (or reduced rank re-
gression). It still has to be established whether there is a need for a continuum of
models filling the gaps between three anchors. Ridge regression is a technique that
has been specially devised for dealing with multi-coUinear data (cf. Section 10.6).
It replaces the OLS regression estimator (X'^X)-^X'^y by (X^X+kiy^X^y. The addi-
tion of the constant k, the ridge parameter, to the diagonal of X^X has the effect of
artificially decreasing the dependence among the X-predictors. As a result it leads
to regression coefficients that are slightly biased towards the low side (shrinkage
property). This small systematic error is, however, more than compensated by the
beneficial effect of the greatly reduced variance of the estimates. The value of k
can be identified from the graph of the regression parameters as a function of /:,
namely as the value where the regression parameters start to stabilize. In a
comparative simulation study it has been shown that ridge regression performs as
well as PCR or PLS, all of them outperforming MLR with forward variable
selection [16]. Other comparative studies also have shown that often the results
obtained with various prediction methods are very similar. When each method is
applied carefully it turns out that there is no overall superior technique. Thus, it is
better to get well acquainted with one or two methods and to apply that method in a
professional way, rather than applying different techniques which are known only
superficially. Even multiple linear regression (MLR, Chapter 10) has regained
some of its lost respectability in the field of multivariate calibration. In combina-
tion with modem wavelength selection procedures (Section 36.5.1) and a safe-
guard for overfitting, well-performing models can be derived that are based on a
small number of wavelengths.

Generalized standard addition method


In Section 8.2.8 we have discussed the standard addition method as a means to
quantitate an analyte in the presence of unknown matrix effects (cf. Section 13.9).
While the matrix effect is corrected for, the presence of other analytes may still
interfere with the analysis. The method can be generalized, however, to the
simultaneous analysis of p analytes. Multiple standard additions are applied in
order to determine the analytes of interest using many (q > p) analytical sensors. It
368

is assumed that the responses are a linear function of the analyte concentrations.
The calibration model reads
R = ( V J + AC) K + E (36.26)

where R (nxq) is the set of responses measured on the calibration samples accord-
ing to the 'design' matrix AC (nxp), containing the added concentrations. The
unknown concentration is given by the vector CQ 07X1) and K (pxq) is the unknown
matrix of sensitivities of the q responses with respect to the p different analytes.
One may eliminate the unknown concentration vector CQ by subtracting the res-
ponse vector TQ corresponding to the original sample, without standard additions,
from each row of the response matrix R, giving AR, and at the same time sub-
tracting the unknown CQ from the matrix (l^c J + AC). The sensitivities K can be
estimated from this reduced system of equations using multiple linear regression
of the corrected responses (AR) on the multiple standard additions (AC). Given the
estimate
K = ((AC)'^(AC))-^ (AC)'^ AR
one may estimate the unknown concentration as:
Co=K^(KKT)-^ro (36.27)
More efficient estimation methods exist than the simple method described here
[17]. The generalized standard addition method (GS AM) shares the strong points
(e.g correction for interferences) and weak points (e.g. error amplification because
of the extrapolation involved) of the simple standard addition method [18].

36.3 Validation

Many choices have to be made before a calibration model can be developed.


First one has to choose between the methods discussed in Section 36.2. Each of
these methods involves the choice of the dimensionality A, a 'meta-parameter' that
may greatly affect the predictive performance of the calibration model. In ILS
calibration one has the problem of selecting a limited number of predictive
wavelengths. If one chooses the method of Brown, Denham and Spiegelman for
wavelength selection (c/. Section 36.5.1) one still has to choose a confidence level
a which governs the number of wavelength channels. A way to make such choices
is to build the models from the calibration or training set and see which of the
models gives the best predictions on a new set of data, the test set. The obvious
criterion for assessment is the average size of the prediction error, as expressed by
the prediction error sum-of-squares, PRESS ([19], Section 10.3.4), or the root-
mean-square (rms) value of the prediction error, RMSPE.
369

PRESS = I ( q - c , ) 2 (36.28)

RMSPE = (PRESSMt)i^2 (36.29)


The summation in eq. (36.29) extends over all n^ samples in the test set. Among the
various model options one chooses the one having minimum PRESS. It is still
common usage to regard the model thus chosen as a validated model and RMSPE
as a proper estimate of future prediction errors. However, the procedure described
above only helped to choose a final model and the prediction error RMSPE
estimated in this manner will be optimistically biased. The so-called test set played
the role of a second training set for estimating the meta parameter(s). It is therefore
better to refer to this second dataset as the monitoring set, used to fine tune the
metaparameter(s). Given the chosen model one should establish its performance
by testing the model on a new independent set of data. The RMSPE value obtained
for this new set of data, truly the test set, gives a fair impression of the predictive
capability of the model provided that the training data, monitoring data and test
data are randomly sampled from the same population. Should the test result be
disappointing and lead to amendments of the model, this new calibration model
should be tested against wholly new data, etc.
In practice the strategy of employing separate sets of calibration (training) and
validation (test) data often cannot be applied as it requires a large number of
samples. It is quite common that the number of samples is limited and resort is
taken to resampling methods or internal cross-validation (cf. Sections 10.3.4 and
33.4). Here, one sets aside part of the data and builds a model with the rest of the
data. The data not included are then used for assessing the PRESS dependence.
This process is repeated for various splits of the original data such that all samples
have been left out once. The PRESS values are accumulated over the various splits.
At the end one chooses the model having minimum overall PRESS. Perhaps, the
most common approach is to perform n calibration steps leaving out one obser-
vation at a time (leave-one-out proctdmt, LOO). As an example. Fig. 36.10 shows
the cross-validation RMSPE result for (traditional) PCR and PLS models of the
X-content. For the PLS model the minimum RMSPE (1.8 %) occurs at a dimen-
sionality A = 5. For PCR the minimum lies at A = 10 (RMSPE = 2.0%). This
minimum is very shallow and one might trade in a few factors for a simpler, and
probably more robust, model with about the same prediction error (A = 8, RMSPE
= 2.2%). In all, the model choice in this example is fairly clear-cut: PLS regression
with 5 factors. If one is willing to accept a somewhat larger prediction error
(RMSPE = 2.5%) a parsimonious 3-factor PLS model suffices. Since one knows
that the minimum RMSPE value is optimistically biased it is good advice to prefer
simpler models with slightly higher RMSPE values. The one-factor model for PCR
performs not better than a zero-factor model, i.e. using no spectral information at
370

5
# factors

Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained
from leave-one-out cross-validation using PCR (o) and PLS (*) regression.

all. Apparently, the first PC-factor captures an important source of spectral vari-
ation that has no predictive value.
Leaving out one object at a time represents only a small perturbation to the data
when the number (n) of observations is not too low. The popular LOO procedure
has a tendency to lead to overfitting, giving models that have too many factors and
a RMSPE that is optimistically biased. Another approach is k-fold cross-validation
where one applies k calibration steps (5 < k < 15), each time setting a different
subset of (approximately) n/k samples aside. For example, with a total of 58
samples one may form 8 subsets (2 subsets of 8 samples and 6 of 7), each subset
tested with a model derived from the remaining 49 or 50 samples. In principle, one
may repeat this /:-fold cross-validation a number of times using a different splitting
[20].
Van der Voet [21] advocates the use of a randomization test (cf. Section 12.3) to
choose among different models. Under the hypothesis of equivalent prediction
performance of two models, A and B, the errors obtained with these two models
come from one and the same distribution. It is then allowed to exchange the
observed errors, e^^ and e^^, for the /th sample that are associated with the two
models. In the randomization test this is actually done in half of the cases. For each
object / the two residuals are swapped or not, each with a probability 0.5. Thus, for
all objects in the calibration set about half will retain the original residuals, for the
other half they are exchanged. One now computes the error sum of squares for each
of the two sets of residuals, and from that the ratio F = SSEJSSE^. Repeating the
process some 100-200 times yields a distribution of such F-ratios, which serves as
a reference distribution for the actually observed F-ratio. When for instance the
observed ratio lies in the extreme higher tail of the simulated distribution one may
371

be confident that model B is significantly better than model A. This randomization


test is generally applicable to choosing among different models, hence it can be
used to determine e.g. the lowest complexity of PCR or PLS models that is
significantly superior to simpler models.
Another approach to model validation is the application of the bootstrap. Here,
we only give a short and qualitative description. For more details about the
bootstrap methodology in general, see [22,23] for a recent application to multi-
variate calibration. The analogy with cross-validation is that the observed data are
resampled many times. The difference is that one does not split the data into two
subsets, but applies resampling with replacement. As an example, suppose there
are 100 objects in the calibration set. This is regarded as a population. One may
draw 100 samples from this population with replacement, i.e. some objects will not
be selected at all, many only once, some twice or more. With this artificial data set
one builds a model. This process is repeated many times, hence it is computer
intensive. For each model one computes a quantity of interest, e.g. a prediction
error. From the distribution of these prediction errors one may derive an average
value as well as a measure of its uncertainty. By monitoring the average prediction
error as a function of the number of PCR or PLS factors one may determine an
optimal model complexity. This average prediction error can be used to compute
confidence limits around future predictions.

36.4 Other aspects

36.4.1 Calibration design

It should be appreciated that classical theory of design of experiments (Chapters


21-26), based on the linear model estimated by least squares regression, cannot be
applied directly to the problem of multivariate calibration. One reason is that one
may not know precisely which (interfering) factors are at play, let alone that one
could control these. Another reason is that one may regard the spectral space as the
space to position the chemical samples in. The number of dimensions (wave-
lengths) is generally much larger than the number of experimental units (chemical
samples) and the linear model cannot be estimated by ordinary least squares
regression.
There are two points of view to take into account when setting up a training set
for developing a predictive multivariate calibration model. One viewpoint is that
the calibration set should be representative for the population for which future
predictions are to be made. This will generally lead to a distribution of objects in
experimental space that has a higher density towards the center, tailing out to the
boundaries. Another consideration is that it is better to spread the samples more or
372

less evenly over the experimental region. One way to generate such a space filling
design is by applying the Kennard-Stone algorithm {cf. Section 24.4.2). Generally,
this will put less weight on the center of the experimental region and more on the
extremes which should give more precise estimates. Feam [24] gives an interesting
discussion on the two choices, the main message being that when the number of
calibration objects is limited the latter choice may be preferable. One is not always
in a position to choose the samples. In that case it is never wise to discard samples
from a given calibration set for the sole sake of making the distribution of
calibration objects in the region of interest more uniform.
When a linear model does not hold over the full range of calibration samples
there are two options. One may apply a nonlinear regression method to the
complete set of data {cf. Section 36.5.3) or one may split the experimental region
into smaller subregions and estimate a separate linear model for each. Naes and
Isaksson [25] use fuzzy clustering for splitting the training sets into smaller subsets
with improved linearity in each group.

36.4.2 Data pretreatment

Data pretreatment is an important issue. Proper preprocessing of the data can be


very instrumental in developing better predictive models. Although it is true that
the modelling process in multivariate calibration may accommodate for inter-
ferences and irrelevant artefacts, careful data preprocessing often turns out to be
more effective [26]. General guidelines on how to preprocess data are hard to give
since this depends very much on the specific application at hand (e.g. which type of
data? spectroscopic? which techniques?) and on the nature of the samples in
question. It goes without saying that any pretreatment of the data has to be applied
in an identical manner to both the calibration data, the test data and future new data.
A very basic form of pretreatment is (column) mean centering. It corresponds to
modelling the variation around the mean, i.e. the deviation from the mean response
is directly related to deviations from the mean for the predictors. Mean centering is
so common that it is often not even considered as a form of data pretreatment.
Without further action this may give the variables with a high variance an undue
influence on the model, except for MLR which is not sensitive to such scaling.
Autoscaling is a form of pretreatment that is recommendable when the predictor
variables are of a different nature and not measured on the same scale. Stan-
dardizing all variables to the same variance can be seen as a democratic manoeuvre
giving all variables equal chance to influence the model.
With spectroscopic data a popular form of data pretreatment is to correct for
varying baseline slopes by regressing each spectrum against wavelength number
and continuing the calibration with residuals. Quadratic regression has also been
used as a means for detrending spectra. Pre-smoothing may be applied to the
373

spectral data to get rid of uncontrolled random noise. Various options are available
to implement such smoothing {cf. Chapter 40): moving window (box car) aver-
aging, Fourier filtering, Savitzky-Golay smoothing. A side effect of smoothing is
that the spectral resolution may be lost.
A very common pre-treatment of spectral data is to convert spectra to first- (or
second)- derivative form [27]. This has the effect of removing any offset and
constant slope (or curvature). Applying second derivatives has the advantage of
sharpening peaks and resolving overlapping bands to some extent, although it also
introduces spurious satellite peaks. Derivatization in general amplifies the noise in
the data (Section 40.5.5). As a remedy to the latter drawback one may carry out the
derivatization in combination with some degree of smoothing, for example, em-
ploying Savitzky-Golay filtering (Section 40.5.2.3).
An effective preprocessing method is the use of standard normal variates
(SNV). This type of standardization boils down to considering each spectrum x^ as
a set of q observations and calculating their z-scores:
z. = ( x . -x)l s (36.30)
It has the effect of removing an overall offset by subtracting the mean spectral
reading x and it corrects for differences affecting the overall variation. In various
settings it has been found to be an effective preprocessing method.
Another popular form of data pre-processing with near-infrared data is the
application of the Multiplicative Scatter Correction (MSC, [28]). It is well known
that particle size distribution of non-homogeneous powders has an overall effect
on the spectrum, raising all intensities as the average particle size increases.
Individual spectra x^ are approximated by a general offset plus a multiple of a
reference spectrum, z.
x. = a^-\-b^z + e,. (36.31)
The offset a^ and the multiplication constant b^ are estimated by simple linear
regression of the /th individual spectrum on the reference spectrum z. For the latter
one may take the average of all spectra. The deviation e, from this fit carries the
unique information. This deviation, after division by the multiplication constant, is
used in the subsequent multivariate calibration. For the above correction it is not
mandatory to use the entire spectral region. In fact, it is better to compute the offset
and the slope from those parts of the wavelength range that contain no relevant
chemical information. However, this requires spectroscopic knowledge that is not
always available.
A special type of data pre-treatment is the transformation of data into a smaller
number of new variables. Principal components analysis is a natural example and
we have treated it in Section 36.2.3 as PCR. Another way to summarize a spectrum
in a few terms is through Fourier analysis. McClure [29] has shown how a NIR
374

spectrum recorded at 1700 channels can be well approximated by a Fourier series


comprising only 100 terms. The error amounted to less than 0.01 %, yet the spectra
can be compressed and stored in 6% of the original space. A calibration model
based on some 10 selected Fourier terms proved to give results that were superior
to using the full raw data.

36.4.3 Outliers

The presence of outliers may have a detrimental effect on the quality of a


calibration model. Therefore, the identification of outliers is an important part of
the modelling process. Outliers can come in different guises. One speaks of high
leverage observations when the predictor data for a calibration object deviate
strongly from the rest. Such outliers in X-space may fit well to the model Cgood'
outliers) or not (*bad' outliers). When the predictor data are not abnormal for an
object, but the object fits poorly to the model, then one speaks of a high residual
observation (outlier in the >^-direction). Another class is formed by the influential
observations. These are observations that have a demonstrably large impact on the
model estimates. When such observations are discarded from the calibration set a
significantly different model with different predictions is obtained.
How do we identify outliers and how should we treat them once found?
Basically there are two approaches: either apply diagnostics for detecting outliers
or use robust estimation methods. Which diagnostics to use depends on the
regression method employed, but some tools are universally applicable. It is
always useful to make residual plots. These may reveal strongly deviating obser-
vation or some remaining structure that should not appear in a good residual plot.
A common way of inspecting the data is to draw a PC A score plot which may show
samples deviating from the bulk of the samples. Formally, one can compute the
Mahalanobis distance or the leverage (cf. Sections 8.2.6 and 20.7), and use that as a
indication of outlying behaviour. Similarly, a loading plot may reveal wavelengths
which are showing deviating behaviour. One may also use the PLS scores and
loadings during the modelling stage.
With MLR one can use such measures as Cook's distance (cf. Section 10.9) or
variance inflation factor (cf. Section 10.5) for influential observations or variables.
A complicating factor with the diagnosis of outliers is the phenomenon of masking.
This refers to the fact that an individual observation may not be recognized as
outlying, because it is part of a cluster of outlying objects. Only when the complete
cluster of deviating outliers is removed from the calibration set, one recognizes
their severe influence on the calibration model [30].
Identification of outliers is not a straightforward process. Even when obser-
vations have been diagnosed as outlying, one should not automatically discard
these, certainly not when the evidence is not overwhelming. Ideally, one should
375

always try to find additional physical or chemical evidence that something is


'wrong' with such samples before deciding to remove them.
An alternative to the detection and removal of outliers is to employ robust
regression methods {cf. Sections 12.1.5 and 12.3). With such methods outlying
objects are automatically identified and downweighted in the regression model-
ling. Robust modifications of popular multivariate regression procedures have
been reported for PCR [31] and PLS [32]. Walczak [33] has described a method,
the evolution program (EP), where a clean subset of the data is generated and the
remaining observations are tested relative to this clean subset. As the name implies
the method is based on the idea of natural evolution similar to the better-known
genetic algorithm. The EP approach allows to build robust models in the presence
of multiple multivariate outliers and can be applied usefully in combination with
PCR and PLS regression. Other robust methods for identifying multivariate
outliers are given in e.g. Refs. [34-35]

36.5 New developments

36.5,1 Feature selection

Brown et al. [36] published an interesting, simple method for selecting wave-
lengths in NIR calibration. The method boils down to ranking the wavelengths in
order of diminishing squared correlation (R^) with the analyte concentration. The
model is then built using the first m wavelengths, j = 1 ...m, from this ordered list as
a weighted average of the m simple regression models corresponding to these
wavelengths. The idea is to minimize the confidence interval for future pre-
dictions. As m increases, so does the total signal-to-noise ratio (sum of F-ratios or
^-values for the m separate, simple regressions), but also does the model complex-
ity. The optimum number of wavelengths m is established by minimizing the ratio
X^(m;a)/(S^J), where X^(m;a) is the critical value of a chi-square distribution with
m degrees of freedom at the chosen level of confidence. Having found the selection
of wavelengths one may proceed using any of the aforementioned regression
methods, e.g. ILS or GLS.
Many other new developments are under way in this area of variable selection.
Wavelength selection can also be done using forward selection using covariance
rather than correlation as a criterion (Intermediate Least Squares [37,38]). One
may see this as the PLS analogon to forward selection in MLR. Recently, also
genetic algorithms {cf. Chapter 27) have been applied to the problem of finding
small sets of predictive wavelengths among the legion of candidate wavelengths
[39]. The challenge with the application of such methods is not to fall in the trap of
overfitting. Another major problem is that of chance correlations: with the large set
376

of predictors encountered in spectral data it is not at all unlikely to select wave-


lengths which have no real predictive power but happen to contribute to the overall
correlation with the response in the calibration set. An alternative approach to
variable selection is the elimination of so-called uninformative variables. These
are variables that have no better predictive power than artificial random variables
added to the data [40].

36.5.2 Transfer of calibration models

The development of a calibration model is a time consuming process. Not only


have the samples to be prepared and measured, but the modelling itself, including
data pre-processing, outlier detection, estimation and validation, is not an autom-
ated procedure. Once the model is there, changes may occur in the instrumentation
or other conditions (temperature, humidity) that require recalibration. Another
situation is where a model has been set up for one instrument in a central location
and one would like to distribute this model to other instruments within the
organization without having to repeat the entire calibration process for all these
individual instruments. One wonders whether it is possible to translate the model
from one instrument (old or parent or master, A) to the others (new or children or
slaves, B).
Several approaches have been investigated recently to achieve this multivariate
calibration transfer. All of these require that a small set of transfer samples is
measured on all instruments involved. Usually, this is a small subset of the larger
calibration set that has been measured on the parent instrument A. Let Z indicate
the set of spectra for the transfer set, X the full set of spectra measured on the parent
instrument and a suffix Aor B the instrument on which the spectra were obtained.
The oldest approach to the calibration transfer problem is to apply the calibration
model, b^, developed for the parent instrument A using a large calibration set (X^),
to the spectra of the transfer set obtained on each instrument, i.e. Z^ and Zg. One
then regresses the predictions y^ (=Z^b^) obtained for the parent instrument on
those for the child instrument yg (=ZBb^), giving
yA=« + ^yB+e (36.32)
This yields an estimate for the bias (intercept) a and slope b needed to correct
predictions y^ from the new (child) instrument that are based on the old (parent)
calibration model, b^. The virtue of this approach is its simplicity: one does not
need to investigate in any detail how the two sets of spectra compare, only the two
sets of predictions obtained from them are related. The assumption is that the same
type of correction applies to all future prediction samples. Variations in conditions
that may have a different effect on different samples cannot be corrected for in this
manner.
377

All other approaches try and relate the child spectra to the parent spectra. In the
patented method of Shenk and Westerhaus [41: Sh], in its simplest form, one first
applies a wavelength correction and then a correction for the absorbance. Each
wavelength channel / of the parent instrument is linked to a nearby wavelength
channel j(i) in the child instrument, namely the one to which it is maximally
correlated. Then, for each pair of wavelengths, / for the parent Mid j(i) for the child,
a simple linear regression is carried out, linking the pair of measured absorbances

ZA,/ = «/ + ^/ ZB,/(O (36.33)

In this way the child spectrum is transformed into a spectrum as if measured on the
parent instrument. In a more refined implementation one establishes the highest
correlating wavelength channel through quadratic interpolation and, subsequently,
the corresponding intensity at this non-observed channel through linear inter-
polation. In this way a complete spectrum measured on the child instrument can be
transformed into an estimate of the spectrum as if it were measured on the parent
instrument. The calibration model developed for the parent instrument may be
applied without further ado to this spectrum. The drawback of this approach is that
it is essentially univariate. It cannot deal with complex differences between dis-
similar instruments.
In the direct standardization introduced by Wang et al. [42] one finds the
transformation needed to transfer spectra from the child instrument to the parent
instrument using a multivariate calibration model for the transformation matrix:
Z^ = ZgF. The transformation matrix F (qxq) translates spectra Zg that are
actually measured on the child instrument B into spectra Z^ that appear as if they
were measured on instrument A. Predictions are then obtained by applying the old
calibration model b^ to these simulated spectra Z^:

y^=Z^h^=Z^Fh^, (36.34)

giving

be = Fb^ (36.35)

as the transferred calibration model that applies directly to spectra measured on


instrument B. Either PCR or PLS2 regression have been used for establishing the
proper transformation, F. Notice that for each channel of the estimated spectrum
the full spectrum of instrument B is used. In piecewise direct standardization
(PDS) one uses for each frequency (column of F) only the local information of
378

neighbouring wavelengths in the transfer spectra Zg employing a window of


wavelengths (columns of Zg) centered around the wavelength (column of Z^) of
current interest. In mathematical terms: one imposes a band structure on the
transformation matrix F. The span of the neighbourhood region and the number of
PCs have to be optimized via cross-validation. Applications with this PDS tech-
niques have proved successful [43].
The choice of the transfer subset is critical to the success of calibration transfer.
The transfer samples should span the region of interest and can be chosen on the
basis of extreme PC or PLS factor scores. Improved results can be obtained by a
better coverage of the calibration range, for example, by using a formal design
algorithm such as that of Kennard and Stone (Section 24.4.2) for the selection of
the transfer set. Forina [44] applies PLS regression both for estimating the calib-
ration model on the one instrument and for modelling the relation between the two
sets of spectra. Alternative methods or suggestions for improving existing methods
continue to be reported. Good reviews on the theory and practice of transferring of
calibration models are found in Refs. [45,46].

36.5.3 Non-linear methods

In recent years there has been much activity to devise methods for multivariate
calibration that take non-linearities into account. Artificial neural networks (Chap-
ter 44) are well suited for modelling non-linear behaviour and they have been
applied with success in the field of multivariate calibration [47,48]. A drawback of
neural net models is that interpretation and visualization of the model is difficult.
Several non-linear variants of PCR and PLS regression have been proposed.
Conceptually, the simplest approach towards introducing non-linearity in the
regression model is to augment the set of predictor variables (jCj, ^2,...) with their
respective squared terms (jc,^, jc|,...) and, optionally, their possible cross-product
terms (JCJJC2, ...). Since the number of predictors grows appreciably, PCR or PLS
regression is called for. A non-linear variant of PLS employing splines for the
inner relation between y and the r-scores has been proposed that has some analogy
with neural nets. However, in multivariate calibration this splines-PLS approach
[49] has not yet met with success. In fact, using a quadratic regression model using
factor scores from PC A can be just as effective [50]. One may employ linear PLS
regression as a first step and then proceed with these PLS scores in a quadratically
extended regression model (LQ-PLS, [51]). Locally weighted regression (LWR],
an approach combining elements of PCA or PLS, weighted regression and local
modelling, has been more successful [52]. In this approach one starts with a
transformation of the spectra into a few PC scores. The spectrum of any new
sample is transformed to the same PC space and a small set of similar spectra from
the calibration set is determined using the Mahalanobis distance as a criterion for
379

similarity. Multiple linear regression is then used to relate the response y to the PC
scores for this small local set and this is used as an interpolating model to estimate
the unknown response for the new sample. The number of PC dimensions, the
number of neighbours are determined through cross-validation. More elaborate
extensions of this approach not only take spectral similarity but also the estimated
chemical similarity into account [53]. An interesting study comparing a variety of
modem non-linear methods is given in Ref. [54].

References

1. K.I. Hildrum, T. Isaksson, T. Nses and A. Tandberg, Near Infra-red Spectroscopy Bridging the
Gap between Data Analysis and NIR Applications. Ellis Horwood, New York, 1992.
2. H. van de Waterbeemd, ed., QSAR: Chemometric Methods in Molecular Design. VCH,
Weinheim, 1995.
3. T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science, Data Han-
dling in Science and Technology Series. Elsevier, Amsterdam, 1996
4. P.K. Hopke and X.-H. Song, The chemical mass balance as a multivariate calibration problem.
Chemom. Intell. Lab. Assist., 37 (1997) 5-14.
5. P.J. Brown, Measurement, Calibration and Regression. Clarendon Press, Oxford, 1993.
6. T. Nses, Progress in multivariate calibration, pp. 52-60, in Ref. [1].
7. J.M. Sutter, J.H. Kalivas and P.M. Lang, Which principal components to utilize for principal
component regression. J. Chemometr., 6 (1992) 217-225.
8. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a
global prediction model by principal component regression. Anal. Chim. Acta, 348 (1997)
19-27.
9. R.F. Gunst and R.L. Mason, Regression Analysis and its Application: A Data-Oriented Ap-
proach. Marcel Dekker, New York, 1980.
10. E. Vigneau, D. Bertrand and E.M. Qannari, Application of latent root regression for calibration
in near-infrared spectroscopy. Comparison with principal component regression and partial
least squares. Chemometr. Intell. Lab. Syst., 35 (1996) 231-238.
11. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14
(1992) 155-164.
12. S. Van Huffel and J. Vandewalle, The Total Least Squares Problem: Computational Aspects
and Analysis. SIAM, Philadelphia, PA, 1991.
13. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression. Appl. Stat., 31 (1982)
244-255.
14. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed
prediction ernbracing ordinary least sqaures, partial least squares, and principal component re-
gression. J. Roy. Stat. Soc. B52 (1990) 237-269.
15. R.J. Brooks and M. Stone, Joint continuum regression for multiple predictands. J. Am. Stat.
Assoc, 89 (1994) 1374-1377.
16. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools.
Technometrics, 35 (1993) 109-135.
17. R. Sundberg, Interplay between chemistry and statistics, with special reference to calibration
and the generalized standard addition method. Chemom. Intell. Lab. Syst., 4 (1988) 299-305.
380

18. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986.
19. D.M. Allen, The relationship between variable selection and data augmentation and a method
for prediction. Technometrics, 16 (1974) 125-127.
20 M. Forina, G. Drava, R. Boggia, S. Lanteri and P. Conti, Validation procedures in near-infrared
spectrometry. Anal. Chim. Acta, 295 (1994) 109-118.
21. H. van der Voet, Comparing the predictive accuracy of models using a simple randomization
test. Chemom. Intell. Lab. Syst., 25 (1994) 313-323.
22. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Wiley, New York, 1993.
23. R. Wehrens and W. van der Linden, Bootstrapping principal regression models. J. Chemom.,
11 (1997)157-172.
24. T. Fearn, Flat or natural? A note on the choice of calibration samples, pp. 61-66 in Ref. [1].
25. T. Naes and T. Isaksson, SpHtting of calibration data by cluster-analysis. J. Chemometr., 5
(1991)49-65.
26. O.E. de Noord, The influence of data preprocessing on the robustness nd parsimony of
multivariate calibration models. Chemom. Intell. Lab. Systems, 23 (1994) 65-70.
27. W.R. Hruschka, Data analysis: wavelength selection methods, pp. 35-55 in: P.C. Williams and
K. Norris, eds. Near-infrared Reflectance Spectroscopy. Am. Cereal Assoc, St. Paul MI, 1987.
28. P. Geladi, D. McDougall and H. Martens, Linearization and scatter-correction for near-
infrared reflectance spectra of meat. Appl. Spectrosc, 39 (1985) 491-500.
29. F. McClure, Analysis using Fourier transforms, in: Handbook of Near-Infrared Analysis, D. A.
Bums and E.W. Ciurczak, eds. Dekker, New York, pp. 181-224 (1992).
30. A.C. Atkinson, Masking unmasked. Biometrika, 73 (1986) 533-541.
32. I.N. Wakeling and H.J.H. MacFie, A robust PLS procedure. J. Chemom., 6 (1992) 189-198.
31. B. Walczak and D.L. Massart, Robust PCR as outliers detection tool. Chemom. Intell. Lab.
Syst., 27 (1995) 41-54.
33. B. Walczak, Outlier detection in bilinear calibration. Chemom. Intell. Lab. Syst., 29 (1995)
63-73.
34. A. Singh, Outliers and robust procedures in some chemometric applications. Chemom. Intell.
Lab. Syst., 33 (1996) 75-100.
35. A.S. Hadi, A modification of a method for the detection of outliers in multivariate samples. J.
Roy. Stat. Soc. B56 (1994) 393-396.
36. P.J. Brown, C.H. Spiegelman and M.C. Denham, Chemometrics and spectral frequency selec-
tion. Phil. Trans. R. Soc. Ser. A, 337 (1991) 311-322.
37. I.E. Frank, Intermediate least squares regression method. Chemom. Intell. Lab. Syst., 1 (1987)
232-242.
38. A. Hoskuldsson, The H-principle in modelling with applications to chemometrics. Chemom.
Intell. Lab. Syst., 14 (1992) 139-153.
39. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, et al.. Genetic algorithms as a tool for wavelength
selection in multivariate calibration. Anal. Chem., 67 (1995) 4295-4301.
40. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. Vandeginste, and C. Sterna,
Elimination of uninformative variables for multivariate calibration. Anal. Chem., 68 (1996)
3851-3858.
41. J.S. Shenk and M.O. Westerhaus, US Patent No. 4866644, Sept. 12, 1989.
42. Y.D. Wang, D.J. Veltkamp and B.R. Kowalski, Multivariate instrument standardization. Anal.
Chem., 63 (1991) 2750-2756.
43. Z.Y. Wang, T. Dean and B.R. Kowalski, Additive background correction in multivariate in-
strument standardization. Anal. Chem., 67 (1995) 2379-2385.
381

44. M. Forina, G. Drava, C. Armanino, et al., Transfer of calibration function in near-infrared spec-
troscopy. Chemom. Intell. Lab. Syst., 27 (1995) 189-203.
45. O.E. de Noord, Multivariate calibration standardization. Chemom. Intell. Lab. Systems, 25
(1995) 85-97.
46. E. Bouveresse and D.L. Massart, Standardisation of near-infrared spectrometric instruments.
Vib. Spectrosc, 11 (1996) 3-15.
47. B.J. Wythoff, Backpropagation neural networks — a tutorial. Chemom. Intell. Lab. Syst., 18
(1993)115-155.
48. C. Borggaard and H.H. Thodberg, Optimal minimal neural interpretation of spectra. Anal.
Chem., 64 (1992) 545-551.
49. S. Wold, Non-linear partial least squares modelling. II. Spline inner relation. Chemom. Intell.
Lab. Syst., 14(1992)71-84.
50. S.D. Oman, T. Naes and A. Zube, Detecting and adjusting for nonlinearities in calibration of
near-infrared data using principal components. J. Chemom., 7 (1993) 195-212.
51. S. Wold, N. Kettaneh-Wold and B.Skagerberg, Nonlinear PLS modeling. Chemom. Intell.
Lab. Syst., 7 (1989) 53-65.
52. T. Naes, T. Isaksson and B. Kowalski, Locally weighted regression in NIR analysis. Anal.
Chem., 2 (1990) 664-673.
53 . Z.Y. Wang, T. Isaksson and B.R. Kowalski, New approach for distance measurement in lo-
cally weighted regression. Anal. Chem., 66 (1994) 249-260.
54. S. Sekulics, B.R. Kowalski, Z.Y. Wang, et al.. Nonlinear multivariate calibration methods in
analytical chemistry. Anal. Chem., 65 (1993) A835-A845.

Additional recommended reading

Books
H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989.
J.H. Kalivas and P.M. Lang, Mathematical Analysis of Spectral Orthogonality. Dekker, New York,
1994.

Articles
K.R. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal.
Chem., 59 (1987) 1007A-1017A.
B.R. Kowalski and M.B. Seasholtz, Recent developments in multivariate calibration. J. Chemo-
metrics, 5 (1990) 129-145.
H. Martens and T. Naes, Multivariate Calibration by Data Compression, Chapter 4 in Ref. [1].
P.J. Brown, Multivariate Calibration. J. Roy. Stat. Soc, B44 (1982) 287-321.
K. Faber and B.R. Kowalski, Propagation of measurement errors for the validation of predictions ob-
tained by principal component regession and partial least squares. J. Chemom., 11 (1997)
181-238.
D.M. Haaland, Multivariate Calibration Methods Applied to the Quantitative Analysis of Infrared
Spectra, Chapter 1 in "Computer-Enhanced Analytical Spectroscopy, Volume 3", edited by
P.C. Jurs. Plenum Press, New York, 1992.
383

Chapter 37

Quantitative Structure-Activity Relationships (QSAR)

37.1 Extrathermodynamic methods

Today it is a well-accepted view that biological and therapeutic activity of a


drug depend upon its physicochemical and conformational (or steric) properties.
The former include the ability of a molecule to cross membranes of cells or to be
taken up by fatty material, the distribution of electric charges within the molecule,
its capacity to make hydrogen bonds with other molecules, etc. The latter relate to
the nature of the atoms that make up the molecule, the distances between atoms
and the angles formed by the chemical bonds between them. The search for
quantitative relations between chemical structure and biological activity is the
subject of Quantitative Structure-Activity Relationships (QSAR), the purpose of
which is to explain why a given drug produces its particular effect, and ultimately
to predict the effect of newly synthesized chemical compounds [1,2]. Quantitative
structure-property relationships have been known for a long time in organic
chemistry, such as the regular increase in boiling temperature in homologous
series of alkanes (e.g. methane, ethane, propane, etc.) as a function of the number
of carbon atoms. It was natural, therefore, to suspect that similar regularities exist
between chemical structure and biological activity.
The first observation of a quantitative structure-activity relationship was made
independently by Meyer and Overton around 1900 [3,4]. They found that the
potency of anaesthetics depended upon their lipophilicity, i.e. their tendency to
dissolve in oil rather than in water. This discovery is generally regarded as the
starting point of QSAR methodology. It pointed toward a physicochemical inter-
action between a drug and the biological materials of the organism in which the
drug is to exert its desired effect. A crucial element of QSAR is the concept of
biological receptor, which emerged around the same time that Meyer and Overton
made their historical observation. The concept states that a drug molecule must
interact in a highly specific way with certain proteins that play a key role in the
production of the desired effect. For example, if a pathological condition is due to
an abnormally high activity of a certain enzyme or receptor protein, then one may
attempt to block this enzyme or receptor by means of a chemical compound that
binds specifically to it. This is the case with naloxone which is a specific antagonist
384

of the opioid receptor and which is used as an antidote for poisoning by morphine
and related narcotics.
The stereospecificity of the drug-receptor interaction was discovered in 1894
by Fischer in his study of the cleavage of glycosides by yeast [5]. He devised the
still famous 'lock and key' paradigm. A drug acts as a key which can turn the
receptor on or off, and which must satisfy narrow constraints (although the key
may be flexible and the lock may appear to be somewhat plastic). The actual term
receptor was coined in 1913 by Ehrlich, the discoverer of salvarsan (arsphen-
amine) which was regarded as the 'magic bullet' for the treatment of syphilis [6].
The part of the drug molecule that interacts with the receptor is called the
pharmacophore. Compounds that share the same pharmacophore are considered to
be similar with respect to the biological activity that is triggered by their receptor.
In practice, however, the therapeutic benefit may vary considerably within a series
of similar compounds due to differences in absorption, secretion, metabolization,
uptake by fatty tissues, transport through membranes, toxicity, etc. The design of
drugs is both an art and a science, but rational approaches, such as QSAR, play an
ever increasing role in the generation of new lead compounds and the optimization
of existing leads.
The foundation of QSAR as a practical tool of drug design took place around
1964 with the introduction of two so-called extrathermodynamic methods. One of
these methods is based on so-called linear free energy relationships (LFER)
between biological activities of structurally related (congeneric) drugs and the
physicochemical properties of the chemical substituents on a common parent
molecule. For example, benzoic acid may be regarded as a parent compound which
can be substituted at the ortho, meta and para positions by chlorine, bromine,
amine, hydroxyl, acetyl, etc. The latter are referred to as the substituent groups.
The ortho, meta and para locations with respect to the carboxyl group in the parent
benzoic acid molecule are called the substituent positions. This approach is also
referred to as Hansch analysis. Another extrathermodynamic method is based on
the additivity of the contributions to the biological activity by various substituent
groups at multiple substituent positions, and is called Free-Wilson analysis. Both
the Hansch and Free-Wilson methods make use of multiple linear regression.
These approaches allow us to determine the combination of substituents that
provide maximal activity in a series of structurally related molecules. These
developments have been extensively reviewed by Martin [7] and Kubinyi [8].
Multivariate chemometric techniques have subsequently broadened the arsenal
of tools that can be applied in QSAR. These include, among others, Multivariate
ANOVA [9], Simplex optimization (Section 26.2.2), cluster analysis (Chapter 30)
and various factor analytic methods such as principal components analysis (Chapter
31), discriminant analysis (Section 33.2.2) and canonical correlation analysis
(Section 35.3). An advantage of multivariate methods is that they can be applied in
385

supervised and unsupervised modes. They can handle series of structurally unrelated
compounds and some of them can be used in the case when multiple biological
activities are obtained. As we will see later on, a drug may act on a wide variety of
receptors and hence may provide a typical spectrum of biological activities.
Spectral map analysis is a factor analytic method which makes use of this property
(Section 31.3.5).
Partial Least Squares (PLS) regression (Section 35.7) is one of the more recent
advances in QS AR which has led to the now widely accepted method of Compara-
tive Molecular Field Analysis (CoMFA). This method makes use of local physico-
chemical properties such as charge, potential and steric fields that can be determined
on a three-dimensional grid that is laid over the chemical structures. The determina-
tion of steric conformation, by means of X-ray crystallography or NMR spectro-
scopy, and the quantum mechanical calculation of charge and potential fields are
now performed routinely on medium-sized molecules [10]. Modem optimization
and prediction techniques such as neural networks (Chapter 44) also have found
their way into QSAR.
Much attention is devoted today to automatic searching of large libraries of
chemical compounds in order to find interesting chemical structures that possess
some similarity to known lead compounds. Current emphasis is on searching of
three-dimensional structures within libraries of hundreds of thousands of com-
pounds [11]. Finally, our knowledge of proteins, specifically of enzymes and
receptors, has greatly increased by the use of techniques from biotechnology. It is
now possible in many cases to isolate and clone these macromolecules in pure
form and to determine their primary amino acid sequence, as well as their exact
three-dimensional structure. This has led to so-called rational drug design by
which drugs are designed by means of computer modelling rather than by the
empirical (serendipitous) method of trial-and-error.
In the light of these considerations one may regard QSAR as a multidisciplinary
field of chemometrics. Statistics, optimization, pattem recognition and information
technology converge in QSAR with chemistry, physicochemistry, biology, bio-
chemistry and biotechnology. The start of QSAR coincides with the development
of linear free energy relationships (LFER) between biological activities and the
physicochemical properties of molecules. This development is generally referred
to as the extrathermodynamic approach in QSAR for reasons that will become
apparent later on in this section. A vast amount of literature has appeared on the
subject since 1964. The content of this section has been largely inspired by a recent
review of the subject by Kubinyi [8].
As we have remarked above, biological activity is the result of the binding of a
drug D onto a receptor protein (or enzyme) P which results in the drug-receptor
complex DP. The strength of the binding and, hence, the magnitude of the effect
can be expressed by means of the change in Gibbs' free energy AG between the
386

free D and P on the one hand and the bound DP on the other hand. According to
classical thermodynamics, the free energy release AG is made up from an enthalpic
and an entropic term [12]:

AG = AH-T AS (37.1)

where AH is the change in enthalpy, AS is the change in entropy and T represents


absolute temperature. The above relation shows that the strength of the binding
between drug and receptor increases with the release of enthalpy and with the gain
in entropy (i.e. decrease of order) in the drug-receptor complex.
The free energy AG can also be related to the equilibrium constant K between
the free and the bound state of drug and receptor:

AG =-2.303/?r log/i: (37.2)

where R represents the gas constant. The relation shows that if the reaction between
drug and receptor proceeds in the direction of the formation of the drug-receptor
complex (K> I), then a decrease of free energy will occur (AG < 0).
Interactions of a drug with a receptor (or enzyme) are of a reversible non-
covalent nature. The free energy released by drug-receptor interactions is smaller
by one to two orders of magnitude than that of a covalent bond. The interactions
take place mainly by means of electrostatic, hydrophobic and dispersion forces.
Electrostatic interactions between charged groups are enthalpic. Hydrogen bonding
is a weak electrostatic interaction which takes place between electron donating and
electron accepting groups, such as amine, ketone, hydroxyl groups, etc. Hydro-
phobic interactions are mostly entropic. They can displace water molecules that
are bound to the drug molecule or to the receptor surface. This diminishes the
ordering of the water molecules and, hence, increases entropy, which in turn
lowers the free energy of the drug-receptor complex. Dispersion forces are
attractive when a drug molecule closely approaches the receptor surface. They are
repulsive when the van der Waals radii of drug and receptor molecules tend to
overlap. The nature of dispersion forces is mainly enthalpic.
The question that now arises is how to define biological (or therapeutic) activity
of a drug. There are good reasons for defining biological activity as the logarithm
of the reciprocal of the effective dose (or concentration) that is needed in order to
produce a well-defined biological effect. This is in accordance with Weber-
Fechner's psychophysical law which states that a biological response is an approx-
imately linear function of the logarithm of the physical stimulus [13]. Furthermore,
experimental observations also show that the doses or concentrations that are
tolerated by living organisms follow a log-normal distribution. The reciprocal of
the effective dose is required, since the more active or potent drugs require a lower
387

dose in order to produce the desired effect. Usually, an effective dose is defined as
the dose which produces the required effect in half of the samples or subjects that
has been tested. In classical pharmacology on living subjects this produces the
so-called median effective dose (or ED5Q). In biochemical pharmacology on isolated
receptors results are often reported in the form of median inhibitory concentrations
(or IC5Q). If we combine the previous observations with eq. (37.2), we obtain an
expression which relates the biological activity (log 1/C) of a drug to the equi-
librium constant (K) of the drug-receptor interaction:

logl/C = alogK (37.3)

where a denotes a proportionality constant.


The next step in the development of the extrathermodynamic approach was to
find a suitable expression for the equilibrium constant in terms of physicochemical
and conformational (steric) properties of the drug. Use was made of a physico-
chemical interpretation of the dissociation constants of substituted aromatic acids
in terms of the electronic properties of the substituents. This approach had already
been introduced by Hammett in 1940 [14]. The Hammett equation relates the
dissociation constant ^ of a substituted benzoic acid (e.g. meta-chlorobenzoic
acid) to the so-called Hammett electronic parameter a:

\ogK = \ogK^ + p(5 (37.4)


where KQ represents the dissociation constant of unsubstituted benzoic acid and
where p is a proportionality constant. A distinction is made between a^ and a^,
which take different values depending on the meta (m) or para ip) position of the
substituent on the parent benzoic acid molecule.
On the analogy of the physicochemical relation, one was led to define a
biological Hammett equation which related the equilibrium constant of the drug-
receptor complex to the electronic a parameters of the substituents (e.g. chlorine,
bromine, methyl, ethyl, hydroxyl, carboxyl, acetyl, etc.) of the drug molecule.
Since the equilibrium constant of a drug-receptor complex is reflected by the
biological activity, this led to the first extrathermodynamic relationship in QSAR:
logl/C = Z7o + ^CT (37.5)
where b^ and b^ are coefficients that can be derived by means of linear regression
(Chapter 8). For the historical reasons discussed above, the relation is also referred
to as a linear free energy relationship (LFER).
A fundamental assumption of the approach is that contributions to a from
several substituent groups on the same parent compound are additive. The additivity
assumption also holds for the more general Hansch model that will be discussed
below.
388

37.1.1 Hansch analysis

It soon became apparent that the biological Hammett equation produced an


unsatisfactory fit between biological activity of a set of chemical analogs and the
experimentally obtained electronic a values of the substituent groups. Hansch
[15,16] first proposed to extend the equation with a term which accounted for the
hydrophobic interaction. The latter can be characterized by determining the partition
of the drug between oil (octanol) and water. The ratio of the concentrations of the
drug in the two phases at equilibrium is called the partition coefficient P and the
decimal logarithm of P is referred to as the lipophilicity. Lipophilicity measures
the tendency of a drug to dissolve into fatty material. It is not exactly the same as
hydrophobicity, which measures the ability of a drug to displace or expel water
molecules at a binding site. Lipophilicity, however, is often a good approximation
to hydrophobicity. This led to the Hansch equation, which reads in its simplest
form:

log 1/C = Z7o + ^1 a + ^2 log P (37.6)


or when electrostatic interactions can be neglected:
log 1/C = Z7o + ^1 log P (37.7)

where b^, b^, Z?2 are coefficients that can be determined by means of multiple linear
regression (see Chapter 10).
The Hansch model can be expressed in a general form:
y = Xb + e (37.8)
where X represents the matrix of independent parameters (extended by a unit
column-vector in order to account for the constant term), y is the vector of observed
biological activities, e contains the residuals between observed and computed
activities, and b is the vector of regression coefficients.
The above Hansch equations are also generally referred to as linear free energy
relationships (LFER) as they are derived from the free energy concept of the
drug-receptor complex. They also assume that biological activity is linearly
related to the electronic and lipophilic contributions of the various substituents on
the parent molecule.
A typical Hansch analysis has been applied to the 50% inhibitory concentrations
(ICgo) of oxidative phosphorylation of 11 doubly substituted salicylanilides (Table
37.1) as reported by Williamson and Metcalf [17]. Multiple linear regression leads
to the following model:
log I/IC50 = -2.190 + 0.708 a + 0.348 log P
389

TABLE 37.1

Lipophilicity (log P), Hammett electronic parameter (a) and inhibitory concentration (1050) for oxidative
phosphorylation of 11 doubly substituted salicylanilides [17]. The two substitution positions are labeled A
andB.
OH

,0^~- ™^^
# A B logF a IC50

1 5-Cl-3-(4-Cl-C6H4) 4'-Cl 4.68 0.463 2.512


2 5-Cl 2'-Cl-4'-N02 2.12 1.724 2.042
3 5-CI-3-C6H5 3',4'-Cl2 4.79 0.836 1.549
4 5-CI-3-C6H5 2',5'-Cl2 4.55 0.836 1.445
5 5-CI-3-C6H5 2',4',5-Cl3 5.48 1.063 0.646
6 5-CI-3-C6H5 3'_Cl-4'-N02 4.36 1.879 0.617
7 5-Cl-3-(4-Cl-C6H4) 2'-Cl-5'-N02 4.98 1.173 0.263
8 5-Cl-3-(4-Cl-C6H4) 2'-Cl-4-CN 4.58 1.091 0.219
9 5-Cl-3-(4-Cl-C6H4) 2'-Cl-5'-CF3 5.93 0.878 0.200
10 5-Cl-3-(4-Cl-C6H4) r,4',5-Cl3 6.41 1.063 0.173
11 5_Cl-3-r-But 2'-Cl-4'-N02 4.00 1.527 0.170

The residual standard deviation of regression (s^) equals 0.346, the coefficient
of determination (R^) is 0.548 and the F-statistic amounts to 4.840 with 2 and 8
degrees of freedom (df) which is significant at the 0.05 level of significance (p).
All coefficients of the regression are significant at the 0.05 level of probability,
which lends evidence to the importance of both electrostatic (a) and hydrophobic
interactions (log P). The coefficient of determination (R^) is small and the error
term is relatively large. This suggests that predicted IC50 values are expected to
deviate by more than a factor two from the observed values. Inspection of the
residuals indicates that there may be two outliers (compounds 8 and 11) in this data
set that do not fit well to the proposed Hansch model.
Frequently, the relationship between biological activity and log P is curved and
shows a maximum [18]. In that case, quadratic and non-linear Hansch models have
been proposed [19]. The parabolic model is defined as:
log lie = bQ^b^ logP + b^ (logPf (37.9)
which reaches a maximum at (log P)^ = -b^l{2b'2).
390

The bilinear model takes the form:


log 1/C = Z^o +feilog P + ^2 log (^3 P + 1) (37.10)
or equivalently:
log 1/C =fco+ ^1 log P-^b^ log (Z?3 lO^^s ^ + 1)
which obtains a maximum at log P^ = log(-b^/b^(b^ + ^2)).
Note that the lipophilicity parameter log P is defined as a decimal logarithm.
The parabolic equation is only non-linear in the variable log P, but is linear in the
coefficients. Hence, it can be solved by multiple linear regression (see Section
10.8). The bilinear equation, however, is non-linear in both the variable P and the
coefficients, and can only be solved by means of non-linear regression techniques
(see Chapter 11). It is approximately linear with a positive slope (b^) for small
values of log P, while it is also approximately linear with a negative slope (b^ + Z?2)
for large values of log P. The term bilinear is used in this context to indicate that the
QSAR model can be resolved into two linear relations for small and for large
values of P, respectively. This definition differs from the one which has been
introduced in the context of principal components analysis in Chapter 17.
A non-linear Hansch model has been applied to the bactericidal concentrations
(C) of 17 doubly substituted phenols (Table 37.2) which have been reported by
Klarmann et al. [20]. By means of multiple linear regression we obtain the
parabolic Hansch model of eq. (37.9):
log 1/C = -3.474 + 2.298 log P - 0.225 (log Pf
5, = 0.178, P2 = 0.933, P = 97.1 with df = 2 and U,p< 0.0001.
The coefficients of the regression are all highly significant (p < 0.0001) and the fit
of the model to the observed data is shown in Fig. 37.1. Using non-linear regression
we obtain the bilinear Hansch model for the bactericidal activities of the 17 phenol
analogs (Table 37.2):
log 1/C = -1.552+ 1.011 logP-3.331 log (0.00244P+ 1)
5^ = 0.179
In this case, the bilinear model of eq. (37.10) fits the data as well as the parabolic
one. A possible reason for the lack of improvement is that the part of the model
which accounts for the higher values of log P is not well covered by the data (Fig.
37.1). The parabolic model yields an optimal value for log P of 5.10 while the
optimum of the bilinear model is found at 5.18.
Hansch analysis marked the breakthrough of QSAR. The method was soon
extended with additional parameters with the aim of improving the fit between
biological and physicochemical data and for the prediction of drugs with optimal
391

TABLE 37.2
Lipophilicity (log P) and bactericidal concentrations (O of 17 doubly substituted phenols [20]

# R R' logP log 1/C

1 H CI 2.39 0.81
2 Methyl CI 2.89 1.34
3 Ethyl CI 3.39 1.73
4 Propyl CI 3.89 2.26
5 Butyl CI 4.39 2.52
6 Amyl CI 4.89 2.63
7 sec-Amyl CI 4.69 2.23
8 Cyclohexyl CI 4.90 2.25
9 Heptyl CI 5.89 2.51
10 Octyl CI 6.39 1.83
11 CI H 2.15 0.50
12 CI Methyl 2.65 0.91
13 CI Ethyl 3.15 1.35
14 CI Propyl 3.65 1.86
15 CI Butyl 4.15 2.20
16 CI Amyl 4.65 2.23
17 CI tert-Amyl 4.33 2.00

activity. Measures of lipophilicity are mostly derived from partition coefficients P


and chromatographic retention times [21]. Rekker [22] derived a method for
computing lipophilicity 'de novo\ i.e. from first principles as opposed to experi-
mentally, which improved on the experimental method used by Hansch, especially
in the case of aliphatic compounds. Electronic parameters include Hammett's a^
and Cp which we have already discussed, the (inductive) field and resonance
parameters F and R which have been derived from spectroscopic measurements by
Swain and Lupton [23], the dipole moment and various quantum mechanical
properties which relate to the distribution of electrons in the molecule. The Swain
and Lupton parameters F and R have been shown to be linear combinations of
Hammett' s a derived from meta and para substituents of benzoic acid. It is claimed
392

:).U -

m
• •
2.5 -

T-H 2.0 -

o
1.5 -
J •

1.0 - /•

0.5 -

0.0 -
1 1 I 1

LogP

Fig. 37.1. Quadratic Hansch model fitted to the bactericidal activities (log 1/C) of 10 doubly substi-
tuted phenols in Table 37.2 as a function of lipophilicity (log F) [20].

that the field and resonance parameters are less correlated than Hammett's
electronic parameters. Steric parameters account for the geometric properties of
the compounds. Among these we only cite the steric hulk factor E^ of Taft [24], the
molecular weight MH^and the five STERIMOL variables which were proposed by
Verloop [25] to describe the size and shape of the molecules. Molar refractivity
MR appears to be a parameter which correlates with lipophilicity and steric
parameters. Connectivity indices have been introduced by Kier and Hall [26] for
the description of the topological (graph) structure of the molecules. A connec-
tivity index is a single number which expresses how atoms are arranged in a
molecule. There are several different indices, each of which expresses a particular
feature of the molecule, such as its degree of branchedness, the presence of double
and triple bonds, of non-carbon atoms (apart from hydrogen), of aromatic rings,
etc. Finally indicator variables can be included into the Hansch equation in order to
account for the presence or absence of specific chemical functions. The use of
indicator variables will be discussed more amply in the following section which
deals with Free-Wilson analysis.
Extensive data bases are now available which list lipophilicity, molar refrac-
tivity, electronic and steric values for a wide collection of substituents [27,28].
Starting from a particular parent compound one computes the value of a physico-
chemical parameter (e.g. lipophilicity) for a given drug by adding the contributions
393

of all the substituents that have been introduced on the parent compound. Hence,
for a particular analog derived from an unsubstituted parent compound we obtain:

logP = X O o g ^ ) / (37.11)

CJ = X<^.- (37.12)

where the summation index / extends over all substituents on the parent molecule.
Computed values of log P are often represented by the symbol 7i, in order to
distinguish them from the empirical values which have been derived from partition
coefficients or from chromatographic retention times.
A difficulty with Hansch analysis is to decide which parameters and functions
of parameters to include in the regression equation. This problem of selection of
predictor variables has been discussed in Section 10.3.3. Another problem is due to
the high correlations between groups of physicochemical parameters. This is the
multicoUinearity problem which leads to large variances in the coefficients of the
regression equations and, hence, to unreliable predictions (see Section 10.5). It can
be remedied by means of multivariate techniques such as principal components
regression and partial least squares regression, appUcations of which are discussed
below.
The fundamental assumption of Hansch analysis is that substituent values are
additive. This implies that substituents are mutually independent, i.e. that the
effect of a substitution group at one position in the parent molecule is independent
of substitution groups at other positions. The assumption of additivity is violated,
for example, when hydrogen bonding occurs between two adjacent substituents.

37.1.2 Free-Wilson analysis

The second extrathermodynamic method that we discuss here differs from


Hansch analysis by the fact that it does not involve experimentally derived
substitution constants (such as a, log P, MR, etc.). The method was originally
developed by Free and Wilson [29] and has been simplified by Fujita and Ban [30].
The subject has been extensively reviewed by Martin [7] and by Kubinyi [8]. The
method is also called the 'de novo' approach, as it is derived from first principles
rather than from empirical observations. The underlying idea of Free-Wilson
analysis is that a particular substituent group at a specific substitution site on the
molecule contributes a fixed amount to the biological activity (log 1/C). This can
be formulated in the form of the linear relationship:

logl/C =feo+ X S ^ 7 ^ ( ; (37.13)


394

where b^ represents the biological activity of the reference compound. The indexy
ranges between 1 and the number of substitution positions /?, while the index /
varies between 1 and the number of possible substituents rip at the position y,
minus 1. The coefficient b^j in eq. (37.13) denotes the contribution to the activity by
the /th substituent group at theyth substitution position on the reference compound,
and x^j is the corresponding indicator variable. The latter equals one if the /th
substitution group is present at theyth position, and zero otherwise. Note that there
is a distinction between the parent compound in Hansch analysis and the reference
compound in Free-Wilson analysis. The parent compound is by definition the
unsubstituted compound (with hydrogen atoms at each substitution site), while the
reference compound can be chosen arbitrarily from the set of compounds. The
substituent groups that appear in the reference compound do not appear in the
Free-Wilson model that has been defined above.
The number n of independent variables x^ is defined by:

n = f,(nj-l) (37.14)
./•

Hence, the number of compounds must at least be equal to n.


Table 37.3 shows the complete table of eight indicator variables for 10 triply
substituted tetracyclines [31] that have been tested for bacteriostatic activity (1/Z),
which is defined here as the ratio of the number of colonies grown with a
substituted and with the unsubstituted tetracycline. In this application we have
three substitution positions, labelled U, V and W. The number of substituents at the
three sites equals 2,3 and 3, respectively. Arbitrarily, we chose the compound with
substituents H, NO2 and NO2 at the sites U, V and W as the reference compound.
This leads to a reduction of the number of indicator variables from eight to five, as
shown in Table 37.4. The solution of the Free-Wilson model can be obtained
directly by means of multiple regression:

log l/Z = -0.603-0.364/(CH3) + 0.060/(Cl) + 0.026/(Br)+ 1.129/(NH2)


+ 0.480/(CH3CONH)
(s^ = 0.347, R^ = 0.832, F = 3.96 with df = 5 and 4, p = 0.10).
In the above expression the indicator variable I(X) takes the value 0 or 1,
depending upon the absence or presence of the substituent X in a particular
compound. The overall result of the regression is not significant at the 0.05 level of
probability. This may be due to the unfavorable proportion of the number of
compounds to the number of parameters in the regression equation (10 to 6). Only
the indicator variable for substituent NH2 at position W in the tetracycline molecule
reaches significance (p = 0.02). This can be confirmed by looking at Table 37.4
395

TABLE 37.3
Complete indicator table and bacteriostatic activities of 10 triply substituted tetracyclines. The three substi-
tution positions are labeled U, V and W [31].

CO NHo

# U V w 1/Z

H CH3 N02 CI Br N02 NH2 CH3CONH

1 1 0 1 0 0 1 0 0 0.60
2 1 0 0 1 0 1 0 0 0.21
3 1 0 0 0 1 1 0 0 0.15
4 1 0 0 1 0 0 1 0 5.25
5 1 0 0 0 1 0 1 0 3.20
6 1 0 1 0 0 0 1 0 2.75
7 0 1 1 0 0 0 1 0 1.60
8 0 1 1 0 0 0 0 1 0.15
9 0 1 0 0 1 0 1 0 1.40
10 0 1 0 0 1 0 0 1 0.75

which shows that all compounds with the NH2 substituent at W possess high
activity, while the remaining compounds reached only low activities. It is not
surprising, therefore, that the Free-Wilson equation indicates that the W site in the
molecule is the most critical one for obtaining bacteriostatic activity. A Free-Wilson
analysis can also be performed by means of analysis of variance (ANOVA) instead
of multiple regression (Chapter 6). The interpretation of the results obtained by the
two methods is essentially the same.
In a broad sense, one may include the Free-Wilson equation within the class of
linear free energy relationships (LFER). It is also subjected to the assumption of
additivity of the contributions to the biological activity by substituent groups at
different substitution sites. The assumption requires, for example, that there is no
hydrogen bonding interaction between the various substitution groups.
The interpretation of the result of a Free-Wilson analysis is somewhat different
from that of a Hansch analysis. The coefficients in the Hansch model represent
absolute contributions to the biological activity of a compound from the various
396

TABLE 37.4
Reduced indicator table and bacteriostatic activities 1/Zof 10 triply substituted tetracyclines, as derived
from Table 37.3. The compound with substitution groups H, NO2 and NO2 at the positions U, V and W is
taken as the reference compound.

# U V w 1/Z

CH3 CI Br NH2 CH3CONH

1 0 0 0 0 0 0.60
2 0 1 0 0 0 0.21
3 0 0 1 0 0 0.15
4 0 1 0 1 0 5.25
5 0 0 1 1 0 3.20
6 0 0 0 1 0 2.75
7 1 0 0 1 0 1.60
8 1 0 0 0 1 0.15
9 1 0 1 1 0 1.40
10 1 0 1 0 1 0.75

types of interactions with the receptor (lipophilic, electronic, steric, etc.). As we


have indicated above, non-significant parameters can be dropped in the Hansch
equation. In the Free-Wilson model, one estimates relative contributions of the
various substituents with respect to an arbitrarily chosen reference compound. In
our illustration we find that the contribution of NH2 at site W exceeds the
contribution of CH3CONH at the same site by an amount of 1.129 - 0.480 = 0.649.
We can also state that any compound with NH2 at site W is on average 10^^"^^ or
4.46 times more potent than a comparable compound with CH3CONH at that site
(all other substituent groups at sites U and V being the same). In order to make such
interpretations, one cannot drop non-significant terms from the Free-Wilson
model. In the extreme case when a particular coefficient of the regression equation
is zero, this would only imply that the contribution of the corresponding sub-
stituent group is equal to the contribution of the substituent group at the same site
in the reference compound [8].
The Free-Wilson analysis provides more site-specific information than a Hansch
analysis. It is recommended to carry out a Free-Wilson analysis first in order to
obtain an idea of the importance of the substituent groups and of the sensitivity of
the substitution sites. This type of analysis can be regarded as being qualitative, as
it points to the important pharmacophores in the molecule. The information thus
obtained may guide the selection of the appropriate physicochemical, topological
397

and indicator variables that are to be included in the Hansch model. The latter
approach can be considered as quantitative, as it makes use of experimentally or
computationally derived parameters, such as 7C, o, E^, etc. The goodness of fit of
the Free-Wilson model to the observed biological activities also indicates whether
the additivity assumption is supported by the data. A comparison between the
Hansch and Free-Wilson models has been made by Craig [32].

37.2 Principal components models

Attention has already been drawn to correlations that occur between various
physicochemical and conformational descriptors of QSAR. Since Hansch analysis
involves multiple linear regression, these correlations may lead to unreliable
predictions of biological activity. In the design of chemical synthesis one also
wishes to sample the space of the variables in a uniform way such as to produce the
greatest possible variety of compounds. For these reasons, use has been made of
so-called Craig plots [33] in which compounds are plotted according to two
selected variables (e.g. a and log P). Such plots have been useful for the optimi-
zation of biological activity by means of sequential Simplex optimization (Section
26.2) [34] or response surface methodology (Section 24.5).
Due to the rapid proliferation of physicochemical, quantum mechanical and
conformational variables, the need was soon felt to apply multivariate methods and
pattern recognition techniques in QSAR. Cluster analysis was introduced by
Kowalski [35] in QSAR as a tool for studying the correlation structure of the
variables and for establishing similarities between the compounds. Factor analysis
has been used at an early stage in QSAR by Franke [36] for analyzing the
correlation structure of the variables (see also Chapter 34). This was followed by
applications of principal components analysis and biplot representations of
rectangular (compounds x properties) tables (see Chapters 17 and 31). Spectral
map analysis (SMA) is one such technique which has been introduced by Lewi
[37] for the study of drug-test specificities in pharmacological screening (Section
31.3.5). Principal components analysis, cluster analysis, multivariate regression
involving multiple dependent variables and other multivariate techniques have
been integrated by IMager [9] into structure-activity relationships in combination
with multivariate bioassay. This approach, which is called rational empiricism,
holds the middle between purely rational and purely empirical drug design.
In the following sections we propose typical methods of unsupervised learning
and pattern recognition, the aim of which is to detect patterns in chemical,
physicochemical and biological data, rather than to make predictions of biological
activity. These inductive methods are useful in generating hypotheses and models
which are to be verified (or falsified) by statistical inference. Cluster analysis has
398

initially been used in QSAR for the detection of patterns in physicochemical and
biological data [38]. These clusters can often be displayed visually in a two- or
three-dimensional plot of principal components.

37.2.1 Principal components analysis

A table of correlations between seven physicochemical substituent parameters


for 90 chemical substituent groups has been reported by Hansch et al. [39]. The
parameters include lipophilicity (log P), molar refractivity (MR), molecular weight
(MW), Hammett's electronic parameters (a^ and a^), and the field and resonance
parameters of Swain and Lupton (F and /?).
Figure 37.2 displays the result of applying principal components analysis
(PC A) to the correlations in Table 37.5 [40]. The PC A has been applied to the
column-standardized data, as explained in Section 31.3.3. The horizontal and
vertical axes of the loading plot represent the first two components, which account
for 46 and 31 %, respectively, of the correlations in the data. A third component,
representing about 20% of the correlations, is indicated by means of variable
shading of the symbols. The line segments that join the representations of the
parameters have been obtained by single linkage clustering [38]. The graphical
result shows a clear distinction between the four electronic parameters (a^, o^, F
and /?), on the one hand, and the other parameters (log P, MR and MW), on the
other hand. As has been mentioned before, the two Hammett a parameters are
more closely correlated than the field and resonance parameters F and R of Swain
and Lupton. As the third component is mostly determined by F and /?, the actual
distance between them is larger than appears from the two-component diagram of
Fig. 37.2. Molar refractivity (MR) has already been characterized as being related
to lipophilicity (such as log P) and steric parameters (such as MW), and this is
confirmed by the principal components analysis.
The method of PCA can be used in QSAR as a preliminary step to Hansch
analysis in order to determine the relevant parameters that must be entered into the
equation. Principal components are by definition uncorrelated and, hence, do not
pose the problem of multicollinearity. Instead of defining a Hansch model in terms
of the original physicochemical parameters, it is often more appropriate to use
principal components regression (PCR) which has been discussed in Section 35.6.
An alternative approach is by means of partial least squares (PLS) regression,
which will be more amply covered below (Section 37.4).
Principal components analysis can also be used in the case when the compounds
are characterized by multiple activities instead of a single one, as required by the
Hansch or Free-Wilson models. This leads to the multivariate bioassay analysis
which has been developed by Mager [9]. By way of illustration we consider the
physicochemical and biological data reported by Schmutz [41] on six oxazepines
399

TABLE 37.5
Correlations between 7 physicochemical substituent parameters obtained from 90 substituents groups [39]

logP ^m ^. F R MR MW

logP 1.000 -0.278 -0.153 -0.323 0.049 0.442 0.364


(^m 1.000 0.884 0.959 0.468 -0.119 0.232

^P 1.000 0.716 0.827 -0.041 0.263


F 1.000 0.200 -0.153 0.187
R 1.000 0.067 0.221
MR 1.000 0.829
MW 1.000

PC2

PCI

Fig. 37.2. Principal components loading plot of 7 physicochemical substituent parameters, as ob-
tained from the correlations in Table 37.5 [39,40]. The horizontal and vertical axes account for 46 and
31%, respectively, of the correlations. IVIost of the residual correlation is along the perpendicular to the
plane of the diagram. The line segments define clusters of parameters that have been computed by
means of cluster analysis.

and six thiazepines. These compounds are classified as neuroleptics (major tranquil-
lizers) which are designed for the treatment of psychosis. The common chemical
structure is shown in the insert of Table 37.6, where the dummy element X
represents either O or S. The oxazepines and thiazepines are also substituted at the
position R in one of the two phenyl rings by means of 6 different substituent
400

TABLE 37.6

Physicochemical and biological parameters obtained from 6 oxazepine (X = O) and 6 thiazepine (X = S)


neuroleptics. These include Hammett's electronic parameter (a^), lipophilicity (log P) and its squared
value, and the activities (log I/EDJQ) in two pharmacological tests in rats [41].

o I

ax^^
# X R <^n, logP (log P)2 Apo-morphine Catalepsy

1 o CH3 -0.070 0.487 0.237 -0.117 -0.332


2 o SCH3 0.151 0.597 0.356 0.069 -0.550
3 o CI 0.373 0.744 0.554 0.707 -0.033
4 o CN 0.564 -0.322 0.104 0.680 0.428
5 o SO2CH3 0.595 -1.279 1.636 0.680 -1.176
6 0 NO2 0.706 0.340 0.116 0.760 0.836
7 s CH3 -0.116 0.524 0.275 -0.755 -1.176
8 s SCH3 0.151 0.597 0.356 -0.170 -1.094
9 s CI 0.373 0.744 0.554 0.122 -0.305
10 s CN 0.558 -0.285 0.081 0.707 0.184
11 s SO2CH3 0.600 -1.242 1.543 0.707 -1.012
12 s NO2 0.706 0.376 0.141 1.239 -0.360

groups. The substituent parameters o^, log P and the square of log P are shown in
Table 37.6 together with the biological activities (log I/ED50) in pharmacological
tests which measure the ability to inhibit the stereotyped effects produced by
apomorphine and the ability to produce catalepsy (abnormal waxy postures) in
rats. These two tests are considered to be good animal models for testing neuro-
leptic effects of compounds.
The biplot in Fig. 37.3 has been constructed from the factor scores of the 12
compounds and the factor loadings of the five physicochemical and biological
variables [42,43]. (The biplot graphic technique is explained in Section 31.2.) It is
401

PC2 "-. ;
V

^".
^^
" Q logP
y , o,CN y \ ' \ y
^^-' ^\^.^
-^
" " / ^©^'"''^
y^ 1 S, qtl APOMORPHXNE - -^ '
"" ^ y O,CLO -""
' / -TT ,
y 0^ K x< .C'''
y S,ClQf
0,CH^j ^
V, ,' , * ^ ' '
QO, SCH3 ^ -^'
1 <«' / / ^ \
1 '^ / ^
^"^ / // / / \
^s,scn3 ^^
/ -^ "^
/
^sfCH3
^ x

^
• (loaP)2

P 5 , S02CH3
O O,S02CH3

PCI

Fig. 37.3. Principal components biplot showing the positions of 6 substituted oxazepine (O) and 6 sub-
stituted thiazepine (S) neuroleptics with respect to three physicochemical parameters and two biologi-
cal activities [41,43]. The data are shown in Table 37.6. The thiazepine analogs are represented by
means of filled symbols. The horizontal and vertical components represent 50 and 39%, respectively,
of the variance in the data.

the result of a principal component analysis of the column-standardized data in


Table 37.6. The cosines of the angular distances between the dashed line segments
represent the correlations between the corresponding parameters. The horizontal
and vertical axes account for 50 and 39%, respectively, of the variance in the data.
As a consequence, the parameters are displayed upon or close to the unit circle
around the origin of the plot. The activities of the catalepsy and apomorphine tests
are shown to be weakly correlated. These tests seem to express different types of
neuroleptic activity. The electronic parameter a^ correlates strongly with the
apomorphine test and less with catalepsy. Log P has little influence on catalepsy
and is negatively associated with inhibition of apomorphine. Conversely (log P)^
has little association with apomorphine activity, but correlates negatively with
catalepsy. Hence, a model for cataleptic effects would include a^ and (log P)^. A
model for apomorphine inhibition would have to consider o^ and log P. All
compounds appear to be arranged within a narrow band on the biplot, with the
exception of those that are substituted by SO2CH3 at the R position.
402

Looking at the positions of the individual /?-substituents on the plot we observe


an ordering within a tight band by increasing electronegativity: CH3, SCH3, CI, CN
and NO2. This is the order that has been imposed by the a electronic parameter.
Overall, the distinction between oxazepines and thiazepines appears to be small.
The SO2CH3 substituent forms a marked exception and is highly influential
because of an extremely low log P value and low cataleptic effect. Sometimes,
however, exceptions to the general rule may be more interesting than the cases that
comply with it. Neuroleptic compounds which inhibit the stereotyped effects of
apomorphine without producing cataleptic effects are highly sought after. Unfortun-
ately, the SO2CH3 analogs would be poor candidates for further testing because of
their low lipophilicity (log P) which would probably lead to their rapid elimination
from the body and hence to a short duration of action.

37.2.2 Spectral map analysis

For a detailed description of spectral map analysis (SMA), the reader is referred
to Section 31.3.5. The method has been designed specifically for the study of
drug-receptor interactions [37,44]. The interpretation of the resulting spectral map
is different from that of the usual principal components biplot. The former is
symmetric with respect to rows and columns, while the latter is not. In particular,
the spectral map displays interactions between compounds and receptors. It shows
which compounds are most specific for which receptors (or tests) and vice versa.
This property will be illustrated by means of an analysis of data reporting on the
binding affinities of various opioid analgesics to various opioid receptors [45,46].
In contrast with the previous approach, this application is not based on extra-
thermodynamic properties, but is derived entirely from biological activity spectra.
Binding studies are performed on isolated receptors that are labeled by means of
a specific radioactive marker. Compounds that are tested may compete with the
specific marker at the binding site of the receptor. The fraction of the marker that is
displaced by the test compound is a measure of the affinity of the latter for that
receptor. Usually, an active compound will show a spectrum of affinities for
several related receptors. The aim of the multivariate analysis is to map the
spectrum of affinities, hence the name spectral mapping. At the time when the
report was published, pharmacologists had identified two opioid receptors which
were labeled |i and 5. The former binds specifically to opioid analgesics such as
morphine and its synthetic derivatives, the latter binds to derivatives of the
so-called endogenous opioid peptides, in particular to enkephalins and endorphins.
Strong evidence for another opioid receptor, named K, was available and this had
been already identified in brain tissues. The data in Table 37.7 report on the results
of 26 narcotic agonists and antagonists in four binding assays. The four radioactive
markers comprised dihydromorphine (DHM, a |i-agonist), an enkephalin analog
403

TABLE 37.7
Inhibition constants (Kf) of 26 narcotic analgesic agonists and antagonists as obtained in 4 receptor binding
tests (EKC, DHM, DADLE and NAL) [45].

# EKC DHM DADLE NAL

1 Morphine 696.0 1.3 26.1 3.5


2 Phenazocine 9.1 0.2 2.6 0.8
3 Levorphanol 47.0 0.2 4.5 0.7
4 Etorphine 4.8 0.1 0.9 0.2
5 Dihydromorphine 6012.0 1.0 16.6 1.3
6 D-Ala, D-Leu-Enkephalin 113.0 8.3 1.3 5.7
7 Met-Enkephalin 66.0 3.6 2.8 4.6
8 Leu-Enkephalin 120.0 8.1 3.1 11.6
9 Leu-Enkephalin, Arg-Phe 4430.0 4.0 3.0 4.2
10 FK-33824 6329.0 0.4 4.9 1.0
11 LY-127623 3924.0 0.7 2.5 1.2
12 Beta-Endorphin 1835.0 0.3 0.3 0.8
13 SKF 10074 241.0 1.4 5.0 1.1
14 Ketazocine 7.3 7.9 6.6 12.7
15 Ethylketocyclazocine 4.7 4.1 7.6 7.5
16 MR 2034 5.7 0.3 0.5 0.2
17 Buprenorphine 2.0 3.1 3.6 1.6
18 Lofentanil 2.7 0.7 1.2 0.4
19 Butorphanol 7.7 0.3 1.6 0.3
20 Cyclazocine 5.0 0.2 1.1 0.2
21 Levallorphan 4.0 0.2 3.3 0.7
22 Nalbuphine 38.0 2.4 21.2 2.4
23 Pentazocine 116.0 3.7 19.8 13.5
24 Nalorphine 55.0 0.7 13.9 3.7
25 Naloxone 11.1 1.4 16.5 1.0
26 Naltrexone 9.8 0.1 5.0 0.5

(DADLE, a 8-agonist), a ketazocine derivative (EKC, a K-agonist) and the opioid


antagonist naloxone (NAL). The data are expressed as inhibition constants of the
drug-receptor interaction, K^, which are proportional to the corresponding 50%
inhibitory concentrations, IC5Q. The four receptors are identified by squares on the
spectral map of Fig. 37.4. By convention, the areas of the squares are made
404

PCI *EKC
BUPRENORPHINZ ,'\
KETAZOCINE* ^ / ^
ETKYLKETOCYCLAZOCINEi \

LOFKNTANI^O » O NALOXONE
' \ O LEVALLORPHAN
BpTORPHANOZ^
___i,,_-__-_ I .,^NALBXTPEINE
PENTAZOCINE > _
^PHENAZOCINE
MET-ZNKEPHALXN'
LEU-ENKEPHALIN* ^/' Q /fj? Q*^^™^^^
0-AIJl,U-I.£I7-KNKSPHAI,XW» ,' UK 20Jtf / y ; NALORPHINE
, + ' ETORPHINE
)CINE
'O LEVORPHANOL
NAL
DADLE' i—--_ _? n
i
/ LBU-EiiKEPHALIN,ARa-PHE
I DIHYDROMORPHINE^
OLY-127623

T FK-33824Q
BETA-ENDORPHIN

PC2

Fig. 37.4. Spectral map of the 26 opioid agonists and antagonists in 4 receptor binding tests, as de-
scribed by Table 37.7 [45, 46]. Circles refer to the compounds. Squares represent the binding tests.
Areas of circles and squares are proportional to the marginal mean affinities in the table. The lines that
join the three poles (DHM, DADLE and EKC) of the map represent axes of contrast between the p.-, 5-
and K-opioid receptors. The horizontal and vertical components represent 18 and 79%, respectively,
of the interaction in the data.

proportional to the average affinities of the 26 compounds for the corresponding


receptors. The horizontal and vertical components represent 18 and 79%, respec-
tively, of the interaction in the data. The term interaction is defined here as the
variance that remains after double-centering of the data (Section 31.3.5).
The spectral map shows three distinct poles of specificity. These are respect-
ively the ^-receptor (DHM), the 5-receptor (DADLE) and the K-receptor (EKC).
The naloxone receptor (NAL) appears to be strongly correlated with the |Li-receptor
(DHM) and, hence, provides little additional information. In spectral map analysis,
correlation between variables, as well as similarity between compounds, is evi-
denced by the proximity of their corresponding symbols. The lines drawn through
the three poles of the map represent bipolar axes of contrast. A contrast is defined
in this context as a log ratio or, equivalently, as a difference of logarithms. For
example, the horizontal axis through the |Li- and 5-receptors defines the |LI/6
contrast. Compounds that project on the right side of this axis bind more specifi-
cally to the |Li-receptor, while those that project on the left side possess more
405

specificity for the 6-receptor. The larger the distance between two poles, the
greater is the contrast between the corresponding receptors. The areas of the circles
are proportional to the average affinity of the corresponding compounds for the
four labeled receptors.
It appears from the spectral map that the K-receptor is a highly specific receptor
which produces strong contrasts in binding affinities of opioid analgesics. The
contrast is most evident in ketazocine, ethylketazocine and buprenorphine which
possess much more affinity for the K-receptor than for the two others. The contrast
is also strong with dihydromorphine, beta-endorphin, an enkephalin analog and
two experimental compounds (LY and FK) which have little or no affinity for the
K-receptor.

37.2.3 Correspondence factor analysis

Correspondence factor analysis (CFA) is most appropriate when the data


represent counts of contingencies, or when there are numerous true zeroes in the
table (i.e. when zero means complete absence of a contingency, rather than a small
quantity which has been rounded to zero [47]). A detailed description of the
method is found in Section 32.3.6.
Correspondence factor analysis has been applied to activity spectra obtained
from a screening test on rats (Table 37.8). One of the objectives of the test was to
differentiate between morphinomimetics (opioid agonists) and neuroleptics [48].
The intensity of six observations have been scored on a scale from 0 (absence) to 6
(very pronounced). These include prostration, temperature increase and decrease,
pupil diameter increase and decrease, and palpebral (eye) opening decrease. The
objective of the analysis was to find criteria by which a differential classification of
screening compounds could be made on the basis of these six criteria. Since the
data consist of scores and since numerous true zeroes appear in the table, CFA is
the method of choice [49]. The resulting biplot is presented in Fig. 37.5, the
horizontal and vertical axes of which account for 40 and 31%, respectively, of the
interactions in the data. Squares represent the observations, circles refer to com-
pounds. Areas of circles and squares are proportional to the marginal sums of rows
and columns in the table. The interpretation of this biplot is analogous to that of the
spectral map. The distance between two circles is related to a contrast between the
corresponding compounds (Section 32.5.3). On the biplot, we have marked neuro-
leptics by filled symbols and morphinomimetics by open symbols. It appears that
neuroleptics and morphinomimetics can be differentiated on the basis of these six
observations. Only spiperone forms an exception, as this specific neuroleptic is
wrongly classified with the morphinomimetics. The distance between two squares
is indicative of high contrasting power between the corresponding observations.
406

TABLE 37.8
Data obtained from a screening test on rats, differentiating between morphinomimetic (opioid analgesic)
and neuroleptic compounds [48]. The six observations are scored on a 6-point scale ranging from absent (0)
to highly pronounced (6). Compounds 1 to 13 are morphinomimetics; compounds 14 to 26 are neuroleptics.

# Prostration Temperat. Temperat. Pupil diam. Pupil diam. Palpebral Total


increase decrease increase increase decrease

1 Propoxyphene 0 3 1 3 0 0 7
2 Morphine 0 0 4 2 0 2 8
3 Methadone 0 0 6 6 0 3 15
4 Pethidine 0 1 4 6 0 2 13
5 Codeine 0 4 0 3 0 1 8
6 Dextromoramide 0 0 6 4 0 3 13
7 Anileridine 0 2 3 4 0 2 11
8 Phenoperidine 0 2 5 2 0 3 12
9 Etonitazine 0 0 6 6 0 0 12
10 Phenazocine 0 2 4 4 0 2 12
11 Piritramide 0 2 4 4 0 3 13
12 Fentanyl 0 1 6 6 0 3 16
13 Bezitramide 0 0 2 2 0 1 5
14 Chlorpromazine 6 0 6 5 0 6 23
15 Perphenazine 6 0 6 0 0 6 18
16 Haloperidol 0 0 6 0 3 6 15
17 Trifluoperazine 6 0 3 0 0 6 15
18 Chlorprothixene 5 0 6 6 0 6 23
19 Spirilene 0 0 3 0 0 5 8
20 Pimozide 0 0 5 0 1 5 11
21 Pipamperone 2 0 6 0 1 6 15
22 Sulpiride 0 0 0 0 1 2 3
23 Benperidol 4 0 6 6 0 6 22
24 Spiperone 0 3 0 3 0 5 11
25 Oxypertine 6 0 6 0 1 6 19
26 Thioridazine 1 0 6 6 0 6 19

Total 36 20 110 78 7 96 347


407

PUPIL DIAM. DECREASE


PC2
9 SULPIRIDE
TEMPERAT. INCREASEn ,

PROPOXYPHENEO

9HALOPERIDOL
@ SPIPERONE

PIRITRAMIDE
\^-G ANILERIDINE
:DE PHENOPERIDII:ufi O^-^HENAZOCINE
FSNTANYL^ Q PETHIDINE
DBXTROMORAISINE ^ ^ n PUPIL DIAM. INCREASE
MORPHINE^
SPIRILENE0 ^OO 0 OETONITAZINE
•f^ \ METHADONE
PIPAMPERONS^
^^
P

PALPEBRAL DECREASE
I
~n ^
9
n ^ ^gy
\
»-«.^m».»rTn-.
BEZITRAMIDE
THIORIDAZINE
TEMPERAT. DECREASE
^ BENPERIDOL
P f CHLORPROTHIXENE
OXYPERTINE^ \CHLORPROMAZINE

^ PERPHENAZINE

@TRIFLUOPERAZINE

n PROSTRATION

PCI

Fig. 37.5. Biplot obtained from correspondence factor analysis of the data in Table 37.8 [43]. Circles
refer to compounds. Squares relate to observations. Areas of circles and squares are proportional to the
marginal sums of the rows and columns in the table. The horizontal and vertical components represent
40 and 31%, respectively, of the interaction in the data.

Temperature decrease and increase of pupil diameter point toward morphino-


mimetic activity. Prostration, pupil diameter decrease and palpebral opening
decrease are predictive of neuroleptic activity. Temperature decrease is the least
contrasting observation, as it appears to be close to the origin of the biplot, which is
marked by a small cross. The origin is the neutral or average point and is devoid
of any contrast. As can be seen from the table, most compounds score highly for
temperature decrease and, hence, its differentiating ability is low. Three obser-
vations stand out as prominent poles of the plot, namely temperature increase,
prostration and pupil diameter increase. This might suggest that the screening test
could be simplified by including only these three highly specific observations.
Unfortunately, these observations are also highly insensitive, as the average scores
of the 26 compounds on them are very low.
408

37.3 Canonical variate models

While principal components models are used mostly in an unsupervised or


exploratory mode, models based on canonical variates are often applied in a
supervisory way for the prediction of biological activities from chemical, physico-
chemical or other biological parameters. In this section we discuss briefly the
methods of linear discriminant analysis (LDA) and canonical correlation analysis
(CCA). Although there has been an early awareness of these methods in QSAR
[7,50], they have not been widely accepted. More recently they have been super-
seded by the successful introduction of partial least squares analysis (PLS) in
QSAR. Nevertheless, the early pattern recognition techniques have prepared the
minds for the introduction of modem chemometric approaches.

37.3.1 Linear discriminant analysis

Linear regression, such as used in the Hansch and Free-Wilson methods, seeks
to predict the actual value of biological activities from chemical, physicochemical,
quantum mechanical and biological parameters of a set of compounds. In linear
discriminant analysis (LDA) one only attempts to predict the class membership of
the compounds from the above mentioned parameters (Section 17.9). The discrim-
inant equation is obtained from a learning or training set of compounds with
known class membership and applied to newly synthesized compounds for the
prediction of their classification. In the case of two classes (e.g. morphinomimetics
and neuroleptics) the discriminant equation can be obtained directly by means of
linear regression (Chapter 10) or principal components regression (Section 35.6).
In the general case of LDA with more than two classes one makes use of canonical
variate analysis (Section 33.2.2). The result can be displayed graphically in the
form of a biplot which allows to determine visually which parameters discriminate
between which classes of compounds.
Linear discriminant analyses have been carried out using the extrathermo-
dynamic parameters that normally appear in Hansch equations (Log P and its
squared value, a^, E^, etc.). The method can also be applied to biological spectra in
order to discriminate between pharmacological classes of compounds that are
submitted to screening tests. One of the assumptions of LDA is that the classes of
compounds are normally distributed in the multivariate predictor space and that
their multivariate equidensity (hyper)ellipsoids are similarly oriented and shaped.
This is a stringent assumption which, in the strictest form, is seldom satisfied in
practice. Another problem of LDA occurs when a highly discriminating parameter
contributes little to the total variance in the data. These drawbacks can be allevi-
ated by applying partial least squares (PLS) regression which is the subject of
Section 35.7.
409

37.3.2 Canonical correlation analysis

Canonical correlation analysis (CCA) is applicable in QSAR when a set of


compounds is described, on the one hand, by a table of chemical and physico-
chemical parameters (X), such as occurs in Hansch or Free-Wilson models, and,
on the other hand, by a table of multiple biological observations (Y). The method
produces several pairs of canonical axes (see Section 35.3 for a more thorough
discussion of CCA). Each pair of axes can be interpreted as a direction in the space
of the chemical and physicochemical parameters which correlates maximally with
a conjugated direction in the space of the biological variables. A pair of canonical
axes is characterized by its canonical correlation coefficient, which is equal to the
product-moment correlation between the projections of the compounds on the
corresponding canonical axes [51]. The result of CCA can be displayed graphi-
cally in the form of two conjugated biplots. Each biplot is spanned by the dominant
canonical axes, such that one can visually observe which chemical or physico-
chemical parameters correlate best with which biological variables [40]. Canonical
correlation analysis has also been used for correlating different experimental
settings for the same set of compounds, such as obtained in animal pharmacology,
in biochemical binding studies and in clinical observations on humans.
A drawback of the method is that highly correlating canonical variables may
contribute little to the variance in the data. A similar remark has been made with
respect to linear discriminant analysis. Furthermore, CCA does not possess a
direction of prediction as it is symmetrical with respect to X and Y. For these
reasons it is now replaced by two-block or multi-block partial least squares
analysis (PLS), which bears some similarity with CCA without having its short-
comings.

37.4 Partial least squares models

Partial least squares analysis (PLS) has rapidly found applications in QSAR
since its introduction by Wold et al. [52]. The reader is referred to Sections 35.7
and 36.2.4 for a thorough discussion of PLS.

37.4.1 PLS regression and CoMFA

The PLS regression model can be expressed in the form:


y = Xb + e (37.15)
where X represents the column-centered matrix of independent physicochemical
variables, y is the centered vector of dependent biological activities, b contains the
coefficients of regression and e is the vector of residuals.
410

As an illustration of PLS regression (PLSl) we reconsider the inhibitory


potencies of oxidative phosphorylation of 11 doubly substituted salicylanilides
[ 17] in Table 37.1. An extended Hansch model is defined by the linear free energy
relation:
log I/IC50 = bQ-hb^c + b2logP'\- Z?3 (log P)^ (37.16)
The most predictive PLS regression model for these data makes use of two
PLS-components:
log I/IC50 = -1.772 + 0.678 a + 0.168 log P + 0.020 (log Pf
5, = 0.313, /?2 = 0.537
The selection of the number of PLS-components to be included in the model was
done according to the PRESS criterion (Section 36.3). Note that the result is
comparable to the one which we obtained earlier by means of the simple Hansch
analysis (Section 37.1.1). Hence, in this case, there is no obvious benefit to include
a quadratic term of log P in the model.
It has been shown that PLS regression fits better to the observed activities than
principal components regression [53]. The method is non-iterative and, hence, is
relatively fast, even in the case of very large matrices.
The most prominent application of PLS regression is in comparative molecular
field analysis (CoMFA) developed by Cramer et al. [54]. While the previous
approaches deal with global properties of the whole compound or at best with
those of substituent groups, CoMFA is based on electrostatic, hydrophobic and
steric properties (called fields) at individual points of the molecules. One must,
therefore, imagine that a fine three-dimensional grid is laid over the individual
molecules and that the properties are computed at each point of the grid. This
produces for each molecule a large vector of independent parameters, the dimension
of which is theoretically equal to the number of properties times the number of grid
points. The latter amounts usually to several thousands of points, depending on the
fineness of the grid. CoMFA makes use of PLS regression to compute a relation
between the molecular fields of the compounds X and their biological activities y.
It can be understood that the independent matrix X is extremely wide, as the
number of variables exceeds the number of compounds (often by a factor as large
as 100). This causes the matrix X to be highly multicollinear. For these reasons,
PLS regression appears to be the method of choice for this application. We briefly
outline the procedure used in CoMFA below.
A group of compounds is selected which possess a common three-dimensional
pharmacophore, i.e. a configuration of chemical groups that presumably interact
with the target receptor or enzyme. The compounds are then aligned in space by
translation and rotation, for example on the basis of their similarity with a known
411

active compound which serves as a common template. A similar alignment


procedure is applied in Procrustes analysis, in which several data tables are related
to one another (Section 35.2). After alignment, the 3-D structures of the molecules
conform maximally with one another, such that they can be overlaid. In the next
step, a three-dimensional grid is spanned over the set of aligned and overlaid
molecules. A typical grid of 5x2x2 nm^ will theoretically generate 2 500 000
lattice points at a grid spacing of 0.02 nm. In the following step, one computes the
electrostatic, hydrophobic and steric fields at each of the lattice points. It has been
mentioned already that the electrostatic and steric fields are of an enthalpic nature,
while the hydrophobic field is entropic. Hence, these three properties give a
comprehensive description of the local extrathermodynamic properties of the
molecule. It is assumed that some of these local fields of the compounds interact
with specific (unknown) chemical groups of the target receptor or enzyme. PLS
regression is then applied in order to determine which fields at which lattice points
are responsible for the observed biological activities of the molecules. Once this
relationship is obtained, predictions of biological activity can be obtained for
newly synthesized compounds.
The same assumptions apply to CoMFA as to ordinary Hansch analysis. These
are additivity of effects and the availability of structurally similar (congeneric)
molecules. The method does not account for pharmacokinetic effects, such as
distribution, elimination, transport and metabolization. A prospective drug may
appear to bind well to the receptor or enzyme, but may not reach the target site due
to undesirable pharmacokinetic properties [8].

37.4.2 Two-block PLS and indirect QSAR

The methods based on relationships between chemical and physicochemical


descriptors, on the one hand, and biological activities, on the other hand, usually
involve a series of structurally related compounds. A more general approach
makes use of biological descriptors which can be applied to a wider variety of
compounds. The underlying idea is that the binding activities of compounds to
proteins (such as enzymes, antibodies and receptors) are indirectly related to their
electronic, hydrophobic and steric interactions. The proteins 'sniff out' the molec-
ular fields of the compounds, so to speak, and the binding that results depends on
the binding energies that are released. If a compound is screened in a battery of
closely related proteins, then the spectrum of binding activities provides an
indirect chemical and physicochemical description of the molecular structure.
Often, it is more economical to determine binding spectra in vitro from bio-
chemical assays than to perform experiments in vivo on animals. It is feasible to
screen large libraries of compounds in robotized batteries of binding assays, using
standard panels of related proteins [55]. Given a test set of compounds, it is
412

possible to derive a relationship between their already established standard binding


spectra and a particular pharmacological activity which is the object of investi-
gation, such as substance P-like activity in rats. The relationship can then be used
to predict the pharmacological activity of newly synthesized compounds from
their standard binding spectra. The fundamental assumption of this indirect QSAR
approach is that the panel of proteins is capable of recognizing the pharmacophore
which is responsible for the particular pharmacological effect. The panel of
proteins must possess affinity for the molecules, and yet must have enough
specificity in order to detect differences among them. If the assumption is met,
then the binding spectra can be very useful, as they reflect only those molecular
fields that are relevant to the interactions of proteins with the pharmacophore.
PLS regression appears as the method of choice in the case when only a single
biological activity is to be predicted (Section 35.7). In the case of multiple
activities it is possible t0 predict the pharmacological spectrum from the bio-
chemical binding spectrum by means of two-block PLS (or PLS2). The latter can
be regarded as a generalization of PLS regression (or PLSl) in which the dependent
vector y is replaced by the matrix of dependent activities Y. The two-block PLS
model can be expressed from analogy with the PLS regression model (Section
37.4.1):
Y = XB + E (37.17)
where B is now the matrix of regression coefficients and E represents the matrix of
residuals.
An illustration of indirect QSAR is the prediction of the in-vivo pharma-
cological spectra of neuroleptics from their spectra of in-vitro binding to radio-
actively labeled receptors [56]. The pharmacological spectra are defined by the
ED5Q (mg/kg) values obtained in five observations of the ATN test in rats [48]. The
combined ATN test includes inhibition of agitation and stereotypy induced by
apomorphine (A), inhibition of seizures and tremors provoked by tryptamine (T)
and prevention of mortality resulting from a high dose of norepinephrine (N). The
results obtained from 17 reference neuroleptics represent the dependent Y block
(Table 37.9). The biochemical binding spectra of the same 17 neuroleptics are
defined by their inhibition constants K^ (nM) of receptors in the brain that have
been radioactively labeled by four specific ligands [57]. The latter comprise
haloperidol and apomorphine which are specific for dopamine receptors, spiperone
which is specific for serotonin receptors and the compound WB4101 which is
specific for norepinephrine (alpha-adrenergic) receptors. (Remember that K^ values
are proportional to IC5Q values, the proportionality constant being the same for
each receptor.) The data constitute the independent X block (Table 37.10). The
preprocessing steps that have been applied to the X and Y blocks are the same as
those of spectral map analysis. First, logarithms have been taken of the reciprocals
413

TABLE 37.9
Pharmacological results (ED^^y in mg/kg) obtained in rats from 17 reference neuroleptic compounds in the
combined ATN test [48].

Agitation Stereotypy Seizures Tremors Mortality

Benperidol 0.01 0.01 0.33 0.33 1.34


Chlorpromazine 0.25 0.29 0.88 0.58 1.54
Chlorprothixene 0.17 0.38 0.51 0.77 0.51
Clothiapine 0.10 0.13 0.38 1.54 3.10
Clozapine 6.15 10.70 3.07 4.66 14.10
Droperidol 0.01 0.01 0.51 0.67 0.77
Fluphenazine 0.06 0.06 0.58 1.77 2.33
Haloperidol 0.02 0.02 0.77 5.00 5.00
Penfluridol 0.22 0.33 160.00 160.00 160.00
Perphenazine 0.04 0.04 0.51 1.34 3.08
Pimozide 0.05 0.05 40.00 40.00 40.00
Pipamperone 3.07 5.35 0.58 2.03 6.15
Promazine 3.07 4.66 9.33 18.70 4.06
Sulpiride 21.40 21.40 320.00 320.00 1280.00
Thioridazine 4.06 5.35 10.70 28.30 1.16
Thiotixene 0.22 0.22 21.40 320.00 4.06
Trifluoperazine 0.04 0.06 1.77 4.67 40.00

of the ED5Q and K^ values, and the resulting activities were double-centered. The
biplots of Figs. 37.6 and 37.7 are constructed from the first two PLS components,
and display the pharmacological and biochemical spectra, respectively. The
reading rules of these biplots are the same as those of the spectral map which have
been explained above (Section 37.2.2). Briefly, circles represent neuroleptic
compounds and squares refer to the tests. Areas of circles are proportional to the
potencies of the compounds, while areas of squares are proportional to the
sensitivities of the tests. Potencies and sensitivities have been defined from the
marginal totals of the data tables. Compounds and tests that possess mutual
specificity are found at a distance and in the same direction with respect to the
center of the plot.
The biplots show which binding tests are predictive for which pharmacological
tests. Binding to the haloperidol and apomorphine labeled receptors corresponds
with inhibition of agitation and stereotypy in rats, binding to the spiperone receptor
414

TABLE 37.10

Biochemical results (A', in nM) obtained in binding assays of radioactively labeled brain receptors [57]
from the same 17 neuroleptics shown in Table 37.9.

Haloperidol Apomorphine Spiperone WB4101

Benperidol 0.4 0.2 6.6 2.3


Chlorpromazine 49.6 4.5 20.2 1.7
Chlorprothixene 11.0 5.6 3.3 1.0
Clothiapine 15.7 5.6 6.0 14.4
Clozapine 156.0 56.0 15.7 7.3
Droperidol 0.8 0.4 4.1 0.8
Fluphenazine 6.2 2.2 32.8 8.9
Haloperidol 1.2 1.3 48.0 3.1
Penfluridol 9.9 8.9 232.0 363.0
Perphenazine 3.9 3.6 33.0 91.3
Pimozide 1.2 1.4 32.8 41.0
Pipamperone 124.0 16.0 5.0 46.0
Promazine 99.0 38.0 328.0 2.5
Sulpiride 31.3 44.7 26043.0 1000.0
Thioridazine 15.7 8.9 36.0 3.2
Thiotixene 2.5 2.2 9.2 12.9
Trifluoperazine 3.9 2.2 41.2 20.4

corresponds with inhibition of seizures and tremors, and, finally, binding to the
alpha-adrenergic receptor (labeled by WB4101) matches with prevention of mor-
tality. These triangular patterns indicate a high specificity of the in-vivo tests for
three receptors (dopamine, serotonin and norepinephrine) that are thought to be
involved in psychotic manifestations such as hallucinations, delusions, mania,
extreme agitation, anxiety, etc. These three receptors form distinct poles on the
pharmacological and biochemical biplots of Figs. 37.6 and 37.7. There also is
agreement between the positions of the compounds on the two biplots. Haloperidol
and pimozide appear as specific inhibitors of the dopamine receptor, promazine
and thioridazine are shown to be specific blockers of the norepinephrine receptor,
and, finally, pipamperone is the only compound that exhibits high specificity for
the serotonin receptor. A comparison of the two biplots (Figs. 37.6 and 37.7)
reveals which binding assays are predictive for which pharmacological effects. It
also shows that activity spectra in animals can be reliably predicted from binding
profiles. The biplots can also be analyzed from a chemical point of view by
415

Serotonin
I
I
•VTHEWDRS
"S
mseiZURES BENPERJDOL Q TRIFLUOPERAZINE^
•PIPAMPEMNE
i '>'SULPIRIDE [PERPHENAZINE ^
CLOZAPINE' CLOTHIAPINE O

CHLOHPROMAZINE*
1UPHEM4ZIMEC :0 V 0«)HALOPERiooL D o p a m i n G
I PENFLURIDOL
CHLOPPROTHIXENE%
I
I - - 0 "^XD] " 1
DROPERIDOL

AGITATION ^ N
OPIHOZIDE
STEREOTYPy
*PROHAZIHE

I
I
•THIORIDAZIHE • 7H10TIXENE
I
jttNORTALITY

Norepinephrine

Fig. 37.6. PLS biplot obtained from the pharmacological data in Table 37.9, after log double-centering
and analysis by two-block PLS [56]. Circles represent 17 reference neuroleptic compounds, squares
denote tests. Areas of circles and squares are proportional to the potencies of the compounds and the
sensitivities of the tests, respectively. Reproduced with permission of E.J. Karjalainen.

Serotonin I
I
PIPANPERONE» J
aSPIPERONE

; \
•CLOTHIAPINE
•PERPHENAZINE ^
TfilOTIXENEO ^PENFLURIDOL
•CLOZAPINE FLUPHENAZINE
piMOiZDE Dopamine
CHLORPROTHIXENEO -h ^ T^I^\^in.UOPERAZINE
'DROPERIDOLO hi-/p--B£NPffljoa.
CHLORPROMAZINEO
• APOMORPHINE HALOPERIDOL
THIORIDAZINE ^ v
I ^ ^ OHALOPERIOOL ^^

DWS 4i0i "^

OPRONAZINE

I
Norepinepifirine

Fig. 37.7. PLS biplot derived from the biochemical binding data in Table 37.10, after log double-
centering and analysis by two-block PLS [56]. The reading rules are the same as those indicated with
Fig. 37.6. Reproduced with permission of E.J. Karjalainen.
416

correlating structural properties with positions on the maps. It appears that com-
pounds attracted by the dopamine pole are chiefly of the butyrophenone and
diphenylbutyl type, while those attracted by the adrenergic pole belong to the
phenothiazine class. In a similar way as with PLS regression, one may determine
the optimal number of components that must be included for obtaining the most
reliable predictions, using cross-validation and minimization of PRESS (Section
36.3).
Two-block PLS can be extended to multi-block PLS for the prediction of drug
activities from several independent predictor blocks [58]. For example, one may
attempt to predict clinical effects (the ultimate goal of drug design) from all
available information (pharmacological, biochemical, pharmacokinetic, toxicol-
ogical, etc.). Multi-block PLS has been used to predict clinical scores for various
antischizophrenic effects [59] from in-vivo and in-vitro data [56]. The analysis
showed that only a single component of clinical effects could be reliably predicted
from animal and receptor studies, while all higher order clinical and biological
components showed only weak correlations.

37.5 Other approaches

In this chapter we have only addressed a selected number of topics and for lack
of space we have left out many others. Cluster analysis has played a larger role in
QSAR than appears from our overview. This technique is an established QSAR
tool in recognition or classification of known patterns [38,60] as well as for
cognition or detection of novel patterns [61].
Neural networks have been introduced in QSAR for non-linear Hansch analyses.
The Perceptron, which is generally considered as a forerunner of neural networks
has been developed by the Russian school of Rastrigin and coworkers [62] within
the context of QSAR. The learning machine is another prototype of neural network
which has been introduced in QSAR by Jurs et al. [63] for the discrimination
between different types of compounds on the basis of their properties.
Decision trees for optimal drug design strategies have been proposed by Topliss
[64] and by Purcell et al [65].
The determination of the three-dimensional conformation of molecules is an
important aspect of QSAR, which can be obtained from x-ray crystallography [66],
NMR spectroscopy or, in the case of small molecular fragments by quantum-
mechanical calculations [67,68].
A topic of actuality is the study of receptor proteins and enzymes for which data
bases with crystallographic information are now made available. Computer model-
ling of the active sites of receptors and enzymes are important tools in rational drug
design. Principal components and cluster analysis can be applied to the primary
417

sequences of these proteins in order to derive classifications and phylogenetic


relationships [69].
Of great importance is the searching of very large data bases (with several
thousands of compounds) for molecular structures that bear similarity to a known
lead compound. The standard approach is to start from connectivity tables which
describe the two-dimensional (graph) structure of the molecules in an unambiguous
way. In the case of a molecule with n atoms, the nxn connectivity table defines the
type of each atom (on the main diagonal) and the type of bond between each pair of
atoms (on the off-diagonal positions). Similarity indices between compounds can
be computed by various techniques, ranging from global connectivity indices [70]
to atom-to-atom mapping [11]. Alternatively, molecular data bases may be searched
for compounds that are as widely dissimilar as possible, in order to form standard
panels of screening compounds. Principal components analysis has been used to
reduce a large number of structural indices to a small set of independent descriptors
that can be more easily submitted to pattern recognition techniques [71].
As we have stated in the introduction to this chapter and as appears from this
overview, a wide variety of chemometric methods converges in QSAR, which
plays a key role in the design of novel and improved drugs.

References

1. A. Burger (Ed.), Medicinal Chemistry. Wiley, New York, 1970.


2. E.J. Ariens (Ed.), Drug Design, Vols I-X. Academic Press, New York, 1971-1980.
3. H. Meyer, Zur Theorie der Alkolnarkose I. Welche Eigenschaft der Anesthetica bedingt ihre
narkotische Wirkung? Arch. Exp. Pathol. Physiol., 42 (1899) 109-118.
4. E. Overton, Osmotic properties of cells in the bearing on toxicology and pharmacy. Zeitschr.
Physik. Chem., 22 (1897) 189-209.
5. E. Fischer, Einfluss der Configuration auf die Wirkung der Enzyme. Ber. Dtsch. Chem. Ges.,
27(1894)2985-2993.
6. Dorland's Illustrated Medical Dictionary, 26th edn. Saunders, Philadelphia, PA, 1981.
7. Y.C. Martin, Quantitative Drug Design. A Critical Introduction. Marcel Dekker, New York, 1978.
8. H. Kubinyi, QSAR: Hansch Analysis and Related Approaches. VCH, Weinheim, 1993.
9. P.P. Mager, The Masca model of pharmacochemistry: II. Rational empiricisms in the multi-
variate analysis of opioids. In: Drug Design, (E.J. Ariens, Ed.), Vol. X. Academic Press, New
York, 1980, pp. 343-401.
10. J. W. McFarland, Comparative Molecular Field Analysis (CoMFA) of anticoccidial triazines. J.
Med. Chem., 35 (1992) 2543-2550.
11. C. Pepperrell, Three-Dimensional Chemical Similarity Searching. Research Studies Press (J.
Wiley), Taunton, UK, 1994.
12. A. Miklavc, D. Kocjan, J. Mavri, J. Koller and D. Hadzi, On the fundamental difference in ther-
modynamics of agonist and antagonist interactions with P-adrenergic receptors and the mecha-
nism of entropy-driven binding. Biochem. Pharmacol., 40 (1990) 663-669.
13. G.T. Fechner, Elemente der Psychophysik, 1907. Reprinted in: Elements of Psychophysik
(D.H. Davis, Ed.). Holt, Rinehart and Winston, New York, 1966.
418

14. L.P. Hammett, Physical Organic Chemistry. McGraw-Hill, New York, 1940.
15. C. Hansch and T. Fujita, pen Analysis. A method for the correlation of biological activity and
chemical structure. J. Am. Chem. Soc, 86 (1964) 1616-1626.
16. C. Hansch and W.J. Dunn III, Linear relationships between lipophilic character and biological
activity of drugs. J. Pharmaceut. Sci., 61 (1972) 1-19.
17. R.L. Williamson and R.L. Metcalf, Salicylanilides: a new group of active uncouplers of oxida-
tive phosphorylation. Science, 158 (1967) 1694-1695.
18. C. Hansch and J.M. Clayton, Lipophilic character and biological activity of drugs II: The para-
bolic case. J. Pharmaceut. Sci., 62 (1973) 1-21.
19. J.W. McFarland, On the parabolic relationship between drug potency and hydrophobicity. J.
Med. Chem., 13 (1970) 1192-1196.
20. E. Klarmann, V.A. Shtemov and L.W. Gates, The alkyl derivatives of halogen phenols and
their bactericidal action. I. Chlorophenols. J. Am. Chem. Soc, 55 (1933) 2576-2589.
21. L. Buydens, D.L. Massart and P. Geerlings, Gas chromatographic behaviour and pharmacol-
ogical activity of neuroleptica. Anal. Chim. Acta, 174 (1985) 237-244.
22. R.F. Rekker and R. Mannhold, Calculation of Lipophilicity. The Hydrophobic Fragmental
Constant Approach. VCH, Weinheim, Germany, 1992.
23. C.G. Swain and E.C. Lupton, Field and resonance components of substituent effects. J. Am.
Chem. Soc, 90 (1968) 4328-4337.
24. R.W. Taft, Separation of polar, steric and resonance effects in reactivity. In: Steric Effects in
Organic Chemistry (M.S. Newman, Ed.). Wiley, New York, 1956, pp. 556-575.
25. A. Verloop, The STERIMOL Approach to Drug Design. Marcel Dekker, New York, 1987.
26. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New
York, 1986.
27. C. Hansch and A. Leo, Substituent Constants for Correlation Analysis in Chemistry and Biol-
ogy. Wiley, New York, 1979.
28. G. Grassy, R. Lahana and Oxford Molecular Ltd., TSAR, Tools for Structure-Activity Rela-
tionships, User's Guide, Issue 3. Proprietary Software developed by Oxford Molecular Ltd.,
Oxford, UK, 1993. _ .^
29. S.M. Free and J.XVL-Wilson, A mathematical contribution to structure-activity relationships. J.
Med. Ciiem.,"7 (1964) 395-399.
30. T. Fujita and T. Ban, Structure-activity study of phenethylamines as substrates of biosynthetic
enzymes of sympathetic transmitters. J. Med. Chem., 14 (1971) 148-152.
31. J.L. Spencer, J.J. Hlavka, J. Petisi, H.M. Krazinski and J.H. Boothe, 6-Deoxytetracyclines, V.
7,9-Disubstituted Products. J. Med. Chem., 6 (1963) 405-407.
32. P.N. Craig, Comparison of the Hansch and Free-Wilson approaches to structure-activity cor-
relation. In: Biological Correlations — The Hansch Approach (R.F. Gould, Ed.). Advances in
Chemistry Series, No. 114. American Chemical Society, Washington DC, 1972, pp. 115-129.
33. P.N. Craig, Interdependence between physical parameters and selection of substituent groups
for correlation studies. J. Med. Chem., 14 (1971) 680-684.
34. F. Darvas, Application of the sequential simplex method in designing drug analogs. J. Med.
Chem., 17(1974)99-804.
35. B.R. Kowalski and C.F. Bender, The application of pattern recognition to screening prospec-
tive anticancer drugs. J. Am. Chem. Soc, 96 (1974) 916-918.
36. R. Franke, S. Dove and R. Kuehne, Hydrophobicity and hydrophobic interactions, 1. On the
physical nature of aromatic hydrophobic substituent constants. Eur. J. Med. Chem., 14 (1979)
363-374.
419

37. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical
compounds. Arzneim.-Forsch. (Drug Res.), 26 (1976) 1205-1300.
38. B.R. Kowalski and C.F. Bender, Pattern recognition. A powerful approach to interpreting
chemical data. J. Am. Chem. Soc, 94 (1972) 5632-5639.
39. C. Hansch, Quantitative approaches to pharmacological structure-activity relationships. In:
Structure-Activity Relationships (C.J. Cavallito, Ed.), Vol. 1. Pergamon, Oxford, 1973, pp.
75-165.
40. P.J. Lewi, Multivariate data analysis in structure-activity relationships, In: Drug Design (E.J.
Ariens, Ed.), Vol. X. Academic Press, New York, 1980.
41. J. Schmutz, Neuroleptic piperazinyl-dibenzo-azepines. Arzneim.-Forsch. (Drug Res.), 25
(1975)712-719.
42. K.R. Gabriel, The biplot graphical display of matrices with appHcation to principal component
analysis. Biometrika, 58 (1971) 453-467.
43. P.J. Lewi, Multidimensional data representation in medicinal chemistry. In: Chemometrics.
Mathematics and Statistics in Chemistry (B.R. Kowalski, Ed.), Reidel, Dordrecht, 1984, pp.
351-376.
44. P.J. Lewi, Spectral mapping of drug-test specificities. In: Advanced Computer-Assisted Tech-
niques in Drug-Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp.
219-253.
45. P.L. Wood, S.E. Charleson, D. Lane and R.L. Hudgin, Multiple opiate receptors: differential
binding of mu, kappa and delta agonists. Neuropharmacology, 20 (1981) 1215-1220.
46. G. Calomme and P.J. Lewi, Multivariate analysis of structure-activity data. Spectral map of
opioid narcotics in receptor binding. Actual. Chim. Therap., S.ll (1984) 121-126.
47. J.-P. Benzecri, Analyse des Donnees. Analyse des Correspondences, Dunod, Paris, 1973.
48. C.J.E. Niemegeers and P.A.J. Janssen, A systematic study of the pharmacology of DA-
antagonists. Life Sci., 24 (1979) 2201-2216.
49. M.J. Greenacre, Correspondence Analysis in Practice. Academic Press, London, 1993.
50. R. Franke, Optimierungs-Methoden in der Wirkstoff-Forschung. Quantitative Struktur-
Wirkungs-Analyse. Akademie-Verlag, Berlin, 1980.
51. T.W. Anderson, Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.
52. S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjostrom,
Pattern recognition: finding and using patterns in multivariate data. In: Food Research and Data
Analysis (H. Martens and H. Russwurm Jr., Eds.). Applied Science, London, 1983, p. 147.
53. S. de Jong, PLS fits closer than PCR. J. Chemom., 7 (1993) 551-557.
54. R.D. Cramer III, D.E. Patterson and J.D. Bunce, Comparative Molecular Field Analysis
(CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc, 110
(1988)5959-5967.
55. L.M. Kauvar, D.L. Higgins, H.O. Villar, J.R. Sportman, A. Engqvist-Goldstein, R. Bukar, K.E.
Bauer, H. Dilley and D.M. Rocke, Predicting ligand binding to proteins by affinity finger print-
ing. Chem. Biol., 2 (1995) 107-118.
56. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of
real-life performance from laboratory results. In: Scientific Computing and Automation (Eu-
rope) (E.J. Karjalainen, Ed.). Elsevier, Amsterdam, 1990, pp. 199-209.
57. J.E. Leysen, Review of neuroleptic receptors: specificity and multiplicity of in-vitro binding
related to pharmacological activity. In: Clinical pharmacology in psychiatry (E. Usdin, S. Dahl,
L.F. Gram and O. Lingjaerde, Eds.). MacMillan, London, 1982, pp. 35-62.
58. L.E. Wangen and B.R. Kowalski, A multiblock partial least squares algorithm for investigating
complex chemical systems, J. Chemom., 3 (1988) 3-10.
420

59. J. Bobon, D.P. Bobon, A. Pinchard, J. Collard, T.A. Ban, R. De Buck, H. Hippius, P.A. Lam-
bert and O.A. Vinar, A new comparative physiognomy of neuroleptics: a collaborative clinical
report. Acta Psych. Belg., 72 (1972) 542-554.
60. C. Hansch, S.H. Unger and A.B. Forsythe, Strategy in drug design. Cluster analysis as an aid in
the selection of substituents. J. Med. Chem., 16 (1973) 1212-1222.
61. D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of
Cluster Analysis. Wiley, New York, 1983.
62. S.A. Hiller, V.E. Golender, A.B. Rosenblit, L.A. Rastrigin and A.B. Glaz, Cybernetic methods
of drug design. 1. Statement of the problem — The Perceptron approach. Comp. Biomed. Res.,
6(1972)411-421.
63. P.C. Jurs, B.R. Kowalski and T.L. Isenhour, Computerized learning machines applied to chem-
ical problems. Molecular formula determination from low resolution mass spectrometry. Anal.
Chem., 41 (1969)21-27.
64. J.G. Topliss, AppHcation of operational schemes for analog synthesis in drug design. J. Med.
Chem., 15(1972) 1001-1011.
65. W.P. Purcell, G.E. Bass and J.M. Clayton, Strategy of drug design: a guide to biological activ-
ity. Wiley, New York, 1973.
66. J.P. Tollenaere, H. Moereels and L.A. Raymaekers, Atlas of Three-Dimensional Structure of
Drugs. Elsevier, Amsterdam, 1979.
67. L.B. Kier, Molecular Orbital Theory in Drug Research. Academic Press, New York, 1971.
68. P.S. Portoghese, In: Molecular and Quantum Pharmacology (E.D. Bergman and B. Pullman,
Eds.). Reidel, Dordrecht, 1974, pp. 352-353.
69. P.J. Lewi and H. Moereels, Receptor mapping and phylogenetic clustering. In: Advanced
Computer-Assisted techniques in Drug Discovery (H. van de Waterbeemd, Ed.). VCH,
Weinheim, Germany, 1994, pp. 131-162.
70. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New
York, 1986.
71. S.C. Basak, V.R. Magnuson, G.J. Niems and R.R. Regal, Determining structural similarity of
chemicals using graph-theoretic indices. Discr. Appl. Math., 19 (1988) 17-44.
421

Chapter 38

Analysis of Sensory Data

38.1 Introduction

The determination and analysis of sensory properties plays an important role in


the development of new consumer products. Particularly in the food industry
sensory analysis has become an indispensable tool in research, development,
marketing and quality control. The discipline of sensory analysis covers a wide
spectrum of subjects: physiology of sensory perception, psychology of human
behaviour, flavour chemistry, physics of emulsion break-up and flavour release,
testing methodology, consumer research, statistical data analysis. Not all of these
aspects are of direct interest for the chemometrician. In this chapter we will cover a
few topics in the analysis of sensory data. General introductory books are e.g. Refs.
[1-3].
There are four main types of data that frequently occur in sensory analysis:
pair-wise differences, attribute profiling, time-intensity recordings and preference
data. We will discuss in what situations such data arise and how they can be
analyzed. Especially the analysis of profiling data and the comparison of such data
with chemical information calls for a multivariate approach. Here, we can apply
some of the techniques treated before, particularly those of Chapters 35 and 36.

38.2 Difference tests

38.2.1 Triangle test


Let us consider a product developer who is trying to improve the taste of an
existing product. The first question one could ask (and should ask before con-
tinuing) with the new product is: does the new product taste different from the old
product? If trained panellists cannot establish a significant difference, it is hardly
justifiable to do consumer tests, let alone launch the product on the market. A
standard overall difference test is the triangle test (Fig. 38.1). In such a test one
presents three samples, in no particular order, which should be tasted. Two out of
the three samples are identical (e.g. the existing product, as a control) and the task
is to identify the odd sample (the new product). If enough panellists correctly
422

Which is the odd sample?

© © ®
Fig. 38.1. Triangle test: two similar products and one different product are presented; the assessor has
to indicate the product that is different.

recognize the dissimilar sample one knows that a sensory difference exists. Table
38.1 gives the critical values for different panel sizes and different significance
levels. For example, with 27 panellists (n = 27) one concludes that there is a
difference (at the a = 5% significance level) when at least 14 correct responses are
obtained.
One should be aware that the conclusion of 'no significant difference detected'
may not be interpreted as 'no difference present'. If the panel is small the triangle
test has low power, i.e. there is a high probability that real differences may go
unnoticed (see Sections 4.7 and 5.3). Therefore, a sizeable panel of at least 25
assessors is recommended. When the assessors have been selected and trained for
their task the panel size may be somewhat smaller (at least 20). In general,
economizing on the number of assessors for sensory tests may be a false economy:
choosing a small inexpensive panel may result in great losses due to missing real
opportunities for introducing improved products with increased market share.

38.2.2 DuO'trio test

In the duo-trio test one presents to each panellist (or 'subject') an identified
reference sample, followed by two coded samples, one of which matches the
reference sample (Fig. 38.2). The subjects are asked to indicate which of the two
coded samples matches the reference. If enough correct replies are obtained the
two coded samples are perceived as different. Table 38.2 gives the critical values

Which is equal to X: A or B?

® ®
®
Fig. 38.2. Duo-trio test: two different products are presented; the assessor has to indicate which of
these two is similar to a third product.
423

TABLE 38.1
Table of critical values for the triangle test for differences

a(%) a(%) a(%) a(%)

n
« 10 5 1 0.1 n 10 5 1 0.1 n 10 5 1 0.1 « 10 5 1 0.1

1 26 13 14 15 17 51 22 24 26 29 76 32 33 36 39
2 27 13 14 16 18 52 23 23 26 29 77 32 34 36 40
3 3 3 28 14 15 16 18 53 53 24 27 30 78 32 34 37 40
4 4 4 29 14 15 17 19 54 23 25 27 30 79 33 34 37 41
5 4 4 5 30 14 15 17 19 55 24 25 28 30 80 33 35 38 41

6 5 5 6 31 15 16 18 20 56 24 26 28 31 81 33 35 38 41
7 5 5 6 7 32 15 16 18 20 57 25 26 28 31 82 34 35 38 42
8 5 6 7 8 33 15 17 18 21 58 25 26 29 32 83 34 36 39 42
9 6 6 7 8 34 16 17 19 21 59 25 27 29 32 84 35 36 39 43
10 6 7 8 9 35 16 17 19 22 60 26 27 30 33 85 35 37 40 43

11 7 7 8 10 36 17 18 20 22 61 26 27 30 33 86 35 37 40 44
12 7 8 9 10 37 17 18 20 22 62 26 28 30 33 87 36 37 40 44
13 8 8 9 11 38 17 19 21 23 63 27 28 31 34 88 36 38 41 44
14 8 9 10 11 39 18 19 21 23 64 27 29 31 34 89 36 38 41 45
15 8 9 10 12 40 18 19 21 24 65 28 29 32 35 90 37 38 42 45

16 9 9 11 12 41 19 20 22 24 66 28 29 32 35 91 37 39 42 46
17 9 10 11 13 42 19 20 22 25 67 28 30 33 36 92 37 39 42 46
18 10 10 12 13 43 19 20 23 25 68 29 30 33 36 93 38 40 43 46
19 10 11 12 14 44 20 21 23 26 69 29 31 33 36 94 38 40 43 47
20 10 11 13 14 45 20 21 24 26 70 29 31 34 37 95 39 40 44 47

21 11 12 13 15 46 20 22 24 27 71 30 31 34 37 96 39 41 44 48

22 11 12 14 15 47 21 22 24 27 72 30 32 34 38 97 39 41 44 48

23 12 12 14 16 48 21 22 25 27 73 31 32 35 38 98 40 41 45 48
24 12 13 15 16 49 22 23 25 28 74 31 32 35 39 99 40 42 45 49
25 12 13 15 17 50 22 23 26 28 75 31 33 36 39 100 40 42 46 49
424

TABLE 38.2
Table of critical values for the duo-trio test

a(%) a(%) a(%) a(%)

n 10 5 1 0.1 n 10 5 1 0.1

1 26 17 18 20 22 51 31 32 35 37 76 45 46 49 52
2 27 18 19 20 22 52 32 33 35 38 77 45 47 50 53
3 28 18 19 21 23 53 32 33 36 39 78 46 47 50 54
4 4 29 19 20 22 24 54 33 34 36 39 79 46 48 51 54
5 5 5 30 20 20 22 24 55 33 35 37 40 80 47 48 51 55

6 6 6 31 20 21 23 25 56 34 35 38 40 81 47 49 52 55
7 6 7 7 32 21 22 24 26 57 34 36 38 41 82 48 49 52 56

8 7 7 8 33 21 22 24 26 58 35 36 39 42 83 48 50 53 56

9 7 8 9 34 22 23 25 27 59 35 37 39 42 84 49 51 54 57

10 8 9 10 10 35 22 23 25 27 60 36 37 40 43 85 49 51 54 58

11 9 9 10 11 36 23 24 26 28 61 37 38 41 43 86 50 52 55 58

12 9 10 11 12 37 23 24 27 29 62 37 38 41 44 87 50 52 55 59

13 10 10 12 13 38 24 25 27 29 63 38 39 42 45 88 51 53 56 59

14 10 11 12 13 39 24 26 28 30 64 38 40 42 45 89 52 53 56 60

15 11 12 13 14 40 25 26 28 31 65 39 40 43 46 90 52 54 57 61

16 12 12 14 15 41 26 27 29 31 66 39 41 43 46 91 53 54 58 61

17 12 13 14 16 42 26 27 29 32 67 40 41 44 47 92 53 55 58 62

18 13 13 15 16 43 27 28 30 32 68 40 42 45 48 93 54 55 59 62

19 13 14 15 17 44 27 28 31 33 69 41 42 45 48 94 54 56 59 63

20 14 15 16 18 45 28 29 31 34 70 41 43 46 49 95 55 57 60 63

21 14 15 17 18 46 28 30 32 34 71 42 43 46 49 96 55 57 60 64

22 15 16 17 19 47 29 30 32 35 72 42 44 47 50 97 56 58 61 65

23 16 16 18 20 48 29 31 33 36 73 43 45 47 51 98 56 58 61 65

24 16 17 19 20 49 30 31 34 36 74 44 45 48 51 99 57 59 62 66

25 17 18 19 21 50 31 32 34 37 75 44 46 49 52 100 57 59 63 66
425

for various panel sizes and significance levels. For example, with a panel of n = 30
panellists one would need at least 20 correct answers to conclude that there is a
significant (a = 5%) difference.

38.23 Paired comparisons

In paired comparison tests two different samples are presented and one asks
which of the two samples has 'most' of the sensory property of interest, e.g. which
of two products has the sweetest taste (Fig. 38.3). The pairs are presented in
random order to each assessor and preferably tested twice, reversing the pre-
sentation order on the second tasting session. Fairly large numbers (>30) of test
subjects are required. If there are more than two samples to be tested, one may
compare all possible pairs ('round robin'). Since the number of possible pairs
grows rapidly with the number of different products this is only practical for sets of
three to six products. By combining the information of all paired comparisons for
all panellists one may determine a rank order of the products and determine
significant differences. For example, in a paired comparison one compares three
food products: (A) the usual freeze-dried form, (B) a new freeze-dried product, (C)
the new product, not freeze-dried. Each of the three pairs are tested twice by 13
panellists in two different presentation orders, A-B, B-A, A-C, C-A, B-C, C-B.
The results are given in Table 38.3.
The results of such multiple paired comparison tests are usually analyzed with
Friedman's rank sum test [4] or with more sophisticated methods, e.g. the one
using the Bradley-Terry model [5]. A good introduction to the theory and appli-
cations of paired comparison tests is David [6]. Since Friedman's rank sum test is
based on less restrictive, ordering assumptions it is a robust alternative to two-way
analysis of variance which rests upon the normality assumption. For each panellist
(and presentation) the three products are scored, i.e. a product gets a score 1,2 or 3,
when it is preferred twice, once or not at all, respectively. The rank scores are
summed for each product /. One then tests the hypothesis that this result could be
obtained under the null hypothesis that there is no difference between the three
products and that the ranks were assigned randomly. Friedman's test statistic for
this reads
Which of the two samples
is the sweetest ? A or B?

© CD
Fig. 38.3. Paired comparison test: two different products are presented and the assessor has to indi-
cated the one that has most of a specified attribute.
426

TABLE 38.3

Results of pairwise comparison of three products in two presentation orders by 13 panellists.

Paired comparison Frequency Ranking

A B C

A>B, A>C, B>C 7 1 2 3


A>B, A>C, B<C 6 1 3 2
A<B, A>C, B > C 5 2 1 3
A>B, A<C, B>C 2 2 2 2
A<B, A>C, B<C 1 2 2 2
A>B, A<C, B < C 4 2 3 1
A<B, A<C, B>C 1 3 2 1

T=l2[C^Rf)/{bt(t+l)}]-3b(t-^l) (38.1)

where t is the number of treatments (or products), b is the number of blocks


(panellist x session) and/?, is the sum of the ranks for the ith product (treatment;
/ = 1,2,3). In our example, t = 3,b=l 3x2 = 26, and the rank sums for products A, B
and C, are /?i = 7x1 + 6x1 + 5x2 + 2x2 + 1x2 + 4x1 + 1x3 = 40, /?2 = 57, and R^ =
59, respectively. This gives T= 12{(402+572+59^)/(26x3x4)} - (3x26x4) = 8.38.
This result should be compared to a chi-square distribution with (t-l) = 2 degrees
of freedom. Since 8.38 > 5.99 (the 5% critical level of a x^ distribution with 2 df;
see Table 5.4), we conclude from this test that there exists a difference between the
three products. The rank sums in Friedman's analysis provide a scale on which to
position the different samples. Large differences on this scale indicate a significant
product effect. In the example a difference between A on the one hand and B and C
on the other is clearly indicated, the difference between B and C being negligible.
Notice that in comparisons such as these sometimes slight inconsistencies in the
results can be obtained. In two cases A was considered better than B, and B better
than C, yet C was judged superior to A! This inconsistency or non-transitivity is
known as Simpson's or de Condorcet's paradox. In this particular case it can
perfectly well be attributed to random variation. Assessors who are not sure about
their conclusion are forced to make a choice, which then can only be a random
guess. It is possible, however, to obtain results which are conflicting and
statistically significant at the same time: A < B and B < C, but C < A. This situation
may occur when the attribute to be assessed in the comparisons is open to different
interpretations. Actually, this is a case of multicriteria decision making (see
Chapter 26) and it may be impossible to rank the three products unambiguously
427

along a single scale. One way out of this situation is to reconsider the testing
design, improve the instructions to the assessors, rephrase the question asked in a
way less prone to misinterpretation, etc. Another way out is to adapt the analysis of
the data allowing for additional dimensions. In the next section we show how the
paired comparison of a set of products (objects, items) can give the basic input for
such a multidimensional scaling.

38.3 Multidimensional scaling

Let us consider an experiment in which n products are compared. One judges


each pair of products and gives a score for the dissimilarity of the two products. For
example, one might give a score of 1, 2, 3, 4 or 5 to indicate "No", "Little",
"Intermediate", "Much", or "Very Much" dissimilarity. The more two products
differ, the larger will be their dissimilarity score. The dissimilarity can therefore be
interpreted as a distance. All pairwise comparisons may be collected into an nxn
symmetric dissimilarity or distance matrix. Such a matrix of pairwise distances can
also be computed when one knows the coordinates of the products in some space.
The problem is now reversed: knowing the distances (dissimilarities) can we find a
configuration of the products in low-dimensional space such that it reproduces the
given distances to a good approximation? MultUdimensional scaling (MDS) is the
collective name for a set of methods and algorithms to find this low-dimensional
configuration from a distance matrix.
A simple example may serve to introduce the subject. Table 38.4 gives the
distances between a few major European cities measured by the approximate flight

TABLE 38.4

Table of approximate flying times (hours) between nine major European cities

From/To: At Lo Ma Mu Pa Ro St VI Wa

Athens 0 5 4.5 3 4 L5 6 4.5 4


London 5 0 2.5 2 L5 2.5 2.5 1 2.5
Madrid 4.5 2.5 0 2.5 2 2.5 6.5 2.5 5.5
Munich 3 2 2.5 0 1.5 L5 3.5 1.5 1.5
Paris 4 1.5 2 1.5 0 2 3.5 1 2.5
Rome 1.5 2.5 2.5 1.5 2 0 5 3 2
Stockholm 6 2.5 6.5 3.5 3.5 5 0 2 1.5
Vlaardingen 4.5 1 2.5 1.5 1 3 2 0 2
Warszaw 4 2.5 5.5 1.5 2.5 2 1.5 2 0
428

time. In so-called classical or metric MDS one applies Principal Coordinate


Analysis introduced in Section 31.6. This is equivalent to finding the eigenvectors
of the double-centered squared distance matrix A, with elements 6,^:

6..= •V2{dl-dl-d]^d^) (38.2)

Here, d^j is the dissimilarity of object / and object y, d^. is the mean of row /, d^ the
mean of column;, and d the overall mean of the matrix of dissimilarities. When
the solution is found one hopes that a sufficiently good approximation can be
obtained in a low-dimensional space, preferably a two-dimensional space for easy
graphical display. Figure 38.4 shows the map obtained by displaying the scores for
each city along the principal axes 1 and 2. One should be aware that flying times
are only an approximate indication of the true distance, since the average flying
speed may differ between different connections. Yet, one notices that the config-
uration of the cities is quite close to reality. One may well imagine that a similar
analysis for cities from different continents would give a poor representation in
two dimensions. Here, a three-dimensional representation would be much more
adequate. With sensory data the best dimensionality of the solution cannot be
guessed beforehand and is itself an analysis result of major interest. In practice one
must carry out the MDS analysis for different choices of the dimensionality. The
final choice is guided by goodness-of-fit criteria, interpretability of the solution
and a desire to keep the results as simple as possible.

4
Stockholm
3

c London Warzaw
o 1
•(/) Vlaardingen
c
(D
E 0

-1
• Rome
- 2 Madrid
Athens

-3 - 2 - 1 0 1
Dimension 2

Fig. 38.4. Two-dimensional map of European cities derived from the flight time data of Table 38.4 us-
ing classical (metric) MDS.
429

Stockholm
2

1- Warzaw
London •
• en
o Vlaarding
C/5
C
CD
0 •
Paris
Munich

-1 • Rome

Madrid
• Athens

-2- L , 1 , ' r~~—' \

- 2 - 1 0 1 2
Dimension 2

Fig. 38.5. Two-dimensional map of European cities obtained from non-metric MDS (using ranked
distances) applied to the flight time data of Table 38.4,

Even when the input data are much less accurate a reasonable configuration can
be recovered by MDS. For example, replacing the distances by their rank numbers
and then applying non-metric MDS gives the configuration displayed in Fig. 38.5,
which still represents the actual distance configuration quite well.
In non-metric MDS the analysis takes into account the measurement level of the
raw data (nominal, ordinal, interval or ratio scale; see Section 2.1.2). This is most
relevant for sensory testing where often the scale of scores is not well-defined and
the differences derived may not represent Euclidean distances. For this reason one
may rank-order the distances and analyze the rank numbers with, for example, the
popular method and algorithm for non-metric MDS that is due to Kruskal [7]. Here
one defines a non-linear loss function, called STRESS, which is to be minimized:

STRESS =^^"^(5,J -f{d,.)Y lY,Y.f^d,f (38.3)

Minimizing this function is equivalent to finding a low-dimensional configuration


of points which has Euclidean object-to-object distances 5,-^ as close as possible to
some transformation/(.) of the original distances or dissimilarities, d^y Thus, the
model distances 5,^ are not necessarily fitted to the originalrf^y,as in classical MDS,
but to some admissible transformation of the measured distances. For example,
when the transformation is a general monotonic transformation it preserves the
430

ordering of the dissimilarities, which may still be in accordance with the measure-
ment level (ordinal). There are many extensions and adaptations to multidi-
mensional scaling methods [8], which are beyond the scope of this book. In the
computer science and pattern recognition literature one finds closely related
procedures under the name of non-linear mapping [9].

Example
Twenty-four different types of bread were assessed visually by 12 assessors.
The samples differed in colour, size, shape, aeration and filling. The assessors
were asked to indicate on a line scale the difference in appearance between
photographs of two samples. A tick mark on the left side of the line scale indicates
no difference, a tick mark on the right end of the line scale a large difference. The
distances of the tick marks from the left end were measured for each of the pairs
and for each panellist. The distances were averaged over all panellists and the
average distances obtained were ranked and analyzed using Kruskal's non-metric
multidimensional scaling technique. A 4-dimensional configuration satisfactorily
approximated the original rank-distances. Figure 38.6 shows a 3D-graph of the
MDS analysis. The graph reveals a clear clustering into six groups. These could be
labelled as white bread, wheat bread, currant bread and pastry bread, and two
intermediate forms. The largest perceived differences are associated with a dif-
ference in colour (-dim 1) and filling (-dim 2) and aeration (-dim 3). The other
dimensions, related to size and shape differences, played a minor role in discrimi-
nating the various products.

€URRANr_
BftEAD

-0.22
0.32
dim 2 -0-69 -1.69 -1.25

Fig. 38.6. Three-dimensional configuration according to a non-metric multidimensional scaling ap-


plied to 24 types of bread differing in appearance as assessed by 12 panellists.
431

38.4 The analysis of Quantitative Descriptive Analysis profile data

Quantitative Descriptive Analysis (QDA) is the term used to designate the


characterization of samples in a reproducible way. The QDA technique is a much
used tool in sensory analysis. QDA is quantitative in the sense that reproducible
scores for sensory attributes can be obtained that are useful in describing the
samples tested. In principle, our interest is not in the scores of individual panellists,
but in the panel scores as a whole, hence we take the mean score of the panel as the
reading of the sensory 'instrument'. (Of course, individual behaviour of panellists
is of interest in monitoring the quality of individual panellists during training and
later as a control.) The panellists are chosen because of their expert sensing
abilities which enables them to perceive small differences between samples. They
are in no way chosen to represent some target group of consumers. The task of a
QDA panel is to 'measure', to describe. It should be stressed that a QDA panel
records, it gives no judgement on the acceptability of a product. The latter should
be done by larger consumer panels which are selected to represent the target
population, not on account of their skills in sensory perception. The main usage of
a QDA panel is to provide the product developer with a reliable instrument for
characterizing the sensory properties of a certain type of products.
The panel consists of carefully selected subjects that have gone through a
thorough training. Aspects considered in the selection of QDA panellists are: good
tasting abilities, odour identification, odour and taste recall, creative use of
language, fitting in the group, no product aversion, availability, general health. The
training stage of a QDA panel largely consists of generating a long list of attribute
names that cover all perceptible characteristics of the product category concerned.
The next task is to find agreement about the definition of the attributes in panel
discussions. Then one seeks to reduce the set of attributes by grouping similar
attributes into homogeneous clusters. Finally, one tests for the repeatability of each
of the attributes as used by the individual panellists. Even after reducing the
number of attributes originally generated the resulting set of attributes can be quite
large: from 10 up to some 100 attributes.
In many ways one may consider a well-trained QDA panel as an analytical
instrument that measures the intensity of certain sensory attributes. For example, it
has a certain precision that can be measured, it may show drift, it may require
readjustment and its application is bound to strict protocols. A common way to
record the intensity is by marking an unstructured line segment that has a left
anchor ("absent") and right anchor ("very strong"). Usually the score is directly
fed into the computer and recorded as a number between 0 ("absent") and 100
("very strong"). A typical way to represent the results of a sensory panel session is
in the form of profiles. Figure 38.7 shows the panel-average profiles for two
products. Here, the scores on different attributes are plotted in one graph with the
432

Attribute
Appearance Pale fl • Yellow
Smooth o' * * Coarse
Spreading slippery jy Sticky
Smooth "'"C^ Brittle
Fine ^B > • Coarse
Mouthfeel Quick • \ * Slow
Thin P \ * Thick
Fine O ^* it Coarse
Flavour None Very Intense
None Very Salty
T"\
None 6 ^ ^ * Off-Flavour

0 5 10 15 20 25 30 35
Score

Fig. 38.7. Example of average sensory profile for two food products as obtained in a QDA panel. For
five attributes (*) the difference is statistically significant (95% confidence level).

attributes arranged along one axis and the other axis providing the scale for the
scores. Attribute scores belonging to the same product are joined, thus giving a
multivariate profile of the sensory characteristics of the product.
The analysis of QDA panel data is usually done by means of analysis of
variance (see Chapter 6) on the individual attributes. With the individual data it is
arguable whether analysis of variance is a proper method with its assumption of
normality. In a two-way ANOVA lay-out with "Products" and "Panellist" as main
factors one can test the main effects against the Product*Panellist interaction (see
Section 6.8.2). The latter interaction can be tested against the pure error when the
testing has been repeated in different sessions. Ideally, one would like this inter-
action to be non-significant. Absence of a significant Product*Panellist interaction
means that the panellists have similar views on the relative ordering of the
products. A practical consequence is that it is less of a problem when not all
panellists are present at each tasting session. With panel mean scores the assump-
tion of normality is much more easily satisfied. For such panel average scores one
may for example test for a difference between products when the samples have
been tasted at two or more sessions. In Fig. 38.7 the attributes for which there is a
significant difference between the two products have been indicated.
When there are many samples and many attributes the comparison of profiles
becomes cumbersome, whether graphically or by means of 2inalysis of variance on
all the attributes. In that case, PC A in combination with a biplot (see Sections 17.4
and 31.2) can be a most effective tool for the exploration of the data. However, it
does not allow for hypothesis testing. Figure 38.8 shows a biplot of the panel-
average QDA results of 16 olive oils and 7 appearance attributes. The biplot of the
433

134%

21
131 ^ G11

Wanspar ,32 } / 022

Fig. 38.8. Biplot of panel average QDA profiles (attributes) for 16 olive oil products.

column-centered data shows the approximate position of the products and the
attributes in sensory space. The 2D-approximation of Fig. 38.8 is a fairly good one,
since it accounts for 83% of the variance. Noteworthy in the biplot is the presence
of one distinct sample, G13, characterized by a particular low transparency,
perhaps due to the presence of "particles". The plot, which is based on a factoring
using a = 0 and (3 = 1 (see Section 31.2), also gives a geometrical view on the
correlation between the various attributes. For example, the attributes "brown" and
"green" are highly correlated and negatively correlated with "yellow". The
attributes "transparency" and "particles" are also strongly negatively correlated.
Instead of using the principal component axes one might define two factors (see
Section 34.2), obtained by a slight rotation over 20 degrees, which are associated
with the particles/transparency contrast and the green/yellow contrast, respec-
tively. The attributes "syrup" and "glossy" are not well represented in this
2-dimensional projection.

38.5 Comparison of two or more sensory data sets

When many data sets for the same set of products (objects) are available it is of
interest to look for the common information and to analyze the individual
deviations. When the panellists in a sensory panel test a set of food products one
might be interested in the answer to many questions. How are the products
positioned, on the average, in sensory space? Are there regions which are not well
434

covered, which may provide opportunities for new products? How are the
attributes related? Are some panellists deviating significantly from the majority
signalling a need for retraining? When comparing different panels, is there a
consensus among the panels? Can one compare the results of different panels
across (cultural, culinary) borders. After all, descriptive panels should be more or
less objective! Can one compare the results of panels (panellists) which have tested
the same samples, but have used different attributes? Can this then be used to
interrelate the different sets of attributes?
A powerful technique which allows to answer such questions is Generalized
Procrustes Analysis (GPA). This is a generalization of the Procrustes rotation
method to the case of more than two data sets. As explained in Chapter 36
Procrustes analysis applies three basic operations to each data set with the
objective to optimize their similarity, i.e. to reduce their distance. Each data set can
be seen as defining a configuration of its rows (objects, food samples, products) in
a space defined by the columns (sensory attributes) of that data set. In geometrical
terms the (squared) distance between two data sets equals the sum over the squared
distances between the two positions (one for data set X^ and one for Xg) for each
object.
The first operation in Procrustes analysis is to shift the center of gravity of each
configuration of data points to the origin. This is a geometric way of saying that
each attribute is mean-centered. In the next step the data configurations are rotated
and possibly reflected to further reduce the distances between corresponding
objects. When more than one data set is involved each data set is rotated in turn to
the mean configuration of the other data sets. In the third step the data sets are
shrunk or stretched by a scaling factor in order to increase their match. For each
data set its scaling factor applies equally to all its attributes, thus the data
configuration of objects shrinks or expands by an equal amount in any direction.
The process of alternating rotations and scaling is repeated until convergence to
stable and optimally close configurations.
We now give a qualitative discussion of the usage and interpretation of the GPA
method to sensory data. The most common application in sensory analysis of GPA
is the comparison of the test results from different panellists. The interest may
simply be to find the 'best' average configuration of the samples. For example, Fig.
38.9 shows the results of a Procrustes analysis on 7 cheese products measured in
triplicate by a QDA panel of 8 panellists using 18 attributes describing odour,
flavour, appearance and texture. The graph shows the average position of the 7
cheeses after optimal Procrustes transformation of the 8x3 = 24 different data sets.
This final average GPA configuration is often referred to as the consensus
configuration. The term 'consensus' is somewhat misleading as the final GPA
configuration is merely the result of an averaging process, it is not the result of
some group discussion between the panellists! In principle, the 7 products span a
435

/>.
A /D '
CM
W
X

Q.
O
^1—1^
Q. G^ "E
^c
principal axis 1

Fig. 38.9. Consensus plot showing the relative position of 7 cheese products (A-G) as assessed by a
panel in 3 sessions. Triangles for each product indicate the three sessions. Differences between the
three sessions are much smaller than differences between products.

6-dimensional space. In this case a two-dimensional projection onto the first two
principal components of the GPA average configuration is a good enough
approximation, accounting for 83% of the total variation. The triangles in the plot
around each product indicate the average positions obtained at the three sessions.
Clearly, the difference among the products exceeds the within-product between-
session variability. Therefore, an interpretation of the results with regard to the
products can be meaningful. Conclusions which can be drawn from the graph are,
for example: A takes a somewhat isolated position, E and F are close, so are C and
G, B is an 'average' product, the lower right area is empty, and so on. For the
product developer who has a background knowledge about the various products
such a graphical summary of the sensory properties can be a useful aid in his work.
For an interpretation of the principal axes one may draw a correlation plot. This
is a plot of the loadings (correlations) of the individual attributes with the principal
axis scores. Figure 38.10 shows an example of an international collaborative study
involving panels from five different institutes [11]. The aim was to assess the
degress of cross-cultural differences in the sensory perception of coffee. The
panels characterized eight brands of coffee, each with an independently developed
list of attributes. The correlation plot reveals that attributes with the same or a
similar name are in general positioned close together. This lends credit to the
'objectivity' of the QDA technique. Attributes which are close to the circle of
radius 1 are well represented by the 2-dimensional space of the first two principal
axes. Thus, the correlation plot and the reference to the common configuration is
helpful in judging the relations between the various attributes within and between
different panels. One may also try to label each principal axis with a name that is
436

i/BURNT
CVi BITTER
BURNT
I WOODY
BITTER

I
umt
bitter
bitter
^ bitter
-0.5
DENMARK
Poland
GERMANY
-1.0 France
-0.5 0.0 0.5 GB/ICO
principal axis 1

Fig. 38.10. Correlations of attributes with consensus principal axes.

suggested by attributes which are highly (positively or negatively) correlated with


that axis. This is not always an easy task. Sometimes it is easier to distinguish main
factors that are rotated with respect to the principal axes.
One may also analyze which of the individual sets (i.e. panellists) are close to
the mean and which are more deviant. For this analysis one determines the
residuals for all products and attributes between the mean configuration and the
individual data sets after the optimal Procrustes transformation. One then strings
out each individual residual data set in the form of a long row vector. These row
vectors are collected into a matrix where each row is now associated with a
panellist. Performing a PCA on this matrix shows in a score plot the relative
position of the individual panellist as a deviation from the mean. Figure 38.11
shows this plot for the 8 panellists of the cheese study. It reveals panellist 8 as
being furthest removed from the rest. This panellist perhaps needs additional
training.
It is not strictly required to use the same attributes in each data set. This allows
the comparison of independent QDA results obtained by different laboratories or
development departments in collaborative studies. Also within a single panel,
individual panellists may work with 'personal' lists of attributes. When the sensory
attributes are chosen freely by the individual panellist one speaks of Free Choice
Profiling. When each panellist uses such a personal list of attributes, it is likely that
437

Fig. 38.11. Deviations of the 8 panellists from the consensus, based on a PC A of the residuals x panel-
list table, where the residuals comprise all products, sessions and attributes.

the number of variables differs from panellist to panellist. In that case it is


convenient to add dummy columns filled with zeros so that all panellists have data
sets of the same, maximum, size. This so-called zero-padding does not affect the
analysis.
So far, the nature of the variables was the same for all data sets, viz. sensory
attributes. This is not strictly required. One may also analyze sets of data referring
to different types of data (processing conditions, composition, instrumental
measurements, sensory variables). However, regression-type methods are better
suited for linking such diverse data sets, as explained in the next section.

38.6 Linking sensory data to instrumental data

The objective of relating sensory measurements to instrumental measurements


is twofold. A first objective can be that it may help a better understanding of the
sensory attributes. One should realize that such a goal usually can only be met
partly since sensory perception is a highly complex process. The instrumental
measurements too may be the result of complex processes. For example, the force
recorded with an Instron instrument when compressing a food sample depends in
an intricate way on its flow behaviour and breaking properties, which themselves
are determined by the sample's internal structure. A second goal of relating the two
types of measurements is that instrumental measurements may eventually replace
the sensory panel. The driving force behind this second objective is that instru-
mental measurements are cheaper. Not much success has been scored in this area,
due to the complexity of human sensory perception.
438

When relating instrumental measurements to sensory data one should focus on


QDA-type data. Hedonic (or 'liking') scores and preference data are generally not
well suited for comparisons with instrumental measurements, since there usually
will not be a linear relationship. A simple example is saltiness, let us say, of a soup.
A QDA panel can be used to 'measure' saltiness as a function of salt concentration.
Over a small concentration range the response may be approximately linear. At
higher concentrations the response may flatten off and in analogy with an
analytical instrument one may consider that the panel is then performing outside its
linear range. With preference testing the nature of the non-linearity is quite
different. One does not measure the saltiness per se, but the condition that is best
liked. Liking scores will show an optimum at some intermediate level of saltiness,
so that the salty taste is neither too weak nor too strong.
A table of correlations between the variables from the instrumental set and
variables from the sensory set may reveal some strong one-to-one relations.
However, with a battery of sensory attributes on the one hand and a set of
instrumental variables on the other hand it is better to adopt a multivariate
approach, i.e. to look at many variables at the same time taking their inter-
correlations into account. An intermediate approach is to develop separate
multiple regression models for each sensory attribute as a linear function of the
physical/chemical predictor variables.

Example
Beilken et al. [12] have applied a number of instrumental measuring methods to
assess the mechanical strength of 12 different meat patties. In all, 20 different
physical/chemical properties were measured. The products were tasted twice by 12
panellists divided over 4 sessions in which 6 products were evaluated for 9 textural
attributes (rubberiness, chewiness, juiciness, etc.). Beilken et al. [12] subjected the
two sets of data, viz. the instrumental data and the sensory data, to separate
principal component analyses. The relation between the two data sets, mechanical
measurements versus sensory attributes, was studied by their intercorrelations.
Although useful information can be derived from such bivariate indicators, a truly
multivariate regression analysis may give a simpler overall picture of the relation.
In recent years the application of techniques such as PLS regression to link the
block of sensory variables to the block of predictor variables has become popular.
PLS regression is well suited to data sets with relatively few objects and many
highly correlated variables. It provides an analysis in terms of a few latent
variables that often allows a meaningful interpretation and an effective graphical
summary. When we analyze the data of Beilken et al. with PLS2 regression (see
Section 35.7) a two-dimensional model is found to account for 65-90% of the
variance of the sensory attributes, with the exception of the attributes juicy and
greasy which cannot be modelled well with this set of explanatory variables.
439

2
1 eg
a
0 f h

-2]

-4
i
-61.ll , , ^ 1 1 1 1 ' r*

-6 -4 -2 0
t1

Fig. 38.12. Scores of products (meat patties) on the first two PLS dimensions.

Figure 38.12 shows the position of the twelve meat patties in the space of the first
two PLS dimensions. Such plots reveal the similarity of certain products (e.g. C
and D, or E and G) or the extreme position of some products (e.g. A or I or L).
Figure 38.13 shows the loadings of the instrumental variables on these PLS factors
and Fig. 38.14 the loadings of the sensory attributes. The plot of the products in the

0.5

COOKLOSS
WB SLOPE I COHESIV

CM
0.0
w e PEAK
CPJYD
CPl jyp WB SHEAR

PH
TENSILE
-0.5
-0.5 0.0 0.5

p1

Fig. 38.13. Loadings of predictor (instrumental) variables on the first two PLS dimensions.
440

0.61
COARSE
CHEWY
0.4i
TEXaiRE
0.2
CM
O
GREAS'^dUICY
CRUMBLY
0.0

T_^RUBBER
0.2

ADHESION
0.4-L , 1 , 1 n 1 1 . r'

-0.4 -0.2 0.0 0.2 0.4 0.6


c1

Fig. 38.14. Loadings of dependent (sensory) variables on the first two PLS dimensions.

space of the PLS dimensions has a fair resemblance to the PC A scores plot only for
the sensory variables. As a consequence, the PLS loading plot of the sensory
variables (Fig. 38.14) gives a similar picture as a PCA loading plot would give.
The associations between the variables within each set are immediately apparent
from Fig. 38.13 or Fig. 38.14. For example, "hardness", "hrdxchv", "R-punch"
and "CP-peak" are all highly correlated and indicate the firmness of the product.
As another example, "tensile" strength is positively correlated with the amount of
protein and negatively correlated with "moisture" and "fat". It would require an
exhaustive inspection of the 20x9 correlation table to obtain similar conclusions
about the relationship between variables of the two sets as the conclusions derived
from simply comparing or overlaying Figs. 38.13 and 38.14.

38.7 Temporal aspects of perception

In the foregoing we loosely talked about the intensity of a sensory attribute for a
given sample, as if the assessors perceive a single (scalar) response. In reality,
perception is a dynamic process, and a very complex one. For example, when a
food product is taken in the mouth, the product disintegrates, emulsions are
broken, flavours are released and transported from the mouth to the olfactory
(smell) receptors in the nose. The measurement of these processes, analyzing and
interpreting the results and, eventually, their control is of importance to the food
441

CO

CD

time I s

Fig. 38.15. Example of time-intensity (TI) curves.

manufacturer. There are many ways to study the temporal aspects of sensory
perception. Experimentally many methods have been developed to measure so-
called time-intensity curves or Tl-curves. Currently popular methodology is to use
a slide-wire potentiometer or a computer mouse and to feed the data directly into
the computer. The sort of curves that is obtained is shown in Fig. 38.15.
Typically, one may characterize such a curve by a number of parameters, such
as time-to-maximum-intensity, maximum intensity, time of decay, total area. The
way to average such curves over panellists in order to derive a panel-average
Tl-curve is not trivial. Geometrical averaging in both the intensity and time
direction may help to best preserve shape. Separate analysis of variance of the
characteristic parameters of the average curves can be used to assess the
differences between the products. One may also try to fit a parametric function to
each individual curve, for example, a combination of two exponential functions
(see Chapters 11 and 39). The curves are then characterized by their best-fitting
parameters, and these are compared in a subsequent analysis. Another method is to
leave the curves as they are and to analyze the whole set of curves by PCA. It
would seem natural to consider each curve as a multivariate observation and the
intensities at equidistant points in time as the variables. Since there is a natural zero
point (^ = 0, / = 0) in these measurements it makes sense in this case not to center
each curve around its mean intensity, but to analyze the raw intensity data. Also,
since the different maximum intensities may be related to concentration of the
bitter component it would be imprudent to scale the curves to a common standard
(e.g. maximum, mean intensity, area). However, one might consider a log trans-
formation allowing for the nature (ratio scale) of the intensity scale.
Figures 38.16 to 38.18 show the result of such an uncentered PCA applied to a
set of Tl-curves obtained by 9 panellists [13]. The perceived intensity of bitterness
442

25Ch

Caffeine 1

Fig. 38.16. Loading curves (PCI, non-centered PCA) for 4 bitter solutions.

of four solutions, caffeine and tetrahop at two concentration levels, was recorded
as a function of time. The analysis is applied separately to each of four bitter
solutions. The loading plots for PCI and PC2 are shown in Figs. 38.16 and 38.17.
Notice that PCI (Fig. 38.16) has little structure: it represents an equal weighting
over most of the time axis. In fact the PCI loading plot very much resembles the
average curve for each product. This is a common outcome with noncentered PCA.
The loading plot for PC2 (Fig. 38.17) has a more distinct structure. Since it has a
negative part it does not represent a particular type of intensity curve. PC2 affects
the shape of the curve. One notices in Fig. 38.17 a distinct shape between the two
tetrahop solutions and the two caffeine solutions. This interpretation of PCI as a
size component and PC2 as a contrast component is a familiar phenomenon in
principal component analysis of data (see Chapter 31). A different interpretation is
obtained if one considers a rotation of the PCs. Rotation of PCI and PC2 does give
more interpretable curves: PCI + PC2 gives a curve that rises steeply and decays
deeply, representing a fast perception, whereas PCI - PC2 gives a curve that starts
to rise much more slowly, reaching its maximum much later with a longer lasting
perception. The score plot of the panellists in the space of PCI and PC2 is shown in
Fig. 38.18, for one of the four product. A high score along PCI, e.g. panellist 6,
implies that the panellist gave overall high intensity scores. A high score on PC2
443

_ 60-1
<
O - Tetra 2
Q.

c
Q>
O
c 2(H
o
c
0)

I <H
O
CO
>
"55 -2(H
c
C
C -4(H
o
o
0) -SO n— "T r- "1 "I
10 20 30 40 50 60 70 80 90
Time

Fig. 38.17. Loading curves (PC2, non-centered PCA) for 4 bitter solutions.

Caffeine 1

4.6

4
73
5 8
-1
-4-
dimension 1

4-6
c
o

Fig. 38.18. Score plot (PC2 v. PCI) based on non-centered PCA of Tl-curves from 9 panellists for a
bitter (caffeine) solution.
444

(e.g. panellists 9), implies a Tl-curve with a relatively fast rise and early peak, in
contrast to a low (negative) score on PC2 (e.g. panellist 1), implying a TI curve
with a relatively slow rise and late peak.
There have also been attempts to describe the temporal aspects of perception
from first principles, the model including the effects of adaptation and integration
of perceived stimuli. The parameters in the specific analytical model derived were
estimated using non-linear regression [14]. Another recent development is to
describe each individual TI-curve,y;(r), / = 1, 2,..., n, as derived from a prototype
curve, S(t). Each individual Tl-curve can be obtained from the prototype curve by
shrinking or stretching the (horizontal) time axis and the (vertical) intensity axis,
i.e. f.(t) = a^ S(bi i). The least squares fit is found in an iterative procedure,
alternately adapting the parameter sets {^,, Z?.} for / = 1,2,..., n and the shape of the
prototype curve [15].

38.7 Product formulation

An important task of the food technologist is to optimize the ingredients


(composition) or the processing conditions of a food product in order to achieve
maximum acceptability. In practice this has often to be done under constraints of
cost restriction and limited ranges of composition or processing conditions. Tech-
niques such as response surface methodology (Chapter 24) and mixture designs
(Chapter 25) are effective in formulation optimization. It is very often the case that
the sensory perception of a product is not a simple linear function of the ingredients
involved. A logarithmic function (Weber-Fechner law) or a power-law (Stevens
function) often describe the relation between perceived intensity (/) and
concentration (c):
/ = a log(c) Weber-Fechner law (38.4)
I^ac^ Steven's law (38.5)
The acceptance is generally a non-linear function of perceived intensity. A simple
example is the salt level in a soup which clearly has a level of maximum accept-
ability between too weak and too salty a taste.
The experimental designs discussed in Chapters 24-26 for optimization can be
used also for finding the product composition or processing condition that is
optimal in terms of sensory properties. In particular, central composite designs and
mixture designs are much used. The analysis of the sensory response is usually in
the form of a fully quadratic function of the experimental factors. The sensory
response itself may be the mean score of a panel of trained panellists. One may
consider such a trained panel as a sensitive instrument to measure the perceived
intensity useful in describing the sensory characteristics of a food product.
445

Example
Figure 38.19 shows the contour plots of the foaming behaviour, uniformity of
air cells and the sweetness of a whipped topping based on peanut milk with varying
com syrup and fat concentrations [16]. Clearly, fat is the most important variable
determining foam (Fig. 38.19A), whereas com symp concentration determines
sweetness (Fig. 38.19C). It is rather the mle than the exception that more than one
sensory attribute are needed to describe the sensory characteristics of a product. An
effective way to make a final choice is to overlay the contour plots associated with
the response surfaces for the various plots. If one indicates in each contour plot
which regions are preferred, then in the overlay a window region of products with
acceptable properties is left (see Fig. 38.19D and Sections 24.5 and 26.4). In the

20 / 100
110'
1 y B

,^ -"'—--.
W""
1 9D_ — " ^ • - , ^

L 8Q "^^^^
[ 70 ~"~~-- ^ ^-^
10
15 20 15 20 25
% Corn Syrup % Corn Syrup

^u 20 ^ ^ ^ C ^ _ _ _ ^ _ J : , , - ^ - — ^
\

m-^-^:
\ C

CO CO
15- 15

J40 /45 /50


U''': ••' , - - ^ - ^ 7 - ^ ^ ^ ^ ^ " " " ^
55,-'-
in 10 ' • I «"• 1 I -I 1 --• ~T . = 1

15 20 25 15 20 25
% Corn Syrup % Corn Syrup

Fig. 38.19. Contour plots of foam (A), uniformity of air cells (B) and sweetness (C) as a (fully-
quadratic) function of the levels of fat and corn syrup. An overlay plot (D) shows the region of overall
acceptability.
446

case of this example products with >135 g fat and <175 g com syrup per kg would
give acceptable foaming behaviour, uniformity of air cells and sweetness. One
might also construct a composite desirability function and optimize this composite
response (see Section 26.4 on multi-criteria decision making). Sometimes, the best
settings for the various responses are conflicting and there is no region where all
properties are acceptable. In that case one has either to relax the acceptability limits
for the various responses or to change the product concept.

References

1. M. Meilgaard, G. V. Civille and B.T. Carr, Sensory Evaluation Techniques, 2nd Edition. CRC
Press, Boca Raton, PL, 1987.
2. H.J.H. MacFie, Data Analysis in Flavour Research: Achievements, Needs and Perspectives,
in: Flavour Science and Technology, M. Martens, G.A. Dalen and H. Russwurm Jr (Editors).
Wiley, 1987.
3. H. Stone and J.L. Sidel, Sensory Evaluation Practices, 2nd ed. Academic Press, San Diego,
1993.
4. S. Siegel. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, Tokyo, 1956.
5. R.A. Bradley, Science, statistics, and paired comparisons. Biometrics, 32 (1976) 213-232.
6. H. A. David, The Method of Paired Comparisons. Charles Griffin, London, 1963.
7. J.B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothe-
sis. Psychometrika, 29 (1964) 1-27, 115-129.
8. S.S. Schiffman, M.L. Reynolds and F.W. Young, Introduction to Multi-dimensional Scaling:
Theory, Methods and Applications. Academic Press, New York, 1981.
9. J.W. Sammon, A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18
(1969)401-409.
10. ABC Executive Flight Planner, June 1994. Reed Telepublishing, Dunstable, 1994.
11. S. de Jong, J. Heidema and H.C.M. van der Knaap, Generalized Procrustes analysis of coffee
brands tested by five European sensory panels. Food Qual. Pref, 9 (1998) 111-114.
12. S. Beilken, L.M. Eadie, I. Griffiths, P.N. Jones and P.V. Harris, Assessment of the textural
quality of meat patties. J. Food Sci., 56 (1991) 1465-1470.
13. G. Dijksterhuis, Principal component analysis of Tl-curves: three methods compared. Food
Qual. Pref., 5 (1994) 121-127.
14. P. Overbosch and S. de Jong, A theoretical model for perceived intensity in human taste and
smell. Physiol. Behav., 45 (1989) 607-613.
15 G. Dijksterhuis and P. Filers, Modelling time-intensity curves using prototype curves. Food
Qual. Pref., 8 (1997) 131-140.
16. A. Abdullah, A.V.A Resurreccion and L.R. Beuchat, Formulation and evaluation of a peanut
milk based whipped topping using response surface methodology. Lebensm. Wiss. Technol.,
26(1993)162-166.

Additional recommended reading

G.E. Arteaga, E. Li-Chan, M.C. Vazquez-Aretaga and S. Nakai, Systematic experimental designs for
product formula optimization. Trends in Food Science Technol., 5 (1994) 243-253.
J.B. Kruskal and M. Wish, Multidimensional Scaling. SAGE, Newberry Park, CA, 1991.
447

P. Lea, T. Naes and M. R0dbotton, Analysis of Variance for Sensory Data. Wiley, London, 1997
D. H. Lyon, M.A. Francombe, T.A. Hasdell and K. Lawson, Guidelines for Sensory Analysis in Prod-
uct Development and Quality Control. Chapman and Hall, London, 1990.
H. Martens and H. Russwurm, eds., Food Research and Data Analysis. Applied Science, London,
1983.
M. O'Mahony, Sensory Evaluation of Foods. Statistical Methods and Procedures. Marcel Dekker,
New York, 1986
T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science. Data Handling in
Science and Technology Series, Elsevier, Amsterdam, 1996
J.R. Piggott (Editor), Sensory Analysis of Foods. Elsevier, London, 1984.
J.R. Piggott, Statistical Procedures in Food Research. Elsevier, London, 1987.
449

Chapter 39

Pharmacokinetic Models

Introduction

Rescigno and Segre [1], in one of the first significant text books on the subject,
defined pharmacokinetics as "the study of drug concentrations in different tissues
and the construction of models suited to interpret such phenomena". We can
distinguish four aspects of pharmacokinetics, namely absorption, distribution,
biotransformation and excretion of a drug. The term drug includes here any
naturally occurring or synthetic chemical compound.
Although it is highly desirable to deliver a drug to the site where it must exert its
effect, this is not generally possible except in those cases where the site of
treatment is readily accessible (such as the skin). When topical (local) treatment is
precluded, the drug must be carried to the target site via the circulatory or
lymphatic system. To this effect it must first be absorbed by the blood of the central
circulatory system. Absorption is extremely rapid and complete when a drug is
delivered directly to the circulation by means of an intravenous injection. Slower
and occasionally less efficient ways of absorption are via the mucous membranes,
by oral intake, by injection into muscle or fatty tissues, or by means of patches
which are applied to the skin. Pharmaceutical technology continuously tries to
explore novel ways of administration in order to deliver a drug under the most
optimal conditions. Important considerations in this respect are the solubility of the
compound, the speed of release, and the susceptibility to biological degradation.
Once the substance has been absorbed, it is distributed by the blood towards the
various tissues of the body. If the drug is highly lipophilic, it may be rapidly
deposited into fatty tissues and slowly released thereafter. Some drugs are readily
adsorbed to membranes of cells, especially those of the red blood cells, others may
adsorb onto circulating proteins. The plasma of blood, however, must carry the
drug to the target tissue. If the treatment of a disease calls for a rapid and burstlike
delivery to the target, then it is undesirable to have a large fraction of the drug
stored away in various buffers. On the other hand, when a slow and progressive
release is required, temporal storage into fatty tissues may be an advantage. Some
tissues possess membranes which are highly selective for the passage of specific
chemical substances. The blood-brain barrier and the placenta are among these.
450

Excretion is the process by which a substance leaves the body. The most
common ways are via the kidneys and via the gut. Renal excretion is favored by
water-soluble compounds that can be filtered (passively by the glomeruli) or
secreted (actively by the tubuli) and that are collected into urine. Fecal excretion is
followed by more lipid substances that are excreted from the liver into the bile,
which is collected in the gut and passed out by the feces. Other routes of excretion
are available through the skin and the lungs.
As long as a compound circulates in the body it is at risk of being degraded by
enzymes. This process is called biotransformation or metabolism. Many of the
biotransformation processes occur in the liver which contains a wide variety of
enzymes of the cytochrome P-450 family, all geared towards the breakdown of
compounds that are foreign to the body (so-called xenobiotics). Metabolic break-
down may also occur in plasma by circulating enzymes or even in the vicinity of
the target site. The pathway of biodegradation can be very complex and often
results in several metabolites, each of which is distributed throughout the body and
excreted in its own particular way. Some metabolites are neutral, others may be
harmful, still others may possess a therapeutic effect. In some cases, a prodrug is
delivered into the system which is inert by itself but which is metabolized near the
target site into the active compound. Such a strategy is highly effective and least
harmful to the system. In all these instances where biotransformation plays a
relevant role, it will be necessary to study the kinetics of the major and most
reactive metabolites in addition to those of the original compound.
Pharmacokinetics is closely related to pharmacodynamics, which is a recent
development of great importance to the design of medicines. The former attempts
to model and predict the amount of substance that can be expected at the target site
at a certain time after administration. The latter studies the relationship between
the amount delivered and the observable effect that follows. In some cases the
observable effect can be related directly to the amount of drug delivered at the
target site [2]. In many cases, however, this relationship is highly complex and
requires extensive modeling and calculation. In this text we will mainly focus on
the subject of pharmacokinetics which can be approached from two sides. The first
approach is the classical one and is based on so-called compartmental models. It
requires certain assumptions which will be explained later on. The second one is
non-compartmental and avoids the assumptions of compartmental analysis.
Although the methods that are discussed in this chapter deal explicitly with the
disposition of drugs in animals and humans, their scope is much wider. In general,
these methods can be applied to study the transport of substances within parts of a
system provided that these transports can be described by zero or first order
kinetics. This applies, for example, when the rate of change of the amount in one
part of the system depends linearly on the amounts present in all the various parts
of the system. Applications are found commonly in first order chemical reactions,
451

Uptake and washout of tracers, radioactive and fluorescent decay, etc. In a broader
context, kinetic models find application in population dynamics, ecology, meteor-
ology, epidemiology, etc. Non-linear models are of special importance in these
fields. Although their formulation is often only slightly more involved than their
linear counterparts, the complexity of their solutions has only recently been
explored. They lie at the origin of fractal geometry and deterministic chaos. Fractal
geometry arises when a system does not evolve along smooth paths, such as the
stream lines in laminar flow, but follows highly complex patterns, such as the
eddies in turbulent flow. The latter system appears to behave chaotically, yet obeys
constraints which force it to remain under the influence of so-called attractors. The
dimensionality of an attractor in a seemingly chaotic system is characterized by a
fractional number rather than by an integer one, hence the name fractal [3].

39.1 Compartmental analysis

The origin of compartmental analysis is usually credited to Teorell [4], al-


though some of the concepts had already been developed by physicists and
physiologists in the eighteenth century in relation to the transfer of heat and the
washout of tracer substances. Compartmental analysis can be applied when two
major assumptions are satisfied. First, one must be able to decompose the system
or organism into identifiable compartments which must be sufficiently homo-
geneous. The latter means that a substance is rapidly and uniformly mixed within
each compartment. Very often, however, a compartment is to be considered as a
lumped system, which groups apparently heterogeneous parts, for example all of
the body except the blood compartment. The degree of lumping depends on the
properties of the compounds and on the level of detail of the study. The second
assumption requires that transport between compartments is governed by first
order differential equations with constant coefficients of the general type:

dX "" ( "" ^
X, (39.1)

where X^ is the amount of drug within compartment / at time t, k^j is the transfer
constant for transport from compartment / to compartment; (in reciprocal time
units) and where n represents the number of compartments. An illustration of
transfer constants in the case of a two-compartment system is given in Fig. 39.1.
The equation above states that the net content in any compartment / equals the
sum of all inflows from the other compartments minus the sum of all outflows
towards the other compartments. Such a relation is also referred to as a mass
balance differential equation, which is well-known in chemical engineering. In the
452

Skin
2

A
kj2 ^21

itravenous T
injection Blood Plasma
^ ' W^ Elimination
^ ^
1

Fig. 39.1. Two-compartment open model composed of a central (plasma) compartment and a target
(skin) compartment. This model assumes that a dnig is delivered rapidly into plasma from which it is
either exchanging with the target organ or eliminated by excretion or metabolism.

case of n compartments, the model is defined by n linear differential equations in n


variables (X^) with constant coefficients k^y The latter are the model parameters
which usually have to be estimated from observations of the concentrations of the
substance in one or more compartments of the system at various times after the
start of the experiment. The way by which these parameters are computed is
explained below. These computed coefficients can be used in model simulations.
They are most meaningful, however, when they can be used for the quantification
of biological and physiological phenomena, such as the rate of clearance of the
substance from the body via various routes of excretion and metabolism.
In Fig. 39.1 we have represented a simple two-compartment open model for a
substance which is injected intravenously. The model applies when the drug is
mainly confined to the central compartment from which it is also directly excreted.
Such a model is called 'open' as it allows for elimination of a substance from one
or more of the compartments. It is also assumed here that a fraction of the drug
penetrates the layers of the skin and that its biological or therapeutic effect is
related to the concentration in this compartment. Such a crude model may be
applicable, for example, to water-soluble antibiotics. Figure 39.2 shows a com-
partmental model for the same drug but on a more detailed level. Here we take into
account that the drug is specifically excreted in the kidneys, that it is metabolized
in the liver and accumulates in adipose (fatty) tissues. This type of compartmental
model is called mammillary, which is characteristic for a central compartment
which exchanges with all peripheral ones. In a catenary model, compartments are
coupled one after the other in a chain-like fashion, as in a cascade of radioactive
decay of isotopes. An even more detailed compartmental model is given by Fig.
39.3. In this case, the drug is administered orally. Therefore, it must be absorbed
from the gut lumen into the gut tissue from where it is carried by the hepatic
453

IntraveilOUS
injecti on Skin

1 t
w
'^-
Liver ^
c Blood Plasma Kidney
^ ^
>

>i >r T
Vletabolism Adipose Excretion
Tissues

Fig. 39.2. Multicompartment model which, in addition to Fig. 39.1, takes into account that the drug is
buffered in adipose (fatty) tissue, excreted by the kidneys and metabolized in the liver.

circulation to the liver. In this case, a large fraction of the drug may be metabolized
during its first pass through the liver. Some of the drug (and its metabolites) is
returned to the gut lumen via the excretion in the bile. The substances are excreted
by the gut into the feces or can be re-entered into the system through the gut tissue.
The remaining part is exchanged between the liver and the plasma fluid. From the
plasma, some part of the drug can be excreted through the kidneys, it can be stored
in adipose tissues or it can be adsorbed on the membranes of red cells. Eventually,
the drug penetrates the skin w^here it must produce its therapeutic benefits. This is
an example of a mixed mammillary-catenary model [5].
An additional problem arises when the exchange processes are rate-limited.
This may be caused by enzymes that become saturated when all their active sites
are occupied by the drug, or it may be due to adsorbing proteins that have a limited
binding capacity. In such cases, one obtains a type of Michaelis-Menten kinetics of
the form:
^max^
(39.2)
dt K +x
where V is the rate of the process, V^^^ represents the maximal rate when the
process is completely saturated, and K^ denotes the Michaelis constant. A rate-
limited process introduces a non-linearity which is incompatible with the usual
454

Oral
administration

Gut Fecal
Lumen excretion

Bile
Gut Skin
Tissue

Liver Blood Plasma Kidney

Metabolism Adipose Renal


Red Cells
Tissues excretion

Fig. 39.3. Multicompartment model which, in addition to Fig. 39.2, models absorption of a drug via
the gut, excretion via the bile and adsorption to the membranes of red cells.

methods of compartmental analysis. In that case, one cannot produce analytical


solutions for the time course of the concentrations in the various compartments. In
some cases one may assume that the amounts X are much smaller than K^ in which
case the relation reduces to a first order linear term. Alternatively, if one can
assume that X is much larger than K^, then this leads to a zero order term, which
means that the rate is constant throughout. In some elementary cases, one can
linearize the model by means of an appropriate transformation, as will be explain-
ed in the last section. If none of these applies, one must resort to non-linear
methods of analysis, which are outside the scope of this chapter (see Chapter 11).
In the following subsections we will deal systematically with the basic aspects of
linear compartmental systems.
455

39.1.1 One-compartment open model for intravenous administration

Figure 39.4a represents schematically the intravenous administration of a dose


D into a central compartment from which the amount of drug X^ is eliminated with
a transfer constant k^^. (The subscript p refers to plasma, which is most often used
as the central compartment and which exchanges a substance with all other
compartments.) We assume that mixing with blood of the dose D, which is rapidly
injected into a vein, is almost instantaneous. By taking blood samples at regular
time intervals one can determine the time course of the plasma concentration C^ in
the central compartment. This is also illustrated in Fig. 39.4b. The initial con-
centration Cp(0) at the time of injection can be determined by extrapolation (as will
be indicated below). The elimination pool is a hypothetical compartment in which
the excreted drug is collected. At any time the amount excreted X^ must be equal to
the initial dose D minus the content of the plasma compartment Xp, hence:
X^=D-X^ (39.3)
After a sufficiently long time, most or all of the dose D will have been excreted.
The linear model for this system is described by the mass balance equation:

© ®
DoseD

Cp<0)

x„.Cp
Plasma
Compartment

1^
Xe
>r D-

Elimination
Pool

Fig. 39.4. (a) One-compartment open model with single-dose intravenous injection of a dose D. The
transfer constant of elimination (excretion and metabolism) is kp^. (b) Time course of the plasma con-
centration Cp and of the contents in the elimination pool Xe.
456

- ^ = -^pe^P (39.4)

with the initial condition that Xp(0) = D.


The differential equation can be solved straightforwardly, yielding an expo-
nential function with respect to time:
Xp(0 = /)e"^^ (39.5)

and also
X,(r) = D ( l - e " ^ ' )

Note that the solution is in terms of amounts X^ rather than observed plasma
concentrations Cp. The conversion from amounts to concentrations is defined by
means of:

Cp(0 = - ^ = ^ e - V ^ =Cp(0)e-V (39.6)


^p ^p

where V^ represents the volume of distribution of the drug in plasma. The volume
of distribution should not be confounded with the physical plasma volume (which
is about four liters in adult humans). The former may exceed the latter by several
orders of magnitude. This occurs when there is a large binding affinity of the drug
molecules with membranes of red blood cells or with circulating proteins. In that
case, only a minute fraction of the drug is recovered in the plasma, the larger part
being buffered reversibly on membranes and proteins. We can derive the volume
of distribution Vp from the dose D and from the observed plasma concentration at
time zero Cp(0):

V = - ^ (39.7)
P Cp(0)
The half-life time ty2 of a drug is an important pharmacokinetic parameter. In this
simple model it can be obtained immediately from the solution in eq. (39.6):

- ^ — = Cp(0)e"'-^- (39.8)

from which we derive that:

,,.-f^ (39.9)
457

In a one-compartment model, the half-life time ty2 is inversely proportional to the


transfer constant of elimination k^^. The longer it takes for a drug to be removed
from the system, the longer is the period during which it can exert its therapeutic
activity. A convenient way to estimate the half-life of a substance is to replot the
observed plasma concentrations (Fig. 39.5a) on a decimal logarithmic scale versus
time (Fig. 39.5b). Fitting a straight line on this semilogarithmic plot produces the
extrapolated plasma concentration Cp(0) from the vertical intercept of the line. The
transfer constant of elimination is determined from the slope of the line:

log Cp(0 = log Cp(0) - 0.4343 k^,t = log B - s^t (39.10)


with intercept log B = log Cp(0) and slope s^ = 0.4343 k^, which follows directly
from the solution of the model defined by eq. (39.6). (Decimal logs are assumed
throughout this chapter, unless specified otherwise.) The half-life time ty2 of the
drug can also be estimated graphically from the semilogarithmic plot of Fig. 39.5b
by determining the abscissa of the point on the fitted line which corresponds with
an ordinate of Cp(0)/2.
The concentration-time curve can be integrated numerically and yields the
so-called area under the curve (AUC):

AUC= j C^{t)At (39.11)

which, in the case of a one-compartment model, can be worked out analytically:

Cp(0) D
AUC = - ^ = (39.12)
^pe *^^p ^pe

The AUC is a measure of bioavailability, i.e. the amount of substance in the central
compartment that is available to the organism. It takes a maximal value under
intravenous administration, and is usually less after oral administration or par-
enteral injection (such as under the skin or in muscle). In the latter cases, losses
occur in the gut and at the injection sites. The definition also shows that for a
constant dose Z), the area under the curve varies inversely with the rate of elimina-
tion /Cpg and with the volume of distribution Vp. Figure 39.6 illustrates schemat-
ically the different cases that can be obtained by varying the volume of distribution
Vp and the rate of elimination k^^ both on linear and semilogarithmic diagrams.
These diagrams show that the slope (time course) of the curves are governed by the
rate of elimination and that elevation (amplitude) of the curve is determined by the
volume of distribution.
458

(a) Cp ^ /.\ cA 1 -0.0051171


Cp(t) = 50.3e

Cp(0) —50-1

40 -^ \

30 H \
Cp(0) \ Cp(0)
2 1
20 H

10 -\

0 -1 '^__^ , ,— 1 ' 1 1 n* 1
0 100 200 300 400 500 600 700 800
t (min)

log Cp(t)= 1.701-0.005117 t

Arctan SQ

—I— —1— —T
200 300 500 600 700 800
t (min)

Fig. 39.5. (a) Plot of plasma concentration Cp (ng T^) versus time t. Cp(0) is the extrapolated initial
concentration at time zero. At the half-life time tm the plasma concentration is half that of Cp(0). (b)
Semilogarithmic plot of plasma concentration Cp (|ig 1"^) versus time t. The intercept B of the fitted
line is the plasma concentration Cp at time 0. The slope s^ is proportional to the transfer constant of
elimination kr^e.
459

® ®

Fig. 39.6. (a) Time courses of plasma concentration Cp in the one-compartment open model for intra-
venous injection, with different contingencies for the transfer constant of elimination kp^ and the vol-
ume of distribution Vp. (b) Time courses of plasma concentration Cp as in panel (a) on semilogarithmic
plots.

From the AUC and knowing the dose, one can immediately derive from eq.
(39.12) an important pharmacokinetic parameter which is the clearance Cl^ of the
drug from the plasma:

- ^ p ^pe (39.13)
P AUC

Clearance is defined as the fraction of the volume of distribution Vp that is cleared


of the drug per unit of time. In the case of elimination from the kidneys, the
clearance provides a measure for the effectiveness of renal elimination with
respect to the drug under study.
460

Example
We consider the following synthetic plasma concentrations which are sup-
posedly obtained after intravenous injection of 10 mg of a drug:

r(min) 2 5 10 20 30 60 90 120 240 360 480 720


Cp(|LigH) 48.1 47.5 43.9 40.4 36.2 23.8 17.3 13.1 2.9 1.0 0.3 0.1

The data are also represented in Fig. 39.5a and have been replotted semi-
logarithmically in Fig. 39.5b. Least squares linear regression of log C^ with respect
to time t has been performed on the first nine data points. The last three points have
been discarded as the corresponding concentration values are assumed to be close
to the quantitation limit of the detection system and, hence, are endowed with a
large relative error. We obtained the values of 1.701 and 0.005117 for the intercept
log B and slope s^, respectively. From these we derive the following pharmaco-
kinetic quantities:

Initial plasma concentration: Cp(0) = 5 = 50.3 L | Lg V^


Rate constant of elimination: k^ = 5p/0.4343 = 0.0118 min"'
Half-life time: r,^ = 0.693/A:p^ = 58.8 min
Plasma volume of distribution: V^ = DIB = 199 1
Area under the curve: AUC = DIVJc^ - 4.268 mg min 1"'
Plasma clearance: Cl^ = D/AVC = 2.34 1 min'

The synthetic data have been derived from a theoretical one-compartment model
with the following settings of the parameters:

Dose: /)=10mg
Plasma volume of distribution: V^ = 200 1
Initial plasma concentration: C (0) = D/V = 50 L
| Lg l'
Half-life time: t^,^ = 60 min
Rate constant of elimination: k^ = 0.693/ty^ = 0.01155 min'
Area under the curve: AUC = 4.329 mg min 1"'
Plasma clearance: CI =2.31 1 min"'
p

The synthetic data have been obtained by adding random noise with standard
deviation of about 0.4 \ig 1"^ to the theoretical plasma concentrations. As can be
seen, the agreement between the estimated and the computed values is fair.
Estimates tend to deteriorate rapidly, however, with increasing experimental error.
This phenomenon is intrinsic to compartmental models, the solution of which
always involves exponential functions.
461

39.1.2 Two-compartment catenary model for extravascular administration

This model is representative for the conditions described in the previous


section, except for the mode of administration which can be oral, rectal or
parenteral by means of injection into muscle, fat, under the skin, etc. (Fig. 39.7). In
addition to the central plasma compartment, the model involves an absorption
compartment to which the drug is rapidly delivered. This may be to the gut in the
case of tablets, syrups and suppositories or into adipose, muscle or skin tissues in
the case of injections. The transport from the absorption site to the central
compartment is assumed to be one-way and governed by the transfer constant k^^
(Fig. 39.7a). The linear differential model for this problem can be defined in the
following way:

® ©
DoseD

Extravascular
Compartment

"ap

or"""
Plasma
Compartment

V
t
Elimination
Pool

Fig. 39.7. (a) Two-compartment catenary model for extravascular (oral or parenteral) administration
of a single dose D which is completely absorbed. The transfer constant of absorption isfcap-(b) Time
courses of the amount in the extravascular compartment Xa, the concentration in the plasma compart-
ment Cp and the content in the elimination pool X^.
462

"dT"" "p^"

j^ - ^ap ^ a ^pe ^ p (39.14)

X, = D - ( X , +Xp)
with the initial conditions:
X,(0) = D and Xp(0) == 0
where X^, X^ and Xj, represent the quantities in the absorption and central plasma
compartments, and in the elimination pool, respectively.
The analytic solution of the system of differential equations in eq. (39.14) can
be written as follows:

k (39.15)
Xp(0 = g ;" (e-V-e-V)

The plasma function can be rewritten in terms of concentrations C^ rather than


amounts of substance X^\
n k
C(t) = — "^ (e'"-' - e " ' - ^ ) (39.16)
^ V k -k
*^p '^ap '^pe

The content X^ in the absorption compartment varies with time according to the
solution of the one-compartment model which has been described in the previous
section. This time course is represented by a single negative exponential function.
Initially, the content is equal to the dose D and at a sufficiently long time after
administration, the content becomes zero. The concentration C^ in the central
compartment is initially zero, reaches a maximum and eventually becomes zero
again. The content X^ of the elimination pool starts from 0 and asymptotically
approaches the dose D at prolonged times (Fig. 39.7b). The solution for the central
plasma compartment gives rise to three distinct cases depending on the difference
between the transfer constants for absorption and elimination.

Case 1
When /:^p > k^^, the time course of the plasma concentration C^ is dominated by
the rate of elimination which is the slower of the two. This is desirable when a drug
must be delivered as rapidly as possible to the plasma compartment, for example in
the relief of acute pain. At sufficiently large times following administration, the
463

transient effect of absorption will have decayed such that one obtains approx-
imately from eq. (39.16):

r ( 0 - — ^"^ e-V (39.17)

Case 2
If A:^p < /^pe, the time course of C^ is dominated by the slower rate of absorption.
This is desirable when a drug is to be delivered over a prolonged period of time, for
example in the relief of chronic pain. When a sufficiently large period of time has
elapsed, the transient effect of elimination has decayed and the solution for the
plasma compartment in eq. (39.16) becomes approximately:

Co(0-^. \ e-V (39.18)


p ^pe ^ap

Case 3
In the theoretical situation when k^^ = k^^ the solution for C^ degenerates since
both the numerator and denominator vanish and the expression becomes indeter-
minate. The limiting solution must then be derived from a rearrangement of the
expression of C^ in eq. (39.16):

p V >' y^ ap lim (39.19)


P ^ap=V ^ap ~" ^pe

Differentiation of the numerator and denominator with respect to k - k , leads to:

CJt) = ^kte-'^' (39.20)


^p
For practical purposes we will assume here that k^^ > k^^. A solution of the
two-compartment model can be derived by means of non-linear regression
between experimental determinations of the plasma concentration C^ and the
corresponding sample times t, using the exponential relation which has been
derived above. This yields the transfer constants k^^, k^^ and the coefficient from
which, knowing £>, the volume of distribution Vp can be obtained.
An alternative graphical solution makes use of the biphasic exponential nature
of the plasma concentration function in eq. (39.16). At larger time values, when the
effect of absorption has decayed, the function behaves approximately as mono-
exponential. Under these conditions, and after replotting the concentration data on
a (decimal) logarithmic scale, one obtains a straight line for the later part of the
curve (Fig. 39.8a). This line represents the p-phase of the plasma concentration
and is denoted by C^ :
464

(a)
log Cp(t)= 1.745-0.0051661

Arctansp/ '\^

—T ^ — ^
200 I 400 500 600 700
t(min)

(b)
log C_(t)= 1.745-0.0621 t

Arctan s„ /

I :••;;....
— I —
20
t(min)

Fig. 39.8. (a) Semilogarithmic plot of plasma concentration Cp (|ig T^) versus time t. The straight line
is fitted to the later part of the curve (slow p-phase), with the exception of points that fall below the
quantitation limit. The intercept B of the fitted line is the extrapolated plasma concentration C^ that
would have been obtained at time 0 with an intravenous injection. The slope 5p is proportional to the
transfer constant of elimination kp^. (b) Semilogarithmic plot of the residual plasma concentration C "
(|ig r') versus time r, on an expanded time scale t. The straight line is fitted to the first part of the resid-
ual curve (fast a-phase), with the exception of points whose residuals fall below the quantitation limit.
The intercept B of the fitted line, is the same as that in panel a. The slope 5a is proportional to the trans-
fer constant of absorption from the extravascular compartment.
465

D k
logCP(0-log— '' -OA343k^J = logB-s^t (39.21)

with intercept log B and slope s^ = 0.4343 k^^. From the slope one obtains the
transfer constant of elimination k , The next step is to compute the residuals
between the observed plasma concentrations and the computed p-phase values,
which are again replotted on a logarithmic concentration scale (Fig. 39.8b). The
resulting line represents the absorption phase or a-phase of the plasma
concentration:

logC^(t) = logic (t)-C^(t)) = log -0A343k^J


V k -k 'P (39.22)
/p '^ap '^pey

= logB-s^t

with the same intercept log B as derived above and slope s^^ = 0.4343 k^^, which
yields the absorption transfer constant k^^. The procedure by which a sum of
exponential functions is resolved into its component terms is called curve peeling.
It allows to include empirical knowledge in the analysis about the number of
exponential functions and possible time lag. A comparison of curve peeling with
non-linear regression is presented in Section 39.1.6.
Knowing the transfer constants k^^, k^^, the dose D and the intercept 5, we can
compute the volume of distribution Vp from the expression (eq. (39.18):
D ^an
— ^^— = B (39.23)
V k -k
*^p '^ap '^pe
The various effects produced by varying k^^, k^^ and Vp are illustrated in Fig. 39.9.
In practice, the half-life time of the drug in the plasma compartment is derived
from the p-phase, and is therefore denoted as t\i2:
.r„=5^ (39.24)
^pe

Likewise the half-life time of the drug in the extravascular compartment r "2 '^
given by:

^«2 z . 2 ^ (39.25)
^ap

The area under the curve AUC is obtained by integrating the plasma con-
centration function between times 0 and infinity. This integral can be obtained
analytically from eq. (39.16):
466

|kap large V , Ljg constant

k-n small
ap

l^ap' '^e constant


^Vp small

Fig. 39.9. Time courses of plasma concentration Cp in a one-compartment model for extravascular ad-
ministration, with different contingencies of (a) the transfer constant of absorption k^p, (b) the transfer
constant of elimination kpc and (c) the volume of distribution Vp.

D
AUC= j C^(t)dt = (39.26)
^P^e

and has the same form as in the case of the intravenous one-compartment model
developed in the previous section. This is not surprising, however, since we know
that the AUC is a measure of the bio-availability of the substance to the plasma
compartment. In this model, all of the initially administered dose D must enter the
467

central compartment and, hence, the complete dose has an opportunity of


exercising its effect. The situation would be different if a fraction of the dose were
metabolised or excreted before entering the central plasma compartment. Such a
situation will be discussed later on. The AUC is an important pharmacokinetic
measure in the study of bio-equivalence for different modes of administration of a
drug. The modes can be considered to be biologically equivalent when their
AUC s are the same.
The clearance of the drug from the plasma is then defined from the dose D and
the AUC:

C l = ^ - = V^k^, (39.27)
^ AUC P P
An important pharmacokinetic parameter is the time of appearance of the
maximum t^ of the plasma concentration. This can be derived by setting the first
derivative of the plasma concentration function in eq. (39.16) equal to zero and
solving for t, which yields:

'- - k -k " k -k ^ ^ ^

where In means natural logarithm (base e). From this expression, one can see that
the time t^^ only depends on the transfer constants of absorption k^^ and elimination
k^^. Substitution of this expression in the plasma concentration function yields the
value of the peak concentration C^it^J. Knowledge of the peak concentration is
important since a drug usually has a minimum threshold for therapeutic effect and
a maximum level above which toxic effects become manifest.
Differentiation of the plasma concentration function in eq. (39.16) at time zero
also yields the initial rate of change of C^
fdC^ D
C'iO)
p
= y K, (39.29)
dt

This shows that the initial steepness of the plasma curve depends, for a given dose
£), on the volume of distribution Vp and on the transfer constant of absorption k^^.
The various characteristics of the plasma concentration curve are also schem-
atically displayed in Fig. 39.9 for different contingencies of/c^p, k^^ and V^,

Example
The following synthetic plasma concentrations are representative for the time
course of the plasma concentration C^ after a single oral administration of a dose of
10 mg:
468

r(min) 2 5 10 20 30 60 90 120 240 360 480 720


C_(|igr') 12.3 23.6 35.5 40.7 37. 1 26.9 19.8 14.0 3.1 1.1 0.1 0.1
<— - ->
a-phase p-phase noise

The semilogarithmic plot of the data is given in Fig. 39.8a. On this plot we can
readily identify the linear p-phase of the plasma concentrations between 30 and
240 minutes. The last three points (at 360, 480 and 720 minutes) have been
discarded because the corresponding plasma values are supposed to be close to the
quantitation limit of the detection system.
A least squares linear regression has been applied to the data pertaining to the
p-phase, yielding the values of 1.745 and 0.005166 for the intercept log B and
slope 5p, respectively. Using these results, we can compute the extrapolated plasma
concentrations C^ between 0 and 20 minutes. From the latter, we subtract the
observed concentrations C^ which yields the concentrations of the a-phase C " :

r(min): 0 2 5 10 20
C^Cngl-^): 55.5 54.2 52.3 49.3 43.8
Cp(|ngH): 0 12.3 23.6 35.5 40.7
Cp«(|jgH): 55.5 41.9 28.7 13.8 3.1
The semilogarithmic plot of the a-phase is shown in Fig. 39.8b. Through these
data we force a least squares linear regression through the vertical intercept, i.e. the
point with coordinates (0, 55.5). The latter is achieved by assigning a very large
weight to this single point, for example 100 times the weight of each of the other
points. This will guarantee that the regression line will pass through the vertical
intercept of the semilogarithmic plot, as required from the theoretical consid-
erations above. The resulting intercept log B and slope s^ are 1.745 and 0.0621,
respectively.
The pharmacokinetic parameters of the model are then readily derived from the
defining equations and the results of the regression:
Initial plasma cone, p-phase: C^ (0) = B = 55.5 \lg 1"'
Transfer constant of elimination: k = 5„/0.4343 = 0.0119 min"'
pe p

Half-life time, p-phase: t^^ = 0.693/k^ = 58.3 min


Transfer constant of absorption: k^^ = 5^/0.4343 = 0.143 min"'
Plasma volume of distribution: V =k D/B(k - itJ = 196 1
p ap ^ ap pe^

Half-life time, a-phase: r«2 = 0-693/A:^ = 4.85 min


Area under the curve: AUC = DIVJc^ = 4.281 mg mini"'
Plasma clearance: Cl^ = D/AVC = 2.34 1 min"'
469

Time of peak plasma cone: t^ = (In kjk^)/{k^ - k^) = 19.0 min


Initial rate of change of plasma cone: C p(0) = k^D/V^ = 6.37 |xg min"' V

The synthetic data have been computed from a theoretical two-compartment


catenary model with the following parameters:

Dose: D = 10 mg
Plasma volume of distribution: n= :200l
.a = 5 min
Half-life time of absorption: '-111
Transfer constant of absorption: K-~: 0.693/r «2 = 0.1386
Half-life time of elimination: '1/2 = 60 min
Transfer constant of elimination: =0.01155
K- = 0.693/rf/2
Area under the curve: AUC = 4.329 mg min \~'
Plasma clearance: CK-= 2.311min"^
The agreement with the theoretical and the 'experimental' results is still fair.
This is due, however, to the rather moderate noise which has been superimposed
on the theoretical plasma concentration values. The standard deviation of the
random noise amounted to about 0.4 |ig 1~^ The success of a curve peeling
operation also depends on the ratio of the transfer constants k^^ and k^^, which
should differ at least by a magnitude of about 10. Errors tend to accumulate in the
successive steps of a curve peeling process. As a consequence, we obtain that the
transfer constant of the faster a-phase has been determined less accurately than the
one of the slower p-phase. In general, the fitting of sums of exponential functions
is an ill-conditioned problem, the solution of which rapidly deteriorates with
increasing noise and with decreasing distinction between the transfer constants.

39,1.3 Two-compartment catenary model for extravascular administration


with incomplete absorption

The effect of incomplete absorption is that only a fraction F^ of a single-dose D


is made available to the central plasma compartment. The solution of the previous
model needs, therefore, to be modified by replacing the term D by F.^D. Conse-
quently the area under the curve AUC^ under incomplete extravascular absorption
will be smaller than the maximal AUC that results from complete absorption. The
latter, as we have seen is equal to the AUC obtained from a single intravenous
injection, which we denote by AUCj. These considerations can be summarized as
follows:
470

F.'J^ (39.30)

If an oral or parenteral plasma concentration curve is available together with an


intravenous one from the same subject(s), then one can derive the absorption
fraction F^ from these.

39.1.4 One-compartment open model for continuous intravenous infusion


In the previous discussion of the one- and two-compartment models we have
loaded the system with a single-dose D at time zero, and subsequently we observed
its transient response until a steady state was reached. It has been shown that an
analysis of the response in the central plasma compartment allows to estimate the
transfer constants of the system. Once the transfer constants have been established,
it is possible to study the behaviour of the model with different types of input
functions. The case when the input is delivered at a constant rate during a certain
time interval is of special importance. It applies when a drug is delivered by
continuous intravenous infusion. We assume that an amount D of a. drug is
delivered during the time of infusion x at a constant rate k^ (Fig. 39.10). The first
part of the mass balance differential equation for this one-compartment open
system, for times t between 0 and x, is given by:
dXp
dt "^ P' " (39.31)

with the initial condition that X^(0) = 0.


For times t greater than x, the equation reduces to:

= -k^X
dt P^ P (39.32)
X, = k^x-X^

with the conditions that the same result Xp(x) is obtained from both parts of the
equation.
The solution of the first part of the model is obtained by straightforward
integration:

X ro = — d - e - * - ' ) (39.33)

or equivalently:
471

® ®

Reservoir 1

;T t

>
Cp(t)
Plasma
Compartment

^ ?T t

f Xe
1 ^^^1

Elimination
Pool

Fig. 39.10. (a) One-compartment open model for continuous intravenous infusion of a dose D from a
reservoir. The rate of infusion is k^p. (b) Time courses of the content in the reservoir Xr, of the plasma
concentration Cp and of the content of the elimination pool X^.

CM) = "rp
(1- -V ) (39.34)
^P^pe

where Vp is the volume of distribution of plasma.


If the infusion is maintained for a sufficiently long time t, one obtains the
condition for a steady-state plasma concentration which no longer changes with
time. The condition follows immediately fromeq. (39.34), by setting tto infinity:

C, =limC„(T) = (39.35)
^ ^ p e
CL

Note that the steady-state plasma concentration Q^ varies proportionally with the
rate of infusion k^ and inversely with the plasma clearance C/p, the definition of
472

which has been given by eq. (39.13). This result allows us to rewrite the solution of
the model in eq. (39.34) in the form:
Cp(0 = C , 3 ( l - e - V ) (39.36)

or equivalently:

log 1 - = -0A343kt
pe
(39.37)
C.

Hence, the slope of the semilogarithmic plot of 1 - C^(t)/C^^ versus time t yields the
transfer constant of elimination k^. From the known rate constant of infusion ^^p*
the transfer constant of elimination k^ and a graphical estimate of C^^, one can then
derive the plasma volume of distribution V^, using the steady-state condition which
has been derived above.
The transfer constant of elimination k^^ has already been shown to be related to
the half-life time of the drug in the plasma compartment (eq. (39.9)):

ty2 = (39.38)

After substitution of this expression in the solution of the model in eq. (39.36) we
obtain:
Cp(r) = C,,(l-e-^-^^^'^''-) (39.39)

which shows that the steady-state plasma concentration Q^ is only obtained when
the time of infusion T is several times the half-life time ty2-
When infusion stops at time T, the plasma concentration Cp(T) decays exp-
onentially according to the solution of the one-compartment open model in eq.
(39.6):
Cp(0 = Cp(T)e-^^'-^^ (39.40)

From a semilogarithmic plot of C^(t) versus r - T, one can again estimate the
transfer constant of elimination k^. From the known values of Cp(T), ^^p and ^p^ one
also obtains a new estimate of V^.
In practice, one will seek to obtain an estimate of the elimination constant k^^
and the plasma volume of distribution V^ by means of a single intravenous
injection. These pharmacokinetic parameters are then used in the determination of
the required dose D in the reservoir and the input rate constant k^ (i.e. the drip rate
or the pump flow) in order to obtain an optimal steady state plasma concentration
473

Example
We make use of the previously defined parameters of the one-compartment
open system (eq. (39.6)).

Transfer constant of elimination: k = 0.01155 min~^


pe

Plasma volume of distribution: V^ = 200 1


In addition we assume that a dose Z) of 10 mg is delivered at a constant rate over a
time span of 60 minutes.

Duration of infusion: T = 60 min


Rate of infusion: k^ = 10,000/60 = 166.7 |Lig min"'

From the solution of the continuous administration model we can now derive
the plasma concentration at the end of infusion T:

1 ftf\ 1
C(x) = • (1 - exp(-0.1155 X 60)) = 72.16 x (1 - 0.5) = 36.1 |ig l'^
P 200x0.01155 ^

This should be compared with the initial plasma concentration of 50 |Lig 1"^
which is obtained when the same dose D is injected as a single bolus. The final
concentration C^(i) is also still far from the steady-state concentration C^^ (72.2 |ig
r^), since the duration of infusion T is only equal to once the half-life time of the
drug ty2 (60 min).

39.1.5 One-compartment open model for repeated intravenous administration

We now consider the case where a single-dose D is injected repeatedly at


constant time intervals 9. Again we consider the simplest possible situation which
is the one-compartment open model with a transport constant for elimination k^^
and a volume of distribution V^ (Fig. 39.11). Administration is supposed to be by
intravenous way which implies rapid distribution of the drug in the plasma
compartment. Over the course of the study, one will observe peaks and troughs in
the plasma concentration Cp. Peaks C ^^"^ occur at times 0, 0, 29, 39, etc. Troughs
Cp"^'" are observed at times 9, 29, 39, etc. The time nQ marks the end of the nth
cycle and the beginning of the (n+l)th one. Since our system behaves in a linear
fashion, the concentration Cp at any time t is simply the sum of the con-
centration-time courses produced by all preceding injections. In particular, we
obtain for the peaks and troughs:
474

® ©

Y
Y
Plasma
Compartment

Elimination
0 e 29 39
Pool

Fig. 39.11. (a) One compartment open model for repeated intravenous injection of the same dose D at
constant intervals 9. (b) Time course of the plasma concentration Cp with peaks C ^ ^ and troughs
C ™" at multiples of the time interval 9. The peaks and troughs tend asymptotically toward the
steady-state values C^^

C ™" (ne) = — (e"*^^ + e"^*^® +... + e"'^®)


p
(39.41)

which follows from a consideration of Fig. 39.11 and from the general solution of
the one-compartment open model in eq. (39.6).
It is readily observed that the expression within brackets for the peak
concentration C ^^"^ forms a geometric progression in which the first term equals
one and with a common ratio of e~^®. (The common ratio is the factor which
results from dividing a term of the progression by its preceding one.) For the sum
of the finite progression in eq. (39.41) we obtain after n cycles:

f^ max (nG) = -—- (39.42)


^ P
l-e-^«
475

In the case of a sufficiently large number of cycles n we thus find that the peak
and trough values approach their respective steady-state values C^^"^ and C^'"":
^ max ^ ^ max / ^ x ^

^ (39.43)
/ ^ min _ /^ max
^ SS ~ ^ SS XT
p

Usually, one has obtained an estimate for the elimination constant k^^ and the
distribution volume Vp from a single intravenous injection. These pharmacokinetic
parameters, together with the interval between administrations 9 and the
single-dose Z), then allow us to compute the steady-state peak and trough values.
The criterion for an optimal dose regimen depends on the minimum therapeutic
concentration (which must be exceeded by C^^) and on the maximum safe
concentration (which may not be exceeded by C^^^). It is readily seen that the
magnitudes of the peak and trough values of C^ are inverse functions of the
elimination constant k^^, the interval between successive dosings 9 and the
distribution volume Vp. They are also directly proportional to the single dose D,
In the limit, when the interval 9 between administrations becomes extremely
small in comparison with the elimination half-life 0.693/^pg, the steady-state
solutions are reduced to those already derived for a continuous intravenous
infusion (eq. (39.35)):

C j r ^ C ™ - — —^ = — ^ (39.44)
V k Q V k
since the rate of input k^ can be written in the form:

^rp-f (39.45)

This property can be derived by means of a series expansion of the exponential


function in eq. (39.43) and by neglecting higher order terms in 9.

Example
We consider again the pharmacokinetic parameters of the one-compartment
model for a single intravenous injection (eq. (39.6)).
Transfer constant of elimination: k^ = 0.01155 min"^
pe

Plasma volume of distribution: Vp = 2001


Single-dose: D = 10 mg
Interval between dosings: 9 = 60 min
476

Note that the half-hfe time of the drug in the plasma compartment is equal to
0.693//:pe which amounts, in this case, to exactly 60 min.
From the previously derived formula we compute the steady-state peak and
trough concentrations in plasma:

c . . . = 25252 ' =^2_ . ,oo„,-


200 l-exp(-0.01155x60) 1-0.5
C3™" = 100 - 10000 / 200 = 50 |Lig 1-^
In this special case when the time between dosings is equal to the half-life time of
the drug, we can deduce that the minimum (steady-state) plasma concentration
with repeated dosing is equal to the peak concentration, obtained from a single
dose. Under this condition, the corresponding maximum (steady-state) concen-
tration is twice as much as the minimum one.
39.1.6 Two-compartment mammillary model for intravenous administration
using Laplace transform

This model is an extension of the one-compartment model for intravenous


injection (Section 39.1.1) which is now provided with a peripheral buffering
compartment which exchanges with the central plasma compartment. Elimination
occurs via the central compartment (Fig. 39.12). The model requires the estimation
of the plasma volume of distribution Vp and three transfer constants, namely k^^ for
elimination and ^p^, /Cbp for exchange between the plasma and buffer compart-
ments. These four parameters can be obtained from the mass balance differential
equations:
dX.
^p

dr (39.46)
dX b
•^pb^p ^bp^b
dr
with

X, = D-(Xp + Xb) (39.47)

where Xp, X^ and X^ are the amounts of drug in the plasma and buffer compartments
and in the elimination pool, respectively, and where D represents the single intra-
venous dose.
These equations form a set of first order linear differential equations with
constant coefficients and with initial conditions:
477

® ®

Plasma
Compartment

^bp

"^pe y

Buffer

Elimination
Pool

Fig. 39.12. (a) Two-compartment mammillary model for single intravenous injection of a dose D. The
buffer compartment exchanges with plasma with transfer constants /:pb and /:bp- (b) Time courses of the
plasma concentration Cp, the content in the buffer compartment X^ and in the elimination pool X^..

Xp(0)=D and Xb(0) = 0

While the simple linear models in the previous sections have been solved by
straightforward integration, the present model (and more complicated ones) are
more conveniently solved by means of the Laplace transform.
In this section, we will outline only those properties of the Laplace transform that
are directly relevant to the solution of systems of linear differential equations with
constant coefficients. A more extensive coverage can be found, for example, in the
text book by Franklin [6].
The Laplace transform of a time-dependent variable X{i) is denoted by Lap X{t)
or x{s) and is defined by means of the definite integral over the positive time
domain:
478

Lap X(t) = je-^' X(t)dt = x(s) (39.48)


0

Since the integral is over time t, the resulting transform no longer depends on t, but
instead is a function of the variable s which is introduced in the operand. Hence, the
Laplace transform maps the function X(t) from the time domain into the 5-domain.
For this reason we will use the symbol x{s) when referring to Lap X(t). To some
extent, the variable s can be compared with the one which appears in the Fourier
transform of periodic functions of time t (Section 40.3). While the Fourier domain
can be associated with frequency, there is no obvious physical analogy for the
Laplace domain. The Laplace transform plays an important role in the study of
linear systems that often arise in mechanical, electrical and chemical kinetic
systems. In particular, their interest lies in the transformation of linear differential
equations with respect to time t into equations that only involve simple functions of
s, such as polynomials, rational functions, etc. The latter are solved easily and the
results can be transformed back to the original time domain.
The Laplace transform of a first-order derivative is defined consistently with eq.
(39.48) by means of the integral:

Lap^^=fe-^d/ (39.49)
dt i0 dt
which can be integrated by parts:
dXit)
L a p ^ ^ = e-^'X(0
e "^yi) I -(-s)
-^-s) { e-"X(t)dt = -X{0)+sLapXit) =s x(s)-X(0)
dt ^0 i
0

(39.50)
provided that X is a bounded function of t.
The inverse Laplace transform is denoted by Lap"^:
Lap-^jc(5) = X(0 (39.51)

Inverse Laplace transforms have been tabulated for most analytical functions,
including power, exponential, trigonometric, hyperbolic and other functions. In
this context we require only the inverse Laplace transform which yields a simple
exponential:

L a p - ^ — ^ = e-^^ (39.52)
479

When we apply these two properties of the Laplace transform to differential


equations of our pharmacokinetic model in eq. (39.46), we obtain:

where x^ and x^ are functions of s.


After rearrangement of eq. (39.53) we obtain a system of two linear equations in
the two unknowns x^ and x^:

(5 + fcpb + fcpe)^p-^bp^b=^ ^3^^^^

-^pb-^p+('^ + ^bp)-^b=0

The solution for x^ in eq. (39.54) can be written explicitly in the form:

x(s) = D— ^^ =D ^ (39.55)
P 5 2 + 5 ( a + p) + ap ( s + a ) ( s + P)

where a and P are called the hybrid transport constants. The latter are related to the
physical transport constants by means of their sum and product:

pb bp pe (39.56)
«P=^pe^bp

If the transfer constants k^^, k^^ and k^^ are known, then the hybrid transfer
constants a and P are the roots of the quadratic equation:

f - Y(^pb + ^bp + ^pe) + Ke K, = 0 (39.57)

We can derive the plasma concentration function C^ in the 5-domain from:

c(s) = ^ ^ =— ^^ = — ^ + —^ (39.58)
P Vp Vp (jy + a)(jy + P) s + a s + ^
where the coefficients A^ and B^ are obtained by working out the right-hand part of
the expression in eq. (39.58) and by equating the corresponding terms of the
numerators in its right- and left-hand members. This yields the solution for Ap and
fip as functions of a and p:
480

\ =^ -^r and
^"d B„^ P= = T r - ^ ^ (39.59)
Kp a-p " V a-p
The plasma concentration function C^ in the time domain is obtained by applying
the inverse Laplace transform to the two rational functions in the expression for c^
in eq. (39.58):
Cp(0 = Lap-^ Cp(5) = Ap e-«^ + B^ e'P^ (39.60)

where the coefficients A^, B^, a, P and V^ have to be estimated from the observed
concentration-time data.
From the initial conditions we must have that:

Cp(0) = A p + 5 p = - ^ (39.61)
P

which is satisfied by the above results for A^ and B^.


This relationship allows us to derive the volume of distribution V^:

provided that A^ and B^ can be determined.


A similar solution can be derived for the contents X^ in the buffer compartment:
X,(t) = A^Q-^ +B^c-^^ (39.63)

with the constraint that A^-\- B^ = 0 since ^^(O) must be zero.


Usually, the buffer compartment is not accessible and, consequently, the
absolute amount of X^ cannot be determined experimentally. For this reason, we
will only focus our discussion on the plasma concentration Cp. It is important to
know, however, that the time course of the contents in the two compartments is the
sum of two exponentials, which have the same positive hybrid transfer constants a
and p. The coefficients A and 5, however, depend on the particular compartment.
This statement can be generalized to mammillary systems with a large number of
compartments that exchange with a central compartment. The solutions for each of
n compartments in a mammillary model are sums of n exponential functions,
having the same n positive hybrid transfer constants, but with n different
coefficients for each particular compartment. (We will return to this property of
linear compartmental systems during the discussion of multi-compartment models
in Section 39.1.7.)
We now turn our attention to the graphical determination of the various para-
meters of our two-compartmental model, i.e. the plasma volume of distribution Vp,
481

the three transfer constants k^y^, k^^, k^^ and their derived quantities, such as the
plasma clearance Cl^ and the bio-availability AUC. By convention, we assume that
p is the smaller of the two hybrid transfer constants ((3 < a). If the difference is
marked (e.g. by one decade) then we can clearly distinguish between the two
exponentially decaying phases of the plasma concentration curve. The a-phase is
the fastest and disappears rapidly after administration and the (i-phase is slower
and dominates at longer times after administration. Under this circumstance it is
possible to apply curve peeling on a semilogarithmic diagram, in the same way as
has been done in the case of the two-compartment model with oral or parenteral
administration defined by eq. (39.16).
First, we determine the p-phase function on a semilog plot by means of linear
regression on the later part of the concentration curve for which the time course is
most linear (Fig. 39.13a). This yields:

logCP(0 = logBp -0.4343pr (39.64)

This yields numerical values for the intercept log B^ and slope P, which have been
defined analytically before.
The next step requires the determination of the residual concentrations C p by
subtracting C^ from the observed C^. The resulting a-phase function is again
obtained by means of linear regression on the earlier part of the time course of the
logarithmic plasma concentration (Fig. 39.13b):

logC«(0 = l o g ( C p ( 0 - C P ( 0 ) = logAp-0.4343a/ (39.65)

This way, we obtain numerical values for the intercept log A^ and the slope a,
which have been defined formally before.
Since we have now determined the hybrid constants a and p, we can derive V^
and k^^ from the coefficients A^ and B^ which resulted from our graphical analysis.
The ratio of Ap to B^ produces (according to eq. (39.59)):

^ ^ a ^ - ^ (39.66)
^p ^bp-P

from which the transfer constant ^^p from the buffer to the plasma compartments
can be solved:

, -MlM ,39.67)
482

Using this result we now obtain the plasma distribution volume V^ from either A
or fip(eq. (39.59)):

(39.68)
" Ap a - p "fip a-p

Finally, we derive the remaining transfer constants k^y, and k^^ from the
previously defined sum and product of the hybrid rate constants a and P (eq.
(39.56)):

(a)

800

Fig. 39.13. (a) Semilogarithmic plot of the plasma concentration Cp (jiig 1"^) versus time t. The straight
line is fitted to the later part of the curve (slow P-phase) with the exception of points that fall below the
quantitation limit. The intercept B^ of the extrapolated plasma concentration C^ appears as a coeffi-
cient in the solution of the model. The slope is proportional to the hybrid transfer constant p, which is
itself a function of the transfer constants of/:pe, itpb and k^p of the model, (b) Semilogarithmic plot of the
residual plasma concentration C " (\xg 1"*) on an extended time scale t. The straight line is fitted on the
first part of the residual curve (fast a-phase), with the exception of points whose residuals fall below
the quantitation level. The intercept Ap appears as a coefficient in the solution of the model. The slope
a is proportional to the hybrid transfer constant a, which is itself a function of the transfer constants of
the pharmacokinetic model, (c) Semilogarithmic plot of the content X^ (|ig) in the buffer compartment.
The rate constants of the exponential functions are the hybrid transfer constants a and p. The coeffi-
cients have been computed by means of the y-method (see text), using the results of curve peeling rep-
resented in panels a and b.
483

100
(b)
log Cp(t)= 1.542-0.024081

0.5 i

0.1 -1 1 I -1— -1— n


n—
10 20 30 40 50 60 70 80
t (min)

(c)

700 800
t (min)

Fig. 39.13 (continued).


484

k - ^
^bp (39.69)

^bp=a + P-^bp-^pe
The area under the plasma concentration curve UAC results from integration of
the sum of exponentials in eq. (39.60) between zero and infinity:

AUC=J (Ape-+Bpe-P')dr = ^ +^ =— ^ (39.70)

which is, once again, the maximal bio-availability of the drug. This expression has
also been obtained in the case of the one-compartment models for intravenous and
for complete extravascular absorption (Sections 39.1.1 and 39.1.2).
The plasma clearance Cl^ follows in the usual way from the dose D and the
bio-availability AUC (eq. (39.27)):

P AUC P P'

Example
We assume that the following synthetic plasma concentrations C^ have been
obtained after a single intravenous administration at different times t\

/(min) 2 5 10 20 30 60 90 120 240 360 480 720


Cp(ngr') 45.7 38.4 31.5 22.3 15.7 11.5 8.9 9.0 5.4 4.2 2.3 1.3

a-phase p-phase

These data are also shown in the semilogarithmic plot in Fig. 39.13a, which
clearly shows two distinct phases. A straight line has been fitted by least-squares
regression through the data starting from the observation at 90 minutes down to the
last one. This yields the values of 1.086 and -0.001380 for the intercept log B^ and
slope 5p, respectively. From these results we have computed the extrapolated
p-phase values C^ between 2 and 60 minutes. These have been subtracted from the
experimental C^ values in order to yield the a-phase concentrations C " :

r(min) 2 5 10 20 30 60
Cp(Mgl-') 45.7 38.4 31.5 22.3 15.7 11.5
CP(Mgl-') 12.2 12.0 11.8 11.5 11.1 10.1
33.5 26.4 19.7 10.8 4.6 1.4
485

The residual a-phase concentrations C " are shown in the semilogarithmic plot
of Fig. 39.13b. Least-squares linear regression of log C " upon time produced
1.524 and -0.02408 for the intercept log A^ and the slope ^^c, respectively.
From the curve peeling operation we thus obtained the following intercepts,
hybrid transfer constants and half-life times of the a-and P-phases:
^p = 33.4 |ig r^ a = ^^ /0.4343 = 0.0555 min'^ t^2 = 0.693/a = 12.5 min
5p=12.2|Ligr^ p = 5p 70.4343 = 0.00318 rnin"^ /f/2 = 0.693/(3 = 218 min
The ratio of the intercepts A^ to B^ indicates that the amplitude or peak value of
the a-phase is 2.7 times larger than the one of the p-phase, while the ratio of a to p
(or rf/2 to ty2) shows that the a-phase decays at a rate which is 17 times larger than
that of the p-phase.
Using the formulas derived above we can now compute the pharmacokinetic
properties of the model:
Transfer constant buffer to plasma: ^^^p = (App + i5pa)/(Ap + B^) = 0.0172 min '
Transfer constant of elimination: k^ = a^/k^^ = 0.0103 min"^
Transfer constant plasma to buffer: ^pb = ^ + P ~ ^bp ~ ^pe = 0.0312 min~^
Plasma volume of distribution: Vp = D/(A^ + 5p) = 219 1
Area under the curve: AUC = (AJa) + (Bp/p) = 4.442 mg min 1"'
Plasma clearance: Cl^ = D/AUC = 2.25 1 min"'

The synthetic data have been derived from a theoretical model with the
following parameters:
Dose: D = 10 mg
Plasma volume of distribution: n=
= 2001
Transfer constant of elimination: K-= 0.01155 min"'
Transfer constant buffer to plasma: K = 0.040 min"'
Transfer constant plasma to buffer: pb
= 0.020 min"'
Area under the curve: AUC = 4.329 mg min 1
Plasma clearance: ch = 2.311 min"'
Fast phase hybrid transfer constant: a == 0.06816
Slow phase hybrid transfer constant: p= : 0.003390

A random noise with standard deviation of 0.4 |ig 1"^ has been added to the
theoretical values in order to produce a realistic example. The specifications of the
model are in part the same as those used for the one-compartment models which
have been discussed above. The major distinction between this model and the
486

former ones is the addition of the buffer compartment which exchanges with the
plasma compartment. A comparison between the various results (Figs. 39.5b,
39.8a and 39.13a) clearly shows that the buffer compartment produces a signif-
icant delay on the time course of the plasma concentration.
At the beginning of this section we have assumed that hybrid rate constants
should be larger than p by at least about one order of magnitude. In the case when a
is close to p, residual plasma concentrations, after peeling off of the P-phase, may
become very small and may even take negative values. Hence, the relative error on
the semilogarithmic plot, which leads to the determination of a, will become very
prominent. As a consequence, p will be estimated more accurately than a, A^ and
^p. Unfortunately, there is no real good alternative. One may perform a non-linear
regression (see Chapter 11) on the plasma concentrations C^ using the graphic
determinations of a, p, A^ and B^ as initial estimates for the parameters of the sum
of exponential functions of time t. This will tend to spread out the estimation error
over the four parameters, which are computed concurrently by the non-linear
regression algorithm. In practical applications, however, it may be preferable to
obtain the best possible estimate of p, even at the expense of the accuracy of the
other parameters. Indeed, the rate constant of the P-phase determines the overall
half-life of the substance in the system that is being modelled. In repetitive drug
administration, for example, the half-life of the slowest phase (p in this case)
determines the schedule of administration (once daily, twice daily, etc.) The
determination of residues in tissues after discontinuation of medication is also
governed by the slowest decaying phase.
To conclude this section on two-compartment models we note that the hybrid
constants a and p in the exponential function are eigenvalues of the matrix of
coefficients of the system of linear differential equations:

^ pb "^ ^ pe ^ bp
K= (39.71)
~^pb ^bp

If a and p are referred to by the general symbol y we must have that the following
determinant A must be zero:
A = |K-Yl| = 0 (39.72)
where I is the 2 x 2 identity matrix. Working out the determinant leads to the
characteristic equation of the system (Section 29.6):

f - Y(^pb + ^bp + V + ^pe ^bp = 0 (39.73)


The exponential functions, the weighted sums of which determine the time courses
in the various departments, can thus be regarded as eigenfunctions of the phar-
487

macokinetic system [7]. The complete solution of this model in terms of the
eigenvalues of the matrix of transfer constants is discussed below in Section
39.1.7.2.

39.1.7 Multi-compartment models

39.1.7.1 The convolution method


Often, it is required to predict the time course of the plasma concentration from
a model with oral administration or with continuous infusion, when only data from
a single intravenous injection are available. In this case, the Laplace transform can
be very useful, as will be shown from the following illustration.
In the catenary model of Fig. 39.14a we have a reservoir, absorption and plasma
compartments and an elimination pool. The time-dependent contents in these
compartments are labelled X^, X^, X^ and X^, respectively. Such a model can be
transformed in the ^-domain in the form of a diagram in which each node
represents a compartment, and where each connecting block contains the transfer
function of the passage from one node to another. As shown in Fig. 39.14b, the

;X/t) Xa(t) XpW Xe« ;


^ap
(D ^
^ ^ '^^>
^\

Reservoir \bsorptioi Plasma Elimination


compartment Compartment Pool

x,(s) 1 Xa(s)
® ^ •
>'':' s >-•
^ •

©
x/s)

•'ap
(s+k^p) (s+kp^)
^\f
^ • s
Xe(s)

Fig. 39.14. (a) Catenary compartmental model representing a reservoir (r), absorption (a) and plasma
(p) compartments and the elimination (e) pool. The contents Xr, Xa, Xp and X^ are functions of time t.
(b) The same catenary model is represented in the form of a flow diagram using the Laplace transforms
Xr, Xa and Xp in the ^-domain. The nodes of the flow diagram represent the compartments, the boxes
contain the transfer functions between compartments [ 1 ]. (c) Flow diagram of the lumped system con-
sisting of the reservoir (r), and the absorption (a) and plasma (p) compartments. The lumped transfer
function is the product of all the transfer functions in the individual links.
488

transfer function from the reservoir to the absorption compartment is l/(s + k^^),
from the adsorption to the plasma compartment is k^^(s + ^p^), and from the plasma
to the elimination pool is k^Js. Note that the transfer constant from an emitting
input node and to a receiving output node appears systematically in the numerator
and denominator of the connecting transfer function, respectively. This makes it
rather easy to model linear systems, of the catenary, mammillary and mixed types.
Parts of the model in the Laplace domain can be lumped by multiplying the transfer
functions that appear between the input and output nodes. In Fig. 39.14c we have
lumped the absorption and plasma compartments. The resulting Laplace transform
of the output jCp(5) is then related to the input jc/5) by means of the transfer function
g(s):
x^(s) = g(s)x,is) (39.74)
where

8(s) = - , ^;; , , (39.75)

This model can now be solved for various inputs to the absorption compartment.
In the case of rapid administration of a dose D to the absorption compartment
(such as the gut, skin, muscle, etc.), the Laplace transform of the reservoir function
is given by:

x^s) = D (39.76)

In this case we obtain the simplest possible expression for the plasma function in
the 5-domain:

x^(s) = g(s)D = - /^ ;' (39.77)

The inverse transform X^(t) in the time domain can be obtained by means of the
method of indeterminate coefficients, which was presented above in Section
39.1.6. In this case the solution is the same as the one which was derived by
conventional methods in Section 39.1.2 (eq. (39.16)). The solution of the two-
compartment model in the Laplace domain (eq. (39.77)) can now be used in the
analysis of more complex systems, as will be shown below.
When the administration is continuous, for example by oral infusion at a
constant rate offc^p,the input function is given by:
k
x^(s) = -^ (39.78)
s
489

and the resulting plasma function becomes:

x^(s) = g(s)'^ = - '"/')'' (39.79)

The inverse Laplace transform can be obtained again by means of the method of
indeterminate coefficients. In this case the coefficients A, B and C must be solved
by equating the corresponding terms in the numerators of the left- and right-hand
parts of the expression:

"^^ ^ ^ + _ A _ + _ ^ (39.80)

The inverse transform of the plasma function is then given by:


Xp(0 = A + Be"^^^' +Ce"^^' (39.81)

After substitution of the values of A, B and C we finally obtain the plasma concen-
tration function X^{t) for the two-compartment open system with continuous oral
administration:
k. l-e"'-^ 1 e -k.j A
^,it) = k^ (39.82)
^ap ^pe ^pe ^ap

From the above solution we can now easily determine the steady-state plasma
content X^^ after a sufficiently long time t:

X^^=-^ (39.83)
^pe

If X^(t) and X^(t) are the input and output functions in the time domain (for
example, the contents in the reservoir and in the plasma compartment), then X^{t) is
the convolution of X-{t) with G(t), the inverse Laplace transform of the transfer
function between input and output:

X^(r) = JG(x)X,(t-x)d%=\G{t-x)X;{x)dx
0 0

= G(trX,it)

where the symbol * means convolution, and where T denotes the integration
variable.
490

In the 5-domain, convolution is simply the product of the Laplace transform:


x^(s) = g{s) x.,(s) (39.85)
By means of numerical convolution one can obtain X^(t) directly from sampled
values of G(t) and X^(t) at regular intervals of time t. Similarly, numerical decon-
volution yields X-^(t) from sampled values of G(t) and X^(t). The numerical method
of convolution and deconvolution has been worked out in detail by Rescigno and
Segre [1]. These procedures are discussed more generally in Chapter 40 on signal
processing in the context of the Fourier transform.

39.1.7.2 The y-method


Thus far we have only considered relatively simple linear pharmacokinetic
models. A general solution for the case of n compartments can be derived from the
matrix K of coefficients of the linear differential equations:

A:,J ^21 ^nl

""^12 ^22 ~^n2


K= (39.86)

~^\n ~^2n ^nn

in which an off-diagonal element k^j represents the transfer constant from


compartment / to compartment^, and in which a diagonal element k^^ represents the
sum of the transfer constants from compartment / to all n-\ others, including the
elimination pool. If compartment / is not connected to say compartmenty, then the
corresponding element k^^ is zero. The index 1 is reserved here to denote the plasma
compartment.
In the previous section we found that the hybrid transfer constants of a two-
compartment model are eigenvalues of the transfer constant matrix K. This can be
generalized to the multi-compartment model. Hence the characteristic equation
can be written by means of the determinant A:
A = |K-Yl| = 0 (39.87)
where I is the AI x n identity matrix.
In the general case there will be n roots 7-, which are the eigenvalues of the
transfer matrix K. Each of the eigenvalues defines a particular phase of the time
course of the contents in the n compartments of the model. The eigenvalues are the
hybrid transfer constants which appear in the exponents of the exponential
function. For example, for the zth compartment we obtain the general solution:
491

^/(^) = X^//^"'^' (39.88)

where G^j is the coefficient of theyth phase in the ith compartment.


The coefficient G,y can be determined from the minors of the determinant A, as
shown by Rescigno and Segre [1]:

^(-iy^u^ (39.89)
G,j=xm A' Jy=yj

where Aj, is the minor of A, which is obtained by crossing out row 1 and column /,
and where A' denotes the derivative (dA/dy) of A with respect to y.
This general approach for solving linear pharmacokinetic problems is referred
to as the y-method. It is a generalization of the approach by means of the Laplace
transform, which has been applied in the previous Section 39.1.6 to the case of a
two-compartment model.
The theoretical solution from the y-method allows us to study the behaviour of
the model, provided that the transfer constants in K are known. In the reverse
problem, one must estimate the transfer constants in K from an observed plasma
concentration curve C^(k). In this case, we may determine the hybrid transfer
constants jj and the associated intercepts of the plasma concentration curve Gj • by
means of graphical curve peeling or by non-linear regression techniques. Using
these experimental results we must then seek to compute the transfer constants
from systems of equations, which relate the hybrid rate constants y^ and the
associated intercepts Gy to the transfer constants in K. We may also insert the
general solution into the differential equation, their derivatives (at time 0) and their
integrals (between 0 and infinity) in order to obtain useful relationships between
the hybrid transfer constants y^, the intercepts Gy and the model parameters in K.
The calculations can be done by computer programs such as PROC MODEL in
SAS [8]. Not all linear pharmacokinetic models are computable, however, and
criteria for computability have been described [9]. Some models may be in-
determinate, yielding an infinity of solutions, while others may have no solution.
By way of illustration, we apply the y-method to the two-compartment
mammillary model for intravenous administration which we have already seen in
Section 39.1.6. The matrix K of transfer constants for this case is defined by means
of:

^ pb "^ ^ pe ^ bp
K= (39.90)
~^pb ^bp
492

and the corresponding characteristic equation can be written in the form:

p pe + "^ pb " y ~ ^ bp
A=IK-Yll = =0 (39.91)
-pb ^bp-Y

The eigenvalues y of the model are the roots a and P of the quadratic equation:

Y ' - Y ( / : p b + ^bp + ^ p e ) - ^ b p ^ p c = 0 (39.92)

the sum and product of which are easily derived:


U i2 P bp pb PC ^3^^^^

Y,Y2 = a p =^bp^pe
The general solution for the plasma compartment is now expressed as follows:
Cp(f) = G„e-T''' +G,2e-^=' = Ape-™ +B^e-^ (39.94)

with
f
D (-l)A,
'"-''-t^ A' ^Y=Yi=a
(39.95)

and
D f (-l)A,
<^12 = ^ p =
V„p V A'
— ^T=72=P

which can be developed into:


D a-A:i,p Dtt-^bp
A =
V^2a-(k^, + k,^ + k^,) Vp a - P
(39.96)
D P-^bp ^P^bp-P
«P =
Vp2p-(/:pb + ^bp + ^ c ) Vp a - P
using the expression in eq. (39.93) for the sum of the roots a + P in terms of the
transfer constants. The expressions for A^ and Bp are exactly the same as those
obtained in Section 39.1.6.
In a similar way, we can derive a general solution for the contents in the buffer
compartment:
493

X^CO^G^ie-^'^ +G22e-^^^ = A^e-"' +B^Q-^' (39.97)


with

G,,=A,=D 1'^—^ =-D-^ (39.98)


A J a-B

and

G22 -B^=D ^D-^""


A' y Y=Y2=P ^

The time course of X,^ in the buffer compartment is represented in Fig. 39.13c
using the theoretical values for D, k^^, a and p of the two-compartment model of
Section 39.1.6. At the time t^^ of peak concentration at 46.3 min, about half of the
initial dose D, i.e. 5.02 mg, is stored in the buffer.

39.2 Non-compartmental analysis

Non-compartmental analysis is a statistical moment theory which has been


derived from chemical engineering and has been applied more recently to tracer
and drug kinetics [10,11]. Although this approach does not assume that the system
can be compartmentalized, it is not free from model assumptions. In fact, its
assumptions may be even more stringent than those pertaining to compartmental
analysis. In particular, statistical moment theory assumes that transit times of
molecules inside a system follow a stochastic distribution, where the nature of the
distribution depends on the structure of the system. (See Section 3.2 for a
discussion of the moments of a distribution.) For example, if a tracer substance is
injected into a body, the tracer particles are offered a choice of different paths
within the circulatory system. Some will complete a full cycle through the body in
a short time, while others may have been delayed during their transit through
various organs. The results of non-compartmental analysis also depend critically
on the accuracy of the method of measurement, as will be explained more fully in
the example.
The simplest non-compartmental parameter that can be obtained from the time
course of the plasma concentration is its area under the curve AUC (see also
Section 39.1.1):

AUC = J C p(Odr (39.99)


494

Fig. 39.15. Area under a plasma concentration curve AUC as the sum of a truncated and an extrapo-
lated part. The former is obtained by numerical integration (e.g. trapezium rule) between times 0 and
r, the latter is computed from the parameters of a least squaresfitto the exponentially decaying part of
the curve (p-phase).

This parameter can be obtained by numerical integration, for example using the
trapezium rule, between time 0 and the time T when the last plasma sample has
been taken. The remaining tail of the curve (between T and infinity) must be
estimated from an exponential model of the slowest descending part of the
observed plasma curve (P-phase) as shown in Fig. 39.15. The area under the curve
AUC can thus be decomposed into a truncated and extrapolated part:

A U C - AUCo,^ + AUC^^ = j Cp(Oclr+ \BQ-^' At (39.100)

with
n-\

0 ^ '=1

and

B (39.101)
T P

The mean residence time MRT of the drug in plasma can be expressed in the
form:
495

MRT = ^ ^AUMC
~ AUC
J C^(t)dt
0

where AUMC is called the area under the moment curve and is defined as the
integral of t C^(t) between time t ranging from 0 to infinity. The expression for
MRT in eq. (39.102) defines the mean of the continuous variable t, using the
corresponding plasma concentration C^(t) as a weighting function. From this point
of view, the denominator can be regarded as a normalizing constant which ensures
that the integral of the weighting function over the whole time domain equals
unity. This is consistent with the definition of the mean of a discrete variable in
terms of the first moment of its probability density function, as was explained in
Section 3.2. For this reason, the graph of tC^it) is called the (first) moment curve.
The quantity AUMC can be decomposed into an observed and an extrapolated
part:
T

AUMC = AUMCo,^ + AUMC^^^ = j tC^(t)dt+ j tBQ'^' dt (39.103)


0 T

The extrapolated area can be expressed analytically by means of integration by


parts between times 7 and infinity:

AUMC^^ = J ?Se-P' d? = ^ r + I e-P^ (39.104)


P P
The truncated part of the integral can be obtained by numerical integration (e.g. by
means of the trapezium rule) of the function tC^(t) between times 0 and T. The
mean residence time MRT is an important pharmacokinetic parameter, especially
when a substantial fraction of the drug is excreted or metabolized during its first
pass through an organ, such as the liver.
In the case of a one-compartment open model with single-dose intravenous
administration, the mean residence time is simply the inverse of the elimination
transfer constant k^^, since according to the above definition we obtain:

jtC^{t)dt DJ tQ-''^' dt
MRT. = ^ = -^ =— (39.105)
IV ex, OO J^

J Cp(Od? DJ e"*^-' dt "^


496

Since

f r e ^^ dr = —- and f e ^^ dt = —
0 '^pe 0 pe

The mean residence time MRT can thus be defined as the time it takes for a
single intravenous dose to be reduced to 36.8% in a one-compartment open model.
This follows from the property derived above in eq. (39.105) and from eq. (39.5):
DQ-^^^^^^^ =DQ-^ =0.368 D (39.106)
since
/:p,MRTi,= l
This relationship shows the analogy with the previously derived half-life time ty2
(in Section 39.1.1), which is the time required for a single intravenous dose to be
halved in a one-compartment open system. Since we already derived in eq. (39.9)
that:
0.693
^1/2 ~ •

we can relate the two parameters by means of:


r,/2 = 0.693 MRTj, (39.107)
When a single-dose intravenous and an oral (or other extravastular) plasma
curve are both available from the same subject(s) one can define the mean
absorption time MAT by means of the mean residence time obtained from the
intravenous curve MRTj^ and the extravascular curve MRT:
MAT = MRT - MRTj, (39.108)
The parameter MAT is representative of the time that the drug remains unabsorbed
[9].
From the mean residence time MRT, the area under the curve AUC and the
administered dose D, one can derive the steady-state volume of distribution V^^ of
the drug in plasma:

V.=^5^!5Ii. (39.109)
AUC
497

In the special case of a one-compartment open model, it can easily be shown that
the steady-state volume V^^ is identical to the plasma volume V^ which has been
defined before in Section 39.1.1 (eq. (39.12)):

Ks= = ^0 (39.110)
" AUC/:p, P

since in this case MRT is the reciprocal of the transfer constant of elimination ^p^.
In the general case, however, V^^ is independent of the elimination of the drug and
only depends on the non-compartmental parameters MRT and AUC, for a given
dose D. In this respect, one can state that the plasma clearance Cl^ is also a
non-compartmental parameter which, for a given dose D, depends only on the area
under the curve AUC (cf. eq. (39.13)):

P AUC

Other non-compartmental parameters that are easily obtainable from a plasma


concentration curve are the time of appearance of the maximum t^^ and the peak
concentration value C^(t^).
The variance of the residence times VRT is derived from the area under the
second moment of the plasma concentration curve AUSC:

j(r-MRT)2 Cp(Odf
VRT = ^ ^AUSC (39.111)
~ AUC
j Cp(Od/

VRT and the second moment function {t - MRT)^ C^{t) are related in the same way
as MRT and the (first) moment function t C^{t) in eq. (37.102).
The quantity AUSC can be decomposed into an observed and an extrapolated
part:
AUSC= AUSCo,7^ + AUSCy. ,^
T oo (39.112)
= j ( r - M R T ) 2 Cp(Odr + ^{t-MKY^ BQ'^' dt
498

where the mean residence time MRT and the area under the curve AUC have
already been defined. An analytical expression can be derived for the extrapolated
partof AUSC:

AUSCy, ^ = j(r - MRT)2 B e-P'dr = ^ - + I 7 - MRT + ^ I ^e-P^ (39.113)

as can be derived by means of integration by parts between times T and infinity.


Application of the trapezium rule to the function {t-MKlf C^{t) between times 0
and r yields the truncated part of the integral. Finally, we can define the standard
deviation of the retention times SRT as the square root of the variance VRT.
The quantities AUMC and AUSC can be regarded as the first and second
statistical moments of the plasma concentration curve. These two moments have
an equivalent in descriptive statistics, where they define the mean and variance,
respectively, in the case of a stochastic distribution of frequencies (Section 3.2).
From the above considerations it appears that the statistical moment method
strongly depends on numerical integration of the plasma concentration curve C^{t)
and its product with t and (r-MRT)^. Multiplication by t and (r-MRT)^ tends to
amplify the errors in the plasma concentration C^{t) at larger values of t. As a
consequence, the estimation of the statistical moments critically depends on the
precision of the measurement process that is used in the determination of the
plasma concentration values. This contrasts with compartmental analysis, where
the parameters of the model are estimated by means of least squares regression.

Example
We reconsider the data used previously in Section 39.1.2 in the discussion of the
two-compartment system for extravascular administration (e.g. oral, subcuta-
neous, intravascular). The data are truncated at 120 minutes in order to obtain a
realistic case. It is recalled that these data have been synthesized from a theoretical
model and that random noise with a standard deviation of about 0.4 |jg 1"^ has been
superimposed.

r(min) 0 2 5 10 20 30 60 90 120
Cp(Mgl-') 0 12.3 23.6 35.5 40.7 37. 1 26.9 19.8 14.0

These data can be integrated numerically between times 0 and 120 minutes. The
remaining part between 120 minutes and infinity must be extrapolated from the
downslope of the curve (P-phase) which can be modelled by means of the
exponential function:
CP(0 = 5e-P^=55.5e^-^i^^^
499

the coefficients of which have been determined graphically in Section 39.1.2 from
the experimental data. We now provide the results that can be derived numerically
from the data by means of the statistical moments model. The exact values
provided by the theoretical model in Section 39.1.2 are added within parentheses.

Dose: Z) = 10 mg
Intercept of p-phase: B = 55.5 jug V
Hybrid transfer constant of p-phase: p = 0.0119 min'^
Last sample time: T=120 min

Truncated area under the curve, by numerical integration:


AUCQJ^ = 3.151 mg min H

Extrapolated area under the curve:

AUCT^^ =~Q'^'^ = 1.118 mg min H

Area under the curve:


AUC = AUCo,^ + AUCy.^ = 4.270 mg min 1"^ (3.968)
Truncated area under the moment curve, by numerical integration:
AUMCop 160.7 mg min^ 1"^
Extrapolated area under the moment curve:

AUMc^^ " r ^ i" f ^^^" ^^^'^ ^^ ™^^ *^

Area under the moment curve:


AUMC = AUMCo,^ + AUMCy.^ = 338.9 mg min^ 1"^ (372.2)
Mean residence time:
MRT = AUMC/AUC = 91.1 min (93.8)
Mean absorption time:
MAT = MRT - MRTj, = 7.1 min (7.2)
since in this case we find that:
MRTj, = l/k^^ = 1/p = 84.0 min (86.6)
500

Truncated area under the second moment curve, by numerical integration:


AUSCoj- = 8472 mg min^ 1"'
Extrapolated area under the second moment curve:

AUSCy., - e - P ' ' = 22168 mgminM-'


r [ P P
Area under the second moment curve:
AUSC = AUSCo,^ + AUSCy^^ = 30640 mg min^ r^ (29948)
Variance of residence times:
VRT = AUSC/AUC = 7176 min^ (7548)
Standard deviation of residence times:
SRT = YRV^^ = 84.7 min (86.9)
Steady-state volume of distribution:
V,, = D MRT/AUC = 213 1 (236)
Plasma clearance:
C/p = D/AUC = 2.34 1 min-i (2.52)
It can readily be seen from this example that the contributions of the extrapolated
areas to the total areas are relatively more important for the higher order moments.
In this example, the contributions are 28, 61 and 72% for AUC, AUMC and
AUSC, respectively. Because of this effect, the applicability of the statistical
moment theory is somewhat limited by the precision with which plasma
concentrations can be observed. The method also requires a careful design of the
sampling process, such that both the peak and the downslope of the curve are
sufficiently covered.

39.3 Compartment models versus non-compartmental analysis

Compartmental analysis is the most widely used method of analysis for systems
that can be modeled by means of linear differential equations with constant
coefficients. The assumption of linearity can be tested in pharmacokinetic studies,
for example by comparing the plasma concentration curves obtained at different
dose levels. If the curves are found to be reasonably parallel, then the assumption
of linearity holds over the dose range that has been studied. The advantage of linear
501

Fig. 39.16. Paradigm for the fitting of sums of exponentials from a compartmental model (c) to ob-
served concentration data (o) as contrasted by the results of statistical moment analysis (s). (After
Thorn [13].)

models is that they can be conveniently analysed by means of the Laplace


transform and methods that are derived from it, such as convolution and the
y-method, which were discussed above.
As we have shown above, pharmacokinetic compartmental analysis requires
estimates of the transfer constants ^^p, ^p^ and the volume of distribution V^ from
experimental plasma concentration curves. This involves the fitting of sums of
exponential functions, the number of which is generally equal to the number of
compartments in the system (not including the reservoir and the elimination pool).
The non-robustness of this problem is well-known. Relatively small errors may
have a large effect on the estimated intercept and slope values. The graphical
method of curve peeling on semilogarithmic plots is subject to error propagation
from the slower phases to the faster ones and should not be recommended in the
case when more than three compartments are to be fitted and when the hybrid rate
constants (slopes of the regression lines) are not clearly distinct. In that case,
non-linear regression can be combined with curve peeling, where the latter then
serves as a means for providing initial estimates of the slopes and intercepts.
The alternative to compartmental analysis is statistical moment analysis. We
have already indicated that the results of this approach strongly depend on the
accuracy of the measurement process, especially for the estimation of the higher
order moments. In view of the limitations of both methods, compartmental and
statistical, it is recommended that both approaches be applied in parallel, whenever
possible. Each method may contribute information that is not provided by the
other. The result of compartmental analysis may fit closely to the data using a
model that is inadequate [12]. Statistical moment theory may provide a model
which is closer to reality, although being less accurate. The latter point has been
made in paradigmatic form by Thom [13] and is represented in Fig. 39.16.
502

© ®
1 _,
V ,*

/\^^^ope = ^
j^'^'^ 1 max
_\
max
0
0 1

Fig. 39.17. Schematic illustration of Michaelis-Menten kinetics in the absence of an inhibitor (solid
line) and in the presence of a competitive inhibitor (dashed line), (a) Plot of initial rate (or velocity) V
against amount (or concentration) of substrate X. Note that the two curves tend to the same horizontal
asymptote for large values of X. (b) Lineweaver-Burk linearized plot of 1/Vagainst \/X. Note that the
two lines intersect at a common intercept on the vertical axis.

39.4 Linearization of non-linear models

A classical non-linear model of chemical kinetics is defined by the Michaelis-


Menten equation for rate-limited reactions, which has already been mentioned in
Section 39.1.1:

dX (39.114)
V =
" dt

where V represents the rate (or velocity) of the process and X represents the amount
(or concentration) of the substrate.
The parameters of this model are the maximal rate of the process V^^^x ^^d the
saturation constant K^. The latter can also be defined as the amount of substrate
which produces a rate V which is half the maximal rate V^^, as can be verified by
substituting K^ by X in the above relation.
Usually, one plots the initial rate V against the initial amount X, which produces
a hyperbolic curve, such as shown in Fig. 39.17a. The rate and amount at time 0 are
larger than those at any later time. Hence, the effect of experimental error and of
possible deviation from the proposed model are minimal when the initial values
are used. The Michaelis-Menten equation can be linearized by taking reciprocals
on both sides of eq. (39.114) (Section 8.2.13), which leads to the so-called
Lineweaver-Burk form:
503

- =^ ^ - +^ - (39.115)
V V X V
^ ^ max ^*^ ^ max

slope = KJV^^^
intercept = l/V^^
The parameters V^^^^ and K^^ can be obtained from the intercept and slope of the
linear relationship between 1/Vand l/X as shown in Fig. 39.17b.
Least-squares linear regression (Chapter 8) of 1/Vupon l/X, although a simple
solution to the problem, suffers from a lack of robustness which is induced by the
reciprocal transformation of V and X. Serious heteroscedasticity (or lack of
homogeneity of variance) occurs in the differences between observed and
computed values of l/V. Indeed, the variances of these residuals tend to increase
with large values of l/X. (It should be noted that the variance of l/V is approx-
imately equal to the variance of V divided by the fourth power of V, where V
becomes vanishingly small when X tends to zero.) This could be remedied by
means of weighted linear regression, in which each point is assigned a weight
which is inversely proportional to the corresponding variance. (Heteroscedasticity
and weighted regression have been explained in Section 8.2.3.) Unfortunately, the
distribution of l/V at the various levels of l/X is usually unknown. It has been
recommended, therefore, to use distribution-free methods of linear regression
[14]. A robust procedure is to compute the intercept and slopes of the linear
relationship by means of the single median regression method (Section 12.1.5.1).
Another linearized form of the Michaelis-Menten equation is defined by:

- = —i-X + ^ ^ (39.116)
V V V
max ' max

slope = l/V^^^
intercept = i^JV^,,
This variant can be derived from the Lineweaver-Burk form in eq. (39.115) by
multiplying both sides by X. From a statistical point of view, it does not seem to
have an advantage over the Lineweaver-Burk form [14]. The latter variant, how-
ever, can be more easily extended to more complex systems of substrate-enzyme
reactions, as will be shown below.
In the case of competitive inhibition, the substrate is displaced by a substance
which has greater affinity for the enzyme (or receptor protein) than its natural
substrate. For example, a competitive inhibitor (or antagonist) will try to occupy
the binding sites such that the enzyme is prevented from exerting its normal
activity on the substrate. It is assumed here that the binding between inhibitor and
504

enzyme is reversible. The inhibiting activity depends on the amount (or con-
centration) Y of the competing substance and on the inhibition constant K^ of the
inhibitor-enzyme complex. The potency of a competitive inhibitor is inversely
proportional to its K^ value. The relationship between the rate of the enzymatic
reaction V and the amount of substrate X can be expressed in a linearized form:
^ K 1
"^ -f-^- (39.117)

slope = (I ^Y/K.)KJV^^
intercept = l/V^^
Note the close analogy with the Lineweaver-Burk form of the simple Michaelis-
Menten equation. In a diagram representing l/V against l/X one obtains a line
which has the same intercept as in the simple case. The slope, however, is larger by
a factor (1 + Y/K^ as shown in Fig. 39.17b. Usually, one first determines V^^^ and
Kj^ in the absence of a competitive inhibitor (Y = 0), as described above. Sub-
sequently, one obtains K^ from a new set of experiments in which the initial rate V
is determined for various levels of X in the presence of a fixed amount of inhibitor
Y. The slope of the new line can be obtained by means of robust regression.
The cases of non-competitive inhibition and even more complex non-linear
reaction kinetics will not be discussed further here.

Example
We consider the initial velocities V, observed with different substrate
concentrations X in a rate-limited enzymatic reaction [15]:

XCmM) 1.25 1.67 2.50 5.00 10.0 20.0


V (mg/min) 0.101 0.130 0.156 0.250 0.303 0.345

Ordinary least squares regression of l/V upon l/X produces a slope of 9.32 and
an intercept of 2.36. From these we derive the parameters of the simple Michaelis-
Menten reaction (eq. (39.116)):
V,,, = 0.424 mg/min
/^^ = 3.95mM
We now consider the case of a competitive inhibitor which has been added to
the above reaction at the fixed concentration of 40 mM [15]. The following initial
velocities of the competitively inhibited Michaelis-Menten process are observed
at the same substrate concentrations as above:
505

X(mM) 1.25 1.67 2.50 5.00 10.0 20.0


V (mg/min) 0.061 0.074 0.106 0.169 0.227 0.313

Ordinary least squares regression of l/V upon l/X yields a slope of 17.7 and an
intercept of 2.46. Using the previously derived values for V^^^ ^"^ K^, and setting
Y equal to 40, we can derive the inhibition constant of the competitively inhibited
Michaelis-Menten reaction (eq. (39.117))
K. = 44 A mM.

Non-linear models, such as described by the Michaelis-Menten equation, can


sometimes be linearized by a suitable transformation of the variables. In that case
they are called intrinsically linear (Section 11.2.1) and are amenable to ordinary
linear regression. This way, the use of non-linear regression can be obviated. As
we have pointed out, the price for this convenience may have to be paid in the form
of a serious violation of the requirement for homoscedasticity, in which case one
must resort to non-parametric methods of regression (Section 12.1.5).

References
1. A. Rescigno and G. Segre, Drug and Tracer Kinetics. Blaisdell, Waltham, MA, 1966.
2. P.J. Lewi, J.J.P. Heykants, F.T.N. Allewijn, J.G.H. Dony and P.A.J. Janssen, On the distribu-
tion and metabolism of neuroleptic drugs. Part I: Pharmacokinetics of haloperidol. Drug. Res.,
20 (1970)943-948.
3. J. Gleick, Chaos: Making a New Science. Viking, New York, 1987.
4. T. Teorell, Kinetics of distribution of substances administered to the body. I. The extravascular
modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 205-225. II. The intravascular
modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 226-240.
5. F.F. Farris, R.L. Dedrick, P.V. Allen and J.C. Smith, Physiological model for the pharmaco-
kinetics of methyl mercury in the growing rat. Toxicol. Appl. Pharmacol., 119 (1993) 74-90.
6. P. Franklin, An Introduction to Fourier Methods and the Laplace Transformation. Dover, New
York, 1958.
7. P.J. Lewi, Pharmacokinetics and eigenvector decomposition. Arch. Int. Pharmacodyn., 215
(1975)283-300.
8. G.S. Klonicki, Using SAS/ETS Software for Analysis of Pharmacokinetic Data. SAS Institute
Inc., Gary, NC, 1993.
9. P.L. Williams, Structural identifiability of pharmacokinetic models — compartments and ex-
perimental design. J. Vet. Pharmacol. Therap., 13 (1990) 121-131.
10. P.R. Mayer and R.K. Brazell, Application of statistical moment theory to pharmacokinetics. J.
Clin. Pharmacology, 28 (1988) 481-483.
11. J. Powers, Statistical analysis of pharmacokinetic data. J. Vet. Pharmacol. Therap., 13 (1990)
113-120.
12. K. Zierler, A critique of compartmental analysis. Ann. Rev. Biophys. Bioeng., 10 (1981)
531-562.
13. R. Thom, Structural Stability and Morphogenesis. Benjamin-Cummings, Reading, MA, 1975.
506

14. I. A. Nimmo and G.L Atkins., The statistical analysis on non-normal (real?) data. TiBS (Trends
in Biological Science), 4 (1979) 236-239.
15. R.J. Tallarida and R.B. Murray, Manual of Pharmacological Calculations. Springer, New
York, 1981, pp. 35-37.

Additional recommended reading:


F. Belpaire and M. Bogaert, Fundamentals of Pharmacokinetics, (in Dutch). University of Gent,
Hey mans Institute, Gent, 1978.
M. Berman, The formulation and testing of models. Ann. N.Y. Acad. Sci., 108 (1963) 182-194.
A. Rescigno, Mathematical foundations of linear kinetics. In: Pharmacokinetics: Mathematical and
Statistical Approaches to Metabolism and Distribution of Chemicals and Drugs. (J. Eisenfeld
and M. Witten, Eds.), North-Holland, Amsterdam, 1988.
J.G. Wagner, Biopharmaceutics and Relevant Pharmacokinetics. Drug Intelligence Publ., Hamilton,
IL, 1971.
507

Chapter 40

Signal Processing

40.1 Signal domains

When measuring a signal, one records the magnitude of the output or the
response of a measurement device as a function of an independent variable. For
instance, in chromatography the signal of a Flame Ionization Detector (FID) is
measured as a function of time. In spectrometry the signal of a photomultiplier or
diode array is measured as a function of the wavelength. In a potentiometric
titration the current of an electrode is measured as a function of the added volume
of titrant.
Time, wavelength and added volume in the above-mentioned examples are the
domains of the measurement. A chromatogram is measured in the time domain,
whereas a spectrum is measured in the wavelength domain. Usually, signals in
these domains are directly translated into chemical information. In spectrometry
for example peak positions are calculated in the wavelength domain and in
chromatography they are calculated in the time domain. Signals in these domains
are directly interpretable in terms of the identity or amount of chemical substances
in the sample.
In some cases one may consider a measurement domain which is complementary
to one of the domains mentioned before. By 'complementary' we mean that each
value in the complementary domain contains information on all variables in the
other domain. In this terminology, the wavelength and frequency of the radiation
emitted by a light source in spectrometry are not complementary domains. One
frequency corresponds to a single wavelength (by the relation v = \/X) and not a
range of wavelengths. On the contrary, each individual point of an interferogram
measured with a Fourier Transform Infra Red (FTIR) spectrometer at a certain
displacement 5 of one of the mirrors in the Michelson interferometer contains
information on all wavelengths in the IR spectral domain, i.e. on the whole IR
spectrum. One can switch between two complementary domains by a math-
ematical operation. The Fourier transform (FT), which is the main subject of this
chapter, is such a mathematical operation. We illustrate this with the way an FTIR
spectrophotometer works [1]. In FTIR spectroscopy, the radiation from the light
source is split into two beams by a beam-splitter. Behind the splitter, the two beams
508

fixed mirror
t . movable mirror

source

direction of
motion
beam
splitter

detector

Fig. 40.1. Mirror assembly of a FTIR Michelson interferometer.

are recombined by a mirror before reaching the detector (Fig. 40.1). If the path
lengths of both beams are equal, the intensity of the light reaching the detector is
equal to the intensity of the light source before splitting. However, when the path
lengths of both beams are unequal the measured intensity is lower, because some
frequencies of the light are out-of-phase. This is shown in Fig. 40.2 for light
consisting of a single frequency. For a path length difference 6 equal to zero, the
intensity is maximal, for b = )J2 the intensity is zero and for 6 = A,/5 the intensity is
in between. By slowly moving a mirror, the path length difference between the two
beams is changed as a function of the position of the mirror. As a result, the light

i (a+d)
5=X/5 ' , / \id) / \

' / • \ ( c ) / \ ,.....t ...(ate)


8=^72

i t
8=0 ^ '' \ / ^ . /
(a+b)
'(a)'

t
Fig. 40.2. Interference of two beams with the same frequency (wavelength). The path-length differ-
ence between beams (a) and (b) is zero, between (a) and (c) X/2, between (a) and (d) X/5; (a+b), (a+c)
and (a+d) are the amplitudes after recombination of the beams.
509

intensity reaching the detector varies according to a pattern which depends on the
path length difference and absorbance of a sample placed in one of the beams. At
each path length difference or displacement 6, frequencies coinciding with 5 = {2k
4- \ykll are out-of-phase, others for which 5 =fc?iare in phase and others for which
kX<h< {2k + l)?l/2 are partly out-of-phase. Frequencies of the light source which
are out-of-phase interfere destructively and frequencies which are in-phase interfere
constructively. At a given displacement specific frequencies are constructively
combined and others are destructively combined. The intensity measured at a
given displacement of the mirror is the sum of all destructively and constructively
combined frequencies. Thus the signal measured at each position of the mirror,
contains information on all wavelengths (or wavenumbers) emitted by the light
source. This is fundamentally different from ordinary IR spectrometry where each
measured signal (transmittance) corresponds to a single wavelength. When the
light intensity is measured as a function of the mirror displacement an interfero-
gram is obtained. Each individual point of the interferogram contains information
about the entire spectrum (thus all wavelengths or wavenumbers emitted by the
light source). It can be shown [1] that the interferogram, which is the amplitude
/(8) as a function of the displacement 5 is the Fourier transform of the spectrum in
the wavelength domain. It represents the spectrum in iht frequency domain. In
order to obtain an interpretable spectrum (the transmittance as a function of
wavenumber or wavelength) the interferogram needs to be back transformed by a
Fourier transform of 7(6).
It may thus be necessary to calculate the Fourier transform of the measured
signal to return to the domain of interpretation, here wavelength or wavenumber.
In FTIR the signal is measured in the displacement domain 6 and transformed to
the wavelength or wavenumber domain by a Fourier transform. Because the
wavelength domain is the Fourier transform of the displacement domain, and vice
versa, we say that the spectrum is measured in the Fourier domain.
Instead of considering a spectrum as a signal which is measured as a function of
wavelength, in this chapter we consider it measured as a function of time. The
frequency scale in the Fourier domain is given in cycles (v) per second (Hz) or
radians (co) per second (s"^). These units are related by 1 Hz = 2n s~^

40.2 Types of signal processing

Transforms are important in signal processing. An important objective of signal


processing is to improve the signal-to-noise ratio of a signal. This can be done in
the time domain and in the frequency domain. Signals are composed of a deter-
ministic part, which carries the chemical information and a stochastic or random
part which is caused by deficiencies of the instrumentation, e.g. shot noise
510

generated by a photomultiplier and noise generated by a flame ionization detector


in chromatography. The presence of such noise hampers the interpretation of the
deterministic part of the signal. Therefore, techniques are required to improve the
signal-to-noise ratio. This is called signal enhancement. One way to achieve this is
by improving the signal before digitization by electronic devices which process the
analog signal during the measurement. The discussion of these devices is beyond
the scope of this book. Nowadays, signals are digitally stored in computer memory
and processed afterwards. A simple way to improve the signal-to-noise ratio is by
averaging the signal over a number of data points, e.g. five consecutive data points
are averaged and the central point is replaced by that mean value. This method and
others are discussed in more detail in Section 40.5.2. However, more complex
methods may be necessary to enhance the signal-to-noise ratio. Noise is associated
with fluctuations that are fast compared to those of the underlying signal, i.e. the
frequency of noise is higher than the frequency of the signal. Therefore, frequencies
associated with noise can be distinguished from the frequencies associated with
the signal. It is, therefore, possible to selectively eliminate noise in the frequency
domain. Fourier transforms and other processing techniques can be used to achieve
this. This is described in Section 40.5.3.
Signals obtained from measurement devices such as analytical instruments may
be distorted. In spectrometry, for instance, the true spectrum can only be obtained
by using monochromatic light. Spectrometers, however, emit light with a certain
band width. As a consequence, the observed peaks are broadened causing a lower
resolution between adjacent peaks. In some cases this broadening of the signal can
be removed by applying signal restoration or deconvolution (Section 40.6).
In this chapter we discuss techniques both for signal enhancement and signal
restoration. Techniques to model or to reconstruct the deterministic part of a digital
signal in the presence of noise are discussed in Chapter 41.

40.3 The Fourier transform

40.3.1 Time and frequency domain

Before discussing the Fourier transform, we will first look in some more detail
at the time and frequency domain. As we will see later on, a FT consists of the
decomposition of a signal in a series of sines and cosines. We consider first a signal
which varies with time according to a sum of two sine functions (Fig. 40.3). Each
sine function is characterized by its amplitude A and its period T, which corresponds
to the time required to run through one cycle {In radials) of the sine function. In
this example the frequencies are 1 and 3 Hz. The frequency of a sine function can
be expressed in two ways: the radial frequency oo (radians per second), which is
511

05

Fig. 40.3. A composite sine function (solid line) which is the sum of two sine functions of 1 Hz (dotted
line) and 3 Hz (dashed line).

equal to 2%IT, and the number of cycles per second, v, which is equal to 1/rHz. For
all t which are a multiple of period T, the amplitude of the sine is zero. The function
describing a sum of two sines is given by:

fif) = A^sinCcOiO + A2sin(co20, or

j{t) = A^sm(2nt/T^) + A2smi2nt/T2X or

fit) = Aisin(27i;ViO + A2sin(27i;v20


Although the sum signal shown in Fig. 40.3 consists of two frequencies Vj =
l/Jj = 1 Hz and V2 = I/T2 = 3 Hz, this is hardly visible in the time domain. By
switching from the time domain to the frequency domain, i.e. by measuring the
amplitude of the sine (and cosine) functions present in the signal as a function of
the frequency, two distinct signals are found, a pulse at cOj = 27iVi = 2n and one at
(O2 = 27t;V2 = 6n (Fig. 40.4). The height of the pulses corresponds to the amplitudes
Aj and A2 of the two sine functions.
The radial frequency co of a periodic function is positive or negative, depending
on the direction of the rotation of the unit vector (see Fig. 40.5). O) is positive in the
counter-clockwise direction and negative in the clockwise direction. From Fig.
40.5a one can see that the amplitudes (A_J of a sine at a negative frequency, -co,
with an amplitude. A, are opposite to the values A^^ of a sine function at a positive
frequency, O), i.e. A_^^ = Asini-m) = -Asm(mt) = -A+^. This is a property of an
anti-symmetric function. A cosine function is a symmetric function because A_^ =
Acos(-(or) = Acos(coO = A^^ (Fig. 40.5b). Thus, positive as well as negative
512

0) I

It
Q.
E
03

^ ^ ^ ^~^ V (CDS)
1 2 3 ^^^
—^ frequency
2n An 67c CO

Fig. 40.4. The signal of Fig. 40.3 represented in the frequency domain.

Fig. 40.5. (a) Sine function with a radial frequency +0) and -co. (b) Cosine function with a radial fre-
quency +0) and -CO. (c) Representation of (a) in the frequency domain, (d) Representation of (b) in the
frequency domain.

frequencies, co, may be considered. A cosine function generates two pulses of


equal sign at co and -co in the frequency domain (Fig. 40.5c). On the other hand, a
sine function generates two pulses of opposite sign at these two frequencies.
Figures 40.3 and 40.4 are two representations of the same signal, each providing
specific information. In the time domain information is available on the overall
513

amplitude of the signal, whereas in the frequency domain explicit information is


obtained on the frequencies present in the signal. In general, each continuous
signal can be decomposed in a sum of an infinite number of sine and cosine
functions, each with a specific frequency v and amplitude. In practice, a signal in
the time domain is measured during a finite time. When transforming a signal
which has been measured during a time t, only sines (and cosines) are considered
with a period T, which exactly fits a number of times in the measurement time. As a
result the Fourier transform of a finite continuous signal is a discrete function.

40.3.2 The Fourier transform of a continuous signal

Let us consider the antisymmetric block shaped signal/(O shown in Fig. 40.6a,
with the values -1 and +1 measured during a finite time, T^. In many textbooks the
measuring time is defined as 2T^ for practical reasons. In this chapter we also
adopt this convention. Usually the origin of the time axis is shifted to the centre of
the measurement time, which is then defined from -T^ to +T^. As a first approx-
imation this block signal f{t) can be represented by a sine function with a period
equal to 2T^, or a frequency v equal to 1/(27^), i.e. fit) ^ 1 sm(2nt/2T^). As can be
seen from Fig. 40.6b, this approximation is imperfect and is improved by adding a
second sine function. Because the period T of this sine function should exactly fit a
number of times in the measurement time 27^, only the frequencies l / ( 2 r j ,
2 / ( 2 r j , 3/(2rj,..., n/(2TJ are considered. In this particular case the fit cannot be
improved by adding a sine with a period equal to 2/(27^). However, including a
sine with a period equal to 3/(27^) and an amplitude 1/3 yields the next best
approximation, which is:

fit) - 1 sm{2nt/2TJ + 1/3 sin(27i^3/2rj

The approximation is still imperfect and can be improved by adding a third sine
function, now with a period 5/(2T^) and an amplitude 1/5, i.e.

ft) - 1 sm(2nt/2TJ + 1/3 sin(27rr3/2rj + 1/5 sin(27i^5/2rj

The process of adding sine functions can be continued giving the following
expression:

ft) - {sm{2nt/2TJ + l/3sin(3(27i)f/2rj + l/5sin(5(27i)^/2rj + ...


+ l/{2n + l)sin((2n + l)2nt/2TJ}

for all integer n^oo.


514

(a) f(t)|
+1

-T T
m

(b) f(t)

(C)

frequency (Hz)
Fig. 40.6. (a) Block signal measured from -T„t to +r;„. (b) Signal (a) approximated by a sum of one,
two and three sine functions, (c) Signal (a) represented in the frequency domain.
515

By plotting the amplitudes 1, 0, 1/3, 0, 1/5,..., l/(2n + 1) at the frequencies v^ =


1/(27^), V2 = 2/(2r^), ..., v„ = n / ( 2 r j , we obtain the Fourier transform of the
antisymmetric block signal (Fig. 40.6c). For 2T^ equal to 1 s the frequency v^ is 1
Hz, V2 = 2 Hz and so on. However, in order to indicate that these frequencies
belong to a sine function and not to a cosine function we also need to calculate the
amplitudes at the negative frequencies -v^, --V2,..., -v^, which are - 1 , 0, -1/3, 0,
-1/5,..., -1/(2/2 + 1). The interval (Av) between two successive frequencies in the
plot is equal to (n + l)/2T^ - nllT^ - II2T^. This leads to the important property
that both the lowest frequency and the frequency interval in the Fourier domain are
equal to the reciprocal of the measurement time. There is no upper limit to the
frequencies which are considered. Because the block signal in Fig. 40.6a is
antisymmetric, no cosine functions were needed to fit the block signal. However,
in general one needs to include cosine functions as well as sine functions. Therefore,
the Fourier transform consists of two parts, the amplitudes of the sine functions
and the amplitudes of the cosine functions, which are called Fourier coefficients.
Because sine and cosine functions have a zero mean, one should first subtract the
mean value from the signal before calculating the FT. In fact the mean value
corresponds to a cosine with a zero frequency. This term is called the DC term in
analogy to electronics where DC means Direct Current.
Summarizing, any signal/(O, measured from time t = -T^ to ^ = + 7^, can be
decomposed in a sum of sine and cosine functions as follows:

(nlTti) „ . (nlnt
/(0=I A„ cos +B„sin
n=0 m J

(rilKt\
= x A„ cos + B„ sin
V^^m )

Because luinlTTJ = nco = co^


+00 0

/ ( 0 = X t ^ n C o s ( c o „ 0 + BnSm(co„0]= X [A^ cos(co,0 + ^n sin(co^O]


n=0

or

/ ( 0 = - X [A„cos(co„0 + 5nSin(co„0] (40.1)


2n:
516

The Fourier coefficient at zero frequency {n = 0) is equal to AJ2 cos(n2nt/2TJ =


AQ/2 COS(07C) = A^2. Equation (40.1) is the expression for what is called the
backward or inverse Fourier transform, because it reconstructs signal f(t) from its
Fourier coefficients A^ and B^(n = -oo to ©o). In principle, the values of A^ and B^
can be calculated by fitting eq. (40.1) to the signal by a least squares regression.
However, the coefficients A^ and B^ are calculated directly [2] by solving the
following two equations, which represent iht forward Fourier transform:
+ T„.

A =
2 7m
j m cos .Unt^
T .
dt (n =-oo,..., oo)
-T
(40.2)

fi = J /(r)sm nnt
— dt (n = -oo,..., oo)
27m -T m y

A relationship, known as Euler's formula, exists between a complex number


[x -^jy] (x is the real part, y is the imaginary part of the complex number (j = V-1))
and a sine and cosine function. Many authors and textbooks prefer the complex
number notation for its compactness and convenience. By substituting the Euler
equations cos(r) = (e^^ + e~^0/2 and sin(0 = (e'^ - t~^y2j in eq. (40.1), a compact
complex number notation for the Fourier transform is obtained as follows:

jnnt -jnnt jnnt -jnnt

m-\ S 2 2j'

Becausey^ = -l,A^ = A_^ and B^ = -B_^ (in analogy to A^^ = A.^^ and B^^ = -fi_J
jnnt
1
m = - ZiA^-jBJe-'-' (40.3)

The amplitudes (A^) of the cosines, therefore, represent the real part of the FT.
They are the real Fourier coefficients. The amplitudes (5„) of the sines are the
imaginary part of the FT, which are the imaginary Fourier coefficients. Recalling
the fact that the antisymmetric block signal, given in Fig. 40.6a, could be described
with a sum of sine functions, means that its FT only consists of imaginary Fourier
coefficients. A symmetric signal only contains real Fourier coefficients. As a
consequence, setting the imaginary coefficients equal to zero symmetrizes the
signal obtained after inverse transform.
Usually, the Fourier coefficients A^ and B^ in eq. (40.2) are combined in a single
complex number [A„ -jBJ:
517

+ 7-
rnit ^rnit^
K-JB„=--27, J/(0 cos 7 Sin At
m -J \^m J
(40.4)
-jnnt
1
j / ( O e ^- df
27,m -7

Equations (40.3) and (40.4) are called the Fourier transform pair. Equation
(40.3) represents the transform from the frequency domain back to the time
domain, and eq. (40.4) is the forward transform from the time domain to the
frequency domain. A closer look at eqs. (40.3) and (40.4) reveals that the forward
and backward Fourier transforms are equivalent, except for the sign in the exponent.
The backward transform is a summation because the frequency domain is discrete
for finite measurement times. However, for infinite measurement times this sum-
mation becomes an integral.
Summarizing, two complementary representations of a signal have been derived:
f(0 in the time domain and [A^ - jBJ in the frequency domain. The imaginary
Fourier coefficients, 5^, represent the frequencies of the sine functions and the real
Fourier coefficients, A^, represent the frequencies of the cosine functions. In the
remainder of this chapter we use the shorthand notation F(n) for [A„ -jB^].
Many signals, such as a Gaussian peak, have a symmetric clock-shaped form
about their maximum. If the origin of the time axis is chosen in the centre of
symmetry the FT contains only real Fourier coefficients. If, however, the origin of
the measurement axis is put at the start of the peak, the function looses its
symmetry about the origin resulting in a FT with real and imaginary coefficients.
This indicates that the location of the origin of the data influences the Fourier
transform.
The Fourier coefficients can be combined in a so called power spectrum which
is defined as P^ = ^J[(A'^ + B^)]. One should realize that because the contributions
of A„ and B^ are implicitly included in P„, one cannot back-transform the power
spectrum to the time domain. However, the power spectrum gives the total
amplitude of the frequencies present in the signal.
The Fourier transform may be considered as a special case of the general data
transform

Fiy)= \ f(t)-K(vj)dt

K(v,t) is called the transform kernel. For the Fourier transform the kernel is e"^^^
Other transforms (see Section 40.8) are the Hadamard, wavelet and the Laplace
transforms [4].
518

40.3.3 Derivation of the Fourier transform of a sine

In this section we derive the Fourier transform of a sine function f(0 = sin(coO by
solving eq. (40.4). Because cos(jc) = (e'^ + e"^^)/2 and sin(jc) =(d^ - e"^^)/(2/), we can
express sin(coO in the complex number notation: sin(coO = (e®^^ - t~^^y{2j)
The Fourier transform of sin(cor) is:

1 r -jnn —
( A „ - ; f i „ ) = — J sin(0)0e ^ df
-T (40.5)
t
-inn -

IT _J, 2j

1
y n J n/ 2T J 2j 2T JJ 02j1
OT

We work out the first term in eq. (40.5)

AITI
d/ (0

" 27 _{. 2; rnz


7 co-

Because j e^dj: = e^ and -J e ^djc = e ^ this term is equal to:


+r
.^r nn \ f nn ^
^'"-y)
1 J e ^ ' -& ^ '
27 IT
2yy|co-y -2;|«-Y
J-r

_ j sin(cor -n%)
2 (oT -nn
In the same way we find that the second term 5„ is equal to

j sin((jor -\- nn)


2 (oT -^nn

This gives the following expression for the Fourier transform of a sine function:
519

sin(cor - nn) sin(cor + nn)


{A,-jBJ = - (40.6)
(oT -nn 0)7 + nn
Only measurement times which are infinite or are a multiple of co are allowed (e.g.
CO = InmllT). Consider a finite measurement time IT equal to 2nm/a). For co =
2nn/2T eq. (40.6) becomes:
sin((m - n)n) sin((m + n)n)
(A„-jBJ = -^
( m - n)n {m+ n)n

All sin(m - n)n are zero for all integer (m - n) except forn = m and for n = -m for
which respectively the first and the second term of the above equation are equal to
1. As a result the FT is:
A„-jB, = -j/2(l+0) = (0-j/2)
and
A_„-;B_„ = - ; / 2 ( 0 - l ) = (0+;/2)
The frequency corresponding to n = m is co^ = InmllT = co and to n = -m is co_^ =
-InmllT = -CO. The FT is thus imaginary, which is in agreement with the fact that a
sine is an antisymmetric function. The factor 1/2 is introduced because positive
and negative frequencies are considered. Namely, f(0 = sin(coO = 1/2 sin(coO
-l/2sin(-co0.

40.3.4 The discrete Fourier transformation

Modem signal processing in analytical chemistry is usually performed by


computer. Therefore, signals are digitized by taking uniformly spaced samples
from the continuous signal, which is measured over a finite time.
Assume a signal f(0 which is measured during a time 2T^, starting at a time t^
and ending at a time {t^ + IT^). In accordance with the notation used when
discussing continuous signals, we shift the origin to the centre of the measurement
time and consider the signal as being measured from a time -T^ to +7^. For the
compactness of the expressions we also require that the origin of the data coincides
with one of the data points. Therefore, the number of data should be an odd number
2A^ + 1. The sampling interval At is then defined by A^ = IT JIN. As a result, the
digitizing process can be represented as: f(0 => f(kAt) with k = -N to -\-N. The value
k = -N corresponds to -T^ and the value k = N to T^. This digitization procedure is
schematically shown in Figs. 40.7a and 40.7b. The new variable k varies from ~N
to+N,
520

f(t)| (a) f(kAt) (b)

V^
^ I I I I
0 1 2 2 0 2 ^

Fig. 40.7. (a) A continuous signalJ{t) measured from to =0 to ^o +2Tn, = 2 seconds, (b) Same signal after
digitization,/(/:A0 as a function of/:.

The expressions for the forward and backward Fourier transforms of a data
array of2N-\- 1 data points with the origin in the centre point are [3]:
Forward:
k=+N -jnlnk
1
F(n) = X / ( M r ) e 2^-1 (40.7)
2A^ + 1 k=-N

for n = -^max ^^ '^max' ^max corrcspouds to the maximum frequency which is in the
ta. The value
data. vah of n^^ is derived in the next section.
Backward:
. ..... ^ jnlnk
f(kAt) = - X ^(^)e 2^^^ (40.8)

for/: = -M...,+A^
The frequency associated with F(n) is v^. This frequency should be equal to n
times the basis frequency, which is equal to 1/(27^) (this is the period of a sine or
cosine which exactly fits in the measurement time). Thus v„ = n/(2T^) = n/(2NAt).
It should be noted that in literature one may find other conventions for the
normalization factor used in front of the integral and summation signs.

40.3.5 Frequency range and resolution

As mentioned before, the smallest observable frequency (v^jj,) in a continuous


signal is the reciprocal of the measurement time (1/27^). Because only those
frequencies are considered which exactly fit in the measurement time, all fre-
quencies should be a multiple of Vj^j^, namely n/2T^ with n = -©o to +oo. As a result
the Fourier transform of a continuous signal is discrete in the frequency domain,
521

with an interval (Ai+l)/(2rj - nl{2TJ = l/(2TJ. This frequency interval is called


the resolution. Thus the measurement time defines the lowest observable fre-
quency and the resolution. Both are equal to l/2T^. In the case of a continuous
signal there is no upper limit for the observable frequencies. For an infinite
measurement time the Fourier transform also becomes a continuous signal because
1/27^ -^ 0.
On the contrary, when transforming a discrete signal, the observable frequencies
are not unlimited. The reason lies in the fact that a sine or cosine function should be
sampled at least twice per period in order to be uniquely defined. This sampling
frequency is called the Nyquistfrequency. This is illustrated in Fig. 40.8 showing a
signal measured in one second and digitized in 4 data points. Clearly, these 4 data
points can be fitted in two ways: with a sine with a period equal to 1 s or with a sine
with a period equal to 1/3 s. However, the Fourier transform always yields the
lowest frequencies with which the data can be fitted, in this case, the sine with a
period equal to 1 s. The Nyquist frequency is the upper limit (Vj^ax = 1/(2A0) of the
frequencies that can be observed. An increase of the maximally observable fre-
quency in the FT can only be achieved by increasing the sampling rate (smaller A^
of the digitization process. From the values of v^^j^^ and v^^^, one can derive that for
a signal digitized in (2A'^ +1) data points, N frequencies are observed, namely:
[(v„ - v^iJ/resolution] + 1 = (l/2At - l/2TJ(l/2TJ + 1 = 2TJ2At - 1 + 1 = M
The same A^ frequencies are observed in the negative frequency domain as well.
In summary, the Fourier transform of a continuous signal digitized in 2A^ + 1
data points returns A^ real Fourier coefficients, A^ imaginary Fourier coefficients
and the average signal, also called the DC term, i.e. in total 2A + 1 points. The
relationship between the scales in both domains is shown in Fig. 40.9.

(D

Q.
E
(0

Fig. 40.8. (a) Sine function sampled at the Nyquist frequency (2 points per period), (b) An un-
der-sampled sine function.
522

33 = (2N+1) data points

1 1
0 At 32
16 0 16
-N N

^0
A(n)
• n
N real coefficients 1
1
\ ii
(+ value at n=0) v = 1/(2T„) v = 1/(2 At)

Av=1/(2Tm)

N Imaginary coefficients B{n)


— • n
1 1
1 16
v=1/(2T^)
\ v = 1/(2 At)
\
Av=1/(2T„)

Fig. 40.9. Relationship between measurement time {2T„,), digitization interval and the maximum and
minimal observable frequencies in the Fourier domain.

TABLE 40.1
Signal measured at five time points with Ar = 0.5 s and 2T^ = 2 s

Original time scale Time scale with origin k Akhi)


(seconds) shifted to the centre

f,=0 t,=-\ -2 A-\) = 2


(2 = 0.5 t2 = -0.5 -1 y(-0.5) = 3
(3=1 ^3 = 0 0 ./(0) = 4
(4=1.5 r4 = 0.5 1 m5) = 4
u=2 u=\ 2

By way of illustration we calculate the FT of the discrete signal listed in Table


40.1. The origin of the time domain has been placed in the centre of the data.
The forward transform is calculated as follows (eq. (40.7))
523

Mean value is: 3; A^ = 2; 2A^ + 1 = 5


v„i„= 1/27^ = 0.5 Hz
Av = l/2T^ = 0.5 Hz
v' max
_ = l / 2 A f = 1/1 = 1 Hz

l . n = 0:Vo = OHz

-^ k=-2 -^ k=-2

= 1/5(2 + 3 + 4 + 4 +2) = 3

2. n = l:vi = 1/27^ = 0.5 Hz

ni)=l/5lf()tA0e--'^''*'5 =
= l/5(2e'""'5 +3e'^"'5 + 4e° + AQ-J^"^^ + 2e-''*"'^)
= l/5(2cos47i/5 + 2/sin47i/5 + 3cos27t/5 + 3;sin27i/5 + 4 + 4cos27i/5
- 4/sin27i/5 + 2cos47i/5 - 2jsm4n/5)
= 1/5 (-1.618 +1.1756/- + 0.9271 + 2.8532/- + 4 + 1.2361 - 3.8042; - 1.618
-1.1756/) =

= 1/5(2.93 - 0.95;) = 0.58 - 0.19;

3. n = 2: V2 = V, + Av = IHz = 1/2A? = v^^^

Fi2) = 1/5 I f{kAt)e-J^'^^ =


= l/5(2e'^"'5 +3e'""'5 + 4e0 + 4e-''"^5 + 2e-J»^^)
= l/5(2cos87i/5 + 2jsm8n/5 + 3cos47C/5 + 3;sin47i/5 + 4 + 4cos4jt/5
- 4;sin47r/5 + 2cos87i/5 - 2/sin87i/5)
= 1/5(0.618- 1.9021y--2.4271 + 1.7634/+ 4-3.2361 -2.3511; +0.618
+ 1.902;)
= l/5(-0.427 + -0.588;-) = - 0.085 - 0.12;
At« = 2 the maximally observable frequency is reached, and the calculation can be
stopped. The results are summarized in Fig. 40.10.
524

f(kAt)

0.5 1.0 1.5 Hz


1 2 3 n

0.5 1.0 1.5 Hz


1 2 3 n

Fig. 40.10. Fourier transform of the data points listed in Table 40.1.

40.3.6 Sampling

In the previous section we have seen that the highest observable frequency
present in a discrete signal depends on the sampling interval (At) and is equal to
l/2Ar, the Nyquist frequency. This has two important consequences. The minim-
ally required digitizing rate (sampling points per second) in order to retain all
information in a continuous signal is defined by its maximum frequency. Secondly,
the Fourier transform is disturbed if the continuous signal contains frequencies
higher than the Nyquist frequency. This disturbance is called aliasing or folding.
The principle of folding is illustrated in Fig. 40.11. In this figure two sine functions
A and B are shown which have been sampled at the same sampling rate of 16 Hz.
Signal A (8 Hz) is sampled with a rate which exactly fits the Nyquist frequency
(v^J, namely 2 data points per period. Signal B (11 Hz) is under-sampled as it
requires a sampling rate of minimally 22 Hz. The frequency of signal B is 3 Hz (=
5v) higher than the maximally observable frequency. This does not mean that for
signal B no frequencies are observed in the frequency domain. Indeed, from Fig.
40.11 one can see that the data points of signal B can also be fitted with a 5 Hz sine,
which is 3 Hz (= 6v) lower than v^^^ (= 8 Hz). As a consequence, if a signal
contains a frequency which is 5v higher than the Nyquist frequency, false frequen-
525

t (sec)

Fig. 40.11. Aliasing or folding, (a) Sine of 8 Hz sampled at 16 Hz (Nyquist frequency), (b) Sine of 11
Hz sampled at 16 Hz (under-sampled), (c) A sine of 5 Hz fitted through the data points of signal (b).

cies are observed at a frequency which is 5v lower than the Nyquist frequency. One
should always be aware of the possible presence of 'false' frequencies by aliasing.
This can be easily checked by changing the sampling rate. True frequencies remain
unaffected whereas 'aliased' frequencies shift to other values. The Nyquist fre-
quency defines the minimally required sampling rate of analytical signals. As an
example we take a chromatographic peak with a Gaussian shape given by y(x) =
atxp(-x^/2s^) (Gaussian function with a standard deviation equal to s located in the
origin, x = 0). The FT of this function is [5]:

3;(v) = a^{2ns)txp(-v^ /(2/s^))

From this equation it follows that the amplitudes of the frequencies present in a
Gaussian peak are normally distributed about a frequency equal to zero with a
standard deviation inversely proportional to the standard deviation s of the
Gaussian peak. As a consequence, the maximal frequency present in a Gaussian
peak is approximately equal to (0 + 3{l/s)). In order to be able to observe that
frequency, a sampling rate of at least 6/s is required or 6 data points per standard
deviation of the original signal. As the width of the base of a Gaussian peak is
approximately 6 times the standard deviation, the sampling rate should be at least
36 points over the whole peak. In practice higher sampling rates are applied in
order to avoid aliasing of high noise frequencies and to allow signal processing
526

(b)
(a) 0.1

0.1

-0.1

0 05 -0.2

-0.3 «
64 127
64 127

0.2

0.1

-0.1

-0.2 I
127

Fig. 40.12. (a) FT (real coefficients) of a Gaussian peak located in the origin of the measurements (256
data points). Solid line: wVi = 20; dashed line: wVi = 5 and corresponding maximal frequencies, (b) FT
(real and imaginary coefficients) of the same peak shifted by 50 data points.

(see Section 40.5). Figure 40.12a gives the real Fourier coefficients of two
Gaussian peaks centered in the origin of the data and half-height widths
respectively equal to 20 and 5 data points. As one can read from the figure, the
maximum frequency in the narrow peak (dashed line in Fig. 40.12) is about 4 times
higher than the maximum frequency in the wider peak (solid line in Fig. 40.12).

40,3.7 Zero filling and resolution

In Section 40.3.5 we concluded that the resolution (Av) in the frequency


spectrum is equal to the reciprocal of the measurement time. The longer the
measurement time in the time domain, the better the resolution is in the frequency
domain. The opposite is also true: the longer the measurement time in the frequency
domain (e.g. in FTIR or FT NMR), the better is the separation of the peaks in the
spectrum after the back-transform to the wavelength or chemical shift domain.
527

Fig. 40.13. Exponentially decaying pulse NMR signal.

In FTIR and FT NMR the amplitude of the measured signal is an exponentially


decaying function (Fig. 40.13). One could conclude that in this case continuing the
measurements makes little sense. However, it shortens the measurement time and
would, therefore, limit the resolution of the spectrum after the back-transform. It
is, therefore, common practice to artificially extend the measurement time by
adding zeros behind the measured signal, and to consider these zeros as measure-
ments. This is called zero filling. The effect of zero filling is illustrated on the
inverse Fourier transform of an exponentially decaying FT NMR signal before
(Fig. 40.14a) and after zero filling (Fig.40.14b).One should take into account that
if the last measured data point is not close to zero, zero filling introduces false
frequencies because of the introduction of a stepwise change of the signal. This is
avoided by extrapolating the signal to zero following an exponential function,
called apodization.

40.3.8 Periodicity and symmetry

In Section 40.3.4 we have shown that the FT of a discrete signal consisting of 2A^
+ 1 data points, comprises A^ real, A^ imaginary Fourier coefficients (positive
frequencies) and the average value (zero frequency). We also indicated that A^ real
and N imaginary Fourier coefficients can be defined in the negative frequency
domain. In Section 40.3.1 we explained that the FT of signals, which are sym-
metrical about the r = 0 in the time domain contain only real Fourier coefficients.
528

6.4 s

1 I 10 Hz 10 Hz

Fig. 40.14. Effect of zero filling on the back transform of the pulse NMR signal given in Fig. 40.12. (a)
before zero filling, (b) after zero filling.

whereas signals which are antisymmetric about the r = 0 point in the time domain
contain only imaginary Fourier coefficients (sines). This property of symmetry
may be applied to obtain a transform which contains only real Fourier coefficients.
For example, when a spectrum encoded in 2A^ + 1 = 512 + 1 data points is
artificially mirrored into the negative wavelength domain, the spectrum is symm-
etrical about the origin. The FT of this spectrum consists of 512 real and 512
imaginary Fourier coefficients. However, because the signal is symmetric all
imaginary coefficients are zero, reducing the FT to 512 real coefficients (plus one
for the mean) of which 256 coefficients at negative frequencies and 256 coefficients
at positive frequencies.

40.3.9 Shift and phase

A shift or translation of f(0 by /Q results in a modulation of the Fourier


coefficients by exp(-j(OtQ). Without shift f(0 is transformed into F(co). After a shift
by /Q, f(r - ^o) is transformed into txpi-jcoto) F((o), which results in a modulation
(by a cosine or sine wave) of the Fourier coefficients. The frequency CO^Q of the
529

modulation depends on ^Q, the magnitude of the shift. The back transform has the
same property. F{n - n^ is back-transformed to f(t)txp(j2nnQt/2N). A shift by A^
data points, therefore, results in f{kAt)Qxp(j2nNk/2N) = f{kAt)cxp(jnk) = f{kAt)
(-1)^. This property is often applied to shift the origin of the Fourier domain to the
centre of its 2A^ + 1 data array when the software by default places its origin at the
first data point.
In Section 40.3.6 we mentioned that the Fourier transform of a Gaussian peak
positioned in the origin is also a Gaussian function. In practice, however, peaks are
located at some distance from the origin. The real output, therefore, is a damped
cosine wave and the imaginary output is a damped sine wave (see Fig. 40.12b).
The frequency of the damping is proportional to the distance of the peak from the
origin. The functional form of the damped wave is defined by the peak shape and is
a Gaussian for a Gaussian peak. Inversely, the frequency of the oscillations of the
Fourier coefficients contains information on the peak position.
The phase spectrum 0{n) is defined as 0(n) = arctan(A(n)/S(n)). One can prove
that for a symmetrical peak the ratio of the real and imaginary coefficients is
constant, which means that all cosine and sine functions are in phase. It is
important to note that the Fourier coefficients A(n) and B(n) can be regenerated
from the power spectrum P(n) using the phase information. Phase information can
be applied to distinguish frequencies corresponding to the signal and noise,
because the phases of the noise frequencies randomly oscillate.

403.10 Distributivity and scaling

The Fourier transform is distributive over summation, which means that the
Fourier transform of the two individual signals is equal to the sum of the Fourier
transforms of the two individual signals F\f^{t) +/2(0] = F\f\if)\ + ^l/iCOl-
The enhancement of the signal-to-noise ratio (or filtering) in the Fourier domain
is based on that property. If one assumes that the noise n{t) is additive to the signal
s{t), the measured signal m{i) is equal to s{t) + n{t). Therefore, F[m{t)\ = F[s{t)\ +
F[n{i)], or

M(v) = 5(v)+7V(v)
Assuming that the Fourier transformed spectra 5(v) and A^(v) contribute at specific
frequencies, the true signal, s(t), can be recovered from M(v) after elimination of
A^(v). This is csllcd filtering (see further Section 40.5.3)
The Fourier transform is not distributive over multiplication:

Fm)f2(f))^F{f,{twm))'
It is also easy to show that for the scalar a F{a(j{t))) = aF(t).
530

40.3.11 The fast Fourier transform

As explained before, the FT can be calculated by fitting the signal with all
allowed sine and cosine functions. This is a laborious operation as this requires the
calculation of two parameters (the amplitude of the sine and cosine function) for
each considered frequency. For a discrete signal of 1024 data points, this requires
the calculation of 1024 parameters by linear regression and the calculation of the
inverse of a 1024 by 1024 matrix.
The FT could also be calculated by directly solving eq. (40.7). For each
frequency (2N + 1 values) we have to add 2N + 1 (the number of data points) values
which are each the result of a multiplication of a complex number with a real value.
The number of complex multiplications and additions is, therefore, proportional to
{2N + 1)^. Even for fast computers this is a considerable task. Therefore, so-called
fast Fourier transform (FFT) algorithms have been developed, originally by
Cooley and Tukey [6], which are available in many software packages. The
number of operations in FFT is proportional to (2N + l)log2(2A^ +1) permitting
considerable savings of calculation time. The calculation of a signal digitized over
1024 points now requires 10"^ operations instead of 10^, which is about 100 times
faster. A condition for applying the FFT algorithm is that the number of points is a
power of 2. The principle of the FFT algorithm can be found in many textbooks
(see additional recommended reading).
Because the FFT algorithm requires the number of data points to be a power of
2, it follows that the signal in the time domain has to be extrapolated (e.g. by zero
filling) or cut off to meet that requirement. This has consequences for the resolution
in the frequency domain as this virtually expands or shortens the measurement
time.

40.4 Convolution

As a rule, a measurement is an imperfect representation of reality. Noise and


other blur sources may degrade the signal. In the particular case of spectrometry a
major source of degradation is peak broadening caused by the limited bandwidth
of a monochromator. When a spectrophotometer is tuned at a wavelength XQ other
neighbouring wavelengths also attain the detector, each with a certain intensity.
The profile of these intensities as a function of the wavelength is called the slit
function, h(X). An example of a slit function is given in Fig. 40.15. This slit
function is also called a convolution function. Under certain conditions, the shape
of the slit function is a triangle symmetrical about XQ. The width at half-height is
called the spectral band-width. When measuring a "true" absorbance peak with a
half-height width not very much larger than the spectral band-width, the observed
531

300 302 306 308

Fig. 40.15. (a) Slit function (point-spread function) h{X) for a spectrometer tuned at 304 nm. (b) f(X) is
the true absorbance spectrum of the sample.

peak shape is disturbed. The mechanism behind this disturbance is called con-
volution. Convolution also occurs when filtering a signal with an electronic filter
or with a digital filter, as explained in Section 40.5.2.
By way of illustration the spectrometry example is worked out. Two functions
are involved in the process, the signal/(^) and the convolution function h{X). Both
functions should be measured in the same domain and should be digitized with the
same interval and at the same ^values (in spectrometry: X-values). Let us further-
more assume that the spectrum/(^) and convolution function h{X) have a simple
triangular shape but with a different half-height width.
Let us suppose that the true spectrum y(^) (absorbance xlOO) is known (Fig.
40.15b):

?ii = 301nm=>f(301) = 0
?i2 = 302nm=»y(302) = 0
?i3 = 303nm=>y(303) = 0
?i4 = 304nm=>/(304) = 5
^5 = 305nm=>y(305)=10
^6 = 306 nm =^J{306) = 5
;i7 = 307nm=^y(307) = 0

For all other wavelengths the absorbance is equal to zero.


532

Let us also consider a convolution function h(X), also called point-spread


function. This function represents the profile of the intensity of the light reaching
the detector, when tuned at wavelength \ . If we assume that the slit function has
the triangular form given in Fig. 40.15a and that the spectrometer has been tuned at
a wavelength equal to 304 nm, then for a particular slit width the radiation reaching
the detector is composed by the following contributions:
25% comes from X = 304 or the relative intensity is: 0.25
18.8% comes from \ = 303 and 305 0.188
12.5% comes from X = 302 and 306 0.125
6.2% comes from X = 301 and 307 0.062
The relative intensities sum up to 1. None of the other wavelengths reaches the
detector.
When tuning the spectrometer at another wavelength, the centre of the con-
volution function is moved to that wavelength. If we encode the convolution
function relative to the set point /i(0), then we obtain the following discrete values
(normalized to a sum =1):
h(0) = 0.25
/z(_l) = /z(+l) = 0.188
/z(-2) = /z(+2) = 0.125
/z(-3) = /z(+3) = 0.062
h(-A) = /z(+4) = 0
All values from h(-^) to h{-oo), and from h{+A) to /i(+oo) are zero.
Let us now calculate the signal g(304) which is measured when the spectrometer
is tuned at X. = 304 nm:

^(304) = 0.25/(304) + 0.188/(305) + 0.188/303) + 0.125^(306) + 0.125^(302) +


0.062/(307) + 0.062/(301) = /z(0)/(304) + /i(l)/(305) + /i(-l)/(303) +
/z(2)/(306) + /i(-2)y(302) + /z(3)y(307) + /z(-3)y(301)

Because /z(0) = /z(304-304), /z(l) = /z(305-304), /i(2) = /z(306-304) and h{-\) =


/i(303-304) and so on, the general expression for ^(jc) can be written in the
following compact notation:

gix)=^fiy)h(y-x) (40.9)

for all X and y for which/(j) and h(y - x) are defined.


533

302 304 306 308

Fig. 40.16. Measured absorbance spectrum for the system shown in Fig. 40.15.

A shorthand notation for eq. (40.9) is


g{x)=f{x)^h{x) (40.10)
where * is the symbol for convolution.
Extension of the convolution to the wavelengths, 301 to 307 nm, yields the
measured spectrum g(x) shown in Fig. 40.16. The broadening of the signal is
clearly visible. One should note that signals measured in the frequency domain
may also be a convolution of two signals. For instance the periodic exponentially
decaying signal shown in Fig. 40.13 is a convolution of a sine function with an
exponential function.
An important aspect of convolution is its translation into the frequency domain
and vice versa. This translation is known as the convolution theorem [7], which
states that:
- Convolution in the time domain is equivalent to a multiplication in the fre-
quency domain
g{x) =J{x) * h(x) ^ ^ G(v) = F(v) H(v)
and
- Convolution in the frequency domain is equivalent to a multiplication in the
time domain
G(v) = F(v) * H{v) ^=^ g{x) =fix) hix)
From the convolution theorem it follows that the convolution of the two
triangles in our example can also be calculated in the Fourier domain, according to
the following scheme:
(1) Calculate F(v) of the signal/(O
(2) Calculate H(v) of the point-spread function h{t)
(3) Calculate G(v) = F(y)H(v): The real (Re) and imaginary (Im) transform
coefficients are multiplied according to the multiplication rule of two
complex numbers:
534

Real FT coefficients Imaginary FT coefficients

(a)
FT

FT (b)

(d) (C)

RFT

13 17 :^1 25

Fig. 40.17. Convolution in the time domain ofJ{t) with h(t) carried out as a multiplication in the Fou-
rier domain, (a) A triangular signal (wi/j = 3 data points) and its FT. (b) A triangular slit function h{t)
(w./, = 5 data points) and its FT. (c) Multiplication of the FT of (a) with that of (b). (d) The inverse FT of
(c).

Re(G(v)) = Re(F(v))Re(//(v)) - Im(F(v))Im(//(v))


Im(G(v)) = -(Re(F(v))Im(//(v)) + Im(F(v))Re(//(v)))

(4) Back-transform G(v) to g(t)


These four steps are illustrated in Fig. 40.17 where two triangles (array of 32 data
points) are convoluted via the Fourier domain. Because one should multiply
Fourier coefficients at corresponding frequencies, the signal and the point-spread
function should be digitized with the same time interval. Special precautions are
needed to avoid numerical errors, of which the discussion is beyond the scope of
this text. However, one should know that when J[t) and h(t) are digitized into
sampled arrays of the size A and B respectively, both J{t) and h(t) should be
extended with zeros to a size of at least A + J5. If (A + B) is not a power of two, more
zeros should be appended in order to use the fast Fourier transform.
535

40.5 Signal processing

40.5.1 Characterization of noise

As said before, there are two main applications of Fourier transforms: the
enhancement of signals and the restoration of the deterministic part of a signal.
Signal enhancement is an operation for the reduction of the noise leading to an
improved signal-to-noise ratio. By signal restoration deformations of the signal
introduced by imperfections in the measurement device are corrected. These two
operations can be executed in both domains, the time and frequency domain.
Ideally, any procedure for signal enhancement should be preceded by a charac-
terization of the noise and the deterministic part of the signal. Spectrum (a) in Fig.
40.18 is the power spectrum of "white noise" which contains all frequencies with
approximately the same power. Examples of white noise are shot noise in photo-
multiplier tubes and thermal noise occurring in resistors. In spectrum (b), the
power (and thus the magnitude of the Fourier coefficients) is inversely prop-
ortional to the frequency (amplitude -- 1/v). This type of noise is often called 1//

IF(/'

L^'VAMMM/
iFd'ii (b)

IFI/11

Fig. 40.18. Noise characterisation in the frequency domain. The power spectrum IF(v)l of three types
of noise, (a) White noise, (b) Flicker or 1//noise, (c) Interference noise.
536

o
u

0.00 8.00 16.00 24.00 32.00 4 0.00 48.00


Time(sec)(» io 1)

Fig. 40.19. Baseline noise of a UV-Vis detector in HPLC.

noise (/is the frequency) and is caused by slow fluctuations of ambient conditions
(temperature and humidity), power supply, vibrations, quality of chemicals etc.
This type of noise is very common in analytical equipment. An example is the
power spectrum (Fig. 40.20) of the noise of a UV-Vis detector in HPLC (Fig.
40.19). In some cases the power spectrum may have peaks at some specific
frequencies (see Fig. 40.18c). A very common source of this type of noise is the 50
Hz periodic interference of the power line. In reality most noise is a combination of
the noise types described. Spectral analysis is a useful tool in assessing the
frequency characteristics of these types of signal.
We first discuss signal enhancement in the time domain, which does not require
a transform to the frequency domain. It is noted that all discrete signals should be
sampled at uniform intervals.

40.5.2 Signal enhancement in the time domain

In many instances the quality of the signal has to be improved before the
chemical information can be derived from it. One of the possible improvements is
the reduction of the noise. In principle there are two options, the enhancement of
the analog signal by electronic devices (hardware), e.g. an electronic filter, and the
537

(b)

10 4 3 5 8 10-3 3 5 8 10-2 3 5 8 10-1 3 5 8 io 0


F r e q u e n c y (CPS)

Fig. 40.20. The power spectrum of the baseline noise given in Fig. 40.19.

manipulation of the signal after digitization by computer, a so called digital filter.


Analytical equipment usually contains hardware to obtain a satisfactory signal-
to-noise ratio. For example, in AAS the radiation of the light source is modulated
by a light chopper to remove the noise introduced by the flame. The frequency of
this chopper is locked into a lock-in amplifier, which passes only signals with a
frequency equal to the frequency of the chopper. As a result noise with other
frequencies than the chopper frequency is eliminated, including the 1//noise. An
apparent advantage of digital filters over the analog filters is their greater flexibility.
When the original data points are stored in computer memory, the enhancement
operation can be repeated under different conditions without the need to remeasure
the signal. Finally, for the processing of a given data point also data points
measured later are available, which is intrinsically impossible when processing an
analog signal. This advantage becomes clear in Section 40.5.2.3. Many instrument
manufacturers apply digital devices in their equipment, or supply software to be
operated from the PC for post-run data processing of the signal by the user, e.g. a
digital smoothing step. To ensure a correct appHcation, the principles and limitations
of smoothing and filtering techniques are explained in the following sections.
538

5 6 6 6
time (s) time (s)

Fig. 40.21. Averaging of 100 scans of a Gaussian peak.

40.5.2.1 Time averaging


Some instruments scan the measurement range very rapidly and with a great
stability. NMR is such a technique. In this case the signal-to-noise ratio can be
improved by repeating the scans (e.g. A^ times) and adding the corresponding data
points. As a result the magnitude of the deterministic part of the signal is multiplied
by A^, and the standard deviation of the noise by a factor of ^|N (set Chapter 3).
Consequently, the signal-to-noise ratio improves by a factor vA^.
There are clear limitations to this technique. First of all, the repeated scans must
be sampled at exactly the same time values. Furthermore, one should be aware that
in order to obtain an improvement of the signal-to-noise ratio by a factor
doubling of the measurement time is required. The limiting factor then becomes
the stability of the signal. The effect of a 100 times averaging of a Gaussian peak
(standard deviation of the noise is 10% of the peak maximum) is demonstrated in
Fig. 40.21. The signal-to-noise ratio is improved without deformation of the
signal.

40.5.2.2 Smoothing by moving average


In the accumulation process explained in the previous section, data points
collected during several scans and measured at corresponding time values, are
added. One could also consider to accumulate the values of a number of data points
in a small segment or window in the same scan. This is the principle of smoothing,
which is explained in more detail below.
539

The simplest form of smoothing consists of using a moving average, which has
been introduced in Chapter 7 for quahty control. A window with an odd number of
data points is defined. The values in that window are averaged, and the central
point is replaced by that value. Thereafter, the window is shifted one data point by
dropping the last one in the window and including the next one (un-smoothed
measurement). The averaging process is repeated until all data points have been
averaged. For the smoothing of a given data point, data points are used which were
measured after this data point, which is not possible for analog filters. The
expression for the moving average of point / is:

j=m

j=-m

where g^ is the smoothed data point i^f^^j is the original data point /+/, (2m + 1) is
the size of the smoothing window.
The process of moving averaging — and we will see later, the process of
smoothing in general — can be represented as a convolution. Consider, therefore,
the data points/(I),/(2), .... , fin). The moving average of point/(5) using a
smoothing window of 5 data points is calculated as follows:

Ai) ^3) y(4) AS) ./(6) .K^)


m 1 1
.m
0...
-
multiply by 0 0 1 1 1
add and divide by 5 [0 + 0+ ./(3) + m+ ./(5) + ./(6) + ./(7) + 0 + ..]/5

If we consider the zeros and ones to form a function h(t) and if we position the
origin of that function /i(0) in the centre of the defined window, then the above
process becomes:

./(2) ./(4) ./(5) ./(7)


m- ./(I) m .m ^(+2)
m..
Kt): K-A) ^(-3) h(-2) /i(-l) h{0) K+l) /i(+3)..

0 0 1 1 1 1 1 0...
m= -4 -3 -2 -1 0 1 2 3...

givmg
^(5) = 1/5 JJ{m)h{m - 5)
or in general
g{t) = Yj'{m)h{m - 0/NORM (40.11)
for all m for which f(m) and h(m-t) are defined. NORM is a factor to keep the
integral of the signal constant.
540

Equation (40.11) corresponds exactly to the expression of a convolution (eq.


(40.9)) introduced earlier, demonstrating that the mechanism of smoothing indeed
is equivalent to that of convolution. Thus eq. (40.11) can be rewritten as:
g(t) =m * h(t) (40.12)
As a consequence, a moving average in the time domain is a multiplication in
the Fourier domain, namely:
G(v) = F(v) H(v) (40.13)
where H(v) is the Fourier transform of the smoothing function. This operation is
called filtering and H(v) is the filter function. Filtering is further discussed in
Section 40.5.3. For the moment it is important to realize that smoothing in the time
domain has its complementary operation in the frequency domain and vice versa.
Besides the improvement of the signal-to-noise ratio, smoothing also introduces
two less desired effects, which are illustrated on a Gaussian peak. In Fig. 40.22a we
show the effect of the smoothing of this peak with increasingly larger smoothing

1.2 (0) (5) (9) (17) (25) '''^

0 6

Q.5

0.4

(window)/\A^^2
1.2
(0) (5) (9) (17) (25) h/hp
(b)
1 1 f • •-

0.8

0.6

0.4

0,2

f¥^ -V4^ 4h—\f^4fn^,M,J v-y\r^-^—^


-U-
(window)/w
1/2

Fig. 40.22. Distortion (h/ho) of a Gaussian peak for various window sizes (indicated within parenthe-
ses), (a) Moving average, (b) Polynomial smoothing.
541

windows. As one can see, the peak becomes lower and broader, but remains
symmetrical. The peak remains symmetric because of the symmetry of the filter
about its central point. Analog filters are by definition asymmetric because they
can only include data points at the left side of the data point being smoothed (see
also Section 40.5.2.4 on exponential smoothing). The applicability of moving
averaging and smoothing in general depends on the degree of deformation assoc-
iated with a certain improvement of the signal-to-noise ratio. Figure 40.22a shows
the effect of a moving averaging applied on a Gaussian peak to which white
random noise is added, as a function of the window size. Clearly, for window sizes
larger than half the half-height peak width, the peak is broadened and the intensity
drops. As a result adjacent peaks may be less resolved. Peak areas, however,
remain unaffected. From Fig. 40.22a one can derive the maximally applicable
window size (number of data points) to avoid peak deformation, given the scan
speed and the digitisation rate. Suppose, for instance, that the half-height width of
the narrowest peak in an IR spectrum is about 10 cm~\ the digitization rate is 10 Hz
and the scan rate is 2 cm"^ per sec. Hence, the half-height width is digitized in 50
data points. The largest smoothing window, therefore, introducing minor disturb-
ances is 25 data points. If the noise is white the signal-to-noise ratio is improved by
a factor 5.
Another unwanted effect of smoothing is the alteration of the frequency charac-
teristics of the noise. This calls for caution. Because low frequencies present in the
noise are not removed, the improvement of the signal-to-noise ratio may be
limited. This is illustrated in Fig. 40.23 where one can see that after smoothing
with a 25 point window low-frequency noise is left. In Section 40.5.3 filtering

v ^

Fig. 40.23. Polynomial smoothing (noise = N(0,3%)): 5-point; 17-point; 25-point smoothing window
and the noise left after smoothing.
542

methods in the frequency domain are discussed, which are capable of removing
specific frequencies from the noise.

40.5,2.3 Polynomial smoothing


The convolution or smoothing function, h{t), used in moving averaging is a
simple block function. However, one could try and derive somewhat more complex
convolution functions giving a better signal-to-noise ratio with less deformation of
the underlying deterministic signal.
Let us consider a smoothing window with 5 data points. Polynomial smoothing
would then consist of fitting a polynomial model through the 5 data points by linear
regression. In this case a polynomial with a degree equal to zero (horizontal line)
up to 4 (no degrees of freedom left) can be chosen. In the latter case the model
exactly fits the data points (no residuals) and has no effect on the noise and signal.
The fit of a model with a degree equal to zero is equivalent to the moving average.
Clearly, one should try and find an optimal value for the degree of the polynomial,
given the size of the window, the shape of the deterministic signal and the
characteristics of the noise. Unfortunately no hard rules are available for that
purpose. Therefore, several polynomial models and window sizes should be tried
out.
The smoothing procedure consists of replacing the central data point in the
window by the value obtained from the model, and repeating the fit procedure by
shifting the window one data point until the whole signal has been scanned. For a
signal digitized over 1000 data points, 996 regressions over 5 points have to be
calculated. This would be very impractical and computing intensive. Savitzky and
Golay [8] derived convolutes, h{t) for each combination of degree of the poly-
nomial and size of the window. The effect of a convolution of a signal with these
convolutes is the same as fitting the signal with the corresponding polynomial in a
moving window. For instance for a quadratic model and a 5-point window the
convolutes are (see Table 40.2):

h{-2) = - 3 ; h{-\) = 12; /i(0) = 17; /z(+l) = 12; /i(+2) = - 3 ; else h{t) = 0

In order to keep the average signal amplitude unaffected a scaling factor NORM
is introduced, which is the sum off all convolutes, here 35. The smoothing
procedure is now

g(t) = If(m)h(m-t)/lh{m)

for all m for which/(m) is defined and h(m -t)^0.


The effect of a 5-point, 17-point and 25-point quadratic smoothing of a
Gaussian peak with 0.3% noise is shown in Fig. 40.22b. Peaks are distorted as
543

TABLE 40.2
Convolutes for quadratic and cubic smoothing (adapted from Refs. [8,10])

Points 25 23 21 19 17 15 13 11 9 7 5

-12 -253
-11 -138 -42
-10 -33 -21 -171
-09 62 -2 -76 -136
-08 147 15 9 -51 -21
-07 222 30 84 24 -6 -78
-06 287 43 149 89 7 -13 -11
-05 343 54 204 144 18 42 0 -36
-04 387 63 249 189 27 87 9 9 -21
-03 422 70 284 224 34 122 16 44 14 -2
-02 447 75 309 249 39 147 21 69 39 3 -3
-01 462 78 324 264 42 162 24 84 54 6 12
00 467 79 329 269 43 167 25 89 59 7 17
01 462 78 324 244 42 162 24 84 54 6 12
02 447 75 309 249 39 147 21 69 39 3 -3
03 422 70 284 224 34 122 16 44 14 -2
04 387 63 249 189 27 87 9 9 -21
05 343 54 204 144 18 42 0 -36
06 287 43 149 89 7 -13 -11
07 222 30 84 24 -6 -78
08 147 15 9 -51 -21
09 62 -2 -76 -136
10 -33 -21 -171
11 -138 -42
12 -253
NORM 5175 805 3059 2261 323 1105 143 429 231 21 35

well, but to a less extent than for moving averaging. For a Gaussian peak the
window size of a quadratic polynomial smoothing should now be less than 1.5
times the half-height width compared to the value of 0.5 found for moving
averaging. The effect on the noise reduction and frequencies left in the noise is
comparable to the moving average filter. We refer to the work of Enke [9] for a
detailed discussion of peak deformation versus signal-to-noise improvement under
544

- 3 - 2 - 1 0 1 2 3

Fig. 40.24. Polynomial smoothing: window of 7 data points fitted with polynomials of degrees 0,1,2,
3 and 4.

different circumstances. Generally, polynomial smoothing is preferred over


moving averaging, because larger windows are allowed before the signal is
deformed. The convolutes h{t), adapted according to Steinier [10], are tabulated in
Table 40.2 for several degrees of the smoothing polynomial and window sizes.
Polynomial models of an even 2n and odd 2n-\-\ degree have the same value for
the central point (Fig. 40.24). Therefore, the same convolutes and the same
smoothing results are found for a quadratic and cubic polynomial fit. In the same
way as for moving averaging the polynomial smoothing can be represented in the
frequency domain as a multiplication (eq. (40.13)). This aspect is further discussed
in Section 40.5.4.

40.5.2.4 Exponential smoothing


The principle of exponential averaging has been introduced in Chapters 7 and
20, and is given by the following equation:

jc,. =(l-X)Xi + ^/-i 0<X<1 (40.14)

which states that a data point / is smoothed by taking a weighted average of the
measurement at time / and the smoothed previous data point at time / - 1. By tuning
the weight X, the smoothing can be varied from no smoothing at all (for A. = 0) to
obtaining a constant value jCj (for A = 1), which remains unchanged regardless of
the value of new incoming measurements. The effect of the exponential smoothing
process is thus defined by a single parameter. Moreover, the smoothed data points
can easily be calculated by hand. This simplicity and versatility are the reasons for
the popularity of the filter. Exponential averaging is mainly applied to smooth time
series or stochastic processes in which one wants to detect a drift or a deviation
from the stationary state. The filter is less appropriate to smooth analytical signals
545

TABLE 40.3
Example of exponential smoothing according to Jcj = XJCJ_I + (1 - X)xi (k = 0.2)

(1 - X)xi + Xxi.

1 3 0.8(3) + 0.2(0) = 2.4


2 7 0.8(7) + 0.2(2.4) = 6.08
3 4 0.8(4)+ 0.2(6.08) =4.41
4 6 0.8(6)+ 0.2(4.41) = 5.68
5 1 0.8(1)+ 0.2(5.68) = 1.94
6 8 0.8(8)+ 0.2(1.94) = 6.79
7 5 0.8(5) + 0.2(6.79) = 5.36
8 2 0.8(2) + 0.2(5.36) = 2.67
9 9 0.8(9) + 0.2(2.67) = 7.73
10 5 0.8(5) + 0.2(7.73) = 5.55

(e.g. spectra or chromatograms), as it introduces asymmetric deformations of the


signal. The algorithm and its simplicity are demonstrated in Table 40.3. The
measurements are given in the first column. In the second column the smoothed
data points are given for X = 0.2.
The exponential shape of the filter follows directly by elaborating eq. (40.14)
for a few consecutive data points (see Table 40.4). From this table we can see that a
smoothed data point at time / is the average of all data points measured before,
weighted with an exponentially decaying weight X^ (k< 1) with d the distance of
that data point from the measurement to be smoothed. Such shapes are also found
for electronic filters with a given time constant. The effect of exponential
smoothing is visualized in the plot of the x^ and x^ values (Fig. 40.25) listed in

TABLE 40.4
The exponential shape of the moving average filter (xi = (1 - 'k)xi + Xxj_i)

I Measurement Smoothed value

1 xi ^1 = (l->-)xi
2 X2 X2 = (l-X)X2+'kxi = (1-X)x2 + X(l-X)xi
3 X3 ^3 = (1-A-)X3+XJC2 = (1-A-)X3 + M1->.)X2 + X\l-X)Xy
4 X4 ^4 =(1->.)X4+AJC3 = (l->-)X4 + >-(l-A,)X3 + A-^(1->.)X2 + X\l-k)Xi
5 X5 X5 = (1-X)X5+X^4 = (1-X)X5 + X,(1->.)X4 + A,2(1-X)X3 + P(1->.)X2 + X\l-X)X^
546

x{t)

10

J L_

5 6 10
t

Fig. 40.25. Effect of exponential smoothing on the data points listed in Table 40.3 (solid line: original
data; dotted line: smoothed data).

Signal
1

0.8

0.6

0.4

0.2

0 5 10 15 20 26 30 35 40 45 50
Time

Fig. 40.26. Effect of exponential smoothing {X = 0.6) on a Gaussian peak (wy^ = 6 data points) (solid
line: original data; bold line: smoothed data).

Table 40.3. As one can see, the filter introduces a slower response to stepwise
changes of the signal, as if it were measured with an instrument with a large
response time. Because fluctuations are smoothed, the standard deviation of the
signal is decreased, in this example from 2.58 to 1.95. A Gaussian peak is broadened
and becomes asymmetric by exponential smoothing (Fig. 40.26).
547

40.5.3 Signal enhancement in the frequency domain

Instead of smoothing the data directly in the domain in which they were
acquired, the signal-to-noise ratio can also be improved by transforming the signal
to the frequency domain and eliminating noise frequencies present in the measure-
ments, after which one returns to the original domain. For instance, the power
spectrum of the noise of a flame ionization detector (Fig. 40.27) reveals the
presence of two dominant frequencies, namely at 2 and at 10 Hz. By substituting
all Fourier coefficients at frequencies higher than 5 Hz by zero, all high frequency
noise is eliminated after back transforming to the time domain. This operation is
called filtering, and because in this particular case low frequencies are retained,
this filter is called a low-pass filter. Equally, one could define a high-pass filter hy
setting all low frequency values equal to zero.
Mathematically this operation can be described by the same equation (eq.
(40.13)) as derived for polynomial smoothing, namely:
G(v) = F{v)H(v)
where H(v) is the filter function, which is now defined in the frequency domain.
Often used filter functions are:
Low-pass filter: H(v) = 1 for all v < VQ else H(v) = 0
High-pass filter: H(v) = 1 for all v > VQ else H(v) = 0

o
X

0.64

a
u
0.3 2

16 32 48
F r e q . in CPS

Fig. 40.27. Power spectrum of the noise of a flame ionization detector.


548

VQ is called the cut-off frequency. H(v) is referred in this context as di filter transfer
function.
Many other filter functions can be designed, e.g. an exponential or a trapezoidal
function, or a band pass filter. As a rule exponential and trapezoidal filters perform
better than cut-off filters, because an abrupt truncation of the Fourier coefficients
may introduce artifacts, such as the annoying appearance of periodicities on the
signal. The problem of choosing filter shapes is discussed in more detail by Lam
and Isenhour [11] with references to a more thorough mathematical treatment of
the subject. The expression for a band-pass filter is: H(v) = 1 for v^i„ < v < v^^^^ else
//(v) = 0. This filter is particularly useful for removing periodic disturbances of the
signal.
The effect of a low-pass filter applied on a Gaussian peak is shown in Fig. 40.28
for two cut-off frequencies. The lower the cut-off frequency of the filter, the more
noise is removed. However, this increasingly effects the high frequencies present
in the signal itself, causing a deformation. On the other hand, the higher the cut-off
frequency the more high-frequency noise is left in the signal. Thus the choice of
the cut-off frequency is often a compromise between the noise one wants to
eliminate and the deformation of the signal one can accept. The more the fre-
quencies of noise and signal are similar, the more difficult it becomes to improve

Fig. 40.28. Effect of alow-pass filter, (a) original Gaussian signal, (b) FT of (a), (c) Signal (a) filtered
with V() = 10. (d) Signal (a) filtered with Vo = 20.
549

the signal-to-noise ratio. For this reason it is very difficult to eliminate 1//noise,
because its power increases with lower frequencies.

40.5.4 Smoothing and filtering: a comparison

Filtering and smoothing are related and are in fact complementary. Filtering is
more complicated because it involves a forward and a backward Fourier transform.
However, in the frequency domain the noise and signal frequencies are distin-
guished, allowing the design of a filter that is tailor-made for these frequency
characteristics.
Polynomial smoothing is more or less a trial and error operation. It gives an
improvement of the signal-to-noise ratio but the best smoothing function has to be
empirically found and there are no hard rules to do so. However, because of its
computational simplicity, polynomial smoothing is the preferred method of many
instrument manufacturers. By calculating the Fourier transform of the smoothing
convolutes derived by Savitzky and Golay one can see that polynomial smoothing
is equivalent to low-pass filtering. Figure 40.29 shows the Fourier transforms of
the 5-point, 9-point, 17-point and 25-point second-order convolutes given in Table
40.2 (the frequency scale is arbitrarily based on a 1024 data points sampled with

Fig. 40.29. Fourier spectrum of second-order Savitzky-Golay convolutes. (a) 5-point. (b) 9-point. (c)
17-point. (d) 25-point (arrows indicate cut-off frequencies).
550

1 Hz). Another feature of polynomial smoothing is that smoothing and differ-


entiation (first and second derivative) can be combined in single step, which is
explained in Section 40.5.5.

40.5.5 The derivative of a signal

Signals are differentiated for several purposes. Many software packages for
chromatography and spectrometry offer routines for determining the peak position
and for finding the up-slope and down-slope integration limits of a peak. These
algorithms are based on the calculation of the first- or second-derivative. In NIRA
small differences between spectra are magnified by taking the first or second
derivative of the spectra. Baseline drifts are eliminated as well. The simplest
procedure to calculate a derivative is by taking the difference between two
successive data points. However, by this procedure the noise is magnified by
several orders of magnitude leading to unacceptable results. Therefore, the
calculation of a derivative is usually linked to a smoothing procedure. In principle
one could smooth the data first. This requires a double sweep through the data, the
first one to smooth the data and the second one to calculate the derivative.
However, the smoothing and differentiation can be combined into a single step. To
explain this, we recall the way Savitzky and Golay derived the smoothing
convolutes by moving a window over the data and fitting a polynomial through the
data in the window. The central point in the window is replaced by the value of the
polynomial. Instead, one may replace it by the value of the first or second
derivative of that polynomial in that point. Savitzky and Golay [8] published
convolutes (corrected later on by Steinier [10]) for that operation (see Table 40.5).
This procedure is the recommended method for the calculation of derivatives. Fig.
40.30 gives the second-derivative of two noisy overlapped Gaussian peaks,
obtained with a quadratic 7-points smoothed derivative. The two negative regions
(shaded areas in the figure) reveal the presence of two peaks.

40.5.6 Data compression by a Fourier transform

Sets of spectroscopic data (IR, MS, NMR, UV-Vis) or other data are often
subjected to one of the multivariate methods discussed in this book. One of the
issues in this type of calculations is the reduction of the number variables by
selecting a set of variables to be included in the data analysis. The opinion is
gaining support that a selection of variables prior to the data analysis improves the
results. For instance, variables which are little or not correlated to the property to
be modeled are disregarded. Another approach is to compress all variables in a few
features, e.g. by a principal components analysis (see Section 31.1). This is called
551

TABLE 40.5
Convolutes for the calculation of the smoothed second derivative (adapted from Ref. [8])

Points 25 23 21 19 17 15 13 11 9 7 5

-12 92
-11 69 77
-10 48 56 190
-09 29 37 133 51
-08 12 20 82 34 40
-07 -3 5 37 19 25 91
-06 -16 -8 -2 6 12 52 22
-05 -27 -19 -35 -5 1 19 11 15
-04 -36 -28 -62 -14 -8 -8 2 6 28
-03 -43 -35 -83 -21 -15 -29 -5 -1 7 5
-02 -48 -40 -98 -26 -20 -48 -10 -6 -8 0 2
-01 -51 -43 -107 -29 -23 -53 -13 -9 -17 -3 -1
00 -52 -44 -110 -30 -24 -56 -14 -10 -20 -4 -2
01 -51 -43 -107 -29 -23 -53 -13 -9 -17 -3 -1
02 -48 -40 -98 -26 -20 ^8 -10 -6 8 0 2
03 -43 -35 -83 -21 -15 -29 -5 -1 7 5
04 -36 -28 -62 -14 -8 -8 2 6 28
05 -27 -19 -35 -5 1 19 11 15
06 -16 -8 -2 6 12 52 22
07 -3 5 37 19 25 91
08 12 20 82 34 40
09 29 37 133 51
10 48 56 190
11 69 77
12 92

NORM 26910 17710 33649 6783 3876 6188 1001 429 462 42

feature reduction. Data may also be compressed by a Fourier transform or by one


of the transforms discussed later in this chapter. This compression consists of
taking the FT of the data and retaining the first n relevant Fourier coefficients. If
the data are symmetrically mirrored about the first data point, the FT only consists
552

5 40 45 50

0.05 h

•0.D5

-0.1

-0.15

Fig. 40.30. Smoothed second-derivative (window: 7 data points, second-order) according to


Savitzky-Golay.

of real coefficients which facilitates the calculations (see Section 40.3.8). Figure
40.31 shows a spectrum of 512 data points, which is reconstructed from resp-
ectively the first 2, 4, 8...,256 Fourier coefficients. The effect is more or less
comparable to wavelength selection by deleting data points at regular intervals. As
a consequence one looses high frequency information. When the rows of a data
table are replaced by the first n relevant Fourier coefficients, the properties of a
data table are retained. For instance, the Fourier coefficients of the rows of a
two-way data table of mixture spectra remain additive (distributivity property).
Similarities and dissimilarities between rows (the objects) are retained as well,
allowing the application of pattern recognition [12] and other multivariate
operations [13,14].
553

FFT
(b)

16

32

64

128

256

'pi 512 (a)

Fig. 40.31. Data compression by a Fourier transform, (a) A spectrum measured at 512 wavelengths;
(b) spectrum after reconstruction with 2, 4,..., 256 Fourier coefficients.

40.6 Deconvolution by Fourier transform

In Section 40.4 we mentioned that the distortion introduced by instruments can


be modeled by a convolution. Moreover, we demonstrated that noise filtering by
either an analog or digital filter is a convolution process. In some cases the
distortion introduced by the measuring device may damage the signal so much that
the analytical information wanted cannot be derived from the signal. For instance
in chromatography the peak broadening introduced during the elution process may
cause peak overlap and hamper an accurate determination of the peak area.
Therefore, one may want to mathematically remove the damage. This process of
signal restoration is known as deconvolution or inverse filtering. Deconvolution is
the inverse operation of convolution. While convolution is mathematically straight-
forward, deconvolution is more complicated. It requires an operation in the Fourier
554

domain and a careful design of the inverse filter. The basic deconvolution algorithm
follows directly from eq. (40.13)
G(v)
F(v) =
H(v)
where G(v) is the Fourier transform of the damaged signal, F(v) is the FT of the
recovered signal and H(v) is the FT of the point-spread function. The back transform
of F(v) gives/(A:). Thus a deconvolution requires the following three steps:
(1) Calculate the FT of the measured signal and of the point-spread function to
obtain respectively G(v) and //(v).
(2) Divide G(v) by ^(v) at corresponding frequency values (according to the
rules for the division of two complex numbers), which gives F(v)
(3) Back-transform F(v) by which the undamaged signaly(jc) is estimated.
The effect of deconvolution applied on a noise-free Gaussian peak is shown in
Fig. 40.32a. Unfortunately as can be seen in Fig. 40.32c a deconvolution carried
out in the presence of noise (s=l%of the signal maximum) leads to no results at
all. This is caused by the fact that two different kinds of damage are present

sjgrlal 1.2
1
(a)
1
0.8 *
tl 0.8
0.6 ^
11 0.6
0.4
1 1 0.4

0.2

u
0 16 32
/l
48 64 80 96 112 128
0.2

time

Fig. 40.32. Deconvolution (result in solid line) of a Gaussian peak (dashed line) for peak broadening
((>v./Jpsf/(H'i/jG = 1). (a) Without noise, (b) With coloured noise (A^(0,1 %), 7JC = 1.5): inverse filter in
combination with a low-pass filter, (c) With coloured noise (M0,1 %), Tx=\.5): inverse filter without
low-pass filter.
555

simultaneously, namely signal broadening and noise. The model for the damaged
signal, therefore, needs to be expanded to the following expression:

g(x)=f(x)^h{x) + n{x)

The Fourier transform G(v) of g(x) is given by:

G(v) = F(v) H(v) + A^(v).

Thus the result of applying the inverse filter is:

H(v) H(v)
The unacceptable results of Fig. 40.32c are caused by the term N(v)/H{v) which is
large for high v values (= high frequency). Indeed H(v) approaches zero for high
frequencies whereas the value of A^(v) does not. The influence of the noise can be
limited by combining the inverse filter with a low-pass noise filter, which removes
all frequencies larger than a threshold value VQ. In this way one can avoid that the
term N(v)/H{v) inflates to large values (Fig. 40.32b). We observe that the overall
procedure consists of two contradictory operations: one which sharpens the signal
by removing the broadening effect of the measuring device, and one which increases
broadening because noise has to be removed. Consequently, the broadening effect
of the measuring device can only be partially removed. In Section 40.7 we discuss
other approaches such as the Maximum Entropy method and Maximum Likeli-
hood method, which are less sensitive to noise.
An essential condition for performing signal restoration by deconvolution is the
knowledge of the point-spread function (psf) h(x). In some instances h{x) can be
postulated or be determined experimentally by measuring a narrow signal having a
bandwidth which is at least 10 times narrower than the width of the point-spread
function. The effect of deconvolution is very well demonstrated by the recovery of
two overlapping peaks from a composite profile (see Fig. 40.33). The half-height
width of the psf was 1.25 times the peak width for the peak systems (a) and (b). For
the peak system (c) the half-height width of the psf and signal were equal.
It is still possible to enhance the resolution also when the point-spread function
is unknown. For instance, the resolution is improved by subtracting the second-
derivative g'\x) from the measured signal g{x). Thus the signal is restored by ag(x)
- (1 - a)g'Xx) with 0 < a < 1. This algorithm is called pseudo-deconvolution.
Because the second-derivative of any bell-shaped peak is negative between the
two inflection points (second-derivative is zero) and positive elsewhere, the
subtraction makes the top higher and narrows the wings, which results in a better
resolution (see Fig. 40.30). Pseudo-deconvolution methods can correct for sym-
556

1 0

U 9

O 6

0 7

O 6
0 !>

0 4

O J

0 2

O 1

0 0
O 20 40 6 0 60 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

11 (c)
1 U

0 9

o tt "V -r-v 1
o / // \ / \1
V M
c o
i
O t

J 4
tl
G J ij M

0 J 1 I

O J
ij L V

00 _-.yi/. v:-.^
20 40 60 8 0 100 120 140 160 1 8 0 2 0 0

Fig. 40.33. Restoration of two overlapping peaks by deconvolution. Dashed line: measured data. Solid
line: after restoration. Dotted line: difference between true and restored signals.

metric point-spread functions. However, for asymmetric point-spread functions,


e.g. the broadening introduced by a slow detector response, pseudo-deconvolution
is not applicable.

40.7 Other deconvolution methods

In previous sections a signal was enhanced or restored by applying a processing


technique to the signal in order to remove damage from the data. Examples of
damage are noise and peak broadening. We have also seen that removing noise and
restoring peak broadening are two opposite operations. Filtering introduces broad-
ening whereas peak sharpening introduces noise. In Section 40.6 we mentioned
557

that the calculation of fix) from the measured spectrum g{x) by solving g(x) =
y(x)*/z(x) by deconvolution is a compromise between restoration and filtering.
Because of a lack of hard rules the selection of the noise filter introduces some
arbitrariness in the procedure.
Another class of methods such as Maximum Entropy, Maximum Likelihood
and Least Squares Estimation, do not attempt to undo damage which is already in
the data. The data themselves remain untouched. Instead, information in the data is
reconstructed by repeatedly taking revised trial data f(x) (e.g. a spectrum or
chromatogram), which are damaged as they would have been measured by the
original instrument. This requires that the damaging process which causes the
broadening of the measured peaks is known. Thus an estimate g(x) is calculated
from a trial spectrum J{x) which is convoluted with a supposedly known point-
spread function h{x). The residuals e{x) = g{x) - g(x) are inspected and compared
with the noise n{x). Criteria to evaluate these residuals are Maximum Entropy (see
Section 40.7.2) and Maximum Likelihood (Section 40.7.1).

40.7.1 Maximum Likelihood

The principle of Maximum Likelihood is that the spectrum,/(x), is calculated


with the highest probability to yield the observed spectrum g{x) after convolution
with h{x). Therefore, assumptions about the noise n{x) are made. For instance, the
noise n^ in each data point / is random and additive with a normal or any other
distribution (e.g. Poisson, skewed, exponential,...) and a standard deviation s^. In
case of a normal distribution the residual e^ = gi-gi = gt - ifh)^ in each data point
should be normally distributed with a standard deviation s^. The probability that
ifh)- represents the measurement g- is then given by the conditional probability
density function Pig^if):

P{gi\f)^—^ exp
2s}

Under the assumption that the noise in point / is uncorrelated with the noise in
pointy, the likelihood that (/*/i). for all measurements, /, represents the measured
set gj, g2, ..., g^ is the product of all probabilities:

(g,-(/*/^),)'
(40.15)
1=1 i=i 5,V27i 2s}

This likelihood function has to be maximized for the parameters in f. The maxim-
ization is to be done under a set of constraints. An important constraint is the
knowledge of the peak-shapes. We assume that f is composed of many individual
558

Fig. 40.34. Signal restoration by a Maximum Likelihood approach.

peaks of known shape. However, we make no assumption about the number and
position of the peaks. Because f is non-linear and contains many parameters to be
estimated, the solution of eq. (40.15) is not straightforward and should be calcul-
ated in an iterative way by a sequential optimisation strategy. Figure 40.34 shows
the kind of resolution improvement one obtains. Under the normality assumption
the Maximum Likelihood and the least squares criteria are equivalent. Thus one
can also minimize Zef by a sequential optimization strategy [15].

40.7.2 Maximum Entropy

Before going into detail in the meaning of entropy and maximum entropy, the
effect of applying this principle for signal enhancement is shown in Fig. 40.35. As
can be seen, the effect is a drastic improvement of the signal-to-noise ratio and an
enhancement of the resolution. This effect is thus comparable to what is achieved
by the Maximum Likelihood procedure and by inverse filtering. However, the
Maximum Entropy technique apparently does improve the resolution of the signal
without increasing noise.
In physical chemistry, entropy has been introduced as a measure of disorder or
lack of structure. For instance the entropy of a solid is lower than for a fluid,
because the molecules are more ordered in a solid than in a fluid. In terms of
probability it means also that in solids the probability distribution of finding a
molecule at a given position is narrower than for fluids. This illustrates that
entropy has to do with probability distributions and thus with uncertainty. One of
the earliest definitions of entropy is the Shannon entropy which is equivalent to the
definition of Shannon's uncertainty (see Chapter 18). By way of illustration we
559

(a)

Ctfcefnlwl Shift/ppm

(b)

t« 140
ChgrnTcalSKVl/ppm

Fig. 40.35. Signal restoration by the Maximum Entropy approach.

consider two histograms of 20 analytical results, one obtained with a precise


method and one obtained with a less precise method (see Table 40.6). On average
the method yields a result equal to 100, in the range of 85 to 115.
According to Shannon, the uncertainty of the two methods can be expressed by
means of:

fi=-^Pi^0S2 Pi
i=\

where p^ is the probability to find a value in class /.


560

TABLE 40.6
Shannon's uncertainty for two probability distributions (a broad and narrow distribution)

Intervals 85 90 90-95 95-100 100-105 105-110 110-115

Distribution 1 2 3 5 5 3 2
Probability 0.10 0.15 0.25 0.25 0.15 0.10
-logaP/ 3.32 2.737 2.00 2.00 2.737 3.32
-Pi^og^Pi 0.332 0.411 0.50 0.50 0.411 0.332 H = 2.486
Distribution 2 0 2 8 8 2 0
Probability 0 0.10 0.40 0.40 0.10 0
-•og2P, 3.323 1.322 1.322 3.323
-Pi^og2Pi 0 0.332 0.529 0.529 0.332 0 H= 1.722

Application of this equation to the probability distributions given in Table 40.6


shows that H for the less precise method is larger than for the more precise method.
Uniform distributions represent the highest form of uncertainty and disorder.
Therefore, they have the largest entropy.
We now apply the same principle to calculate the entropy of a spectrum (or any
other signal). The entropy, 5 of a spectrum given by the vector y is defined as

S = -I.Pi\og^i with Pi = \yi\/Byi\

The entropy of a noise spectrum with equal probability of measuring a certain


amplitude at each wavelength, is maximal. When structure is added to the spectrum
the entropy decreases. Noise is associated with disorder, whereas structure means
more order. In order to get a feeling for the meaning of entropy, we calculated the
entropy of some typical spectra: a noise spectrum, the same noisy spectrum to
which we added a spike, a noise free spectrum with one spike and one with two
spikes (Table 40.7). As one can see, noise has the highest entropy, whereas a single
spike has no entropy at all. It represents the highest degree of order.
As indicated before, the maximum entropy approach does not process the
measurements themselves. Instead, it reconstructs the data by repeatedly taking
revised trial data (e.g. a spectrum or chromatogram), which are artificially corrupted
with measurement noise and blur. This corrupted trial spectrum is thereafter
compared with the measured spectrum by a x^-test. From all accepted spectra the
maximum entropy approach selects that spectrum, f with minimal structure (which
is equivalent to maximum entropy). The maximum entropy approach applied for
noise elimination consists of the following steps:
561

TABLE 40.7

Entropy of noise, noise plus a spike at / = 5, a spike at / = 5 and two spikes at / = 5 and 6

i Noise Noise ••\- spike Single spike Two spikes

y Pi -Pi^^ZiPi y Pi -Pi^og^i y Pi y Pi -Pi^og^i

1 0.43 0.141 0.398 0.43 0.095 0.324 0 0 0 0 0 0


2 -0.41 0.135 0.418 -0.41 0.091 0.317 0 0 0 0 0 0
3 0.34 0.112 0.361 0.34 0.075 0.280 0 0 0 0 0 0
4 0.49 0.161 0.424 0.49 0.108 0.347 0 0 0 0 0 0
5 0.01 0.003 0.025 1.50 0.331 0.531 1.5 1 0 1 0.5 1
6 0.25 0.082 0.296 0.25 0.055 0.230 0 0 0 1 0.5 1
7 -0.42 0.138 0.394 -0.42 0.093 0.320 0 0 0 0 0 0
8 -0.04 0.013 0.081 -0.04 0.009 0.006 0 0 0 0 0 0
9 -0.28 0.092 0.317 -0.28 0.062 0.250 0 0 0 0 0 0
10 -0.37 0.122 0.370 -0.37 0.082 0.297 0 0 0 0 0 0
Entropy//= 3.084 / / = 2.956 H =0 H = 2.0

(1) Start the procedure with a trial spectrum fj.


If no prior knowledge is available on the spectrum one starts the iteration
process with a structure-less noise spectrum. Indeed, in that case there is no
evidence to assume a particular structure beforehand. However, prior knowledge
may justify to introduce some extra structure.
(2) Calculate the variance of the differences d between the measurements and
the trial spectrum f^.
(3) Test whether the variance of these differences (sl)is significantly different
from the variance of the measurement noise (s^) by a x^-test: X^ = (n - l)sj/s^ (n
data points). For large n, Xcrit ~ ^•
(4) If the trial spectrum is significantly different from the measured spectrum,
the trial spectrum is adapted into f2 = fj + Af, whereafter the cycle is repeated from
step (2) (see e.g. [16] for the derivation of A) until the spectrum meets the x^
criterion.
(5) By repeating steps (1) to (4) with several 'noise' spectra, a set of spectra is
obtained which meet the x^ criterion. All these spectra are marked as 'feasible'
spectra.
(6) Finally from the set of 'feasible' spectra the spectrum is selected with the
maximum entropy.
562

The maximum entropy method thus consists of maximizing the entropy under
the y^ constraint. An algorithm to maximize entropy is the so-called Cambridge
algorithm [16].
When the maximum entropy approach is used for signal restoration a step has to
be included between steps (1) and (2) in which the trial spectrum is first convoluted
(see Section 40.4) with the point-spread function before calculating and testing the
differences with the measured spectrum. The entropy of the trial spectrum before
convolution is evaluated as usual.

40.8 Other transforms

40.8,1 The Hadamard transform

In Section 40.3.2 we mentioned that the Fourier coefficients A^ and B^ can be


calculated by fitting eq. (40.1) to the signaly(0 by a least squares regression. This
fit is represented in a matrix notation as given in Fig. 40.36. The vector X
represents the measurements, whereas A and B are vectors with respectively the
real and imaginary Fourier coefficients. The columns of the two matrices are the
sine and cosine functions with increasing frequency. These sines and cosines
constitute a base of orthogonal functions. This representation also shows the
resemblance of a FT with PC A. The measurement vector which initially contains A^
features (e.g. wavelengths) is reduced to a vector with n < N features by a
projection on a smaller orthogonal sub-space defined by the n columns in the
transform matrix. In PCA these n columns are the n principal components and in
FT these columns are sines and cosines. Depending on the properties of these
columns, the scores have a specific meaning, which in the FT are the Fourier
coefficients. In theory, any base of orthogonal functions can be selected to
transform the data. A base which is related to the cosine and sine functions is a
series of orthogonal block signals with increasing frequency (Fig. 40.37). Any
signal can be decomposed in a series of block functions, which is called the

1 2
B,

B„

Nl

Fig. 40.36. Matrix representation of a Fourier transform.


563

6 H

~i I ~T 1 1 r — I —

12 16 20 24

Fig. 40.37. A base of block signals.

FHT

J L

16

32
"^~>^V
64

128
i\

256

Fig. 40.38. Spectrum given in Fig. 40.31 reconstructed with 2, 3,..., 256 Hadamard coefficients.
564

Hadamard transform [17]. For example the IR spectrum (512 data points) shown
in Fig. 40.31a is reconstructed by the first 2, 4, 8, ... 256 Hadamard coefficients
(Fig. 40.38). In analogy to spectrometers which directly measure in the Fourier
domain, there are also spectrometers which directly measure in the Hadamard
domain. Fourier and Hadamard spectrometers are called non-dispersive. The
advantage of these spectrometers is that all radiation reaches the detector whereas
in dispersive instruments (using a monochromator) radiation of a certain wave-
length (and thus with a lower intensity) sequentially reaches the detector.

40.8,2 The time-frequency Fourier transform

A common feature of the Fourier and Hadamard transform is that they describe
an overall property of the signal in the measurement range, ^ = 0 to ^ = 7^.
However, one may be interested in local features of the signal. For instance, it may
well be that at the beginning of the signal the frequencies are much higher than at
the end of the signal as shows Fig. 40.39. This is certainly true when the signal
contains noise and peaks with different peak widths. In Fig. 40.39 there are regions
with a high, low and intermediate frequency. One way to detect these local features
is by calculating the FT in a moving window of size T^, and to observe the

Fig. 40.39. Signal with local frequency features.


565

i+1,w
i+1,w

Fig. 40.40. The moving window FT principle.

evolution of the Fourier coefficients as a function of the position of the moving


window. In this example when the centre of the window coincides with the
position of one of the peaks, the low frequency components are dominant, whereas
in an area of noise the high frequencies become dominant. This means that peaks
are detected by monitoring the Fourier coefficients as a function of the position of
the moving window. The procedure of the moving FT is schematically shown in
Fig. 40.40. At each position / of the window (size = T^) a filter function h(i) is
defined by which the signal/(0 is multiplied before the FT is calculated. In general,
this is expressed as follows:
F{v,a) = Fmh{a)]
where F is the symbol for the FT and a refers to the filter transfer function h(a). For
each a, n Fourier coefficients are obtained, which can be arranged in matrix of A^
and B, coefficients:

^1,0 • • • ^ l , n ^ l , 0 "•^\,n

^ 2 , 0 '"^2,n^2,0 "^2,n

^a,0 •^a,n^a,0 "^a,n

The columns of this matrix contain the time information (amplitudes at a specific
frequency as a function of time) and the rows the frequency information.
566

40.8.3 The wavelet transform

Another transform which provides time and frequency information is the


wavelet transform (WT). By analogy with the Fourier transform, the WT
decomposes a signal into a set of basis functions, called a wavelet basis. In FT the
basis functions are the cosine and sine function. The wavelet basis is also a
function, called the analyzing wavelet. Frequently applied analyzing wavelets are
the Morlet and Daubechies wavelets [18], of which the Haar wavelet is a specific
member (Fig. 40.41). A series of wavelets is generated by stretching and shifting
the wavelet over the data. The shift b is called a translation and the stretching or
widening of the basis wavelet with a factor a is called a dilation. Suppose for
instance that the analyzing wavelet is a function h(t). A series of wavelets h^u(t) is
then generated by introducing a translation b and a dilation a, according to

^Ja K a

A series of Morlet wavelets for various dilation values is shown in Fig. 40.42. In
a similar way as for the FT, where only frequencies are considered which fit an
exact number of times in the measurement time, here, only dilation values are
considered which are stretched by a factor of two. A wavelet transform consists of
fitting the measurements with a basis of wavelets as shown in Fig. 40.42, which are
generated by stretching and shifting a mother wavelet h(t). The narrowest wavelet
(level a^) is shifted in small steps, whereas a broad wavelet is shifted in bigger
steps. The shift b is usually a multiple of the dilation value. The fitting of these
basis of wavelets on the data yields the wavelet transform coefficients. Coefficients
associated with narrow wavelets describe the local features in a signal, whereas the
broad wavelets describe the smooth features in the signal.

a)

c)
A

Fig. 40.41. The Haar wavelet (a) and three Daubechies wavelets (b-d).
567

Fig. 40.42. A family of Morlet wavelets with various dilation values.

To transform measurements available in a discrete form a discrete wavelet


transform (DWT) is applied. Condition is that the number of data is equal to 2". In
the discrete wavelet transform the analyzing wavelet is represented by a number of
coefficients, called wavelet filter coefficients. For instance, the first member
(smallest dilation a and shift b = 0)of the Haar family of wavelets is characterized
by two coefficients Cj = 1 and C2 = 1. The next one with a dilation 2a is
characterized by four coefficients: Cj = 1, C2 = 1, C3 = 1 and C4 = 1. Generally, the
wavelet member n is characterized by l"" coefficients. The widest wavelet
considered is the one for which 2" is equal to N, the number of measurements. The
value of n defines the level of the wavelet. For instance, forn = 2 the level 2
wavelet is obtained. For each level, a transform matrix is defined in which the
wavelet filter coefficients are arranged in a specific way. For a signal containing
eight data points (arranged in a 8x1 column vector) and level 1, the transform
matrix has the following form:

C) C2 0 0 0 0 0 0
0 0 Cl C2 0 0 0 0
0 0 0 0 Cl C2 0 0
0 0 0 0 0 0 c, C-,

Multiplication of this 4x8 transformation matrix with the 8x1 column vector of
the signal results in 4 wavelet transform coefficients or N/2 coefficients for a data
vector of length A^. For Cj = C2 = C3 = C4 = 1, these wavelet transform coefficients are
equivalent to the moving average of the signal over 4 data points. Consequently,
568

these wavelet filter coefficients define a low-pass filter (see Section 40.5.3), and
the resulting wavelet transform coefficients contain the 'smooth' information of
the signal. For this reason this set of wavelet filter coefficients is called the
approximation coefficients and the resulting transform coefficients are the a-compo-
nents. The transform matrix containing the approximation coefficients is denoted
as the G-matrix. In the above example with 8 data points, the highest possible
transform level is level 3 (8 non-zero coefficients). The result of this transform is
the average of the signal. The level zero (1 non-zero coefficient) returns the signal
itself
Besides this first set of coefficients, a second set of filter coefficients is defined
which is the equivalent of a high-pass filter (see Section 40.5.3) and describes the
detail in the signal. The high-pass filter uses the same set of wavelet coefficients,
but with alternating signs and in reversed order. These coefficients are arranged in
the H-matrix. The H-matrix for the level two transform of a signal with length 8 is:

Cl -C\ 0 0 0 0 0 0
0 0 C2 -C\ 0 0 0 0
0 0 0 0 Cl -C| 0 0
0 0 0 0 0 0 C-, -c

The coefficients in the H-matrix are the detail coefficients. The output of the
H-matrix are the ^-components. With Nil detail components and Nil approx-
imation components, we are able to reconstruct a signal of length N.
The discrete wavelet transform can be represented in a vector-matrix notation
as:

a = W'^f (40.16)

where a contains N wavelet transform coefficients, W is an A^xA^ orthogonal


matrix consisting of the approximation and detail coefficients associated to a
particular wavelet and f is a vector with the data. The action of this matrix is to
perform two related convolutions, one with a low-pass filter G and one with a
high-pass filter H. The output of G is referred to as the smooth information and the
output of H may be regarded as the detail information. By way of illustration we
consider a sequence of a discrete sample of 16 points, taken from Walczak [19], F
= [0 0.2079 0.4067 0.5878 0.7431 0.8660 0.9511 0.9945 0.9945 0.9511 0.8660
0.7431 0.5878 0.4067 0.2079 0.0] which is fitted with a Haar wavelet at level a\
We first define the 16x16 matrix W of the wavelet filter coefficients equal to:
569

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 .

Rows 1-8 are the approximation filter coefficients and rows 9-16 represent the
detail filter coefficients. At each next row the two coefficients are moved two
positions (shift b equal to 2). This procedure is schematically shown in Fig. 40.43
for a signal consisting of 8 data points. Once W has been defined, the a^ wavelet
transform coefficients are found by solving eq. (40.16), which gives:
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ro.oooo" " 0.1470"
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0.2079 0.7032
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0.4067 1.1378
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0.5878 1.3757
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0.7431 1.3757
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0.8660 1.1378
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0.9511 0.7032
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0.9945 0.1470
/1/2
1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9945 -0.1470
0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0.9511 -0.1281
0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0.8660 -0.0869
0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0.7431 -0.0307
0 0 0 0 0 0 0 0 1-10 0 0 0 0 0 0.5878 0.0307 '
0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0.4067 0.0869
0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0.2079 0.1281
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 [o.oooo_ . 0.1470 J

The factor •\ll/ 2 is introduced to keep the intensity of the signal unchanged. The 8
first wavelet transform coefficients are the a or smooth components. The last eight
coefficients are the d or detail components. In the next step, the level 2 components
are calculated by applying the transformation matrix, corresponding to the a^ level
on the original signal. This a^ transformation matrix contains 4 wavelet filter
570

original data

approximation detail

rn m

1
Ti m
, [, \, \
J 1
iTTTi rrm 1 iTi m

IJ 11FT]
• iJ
Fig. 40.43. Waveforms for the discrete wavelet transform using the Haar wavelet for an 8-points long
signal with the scheme of Mallat' s pyramid algorithm for calculating the wavelet transform coefficients.

coefficients, which is a doubUng of the width of the wavelet. This wavelet is


shifted four positions instead of two at the previous level. In our example this leads
to a transform matrix with four approximation rows and four detail rows with 16
elements. Multiplication of this matrix with the 16-points data vector results in a
vector with four a and four d components.
The level 2 coefficients a^ are equal to:
ro.oooo"
02079
O4067
05878
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 o] 07431 O6012
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 08660 1.7773
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0.9511 1.7773
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 09945 O6012
/1/4
i -1 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 09945 -0.3922
0 0 0 0 1 -1 1 -1 0 0 0 0 0 0 0 0 0.9511 -0.1682
0 0 0 0 0 0 0 0 1 -1 1 -1 0 0 0 0 O8660 01682
0 0 0 0 0 0 0 0 0 0 0 0 I -1 1 -ij 07431 03933
05878
0.4067
02079
LoooooJ
571

The same result is obtained by multiplying the vector of a coefficients obtained in


the previous step with an 8x8 a' level transform matrix:

1 1 0 0 0 0 0 0 [0.1470' ' 0.6012"


0 0 1 1 0 0 0 0 0.7032 1.7773
0 0 0 0 1 1 0 0 1.1378 1.7773
0 0 0 0 0 0 1 1 1.3757 0.6012
ViTI
1 - 1 0 0 0 0 0 0 1.3757 -0.3933
0 0 1 - 1 0 0 0 0 1.1378 -0.1682
0 0 0 0 1 1 0 0 0.7032 0.1682
0 0 0 0 0 0 1 - 1 L0.1470_ _ 0.3933

This is the principle of the pyramidal algorithm developed by Mallat [20],


which is computationally more efficient. Continuing the calculations according to
this algorithm, the four a components are input to a 4x4 a^ level transformation
matrix, giving the level-3 components:

1 1 0 0 ] [0.6012" 1.6819"
0 0 1 1 1.7773 1.6818
^n/2
1-10 0 1.7773 -0.8316
0 0 1 -ij [0.6012 0.8316J

and finally the level-4 coefficients (a^) are calculated according to:
1 1 1.6819 2.3785
^fU2
1 - 1 1.6818 0.0000

Having a closer look at the pyramid algorithm in Fig. 40.43, we observe that it
sequentially analyses the approximation coefficients. When we do analyze the
detail coefficients in the same way as the approximations, a second branch of
decompositions is opened. This generalization of the discrete wavelet transform is
called the wavelet packet transform (WPT). Further explanation of the wavelet
packet transform and its comparison with the DWT can be found in [19] and [21].
The final results of the DWT applied on the 16 data points are presented in Fig.
40.44. The difference with the FT is very well demonstrated in Fig. 40.45 where
we see that wavelet a^ describes the locally fast fluctuations in the signal and
wavelet a^ the slow fluctuations. An obvious application of WT is to denoise
spectra. By replacing specific WT coefficients by zero, we can selectively remove
572

approximations details

J
^ ^

2^^.

Fig. 40.44. Wavelet decomposition of a 16-point signal (see text for the explanation).

50 100 150 X'O 250 300 350 400 460 500

QI—•^'^—^^A^-*^

50 100 150 200 250 50 100 150 200 250

0.2

Hri
V -0.2

123 128

Fig. 40.45. Wavelet decomposition of a signal with local features.

noise from distinct areas in the signal without disturbing other areas [22]. Mitter-
mayr et al. [23] compared the wavelet filters to Fourier filters and to polynomial
smoothers such as the Savitzky-Golay filters.
Wavelets have been applied to analyze signals arising from several areas, as
acoustics [24], image processing [22], seismics [25] and analytical signals [23,26,
27]. Another obvious application is to use wavelets to detect peaks in a noisy
signal. Each sudden change of the signal by the appearance of a peak results in a
573

wavelet coefficient at that position [27]. Recently, it has been shown that signals
can be compressed to a fairly small number of coefficients without much loss of
information. Bos et al. [26] applied this property to compress IR spectra by a factor
of 20 prior to a classification by a neural net. Feature reduction by wavelet
transform for multivariate calibration has been studied by Jouan-Rimbaud et al.
[28].

References

1. F.C. Strong III, How the Fourier transform infrared spectrophotometer works. J. Chem. Educ,
56(1979)681-684.
2. R. Brereton, Tutorial: Fourier transforms. Use, theory and applications to spectroscopic and re-
lated data. Chem. Intell. Lab. Syst., 1 (1986) 17-31.
3. R.N. Bracewell, The Fourier Transform and its Applications. 2nd rev. ed., McGraw-Hill, New
York, 1986.
4. G. Doetsch, Anleitung zum praktischen gebrauch der Laplace-transformation und der Z-
transformation. R. Oldenbourg, Munchen 1989.
5. K. Schmidt-Rohr and H.W. Spiess, Multidimensional Solid-state NMR and Polymers. Aca-
demic Press, London 1994, pp. 141.
6. J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier se-
ries. Math. Comput., 19 (1965) 297-301.
7. E.G. Brigham, The Fast Fourier Transform. Prentice-Hall, Englewood Cliffs NJ, 1974.
8. A. Savitzky and M.J.E. Golay, Smoothing and differentiating of data by simplified least-
squares procedures. Anal. Chem., 36 (1964) 1627-1639.
9. C.G. Enke and T.A. Nieman, Signal-to-noise ratio enhancement by least-squares polynomial
smoothing. Anal. Chem., 48 (1976) 705A-712A.
10. J. Steinier, Y. Termonia and J. Deltour, Comments on smoothing and differentiation of data by
simplified least squares procedure. Anal. Chem., 44 (1972) 1906-1909.
11. B. Lam and T.L. Isenhour, Equivalent width criterion for determining frequency domain cut-
offs in Fourier transform smoothing. Anal. Chem., 53 (1981) 1179-1182.
12. W. Wu, B. Walczak, W. Pennincks and D.L, Massart, Feature reduction by Fourier transform
in pattern recognition of NIR data. Anal. Chim. Acta, 331 (1996) 75-83.
13. L. Pasti, D. Jouan-Rimbaud, D.L. Massart and O.E. de Noord, Application of Fourier trans-
form to multivariate caUbration of Near Infrared Data. Anal. Chim. Acta, 364 (1998) 253-263.
14. W.F. McClure, A. Hamid, E.G. Giesbrecht and W.W. Weeks, Fourier analysis enhances NIR
diffuse reflectance spectroscopy. Appl. Spectrosc, 38 (1988) 322-329.
15. L.K. DeNoyer and J.G. Dodd, Maximum Likelihood deconvolution for spectroscopy and chro-
matography. Am. Lab., 23 (1991) D24-H24.
16. E.D. Laue, M.R. Mayger, J. Skilling and J. Staunton, Reconstruction of phase-sensitive two-
dimensional NMR-spectra by maximum entropy. J. Magn. Reson., 68 (1986) 14-29.
17. S.A. Dyer, Tutorial: Hadamard transform spectrometry. Chemom. Intell. Lab. Syst., 12 (1991)
101-115.
18. I. Daubechies, Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math.,
41(1988) 909-996.
19. B. Walczak and D.L.Massart, Tutorial: Noise suppression and signal compression using the
wavelet packet transform. Chemom. Intell. Lab. Syst., 36 (1997) 81-94.
574

20. I. Daubechies, S. Mallat and A.S. Willsky, Special issue on wavelet transforms and multireso-
lution signal analysis. IEEE Trans. Info Theory, 38 (1992) 529-531.
21. B. Walczak, B. van den Bogaert and D.L. Massart, Application of wavelet packet transform in
pattern recognition of near-IR data. Anal. Chem., 68 (1996) 1742-1747.
22. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-noising of SIMS images via wavelet shrink-
age. Chemom. Intell. Lab. Syst., 34 (1996) 263-273.
23. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet denoising of Gaus-
sian peaks: a comparative study. Chemom. Intell. Lab. Syst., 34 (1996) 187-202.
24. R. Kronland-Martinet, J. Morlet and A. Grossmann, Analysis of sound patterns through
wavelet transforms. Int. J. Pattern Recogn. Artif. Intell., 1 (1987) 273-302.
25. P. Goupillaud, A. Grossmann and J. Morlet, Cycle-octave and related transforms in seismic
signal analysis. Geoexploration, 23 (1984) 85-102.
26. M. Bos and J. A.M. Vrielink, The wavelet transform for pre-processing IR spectra in the identi-
fication of mono- and di-substituted benzenes. Chemom. Intell. Lab. Syst., 23 (1994) 115-122.
27. M. Bos and E. Hoogendam, Wavelet transform for the evaluation of peak intensities in flow-
injection analysis. Anal. Chim. Acta, 267 (1992) 73-80.
28. D. Juan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord and D.L. Massart, Application of
wavelet transform to extract the relevant component from spectral data for multivariate calibra-
tion. Anal. Chem., 69 (1997) 4317-4323.

Additional recommended reading

C.K. Chui, Introduction to Wavelets. Academic Press, Boston, 1991.


D.N. Rutledge (Ed.), Signal Treatment and Signal Analysis in NMR. Elsevier, Amsterdam, 1996.
B.K. Alsberg, A.M. Woodward and D.B. Kell, An introduction to wavelet transforms for chemomet-
ricians: a time-frequency approach. Chemom. Intell. Lab. Syst., 37 (1997) 215-239.
F. Dondi, A. Betti, L. Pasti, M.C. Pietrogrande and A. FeHnger, Fourier analysis of multicomponent
chromatograms — application to experimental chromatograms. Anal. Chem., 65 (1993)
2209-2222.
575

Chapter 41

Kalman Filtering

41.1 Introduction

Linear regression and non-linear regression methods to fit a model through a


number of data points have been discussed in Chapters 8, 10 and 11. In these
regression methods all data points are collected first followed by the estimation of
the parameters of a postulated model. The validity of the model is checked by a
statistical evaluation of the residuals. Generally, the same weight is attributed to all
data points, unless a weighing procedure is applied on the residuals, e.g., according
to the inverse of the variance of the experimental error (see Section 8.2.3.2).
In this chapter we discuss an alternative method of estimating the parameters of
a model. There are two main differences from the regression methods discussed so
far. First, the parameters are estimated during the collection of the data. Each time
a new data point is measured, the parameters of the model are updated. This
procedure is called recursive regression. Because at the beginning of the measure-
ment-estimation process, the model is based on a few observations only, the
estimated model parameters are imprecise. However, they are improved as the
process of data collection and updating proceeds. During the progress of the
measurement-estimation sequence, more data points are measured leading to
more precise estimates. The second difference is that the values of the model
parameters are allowed to vary during the measurement-estimation process. An
example is the change of the concentrations during a kinetics experiment, which is
monitored by UV-Vis spectrometry. Multicomponent analysis is traditionally
carried out by measuring the absorbance of a sample at a number of wavelengths
(at least equal to the number of components which are analyzed) and calculating
the unknown concentrations (see Chapter 10). These concentrations are the para-
meters of the model. The basic assumption is that during the measurement of the
absorbances of the sample at the selected wavelengths, the concentrations of the
compounds in the mixture do not vary. However, if the measurements are carried
out during a kinetics experiment, the concentrations may vary in time, and as a
result the composition of the sample varies during the measurements. In this case,
we cannot simply estimate the unknown concentrations by multiple linear regres-
sion as explained in Chapter 10. In order to estimate the concentrations of the
576

compounds as a function of time during the data acquisition, two models are
needed, the Lambert-Beer model and the kinetics model. The Lambert-Beer
model relates the measurements (absorbances) to the concentrations. This model is
called the measurement model {A =J{a,c)). The kinetics model describes the way
the concentrations vary as a function of time and is called the system model (c =
J{k,t)). In this particular instance, the system model is an exponential function in
which the reaction rate k is the regression parameter. The terms * systems' and
'states' are associated with the Kalman fdter. The system here is the chemical
reaction and its state at a given time is the set of concentrations of the compounds at
that time. The output of the system model is the state of the system. The measure-
ment and system models fully describe the behaviour of the system and are
referred to as the state-space model. Thus the system and measurement models are
connected. The parameters of the measurement model (in our example, the conc-
entrations of the reactants and the reaction product) are the dependent variables of
the system model, in which the reaction rate is the regression parameter and time is
the independent variable. Later on we explain that system models may also contain
a stochastic part. It should be noted that this dual definition implies that two sets of
parameters are estimated simultaneously, the parameters of the measurement
model and the parameters of the systems model.
In this chapter we introduce the Kalman filter, with which it is possible to
estimate the actual values of the parameters of a state-space model, e.g., the rate of
a chemical reaction from the evolution of the concentrations of a reaction product
and the reactants as a function of time which in turn are estimated from the
measured absorbances. Let us consider an experiment where a flow-through cell is
connected to a reaction vessel, in which the reaction takes place. During the
reaction, which may be fast with respect to the scan speed of the spectrometer, a
spectrum is measured. At r = 0 the measurement of the spectrum is started, e.g., at
320 nm in steps of 2 nm per second. Every second (2 nm) estimates of the
concentrations of the compounds are updated by the Kalman filter. When arrived
at the end of the spectral range (e.g. 600 nm) we may continue the measurement
process as long as the reaction takes place, by reversing the scan of the spectrometer.
As mentioned before, the confidence in the estimates of the model parameters
improves during the measurement-estimation process. Therefore, we want to
prevent a new but deviating measurement to influence the estimated parameters
too much. In the Kalman filter this is implemented by attributing a lower weight to
new measurements.
An important property of a Kalman filter is that during the measurement and
estimation process, regions of the measurement range can be identified where the
model is invalid. This allows us to take steps to avoid these measurements
affecting the accuracy of the estimated parameters. Such a filter is called the
adaptive Kalman fdter. An increasing number of applications of the Kalman filter
577

has been published, taking advantage of the formulation of a systems model which
describes the dynamic behaviour of the parameters in the measurement model and
exploiting the adaptive properties of the filter.
In this chapter we discuss the principles of the Kalman filter with reference to a
few examples from analytical chemistry. The discussion is divided into three parts.
First, recursive regression is applied to estimate the parameters of a measurement
equation without considering a systems equation. In the second part a systems
equation is introduced making it necessary to extend the recursive regression to a
Kalman filter, and finally the adaptive Kalman filter is discussed. In the concluding
section, the features of the Kalman filter are demonstrated on a few applications.

41.2. Recursive regression of a straight line

Before we introduce the Kalman filter, we reformulate the least-squares algor-


ithm discussed in Chapter 8 in a recursive way. By way of illustration, we consider
a simple straight line model which is estimated by recursive regression. Firstly, the
measurement model has to be specified, which describes the relationship between
the independent variable x, e.g., the concentrations of a series of standard solu-
tions, and the dependent variable, 3;, the measured response. If we assume a straight
line model, any response y^ is described by:

yi = ^0 + ^1 ^i + ^i

bg and b^ are the estimates of the true regression parameters % and Pj, calculated
by linear regression and e^ is the contribution of measurement noise to the response.
In matrix notation the model becomes:
y. = xj h + e^

where x- is a [2x1] column vector


1

and b is [2x1] column vector

h
A.
A recursive algorithm which estimates PQ and pj has the following general form:
New estimate = Previous estimate + Correction
578

After each new observation, the estimates of the model parameters are updated
(= new estimate of the parameters). In all equations below we treat the general case
of a measurement model with p parameters. For the straight line model /? = 2. An
estimate of the parameters b based ony - 1 measurements is indicated by b(/ - 1).
Let us assume that the parameters are recursively estimated and that an estimate b(/
- 1) of the model parameters is available fromj - 1 measurements. The next
measurement y(j) is then performed at x(/), followed by the updating of the model
parameters to b(/).
The first step of the algorithm is to calculate the innovation (I), which is the
difference between measured y(j) and predicted response y(j) at x(/). Therefore, the
last estimate b(/ - 1) of the parameters is substituted in the measurement model in
order to forecast the response y(j), which is measured at x(j):
y(j) = bo(j-l) + b,(j-l)x(j)or
>;(/•) = xT(/*)b(/'-l) (41.1)
The innovation /(/) (not to be confused with the identity matrix I) is the difference
between measured and predicted response at x(j). Thus /(/) = y(j) - yij)-
The value of the innovation is used to update the estimates of the parameters of
the model as follows:
b(7) = b ( 7 - l ) + k ( ; ) / ( j ) (41.2)
px\ px\ pxl 1x1

Equation (41.2) is the first equation of the recursive algorithm. k(/) is a/7xl vector,
called the gain vector. Looking in more detail at this gain vector, we see that it
weights the innovation. For k equal to the null vector, b(/) = b(/ - 1), leaving the
model parameters unadapted. The larger the k, the more weight is attributed to the
innovation and as a consequence to the last observation. Therefore, one may
intuitively understand that the gain vector depends on the confidence we have in
the estimated parameters. As usual this confidence is expressed in the variance-
covariance matrix of the parameters. Because this confidence varies during the
estimation process, it is indicated by P(/) which is given by:

nj)--

One expects that during the measurement-prediction cycle the confidence in the
parameters improves. Thus, the variance-covariance matrix needs also to be
updated in each measurement-prediction cycle. This is done as follows [1]:

P(7)=P(;-l)-k(;)xT(7)P(7-l) (41.3)
pxp pxp pxl Ixp pxp
579

Equation (41.3) is the second equation of the recursive filter, which expresses
the fact that the propagation of the measurement error depends on the design (X) of
the observations. Once the fitting process is complete the square root of the
diagonal elements of P give the standard deviations of the parameter estimates.
Key factor in the recursive algorithm is the gain vector k, which controls the
updating of the parameters as well as the updating of the variance-covariance
matrix P. The gain vector k is the most affected by the variance of the experimental
error r(j) of the new observation y(j) and the uncertainty PO - 1) in the parameters
b(/ - 1). When the elements of PO* - 1) are much larger than r(/), the gain factor is
large, otherwise it is small. After each new measurement j the gain vector is
updated according to eq. (41.4)

k(7) = P ( y - l ) x ( y ) ( x T ( y ) P ( y - l ) x ( ; ) + r ( ; ) ) - i (41.4)
pxl pxp pxl \xp pxp px\ 1x1

The expression x^(/)P(/ - l)x(/) in eq. (41.4) represents the variance of the
predictions, y(j), at the value x(j) of the independent variable, given the uncertainty
in the regression parameters P(/). This expression is equivalent to eq. (10.9) for
ordinary least squares regression. The term r(j) is the variance of the experimental
error in the response y(j). How to select the value of r(j) and its influence on the
final result are discussed later. The expression between parentheses is a scalar.
Therefore, the recursive least squares method does not require the inversion of a
matrix. When inspecting eqs. (41.3) and (41.4), we can see that the variance-
covariance matrix only depends on the design of the experiments given by x and on
the variance of the experimental error given by r, which is in accordance with the
ordinary least-squares procedure.
Typically, a recursive algorithm needs initial estimates for b(0) and P(0) to start
the iteration process and an estimate of r(j) for all j during the measurement-
estimation process. When no prior information on the regression parameters is
available b(0) is usually set equal to zero. In many textbooks [1], it is recom-
mended to choose for P(0) a diagonal matrix with very large diagonal elements,
expressing the large uncertainty in the chosen starting values b(0). As explained
before, r(j) represents the variance of the experimental error in observation y(j).
Although the choice of P(0) and r(j) introduces some arbitrariness in the calcu-
lation, we show in an example later on that the gain vector (k) which fully controls
the updating of the estimates of the parameters is fairly insensitive for the chosen
values r(/) and P(0). However, the obtained final value of P depends on a correct
estimation of r(j). Only when the value of r(j) is equal to the variance of the
experimental error, does P converge to the variance-covariance matrix of the
estimated parameters. Otherwise, no meaning can be attributed to P.
In summary, after each new measurement a cycle of the algorithm starts with the
calculation of the new gain vector (eq. (41.4)). With this gain vector the variance-
580

Iteration step

Fig. 41.1. Evolution of the innovation during the recursive estimation process (iteration steps) (see
Table 41.1).

covariance matrix (eq. (41.3)) is updated and the estimates of the parameters are
updated (eq. (41.2)). By monitoring the innovation sequence, the stability of the
iteration process can be followed. Initially, the innovation shows large fluctu-
ations, which gradually fade out to the level expected from the measurement noise
(Fig. 41.1). An innovation which fails to converge to the level of the experimental
error indicates that the estimation process is not completed and more observations
are required. However, P(0) also influences the convergence rate of the estimation
process as we show with the calibration example discussed below.
By way of illustration, the regression parameters of a straight line with slope = 1
and intercept = 0 are recursively estimated. The results are presented in Table 41.1.
For each step of the estimation cycle, we included the values of the innovation,
variance-covariance matrix, gain vector and estimated parameters. The variance
of the experimental error of all observations y is 25 10"^ absorbance units, which
corresponds to r = 25 10"^ au for a l l / The recursive estimation is started with a
high value (10^) on the diagonal elements of P and a low value (1) on its
off-diagonal elements.
The sequence of the innovation, gain vector, variance-covariance matrix and
estimated parameters of the calibration lines is shown in Figs. 41.1^1.4. We can
clearly see that after four measurements the innovation is stabilized at the
measurement error, which is 0.005 absorbance units. The gain vector decreases
monotonously and the estimates of the two parameters stabilize after four measure-
ments. It should be remarked that the design of the measurements fully defines the
variance-covariance matrix and the gain vector in eqs. (41.3) and (41.4), as is the
case in ordinary regression. Thus, once the design of the experiments is chosen
581

TABLE 41.1

Recursive estimation of the parameters BQ and Z?^ of a straight line (see text for symbols; y are the measure-
ments). InitiaUsation: bT(0) = [0 0]; r = 25 10-^; P^^(0) = P2,2(^)= 1000; Fi 2CO) = PiA^) = 1

J x(/') y(j) y(j) I(j) h(j)


m P(/")
1 1 0.101 0 0.101 0.09999 0.99000 9.8990 -98.99
0.1 0.01010 0.09998 -98.99 989.9

2 1 0.196 0.102 0.094 0.0060 -1.0000 1.37 10"" -8.16 10-"


0.2 0.9500 9.99995 -8.09 lO"" 5.249 10"'

3 1 0.302 0.291 0.011 -0.0024 -0.7307 -6.0 10"' -2.62 10-"


0.3 1.0072 5.2029 -2.61 10-" 1.303 10"'

4 1 0.403 0.401 -0.002 -0.0032 -0.5254 3.73 10-' -1.26 10-"


0.4 1.0138 3.0754 -1.26 10-" 5.06 10^

5 1 0.499 0.504 -0.005 -0.0012 -0.4085 2.68 10-' -7.41 10"'


0.5 1.0042 2.0236 -7.4 10"' 2.49 10""

6 1 0.608 0.601 0.007 -0.0035 -0.334 2.10 10-' -4.89 10"'


0.6 1.0138 1.433 -4.88 10' 1.41 10^

7 1 0.703 0.706 -0.003 -0.0025 -0.2838 1.7 10"' -3.5 10"'


0.7 1.0104 1.0694 -3.5 10' 8.78 10"'

8 1 0.801 0.806 -0.005 -0.0014 -0.2468 1.5 10' -2.6 10"'


0.8 1.0064 0.8290 -2.6 10' 5.84 10'

9 1 0.897 0.904 -0.007 -0.0002 -0.2185 1.27 10' -2.02 10"'


0.9 1.0015 0.6618 -2.02 10' 4.08 10"'

10 1 1.010 1.002 0.0082 -0.0014 -0.1962 1.13 10' -1.62 10'


1.0 1.0060 0.5408 -1.62 10' 2.97 10"'

(x(j),j = 1, ..., n), one can predict how the variance-covariance matrix behaves
during the iterative process and decide on the number of experiments or decide on
the design itself. This is further explained in Section 41.3. The relative insensitivity
of the estimation process for the initial value of P is illustrated by repeating the
calculation with the diagonal elements P(l,l) and P(2,2) of P set equal to 100
instead of equal to 1000 (see Table 41.2). As can be seen from Table 41.2 the gain
vector, the variance-covariance matrix and the estimated regression parameters
rapidly converge to the same values. Also using unrealistically high values for the
experimental error (e.g. r(j) = 1) does not affect the convergence of the gain factors
too much as long as the diagonal elements of P remain high. However, we also
582

Iteration number

Fig. 41.2. Gain factor during the recursive estimation process (see Table 41.1).

lt+4

1E+2
b

1E+0
\ \ P(2,2)
1E-2
\pair'"---~.^___^
1E-4

8 9 10

Iteration number

Fig. 41.3. Evolution of the diagonal elements of the variance-covariance matrix (P) during the estima-
tion process (see Table 41.1).

observe that P no longer represents the variance-covariance matrix of the


regression parameters. If we start with a high r(j) value with respect to the diagonal
elements of P(0) (e.g. 1:100), assuming a large experimental error compared to the
confidence in the model parameters, the convergence is slow. This is indicated by
comparing the innovation sequence for the ratio r(j) to P(0) equal to 1:1000 and
1:100 in Table 41.2. In recursive regression, new observations steadily receive a
lower weight, even when the variance of the experimental error is constant
(homoscedastic). Consequently, the estimated regression parameters are generally
not exactly equal to the values obtained by ordinary least squares (OLS).
583

Q.

i 1
<D i
^—" •
o 1 1 slope
Q) C 0.8

"(0 o 0.6 : /
0) (A

0.4 - /
0.2
intercept
^ / ^
u

,,. 1 1 ,1 1 , 1 , 1. , 1 1 1
10
Iteration number

Fig. 41.4. Evolution of the model parameters during the iteration process (see Table 41.1).

This straight line example exemplifies the fact that by exploiting the recursive
approach alternative calibration practices are possible. The standard practice in
calibration is to measure calibration standards prior to the analysis of the unknown
samples and to construct a calibration line. Thereafter one starts the analysis of a
sample batch. To check whether the analysis system performs well, each batch of
samples is preceded by a check sample with a known concentration. The results of
the check sample are plotted in a control chart (see Chapter 7). When the value
exceeds the warning or action limits, one may decide to re-establish the calibration
line by remeasuring a new set of calibration standards. This procedure disregards
the knowledge on the calibration factors already available from the previous
calibration run. By applying the recursive algorithm, it is in principle sufficient to
remeasure a limited set of calibration standards and to update the estimates of the
calibration factors already available from the previous calibration step. However,
as can be seen from the example given in Table 41.1, this would mean that the
weight attributed to the new measurements steadily decreases (small gain vector).
This leads to the undesirable situation where the calibration factors are hardly
updated. However, it should be bome in mind that the reason to decide to
recalibrate was that during the measurement of the unknown samples, calibration
factors may be changed. As a result, the uncertainty in the calibration factors were
judged to be unacceptably large, requiring recalibration. Expressed in terms of a
system, the system state (here slope and intercept) has not been observed for some
time. Because this system state varies in time (in an unknown deterministic or
stochastic way), the uncertainty about the system state increases and the state has
to be observed again. This uncertainty is taken into account in the Kalman filter by
the system equation, which is discussed in the next section. One of the methods to
584

TABLE 41.2

Influence of initial values of the diagonal values of the variance-covariance matrix P(0) and the variance of
the experimental error on the gain vector and the Innovation sequence (see Table 41.1 for the experimental
values, y)

Pu(0) = P2,2<[0)=1000 Pi,i(0) == P2 2(0) = 100

r=25 10-6 r=\ r = 25 10"^ r=l

kl k2 I kl k2 I kl k2 I kl k2 I

1 0.99 0.10 0.101 0.98 0.10 0.101 0.99 0.11 0.101 0.98 0.11 0.101
2 -1.0 10.0 0.094 -0.75 8.31 0.094 -1.0 9.99 0.094 0.00 3.33 0.095
3 -0.73 5.20 0.011 -0.62 4.76 0.035 -0.66 4.99 0.011 -0.33 3.31 0.105
4 -0.53 3.08 0.002 -0.48 2.94 0.011 -0.50 2.99 0.002 -0.37 2.48 0.069
b -0.41 2.02 -0.005 -0.39 1.98 0.0005 -0.40 1.99 -0.045 -0.34 1.80 0.034
6 -0.33 1.43 0.007 -0.33 1.42 -0.010 -0.33 1.42 0.007 -0.30 1.34 0.034
7 -0.28 1.07 -0.003 -0.28 1.07 -0.001 -0.28 1.07 -0.003 -0.27 1.03 0.016
8 -0.24 0.83 -0.005 -0.25 0.83 -0.003 -0.25 0.83 -0.004 -0.24 0.81 0.009
9 -0.22 0.66 -0.007 -0.22 0.66 -0.006 -0.22 0.66 -0.007 -0.21 0.65 0.003
10 -0.19 0.54 0.008 -0.20 0.54 0.009 -0.20 0.54 0.008 -0.19 0.54 0.016

PKI ^2.2 Pi.i P2.2 Pu P2,2 Pu P2,2

1 9.9 989.9 9.9 989.9 0.988 98.8 1.96 98.8


2 1.4 E-4 5.2 E-3 4.2 166 1.2 E-4 4.9 E-3 1.96 65.5
3 6.0 E-5 1.3 E-3 2.23 47.5 5.7 E-5 1.2 E-3 1.64 32.8
4 3.7 E-5 5.0 E-4 1.47 19.6 3.7 E-5 4.9 E-4 1.27 16.5
5 2.7 E-5 2.5 E-4 1.09 9.89 2.7 E-5 2.5 E-4 1.01 9.01
6 2.1 E-5 1.4 E-4 0.86 5.67 2.1 E-5 1.4 E-4 0.82 5.37
7 1.7 E-5 8.8 E-5 0.71 3.56 1.8 E-5 8.8 E-5 0.69 3.43
8 1.5 E-5 5.8 E-5 0.60 2.37 1.5 E-5 5.9 E-5 0.59 2.31
9 1.3 E-5 4.1 E-5 0.52 1.66 1.3 E-5 4.1 E-5 0.52 1.63
10 1.1 E-5 3.0 E-5 0.47 1.21 1.1 E-5 3.0 E-5 0.46 1.19

assess this uncertainty as a function of time is by monitoring (e.g., in a Shewhart


control chart) the value of the calibration factors obtained by a traditional calib-
ration procedure, and modelling them by, e.g., an autoregressive model (see
Section 20.4). However, because analytical chemists are unfamiliar with this type
of filter, and the recursive approach is not simple, Kalman filters are hardly applied
in calibration.
585

41.3 Recursive multicomponent analysis

At this point we introduce the formal notation, which is commonly used in


literature, and which is further used throughout this chapter. In the new notation we
replace the parameter vector b in the calibration example by a vector x, which is
called the state vector. In the multicomponent kinetic system the state vector x
contains the concentrations of the compounds in the reaction mixture at a given
time. Thus x is the vector which is estimated by the filter. The response of the
measurement device, e.g., the absorbance at a given wavelength, is denoted by z.
The absorbtivities at a given wavelength which relate the measured absorbance to
the concentrations of the compounds in the mixture, or the design matrix in the
calibration experiment (x in eq. (41.3)) are denoted by h^.
Because zij) = zij) + v(/) and zij) = h^(j)x(j - 1 ) , the measurement equation given
by eq. (41.1) is transformed into:

z{j) = h^j)x(j^ l)+v(7) for; = 0,1, 2,... (41.5)


1x1 \xp pxl 1x1

where v(j) includes the random measurement error (not to be confounded with r(j)
which is the variance of the random measurement error) and the systematic error
caused by the bias in model parameters, x(/ - 1). The equations for the variance-
covariance update and Kalman gain update are adapted accordingly. The overall
sequence of the equations is given in Table 41.3. It should be noted that here the
state vector x is time-invariant.

TABLE 41.3
Kalman filter algorithm equations for time-invariant system states

Initialisation: x(0), P(0)


Kalman gain update
k ij) = P ( ; - l)h (;) (h'^ (;•) P(y - l)h (;) + K;))"^ (41.6)
px\ pxp pxl Ixp pxp pxl

Variance-covariance update
PO; = PO"-l)-kWh'^0-)PO'-l) (41.7)
pxp pxp pxl Ixp pxp

Measurement zij)
State update
x(j)=xU-l)-^k(j)(z{j)-h^ (j)xU-l)) (41.8)
pxl pxl pxl 1x1 pxl 1x1
586

TABLE 41.4
Absorptivities of CI2 and Br2 in chloroform [2]

Wavenumber Absorptivities (>a 03) Absorbance of sample


cm-'/103
CI2 Br2
22 4.5 168 0.0341
24 8.4 211 0.0429
26 20.0 158 0.0335
28 56 30 0.0117
30 100 4.7 0.0110
32 71 5.3 0.0080

By way of illustration we work out the two-component analysis of chlorine-


bromine by UV-Vis spectrometry. In Table 41.4 the absorptivities of CI2 and Br2 in
chloroform at six wavenumbers are given [2], with the absorbances measured for a
supposedly unknown sample with concentrations (CQ^ = 0.1 and c^^. = 0.2). Con-
centrations of CI2 and Br2 are estimated by recursive regression. With this example
we want to demonstrate that the evolution of the estimations of the unknown
concentration depends on the design of the experiments, here the set of wave-
lengths and the order in which these wavelengths are selected. A measure for the
precision of the estimation process is the variance-covariance matrix P. As said
before, this precision can be evaluated before any measurement has been carried
out. The two following wavelength sequences are evaluated — sequence A: 22-32
10^ cm"^ in ascending order, and sequence B: 22, 32, 24, 30, 26, 28 10^ cm~^ The
measurement equation for the multicomponent system is given by:

za;=h^(/>o'-i)
where z(j) is the predicted absorbance at the 7th measurement, using the latest
estimated concentrations x(j - 1) obtained after the (/* - l)th measurement. h^O)
contains the absorptivities of CI2 and Br2 at the wavelength chosen for the jth
measurement.
Step 1. Initialisation (/ = 0)
"1000 1
P(0) = x(0) =
1 1000

r(j) = 1 for ally


587

Step 2. Update of the Kalman gain vector k(l) and variance-covariance matrix
P(l)
k(l) = P(0)h(l)(hT(l)P(0)h(l) + i r '
P(l) = P(0)-k(l)h'^(l)P(0)
This gives for design A:

fiooo 1 [0.0045" 1000 1 0.0045


k(l) =
L 1 1000 [ 0.168 r
|[([0.0045 0.168]
1 1000 0.168
+1

[0.1596"
' L 5-744 _

"1000 1 ~ "0.1596" 1000 1 999.2 -25.8


P(l) = [0.0045-0.168]
1 K)00^ _ 5.74^t_ 1 1000 -25.8 34.88

Step 3. Predict the value of the first measurement z(l)


To'
z(l) = [0.0045 0.168] =0
0
Step 4. Obtain the first measurement: z(l) = 0.0341.
Step 5. Update the predicted concentrations: x(l) = x(0) + k(l) (z(l) - 2(1))
"0" "0.16' 0.0054
x(l) = + [0.0341-0] =
0 _5.7_ 0.196

Step 6. Return to step 2.


These steps are summarized in Tables 41.5 and 41.6. The concentration estimates
should be compared with the true values 0.1 and 0.2 respectively. For design B the
results listed in Table 41.7 are obtained. From both designs a number of interesting
conclusions follow.
(1) The set of selected wavelengths (i.e. the experimental design) affects the
variance-covariance matrix, and thus the precision of the results. For
example, the set 22, 24 and 26 (Table 41.5) gives a less precise result than
the set 22, 32 and 24 (Table 41.7). The best set of wavelengths can be
derived in the same way as for multiple linear regression, i.e. the deter-
minant of the dispersion matrix (h^h) which contains the absorptivities,
should be maximized.
588

TABLE 41.5

Calculated gain vector k and variance covariance matrix P\ f 11(0) = P2.2(^) = 1000; P^j^O) = ^2.1(0) = 1 > ''= 1

Step Wavelength k(«) P(n)

A:l kl ^2,2 ^1.2-^2,1

1 22 0.16 5.74 999.2 34.88 -25.88


2 24 1.16 2.82 995.8 14.73 -34.12
3 26 9.37 1.061 859.7 12.98 -49.53
4 28 13.17 -0.67 245.0 11.4 -18.11
5 30 7.11 -0.52 71.4 10.5 -5.6
6 32 3.71 -0.25 52.6 10.4 -4.3

TABLE 41.6
Estimated concentrations (see Table 41.5 for the starting conditions)

Step Wavelength x2
CL Br.
1 22 0.0054 0.196
2 24 0.0072 0.2004
3 26 0.023 0.2022
4 28 0.0802 0.1993
5 30 0.0947 0.1982
6 32 0.0959 0.198

TABLE 41.7
Calculated gain vector, variance covariance matrix and estimated concentrations (see Table 41.5 for the
starting conditions)

Step Wavelength k{n) P(A2) x(«)

k\ kl Pu ^2.2 ^1,2-^2,1 JCl x2

1 22 0.16 5.744 999.2 34.88 -25.88 0.0054 0.196


2 32 11.76 -0.27 166.2 34.43 -6.42 0.0825 0.194
3 24 0.016 2.859 166.2 13.81 -6.54 0.0825 0.198
4 30 6.24 -0.22 62.6 13.68 -2.86 0.0938 0.197
5 26 0.59 1.560 62.1 10.4 -4.1 0.0941 0.198
6 28 2.82 0.069 52.2 10.4 -4.3 0.0955 0.198
589

TABLE 41.8
Concentration estimation with the optimal set of wavelengths (see Table 41.5 for the starting conditions)

Step Wavelength k(n) P(^^) x(n)

k\ k2 Pli P22 ^1,2-^2,1 x\ x2

1 22 0.16 5.744 999.2 34.88 -25.88 0.0054 0.196


2 32 11.76 -0.27 166.2 34.43 -6.42 0.0825 0.1942
3 30 6.244 -0.18 62.6 34.34 -3.42 0.0940 0.1939

(2) From the evolution of P in design B (Table 41.7), one can conclude that the
measurement at wavenumber 24 10^ cm~^ does not really improve the
estimates already available after the sequence 22, 32. Equally the measure-
ment at 26 10^ cm~^ does not improve the estimate already available after
the sequence 22,32,24 and 30 10^ cm~^. This means that these wavelengths
do not contain new information. Therefore, a possibly optimal set of wave-
numbers is 22,32 and 30. Inclusion of a fourth wavelength namely at 28 10^
cm~^ probably does not improve the estimates of the concentrations already
available, since the value of P converged to a stable value. To confirm this
conclusion, the recursive regression was repeated for the set of wave-
lengths 22, 32 and 30 10^ cm"^ (see Table 41.8). Thyssen et al. [3] devel-
oped an algorithm for the selection of the best set of m wavelengths out of
n. Instead of having to calculate 10^^ determinants to find the best set of six
wavelengths out of 300, the recursive approach only needs to evaluate a
rather straightforward equation 420 times. The influence of the measure-
ment sequence on the speed of convergence is well demonstrated for the
four-component analysis (Fig. 41.5) of a mixture of aniline, azobenzene,
nitrobenzene and azoxybenzene [3]. In the forward scan mode a quick
convergence is attained, whereas in the backward scan mode, convergence is
slower. Using an optimized sequence, convergence is complete after less
than seven measurements (Fig. 41.6). Other methods for wavelength
selection are discussed in Chapters 10 and 27.

41.4 System equations

When discussing the calibration and multicomponent analysis examples in


previous sections, we mentioned that the parameters to be estimated are not
necessarily constant but may vary in time. This variation is taken into account by
590

" K
M •
1 .6 " K

X
K
M
K M
X
1.2 «"" K
•". X * X X

X Z

0.8 r> *-

rzi
AA _
xa^ie""
-4

1 1 1 1 1 1 1 1 1- > <
14 28 42 56 70 —> k

0.8 +

0.4 \

78 —> k

Fig. 41.5. Multicomponent analysis (aniline (jci), azobenzene fe), nitrobenzene fe) and azoxy-
benzene (JC4)) by recursive estimation (a) forward run of the monochromator (b) backward run (k indi-
cates the sequence number of the estimates; solid lines are the concentration estimates; dotted lines are
the measurements z).
591

l.B f

1.2

e.4 I ps.. X2K10

XltUB

M 28 42 56 70 —> k

Fig. 41.6. Multicomponent analysis (see Fig. 41.5) with an optimized wavelength sequence.

the system equation. As explained before, the system equation describes how the
system changes in time. In the kinetics example the system equation describes the
change of the concentrations as a function of time. This is a deterministic system
equation. The random fluctuation of the slope and intercept of a straight line can be
described by a stochastic model, e.g., an autoregressive model (see Chapter 20).
Any unmodelled system fluctuations left are included in the system noise, w(j).
The system equation is usually expressed in the following way:

x(/*) = F ( / j - l ) x ( / - l ) + w(/') (41.9)

where F(/j - 1) is the system transition matrix, which describes how the system
state changes from time ti_^ to time /^. The vector w(/) consists of the noise
contributions to each of the system states. These are system fluctuations which are
not modelled by the system transition matrix. The parameters of the measurement
equation, the h-vector and system transition matrix for the kinetics and calibration
model are defined in Table 41.9. In the next two sections we derive the system
equations for a kinetics and a calibration experiment. System state equations are
not easy to derive. Their form depends on the particular system under
consideration and no general guidance can be given for their derivation.
592

TABLE 41.9
Definition of the state parameter (x), h-vector and transition matrix (F) for two systems

State parameters (x) h-vector Transition matrix (F)

1. Calibration Slope and intercept of [ 1 c] where c is the concent- time constant of the
the calibration line at ration of the calibration the variations of slope
time t standard measured at time t and intercept

2. 1 st order kinetics Concentrations of A Absorbance coefficients of A 1st order reaction rate


monitored by Uv-Vis and B at time t and B at the wavelength of the
A -> B reading at time t

41.4.1 System equation for a kinetics experiment

Let us assume that a kinetics experiment is carried out and we want to follow the
concentrations of component B which is formed from A by the reaction: A -> B.
For a first-order reaction, the concentrations of A (= jCj) and B (= x^^) as a function
of time are described by two differential equations:
dxj Idt = -k^x^
djC2/dr = fcjjCj

which can be rewritten in the following recursive form:


jc,(r + 1) = (1 - A:i)jci(0 + w,(0 (41.10)
x^{t + 1) = k^x^{t) + JC2(0 + ^2(0
w indicates the error due to the discretization of the differential equations. These
two equations describe the concentrations of A and B as a function of time, or in
other words, they describe the state of the system, which in vector notation becomes:
x ( r + l ) = Fxft) + w(0 (41.11)
with
x\t^\)^[x,{t^\)x^{t+\)]
w^(r+l) = [vi;i(r+1)^2(^+1)]
and the transition matrix equal to:
"l-A:, 0"
F
Jr —
k^ 1
593

41.4.2 System equation of a calibration line with drift

In this section we derive a system equation which describes a drifting calibration


line. Let us suppose that the intercept x,(/ + 1) at a time; + 1 is equal to x,(/) at a
time j augmented by a value a(/'), which is the drift. By adding a non-zero system
noise w„ to the drift, we express the fact that the drift itself is also time dependent.
This leads to the following equations [5,6]:

x,(7+l) = x,0') + aO')

a(/ + l) = a(/) + w„(/ + l)

which is transformed into the following matrix notation:

1 1 •^i(y) 0
+ (41.12)
.a(7 + l) 0 1 a(7) WaCi+l)

A similar equation can be derived for the slope, where P is the drift parameter:

X2(;+l)
1 ip2(;) 0
(41.13)
.P0"+1). .0 iJIpCy). H ' p ( ; + l)

Equations (41.12) and (41.13) can now be combined in a single system model
which describes the drift in both parameters:

\x,{i+\) 10 1 0 " \x\{j)^ 0


x^ii+i) 0 10
1 U2(;) 0
+ (41.14)
a(/+l) 0 0 1 0 «(;•) H'aCy + i)
.PO'+l). 0 0 0 1 ^PO). >vp(; + l)

or x(/ + 1) = Fx(/) + w(7' + 1) with

10 10" Xi

0 10 1
F= andx =
0 0 10 a
0 0 0 1 .P.
F describes how the system state changes from time tj to tj^^.
594

For the time invariant calibration model discussed in Section 41.2, eq. (41.14)
reduces to:
"1 OTA:,
•^iO')1
0 lJ[jC2lU)}

where x^ = intercept and X2 = slope.

41.5 The Kalman filter

41.5.1 Theory

In Sections 41.2 and 41.3 we applied a recursive procedure to estimate the


model parameters of time-invariant systems. After each new measurement, the
model parameters were updated. The updating procedure for time-variant systems
consists of two steps. In the first step the system state x(/ - 1) at time ti_^ is
extrapolated to the state x(j) at time tj by applying the system equation (eq. (41.15))
in Table 41.10). At time tj a new measurement is carried out and the result is used to

TABLE41.10
Kalman filter algorithm equations
Initialisation: x(0), P(0)
State extrapolation: system equation
x(/!/- 1) = F(/)xO- 11/'- 1) + wO) (41.15)

Covariance extrapolation
P(/V- 1) = F(/-)P(/- II7- 1)F''(/) + Q 0 - 1) (41.16)

New measurement: z(j)


Measurement equation
z(J) = h'(j)x(j\j - 1) + v(/') forj = 0„ 1 „ 2,...

Covariance update
P(/'iy) = P(/"iy - 1) - mh'ijWiiy -1) (41.17)

Kalman gain update


k(/) = POV- l)h(/)(h''(/-)POV- l)hO) + rO))-' (41.18)

State update
x(J\J) = ^(J\J-^) + mKz(j)-h'(j)x(j\j- D) (41.19)
595

update the state x(j) (eq. (41.19) in Table 41.10). In order to make a distinction
between state extrapolations by the system equation and state updates when
making a new observation, a double index (j\j - 1) is used. The index (j\j - 1)
indicates the best estimates at time tj, based on measurements obtained up to and
including those obtained at point tj^^.
Equations (41.15) and (41.19) for the extrapolation and update of system states
form the so-called state-space model. The solution of the state-space model has
been derived by Kalman and is known as the Kalman filter. Assumptions are that
the measurement noise v(j) and the system noise w(;) are random and independent,
normally distributed, white and uncorrected. This leads to the general formulation
of a Kalman filter given in Table 41.10. Equations (41.15) and (41.19) account for
the time dependence of the system. Eq. (41.15) is the system equation which tells
us how the system behaves in time (here inj units). Equation (41.16) expresses
how the uncertainty in the system state grows as a function of time (here inj units)
if no observations would be made. Q(/ - 1) is the variance-covariance matrix of
the system noise which contains the variance of w.
The algorithm is initialized in the same way as for a time-invariant system. The
sequence of the estimations is as follows:
Cycle 1
1. Obtain initial values for x(OIO), k(0) and P(OIO)
2. Extrapolate the system state (eq. (41.15)) to x(llO)
3. Calculate the associated uncertainty (eq. (41.16)) P( 110)
4. Perform measurement no. 1
5. Update the gain vector k(l) (eq. (41.18)) using P(IIO)
6. Update the estimate x(llO) to x(lll) (eq. (41.19)) and the associated
uncertainty P(lll) (eq. (41.17))
Cycle 2
1. Extrapolate the estimate of the system state to x(2ll) and the associated
uncertainty P(2I1)
2. Peform measurement no. 2
3. Update the gain vector k(2) using P(2I1)
4. Update (filter) the estimate of the system state to x(2l2) and the associated
uncertainty P(2I2)
and so on.
In the next section, this cycle is demonstrated on the kinetics example intro-
duced in Sections 41.1 and 41.4.
Time-invariant systems can also be solved by the equations given in Table
41.10. In that case, F in eq. (41.15) is substituted by the identity matrix. The system
state, x(/), of time-invariant systems converges to a constant value after a few
cycles of the filter, as was observed in the calibration example. The system state,
596

x(/), of time-variant systems is obtained as a function of y, for example the


concentrations of the reactants and reaction products in a kinetic experiment
monitored by a spectrometric multicomponent analysis.

41.5.2 Kalman filter of a kinetics model

Equation (41.11) represents the (deterministic) system equation which describes


how the concentrations vary in time. In order to estimate the concentrations of the
two compounds as a function of time during the reaction, the absorbance of the
mixture is measured as a function of wavelength and time. Let us suppose that the
pure spectra (absorptivities) of the compounds A and B are known and that at a
time t the spectrometer is set at a wavelength giving the absorptivities h^(0- The
system and measurement equations can now be solved by the Kalman filter given
in Table 41.10. By way of illustration we work out a simplified example of a
reaction with a true reaction rate constant equal to k^ =0.1 min"^ and an initial
concentration jCi(O) = 1. The concentrations are spectrophotometrically measured
every 5 minutes and at the start of the reaction after 1 minute. Each time a new
measurement is performed, the last estimate of the concentration A is updated. By
substituting that concentration in the system equation x^(t) = x^(0)txp(-k^t) we
obtain an update of the reaction rate k. With this new value the concentration of A
is extrapolated to the point in time that a new measurement is made. The results for
three cycles of the Kalman filter are given in Table 41.11 and in Fig. 41.7. The

"c ik 1?
(0 I i
•G
(Q
g> 1
O
C
o 08
(Q

C
8 0.6
oC
o
0.4

0.2
o o o o
0 _l I I l_ _J I \ L_

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
• time

Fig. 41.7. Concentrations of the reactant A (reaction A ^ B) as a function of time (dotted line) (CA = 1,
CB = 0); • state updates (after a new measurement), O state extrapolations to the next measurement
(see Table 41.11 for Kalman filter settings).
597

TABLE41.il
The prediction of concentrations and reaction rate constant by a Kalman filter; (JCJCO) =l,k^=0.l min-^

Time Concentrations Wave- Absorp- Absorb- Estimate of the


lengthy tivities ance concentration of A

Continuous Discrete A B z z State update State extra-


1 ^- 1)
(41.19) polation
A B A B
(41.15)
0 1 0 1 0 x(OIO) 1.00
1 0.90 0.10 0.90 0.10 ->1 10 1 9.1 10 jc(lll)0.91 ;c(llO) 0.90
2 0.82 0.18 0.81 0.19 0.82
3 0.74 0.28 0.73 0.27 0.74
4 0.67 0.33 0.66 0.34 0.67
5 0.61 0.39 0.59 0.41 ->2 5 2 3.8 3.87 jc(2l2) 0.63 4211) 0.62
6 0.55 0.45 0.53 0.47 0.57
7 0.50 0.50 0.48 0.52 0.52
8 0.45 0.55 0.43 0.57 0.48
9 0.41 0.59 0.39 0.61 0.43
10 0.37 0.63 0.35 0.65 ^ 3 2 5 3.97 3.81 A:(3I3) 0.38 jc(3l2) 0.40
11 0.33 0.67 0.31 0.69 0.34
12 0.29 0.71 0.28 0.72 0.31
13 0.27 0.73 0.25 0.75 0.28
14 0.25 0.75 0.23 0.77 0.26
15 0.22 0.78 0.21 0.79 ->4 7 2 3.10 3.15 ;c(4l4)0.16 x(4l3) 0.23
16 0.20 0.80 0.18 0.82 0.15
17 0.18 0.82 0.17 0.83 0.13
18 0.16 0.84 0.15 0.85 0.12
19 0.15 0.85 0.14 0.86 0.11
0.10

')State extrapolations in italic are obtained by substituting the last estimate of k^ equal to -\n{x(j\j))/t in the
system equation (41.15).

dashed line in Fig. 41.7 shows the true evolution of the concentration of the
reactant A. For a matter of simplicity we have not included the covariance
extrapolation (eq. (41.16)) in the calculations. Although more cycles are needed
for a convincing demonstration that a Kalman filter can follow time varying states,
the example clearly shows its principle.
598

41.5.3 Kalman filtering of a calibration line with drift

The measurement model of the time-invariant calibration system (eq. (41.5))


should now be expanded in the following way:
z(j) = h'iDxij - 1) + v(j) for; = 0, 1, 2,... (41.20)
where
h^(/) = [lc(/)0 0]
c(j) is the concentration of the analyte in theyth calibration sample
x'{j-l) = [b, b, a p ]
The model contains four parameters, the slope and intercept of the calibration line
and two drift parameters a and p. All four parameters are estimated by applying
the algorithm given in Table 41.10. Details of this procedure are given in Ref. [5].

41.6 Adaptive Kalman filtering

In previous sections we demonstrated that a major advantage of Kalman


filtering over ordinary least squares (OLS) procedures is that it can handle time-
varying systems, e.g., a drifting calibration parameter and drifting background. In
this section another feature of Kalman filters is demonstrated, namely its capability
of handling incomplete measurement models or model errors. An example of an
incomplete measurement model is multicomponent analysis in the presence of an
unknown interfering compound. If the identity of this interference is unknown, one
cannot select wavenumbers where only the analytes of interest absorb. Therefore,
solution by OLS may lead to large errors in the estimated concentrations. The
occurrence of such errors may be detected by inspecting the difference between the
measured absorbance for the sample and the absorbance estimated from the
predicted concentrations (see Chapter 10). However, inspection of PRESS does
not provide information on which wavelengths are not selective. One finds that the
result is wrong without an indication on how to correct the error. Another type of
model error is a curving calibration line which is modelled by a straight line model.
The size and pattern of the residuals usually indicate that there is a model error (see
Chapter 8), after which the calculations may be repeated with another model, e.g.,
a quadratic curve.
The recursive property of the Kalman filter allows the detection of such model
deviations, and offers the possibility of disregarding the measurements in the
region where the model is invalid. This filter is the so-called adaptive Kalman
filter.
599

41.6.1 Evaluation of the innovation

Before we can apply an adaptive filter, we should define a criterion to judge the
validity of the model to describe the measurements. Such a criterion can be based
on the innovation defined in Section 41.2. The concept of innovation, /, has been
introduced as a measure of how well the filter predicts new observations:

/(/•)=zij) - h^ox/ -1)=zij) - i(j)


where zij) is the jth measurement, x(/ - 1) is the estimate of the model parameters
after 7 - 1 observations, and h^(/) is the design vector.
Thus /(/) is a measure of the predictive ability of the model. For the calibration
example discussed in Section 41.2, x(/ - 1) contains the slope and intercept of the
straight line, and h^(/) is equal to [1 c(j)] with c(j) the concentration of the
calibration standard for the jth calibration measurement. For the multicomponent
analysis (MCA), x(/ - 1 ) contains the estimated concentrations of the analytes after
j - I observations, and h^(/) contains the absorptivities of the analytes at wave-
length;.
It can be shown [4] that the innovations of a correct filter model applied on data
with Gaussian noise follows a Gaussian distribution with a mean value equal to
zero and a standard deviation equal to the experimental error. A model error means
that the design vector h in the measurement equation is not adequate. If, for
instance, in the calibration example the model was quadratic, h^ should be [1 c(j)
c(/)^] instead of [1 c(j)]. In the MCA example h^(/) is wrong if the absorptivities of
some absorbing species are not included. Any error in the design vector h^ appears
by a non-zero mean for the innovation [4]. One also expects the sequence of the
innovation to be random and uncorrelated. This can be checked by an investigation
of the autocorrelation function (see Section 20.3) of the innovation.

41.6.2 The adaptive Kalman filter model

The principle of adaptive filtering is based on evaluating the observed innovation


calculated at each new observation and comparing this value to the theoretically
expected value. If the (absolute) value of the innovation is larger than a given
criterion, the observation is disregarded and not used to update the estimates of the
parameters. Otherwise, one could eliminate the influence of that observation by
artificially increasing its measurement variance r(/), which effectively attributes a
low weight to the observation.
For a time-invariant system, the expected standard deviation of the innovation
consists of two parts: the measurement variance (r(/)), and the variance due to the
uncertainty in the parameters (P(/)), given by [4]:
600

OJU) = [r(j) + h^(/*)P(/' - l)mr' (41.21)

As explained before, the second term in the above equation is the variance of the
response, £(/), predicted by the model at the value h(/) of the independent variable,
given the uncertainty in the regression parameters P(/ - 1) obtained so far. This
equation reflects the fact that the fluctuations of the innovation are larger at the
beginning of the filtering procedure, when the states are not well known, and
converge to standard deviation of the experimental error when the confidence in
the estimated parameters becomes high (small P). Therefore, it is more difficult to
detect model errors at the beginning of the estimation sequence than later on.
Rejection and acceptance of a new measurement is then based on the following
criterion:

if |/0)|>3a,(/): reject

otherwise: accept

Adaptation of the Kalman filter may then simply consist of ignoring the rejected
observations without affecting the parameter estimates and covariances. When
using eq. (41.21) a complication arises at the beginning of the recursive estimation
because the value of P depends on the initially chosen values P(0) and thus is not a
good measure for the uncertainty in the parameters. When large values are chosen
for the diagonal terms of P in order to speed up the convergence (high P means
large gain factor k and thus large influence of the last observation), eq. (41.21)
overestimates the variance of the innovation, until P becomes independent of P(0).
For short runs one can evaluate the sequence of the innovation and look for regions
with significantly larger values, or compare the innovation with r(j). By way of
illustration we apply the latter procedure to solve the multicomponent system
discussed in Section 41.3 after adding an interfering compound which augments
the absorbance at 26 10^ cm"^ with 0.01 au and at 28 10^ cm"^ with 0.015 au. First
we apply the non-adaptive Kalman filter to all measurements. The estimation then
proceeds as shown in Table 41.12.
The above example illustrates the self adaptive capacity of the Kalman filter.
The large interferences introduced at the wavelengths 26 and 28 10^ cm~^ have not
really influenced the end result. At wavelengths 26 and 28 10^ cm"^ the innovation
is large due to the interferent. At 30 10^ cm"^ the innovation is high because the
concentration estimates obtained in the foregoing step are poor. However, the
observation at 30 10^ cm~^ is unaffected by which the concentration estimates are
restored within the true value. In contrast, the OLS estimates obtained for the
above example are inaccurate (jCj = 0.148 and X2 = 0.217) demonstrating the
sensitivity of OLS for model errors.
601

TABLE 41.12
Non-adaptive Kalman filter with interference at 26 and 28 10-3 cm-^ (see Table 41.5 for the starting condi-
tions)

Step 7 Wavelength ^1 ^2 Absorbance Innovation

CI2 Br2 Measured^) Estimated^)

0 0 0
1 22 0.054 0.196 0.0341 0 0.0341
2 24 0.072 0.2004 0.0429 0.0414 0.0015
3 26 0.169 0.211 0.0435 0.0331 0.0104
4 28 0.313 0.204 0.0267 0.0158 0.0110
5 30 0.099 0.215 0.0110 0.0409 -0.030
6 32 0.098 0.215 0.0080 0.0082 -0.0002

'^Values taken from Table 41.4.


2)Calculated with the absorbtivity coefficients from Table 41.4.

The estimation sequence when the Kalman filter is adapted is given in Table
41.13. This illustrates well the adaptive procedure, which has been followed. At 26
10-^ cm~^ the new measurement is 0.0104 absorbance units higher than expected
from the last available concentration estimates (x^ = 0.072 and X2 = 0.2004). This
deviation is clearly larger than the value 0.005 expected from the measurement
noise. Therefore, the observation is disregarded and the next measurement at 28
10^ cm~^ is predicted with the last accepted concentration estimates. Again, the
difference between predicted and measured absorbance (= innovation) cannot be
explained from the noise and the observation is disregarded as well. At 30 10^ cm"^
the predicted absorbance using the concentration estimates from the second step is
back within the expectations, P and k can be updated leading to new concentration
estimates x^ = 0.098 and X2 = 0.1995. Thereafter, the estimation process is
continued in the normal way. The effect of this whole procedure is that the two
measurements corrupted by the presence of an interferent have been eliminated
after which the measurement-filtering process is continued.

41.7 Applications

One of the earliest applications of the Kalman filter in analytical chemistry was
multicomponent analysis by UV-Vis spectrometry of time and wavelength inde-
pendent concentrations, which was discussed by several authors [7-10]. Initially,
the spectral range was scanned in the upward and downward mode, but later on
602

TABLE 41.13
Adaptive Kalman filter with interference at 26 and 28 cm-' (see Table 41.5 for the starting conditions)

Step 7 Wavelength ^1 X2 Absorbance Innovation

CI2 Br2 Measured') Estimated^)

0 0 0
1 22 0.054 0.196 0.0341 0 0.0341
2 24 0.072 0.2004 0.0429 0.0414 0.0015
3 26 - - 0.0435 0.0331 0.0104
4 28 - - 0.0267 0.0100 0.0166
5 30 0.0980 0.1995 0.0110 0.00814 -0.002
6 32 0.0979 0.1995 0.0080 0.00801 -0.00001

')Values taken from Table 41.4.


2)Calculated with absorbtivity coefficients from Table 41.4.

optimal sequences were derived for faster convergence to the result [3]. The
measurement model can be adapted to include contributions from linear drift or
background [6,11]. This requires an accurate model for the background or drift. If
the background response is not precisely described by the model the Kalman filter
fails to estimate accurate concentrations. Rutan [12] applied an adaptive Kalman
filter in these instances with success. In HPLC with diode array detection, three-
dimensional data are available. The processing of such data by multivariate
statistics has been the subject of many chemometric studies, which are discussed in
Chapter 34. Under the restriction that the spectra of the analytes should be
available, accurate concentrations can be obtained by Kalman filtering in the
presence of unknown interferences [13].
One of the earliest reports of a Kalman filter which includes a system equation is
due to Seelig and Blaunt [14] in the area of anodic stripping voltametry. Five
system states — potential, concentration, potential sweep rate and the slope and
intercept of the non-Faradaic current — were predicted from a measurement
model based on the measurement of potential current. Later on the same approach
was applied in polarography. Similar to spectroscopic and chromatographic appli-
cations, overlapping voltamograms can be resolved by a Kalman filter [15]. A vast
amount of applications of Kalman filters in kinetic analysis has been reported
[16,17] and the performance has been compared with conventional non-linear
regression. In most cases the accuracy and precision of the results obtained from
the two methods were comparable. The Kalman filter is specifically superior for
detecting and correcting model errors.
603

The Kalman filter is particularly well-suited to monitor the dynamic behaviour


of processes. The measurement procedure in itself can be considered to be a system
which is observed through the measurement of check samples. One can set up
system equations, e.g., a system equation which describes the fluctuations of the
calibration factors. Only a few applications exploiting this capability of a Kalman
filter have been reported. One of the difficulties is a lack of system models, which
describe the dynamic behaviour of an analytical system. Thyssen and coworkers
[17] demonstrated the potential of this approach by designing a Kalman filter for
the monitoring of the calibration factors. They assembled a so-called self-calibrating
Flow Injection Analyzer for the determination of chloride in water. The software
of the instrument included a system model by which the uncertainty of the
calibration factors was evaluated during the measurement of the unknown samples.
When this uncertainty exceeded a certain threshold the instrument decided to
update the calibration factors by remeasuring one of the calibration standards.
Thyssen [18] also designed an automatic titrator which controlled the addition of
the titrant by a Kalman filter. After each addition the equivalence point (the state of
the system) was estimated during the titration.

References

1. D. Graupe, Identification of Systems. Krieger, New York, NY, 1976.


2. Landolt-Bornstein, Zalen Werte und Funktionen. Teil 3, Atom und Molekular Physik.
Springer, Berlin 1951.
3. P.C. Thijssen, L.J.P. Vogels, H.C. Smit and G. Kateman, Optimal selection of wavelengths in
spectrophotometric multicomponent analysis using recursive least squares. Z. Anal. Chem.,
320(1985)531-540.
4. A. Gelb (Ed.), Applied Optimal Estimation. MIT Press, Cambridge, MA, 1974.
5. G. Kateman and L. Buydens, Quality Control in Analytical Chemistry, 2nd Edn. Wiley, New
York, 1993.
6. P.C. Thijssen, S.M. Wolfrum, G. Kateman and H.C. Smit, A Kalman filter for calibration,eval-
uation of unknown samples and quality control in drifting systems: Part 1. Theory and simula-
tions. Anal. Chim. Acta, 156 (1984) 87-101.
7. H.N.J. Poulisse, Multicomponent analysis computations based on Kalman Filtering. Anal.
Chim. Acta, 112 (1979) 361-374.
8. C.B.M. Didden and H.N.J. Poulisse. On the determination of the number of components from
simulated spectra using Kalman filtering. Anal. Lett., 13 (1980) 921-935.
9. T.F. Brown and S.D. Brown, Resolution of overlapped electrochemical peaks with the use of
the Kalman filter. Anal. Chem., 53 (1981) 1410-1417.
10. S.C. Rutan and S.D. Brown, Pulsed photoacoustic spectroscopy and spectral deconvolution
with the Kalman filter for determination of metal complexation parameters. Anal. Chem., 55
(1983) 1707-1710.
11. P.C. Thijssen, A Kalman filter for calibration, evaluation of unknown samples and quality con-
trol in drifting systems: Part 2. Optimal designs. Anal. Chim. Acta, 162 (1984) 253-262.
604

12. S.C. Rutan, E. Bouveresse, K.N. Andrew, P.J. Worsfold and D.L. Massart, Correction for drift
in multivariate systems using the Kalman filter. Chemom. Intell. Lab. Syst., 35 (1996)
199-211.
13. J. Chen and S.C. Rutan, Identification and quantification of overlapped peaks in liquid chroma-
tography with UV diode array detection using an adaptive Kalman filter. Anal. Chim. Acta,
335(1996) 1-10.
14. P. Seelig and H. Blount, Kalman Filter applied to Anodic Stripping Voltametry: Theory. Anal.
Chem., 48 (1976) 252-258.
15. C.A. Scolari and S.D. Brown, Multicomponent determination in flow-injection systems with
square-wave voltammetric detection using the Kalman filter. Anal. Chim. Acta, 178 (1985)
239-246.
16. B.M. Quencer, Multicomponent kinetic determinations with the extended Kalman filter. Diss.
Abstr. Int. B 54 (1994) 5121-5122.
17. M. Gui and S.C. Rutan, Determination of initial concentration of analyte by kinetic detection of
the intermediate product in consecutive first-order reactions using an extended Kalman filter.
Anal. Chim. Acta, 66 (1994) 1513-1519.
18. P.C. Thijssen, L.T.M. Prop, G. Kateman and H.C. Smit, A Kalman filter for caHbration, evalu-
ation of unknown samples and quality control in drifting systems. Part 4. Flow Injection Analy-
sis. Anal. Chim. Acta, 174 (1985) 27-40.
19. P.C. Thijssen, N.J.M.L. Janssen, G. Kateman and H.C. Smit, Kalman filter applied to setpoint
control in continuous titrations. Anal. Chim. Acta, 177 (1985) 57-69.

Recommended additional reading

S.C. Rutan, Recursive parameter estimation. J. Chemom., 4 (1990) 103-121.


S.C. Rutan, Adaptive Kalman filtering. Anal. Chem., 63 (1991) 1103A-1109A.
S.C. Rutan, Fast on-line digital filtering. Chemom. Intell. Lab. Syst., 6 (1989) 191-201.
D. Wienke, T. Vijn and L. Buydens, Quality self-monitoring of intelligent analyzers and sensor based
on an extended Kalman filter: an application to graphite furnace atomic absorption spectros-
copy. Anal. Chem., 66 (1994) 841-849.
S.D. Brown, Rapid parameter estimation with incomplete chemical calibration models. Chemom.
Intell. Lab. Syst., 10 (1991) 87-105.
605

Chapter 42

Applications of Operations Research

42.1 An overview

Ackoff and Sasieni [1] defined operations research (OR) as "the application of
scientific method by interdisciplinary teams to problems involving the control of
organized (man-machine) systems so as to provide solutions which best serve the
purposes of the organization as a whole".
Operations research consists of a collection of mathematical techniques. Some
of these are linear programming, integer programming, queuing theory, dynamic
programming, graph theory, game theory, multicriteria decision making, and
simulation. They often are optimization techniques and are characterized by their
combinatorial character: their aim is to find an optimal combination. Typical
problems that can be solved are:
(1) allocation,
(2) inventory,
(3) replacement,
(4) queuing,
(5) sequencing and combination,
(6) routing,
(7) competition, and
(8) search.
Several, but not all, of these mathematical methods (e.g. multicriteria decision
making. Chapter 26) or problems (the non-hierarchical clustering methods of
Chapter 30, which can be treated as allocation models) have been treated earlier. In
this chapter, we will briefly discuss the methods that are relevant to chemo-
metricians and have not been treated in earlier chapters yet.

42.2 Linear programming

Suppose that a manufacturer prepares a food product by adding two oils (A and
B) of different sources to other ingredients. His purpose is to optimize the quality
606

of the product and at the same time minimize cost. The quality parameters are the
amount of vitamin A (y^) and the amount of polyunsaturated fatty acids (y2). The
cost of a unit amount of oil A is 40, that of B is 25. In this introductory example, we
will suppose that the cost of the other ingredients is negligible and does not have to
be taken into account. Moreover, we suppose that the volume remains constant by
adaptation of the other ingredients. The cost or objective function to be minimized
is therefore

Z = 4 0 J C , + 25JC2 (42.1)

where jc j is the amount in grams of oil A per litre of product and X2 the amount of oil
B. An optimal product contains at least 65 vitamin A units and 40 polyunsaturated
units per litre (all numbers in this section have been chosen for mathematical
convenience and are not related to real values). More vitamin A or polyunsaturated
fatty acids are not considered to have added benefit. Suppose now that oil A
contains 30 units of vitamin A per litre and 10 units of polyunsaturated fatty acids
and oil B contains 15 units of vitamin A and 25 units of polyunsaturated fatty acid.
One can then write the following set of constraints.

y^ = 30x^ + 15A:2>65

3;2=10JCI + 2 5 J C 2 > 4 0 (42.2)

JC, > 0, ;c2 > 0

The line y^=65 = SOx, + 15JC2 is shown in Fig. 42.1. All points on that line or
above satisfy the constraint 30JCJ + 15JC2 ^ 65. Similarly, all points lying above line
J2 = IOJC, + 25JC2 = 40 satisfy the second constraint of eq. (42.2), while the last
constraints limit the acceptable solutions to positive or zero values for x^ and X2.
The acceptable region is the shaded area of Fig. 42.1.
We can now determine which pairs of (JC,, X2) values yield a particular z. In Fig.
42.1 line z, shows all values for which z = 50. These (JC,, X2) do not belong to the
acceptable area. However, we can now draw parallel lines until we meet the
acceptable area. This happens in point B with line 12- The coordinates of this point
are obtained by solving the set of simultaneous equations

30JC, + 15JC2 = 6 5 l

IOJCJ + 2 5 J C 2 =40J

This yields x^ = 41/24, X2 = Will and z = 91.2.


607

Fig. 42.1. Linear programming: oil example.

Let us look at another example (Fig. 42.2) [2]. A laboratory must carry out
routine determinations of a certain substance and uses two methods, A and B, to do
this. With method A, one technician can carry out 10 determinations per day, with
method B 20 determinations per day. There are only 3 instruments available for
method B and there are 5 technicians in the laboratory. The first method needs no
sophisticated instruments and is cheaper. It costs 100 units per determination while
method B costs 300 units per determination. The available daily budget is 14000
units. How should the technicians be divided over the two available methods, so
that as many determinations as possible are carried out? Let the number of
technicians working with method A be x^ and with method B ^2, and the total
number of determinations z; then, the objective function to be maximized is given
by:

Z=\Ox,+ 20A:, (42.3)


608

Fig. 42.2. Linear programming: laboratory technicians example.

The constraints are:


}^i = ^2 < 3

^2 = X, + ^2 < 5

y^ = (10 X 100) jci + (20 X 300) x^ < 14000 (42.4)


X, > 0, jc, > 0
The optimal result obtained in this way is x^ = 3.2, JC2 = 1.8, z = 68. We observe that
in both cases
- the set of acceptable solutions is convex, i.e. whatever two points one
chooses from the set, the line connecting them lies completely within the do-
main defined by the set;
- the optimal solution is one of the comer points of the convex set.
It can be shown that this can be generalized to the case of more than two variables.
The standard solution of a linear programming problem is then to define the corner
points of the convex set and to select the one that yields the best value for the
objective function. This is called the Simplex method.
The second example illustrates a difficulty that can occur, namely the optimal
solution concerns 1.8 technicians working with method B, while one needs an
integer number. This can be solved by letting one technician working full time and
another four days out of five with this instrument. When this is not practical, the
609

solution is not feasible and one should then apply a related method, called integer
programming in which only integer values are allowed for the variable x^. When
only binary values are allowed, this is called binary programming.
Problems resembling the first example, but much more complex, are often
studied in industry. For instance in the agro-food industry linear programming is a
current tool to optimize the blending of raw materials (e.g. oils) in order to obtain
the wanted composition (amount of saturated, monounsaturated and polyunsaturated
fatty acids) or property of the final product at the best possible price. Here linear
programming is repeatedly applied each time when the price of raw materials is
adapted by changing markets.
Integer programming has been applied by De Vries [3] (a short English-
language description can be found in [2]) for the determination of the optimal
configuration of equipment in a clinical laboratory and by De Clercq et al. [4] for
the selection of optimal probes for GLC. From a data set with retention indices for
68 substances on 25 columns, sets of p probes (substances) (/? = 1, 2,..., 20) were
selected, such that the probes allow to obtain the best characterization of the
columns. This type of application would nowadays probably be carried out with
genetic algorithms (see Chapter 27).
The fact that only linear objective functions are possible, limits the applicability
of the methodology in chemometrics. Quadratic or non-linear programming are
possible however. The former has been applied in the agro-food industry for the
determination of the composition of an unknown fat blend from its fatty acid
profile and the fatty acid profiles of all possible pure oils [5]. The solution of this
problem is searched under constraints of the number of oils allowed in the solution,
a minimal or maximal content, or a content range. This problem can be solved by
quadratic programming. The objective function is to minimize the squared differ-
ences between the calculated and actual fatty acid composition of the oil blend. An
attractive feature of the programming approach to solve this type of problems is
that it provides several solutions in a decreasing order of value of the objective
function. All these methods together are also known as mathematical programming.

42.3 Queuing problems

In several chapters we discussed how the quality of the analytical result defines
the amount of information which is obtained on a sampled system. Obvious quality
criteria are accuracy and precision. An equally important criterion is the analysis
time. This is particularly true when dynamic systems are analyzed. For instance a
relationship exists between the measurability and the sampling rate, analysis time
and precision (see Chapter 20). The monitoring of environmental and chemical
processes are typical examples where the management of the analysis time is
610

important. In this chapter we will focus on the analysis time. The time between the
arrival of the sample and the reporting of the analytical result is usually sub-
stantially longer than the net analysis time. Delays may be caused by a congestion
of the laboratory or analysis station or by managerial policies, e.g. priorities
between samples and waiting until a batch of a certain size is available for analysis.
A branch of Operations Research is the study of queues and the influence of
scheduling policies on the formation of queues. Queues in waiting rooms, for
instance, or the occupation of beds in hospitals, telephone and computer networks
have been extensively studied by queuing theory. On the contrary only a few
studies have been conducted on queues and delays in analytical laboratories
[6-10]. Despite the fact that several operational parameters can be registered with
modem Laboratory Information Management Systems (LIMS), laboratory activ-
ities are apparently too complex to be described by models from queuing theory.
For the same reason the alternative approach by simulation (see Section 42.4) is
not a real management tool for decision support in analytical laboratory manage-
ment. On the other hand simulation techniques proved to be a useful tool for the
scheduling of robots.
In this section waiting and queues are discussed in order to provide some basic
understanding of general queuing behaviour, in particular in analytical lab-
oratories. This should allow a qualitative forecast of the effect of managerial
decisions.

42.3.1 Queuing and Waiting

No queues would be formed if no new samples are submitted during the time
that the analyst is busy with the analysis of the previous sample. If all analysis
times were equally long and if each new sample arrived exactly after the previous
analysis is finished, the analytical facility could be utilized up to 100%. On the
other hand, if samples always arrive before the analysis of the previous sample is
completed, more samples arrive than can be analyzed, causing the queue to grow
indefinitely long. Mathematically this means that:

AZq = 0, w = 0 when ?i/|Li < 1


and
n^ = oo^w = oo when )J\i > 1

with AZq the number of samples waiting in queue, w the waiting time in queue, X the
number of samples submitted per unit of time (e.g. a day), and \i the number of
samples which can be analyzed per unit of time.
611

Because
X = l/lAT, wherelAT is the mean interarrival time, and

L
| L = 1 / AT, where AT is the mean analysis time

:^/|i = Af/iAf = p
p is called the utilization factor of the facility or service station.
In reality, the queue size {n^ and waiting time (w) do not behave as a zero-
infinity step function at p = 1. Also at lower utilization factors (p < 1) queues are
formed. This queuing is caused by the fact that when analysis times and arrival
times are distributed around a mean value, incidently a new sample may arrive
before the previous analysis is finished. Moreover, the queue length behaves as a
time series which fluctuates about a mean value with a certain standard deviation.
For instance, the average lengths of the queues formed in a particular laboratory for
spectroscopic analysis by IR, ^H NMR, MS and ^^C NMR are respectively 12, 39,
14 and 17 samples and the sample queues are Gaussian distributed (see Fig. 42.3).
This is caused by the fluctuations in both the arrivals of the samples and the
analysis times.
According to the queuing theory the average waiting time (w) exponentially
grows with increasing utilization factor (p) and asymptotically approaches infinity
when p goes to 100% (see Fig. 42.4). Figure 42.4 shows the waiting time for the
simplest queuing system, consisting of one server, independent analysis and
arrival processes, Poisson distributed (see Section 15.3) arrivals (number of
samples per day) and exponentially distributed analysis times. In the jargon of
queuing theory such a system is denoted by M/M/1 where the two Ms indicate the
arrival and analysis processes respectively (M = Markov process) and ' 1' is the
number of servers.
The number of arrivals follows a Poisson distribution when samples are sub-
mitted independently from each other, which is generally valid when the samples
are submitted by several customers. The probability of n arrivals in a time interval t
is given by:

P M ^ ' - ^ ^ (42.5)

where Xt is the average number of arrivals during the interval t. For instance the
number of samples (samples/day) submitted to the spectroscopic department men-
tioned earher [9] can be modelled by Poisson distributions with the means 2.8,7.7,
2.06 and 2.5 samples per day (Fig. 42.5).
612

20 60

20 60 100 140 180 220


- ^ - 1 (days)

Fig. 42.3. Time series of the observed queue lengths (n) in a department for structural analysis, with
their corresponding histograms fitted with a Gaussian distribution.
613

utilization factor (p)

Fig. 42.4. The ratio between the average waiting time (w) and the average analysis time (AT) as a func-
tion of the utilization factor (p) for a system with exponentially distributed interarrival times and anal-
ysis times (M/M/1 system).

1
IR HNMR

t 20
c
^^ 15
a.
10

0 1 2 3 4
n^ 5 7

»• n
8
lii
0 2 4 6
I I I
10 12 14 16 18

13
MS CO 25 r CNMR
m
> t> 20 V
03
CO
c:
- • . - ^

15 [
tl.

0 1 2 3
L
4 5 6 7 8 0 1 2 3 4 5
L^ 6 7 8

Fig. 42.5. The distribution of the probability that n samples arrive per day, observed in a department
for structural analysis. (I) observed (•) Poisson distribution with mean [i.
614

The following relationships fully describe an M/M/1 system:


- the average queue length (n^) which is the number of samples in queue, exclud-
ing the one which is being analysed:
^^=p2/(l-p)

- the average waiting time (vv) in queue (excluding the analysis time of the sam-
ple):

w = ATp/(l-p):
^(1-p)
Queuing models also describe the distribution of the waiting times though only
for relatively simple queuing systems. Waiting times in an M/M/1 system are
exponentially distributed. The probability of a waiting time shorter than a given
Wj^gx is given by (see Fig. 42.6):
,/AT
P ( w < w ^ ^ ) = l - p e -^iVVm = l - p e -P^mj^
It means that a large part of the samples (65% for p = 0.7) waits for less than the
mean waiting time (w^^ = vv). On the other hand, there is a significant probability
(35% for p = 0.7) that a sample has to wait longer than this average. It also means
that when the laboratory management wants to guarantee a certain maximal turn
around time (e.g. 95% of the samples within w^^J, the mean waiting time should
be 27% of w^3^ (p = 0.7). Figure 42.7 shows the waiting time distributions of the
samples in the IR, *H NMR, ^^C NMR and MS departments mentioned before.
The probability of finding k samples in a queue is:

t/AT

Fig. 42.6. Probability that the waiting time is smaller than t (t given in units relative to the average anal-
ysis time).
615

cumulative % IP cumulativ
%
100 100
1 • <»• y/^
80
15h| /
A 60

i
lOhl 1
40

20

0 2 4 6 8 10 12 14 16 18 20 0 2
lllllir
4 6 8 10 12 14 16 18 20
Waiting time (days) Waiting time (days)

13
% C NMR cumulative % MS cumulative %
20 ( ^ _ _ ^ - • 1100 100

15

10
r*La

0 2 4 6 8 10 12 14 16 18 20
Waiting time (days)
i
0 2 4
1

lllh
bl

llllllfrrr^f*..
6 8 10 12 14 16 18 20
Waiting time (days)

Fig. 42.7. Histograms and cumulative distributions of the delays (waiting time + analysis time) in a de-
partment for structural analysis. (I) Observed values. (•) Cumulative distribution. (•) Fit witli a theo-
retical model (not discussed in this chapter).

p(n=k) = p\l-p) (42.6)


The observed Gaussian distribution of the queue lengths in the spectroscopic
laboratory (see Fig. 42.3) is not in agreement with the theoretical distribution given
by eq. (42.6). This indicates that an analysis station cannot be modelled by the
simple M/M/1 model. We will return to this point later.
Another interesting feature is the dynamic behaviour of a queue, which is
expressed by the time constant of the queue. As explained in Chapter 20 the time
constant of a time series can be determined from the autocorrelation function
which gives the autocorrelation (O) for observations with the same time difference
T. The autocorrelation plots and time constants (T) of the queues shown in Fig. 42.3
are given in Fig. 42.8. In general, the time constant of a queue increases with an
increasing utilization factor [9], which means that periods of congestion may be
relatively long. For instance, for an M/M/1 system 7 = ATp/(l - p)^, giving 7 = 8
AT for p = 0.7. The behaviour of a queue depends mainly on three factors: the
arrival pattern of the samples, the analysis time and the applied managerial rules.
616

r(T) 1 1 . ..
A 08
\ HNMR r(T) 1 IR
A 08
T 06
\ ' 06
04

0.2 x^^^x ^ 0.4

0.2
0
^ 0 ^---^^^ ^ ^
-0 2
-0.2
-0.4
0 2 4 6 8 10 12 14 16 18 20
-0.4
0 2 4 6 8 10 12 14 16 18 20
— ^ ^ T (days)
— ^ - T(days)

r{T) 1 ^ MS r(T) 1 ^^CNMR


A 08
i 0.8
' 0.6 : \ ' 0.6
\
\
\v
0.4
0.4
0.2
\ \ \
\ ^ 0 2

\ /
-0.2

-0.4
0 2 4 6
\
8
v.^,
10 12
^-'-^
14
y
16 18 20
0

-0.2

-0 4 «
0 2 4 6 8 10 12 14 16 18 20
— ^ - T (days) ^^ t (days)

Fig. 42.8. Autocorrelograms of the queue lengths observed in a department for structural analysis.

As it is impossible to discuss them all, a number of cases has been compiled in


Table 42.1 with reference to relevant literature in which the cases are discussed by
queuing theory or are simulated (see Section 42.4).
An interesting aspect of a queuing system is the distribution of the 'idle' and
'busy' time. Idle time is the time that no samples are waiting and are being
analyzed. The time that samples are analyzed is called the busy time. For a system
with a utilization factor of 70%, analysts are waiting for samples during 30% of
their time, which may very much disturb any laboratory manager. However
matters become better acceptable when we consider that 30% overhead activities
are more the rule than the exception. On the other hand for an M/M/1 system one
can derive that the average duration of the busy time is quite long and that the idle
time is split up in many relatively short periods. For instance for a system with a
utilization factor of 70% and an analysis time of 1 h the mean busy time is
AT/(1 - p) = 3.3 h and the mean idle time is AT/p = 1.4 h. Therefore, analysts will
not be able to schedule all overhead activities during the time that no samples are
waiting, but will be forced to interrupt the analysis process in favour of overhead
activities. Because overhead activities compete with the analysis process, queues
are never empty, which is demonstrated by the queues given in Fig. 42.3. Thus the
617

TABLE 42.1
Classification of queuing systems

Sample arrival Analysis time Management rules


Analysis mode

1. Single arrivals [11] 1. Analysis time (single) [11] 1. Queue discipline [11]
a. uniform a. uniform a. first-in-first-out (FIFO)
b. Poisson b. exponential b. last-in-first-out (LIFO)
c. other distribution c. other distribution c. priority to easy samples [10]
2. Batch arrivals [12] d. interrupted (illness, failure) [9] (expected analysis time)
a. uniform e. depends of batchsize d. random
b. Poisson 2. Batch size [12] 2. Priority rules [9]
c. other a. fixed a. absolute classes
3. Batch size [12] b. distributed b. depends on waiting
a. fixed c. minimal size 3. Overhead scheduling [9]
b. distribution a. random interruption
4. No arrival if n > n^Ht b. scheduled interruption
c. interruption ifn< w^rit

fact that queues are never empty does not necessarily indicate an oversaturated
system.

42.3.2 Application in analytical laboratory management

The overview given in Table 42.1 demonstrates that queues and waiting can
only be studied by queuing theory in a limited number of cases. Specifically the
queuing systems that are of interest to the analytical chemist are too complex.
However the behaviour of simple queuing systems provides a good qualitative
insight in the queuing processes occurring in the laboratory. An alternative approach
which has been applied extensively in other fields, is to simulate queuing systems
and to support the decision making by the simulation of the effect of the decision.
This is the subject of Section 42.4. Before this, we will summarize a number of
rules of thumb relevant to the laboratory manager, who may control the queuing
process by controlling the input, the analytical process and the resources.
(1) Input control: Maximum delays may be controlled by monitoring the work
load (Wq AT). When n^ AT > w^^^, or when n^ exceeds a critical value (n^rit)'
customers are requested to refrain from submitting samples. Too high a frequency
of such warnings indicates insufficient resources to achieve the desired Wj^^x-
618

(2) Priorities do not influence the overall average delay, because


vv = avvj + (1 - a)w2, where a is the fraction of samples with a high priority. The
values of vvj and W2 depend on a, the kind of priority (see Table 40.2) and p [10].
(3) The effect of collecting batches depends on the shortening of the analysis
time by batch analysis and the time needed to collect the batches.
(4) By automation one can remove the variation of the analysis time or shorten
the analysis time. Although the variation of the analysis time causes half of the
delay, a reduction of the analysis time is more important. This is also true if, by
reducing the analysis time, the utilization factor would remain the same (and thus
n^) because more samples are submitted. Since p = AT / lAT, any measure to
shorten the analysis time will have a quadratic effect on the absolute delay
(because vv = AT / (lAT - AT)). As a consequence the benefit of duplicate
analyses (detection of gross errors) and frequent recalibration should be balanced
against the negative effect on the delay.
(5) By preference, overhead activities should be scheduled in regular blocks,
e.g. at the end of the day [10].
(6) For fixed resources (costs), the sampling rate in process control can be
increased to maximal utilization (p = 1) of the available resources without penalty
only if samples are taken at regular time intervals (no variation) and there is no
variation in the analysis time (automated measuring device). In other situations an
optimal sampling rate will be found where the measurability is maximal. Recalling
the fertilizer example discussed in Chapter 20 we can derive the optimal sampling
rate for the A^ determination by an ion-selective electrode, by substituting the
analysis time (10 min) by the delay in the measurability equation. However
considering that in process control it is always preferable to analyze the last
submitted sample (by eventually skipping the analysis of the waiting samples,
because they do not contain additional information), it is obvious that a last-in-
first-out (LIFO) strategy should be chosen.

42.4 Discrete event simulation

In Section 42.2 we have discussed that queuing theory may provide a good
qualitative picture of the behaviour of queues in an analytical laboratory. However
the analytical process is too complex to obtain good quantitative predictions. As
this was also true for queuing problems in other fields, another branch of Operations
Research, called Discrete Event Simulation emerged. The basic principle of
discrete event simulation is to generate sample arrivals. Each sample is chara-
cterized by a number of descriptors, e.g. one of those descriptors is the analysis
time. In the jargon of simulation software, a sample is an object, with a number of
attributes (e.g. analysis time) and associated values (e.g. 30 min). Other objects are
e.g. instruments and analysts. A possible attribute is a list of the analytical
619

procedures which can be carried out by the analyst or instrument. The more one
wants to describe the reaUty in detail, the more attribute-value pairs are added to
the objects. For example, if one wants to include down times of the instrument or
absence due to illness, such attributes have to be added to the object.
An event takes place when the state of the laboratory changes. Examples of
events are:
- a sample arrival: this introduces a state change because the sample joins the
queue;
- an analysis is ready: this introduces a state change because the instrument and
analyst become idle.
With each event a number of actions is associated. For example when a sample
arrives, the following actions are taken:
- if instrument and analyst are idle and if all other conditions are met (batch
size), start the analysis. This implies that the next event 'analysis is ready' is
generated, the status of the instrument and analyst is switched to 'busy'. Gen-
erate the time of next arrival.
- otherwise: register the arrival time in the queue; augment the queue size by
one. Generate the time of the next arrival.
As events generate other events, the simulation keeps going from event to event
until some terminating conditions are met (e.g. the end of the simulation time, or
the maximum number of samples has been generated). As one can see a specific
programming environment, called object oriented programming [17], is required
to develop a simulation model, consisting of object-attribute-value (O-A-V) triplets
and rules (see also Section 43.4.2). The little research that has been conducted on
the simulation of laboratory systems [9,10,13-15] was primarily focused on the
demonstration that it is possible to develop a validated simulation model that
exhibits the same behaviour in terms of queues, delays as in reality. Next, such a
validated model is interrogated with the question "What if?". For instance, what if:
- priorities are changed?
- resources are modified?
- minimal batch size is increased?
In Fig. 42.9 we show the simulation results obtained by Janse [8] for a municipal
laboratory for the quality assurance of drinking water. Simulated delays are in
good agreement with the real delays in the laboratory. Unfortunately, the develop-
ment of this simulation model took several man years which is prohibitive for a
widespread application. Therefore one needs a simulator (or empty shell) with
predefined objects and rules by which a laboratory manager would be capable to
develop a specific model of his laboratory. Ideally such a simulator should be
linked to or be integrated with the laboratory information management system in
order to extract directly the attribute values.
620

i*^ ^
•^ "w"
o
c3
c\ ^-^
;: E o | B
O
<D
to| ^
rr^ -o
c

« 8 :^ 2
CO
o_c5

^
CD
o
x^
OS
^?
T- n u^

Si
CO a bO
C
;g
:i ^
(N •c
73

tiJO

CM

x:

J
T3
CM S,

CD
CO
-Q
O
JO)

O
621

Klaessens [14-17] developed a ^laboratory simulator', written in SIMULA,


which by a question-answering session assembles the simulation model. SIMULA
[18] is a programming environment dedicated to the simulation of queuing systems.
KEE [19] offers a graphics-driven discrete event simulator, in which the objects
are represented by icons which can be connected into a logical network (e.g. a
production line for the manufacturing of electronic devices). Although KEE has
proven its potential in many areas, no examples are known of analytical lab-
oratories simulated in KEE.
From the few but interesting simulation studies of analytical laboratories, the
following conclusions can be drawn.
- A condition for discrete event simulation to become a relevant tool for the
laboratory manager is the availability of an easy to use simulator with a user
friendly user interface.
- Simulators are a special kind of expert systems and should be treated as such.
They should be used to support the decision making process but not to re-
place the human creativity.
- Simulators should be used in a cyclic fashion in order to improve the reliabil-
ity of the forecast. By simulating small changes in the laboratory organisation
and by comparing the forecasts with the real effects, the simulator is refined
until the forecasts are within a certain confidence band.
- A particular problem is the number of events that should be simulated before
the results are stabilized about a mean value. This problem is comparable to
the question of how many runs are required to simulate a Gaussian distribu-
tion within a certain precision. Experience shows that at least 1000 sample ar-
rivals should be simulated to obtain reliable simulation results. The sample
load (samples/day) therefore determines the time horizon of the simulation,
which for low sample loads may be as long as several years. It means also that
in practice many laboratories never reach a stationary state which makes
forecasting difficult. However, one may assume that on the average the best
long term decision will also be the best in the short run. One should be careful
to tune a simulator based on results obtained before equilibrium is reached.
- Experimental design techniques (see Chapters 21 to 26) should be used to
study the relevant factors and interactions.

42.5 A shortest path problem

A network or graph consists of a set of points {nodes) connected by lines {edges


or links). These links can be one-way (one can go from point A to B, but not vice
versa) or two-way. When the edges are characterized by values, it is called a
weighted graph. In the usual economic problems for which one applies networks,
these values are cost, time, or distance.
622

Graphs have been applied in chemometrics to describe molecules. Indeed,


molecules can be represented as nodes (the atoms) connected by bonds (the edges).
Because in the present chapter, we want to emphasize the optimization aspect, we
refer the reader to the literature e.g. Refs. [20-23].
In this section therefore, a routing problem is considered in which one must go
from one node (the origin) to another (the terminal node). There are many ways by
which this is possible and the routing problem consists in finding a path that
minimizes the sum of the values of the edges that constitute the path. This is easily
understood if one supposes that the nodes are towns and the values of the edges
between neighbouring towns are the distances between the towns. The routing
problem then consists in choosing the shortest way of going from one town to
another. This type of problem is called a shortest path or minimal path problem. It
is applied here to the optimization of chromatographic separation schemes for
multicomponent samples and more particularly to the ion-exchange separation of
samples containing several different ions [24].
In this type of application, one usually employs more or less rapid and clear-cut
separation steps. This can be explained best by considering the simplest possible
case, namely the separation of three ions. A, B, and C. The original situation is that
the three ions have been brought together (not separated) on a chromatographic
column and the final situation should be that they are eluted and separated from
each other. These two situations constitute the initial and the terminal nodes of the
network. They are denoted by ABC// and //A/B/C. The elements which remain on
the column are given to the left of symbol // and the symbol / means that the ions to
the left and to the right of it are separated. There are many ways in which one can
go from situation ABC// to situation //A/B/C, as shown in the network in Fig.
42.10.

Step 1
There are two possibilities, namely:
(a) one can elute one element and retain the other two on the column. This leads
to nodes AB//C, AC//B, BC //A; or
(b) one can elute two elements and retain the other. This leads to nodes A//BC,
B//ACandC//AB.

Step 2
(a) Following step la, one of the two remaining ions is eluted. For example, if in
the first step A was eluted, one now elutes B or C. In step 2a one can reach the
situations A//B/C, B//A/C or C//A/B.
(b) Following step lb, two ions are eluted together and are therefore not
separated; they have to be adsorbed first on another column. In the mean time, the
single ion, left on the first column, can be eluted. Two ions are then adsorbed on a
623

M//BC A//B/C
a8^

//A/B/C
a 11

Fig. 42.10. A shortest path problem. See text for symbols.

column and one is eluted. The different possibilities are AB//C, AC//B, BC//A, i.e.
the situations reached also after step la. From there one proceeds to step 2a.

Step 3
This follows step 2a. Only one ion remains on the column. It is now eluted, so
that the terminal node is reached.
These different possibilities and their relationships can be depicted as a directed
graph (Fig. 42.10). Directed graphs are graphs in which each edge has a specific
direction. To find the shortest path one has to give values to the edges of the graph.
As the problem is to find the procedure that permits to carry out the separation in
the shortest time possible, these values should be the times necessary to carry out
the steps symbolized by the links in the graph or a value proportional to the time.
We shall not detail the manner in which these times were derived. Essentially,
there are three possibilities:
(a) If the separation depicted by a particular link is possible, the time is
considered to be equal to the distribution coefficient of the ion which is the slowest
to be eluted in this step (the distribution coefficient as defined in ion exchange
chromatography is proportional to the elution time).
(b) If the separation depicted by a particular link is not possible, a very high
value (900) is given to that link.
(c) If the link contains a transfer from one column to another, a value corres-
ponding to the estimated time needed for this transfer is given.
624

One may question whether the application of graph theory is really necessary as
no doubt a separation such as that described above can be investigated easily
without it. However, when more ions are to be separated, the number of nodes
grows very rapidly.
The calculation of the number of nodes is rather complicated. It is simpler in the
special case where transfers from one column to another are not allowed (all ions
eluted from the column must be completely separated from all the other ions). In
this instance, and for three elements, the following nodes should be considered:
ABC//, A//B/C, AB//C, AC//B, BC//A, B//A/C, C//A/B, and //A/B/C. If one
considers only the stationary phase (to the left of //), one notes that all the
combinations of zero, one, two, and three ions out of three are present.
Calling n the total number of ions and p (0 < p < n) the number of ions in a
particular combination taken from these n, the total number of combinations is
equal to

p=0

where C^ is the symbol used for the number of combinations ofp elements out of a
set of A2. This can be shown to be equal to 2^. In this particular instance, a separation
scheme for eight elements would contain 256 nodes. For the general case (transfers
allowed), no less than 17008 nodes would have to be considered! Even for 4
elements, 38 nodes are obtained and it becomes difficult to consider all of the
possibilities without using graph theory. The shortest (cheapest) path in a graph
can be found, for example, with a very simple algorithm. Let us suppose that one
has to construct a highway from town a^ to a town a^ (Fig. 42.10). There are
several possible layouts, which are determined by the towns through which one
must pass, and these must be selected from ^2 to a^Q. The values of the links in the
resulting graph are given by the estimated costs. The value 900 is a way to indicate
that this link is impossible or undesirable. The problem is, of course, to find the
cheapest route. A value A | = 0 is assigned to town (node) a, and the value of all the
nodes a^ directly linked to a j is computed by using the equation A^ = Aj + /(a,, a J
where /(a,, a^) is the length of edge (^j, aj. In this way, we assign the values 3, 5,
900, 0, 900 and 900 to the nodes <22, a^, a^, a^, a^ and a-^, respectively. This
procedure is repeated for the nodes a^ linked directly to one of the nodes a^ by
using the equation A^ = A^ + /(a„, a^. We continue to do this until a value has been
assigned to each of the nodes in the graph. We would, for example, assign the
values 15, 1800, 900 and 902 to ag, a^, a^^ and a^, respectively.
In the first stage, we assigned a possible value to all of the nodes, but not
necessarily the lowest possible value. For example, the value A-j of node a^ is now
900. The value 900 is artificially high, indicating that for practical reasons it is
625

impossible to go from a^ to a^. In the highway example, this could mean a


mountain ridge and in the ion-exchange case, it could be a separation that cannot
be carried out in a reasonable time. This value is derived here from edge (a^, a^)
using the equation Ay = Aj + /(aj, a^). Town a^, however, can also be reached from
town ^2- The value of Ay is then given by A2 + /(a2, a^) and is equal to 13; this
replaces the original value 900. In this way, all of the nodes are checked until one
knows that each town is reached in the cheapest possible way. The optimal path is
then found by retracing the steps that led to the final value for A jj. In the graph in
Fig. 42.11, this is the path a^, a^, a^, a^^ with a total value of 6.
The graph in Fig. 42.10 is the graph obtained for the separation of Ca, Co, and
Th on a cation-exchange column. One arrives at this graph by replacing nodes
ai-ai 1 by CaThCo//, Ca//Th/Co, Th//Ca/Co, Co//Ca/Th, Ca Th//Co, Ca Co//Th, Th
Co//Ca, Ca//Th/Co, Co//Ca/Th, Th//Ca/Co, and //Ca/Th/Co, respectively. The
weights of the edges are distribution coefficients obtained from the literature. The
conclusion in this particular instance is that one must first reach situation a^, i.e. Ca
Th//Co, meaning that one must first elute Co. As the weight of the edge is 0, this
means that at least one solvent has been described in the literature that permits one
to elute Co with a distribution coefficient of 0 without eluting Ca and Th. The
following steps are the elution of Th (distribution coefficient = 2) and Ca (distrib-
ution coefficient = 4). As in Section 42.2, we should remark here that at least some
of the graph theoretic applications can also be carried out with genetic algorithms.

References

1. R.I. Ackoff and M.W. Sasieni, Fundamentals of Operations Research. Wiley, New York, 1968.
2. D.L, Massart, A. Dijkstra and L. Kaufman, Evaluation and Optimization of Laboratory
Methods and Analytical Procedures. Elsevier, Amsterdam, 1978.
3. T. De Vries, Het klinisch-chemisch laboratorium in economisch perspectief. Stenfert-Kroese,
Leiden, 1974.
4. H. De Clercq, M. Despontin, L. Kaufman and D.L. Massart, The selection of representative
substances by operations research techniques. J. Chromatogr., 122 (1976) 535-551.
5. S. de Jong and Th.J. R. de Jonge, Computer assisted fat blend recognition using regression
analysis and mathematical programming. Fat Sci. Technol., 93 (1991) 532-536.
6. I.K. Vaananen, S. Kivirikko, J. Koskennieni, J. Koskimies, A. Relander, Laboratory Simula-
tor: a study of laboratory activities by a simulation method. Meth. Inform. Med., 13 (1974)
158-168.
7. B. Schmidt, Rechnergesteuerte Strategien zur Probenverteilung in einem klinisch-chemischen
Labo. Z. Anal. Chem., 287 (1977) 157-160.
8. T. A.H.M. Janse, G. Kateman, Enhancement of the performance of analytical laboratories by a
digital simulation approach. Anal. Chim. Acta, 159 (1984), 181-189.
9. B.G.M. Vandeginste, Strategies in molecular spectroscopic analysis with application of
queueing theory and digital simulation. Anal. Chim. Acta, 112 (1979) 253-275.
626

10. B.G.M. Vandeginste, Digital simulation of the effect of dispatching rules on the performance
of a routine laboratory for structural analysis. Anal. Chim. Acta, 122 (1980) 435-454.
11. L. Kleinrock, Queueing Systems, Vol. I, Theory. Wiley, New York, 1976.
12. J.G. Vollenbroek and B.G.M. Vandeginste, Some considerations on batch arrival and batch
analysis in analytical laboratories. Anal. Chim. Acta, 133 (1981) 85-97.
13. B. Van de Wijdeven, J. Lakeman, J. Klaessens, B. Vandeginste and G. Kateman, Digital simu-
lation as an aid to sample scheduling in a routine laboratory for liquid chromatography. Anal.
Chim. Acta, 184 (1986) 151-164.
14. J. Klaessens, T. Saris, B. Vandeginste and G. Kateman, Expert system for knowledge-based
modelling of analytical laboratories as a tool for laboratory management. J. Chemometrics, 2
(1988)49-65.
15. J. Klaessens, J. Sanders, B. Vandeginste and G. Kateman, LABGEN, expert system for knowl-
edge based modelling of analytical laboratories. Part 2, Application to a laboratory for quality
control. Anal. Chim. Acta, 222 (1989) 19-34.
16. J. Klaessens, B. Vandeginste and G. Kateman, Towards Computer-supported management of
analytical laboratories. Anal. Chim. Acta, 223 (1989) 205-221.
17. J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, Labgen, expert
system for knowledge-based modelling of analytical laboratories. Part 1. Laboratory organisa-
tion. Anal. Chim. Acta, 222 (1989) 1-17.
18. G.M. Birtwisle, O.J. Dahl, B. Myhrhaug and K. Nygard, SIMULA Begin, Petrocelli/Charter,
New York, 1973.
19. C.A. Copenhaver, Knowledge Engineering Environment (KEE) from Intellicorp, Packag.
Software Rep., 15 (1989) 5-20.
20. M.J. Randic, On the recognition of identical graphs representing molecular topology. J. Chem.
Physics, 60 (1974) 3920-3928.
21. A.T. Balaban, ed.. Chemical Applications of Graph Theory, Academic Press, London, 1976.
22. V. Kvasnicka and J. Pospichal, An improved version of the constructive enumeration of mo-
lecular graphs with described sequence of valence states. Chemom. Intell. Lab. Systems, 18
(1993) 171-181.
23. L.B. Kiers and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Research
Studies Press, Letchwork, U.K., 1986.
24. D.L. Massart, C. Janssens, L. Kaufman and R. Smits, Application of the theory of graphs to the
optimalisation of chromatographic separation schemes for multicomponent samples. Anal.
Chem., 44 (1972) 2390-2399.

Recommended additional reading

L. Kleinrock, Queueing Systems, Vol. II: Computer Applications. Wiley, New York, 1976.
H.M. Wagner, Principles of Operations Research, With Applications to Managerial Decisions.
Prentice/Hall, Englewood Cliffs, NJ, 1975.
S.P. Sanoff, D. Poilevey, Integrated information processing for production scheduling and control.
Comput. Integr. Manuf., 4 (1991) 164-175.
627

Chapter 43

Artificial Intelligence: Expert and Knowledge Based


Systems

43.1 Artificial intelligence and expert systems

To perform tasks in a chemical laboratory various skills and expertise are


required. A considerable amount of experience and intelligence is necessary to
apply correctly the theoretical knowledge. A good example concerns the use of
chemometrical techniques that are described in previous chapters of this book for
the analysis and validation of chemical data. Even when we think (or hope) that we
have described these techniques and their application possibilities and limitations
to a great extent, it remains difficult to select a suitable technique for a specific
problem at hand. This aspect requires a considerable amount of intelligence and
expertise. A specialized branch of science, artificial intelligence (AI) investigates
to what extent it is possible to implement this expertise in a computer. It is a
relatively new science whose research object is "intelligence". Although intelli-
gent behaviour is recognizable by most people, it is at this moment impossible to
give a clear definition of intelligence. One of several definitions that have been
offered is: "All active structures and processes during a reasoning process". From
the beginning of computer science, a small group of people has investigated this
topic in order to implement in the computer some kind of intelligent behaviour.
There are, however, so many different aspects to be studied that various sub-
disciplines emerged. Natural language systems, robotics, vision, automatic
learning, neural networks and expert systems are examples of topics studied in AI.
The first useful results relevant to this book were obtained with expert systems.
These are computer programs that emulate an expert's behaviour. This implies that
they can help solve problems at an expert level. In the last ten years some successes
have been achieved in chemistry with these software products [1-6]. They can be
considered as a new generation of tools for the chemical laboratory. When
integrated in existing methodology and data processing techniques as described in
previous chapters, they can provide substantial benefits. The possibility to
incorporate experience in expert systems (see Section 43.2) is the major reason for
their success.
628

As an example, consider the automation efforts for chemical laboratories in the


last decades. Chemical laboratories of today are equipped with instruments that, in
principle, can run automatically for 24 hours a day. This results in a higher
productivity, since more samples can be analysed with an equal technical effort.
Decisions about the analysis itself, how many and which samples must be analysed
with what method or technique, etc., are still the responsibility of the laboratory
personnel. Since experience can be incorporated into expert systems, they can
provide significant benefits as decision-supporting tools. Therefore, the main ideas
of expert systems and their development are explained in this chapter. More
detailed information can be found in the numerous textbooks on expert systems
[7-10].

43.2 Expert systems

Expert systems are computer programs that are intended to support tasks at an
expert level. What makes them different from other programs that compute
complicated formulas by means of ingenious algorithms? Certainly, these
programs are necessary and valuable tools for the expert; they make it possible to
carry out computations instantaneously and reliably. When making decisions in a
specialized area there is, however, another important aspect, namely, experience.
During the decision-making process, the expert — often without realizing it
explicitly — makes choices or excludes possibilities to reach conclusions
efficiently. His or her major guides are qualitative rules of thumb, optimized
during his/her own learning period and experience with his/her job. A simple
example will best illustrate the concept of heuristics.
Classification of French wines according to their region of origin is a problem
that can be tackled in different ways. Many studies are published based on different
pattern recognition techniques on chemical and sensory characteristics of the
wines. All these techniques perform well to very well, but they all require many
measurements. However, this strategy is totally inefficient and unnecessary when
some typical prior knowledge is available. When the sample is offered in its
original bottle, the shape and colour of the bottle contain at least part of the
required information. To reason with this kind of prior knowledge is difficult in
classical pattern recognition techniques. The inclusion of qualitative rules, also
called heuristic rules, would certainly enhance the performance of the classical
techniques.
Sometimes these heuristics are an extension of the theory and sometimes they
are experience-based rules, with no apparent theoretical justification, but they
simply work most of the time. One way of explaining human reasoning that has
influenced the expert system research is that these heuristic rules of thumb are
629

implicit in the expert's mind as a kind of decision tree that he or she is able to
consult and evaluate instantaneously. This is what defines an expert. An expert
system is then the implementation of a part of this decision tree in a computer
program. By chaining the heuristic rules, the expert system can emulate a part of
the decision capacity of the expert. Expert systems are not simulations of human
intelligence in general. They are simply practical programs that use heuristics to
solve specific problems. Such programs can obviously be useful as decision
support tools. Since they are developed to give advice at expert level, ideally they
should:
- be able to explain their own reasoning process, answer queries such as 'why'
or 'how' certain conclusions have been reached or 'why' certain conclusions
have 'not' been drawn;
- be easily modifiable; since by definition expert knowledge is dynamic, ex-
pert systems must allow easy implementation of changes;
- allow a certain degree of inexactness; since they are based on heuristics
which work most of the time, but not in 100% of the cases, a strategy to cope
with uncertainty should be available.
The last feature brings us to what is considered to be one major disadvantage of
expert systems. Since they are acting on an expert level, with a certain level of
uncertainty, evaluating them in an exact way is impossible, as is possible with
algorithmic programs. These programs should work correctly and reliably in 100%
of all cases. Test methods are available to evaluate them. Expert systems, however,
are based on heuristics, which fail occasionally. The problem is then how to
evaluate expert systems. What is acceptable? The only reasonable way is to
compare their performance with that of human experts. Even when they are
performing well, eliminating the occurrence of a failure is not possible. This is
especially a drawback when expert systems are to be integrated in an automatic
environment. One way to alleviate this problem is to include a strong
meta-knowledge component in the expert system, e.g., knowledge about its
boundaries and limitations in order to recognize cases where failures are likely to
occur.

43.3 Structure of expert systems

The typical structure of an expert system is shown in Fig. 43.1. Three basic
components are present in all expert systems: the knowledge base, the inference
engine and the interaction module (user interface). The knowledge base is the heart
of the expert system. It contains the necessary expert knowledge and experience to
act as a decision support. Only if this is correct and complete enough the expert
system can produce meaningful and useful conclusions and advice. The inference
630

KNOWLEDGE

BASE

INFERENCE

ENGINE

INTERACTION

MODULE

Fig. 43.1. General structure of an expert system.

engine contains the strategy to use this knowledge to reach a conclusion. This
separation between the knowledge in the knowledge base and the inference engine
is an important characteristic of expert systems. An advantage is that changes,
made in the knowledge base will not corrupt the active component, the inference
engine. Another spin-off of this separated architecture is the possibility of multiple
uses of the inference engine for different knowledge bases. Expert-system shells
are commercially available software programs that contain all elements of Fig.
43.1, except that the knowledge base is empty and must be filled to create specific
applications. The interaction module takes care of communication between the
user and the expert system. Explanation facilities are an important part of the
interface.

43.4 Knowledge representation

The knowledge representation must allow an efficient implementation of all the


knowledge that is necessary for the problem area; at the same time it must be
efficiently usable by the inference strategy. These are often conflicting demands.
Finding the right balance is a research subject for AI specialists. Because expert
631

systems will serve as human decision support tools they must incorporate all the
knowledge of the domain in a natural way. The focus is changed from computer
efficiency aspects towards the programmer's or user's comfort. It is thanks to
research in this field that the focus of modem computer tools and languages is
shifted more to the user than to the machine.
In general, a representation method for expert systems should fulfil as well as
possible the following criteria:
- incorporate qualitative knowledge in a natural way,
- allow an efficient use of this knowledge by the inference engine,
- allow structuring of the knowledge,
- allow representation of meta-knowledge: knowledge about the system itself,
so that it is able to recognize its own boundaries.
The two most important representation schemes are the rule-based scheme and
the frame-based or object-oriented scheme.

43.4J Rule-based knowledge representation

The most popular representation scheme in expert systems is the rule-based


scheme. In a rule-based system the knowledge consists of a number of variables,
also called attributes, to which a number of possible values are assigned. The rules
are the functions that relate the different attributes with each other. A rule base
consists of a number of "If... Then..." rules. The "IF" part contains the conditions
that must be satisfied for the actions or conclusions in the "THEN" part to be valid.
As an example, suppose we want to express in a rule that a compound is unstable in
an alkaline solution if it contains an ester function. In a semi-formal way the rule
can be written:
IF the molecule contains an ester-function
THEN it is unstable in an alkaline solution
The syntax for implementing this in an actual expert system is, of course, quite
diverse and it is beyond the scope of this book to describe this in more detail. As an
example the above-mentioned rule is translated in a realistic syntax. This starts by
the definition of the attributes and their possible values. In this examples two
attributes must be defined:
Attribute Possible values
ester-function [yes, no]
alkaline stability [stable, unstable]
IF ester-function = 'yes'
THEN alkaline stability = 'unstable'
632

In what follows we will use the semi-formal way of describing rules, since there
are multiple equivalent ways to implement them formally. The rules may consist of
one or more conditions in the IF part and one or more conclusions in the THEN
part. The collection of all the rules in the knowledge base constitutes the expertise
present in the system. Typically a medium-sized expert system contains a few
hundreds of rules. Rules allow a flexible and modular representation of this
expertise. It is important that each rule represents a single step in the reasoning
process. In this case rules can be seen as independent chunks of knowledge that can
be changed or modified easily without affecting the other rules. When the know-
ledge domain becomes complex, this condition is almost impossible to fulfil.
Larger applications require the possibility of structuring the rules in small rule sets
that are easier to manipulate.
Rules seemingly have the same format as "IF.. THEN.." statements in any other
conventional computer language. The major difference is that the latter statements
are constructed to be executed sequentially and always in the same order, whereas
expert system rules are meant as little independent pieces of knowledge. It is the
task of the inference engine to recognize the applicable rules. This may be different
in different situations. There is no preset order in which the rules must be executed.
Clarity of the rule base is an essential characteristic because it must be possible to
control and follow the system on reasoning errors. The structuring of rules into rule
sets favours comprehensibility and allows a more efficient consultation of the
system. Because of the natural resemblance to real expertise, rule-based expert
systems are the most popular. Many of the earlier developed systems are pure
rule-based systems.

43.4.2 Frame-based knowledge representation

Frame-like structures can be used to represent the facts, objects and concepts. In
this context frames must be interpreted as a software structure (a frame) in which
all characteristics of an object or a concept is described. The simplest example of
frame-like structures are the so-called ''object-attribute-value'' triplets. Examples
of such triplets to describe a chromatographic column are:
Column 1 - Tradename - Lichrosorb
Column 1 - Functionality - C8

Column 2 - Tradename - |i-Bondapak


Column 2 - Functionality - CI8
633

Frames can be seen as structures where all relevant information about an object
or a concept is collected. As an example the relevant information about a column in
a chromatographic method can be represented by a dedicated general frame,
COLUMN. Separate columns can be represented by so-called instantiations of this
frame. Instantiations are copies of the general frame that contain the characteristics
of a specific object, in this example the separate columns.

COLUMN
Tradename: [Spherisorb, Hypersil, Lichrosorb,...]
Functionality: [C8,C18,...]
Particle size: [Real number] (|im)
Column length: [Integer number] (cm)
Batch number: [Integer number]
Internal diameter: [Real number] (mm)

COLUMN 1 COLUMN 2
Tradename: Lichrosorb Tradename: |Li-Bondapalc
Functionality: C8 Functionality: C18
Column length: 30 cm Column length: 30 cm
Particle size: 4.0 |Lim Particle size: 5.0 |Lim
Batch number: 2347645 Batch number: 3459863
Internal diameter: 4.6 mm Internal diameter: 4.0 mm
In these frames all specific columns that are relevant for the reasoning process of
the expert system can be described in a structured and comprehensive way. The
frame-based and rule-based Icnowledge representation are both required to rep-
resent expertise in a natural way. Therefore, in most expert systems a combination
of rule-based and frame-based knowledge representation is used. The rule base
together with the factual and descriptive knowledge by means, of e.g., frames
constitute the knowledge base of the expert system.

43.5 The inference engine

43.5.1 Rule-based inferencing

The inference engine is the active component of the expert system. It contains a
strategy to use the knowledge, present in the knowledge base, to draw conclusions.
In a rule-based expert system its major task is to recognize the applicable rules and
how they must be combined in order to derive new knowledge that eventually
leads to the conclusion. Suppose the following two rules are present in a
knowledge base:
634

IF a molecule contains a phenyl group


THEN it is UV-active
and
IF a molecule is UV-active
THEN a UV-detector can be used
These two rules can be combined into:
IF a molecule contains a phenyl group
THEN a UV-detector can be used.
In this example this combination is straightforward. One has to be aware, however,
that a medium-sized expert system easily contains several hundreds of rules. In
addition, several rules can be valid at the same time. The inference engine should
also have a strategy for deciding on priorities {conflict resolution).
For the combination of rules two alternative strategies are possible: forward
chaining (data driven) and backward chaining (goal-driven). A forward chaining
strategy first checks all available factual knowledge for a given problem. It then
compares this with the IF parts of the rules and tries to derive new knowledge that
can be used to activate other rules. A backward, goal-oriented strategy starts from a
possible solution of the problem and tries to establish and prove all necessary
conditions for this solution. If this is not successful, another possibility is tried out,
until a solution is found.
An example will clarify this. Suppose we have a small expert system that can
give advice on a suitable solvent and the rule base consists of the following six
rules, here represented in a semi-formal way:
1. IF the polarity of the compound = not polar
THEN the solvent is methanol
2. IF the stability in an alkaline solution of the compound = not stable
THEN the solvent is methanol
3. IF the polarity of the compound = polar
AND
IF the stability in an alkaline solution of the compound = stable
THEN the solvent is an alkaline solution
4. IF no functional groups are present
THEN the polarity of the compound = not polar
5. IF an ester-function is present
THEN the stability in an alkaline solution of the compound = not stable
635

6. IF a carboxylic functional group is present


THEN the polarity of the compound = polar

This is obviously not a complete rule base and, moreover, we have over-
simplified the chemical knowledge, but for clarity we will restrict the example to
these six (over-simplified) rules. The knowledge base is summarized in Fig. 43.2.
We will simulate a forward and a backward inference engine to find the solvent
for acetylsalicylic acid. The goal-driven or backward strategy starts with a possible
solution. In our case the two possible solvents are methanol and an alkaline
solution. The inference engine searches the rule base for all rules that decide on the
solvent. Rules 1, 2 and 3 are selected on this basis. At this point the inference
engine must decide which one to investigate first. The simplest strategy is to
investigate the rules in order of appearance in the knowledge base. In our case this
means that the inference engine selects Rule 1. In more sophisticated expert
systems priority rules can be defined that decide on the order of the rules or rule
sets that must be investigated. In our example we will first investigate Rule 1. The
THEN part of this rule is valid if the compound is not polar. The IF part of this rule
becomes now a new subgoal (SGI: is the polarity of the compound = not polar?)
that is treated in a similar way. The inference engine now tries to prove this subgoal
by searching other rules that conclude in the THEN part on the polarity of a
compound. This is the case for Rules 4 and 6. Rule 4 is first selected. The IF part of

Solvent =
Solvent = Methanol
alkaline solution

Rule Rules

Stability in
Polarity alkaline solution

Rule 6 Rule 4 Rules

Carboxylic Functional Groups Ester Function


functional group

Fig. 43.2. Representation of a small example knowledge base to select a solvent (see text for further
explanation).
636

Rule 4 is now the new subgoal (SG2: are there functional groups present?). No
rules are found in this rule base that conclude on the presence of functional groups.
At this point the inference engine checks whether this information is available in
the factual database (e.g. the frames). In our example we assume that there is no
such factual database and the working memory is still empty. Only in this case
does the inference engine ask the user (through the interaction module) for
information about this point. When the user confirms the presence of functional
groups in acetylsalicylic acid the inference engine realizes that Rule 4 is not valid.
Rule 6 is now investigated as the second and last possibility to conclude on the
polarity of the compound. The compound is polar when a carboxylic functional
group is present. The presence of a carboxylic function becomes subgoal 3. Since
no rules or databases are available to decide on this, the user is again asked for
information on this topic. Because acetylsalicylic acid contains a carboxylic
function, the inference engine can establish the polarity of the compound: The
polarity of acetylsalicylic acid = polar. Based on this conclusion SGI must be
rejected. The inference engine is now back to Rule 1 and has to conclude that the
path followed was unsuccessful. The process just described is called backtracking.
Rule 2 can now be investigated in the same way. SG4 is now the instability of the
compound in an alkaline solution. Rule 5 is selected because it concludes in the
THEN part on the stability of compounds in an alkaline solution. The presence of
an ester function becomes SG5. Since no rules are available that conclude on this
issue, the user is again asked for information. Through the interaction module the
user now confirms the presence of an ester function in acetylsalicylic acid. This
proves SG5 which in turn establishes SG4. We are now back to Rule 2, which can
be executed. The second path turned out to be successful. The expert system has
come to a conclusion and advises methanol as a suitable solvent. In the same way
Rule 3 is investigated. The reader can verify that the inference engine can reject
this third path without further information of the user.
It is already clear from this small example that aspects such as conflict resol-
ution can have an important influence on the performance of the search strategy.
A forward chaining inference engine first searches for rules that make use of
characteristics of the molecule. This is the case for Rules 4, 5 and 6. The inference
engine checks whether the conditions of the IF parts of these rules can be proven
from the knowledge already present in the descriptive part of the knowledge base,
e.g. in the frames. In our example there is no such information. In this case the
expert system queries the user for information on functional groups, ester
functions, and carboxylic acid functions. Rule 4 is selected first (order of appear-
ance). Because there are indeed functional groups present in acetylsalicylic acid,
this rule cannot be executed and the inference engine continues with Rule 5. This
rule can be executed because the user has confirmed the presence of an ester-
function in acetylsalicylic acid. The stability in an alkaline solution has now been
637

established as: 'not stable'. The inference engine now looks for rules that use in the
IF part the stability in an alkaline solution of the compound = 'not stable'. This is
the case for Rule 2 and this rule can be executed, leading to the conclusion, namely
to advise methanol as solvent for acetylsalicylic acid. The inference engine
continues with Rule 6. Based on the information of the user that a carboxylic group
is present, the polarity of the compound can be established as 'polar'. The
inference engine looks for rules that use the fact that the compound is polar in the
IF part. This is the case for Rule 3. This rule can, however, not be executed since
the second IF statement of the rule is not true.
Both strategies show advantages and drawbacks. When most information of the
bottom line rules (here Rules 4,5 and 6) is present in the knowledge base or can be
obtained automatically, e.g. by a database search, then forward chaining is the
natural way to proceed. When, however, much information must be queried from
the user in this way, the use of the expert system becomes irritating and the system
gives the impression to ask for many facts, that are seemingly not used. Backward
chaining avoids this drawback. Information is asked only where necessary for the
reasoning process. The consultation of such a system creates a closer relationship
with the user since following the reasoning path of the system is easier, and
questions come in a logical sequence.

43.5.2 Frame-based inferencing

43.5.2.1 Inheritance
A typical feature of expert systems that support frames is inheritance. Frames
can be organized in a hierarchical structure. They can inherit properties (attributes)
from frames that are higher in the hierarchy. The latter are therefore called parent
frame and the former child frame. There are many varieties of the inheritance
principle. Frames can have only one parent frame (simple inheritance) or may have
multiple parent frames (multiple inheritance). All attributes can be inherited (full
inheritance) or only a few, selected by the knowledge engineer, may be inherited
(partial inheritance) by the child frames. An example of a simple inheritance
organization of frames is shown in Table 43.1. The frame 'Organic Compound' is
the parent frame. The frames 'Ester' and 'Acids' are child frames of 'Organic
Compound'. A typical example of inheritance is instantiation. The frame Acetic
acid is a child of 'Acids' and, since no extra attributes are added, it is also an
instantiation.
The hierarchical structure of the frames defines the relation between them. This
hierarchical structure is static, however, and cannot be easily modified. Inheritance
is therefore an elegant way to represent taxonomical relations or well-established
relations between objects. It is less suited for variables or ill-defined relations
between objects.
638

TABLE 43.1

An example of inheritance

Organic Compound
Name
Molecular formula
Molecular weight

Ester Acids
(child of 'Organic Compound') (child of 'Organic Compound')
Name Name
Molecular formula Molecular formula
Molecular weight Molecular weight
StabiHty in alkaline solution pKa value

Acetic Acid
(child of 'Acids')
Name: acetic acid
Molecular formula: C2H4O2
Molecular weight: 60
pKa value: 4.7

43.5.2.2 Object-oriented programming techniques


Another relatively recent feature of frame-based expert systems is the use of
object-oriented programming techniques. The frames are considered as independ-
ent entities that can exchange information by means of messages that they can send
to each other. Which frames can send/receive messages to/from which frames
must be defined beforehand by procedures that must be part of each frame. Apart
from attributes, procedures are thus defined for each frame. They define what kind
of messages can be sent to which other frames and what kind of messages can be
received by which frames. A message can be the request to change an attribute, or
to activate another procedure of the frame. Procedures can involve external
programs to be executed.
Similar to attributes, procedures can be inherited by child frames when inheri-
tance is supported. Object-oriented programming enormously enlarges the flex-
ibility of frame-based systems. A disadvantage is, however, that it results often in
unclear and intractable reasoning structures. Table 43.2 shows the small rule base
of Fig. 43.2, translated in an object-oriented system.
639

TABLE 43.2
An example of object-oriented programming

Sample polarity [yes/no]


alkaline solution stability [yes/no]

Procedure 'find polarity'


Call external program 'polarity.exe'
Assign value 1 to SAMPLE/'polarity'
Send message to SOLVENT/procedure 'start' = valuel

Procedure 'find alkaline solution stability'


Call external program 'alkaline solution stability.exe'
Assign value2 to SAMPLE/'alkaline solution stability'
Send message to SOLVENT/procedure 'start' = value2

Solvent Methanol [yes/no]


Alkaline solution [yes/no]

Procedure 'start'
Send message to SAMPLE: activate procedure 'find polarity'
Send message to SAMPLE: activate procedure 'find alkaline solution stability'
if valuel = 'yes'
and value2 = 'yes'
SOLVENT/'Alkaline solution' = 'yes'
elsif SOLVENT/'Methanol' = 'yes'
endif

43.5.3 Reasoning with uncertainty

Because the rules or procedures in expert systems are heuristic they are often
not well-defined in a logical sense. Nevertheless, they are used to draw conc-
lusions. A conclusion can be uncertain because the truth of the rules deriving it
cannot be established with 100% certainty or because the facts or evidence on
which the rule is based are uncertain. Some measure of reliability of the obtained
conclusions is therefore useful. There are different approaches used in expert
systems to model uncertainty. They can be divided into methods that are based on
640

Bayesian probability theory and methods that are based on fuzzy-set theory. The
principles of both theories are explained in Chapter 16 and Chapter 19, respect-
ively. Both approaches have advantages and disadvantages for the use in expert
systems and it must be emphasized that none of the methods, developed up to now
are satisfactory [7,11].
The application of Bayesian theory requires the knowledge of probabilities of
all occurring situations. This is in practice impossible. In one of the earliest
successful expert systems, MYCIN, a medical diagnosis system [12], this problem
has been circumvented by assigning two numbers to each rule, not to be interpreted
as probabilities but as likelihoods. The measure of belief (MB, reflecting the
strength of the rule) and a measure of disbelief (MD). These two measures are
combined in a certainty factor (CF). There are different ways, loosely based on
probability theory, to combine these measures when rules must be chained to
arrive at the conclusion. Instead of assigning a certainty factor to a rule or
statement, a fuzzy measure can be assigned that is based on a function, sometimes
called a credibility function in expert system terminology.
All these methods pretend to represent the intuitive way an expert deals with
uncertainty. Whether this is true remains an open question. No method has yet
been evaluated thoroughly. Modelling uncertainty to obtain a reasonable relia-
bility measure for the conclusions remains one of the major unsolved issues in
expert system technology. Therefore, it is important that in the expert system a
mechanism is provided to define its boundaries, within which it is reasonably safe
to accept the conclusions of the expert system.

43.6 The interaction module

The interaction module takes care of the communication with the user. It has
essentially the same function as the user interface of other software programs. All
queries to and from the expert system are passed through this module. Although it
has nothing to do with the quality of the advice, it often determines the degree of
acceptance of the system. Most expert systems provide also an interface for the
programmer and for the user. The programmer's interface can be a knowledge base
editor that allows the introduction of the rules in the correct format in an efficient
way. Sometimes an automatic debugger for syntax errors is also provided. The
possibility to retrieve or forward information from or to the user is always
provided. Expert systems can use question/answer dialogues in a natural-like
language, or by menu-driven interfaces or by graphical style interactions. The
preferred interface is mainly application-dependent.
Another essential component, different from classical software, of the inter-
action module is the explanation facility subsystem. The simplest provision is the
641

explanation of a question or concept. This is in most cases easily implemented by


means of a text file where some difficult concepts are explained. The explanation
system only shows the proper section of this file. If we consider the small example
expert system of the previous section, we could add explanations for the concept:
functional groups.
In most expert systems two types of questions from the user are supported:
WHY is certain information required and HOW are conclusions reached. In
today's expert systems the answer to the first question is simply the rule that the
inference engine is trying to fire. If in our example the user asks WHY the expert
system asks whether there are functional groups in the compound, the answer
would be a reproduction of Rule 4. The answer to the HOW question is the
sequence of rules needed to reach the conclusion. In some cases this is sufficient
information. Occasionally, however, the user is actually asking for more
theoretical background knowledge of the system. To date, this is not possible,
since expert systems are not simulations of the expert's intelligence but rather
practical programs that reproduce (part of) the heuristics of the expert.

43.7 Tools

The implementation of an expert system can be done using an AI language such


as PROLOG or LISP (or a conventional language such as PASCAL, FORTRAN).
This is, however, a tedious and time-consuming task. Another approach is to use
dedicated tools for building expert systems. These tools facilitate the knowledge
acquisition and the implementation. Tools for implementation are commercially
available in a large price range. Tools for knowledge acquisition are still in the
research phase.
Implementation tools differ in the facilities they offer. A major selection
criterion is the knowledge representation possibilities the tools offer. Another
important issue is the access to external programs. This is an important aspect for
chemical knowledge bases since many chemical problems involve a large
computational aspect that is best solved by an external, algorithmic program.
Expert system shells represent the lower end of the range of tools. Shells are
tools that offer only one knowledge representation strategy with only one
inferencing strategy (or a restricted second one). Shells typically support the
rule-based knowledge representation. The user interface is highly standardized
and does not allow user-tailored output. Explanation facilities are restricted to the
HOW and WHY type. Access to externals is also standardized to classical
packages such as LOTUS 123, Excell or DBase. Due to the absence of flexibility,
shells are not suitable for large projects. They are, however, very useful for small
test projects or for projects where the expert and the knowledge engineer is one and
the same person. Shells run on PC-type hardware.
642

Other, more advanced tools offer possibilities for combining different


knowledge representations and inferencing strategies. They typically offer rules,
combined with frames, different inheritance possibilities. The possibility of
object-oriented programming is not always present. The user interface is not
standardized and can be user tailored. The interface to standard packages is usually
available.
Knowledge engineering environments offer extended possibilities to combine
as well various knowledge representations as inferencing strategies. The user
interface includes graphical possibilities. External programs can be integrated
using the underlying language of the tool. Knowledge environments require at
least workstations as hardware.

43.8 Development of an expert system.

The appearance of expert systems to solve practical problems, also in chemistry,


started in the eighties. During this period much experience has been acquired
through the expected and unexpected problems that arose during such projects.
Until now there are only a few commercially available expert systems and this is
not likely to change in the near future. This implies that expert systems will be
mostly in-house developments. The different steps to consider are:
1. analysis of the application area,
2. definition of the knowledge domain and intended users,
3. definition of knowledge sources,
4. selection of hard- and software,
5. knowledge acquisition,
6. implementation/prototyping,
7. testing/validating,
8. maintenance.

43,8.1 Analysis of the application area

Since the development of an expert system is a substantial project, only appli-


cations with high returns are worthwhile to be considered. The benefits and the
risks must be well evaluated. The pay-off of application areas relates to the
required speed and or the required reliability of the task. Complicated tasks
requiring high level expertise and that are time-critical in their execution are good
candidates. It is more likely that the human experts make errors at these tasks. A
decision support system is then a valuable tool. Examples are diagnostics in a
nuclear reactor [13] or for medical diagnosis [14]. Another pay-off of the devel-
opment of an expert system concerns the detection of unknown knowledge gaps.
During the knowledge acquisition phase it is necessary to make all heuristic rules
643

explicit. This process often leads to the conclusion that for some situations no
heuristic rules or experience are available. The awareness of these knowledge gaps
can prevent unforeseen difficulties. There may be of course also other reasons that
justify an expert system project. Obtaining experience in this kind of projects often
is a reason or it may be useful for demonstration purposes.

43.8.2 Definition of knowledge domain, sources and tools

The next few steps are very similar to those required in any software project.
One of the first stages is the clear definition of the knowledge domain. It must be
clear which problems the expert system must solve. It is at this stage not the
intention to define how this can be done. Clarity and specificity must be the major
guides here. Fuzziness at this stage will, more than in classical software projects,
have to be paid for later when different interpretations cause misunderstandings.
Equally important is the clear definition of the end user(s). An expert system set up
as decision support tool for professionals is totally different from an expert system
that can be used as a training support for less professional people.
The next step is to make sure that the whole knowledge area is covered by
appropriate expertise. In most cases the knowledge source is a person with
extensive experience in the problem area. The expert will also be responsible for
the contents of the knowledge base and the quality of advice. Usually, literature
references will serve as a secondary source to find basic or background knowledge.
Using literature puts fewer demands on the experts. There are situations where
literature can be the main source of knowledge. The knowledge domain must then
be stable for a longer period. Still there has to be an expert who is responsible for
extracting the relevant information from the literature.
After the knowledge domain, the end users and the experts have been defined, a
choice must be made for the software environment for the development of the
expert system. In addition, the hardware for the development process as well as for
the delivery system must be identified. Real application expert systems always
require at least a mid-sized tool (see Section 43.7).
If possible, the best procedure for selecting a tool is to implement a smaller test
knowledge base similar to the final knowledge base in a few different tools. This
test knowledge base should be small enough to be implemented quickly in
different tools. On the other hand, it must be specific enough to highlight the
differences between the tools.

43.8.3 Knowledge acquisition

The knowledge base of an expert system contains the relevant knowledge about
the specific area of the expert system. For this all necessary heuristics must be
644

made explicit. Two persons play a pivotal role in this step: the expert and the
knowledge engineer. The expert is responsible for the contents of the expert
system while the knowledge engineer is responsible for the implementation of the
knowledge in the selected tools. Knowledge acquisition is the bottleneck of the
development stage. Here expert and knowledge engineer together try to unravel
the decision path of the problem solving. All the factual knowledge, the decision
rules and rules of thumb must be explicitly defined. This process is quite tedious
because experts usually have much trouble in defining why and how certain
conclusions are reached. The expert often uses many assumptions that are obvious
to him or her. Yet these have to be stated explicitly since nothing is obvious for the
computer. It is the task of the knowledge engineer to assist the expert in making all
necessary knowledge explicit. Up to now numerous knowledge acquisition
techniques have been applied, ranging from interviews of the expert by the
knowledge engineer to closely observing the expert in his daily practice. Research
is still going on to support this step and to automate it as far as possible.
The final aim is to construct a formalized representation of the decision process.
Decision trees and structured system analysis are possibilities. Some types of
expert systems can derive their own rules from examples. These are described in
Chapters 18 and 33.

43.8.4 Implementation

After knowledge acquisition has been completed, the next step is implement-
ation. Since in practice explicit knowledge emerges piecewise, it is useful to start
implementation as soon as some structured knowledge is available. In doing so, the
knowledge engineer has the opportunity to become acquainted with the tool. The
expert on the other hand uses the prototype to check whether no misunderstandings
exist. This tends to greatly enhance the confidence in the project and to reduce
misunderstandings between the knowledge engineer and the expert. An important
point that should always be borne in mind during implementation is the future
maintainability of the system.

43.8.5 Testing, validation and evaluation

The testing phase is important in expert system development. The practical


applicability of the expert system will largely depend on this phase. Testing expert
systems is different from normal software engineering in a number of ways. First,
it is difficult to test exhaustively the full code and all possible paths the reasoning
process may follow. Secondly, the nature of expert systems poses some typical
problems. Due to their heuristic nature the correctness of the results cannot be
easily verified. A certain degree of errors may be acceptable and, moreover, an
645

answer, different from the expert's, may still be acceptable. Because of these facts
it is not to be expected that general test procedures will become available. The
practical question that must be answered is whether the expert system is useful in
the working environment. One procedure to test an expert system is a validation
followed by an evaluation of the expert system. Validation involves verifying
whether the system responds in the way the expert has in mind. Evaluation
involves verifying whether the expert system meets the requirements of the
intended end users. Validation can be performed with a number of carefully
selected test cases. Evaluation must be performed in real situations. The users must
apply the system in their daily work and check the practical usefulness.

43,8,6 Maintenance

Once an expert system is installed in a laboratory, it is important to follow its


performance in time. Hidden errors may be detected by longer use of the system.
Keeping the system up to date is also important. New knowledge and experience
which can improve the performance of the system should be incorporated. The
maintenance of expert systems should be foreseen in the project management.
Knowledge about the structure and knowledge engineering strategy must be saved
as long as the expert system is working,

43.9 Conclusion

Expert systems were investigated most intensively during the eighties, and it
appeared that the problems for their practical usability were greater than expected.
In particular, the time-consuming acquisition phase, the absence of a general test
procedure and the lack of a satisfactory method to reason with uncertain know-
ledge prohibited their real breakthrough. Another typical feature of expert systems
that limits their distribution compared with classical software is the fact that they
contain very subjective and site-specific knowledge. The utility of this kind of
knowledge on a larger scale is, of course, limited. All these problems caused a
decline in the development of expert system applications. The phrase 'expert
system' became almost taboo. Nevertheless, some expert systems have been
developed that are useful on a large scale, particularly in the medical domain. In
the chemical domain there are also a few famous expert systems such as
DENDRAL [15], while other expert systems in chemistry concern organic syn-
thesis, LHASA [1]. LHASA uses a retrosynthetic approach; starting from the
molecule to be synthesized, it gives different possibilities for how that molecule
can be synthesized in one step from smaller molecules. The user then selects one of
the possibilities and subsequently LHASA continues with the molecules involved
646

in the selected reaction. The process continues until basic starting molecules such
as, e.g., ethanol are reached. LHASA is still undergoing development to make it
more efficient to use [16]. Other applications can be found in spectroscopy,
chromatography and AAS; for an overview, see Refs [2-4,17-22].
Two trends can be observed in the last few years. First, small-scale expert
systems are being incorporated in chemical instrumentation under the name
decision-support systems. Secondly, in many companies site-specific knowledge
is being incorporated in computer programs. For example, the strategy for
validation of the analytical methods is a crucial issue, especially for accreditation
and GLP reasons. The development of a satisfactory strategy for a specific
laboratory requires much expertise. In many companies programs are being
developed that contain this strategy. While these are often programmed in
conventional programming languages, the knowledge acquisition phase is the
same as for expert systems [23].

References

1. E.J. Corey, A.K. Long and S.D. Rubenstein, Computer-assisted analysis in organic synthesis.
Science, 228 (1985) 408-418.
2. L.M.C. Buydens and P. Schoenmakers (eds.). Intelligent Software for Chemical Analysis.
Elsevier, Amsterdam, 1993.
3. M. Peris, An overview of recent expert system applications in analytical chemistry. Crit. Rev.
Anal. Chem., 26 (4) (1996) 219-237.
4. C.H. Bryant, A. Adam, D.R. Taylor and R.C. Rowe, A review of expert systems for chroma-
tography. Anal. Chim. Acta, 297 (1994) 317-347.
5. M. Cadisch and E. Pretsch, SpecTool: a knowledge-based hypermedia system for interpreting
molecular spectra. Fres. J. Anal. Chem., 344 (1992) 173-177.
6. W. Pennincks, P. Vankeerberghen, D.L. Massart and J. Smeyers-Verbeke, A knowledge-based
computer system for the detection of matrix interferences. Atomic Absorption Spectrometric
Methods, J. Anal. Atom. Spectrom. inch Atomic Absorption Spectrom. Updates, 10 (3) (1995)
207-214.
7. F. Hayes-Roth, D.A. Waterman and D.B. Lenat (eds.). Building Expert Systems. Adison-
Wesley, London, 1983.
8. R. Forsyth and R. Rada, Machine Learning: Applications in Expert Systems and Information
Retrieval. Ellis Horwood Series in Artificial Intelligence. Ellis Horwood, Wiley, Chichester,
1986.
9. P.G. Raeth, Expert systems; a software methodology for modern applications. IEEE Computer
Society Press Reprint Collection, IEEE Computer Society Press, Los Alamitos, CA, USA,
1990.
10. A. Hart, Knowledge Acquisition for Expert Systems. Kogan Page, London, 1986.
11. P. Walley, Measures of uncertainty in expert systems. Artif. Intell., 83 (1) (1996) 1-58.
12. E.H. Shortlife, Computer-based Medical Consultation: MYCIN. Elsevier, New York, 1976.
13. H.Y. Chung, I.S. Park, S.K. Hur, S.W. Cheon, H.G. Kim, S.H. Chang, Diagnostic strategies for
primary-side systems in nuclear power plants. Elect. Power Energy Syst., 14 (4) (1994)
284-297.
647

14. K.P. Adlassnig, H. Leitich and G. Kolarz, On the applicability of dignostic criteria for the diag-
nosis of rheumatoid arthritis in an expert system. Expert Syst. Applic, 13 (1) (1997) 73-79.
15. B.G. Buchanan and E.A. Feigenbaum, Dendral and Meta-Dendral; their application dimen-
sion. Artif. Intel!., 11 (1978) 5-24.
16. M.A. Ott and J.H. Noordik, Long-range strategies in the LHASA program: the Quinone
Diels-Alder transform. J. Chem. Inf. Comput. Sci., 37 (1997) 98-108.
17. P. Van Keerbergen, J. Smeyers-Verbeke and D.L. Massart, Decision support system for run
suitability checking and explorative method validation in electrothermal atomic absorption
spectrometry. J. Anal. Atomic Spectrom. incl. Atomic Spectrom. Updates, 11 (2) (1996)
149-158.
18. J.E. Matek and G.F. Luger, An expert system controller for gas chromatography automation.
Instrum. Sci. Technol., 25 (2) (1997) 107-120.
19. M.E. Elyashberg, Y. Karasev, E.R. Martirosian, H. Thiele and H. Somberg, Expert systems as
a tool for the molecular structure elucidation by spectral methods, strategies of solution to the
problem. Anal. Chim. Acta, 348 (1997) 443-464.
20. M. De Smet, G. Musch, A. Peeters, L. Buydens and D.L. Massart, Expert systems for the selec-
tion of HPLC methods for the analysis of drugs. J. Chromatogr., 485 (1989) 237-253.
21. P.J. Schoenmakers, A. Peeters and R.J. Lynch, Optimization of chromatographic methods by a
combination of optimization software and expert systems. J. Chromatogr., 506 (1990)
169-184.
22. M. Mulholland, N. Walker, J.A. van Leeuwen, L. Buydens, H. Hindriks and P.J.
Schoenmakers, Expert systems for method development and validation in HPLC. Microchim.
Acta, 2 (1991) 493-503.
23. J. Sandberg, B. Wielinga and G. Schreiber, Methods and techniques for knowledge manage-
ment: what has knowledge engineering to offer? Expert Syst. Applic, 13(1) (1997) 73-79.

Additional Reading

H.J. Luinge, A knowledge-based system for structure analysis from infrared and mass spectral data.
Trends Anal. Chem., 9 (1990) 66-69.
M. Otto, Fuzzy expert systems. Trends Anal. Chem., 9 (1990) 69-72.
J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, LABGEN, expert sys-
tem for knowledge-based modelling of analytical laboratories. Part I, Laboratory organization.
Anal. Chim. Acta, 222 (1989) 1-17.
K. Janssens and P. van Espen, Implementation of an expert system for the qualitative interpretation of
X-ray fluorescence spectra. Anal. Chim. Acta, 184 (1991) 117-132.
B.A. Hohne and T.H. Pierce, (eds.). Expert-system Applications in Chemistry. ACS Symp. Ser., Vol.
306, ACS, Washington, DC, 1989.
649

Chapter 44

Artificial Neural Networks

44.1 Introduction

In Chapter 43 the incorporation of expertise and experience in data analysis by


means of expert systems is described. The knowledge acquisition bottleneck and
the brittleness of domain expertise are, however, the major drawbacks in the
development of expert systems. This has stimulated research on alternative
techniques. Artificial neural networks (ANN) were first developed as a model of
the human brain structure. The computerized version turned out to be suitable for
performing tasks that are considered to be difficult to solve by classical techniques.
Research on artificial neural networks has developed along two major
pathways. First, there are efforts to mimic the nervous system as closely as
possible by the artificial networks. The goal of this research is to better understand
the functioning of the nervous system. Another school of research is more
interested in artificial neural networks as an interesting computing formalism. The
focus of the efforts here lies in applying and adapting the concept of artificial
neural networks to solve and study practical computational problems. The aim is to
exploit this new concept as much as possible, not to better understand the neuro-
physiological functioning. For chemistry and more specifically for chemometrics,
the results of the second approach are of interest. In this book, therefore, only the
second approach to neural networks is considered. In what follows whenever
neuron or neural network are mentioned, the artificial variant is meant.
Many different types of networks have been developed. They all consist of
small units, neurons, that are interconnected. The local behaviour of these units
determines the overall behaviour of the network. The most common is the multi-
layer-feed-forward network (MLF). Recently, other networks such as the
Kohonen, radial basis function and ART networks have raised interest in the
chemical application area. In this chapter we focus on the MLF networks. The
principle of some of the other networks are explained and we also discuss how
these networks relate with other algorithms, described elsewhere in this book.
650

44.2 Historical overview

The beginning of the research into artificial neural networks is often considered
to be 1943 when McCulloch and Pitts published their paper on the functioning of
the nervous system [1]. They tried to explain it by small units that are based on
mathematical logic and that are interconnected. These units are abstractions of the
biological neurons, and their connections. In 1949 Hebb [2] tried to explain some
psychological results by means of a learning law for biological synapses. With the
introduction of computers it was possible to develop and test artificial neural
networks. The first artificial neural networks on computer were developed by
Rosenblatt (the perceptron) [3] and by Widrow [4] (the AD ALINE: ADAptive
LINear Element). The main difference between them is the learning rule that they
use. These simple networks were able to learn and perform some simple tasks. This
success stimulated research in the field; the book by Nilsson on linear learning
machines [5] summarizes most of the work of this early period. The possibilities of
artificial neural networks were believed to be tremendous, and gave rise to
unrealistically high expectations by the public at large. At that time (1969) Papert
and Minsky showed [6] that many of these expectations could not be fulfilled by
the perceptron. Their book had a very negative impact on the research in the field:
funding became difficult and publication of papers declined. Apart from some
enthusiastic researchers who continued their efforts in the field, research stopped
for many years. Some of the investigations continued under the heading adaptive
signal processing or pattern recognition. In chemistry the best known example is
the linear learning machine which was a popular pattern recognition method (see
also Chapter 33).
In 1986 the second breakthrough was caused by the publication of a book by
Rumelhart in which a learning strategy, the back-propagation, developed earlier by
Werbos was proposed [7,8]. This new learning rule allowed the construction of
networks which were able to overcome the problems of the perceptron-based
networks. They were now able to solve more complicated (non-linear) problems.
This breakthrough caused a new interest and up-to-date research is still increasing
with encouraging results.

44.3 The basic unit — the neuron

The structural unit of artificial neural networks is the neuron, an abstraction of


the biological neuron; atypical biological neuron is shown in Fig. 44.1. Biological
neurons consist of a cell body from which many branches (dendrites and axon)
grow in various directions. Impulses (external or from other neurons) are received
through the dendrites. In the cell body, these signals are sifted and integrated.
651

Fig. 44.1. A biological neuron. (Adapted from Ref. [80]).

When a certain threshold impulse is reached a suitable response signal is delivered.


The axon transports the response signal to many other neurons through a complex
branching system. The contact points between different nerve cells, through which
the signals are propagated are called synapses. The efficiency of the propagation of
signals from one cell to another is influenced by many factors (chemical and
electrical) and is called the synaptic strength. It is important to note that this basic
process is the same for all neurons. This suggests that the connecting network
structure determines the overall behaviour of a nervous system. This basic process
is represented by the artificial neuron as shown in Fig.44.2. The incoming signals
(the input) are passed to the neuron body, where they are weighted and summed,
then they are transformed, by passing through the transfer function (or output
652

Output

Fig. 44.2. An artificial neuron; xi...Xp are the incoming signals; wi... Wp are the corresponding weight
factors and F is the transfer function.

function), into the outgoing signal (the output of the neuron). The propagation of
the signal is determined by the connections between the neurons and by their
associated weights. These weights represent the synaptic strengths in the biol-
ogical neuron.
Different types of networks have been developed. The connection pattern is an
important differentiating factor, since this determines the information flow
through the network.
The weights associated with each connection are also essential since, together
with the connection pattern, they determine the signal propagation through the
network. They are real numbers that determine the strength of the connectivity
between the two connected neurons. Each signal that is transported through the
connection is multiplied by the associated weight of the connection. Weights can
be positive or negative. The appropriate setting of the weights is essential for the
proper functioning of the network. They determine the relationship between the
input and the eventual output of the network. Therefore, they are considered as the
distributed knowledge content of the network. Finding the proper weight setting is
achieved in a training phase in which the weights are adapted according to a
so-called learning rule. The training of the network is a major phase of the
developing process of artificial neural networks. Different networks require
different training procedures, the two major variants being supervised and
unsupervised training (see also Chapters 30 and 33).
653

44.4 The linear learning machine and the perceptron network

44.4.1 Principle

The perceptron-like linear networks were the first networks that were developed
[3,4]. They are described in an intuitive way in Chapter 33. In this section we
explain their working principle as an introduction to that of the more advanced
MLF networks. We explain the principle of these early networks by means of the
Linear Learning Machine (LLM) since it is the best known example in chemistry.
The LLM is a classical supervised pattern recognizer. It tries to find boundaries
between classes. In Fig. 44.3 an example of two classes A and B is shown in two
dimensions {x^ and X2). Each possible object is thus defined by its values on x^ and
JC2. The boundary between the two classes A and B is defined by the line, L,
described by eq. (44.1):

WjX2+W2X2 - 6 = 0

Fig. 44.3. Two classes in two dimensions and a possible boundary line.
654

W, X| + W2 JC2 = 6

or (44.1)

^2 - e =o
where w, and W2 ^^^ called the weights, 0 is the offset. In neural networks term-
inology it is called the bias. The offset or bias is added for generality. If no offset
(or bias) is provided the line is forced to go through the origin.
The line L divides the two-dimensional space in two regions (corresponding to
the two classes). All points or objects (combinations of Xj and X2) below the line
yield a value for the function, which we will call NET(xi,X2) (eq. (44.2)), larger
(smaller) than 0; the points above the line yield a value of NET(xi,X2) smaller
(larger) than 0; the points on the line satisfy eq. (44.1) and thus yield a value, 0.

NET(xj^- X2i) = Wy Xjj- + W2 X2i = x ^ w

where
xj =[x,,X2i] wT = [w,W2] (44.2)

Object / is classified as
Class A, if: NET(jCi,, JC2,) > 0 or NET(jCi„ JC2,) - 0 > 0
Class B, if: NET(jCi„ ^2,) < 0 or NET(jCi., X2) - 0 < 0
Another approach to eq. (44.2) is to add an extra dimension to the object vector
x^, on which all objects have the same value. Usually 1 is taken for this extra term.
The 0 term can then be included in the weight vector, w^(wi,W2,-0). This is the
same procedure as for MLR where an extra column of ones is added to the
X-matrix to accommodate for the intercept (Chapter 10). The objects are then
characterized by the vector xJ (JCI-,JC2„1). Equation 44.2 can then be written:
NET(x.) = xj yv

xJ = [jci, X2i 1] w'T = [w, W2 - 0] (44.3)

Object / is classified as
Class A if: NET(x,.) > 0
Class B if: NET(x,.) < 0
The classification of objects is based on the threshold value, 0, of NET(x), also
called the bias. The procedure can be described by means of a transfer function, F
(Fig. 44.4a). The weighted sum of the input values of x is transmitted through a
655

+1 (CLASS A)

F(vj[5j+\^3^ - 9) I ^-

-1 (CLASS B)

+1 +1

0
K^ + ^^1 [Wj^+\^Xj- 6]

-1 J_

Fig. 44.4. (a) Schematic representation of the LLM. F represents the threshold transfer function: F =
sign(Net(x)). (b) On the left: the threshold function for NET(x) = wiJCi + W2X2; on the right for NET(x)
= wiXi + W2X2 - 6; see text for explanation.

transfer function, a threshold step function, also called a hard delimiter function.
Figure 44.4b shows the threshold function. The classification is based on the
output value of the threshold function (0 or +1 for class A; - 1 for class B). When
the input to the threshold function (i.e. NET(Xj)) exceeds the threshold, then the
output value, y, of the function is 1, otherwise it is - 1 .
Instead of such a hard delimiter function it is possible to use other transfer
functions such as the threshold logic, also called the semi-linear function (Fig.
44.5a). In this function there is a region where the output value of the transfer
function is linearly related to the input value with a slope, a. The first point, A, of
this region is: x^w^ + X2W2 = NET(JCI,JC2) = 6. The endpoint, B, of the linear region is
reached when: a{x^w^ + X2W2 - 9) = 1 or x^Wi + JC2W2 = NET(xi,X2) = 6 + lla. The
width of the interval between A and B is thus lla. It is, moreover, possible to use
non-linear functions. The sigmoidal transfer function (Fig. 44.5b) is the most
widely used transfer function in the more advanced MLF networks. It will be
discussed more in detail in Section 44.5.
656

a
A
+1

e + l/a NET

F
+1

NET

Fig. 44.5. (a) The semi-linear function, F = max(0,min(l,a(NET(x) -9))) with NET(x) = Wixi + W2X2
and (b) the sigmoid function: F = ft.Mox. ^ a^^

44.4.2 Learning strategy

The weights, as described in the previous section, determine the position of the
boundary that the LLM draws between the classes. The strategy to find these
weights is at the heart of the LLM. This procedure is called the learning rule [7,8].
It is a supervised strategy and is based on the learning rule that Hebb suggested for
biological neurons [2]. Initially the weights are set randomly. A training set with a
number of objects with known classification is presented to the classifier. At each
presentation the weights are updated with an amount Aw, determined by the
learning rule. The most important variant, used in networks, is the delta rule
657

developed by Widrow and Hoff (eq.(44.4)) [4]. In this learning strategy the actual
output of the neuron is adapted with a term, based on 6, the error, i.e. the difference
between the actual output and the desired or target output of the neuron for a
specific object (eq. (44.4)).

Aw. ~ X. 6,

5-4-a, (44.4)

where d^ is the desired output and a^ is the actual output.


In the linear learning machine this rule is applied as follows:
1. present an object / with input vector x^ and apply eq. (44.2);
2. if the classification is correct, the weights are left unchanged;
3. if the classification is wrong, the weights are updated according to eq.
(44.4); 6 is taken such that the updated weights yield the current output but
with an opposite sign and thus yielding the correct classification for object i
(see eq. (44.5));
4. goto step 1.
In the LLM the (scalar) desired output value (xj w^^^) is defined as the negative of
the actual (wrong) output value:

^J w„ew =-^J Wold (44.5)

The trivial solution (w„g^ = -w^y) is not interesting since it defines the same
boundary line. A non-trivial solution is found by the following procedure:

Aw ~ 6, X.

or

xJ Aw
5,=T1-V— (0<r|<l)
x ; x^.

Aw = w„,^-Woid

/ I - T
x ; X,

and by using eq. (44.5)


658

TABLE 44.1

Example of LLM procedure


T
Initial weight vector, w/ = [1,-1,0]; r| = 1
Object xT Class Result OK? New w'''
T
X W

Cycle 1
1 (1,2,1) A -1 Yes -
2 (3,2,1) A +1 No wj =(0.57,-1.28,-0.14)
3 (-1,0,1) B -0.71 No W2 =(-0.14,-1.26,0.57)
4 (-1,2,1) B -1.86 No W3 =(-0.76,-0.05,1.19)

Cycle 2
1 (1,2,1) A +0.33 No w j =(-0.87,-0.27,1.08)
2 (3,2,1) A -2.07 Yes -
3 (-1,0,1) B + 1.95 Yes -
4 (-1,2,1) B + 1.41 Yes -

Cycle 3
I (1,2,1) A -0.33 Yes -
2 (3,2,1) A -2.07 Yes -
3 (-1,0,1) B + 1.95 Yes -
4 (-1,2,1) B + 1.41 Yes -
All objects are correctly classified after three cycles; final w'^ = (-0.87,-0.27,1.08)

xj X,.

W„ew=W,,-'^"J^°'^X, (44.6)
xj X.

When Ti is set to 1 the new weights yield the opposite NET-value for the object,
/. This mirroring effect can sometimes yield undesirably large weight changes.
Therefore r| is usually set to a value < 1. In Table 44.1 a calculation on a two-class
example is given. In Fig. 44.6 the boundaries found by LLM are shown. As an
659

1 1 1 1 1
1 ' '

l \NO^

U(B) \ X 1 (A) 1X2 (A)

X 3 (B) ^^^^,:^ isf2-^

-1

-2 w4 \ \ w3

1 1 1 1 1 1 _ 1 J
-0.5 0.5 1 1.5 2.5
x1

Fig. 44.6. The successive boundaries (wO to w4) as found by the LLM, Airing the draining procedure
described in Table 44.1. The crosses indicate the positions of the objects (1 to 4); their class member-
ship (A or B) is given between parentheses.

example the weight adaptation in the first cycle, caused by the wrong classification
of the second object, can be calculated using eq. (44.6):

ll
2[3 2 1] - 1
oj [3"
2
" 0.57'
— -1.28
[3 2 1]
[l -0.14

44.4,3 Limitations

In Fig. 44.7 an example of a classification problem is shown that cannot be


solved by the simple perceptron-like networks. It is known as the "exclusive or"
(XOR) problem. No single boundary can be found that yields a correct class-
660

A X2

4-A +B

xl
\«- -+Ar

Fig. 44.7. The XOR classification problem (see also Table 44.2).

TABLE 44.2

A two-layered network of Fig. 44.8a to solve the XOR problem

Class x\ x2 hu\ hul Final output


sign(jc 1+^:2-0.5) sign(A l+jc2--1.5) sign(/iMl-/iM2-0.5)

B 0 0 -1 -1 -1
A 0 1 +1 -1 +1
A 1 0 +1 -1 +1
B 1 1 +1 +1 -1

ification in 2 classes A and B for all objects. The fact that perceptron-like networks
were only able to solve simple linearly separable problems was criticized first by
Minsky and Papert [6] and marked the end of the first period of interest in neural
network research. To solve the XOR problem it must be possible to find two
boundary lines. For this an additional layer of neurons is necessary (see Fig.
44.8a). The two additional neurons in the hidden layer of neurons each determine
one boundary line. By combining these lines it is possible to obtain the correct
classification. In Fig. 44.8a an example of a combination of weights and thresholds
is given that yields the correct classification. The first hidden unit determines
'line 1' and the other one determines 'line T in Fig. 44.8b. In Table 44.2 the result
of applying the different hard delimiter transfer functions, of the two hidden units
and the output unit, is summarized. The output values of the transfer functions of
the two hidden units, hu\ and hul are given in the third and fourth column. The two
objects of class A yield the same results for hul and hul. The objects of class B still
yield different results. In Fig. 44.8c the output values of the hidden units for the
661

Hidden unit 1 (linel)

Output unit(line3)

Hidden unit 2 (line2)

Line 2 xl+x2-1.5=0

Linel xl+x2-0.5=0

hu2

B / Line 3 hul-hu2-0.5=0

hul

Fig. 44.8. (a) The structure of the neural network, for solving the XOR classification problem of Fig.
44.7. (b) The two boundary lines as defined by the hidden units in the input space (xl,x2). (c) Repre-
sentation of the objects in the space defined by the output values of the two hidden units {hul,hu2) and
the boundary line defined in this space by the output unit. The two objects of class A are at the same
location.
662

different objects is plotted. It is the output unit that determines a third line, 'line 3'
in Fig. 44.8c, that yields the correct classification (see Table 44.2).

44.5 Multilayer feed forward (MLF) networks

44.5.1 Introduction

Unfortunately no learning rule was available that could train two-layered


networks. The major problem was that, while it is straightforward to define the
target output value for the output neuron, it is difficult to define the desired output
value of the hidden neuron layers. When Rumelhart [7,8] published the back-
propagation learning rule in 1986 to train these networks, MLF networks became
the most popular and most widely used artificial neural networks in many appli-
cation fields, including chemistry [9-11]. Sometimes they are called back-
propagation networks after the learning rule. Besides the additional layer, an
important difference with the earlier described perceptron-like networks is the use
of the sigmoidal transfer function. The introduction of this transfer function is
essential for the back-propagation learning rule and it is also responsible for the
non-linear properties of the MLF networks (see Sections 44.5.4 and 44.5.5).
As an extension of perceptron-like networks MLF networks can be used for
non-linear classification tasks. They can however also be used to model complex
non-linear relationships between two related series of data, descriptor or inde-
pendent variables (X matrix) and their associated predictor or dependent variables
(Y matrix). Used as such they are an alternative for other numerical non-linear
methods. Each row of the X-data table corresponds to an input or descriptor
pattern. The corresponding row in the Y matrix is the associated desired output or
solution pattern. A detailed description can be found in Refs. [9,10,12-18].

44.5.2 Structure

The general structure is shown in Fig. 44.9. The units are ordered in layers.
There are three types of layers: the input layer, the output layer and the hidden
layer(s). All units from one layer are connected to all units of the following layer.
The network receives the input signals through the input layer. Information is then
passed to the hidden layer(s) and finally to the output layer that produces the
response of the network. There may be zero, one or more hidden layers. Networks
with one hidden layer make up the vast majority of the networks. The number of
units in the input layer is determined by /?, the number of variables in the {nxp)
matrix X. The number of units in the output layer is determined by q, the number of
variables in the {nxq) matrix Y, the solution pattern.
663

input layer

hidden layer

output layer

Fig. 44.9. General structure of an MLF Network, (a) without bias and (b) with a bias neuron.

Just as in the perceptron-like networks, an additional column of 'ones' is added


to the X matrix to accommodate for the offset or bias. This is sometimes explicitly
depicted in the structure (see Fig. 44.9b). Notice that an offset term is also provided
between the hidden layer and the output layer.
The choice of h, the number of units in the hidden layers, or hidden units, is
discussed in Section 44.5.8.
664

44.5.3 Signal propagation

The signal propagation in the MLF networks is similar to that of the


perceptron-like networks, described in Section 44.4.1. For each object, each unit in
the input layer is fed with one variable of the X matrix and each unit in the output
layer is intended to provide one variable of the Y table. The values of the input
units are passed unchanged to each unit of the hidden layer. The propagation of the
signal from there on can be summarized in three steps.
1. Each hidden unit,y, receives the signals from the/? units of the previous layer,
the input layer. From these signals the net input is calculated:
NET,.(x,) = xj ^VJ

xj =[x,,,„x,pl]

wj =[Wj,...Wjp-Q]

Wj is the weight vector associated with the hidden unity, including 0, the bias term.
X. is the input vector, including an extra 'one' to accommodate for the bias, that
characterizes object /. Since the input units are pass-through units, x. is also the
input vector for each of the hidden units.
2. The net input, NETy, is then passed to the transfer function that transforms it
into the output signal of the unit. Different transfer functions may be used, the most
common non-linear one being the sigmoidal function (Fig. 44.5b).

output,. = oj = sfiNETj) = _l (44.8)


1+ e ^
3. The output units receive the weighted output signals of the h hidden units.
The weighted sums are calculated and passed through the transfer function to yield
the final output of the network.
NET/0,) = o7w^.

wJ =[wj,.,.Wjf^-Q]

oJ is the input vector to the output units. It contains the output signals of the h
hidden units plus an additional 'one' to take the bias term for the output units into
account. The whole signal propagation is shown in Fig. 44.10. As can be seen the
signals are only forwarded from one layer to the next layer. There is no signal
propagation within a layer or to a previous layer. This explains the term feed-
forward networks.
665

0.583 Output

Fig. 44.10. An example of signal propagation in an MLF network with transfer function:
1
-(NET(x))-
1+e

44.5.4 The transfer function

44.5.4.1 Role of the transfer function


The sigmoidal transfer function is the most common one in MLF networks. This
function has been developed as an adaption of the hard delimiter transfer function
of the perceptron-like networks. The back-propagation learning rule (see Section
44.5.5) requires a transfer function for which a derivative exists over the whole
domain. The transfer function has two important roles in the signal propagation
process. It 'squashes' the output value of the units between 0 and 1, preventing that
it becomes infinitely large and it is also responsible for the non-linear modelling
properties, which make MLF networks so useful.
666

44.5.4.2 Transfer function of the output units


There are basically two alternatives for the transfer function of the output units,
the sigmoidal function and the linear function. The sigmoidal function is preferred
in (supervised) classification tasks, because of its squashing property. The desired
output of the network in a classification task is typically 1 or 0 (belonging or not to
a certain class). The shape of the sigmoidal takes care that the output values are
squashed into the range [ 0 - 1 ] . When the MLF network is used to model
non-linear relationships this squashing of the output values is not necessary and is
even harmful for the performance of the network. For modelling tasks a simple
linear transfer function for the output units performs best.

44.5.4.3 Transfer function in the hidden units


The transfer function of the hidden units in MLF networks is always a sigmoid
or related function. As can be seen in Fig. 44.5b, 0, represents the offset, and has
the same function as in the simple perceptron-like networks. (3 determines the
slope of the transfer function. It is often omitted in the transfer function since it can
implicitly be adjusted by the weights. The main function of the transfer function is
modelling the non-linearities in the data. In Fig. 44.11 it can be seen that there are
five different response regions in the sigmoidal function:

NET
2 3 4

Fig. 44.11. Different response regions in a sigmoid function. For explanation, see text.
667

« = -2.5

(a)
6=2.5

Fig. 44.12. (a) and (b): The output value (y) of the MLF network as a function of the input value (x)
with two different weight settings, (c) and (d): The contour plots of the output value of an MLF net-
work in the specified input space with two different weight settings.

1. NETI < A: response approximates 0


2. A < NET, < B: the response is non-linear, convex
3. B < NETy < C: the response varies almost linearly with NET,
4. C < NETy < D: the response is non-linear, concave
5. NET, > D: the response is maximal (approximates 1)
The region from A to D is called the dynamic range. The regions 2 and 4
constitute the most important difference with the hard delimiter transfer function in
perceptron networks. These regions rather than the near-linear region 3 are most
important since they assure the non-linear response properties of the network. It may
668

8 10

e = o^

(b)
e=-o.5

Fig. 44.12 continued (b).

seem surprising at first sight that such a simple non-linear unit suffices to model
successfully complex non-linear relationships. One must not forget however that it is
the proper combination (weight setting) of several of those sigmoid functions that
assures the overall modelling power. In Fig. 44.12a an example of a combination of
two sigmoidal functions of two hidden units is given so that a Gaussian-like
relationship between input and output is modelled. Changing the weights yields Fig.
44.12b. It is easy to imagine that in such a way complex relationships can be
modelled with more sigmoids. In this view an MLF network can be seen as a kind of
automatic spline fitter (Chapter 11). It is automatic because the combination of basic
functions and knot setting is done automatically during the weight and bias setting
in the training phase by means of the learning rule (Section 44.5.5).
669

-4 -3 -2

9=0

xl

x2

e=o

Fig. 44.12 continued (c).

When the MLF is used for classification its non-linear properties are also
important. In Fig. 44.12c the contour map of the output of a neural network with
two hidden units is shown. It shows clearly that non-linear boundaries are
obtained. Totally different boundaries are obtained by varying the weights, as
shown in Fig. 44.12d. For modelling as well as for classification tasks, the
appropriate number of transfer functions (i.e. the number of hidden units) thus
depends essentially on the complexity of the relationship to be modelled and must
be determined empirically for each problem. Other functions, such as the tangens
hyperbolicus function (Fig. 44.13a) are also sometimes used. In Ref. [19] the
authors came to the conclusion that in most cases a sigmoidal function describes
non-linearities sufficiently well. Only in the presence of periodicities in the data
670

1 1 1 1 1
' 1 ' '/ ' '
1
4

1 ^^ ^
cvj 0

-1

-2

-3

-4

1 1 1 / 1 1 1 1 1 ... 1 1.. 1
-4 -3 -2 -1 0
x1

9=0

xl

x2

9=0

Fig. 44.12 continued (d).

this type of function is not adequate. In that case a sinusoidal transfer function (Fig.
44.13b) is necessary. In a special kind of MLF networks the so-called radial basis
function is used as transfer function. We describe them in Section 44.6.

44.5.5 Learning rule

In MLF networks the proper setting of the weights is optimized by supervised


training. The weights are optimized by means of a number of example input
patterns together with their associated desired output pattern. During the training
session the weights are adapted according to the learning rule.
671

-1

b A

Fig, 44.13. The tangens hyperbolicus and the sine function.

The most common learning algorithm is the back-propagation learning rule


[7,8]. The weight updates are, as in the delta rule, based on the difference between
the actual and the desired output of the network. The weight updating can be done
after each training example that is offered to the network or it can be done after all
training examples have been seen once. The two procedures are strictly equivalent.
Since the former one is applied most, this procedure is explained here. It can be
summarized as follows:
1. Initialize the weights with small random values in a range around 0 (e.g. -0.3
to +0.3).
2. For each output unit, y, calculate the output value with the current weight
setting and the error, Ej, based on the difference between this value and the target or
desired output value.
672

Ej=^(dj-Ojf (44.9)

Ej is determined by the weights (through Oj which is a function of NET^, see eq.


(44.8)). Note that this error is in fact the same as the error term used in a usual least
squares procedure.
3. Carry out the weight adaptation of the output neurons:
Aw,. = y]bjO, (44.10)
w^j is the weight between the hth hidden unit and the 7th output unit; r| is the
learning rate, a positive constant between 0 and l;Of^is the output of the hth hidden
unit and bj is a term based on the error. In the back-propagation strategy the weight
adaption is made in the direction that minimizes the error. The back-propagation
algorithm is thus essentially a gradient based optimization method. Therefore, the
gradient of the error as a function of the weights must be calculated. In Fig. 44.14
this is shown for one of the weights. For the output units this gradient minimization
yields eq. (44.11). This explains why the transfer function must have a derivative
over the whole response domain.
d(sf(NETj))
' ' ' d(NETj) (44.11)

= (dj- Oj) sf (NETJ) [1 - sf (NETJ)]

sfiNETj) represents the value of the sigmoidal function for NETy.


I
Error

Wij

Fig. 44.14. An example of an error surface of an MLF as function of one weight value. The gradient de-
termines the change of the weight in the next iteration.
673

4. Calculate the adaptation of the weights to the hidden layer. The desired output
and thus the error is not known directly. In the back-propagation strategy it is
calculated from the accumulated errors of the output neurons. It can be shown (e.g.
Refs. [14] and [15]) that this yields eq. (44.12). In this equation, k represents the
units in the output layer. All weight adaptations can be calculated using eq. (44.10)
and eq. (44.12). The error is said to be back-propagated through the network.

k ^jk
•'• diNETj) h
(44.12)

= sf (NETJ) [1 - sfiNETj)] ^ 5 , Wj,


k=\

5. Repeat this process for all input patterns. One iteration or epoch is defined as
one weight correction for all examples of the training set.

44.5.6. Learning rate and momentum term

The learning rate, r|, in eq. (44.10) is important for the performance of the
training procedure. If it is small, the convergence of the weights to an optimum
may be slow and there is a danger of getting stuck at a local optimum. If the
learning rate is high, the system may oscillate. An appropriate value of the learning
rate depends on the transfer function of the output units. For a sigmoidal transfer
function, r| values between 0.5 and 1 are used. For a linear transfer function much
smaller values (0.001 < r| < 0.1) are applied. To limit the danger of oscillation, eq.
(44.10) can be adapted as follows:

Aw,-(n) = Ti5^ o, + a Aw,.(n - 1) (44.13)

In eq. (44.13) the term ^Wy^fn) is the current weight change, l^w^fn - 1) is the
weight change from the previous learning cycle, a is called the momentum. It
represents the proportion of the weight change of the previous learning cycle that is
taken into account to determine the step size of the weight in the present learning
cycle. Its value is usually set between 0.6 and 0.8. The influence of the momentum
term is less important than that of the learning rate. The optimal values of a and y\
depend on the problem under study and can be found by means of an optimization
procedure, although this is usually not done and only some values of r| and a are
tried.
674

44.5J Training and testing an MLF network

During a supervised training session the weights of the network are optimized,
using a training set. Since the weight adaptations are done by means of the training
set, it should contain a sufficient number of relevant examples for the relationship
to be modelled. The more hidden units, the more examples are necessary to train
the network. The number of weights to be optimized increases indeed drastically
with the number of units in the network:
number of weights = ph + qh = h(p + q)
where p is the number of input units, h the number of hidden units and q the number
of output units.
As described in Section 44.5.5, the weights are adapted along the gradient that
minimizes the error in the training set, using the back-propagation strategy. One
iteration is not sufficient to reach the minimum in the error surface. Care must be
taken that the sequence of input patterns is randomized at each iteration, otherwise
bias can be introduced. Several (50 to 5000) iterations are typically required to
reach the minimum.

44.5.7.1 Network performance


During the training session the performance of the network must be monitored.
Different performance criteria are possible, but usually the normalized standard
error, NSE, is used.

NSE = —XX(^/;-^.7)' (44.14)

q is the number of output units, n is the number of examples in the training set, d is
the desired output value and a is the actual output value. The NSE is calculated
after each iteration of the training session. The NSE is not always the best
performance criterion. One badly predicted pattern (e.g. an outlier) may influence
strongly this performance criterion. It also gives no indication on which or how
many patterns are well or badly predicted. Small errors, close to 0 for almost all
training examples and a few badly predicted examples will yield the same NSE as
for average errors for all examples. For classification purposes the NSE is not the
ideal criterion since it is more important that the output values are above or below a
certain threshold (e.g. 0.9 or 0.1) than that they are exactly 0 or 1.
The NSE does give, however, a good general idea of the performance of the
network and is therefore commonly used during the training session. In Fig. 44.15a
a performance curve is shown. The NSE decreases monotonically with the number
675

a Error A

ITERATIONS

Error

Fig. 44.15. (a) Performance curve of a MLF network for a training set. (b) Performance curve of a
MLF network with too high a learning rate.

of iterations. Training with too high a learning rate, Ti, yields the performance
behaviour of Fig 44.15b.
When the network must be used to predict future unknown examples it is not
good practice however to judge the performance of the network based on the
training set only. Together with the NSE of the training set the NSE of an
independent set, the monitoring set, must always be calculated. The NSE value for
the monitoring set is usually somewhat larger than the NSE for the training set. In
Fig. 44.16a an ideal performance behaviour of a network is shown.
A typical performance behaviour is shown in Fig. 44.16b. The increase of the
NSE for the monitoring set is a phenomenon that is called overtraining. This
phenomenon can be compared to fitting a curve with a polynomial of a too high
order or with a PCR or PLS model with too many latent variables. It is caused by
the fact that after a certain number of iterations, the noise present in the training set
is modelled by the network. The network acts then as a memory, able to recall
676

Error

Error

ITERATIONS ITERATIONS

Fig. 44.16. (a) Ideal performance curve of a MLF network. The solid line represents the training set;
the dashed line represents the monitoring set. (b) overtraining of the network begins at point A. (c) Par-
alysed network, (d) Performance curve of an MLF network for which the training and control samples
represent different models.

exactly the training examples but loosing its ability to predict. This is harmful for
the predictive performance of the network. When the NSE of the monitoring set is
acceptable it suffices to stop the training at point A. Overtraining occurs sooner
when the number of hidden units is too large for the problem complexity or
because the number of training examples is too small to train adequately all the
weights. The remedy then is to use a network with a smaller number of units.
In Fig. 44.16c the performance behaviour of SL paralysed network is shown. The
normal decrease of the NSE stops too soon and the NSE remains too high to be
acceptable. The paralysation of the network occurs when the weighted input value,
677

NETy, is situated in the response regions 1 or 5 of the transfer function for too many
units (Fig. 44.11). As long as Net^ is smaller than A or larger than D, the response
does not change when changing Net^ (i.e changing the weights). A remedy is to
increase the learning rate, r|, or to explicitly decrease the slope, (3, of the transfer
function, so that the input values fall again inside the dynamic region. To retrain
the network with a different weight initialization is another option. This perf-
ormance behaviour is also observed when too few hidden units are used. The
network is then not able to properly model the relationship. Increasing the number
of hidden units is the remedy in this case.
The performance behaviour shown in Fig. 44.16d is caused by the fact that the
monitoring set and the training set represent different relationships or when some
outliers are present in the monitoring set that are not present in the training set.
When not enough examples are available to make an independent monitoring
set, the cross-validation procedure can be applied (see Chapter 10). The data set is
split into C different parts and each part is used once as monitoring set. The
network is trained and tested C times. The results of the C test sessions give an
indication on the performance of the network. It is strongly advised to validate the
network that has been trained by the above procedure with a second independent
test set (see Section 44.5.10).

44.5.7.2 Local minima


The back-propagation strategy is a steepest gradient method, a local opti-
mization technique. Therefore, it also suffers from the major drawback of these
methods, namely that it can become locked in a local optimum. Many variants
have been developed to overcome this drawback [20-24]. None of these does
however really solve the problem.
A way to obtain an idea on the robustness of the obtained solution is to retrain
the network with a different weight initialization. The results of the different
training sessions can be used to define a range around the performance curve as
shown in Fig. 44.17. This procedure can also be used to compare different
networks [20].

44.5.8 Determining the number of hidden units

From the previous sections it is clear that it is important to use a network with a
suitable number of hidden units. When too few hidden units are used, the
relationship cannot be modelled properly and the network shows poor
performance. Too large a number of hidden units causes severe overtraining. The
suitable number of hidden units depends on the problem complexity and on the
number of training examples that are available. It must be determined empirically.
There are basically three approaches for this:
678

Error

10 12 14 16 18

Hidden units

Fig. 44.17. Determining the number of hidden units for an MLF network. The whiskers represent the
range of the error with different random weight initializations or by cross-validation.

Train and test a network with a certain number of hidden units, based on an
educated guess. When the NSE of the network is acceptable and no severe
overtraining occurs, the network is suitable.
A second approach is to train different networks with a different number of
hidden units. Preferably, each network is trained several times with a differ-
ent weight initialization. From a plot as in Fig. 44.17 it is then straightforward
to select a suitable number of hidden units. This approach is certainly the best
but it involves the training and testing of many networks and is thus a time
consuming procedure.
Another approach that has been used is the so-called pruning procedure
[26,27]. One starts with a network with a large number of hidden units. Dur-
ing the training phase the weight changes of all units are monitored. Those
units or connections whose weights remain low are removed and the training
is continued. This procedure has not been very successful for MLF networks
and is not often applied.
679

44.5.9 Data Preprocessing

44.5.9.1 Scaling
So far, we have not considered the nature of the data in the X or Y matrix.
However, as with all other data handling techniques, preprocessing of the data may
be desirable in some cases. The reasons for scaling are essentially the same as
described in earlier chapters. Methods that are often used for continuous values are
autoscaling and range scaling. The data can be scaled by columns or by rows.
Scaling is often applied if the variables have different ranges of values. Variables
with large values can dominate the model. Another reason for scaling in neural
networks is the fact that large input values for a certain unit, especially together
with large weights, may cause the NET input of that unit to fall in the tail of the
sigmoid transfer function. In this region the derivative is small and, as explained in
Section 44.5.5, weight corrections according to the back-propagation rule are
proportional to the derivative of the transfer function (see eqs. (44.10 and 44.11)).
This may then lead to paralysation of the network. Scaling of the input brings the
net input within the appropriate range of the sigmoid transfer function.

44.5.9.2 Variable selection and reduction


The number of variables in the X and Y matrix determines the number of input
and output units respectively. The total number of units determines the total
number of weights that have to be optimized during the training session. When
there is a large amount of input variables, variable selection and or reduction may
be necessary. This is for example the case when the input consists of a spectrum of
e.g. 1000 wavelengths. Preprocessing the data with PCA seems attractive; it is
however not always the appropriate approach. Neural networks, unlike some other
techniques, do not suffer from correlations in the data. In this way the network can
extract relevant information from these correlations which is not possible with
other techniques. Other methods have been proposed to perform the variable
selection (e.g. genetic algorithms and simulated annealing, see Chapter 27).

44.5.10 Validation ofMLF networks

Validation of neural networks is usually based on an independent test set. Note


that this test set should be different from the monitoring set as described in Section
44.5.7. If sufficient samples are not available cross-validation is applied [28]. The
presence of local minima causes an additional problem. Retraining the network
with different weight initializations is an important diagnostic tool for this
purpose. It must be noted however that retraining the network with different initial
weights yields different final weight settings, thus different networks. The next
problem is to select the best network among those. One may decide to use the
680

network that yields the lowest NSE for the test set. As mentioned before, however,
the NSE does not cover all performance aspects. It is good practice to consider also
other performance criteria at this stage of the development. The distribution of the
errors is certainly important. Another important performance criterion is the
robustness to noise on the input values. Derks et al. [29] proposed an empirical
procedure to test this robustness. It is based on the gradually increasing addition of
noise to the input data. Networks whose performance is less sensitive to this noise
injection are to be preferred over sensitive networks. Gemperline studied the
robustness of MLF networks in comparison with PLS [30]. The more performance
criteria used to validate the model the higher the probability that a good impression
is obtained of the network's predictive value for samples that lie within the range
of the training set. Since no theoretical knowledge is available about the model, it
is dangerous to extrapolate with MLF networks. Before using the network for
prediction it should be checked whether the input falls within the training set range.

44.5.11 Aspects of use

MLF networks are very powerful but should be applied with caution. They are
powerful since practical experience has shown that they are able to model in an
easy way difficult non-linear relationships, much better and faster than alternative
techniques.
There are however many pitfalls in the use of MLF networks.
- Since they are used for complex relationships, about which only little prior
knowledge is available, it is very hard to be sure that the training, monitoring
and test examples are representative for the data.
- Usually many local minima are present in the error surface. It is impossible to
guarantee that the overall optimum is reached.
- The validation of the obtained result is another problem. It is yet not possible
to obtain a confidence interval around the obtained output. Research on this
topic is going on [21,28].
- It is not possible to extract in a direct way theoretical information on the
model. It is to be considered as an empirical model builder such as a spline or
polynomial fitting procedure.

44.5.12 Chemical applications

Especially the last few years, the number of applications of neural networks has
grown exponentially. One reason for this is undoubtedly the fact that neural
networks outperform in many applications the traditional (linear) techniques. The
large number of samples that are needed to train neural networks remains certainly
a serious bottleneck. The validation of the results is a further issue for concern.
681

TABLE 44.3
Examples of chemical applications

- Pattern recognition [31-38]


- Spectrum interpretation [39-45]
- Quality control/ process control [46-49]
- Structure elucidation [50-51]
- Non-linear calibration and modelling [52-58]
- Quantitative structure activity relationships [11,59]
- Signal processing [60,61]

Even more than in traditional techniques, care must be taken not to extrapolate.
When the training data are not evenly distributed, even intrapolation can cause
problems [20].
The applications concern classification as well as quantification problems. In
Table 44.3 some examples are given in both problem domains.

44.6 Radial basis function netvrorks

44.6.1 Structure

Radial basis function networks (RBF) are a variant of three-layer feed forward
networks (see Fig 44.18). They contain a pass-through input layer, a hidden layer
and an output layer. A different approach for modelling the data is used. The
transfer function in the hidden layer of RBF networks is called the kernel or basis
function. For a detailed description the reader is referred to references [62,63].
Each node in the hidden unit contains thus such a kernel function. The main
difference between the transfer function in MLF and the kernel function in RBF is
that the latter (usually a Gaussian function) defines an ellipsoid in the input space.
Whereas basically the MLF network divides the input space into regions via
hyperplanes (see e.g. Figs. 44.12c and d), RBF networks divide the input space
into hyperspheres by means of the kernel function with specified widths and
centres. This can be compared with the density or potential methods in pattern
recognition (see Section 33.2.5).
The output of a hidden unit, in case of a Gaussian kernel function is defined as:
output^. = Ofx) = e'^"^-'^'^^^' (44.13)
Ix - c-l is the euclidean distance (other distance measures are also possible) between
the input vector, x, and c^, the centroid of the Gaussian kemel function. The
682

yi

y2

Fig. 44.18. An example of an RBF network.

parameter b represents the width of the Gaussian function. The centroids, Cy and the
widths, bj of all the hidden units together define the so-called activation space,
where the Gaussian function has a value larger than a given threshold value. Nodes
containing a kernel with a centroid close (in comparison with the width, b) to the
input pattern will be predominantly activated. This is in contrast to the distant
objects which cause low output values of the hidden units. This activation space
can thus be regularized by the width factor and the centroid position, providing a
means to adapt the local behaviour of the network.
The output of these hidden nodes, o^, is then forwarded to all output nodes
through weighted connections. The output yj of these nodes consists of a linear
combination of the kernel functions:
yj = I wji Oi(x)

where Wj^ represents the weights of the connections between the hidden layer / and
output layer, j .

44.6.2 Training

The centroid positions, c^, of the kernel functions and their widths, bj, determine
the output of the hidden units. The centroid positions are fixed at the initialization
of the network. The widths and the weights are obtained by supervised training.
Initializing and training these positions and widths is a critical issue in constructing
RBF networks. The supervised network training is generally performed by error
back-propagation by means of the generalized delta learning rule (Section 44.5.5).
The Gaussian kernels are continuously differentiable functions and are therefore
suited for this algorithm.
683

Approaches for the initialization and training of the position of the centres, the
widths and weights are summarized below:
1. For the distribution of the position of the centres the following methods or
combinations of methods are used [29]:
- random distribution within the range of interest;
- random selection of input patterns;
- maximal coverage of the range of interest, e.g. with the Kennard and Stone
algorithm (see Chapter 24);
- selection of representative input patterns, defined by means of the output of a
Kohonen network, applied to the input pattems (see Section 44.7);
~ selection of representative input pattems, defined by means of clustering
methods (e.g. ^-means [65] (see Section 30.3.2), genetic algorithms or simu-
lated annealing [64] (see Chapter 27).
2. The widths, b, of the kernel functions are obtained by:
- the supervised training procedure (error back-propagation);
- fixing them at the initialization phase based on expertise and prior knowledge
on the data structure.
3. The weights, w^p are trained with the usual error back-propagation strategy.
It is generally advised to use an abundant number of kernels on random positions
which are then successively pruned during training.

44,6.3 An example

In this section the local modeling capability of RBF is demonstrated on the


logical operators AND and OR. The examples used are demonstrative for the
structure and physical meaning of the weights of the RBF network. Two datasets of
four training objects each, have been created according to the logical operators
AND and OR. In Fig. 44.19 the datasets and the network topology are graphically
depicted. The AND-operator yields a positive output on inputs of identical sign,
whereas the OR-operator responds positively on any positive input.
The centroids represent the positions of Gauss kernels and are in this example
positioned in the same place as the objects in the input space. The width factors do
not change during training. The weights of each kernel function are obtained by
training.
In Fig. 44.20a a grid of data within the range [-1...1] has been propagated
through the OR network model to depict the activation space. It can be seen that the
OR-function has been trained properly, i.e. the output is high or low at the correct
positions. Where no input data were available the network yields zero as output
value. In Fig. 44.20b the same grid of data has been propagated through an OR
network model obtained with a broader width parameter. It can be seen that the
network still yields high and low values at the positions of the position of the input
684

.•.
0.,
I >i - 1 1 -1 . 1 •, 1 , 1

•• 4 # .• %. • # i
(-i,.i)(i-i,i)Ci4i)(i.i) (-i;-i)(i-i,i).(i.^i)<W)

i "* i •
-1 -I

X I -1

1 1

AND OR

Fig. 44.19. The logical operators AND and OR and their RBF implementation.

patterns but it interpolates in a smoother way. It follows from the introduction of


this section that the concept of RBF is based on local approximation of data by
means of kernel functions. MLF networks try to model data by means of
constructing non-linear hyperplanes in the input space. In Fig. 44.21 the
predictions of an AND-model obtained by MLF and RBF are shown. It can clearly
be seen that the output of both networks is the same in the neighbourhood of the
input objects. However, the output differs for the positions in between the input
objects. The RBF network yields zero output for the interpolated grid positions,
while the output of the MLF network is different from zero. The results can be
influenced by the width parameters of the kernels. Whatever is better should be
evaluated by means of independent test sets.

44.6.4 Applications

In chemical practice, problems are far more complex than the example
problems described in the previous section. To be able to apply RBF networks
properly, it is useful to have a considerable amount of prior knowledge about the
685

Fig. 44.20. (a) The trained logical OR. The contour lines in the xl ,x2-plane around the centres denote
the 20, 40, 60 and 80 percent confidence limits of the Gaussian kernels, (b) The trained logical OR
with increased width factors.
686

Fig. 44.21. Grid (40x40) prediction of the logical AND for (a) the MLF model and (b) for the RBF
model.
687

data especially for the distribution of the centroid positions and widths [29].
Although this might be considered as a disadvantage, at least the weights have
physical or chemical meaning, allowing a better understanding of the model. They
have been applied in process control [66,67]. A combination of RBF-PLS as a
flexible non-linear regression technique has also been proposed [68].

44.7 Kohonen networks

44.7.1 Structure

Kohonen networks belong to the class of self-organizing maps. In contrast to


MLF and RBF networks, they are designed for unsupervised pattern recognition
tasks. The Kohonen network consists of one layer of neurons, ordered in a
low-dimensional map such as a one-dimensional array or a two-dimensional
matrix (see Fig. 44.22). Each neuron or unit contains a weight vector of the same
dimension as the input patterns. To train the network, the unsupervised Kohonen
learning rule is applied (see Section 44.7.2). After training the individual weight
vectors are oriented in such a way that the structure of the input space (the
topology) is represented as well as possible in the resulting map (see further). Each
object or input pattern is assigned to (or mapped on) the neuron with the most
similar weight vector. The goal of the Kohonen network is to map similar objects

a o—o—o—o—o—o c o—o—o—o
/ \ / \ / \ / \
o—o—o—o—o
/ \ / \ / \ / \ / \
b o—o—o—o—o—o o—o—o—o—o—o
i4444_|, / \ / \ / \ / \ / \ / \
I I I I I V O O O O O O O
o—o—o—o—o—o
o—o—o—o—o—o
o—o—o—o—o—o \ / \ / \ / \ / \ /
o—o—o—o—o—o o—o—o—o—o
I I I I I I \ / \ / \ / \.
o—o—o—o—o—o o—o—o—o
Fig. 44.22. Three commonly used Kohonen network structures, (a) One-dimensional array; (b)
two-dimensional rectangular network (each unit, apart from the borderline units has 8 neighbours)
and (c) two-dimensional hexagonal network (each unit, apart from the borderline units, has 6 neigh-
bours). (Reprinted with permission from Ref. [70]).
688

on the same or neighbouring neurons. The interpretation of the Kohonen map is


described in Section 44.7.3. The Kohonen mapping is primarily used for
classification purposes.

44J.2 Training

The training process of a Kohonen network consists of a competitive learning


procedure and can be summarized as follows:
- Initialize the weight vectors of all units with random values.
- Calculate a specified similarity measure, D, between each weight vector and
a randomly chosen input pattern, x.
- The unit in the Kohonen map that is most similar to the input vector is de-
clared as the winning unit and is activated (i.e. its output is set to 1). The out-
put of a Kohonen unit is typically 0 (not activated) or 1 (activated).
- The weights of the winning unit are adapted according to:
w(/+l) = w(0 + r| (x - yfj{t))
y\ is the learning rate and w(0 is the weight vector.
- the weights of the units in close vicinity of the winning unit are also adapted
according to:
w(r+l) = w(r) + r| N(t,r)(x - Wj(t))
N(t,r) is a predefined neighbourhood function in which r represents the distance in
the map between the considered unit and the winning unit. Different neighbour-
hood functions can be used. The principle is always that units closer to the winning
unit are adapted most. Some common functions are shown in Fig. 44.23. Due to
this aspect of the learning procedure, the connectivity (i.e. the number of neigh-
bours for each unit) of the network has an influence on its performance. Networks
with a different number of neighbours for the units are shown in Fig. 44.22.
The above steps are repeated for all input patterns to complete one iteration or
cycle. A training procedure consists of several such cycles. During the training
process the neighbourhood definition is usually narrowed down, allowing
convergence of the network (see Fig. 44.24). As a result of the training procedure
the weight vector of the winning unit and its neighbours move gradually towards
the applied input vector, as shown in Fig. 44.25. In this way, similar input vectors
will map in the same region of the Kohonen map. Due to the unsupervised nature
of the Kohonen algorithm, it cannot be checked which units are associated with
specific clusters in the input patterns. However, inspection of all weight vectors
provides information concerning topological aspects of the input pattern space, i.e.
the position of the input patterns relative to each other.
689

n(r)
a

Fig. 44.23. Some common neighbourhood functions, used in Kohonen networks, (a) a block function,
(b) a triangular function, (c) a Gaussian-bell function and (d) a Mexican-hat shaped function. In each
of the diagrams is the winning unit situated at the centre of the abscissa. The horizontal axis represents
the distance, r, to the winning unit. The vertical axis represents the value of the neighbourhood func-
tion. (Reprinted with permission from [70]).

matching
input vector

Fig. 44.24. Example of gradually narrowing of the neighbourhood function during the training pro-
cess. Light grey: definition of neighbourhood at stage 1; middle grey: at stage 2 and dark grey: at stage
3. O is a unit and • is the winning unit. (Adapted from Ref. [70]).
690

X^ r\(X-W(t))

Fig. 44.25. Example of the weight update of the winning unit in a Kohonen network. (Reprinted with
permission from Ref. [70]).

44.7.3 Interpretation of the Kohonen map

After the training procedure, the weight vectors of the units are fixed and the
map is ready to be interpreted. There are different possibilities to interpret the
weight combination, depending on the purpose of the network use. In this section
some possibilities are described.
The output-activity map. A trained Kohonen network yields for a given input
object, Xj, one winning unit, whose weight vector, w^, is closest (as defined by the
criterion used in the learning procedure) to x-. However, x^ may be close to the
weight vectors, w^, of other units as well. The output )^ of the units of the map can
also be defined as:
yj = D(Wj, X.)

D is the similarity measure as used in the training procedure. This results in a map
as in Fig.44.26a. This map allows the inspection of regions (neighbouring neurons)
that have a similar weight vector as a given input x^. Note that each input x^ yields a
different output activity map.
The counting map (Fig. 44.26b). This map is obtained by counting for each unit
of the Kohonen network, the number of training objects for which the unit is the
winning one. This map provides insight in the number of clusters that are present in
the dataset. When e.g. all input patterns are assigned to e.g two distinct regions
(sets of neighbouring neurons) in the map, it can be concluded that there are two
clusters in the dataset.
691

+ 4- 4- + 4- - X X A A A A A A
+ + 4- 4- 4- 4- X X A A A A
4- + 4- X - X - A A A B B
4- 4- 4 - - - - B B B
+ 4- - - - -1 C X B B X B B
++4 « - - C C BX X C B B
4- 4- 4- 4- - - C C B X C C B B
4- 4- - - - C C C C C B B B
k
Fig. 44.26. Some possibilities for analysing a two-dimensional Kohonen map (Reprinted with per-
mission from Ref. [70]). (a) Grey-encoded output activity map for a given training example. Dark ar-
eas in the map indicate a high similarity between the weight vector of the unit and the input object, (b)
A counting map: dark areas indicate a large number of training examples that map on the unit. Units on
which no training examples map are indicated white, (c) Feature map indicating units on which train-
ing examples map with (+) or without (-) a certain feature. (X) indicates that different labels are as-
signed to the unit, (d) Feature map with class identification A and B as labels. (X) indicates multiple
class labels.

The feature map. This map can be obtained when labels can be assigned to the
training objects. Labels may consist of a known classification, or the presence
versus absence of certain features. For each training object, x-, the winning unit is
determined and to this unit the label of x^ is assigned. When this is completed for all
training objects, each unit in the map is labelled in the map with zero, one or more
labels (see Figs. 44.26c and 44.26d). When one unit is labelled with more labels it
means that an overlap is present. In this way it can be compared with fuzzy
clustering techniques (see Chapter 30). A detailed description can be found in
references [69,70].

44.7.4 Applications

Due to the Kohonen learning algorithm, the individual weight vectors in the
Kohonen map are arranged and oriented in such a way that the structure of the
input space, i.e. the topology is preserved as well as possible in the resulting
692

low-dimensional map. Therefore, the Kohonen map is said to be a topology


preserving mapping technique. It is primarily used for the examination of data for
which no or little prior knowledge is available.
The choice of the size and dimensionality of the network depends on the
problem type. When too few units are used for classification purposes, different
classes will fall together on the same unit(s). Too many units will not result in
clustering of the data. All training objects will tend to map on different units. A rule
of thumb is to take twice the number of expected classes. It has been applied for
e.g. mapping IR spectra [71], to map surface molecular properties of organic
molecules [72] and for classification [73] (e.g. SIMS images [74] or plant seeds
[75]). Another interesting application is in the design of RBF networks (Section
44.6). It can best be compared with techniques such as non-linear mapping or
multi-dimensional scaling (see Chapter 38). Reference [76] is a review of the use
of Kohonen neural networks in analytical chemistry.

44.8 Adaptive resonance theory networks

44.8.1 Introduction

The Adaptive Resonance Theory (ART) networks are also designed for un-
supervised pattern recognition. There exists a variant (ARTMAP) that is meant for
supervised pattern recognition, but we will consider here the more common
unsupervised variant. For a detailed description see Ref. [14]. Grossberg designed
the adaptive resonance theory to overcome a general drawback of classifiers (e.g.
MLF networks), i.e. the fact that once the training procedure (supervised or
unsupervised) is completed the weights are fixed. The network or classifier is
designed for a certain well-defined classification task. The network is trained by
means of a set of representative training examples. This is an acceptable procedure
only when the classification problem is well defined and only as long as it remains
stable. Unfortunately this is not often the case in real situations. It may happen that
a new class of objects appears that was not represented in the training set. The only
remedy in that case is to retrain the network including training objects of the new
class. It is also possible that existing classes tend to change in time, e.g. drifting
away. Here too, the proper action is to retrain the network with new representative
training objects. This is what Grossberg called the stability-plasticity dilemma:
how can a system be adaptive to relevant new input and at the same time be stable
to irrelevant input changes? In response to this, he and others developed the
adaptive resonance theory. The essence of ART is pattern matching. When a
familiar pattern is presented to the network (i.e. satisfies the limiting conditions of
the previous examples) the network recognizes the input and it also incorporates
693

the new information of the new input by adapting the weights. When, on the other
hand, a novel input is presented, i.e. not satisfying the limiting conditions of the
previous examples, the structure is adapted and the novel input is identified as the
first representative of a new class. In this way ART is stable enough to preserve
past learning but remains adaptable to incorporate new information when it
appears.

44.8.2 Structure

ART networks consist of units that contain a weight vector of the same
dimension as the input patterns. Each unit is meant to represent one class or cluster
in the input patterns. The structure of the ART network is such that the number of
units is larger than the expected number of classes. The units in excess are dummy
units that can be taken into use when a new input pattern shows up that does not
belong to any of the already learned classes.
There exist many different types of ART. The variant ARTl is the original
Grossberg algorithm. It allows only binary input vectors. ART2 allows also
continuous input. It is the basic variant of this type that we will describe.
The structure of ART networks is hard to visualize. It is in fact a theory that can
better be explained by means of a sequence of steps that follow the strategy.

44.8.3 Training

Training of an ART network can be summarized as follows:


- Initialize the weights of all units (in a (pxc) matrix W) with a fixed value. The
parameter p is the length of the weight vector and c is the total number of
units. Usually the fixed value 1/^ P is used for the initial weights, such that
the length of the weight vector is scaled to unity.
- The input patterns are scaled to unit length.
- The first input vector is copied into the weight vector of the first unit, which
becomes now an active unit.
- For the next input vector the similarity, p,^, with the weight vector, w^, of
each active unit, k is calculated:

Because x^ as well as w^ are normalized, p,^ represents the cosine or correlation


coefficient between the two vectors. In a variant of ART, Fuzzy ART, a fuzzy
similarity measure is used instead of the cosine similarity measure [14].
- The active unit with the highest similarity is declared as the winning unit.
694

- The similarity between x, and the winning unit is compared with a threshold
value, p*, in the range from zero to one. When p^^ < p* the input pattern, x^, is
not considered to fall into the existing class. It is decided that a so-called nov-
elty is detected and the input vector is copied into one of the unused dummy
units. Otherwise the input pattern, x^, is considered to fall into the existing
class (to resonate with it). A large p* will result in many novelties, thus many
small clusters. A small p* results in few novelties and thus in a few large clus-
ters.
- When the resonance step succeeds the weight vector of the winning unit is
changed. It adapts itself a little towards the new input pattern x^, belonging to
the same class, according to:
w(r+l) = r| x^ + (1 - r|) w(0
w(r+l) = w(r+l)/lw(r+l)l
Usually the learning rate, rj, is chosen between 0 and 1. In this step the network
incorporates the new information present in the input object by moving the
centroid of the class a little towards the new input pattern x^. This step is intended to
make the network flexible enough when clusters are changing in time.

44.8.4 Application

From the previous discussion, it is clear that ART networks are more applicable
for pattern recognition than for quantitative applications. In Fig. 44.27 it is shown
how ART networks functions with different values for p*. It has been applied to

Fig. 44.27. The influence of different values of p* on the classification performance of the ART net-
work, (a) large p* value; (b) small p* values.
695

classify process control data [77], for classification of UV/VIS/NIR spectra and of
airborne particles [78]. It has also been applied for QSAR [79]. There is not,
however, much expertise available yet on the performance of these networks in
real situations. The general experience is that the networks are very sensitive to
noise. It is difficult to select a proper threshold value for the similarity measure.
Moreover, the experience is that within one network in fact different threshold
values are necessary for different classes. Strategies that use also an adaptive
threshold (e.g. based on the different class properties) may be more successful.

References

1. W.S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity.
Bull. Math. Biophy., 5 (1943) 115-133.
2. D.O. Hebb, The Organization of Behavior. Wiley, New York, 1949.
3. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization
in the brain. Psycholog. Rev., 65 (1958) 386-408.
4. B. Widrow and M.E. Hoff, Adaptive switching circuits. In: IRE WESCON Convention Re-
cord, New York, 1960, p. 96-104.
5. N. Nilsson, Learning Machines, McGraw-Hill, New York, 1965.
6. M.L. Minsky and S.A. Papert, Perceptrons. MIT Press, Cambridge, MA, 1969.
7. D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error
propagation. In: Parallel Distributing Processing, Explorations in the Microstructure of Cogni-
tion, Vol. 1: Foundations, D.E. Rumelhart and J.L. McClelland (eds.), MIT Press, Cambridge,
MA, 1986, pp. 318-362.
8. P. Werbos, Beyond regression: new tools for prediction and analysis the behavioral sciences.
PhD Thesis, Harvard, Cambridge, MA, 1974.
9. B. Wythoff, Backpropagation neural networks. Chemom. Intell. Lab. Syst., 18 (1993)
115-155.
10. J. Zupan and J. Gasteiger, Neural Networks for Chemists. An Introduction. VCH
Verlaggesellschaft, Weinheim, 1993.
11. J. Devillers (ed.), Neural Networks in QSAR and Drug Design. Academic Press, London,
1996.
12. R. Beale and T. Jackson, Neural Computing: An Introduction. Institute of Physics Publishing,
Bristol, 1992.
13. J.L. McClelland and D.E. Rumelhart, Parallel Distributed Processing, Vol. 1. MIT Press, Lon-
don, 1988.
14. J.A. Freeman and D.M. Skapura, Neural Networks, Algorithms, Applications and Pro-
gramming Techniques. Addison-Wesley, Reading, MA, 1991.
15. R. Hecht-Nielsen, Neurocomputing. Addison-Wesley, Reading, MA, 1991.
16. P.D. Wasserman, Neural Computing: Theory and Practice. Van Nostrand Reinhold, New
York, 1989.
17. J. Smits, W.J. Meissen, L.M.C. Buydens and G. Kateman, Using artificial neural networks for
solving chemical problems. Chemom. Intell. Lab. Syst., 22 (1994) 165-189.
18. D. Svozil, Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab.
Syst., 39 (1997) 43-62.
696

19. W.J. Meissen and L.M.C. Buydens, Aspects of multi-layer feed-forward neural networks in-
fluencing the quality of the fit of univariate non-linear relationships. Anal. Proc, 32 (1995)
53-56.
20. E.P.P. A. Derks, Aspects of Artificial networks and experimental noise. PhD thesis, University
of Nijmegen, the Netherlands, Chapter 2, 1997.
21. R. Tibshirani, A comparison of some error estimates for neural network models. Neural Com-
putation, 8 (1995) 152-163.
22. B. Walczak, Neural networks with robust backpropagation learning algorithm. Anal. Chim.
Acta, 322 (1996) 21-30.
23. P. Deveka and L. Achenie, On the use of quasi-Newton based training of a feedforward neural
network for time series forecasting. J. Intell. Fuzzy Syst., 3 (1995) 287-294.
24. M. Norgaard, Neural network based system identification toolbox. Technical report. Institute
of Automation, Technical University, Denmark, 1995.
25. E.P.P. A. Derks, Aspects of artificial networks and experimental noise. PhD thesis, Nijmegen,
the Netherlands, Chapter 3, 1997.
26. G. Castellano, A.M. Fanelli and M. Pelillo, An iterative pruning algorithm for feedforward
neural networks. IEEE Trans. Neural Networks, 8 (1997) 519-531.
27. J. Zhang, J.-H. Jiang, P. Liu, Y.-Z. Liang and R.-Q. Yu, Multivariate nonlinear modelling of
fluorescence data by neural network with hidden node pruning algorithm. Anal. Chim. Acta,
344(1997)29-40.
28. E.P.P.A. Derks, M.L.M. Beckers, W.J. Meissen and L.M.C. Buydens, A parallel
cross-validation procedure for artificial neural networks. Computers Chem., 20 (1995)
439-448.
29. E.P.P.A. Derks, M.S. Sanchez Pastor and L.M.C. Buydens, A robustness analysis for MLF and
RBF neural networks models. Chemom. Intell. Lab. Syst., 34 (1996) 299-301.
30. P.J. Gemperline, Rugged Spectroscopic calibration. Chemom. Intell. Lab. Syst., 39 (1997)
29-42.
31. G.J. Salter, M. Lazzari, L. Giansante, R. Goodacre, A. Jones, G. Surrichio, D.B. Kell and B.
Bianchi, Determination of the geographical origin of Italian extra-virgin olive oil using pyroly-
sis-mass spectrometry and neural networks. J. Anal. Appl. Pyrol., 40 (1997) 159-170.
32. J.R.M. Smits, L.W. Breedveld, M.W.J. Derksen, G. Kateman, H.W. Balfoort, J. Snoek and
J.W. Hofstraat, Pattern classification with artificial neural networks: classification of Algae,
based upon flow cytometry data. Anal. Chim. Acta, 258 (1992) 11-25.
33. H. Chan, A. Butler, D.M. Falk and M.S. Freund, Artificial neural network processing of strip-
ping analysis responses for identifying and quantifying heavy metals in the presence of
intermetallic compound formation. Anal. Chem., 69 (1997) 2373-2378.
34. C.W. McCarrick, D.T. Ohmer, L.A. Gilliland, P.A. Edwards and H.T. Mayfield, Fuel identifi-
cation by neural network analysis of the responses of vapour-sensitive sensor arrays. Anal.
Chem., 68 (1996) 4264^269.
35. M. Click and G.M. Hieftje, Classification of alloys with an artificial neural network and
multivariate calibration of Glow-Discharge emission spectra. Appl. Spectrosc, 45 (1991)
1706-1716.
36. R. Goodacre, D.B. Kell and G. Bianchi, Rapid identification of species using pyrolysis mass
spectrometry and artificial neural networks of propionibacterium acnes isolated from dogs. J.
Appl. Bacteriol., 76 (1994) 124-134.
37. W. Werther, H. Lohninger, F. Stand and K. Vermuza, Classification of mass spectra. A com-
parison of yes/no classification methods for the recognition of simple structural properdes.
Chemom. Intell. Lab. Syst., 22 (1994) 63-67.
697

38. J.M. Andrews and S.H. Lieberman, Neural network approach to qualitative identifications of
fuels from laser induced fluorescence spectra. Anal. Chim. Acta, 285 (1994) 237-246.
39. U. Hare, J. Brian and J.H. Prestegard, Application of neural networks to automated assignment
of NMR spectra of proteins. J. Biomolec. NMR, 4 (1994) 35-46.
40. T. Visser, H.J. Luinge and J.H. van der Maas, Recognition of visual characteristics of infrared
spectra by artificial neural networks and partial least squares regression. Anal. Chim. Acta, 296
(1994).
41. H.J. Luinge, E.D. Leussink, and T. Visser, Trace-level identity confirmation from infrared
spectra by library searching and artificial neural networks. Anal. Chim. Acta, 345 (1997)
173-184.
42. C. Affolter and J.T. Clerc, Prediction of infrared spectra from chemical structures of organic
compounds using neural networks. Chemom. Intell. Lab. Syst., 21 (1993) 151-157.
43. J.R.M. Smits, P. Schoenmakers, A. Stehmann, F. Sijstermans and G. Kateman, Interpretation
of Infrared spectra with modular neural-network systems. Chemom. Intell. Lab. Syst., 18
(1993) 27-39.
44. M.E. Munk, M.S. Madison and E.W. Robb, Neural network models for infrared spectrum in-
terpretation. Microchim. Acta, 2 (1991) 505-524.
45. W. Wu and D.L. Massart, Artificial neural networks in classification of Nir spectral data: selec-
tion of the input. Chemom. Intell. Lab. Syst., 35 (1996) 127-135.
46. C. Hoskins and D.M. Himmelblau, Process Control via artificial Neural networks and rein-
forced learning. Computers Chem. Eng., 16 (1992) 241-251.
47. C. Hoskins and D.M. Himmelblau, Fault diagnosis in complex chemical plants using artificial
neural networks. AIChE J, 37 (1991) 137-141.
48. M. Bhat and T.J. McAvory, Use of neural nets for dynamic modelling and control of chemical
process systems. Computers Chem. Eng., 15 (1990) 573-578.
49. C. Puebla, Industrial process control of chemical reactions using spectroscopic data and neural
networks. Chemom. Intell. Lab. Syst., 26 (1994) 27-35.
50. N. Qian and T.J. Sejnowski, Predicting the secondary structure of globular proteins using neu-
ral network models. J. Molec. Biol., 202 (1988) 568-584.
51. M. Vieth, A. Kolinsky, J. Skolnicek, and A. Sikorski, Prediction of protein secondary structure
by neural networks, encoding short and long range patterns of amino acid packing. Acta
Biochim. Pol., 39 (1992) 369-392.
52. R. Wehrens and W.E. van der Linden, Cahbration of an array of voltammetric microelectrodes.
Anal. Chim. Acta, 334 (1996) 93-100.
53. J.R.M. Smits, W.J. Meissen, G.J. Daalmans and G. Kateman, Using molecular representations
in combination with neural networks. A case study: prediction of the HPLC retention index.
Computers Chem., 18 (1994) 157-172.
54. A. Bos, M. Bos and W.E. Van der Linden, Artificial neural networks as a multivariate calibra-
tion tool: modeUng the ion-chromium nickel system in x-ray fluorescence spectra. Anal. Chim.
Acta, 277 (1993) 289-295.
55. M.N. Tib and R. Narayanaswamy, Multichannel calibration technique for optical-fibre chemi-
cal sensor using artificial neural network. Sensors Actuators, B39 (1997) 365-370.
56. H.M. Wei, L.S. Wang, B.G. Zhang, C.J. Liu and J.X. Feng, An application of artificial neural
networks. Simultaneous determination of the concentration of sulfur dioxide and relative hu-
midity with a single coated piezoelectric crystal. Anal. Chem., 69 (1997) 699-702.
57. H.J. Miao, M. H. Yu and S.X. Hu, Artificial neural networks aided deconvolving overlapped
peaks in chromatograms. J. Chromatogr., A, 749 (1996) 5-11.
698

58. R.M. Lopes Marques, P.J. Schoenmakers, C.B. Lucasius and L.M.C. Buydens, Modelling
chromatographic behavior as a function of pH and solvent composition in RPLC.
Chromatographia, 36 (1993) 83-95.
59. A.P. de Weyer, L.M.C. Buydens, G. Kateman and H.M. Heuvel, Neural networks used as a soft
modelling technique for quantitative description of the inner relation between physical proper-
ties and mechanical properties of poly ethylene terephthalate yams. Chemom. Intell. Lab.
Syst., 16(1992)77-82.
60. E.P.P.A. Derks, B.A. Pauly, J. Jonkers, E.A.H. Timmermans and L.M.C. Buydens, Adaptive
noise cancellation on inductively coupled plasma spectroscopy. Chemom. Intell. Lab. Syst.,
(1998) in press.
61. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing.
Wiley, New York, 1993.
62. J. Park and I.W. Sandberg, Universal approximation using radial basis function networks. Neu-
ral Computation, 3 (1991) 246-257.
63. J. Moody and C.J. Darken, Fast learning in networks of locally-tuned processing units. Neural
Computation, 1 (1989) 281-294.
64. B. Carse and T.C. Fogarty, Fast evolutionary learning of minimal radial basis function neural
networks using a genetic algorithm. Lecture Notes in Computer Science, 1143, (1996) 1-22.
65. L. Kieman, J.D. Mason and K. Warwick, Robust initialisation of Gaussian radial basis function
networks using partitioned k-means clustering. Electron. Lett., 32 (1996) 671-672.
66. W. Luo, M.N. Karim, A.J. Morris and E.B. Martin, Control relevant identification of a pH
waste water neutralisation process using adaptive radial basis function networks. Computers
Chem.Eng.,20(1996)S1017
67. B. Walczack and D.L. Massart, Application of radial basis functions-partial least squares to
non-linear pattern recognition problems: diagnosis of process faults. Anal. Chim. Acta, 331
(1996) 187-193.
68. B. Walczack and D.L. Massart, The radial basis functions — partial least squares approach as a
flexible non-linear regression technique. Anal. Chim. Acta, 331 (1996) 177-185.
69. T. Kohonen, Self Organization and Associated Memory. Springer-Verlag, Heidelberg, 1989.
70. W.J. Meissen, J.R.M. Smits, L.M.C. Buydens and G. Kateman, Using artificial neural net-
works for solving chemical problems. II. Kohonen self-organizing feature maps and Hopfield
networks. Chemom. Intell. Lab. Syst., 23 (1994) 267-291.
71. W.J. Meissen, J.R.M. Smits, G.H. Rolf and G. Kateman,Two-dimensional mapping of IR
spectra using a parallel implemented self-organising feature map. Chemom. Intell. Lab. Syst.,
18(1993)195-204.
72. S. Anzali, G. Bamickel, M. Krug, J. Sadowski, M. Wagener, J. Gasteiger and J. Polanski, The
comparison of geometric and electronic properties of molecular surfaces by neural networks:
Application to the analysis of corticosteroid-binding globulin activity of steroids. J. Com-
puter-Aided Molec. Design, 10 (1996) 521-534.
73. X.H. Song and P.K. Hopke, Kohonen neural network as a pattern-recognition method, based
on weight interpretation. Anal. Chim. Acta, 334 (1996) 57-66.
74. M. Walkenstein, H. Hutter, C. Mittermayr, W. Schiesser and M. Grasserbauer, Classification
of SIMS images using a Kohonen network. Anal. Chem., 69 (1997) 777-782.
75. R. Goodacre, J. Pygall and D.B. Kell, Plant seed classification using pyrolysis mass spectrome-
try with unsupervised learning: the application of auto-associative and Kohonen artificial neu-
ral networks. Chemom. Intell. Lab. Syst., 33 (1996) 69-83.
76. J. Zupan, M. Novic and I. Ruisanchez, Kohonen and counterpropagation networks in analytical
chemistry. Chemom. Intell. Lab. Syst., 38 (1997) 1-23.
699

77. D. Wienke and L.M.C. Buydens, Adaptive resonance theory neural networks — the ART of
real-time pattern recognition in chemical process monitoring? Trends Anal. Chem., 99 (1995)
1-8.
78. Y. Xie, P.K. Hopke and D. Wienke, Airborne particle classification with a combination of
chemical composition and shape index utilizing an adaptive resonance artificial neural net-
work. Environ. Sci. Technol., 28 (1994) 1399-1407.
79. D. Domine, D. Wienke, J. Devillers and L.M.C. Buydens, A new nonlinear neural mapping
technique for visual exploration of QSAR data. In: Neural Networks in QSAR and Drug De-
sign, J. Devillers (ed.). Academic Press, London, 1996, p. 223-253.
80. S.H. Barondes, Geestesziekten en Moleculen. De Wetenschappelijke Bibliotheek van Natuur
& Techniek, 1993.
701

Subject Index

a priori probability, 221 average linkage, 69, 70, 72


absorption, 449 axis of inertia, 106
abstract chromatogram, 247, 266 axis of symmetry, 104
abstract factor, 243, 245
abstract spectrum, 247
ACE, 235 back-propagation learning rule, 662, 671
activation space, 682 back-propagation network, 662
ADALINE, 650 backrotation, 56
adaptive Kalman filter, 576, 598, 599 backtracking, 636
adaptive resonance theory network, 692 backward chaining, 634
additivity, 393, 397 backward Fourier transform, 516
administration, way of, 449 bacteriostatic activity, 394
aerosol particles, 60 basis function, 681
agglomerative hierarchical classification, 75, 79 basis vector, 9, 14, 91
aliasing, 524 Bayes equation, 221
algebraic set, 9 Bayesian approach, 221
ALLOC, 227 Bayesian probability theory, 640
alpha-phase, 469, 481 beta-phase, 463,481,494
alternating least squares, 278, 296 between-class variance, 216
alternating regression, 278, 296 bias, 654
analog filter, 537 bidiagonal matrix, 136
analytical laboratory management, 617 bilinear model, 390
analyzing wavelet, 566 binary programming, 609
angular distance, 11, 47 binary variables, 65
ANOVA, 129,384,395 binding affinity, 402
antibiotics, 452 binding assay, 411
anti-symmetric function, 511 binding capacity, 453
apodization, 527 bioavailability, 457, 466,484
approximation coefficient, 568 bio-equivalence, 467
area under the curve, 457,493 biological activity, 149, 383
area under the moment curve, 495 biotransformation, 449
area under the second moment, 497 biplot, 112,153,183,187, 190, 203,405,408,
ART network, 649 413
artefact, 156 bipolar axis, 113, 188
artificial intelligence, 627 bivariate distribution, 212
artificial neural network, 649 bivariate normal distribution, 211
artificial neuron, 233 bootstrap, 371
autoscaling, 64, 122 busy time, 616
702

calibration, 115 classification tree, 227


canonical analysis, 320 CLASSY, 232
canonical axis, 409 clearance, 452,459,481, 484
canonical correlation, 319 clinical chemistry, 208, 213
canonical correlation analysis, 409 clinical laboratory, 609
canonical variable, 319 closure, 87, 167
canonical variate, 213, 319 cluster analysis, 57, 156, 384, 397,416
canonical variate model, 408 clustering, 57, 207
canonical weight, 319 — algorithm, 69
Cartesian coordinate space, 9 — average linkage, 69,70, 72
Cartesian diagram, 112 — complete linkage, 69, 70, 72
catenary model, 452, 487 — density method, 209, 211, 213, 225
centered matrix, 45 — non-hierarchical, 59, 83
centering, 43 — seed points, 77
centre of mass, 42 — single linkage, 398
centroid, 77, 116, 120 — tendency, 82
centrotypes, 78 — Ward's method, 71,72
certainty factor, 640 column space, 246
chalcone, 116 column-centered biplot, 120
characteristic equation, 31, 490 column-centered matrix, 43
characteristic root, 31 column-closure, 168
characteristic vector, 33 column-eigenvector, 92
child frame, 637 column-mean, 42
Chinese teas, 60 column-orthogonal, 21
chi-square distance, 147 column-orthonormal, 21
chi-square statistic, 166, 182 column-pattern, 16, 104
chromatography, 118, 622 column-principal component, 97
circulating protein, 456 column-profile, 168
city-block distance, 66, 149 column-singular vector, 91
city-block metric, 152 column-standard deviation, 46
class envelope, 212 column-standardization, 122
class membership, 408 column-standardized biplot, 123
classical least squares, 352, 353 column-standardized matrix, 47
classical metric scaling, 149 column-sum, 165
classification, 57, 669 column-variable, 87
— agglomerative hierarchical, 75, 79 combinatorial chemistry, 59, 65
— divisive hierarchical, 75 commutative, 20
— fuzzy clustering, 80-82 comparative molecular field analysis, 385,
— hierarchical, 57, 69 410
— /:-nearest neighbour method, 208, 223 compartment, 451
— non-hierarchical, 58 compartmental analysis, 451
— non-hierarchical, Forgy's method, 77 competitive inhibition, 503
— and regression trees, 227 complete linkage, 69, 70, 72
classification rule, 207, 208 compositional data, 130
703

compositional table, 87 dance step diagram, 135


concentration-time curve, 457 DASCO, 232
confirmatory data analysis, 46 data analysis, 14, 30, 40
conflict resolution, 634, 636 data compression, 550
conformation, 387 data-scope, 281
conjugated biplot, 409 Daubechies wavelet, 566
connected graphs, 73 de novo approach, 391, 393
connection pattern, 652 decision-making process, 628
connectivity indices, 392 decision tree, 416, 629
consensus, 313 deconvolution, 490, 510, 553, 556
contingency table, 3, 7, 161 deflated cross-product matrix, 138
continuous signal, 513 deflated matrix, 35
continuum power-PLS, 345 deflation, 138
continuum regression, 367 degeneracy, 40
contrast, 113, 115, 127, 180, 181, 203, 204, 405 delta rule, 656
controlled calibration, 351 dendrogram, 57, 69,70-72, 83
convolution, 489, 530, 533 density method, 209, 211, 213, 225
convolution function, 530 derivative of a signal, 550
convolution theorem, 533 descriptive linear discriminant analysis, 220
Cook's distance, 374 detail coefficient, 568
Coomans plot, 231 determinant, 31, 34, 39, 490
coordinate space, 9 determinant, minor of, 491
core matrix, 154 deterministic chaos, 451
correlation, 63, 401 deviations
correlation coefficient, 14, 62, 65, 111 — of column-profiles, 169
correlation matrix, 49 — of double-closed data, 169
correlation-PCR, 364 — of row-closed profile, 168
correspondence, 182 diagonal matrix, 19
correspondence factor analysis, 130, 174, 182, diagonalization, 34, 93, 139, 140
405 differential equation, 462
cosine rule, 12, 111 differential equations, system of, 477
cost-benefit analysis, 237 digital filter, 537
counting map, 690 dilation, 566
Craig plot, 397 dimension, 8
credibility function, 640 dimension reduction, 107
cross-product matrice, 48 direct standardization, 377
cross-tabulation, 7, 88, 147 directed graphs, 623
cross-validation, 144, 229, 236, 369, 416 discrete event simulation, 618
— internal, 238 discrete Fourier transform, 519
— k-Md, 238 discrete wavelet transform, 567
curve peeling, 465, 491, 501 distributivity, 529
curve resolution, 243, 260, 267 discriminant function, 216, 217
curvilinear trajectories, 152 discriminating power, 237
cut-off frequency, 548 discrimination, 208
704

disjoint class modelling, 208 eigenvalue decomposition, 33, 92, 148, 183
dispersion matrix, 31 eigenvector, 33, 228, 230
dissimilarity plot, 295 eigenvector projection, 55
dissociation constant, 387 electron acceptor, 127
distance, 43, 45, 47, 60, 108, 176-178 electron donor, 127
— between two points, 11 electrostatic interaction, 386
— of chi-square, 133, 175 elimination pool, 455
— city block, 66, 149 ellipsoid, 39
— Cook, 374 enthalpy, 386
— Euclidean,60, 62,63, 67, 108, 146,230, entropy, 386
231 environmental applications, 232
— from the origin, 11 enzyme, 383, 411
— function, 152 epoch, 673
— generalized, 61 equilibrium constant, 386
— Hamming, 66, 147 equiprobability envelope, 104, 107
— Mahalanobis, 61, 62, 65, 147, 220, 221, error eigenvector, 143
228, 274 Euclidean distance, 60, 62, 63, 67, 108, 146,
— Manhattan, 67, 147 230,231
— Minkowski, 67 euroleptic effect, 400
— matrix, 68 evolution program, 375
— metrics, 150 evolutionary method, 251
— Pythagorean, 146 evolutionary rank analysis, 274
— standardized Euclidean, 61, 62, 64 evolving factor analysis, 274
— taxi, 147 evolving principal components analysis, 274
distribution, 449 exclusive or (XOR) problem, 659
— volume of, 456 excretion, 449, 452
distributional equivalence, 193 exhaustion, 136
D-optimality, 40 expected value, 147
double-closure, 130, 169 experimental design, 40
drift, 593, 598 expert system, 4, 627
drug-receptor complex, 385 — if...then... rules, 631
drug-receptor interaction, 402 — inductive, 227
drug-test specificity, 397 — inference engine, 629, 633
dual representation, 28 — inheritance, 637
dual space, 174 — shell, 630, 641
duality, 16 explanation facility, 640
duo-trio test, 422 exploratory analysis, 46, 167
dynamic programming, 605 exponential function, 203, 462
exponential smoothing, 544
earning machine, 416 extended matrix, 155
edges, 621 extrathermodynamic method, 384
eigenfunction, 486 extravascular administration, 461,469
eigenstructure tracking, 280
eigenvalue, 31, 186, 486, 490, 492 factor, 244, 319
705

factor analysis, 243, 244, 302, 397 — time-frequency, 564


— evolving, 274 fractal geometry, 451
— fixed-size window evolving, 278 frame, 14, 633
— full-rank method, 251 free choice profiling, 436
— iterative target transformation, 268 Free-Wilson analysis, 384, 393
— key-set, 297 frequency domain, 509, 510, 547
— local rank methods, 274 fuel spill, 232
— residual bilinearization, 300 full-rank method, 251
— target transformation, 256 fuzzy clustering, 80, 81, 82
factor loading, 244 fuzzy-set theory, 640
factor rotation, 252
factor scaling coefficient, 95, 150, 188 gain vector, 578, 579
factor space, 95, 99 game theory, 605
factorization, 95 gamma-method, 490,491
fast Fourier transform, 530 generalized column-eigenvectors, 186
feature map, 691 generalized distance, 61
feature reduction, 215, 236 generalized eigenvalue decomposition, 185
feature selection, 207, 236, 375 generalized inverse, 38
field and resonance parameters, 392 generalized least squares, 356
filter function, 540, 547 generalized loading, 188
filter transfer function, 548 generalized Procrustes analysis, 317, 434
filtering, 529, 540, 547, 549 generalized rank annihilation factor analysis,
— cut-off frequency, 548 298
— low-pass, 547, 549 generalized row-eigenvectors, 185
— high-pass, 547 generalized score, 188
— inverse, 553, 555 generalized singular value decomposition, 183
finite progression, 474 generalized standard addition method, 367,
fixed-size window evolving factor analysis, 278 368
folding, 524 genetic algorithm, 78, 625
food authentication, 232 geometrical structure, 46
Forgy's method, 78 Gibbs' free energy, 385
Forgy's non-hierarchical classification method, global interaction, 133
77 global sum, 165
forward chaining, 634 global sum of squares, 94
forward chaining inference engine, 636 global weighted mean, 178
forward Fourier transform, 516 Golub-Reinsch, 134
Fourier coefficient, 236, 515 goodness of fit, 182
— imaginery, 516 Cower's similarity index, 148
— real, 516 GRAFA, 298
Fourier filtering, 373 graph, 73, 621
Fourier transform, 236, 478, 507, 510, 513, 550 graph theory, 79, 605
— discrete, 519 graphical determination, 480
— inverse, 516 graphical solution, 463
— pair, 517 GSAM, 367, 368
706

Guttman effect, 199 incomplete absorption, 469


indicator table, 161
Haar wavelet, 566 indicator variable, 392
Hadamard transform, 562, 564 indirect QSAR, 412
half-life time, 456 inertia, 107, 113
Hammett electronic parameter, 387 inference engine, 629, 633
Hammett equation, 387 inheritance, 637
Hamming distance, 66, 147 inhibition constant, 403
hard delimiter, 655 integer programming, 605, 609
Hansch analysis, 384 integration by parts, 498
Hansch equation, 388 interaction, 128, 175, 179, 182, 196
Hansch model, 410 interaction module, 629, 640
HELP, 280 interferogram, 509
heterogeneity, 124 internal cross-validation method, 238
heterogeneous data, 133, 205 intravenous administration, 455,476,491
heterogeneous table, 182 — repeated, 473
heteroscedasticity, 64, 503 inverse calibration, 352
heuristic evolving latent projection, 280 inverse filtering, 553, 555
heuristics, 628 inverse Fourier transform, 516
hierarchical classification, 57 inverse Laplace transform, 478, 489
hierarchical methods, 69 iterative target transformation factor analysis,
high-pass filter, 547 268
H-matrix, 568 ITTFA, 268
HOMALS program, 150
homogeneity, 150 Jaccard similarity coefficient, 65
homogeneous data, 133, 205 jackknife method, 238
homogeneous linear equation, 34 Jacobi algorithm, 134
homogeneous table, 87, 182
Hopkin's statistic, 82 Kalman filter, 575, 576, 594, 596
Hotelling T^-distribution, 228 — innovation, 578, 599
Householder-QR algorithm, 134 K-center, 78
Householder transformation, 136 K-centroid methods, 78
hybrid constants, 486 K-clustering, 77, 83, 84
hybrid transport constant, 479 Kennard-Stone algorithm, 372, 378
hydrogen bonding, 386, 395 kernel, 80
hydrophobic interaction, 386, 388 kernel function, 681
hydrophobicity, 388 kernel method, 225
key-set factor analysis, 297
ICso value, 412 kinetics, 592, 596
identity matrix, 19 kinetics model, 576
ill-conditioned problem, 469 k-nearest neighbour method, 208, 223
image, 52 knowledge acquisition, 643
image analysis, 153 knowledge base, 629
imaginary Fourier coefficient, 516 knowledge engineer, 644
707

knowledge environment, 642 loading, 55


knowledge representation, 630 loading matrix, 155
Kohonen learning rule, 687 loading plot, 99
Kohonen mapping, 82 locally weighted regression, 378
Kohonen network, 649, 687 location, 174
Kronecker's delta, 152 location model, 78
Kruskal's algorithm, 74 lock and key paradigm, 384
log column-centered biplot, 124
LI-norm, 66, 152 log column-centering, 123
L2-norm, 67 log double-centering, 125
laboratory management, 610 log ratio, 404
laboratory simulator, 621 logarithmic analysis, 129
Lagrange multipliers, 93 logarithmic function, 124
Laplace domain, 488 logarithmic transformation, 64
Laplace transform, 477,491 logarithms, 133
— inverse, 478, 489 log-bilinear model, 129, 201
latent value, 95 log-linear model, 201
latent variable, 50, 212, 215, 228, 319 lumping, 451
latent vector, 95, 182, 183
lead compounds, 384 MacQueen's K-means method, 78
learning object, 207 Mahalanobis distance, 61, 62, 65, 147, 220,
learning rate, 673 221,228,374
learning rule, 652, 656, 670 majority rule, 223
least-squares criterion, 53 Malinowski's F-ratio, 144
least-squares linear regression, 503 mammillary model, 452
leave-one-out procedure, 238 Manhattan distance, 67,147
leverage, 374 manifest variable, 50
limiting solution, 463 marginal sum, 131, 165, 166, 405
line of closest fit, 104, 106 MARS, 235
linear algebra, 134 MASLOC, 78, 83
linear combination, 91, 96 mass, 174
linear discriminant analysis, 84, 208, 209, 212, mass balance differential equation, 451
213,236,408 mass balance equation, 470, 476
linear free energy relationship, 384 mass weight, 106
linear independence, 27, 32 matching coefficient, 65, 66
linear learning machine, 234, 653 mathematical programming, 609
linear programming, 605 matrix, 7, 15
linear regression, 387, 388, 390, 460, 468 — inner product, 10, 20
linear transfer function, 666 — outer product, 25, 43
linearity, 500 matrix addition, 43
linearization, 502 matrix multiplication, 20
linearly independent, 8 matrix-by-vector product, 23
Lineweaver-Burk form, 502 maximal complete subgraph, 79
lipophilicity, 388 maximum entropy method, 555, 558
708

maximum likelihood method, 555, 557 multivariate cahbration, 60, 239, 349
mean absorption time, 496 multivariate normal distribution, 40, 212, 221,
measure of belief, 640 228
measure of disbelief, 640 multivariate quality control, 232
measure of (dis)similarity, 60, 65 multivariate regression, 323
measurement equation, 577 multivariate statistics, 4
measurement model, 576 multi-way contingency table, 165
measurement table, 7, 87 multi-way data table, 2
mechanical analogy, 128 MYCIN, 640
membrane, 456
metabolism, 452 natural calibration, 352
metabolite, 450 near infrared spectra, 232
meta-knowledge, 629, 631 nearest neighbour method, 209, 223
meteorites, 57 nearest neighbours, 213
metric matrix, 171 needle search, 271
metric MDS, 428 network, 621
Michaelis constant, 453 neural network, 82, 149, 209, 213, 225, 233,
Michaelis-Menten equation, 502 378,385,416,649
Michaelis-Menten kinetics, 453 — application of, 680
minimal path, 622 — ART, 649
minimal spanning tree, 73, 74 — back-propagation learning rule, 662, 671
Minkowski distance, 67 — backtracking, 636
mint species, 60 — backward chaining, 634
mixed variables, 67 — hidden layer, 660, 662
mixture design, 444 — hidden unit, 660, 677
mode analysis, 79 — input layer, 662
modelling power, 237 — multi-layer-feed-forward network, 649
molar refractivity, 392 — output layer, 662
molecular field, 410 — performance behaviour of, 674, 677
momentum, 673 — performance criterion, 680
monitoring set, 675 — radial basis function, 649, 670, 681
Moore-Penrose inverse, 38 — validation of, 679
Morlet wavelet, 566 neuron, 650
moving average, 538 NIPALS, 134, 139,332
multicompartment model, 487 NIPALS algorithm, 40, 107
multicomponent analysis, 575 NIR spectra, 213
multicriteria decision making, 426, 605 node, 621
multidimensional scaling, 149, 427 noise, 143, 535
multidimensional space, 16 nominal scale, 161
multinormal, 104 non-compartmental analysis, 493, 500
multiple linear regression, 53 non-hierarchical classification, 58
multiple inheritance, 637 non-hierarchical clustering, 59, 83
multiplicative scatter correction, 373 non-hierarchical methods, 76
multivariate bioassay, 397 non-linear Hansch model, 389
709

non-linear mapping, 430 parent frame, 637


non-linear model, 502 partial least squares, 232, 331
non-linear PCA, 150 partial least squares analysis, 408
non-linear PLS, 378 partial least squares model, 409
non-linear programming, 609 partial least squares regression, 366
non-linear regression, 390, 486, 491 partition coefficient, 388
non-linear transformation, 149 partitioning-optimization techniques, 78
non-metric MDS, 429 pattern recognition, 397
non-parametric regression, 505 — supervised, 207, 652
non-singular, 27 — unsupervised, 207, 687
norm, 11, 111 peak concentration, 467
normal probability distribution, 213 peak purity, 301
normalization, 12, 138, 167 perceptron, 650
normalized Hamming distance, 66 periodicity, 527
normalized standard error, 674 pharmaceutical identification, 232
null matrix, 19 pharmaceutical preparations, 223
null vector, 9 pharmacodynamics, 450
numerical integration, 494 pharmacokinetic effect, 411
numerical taxonomy, 58 pharmacokinetics, 449
Nyquist frequency, 521, 524 pharmacological activity, 412
pharmacophore, 384, 396
object-attribute-value triplets, 619, 632 phase, 528
object-oriented programming technique, 638 phase spectrum, 529
one-compartment open model, 455,473, 474, phylogenetic application, 157
495 physical dimension, 169
operations research, 4, 605 physicochemical parameter, 393
optimization, 622 physicochemical property, 387
ordinal variable, 66 piecewise direct standardization, 377
orthogonal decomposition, 138 plane of closest fit, 106
orthogonal projection, 55, 295 PLS
orthogonal projection operator, 54 — continuum power, 345
orthogonal rotation, 55, 108, 252 — two-block, 411
orthogonal rotation matrix, 255 PLS 1,233
orthogonal vector, 12 PLS2, 233
orthonormal vector, 14 point-spread function, 531, 532, 554
outliertest, 210, 232 poles of specificity, 404
outlier, 239, 374 polynomial smoothing, 542
output-activity map, 690 pooled variance-covariance matrix, 217, 219,
overtraining, 675 221
positive definite, 30
paired comparison test, 425 positive semi-definite, 30
parabolic model, 389 potential density function, 80
PARAFAC model, 156, 301 potential function, 225, 226
paralysed network, 676 potential method, 80, 81, 225
710

powering algorithm, 134, 138 queuing theory, 605, 609, 610


power spectrum, 517, 536
precisions, 196 radial basis function network, 649, 670, 681
prediction ability, 238, 239 RAFA, 298
preprocessing, 48, 112, 115, 201 random calibration, 352
PRESS, 368 random noise, 485
principal axis, 104 random variation, 28
principal component, 215, 228 range scaling, 64, 67
principal components analysis, 57, 88, 192, 201, rank, 27, 170, 196,202
384, 398 rank annihilation factor analysis, 298
principal components regression, 329, 358 rate-limited exchange, 453
principal coordinates analysis, 146,428 rational drug design, 385
principal covariates regression, 367 RBL, 300
principal diagonal, 19 real Fourier coefficient, 516
prior probabilities, 221 receptor, 383, 396, 402, 411
probability, a priori, 221 reciprocal averaging, 182
probability density function, 495 recognition ability, 238, 239
Procrustes analysis, 310, 411 reconstructed data matrix, 102
projection, 51, 52, 151 reconstruction, 100
projection to latent structure, 332 recursive multicomponent analysis, 585
pruning, 678 recursive regression, 575, 577
pseudo-deconvolution, 555 reduced rank regression, 324, 367
pseudodiagonals, 145 redundancy analysis, 324
pure variable, 286 reflected discriminant analysis, 236
purest column, 251 regression, 64, 220, 238
purest row, 251 regression line, 106
purity spectrum, 292 regularised discriminant analysis, 223
pyramidal algorithm, 571 relative contribution, 103, 192
Pythagorean distance, 146 repeated intravenous administration, 473
representative objects, 78
QDA, 228 representative samples, 60
Q-mode, 88 repulsion, 128
QR transformation, 136, 140 resampling, 238
quadratic discriminant analysis, 209, 210, 220, reservoir, 472
228 residual data matrix, 102, 136
quadratic programming, 609 residual error, 155
quantitation limit, 460 residual error sum of squares, 145
quantitative descriptive analysis, 431 residual matrix, 35
quantitative structure-activity relationships, 383 residual variance, 142
queue length, 614 resolution, 520, 521,526
queueing response surface, 397
— idle time, 616 response surface methodology, 444
— M/M/1 system, 611 retention time, 116
— utilization factor, 611 ridge regression, 367
711

right-singular vector, 91 Shannon entropy, 558


R-mode, 88 shift, 528
RMSPE, 368 shortest path problem, 621
robust clusters, 83 sigmoidal transfer function, 655, 662
robust regression, 504 signal domains, 507
rotation, 55 signal, derivative of, 550
rotation angle, 251, 253 signal enhancement, 510, 535, 536, 547
rotation matrix, 254, 286 signal processing, 507, 509, 535
rotation, Varimax, 254 signal restoration, 510, 535
row simplicity, 254 signal-to-noise ratio, 509, 535
row space, 246 SIMCA, 212, 225, 228, 237
row-centered matrix, 43 similarity, 60, 148
row-closure, 168 similarity coefficient, 60
row-eigenvector, 92 similarity index, Gower's, 148
row-mean, 43 similarity matrix, 68
row-orthogonal, 21 simple inheritance, 637
row-orthonormal, 21 simple neural network, 228
row-pattern, 16, 104 Simplex method, 608
row-principal component, 96 Simplex optimization, 384, 397
row-profile, 168 Simplisma, 292
rows-by-columns product, 20 SIMPLS, 340, 367
row-singular vector, 91 SIMULA, 621
row-standardized matrix, 47 simulated annealing, 78
row-sum, 165 simulation, 605, 610
row-variable, 87 simulation model, 619
rule set, 632 single linkage, 69, 70, 71, 72
rule-based scheme, 631 singular value, 186
singular value decomposition, 40, 89, 183
sampling, 500, 524 singular vector decomposition, 202
saturated RC association model, 129, 201 sinusoidal transfer function, 670
scalar multiplication, 9 size component, 119
scalar product, 10, 11,51 slit function, 530
scaling, 64, 67, 529 smoothing, 538, 549
score, 55, 213 — polynomial, 542
score plot, 97, 157 — Savitzky-Golay, 373
Scree-plot, 142 smoothing parameter, 226, 227
selecting clusters, 82 soft independent modelling of class analogy,
selectivity, 237 228
self-organizing map, 687 solubility, 449
semi-axis, 40 solvent, 60, 74, 75
semilogarithmic diagram, 481 specificity, 413
semilogarithmic plot, 457,468, 501 spectral decomposition, 34, 92
sensitivity, 237, 413 spectral map, 129,404
sensory analysis, 421 spectral map analysis, 129, 201, 402
712

speed of release, 449 Tanimoto similarity, 65


spread, 174 target, 256
stability-plasticity dilemma, 692 target transformation factor analysis, 256
stacked histogram, 168 taxi-distance, 147
standard deviation of the retention time, 498 taxonomy, 157
standard deviation spectrum, 292 test set, 238, 239
standard normal variates, 373 therapeutic activity, 383
standardized Euclidean distance, 61, 62, 64 thermometer plot, 198
state vector, 585 three-dimensional conformation, 416
state-space model, 576, 595 three-distance clustering method, 76
statistical moment, 498 three-factor system, 267
statistical moment theory, 493, 501 three-way analysis, 153
steady-state plasma concentration, 471 three-way data, 153
steady-state solution, 475 three-way data table, 2
steady-state volume of distribution, 496 three-way matrix, 301
stepwise variable selection, 236 threshold logic, 655
steric bulk, 392 threshold step function, 655
steric conformation, 385 time averaging, 538
STERIMOL variable, 392 time of appearance of the maximum, 467
stochastic distribution, 498 time domain, 507, 509, 510, 536
STRESS, 429 time-frequency Fourier transform, 564
stretched coordinate axis, 171 time-intensity curves, 441
structural eigenvector, 143 time-invariant system, 585, 595
structural features, 48 tolerance, 135, 139
structural information, 140 top-down PCR, 364
structure correlation, 321 topology preserving mapping technique, 692
subspace, 9, 192, 136 total least squares, 367
substituent parameter, 398 toxicity testing, 59
substitution constant, 393 trace, 22, 49
sum matrix, 19 trace elements, 89
sum vector, 42 training object, 233
supervised learning, 207, 652 training objects, 207
supervised pattern recognition, 207 trajectory, 153
Swain and Lupton parameter, 391 transfer constant, 451
symmetric function, 511 transfer function, 234, 651, 655
symmetry, 527 transfer function, role of, 665
system equation, 589, 592, 593 transfer of calibration models, 376
system model, 576 transform kernel, 517
system noise, 591 transformation, 43
system transition matrix, 591 transition formula, 100, 185
systematic information, 140 translation, 43, 45, 566
systems equation, 577 transport, 450, 451
triangle relationship, 12
Tanimoto coefficient, 65, 66 triangle test, 421
713

tridiagonal matrix, 140 wavelength domain, 507


trilinear model, 156 wavelength selection, 589
true factors, 243 wavelet
TTFA, 256 — Daubechies, 566
Tucker3 model, 154 — Haar, 566
two-block PLS, 411 — Morlet, 566
two-compartment catenary model, 461, 469 wavelet basis, 566
two-compartment mammillary model, 476,491 wavelet coefficient, 236
two-component analysis, 586 wavelet filter coefficient, 567
two-factor system, 260 wavelet packet transform, 571
two-layer network, 234 wavelet transform, 566
two-way table, 1,7,88, 153 — coefficient, 566
— discrete, 567
uncertainty, 639 Weber-Fechner's psychophysical law, 386
UNEQ, 210, 212, 228 weight, 652
unfolding, 153 weight coefficient, 131, 173, 201
unipolar axis, 112, 150, 188 weighted cross-products matrix, 186
unit vector, 9 weighted distance, 116, 171
unsupervised learning, 397, 652 weighted Euclidean distance, 61, 62
unsupervised pattern recognition, 207, 687 weighted Euclidean metric, 170
utilization factor, 611 weighted linear regression, 503
weighted mean, 173
validation, 141, 157,207,238 weighted norm, 171
— of neural networks, 679 weighted scalar product, 170
van der Waals radius, 386 weighted sum, 131
VARDIA, 286 weighted sum of squares, 131, 173
variable space, 246 weighted variance, 173
variance diagram, 286, 289 weighting function, 495
variance inflation factor, 374 white noise, 535
variance of the residence time, 497 Wilks' lambda, 237
variance-covariance matrix, 49, 578 wind direction, 89
Varimax rotation, 254 within-class variance, 216
varivector, 255 within-variance, 150
vector, 8
vector addition, 9
xenobiotics, 450
vector product, 25
vector space, 8
vector-by-matrix product, 23 Young-Householder factorization, 38
volume of distribution, 456
z-transform, 64
Ward's method, 71,72 zero filling, 526, 527

You might also like