You are on page 1of 14

Analytica Chimica Acta, 161(1984) 135-148

Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands

COMPUTER-ASSISTED MULTICOMPONENT SPECTRAL ANALYSIS


WITH FUZZY DATA SETS

T. BLAFFERT
Phihps GmbH Forschungslaboratorlum Hamburg, Vogt-KiillnStraBe 30, D-2000
Ham burg 54 (West Germany)
(Received 20th December 1983)

SUMMARY

A new approach to the interpretation of spectra with “fuzzy sets” is described. A com-
puter program CIF (Compound Identification with Fuzzy sets) is applied. This program is
capable of finding components in a mixture by comparing the sample spectrum with
reference spectra in a library. The applications discussed involve the interpretation of
infrared spectra. The problems of spectral library search are discussed, an elementary
introduction to fuzzy set theory is given, and applications to spectral library search are
demonstrated.

The identification of compounds and the determination of their concen-


tration in a multicomponent unknown sample is a basic problem in analyti-
cal chemistry. Various wet chemical or spectroscopic identification methods
can be applied to obtain an answer from one experiment or from the com-
bination of various measurements. Experience and expertise are necessary to
correlate the data with a chemical structure. Such work is now frequently
supported by a computer, which can calculate mathematical models or
search reference libraries much more quickly than a human being.
Spectroscopic methods are well adapted to being supported by a com-
puter because the output of the instrument is readily digitized and stored in
computer memory. Many data points have to be reduced to position, height,
width and shape of a few spectral lines, a typical problem of data processing.
After this step, the reduced data are evaluated according to a mathematical
model (as with n.m.r.) or, if this is unsatisfactory, compared with reference
lines in a reference library (e.g., infrared, x-ray diffraction). In this article, a
new approach to the latter method will be given, the identification proce-
dure being treated as a pattern recognition problem.

Compound identification as a pattern recognition problem


The above-mentioned steps in computerized compound identification can
be formulated as a common pattern recognition scheme [l]
136

spectral feature
~ feature
input + vector -+ classification + decision
extraction
data or set

In feature extraction, the lines of a spectrum (peaks) with their position,


height and width are measured from the input data. In classification, the
procedure is to search in a reference library, compare references and un-
known sample spectra with respect to the features selected, and to reach a
decision on the compound(s) present.
As an example, an infrared spectrum of a mixture of benzyl alcohol,
butan-2-one, carbon tetrachloride and hexane is shown in Fig. 1, where an
automated peak-search program found the positions of the measured peaks.
The spectra of all the components are plotted in Fig. 2. The position scale is
instrument-dependent (wavenumbers for infrared spectra, 20 for x-ray
diffractometry, time for gas or liquid chromatography) but arbitrary for the
present purpose. In Fig. 1, the peaks found are marked by lines. The posi-
tions and heights of these lines are compared with the lines in reference
spectra in the classification step to establish the sample components.
Because line positions are most relevant, the introduction to fuzzy sets is
described in terms of this feature, but it is also applicable to intensities and
other features.

Fig. 1. Infrared spectrum of a mixture of benzyl alcohol, butan-a-one, carbon tetra-


chloride and hexane, plotted in wavenumbers vs. % absorption. The peaks are located by
an automated peak search program and marked by lines in the diagram.
Fig. 2. Infrared spectra of the components in the sample used for Fig. 1: (A) benzyl alco-
hol; (B) butan-a-one; (C) carbon tetrachloride; (D) hexane.

FUZZY SET APPROACH

Comparison of spectra formulated in conventional set theory and fuzzy set


theory
After all measured peaks have been located by a peak search, the spectrum
is characterized by a collection of features like position (28, wavenumber,
time, etc.), height and width. It is important to treat these features as a set
of features rather than a feature vector because there is no relation between
a spectral line and a certain dimension. In Fig. 3A, the members of the set
are illustrated by single lines. The reference spectrum is also a set of refer-
ence lines (Fig. 3B). The goodness of fit of the sets even in a multicom-
ponent sample must be considered, i.e., the number of lines of the reference
spectrum that are present in the measured set. In conventional set theory,
the answer to this basic problem is given by counting the number of mem-
bers of the sets in the intersection of the two sets; this is the conventional
set power of the intersection. The advantage of the intersection operation
138

I I

A
m ‘0
$05.

Fig. 3. Fuzzy set connections. Comparison of measured set of lines (A) and reference set
of lines (B) using conventional intersection. There are no lines in the intersect set. Mem-
bership of the set is indicated by MEMB.

will be fully understood if the comparison of a reference with a multicom-


ponent sample is considered. A multicomponent set of lines is the union of
the constituent sets and the intersection operation extracts the lines belong-
ing to the reference (see below). Unfortunately, the result of a conventional
set operation on the lines in Fig.3 is unsatisfactory because the true line posi-
tions are distorted by small random variations and therefore the intersect set
(not shown) shows no lines at all.
These random variations are taken into account in the present search
method by introducing the theory of fuzzy sets. In fuzzy set theory, it is not
only valid to say that a spectrum line does or does not belong to a set but
also to say that a spectrum line belongs to a set to a certain degree. A refer-
ence line which has a slightly different position from a measured line does
not abruptly disappear in the intersection; it is still present, but with a lower
degree of membership. This point is illustrated in Fig. 4, and the exact
mathematical formulations are presented in the next sections.

Fig. 4. Fuzzy set connections. Comparison of a fuzzed set of measured lines (A) and a
reference set of lines (B) using the fuzzy intersection. The intersect set (C) reflects all
lines, but some of them have a degree of membership smaller than 1. The peaks in A
represent position distribution, not intensity distribution.
139

Elementary fuzzy set theory


A big advantage of the fuzzy set theory, which was introduced by Zadeh
[21, arises from the continued use of conventional concepts and nomen-
clature of set theory. Overviews of fuzzy set theory have been given by
Zadeh [3] and Kandel and Byatt [4] ; here, only the concepts relevant to a
spectral library search are considered. The fuzzy set operations are illustrated
in Fig. 5.
The range of all spectral values possibly measured by the specific instru-
ment is denoted by X. A fuzzy set A of X = (3~) is characterized by a mem-
bership function fA: X --f [0, 11. The value fA(x) represents the grade of
membership for each point x in the fuzzy set A. The membership function
fA can be defined objectively by probability distributions, but it can also be
used to incorporate personal subjective experience. If fA assumes only two
values 0 and 1, it reduces to the characteristic function of conventional set
theory, which is fA(x) = 1 for XEA and fA(x) = 0 when x is not a member of
set A.
Set A is contained in set B if ACB W fA(x) < fB(x), for all xEX (Fig. 5A).
An intersection of two fuzzy sets A and B in X is defined as the membership
function of A N3 given by

FAdx) =min[fA(x)7f~(x)l (1)

as shown in Fig. 5(B). Similarly, the union AUB is defined by

Fig. 5. Fuzzy set connections: illustration of fuzzy set operations.


140

(2)
as shown in Fig. 5(C). The power of a fuzzy set A, for a finite population X,
is

NA = c fA(x) (3)
XEX

(see Fig. 5D). The last four definitions are consistent with conventional set
theory if f is limited to the dual set { O;l}.
An important relationship exists between the containment and the inter-
section: if a (reference) set A is contained in a (measured unknown) set B,
then the equation Am = A holds, and the fuzzy set power of the inter-
section set is equal to the power of A. Thus the fuzzy set power of an
intersection is a measure of containment of a (reference) set A within the
(measured) set B. The containment measure can be normalized by CAB =
NA~B/NA.

Fuzzing of the spectrum


The peak-search step enables the 3c values (line positions) which are con-
tained in the measurement to be identified. This usually produces a list of
n peaks, and the spectrum can be described as a union S of single lines S, by

s=; S’ (4)
J=l

with fsJ(x) = 1 if 3t is the location of the jth peak but fsJ(x) = 0 otherwise.
So far, S is a conventional set (Fig. 3A). Of course, a measured spectrum
is not exact because random errors introduce uncertainty about which
features are really related to the measured values; not only a spectral line
with the exact measured value, but also lines in its neighbourhood may in
fact be the same. One solution is to define a “hard window”; this is an
interval of m points around a measured line. All lines in this interval may
then belong to the spectrum. This can be expressed by S = U:,,f?, with
fSJ(x) = 1 if 3tE[X - m, x + m] and x is the location of the jth peak, but
f,,(x) = 0 otherwise. Figure 6 demonstrates the hard window approach.
Various search programs use this hard window. The disadvantage in this
concept is that a small variation around 3c - m or x + m can make a large
difference to belonging or not belonging to a set. This disadvantage can be
avoided by using a continuous grade of membership between 0 and 1. A
reference line with exactly the same 3t as the measured line has a member-
ship value of 1. The membership value diminishes with increasing distance
from the measured line, and if the distance is very large, the membership
value becomes zero. The shape of the membership function might be Gaus-
sian or Lorentzian, but it is not necessary to choose statistical functions.
The degree of membership expresses at least a subjective estimate of uncer-
tainty, which is established by the experience of the operator. Nevertheless,
141

800 1000 1200 1wo 1600 moo


LINE POSITION

,‘O- C
$05 _
800 moo 1200 1wo 1600 moo
LINE POSITION

Fig. 6. Fuzzy set connections: comparison of spectra using the hard window approach.

Gaussian distribution could be a good choice when the uncertainty derives


from a randomly distributed measurement error.
The procedure of assigning a broad membership function to a sharply
defined feature (i.e., a line of a spectrum) is called fuzzing. The fuzzed
spectrum is the fuzzy union of the fuzzed lines

with &J(X) = exp[-(xi- x)~/~u’] if XE [xi - cu&,xJ + a&], xj being the loca-
tion of the jth peak; otherwise fgJ(llc) = 0. This is equivalent to
f&+ max [fgJb)l (6)
J

Figure 4A shows a spectrum fuzzed according to Eqn. 5. The Gaussian distri-


butions in the fuzzed spectrum must be distinguished from a line profile;
they do not represent an intensity distribution but a position-dependent
error distribution.

Fuzzy sets compared with linear operators


Superficially, the fuzziness related to Eqn. 6 seems to be similar to a con-
volution of an initial &signal with a Gaussian convolution kernel. However,
there are mathematical and semantic differences between both operations.
The convolution
f(X) = Z:f(X --j)*/?(j) (7)
is a linear operation, whereas the fuzzy min/max operations are nonlinear.
A convolution, as shown in Fig. 7, yields functional values larger than 1,
142

Fig. 7. Fuzzy set connections: (A) spectral lines;(B) convoluted set.

resulting from contributions of more than one initial spectral line. In con-
trast, the “max” union operator implies a decision on which spectral line
contributes to a membership value.
Another method of determining the similarity between sample and refer-
ence spectra is to calculate the variance
S* = Z: (Xj -X,,)2/(T2 - 1) (8)
Pairs of related lines x,, xJR are needed, which have to be selected by addi-
tional rules. The problem is that the jth reference line does not necessarily
correspond to the jth line in the spectrum because missing lines in the spec-
trum, extra lines from additional components, unresolved doublets, etc.,
distort the order of lines in a list. Sets are compared rather than vectors.

Comparison by fuzzy intersection and fuzzy power


Fuzzy intersection, as defined by Eqn. 1, performs a formulation of the
comparison step between fuzzed measurement and reference, where the
latter is an unfuzzed set denoted by

Rh= ;I” R:, (9)


J=l

with fRi(x) = 1 if x is the location of the jth line of reference K, but is other-
wise equal to zero. Here, k is a reference number in the reference library, and
nk is the number of lines in reference k. The application of the intersection
operator leads to a result set

hRK =S’n( jk Ri) = ;I” (&TR’,) (10)


J=l J=l

with fhR@) = mh[f&), fR+)j, which equals fF((x) if x is the location


of the jth line of reference K but is otherwise equal to zero.
The remaining problem is to find a score value which gives the confidence
limit between the intersection set and the intrisic reference set. This value is
143

given by the fuzzy set power (Eqn. 3). It must be assumed that the support
(spectrum range) consists only of finite elements, but this is always true in
computerized spectral analyses. In the special case of intersection (Eqn. lo),
this power is

where 3tf, is the location of the jth line of reference K.


This power or, alternatively, the containment measure (CAB = N,,,/N,)
is the final number representing the goodness of fit between a reference and
a measured spectrum. Sorting after all power computations gives a list of
best-fitting references.

Spectra obtained from multicomponen t samples


Clearly the described method can be used if the spectrum is obtained from
a mixture of different compounds (Fig. 8). The method indicates which
sample and reference lines are related. Constituent compounds are not lost
in this comparison, as can happen with an algorithm in which identified lines
are removed from the peak table. Only extra compounds with line combina-
tions derived from the mixing process can be found. They can be eliminated
by additional use of intensity information (see below).
Lines which are overlapped by a strong line from a different component
or hidden in the background noise, can also be examined with the fuzzy set
procedure. Missing lines will decrease the set power by 1 and lower the posi-
tion of a correct reference pattern in the score list. Of course this gives a
worse result, but the correct pattern is still in the list and has not totally
disappeared.

2000 2200 2600


LINE POSITION

Fig. 8. Comparison of reference spectra with a two-component unknown sample: (A) the
measured spectrum; (B) and (C) two reference sets after a fuzzy intersection.
144

Extension of the fuzzy set concept to line intensities


The discussion of spectral interpretation with fuzzy sets has so far been
limited to line positions. A mathematical extension of this concept to line
intensities or even other (new) features is easy. The physical and instrumental
influences should be considered and included in the construction of the
membership function; only carefully selected parameters guarantee a good
discrimination between reference candidates, and therefore an accurate result
from the identification procedure.
As an example, Fig. 9 shows the match between a spectrum and a refer-
ence on positions and line intensities. The one-dimensional axis of peak
positions is replaced by a two-dimensional plane, where each point denotes
a peak with a certain position and a certain intensity. A well defined line,
like a reference line, is marked by a single stroke (Fig. 9B); this means that
this line element is contained in the reference set. The measured lines are,
analogously to the two-dimensional case, fuzzed in the x- and i-directions
(Fig. 9A). Random variations in the intensity determine the broadness of
the i-fuzzing. All onedimensional fuzzy relationships still hold, except that
the variable x must be replaced by a vector x = (x, i).
Including intensities in the compound identification gives rise t.o the
problem of concentration. This means that line intensities, which are given
relatively to the most intense line in a reference spectrum, are shifted simul-
taneously towards lower values in a multicomponent sample as a result of
lower concentration. For example, a 100% line of a reference can occur as a
50% line in a measured spectrum. Such systematic errors can be corrected by
transforming the relative intensities to a logarithmic scale and shifting all
intensity values by the same amount until the maximum fuzzy set power is
obtained.

L INE POSITION

a
:,
8
P
L
LINE POSITION

Fig. 9. Line comparison with fuzzy sets including line intensities. A fuzzed set (A) inter-
sected by a reference (B) yields set (C).
COMPONENT IDENTIFICATION WITH FUZZY SETS 145

The component identification selects those reference spectra from the


score list, which are actually in the sample. This is done by combining the
reference spectra which score highest in the match according to their pre-
dicted concentrations and comparing the result with the original sample
spectrum. In this step, intensity variations related to overlapping lines from
different compounds are evaluated. All references in the combination with
the highest power are considered as identified.
The identification of components can also be modelled with fuzzy set
concepts. The reference spectra are combined by unifying them (Eqn. 2);
the united set is then fuzzed and intersected with the unfuzzed measured
spectrum, and the fuzzy power is computed (Fig. 10).
Table 1 shows the list of compounds found in the library search on the
sample used for Fig. 1, and scored in the fuzzy match. The score indicates

Fig. 10. Comparison of themeasured set(D) with the fuzzed combination (C) of reference
spectra (A, B) by a fuzzy intersection (E).
146

TABLE 1

Score table of the fuzzy match and the component identification [ CIF Version 2.1 (IR)]
for the sample mixture of Fig. 1. All 4 components are correctly identified in the best
combination (a). In the remaining combinations (b, c, d), the hexane is replaced by an-
other similar compound

No. Score Comb. I Chemical name Reference


(%) no.

1 9.8 abed 36 Benzyl alcohol 0022


2 5.1 a 72 Hexane 0067
3 4.8 abed 91 Butan-2-one 1701
4 4.8 44 Toluene 0019
5 4.7 76 Pentane 0208
6 4.7 91 2-Methylbutane 0003
7 4.7 91 2-Methylbutanea 0772
8 4.6 b 83 Nujol 0781
9 4.5 91 Nonane 0070
10 4.4 d 91 Octane 0069
11 4.2 C 83 Heptane 0068
12 3.9 55 2-Methyl-2-butenea 0215
13 3.7 58 2-Methyl-2-butene 0044
14 3.5 97 o-Xylene 1755
15 2.8 50 ABS 0656
16 2.0 abed 95 Carbon tetrachloride 1764
17 1.6 95 Hexadecane 0075
18 1.4 100 2,2-Dimethylbutane 0035
19 1.4 63 2-Methyl-1-butene 0214
20 1.1 44 Benzyl benzoate 1233

%Spectrum recorded with higher resolution.

the number of fitting lines, and the 1% value gives an estimate of the relative
concentrations. The compounds belonging to the four best possible combina-
tions are marked with A, B, C, D; all compounds with an “A" belong to the
best combination, alI compounds with a “B” to the second best, etc. A list
with the combinations printed together is presented in Table 2.

CONCLUSION

In x-ray diffractometry, several search programs are available for the


JCPDS reference library [5-g]. The present work was based on the
SANDMAN search/match/identify program, which uses variable scoring
rather than fixed window scoring [5, lo]. The parts of SANDMAN particu-
lar to x-ray diffractometry were incorporated into the new CIF program,
where expression in terms of fuzzy set formulation makes adjustment of
parameters associated with the search/match straightforward. This increased
the reliability of CIF.
The CIF program, initially applied to x-ray powder diffraction, was
147

TABLE 2

List of best combinations [CIF Version 2:l (IR)] for the sample mixture of Fig. 1; com-
pounds belonging to a combination are printed together

No. Score Comb. I Chemical name Reference


(S) no.

Score of combination a = 25.5


1 9.8 abed 36 Benzyl alcohol 0022
2 5.1 a 72 Hexane 0067
3 4.8 abed 91 Butan-a-one 1701
16 2.0 abed 95 Carbon tetrachloride 1764
Score of combination b = 25.3
1 9.8 abed 36 Benzyl alcohol 0022
3 4.8 abed 91 Butan-a-one 1701
8 4.6 b 83 Nujol 0781
16 2.0 abed 95 Carbon tetrachloride 1764
Score of corn binatzon c = 24 4
1 9.8 abed 36 Benzyl alcohol 0022
3 4.8 abed 91 Butan-2-one 1701
11 4.2 C 83 Heptane 0068
16 2.0 abed 95 Carbon tetrachloride 1764
Score of combination d = 24.2
1 9.8 abed 36 Benzyl alcohol 0022
3 4.8 abed 91 Butan-2-one 1701
10 4.4 d 91 Octane 0069
16 2.0 abed 95 Carbon tetrachloride 1764

adapted to infrared spectroscopy. Changes had to be made for the conver-


sion of the peak position and intensities (wavenumbers and %-transmittance)
to an internally consistent format, and to produce the output in the approp-
riate scales.
The fuzzy set theory offers a new approach to spectrum library search
problems. Fuzzy sets with a two-dimensional support can be used to describe
membership of spectrum lines including positions and heights in the match
and identify steps. The method was implemented in the CIF program, which
is capable of resolving multicomponent samples in infrared spectroscopy as
well as x-ray powder diffractometry. The usage of fuzzy set theory does not
give the answer to problems caused by experimental errors, but supports the
integration of chemical and physical knowledge into computerized inter-
pretation of spectra in an illustrative manner.

The author thanks W. J. Dallas for helpful discussions and P. Smit for
his advice on crystallographic problems and the preparation of samples for
the program tests. The author also appreciates the work and helpful com-
ments of R. Jenkins and W. N. Schreiner, which provided a basis for the CIF
148

implementation with the SANDMAN search/match/identify program. This


work was sponsored by the German Federal Ministry for Research and Tech-
nology, grant number 08 IT 15227.

REFERENCES

1 K. S. Fu (Ed.), Digital Pattern Recognition, Springer-Verlag, New York, 1980.


2 L. A. Zadeh, Inf. Control, 8 (1965) 338.
3 L. A. Zadeh, Fuzzy Sets and Their Applications to Cognitive and Decision Processes,
Academic Press, New York, 1975.
4 A. Kandel and W. J. Byatt, Proc. IEEE, 66 (1978) 1619.
5 W. N. Schreiner, C. Surdukowski and R. Jenkins, J. Appl. Cryst., 15 (1982) 513.
6 G. G. Johnson and V. Vand, Ind. Eng. Chem., 59 (1967) 19.
7 L. K. Frevel, C. E. Adams and L. R. Ruhberg, J. Appl. Cryst., 9 (1975) 199.
8 M. C. Nichols and R. C. Basinger, American Crystallographic Meeting, Berkeley, CA,
1974.
9 L. Tian-Hui, Z. Sai-Zhu, C. Li-Jun and C. Xin-Xing, J. Appl. Cryst., 16 (1983) 150.
10 W. N. Schreiner, C. Surdukowski and R. Jenkins, J. Appl. Cryst., 15 (1982) 524.

You might also like