Professional Documents
Culture Documents
procedures for the selection of variables under univariate regression models. The
paper of J. L. Schmidhammer illustrates the use of finite intersection tests
proposed by Krishnaiah for the selection of variables under univariate and
multivariate regression models. A. K. Jain and B. Chandrasekaran discuss the role
which the relationship between the number of measurements and the number of
training patterns plays at various stages in the design of pattern classifiers and
mention the guidelines provided by research to date. In some situations, the
discriminating ability of various procedures for discrimination between the popu-
lations may actually decrease as the number of variables increases. Apart from it,
it is advantageous to deal with a small number of important variables from cost
and computational considerations. Motivated by the above considerations, the
paper by W. Schaafsma and the second paper by P. R. Krishnaiah deal with
methods of selection of important variables in the area of discriminant analysis.
Krishnaiah first reviews certain methods of the selection of the original variables
for discrimination between several multivariate populations. Then he discusses
various methods of selecting a small number of important discriminant functions.
The models, examples, applications, and references from diverse sources con-
tained in these articles by statisticians, engineers, computer scientists, and scien-
tists from other disciplines, should make this volume a valuable aid to all those
interested in classification and the analysis of data and pattern structure in the
presence of uncertainty.
We wish to thank Professors T. Cover, S. Das Gupta, K. S. Fu, J. A. Hartigan,
V. Kovalevsky, C. R. Rao, B. K. Sinha and J. Van Ryzin for serving on the
editorial board. Thanks are also due to Professors J. Bailey, R. Banerji,
B. Chandrasekaran, S. K. Chatterjee, R. A. Cole, A. K. Jain, K. G. J6reskog,
J. Lemmer, S. Levinson, J. M. S. Prewitt, A. Rudnicky, J. van Ness, and
M. Wish for reviewing various papers. We are grateful to the authors and
North-Holland Publishing Company for their excellent cooperation in bringing
out this volume.
P. R. Krishnaiah
L. N. Kanal
Table of Contents
Preface v
Table of Contents xi
Contributors xxi
1. Introduction 1
2. Time domain classification methods 5
3. Discriminant analysis in the frequency domain 11
4. Statistical characterization of patterns 26
5. A n application to seismic discrimination 33
6. Discussion 42
Acknowledgment 43
References 43
1. Introduction 47
2. The univariate case 49
3. Multivariate case: 2"known 54
4. Multivariate case: 2" unknown 56
5. Multivariate case: /z I and/~2 known 58
References 60
Introduction 61
Statistics of classification into one of two multivariate normal
populations with a c o m m o n covariance matrix 62
xi
xii Table of Contents
1. Introduction 101
2. Bayesian allocation 101
3. Multivariate normal allocation 106
4. Bayesian separation 109
5. Allocatory-separatory compromises I 11
6. Semi-Bayesian multivariate normal applications 112
7. Semi-Bayesian sample reuse selection and allocation 118
8. Other areas 119
References 120
1. Introduction 121
2. Preliminaries 122
3. Classification into one of two growth curves 123
4. Bayesian classification of growth curves 125
5. Arbitrary p.d. 2- 125
6. Rao's simple structure 132
References 136
1. Introduction 139
2, A procedure for partial and forced classification based on ranks of discriminant scores 144
3. Robust discriminant functions 153
4. Nonparametric discriminant functions 159
References 167
1. Introduction 169
2. Logistic discrimination: Two groups 170
3. Maximum likelihood estimation 175
4. A n example: The preoperative prediction of postoperative deep vein thrombosis 180
5. Developments of logistic discrimination: Extensions 182
6. Logistic discrimination: Three or more groups 187
7. Discussion: Recent work 189
References 191
Table of Contents xiii
References 196
1. Introduction 199
2. Classification approach 201
3. Mixture approach 202
4. Efficiency of the mixture approach 204
5. Unequal covariance matrices 205
6. Unknown number of subpopulations 206
7, Partial classification of sample 206
References 207
Ch. 10. Graphical Techniques for Multivariate Data and for Clustering 209
J. M. Chambers and B. Kleiner
1. Introduction 267
2. Notation and definitions 268
3. Algorithms 270
Acknowledgment 282
References 282
xiv Table of Contents
1. Introduction 347
2. Intrinsic dimensionality for representation 348
3. Intrinsic dimensionality for classification 353
References 359
1. Introduction 361
2. Syntactic pattern recognition 362
3. Artificial intelligence 371
4. Relaxation 379
Acknowledgment 381
References 381
1. Introduction 383
2. Pixel based models 383
3. Region based models 393
4. Discussion 394
Acknowledgment 395
References 395
1. Introduction 399
2. Review of the literature on texture models 400
Table of Contents xv
1. Introduction 417
2. Review of stochastic languages 417
3. Application to communication and coding 423
4. Application to syntactic pattern recognition 427
5. Application to error-correcting parsing 430
6. Stochastic tree grammars and languages 433
7. Application of stochastic tree grammars to texture modelling 441
8. Conclusions and remarks 446
References 447
0. Introduction 451
1. Representations and interpretations 451
2. Laws and uses of similarity 460
3. Conclusion 475
References 476
0. Introduction 479
1. Requirements for a class of decision rules 480
2. Class of logical decision rules 483
3. Method of [3redicting object's perspectiveness 486
4. Algorithm of predicting the value of quantitative feature 487
5. Automatic grouping of objects 488
6. Method of dynamic prediction 490
References 491
1. Introduction 501
2. Electrocardiology 502
xvi Table of Contents
3. Detection 505
4. Typification 513
5. Boundary recognition 517
6. Feature selection and classification 520
7. Data reduction 523
8. Discussion 524
References 524
1. Introduction 527
2. Models for waveform analysis: SDL and FDL 529
3. The n~ARSAYspeech understanding system 535
4. Analysis of medical waveforms using waPsYs 537
5. Concluding discussion 546
References 548
1. Introduction 549
2. Acoustic processors 551
3. Linguistic decoder 551
4. Markov source modeling of speech processes 552
5. Viterbi linguistic decoding 558
6. Stack linguistic decoding 560
7. Automatic estimation of Markov source parameters from data 562
8. Parameter estimation from insufficient data 564
9. A measure of difficulty for finite state recognition tasks 569
10. Experimental results 570
Acknowledgment 572
References 573
1. Introduction 575
2. A radar as an information-gathering device 575
3. Signature 576
4. Coherence 577
5. Polarization 577
6. Frequency diversity 577
7. Pulse sequences 578
8. Decisions and decision errors 579
9. Algorithm implementation 579
10. Classifier design 580
Table of Contents xvii
1. Introduction 595
2. Experiments on the automation of the WBCD 596
3. Developments in the commercial field 603
4. Conclusions 606
References 607
Ch. 28. Pattern Recognition Techniques for Remote Sensing Applications 609
P. H. Swain
1. Introduction 621
2. OCR problem characterization 622
3. Applications ~523
4. Transducers r628
5. Character acquisition 631
6. Character classification 634
7. Context 639
8. Error/reject rates 643
Acknowledgment 647
Bibliography 647
1. Introduction 651
2. Methods for oil data analysis 652
3. Computational models for oil identification 663
4. Summary of oil identification research 668
References 669
xviii Table of Contents
1. Introduction 673
2. Formulation of chemical problems in terms of pattern
recognition 675
3. Historical development of pattern recognition in chemistry 677
4. Types of chemical data and useful preprocessing methods 677
5. Pattern recognition methods used 682
6. Some selected chemical applications 685
7. Problems of current concern 689
8. Present research directions 693
9. Conclusions and prognosis 694
References 695
Ch. 32. C o v a r i a n c e M a t r i x R e p r e s e n t a t i o n a n d O b j e c t - P r e d i c a t e
S y m m e t r y 699
T. Kaminuma, S. Tomita and S. Watanabe
1. Introduction 721
2. Variation in a single sample 724
3. Homogeneity and heterogeneity of covariance matrices 726
4. Size and shape 728
5, Significance tests in morphometrics 729
6, Comparing two or more groups 730
7. Morphometrics and ecology 738
8. Growth-free canonical variates 738
9. Applications in taxonomy 743
References 743
1. Introduction 747
2. Moment structure models: A review 751
3. A simple general model 757
4. Parameter identification 760
Table of Contents xix
Ch. 35. Use of Distance Measures, Information Measures and Error Bounds in
Feature Evaluation 773
M. Ben-Bassat
1. Introdudtion 793
2. The monotonicity of the Bayes risk 796
3. The arbitrary relation between probability of error and measurement subset 800
References 803
1. Introduction 805
2. Preliminaries 806
3. Forward selection procedure 806
4. Stepwise regression 809
5. Backward elimination procedure 811
6. Overall F test and methods based on all possible regressions 814
7. Finite intersection tests 817
References 819
1. Introduction 821
2. The multivariate F distribution 821
3. The finite intersection t e s t - - A simultaneous procedure in the univariate case 823
4. The finite intersection t e s t - - A simultaneous procedure in the multivariate case 826
5. A univariate example 828
6. A multivariate example 830
References 833
xx Table of Contents
1. Introduction 835
2. Classification performance 836
3. K-nearest neighbor procedures 849
4. Error estimation 850
5. Conclusions 851
References 852
1. Introduction 857
2. Illustrating the phenomenon when dealing with A i m 1 in the case k -- 2 860
3. One particular rule for selecting variables 864
4. Dealing with Aim 3 in the case k = 2, m 0 = 1 868
5. Dealing with A i m 4 in the case k = 2, m 0 = 1 872
6. Incorporating a selection of variables technique when dealing with Aim 3 or A i m 4 in the case
k = 2 , m 0 = 1 875
7. Concluding remarks and acknowledgment 877
Appendix A 878
References 881
1. Introduction 883
2. Tests on discriminant functions using conditional distributions for two populations 883
3. Tests on discriminant functions for several populations using conditional distributions 885
4. Tests for the number of important discriminant functions 886
References 891
R. H. Shumway
1. Introduction
L'.-~-,,"..-r-~-'.
~.-.%"..--_-^_^,,'~-,,'~
51265 H ~'vV~V V~VV W" ,,/v''v " " ' 240 71/06/06, 49.~N, 77.7E, 5.5
72/04/05, 41.9N, 84.5E, USGS
4.8, 33N USGS
o 5 10
n 5 10
Discriminant analysis for time series
and consider the classification problem for finite dimensional random vecl
This reduces the problem to one that is covered very well in standard multival
references such as in [3, 37, 59, 75]. In these approaches one usually assigl
multivariate normal observation to each of the q classification categories on
4 R.H. Shumway
basis of a Bayes or likelihood based rule, which usually ensures that some
combination of misclassification probabilities will be minimized.
In the case of q categories the vector x is regarded as belonging to a
T-dimensional Euclidean space which has been partitioned into q disjoint regions
E~, E 2. . . . , E q , such that if x falls in region Ei, we assign x to population i. If x has
a probability density of the form pi(x) when x belongs to population i, the
probability of misclassifying an observation into population j can be written as
Pe = E ~ E P(Jl i) (1.2)
i=t .jv~ i
and accepting H 2 otherwise. The advantage here is that the rule is independent of
the prior probabilities and has the property that for a fixed misclassification
probability of one kind the error of the other kind is minimized. That is, fixing
P(l12 ) yields a minimum P(2 I1), and fixing P(2 ]1) yields a rule which minimizes
P(1 [ 2). It is obvious from (1.3) and (1.4) that K = ~rz/rr I is the Bayes rule when
the prior probabilities are ~r~ and 7r2.
The discussion can be made more concrete by considering the classical problem
of detecting a signal in noise. Suppose that H~ denotes the signal present
hypothesis, and that the signal is absent under H 2. Then P(1 [ 2) denotes the false
alarm probability, which might be set at some prespecified level, say 0.001. In this
case it follows that P(2 I1), the missed signal probability, is minimized or,
equivalently, that P(lll ), the signal detection probability, is maximized. As
another example the seismic data in Fig. 1 can be analyzed by identifying the
earthquakes with H 1 and the explosions with H 2. Since the interest is in detecting
an explosion, presumably identified as a violation of an underground nuclear test
Discriminant analysis for time series 5
ban treaty, P(211) can be identified as the false alarm probability, whereas P(212 )
is a measure of the explosion detection probability.
The above criteria have been applied mainly to the problem of classifying
multivariate vectors, where the dimensionality T was fairly small and an adequate
learning population was available for estimating the unknown parameters. This
will not generally be the case for time series data, where T can be very large
relative to the number of elements likely to be found in the learning population.
For example, the earthquakes and explosions in Fig. 1 are sampled at T = 256
points, and the potential learning populations contain only 40 earthquakes and 26
explosions respectively. In this case the computations required to calculate the
discriminant function and to numerically evaluate its performance will always
involve inverting a 256 × 256 covariance matrix. The estimation of the parameters
in the learning sets will be difficult because of the fact that when the dimension T
exceeds the number of elements in the learning population, the sample covariance
matrices are not of full rank.
The difficulties inherent in time domain computations can be alleviated consid-
erably by applying spectral approximations suggested by the properties of the
discrete Fourier transform (DFT) of a stationary process. If the covariance
functions are assumed to come from stationary error or noise processes, then the
involved matrix operations can be replaced by simple ones involving the spectra
and DFT's of the data and mean value functions. The use of spectral approxima-
tions has been fairly standard, beginning with the work of Whittle [81], and
continuing with the work of Brillinger [18] and Hannan [43]. The approximations
used here depend on those developed by Wahba [78], Liggett [56], and Shumway
and Unger [72]. Spectral approximations to the likelihood function have been
used by a number of authors for solving a great variety of problems (see [6, 19,
20, 22, 27, 28, 29, 56, 73]).
We begin by reviewing the standard approach to the problem of discriminating
between two normal processes with unequal means or covariance functions. The
two cases lead separately to linear or quadratic discriminant functions which can
be approximated by spectral methods if the covariance function is stationary. The
discriminant functions and their performance characteristics are approximated by
frequency domain methods, leading to simple and easily computed expressions. A
section is included which discusses methods for (1) determining whether the
group patterns are better modelled as by differing in the mean values or
covariances (spectra), and (2) the estimation of means and spectra from a learning
population. Finally an example shows the application of the techniques to the
problem of discriminating between short period seismic recordings of earthquakes
and explosions.
T × T covariance matrix
under hypothesis Hi, j = 1,2 .... , q. Writing the covariance function in terms of the
time difference t - u indicates that under Hi, one may represent x as a signal plus
a stationary zero mean noise process, i.e.,
x = / t j + n~ (2.2)
where/zj denotes a fixed signal and nj = (n j(0), n j(1) ..... nj(T--1))' has mean 0
and stationary covariance matrix Rj. One may obtain a stochastic signal model by
c h o o s i n g / , j = 0 in (2.2) and regarding nj as formed by adding a zero-mean
stationary stochastic signal sj = (sj(0), sj(1) ..... sj(T- 1))', depending on the par-
ticular hypothesis Hj, to a noise n which does not depend on j. This leads to the
stochastic signal model
x = sj + ,,, (2.3)
where the covariance matrix of x can be represented as the sum of the signal and
noise covariance matrices if the signal and noise processes are assumed to be
uncorrelated. The simple case of detecting either a fixed or stochastic signal s
imbedded in Gaussian noise follows by taking q = 2 , / t 1 = s, and/*2 = 0 in (2.2)
for a deterministic signal model and s 1 = s, s 2 = 0 in (2.3) for the stochastic signal
case. It follows that the general model which regards x as normal with mean/~j
and covariance matrix Rj under Hj subsumes the standard cases of interest in
signal detection theory.
For the multivariate normal or Gaussian case the probability density appearing
in the likelihood or Bayes approaches takes the form
The basic approach and equations following from (2.4) can be found in the
standard references [3, 37, 75].
It should be noted that the results in this section assume that the mean value
and covariance parameters are known exactly. This idyllic assumption is almost
never satisfied in practice, and one must find ways of estimating the initial means
and covariances for each of the populations. Such problems are considered in
Section 5. Since the case of q = 2 predominates in applications and is less
involved, we cover that case first in the following sections. For example, the
problem of discriminating between the earthquake records ( n t) and the explosion
records (H2) falls in this category.
Discriminant analysis for time series 7
,L the unequal means case for q = 2, assume first that the covariance matrices
::,e equal, i.e. R~ = R 2, and note that the usual Neyman-Pearson criterion (1.4)
implies that one should accept H1 if the linear discriminant function
P ( l l l ) = P ( 2 ] 2 ) = ~(½8T) (2.11)
The covariance matrix of the vector white noise process n is R --- O2Iv where I v
denotes the T × T identity matrix and of = E(n2(t)) is the noise power. The
discriminant function in this case is obtained by taking/t~ = s,/-g2 ~ 0 with R as
given above. In (2.6) note that
T--1 T--1
dL(X)----o• 2 E s ( t ) x ( t ) - - ½ ° , 2 E s2(t) (2.12)
t~0 t~0
is the filter resulting from simply matching the theoretical signal to the observed
series. The performance of this matched filter depends on the distance measure
(2.6) which becomes
T 1
6 2 = ° ~ -2 E s2(t) (2.13)
t=0
and is just the signal to noise ratio. From (2.10) and (2.11) it is easy to see that
P(l12), the false alarm probability, gets small as the signal to noise ratio
increases, whereas the signal detection probability P011) increases towards a
limiting value of one.
For the case of discriminating among more than two populations differing only
in the mean values, it is convenient to define the intermediate term
f o r j = 1,2 ..... q, with the Bayes rule (1.3), implying that one should classify x into
population l whenever
for j = 1,2,... ,q, j va I. If the error probabilities for the multiple group case are
needed, note that the random variable utj(x ) ( u u ( x ) = - u jr(x)) is normal with
I 2
mean ~Stj r under H l and mean - ~ 1 ~ 2tit under ~ with variance 8~r under both
hypotheses, where
2 __ !
8,jr- R--1 (m-t,j) (2.16)
Unfortunately the uU(x ) and uU,(x ) are correlated f o r j va j', and the regions for
determining the probability of correctly classifying x into H t are rather involved.
However, defining K u = In ~ j - In ~rl, we may use Bonferroni's inequality to write
the prior probabilities are equal, Ktj = 0 and we have an expression which
depends strictly on the distance measures.
which is the sum of a quadratic and a linear form. The probability distribution of
this discriminant is very involved, and since one can often deal with the case
where a signal has only a stochastic part, it is convenient to let/*1 =/~2 = 0 and
work with the purely quadratic discriminant
with the rule being to accept H~ if dQ(X)> K. The distribution of dQ(X) under
hypothesis H; is basically a linear combination of single degree of freedom
chi-squared random variables where the coefficients are the eigenvalues of the
matrix Rj(R 7 ' - R ~ ') for j = 1,2. Even though these coefficients may be either
positive or negative, there are numerical methods for calculating the critical
points (c.f. [26]). In the case where T is moderately large, the normal approxima-
tion may be reasonable, and it is useful to note that the means and variances of
dQ(x) under Hj, j = 1,2 are
where tr denotes'trace.
A special case' of interest is that of detecting a white Gaussian signal in a white
Gaussian noise process, for which we may take R 2 = 02IT and R 1= (q2 + 02)1>
where 02 = E(s2(t)) is the average signal power and 0ff is the average noise power
as before. Then
T--1
dQ(X)=((o 2) ' - - ( o 2 + o 2 ) ') ~,_x2(t) (2.22)
t=0
r=o~/o 2 (2.25)
is the signal to noise ratio. Later it will be evident that a narrow band version of
this detector is a reasonable approximation in the general case.
The distributional complications associated with working with the discrimi-
nants (2.18) and (2.19) lead to considering the possibility of applying linear
procedures in these complicated cases. In particular, the problem of determining
the T × 1 vector b such that the decision rule accepts H~ when b ' x > K was
considered first by Kullback [53], who showed that when both the means and
covariances differed, the solution that minimized P(l[2) for a given value of
P(2 ]1) was of the form
j = 1,2 are the variances of the discriminant under Hi. Giri [37] has presented a
convenient method for searching over the admissible linear solutions, noting that
(2.27) implies that one may write
and restrict the search to weights w~ and w2 such that w~ + w2 = 1 for w~, w2 > 0,
w ~ - w 2=1 for w j > 0 , w2 < 0 , and w2 - % = 1 for w1<0, w2 > 0 as shown in
Fig. 2.
Then, choosing wt and w2 in Fig. 2 leads to b by solving (2.26) as long as the
positive definite condition holds. This leads, in turn, to the two error probabilities
given in (2.29) and (2.30). We do not discuss more detailed procedures for finding
w~ and w2 to minimize one error probability for a given value of the other, since
the spectral approximations will enable a very rapid scan over the values of w~
and w2 with an accompanying simple method for determining whether the
positive definite condition holds.
In order to obtain the multiple group discriminant function corresponding to
the unequal means, unequal covariances case, simply substitute (2.4) into (1.3) to
Diseriminant analysis for time series 11
w2
w2--w1=1 w1+w2=1
Wl
1--w2=1
Fig. 2. Values of the weights w1 and w2 [or which an admissible linear discriminant may exist in the
unequal covariance matrix case.
obtain the analog of (2.15). In this case one classifies x into population I if
with gj(x) the linear term given in (2.14). Taking the means/tj = 0 in (2.32) gives
gj(x) = 0, leading to the pure quadratic discriminant.
While the tinge domain approach of the previous section can lead to rather
cumbersome matrix calculations, this is not the primary reason for considering
the more easily computed spectral approximations. The motivation for the
frequency domain approach stems mainly from the convenient theoretical proper-
ties of the discrete Fourier transform of a weakly stationary process, namely, that
the random variables produced are nearly uncorrelated, with variances approxi-
mately equal to the power spectrum. In this case the estimation and hypothesis
testing problems are formulated in terms of sample spectral densities with simple
approximate distributions, and one avoids the more difficult sampling character-
istics of the autocorrelation functions observed ir~ time domain computations. In
a practical phenomenological context, the power spectrum usually turns out to be
an essential component in any overall model for a physical system. The purpose
of this section is to present some of the spectral approximations which make
discriminant analysis in the frequency domain such an attractive alternate proce-
dure. The results allows one to replace complicated expressions involving matrix
inverses by simple sums involving discrete Fourier transforms (DFT's) and
spectral density functions.
12 R. H. Shumway
The power spectrum f ( . ) of the stationary process is defined by the usual Fourier
representation
O<m<-f(X)<-M<~ (3.3)
£ Itll+~lr(t)[<ce (3.4)
t = --~
holds for some a, 0 < a < 1. This condition from Liggett [56] is used to justify the
approximation to R -1.
In order to develop a reasonable approximation to the covariance function
r(t - u), note that (3.2) suggests the form
T--I
T--I
o= Itltr(t)l< (3.7)
t= --oe
T--I
k - l ( t - - u ) = r ' ~, exp{i2tk(t--u)}f-l(2tk) (3.8)
k=0
T--I
X^(k)=T -1/2 • x(t)exp{--iXkt } (3.9)
t=0
-1) ifk=l,
(3.10)
X'(l)) [O(r ') if k=/=l
oT(- = (3.11)
t=0
O(~-) = (27r) -~
f --qT
exp{iXT} a M ( X ) (3.12)
are of the form D T = {T-ltt'jR-1ttk, j, k = 1,2 .... ,q}, and appear in the discrimi-
nant functions in Subsection 2.1. Then it can be shown, as in [5] or [43], that
lim D r = D
T~c~
where
D= (2~r)-'f"/-'(X)dM(X). (3.14)
where M ^ ( k ) = ( M ~ ( k ) , M2(k
^ ) .... ,Mq(k))
^ ' with
T--1
Mf(k)=T ,/2 E # j ( t ) e x p { - i X k t } , (3.16)
t=0
the DFT of the mean value function. (The notation • will also be used for the
Discriminant analysisfor time series 15
limb T= D,
T~oo
so that the approximation and the true value have the same limit. Furthermore, it
can be shown (see [68]) that the absolute difference between two corresponding
elements of D r and b T is O(T-1), implying that the approximation is reasonable
for finite values of T.
A number of other approximations will be introduced in the following sections
which reduce involved matrix operations like those in D r to simple sums like b r.
The complex exponentials appearing in the DFT can be regarded as coefficients
in a linear transformation which reduces the matrix R approximately to a
diagonal matrix with the spectral ordinates appearing down the diagonal. Work
of Fuller [31] and Davies [25] can be consulted for more detailed analyses from
this point of view.
dL(X)=gt(X)--g2(x) (3.17)
with
for )~k = 2~rkT-1 with Mr(k), X^(k) the DFT's of/xj(.) and x(.) respectively.
The resulting expression for (3.18) contains only simple sums depending on the
DFT's of the mean value, and data series weighted inversely by the spectral
values. Furthermore, note that T-ldL(X) is normally distributed with mean
16 R. H. Shumway
T, 2
~ 2 = (/-L 1 --/-~ 2 )ttl~-- 1( ~ 1 - - ~ 2) = ~ [ k)l (3.20)
k 0 f(Xk)
Note that for q = 2 one may use the limiting results for D T and b T to write
with a = ( 1 , - 1,0 ..... 0)'. Again, the approximations are good for finite T, as we
have
T-'[8~--821 = O ( T - l ) .
The variance of the approximate discriminant T-ldL(X ) is
It can be shown by methods analogous to those used by Liggett [56] (see also [68]
and [72]), that
holding for finite T. This justifies the use of the approximate discriminant
function T-~dL(X) as being normal with approximate means 51 _T 1,~2 v T and
- - I T2 ---- 1 ~ 2 VT under H l and H 2 respectively. The variance is approximately T-2/~-
under both hypotheses. The error probabilities P ( l J 2 ) and P(2 Jl) can be ap-
proximated by replacing 8~ in (2.7) and (2.8) by ~2o, and we note that, for K = 0,
the argument in (2.10) and (2.11) is approximately 8 r ~ T 1/2 8 where 6 2 is given in
(3.21). These are the same approximate error probabilities that would be obtained
from T-IdL(X), and it is clear that both error rates go to zero as T ~ oo.
The special case of detecting a deterministic signal s(t) in noise n(t) is of
interest in its own right, and denoting the noise autocorrelation and spectrum by
Diseriminant analysis for time series 17
T-' S ^ ( k ) , X ~ ( k )
dL(X) ~-" E 2(~rl
°2 (3.22)
k 0 f,(Xk)
where
T--1 12
~= E Is~(k) (3.23)
t=o f.(Xk)
determines the approximate performance of the filter in terms of the signal to
noise ratios summed over the separate frequencies. The linear discriminant
function (3.22) is equivalent to the usual signal processing technique which
matches the prewhitened data against a prewhitened version of the signal. One
may first prewhiten both DFT's by dividing by f,I/z(xk) at each frequency. The
inverse DFT's, say Xw(t ) and Sw(t), are the prewhitened data and signal processes
respectively, and they can be applied as in matched filtering. This yields
T--I T--I
alL(x)= E Sw(t)Xw(t)-~ E s~(t)
t 0 t=0
, T - , IM; ( k ) _ ~ ( k ) l 2
$/2jT= ( ~ , - - U j ) t I ~ - - I ( / A / - ~ j ) = E (3.25)
k=o f(Xk)
for l,j=l,2,...,q, and lva j.
and H 2. Central limit results for the test statistic (3.26) were considered by Capon
[21] and Rubin [68] who proved essentially that
and variance
T--I
OJ
° 2 = 2T_ 2 ~] (fz_l(~k) -- fl_l( ~k)) 2fj 2(~k) (3.30)
k=0
under Hi, j = 1,2. These approximations to the true means and variances of
T-ldQ(X) given in (2.20) and (2.21) satisfy
I j- %(*)1 = O(T-')
and
T]6j2 --varjT-'dQ(X) I = O(T- 1),
implying that they are reasonable for finite T.
Since the decision rule is to accept H l for dQ(X)> K, the error probabilities
P(l12 ) and P(2 ]1) may be approximated as
K-rh2)
P(l[2) = 1 - • ~ (3.31)
a2
and
p(211)=~ ( K-m,)6~ ' (3.32)
where the arguments are proportional to rhj/6j ~ T1/2mj/'yj. Grenander [42] has
Discriminant analysis for time series 19
derived other asymptotic expressions for the error rates which are related directly
to the spectral distribution of the eigenvalues of R IR 7 ~.
Freiberger [29] (see also [30]), used the continuous analog of T-ldQ(X), say
under Hi, j = 1,2. Several options are available for calculating the critical values
associated with the approximate quadratic discriminant function (3.35). Box [13]
has given a simple expression for computing probabilities involving a hnear
combination of negative exponentials when the coefficients may be both positive
20 R. H. Shumway
and negative. The cdf corresponding to (3.35) in this case is of the form
where G2(") denotes the cdf of a chi-square random variable with two degrees of
freedom and the coefficients are defined by
rm=fs(Xm)/f.(Xm), (3.39)
where ML ~ T and )~m = cr[(2m + 1)L -- 1]M- l. Then M-1 d~)(x) can be used,
with the errors under the normal approximation determined by making the
appropriate replacements in (3.29)-(3.32). The choice of M follows from smooth-
ness conditions imposed on the spectrum which are given by Liggett [56].
Discriminant analysis [or time series 21
In the case where both the mean value and covariance functions are different,
linear procedures of the form b'x > K may be considered, and one may develop
the approximate version in terms of the DFT's of b and x, leading to accepting H l
when T 1/}Q(x) > K, where
T 1
dLQ(x) = ~] B^(k)*X^(k) (3.42)
k=0
B^ ( M£ + w2 (3.43)
for - ~r ~<X ~<~r. The whole range of admissible linear discriminants for this case
can be very quickly searched over the values of w 1 and w2 shown in Fig. 2,
satisfying the positive definite condition (3.47). The performance characteristics
(3.44) and (3.45) can be evaluated for each w I and w2, with the approximate
threshold/( following from the conditions (2.27) and (2.28) as
T--1
I~=r-I E B^(k)*M~(k)-Wld~
k=0
T-1
=T-' E g^(k)*M£(k)+w262. (3.48)
k--0
The justification (see [72]) for these approximations involves showing that the
means and variances of T-ldQ(x) satisfy limiting conditions which are the same
as those satisfied by the exact time domain version T-lb'x.
22 R. 11. Shumway
For the case of discriminating among q groups one may develop an approxima-
tion to vtj(x) as given in (2.31), noting that
T--I
I: lim T - ' l n d e t R j = lim T - ' ~, lnfj()~k) (3.49)
T~oo T~oo k=0
where
I = (2 ~r)- l f_~r In fj (X) d )t (3.50)
(see [56], or [40, p. 65]). This means that one may use the discriminant T-lbtj(x),
accepting H t if
T--1 T--1
l~j(X)=gj(X) -I ~ lnfj(~kk) -1 E IX^(k)]2 (3.52)
T, ]
° -½T-'
mlj -- [01n
k
fAXk)
+ ~
k=0
(fj-~(Xk)-f/ ~()~k))ft()~k) (3.53)
3. 3. Multivariate extensions
In some cases it may be necessary to handle multivariate time series data. For
example, a number of sensors can be employed to monitor a phenomenon, with
the possibility that population differences may be characterized by the sensor
cross correlation structure. Possible applications occur in processing EEG data
from multiple leads attached to the same subject where cross spectral parameters
are assumed to be important.
Discriminant analysis for time series 23
For a multivariate time series (see also [81-83] and [43]) we imagine a
collection of p sensors xl(t),Xz(t )..... Xp(t) sampled at the time points t =
0, 1..... T - 1. The vector series x(t) = (xl(t), x2(t ) .... ,Xp(t))' is assumed to have
a p × 1 mean value function/tj(t) = (/~/l(t),/~i2(t) . . . . . tljp(t))' and a p × p matrix
valued covariance function
The DFT's for the observed vector series x(t) and ~ j ( t ) , j = 1, ..:, q are defined as
in (3.9) and (3.1); we denote the transforms by X (k) and Mj (k) respectively.
The sampled data will be represented generically by the p × T matrix X =
(x(0), x(1) ..... x(T-- 1)).
In order to discriminate between two hypotheses
j=l,...,q with F - I ( . ) denoting the inverse of the spectral matrix F(.). The
performance in this case can be approximated by using
T--I
(M?(k)- M; (k))*F-'(Xk)(M;(k)- M; (k)) (3.58)
k=0
For the case where the covariance matrices are unequal but q = 2, the pure
quadratic discriminant (2.18) becomes
T--1
dQ(X)= ~] X^(k)*(F21(~k)-F, l(~k))X^(k), (3.59)
k 0
and the normal approximation can again be used with T ldQ(X) assumed to
have means
T--I
rhj= T ' ~, tr((F21(2tk)-- F~l(~k))~.(Xk) } (3.60)
k=O
and variances
T 1
S2=2T 1 ~ tr[(F2'(Xk)--F,-'(~tk))Fj(Xk)] 2 (3.61)
k=O
T--1
E n"(k)*x^(k) (3.62)
k=0
with
B^(k)=(WlFl(~.k)+W2F2()kk))-l(M~(k)--M;(k)). (3.63)
The weighted spectrum is restricted to values of w1 and w2 for which the matrix
is positive definite (see Fig. 2). The two error probabilities are approximated as in
(3.44) and (3.45) with
T--1
6j2= T -' ~ B^(k)*Fj()tk)B^(k). (3.64)
k=0
We note that Azari [9] has given the details which establish the validity of these
approximations in the multivariate case. Generally it is required that the cross
correlations satisfy condition (3.4) and that 0 < m ~<det Fj(X) ~<M < ~ for all X.
The multiple population case just uses the function ~j(X) in the same way that
(3.24) uses ~j(x) in the univariate case, and the performance depends on the
Discriminant analysis for time series 25
obvious generalization of (3.25). Another approach (see [18, pp. 390-391]) which
might be useful in the ^multiple group case is to choose the discriminant or
transformation vector/3 (k) at each frequency by maximizing the ratio of the
between-group power to the within-group power of the transformed observations
in the learning populations. In this approach the discriminant vector/]^(k) is the
characteristic vector corresponding to the largest root ~'k of the determinantal
equation
( B ~ - pkEk)[3^ ( k ) = O , (3.65)
where X = (X,,..., ?tr)' is a vector of frequencies restricted to IX, 1, IX2 ]..... IXr 1~<
7r, and Fj(X) is thep × p spectral density matrix under Hj. The DFT is defined as
T 1 -- 1 Tr--1
X^(k)=T~ '/2... Tr 1/2 ~ "'" £ x ( t ) e x p { - - i X ' k t } (3.67)
tI~ o tr o
for j - - 1 .... ,q, 1--1,... ,/Vs., with the sample series assumed to be observed at the
points t - - 0 , 1.... , T - 1 . The error series n~/(t ) are assumed to be zero-mean
stationary normal processes with an autocorrelation function of the form
would imply that the linear discriminant function would be nearly optimum with
a corresponding simplification in the form and performance as described in
earlier sections. If significant differences are apparent in the group covariance
functions or spectra, one may need to look at some tests for equality of the group
spectra.
Since the discriminant functions in the previous section have all been applied in
the frequency domain, it is convenient to rely on an approach for testing these
hypotheses on a frequency by frequency basis. These follow by taking DFT's on
both sides of (4.1) to obtain the frequency domain model
where the complex normal variables N/~ (k) now have the approximate covariance
structure
are defined as in the usual case. The F test for equality of the q means at a given
frequency Xk follows by using the ratio of the mean between power component to
the mean error power component in Table 1, which yields
_ PB k n (4.7)
Fz(q- 0,2~ PE k q -- 1 '
28 R.H. Shumway
nj = Nj - 1 (4.8)
and
n = • n j = ~ Nj - q. (4.9)
If the test is performed over a band of L frequencies centered at Am, say Am+k,
k = - ½ ( L - 1). . . . . 0 ..... ½ ( L - 1), the components of power in Table 1 are sim-
ply smoothed over the frequencies, and the degrees of freedom are replaced by
2 L ( q - - 1 ) and 2Ln respectively. We will refer frequently in the sequel to
smoothing of various power components, and will mean by this in most cases an
average of the form (3.40) with A ( k ) = 1 or L - 1. It is useful to plot the F statistic
(4.7) as a function of frequency in order to determine which frequencies dis-
criminate between the group means.
In the multivariate case the vector process xjl(t ) = ( X j n ( t ) . . . . . xjtp(t))' is trans-
formed to the frequency domain, and we obtain the equality of means test in
terms of the p × p between groups spectral matrix
,,)(k))*. (4.11)
jl
Application of the likelihood ratio test leads to rejecting the hypothesis when
is less than some critical value. Khatri [49] and H a n n a n [43] give the usual
Table l
Analysis of power for equality of means test at frequency Xk = 2 ~ r k T - t
Degrees of
Source freedom Power
approximation
with n as in (4.9) and Gf(.) the cdf of a chi-squared distribution with f degrees of
freedom as before. One term will generally be sufficient since smoothing over L
points simply replaces f by f ' = 2L(q -- 1)p and ~ by v ' = L(2n + q -- 1)-- p, so
that the remainder is automatically made small by increasing bandwidth. The case
of q = 2 populations specializes to a version of Hotelling's T 2, as shown in [37],
where
-
NI N2 ^ * I ^
T2(Xk) (N, + N2) (X~.(k) -
(4.16)
K - n - - np
p+l F2P'2(" p+l);~ (4.19)
w h e r e Ffl,f2; ~ denotes the upper a critical value from an F distribution with fl and
f2 degrees of freedom. Again it is informative to plot Hotelling's T 2 as a function
of frequency.
and we notice that it is essentially a complex sample variance. For q = 2, the ratio
nnP [Iq=l[detnjFjT(~k)] nj
(4.22)
nq=,n;J; [detnrT(X )]"
is less than some critical value, where
q
FT(~kk) = n - I E njFj'T(~k) (4.23)
j =1
is the pooled estimator of the spectrum. Krishnaiah et al. [51] have given the hth
moment of L ' ( ~ ) and calculated 95 % critical points for p = 3, 4, using a Pearson
type I approximation. For reasonably large samples involving smoothed spectral
estimators, the first term in the complex version of the usual chi-squared series
may be sufficiently accurate, and we note that the critical values can also be
determined from
^ 2
fj~-( Xk ) = Nj-' 2 IXj,( k)[ , (4.28)
l
for j = 1..... q, with L chosen in accordance with the usual resolution and
bandwidth stability considerations. The smoothing introduces stability into com-
putations involving the approximate discriminant functions in Section 3. For
example, the occurrence of a zero value for fir(X) is to be avoided at all costs,
since it will induce a large spurious contribution to the discriminant function over
32 R.H. Shumway
those frequencies. Frequently there will be bands where the group spectra and
mean value transforms are near zero, and we may have a situation where
assumption (3.3) is nearly violated. In order to keep the frequencies where the
spectra are nearly zero from exerting an unnatural effect on the computations,
one can either replace the spectrum in question by some small non-zero value or
simply assume that the numerator is zero over intervals where the denominator
spectrum is null, and sum the test statistic over a reduced number of frequencies.
The use of the autoregressive estimator for the group spectra may be preferred
if the original series satisfy low order autoregressions which differ between the
groups. This requires that one determines the appropriate order by some reason-
able criterion such as has been given by Akaike [1, 2]. Anderson [7] has given a
procedure for fitting autoregressive models to the replicated data in each of the
groups, and this will imply specific forms for the spectral densities ~v()t),
j = 1,..., q. One might even consider the autoregressive moving average (ARMA)
model, which can be developed using the model identification techniques of Box
and Jenkins [14]. The Akaike information theory criterion (AIC) can help one to
choose a final version, and then one can use the techniques summarized by
Anderson [6, 7] to fit a final model in either the time or frequency domain. In the
case where one is interested in detecting a simple autoregressive stochastic signal
imbedded in white noise, the estimation techniques developed in [62] could be
employed.
The linear discriminant function can be evaluated by examining its perfor-
mance on samples not included in the learning population or by constructing
estimators for P ( l I 2 ) and P(2[1), using one of the conventional techniques
described in [59] or [37]. One possibility is to use the conventional expressions
(2.7) and (2.8) with the approximation
T-,
Z (4.30)
k=o fv()tk)
for 6~. Another possibility is to use the sample means and variances of the
discriminant functions evaluated over a population not used for estimating the
original mean values and spectra. For example, let dL(xjt ) be the value of a
discriminant function for the sample xjt, where the discriminant function was
derived from a learning sample not containing xg. Then the sample means
dj = N / E (4.31)
/
and variances
4:(Nj-1)-IZ(dL(Xjl)-dj) 2 (4.32)
I
Discriminant analysis for time series 33
can be used in the usual expressions for the two errors, given in this case by
If the entire sample has been used as a learning set, it is conventional to evaluate
the means and variances of discriminants which were computed after discarding
the observation in question during the parameter estimation stage.
The methods and procedures of the previous sections can be clarified and
illustrated by applying them to a reasonably complete data set. We have chosen a
collection of 40 presumed earthquakes and 26 presumed nuclear explosions
recorded at the Large Aperture Seismic Array (LASA) in Montana. The traces are
short period beams (averages) for events located between 40 and 60 degrees
latitude and between 70 and 90 degrees longitude. Fig. 1 (see Section 1) shows ten
typical members of each of the two populations, and it should be noted that there
appear to be different features which suggest the possibility that these populations
can be separated using a discriminant function. One might mention the energy in
the earthquake records which tends to appear in the latter part or coda of the
waveform. This is often due to a 'depth phase' (pP) which arrives later for the
deeper events.
One may proceed in accordance with the recommendations of Section 4 by first
investigating the possibility that the mean value functions are different. Fig. 3
shows the mean value functions x~.(t) and x2.(t ) computed from the full
populations of earthquakes (H~) and explosions (H2) respectively. It should be
noted that all of the recordings were standardized so that one-half of the peak to
peak amplitude of the largest cycle on each trace was unity. The sample means
indicate that both populations seem to contain fixed deterministic components
with a large excursion appearing in the explosion population. The mean values
show the impulsive nature of the explosion population, in contrast with the more
energetic behavior of the entire earthquake waveform. The coda of the earth-
quake, say between 5 and 10 seconds after th.e start, seems to have more power
than the explosion, due again to the impulsive nature of the explosive source.
In order to characterize these qualitative signal differences over frequency, one
may perform the analysis of power described in Subsection 4.1. The ANOPOW
components shown in Table 1 of that subsection are plotted as a function of
frequency in Fig. 4. For this population there are N~ = 40 earthquakes and
N2 = 26 explosions with the sampling rate of 10 points per second, yielding a
folding frequency of 5 Hz. The power components tend to peak in the neighbor-
hood of 1 Hz with little or no activity beyond 3 Hz. The between power compo-
nent indicates roughly that the discriminating power ought to lie between 0.5 and
34 R. 1t. Shumway
MEANS
I
~'V"~-"- ~=- - "-- -- . . . . -- --- 26EX
Fig. 3. Sample means for earthquake (EQ) and explosion (EX) populations (256 points at 10 samples
per second).
10.0 --
/I
8.0--
i II
6.0 --
{
n" I
LLI {
0 /
O. !
!
I
!
4.0-- /
{--
! I
2.0-- I\\1 II
{ I
%
I \
/
0.0
I I I I I I I I I I
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ
Fig. 4. Analysis of power resulting from testing equality of earthquake and explosion means.
Discriminant analysis for time series 35
1.5 Hz, although the within or error component is also large over these frequen-
cies. All components are reasonably stable, so that no smoothing was introduced,
and there are 129 frequency points running from zero to the folding frequency.
A clear indication as to which frequencies ought to discriminate between the
mean values is provided by plotting the F-statistic (4.7) which is the ratio of the
average between group power to the average within group power. For the case
q = 2, this may be written in the form
with fr(Xk) the pooled estimator of the error spectrum defined in (4.27). This
exhibits the F-statistic as a sort of signal to noise ratio expressed as a function of
frequency, since the values in (5.1) are essentially the components of the estimated
distance a)r2 in equation (4.30). Fig. 5 shows that the discriminating frequencies
tend to be distributed over a somewhat broader band, with values exceeding the
0.01 critical value occurring between 1.0 and 2.5 Hz. This indicates that whatever
discriminating power there is in the mean value functions, is concentrated in a
relatively high frequency band, with the lower frequency component masked by
the large error power contribution. It is interesting to look in more detail at the
error spectra to determine whether the assumption that the theoretical spectra of
the two groups are equal over all bands is reasonable. The group spectra flv(?~)
and f2r(X) defined in (4.20) are plotted in Fig. 6. First of all note that the error
15.0--
0.0 I [ I [
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ
Fig. 5. F-statistics for testing equality of earthquake and explosion means as a function of frequency
(129 frequencies, 2 and 128dr).
36 R. H. Shumway
15.0--
n-
UJ
0 - f\
ft. I
I
/
10.0 -- /
/
/ EQ error spectrum
/
-- / E X error spectrum
I
i ~1 t
I
5.0--
,,"t,,,'-' M,,
- i /-" \..._,,
o.o I r I I I I I I I t
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ
Fig. 6. Error spectra of earthquake and explosion populations. (129 frequencies, EQ (78dr), EX
(5Od~).
spectra appear to be roughly equal in the higher frequency bands (1.2 to 3.0 Hz),
where we have just indicated that significant differences in the mean value
functions are present. Using the test for equality of the error spectra given in
(4.21) leads to comparing the spectral ratios with an F-statistic, where we
associate 78 degrees of freedom with the larger earthquake spectrum and 50
degrees of freedom with the smaller explosion spectrum. The ratio exceeds the
critical value (4.8 at the 0.01 level) fairly consistently over the band ranging from
0.0 to 1.2 Hz, implying that the signals have significantly different spectra over the
band.
The tentative conclusions that follow from the preceding tests are that stochas-
tic or spectral differences characterize the low frequency range whereas determin-
istic or mean differences predominate at the higher frequencies. It should be
noted that these conclusions tend to support the near optimality of classical long
period discriminants, such as surface-wave body-wave magnitudes as in [23] or
short period discriminants such as complexity or spectral ratios as in [8] (see also
[11]). For example, a classical and reliable discriminant exists when the short
period P-wave arrival represented in Fig. 1 can be compared with a long period
(low frequency) surface wave observed on a separate system, recording frequen-
cies between 0.0 and 1.0 Hz. The surface wave magnitude, measured basically as
the logarithm of the amplitude divided by the period, is roughly a measure of
power in the low frequency band. The surface wave magnitude M s is combined
with the body wave magnitude M b measured from the maximum cycle on the
short period data traces which we have scaled to unity for all events. Hence the
Discriminant analysis for time series 37
which is just a matching of the mean difference function with the data vector. The
linear matched filter (LMF) would be optimal for the case where both spectra are
constant over all frequencies (white), and the signal difference is a known
deterministic function (see also (2.12)).
Several different versions of the quadratic filter will be considered. The most
general version, which is appropriate in t h e case where there are presumed
differences in both the means and spectra, is defined in (3.51) and (3.52). This is
proportional, for the case q = 2, to
T--1
X Ix^ 2
[A-1(x )-fr 1(
k--0
T 1
+ ~ [M~(k)*f~l(Xk)--M2(k)*fff21(Xk)]X^(k), (5.4)
k=0
and will be referred to as the quadratic detection filter (QDF).
38 R. H. Shumway
and we have
(L -- 1)/2
d~)(x) = 2 E IX^(m+k)lZ(f21(Xm)--f, l (?m
t )), (5"5t
k=--(L--l)/2
where L is chosen so that the frequency band runs from 0.4 to 0.8Hz. For l0
samples per second and 256 points the primary frequencies are of the form
fn = 10n/256 cycles per second for n = 0 .... ,128, so that taking L = 11 in (5.5)
produces the desired bandwidth, where we center (5.5) on the value m =15,
corresponding to 0.6 Hz.
For completeness, the classical complexity discriminant, defined as the mean
square coda of the series, say
256
C=(2001-' E x2 (5.6)
t=57
will be calculated. This is closely related to the quadratic detector (2.22) which is
optimal for detecting a Gaussian white signal in white noise.
The unknown spectral and mean value parameters were estimated using the
entire population of 40 earthquakes (EQ) and 26 explosions (EX) as a learning
sample, w i t h fiT(,') determined from (4.29) ( L = 3) and Mj (k) estimated by the
sample means Xj.(k), j = 1,2. When spectral values were zero, they were replaced
by a small constant percentage of the observed maximum over the entire
frequency band. The estimated spectral values for the spectral ratio (SR) detector
(5.5) were calculated using a smoothed version of (4.28) with L = 11, i.e.,
5 nj ^ 2
2 2
k= 5 l=I
given by
T--I
oj 2= ~ ~.(X,)IM~(k)--M£(k)I 2
k=O
9.90
.O5 0
o ©
o o
o o o
308 o
o o ODO o (50
o o
0 0
o ° 0 o o 0
-3.75 (~
0 o
- - .001
.OO
0
ae -10.57
-17.40
-24.22
-31.05
o EARTHQUAKES
-K.-EXPLOSIONS
-37.87
-44.70 ~ I l I I I I I 1 I I
-0.87 -0.72 -0.57 -0.42 -0.27 -0.12 0.04 0.19 0.34 0.49 0.64
MATCHED FILTER
Fig. 7. Output of linear detection filter (LDF) and linear matched filter (LMF), applied to full suite of
events.
40 R. 14. Shumway
parameters in Table 2. The disparity between the predicted and observed sample
performance is due to the increase in the sample variances caused by several
extreme observations, clearly visible in Fig. 7. In order to evaluate the perfor-
mance under other conditions, a subpopulation of 23 earthquakes and 15 explo-
sions was drawn to serve as a hypothetical learning sample. The mean value and
spectra from this small learning sample did not differ substantially from the
initial values evaluated over the complete population. Furthermore, the estimated
filter parameters did not change substantially from those given in Table 2 when
they were evaluated over either the learning or test populations. For further
details, see [71].
The different classes of quadratic detectors were also applied to the full suite of
earthquakes and explosions. The threshold value for the spectral ration (SR)
detector (5.5) was determined for a specified false alarm probability of 0.01, using
the chi-squared distribution with 2 L = 22 degrees of freedom and the average
spectral values mentioned earlier. The predicted detection probability for that
false alarm probability was 0.99. The generalized quadratic filter (QDF) and
complexity thresholds were estimated using the observed empirical values for the
discriminants. Fig. 8 shows the performance of the quadratic filter (QDF)
compared with the linear detection filter (LDF), and we note that the perfor-
mance of the quadratic filter is an improvement with only one false alarm and 24
out of 26 explosions detected. Note that the variance of the quadratic filter output
increases substantially for the earthquake population whereas the linear filter
outputs have approximately equal variances under H 1 and H 2.
The empirical false alarm and signal detection probabilities for all of the
methods based on the proportion correctly and incorrectly classified in the
sample, are shown in Table 3. If the empirical false alarm and signal detection
probabilities are denoted by Ps(211) and Ps(2 ] 2) = 1 - Ps(112), we m a y express the
overall error probability, for equal prior probabilities ~rI -- ~r2 = ½, as
Table 2
Theoretical and sample parameters, and predicted false alarm and signal detec-
tion probabilities for linear detection (LDF) and linear matched (LMF) filters
Theoretical Samplea
LDF LMF LDF LMF
Means EQ 1.0 0.01 0.9 -0.03
EX - 15.9 -0.59 - 15.9 0.59
Std. dev. EQ 3.1 0.20 4.6 0.32
EX 2.7 0.11 8.9 0.28
P(2 I1) b 0.001 0.052 0.02 0.18
P(2 I 2)c 0.997 0.992 0.80 0.83
9.90
o o
EST.~
o ©
3.08 OOo o
o oO~ °o o o o o
o o
°o o
-3.75
o* o o
o
- - .001
-10.57 d¢~ o
!,-
ell -17.40 :~J~
~k
-24.22 #t-
-31.05
0 EARTHQUARES
:#: EXPLOSIONS
-37.87
-44.70 : I I I I I i I I I J
-0.40 -().09 0.22 0.52 0.83 1.14 1.44 1.75 2.05 2.36 2.67
QUADRATIC FILTER
Fig. 8. Output of quadratic detection filter (QDF) and linear detection filter (LDF), designed and
applied to full suite of events. Quadratic threshold at 0.1 estimated visually.
be made, as in Table 3. We note that the quadratic filter (QDF) performs best
with an overall error rate of 0.07, with the linear detection filter (LDF) running a
close second. Since the distribution theory for the LDF is formulated easily in
terms of the normal distribution, it might be chosen in the absence of any clear
superiority for the quadratic detector. It seems to be clear, however, that some
improvement can be expected by combining the spectral and mean value informa-
tion from a short period recording in a more nearly optimal manner. It should be
Table 3
Sample empirical error and detection probabilities for all methods
P2(2 IDa Ps(2 ] 2)b PS
Linear detection (LDF) 0.05 0.85 0.10
Linear matched (LMF) 0.20 0.88 0.16
Quadratic detection (QDF) 0.05 0.92 0.07
Spectral ratio (SR) 0.20 0.92 0.14
Complexity (C) 0.10 0.85 0.13
noted that the above analysis is predicated on the assumption that the surface
wave was not detected on the long period recording instrument. If the surface
wave magnitude can be measured, the classical M s - - M b discriminant may be
superior to any of those described above.
The use of autoregressive techniques for discrimination using short period data
has been investigated by TjOstheim [76] using a similar population containing 45
earthquakes and 40 explosions recorded at NORSAR (Norwegian Seismic Array).
It was noted that the coda or complexity portions could be modelled as third
order autoregressions. Then, displaying the first two autoregressive coefficients
for earthquakes and explosions led to a separation between the two classes, which
was comparable to that achieved using the classical complexity and spectral ratio
methods described here.
6. Discussion
Acknowledgment
References
[1] Akaike, H. (1974). A new look at the statistical model identification. I E E E Trans. Automat.
Control 19, 716-723.
[2] Akaike, H. (1977). On entropy maximization principle. In: P. R. Krishnaiah, ed., Proc. Symp.
Applications Statistics, 27-47. North-Holland, Amsterdam.
[3] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
[4] Anderson, T., W. and Bahadur, R. R. (1962). Classification into two multivariate normal
populations with different covariance matrices. Ann. Math. Statist. 33, 420-431.
[5] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
[6] Anderson, T. W. (1977). Estimation for autoregressive moving average models in the time and
frequency domain. Ann. Statist. 5, 842-865.
[7] Anderson, T. W. (1978). Repeated measurements on autoregressive processes. J. Amer. Statist.
Assoc. 73, 371-378.
[8] Anglin, F. M. (1971). Discrimination of earthquakes and explosions using short period seismic
array data. Nature 233, 51-52.
[9] Azari,R. (1975). Information theoretic properties J s o m e spectral approximations in stationary
time series. Dissertation. George Washington University, Washington.
[10] Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. Holt, Rinehart and
Winston, New York.
[11] Booker, A. and Mitronovas, W. (1964). An application of statistical discrimination to classify
seismic events. Bull. Seismological Soc. America 54, 961-977.
[12] Borpujari, A. S. (1977). An empirical Bayes approach for estimating the mean of N stationary
time series. J. Amer. Statist. Assoc. 72, 397-402.
[13] Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of
variance problems. Part I: Effect of inequality of variance in the one-way classification. Ann.
Math. Statist. 25, 290-302.
44 R. H. Shumway
[14] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco.
[15] Bricker, P. D., Gnanadesikan, R., Mathews, M. V., Pruzansky, S., Tukey, P. A., Wachter, K. W.,
and Warner, J. L. (1971). Statistical techniques for talker identification. Bell Syst. Tech. J. 50,
1427-1454.
[16] Brillinger, D. R. (1973). The analysis of time series collected in an experimental design. In: P. R.
Krishnaiah, ed., Multivariate Analysis-III. Academic Press, New York.
[17] Brillinger, D. R. (1974). Fourier analysis of stationary processes. Proc. IEEE 62, 1623-1643.
[18] Brillinger, D. R. (1975). Time Series: Data Analysis and Theory. Holt, Rinehart and Winston,
New York.
[19] Brillinger, D. R. (1978). Comparative aspects of the study of ordinary time series and of point
processes. In: P. R. Krishnaiah, ed., Developments in Statistics, Vol. 1, 33-133. Academic Press,
New York.
[20] Brillinger, D. R. (1979). Analysis of variance and problems under time series models. Handbook
of Statistics, Vol. 1,237-278. North-Holland, Amsterdam.
[21] Capon, J. (1965). Hilbert space methods for detection theory and pattern recognition. IEEE
Trans. Inform. 11, 247-59.
[22] Capon, J. (1965). An asymptotic simultaneous diagonalization procedure for pattern recogni-
tion. J. Informat. Control 8, 264-281.
[23] Capon, J., Greenfield, R. J., and Lacoss, R. T. (1969). Long-period signal processing results for
the large aperture seismic array. Geophysics 34, 305-329.
[24] Davenport, W. B. and Root, W. L. (1958). An Introduction to the Theory of Random Signals and
Noise. McGraw-Hill, New York.
[25] Davies, R. B. (1973). Asymptotic inference in stationary Gaussian time series. Adv. in Appl.
Probab. 5, 469-497.
[26] Davies, R. B. (1973). Numerical integration of a characteristic function. Biometrika 60, 415-417.
[27] Dunsmuir, W. (1979). A central limit theorem for parameter estimation in stationary vector time
series and its application to models for a signal observed with noise. Ann. Statist. 7, 490-506.
[28] Dunsmuir, W. and Hannan, E. J. (1976). Vector linear time series models. J. Appl. Probab. 10,
130-145.
[29] Freiberger, W. F. (1963). An approximate method in signal detection. Quart. Appl. Math. 20,
373-378.
[30] Freiberger, W. F. and Grenander, U. (1959). Approximate distributions of noise power
measurements. Quart. Appl. Math. 17, 271-1284.
[31] Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
[32] Gersch, W. (1977). Discrimination between stationary Gaussian time series, large sample results,
Tech. Rept. No. 30. Dept. of Statistics, Stanford University, Palo Alto.
[33] Gersch, W. and Yonemoto, J., (1977). Automatic classification of multivariate EEG, using an
amount of information measure and the eigenvalues of parametric time series model features.
Comput. Biomed. Res. 10, 297-316.
[34] Gersch, W., Martinelli, F., Yonemoto, J., Lew, M. D., and McEwan, J. A. (1979). Automatic
classification of electroencephalograms: Kullback-Leibler nearest neighbor rules. Science 205,
193-195.
[35] Gevins, A. S., Veager, C. L, Diamond, S. L., Spire, J., Zeitlin, G., and Gevins, A. (1975).
Automated analysis of the electrical activity of the human brain (EEG): A progress report. Proc.
IEEE 63, 1382-1399.
[36] Gift, N. (1965). On the complex analogues of T 2 and R 2 tests. Ann. Math. Statist. 36, 664-670.
[37] Giri, N. C. (1977). Multivariate Statistical Inference. Academic Press, New York.
[38] Goodman, N. R. (1963). Statistical analysis based on a certain multivariate complex Gaussian
distribution. Ann. Math. Statist. 34, 152-177.
[39] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1 (17) 195-277.
[40] Grenander, U. and SzegO, G. (1958). Toeplitz Forms and their Applications. University of
California Press, Berkeley.
[41] Grenander, U. (1965). On the estimation of regression coefficients in the case of an autocorre-
lated disturbance. Ann. Math. Statist. 25, 252-272.
Discriminant analysis for time series 45
[42] Grenander, U. (1974). Large sample discrimination between two Gaussian processes with
different spectra. Ann. Statist. 2, 347-352.
[43] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[44] Helstrom, C. W. (1968). Statistical Theory of Signal Detection. Pergammon Press, Oxford.
[45] Huang, T. S., Schreiber, W. F., and Tretiak, O. J. (1971). Image processing. Proc. IEEE 59,
1586-1609.
[46] Jenkins, G. M. and Watts, D. J. (1968). Spectral Analysis and its Applications. Holden Day, San
Francisco.
[47] Kadota, T. T. (I 965). Optimum reception of binary sure and Gaussian signals. Bell System Tech.
J. 44, 1621-58.
[48] Kadota, T. T. and Shepp, L. A. (1967). On the best finite set of linear observables for
discrimination between two Gaussian signals. IEEE Trans. Inform. Theory 13, 278-284.
[49] Khatri, C. G. (1965). Classical statistical analysis based on a certain multivariate complex
Gaussian distribution. Ann. Math. Statist. 36, 115-119.
[50] Krishnaiah, P. R. (1976). Some recent developments on complex multivariate distributions. J.
Multivariate Anal. 6, 1-30.
[51] Krishnaiah, P. R., Lee, J. C., and Chang, T. C. (1976). The distribution of likelihood ratio
statistics for tests of certain covariance structures of complex multivariate normal populations.
Biometrika 63, 543-549.
[52] Krishnaiah, P. R. and Lee, J. C. (1979). Likelihood ratio tests for mean vectors and covariance
matrices. In: P. R. Krishnaiah, ed., Handbook of Statistics, Vol. 1, 513-570. North-Holland,
Amsterdam.
[53] Kullback, S. (1959). Information Theory and Statistics. Smith, Gloucester, MA.
[54] Lagakos, S. W. (1973). Bounds on the diagonalizability of the finite Fourier transforms of
stationary time series. Tech. Rept. No. 2, Dept. of Computer Sciences, State University of New
York at Buffalo, Amherst.
[55] Larimore, W. E. (1977). Statistical inference on random fields. Proc. IEEE 65, Special Issue on
Multidimensional Systems, 961-970.
[56] Liggett, W. S. (1971). On the asymptotic optimality of spectral analysis for testing hypotheses
about time series. Ann. Math. Statist. 42, 1348-1358.
[57] Markel, J. D. and Gray, A. H., Jr. (1976). Linear Prediction of Speech. Springer, Berlin.
[58] Meisel, W. S. (1972). Computer Oriented Approaches to Pattern Recognition. Academic Press,
New York.
[59] Morrison, D. E. (1976). Multivariate Statistical Methods. McGraw-Hill, New York.
[60] Otnes, R. K.~and Enochson, L. (1978). Applied Time Series Analysis. Wiley, New York.
[61] Pagano, M.'(1970). Some asymptotic properties of a two-dimensional periodogram, Tech. Rept.
No. 146, Dept. of Statistics, The Johns Hopkins University, Baltimore.
[62] Pagano, M. (1974). Estimation of models of autoregressive signal plus white noise. Ann. Statist.
2, 99-108.
[63] Parzen, E. (1962). Extraction and detection problems and reproducing kernel I-Iilbert spaces. In:
E. Parzen, ed., Time Series Papers (1967), 492-519. Holden-Day, San Francisco.
[64] Parzen, E. (1959). Statistical inference on time series by Hilbert space methods I. In: E. Parzen,
ed., Time Series Papers (1967), 251-382. Holden-Day, San Francisco.
[65] Pinsker, M. S. (1964). Information and Information Stability of Random Variables and Processes.
Holden-Day, San Francisco.
[66] Root, W. L. (1962). Singular measures in detection theory. In: M. Rosenblatt, ed., Time Series
Analysis, Symposium, 292-315. Wiley, New York.
[67] Rosenfeld, A. and Weszka, J. S. (1976). Picture recognition. In: K. S. Fu, ed., Digital Pattern
Recognition, 135-166. Springer, Berlin.
[68] Rubin, G. E. (1977). On the quadratic classification problem for zero mean stationary time
series. Dissertation. George Washington Univ., Washington.
[69] Selin, I. (1965). Detection Theory. Princeton Univ. Press, Princeton.
[70] Shaman, P. (1975). An approximate inverse for the covariance matrix of moving average and
autoregressive processes. Ann. Statist. 3, 532-538.
46 R. H. Shumway
[71] Shumway, R. H. and Blandford, R. (1974). An examination of some new and classical short
period discriminants. Tech. Rept. No. TR-74-10, Seismic Data Analysis Center, Alexandria,
U.S.A.
[72] Shumway, R. H. and Unger, A. N. (1974). Linear discriminant functions for stationary time
series. J. Amer. Statist. Assoc. 69, 948-956.
[73] Shumway, R. H. (1971). On detecting a signal in N stationarily correlated noise series.
Technometrics 13, 499-519.
[74] Shumway, R. H. (1970). Applied regression and analysis of variance for stationary time series. J.
Amer. Statist. Assoc. 65, 1527-1546.
[75] Srivastava, M. S. and Khatri, C. G. (1979). An Introduction to Multivariate Statistics. North-Hol-
land, New York.
[76] Tj6stheim, D. (1975). Autoregressive representation of seismic P-wave signals with an appfica-
tion to the problem of short period discriminants. Geophys. J. Roy. Astron. Soc. 43, 269-291.
[77] Van Trees, H. L. (1968). Detection Estimation and Modulation Theory, Parts I, II. Wiley, New
York.
[78] Wahba, G. (1968). On the distribution of some statistics useful in the analysis of jointly
stationary time series. Ann. Math. Statist. 38, 1849 1862.
[79] Welch, P. D. and Wimpress, R. S. (1961). Two multivariate statistical computer programs and
their application to the vowel recognition problem. J. Acoust. Soc. Amer. 33, 426-434.
[80] Whalen, A. D. (1971). Detection of Signals in Noise. Academic Press, New York.
[81] Whittle, P. (1951). Hypothesis Testing in Time Series Analysis. Almqvist and Wiksell, Uppsala.
[82] Whittle, P. (1953). The analysis of multiple stationary time series. J. Roy. Statist. Soc. Ser. B 15,
125-139.
[83] Whittle, P. (1963). Stochastic processes in several dimensions. Bull. Inst. Internat. Statist. 40,
974-994.
[84] Wolf, J. J. (1976). Speech recognition and understanding. In: K. S. Fu, ed., Digital Pattern
Recognition, 167-203. Springer, Berlin.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ~')
z~
©North-Holland Publishing Company (1982) 47-60
S o m e s h D a s Gupta
1. Introduction
Let w denote an experimental unit drawn randomly from a population ~r. The
classification problem in its standard form is to devise rules so as to identify ~r
with one of the two given 'distinct' populations ~r1 and ~r2. A set of p real-valued
measurements X: p × 1 is observed on w and it is believed that the distributions
of X in those two populations are different. In this paper we shall assume that
X--Up(~,~,).
Let #i denote the mean of X in the population % (i = 1,2), where #1 ve S2. The
classification problem is to find 'good' rules for deciding whether S = / h or
t~ = S2. When all the parameters St,/Lz and X are known, Wald's decision theory
[17] may be used to derive the miminal complete class of decision rules for
z e r o - o n e loss function. It is given by the following, except for sets of measure
zero [21:
The rule ¢pk decides/~ =/~1 iff
It can be proved [2] that the rule cp0 is the only admissible minimax rule.
However, in practice all the parameters are not known, and in order to
differentiate the two populations random (training) samples from both the
populations are obtained. It may be remarked that if either o f / ~ and ~2 is known
it is not necessary to draw samples from both the populations.
Let 0 stand for ( S , / h , ~2, Z), and
*Supported by a grant from the Mathematics Division, U.S. Army Research Office, Durham, NC,
Grant No. DAAG-29-0038.
47
48 SomeshDas Gupta
Following [7] a set of heuristic rules (called plug-in rules) may be devised by
first choosing some good estimates of the unknown parameters and replacing the
unknown parameters in eft by their respective estimates. We shall call such a rule
Cpp
k when the standard estimates are used.
Let Xil,...,X,n ' denote the X-observations of the training sample from
(i = 1,2). Define (assume n 1+ n 2 - 2 > 0)
nI
~= X Xij/ni (i=1,2), (1.5)
j=l
q- 2 (X2j-- X2)(Xij-- X2
j=l
-),] / ( n l + n 2 --2).
When all the parameters are unknown, Fisher's plug-in rules are given by the
following:
The rule q0p
k decides/~ =/~l iff
Using the likelihood-ratio principle Anderson [1] proposed the following rules
when (~l,/~2, Z) lies in (2 given by (1.4):
The rule +~ decides > =/~l iff
2.1. p = l , 02isknown
Without any loss of generality we shall assume that 0 2 = 1. Let ~ = (ep~, ep2)
stand for a decision rule, where cp~is the probability of deciding/~ =/~i given the
observations. We shall consider only the rules based on sufficient statistics X, .~
and -~2-
First we shall make an orthogonal transformation as follows: Define
where ki's are chosen so that var(U/)= 1, i = 1 , 2 , 3 . Note that U/'s are indepen-
dently distributed. Let E(U~) = vi. Then U~~ N(vi, 1).
In terms of (v 1, v2, v3) the sets O 1 and 02, as defined in (1.2)-(1.4), are
transformed as follows.
where c = k2/k 1> 0 (ki's are chosen to be positive). Note that c > 1.
(U,-c,)(u2-c2)<-o (2.7)
where c~ and c 2 are functions of vo, fl, "y and c. Conversely, given c 1 and c 2 it is
possible to choose fi, ~, and v0 appropriately. Another class of Bayes rules may be
obtained from the following prior distributions: The probability that (Vl, v2)~ ~2"
is ~i, and given that v 1 = v and v2 = ( - 1)icy the distribution of v is N(0, q.2). The
unique (a.e.) Bayes rule against the above prior distribution decides (Vl, v2, v3) E ~21
iff
U, Uz<~k (2.8)
where k is a function of ~1, ~2 and c. Different types of Bayes rules are given by
Das Gupta and Bhattacharya [3].
Now consider the rule which decides (vl, v2, v3)E ~21 iff
U1U2<~O. (2.9)
Thus the above rule is the same as ¢po, defined in (1.8). The rule ¢po is the unique
Bayes rule against the prior ~(5,~,v0)
1 1 for any v 0 > 0 . Moreover, the risk of
the rule ~0° is constant over the four-point set (Vo, CVo), (--Vo,-CUo),
( - vo, CVo),(vo, -- CVo). Hence ¢po is an admissible minimax rule, and moreover the
supremum of the risk of ~0° is equal to ½.
However, ¢po is not the unique minimax rule (leaving aside the trivial rule
__ __ [
~1 = qP2 = 2)" TO see this, transform (Ul, U2) to (V l, V2) by an orthogonal transfor-
mation L such that (EVI,EV2) is proportional to ( 1 , - d l ) and (1, d2) for
(Vl, v2)~ ~2~' and (vl, Vz)E ~2~, respectively, and d 1> 0, d 2 > 0. Let + be the rule
which decides (Vl, u2)E ~2~ iff VIV2 <~O. It can be easily seen (or, see [6]) that the
supremum of the risk of + is ½. Note that there are many such orthogonal
transformations L which will satisfy the desired property for (EV1,EV2). It may
be shown that neither of the rules ¢po and ~b dominates the other. However, the
characterization of the class of all admissible minimax rules is not known.
Now, instead of the zero-one loss function consider a loss function which takes
the value 0 for correct decisions and equals l(I/~ 1 -/~ 2]) for any incorrect decision
where l is a positive-valued, bounded, continuous function such that l ( A ) ~ 0 as
a ~0. Das Gupta and Bhattacharya [3] have shown that ¢po is the unique minimax
rule (and Bayes admissible) for the above loss function when n 1 = n 2.
It is clear that neither of Cpp
° and cp° dominates the other. It is believed that ~p0 is
also admissible.
Optimum rules for classification 51
= folio [~2(Ul,U2)n(ul;~)n(u2;-c~)
+ ~2(u~, u2)n(u, ; -'~)n(u2; c~)
+ ( 1 - - ¢P2(ul, u2))n(u ,; -- v)n(u 2; -- cv)
+(1-cp2(u,,u2))n(ul;v)n(u2;cv)]du, du2, (2.14)
This is equivalent to
Es,.8:~P(Y1,Yz)=½[E~I.~:tP(Y,,Yz)+E-8,. 82tP(Y1,Y2)]
O(3 O(3 2 2
Thus ~p* is the uniformly most powerful invariant similar test. The above result is
due to Schaafsma [12].
where Y1 and Y2 are given in (2.15) and (2.16), S is given in (1.5), and tnl+n2_2, a is
the upper 100a% point of the Student's t distribution with n 1+ n 2 - 2 degrees of
freedom. However, it is very likely that this test is not admissible.
54 Somesh Das Gupta
It follows from [9] that the rule q~L x is a (unique) Bayes rule. We shall give a
sketch of the prior distribution against which +Lx is unique Bayes. Consider
U 1, U2, U3 as defined in (2.1)-(2.3). Then the U~'s are independently distributed,
and U~ ~ NO,i, 02). Moreover, under 0 ~ 0 i (i.e. (Vl, v2, v3)E ~2i) we have v I = v, v2
= ( - 1)~cv, v 4: O. The prior distribution is given as follows.
(i) P(OE (9i) = ~, i = 1,2.
(ii) Given 0 E Oi, the conditional distribution of (v, us, 0 2) is derived from the
following:
(iia) Given o 2 = (1 + ~_2)-1, the conditional distribution of 0 , / 0 2, v3/o 2) is the
same as that of (~-V, l-Vs), where V and V3 are independently distributed with
V ~ N(0,(1 + ~-2)/(1 + c2)) and V3 ~ N(0, 1 + T2)
(iib) The density of ~- is proportional to (1 + ~.2) (re+l)/2.
Without any loss of generality we shall assume that Z = Ip. First we shall derive
a class of Bayes rules and obtain an admissible minimax rule. Define U1, U2, U3
and kl, k 2 a s in (2.1)-(2.3), except that U~'s are now p × 1 vectors and U ~
Np(v, Ip). Correspondingly redefine the sets $2i as follows:
~i=((pl,v2,P3):p,=p,~z=(-1)icu=/=O,v,~3~RP}, (3.1)
99o is minimax, note that the risk of 990L is constant over the set
(3.4)
U {(IPl, P2, It3): ~'1 = .,.2 = CP, I-P"tP= Z~2} .
Das Gupta [4] has also shown that the rule 99o is the unique (a.e.) minimax when
the loss for any correct decision is zero, and the loss for deciding ff=ffi
incorrectly is
+ b) (3.6)
for all b ~ N P. Clearly, (U1, U2) is a set of maximal invariants. A rule 99 is called
orthogonally-invariant if
For a rule 9 9 ~ * let G1(99; A2) and G2(99; a 2) be the error probabilities when
56 Somesh Das Gupta
/~ =/z 1 and/~ =/~2, respectively. Rao [15] has posed the problem of minimizing
.[(x- + 1/.1)(x-
×[(X-- X,)--(l÷ 1//11)( x - X2)]
- b[(1 +
First we shall show that a likelihood-ratio rule '/'Lx is unique (a.e.) Bayes and
hence is admissible (for zero-one loss function). Note that U1, U2, U3 and S are
sufficient statistics in this case, where U~'s (in p × 1 vector notations) are given by
(2.1)-(2.3) and S is given by (1.5). Here Ui~Np(vi, N ). We now consider the
following prior distribution.
(i) P(0E Oi) = ~i.
(ii) Given 0EO~ (i.e., v 1=v, v2=(-1)~cv), the conditional distribution of
(v, v3, Z) is derived from the following:
(iia) Given v-1 = Ip + rr'(r: p × 1), the conditional distribution of ( ~ iv,
Z-~v3) is the same as the distribution of (rV, rV3), where V and V3 are
independently distributed as
respectively.
(iib) The density of ~- is proportional to (1 + "r"r) (re+l)/2 where m > p - 1.
Following a simplified version of the results of Kiefer and Schwartz [9] it can
be shown that a unique (a.e.) Bayes rule against the above prior distribution
accepts # =/z 1 if (1.7) holds, where X is a function of ~i's; conversely, given X the
constants ~i's can be appropriately chosen.
Das Gupta [4] has considered a class ~** of rules invariant under the following
transformations:
z t --1
mij UiS Uj/m. (4.2)
When v I = 9,/~2 = (-- 1)icy, b't•--lb' =A2' the joint density of (mlt, m12 , m22 ) is
given by [ 14]
X E gj(½A2)Jhj(mll,m12,m22) (4.3)
j 0
where
(rnll + 2 ( - 1 ) i c m l 2 + c2m22 + (1 + c a) Iml )J
hj(mll, m12, m22) = 112 + M I(1~2)(rn+2)+j
(4.4)
m 12< 0. (4.7)
for any positivej if x < 0. The relation (4.7) is the same as (1.7) for 7t = 1. It now
follows easily that the rule ~p[ is admissible and minimax in ~** [4]. Das Gupta
[4] has also shown that ~[ is the unique (a.e.) minimax in ~** if the loss for any
correct decision is zero and the loss for deciding/~ =/~i incorrectly is
+(I+I/nl)(X-Xz)(X-X2)'-2(X-X1)(X--X2)' ]. (4.tl)
It is not clear why Rao imposed the similarity condition even after restricting to
the class ~*. One may directly consider the class of rules invariant under (4.1)and
try to minimize (3.10) subject to the condition that Gi(ep; 0) is equal to a specified
constant. Using (4.3) it can be found that the optimum rule decides/~ =/~1 iff
As in (2.29) a similar region for O1 may be constructed for this case also. It is
given by the following.
In this case the plug-in rules are given by the following: Decide/~ =/~1 if
Without loss of generality we may assume that ~1 7--0 and /~'2 = (1,0 ..... 0).
Then the problem is invariant under the following transformations:
[0l L22
It can be seen that a set of maximal invariants is given by (X1.2, X(2)A221X(2), A 11.2)
where
The relation (5.9) is the same as (5.1) for X = Q,, and as (5.3) for X = 1. The above
region is not similar for/~ = #1- Such a similar region may be constructed using
References
Minoru Siotani
1. Introduction
Notations
ar(c ) (fir(c)) is the PMC when x is wrongly classified to the first (the second)
population H 1 (H2) by a classification rule with cut-off point c based on the
statistic T.
• (x) and q,(x) are respectively the c.d.f, and p.d.f, of N(0, 1).
E ( T I H ) is the distribution law of a statistic T when H is the underlying
population.
Let H F Np(~i , ,~), i = 1,2, be the two p-variate normal populations into one of
which we wish to classify a vector x. When all parameters are known, an optimum
classification rule (Bayes rule, minimax rule, etc.) is based on the statistic
F0 = (ge -- ~2)'~-'X
where £i are sample mean vectors and S is the sample pooled covariance matrix
with divisor n = N 1+ N2 - 2 . Fisher (1936) and Wald (1944) suggested the plug-in
LDF given by
The distributions of W and F have been studied by many authors but we here
concern with large sample approximations and asymptotic expansions of them.
Classification statistics 63
Wald (1944) showed that the limiting distribution of W as N~ -~ o~, i = 1,2, is the
same distribution of U0, i.e.
where wi = E(Fo[Hi) = (~et - ~2)'~J l~i , i = 1,2. Thus for sufficiently large sam-
ples, we can use W and F as if we know the population exactly. Hence the PMC's
of the W-rule and the F-rule with a cut-off point c are approximately equal to
fiw(C)~(b(--(c+½A2)/A), (2.7)
(2.8)
E[ l{O/W(0lxl,.~2,S2)~ -/~W(0].~1,.~2,$2)}] =
, { m2 ... }
(2.9)
where m = (N 1+ N 2)/8N 1N2, H _ I(A) = ~ ( - - A)/q~(A) and H,(A) are the Hermite
polynomials of degree v. Bowker (1961) showed that W can be represented as a
function of two independent 2 × 2 Wishart matrices one of which is noncentral.
Bowker and Sitgreaves (1961) used this representation to obtain the following
asymptotic expansion for the c.d.f, of W when NI = N 2 = N:
( 2N ]1/2(( 2N--p--1 }
hlj= ~ ! (-1)j+l(½A2) 2N-2 w° '
hTj = ( - 1)J+IA4, h s j - - ± A4
Okamoto (1963, 1968) considered a more general case of sample sizes and derived
asymptotic expansions for
P~(u;zi)= P{ ----7=-
W-/A2 -< ulUl } ,
e2(b/; A ) = p ( W-~IA2 }
<~u]H2 (2.11)
and also for the PMC of the W-rule with cut-off point 0
,o (.;,~)= ~ ( . ) +
[{h,~ + ~ +h27 h~}
+{gll _t_ g,2 q_ g2..._~2
q_ g13 q_ g23 q_ g33~1
7N ~ Nd N,n ~ 2n ~ - J ] o ( u ) + O3
(2.13)
Classification statistics 65
+ 6 ( p - 8 ) ( u + a ) a u 3 + 2 ( p - lO)a' u
- - 3 A4u + ( p 2 - - 18p +65)u 3 + ( p 2 - - 16p +54)(2u + A)A u
- - 2 ( p --4)a3 - - 3 ( p - - 3 ) ( p --5)u - - 2 ( p --3)(p -- 4)a },
1 { ( 2 u + A)2u5 + 2 ( 5 p - - 2 3 ) u 5
g13 -- 8A 2
+ 2(p - 21) a u 4 - (3p + 10)A2u 3
--pA3u 2 + 6 ( p 2 - - 9 p + 16)u3--4(p 2 + 4 p - - 1 8 ) A u 2
--(2p 2 - 7 p - l l ) A 2 u + p A 3-6(p- 1)(p--3)u+6(p-- 1)A},
1
g23-- 8A2 {(u+z~)Z(2u+k)2u3+2(5p--Z3)uS+Z(llp--51) Au4
al a2 ~ }
aw(0) = ~ ( - a / 2 ) + ~7 + ~22 +
+f bll b12 b22 b13 + b23 b33 ~ + O3, (2.16)
[ N 2 +~1N2+-~22 + Nln N ~ 2 n + n 2 J
flw(0) = the expression obtained by interchanging N 1 and N 2 in aw(0),
(2.17)
where a i = hi(- A/2; k)~(-- A/2), bij = gij(-- A/2; k)qS(-- A/2). (The third
order terms are too long to present here.)
Tables of a i and bij were given by Okamoto (1963, 1968) for p =
1,2,3,5,7, 10,20,50 and Zl = 1,2,3,4,6,8. Siotani and Wang (1975, 1977) prepared
the tables of cijk, the coefficients of the third order terms in (2.16), as well as a i
and bij for p = 1 ( 1) 20, 25, 30 (10) 50 and A = 0.4 (0.2) 2.0 (0.4) 4.0, 3.0, 5.0 ( 1.0) 8.0.
Lachenbruch and Mickey (1968) considered the estimation of ave(0) and flw(0)
based on (2.16) and (2.17) by replacing A2 with its estimate
D2=(Xl--X2)'s-l(xl--X2) or D*2=(n-p--1)D2/n.
Anderson (1973a) derived asymptotic expansions for the c.d.f.'s of the Studen-
tized W, i.e. (W-½D2)/D and ( W + ½D2)/D under H 1 and/72, respectively;
P( W - ½D E <~u ]H ' }
- - * ( u ) + l {-~~ ( l + k ) n
_(p_¼+½k)u_¼u3)ep(u)+O(n
2), (2.18)
{ W+½D2 }
(2.19)
where k = lim,, ~ ~(N 2 / N 1). Using (2.18) and (2.19), Anderson (1973b) discussed
Classi[ication statistics 67
the comparison of the expansions for the c.d.f.'s, densities, and the first two
moments of W itself and the Studentized W. In particular, when k = 1 and
u = --A/2,
(2.20)
He also considered how one chooses the cut-off point c = uA + ½A2 for W to
achieve a given PMC P0 when H 1 is true.
When p = 1 and all the parameters are known, the rule Uo ~ 0 is equivalent to
[x-½(~,+~z)](~,-~z)/O2X0 or x-½(~,+~z)X0
for ~l < 42- Friedman (1965) considered the plug-in rule: x X½(£1 + x2), and
compared their PMC's with approximations for large samples from H~ and H 2.
The conditional distribution of W given xl, x2, and S is obviously normal and
the conditional PMC of the rule W <> 0 is written as
{
(IxI "q-1X2-- ~2)tS-1(-~1- -~2) ~, (2.22)
When all the parameters are known, we have the optimum PMC given by
The problem of estimating one or more of those PMC's and the study on
effectiveness and comparison of the estimators have been considered by many
68 Minoru Siotani
E{Qo} = B + 03
= ~(- A/2)+~(--A/2)
+
I N 2 Ar ~,1~,2 ~-d¥2 ~-,,1,, + ~,2,~ ' ~ - n 2 j
1
(2.25)
where
+ 6 4 ( p - - 1)(p--3)},
1
b'3 = b 2 3 - 1024A {A6--4(3p +7)A4 + 16(2p2 +8P +5)A2
Bias(QD) -- A - B + 03
1 2 A P _ l)] (2.28)
+ 1 ~ - 2 ~ {A - 4 ( P - 1)} +-~n (
A
0 2 ~__41( ~ 2 ( - A / 2 ) [ 1 / N 1 + 1 / N 2 ], (2.29)
but if the term equal to 02 is taken into account, M~ does not have a normal
distribution. He also proved that (M 1+ M 2 ) / 2 , where Mz=flw(O[K), has a
normal distribution with mean
and variance
,2
r2=-l~ q~(-A/2) [(,~ + N ~ {A4+16(p--1)}
AMSE(QD) = ½02( -- A / 2 )
X ~+ Az+ZNZA-------~{A4+4(p--2)Ae+I6(p--1)(p--2))
+ N--~ (A2 - - 2 p ) 1n
+ 4--~i (A4--(3p+14)A2 +4p(8p--1)}
1 4
+ 8---~2n (~1 - 2 ( p + 4 ) a 2 + 8 p )
(2.32)
and the expansions for AMSE(Qt)-AMSE(Qo), t =/=D, were listed for several
estimators which are all 02, so that they have the same leading term of the first
order
q~2(_ A / 2 ) ( ( 1 / N 1) + (AZ/8n) }.
where W ( y ) is the Wald W for a randomly chosen member y from the initial N I
observations f r o m / 7 I. McLachlan (1976) derived the asymptotic bias of R~ by
evaluating an asymptotic expansion of E(q~ ), which is the unconditional expecta-
tion of R~, as
E(q,)=E(O(-D/2)}+q~(-~/2) 8'
-N~-I + - ( 1 2 - A 2 ) n t +O2"
(2.34)
Classification statistics 71
Bias(R,)=0(--A/2) ~{A2+4(p--1)}+-~n(P--1 ) + O 2.
(2.35)
P1 = P( (W--½D2)/D<cIK,/71}
{cD+(21 --'l)S' - I - (Xl--X2)} (2.36)
Y
where y = {(21 - 2 2 ) ' S 12JS-1(21- 22)} 1/2. This corresponds to the conditional
PMC for an observation from /71. He showed that the distribution of PI is
asymptotically normal with mean
and variance
o 2 = l~2(c)b(c) (2.38)
where
a(c)= ~_( p - - 1 ) ( l + k ) - ( p - z + ~ k ) c - - z c ,
1 1 1 3
It is noted that/~1 is the same expression as the first two terms of (2.18) with
c = u. Using this asymptotic distribution of P1, he discussed the confidence
statement on P1 by determining the cut-off point c so that
hi h2 h3 + O ( n - 2 ) , (2.40)
c = Co nl/2 17 /,/3/2
where C o = ~ I(M) and where, on writing a(Co) and b(co) as a 0 and bo,
respectively,
h 1: zbl/2 '
h 2 : Ct0 + ½ Z 2 c o ( b o - 1),
1/2[ ^ 1 ~ 2
h3=zbo [Coao+~oCO(ZCo-4Cto)+½z2bo(c2-1 )
+34c2(1-z2)+~z 2+p-¼+½k],
/
The cut-off points c I and c~ are determined seperately for the upper bounds M 1
and M 2 on the conditional PMC's P~ and P2* and specified confidence levels a I
and a2, respectively. Since
P?=aw(C"c~lK)--e
W ½D 2
D <cl'
W+½D2
D <-c~]K'H1
)
< P { W-½D2D < cIlK'/-/1} = P l ,
Classification statistics 73
we obviously have
if we use the clgiven by the formula (2.40) for M = M I and " = a l . Similarly, we
have
Var(,Xw(0lK)) (2~r)-l/2exp{
=
(14v) 42 4 N + +02,
(2.48)
(2.49)
74 Minoru Siotani
where
1[
dl=~- 7 (p--l)
l+t 2jn,
-(~TT--~ {l+½(l+(fA2/n)))+l(1--y)A
]
1[ y+(fA2/n) ]
+2---~2T ( p - - l ) -~--T--~ {½(7+(fA2/n))--Y}+I(1--Y)Y2A
[1
+ 2N2(11--7) ---2-~(P--I)(I--Y)+½(1--T)3A ,
] (2.50)
1 [ l+(fA2/n){l+(fA2/n)(l+ 7 } ]
d2 = -2N,
- (p--l) ~-_2--~ 1_--~) - 1 + 1(1-¢- T)a
2
1
+ 2 N 2 ( I _ y ) [ - - ~ ( P - - 1 ) ( y - - 3 ) + ½ ( I + y ) ( 1 - - Y ) 2A]
l [ y+(fA2/n){~,+(fA2/n)(l+yl }
+2--~2y ( p - - l ) (1--y)A 2 ~ l - - T ] +3'
n + Nl --
NT~(x- ~2)'S '(x-
(2.52)
see Anderson (1958). When the cut-off point c is l, this LR rule reduces to the
maximum likelihood (ML) rule z X 0 where
and also for the PMC's of the Z-rule, i.e., az(0 ) and flz(0), in similar forms to
those of the W-statistic and the W-rule, i.e., to (2.13) and (2.16), respectively. The
coefficients hi, giy, ai, biy, and cij k there are denoted with * here.
For P~(u; A), the first term is ~ ( u ) and
h~ - 1 2 (U 3 --Au 2 q - ( p - - 3 ) u + A } ,
2A
h~ - 2A
21{u3-Au2+(P-3-Aa)u+A(A2+I)}
1 {(u_A)2uS+(2p_17)uS 2(p_13)Au4_laA2u3
g~a 1 - - 8~ 4
1 {(u--ZA)u6+A3(Zu--A)u3+(Zp--17)u5
g~2 -- 4A4
- - 2 ( p -- 13)Au 4 - - ( p +4)AZu 3 + ( p -- 8)A3u 2 +3Aau
+(p2_18p+65)u 3+6(2p--ll)Au 2
1 {(u+a)Z(u--a)4u+(2p--lV)uS--2(p--13)au 4
g~2 -- 8A4
- - 2 ( p -- 1)AZu3 + 2 ( p -- 8)A3u 2 +7A4u --2 A5
1 {(u_A)(2u_A)2u4+2(5p_23)u5 12(p_6)Au4
g~3 -- 8A 2
--6(p--1)(p--3)u+2(p--1)(p--4)A},
1
g~3-- 8A2 {(u+ A)(u-- A)2(2u--A)2u2 +2(5p--23) u5
It is noted that h~ = h~(u; A)= h3(u;- A) and g~'3= g~'3(u; A ) = g33(u;- A). As
in the case of W, P~(u; A) can be obtained from
hence/3z(0 ) is the az(0 ) with the interchange of N~ and N 2. Thus the first term in
the expansion for az(O) is 1 - ~ ( A / 2 ) = ~ ( - A/2), and the coefficients a*, b~,
and c~k are obtained respectively from a* = - h *i (~A,
~ • A), bTj = - gTj(½A;A), and
the corresponding relation for e~k.
Tables of a* and b~ were prepared by Memon and Okamoto (1971) for
p =1,2,3,5,7, 10,20,50 and A =1,2,3,4,6,8. Siotani and Wang (1975, 1977) pre-
Classification statistics 77
pared the tables of Cijk and C~jk and made a comparison of the W-rule and the
Z-rule with respect to the PMC.
As in the case of W, an asymptotic expansion for the distribution of the
Studentized Z was given by Fujikoshi and Kanazawa (1976) which is rewritten as
Z+D2 <__ul//l)
P T~ =
=#P(ul--e~(ul[2----~l(Au--u
~ 1
+p--l)
2
(2.58)
1 1 (u2+4p_3)u]+02"
+ 2--~2~ { ( u - A) 2 + ( p - - 1)) +-4-nn
The case when x comes f r o m / / 2 can be treated by using a relation similar to
(2.56).
For the estimation of az(O), flz(O) or az(OIK ), flz(O[K), we may make a
similar investigation to the one mentioned in the last section for the estimators of
the PMC's of the W-rule.
Siotani (1980) obtained the asymptotic expansions for the conditional distribu-
tions of ( Z - ( - 1 ) ~ 2 ) / 2 A and of the Studentized Z, i.e.
Zsi = (Z--(--1)iDZ)/2D
given K when x comes from H~, i---1,2. From those results, the following large
sample approximations to the distributions of conditional PMC's were derived: If
the second order term O2 with respect to (N~ ~, N f ~, n -1) is ignored, then the
conditional PMC az(C IK) has a normal distribution with mean
~ = ~ ~-~ + (c 3 + -
, + ( 4 p - - 4 - - AZ)A4 )
1
~ {c 3 + AZc2 + (4p -- 12- 5 £ ) £ c + (4p - 4+ 3a2),a 4)
+ 16N2A--------
°~2= ~ 2A
X ( c + A2)2 + 4NzA 4
with mean
1 1
/*~0= ~b(-- A / 2 ) + 1 ~ - ~ {4(p -- 1)-- A2} + ~ {4(p -- 1) + 3A2}
1 1
4n ( p --1)a , ~ ( - - a / 2 ) (2.61)
and variance
For the Studentized form, we consider the classification rule; assign x to II1 if
Zsl < c and to H 2 otherwise. Then if the term equal to 02 is ignored, the
distribution of the conditional PMC a Zs,(ClK) is N(m T, z~'2), where
For the asymptotic distributions of flz(cIK), •z( 01K), and BZo2(ClK), they are
all normal with means and variances calculated from (~]~, o~'2") in (2.59) and
(2.60), (/~0, °~'2) in (2.61) and (2.62), and (m]~, ~.~2) in (2.63) and (2.64), respec-
tively, by interchanging N 1 and N 2.
12]1
where x is a subvector of p discriminators, y a subvector of q covariates,
E ( y l H i ) = ~ , , I; is a ( p + q ) × ( p + q ) covariance matrix with the partition
corresponding to the partition of the random vector. They introduced
x* = x - By, where
Y*=~i-Bfi,., i=1,2,
= [s,, s12]
s((p+q)x(p+q)) [s2, s ~ '
B(pXq)=SI2S221; Sll.2=S11-S12S221S21 .
Z** was first mentioned by Memon (1968). When q = 0, the both Z* and Z** are
equal to Z.
The limiting distributions of IV*, Z*, a n d Z** as N 1, N2, n --, ce (lim r~ = const.)
are as follows,:
Since these Q}J)(u; z~*) become the corresponding probability functions in the
non-covariate case when q -- 0, they are written in the form
where P/O)(u; A*) are independent of q; hence they are parts due to discrimina-
tors, so that
L}J)(u; A*) are due to covariates. From those results, asymptotic expansions for
the PMC's of the W*, Z*, and Z** rules with cut-off point 0 are given by
covariate parts;
(2.76)
where
1
b]'3) - 128A* { A * 4 +4(3p--4)A*2 + 4 8 ( p - - 1 ) } ,
1
b0)- { A . , _ 4 p A . 2 _ 16(p _ 1)},
23 128/1"
a*
u33h(1)= --~ ( ( 2 p + q ) ( A * 2 + 4 ) - - 1 6 } , "3•(1) --
I a *4- ~- '
b(2)
, 3 -_ b(3)
13 _ 128A*
1 {--A*4+4pA*2+16(p--1)}'
1
b(2)
2 3 -- b(3)
23 - 128A* {3 A*4 + 4 ( p --4)A'2 + 1 6 ( p - - 1)}"
It is noted here that asymptotic expansions for az.(O ) and az**(0 ) are equal if the
terms of the third order are negligible. McGee (1976) and Kanazawa, McGee, and
Siotani (1979) made a comparison of those covariate classification rules on the
basis of the PMC's thus obtained.
Tables of coefficients a's, b's as well as c's in the expansions for aw.(O ) and
az**(0 ) were given by McGee (1976).
As in the non-covariate case, asymptotic expansions for the c.d.f.'s of the
Studentized W*, Z*, and Z** are available. Kanazawa and Fujikoshi (1977) gave
the formula
p{(W*-½D*2)/D*~uIHI} =
o(u,
(2.78)
with Anderson's formula (2.18). Fujikoshi and Kanazawa (1976) obtained the
formulae for the Studentized Z* and Z**, which are expressed as
P / Z * + D .2 [right hand side of (2.58) with A*
(2.79)
k
2D* <-u[II,) = instead of A 1_ ~n A, + O2,
¢ z * * + 0 *2 ~ [right hand side of (2.59) with ,4*
Pl 2 D* )
< u11I~ = instead of ,4 ] - q u + 02 .
n
(2.80)
(2.83)
A,,(Z*) 1
+ 2_~2~, { ( p _ 1)+ ,2
1 (u 2+4p_3)u, qu (2.85)
4n n a,
and u~ is the upper 1008% point of N(0, 1).
Classification statistics 83
+log(l~,l/l~21). (3.1)
If ~e1= ~e2 = C0, then this reduces to
Pl P2 P3 P4 "'" Pl 1
circular structure (Han, 1970). An optimum rule in the class of rules based on
linear functions of x is studied by Kullback (1952, 1958), Clunies-Ross and
Riffenburgh (1960), Anderson and Bahadur (1962), and Banerjee and Marcus
(1965). There are other studies on optimum rules when parameters are known or
unknown, but only a few approximations to or asymptotic expansions for the
distributions of classification statistics are known.
Okamoto (1961) considered the plug-in version of U2, apart from a constant,
and
Q=(x-£)'(S(1-S[1)(x-£), when ~0 is unknown , (3.7)
where
1
-r =
N1 + N2(Nl-rl + N2 X2 ), sample grand mean vector,
1 N,
1 2 (xo( i ) _ _x,)(xo
- (i)__- , i=1,2.
ot:l
for an appropriate choice of a value of s = O, 1,..., q, where lj's (11 >~ 12 9 . . . ~ lp)
are roots of IS02 - lSol ] = 0 and zj is thejth component of Z = F(x -- ~0), F being
a nonsingular matrix such that F'S01F =Ip, F'So2F = diag(ll, l 2..... lp). Okamoto
(1961) gave an asymptotic expansion for the distribution of Q~ only for q = 1, i.e.,
Q~ = { 1 - (1/l l) }zl2, in the following form:
P(2]l ) = P(Q~>k(I,)IH1}
where n = Nl, ~1 is the largest root of IX2 - ? t ~ l [ = 0, k()tl) is the cut-off point
for the Bayes rule or minimax rule when q = 1 in the reduced form, which is a
function of ~1, k(/1) is obtained by replacing 2t1 by l I in k(?tl),
__ u -- 1
20 e -u/2,
Classificationstatistics 85
B_ (£_])2
2)~2 (X~( K
A.--1 K') 2
+c2( K _ X K , ) 2}
X--1 "
(3.10)
vl = ( x - 1 ( x _ ~ 2 ) , ~ _ l ( x _ :~2)" (3.13)
[( y + aO)'X-1( y + a O ) - a ( a + 1)~2],
0"2
+N + 02.
(3.17)
r2( v) = P{ V, <<-vJn2)
bl(d)=al(d),
b2(d ) = - p(a + 1)2d - 4 a ( a + 1)3A2d2, b3(d ) = 2(a + 1)4AZd2,
b4(d ) = _ paZd - 4 a z ( a + 1)2AZd2. bs(d ) = 2a2(a + 1)2A2d2,
Han (1970) considered the distribution of the plug-in version of U1 of (3.1) apart
from the term log[ 2~11/ [2~2[, when ~ have the circular Structure given in (3.5). In
this case, there exists an orthogonal matrix H with the (j, k)-element
such that H ' . ~ i H = diag(o/], °'2i2,"".,0"2) (cf. Wise (1955)). Since U1 is invariant
88 Minoru Siotani
j=,
11t(
a?j o?j xj 1/o---Tjj 1/o2j
(3.20)
where xj a n d ~ij are thejth component of x and 6, respectively. If 02j > o2 for all
j or equivalently 2J1- ~2 is positive definite, then when x comes from Hi, i = 1 or
2, V2 is distributed as the sum of 52X{2(~2,.),
J J
where X~2(`/2)
J
is a noncentral
chi-square distribution with 1 d.f. and noncentrality parameter ,/2. = m2j/rt2, and
( 1 1 )'/2( ~2j/o22,-~,j/02j)
(3.21)
rn,, = °2J °b litj 1/022j -- 1/o2 '
rtj2 = oi2j(1/ozj -1/o2j ) (3.22)
X, rt4 + 2Ej'rt2m~, 1
( Z ~ + ~m2:)- (3.23)
ai = ~j,l.i ~ _[_ ~ , j m 2 j , Pt = --
at j J -
When set are unknown and estimated by sample means -~t, the plug-in statistic
V3= E
,,
1
022,
1
x,-
x2,/Oz2,--x,,/02j
1/o22, l / 4
t 2] (Xl > 2~2) '
(3.24)
2
under the assumption that 271 > ~2, where stj- N i i(xtj~- Ytj)2/nt, n / = N~- 1,
]~= - -
N~ being the sample sizes. It is noted here that, without loss of generality,
we may let ~l = O, 271 = lp, ~ = ~ = (~01, ~02. . . . . ~0p), and '~2 = '~0 =
Classification statistics 89
• 2 2
dlag(o61, or~2. . . . . 02p). The results were given in the following forms:
F~(v) = P( V4 ~<vIH2}
P
= E {L,j(v)--,-,+L(,y(v)+r~+L(?~(v)
j=l
where G,j(v) is the c.d.f, of a noncentral chi-square distribution with v d.f. and
4 2 G(f)(v) is the kth derivative of G,j(v), L,j(v)
noncentrality parameter ~oj/Ooja),2
is the c.d.f, of a noncentral chi-square distribution with v d.f. and noncentrality
parameter ~Oj/OgjOlj,
2 2 2 L~))(v) is the kth derivative of L v j ( t ) ) , eli= (1/%2) - 1, and
%- %4ja2 + 1 ,
o6ja)
6 3
i)
b,: C 1 los b'J = o~j
2--( ~2j 3~2j )
04'
o6ja
6 3
) 2--'-'-'~
%a) + 1 '
2~2j 2~°2J ( 1 +2 )
cj - Oo~aj '
2~2j
di-4 ,, d;- %%
,o ( 4 ~] j -41, .
90 Minoru Siotani
and
rlj = 1/Nlaj+l/N2o~aj+ A~/n ' + OojSj/n~,
4 ,
2~2j 2~j 1 2
rzj-- m + - + -~((Aj + 2 0 ~ A j + C})
N1a2 Nzoo4j#
+ <v;),
~
r 3 j (4o~Cj + 2cjaj)/n 1+(4o~Dj--2o~DjBj)/n2,
% = CjZ/n, + o4jOZ/n2
with the notations
Aj - a~ --+o~., A ~ = 2 / = 3 + _--57- + _ -
1 3~5~j 2
OJ = 4
Oo)a)
2 O.02j' n ; - - %a)
6 3+ %4 '
2 2
2o6j~oj
C j . - -aj- -
aj j
2fo2j ~2
vJ- j,
Oojaj ~ %a~ I
Kullback (1952, 1958) suggested a rule based on the linear statistic which
maximizes the divergence J(1,2) between Np(~el, ~q) and Np(¢2, X2). Matusita
(1967) considered a minimum distance rule based on the distance
The description given here is not for a general explanation on the classification
problem but is a short note on the main topic of this chapter in the non-normal
and discrete cases.
possible patterns of zeros and ones. We call each unique pattern a state and with
each state a probability is associated. In the case of two groups/71 a n d / / 2 , we
denote the probability distributions of x in /71 and //2 by p ( x ) and q(x),
respectively. More specifically, we have ( P l , P2,...,Ps) and (ql, q 2 , . . . , q s ) i n / / 1
and/72, respectively, where pi is of the ith state i n / / 1 and q~ is the probability of
the ith state in //2. If p ( x ) and q ( x ) or Pi, i = 1 ..... s and q~, i = l ..... s are
known, then the optimal classification procedure is based on the likelihood ratio
p ( x ) / q ( x ) or equivalently on
(4.2)
where
Similarly
(4.4)
92 Minoru Siotani
where
(the symmetric Kullback-Leibler information measure) is small, and if x~'s are not
highly interdependent, then L(x) is approximately normally distributed in /71
and/72 with means/~l and /~2 and variances o~ and o2, respectively. He also
showed that o~ and a 2, under some conditions on p and q, may be approximated
by j l / 2 = (l~l-/x2) ~/2. It should be noted that/~ > 0 and/x 2 < 0 unless p and q
are identical distribution. The PMC's associated with a cut-off point c are then
C--~l
(4.6)
(x: L(x)<~c}
ilL(C) = ~
{x: L(x)>c}
q(x)~( -- C+/~2
jl/2
)j" (4.7)
Solomon (1961) used the representations (4.2) and (4.4) to assess the loss of
information incurred by the approximations to p(x) and q(x) by exploring the
PMC's of classification procedures using test-item dichotomous response data.
Moore (1973) discussed and evaluated five procedures for classification with
binary variables. Among them the plug-in version of the first and the second
order approximations to the Bahadur models (4.2) and (4.4), i.e., the first
approximation
p
p(')(x) = 1-[ aX'( 1 - - ai)
,~1 x,
i=1
FA = P
(4.8)
q(')(x) = 1-[ ill x '(1-- fl,) , - x
',
i=1
Classification statistics 93
They discussed the estimation of aj and f under some constraints. Using these
estimates, we have the plug-in classification rule: classify x into H 1 if h(al, x)>I
h(a=, x), which is equivalent, when all estimators are included, to the rule:
classify x into H 1 if n l ( x ) / n i ~> n 2 ( x ) / n 2, where n i are sizes of samples indepen-
dently taken from H i, and ni(x ) are frequencies of state x in H i.
There are many ways to represent binary data. Cox (1972) gave a brief
overview of the properties and problems of various methods used in multivariate
binary distributions. Among them the representation
P ( X ) = 71 E dr%(x) (4.15)
rc S(x)
S(x) is a set of all the s state points x and the coefficients dr are, using the
orthogonality of %(x), evaluated by
dr = E(%(x)}. (4.17)
Classify x into/-I1 ( / / 2 ) if
(4.18)
E E
r~S(x) r~S(x)
and randomly otherwise, where the sets (dj, r} and (d2, r) are associated with H I
and /72, respectively. If all the parameters are to be estimated from available
independent samples, the plug-in rule is simply the rule given by (4.18) with {di, r)
replaced by their estimates
where (fl, f2 ..... f~) and (gl, g 2 , ' " , g s ) a r e state probabilities corresponding to
the two distributions F and G, respectively. Suppose that independent samples of
sizes n i and n 2 from H l and H E are available. We wish to classify a new sample of
size n o into either H l o r / 7 2. Let S1, $2, and SO be the empirical distributions
formed on the basis of these independent samples. Then Matusita's sample-based
rule is: classify the new sample into H 1 (H2) if
He obtained lower bounds for PCC and an approximate value of PCC when
sample sizes are large.
Dillon and G,oldstein (1978) considered a modification for the case of n 0 = 1;
But Glick (1972, 1973) showed rigorously that the difference has an upper
bound which diminishes to zero exponentially as sample size n ~ oo and also
P(actual = optimum} --, 1 exponentially. He also gave a proof of the proposition
that the expected excess of the apparent non-error rate (Smith's resubstitution or
reallocation estimator of PMC) over the optimum non-error rate has an upper
bound proportional to (n-l/2)an where a < 1. Based on Glick's work, Goldstein
and Rabinowitz (1975) discussed a sample-based procedure for selecting an
optimum subset of variables.
Glick's results contain a generalization of the results obtained by Cochran and
Hopkins (1961), and Hills (1966).
4.4. Classification when both continuous and discrete variables are involved
Chang and Afifi (1974) suggested a method suitable for one binary and p
continuous variables, based on the location model. An extension to the case of q
binary and p continuous variables variables was proposed by Krzanowski (1975)
under Olkin and Tate's (1961) location model and L R classification, and its
plug-in version was considered. He also discussed on the conditions for success or
failure in the performance of LDF.
References
[1] Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59, 19-36.
[2] Anderson, T. W. (1951). Classification by multivariate analysis. Psychometrika 16, 31-50.
[3] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
[4] Anderson, T. W. (1973a). An asymptotic expansion of the distribution of the studentized
classification statistic. Ann. Statist. 1, 964-972.
[5] Anderson, T. W. (1973b). Asymptotic evaluation of the probabilities of misclassification by
linear discriminant functions. In: T. Cacoulos, ed., Discriminant Analysis and Applications,
17-35. Academic Press, New York.
[6] Anderson, T. W. and Bahadur, R. R. (1962). Classification into two multivariate normal
distributions with different covariance matrices. Ann. Math. Statist. 33, 420-431.
[7] Bahadur, R. R. (1961). On classification based on response to N dichotomous items. In: H.
Solomon, ed., Studies in Item Analysis and Prediction, 169-176. Stanford Univ. Press, Stanford.
[8] Banerjee, K. and Marcus, L. F. (1965). Bounds in a minimax classification procedure. Bio-
metrika 52, 653-654.
[9] Bartlett, M. S. and Please, N. W. (1963). Discrimination in the case of zero mean differences.
Biometrika 50, 17-21.
[1o] Bowker, A. H. (1961). A representation of Hotelling's T 2 and Anderson's classification statistic
W in terms of simple statistics. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
285-292. Stanford Univ. Press, Stanford.
98 Minoru Siotani
[11] Bowker, A. H. and Sitgreaves, R. (1961). An asymptotic expansion for the distribution function
of the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
293-310. Stanford Univ. Press, Stanford.
[12] Chang, P. C. and Afifi, A. A. (1974). Classification based on dichotomous and continuous
variables. J. Amer. Statist. Assoc. 69, 336-339.
[13] Clunies-Ross, C. W. and Riffenburgh, R. H. (1960). Geometry and linear discrimination.
Biometrika 47, 185-189.
[14] Cochran, W. G. (1964). Comparison of two methods of handling covariates in discriminatory
analysis. Ann. Inst. Statist. Math. 16, 43-53.
[15] Cochran, W. G. and Bliss, C. I. (1946). Discriminant functions with covariance. Ann. Math.
Statist. 19, 151-176.
[16] Cochran, W. G. and Hopkins, C. E. (1961). Some classification problems with multivariate
qualitative data. Biometrics 17, 10-32.
[17] Cooper, P. W. (1962a). The hyperplane in pattem recognition. Cybernetica 5, 215-238.
[t8] Cooper, P. W. (1962b). The hypersphere in pattern recognition. Information and Control 5,
324-346.
[19] Cooper, P. W. (1963). Statistical classification with quadratic forms. Biometrika 50, 439-448.
[20] Cooper, P. W. (1965). Quadratic discriminant functions in pattern recognition. I E E E Trans.
Inform. Theory 11, 313-315.
[21] Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve. In:
F. N. David, ed., Research Papers in Statistics: Festschrift for J. Neyman, 55-71. Wiley, New
York.
[22] Cox, D. R. (1972). The analysis of multivariate binary data. Appl. Statist. 21, 113-120.
[23] Das Gupta, S. (1965). Optimum classification rules for classification into two multivariate
normal populations. Ann. Math. Statist. 36, 1174-1184.
[24] Das Gupta, S. (1973). Theories and methods in classification: A review. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 77-137. Academic Press, New York.
[25] Day, N. E. and Kerridge, D. F. (1967). A general maximum likelihood discriminant. Biometrics
23, 313-323.
[26] Dillon, W. R. and Goldstein, M. (1978). On the performance of some multinomial classification
rules. J. Amer. Statist. Assoc. 73, 305-313.
[27] Elfving, G. (1961). An expansion principle for distribution functions with applications to
Student's statistic and the one-dimensional classification statistic. In: H. Solomon, ed., Studies in
Item Analysis and Prediction, 276-284. Stanford Univ. Press, Stanford.
[28] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics
7, 179-188.
[29] Friedman, H. D. (1965). On the expected error in the probability of misclassification. Proc.
IEEE 53, 658-659.
[30] Fujikoshi, Y. and Kanazawa, M. (1976). The ML classification statistic in covariate discriminant
analysis and its asymptotic expansions. Essays in Probability and Statistics (Ogawa Volume),
305-320. Shinko-Tsusho, Tokyo.
[31] Gilbert, E. S. (1968). On discrimination using qualitative variables. J. Amer. Statist. Assoc. 63,
1399-1412.
[32] Glick, N. (1972). Sample-based classification procedures derived from density estimators. J.
Amer. Statist. Assoc. 67, 116-122.
[33] Glick, N. (1973). Sample-based multinomial classification. Biometrics 29, 241-256.
[34] Goldstein, M. (1976). An approximate test for comparative discriminatory power. Multiv.
Behav. Res. 11, 157-163.
[35] Goldstein, M. (1977), A two-group classification procedure for multivariate dichotomous
responses. Multiv. Behav. Res. 12, 335-346.
[36] Goldstein, M. and Rabinowitz, M. (1975). Selection of variates for the two-group multinomial
classification problem. J. Amer. Statist. Assoc. 70, 776-781.
[37] Han, C. P. (1968). A note on discrimination in the case of unequal covariance matrices.
Biometrika 55, 586-587.
Classification statistics 99
[38] Han, C. P. (1969). Distribution of discriminant function when covariance matrices are propor-
tional. Ann. Math. Statist. 40, 979-985.
[39] Hart, C. P. (1970). Distribution of discriminant function in circular models. Ann. Inst. Statist.
Math. 22, 117-125.
[40] Hills, M. (1966). Allocation rules and their error rates. J. Roy. Statist. Soc. Set. B 28, 1-31.
[41] Hills, M. (1967). Discrimination and allocation with discrete data. Appl. Statist. 16, 237-250.
[42] Hill, G. W. and Davis, A. W. (1968). Generalized asymptotic expansions of Cornish-Fisher
type. Ann. Math. Statist. 39, 1264-1273.
[43] John, S. (1960). On some classification problems, I, II. Sankhy~ 22, 301-308,309-316.
[44] John, S. (1963). On classification by the statistics R and Z. Ann. Inst. Statist. Math. 14,
237-246.
[45] Johnson, N. L. (1949). Systems of frequency curves gerated by methods of translation.
Biometrika 36, 149-176.
[46] Kanazawa, M. (1979). The asymptotic cut-off point and comparison of error probabilities in
covariate discriminant analysis. J. Japan Statist. Soc. 9, 7-17.
[47] Kanazawa, M. and Fujikoshi, Y. (1977). The distribution of the Studentized classification
statistics W* in covariate discriminant analysis. J. Japan Statist. Soc. 7, 81-88.
[48] Kanazawa, M., McGee, R. I., and Siotani, M. (1979). Comparison of the three procedures in
covariate discriminant analysis. Unpublished paper.
[49] Kronmal, R. and Tarter, M. (1968). The estimation of probability densities and cumulatives by
Fourier series methods. J. Amer. Statist. Assoc. 63, 925-952.
[50] Krzanowski, W. J. (1975). Discrimination and classification using both binary and continuous
variables. J. Amer. Statist. Assoc. 70, 782-790.
[51] Krzanowski, W. J. (1976). Canonical representation of the location model for discrimination or
classification. J. Amer. Statist. Assoc. 71, 845-848.
[52] Krazanowski, W. J. (1977). The performance of Fisher's linear discriminant function under
non-optimal conditions. Technometrics 19, 191-200.
[53] Kullback, S. (1952). An application of information theory to multivariate analysis, I. Ann. Math.
Statist. 23, 88-102.
[54] Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
[55] Kudo, A. (1959). The classificatory problem viewed as a two-decision problem. Mere. Fac. Sci.
Kyushu Univ. Ser. A. 13, 96-125.
[56] Kudo, A. (1960). The classificatory problem viewed as a two-decision problem, II. Mere. Fac.
Sci. Kyushu Univ. Ser. A 14, 63-83.
[57] Lachenbruch, P. A., Sneeringer,~C., and Revo, L. T. (1973). Robustness of the linear and
quadratic discfiminant functions to certain types of non-normality. Comm. Statist. 1, 39-56.
[58] Lachenbruch, P. A. (1966). Discriminant analysis when the initial samples are misclassified.
Technometrics 8, 657-662.
[59] Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics 10, 1-11.
[60] Linhart, H. (1961). Zur Wahl yon Variablen in der Trennanalyse; Metrika 4, 126-139.
[61] Matusita, K. (1956). Decision rule, based on the distance, for the classification problem. Ann.
Inst. Statist. Math. 8, 67-77.
[62] Matusita, K. (1967). Classification based on distance in multivariate Gaussian cases. Proc. Fifth
Berkeley Syrup. Math. Statist. Prob. 1, 299-304:
[63] Martin, D. C. and Bradley, R. A. (1972). Probability models, estimation and classification for
multivariate dichotomous populations. Biometrics 28, 203-222.
[64] McGee, R. I. (1976). Comparison of the W* and Z* procedures in covariate discriminant
analysis. Dissertation submitted in partial fulfillment of P h . D . requirements. Kansas State
Univ.
[65] McLachlan, G. J. (1972). Asymptotic results for discriminant analysis when the initial samples
are misclassified. Technometrics 14, 415-422.
[66] McLachlan, G. J. (1973). An asymptotic expansion of the expectation of the estimated error rate
in discriminant analysis. Austral. J. Statist. 15, 210-214.
1O0 Minoru Siotani
[67] McLachlan, G. J. (1974a). The asymptotic distributions of the conditional error rate and risk in
discriminant analysis. Biometrika 61, 131-135.
[68] McLachlan, G. J. (1974b). An asymptotic unbiased technique for estimating the error rates in
discriminant analysis. Biometrika 30, 239-249.
[69] McLachlan, G. J. (1974c). Estimation of the errors of misclassification on the criterion of
asymptotic mean square error. Technometrics 16, 255-260.
[70] McLachlan, G. J. (1974d). The relationship in terms of asymptotic mean square error between
the seperate problems of estimating each of the three types of error rate of the linear
discriminant function. Technometrics 16, 569-575.
[71] McLachlan, G. J. (1976). The bias of the apparent error rate in discriminant analysis. Biometrika
63, 239-244.
[72] McLachlan, G. J. (1977). Constrained sample discrimination with the Studentized classification
statistic W. Comm. Statist. A--Theory Methods 6, 575-583.
[73] Memon, A. Z. (1968). Z statistic in discriminant analysis. Ph.D. Dissertation, Iowa State Univ.
[74] Memon, A. Z. and Okamoto, M. (1970). The classification statistic W* in covariate discriminant
analysis. Ann. Math. Statist. 41, 1491-1499.
[75] Memon, A. Z. and Okamoto, M. (1971). Asymptotic expansion of the distribution of the Z
statistic in discriminant analysis. J. Multivariate Anal. 1, 294-307.
[76] Moore, II, D. H. (1973). Evaluation of five discrimination procedures for binary variables. J.
Amer. Statist. Assoc. 68, 399-404.
[77] Okamoto, M. (1961). Discrimination for variance matrices. Osaka Math. J. 13, 1-39.
[78] Okamoto, M. (1963). An asymptotic expansion for the distribution of the linear discriminant
function. Ann. Math. Statist. 34, 1286-1301.
[79] Okamoto, M. (1968). Correction to "An asymptotic expansion for the distribution of the linear
discriminant function". Ann. Math. Statist. 39, 1358-1359.
[80] Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and
continuous variables. Ann. Math. Statist. 32, 448-465.
[81] Ott, J. and Kronmal, R. A. (1976). Some classification procedures for multivariate binary data
using orthogonal functions. J. Amer. Statist. Assoc. 71, 391-399.
[82] Rao, C. R. (1954). A general theory of discrimination when the information about alternative
population distributions is based on samples. Ann. Math. Statist. 25, 651-670.
[83] Patnaik, D. B. (1949). The non-central X 2 and F distributions and their applications. Biometrika
36, 202-232.
[84] Siotani, M. (1980). Asymptotic approximations to the conditional distributions of the classifi-
cation statistic Z and its Studentized form Z*. Tamkang. J. Math. 11, 19-32.
[85] Siotani, M. and Wang, R. H. (1975). Further expansion formulae for error rates and comparison
of the W- and Z-procedures in discriminant analysis. Tech. Rept. No. 33, Dept. Statist., Kansas
State Univ., Manhattan.
[86] Siotani, M. and Wang, R. H. (1977). Asymptotic expansions for error rates and comparison of
the W-procedure and the Z-procedure in discriminant analysis. In: P. R. Krishnaiah, ed.,
Multivariate Analysis IV, 523-545. North-Holland, Amsterdam.
[87] Smith, C. A. B. (1947). Some examples of discrimination. Ann. Eugenics 13, 272-282.
[88] Solomon, H. (1961). Classification procedures based on dichotomous response vectors. In: H.
Solomon, ed., Studies in Item Analysis and Prediction, 177-186. Stanford Univ. Press, Stanford.
[89] Teichroew, D. and Sitgreaves, R. (1961). Computation of an empirical sampling distribution for
the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
252-275. Stanford Univ. Press, Stanford.
[90] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann. Math. Statist. 15, 145-162.
[91] Wise, J. (1955). The autocorrelation function and the spectral density function. Biometrika 42,
151-159.
[92] Zhezhel, Yu., N. (1968). The efficiency of a linear discriminant function for arbitrary distribu-
tions. Engrg. Cybernetics 6, 107-111.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 A
©North-Holland Publishing Company (1982) 101-120 --i-
Bayesian Discrimination*
Seymour Geisser
1. Introduction
2. Bayesian allocation
101
102 Seymour Geisser
k
L ( X I O , ~ ) = 1"I L(X~IO,,~j) (2.1)
i=l
where X represents the set of all the data samples Xl,... ,Xk, often referred to as
the training sample. Hence the posterior density, when it exists, is
and O/~ is the complement of Oi, O?UO~=O. We then calculate the posterior
probability that z belongs to %,
P r { z E %l X, ~, q} ~ qJ(z[ X, ~, %) (2.6)
where q stands for (q~,..., qk)- For allocation purposes we may choose to assign z
to that ~ri for which (2.6) is a maximum, if we ignore the differential costs of
misclassification. We could also divide up the observation space of Z into sets of
regions R l .... ,R k where R i is the set of regions for which ui(z ) = qif(zl X, q~, ~ri)
is maximal and use these as allocating regions for future observations. We may
also compute 'classification errors,' based on the predictive distributions, which
are in a sense a measure of the discriminatory power of the variables or
characteristics. If we let Pr(~sl~r~} represent the predictive probability that z has
Bayesian discrimination 103
where ~rg° stands for all the populations with the exception of ~rl. Then the
predictive probability of a misclassification is
k k
2 q, Pr(~~lTr~)--1- 2 qiPr(~ril~r~)- (2.10)
i 1 i 1
k
L(q, ..... qk_,)o z ~ qNj. (2.11)
j=l
If we assume that the prior probability density of the qi's is of the Dirichlet form
k
g(ql , . . . . q k - 1 ) OZ 1"I q ) ~J , (2.12)
j=l
k
P(ql .... ,qk-ll N, .... ,Nk~,)cc H qUj+%. (2.13)
j=l
104 Seymour Geisser
Further
In the second situation we assume that the ~ ' s were chosen and not random
variables. This is tantamount to assuming that N~----0 for all i as regards the
posterior distribution of the qi's, resulting in
where
It is to be noted that while the joint density of Z1,..., Z n given 0i,..... 0i. factorizes
to II~= lf(zjlOi/~ij, ~rij), this will not be generally true for the predictive density;
i.e.,
Hence the results of a joint allocation will be in principle different from the
previous type, which we may refer to as a marginal allocation, although perhaps
not too often in practice.
It is sometimes convenient to write
Pr{zlE %,,... , z , E ~ri. IX, +, q} = Pr{3, ~ ~r,,... ,SkE 7r~[X, ~k, q),
(2.22)
where 8~ represents the set of n~ observations assumed from ~ri and Y~_lni = n,
since the set of observations z 1..... z, is apportioned among the k populations
such that n~ belong to 7r~. The reason for using (2.22) is that under certain
conditions we do have a useful factorization such that
k
P r ( 3 , ~ r , ..... 8 k ~ r k I X , + , q ) = 1-[ P r { s j ~ l X , ~ , q } • (2.23)
j=l
i.e., a mixture of joint predictive densities with Z, assumed from 7ri. Further,
This same result can also be obtained from the product of the likelihoods and the
prior density,
and finally
f(zlX,,S, Tr,)~
(N~)~/2[
~ 1-+
U~(Xi-z)'S-'(Xi-z)
(Ni + 1)(N_ k )
(N k+1)/2
(3.2)
Bayesiandiscrimination 107
the predictive density of the observation to be allocated. This then is inserted into
either (2.6), (2.16) or (2.17) depending on the circumstances involving q and is
appropriate for allocating a single new vector observation z 1.
We now assume that we need jointly allocate n new vector observations
z l , . . . , z,. Letting, as in (2.22), ~ represent the set of n~ observations assumed f r o m
~ri, with n = N~=ln,, we obtain
(3.3)
where~2~ = I + Ni-leie'i and e l = (1 ..... 1) of dimension n i. Hence
Pr[ ~1 ~ ~ r l ' " " ~ k ~ ~rk I X ' q] cc (i=llIqni) f(31,''" ,3klX,~l ..... ~k)
(3.4)
where again if the q/'s are unknown, appropriate substitutes can be found in
(2.16) or what follows it.
The observations m a y in m a n y instances be sequentially obtained and for
compelling reasons allocations (diagnoses) need be m a d e as soon as possible.
Let z in- 1)= ( z I . . . . , Z n _ l ) and Y,' stand for the sum over all assignments of
z~ . . . . . z , _ 1 to 31 ..... ~k with z n always assigned to ~ and then s u m m e d over all
partitions of n such that YT=lnj k = n, nj >1O, j ~ i and n i t> 1. Then
for n = 2, 3, . . . .
A second case that is also easily m a n a g e d is the unequal covariance matrix
situation. Here ~ri is represented b y a N(/~ i, ~Ji) distribution i = 1. . . . . k.
Using the same training sample notation as previously and a similar convenient
unobtrusive reference prior
k
g ( ~ ..... ~ k , ~ 7 1 , . . . , ~ ; ') ~ l-[ I~il <~+'>/= (3.6)
i=1
t08 S e y m o u r Geisser
we obtain
-- N i / 2
p/2/~(N//2) 1"-~ N / ( x i - - z) t
-1 - z)
N/2--1
the predictive density of the observation to be allocated. This is then inserted into
the appropriate formula as previously, to calculate the posterior probability of z
belonging to ~ri.
For the joint classification of z~..... z, we obtain as in (2.15) by assigning
ZI,...,Z n to ~ l , " ' ~ k
k
P r [ ~ , ~ r , ..... 8k~rklX, q] ~ 1-[ q["d(silxie;,~2i, s , , N i - l,ni, p)
i=1
(3.8)
where d(. I') represents the determinantal density (Geisser, 1966),
d(YIA,9, A,M,m,p) =
(2~)-P~/2K(p, M)IMAIM/2[~2IP/2
K( p, M + m) IMA + ( Y-- A )~2(Y-- A )'I (M+m)/2
(3.9)
K l(p,v)=2pv/27rtP(P-l)l/4 [I v + l - - j
j=~ 2
)"
For sequential allocation we obtain for n = 2, 3 ....
k
pr(z.C~rilX, z('-l),q } cc X I'[ q;Jd(3jlYjej,~2j,Sj,Nj- l,nj, p).
j 1
(3.10)
The material in this and the previous section is derived from Geisser (1964,
1965, 1966) and Geisser and Cornfield (1963), Geisser and Desu (1968, 1973).
4. Bayesian separation
/~l ..... ~ are linearly independent. The solution then is the set of k - v linear
discriminants given by
z'~-lAp (4.3)
z'N-1AP(r) (4.4)
where P(r)=(P1 er) are the r column vectors associated with the r largest
.....
non-zero roots 8j of A N N - 1.
The focus here is on the estimation of c. In particular if we are dealing with two
populations, then (4.4) is equivalent to
If we make the multivariate normal assumptions of Section 2 and also use the
same prior density for Jg-1,/~l, and/~2, then we obtain the result of (4.6). Hence
one may obtain for k populations that the estimator for z'2; ~(/~i-/~j) is
z'S I(Y~i - £j) and generate the estimator, using x for/~,
z'S l~ ~ (4.7)
of the set of linear discriminants where A and/~ are obtained from the solution
5. Allocatory-separatory compromises
class of separatory discriminants selecting that one which minimizes the total
error of classification with respect to the predictive distributions (Enis and
Geisser, 1974). This then would be an all purpose discriminant having both good
separatory and allocatory properties. This approach modifies the optimal Baye-
sian allocatory discriminant, which is
Then minimize
where f(wIX, rri) is the predictive density of W derived from the predictive
density of Z, given %.
This provides us with a Bayesian discriminant of a stipulated form that is
optimal with respect to error rates. This compromises the form of the discrimi-
nant with an allocatory requirement.
In the multivariate normal case with equal covariance matrices, interest has
generally focussed on the linear discriminant
Vf>• log r
< log r
assigns z to ~r1,
assigns z to ~r2.
(6.4)
x )S (6.10)
(logr-½(pc-'+Q) )
(6.11)
E(el ) _ ~ ( l o g r - - ½Q (6.12)
Q1/2 ] = gl.
[ 4c(p+cQ)(q~-'(b)) 2
P(e')--=Pr[el<~b]~l-Fa p+cQ+l,-l(cQ)2 (6.13)
f°' (6.14)
and 01 and 02 are random variables that are functions of #1,/~2 and 2~. Hence we
have defined fll and/32 as functions of the random variables/~1,/~2, and ~ for
fixed values of Xl, x 2 and S which differs from the sampling interpretation where
fl~ and f12 are considered either as functions of the fixed parameters/21,/~2, and 2~
obtained from the unconditional sampling distribution of V in terms of the
random variables Xl, x2, and S, or defined as functions of the random variables
x 1, x 2, and S. Although the exact posterior distribution of fl~ both jointly or
marginally (Geisser, 1967) can easily be found, a convenient and rather good
approximation is obtained as
A2----
(u--P+I/2)
Q1,
1/2(logr + ½ Q ) ,
B 2 ---- (log r +~Q)
~ 2 /2uQ.
(6.22)
where f(V ]/z i,/~ 2, ~ , "7/'1) represents the conditional density of W. Hence
W(z) { ~>0,
<0,
assign z to ~r1,
assignzto~r 2
(6.26)
where a ' = [a~,...,ap] is a nonnull vector and b is an arbitrary scalar, such that
for variations in a and b the total predictive probability of correct allocation is
maximized. The solution obtained by Enis and Geisser (1974) is termed the
optimal predictive linear discriminant,
and
1/2
/q= (~+l)(.+p-1)
First we note that for purely separatory purposes the constant is irrelevant and
again we obtain Fisher's linear discriminant function. For allocatory purposes the
constant b 0 is relevant and may yield a rather slight error rate improvement over
V or V + p(N2-1 - NI~). But note that W will be globably optimal iff RK 2 = K~
since this is equivalent to r(z), the optimal posterior discriminant, which under
these circumstances and appropriate h(r) becomes linear as well. If ql = q2 and
N 1 = N 2, than all methods mentioned thus far essentially yield V.
When 2; 1vs 2J2, the optimal discriminant is the quadratic
is also very nearly achieved as the posterior expectation of U. Enis and Geisser
(1970) showed that
where
2 P
h(p,N,,N2)=½ E E ( - - 1 ) i { l o g ( N / - 1 ) + Ni- 1 - ~ - t [ l ( N / - - j ) ] }
i= 1 j = l
(6.33)
that will simplify it. One could attempt to derive the optimal predictive quadratic
discriminant, but in general this is quite difficult to obtain.
where the expectation is over g(co), X~ represents the set of observations from %,
2 N,
f(sL, g2lco,~,,~rz)=fp(OI o~) IX 1-[ f(xvlO,,co,~r~)dO. (7.3)
i lj=l
This full Bayesian approach requires a body of prior knowledge that is often
unavailable and may be highly sensitive to some of these assumptions.
We shall present here only one of a series of data analytic techniques given by
Geisser (1980) which selects a single co = co* to be used for allocation rather than
the Bayesian averaging. It is a technique which combines Bayesian, frequentist
and sample reuse procedures.
Let
2 N,
L (co) = ~ H f(xijl X(ij), co, ~ri), (7.4)
i:lj:l
be the product of reused predictive densities where X(ij) is the set of observations
X with xij deleted, and f is the same form as (7.2); i.e., xij replaces z and X(ij)
Bayesian discrimination 119
maxg(60)L(60),
o~
and then use the % and ~r2 specified by 60* in an allocatory or separatory mode.
As an example suppose 60= 601 specified that N is N(~i, ~) and 60 = 602 specified
that % is N(/~, Z~), respectively.
Under 60l,
2 N,
L(601)= II II f(xij[xi(j),S(ij),Ni-l,N-1,601,~ri) (7.5)
i=1 j=l
where the density f is given by (3.2) with z, xi, S, N~, and N replaced by x~j, ff~o),
S(ij) , N~- 1 and N - 1 , respectively; ~(j) and S(ij) being the sample mean and
pooled covariance matrix with xtj deleted.
Under 602,
2 N,
L ( % ) = I'[ II f( xij[xi(j), Si(j), Ni - 1,60 2, ¢ri) (7.6)
i=1 j=l
where the density f is given by (3.7) with z, ff~, S / a n d iV/replaced by xij, Y(~j),
S~O), and N , - 1 , respectively and S/o.) being the sample covariance matrix
calculated from X/with xij deleted. The choice of 60* now rests with
8. Other areas
Most of the current work in separatory discriminants has been linear mainly
because of convenience and ease of interpretation. However, it would be desirable
to consider other functional discriminants as there are situations where the
natural discriminants are quadratic.
There is also another useful model wherein the so-called populations or labels
have some underlying continuous distribution, but one can only observe whether
Ir is in a set Si where S 1..... S k exhaust the range of ~r, see, for example, Marshall
and Olkin (1968). In the previous case ~ = N was synonymous with Si, and the
distribution only involved the discrete probabilities qg. However, this case involves
more structure and requires a more delicate Bayesian analysis. Work in this area
is currently in progress.
120 Seymour Geisser
References
Jack C. L e e
1. Introduction
where ~- is unknown,'X and A are known matrices of ranks m < p and r < N,
respectively. Further, the columns of e are independent p-variate normal with
mean vector 0 and common covariance matrix Z, i.e., G(Y[ ,r, Z ) = N ( Y ; X,rA, Z ®
IN) where ® denotes the Kronecker product, and G(.) the cumulative distribu-
tion function (c.d.f.).
Several examples of growth curve applications for the model (1.1) were given
by Potthoff and Roy (1964). We will only indicate two of them here.
(i) N individuals, all subject to the same conditions, are each observed at p
points in time t, ..... tp. The p observations on a given individual are not
independent, but rather are assumed to be multivariate normal with unknown
covariance matrix Z. The observations of different individuals are assumed to be
independent. The growth curve is assumed to be a polynomial in time of degree
m - 1, so that the expected value of the measurement of any individual at time t is
T0 + "rlt + • • • + "gm_ltm-1. The matrix A is 1 × N and contains all l's, ~ =
(% ..... ~'m--,)' and the element in t h e j t h row and cth column of X i s t f - ' .
(ii) There are r groups of individuals With n/individuals in t h e j t h group, and
with each group being subjected to a different treatment. Individuals in all groups
are measured at the same points in time and are assumed to have the same
covariance matrix Z. The growth curve associated with the j t h group is %i ÷
Tut + . . . + "rm_l,jt m-l. The matrix A will contain r rows, and will consist of n 1
columns of (1,0 .... ,0), n2 columns of ( 0 , 1 , 0 . . . . . 0 ) , . . . , and n r columns of
121
122 Jack C. Lee
2. Preliminaries
LEMMA 2.1. Let YpxN, Xp×m, %xr, Ar×N be such that the ranks of X and A are
m < p and r < N, respectively. Then
I ( Y - X r A ) ( Y - X'rA)'[ -u/2 =
B=(X'X)-'X,
S=Y(I-A'(AA')-IA)Y ',
÷= (X,S-1X)-'X,S-1YA,(AA,) -1, (2.2)
G - ' = ( A A ' ) - ' + T~(Z'SZ)-~T2,
T2 = Z'YA'(AA') -1.
For a proof of the above lemma, the reader is referred to Geisser (1970).
We shall say that L, with dimension m X r, is distributed as Dm, r('; B, A, 2~, N)
(see Geisser (1966)) if it has probability density function (p.d.f.)
C fir--rm/2 Zl~/2lAlm/2
g(L) = m,, (2.3)
cm, N 12 + ( L - a ) a ( L - a ) ' l N/2
where v = N - r and
C-'
m,p =fir m(m-l)/' ~I / ~ [ I ( P - I - I - - J ) ]
j=l
We note that the general determinantal distribution Din, r(') includes multi-
variate T distribution, T(- ; A, Z, N), as a special case when r = 1, A = ( N - - rn) -1.
Some properties of Din, r(" ) were given in Lee and Geisser (1972). In the sequel we
will use h(. ) for the prior probability density functions (p.d.f.) of parameters, g(-)
for probability density functions other than prior, and G(.) for the c.d.f, even
though they stand for different functional forms in different circumstances.
In this section we consider the special case where c -- 2 and ql, q2 are known.
Following Welch (1939) we classify V into H I if
g , ( V ) / g 2 ( V ) > q2/ql (3.1)
where
gi(V) = (2v)-,/~/212 i I_K/2
Xexp{ -- ½tr 2 / - l ( v - XriFi)(V-- Xr/F,.)'}. (3.2)
2 i = N / - I ( Y / - S ~ i A i ) ( Y i - X~iAi)' (3.7)
Z ~ i - N ~- 1 { X B S i B X + ZDYiiYiD Z }
^ _ , , , , t
(3.9)
where
D=(Z'Z)-'Z'.
Of course, when -~1 = ~2 = Z, then the m.l.e, of the common covariance matrix is
for i = 1,...,c; q = (ql .... ,qc). The observation V is then classified into H i for
which (4.1) is a maximum. For the prior density of parameters we follow Geisser
(1970) in assuming a situation where there is little or no prior knowledge
regarding the parameters. In the next two sections we discuss Bayesian classifica-
tion of growth curves for the two different covariance matrices.
5. Arbitrary p.d. 2;
and hence is irrelevant• The irrelevant constant will be absorbed by the propor-
tionality sign oc and hence will be omitted from now on.
The posterior expectation of ~'i is ~ as given in (3.5) and the posterior
expectation of 2i is
0 A2 0 • " " 0
A= 0 0 A3 (5 •9)
0 0 • *" A C
where 4, S and G are defined in (2.2) and W(.) stands for the Wishart
distribution. The posterior expectation of 2J is
E ( ~ . I Y ) = ( N - p - - 1 ) - I { ( Y - X?A)(Y-- X~A)'
+( N - m -- rc-- I )-' X( X'S-1X)-' X'[trG-1AA'I }. (5.12)
We also note that ÷ is the m.l.e, as well as the posterior expectation of ~-, and the
m.l.e, of N is
0 0 Ai_ 1 0 0
4,,= 0 = (A*, F~*),
Ai+ 1 0
Ac 0 0
0 0 0 ..- 0 Ai
(5.15)
~t__ t
F i - (0,0 .... ,0, Fi ),
where Q*, G* are defined in the same way as Qz, Gi given in (5.4) except that Y/is
128 Jack C. Lee
G(V[ ~, Y, Hi) = Dp, K('; XTiFi, I, (Y-- X~A)( Y-- XTA)', U + K).
(5.21)
With the prior h(+, X ~ ..... 2~ -~) cc II~=llY~jI(P+°/2, we have the posterior
density
P jl
j=l
where ~ , ~, G/are defined in the same way as S, ++,G given in (2.2) except that Y
is replaced by Y/, A by Aj. We note that (5.22) is a product of c determinantal
distribution kernels.
We now consider the special case where c = 2, r = 1. From (5.22) it is easy to
see that the posterior distribution of Ni- ~ for a given "~ is
e(+lt) ++*(0),
^q¢ ^
E(~ilY) ~(N~- p-1)-'{(Y,- X++*(0)Ai)(g~- X++ (0)Ai) !
(5.30)
130 Jack C. Lee
where ~'ojis defined in the same way as + given in (2.2) with S replaced by 2j, Y by
Yj and A by Aj. We note that (5.33) is a product of c multivariate normal kernels.
For the rest of this case we consider the situation where c = 2, r -- 1. It can be
shown that the posterior distribution of + is
5. 6. Ei unknown, ~. known
With the prior h ( ~ -1) ec I~Jil(p÷l)/2, we have the posterior density
which implies
t --1
G('rilZi,Yi)=N(';'roi,(X'-y,/lx)-l®(AiAi) ) (5.41)
where "r0iis defined in the same way as + given in (2.2) with S replaced by Z i, Y by
Yi, A by Ai.
The predictive density satisfies
g(VI.Yi-',Y~,Hi)=g(V11V2,.YTi,Yi,Hi)g(V21~71,Yi,Hi) (5.42)
where
G(V, IV2,•i', Hi) = N ( ' ; Qi,( X t ~ i l x ) -1 @Mi I )'
G(V2IZZ 1, Yi, H i ) = N(. ;0, Z'Z/'Z®IK) (5.43)
and Qi is defined in the same way as Qi given in (5.4) except that S i is replaced by
~i-
K i = ~I F [ ½ ( N ~ + K + I - r - j ) ] F[½(N~+K+I--j)]
j, r[~(N,+i---;--~] pIIm.j=, r[½(Ni+F--j-)] "
(6.2)
where
S = ~, S~, Y=(Y1,...,Yc), N= N i.
i=1 i=1
Hence
6.4. .~, -- XF, X' + ZO, Z', Fi, O, unknown, ~'i= ~ unknown
As in Section 5.4 we will consider the case where c = 2, r = 1. With the
convenient prior
and
f , ( t ) = I~tN'/2--1[Ol"t-(rl2 -- Tll)tA(T12 -- r l , ) ] - ( N m)/2 iA, l_l/2,
,Y: (N- m)-l {a +(T,2 - T,,)'A(T,2- T1, ) }A *-1 ,
T, = rll + fI*-'(BS2B')-l(T12 - r,,), (6.11)
ot --= t ( A I A ] ) -1 + ( A 2 A I ) -1 ,
E(+IY) = fo°~Tl(t)fl(t)dt,
E(O,IY)=(~-p+m-1) --I
DEED t !
and
E(F/] r ) = (N/-- m -- 1) - ' { (BY/-- E(4IY)Ai)(BY i -- E(,I v)A,)'
+ AiA;[cov (elY)]}.
A reasonable approximation for the posterior distribution of 4 is
+
+ A m m 2A~A;2(/)}"
The predictive density of V is, f o r j ~ i,
(6.13)
where
:N,+K~ ,:Nj~ r[~(N~ K+I--~)]
. =H, + 1 - a)]
F[½(N/+
p-m r[l(N~ + K + l~)i)] (6.14)
X al-I 1 F[l(Ni +l- '
B~= BS~B'+ (BY-- T,,E)'M~(BV-- TI~).
136 Jack C. Lee
g( V l Y~, ,i-i, I7[i) o~ /£~ I(n Y//- ,.l-iAi)( B ~ii - ,riAi )'l Ni/2 l D Y~Yg'O'l N,/2
XI V)- [B(r,, V ) - K)/2
X ID(Y,.Y~' + V V ' ) D ' I-(u,+K)/2 (6.16)
References
Anderson, T. W. (1958). A n Introduction to Multivariate Statistical Analysis. Wiley, New York.
Geisser, S. (1964). Posterior odds for multivariate normal classifications. J. Roy. Statist. Soc. Ser. B,
26, 69-76.
Geisser, S. (1966). Predictive discrimination. In: P. R. Krishnaiah, ed., Multivariate Analysis, 149-163.
Academic Press, New York.
Geisser, S. (1970). Bayesian analysis of growth curves. Sankhya, Ser. A 32, 53-64.
Geisser, S. and Desu, M. M. (1968). Predictive zero-mean uniform discrimination. Biometrika 55,
519- 524.
Khatri, C. G. (1966). A note on MANOVA model applied to problems in growth curve. Ann. Inst.
Statist. Math. 18, 75-86.
Krishnaiah, P. R. (1969). Simultaneous test procedures under general MANOVA models. In: P. R.
Krishnaiah, ed., Multivariate Analysis, H, 121-143. Academic Press, New York.
Lee, J. C. (1975). A note on equal-mean discrimination. Comm. Statist. 4, 251-254.
Lee, J. C. (1977). Bayesian classification of data from growth curves. South African Statist. J. ll,
155-166.
Lee, J. C. and Geisser, S. (1972). Growth curve prediction. Sankhyd, Set. A 34, 393-412.
Lee, J. C. and Geisser, S. (1975). Applications of growth curve prediction. Sankhyd, Ser. A 37,
239-256.
Leung, C. Y. (1980). Discriminant analysis and testing problems based on a general regression model.
Ph.D. Thesis, University of Toronto.
Classification of growth curves 137
Nagel, P. J. A. and deWaal, D. J. (1979). Bayesian classification, estimation and prediction of growth
curves. South African Statist. J. 13, 127-137.
Potthoff, R. R. and Roy, S. N. (1964). A generalized multivariate analysis of variance model useful
especially for growth curve problems. Biometrika 51, 313-326.
Rao, C. R. (1965). The theory of least squares when the parameters are stochastic and its application
to the analysis of growth curves. Biometrika 52, 447-458.
Rao, C. R. (1966). Covariance adjustment and related problems in multivariate analysis. In: P. R.
Krishnaiah, ed., Multivariate Analysis, H, 87-103. Academic Press, New York.
Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to
measurement of signals. Proc. Fifth Berkeley Syrup. Math. Statist. and Probability 1, 355-372.
Welch, B. L. (1939). Note on discriminant functions. Biometrika 31, 218-220.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1~
©North-Holland Publishing Company (1982) 139-168 K3
Nonparametric Classification
James D. Broffitt
1. Introduction
1.1. Preliminaries
Consider the problem of discriminating between two populations rq and ~r2. An
clement is selected at random from a mixture of ~r1 and ~r2, and p variates are
measured on this element, which we arrange in a vector z. Based on z we must
decide if the element is a member of 7rl or ~r2. For convenience we shall talk about
classifying z rather than the element which gave rise to z. We may then state the
problem as deciding between z being a member of ~rI and z being a member of ~r2.
We may think of any decision rule as a partition of the p-dimensional space of
the random vector Z into two sets, K~ and K 2. (We shall follow the usual
convention that z denotes an observed value of the random vector Z.) If z ~ K~,
then z is classified into ~q, otherwise z is classified into ~r2. Thus any decision rule
may be defined by specifying the set K 1. Corresponding to each decision rule are
the probabilities of classification: P(jli) is the probability of classifying an
element from ~ri~into ~. If i :/: j, then P(jli) is a probability of misclassification,
otherwise P(jli) is a probability of correct classification. Let the p variates have
pdf fl(" ) in ~rI and f2(" ) in rr2. We may then write
P(jIi)= P{Z~KjIZ~ f~ )
where Z ~ f / m e a n s that Z has pdf f/(-).
The objective is to determine a decision rule which minimizes some function of
the misclassification probabilities. Such a function may be obtained by consider-
ing the costs of misclassification. Let C(jli ) denote the loss (cost) incurred when
an element from rri is classified into ¢rj. We restrict C(jli ) to satisfy: C(jli) -- 0 if
i = j, and C(jli)>O if i ve j. Then the risk (expected loss) corresponding to a
decision rule is
q1C(211)P(211)+ q2C(lI2)P(l[2),
where qi = P{Z~ f} is the prior probability that Z is a member of %. The
139
140 James D. Broffitt
decision rule that minimizes this risk (see Anderson (1958, Chapter 6), is sped-
fled by
Thus if we know the prior probabilities, misclassification costs, and the pdf's f~(.)
and f2('), we classify z into ~rI iffl(z)/f2(z ) >>-k, and into ~r2 otherwise.
In many practical problems it may be difficult to assess the misclassification
costs as well as the prior probabilities. In this case a different approach may be
possible. Suppose we are able to specify the value that we would like to have for
the ratio P(2]1)/P(l]2). For example, if we desire a decision rule that misclassi-
lies elements from 7r2 twice as often as elements from ~rl, then we would let
P(2 ]1)/e(112) -- ½. In general let 7 be the value assigned to P(211)/P(1 ]2). Then
among all decision rules that satisfy our constraint P(211 ) = 7P(112), we want the
one that minimizes P(211), or equivalently, minimizes P(ll2). The solution is
given by
K t : {z: >i k}
where k is selected to satisfy P(211 ) = 7P(112 ). In Section 2.4 we will show how to
apply discriminant functions in such a way that, at least asymptotically,
P(211)/P(112) = Y. These procedures allow the experimenter to have some control
over the balance between the two misclassification probabilities.
In order to apply the above decision rules it is necessary to compare fl(z) to
fz(Z). In practical problems f l ( ' ) and f2(') are not fully known and must be
estimated from sample data. Our decision rules will then be based on ]l(z)/f2(z)
where)~(z) is an estimate o f f ( z ) . That is, we replace the set K~ with
which defines the sample based rule: classify z into ~rI if and only if z E K~'. There
are different levels of prior knowledge that we may have about f ( - ) . If we assume
that f ( - ) is a normal density with unknown mean vector and dispersion matrix,
then we could obtain)~(. ) by replacing the unknown parameters in f ( . ) with their
sample estimates. This would be termed a parametric solution since we assumed
very specific forms for fl(" ) and f2('), and consequently needed to estimate only a
few parameters. On the other hand if we have very little knowledge of f ( - ) , then
we might use a nonparametric density estimate to obtain f ( - ) . In either case we
must have observations from ~r1 and ~r2 in order to form estimates of f l ( ' ) and
f2(')-
Two methods of sampling are used in practice. The method used is generally
dictated by the practical problem rather than a preference choice. In the first
method we draw random samples separately from ~r~ and ~r2. Let x 1..... xn~ be the
observation vectors for n~ randomly selected elements from 7r~, and similarly
Y~..... Yn2 are the observation vectors for n z elements from ~r2. The x ' s and y ' s are
Nonparametric classification 141
called training samples and may be used to estimate f l ( ' ) and f2('), respectively.
The second sampling method occurs when we must sample from the mixture of ~r1
and ~r2 rather than taking samples from the individual populations. Let n b e the
number of observations taken from the mixture. If these observations can be
correctly classified, then those from ~r1 are the x ' s and those from ~r2 are the y 's.
In this case, n 1 and n 2 are random and n i / n m a y be used as an estimate of q~. In
the remainder of this section we will concentrate on methods of classification
based on training samples.
i=1 i=l
nl
S, = Z (x~- Y)(x,- Y)'/(n, - I),
i=l
n 2
= 2 ( y, - : ) ( y, - : 1 ' / ( . 2 -
F
i=I
and
S -- [(n I - 11S , + (n 2 - l l S z ] / ( n I + n 2 - - 2 / .
z)L(z) : (z
R. A. Fisher was the first to suggest using DL(Z ) for purposes of classification.
Since DL(Z ) is linear in the components of z, it is called the linear discriminant
function or LDF. The procedure is to compute x, y, and S from the training
samples, then compute DL(Z ) and compare it to Ink. If DL(Z ) >/In k we classify z
into 7rl, otherwise we classify z into ~r2.
Notice that in a sample based classification rule the classification proba-
bilities P ( j ] i ) are computed with respect to the joint distribution of
Z, X 1..... X,,, YI.... ' Y,2" For example, with the above rule based on DL,
where
A decision rule is equivalent to a partition of the space of Z into sets K~ ..... Kin,
such that z is classified into ~rt if and only if z E K t. Then the decision rule that
Nonparametric classification 143
That is, classify z into that population which corresponds to the smallest of the m
values:
m
If there is a tie between two or more populations for this minimum value, then z
may be classified into any one of those populations involved in the tie. (Note that
K~ ..... K m is not a partition in the strict sense because they do not include these
tie producing z's.) In the special case where C ( j [ i ) = l for iv a j, K~" may be
written in the simplified form
p
If )~(z) is a consistent estimator of f/(z), then the sample based rule is asymptoti-
cally optimal. Thus the two population problems may be extended to m popula-
tions without difficulty.
To enhance exposition we generally restrict our discussion to the two popula-
tion problem, referring to the m population problem when it is unclear how a two
population procedure extends to the m population case.
data. Outliers are particularly hard to spot in multivariate data since we cannot
plot points in more than two or three dimensions. It is because of the persistence
of nonnormal data that we are interested in nonparametric classification.
In Section 2 we shall present a method which uses ranks of the discriminant
scores to classify z. A problem often encountered with using D L and DQ on
nonnormal data is the imbalance in the resulting misclassification probabilities.
For example, when using D L with highly skewed distributions or distributions
that have quite different dispersion matrices, it is not uncommon to obtain
misclassification probabilities such as P(2[1) = 0.08 and P(l[2) = 0.39. The ex-
perimenter may find this an undesirable feature. The rank method affords the
experimenter some control over the ratio P(2tl)/P(l[2 ). These ranks may also be
used to define partial classification rules where P(2[1) and P(I[ 2) are bounded by
prespecified values. The rank procedure is universal since it may be used in
conjunction with virtually any discriminant function including DL, DQ, and
others discussed in Sections 3 and 4. In Sections 3 and 4 we shall review robust
and nonparametric discriminant functions.
2.1. Introduction
In this section we shall present a method of ranking discriminant scores. Rules
for partial and forced classification will then be defined in terms of these ranks.
We are not actually developing any new discriminant functions but rather a
nonparametric method of applying discriminant functions to classify observa-
tions. Its use in partial classification, where highly questionable observations need
not be classified, permits us to place upper b o u n d s on the misclassification
probabilities. In forced classification we will be able to control, at least asymptoti-
cally, the balance between the two misclassification probabilities. These results
are valid regardless of the discriminant function being used and the distributions
of 7r1 and ~r2, provided the discriminant functions have continuous distributions.
Of course, to obtain an efficient procedure we should choose a discriminant
function which is suitable for discriminating between ~r~ and 7r2. As we shall see,
choice of the discriminant function may be made after examining the data,
without disturbing the nonparametric nature of the procedure. This is particularly
appealing in partial classification since we may pick the discriminant function
after a preliminary look at the data and still maintain prespecified bounds on the
misclassification probabilities. Thus the rank method provides us with an oppor-
tunity to adaptively select discriminant functions.
that 'large' values of the discriminant score D(z) indicate classification of z into
~rv That is, classify z into ~ if D(z)>1c and into 9 2 otherwise. For this rule the
misclassification probabilities are given by
Also if c I < c 2 , then A]f-IA 2 = ~ and these upper b o u n d s are achieved, i.e.,
P(211) = a 1 and P ( l ] 2 ) = a 2. The condition e I < c 2 occurs when g l ( ' ) and g2(')
have a substantial overlap. It is in this situation that we r e c o m m e n d partial
classification. If g l ( ' ) and g2(') are well separated (see Fig. 3) so that c I > c > c2,
then we would be foolish to use partial classification since the forced rule, which
classifies z into tr 1 if and only if D(z) >! c, has misclassification probabilities which
satisfy the specified conditions P(2[1)~< a 1 and P(I[ 2)~< ot2, and never fails to
Cl C2
A2 ~ ~1
However, the difficulty here lies in finding reasonable estimates ~ , ~2- Even if we
had estimators dl and d2, and used the partial classification rule (2.1) with the
events A1 and /12 given in (2.4), there is no guarantee that P ( 2 [ 1 ) ~< a I and
P ( l [ 2 ) ~< ot2 •
T h e rank procedure will determine two events, A 1 and A2, whose use in the
partial classification rule (2.1) will ensure that P(2] 1) ~< a I and P(1 ] 2) ~< a 2. I n this
procedure two ranks are assigned to z. O n e rank measures its closeness to the
training sample of x ' s and the other measures its closeness to the y ' s .
First we assume that z is f r o m ~r~ and accordingly we consider the two samples
Xl . . . . . xn~, z and Yl . . . . . y,~ of sizes n~ + 1 and n 2 f r o m ~r~ and 7r2 respectively. Based
on these samples we determine a discriminant functiQn Dx(.), designed so that
discriminant scores for observations from ~rI will tend to be large and observa-
tions from ~r2 will tend to produce small values for Dx(-). A n essential require-
ment in the determination of Dx(.) is that the sample of n I + 1 observations from
~q must be treated symmetrically, and the n 2 observations f r o m ~r2 must be treated
symmetrically. That is, Dx(.) must be a symmetric function of x 1. . . . . x n~, z and a
symmetric function of y~,...,yn2. To fix this idea we use the following notation
which emphasizes the dependence of D~(-) on the observations:
D~('lxi,, .... xi,,l+l;YJ"1 ..... YJ.2)-- O x ( ' l X l ' " " X , l +1; Y' ..... Y,z)" (2.5)
C2 C CI
A2 AI
< ,I >
Fig. 3. gl and g2 well separated.
148 James D. Broffitt
I n a previous paper (Broffitt, Randles, and Hogg, 1976) it was shown that when
Z ~ fl, then R~(Z) has a discrete uniform distribution over the integers 1.....
n 1 + 1, that is,
This result is independent of the distributions of ~rI and ~r2 provided D,(Z) has a
continuous distribution. It follows that
just another x. We must not know which point is actually z (or at least we must
not use the knowledge that we know which point is z) in determining Dx(.). If we
do not violate (2.5) and a corresponding equation for Dy(.), then Rx(Z ) and
Ry(Z) will have uniform distributions and the misclassification probabilities will
satisfy the upper bound restrictions. This opportunity to 'legally' inspect the data
before selecting a discriminant function m a y be used to great advantage. We
should be able to pick reasonably accurate models for the distributions of ~r~ and
rr2 and thereby improve the efficiency of our analysis without disturbing its
nonparametric nature.
Notice that in using this rule we do not need to specify prior probabilities or
misclassification costs. This rule produces misclassification probabilities which
asymptotically satisfy P ( 2 [ 1 ) / P ( l 1 2 ) = 3'- A heuristic argument for this result was
given by Randles et al. (1978). In the special case where 3, = 1, rule (2.7) classifies
z into that population which corresponds to the larger p-value. This is an
intuitively appealing result since the p-values measure the affinity of z for the
respective populations,
and in general
so that P [ ~ I Z ~ fi] = ai. We also note that these upper bounds will be sharp if
P [ A , n A 2 n A 3 1 z ~ fA = O,
Two methods for determining A~, A2, and A 3 will be discussed. In the first we
consider the populations pairwise, and for each of these pairs, z is assigned two
ranks as in Section 2.3. Let R12(z ) and R21(z ) be the t w o r a n k s assigned to z using
populations 7rl and ~r2. The first rank R12(Z), is obtained by ranking z among the
x,s, so that large values of Rj2(z ) indicate that z looks more like an x than a y.
Also R2~(z ) is obtained b y ranking z among the y ' s . Similarly we m a y obtain
RI3(Z), R31(Z),R23(Z), and R32(C). The two subscripts for these ranks indicate the
training samples used to construct the discriminant function, and the first
subscript denotes the training sample among which z is ranked. We could then
define the events AI, A 2, and A 3 as
P[ZMCIZ~flI~P[/TIIZ~fl]
P[ RI2( Z ) ~ all Z ~ fl] -t- P[ R13( Z ) ~ a I IZ ~ fl]
= 2al/(nl dr 1). (2.11)
Thus to satisfy the bounds P [ Z M C [ Z ~ f i ] ~ ct we must let a i = [(ni + 1)ai/2]
where [-] denotes the greatest integer function. As an example, suppose n I = n 2 =
n 3 = 18, and we desire a~ = a 2 = a 3 = 0.10, then a~ = a 2 = a 3 = 0. In this case
AINA2fBA3 always occurs and we would always fail to classify z. This undesirable
result seems to be somewhat characteristic for the events defined in (2.10).
This situatibn m a y be improved with a second method of defining A~, A 2, and
A 3. First we combine the training samples from ~r2 and 7r3, and consider the two
population problem where one population is '~'1 and the other is a mixture of ~r2
and ~r3. We determine two ranks for z as in Section 2.3. Let Rl(z ) be the rank of z
among the x ' s and Rl(Z ) the rank of z among the combined samples of y ' s and
w's. Large v a l u e s o f R~(z) indicate that z looks more like an x than a y or w, and
large values of Rl(z) indicate that z looks more like a y or w than an x. In an
analogous manner we determine R2(z),/l~2(Z), R3(Z), a n d ]~3(Z). The ranks R2(z )
and R2(z ) are obtained from the two population problem where ~r2 is one
population and the other is a mixture of ~r~ and ~r3, and so on. We can then define
the events A~, A 2, and A 3 as
AI=[R,(Z)>al], A2=[R2(Z)>a2],
A 3 = [ R 3 ( Z ) > a3] (2.12)
P [ Z M C I Z - fi l <~P[ ~ I Z - fi]
: P[Ri(Z) ~ a, I Z - - f i ] = ai/(ni + 1)
where the last equality follows by the uniform distribution of R i(Z). Thus we can
satisfy the bound P [ Z M C I Z - f/] ~< a i by selecting
ai=[(niq-1)oti].
Notice that this is the same quantity used for a i in the two population problem of
Section 2.3. In fact if we extend definition (2.12) to the general m population
problem, then we should still select a i = [ ( n i -}- 1)ai].
For the partial rule (2.8), the events (2.12) seem to be a better choice than those
in (2.10). We should emphasize that the ranks used in (2.12) are based on
discriminant functions constructed to separate a group of m populations into one
population and a mixture of the remaining r n - 1 populations. We may not be
able to find such a discriminant function in a simple form such as D L and DQ. We
must be prepared to consider a variety of different types of discriminant func-
tions including those based on nonparametric density estimation.
PI(z) = g , ( z ) / ( n , ÷ 1),
since Rl(z ) already takes into account the closeness of z to both rr2 and ~r3. In
general Pi(Z ) = R i ( z ) / ( n i + 1). Both sets of p-values seem reasonable and either
m a y be used. The first set of p-values may be more advantageous since they will
usually be based on simpler discriminant functions. In either case, the larger Pi(Z)
is, the more it appears that z is a member of ~ri.
Nonparametric classification 153
Thus the set of z's that satisfy f(z)= const, is an ellipsoid in p dimensions with
center at/~ and shape determined by V. If moments exist for the distribution in
154 James D. Broffitt
(3.2), the mean is /t and the dispersion matrix is o2V where 0 2 is a constant
depending on h(.).
Suppose we have a random sample x~ .... ,x n from an elliptically symmetric
distribution (3.2). One possible estimate o f / t is .~, which is a special case of the
weighted average ( w l x 1+ . . . + W n X n ) / / ~ l W i , where each weight equals one. If
h(-) is such that we are likely to observe outliers, then Y may be a poor estimate
since one extreme observation can pull :~ away from the center of the data. In this
case we can improve our estimate by giving extreme observations smaller weights
and thus decrease the influence of outliers should they exist.
Huber (1964) developed the t h e o r y of robust M-estimators for u n i v a r i a t e
distributions, and Maronna (1976) extended this research to the multivariate case.
Maronna's estimators/2 and I? are the solutions of the equations
(3.3)
n i=1
1, s<~k,
w~(s)= It~s, s>Ic
and
1, s 2 ~ k 2,
w (s2) : s2 > k 2.
The quantity k 2 should be chosen so that all but the extreme observations receive
weights of one or almost one. We suggest trying k Z = p + k o~2 where k 0 is a
constant less than 2.0. Many other weight functions are possible. Maronna gives
sufficient conditions on weight functions for which the solutions to (3.3) are
consistent and asymptotically normal. By consistency we mean that (/L I?)
(/%, II0) a.s., where (/%, Iio) is the solution of
x- x- x - . o ) } =o.
and
e(w2((x- t'o)'Vol(X--t'o))(X-- t'0)(X-- P0)') = V0.
1 h'(s)
w,(s)-- s h(s) and w2(s2):wl(s),
Thus we are simply using the usual linear and quadratic discriminant functions
with different estimates for the location and scale parameters.
Of course we suggest using the rank procedure (Section 2) in conjunction with
these robust discriminant functions in order to classify z. The use of robust
estimates seems particularly appropriate when the r a n k procedure is being
applied. For example, in the determination of Rx(z ), z is placed with the x's, and
if z really is from ~r2, then z may act as an outlier in this sample and distort the
estimates £ and S~. Consequently z may receive a higher rank than if robust
estimates had been used. This of course would be an undesirable feature since,
when z belongs to ¢r2, we want Rx(z ) to be as small as possible. In the process of
156 James D. Broffitt
placing z with each training sample we can artificially create outliers, and since
robust estimates provide a degree of protection against outliers, these estimates
are very attractive when the rank procedure is being used.
As an illustration of the use of robust estimates consider the data displayed in
Fig. 4. Here we have m = 2 and p = 2. The × 's denote observations from ~r1 and
o's denote observations from ~r2. Each sample of ten observations contains one
very conspicuous outlier. Each of the four lines represents a different discriminant
function, and in each case the region above the line corresponds to classification
into ¢r1 and the region below the line corresponds to classification into rr2. Line L
is a plot of DL(Z) ----0 where x, y, and S are computed using all 20 observations.
In order to assess the effect of the two outliers, we trimmed (deleted) them from
the data sets, recomputed the sample means and dispersion matrix, and replotted
DL(Z ) = 0. The result was line LT. Since we will observe outliers infrequently, our
main concern should be to correctly classify the 'inliers'. Thus line L T is preferred
as a separator between ~rI and ~r2. The difference between lines L and LT
dramatically illustrate the damage that outliers can cause. Line H is a plot of
DHL(Z ) = 0 where DHL(- ) was computed using all 20 observations, and line H T
is a plot of DHL(Z ) = 0 where DHL(. ) was based on the trimmed sample of 18
observations. We notice that L T and H T are nearly identical, that is, when
outliers are not present there is very little difference in the results based on robust
70-4 \X .
L7
50-
7
W
I0-
-,oo ~ o ~b ,6o
E-I
and nonrobust estimates. Lines L and H are quite different however, showing that
the robust estimates did partially adjust for the outliers. While we would prefer
line H to be closer to LT, in terms of classifications, H will perform more like LT
than will L. Lines H and HT were obtained by iteratively solving the three
equations:
= E w,(sli)x,/E w,(s,3,
~2 "~-~ WI(S2i)Yi/ ~ WI(S2i), (3.6)
¢,=[ E
+ EwZ(sz,)(Yi-li2)(y,-112)']/(n, +n2-2 )
where wl(- ) is the Huber weight function given above with k o =1, sZi = ( x ~ -
f~l)'fz-l(Xi--ill), and sZi = ( Yi-- fl2)'("-l( Yi-- !12) • Thus we used a pooled V in
computing the weights throughout the iteration process. The final weight for the x
outlier was wl(sl) = 0.33, and the final weight for the y outlier was wl(s2) = 0.25.
For the remaining 18 observations the final weights were all ones. We could move
line H closer to LT by decreasing k o and thereby decreasing the weights of the
outliers. There generally do not appear to be good guidelines for choosing k0, but
in a real problem we could try several values of k 0 and choose that value
corresponding to the smallest estimates of misclassification probabilities. Clearly
the choice of k 0 is an important practical problem which needs further develop-
ment.
It may be easily verified that this inequality is~ equivalent to DL(Z ) >~ 0.
Notice that the ratio (3.7) can he rewritten as
-' 1
.l ,=, t( a,sp)'/I
The quantity (fl'x i a ) / ( f l ' S f l ) 1/2 is a standardized measure of distance between
the projection of x~ and the point a. We may think of a as the point of separation
between the two classification regions. That is, fl'z >~ a corresponds to classifi-
cation into ~rl, etc. An x for which fl'x > a is correctly classified and accordingly
contributes a positive increment to (3.8). If fl'x < et, then x would be misclassified
and accordingly a negative value is added to (3.8). A similar type of statement
holds for each y. Fig. 5 should help clarify the idea of comparing projections of x
and y in the fl direction to a point a. In this figure ll'x > a and f l ' y < a.
Consider now the effect of an extreme observation which could be either an x
or y. Such an observation can have a disproportionate influence in determining
the fl that maximizes (3.8). To obtain a robust discriminant function we should
classify into nz (
\ > classify into ITi
\
Y
. ~ : ~ ~ S ~
Fig. 5. Projections of points x and y upon the linear space generated by ft.
Nonparametric classification 159
(3.9)
T ( a , f l ) : n, i=, ( fl, f.fl)l/2 n2 .= ( fl,~fl),/2 '
-k, d<-k,
Tl(d): d, -k~d~k,
k, d>k,
and
r2(d) =
t sin(~rd/2k),
D1, d<-k,
-k~d~k,
L1, d>k.
The quantity T(a, r ) will be maximized with respect to a and fl using a computer
algorithm. Since r2(. ) is everywhere differentiable, it is smoother than ~'l("), and
consequently "r2(.) may be easier to maximize with an algorithm. The constant k
should be picked so that only the extreme observations produce d ' s that are
larger than k in magnitude. We suggest using k ~< 2. In using the rank procedure
for classifications we may try several values of k and pick the one that seems most
appropriate. For example, to determine Rx(Z ) we consider the two samples
xl ..... xn,, z and yl ..... Yn2"Provided that we treat the n ~+ 1 x ' s symmetrically and
also the n 2 y's, we may do anything to determine a good discriminant or ranking
function. Thus/we may try several k's and also several r ' s before deciding which
ones to use. The selection of these items should be viewed as just one step in the
computation of the discriminant function.
A Monte Carlo study of some robust discriminant functions similar to those
discussed above was reported by Randles et al. (1978).
4.1. Introduction
We have developed the rank procedure (Section 2) which forms the basis for
classification rules. Use of this procedure requires a discriminant or ranking
function. If ~rI and ~r2 have normal distributions, then we would use DL(') or
DQ(.) for the ranking function. Also the robust discriminant functions developed
in Section 3 are appropriate for elliptically symmetric distributions which produce
outliers. In order to accommodate problems with nonelliptically symmetric data,
160 James D. Broffitt
4. 3. Density estimation
A nonparametric discriminant function applicable when all p variates are
continuous is based on density estimation. Let x ~ , . . . , x n be univariate sample
observations on a continuous random variable X with pdf f(.). How can we
estimate f ( z ) , the density of X at the point z? Let N ( z ) be the number of sample
x's in the interval [ z , h, z + h], then N ( z ) / n is an estimate of P{z - h <- X ~ z
+ h}. If we divide this by the length of the interval, 2h, we obtain an estimate of
f ( z ) , that is,
f(z) = N(z)/2hn.
Now define
K ( u ) = {½0 (]u]~<l)
(lul>l)
then N ( z ) = 2~,7: lK((z -- x i ) / h ) and
1 ~ tz-x~
' l u2/2
K(•) = - - e . .
Cac0ullos (1966) extended the idea of kernel estimates to the multivariate case.
Let x~ .... , x n be sample observations on a continuous p-variate random variable X
with pdf f ( . ) . If x~= (Xil .... ,Xip ) and z ' = ( z l ..... zp), then
1 n ( Zl -- Xi 1 Zp -- Xip )
f(Z)- n h , . - . hp i~=l K h1 '"" -h; •
consistent estimates of fl(z) and f2(z), and the classification of z may be based on
).
O--- (z:
+ln(t~2l/l~,l ) = 21nk}.
As fl(') and f2(') become more irregularly shaped, the more complex the set B
may become. However, if ft(') and f2(') are not too complex, then a hyperplane
or quadratic surface may be an adequate substitution for B, at least for classifi-
cation purposes. The hyperplane {z: fl'z = a} splits the space of Z into two
halfplanes {z: fl'z >>-a} and {z: fl'z < ct}. If we could find values of fl and a, say
t * and ct*, so that {z: fl*'z/> a*} is 'similar' to K~, then we could use the rule:
classify z into 7rI if and only if fl*'z/> a*. In general this rule is not optimal, but it
may be reasonable if its misclassification probabilities are not much larger than
those of the optimal rule. Of course we would always choose the optimal rule if
f l ( ' ) and f2(') are known. In practice, however, we must estimate the sets K~" and
{z: fl*'z 1> a*} from sample data. Estimating the best hype~lane requires estima-
tion of p + 1 quantities (fl* and a*), whereas determining K~' requires estimation
of two entire distributions. Since in general it is statistically more efficient
to estimate fewer parameters, it is plausible that the rule corresponding to
{z: fl*'z >/&*} may produce smaller misclassification probabilities than the rule
based on/£~', when the sample sizes are small or moderate.
Consider now the m population classification problem. Let K = ( K l .... , K m) be
a partition of the space of Z which corresponds to the rule that classifies z into %
if z C Ki. We shall use the symbol K to denote both the partition and its
corresponding rule. Let r(K) be the probability of correctly classifying an
observation drawn from the mixture of ~rI..... ~r,~when rule K is used, that is,
r ( K ) = ~ qiP(ili ).
i=1
Finally we let C be the class of partitions from which we shall choose our rule. If
N onparametric classification 163
K + E G and
then K + is a best decision rule among those in class E. If C is the class of all
partitions, then K + is the unrestricted best rule since it achieves the largest
probability of correct classification. If m = 2 and ~ is the class of partitions
K = ( K 1, K2) where K 1 and K 2 are complementary halfplanes, then the best
partition K ÷ is a split of the space of Z into two halfplanes ( K + , K + ), for which
the probability of correct classification is a maximum. So rather than considering
the class of all decision rules, we may restrict our attention to those rules in a
special class C and seek the best rule within this class.
Following Glick (1969), let ? ( K ) be the sample based counting estimate of
r(K). That is, ? ( K ) equals the proportion of training sample observations which
are correctly classified by K. If K E C satisfies
i.e., among all rules in C , / £ correctly classifies the largest proportion of training
sample observations, then k is called the best of class ru'le. Thus we are picking
that rule within C, K, which maximizes our estimate of r(K). In this sense K i s an
estimate of K + . For certain classes C Glick showed that the r u l e / ( is asymptoti-
cally equivalent to K + . In particular, if m = 2 and C is the class of complementary
halfplanes, then
r( K ) ~ r( K +) a.s.,
i.e., with probability one the best of class r u l e / ( is asymptotically optimal within
the class C.
As an example suppose p and m are both two, and C is the class of complemen-
tary halfplanes. Any line in the two-dimensional plane divides the training sample
observations into two sets. Observations on one side of the line are classified as
x's, and those on the other side as y's. Then we m u s t find that line which
separates the observations so that the number correctly classified is a maximum.
To illustrate this idea, we use the symbols X and o to represent the training
sample observations corresponding to ~r~ and 7r2 respectively, and consider the
example shown in Fig. 6. Notice that the best of class rule is not unique since
there are several lines labelled A, B, and C for which the empirical probability of
correct classification is a maximum, in this case 16/19.
When p is larger than 2, determining the best separating hyperplane presents
computational difficulties. We can no longer rely on a visual determination of the
best line but must use some sort of computer algorithm. This problem along with
some of its variants was studied by Wang (1979). T h e computations generally
become quite difficult for large values of n~, n 2, or p.
164 James D. Broffitt
70"
50"
30-
3
I,Ll
B
~ xx x x
I0-
x
-[0 a i
- 20 I0 '~O
E-I
and
C " z (C 1. . . . . Cp).
Rather than using the original data x 1. . . . . yn2,z we shall base our discriminant
functions on the corresponding rank data al,...,bn=, c.
Nonparametric classification 165
In particular, for the linear and quadratic discriminant functions we would use
D R L ( C ) = [c - ½(K + / ; ) ] ' U - l ( a - / ; )
and
DRQ(C) = ( c -- b ) ' U f ' ( c - b ) - ( c - ff ) ' U f '( e -- ~ ) + l n ( I U 2 1 / IU, I )
where
tl 1 ?12
ai/n,, X t /n2,
i--1 i=1
nl
( a i -- ~ ) ( a i -- ~ ) ' / ( n I -- 1),
i=1
n2
i--I
and
V ~- [ ( H 1 - 1 ) U 1 + ( n 2 - 1)U2]/(n I -}- rt 2 --2).
20-
X
15-
I0-
¢,~ . R
5- o
o4 I I
that line R will be a much better classifier than line L but possibly not quite as
good as line H.
Both lines H and R adjust for outliers, but in different ways. To compute line
H we leave the outliers in their original positions but give them small weights in
the analysis. In computing line R we first move the outliers inward so they are on
the fringe of the data set and then give them full weight in the analysis. We
cannot say which method is better. That depends on the distribution of ~r1 and ~r2,
i.e., in some cases line H may be better and in others line R may be better. We
note however that as the outliers move further from the main body of data, the
computation of line H will assign increasingly smaller weights to the outliers,
whereas line R will not change. That is, line R will remain the same but line H
will move closer to line LT.
Conover and Iman (1978) did a Monte Carlo study comparing classification
rules based on D R L and D R O to those based on DL, DQ, and a variety of
nonparametric density estimates. For the distributions they simulated and the
sample sizes used their general conclusion was that D R L and DRQ are nearly as
good as D L and D o when ~r1 and ~r2 h a v e normal distributions, but with
nonnormal distributions D R L is generally better than either D L or D o but not
quite as good as DRQ, although the differences between D R L and DRQ seemed
to be very slight.
Finally we should note that there have been other proposals for classification
rules based on rank data and distance functions. Since we will not review them in
detail, the interested reader should consult the references cited. Stoller (1954)
considered a univariate classification rule which was later generalized by Glick to
the best of class rules (Section 4.4). The univariate problem was also considered
by Hudimoto ( 1 9 6 4 ) a n d Govindarajulu and Gupta (1977). Hudimoto worked
N onparametric classification 167
with the two population problem,while the latter paper considered the general
multipopulation setup. In both cases it is assumed that a sample of n o z 's, all
from the same population, is to b e classified. Classification rules based on the
ranks of the combined samples are defined. Hudimoto derives bounds on the
misclassification probabilities. Govindarajula and Gupta show that the probabil-
ity of correct classification converges to one as n o, n~ . . . . . n M approach infinity.
Chatterjee (1973) has multivariate samples of size n o , n l, and n 2 from distribu-
tions F o, FI, and Fz, where F 0 is a mixture of F l and F2 (i.e., F o = O F l +(1 - 0)F2).
Thus some of the n o z's are from ¢r~ and some are from ¢r2. Rather than
classifying the z 's, the problem is to classify the mixing parameter as either large,
small, or intermediate. He defines a decision rule based on the combined sample
ranks and shows consistency. Das Gupta (1964) ~Considered the multivariate
multipopulation problem where a sample of z's all'from the same population is to
be classified. His classification rule i s based o n a'comparison of a measure of
distance between the empirical distribution function o f the z 'S and the empirical
distribution of each of the other samples. H e also considers a univariate
two-population problem similar t o that of Hudimoto. For both problems h e
shows that his classification rules are~consistent; Le,, the probabilities of correct
classification converge to one as~ the Sample sizes approach infinity.
References
Logistic Discrimination
J. A. Anderson
1. Introduction
Discriminant methods are used in broadly two ways, firstly to summarise and
describe group differences and secondly to allocate new individuals to groups.
This chapter is chiefly concerned with the latter problem, of which medical
diagnosis is a prime example.
Three of the chief attractions of logistic discrimination'are: (i) Few distribu-
tional assumptions are made. (ii) It is applicable with either continuous or
discrete predictor variables, or both. (iii) It is very easy to use--once the
parameters have been estimated, the allocation of a fresh individual requires only
the calculation of a linear function. There are many other methods that have been
suggested for statistical discrimination. One possible categorisation is by the level
of assumptions made about the functional forms of the underlying likelihoods.
Three classes are probably sufficient: (i) fully distributional, (ii) partially distribu-
tional and (iii) distribution-free. Thus, suppose that it is required to discriminate
between k groups ( H l , . . . , H k ) on the basis of random variables x T =
(xl, x 2..... Xp).'The likelihood of x in H,, L(xlHs) (s = 1,... ,k) may be assumed
to have a fully specified functional form except for some parameters to estimate.
This is the fully distributional approach (i), the classical example being the
assumption of multivariate normality (Welch, 1939; Anderson, T. W., 1958). An
example in the partially distributional class (ii) is the logistic discrimination
(Anderson, J. A., 1972) where ln{L(x IHs)/L(x IHt) } is taken to be linear in the
(XJ) or simple functions of them. Here only the likelihood ratios are modelled, the
remaining aspects of the likelihoods are estimated only if required. Distribution-
free methods of discrimination have been described, for example, by Aitchison
and Aitken (1976), and Habbema, Hermans and van der Broek (1974). The basic
idea here is that distribution-free estimates of the likelihoods are found, perhaps
using kernel or nearest neighbour methods. The above is intended not as a review
of the literature but rather to give examples of the three classes of discriminant
method.
169
170' J. ,4. Anderson
The importance of the model (2.1) has many facets. Perhaps the first is that it
gives the posterior probabilities a simple form:
Pr(H[ x) = exp(fl~ + l n K q- ]~Tx)/(1 +exp(/3~ + l n K + flTx)), (2.2)
Here n(x) = nl(x)+ n2(x ) is fixed for all x. Now it follows from (2.2) that
Note that the decision about discrimination rests solely on the linear function
I(x) --/3o+PTx.
The likelihood, Lc, may now be written,
It is the logistic form of the conditional probabilities in (2.4) and (2.5) which gives
its name to this approach of discrimination. We see that L c in (2.6) is a function
of flj ( j = 0, 1,...,p) and hence the maximum fikelihood estimators of these
parameters are derived by an iterative optimisation procedure. Note that flo is
estimable but that f l O = f l o - l n H 1 / H 2 is estimable only if H 1 is known or
estimable separately. Technical and practical details of the iterative procedure will
be given in the next section.
This is different from Lc but it can be shown that the flj ( j = 0, 1..... p ) are
estimated as in the x-conditional sampling case by maximising L~ in (2.6). To see
this, note that
L(xlHs):Pr(H~lx)L(x), s=l,2.
Hence
Now the functional form of the likelihood ratio has been assumed in (2.1) but
specifically no further assumptions have been made. This implies that L(x), the
Logistic discrimination 173
2
Ls= I-I H(L(xIHs)) "'(x). (2.9)
s=l x
It is not at all obvious that the estimation scheme above (maximising L c in (2.6))
yields maximum likelihood estimations of the firs in this case. However, the result
does follow. Anderson (1972) gave a proof for discrete random variables and gave
a brief discussion of the continuous variate case. A rather different approach will
be sketched here following Anderson (1979). That paper was concerned with the
same likelihood structure as here, but additionally there were n 3 sample points
from the compound distribution
01L(xIH1)+(1- O,)L(xIH2)
for some unknown proportion 01. The following discussion is derived from
Anderson's (1979) results with n 3 = 0.
Suppose ,
This expression for L s involves the parameters f16,fl and the function f(x). To
proceed with the estimation, some structure for f ( . ) is necessary. The discrete case
is most straightforward and will be dealt with first. Thus suppose that the sample
space for x is discrete, then the values of f(x) may be taken to be multinomial
probabilities. We may attempt to estimate all these as multinomial probabilities
without imposing any further structure. Note that if the sample from H 1 is not
available, the maximum likelihood estimates of the ( f ( x ) } are f ( x ) = n 2 ( x ) / n 2
for all x. In the more interesting case where we also have n l sample points from
174 J. A. Anderson
~f(x)e#~+PrX=l, (2.13)
X
~f(x)=l. (2.14)
X
where
L* = ]-I
x
nlel3~+PTx
nle/3~+pTx+ n 2
}nx{ n2
n~e¢~+pTx-+- n 2
t
= II {P,(X)}"'(X){Pz(X)}"2(x), (2.17)
X
where p l(x) and p2(x) are as given in (2.4) and (2.5) with/3 o = fi6 + l n ( n l / n 2)" It
follows that the maximum likelihood estimates of fl and/3o (and hence f16) are
obtained by maximising L* in (2.17), or equivalently by maximising L c in (2.6).
If some or all of the x variates are continuous, the above treatment does not
hold. Anderson (1979) discusses this case briefly. It is suggested that/36 and fl
may still be estimated by maximising L* or equivalently L c. The simplest
justification is based on the subdivision of each continuous variate to make it
discrete. Subject to the condition (2.1) holding, the above approach is applicable
with perhaps some information loss due to the range subdivision.
To retain the 'continuous' nature of the x variates introduces some difficulties
in the context of maximum likelihood estimation. Any attempt to maximise the
likelihood (2.12) with respect to fl~3,fl, and the functional form f ( - ) quickly
founders as the likelihood is not bounded above with respect to f(. ), even though
it is constrained as a density. Good and Gaskins (1971) discuss this in a simpler
context with the aid of Dirac-delta functions. They showed how to 'bound' the
likelihood by introducing smoothness constraints on the function f(-). Recent
Logistic discrimination 175
2.4. Discussion
Provided that the likelihood ratio satisfies (2.1), the parameters fl may be
estimated by maximising Lc, irrespective of (i) which of these sampling plans has
been taken and (ii) the nature of the x-variates, discrete or continuous. The role of
13~ is a little different in the three sampling plans considered here, and care is
required. In particular, for use in discrimination, it has been seen in (2.2) that
13~+ l n 11~/112 is required. This is estimated automatically in x-conditional and
mixture sampling, but in separate sampling 13~ is estimated and 11~ must be
supplied (or estimated) additionally. This is not normally a severe requirement.
Suppose that ]3o and fl have been estimated corresponding to proportions
II1, H2 for H~, H 2. It is now required to use the discrimination system in a
context where sample points are drawn from the mixture of H 1 and H 2 in the
proportions/'I1 and if/2. Anderson (1972) showed that the appropriate posterior
probability for H 1 is
It has been seen that a key feature of logistic discrimination is that the
parameters 130 a n d / 8 required for the posterior probabilities, Pr(H s Ix), s = 1,2,
are estimated by the same maximum likelihood procedure for continuous and
discrete data, for different underlying families of distributions and for different
sampling plans. In all these cases it is necessary to maximise
where the summation is over all x-values, but clearly only x-values with n(x) > 0
need to be included. The second derivatives are equally accessible:
021n L~
~flJ~fl,- ~n(x)p(lx)p2(x)xj"x j , l = l ..... p. (3.2)
The only extra information required to start the iterative optimisation proce-
dure, whether Newton or quasi-Newton, is the starting values. Cox (1966)
suggested taking linear approximations to pi(x) ( i = 1,2) and obtaining initial
estimates of t0 and fl by weighted least squares. However, Anderson (1972)
suggested taking zero as the starting value for all p + 1 logistic parameters. This
has worked well in practice, and can be recommended with confidence. Albert
(1978) recently showed that except in two special circumstances, which can be
readily recognized, the likelihood function L c has a unique maximum attained for
finite ,8. Hence the procedure of starting from a fixed 'neutral' value may be
expected to converge rapidly to the maximum. Special computer programs are
available in Fortran and Algol 60 to effect this estimation, and they can be
obtained from the author.
cov(/3j,/3t) = I jr where I-1 = (Ut) for a l l j and l. This result is obvious except for
separate sampling where the likelihood L S (2.12) is subject to the constraints
(2.13) and (2.14). However, Anderson (1972) showed that the above asymptotic
matrix is appropriate in this case also although the variance of rio, the maximum
likelihood estimator of/3o, has a further error of o(1/n) introduced.
where the products are over the sample points from H~ and H 2, respectively. As
k--, + m , L c ( k ) ~ l . Hence there is a maximum of L c at fio=k,{o, f l = k y as
k ~ + m, that is, at infinity. Further, there are equivalent points at infinity giving
the same upper bound of 1 for L c for all hyperplanes which give complete
separation. These non-unique maxima suggest that fl0 and fl cannot be estimated
with much precision, although some bounds may be placed on them. However,
from the disdrimination point of view, quite a good discriminant function may
ensue for any separating hyperplane. It will after all have good discriminant
properties on the sample.
It is easy to test whether a particular data configuration exhibits complete
separation. If there is a separating hyperplane, the maximum of the likelihood is
at infinity, and the iterative procedure must at some stage find a /3(m) and fl(m)
such that the plane
completely separates the points from H~ and H 2. The values of l("°(x) are
required in the iterative procedure, so it is quick and simple to check whether
l(m)(x) gives complete separation at each stage of the iteration. If so, the iteration
stops with an appropriate message. Day and Kerridge (1967) and Anderson
(1972) give full details of complete separation.
The second kind of problem of data configuration has zero marginal propor-
tions occurring with discrete data. Again the maximum for L~ is at infinity. For
178 J. A. Anderson
1 +1n(0,1/021),
/~1 = ln{(/~,,021)/(01l/~21)},
g--af 1,,
where/~il is the estimate of Pr(x 1 = IIHi), i = 1,2 and Oil = 1 --/~,1. The estimate
1Vii=(ail +l)/(ni+2) is suggested where all is the number of sample points
observed in H i with x 1 = 1, and n i is the total number of sample points from Hi,
i = 1, 2. Further /30(- 0 a n d / ~ - o are the m a x i m u m likelihood estimates of the
logistic parameters as derived on this section, but omitting x 1. This approach may
be extended easily if more than one variable has troublesome zero marginal
proportions.
We must decide which of these models is 'best' and whether it is 'better' than the
Logistic discrimination 179
'null' model which indicates that none of the potential predictor variables is
useful:
M o = n]',n'~2/n ",
where n = nl + n 2. Then x(t) is selected as the 'best' single predictor variable if its
maximised log-likelihood M(1) satisfies
34(1,/> M j , j = 1 . . . . . p. (3.6)
The logistic parameters for the five above variables in the predictor set were
estimated as part of the stepwise system and gave as estimate of l(x),
The estimates of the fl-parameters as given in this equation display their depen-
dence on the scale of measurement for the (xj). For example, the mean of x 3 in
the deep vein thrombosis group is 412 while that for x 4 is 9 in the same group.
Using the method given in Subsection 3.2, the standard errors of the logistic
parameters in (4.3) were estimated to be 2.39, 0.0028, 0.12, 0.03, 0.023 and 0.76,
respectively.
The values of i ( x ) were calculated for each of the 124 patients in the study and
plotted in Fig. 1, distinguishing between the two groups. Clearly patients with and
without deep vein thrombosis are well separated, but there is inevitably a 'grey'
zone where group membership is equivocal. In this context there is a preoperative
decision to be made, whether to use anti-coagulant therapy or not. This decision
depends not only on the relative likelihood of the two groups but also on the costs
of the various (state-of-the-world, action) pairs. Bearing in mind the potential
gain and losses mentioned at the beginning of this section a cut-off point of - 2.5
was taken, the idea being that patients whose [ values are greater than - 2 . 5
should be given preoperative anti-coagulant therapy. No other patient would be
given this treatment.
It can be seen from the figure that some quite extreme values for i ( x ) have
been found, as low as - 8 and as high as 6.5. The corresponding posterior
probabilities for H 2 and H 1 are not convincing as they are very high at 0.9997 and
0.9985, respectively. In other studies even more extreme estimates of the posterior
probabilities have been noted. This is to some extent caused by the relatively
small number of patients from H1, but the phenomenon of extreme odds occurs in
many similar techniques where estimated parameters are 'plugged' into an
expression involving the true values of parameters. Aitchison and Dunsmore
(1975) discuss this 'estimative' approach and prefer their 'predictive' method.
Unfortunately this is limited largely to multivariate normal distributions. There is
no predictive treatment for multivariate discrete distributions. Hence the numeri-
cal values of the posterior probabilities obtained from logistic discrimination
should be treated with caution, particularly with small samples. However, the
ordering of probabilities is unlikely to be changed by a better method of
x x
x x
x x x
x xx xx x
xx xx xx x
x xxxxx xxxxx x
x xx xxxxxxxxxxxxx xx
x xx xxxxxxxxxxxxx ox o
xxxxxxxxxxxxxxxxoxx ox x o
xxxxxxxxxoxxxxxooxo ooxx oo ooo 0 O0 0
Fig. 1. Prognostic index (/~) for patients at risk from deep vein thrombosis (DVT). Sample values: 0
patients with DVT: X patients with no DVT.
182 3-. A. Anderson
estimation (if one existed), so some monotone transformation of the index values
i ( x ) should give satisfactory estimates of the posterior probabilities. The index
values should be interpreted in this light.
Note that here the sampling was from the mixture of the two distributions.
Hence the /3o given by the maximum likelihood procedure of Subsection 3.1
required no adjustment for use in the estimated discriminant function or index
i(x). This would have also been true for x-conditional sampling but if separate
sampling had been employed, ln{ ( H l n 2 ) / ( H 2 n l) } would have been added to the
estimate/30 emerging from the maximum likelihood procedure. As discussed in
Subsection 2.4, H~ is the proportion of sample points from H 1 in the mixed
population in which discrimination is to take place.
The need to derive discriminant functions based on continuous and discrete
variables is common in medical situations, as here, and gives logistic methods
particular importance.
The decision system suggested for the management of postoperative deep vein
thrombosis has been tried in practice and has given very satisfactory results giving
a reduction in the incidence of the thrombosis from 16% (20/124) here to 3%
(3/100) in a recently completed study. Papers are in preparation to report these
findings in detail.
Over the last few years certain developments and extensions of logistic dis-
crimination have emerged.
in Sections 2 and 3. In practice, if the number of basic variates p is, at all large,
say greater than 4 or 5, this approach would result in far too many parameters to
estimate in an iterative procedure. Thus for p = 5 and 10, the number of
parameters is 21 and 66, respectively.
Anderson (1975) discussed this area in some detail and suggested some ap-
proximations for the quadratic form xT~x, which enables the estimation to
proceed. The simplest of these was to take a rank one approximation to ~ = 2t~lll~,
visualised in terms of the largest eigenvalue (2t l) and corresponding eigenvector
(! 1) of ~. In this case, the probability of H l, given x,
eq
p l ( x ) -- 1 + e q (5.2)
This is no longer linear in the parameters but Anderson (1975) showed that the
likelihood
could be maxirnised to give estimates of/3o,/31 ..... /3p and 11..... lp where p l(X) is
given by (5.2). Because of the non-linearity of q, a different iterative procedure
from that of Section 3 is required, but using the quasi-Newton methods this is
straightforward. Clearly the discriminant function based on Subsections 5.2 and
5.3 has a parabolic boundary if it is agreed to allocate all x such that p~(x) ~> 7t to
H~, for some X. Anderson (1975) demonstrated this by means of an example.
The need for quadratic discriminant functions is by no means restricted to
situations whe)e all the underlying variates are continuous. For example, if the
variates are binary and the first-order interactions on the log-linear scale are not
equal in the two groups, then again the log-likelihood ratio satisfies (5.1),
provided the higher order interactions are the same in the two groups. Usually
any log-likelihood ratio that may be thought to be linear in x and satisfy (2.1)
may be generalised to be quadratic in x and satisfy (5.1). Note that the basic
variables may be transformed before commencing the quadratic logistic proce-
dures.
e ( O ) = O+ b(O) + e (5.5)
where bT(O)=[bl(O) ..... b,(O)] and e is a vector whose components are all
o(1/n).
bt(O)=½ i
{ ( alnL
IitlJk2E
OzlnL) ( 031nL)}
OOj OOiaOk + E ~0i00230 [
i,j,k=l
( t = l , 2 ..... r) (5.6)
O'ln L /( OO: )
may be estimated by its observed value calculated at t~, where a, b, c ~> 0 and
a + b + c = r. Similarly for independent identically distributed sample points.
E( OlnL 021nL)
may be estimated by
Olnl. 021n lu
.=1 OOs O0~OOk"
L : I I II (p (x)I4x)}
X s:l
as in (2.8). The probabilities (ps(x)} are as given in (2.4) and (2.5). Although
there are two sets of parameters, the (flj) and the (L(x)}, to estimate, the
Logistic discrimination 185
maximum likelihood estimation of the (flj) and the bias correction proceeds
independently of the terms in L(x) as the likelihood factorises. Hence, as in
Subsection 2.2, the maximum likelihood estimation is based on L c given in (2.6).
The first and second derivatives 0 In Lc/3flj and 32 In L c/(0flj 0fit) are given in
(3.1) and (3.2). Hence it can be proved (Anderson and Richardson, 1979) that
( ~lnL 021nL)
E aflj Oflk Oflt ----0, for all j, k, l. (5.8)
Thus the bias corrected estimators of the (flj) may be calculated from (5.7) with
the simplification that one set of expectations are all zero.
The situation for separate sampling logistic discrimination is not so straightfor-
ward. It was seen in Subsection 2.3 that the estimation of the (flj) does not
separate easily from that of the quantities {f(x)) introduced there, largely
because of the constraints (2.13) and (2.14). Strictly, new results for bias correc-
tion in the presence of constraints are required, but because other results for
mixture and x-conditional sampling have carried over to the separate sampling
case, Anderson and Richardson (1979) suggested using the above bias corrections
in this case also. They investigated the properties of the bias corrected estimators
using simulation studies, and concluded that worthwhile improvements to the
maximum likelihood estimates could be obtained provided that the smaller of n 1
and n 2 (the sample sizes from the two groups) was not too ~mall.
separate samples are taken from H1, H 2 and H 3 of size n ], n 2 and n 3, respectively.
Extending the notation of Section 2, suppose that at x, n~(x) sample points are
observed from H s (s = 1,2, 3). The likelihood is given by
3
L,nix=l'I 1-[ (L(xlH,)) "~(x). (5.9)
x s 1
Substituting this result for f(x) into (5.10) implies that fl~, fl, and 0 t may be
estimated by maximising the function
(5.13)
where
P( x ) = n~elr°+PTX/( n~e B'°+pTx+ n~ ) (5.14)
and
Q(x)=l-P(x).
The expression in (5.13) clearly displays a compound logistic distribution which
gives its name to this approach.
The function Lmix may be maximised using one of the quasi-Newton proce-
dures referred to in Section 3. A Fortran program has been written to do this and
is available.
Logistic discrimination 187
Note that Lmix contains only p + 2 parameters, so the update facility has
introduced only one extra parameter, a small cost for the additional power. It
follows from the form of n~' that if there are no sample points from H a (n3(x) = 0
for all x), the functional form of L*i~ reduces to that of L c in (2.6).
Although the above results hold strictly for discrete random variables,
Anderson (1979) used arguments similar to those in Subsection 2.3 to justify their
use with continuous random variables or combinations of discrete and continuous
random variables in the separate sampling case. In the mixture or x-conditional
case, these results may be justified directly for continuous a n d / o r discrete
random variables.
The above procedure for updating logistic discriminant functions, using infor-
mation from sample points of uncertain provenance, gives logistic discrimination
a considerable advantage over most of its rivals. For example, if multivariate
normal distributions are assumed, the iterative procedures for discriminant updat-
ing involve O(p 2) parameters. If no distributional assumptions are made, it is
difficult to incorporate information from the mixed sample. Thus there is no
extension of Fisher's linear discriminant function to cover this case. However,
Murray and Titterington (1978) have recently extended the kernel method to
provide an alternative to the method of logistic compounds in some circum-
stances.
The emphasis here has been on discriminant functibn updating, but the
methods derived here can be used quite generally where the fundamental problem
is to estimate the mixing proportions 0~. Given samples from the two distributions
and the mixture, maximisation of L*~x then gives estimates of 01 and the logistic
parameters. The logistic approach outlined here is particularly appropriate if
there is only weak information available about the underlying likelihoods; it
provides a partially distributional approach.
For simplicity all the results on logistic discriminant functions have been given
so far in terms of two groups. There is no difficulty in extending at once all the
previous results to discrimination between k groups, Hi, n2,... , n k. An outline of
the methods is given here.
Denote the likelihood of the observations x given Hs by L ( x I H~), s = 1 . . . . , k .
The equivalent of the fundamental assumption (2.1) on the linearity of the
log-likelihood ratio is
-a, + B, Y- sx , s = l , . .. , k - I , (6.1)
where PsT _--(ills, il2 ...... ilps)" Note that this implies that the log-likelihood ratio
has this form for any pair of likelihoods. As in previous sections, the linearity in
(6.1) is not necessarily in the basic variables; transforms of these may be taken. It
has been shown in Section 2 that the assumptions embodied in (6.1) are likely to
188 J. A. Anderson
k
Pr(H, lx)=eZs/ E eZ', s = l .... ,k, (6.2)
s=l
where
zs=fl[~s+lnK,+flfx, s = l ..... k - - l ,
and
Zk=O.
Thus if the fl's are known or have been estimated, the decision about the
allocation of a sample point x requires little computing as it depends solely on the
linear forms in x and zs, s = 1..... k - 1.
Following the notation in Section 2, suppose that ns(x ) sample points are
noted at x from H s (s = 1,...,k). Then under x-conditional sampling the likeli-
hood
k
L~k)=H H (Pr(H, lx)} "'(x) (6.4)
x s 1
or
k ( . k ) ns(x)
(6.5)
s=l[ k=l J
substituting from (6.2). This displays L~k) as a function of the (fl0,) and (fl,)
alone. Hence L(ck) may be maximised to give maximum likelihood estimates of the
fl-parameters. Nothing further is required for discrimination, but note that if
estimates of the (fl0s) are required, extra information about the (Ks) or (/7,) is
required.
Arguing as in Subsection 2.2, it follows that the above procedure also yields the
maximum likelihood estimates of the fl-parameters under mixture sampling where
the basic random variable is (x, H ) ( H = H1, H 2.... or Hk). Here n , / n gives an
estimate of H s, so fl~s is estimable without extra information (s = 1..... k - 1).
Note n, = Exns(x) and n = 2~=1n,.
The separate sampling case is more complicated, but it can be proved
(Anderson, 1972 and 1979) that for discrete random variables maximisation of
L(f ) again gives estimates of the (fl0s) and (fls).
Logistic discrimination 189
However, now
Hence the B~s are estimable here directly but for discrimination fl~, + l n ( H , / H k)
is required as in (6.3), s = 1..... k - 1. Thus for discrimination the (H~) must be
estimated separately. If some of the variables are continuous under separate
sampling, it is suggested that the above approach of maximising L(~k) is still valid.
The full justification for this is still awaited but it is certainly approximately valid
(Anderson, 1979).
The likelihood L(~k) in (6.5) is maximised iteratively using a quasi-Newton or a
Newton-Raphson procedure along the lines of Section 3. Anderson (1972, 1979)
gives full details of this, and Fortran programmes are available.
Complete separation may occur with k groups, but again these data configura-
tions can be easily identified in the course of the iterative maximum likelihood
procedure. Hence, although the maximum of the likelihood is achieved at a point
at infinity in the parameter space, the situation is recognised and the iterations
stopped before time has been wasted. In this case the estimates of the parameters
are unreliable but good discriminant functions emerge (Anderson, ! 972).
Zero marginal sample proportions cause the same difficulties with k groups as
with two. The ad hoc method suggested by Anderson (1~t74) may be used here
also.
The ideas of quadratic logistic discriminators can be applied immediately in the
k group case as discussed by Anderson (1975). Equally, compound logistic
methods may be used (i) to update discriminant functions for k groups using data
points of uncertain origin and (ii) to estimate the mixing proportions of k groups
(Anderson, 1979).
In short, there is no additional theoretical problem if the number of groups is
greater than two. Note, however, that some constraints on the dimensionality ( p )
of a problem' may be introduced because of the number of parameters to be
estimated. In the k group case, there are ( k - 1 ) × ( p + 1) parameters, and clearly
this number must be kept within the operational limits of the optimisation
procedure.
Logistic methods have been described here from the standpoint of their
application in discrimination. However, implicit in the assumptions (2.1) and (6.1)
are the models (2.2) and (6.2) for the conditional probability of H s given x. If the
(H s) are now thought of as representing levels of a variable of interest, say y, then
the methods discussed here may be used to investigate aspects of the relationship
between y and x. This is because (2.2) and (6.2) now model the conditional
distribution of y given x as logit regressions. These ideas have been used and
developed in various contexts including the estimation of and inference about
190 J. A. Anderson
References
Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel method.
Biometrika 63, 413-20.
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge University Press,
Cambridge.
Albert, A. (1978). Quelques apports nouveaux h l'analyse discriminante. Ph.D. Thesis, Facult6 des
Sciences, Universit6 de Liege.
Albert, A. and Anderson, J. A. (1981). Probit and logistic discriminant functions. Comm. Statist.--
Theory Methods 10, 641-657.
Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59, 19-35.
Anderson, J. A. (1973). Logistic discrimination with medical applications. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, pp. 1-15. Academic Press, New York.
Anderson, J. A. (1974). Diagnosis by logistic discriminant function: Further practical problems and
results. Appl. Statist. 23, 397-404.
Anderson, J. A. (1975). Quadratic logistic discrimination. Biometrika 62, 149-54.
Anderson, J. A. (1979). Multivariate logistic compounds. Biometrika 66, 17-26.
Anderson, J. A. and Richardson, S. C. (1979). Logistic discrimination and bias correction in maximum
likelihood estimation. Technometrics 21, 71-8.
Anderson, T. W. (1958). An Introduction to Multivariate Analysis, p. 133. Wiley, New York.
Barnet, V. and Lewis, T. (1978). Outliers in Statistical Data. Wiley, Chichester.
Bartlett, M. S. (1947). Multivariate analysis. J. Roy. Statist. Soc., Supple. 6, 169-73.
Breslow, N. and Powers, W. (1978). Are there two logistic regressions for retrospective studies?
Biometrics 34, 100-5.
Clayton, J. K., Anderson, J. A. and McNicol, G. P. (1976). Preoperative prediction of
postoperative deep vein thrombosis. British Med. J. 2, 910-2.
Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve. In: F. N.
David, ed., Research Papers in Statistics.: Festschrift for J. Neyman, pp. 55-71. Wiley, New York.
Cox, D. R. (1972). Regression models and fife tables (with discussion). J. Roy. Statist. Soc. Ser. B. 34,
187-220.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, p. 309. Chapman and Hall, London.
Day, N. E. and Kerridge, D. F. (1967). A general maximum likelihood discriminant. Biometrics 23,
313-23.
Farewell, V. T. (1979). Some results on the estimation of logistic models based on retrospective data.
Biometrika 66, 27-32.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen. Lond. 7,
179-88.
Gardner, M. J. and Barker, D. J. P. (1975). A case study in techniques of allocation. Biometrics 31,
931-42.
Gill, P. E. and Murray, W. (1972). Quasi-Newton methods for unconstrained optimisation. J. Inst.
Math. Appl. 9, 91-108.
Good, I. J. and Gaskins, R. A. (1971). Non-parametric roughness penalties for probability densities.
Biometrika 58, 255-77.
Habbema, J. D., Hermans, J. and van der Brock, K. (1974). A stepwise discriminant analysis program
using density estimation. In: G. Bruckmann, ed., Compstat 1974, pp. 101-110. Physica, Vienna.
Mantel, N. (1973). Synthetic retrospective studies and related topics. Biometrics 29, 479-86.
Murray, G. D. and Titterington, D. M. (1978). Estimation problems with data from a mixture. Appl.
Statist. 27, 325-34.
Prentice, R. (1976). Use of the logistic model in retrospective studies. Biometrics 32, 599-606.
Prentice, R. and Breslow, N. (1978). Retrospective studies and failure time models. Biometrika 65,
153-8.
Rao, C. R. (1965). Linear Statistical Inference and its Applications, p. 414. Wiley, New York.
Truett, J., Cornfield, J. and Kannel, W. (1967). A multivariate analysis of the risk of coronary heart
disease in Framlington. J. Chron. Dis. 20, 511-24.
Welch, B. L. (1939). Note on discriminant functions. Biometrika 31, 218-20.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2
©North-Holland Publishing Company (1982) 193-197
for some function g, are termed nearest neighbor rules, while rules which can be
put in the form (1) for some g are called k-local.
The probability of error for a rule given the data and attached random
variables is given by
L,=P[O•OID, ]
where
Do=((X1,01,Z,) ..... (X., 0., Z.)).
The frequency interpretation of L, is that a large number of new observations,
whose states are estimated with the rule and the given data, will produce a
frequency of errors equal to the value of L,. (Each of these new observations will
have a new independent Z attached to it but the Z 1..... Z, stay fixed with the
data.) The random variable L, is important then because it measures the future
performance of the rule with the given data.
Most of the results dealing with nearest neighbor rules are of the asymptotic
variety, that is, results concerned with where L, converges to and how it
converges as n tends to infinity. If the limiting behavior of L~ compares favorably
to L*, the Bayes probability of error (the smallest possible probability of error if
one knew the distribution of (X, 0)), then one has some hope that the rule will at
least perform well with large amounts of data. For the k-nearest neighbor rule
with fixed k the first result of this type, and certainly the best known, is that of
Cover and Hart (1967) who showed that
n
EL. (3)
when P[0 = i[ X = x] has an almost everywhere continuous version, 1 ~ i ~< M. In
(3) L is a constant satisfying, for k = 1,
L*<~L<~2L*(1--L*)<~2L *. (4)
For arbitrary k the "2" in (4) is replaced by a~ where a~ $1 as k ~ oe. For these
same assumptions it is also known that
L~ L L in probability (5)
(Wagner, 1971) with convergence in (5) actually being with probability one for
k =1 (Fritz, 1975).
If k is allowed to vary with n, then Stone (1977) showed that for any
distribution of (X, 0)
if
n n
k=kn~oe and kn/n O.
then (6) holds with the convergence being with probability one.
In view of Stone's result, it might be expected that the asymptotic results of the
k-nearest neighbor rule with k fixed are also distribution-free, that is, no condi-
tions on the distribution of (X, 0) are needed for (5). In fact, using Stone's way of
breaking ties, Devroye (1981b) has shown exactly that. Moreover, the constant L
for the general case, which is the same as Cover and Hart's for their assumptions
on the distribution of (X, 0), continues to obey the inequality (4).
As intellectually satisfying as these results are, one is still faced with the finite
sample situation. You have data Dn and your immediate need is for a reliable
estimate of L, for your chosen rule. You may even wish to examine the data and
then pick the rule. In this case reliable estimates of L n for each rule may guide
you in your chbice. If one is using a local rule, then a natural estimate is the
deleted estimate of Ln given by
£, = (l/n) ~ I[~o,1
i=1
where t~ is the estimate of 0i from Xi, Z/, and D, with ( X i, 0i, Zi) deleted. This
definition requires, of course, that k ~< n - 1. Deleted estimates are not easy to
compute but, in cases like the k-nearest neighbor rule, the computation is
reasonable and the intuitively appealing use of the data can be taken advantage
of. Rogers and Wagner (1977) have shown that for all distributions of (X, 0) and
any k-local rule
2k(2k + l / 4 ) 1/2 k2
E(£,, - Ln) 2 ~< 2k + 1 / 4 -~ + --. (7)
n n3/2 n2
196 L. Devroye and T. J. Wagner
References
Cover, T. and Hart, P. (1967). Nearest neighborpattern classification.IEEE Trans. Inform. Theory 11,
21-27.
Devroye, L. (1981a). On the almost everywhereconvergenceof nonparametric regression function
estimates. Ann. Statist. 9, 1310-1319.
Devroye,L. P. (1981b). On the inequalityof Cover and Hart in nearest neighbor discrimination. IEEE
Trans. Pattern Analysis Machine Intelligence 3, 75-78.
Devroye, L. P. and Wagner, T. J. (1979a). Distribution-freeinequalities for the deleted and holdout
error estimates. IEEE Trans. Inform. Theory 25, 202-207.
Nearest neighbor methods in discrimination 197
Devroye, L. P. and Wagner, T. J. (1979b). Distribution-flee performance bounds with the resubstitu-
tion error estimate. IEEE Trans. Inform. Theory 25, 208-210.
Fix, E. and Hodges, J. (1951). Discriminatory analysis: Nonparametric discrimination: consistency
properties. Rept. No. 4, USAF School of Aviation Medicine, Randolph Field, TX.
Fritz, J. (1975). Distribution-free exponential error bound for nearest neighbor pattern classification.
IEEE Trans. Inform. Theory 21, 552-557.
Penrod, C. S. and Wagner, T. J. (1979). Risk estimation for nonparametric discrimination and
estimation rules: A simulation study. IEEE Trans. Inform. Theory [to appear].
Ritter, G. L., Woodruff, H. B., Lowry, S. R., and Isenhour, T. L. (1975). An algorithm for a selective
nearest neighbor rule. IEEE Trans. Inform. Theory 21, 665-669.
Rogers, W. H. and Wagner, T. J. (1977). A finite sample distribution-free performance bound for local
discrimination rules. Ann. Statist. 6, 506-514.
Stone, C. J. (1977). Consistent nonparametric regression. Ann. Statist. 5, 595-645.
Wagner, T. J. (1971). Convergence of the nearest neighbor rule. IEEE Trans. Inform. Theory 17,
566-571.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 C}
©North-Holland Publishing Company (1982) 199-208 .J
G. J. McLachlan
1. Introduction
*This work was completed while the author was on leave with the Department of Statistics at
Stanford University, and was supported in part by ONR contract N00014-76-C-0475.
199
200 G.J. McLachlan
More recently, Bryant and Williamson [4] extended Marriott's results and showed
that the method may be expected to give asymptotically biased results quite
generally.
A related approach is the mixture maximum likelihood method considered by
Day [5] and Wolfe [34], among many others. With this approach x I..... x, are
assumed to be a random sample of size n from a mixture of H 1..... H k in the
proportions (e 1..... e~) = g. Hence the likelihood
can be formed; the estimates of 0 and e obtained by maximizing (1.2) are denoted
by 0 and d respectively. Each xj can be classified then on the basis of the
estimated posterior probabilities Pij (i = 1..... k) formed by replacing 0 and e with
6 and ~ in
P i j = e{xj EH i l x/ },
and x/is assigned to Hg if
^ ^
Pgj>~Pij, i = l , ' " , k"
It can then be seen that the mixture approach is equivalent to the classification
procedure with the additional assumption that "h..... ,/, is an (unobservable)
random sample from a probability distribution with mass e~ at i (i =1 ..... k). It
appears to avoid the asymptotic biases associated with the classification proce-
dure where at each step in the iterative process of computing the maximum
likelihood estimates each xj is assigned outright to a particular subpopulation
according to the estimate for yj. By contrast, the mixture approach does not insist
on definite membership to any subpopulation; rather it gives an estimated
probability of membership of each subpopulation.
Note that another approach to this problem is to proceed further and adopt a
Bayesian procedure in which all parameters are random variables (see [2] and
[321).
A common assumption in practice is to adopt the normality model
approaches under the normality model (1.3) which is assumed to hold through to
Section 5, where the condition of a common covariance matrix is relaxed to cover
the general case of unequal covariance matrices.
2. Classification approach
where fli and 2 are the ordinary maximum likelihood estimates of/t~ and 2J for a
sample of normal observations classified according to ~. Hence the solution can
be computed iteratively [17, 30]. Starting with some initial clustering y, the/ti and
IJ are estimated accordingly and then used to give a new estimate of y on the
basis (2.1), equivalent to allocating each observation to"the nearest cluster centre
in terms of the estimated Mahalanobis distance. Each step in the iterative process
yields a value of the likelihood not less than that at the previous step, and the
iterations may be continued until no observation changes clusters. Various
starting values should be taken in an attempt to locate the global solution. It will
be seen in the next section that the likelihood equations under the mixture
approach can be easily modified to be applicable also under the classification
approach. There are other procedures for finding the solution under the classifica-
tion approach; for example, the Mahalanobis distance version of MacQueen's [20]
k-means procedure, where the/~ and 2J are re-estimated after each observation is
allocated rather than waiting until after all the observations have been allocated.
For the classification approach applied under the normality model (1.3), Scott
and Symons [31] showed that ~ corresponds to the partition which minimizes the
determinant of the pooled within-subpopulations sum of squares matrix
k
w=2w
i=1
where
ni
q=l
k
nloglW I --2 ~] nilogn i
i=1
3. Mixture approach
for r = 1..... k; that is, al = 0 and b 1= 0. The maximum likelihood estimates are
evaluated from the equations
which can be solved iteratively by substituting some initial values for the
estimates into the right-hand side of (3.1)-(3.3) to produce new estimates on the
left-hand side, which are then substituted into the right-hand side, and so on.
These iterative estimates can be identified with those obtained by directly
applying the so-called EM algorithm of Dempster et al. [6], which shows that the
estimates will converge to a local maximum irrespective of the starting point. The
Maximum likelihood approaches to cluster analysis 203
m = EDUl -'1-e2/.g 2
and
v: x+ t,2)(,,,-,,2)'
are the mean and covariance matrix of the mixture distribution; a and b denote
a 2 and b 2 with their subscripts suppressed since k = 2 only. The maximum
likelihood equations now can be written as
m= ~ xj/n, (3.4)
j=l
Only values of fi a n d / ; are needed in solving the above equations as th and P are
given explicitly.
To obtain suitable initial values of a and b, it is suggested for various bivariate
subsets of the variables plotting the data points and drawing a line which divides
the data into two groups which have a scatter that appears normal (see, for
example, [28] and [12]). Estimates of a and b can be formed on the basis of this
subdivision, proceeding as if the observations were correctly classified. There
appears to be no difficulty in locating the global maximum for p = 1 and p = 2,
but for p/> 3 there are problems with multiple maxima, particularly for small
values (less than two, say) of the Mahalanobis distance between/71 and H2,
when n is not large [5]. Also, it is well known [5, 16] that maximum likelihood
estimates based on a mixture of normal distributions are very poor unless n is
very large (for example, n > 500). However, Ganesalingam and McLachlan [11]
found that although the maximum likelihood estimates a and/~ may not be very
reliable for small n, it appears that the proportions in which the components of a
and/~ occur are such that the resulting discriminant function, a ' x + b, may still
provide reasonable separation between the subpopulations.
Note that the same set of equations here can be used as follows to compute the
estimates/~i, X, and y under the classification approach. At a given step, ~/is put
equal to that g for which lPgj >~Pj/(i = 1..... k) where, in the Pi/, br is used without
204 G. 3-. McLachlan
the log(e,/e~) term. Then on the next step the /2i and ~7 are computed from
(3.1)-(3.3) in which, for each j, /~ij is replaced by 1 ( i = g ) and 0 (i=/=g). The
transformed equations (3.4)-(3.7) for k = 2 are also applicable to the classifica-
tion approach with the above modifications; that is, the term corresponding to gi
in (3.6) is given by ni/n (i=1,2) while there is no term corresponding to
log(~2/~l) in (3.7).
A simulation study undertaken by Ganesalingam and McLachlan [13] for k = 2
suggests that overall the mixture approach performs quite favourably relative to
the classification approach even where mixture sampling does not apply. The
apparent slight superiority of the latter approach for samples with subpopulations
represented in approximately equal numbers is more than offset by its inferior
performance for disparate representations.
e = ( E ( R ) - R o ) / ( E ( R M ) - Ro} (4.1)
where E(RM) and E(R) denote the unconditional error rate of the mixture and
classical procedures respectively applied to an unclassified observation subse-
quent to the initial sample, and R 0 denotes their common limiting value as n ~ oe.
The asymptotic relative efficiency was obtained by evaluating the numerator and
denominator of (4.1) up to and including terms of order 1/n. The multivariate
analogue of this problem was considered independently by O'Neill [28]. By
definition the asymptotic relative efficiency does not depend on n, and O'Neil
[28] showed that it also does not depend on p for equal prior probabilities,
e~ = 0.5. The asymptotic values of e are displayed in Table 1 as percentages for
selected combinations of a 2, e~, p, and n; the corresponding values of e obtained
from simulation are extracted from [11] and listed below in parentheses. It can be
seen that the asymptotic relative efficiency does not give a reliable guide as to the
true relative efficiency when n is small, particularly for A = 1. This is not
surprising since the asymptotic theory of maximum likelihood for this problem
requires n to be very large before it applies [5, 16]. Further simulation studies by
Ganesalingam and McLachlan [11] in the univariate case indicate that the
asymptotic relative efficiency gives reliable predictions at least for n/> 100 and
A~>2.
Maximum fikelihood approaches to cluster analysis 205
Table 1
A s y m p t o t i c versus s i m u l a t i o n results for the relative efficiency of the m i x t u r e a p p r o a c h
p=l,n=20 p=2, n=20 p=3, n=40
A e I -- 0.25 e I -- 0.50 e I -- 0.25 e 1 = 0.50 e I = 0.25 e 1 = 0.50
1 0.25 0.51 0.34 0.51 0.42 0.51
(33.01) (25.12) (46.71) (63.11) (25.00) (43.39)
2 7.29 10.08 9.36 10.08 10.51 10.08
(22.05) (17.74) (25.73) (16.26) (16.28) (14.51)
3 31.41 35.92 35.13 35.92 36.78 35.92
(19.57) (23.54) (43.91) (29.63) (29.01) (23.46)
The simulated values for the relative efficiency in Table 1 suggest that for the
mixture approach to perform comparably with the classical discrimination proce-
dure it needs to be based on about two to five times the number of initial
observations, depending on the combination of the parameters.
E /.
i=l
for selected values t~. . . . . th of t in some small interval (c, d), c < 0 < d, where
2
+(t) = ~ eiexp(l~it + ½oi2t2)
i=1
We now consider the situation where the classification of some of the observa-
tions in the sample is initially known. This information can be easily incorporated
into the maximum likelihood procedures for the classification and mixture
approaches. If an xj is known to come from, say, //r, then under the former
approach, yj = r always in the associated iterative process while, under the latter,
Psi is set equal to 1 (i = r) and 0 (i ve r) in all the iterations. In those situations
where there are sufficient data of known classification to form a reliable dis-
crimination rule, the unclassified data can be clustered simply according to this
rule and, for the classification approach, the results of McLachlan [24, 25] suggest
this may be preferable unless the unclassified data are in approximately the same
proportion from each subpopulation. With the mixture approach a more efficient
clustering of the unclassified observations should be obtained by simultaneously
Maximum likelihood approaches to cluster analysis 207
References
of original and derived variables. There are in addition some displays specific to
each of the analyses mentioned, but these will not be discussed here.
One technique of multivariate analysis which does generate important new
graphical displays is cluster analysis. Here a set of objects (which may be either
the observations or the variables) are assigned to clusters, such that objects within
each chister are relatively close together or similar. Frequently, the clustering is
hierarchical; for example, by successively merging clusters. The need to under-
stand the process of clustering and its relation to the underlying data leads to a
set of graphical displays, discussed in Section 3.
niques, such as scatter plots, time-series plots and histograms, are directly useful
only for one or two variables. Even the techniques for three-dimensional data
apply only to a minority of problems, since most interesting multivariate data sets
will have more than three variables.
One must keep in mind that we are trying to use a fundamentally two-dimen-
sional plot to represent data which have intrinsically more than two dimensions.
N o single multivariate plot is likely to convey all the relevant information for any
nontrivial data set. For effective data analysis, one needs to try several techniques
and to integrate the graphical displays with model-fitting, data transformation
and other data-analytic techniques. In addition the suitability of specific methods
for a given set of data depends on the number of observations, number of
variables and other properties of the data.
It is useful to group most multivariate plotting methods into two classes:
-extensions of scatter plots and
- symbolic plots.
In the first class the actual plots are two-variable scatter plots. A set of these is
generated to represent higher-dimensional data, directly or through derived vari-
ables. In the second class the data values are not used as coordinates in scatter
plots but as parameters to select or draw graphical symbols. The two classes are
not mutually exclusive. Symbols may usefully enhance scatter plots in some cases.
¢y . . . . . . .
6"1 .~.~.:. •
¢~ • , °
. . ., , .
relationship for most pairs of companies, along with some outlying years which
deviate from the general pattern. Some relationships are substantially stronger
than others, e.g., CP versus UAL is very weak.
In addition to such general criteria, one should also be ready to use information
about the specific data at hand in selecting graphical displays. For our example,
the set of 15 companies naturally formed four subsets: railroads, domestic and
international airlines and the conglomerate. Picking one company from each
subset gives us a display requiring only 6 scatter plots (Fig. 2). We see somewhat
more detail in the relationships; for example, we can see that two years have
unusually high returns for TWA, accounting for part of the departure from the
positive relation.
To apply scatter plots for larger numbers of variables, one may select either a
subset of variables or a subset of variable pairs, generating a more reasonable
number of plots. Essentially any variable-selection technique could be used,
depending on the goals of the analysis (e.g., subsets defined by regressing one
important variable on all others). Pairs can be selected by looking, say, at
properties of their joint distribution, such as comparing ordinary pairwise correla-
tions with robust correlations, and then looking at the pairs of variables where the
two correlations differ substantially. Conversely, one may look at all scatter plots
within one subset. Fig. 3 shows the scatter plots for the railroads.
If one is willing to sacrifice the direct interpretability of the original variables,
plots may be made for derived variables. Any of the techniques of multivariate
analysis (Gnanadesikan, 1977; Kruskal and Wish, 1978) could be used to derive a
smaller set of variables to plot. Examples are:
(i) principal component analysis;
(ii) factor analysis;
(iii) multidimensional scaling;
(iv) canonical analysis;
(v) discriminant analysis.
The first three methods define derived variables intended to represent as much
as possible of the internal variation in the data. Note that multidimensional
scaling may also be used when the data is originally a set of similarities among the
n observations. For canonical analysis, the original variables are divided i n t o
subsets, and derived variables are found within each subset that is highly
correlated. Discriminant analysis takes a partitioning of the observations into
groups, and looks for derived variables that are good discriminators among the
groups, i.e. predict well some contrast among the groups.
In general, the result of the analysis is some set of new variables of which we
use the first k to represent the original variables. Graphical presentation may be
more effective if we can choose k <<p. One is tempted, of course, to choose k = 2.
Unless the data support this, the temptation should be resisted.
Fig. 4 shows the pairwise scatter plot of the first two principal components of
the transportation data. Table 1 at page 217 shows the linear combination of the
companies defining the first and second principal components. The first principal
component gives positive weight to all companies, and may be interpreted as a
214 John M. Chambers. and Beat Kleiner
..•'. O
/ ~
o~
• O
tD • D tB
0 •
el, •
°ol ? ~O
• o t[, t~a, tl. •
~,• • • •
tt o
E~L
?
]
-° • °-
• ",.:
• °" t
--.°
oL •
1l °s~°°
.~° ° "
°l i
#" . , t
l °.
.; • ..-¢•
FN SCI CP
1[ 1l
•
t; :i
•°
CO
g
"
°~
"::."
NFV
•
• :¢:
NIS
. •
,:;
SX
S
1.4
* 1871
!
8.,,
B
CJ
* 1~4
@ •
@
@
@
| _
T
m
T
lgSB *
6
_ I _ .... I.. I I I
czu -1BB B 10B 2m 4m
l i n t Prlnolpal Comp~mnt
Table 1
TNW FN CP CO NFW MIS SX SR SCI
1st 0.24 0.20 0.18 0.13 0.10 0.11 0.11 0.17 0.16
2rid -0.53 -0.24 -0.13 -0.07 -0.08 -0.18 -0.13 -0.23 -0.23
measure of overall behavior, with relatively higher weights for the airlines
reflecting their greater variability. The plots shows that 1954, 1958, 1963 and 1971
all have been exceptionally good years. The second principal component contrasts
TNW, N W A and the railroads to the other airlines but seems harder to interpret.
In any event, if a small set of derived variables is a good representation of the
original variables, plots of this set m a y be helpful. Drawbacks are the difficulty of
explaining the analysis leading to the plots and, more fundamentally, the danger
that the.transformation has obscured rather than enhanced some essential infor-
mation, reducing the usefulness of these plots as diagnostic aids.
The procedure is essentially the same for continuously varying symbols, except
that there is no need to use only a discrete set of values. One must decide what
range of symbol values is wanted; for example, whether the shortest line should
be of zero length. Notice that line length gives more graphical impact to large
values than to small ones. Such biases are a significant danger with many
symbolic plots. They can be somewhat relieved by plotting several times with
changes in the variables (e.g., - xij instead of xu).
The two most common approaches to choosing symbols are to have either a
one-parameter symbol correspond to each data value or a p-parameter symbol
correspond to each observation. The former method, which we call a symbolic
matrix, usually amounts to taking the printed form of the data matrix and
replacing each entry by a symbol. One may choose any set of symbols, but usually
the choice should be made so that the symbols are obviously ordered. For
example, printed characters, possibly overstruck, will give a varying grey scale.
Fig. 5 shows a symbolic matrix plot of the transportation data, using a
four-level set of symbols. We have ordered the companies, using an ordering
implied by a clustering algorithm (to be described in Subsection 3.1). Four
symbols are used, from " - " to a superimposed "$" and "v ~''. The representation
of the data is rather crude. However, one can see that certain years are particu-
larly high (1958) or low (1953) and, with closer scrutiny, that there is some
grouping as expected (for example, airlines did relatively better than railroads in
the mid 1960's and worse in 1972-1974). The symbolic matrix is not really
competitive with other methods, however, until the size of the data matrix is
substantially larger than our example.
Advantages of the symbolic matrix are its simplicity and compactness. Data
matrices of quite large size can be represented legibly by this method. The
technique has been exploited extensively by Bertin (1967), particularly through a
mechanical display which allows the user to permute rows or columns of the
Ln ~
TNW ° + + + . $ + , + ° + + + ° + + ° . + o ° . + + +
FN , ++ , . # ° . + + + + + , + + ° ++ ° ° ° +# +
SCI . # . . . + . . . + + . + . + + . + + . . . . + +
CP ° + + + . + ° • + • + + + . + + • ° + + + , ° + °
CO . + + + . + + . . • + + + • + + • + + • + . + + •
NFW . + + + . + + . + + + + . • • + • • + • + • + + •
MIS + + + + . + . + + . . + + . + + . . + + + • + + +
SX . + + . . # + . + + + + + . + + . + + . . . + + .
SR . # + + . # . . + + + . +. + + . + + + + . + + .
EAL . + + . . + . . + . + + # . + . . + + . . . + # .
AMR . # + . . + + . + . # + + + . + . . + . . . + + .
UAL . + + + . + + . + • + + # + + . • . # • • • # • .
TWA . $ . . . + + . . . $ + + + . . . . ~ ° . . + + .
N~A . $ . . . ~ . . # + # + # # . . . . + . . . # + .
PN • # . + • # + • + • # + + + • + • + + • • • 9 • •
F i g . 5. S y m b o l i c matrix.
Graphical techniques for multivariate data and for clustering 219
matrix to detect structure. Disadvantages of the method are its relatively coarse
parameter values and perhaps the difficulty of perceiving the overall relation
among observations.
Five of the methods which associate a symbol with each observation are the
following.
- Profiles, which represent each observation by p vertical bars for p variables, each
bar having height proportional to the corresponding variable. The profile refers to
the top of the bars; sometimes the profile is shown as a connected line.
- Stars or polygons, which represent each variable as a value along equally spaced
radii from a common center. The points on the radii are usually connected in a
polygon.
-Faces, which represent each variable by features of a cartoon face (Chernoff,
1973). Such features as the shape of the face, the curve of the mouth, the position
and shape of the eyes can be used as parameters.
-Curve plots (Andrews, 1972) where each observation is mapped into a curve
which is a linear combination, defined by (xil Xip), of a set of basis curves
.....
(usually trigonometric.).
- Trees, which represent each variable as the length of a branch of a tree whose
structure is determined by applying a hierarchical clustering algorithm to the
variables (Kleiner and Hartigan, 1980).
Having defined the symbols for each observation one is at liberty to plot them
in any suitable arrangement. For all but the curve plots the usual practice is to
plot the symbols separately, say in an array on the page. Curve plots are usually
superimposed.
Among the methods mentioned, no single method is entirely best, but we feel
that star or polygon plots combine a reasonably distinctive appearance with
computational simplicity and ease of interpretation. Profiles are not so easy to
compare as a general shape. Faces are memorable but they are more complex to
draw and one must be careful in assigning variables to parameters and in
choosing parameter ranges. Curve plots are effective when p is large but become
cluttered when there are many observations. Trees provide additional informa-
tion, using a non-subjective clustering of the variables, and are vivid symbols, but
require considerable initial computation. Faces and curves to some extent disguise
the data in the sense that individual data values may not be directly comparable
from the plot.
Polygon plots are simple to construct and to interpret. Given p variables each
symbol will have p radii, usually spanning either a full or a half circle. The angle
between the horizontal and t h e j t h radius is
for j = 1,... ,p. The full circle is the more compact form and tends to give more
distinct symbols.
220 John M. Chambers and Beat Kleiner
The ith polygon consists ofp points, thejth point lying a distance along thejth
radius proportional to the data value x~j. A simple technique is to arrange for the
n values of each variable to be scaled to the range [0, 1], to put the center of the
symbol at the plotting origin and to let the maximum radius be 1. Then the point
corresponding to Xij has plotting coordinates
Pi / = ( x ij c ° s Oi , xijsin 0j).
FN MIS UAL
SCI SX TWA
CP SR
4
NW^
CO EAL
v~
19
"," ,~
to
CO
Fig. 6b. Labelled polygon plot.
angle and width of the eyebrows. In this example, the faces display the grouping
of companies less clearly than some of the other methods.
Tree plots (Figs. 6d and 6e) are constructed by first hierarchically clustering the
years. The resulting dendrogram (see Subsection 3.1 for further discussion of the
dendrogram) serves as the template for the tree plots in the sense that the tree
symbols all have the same topology as the dendrogram and the angles between
'branches' of the symbols are a function of the distances at which the correspond-
ing subclusters were merged. Each symbol gets its individual shape by making the
length of each outermost limb proportional to the size of the variable it represents
and setting the length of each inner limb equal to the average of all outermost
limbs whose path to the bottom of the tree passes through this inner limb. For
more details see Kleiner and Hartigan (1980). Figs. 6d and 6e show that the
extremely good years 1954, 58, 63 and 71 are grouped on the lower left, the good
years 1961, 64-66 and 75-76 on the lower fight and the extremely disappointing
years 1957, 69 and 73-74 just above the extremely good years. Fig. 6e clearly
indicates that the airlines (especially the international ones) do extremely well in
good years and on closer inspection, extremely poorly in the bad years.
The plots in Fig. 6 are generally able to contrast companies, showing that the
shapes of the symbols representing companies within a group (e.g., the railroads)
are more similar than for companies from different groups. The polygon plots
and the trees especially, also point out that the groups differ in variability as well.
The airlines, particularly the three international airlines, have much greater
year-to-year variability. The features seen here were also visible in the symbolic
matrix although with less detail available.
Graphical techniquesfor multivariate data andfor clustering 223
TN¥ CO SR TYA
CP SX UAL
TREE PLOT
Fig. 6d. The results of hierarchically clustering the years and a labelled tree plot.
2.3. Summary
All the methods described here can be useful, but it is worth repeating that
none gives a completely adequate picture of the data. Scatter plots are natural to
interpret and have fewer problems of scaling or of distorting the range of data
values, but they can only be used indirectly for a large number of variables and
integrating the pairwise plots is difficult. Symbofic plots are overly dependent on
several arbitrary choices; for example, the mapping of data values onto the
parameter interval and the ordering of both variables and observations. Some
(like faces) also treat the different variables in a highly unsymmetric way. Some
(particularly curve plots) are hard to use when the number of observations is
large.
These problems point to the need for care in interpreting the plots. Several
different methods should be used to get a good look at a difficult data set. Other
things being equal, simple methods, such as pairwise scatter plots, the symbolic
matrix and polygon plots, are easier to use and to explain and offer fewer hidden
distortions. The previously mentioned drawbacks in some of these methods
should be kept in mind, however.
Graphical techniques for multivariate data and for clustering 225
° ~.ll
(¢J
a3
W'4
FN MIS UAL
+ SCI SX TWA
CP SR NWA
CO EAL PN
Fig. 6e. ~ e e plots.
with one cluster containing all the objects. (One can also view this as a splitting
process, in the opposite direction.)
By contrast nonhierarchical clustering produces a single partition of the objects.
Naturally, each step of the hierarchical clustering defines a nonhierarchical
clustering, but typically without a rule for stopping at a 'best' clustering. In fact
the popularity of hierarchical methods is a combination of the computational
difficulty of choosing a 'best' partition and the advantage to the user of studying
a number of possible partitions.
We consider the graphical techniques for hierarchical and nonhierarchical
clustering separately. The latter can be applied to any partition derived from the
former.
For simplicity, we will always speak of measures of dissimilarity, dij, between
objects i and j. The entire discussion, however, could equally well be phrased in
terms of similarities. Speaking of objects being 'close' for example, implies small
dissimilarity or large similarity. Also, when specific notation is required, we will
tend to write fortnulas as if the objects were the n observations, although only
interchanging subscripts is needed to talk about clustering variables.
If we start with a set of multivariate data, the most common measure of
dissimilarity between observations would be some measure of the distance be-
tween two observations, regarded as points in p-dimensional space. Conversely, to
cluster variables one may begin with a measure of correlation as indicating the
similarity of variables. A third class of clustering applications may arise when
measures of similarity or distance arise as the original data.
Notice that some reordering of the original objects is necessary to avoid lines
crossing in the cluster tree. The ordering is not unique; at each node one could
flip the right and left subtree. There are various mechanisms for choosing a
unique ordering of the objects. We shall assume this to be done by the clustering
algorithm and will not discuss it further.
A cluster tree plot allows one to gain a number of insights into the clustering.
One may look for subsets which are clearly defined by the clustering. These are
indicated by clusters which join together at a relatively low distance. In Fig. 7 all
the railroads form such a cluster. On the other hand, the grouping of TWA, NWA
and PN (the international airlines) forms at a larger distance (3.0 opposed to 1.4),
indicating that their cluster is not as tight. The cluster of the domestic airlines
(EAL, AMR and UAL) is nearly as tight as the railroads. The conglomerate,
TNW, is merged with the combined railroad/domestic airline cluster, indicating
that it shares some characteristics of each, based on the data.
By 'cutting' the tree at any level of dissimilarity, one obtains a partition of the
objects. In the example of Fig. 7, we could obtain a partition into 6 subsets by
cutting at level 2.0. This defines the clusters of railroads and domestic airlines by
TNW and the three international airlines form four one-object clusters. On
substantive grounds one would prefer to group TWA, NWA and PN into one
cluster, giving four clusters (even though this does not strictly represent cutting
the tree).
Many variations of the basic cluster plot exist. One may put the objects some
fixed distance below the value for the cluster they join (as in Fig. 8), rather than
'l
Fig. 7. Dendrogramof completelinkageon Euclideandistances(basic form).
Graphical techniquesfor multivariate data and for clustering 229
tt~
X fir:.
~r
] <z
N
, x ~
EY
at 0. This makes the members of a cluster easier to see, but obscures distances
slightly. Instead of the horizontal and vertical lines marking each step, one can
join the successive plotting positions by a single, oblique line (Fig. 9) or, better,
combine this with the previous modification (Fig. 10) to somewhat reduce the
chance of lines touching. For modest-sized problems, the original method seems
the clearest. For larger problems, however, it may be difficult to associate objects
with long vertical lines, in which case the variation of Fig. 10 may be preferable.
If the cluster tree is produced on a line printer or printing terminal, rather than
on a graphic device, a number of further variations may be imposed. Because of
the limited resolution, the successive lines do not represent the dissimilarities but
simply the merging step. The dissimilarity may be printed beside the tree (Fig.
11). This still obscures the graphical evidence of cluster tightness (compare Fig.
7).
The printed tree may be further reduced in size, simply by omitting some of the
lower merges. Fig. 12 shows this variation, in which only the mergers from steps 7
through 14 of Fig. 11 are shown. This display, called a squashed tree by Gross
(1975), takes advantage of the tendency of people to be mainly interested in
obtaining a small number of clusters; there is little point in generating, say, more
than n / 2 subsets of n objects. The squashed tree plots also may print two disjoint
clusters side by side, rather than indicating the merging times consecutively for
the entire tree. Another variation is the icicle plot (Kruskal and Landwehr, 1980),
which uses the object labels to fill in vertically, starting from the step at which the
object first joins a cluster. Fig. 13 shows an icicle plot of the transportation data.
Notice that the company codes are repeated within each circle. Identifying
Graphical techniques for multivariate data and for clustering 231
T F S C C N M S S E A U T N P
N N C P 0 F I X B A M A W ~ N
W i W S L i~ L A
I 1 I I I 1
4 9 0 2 3 6 5 ,I 2 4 I 5 3 ,7 8
- : : - . : : - . : - . : - .
l 7, 1352726E- 0 ! . . . . • : : : : : : :
. . : : : - . : - : : - . .
2 8,8281667E'01 : : : : : • . . . . o
: - . : : - :
3 |.0657622E 00 : : : ...... : : - . : : • :
: - . . : - :
. . . . . : •
|,08|135|E 00 : .... : :
• : _. . : • .
5 1,1878790~ 00 : : : :
. . : : - .
6 1.2848810E 00 : : : ...... : : : : :
. . : : : : : : -
8 1,4158114E 00 • ............. : : : : :
. . . : - : :
9 1,6111923E 00 : : ...... : : :
: : - : - :
12 2,5095q63E 00 : : ....
13 3.05468|1E 00 : ......
14 3. 5739761E O0 .............................
T 9 S C C N M S S E A U T N P
h N C P O F I X ~ A M ~ W W N
W I W S L R L A A
I I I I I I
4 9 0 2 3 6 5 I 2 4 I 5 3 7 8
I 1.2848810E 00 : . . . . . . . . . . . . . . . . . . : .... : : :
: : : : : : : : :
2 ],3658933E 00 : : .......... : : : : :
: : : : : : : :
3 1,611|923E 00 : . . . . . . . . . . . . . . . . . . . : : :
: : : : : :
2.075476~E 00 : .................... : : :
: : : : :
5 2,5095463~ 00 ...................... : ....
6 3.05~6811£ 00 : ......
:
7 3,573976lE 00 .............................
TF SC C NMS S E A U ~ N P
NN C PO F I XR AM A ~ W N
W- I- - WS- - L R L A ~ -
I~ 3.57398E 00 T=F=S=C=C=N=M=S=S=E=A=U=I=N=P
13 3.05~68~ 00 Iq=N=C=P=O=F=I=X=I~=A=M=A ~=W=N
12 2. 5 0 9 5 5 ~ 00 W=-=I ..... W=S ..... L=a=L A=-
II 2.43098E 00 S=~=S=~=&=&=&=&=&=&=&=&
10 2. 0 7 5 ~ 8 E 0 0 F = S = C = C = N = M = S = S = E = ~= U
9 1.61119E 00 N = C = P = O = F = . Z = X = R ~=M=A
1 . 4 1 5 8 1 E 00 -=I= .... W=S .... R=L
7 1 . 3 6 5 8 9 E 00 &=& &=&=&=&=&=& ~=S
6 1.28~88E 00 F=S C=C=N M=S=S A=U
5 I. 1 8 7 8 8 E 0 0 N=C P=O=F X=R M=A
t& i . 0 8 i i/4~. O0 -=/ .... • -=-
3 1 . 0 6 5 7 6 ~ 00 ~=&=& &=&
2 8. 8 2 8 1 7 ~ - 0 I C=N S=S
I 7. 1 3 5 2 7 F - 0 1 O=F
3. 2. Plotting distances
The concept of distance is natural and central to many applications of
clustering, because distances (or dissimilarities) between objects are inputs to
many clustering algorithms. In this subsection we will discuss several ways of
using plots of distances to assess appropriateness, tightness and separations of
clusters and to gain insight into how individual objects in a cluster differ from the
'average' cluster behavior. We will discuss three sets of distances:
(i) Distances between pairs of individual objects (Subsection 3.2.1),
(ii) Distances between pairs of cluster centroids (Subsection 3.2.2),
(iii) Distances between cluster centroids and individual objects (Subsection
3.2.3).
Graphical techniquesfor multivariate data and for clustering 233
TNW
FN
SCI
CP
CO
NFW
MIS
SX
SR
EAL
AMR
UAL
TWA
NWA
PN x~
Ioo ,,,
Fig. 14. Shadedrepresentationof the Euclideandistancematrix between compames.
234 John M. Chambers and Beat Kleiner
Cohen et al. (1977) describe a method for plotting distances between objects,
which can be used to identify the presence and composition of clusters without
going through may clustering algorithm. The distances between the objects are
plotted in groups which consist of the distances between each object and its
nearest neighbor, the distances between each object and its second nearest
neighbor, and so on.
In the diagram they plot
where n is the number of objects and di(j) is the j t h largest value among
dil, di2,---, di.i- l, d/,i+ 1,..., din- Thus the first column of the plot ( j = 1) displays
the empirical distribution of the n nearest neighbor distances against their
median. The second column shows the n second nearest neighbor distances, and
so on. A diagram of this type, together with output identifying i and j for each
point, can be helpful in detecting certain types of clusters and outliers.
The distances between the 15 transportation companies are displayed in Fig.
15. They show a clearly separated group in the lower left hand corner (all data
points with y < 1.5). All distances in this group except one (AMR-UAL) are
among the 8 railroads, indicating that they form a relatively tight and well-sep-
arated cluster. Furthermore, the two largest distances in all but the last column
involve two of the international air carriers (TWA and NWA); the largest four
distances in all but the last two columns include all three international air carriers
TWA, NWA and PN. This suggests that the international airlines cluster consists
of three objects, which are quite distant among themselves but even farther from
all other objects.
15 TRANSPORTATION COMPANIES
I I I
u~ @
• m
? _ 0 @
W3
&
o°°°0 o
O
0 O
0 o
0 0
o o
o ~
o
e
o oo 0 0
"? _oo
I I
0
Fig. 15. Diagram ol i th closest neighbor distances vs. their median. The circles enclosing a star denote
T W A and NWA, the circles enclosing a triangle denote PN.
are equal to the true distances). They also plot the 'diameters' of the clusters in
order to compare them to the distances between the clusters.
Taxometric maps contain large amounts of information but are rather cumber-
some to construct and seem hard to interpret. A similar, but much simpler,
diagram is described by Fowlkes et al. (1976). They begin by cutting the
dendrogram at a dissimilarity level which will create a desired number of
(hopefully well separated) clusters. The resulting clusters are represented by
circles with diameters equal to the diameters of the clusters. The interpretation of
the circle diameter depends on the algorithm and metric used; for the compact
method on Euclidean distances, for instance, the diameter is equal to the
maximum distance between any two objects within the cluster. Finally the circles
are connected by horizontal and vertical lines whose lengths are equal to the
distances between the corresponding clusters.
236 John M. Chambers and Beat Kleiner
EAL
.~p;,
/ \ A,,R
)CPco
NFW
\ / MIS
~ S X
SR
FNW
TWA
I I I I
2 4 6 8 10
15 TRANSPORTATION COMPANIES
Fig. 16. Diameters and distances of clusters obtained by cutting the dendrogram in Fig. 7 at level 2.0.
Fig. 16 shows the plot resulting from cutting the dendrogram in Fig. 7 at level
2.0. Each circle is labeled by the objects it contains. There are 6 clusters, 4 of
which consist of only one object each. The distance between the centers of the two
circles is 2.07, i.e. the distance between the clusters represented by these two
circles. The distance between TNW and the middle of the line connecting the
two circle centers is 2.43, i.e. the distance between TNW and the union of the two
circles. This plots shows two rather tight clusters which are reasonably close
together but are quite far away from the other objects. Note that, besides the
diameters of the circles, only the distances along the vertical and horizontal lines
Graphical techniquesfor multivariate data and for clustering 237
matter. Therefore the fact that N W A lies very close to the circles does not mean
that it is close to them in the metric used in the clustering algorithm.
Another method of displaying inter-cluster distances is a plot of the distances
between a first cluster and all remaining clusters in the first column, the distances
between a second cluster and the remaining ones in the second column, and so
on. This often reveals isolated dusters and clusters so close to each other that the
question arises if they should be separate dusters at all.
Fig. 17 shows distances between the cluster centroids of duster 1 (consisting of
TNW), cluster 2 (containing the 8 railroads), cluster 3 (domestic airlines) and
cluster 4 (international air carriers). The plot is not very illuminating; it shows
that dusters 1 and 4 are somewhat more apart from all other clusters than 2
and 3.
15 TRANSPORTATION COMPANIES
0
I I
4 1
03
4 2
S
t.- 3
Z
0
w
(.3 o4
IZ:
Ld
I.-- 2
03
d Lf3 4 3
r,.j, 3 2 --
Z
LLI
W
l'--
Ld
r~ 0
I--
03
rm
U3
o
- I I I I --
1 2 5 4
CLUSTER NUMBER
1 5 TRANSPORTATION COMPANIES
I I I
I,- TWA
0
w
2o
0
._J
mWA
PN
>
tm CO
Z NWA NWA TNW
EAL PN BR
$CI
I UAL TWA
CP
TNW
0 SCI
n~ NFW
z
BNSR ~
~ UAL
w
AMR SCl
o
FN
n~
w
PN
AMR
oo
J
(_) EAL
~J
Z SX
<
NFW
Q - T~W I I I
I 2 3 4
CLUSTER NUMBER
Fig. 18. Distances between cluster centroids and individual objects (in fractions of returns).
Graphical techniquesfor multivariate data andfor clustering 239
a particular cluster are plotted directly above the corresponding cluster number
while the other objects are somewhat set off to the right. We can see that cluster 1
(consisting of only one object) is very well separated, that clusters 2 and 3 are
tight and reasonably isolated (cluster 3 somewhat less so than cluster 2), while
cluster 4 has one outside object which is closer to its center than any of its own
objects.
Due to the relatively small size of this data set, all possible distances could be
reasonably plotted on the same page. This might not be the case for larger data
sets; there one usually only plots the objects closest to the cluster centers. Often it
TWA
SR AMR
$CI UAL
TNW EAL
O3
Z Z
~ u3 fNW
i-- 0 =S SR
%w w
EAL
&
#0,
.............................
*.e ~
PN
TWA
I I I I I I
1 2 3 4 2 3
CLUSTER NUMBER CLUSTER NUMBER
U3
Z
I~: Lrb
~_0
Od TNW
#?w
UAL PN
. . . . . . . . ~ .... ~:~l~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NWA TNW
6N
TWA
~ NWA
UAL TWA
I F~ I I
1 2 3 I 2 3
CLUSTER NUMBER CL U STE R N U M B ER
Fig. 19. Sets of fractional returns vs. cluster number for the years 1954-57.
240 John M. Chambers and Beat Kleiner
is also advantageous to use different scales for different clusters and to code each
object simply by the cluster to which it belongs.
Plots displaying the behavior of a given set of clusters and their objects for each
individual variable are described next, followed by descriptions of the behavior of
individual clusters compared to the overall center, or of objects compared to their
cluster center with respect to all variables.
The former include separate diagrams for each variable where the values of
each object are plotted against the number of the cluster containing that object
(Fig. 19), showing the location of the different clusters for each variable and the
spread within each cluster. It also pinpoints outlying values with respect to a
single variable. A variation of this is to first subtract the respective cluster center
from each object before plotting. This allows comparing spreads within clusters
and detecting outliers more easily, but loses the information about cluster levels.
Fig. 19 shows the returns of the 15 transportation companies for the years
1954-57. Generally the median of all values in a cluster is denoted by a star; here
SR
975
1970
1965
1960
955
0.0 02 04
DEVIATrONS FROM CLUSTER CENTER
Fig. 20. Deviationsof object SR from the center of cluster 2 for all years 1953-77 (in fractions of
returns).
Graphical techniques for multivariate data and for clustering 241
this has only been necessary for duster 2. A dotted line is drawn at level 0. The
returns have been very good in 1954; highest for the international carriers in
cluster 4, followed by duster 3, followed by duster 2. In 1955-57 the reverse is
true, and the returns grow steadily worse until in 1957 none of the 15 companies
has a positive return. Note that cluster 1 does better than the other dusters in
1955-57.
In order to see how individual cluster centers relate to an overall center, Dunn
and Landwehr (1980) have suggested plotting the differences between a duster
center and the overall center for each variable; to see the relations between an
object and its duster center they plot the differences between that object and its
cluster center.
Fig. 20 shows a plot of this type for the object SR (a railroad). It shows the
deviation from the median of cluster 2 for each of the 25 variables from bottom to
top. Interesting here is the very cyclical behavior of the deviations, a phenomenon
which has also been observed for other railroads.
If the variables are measured in different units, it is advisable to standardize the
differences by dividing by a measure of scale within the cluster.
3. 4. Sensitivity analyses
Given a tentative set of clusters it is important to be able to assess the
reliability of these clusters.
The idea of dropping variables is a natural and appealing tool to use in this
context. One drops one (or small groups of) variables at a time from the study,
reapplies the clustering algorithm and checks which clusters are still intact and
which ones are not. This not only shows the effect of minor changes in the data
on the resulting clusters but also enables the user to assess the effects of including
or excluding certain variables (Gnanadesikan et al., 1977).
Fig. 21 shows overall distances between the center of cluster 2 and the other
cluster centeis when no variables are left out (denoted 5-year group 0), when the
first 5 years (1953-57) are left out (denoted 5-year group 1), the second group of
5 consecutive years is left out (5-year group 2), and so on. The distances have to
be normalized to be comparable. The square root of the number of variables
involved seems to be a reasonable choice for the normalizing constant, because
under the assumptions of i.i.d, normal variables the squared distances will follow
a X2 distribution with the number of degrees of freedom equal to the number of
variables.
Fig. 21 suggests that cluster 2 might be somewhat better separated if the first
five years were excluded and might be less well separated from cluster 1 if the
second five years (1958-62) were left out.
Fig. 22 shows the normalized distances between the centroid of cluster 3
(EAL, AMR, UAL) and all other objects with no years left out (group 0) and
consecutive chunks of 5 years left out. As in Fig. 20, objects not in cluster 3 have
been offset somewhat to the right. The plot suggests that dropping the second 5
years would decrease tightness and separation of cluster 3 and that cluster 3
242 John M. Chambers and Beat Kleiner
o I I I I I I
4 4
o9
nm
Q!
u_
0
rY
z
I--
0 -- ,3
0
o 3
LJ
rr
0 c'q
O9
S
Ld
0
Z
<r.
I--
O9
C'
o
o
-I I I I I I-
O 1 2 5 4 5
Fig. 21. Normalized distances between the centroid of cluster 2 and the other cluster centers when 0,
the first 5, the second 5 etc. years are left out.
would be somewhat tighter, but not more separated, if the last 5 years were not
included.
An enhancement of the ordinary dendrogram designed to give the viewer more
information about each merger is due to Rohlf (1970). He not only denotes each
merger by a horizontal line such as in Fig. 7 but for each merger also plots the
distances between all possible pairs of objects between the two clusters merged.
Therefore at the merging of MIS with {SX, SR} one would plot the distances
MIS-SX and MIS-SR; at the merging of the 8 railroads with the 3 domestic
airlines one would plot the distances between all pairs with exactly one railroad
Graphical techniques for multivariate data and for clustering 243
NWA
t#) NWA TWA TWA --
U3 O TWA
W
_1
TNW
<
TWA TNW TNW
7.. TNW
< TWA
TNW
LL O
O
~,~
SCI sx cP ~ SR
CF
Z TNWscI ~ ~' 01XS
F-- t¢3
SCI
SX
~FOu
PN
FN FN f~ r'~ - -
O -- FN
O O FN
rF NWA
w SCI
rY
<
0
Cf) o4 EAL
EAL
--EAL EAL EAL
w
0 UAL UAL UAL UAL
Z UAL AMR ~
< AMR AMR
AMR
AMR
g
o UAL
? I I I I I I
o
0 1 2 3 4 5
Fig. 22. N o r m a l i z e d distances between the centroid of cluster 3 and all objects w h e n 0, the first 5, the
second 5 years etc. are left out.
and one domestic airline. These distance plots will give some indication about the
validity of the level of a particular merge.
References
The amount and diversity of duster analysis software has grown almost as
rapidly as the number of publications which describe its use (Blashfield and
Aldenderfer, 1978a). New methods (and the programs which implement them) are
proposed continually, and no end to this process of innovation is in sight (Sneath
and Sokal, 1973). No reliable estimate of the total amount of clustering software
in use today has ever been made because duster analysis has spread to innumer-
able scientific disciplines and subdisciplines, making any attempt at a comprehen-
sive review futile. An indirect measure of the abundance of clustering software
can be found in a study by Blashfield (1976b). Selecting the members of the
Classification Society (North American Branch) as a sample, Blashfield sent each
member a questionnaire, asking what cluster software each currently used. The
fifty-three respondents listed fifty different programs and packages. On the basis
of this result he suggested that there may be as many different programs in
existence to perform cluster analysis as there are users.
Three reasons can be identified why so much software has been developed.
(1) The process of creating groups of entities--classification--is a fundamental
human activity which forms the basis for most scientific progress (Hempel, 1952).
However, there are many competing philosophies as to how groups should be
constructed and how they should be defined. Consequently, a wide variety of
logical, statistical, mathematical, and heuristic methods has been applied to the
problem of creating groups. Cluster analysis has been strongly affected by this
diversity of thought, and at least seven major families of clustering methods have
been developed. These are: (1) hierarchical agglomerative; (2) hierarchical divi-
sive; (3) iterative partitioning; (4) mode searching (5) factor analytic; (6) clump-
ing; and (7) graph theoretic methods (Anderberg, 1973; Bailey, 1975; Everitt,
1974). Each of these families represents a different perspective on the creation of
*The materialin this chapter was partially collectedwhile the first author was supported by NSF
Grant DCR No. 74-20007.
245
246 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey
groups. The results obtained when different methods are applied to the same data
set can be widely divergent. This diversity in approach to clustering makes it
likely that a number of programs will be written in order to meet the alternative
philosophies of classification.
(2) Different sciences have different analytical and methodological needs.
Although each of the families of clustering methods can be used by any discipline,
certain methods have been found to be particularly useful in certain sciences
(Clifford and Stephenson, 1975). Thus hierarchical agglomerative methods are
most frequently used in the biological sciences (Sneath and Sokal, 1973), and
factor analysis variants are popular in psychology (Lorr, 1966; Overall and Klett,
1972; Skinner, 1977). At first glance this state of affairs would seem to work
against the development of new programs, but it is, in fact, a cause of software
proliferation. For instance, a new type of classification problem may arise in a
science for which existing software is poorly structured, or a new important
variation on a traditional cluster method may be proposed, or researchers in one
science may become aware of a clustering technique which has proved useful in
another science. All of these instances are likely to cause the creation of new
software or the major revision of old software programs.
(3) Finally, the tendency to write new clustering programs is greatly facilitated
because most cluster analysis methods are relatively easy to program (Anderberg,
1973). Many/-nethods (and their variants) are not based on sophisticated statisti-
cal or mathematical models and thus they do not require considerable expertise to
implement. In fact, most clustering methods are no more than heuristics, or 'rules
of thumb,' which have reasonably straightforward rules of group formation
(Hartigan, 1975a).
Clustering software comes in a variety of forms, ranging from the simple,
100-line FORTRAN programs to packages containing many thousands of state-
ments. The software reviewed in this paper is divided into five categories: (1)
collections of subroutines and algorithms; (2) general statistical packages which
include clustering methods; (3) cluster analysis packages; (4) simple programs
which perform one type of clustering; and (5) special purpose clustering pro-
grams, including novel methods, graphics, and other aids to cluster interpretation.
Since a comprehensive review of clustering software is effectively impossible,
some selection is necessary. The decision was made to emphasize the software
programs shown to be most popular by Blashfield (1976b). In addition, other
software was included which concerns recent advances that are likely to be
popular and/or which represent striking alternatives to the most commonly used
programs. Most of the software discussed in the paper was developed in the
United States, Great Britain or Australia. The software developed in Europe has
not been fully sampled (e.g., Spath, 1975).
The remainder of this chapter will be organized as follows: First, there will be a
short discussion of the five major categories of software and the particular
programs included under each category (Section 1). Next, a reasonably detailed
discussion will follow concerning software programs which emphasize hierarchical
methods (Section 2) and those which contain iterative partitioning methods
Cluster analysis software 247
(Section 3). Fourth, the special purpose programs will be described (Section 4).
Finally, there will be a section on usability which concerns the users manuals and
error handling of the programs (Section 5).
Six packages devoted to clustering and related methods are reviewed. They are
CLUSTAN IC (Wishart, 1978), NT-SYS (Rohlf, Kishpaugh and Kirk, 1974).
CLUS (Rubin and Friedman, 1967), TAXON (Milne, 1976; Williams and Lance,
1977), BC-TRY (Tryon and Bailey, 1970) and CLAB (Shapiro, 1977). In many
ways, packages devoted to cluster analysis represent the ultimate of both flexibil-
ity and user convenience. These packages combine many of the advantages of a
general statistical package (data screening, file manipulation, and data transfor-
mation), with features of interest to users of cluster analysis (a wide choice of
similarity measures, cluster diagnostics, and graphics). Novices will have more
difficulty learning to use these packages than general statistical packages such as
SAS or SPSS, but these packages are not overly difficult. Experienced users find
these packages important because they often contain many different or hard-to-
find options which may be particularly suited to a specific problem or data set.
Most of these packages (except CLUS and CLAB) are maintained by private or
commercial organizations which are responsible for their development and distri-
bution.
Of the six cluster analysis packages CLUSTAN is the most versatile, containing
the widest range of clustering methods and similarity measures. Also worth noting
about CLUSTAN is that versions are now available which allow CLUSTAN to
handle SAS or SPSS files. NT-SYS was designed for users in the biological
Cluster analysissoftware 249
sciences and relies heavily on the ideas proposed by Sokal and Sneath (1963;
Sneath and Sokal, 1973). CLUS is an iterative partitioning program which has
many options concerning this type of clustering method. TAXON is an Australian
program which emphasizes developments by Lance and Williams (1967a, 1967b;
Clifford and Stephenson, 1975; Williams and Lance, 1977). Hierarchical divisive
methods and the use of statistics from information theory are particularly salient
in TAXON. The next package, CLAB, is the only one which is interactive. It was
designed for use on DEC-10 computer. Finally, the BC-TRY system is a factor
analysis oriented package derived from the work on clustering by the psychologist
Tryon (1939; Tryon and Bailey, 1970).
clustering packages and six simple programs. Of these, twelve contained at least
one hierarchical method. The six remaining programs (BC-TRY, BUILDUP,
CLUS, HOWD, ISODATA, and MIKCA) will not be discussed further in this
section.
Hierarchical methods of cluster analysis are, by far, the most commonly used
techniques in the clustering literature (Sneath and Sokal, 1973; Blashfield and
Aldenderfer, 1978a). Any software which intends to appeal to a wide range of
users should include a number of the hierarchical clustering methods as well as a
reasonable sampling of similarity measures.
Attempts have been made to identify characteristic ways in which hierarchical
cluster analysis methods differ. Bailey (1975), for example, proposed 12 'criteria'
to be used in selecting a method; similarly, Sneath and Sokal (1973) discussed 8
'options' from which to choose. In Subsections 2.1-2.4 the dimensions crucial to
the understanding of hierarchical methods are treated.
other (Sneath and Sokal, 1973). Nonetheless, some authors feel that this is the
only method which meets the mathematical criteria to be satisfied by an accept-
able clustering method (Jardine and Sibson, 1968, 1971). All programs discussed
here, with the exception of HGROUP, OSIRIS, and SAS, contain the single
linkage method as an option.
Single linkage clustering is related to the formation of minimum spanning trees
(Zahn, 1971). For users interested in minimum spanning trees, CLAB contains a
number of intriguing options. CLUSTAN and NT-SYS also will permit the
formation of minimum spanning trees.
(b) Complete linkage clustering. This method is the logical opposite of single
linkage clustering. Rather than an entity joining with the closest member of a
cluster, complete linkage requires that an entity be within a certain level of
similarity with the most distant entity, or in effect, with all members of that
cluster (Sokal and Michener, 1958). This method thus tends to form compact,
hyperspherical clusters composed of highly similar entities. Only HART and
HGROUP do not include an option for complete linkage clustering.
(c) Average linkage clustering. Sokal and Michener (1958) introduced average
linkage clustering as a compromise between the 'conservative' complete linkage
method and the 'liberal' single linkage method. Although more than one average
linkage technique exists, one particular variation has attracted the most use:
unweighted pairwise group mean averaging (abbreviated UPGMA) (Sneath and
Sokal, 1973). From the twelve programs, UPGMA was available in eight. The
other four programs (SAS, OSIRIS, IMSL, JCLUST) were designed in response
to the Johnson (1967) article which does not consider average linkage because
UPGMA is not invariant under monotonic transformation of the similarity
measure. However, since average linkage is the dominant clustering method, its
absence in programs intended for general users such as SAS, OSIRIS and IMSL
clearly limits the wide usefulness of these programs.
(d) Ward's method. This method was designed to optimize an objective func-
tion, the minimum variance within clusters (Ward, 1963). The method has proved
popular in the social sciences but not in the biological sciences. Despite the over
220 citations to Ward's article in a wide range of social sciences, this method has
been frequently overlooked in software. Only three of the twelve programs,
ANDER, CLUSTAN, and HGROUP, have incorporated Ward's method. In this
regard, it is worth noting that one clustering package, CLUSTAN, has suggested
that Ward's method be used as the default option in the choice of linkage form.
Ward's method has also been shown to yield relatively good solutions in Monte
Carlo comparisons of cluster analysis methods (Gross, 1972; Kuiper and Fisher,
1975; Blashfield, 1976a; Mojena, 1977).
In conclusion about linkage forms, CLUSTAN is the best program. It has eight
hierarchical agglomerative methods and two hierarchical divisive methods.
ANDER, TAXON, and NT-SYS are close with seven methods each. BMDP and
CLAB contain three linkage forms, but, in BMDP, these are available only if the
user wishes to cluster variables. If the user wants cluster entities (as is usually the
252 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey
case), BMDP forces the user to cluster with average linkage. SAS, OSIRIS, IMSL,
JCLUSTER, HGROUP and HART are clearly the most limited in terms of their
variety of linkage forms. None of these last six programs is recommended for use
as general cluster analysis software.
submission of a similarity matrix is an option for all other programs, with the
exeption of HGROUP. Finally, BMDP and OSIRIS both have options which
permit clustering of variables rather then entities.
2.4. Dendrograms
A final major dimension associated with hierarchical methods concerns the use
of dendrograms or 'trees' which graphically represent the results of hierarchical
clustering. For users in the biological sciences, trees are the major output
necessary for interpretation. Most programs, except SAS, OSIRIS, JCLUSTER
and HGROUP, print trees as either standard or optional output. CLAB, NT-SYS
and CLUSTAN have the most easily interpreted trees. CLUSTAN permits the
user to request horizontal or vertical trees. BMDP has an option which allows the
tree to be printed over the similarity matrix.
Other graphics which can be used to display hierarchical clustering results are
skyline plots and shaded similarity matrices. Skyline plots are somewhat like
vertical trees depicted by a series of bars, while the shaded similarity matrices
(Ling, 1973) are reorganized similarity matrices in which similarity values are
replaced by dots of different darkness. Both of these graphics become awkward to
visually interpret when the number of entities is moderately large (N > 75). SAS,
OSIRIS and JCLUST print skyline plots rather than trees. BMDP permits shaded
similarity matrices as an option in addition to trees.
The remaining six programs, including the general statistical packages of SAS
and OSIRIS, have limited versatility, and there is no particular reason why a user
should seek out these programs for performing hierarchical cluster analysis.
The iterative partitioning methods comprise the second major family of cluster
analysis techniques. In general, these methods assign entities to the nearest
cluster, compute the new cluster centroids and reassign entities. These alterations
are performed until no object changes cluster membership. Conceptually, this
approach circumvents a serious drawback with hierarchical methods. Hierarchical
techniques require the formation of a similarity matrix which has N ( N - 1 ) / 2
unique values. Since the amount of computer memory needed increases as an
exponential function of N, the hierarchical methods are limited by the size of the
data matrix. For instance, 400 entities would require storage for 79 800 unique
similarity values.
As iterative partitioning methods do not require the storage of a similarity
matrix, they have the potential of handling distinctly larger data sets. However,
these methods are subject to a different limitation. The optimal way to perform
iterative analysis would be to form all possible partitions of the data set.
Unfortunately, this approach requires an enormous number of iterations. When
there are over fifty entities, an exhaustive approach becomes unfeasible. As a
result, the authors of the program which performs iterative partitioning have
created procedures which sample a small subset of the possible partitions. The
heuristic procedures for choosing likely partitions are plausible, yet quite varied in
approach. Thus, large differences exist both between and within the options of
programs which perform iterative analysis.
Ten of the eighteen clustering programs have an iterative partitioning method
of clustering. The ten programs are ANDER, BC-TRY, BMDP, CLAB, CLUS,
CLUSTAN, HART, HOWD, ISODATA, and MIKCA. These programs are more
varied than were the hierarchical clustering programs. In fact, it is impossible,
with the exception of two programs (ANDER and CLUSTAN), to set the options
in such a way that these programs will yield identical solutions to the same data
sets. Hence a discussion of these methods will also follow an analysis of major
dimensions in these methods. Anderberg (1973) contains a helpful discussion of
iterative partitioning methods.
partition is the option which has the most effect on an iterative method. Some
methods use initial estimates of cluster centroids as the basis for the initial
partition. For example, A N D E R and BMDP start with the specification of
centroid estimates, called 'seed points,' which can be user specified or can be an
arbitrarily chosen set of K actual data points (where K is the number of dusters).
In the first pass through the data, the entities are assigned to the cluster with the
nearest centroid. MIKCA, on the other hand, starts by analyzing three different
sets of randomly chosen seed points for the set which seems most likely to lead to
an efficient solution.
A second type of initial partitioning inyolves the specification of the first
cluster assignment. With this procedure the centroid of each cluster is defined as
its multivariate mean of entities within a cluster. This can be accomplished in a
variety of ways. ANDER, apart from the seed point method, allows the user to
specify the initial partition. CLUSTAN may be started either with a user specified
partition or by pseudorandom assignment, where every K t h element is assigned
to the same cluster. ISODATA selects initial cluster centroids that are relatively
distant from the centroid of the entire data set. There are four starting options
permitted by CLUS: (1) randomly chosen partition; (2) user-specified partition;
(3) a partition where the first K / N entities are assigned to the first cluster, the
second K / N to the second duster, and so on; and (4) a partition that the package
chooses " b y its own method." BC-TRY allows a user-specified starting partition,
as well as a seed point procedure. The BC-TRY seed points are unique in that
they are found by dividing Q-dimensional space (as found in multiple group
factor analysis) and into 2 Q equal segments. The centroids of these segments
define the initial seed points for the iterative clustering process.
There are also three programs which form clusters over a range of K. H O W D
and BMDP, for example, superimpose a hierarchical divisive algorithm onto an
iterative procedure. The initial partition is formed by dividing the data set at the
mean of the variable with the largest variance. After a stable solution is found
using an iterative K-means procedure, these programs search for the variable
within the two clusters that now has the largest variance. Subdividing at the mean
of that variable, the next K-means pass results in a three cluster solution. This
process is repeated until an upper limit on the number of clusters is reached.
CLUSTAN works in the opposite direction, superimposing a hierarchical
agglomerative procedure. First, an initial cluster solution is obtained for the
maximum number of dusters (Kraal). The two closest clusters are merged, and
iterations are performed to find gma x - 1 clusters. This process repeats until a
lower limit of K is reached.
This dimension refers to the manner in which the programs aid in determining
the number of clusters that exist in the data set. In A N D E R , CLUS and MIKCA
the number of clusters ( K ) must be specified by the user and hence is fixed. The
remaining packages contain procedures which allow K to vary in some manner.
Again the different programs take different approaches to this problem.
CLUSTAN agglomeratively collapses clusters across a user-specified range while
H O W D and BMDP use a divisive procedure to form a range for K. BC-TRY and
ISODATA provide proc,edures for 'splitting' and 'merging' clusters. ISODATA is
quite flexible in this regard as it permits the user to specify the limits on both the
diameter a n d / o r the size of the cluster. If clusters are too close, ISODATA may
Cluster analysis software 257
merge them; if a cluster is too heterogeneous, it may be split. In the same way,
clusters that are too large may be split and clusters that are too small may be
assigned to an outlier group.
3.5. Cost
Another factor to consider when performing iterative partitioning cluster
analysis is cost. This is particularly important because cost puts a limitation on
the tremendous number of calculations that would be required to exhaustively
test all possible partitions of the data. The various programs which perform
iterative analysis all attempt to efficiently find an optimal solution. However, the
programs differ drastically in terms of cost (Blashfield and Aldenderfer, 1978b).
CLUS is by far the most expensive program to run. This is a result of (1) the
hill-climbing passes, which require computation of the criterion statistic with each
possible move, and (2) the fact that it will either restart the partitioning analysis
or 'force' movement to avoid the problem 'local maxima.'
3. 7. Conclusion
Comparison of the programs that perform iterative analyses is more difficult
than those which perform hierarchical analyses. CLUS is clearly the most
versatile of the packages, as it has the widest range of options for type of pass, the
choice of a starting partition, etc. Unfortunately it is also by far the most
expensive of the iterative partitioning programs. The more cost efficient programs
such as BMDP, CLUSTAN, and HART do not have quite as much versatility.
However, these three programs do have a good range of choices concerning
starting partitions and output information. HOWD, ISODATA, and BC-TRY all
have distinctly unique properties which have no direct analogs in the other
packages. ANDER, CLAB, and MIKCA are relatively efficient programs which
represent the major options existent in current thinking about iterative analysis,
258 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey
but are not as complete in terms of output as other packages. In sum, the
particular needs of the user will dictate the software of preference for performing
iterative partitioning analysis.
4.1. Validation
4.3. Graphics
Another area of current interest in clustering methods is graphical representa-
tion. Since clu~ter analysis methods are simply heuristics, one solution is to
present visual representations of the multivariate data and let the human re-
searcher decide.
For instance, Chernoff (1973; Chernoff and Rizvi, 1975) proposed a method of
representing multidimensional data using computer drawn faces. He wrote a
program utilizing a Calcomp plotter which will create these faces (Chernoff,
1971). Turner and Tidmore (1977) made this approach more usable by writing a
program which draws the Chernoff faces on a line printer. CLAB also can create
the faces as clustering output.
Another interesting set of graphical available techniques is the comparative
univariate histograms, multivariate histograms, joining trees (not dendrograms)
and dimensional boxes. These are illustrated and discussed by Hartigan (1975b).
Another graphical procedure suggested by this author is the use of 'sleeves' to
visually represent clusters which occur in multivariate data gathered over time
(Hartigan, 1978b).
260 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey
Up to this point the focus has been on the features of cluster analysis programs
which relate to the various methods being performed. In this respect the entire
chapter has been devoted to just one issue: versatility of options. The aim of this
section is to provide the reader with information concerning the general features
of the programs that contribute to their ease in research.
Obviously, the initial problem any user faces is learning how to use this
program. In this respect a user manual is of primary importance because it serves
as the basic source of information about the program. However, not all programs
are accompanied by a manual. Of the eighteen programs under discussion the six
simple programs except M I K C A do not have manuals. In addition, ANDER,
BC-TRY and H A R T are primarily explained in their books although the latter
two do have independent manuals. Those programs which do have manuals vary
in clarity and informativeness.
The packages which include duster analysis as part of a more general coverage
of statistical routines (e.g., BMDP, OSIRIS, and SAS) have manuals which are
Cluster analysis software 261
sold to potential users. In addition, CLUSTAN now makes its manual commer-
cially available. These manuals contain information about the structural features
of the program, control cards, computational algorithms, etc.
Another method of manual distribution is to have the manual availabJe as a
printout from the master tape which contains the source listing of the program
(e.g., NT-SYS and earlier versions of CLUSTAN). This method is convenient in
that manual access is relatively easy. However, in order to keep the printout to a
reasonable size, these manuals tend not to have some user oriented features which
add to clarity (e.g., sample job runs).
All of the manuals provide some basic information about using the program.
This includes the specification of the format for the control cards, a listing of the
available options, and basic references which describe the methods which are in
the program. However, there are other valuable aspects which have been left out
in some instances. For example, the specification of standard or suggested
options, a listing of control cards for an example run, and a listing of the output
generated by the example run are all useful providing concrete examples for the
user to follow. CLAB, BMDP, OSIRIS, CLUSTAN, and SAS provide all of these
features (note: the new TAXON manual was not available at the time this chapter
was written). Of the remaining manuals, MIKCA contains an example of the
output generated by the program. The manuals for CLUS, NT-SYS and BC-TRY
have sections describing the structure and interpretation of output.
Another shortcoming that is found in some manuals is a failure to describe the
error messages generated by the program. In this regard the OSIRIS manual is
exemplary. It has an entire section devoted to a description of the error messages
generated b y the different procedures. In this way it gives the user some idea of
what action will be needed to correct the error. The newer CLUSTAN manual
(CLUSTAN IC) also has a separate error description section, a large advance
over its previous documentation. The SAS manual has some descriptions of errors
and suggestions on how to find them. Another user-oriented feature which is
important to include is how clear and jargon free the introductory sections of the
manual are for novices who are unfamiliar with the program. BC-TRY, BMDP
and OSIRIS have sections of the manual written especially for novices. BMDP
and OSIRIS also have clear and concise descriptions of tl-.e logic of the various
clustering techniques available in the packages. Of the manuals available only
four (OSIRIS, BMDP, CLUSTAN and SAS) have indexes, another important
aspect.
Most manuals are clearly deficient in the statistical documentation 61 the
various procedures which are used within a program. For example, Blashfield
(1977) noted that three packages generated considerably different solutions when
apparently the same clustering techniques were used. He had to calculate the
clustering steps by hand in order to learn why the programs found different
results. These calculations were necessary since the manuals did not provide
sufficient detail about the definitions of the Euclidean distance. The best manual
in this regard is the recent CLUSTAN manual. It provides an entire chapter
describing each similarity measure available, the calculation formulae and peril-
262 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey
nent comments concerning each measure. There is a need for other manuals to
follow this example.
Yet another problem with the user manuals is in their use of jargon. For
example, the complete linkage algorithm has been variously called the diameter
method, the furthest neighbor method, and the m a x i m u m method by different
manuals. The unfortunate result of this use of jargon is that the user is confused
by the idiosyncratic use of these terms. Again, the C L U S T A N manual is probably
best in this regard as it provides synonymous names for some of its methods.
In sum, the manuals vary a great deal in usability and comprehensiveness. The
manuals for C L U S T A N (version IC2), B M D P and CLAB are the clearest.
Nonetheless, the three are limited either by terseness or their use of jargon. Nearly
all of the manuals can be found to be lacking in some respect.
The last major area of concern is the facility of the packages for error h a n d i n g .
Ideally, a package should have sufficient internal checks so that (a) the common
user errors will be detected by the package, (b) the user will be explicitly told in
English (as opposed to being told in computerese) what the error is, and (c) the
user will be told what steps are probably needed to fix the error. A less desirable
error message is generated by the F O R T R A N environment of the user's computer
system. Such errors generally require the user to have a knowledge of F O R T R A N
and may even require the user to access the source code for the package if h e / s h e
is to decode the error. An even worse response to an error occurs when the
program does not detect an error and executes without notifying the user that an
error (or probable error) did occur. The last response to an error is particularly
serious because the package in fact may generate a solution which is gibberish but
the user will assume the solution is valid.
In order to check the error handling facility, thirteen of the programs were run
on a standard data set, and four errors were intentionally made in the control
cards. The thirteen programs examined included all except TAXON, OSIRIS,
SAS, IMSL, and CLAB. The analysis primarily was based on versions of the
programs available in 1977.
The standard error messages from most programs were F O R T R A N error
messages. For instance, a common error message to an error involving control
card transposition was the message " I H C 2 1 5 I - - C O N V E R T - - I L L E G A L D E C I -
M A L C H A R A C T E R . " The user who is familiar with F O R T R A N at IBM
installations will recognize that this error message means that the program
encountered a character, such as an alphabetical letter, which it did not expect.
Thus this error will suggest to the sophisticated user that a control card is not in
its correct sequence. However, for users who are not familiar with F O R T R A N ,
the error message will have no obvious meaning, and they will be forced to
inquire for a computer consultant in order to correct the error.
Some programs, such as BMDP, CLUS, C L U S T A N IC, and M I K C A , usually
generated error messages in the language of the program. In most instances, the
Cluster analysis software 263
errors were not considered 'fatal,' and the programs attempted to generate some
type of cluster solution after printing the error message.
A few unusual responses to error conditions were noted. For example, error
conditions were found under which CLUS and NT-SYS generated large volumes
of printed output which had no useful purpose. In one error condition CLUS
noted that the covariance matrix was singular and created an error message telling
the user this. However, the error apparently was not fatal, so the message was
repeated for 5000 lines until the program exceeded the maximum number of lines
as specified by the user.
6. Discussion
The problems associated with cluster analysis software are many. First, there is
a very large number of methods and programs for performing cluster analysis. A
conservative estimate would place the number of clustering methods in excess of
100. Different researchers have attempted to resolve the issue of which method is
best for cluster analysis but the conclusions have been equivocal and conflicting.
As a result, there is no easy way to determine which methods should be
incorporated in cluster analysis software.
A second problem concerns the diversity of the audience of cluster analysis
users. The consumers of this software are research scientists from different
disciplines with decidedly different needs. Similarity measures popular in bio-
chemistry are often very different to those used in psychology. With the explosion
of interest in cluster analysis in the last decade, a program author must face the
difficult problem of making the program sufficiently general to meet the needs of
the wide range of users, and yet still keep the program reasonably small so that it
will not be too expensive or cumbersome to use.
A third problem associated with this software is the relative lack of usability of
this collection as compared to the more general statistical packages. Some of the
larger programs, especially CLUSTAN and BMDP, are not bad in this respect.
However, six of the programs discussed do not have anything remotely resem-
bling a user manual, and the error handling of most programs was not overly
clear. Perhaps if the popular statistical packages such as SAS and SPSS add
cluster analysis to their repertoire, usability will be less of an issue.
In conclusion, the software for cluster analysis displays marked heterogeneity.
The popular programs vary in terms of which clustering methods they contain:
264 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey
and how usable they are. The choice of software will be dictated primarily by the
needs of the consumer. The aim of this chapter has been to provide sufficient
information about these programs so that the reader can make a thoughtful
choice.
References
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press, New York.
Bailey, K. D. (1975). Cluster analysis. In: D. Heise, ed., Sociological Methodology. Jossey-Bass, San
Francisco.
Bell, P. A. and Korey, J. L. (1975). QUICLSTR: A FORTRAN program for hierarchical cluster
analysis with large numbers of subjects. Behav. Res. Methods Instrumen. 7, 575.
Blashfield, R. K. (1976a). Mixture model test of cluster analysis: Accuracy of four hierarchical
agglomerative methods. Psyeh. Bull. 83, 377-378.
Blashfield, R. K. (1976b). Questionnaire on cluster analysis software. Class. Soc. Bull. 83, 25-42.
Blashfield, R. K. (1977). The equivalence of three statistical packages for performing hierarchical
cluster analysis. Psychometrika 42, 429-431.
Blashfield, R. K. and Aldenderfer, M. S. (1978a). The literature on cluster analysis. Multivar. Behav.
Res. 13, 271-295.
Blashfield, R. K., and Aldenderfer, M. S. (1978b). Computer programs for performing iterative
partitioning cluster analysis. Appl. Psych. Measure 2, 533-541.
Carlson, K. A. (1972). A method for identifying homogeneous classes. Multivar. Behav. Res. 7,
483-488.
Carmichael, J. W. and Sneath, P. H. A. (1969). Taxometric maps. Systems Zoo. 18, 402-415.
Cattell, R. B., Coulter, M. A. and Tsujioka, B. (1966). The taxometric recognition of types and
functional emergents. In: R. B. Cattell, ed., Handbook of Multivariate Experimental Psychology,
287-312. Rand-McNally, Chicago.
Chernoff, H. (1971). The use of faces to represent points in N-dimensional space graphically. Tech.
Rept. No. 71. Department of Statistics, Stanford University, Stanford.
Chernoff, H. (1973). Using faces to represent points in K-dimensional space graphically. J. Amer.
Statist. Assoc. 68, 361-368.
Chernoff, H. and Rizvi, M. H. (1975). Effect of classification error of random permutations of features
in representing multivariate data by faces. J. Amer. Statist. Assoc. 70, 548-554.
Clifford, H. T. and Stephenson, W. (1975). An Introduction to Numerical Classification. Academic
Press, New York.
Coleman, J. S. (1970). Clustering in N dimensions by use of a system of forces. J. Math. Soc. 1, 1-47.
Cormack, R. M. (1971). A review of classification. J. Roy. Statist. Soc. 134, 321-367.
Dallal, G. E. (1975). A user's guide to J. A. Hartigan's clustering algorithms. Yale University, New
Haven.
Defays, D. (1977). An efficient algorithm for a complete link method. Comput. J. 20, 364-366.
Dubes, R. and Jain, A. K. (1977). Models and methods in cluster validity. Tech. Rept. JR-77-05.
Department of Computer Science, Michigan State University, East Lansing.
Everitt, B. D. (1974). Cluster Analysis. Halstead Press, London.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics 7,
179-188.
Fleiss, J. L. and Zubin, J. (1969). On the methods and theory of clustering. Multivar. Behav. Res. 4,
235-250.
Gordon, A. J. and Henderson, J. T. (1977). An algorithm for Euclidean sum of squares classification.
Biometrics 33, 355-362.
Gross, A. L. (1972). A Monte Carlo study of the accuracy of a hierarchical grouping procedure.
Multivar. Behav. Res. 7, 379-389.
Cluster analysis software 265
Hall, D. J. and Khanna, D. (1977). The ISODATA method computation for the relative perception of
similarities and differences in complex and real data. In: K. Enslein, A. Ralston and H. W. Will,
eds., Statistical Methods for Digital Computers, Vol. 3. Wiley, New York.
Hartigan, J. (1975a). Clustering Algorithms. Wiley, New York.
Hartigan, J. (1975b). Printer graphics for clustering. J. Statist. Comput. Simulations 4, 187-213.
Hartigan, J. (1978a). Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117-131.
Hartigan, J. (1978b). Graphical techniques in clustering: Sleeves. Paper presented at the Classification
Society Meetings in Clemson, SC.
Hempel, C. G. (1952). Problems of Concept and Theory Formation in the Social Sciences, Language and
Human Rights. University of Pennsylvania Press, Philadelphia.
Huizinga, D. (1978). MODES: A natural or mode seeking cluster analysis algorithm. Tech. Rept. No.
78-11. Behavioral Research Institute, Boulder, CO.
IMSL (1977). IMSL Reference Manual, Library 1, Ed. 6, Vols. 1 and 2. Houston, TX.
Jardine, N. and Sibson, R. (1968). A mode for taxonomy. Math. Biosci. 2, 465-482.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 38, 241-254.
Kuiper, F. K. and Fisher, L. (1975). A Monte Carlo comparison of six clustering procedures.
Biometrics 31, 777-783.
Lance, G. N. and Williams, W. T. (1965). Computer program for monothetic classification (associa-
tion analysis). Comput. J. 8, 246-249.
Lance, G. N. and Williams, W. T. (1967a). A general theory of classificatory sorting strategies. I.
Hierarchical systems. Comput. J. 9, 373-380.
Lance, G. N. and Williams, W. T. (1967b). A general theory of classificatory sorting strategies. II.
Cluster systems. Comput. J. 10, 271-277.
Lennington, R. K. and Rossbach, M. E. (1978). CLASSY--An adaptive maximum likelihood
clustering algorithm. Paper presented at the Classification Society Meetings at Clemson, SC.
Levinsohn, J. R. and Funk, S. G. (1974). CLUSTER: A hierarchical clustering program for large data
sets (n > 100). Research Memo. No. 40. Thurstone Psychometric Laboratory, University of North
Carolina, Chapel Hill, NC.
Ling, R. F. (1972). On the theory and construction of K-clusters. Comput. J. 15, 326-332.
Ling, R. F. (1973). A computer generated aid for cluster analysis. Comm. ACM 10, 355-361.
Lorr, M. (1966). Explorations in Typing Psychotics. Pergamon, New York.
Lorr, M. and Radhakrishnan, B. K. (1967). A comparison of two methods of cluster analysis. Educ.
Psych. Measure 27, 47-53.
McQuitty, L. L. and Koch, V. L. (1975). A method for hierarchical clustering of a matrix of a
thousand by a tt~ousand. Educ. Psych. Measure 35, 239-254.
Milligan, G. W. (1978). An examination of the effects of error perturbation of constructed data on
fifteen clustering algorithms. Unpublished Ph.D. Thesis, Ohio State University, Columbus.
Milligan, G. W. (1979). Further results on true cluster recovery: Robust recovery with the K-means
algorithms. Paper presented at the Classification Society Meetings in Gainesville, FL.
Milne, P. W. (1976). The Canberra programs and their accession. In: W. T. Williams, ed., Pattern
Analysis in Agricultural Science, 116-123. Elsevier, Amsterdam.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules--An evaluation. Comput. J. 20,
359-363.
Olsen, C. L. (1976). On choosing a test statistic in multivariate analysis of variance. Psych. Bull. 83,
579-586.
Overall, J. E. and Klett, C. J. (1972). Applied Multivariate Analysis. McGraw-Hill, New York.
Revelle, W. (1978). ICLUST: A cluster analytic approach to exploratory and confirmatory scale
construction. Behav. Res. Methods Instrumen. 10, 739-742.
Rohlf, F. J. (1977). Computational efficiency of agglomerative clustering algorithms. Tech. Rept.
RC-6831. IBM Watson Research Center.
Rohlf, F. J., Kishpaugh, J. and Kirk, D. (1974). NT-SYS user's manual. State University of New York
at Stonybrook, Stonybrook.
266 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey
Rubin, J. and Friedman, H. (1967). CLUS: A cluster analysis and taxonomy system, grouping and
classifying data. IBM Corporation, New York.
Sale, A. H. J. (1971). Algorithm 65: An improved clustering algorithm. Comput. J. 14, 104-106.
Shapiro, M. (1977). C-LAB: An on-line clustering laboratory. Tech. Rept. Division of Computer
Research and Technology, National Institute of Mental Health, Washington, DC.
Sibson, R. (1973). SLINK-An optimally efficient algorithm for single-link cluster methods. Comput.
J. 16, 30-34.
Skinner, H. A. (1977). The eyes that fix you: A model for classification research. Canad. Psych. Rev.
18, 142-151.
Sneath, P. H. A. (1957). The application of computers to taxonomy. J. Gen. Microbiol. 17, 201-226.
Sneatli, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships.
Kansas Univ. Sci. Bull. 38, 1409-1438.
Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxonomy
11, 33-40.
Sokal, R. R. and Sheath, P. H. A. (1963). Principles of Numerical Taxonomy. Freeman, San Francisco.
Spath, H. (1975). Cluster-Analyse Algorithmen. R. Oldenbourg, Munich.
Tryon, R. C. (1939). Cluster Analysis. Edward, Ann Arbor.
Tryon, R. C. and Bailey, D. E. (1970). Cluster Analysis. McGraw-Hill, New York.
Turner, D. W. and Tidmore, F. E. (1977). Clustering and Chernoff-type faces. Statistical Computing
Section Proceedings of the American Statistical Association, 372-377.
Veldman, D. J. (1967). FORTRAN Programming for the Behavioral Sciences. Holt, Rinehart and
Winston, New York.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc.
58, 236-244.
Ward, J. H. and Hook, M. E. (1963). Application of a hierarchical grouping procedure to a problem of
grouping profiles. Educ. Psych. Measure 32, 301-305.
Whallon, R. (1971). A computer program for monothetic subdivisive classification in archaeology.
Tech. Rept. No. 1. University of Michigan Museum of Anthropology, Ann Arbor.
Whallon, R. (1972). A new approach to pottery typology. Amer. Antiquity 37, 13-34.
Williams, W. T., Lambert, J. M. and Lance, G. N. (1966). Multivariate methods in plant ecology. V.
Similarity analyses and information analysis. J. Ecology 54, 427-446.
Williams, W. T., and Lance, G. N. (1977). Hierarchical classification methods. In: K. Enslein, A.
Ralston and H. Wilf, eds., Statistical Methods for Digital Computers. Wiley, New York.
Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining
effects. In: A. J. Cole, ed., Numerical Taxonomy. Academic Press, London.
Wishart, D. (1978). CLUSTAN 1C user's manual. Program Library Unit of Edinburgh University,
Edinburgh.
Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5,
329-350.
Zahn, C. T. (1971). Graph theoretical methods for dissecting and describing Gestalt clusters. IEEE
Trans. Comput. 2, 68-86.
P. R. Krishnaiahand L. N. Kanal,eds., Handbookof Statistics, Vol. 2 1 '~
©North-HollandPublishingCompany(1982) 267-284
F. J a m e s R o h l f
1. Introduction
The present paper is concerned with computational algorithms for the single-link
clustering method which is one of the oldest methods of cluster analysis. It was
first developed by Florek et al. (1951a, b) and then independently by McQuitty
(1957) and Sneath (1957). This clustering method is also known by many other
names (e.g., minimum method, nearest neighbor method, and the connectedness
method) due both to the fact that it has been reinvented in different application
areas, but also to the fact that there exist many very different computational
algorithms corresponding to the single-link clustering model. Often this identity
has gone unnoticed since new clustering methods are not always compared with
existing ones. Jardine and Sibson (1971) point out quite correctly that one must
distinguish between clustering methods (models) and the various computational
algorithms which enable one to actually determine the clusters for a particular
method. Different clustering methods imply different definitions of what con-
stitutes a 'cluster' and should thus be expected to give different results for many
data sets.
Since we are concerned here only with algorithms, the interested reader is
referred to Sneath and Sokal (1973) and Hartigan (1975) for general discussions
of some of the important properties of the single-link clustering method. They
contrast this clustering method with other related methods such as the complete
link and the various forms of average link clustering methods. Fisher and van
Ness (1971) and van Ness (1973) summarize some of the more important
mathematical properties of a variety of clustering methods including the single-link
method. The book by Jardine and Sibson (1971) considers some of the more
abstract topological properties of the single-link clustering method and its gener-
alizations.
In the account given below a variety of algorithms are presented to serve as a
convenient source of algorithms for the single-link method. It is hoped that
presenting these diverse algorithms together will also lead to a further understand-
ing of the single-link clustering method. While the algorithms differ considerably
in terms of their computational efficiency (O(n log n) versus O(n 5)), even the least
267
268 F. James Rohlf
efficient algorithm may sometimes be useful for small data sets. The less time
efficient algorithms are simpler to program in languages such as FORTRAN. They
also may require much less computer storage.
Table 1
A dissimilarity matrix for 6 objects
i l 2 3 4 5 6
1 0.0 6.8 2.6 3.0 3.5 7.0
2 6.8 0.0 4.5 9.8 4.9 0.8
3 2.6 4.5 0.0 5.4 1.2 5.2
4 3.0 9.8 5.4 0.0 6.3 9.9
5 3.5 4.9 1.2 6.3 0.0 6.1
6 7.0 0.8 5.2 9.9 6.1 0.0
Single-link clustering algorithms 269
Table 2
Single-link hierarchical clustering scheme for the dissimilarity matrix given in Table 1
x A,, Clusterings ~
0 0.0 {1}, {2}, {3}, {4}, {5}, {6}
1 0.8 {1}, {2,6}, {3}, {4}, {5}
2 1.2 {1}, {2,6}, {3,5}, {4}
3 2.6 {1,3,5}, {2,6}, {4}
4 3.0 { 1,3,5,4}, {2,6}
5 4.5 { 1,3,5,4,2,6}
2" '
°1
i I
6
h
Fig. 1. Single-link dendrogram for the hierarchical clustering scheme given in Table 2.
270 F. James Rohlf
3. Algorithms
The different algorithms presented below are classified into five different types
of approach.
ALGORITHM 1
(a) Set the clustering level A to an initial value: A 0 --, A. [A = min(dij; i ~ j ) is
the smallest value of interest.]
(b) Define a connection matrix, A ----(aij) such that a U -- 0 if dij > A and aij = 1
if d~j <~A.
(c) Raise the matrix A to a power m such that A m = A m+l. [Then the (i, j ) t h
element of A m equals unity if and only if the ith and j t h objects belong to the
same connected subgraph (single-link cluster). All other elements are equal to
zero.]
(d) Repeat steps (b) and (c) for a larger value of A. [One could increment A by
a fixed amount as suggested by van Groenewoud and Ihm (1974) or (in order to
obtain nonredundant solutions) one can use the smallest value of dij such that i
and j have not yet been placed into the same cluster. At most n - 1 distinct A
values are required.]
The straight-forward implementation of such an algorithm implies considerable
computational effort since the multiplication of two matrices requires effort
Single-link clusteringalgorithms 271
ALGORITHM 2
(a) Initialize: 0 ~ Pi (i ----1.... ,n),0 --, s, and 1 -~ i. [P~ will contain the index of
the cluster to which object i belongs.]
(b) If Piv a 0, then: go to step (f) [object i has already been clustered]; else:
continue.
(c) If there exists a dij<A ( j = i + l , . . . , n ) , then: go to step (d) [object i is
connected to at least one other object]; else: go to step (f).
(d) A new cluster containing more than a single object has been found:
s - l - l ~ s , s ~ P i.
(e) While there exists an object k ( k = i + l .... ,n) such that P ~ = s and such
that djk < A ( j = 1..... n; j va k): s ~ ~ . [This step finds all objects connected to
the present members of cluster s by repeatedly searching the dissimilarity matrix
and adding new objects to the cluster.]
(f) If i < n, then: i + 1-9 i and go to step (c); else: stop.
Table 3
Ultrametric distance matrix for the hierarchical clustering scheme given in Table 2
i 1 2 3 4 5 6
1 0.0 4.5 2.6 3.0 2.6 4.5
2 4.5 0.0 4.5 4.5 4.5 0.8
3 2.6 4.5 0.0 3.0 1.2 4.5
4 3.0 4.5 3.0 0.0 3.0 4.5
5 2.6 4.5 1.2 3.0 0.0 4.5
6 4.5 0.8 4.5 4.5 4.5 0.0
properties only for the single-link method. Given an HCS, the ultrametric distance
u~j between all pairs of objects i a n d j is defined as follows. Let uij -- A , where x
is the smallest integer such that in clustering C~ objects i a n d j are in the same set
(0 ~<x ~<n - 1 ) . Then U = (uis) is a matrix of ultrametric distances, that is, ui~ = 0
and 0 ~< uij <~max{uig, ujk } for all triples of objects i,j, and k. This is a stronger
condition than the usual metric condition since max(ui~, ujk } ~<u~k + ujg. This
relationship between dusters and ultrametrics is discussed in Jardine et al. (1967),
Johnson (1967), and Hartigan (1967). The ultrametric distance matrix for the
hierarchical clustering scheme given in Table 2 is shown in Table 3.
The single-link clusters for a given data set can readily be determined from an
ultrametric matrix (objects i and j belong to the same cluster at level A if and
only if uij <~A ) . Thus another algorithmic approach to single-link clustering is to
transform D to U, and then recover the single-link clusters from U.
The following single-link ultrametric transformation algorithm is a special case
of Jardine and Sibson's (1968) algorithm for their B~ (fine) clustering methods
(B l corresponds to the single-link method).
ALGORITHM 3
(a) Consider all possible triplets of distinct objects, i, j, and k. For each such
triplet of objects determine the largest, d', and second largest, d", dissimilarities
between them. If d ' > d", then: replace the d' value in the dissimilarity matrix
with d" else [ d ' = d"]: leave the dissimilarities unchanged.
(b) Repeat step (a) with the updated dissimilarity matrix until no dissimilarity
values are changed. [At completion the dissimilarity matrix will have been
transformed into an ultrametric matrix corresponding to the single-link clustering
method.]
The effort for this algorithm is O(mn 3) where m is the number of repetitions of
step (a). Cole and Wishart (1970) state that the value of m is usually between 3
and 5. While this algorithm requires considerable effort, it is more efficient than
Algorithm 1 or Algorithm 2. Cole and Wishart (1970) present a number of
improvements in this algorithm which reduce needless checks of triplets of objects
which ultimately require no adjustment in their dissimilarity values. Another
important innovation is first sorting the dissimilarities (O(n21ogn) effort) and
Single-link clustering algorithms 273
then considering the triplets in sorted order. The FORTRAN computer program
KDEND is available for this algorithm (Cole and Wishart, 1970).
Rohlf (1973a) proposed another algorithm for B~ (fine) clustering in which the
elements in the dissimilarity matrix are initially sorted, but sets of objects (triplets
in the case of B~) are not explicitly considered. For the single-rink method (B~) it
reduces to the following algorithm.
ALGORITHM 4
(a) Sort the elements of the upper half portion, excluding diagonal elements, of
the dissimilarity matrix D in ascending order into array L. Clear the n by n matrix
U. Let C = CO be the weakest clustering where each cluster C k contains only the
single object k. 0 ~ l.
(b) Set l + 1 -~ l and let L t = dij [the next, larger, dissimilarity value from the
sorted array].
(c) If uij has already been defined, then: go to step (b); else: continue.
(d) Let C~ and C b represent the clusters to which objects i and j belong,
respectively. Then dij "---)Ucd for all objects c E Ca and d E Cb). Update G: C a -~- C b
-- Ca UC b ---, C~ and then remove C b from C.
(e) If all elements of U have not been defined (i.e., C consists of more than a
single cluster), then: go to step (b); else: stop.
ALGORITHM 5
(a) Clear sets P and Q. Form a set S consisting of all objects i such that D i =/=O,
where D t = [HI. [P will contain objects whose neighborhoods are to be consid-
ered, and Q will contain the objects in the current cluster.]
274 F. James Rohlf
If one plots the estimated densities Dj for the neighborhood of each object j as
the j ' s are added to cluster Q, they will rise to a maximum (thereby indicating
which objects are closest to the estimated mode of the cluster) and then fall off as
we move towards a valley. When the set P is empty a valley has been reached
which has zero density. A new starting object which does not belong to a mode
that has already been found is then selected at step (b). This is repeated until all
modes have been found for the given radius p. The procedure can then be
repeated with a new (larger) value for p.
If the criterion that a cluster is a 'dense' region of space separated by a gap of
zero density from other such clusters is used, then the above algorithm will find
the single-link clusters corresponding to a clustering level A = p. Note: other
definitions of density (e.g., by using a Gaussian kernel) of the p.d.f, do not, in
general, lead to single-link clusters (however, Hartigan, 1977, shows that asymp-
totically the clusters formed using these two definitions of 'density' converge). It
should also be pointed out that the usual approach to density estimation
clustering does not involve defining a cluster based on a minimum threshold of
density (density contour clustering) as proposed by Shaffer et al. (1979) but uses
valleys to separate modes (density gradient clustering). Katz and Rohlf (1973)
defined two objects to belong to the same cluster if and only if the path from each
point following the steepest gradient in the p.d.f, leads to the same peak in the
p.d.f. Kittler (1976, 1979) used an intuitive assessment of the depth of the valley
in order to decide whether two modes are sufficiently distinct. It is not clear by
their algorithm how sure one is that the depths of the valleys displayed on their
plots reflect the depths of the valleys between adjacent modes.
The principal effort required by these algorithms is in computing the density
around each object. If the number of objects within a fixed radius of each object
is found by a direct search (O(n2)), then the effort can be O(n 3) for each value of
p (the exact effort depends upon the size of the neighborhoods and the method
used to update the sets P and S).
clustering methods (actually these are, of course, algorithms). They give the
following algorithm for the single-link method.
ALGORITHM 6
(a) Let C = CObe the weakest clustering, where each cluster C k contains only a
single object k. [The dissimilarity matrix will be interpreted as a matrix of
dissimilarities between the corresponding clusters.]
(b) Find a pair (not necessarily unique) Of clusters ( C a, Cb) which are least
dissimilar (i.e., dab is minimal), dab ~ a .
(c) CaUC b ~ Ca, and delete C b from C. The a value is saved as the clustering
level at which the new cluster was formed.
(d) Repeat steps (b) and (c) for all pairs of clusters (if any) with the same
minimal dissimilarity (i.e., allow for ties).
(e) Recalculate the dissimilarity between the new cluster Ca and each of the
other clusters C~. The dissimilarity dij is computed as min(dij]i~ Ca, j ~ C~}.
(f) Repeat steps (b), (c), and (d) until C = C,~ (i.e., it consists of only a single
cluster of cardinality n).
This general algorithm can be implemented in many ways. One simple scheme
uses an n by n matrix of dissimilarities between all pairs of objects. When a pair
of objects (e.g., i and j ) is merged, the corresponding rows and columns i a n d j are
deleted from the matrix and a new row and column i' corresponding to the
resulting cluster is added. Thus in step (c) the row- and column-dimension of the
matrix is decreased by one each time there is a merger. Johnson (1967) called
single-link clustering method the 'minimum method' due to the fact that the rain
function is used in step (e).
A c by c matrix of dissimilarities among the c currently existing clusters must
be examined' in order to find the smallest dissimilarity for each of the n - 1
clustering levels. The value of c is initially n and is reduced by unity as the
clusters merge. Therefore the total number of dissimilarity coefficients which need
to be considered is O(n3).
A number of computer programs exist for this algorithm (e.g., JOIN, Hartigan,
1975; Johnson, 1967; CLASS and CENTBEV, Lance and Williams, 1967; CLSTR,
Anderberg, 1973).
The computational effort can be reduced considerably by using the fact that
when updating the dissimilarity matrix in step (d), the dissimilarity between two
objects (or clusters) is not affected by mergers which only involve other clusters.
Thus the entire matrix need not be searched in step (b) but rather only the few
'local' changes made as a result of the last fusion. In particular, when two clusters
C a and C b are merged, a list of the nearest neighbors of each cluster must be
updated only for those clusters which were nearest neighbors of either cluster a or
cluster b. These can conveniently be located in step (e) as described by Rohlf
(1977).
276 F. James Rohlf
The algorithm can also be improved by locating all mutually closest pairs of
clusters in step (b), rather than just the one pair of clusters with minimum
dissimilarity. All such pairs can then be merged simultaneously in step (c).
Clusters Co and C b are mutually closest if cluster Co has its least dissimilarity with
cluster C b and cluster C b has its least dissimilarity with cluster Ca. There must be
at least one such pair. For the dissimilarity matrix given in Table 1, two mutually
closest pairs (3-5 and 2-6) would be found during the first pass through step (b).
If there are many such pairs found each time through step (b), then the effort will
approach O(n2). However, if the final dendrogram is very asymmetrical, only a
few mutually close pairs of points will be found each time and the effort will
remain close to O(n 3). Modifications of this algorithm are also available such that
the average effort is expected to remain very close t o O(n 2) even for such data
sets (see, for example, Rohlf, 1977).
Even more efficient algorithms, such as one developed by Sibson (1973), result
from the fact that only local changes in the reduced dissimilarity matrix result
from the merging of two clusters. Given an initial single-link clustering for m < n
objects, Sibson (1973) developed a method for determining the changes required
in a dendrogram in order to correctly add an additional object. This makes it
possible to start with a dendrogram consisting of only a single object and build
the final dendrogram recursively by adding the remaining n - 1 objects one at a
time in an arbitrary order. His algorithm is as follows.
ALGORITHM 7
Table 4
Pointer representationof the hierarchicalclusteringschemegivenin Table 2
i 1 2 3 4 5 6
H i 5 6 5 6 4
A i 2.6 0.8 1.2 4.5 3.0
dissimilarity matrix need not be placed in fast access storage during execution,
giving the algorithm a considerable advantage when n is very large. Sibson (1973)
furnishes a FORTRAN computer program, SLINK1, for the above algorithm and
another, SLINK2, for transforming the pointer representation into a more conveni-
ent 'packed' form (see below).
Williams et al. (i966) and Anderberg (1973) give the following simple algo-
rithm (which is very similar in its approach to Algorithm 4).
ALGORITHM 8
(a) Sort the entries in the upper triangular portion, excluding diagonals, of the
dissimilarity matrix into an array L. 0 ~ I. Let C = Co (the weakest clustering).
(b) l + 1 ~ l, let L t -- dij. [The next dissimilarity value from the sorted list.]
(c) If i a n d j already belong to the same cluster, then: go to step (b); else: merge
the two clusters to form a new cluster at clustering level A = dij.
(d) Repeat steps (b) and (c) until ~ = G,~ (a single cluster which contains all n
objects).
ALGORITHM 9
(a) Let H be a list of the edges in the MST which are of length less than or
equal to a given value of Zl.
(b) Let C = CObe the weakest clustering. (Each cluster C~ contains only a single
object).
(c) Let (i, j ) be the next edge (of length dij) from list H (the order in which
they are considered is arbitrary).
(d) Let Ca and C o be the clusters to which objects i and j belong. Set
Ca UC b ~ C,, and then delete C b from C.
(e) Repeat steps (c) through (d) until list H is empty. C will then contain the
single-fink clusters at level ,t.
Table 5
Minimum spanning tree for the dissimilarity matrix given in Table 1
i 1 3 1 3 2
j 3 5 4 2 6
dij 2.6 1.2 3.0 4.5 0.8
Single-link clusteringalgorithms 279
Table 6
Packed representationof the hierarchicalclusteringschemegivenin Table 2
i 1 3 5 4 2 6
2.6 1.2 3.0 4.5 0.8
(1973b) gives a FORTRAN subroutine, DEND, which creates a simple line printer
plot of the dendrogram from this packed representation with only O(n) effort.
Rohlf (1974, 1975) gives a FORTRAN subroutine for plotting a dendrogram which
requires slightly more effort since it centers the 'stems' under the branches.
Many papers have been published which are concerned with efficient algo-
rithms for constructing the MST's themselves. Recent work has been concerned
with efforts to reduce the computational effort below O(n2). Clearly this is only
possible if some proportion of the n ( n - 1)/2 dissimilarities can be ignored as
being irrelevant. The larger dissimilarities are not of interest since the MST
contains only edges connecting near neighbors. When only a small proportion of
the n ( n - 1)/2 dissimilarities are defined (the undefined coefficients are consid-
ered infinite), then the following algorithm by Yao (1975) is useful.
ALGORITHM 10
(a) For each object i partition the remaining objects into k ordered sets
S1, S 2 ..... S~ so that d i j ~ d i t if j C S ~ , l E S b and a < b . [This can be done by
repeatedly applying the linear median-finding algorithm of Blum et al. (1973).]
The number of sets, k, is taken as log n.
(b) Sollin's algorithm (a multi-fragment MST algorithm, see Berge and
Ghouila-Houri, 1962) is then applied but with the dissimilarities in the higher
indexed sets considered only after those in the lower indexed sets have been
exhausted.
Therefore the most efficient of such algorithms must expend O(n 2) effort (Sibson,
1973). All of the dissimilarities must be examined because the analyses make no
assumptions about the metric properties of the space, and it is not possible to
infer the magnitude of any d U by knowing any or all of the other dissimilarities.
In most applications of single-link cluster analysis this is unnecessary since in fact
most dissimilarity coefficients used allow such inferences. The case of a metric
space knowing dl, 2 and dr, 3 allows one to set upper bounds on the magnitude of
d2, 3 since d2, 3 ~< dl, 2 ÷ dl, 3. What is needed, however, is a lower bound on the
dissimilarity between two objects since if one could infer that d~j was greater than
A, then one could skip over the dissimilarity between objects i and j.
If the dissimilarity coefficients were computed from a p variable by n object
matrix X such that x ~ can be interpreted as the coordinate of a point (corre-
sponding to object i) along axis k, then more time-efficient algorithms are
possible. The choice of a dissimilarity coefficient for these algorithms is limited
only by the restrictions that
ALGORITHM 11
(a) Let i denote the object for which the nearest neighbor is required.
(b) Initiahze: create a table T giving the objects in rank order for each of the p
variables separately and an array R which will contain the dissimilarity from the
nearest neighbor to i in the k th dimension. A table of inverted lists, L, should
also be prepared so that one can directly look up the position of an object in T for
a given dimension. 0 ~ R~, 1 ~ l k ~ r~ (for k = 1 t o p ) . [Pointers l~ and r k point to
an object to the left and to the right of i (if such exist) in the sorted list for
dimension k.]
(c) Initialize: l ~ k, ~ --, dmin, ~ ~ m. [k is the dimension being searched over,
m will be the nearest neighbor of object i at distance d~in.]
(d) Find the location a in T of object i for dimension k: L~i --, a.
(e) Find the two objects on either side of i in the k t h dimension: Tk, o _ l , ~
b, Tk,a+rk ~ c (if a - l~ < 1 or a + r k > n, then: b or c, respectively, is undefined
and dissimilarities to such undefined objects are set to ~ ) .
Single-link clustering algorithms 281
(f) If min(dib, die ) ~ d~an, then: go to step (j) [objects b and c are both outside
of the search radius], else: continue. [A nearer neighbor has been found.]
(g) Update radius of spiral: m i n { l x k i -- Xkbl, [Xki -- XkcI} -~ R ~ .
(h) If dib < d i c , then: b ~ m , dib ~ dram, l k -- 1 ~ lk, else: c ~ m , dic ~ drain,
rk + l--, r ~.
(i) If k < p, then: k + 1 ~ k, go to step (d), else: continue. [A new nearest
neighbor candidate has been found.]
(j) Compute the dissimilarity coefficient based on differences in array R. If it is
larger than dmin, go to step (c), else: stop. [Object m is the nearest neighbor.]
Steps (c) through (j) are repeated until it is impossible for there to be a point
closer to i than m (dim = dmi~). The search 'spirals' outward from point i due to
step (h). O(nlog n) effort is spent during the initialization step sorting the data
and building the tables T and L of pointers, but for a given analysis one builds
the tables only once.
Two other general approaches have been suggested for finding nearest neigh-
bors which also avoid the computation of dissimilarities between all pairs of
objects. Bentley (1975) developed a multidimensional binary search tree, 'k-d tree'
(refers to a k-dimensional tree), which can be fitted to a given data set so that
nearest neighbors tend to be on nearby nodes of the binary tree. As in the spiral
search algorithm, dissimilarities are computed during construction of the tree only
for objects which are likely to be nearest neighbors. The second approach is to
partition the p-dimensional space into cells and then compute dissimilarities only
among points in the same or adjacent cells. The optimal approach among these
three depends upon the configuration of the points in the p-dimensional space.
The application of k-d trees to the construction of MST's is described in
Bentley and Friedman (1978) (one can easily compute single-link clusters from a
MST as described above). They give a detailed presentation of the algorithm and
the results of a simulation study showing that the average running time is
O(nlogn) for the spherical normal distribution. There is, however, considerable
overhead at the preprocessing stage so that the algorithm represents an improve-
ment in total running time only for larger values of n. For p = 2 the break-even
point with respect to a simple MST algorithm such as that of Prim (1957) is at
n = 250. For p = 8 the break-even point is at n = 340 (Bentley and Friedman,
1978). Execution times for this algorithm are especially favorable for data sets
consisting of points which are uniformly distributed along the orthogonal coordi-
nate axes, while data sets with a few isolated clusters exemplify worst case
situations.
Many authors have suggested a preprocessing step in which the coordinate
space is partitioned into cells to facilitate searching (e.g., Yuval, 1975, 1976;
Bentley and Shamos, 1976; and Rabin, 1976). Given a fixed threshold of
dissimilarity, 8, Rabin (1976) developed a method for partitioning a p-dimen-
sional space into 2 p systems of p-dimensional cells of width 28 on each side such
that all points at distance d ~<8 would be found together in the same cell at least
once. Thus dissimilarity values need only be computed for pairs of points
282 F. James Rohlf
contained in a common cell. Rabin (1976) also proposed that 8 be estimated from
a random sample of points. Rohlf (1977, 1978) adapted this method to the
computation of MST's and single-link clusterings. In simulations, based on
samples from p-dimensional spherical normal distributions it was found that this
algorithm had average running times which increased more slowly than O(n log n).
The break-even point with respect to a simple MST algorithm such as Prim (1957)
for p - - 2 was only n = 150, but for p - - 4 it was n = 350. Thus the apparent
advantage of the use of cells versus k-d trees seems to decrease as p increases.
Rohlf (1977) showed that additional preprocessing of the data could improve the
performance of using cells. The partitioning of the space could be based, for
example, on only the k most important variables (in the sense of having the
largest variances). Or the data could be projected onto the first k principal
components. If the k dimensions used for the partitioning of the data into cells
explain most of the variance in the data and k is much smaller than p, then the
running time should be reduced considerably. The most favorable data configura-
tion for this approach would be that a large number of points uniformly
distributed in a low-dimensional subspace. As in the case of k-d trees the worst
case is for well-separated clusters of objects. From the point of view of applica-
tions to cluster analysis, these algorithms work best on data in which there are in
fact no clear clusters. A description of these algorithms for the computation of
MST's and single-link clusters is given in Rohlf (1977, 1978).
For the special case in which dij is the Euclidean distance in the plane (p = 2)
Shamos (1975) and Shamos and Hoey (1975) show that Voronoi diagrams can be
used to reduce the computational effort in finding the nearest neighbors of an
object. They describe an algorithm for finding a MST which in the worst case
requires computational effort of only O(nlogn). Unfortunately, this approach
has not been extended to other dissimilarity coefficients or even to greater than
p = 2 dimensions.
This is an important area which needs further work. It appears that only
through the use of such techniques it will be feasible to cluster very large
(n > 1000) data sets.
Acknowledgment
This paper represents contribution No. 332 from the Program in Ecology and
Evolution at the State University of New York at Stony Brook. It was supported
in part by a grant DEB77-24611 from the National Science Foundation. I wish to
thank Sally Howe who critically read a draft of this paper and made many
valuable suggestions.
References
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press, New York.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Comm.
ACM 18, 509-517.
Single-link clustering algorithms 283
Bentley, J. L. and Friedman, J. H. (1978). Fast algorithms for constructing minimal spanning trees in
coordinate spaces. IEEE Trans. Comput. 27, 97-105.
Bentley, J. L. and Shamos, M. I. (1976). Divide-and-conquer in multidimensional space. Proc. Eighth
A CM Syrup. on the Theory of Computing, 220-230.
Berge, C. 1966. The Theory of Graphs and its Application. Methuen, London.
Berge, C. and Ghouila-Houri, A. (1962). Programming, Games and Transportation Networks. Methuen,
London.
Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L. and Tarjan, R. E. (1973). Time bounds for
selection. J. Comput. System Sci. 7, 448-461.
Cole, A. J. and Wishart, D. (1970). An improved algorithm for the Jardine-Sibson
method of generating overlapping clusters. Comput. J. 13, 156-163.
Davies, R. G. (1971). Computer Programming in Quantitative Biology. Academic Press, New York.
Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numer. Math. 1, 269-271.
Fisher, L. and van Ness, J. W. (1971). Admissible clustering procedures. Biometrika 58, 91-104.
Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, H. and Zubrzycki, S. (1951a). Sur la liason et la
division des points d'un ensemble fini. Colloq. Math. 2, 282-285.
Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, H. and Zubrzycld, S. (1951b). Taksonomia
Wroclawska. Przegl. Antropol. 17, 193-211.
Gower, J. C. and Ross, G. J. S. (1969). Minimum spanning trees and single-linkage cluster analysis.
Applied Statistics 18, 54-64.
van Groenewoud, H. and Ihm, P. (1974). A cluster analysis based on graph theory. Vegetatio 29,
115-120.
Hartigan, J. A. (1967). Representation of similarity matrices by trees. J. Amer. Stat. Assoc. 62,
1140-1158.
Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
Hartigan, J. A. (1977). Distributional problems in clustering. In: J. van Ryzin, ed., Classification and
Clustering, 45-71. Academic Press, New York.
Jardine, C, J., Jardine, N. and Sibson, R. (1967). The structure and construction of taxonomic
hierarchies. Math. Biosei. 1, 173-179.
Jardine, N. and Sibson, R. (1968). The construction of hierarchic and non-hierarchic classifications.
Comput. J. 11, 177-184.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241-254.
Katz, J. O. and Rohlf, F. J. (1973). Functionpoint cluster analysis. Systematic Zool. 22, 295-301.
Kittler, J. (1976). A locally sensitive method for cluster analysis. Pattern Recognition 8, 23-33.
Kittler, J. (1979)i Comments on "single-link characteristics of a mode-seeking clustering algorithm".
Pattern Recognition 11, 71-73.
Kruskal, J. B. Jr. (1956). On the shortest spanning subtree of a graph and the traveling salesman
problem. Proc. Amer. Math. Soc. 7, 48-50.
Lance, G. N. and Williams, W. T. (1967) A general theory of classificatory sorting strategies, Part I.
Hierarchical systems. Comput. J. 9, 373-380.
McQuitty, L. L. (1957). Elementary linkage analysis for isolating orthogonal and oblique types and
typal relevancies. Educ. Psychol. Meas. 17, 207-222.
van Ness, J. W. (1973). Admissible clustering procedures. Biometrika 60, 422-424.
Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Tech. J. 36,
1389-1401.
Rabin, M. O. (1976) Probabilistic algorithms. In: J. F. Traub, ed., Algorithms and Complexity, 21-39.
Academic Press, New York.
Rohlf, F. J. (1973a). A new approach to the computation of the Jardine-Sibson Bk clusters. Comput.
J. 18, 164-168.
Rohlf, F. J. (1973b). Algorithm 76, Hierarchical clustering using the minimum spanning tree. Comput.
J. 16, 93-95.
Rohlf, F. J. (1974). Algorithm 81, dendrogram plot. Comput. J. 17, 89-91.
Rohlf, F. J. (1975). Note on Algorithm 81, dendrogram plot. Comput. J. 18, 90-92.
284 F. James Rohlf
It is difficult to give a precise definition of MDS, because some people use the
term for a very specific class of techniques while others use it in a much more
general sense. Consequently it makes sense to distinguish between MDS in the
broad sense and M D S in the narrow sense. M D S in the broad sense i n c l u d e s
various forms of cluster analysis and of linear multivariate analysis, M D S in the
narrow sense represents dissimilarity data in a low-dimensional space.
People who prefer the broad-sense definition want to emphasize the close
relationships of clustering and scaling techniques. Of course this does not imply
that they are not aware of the important differences. Clustering techniques fit a
non-dimensional discrete structure to dissimilarity data, narrow-sense M D S fits a
continuous dimensional structure. But both types of technique can be formalized
as representing distance-like data in a particular metric space by minimizing some
kind of loss function. The difference, then, is in the choice of the target metric
space, the structure of the two problems is very much alike. The paper pioneering
this point of view is Hartigan's (1967). In fact Hartigan proceeds the other way
around, he takes clustering as the starting point and lets clustering in the broad
sense include narrow-sense MDS. The same starting point and the same order are
more or less apparent in the important review papers of Cormack (1971) and
Sibson (1972). An influential broad-sense paper that uses narrow-sense M D S as
the starting point is that of Carroll (1976). In fact Carroll discusses techniques
that explicitly combine aspects of clustering and narrow-sense MDS, and find
mixed discrete/continuous representations. In a very recent paper Carroll and
Arabie (1980) propose a useful taxonomy of MDS data and methods which is
very broad indeed. Investigators who are less interested in formal similarities of
methods and more interested in substantial differences in models have naturally
emphasized the choice of the space, and consequently the differences between
clustering and narrow-sense MDS. Of course detailed comparison of the two
classes of techniques already presupposes a common framework. This is obvious
285
286 Jan de Leeuw and Willem Heiser
in the important papers of Shepard (1974) and Shepard and Arabie (1979), who
present the two classes of methods essentially as complementing each other. If the
emphasis shifts more towards a theory of similarity judgments, the discrete and
continuous representations tend to become rivals. In a brilliant paper Tversky
(1977) has attacked narrow-sense MDS as a realistic model for psychological
similarity, and Sattath and Tversky (1977) have presented the free or additive tree
as a more realistic alternative. Krumhansl (1978) has defended dimensional
models.
We are not interested, in this paper, in psychological theories of similarity, i.e.
in narrow-sense MDS as a miniature psychological theory. We are also not
interested in discussing the many forms of cluster analysis and nondimensional
scaling. Consequently we propose a definition of scaling which is quite broad
(although not as broad as the one suggested by Carroll's taxonomy), and after
proposing this definition we quickly specialize to MDS. From that point on there
is no more need to distinguish between narrow and broad MDS. Our definition is
inspired by the definition of Kruskal (1977), and by the discussion in Kruskal and
Wish (1978), and Cliff (1973). We agree with these authors that MDS should be
further classified, and that the most important distinctions are between metric
and nonmetric MDS, and between two-way and three-way MDS. Other criteria
are choice of metric, choice of loss function, and choice of algorithm, but these
seem to be more technical and less essential.
A number of additional comments are in order here. We have called d(x, y) the
distance of x and y, but we have not assumed that the function d satisfies the
usual axioms for a metric. In the same way we have defined disparities as
arbitrary functions on I × I, while in most applications disparities (and dissimi-
larities) are distance-like too. For ease of reference we briefly mention the metric
axioms here. For (X, d ) we must have the following.
In most applications our target space (X, d ) does satisfy these axioms, but in
some (M1) must be replaced by
Because of the very many possibilities we do not impose a specific set of axioms
about (X, d ) a n d / o r (I, is), we simply have to remember that they usually are
distance-like.
The expression "approximately equal to" in our definitions has not been
defined rigorously. As we mentioned previously the usual practice in scaling is to
define a real valued lossfunction, and to construct the mapping of I into X in such
a way that this loss function is minimized. In this sense a scaling problem is
simply a minimization problem. One way in which scaling procedures differ is
that they use different loss functions to fit the same structure. Another way in
which they differ is that they use different algorithms to minimize the same loss
function. We shall use these technical distinctions in the paper in our taxonomy
of specific multidimensional scaling procedures.
It is now very simple to define MDS: a (p-dimensional) multidimensional
scaling problem is a scaling problem in which X is R P, the space of all p-tuples of
real numbers. Compare the definition given by Kruskal (1977, p. 296): " W e
define multidimensional scaling to mean any method for constructing a configura-
tion of points in low-dimensional space from interpoint distances which have
been corrupted by random error, or from rank order information about the
corrupted distances." This is quite close to our definition. The most important
difference is that Kruskal refers to 'corrupted distances' and 'random error'. We
do not use these notions, because it seems unnecessary and maybe even harmful
to commit ourselves to a more or less specific stochastic model in which we
assume the existence of a 'true value.' This point is also discussed by Sibson
288 Jan de Leeuw and Willem Heiser
(1972). Moreover, not all types of deviations we meet in applications of MDS can
reasonably be described as 'random error.' This is also emphasized by Guttman
(1971).
We have now specified the space X, but not yet the metric d. The most familiar
choice is, of course, the ordinary Euclidean metric. This has been used in at least
90% of all MDS applications, but already quite early in MDS history people were
investigating alternatives and generalizations. In the first place the Euclidean
metric is a member of the family of power metrics, Attneave (1950) found
empirically that another power metric gave a better description of his data, and
Restle (1959) proposed a qualitative theory of similarity which leads directly to
Attneave's 'city block' metric. The power metrics themselves are special cases of
general Minkovski metrics, they are also special cases of the general additive/dif-
ference metrics investigated axiomatically by Tversky and Krantz (1970). On the
other hand Euclidean space is also a member of the class of spaces with a
projective (or Cayley-Klein) metric, of which hyperbolic and elliptic space are
other familiar examples. In Luneborg (1947) a theory of binocular visual space
was discussed, based on the assumption of a hyperbolic metric, and this theory
has always fascinated people who are active in MDS. It is clear, consequently,
that choice of metric is another important criterion which can be used to classify
MDS techniques.
the points by using a theorem due to Young and Householder (1938). The
Thurstonian methods were systematized by Torgerson in his thesis of 1951, the
main results were published in [127]. Messick and Abelson (1956) contributed a
better method to estimate the additive constant, and Torgerson summarizes the
Thurstonian era in MDS in Chapter 11 of his book (1958).
The first criticisms of the Thurstonian approach are in an unpublished disserta-
tion of Rowan in 1954. He pointed out that we can always choose the additive
constant in such a way that the distances are Euclidean. Consequently the
Thurstonian procedures tend to represent non-Euclidean relationships in
Euclidean space, which is confusing, and makes it impossible to decide if the
psychological distances are Euclidean or not. Rowan's work is discussed by
Messick (1956). He points out that non-Euclidean data lead to a large additive
constant, and a large additive constant leads to a large number of dimensions. As
long as the Thurstonian procedures find that a small number of dimensions is
sufficient to represent the distances, everything is all right. In the meantime a
more interesting answer to Rowan's objections was in the making. In another
unpublished dissertation Mellinger applied Torgerson's procedures to colour
measurement. This was in 1956. He found six dimensions, while he expected to
find only two or three. Helm (1960) replicated Mellinger's work, and found not
less than twelve dimensions. But Helm made the important observation that this
is mainly due to the fact that large distances tend to be underestimated. If he
transformed the distances exponentially he merely found two dimensions, and if
he transformed Mellinger's data exponentially he found three. Consequently a
large number of dimensions is not necessarily due to non-Euclidean distances, it
can also be due to a nonlinear relationship between dissimilarities and distances.
In the meantime the important work of Attneave (1950) had been published.
His study properly belongs to 'multidimensional psychophysics', in which the
projections of the stimuli on the dimensions are given, and the question is merely
how the subjects 'compute' the dissimilarities. In the notation of the previous
section the mapping q~ is given but the distance d is not, while in MDS the
distance d is given and the mapping q~ is not. In this paper Attneave pointed out
that other distance measures may fit the data better than the Euclidean distance,
and he also compared direct judgments of similarity with identification errors in
paired associate learning. He found that the two measures of similarity had a
nonlinear but monotonic relationship. In his 1955 dissertation Roger Shepard
also studied errors in paired associates learning. The theoretical model is ex-
plained in [117], the mathematical treatment is in [116], and the experimental
results are in [118]. Shepard found that to a fairly close approximation, distance is
the negative logarithm of choice probability, which agrees with a choice theory
analysis of similarity judgments by Luce (1961, 1963). Compare also [68]. On the
other hand Shepard also found systematic deviations from this 'rational distance
function', and concluded that the only thing everybody seemed to agree on was
that the function was monotonic.
Also around 1950 Coombs began to develop his theory of data. The main
components of this theory are the classification of data into the four basic
290 Jan de Leeuw and Willem Heiser
loss function, and he discussed how such functions could be minimized. Early
systematizations of this approach are from Roskam (1968) and Young (1972).
Torgerson (1965) reported closely related work that he had been doing, and made
some thoughtful (and prophetic) comments about the usefulness of the new
nonmetric procedures. Guttman (1968) contributed a long and complicated paper
which introduced some useful notation and terminology, contributed some inter-
esting mathematical insights, but, unfortunately, also a great deal of confusion. It
is obvious now that Kruskal's discussion of his minimization method was rather
too compact for most psychometricians at that time. The confusion is discussed,
and also illustrated, in [87].
The main contribution to MDS since the Shepard-Kruskal 'computer revolu-
tion' is undoubtedly the paper by Carroll and Chang (1970) on three-way MDS.
It follows the by now familiar pattern of presenting a not necessarily new model
by presenting an efficient algorithm and an effective computer program, together
with some convincing examples. A recent paper, following the same strategy,
integrates two- and three-way metric and nonmetric MDS in a single program
[1251.
d2(x,y)=(x-y)'(x-y),
or, equivalently,
d ( x , y ) = I l x - yll,
where II" II is the Euclidean norm. The metric two-way Euclidean MDS problem is
to find a mapping ~ of I into R p such that 6(i, j ) is approximately equal to
[Iq~(i)- q~(j)ll. In this section we are interested in the conditions under which this
problem can be solved exactly, or, to put it differently, under which conditions
(I, 6) can be imbedded in (NP, d). This problem is also studied in classical
distance geometry. There is some early work on the subject by Gauss, Dirichlet,
and Hermite, but the first systematic contribution is the very first paper of Arthur
Cayley (1841). Cayley's approach was generalized by Menger (1928) in a funda-
mental series of papers. Cayley and Menger used determinants to solve the
imbedding problem, an alternative formulation in terms of quadratic forms was
suggested by Fr6chet (1935) and worked out by Schoenberg (1935). The same
292 Jan de Leeuw and Willem Heiser
PROOF. Suppose x 1 x n are p-vectors such that 82(i, j ) = d2(xi, xj). Define
.....
In fact Schoenberg (1935) and Young and Householder (1938) prove a slightly
different version of the theorem. They place the origin in one of the points while
our version places the origin in the centroid of the points, which is considerably
more elegant from a data analysis point of view. Our formulation is due to
Tucker, Green, and Abelson (cf. [128, pp. 254-259]). Theorem 1 can be sharpened
by defining ( 1 , 8 ) to be irreducibly imbeddable in Euclidean p-space if it is
imbeddable, but not imbeddable in Euclidean ( p - 1)-space. A slight rewording
of the proof shows that ( I , 8) is irreducibly imbeddable if and only if B is psd
and r a n k ( B ) = p . A complete solution of the general imbedding problem, in
which I can be infinite, was given by Menger (1928). We quote his result without
proof; for an excellent discussion we refer to [10, Chapter IV].
PROOF. Consider X(a), the minimum of x'B(et)x over all x satisfying x ' x - - 1
and x'e -- 0. Clearly X(et) is continuous. By hypothesis X(0) < 0, and the proof of
Theorem 5 shows that X(et)> 0 for a > a 0. Consequently X(et)= 0 for some a
between 0 and et0-
THEOREM 7. Suppose <I, 8> is a semimetric space, and suppose I is finite with n
elements. Then there is an et such that I with the semimetric [(82(i,j)
+ et(1 - 8iJ)} 1/2 ] can be imbedded in Euclidean (n -2)-space.
PROOF. In this case B(t 0 = B0 + l a J , and X(a) = X(0)+ let. Thus if a = -2X(0),
then X(et) = 0, and B(a) is psd of rank n - 2 .
THEOREM 8. Suppose <I, 8> is a semimetric space,_ and sup_pose I is finite with n
elements. Then there is a semimetric 8 such that 8(i, j)<~ 6(i', j') if and only if
6(i, j ) <~6(i', j') for all i ~=j and i' =/=j', and such that (I, i~) can be imbedded in
Euclidean ( n - 2)-space.
It was already pointed out by Lingoes (1971) that the mapping constructed in
the proof of Theorems 7 and 8 may lead to 8(i, j ) -- 0 for some i v~ j. This means
that in general the conclusion of Theorem 8 cannot be strengthened to 8(i, j ) ~<
8(i', j') if and only if 8(i, j ) <~8(i', j') for all i, j, i', j'. The precise conditions
under which such a strengthening is possible were investigated by Holman (1972).
His work is based partly on unpublished work of the distance geometer L. M.
Kelly. In the first place a semimetric space (I, 8> is called an ultrametric space if
for all i, j, k. This ultrametric inequality, which obviously implies the triangle
inequality, is interesting from a data analysis point of view, because it is well
known that a semimetric space can be imbedded in a hierarchical tree structure if
and only if the semimetric satisfies the ultrametric inequality (cf. for example
[62]). Hierarchical tree structures are very important in clustering and classifi-
cation literature [25]. We now give the two theorems proved by Holman.
THEOREM 9. Suppose (I, 8> is an ultrametric space, and suppose I is finite with n
elements. Then <I, 8> can be imbedded in Euclidean ( n - 1)-space, but not in
Euclidean ( n - 2)-space.
Theory of multidimensionalscaling 295
Because the ultrametric inequality remains true if we transform the 8ij mono-
tonically, this also implies that no strictly monotone transformation of the 6~j can
be imbedded in Euclidean ( n - 2 ) - s p a c e .
H( O) = H o + ( Y,O~jAijl(i, j ) ~ / 7 } ,
, ( o ) : , 0 + { E o,j jl(i,
where B 0 = - )JHoJ and T,j = -- ½JAijJ as usual. The following theorem is rather
trivial but is given because it has computational consequences.
THEOREM 12. Suppose (I, 6) is a semi-metric space, with I finite. Then there
exists a semimetric 8 such that for all i, j, i', j', 8(i, j ) <~8(i', j') implies 8(i, j ) <~
8( i', j') and such that ~I, 8) can be imbedded in Euclidean p-space if and only if
there exist nonnegative numbers O,j such that B( O) is psd, and rank(B(0)) ~< p.
Theorem 2 from the previous subsection to find solutions to the general imbed-
ding problem. The space problem is somewhat more complicated. The most
interesting case is the one in which the elements of I × I are ordered. We want to
study the conditions under which this order can be considered to be induced by a
convex metric. This topic was studied by Beals and Krantz (1967), their treatment
was improved by Krantz (1968), explained for psychologists by Beals, Krantz,
and Tversky (1968), and generalized by Lew (1975). Minimality and symmetry
can easily be defined in terms of the order, they are obviously necessary
conditions for the representation of the order by any metric, convex or not. A
number of 'technical' assumptions is also needed, which state that the space is
continuous, without holes. They are usually untestable on empirical data. The
most important assumption is based on an ordinal characterization of between-
ness. Suppose il, i2 and i 3 are three distinct objects in I. Following Beals and
Krantz we define (ili2i3) if and only if
(a) (il, i~) ~<(i~, i2) and (i~, i~) ~>(i~, i3) imply (i~, iX) ~>(i2, i3)-
(b) if the conditions of (a) hold, and (i~, iX) = (i2, i3), then (il, i~) = (il, i2) and
(i,, i~)= (il, i3)"
Moreover we define (iliei3) if and only if both (ili2i3) and (i3i2il). The basic
assumption is that if 0<(il, i'2)<(i~,i3) then there is an i 2 in 1 such that
(il, i 2 ) = (il, i~) and (ili2i3). The conditions can be illustrated by drawing 'iso-
similarity contours' as in [5]. Under the assumptions we have stated there exists a
metric d on I such that
(a) (i, j ) ~<(i', j') if and only if d(i, j) <~d(i', j').
(b) (ili2i3) if and only if d(ii, i2)+ d(i2, i3) = d(il, i3)"
To get to Euclidean space we can use Theorem 3 or 4. Completeness and external
convexity can easily be defined in terms of the order relation and the betweenness
relation. Because the convex metric constructed by Beals and Kxantz is unique up
to scale we can simply use it to test the Euclidean four point property (or the
weaker versions of the property discussed by Blumenthal (1975)). A more direct
approach is also possible. Blumenthal (1938, pp. 10-13) discusses an axiomatiza-
tion of three-dimensional Euclidean geometry due to the Italian geometer M.
Pieri. The single undefined elements are 'points', the single primitive relation 'i 1 is
equally distant from i 2 and i3'. This axiomatization, published in 1908, can easily
be generalized to p-dimensional space.
Holman's Theorem 9 can be interpreted as a negative result, which shows that
ultrametrics cannot be represented in low-dimensional Euclidean spaces. It is well
known that ultrametric and tree distances are closely related to 'city block' or
lrdistances. In this sense the counter examples presented by Lew (1978) gener-
alize Holman's results. Lew proves that the p-dimensional 'city block' spaces 1p
cannot be imbedded into finite-dimensional Euclidean space, and that there is no
monotone transform of the metric which makes such an imbedding possible. On
the other hand it has been shown by Schoenberg (1937, 1938) that if 8(i, j ) is the
ll-metric, then {8(i, j)}~, with 0 < 3' ~<½, can be imbedded into 12, the natural
infinite-dimensional generalization of Euclidean space. Lew (1978) also presents
other interesting results based on Schoenberg's metric transform theory.
Theory of multidimensionalscaling 297
L A - I K ' B k K A - 1 L ' is diagonal for each k, which is possible if and only if the
A - I K , B~AA-T , 1 commute. This is true if and only if (c) is true.
E
s=l
where z is called the 'slide vector'. A related 'jet stream' model appears in [40]
where Gower also has an interesting 'cyclone' model. Gower (1977), and
Constantine and Gower (1978) also discuss MDS techniques which decompose a
matrix in its symmetric and antisymmetric parts, and then compute the singular
value decomposition of both parts. Baker (cf. [3]) has proposed
P
2 2
dZ(dP(i),q'(J)) = E w , s ( x , , - - x j , ) .
s=l
Theory of multidimensionalscaling 299
especially if we consider the fact that Shepard and Luce extend their model by
supposing that d(¢(i), q~(j)) = - In */u" For recent extensions of the Shepard-Luce
models we refer to work by Nakatani (1972), and Townsend (1978). Models of
this form can be used to find maximum likelihood estimates of the quasi-
symmetry parameters, and in the complete specification of the MDS-coordinates.
De Leeuw and Heiser (1979) propose a variety of probability models and
computational techniques for these 'discrete interaction matrices'.
The proof is easy matrix algebra. In a principal component context this model
has been discussed by Carroll, Green, and Carmone (1976). It has been extended
to three-way PCA by Carroll and Pruzansky (1977). Various applications, in
which for example the matrix Y is an ANOVA-type design matrix, are also
discussed in these papers. A more complicated class of restrictions uses x i = Tyi in
combination with T diagonal. This can be used to fit simplexes and circumplexes.
Cases in which X is partially restricted and partially free can be used to build
MDS versions of common factor models.
If T is diagonal and Y is binary (zero-one) a special interpretation of the
Euclidean distance model is possible. Suppose P = (1,... ,p). If S is a subset of P
and t is the p-vector with the diagonal elements of T, then we can define
(s):E{tsls P}.
If S 1..... Sn are the subsets of P defined by
s,:{sly, s:l},
then
[2], or
P
~2(i, j) = E
( , "s - J s ) . 2 ,
s=l
[128]. This approach has been generalized by Tversky (1966), whose work is also
discussed in [5], and is improved and generalized in [134]. Now the I s do not have
to be sets of real numbers anymore, they do not even have to be ordered sets. We
suppose that there are real valued functions +s on Is, increasing functions
Xs : R ~ R, and an increasing function F: R --, R, such that
8(i, J ) = ~_, X s ( I q ~ s ( i s ) - q ~ ( L ) l ) .
This is called the additive difference model. Tversky gives necessary and sufficient
conditions in terms of the dimensions of the product structure and the order
relation on I × I which must be satisfied for an additive difference representation.
It is also proved that the q~ are interval scales, and the Xs are interval scales with a
common unit. Of course an additive difference model does not necessarily define
a metric. The additive difference representation is said to be compatible with a
metric with additive segments if the representation satisfies the assumptions of
Beals and Krantz (1967), i.e., if the order on I × I also defines a convex metric.
Krantz and Tversky (1970) prove the very satisfactory result that compatibility
proves that there is an r > 1 such that
P
8r(i,j) = ~ I s(is)- s(Js)l r
s~l
In other words, the only additive difference models compatible with a convex
metric are the power metrics. Tests of the additive difference theory have been
carried out by Tversky and Krantz (1969), Wender (1971), Krantz and Tversky
(1975), and SchOnemann (1978).
supposes that the objects have a product structure. If only the metric, or an order
on 1 × I, is given, we have to follow a different route. We can use results of
Andalafte and Blumenthal (1964) that characterize Banach spaces in the class of
complete convex metric spaces, by using ordinal properties of the metric only. A
Minkovski space is a finite dimensional Banach space in which power metrics can
be characterized by using homogeneity. Another possibility is to characterize
Minkovski spaces in the class of straight G-spaces, as in [13, pp. 144-163], by
using this theory of parallels. In both cases we simply have to add some
qualitative axioms to the ones given by Beals and Krantz (1967), the additional
axioms are 'testable' and not 'technical' in the sense of Beals, Krantz, and
Tversky (1968). Both Anadalafte and Blumenthal (1964) and Busemann (1955)
list additional simple qualitative properties which characterize Euclidean space in
the class of Minkovski spaces.
THEOREM 17. The finite semimetrie space <I, 6 > can be imbedded in S;, o if and
only if
(a) 6(i, j ) <~ TO,
(b) the matrix B with elements big = cos 6(i, j ) / p is psd, and
(c) rank(B) ~ p + 1.
For p-dimensional hyperbolic space Hp,p the situation is even more like
Euclidean space. We define the matrix H with elements hij = coshr(i, j ) / o 2, and
the matrix B by bij = 1 - ( h i j h . . / h i. h .j), where dots are averages again.
THEOREM 18. The finite semimetric space (I, 8 > can be imbedded in Hp, p if and
only if B is positive semi-definite, and rank(B) ~<p.
Theory of multidimensional scaling 303
Congruence orders of Sp. o and He.p, i.e. the analogues of Theorem 2, are
studied in Blumenthal (1953). The theorem is also true in these spaces, they also
have congruence order p + 3. For elliptic space the situation is more complicated.
For imbedding results and congruence order results we refer to Blumenthal [10,
Chapters IX-XI], and the more recent review of Seidel [115]. Because of the
developments of non-Euclidean geometry at the end of the previous century and
in the beginning of this century we know many ways to solve the space problem
for these geometries. Not all of them are useful for our purposes, however.
Busemann discusses many metric characterizations. The most convenient one is
given in the following Theorem [13, p. 331].
THEOREM 19. I f each bisector B(a, a') (i.e. the locus xa = xa') of a G-space R
contains with any two points x, y at least one segment T(x, y), then the space is
Euclidean, hyperbolic, or spherical of dimension greater than 1.
Again it is easy to add the flatness of the bisectors to the axioms of Beals and
Krantz (1967) for G-spaces.
over all X in R "p. Suppose 2,~ > / . . - >/2,, are the ordered eigenvalues of B, let
Xs = min(0, Xs), and let k s be an eigenvector corresponding with ~,; the k s
corresponding with equal eigenvalues are chosen to be orthogonal.
THEOREM 20
P
min{o,(X)[XER "p}: ~ X2s+ ~ )~2.
s=l s=p+l
•.sl , s=l,...,p.
304 Jan de Leeuw and Willem Heiser
A proof is given, for example, by Keller (1958). This theorem justifies the
classical scaling method explained most extensively by Torgerson (1958). Observe
that the last columns of X are equal to zero if B does not have p positive
eigenvalues. This metric scaling procedure has three major advantages compared
with other competing ones. In the first place we know how to compute eigenvec-
tors and eigenvalues precisely and efficiently. In the second place the solutions
are nested in the sense that the solution for q dimensions is contained in the
solution for p dimensions if q < p. And finally, we are sure that we find the global
minimum of the loss function, and not merely a local minimum. The disadvantage
is that the procedure of computing B only makes sense in the Euclidean case. We
can use Theorems 17 and 18 to construct metric scaling procedures for spherical
and hyperbolic geometry, but they use another definition of B. If we cannot
transform to scalar products, as in the Minkovski case, then these procedures
cannot be used. Another disadvantage in the Euclidean and hyperbolic case is
that the elements of B are not independent if the elements of E are independent,
which makes unweighted least squares look bad. Moreover, the method loses
much of its computational appeal if there are missing data, while this does not
bother other methods.
Carroll, Green, and Carmone (1976) have already pointed out that the simple
scaling procedure can be generalized if we have linear restrictions of the form
X = YT', with Y'Y--- I and Y known. The loss function ol( X ) becomes o1(S ) =
t r ( B - Y S Y ' ) 2, with S - - T ' T . If we define S = Y'BY, then we can write o1(S) =
t r ( B - yffy,)2 + t r ( f f - S ) 2 which we can minimize by minimizing t r ( S - T ' T ) 2
over T, using the least squares result of Keller (1958) again, as we did in Theorem
20. The asymmetric 'slide vector' model can also be fitted by using the same
matrix methods.
o1( X, a) = t r { B ( a ) - XX'} 2
over all X in ~np, and over a. Define the minimum of ol(X, a) over X for fixed a
as ~(a). Then, in the same way as in Theorem 20,
s--1 s--p+l
This is a function of the single real parameter t~, which can be minimized
efficiently in a number of ways. It is clear that the approach generalizes without
further complications to any problem in which we have a one-parameter family of
matrices B(a). By using the theory of lambda-matrices [81] we can in all cases
construct efficient algorithms. This fact was used by Critchley (1978) in his
Theory of multidimensional scaling 305
over X and C~..... , Cm. In the IDIOSCAL case there are no further restrictions on
306 Jan de Leeuwand WillemHeiser
over both X and Y. Because the B~ are symmetric and the C k are symmetric too,
we expect X and Y to converge to the same value. Splitting can be used to
generalize ALS from multivariate multilinear problems to multivariate poly-
nomial problems in many cases but the precise conditions under which splitting
works have not been established. In the three-way case it is easy to show that
using splitting is closely related to using the Gauss-Newton method. In the
Gauss-Newton method we minimize the approximation
over A, and then set the new X equal to the old X plus A. This gives the same
iterates as ALS applied to or(X, Y; Ck). Of course convergence of the ALS
procedures is no problem, they converge more or less by definition. Often
convergence is painfully slow, however. Ramsay (1973) has shown that we can
usually accellerate convergence considerably by choosing a suitable relaxation
factor in the ALS iterations. Another disadvantage of ALS in this context is that
the procedure may converge to a C k which is not psd, or to a Wk which is not
nonnegative.
We can compute very good initial estimates for our iterative procedures by
using Theorems 13 and 14. The proof of Theorem 13 gives us IDIOSCAL
estimates of X and Ck, Theorem 14 says that we can find INDSCAL estimates if
we diagonalize these C k. A number of these two-step procedures has been
proposed. The first one in [112], the most straightforward one in [34]. This last
paper also has the necessary references. Another two-step procedure proposed by
De Leeuw is used to construct the initial configuration in ALSCAL. It is
described in [139]. It is clear that we can construct nonmetric versions of
three-way MDS by combining the results in this section with those from the
previous section. Carroll and Chang (unpublished) have experimented with non-
metric INDSCAL, called NINDSCAL, while Richard Sands (in press) has a
nonmetric version of IDIOSCAL which is comparable to ALSCAL.
Theory of multidimensional scaling 307
3.2.1. Two-way M D S
For metric two-way MDS we can also consider the loss function
i lj=l
with hij = ~2(i, j), as before, and d2( X ) = ( x i - xj)'(x i - xj). This loss function
was proposed by Obenchain (1971) and Hayashi (1974), but efficient algorithms
to minimize functions like this in several different MDS situations were proposed
by Takane, Young, and De Leeuw (1977). The current version of the ALSCAL
algorithm (see [140]) can handle all kinds of metric/nonmetric two/three way
data structures, using the basic two-step alternating least squares methodology of
Young, De Leeuw, and Takane (in press). The interesting problem is how we
must minimize o2(X) over X. In ALSCAL a single coordinate is changed at the
time, the other coordinates are fixed at current values, and we cycle through the
coordinates. Of course o2 is a quartic in each coordinate, we minimize over the
coordinate by solving a cubic. The procedure may not be very appealing at first
sight but is surprisingly efficient.
The loss function o2(X ) is called SSTRESS by Takane et al., the loss function
ol(X ) is called STRAIN by Carroll. There is an interesting relationship between
STRAIN and SSTRESS which explains why the initial configuration routines for
ALSCAL work as well as they do.
U i j ( b ) = -- l ( h i j -- h i -- b j ) .
min(o2(X) IX ~ R"P } .
THEOREM 22. Suppose that a matrix with all off-diagonal disparities equal &
admissible. Suppose in addition that n and p are such that an n X p matrix Y exists
with ~ Yis = Ofor each s, ~ Yi2s= n / p for each s, Y~YisYitz Ofor all s ~ t, and ~,yi2~~- 1
for all i. Then
p n
m i n { o 2 ( X , d ) [ X , d } < ~ l - ( - ~ - - + - f )(-~-Z-(_l ).
3.2.2. Three-way M D S
The only three-way MDS program based on squared distance loss is ALSCAL.
We do not discuss the algorithm here because the principles are obvious from the
previous section. There are two substeps, the first one is the optimal scaling step,
it finds new disparities for a given configuration, the second o n e changes the
configuration by the cubic equation algorithm of ALSCAL and the weights for
the individuals by linear regression techniques. There is an interesting modifica-
tion which fits naturally into the ALSCAL framework, although it has not been
implemented in the current versions. We have seen that the inner product
algorithms can give estimates of C~ = TkTi, that are not psd. The same thing is
true for ALSCAL, but if we minimize the loss over Tk instead of over C k we do
Theory of multidimensional scaling 309
not have this problem, and the minimization can b e carried out 'one variable at a
time' by using the cubic equation solver again. This has the additional advantage
that we can easily incorporate rank restrictions on the C k. If we require that the
Tk are p × 1, for example, we can fit the 'personal compensatory model' men-
tioned by Coombs (1964, p. 199), and by Roskam (1968, Chapter IV).
An important question in constructing MDS loss functions is how they should
be normalized. This is discussed in general terms in [78], and for ALSCAL in
[125] and [140]. McCallum (1977) studies the effect of different normalizations in
a three-way situation empirically. Other Monte Carlo studies have been carried
out by McCallum and Cornelius (1977) who study metric recovery by ALSCAL,
and by McCallum (unpublished) who compares ALSCAL and INDSCAL re-
covery (in terms of mean squared error). It seems that metric I N D S C A L often
gives better results than nonmetric ALSCAL, even for nonmetric data. This may
be due to the difference in metric/nonmetric, but also to the difference between
scalar product and squared distance loss. Takane, Young, and De Leeuw (1977)
compare CPU-times of ALSCAL and INDSCAL. The fact that ALSCAL is much
faster seems to be due almost completely to the better initial configuration, cf.
[341.
where we have written 8ij for 6(i, j ) and where we have introduced nonnegative
weights wij. For wij ~ 1 this loss function is STRESS, introduced by Kruskal [73,
74]. The Guttman-Lingoes-Roskam programs are also based on loss functions of
this form. Using o3 seems somewhat more direct than using o2 or o l, moreover,
both o2 and o I do not make much sense if the distances are non-Euclidean. A
possible disadvantage of o 3 is that it is somewhat less smooth, and that computer
programs that minimize o3 usually converge more slowly than programs mini-
mizing o~ or o2. Moreover, the classical Young-Householder-Torgerson starting
point works better for o 2 and ol, which has possibly some consequences for the
frequency of local minima. There are as yet, however, no detailed comparisons of
the three types of loss functions.
We have introduced the weights w/j for various reasons. If there is information
about the variability of the 6~j, we usually prefer weighted least squares for
statistical reasons, if there is a large number of independent identically distributed
replications, then weighted least squares gives efficient estimates and the mini-
mum of o3 has a chi-square distribution. Another reason for using weights is that
we can compare STRESS and SSTRESS more easily. It is obvious that if
310 Jan de Leeuw and Willem Heiser
8ij~-,dij(X ) and
if we choose wij = 48/~, then o3(X ) ~ o 2 ( X ). Thus, if a good fit is
possible, we can imitate the behaviour of 02 by using o 3 with suitable weights
Ramsey (1977, 1978) has proposed the loss function
o4(X)= ~ ~] ( l n S ~ j - l n d i j ( X ) ) 2,
i--l j--1
which makes sense for log-normally distributed dissimilarities. Again, if 8ij ~-~
dij (X) and if we choose wij -- 1/62, we find a3(X) ~ o4(X).
The algorithms for minimizing o3(X ) proposed by Kruskal [73, 74], Roskam
(1968), Guttman (1968), Lingoes and Roskam (1973) are gradient methods. They
are consequently of the form
where index ~- is the iteration number, xTo3 is the gradient, and a~ > 0 is the
step-size. Kruskal (1977) discusses in detail how he chooses his step-sizes in
MDSCAL and KYST, the same approach with some minor modifications is
adopted in the MINISSA programs of Lingoes and Roskam (1973). KYST [80]
also fits the other power metrics, but for powers other than two there are both
computational and interpretational difficulties. The city-block (power = 1) and
sup-metric (power = oc) are easy to interpret by very difficult to fit because of the
serious discontinuities of the gradient and the multitude of local minima. The
intermediate cases are easier to fit but difficult to interpret. We prefer a somewhat
different approach to step-size. Consider the Euclidean case first. The cross
product term
p(x) E Y wiA/ij(x)
=
De Leeuw and Heiser (1980) proved the following global convergence theorem.
c o n v e r g e s to zero.
In [30] there is a similar, but less general, result for general Minkovski metrics.
In that case the computations in an iteration are also considerably more com-
plicated.
x +o0(X('))),
where P9 is the metric projection on /2 in the metric defined by V. Three-way
M D S methods are special cases of this general model, because we can use an
n m X n m supermatrix of weights W, with only the m diagonal submatrices of
order n nonzero. The configurations can be collected in an n m X p supermatrix X
whose m submatrices of order n X p must satisfy the restrictions X k = Y T k.
References
[4] Beals, R. and Krantz, D. H. (1967). Metrics and geodesics induced by order relations. Math. Z.
101, 285-298.
[5] Beals, R., Krantz, D. H. and Tversky, A. (1968). Foundations of multidimensional scaling.
Psychol. Rev. 75, 127-142.
[6] Bennett, J. F. and Hays, W. L. (1960). Multidimensional unfolding, determining the dimen-
sionality of ranked preference data. Psychometrika 25, 27-43.
[7] Bentler, P. M. and Weeks, D. G. (1978). Restricted multidimensional scaling. J. Math. Psychol.
17, 138-151.
[8] Bloxom, B. (1978). Constrained multidimensional scaling in N-spaces. Psychometrika 43,
397-408.
[9] Blumenthal, L. M. (1938). Distance geometries. Univ. Missouri Studies 13 (2).
[10] Blumenthal, L. M. (1953). Theory and Applications of Distance Geometry. Clarendon Press,
Oxford.
[11] Blumenthal, L. M. (1975). Four point properties and norm postulates. In: L. M. Kelly, ed., The
Geometry of Metric and Linear Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[12] Bunemann, P. (1971). The recovery of trees from measures of dissimilarity. In: R. F. Hodson,
D. G. Kendall and A. Taltu, eds., Mathematics in the Archeological and Historical Sciences.
University of Edinburgh Press, Edinburgh.
[13] Busemann, H. (1955). The Geometry of Geodesics. Academic Press, New York.
[14] Busemann, H. (1970). Recent Synthetic Differential Geometry. Springer, Berlin.
[15] Carroll, J. D. (1976). Spatial, non-spatial and hybrid models for scaling. Psychometrika 41,
439-463.
[16] Carroll, J. D. and Arabie, P. (1980). Multidimensional scaling. Ann. Rev. Psychol. 31, 607-649.
[17] Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional
scaling via an N-way generalization of 'Eckart-Young' decomposition. Psychometrika 35,
283-319.
[18] Carroll, J. D., Green, P. E. and Carmone, F. J. (1976). CANDELINC: a new method for
multidimensional analysis with constrained solutions. Paper presented at International Congress
of Psychology, Paris.
[19] Carroll, J. D. and Pruzansky, S. (1977). MULTILINC: multiway CANDELINC. Paper
presented at American Psychological Association Meeting, San Francisco.
[20] Carroll, J. D. and Wish, M. (1974). Models and methods for three-way multidimensional
scaling. In: Contemporary Developments in Mathematical Psychology. Freeman, San Francisco.
[21] Cayley, A. (1841). On a theorem in the geometry of position. Cambridge Math. J. 2, 267-271.
[22] Cliff, N. (1973). Scaling. Ann. Rev. Psychol. 24, 473-506.
[23] Constantine, A. G. and Gower, J. C. (1978). Graphical representation of asymmetric matrices.
Appl. Statist. 27, 297-304.
[24] Coombs, C. H. (1964). A Theory of Data. Wiley, New York.
[25] Cormack, R. M. (1971). A review of classification. J. Roy. Statist. Soc. Ser A. 134, 321-367.
[26] Critchley, F. (1978). Multidimensional scaling: a critique and an alternative. In: L. C. A.
Corsten and J. Hermans, eds., COMPSTA T 1978. Physika Verlag, Vienna.
[27] Cross, D. V. (1965). Metric properties of multidimensional stimulus generalization. In: J. R.
Barra et al., eds., Stimulus Generalization. Stanford University Press, Stanford.
[28] Cunningham, J. P. (1978). Free trees and bidirectional trees as representations of psychological
distance. J. Math. Psychol. 17, 165-188.
[29] De Leeuw, J. (1970). The Euclidean distance model. Tech. Rept. RN 02-70. Department of
Datatheory, University of Leiden.
[30] De Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In: J. C.
Lingoes, ed., Progress in Statistics. North-Holland, Amsterdam.
[31] De Leeuw, J. and Heiser, W. (1977). Convergence of correction matrix algorithms for
multidimensional scaling. In: Geometric Representations of Relational Data. M athesis Press,
Ann Arbor.
[32] De Leeuw, J. and Heiser, W. (1979). Maximum likelihood multidimensional scaling of
interaction data. Department of Datatheory, University of Leiden.
Theory of multidimensional scaling 313
[33] De Leeuw, J. and Heiser, W. (1980). Multidimensional scaling with restrictions on the
configuration. In: P. R. Krishnaiah, ed., Multivariate Analysis, Vol. V. North-Holland,
Amsterdam.
[34] De Leeuw, J. and Pruzansky, S. (1978). A new computational method to fit the weighted
Euclidean model. Psychometrika 43, 479-490.
[35] Eisler, H. (1973). The algebraic and statistical tractability of the city block metric. Brit. J.
Math. Statist. Psychol. 26, 212-218.
[36] Fisher, R. A. (1922). The systematic location of genes by means of cross-over ratios. American
Naturalist 56, 406-411.
[37] Fr6chet, M. (1935). Sur la d6finition axiomatique d'une classe d'espaces distanci6s vectorielle-
ment applicable sur l'espace de Hilbert. Ann. Math. 36, 705-718.
[38] Gold, E. M. (1973). Metric unfolding: data requirements for unique solution and clarification
of SchOnemann's algorithm. Psychometrika 38, 555-569.
[39] Goldmeier, E. (1937). Uber ,g&nlichkeit bei gesehenen Figuren. Psychol. Forschung 21, 146-208.
[40] Gower, J. C. (1977). The analysis of asymmetry and orthogonality. In: J. R. Barra et all, eds.,
Progress in Statistics. North-Holland, Amsterdam.
[41] Guttman, L. (1941). The quantification of a class of attributes: a theory and method of scale
construction. In: P. Horst, ed., The Prediction of Personal Adjustment. Social Science Research
Council, New York.
[42] Guttman, L. (1944). A basis for scaling qualitative data. Amer. Sociol. Rev. 9, 139-150.
[43] Guttman, L. (1946). An approach for quantifying paired comparisons and rank order. Ann.
Math. Statist. 17, 144-163.
[44] Guttman, L. (1950). The principal components of scale analysis. In: S. A. Stouffer, ed.,
Measurement and Prediction. Princeton University Press, Princeton.
[45] Guttman, L. (1957). Introduction to facet design and analysis. Paper presented at F(fteenth Int.
Congress Psyehol., Brussels.
[46] Guttman, L. (1959). Metricizing rank-ordered or unordered data for a linear factor analysis.
Sankhya 21, 257-268.
[47] Guttman, L. (1968). A general nonmetric technique for finding the smallest coordinate space
for a configuration of points. Psychometrika 33, 469-506.
[48] Guttman, L. (1971). Measurement as structural theory. Psychometrika 36, 329-347.
[49] Haberman, S. (1974). The Analysis. of Frequency Data. University of Chicago Press, Chicago.
[50] Harshman, R. A. (1970). Foundations of the PARAFAC procedure: models and conditions for
an explanatory multi-modal factor analysis. Department of Phonetics, UCLA.
[51] Harshman, R. A. (1972). PARAFAC2: mathematical and technical notes. Working papers in
phonetics No. 22, UCLA.
[52] Hartigan, J. A. (1967). Representation of similarity matrices by trees. J. Amer. Statist. Assoc.
62, 1140-1158.
[53] Hayashi, C. (1974). Minimum dimension analysis MDA: one of the methods of multidimen-
sional quantification. Behaviormetrika 1, 1-24.
[54] Hays, W. L. and Bennett, J. F. (1961). Multidimensional unfolding: determining configuration
from complete rank order preference data. Psychometrika 26, 221-238.
[55] Heiser, W. and De Leeuw, J. (1977). How to use SMACOF-I. Department of Datatheory,
University of Leiden.
[56] Heiser, W. and De Leeuw, J. (1979). Metric multidimensional unfolding. MDN, Bulletin VVS
4, 26-50.
[57] Heiser, W. And De Leeuw, J. (1979). How to use SMACOF-III. Department of Datatheory,
University of Leiden.
[58] Helm, C. E. (1959). A multidimensional ratio scaling analysis of color relations. E.T.S.,
Princeton.
[59] Holman, E. W. (1972). The relation between hierarchical and Euclidean models for psychologi-
cal distances. Psychometrika 37, 417-423.
[60] Indow, T. (1975). An application of MDS to study binocular visual space. In: US Japan
seminar on MDS.. La Jolla.
314 Jan de Leeuw and Willem Heiser
[61] Ireland, C. T., Ku, H. H. and Kullback, S. (1969). Symmetry and marginal homogeneity in an
r × r contingency table. J. Amer. Statist. Assoc. 64, 1323-1341.
[62] Johnson, S. C. (1967). Hierarchical clustering schemes. Psyehometrika 32, 241-254.
[63] Jrreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psycho-
metrika 43, 443-477.
[64] Keller, J. B. (1962). Factorization of matrices by least squares. Biometrika 49, 239-242.
[65] Kelly, J. B. (1968). Products of zero-one matrices. Can. J. Math. 20, 298-329.
[66] Kelly, J. B. (1975). Hypermetric spaces. In: L. M. Kelly, ed., The Geometry of Metric and Linear
Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[67] Klingberg, F. L. (1941). Studies in measurement of the relations between sovereign states.
Psychometrika 6, 335-352.
[68] Krantz, D. H. (1967). Rational distance functions for multidimensional scaling. J. Math.
Psychol. 4, 226-245.
[69] Krantz, D. H. (1968). A survey of measurement theory. In: G. B. Dantzig and A. F. Veinott,
eds., Mathematics of the Decision Sciences. American Mathematical Society, Providence.
[70] Krantz, D. H. and Tversky, A. (1975). Similarity of rectangles: an analysis of subjective
dimensions. J. Math. Psychol. 12, 4-34.
[71] Kroonenberg, P. M. and De Leeuw, J. (1977). TUCKALS2: a principal component analysis of
three mode data. Tech. Rept. RN 01-77. Department of Datatheory, University of Leiden.
[72] Krumhansl, C. L. (1978). Concerning the applicability of geometric models to similarity data:
the interrelationship between similarity and spatial density. Psychol. Rev. 85, 445-463.
[73] Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-27.
[74] Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika
29, 28-42.
[75] Kruskal, J. B. (1976). More factors than subjects, tests, and treatments: an indeterminacy
theorem for canonical decomposition and individual differences scaling. Psychometrika 41,
281-293.
[76] Kruskal, J. B. (1977). Trilinear decomposition of three-way arrays: rank and uniqueness in
arithmetic complexity and in statistical models. Linear Algebra Appl. 18, 95-138.
[77] Kruskal, J. B. (1977). Multidimensional scaling and other methods for discovering structure.
In: Statistical Methods for Digital Computers. Wiley, New York.
[78] Kruskal, J. B. and Carroll, J. D. (1969). Geometric models and badness of fit functions. In:
P. R. Krishnaiah, ed., Multivariate Analysis', Vol. H. Academic Press, New York.
[79] Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. Sage Publications, Beverly Hills.
[80] Kruskal, J. B., Young, F. W. and Seery, J. B. (1977). How to use KYST-2, a very flexible
program to do multidimensional scaling and unfolding. Bell Laboratories, Murray Hill.
[81] Lancaster, P. (1977). A review of numerical methods for eigenvalue problems nonlinear in the
parameter. In: Numerik und Anwendungen von Eigenwertaufgaben und Verzweigungsproblemen.
Internat. Ser. Numer. Math. 38. Birkhauser, Basel.
[82] Landahl, H. D. (1945). Neural mechanisms for the concepts of difference and similarity. Bull.
Math. Biophysics 7, 83-88.
[83] Lew, J. S. (1975). Preorder relations and pseudoconvex metrics. Amer. J. Math. 97, 344-363.
[84] Lew, J. S. (1978). Some counterexamples in multidimensional scaling. J. Math. Psychol. 17,
247-254.
[85] Lindman, H. and Caelli, T. (1978). Constant curvature Riemannian scaling. J. Math. Paychol.
17, 89-109.
[86] Lingoes, J. C. (1971). Some boundary conditions for a monotone analysis of symmetric
matrices. Psychometrika 36, 195-203.
[87] Lingoes, J. C. and Roskam, E. E. (1973). A mathematical and empirical analysis of two
multidimensional scaling algorithms. Psychometrika 38, monograph supplement.
[88] Luce, R. D. (1961). A choice theory analysis of similarity judgements. Psychometrika 26,
151-163.
[89] Luce, R. D. (1963). Detection and recognition. In: R. D. Luce, R. R. Bush and E. Galanter,
eds., Handbook of Mathematical Psychology, Vol. I. Wiley, New York.
Theory of multidimensional scaling 315
[90] Luneborg, R. K. (1947). Mathematical Analysis of Binocular Vision. Princeton University Press,
Princeton.
[91] MacCallum, R. C. (1977). Effects of conditionality on INDSCAL and ALSCAL weights.
Psychometrika 42, 297-305.
[92] MacCallum, R. C. and Cornelius III, E. T, (1977). A IVlonte Carlo investigation of recovery of
structure by ALSCAL. Psyehometrika 42, 401-428.
[93] Menger, K. (1928). Untersuchungen tiber allgemeine Metrik. Math. Ann. 100, 75-163.
[94] Messick, S. J. (1956). Some recent theoretical developments in multidimensional scaling. Ed.
Psychol. Meas. 16, 82-100.
[95] Messick, S. J. and Abelson, R. P. (1956). The additive constant problem in multidimensional
scaling. Psychometrika 21, 1-15.
[96] Nakatani, L. H. (1972). Confusion-choice model for multidimensional psychophysics. J. Math.
Psychol. 9, 104-127.
[97] Obenchain, R. L. (1971). Squared distance scaling as an alternative to principal components
analysis. Bell Laboratories, Holmdell.
[98] Pieszko, H. (1975). Multidimensional scaling in Riemannian space. J. Math. Psychol. 12,
449-477.
[99] Ramsay, J. O. (1975). Solving implicit equations in psychometric data analysis. Psychometrika
40, 337-360.
[100] Ramsay, J. O. (1977). Maximum likelihood estimation in multidimensional scaling. Psycho-
metrika 42, 241-266.
[ 101] Ramsay, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika
43, 145-160.
[102] Restle, F. (1959). A metric and an ordering on sets. Psychometrika 24, 207-220.
[103] Richardson, M. W. (1938). Multidimensional psychophysics. Psychol. Bull. 35, 659-660.
[104] Roskam, E. E. (1968). Metric analysis of ordinal data in psychology. VAM, Voorschoten, The
Netherlands.
[105] Rutishanser, H. (i970). Simultaneous iteration method for symmetric matrices. Numer. Math.
16, 205-223.
[106] Saito, T. (1978). An alternative procedure to the additive constant problem in metric multidi-
mensional scaling. Psychometrika 43, 193-201.
[107] Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika 42, 319-345.
[108] Schoenberg, I. J. (1935). Remarks to Maurice Fr6chet's article "Sur la d+finition axiomatique
d'une elasse d'espaces distancibs vectoriellement applicable sur l'espace de Hilbert." Ann.
Math. 38, 724-732.
[109] SchoenbeJ:g, I. J. (1937). On certain metric spaces arising from Euclidean space by a change of
metric and their imbedding in Hilbert space. Ann. Math. 40, 787-793.
[110] Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Trans. Amer. Math. Soc.
44, 522-536.
[111] Sch/Snemann, P. H. (1970). On metric multidimensional unfolding. Psychometrika 35, 349-366.
[112] Sch6nemann, P. H. (1972). An algebraic solution for a class of subjective metrics models.
Psvchometrika 37, 441-451.
[113] Sch~Snemann, P. H. (1977). Similarity of rectangles. J. Math. Psychol. 16, 161-165.
[114] Seidel, J. J. (1955). Angles and distances in n-dimensional Euclidean and non-Euclidean
geometry. Parts I, II, III. lndag. Math. 17, 329-335, 336-340, 535-541.
[115] Seidel, J. J. (1975). Metric problems in elliptic geometry. In: L. M. Kelly, ed., The Geometry of
Metric and Linear Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[116] Shepard, R. N. (1957). Stimulus and response generalization: a stochastic model relating
generalization to distance in psychological space. Psychometrika 22, 325-345.
[117] Shepard, R. N. (1958). Stimulus and response generalization: tests of a model relating
generalization to distance in psychological space. J. Exp. Psychol. 55, 509-523.
[118] Shepard, R. N. (1958). Stimulus and response generalization: deduction of the generalization
gradient from a trace model. Psychol. Rev. 65, 242-256.
[119] Shepard, R. N. (.1962). The analysis ofproximities: multidimensional scaling with an unknown
distance function, Parts I, II. Psychometrika 27, 125-140, 219-246.
316 Jan de Leeuw and Willem Heiser
[ 120] Shepard, R. N. (1966). Metric structures in ordinal data. J. Math. Psychol. 3, 287-315.
[121] Shepard, R. N. (1974). Representation of structure in similarity data: problems and prospects.
Psychometrika 39, 373-42 I.
[122] Shepard; R. N. and Arabie, P. (1979). Additive clustering: representation of similarities as
combinations of discrete overlapping properties. Psyehol. Rev. 86, 87-123.
[123] Sibson, R. (1972). Order-invariant methods for data analysis. J. Roy. Statist. Soc. Ser. B 34,
311-349.
[124] Stumpf, C. (1880). Tonpsychologie, Vol. I and H. Teubner, Leipzig.
[125] Takane, Y., Young, F. W. and De Leeuw, J. (1977). Nonmetric individual differences in
multidimensional scaling: an alternating least squares method with optimal scaling features.
Psyehometrika 42, 7-67.
[126] Taussky, O. (1949). A recurring theorem on determinants. Amer. Math. Monthly 56, 672-676.
[127] Torgerson, W. (1952). Multidimensional scaling I--theory mad methods. Psychometrika 17,
401-419.
[128] Torgerson, W. (1958). Theory and Methods of Sealing. Wiley, New York.
[ 129] Torgerson, W. (1965). Multidimensional scaling of similarity. Psychometrika 30, 379-393.
[130] Townsend, J. T. (1978). A clarification of some current multiplicative confusion models.
J. Math. Psychol. 18, 25-38.
[131] Tversky, A. (1966). The dimensional representation and the metric structure of similarity data.
Michigan Math. Psychol. Program.
[132] Tversky, A. (1977). Features of similarity. Psychol. Rev. 84, 327-352.
[133] Tversky, A. and Krantz, D. H. (1969). Similarity of schematic faces: a test of interdimensional
additivity. Perception and Psyehophysics 5, 124-128.
[134] Tversky, A. and Krantz, D. H. (1970). The dimensional representation and the metric structure
of similarity data. J. Math. Psychol. 7, 572-596.
[135] Valentine, J. E. (1969). Hyperbolic spaces and quadratic forms. Proc. Amer. Math. Soc. 37,
607-610.
[136] Wender, K. (1971). A test of independence of dimensions in multidimensional scaling.
Perception and Psychophysics 10, 30-32.
[137] Young, F. W. (1972). A model for polynomial conjoint analysis algorithms. In: R. N. Shepard,
A. K. Romney and S. B. Nerlove, eds., Multidimensional Scaling: Theory and Applications in the
Social Sciences, Vol. I. Seminar Press, New York.
[138] Young, F. W., De Leeuw, J. and Takane, Y. (1980). Quantifying qualitative data. In:
E. Lantermann and H. Feger, eds., Similarity and Choice. Huber, Bern.
[139] Young, F. W., Takane, Y. and Lewyckyj, R. (1978). Three notes on ALSCAL. Psychometrika
43, 433-435.
[140] Young, F. W. and Lewyckyj, R. (1979). ALSCAL-4 User's guide. Data analysis and theory
associates, Carrboro, NC.
[141] Young, G. and Householder, A. S. (1938). discussion of a set of points in terms of their mutual
distances. Psychometrika 3, 19-22.
P. R. Krishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 "1A
.Ik
©North-HollandPublishingCompany(1982) 317-345
collection procedures relevant for MDS are provided in Shepard (1972), Wish
(1972), and Kruskal and Wish (1978).
Proximity is taken to be a primitive relation on pairs of stimufi, which is
assumed to be orderable. Thus it is defined on at least what Stevens (1951) has
called an ordinal scale (the usual assumption in the so-called nonmetric MDS
methods). In some cases it may be assumed to be measurable on an interval scale
(the standard assumption for metric MDS) or even on a ratio scale, see Ekman
(1963).
Since proximities can be regarded as distance-like numbers, they should
roughly satisfy the general conditions for a metric space (see Carroll and Wish,
1974a, b and de Leeuw, this volume); that is, the distance axioms of positivity
(djk <~djj = 0), symmetry (djk = dkj), and the mangle inequality (djl <~dj~ + dk, )
should be satisfied in an ordinal sense. If the data are dissimilarities, 6jk, the first
two conditions become 6j~ ~<6jj = 6kk and ~jk = 6kj for all j, k.
In practice, approximate satisfaction of positivity and symmetry is about the
best that one can hope for. Furthermore, the third condition is untestable if the
proximities are only on an ordinal or interval scale, since by adding a suitably
large constant to each data value the proximities can be made to satisfy the
triangle inequality. One could, however, establish a rough criterion based on how
large that constant must be, relative to the variance of the proximities, to
guarantee satisfaction of something analogous to the triangle inequality. When,
however, the violations are severe, it is essential to transform the data in some
way to make it more appropriate for MDS (e.g., symmetrizing the matrix,
normalizing rows and columns, computing some measure of profile distance; see
Kruskal and Wish, 1978), or to use a non-spatial type of data analysis procedure
(see Cunningham, 1978; Carroll, 1976; Tversky, 1977; Sattath and Tversky, 1977;
Shepard and Arabie, 1979; and Arabie and Carroll, 1980).
the complete method of triads or the method of successive intervals) and put
through one of the Thurstone-type unidimensional scaling procedures (see
Torgerson, 1958) to produce interval scale measures called comparative distances.
Since ratio scale distances were needed, the problem of estimating an additive
constant arose. The simplest estimation procedure was to add to each data value
the smallest constant guaranteeing satisfaction of the triangle inequality. Once
ratio scale distances were obtained they were converted to scalar products around
an origin placed at the centroid of the configuration. The conversion from
Euclidean distances to scalar products was effected by double centering the
matrix whose general entry is ~d)k.
-
~ 2 Double centering is equivalent to taking out
both row and column effects in analysis of variance, leaving 'interaction numbers'
which in this case are estimated scalar products. The matrix of estimated scalar
products S = XX', can be viewed as analogous to a covariance matrix, and
methods closely related to factor analysis and principal components analysis can
be used to solve for the dimensionality and configuration (based on the eigenval-
ues and eigenvectors of S).
(1)
[ XjX d)
where the djk's are distances in the underlying metric space, and the djk ^ ' s are
related to proximities by a monotonic function (nonincreasing if the proximities
are similarities, nondecreasing if the data are dissimilarities). This formula is
referred to as stress form 1 (or SFORM1). Another formula, stress form 2, differs
only in the normalization factor; that is, the denominator involves a summation
of ( d j k - d) 2 rather than d~k, where d is the mean of the dj~'s. Least squares
monotone regression (see van Eeden, 1957; Miles, 1959; Barton and Mallows,
1961; and Bartholomew, 1959) was incorporated in the algorithm for finding the
best monotonic function; that is, the function minimizing stress. Another im-
portant generalization introduced by Kruskal allowed specific non-Euclidean (Lp)
metrics to be fit.
The first step in the procedure was to generate a set of coordinates for stimuli
in a specified dimensionality by some random, arbitrary, or rational method.
Given a particular metric in the underlying space (usually, but not necessarily,
Euclidean) and a particular dimensionality, the objective was rigorously defined
as that of minimizing stress over the class of all R-dimensional spaces and over
the class of all monotonic functions. This involved computation of the partial
derivatives of stress with respect to the total set of coordinates (all stimuli in all
dimensions). To determine the direction and relative distance each point should
be moved on that iteration. When the components of the gradient were suffi-
ciently small to indicate convergence, the process was terminated.
Kruskal's algorithm was originally implemented in a computer program called
MDSCAL that went through several versions. More recently several features of
another MDS program called TORSCA (Young and Torgerson, 1967) were
combined with those of MDSCAL to produce an improved program called KYST
(Kruskal, Young, and Seery, 1973). (The name, 'KYST,' is based on a combina-
tion of four contributors to MDS methodology--Kruskal, Young, Shepard, and
Torgerson.) Another series of two-way MDS procedures under the general name
of Smallest Space Analysis, or SSA, was independently developed by Guttman
and Lingoes (see Guttman, 1968; Lingoes, 1972).
Although the MDSCAL and KYST programs are usually thought of as being
nonmetric, they do also allow for metric MDS (as well as a compendium of
options for different types of analyses). This amounts to replacing the monotone
function with some specified metric function, e.g. a polynomial of some degree.
These approaches are metric in that interval scale properties of the data are used.
Multidimensional scaling and its applications 321
1.3. Dimensionality
A difficult task confronting the user of KYST or any other MDS program is to
decide how many dimensions are appropriate for the data. In most cases the
answer is not straightforward since there is a balance between goodness of fit,
interpretability, and parsimony of data representation. Stress (or another com-
parable measure) can be made as low as desired by using a sufficiently large
number of dimensions, but this would provide limited data reduction and would
greatly complicate interpretation of results. In contrast, a two-dimensional solu-
tion greatly facilitates comprehension of the space, but if the stress is too high, the
configuration may misrepresent the overall data structure or omit important
features. In this regard it is difficult to say how low stress should be on an
absolute basis since it depends on so many considerations--for example the
number of stimuli, the amount of 'noise', the distribution of proximity values, the
analysis type (e.g., metric vs. nonmetric, Euclidean vs. non-Euclidean), etc.
The most prevalent procedure used for deciding how many dimensions are
needed is to look at a plot showing the stress values for solutions in several
dimensionalities. (To avoid the possibility of a local minimum in one or more of
these dimensionalities, it is desirable to use more than one starting configuration,
and perhaps to do both metric and nonmetric analyses of the data.) Under ideal
circumstances there is a sharp bend or elbow in the plot. However, unless there
are physical or other clearly defined attributes associated with the dimensions,
there is unlikely to be a prominent inflection point.
Another technique for assessing dimensionality is to refer the stress values to
charts based on Monte Carlo simulations (Wagenaar and Padmos, 1971; Isaac
and Poor, 1974; Spence and Graef, 1974). These charts, which are based on
sampling distributions, can be helpful in some instances, but several strong
assumptions about the data are involved as well as other limitations (see Kruskal
and Wish, 1978).
The size and distribution of residuals ( d j ~ - djk) is also relevant to decisions
about dimensionality. The presence of large outliers or a systematic pattern of
residuals can be grounds for additional dimensions. In some instances inspection
of the residuals can suggest what those dimensions might be, or suggest other
structure in the proximity matrix.
Two additional criteria that can be brought to bear are replicability and
interpretability of results. In addition to analysing other sets of comparable data,
replicability can be assessed by splitting the data in several ways, by a
'jackknife'-type approach (Tukey, 1977), or by various types of 'sensitivity'
analyses.
Interpretability depends to some extent on the prior knowledge and intuitions
of the experimenter, but can also be aided by regression analyses and procedures
322 Myron Wish and J. Douglas Carroll
.40 -
,50
(/)
oo
20
o9
.10 -
I I I I I
1 2 3 4 5
DI M E N S I O N A L I T Y
Fig. I. STRESS(form 1) for Morse code confusiondata (Rothkopf, 1957) in one to five dimensions.
Multidimensional scaring and its applications 323
Fig. 2. Two-dimensional MDSCAL solution for Morse code confusion data, as interpreted by
Shepard (1963).
figure. The vertical dimension might also be interpreted as total duration of the
signal (dashes are three times as long as dots, and the interval between compo-
nents has the same duration as a dot).
The plot of proximities (percentages of 'same' judgments for pairs) vs. inter-
point distances (in the multidimensional space) is shown in Fig. 3. The nonlinear-
ity of the best-fitting monotonic function is striking, and is in accord with the
exponential relationship expected by Shepard for such 'confusions' data. Also of
note are the rather large discrepancies from a perfect monotonically decreasing
function. Thus, although the number of components and the relative number of
dots and dashes are the two most prominent attributes along which Morse code
signals are confused, the large residual error suggests the possibility of additional
structure latent in these data. In this regard, further analyses of these and other
similar data (Wish, 1967a,b) showed that the higher dimensions reflect the
particular sequence of components. For example, a three-dimensional solution,
which had a stress of 0.13, had an additional dimension, separating signals that
begin with a dash from those whose first component is a dot.
324 Myron Wish and J. Douglas Carroll
5 •0 0 8- ' °o
) oI ~ o
LLI 0 0 0
o
3O
Z o °0 ID
U 20 o'~ o
... oo ~ ~
,', I0
2°°o o.°o ~,~,
° Oo.o
o
oo o
3i.5
°o o'.5 ;.o " '
1.5 2 '. 0 Z '. 5 .~.o
INTERPOINT DISTANCE
Fig. 3. Relationship between percentage of 'same' responses for stimulus pairs and their distance in
the two-dimensional configuration.
This was one of the first applications of MDS showing that metric information
could be obtained from ordinal data (see Shepard, 1966). Besides revealing the
fundamental dimensions of these data, the study made a theoretical contribution
in showing that the function relating stimulus confusions to distance was ex-
ponential. It should be pointed out, however, that the dimensions in this applica-
tion and most others depend on the kind of judgment or psychological process
involved, as well as on the stimulus domain. Thus, somewhat different dimensions
were obtained for data using other experimental tasks (see Shepard, 1963).
The next application of MDS is from a pilot study by Wish and Carroll (in
association with W. Kluver and J. Martin of Experiments in Art and Technology,
Inc.) The main task for the 14 subjects participating in the study (see Kruskal and
Wish, 1978) was to give pairwise ratings of relatedness among 22 societal
problems. In addition, they evaluated these problems on 15 attributes; for
example, 'not at all important vs. very important,' 'affects very few people vs.
affects most people,' etc. The data were collected in 1972; other problems, such as
energy, would undoubtedly be included if the study were done today.
The relatedness judgments were regarded as proximities and analyzed by the
KYST procedure. Analyses in one through six dimensions gave stress values
Multidimensional scaling and its applications 325
Table 1
Multiple regression of mean bipolar scale ratings on dimensions of societal problems
Mult. Principal axes Rotated axes
Rating scales corr. (1) (2) (3) (1) (2) (3)
(1) Very important 0.49 0.54 0.76 0.35 0.54 -0.53 0.66
(2) Very interested 0.36 0.91 0.39 0.16 0.91 -0.42 0.03
(3) Affects me agreat deal 0.84* 0.99 0.04 0.10 0.99 -0.08 --0.08
(4) Affects most people 0.80* 0.94 0.34 0.05 0.94 0.33 0.11
(5) Action urgently needed 0.30 0.56 0.42 -0.71 0.56 -0.06 0.83
(6) Economic problem 0.52 0.33 0.94 0.07 0.33 -0.81 0.49
(7) Moral problem 0.52 0.17 -0.12 0.98 -0.17 -0.33 -0.93
(8) Political problem 0.30 0.36 --0.68 -0.63 0.36 0.90 0.26
(9) Organizationalproblem 0.64** 0.32 -0.47 -0.82 0.32 0.79 0.52
(10) Technological problem 0.77* 0.61 0.42 -- 0.67 0261 0.07 0.79
(11) Responsibility of
federal government 0.44 -0.03 0.74 0.68 0.03 -0.96 -0.28
(12) Responsibility of
localgovernment 0.77 0.26 -0.86 -0.43 -0.26 0.97 0.00
(13) Responsibility of
non-profitinstitutions 0.41 0.59 0.44 --0.68 0.59 -0.09 0.80
(14) Responsibility of profit-
making institutions 0.60** 0.56 0.77 0.32 0.56 0.83 0.06
(15) Responsibility of people
directly affected 0.39 0.30 -0.86 -0.41 0.30 0.95 0.01
DIM. 3
DETER OF /
DRUG 8~ ALCOHOL
Pu~:E~//
ABUSE
•
URBAN
DECAY
/ POLLU:,ON
FAILURES IN / MISUSES OF
WELFARE • POVERTY
UNEMPLOYMENT,,
DIM. t
LOCAL GOVT
"AC'SM/,MBAL',N
/ • CONSUMER
EXPLOITATION
JUDIC. SYSTEM
(a)
INEQUITIES IN DIM. 3
JUDICIAL SYSTEM WAR
CONSUM ER
EXPLOITATIOn, INFLATION
X• JOB I
~'SC"'%GREEO
RACISM.\ "1 C,,ME~
,M~.,N \ I V'O.~E":E~
POL RER SUBSTANDARD\ I / UNEMPLOYMENT
• h~U~'~"'-~/ "
'~ - DIM. 2
NEGLECTO' /\POV .RTY ".ISUSESOF
PUBLICTRANSR/ _\ TECHNOLOGY
• "7 URBAN ~P ~ •
_ / DECAY" \ DRUG& ALCOHOL
,~.NEF~ 9E.... %..~ ABUSE
LOCAL
LOCAL
I
GOV1Z
• ""
" ~~ P OPOLLUTION
LLUTION
IRESPONSIBILITY[ FAILURESIN ~=
WELFARE 'q
TECHNOLOG. (~VERPOPULATION
PROBLEM •
DETER. OF INADEQUATE
PUBLIC EDUC. HEALTH CARE
(b)
Fig. 4. Three-dimensional KYST solution for data on perceived relatedness among 22 societal
problems. Vectors for rating scales are based on multiple regression analyses.
Multidimensional scaling and its applications 327
Other scales have high regression weights on the second and third dimensions
(economic vs. not economic problem, moral vs. oot moral problem) but their
multiple correlations are too low for these scales to be useful for interpretational
purposes. Two other rating scales having reasonably high multiple correlations
a r e ' responsibility (vs. not responsibility) of local government' and ' technological
(vs. not technological) problem.' Since neither of these fines up closely with
dimension 2 or 3, a rotation was needed to provide a satisfactory interpretation of
these coordinate axes.
This is displayed geometrically in Fig. 4 , which shows two planes from the
three-dimensional solution for the relatedness data, along with vectors for some of
the scales. The location of these vectors is based on the normalized regression
weights, or direction cosines, in the first three columns of the table. In this
orientation mean ratings on the scales correlate as highly as possible with stimulus
projections on the associated vectors.
Since two scales already provide a good definition of the first dimension, no
rotation is required for its interpretation. Labelling of the other two dimensions
could be enhanced considerably, however, by rotating the axes to closer con-
gruence with the two vectors shown in Fig. 4(b). This cannot be done perfectly
with an orthogonal rotation since the two vectors are not exactly at right angles to
one another.
The last three columns of Table 1 show the regression weights for the rotated
dimensions. The fourth column is the same as the first since dimension 1 was not
involved in the rotation. In this orientation, however, the other two dimensions
can be more reasonably interpreted as 'responsibility vs. not responsibility of
local government' and 'technological vs. not technological problem.' Since multi-
ple correlations are only in the 0.70's for these scales (they are unaffected by the
rotation), the names for dimensions are still only suggestive. Further studies
including additional scales and societal problems, as well as a large sample of
subjects, would be needed to arrive at a compelling determination of the dimen-
sions of societal problems.
Everything discussed so far has assumed a two-way data matrix and a quadratic
model for squared distances. Since the squared distances can be converted to
scalar products (as in classical MDS), two-way MDS can also be thought of as
involving bilinear models for derived scalar products. We now consider the
extension to three-way arrays and trilinear models.
Probably the simplest generalization of the bilinear model for two-way MDS is
one given by (2).
R
Z i j k ~ Z i j k = ~ airbjrCkr. (2)
r--l
328 Myron Wish and J. Douglas Carroll
In this equation air, bjr, and C~r are elements of the matrices A, B, and C having
I, J, and K rows, respectively. (All three matrices have R columns.) If A
corresponds to subjects and B to stimuli, then C might be associated, for example,
with different rating scales, experimental methods, occasions, etc.
The least squares solution for this trilinear model can be obtained by extending
the Eckart-Young (1936), or singular-value decomposition to the three-way case.
This entails use of what Wold (1966) has called a NIPALS (for nonlinear iterative
partial least squares) procedure, or what has more recently (de Leeuw, 1977) come
to be called an ALS (for alternating least squares) procedure. This involves
converting the overall nonlinear least squares problem (of finding matrices A, B,
and C that optimally fit the Z array in a least squares sense) into a series of linear
least squares problems that are solvable by standard methods.
This is done by successively fixing two of the matrices (say B and C) and
solving for the least squares estimate of the third with these fixed values; then
fixing two others (say A and C) and solving for the remaining one and so on
around the iterative loop until no further improvement occurs. Each iteration of
the process necessarily improves the fit; i.e., decreases the sum of squared
residuals between Z and Z, ( = the three-way array containing the £ijk'S). When
the process has converged (and it can be proved that it will converge), the
resulting matrices constitute at least locally optimal least squares estimates of A,
B, and C. This process has been called CANDECOMP, for canonical decomposi-
tion of N-way tables.
In the case described we have N = 3, but the extension to higher-way tables is
straightforward (Carroll and Chang, 1970; Carroll and Wish, 1974a). Thus, in
higher-way methods all but one way of the data table is fixed at any time, and the
remaining one is estimated by the NIPALS-ALS procedure. A fruitful appli-
cation of N-way CANDECOMP has been for the analysis of N-way contingency
tables in terms of the Lazarsfeld (Lazarsfeld and Henry, 1968) latent class model
(see Carroll, Pruzansky, and Green, 1980).
value, or applying any other linear function with negative slope.) Fig. 5 schema-
tizes both the input data array and output matrices for INDSCAL.
The model for INDSCAL is summarized in the following equations:
where
R ) 1/2
d}~:)= E Wir(Xjr-- Xkr)2 (4)
r=l
o1
o
, 2., "1 I
__ "~
I I "~J. ~ : 1
. . a.<
g ~
.
""
...1"Q2, c'rs o,.,o,,as
(OR OIH~""
(a)
DIMENSIONS DIMENSIONS
t2---r R I 2---r R
t
ol
2.
G)
I--
(.J
tJJ Nir
1
i
J Xjr g
u.I
ol
tn
O
SUBJECT SPACE
NOTE:"OBJECTS" NEED NOT BE
n "STIMULI." "SUBJECTS"
GROUP MAY BE OTHER DATA
STIMULUS SPACE SOURCES
(b)
Fig. 5. Schematic representation of INDSCAL input (a) and output (b). The input data are two or
more square symmetric matrices, one for each subject or other data source. The data values, 6j(~),
indicate the dissimilarity between stimulusj and stimulus k for subject i. The set of two-way matrices
defines the rectangular solid or three-way data array, shown at the top. The output from INDSCAL
consists of a matrix of stimulus coordinates (bottom left) and a matrix of subject weights (lower right)
on the R dimensions.
330 MyronWishandJ. DouglasCarroll
The F ' s will generally be considered to be either linear, in the metric case, or
monotonic in the nonmetric case. It is important to note, however, that a different
F/is assumed for each individual. Thus INDSCAL generalizes two-way MDS by
substituting a weighted Euclidean metric for the Ordinary (unweighted) Euclidean
metric. Different patterns of weights are allowed for each individual or other data
source. The possibility of zero weights means that some dimensions m a y be
totally irrelevant to a subject. Thus, INDSCAL allows for a large degree of
variation among subjects within the context of a shared multidimensional space.
A more geometric interpretation of the INDSCAL model is provided by (5).
By substituting (5) into (6) we get (4). Thus, (5) and (6) together provide an
alternative interpretation of the weighted generalization of the ordinary Euclidean
metric defined in (4). They express the INDSCAL model in terms of a particu-
larly simple class of transformations of the common space, followed by computa-
tion of the ordinary Euclidean metric. The class of transformations can be
described algebraically as linear transformations, with the transformation matrix
constrained to be diagonal (sometimes called a strain or 'stretch' transformation).
More simply, this amounts to simply rescaling each dimension by the square root
of that particular subject's weight for that dimension. Geometrically this can be
thought of as differentially stretching or shrinking each dimension by a factor
proportional to the square root of the associated weight.
The weights can be plotted in a second space, which is generally referred to as
the 'subject space,' since in most psychological applications these points corre-
spond to different individuals. The coordinates of the subject space are contained
in an I × R matrix, as shown in the bottom right of Fig. 5. Although in principle
these weights should all be positive (or zero), some of the weights turn out
occasionally to be slightly negative. Near-zero weights are likely to represent
random statistical fluctuation. Large negative weights signal inappropriateness of
the model, too many dimensions, or perhaps a selection of the wrong option (e.g.,
treating the data as dissimilarities when they should be similarities).
I
USA
RUS~,A • JAPAN
FRANCE •
YUG(~SLAVlA ISRAEL
PRO-COMMUNIST PRO-WESTERN
DIM1
CHINA
C~BA BR~ZIL
EG~'PT IN~)IA
~CONGO
UNDERDEVELOPED
(b)
(~)I0
I-
2(•)4 69
z
uJ @3
(3.
$
UJ
>
@ DOVES~
uJ
0
t
o
bJ
® @
,4
Fig. 6. Group stimulus space (a) and subject space (b) from an INDSCAL analysis of perceived
similarities among 12 nations. Private perceptual spaces for subjects 17 (c) and 4 (d) were obtained by
applying the square roots of their respective weights to the dimensions of the group stimulus space.
332 Myron Wish and J. Douglas Carroll
(c)
ECONOMICALLY
DEVELOPED
DIM 2
IRUSSIA • USA
•YUGOSLAVIA FRANCE JAPAN ¶SRAEL
PRO-CO MM UN IST PR0-WEST~]~I
•CHINA DiM1
CLTBA EGYPT •
INDIA
BR~. Ik
DCONGO
UNDE]~DEVELOPED
(d)
ECONOMICALLY
DEVELOPED •
DIM 2 USA
RUS,A •JAPAN
F~RANCE • I S R A E L
YuG(~SLAVIA
PRO-COMMUNIST PRO'WESTERN
DIM1
CHINA
BRAZI L
INDIA
CONGO
UNDERDEVELOPED
Fig. 6 (continued).
or in matrix notation
(s)
The correspondence with (2) can be seen by substituting Wir for air, Xjr for byr, and
Xkr for ckr. This is a symmetric form of CANDECOMP in which the second and
third ways of the table correspond to the same set of entities, and therefore
J = K = n, the number of stimuli. Furthermore, ~ is an n × n matrix of derived
scalar products, and W, is an R × R diagonal matrix.
334 Myron Wish and J. Douglas Carroll
Table 2
Interpersonal relations and situational contexts used to generate the factorial set of hypothetical
communication episodes
Interpersonal relations
(A) Bitter enemies
(B) Business partners
(C) Casual acquaintances
(D) Husband and wife
(E) Marine sergeant and private
(F) Parent and teenager
(G) Political rivals
(H) Supervisor and employee
Situational contexts
(1) Attempting to work out a compromisewhen their goals are strongly opposed
(2) Pooling their knowledge and skills to solve a difficult problem
(3) Discussing a controversialsocial issue on which their opinions differ
(4) Talking to each other at a large social gathering
(5) Expressing anxiety about a national crisis that is affecting them personally.
(6) Working for a common goal with one person directing the other
(7) Blaming one another for a serious error that was made
(8) Having a brief exchange about a minor technical detail
Table 3
Dimension weights for 14 bipolar scales based on
an INDSCAL analysis of 64 hypothetical communication episodes
Rating scales Dim. l Dim. 2 Dim. 3 Dim. 4 Dim. 5
Very cooperative vs.
very competitive 0.96* 0.03 0.00 0.03 0.03
Very friendly vs.
very hostile 0.96* -0.10 0.04 0.10 0.03
No conflict vs.
constant conflict 0.88* 0.20 0.03 -0.04 -0.03
Very intense vs.
very superficial 0.11 0.92* 0.04 - 0.02 0.05
Completely engrossed vs.
uninterested and uninvolved -0.08 0.85* 0.04 0.07 0.14
Very emotional vs.
very unemotional 0.33 0.73* 0.02 -0.02 0.05
Very personal vs.
very impersonal - 0.08 0.70* 0.00 0.50 0.05
Very different roles and behavior vs.
very similar roles and behavior 0.03 0.02 0.93* 0.00 0.03
One totally dominates the other vs.
each treats the other as an equal 0.18 0.04 0.92* 0.07 0.01
Very autocratic vs.
very democratic 0.21 0.00 0.91" 0.08 0.02
Very formal vs.
very informal O.lO 0.00 0.09 0.92* 0.02
Very reserved and cautious vs.
very frank and open O.19 0.01 0.02 0.85* 0.08
Very task oriented vs.
not at all task oriented - 0.02 O.12 0.05 0.05 0.91 *
Very productive vs.
very unproductive 0.39 -0.08 -0.02 0.07 0.78*
*Weight/> 0.70
each dimension. Details about the study as a whole and these supplementary
analyses can be found in Wish and Kaplan (1977).
(9)
, Norm/ J"
The two versions of Norm i are analogous to those used in stress forms 1 and 2.
Suppose, for example, that the subjects are defined by a factorial design (e.g.,
using demographic factors such as age, sex, and socioeconomic status), and
that stimuli are defined factorially in terms of color, size, and shape. In a
CANDELINC analysis the subject and/or stimulus parameters could be required
to satisfy any set of linear constraints desired by the experimenter or data analyst.
Thus one or both sets could be constrained to be perfectly decomposable by an
additive (or main effects only) ANOVA-type model. Alternatively, specified
interaction terms could be included allowing for all two-way, but no three-way
interactions; or certain one-degree-of-freedom contrasts (partitioning main ef-
fects, interactions, or both) could be allowed.
One important advantage of imposing such constraints is to enhance the
experimenter's ability to extrapolate results to new stimuli, subjects, etc. Another
benefit is the potential for comparing alternative models (as in analysis of
variance or conjoint analysis) and providing a parsimonious representation of the
data. Since CANDELINC entails a decomposition of a much smaller array, it
also saves considerable computer time and costs.
-~jr~,rr,~ kr ,
F Ft
or in matrix notation
S i = XCiX' (12)
H
E aihh (13)
h=l
computational method for fitting this model to data. This model, which is called
PARAFAC-2 (for parallel factors, model 2) assumes that C~ can be decomposed
as
C i = DiRD i (14)
There are several recent trends in MDS methodology that offer bright prospects
for the future. There are also arid areas that should become more fertile in the
coming years.
One recent trend that appears almost to represent a zeitgeist involves the fusion
of continuous and discrete models for the analysis of proximities. In this regard,
Carroll and Pruzansky (1975) have recently developed hybrid models that incor-
porate dimensional and tree-structure components (see also Carroll, 1976). Pre-
liminary applications have shown the potential for a more comprehensive data
representation than would be possible with a spatial or clustering model alone.
Until recently nonsymmetric data were generally simply symmetrized (e.g., by
averaging the j, k and k, j cells). Recently, however, models and methods have
been proposed for dealing directly with nonsymmetries. A procedure developed
by Young (1975) called ASYMSCAL and a nonsymmetric version of INDSCAL
(DeSarbo and Carroll, 1979) offer potentially useful perspectives for tackling this
ubiquitous problem. Other interesting developments related to the analysis of
nonsymmetric proximities data have been made by Tobler (1976), Chino (1978),
and Gower (1977, 1978). A particularly innovative attempt for handling and
graphically displaying general patterns of nonsymmetries is the DEDICOM
approach of Harshman (1980).
CANDELINC represents a general approach that could be considerably devel-
oped and expanded in the future. In this regard it should be pointed out that
linear constraints can be applied to the parameters of the other three-way
(and higher-way) models by methods very similar to the procedure used in
CANDELINC. In fact, Carroll, Pruzansky, and Kruskal (1980) discuss an explicit
generalization of CANDELINC to Tucker's three-mode factor analysis and
scaling models. In the future other kinds of constraints could be allowed, or the
constraints could apply differentially to specified dimensions. Work along these
lines has recently been done by Bentler and Weeks (1978), Bloxom (1978), de
Leeuw and Heiser (1979) and Noma and Johnson (1979). Perhaps the most
342 Myron Wish and J. Douglas Carroll
encouraging aspect of this approach is the attempt to integrate MDS with other
areas of statistics such as analysis of variance. The possibility of comparing a
range of models may also increase the value of MDS for theory construction and
testing in various fields.
Approaches to the analysis of highly nonlinear data, which have been done
more generally for multivariate statistics (e.g., McDonald, 1962; Shepard and
Carroll, 1966; Gnanadesikan and Wilk, 1969) will undoubtedly be proposed
within the context of MDS. Likewise, the important work by Ramsay (1978) and
Takane (1981) on the development of maximum likelihood methods for estimat-
ing MDS parameters offers another example of the introduction of general
statistical methodology within the context of MDS. (These maximum likelihood
methods also allow for constraints of various kinds on the solutions.) This is
particularly encouraging since MDS has been primarily used in the past as a
purely descriptive rather than as an inferential tool.
Although there have been attempts, using Monte Carlo methods, to provide
objective indices for assessing dimensionality, considerable work remains to be
done toward the development of distribution theory within the context of
multidimensional scaling. Likewise the need to develop a more solid base in
statistical inference is a sine qua non for the future.
There are clearly numerous other unsolved problems and unexplored frontiers
in multidimensional scaling, such as identifiability and uniqueness of various
models, efficiency of numerical algorithms, and the development of diagnostics to
aid naive as well as sophisticated users. Hopefully, the broad range of challenges
and the growing awareness of these needs will motivate important breakthroughs
in multidimensional scaling.
References
Klingberg, F. L. (1941). Studies in measurement of the relations among sovereign states. Psycho-
metrika 6, 335-352.
Kruskal, J. B. (1964a). Multidimensional scaring by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-27.
Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrika 29,
115-129.
Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. Sage Publication, Beverly Hills.
Kruskal, J. B., Young, F. W. and Seery, J. B. (1973). How to use KYST, a very flexible program to do
multidimensional scaling and unfolding. Tech. Rept., Bell Telephone Laboratories, Murray Hill.
Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston.
Lingoes, J. C. (1972). A general survey of the Guttman-Lingoes nonmetric program series. In: R. N.
Shepard, A. K. Romney and S. Nerlove, eds., Multidimensional Scaling: Theory and Applications in
Behavioral Sciences. Vol I." Theory, 49-68. Seminar Press, New York.
McDonald, R. P. (1962). A general approach to nonlinear factor analysis. Psychometrika 27, 397-415.
Miles, R. E. (1959). The complete amalgamation into blocks, by weighted means, of a finite set of real
numbers. Biometrika 46, 317-327.
Noma, E. and Johnson, I. (1979). Constrained nonmetric multidimensional scaling. Tech. Rept.
MMPP 1979-4. Ann Arbor: University of Michigan Math. Psychol. Program.
Ramsey, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43,
145-160.
Richardson, M. W. (1938). Multidimensional psychophysics. Psychological Bulletin 35, 659-660.
Rosenberg, S., Nelson, C. and Vivekananthan, P. S. (1968). A multidimensional approach to the
structure of personality impressions. J. Personality and Social Psychology 9, 283-294.
Rothkopf, E. Z. (1957). A measure of stimulus similarity and errors in some paired-associate learning
tasks. Experimental Psychology 53, 94-101.
Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika 42, 319-345.
Shepard, R. N. (1962a). Analysis of proximities: Multidimensional scaling with an unknown distance
function. I. Psychometrika 27, 125-140.
Shepard, R. N. (1962b). Analysis of proximities: Multidimensional scaling with an unknown distance
function. II. Psychometrika 27, 219-246.
Shepard, R. N. (1963). Analysis of proximities as a technique for the study of information processing
in man. Human Factors 5, 33-48.
Shepard, R. N. (1966). Metric structures in ordinal data. J. Math. Psych. 3, 297-315.
Shepard, R. N. (1972). A taxonomy of principal types of data and of multidimensional methods for
their analysis. In: R. N. Shepard, A. K. Romney and S. Nerlove, eds., Multidimensional Scaling:
Theory and Applications in the Behavioral Sciences, Vol. I, 21-47. Seminar Press, New Ycrk
Shepard, R. N. and Arable, P. (1979). Additive clustering: Representation of similarities as combina-
tions of discrete overlapping properties. Psychol. Rev. 86, 87-123.
Shepard, R. N. and Carroll, J. D. (1966). Parametric representations of nonlinear data structures. In:
P. R. Krishnaiah, ed., Multivariate Analysis Vol. 2,561-592. Academic Press, New York.
Spence, I. and Graef, J. (1974). The determination of the underlying dimensionality of an empirically
obtained matrix of proximities. Multivariate Behav. Res. 9, 331-342.
Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In: S. S. Stevens, ed., Handbook
of Experimental Psychology. Wiley, New York.
Takane, Y. (1981). Multidimensional successive categories scaling: A maximum likelihood method.
Psychometrika [in press].
Takane, Y., Young, F. W., and de Leeuw, J. (1977). Nonmetric individual differences multidimen-
sional scaling: An alternating least squares method with optimal scaling features. Psychometrika 42,
7-67.
Tobler, W. (1976). Spatial interaction patterns. J. Environ. Syst. 6, 271-301.
Torgerson, W. S. (1958). Theory and Methods of Scaling. Wiley, New York.
Tucker, L. R. (1964). The extension of factor analysis to three-dimensional matrices. In: N, Fredriksen
and H. Gulliksen, eds., Contributions to Mathematical Psychology. Holt, Rinehart and Winston, New
York.
Multidimensional scaling and its applications 345
Tucker, L. R. (1972). Relations between multidimensional scaling and three-mode factor analysis.
Psyehometrika 37, 3-27.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading.
Tversky, A. (1977). Features of similarity. Psychol. Rev. 84, 327-352.
van Eeden, C. (1957). Note on two methods for estimating ordered parameters of probability
distributions. Proeeedings Akademie van Wetenschappen, Ser. A 60, 128-136.
Wagenaar, W. A. and Padmos, P (1971). Quantitative interpretation of stress in Kruskal's multidimen-
sional scaling technique. British J. Math. Statist. Psych. 24, 101-110.
Wish, M. (1967a). A structural theory for the perception of Morse code signals and related rhythmic
patterns. Center for Research on Language and Language Behavior, University of Michigan.
Wish, M. (1967b). A model for the perception of Morse code like signals. Human Factors 9, 529-540.
Wish, M. (1972). Notes on the variety, appropriateness, and choice of proximity measures. Unpub-
lished manuscript, Bell Telephone Laboratories, Murray Hill.
Wish, M. and Carroll, J. D. (1974). Applications of INDSCAL to studies of human perception and
judgment. In: E. C. Carterette and M. P. Friedman, eds., Handbook of Perception. 449-491.
Academic Press, New York.
Wish, M., Deutsch, M. and Biener, L. (1970). Differences in conceptual structures of nations: An
exploratory study. J. Personality, Social Psychology 16, 361-373.
Wish, M. and Kaplan, S. J. (1977). Toward an implicit theory of interpersonal communication.
Sociometry 40, 234-246.
Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In:
P. R. Krishnaiah, ed., Multivariate Analysis. Academic Press, New York.
Young, F. W. (1975). An asymmetric Euclidean model for multiprocess asymmetric data. Presented at
U.S.-Japan Seminar Multidimensional Scaling, University of California, San Diego, La Jolla.
Young, G. and Householder, A. S. (1938). Discussion of a set of points in terms of their mutual
distances. Psychometrika 3, 19-22.
Young, F. W. and Torgerson, W. S. (1967). TORSCA, a FORTRAN IV program for Shepard-Kruskal
multidimensional scaling analysis. Behav. Sci. 12, 498.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 l
©North-Holland Publishing Company (1982) 347-360 1 ,J
Keinosuke Fukunaga
1. Introduction
x(t) = . e (1)
where a and m are two random variables that govern this process. Assume that
our observations are formed by taking n time samples of each x(t). Observe that
knowledge of the parameters a and m allows a complete representation of the n
observed time samples as well as the process. This implies that the intrinsic
dimensionality is two. However, conventional basis function approaches, such as
the Karhunen-Lo6ve expansion or the Fourier transform, will require signifi-
cantly more than two terms to approximate the above process. This is due to the
nonlinear nature of x(t). In contrast to the psychology example absolute values,
as opposed to relative values, are necessary in general. Furthermore, the observa-
tions of x(t) naturally provide N vectors in an n-dimensional space. We seek to
transform the vectors from the n-dimensional space to an m-dimensional space
with m < n such that the pairwise euclidean distances are 'approximately' the
*This work was supported in part by the National Science Foundation under Grant ECS-80-05482.
347
348 KeinosukeFukunaga
same. The intrinsic dimensionality is defined as the minimum m for which the
above constraint holds.
The previous two problems are concerned with representation. We desire a low
dimensional space that preserves the structure of the distribution of samples. The
representation problem will be discussed in more detail in Section 2. For the
classification of samples, the preservation of distribution structure is no longer
required. Instead, the degree of overlap (Bayes error) among different classes (or
categories) should be preserved. In this case the intrinsic dimensionality becomes
the lowest dimensionality we can obtain without changing the overlap among
classes. This problem will be discussed in Section 3.
In this article, we limit our discussion to the case where original samples are
given as vectors in an n-dimensional space. Therefore, when a random process
x(t) is considered, it should be converted to a random vector with n time-
sampling as
x2
local subset of data should indicate the dimensionality which is close to the
intrinsic dimensionality. The local eigenvectors will also give the basis vectors for
the local distributions. The eigenvalue approach is not the only way to estimate
local dimensionality, and there are other alternatives. One example is the maxi-
mum likelihood estimation technique proposed by Trunk in which each local
distribution is statistically tested to determine the most likely number of parame-
ters in the data generating process [7].
Local dimensionalities with unlimited noise-free data may be found by reduc-
ing the size of the local regions until a limiting dimensionality is reached. In
practice, however, some factors such as a limited data set and noise complicate
this procedure.
(1) Dominant eigenvalues. As is seen in Figs. 2 and 3, surface convolutions and
noise tend to enlarge the eigenvalues along insignificant eigenvectors. Therefore,
instead of computing the pure mathematical rank of the covariance matrix, the
dominant eigenvalues should be chosen with a properly selected threshold. The
threshold value affects the estimation of the dimensionality directly, and is very
subjective. It is advisable to try several threshold values and compare the results.
(2) Sample size. It is known that in order to insure the nonsingularity of the
covariance matrix, the number of samples must be larger than the dimension n.
However, the necessary sample size to detect the intrinsic dimensionality of n 0 is
some number larger than n 0, not necessarily larger than n. In addition, our
Lx2
~ ×1
¢
Fig. 2 Local subsets of data.
350 Keinosuke Fukunaga
k× 2
t,
,,,"
ee
, ,',*;,'o'e :;- •
concern is the total number of significant eigenvalues, not the accuracy of their
estimate. So the required sample size may be much smaller than the one needed
for estimating eigenvalues. Experience shows that the local sample size may only
be two or three times the number of dominant eigenvalues, n 0, regardless of n.
For example, with n o = 3 only 6 or 9 samples are required even for n--100. A
technique is available to calculate the eigenvalues with sample size smaller than n,
without computing an n × n covariance matrix [8].
(3) Effect of noise. The addition of noise has the effect of constraining the
minimum size of the local regions. This is shown in Fig. 3, where a one
dimensional line is smeared by a noise such that the number of the dominant
eigenvalues in small local regions may be increased to two. As mentioned earlier,
the choice of large regions may include several surface convolutions, leading to an
overestimate of the intrinsic dimensionality. On the other hand, a small local
region will pick up the eigenvalues due to the noise component. Therefore, an
engineering compromise on the size of local regions is necessary. This problem
may be corrected by using a data filter, which will be described later, to eliminate
the noise.
Since the choice of threshold value and local region size depends on the data, it
is desirable to have an interactive computer system to provide operator flexibility.
One possible algorithm is given as follows.
(1) Selection of the size of local regions. Although the size of local regions could
be adjusted region-by-region, it might be more convenient to fix the size at the
beginning of the program. The proper size would be chosen after studying the
relationship between the size and the resulting dimensionality around the sample
nearest to the sample mean vector of the entire sample set. The size is specified by
either the radius of a hypersphere or the number of samples in a local region. An
operator should be able to adjust the size whenever needed.
(2) Selection of local centers. This problem is essentially a search for 'good' local
regions in a high dimensional space, and involves many compromises such as
local region size, amount of overlap, and the methodology used to find local
centers. One of the possibilities is to start by using the first sample of a sample list
as a local center. The samples of the local region around the sample are used to
determine the local dimensionality and then removed from the sample list. The
same procedure is repeated until the list becomes empty. Statistics of local
Intrinsic dimensionality extraction 351
Vfi( x ) = k( X) n + 2 [ 1 ~ (K_X)
(3)
Nv h2 [ t t IIXi Xll < h
where v is the volume of the hypersphere around X with radius h and k(x) is the
number of samples in the hypersphere. Eq. (3) indicates that V/~(X) is propor-
tional to k ( X ) / N v , a density estimate, and Y~(Xi - X ) / k , the sample mean-shift
from X. When the estimate of a density gradient is normalized by a density
estimate as v p ( X ) / p ( X ) , each sample is moved in proportion to the sample
mean-shift. Fig. 4 shows the experimental result for a simple two-dimensional
example, where the above operation was repeatedly applied until no sample
moves were observed.
. :,,~ ° ° N.
....
:.-.::... -=- ..
• " ..~.
:, :" "::.. :..
:.:.: ..... g
L ""
• ..• ..
o• °r. :.-"
°.~-• •
• ,2. °" • ;
(o)
.......... .°,°
_ -,°. .,.:..
, .,.'°- ....,
.T.'" °:"
.- ...
°- °.'°
°.
,o .'°
(b)
Fig. 4 A data filter. (a) The input of the filter; (b) the output of the filter.
352 Keinosuke Fukunaga
Table 1
Intrinsic dimensionalityof Gaussian pulse
No. of samples Range of Intrinsic dimensionalityusing
used in local regions hypersphereradius 10% 1%
5 0.81-0.83 1 1
10 1.60- 2.43 2 3
15 1.92-2.15 2 3
100 4.60 4 6
time-sampled) vectors were generated with the two parameters a and m uniformly
distributed as 1 ~<a ~< 3 and 0.2 ~ m ~< 0.8. The size of the local region was
specified by requiring that k points be contained in each local region. Table 1
contains the results that were generated for k = 5, 10, and 15 using 20 local
regions. The intrinsic dimensionality was determined by counting the number of
eigenvalues larger than t% of the largest eigenvalue in each local region. Two
values of t were used, 10% and 1%. For each combination of k and t the most
frequent number of sufficiently large eigenvalues was called the intrinsic dimen-
sionality. Referring to Table 1 it can be seen that either 2 or 3 is indicated as the
intrinsic dimensionality. To provide a comparison the entire data set was trans-
formed using the K a r h u n e n - L o r v e expansion. This corresponds to one 'local'
cluster with 100 samples. Using the same criteria the dimensionality is 4 for
t = 1 0 % and 6 for t = l % .
computed for each m. One then chooses a value of m that has a small stress and
for which an increase in m does not significantly reduce the stress. Shepard and
Carroll used a similar procedure for a different criterion called continuity [5].
In the engineering literature, Sammon minimized the stress criterion for m -- 2.
His purpose was to display data on a cathode-ray tube in his interactive computer
system OLVARS[1 1]. Calvert and Young added to the stress criterion a criterion to
measure the overlap among different classes [12]. Thus, they tried to find a
mapping which preserves the class separability as well as the structure of the data
distribution.
All these algorithms use the point locations, Y1; .... YN, as the variables to be
optimized, and the optimization process is iterative in nature. The iterative
optimization of N × m variables is very time consuming even on present com-
puters.
Some attempts have been made to develop noniterative mapping algorithms.
Olsen and Fukunaga specified the intrinsic dimensionality n o by determining
dominant local eigenvalues and selecting the set of n 0 eigenvectors for each local
region. These n o dimensional subspaces were transformed into a common sub-
space [13]. Also, Koontz and Fukunaga found a mapping function of interpoint
distances which preserves the class separability as well as the structure of the data
distribution [14].
yl (X) ='r/i(X)
Fig. 5 The optimum mapping for classification.
ered necessary for a criterion, since these transformations should not change the
class separability. After a whitening transformation (ATKo A = I) and a coordi-
nate shift
j = f ( ~ l , y 2 Ko ) = f(0,/x, I ) (5)
where/, = AT(y 2 - ~1). Further application of any unitary transformation does
not change the criterion value. Therefore, it may be stated that the criterion value
of (5) depends only on the length of the vector/,, that is,
J=g(d(2) (6)
where
d?2 =#T# : (~2 _ y l ) T K o l ( y 2 _ y , ) . (7)
og =0. (s)
0d122
Eq. (8) reveals that, regardless of the selection of f, the optimization of f is to find
Y(X) which maximizes d122= (~2 _ ~I)TK01(~2 _ ~l). Although Og/Od22 = 0
gives another optimum condition of f, this is an undesired solution, since the
solution does not depend on the functional form of Y(X).
EXAMPLE 3.1. One example of a functional form for f ( ~ l , y2, K0), which is
commonly used in discriminant analysis [22], is
2~1~2d22
J = trS w iSb -- (9)
1 - - ~'1q7"2d?2
EXAMPLE 3.4. The following are the examples in which the criterion values
depend on the selected coordinate
trSw
J-- trSm, J=trSw+l~trSm (/,: Lagrange multiplier)• (1 2)
=g( 2T 2 .....
T
.....
(13)
M M Og
x x • 8(/~/~k)=O (14)
j 2k 2 Y(X) = r * ( x )
where Y*(X) is the optimum Y(X). This is equivalent to optimizing the criter-
ion J',
M M
J'=2 2 ajkgjP,
T k = tr BTHTM TKo 1MHB (15)
j=2k=2
where
M= [~,( ~ 1 _ ; ),..., ~,(~M_ ~ )], (16)
- 1/~r l . . . . 1/,',71
1/7r2 0
H= (17)
0 1/¢rM
BB T = (18)
Intrinsic dimensionality extraction 357
where Xjk is the cost of choosing class k given class j occurs. Eq. (19) may be
rewritten as
Since the first term in (21) is independent of Y(X), minimizing (21) with respect
to Y(X) is equivalent to maximizing the second term in (21) or to maximizing
(15) with HB =A.
Thus, it may be concluded that fhe optimization of f(~l ..... ~M, K0 ) is
equivalent to the minimization of the mean-square-error of the Bayes risk
estimate as in (19). And the selection of a criterion function of the form (13)
corresponds essentially to the selection of a cost matrix A in the Bayes risk
estimate.
)kjk = { 10/~jJ f o r j = k,
(23)
forj v~ k.
Although this result still does not aid in the estimation of the posterior
probability functions, there is a way to estimate the optimum value of f without
computing the posterior probabilities. Thus, when a data set is given, and a
feature set is proposed, one can estimate f*, the criterion value for the optimum
feature set, and compare it with the criterion value for the proposed feature set. If
both are close, the proposed feature set may be acceptable. If not, a better feature
set should be found.
T h e j t h component of ~ i , is
g*= = (24)
E(~j(X)~i(X)} is the asymptotic error between class j and class i due to the
nearest neighbor classification [21]. Therefore, the classification errors of the
nearest neighbor rule can be estimated in the original X-space, and thus
f ( ~ l , ..... ~M,, K~') can be computed through (24), (25), and the given functional
form of f.
L
y k ( X ) = ~ aj~q~j(X) or Y(X)=AT~(x). (26)
j 1
Using (21), the problem becomes to maximize the following criterion with respect
to A,
J=trAT(ATD)T(ATSoA)-I(ATD)A=tr(ATSo A) '(ATDAATDTA)
(27)
where D and So are the matrices in the '/'-space which correspond to M of (16)
and K 0 in the Y-space. D and So become ATD and ATSoA respectively after the
transformation of (26).
The optimum solution of (26) specifies that the column vectors of A should be
the eigenvectors of So 1DAATDTwith nonzero eigenvalues [22]. Since the rank of
D is M - 1, there are M - 1 nonzero eigenvalues leading to M - 1 columns in A.
The subspace spanned by these M - 1 eigenvectors is identical to the subspace
Intrinsic dimensionality extraction 359
References
[1] Shepard, R. N. (1962). The analysis of proximities: multidimensional scaling with an unknown
distance function, I. Psychometrika 27, 125-140.
[2] Shepard, R. N. (1962). The analysis of proximities: multidimensional scaling with an unknown
distance function, II. Psychometrika, 27, 219-245.
[3] Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-28.
[4] Kruskal, J. B. (1964). Nonlinear multidimensional scaling: a numerical method. Psychometrika
29, 115-130
[5] Shepard, R. N. and Carroll, J. D. (1965). Parametric representation on nonlinear data structure.
In: P. R. Krishnaiah, ed., Multivariate Analysis. Academic Press, New York.
[6] Fukunaga, K. and Olsen, D. R. (1971). An algorithm for finding the intrinsic dimensionality of
data. IEEE Trans. Comput. 20, 176-183.
[7] Trunk, G. V. (1968). Statistical estimation of the intrinsic dimensionality of data collections.
Inform. and Control. 12, 508-525.
[8] McLaughlin, J. A. and Raviv, J. (1968). Nth order autocorrelation in pattern recognition.
Inform. and Control 12, 121-142.
[9] Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function
with application in pattern recognition. IEEE Trans. Inform. Theory 21, 32-40.
[10] Bennett, R. S. (1969). The intrinsic dimensionality of signal collections. IEEE Trans. Inform.
Theory 15, 517-525.
[11] Sammon, Jr., J. W. (1969). A nonlinear mapping algorithm for data structure analysis. IEEE
Trans. Comput. 18, 401-409.
[12] Calvert, T. W. and Young, T. Y. (1969). Randomly generated nonlinear transformations for
pattern recognition. IEEE Trans. Systems Sci. Cybernet. 5, 266-273.
[13] Olsen, D. R. and Fukunaga, K. (1973). Representation of nonlinear data surfaces, IEEE Trans.
Comput. 22, 915-922.
[14] Koontz, W. L. G. and Fukunaga, K. (1972). A nonlinear feature extraction algorithm using
distance transformation. IEEE Trans. Comput. 21, 56-63.
[15] Devijver, P. A. (1973) Relationships between statistical risks and the least-mean-square-error
design criterion in pattern recognition. Proc. First Internat. Joint Conf. Pattern Recognition, 139-
148, Washington, DC.
[16] Otsu, N. (1972). An optimal nonlinear transformation based on variance criterion for pattern
recognition I. Its derivation. Bull. Eleetroteehnical Laboratory 36, 815-830.
[17] Otsu, N. (1973). An optimal nonlinear transformation based on variance criterion for pattern
recognition--II. Its properties and experimental confirmation. Bull. Electrotechnical Laboratory
37, 283-295.
[18] Fukunaga, K. and Ando, S. (1977). On a nonlinear feature extraction. IEEE Trans. Inform.
Theory 23, 453-459.
360 Keinosuke Fukunaga
[19] Fukunaga, K. and Short, R. D. (1978). Nonlinear feature extraction with a general criterion
function. IEEE Trans. Inform. Theory 24, 600-607.
[20] Fukunaga, K. and Short, R. D. (1978). A class of feature extraction criteria and its relation to
the Bayes risk estimate. IEEE Trans. Inform. Theory 26, 59-65.
[21] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[22] Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1 1~
@North-Holland Publishing Company (1982) 361-382 1
1. Introduction
The field of image processing has grown enormously in recent years. Among
the many areas of application are remote sensing for natural resource evaluation,
industrial parts inspection, cartographic feature extraction, and x-ray analysis. In
these applications considerable information about the probable contents of
images is often available. The amount and complexity of this information has
prevented full use of this knowledge. Many approaches to the structuring of
image knowledge have been studied; in this chapter we provide an elementary
introduction to some of the major approaches.
A wide variety of techniques has been developed for image analysis. One
common approach consists of image segmentation using some type of similarity
criterion for grouping areas within an image followed by measurement of result-
ing region properties such as shape and texture. Finally, these measurements are
used to classify the regions into types by computing the similarity of these
measurements to those of a set of tracking regions. Statistical methods such as
discriminant analysis and Bayesian classifiers have been used for this classifica-
tion step. This type of approach has not been adequate to handle many problems
in which objects are formed from substructures, and the regions surrounding an
object are important for the identification of the object.
The primary goal of structural pattern recognition procedures has been the
recognition of objects in an image. In some tasks such as industrial parts
inspection, there is often only one object present, and the identity of the object is
known a priori. The problem in this situation is to determine the position of the
object and perform visual inspection tasks automatically. For remote sensing
applications, many objects may be present in the image. The types of objects
present may, in some cases, not be known during algorithm development.
Structural methods are appealing because they allow the designer or user of a
pattern recognition system to employ a somewhat intuitive description of an
object as the basis for a recognition scheme. Unfortunately, these methods can be
time-consuming to implement and use. In spite of such drawbacks, there are
many situations in which structural methods appear necessary, especially when
there is considerable variation in factors such as image size, location, shape, and
361
362 L, N. Kanal, B. A. Lambird and D. Lavine
color. Such variations frequently arise in remote sensing. These variations are a
result of factors such as lighting, viewing angle, type of sensor, seasonal changes
in crops, water content of the soil, and elevation.
A fundamental part of many structural recognition systems is a search space.
This space may be explicitly stored in a computer or implicitly stored and
dynamically generated as in a grammar. Often measures of merit are defined on
those parts of the search space which have been examined. These measures may
provide a notion of distance between parts of the search space already searched
and the goal of the search. In addition they are often used to direct the next step
in the search. There may be more than one goal and belief in the goal, once
reached, may be complete or probabilistic.
Numerous types of search spaces and problem representations may be found in
the literature. Grammars, stochastic grammars (see the chapter by K. S. Fu),
production rule systems, predicate calculus systems, semantic nets, state spaces,
and A N D / O R graphs (see the chapter by G. C. Stockman), are a few of the more
common types of problem representations. Within each of the above representa-
tions, various subclasses of representations have been implemented. In addition,
more than one search algorithm has been defined for each of these search spaces.
Furthermore, many structural recognition systems are quite large and their
performance is strongly affected by the problem domain, the extent of expert
domain knowledge employed, and general system design strategies.
a b
_ b b
S a a
and tail respectively, and the production rules indicate that W may be replaced by
bab, T by babab, and S by a b W b T b W b .
The notation
p~8 ~ p138
G
The division of grammars into these four categories is useful in the design of
procedures for finding derivations. Efficient procedures exist for handling regular
and context-free grammars, while the complexity of context sensitive and unre-
stricted grammars, in general, make them infeasible to handle. Context-free
grammars do not appear to be powerful enough to handle some applications.
New types of grammars representing a compromise between context-free and
context-sensitive grammars have been designed to alleviate these problems. Some
of those types of grammars, such as programmed grammars or indexed grammars
have been used to describe non-context-free constructions in programming lan-
guages.
Grammars can also be classified in a different way. A grammar is deterministic
if at each step in a derivation there is only one possible action that can be taken,
i.e. there are no alternatives. A grammar is stochastic if probabilities are assigned
to each alternative, indicating the likelihood of that alternative. A grammar is
nondeterministic if there are a finite number of choices at any step, and
probabilities are not used.
Stochastic grammars have received considerable attention in recent years (Fu,
1982). By using such grammars for parsing it is possible to attach a probability to
each derivation of a string. Furthermore, the probabilities attached to rewriting
rules can be used to guide the search for derivations of a string. In any step of a
derivation, the search procedure merely applies that rewrite rule which has the
highest probability among all rewrite rules which are applicable. Other applicable
rules may be tried, in order of decreasing probability, at this point in the
derivation. The probability attached to a derivation is the product of the
probabilities of the rewrite rules used in the derivation. By using the probabilities
attached to the rewrite rules to guide the search for derivations we are attempting
to apply the best rewrite rule at each point in the search for derivations.
A strong motivation for the use of stochastic grammars lies in the relationship
between probabilities on rewrite rules and probabilities on sentences. If we view
the objects we are attempting to recognize as being generated by some type of
probabilistic mechanism, then each object corresponds to a sentence and so a
probability can be attached to each sentence. Under certain conditions, it is
possible to find, for such a probability distribution, a stochastic grammar which
yields the same probability distribution on sentences.
The previous sections described grammars that only allow concatenation as the
relationship between subpatterns. This section will briefly touch on grammars
that are much more complicated but allow complex relations (Fu, 1974; Kanal,
1974; Gonzalez and Thomason, 1978). These grammars are based on graph
theoretical principles (Harary, 1969).
Tree grammars have productions where the terminals and non-terminals are
trees. However, since trees can be represented by lists or strings, tree grammars
can be written in string grammar form. Fig. 3 shows an example of a tree
grammar and a derivation. Tree grammars have been used in the detection of
texture (Lu and Fu, 1978) and in analysis of bubble chamber photographs (Fu
and Bhargarva, 1973). Texture can be defined in a hierarchical way, i.e., as
subpatterns occurring repeatedly in a highly specified manner. This definition
leads naturally to a tree grammar, where small regions of some grey level are the
primitive patterns.
Web grammars use webs as the subpatterns, where webs are undirected, labeled
graphs. Web grammars can describe more general patterns than tree or string
grammers. However, the increased flexibility requires much more complicated
production rules. In the string or tree grammar case it was easy to see how to
substitute the new subpattern. In the web grammar case, it is necessary to specify
how to connect the new subpattern. For example, suppose a is to be replaced by
fl (Fig. 4(a)), in the web pattern shown. There is more than one possible way to do
this substitution, as demonstrated. As a result, web productions are written as
triples (a, 13, f ) where f is a function that specifies how to join the nodes of
subweb 13 to the neighbors of each node of the removed subweb a.
Web grammars have been used in the classification of L A N D S A T data (Brayer
and Fu, 1976). Their graph model for the scene is shown in Fig. 4(b). The graph is
366 L. N. Kanal, B. A. Lambird and D. Lavine
A2
A1 A2 A2
A2 ~,, c A1
(a)
S S $ $
s_
A1 A2 A1 b A1 b a b
I
A2 c
I
b
I
c
(b)
Fig. 3. An exampleof a tree grammar and derivation:(a) productionrules; (b) derivationof a pattern.
not a tree because of the presence of the relationships 'surround', 'near', and
' range'.
Plex grammars have even more complicated production rules. Plex structures
have an arbitrary number of attaching points for joining to other structures. The
production rules describe the connectivity by providing lists of labeled concatena-
tion points. Plex grammars have been used to describe chemical structures (Feder,
1971).
In the above discussion it should be noted that the more complicated the
primitives are, the more complicated the form of the production rules. However,
with the more complicated grammars, the number of production rules in the
grammar needed to describe a given pattern may be a great deal less. This
tradeoff of primitive type versus grammar complexity becomes important in the
recognition process discussed next.
Grammar: Gw=(N,.Y,,P,S)
N-(S}
N= {a,b,c)
P is the following two triples:
f
b
f(a, S) = {b, c}
c
b
f(a, S) = {b, c}
c
Derivation
b b b b b b
S ~_ a ~ S ,,,._ a ~ ~ >
c c c c c c
syntax analysis, or finding the derivation. If the process begins with the start
symbol and attempts to find the derivation by progressively expanding the
non-terminals (usually done leftmost first), then the process is called top-down. If
the parse begins with the terminal symbols and progressively attempts to replace
them with the left-hand side of the production rules, then the parse is called
bottom-up. Since at each step a wrong choice can be made, if more than one
choice is possible, the parsing process can be inefficient.
Top-down parsing is 'goal-oriented'. Suppose the start symbol S has the
product production S--, PI ..... Pn. If P1 is a terminal symbol, then the pattern
must start with P1; if P~ is a non-terminal, then P~ becomes a subgoal and P~
productions are examined. This process continues until all the Pi's are recognized.
If at any point a Pi cannot be recognized, then an alternative S production is
examined. Fig. 5(a) shows a grammar, a pattern string to be recognized, and the
top-down parsing process.
Top-down parsing techniques are conceptually simple since they can use the
structure of the grammar to direct the parsing. In addition, since the desired
primitive is known at each step, primitive detection can be tailored to the desired
primitive.
368 L. N. Kanal, B. A. Lambird and D. Lavine
Z
<
<
LU
Z
e~ ~
Structural methods in image analysis and recognition 369
Grammar G = ( { S , A } , (a,b,c,d}, P , S )
Production rules S ~ d Abc
S~dS
A~aA
A ---,c
1 S initially
2 dS S~dS
3 ddAbc S ~ dAbc
4 ddaAbc A ~ aA
5 ddacbc A~ c
Bottom-up parsers, on the other hand, require all the primitives to be detected
before parsing. Fig. 5(b) shows an example of b o t t o m - u p parsing. At each step
some part of the pattern string is substituted b y the left side of a production rule
until the start symbol is reached.
A general comparison of b o t t o m - u p and t o p - d o w n methods is difficult (Fu,
1974). Present t o p - d o w n parsers can accept a wider class of g r a m m a r s but are
m o r e inefficient. M a n y efficient special-purpose b o t t o m - u p parsing techniques are
available for various restricted grammars.
The one-directional analysis used b y b o t t o m - u p parsers or t o p - d o w n parsers
causes problems in difficult applications such as cartographic feature recognition.
Bottom-up methods do not take advantage of a priori knowledge during segmen-
tation, while strictly t o p - d o w n methods can inefficiently generate hypotheses that
are consistent with the model but in no way related to a given instance of data. A
m e t h o d which combines the b o t t o m - u p and t o p - d o w n methodologies is presented
in (Stockman 1977, Stockman and Kanal, 1982). This method is able to proceed
in a bi-directional manner: model-directed and data-confirmed. The analysis is
not confined to a canonical scan of the data such as row-by-row and left-to-right.
Multiple or ambiguous interpretations are allowed and are developed on a
best-first basis. This ambiguity is permissible at both the segmentation level and
the structural analysis level. The algorithm has subsequently been extended to
allow parallel development of the interpretations (Kanal and Kumar, 1981).
The non-directional parsing is accomplished by combining artificial intelligence
problem-solving techniques with formal language theory. The problem-reduction
representation (PRR) approach of artificial intelligence subdivides the original
problem into a set of subproblems which are in turn subdivided. The subdivision
continues until primitive subproblems are reached whose solutions are trivial.
Informally, each production in the grammar can be thought of as a subproblem,
the terminal symbols represent primitive subproblems, and start productions
represent the original problem. This representation can then be searched for an
optimal solution using various techniques. For more details see (Nilsson, 1980;
Stockman, 1977; Stockman and Kanal, 1982).
Stockman applied this procedure to one-dimensional waveforms and experi-
mented with carotid pulse waves. The system (WAPSYS) successfully recognized
the pulse waves and automatically extracted features. This work is now being
extended to images (Lambird, 1982).
VE (NUN), there exists a finite set of attributes A(V), where each attribute a of
A(V) has a set of possible values D~. The production rules, P, have two
parts--the syntactic part with the form V0 --, V y 2 • • • Vm where V0~ N and VzC V
for 1 < i < m, and the semantic part which is a set of functions or procedures
which show how the attributes of each Vii are generated from the other Vii in the
production.
Attributes in an attributed grammar may be used for directing parsing and for
assigning a measure of merit to a parse. Owing to the lack of restrictions on the
semantic part of an attributed grammar, no general statements can be made
concerning the efficiency of using attributes for directing parsing. For simple
semantic rules attributes have proven useful for directing parsing in various
applications.
In image processing applications attributes may represent object properties
such as length. A semantic rule may in this case assign to the left-hand side of a
production rule the sum of the lengths of the right-hand side attributes.
2.6. E r r o r correcting p a r s i n g
3. Artificial intelligence
3.l. Frames
A frame (Winston, 1977) is a scheme for representing knowledge. It represents
a stereotyped situation such as a prototype of a house or a car. Several types of
information may be contained in a frame. Generally there is a description of
those properties which are to be found in all the observed variants of the
prototype. In addition, a set of initially empty areas called slots are present. These
slots contain information describing a specific instance of the type of object
represented by the prototype. For a frame representing an image, the slots may
contain a description of a specific object, such as a house, within the image. The
frame may also contain information on how the frame is to be used, what it can
expect to see next, and what to do if it does not see what it expects to see.
A slot in a frame may Contain another frame. Thus frames can be linked
together to represent complex relationships between frames. Spatial relationships
such as adjacency and part-whole relationships can be represented using frames.
Because of the lack of restrictions on the type of information and procedures
which may be stored within frames, it is difficult to draw any conclusions about
the value of frame systems in general. Frames, as well as closely related systems
such as semantic nets, have been used heavily in recent years, and further
development of such systems is likely to continue.
I Madeof I
[ Eot,ty_l ] [ B.ck ]
Fig. 6. Representation of the statement: entity-1 is made of brick.
RULES ]
I role-1 1
antecedent ~ " ~ e n t
I road 1 I m°°°tain I l
L
tonoe' I
Fig. 7. The network that represents the production rule: "a road that goes into a mountain is a
tunnel".
374 L. N. Kanal, B. A. Lambird and D. Lavine
Docked
Ship
Coastline
l = Instance
r
LD= Location descriptor
in the road following procedure is used as a correction vector for the location of
the next image feature point to be detected. The experimental results indicate
high accuracy in road point verification and extensions of the method appear
promising.
The representation of knowledge in KMS can take several forms. In all forms,
the user must supply a set of attributes, the range of possible values for these
attributes, and a set of relationships among the attributes. The relationships can
be given in four forms:
(1) Production rules,
(2) User-defined mathematical formulas,
(3) Bayesian inferences,
(4) Feature description.
In the KMS system measures of belief can be attached to each attribute on the
right-hand side of a production rule. A rule is provided by the system for
combining these measures of belief into a measure of belief for the left hand side.
In all cases, attributes refer to variables whose values can be computed or
assigned by the user. For example, we may have an attribute bridge, which has a
value determined b y t w o other attributes, (1) water and (2) a strip of pavement of
specified dimension over water. In each of the first three types of relationships we
may assign some measure of belief or probability to the attributes water and strip
of pavement over water. Each system will then compute a number which may be a
probability, measure of belief, etc., for the entity bridge.
Production rules were described in Section 3.2. The user-defined mathematical
formulas are functions, such as regression functions, mapping values of a set of
attributes into the value of another attribute. Bayesian inferences refer to applica-
tions of Bayes rule to calculate the value of an attribute using the values of other
attributes. Finally, feature descriptions consist of a set of attributes describing
another attribute together with probabilities of the presence of these attributes in
the attribute being described. A hypothesis and test inference mechanism is
available with the feature descriptions.
At present in all but the feature description systems, KMS seeks to assign
values to attributes by exhaustive top-down directed search through the knowl-
edge base using an inference mechanism. This is made possible by the KMS
restriction that the relationships between attributes be ordered in a hierarchical
fashion. Thus to each attribute A, one can find a set of attribute relations and
ordering of those relations such that the system can start with user supplied
attributed values and apply these relations in order to compute the value of
attribute A. The KMS control structure could be modified to allow for alternate
types of search, though this requires modification of the code and not a simple set
of user commands to the system.
The goal of the KMS subsystem DESCRIPTION is to provide an explanation
of a set of data using a minimum number of explanatory factors. Hypotheses are
formed based on an initial set of data. These hypotheses are then used to generate
new questions whose answers can be used to test the hypotheses. This process is
iterated until an acceptable explanation has been found.
The mathematical basis (Stoffel, 1974) of the definition of a minimal explana-
tion is found in the theory of minimal covers of sets. Let S, S 1.... , S~ be sets. A
minimal cover A of S is defined to be a subset {Si,. . . . . Si,,} of { S 1 . . . . . Sk} such
Structural methods in image analysis and recognition 377
that SCSilU'''USim and for any other subset (Sj,,...,Sj,,} of (S, ..... Sk),
n
Ui=1SjDS implies n > = m. In our application, each S i may be viewed as an
explanation of the elements of Si and we are interested in explaining all the
elements of S using the minimum number of explanations. The goal of the
description section of KMS is to determine such minimal explanations though
the minimality of the cover produced by KMS has not been proven. The worst
case running time of existing minimal cover algorithms is exponential though
experience with KMS has indicated that actual running times are short.
4. Relaxation
4. 4. Relaxation summary
Although much success has been claimed with relaxation methods, it is not
clear that relaxation would perform better than the top-down methods presented
Structural methods in image analysis and recognition 381
Acknowledgment
The preparation of this article was supported in part by NSF grant ECS-7822159
to the Laboratory for Pattern Analysis and in part by L.N.K. Corporation,
Silverspring, MD.
References
Ballard, D. H., Brown, C. M., and Feldman, J. A. (1978). An approach to knowledge-directed image
analysis. In: A. R. Hanson and E. M. Riseman, eds., Computer Vision Systems, 271-281. Academic
Press, New York.
Brayer, J. and Fu, K. (1976). Application of a web grammar model to an ERTS picture. Proc. Third
Internat. Joint Conference on Pattern Recognition, 405-410. Coronado, CA.
Davis, L. and Henderson, T. (1981). Hierarchical constraint processes for shape analysis. IEEE Trans,
Pattern Analysis and Machine Intelligence 3, 265-277.
Davis, L. and Rosenfeld, A. (1978). Cooperative processes for waveform parsing. In: A. Hanson and
E. Riseman, eds., Computer Vision Systems. Academic Press, New York.
Feder, J. (1971). Plex languages. Inform. Sci. 3, 225-241.
Fu, K. (1974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
Fu, K. (1982). Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, NJ.
Fu, K. and Bhargava, B. (1973). Tree systems for syntactic pattern recognition. IEEE Trans. Comput.
22 (12) 1087-1099.
Fu, K. and Booth,. T. 0975). Grammatical inference: introduction and survey--Part I. IEEE Trans.
Systems Man Cybernet. 5 (l) 95-111.
Fu, K. and Booth, T. (1975). Grammatical inference: introduction and survey--Part II. IEEE Trans.
Systems Man Cybernet. 5 (4) 409-423.
Gonzalez, R. C. and Thomason, M. G. (1978). Syntactic Pattern Recognition--An Introduction.
Addison-Wesley, Reading, MA.
Harary, F. (1969). Graph Theory, Addison-Wesley, Reading, MA.
Hummel, R. and Zucker, S. (1980). On the foundations of relaxation labeling processes. Computer
Vision and Graphics Laboratory, Tech. Rept. TR-80-7. McGill University, Montreal.
Kanai, L. (1974). Patterns in pattern recognition: 1968-1974. IEEE Trans. Inform. Theory 20 (6)
697-722.
Kanal, L. and Kumar, V. (1981). Parallel implementations of a strnctural analysis algorithm. Proc.
Conf. Pattern Recognition and Image Processing, 452-458. Dallas.
Knuth, D. E. (1968). Semantics of context-free languages. J. Math. Systems Theory 2, 127-142.
Lambird, B. A. and Kanal, L. N. (1982). Syntactic analysis of images in progress, Dept. of Computer
Science, University of Maryland, College Park, MD.
Lu, S. and Fu, K. (1978). A syntaxtic approach to texture analysis. Comput. Graphics Image Process. 7
(3) 303-330.
Levine, M. D. and Shaheen, S. I. (1981). A modular computer vision system for picture segmentation
and interpretation. IEEE Trans. Pattern Analysis and Machine Intelligence 3, 540-556.
Marr, D. (1977). Artificial intelligence--A personal view. Artificial Intelligence 9 (1) 37-48.
382 L. N. Kanal, B. A. Lambird and D. Lavine
Nilsson, N. (1980). Principals of Artificial Intelligence. Tioga, Palo Alto, CA, 2nd ed.
Ostroff, B., Lambird, B., Lavine, D. and Kanal, L. (1982). HRPS--Hierarchial relaxation parsing
system. Tech. Note, Lab. for Pattern Analysis, Department of Computer Science, University of
Maryland, College Park, MD.
Reggia, J. A. (1981). Knowledge-based decision support systems: development through KMS. Ph.D.
Thesis, University of Maryland; ibid., Tech. Rept. TR. 1121, Comput. Sci. Center, Univ. of
Maryland, College Park, MD.
Rosenfeld, A., Hummel, R. and Zucker, S. (1976). Scene labelling by relaxation operations. IEEE
Trans. Systems Man Cybernet. 6, 420-433.
Shaw, A. (1968). The formal description and parsing of pictures. Stanford Linear Accelerator Center,
Rept. SLAC-84, Stanford University, Stanford, CA.
Shaw, A. (1969). A formal picture description scheme as a basis for picture processing systems.
Inform. Contr. 14, 9-52.
Shaw, A. (1970). Parsing of graph-representable pictures. J. ACM 17, 453-481.
Stockman, G. C. (1977). A problem-reductlon approach to the linguistic analysis of waveforms. Tech.
Rept. TR-538, University of Maryland.
Stockman, G. C. (1978). Toward automatic extraction of cartographic features. U.S. Army Engineer
Topographic Laboratory, Rept. No. ETL-0153, Fort Belvoir, VA.
Stockman, G. C. and Kanal, L. N. (1982). A problem reduction approach to the linguistic analysis of
waveforms. IEEE Trans. Pattern Analysis and Machine Intelligence, to appear.
Stockman, G. C., Lambird, B. A., Lavine, D. and Kanal, L. N. (1981). Knowledge-based image
analysis, U.S. Army Engineer Topographic Laboratory, Rept. ETL-0258, Fort Belvoir, VA.
Stoffel, J. C. (1974). A classifier design technique for discrete variable pattern recognition problems.
IEEE Trans. Comput. 23, 428-444.
Tenenbaum, J. M., Fischler, M. A. and Wolf, H. C. (1978). A scene-analysis approach to remote
sensing. Stanford Research Institute Internat. Tech. Note 173.
Waltz, D. (1975). Understanding line drawings of scenes with shadows. In: Winston, ed., The
Psychology of Computer Vision. McGraw-Hill, New York.
Winston, P. H. (1977). Artificial Intelligence. Addison-Wesley, Reading, MA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1 ~7
@North-Holland Publishing Company (1982) 383-397 J{ /
Image Models
1. Introduction
Time series analysis [10] has been extensively used [38, 60, 61] to model
relationships among gray levels of a given pixel and those preceding it in the scan
of a texture• An image is raster scanned to provide a series of gray level
fluctuations, which is treated as a stochastic process evolving in 'time'• The future
course of the process is presumed to be predictable from information about its
past.
Before summarizing the models, we review some of the commonly used
notation in time series•
Let
2 i = Z i -#.
Let [a] be a series of outputs of a white noise source, with mean zero and
variance o~.
Let B be the 'backward' shift operator such that B Z t = 2 t - l ; hence B'~2t =
2?t_m; and let V be the 'backward' difference operator such that
V2,=2,-2t_,=(1-B)2t;
Autoregressive model ( A R )
In this model the current Z-value depends on the previous p Z-values, and on
the current noise term:
If we let
[@(B)](2t)=at.
Image models 385
This process can be repeated to eventually yield an expression for 2~t as an infinite
series in the a 's.
The moving average model allows a finite number q of previous a-values in the
expression for 2~t. This explicitly treats the series as being observations on linearly
filtered Gaussian noise.
Letting
dependence gives rise to more complex parameter estimation problems [9, 12].
Interestingly, a frequency domain treatment makes parameter estimation in
bilateral representation much easier [13].
2.2.1. Globalmodels
Global models treat an entire image as the realization of a random field.
Different image features may be modeled by a random field, and the field may be
specified in different ways. An important model for height fields has been used by
oceanographers [31-33, 49] interested in the patterns formed by waves on the
ocean surface. Longuet-Higgins [31-33] treats the ocean surface as a random field
satisfying the following assumptions:
(a) the wave spectrum contains a single narrow band of frequencies, and
(b) the wave energy is being received from a large number of different sources
whose phases are random.
Considering such a random field, he obtains [32] the statistical distribution of
wave heights and derives relations between the root mean square wave height, the
mean height of the highest p% of the waves, and the most likely height of the
largest wave in a given interval of time.
In subsequent papers [31, 32], Longuet-Higgins obtains a set of statistical
relations among parameters describing (a) a random moving surface [31], and (b)
a Gaussian isotropic surface [32].
Some of his results are:
(1) The probability distribution of the surface elevation, and the magnitude and
orientation of the gradient.
(2) The average number of zero crossings per unit distance along an arbitrarily
placed line transect.
(3) The average contour length per unit area.
(4) The average density of maxima and minima.
(5) The probability distribution of the heights of maxima and minima.
All results are expressed in terms of the two-dimensional energy spectrum up to
a finite order only. The converse of the problem is also studied and solved. That
is, given certain statistical properties of the surface, a convergent sequence of
approximations to the energy spectrum is determined.
The analogy between this work and image processing, and the significance of
the results obtained therein, is obvious. Fortunately the assumptions made are
also acceptable for images.
Panda [47] uses this approach to analyze background regions selected from
Forward Looking InfraRed (FLIR) imagery. He derives expressions for (a)
density of border points and (b) average number of connected components along
a row of the thresholded image. Panda [46] also uses the same model to predict
Image models 387
operators. For example, for a first order Markov field, decorrelation may be
achieved by using a Laplacian operator [50]. The whitened field estimate of the
independent identically distributed noise process will only identify the spatial
operator in terms of the autocorrelation function, which is not unique. Thus, the
white noise probability density and spatial filter do not, in general, make up a
complete set of descriptors [51]. To generate a texture, the procedure can be
reversed by generating a white noise image having the computed statistics, and
then applying the inverse of the whitening filter.
Several authors describe models for the Earth's surface. Freiberger and
Grenander [16] reason that the Earth's surface is too irregular to be represented
by an analytic function having a small number of free parameters. Nevertheless,
landscapes possess strong continuity properties. They suggest using stochastic
processes derived from physical principles. Mandelbrot [35] uses a Poisson-Brown
surface to give a first approximation to the Earth's relief. The Earth's surface is
assumed to have been formed by the superimposition of very many, very small
cliffs along straight faults. The positions of the faults and the heights of the cliffs
are assumed random and independent. The irregularity predicted by the model is
excessive. Mandelbrot suggests that the surface could be made to resemble some
terrain more closely by introducing anisotropy into ridge directions. Mandelbrot's
model is often used in computer graphics to generate artificial terrain scenes.
Adler [2] presents a theoretical treatment of Brownian sheets and relates them to
the rather esoteric mathematical concept of Hausdorff dimension.
Recursive solutions based upon differential (difference) equations are common
in one-dimensional signal processing. This aprroach has been generalized to
two-dimensions. Jain [29] investigates the applicability of three kinds of random
fields to the image modeling problem, each characterized by a different class of
partial differential equations (PDE's). A digital shape is defined by a finite
difference approximation of a PDE. The class of hyperbolic PDE's is shown to
provide more general causal models than autoregressive moving average models.
For a given spectral density (or covariance) function, parabolic PDE's can
provide causal, semicausal, or even noncausal representations. Elliptical PDE's
provide noncausal models that represent two-dimensional discrete Markov fields.
They can be used to model both isotropic and nonisotropic imagery. Jain argues
that the PDE model is based upon a well-established mathematical theory.
Furthermore, there exists a considerable body of computer software for numerical
solutions. The PDE model also obviates the need for spectral factorization, thus
eliminating the restriction of separable covariance function. System identification
techniques may be used for choosing a PDE model for a given class of images.
In the absence of any knowledge or assumption about the global process
underlying a given image, models of the joint gray level probability density and
its derivative properties may be used. Among models of the joint density for
pixels in a window, the multivariate normal has been the one most commonly
used because of its tractability. However, it has been found to have limited
applicability. Hunt [25, 26] points out that stationary Gaussian models are based
upon an oversimplification. Consider the vector F of the picture points obtained
Image models 389
R(rrl, 17"2)7---0"2p[--~ll'rll--~2lT2l]
qSmn= ¢),,,o~o~,
then the process becomes a multiplicative process in which the influence of rows
and columns on the autocorrelation is separable. Thus,
Pij = PioPoj
Tou et al. consider fitting a model to a given texture. The choice among the
autoregressive, moving average and mixed models, as well as the choice of the
order of the process, is made by comparing the behavior of some observed
statistical property, e.g., the autocorrelation function, with that predicted by each
of the different models. The values of the model parameters are determined so as
to minimize, say, the least square error in fit. In a subsequent paper Tou and
Chang [61] use the maximum likelihood principle to optimize the values of the
parameters, in order to obtain a refinement of the preliminary model as suggested
by the autocorrelation function.
A bilateral dependence in two dimensions is more complex as compared to the
one-dimensional case discussed earlier. Once again, a simpler, unilateral model
may be obtained by making a point depend on the points in the rows above it, as
well as on those to its left in its own row. Whittle [63] gives the following reasons
in recommending working with the original two-dimensional model:
(1) The dependence on a finite number of lattice neighbors, for example a finite
autoregression in two dimensions, may not always have a unilateral representa-
tion that is also a finite autoregression.
(2) The real usefulness of the unilateral representation is that it suggests a
simplifying change of parameters. For most two-dimensional models, however,
the appropriate transformation, even if evident, is so complicated that nothing is
gained by performing it. It may be pointed out that frequency domain analysis
for parameter estimation [13] may prove useful here too.
Two-dimensional Markov random fields have been investigated for represent-
ing textures. A wide sense Markov field representation aims at obtaining linear
dependence of a pixel property, say its gray level, on the gray levels of certain
other pixels so as to minimize the mean square error between the actual and the
estimated values. This requires that the error terms of various pixels be uncorre-
lated random variables. A strict sense Markov field representation involves
specification of the probability distribution of the gray level given the gray levels
of certain other pixels. Although processes of both these types have been
investigated, more experimental work has been done on the former.
Woods [65] shows that the strict sense Markov field differs from a wide sense
field only in that the error variables in the former have a specific correlation
structure, whereas the errors in the latter are uncorrelated. He points out
restrictions on the strict sense Markov field representation under which it yields a
392 Narendra A huja and A zriel Rosenfeld
model for non-Markovian processes. The condition under which a general non-
causal Markov dependence reduces to a causal one is also specified.
Abend et al. [1] introduce Markov meshes to model dependence of a pixel on a
certain immediate neighborhood. Using Markov chain methods on the sequences
of pixels from various neighborhoods, they show that in many cases a causal
dependence translates into a noncausal dependence. For example, the dependence
of a pixel on its west, northwest and north neighbors translates into dependence
on all eight neighbors. Interestingly, the causal neighborhood that results in a
4-neighbor noncausal dependence is not known in their formulation, although in
the Gauss Markov formulation of Woods [65] such an explicit dependence is
allowed. In this sense Woods' definition of a Markov field is more general than
the Markov meshes of Abend et al. [1].
Hassner and Sklansky [21] also discuss a Markov random field model for
images. They present an algorithm to generate a texture from an initial random
configuration. The Markov random field is characterized by a set of independent
parameters that specify a consistent collection of nearest neighbor conditional
probabilities.
Deguchi and Morishita [14] use a noncausal model for the dependence of a
pixel on its neighborhood. The coefficients of linear dependence are determined
by minimizing the mean square estimation error. The resulting two-dimensional
estimator characterizes the texture. They use such a characterization for classifica-
tion and for segmentation of images consisting of more than one textural region.
Jain and Angel [27] use a 4-neighbor autoregression to model a given, not
necessarily separable, autocorrelation function. They obtain values of the autore-
gression coefficients in terms of the autocorrelation function. However, their
representation involves error terms that are uncorrelated with each other and with
the non-noisy pixel gray level values. As pointed out by Panda and Kak [45], the
two assumptions about the error terms are incompatible for Markov random
fields [65]. Jain and Angel [27] point out that a 4-neighbor Markov dependence
can represent a large number of physical processes such as steady state diffusion,
random walks, birth and death processes, etc. They also propose 8-neighbor [27]
and 5-neighbor (the 8 neighbors excluding the northeast, east, and southeast
neighbors) [27, 28] models.
Wong [64] discusses characterization of second order random fields (having
finite first and second moments) from the point of view of their possible use in
representing images. He considers various properties of a two-dimensional ran-
dom field and their implications in terms of its second-order properties. Some of
the results he obtains are as follows:
(1) There is no continuous Gaussian random field of two dimensions (or higher
dimensions) which is both homogeneous and Markov (degree 1).
(2) If the covariance function is invariant under translation as well as rotation,
then it can only depend upon the Euclidian distance. The second-order properties
of such fields (Wong calls them homogeneous) are characterizable in terms of a
singel one-dimensional spectral distribution.
Image models 393
Wong generalizes his notion of homogeneity to include random fields that are
not homogeneous, but can be easily transformed into homogeneous fields. Even
this generalized class of fields is no more complicated than a one-dimensional
stationary process.
Lu and Fu [34] identify repetitive subpatterns in some highly regular textures
from Brodatz [11] and design a local descriptor of the subpattern in an enumer-
ative way by generating each of the pixels in the window individually. The
subpattern description is done by specifying a grammar whose productions
generate a window in several steps. For example, starting from the top left corner
rows may be generated by a series of productions, while other productions will
generate individual pixels within the rows. The grammar used may also be
stochastic.
These models use the notion of a structural primitive. Both the shapes of the
primitives and the rules to generate the textures from the primitives may be
specified statistically.
Matheron [36] and Serra and Verchery [58] propose a model that views a binary
texture as produced by a set of translations of a structural element. All locations
of the structural elements such that the entire element lies within the foreground
of the texture are identified. Note that there may be (narrow) regions which
cannot be covered by any placement of the structural element, as all possible
arrangements of the element that cover a given region may not lie completely
within the foreground. Thus only an 'eroded' version of the image can be spanned
by the structural element which is used as the representation of the original
image. Textural properties can be obtained by appropriately parameterizing the
structure element. For a structural element consisting of two pixels at distance d,
the eroded image represents the autocovariance function of the original image at
distance d. More complicated structural elements would provide a generahzed
autocovariance function which has more structural information. Matheron and
Serra show how the generalized covariance function can be used to obtain various
texture features.
Zucker [67] views a real texture as a distortion of an ideal texture. The ideal
texture is a spatial layout of cellular primitives along a regular or semiregular
tessellation. Randomness is introduced by distorting the primitives using certain
transformations.
Yokoyama and Haralick [66] describe a growth process to synthesize textures.
Their method consists of the following steps:
(a) Mark some of the pixels in a clean image as seeds.
(b) The seeds grow into curves called skeletons.
(c) The skeletons thicken to become regions.
394 Narendra Ahuja and Azriel Rosenfeld
(d) The pixels in the regions thus obtained are transformed into gray levels in
the desired range.
(e) A probabilistic transformation is applied, if desired, to modify the gray level
cooccurrence probability in the final image.
The distribution processes in (a) and the growth processes in (b) and (c) can be
deterministic or random. The dependence of the properties of the images gener-
ated on the nature of the underlying operations is not obtained.
A class of models called mosaic models, based upon random, planar pattern
generation processes, have been considered by Ahuja [3-6], Ahuja and Rosenfeld
[7] and Schachter, Davis, and Rosenfeld [56]. Schachter and Ahuja [55] describe a
set of random processes that produce a variety of piecewise uniform random
planar patterns having regions of different shapes and relative placements. These
patterns are analyzed for various geometrical and topological properties of the
components, and for the pixel correlation properties in terms of the model
parameters [3-6]. Given an image and various feature values measured on it, the
relations obtained above are used to select the appropriate model.
The syntactic model of Lu and Fu [34] discussed earlier can also be interpreted
as a region based model, if the subpattern windows are viewed as the primitive
regions. Although the model used by Nahi and Jahanshahi [43], and Nahi and
Lopez-Mora [44], discussed earlier, is pixel based, the function y carries informa-
tion about the borders of various regions. Thus, under the constraint that all
regions except the background are convex, the model can also be interpreted as a
region based model.
4. Discussion
Region based models can act as pixel based models. For the case of images on
grids this is easy to see. Consider a subpattern that consists of a single pixel. The
region shapes are thus trivially specified. It is obvious that the region characteris-
tics and their relative placement rules can be designed so as to mimic the pixel
and joint pixel properties of a pixel based model, since both have control over the
same set of primitives and can incorporate the same types of interaction. On the
other hand if we are dealing with images that are structured, i.e. that have planar
clusters of pixels such that pixels within a cluster are related in a different way
than pixels across clusters, then we must make such a provision in the model
definition. Such a facility is unavailable in pixel based models, whereas the use of
regions as primitives serves exactly this purpose. The pixel based models are
acceptable for images where there are no well-defined spatial regional primitives.
Region based models appear to be more appropriate for representing many
natural textures, which do usually consist of regions.
Many texture studies are basically technique oriented and describe texture
feature detection and classification schemes which are not based upon any
underlying model of the texture. We do not discuss these here; see [19, 41, 59] for
several examples and more references.
Image models 395
Acknowledgment
The support of the U.S. Air Force Office of Scientific Research under Grant
AFOSR-77-3271 to the University of Maryland and of Joint Services Electronics
Program (U. S. Army, Navy and Air Force) under Contract N00014-79-C-0424 to
the University of Illinois is gratefully acknowledged as is the help of Kathryn
Riley and Chris Jewell in preparing this paper.
References
[1] Abend, K., Harley, T. J. and Kanal, L. N. (1965). Classification of binary random patterns.
IEEE Trans. Inform. Theory 11, 538-544.
[2] Adler, R. J. (1978). Some erratic patterns generated by the planar Wiener process. Suppl. Adv.
Appl. Probab. 10, 22-27.
[3] Ahuja, N. (1979). Mosaic models for image analysis and synthesis. Ph.D. dissertation, Depart-
ment of Computer Science, University of Maryland, College Park, MD.
[4] Ahuja, N. (1981). Mosaic models for images, I: Geometric properties of components in cell
structure mosaics. Inform. Sci. 23, 69-104.
[5] Ahuja, N. (1981). Mosaic models for images, II: Geometric properties of components in
coverage mosaics. Inform. Sci. 23, 159-200.
[6] Ahuja, N. (1981). Mosaic models for images, III: Spatial correlation in mosaics. Inform. Sci. 24,
43-69.
[7] Ahuja, N. and Rosenfeld, A. (1981), Mosaic models for textures. IEEE Trans. Pattern Analysis,
Machine Intelligence 3, 1-11.
[8] Angel, E. and Jain, A. K. (1978). Frame-to-frame restoration of diffusion images. IEEE Trans.
Automat. Control 23, 850-855.
[9] Bartlett, M. S. (1975). The Statistical Analysis of Spatial Pattern. Wiley, New York.
[10] Box, J. E. P. and Jenkins, G. M. (1976). Time Series Analysis. Holden-Day, San Francisco.
[11] Brodatz, P. (1966). Textures: A Photographic Album for Artists and Designers. Dover, New York.
[12] Brook, D. (1964). On the distinction between the conditional probability and joint probability
approaches in the specification of nearest-neighbor systems. Biometrika 51, 481-483.
[13] Chellappa, R. and Ahuja, N. (1979). Statistical inference theory applied to image modeling.
Tech. Rept. TR-745, Department of Computer Science, University of Maryland, College Park,
MD.
[14] Deguchi, K. and Morishita, I. (1976). Texture characterization and texture-based image parti-
tioning using two-dimensional linear estimation techniques. U.S.-Japan Cooperative Science
Program Seminar on Image Processing in Remote Sensing, Washington, DC.
[15] Franks, L. E. (1966). A model for the random video process. Bell System Tech. J. 45, 609-630.
[16] Freiberger, W. and Grenander, U. (1976). Surface patterns in theoretical geography. Report 41,
Department of Applied Mathematics, Brown University, Providence, RI.
[17] Gagalowicz, A. (1978). Analysis of texture using a stochastic model. Proc. Fourth Internat. Joint
Conf. Pattern Recognition, 541-544.
[18] Habibi, A. (1972). Two dimensional Bayesian estimate of images. Proc. IEEE 60, 878-883.
[19] Haralick, R. M. (1978) Statistical and structural approaches to texture. Proc. Fourth Internat.
Joint Conf. Pattern Recognition, 45-69.
[20] Haralick, R. M., Shanmugam, K. and Dinstein, I. (1973). Textural features for image classifica-
tion. IEEE Trans. Systems Men Cybernet. 3, 610-621.
[21] Hassner, M. and Sklansky, J. (1978). Markov random field models of digitized image texture.
Proc. Fourth Internat. Joint Conf. Pattern Recognition, 538-540.
[22] Hawkins. J. K. (1970). Textural properties for pattern recognition. In: B. S. Lipkin and A.
Rosenfeld, eds., Picture Processing and Psychopictorics, 347-370. Academic Press, New York.
396 Narendra Ahuja and Azriel Rosenfeld
[23] Huang, T. S. (1965). The subjective effect of two-dimensional pictorial noise. IEEE Trans.
Inform. Theory 11, 43-53.
[24] Huijbregts, C. (1975). Regionalized variables and quantitative analysis of spatial data. In: J.
Davis and M. McCullagh, eds., Display and Analysis of Spatial Data, 38-51. Wiley, New York.
[25] Hunt, B. R. (1977). Bayesian methods in nonlinear digital image restoration. IEEE Trans.
Comput. 26, 219-229.
[26] Hunt, B. R. and Cannon, T. M. (1976). Nonstationary assumptions for Gaussian models of
images. IEEE Trans. Systems Man Cybernet. 6, 876-882.
[27] Jain, A. K. and Angel, E. (1974). Image restoration, modelling, and reduction of dimensionality.
IEEE Trans. Comput. 23, 470-476.
[28] Jain, A. K. (1977). A semi-causal model for recursive filtering of two-dimensional images. IEEE
Trans. Comput. 26, 343-350.
[29] Jain, A. K. (1977). Partial differential equations and finite-difference methods in image
processing, Part 1: Image representation. J. Optim. Theory Appl. 23, 65-91.
[30] Kretzmer, E. R. (1952). Statistics of television signals. Bell System Tech. J. 31, 751-763.
[31] Longuet-Higgins, M. S. (1957). The statistical analysis of a random moving surface. Phil. Trans.
Roy. Soc. London Ser. A 249, 321-387.
[32] Longuet-Higgins, M. S. (1957). Statistical properties of an isotropic random surface. Phil. Trans.
Roy. Soc. London Ser. A 250, 151-171.
[33] Longuet-Higgins, M. S. (1952). On the statistical distribution of the heights of sea waves. J.
Marine Res. 11, 245-266.
[34] Lu, S. Y. and Fu, K. S. (1978). A syntactic approach to texture analysis. Comput. Graphics
Image Process. 7, 303-330.
[35] Mandelbrot, B. (1977). Fractals-- Form, Chance, and Dimension. Freeman, San Francisco.
[36] Matheron, G. (1967). Elements pour une Theorie des Milieux Poreux. Masson, Paris.
[37] Matheron, G. (1971). The theory of regionalized variables and its applications. Les Cahiers du
Centre de Morphologie Math. de Fontainbleau 5.
[38] McCormick, B. H. and Jayaramamurthy, S. N. (1974). Time series model for texture synthesis.
Internat. J. Comput. Inform. Sci. 3, 329-343.
[39] McCormick, B. H. and Jayaramamurthy, S. N. (1975). A decision theory method for the analysis
of texture. Internat. J. Comput. Inform. Sci. 4, 1-38.
[40] Michalski, R. S. and McCormick, B. H. (1971). Interval generalization of switching theory. Proc.
Third Annual Houston Conf. Computing System Science, 213-226. Houston, TX.
[41] Mitchell, O. R., Myers, C. R. and Boyne, W. (1977). A max-min measure for image texture
analysis. IEEE Trans. Comput. 26, 408-414.
[42] Muefle, J. L. (1970). Some thoughts on texture discrimination by computer. In: B. S. Lipkin and
A. Rosenfeld, eds., Picture Processing and Psychopictorics, 371--379. Academic Press, New York.
[43] Nahi, N. E. and Jahanshahi, M. H. (1977). Image boundary estimation. IEEE Trans. Comput.
26, 772-781.
[44] Nahi, N. E. and Lopez-Mora, S. (1978). Estimation detection of object boundaries in noisy
images. IEEE Trans. Automat. Control 23, 834-845.
[45] Panda, D. P. and Kak, A. C. (1977). Recursive least squares smoothing of noise in images. IEEE
Trans. Acoust. Speech Signal Process. 25, 520-524.
[46] Panda, D. P. and Dubitzki, T. (1979). Statistical analysis of some edge operators. Comput.
Graphics Image Process. 9, 313-348.
[47] Panda, D. P. (1978). Statistical properties of thresholded images. Comput. Graphics Image
Process. 8, 334-354.
[48] Pickett, R. M. (1970). Visual analysis of texture in the detection and recognition of objects. In:
B. S. Lipkin and A. Rosenfeld, eds., Picture Processing and Psychopietorics, 289-308. Academic
Press, New York.
[49] Pierson, W. J. (1952). A unified mathematical theory for the analysis, propagation, and
refraction of storm generated surface waves. Department of Meteorology, New York Uhiversity,
New York.
[50] Pratt, W. K. and Faugeras, O. D. (1978). Development and evaluation of stochastic-based visual
textures features. Proc. Fourth Internat. Joint Conf. Pattern Recognition, 545-548.
Image models 397
[51] Pratt, W. K., Faugeras, O. D., and Gagalowicz, A. (1978). Visual discrimination of stochastic
texture fields. IEEE Trans. Systems Man Cybernet. 8, 796-804.
[52] Read, J. S. and Jayaramamurthy, S. N. (1972). Automatic generation of texture feature
detectors. IEEE Trans. Comput. 21, 803-812.
[53] Rosenfeld, A. and Kak, A. C. (1976). Digital Picture Processing. Academic Press, New York.
[54] Rosenfeld, A. and Lipkin, B. S. (1970). Texture synthesis. In: B. S. Lipldn and A. Rosenfeld,
eds., Picture Processing and Psychopictorics, 309-322. Academic Press, New York.
[55] Schachter, B. and Ahuja, N. (1979). Random pattern generation processes. Comput. Graphics
Image Process. 10, 95-114.
[56] Schachter, B. J., Davis, L. S. and Rosenfeld, A. (1978). Random mosaic models for textures.
IEEE Trans. Systems Man Cybernet. 8, 694-702.
[57] Schachter, B. J. (1980). Long crested wave models for Gaussian fields. Comput. Graphics Image
Process. 12, 187-201.
[58] Serra, J. and Verchery, G. (1973). Mathematical morphology applied to fibre composite
materials. Film Science and Technology 6, 141-158.
[59] Thompson, W. B. (1977). Texture boundary analysis. IEEE Trans. Comput. 26, 272-276.
[60] Tou, J. T. and Chang, Y. S. (1976). An approach to texture pattern analysis and recognition.
Proc. IEEE Conf. Decision and Control, 398-403.
[61] Tou, J. T., Kao, D. B., and Chang, Y. S. (1976). Pictorial texture analysis and synthesis. Proc.
Third lnternat. Joint Conf. Pattern Recognition.
[62] Trussel, H. J. and Kruger, R. P. (1978). Comments on 'nonstationary' assumption for Gaussian
models in images. IEEE Trans. Systems Man Cybernet. 8, 579-582.
[63] Whittle, P. (1954). On stationary processes in the plane. Biometrika 41, 434-449.
[64] Wong, E. (1968). Two-dimensional random fields and representations of images. SIAMJ. App !.
Math. 16, 756-770.
[65] Woods, J. W. (1972). Two-dimensional discrete Markovian fields. IEEE Trans. Inform. Theory
18, 232-240.
[66] Yokoyama, R. and Haralick, R. M. (1978). Texture synthesis using a growth model. Comput.
Graphics Image Process. 8, 369-381.
[67] Zucker, S. (1976). Toward a model of texture. Comput. Graphics Image Process. 5, 190-202.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "1
K.J
North-Holland Publishing Company (1982) 399-415
Robert M. Haralick
1. Introduction
*A full text of this paper by the same author appeared in Proc. IEEE 67 (5) (1979) 786-804, under
the title "Statistical and Structural Approaches to Texture." Reprinted with permission. @1979 IEEE.
399
400 Robert M. Haralick
There have been eight statistical approaches to the measurement and char-
acterization of image texture: autocorrelation functions, optical transforms, dig-
ital transforms, textural edgeness, structural elements, spatial gray tone co-
occurrence probabilities, gray tone run lengths, and auto-regressive models. An
early review of some of these approaches is given by Hawkins (1970). The first
three of these approaches are related in that they all measure spatial frequency
directly or indirectly. Spatial frequency is related to texture because fine textures
are rich in high spatial frequencies while coarse textures are rich in low spatial
frequencies.
An alternative to viewing texture as spatial frequency distribution is to view
texture as amount of edge per unit area. Coarse textures have a small number of
edges per unit area. Fine textures have a high number of edges per unit area.
The structural element approach of Serra (1974) and Matheron (1967) uses a
matching procedure to detect the spatial regularity of shapes called structural
elements in a binary image. When the structural elements themselves are single
resolution cells, the information provided by this approach is the autocorrelation
function of the binary image. By using larger and more complex shapes, a more
generalized autocorrelation can be computed.
The gray tone spatial dependence approach characterizes texture by the co-
occurrence of its gray tones. Coarse textures are those for which the distribution
changes only slightly with distance and fine textures are those for which the
distribution changes rapidly with distance.
The gray level run length approach characterizes coarse textures as having
many pixels in a constant gray tone run and fine textures as having few pixels in a
constant gray tone run.
The auto-regressive model is a way to use linear estimates of a pixel's gray tone
given the gray tones in a neighborhood containing it in order to characterize
texture. For coarse textures, the coefficients will all be similar. For fine textures,
the coefficients will have wide variation.
The power of the spatial frequency approach to texture is the familiarity we
have with these concepts. However, one of the inherent problems is in regard to
gray tone calibration of the image. The procedures are not invariant under even a
linear translation of gray tone. To compensate for this, probability quantizing can
be employed. But the price paid for the invariance of the quantized images under
Image texture survey 401
monotonic gray tone transformations is the resulting loss of gray tone precision in
the quantized image. Weszka, Dyer and Rosenfeld (1976) compare the effective-
ness of some of these techniques for terrain classification. They conclude that
spatial frequency approaches perform significantly poorer than the other ap-
proaches.
The power of the structural element approach is that it emphasizes the shape
aspects of the tonal primitives. Its weakness is that it can only do so for binary
images.
The power of the co-occurrence approach is that it characterizes the spatial
inter-relationships of the gray tones in a textural pattern and can do so in a way
that is invariant under monotonic gray tone transformations. Its weakness is that
it does not capture the shape aspects of the tonal primitives. Hence, it is not likely
to work well for textures composed of large-area primitives.
The power of the auto-regressive linear estimator approach is that it is easy to
use the estimator in a mode which synthesizes textures from any initially given
linear estimator. In this sense, the auto-regressive approach is sufficient to capture
everything about a texture. Its weakness is that the texture it can characterize are
likely to consist mostly of micro-textures.
1
(Lx-lXl)(Ly-lYl)
f jI(u,v)Z(u+ x,v+ y ) d u d v
p(x, y =
1 f2fl2(u,v)dudv
L x Ly
If the tonal primitives on the image are relatively large, then the autocorrelation
will drop off slowly with distance. If the tonal primitives are small, then the
autocorrelation will drop off quickly with distance. To the extent that the tonal
primitives are spatially periodic, the autocorrelation function will drop off and
rise again in a periodic manner. The relationship between the autocorrelation
function and the power spectral density function is well known: they are Fourier
transforms of one another (Yaglom, 1962).
The tonal primitive in the autocorrelation model is the gray tone. The spatial
organization is characterized by the correlation coefficient which is a measure of
the linear dependence one pixel has on another.
An experiment was carried out by Kaizer (1955) to see of the autocorrelation
function had any relationship to the texture which photointerpreters see in
images. He used a series of seven aerial photographs of an Arctic region and
determined the autocorrelation function of the images with a spatial correlator
which worked in a manner similar to the one envisioned in our thought experi-
ment. Kaizer assumed the autocorrelation function was circularly symmetric and
computed in only as a function of radial distance. Then for each image, he found
the distance d such that the autocorrelation function p at d took the value
l / e ; p(d) = 1/e.
Kaizer then asked 20 subjects to rank the seven images on a scale from fine
detail to coarse detail. He correlated the rankings with the distances correspond-
ing to the ( 1 / e ) t h value of the autocorrelation function. He found a correlation
coefficient of 0.99. This established that at least for his data set, the autocorrela-
tion function and the subjects were measuring the same kind of textural features.
Kaizer noticed, however, that even though there was a high degree of correla-
tion between p - 1 ( l / e ) and subject rankings, some subjects put first what p - 1 ( l / e )
put fifth. Upon further investigation, he discovered that a relatively flat back-
ground (indicative of high frequency or fine texture) can be interpreted as a fine
textured or coarse textured area. This phenomena is not unusual and actually
points out a fundamental characteristic of texture: it cannot be analyzed without
a reference frame of tonal primitive being stated or implied. For any smooth gray
tone surface there exists a scale such that when the surface is examined, it has no
texture. Then as resolution increases, it takes on a fine texture and then a coarse
texture. In Kaizer's situation, the resolution of his spatial correlator was not good
enough to pick up the fine texture which some of his subjects did in an area which
had a weak but fine texture.
correct classification rate was obtained using only spectral information. This rate
increased to 98.5% when textural information was also included in the analysis.
These researchers reported no significant difference in the classification accuracy
as a function of which transform was employed.
Pratt (1978) and Pratt, Faugeras and Gagalowitz (1978) suggest measuring
texture by the coefficients of the linear filter required to decorrelate an image and
by the first four moments of the gray level distribution of the decorrelated image.
They have shown promising preliminary results.
The linear dependence which one image pixel has on another is well known and
can be measured by the autocorrelation function. This linear dependence is
exploited by the autoregression texture characterization and synthesis model
developed by McCormick and Jayaramamurthy (1974) to synthesize textures.
McCormick and Jayaramamurthy used the Box and Jenkins (1970) time series
seasonal analysis method to estimate the parameters of a given texture. These
estimated parameters and a given set of starting values were then used to
illustrate that the synthesized texture was close in appearance to the given texture.
Deguchi and Morishita (1978), Tou, Kao and Chang (1976) and Tou and Chang
(1976) used similar techniques.
The autoregressive model for texture synthesis begins with a randomly gener-
ated noise image. Then, given any sequence of K synthesized gray level values in
its immediately past neighborhood, the next gray level value can be synthesized as
a linear combination of those values plus a linear combination of the previous L
random noise values. The coefficients of these linear combinations are the
parameters of the model. Texture analysis work based on this model requires the
identification of these coefficient values from a given texture image.
tives and their placement rules can be used to describe textures (Rosenfeld and
Lipkin, 1970). The identification and location of a particular primitive in an
image may be probabilisticaUy related to the identification and distribution of
primitives in its neighborhood.
Carlucci (1972) suggests a texture model using primitives of line segments, open
polygons and closed polygons in which the placement rules are given syntactically
in a graph-like language. Zucker (1976a, 1976b) conceives of a real texture to be
the distortion of an ideal texture. Zucker's model, however, is more of a
competance based model than a performance model. Lu and Fu (1978) and Tsai
and Fu (1978) use a syntactic approach to texture.
In the remainder of this section, we discuss some structural-statistical ap-
proaches to texture models. The approach is structural in the sense that primitives
are explicitly defined. The approach is statistical in that the spatial interaction, or
lack of it, between primitives is measured by probabilities.
We classify textures as being weak textures, or strong textures. Weak textures
are those which have weak spatial-interaction between primitives. To distinguish
between them it may be sufficient to only determine the frequency with which the
variety of primitive kinds occur in some local neighborhood. Hence, weak texture
measures account for many of the statistical textural features. Strong textures are
those which have non-random spatial interactions. To distinguish between them it
may be sufficient to only determine, for each pair of primitives, the frequency
with which the primitives co-occur in a specified spatial relationship. Thus, our
discussion will center on the variety of ways in which primitives can be defined
and the ways in which spatial relationships between primitives can be defined.
3.1. Primitives
A primitive is a connected set of resolution cells characterized by a list of
attributes. The simplest primitive is the pixel with its gray tone attribute.
Sometimes it is useful to work with primitives which are maximally connected sets
of resolution cells having a particular property. An example of such a primitive is
a maximally connected set of pixels all having the same gray tone or all having
the same edge direction.
Gray tones and local properties are not the only attributes which primitives
may have. Other attributes include measures of shape of connected region and
homogeneity of its local property. For example, a connected set of resolution cells
can be associated with its length or elongation of its shape or the variance of its
local property.
Tsuji and Tomita (1973) and Tomita, Yachida, and Tsuji (1973) describe a
structural approach to weak texture measures. First a scene is segmented into
atomic regions based on some tonal property such as constant gray tone. These
regions are the primitives. Associated with each primitive is a list of properties
such as size and shape. Then they make a histogram of size property or shape
property over all primitives in the scene. If the scene can be decomposed into two
or more regions of homogeneous texture, the histogram will be multi-modal. If
this is the case, each primitive in the scene can be tagged with the mode in the
histogram to which it belongs. A region growing/cleaning process on the tagged
primitives yields the homogeneous textural region segmentation.
If the initial histogram modes overlap too much, a complete segmentation may
not result. In this case, the entire process can be repeated with each of the then so
far found homogeneous texture region segments. If each of the homogeneous
texture regions consists of mixtures of more than one type of primitive, then the
procedure may not work at all. In this case, the technique of co-occurrence of
primitive properties would have to be used.
Zucker, Rosenfeld and Davis (1975) used a form of this technique by filtering a
scene with a spot detector. Non-maxima pixels on the filtered scene were thrown
out. If a scene has many different homogeneous texture regions, the histogram of
the relative max spot detector filtered scene will be multi-modal. Tagging the
maxima with the modes they belong to and region growing/cleaning thus
produced the segmented scene.
The idea of the constant gray level regions of Tsuji and Tomita or the spots of
Zucker, Rosenfeld, and Davis can be generalized to regions which are peaks, pits,
ridges, ravines, hillsides, passes, breaks, flats and slopes (Toriwaki and Fukumura,
1978; Penucker and Douglas, 1975). In fact, the possibilities are numerous
enough that investigators doing experiments will have a long working period
before understanding will exhaust the possibilities. The next three subsections
review in greater detail some specific approaches and suggest some generaliza-
tions.
by any one of the gradient neighborhood operators. For some specified window
centered on a given pixel, the distribution of gradient magnitudes can then be
determined. The mean of this distribution is the amount of edge per unit area
associated with the given pixel. The image in which each pixel's value is edge per
unit area is actually a defocussed gradient image. Triendl (1972) used a de-
focussed Laplacian image. Sutton and Hall (1972) used such a measure for the
automatic classification of pulmonary disease in chest X-rays.
Ohlander (1975) used such a measure to aid him in segmenting textured scenes.
Rosenfeld (1975) gives an example where the computation of gradient direction
on a defocussed gradient image is an appropriate feature for the direction of
texture gradient. Hsu (1977) used a variety of gradient-like measures.
difference between the value of the minimum and the lowest adjacent maximum.
The width of a maximum is the distance between its two adjacent minima. The
width of a minimum is the distance between its two adjacent maxima.
Two-dimensional extrema are more complicated than one-dimensional extrema.
One way of finding extrema in the full two-dimensional sense is by the iterated
use of some recursive neighborhood operators propagating extrema values in an
appropriate way. Maximally connected areas of relative extrema may be areas of
single pixels or may be plateaus of many pixels. We can mark each pixel in a
relative extrema region of size N with the value h indicating that it is part of a
relative extremum having height h or mark it with the value h / N indicating its
contribution to the relative extrema area. Alternatively, we can mark the most
centrally located pixel in the relative extrema region with the value h. Pixels not
marked can be given the value 0. Then for any specified window centered on a
given pixel, we can add up the values of all pixels in the window. This sum
divided by the window size is the average height of extrema in the area.
Alternatively we could set h to 1 and the sum would be the number of relative
extrema per unit area to be associated with the given pixel.
Going beyond the simple counting of relative extrema, we can associate
properties to each relative extremum. For example, given a relative maximum, we
can determine the set of all pixels reachable only by the given relative maximum
and not by any other relative maximum by monotonically decreasing paths. This
set of reachable pixels is a connected region and forms a mountain. Its border
pixels may be relative minima or saddle pixels.
The relative height of the mountain is the difference between its relative
maximum and the highest of its exterior border pixels. Its size is the number of
pixels which constitute it. Its shape can be characterized by features such as
elongation, circularity, and symmetric axis. Elongation can be defined as the ratio
of the larger to small eigenvalue of the 2 × 2 second moment matrix obtained
from the (~) coordinates of the border pixels (Bachi, 1973; Frolov, 1975).
Circularity can be defined as the ratio of the standard deviation to the mean of
the radii from the region's center to its border (Haralick, 1975). The symmetric
axis feature can be determined by thinning the region down to its skeleton and
counting the number of pixels in the skeleton. For regions which are elongated it
may be important to measure the direction of the elongation or the direction of
the symmetric axis.
P(t 1, t2) is just the relative frequency with which two primitives occur with
specified spatial relationship in the image, one primitive having property t~ and
the other primitive having property t 2.
412 Robert M. Haralick
Zucker (1974) suggests that some textures may be characterized by the frequency
distribution of the number of primitives any primitive has related to it. This
probability p(k) is defined by
# { ( q ~ QI#S(q) = k)
p(k) = #Q
4. Conclusion
References
Bachi, R. (1973). Geostatistical analysis of territories. Proc. 39th Session-Bull. Internat. Statist.
Institute. Vienna.
Bajcsy, R. (1972). Computer identification of textured visual scenes. Stanford Univ., Palo Alto, CA.
Bajcsy, R. (1973). Computer description of textured surfaces. Third lnternat. Joint Conf. on Artificial
Intelligence, 572-578. Stanford, CA.
Bajcsy, R. and L. Lieberman (1974). Computer description of real outdoor scenes. Proc. Second
Internat. Joint Conference on Pattern Recognition, 174-179. Copenhagen, Denmark.
Bajcsy, R. and L. Lieberman (1976). Texture gradient as a depth cue. Comput. Graphics Image Process.
5 (1) 52-67.
Bartels, P., G. Bahr and G. Weid (1969). Cell recognition from line scan transition probability profiles.
Aeta Cytol. 13, 210-217.
Bartels, P. H. and G. U Wied (1975). Extraction and evaluation of information from digitized cell
images. In: Mammalian Cells: Probes and Problems, 15-28. U.S. NTIS Technical Information
Center, Springfield, VA.
Box, J. E. and G. M. Jenkins (1970). Time Series Analysis. Holden-Day, San Francisco, CA.
Carlton, S. G. and O. Mitchell (1977). Image segmentation using texture and grey level. In: Pattern
Recognition and Image Processing Conference, 387-391. Troy, NY.
Carlucci, L. (1972). A formal system for texture languages. Pattern Recognition 4, 53-72.
Chen, P. and T. Pavlidis (1978). Segmentation by texture using a co-occurrence matrix and a
split-and-merge algorithm. Tech. Rept. 237. Princeton Univ., Princeton, NJ.
Chien, Y. P. and K. S. Fu (1974). Recognition of x-ray picture patterns. IEEE Trans. Systems Man
Cybernet. 4 (2) 145-156.
Image texture survey 413
Conners, R. W. and Ch. A. Harlow (1976). Some theoretical considerations concerning texture
analysis of radiographic images. Proc. 1976 IEEE Conf. on Decision and Control. Clearwater Beach,
FL.
Cutrona, L. J., E. N. Leith, C. J. Palermo and L. J. Porcello (1969). Optical data processing and
filtering systems. IRE Trans. Inform. Theory 15 (6) 386-400.
Darling, E. M. and R. D. Joseph (1968). IEEE Trans. Systems Man Cybernet. 4, 38-47.
Davis, L., S. Johns and J. K. Aggarwal (1978). Texture analysis using generalized co-occurrence
matrices. Pattern Recognition and Image Processing Conf. Chicago, IL.
Deguchi, K. and I. Morishita (1978). Texture characterization and texture-based image partitioning
using two-dimensional linear estimation techniques. IEEE Trans. Comput. 27 (8) 739-745.
Dyer, C. and A. Rosenfeld (1976). Fourier texture features: suppression of aperature effects. IEEE
Trans. Systems Man Cybernet. 6 (10) 703-706.
Ehrich, Roger and J. P. Foith (1976). Representation of random waveforms by relational trees. IEEE
Trans. Comput. 26 (7) 726-736.
Ehrich, R. and J. P. Foith (1978). Topology and semantics of intensity arrays. In: Hanson and
Roseman, eds., Computer Vision. Academic Press, New York.
Frolov, Y. S. (1975). Measuring the shape of geographical phenomena: a history of the issue. Soviet
Geography: Review and Translation 16 (10) 676-687.
Galloway, M. M. (1975). Texture analysis using gray level run lengths. Comput. Graphics and Image
Process. 4 172-179.
Gilbert, E. (1962). Random subdivisions of space into crystals. Ann. Math. Statist. 33, 958-972.
Goodman, J. W. (1968). Introduction to Fourier Optics. McGraw-Hill, New York.
Grarnenopoulos, N. (1973). Terrain type recognition using ERTS-I MSS images. Syrup. on Significant
Results Obtained from the Earth Resources Technology Satellite, NASA SP-327, pp. 1229-1241.
Haralick, R. M. (1971). A texture-context feature extraction algorithm for remotely sensed imagery.
Proc. 1971 1EEE Decision and Control Conf., 650-657. Gainesville, FL.
Haralick, R. M. (1975). A textural transform for images. Proc. IEEE Conf. on Computer Graphics,
Pattern Recognition and Data Structure. Beverly Hills, CA.
Haralick, R. M. and K. Shanmugam (1973). Computer classification of reservoir sandstones. IEEE
Trans. Geosci. Electronics 11 (4) 171-177.
Haralick, R. M. and K. Shanmugam (1974). Combined spectral and spatial processing of ERTS
imagery data. J. Remote Sensing of the Environment 3, 3-13.
Haralick, R. M. and K. Shanmugam (1973). Textural features for image classification. IEEE Trans.
Systems Man Cybernet. 3 (6) 610-621.
Hawkins, J. K. (1970). Textural properties for pattern recognition In: B. S. Lipkin and A. Rosenfeld,
eds., Picture Processing and Psychopictorics. Academic Press, New York.
Horning, R. J. and J. A. Smith (1973). Application of Fourier analysis to multi-spectral/spatial
recognition. Management and Utilization of Remote Sensing Data A S P Syrup. Sioux Falls, SD.
Hsu, S. (1977). A texture-tone analysis for automated landuse mapping with panchromatic images.
Proc. Amer. Soc. for Photogrammetry, 203-215.
Julesz, B. (1962). Visual pattern discrimination. IRE Trans. Inform. Theory 8 (2) 84-92.
Julesz, B. (1975). Experiments in visual perception of texture. Sci. Amer. 232, 34-43.
Kaizer, H. (1955). A quantification of textures on aerial photographs. Tech. Note 121, AD 69484.
Boston Univ. Res. Labs.
Kirvida, L. (1976). Texture measurements for the automatic classification of imagery. IEEE Trans.
Electromagnetic Compatibility 18 (1) 38-42.
Landerweerd, G. H. and E. S. Gelsema (1978). The use of nuclear texture parameters in the automatic
analysis of leukocytes. Pattern Recognition 10, 57-61.
Lantuejoul, C. (1978). Grain dependence test in a polycristalline ceramic. In: J. L. Chernant, ed.,
Quantitative Analysis of Microstructures in Materials Science, Biology and Medicine, 40-50. Riederer,
Stuttgart.
Ledley, R. S. (1972). Texture problems in biomedical pattern recognition. Proc. 1972 IEEE Conf. on
Decision and Control and the l l t h Syrup. on Adaptive Processes. New Orleans, LA.
Lendaris, G. and G. Stanley (1969). Diffraction pattern sampling for automatic pattern recognition.
SPIE Pattern Recognition Studies Seminar Proceedings, 127-154.
414 Robert M. Haralick
Lendaris, G. G. and G. L. Stanley (1970). Diffraction pattem samplings for automatic pattern
recognition. Proc. IEEE 58 (2) 198-216.
Lu, S. Y. and K. S. Fu (1978). A syntactic approach to texture analysis. Comput. Graphics andlmage
Process. 7, 303-330.
Maleson, J., C. Brown and J. Feldman (1977). Understanding natural texture. University of Rochester,
Rochester, NY.
Matheron, G. (1967). Elements pour une Throrie des Milieux Poreux. Masson, Paris.
Matheron, G. (1975). Random Sets and Integral Geometry. Wiley, New York.
McCormick, B. H. and S. N. Jayaramamurthy (1974). Time series model for testure sythesis. Internat.
J. Comput. Informat. Sci. 3 (4) 329-343.
Miles, R. (1969). Random polygons determined by random lines in the plane. Proc. Nat. Acad. Sci.
U.S.A. 52, 901-907; 1157-1160.
Miles, R. (1970). On the homogeneous planar poisson point-process. Math. Biosci. 6, 85-127.
Mitchell, O., Ch. Myers and W. Boyne (1977). A max-min measure for image texture analysis. IEEE
Trans. Comput. 25 (4) 408-414.
Mitchell, O. R. and S. G. Carlton (1977). Image segmentation using a local extrema texture measure.
Pattern Recognition 10, 205-210.
Muller, W. (1974). The leitz-texture-analyzing system. Lab. Appl. Microscopy, Sci. and Tech. Inform.,
Suppl. I. 4, 101-136. Wetzler, West Germany.
Muller, W. and W. Hunn (1974). Texture Analyzer System. Industrial Res., 49-54.
Ohlander, R. (1975). Analysis of natural scenes. Ph.D. dissertation. Carnegie-Mellon Univ., Pitts-
burgh, PA.
Peucker, T. and D. Douglas (1975). Detection of surface-specific points by local parallel processing of
discrete terrain elevation data. Comput. Graphics' and Image Process. 4 (4) 375-387.
Pratt, W. K. (1978). Image feature extraction. Digital Image Processing, 471-513.
Pratt, W. K., O. D. Fangeras and A. Gagalowicz (1978). Visual discrimination of stochastic texture
fields. IEEE Trans. Systems Man Cybernet. 8 (11) 796-804.
Pressman, N. J. (1976). Markovian analysis of cervical cell imag,~s. J. Histochem. Cytochem. 24 (1)
138-144.
Pressman, N. J. (1976b). Optical texture analysis for automated cytology and histology: A Markovian
approach. Lawrence Livermore Lab. Rept. UCRL-52155. Livermore, CA.
Preston, K. (1972). Coherent Optical Computers. McGraw-Hill, New York.
Rosenfeld, A. (1975). A note on automatic detection of texture gradients. IEEE Trans. comput. 23 (10)
988-991.
Rosenfeld, A. and B. S. Lipkin, eds. (1970). Picture Processing and Psychopictorics. Academic Press,
New York.
Rosenfeld, A. and M. Thurston (1971). Edge and curve detection for visual scene analysis. IEEE
Trans. Comput. 20 (5) 562-569.
Rosenfeld, A. and E. Troy (1970). Visual texture analysis. Univ. of Maryland, College Park, MD;
Tech. Rept. 70-116; ibid., Conf. Record Syrup. on Feature Extraction and Selection in Pattern
Recognition, IEEE Publ. 70C-51C (1970) 115-124. Argonne, IL.
Schacter, B. J., A. Rosenfeld and L. S. Davis (1978). Random mosaic models for textures. IEEE
Trans. Systems Man Cybernet. 8 (9) 694-702.
Serra, J. (1974). Theoretical bases of the leitz texture analysis system. Leitz Sci. Tech. Inform., Suppl. 1
(4) 125-136. Wetzlar, Germany.
Serra, J. (1978). One, two, three ..... infinity. In: J. L. Chernant, ed., Quantitative Analysis of
Microstructures in Materials Science, Biology and Medicine, 9-24. Riederer, Stuttgart.
Serra, J. and G. Verchery (1973). Mathematical morphology applied to fibre composite materials. Film
Sci. Technol. 6, 141-158.
Shulman, A. R. (1970) Optical Data Processing. Wiley, New York.
Sutton, R. and E. Hall (1972). Texture measures for automatic classification of pulmonary disease.
IEEE Trans. Comput. 21 (1) 667-676.
Swanlund, G. D. (1971). Design requirements for texture measurements. Proc. Two Dimensional
Digital Signal Processing Conf. Univ. of Missouri, Columbia, MO.
Image texture survey 415
Switzer, P. (1967). Reconstructing patterns for sample data. Ann. Math. Stat&t. 38, 138-154.
Tomita, F. M. Yaehida and S. Tsuji (1973). Detection of homogeneous regions by structural analysis.
Proc. Third Internat, Joint Conf. on Artificial Intelligence, 564-571. Stanford Univ., Stanford, CA.
Toriwaki, J. and T. Fukumura (1978). Extraction of structural information from grey pictures.
Comput. Graphics and Image Process. 7 (1) 30-51.
Tou, J. T. and Y. S. Chang (1976). An approach to texture pattern analysis and recognition. Proe.
1976 IEEE Conf. on Decision and Control. Clearwater Beach, FL.
Tou, J. T. and Y. S. Chang (1977). Picture understanding by machine via textural feature extraction.
Proc. 1977 IEEE Conf. on Pattern Recognition and Image Processing. Troy, NY.
Tou, J. T,, D. B. Kao and Y. S. Chang (1976). Pictorial texture analysis and synthesis. Third lnternat.
Joint Conf. on Pattern Recognition. Coronado, CA.
Triendl, E. E. (1972). Automatic terrain mapping by texture recognition. Proc. Eighth Internat. Symp.
on Remote Sensing of Environment. Environmental Research Institute of Michigan Ann Arbor, MI.
Tsai, W. H. and K. S, Fu (1978). Image segmentation and recognition by texture discrimination: A
syntactic approach. Fourth Internat. Joint Conf. on Pattern Recognition. Tokyo, Japan.
Tsuji, S. and F. Tomita (1973). A structural analyzer for a class of textures. Comput. Graphics and
Image Process. 2, 216-231.
Watson, G. S. (1975). Geological Society of America Memoir 142, 367-391.
Weszka, J., C. Dyer and A. Rosenfeld (1976). A comparitive study of texture measures for terrain
classification. IEEE Trans. Systems Man Cybernet. 6 (4) 269-285.
Wied, G., G. Bahr and P. Bartels (1970). Automatic analysis of cell images. In: Wied and Bahr, eds.,
Automated Cell Identification and Cell Sorting. Academic Press, New York.
Yaglom, A. M. (1962). Theory of Stationary Random Functions. Prentice-Hall, Englewood Cliffs, NJ.
Zucker, S. W. (1976a). Toward a model of texture. Comput. Graphics and Image Process. 5, 190-202.
Zucker, S. W. (1976b). On the structure of texture. Perception 5, 419-436.
Zucker, S. (1974). On the foundations of texture: A transformational approach, Univ. of Maryland,
Tech. Rept. TR-331. College Park, MD,
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, VoL 2 "1 0
& j,
©North-Holland Publishing Company (1982) 417-449
K . S . Fu
1. Introduction
417
418 K. S. Fu
PU
ai~fli/, j=l ..... n,; i=1 ..... k (1)
where
E( )*,
Bu
and pij is the probability associated with the application of this stochastic
production,
ni
0 < p i j ~<1 and ~ Pij=l-3 (2)
j=l
])ij
and we say that ~ directly generates 7/ with probability Pij- If there exists a
sequence of strings ~ol.... , % + 1 such that
Pi
~ = ¢a~l , 7~ : 0 ) n + 1, ~i---~i+l, i=1 ..... n,
then we say that ~ generates 71 with probability p = ~n=lp i and denote this
2In a more general formulation, instead of a single start symbol, a start symbol (probability)
distribution can be used.
3If these two conditions are not satisfied, Pij will be denoted as the weight wij and Ps the set of
weighted productions. Consequently, the grammar and the language generated are called weighted
grammar and weighted language, respectively [12].
4 I n running text we will set indices superior and inferior to arrows instead of indices centered above
or below the arrow.
Applications of stochastic languages 419
derivation by
P
~--,~.
The probability associated with this derivation is equal to the product of the
probabilities associated with the sequence of stochastic productions used in
the derivation. It is clear that ~P. is the reflexive and transitive closure of the
relation --, P.
The stochastic language generated by G~ is
(3)
where k is the number of all distinctively different derivations of x from S and pj
is the probability associated with t h e j t h distinctive derivation of x.
In general, a stochastic language L(G~) is characterized by (L, p) where L is a
language and p is a probability distribution defined over L. The language L of a
stochastic language L(G~)= (L, p) is called the characteristic language of L(G~).
Since the productions of a stochastic grammar are exactly the same as those of
the non-randomized grammar except for the assignment of the probability
distribution, the language L generated by a stochastic grammar is the same as that
generated by the non-randomized version.
Gs=(VN,VT, Ps,S)
where
VN=(S,A,B), VT=(0,1)
and
1
Ps: S--,1A,
0.8 0.2 0.3 0.7
A --,0B, A ~ 1, B-~0, B ~ 1S.
S -, 1A ~ 10B ~ 100,
p(100) = 1 × 0 . 8 × 0 . 3 = 0.24.
p(x)=0.2+0.24+ ~ (0.2+0.24)(0.56)n=1.
xE L(Gs) n = 1
420 K.S. Fu
Table 1
String generated x p (x)
11 0.2
100 0.24
(101)"11 0.2×(0.56) n
(101)"100 0.24×(0.56)"
E Pir : 1.
cA,
Applications of stochastic languages 42 1
DEFINITION 2.4. For each CA,, i = 1..... k, define the k-argument generating
function (st, s 2 .... ,sk) as
Pll
P~: A1 --* aAtA2,
Pl2 P21 P22
A t --'b, A 2 --+aA2A2, A 2 ~aa.
Fo(st,s2)=st, Ft(st,s2)=ft(sl,s2)=ptts,s2+pt2,
rz(st, s2) = FI( ft(st, s2), f2(s1, s 2 ) ) ~---P t l ft(s1, s2)Tz(S,, S2) + Pt2
-- 2 3
-- Plt P2tSlS2 + p2tP22SISz + P11Pt2P22 + P12"
After examining the previous example we can express Fj(s 1..... sk) as
where Gj(s t ..... sk) denotes the polynomial of s t .... ,s~ without including the
constant term. The constant term, Kj, corresponds to the probability of all
the strings x E L(Gs) that can be derived in j or fewer levels. This leads to the
following theorem.
422 K. S. Fu
lim Kj = 1. (7)
j~CX~
Note that if the above limit is not equal to 1, there is a finite probability that a
generation process may never terminate. Thus, the probability measure defined
over L will be less than 1 and, consequently, G~ will not be consistent. On the
other hand, if the limit is equal to 1, then there exists no such infinite (non-
terminating) generation process since the limit represents the probability of all the
strings which are generated by applications of a finite number of productions.
Consequently, G~ is consistent. The problem of testing the consistency of a given
stochastic context-free grammar can be solved by using the following testing
procedure developed for branching processes.
eiJ = 3f/(Slasj
..... Sk) ~, ...... k : l (8)
DEFINITION 2.9. The first moment matrix E of the generation process corre-
sponding to a stochastic context-free grammar G~ is defined as
THEOREM 2.10. For a given stochastic context-free grammar Gs, order the eigen-
values or the characteristic roots p l , . . . , Ok of the first moment matrix according to
the descending order of their magnitudes; that is,
2p21 •
Applications of stochastic languages 423
-kz
<u
<u
-I 05
I--
$. t ~,
0
0
I--
I.- V-
z
o
8
<
u
d-
Applications of stochastic languages 425
u , -m (u il, u2,...,
i
ui). Then the conditional probability of a binary sequence v =
(vl, v2 .... ,%) from the output of the channel given that the input sequence is
u = ( u . u2 ..... u.) is
e ( v lu ) = P ( s o = v , I S l = u , ) . . . . . P ( s o = vnl s , = u n)
where the ui's are codewords for the alphabet VT, and U's are disjoint subsets of
(0, 1)" such that Ult_J . . . t.JUt--(0, 1}L A lexicographical decoder D O is a map-
ping from {0, l } n into Va-, defined by
Designs of the code as well as its encoders and decoders are studied extensively in
information theory (e.g., [26]). We shall present in the following a decoding
scheme which in a certain sense is superior to the lexicographical decoding
scheme.
The encoder, binary symmetric channel N and the decoder D o can be combined
and represented by a mutational channel M characterized by the following
transition probabilities:
for every i, j = 1,2 ..... I. Here a and/3 are respectively the input and output of the
mutational channel M, and QB(fl = ° j l a = °i) is the probability of observing a
symbol oa. at the output of M when symbol oi is the true symbol at the input of M.
Note t h a t ~ j Q B ( f l ~- oj[a = oi) ~- 1 for all i.
As illustrated in Fig. 1, a syntactic decoder D is introduced to the output of the
mutational channel, i.e., to the output of the lexicographical decoder Do, before
the messages are conveyed to the user. Suppose the source grammar G generates a
sentence w = ClC2. •, c N ~ L ( G ) . The symbols Cl,..., cw are fed into the encoder Cn
426 K. S. Fu
sequentially such that a sequence of binary codewords Cn(c 1)..... Cn(¢N) c o m e out
of the encoder and are modulated and transmitted over the noisy channel N. Let
u i = C~(ci), i = 1 . . . . . N , and vi be the binary sequence of length n coming out of
the channel N corresponding to the transmitted codeword ui. These binary
sequences v i . . . . , v u are decoded according to (12) so that the lexicographical
decoder outputs a string of symbols in VT, z = a l a 2. • • a N where a~ = D o ( v i ) , i =
1,2 .... ,N. We can symbolically describe this coding scheme by z = a l a 2 • • • a N =
1 t
P(elB) = 7 i ~ l Q B ( f l ¢: ° i l a = ai) (14)
is minimized.
However, it is immediately obvious that the above optimal block code B is
based on the performance of the code for individual symbols. For linguistic
messages we may take advantage of our knowledge about the syntactic structure
among the transmitted symbols to improve the efficiency of the communication
system. Since the utility of the communicated linguistic messages depends mostly
on the correctness of individual sentences, the average probability of obtaining a
correct sentence for the source grammar will be considered to be a significant
factor in evaluating the efficiency of the communication system.
Let z = a l a 2 . . , a N be the decoded sequence of symbols from the lexicographi-
cal decoder D 0. The syntactic decoder D is a mapping from V-} into L ( G ) , which
is optimal according to the maximum likelihood criterion:
N N
[I Q ( B - - a i l a = O * ) > ~ 1-1 Q ( B = a ~ l a = O ~ ) . (16)
i--I i-1
Extend the definition of d(oj, ai) to the case of two strings z = a l a 2 . . , a N and
w = clc 2. • • c u which are both of length N, N = 2,3 . . . . . Then
N
d ( z = a , a 2 . . , aN; W = CLC2''" CN) : • d(ai, ci). (18)
i 1
b b b
~"-t-....~ b d b b ~ a
(a) Median String Representation
cbbbabbbbdbbbbabbbcbbbabbbbdbbbbabbb
a a
b
b b b b
a b b ~ ' ~ " ~ a
(c) Acrocentrie String Representation
cadbbbbbbabbbbbcbbbbbabbbbbbda
a a
b b ' b~" h
vT= {n,l,u,{}
abcd
and Ps:
1 P3 P3
S --'AA, R ~RE, W--' WE,
1 P4 P4
A -'*cb, R ~HDJ, W~d,
Pl Pl 1
B ~FBE, D ---'E, F~b,
P2 P2 1
B ~ HDJ, D -~ d, E ---,b,
P3 P3 1
B ~ RE, D ~ FG, H ~ a,
P3 P3 1
B ~F L , D --, W E , J--,a,
P3 P3
L ~ FL, G ---"FG,
P4 P4
L ~ HDJ, G ~ d,
p l + p2 + 2p3 = l, p3+P4=l.
By associating probabilities with the strings, we can impose a probabilistic
structure on the language to describe noisy patterns. The probability distribution
characterizing the patterns in a class can be interpreted as the probability
Par,~ P(xlGI)
"I°'
-'
x E L(G.)
Input x Maximum •,b. I
f
Detector
~ ~ p(xG
I
Fig. 3. Maximum-likelihood syntactic pattern recognition system.
430 K.S. Fu
DEFINITION 5.1. The deformation probabilities that are associated with substitu-
tion, insertion and deletion transformations Ts, TI, TD are defined as follows:
Ts, qs(b/a)
(1) xay t - - xby
7"I, qi(b/a)
(2) xay I xbay
TD, qD(a)
(3) xay I - - xy
Applications of stochastic languages 431
Tt,qi(a)
(4) xI xa
E qs(b/a)+qD(a) + E qi(b/a)+q(a) =1
b~ bE.Y,
bvaa
for all a E 2, where q(a) is the probability that no error occurs on terminal a [18,
33].
Let L(G~) be a given stochastic context-free language and y be an erroneous
string, y ~ L(G~). The maximum-likelihood error-correcting parsing algorithm
searches for a string x, x E L(G~) such that
q(y/x)p(x) = max ( q ( y / z ) p ( z ) l z E
Z
L(Gs) )
x~G s ri
Using Bayes rule, for a given string y, y E L(Gi) for all i, l ~ i < - K , the a
posteriori probability that y is in class j can be c o m p u t e d as
q( yICj )P( Cj )
e ( f j l y ) : ~K
~=,q(yICi)P(C~)
where P(C,) is the a priori probability of class C i.
The Bayes decision rule that classifies y as class Cj is as follows,
P ( C j l y ) : miax { P ( C ~ y)ll<~i<~k}.
In this section we present some major results on stochastic tree languages [4].
Pi
a = t 0, f l = t m, ti- 1 ---~ti, i=l,...,m,
then we say that a generates fl with probability p = IIi=_Tpi and denote this
derivation by a ~-- off or a ~ Pfl.
where Tv~ is the set of all trees over VT, k is the number of all distinctly different
derivation of t from S and ps is the probability associated with the j t h distinct
derivation of t from S.
p q r
Xo-~ x , Xo-~ Xl, or x ~Xo
/ \ //\
X I " - Xr(x) X l " * Xr(x)
X 0----~ x
J \
X 1 . . . X,(~)
LEMMA 6.6. Given a stochastic tree grammar G~ = (V, r, P, S ) over ( V T, r}, one
can effectively construct a simple stochastic tree grammar G2 = (V', r', P', S') over
V T which is equivalent to Gs.
~-- a ~ a l a a )(')
a~-flla a = a~gtl a = f i n G 2.
Note that the application of a single rule (~i ~ p ~/ti in Gs simulates the applica-
tion of all rules of P'. An example should make this very clear.
1 1
(4') V,'~S, (5') V2'~X, (6') squ?,
1 2 1
(7') Uo2 ~ V 0, (8') Vo2 ~ × .
Note the productions (1'),(2'),...,(5') in P ' are the result of production (1) in P
and productions (6'), (7'), (8') are due to production (2) in P.
A simple deduction in Gs is as follows:
p q Pq
s--,+s× --,+xx, sI - +XX.
436 K.S. Fu
Note that if the tree ¢Pi on the left-hand side of the production rule is a single
symbol of alphabet V, we will have no contracting production rules in our
grammar.
where x E VT and X0, X 1..... Xr~x) are non-terminal symbols contained in V - VT.
Obviously,
p( /x\ )=p(x)p(tl)...p(tn).
t I • •. t n
The function h forms a string in VTo obtained from a tree t by writing the
frontier of t. N o t e frontier is obtained by writing in order the images (labels) of
all end points of tree 't '.
,
fI P P
P = I X° -~ X1" " " X" x E VT ' n > O' X° -->
/\
Xl..- X.
U { Xo P x X P x ~ P, x E VTo}.
£ p(t) =1
t ~ L(Gs)
where
-t is a tree generated by G~,
-L(Gs) is the set of trees generated by Gs,
-p(t) is the probability of the generation of tree 't'.
The set of consistency conditions for a stochastic tree grammar Gs is the set of
conditions which the probability assignments associated with the set of stochastic
438 K.S. Fu
tree productions in Gs must satisfy such that Gs is a consistent stochastic tree
grammar. The consistency conditions of stochastic context-free grammars has
been discussed in Section 2. Since non-terminals in an intermediate generating
tree appear only at its frontiers, they can be considered to be causing further
branching. Thus, if only the frontier of an intermediate tree is considered at levels
of branching and, due to Theorem 6.11, the consistency conditions for stochastic
tree grammars are exactly the same as that for stochastic context-free grammars
and the tree generating mechanism can be modelled by a generalized branching
process [23].
Let P =/'A U/~A2U "-" tOFAKbe the partition of P into equivalent classes such
that two productions are in the same class if and only if they have the same
premise (i.e., same left-hand side non-terminal). For each FAy define the condi-
tional probabilities (p(tlAj) } as the probability that the production rule Aj--, t,
where t is a tree, will be applied to the non-terminal symbol Aj where ZrAjp(tlAj)
=1.
Let ~t(t) denote the number of times the variable A t appears in the frontier of
tree ' t ' of the production A j --, t.
gl(S1,S2,S3,S4):P(/$ lS)$283=-$2S3,
A B
g2(81,32,33,84): p ( / a IA)3233-~p(alA )
A B
= pS2S3 +(1- p),
g3(S,,Sz,S3,S4)=P( ~'B)S4 + p(blB)=qS4 +(t-q),
C
g4(S1, 82, 83, 54) = P(alC) = 1.0.
DEFINITION 6.15. The ith level generating function Fi(S 1.... , S 2 .... ,SK) is de-
fined recursively as
lim Ci = 1.
PROOF. If the above limit is not equal to 1, this means that there is a finite
probability that the generation process enters a generating sequence that has a
finite probability of never terminating. Thus, the probability measure defined
upon L(Gs) will always be less than 1 and R will not be consistent. On the other
hand, if the limit is 1, this means that no such infinite generation sequence exists
since the limit represents the probability measure of all trees that are generated by
the application of a finite number of production rules. Consequently R is
consistent.
E = [e,j],<_i,g<_K
EXAMPLE 6.19. In this example consistency conditions for the stochastic tree
grammar Gs in Example 6.9 (as verified in part (a)) are found, and thus the
consistency criterion verified.
(a) The set of trees generated by GS is as follows:
Tree (t) Probability of generation [p (t)]
$ (1-p)(1-q)
/\
a b
$ (l-p)q
/\
a b
\
a
$ p(1 - p)(1 - q)2
/\
a b
/\
a b
$ p(1--p)q 2
/\
/\
a b\
a b a
\
a
• • • etc.
In all the above trees production (1) is always applied. If production (2) is
applied ( n - 1) times, there will be one 'A' and n 'B's in the frontier of such
obtained tree. Production (3) is then applied when no more production (2) is
needed. In the n 'B's in the frontier, any one, two, three or all n 'B's may have
production (4) applied and to the rest of ' B ' production (5) is applied• Production
(6) always follows production (4).
Thus we have
Note that the power of p in the above terms shows the number of times
production (2) has been applied before applying production (3). So
p(t)=(1-p)[~-q+q]
t@ L(Gs)
q+qt
+(1--p)p"-l[(~-q+q)"]+...
or
Z p(t)=(1-p)+(1-p)p+...(l--p)p" ~+...
t E L(Gs)
=(l_p)[l+pl+pZ+...p, 1+...]
=(1-p)X 1 (ifp<l)=l
E=
0 ,01
0
0
p
0
p
0
0
q "
0 0 0 0
T ¢-t111/i1 i/1 ]
t + I t I
Starting
point
i.i [/ .liliti i
[ .+-.+-.+~.1:i!1!1[
(a) Structure A
Starting
point
. . . . . - - , __, o_
(b) Structure B
Fig. 4. Two tree structures for texture modeling; (a) Structure A, (b) Structure B.
"]i i i
iH~i~'i !i=
=i i ~ - :
iii~il: ! :I!! g
i 71
Fig. 5b. Texture pattern: D68, woodgrain.
*** |
*******/
*** J
(a)
*.1I:1I:~
k *~**
A] A2 A5
*** ****'1
r ,x****~
[ ********1
*Ill* Zl
Dq [3 2 [3 5
..~ ~t~~,4 I~ ,, . . . •
G68 = (V, r, P, S) where V : {S, A, B,0, 1}, VT = {0, 1), r(0) : r(1) = {0, 1,2,3},
and P is
0 0 0 1
0.5 0.05 0.09 0.09
s-*/l\, s -* / \, s -* / I \ , s-* / I\,
A SA A A B SA BSA
0 1 1 0.90
0.09 0.09 0.09
s-~ / I\, s-* / I\, s-* / I\, A-*0,
A SB B SB A SB I
A
0.05 0.05 0.85 0.10 0.05
A -* 0, A -*0, B-* 1, B ~ 1, B-*I.
I L I
B B A
Stochastic string languages are first introduced and some of their applications
to coding, pattern recognition, and language analysis are briefly described in this
paper. With probabilistic information about the process under study, maximum-
likelihood and Bayes decision rules can be directly applied to the coding/decod-
ing and analysis of linguistic source and the classification of noisy and distorted
linguistic patterns. Stochastic finite-state and context-free languages are easier to
analyze compared with stochastic context-sensitive languages; however, their
descriptive power of complex processes is less. The consistency problem of
Applications of stochastic languages 447
References
[1] Aho, A. V. and Peterson, T. G. (1972). A minimum distance error-correcting parser for
context-free languages. S I A M J. Comput. 4.
[2] Aho, A. V. and Ullman, J. D. (1972). Theory of Parsing, Translation and Compiling, Vol. 1. (Vol.
2: 1973). Prentice-Hall, Englewood Cliffs.
[3] Bahl, L. R. and Jelinek, F. (1975). Decoding for channels with insertion, deletion and
substitutions with applications to speech recognition. IEEE Trans. Inform. Theory 21, 4.
[4] Bhargava, B. K. and Fu, K. S. (1974). Stochastic tree system for syntactic pattern recognition.
Proc. 12th Annual Allerton Conf. on Comm., Control and Comput., Monticello, IL, U.S.A.
[5] Booth, T. L. (1974). Design of minimal expected processing time finite-state transducers. Proc.
IFIP Congress 74. North-Holland, Amsterdam.
[6] Booth, T. L. (1969). Probability representation of formal languages. IEEE lOth Annual Syrup.
Switching and Automata Theory.
[7] Brayer, J. M. and Fu, K. S. (1977). A Note on the k-tail method of tree grammar influence,
IEEE Trans. System Man Cybernet. 7 (4) 293-299.
[8] Brodatz, P. (1966). Textures. Dover, New York.
[9] Chomsky, N. (1956). Three models for the description of language. IEEE Trans. Inform. Theory
2, 113-124.
[1o1 Fu, K. S. (1968). Sequential Methods in Pattern Recognition and Machine Learning. Academic
Press, New York.
448 K. S. Fu
[11] Fu, K. S. (1972). On syntactic pattern recognition and stochastic languages. In: S. Watanabe,
ed., Frontiers of Pattern Recognition. Academic Press, New York.
[12] Fu, K. S. and Huang, T. (1972). Stochastic grammars and languages. Internat. J. Comput.
Inform. Sci. 1 (2) 135-170.
[13] Fu, K. S. (1973). Stochastic languages for picture analysis. Comput. Graphics and Image
Processing 2 (4) 433-453.
[14] Fu, K. S. and Bhargava, B. K. (1973). Tree systems for syntactic pattern recognition. IEEE
Trans. Comput. 22, 1087-1099.
[15] Fu, K. S. (I974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
[16] Fu, K. S. and Booth, T. L. (1975). Grammatical inference: Introduction and survey, I-II, IEEE
Trans. Systems Man Cybernet. 5; 95-111,409-423.
[17] Fu, K. S. (1976). Tree languages and syntactic pattern recognition. In: C. H. Chen, ed., Pattern
Recognition and Artificial Intelligence. Academic Press, New York.
[ 18] Fu, K. S. ( 1981). Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs.
[19] Fung, L. W. and Fu, K. S. (1975). Maximum-likelihood syntactic decoding. IEEE Trans.
Inform. Theory 21.
[20] Fung, L. W. and Fu, K. S. (1976). An error-correcting syntactic decoder for computer networks.
Internat. J. Comput. Inform. Sci. 5 (1).
[21] Grenander, U. (1969). Foundation of pattern analysis. Quart. Appl. Math. 27, 1-55:
[22] Haralick, R. M., Shammugam, K. and Dinstein, I. (1973). Texture features for image classifica-
tion. IEEE Trans. Systems Man Cybernet. 3.
[23] Harris, T. E. (1963). The Theory of Branching Processes. Springer, Berlin.
[24] Hellman, M. E. (1973). Joint source and channel encoding. Stanford Electronics Lab., Stanford
University.
[25] Hutchins, S. E. (1970). Stochastic sources for context-free languages. Ph.D. Dissertation,
University of California, San Diego, CA.
[26] Jelinek, F. (1968). Probabilistic Information Theory. McGraw-Hill, New York.
[27] Keng, J. and Fu, K. S. (1976) A syntax-directed method for land use classification of LANDSAT
Images. Proc. Syrup. Current Math. Problems in Image Sci. Monterey, CA, U.S.A.
[28] Lafrance, J. E. (1971). Syntax-directed error-recovery for compilers. Rept. No. 459, Dept. of
Comput. Sci., University of Illinois, IL, U.S.A.
[29] Lee, H. C. and Fu, K. S. (1972). A stochastic syntax analysis procedure and its application to
pattern recognition. IEEE Trans. Comput. 21, 660-666.
[30] Li, R. Y. and Fu, K. S. (1976). Tree system approach to LANDSATdata interpretation. Proc.
Symp. Machine Processing of Remotely Sensed Data, Lafayette, IN, U.S.A.
[31] Lipkin, B. S. and Rosenfeld, A., eds. (1970). Picture Processing and Psychopietories, 289-381.
Academic Press, New York.
[32] Lipton, R. J. and Snyder, L. (1974). On the optimal parsing of speech. Res. Rept. No. 37, Dept.
of Comput. Sci., Yale University.
[33] Lu, S. Y. and Fu, K. S. (1977). Stochastic error-correcting syntax analysis for recognition of
noisy patterns. IEEE Trans. Comput. 26, 1268-1276.
[34] Lu, S. Y. and Fu, K. S. (1978). A syntactic approach to texture analysis. Comput. Graphics and
Image Processing 7.
[35] Lu, S. Y. and Fu, K. S. (1979). Stochastic tree grammar inference for texture synthesis and
discrimination. Comput. Graphics and Image Processing 8, 234-245.
[36] Moayer, B. and Fu, K. S. (1976). A tree system approach for fingerprint pattern recognition.
IEEE Trans. Comput. 25 (3) 262-274.
[37] Persoon, E. and Fu, K. S. (1975). Sequential classification of strings generated by SCFG's.
lnternat. J. Comput. Inform. Sci. 4 (3) 205-217.
[38] Smith, W. B. (1970). Error detection in formal languages. J. Comput. System Sci.
[39] Souza, C. R. and Scholtz, R. A. (1969). Syntactical decoders and backtracking S-grammars.
ALOHA System Rept. A69-9, University of Hawaii.
[40] Suppes, P. (1970). Probabilistic grammars for natural languages. Syntheses 22, 95-116.
Applications of stochastic languages 449
[41] Tanaka, E. and Fu, K. S. (1976). Error-correcting parsers for formal languages. Tech. Rept.
EE-76-7, Purdue University.
[42] Weszka, J. S., Dyer, C. R. and Rosenfeld, A. (1976). A comparative study of texture measures
for terrain classification. IEEE Trans. Systems Man Cybernet. 6.
[43] Zucker, S. W. (1976). Toward a model of texture, Comput. Graphics and Image Processing 5,
190-202.
P. R. Krishnalah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")1"~
z~K./
©North-Holland Publishing Company (1982) 451-477
O. Introduction
Information
Information is a loose concept, which is made of two parts: a representation
and one or more interpretations. Later we speak of couples or pairs of 'representa-
tion-interpretation'.
*An earlier version of this article has been published in the journal Signal Processing, Volume 2
(1980) pp. 5-22 under the title: "A Structural Approach of Pattern Recognition".
451
452 J. C. Simon, E. Backer, and J. Sallentin
Representation
A representation is the material support of information. For example a string
of letters or bits; the result of a measurement, such as an image. Let such a value
be represented by a string of italic letters.
X=(Xl,X2,...,Xn).
x = (x,,x2,...,x°).
Interpretation
A representation may have many interpretations:
- trivial." the nature of the element of rank j in the representation string.
- a n identification: the 'name' of the object represented by X. Such an interpreta-
tion is the most frequent in Pattern Recognition (PR) for a representation of an
object, measured by some physical sensors.
-a property, such as a truth value or an assertion, a term, an expression
-an action: A program is represented by a string of 'instructions', themselves
translated into a string of bits, interpreted by a computer as actions.
-In practice, many others as we witness by man in every days fife...
Of course one may ask what we choose to call interpretation being the result of
a process on a representation? We believe that in our field of PR, related to
understanding and linguistics, it is more appropriate as this frame of concepts has
been advocated quite early in linguistics [2].
We call semantic of a representation the set of interpretations which may be
found from this representation. Again we underline that this set may be infinite
a n d / o r ill defined. But after all the set of properties of a mathematical object
such as a number may be also infinite. To demonstrate a theorem we only use a
finite set of properties.
Thus 'information', which is the couple of a representation and of one (or
more) interpretation, may be quite ill defined. This we see in every day life: a
representation is understood quite differently by different people, specially if they
belong to different cultures.
Identification
Let I2 be the set of names, I2= {c01,cO2," "',cOp)" An identification is a mapping
E from the representation space into the set of names.
E:X ~/2,
PR operators or programs
Of course such a mapping is only a mathematical description. It has to be
implemented in a constructive way. A PR operator or algorithm does effectively
the task of giving a name if the input data is a representation. The PR specialists
are looking for such algorithms; they implement them on computer systems.
Finding such efficient PR algorithms and programs is the main goal of the
Pattern Recognition field.
Interpretations of an object
An identification is not the only interpretation looked for the representation of
an object:
(i) A feature is the result of a partial identification. For example, a phoneme in a
spoken word, a segment in a letter, a contour or a texture in an image. Sometimes
the term initial level feature is used instead of representation. It points out the
fact that such representations are obtained through physical sensors from the
outside universe. Thus a representation is already an interpretation of the outside
world.
(ii) A fuzzy identification is sometimes preferred to the identification by yes or
no. It may be defined as a ' multimapping' of X in/a × f,
Similarity, distance
Let X,Y, Z be entities that we wish to compare. Note that they are not always
of the same nature. Later on they may be taken as objects, classes or what we
454 J. C. Simon, E. Backer, and J. Sallentin
Similarity (resemblance)
Dissimilarity (dissemblance)
1.2. Remarks
(I) Usually an identification may be described as a multilevel process. Let us
take the example of written words identification.
From the initial representation level, for instance the pixel level of an image, a
first group of interpretations is obtained. They result in the identification of a
certain number of 'features', such as segments, curves, crossings, extremities, etc.
From this level the letters are found. Then in another level the word is identified
from the letters.
Thus starting from a representation level, an identification process allows access to
an interpretation level, which then becomes the new representation level.
Such a scheme is more general than the 'historical' PR scheme: feature
identification followed by a 'classifier'. It is now currently said that image and
speech recognition are such multilevel processes and may be described as an
interactive, competitive system of procedures, either inductive (data driven) or
deductive (concept driven) [3].
A unifying viewpoint on pattern recognition 455
assumed that such an order is a basic property of the data. Let ~(j, h) be such a
dissimilarity measure. It is clear that it does not always satisfy the triangular
inequality (1.7).
Such a homeomorphism should map inf ~ on 0 and should add a constant to all
the values of ~(j,h), such that (1.6) is verified for all the table.
Similar properties may be found for a dissimilarity measure.
We deal now with the properties of an infinite representation space X , into
which E is embedded. Two basic properties have to be examined for such a space:
(a) is it a metric space?
(b) has it a density measure?
One may challenge that the Hausdorff condition is verified everywhere, spe-
cially with the finite precision of the measurements and of the computations.
However it seems quite reasonable to give the Hausdorff quality to an infinite
representation space. In other words to assume that two distinct points should
have separate neighbourhoods. Thus from now on we assume that a representa-
tion space X is a metric space. The problem is of course to find the metric.
We have seen how an experimental table of dissimilarity of E may be
transformed in a table of distances, without changing the order on the couples of
points. This table of distance provides a sampling of the general distance measure
on X. The problem is to find an algorithm of distance which verifies the
measured distance table. It is a generalization problem as there exists so many
in PR.
Density measure
Let us call (intentionally)/~(X) a density measure at X ~ X, a function of X
taking its values in • +.
Many efforts are made in 'statistical' PR to build up such a density from an
experimental distribution of representations X of objects. This density is used as a
similarity measure between an object X and a class.
These efforts are along two lines: either the objects are labeled or not labeled.
Labeled. (1) Probability densities are obtained through various statistical tech-
niques (parametric) and also by interpolating techniques (non parametric) [8].
(2) k-Nearest Neighbours, (k-NN) [9], [10], [11]. Note also the Shared k-NN
approach of Jarvis [12] and the Mutual N N of Gowda [13].
(3) Potential or Inertia functions, [1].
(4) Fuzzy belonging [14].
All of these measures are interpreted and used as a similarity between an object
and a concept, such as a class (also called aggregate, taxon, OTU (operational
taxonomic unit), fuzzy set) [1].
Remark
It should be clear that the hypothesis that a representation space is a metric
space is a 'strong' one, even if it is the most frequent. Some finite data cannot be
458 J. C. Simon, E. Backer, and J. Sallentin
ULL'
L@L'
ALL'
Fig. 1
These laws on P make a distributive lattice of this set, sometimes also called
semi-ring or o-algebra. The distributivity means that
Terms. We will call term a sentence naming an object. The set of terms is a
generalization of the set ~ of names. The languages of terms are regular, i.e. may
be recognized by a finite automata [15].
Logical expressions. The languages of different logics have been well defined,
for instance cf. [15]. A sentence of such a language is called an expression. It is
formed with terms and formulas. The formulas are built up with logical connec-
tives, according to a syntax and a number of axioms. An essential point of these
logical languages is that an expression may be interpreted as 'true' or 'false'.
In the classical sentential logic the basic connectives are and A, or V. But many
others have been proposed, with different syntax and semantic. They have
allowed to propose other logics, such as the predicate (extensions of the classical
sentential logic), the modal, the intuitionistic, the fuzzy, the quantum logics.
Let us simply point out that the expressions of the sentential logic form an
algebra, called a boolean algebra, isomorphic to an algebra of sets; thus we may
obtain the same structure as that of a distributive lattice, where the connectives
'and' and 'or' play respectively the same role as the operations 'intersection' and
'union' for sets [15].
By suppressing some axioms, introducing new connectives, other logics are
formed in which the distributivity of the lattice structure is not any more certain.
tions of the world of objects. Of course the formalization is a lot more difficult.
The languages of different logics have in fact been proposed to model the natural
language.
Knowing f(X; A) and f(X; B), what are the values of f(X; A U B) and f(X; A n B)?
A unifying viewpoint onpattern recognition 461
PR problem
representation interpretation
space space
1 , homomorphism ? ~ I
Fig. 2
The answer generally found by the users is to use on f two homomorphisms, i.e.
to find two laws on f such that
@ is an additive law, homomorphic to U :
The range of f
f takes its values in the real domain R. But its interval of variation or range R is
usually a part of R. For example: {0, 1} two values only, example of the range of
the characteristic function of a set; [0, 1], range of probability and of fuzzy
belonging; R +, range of distances, etc.
Semi ring
Let us recall now the definition of a semi ring. It is a structure Z on a set
(range) R,
Z= (R,@,.,0,1>.
462 J. C. Simon, E. Backer, and J. Sallentin
Remark
The difference between a group and a semi group comes from the fact that no
inverse is defined for an element x. In other words no - x for addition, no x 1 for
multiplication, as it is always defined for a group. We will show that it has some
important consequences.
Idempotent operations
An operation is called idempotent if applied to the same element, it gives as a
result the element itself. For instance if U and f3 are respectively the set
A. unifying viewpoint on pattern recognition 463
AAA=~;
A is not idempotent.
Let us suppose that (2.1) a n d / o r (2.2) are satisfied, if O is the operation on f,
f-f=O or f×f-~=l.
Let us come back to the range [0, 1]; other operations may be taken. Let % be a
law such that
% is a contracting law and formulas similar to (2.4) and (2.5) may be used
2.2.1. Hierarchies
The representation space is either a finite set E or a metric space X.
A hierarchy is a finite set H belonging to the set of parts of E or X, with the
following conditions:
(X}~HCP(orP),
E (or X ) E H ,
if h i , h j E H , then either h i A h ) = O
(2.10)
or h i A h j v a ~
and either h i D hj
or hi C_hj.
A hierarchy is a semi lattice, even more a tree, in which the elements are obtained
by the operation U. Two elements of H being given hi,hi, there exists always an
element h called the least upper bound (1.u.b.), such that it is the smallest set with
the property h i C h and hj C h. If hif-lh j =/a~, it is clear that h i = h or hj = h.
Ultrametric distances. Let us define measures compatible with the above struc-
ture.
hirqh j ¢ ¢. h i and hj are elements in a chain ordered by C.
XChl.-. c h i. .. C h j . (2.11)
A unifying viewpoint on pattern recognition 465
h"
X1 X2 X3
Fig. 3
hif-)hj : 0. Let h be the l.u.b, of h i and of hj. Then Mhi), )~(hj) < X(h). Let X
and X' be leaves of the hierarchy (tree). The distance between X and X' is
8(X,X') -- X(h), where h is the 1.u.b. of X and X'. Let X1,X2,X 3 be three leaves of
the hierarchy, h be the 1.u.b. of Xl, X 2, and h' be the l.u.b, of X 1, X 3 and of X 2,
X 3, then
From (2.13) every triangle X l , X 2 , X 3 is isosceles with a base smaller than the two
equal sides. (See Fig. 3.)
It is shown that such a proposition is equivalent to the following relation,
proper to ultrametric distances,
X(h) h
Fig. 4
C --~ ( C l . . . . C i , . . . C k ) (2.16)
A = (A 1.... A i , . . . A k ) (2.17)
where the range of f is [0, 1] and the interpretations form a distributive lattice.
Such formulas are currently used in fuzzy set formulations for object-concept
similarity, for concept-concept similarity and object-object similarity [30, 31, 32].
In probability
Object-concept,
Concept-concept,
Object-object,
n
Similar formulas are used in fuzzy set formulations; for examples see [14,
Chapter 2].
Concept-concept,
1
F(L, L') = ~ ~ INF [f(X; L), f(X; L')]. (2.321
x
Object-object,
Ill
# (X, X') = ½2 (SUP[f(X; Li), f(X'; Li) ] - INF[f(X; L i), f(X'; Li)]).
1
(2.33t
Remarks
(1) The operation SUP is often interchanged with 'average' (1/N)Y,. However
the average is not an associative operation and some care should be taken to
preserve the homomorphic properties.
(2) Note that in probability the conditional Bayes error for two classes is
written as
Van der Pyl [27] studies H and proposes to use, instead of (2.7),
Feature evaluations
Many measures have been proposed for the evaluation of features; we would
rather refer to [17]: evaluation of the PR operators which detect the existence of
the features. Mutual Information, Quadratic Entropy, Bhattacharyya, Bayesian
distances, Patrick-Fisher, Bayes error, Kolmogorov's, etc .... For example see
[281.
Among these Mutual Information has some interesting properties. Let A, B be
two operators to be evaluated; let 12 be the 'ideal operator' given by the training
set. Knowing the I(A; 12) and I(B; 12), what can be said if A and B are in serie or
in parallel?
Serie
Parallel
Remarks
We consider logics different from the 'classical' (boolean) logic. But in all of
these logics no quantifiers are used such as 'there exist' or 'for all'. Thus Modal
logic is not envisioned here.
All these logics are sentential logics. Their propositions form a lattice under the
operations V, A. But this lattice is not always distributive. Let us recall for the
ease of reading some useful definitions of sentential logics.
I f p A q = p, then/~ A q = q. (2.42)
p---p. (2.44)
p V p = I. (2.45)
(5) No contradiction
p A/3---- 0. (2.46)
(6) Distributivity
(7) Pseudomodularity
The different sentential logics are distinguished one from the others according
to the above properties verified or not.
A unifying viewpoint on pattern recognition ,173
Table 1
Properties of sentential logics
Logics Properties
1 2 3 4 5 6 7
Classical X X X X X X X
Quantum X X X X X X
Fuzzy X X X X X
N o n distributive
fuzzy X X X X
Intuitionist X X X X X
N o n distributive X X X X
intuitionist
Distributivity
THEOREM OF STONE. All the distributive logics (property (6)) are homomorphic to
a distributive lattice of subsets of a set.
Thus for the distributive logics, we are again in the situation of Subsection 2.1.
Let us consider a semi-ring Z having an additive law _1_ and a multiplicative law
• , then
Negation
(1) If (2.44) is verified, a corresponding law on f has to be found. For instance
if the range is [0, 1],
(2) If (2.46) is verified, and if the range is [0, 1], then we should have
f ( X ; f i ) = { O 1 iff(X;p)>O,otherwise. (2.52)
Fuzzy logic does not verify this last relation but it is an essential property of
the intuitionistic logic. On the other hand it is clear that the intuitionistic logic
does not verify (2.44), the double negation property.
The only logic which does verify both the double negation (2.44) and the no
contradiction (2.46), is the classical logic. But then only admissible rings on f are
the first two, leading to only two values for f.
The interest of logics other than classical appears now clearly.
Classical logic gave birth to Quantum logic; Fuzzy and Intuitionist to non
distributive logics. Examples may be found in the 'real Universe', where distribu-
tivity is not verified.
Idempotence
An essential property of the basic connectives V and A is idempotence. But
modeling natural language, which describes the real world, we find that idempo-
tence should not always be verified by such connectives.
For example the repetition of a proposition is not always equivalent to the
proposition itself:
"This is a man"
and
Then
It means that f(%(X); p) is now the idempotent. "Repeating once more will
not change our knowledge on X" [29].
The propositions p verifying (2.55) are special in the language; sometimes they
are called 'observable'. They form a lattice, which is not always distributive.
3. Conclusion
References
[1] Simon, J. C. (1978). Some current topics in clustering in relation with pattern recognition. Proc.
Third Internat. Conf. on Pattern Recognition, Coronado, pp. 19-29.
[2] Sanssure, F. de (1972). Cours de Linguistique Gkn&ale. Payot, Paris.
[3] Haralick, R. M. (1978). Scene matching problems. Proc. 1978 NATO A S I on Image Processing,
Bonas.
[4] De Mori, R. (1978). Recent advances in automatic speech recognition. Proc. Fourth Internat.
Conf. on Pattern Recognition, Kyoto, pp. 106-124.
[5] Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[6] Diday, E. and Simon, J. C. (1976). Cluster Analysis. In: Fu, K. S., ed., Digital Pattern
Recognition. Springer, Berlin.
[7] Hu, S. T. (1966). Introduction to General Topology. Holden Day, San Francisco.
[8] Kanal, L. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20 (6)
697-722.
[9] Cover, T. M. and Wagner, T. J. (1976). Topics in statistical recognition. In: Fu, K. S., ed.,
Digital Pattern Recognition. Springer, Berlin.
[10] Devljver, P. A. (1977). Reconnaissance des formes par la m6thode des plus proches voisins.
Th6se de Doct., Univ. Paris VI, Paris.
[ll] Cover, T. M. and Hart, P. E. (1967). Nearest neighbour pattern classification. IEEE Trans.
Inform. Theory 13, 21-26.
A unifying viewpoint onpattern recognition 477
[12] Jarvis, R. A. (1978). Shared near neighbour maximal spanning tree for cluster analysis. Proc.
Third Internat. Conf. on Pattern Recognition, Coronado, pp. 308-313.
[13] Gowda, K. C. and Krishna, G. (1978). Agglomerative clustering using the concept of mutual
nearest neighbourhood. Pattern Recognition 10, 105-112.
[14] Backer, E. (1978). Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. Delft
University Press, Delft.
[15] Lyndon, R. C. (1964). Notes on Logic. Van Nostrand Mathematical Studies 6. Van Nostrand,
New York.
[16] Sabah, G. (1977). Sur la eompr6hension d'histoire en langage naturel. Th6se de Doct., Univ.
Paris VI, Paris.
[17] Simon, J. C. (1975). Recent progress to a formal approach of pattern recognition and scene
analysis. Pattern Recognition 7, 117-124.
[18] Kampe de Feriet, J. (1977). Les deux points de vue de l'information: information ~t priori,
information ~t posteriori. Colloques du CNRS 276, Paris.
[19] Diday, E. (1973). The dynamic cluster algorithm and optimisation in non-hierarchlcal clustering.
Proc. Fifth IFIP Conf., Rome.
[20] Miranker, W. L. and Simon, J. C. (1975). Un mod6le continu de l'algorithme des nu6es
dynamiques. C.R. Acad. Sci. Paris SOr. A 281, 585-588.
[21] Hansen, P. and Delattre, M. (1978). Complete link cluster analysis by graph coloring. J. Appl.
Statist. Assoc. 73, 397-403.
[22] Simon, J. C. and Diday, E. (1972). Classification automatique. C.R. Acad. Sci. Paris Sbr. A.
275, 1003.
[23] Ruspini, E. (1969). A new approach to clustering. Inform. Control 15, 22-32.
[24] Bezdek, J. C. (1973). Fuzzy mathematics in pattern classification. PhD Thesis, Cornell Univer-
sity, Ithaca.
[25] Carnap, R. and Bar Hillel, Y. (1953). Semantic information. British J. Phil. Sci. 4, 147-157.
[26] Kampe de Feriet, J. (1973). La Thborie Gbnbralisbe de l'Information et de la Mesure Subjective de
l'Information, Lecture Notes in Mathematics 398. Springer, Berlin.
[27] Van der Pyl, T. (1976). Axiomatique de l'information. C.R. Aead. Sci. Paris Sbr. A 282.
[28] Backer, E. and Jaln, A. K. (1976). On feature ordering in practice and some finite sample effects.
Proc. Third Internat. Conf. on Pattern Recognition, Coronado, pp. 45-49.
[29] Sallentin, J. (1979). Repr6sentation d'observation dans le contexte de la th6orie de l'information.
Thbse de Doct., Univ. Paris VI, Paris.
[30] Zadeh, L. A. (1971). Similarity relations and fuzzy orderings. Inform. Sci. 3, 177-200.
[31] Zadeh, L. A. (1977). Fuzzy sets and their application to pattern classification and clustering
analysis. In: van Ryzin, J., ed., Classification and Clustering 251-299. Acad. Press, New York.
[32] Zadeh, L. A. (1978). PRUF, a meaning representation language for natural languages. Internat.
J. Man-Mach. Stud. 10, 395-460.
P. R. Krishnalah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 t,) 1
1
©North-Holland Publishing Company (1982) 479-491
G. S. L b o v
O. Introduction
1Groups of scales are considered intended for measuring the features of the three types: quantitative
features (scales of intervals, relations, and the absolute one); serial features (scales of order, partial
order, rank); nominal ones (name scale).
479
480 Ca. S. Lbov
Special attention is paid in the paper to algorithms for solving the above
types of problems in the case of empirical tables characterized by either a great
number of features ( n = 3 0 - 1 5 0 ) and a small size of sample (N~-n), or by
heterogeneity of features (gaps are possible in tables for both cases). The tables
for which at least one of these properties is fulfilled we call RP-tables. Such tables
are common in complex studies in medicine, sociology, and geology. The known
methods for solving prediction problems are mainly designed for the case when
features X 1.... ,Xj ..... Xn are either Boolean [1] or quantitative (for example, in
the classical problem of the function x 0 = f ( x ) , restoration assumes all the
features to be quantitative). An application for heterogeneous features of the
known algorithms [ 11, 13] using the hypothesis of 'closeness', 'compactness', etc.,
faces methodological difficulties: when calculating similarity between the two
vectors one has to deal with its components which are results of measuring non-
comparable quantities. Therefore, in the present paper in the case of heteroge-
neous features for solving prediction problems a class of logical and linear-logical
decision rules is proposed for use. In addition, from theoretical studies [6, 10] it
follows that the given class of rules has the least measure of complexity as
compared to the known classes (e.g. the class of potential functions, the class of
polynomials) which allows one to construct reliable decision rules for the large
dimension of feature space and small training sample (a small number of objects).
This fact also leads to the necessity to use the given class of rules in the case when
features X 1.... ,)in are measured in the same scale. In medical, geological and
sociological research there often arise problems when both the given facts occur.
To solve the problems the basic empirical hypothesis is used: a subset of
objects of a set A is believed to be randomly chosen, i.e. a statistical set-up of the
problem is considered. All the prediction methods construct a decision rule
having the maximal prediction quality for the objects of the subset A. Only if to
take the given hypothesis, such a decision rule will possess the best predicting
capacity for the rest of the objects of the set A.
(4) Enable construction Of the simple optimization procedures of the search for
the best rule.
(5) Contain rules which are realized technically in a simple way.
(6) Permit construction of algorithms operating with gaps present in empirical
tables.
Consider now each requirement in more detail.
Let us introduce the statistical criteria which have been used to determine the
complexity of a class of decision rules. Let distributions (p(~0, x ) = p(~o)p~(x)}
be given on a set of features X 1..... X n (~o is a pattern number; x is the point of
n-dimension feature space; p(¢0) is an a priori pattern ~0 probability; ~0= 1..... K;
K is the number of patterns). One may say that the strategy of nature c = 0 is
assigned by a set of probabilities c = {p(¢o, x)} (0 is the set of strategies of
nature). Denote as V the set of all possible samples of the size N. A particular
empirical table {xij } is an element of the set: v E V. The table v is brought in
correspondence with a recognition rule f from some class of decision rules ~. This
rule is chosen in accordance with a training algorithm Q, i.e. Q ( v ) = f. An
operator Q is either a procedure for evaluating distributions of parameters in
(p(~0, x)}, or an optimization procedure of rule f choice from the class ~, this
procedure minimizes the recognition quality criterion (e.g. the number of recogni-
tion errors).
A decision rule is a mapping which corresponds to a point x of the feature
space a pattern number, i.e. f(x) = ~0. For a fixed decision rule f and a fixed
strategy of nature c the probability of error classification of 62f can be obtained.
For the fixed rule f one can determine the probability of error 62f. For each
strategy of nature c there can be given the distribution of probabilities on the set
of samples p(v). Hence the decision rule f will be chosen randomly from the set
in accordance with some distribution p ( f ) . Note that in the general case
p ( f ) >~p(v) since the algorithm Q may choose one and the same decision rule f
for a certain subset of samples Vi c_ V. The error probabihty @f will be a random
quantity with certain distribution of probabilities ¢p(@f). With an increase in the
size of sample (N--+ ~ ) the quantity ~f --, ~ and the distribution q0(6~f) degener-
ates into the &function.
For the fixed distribution q)(@f) one can calculate the probability Pr{°2UE
[62~, 027]}>~ 7, 627 I> 62~. The magnitude V is close to 1. For the fixed value of ,/,
from the given inequality one can obtain the value 627" The quantity e = @7 - 62~
shows an extent of the deviation of sampling rule from the optimal rule in the
sense of error probability at the fixed strategy of nature c, at the fixed size of
training sample N, and at the fixed class of decision rules. Since the strategy of
nature c is unknown, we consider e* = maxcc0 e. The magnitude e* at the fixed
sample size depends on the chosen class of decision rules.
DEFINITION 1.1. Of two classes of rules, ~1 and ~2, the latter is more com-
plicated compared to the former if the value e~ for ~2 is greater than e]~ for ~1-
,482 G. S. Lbov
fc,~
e----~(ln]~ I --ln(1-- 7 ) ) / 2 N
Let us formulate a class of logical decision rules for recognition. For the sake of
simplicity, consider two patterns (to, ~}.
Introduce all the necessary definitions and notations. As elementary statement
Tj we take expressions Xj.(a)<-x, X j ( a ) > x for the name scale; Xj(a)<~x,
Xj(a) > x for the order scale, scale of intervals, and relations scale where Xj. is the
feature, x is its value and a is the name of an object.
A conjunction of elementary statements S = T 1/k T 2 / k • • • / k T~ (~ ~ n) is called
a statement S. As the length of a statement we mean the number of elementary
statements involved. We say that a statement is satisfied on an object if each
elementary statement involved in S is true on this object.
DEFINITION 2.1. Two elementary statements Tjt and Tj2 a r e called equivalent if
they are satisfied on the same set of objects of A = A~UA ~.
For the class of linear rules [2] at N < n, 1~1 = 22.2 l°°~ 103°. Thus, the class
under consideration belongs to the class of the simplest decision rules. This
property is of primary importance for the case of little sample and large space
dimensions.
(2) From the theoretical viewpoint the class ~2 has an important asymptotic
property: in [7] it was proved that with an increase in the sample size and number
M the decision rule f(R) tends to the optimum Bayes rule.
Logicalfunctions in the problems of empiricalprediction 485
(3) Number and composition of objects (on which this or that elementary
statement is satisfied) do not depend on the permissible transformations of scales.
Therefore, any algorithms for constructing the best rule, taking account of only
the number and composition of objects, when sampling elementary statements,
are invariant to the above transformations.
(4) Statements represented as conjunctions of values and conditional intervals
of features are easily interpretable, and the decision rule can be technically
realized in a simple way as a threshold device.
(5) When describing known recognition algorithms one, as a rule, does not
show how one will use in teaching the vector with omitted values of any features.
In the case of constructing the logical rule this realization is not used for those
conjunctions only those that have the feature with the missing value.
(6) Simultaneously with choosing logical regularities the problem of reducing
the number of features of the initial system is solved. Indeed, in the solution of
applied problems, as a rule, to construct a tree R it is sufficient to use only a small
number of features of the initial system. In addition, the decision rule enables one
to perform an 'individual' approach: for each recognizable object its own subset
of features is used. The above properties are of significance for solving problems
where both the recognition error and cost of measuring the features should be
minimized. Thus it is shown that the class of logical decision rules ~2 satisfies the
above formulated requirements for the class of decision rules of recognition.
Together with classes of decision rules ~ and ~z we also considered classes ~3
and ~4 Which can also be referred to as to the class of logical decision rules. At
present the properties of ~3 and ~j4 classes are being studied.
If the elementary statements of the form Y,m j=lajXj(a)>~eto, Y~e'
j=lajXj(a)<ao
(m ~< n l, n~ is the number of quantitative features) are added to the above types
of elementary statements, then we obtain the class of linear-logical decision rules
~3, given as a decision tree.
If we use conjunctions chosen from some set d = {S 1..... St,..., S~} instead of
elementary statements of the above tree, then we get the class of decision rules
~4. For this purpose let us bring a new system of features Y1..... Yt..... Y~ into
correspondence with the set d in such a way that, if for an object a the statement
S t is fulfilled, then Yt equals 1, otherwise it equals 0. Then the empirical tables can
be rewritten in the form {x/j} ~ (Yit'~} and (xij'~ } ~ (Yit }. The decision rule in the
form of a decision tree is constructed according to new features. From the set of
all possible statements a set d only involves informative statements which are
called regularities. By regularity characterizing the pattern ~0 we mean a statement
S for which @so~~> ~ and ~sr~ ~<fl (e.g. ~ = 0.6, fl = 0.02). For a small fixed fl
magnitude 8 is chosen experimentally: starting with 8 = 1, this quantity decreases
with a certain step (e.g. A8----0.05) until a small number of regularities emerges
(usually this number is assigned to be 5-10 per pattern). When solving the
applied problems the length of an informative statement (number of elementary
statements in conjunction) has not been found to exceed 5. In [5] an algorithm
T E M P was proposed with the help of which all regularities characterizing
RP-tables were disclosed for reasonable machine time.
486 G. S. Lbov
op(rt) = (2 a(a;)
i=1
is determined where
7r(a~) the object's place number in H. The ordering is the better the less the value
of magnitude OP(H). Let us consider the meaningful interpretation of this crite-
rion. Let F be a set of geological areas. The area where deposits are discovered
(pattern ~0) gives some profits per time unit starting from the moment of
discovering. If no deposits are discovered on it, money spent on working (e.g.,
drilling) proves wasted. The criterion in question is shown to be connected with
the rate of compensation for money spent on working.
If the strategy of nature is known to be e = {p(~0, x)), one can determine
mathematical expectation of the criterion M o p ( H ) = Mop(F, c) where F is the rule
of ordering being a ratio of order on the set R". It is necessary to minimize
Mop(F, e) by F ( F E ~). The ordering rule F o is determined by the function
g(x) = p(x/~o)/p(x/~) as follows:
-if g ( x i ) > g ( x t ) , then x i > xt,
-if g ( x i ) = g ( x t ) , then x i ~ x #
Logicalfunctions in theproblems of empiricalprediction 487
M v ( F o , c) <~ M v ( F , c )
g(b,) = (No, + + 1)
corresponds to each branch of the tree b t (N,~ t is the number of objects of the set
A involved in b t from the pattern to, N~, t is the number of objects from A involved
in b t from the pattern ~). Each vector x i is involved into some branch bt = b ( x i ) .
Define ~ ( x i ) = g ( b t ) = g[b(xi) ]. If g ( X i ) > g ( X l ) , then x i >- xt, if g ( x i ) = g ( x t ) ,
then x i ~ x #
In the previous sections the problem of predicting the feature value measured
in the name scale has been considered. This section deals with the description of
an algorithm for the case when feature X 0 is measured in the scale of relations,
and the rest of the features can be measured in different scales. A function F is
chosen from a certain class • = {F} which compares vector x to evaluation of
feature X 0, i.e., vector x is compared to an object a, and this vector x determines
the evaluation of a goal feature X o ( a ) = F ( x ) . The class of functions in which the
function F ( x ) is assigned by a tree R determined on features in scales of names
and order, as well as by a set of linear functions fl of features measured in scale of
relations and intervals, is called a class of linear-logical evaluation functions.
Let R be a function with values in a set of natural numbers from 1 to P, i.e.
R ( x ) : x ~ ( 1 ..... P}. R divides feature space into P non-intersecting classes
A 1..... Ap where A e = {a: R ( x ) = l}. For each of At,... ,Ap its own linear function
ft is defined. According to the training sample {xij}, i = 1,... ,N, j = 1..... n; 0 one
minimizes the criterion
l N
L(F'X):NtX:I [X°(ai)-- F(xi)]i"
,-,': E + I
is minimal through all possible partitions of the initial table into two taxons (A 1 is
the set of objects for which Xj~(a)I> B, A 2 is the complementary set of objects).
The coefficients of the function fl 1 have been obtained by constructing a linear
regression of the feature X0 of X 6 by the objects of the taxon A1, coefficients o f f l 2
by objects of the taxon A 2.
(2) Among quantitative features not yet included into the equation we find the
feature X/2 which is utmost correlated with the remainder Xo1. Obtain the
regression equation as follows
YCo = f d )= + +
1 N
L2=N X [Xo(ai)-- f2(ai)] 2.
l=l
After steps 1 and 2 have been fulfilled, the values L 1 and L 2 a r e compared. If
L 1> L 2, then the introduction of variable X6 into the regression equation is more
preferable than that of the condition obtained in the tree R, i.e. as a linear
evaluation function we have fz of dimension 2. If L 1~< L 2, then at this step we
complement R with an elementary statement x j, >~ B. Thus, the linear functions fl ~
and f 2 corresponding to the taxons A~ and A 2 a r e obtained.
Thus, the initial sample (taxon 1) is partitioned into 2 taxons with numbers 2
and 3, respectively. The enumeration is performed as follows: if t a x o n j is divided,
then the resulting subsets get numbers 2 j and 2 j + 1.
Further procedures are the same. Having completed a limited number of steps
K the algorithm constructs a tree R ( x ) , partitioning a set of objects into P taxons,
and a set of functions fl ..... fe corresponding to each taxon. The constructed
function F for test objects evaluates the value of the feature X o by the rule
)(o(ai) = f t ( x i ) where l = R ( x ) .
@(Ns ) = cNspNs( 1 __ p s ) U N,
A choice of the preference criterion for one logical statement to another in search
of regularities for solving the given problem of table approximation is based on
the following hypothesis: the less the magnitude of @(Ns) for the statement S, the
more reasons t o consider this statement as a regularity.
The number of realizations, Ns, on which the statement is fulfilled on the
'random' table, is, on the average, N P s. For grouping purposes we only consider
those statements for which N s > N P s.
Besides, we are interested in statements fulfilled on the initial table not less
than 6 times.
For simplicity of determining the preference order on a set of statements we
use the magnitude
7 ( S ) -- (z - NPs) 2
Ps(1--Ps)
1 ex-r v(s) 1
f(z): (2,~NPs(l_Ps)),/2 ~ [ - - ~ - ].
It is clear that the less the magnitude of @(Ns) the less the one of y(s).
2When describing the objects aimed at pattern recognitionthis effect was observedin all the solved
applied problems.
490 G. S. Lbov
Consider statements S~ and S 2, fulfilled in the given table Ns, and Ns2 times,
respectively. The statement S~ is considered to be more preferable than $2 if
v(s~) < v(s2).
To discover the best statements according to the criterion ~,(s) the algorithm
T E M P is used. It is done in such a way: the first best statement S 1 having been
chosen, Ns~ objects from N are excluded on which this statement has been
fulfilled. Then the best $2 is determined which is only fulfilled on a complemen-
tary subset of objects, and so on until the partitioning of the initial set is
completed.
Xo(t~) = ~, / = 1 ..... R - 1 .
Let us choose as the start of time readings an arbitrary moment t R. The results
of measuring the features of all objects obtained at periods t~ for the first object,
t~ for the second one, and so on, are correlated with the above start. Then a set of
i
m o m e n t s (/R--l) is correlated with the moment tR_l, the set {t R i 2} with t R _ 2 ,
References
N . G. Z a g o r u i k o a n d V. N . Y o l k i n a
1. Algorithm ZET
The basic ideas of the method Z E T [1] consist in the following. As in most
pattern recognition algorithms, we accept the hypothesis that the 'similar' or 'like'
objects are those having similar values of representation parameters. The less
divergence there is, the more 'similar' these objects are. It is also supposed that
the set of parameters describing the objects is not random, and reflects some
regularity relating to these objects. Then one may expect that the objects, similar
for n parameters, will most probably have similar values for the (n + l ) s t
parameter. Under these conditions we have good reasons to undertake the task of
'prediction' of the missing values, i.e. the calculation of some of the most similar
values in terms of the known elements of the matrix of initial data. As it is
usually, we will consider that the lines of the matrix of initial data contain the
information about the description of the object ai, i 1.... , m in the given system
=
of parameters X---- {Xj), j = 1..... n and that the columns contain the information
on the values which the parameter Xj has on the various objects a r
In algorithm ZET, to predict a missing element only the relevant groups of
lines or columns of the matrix under study are used. Relevance is defined as a
function of two variables: a measure of similarity ~il between a line (column)
containing expected gap and lines (columns) not having a blank in the place
corresponding to the expected gap; and a degree of their mutual fillingness L a.
Naturally, the relevance of predicted lines (columns) is highest if they are more
similar to the predicted ones and if they contain the greatest number of mutual
non-empty elements. As a measure of similarity between lines (columns) of the
matrix in the algorithm ZET we take the module of the coefficient of paired
correlation, calculated after normalization of all the columns of the matrix to the
interval [0, 1].
While calculating a predicted element we take into account the predictive value
of each line (column) which depends on its relevance and of some parameter a.
The parameter a is chosen during the process of decision, each gap being taken
separately. The minimum of the mean error of prediction of all the known
elements of a line (column) containing a gap is a criterion for the choice a.
493
494 N. G. Zagoruiko and V. N. Yolkina
Table 1
~Features
Objects~l 2 .-. k ..- j -..
1
l aik aij
l ark alj
An error ~n of the prediction of the known elements of the table under optimal
value a is obtained as an estimation of the expected 'quality of prediction'. If 6n
exceeds the given threshold, the algorithm will leave the gap empty•
The algorithm ZET works as follows• Let there be a matrix of dimension n × m
where n is the number of columns (features) and m is the number of lines
(objects), see Table 1.
Step 0. The normalization of all the columns of the matrix to the interval [0, 1]
is made•
Step 1. The next gap aij to be predicted, is chosen•
Step 2. For each k t h column having no gaps in the ith line, we calculate its
measure of fillingness Ljk(j ~ k) with respect to t h e j t h column. It is equal to the
number of mutual nonempty pairs of elements of the j t h and the k t h columns•
Step 3. The measure of similarity [fjk[ of columns j and k is computed•
Step 4. Under various a, the values of all known elements a~. of t h e j t h column
are predicted by the k t h column which has no gap in the ith line:
Yf=la~" Iffj~l°'Ljk
a,j= y f=,l .j l .Lj k ,
Step 10. Having filled all the missing elements in the table, one repeats the
Steps 2-9 (smoothing).
Step 11. After each regular iteration one makes an estimate of the mean
summary difference of the results predicted at this step, from the prediction
results obtained at the previous one. The process is ended if this difference
becomes small.
A second criterion is given to end the iterations.
2. AlgorithmVANGA
alj • a i k
ai j -- _ _
alk
But in reality, a submatrix, consisting of four elements that belong to the ith line
496 N. G. Zagoruiko and V. N. Yolkina
Table 2
1 ... k ... j ... n
1 1 3 1 6
2 1 2
3 4 3 5
4 2 4 4
5 5 5 3
6 6 6 1
and t h e j t h column, and to the lth line and to the kth column as well, could give
us only a variant ('prompt' btk ) of a value of element aij:
alj • aik
blk -=- , k v ~ j; 14=i.
alk
Let us form matrix B of the same dimension as matrix A, in which there are
prompts big in the crossing of the lth line and the k t h column.
Let us now evaluate the mean value of quantities blk(b ) and their dispersion
( D ) in each column and in each line:
1 rn--1 l n--1
b . k -- m -- 1 ~ btk' #l" -- n -- 1 ~ b'k ;
l=1 k=l
1 m--1 1 n 1
y m-
l=llblk l=1
E D 1. n--1
Y'k=lblk
E I~.--blkl"
k=l
CIk ~- ( D--'~-~ __ DI k ) a
where Dmax and Dmin are maximum and minimum of Dtk for all tables. Ctk
approaches 1 when Dik = O m i n and it approaches 0 when Dtk----Dm~,. In the
intervals between these extremes--0 and 1 - - t h e quantity Ctk is a function of D~k
and a. The value a, like in ZET too, is chosen by the best prediction of known
elements of the ith line and of t h e j t h column of matrix A.
Inference and data tables with missing values 497
Table 3
l ''" j n
The mean value fiij of quantity of the expected element aij is computed as
1 k=n'l=m
^ _ _
Hence a prompt for element a~j, obtained with participation of elements al~ and
a q k (let us denote it by blq , k ) , will be
Table 4
1 .-- j -..n
Computing the prompts, their mean value and dispersion in columns and lines,
determining competence of these prompts and their accounting under determin-
ing the value of expected element, is made in the same way as for the relation
scale. The value aij in matrix A is then predicted to be 2.3.
(3) In order scale (algorithm VANGA-0) a minimal submatrix has dimension
2 × 2. Statements of such a kind as
"if aik >>-alk, then aij >~alj too"
are invariable to the transformations of order scale. Hence, prompt blk >1 alj if
ai~ >1alk, and btk < alj if aik < ate. If a rank correlation between t h e j t h and the
k t h columns is negative, then prompts have an inverse sign: blk >~ a,/ if aik <
ark, b~k < a~j if aik >1alk.
In columns of matrix B (Table 5) (d + 1) different variants of prompts may be
found if d is a number known but not-coinciding with any other element in the
j t h column. In fact, one of d + 1 different events may take place; in our example
such events are
Prompts btk show whether the unknown quantity aij may be in one or another
of the (d + 1) diapasons of the order scale. If bzk < 4 , then such events as
aij < l, 3 > aij >t 1 , 4 > aij ~ 3, are possible. Let us assign 1 to each of these events.
If b/~ ~>6, then it is possible only that aij >~6, and 1 is assigned only to this
diapason. After looking over all prompts of a column, we shall interpret the sum
of ones, assigned to each diapason and divided by a total number of ones, as a
probability (Ps) of the fact that aij will be in this diapason. Entropy
d+l
H.k=-- E Ps lnPs
S=I
Table 5
1 "'' j "'" n
Table 6
l "'" j -'-n
= ( Hm= - H,k ) ° ,
Quantities Cl~ under a = 0.5 are given in Table 6. Now, using weights of Ctk,
one can find a sum of weighed votes for each event of the six possible ones. For
the case under review these sums are
Table 7
1 --- k "- j n
3. Conclusion
References
1. Introduction
1.1. Objectives
In this chapter we will discuss some statistical and other operations on the
ECG, not only because the signal is representative for a time-varying biological
phenomenon, but also since the efforts made and the result obtained in this area
are illustrative of the methods applied to other biological signals.
The strategies and techniques as will be described--preprocessing, estimation
of features, boundary recognition and pattern classification--can also be applied
to many other signals of biological origin, such as the electro-encephalogram, the
spirogram or hemodynamic signals (Cox, 1972).
The ultimate goal of such processing is the medical diagnosis or, in research, to
obtain insight in the underlying biological processes and systems. Our main
objective, therefore, will be to show the state-of-the-art in biological signal
processing and recognition by discussing specifically the processing of the electro-
cardiogram.
1.2. Acquisition
With modern technology and its micro-sensors, a host of different transducers,
instruments and data acquisition techniques became available to study the human
body and to examine the individual patient. Electrodes for the recording of the
ECG have been much improved by new materials and buffered amplifiers;
multi-function catheters can be shifted through veins and arteries, e.g. for the
intra-cardiac recording of the His-electrogram, flows and pressures; micro-
transducers can even be implanted for long-term monitoring of biological signals.
Many other, noninvasive, methods have been devised to study the organism under
varying circumstances, e.g. during physical exercise.
In the entire process of signal analysis, the data acquisition is of course a very
important stage to obtain signals with a high signal-to-noise ratio (SNR). For this
reason it must be stressed that signal processing starts at the transducer; it makes
no sense to put much effort in very intricate statistical techniques if the trans-
501
502 J a n H. van B e m m e l
ducers are not properly located and if the disturbances are unacceptable, hamper-
ing detection, recognition and classification. During the last decade in many
instances the signals are digitized as soon as they have been acquired and
amplified and are recorded in digital form or real-time processed. Data acquisi-
tion equipment is presently often brought very near to the patient, even to the
bedside, for reasons of accuracy, speed and possible system failure. Dependent on
the specific goal, the processing is done real-time (e.g. for patient monitoring on a
coronary care unit, CCU) or off-line (e.g. for the interpretation of the ECG for
diagnostic purposes).
After this rather general introduction we will concentrate on ECG analysis,
primarily by computerized pattern recognition and related statistical techniques.
After a brief discussion of the different research lines in electrocardiology, we will
mainly restrict ourselves to the interpretation of the standard ECG leads (12-leads
or vectorcardiogram), recorded at the body's surface. In these sections we will
follow the main steps by which computer interpretation is usually done.
2. Electrocardiology
t CL,N,CAL 1 FINDINGS
STATISTICAL ]
i CRITERIA DATA
BASte J
~ N G - - - G ~ II COMPONENTSJ
II CARDIAC E~jI~7~ ;~"- . . . . . . . ~ l [ DIAGNOSIS
I GENERATORL ~ ~ "/-~1" "INVERSE I
, ~ / / ~ 1 ~ 1 PROBLEMF II FIXED
I EXC~TAT,ONI I-~;P¢'~-GU - A " /! DIPOLES
MODEL I . ..
m I ~
/
II M O V I N G
L ............. ~ MOD E ELI N G ~ " I DIPOLES
~11 MULTIPOLES
multilead (126 or more) systems, mainly for research purposes (see e.g. Barr,
1971). The vectorcardiographic lead systems are based on the physical assumption
that the heart's generator can be considered as a dipole with fixed location but
variable dipole moment. The main goal for the choice of the set of spatial samples
is to allow some kind of inverse computation, from the body's surface to the
electrical field within the heart, either by mathematical inverse solutions (based
on physical models) or by parameter estimation and logical reasoning (based on
statistical and pattern recognition models). Fig. 1 shows the different approaches
to the analysis of the ECG. For the development of the physical models, of
course, insight into the course of the electrical field through the myocardium is
necessary. Such models are again simplifications of reality: sometimes rather
crude as in vectorcardiography; moving-dipoles (Horan, 1971); multiple fixed
dipoles (Holt, 1969; Guardo, 1976); or, still more abstract, multipoles
(Geselowitz, 1971). It is not our intention to treat these different models in this
chapter, since almost none of them have demonstrated real clinical implications,
but we will restrict ourselves to the second type of solutions, that make use of
clinical experience and evidence.
Still, the advantage of theoretical electrocardiology is that it has provided us
with comprehensive knowledge about the coherence between the different ap-
proaches and with electrode positions that bear diagnostic significance (e.g.
Kornreich, 1974) without being too sensitive for inter-individual differences in
body shape, i.e. the volume conductor for the electric field.
We conclude this section by mentioning that the essential knowledge to build
models of whatever kind, is based on experiments with isolated or exposed hearts;
504 J a n H. van B e m m e l
I obiects
H patterns
H"dH patterns features
H""'°I patterns
Fig. 2. Stages in pattern recognition and signal processing which run fully parallel. In many instances,
however, the processing is not as straightforward as indicated here but includes several feed-back
loops. In this chapter several such feedback examples are mentioned.
on the construction of the so-called forward models that simulate the real process;
and on the acquisition of (always large) populations of well-documented patient
data, ECG's. These, together, form the basis for the development of the inverse
models, of physical, statistical or mixed nature.
3. Detection
X2
Fig. 4. Example of a VCG recording with up to 7 different QRS wave shpaes. The points of QRS
detection have been indicated by vertical marks, the cluster number by numbers 1 to 7.
tor and its different time-varying conductances, see Hora~ek 1974). In summary,
an ECG may show two wavetrains of rather stable shape (P and QRS), both
followed by repolarization signals, of which mainly the ST-T wave after the QRS
is dearly seen. In abnormal cases one may see almost any combination of atrial
and ventricular signals, sometimes consisting of different QRS-wave shapes
resulting from intra-ventricular pacemakers. In Fig. 4 an illustration is seen of a
rather chaotic signal where only sporadic periodic epochs are observed.
Such signals are a big challenge to develop generally applicable processing
methods. The first problem to be solved is to detect all QRS-complexes without
too many false positive (FP) or missed beats (FN). If this problem has been
solved, the question remains how to detect the tiny P-waves amidst the always
present disturbances, especially if the rhythm and QRS-waves are chaotic.
V
3 detector point process
-I
J e s tlmato r parameters
1
Fig. 5. Illustration of the principles of strong and weak coupling for simultaneous detection and
estimation of signals. In case of weak coupling, only one feed-back loop is present. In ECG pattern
recognition, these principles are frequently used.
Recognition of electrocardiographicpatterns 507
the following. Biological processes, such as the functioning of the heart, can
frequently be considered in the time domain as a series of coupled events, as we
have seen already for the P-wave and the QRS-complex. In analyzing such signals
we are interested in the occurrence of each of these events, which can be
expressed as a point process. In many instances, however, it is a complicated task
to derive the point process from the signal if we do not exactly know what event
(wave shape) to look for. On the other hand, the determination of the wave shape
itself is much facilitated if we are informed about the occurrence of the events
(the point process). Accordingly, a priori knowledge about one aspect of the
signal considerably simplifies the estimation of the other.
In practice, this process of simultaneous detection and estimation in ECG's is
done iteratively: a small part of the signal serves for a first, rough event detection
and estimation of the wave shape, and based upon this, an improved point
process can be computed and so on. However, if we have to deal with only rarely
occurring wave shapes as in intra-ventricular depolarization with wandering
pacemakers, such a priori knowledge is not available. We can improve the
estimation only if at least a few ECG-beats of identical shape are present for
analysis.
It is unnecessary to state that the optimum performance of a detector is
obtained only if we also have at our disposal the prior probabilities of the
occurrences of the different waves. Although the latter is seldom known for the
individual ECG, a good compromise is the optimization of the detector's perfor-
mance for a library of ECG's and to test it on another, independent population.
Preprocessing (QRS)
Since most ECG interpretative systems are processing at least 3 simultaneous
leads, the detection functions are also based on combined leads. The commonly
used detection functions d(i) (i the sample number, sampling rates taken accord-
ing to definitions given by the American Heart Association (AHA, 1975)) are
based on derivatives (i.e. comparable to band-pass filtered leads) o r - - i n terms of
three-dimensional vectorcardiography--the spatial velocity. If the ECG(i) is
expressed as
ECG(i) = (Xl(i), X2(i ), X3(i)),
then the detection function d(i) can be written as
d(,) = 2 r(x~(i))
k
508 Jan H. van Bemmel
with T a transformation of Xk(i). The most simple formula for computing the
derivative is the two-sided first difference, so that
z : - l x A i + l ) - X~(i-1)l 2.
The spatial velocity is in this case just the square root of d(i). Other simpler
forms, saving processing time for d(i) are, with absolute values:
z=lxk(i+l)-- gk(i--l)l,
and a third detection function computed from the original amplitudes,
Z= lSk(i)l 2.
The disadvantage of d(i) with the last transformation is that it is very sensitive to
changes in baselines. Other functions for T are, though sometimes more elaborate,
essentially identical to the ones mentioned here. Fig. 6 shows an example of a
detection function computed from absolute first differences.
Xl
X2
X3
i
X1
1oo %
d(i)
o 10 SECONDS
Fig. 6. Detection of the QRS-complexes in an ECG recording. From the scalar leads Xk(i), the
detection function d(i) is computed. Three thresholds are applied after estimating the 100% level. A
candidate QRS is detected with these thresholds. Further refinement, i.e. the determination of a point
of reference, is done from the derivative of one of the leads, in this case X~. Vertical fines indicate the
points of reference of the detected QRS complexes.
Recognition of electrocardiographicpatterns 509
averaged peak of all QRS-complexes in d(i). Next, thresholds are applied at 5, 25,
and 40% of this averaged peak.
If the detection function fulfills the following conditions, a QRS-complex is
labelled as a candidate wave:
(d(i)>25% during >10 msec.) A (some d(i)>40% during
100 msec. thereafter) A (the distance to a preceding can-
didate > 250 msec.).
Other systems apply different thresholds and rules for QRS finding, but all
approaches follow one or another logical reasoning or syntactic rule, based on
intrinsic ECG properties, expressed as statistics of intervals, amplitudes or other
signal parameters. We will proceed with Plokker's method.
If the candidates, mentioned above, are found, an algorithm is applied to
discriminate between different QRS wave shapes. First of all, the lead is
determined with the largest absolute value of the derivative (see also Fig. 6, where
the X-lead is chosen). In this selected lead, a point of reference is determined: the
zero-crossing with the steepest slope within a search interval of -+ 100 msec.
around the first rough indication. After the zero-crossing (i.e. the point of
reference) has been found, a template is matched to the filtered QRS-complex to
assure a stable reference point (the application of strong coupling in the detector).
This template matching is done with the aid of two levels at -+25% of the
extremum. Fig. 7 shows examples of the ternary signals, that result from this
procedure, for typical QRS-shapes after band-pass filtering. Templates are
determined for all different QRS wave shapes in an ECG recording, in such a
way that each time that a new QRS-shape is detected, a template is automatically
generated. Since such templates have only the values 0 and -+ 1, matching itself
can be simplified to elementary arithmetics of only additions or subtractions. An
already known QRS is assumed to be present if the correlation is larger than 0.70,
b)
i I
lI W l I t w I I t
+I--F~ i
Fig. 7. Examples of ternary templates, derived from an individual ECG for the preliminary labelling
of QRS wave shapes and the determination of a point of reference. The ternary signals are computed
from the band-pass filtered QRS-complex and used for signal matching.
510 J a n H. van B e m m e l
else a new template is generated. In the end all candidate complexes are matched
with all templates found in an individual recording to determine the highest
correlation factors. A method as described here, yields less than 0.1% FP (false
alarms) or FN (missed beats) as found in a population of 47 750 beats from 2769
ECG's in Plokker (1978).
3.3. P-detection
A similar, though more intricate, strategy can be followed for the location of
P-waves. They are small as compared to the QRS-complex and of the order of 100
~V or less, and of the order of magnitude of the noise in the bandwidth of 0.10 to
150 Hz, that can vary from 10 ~V or more. The frequency spectrum of such
P-waves, however, lies far below this 150 Hz: roughly in between 0.10 and 8 Hz.
The P's can be coupled to QRS-complexes or not, and most of the time occur
with a repetition rate of less than 200 per minute. The duration of the P-wave is
about 60 to 100 msec., its shape may vary from person to person and from lead to
lead. If the P-wave is coupled to the QRS, the range of the PR interval
distribution is less than 30 msec.
The processing of P-waves may require much computer time unless we use data
reduction methods and optimized algorithms to speed up the processing. Often
we have to look for a compromise between what is theoretically (from the
viewpoint of signal processing) desirable and practically (from the standpoint of
program size and processing time) feasible. P-wave detection is an illustrative
example in this respect. First of all, we will discriminate between 'coupled' and
'non-coupled' P-waves. This is of importance, since regular rhythms are most
commonly seen (in more than 90% of all patients). For that reason the detector
has first of all to ascertain whether such coupling is present or not. The detection
of 'non-coupled' P-waves is much more cumbersome and requires considerable
computational effort. For both approaches we will discuss the processing stages
mentioned earlier (for illustration purposes we follow the lines of thought
published by Hengeveld (1976)).
t PR
l r l l l l l l ~ l l l l 1 1 1 I ' l l l l l l l l l l l l l l l I I I I I I I I I I I I I I I [ I
I I I I I I I I I I I I I T l l l F I I I I I I I I I I I I I I I I I I I I I I I 1 I I I [ [ I I I I
I l l l l l l l [ l l l 1 [ l l l I I i l 1 1 1 1 1 1 1 1 1 1 1 1 1 I I I I I I I I I I I I I I I I I
Fig. 8. Examples of PR, PP and RR interval histograms for 3 different ECG's, used for the
determination of the presence of coupled P-waves and stable sinus rhythms. In case of P-wave
coupling, the PR appears to be rather constant even if the PP and RR intervals are varying. Distances
between vertical marks are 1,/30 sec. apart. Left: stable sinus rhythm; middle and right histograms:
irregular sinus rhythm but stable PR intervals.
the ranges of the interval distributions are computed and if only one of these is
small enough (i.e. < 30 msec.), for at least 80% of all QRS-complexes present, the
decision of coupled P-waves is made. Fig. 8 shows examples of such distributions
for 3 simultaneous leads, for coupled as well as for non-coupled Ps.
If the P-waves cannot be classified as coupled to the following QRS, it still
remains to be investigated whether Ps are yet present and at what instants. For
that reason the entire processing is started once more, now by using shape
information. We will follow the stages in this second approach as well.
e
f
Fig. 9. Steps in the recognition of non-coupled P-waves. In the original signal (a) the QRS is cut away
(b) in order to diminish the response of the high-amplitude QRS in the band-pass filtered output as
seen in (c). The latter signal is rectified (d) and thresholds are applied to derive a ternary signal (e) that
is supposed to give a response where P-waves are located. Signal (e) is crosscorrelated with a template
(f) that has been computed from a training population of P-waves. The matching function is seen in
(g). Again a level is applied to detect the presence of P-waves.
in the same manner (and not from the individual E C G recording, as in QRS
detection). In this template (Fig. 9(f)) the information about the set of P-waves is
condensed--albeit in a rather crude way for reasons of processing speed. Of
course it would in some instances be better to use the prior information about the
P-wave shapes of the individual ECG, but, as written in Section 3.1., this is not
always feasible so that in such cases the statistical properties of a population of
signals is used instead. So, the parameters that are used for recognition are the
instants of the different level crossings, which can be visualized as a ternary
signal. In practice, this cross-correlation or matching again does not imply any
multiplication, since it can be proven that the entire correlation can be carried out
by simple additions and subtractions of intervals, to be computed only at the
instants of level crossings, which is most advantageous for processing speed. An
example of the correlation as a function of time is shown in Fig. 9(g). If the
correlation reaches a maximum above 0.80, the presence of a P-wave is assumed.
The procedure is carried out for each individual lead available and for all
TQ-intervals.
Most interpretation systems for ECG's offer only overall evaluation results.
The available literature in this field gives only seldom the reasons for ECG-
Recognition of electrocardiographic patterns 513
Table 1
Evaluation of the P-wave detection method, described in Section 3. The
numbers are derived from 42240 P-waves from 2769 ECG recordings.
162 P's are missed and 1072 falsely detected. For these ECG's this gave
rise to less than 2.5% errors in the arrhythmia classification, of which
the majority were minor deviations
Computer
+ --
+ 41168 1072
- 162 n.a.
misclassification, which might have happened anywhere during the various steps
and stages of processing. For that reason it is of utmost importance to trace the
shortcomings of all intermediate steps involved in the interpretation, so that
possible weak links in the chain of steps can be improved. For the two different
approaches to P-wave detection this has been done by Plokker (1978). Evaluation
results from a population of 1769 patients can be seen in Table 1.
We will conclude this section on detection by mentioning that in processing
ECG's the finding of other events is important to avoid wrong (FP) detections.
This regards the detection of 'spikes' (sometimes with similar shapes as the QRS)
resulting from electrical disturbances in the environment of the patient; the
measurement and correction of 60 Hz (50 Hz) main voltage; the effect of electrode
polarization causing wandering baselines or even amplifier saturation. Further-
more there is the disturbance of biological origin: patient movements and their
effects on the ECG like baseline fluctuations, electromyographic signals and the
modulation of the signal (up to a modulation depth of 50%) caused by respira-
tion. In order to obtain a reliable interpretation of the ECG with a minimum of
FP and FN, all steps (the detectors, parameter estimators and classifiers) have to
reckon with these disturbances. In many systems special detectors and pattern
recognition algorithms have been built-in to find baseline drift, spikes, EMG and
so on. Especially in cases where a superposition of signal and nonstationary
disturbances exists, discrimination is very complicated or even impractical, given
the finite amount of time allowed for an ECG computer interpretation because of
economic implications.
4. Typification
find the normal or modal beat by labelling all wave shapes; in the diagnostic
classification it is directed towards a discrimination between (degrees or combina-
tions of) disease patterns.
Preprocessing (typification)
In this step we depart from the original signal(s), given the fiducial points
found by the detection. As in all pattern recognition applications, here also the
question rises what set of features to search for typification. We repeat that as
much a priori information should be utilized as is known and available. For
illustration we discuss again one specific development (Van Bemmel, 1973). We
know that the duration of a QRS-complex is on the average not longer than 100
msec. and that most signal power is found in the bandwidth between about 8 to
40 Hz. For these reasons the QRS is filtered by a digital convolution procedure
around the reference point and the sampling rate is reduced to 100 Hz in such a
way that the instantaneous amplitudes (or, in VCG, vectors) are located in an
area around and phase-locked with the fiducial point at 10 msec. distances apart,
where most of the signal power is located. In practice, this could mean e.g. 3 or 6
parameters before and resp. 7 or 4 parameters after the reference point. The
location of the window of instantaneous amplitudes is therefore dependent on
QRS morphology. From this set of 10 filtered, instantaneous amplitudes per
QRS-complex, the features are computed for typification or labelling.
P(j, k) = c o v ( j , k ) [ c o v ( j , j ) c o v ( k , k ) ] - , / 2
Both the O's and the instantaneous power P(k) are used as features for typifica-
tion. Two complexes with indices j and k are said to be identical if
QRS
typification
50-
errors
/
3O- in%
2O-
• too many types
o too few types
10-
\ /
5-
3-
2-
.5-
.3-
.2-
.1- l 1 1 I { [ { I •
60 70 80 90 100
Fig. 10. Effect of the typification threshold on the number of correctly labelled QRS-complexes. If the
level is too high, all complexes are called identical and vice versa. In case of parallel leads,
combinatory rules are applied for the optimization of typification.
means that in 10-dimensional feature space two complexes are called identical if
they fall within a cone with a spatial angle determined by ;k and within two
spherical shells determined by w.
To speed up the computation time for an ECG of, say, 20 beats, not the entire
matrix of 20×20 terms is computed, but in practice a sequential method is
employed which needs only a small part of the matrix. Starting with the first
complex, the first row of the matrix (i.e. the covariances with all other complexes)
is computed. Next, only those rows are computed for which the conditions of
similarity were not fulfilled, which brings the number of computations back to
about 10%. For all leads available this procedure is repeated. If more than one
lead is available, again a syntactic majority rule is applied which is optimized by a
learning population. Also the determination of the dominant beat is done by such
rules, making use of intervals between the different types of complexes. Table 2
presents a result of the algorithm for QRS-typification. The typification of ST-T
waves is done in similar ways also based on instantaneous amplitudes, falling,
however, in a much lower frequency bandwidth. The combined results of QRS-
and ST-T labelling finally gives the dominant complexes that can be used for
diagnostic shape classification, to be treated later on.
Supervised learning
Thus far we have restricted ourselves to the labelling of QRS-waves from ECG
recordings of rather short (e.g. 5-15 sec.) duration. If this problem has to be
solved for very long recordings as in a coronary care unit or for ambulatory ECG
monitoring, slightly different techniques can be used (Ripley, 1975; Feldman,
1977). Especially in the first situation all operations have to be executed in
real-time or even faster. If the time is allowed, in such circumstances interactive
516 J a n 11. van B e m m e l
Table 2
Decision matrix for QRS typification. In 2769 records 47751 QRS-complexes were seen,
with up to 6 different wave shapes. The overall result was an error rate of less than
0.07%. The artefact typifications were done on 50 distorted waveforms and noise spikes,
not seen by the detection stage
Computer
Type 1 2 3 4 5 6
1 47209 24 1
2 19 424 7
3 50
Reference 4 11
5
6
Artefact 39 6 5
pattern recognition may offer great advantages (Kanal, 1972). Several systems,
therefore, make use of man-machine interaction for waveform recognition
(Swenne, 1973). As soon as the computer (e.g. by using the same methods as
explained before) finds an unknown QRS-shape, the user (nurse, physician) is
requested to indicate whether he wants to label this wave as normal or, e.g., as a
PVC or wants to ignore it. The computer stores the patterns belonging to the
indicated beats in memory for future comparison. In this way two goals are
served: the user determines the labelling for the individual patient himself and he
is alerted as soon as waves of strange or abnormal shape suddenly occur. During
training (supervised learning) and thereafter, the computer determines the gravity
points of the cluster q~k, belonging to type k as follows:
nk
ink= ~ vi/nk,
i=1
vi being the (10-dimensional) feature vector for complex i and n~ being the
number of times the complex of type k has been observed. The dispersion of the
cluster is determined in the usual way:
?/k
s k2= E ( v / - i n k ) /(nk--1 ).
i=1
The distance from some new vector N to the clusters (~k} is computed by the
normalized Euclidean distance:
2 2
a?k =
% is allocated to Ok instead of qSzif (djk < ?tadjt)/X(djk < ?tk). Proper measures for
the thresholds are Xa = 5 and Xk = 3 or 4. In order to allow a gradual change in
Recognition of electrocardiographic patterns 517
wave shapes (letting the cluster q)~ slowly float in feature space), we may use
recursive formulae as soon as n k > X, (during the training X, = nk):
5. Boundary recognition
T=IARN(Xk(i))I
where AR N is an autoregressive digital filter, computing some bandpass filtered
version of the ECG Xk(i), intended to obtain the derivative (while increasing the
SNR) and based on -+ N sample points around i.
Thresholds
Threshold detection is done by applying a fixed or relative amplitude level in
d(i) within a window where the wave boundary has to be expected. In some cases
feedback is built in the method in such a way that the threshold may be
adaptively increased if too many level crossings are seen within the window.
Signal matching
The second method that has been reported is the use of a standard wave form
around the point where the boundary is expected. This standard wave form s(k)
is computed from a learning set of functions d(k) with known boundaries,
indicated by human observers. The method then searches for the minimum of the
518 Jan H. van Bemmel
MINi{es(i)= k ~
with N and M points before, resp. after the boundary. For the weighting factor
w(k), the dispersion of s(k) at point k is usually taken, so that e, is the weighted
mean squares difference between d and s. The minimum of es(i) yields the
boundary at i = i 0.
A disadvantage of this method is, that it is rather sensitive to noise, so that in
such circumstances the function d(i) may remain at relatively high amplitude
levels.
Templatematching
Two-dimensional (i.e. time and amplitude) templates have been developed for
wave form boundary recognition as well. In such applications a signal part is
considered as a pattern in 2-dimensional space, to be matched with another 2-D
template, constructed from a learning set. We will briefly explain the method.
Here again we start from the set of L functions ( d ( i ) } , reviewed by human
observers (boundaries were indicated in the original signals Xk(i)). Around the
boundaries, windows are applied (see Fig. 11), to be used later on in the
cross-correlation.
Within the window area we determine a multi-level threshold function f as
f~(i)=sign(d(i)- X}.
I
25%
d(i)
i i
*÷ end ,-41,,-
0 Isec QRS ORS T
d(i)
Fig. 11. Example of the windows that are applied in the detection function d(i) for the recognition of
wave boundaries. Within these windows a template is matched to a function computed from d(i) (see
Section 3.2 for the definition of d(i)).
Recognition of electrocardiographic patterns 519
i is the sample number, l one of the L functions d t and X the applied threshold.
'sign' takes the sign of the expression in brackets. So, the area where the function
d(i) is larger than ~ is given the value + 1, otherwise the value - 1. As a matter of
fact, see above, this is the most simple boundary detector, yielding a response
only at the place where crosses X. d(i)
We now define a template Tx(i ) in which the statistical properties of all f~ are
comprised:
Tx(i) = Z 1 ~ f/,(i).
/=l
Wx(i)={ogn(Tx(i)) forf°rITx(i)I>Xr'lTx(i)l
~<Xr.
o b c
5%
E -1 rq -1 r-1 -1
r~
. . . . . . . . . . +
÷1
0~
onset P ons~Fet
20ms
Fig. 12. Illustration of the adaptation of a P-wave template, after feedback of the computed template
to the learning population itself. The area where the detection function matches the template narrows
after a few iterations. The first template (a) is based on the learning population; (b) is computed from
the points, estimated by (b) and so on for (c).
520 Jan H. van Bemmel
A suitable value for Xr lies somewhere around 0.30. The advantage of Wx instead
of Tx is that in this case again only additions and subtractions are computed and
no multiplications are involved.
Methods such as the ones described here are in routine use for a wide variety of
ECG interpretation systems. Only very few reports have appeared in the literature
giving evaluation results of the algorithms on well-documented ECG's. Yet, these
boundaries form the basis for all diagnostic procedures.
The inaccuracies that are still allowable for P-, QRS- and ST-T boundaries are
in the order of 15, 5 and 30 msec., resp., at both sides of the onsets or endpoints.
Some interpretation programs for ECG's adopt the boundary detection to each
complex separately and next apply a majority rule to determine the most
probable locations of the wave edges (e.g. by the determination of the median of
the measured distribution of recognized boundaries). Other programs apply the
edge detection only after coherent averaging of the dominant beats by the
typification step. The outcomes of both approaches are in principle different,
because of the different influence of disturbances on both methods.
Promising wave parsing methods, also applied to ECG's, have been reported by
Stockman (1976) and Horowitz (1977). These methods are essentially syntactic
approaches to this problem. After many years of research in the wide field of
Pattern Recognition for general edge or boundary detection methods, still no
general method is yet available. The only common factor between all reported
techniques is that they at least strive at the maximization of the likelihood of the
same phenomenon and one can only hope that they converge to identical
solutions.
Problems which involve the segmentation of ECG's (in general: signals) have
much in common with the boundary recognition methods as treated here. Again,
it fully depends on the signal characteristics and the ultimate goal of the user
what strategy is followed.
The proper selection of features is the basis for all pattern and signal classifica-
tion. As soon as we have determined in earlier steps the signal parts to be
classified, the question arises: which features?
This question is also a main issue for ECG interpretation. Some investigators
supposed that those parameters by which the original ECG can be reconstructed
(e.g. an orthonormal basis such as Nyquist samples; Fourier, Karhunen-Lo6ve or
Chebyshev components) are a sufficient basis for a feature space (e.g. Young,
1963). This, however, is only seldom true, since these parameters are ideal for a
syntactic shape reconstruction but do not necessarily have a semantic information
content. Features that have diagnostic discriminatory power are very often
computed from non-linear combinations of the syntactic basic components, such
as products, ratios, squares or time intervals, which may be related to biological
events and phenomena. Such parameters will hardly ever automatically arise, even
Recognition of electrocardiographicpatterns 521
by non-linear mapping techniques. For that reason only sound theoretical reason-
ing based on fundamental knowledge of the biological process (see also par. 2) is
the ideal way to obtain relevant features.
In many cases, however, the significance of the features is only a posteriori
demonstrated by means of operations on well-documented populations of ECG's,
often referred to as heuristic feature selection.
Classification of contours
Classification results for wave contours have been reported for almost all
existing ECG programs, but except for the study by Bailey almost no objective
evaluation study has been published based on the same ECG population (Bailey,
1974). Classification results, of course, differ widely from one application area to
the other (e.g. in screening or in a heart clinic). Some programs (Pipberger, 1975)
are primarily based on independent, i.e. non-ECG information (history, catheteri-
zation data, autopsy reports, etc.) instead of the diagnosis by cardiologists based
on ECG morphology.
This purely statistical approach to ECG diagnosis, however, has not received
the expected interest from the medical community. If the final diagnosis based on
the ECG itself and obtained from a team of cardiologists is used for reference
522 Jan H. van Bemmel
purposes (e.g. Bonner, 1972), the best results reported thus far claim a percentage
of > 95% of correctly classified wave forms. The evaluation of ECG contours is
usually done along three different lines: with respect to measurements that can be
verified by non-ECG data; for features that can only be derived from the ECG
itself (like conduction disturbances); and for purely descriptive parameters (like
ST-elevations). A detailed description of these three approaches can be found in
the results of The Tenth Bethesda Conference on Optimal Electrocardiography
(1977).
In this section we will briefly describe the method that is followed for statistical
classification and evaluation. Although the proper choice of features primarily
determines the results of the classification and the model that is being used for
discrimination can reveal the requested performance only on the basis of these
properly chosen feature vectors, the statistical approach, at least from a theoreti-
cal point, offers some advantages over a logical reasoning as has been amply
shown by Cornfield (1973) and Pipberger (1975). An advantage is, e.g., the fact
that in a multivariate approach we may easily and analytically take into account
prior probabilities for the different diseases and cost and utility factors. The a
posteriori probability of having a disease k out of K, given the prior probabilities
p(k) and the feature (symptom) vectors x with their conditional probabilities
p(xlk), can be expressed with Bayes as:
p(klx)=p(xlk) • p(xlj)p(j
j=l
))-'
The classification of vector x to a certain class is determined by the maximum of
p(k Ix), if desired beforehand weighted with a matrix of cost factors. Assuming
that the vectors x have normal distributions for all diseases k and identical
variance-covariance matrices D for all distributions x lk, with mjk = mj - m~ (m k
the mean of class k), we may write for the a posteriori probabilities
(
p(klx)= 1+ E exp(xTD-'mjk--½mykD-'mjk)P(J)/P(k)
j=l,j~-k
)l •
Cornfield (1973) has shown the influence of the prior probabilities in such models
if used in clinical practice (see also Pipberger, 1975). If too many disease classes
(age, race, sex, etc.) in different degrees and combinations have to be discerned,
such statistical models require an impractically large population for training the
parameters. In such cases it is necessary to combine the advantage of the purely
statistical approach with that of the heuristic and logical solution to the classifica-
tion problem.
Classification of rhythms
Ryhthm diagnosis is based on the measured PP, RR and PR intervals as well as
P-wave and QRS morphology, found by the detection and typification or wave
form labelling steps. If no detailed diagnosis is given of complicated rhythms but
Recognition of electrocardiographicpatterns 523
Serial electrocardiography
In recent years many programs for ECG classification have incorporated
algorithms for serial analysis of ECG's (e.g., Macfarlane, 1975; Pipberger, 1977).
Improvement in the final classification is claimed, which is not surprising because
differences in morphology as compared to an earlier recording can be taken into
account. This requires, however, very standardized locations of the electrodes,
since especially in the chest leads a minor misplacement may cause large changes
in QRS shape.
Present research in contour classification is, besides the further investigation of
serial electrocardiography, primarily directed towards the derivation of features
from multiple leads (Kornreich 1973); the stability of classifiers (Willems, 1977;
Bailey, 1976); the use of fuzzy set theory (Zadeh, 1965) and syntactic approaches
to classification (Pavlidis, 1979).
7. Data reduction
being used. Essentially such methods replace a signal by samples at unequal time
intervals, only measured if a certain threshold of the first or second difference if
crossed. Bertrand (1977) applied this and other algorithms in a system for
transmission.
Other techniques for ECG compression make use of a series of orthogonal
basic functions for the reconstruction of the wave forms. Well known is the
Karhunen-Lo~ve expansion (see Young, 1963) or a Chebyshev transform. The
first method, yielding the eigenvectors, was evaluated by Womble (1977) together
with reduction by spectral techniques. As has been observed already, such
methods do not take into account the semantic information comprised in the
ECG. For that reason they are most helpful in detecting trends in intervals or
sudden changes in wave shapes in individual patients; for contour classification
they are rather inefficient since, e.g., tiny Q-waves may be missed by the fact that
their signal power is less than the distortions allowed, if integrated over the
duration of the wave. This is the reason why most long-term storage systems store
either the samples of the dominant beat or even the entire recording eventually
sampled at 250Hz. Another reason is the fact that the technical means for
inexpensive storage and retrieval have gradually diminished the need for data
reduction algorithms that are always more or less increasing the signal entropy.
8. Discussion
References
Bailey, J. J., Horton, M. and Itscoitz, S. B. (1976). The importance of reproducibility testing of
computer programs for electrocardiographic interpretation. Comput. and Biomedical Res. 9, 307-316.
Barr, R. C., Spach, M. S. and Herman-Giddens, G. S. (1971). Selection of the number and position of
measuring locations in electrocardiography. IEEE Trans. Biomedical Engrg. 18, 125-138.
Bertrand, M., Guardo, R., Roberge, F. A. and Blondeau, P. (1972). Microprocessor application for
numerical ECG encoding and transmission. Proc. IEEE 65, 714-722.
Bonner, R. E., Crevasse, L., Ferrer, M. I. and Greenfield, J. L. (1972). A new computer program for
analysis of scalar electrocardiograms. Comput. and Biomedical Res. 5, 629-653.
Cornfield, J., Dunn, R. A., Batchlor, C. D. and Pipberger, H. V. (1973). Multigroup diagnosis of
electrocardiograms. Comput. Biomedical Res. 6, 97-120.
Cox, J. R., Nolle, F. M., Fozzard, H. A. and Olivei:, G. C. (1968). AZTEC, a preprocessing program
for real time ECG rhythm analysis. IEEE Trans. Biomedical Engrg. 15, 128-129.
Cox, J. R., Nolle, F. M. and Arthur, R. M. (1972). Digital analysis of the electroencephalogram, the
blood pressure wave and the electrocardiogram. Proc. IEEE 60, 1137-1164.
Feldman, C. L. (1977). Trends in computer ECG monitoring. In: J. H. van Bemmel and J. L. Willems,
eds., Trends in Computer-processed Electrocardiograms, 3-10. North-Holland, Amsterdam.
Geselowitz, D. B. (1971). Use of the multipole expansion to extract relevant features of the surface
electrocardiogram. IEEE Trans. Cornput. 20, 1086-1089.
Guardo, R., McA Savers, B. and Monro, D. M. (1976). Evaluation and analysis of the cardiac
electrical multipole series on two-dimensional Fourier technique. In: C. V. Nelson and D. B.
Geselowitz, eds., The Theoretical Basis of Electrocardiology, 213-256. Clarendon, Oxford.
Gustafson, D. E., Wilsky, A. S., Wang, J., Lancaster, M. C. and Triebwasser, J. H. (1978). ECG/VCG
rhythm diagnosis using statistical signal analysis. IEEE Trans. Biomedical Engrg. 25, 344-361.
Hengeveld, S. J. and van Bemmel, J. H. (1976). Computer detection of P-waves. Comput. and
Biomedical Res. 9, 125-132.
Holt, J. H., Barnard, A. C. L. and Lynn, M. S. (1969). The study of the human heart as a multiple
dipole source. Circulation 40, Parts I and II, 687-710.
Horacek, B. M. (1974). Numerical model of an inhomogeneous human torso. Advances in Cardiology
10, 51-57.
Horan, L. G. and Flowers, N. C. (1971). Recovery of the moving dipole from surface potential
recordings. Amer. Heart J. 82, 207-213.
Horowitz, S. L. (1977). Peak recognition in waveforms. In: K. S. Fu, ed., Syntactic Pattern Recognition
Applications, 31-49. Springer, New York.
Kanal, L. N. (1972). Interactive pattern analysis and classification systems: a survey and commentary.
Proc. IEEE 60, 1200-1215.
Kanal, L. N. (1974). Patterns in pattern recognition. IEEE Trans Inform. Theory 20, 697-722.
Kornreich, F., Block, P. and Brismee, D. (1973/74). The missing waveform information in the
orthogonal electrocardiogram. Circulation 48, Parts I and II, 984-1004; ibid. 49, Parts III and IV,
1212-1231.
Macfarlane, P. W., Cawood, H. T. and Lawrie, T. D. V. (1975). A basis for computer interpretation of
serial electrocardiograms. Comput. and Biomedical Res. 8, 189-200.
McFee, R. and Baule, G. M. (1972). Research in electrocardiography and magnetocardiography. Proc.
IEEE 60, 290-321.
Nelson, C. V. and Geselowitz, D. B., eds. (1976). The Theoretical Basis of Electrocardiograms.
Clarendon, Oxford.
Nolle, F. M. (1977). The ARGUS monitoring system: a reappraisal. In: J. H. van Bemmel and J. L.
Willems, eds., Trends in Computer-processed Electrocardiograms, 11-19. North-Holland, Amster-
dam.
Pavlidis, T. (1979). Methodologies for shape analysis. In: K. S. Fu and T. Pavlidis, eds., Biomedical
Pattern Recognition and linage Processing, 131-151. Chemic Verlag, Berlin.
Pipberger, H. V., Cornfield, J. and Dunn, R. A. (1972). Diagnosis of the electrocardiogram. In: J.
Jacquez, ed., Computer Diagnosis and Diagnostic Methods, 355-373. Thomas, Springfield.
Pipberger, H. V., McCaughan, D., Littman, D., Pipberger, H. A., Cornfield, J., Dunn, R. A., Batchlor,
C. D. and Berson, A: S. (1975) Clinical application of a second generation electrocardiographic
computer program. Amer. J. Cardiology 35, 597-608.
526 Jan H. van Bemmel
George C. Stockman
1. Introduction
The formal language theory that was developed by Chomsky and others for
modeling natural language turned out to be more useful for the modeling and
translation of programming languages. An algorithm for analyzing forms of a
language according to a grammar for the language is called a parsing algorithm.
The parsing of simple languages is well understood and is a common technique in
computer science. (See Giles, 1971, or Hopcroft and Ullman, 1969.) It is not
surprising that known parsing techniques were brought into play in attempts at
machine recognition of human speech (Miller, 1973; Reddy et al., 1973). How-
ever, even earlier the concept of parsing was applied to the analysis of the
boundary of 2-D objects; i.e. chromosome images (Ledley et al., 1973).
The attempts at automatic analysis of 1-D time or space domain signals have
been numerous and varied in approach. An excellent survey of early work on
bio-medical signals, such as the electrocardiogram and the blood pressure wave,
appears in Cox et al. (1972). Early approaches applied the constraints of a
structural model in an ad hoc manner; usually embedded in an application
program. As programming techniques developed waveform parameters were
placed in tables in computer programs. Later, decision rules were also placed in
tables and the evolution towards more formal analysis techniques had begun.
Waveform parsing systems (WPS) were developed to be applicable to several
problem domains. In order to achieve this, the model of a particular waveform
domain must be input to the WPS as data rather than being implemented by
programming. The loading of a WPS with a waveform model before analysis of
waveform data is pictured in Fig. 1. The structural model is presented to WPS in
a fixed language, the 'structural description language' (SDL). Typically this will
be a BNF grammar or connection tables defining a grammar. The waveform
primitives must also be selected from the set known to WPS, perhaps augmented
by specific numeric parameters. For instance, WPS may use a vocabulary of
shapes such as CAP, CUP, or FLAT which may be parameterized by duration,
amplitude, curvature, etc. The primitives chosen for use in the specific application
527
528 George C. Stockman
WPS skeleton
Primitive
Description of Feature
special features Detection
Algorithms
WPS
General
Parsing
Algorithm
Structural
Tables Structural
~ analysis
Primitiv~ and
Detectors interpretation
Parameter
Tables o r
Prototypes
b
Fig. 1. (a) The WPS is tuned for a particular set of signals and (b) is then tasked with the analysis of
signals for that set.
Models for primitive and for aggregate structures of waveforms will be dis-
cussed. Assignment of meaning to the structures is clearly problem specific and
perhaps not one of the duties of WPS. A very simple example is presented first so
that the methodology can be established before real world complexity is intro-
duced.
u recognized as 'a'
recognized as b'
recognized as c'
primitives primitives
a r e 'c' a n d 'a' a r e 'a' a n d 'b'
1 caa 1 abaa
2 aaaa 2 aaaa
3 acaa 3 aabaa
4 cacaa 4 abaabaa
5 not possible 5 baa
6 not possible 6 abbb
C
Fig. 2. Example of waveform segmentation problem.
(a) Waveform samples 1-4 are in the problem domain and 5 and 6 are not.
(b) Three possible primitives for waveform segmentation.
(c) Segmentation of the six samples using two different sets of primitives.
A S
a b a a b a
I
a
S > aAS S
S ----)a / ~ S
A---> SbA I
A ----~ba a b a a
A---> SS
a b
Fig. 3. (a) A generative grammar and (b) an example of structural description for some waveform
samples.
algorithm which has been used in waveform parsing is Earley's algorithm (Earley,
1970).
Earley's algorithm is the culmination of a long line of research work and offers
a very simple paradigm for very flexible analysis. The algorithm uses the grammar
to generate hypothesis about the primitive content of the input string. The input
string is then checked for the primitive. There is no backing up by the algorithm
because alternate possibilities are pursued simultaneously. The input string is
recognized and analyzed when the goal symbol of the grammar is recognized as
the last input symbol is seen.
The important property of the grammar model is that it defines and constrains
the context in which symbols of the vocabulary can occur. For example, study of
Fig. 3 shows that if the symbol 'b' appears it can appear only in the context of at
most one other 'b'. Use of context in human speech analysis will be discussed in
the next sections.
The theory of parsing is too broad and complex to treat formally here. Instead
the goals and procedures of parsing are intuitively presented. Fig. 4 shows several
attempts at achieving a structural analysis of a given input string representing a
waveform. As a whole, these attempts represent the unconstrained trials that a
human might make in attempting to match the grammar goal symbol to the input.
532 George C. Stockman
(b) (c) A
1\
A S S
s t
I I
abaab aa
(e)
S :a! ab aab aa
.. (g)
b a abAa a
A
s/ Xs
[.s[a] ] & [.A[b] [a] ]
a b a a b a i iii 5 55666
Fig. 4. Possible parse states for sample grammar of Fig. 2. A partial parse tree is shown together with
linear encoding. ' ' denotes a 'handle', or current focus of analysis. Integers indicate symbol positions.
Square brackets indicate recognized structures while parentheses indicate open goals.
Parsing algorithms are very limited in their behavior and could not develop all
parsing states s h o w n in Fig. 4. States (b) and (c) are being developed 'bottom-up'
with only one phrase being worked on at a time. State (g) shows b o t t o m - u p
development of two phrases. States (d) and (e) show t o p - d o w n development: a
parse tree rooted at the grammar goal symbol is being extended to m a t c h the
input. In (d) only one phrase is being worked on but in (e) there are three
u n m a t c h e d phrases. In Fig. 4 (f) shows 'N-directional' development. The practical
meaning of these terms will b e c o m e obvious in the detailed examples of Sections
3 and 4. The essentials of the w a v e f o r m parsing paradigm are summarized below.
Waveform parsing systems 533
MOVE ~ THE _
MORE TIME
0 200 400 600 800
a
I KING
QUEEN I
State #i merit=0.80 (MOVE THE PAWN )
KNIGHT
i KING I
•noun[ QUEEN ,420,-,-] ] )
I KNIGHT
PAWN I
State #2 merit=0.80
(subj [adj [MORE,95,305,0.80]noun[TIME,305,410,0.65] ]
I IS NEEDED i
•pred [ REQUIRED ,410,-,-1 )
WANTED ,
C
Fig. 5. Parse states representing alternative competing interpretations of a speech signal.
(a) Matching of possible words in vocabulary to speech segments.
(b) Rough concept of states of analysis.
(c) More refined concept of states of analysis including syntactic labels, phrase structure, location,
and merit. (' .' denotes the 'handle' or current focus of processing.)
of a move which the human wants to make in a chess game with the computer.
The analysis encoded in state # 1 records the following. The words 'MOVE THE'
have been recognized in the speech signal by their acoustic features. These words
are in the chess vocabulary. The chess grammar recognized an imperative sentence
to be in progress and predicted that an object phrase was the next segment. The
name of a chess piece was predicted; names of pieces which were not currently in
the game were removed from the set of possibilities. From previous acoustic
analysis the certainty of the detection of 'MOVE' was 0.95 and the certainty for
'THE' was 0.80. The merit of the state was assigned to be the minimum of these.
State # 2 encodes an alternate partial interpretation as follows. The words
'MORE TIME' have been detected in the initial speech segment with certainty
Waveform parsing systems 535
Table 1
Operations on parse states (' Handle' of parse state is phrase where analysis is currently focused.)
Condition Action
of of Action Action
current best feature of of
parse extraction syntactic semantic
state module module module
halts if global goal re-
handle is
recognized cognized. Otherwise can delete state
none generates new states if unacceptable or
word or
phrase using grammar by reset merit
setting next a goal
creates new state for
handle is each grammar rule
non-terminal
structural none applicable to N. Handle
goal N. of each new state is
first a subgoal of N.
handle is attempts to match T in
terminal input data at appropriate
none none
structural location. Returns match value
goal T. between 0 and 1.
a'next' and 'first' mean in left to right order here. This order can actually be defined
arbitrarily.
0.80 and 0.65 respectively. The chess grammar allows only 3 possible continua-
tions. Since state # 1 has a higher merit than state # 2, the analysis encoded there
should be continued next. Thus the acoustic feature detection routines should be
called to evaluate the predicted words against the rest of the input signal. This in
turn will result in several possibilities. Only one word may be recognized and
state # 1 will be amended. More than one word may be recognized and state # 1
will be split into competing states of differing likelihoods. If none of the predicted
words are recognized with any certainty the record of analysis in state # 1 will be
deleted. Provided that all of the details of analysis are encoded in the states, the
entire waveform analysis can be controlled by operations on the states defined as
in Table 1. Control is therefore rather simple--just allow the appropriate modules
(feature extractor, syntax analyzer, or semantic analyzer) their turn at accessing
the set of current states of analysis. How this solves the original practical problem
will become clear after Sections 3 and 4.
3.3. HEARSAY s y n t a x
A BNF grammar defined the official chess move language. This means that m
only was the entire vocabulary specified but so was the syntax of all sentence
Note that the set of all possible nouns would be very constrained, i.e. (KIN(
QUEEN, PAWN, etc.) as would the set of verbs, i.e. (MOVES, TAKE',
CAPTURES, etc. }. The essential property of the grammar as discussed in Sectic
2, is that it could accept or reject specific words in context and it could genera
all possible legal contexts. The HEARSAYsyntax analyzer behaved as described i
Fig. 5 and Table 1.
Waveform parsing systems 537
HEARSAY was never designed for anything but speech and so would not be
likely to process EKG's. The primary obstacle is the feature extraction module.
Certainly grammar models exist for EKG's and pulsewaves as shown in Section 4,
so the HEARSAY syntax analyzer could be used for structural analysis. Also, it is
likely that ' E K G semantics' could be loaded into HEARSAY in the same way as
chess semantics were. The control of the analysis used by HEARSAYalso appears to
be adequate for any time-domain waveform analysis. Thus there are two obstacles
which would prevent HEARSAY fi?om being classified as a true waveform parsing
system. First, not enough variety is provided in the set of primitive waveform
features. Secondly, certain knowledge appears to have been implemented via
program rather than data tables and hence reprogramming would be required for
other applications.
Uf•P
-
RGE NEG
U LOPE I l e a l ~N/
-TFlstoli~ diasto~'~ F2
• . .
l--F
/i 2
0867
0865
0864
0863
0862
0859
0858
0856
0855
0854
b
Fig. 6.(~ Stereotyped carotid pulse wave and(b)asetofl0 differentsamples.
should provide information for disease diagnosis (Freis et al., 1966). Fig. 6 shox
10 sample waveforms and a stereotyped pattern for one heart cycle. The objecti'
of analysis is to reliably detect the location of the labeled points so that variol
measurements can be made. For instance, heart rate is easily determined from~
and F 2. Location of the important points can only be reliably done when tt
context of the entire set of points is considered and related to the heart model.
Ym (x) = P 2 x 2 + P l x + P O
y (a) =y (b) y (a) =y (b)
Cl< p2 < c2< 0 c I > p2 > c 2 > 0
c3<- b - a < c 4 c3<- b-a<_ c 4
a b
c d
Ym (x) =Plx+PO
Cl< Pl < c 2
c3<- b - a < c 4
Fig. 7. Five primitives defined by constrained fits of model Ym(X) to data y(x), x E [ a, b] c_ [ l, r ].
(a) CAP,
(b) CUP,
(c) RIGHT SHOULDER,
(d) LEFT SHOULDER,
(e) STRAIGHT LINE.
Driven by the syntax analyzer, the WAPSYS segmentor is repetitively tasked with
identifying a specific morph M in a specific interval of data [a, b] c_ [1, r]. [l, r] is
the constraint interval to be searched and [a, b] is the match interval where the
primitive is detected. The segmentor may, in fact, identify no occurrence, one
540 George C. Stockman
occurrence, or many occurrences of the morph (M,[aj, bfl, ei, &) existing on the
constraint interval [l, r]. pj is the parameterization of the morph M and ej is an
evaluation of its merit or certainty. The morph M is specified to the segmentor by
a syntactic name and semantic constraints C which must be satisfied by pa-
rameterization P. Morph M is formally defined as a functional form Ym = Ym(x)
= f ( x ) to be fit to the datay(x), x E [a, b] under set of constraints C.
For example, the 'CAP' morph of Fig. 7 is defined a s Y m ( X ) = p2 x2 ÷ p l x + p,
subject to the constraints that Ym(a) = ym(b), c I < P2 < c2 < 0 and c 3 ~<b - a ~<ca.
The parameterization P = {P0, P1, P2} is determined by least squares fitting
Ym(X) to y ( x ) over x ~ [ a , b]. All five morph definitions in Fig. 7 imply least
squares error estimation of 2 free parameters. Under the assumption of Gaussian
noise distributed as N(0, o 2) the variable
b
2 2
S ~ E (ym(X)-- y(x))
X a
y = ymCx ) = ~ Pi x~
i=0
until the hypothesis H(m), that this model generated the data, can be accepted at
a given confidence and the hypothesis H ( m + 1) does not increase significantly
this confidence of fit. The approach of the waesYs segmentor is to keep the
polynomial form ym(x) fixed in form and to vary the subinterval [a, b] C [l, r] to
find the best fit (s). The reason for doing this is that it is desired that the data be
represented by morphs of confined geometric shape whose parameters might have
strong interpretation in the problem domain. For instance, if the rate of pressure
rise in a certain region of a pulse wave were thought to be significant in disease
diagnosis, it would be appropriate to estimate shape by fitting a straight line to
the region and not a bell-shaped curve.
Fig. 11 shows some pulse wave data that was fitted with models from Fig. 7.
The morphs UPSLOP, MPS, MNG, LN, and HOR are defined and detected by
using constraints on the parameters of the straight line model. The CAP morph
and RSH morph are instances of the cap and right shoulder of Fig. 7. Constrain-
ing the juxtaposition of these morphs is in the domain of syntax discussed in later
sections.
Waveform parsing systems 541
showing that the structure (JYNT) (i.e. joint) has three substructures which
are respectively TREDGE (trailing edge), GLBMIN (global minimum), and
UPSLOP in time order of their appearance. Constraints to be used by the
curve-fitting routines are listed below each structure name. These constraints are
obtained using the training procedure discussed in Section 4.3 and are interpreted
as follows. MIN and MAX are the minimum and maximum width of the
primitive in number of points, ALO and AHI are the low and high limits of the
slope, and HGN and MNN are high and mean noise levels. The informal notion
GLBMIN----7[
J (FX,FY)
b
[ 2=<JYNT>,I FX =.210+03 FY = . 8 0 7 + 0 2 Q=.800+00 RGT=.216+03 LFT=.I73+03
] 6=GLBMIN,3
9=TREDGE,2 C =.000 B =.293+03 A=-.613+01 Q =.800+00 RGT=.207+03
LFT=.I73+03
] 9=TREDGE,2
[ 4=UPSLOP,I C =.000 B =.201+03 A= .102+03 Q =.100+01 RGT=.216+03
LFT=.211+03
] 4=UPSLOP,I
] 2=<JYNT>,I
C
of attributes used here has already been formalized by others yielding 'attribute
grammars.' The production rule in Fig. 8(a) also tells the WAPSYS analyzer in what
sequence to search for the substructures of (JYNT>. UPSLOP is to be sought first
because it is reliably found with no other context information. TREDGE is to be
found next, to the left of UPSLOP, of course, and then finally GLBMIN is to be
gotten somewhere in between UPSLOP and TREDGE. The MIN and MAX
parameters are required by WAPSYS so that it can assign a consistent search
interval [l, r] to each hypothesized primitive. Fig. 8(b) shows graphically a
(JY T> structure identified on a pulse wave. A parenthesized form of the partial
parse tree generated by WAPSYS is given in Fig. 8(c). The LFT and RGT
attributes define the match interval [a, b] and Q is the chi-squared quality of fit.
Fig. 9 shows a complete structural model of a carotid pulse wave via BNF.
Parameters of each syntactic structure are omitted for clarity but search sequence
b
Fig. 9. BNF grammar (without parameters) used to drive analysis of carotid pulse waves.
(a) Set of productions for carotid pulse wave grammar; (STRT) is start symbol or global goal.
Numbers in parentheses indicate search order.
(b) Terminal vocabulary of carotid pulse wave grammar.
544 George C. Stockman
1 6~iIN, 3
[ 9 = ~ , 2 C = .000 B = .644+03 A =-.787+01 Q = .900+00 RGT=-.750+02
.240+02
] 9=TREDGE, 2
[ 4=UPSLOP,I C = .000 B = .181+03 A = .105+03 Q = .100+01 RGT=- .820+02
LFT=- . 770+02
] 4=UPSLOP, 1
] 2=<JYNT>, 1
] I=<STRT>, 0
Fig. 10. Parse tree showing the results of analysis of a complete carotid pulse wave cycle (first cycle
shown in Fig. 11).
Waveformparsing systems 545
• P
u7L°P ~ P 2
/ ~~\~ ..~..~~o~
/ .J~ \~ .::
'/ i
"-'-+~to~ ~/~
,, [ scale appl ~es ~o to~ c(cle [ [
ko ~ ,~p 03 O oD
• I
! I I I I I
ko o'~ O
Fig. 11. Carotid pulse wave sample 0844 shows interesting structural variations on consecutive cycles
(analysis of first cycle shown in Fig. 10).
is indicated on each right hand side. This grammar was used by WAPSYS to drive
the analysis of a few hundred pulse waves. Fig. l0 shows a complete parse tree
obtained by analyzing the first cycle of the waveform shown in Fig. 11. Fig. 11
shows interesting structural variations present in two consecutive cycles of the
546 George C. Stockman
same pulse wave. The parse tree for the second cycle is not shown here. As Fig. 10
shows a number of global attributes have been computed by wAPSYS from the
parse tree: heart rate (RAT = 0.871 +02) is only one of these. Global attributes
are manipulated by 'semantic routines' which are called whenever syntactic
structures are identified. The semantic routines are actually FORTRAN code
segments. Although they are simply linked to WAPSYS these routines are applica-
tion specific and must be written by the user.
5. Concludingdiscussion
structural model which is usually a grammar of some type. Both the primitives
and structural model can be input to WPS via tables of data and no programming
is required by the user for this. However, sometimes it is necessary for the user to
specify application specific semantic processing via code in a high level language.
selects a best one or makes minor corrections. The linguistic model used by WPS
appears to support this very well. With respect to speech recognition, most
current systems require an acknowledgement of some sort from the human to
indicate that the computer did indeed make the correct interpretation.
References
Aho, A. V. and Ulman, J. (1972). The Theory of Parsing, Translation and Compifing, Vol. I: Parsing.
Prentice Hall, New Jersey.
Cox, J., Nolle, F., and Arthur. R. (1972). Digital analysis of the electro-encephalograrn, the blood
pressure wave, and the electrocardiogram. Proc. IEEE 60(10) 1137-1164.
Earley, J. (1970). An efficient context-free parsing algorithm. Comm. A CM 13(2) 94-102.
Freis, E. D., Heath, W. C., Fuchsinger, P. C. and Snell, R. E. (1966). Changes in the carotid pulse
which occur with age and hypertension. American Heart J. 71(6) 757-765.
Fu, K. S. (1974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
Gries, D. (1971). Compiler Construction for Digital Computers. Wiley, New York.
Hall, P. A. (1973). Equivalence between and/or graphs and context-free grammars. Comm. A C M
16(7) 444-445.
Hopcroft, J. and Ullman, J. (1969). Formal Languages and Their Relationship to Automata. Addison-
Wesley, New York.
Horowitz, S. L. (1975). A syntactic algorithm for peak detection in waveforms with applications to
cardiography. Comm. A C M 18(5) 281-285.
Ledley, R. et al. (1966). Pattern recognition studies in the biometrical sciences. AFIPS Conf. Proc.
SJCC, 411-430. Boston, MA.
Miller, P. (1973). A locally organized parser for spoken output. Tech. Rept. 503, Lincoln Lab, M.I.T.,
Cambridge, MA.
Pavlidis, T. (1971). Linguistic analysis of waveforms. In: J. Tou, ed., Software Engineering, Vol. H,
203-225. Academic Press, New York.
Pavlidis, T. (1973). Waveform segmentation through functional approximation. IEEE Trans. Comput.
22(7) 689-697.
Reddy, D. R., Erman, L. D., FenneU, R. D., and Neely, R. B. (1973). The HEARSAYspeech
understanding system. Proc. 3rd Internat. Joint Conf. Artificial Intelligence, 185-193. Stanford, CA.
Stockman, G., Kanal, L., and Kyle, M. C. (1976). Structural pattern recognition of carotid pulse waves
using a general waveform parsing system. Comm. ACM 19(12) 688-695.
Stockman, G. (1977). A problem-reduction approach to the linguistic analysis of waveforms. Com-
puter Science TR-538, University of Maryland.
You, K. C. and Fu, K. S. (1979). A syntactic approach to shape recognition using attributed
grammars. IEEE Trans. Systems Man Cybernet. 9(6) 334-344.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")~
L.,,,./
©North-Holland Publishing Company (1982) 549-573
1. Introduction
r
TE×T
GENERATOR SPEAKER] ACOUST, C
1- i PROCESSOR LINGUISTICI
DECODER [
L
SPEECH RECOGNIZER
Fig. 1. A continuous speech recognition system.
r I
2. Acoustic processors
3. Linguistic decoder
The AP produces an output string y. From this string y, the linguistic decoder
(LD) makes an estimate, ~i,, of the word string w produced by the text generator
(see Fig. 1). To minimize the probability of error, • must be chosen so that
By Bayes' rule:
e(w)P(ylw)
P(wly)-- (3.2)
q~(t)=0 if sv~L(t)
and (4.1)
]~q~(t) = 1 , S~--(SF).
t
In general, the transition probabilities associated with one state are different
from those associated with another. However, this need not always be the case.
We say that state s~ is tied to state s 2 if there exists a 1-1 correspondence
Ts,s2 :~---, ~-such that qs,(t)= qs2(Ts,~2(t)) for all transitions t. It is easily verified
that the relationship of being tied is an equivalence relation and hence induces a
partition of g into sets of states which are mutually tied.
A string of n transitions t~ for which L(tl) -- s I is called a path; if RUn) = SF,
then we refer to it as a complete path. 2 The probability of a path t~' is given by
A Markov source for which each output string a~ determines a unique path is
called a unifilar Markov source.
In practice it is useful to allow transitions which produce no output. These null
transitions are represented diagrammatically by interrupted fines (see Fig. 4).
Rather than deal with null transitions directly, we have found it convenient to
associate with them the distinguished letter d?. We then add to the Markov source
2t{' is a short-hand notation for the concatenation of the symbols t l, t 2 , . . . , G- Strings are indicated
in boldface throughout.
554 F. Jelinek, R. L. Mercer and L. R. Bahl
I
I
0 0
a filter (see Fig. 5) which removes ~, transforming the output sequence a~ into an
observed sequence b'~ where b i E ~ = d~ - (q~}. Although more general sources can
be handled, we shall restrict our attention to sources which do not have closed
circuits of null transitions.
If t~ is a path which produces the observed output sequence bT, then we say
that b i spans tj if tj is the transition which produced b~ or if tj is a null transition
bl,..., bm,...
MARKOVsouRCEal"" ~--'an""
'
bl ~ ~ b2 ¢ b3
and so it is natural to consider structures for which a word string w~-~ uniquely
determines the state of the model. A particularly simple model is the N-gram
model where the state at time k - 1 corresponds to the N - 1 most recent words
wk_N+ 1..... w~_~. This is equivalent to using the approximation
P(wr)~ fi P(WkIW2:~+I)"
k--I
N-gram models are computationally practical only for small values of N. In order
to reflect longer term memory, the state can be made dependent on a syntactic
analysis of the entire past word string Wlk- 1, as might be obtained from an
appropriate grammar of the language.
556 b2. Jelinek, R. L. M e r c e r a n d L. R. B a h l
®~ ~ ~~ ~.~,~.~_
~ . ~ ~_oo...... ~ ~ o~.~o.~.~
e~
I _ ~ .~1 I. ~- if ,
# ~d=-g.°r-e'-~ E"~ ~-~.=-~,|=.s.~.=~ *6.B.=-~g[,'~2 '- ~.~-~o E~-°~,,
Continuousspeech recognition 557
Fig. 8. A word-basedMarkovsubsource.
558 E Jefine~ R. L MercerandL. R. Bahl
h
V U
t
0 --
I
SUBSOURCE
OF P ONE
~ /
;oos,
SUBSOURCE
OF PHONE
U "~
v /
/ ~'x
4,~ ACOUSTICI ~ ~ / "-.,..
SUBSOURCE L ~ I ~. / ~
0" OF PHONE r-~..P""O'~__~ ~.,,.~ "~
t l ~"---~ ~ /
cou T,C, cou T,C
SUBSOURCE~ rzJ SUBSOURCE
OFP:ONE r_~--~--~l OFPHONEuv
Fig. 11. A phone-based Markov source based on the phonetic subsource of Fig. 9.
Continuous speech recognition 559
the corresponding word. The resulting Markov source is a model for the entire
stochastic process to the left of the linguistic decoder in Fig. 1. Each complete
path t~ through the model determines a unique word sequence w~ = W(t~) and a
unique AP output string y~ ~-Y(t~) and has the associated probability P(t~).
Using well known minimum-cost path-finding algorithms, it is possible to de-
termine for a given AP string y~, the complete path t~ which maximizes the
probability P(t~) subject to the constraint Y(t~) -- y~'. A decoder based on this
strategy would then produce as its output W(t~). This decoding strategy is not
optimal since it may not maximize the likelihood P(w, y). In fact, for a given pair
w, y there are many complete paths t for which W ( t ) = w and Y ( t ) - - y . To
minimize the probability of error, one must sum P(t) over all these paths and
select the w for which the sum is maximum. Nevertheless, good recognition results
have been obtained using this suboptimal decoding strategy [7, 9, 16].
A simple method for finding the most likely path is a dynamic programming
scheme [11] called the Viterbi algorithm [13]. Let ~'k(s) be the most probable path
to state s which produces output y~. Let V~(s) = P('rk(s)) denote the probability
of the path ~k(s). We wish to determine ~n(SF) (see Section 4.1). Because of the
Markov nature of the process, ~-k(s) can be shown to be an extension of "rk_l(S' )
for some s'. Therefore, ~-k(s) and Vg(s) can be computed recursively from ~'k l(s)
and Vk_ l(s) starting with the boundary conditions V0(si) = 1 and ~-0(si) being the
null string. Let C( s, a) -- ( t l R ( t ) = s, A( t ) = a }. Then
(6.1)
i=0 i=0
The first term on the right-hand side is the a priori probability of the word
sequence w. The second term, referred to as the acoustic match, is the sum over i
of the probability that w produces an initial substring y~ of the AP output string
y. Unfortunately, the value of (6.1) will decrease with lengthening word sequences
w, making it unsuitable for comparing incomplete paths of different lengths.
Some form of normalization to account for different path lengths is needed. As in
the Fano metric used for sequential decoding [14], it is advantageous to have a
likelihood function which increases slowly along the most likely path, and
C o n t i n u o u s speech recognition 561
If we consider P(w, Yl) to be the cost associated with accounting for the initial
part of the AP string Yl by the word string w, then Y~w,P(w',~"+~[w,y~)
represents the expected cost of accounting for the remainder of the AP string y,~_
with some continuation w' of w. The normalizing factor a can be varied to control
the average rate of grgwth of A(w) along the most likely path. In practice, a can
be chosen by trial and error.
An accurate estimate of E,/P(w', yF+l[w,y() is of course impossible in prac-
tice, but we can approximate it by ignoring the dependence on w. An estimate of
E( Y/%1[Y~), the average value of P(w', Yf+l[ Yl), can be obtained from training
data. In practice, a Markov-type approximation of the form
i.e., the probability that w was uttered and produced the complete output string
y~.
The likelihood of a successor path w~ = W~-lWk can be computed incremen-
tally from the likelihood of its immediate predecessor w~- 1. The a priori probabil-
ity P(w~) is easily obtained from the language model using the recursion
The acoustic match values P( Y~l w~) can be computed incrementally if the values
P( Yll w~-') have been saved [1].
A search based on this likelihood function is easily implemented by having a
stack in which entries of the form (w, A(w)) are stored. The stack, ordered by
decreasing values of A(w), initially contains a single entry corresponding to the
initial state of the language model. The term stack as used here refers to an
562 F. Jelinek, R. L. Mercer and L. R. Bahl
ordered list in which entries can be inserted at any position. At each iteration of
the search, the top stack entry is examined. If it is an incomplete path, the
extensions of this path are evaluated and inserted in the stack. If the top path is a
complete path, the search terminates with the path at the top of the stack being
the decoded path.
Since the search is not exhaustive, it is possible that the decoded sentence is not
the most likely one. A poorly articulated word resulting in a poor acoustic match,
or the occurrence of a word with low a priori probability can cause the local
likelihood of the most likely path to fall, which may then result in the path being
prematurely abandoned. In particular, short function words like the, a, and of,
are often poorly articulated, causing the likelihood to fall. At each iteration, all
paths having likelihood within a threshold A of the maximum likelihood path in
the stack are extended. The probability of prematurely abandoning the most
likely path depends strongly on the choice of A which controls the width of the
search. Smaller values of A will decrease the amount of search at the expense of
having a higher probability of not finding the most likely path. In practice, A can
be adjusted by trial and error to give a satisfactory balance between recognition
accuracy and computation time. More complicated likelihood functions and
extension strategies have also been used but they are beyond the scope of this
paper.
Let P~(t, bT) be the joint probability that b~' is observed at the output of a
filtered Markov source and that b i spans t (see Section 4.1). The count
is the Bayes a posteriori estimate of the number of times the transition t is used
when the string b~' is produced. If the counts are normalized so that the total
count for transitions from a given state is 1, then it is reasonable to expect that
the resulting relative frequency
Step 4. Compute f~(t, b'~) and obtain new estimates qJ+ l(t) ----f~(t, bT).
Step 5. Set j = j + 1.
Step 6. Repeat from Step 3.
To apply this procedure we need a simple method for computing Pi(t, b'~).
Now Pi(t, b"~) is just the probability that a string of transitions ending in L(t) will
produce the observed sequence b]-~, times the probability that t will be taken
once L(t) is reached, times the probability that a string of transitions starting
with R(t) will produce the remainder of the observed sequence. If A(t) -----~, then
the remainder of the observed sequence is b~'; if A ( t ) ~ ep, then, of course,
A ( t ) = b i and the remainder of the observed sequence is b;m+v Thus if ai(s )
denotes the probability of producing the observed sequence b] by a sequence of
transitions ending in the state s, and fli(s) denotes the probability of producing
the observed sequence b~' by a string of transitions starting from the state s, then
Bm(SF) = 1, (7.6a)
fli(s) = E fl;(R(t))~(t, s, *) + E fli+ ,(R(t))~(t, s, bi),
t t
i~<m, sv as F (7.6b)
where
~( t, s, a) = qL(t)( t )6( L( t ), s )8( A( t ), a ). (7.7)
Step 3 of the iterative procedure above then consists of computing a; in a
forward pass over the data, r; in a backward pass over the data, and finally
P;(t, b"~) from (7.3). We refer to the iterative procedure together with the method
described for computing Pi(t, b~) as the forward-backward algorithm.
564 F. Jelinek, R. L. Mercerand L. R. Bahl
It is often the case in practice that the data available are insufficient for a
reliable determination of all of the parameters of a Markov model. For example,
3For definitionof tying, see Section4.1. For details of the forward-backward algorithmextended to
machines with tied states, see [15].
Continuousspeech recognition 565
the trigram model for the Laser Patent Text corpus [4] used at IBM Research is
based on 1.5 million words. Trigrams which do not occur among these 1.5 million
words are assigned zero probability by maximum likelihood estimation, a degen-
erate case of the forward-backward algorithm. Even though each of these
trigrams is very improbable, there are so many of them that they constitute 23%
of the trigrams present in new samples of text. In other words, after looking at 1.5
million trigrams the probability that the next one seen will never have been seen
before is roughly 0.23. The forward-backward algorithm provides an adequate
probabilistic characterization of the training dat a but the characterization may be
poor for new data. A method for handling this problem, presented in detail in
[15], is discussed in this section.
Consider a Markov source model the parameters of which are to be estimated
from data b~'. We assume that b~' is insufficient for the reliable estimation of all
of the parameters.
Let Os(t) be forward-backward estimates of the transition probabilities based
on b~' and let *Os(t) be the corresponding estimates obtained when certain of the
states are assumed to be tied. Where the estimates 0s(t) are unreliable, we would
like to fall back to the more reliably estimated *O~(t), but where 0s(t) is reliable
we would like to use it directly.
A convenient way to achieve this is to choose as final estimates of qs(t) a linear
combination of 0~(t) and *Os(t). Thus we let ~ ( t ) be given by
with As chosen close to 1 when gls(t) is reliable and close to zero when it is not.
Fig. 12(a) shows the part of the transition structure of the Markov source
related to the state s. Eq. (8. l) can be interpreted in terms of the associated Markov
source shown in Fig. 12(b), in which each state is replaced by three states. In Fig.
S ~ 0
SI
$2
~ SI
$3
(o) (b)
Fig. 12.(a)Part of transition structure of a Markov source; (b) the correspondingpart of an associated
Markov source.
566 F. Jelinek, R. L. Mercer and L. R. Bahl
12(b), g corresponds directly to s in Fig. 12(a). The null transitions from ~ to s and
s* have transition probabilities equal to ?~ and 1 - X~, respectively. The transi-
tions out of s have probabilities qs(t) while those out of s* have probabilities
*q~(t). The structure of the associated Markov source is completely determined by
the structure of the original Markov source and by the tyings assumed for
obtaining more reliable parameter estimates.
The interpretation of (8.1) as an associated Markov source immediately sug-
gests that the parameters 7~ be determined by the forward-backward ( F - B )
algorithm. However, since the ?~ parameters were introduced to predict as yet
unseen data, rather than to account for the training data b T, the F - B algorithm
must be modified. We wish to extract the ?, values from data that was not used to
determine the distributions qs(t) and *qs(t) (see (8.1)). Since presumably we have
only b T at our disposal, we will proceed by the deleted estimation method. We
shall divide b~' into n blocks, and for i--1 ..... n estimate ?~ from the i th block
while using qs(t) and *q~(t) estimates derived from the remaining blocks.
Since the ~s values should depend on the reliability of the estimate q~(t), it is
natural to associate them with the estimated relative frequency of occurrence of
the state s. We thus decide on k relative frequency ranges and aim to determine
corresponding values X(1),...,~(k). Then ~s = ?~(i) if the relative frequency of s
was estimated to fall within the i th range.
We partition the state space $ into subsets of tied states $~, $2 ..... Sr and
determine the transition correspondence functions T~,~, for all pairs of tied states
s,s'. We recall from Section 4 that then *qs(t)=*q~,(Ts,~,(t)) for all pairs
S, S t ~ ~i' i = 1,... ,r. If L(t)E $*, then g'(t) -- ( g i g = TLU),~,(t), s ' ~ S*) is the set
of transitions that are tied to t. Since TL(t)"f u ) ( t ) = t, t ~ ~'(t).
We divide the data b T into n blocks of length l (m = nl). We run the F - B
algorithm in the ordinary way, but on the last iteration we establish separate
counters
(j - 1)l m
cj(t, b T ) g • Pi(t, bT)+ Y, Pi(t,b'~), j = l , 2 ..... n, (8.2)
i--I i=jl+l
for each deleted block of data not contributing to the counter. The above values
will give rise to detailed distributions
q,(t,j)=(cy(t,b~)6(s,L(t)))/(~cj(t',bT)8(s,L(t')) ) (8.3)
t'
and to tied distributions
t'E ~(t)
--1
(8.4)
' t"e ~(t')
Note that qs(t, j) and *qs(,t j )" do not depend directly on the output data
F. Jelinek, R. L. M e r c e r a n d L. R. Bahl 567
and setting )t~ = ?~(i) if q(s, j) belonged to the ith frequency range. Also, the )t~
counts estimated from the j t h block are then added to the contents of the ith
counter pair.
After ?t values have been computed, a new test data is predicted using an
associated Markov source based on probabilities
(8.6)
*q~(t)=(6(s,L(t)) E i cj(t',b~))
t ' ~ °J(t) j = 1
and Xs values chosen from the derived set X(1) ..... X(k), depending on the range
within which the estimate
(( j
))/( t'j=l
)
falls.
This approach to modeling data generation is called deleted interpolation.
Several variations are possible some of which are described in [15]. In particular,
it is possible to have v different tying partitions of the state space corresponding
to transition distributions (Oq,(t), i =-1..... v, and to obtain the final estimates by
the formula
Let K(d~i(WlW2) ) be the number of times that members of the set ~i(WIW2) Occur in
the training text. Finally, partition the state space into sets
which will be used to tie the associated states wiw2 according to the frequency of
word pair occurrence. Note that if K(qh(wlw2) ) >-2, then q,5(wlw2) is simply the
set of all word pairs that occurred in the corpus exactly as many times as wiw2
did. A different X distribution will correspond to each different set (8.11). The
language model transition probabilities are given by the formula
4
P(w,lw,w~)= ~ x,(~(w,w~))e,(w~l~,(w,w~)). (8.12)
i=1
\\ X ~ "~::::~P(w / 13 (wi'w2))
\\4 //""
Fig, 13, A section of the interpolated trigram language model correspondingto the state determined
by the word pair wl, w2.
F. Jefinek, R. L. Mercer and L R. Bahl 569
n s ( w ) = - ~, P ( w l s ) l o g 2 e ( w l s ) . (9.1)
W
The entropy, H(w), of the task is simply the average value of Hs(w ). Thus if 7r(s)
is the probability of being in state s during the production of a sentence, then
The perplexity S(w) of the task is given in terms of its entropy H(w) by
S(w)-- 2 H(w). (9.3)
The results given in this section, obtained before 1980, are described in detail in
[2-6].
Table 1 shows the effect of training set size on recognition error rate. 200
sentences from the Raleigh language (100 training and 100 test) were recognized
using a segmenting acoustic processor and a stack algorithm decoder. We initially
estimated the acoustic channel model parameters by examining samples of
acoustic processor output. These parameter values were then refined by applying
the forward-backward algorithm to training sets of increasing size.
Table 1
Effectof training set size on the error rate
%of sentences
Training set decodedincorrectly
size Test Training
0 80% --
200 23% 12%
400 20% 13%
600 15% 16%
800 18% 16%
1070 17% 14%
F. Jelinek, R. L. Mercer and L. R. Bahl 571
Table 2
Effect of weak acoustic channel models
% of sentences
Model type decoded incorrectly
Complete acoustic channel model 17%
Single pronunciation 25%
Spelling-basedpronunciation 57%
While for small training set sizes performance on training sentences should be
substantially better than on test sentences, for sufficiently large training set sizes
performance on training and test sentences should be about equal. By this
criterion a training set size of 600 sentences is adequate for determining the
parameters of this acoustic channel model. Notice that even a training set size as
small as 200 sentences leads to a substantial reduction in error rate as compared
to decoding with the initially estimated channel model parameters.
The power of automatic training is evident from Table 1 in the dramatic
decrease in error rate resulting from training even with a small amount of data.
The results in Table 2 further demonstrate the power of automatic training.
In Table 2, three versions of the acoustic channel model are used, each weaker
than the previous one. The 'complete acoustic channel model' result corresponds
to the last line of Table 1. The acoustic channel model in this case is built up from
phonetic subsources and acoustic subsources as described in Section 4. The
phonetic subsources produce m a n y different strings for each word reflecting
phonological modifications due to rate of articulation, dialect, etc. The 'single
pronunciation' result is obtained with an acoustic channel model in which the
phonetic subsources allow only a single pronunciation for each word. Finally, the
'spelling-based pronunciation' result is obtained with an acoustic channel model
in which the single pronunciation allowed by the phonetic subsources is based
directly on the letter-by-letter spelling of the word. This leads to absurd pronunci-
ation models for some of the words. For example, t h r o u g h is modeled as if the
final g and h were pronounced. The trained parameters for the acoustic channel
with spelling-based pronunciations show that letters are often deleted by the
acoustic processor reflecting the large number of silent letters in English spelling.
Although the results obtained in this way are much worse than those obtained
with the other two channel models, they are still considerably better than the
Table 3
Decoding results for severaldifferent acoustic processors with the Raleigh language
Error rate
Acoustic processor Sentence Word
MAP 27% 3.6%
CSAP 2% 0.2%
TRIVIAL 2% 0.2%
572 Continuous speech recognition
results obtained with the complete channel model using parameters estimated by
people.
Table 3 shows results on the Raleigh Language for several different acoustic
processors. In each case the same set of 100 sentences was decoded using the
stack decoding algorithm. MAP is a segmenting acoustic processor, while CSAP
and TRIVIAL are non-segmenting acoustic processors. Prototypes for CSAP were
selected by hand from an examination of speech data. Those for TRIVIAL were
obtained automatically from a Viterbi alignment of about one hour of speech
data.
Table 4 summarizes the performance of the stack decoding algorithm with a
segmenting and a time-synchronous acoustic processor on 3 tasks of varying
perplexity. The Raleigh task has been described earlier in the paper. The Laser
task is a natural language task used at IBM. It consists of sentences from the text
of patents in laser technology. To limit the vocabulary, only sentences made
entirely from the 1000 most frequent words in the complete laser corpus are
considered. The CMU-AIX05 task [20] is the task used by Carnegie-Mellon
University in their Speech Understanding System to meet the ARPA specifica-
tions [18]. All these results were obtained with sentences spoken by a single talker
in a sound-treated room. Approximately 1000 sentences were used for estimating
the parameters of the acoustic channel model in each of the experiments.
In Table 4 we can see a clear correlation between perplexity and error rate. The
CMU-AIX05 task has the largest vocabulary but the smallest perplexity. Note
that for each of the tasks, the performance of the time-synchronous acoustic
processor is considerably better than that of the segmenting acoustic processor.
Table 4
Recognition results for several tasks of varying perplexity
Vocabulary Word error rate
Task size Perplexity Segmenting AP Time-synchronous AP
CMU-AIX05 1011 4.53 0.8% 0.1%
Raleigh 250 7.27 3.1% 0.6%
Laser 1000 24.13 33.1% 8.9%
Acknowledgment
References
[1] Bahl, L. R. and Jelinek, F. (1975). Decoding for channels with insertions, deletions and
substitutions with applications to speech recognition. IEEE Trans. Inform. Theory 21 (4)
404-411.
[2] Bahl, L. R., Baker, J. K., Cohen, P. S., Dixon, N. R., Jelinek, F., Mercer, R. L. and Silverman,
H. F. (1976). Preliminary results on the performance of a system for the automatic recognition
of continuous speech. Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing,
425-429.
[3] Bahl, L. R., Baker, J. K., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L.
(1978). Automatic recognition of continuously spoken sentences from a finite state grammer.
Proc. IEEE Internat. Conf. on Acousties, Speech and Signal Processing, 418-421.
[4] Bahl, L. R., Baker, J. K., Cohen, P. S., Jelinek, F., Lewis, B. L. and Mercer, R. L. (1978).
Recognition of a continuously read natural corpus. Proc. IEEE Internat. Conf. on Acoustics,
Speech and Signal Processing, 422-424.
[5] Bahl, L. R., Bakis, R., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L.
(1979). Recognition results with several acoustic processors. Proc. IEEE Internat. Conf. on
Acoustics, Speech and Signal Processing, 249-251.
[6] Bahl, L. R., Bakis, R., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L . .
(1980). Further results on the recognition of a continuously read natural corpus. Proc. IEEE
lnternat. Conf. on Acoustics, Speech and Signal Processing, 872-875.
[7] Baker, J. K. (1975), The DRAGON system an overview. IEEE Trans. on Acoustics, Speech,
and Signal Processing 23 (1) 24-29.
[8] Baker, J. M. (1979). Performance statistics of the hear acoustic processor. Proc. of the IEEE
Internat. Conf. on Acoustics, Speech and Signal Processing, 262-265.
[9] Bakis, R. (1976). Continuous speech recognition via centisecond acoustic states. 91st Meeting of
the Acoustical Society of America. Washington, DC. (IBM Res. Rept. RC-5971, IBM Research
Center, Yorktown Heights, NY.)
[10] Banm, L. E. (1972). An inequality and associated maximization technique in statistical estima-
tion of probabilistic functions of Markov processes. Inequalities 3, 1-8.
[11] Bellman R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
[12] Cohen, P. S. and Mercer, R. L. (1975). The phonological component of an automatic speech-
recognition system. In: D. R. Reddy, ed., Speech Recognition, 275-320. Academic Press, New
York.
[13] Forney, G. D., Jr. (1973). The viterbi algorithm. Proc. IEEE 61, 268-278.
[14] Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. I B M J. Res. and
Development 13, 675-685.
[15] Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from
sparse data. Proc. workshop on Pattern Recognition in Practice. North-Holland, Amsterdam.
[16] Lowerre, B. T. (1976). The Harpy speech recognition system. Ph.D. Dissertation, Dept. of
Comput. Sci., Carnegie-Mellon University, Pittsburgh, PA.
[17] Lyons, J. (1969). Introduction to Theoretical Linguistics. Cambridge University Press, Cambridge,
England.
[18] Newell, A., Barnett, J., Forgie, J. W., Green, C., Klatt, D., Licklider, J. C. R., Munson, J.,
Reddy, D. R. and Woods, W. A. (1973). Speech Understanding Systems: Final Report of a Study
Group. North-Holland, Amsterdam.
[19] Nilsson, N. J. (1971). Problem-Solving Methods in Artificial Intelligence. McGraw-Hill, New
York.
[20] Reddy, D. R. et al. (1977). Speech understanding systems final report. Comput. Sci. Dept.,
Carnegie-Mellon University, Pittsburgh, PA.
[21] Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Tech. J. 30,
50-64.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")~
@North-HollandPublishing Company (1982) 575-593
1. Introduction
575
576 Alan A. Grometstein and William 11. Schoendorf
antenna: the difference can be sensed and the angular position of the target (say,
in azimuth and elevation relative to an oriented ground plane) estimated.
Special circuitry within a radar receiver can compare the radio frequency (r.f.)
of the echo with that of the transmitted pulse. The difference frequency, fa, can be
ascribed to the component of the target velocity in the direction of the radar (i.e.,
to the target's Doppler speed) at the time of pulse reflection; thus, the radar can
measure the instantaneous Doppler, D, of the target:
D = ½)tfd
3. Signature
RCS = K'R4"Pe
where K is a constant associated with the radar circuitry and the power in the
transmitted pulse. The RCS of a target is a measure of its effective scattering area.
RCS is measured in terms of the ratio (energy returned in an echo) : (energy
density impinging on the target), and thus has the units of area. The projected
area presented by a target to a radar beam may be orders of magnitude smaller or
larger than its RCS: indeed, it is often a primary goal of radar observations to
estimate the physical size of a target from its electrical 'size'.
Applications of pattern recognition in radar 577
RCSdBsm = 10 log(RCS~2)
4. Coherence
5. Polarization
The pulse transmitted by the radar has a polarization state characteristic of the
radar transmitter and antenna. Some of the energy in the echo reflected from a
target retains the transmitted polarization and some is converted to the orthogo-
nal polarization state. The depolarizing properties of the target are informative as
to its nature, and some radars are built which can separately receive and process
the polarized and depolarized components of the echo.
6. Frequency diversity
Some radars have the ability to change their r.f. during the transmission of a
single pulse, so that the frequency content of the pulse has a time-varying
structure. (The pulse, therefore, does not resemble a finite section of a pure
578 Alan A. Grometstein and William H. Schoendorf
sinusoid.) If the frequency changes during the pulse in discrete jumps, the radar is
a frequency-jump radar, while if it changes smoothly, the radar is said to be of the
compressed-pulse type. In either case, the radar can be referred to as a wideband
radar.
Several reasons might lead to the incorporation of frequency diversity in a
radar; of particular interest to us is the fact that the way in which a target reflects
the different frequency components reveals something of its nature. If signature
information is to be processed in a wideband radar, the envelope of the returned
echo is sampled along its length, at intervals determined by the rapidity with
which the frequency changes. In this way, instead of collecting a single amplitude
and phase from the echo (i.e., a single complex return), such a radar collects a set
of complex returns from a pulse. This set, ordered along the pulse length, can be
thought of as the return vector for that echo.
The elaborations discussed (coherence, polarization diversity, frequency diver-
sity) can be incorporated into a radar virtually independently of one a n o t h e r - -
although, for cost reasons, it is rare that all are found in a single instrument. In
the extreme case, however, where all are present, the information recorded for a
single echo takes the form of two amplitude vectors, one for each polarization
channel where each vector, as explained above, consists of the complex returns at
a set of specific points along the length of the echo.
Efficient processing of such a complicated information set is important if the
complete information-gathering capacity of the radar is to be utilized.
7. Pulse sequences
It is rarely the case that a decision must be based on a single echo: ordinarily, a
sequence of echoes from a target can be collected before the decision must be
made, and this permits extraction of yet more information than would be
available in a single echo.
If the target is a constant factor, so that the only causes of variation in the
echoes are such extraneous influences as noise, changes in the propagation path,
etc., the sequence of echoes is frequently integrated, either coherently or incoher-
ently, with a view to improving the signal-to-noise ratio (SNR), so that a clearer
view can be had of the target echo itself.
If, however, the target is representable by a stationary process with a significant
frequency content, to integrate the pulses might destroy useful information. The
sequence might in such a case be treated as a finite, discrete time series--a
spectral analysis might be made, or some features extracted (say, peak values or
rate of zero-crossings) and processed.
In other cases, the target process cannot be treated as a stationary one: often a
change in the nature of the process is precisely what is being looked for, to trigger
a decision. In this case, the pulse sequence must be processed so as to detect the
change in the target characteristic, and integration may again be contraindicated.
Applications of pattern recognition in radar 579
9. Algorithm implementation
x] AX .x
x2 I
x
,'~.. X3 f
ii x\ • I
IXN
I---
.e. /t 'k I AXIS 1
_J
n
z;
'11 /s -j X2
/ TIME - - _
tl t2 t3 t 4 . • • tN X1
It is at this point that the practical difficulties of dealing with finite and
ordinarily small numbers of signatures in high-dimensional spaces become evi-
dent. The dimensionality of the O-space can be quite large since radar signatures
are often as long as tens of pulses and under many circumstances can be
considerably longer.
Signatures can be obtained either directly from observations of flight tests or
from simulations by combining the calculated dynamics of the body with static
body measurements or theoretically produced static patterns. Since flight test data
are difficult and expensive to obtain, it is often necessary to use simulated
signatures to supplement the flight test signatures. In many cases, however, the
number of signatures required to estimate the class conditional pdf's or the
likelihood ratio is so large that even the use of simulations becomes too costly. In
these cases we must consider other approaches to the design of classification
algorithms, including reducing the dimensionality of the space in which classifi-
cations are made.
Parametric classification schemes, feature selection, feature extraction, and
sequential classification are possible alternatives to the non-parametric estimate
of the pdfs or the likelihood ratio in O-space, which diminish the required number
of sample signatures. In the first case, we may know or we may assume
parametric forms for the class-conditional pdf's. The pdf's are then completely
determined by the value of an unknown but nonrandom parameter vector. For
example, if each class is known or assumed to be Gaussian, the components of the
parameter vector are the class conditional mean and covariance matrix. The
learning data are then used to estimate these parameters for each target class and
the LRT can then be implemented using a smaller number of learning signatures
than would be required if a nonparametric estimate of the densities were used.
In the same vein, the number of parameters that must be estimated from the
data can be reduced by calculating features on the basis of prior knowledge or
intuition. For example, if we were trying to discriminate between a dipole and a
sphere on the basis of the returns from a circularly-polarized radar we might use
582 Alan A. Grometstein and William H. Schoendorf
the ratio of the principally polarized (PP) and orthogonally polarized (OP)
returns. Since this ratio would be unity for an ideal dipole and infinite for a
perfect sphere, the ratio could be used as a discriminating feature rather than the
individual returns, reducing the dimensionality of the classification space by a
factor of 2. In general, however, realistic discrimination problems are not as
simple as sphere vs. dipole, and the selection of features becomes difficult.
A third method that has been used for reducing the dimensionality of the space
in which classification is performed is mathematical feature extraction. Here we
attempt to find a transformation that maps the data to a lower-dimensional
feature space while minimizing the misclassification rate in the feature space. In
general, attempts to derive feature extraction techniques associated directly with
error criteria have been unsuccessful except when the underlying density of the
data is known. Nonparametric approaches to the problem have been computa-
tionally complex and require as many samples for the derivation of the transfor-
mation as would be required for the design of the classifier in the original
observation space. Because of this, criteria not directly associated with error rates
have been utilized, particularly those involving the means and covariances of each
class. Examples of these types of techniques are the Fukunaga-Koontz transfor-
mation, Fukunaga and Koontz (1970), and the sub-space method of Watanabe
and Pakvasa (1973). Therrien et al. (1975) extended the Fukunaga-Koontz
technique to the multiclass case and showed the applicability of the technique to
functions of the covariance matrix. He also showed that Watanabe's subspace
method is a special case of this extension. Applications to radar signature
classification of the Fukunaga-Koontz technique, the subspace method and
Therrien's extension are found in Therrien et al. (1975). The results presented in
that paper indicate that the performance when mapping down from high-dimen-
sional spaces is not outstanding. This is probably due to the irregular nature of
the underlying class-conditional pdf's of the radar signature data: these data are
rarely Gaussian or even unimodal. The multimodality of the data is due to the
different nature of the radar returns from a target when viewed from near
nose-on, broadside or the rear.
A fourth technique that has been used to reduce the dimensionality of the
classification space in radar applications is implementation of a sequential
classifier. The classification schemes that utilize the hyper-space concepts de-
scribed previously operate on a predetermined and fixed number of returns to
produce a decision. The sequential classifier makes more efficient use of radar
resources by acting on groups of pulses, one group at a time. After each group of
pulses is fed into the classifier the target is either placed into one of M possible
classes or another group of pulses is fed into the classifier. In this manner targets
which can be easily discriminated will be classified using a small number of
returns, while more difficult targets will be observed for longer times, and the
mean number of pulses required to classify a set of targets may be significantly
reduced, compared to the demands of fixed sampling.
The foundations of sequential classification theory were laid by Wald (1974).
Therrien (1978) has recast the structure of the Gaussian classifier (quadratic
Applications of pattern recognition in radar 583
After a classifier has been designed, its error rates must be determined in order
to evaluate its performance. The preferred method of accomplishing this is to do
so experimentally by passing a set of test signatures through the classifier and
using the fraction of misclassifications as an estimate of the error rate. The test
signatures should be distinct from those used in the design of the classifier in
order to obtain an unbiased estimate of the error rates. Therrien (1974) and
Burdick (1978) show the results of classifying the dynamic radar signatures of
missle components in O-space and also after using feature extraction techniques.
These papers show results using parametric classifiers such as the Gaussian
quadratic as well as non-parametric classifiers such as the grouped and pooled
'nearest-neighbor'. They indicate that the nonparametric classifiers can give
excellent results when the class-conditional pdf's are multimodal or irregular.
Because of the non-parametric nature of the density, these classifiers work best
when the dimensionality of the O-space is low (less than 20) and the number of
training samples is large. If these classifiers are used in high-dimensional spaces
they perform well on the learning data but suffer degraded performance when
applied to an independent set of test data if there are an insufficient number of
training samples. Ksienski et al. (1975) examined the application of linear
parametric classifiers and nearest-neighbor classifiers to the problem of identify-
ing simple geometric shapes and aircraft on the basis of their low-frequency radar
returns. In this analysis, the temporal response of the target was ignored: it was
assumed that the target was viewed at a fixed aspect angle which was known to
within a specified tolerance. The data vector consisted of a sequence of multi-
frequency returns taken at identical aspect angles. A collection of these data
vectors was then used to design a classifier and learning data were corrupted by
noise to produce data for testing the classifier.
12. Examples
features of the data. The second concerns the discrimination of airborne targets
by a wideband radar, and illustrates the use of an array classifier.
The pulse spacing was 0.05 s so it required just under 0.8 s to collect the signature.
The Likelihood Ratio Test (LRT) for the quadratic classifier is expressed as
W
g ( X ) ----( X - M w )T Z ' w ' ( X - M w ) - ( X - MD) "r ~'D l ( x -- MD) X T
D
where M w, ~;w, Mr) and ~D are the mean vector and the covariance matrix of
the warhead and debris classes respectively. These means and covariance matrices,
together with the threshold, T, were calculated from simulated learning data.
The major problem in designing the classifier was obtaining the learning data
for the debris class. Static measurements were made on objects thought to
resemble deployment debris, and these were combined with assumed motion
parameters to produce dynamic signatures. Since little was known of the true
debris motions a wide variety of tumble rates were employed to prevent the
resulting classifier from being tuned to a narrow spectrum of motions. Details of
the warhead shape and ranges of motion parameters were better known so there
was no difficulty in simulating its dynamic signatures.
A threshold was then chosen for the classifier and it was tested in real-time on
a series of flight tests. Examples of debris and war-head signatures are shown in
Figs. 2(a) and 2(b), respectively. The classifier was inplemented on a radar and
used to observe real warhead and debris targets. On a total of 132 warhead and
49 debris signatures, leakage rates of 5% and false alarms rates of 0% were
obtained.
One of the more interesting aspects of this example involved the analysis of the
classifier. Examination of the coefficients of the classifier clarifies the characteris-
tics of the signatures which are most important for discrimination.
Prior to the actual flight test it had been postulated that the ratio of the mean
PP return to the mean OP return would be an effective feature for discrimination.
It was argued that a piece of debris, being sharp, irregular and edgy, would show
an OP return comparable to its PP return, much like a dipole, and thus its
Applications of pattern recognition in radar 585
u')
ol
(J
rr"
-5
-10
I I I I
1 2 3 4
(,9
L)
nr
-E
-1C
I I I I
1 2 3 4
¥
OI PP l
co
rr
-5
-I
0 1 2 3 4
TJME (S)
PP
m -5
"D
(I)
nr
OP
-I0
-15
0 1 2 3 4 5
m -5
-o
u
fig
-x5 I I I i
0 1 2 3 4
TIME (S)
polarization ratio would be close to unity. The warhead, on the other hand, was
known to be rotationally symmetric and would, therefore, have a low OP return,
much like a sphere, and hence provide a high polarization ratio. Fig. 2 shows that
this conjecture is poor for one of the pieces of debris and quite incorrect for the
other two pieces which, on the average, exhibit large polarization ratios.
Examination of the quadratic classifier coefficients revealed that the ratio of
the second moment of the PP return to the second moment of the OP return was
the dominant feature for discrimination rather than the ratio of the means.
Inspection of Fig. 2 confirms this prediction, although this feature would not be
so evident to the eye in the shorter 0.8 s signatures that were fed into the classifier.
To confirm this analysis a new classifier was designed which used the second-
moment-ratio as the sole selected feature: it performed about as well as the full
Applications of pattern recognition in radar 587
classifier had done, thus confirming the interpretation of the coefficients of the
classifier.
l Xpp
X1
"'RES"O'O I t +5.B - 1 X2
PP
Xn
X 14 z M"~ =x
o~ I I Xn+ I
Xn+2 OP
n+l 2n XEn.
Fig. 3. Vector representationof wideband pulse shape.
cies. In this way, a small (335 samples) but realistic set of signatures was obtained
at the three bandwidths of interest.
Fig. 4 shows examples of the time-averaged waveforms returned by the SR,
LD, and SD, at each of the bandwidths. For clarity, the OP return is displaced to
the right of the corresponding, simultaneous PP return. Time-averaged wave-
forms, rather than single-pulse waveforms, are plotted to present the underlying
structure. Single-pulse waveforms (on which the classifier operated) are much
noisier.
A novel problem posed by this study relates to the fact that the two classes,
RPV and drone, were not themselves homogeneous in content, since each was
comprised of two distinct types of target (LR and SR in the one case, LD and SD
in the other). What logic of separation ought to be employed to separate RPVs
from drones, there being no need to distinguish between types of RPV or between
types of drones? The following statagem was found to be powerful:
The outputs, h i, of the four linear classifiers were fed into a nearest-neighbor
classifier which made the final decision as to whether the signature in question
was that of an RPV or a drone. Fig. 5 shows the logical arrangements of the linear
and nearest-neighbor classifiers in the form of an array classifier. The four linear
classifier outputs, h i, can be thought of as features and the classifiers themselves
IOther linear classifiers, representing alternative decision options, were examined but found to be
comparativelyineffective.
Applications of pattern recognition in radar 589
~-5 PP OP -
-I(]
-1!
-21
LARGE DRONE ( L D )
rn
cO -1C
(D
rr
-2(
-i
-i, z
-2(
2Historically, OCs are conventionally plotted with the two error variables arranged in linear form. A
log-log plot has been found to provide a more legible and readily interpretable shape to the OC curve.
This accounts for what might appear to be an unusual (viz. concave) shape, in contrast to the more
common (convex) shape of OCs.
590 Alan A. Grometsteinand William H. Schoendorf
~=h I
SR vs. LR
~ E = h2
SR vs. LD
% NEAREST
O - m ~[ NEIGHBOR - ~ - DECISION
CLASSIFIER
SR vs, SD
shows that error rates of close to 0 percent on each axis can be achieved: this is
essentially perfect discrimination on one pulse.
An interesting aspect of this problem arose from the use of linear classifiers as
the first step in the full-array classifier. For the Fisher linear classifier, the LRT
takes the form:
2/7
BTx= 2 bixi.
i=1
That is, the LRT is a dot product between the vector of weighting coefficients, B,
and the data vector, X. The weighting vector is computed from the learning data
of the two target types (say 'R','D'), and is given by
8 = + MD).
Now, of the 2n components of B, those that are largest indicate the positions
within the pulse-form where the greatest amount of discrimination information
lies. Conversely, if a coefficient, bi, is small, that position supplies little dis-
crimination information (since, whatever the pulse amplitude there, it contributes
little to the dot product after being multiplied by the small coefficient).
Fig. 7 shows the PP and OP components of B 2, as an example, which
distinguishes between the SR and the LD. For 500 MHz, there are 40 components
Applications of pattern recognition in radar 591
5 0 0 MHz %
0.999 I I I I I II l I l l i 0.001
0.995 0.005
200 MHz ~
0.99 0.01
/
r I00 MHz
/
0.95 / 0.05
/-
0.90 0.I0
0.50 0.50
P,
Fig. 6. Operating characteristic for array classifier.
to B2 (20 for each polarization); for 200MHz, 20 components; and for 100MHz,
10 components. Several conclusions can be drawn:
(1) For all bandwidths, the leading edge of the PP return is important.
(2) For 500 and 200MHz, the OP return supplies very little discrimination
information.
(3) For 100MHz, the trailing edge of the OP return is important.
These observations suggest that, for the 500- and 200-MHz bandwidths, the
absence of the OP signature might not adversely affect the performance of the B2
classifier, and the radar might as well not have the second polarization. However,
this remark must be tempered by the realization that we have examined only B2:
the other three linear classifiers might tell a different story; further, a classifier
structure more powerful than the linear structure shown might make superior use
of the OP return.
592 Alan A. Grometstein and William H. Schoendorf
PP OP
B:z
5 0 0 MHz
INCREASING INCREASING
RANGE RANGE
,A A
200 MHz
1 0 0 MHz
n2
/
Fig. 7. Weightingcomponentsof B2.
References
Burdick, B. J. et al. (1978). Radar and penetration aid design. Proc. 1978 Comput. Soc. Conf. on
Pattern Recognition and Image Processing, Chicago, IL, U.S.A.
Chernoff, H. and Moses, L, E. (1969). Elementary Decision Theory. Wiley, New York.
Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. Academic Press, New York.
Fukunaga, K. and Koontz, W. L. (1970). Application of the Karhunen-Loeve expansion to feature
selection and ordering. IEEE Trans. Comput. 19, 311.
Ksienski, A. A. et al. (1975). Low-frequencyapproach to target identification. Proc. IEEE 63, 1651.
Therrien, C. W. (1974). Applicationof feature extraction to radar signature classification. Proc. Second
Internat. Joint Conf. on Pattern Recognition, Copenhagen, Denmark.
Therrien, C. W. et al. (1975). A generalized approach to finear methods of feature extraction. Proc.
Conf. Comput. Graphics on Pattern Recognition and Data Structures, BeverlyHills, CA, U.S.A.
Therfien, C. W. (1978). A sequential approach to target discrimination. IEEE Trans. Aerospace
Electron. 14, 433.
Van Trees, H. L. (1968). Detection, Estimation and Modulation Theory, Vol I. Wiley, New York.
Wald, A. (1974). Sequential Analysis. Wiley, New York.
Watanabe, S. and Pakvasa, N. (1973). Subspace methods in pattern recognition. Proc. First Internat.
Joint Conf. on Pattern Recognition, Washington, DC, U.S.A.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2 ')'7
©North-Holland Publishing Company (1982) 595-607
E. S. Gelsema a n d G. H. L a n d e w e e r d
1. Introduction
The white blood cell differential count (WBCD) is a test carried out in large
quantities in hospitals all over the world. In the U.S. and in many European
countries the complete blood count including the differential count is usually a
part of the routine hospital admitting procedure. Thus, it may be estimated that a
hospital will generate between 50 and 100 differential counts per bed per year.
For the U.S. alone Preston [16] estimates that annually between 20 and 40 × 109
human white blood cells are examined. The average time it takes a technician to
examine a slide varies widely and is amongst other things dependent on the
standards set by the hospital administration. The actual 100-cell differential count
can take as little as two minutes, but given the need to spot even rare abnormali-
ties and taking into account the time to load each slide and to record the results,
it seems reasonable to assume an average examination time of 10 minutes per
slide. This, then, for a 2000 bed hospital corresponds to a workload of between 50
and 100 man-hours per day for the microscopic examination only.
For certain conditions the need to examine the blood smear is obvious from the
clinical information available and the information it provides is essential. Such
examples, however, are relatively few, and more often the information from the
differential is yet another important element which helps the physician to arrive
at a diagnosis.
The WBCD consists of estimating the percentages of the various types of white
blood cells present in a sample of peripheral blood. Normal values may vary
widely. The normal percentages and ranges as given by Wintrobe [20] are given in
Table 1.
A typical example of each of the normal cell types is given in Fig. 1.
In pathological cases, in addition to these normal cell types, a number of
immature or abnormal types may also be present. The number of such immature
types depends on the degree of subclassification one wants to introduce. More-
over, in an automated system, immature red blood cells have to be recognised as
well. The detection of immature and abnormal cells, even though in terms of
numbers they usually represent only a small proportion, is very important for
establishing a diagnosis.
596 E. S. Gelsema and G. H. Landeweerd
Table 1
Normal values and ranges of the occurrence of the
different normal types of white blood cells accord-
ing to Wintrobe [20]
Cell type Percentage 95% range
Neutrophils 53.0 34.6-71.4
Lymphocytes 36.1 19.6-52.7
Monocytes 7.1 2.4- ! 1.8
Eosinophils 3.2 0.0-7.8
Basophils 0.6 0.0-1.8
Fig. 1. Six normal types of white blood cells. Top row: segmented neutrophil and band: middle row:
lymphocyte and monocyte; bottom row: eosinophil and basophil.
598 E. S. Gelsema and G. H. Landeweerd
Table 2
Significant experiments on the automation of the white blood cell differential count
Year of type of No. No. No.
publication Authors Ref. features cells classes features % correct
1966 M. Ingram 8 Topology 117 3 6 72
P. E. Norgren 3 10 84
K. Preston
1966 J.M.S. Prewitt 17 From opt. 22 4 2 100
M. L. Mendelsohn dens. freq.
dist.
1969 I. T. Young 21 Geometry 74 5 4 94
Colour
1972 J.W. Bacus 1 Geometry 1041 5 8 93.2
E. E. Gose Colour 7 8 73.6
Texture 8 8 71.2
1974 J.F. Brenner 4 Geometry 1296 7 20 86.3
E. S. Gelsema Colour 17 20 67.3
T. F. Necheles Texture
P. W. Neurath
W. D. Selles
E. Vastola
1978 G.H. Landeweerd 10 Texture 160 3 10 84.4
E. S. Gelsema
2.1
Ingram, N o r g r e n and Preston are a m o n g the first to enter in this field [8, 9].
Their C E L L S C A N system differs considerably from the systems used by the
workers to be listed below. It essentially consists of a TV camera, linked to a
special purpose digital computer through an A D C converter. A binary image of
the cell is thus read into the computer, which then applies a series of 'parallel
pattern transforms' [7], in each step reducing the image by shrinking operations.
White blood cell recognition 599
2.2
300
NUMBER
OF POINTS
200
100
I i i L I
10 20 30 40 50 60
FILM DENSITY
Fig. 2. Optical density histogram of the image of a white blood cell. The peaks from left to right are
generated mainly by points in the background, in the cytoplasm and in the nucleus, respectively.
IGranulocytes are those cells which contain granules in the cytoplasm.The normal granulocytes are
neutrophils, eosinophils.and basophils.
600 E. S. Gelsema and G. H. Landeweerd
observation that the optical density histogram is characteristic for the cell type
from which it is generated. Of course, parameters used by hematologists such as
nuclear area, cytoplasmic area, contrast, etc. may also be measured globally from
the histogram. The authors report 100% correct classification of 22 cells, using 2
parameters•
As the authors themselves remark it is hardly surprising that with the enlarge-
ment of the blood cell sample and the number of types to be recognised the
method of analysis will have to become more complicated•
A l l experiments to be described below use the 'segmentation' approach which
consists of finding the boundaries of the cell and of the nucleus prior to the
estimation of parameters. Fig. 3 shows the digitised image of a cell and in Fig. 4
the two boundaries imposed on it are given• Measurements may now be per-
formed on the two areas of interest, i.e. cytoplasm and nucleus. For convenience
the resulting parameters may be subdivided in three classes:
- geometry,
- colour,
- texture•
Geometrical parameters describe e.g., the area of the cell, the cellular to nuclear
area ratio, the shape of the nucleus, etc. Colour parameters may be retrieved by
analyzing at least two images obtained through two different colour filters. They
include the average colour of the cytoplasm, the average colour of the nucleus, the
width of the colour distributions in these two areas, etc. Texture parameters
describe the local variations in optical density. They incorporate somehow classi-
cal subjective descriptions such as 'fine chromatine meshwork' or 'pronounced
• I • , , • L • . • /I,,/~w~,/b/~,~./~/~/v/,/,/Ivlvl, l ~ l l • l . l ~ , , , / . . . . . ~ ~ , | 1 .
• i i y y • // /,V'IV I v i./vv,,,./,lm/I/x/*/i.l.l,l,l,l.i*l,l,,.l/// . . . . . , * 1 *
I + * i . . I , l l . I . . l ~ l , l ~ v ~ . / ~ / . / , / , / I / , . / . / l l l . I . l ~ . , I . l . i * l . l . I . . , . l l l . I i ~ . . . .
/ ,I,1,1 I,I I I.IVVIV/I/,I./¢~IVI.III.,I~I.I.I.III. I.t.I.*.IVI . . . . . . . .
. . . . ~ I1~,1. , ,i.I~.l.* ,Ivl.l.l~,/, / I I ~./. v I. I,.! ¢ I.I,.I.l* iJ !. I.IIl.l/ ~ v v / . . . . . . .
. . I/ Ii I, wl-I.l*l;l.l,I*l v¢-/IJvcv,ltl.l.l,l,,I,l,l.I.I IIl.l*l,l,,/v/ . . . . . .
. I ,, v/Ivs.l*ivl.l.I I,I,**.,I.ilcvvIvl,I~l.l.l.l.l*l.l.l*l, laIIXAI,l,,/v/// . . . . .
I v . ./1.1 s*l~l.l,l*I I !~1 ....... / ~l,l~.~l*l.l~*ll~Iil.l.l*l.l.I*l.l*./l/I/~/ . . . .
..... //lll,l.ldI,l.l,I I.,~. ,Iglvll,/v/././*~I.l*...llvl*g*l*l,l,l*I*I*l*,/ll~/v/.. , *
• ,~,.S,t S,I,S,I*S,,*'' -S~S.,'••/ ~ ' / . ~ S * ~ . | . S V I ~ I , I . ~ | , I . * . | , I . ~ ' . / . , . / ~ H ~ .
• I/~vI.I*l* v,ll~I*l,l~lvl,l~lv,.l/~lI/./*/~ll ..... I.I ..... I . l . i l I . I * ~ , , / ~ . , v l / , / . . .
• . */ 11,1..Iq*II.I.i. II.I.III.I~.I/.VV*VV.XI.*I.I**I.I*I.I...IVI..IV..v~..vVVIIII..
• • //XlI, *.,,IVI,I l.S,I,I S * l , l , ,.i /,/l,.~•x ...... $~I,S.t.l .... X~vvvw,•v,lvlv•,,//,.
.... // ....... I.I.I~I.I.I,.I.II,IV.IlII/,II/,IIi~.I.I.III,.II.I.I.Illl.V*IIIv,I~...VlVI~.I~/.
. , //l,l,l,~.l.l,~,,I ...... l~lvi,,v~/ /lllvi*lv=..l...,I.I ...... VVV././VVlvI ~VlVlV..,I~
. 1 . , III/'II'I'I'I.I.I.lkVVv/V/~II/V/./III/~V.M~VI/,/.tVVIt././I/I/V~V,,/.,././~/. * I
1 : : :I : ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fig. 3. Grey level plot of the image of a white blood cell (eosinophil).
White blood cell recognition 601
/
I I I I 1 1 1 ! l l - l / l l l •
it ° ° o ° * ° °I
I 1 * . * * .11
I , , o *i
** .i
i I
: I I
° •
I
°
°° ~
°* •
i ' ° . o ° ° ° ~
/
I
° * / /
*/
l/
/
I l l l
I
l l l l I I / I
I II / I /
g i l l I
Fig. 4. Contours of the cell and of the nucleus of the cell in Fig. 3.
granularity', etc. With most texture parameters, however, the link between the
numerical values and the visual perception is much less evident than with the
parameters from the geometry and colour category.
2.3
With the experiment by Young [21] systematic colour measurement is intro-
duced. He uses a flying spot colour scanner SCAD which scans colour slides of
the leukocytes through a set of dichroic mirrors. In this way three spectrally
filtered images are stored in the computer memory. The density histogram
provides threshold levels for the transition from background to cytoplasm and
from cytoplasm to nucleus, respectively. A set of quasi chromaticity coordinates
are then used to identify surrounding red cells and to define the 'average' colour
of the cytoplasm and of the nucleus. In his classification scheme Young uses 4
features (the estimated area and one chromaticity coordinate for both the
cytoplasm and the nucleus) to distinguish 5 cell types (the 5 normal types). Of the
74 cells used as a learning population 94% are correctly classified.
2.4
The work of Bacus et al. [1] is significant in more than one respect. First, his
experiment is conducted on a sample size much larger than the previous ones.
Blood smears from 20 people were used to constitute the data base. A total of
1041 cells is involved, half of which are used to train the eight-dimensional
Gaussian classifier. This is in fact the first time that a testing set different from
the learning population is used in the classification process. Secondly, Bacus has
602 E. S. Gelsema and G. H. Landeweerd
also extensively studied the human percentage of error both on the basis of single
cell classification and of the differential count. Surprisingly, in the estimation of
the differential into eight classes (the lymphocytes are subdivided in three classes
according to size and the neutrophils are subdivided in two classes according to
age) the human error is as high as 12%. This figure sets an upper limit in the
evaluation of any automatic classification device. It is to be expected that the
human error in the single cell recognition when immature cells are included is
substantially higher than the 12% given above. Also, Bacus is the first to propose
analytical expressions for the derivation of some texture parameters in the WBC
application. The overall percentages correct classification are 93.2, 73.6 and 71.2
for the five class, the seven class and the eight class problem, respectively.
2.5
Immature cells enter into the picture for the first time in the work by Brenner
et al. [4]. They do a large scale experiment on the classification of white blood
cells into 17 types. From the images of 1296 cells, divided equally among the 17
types, a total of about 100 parameters is extracted. Of these, an optimal set of 20
is retained for classification purposes. These include features from all three
categories listed above. They present their result at three levels of sophistication
w.r.t, the types to be distinguished. Attempts at separation into 17 classes yields
67.3% correct classification. Of the misclassified cells 30% are confused with
adjacent stages in the maturation process. Treating all immature cells as one
single class, a classification into 7 types results in 86.3% misclassification with 8.7
false negatives (immature cells classified as normal cells) and 12.5% false positives
(normal cells classified as immature cells). While the false negative rate is
comparable to the Poisson error if between two and three immature cells are
present in the sample of 100 cells, the false positive rate is judged too high for a
practical system.
At this point (around 1974) in the history of the automated WBCD it becomes
clear that much work remains to be done in the reliable recognition of immature
cells as such and in the classification of the different types of such cells. Evidence
for this is also to be found in the performance of the commercial systems that at
this point in time have become available. (These systems will be described in the
next section.)
The problem is somewhat more complicated since, contrary to the situation in
normal types, the immature types as recognised by hematologists are not discrete
states but should rather be considered as subsequent stages in a continuum in the
evolution from stem cells to mature forms.
2.6
With the last entry in Table 2, representing an experiment by Landeweerd et al.
[10], the emphasis is on quantification rather than on automation. A certain
amount of interaction is introduced to ensure that measurements are taken in the
interesting portions of the cell image (i.e. in this case in the nucleus). Realising
White blood cell recognition 603
that the differences between the immature cell types are mainly in the domain of
texture, as also evidenced by descriptions of such types in hematology textbooks
[20] and by the work of Lipkin et al. [11] and of Pressman [15], Landeweerd et al.
investigate the usefulness of various texture descriptions. They study a sample of
160 cells of three types (basophils, myeloblasts and metamyelocytes). They also
introduce the concept of hierarchical decision tree logic for this application.
Indeed, with so many classes to distinguish in the total WBDC, some of which are
rather arbitrarily defined stages in the maturation continuum, the concept of a
single level classifier becomes increasingly unrealistic. Using an optimal set of 10
parameters out of a set of 27 significant ones (based on T-tests with a confidence
limit of 99%) they arrive at 84.4% correct classification based on texture parame-
ters only.
Summarizing Table 2 it would be customary to indicate a correllation (positive
or negative) between the figures in the first and the last columns (year of
publication and percentage correct classification). The present case, however, calls
for a more qualified approach, since the numbers reflect extremely dissimilar
situations.
Future work will probably be directed toward better automation techniques to
be realised in hardware. At the present state of the art; however, there is also
scope for more sophisticated software to guide the decisions as to which parame-
ters should be extracted. Software to improve the design of optimal hierarchical
classifiers will be needed as well. In this respect interactive pattern analysis
systems such as ISPAHAN [5, 6] may be of considerable value. In any case, a
large scale experiment on several thousands of cells, including immatures, incor-
porating all promising techniques suggested so far is what is needed at this point.
As has been described in the previous section, research in the field of the
WBCD is at its peak in the late 1960's and early 1970's. Led (or misled) by the
early success of these research efforts a number of companies enter into the field
with commercially available machines. There are now four different instruments
on the market, of which the operation is based on the digital image processing
principle. 2 They are listed in Table 3.
Before discussing the merits of these machines it is of interest to consider Table
4 from Preston [16], which gives a breakdown in time of the operations done by a
technician in the manual execution of a WBCD. From these figures it is clear that
for commercial machines to be cost-effective, automation of the task correspond-
ing to visual inspection only is not sufficient. All other tasks together, when done
manually account for 60% of the total processing time, so that automation of
these is an essential part of the design of an acceptable system.
Table 3
Four commerciallyavailable white blood cell differential machines
based on the digital image processingprinciple
Year of
Machine Company introduction
HEMATRAK GeometricData 1974
LARC Coming 1974
DIFF-3 Coulter Electronics, Inc 1976
ADC-500 Abbott Laboratories 1977
Table 4
Breakdown of tasks in the manual white blood cell differential
according to Preston [16]
Task %Time
Slide staining 13
Data logging 28
Slide loading 12
Visual examination 40
Overhead 7
Moreover, within the task listed as visual inspection, the technician, while
completing the WBCD also assesses the red blood cell morphology and the
platelet sufficiency. An automated system should therefore have the capability of
doing the same.
It is not an easy task to evaluate the performance of such machines and it is
even more difficult to compare one device against another. First of all, the
specifications of the instruments in terms of the parameters and classifiers that
are used are not always available. Secondly, when results from test runs in a
clinical environment are given, they are usually completely or partly on the basis
of the total differential, rather than on a cell by cell basis. In view of the fact that
in a normal smear most of the cells (53%) are neutrophils, which is one of the
most easily recognised cell types, the total WBCD may hide serious errors in the
less frequently occurring types.
Some properties of the four different machines are listed in Table 5. They have
in principle been taken from the commercial announcements of the various
machines as far as this information was available in the various brochures.
Otherwise the source of the information is referenced with the corresponding
entry. From the number of times such additional sources had to be consulted it is
already clear that even such a simple product comparison is not a straightforward
job. Also, the enormous difference in the number of parameters used in the
different machines is at least surprising. Features such as automatic slide loading,
automatic data logging, etc. seem to be taken care of in the newer versions of
most of the instruments.
Of the two earlier systems listed in Table 3, reports of field trials on a cell by
cell basis are now available [14, 19].
White blood cell recognition 605
Table 5
Some properties of the commercial white blood cell differential counters
HEMATRAK
480 LARC DIFF-3 ADC-500
T y p e of hardware FS TV TV TV
Resolution (/Lm) 0.25 0.42 a 0.40 0.50 b
No. classes 7 7 10 13
No. p a r a m e t e r s 96 9a 50 8b
Time/100 cell diff. 25" 60" 90" 1 l"C
No. s l i d e s / h o u r 60 44 a 25 40 c
Aut. R B C morphol. + - + +
Aut. platelet count + + +
aFrom [19].
bprivate communication from J. Green.
c Classifying 500 cells/sfide.
Normals
NEU 99.1 99.9 100
LYM 97.7 95.1 88
MON 92.8 97.6 100
EOS 93.1 87.0 100
BAS 80.3 100 96
Average 92.6 95.9 96.8
Immatures
MYE 75 -- 85
PRO 99 -- 96
NRC 87 -- 96
BLA 88 -- 85
PLA 89 94
Average 86.4 64.3 91.2
606 E. S. Gelsema and G. H. Landeweerd
4. Conclusions
Almost 20 years of research effort has now been invested in the automation of
the white blood cell differential count. Starting from promising results obtained in
simple experiments this effort has led to the situation where for this test a number
of different machines is now on the market and in routine use. Even though there
is scope for improvement in this field, the application to white blood cell
differential counting represents one of the successes of image processing and
pattern recognition.
Improvements are to be expected in two directions: First, with the advent of
parallel image processing techniques the speed of differential systems is likely to
increase considerably. This is of importance since in view of the small percentages
of occurrence of some cell types, the 100-cell differential for these types is
statistically meaningless.
Secondly, the situation with respect to the immature and abnormal cell types is
as yet unclear. The optimal choice of parameters and classifiers has still to be
White blood cell recognition 607
found experimentally. Moreover, even the a priori definition of the various types
of leukemic cells is still debated among hematologists [3, 13]. It is well possible
that image processing and pattern recognition techniques by virtue of their
inherent consistency may be useful in this respect as well.
Finally, whether differential machines according to the flow through principle
on the one hand and machines based on image processing techniques on the other
will continue to be competing, or whether they will eventually merge into one
super-machine, is as yet hard to foresee.
References
[1] Bacus, J. W. and Gose, E. E. (1972). Leukocyte pattern recognition. IEEE Trans. System. Man.
Cybernet. 2, 513-526.
[2] Bacus, J. W. (1973). The observer error in peripheral blood cell classification. Amer. J. Clin.
Pathol. 59, 223-230.
[3] Bennett, J. M., Catovsky, D., Daniel, M. T., Flandrin, G., Galton, D. A. G., Gralnick, H. R. and
Sultan, C. (1976). Proposals for the classification of the acute leukaemias. Brit. J. Haemotology
33, 451-458.
[4] Brenner, J. F., Gelsema, E. S., Necheles, T. F., Neurath, P. W., Selles, W. D. and Vastola, E.
(1974). Automated classification of normal and abnormal leukocytes, J. Histoch. and Cytoch. 22,
697-706.
[5] Gelsema, E. S. (1976). ISPAHAN, an interactive system for statistical pattern recognition. Proc.
BIOSIGMA Confer., 469-477.
[6] Gelsema, E. S. (1976). ISPAHAN users manual. Unpublished manuscript.
[7] Golay, M. J. E. (1969). Hexagonal parallel pattern transforms. IEEE Trans. Comput. 18,
733-740.
[8] Ingram, M., Norgren, P. E. and Preston, K. (1966). Automatic differentiation of white blood
cells. In: D. M. Ramsey, ed., Image Processing in Biological Sciences, 97-117. Univ. of
California Press, Berkeley, CA.
[9] Ingram, M. and Preston, K. (1970). Atomic analysis of blood cells. Sci. Amer. 223, 72-82.
[10] Landeweerd, G. H. and Gelsema, E. S. (1978). The use of nuclear texture parameters in the
automatic analysis of leukocytes. Pattern Recognition 10, 57-61.
[11] Lipldn, B. S. and Lipkin, L. E. (1974). Textural parameters related to nuclear maturation in the
granulocytic leukocytic series. J. Histoch. and Cytoch. 22, 583-593.
[12] Mansberg, H. P., Saunders, A. M. and Groner, W. (1974). J. Histoch. and Cytoch. 22, 711-724.
[13] Math6, G., Pouillart, P., Sterescu, M., Amiel, J. L., Schwarzenberg, L., Schneider, M., Hayat,
M., De Vassal, F., Jasmin, C. and Lafleur, M. (1971). Subdivision of classical varieties of acute
leukemia: Correlation with prognosis and cure expectancy. Europ. J. Clin. Biol. Res. 16,
554-560.
[14] Miller, M. N. (1976). Design and clinical results of Hematrak: An automated differential
counter. IEEE Trans. Biom. Engrg. 23, 400-407.
[15] Pressman, N. J. (1976). Optical texture analysis for automatic cytology and histology: A
Markovian approach. Ph.D. Thesis, UCLA, UCLR-52155.
[16] Preston, K. (1976). Clinical use of automated microscopes for cell analysis. In: K. Preston Jr.
and M. Onoe, eds., Digital Processing of Biomedical Images. Plenum, New York.
[17] Prewitt, J. M. S. and Mendelsohn, M. L. (1966). The analysis of cell images. Ann. NYAcad. Sci.
128, 1035-1053.
[18] Ri~mke, C. L. (1960). Variability of results in differential cell counts on blood smears. Triangle
4, 154-158.
[19] Trobaugh, F. E. and Bacus, J. W. (1977). Design and performance of the LARC automated
leukocyte classifier. Proc. Conf. Differential White Cell Counting, Aspen, CO.
[20] Wintrobe, M. M. (1'967). Clinical Hematology. Lea and Febiger, Philadelphia, PA.
[21] Young, I. T. (1969). Automated leukocyte recognition. Ph.D. Thesis, MIT. Cambridge, MA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~
u
©North-Holland Publishing Company (1982) 609-620
Philip H. Swain
I There are other important applications, however, such as seismic prospecting in the extractive
industries.
609
610 Philip H. Swain
Prism Detectors
Scanning Mirror
Motor
pe Recorder
Direction of Flight
.d
~'Resolution
/ Element
/ /
m Raster Line
tions of the reference data in the space. This mode of analysis is called supervised
because the data analyst can 'supervise' the partitioning of the measurement
space through use of the reference data. In contrast to this, unsupervised analysis
is required when reference data are scarce. Apparent clustering tendencies of the
data are used to infer the partitioning, the reference data then being used only to
assign class labels to the observed clusters. Supervised classification is at once the
most powerful and most expensive mode of analysis. Most practical analysis
procedures generally consist of a tradeoff between the purely supervised and
purely unsupervised modes of analysis.
Assuming that the information has been specified which is required by the
application at hand and that all necessary data have been collected and pre-
processed, the data analysis procedure will consist of the following steps:
Step 1. Locate and extract from the primary data the measurements recorded
for those areas for which reference data are available.
Step 2. Define or compute the features on which the classification is to be
based. These may consist of all or a subset of the remote sensing measurements or
a mathematical transformation thereof (often a linear or affine transformation).
Ancillary variables may also be involved.
Step 3. Compute the mathematical/statistical characterization of the classes of
interest.
Step 4. Classify the primary data based on the characterization.
Step 5. Evaluate the results and refine the analysis if necessary.
Steps 1 through 3 are usually referred to as 'training the classifier', terminology
drawn from the pattern recognition technology. In practice there is considerable
overlap and interaction among all of the steps we have outlined. The two most
crucial aspects of the process are: (1) determining a set of features which will
provide sufficiently accurate discrimination among the classes of interest; and (2)
selecting a decision r u l e - - t h e classifier--which can be implemented in such a
way as to provide the attainable classification accuracy at minimal cost in terms
of time and computational resources.
The decision rule most commonly employed for classifying multispectral re-
mote sensing data is based on classical decision theory. Let the spectral measure-
ments for a point to be classified comprise the components of a random vector X.
Assume that a pixel is to be classified into one of m classes (~0i[i = 1,2 ..... m).
The classification strategy is the Bayes (minimum risk) strategy by which X is
assigned to the class o~i minimizing the conditional average loss:
j-1
Pattern recognition techniques for remote sensing applications 613
where
~ / = the cost of classifying a pixel into class o9i when it is actually from
class 0~j;
p(o~j[X) = the a posteriori probability that X is from class 0~/.
Typically, a 0-1 cost function is assumed, in which case the discriminant
functions for the classification problem are simply
where p(X Io~i) is the class-conditional probability of observing X from class 0~i
and p ( e i ) is the a priori probability of class wi.
It is also common to assume that the class-conditional probabilities are
multivariate normal. If class wi has the Gaussian (normal) distribution with mean
vector U~ and covariance matrix Z,, then the discriminant functions become
5. Clustering
when the Gaussian assumption is invoked and for partitioning the measurement
space based on natural groupings in the data. The latter use is often referred to as
' unsupervised classification'.
Unsupervised classification has two interesting and somewhat different applica-
tions in the analysis process. As noted in a previous section, in the early stages of
a supervised analysis procedure it is necessary to locate and extract from the
primary remote sensing data those areas for which reference data are available.
This process is greatly facilitated if the data can be displayed in such a way as to
enhance both the spectral similarities and differences in the data. Such enhance-
ment causes objects or fields and the boundaries between them to be more
sharply defined. Clustering techniques accomplish this nicely since in general they
aim to partition the data into clusters such that within-cluster variation is
minimized and between-cluster variation is maximized. After this unsupervised
classification is performed on every pixel in an area of interest, the results may be
displayed using contrasting tones or distinctive colors, making it easier to find
and identify landmarks in the scene.
On the other hand, unsupervised classification is the key process in an
unsupervised analysis procedure, used when reference data is in too short supply
to provide for adequate estimation of the class distributions. In this case,
clustering is applied after which such reference data as are available are used to
'label' the clusters, i.e., to infer the nature of the ground cover represented by
each cluster. Clearly this approach will suffice only when the clustering algorithm
can successfully associate unique clusters or sets of unique clusters with the
various ground cover classes of interest.
There are a great many multivariate clustering techniques available; many more
than we have space to describe here have been applied to remote sensing data
analysis. Most are iterative methods which depend on 'migration' of cluster
centers until a stopping criterion is satisfied. They differ primarily in two
respects:
(1) The distance measure used on each iteration for assigning each data vector
to a 'cluster center'.
(2) The method used to determine the appropriate number of clusters. This
usually involves splitting, combining and deleting clusters according to various
'goodness' measures.
Unfortunately, the precise behavior of most clustering methods depends highly
on several user-specified parameters. The parameters provide implicitly the defini-
tion of a 'good' clustering, which is often very application-dependent. Since there
is no objective means for relating the characteristics of the application to the
parameter values, these values are usually determined by trial-and-error, the user
or data analyst playing an essential (subjective) role in the process. This is often
viewed as a significant shortcoming of the data analysis procedure when opera-
tional use requires a procedure which is strictly objective, repeatable and as
'automatic' as possible. A considerable amount of research on clustering has been
motivated by the needs of remote sensing data analysis [1].
Pattern recognition techniques for remote sensing applications 615
6. Dimensionality reduction
Determining the feasibility of using computer-implemented data analysis for
any particular remote sensing application often reduces to assessing the cost of
the required computations. Judicious choice of the algorithms employed for the
analysis is essential. They must be powerful enough to achieve the required level
of accuracy (assuming the level is achievable) but not so complex as to be
prohibitively expensive in terms of computational resources they require. Some-
times the most effective way to achieve computational economy is to reduce the
dimensionality of the data to be analyzed, either by selecting an appropriate
subset of the available measurements or by transforming the measurements to a
space of lower dimensionality.
For selecting best feature subsets, an approach is to choose those p features
which maximize the weighted average 'distance' Dave between pairs of classes,
where
]2 ]1/2
Dij:{Sx[~--Tp(XioDj) j dX~ . (5)
Notice from (6) that the Jeffreys-Matusita distance 'saturates' as the separability
(a) increases, the property which makes it behave functionally in a manner
similar to classification accuracy and accounts for its good performance as
predictor of classification accuracy. The appropriate value of p may be de-
termined by finding the optimal value of Dave for each of several candidate values
of p, plotting the results (Dave versus p) and observing the point beyond which
little increased separability is gained by increasing p.
An alternative to subset selection is to apply a dimensionality-reducing trans-
formation to the measurements before classification. The use of linear transfor-
616 Philip H. Swain
mations for this purpose has been studied to a considerable extent [2]. In this
case, the p-dimensional feature vector Y is derived from the n-dimensional
measurement X through the transformation Y = BX, where B is a p × n matrix of
rank p (p < n). Note that:
(1) If X is assumed to be a normally distributed random vector, then so is Y.
Specifically, if X-- N(U, Y,), then Y = B X ~ N( BU, B~,BT).
(2) Subset selection may be considered a special case of linear transformation
for which the B matrix consists only of O's and l's, one 1 per row.
(3) Numerical techniques are available (refer to [2]) for determining B so as to
extremize an appropriate criterion such as that defined by (4) and (5).
For nontrivial problems, both of these dimensionality reduction approaches
require considerable computation, and for this reason suboptimal procedures are
sometimes employed. However, in practice the computational expense may well
be warranted when the total area to be classified is large and the consequent
saving of computation in classification is substantial.
One final comment is appropriate before closing this section. A well-known
general method for dimensionality reduction is principal components analysis or
the Karhunen-Lo6ve transformation [6]. On the face of it, this approach seems
attractive because to apply it one need not have previously computed the statistics
of the classes to be discriminated. All that is needed is the lumped covariance
matrix for the composite data. However, a fundamental assumption underlying
this approach is that variance (or covariance) is the most important information-
bearing characteristic of the data. While this assumption is appropriate for signal
representation problems, it is not appropriate when the final goal is discrimina-
tion of classes. In the former case the objective i s t o capture as much of the data
variability as possible in as few features (linear combinations of the measure-
ments) as possible. In the latter the requirement is to maintain separability of the
classes, and variability is only of incidental interest. 'Canonical analysis', a
somewhat similar approach in terms of the mathematical tools used, provides a
better method of determining linear combinations of features while preserving
class separability [8].
are sums taken over all pixels in the object to be classified. Notice two facts:
(1) Only two terms in (9) depend on the object to be classified and need to be
computed for each classification.
(2) The expression for the 'log-likelihood' (9) is valid for the case s = 1, so that
no problem develops when a single-pixel object is encountered.
satisfying a relatively mild homogeneity test becomes a cell. If the group fails the
test, it is assumed to overlap an object boundary and the pixels are then classified
individually. At the second conjunctive level, adjacent cells which satisfy another
test are serially merged into an object. By successively 'annexing' adjacent cells,
each object expands as much as possible (as defined by the test criterion) and is
subsequently classified by maximum likelihood sample classification.
For practical reasons it is important that this scene partitioning algorithm,
together with the maximum likelihood sample classifier, can be implemented in a
sequential fashion, accessing the pixel data only once and in the raster order in
which they are stored on the tape.
The scene partitioning can be implemented in a 'supervised' mode which makes
use of the statistical characterization of the pattern classes, or in an unsupervised
mode which does not require such an a priori characterization. Given the
objectives of this section, we shall describe only the tests for the former.
Define the quantity
Qj(X) = _ zj
i 1
where X~ is the i th pixel vector in the group being tested, s is the number of pixels
in the group, and Uj and ~j are, respectively, the mean vector and covariance
matrix for t h e j t h training class (again we have invoked the multivariate Gaussian
model). Let co* be the class for which the log-likelihood of the group is maximum,
i.e.,
A: maxip(Xlcoi)p(Yl~')
maxip( Xlcoi)maxsp( y]% ) . (12)
Pattern recognition techniquesfor remote sensing appfications 619
Observe that 0~<A~<I and that A = I when both p(X[o~i) and P(YI%) are
maximum for the same class. Thus, the cell is assumed to belong to the same class
as the object and is annexed to the object if A >t-T where T is a user-specified
threshold.
Naturally the scene partitioning requires computational overhead not required
by pixel-at-a-time classification and this is why the greater efficiency of the
sample classification approach depends on the size of the objects in the scene
being large in comparison to the resolution of the sensor or pixel size. The larger
the objects, the greater the saving in computation required for classification per
se. Significantly, it is also possible to show [5] that the expected accuracy of
classification improves rapidly as the object size increases.
8. Research directions
To date, the statistical pattern recognition techniques most widely applied for
remote sensing data analysis have been brought to bear primarily on the spectral
space, although there have been some notable successes in attempts to augment
the spectral feature vector with, for example, texture features. Basically, however,
the fundamental approach has been to assume that the relationships among the
features can be characterized in relatively simple terms and that the scene of
interest can be classified a pixel at a time using standard decision-theoretical
methods.
There is a great wealth of information as yet virtually untapped in the remote
sensing data and the supporting environmental data which are usually available
in conjunction with it. Spatial characteristics, temporal variations, meteorological
data, soil background data, to name only a few, are significant information-
bearing factors which could be used profitably in the analysis process. Syntactic
scene analysis [3], contextual classification [12], and temporal feature extraction
[9] are some approaches which have begun to prove fruitful but will require
substantial research to develop their practical utility. The very complex relation-
ships among the multivarious forms of data found in the typical remote sensing
data base will not yield easily to the simple classification methods currently in
use. Generalized decision-theoretical methods are called for, and work is in
progress to develop and apply compound decision theory and hierarchical deci-
sion processes.
Finally, in the face of limited high quality reference data ('ground truth') which
must serve the dual purpose of providing both for training and testing classifiers,
there remains the problem of understanding how to best evaluate the results of
the analysis process [13]. Predictions and/or posterior estimates of classification
accuracy as well as biases and precision of the results are needed. It is not well
understood how classification accuracy per se reflects the quality of the results
achieved or achievable, especially when the objective is to obtain large area
estimates from classification of only a sample of the area.
620 Philip H. Swain
References
[1] Bryant, J. (1979). On the clustering of multidimensional pictorial data. Pattern Recognition 11,
115-126.
[2] Decell, H. P. and Guseman, L. F., Jr. (1979). Linear feature selection with applications. Pattern
Recognition 11, 55-63.
[3] Fu, K. S. (1977). Syntactic Pattern Recognition. Springer, New York.
[4] Kettig, R. L. and Landgrebe, D. A. (1976). Classification of multispectral image data by
extraction and classification of homogeneous objects. IEEE Trans. Geosci. Electronics 14, 19-26.
[5] Kettig, R. L. (1975). Computer classification of remotely sensed multispectral image data by
extraction and classification of homogeneous objects. Ph.D. Thesis, Purdue University, West
Lafayette, IN.
[6] Kitfler, J. and Young, P. C. (1973). A new approach to feature selection based on the
Karhunen-Lobve expansion. Pattern Recognition 5, 335-352.
[7] Marill, T. and Green, D. M. (1963). On the effectiveness of receptors in recognition systems.
IEEE Trans. Inform. Theory 9, 11- 17.
[8] Merembeck, B. F. and Turner, B. J. (1979). Directed canonical analysis and the performance of
classifiers under its associated linear transformation. Proc. Symp. Machine Processing of Re-
motely Sensed Data. IEEE Cat. No. 79CH1430-8 MPRSD, IEEE Single Copy Sales, Piscataway,
NJ.
[9] Misra, P. N. and Wheeler, S. G. (1978). Crop classification with Landsat multispectral scanner
data. Pattern Recognition 10, 1-13.
[10] National Aeronautics and Space Administration. Earth resources technology satellite data users
handbook. NASA Goddard Space Flight Center, Greenbelt, MD.
[11] Swain, P. H. and Davis, S. M. (1978). Remote Sensing: The Quantitative Approach. McGraw-Hill,
New York.
[12] Swain, P. H., Tilton, J. C. and Vardeman, S. B. (1982). Estimation of context for statistical
classification of multispectral image data. IEEE Trans. Geosei. Remote Sensing 20 (4).
[13] Todd, W. J., Gehring, D. G. and Haman, J. F. (1980). Landsat wildland mapping. Photogram-
metric Engrg. and Remote Sensing 46, 509-520.
[14] Wacker, A. G. (1971). The minimum distance approach to classification. Ph.D. Thesis, Purdue
University, West Lafayette, IN.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ") Q
../
@North-Holland Publishing Company (1982) 621-649
George Nagy
1. Introduction
621
622 GeorgeNag,/
3. Applications
ABCDEFGHIJKLMNOPQRS
TUVWXYZ0123456789 •,
'-(}%?#WH : ; =+/$,"&
USASC CHARACTER FONT
Fig. I. OCR-Afont.
control over document preparation. Typical applications are credit card slips,
invoices, and insurance application forms prepared by agents. Special accurately
machined typeheads for certain stylized fonts are available for ordinary type-
writers which must, however, be carefully aligned to satisfy the OCR-reader
manufacturers' specifications with regard to character skew, uniformity of impres-
sion, and spacing. Standards have also been promulgated with respect to accept-
able ink and paper reflectance, margins, and paper weight.
Among the typefaces most popular in the United States is the 64-symbol
OCR-A font (Standard USA 5X3.17) which is available in three standard sizes
(Fig. 1). In the United States, OCR standards are set by the A.N.S.I.X3A1
Committee. In Europe the leading contender is the l l3-symbol OCR-B font
(Standard ECMA-11) which is aesthetically more pleasing, and which als0
includes lower-case letters (Fig. 2). Some OCR devices are capable of reading only
subsets of the full character set, such as the numerals and five special symbols.
3.2. Typescript
With a well-aligned typewriter, carbon-film ribbon, and carefully specified
format conventions, single-font typewritten material can be readily recognized by
machine. Sans-serif typefaces generally yield a lower error rate than roman styles,
because serifs tend to bridge adjacent characters. When specifying the typeface, it
ABCDEFGHIJ
K LMNOPQR ST
UVWXYZ
Fig. 2. OCR-Bfont (upper case only).
@tical character recognition - - ~ e o ~ and practice 625
(a)
.°* **
.,: .:**,
*** *°
s* 1111, ,1,
***~ **°°
..... :11111,
*****
°** *
******
=**=
, :'',. : .'',,. ...°
**°* ****** .°° ,~°el
.°*** .°** °s*
*°*
.... 11"
(b)
Fig. 3. Typewritten characters. (a) Elite, Adjudant, and Scribe fonts. (b) Digitized versions of '4' and
' F ' from the three font styles shown above. (c) Examples of difficult to distinguish lower case letters.
626 GeorgeNagy
"'I .... ~°
..**'***.. ***..,.
:!i"i:::::!i".-
..... ..."
* w * * * * * * ***
* 7 * * * * * * * * * *
...... 51"**
* * * * 8 * * * * * *
*IIIII*
(c)
Fig. 3 (continued).
issued to date, retrieving relevant federal and state laws, and converting library
catalog cards to computer readable form. Most of these applications may,
however, eventually disappear when computerized typesetting becomes so preva-
lent that all newly published material is simultaneously made available in
computer readable form as a byproduct of typesetting.
The greatest problems with automatic reading of typeset material are the
immense number of styles (3000 typefaces are in common use in the United
States, each in several sizes and variants such as italic and boldface), and the
difficulty of segmenting variable-width characters. The number of classes in each
font is also larger than the 88 normally available on typewriters: combinations of
characters (called 'ligatures') such as fi, ffi, and fl usually appear on a single slug,
and there are many additional special symbols.
A resolution of about 0.003" is necessary for segmenting and recognizing
ordinary bookface characters (Fig. 4). The higher resolution is necessary because
the characters are smaller than in typescript, because of the variability in stroke
width within a single character (for instance, between the round and straight parts
of the letter 'e'), and because of the tight character spacing.
(~) (b)
Fig. 4. Typeset text. (a) Original. (b) Photoplotter output after digitization.
628 GeorgeNagy
writers (see Handprint Standard X3.45). Applications include filling out short
forms such as driver's license applications, magazine renewals, and sales' slips. At
one time automatic interpretation of coding forms was considered important, but
this application has lost ground due to the spread of interactive programming.
Individual experimenters can usually learn to print consistently enough to have
their machine recognize their own characters with virtually zero error. Although
many of the handprinted character recognition devices described in the literature
are trainable or adaptive in nature, in some successful applications most of the
adaptation takes place in the human. An example is the operational and success-
ful Japanese ZIP-code reader, where the characters are carefully printed in boxes
printed on the envelope.
4. Transducers
for accurate and faithful conversion of grey-scale material to digital form does of
course exist, but the necessary apparatus--flat-bed and rotating drum micro-
densitometers--is far too expensive for most academic research programs and far
too slow for commercial application. The challenge in optical character recogni-
tion is, in fact, to classify characters as accurately as possible with the lowest
possible spatial quantization and the minimum number of grey levels in the
transducer. Most OCR scanners therefore operate at a spatial resolution barely
sufficient to distinguish between idealized representations of the classes and
convert all density information to binary--black and white--form.
The remainder of this section outlines the principal characteristics of optical
transducers used for character recognition either in the laboratory or in the field.
4. 2. Geometric characteristics
The major geometric characteristics of optical scanners are:
(1) Resolution. The exact measurement of the two-dimensional optical transfer
function is complex, but for OCR purposes one may equate the effective spot size
to the spatial distance between the 10% and 90% amplitude-response points, as a
black-white knife-edge is advanced across the scanning spot. Standard test charts
with stripe patterns are also available to measure the modulation as a function of
spatial frequency.
(2) Spot shape. A circular Gaussian intensity distribution is normally preferred,
but a spot elongated in the vertical direction reduces segmentation problems.
(3) Linearity. This characteristic, which may be readily measured with a grid
pattern, is important in tracking margins and lines of print. Solid-state devices are
inherently linear, but cathode-ray tubes require pin-cushion correction to com-
pensate for the increased path-length to the corners of the scan field. Skew is
sometimes introduced by the document transport.
(4) Repeatability. It is important to be able to return exactly to a previous spot
on the document, for example to rescan a character or to repeat an experiment.
Short-term positional repeatability is generally more important--and better--than
long-term repeatability.
Optical character recognition - - theory and practice 631
4. 3. Photometric characteristics
(1) Resolution. The grey-scale resolution may also be measured in different
ways. To some investigators it means the number of grey levels that may be
reliably discriminated at a specific point on the document; to others it means the
number of discriminable grey levels regardless of location. The latter interpreta-
tion is considerably more stringent, since a flat illumination or light collection
field is difficult to achieve without stored correction factors. Quantum noise also
affects the grey-scale resolution, but the signal-to-noise ratio may be generally
increased at the expense of speed by integrating the signal.
(2) Linearity. Grey-scale linearity is meaningless for binary quantization, but if
several grey-levels are measured, then an accurately linear or logarithmic ampli-
tude response function may be desirable. Standard grey-wedges for this measure-
ment are available from optical suppliers.
(3) Dynamic range. For document scanners a 20:1 range of measurable
reflectance values is acceptable. Transparency scanners, however, may provide
adequate response over 3.0 optical density units (1000 : 1).
(4) Repeatability. It is important that the measured grey level be invariant with
time as well as position on the document.
(5) Spectral match. The spectral characteristics of the source of illumination
and of the detector should be closely matched to those of the ink and of the
paper. The peak response of many OCR systems is in the invisible near-infrared
region of the spectrum.
4. 4. Control characteristics
Most OCR devices operate in a rigid raster scan mode where the scan pattern is
independent of the material encountered. Line followers, however, track the black
lines on the page and have been used principally for the recognition of hand-
printed characters. Completely programmable scanners are invaluable for experi-
mentation, but have become a strong commercial contender only with the advent
of microprocessors. Programmable scanners allow rescanning rejected characters,
increase throughput through lower scan resolution in blank areas of the page, and
reduce storage requirements for line-finding and character isolation.
5. Character acquisition
The preprocessing necessary for classification must satisfy two basic require-
ments. The first requirement is to locate each character on the page in an order
that 'makes sense' in the coded file which eventually results from classification.
The second requirement is to present each isolated character to the recognition
algorithm in a suitable form.
5. 2. Character isolation
Imperfect separation between adjacent characters accounts for a large number
of misclassifications, although in experimental studies segmentation errors are
often omitted from the reported classification error rate. With stylized characters
and with typewritten material produced by well-adjusted typewriters the expected
segmentation boundaries can be accurately interpolated by correlating an entire
line with the many easily detectable boundaries that occur between narrow
Optical character recognition - - theory and practice 633
characters. The problem is, however, more complicated if the geometrical linearity
and repeatability of the transducer itself is low relative to the character dimen-
sions.
If the scanner geometry is undependable (CRT scanners) or if the characters
are not uniformly spaced (typeset material), then there is no alternative to
character-by-character segmentation. This, in turn, requires a scanning aperture
two or three times smaller than that required for recognition to find narrow
zig-zag white paths between adjacent characters. Ad hoc algorithms depending,
for instance, on 'serif suppression', are used to separate touching characters. The
expected fraction of touching character pairs~as measured directly on the
document using ten-fold magnification--depends on the type-style, type-size,
printing mechanism, and on the ink-absorbing characteristics of the paper, but
typescript or printed material (serif fonts) with up to 10% touching characters is
not uncommon. The worst offender in this respect is probably the addressograph
machine: with a well-inked ribbon one may seldom see an unbridged pair of
characters within the same word.
Some character pairs simply cannot be segmented unless the individual compo-
nents are first recognized. Current classification algorithms, however, generally
require isolated character input. The development of optimal 'on-the-fly' recogni-
tion, combining segmentation with classification, is one of the major challenges
facing OCR.
and vertical medians (the imaginary lines which divide the pattern into an equal
number of black bits in each quadrant) is more economical than computation of
the centroid, but both methods suffer from vulnerability to misalignments of the
printing mechanism, which causes one side, or top or bottom, of the character to
be darker than the other. An equally unsatisfactory alternative is lower-left-corner
(or equivalent) registration after stray-bit elimination.
Misregistrations of two or three pels (i.e., picture elements) in the vertical
direction and one or two pels horizontally are not unusual with the normally used
scanning resolution for typewritten characters. If uncorrected, such misregistra-
tion necessitates that template matching be attempted for all possible shifts of the
template with respect to the pattern within a small window (say, 7 × 5) in order to
guarantee inclusion of the ideal position.
6. Character classification
Let us now consider the computational problems of estimating P(vl ak) for all
possible values of ~ and a k in an OCR environment. For the sake of concreteness,
let us assume that each observation 17 corresponds to the digitized grey values of
the scan field. If the number of observable grey levels is S, the number of
elements in the scan field is N, and the number of character classes is M, then the
total number of terms required is M × S N. In a practical example we may have
S----16 (the number of differentiable reflectance values), N = 600 (for a 20×30
array representing the digitized character), and M - - 6 4 (upper and lower case
characters, numerals, and special symbols.) The number of probability estimates
required is thus 64× 166°°z 224°6~107°°. The goal of much of the work in
character recognition during the last two decades has been, explicitly or im-
plicitly, to find sufficiently good approximations, using a much smaller number of
terms, to the required probability density function.
Among the approaches tried are:
(1) Assuming statistical independence between elements of the attribute vector.
(2) Assuming statistical independence between subsets of elements of the
attribute vector (feature extraction).
(3) Computing only the most important terms of the conditional probability
density function (sequential classification).
In the next several paragraphs we will examine how these simplifications allow
us to reduce the number of estimations required, and the amount of computation
necessary to classify an unknown character. In order to obtain numerical com-
parisons of the relative number of computations, we shall stay with the above
example.
need determine only the largest of the a posteriori probabilities, and the logarithm
function is monotonic, taking logarithms preserves the optimal choice.
This derivation leads to the weighted mask approach, where the score for each
class is calculated by summing the black picture elements in the character under
consideration with each black point weighted by a coefficient corresponding to its
position in the digitized array. The contribution of the white points can be taken
into account by the addition of a constant term. The character with the highest
score is selected as the most likely choice. If none of the scores are high enough,
or the top two scores are too close, then the character is rejected.
It may be noted that the weighted mask approach implemented through resistor
networks or optical masks was used in experimental OCR systems long before the
theoretical development was published. A special case, called template matching
or prototype correlation, consists of restricting the values of the coefficients
themselves to binary or ternary values; when plotted, the coefficients of the black
points resemble the characters themselves. A further reduction in computation
may be obtained by discarding the coefficients which are least useful in dis-
criminating between the classes, leading to peephole templates.
then one may assume that higher order dependences are insignificant, represent
the statistical relations in the form of a dependence tree, and restrict the computa-
tion to the most important terms, as shown by Chow.
This point of view also leads to a theoretical foundation for the ad hoc feature
extraction methods used in commercial systems. Straightline segments, curves,
corners, loops, serifs, line crossings, etc., correspond to groups of variables which
commonly occur together in some character classes and not in others, and are
therefore class-conditionally statistically dependent on one another. Even features
based on integral transform methods, such as the Fourier transform, can be
understood in terms of statistical dependences.
Feature selection and dimensionality reduction methods may also be considered
in the above context. Given a pool of features, each representing a group of
elementary observations vj, it is necessary to determine which set of groups most
economically represents the statistical dependences necessary for accurate estima-
tion of the a posteriori probability. The number of possible combinations of
features tend to be astronomical: if we wish to select 100 features from an initial
set of 1000 features, there are about 10 ~°° possible combinations. Most feature
Optical character recognition - - theory and practice 637
6. 4. Sequential classification
In sequential classification only a subset of the observations vj is used to arrive
at a decision for the character identity; unused elements need not even be
collected. The classification is based on a decision tree (Fig. 5) which governs the
sequence of observations (picture elements) to be examined. The tree is fixed for a
given application, but the path traced through the tree depends on the character
under consideration.
The first element vj to be examined, called the root of the tree, is the same for
any new character, since no information is available yet as to which element
would provide the most information. The second element to be examined,
however, depends on whether the first element was black or white. The third
element to be examined depends, in turn, on whether the second element was
black or white. Each node in the tree, corresponding to a given observation vj,
thus has two offsprings, each corresponding to two other observations. No
observation vs. occurs more than once in a path. The leafs of the tree are labelled
with the character identities or as 'reject' decisions. Normally there are several
leafs for each character class (and also for the reject decision), corresponding to
the several character configurations which may lead to each classification.
Binary decision trees for typewritten characters typically have from 1000 to
10000 nodes. The path length through the tree, from root to leaf, may vary from
103
w h ~ k
Table 1
Comparison of statistical classification algorithms. The comparison of the computational requirements
of the various classification methods is based on 64 character classes and 600 (20 × 30) binary picture
elements per character
Method Computational requirement
Exhaustive search 2600 --~ 10 Is° comparisons (600
bits each)
Complete pairwise dependence ½× 64 X 6002 = 107 logical operations and additions
Complete second-order Markov dependence 64 X 1200 = 80 000 logical operations and additions
Class-conditional independence (weighted masks) 64 × 600 = 40 000 logical operations and additions
Mask-matching (binary masks) 64 X 600 = 40 000 logical operations and counts
Peephole templates (30 points each) 64 × 30 = 1920 logical operations and counts
Complete decision tree 600 one-bit comparisons
Decision tree--4000 nodes 2log4000 = 12 one-bit comparisons
Optical character recognition theory and practice 639
7. Context
Characters that are difficult to classify by their shape alone can sometimes be
recognized correctly by using information about other characters in the same
document. For instance, in decoding a barely legible handwritten postcard, one
must frequently resort to locating different occurrences of the same misshapen
symbol. Contextual information may be used in a number of different ways to aid
recognition; Toussaint lists six major categories. In this section, however, we will
discuss only the use of the information available from the non-random sequenc-
ing of characters in natural-language text, including such relatively 'unnatural'
sequences as postal addresses, prices, and even social security numbers.
For a sequence of observed patterns V = el, v 2..... g,, we may again use Bayes'
rule to obtain
P(AIV)=P(VIA)P(A)/P(V)
Table 2
Bigram frequencies based on 600 000 characters of legal text (X 10)
A B C D E F G H I J K L
1 0.693 0.044 0.001 0.015 0.178 0.375 0.099 0.044 0.038 0.003 0.003 0.008 0.062
2 A 0.207 0.000 0.004 0.030 0.011 0.032 0.011 0.010 0.057 0.019 0.001 0.000 0.038
3 B 0.079 0.011 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.005 0.000 0.000 0.000
4 C 0.124 0.028 0.000 0.004 0.000 0.044 0.000 0.000 0.000 0.047 0.000 0.000 0.000
5 D 0.057 0.017 0.000 0.000 0.001 0.085 0.000 0.000 0.000 0.021 0.000 0.000 0.015
6 E 0.057 0.000 0.033 0.053 0.063 0.025 0.016 0.021 0.2i6 0.022 0.003 0.008 0.045
7 F 0.064 0.006 0.000 0.000 0.000 0.017 0.017 0.000 0.000 0.014 0.000 0.000 0.002
8 G 0.016 0.011 0.000 0.000 0.006 0.012 0.000 0.001 0.000 0.014 0.000 0.000 0.001
9 H 0.047 0.001 0.000 0.032 0.000 0.003 0.000 0.013 0.000 0.000 0.000 0.000 0.000
10 I 0.130 0.025 0.006 0.026 0.037 0.010 0.026 0.007 0.046 0.000 0.000 0.002 0.027
11 J 0.016 0.001 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
12 K 0.004 0.004 0.000 0.004 0.000 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.001
13 L 0.026 0.067 0.010 0.013 0.001 0.031 0.003 0.002 0.001 0.025 0.000 0.001 0.042
14 M 0.044 0.011 0.001 0.000 0.002 0.016 0.000 0.003 0.000 0.017 0.000 0.000 0.000
15 N 0.047 0.118 0.000 0.000 0.000 0.091 0.000 0.003 0.001 0.168 0.000 0.002 0.000
16 O 0.153 0.000 0.018 0.084 0.008 0.003 0.033 0.004 0.025 0.076 0.003 0.000 0.014
17 P 0.075 0.029 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.007 0.000 0.000 0.000
18 Q 0.004 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
19 R 0.057 0.063 0.005 0.006 0.003 0.140 0.009 0.008 0.003 0.017 0.000 0.000 0.000
20 S 0.116 0.060 0.003 0.001 0.005 0.075 0.000 0.004 0.001 0.081 0.000 0.001 0.007
21 T 0.307 0.099 0.001 0.043 0.000 0.022 0.004 0.001 0.008 0.078 0.000 0.000 0.005
22 U 0.024 0.010 0.011 0.010 0.006 0.000 0.006 0.005 0.003 0.000 0.013 0.000 0.006
23 V 0.023 0.008 0.000 0.000 0.001 0.018 0.000 0.000 0.000 0.010 0.000 0.000 0.002
24 W 0.075 0.007 0.000 0.000 0.001 0.007 0.000 0.000 0.001 0.000 0.000 0.000 0.001
25 X 0.000 0.005 0.000 0.000 0.000 0.015 0.000 0.000 0.000 0.001 0.000 0.000 0.000
26 Y 0.005 0.011 0.019 0.002 0.001 0.005 0.001 0.000 0.002 0.000 0.000 0.000 0.020
27 Z 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.000 0.000 0.000
depend greatly on the domain of discourse; in the example given, the frequency of
the pair 'ju' (from jury, judge, judicial, jurisprudence, etc.,) is much higher than in
other types of material. The most common letter in English is 'e', accounting for
13% of all letters. The most common pair is 'he'.
In studying higher-order letter frequencies, upper and lower case, blanks, and
punctuation must also be considered. The letter 'q', for instance, is more common
at the beginning of words than in the middle or at the end--hence 'q-' has a
lower frequency than '-q'. The distribution of n-gram frequencies according to
the position within the word has been studied by Shinghal and Toussaint. A study
of the singlet frequencies of Chinese ideographs (which may be regarded as
complete words, but represent single characters as far as OCR is concerned) is
available in Chen. Japanese Katakana character frequencies are tabulated in Bird.
In deriving n-gram probabilities from a sample of text it is necessary to make
special provision for estimating the frequency of n-grams which do not occur in
the text at all. More generally, it is desirable to use a better estimator than the
sample frequency itself. By postulating an a priori distribution for each n-gram
frequency, an improved estimator may be derived either by minimizing the risk
642 GeorgeNagy
Table 2 (continued)
Bigram frequencies based on 600000 characters of legal text ( × 10)
M N O P Q R S T U V W X Y
1 0.018 0.176 0.074 0.008 0.000 0.097 0.213 0.180 0.002 0.011 0.013 0.005 0.096
2 A 0.031 0.020 0.008 0.020 0.000 0.038 0.013 0.049 0.006 0.007 0.022 0.003 0.000
3 B 0.002 0.000 0.003 0.000 0.000 0.001 0.000 0.000 0.008 0.000 0.000 0.000 0.000
4 C 0.001 0.031 0.010 0.000 0.000 0.009 0.005 0.001 0.014 0.000 0.000 0.005 0.000
5 D 0.000 0.090 0.006 0.000 0.000 0.025 0.001 0.000 0.008 0.000 0.000 0.000 0.000
6 E 0.039 0.037 0.003 0.043 0.000 0.124 0.074 0.085 0.012 0.037 0.020 0.003 0.003
7 F 0.000 0.003 0.095 0.000 0.000 0.002 0.000 0.000 0.002 0.000 0.001 0.000 0.000
8 G 0.000 0.052 0.003 0.000 0.000 0.004 0.000 0.000 0.005 0.000 0.000 0.000 0.000
9 H 0.000 0.001 0.005 0.003 0.000 0.001 0.016 0.258 0.000 0.000 0.022 0.001 0.000
10 I 0.024 0.017 0.004 0.004 0.000 0.047 0.033 0.096 0.010 0.024 0.019 0.002 0.001
11 J 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
12 K 0.000 0.002 0.001 0.000 0.000 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000
13 L 0.000 0.004 0.017 0.017 0.000 0.003 0.003 0.003 0.017 0.000 0.001 0.000 0.001
14 M 0.011 0.001 0.028 0.001 0.000 0.011 0.002 0.003 0.005 0.000 0.003 0.000 0.001
15 N 0.000 0.005 0.143 0.000 0.000 0.008 0.000 0.002 0.023 0.000 0.004 0.000 0.000
16 O 0.015 0.032 0.005 0.027 0.000 0.043 0.016 0.067 0.001 0.003 0.006 0.000 0.002
17 P 0.012 0.000 0.013 0.028 0.000 0.007 0.008 0.000 0.013 0.000 0~000 0.002 0.000
18 Q 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
19 R 0.000 0.001 0.094 0.039 0.000 0.008 0.000 0.028 0.043 0.000 0.003 0.000 0.000
20 S 0.004 0.035 0.013 0.001 0.000 0.023 0.029 0.025 0.032 0.000 0.003 0.000 0.005
21 T 0.000 0.087 0.032 0.007 0.000 0.043 0.072 0.011 0.024 0.000 0.000 0.002 0.000
22 U 0.004 0.004 0.051 0.006 0.010 0.011 0.032 0.013 0.000 0.000 0.000 0.000 0.000
23 V 0.000 0.003 0.011 0.000 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
24 W 0.000 0.002 0.018 0.000 0.000 0.001 0.001 0.004 0.000 0.000 0.000 0.000 0.000
25 X 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
26 Y 0.000 0.009 0.001 0.000 0.000 0.014 0.001 0.019 0.000 0.000 0.000 0.000 0.000
27 Z 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p = ( m .1. 1 ) / ( n -t-2)
w h e r e m is t h e n u m b e r o f c o o c c u r r e n c e s o f t h e i t h n - g r a m i n t h e s a m p l e a n d n is
the total number of n-grams. The actual a priori distributions may be more
realistically approximated by means of beta distributions with two parameters,
y i e l d i n g a p o s t e r i o r i d i s t r i b u t i o n s o f t h e t y p e (m .1. a ) / ( n .1. b).
The estimates of n-gram frequencies obtained by various authors have been
compared by Suen in an article that includes also an excellent bibliography on the
statistical parameters of textual material.
Optical character recognition theory and practice 643
8. Error/reject rates
8.1. Prediction
Based on his 1957 work (see above), in 1970 Chow derived some very
interesting relations between the substitution error rate and the reject rate of a
character recognition system. The principal results are:
(1) The optimum rule for rejection is, regardless of the underlying distributions,
to reject the pattern if the maximum of the a posteriori probabifities is less than
some threshold,
(2) The optimal reject and error rates are both functions of a parameter t,
which is adjusted according to the desired error/reject tradeoff (which, in turn, is
based on the cost of errors relative to that of rejections).
(3) The reject rule divides the' decision region into 'accept' and 'reject' regions;
the error and reject rates are the integrals over the two regions of the probability
P(v) of the observations.
(4) Both the error and reject rates are monotonic in t.
(5) t is an upper bound on the error rate.
(6) The slope d E / d R of the error-reject curve increases from -- 1 + 1/n (n is
the number of classes) to 0 as R increases from 0 to 1.
(7) The error/reject rate is always concave upwards (dZE/dR 2 >10).
(8) The optimum error rate may be computed from the reject rate according to
the equation E(t) = f/=otdR(t), regardless of the form of the underlying distribu-
tion.
The importance of (8) lies in the fact that in large-scale character-recognition
experiments it is generally impossible to obtain a sufficient number of labelled
samples to estimate small error rates directly. Eq. (8) allows the estimation of the
error rate using unlabelled samples.
Chow gives examples of the optimum error-reject curve for several forms of
the probability density function of the observations.
utes information to tune the classifier. The main branches of this endeavour are
supervised classification where each new sample is identified, and unsupervised
classification ('learning without a teacher') where the true identities of the new
characters remain unknown. None of this work seems to have found much
application to practical OCR systems.
Commercial OCR manufacturers normally test their devices on samples of
several hundred thousands or millions of characters in order to make performance
estimates at realistic levels of substitution error and reject rate. Academic re-
searchers, on the other hand, usually test their algorithms on a few hundred or
thousand samples only, using general purpose computers. The IEEE Pattern
Recognition Data Base contains several dozen test sets, originally collected by the
IEEE Computer Society's Technical Committee on Pattern Recognition, which
are available to experimenters for the cost of distribution. These test databases
include script, hand-printed characters, single and multifont typewritten char-
acters, and also some non-character image data. Several articles have been written
comparing different studies using identical design and test data. Another source
of fairly large files of alphanumeric test data is the Research Division of the U.S.
Postal Service.
Table 3
'Typical' error and reject rates
Error rate Reject rate
Stylizedcharacters 0.000005 0.00005
Typescript (OCR quality) 0.00001 0.000t
Ordinary. typescript 0.0005 0.005
Handprint (alphanumeric) 0.005 0.05
Bookface 0.001 0.01
Optical character recognition -- theotT and practice 647
There are twenty-four large capacity mail sorters installed in the U.S. Post
Offices in large cities throughout the country. The older machines are effective
only in sorting out-going mail, where the contextual relations between the city,
state, and zip-code on the last line of the address allows correct determination of
the destination even with several misrecognized characters. On high quality
printed matter (most of the mail in the United States, as opposed to Japan, has a
printed or typewritten address) 70% to 90% of the mail pieces are routed
correctly, and the rest is rejected for subsequent manual processing. In order to
use the machines effectively, obviously unreadable material is not submitted to
the machines. Large mailers participate in the U.S. Postal Service's 'red tag'
program by affixing special markers to large batches of presorted mail which is
known to have the appropriate format for automatic mail sorting.
The recognition performance on variable-pitch typeset material is much lower
than on typewritten and special OCR f o n t s - - u p to 1% of the characters may not
be recognized correctly. Only a few manufacturers market machines for typeset
material. Some of these machines must be ' trained' on each new font. Mixed-font
OCR has been successfully applied to reading-aids for the blind. Here even if a
relatively high fraction of the characters is misrecognized, the human intelligence
can make sense from the output of the device. In one commercially available
system the reader is coupled to an automatic speech synthesizer which voices the
output of the machine at an adjustable rate between 100 and 300 words per
minute. When the rules governing its pronunciation fail, this machine can spell
the material one letter at a time.
Few major benefits can be expected from further improvement of current
commercial classification accuracy on clean stylized typescript and on stylized
fonts. Current developments are focussed, therefore, on enabling OCR devices to
handle increasingly complicated formats (such as technical magazine articles) in
an increasing variety of styles including hand-printed alphanumeric information,
ordinary typescript, and bookface fonts, all at a price allowing small decentralized
applications. When these problems are successfully solved, we may expect general
purpose OCR input devices to be routinely attached to even the smallest
computer systems, complementing the standard keyboard.
Acknowledgment
The author wishes to acknowledge the influence on the points of view expressed
in this article of a number of former colleagues, particularly R. Bakis, R. G.
Casey, C. K. Chow, C. N. Liu, and Glenn Shelton, Jr. He is also indebted to
R. M. Ray III for some specific suggestions.
Bibliography
[1] Abend, K. (1968). Compounddecisionproceduresfor unknown distributions and for dependent
states of nature. In:' L. N. Kanal, ed., Pattern Recognition. Thompson, Washington/L. N. K.
Corporations CollegePark, MD.
648 George Nagy
[2] Ascher, R. N. (et al.), (1971). An interactive system for reading unformatted printed text. IEEE
Trans. Comput. 20, 1527-1543.
[3] Bird, R. B. (1967). Scientific Japanese: Kanji distribution list for elementary physics. Rept. No.
33. Chemical Engineering Department, The University of Wisconsin.
[4] Bledsoe, W. W. and Browning, I. (1959). Pattern recognition and reading by machines. Proc.
E.J.C.C., 225-233.
[5] British Computer Society. (1967). Character Recognition. BCS, London.
[6] Casey, R. G. and Nagy, G. (1968). An autonomous reading machine. IEEE Trans. Comput. 17,
492-503.
[7] Casey, R. G. and Nagy, G. (1966). Recognition of printed Chinese characters. IEEE Trans.
Comput. 15, 91-101.
[8] Chen, H. C. (1939). Modern Chinese Vocabulary. Soc. for the Advancement of Chinese
Education. Commercial Press, Shanghai.
[9] Chow, C. K. (1957). An optimum character recognition system using decision functions. IRE
EC 6, 247-257.
[10] Chow, C. K. and Liu, C. N. (1966). An approach to structure adaptation in pattern recognition.
IEEE SSC 2, 73-80.
[11] Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with
dependence trees. IEEE Trans. Inform. Theory 14, 462-467.
[12] Chow, C. K. (1970). On optimum recognition error and reject tradeoff. IEEE Trans. Inform.
Theory 16, 41-46.
[ 13] Deutsch, S. (1957). A note on some statistics concerning typewritten or printed material. IRE I T
3, 147-148.
[14] Doyle, W. (1960). Recognition of sloppy, hand-printed characters. Proc. W.J.C.C., 133-142.
[15] Fischer, G. L. (1960). Optical Character Recognition. Spartan Books, Washington.
[16] Genchi, H., Mori, K. I., Watanabe S. and Katsuragi, S. (1968). Recognition of handwritten
numeral characters for automatic letter sorting. Proc. IEEE 56, 1292-1301.
[17] Greanias, E. C., Meagher, P. F., Norman, R. J., and Essinger, P. (1963). The recognition of
handwritten numerals by contour analysis. I B M J. Res. Develop. 7, 14-22.
[18] Greenough, M. L. and McCabe, R. M. (1975). Preparation of reference data sets for character
recognition research. Tech. Rept. to U.S. Postal Service. Office of Postal Tech. Res., Pattern
Recog. and Comm. Branch, Nat. Bur. of Standards, NBS1R75-746. Washington.
[ 19] Harmon, L. D. (1972). Automatic recognition of print and script. Proc. IEEE 60, 1165-1176.
[20] Hennis, R. B. (1968). The IBM 1975 optical page reader. I B M J . Res. Develop. 12, 345-371.
[21] Herbst, N. M., and Liu, C. N. (1977). Automatic signature verification. I B M J . Res. Develop. 21,
245-253.
[22] Hoffman, R. L. and McCullough, J. W. (1971). Segmentation methods for recognition of
machine-printed characters. IBM J. Res. Develop. 15, 101-184.
[23] Hussain, A. B. S., and Donaldson, R. W. (1974). Suboptimal sequential decision schemes with
on-line feature ordering. IEEE Trans. Comput. 23, 582-590.
[24] Kamentsky, L. A., and Liu, C. N. (1963). Computer automated design of multi-font print
recognition logic. I B M J . Res. Develop. 7, 2-13.
[25] Kanal, L. N. (1980). Decision tree design--the current practices and problems. Pattern
Recognition in Practice. North-Holland, Amsterdam.
[26] Kovalevsky, V. A. (1968). Character Readers and Pattern Recognition. Spartan Books, Washing-
ton.
[27] Liu, C. N. and Shelton, G. L., Jr. (1966). An experimental investigation of a mixed-font print
recognition system. IEEE Trans. Comput. 15, 916-925.
[28] Minsky, M. (1961). Steps towards artificial intelligence. Proc. IRE.
[29] Nadler, M. (1963). An analog-digital character recognition system. IEEE Trans. Comput. 12.
[30] Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Trans. Inform.
Theory 21, 222-226.
[31] OCR Users Association (1977). OCR Users Association News. Hackensack, NJ.
[32] Ledley, G. (1970). Special issue on character recognition. Pattern Recognition 2. Pergamon Press,
New York.
Optical character recognition theory and practice 649
[33] Raviv, J. (1967). Decision making in Markov chains applied to the problem of pattern
recognition. IEEE Trans. Inform. Theory 13, 536-551.
[34] Riseman, E. M. and Ehrich, R. W. (1971). Contextual word recognition using binary digrams.
IEEE Trans. Comput. 20, 397-403.
[35] Riseman, E. M. and Hanson, A. R. (1974). A contextual postprocessing system for error
correction using binary N-grams. IEEE Trans. Comput. 23, 480-493.
[36] Schantz, H. F. (1979). A.N.S.I. OCR Standards Activities. OCR Today 3. OCR Users Associa-
tion, Hackensack, NJ.
[37] Shannon, C. (1951). Prediction and entropy of printed English. BSTJ 30, 50-64.
[38] Shillman, R. (1974). A bibliography in character recognition: techniques for describing char-
acters. Visible Language 7, 151-166.
[39] Shinghal, R., Rosenberg, D., and Toussaint, G. T. (1977). A simplified heuristic version of
recursive Bayes algorithm for using context in text recognition. IEEE Trans. Systems Man
Cybernet. 8, 412-414.
[40] Shinghal, R., Rosenberg, D., and Toussaint, G. T. (1977). A simplified heuristic version of
Raviv's algorithm for using context in text recognition. Proc. 5th Internat. Joint Conference
Artificial Intelligence, 179-180.
[41] Stevens, M. E. (1961). Automatic character recognition-state-of-the-art report. Nat. Bureau
Standards, Tech. Note 112. Washington.
[42] Suen, C. Y. (1979). Recent advances in computer vision and computer aids for the visually-
handicapped. Computers and Ophthalmology. IEEE Cat. No. 79CH1517-2C.
[43] Suen, C. Y. (1979). N-gram statistics for natural language understanding and text processing.
IEEE PA MI 1, 164-172.
[44] Suen, C. Y. (1979). A study on man-machine interaction problems in character recognition.
IEEE Trans. Systems Man Cybernet. 9, 732-737.
[45] Thomas, F. J. and Horwitz, L'. P. (1964). Character recognition bibliography and classification.
IBM Research Report RC-1088.
[46] Toussaint, G. T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
[47] Toussaint, G. T. and Shinghal, R. (1978). Tables of probabilities of occurrence of characters,
character pairs, and character triplets in English text. McGill University, School of Computing
Sciences. Tech. Rept. No. SOCS 78-6, Montreal.
[48] Toussaint, G. T. (1978). The use of. context in pattern recognition. Pattern Recognition 10,
189-204.
[49] Walter, T. (1971). Type design classification. Visible Language 5, 59-66.
[50] Wright, G. G. N. (1952). The Writing of Arabic Numerals. University of London Press, London.
P. R. Krislmaiah and L. N. Kanal, eds, Handbook of Statistics, Vol. 2 ~l~
©North-Holland Publishing Company (1982) 651-671 JL/
1. Introduction
651
652 Y. T. Chien and T. J. Killeen
spilled oil samples as well as the suspect samples. The peaks of the spectra are
then determined manually (or automatically, in a computerized system) and the
identification of the spiller is accomplished by overlaying two spectra (the spill
sample and a suspect sample) at a time for comparison. The suspect that results in
the closest match (according to some numerical measure of closeness) at the
designated peaks is deduced as the spiller.
Various oil identification schemes have been proposed and studied for their
potential applications in an integrated system which will determine the true
source of an oil spill with high reliability. For a most complete discussion on oil
spill identification schemes, the reader is referred to [4, 5, 6]. In this paper, a
number of mathematical models and techniques that make use of probability
theory and statistics in one form or another will be discussed.
Section 2 will deal with the various computer or statistical-oriented methods for
data analysis as applied to oil identification. Two mathematical models for
identifying oil spillers by matching probability data or by spectral data are
presented in Section 3. The final section gives a summary of the research and
development projects related to oil spill identification work. This summary, when
used with the bibliography, should be useful to the reader who wishes to obtain
additional information on the subject of oil spill identification.
This section describes various computer and statistical techniques which have
been applied to the problem of identifying oil or chemical spectra. In Subsection
2.1 we discuss how computer graphics can be used to allow interactive analysis of
high dimensional data in a 2-dimensional space. The other two parts of this
section outline major work done by two investigators contracted to the United
States Coast Guard, Chris Brown [7, 8] from the University of Rhode Island,
Department of Chemistry, and James Mattson [35, 36] from the University of
Miami, Rosenthal School of Marine and Atmospheric Science. Each carried on
extensive studies of methods of matching oil spectra. These projects were carried
out in the early to mid 1970's and concentrated on two different aspects of the
problem.
Mattson applied the classical statistical and pattern recognition methodology to
the problem of oil identification, but was hampered by lack of real data to assess
the effects of weathering on types of oil samples. Brown, on the other hand,
collected voluminous data on weathering and simulated weathering. His analysis
was not done from a classical point of view but was developed specifically for the
oil identification problem.
The record match and change sample commands return control to the first
menu. The hard copy, select sample and move sample returns control to the
second menu.
The printout from this program is a record of all suspect selections, any
movement made, and any matches recorded. The printout also includes the
spectral data read in by the program for the suspects and sample. Fig. 1 and Fig.
2 illustrate typical manipulations allowed in this system to match FL spectra
corresponding to spill and suspect oil samples.
Fig. 1. Graphical sample execution (' , ' for spill sample and ' + ' for suspect).
(a) The initial display,
(b) Display following light pen detection on READ SUSPECTS and SAMPLE.
(c) Display following light pen detection on SELECT SUSPECT.
(d) Display following light pen detection on MOVE SUSPECT.
(e) Display following light pen detection on SPECIFY while entering point numbers.
(f) Display following entering point numbers.
Computerandstatisticalconsi~rations~roil~illi~nt~cation 655
1.0 I I I I I I I I I I I I [ b--x.o
0-8 -0-8
REflD SRMPLES ONLY
0.7 0-7
BTOP
06 0.6
0 5 O.S
04 D.~
0,3 0.3
0.2 O-d
0-I 0-I
0.0 I I I I I I I I I I [ l I I ~-0
o s lo is e• ~s 30 3s ~0 ~a 5o ss Go Gs 7o
SPZ~L SAWPLE SUSPECT
(a)
s ~o Is e• 2s so ss ~o ~s so ss so 6s 70
t.O I I I I I ;~-, I I ( I I I I 1.o
$
06 06
0.5 n5
04. 0°4'
• $•
$ e•
O.q
$$ e
0.2 $•$$• 0.2
• $$$0•$i
0.1 -0-I
0.0 . . . . r- ~ i I { ( I I I I l [ l I - La
s lo Is eo es ~o 35 ~o ~s so ss Go ~s 70
SVILL S ~ P L E 1 SUSPECT
(b)
Fig. 1
656 Y. T. Chien and T. Y. Killeen
0 5 10 15 20 ~5 30 35 40 ~5 50 55 SO ~S 70
1.0 I ~ ~ ~ , ~" ' I I I I I I I 1.0
+÷ $
• • MOVE SUSPEgT
0.8 -0 .e
** REgORO MfTOH
~÷. ~• ÷* H~RO COPY
0.7 - o .7
+• CHfNOE SfMPLE
+•
֥
0-6 ÷ - o .6
s
0.5 o .5
•+
0.~ $ •~÷÷ 0.4
$ ••~+÷÷
O.B a $•• ++ -o .3
$ e••
0.? -o .,e
$ $°••••a
0-1 ÷+ $ THE 6PILL 5~MPLE ~NO aU~PEgT BNE NO~MfLIZEO. -o -1
•e
5 10 15 20 ~5 ~0 35 40 ~5 50 55 60 65 70
1.0 I [ I '{ ]~-÷,1:-. I I [ [ I I [ I 1.0
++ a
+ +÷ • +,m
0.9 • ,o ~HIFT RZQHT -0 -~
o
O.B ** 6HIrT LEFT -0 -8
0.7 -0,7
" +• 6~EgZF¥
÷m
0.6 -0.6
$
I
0.5 -0.5
o÷
$+
0.4 -O .4
$
m
0.'3 $ ° o + -0 .~
o$ ÷~#e
0.2
÷÷ °J$ooo÷÷e~÷~
$ o$$$$$
0-1 ** , THE SPILL 5RMPLE fiND SUSPEOT ~RE MORMfLZZEO. :0-1
o•
. . . ~ . , ' ~
O.O .... , , I I i ........ I I I I I [ I [ t O.O
0 5 10 15 20 25 30 35 40 45 50 55 60 GS 70
SPILL ~RMPLE 1 $OSPEGT !
(d)
Fig. 1 (continued).
Computer and statistical considerations for oil spill identification 657
• • m÷
0÷
0.4 "¢,,., 000-4
0.3
$ +I+÷.
• • $1 ¢
O-E + e i 15 i ÷'t'4~.~4~÷ -0,~
0.! THE GRILL SAMPLE eND OUSFECT eRE NORMALIZED- "i''i'+ "0-1
m•
0 5 1000 L5 20 25 30 35 40 45 50 55 GO G5 70
mi ¢ i t H~PD COPY
0.7 ÷ -0.7
• GH~h~ 5A~fLE
$
O.S • -O-B
$
÷ t
0-5 • +* ÷ " -0-5
0.4 + • -0.9
m ¢÷ •
0.3 i * • -0.3
¢t¢ ~tI•D
0.2 ~ -0.2
THE GUSPgGT IG ~HIFTED 4. UNITS LFFT. + + , ++ " ,,
+* • +•+ * ° • i I ••
0-1
÷* me
liii THE SbSPEGT IS LO~EPED O-O~C~ b h [ T g .
(~
Fig. 1 (continued).
658 Y. T. Chien and T. J. Killeen
5 10 15 20 25 30 35 ~0 4S SO 5S 60 GS 70
1.0 I I J l l I I .... ! l l f 1.0
$me$$$ eo
oOl ee
0.8 am o -0.9
o$ o •
$$e$65 e • e
$ °00$
O.B -0.8
leell
0-7 e$$e o I ~0.7
0.6 0.6
0.5 $ -0.5
o
• 00
0.4 -0.4
$
0.3 -0 .~
0-2 -0 .E
O-1 -0-I
O0 I '1 I I L I I l I I I L I l 0.0
5 10 IS 20 25 30 35 40 'IS SO SS 60 65 70
SF~ILL SFFMfLE l ~USP'EOT
(a)
Fig. 2. Graphical output ( ' , ' for spill sample and ' + ' for suspect.) Each of the plots (a) through (d)
were obtained by light pen detecting on HARD COPY. They are examples of the hard copy available
to the user. They are also an example of the manipulation of spectra available to the user.
(a) A display of spill sample 1.
(b) A display of spill sample 1 and suspect 2 both in their original normalized positions.
(c) A display of spill sample 1 in normalized position and suspect 2 moved so that the 52nd point of
the suspect coincides with the 45th point of the spill sample.
(d) A display in which the suspect spectral waveform has been shifted one position to the right in
relation to the display of (c).
Computer and statistical considerations for oil spill identification 659
o s so is eo es so ss ~o ~s so ss so ss 7o
1.0 I I I I I I I I ..~b**Xh+..]4. I I 1--1.o
• o$ooooo$o4.4. $1 a ~ •
o$ •
0.9 -0 .S
oo ° ¢4.~ e$e 4.4.*
$$$o$me ÷#
0-8 • 4. °$°0~4.4. -- - 0 . 8
0.7 -0.7
O.S -0 -6
0.$ • 4. -0.5
0-4 -O m4
0-3 -0.3
4.,
0.~ + -0.2
4.
0-I + THE SPILL 8RMPLE RNO 8USPEGT RRE NORMRLIZED. -0.I
4.4.
b654" 4.4.
s lo ss 20 es 30 ~s ~o ~s so ss Go 6s 70
1.0 I I I I 1 I • I +4.,,J,,.,o~.***l '1 i I I 1-0
• o,$~,oO 06 $410
11~11 , .e.
O.tq -0.8
0.5 • 4. -0.5
a
O-q o
4. •
0.3 -0.3
0 5 10 J5 20 2s 30 3s ~o 4s so ss Go ss 70
1.0 f I I I I , , **~.~.4~***!. I I I [ -1 .D
O.S e+,
-0.8
$$°$$$ ~j
~0
O.B $ ÷ e e
+
• -0.8
+ +o$$•
0.7 $$m$ + ÷
• -0,7
++
+÷+
O.S
++ -0 -S
$
0-5 $ ÷
o -0 .S
0.4
,,~t*
+• -0 .~
0.8
-0.3
+
0.0 I 'I I I [ I I t 1 1 I I I - I •0 . 0
O 5 I0 15 ~0 25 30 35 40 45 ~0 55 60 65 70
SPILL SflHPLE I SUSPECT 2
(d)
Fig. 2 (continued).
Y2
-Middle East
x -Mid- Continent
z~ - A l a s k a
[] -California
Y,
Fig. 3. An example of how nonlinear transformations can be applied to identify crude 0il clusters
from a 2-dimensional plot. Visually clusters are circled.
Computer and statistical considerationsfor oil spill identification 661
18 8 ))}2
S 2-- ~]
i=l
log(Ai,/Ai2)-~( E log(Aj,/Aj2
j=l
662 Y. T. Chien and T. J. Killeen
1S00
1000
500
0. . . . . -'2g ~ ~ ~ 0
Both of the criteria mentioned above have been found useful in identifying the
true source of a spill in most cases. However, when spills have experienced
weathering, especially in the case of light oils, the log ratio mentioned has some
difficulty identifying the true spill sample. This may be caused by the fact that it
is not able to compensate for the type of changes in the spectrum caused by
weathering.
Now let the event that the spiller escaped without being sampled be denoted by
A o. Implicit in our model is the assumption that either one of the identified
suspects caused the spill or the true spiller(s) escaped.
Notice that the sample is simply
S: QJ A i
i=O
and the A i are disjoint. We have partitioned the set of all possible outcomes into
n ÷ 1 disjoint events.
Prior probabilities
Many times prior information about the suspects will allow an investigator to
assign prior probability of guilt to each suspect. This information may include eye
witness identification, oceanographic data, information from ships' logs, etc.
These probabilities are denoted by P ( A t ) . Certainly P ( A i ) >>-0 for i = 0, 1,2,..., n
andY~7=0P(Ai) = 1. When no prior information is known, it is usually reasonable
to assign P( At) = ! / ( n + 1). That is, all events A i, i = O, 1..... n, are equally likely.
664 Y. T. Chien and T. J. Killeen
If multiple samples or methods are involved, the event B will become more
complicated. We would need additional statistics for each sample and method
from each suspect. For example, if two spill samples and two methods were used
and the distance statistic for the second method were denoted by D 2, then
P(B[Ai)P(Ai) (1)
P(A,IB) = :0P( IAj)P(Aj)
P(B[Ao)= fi D(x:).
j=l
In the case when multiple samples are obtained, specific care should be taken
to insure statistical independence; for instance, spill samples should be taken
from different areas of the spill at different times. When multiple chemical
methods are used, histograms for each method must be tabulated and the
statistical independence of the statistics must be established. We have preliminary
results indicating independence of the infrared S 2 and D 2, a distance statistic for
fluorescence.
Notice that since all prior probabilities are equal and appear in each term, they
666 Y. T. Chien and T. J. Killeen
Similarly we get P(A1[B1)- 0.808, P(A2[B1)-~ 0.029. Our intuition has been
justified and suspect 1 is most likely the spiller. However, there still is a good
chance of A 0 having occurred and we certainly do not have enough information to
accuse suspect number 1.
If a second sample from a different part of the spill is also available, we may
continue in the following manner. Since the sample is from another part of the
spill we assume that it is independent of the first sample and recalculate (1) using
the above revised probabilities as priors. Suppose that the S 2 values (using the
original suspect samples vs. the new spill samples) are S~2 = 2.67 and S~ 2 = 16.73.
The table of D and S values becomes:
Here we let B e = ~S~ 2 = 2.67, S~ 2 = 16.73} and P(Ao) - 0.163, P(A1) -- 0.808, and
P ( A 2 ) - 0.029 as our prior probabilities, we obtain
34.5 × 10 - 7 if/--0,
P(B2[Ai)'-J335.8×lO 7 ifi=l,
L 10.5 X 10 - 7 ifi--2.
The above example might have been done in one step letting B = BIf)B 2
and P(Ai)= ½ as our prior probabilities. The calculations are slightly more
complex; for instance, P(BIAI) = S(xl)D(x2)S(x~)D(x'0 = (0.0094)(0.0017)
(0.0146)(0.0023). However, the final revised probability estimates would be the
same as those we obtained.
If another chemical method is employed and (i) the distance statistic for this
method is independent of S 2 and (ii) histogram for identical and different oils
have been tabulated, then we may revise our probability estimates again using our
old revised probabilities as priors. In this way it is possible to combine the results
of two or more methods into our probability calculations.
Computer and statistical considerationsfor oil spill identification 667
Discussion
The method presented here is a general one; it may be applied to any of the
standard chemical tests as long as a suitable 'distance' measure can be established
and the required histograms are available.
Our technique gives a truly quantitative calculation of the probability of a
match, provides a reasonable probabilistic model for the oil identification prob-
lem, and possibly allows an investigator to systematically combine several chemi-
cal methods.
Table 1
Summary of oil spill I. D. research
Sponsoring Method of Type of chemical
Investigators institution data analysis data References
A. Bentz USCG a CG ID Infrared l, 3,4, 5, 6,
system 42, 43
C. Brown USCG Log ratio Infrared 2, 7, 8, 34,
Univ. of RI statistics 43
Y. T. Chien USCG Probability model Infrared 1,9, 10, 29,
T. J. Killeen Univ. of CT. Curve fitting Fluorescence 30,43
M. Curtis USCG Cluster Fluorescence 11,43
Rice Univ. analysis
D. Eastwood USCG CG ID system Fluorescence 14,17, 30,
Curve fitting Low temp. 42
luminescence
G. Flanigan USCG Log ratio Gas chromatography 15, 16,18,
G. Frame statistics 42, 43
J. Frankenfeld USCG Multiple 19, 20
Exxon methods
P. Grose NOAA b Modeling oil 21
spills
R. Jadamec USCG CG ID Thin layer and 23,24, 39,
W. Saner system liquid chroma- 40, 42
tography
F. K. Kawahara USEPAc Ratios of Infrared 27, 28
absorbances
L.D.F.A.
B. Kowalski Univ. of Pattern 13, 31,32,
Washington recognition 33,43
J. Mattson Univ. of Multivariate Infrared 35,36,43
Miami, NOAA, analysis
USCG L.D.F.A.
References
[1] Anderson, C. P., Killeen, T. J., Taft, J. B. and Bentz, A. P, Improved identification of spilled oils
by infrared spectroscopy, in press.
[2] Baer, C. D., and Brown, C. W. (1977). Identifying the source of weathered petroleum: Matching
infrared spectra with correlation coefficients, Appl. Spectroscopy 6 (31) 524-527.
[3] Bentz, A. P. (1976). Oil spill identification bibliography. U.S. Coast Guard Rept. ADA 029126.
[4] Bentz, A. P. (1976). Oil spill identification. Anal. Chem. 6 (48) 454A-472A.
[5] Bentz, A. P. (1978). Who spilled the oil? Anal. Chem. 50, 655A.
[6] Bentz, A. P. (1978). Chemical identification of oil spill sources. The Forum 12 (2) 425.
[7] Brown, C. W., Lynch, P. F. and Amadjian, M. (1976). Identification of oil slicks by infrared
spectroscopy. Nat. Tech. Inform. Service, Rept. ADA040975 (CG 81-74-1099),29,36,38.
[8] Brown, C. W., Lynch, P. F. and Amadjian, M. (1976), Infrared spectra of petroleum--Data base
formation and application to real spills. Proc. IEEE Comput. Soc. Workshop on Pattern
Recognition Applied. to Oil Identification, 84-96.
670 Y.T. Chien and T. J. Killeen
[9] Chien, Y. T., (1978). Interactive Pattern Recognition. Dekker, New York.
[10] Chien, Y. T. and Killeen, T. J. (1976). Pattern recognition techniques applied to oil identifica-
tion. Proc. IEEE Comput. Soc. on Pattern Recognition Applied to Oil Identification, 15-33.
[11] Curtis, M. L. (1977). Use of pattern recognition techniques for typing and identification of oil
spills. Nat. Tech. Inform. Service, Rep. ADA043802 (CG-81-75-1383).
[12] Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[13] Duewer, D. L., Kowalski, B. R. and Schatzki, T. F. (1975). Source identification of oil spills .by
pattern recognition analysis of natural elemental composition. Anal. Chem. 9 (47) 1573-1583.
[14] Eastwood, D., Fortier, S. H. and Hendrick, M. S. (1978). Oil identification--Recent develop-
ments in flourescence and low temperature luminescence. A mer. Lab. 3,10, 45.
[15] Flanigen, G. A. (1976). Ratioing methods applied to GC data for oil identification. Proc. IEEE
Comput Soc. Workshop on Pattern Recognition Applied to Oil Identification, 162-173.
[16] Flanigen, G. A. and Frame, G. M. (1977). Oil spill 'fingerprinting' with gas chromatography.
Res. Development 9, 28.
[17] Fortier, S. H. and Eastwood, D. (1978). Identification of fuel oils by low temperature
luminescence spectrometry. Anal. Chem. 50, 334.
[18] Frame, G. M., Flanigan, G. A. and Carmody, D. C. (1979). The application of gas chromatogra-
phy using nitrogen selective detection to oil spill identification. J. Chromatography 168, 365-376.
[19] Frankenfeld, J. W. (1973). Weathering of oil at sea. USCG. Rept. AD87789.
[20] Frankenfeld, J. W. and Schulz, W. (1974). Identification of weathered oil films found in the
marine environment. USCG. Rept. ADA015883.
[21] Grose, P. L. (1979). A preliminary model to predict the thickness distribution of spilled oil.
Workshop on Physical Behavior of Oil in the Marine Environment at Princeton University.
[22] Grotch, S. (1974). Statistical methods for the prediction of matching results in spectral file
searching. Anal. Chem. 4 (46) 526-534.
[23] Jadamec, J. R. and Kleineberg, G. A. (1978). United States coast guard combats oil pollution.
Internat. Environment and Safety 9.
[24] Jadamec, J. R. and Saner, W. A. (1977). Optical multichannel analyzer for characterization of
flourescent liquid chromatographic petroleum fractions. Anal. Chem. 49, 1316.
[25] Jurs, P. and Isenhour, T. (1975). Chemical Applications of Pattern Recognition. Wiley, New York.
[26] Kanal, L. N. (1972). Interactive pattern analysis and classification systems: A survey and
commentary. IEEE Proc. 60 (10) 1200-1215.
[27] Kawahara, F. K., Santer, J. F. and Julian, E. C. (1974). Characterization of heavy residual fuel
oils and asphalts by infrared spectrophotometry using statistical discriminant function analysis.
Anal. Chem. 46, 266.
[28] Kawahara, F. K. (1969). Identification and differentiation of heavy residual oil and asphalt
pollutants in surface waters by comparative ratios of infrared absorbances. Environmental Sci.
Technol. 3, 150.
[29] Killeen, T. J. and Chien, Y. T. (1976). A probability model for matching suspects with
spills--Or did the real spiller get away? Proc. IEEE Comput. Soc. Workshop on Pattern
Recognition Applied to Oil Identification, 66-72.
[30] Killeen, T. J., Eastwood, D. and Hendrick, M. S. Oil matching using a simple vector model, in
press.
[31] Kowalski, B. R. and Bender, C. F. (1973). Pattern recognition--A powerful approach to
interpreting chemical data. J. Amer. Chem. Soc. 94 (16) 5632-5639.
[32] Kowalski, B. R. and Bender,. C. F. (1973). Pattern recognition II--Linear and nonlinear
methods for displaying chemical data. J. Amer. Chem. Soc. 95 (3) 686-692.
[33] Kowalski, B. R. (1974). Pattern recognition in chemical research. In: Klopfenstein and Wilkins,
eds., Computers in Chemical Biochemical Research, Vol. 2. Academic Press, New York.
[34] Lynch, P. F. and Brown, C. W. (1973). Identifying source Of petroleum by infrared spectros-
copy. Environmental Sci. Technol. 7, 1123.
[35] Mattson, J. S. (1976). Statistical considerations of oil identification by infrared spectroscopy.
Proc. IEEE Comput. Soc. Workshop on Pattern Recognition Applied to Oil Identification,
ll3-121.
[36] Mattson, J. S. (1971). Fingerprinting of oil by infrared spectroscopy. Anal. Chem. 43, 1872.
Computer and statistical considerations for oil spill identification 671
[37] Mattson, J. S. (1976). Classification of oils by the application of pattern recognition techniques
to infrared spectra. USCG Rept. ADA039387.
[38] Preuss, D. R. and Jurs, P. C. (1974) Pattern recognition applied to the interpretation of infrared
spectra. Anal. Chem. 46 (4) 520-525.
[39] Saner, W, A. and Fitzgerald, G. E. (1976). Thin layer chromatographic techniques for identifica-
tion of waterborne petroleum oils. Environmental Sei. Technol. 10, 893.
[40] Saner, W. A., Fitzgerald, G. E. and Walsh, J. P. (1976). Liquid chromatographic identification of
oils by separation of the methanol extractable fraction. Anal. Chem. 48, 1747.
[41] Ungar, A. and Trozzolo, A. N. (1958). Identification of reclaimed oils by statistical discrimina-
tion of infrared absorption data. Anal. Chem. 30, 187-191.
[42] United States Coast Guard (1977). Oil spill identification system. USCG. Rept. ADA044750.
[43] Workshop IEEE Comput. Soc. Proc. on Pattern Recognition Applied to Oil Identification (1976).
Catalogue No. 76CH1247-6C.
P. R. Kfishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~ "[
@North-HollandPublishing Company (1982) 673-697 ,.y Jk
B r u c e R. K o w a l s k i * a n d S v a n t e W o l d
1. Introduction
*The researchwork of this author is supported in part by the Office of Naval Research.
673
674 Bruce R. Kowalski and Svante Wold
Table 1
Chemical systems (objects, cases, samples) and variables typically used for their characterization (see
also Table 2). Pattern recognition classes with examples to the right
Systems Variables PaRc classes Examples
Chemical compounds Spectral and Structural type Amines, alcohols, esters
or complexes thermodynamic Quadratic, tetrahedral ....
Chemical mixtures Concentrations of Type Corrosive or not
(minerals, alloys, constituents and Source Mineral deposit 1, 2 or 3
water samples, oil trace elements Oil tanker 1, 2 or3
samples, polymers) Cracking plastic, noncracking
Biological Amounts of organic Type Taxonomic type (family, etc.)
constituents and Alergenic, non-allergerfic
fragmentation products Source Blood of suspect 1, 2 or 3
of biopolymers, also Fish from lake, 1, 2 or 3
trace element conc's Disease Differential diagnosis
Chemical reactions Thermodynamic and Mechanistic type SN1,SN2
kinetic, amounts of Solvent assisted or not
major and minor Kinetic or thermodyn, control
products
Biologically active Fragment or substituent Biologically active Drugs
compounds descriptors, quantum or non-active Toxic compounds
mechanical indices Type of biol. activ. Carcinogens
Table 2
Common instrumental methods giving multiple measurements for chemical systems (objects,
cases, samples)
Method System Variables
(1) Spectral
IR, NMR, UV, Compounds Wave lengths or frequencies of
ESCA, X-Ray (sometimes characteristic absorption (peaks)
mixtures) or digitized spectra (absorption
Polymers at regular wave length intervals)
Mass-spectra Ion abundances of fragments
Atomic abs., etc. Trace and major element
concentrations
(2) Separation
Gaschromatogr., GC Mixtures Amounts of volatile constituents
Pyrolysis-GC Polymers Amounts of volatile fragments
Liquid Chromat., LC Biological samples Amounts of soluble constituents
Pyrolysis-LC Amounts of soluble fragments
Amino-acid analysis Amounts of different amino acids
Electrophoretic Amounts of different macromolecules
Gel-filtration Amounts of different larger molecules
Medical analyses Blood, Urine Inorganic, organic, biochemical,
etc. constituents
Pattern recognition in chemistry 675
standing, still far from complete, of how to apply pattern recognition in chem-
istry.
object serve to postion each point in the space. The 'data structure' is the overall
relation of each object to every other object and, in this simple two-dimensional
vector space, the data structure is immediately available to the chemists once the
plot is made.
As mentioned in Section 1, chemistry is a multivariate science and in order for
the chemist to solve complex problems, more than two or three measurements per
object are often required. When the dimension of the vector space (henceforth
denoted M) becomes greater than three, the complete data structure is no longer
directly available to the chemist.
During the 1970's, chemists have solved this problem by applying the many
methods of pattern recognition to their own M-dimensional vector spaces. The
many methods under the general heading of "Preprocessing" (Andrews, 1972),
sometimes "Feature Selection", can aid the chemist by transforming the data
vector space to a new coordinate system and thereby enhancing the information
contained in the measurements. These methods can also be used to eliminate
measurements that contain no useful information in the context of the application
or weight those measurements containing useful information in proportion to
their usefulness.
Display methods (Kowalski and Bender, 1973) are useful for providing linear
projections or nonlinear mappings of the M-dimensional data structures into
two-dimensions with minimum distortion of data structure. These methods allow
an approximative view of data structure and used in conjuction with unsupervised
learning (Duda and Hart, 1973) or cluster analysis methods that can detect natural
groupings or clusters of data vectors, they are often effective at providing the
chemist with an understanding of the data structure. They can be thought of as
viewing aids to allow a visual, or at least conceptual, examination of M-dimen-
sional space.
The majority of chemical applications of pattern recognition have used super-
vised learning (Fukunaga, 1972) or classification methods. Here, the goal is to
partition the vector space so that samples from well-defined categories based on
chemically meaningful rules fall into the same partition. Examples of classifica-
tion methods applied to chemical problems are given later in this chapter. In these
applications, the property referred to in the general statement given earlier is a
class or category membership. Prototype samples with known class membership
are used to partition the data vector space and samples with unknown class
membership are then classified depending upon their location in the vector space.
The prototype samples from all categories are collectively called the training set.
In the following, the elements of a data vector are interchangeably referred to
as 'measurements' and 'variables'. The former reflects a terminology most
palatable to chemists and the latter terminology more often preferred by statisti-
cians. Besides these terms, the pattern recognition literature often refers to
single-valued functions of measurements or variables as 'features'. Likewise, a
data vector can also be referred to as a 'sample', 'pattern', 'object' or 'system'
usually depending upon the author's background. Reading the pattern recogni-
Pattern recognition in chemistry 677
Pattern recognition in chemistry had its start in 1969 when analytical chemists
at the University of Washington applied the Learning Machine (Nilsson, 1965) to
mass spectral data in attempts to classify molecules according to molecular
structural categories (Jurs, Kowalski and Isenhour, 1969). The objects, chemical
compounds, were characterized by normalized ion intensity measurements at
nominal mass/charge values as measured by low resolution, electron impact mass
spectrometry. For each application, objects with a particular functional group,
say a carbonyl (~/C = O), were put into one class and all other molecules in
another. The learning machine was used to find a hyperplane that separated the
two classes in the vector space with the eventual goal of classifying compounds
with unknown molecular structure based on their location in the so-called mass
spectra space.
In the early 1970's, several improvements were made to preprocessing methods
applied to mass spectral data prior to classification (Juts and Isenhour, 1975) and
the learning machine was applied to other types of molecular spectroscopy. It
wasn't until 1972 that chemists were introduced to the use of several pattern
recognition methods to solve complex multivariate problems. (Kowalski, Schatzki
and Stross, 1972; Kowalski and Bender, 1972).
Currently, a strong and useful philosophy of the application of pattern recogni-
tion and other areas of multivariate analysis in chemistry is developing (Albano,
Dunn, Edlund, Johansson, Norden, SjOstrOm and Wold, 1978). At the same time,
pattern recognition is taking its place with other areas of mathematical and
statistical analysis as powerful tools in a developing branch of chemistry: chemo-
metrics (Kowalski, 1977; 1980). Chemometrics is concerned with the application
of mathematical and statistical methods (i) to improve the measurement process,
and (ii) to better extract useful chemical information from chemical measure-
ments. In recognition of the importance of chemometrics in chemistry, the journal
that published the first account of pattern recognition in chemistry, Analytical
Chemistry, celebrated its 50th anniversary with a symposium that ended with a
review of the developments of chemometrics over the 1970's with a look into the
future (Kowalski, 1978).
4.1
The representation of the data vectors measured on the objects as points in the
multidimensional space demands certain properties of these data as given in the
678 Bruce R. Kowalski and Svante Wold
following:
(1) The data should be continuous; i.e. a small chemical change between two
systems should correspond to a small distance between the corresponding points
in M-space and hence to a small change in all variables characterizing the
systems. Many chemical variables have this continuity property directly, but
others do not and therefore must be transformed to a better representation.
(2) Many methods of pattern recognition function better when the variables
are fairly symmetrically distributed within each class. Trace element concentra-
tions and sometimes chromatographic and similar data can display very skew
distributions. It therefore is good practice to investigate the distributions of the
variables by means of, for instance, simple histograms and transform skew data
by the simple logarithm or the gamma-transformation by Box and Cox (1964).
(3) The variation in each variable over the whole data set should be of the same
order. Distance based pattern recognition methods including subspace methods
such as SIMCA are sensitive to the scaling of the data. Initially, equal weighting
of the data should therefore be the rule. This is obtained by regularization
(autoscaling) to equal variance (see Kowalski and Bender 1972).
(4) The variables should be used according to good chemical practice; logarith-
mic rate and equilibrium constants, internally normalized chromatograms, etc.
(5) Finally, for certain methods of pattern recognition there are bounds on the
number of variables (see Section 7.1). Also, there are dangers in preprocessing the
data by selecting the variables that differ most between the classes if the number
of variables exceeds the number of objects in the training set divided by three (see
Section 7.2). In this case, the only safe preprocessing methods are those which are
not conditioned on class separation, i.e. normalization, autoscaling, Fourier or
Hadamard transforms, autocorrelation transforms and the selection of variables
by cluster analysis.
4.2
Some of the types of chemical measurements that have been used in chemical
applications of pattern recognition include the following:
vector space. This is the basis of numerous studies aimed at using pattern
recognition and training sets of mass spectra to develop classification strategies
that can extract molecular structural information directly from mass spectra
without the assistance of an expert (Jurs and Isenhour, 1975).
When representing a mass spectrum as a data vector, the direct use of
intensities at various mass numbers often is impractical. The reason is that similar
compounds differing by, say, a methyl group, have very similar mass spectra but
parts of the spectra are shifted in relation to each other by 15 units (the mass of a
methyl group). Therefore a representation of the mass spectrum which recognizes
similarities in shifted spectra is warranted; Fourier, Hadamard and auto correla-
tion-transforms are presently the best choices (McGill and Kowalski, 1978).
peaks. The spectrum cannot be digitized directly, using the absorption at regular
intervals as variables. The reason is that a small shift of one peak from one
spectrum to another results in large jumps in the variables, thus making them lack
continuity properties as discussed in point one above. Instead, a transformation
of the spectrum should be used (see Kowalski and Reilly, 1971).
one are used. The method has the advantage to be little dependent on the
homogeneity and distribution within each class and works well as long as the
classes have approximately equally many representatives in the training set. This
makes the method a good complement to other methods which are more depen-
dent on the homogeneity of the classes.
The KNN method, in its standard form, however, has the same drawback as
the hyper-plane methods discussed above to give no information about outliers or
deviating systems and to provide no opportunity for pattern analysis. Generaliza-
tions of the method to remove these drawbacks seem possible, however.
The nature of chemical problems (i.e. the information the chemist wants to
extract from his data) make the so-called modelling methods presently the most
attractive for general use in chemistry. Here the Bayes method is the most widely
available and most well known (ARTHUR; Harper, Duewer, Kowalski and
Fasching, 1977; Fukunaga, 1972). In this method, the frequency distribution is
calculated for each variable within each class. An object from the test set is then
compared, variable by variable, with each class distribution to give a probability
that the variable value actually was drawn from that distribution. The probabili-
ties over all variables are then multiplied together to give a set of probabilities for
the object to belong to the available classes.
The Bayes method is based on the assumption that the variables are uncorre-
lated both over all classes and within each class. This is seldom fulfilled and
therefore the variables should be transformed to an orthogonal representation.
Otherwise the calculated probabilities can be grossly misleading. When using
orthogonalized variables in terms of principal components (Karhunen-Lo6ve
expansion) disjointly for each class, the Bayes method becomes very similar to the
SIMCA method discussed below.
One disadvantage with the Bayes method seems to be that in order to get fair
estimates of the frequency distributions within each class, one needs rather many
objects in each class training set, say of the order of 30. With smaller training sets,
assumptions must be made concerning the shape of the distributions which
complicate the application and makes it more risky.
A second modelling method developed for chemical use rather recently is the
SIMCA method (acronym for Soft Independent Modelling of Chemical Analogy).
This method is essentially a subspace method based on a truncated principal
components expansion of each separate class (Wold, 1976; Wold, and Sj6str6m,
1977; Massart, Kijkstra and Kaufman, 1978). However, the modelling aspect of
the principal components analysis is emphasized more than is usual in the
subspace methods. This allows the dimensionality of the class PC models to be
estimated directly from the data (Wold, 1978).
Compared with the Bayes Classification method, the SIMCA method has the
advantage that it can function with correlated variables; in fact the principal
components models utilize differences in correlation structure between the classes
for the classification. Individual residuals for each variable and object are
calculated which facilitates the interpretation of outliers (Sj6striSm and Kowalski,
1979).
684 Bruce R. Kowalski and Svante Wold
The great advantage with modelling methods in chemical problems is that they
directly give a 'data profile' for each class, which allows the construction of
systems (objects) typical for each class. In particular, in structure/activity appli-
cations (see the example in Section 6.2) this is a primary goal of the data analysis.
The classification in terms of probabilities for each object belonging to each class
provides opportunities to find outliers in both the training and test sets. More-
over, this gives valuable information as to whether an object is sufficiently close
to a class to be considered a typical class member; information of particular
importance in the classification of structural types (see the example in Section 6.1)
and the assignment of the source of a sample (Section 6.3).
Another advantage with the modelling methods is that they operate well even if
some data elements are missing either in the training set or test set or both. The
distribution of each variable (Bayes method) or the parameters in a principal
components model for each class are well estimated even for incomplete training
set data matrices, albeit less precisely. Data vectors in the test set are classified on
the basis of the existing data, again giving consistent, but naturally less precise,
results compared with the complete data situation.
Finally, the possibility of pattern analysis given by the modelling methods is
most valuable in chemical applications where the interpretation of the differences
between the classes and the structure inside the classes often is desired. The
modelling methods give information about the position of each object inside each
class which can be related to theoretical properties of the objects (Section 6.1) or
variables for which a prediction is desired (Section 6.2). The grouping of variables
Table 3
Summary of the properties of the most common methods in chemical pattern recognition
LLM LDA KNN Bayes SIMCA
Classification in terms
of closest class ÷ ÷ + ÷ ÷
Class assignment in terms
of probabilities ÷ +
Not dependent on the
homogeneity of classes ÷
Works in asymmetric
binary classification ÷ ÷ +
Not dependent on
M<~N ÷ + +
Detects outliers in
training and test sets + ÷
Gives typical data
profile of each class ÷ + ÷
Possibilities for
pattern analysis ÷ +
Tolerates missing data ÷ ÷ ÷
Pattern recognition in chemistry 685
The two typical applications of pattern recognition in chemistry are (i) the
determination of the structure of a compound on the basis of Spectra measured
on the compound and a number of reference compounds and (ii) the determina-
tion of the type or source of a chemical m i x t u r e - - a n alloy, a mineral or a
hydrocarbon mixture (say, an oil spill) or a macro-molecular mixture (say, a
micro-organism)--on the basis of the content of chemical components in the
mixture and in a number of reference mixtures. We discuss first an example of the
former type, then an 'inverted' problem where relations are sought between
the chemical structure and biological activity of a series of compounds. Second,
we discuss two examples of the second type where the type of a sample is inferred
from its content of trace elements or biochemical macro-molecules. Several other
applications are covered in recent reviews (Kowalski, 1980: Varmuza, 1980).
(1960) included molecules which were locked in trans conformation via a carbon
bridge from R~ to R 3 or locked in cis conformation by a carbon bridge from R 1 to
R 2•
Seven variables (the frequencies of CO and CC absorption in IR and the wave
length of maximum absorption in UV and the intensities of these absorptions)
were extracted from the spectra to characterize each compound. These data were
analyzed in terms of two classes:
Class 1. Compounds with 'known' trans conformation (four with 'locked'
conformation and two with R1 = R 3 = Hydrogen).
Class 2. Compounds with large substituents R~ and R 3.
SjOstrrm and Kowalski (1979) have compared the five methods in Table 3 on
these data. All methods give consistent results in the crucial question, namely to
which class the three cis compounds are classified. They all fall closer to Class 2
inferring that the compounds with large groups R l and R 3 are twisted fully over
to the cis conformation.
However, only Bayes and SIMCA provide the valuable information that the
three cis compounds indeed are so close to Class 2 that they can be considered to
be typical members of that class. Since it was not possible to include compounds
with intermediate twisted conformation in the study, the other methods cannot
tell whether the compounds with large groups R 1 and R 3 are twisted compounds
more similar to cis than trans or if they actually are rotated fully over into the cis
conformation.
existence; data vectors can be generated from the theoretical structure and tables
of substituent scales. Hence predictions of the biological activity can be obtained
on a theoretical basis and only potentially interesting compounds need to be
synthesized and further tested.
As an example, we discuss a series of compounds with the general structure
X ( Y ) - C 6 H 4 - C H ( R ) - C H ( R I ) - N H - R 2. The substituents range from H through
C H ( C H 3 ) - C H 2 - C 6 H 4 - 4 - O H in the five sites of substitution X, Y, R, R 1 and
R 2. Describing each site by means of the variables discussed above and one
measured property (the binding constant to a test beta-receptor) a training set of
32 compounds was used with 13 variables per compound. For test compounds
only the values of the 12 'theoretical' variables are available and therefore a data
analytic method tolerating missing data must be used. The training set consisted
of two classes, compounds active as blockers (class 1, n = 17) and beta-receptor
stimulators, agonists (class 2, n = 15).
On each of the compounds in the training set, the level of the biological activity
is known. Hence, it was of interest also to seek relations between this level and the
'independent' structural descriptors in addition to an ordinary classification
based on the latter variables (Dunn, Wold and Martin, 1978).
A SIMCA analysis of the two training set class matrices revealed that 4 of the
13 variables were irrelevant. The remaining 9 were well described by two separate
three dimensional principal components models. These models classify 15 of the
blockers and all 15 agonists correctly in the validation analysis. The SIMCA
analysis gives coordinates of each object in terms of its principal components
values. In a second step of the analysis, it was found that these coordinates were
related to the measured level of activity. In other words, the position of the
compounds inside their class was correlated to their measured activity value.
This 'pattern analysis' of each class can, in summary, be used to predict the
type and activity of chemical test compounds on the basis of their structure only.
The same methodology has thereafter been applied to other structure-activity
problems (Dunn, Wold and Martin, 1978; Norden, Edlund and Wold, 1978;
Dunn and Wold, 1980).
Twenty crude oil samples and twenty residual oil samples were first analyzed and
then portions of each oil were artificially weathered for different lengths of time
using fresh and salt water. The unweathered oil and nine weathered samples for
each of the 40 starting oils comprised 40 categories of oils with ten samples/
category in a 22-dimensional vector space.
The first goal in the study was to use the elemental concentrations to determine
whether or not the 40 oils could be separated even though the spread~ in each
category due to weathering effects was quite large in some cases. Using pre-
processing methods prior to classification analysis allowed classification accu-
racies as high as 99.3% thereby establishing the feasibility of using pattern
recognition to elemental concentration data to overcome weathering effects in oil
spill identification.
A second goal of the study was to identify, using weighting methods, the most
useful elements for discrimination. This is important as using fewer elements, can
cut the cost of analysis and, in some cases, speed up the entire procedure.
Vanadium was by far the most discriminating element, a result that came as no
surprise to petroleum chemists. Vanadium is tightly complexed by very stable
high molecular weight compounds in oil that resist weathering very effectively.
Nickel was the second most discriminating element for the same reason. Sulfur
was third; it is known to be covalently bonded in large molecules that again are
quite stable to weathering.
Several other applications of pattern recognition to the material source identifi-
cation problem can be found in the chemical literature (Kowalski, :Schatzki and
Stross, 1972; Howarth, 1974; Parsons, 1978; McGill and Kowalski, 1977). Also,
the use of pattern recognition to find useful patterns in industrial production data
in order to solve product quality and integrity problems is a most useful and
cost-effective type of application (Duewer et al., 1978).
determined at the end of the study. For complex natural products containing
hundreds or even thousands of components, the blind assay method can solve
complex problems and save enormous amounts of time.
7.1. Pitfalls
The pitfall most commonly encountered in chemical pattern recognition is that
of overestimating class separation. This is done mainly in two ways: (i) by
selecting variables that most differ between the classes and (ii) by using data
analysis methods which exaggerate class differences.
The two points are closely related to each other. Because the data set is finite,
there is a finite chance of finding a substantial difference in mean values in a
variable between two classes, even if the variable was drawn from a random
population. If we now have a large number of variables, the chance of finding,
just by accident, variables that differ between the classes becomes embarrassingly
large, especially for cases with few samples per class. Similarly, data analysis
methods which combine variables in order to enhance class differences are subject
to the risk that the result difference is just an artifact.
Empirically it has been found that for two-class problems the number of
variables must not exceed approximately a third of the number of independent
objects in the training set if variable selection methods or pattern recognition
methods conditioned on class separation are to be used (i.e. variance weights,
Fisher weights, etc. and hyper-plane methods, respectively, see further next
subsection).
A second pitfall related to the first is that one has fewer independent objects
(systems, cases, samples) than one realizes. In situations such as chromatography,
when each sample is analyzed in replicate and the raw data entered into the
analysis (this is strongly recommended to avoid a loss of information), the
number of independent objects is the same as the number of original samples, i.e.
half or a third of the number of objects in the analysis. If now the rule is
employed mechanically that the number of variables must not exceed ½N one falls
into the 'too many variables trap'.
690 Bruce R. Kowalski and Svante Wold
7.3. Validation
Most pattern recognition methods classify the training set with an over-
optimistic 'success' rate. Hence a validation of the classification should be made
in another way. The best method seems to be to divide the training set into a
number of groups and then delete one of these groups at a time making it a ' test'
set with known assignment. The pattern recognition method is then allowed to
'learn' on the reduced training set independently for each such deletion. When
Pattern recognition in chemistry 691
summed up over all deletions, a lower bound of the classification rate is obtained
(see for example Kanal (1974)).
Several authors recommend that only one object at a time should be deleted
from each class in this validation. This might be true if one is certain that the
objects really are independent (see Section 7.1). In practice, however, one often
has at least weak groupings in the classes. It then seems safer to delete larger
groups at a time to get a fair picture of the real classification performance.
are studied with prediction of chemical properties (e.g. class membership) the
goal, overestimation should be avoided. The danger of overestimation in chemical
pattern recognition is analogous to data extrapolation by fitting polynomial
functions to experimental data. A better fit can always be obtained by adding
more terms to the model function but the model may not be meaningful in the
context of the experiment. Extrapolation or interpolation results in this case may
be very inaccurate. This danger is efficiently avoided in modelling methods by the
use of a cross-validatory estimation of the model complexity (Wold, 1976; 1978).
When a new application is encountered, the application of a single pattern
recognition method, even the preferred modelling methods described in this
chapter, is not recommended. All pattern recognition methods perform data
reduction using some selected mathematical criteria. The application of two or
more methods with the same goal but with different mathematical criteria
provides a much greater understanding of the vector space. In classification for
instance, when the results of several methods are in agreement, the classification
task is quite an easy one with well separated classes indicating that excellent
chemical measurements were selected for analysis. When the classification meth-
ods are in substantial disagreement then, armed with an understanding of the
criteria of each method, the chemist can use the results to envision the data
structure in the vector space. The use of multiple methods with different criteria is
recommended as is the use of multiple analytical methods for the analysis of
complex samples.
The selection of a method for application to specific problems is rather
complex and can only be learned by experience since almost every vector space
associated with an application has a different structure and every application has
different goals. When a training set is not available, a simple preprocessing
method such as scaling followed by cluster analysis and display in two-space may
be all that is required to detect a useful structure of the vector space. At the other
extreme, when classification results from several methods do not agree, or when
modelling is poor, the application may require a transformation of the vector
space or even the addition of new measurements to the study. This latter case can
become quite complex and may require several iterations.
Multivariate analysis is quite often an iterative exercise. As interesting data
structures are detected and found to be chemically meaningful, the analysis may
be repeated with a new training set. Few problems are solved in any field by the
application of a single tool and the application of pattern recognition to chemical
data is no exception.
7. 6. Standardization
In the present state of early development, the field of chemical pattern
recognition cannot be subject to standardization either in terms of methodology
or data structure (experimental design). However, we feel that the following
elements of data analysis should be included in all applications: (1) a graphical
examination of the data using display methods, (2) test of the homogeneity of
Pattern recognition in chemistry 693
each class using, for instance, a graphical projection of each separate class and
finally (3) a validation of the classification results.
References
Albano, C., Dunn, W. II1, Edlund, U., Johansson, E., Norden B., Sj6str6m M. and Wold, S. (1978).
Four levels of pattern recognition. Anal. Chim. Acta. 103, 429-443.
Andrews, H. C. (1972). Introduction to Mathematical Techniques of Pattern Recognition. Wiley, New
York.
Arthur, A computer program for pattern recognition and exploratory data analysis,
INFOMETRIX. Seattle, WA, U.S.A.
Blomquist, G., Johansson, E., S6derstr6m, B. and Wold, S. (1979). Reproducibility of pyrolysis-gas
chromatographic analyses of the mould Penicillium Brevi-Cornpactum. J. Chromatogr. 173, 7-19.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26,
211-214.
Boyd, J. C., Lewis, J. W., Marr, J. J., Harper, A. M. and Kowalski, B. R. (1978). Effect of atypical
antibiotic resistence on microorganism identification by pattern recognition. J. Clinical Microbiol-
ogy, 8, 689-694.
Burgard, D. R. and Perone, S. P. (1978). Computerized pattern recognition for classification of organic
compounds from voltammetric data. Anal. Chem. 50, 1366-1371.
Cammarata, A. and Menon, G. K. (1976). Pattern recognition. Classification of therapeutic agents
according to pharmacophores. J. Med. Chem. 19, 739-747.
Chapman, N. B. and Shorter, J., eds. (1974). Advances in Linear Free Energy Relationships. Plenum,
London.
Chapman, N. B. and Shorter, J., eds. (1978). Correlation Analysis in Chemistry. Plenum, London.
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
Duewer, D. L., Kowalski, B. R. and Schatzki, T. F. (1975). Source identification of oil spills by pattern
recognition analysis of natural elemental composition. Anal. Chem. 47, 1573-1583.
Duewer, D. L., Kowalski, B. R., Clayson, K. J. and Roby, R. J. (1978). Elucidating the stnacture of
some clinical data. Biomed. Res. 11, 567-580.
Dunn, W. J. III and Wold, S. (1978). A structure-carcinogenicity study of 4-nitroquinoline 1-oxides
using the SIMCA method of pattern recognition. J. Med. Chem. 21, 1001=1011.
Dunn, W. J. III, Wold, S. and Martin, Y. C. (1978). Structure-activity study of beta-adrenegic agents
using the SIMCA method of pattern recognition. J. Med. Chem. 21, 922-932.
Dunn, W. J. and Wold, S. (1980). Relationships between chemical structure and biological activity
modelled by SIMCA pattern recognition. Bioorg. Chem. 9, 505-521.
Exner, O. (1970). Determination of the isokinetic temperature. Nature 277, 366-378.
Exner, O. (1973). The enthalpy-entropy relationship. Progr. Phys. Org. Chem. 10, 411-422.
Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York.
Gerlach, R. W., Kowalski, B. R. and Wold, H. (1979). Partial least squares path modelling with latent
variables. Anal. Chim~ Acta. 112, 417-421.
Hammett, L. P. (1970). Physical Organic Chemistry, 2nd ed. McGraw-Hill, New York.
Hansch, C., Leo, A., Unger, S. H., Kim, K. H., Nikaitani, D. and Lien, E. J. (1973). 'Aromatic'
substituent constants for structure-activity correlations. J. Med. Chem. 16, 1207-1218.
Harper, A. M., Duewer, D. L. Kowalski, B. R., and Fasching, J. L. (1977). ARTHUR mad
experimental data analysis: The heuristic use of a polyalogrithm. In: B. R. Kowalski, ed.,
Chemometrics: Theory and Application. Amer. Chem. Soc. Syrup. Set. 52, 14-52.
Harper, A. M. and Kowalski, B. R., unpublished.
696 Bruee R. Kowalski and Svante Wold
Norden, B., Edlund, U. and Wold, S. (1978). Carcinogenicity of polycycfic aromatic hydrocarbons
studied by SIMCA pattern recognition. Acta Chem. Scand. Ser. B. 21, 602-612.
Parsons, M. (1978). Pattern recognition in chemistry. Research and Development 29, 72-85.
Pijpers, F. W., Van Gaal, H. L. M., and Van Der Linden, J. G. M. (1979). Qualitative classification of
dithiocarbamate compounds from 13C-N M R and I.R. spectroscopic data by pattern recognition
techniques. Anal. Chim. Acta 112, 199-210.
Reiner, E., Hicks, J. J., Ball, M. M., and Martin, W. J. (1972). Rapid characterization of salmonella
organisms by means of pyrolysis-gas-liquid chromatography. Anal. Chem. 44, 1058-1063.
Rozett, R. W. and Petersen, E. M. (1975). Methods of factor analysis of mass spectra. Anal. Chem. 47,
1301-1310.
Sammon, J. W., Jr. (1968) On-line pattern analysis and recognition system (OLPARS). Rome Air
Develop. Center, Tech. Rept. TR-68-263.
Saxberg, B. E. H., Duewer, D. L., Booker, J. L. and Kowalski, B. R. (1978). Pattern recognition and
blind assay techniques applied to forensic separation of whiskies. Anal. Chim. Acta. 103, 201 210.
Schachterle, S. D. and Perone, S. P. (1981). Classification of voltammetric data by computerized
pattern recognition. Anal. Chem. 53, 1672-1678.
Sj/3str/Sm, M., and Edlund, U. (1977). Analysis of 13C NMR data by means of pattern recognition
methodology. J. Magn. Res. 25, 285-298.
Sj/Sstr/Sm, M. and Kowalski, B. R. (1979). A comparison of five pattern recognition methods based on
the classification results from six real data bases. Anal. Chim. Acta. 112, 11-30.
Taft, R. W., Jr. (1952). Polar and steric substituent constants for aliphatic and o-benzoate groups from
rates of eterificafion and hydrolysis of esters. J. Amer. Chem. Sci. 74, 3120-3125.
Van Gaal, H. L. M., Diesveld, J. W., Pijpers, F. W. and Van Der Linden, J. G. M. (1979). 13C NMR
spectra of dithiocarbamates. Chemical shifts, carbon-nitrogen stretching vibration frequencies, and
~r bonding in the NCS 2 fragment. Inorganic Chemistry 11, 3251-3260.
Varmuza, K. (1980). Pattern Recognition in Chemistry. Springer, New York.
Verloop, A., Hoogenstraaten, W. and Tipker, J. (1971). In: E. J. Ariens, ed., Drug Design, Vol. V.
Academic Press, New York.
Wold, H. (1977). Mathematical Economics and Game Theory. Essays in Honor of Oscar Morgenstern.
Springer, Berlin.
Wold, S. and Andersson, K. J. (1973). Major components influencing retention indices in gas
chromatography. J. Chromatogr. 80, 43-50.
Wold, S. (1976) Pattern recognition by means of disjoint principal components models. Pattern
Recognition 8, 127.
Wold, S. and SfiSstr6m, M. (1977). SIMCA: A method for analyzing chemical data in terms of
similarity and analogy. In: B. R. Kowalski, ed., Chemometrics: Theory and Application. Amer. Chem.
Soc. Symp. Ser. 52, 243--282.
Wold, S. (1978). Cross validatory estimation of the number of components in factor and principal
components models. Technometrics 20, 397-406.
P. R. Kfishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 '~')
z~
©North-HollandPublishingCompany(1982) 699-719 J
T. K a m i n u m a , S. T o m i t a a n d S. W a t a n a b e
1. Historical background
In the field of pattern recognition, a series of researches was carried out in the
early 1960s utilizing the Karhunen-Lo6ve expansion as a tool for feature extrac-
tion. The results were reported in a paper in 1965 [13], which pointed out an
identical mathematical structure (diagonalization of covariance matrix) in the
Karhunen-Lo6ve method and Factor Analysis. That paper also introduced and
proved the entropy minimizing property of the covariance matrix method. Since
then the method has proved to be one of the most powerful and versatile tools for
feature extraction, and is today widely used in many practical problems in pattern
analysis [ 1].
The idea of object-predicate reciprocity (or symmetry) dates back to 1958 [12],
but its computational advantages were emphasized only in a 1968 paper [15]. A
mathematical theorem that underlies this symmetry was demonstrated in a paper
in 1970 [10] and interesting applications were adduced [10, 11]. In the meantime, a
paper in 1970 [4] surveyed the mathematical core of the covariance method in a
very general perspective and pointed out parallel developments in physics and
chemistry. It traced back the mathematical theory to Schmidt, 1907 [8], although
the later authors may not have known that their works were only independent
rediscoveries. K. Pearson (1901) also had the idea of diagonalization of the
covariance matrix for a purpose related to Factor Analysis [7].
It is interesting to note that from the very beginning, Schmidt introduced two
products of an unsymmetric kernel of an integral equation and its conjugate
kernel. He pointed out that the eigenvalues of the two product kernels are
identical up to their degeneracy, and the two eigen function systems are related to
each other in a symmetrical manner. The symmetrical character of the two
optimal coordinate systems associated with the two products of unsymmetric
kernels was rediscovered by quantum chemists [3] in 1960s and was applied to
obtain good approximate wave-functions in many particle systems.
Time and again, in different contexts, the error minimizing property of eigen-
vectors of the covariance matrix was rediscovered and has been used, mainly
699
700 T. Kaminuma, S. Tomita and S. Watanabe
2. Covariance representation
The data set is a set of vectors (x}~)}, a = 1,2 ..... N; i = 1,2 ..... n, with a weight
function (W(")}, where x~ is the result of the ith predicate measurement made on
the ath object. The G-matrix is defined by
It is assumed that the Euclidean distance in the n-dimensional real space defined
by the xi's ( i = l , 2 , . . . , n ) has the meaning of a measure of difference or
dissimilarity between objects suitable for clustering or pattern recognition. The
restriction implied by this Euclidean assumption is to some extent alleviated by
dealing with some non-linear functions of the xi's as if they were independent
measurements. Instead of explicitely writing W ", we can multiply each x} ") by
f-W (~1 and simplify the G-matrix
N
the G-matrix becomes the covariance matrix. Under the same assumption, if we
Covariance matrix representation and object-predicatesymmetry 701
replace 0- by
°)2 S x 2 , (2.6)
a 1 ~x=l
Without altering the Eulidean distance, we can change the coordinate system
by an orthogonal transformation
The three basic properties (2.7), (2.8), (2.9) of G-matrix remain unchanged by this
transformation. So does the value of ~-.
The invariant properties (2.8) and (2.9) allow to define the entropy
where the eigenvalues X(k), k = l, 2 ..... n, are the n roots of the algebraic equation
(2.16)
~p:k)~,:l)
= 8k ' (2.17)
i=l
and
~:k)#k)=~ij. (2.18)
k:l
k=l
if we take the primed coordinate system as the one determined by the eigenvec-
tors; then we obviously have
3. Minimumentropy principle
A good heuristic advice in pattern recognition is that we should formulate and
adjust our conceptual system to minimize the entropy definable by the data set
[17]. In our case this would mean that we should choose the coordinate system
that minimizes S' defined in (2.12). In such a coordinate system the degrees of
importance will be the most unevenly distributed over the n coordinates. This will
also lead to a possibility of reducing dimensionality of the representation space,
since we shall be able to ignore these coordinates whose 'importances' are very
small.
(i) Optimal coordinate system. To obtain the 'optimum' coordinate system that
minimizes the entropy, we can take advantage of the following theorem.
THEOREM 3.1. The eigenvectors of the G-matrix provide the optimal coordinate
system.
In other words, the optimal coordinate system is the principal axes coordinate
system. The minimum entropy is, according to this theorem and (2.19),
THEOREM 3.2. The variables representing the optimal coordinates are uncorrelated.
N N
En, m : E ~ (x}¢¢)') 2 - E ~ (x}a)') 2" (3.2)
cx--1 i=l a = l i--1
THEOREM 3.3. No matter what value m has ( m < n ), En, m becomes minimal when
we take as the primed coordinate system the optimal coordinates with the convention
(2.16).
x}~)= f b f ( ~ ) ( u ) g i ( u ) d u (3.4)
so that
where the first m variables are the common independent variables and the last
term biz ~ represents a small variation of values specifically characteristic of x/.
It is usually assumed that each of the variables xi, yj, z i has mean value zero
and standard deviation unity. Further, it is assumed that each pairs (yj, Yk),
(zi, zj) and (z~,zj), with i = 1 , 2 ..... n, and j = l , 2 ..... m are mutually uncorre-
lated. The A~'s and B~'s defined by
i=1
Covariance matrix representation and object-predicate symmetry 705
We provisionally assume that there do not exist the z's and obtain the yj as
eigenvectors of G-matrix which is ( I / n ) times the correlation matrix of xi's. We
can adopt those y f s whose generality of influence is larger than an appropriate
threshold. We can then revive the zi's to satisfy (3.8).
4. SELFIC
o=
i=1 i=1 i=1
where the primed coordinates belong to the optimal coordinate system. We can
call these objects which have /z(~) larger than certain threshold 'representative
objects'
/~(~) >~0. (4.3)
The above explanation concerns the case where the class-affiliation of vectors is
not known. That means that the method is used for the preprocessing for
clustering. The method can, however, be used in the case of paradigm-oriented
pattern recognition. We can obtain a separate retrenched subspace for each class.
This leads to the method of CLAFIC (class-featuring-information-compression).
The results described in Sections 2, 3, and 4 are first reported in [1]. The
CLAFIC method was described in [14, 16].
5. Object-predicate reciprocity
The given data {x}~)} are n × N matrices. (In Section 7 this corresponds to
kernel K.) It is quite natural to consider this matrix either as a set of N vectors of
n dimension or as a set of n vectors of N dimension. This object-predicate
symmetry was exploited for the first time for a computational convenience in [12]
in the case of binary data, and the epistemological implications of this reciprocal
duality were explained in [15]. Relegating these methodological details to the
original papers, we only call attention here to the pragmatic fact that the
identification of an object depends also on a set of observations, and hence the
object-predicate table is, in the last analysis, nothing but a relation between two
sets of predicate variables.
Corresponding to the predicates correlation matrix, G/j, we can define an
object correlation matrix
i
i=1 i=1
i oe=l
(,.,)
Both G/j and H ("/~) can be derived from
= / i t (5.2)
a=l i=1
by
N
and
H ( ~ ) = ~ ~}~'). (5.4)
il
The eigenvectors q)t and the corresponding eigenvalues Kt of H (~¢) are defined
by
N
H(~)O(~)-
/ - - x / 0 (")
l (5.5)
B 1
with
N
THEOREM 5.1. The matrices Gij and H (~/3) have the same rank and their corre-
sponding eigenvalues are equal, ~0 = x , (o = 1,2 .... ), with the labeling conventions
(2.16) and (5.7).
The picture patterns are now represented by picture pattern vectors {x} ~)} with
a = 1,...,N; i = 1.... ,m X m, and x} ~) being an integer between 0 and n - 1. We
then define the two conjugate correlation matrices G and H as before;
N
Gij= ~ W(~)x i(")x)(~)Tllx(~) [[2 (6.2)
a 1
and
'mXm
H ('~)= ]~ (W('~)W(~))'/2x}'~)x}B)/(l]x('~)ll IIx (B) II) (6.3)
i=1
mXm
[Ix(~)l] 2 = 2 (x}~)) 2 (6.4)
i 1
and the weight function {W (~)) must satisfy the condition that W (") > 0 and
N
W (~) = 1. (6.5)
a=l
we must be careful when we examine their limiting values as will be shown in the
following section.
It was already notified that the two matrices G and H share the same
eigenvalues and their degeneracies. Thus there are at most rain{N, m × m}
non-zero positive eigenvalues Xi- Moreover there also exist symmetrical relations
between the eigenvectors gi of G and the eigenvectors h i of H:
N
gij = (1/~ki) 1/2 ~ (W(~))l/Zx}~)hi~/llx(~ll (6.6)
c~=l
and
m>(m
h / = (1/3ki) '/2 ~ (W(~))'/2x~)g,y/llx(~)ll. (6.7)
j 1
These relations allow us to compute the eigenvectors of one of the two matrices,
G or H, without time-consuming matrix diagonalization once we already have the
eigenvalues and the eigenvectors of another. This means that whenever we
compute the Karhunen-Lo6ve system, we can always choose the lower dimen-
sional matrix and its Karhunen-Lo6ve system and then convert the eigenvectors
by (6.6) or (6.7) to obtain the other set of eigenvectors.
f I I I -- I ""
D,
k
I I I I
i i
hI x +h2x + . . . + h ~3x :z. f(J)
J
+ ...
where fE(.~) are approximations of f(~) by quantizing them into n level step
Covariance matrix representation and object-predicate symmetry 711
9
Fig. 3, Digitization of two k a r y o t y p e p a t t e r n s where n is fixed as 64 for different m, (a) m - 2, (b)
m 4,(c) m--g,(d)m 16,(e) m - 3 2 , ( f ) m 64.(g) m -128.
712 T. Kaminuma, S. Tomita and S. Watanabe
functions, and
(Of course when either f[(,~) or f[(~) are zero, H (~#) is also zero.) Since f(~) are
bounded, from (6.9) the limit of H as m -, 0o always exists for fixed n, and so do
its eigenvahies and entropy. Furthermore, if we take n --+ oc, then
N
NG(u,v; u',v')= ~ - f ( ' ~ ) ( u , v ) f ( ~ ' ) ( u ' , v ' ) / l l f'~ll (6.15)
a 1
so that finite matrices Gq are approximations to this G, and the infinite dimen-
sional G should be identical to (6.15). C o m p a r i n g (6.2) and (6.3), we see that they
~.__.m.mmm.
.m__m~
Fig. 4. Geometric convolution of two picture patterns: (a) is f(,~): (b) is f[(,~l: (c) corresponds to
[ ¢ ,)r~ J.G~l
•1 ~.1 1,1
Covariance matrix representation and object-predicatesymmetry 713
Fig. 5. Two calligraphyfonts of ten Japanese kana characters. (a) and (c) are zoomed up pictures of
the first characters of (b).and (d), respectively.
714 T. Kaminuma, S. Tomita and S. Watanabe
Entropy
s=2
2.0
1.5
1.0
0.5
J r 2
2 42 2 162 32 642 1282
Number of meshes
Fig. 6. G r a p h s which show that the two entropy functions converge for two calligraphy fonts as n goes
to infinity for s 2 and 64.
Covariance matrix representation and object-predicate symmetry 715
Early study was carried out for Japanese handwritten characters and alphabets
in [10] by one of the authors. We here give some additional illustrative examples
of our experiments [5]. We compared two different calligraphy fonts of 10
Japanese characters (see Fig. 5). Their entropy functions are plotted in Fig. 6 for
fixed s = 2 and 64. In this example the curves reach plateau between n = 32 X 32
and 64 X 64. The results suggest that for classification problems of such patterns it
seems useless to make finer sampling than n = 64 × 64. The same conclusion is
also drawn from the experiment using 23 pairs of karyotypes (see Fig. 7). Of
course to confirm this conclusion we must take into account possible variations of
Entropy
1.2
1.0 ¸
0.5
I I
2I 2 4~ 2 8I z ll6 z 3J2 z
64 z 1282
Number of meshes
each pattern. However, the above result gives a guideline on how to choose
appropriate quantization levels in relation to difference of patterns.
fg(s, t) = f(s)g(t).
In this notation an integral equation
7tKf=f, (7.1)
= (7.2)
K+ =K (7.3)
K= ~ LL/Xi, (7.5)
i=1
which converges absolutely and uniformly. But such is not the case for unsymmet-
ric kernels. However, for a given unsymmetric kernel K, Schmidt associated two
symmetric operators
G = KK + (7.6)
and
H = K+K, (7.7)
which are positive definite and whose eigenvalues are real. He has shown that:
(1) G and H defined by (7.6) and (7.7) share the same eigenvalues which are
defined in (7.2) and their degeneracies.
(2) K is expanded by characteristic systems of G and H in a series
K- ~ f, gi/X, (7.8)
i=1
rain
{X/, Y/; n}
K-- i X~Y~ 2 = K-- ~"lfgi/)t i
i=1 i=
2= ~
i=n+l
1/)t2 (7.9)
where we have assumed that the positive eigenvalues are arranged such that
fii = ) t i K g i , (7.10)
gi = ~'iK+ fi • (7.11)
With various modifications the previous theory becomes applicable to different
areas of mathematical science. The first modification may be to extend K(s, t) to
a complex-valued function. As is well known, the concept of symmetry must then
be replaced by the concept of hermiticity, and minor additional changes in
definitions are required. K may also be extended to a complex function of several
variables K(x~,x 2..... xn) which is considered to be a kernel operator that
transforms p-dimensional L 2 space vectors into s-dimensional Z 2 space vectors or
its converse with n = s + p. The application of this theory to physics, when K is
identified by the many-particle wave function, was thoroughly discussed in [8].
Secondly, K may be reduced to a matrix Kij Or a family of indexed functions
Ki(s ). These are the cases which we encounter in pattern recognition as discussed
in the previous sections. In particular we note that in pattern recognition theory
an error-minimizing expansion (7.9) was further identified with the entropy
minimizing expansion [13]. However, it is necessary to include a normalization
condition IIKi [I = 1 for this purpose, where Ki(s ) is the ith pattern function when
patterns are represented by continuous variables.
However, in any case the essential features of Schmidt's arguments remain
unchanged, and all proofs are almost identical to the proofs given in this section.
8. Conclusion
Acknowledgment
T h e a u t h o r s w i s h to t h a n k Mr. I s a m u S u z u k i at the T o k y o M e t r o p o l i t a n
I n s t i t u t e of M e d i c a l Science, w h o k i n d l y p r o v i d e d e x p e r i m e n t a l r e s u l t s a n d
p r o d u c e d t h e i l l u s t r a t i o n s i n S e c t i o n 6. H e also k i n d l y r e a d the o r i g i n a l m a n u s c r i p t
a n d c o n t r i b u t e d n o t a t i o n a l corrections.
References
[1] Andrews, H. C. (1971). Multi-dimensional rotations in feature selection. IEEE Trans. Comput.,
1045-1051.
[2] Benzecri, J. P. (1976). L'Analyse des Donnees 2 - - L'Analyse des Correspondances. Bordas, Paris.
[3] Coleman, A. J. (1963). Structure of Fermion density matrics. Rev. Mod. Phys. 35, 668-687.
[4] Kaminuma, T. (1970). Informational entropy as a measure of correlation of interacting Fermi
particles. Ph.D. Thesis, University of Hawaii.
[5] Kaminuma, T. and Suzuki, I. (1980). Symmetry of Karhunen-Lo~ve systems and its application
to geometric pattern analysis. Proe. 5th Internat. Conf. on Pattern Recognition, Vol. 2, pp.
1228-1231
[6] Kanal, L. N. and Chandrasekaran, B. (1965). On dimensionality and sample size in statistical
pattern classification. Proc. Nat. Electronics Conf., Vol. 24, pp. 2-7; ibid., Pattern Recognition 3
(1971) 225-234.
[7] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philos. Mag.
559-572.
[8] Schmidt, E. (1907). Zur Theofie der linearen und nicht linearen Integralgleichungen. Math. Ann.
63, 433-476.
[9] Smithies, F. (1958). Integral Equations. Cambridge University Press, Cambridge.
[10] Tomita, S. et al. (1970). Theory of feature extraction for patterns by the Karhunen-Lo6ve
orthogonal system. Systems Comput. Control 1, 55-62.
[11] Tomita, S. et al. (1973). On evaluation of hand,a ritten characters by the K - L orthogonal system.
HICSS-3., 501-504.
[12] Watanabe, S. (1958). Lecture Notes of Summer School in Information Theory at Vienna; ibid.,
A note on the formation of concept and of association by information-theoretical correlation
analysis. Inform. Control 4, 291-296.
[13] Watanabe, S. (1965). Karhunen-Lo6ve expansion and factor analysis--Theoretical remarks and
applications. Trans. Fifth Prague Conf. in Information Theory Statistical Decision Functions and
Random Processes. Prague, 1967. Publishing House of the Czeckoslovak Academy of Sciences,
Prague, pp. 635-660.
[14] Watanabe, S. (1969). Knowing and Guessing. Wiley, New York.
[15] Watanabe, S. (1969). Object-predicate reciprocity and its applications to pattern recognition.
Inform. Processing 68; ibid., Proc. IFIP Congress Edinburgh, Scotland. North-Holland,
Amsterdam, pp. 1608-1613.
[16] Watanabe, S. and Kulikowski, C. A. (1970). Multiclass subspace methods in pattern recognition.
Proc. Nat. Electronics Conf., Vol. 26, pp. 468.
[17] Watanabe, S., P/tttern recognition as guest for entnropy minimization, Pattern Recognition, to
appear.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ~ "~
©North-Holland Publishing Company (1982) 721-745 ,J
Multivariate Morphometrics
Richard A. Reyment
1. Introduction
1.1. Definition
Morphometrics, in the statistical sense (the term is used differently in descrip-
tive zoology), means, as implies the name, the quantitative description of
variation in the morphology of organisms, plants and animals. Multivariate
Morphometrics, as defined by Blackith and Reyment (1971), is concerned with the
application of the theory and practice of multivariate statistical analysis to two or
more morphological characters considered simultaneously. Although the concept
of multivariate morphometrics can be generalized to cover a wide range of topics,
including numerical taxonomy (as developed by Sokal and Sneath, 1973), it seems
more realistic to restrict it to the ample field represented by the multivariate
statistical evaluation of morphological variation.
Blackith and Reyment (1971) concentrated largely on case histories drawn from
the fields of Botany, Palaeontology and Zoology with little attention paid to the
details of computing. Pimentel (1979) has chosen the opposite approach and
concerned himself with the step-by-step treatment of the arithmetic of the main
standard methods of multivariate statistical analysis of use in morphometric
work. Blackith (1965) gave an excellent general introduction to the subject which
can be consulted as an easy entrance to the ideas involved.
The present account concentrates mainly on newer developments in multi-
variate morphometrics and the special analysis of particular problems of great
interpretative significance. It owes much to the imaginative and far-sighted
research of N. A. Campbell.
the two extremes, being of what we commonly call 'normal build'. There is thus
no clear distinction between studying contrasts of form between two selected
types and studying the variation of form to be found within any one group of
people. Within the group the variation will still be canalized along vectors of
various kinds. The situation in which we try to distinguish these canalized
patterns of variation in a supposedly homogeneous group, and that in which we
are deliberately selecting individuals which show one or other extreme form (to
act as terminal groups for discriminant functions) is in practice a matter of
degree.
where I is the unit matrix of the same rank as S. The solution of this equation
consists of a set of eigenvalues, as many as there were characters in the covariance
matrix, each of which is associated with an eigenvector, with as many elements as
there were characters measured.
Generally the first eigenvector, corresponding to the largest eigenvalue of the
characteristic equation, is taken to reflect the variation in size of the organisms--
for the above example it would be interpreted as reflecting the polarity of stature
between tall and short individuals. The size-variational interpretation of the first
eigenvector, or principal component, has been derived from work by the French
zoologist Teissier, who more than 45 years ago analyzed variation in crabs using
essentially the first principal component of the correlation matrix of features of
the carapace as a means of quantifying size variation. The ideas of Teissier were
carried further by Jolicoeur and Mosimann (1960) who added an interpretation of
the second principal component, the elements of which usually have plus and
minus signs, as a measure of variation in shape. Thus, in the fatness-thinness
example, the second principal component might reflect the fatness-thinness
polarity or the polarity between persons of stocky and slender build. Although
size and shape interpretations based on principal component analyses are very
common in morphometric studies, there is a disturbing arbitrariness to the
approach and it is by no means certain that all analyses are unchallengeable. Rao
(1964) has given attention to alternative methods of treating the problem. The
most complete probing of the entire field of size and shape variables has been
given by Mosimann (1970) and Mosimann and Malley (1979) (see Section 4).
There is another way of extracting canalized variation from the covariance
matrix. The components obtained by principal component analysis are always
Multivariate morphometries 725
orthogonal (at right angles) to each other. Various techniques of factor analysis
have been devised in order to 'reorient' the eigenvectors more closely with
assumed sources of variation.
different experiments but the experimenter wishes to compare some notional axis
of variation. In psychometric work this situation seems to be common: the test
subjects are given a 'battery' of tests, differing from one laboratory to another,
and notional factors such as those related to intelligence or skills are to be
extracted from the matrices of correlations between the performances. An analo-
gous situation in morphometrics would arise when two experimenters wished to
compare vectors of, say, size and shape, but had, for some reason, been unable or
unwilling to measure the same characters of the organisms. There can be little
doubt that certain factor-analytical techniques can arrive at an assessment of
these axes of variation in terms which minimize the nuisance caused by the
different suites of characters used. Against this advantage has to be set the lack of
any way of testing whether factors so uncovered are significant. Such situations
seem to arise with great frequency in psychometric work but they are not
common in morphometrics.
Blackith and Reyment (1971, p. 147) discussed applications of principal
component analysis to morphometric problems. For example, size-variation in
thrips, the identification of shells deriving from the life cycle of fossil for-
aminifers, variation in scale insects and the morphological variants of the com-
mon salamander.
Brief mention should be made of a method of 'factor analysis' known as
correspondence analysis (l'analyse factorielle des correspondances) and which is
widely used (under the name of factor analysis) in French-language morphometri-
cal publications. The current version of the method is due to J. Brnzrcri (1973)
but its history is long and tortuous. An excellent account of the many re-
discoveries of the concept of combined R-Q-mode analyses has been published by
Hill (1975), starting with the original brilliant work of H. O. Hirschfeld (later
Hartley) in 1935.
Briefly stated, the aim of correspondence analysis is to obtain simultaneously,
and on equivalent scales, R-mode 'factor-loadings' and Q-mode 'factor-loadings'
that represent the same principal components of the data matrix. The goal has
been achieved in Brnzrcri's variant by a method of scaling the data matrix, an
appropriate index of similarity, and the Q-R-duality relationship of the Eckart-
Young theorem (JOreskog et al., 1976, p. 107). The scaling procedure and analysis
are algebraically equivalent to Fisher's (1940) contingency table (Hill, 1975).
The monograph on Madagascan lemurs by Mah6 (1971) and that dealing with
Jurassic brachiopods by Delance (1974) contain many examples of the morpho-
metrical application of correspondence analysis.
Ideas for the separation of shape- and size-components of the various contrasts
of form with which one may have to deal date back to Penrose (1954). We shall
now consider the way in which Mosimann (1970) has taken up the general ideas
of Penrose for providing an acceptable quantification of shape variation.
Mosimann (op. cit.) and Mosimann and Malley (1979, p. 182) consider that the
most optimal solution to the problem of defining shape- and size-variables is by
means of quantities based on geometric similarity.
This approach considers only positive k-dimensional vectors. We denote the set
of such vectors by pk. Two vectors xl, x2, say, of P~ have the same shape if they
are both on the same ray of pk. A size-variable is generally defined to be any
homogeneous function of degree 1 and pk to p1, the positive real numbers. This
may be expressed succinctly by saying that for every positive x,
G( ax ) = aG( x )
z(.)=x/C(x)
for all x. The direction cosines, x/((Ex2i )),/2, proportions, x/Y~x,, and ratios,
x / x k, are examples of shape vectors. Two vectors xl, x 2 have equal shape vectors
Zc(Xl) = Zc(x2) if they both lie on the same ray, i.e. if x 1 = ax 2 for some a > 0.
In many of the situations with which an experimental scientist has to cope, tests
of significance are superfluous to the problem of ascertaining the structure of the
experiment in multidimensional space. If a group of organisms is not known to
consist of definable sub-groups, then a principal component, or principal coordi-
nate, analysis seems the appropriate tool for a preliminary probing of its
morphometric structure (so-called 'zapping'). Once the material is known to
comprise a number of sub-groups, an analysis along canonical axes is advisable.
When the decision to use canonical variates has been taken, tests of multivariate
significance lose much of their point, for we know in advance that the sub-groups
differ. In this respect, the interests of the biologist may deviate from those of the
theoretically oriented statistician--a large body of the literature on multivariate
statistics is concerned with divers aspects of the testing of significance.
Thus, the point at issue is not primarily a statistical one: an entomologist
investigating the form of insects in a bisexual species would rarely be well advised
to test the significance of the sexual dimorphism, for a glance at the genitalia will
730 Richard A. Reyment
normally settle the question of sex. A palaeontologist concerned with the question
of whether or not sexual dimorphism exists in fossil cephalopods might well, on
the other hand, take a quite opposite view.
The statistical ideas underlying one of the original problems of the method of
discriminant functions may be discussed in terms of two populations ¢q and ¢r2
reasonably well known from samples drawn from them. A linear discriminant
function is constructed, on the basis of v variables and two samples of size NI and
N 2. The coefficients of the sample discriminant function may be defined as
where a is the vector of discriminatory coefficients and -~1 and £2 are the mean
vectors of the respective samples from the two populations. S-~ is the inverse of
the pooled sample covariance matrix for the two samples.
The linear discriminant function between the two samples for the variables
x ~ , . . . , x v may be written as
z = xTa. (6)
If the variances of the v variables are almost equal, the discriminator coefficients
give an approximate idea of the relative importance of each variable to the
efficiency of the function.
Considering now one of the classical problems of discriminatory analysis, we
have measurements on the same v variables as before on a newly found individual
which the researcher wishes to assign to one of the two populations with the least
chance of being wrong..(This presupposes that the new specimen does really come
from one of the populations.) Using a pre-determined cut-off value, the measure-
ments on the new specimen are then substituted into (6) and the determination is
made on the grounds of whether the computed value exceeds or is less than the
cut-off point. Usually this is taken to lie midway between the two samples, but
other values may be selected for some particular biological reason.
Multivariate morphometrics 731
The supposition that the individual must come from either of the populations is
necessary from purely statistical aspects, but it is one that may make rather poor
biological sense. Doubtlessly, the specimen could have come from one of the two
populations, but it is equally likely that it is close morphometrically, but not
identical with, one of these. Such situations arise in biogeographical studies and in
the analysis of evolutionary series.
The linear discriminant function is connected with the Mahalanobis' gener-
alized statistical distance by the relationship
cTBc~/cTw~,. (8)
The maximized ratio yields the first canomcal root f~ with which is associated the
first canonical vector c I. Subsequent vectors and roots may be obtained analo-
gously. The canonical vectors are usually scaled so that, for example, c T Wc~ = n~.
An important reference for canonical variate analysis is Rao (1952).
Other derivations are given elsewhere in this handbook.
indicates that little or no discriminatory information has been lost). When this
occurs, the obvious conclusion is that one or some of the variables contributing
most to the principal component that has been shrunk have little influence on the
discriminatory process. One or some of these redundant variables can profitably
be eliminated. In addition, variables with small standardized canonical variate
coefficients can be deleted.
For interpreting the morphometrical relationships in a taxonomic study, or an
equivalent kind of analysis, those principal components that contribute most to
the discrimination are of particular interest and the characteristics of the corre-
sponding eigenvectors should be examined. For example, if the first principal
component is involved, size-effects of some kind may occur.
The within-groups sums of squares and cross products matrix W on n w degrees
of freedom and the between-groups sums of squares and cross products matrix B
are computed in the usual manner of canonical variate analysis, together with the
matrix of sample means. It is advisable to standardize the matrix W to correlation
form, with similar scaling for B. The standardization is obtained through pre- and
post-multiplying by the inverse of the diagonal matrix S, the diagonal elements of
which are the square roots of the diagonal elements of W.
Consequently,
Usually the eigenvectors are now scaled by the square root of the eigenvalue; this
is a transformation for producing within-groups sphericity. Shrunken estimators
are formed by adding shrinking constants k i to the eigenvalues e i before scaling
the eigenvectors. The details of the mathematics are given in Campbell (1980).
Write
and define
and set d t equal to the ith diagonal element of G. The ith diagonal element d~ is
the between-groups sums of squares for the ith principal component.
An eigen-analysis of the matrix G~0,...,0) yields the usual canonical roots f and
canonical vectors for the principal components, a u. The usual canonical vectors
c ° = are given by
Table 1
Means, pooled standard deviations and correlations for the gastropod Dicathais
Sample l 2 3 4 5 6 7
Sample size 102 101 75 69 29 48 32
Sample means
L 39.36 33.39 35.54 33.86 2 7 . 4 3 5 1 . 7 3 37.47
LS 16.10 11.99 14.06 1 3 . 0 7 10.14 2 0 . 7 3 13.79
LA 28.04 25.58 25.81 2 5 . 1 0 2 0 . 4 2 37.21 28.55
WA 12.81 12.02 11.76 11.60 9.154 17.97 13.39
Sample 8 9 10 11 1_2 13 14
Sample size 83 88 44 34 33 82 60
L 40.11 38.43 33.17 32.39 44.02 33.34 55.94
LS 13.16 12.71 12.36 13.29 14.91 13.34 25.00
LA 31.94 30.40 24.67 2 3 . 1 2 33.51 24.92 38.93
WA 16.08 14.90 11.21 11.76 17.46 13.02 20.84
6. 4. Practical aspects
T h e analysis given b y C a m p b e l l (1980) for the g a s t r o p o d Dicathais f r o m the
coasts of A u s t r a l i a a n d N e w Z e a l a n d p r o v i d e s a g o o d i d e a o f the m o r p h o m e t r i c a l
consequences of i n s t a b i l i t y in c a n o n i c a l v a r i a t e coefficients. F o u r v a r i a b l e s de-
scribing the size a n d shape of the shell were m e a s u r e d , to wit, length of shell ( L ) ,
length of spire (LS), length of a p e r t u r e (LA) a n d w i d t h of a p e r t u r e (WA). M e a n s ,
p o o l e d s t a n d a r d d e v i a t i o n s a n d correlations for W* are listed in T a b l e 1. A s is
n o t u n c o m m o n in highly i n t e g r a t e d c h a r a c t e r s in molluscs, all c o r r e l a t i o n s are
very high. T h e eigenvalues and eigenvectors of the c o r r e l a t i o n m a t r i x of T a b l e 1
are listed in T a b l e 2.
A s an o u t c o m e of the very high correlations, there are two small eigenvalues,
with the smaller of t h e m a c c o u n t i n g for less t h a n 0.08% of the w i t h i n - g r o u p s
variation; the c o r r e s p o n d i n g eigenvector c o n t r a s t s L with LS a n d LA. T h e
b e t w e e n - g r o u p s sums of squares c o r r e s p o n d i n g to each c o m b i n a t i o n of eigenval-
ues a n d eigenvectors (12) are s u p p l i e d in the s a m e table.
736 Richard A. Reyment
Table 2
Eigen-analysis of the within-groups correlation matrix for Dicathais
Eigenvector
No. L LS LA WA Eigenvalue U*TB*U *
1 0.50 0.49 0.50 0.50 3.869 0.55
2 0.08 0.79 -0.42 -0.43 o. 112 1.49
3 -0.33 0.15 -0.56 0.75 0.016 1.87
4 0.79 -0.33 -0.51 0.03 0.003 0.38
Table 3
Summary of canonical variate analysis for Dicathais (from Campbell, 1980)a
First canonical vector Second canonical vector
Canonical Canonical
PC 1 PC2 PC3 PC4 root PC 1 PC2 PC3 PC4 root
au --0.32 --0.08 --0.93 0.17 0.09 --0.93 --0.02 --0.35
L LS LA WA L LS LA WA
c(k4=0)
u 4.82 2.02 -2.41 5.64 2.13 4.65 4.28 -2.12 1.51 1.68
.oi
C(k4=~) --2.42 0.66 -3.78 5.91 2.09 0.17 -2.54 1.93 0.18 1.48
apc stands for principal component, C~k4=0)is the usual canonical vector and c~I4_~) denotes the
generalized inverse coefficients for the canonical variates.
Multivariate morphometrics 737
Table 4
Summary of canonical variate analysis for Prionocycloceras
First canonical vector Second canonical vector
Canonical Canonical
PCI PC2 PC3 PC4 root PCI PC2 PC3 PC4 root
au 0.94 0.04 --0.31 0.17 0.21 --0.44 0.78 0.38
D U H B D U H B
cU(k4=0) 3.67 10.51 9.03 8.67 5.68 -6~43 -15.19 18.93 4.32 0.10
c(~4=~o)~I 9.52 10.79 5.25 5.56 5.53 7.64 --15.46 11.45 --2.58 0.09
X3 x4
{ x5
t I
I I °
"/'6
x2
Fig. 1. Measurements made on Subbotina pseudobulloides.
variables measured are shown in Fig. 1. The basic statistics are listed in Tables 5
and 6.
Growth-free canonical variate analyses were made for each of the species for
k = 0 and k = 1. The analysis for k = 0 is the standard one of canonical variates
where no growth effects are extracted. The analysis with k = 1 corresponds to the
removal of one principal component as a 'growth vector', subject to the reserva-
tion of Section 2 regarding the arbitrary nature of such a principal-components
growth interpretation.
Tables 7 through 9 contain the squared generalized distances, the canonical
variate loadings and the canonical variate means resulting from the analysis.
Where no principal components were extracted (k = 0), the canonical variate
means are substantially different from those for k = 1. The sample illustrates the
comparatively large changes that may result in distances by removal of one
Table 5
Pooled within-groups covariance matrix and group means for Subbotina pseudobulloides from the
Early Paleocene of Sweden (logarithmically transformed data)
Pooled covariance matrix
0.0184
0.0173 0.0203
0.0170 0.0180 0.0226
0.0165 0.0146 0.0159 0.0206
0.0176 0.0198 0.0188 0.0157 0.0257
0.0199 0.0210 0.0199 0.0155 0.0204 0.0298
Group means N
5.1391 4.9476 4.6280 4.6165 4.4182 4.2415 20
5.1501 4.9542 4.5415 4.6714 4.4530 4.2696 29
4.9748 4.7771 4.4850 4.4833 4.2821 4.0888 30
5.1837 5.0079 4.6270 4.6666 4.5271 4.3238 39
5.0404 5.8598 4.5178 4.5127 4.3508 4.1376 60
5.1845 5.0206 4.6998 4.6813 4.5481 4.3389 100
Multivariate morphometrics 741
Table 6
Eigenvalues and eigenvectors of the within-groups covariance matrix for Subbotina
pseudobulloides
1 2 3 4 5 6
EigenvNues
0.1129 0.0094 0.0070 0.0047 0.0022 0.0011
Eigenvectors
0.3860 -0.1291 0.2063 --0.1784 --0.3120 0.8140
0.4032 0.1582 0.1402 --0.0539 --0.7730 -0.4386
0.4069 -0.1071 -0.0252 0.8992 0.1117 0.0363
0.3535 -0.7514 0.3035 -0.2505 0.1894 -0.3460
0.4288 0.0219 -0.7890 -0.2596 0.3463 0.0759
0.4626 0.6179 0.4717 --0.1630 0.3701 -0.1348
Table 7
Squared generalized distances for Subbotinapseudobulloides for k = 0, 1
k-0
1 0.0000
2 2.8677 0.0000
3 1.9539 4.4016 0.0000
4 1.5322 0.8623 3.6189 0.0000
5 0.7524 2.9727 0.8738 1.9316 0.0000
6 1.3623 2.6748 3.1345 0.7847 2.2441 0.0000
k=l
1 0.0000
2 2.8647 0.0000
3 0.7654 3.0867 0.0000
4 1.3496 0.7246 1.3157 0.0000
5 0.2759 2.4150 0.7146 0.6821 0.0000
6 1.0088 2.3848 0.2947 0.7563 0.5923 0.0000
742 Richard A. Reyment
Table 8
Canonical variate analysis for Subbotina pseudobulloides for k = 0
Latent roots
1 2 3 4 5
3.1008 1.3326 0.6374 0.2336 0.0235
Canonical variate loadings
1 2 3 4 5
- 1.6494 -2.0139 - 17.4327 -2.5077 15.2510
4.7737 0.3278 -8.0457 -2.6102 16.2890
-8.4805 9.7136 0.9643 3.7573 -0.0948
5.6673 -3.3072 7.2132 8.2769 -6.8715
2.4804 0.9962 7.2153 -7.5067 2.5701
2.4981 - 1.1284 7.9632 2.5453 5.4067
Coordinates of means
1 2 3 4 5
0.2762 0.3301 -0.5245 0.2498 0.0258
0.9390 --0.7096 0.0179 0.1451 --0.0435
1.0794 -0.3072 0.4137 0.1008 0.0420
0.7064 0.1001 0.0143 --0.2162 0.1046
-0.6122 --0.1777 -0.2669 -0.3043 --0.0652
0.3224 0.7643 0.3455 0.0249 --0.0636
Table 9
Canonical variates analysis for Subbotinapseudobulloides for k = 1
Latent roots
1 2 3 4 5
2.2298 0.6467 0.2667 0.0619 0.0001
Canonical variate loadings
1 2 3 4 5
1.5126 17.9109 0.4724 15.5306 --249.7511
-4.0535 7.2746 5.9664 --16.3598 --98.3832
12.9584 --1.6057 --2.6835 -1.4948 19.6511
--6.6390 --6.8857 --7.5410 -7.8315 52.2846
- 1.6893 --7.5132 6.7275 4.1577 79.9098
-2.4882 -7.6466 -3.7076 4.7446 162.8214
Coordinates of means
1 2 3 4 5
0.4676 0.5034 -0.1826 --0.0667 0.0043
--1.1577 0.0329 -0.1904 0.0278 -0.0033
0.5503 -0.3420 -0.2395 0.1348 -0.0010
--0.4440 -0.0502 0.2503 0.0903 0.0069
0.3066 0.2902 0.2569 0.0392 -0.0071
0.2772 -0.4342 0.1052 --0.1698 0.0001
Multivariate morphometrics 743
9. Applications in taxonomy
References
Benzrcri, J. P. (1973). L'Ana]yse des Donnbes. 2, L'Analyse des Correspondances. Dunod, Paris.
Blackith, R. E. (1965). Morphometrics. In: T. H. Waterman and H. J. Marowitz, eds., Theoretical and
Mathematical Biology, 225-249. Blaisdell, New York.
Blackith, R. E. and Reyment, R. A. (1971). Multivariate Morphometrics. Academic Press, London.
Burnaby, T. P. (1966). Growth invariant discriminant functions and generalized distances. Biometrics
22, 96-110.
Campbell, N. A. (1978). The influence function as an aid in outlier detection in discriminant analysis.
Appl. Statist. 27, 251-258.
Campbell, N. A. (1979). Canonical variate analysis: some practical aspects. Ph.D. Thesis, Imperial
College, University of London.
Campbell, N. A. (1980). Shrunken estimators in discriminant and canonical variate analysis. Appl.
Statist. 29, 5-14.
744 Richard A. Reyment
Rao, C. R. (1966). Discriminant function between composite hypotheses and related problems.
Biometrika 53, 339-345.
Reyment, R. A. (1972). Multivariate normality in morphometric analysis. Math. Geology 3, 357-368.
Reyment, R. A. (1975). Canonical correlation analysis of hemicytherinid and trachyleberinid ostra-
codes in the Niger Delta. Bull. Amer. Paleontology 65 (282) 141-145.
Reyment, R. A. (1976). Chemical components of the environment and Late Campanian microfossil
frequencies. Geologiska F6reningens i Stockholm Fbrhandlingar 98, 322-328.
Reyment, R. A. (1978a). Graphical display of growth-free variation in the Cretaceous benthonic
foraminifer Afrobolivina afra. Palaeoecology Palaeogeography, Palaeoclimatology 25, 267-276.
Reyment, R. A. (1978b). Quantitative biostratigraphical analysis exemplified by Moroccan Cretaceous
ostracods. Micropaleontology 24, 24-43.
Reyment, R. A. (1979). Analyse quantitative des Vascoc6ratid6s h carbnes. Cahiers de Micropalbontolo-
gie 4, 56-64.
Reyment, R. A. (1980). Morphometric Methods in Biostratigraphy. Academic Press, London.
Reyment, R. A. and Banfield, C. F. (1976). Growth-free canonical variates applied to fossil for-
aminifers. Bull. Geological Institutions University of Uppsala 7, 11-21.
Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
Sprent, P. (1968). Linear relationships in growth and size studies. Biometrics 24, 639-656.
Thompson, D'A. W. (1942). On Growth and Form. Cambridge University Press, Cambridge.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~ZI_
©North-Holland Publishing Company (1982) 747-771
1. Introduction
*Preparation of this chapter was facilitated by a Research Scientist Development Award (K02-
DA00017) and a research, grant from the U.S. Public Health Service (DA01070).
747
748 P. M. Bentler and D. G. Weeks
(Amemiya, 1977). In this nonlinear simultaneous equation system for the tth
observation there are j equations, where Yt is a j-dimensional vector of dependent
variables, x t is a vector of independent variables, o~i is a vector of unknown
parameters, and u , is a disturbance whose j-dimensional vector has an indepen-
dent and identically distributed multivariate normal distribution. It is the nonlin-
earity that excludes the current model from consideration, but some models
described below may be considered as nonlinear in parameters provided this
nonlinearity is explicit (see Jennrich and Ralston, 1978). Among structured linear
models, the statistical problems involved in estimating and testing such models
are furthermore considerably simplified if the assumption is made that the
random variables associated with the models are multivariate normally distrib-
uted. While such an assumption is not essential, as will be shown, it guarantees
that the first and second moments contain the important statistical information
about the data.
measuring these constructs. The LVs are related to each other in certain ways as
specified by the investigator's theory. When the relations among all LVs and the
relation of all LVs to MVs are specified in mathematical form--here simply a
simultaneous system of highly restricted linear structural equations--one obtains
a model having a certain structural form and certain unknown parameters. The
model purports to explain the statistical properties of the MVs in terms of the
hypothesized LVs. The primary statistical problem is one of optimally estimating
the parameters of the model and determining the goodness-of-fit of the model to
sample data on the MVs. If the model does not acceptably fit the data, the
proposed model is rejected as a possible candidate for the causal structure
underlying the observed variables. If the model cannot be rejected statistically, it
may provide a plausible representation of the causal structure. Since different
models typically generate different observed data, carefully specified competing
models can be compared statistically.
As mentioned above, factor analysis represents the structured linear model
whose latent variable basis has had the longest history, beginning with Spearman
(1904). Although it is often discussed as a data exploration method for finding
important latent variables, its recent development has focused more on hypothesis
testing as described above (J6reskog, 1969). In both confirmatory and exploratory
modes, however, it remains apparent that the concept of latent variable is a
difficult one to communicate unambiguously. For example, Dempster (1971)
considers linear combinations of MVs as LVs. Such a concept considers LVs as
dependent variables. However, a defining characteristic of LV models is that the
LVs are independent variables with respect to the MVs; that is, MVs are linear
combinations of LVs and not vice-versa. There is a related confusion: although
factor analysis is typically considered to be a prime method for dimension
reduction, in fact it is just the opposite. If the MVs are drawn from a p-variate
distribution, then LV models can be defined by the fact that they describe a
(p + k)-variate distribution (see Bentler, 1982). Although less than p of the LVs
are usually considered important, it is inappropriate to focus only on these LVs.
In factor analysis the k common factors are usually of primary interest, but the p
unique factors are an equally important part of the model.
It should not surprise the reader to hear that the concept of drawing inferences
about (p +k)-variates based on only p MVs has generated a great deal of
controversy across the years. While the MVs are uniquely defined, given the
hypothetical LVs, the reverse can obviously not be true. As a consequence, the
very concept of LV modeling has been questioned. McDonald and Mulaik (1979),
Steiger and SchiSnemann (1978), Steiger (1979), and Williams (1978) review some
of the issues involved. Two observations provide a positive perspective on the
statistical use of LV multivariate analysis (Bentler, 1980). Although there may be
interpretive ambiguity surrounding the true 'meaning' of a hypothesized LV, it
may be proposed that the statistical evaluation of models would not be affected.
First, although an infinite set of LVs can be constructed under a given LV model
to be consistent with given MVs, the goodness-of-fit of the LV model to data (as
indexed, for example, by a X2 statistic) will be identical under all possible choices
750 P. M. Bentler and D. G. Weeks
of such LVs. Consequently, the process of evaluating the fit of a model to data
and comparing the relative fit of competing models is not affected by LV
indeterminacy. Thus, theory testing via LV models remains a viable research
strategy in spite of LV indeterminacy. Second, although the problem has been
conceptualized as one of LV indeterminacy, it equally well can be considered one
of model indeterminacy. Bentler (1976) showed how LV and MV models can be
framed in a factor analytic context to yield identical covariance structures; his
proof is obviously relevant to other LV models. While the LVs and MVs have
different properties, there is no empirical way of distinguishing between the
models. Hence, the choice of representation is arbitrary, and the LV model may
be preferred on the basis of LV's simpler theoretical properties.
Models with MVs only can be generated from LV models. For example,
traditional path analysis or simultaneous equation models can be executed with
newer LV models. As a consequence, LV models are applicable to most areas of
multivariate analysis as traditionally conceived, for example, canonical correla-
tion, multivariate analysis of variance, and multivariate regression. While the LV
analogues or generalizations of such methods are only slowly being worked out
(e.g., Bentler, 1976; Jrreskog, 1973; Rock, Werts, and Flaugher, 1978), the LV
approach in general requires more information to implement. For example, LVs
may be related to each other via a traditional multivariate model (such as
canonical correlation), but such a model cannot be evaluated without the addi-
tional imposition of a measurement model that relates MVs to LVs. If the
measurement model is inadequate, the meaning of the LV relations is in doubt.
x g = v g + Ag~ ~ + f g (2.3)
with expectations E(~ g) = 0 g and E(x g) =/,g ----v g + AgO g, and covariance matrix
Z g = A g ~ g A ~' + "I"g. The group covariance matrices thus have confirmatory factor
analytic representations, with factor loading parameters A g, factor intercorrela-
tions or covariances ~g, and unique covariance matrices g'g. It is generally
necessary to impose constraints on parameters across groups to achieve an
identified model, e.g., A g = A for all g.
Similar models were considered by JOreskog (1971) and Please (1973) in the
context of simultaneous factor analysis in several populations. In a model such as
(2.3), there is an interdependence of first and second moment parameters. The
MVs' means are decomposed into basic parameters that may also affect the
752 P. M. Bentler and D. G. Weeks
parameters of the model are T g, Zg, ~2g and the matrices A~, while U g, V g, and
W g are known constant matrices. In some important applications the T g can be
written as functions of the A~. The columns of xg are independently distributed
with covariance matrix ,Yg. For simplicity it may be assumed that the ~ have
covariance matrices ~jg and are independent of ~}, where j =~ j'. It follows that
It is apparent that this model introduces LVs of arbitrarily high order, while
allowing for an interdependence between first and second moment parameters.
Alternatively, in the case of a single population one may write model (2.8) as
~_,= A 1A 2" "" A ~ q ) A ~ . . . A'2A], where q~ is block-diagonal containing all of the q~j
matrices in (2.8) (McDonald, 1978).
It is possible to write Tucker's (1966) generalization of principal components to
three 'modes' of measurement via the random vector form as x -- ( A Q B ) F ~ where
x is a ( p q X 1) vector of observations; A, B, and F are parameter matrices of
order (p X a), (q X b), and (ab X c) respectively, and ~ is of order (c X 1). The
notation (A ®B) refers to the right Kronecker product of matrices (A ® B) = [a u B].
Bentler and Lee (1979a) have considered an extended factor analytic version of
this model as
x = (A®B)F~ + ~ (2.9)
= (A®B)F+I"(A'®B') + Z 2, (2.10)
where ~ and 71 are latent random variables and 8 and e are vectors of error of
measurement that are independent of each other and the LVs. All vectors have as
expectations the null vector. The measurement model (2.11) is obviously a
factor-analytic type of model, but the latent variables are furthermore related by a
linear structural matrix equation
y = AyB-'(F~ + ~) + e. (2.13)
where • = E ( ~ ' ) , g' = E ( ~ ' ) , 0~ = E(68'), 0~= E(ee'). Models of a similar nature
have been considered by Hausman (1977), Hsiao (1976), Geraci (1977), Robinson
(1977), and Wold (1980), but the JOreskog-Keesling-Wiley model, also known as
LISREL, has received the widest attention and application.
The model represented by (2.11)-(2.14) represents a generalization of econo-
metric simultaneous equation models. When variables have no measurement
structure (2.11), the simultaneous equation system (2.12) can generate path
analytic models, multivariate regression, and a variety of other MV models.
Among such models are recursive and nonrecursive structures. Somewhat para-
doxically, nonrecursive structures are those that allow true 'simultaneous' or
reciprocal causation between variables, while recursive structures do not. Recur-
sivity is indicated by a triangular form for the parameters B* in (2.12). Recursive
structures have been favored as easier to interpret causally (Strotz and Wold,
1960), and they are less difficult to estimate. Models with a measurement
structure (2.11) allow recursivity or nonrecursivity at the level of LVs.
When the covariates in analysis of covariance (ANCOVA) are fallible--the
usual case--it is well known that ANCOVA does not make an accurate adjust-
ment for the effects of the covariate (e.g., Lord, 1960). A procedure for analysis of
covariance with a measurement model for the observed variables was developed
by S6rbom (1978). This model is a multiple-population structural equation model
Multivariate analysis with latent ~)ariables 755
where common factor LVs are independent of error of measurement LVs. The
vector of criterion variables in the gth group is yg, and x g is the vector of
covariates. The latent variables ~/g and ~g are related by
~g = a g + Fg~ ~ + fg (2.16)
The expected values of the latent variables are E(~g)---/z~ and E(T/g)=/zng.
Consequently one obtains the expectations E(x g) =/Zxg= vx + A x # ~ and E ( y g) =
i~gy=Vy + Ayl~g~. Then, rewriting (2.15) and (2.17) one obtains
where ~g = / ~ + ~g* and the expected values of ~g*, ~g, eg, and 6 g are null vectors.
The covariances of (2.18) are taken to be
g __ g t g g __ g gt t
Zxx-AxO'Ax+O~n, Zxy-AxO F Ay-}-O~e
and (2.19)
= g t +
where the covariance matrices of the ~g* and fg are given by og and ~bg, and
where the various og matrices represent covariances among the errors.
The structural equation models described above have been conceptualized as
limited in their application to situations involving latent structural relations in
which the MVs are related to LVs by a first-order factor analytic model. Causal
relations involving higher-order factors, such as 'general intelligence', have not
been considered. Weeks (1978) has developed a comprehensive model that
overcomes this limitation. The measurement model for the gth population is given
by
x = IZx + Axli and y = l~y + Ay~l (2.20)
where the superscript g for the population has been omitted for simplicity of
notation. The components of the measurement model (2.20) are of the form (2.6)
and (2.7), but they are written in supermatrix form. For example, A x =
[Alx "'" A k, A L ' " • A kx- l , " ",A~] and, similarly, ( ' = [ ( ~ ' , l i k - l ' , . . . , ~ l'] where the
756 P. M. Bentler and D. G. Weeks
(2.21)
where E((~ - / ~ ) ( ~ - bt~)') = ~, E(ff') = g', B = ( I - Br) and where • and g" are
typically block-diagonal. Although (2.22) has a relatively simple representation
due to the supermatrix notation, quite complex models are subsumed by it. It may
be noted that (2.22) is similar in form to the JOreskog-Keesling-Wiley structure
(2.19), but the matrices involved are supermatrices and one has the flexibility of
using primary or multilevel orthogonalized factors in structural relations. See
Bentler (1982) for a further discussion and Weeks (1980) for an application of
higher-order LVs, and Bentler and Weeks (1979) for algebraic analyses that
evaluate the generality and specialization possible among models (2.1)-(2.22).
It is apparent that a variety of LV models exist, and that the study of
higher-level LVs and more complex causal structures has typically been associated
with increasingly complex mathematical representations. It now appears that
arbitrarily complex models can be handled by very simple representations, based
on the idea of classifying all variables, including MVs and LVs, into independent
or dependent sets. As a consequence, a coherent field of multivariate analysis with
latent variables can be developed, based on linear representations that are not
more complex than those of traditional multivariate analysis.
Multivariate analysis with latent variables 757
system. Obviously, the vector ~/represents more than the 'endogenous' variables
discussed in econometrics, a n d / 3 0 represents all coefficients for structural rela-
tions among dependent variables, including the coefficients governing the relation
of lower-order factors to higher order factors excepting those residuals and the
highest order factors that are never dependent variables in any equation. The
vector ~ contains those MVs and LVs that are not functions of other manifest or
latent variables, and typically it will consist of three types of variables, ~ ' =
Ix', f', e'], namely, the random vector x of MVs that are 'exogenous' variables as
conceived in econometrics, residual LV variables f or orthogonalized factors, and
errors of measurement or unique LV factors e. Note that in a complete LV model,
where every MV is decomposed into latent factors, there will be no ' x ' variables.
While the conceptualization of residual variables and errors of measurement as
independent variables in a system is a novel one, particularly because these
variables are rarely if ever under experimental control, this categorization of
variables provides a model of great flexibility. In this approach, since 7 represents
the structural coefficients for the effects of all independent variables, the coeffi-
cients for residual and error independent variables are typically known as having
fixed unit values.
where Gx and Gy are known matrices with zero entries except for a single unit in
each row to select y from 7/and x from ~. For definiteness we shall assume that
there are p observed dependent variables and q observed independent variables.
Vectors/~v ( P X 1) and/~x (q X 1) are vectors of means. Letting z ' = [y', x'], the
selection model (3.2) can be written more compactly as
z=.+cv (3.3)
where ~ ' = [/~y,/~'], v ' = [~/', ~'], and G is a 2 X 2 supermatrix containing the rows
[Gy,O], [0, G~].
where E(v) = / ~ = TZU, with T and Z being parameter matrices of fixed, free, or
constrained elements and with U being a known vector. The use of means that are
Multivariate analysis with latent variables 759
where ~ is the covariance matrix of the independent variables ~. Eq. (3.5) may be
more simply represented as
where F' = [y', I ], B 0 has rows [/30, 0] and [0, I ], and B = I -- B0. The orders of the
matrices in (3.6) are given by G(r × s), B(s × s), F(s × n), and ~(n × n) where
r = p + q a n d s = m + n.
In general, a model of the form (3.1)-(3.6) can be formulated for each of
several populations, and the equality of parameters across populations can be
evaluated. Such a multiple-population model is relevant, for example, to factor
analysis in several populations (e.g., SOrbom, 1974) or to the analysis of covari-
ance with latent variables (SOrbom, 1978), but these developments will not be
pursued here. We concentrate on a single population with the structure (3.4) and
(3.6), with/~v = 0.
It is possible to drop the explicit distinction between dependent and indepen-
dent variables (Bentler and Weeks, 1979). All structural relations would be
represented in/30, and all variables with a null row in/3o would be independent
variables. The matrix • will now be of order equal to the number of independent
plus dependent variables. The rows and columns (including diagonal elements) of
corresponding to dependent variables will be fixed at zero. The model is
simpler in terms of number of matrices:
4. Parameter identification
Similarly, from the first order necessary condition, if a 0 exists, there corresponds
a vector of Lagrange multipliers X'= (Xl ..... Xr) such that
made of the information matrix 2-1M with elements M(O)(i, j ) = tr Z-lJ~iZ- 12~j
(see Lee and Jennrich, 1979). The results include the following six propositions.
(a) The generalized least squares estimator t~ is consistent.
(b) The joint asymptotic distribution of random variables nt/2(0 - 00) and
n ~/2~ is multivariate normal with zero mean vector and eovariance matrix
For any choice of error function and for almost any LV model with parameters
subject to no constraints, or to simple equality and proportionality constraints,
parameter estimates can be obtained by one of several nonlinear programming
algorithms. Certain algorithms commonly used in moment structures analysis will
be briefly considered. All algorithms to be considered here may be written as
where Ok is the vector of parameter estimates at the k th iteration. The vector 0 has
as the number of elements the number of nondependent parameters, i.e., the
number of free parameters after considering equality and proportionally con-
straints. Nk is a square symmetric positive definite matrix, and gk is the gradient
Multivariate analysis with latent variables 765
Ok+,=Ok+a[H(W®W)H']-IH(W®W)Vec(S--~,), (6.2)
where Vec stacks the elements of the subsequent matrix into a vector. Thus under
an appropriate choice of W, one may obtain least-squares ( W = I), generalized
least-squares ( W = S-1), or maximum-likelihood ( W = 2~-~) estimates from the
modified Gauss-Netwon algorithm. In an empirical comparison of the algorithms
considered here (except steepest descent) for the orthogonal factor model, Lee
and Jennrich (1979) found the modified Gauss-Newton algorithm to be a
cost-efficient statistical optimizer.
For most moment structure models, the constrained generalized least squares
estimator t~ and the corresponding Lagrange multipliers ~ cannot be solved in
closed form; thus, some nonlinear iterative procedure has to be used. Among the
other methods the penalty function technique developed by Fiacco and
McCormick (1968) has been accepted as an effective method in constrained
optimization. Based on this technique, Lee and Bentler proposed an algorithm as
766 P. M. Bentler and D. G. Weeks
follows: (a) Choose scalar c I > 0 and initial values of 0. (b) given ck > 0 and Ok, by
means of the Gauss-Newton algorithm (6.2) search a minimum point 0k+ 1 of the
function
where q~ is a real-valued differentiable function such that ~ ( x ) / > 0 for all x and
• (x) = 0 if and only if x = 0. (c) Update k, increase ck+ 1 and return to (b) with
Ok+~ as the initial values. The process is terminated when the absolute values of
maX[Ok+l(i)--Ok(i)]
i
and max[ht(Ok+l) ]
t
(6.4)
are less than e, where e is a predetermined small real number. The algorithm will
converge to the constrained generalized least squares estimator, if it exists. It has
been shown by Fiacco and McCormick (1968), and Luenberger (1973) that if the
algorithm converges to Ok, the corresponding Lagrange multipliers are given by
follows that the elements of the unreduced gradient g* are stacked into the vector
g * ' = [g(~)', g(F)', g(Bo)'], whose vector components are given by
(r'vr®r'vr), 2(vr®c'vr),
2[(v®c'vc) + Err(C'VNVC)],
(6.8)
2(vr®vvr), 2[(V®DVC)+Err(DV®VC)],
2[(V®DVD')+ E,,( DV®VD')].
In (6.5), C = Y~, D = B - lyq~/~,, and V = B ' - 1G'WGB i.
The matrix 0N/0 0* contains derivatives with respect to all possible parameters
in the general model. In specific applications certain elements of q~, F, and B0 will
be known constants and the corresponding row of O2J/O0* must be eliminated.
In addition, certain parameters may be constrained, as mentioned above. For
example, • is a symmetric matrix so that off-diagonal equalities must be
introduced. The effect of constraints is to delete rows of ON/O0* corresponding
to constrained parameters and to transform a row i of OZ/O0* to a weighted sum
of rows i, j for the constraint 0i = wjOj. These manipulations performed on (6.7)
transform it into the (q × 1) vector g and when carried into the rows and columns
of (6.8) transform it into the (q × q) matrix N; where q is the number of
nondependent parameters. The theory of Lee and Bentler (1980) for estimation
with arbitrarily constrained parameters, described above, can be used with the
proposed penalty function technique to yield a wider class of applications of the
general model (3.6) than have yet appeared in the literature.
7. Conclusion
The field of multivariate analysis with continuous latent and measured random
variables has made substantial progress in recent years, particularly from
mathematical and statistical points of view. Mathematically, clarity has been
achieved in understanding representation systems for structured linear random
variable models. Statistically, large sample theory has been developed for a
variety of competing estimators, and the associated hypothesis testing procedures
768 P. M. Bentler and D. G. Weeks
References
Aigner, D. J. and Goldberger, A. S., eds. (1977). Latent Variables in Socioeconomic Models. North-
Holland, Amsterdam.
Aitchison, J. and Silvey, D. S. (1958). Maximum likelihood estimation of parameters subject to
restraint. Ann. Math. Statist. 29, 813-828.
Algina, J. (1980). A note on identification in the oblique and orthogonal factor analysis models.
Psychometrika 45, 393-396.
Amemiya, T. (1977). The maximum likelihood and the nonlinear three-stage least squares estimator in
the general nonlinear simultaneous equation model. Econometrica 45, 955-968.
Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
Anderson, T. W. (1973). Asymptotically efficient estimation of covariance matrices with linear
structure. Ann. Statist. 1, 135-141.
Anderson, T. W. (1976). Estimation of linear functional relationships: Approximate distributions and
connections with simultaneous equations in econometrics. J. Roy. Statist. Soc. Sec. B 38, 1-20.
Discussion, ibid 20-36.
Anderson, T. W. and Rubin, H. (1956). Statistical inference in factor analysis. Proc. 3rd Berkeley
Symp. Math. Statist. Prob. 5, 111-150.
Bentler, P. M. (1976). Multistructure statistical model applied to factor analysis. Multivariate Behav.
Res. ll, 3-25.
Bentler, P. M. (1980). Multivariate analysis with latent variables: Causal modeling. Ann. Rev. Psychol.
31, 419-456.
Bentler, P. M. (1982). Linear systems with multiple levels and types of latent variables. In: K. G.
Jrreskog and H. Wold, eds., Systems under Indirect Observation. North-Holland, Amsterdam [in
press].
Bentler, P. M. and Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psych. Bull. 88, 588-606.
Bentler, P. M. and Lee, S. Y. (1978a). Statistical aspects of a three-mode factor analysis model.
Psychometrika 43, 343-352.
Bentler, P. M. and Lee, S. Y. (1978b). Matrix derivatives with chain rule and rules for simple,
Hadamard, and Kronecker products. J. Math. Psych. 17, 255-262.
Multivariate analysis with latent variables 769
Bentler, P. M. and Lee, S. Y. (1979a). A statistical development of three-mode factor analysis. British
J. Math. Statist. Psych. 32, 87-104.
Bentler, P. M. and Lee, S. Y. (1979b). Newton-Raphson approach to exploratory and confirmatory
maximum likelihood factor analysis. J. Chin. Univ. Hong Kong. 5, 562-573.
Bentler, P. M. and Weeks, D. G. (1978). Restricted multidimensional scaling models. J. Math. Psych.
17, 138-151.
Benfler, P, M. and Weeks, D. G. (1979). Interrelations among models for the analysis of moment
structures. Multivariate Behav. Res. 14, 169-185.
Bentler, P, M. and Weeks, D. G. (1980). Linear structural equations with latent variables. Psycho-
metrika 45, 289-308.
Bhargava, A. K. (1977). Maximum likelihood estimation in a multivariate 'errors in variables'
regression model with unknown error covariance matrix. Comm. Statist. A--Theory Methods 6,
587-601.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis'. MIT
Press, Cambridge, MA.
Bock, R. D. and Bargmann, R. E. (1966). Analysis of covariance structures. Psychometrika 31,
507-534.
Browne, M. W. (1974). Generalized least-squares estimators in the analysis of covariance structures.
South African Statist. J. 8, 1-24.
Browne, M. W. (1977). The analysis of patterned correlation matrices by generalized least squares.
British J. Math. Statist. Psych. 30, 113-124.
Browne, M. W. (1982). Covariance structures. In: D. M. Hawkins, ed., Topics in Applied Multivariate
Analysis. Cambridge University Press, London,
Chechile, R. (1977). Likelihood and posterior identification: Implications for mathematical psychol-
ogy. British J. Math. Statist. Psych. 30, 177-184.
Cochran, W. G. (1970). Some effects of errors of measurement on multiple correlation. J. Amer.
Statist. Assoc. 65, 22-34.
Deistler, M. and Seifert, H. G. (1978). Identifiability and consistent estimability in econometric
models. Econometrica 46, 969-980.
Dempster, A. P. (1969). Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading, MA.
Dempster, A. P. (1971). An overview of multivariate data analysis. J. Multivariate Anal. 1, 316-346.
Feldstein, M. (1974). Errors in variables: A consistent estimator with smaller MSE in finite samples. J.
Amer. Statist. Assoc. 69, 990-996.
Fiacco, A. V. and McCormick, G. P. (1968). Nonlinear Programming. Wiley, New York.
Fink, E. L. and Mabee, T. I. (1978). Linear equations and nonlinear estimation: A lesson from a
nonrecursive example. Sociol. Methods Res. 7, 107-120.
Gabrielsen, A. (1978). Consistency and identifiability. J. Econometrics 8, 261-263.
Geraci, V. J. (1976). Identification of simultaneous equation models with measurement error. J.
Econometrics 4, 263-283.
Geraci, V. J. (1977). Estimation of simultaneous equation models with measurement error. Econometrica
45, 1243-1255.
Geweke, J. F. and Singleton, K. J. (1980). Interpreting the likelihood ratio statistic in factor models
when sample size is small. J. Amer. Statist. Assoc. 75, 133-137.
Gleser, L, J. (1981). Estimation in a multivariate "errors in variables" regression model: Large sample
results. Ann. Statist. 2, 24-44.
Goldberger, A. S. and Duncan, O. D., eds. (1973). Structural Equation Models in the Social Sciences.
Academic Press, New York.
Goodman, L. A. (1978). Analyzing Qualitative/Categorical Data. Abt Books, Cambridge, MA.
Hausman, J. A. (1977). Errors in variables in simultaneous equation models. J. Econometrics 5,
389-401.
Hsiao, C, (1976). Identification and estimation of simultaneous equation models with measurement
error. Internat. Econom. Rev. 17, 319-339.
Jermrich, R. I. and Ralston, M. L. (1978). Fitting nonlinear models to data. Ann. Rev. Biophys. Bioeng.
8,' 195-238.
770 P. M. Bentler and D. G. Weeks
Jtreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika 32,
443-482.
Jtreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika 34, 183-202.
Jtreskog, K. G. (1970). A general method for analysis of covariance structures. Biometrika 57,
239-251.
J~reskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika 36,
409-426.
J6reskog, K. G. (1973). Analysis of covariance structures. In: P. R. Krishnaiah, ed., Multivariate
Analysis III, 263-285. Academic Press, New York.
J6reskog, K. G. (1977). Structural equation models in the social sciences: Specification, estimation and
testing. In: P. R. Krishnaiah, ed., Applications of Statistics, 265-287. North-Holland, Amsterdam.
J~reskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika 43,
443-477.
J6reskog, K. G. and Goldberger, A. S. (1972). Factor analysis by generalized least squares. Psycho-
metrika 37, 243-260.
Jtreskog, K. G. and Goldberger, A. S. (1975). Estimation of a model with multiple indicators and
multiple causes of a single latent variable. J. Amer. Statist. Assoc. 70, 631-639.
Jtreskog, K. G. and Strbom, D. (1978). LISREL I V Users Guide. Nat. Educ. Res., Chicago.
Keesling, W. (1972). Maximum likelihood approaches to causal flow analysis. Ph.D. thesis. University
of Chicago, Chicago.
Krishnaiah P. R. and Lee, J. C. (1974). On covariance structures. Sankhy~ 38, 357-371.
Lawley, D. N. (1940). The estimation of factor loadings by the method of maximum likelihood. Proc.
R. Soc. Edinburgh 60, 64-82.
Lawley, D. N. and Maxwell, A. E. (1971). Factor Analysis as a Statistical Method. Butterworth,
London.
Lawley, D. N. and Maxwell, A. E. (1973). Regression and factor analysis. Biometrika 60, 331-338.
Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis. Houghton-Mifflin, New York.
Lee, S. Y. (1977). Some algorithms for covariance structure analysis. Ph.D. thesis. Univ. Calif., Los
Angeles.
Lee, S. Y. (1980), Estimation of covariance structure models with parameters subject to functional
restraints. Psychometrika 45, 309-324.
Lee, S. Y. and Bentler, P. M. (1980). Some asymptotic properties of constrained generalized least
squares estimation in covariance structure models. South African Statist. J. 14, 121-136.
Lee, S. Y. and Jennrich, R. I. (1979). A study of algorithms for covafiance structure analysis with
specific comparisons using factor analysis. Psychometrika 44, 99-113.'
Lord, F. M. (1960). Large-sample covariance analysis when the control variable is fallible. J. Amer.
Statist. Assoc. 55, 307-321.
Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley,
Reading, MA.
Luenberger, D. G. (1973). Introduction to Linear and Nonlinear Programming. Addison-Wesley,
Reading, MA.
McDonald, R. P. (1978). A simple comprehensive model for tile analysis of covariance structures.
British J. Math. Statist. Psych. 31, 59-72.
McDonald, R. P. and Krane, W. R. (1977). A note on local identifiabihty and degrees of freedom in
the asymptotic likelihood ratio test. British J. Math. Statist. Psych. 30, 198-203.
McDonald, R. P. and Krane, W. R. (1979). A Monte-Carlo study of local identifiability and degrees
of freedom in the asymptotic likelihood ratio test. British J. Math. Statist. Psych. 32, 121-132.
McDonald, R. P. and Mulaik, S. A. (1979). Determinacy of common factors: A nontechnical review.
Psych. Bull. 86, 297-306.
Monfort, A. (1978). First-order identification in linear models. J. Econometrics 7, 333-350.
Nel, D. G. (1980). On matrix differentiation in statistics. South African Statist. J. 14, 137-193.
Olsson, U. and Bergman, L. R. (1977). A longitudinal factor model for studying change in ability
structure. Multivariate Behav. Res. 12, 221-242.
Multivariate analysis with latent variables 771
Please, N. W. (1973). Comparison of factor loadings in different populations. British J. Math. Statist.
Psychol. 26, 61-89.
Rao, C. R. (1971). Minimum variance quadratic unbiased estimation of variance components. J.
Multivariate A nal. 1, 445-456.
Rao, C. R. and Kleffe, J. (1980). Estimation of variance components. In: P. R. Krishnaiah and L. N.
Kanal, eds., Handbook of Statistics, Vol. I, 1-40. North-Holland, Amsterdam.
Robinson, P. M. (1974). Identification, estimation and large-sample theory for regressions containing
unobservable variables, lnternat. Econom. Rev. 15, 680-692.
Robinson, P. M. (1977). The estimation of a multivariate linear relation. J. Multivariate Anal. 7,
409-423.
Rock, D. A., Werts, C. E. and Flaugher, R. L. (1978). The use of analysis of covariance structures for
comparing the psychometric properties of multiple variables across populations. Multivariate Behav.
Res. 13, 403-418.
Srrbom, D. (1974). A general method for studying differences in factor means and factor structure
between groups. British J. Math. Statist. Psych. 27, 229-239.
Srrbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika 43,
381-396.
Spearman, C. (1904). The proof and measurement of association between two things. Amer. J. Psych.
15, 72-101.
Steiger, J. H. (1979). Factor indeterminacy in the 1930's and the 1970's: Some interesting parallels.
Psyehometrika 44, 157-167.
Steiger, J. H. and Schi3nemann, P. H. (1978). A history of factor indeterminacy. In: S. Shye, ed.,
Theory Construction and Data Analysis, Jossey-Bass, San Francisco.
Strotz, Robert H. and Wold, H. O. A. (1960). Recursive vs. nonrecursive systems: An attempt at
synthesis. Econometrica 28, 417-427.
Swain, A. J. (1975). A class of factor analytic estimation procedures with common asymptotic
sampling properties. Psychometrika 40, 315-335.
Thurstone, L. L. (1947). Multiple Factor Analysis. Univ. of Chicago Press, Chicago.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31,
279-311.
Tukey, J. W. (1954). Causation, regression, and path analysis. In: O. K. Kempthorne, T. A. Bancroft,
J. W. Gowen and J. L. Lush, eds., Statistics and Mathematics in Biology, 35-66. Iowa State
University Press, Ames, IA.
Weeks, D. G. (1978). Structural equation systems on latent variables within a second-order measure-
ment model. Ph.D. thesis. Univ. of Calif., Los Angeles.
Weeks, D. G. (1980). A second-order longitudinal model of ability structure. Multivariate Behav. Res.
15, 353-365.
Wiley, D. E. (1973). The identification problem for structural equation models with unmeasured
variables. In: Goldberger and Duncan, eds., Structural Equation Models" in the Social Sciences,
69-83. Academic Press, New York.
Wiley, D. E., Schmidt, W. H. and Bramble, W. J. (1973). Studies of a class of covariance structure
models. J. Amer. Statist. Assoc. 68, 317-323.
Williams, J. S. (1978). A definition for the common-factor analysis model and the elimination of
problems of factor score indeterminacy. Psychometrika 43, 293-306.
Wold, H. (1980). Model construction and evaluation when theoretical knowledge is scarce: An
example of the use of partial least squares. In: J. Kmenta and J. Ramsey, ed., Evaluation of
Econometric Models. Academic Press, New York.
Wright, S. (1934). The method of path coefficients. Ann. Math. Statist. 5, 161-215.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~
@North-Holland Publishing Company (1982) 773- 791 ,,.]
Moshe Ben-Bassat
nPi(xj) (1)
*This study was partially supported by the National Science Foundation Grant ECS-8011369 from
the Division of Engineering, and by United States Army Research Institute for the Behavioral and
Social Sciences contract number DAJA 37-81-C-0065.
773
774 Moshe Ben-Bassat
If the cost for all types of correct classification is zero and the cost for all types
of incorrect classification is one, then the optimal Bayes decision rule assigns the
object to the class with the highest a posteriori probability. In this case, the Bayes
risk associated with a given feature X reduces to the probability of error, Pe(X),
which is expected after observing that feature:
e(x) = 2 (3)
i=1
If our objective is to minimize the classifier error rate, 1 and the measurement
cost for all the features is equal, then the most appealing function to evaluate the
potency of a feature to differentiate between the classes is the Pc(X) function.
Nevertheless, extensive research effort was devoted to the investigation of other
functions--mostly based on distance and information measures--as feature
evaluation tools.
Section 2 introduces formally feature evaluation rules and illustrates two of
them by a numerical example. In Section 3 the limitations of the probability of
error rule are pointed out, while in Section 4 the properties desired from a
substitute rule are discussed. Section 5 reviews the major categories of feature
evaluation rules and provides tables which relate these rules to error bounds. The
use of error bounds for assessing feature evaluation rules and for estimating the
probability of error is discussed in Section 6. Section 7 concludes with a summary
of the theorethical and experimental findings so far and also provides some
practical recommendations.
i For instance, in the early stages of sequential interactive pattern recognition tasks, the user m a y be
interested in a feature which reduces maximally the number of plausible classes. Such an objective is
not necessarily the same as minimizing the expected classifier error rate for the next immediate stages.
Featureevaluation 775
Table 1
An examplewith binary features
Prior ""-~eatures
probabilities Classes~ XI X2 X3 X4 X5 X6
0.25 1 0.75 0.90 0.05 0.40 0.10 0.05
0.25 2 0.10 0.45 0.52 0.40 0.90 0.07
0.25 3 0.80 0.45 0.60 0.40 0.75 0.90
0.25 4 0.85 0.01 0.92 0.40 0.80 0.90
Before proceeding, let us introduce the following example which will be used
throughout the paper to demonstrate several of the concepts discussed.
EXAMPLE 1. Consider a classification problem with four classes and six binary
features which is presented in Table 1. For practical illustration the classes may
be considered as medical disorders, while the features are symptoms, signs or
laboratory tests which are measured as positive/negative. The entries of the table
represent the respective conditional probabifities for positive results of the fea-
tures given the classes. The ordering of features by the Pe rule is given in Table 2.
It should be noted that feature preference may be a function of the prior
probabilities. For instance, if the prior probabilities for Example 1 were
(0.1,0.7,0.1,0.1), then Pe(X6)=0.259 while Pe(X2)=0.300, which means that
under these prior probabilities X 6 is preferred to X 2 (Table 3).
Another frequently used feature evaluation rule is derived from Shannon's
entropy by which X is preferred to Y if the expected posterior uncertainty
Table 2
Feature ordering by the Pe rule,/7 = (0.25,0.25,0.25,0.25)
Feature X2 X3 X6 X5 X1 X4
Pe 0.527 0.532 0.537 0.550 0.562 0.750
776 M o s h e Ben-Bassat
Table 3
Feature ordering by the Pc rule,/7= (0.1,0.7,0.1,0.1)
Feature X6 X5 X~ X2 X3 X4
Pc 0.259 0.280 0.285 0.300 0.300 0.300
resulting from X:
Although the Pe rule seems to be the most natural feature evaluation rule,
alternative feature evaluation rules are of great importance due to the following
reasons:
(1) The probability of error m a y not be sensitive enough for differentiating
between good and better features. That is to say, the equivalence groups partition
induced over F by the Pe rule is not sufficiently refined. For instance, in Table 3,
feature X4, which can contribute nothing to differentiating among the classes
(since Pi(X4) is the same for every i, i = 1,2,3,4), is considered by the Pe function
as equivalent to features X 3 and X 2 which may certainly contribute to differentiat-
ing among the classes. For instance, if X 2 is selected and a positive result is
obtained, then class 4 is virtually eliminated [~4(X2 = + ) = 0.0022], while the
liklihood of class 1 is doubled [~1()(2 = + ) = 0.20]. If a negative result is obtained
for X 2 , similar strong indications in the opposite direction are suggested
[~4(X2 = - ) = 0 . 1 8 , ~1(X2 = - ) = 0.0018]. On the other hand, if X 4 is selected,
the posterior probabilities of all the classes remain the same as the priors
regardless of the result observed for X 4. Yet the expected probability of error for
X 4 is 0.300, the same as for X 2. The main reason for the insensitivity of the Pe
function lies in the fact that, directly, the Pe function depends only on the most
probable class, and that under certain conditions, the prior most probable class
remains unchanged regardless of the result for the observed feature [8].
(2) For optimal subset selection of K features out of N, when exhaustive search
over all possible subsets is impractical, procedures based on the relative value of
Table 4
Feature ordering by the H rule 11 -- (0.25,0.25, 0.25, 0.25)
Feature X6 X2 X3 X5 X1 X4
H 1.399 1.640 1.666 1.673 1.698 2.000
Feature evaluation 777
individual features are suggested [20, 30, 44, 55, 59-61]. Using the probability of
error as the evaluation function for individual features does not ensure 'good'
error rate performance of the resulting subset, not even for the case of condition-
ally independent features [11, 13, 21, 56]. Alternative criteria for evaluating
individual features may provide better error rate performance of the resulting
subset, or may diminish the search efforts, see [59] for more details. For example,
Narendra and Fukunaga [45] have recently introduced a branch and bound
algorithm for optimal subset selection. However, although their algorithm is quite
general, it is more efficiently applied if the criterion function satisfies a recursive
formula which expresses the value for t - 1 features by means of its value for t
features. Such a recursive formula is not satisfied by the Pe function but it is
satisfied by other functions, e.g., the divergence and Bhattacharyya distance for
the case of normal distribution.
(3) In sequential classification, when dynamic programming procedures cannot
be used due to computational limitations, myopic policies are usually adopted, by
which the next feature to be tested is that feature which optimizes a criterion
function for just one or a few steps ahead. Usually, the objective is to reach a
predetermined level of the probability of error by a minimum number of features.
This objective is not necessarily achieved when the Pe function is used as the
myopic feature evaluation function, and substitute rules may perform better.
Experience with several myopic rules for the case of binary features is reported by
Ben-Bassat [4]. In fact, the author found out that when ties are broken arbitrarily
for the Pe rule, this rule may be very inefficient under a myopic policy. The main
reason for the low efficiency of the Pe function in myopic sequential classification
is its low sensitivity for differentiating between good and better features, particu-
larly in advanced stages.
(4) Computation of the probability of error involves integration of the function
max{~l(X ) ..... ~m(X)} which usually cannot be done analytically. Numerical
integration, on the other hand, is a tedious process which becomes particularly
difficult and inaccurate when continuous and multidimensional features are
evaluated. For certain class distributions alternative feature evaluation functions
may lead to closed formulas which greatly facilitates the feature evaluation task.
For instance, in the two class case with Gaussian features, Kullback-Liebler
divergence and Bhattacharyya coefficient are simple functions of the mean
vectors and the covariance matrices [29].
When the reasons for considering alternative feature evaluation rules are the
insensitivity of the Pe rule a n d / o r computational difficulties, and our objective is
still the minimization of the expected Pe, then an ideal alternative rule is one
which does not contradict the Pe rule and perhaps refine it. That is, if X is not
preferred to Y by the Pe rule, then it is not preferred to Y by that alternative rule
either. Among features which are indifferent by the Pe rule, it is possible, and
778 Moshe Ben-Bassat
5.1. Overview
Feature evaluation rules may be classified into three major categories.
(1) Rules derived from information measures (also known as uncertainty
measures).
(2) Rules derived from distance measures.
(3) Rules derived from dependence measures.
The assignment of feature evaluation rules within these categories may be
equivocal, since several feature evaluation rules may be placed in different
categories when considering them from different perspectives. Moreover, we often
found that a certain feature evaluation rule in category i may be obtained as a
mathematical transformation of another rule in categoryj.
In the following sections we introduce a unified framework for each of these
categories and construct tables that contain representatives of each category along
with their relationships to the probability of error.
Feature evaluation 779
A feature valuation rule derived from the concept of information gain states that
X is preferred to Y if I ( X ) > I ( Y ) . Since u ( H ) is independent of the feature
under evaluation, this rule is equivalent to a rule which says: X is preferred to Y if
U( X ) < U(Y) where
v(x) = (7)
J
I
o
T T
o
0 r~ --.---.
t q
+ I
+
q ,.-, ÷ I
t "@ o
<°+ I~ i
i I
'T~'
r~
i i
© i--.,.-i
~ ~1 ~
4- I
, ._.
0 ,.-, I
t
0
w ~ ~ + r~° -
...._............#,,~
+~ q ~ ~ -i~ x ~
o~
V
.g
I Atl,
0 W
0 ~w
T o r~
iii II t 2
0 ~ o "-4
o "~
.4,,,...,
*
Feature evaluation 781
Ben-Bassat [4] reports on experiments with this measure for a sequential multi-class
classification problem using conditionally independent binary features.
Devijver [18] and Toussaint [61, 64] relates the entropy of degree a to the
Bayesian probability of error and to the nearest neighbor probability of error.
Except for Renyi's entropy, all of the above functions are special cases of the
f-entropy family which is given by
u(n): E (8)
where f is strictly concave, f " exists and f ( 0 ) : l i m ~ 0 f ( ~ ) : 0. The tightest
upper and lower bounds on f-entropies by means of the probability of error are
presented in [6]. Substituting the appropriate f in these bounds, we obtain the
above mentioned bounds as special cases (see Table 5).
This extension for the multiclass case was used by Fu, Min and Li [22] for
Kullback-Liebler divergence measure, by Lainiotis [36] for the Bhattcharyya
distance, and by Toussaint [57] for the Kolmogorov variational distance.
The major disadvantage of this approach is that one large value of dij (X) may
dominate the value for D(X) and impose a ranking which reflects only the
distance between the two most separable classes.
An alternative approach, also based on the dij values, suggests to prefer X to Y
if X discriminates 'better' between the two most confusing pair of classes, i.e. if
By its definition, the drawback of this approach is that it takes into account the
distance between the closest pair only. Greetenberg [26] compares between the
methods of (9) and (10).
Another distance measure between the conditional density functions for the
multiclass case includes Matusita's [42, 43] extension of the affinity measure.
Its relationship to the probability of error and to the average of the Kullback-
Liebler divergence over all the possible pairs of classes is discussed by Toussaint
[62-64]. An axiomatic characterization of this measure is given by Kaufman and
Mathai [31].
Glick's work [24] presents some general results concerning distance measures
and the probability of error.
All of the distance functions that are used to derive feature evaluation rules by
looking at the distance between Pl(x) and Pz(x) may also be used to derive
corresponding versions of these rules by looking at the expected distance between
~l(x) and ~?2(x) with respect to the mixed distribution of X.
Using this approach, it can be shown, see, e.g. [39], that for the two-class case
Kolmogorov distance is directly related to the probability of error by
=k[1-- (12)
Table 6
Distance measures on the prior and posterior class-probabilities
Bayesian f[Y~(x)]P(x)dx
Directed r ,~,(x) 1
divergence
Divergence
of order a > 0
l f[logXei(x)%) '~] ,(x)dx
Variance f[:~,~,(e,(x)-,~y];~(x)dx
special cases these rules may be expressed in a closed form. Such is the case, for
instance, with multivariate Gaussian features and the Bhattacharyya, Kullback-
Liebler and Matusita distances, see [23].
Distance functions between the prior and posterior class probabilities have also
been proposed as a tool for feature evaluation. The rational behind this approach
is that a feature which may change more drastically our prior assessment
concerning the true class is a better feature. In principle this approach is similar
to the information gain approach except that distance functions are used instead
of information functions. Several examples are included in Table 7.
Let us note that the directed divergence in Table 7 equals the information gain
obtained from Shannon's entropy in Table 5 [25]. This illustrates the duplicity of
rules in different categories.
s
~=
0 -~
oV ~
I[
I I
II .~ II
=o7
C
,..c
0 .... ,~ ,~ ,T~~
,..t
a
£-
v
m
V/
I I I
6
~g
0 II II
~Zw
,.o
S
e~
~5
Feature evaluation 785
Table 8
Dependence measures expressed as distance measures between the
class-conditional probabilities and the mixed probability [66]
Kolmogorov e( )idx
random variables. Silvey [54] and Ali and Silvey [2] discuss general dependence
measures with respect to Renyi's postulates. The use of dependence measures for
feature evaluation in pattern recognition started with Lewis' [37] work where he
used Shannon's mutual i n f o r m a t i o n for expressing the dependence between
features and classes. This measure is given by
m
P( X, Ci)
R(X)---- E f P ( X , C , ) l o g p ( x ) P ( C i ) d X (14)
i=l
where P( Ci) = %, P( X, Ci) = ~riPi(X ) and P( X) = Y,ml~rgPi(X ).
Considering (14) as a distance measure between the probability functions
P(X, C) and P ( X ) P ( C ) , Vilmansen [66] proposes a set of dependence measures
for feature evaluation which are based on various distance functions. He shows
that these dependence measures attain their minimum when X and C are
statistically independent and attain their maximum when each value of X is
associated with one value of C, i.e. for every x there exists t such that ¢rt(x ) = 1
and ¢;i(x) = O, i ~ t. Using some algebraic manipulation on the original formula-
tion of these dependence measures, Vilmansen also shows that they may be
expressed as the average of the distance measures between each of the class
conditional probabilities Pi(X) and the mixed distribution P(X), see Table 8. All
these properties provide a solid justification for using dependence measures as a
tool for feature evaluation. Vilmansen's paper also contains error bounds by
means of these dependence measures which are based on the bounds of Table 7.
Let us note that the dependence measure based on Kullback-Liebler diver-
gence, also known as Joshi measure, is mathematically identical to the Kullback-
Liebler distance measure between the conditional probabilities [33].
For practical purposes, ideal rules are not stringently required. For, if Pe(X) is
only slightly smaller than Pe(Y), then we are not too concerned if an alternative
feature evaluation rule prefers Y to X. However, for a given rule it is desirable to
know a priori how far it may deviate from being ideal. Namely, if a given rule
may prefer Y to X while in fact Pe(X) <- Pe(Y), then we would like to know to
what extent X is better than Y by means of their Pe" The concept of e-equivalence
[5] is designed to answer this question, and it also provides some further
justification for using the upper and lower bounds as indicators for deviation
from ideality.
Briefly, two features X and Y are said to be e-equivalent if the difference
between their corresponding probability of error is less than e; i.e. IP e ( X ) -
Pe(Y)] <e. A feature X is said to be e-preferred to Y if P e ( Y ) > P e ( X ) + e ;
namely, the expected probability of error from X is not just smaller than that for
Y, it is smaller by more than e. For a given e, the grouping and ordering of
features by the e-equivalence and e-preference relations may be considered as a
distortion of the original grouping and ordering induced by the Pe rule. As e goes
up, the degree of distortion increases and at a certain level this distortion is
maximized by grouping all of the features into a single e-equivalence group.
Considering the example of Section 2, if we are willing to tolerate differences of at
most e = 0.014, then the original ordering induced by the Pe rule (Table 2) is
distorted as shown in Table 9. In this table we see that for e = 0 . 0 1 4 the
e-equivalence groups are {X2, X3, X6}, {X6, X5}, {X5, Xl} and {X4}. Looking at
the ranking induced by the H rule (Table 4) we conclude that for this example
with e = 0.014 the Pe rule and the H rule may be considered equivalent.
For a given feature evaluation function let e denote the lowest e for which if X
is not preferred to Y by that rule, then either Y is e-equivalent to X, or Y is
e-preferred to X for e > e. The value of _emay serve as a measure for the deviation
of a given function from being an ideal feature evaluation function. The smaller _e,
the closer is the function to being ideal. An important result is that _eis directly
related to the tightest lower and upper bounds on the probability of error by
means of the feature evaluation function. Let
~ u ) = sup(Pe(X)IX~ F, U ( X ) = u}, (15)
p (u) = i n f { P e ( X ) [ X ~ F, U ( X ) = u}, (16)
d(u)=~e(U)-p(u ). (17)
Table 9
Feature ranking by the Pe rule when differencesof at most e 0.014
are tolerated,jr/= (0.25,0.25,0.25,0.25)
x2 x3 X6 X5 X1 [ ~
e= supd(u). (18)
U
Practically this result suggests that the maximum difference between the upper
and lower bounds is the tolerance level for a given feature evaluation function.
Table 1 from Ben-Bassat [5] lists several values for e for the quadratic and
Shannon's entropy. This table demonstrates that, for the two class case, if we are
willing to consider two features as equal, as long as the differences between their
corresponding probabilities of error is no more than 0.162, then Shannon's
entropy can be used to replace the Pe function in every problem. In most practical
problems a much lower tolerance level is required for exchanging the two rules.
7. Summary
compared the quadratic and Shannon's information gain rules and obtained 0.947
correlation; Vilmansen [66] compared various dependence measures and obtained
correlations above 0.96, and Backer and Jain [3] compared eleven rules from the
three categories and obtained correlations above 0.84 with most of them above
0.91. These findings suggest that if we decide to avoid the Pe rule, then computa-
tional efficiency should be the key factor in determining the feature evaluation
rule to be used.
(4) Using the probability of error as the evaluation function of individual
features in suboptimal algorithms for subset selection (forward, backward or
other) does not ensure that the resulting subset will be even close to optimal [13].
The author is not aware of experiments with these algorithms which used
evaluation functions other than the Pc" The findings of the previous paragraph do
not signal that much difference will be detected by the various feature evaluation
rules, but perhaps they may perform better than the Pe rule.
(5) Experiments with sequential feature selection under myopic policy were
reported in [4] for the case of conditionally independent binary features. The
conclusion from these experiments using many simulated data sets was that no
rule is consistently superior to the others, and that no specific strategy for
alternating the rules seems to be significantly more efficient.
(6) In the past, computational difficulties with the Pe rule constitute the major
reason for recommending a substitute rule. In the author's opinion the insensitiv-
ity of the P~ rule is of much greater significance for avoiding the Pe rule even when
a substitute rule does not offer any computational advantage. As was pointed out
in [8] the conditions under which the Pe rule becomes highly insensitive are quite
often satisfied when one class has relatively high prior probability compared to
the other classes. In this case the use of a substitute rule is highly recommended
under all circumstances, i.e., for evaluating individual features in an algorithm for
subset selection, or for evaluating individual features in sequential feature selec-
tion.
References
[1] Aczel, J. and Daroczy, Z. (1975). On Measures of Information and Their Characterization.
Academic Press, New York.
[2] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one
distribution from another, J. Royal Statis. Soc. Ser. B. 28, 131-142.
[3] Backer, E. and Jain, A. K. (1976). On feature ordering in practice and some finite sample effects.
Proc. Third Internat. Joint Conf. Pattern Recognition 45-49. San Diego, CA.
[4] Ben-Bassat, M. (1978a). Myopic policies in sequential classification. [EEE Trans. Comput. 27,
170-174.
[5] Ben-Bassat, M. (1978b). e-equivalence of feature selection rules. [EEE Trans. Inform. Theory 24,
769-772.
[6] Ben-Bassat, M. (1978c). f-entropies, probabilities of error and feature selection. Inform. and
Control 39, 227-242.
[7] Ben-Bassat, M. and Raviv, J. (1978). Renyi's entropy and the probability of error. IEEE Trans.
Inform. Theory 24, 324-331.
Feature evaluation 789
[8] Ben-Bassat, M. (1980). On the sensitivity of the probability of error rule for feature selection.
IEEE Trans. Pattern Anal. Machine Intell. 2, 57-60.
[9] Chen, C. H. (1971). Theoretical comparison of a class of feature selection criteria in pattern
recognition. IEEE Trans. Comput. 20, 1054-1056.
[10] Chen, C. H. (1976). On information and distance measures, error bounds and feature selection.
Inform. Sci. 10, 159-173.
[11] Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Trans.
Systems Man Cybernet. 4, 116-117.
[12] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[13] Cover, T. M. and Van Campenhout, J. M. (1977). On the possible orderings in the measurement
selection problem. IEEE Trans. Systems Man Cybernet. 7, 657-660.
[14] Daroczy, Z. (1970). Generalized information functions. In[orm. and Control 16, 36-51.
[15] DeGroot, M. (1962). Uncertainty, information and sequential experiments, Ann. Math. Statist.
33, 404-419.
[16] DeGroot, M. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
[17] Devijver, P. A. (1973). On a class of bounds on Bayes risk in multihypothesis pattern
recognition, IEEE Trans. Comput. 23, 70-80.
[18] Devijver, P. A. (1977). Entropies of degree fl and lower bounds for the average error rate.
Inform. and Control 34, 222-226.
[19] Devijver, P. A. (1978). Nonparametric estimation by the method of ordered nearest neighbor
sample sets. Proc. Fourth Internat. Joint. Conf. Pattern Recognition, 217-223.
[20] Duin, R. P. W. and van Haresma Buma, C. E. (1974). Some methods for the selection of
independent binary features. Proc. Second Internat. Joint Conf. Pattern Recognition, 65-70.
Copenhagen, Denmark.
[21] Elashoff, J. D., Elashoff, R. M. and Goldman, G. E. (1971). On the choice of variables in
classification problems with dichotomous variables. Biometrika 54, 668-670.
[22] Fu, K. S., Min, P. J. and Li, T. J. (1970). Feature selection in pattern recognition. IEEE Trans.
Systems Sci. Cybernet. 6, 33-39.
[23] Fukunaga, K. (1972). Introduction to Statistical Pattern ReCognition. Academic Press, New York.
[24] Glick, N. (1973). Separation and probability of correct classification among two or more
distributions. Ann. Inst. Statist. Math. 25, 373-382.
[25] Good, I. J. and Card, W. I. (1971). The diagnostic process with special reference to errors. Math.
Inform. Med. 10, 176-188.
[26] Greetenberg, T. L. (1963). Signal selection in communication and radar systems. IEEE Trans.
Inform. Theory 9, 265-275.
[27] Hellman, M. E. and Raviv, J. (1970). Probability of error, equivocation and the Chernoff bound.
IEEE Trans. Inform. Theory 16.
[28] Jain, A. K. (1976). On an estimate of the Bhattacharyya distance. IEEE Trans. Systems Man
Cybernet. 6, 763-766.
[29] Kailath, T. (1967). The divergence and Bhattacharyya distance in signal selection. IEEE Trans.
Comm. Tech. 15, 52-60.
[30] Kanal, L. (1974). Patterns in pattern recognition: 1968-1974. IEEE Trans. Inform. Theory 18,
697-722.
[31] Kaufman, H. and Mathai, A. M. (1973). An axiomatic foundation for a multivariate measure of
affinity among a number of distributions. J. Multivariate Anal. 3, 236-242.
[32] Kittler, J. (1975a). Mathematical methods of feature selection in pattern recognition. Internat. J.
Man-Mach. Stud. 7, 609-637.
[33] Kittler, J. (1975b). On the divergence and Joshi dependence measure in feature selection.
Information Processing Lett. 3, 135-137.
[34] Kovalevsky, V. A. (1968). The problem of character recognition from the point of view of
mathematical statistics. In: V. A. Kovalevsky, ed., Character Readers and Pattern Recognition.
Spartan, New York.
790 Moshe Ben-Bassat
[35] Kullback, S. and Leibler, R. A. (1951). Information and sufficiency. Ann. Math. Statist 22,
79-86.
[36] LainJotis, D. G. (1969). A class of upper bounds on probability of error for multihypothesis
pattern recognition. IEEE Trans. Inform. Theory 15, 730-731.
[37] Lewis, P. M. (1962). The characteristic selection problem in recognition systems. IEEE Trans.
Inform. Theory 8, 161-171.
[38] Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math.
Statist. 27, 986-1005.
[39] Lissack, T. and Fu, K. S. (1976). Error estimation in pattern recognition via L"-distance between
posterior density functions. IEEE Trans. Inform. Theory 22, 34-45.
[40] Marill, T. and Green, D. M. (1963). On the effectiveness of receptors in recognition systems.
IEEE Trans. Inform. Theory 9, 11-17.
[41] Mathai, A. M. and Rathie, P. N. (1975). Basic Concepts in information Theory and Statistics.
Wiley, New York.
[42] Matusita, K. (1967). On the notion of affinity of several distributions and some of its
applications. Ann. Inst. Statist. Math. 19, 181-192.
[43] Matusita, K. (1973). Discrimination and the affinity of distributions. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 213-223. Academic Press, New York.
[44] Mucciardi, A. N. and Gose, E. E. (1971). A comparison of seven techniques for choosing subsets
of pattern recognition properties. IEEE Trans. Comput. 20, 1023-1031.
[45] Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset
selection. IEEE Trans. Comput. 26, 917-919.
[46] Renyi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hungar. 10, 441-451.
[47] Renyi, A. (1960). On measures of entropy and information, Proc. Fourth Berkeley Symposium
Math. Statist. and Probability 1, 547-561.
[48] Renyi, A. (1964). On the amount of information concerning an unknown parameter in a
sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci. 9, 617-624.
[49] Renyi, A. (1966). On the amount of missing information and Neyman-Pearson lemma. In F. N.
David, ed., Research Papers in Statistics, 281-288. Wiley, New York.
[50] Renyi, A. (1967a). On some problems of statistics from the point of view of information theory.
Proc. Fifth Berkeley Symposium on Math. Statist. 531-543.
[51] Renyi, A. (1967b). Statistics and information theory. Studia Sci. Math. Hungar. 2, 249-256.
[52] Renyi, A. (1970). Probability Theory. North-Holland, Amsterdam.
[53] Shannon, C. (1948). A mathematical theory of communication. Bell Systems Tech. J. 7, 379-423.
[54] Silvey, S. D. (1964). On a measure of association. Ann. Math. Star. 35, 1157-1166.
[55] Stearns, S. D. (1976). On selecting features for pattern recognition. Proc. Third Internat. Joint
Conf. Pattern Recognition, 245-248. San Diego, CA.
[56] Toussaint, G. T. (1971a). Some upper bounds on error probability for multiclass pattcrn
recognition. IEEE Trans. Comput. 20, 943-944.
[57] Toussaint, G. T. (1971b). Note on the optimal selection of independent binary valued features
for pattern recognition. IEEE Trans. Comput. 17, 618.
[58] Toussaint, G. T. (1972). Feature evaluation with quadratic mutual information. Information
Processing Lett. 1, 153-156.
[59] Toussaint, G. T. (1974a). Recent progress in statistical methods applied to pattern recognition.
Proc. Second Internat. Joint Conf. Pattern Recognition, 479-488.
[60] Toussaint, G. T. (1974b). On the divergence between two distributions and the probability of
misclassification of several decision rules. Proc. Second Internat. Joint Conf. Pattern Recognition.
[61] Toussaint, G. T. (1974c). On information transmission, nonparametric classification and mea-
suring dependence between random variables. In: Proc. Syrup. Statist., Related Topics. Carleton
University, Canada.
[62] Toussaint, G. T. (1977). An upper bound on the probability of misclassification in terms of the
affinity. Proc. IEEE 65, 275-276.
Feature evaluation 791
[63] Toussaint, G. T. (1978a). Probability of error, expected divergence and the affinity of several
distributions. IEEE Trans. Systems Man Cybernet. 8, 482-485.
[64] Toussaint, G. T. (1978b). Probability of error and equivocation of order a. Unpublished
manuscript.
[65] Vajda, I. (1968). Bounds on the minimal error probability and checking a finite or countable
number of hypotheses. Inform. Transmis. Problems 4, 9-19.
[66] Vilmansen, T. R. (1973). Feature evaluation with measures of probabilistic dependence. 1EEE
Trans. Comput. 22, 381-388.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~
,..-' ' k J
©North-Holland Publishing C o m p a n y (1982) 793 - 803
1. Introduction
One of the major problems one encounters during the design phase of an
automatic pattern recognition system is the identification of a good set of
measurements. These measurements, to be performed on future unclassified
patterns, should enable the recognition system to classify the patterns as correctly
as possible. At the same time, the cost of acquisition and processing of the
measurements in the classifier should be kept as low as possible.
793
794 Jan M. Van Campenhout
Pe(S) = P { C # C * } (1)
is minimal? Here C*(S) represents the Bayes decision with respect to a probabil-
ity of error loss function, and using only measurements in S; Pe(S) represents the
corresponding risk.
The naive solution to this problem is to try all possible k-element subsets of
{X~ ..... Xn}, and then choose the subset with the lowest misclassification proba-
bility. Unfortunately, neither of the steps of the above 'solution' is actually
feasible in practice. To begin with, both the number of possible measurements n
and the number of measurements to be retained, k, are typically so large that an
exhaustive investigation of all the k-element measurement subsets is ruled out.
Furthermore, the determination of the Bayes rule and its corresponding risk
require the knowledge of the underlying statistical structure. This knowledge we
seldom have, and even if so, computing the Bayes risk for large values of k can be
a very difficult task.
Therefore, the naive approach is rarely used, except perhaps in the related
problem of measurement selection for regression analysis (Furnival, 1974). Here
the numbers n and k are typically much smaller than in pattern recognition, and
the figure of merit of the measurement subsets, the residual sum of squares, is
relatively easy to compute compared to the Bayes risk in classification problems.
preclude large numbers of subsets from its subsequent search. For instance,
suppose that the measurement subsets S 1 and S2 have been investigated, and also
suppose that Pe(SI)< Pe(S2). Then it is needless to investigate any of the subsets
of S 2, because these subsets are known to be worse than S 2. Dramatic savings in
search effort have been reported with these "Branch and Bound Algorithms", in
both pattern recognition (Narendra and Fukunaga, 1977) and regression analysis
(Furnival, 1974).
The second approach uses a heuristic reasoning to limit the search path to a
small fraction of the k-element measurement subsets. Various such algorithms
exist and have been in use for a long time, e.g. the forward and backward
sequential search techniques (Stearns, 1976). The forward sequential search
algorithm uses the heuristic that the best ( k + l ) - e l e m e n t subset frequently
contains the best k-element subset. The algorithm then first finds the best
individual measurement and then proceeds by adding to this set the conditionally
best element not yet in the set. Unfortunately, as was soon realized, the heuristic
search algorithms do not necessarily provide us with the best measurement subset
of its size. In fact, it has been conjectured (Kanal, 1974) that the only optimal
search method, i.e., a method guaranteeing to find the best k-element subset no
matter what the underlying distribution is, necessarily has to perform the exhaus-
tive search of the naive solution. Despite this conjecture some researchers claimed
a near-optimal performance of the heuristic algorithms, stating that the measure-
ment subsets selected by these algorithms, although perhaps not the best, are
always very good (Chen, 1975).
So the following question arises: given a finite training set, how complex can the
measurements be before the performance degrades?
that is, we can map the measurements in A into the measurements in B in a way
that agrees with the statistical structure of B.
Topics in measurement selection 799
If case A and case B are related through (3) and (4), then one can easily prove
that Pe(XA) <-Pe(XB) (Van Campenhout, 1978).
In the Hughes case the functions gl and g2 could be defined as the merging of
cells in A's partitioning to obtain B's partitioning. But then the question must be
asked if (4) is satisfied. The distribution F(X B, T 8, C) is the mixture distribution
which clearly depends on the parametrization pB and its prior distribution F(pB).
Consequently, once pA and F(pA) are determined, the choice of g Land g2 imposes
restrictions on the choice of pB and f(pB) if (4) is to be satisfied. Hughes does not
take into account any such comparability requirements. Instead he freely chooses
Pl = ( P l l ..... Pml) and P2 = ( P l E , ' " , P m 2 ) t o be independent and uniformly dis-
tributed over the parameter simplices Y~Pil ~ 1 and Y'Pi2 = l, respectively in both
the A and B case. By doing so, he actually changes the amount of prior
information contained in F(p).
One can satisfy the comparability requirement if one obtains pB from pA by the
map g(pA) that adds the cell probabilities of cells of the partition A that are
merged. We then require that F(p B) = F(g(p~)). It is easy to verify that ifp A has
the uniform simplex distribution specified by Hughes, then pB cannot have a
uniform simplex distribution but rather has a Dirichlet distribution.
So we conclude that the peaking established by the (modified) Hughes model
arises from an inconsistency in the specification of the models to be compared:
while the information in the measurements X increases by refining the partition-
ing of the measurement space, at the same time the amount of information
brought in through the prior parameter distribution F(p) (which, notably, reflects
our ignorance!), is reduced. Hence Hughes' model trades prior information for
measurement information. The existence of an optimal trade-off causes peaking.
(b) The above analysis does not deal with typical situations encountered in
pattern recognition practice, where Bayes rules can simply not be obtained by
lack of statistical information. Determining the optimal measurement complexity
in practical situations remains a problem.
THEOREM 3.1 (Van Campenhout, 1980). Any set of real numbers {Pe(S): S C
{ X 1.... , X, } } giving rise to an ordering of the measurement subsets
½= ee(S, = ¢ ) > e e ( S 2 ) > ' " • > ee(S.-- {x, ..... Xn)) > 0
for which Sj C S k ~ Pe( Sj) > Pe( Sk) is inducible as a set of misclassification proba-
bilities. Moreover, there exist multivariate normal models N ( - - t t , K ) vs. N(It, K )
with vector-valued measurements inducing these numbers.
References
Abend, K. and Harley, T. J. (1969). Comments: On the mean accuracy of statistical pattern
recognizers. IEEE Trans. Inform. Theory 15, 420-421.
Beale, E. M. L., Kendall, M. G. and Mann, D. W. (1967). The discarding of variables in multivariate
analysis. Biometrika 54 (3,4) 357-366.
Chandrasekaran, B. and Harley, T. J. (1969). Comments: On the mean accuracy of statistical pattern
recognizers. IEEE Trans. Inform. Theory 15, 421-423.
Chen, C. H. (1975). On a class of computationally efficient feature selection criteria. Pattern
Recognition 7, 87-94.
Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Trans.
Systems Man Cybernet. 4 (1) 116- l 17.
Cover, T. M. and Van Campenhout, J. M. (1977). On the possible orderings in the measurement
selection problem. IEEE Trans. Systems Man Cybernet. 7 (9) 657-661.
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analys. Wiley, New York.
Furnival, G. M. (1974). Regression by leaps and bounds. Teehnometrics 16 (4) 499-511.
Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inform.
Theory 14 (1) 55-63.
Kanal, L. N. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20,
697-722.
Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection.
IEEE Trans. Comput. 26 (9) 917-922.
Stearns, S. D. (1976). On selecting features for pattern classifiers. Proc. Third Internat. Joint Conf.
Pattern Recognition. Coronado, CA, 71-75.
Toussaint, G. T. (1971). Note on optimal selection of independent binary-valued features for pattern
recognition. IEEE Trans. Inform. Theory 17, 618.
Van Campenhout, J. M. (1978). On the problem of measurement selection. Ph.D. Thesis, Department
of Electrical Engineering, Stanford University, Stanford, CA.-
Van Campenhout, J. M. (1980). The arbitrary relation between probability of error and measurement
subset. J. Amer. Statist. Assoc. 75 (369) 104-109.
Waller, W. G. and Jain, A. K. (1977). Mean recognition accuracy of dependent binary measurements.
Proc. Seoenth Internat. Conf. Cybernet. and Society, Washington, DC.
P. R. Kfishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~'7
..J/
©North-Holland Publishing Company (1982) 805-820
P. R. Krishnaiah
1. Introduction
The techniques of regression analysis have been used widely in the problems of
prediction in various disciplines. When the number of variables is large, it is of
interest to select a small number of important variables which are adequate for
the prediction. In this paper we review some methods of selection of variables
under univariate regression models. In Sections 3-5 we discuss the forward
selection, backward elimination and stepwise procedures. A description of these
procedures is given in Draper and Smith (1966). These procedures are widely used
by m a n y applied statisticians since computer programs are easily available for
their implementation. We will discuss some drawbacks of these procedures. In
view of these drawbacks, we have serious reservations about using these methods
for selection of variables. In Sections 6 - 8 we discuss the problems of selection of
variables within the framework of simultaneous test procedures. Section 6 is
devoted to a discussion of how the confidence intervals associated with the well
known overall F test in regression analysis can be used for the selection of
variables. The above confidence intervals are available in the literature (e.g., see
Roy and Bose, 1953; Scheff6, 1959). In this section we also discuss some
procedures based upon all possible regressions. In Section 7 we discuss how the
finite intersection tests (FIT) proposed by Krishnaiah (1963; 1965a, b; 1979) for
testing the hypotheses on regression coefficients simultaneously can be used for
selection of variables. It is known that the F I T is better than the overall F test in
terms of the shortness of the lengths of the confidence intervals. For an illustra-
tion of the FIT, the reader is referred to the chapter by Schmidhammer in this
volume. Reviews of the literature on some alternative procedures are given in
Hocking (1976) and Thompson (1978a, b). For a discussion of the procedures for
selection of variables when the number of variables is large the reader is referred
to Shibata (1981) and the references in that paper.
*This work is sponsored by the Air Force Office of Scientific Research under contract
F49629-82-K-001. Reproduction in whole or in part is permitted for any purpose of the U. S.
Government.
805
806 P. R. Krishnaiah
2. Preliminaries
where e is distributed normally with mean zero and variance O2,/~1 . . . . . /~q are
unknown and x 1. . . . ,Xq may be fixed or random. This chapter is devoted to a
review of some of the procedures for the selection of the above variables. Unless
stated otherwise we assume that x 1. . . . . Xq are fixed. Also, xi: n × 1 and y: n × 1
denote respectively vectors of n observations on xi and y. We now define the
multivariate t and multivariate F distributions since they are needed in the sequel.
Let z ' - - ( z 1..... zp) be distributed as a multivariate normal with mean vector
/L'= (/~ ..... /~p) and covariance matrix ozfg where ~2 is the correlation matrix.
Also, let s Z / o 2 be distributed independent of z as chi-square with m degrees of
freedom. Then, the joint distribution of t 1.... ,tp is the central (noncentral)
multivariate t distribution with p degrees of freedom and with ~2 as the correlation
matrix of the 'accompanying' multivariate normal when/_t = 0 (/~ va 0) and where
t i = z i f m / s (i = 1,2 ..... p). The joint distribution of t l2, . . . , tp2 is the central (non-
central) multivariate F with (1, 1,) degrees of freedom and with ~ as the correla-
tion matrix of the accompanying multivariate normal. The above distribution is
singular or nonsingular according as ~2 is singular or nonsingular. Also, when
/t 4 : 0 and $2/0 2 is distributed as the noncentral chi-square, then t 2. . . . . t2 are
jointly distributed as doubly noncentral F distribution with (1, p) degrees of
freedom. Multivariate t distribution was considered by Cornish (1954) and
Dunnett and Sobel (1954) independently. Multivariate F distribution with (1, p)
degrees of freedom is a special case of the multivariate F distribution proposed by
Krishnaiah (1965a). Cox et al. (1980) investigated the accuracy of various
approximations for the multivariate t and multivariate F distributions. For a
review of the literature on multivariate t and multivariate F distributions, the
reader is referred to Krishnaiah (1980).
y= X[3+e (3.1)
where e ' = (E 1 En) is distributed as multivariate normal with mean vector 0 and
.....
t l t -~--1 v l
y xAx~x~) x,yt, n - 1 ) (3.2)
1 p
Fi= y'[I--xi(x'ixi)-xi]y
If m a x ( F l, F 2 . . . . . Fq) >1 F~, the variable corresponding to the max(F1, F2,..., Fq) is
declared to be the most important. For example, if m a x ( F l , . . . , F q ) = F 1, then x 1
is declared to be the most important. Here F~ is the upper 100a% point of the
central F distribution with (1, n --1) degrees of freedom. If m a x ( F 1, F2,...,Fq)<~
F~, we declare that none of the variables are important and we don't proceed
further; otherwise, we proceed further as follows and select the variable which is
the second most important. Let
Q,i=(i_x,(x,x,)-lx,)xi[xixi , -1 x xi] -,
× (I-x,(xqx,)lx~), (3.4)
Ix xlf(Xl)lx
x (x:)
x
If max(F12 . . . . . Fiq ) <~ Fl,~, we conclude that none of the variables x2, x 3. . . . . Xq
are important and we don't proceed further; here Fl, is the upper 100a%
point of the central F distribution with (1, n - 2 ) degrees of freedom.
If max(F~2, F~3 F|q) > F~,~, then the variable c o r r e s p o n d i n g to
. . . . .
max(Fi2, F13 ..... Flq ) is declared to be the second most important. Suppose x 2 is
the second most important variable according to the above procedure. Then we
proceed further to select the third most important variable. We continue this
procedure until a decision is made to declare that all variables are important or a
decision is made, at any stage, that none of the remaining variables are important.
Suppose we declare that r variables (say x~, x 2..... x r) are important according to
the above procedure when x i is declared to be the ith most important variable.
Then we proceed further as follows. Let
r --1
Qri=( I-- S(r)( g(r)g(r)) S(r))Xi
X [XtiXi--xtiS(r)(g~r)g(r)) Ig~r)Xi] 1x~
Qro=i_[X(~),xi ] X~)
Xti
(X(r),xi
Xi
, ,
E(y)=xifli. (3.9)
F IZ/i] = (3.10)
I n any situation all the q models E ( y ) = xlfl 1..... E(y) ~ Xqflq m a y be wrong; at
best, one and only one of the above q models is correct. F o r any given i, let us
assume that the model E ( y ) = xifl i is not correct. T h e n F~ is not distributed as
central F distribution with (1, n - 1) degrees of freedom even when H i is true since
fli is estimated under the assumption that E(y) = xifl i is the correct model. So, the
true type I error is not a even to test H i individually. For example, let the true
model be E(y) = x 2 f12 or (say) E(y) = x 2 f12 -~ X4f14" T h e n Fl, F 3 . . . . . Fq are distrib-
uted as doubly noncentral F distribution with (1, n - 1) degrees of freedom and
the noncentrality parameters associated with these statistics are usually unknown.
So, we will not be able to c o m p u t e the exact type I error for testing H i, even
Selection of variables under univariate regression models 809
for i = r + 1..... q. For any given i, the statistic Fri defined b y (3.6) is nothing but
the F statistic used to test Hi under the model (3.11). So, in the forward selection
procedure we are essentially testing the hypothesis Hi, for any given i ( i =
r + 1..... q), under the model (3.11) to decide whether i th variable is unimportant.
The criticism of the method used at first stage applies here also.
At (i + 1)-th stage (i = 1,2 ..... q - 1), the critical value F~ is chosen ignoring the
decisions made at previous stages. Take, for example, the second stage. While
computing the critical value Fl~, we should compute the conditional probabilities
for i = 2, 3 .... , q to find the type I error to test H i for given i and not P[F1i < Hi]
since we go to the second stage if and only if m a x ( F l ..... Fq) >1 F,~.
The decision to select or not to select a variable at the first stage is made
according as max(F1,... , F q ) ~ F,~. So, the critical value F~ should be chosen such
that the probability of m a x ( F l , . . . , F q ) being less than F~ is equal to ( 1 - a) when
q 1Hi is true instead of choosing it to satisfy (3.10). Similar criticism applies to
N i=
subsequent stages also. When O ~ = I H i is true and the model is of the form (3.9),
the joint distribution of F I , . . . , Fq is a multivariate F distribution; this multivariate
F distribution is different from the one defined in Section 2 since the above F~'s
do not have a common denominator.
4. Stepwise regression
for i = 1,2 ..... q. Let H i : f l i = 0 as before and let Fi, for any given i, denote the
usual F statistic used to test H i under the model (4.1). If max(F1,...,Fq)<~ F,~, we
decide that none of the variables are important and we don't proceed further.
810 P. R. Krishnaiah
U n d e r the above model we test the hypothesis fil = 0 b y using the usual F test. If
the hypothesis fll = 0 is accepted, we delete the variable x I and consider the
following model:
y = x2B 2 + x 3 f l 3 + e . (4.5)
for i = 4, 5 ..... q. If under the model (4.3) the hypothesis H l is rejected, then we
consider the following models:
and test H 1 under the model (4.10). If H1 is accepted under the above model, then
we delete x~ from the prediction equation. Obviously, at the second stage, we
want to find whether x~ is important when the effect of x2 is eliminated. One may
pose the question as to why we should not examine the importance of x 1 after
eliminating the effect of x 3 or x 5 since it is possible that x 2 may be declared to be
unimportant and purged at a later stage whereas x 3 and x 5 may be considered to
be important.
Let H i :/~i = 0 as before. Also, let F~ denote the usual F statistic used to test H~
against A i : B i v~ 0 under the model (5.1). In the backward elimination procedure
we declare that none of the variables are unimportant if min(F?, F2*..... Fq) ) fal
where F,l is the upper 100a% point of the central F distribution with (1, n - q)
degrees of freedom. If min(F~ ..... Fq*) ~< F~l, we eliminate the variable associated
with the smallest of F~ ..... Fq. The above procedure is equivalent to the following
procedure. Let F,t be chosen such that
for i = 1,2 ..... q. In the usual individual tests we accept or reject Hi according as
(5.3)
The hypothesis H i is equivalent to the hypothesis that x i is not important. If
min(F~ ..... F q ) ~ F ~ l , it implies that H1, H 2 .... ,Hq are rejected when they are
tested individually. If we eliminate the variable connected with the smallest F~,
then we go to the second stage to determine as to whether we should eliminate the
second least important variable. The main drawback of this procedure at the first
stage is the choice of F~. Since we are interested in finding out as to whether all
variables are important, it is more meaningful to use the critical values F*t
(instead of F~1) where
and discard it. The critical value F~2 is chosen such that
for i = 1,2 ..... q -- 1. So the type I error is chosen at the second stage such that the
probability of rejecting the hypotheses H i individually when in fact they are not
true is a. But, what is of more interest is to control the error of rejecting at least
one of the hypotheses H l, H 2..... Hq_ 1 when, in fact, all of them are true. So, the
critical value should be F*2 where
[
P F~<~F*2;i=I,2 ..... q--1 ]
H i =(l-a) (5.7)
if we have started at the second stage directly• But, we arrive at the second stage if
and only if min(F~' ..... Fq*)~
< F2~. So, we should find the critical value F,*2 such
that
[ q--1
• , # ~ *
P ] Fig-~
*-< F22,
• . i---1,2 . . . . . q 1 A H i ; f i n n ( F , .... ,Fq ) ~ F : I
I. i=1
= ( 1 - o O.
(5.8)
Now, let Fj* denote the F statistic used to test H / u n d e r the model (5.9). Then,
according to the backward elimination procedure we decide that none of the
variables x 1..... Xq_j are unimportant if min(Fj], Fj'~..... FTq_j)~>F~,j+ 1 where
F,,j+ 1 is the upper 100a% point of the central F distribution with (1, n - q + j )
degrees of freedom. If min(Fj], F72,... , Fj*,q_j) ~ Fa, j + l , then we conclude that the
variable associated with the smallest of the statistics Fj],... ,Fj*q_i is the ( j + 1)-th
least important variable and proceed to the next stage. The critical value F~,s+ ~
here is chosen such that
e[Fji~-fe~,j+l[Hi] = ( 1 - O r ) (5.10)
for i = 1,2 ..... q - j. But, the critical value should be chosen such that the
conditional probability of rejecting at least one of the hypotheses H I . . . . . Hq_j
814 P. R. Krishnaiah
(when all of them are true) is equal to a given min(Fj* l, l .... , Eft_1,q--j+ 1) • F*aj
where F~*/is chosen in a similar way as F*2. In summary, the critical values used in
backward elimination procedure are chosen somewhat arbitrarily.
Some of the drawbacks of the forward selection, backward elimination and
stepwise procedures were discussed in Pope and Webster (1972) also. In view of
the various drawbacks of the above procedures we do not recommend use of them
in the selection of variables. In the following sections we discuss some alternative
procedures for the selection of the variables.
F%F~, (6.1)
where
F = y'Qy(n - q)/Y'QoYq, (6.2)
P [ F < ~ F , ~ I H ] = ( 1 - - a ). (6.5)
a'~ , , , - 1
-- {qr~(y Q0y)a ( X X ) a / ( n - q ) } ' / 2 ~< a'/3
~< a'/~ + , t t 1 1/2
{qg~(yQoy)a(XX ) a/(n-q)} (6.6)
for all a ' = ( a , ..... a q ) va O' where ~ : y" X( X' X ) - ' = ( t~, .... , flq ). In particular,
the confidence interval on fli is given by
rejected according as the confidence interval (6.7) covers or does not cover zero.
This is equivalent to acceptance or rejection of H i according as
F i % qF, (6.8)
where
F i = ~i2(n -- q ) / e i i Y ' Q o y . (6.9)
We now interpret the statistics F and F~ where F and F~ are given by (6.2) and
(6.9), respectively. The statistic F can be written as
F = R 2 ( n -- q ) / q ( 1 -- R 2) (6.10)
where F12 is the F statistic associated with testing HI2. In general, the hypothesis
H ~ . . i , is accepted or rejected according as
/3~('0= (fli,, .... /~i,) and V(oo2 is the covariance matrix of/~(o" The above implica-
tions of the overall F test are well known. The subset (xi~..... xi, ) of the set of
variables x~,..., Xq may be considered important or unimportant for prediction of
816 P. R. Krishnaiah
where t = l , 2 ..... q and il ..... i~ take the values from 1 to q subject to the
restrictions i 1 4 = i 2 V= . . . 4= i t. In the above situation there are 2 q - 1 possible
models and we have to select one of them. For given i~, , i t let F.* denote the
" " " 11 • " " i t
usual F statistic used to test the hypothesis H~,...i, under the model (6.14). Then
the hypotheses H~, .. .,,, for all possible values of i~..... i t can be tested simulta-
neously as follows. We accept or reject H i , . . . i , for given i I ..... i t according as
F*llt 2. . .it
X/~* (6.15)
where
P [ F~'',,,2. - .,,"<~ F * ; i , , . . ., it ~ G [ H ] = (1-- a ) (6.16)
E ( s 2. . , , ) : o 2
• t I . . .i t •
(6.17)
a(q, t, n) and b(q, t, n) are properly chosen constants and 02 is the usual error
mean square when the model is (3.1). It is complicated to compute the probability
of correct selection in both of the above situations. Akaike (1973) and Mallows
(1973) considered some special values of a(q, t, n) and b(q, t, n) in (6.18) while
considering procedures for the selection of variables.
F/XF* (7.1)
where g(e[~) is the density of a multivariate normal with mean vector 0 and
covariance matrix r2In, and h(~-) is the density of ~-. For example, let the density
of ~- be
If ~- has the above density, then th~ distribution of v~- is known to be inverted chi
distribution with v degrees of freedom. In this case the density of e is given by
(7.6)
where
c =//2r½(v + "/2. (7.7)
The density given by (7.6) is a special case of the multivariate t distribution with v
degrees of freedom and with I, as the correlation matrix of the 'accompanying'
multivariate normal. By making different choices of the density of ~r we get a wide
class of distributions. Zellner (1976) pointed out that the type I error associated
with the overall F test is not at all affected if the density of e is of the form (7.5).
A similar statement holds good for the FIT also. But, the power functions of the
overall F test and FIT are affected when the assumption of normality of the
errors is violated and their joint density is of the form (7.5).
Next, let us assume the model to be
y = t ol + + (7.8)
instead of (3.1). Here flo is unknown and 1 is a column vector whose elements are
equal to unity. In this case the methods discussed in this paper hold good with
very minor modification. In the F statistics we replace y and X with y* and X*,
Selection of variables under univariate regression models 819
r e s p e c t i v e l y , w h e r e y* = y - S , X* = X - _~, a n d
i=1 i 1
References
[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In: B. N. Petrov and F. Csaki, eds., 2nd International Symposium on Information Theory,
267-281. Akademia Kiado, Budapest.
[2] Cornish, E. A. (1954). The multivariate small t-distribution associated with a set of normal
sample deviates. Austral. J. Phys. 7, 531-542.
[3] Cox, C. M., Krishnaiah, P. R., Lee, J. C., Reising, J. and Schuurmann, F. J. (1980). A study on
finite intersection tests for multiple comparisons of means. In: P. R. Krishnalah, ed., Multi-
variate Anal., Vol. V. North-Holland, Amsterdam.
[4] Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. Wiley, New York.
[5] Dunnett, C. W. and Sobel, M. (1954). A bivariate generalization of Student's t-distribution with
tables for certain cases. Biometrika 41, 153-169.
[6] Hocking, R, R. (1976). The analysis and selection of variables in linear regression. Biometrics 32,
1-49.
[7] Krishnaiah, P. R. (1963). Simultaneous tests and the efficiency of generalized incomplete block
designs. Tech. Rept. ARL 63-174. Wright-Patterson Air Force Base, OH.
[8] Krishnaiah, P. R. (1965a). On the simultaneous ANOVA and MANOVA tests. Ann. Inst.
Statist. Math. 17, 35-53.
[9] Krishnaiah, P. R. (1965b). Multiple comparison tests in multiresponse experiments. Sankhyff,
Ser. A 27, 65-72.
[10] Krishnaiah, P. R. (1979). Some developments on simultaneous test procedures. In: P. R.
Krishnaiah, ed., Developments in Statistics. Vol. 2, 157-201. Academic Press, New York.
[11] Krishnaiah, P. R. (1980). Computations of some multivariate distributions. In: P. R. Krishnaiah,
ed., Handbook of Statistics, Vol. 1: Analysis of Variance, 745-971. North-Holland, Amsterdam.
[12] Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661-675.
[13] Pope, P. T. and Webster, J. T. (1972). The use of an F-statistic in stepwise regression problems.
Technometrics 14, 327-340.
[14] Roy, S. N. and Bose, R. C. (1953). Simultaneous confidence interval estimation. Ann. Math.
Statist. 24, 513-536.
820 P. R. Krishnaiah
J a m e s L. S c h m i d h a m m e r
1. Introduction
The finite intersection test procedure under the univariate regression model was
first considered by Krishnaiah in 1960 in an unpublished report which was
subsequently issued as a technical report in 1963 and later published in Krishnaiah
(1965a). The problem of testing the hypotheses that the regression coefficients are
zero as well as the problem of testing the hypotheses that contrasts on means, in
the A N O V A setup, are equal to zero are special cases of this procedure. Finite
intersection tests under general multivariate regression models were proposed by
Krishnaiah (1965b). In this chapter, we discuss some.applications of Krishnaiah's
finite intersection tests for selection of variables under univariate and multivariate
regression models.
Section 2 gives some background material on the multivariate F distribution,
which is the distribution most commonly used in conjunction with the finite
intersection test. Section 3 describes the application of the finite intersection test
procedure to the univariate linear regression problem, while Section 4 discusses
the extension to the multivariate linear regression case. Finally, Sections 5 and 6
illustrate the use of the finite intersection test with univariate and multivariate
examples respectively.
*This work is sponsored by the Air Force Office of Scientific Research under Contract F49629-82-
K-001. Reproduction, in whole or in part, is permitted for any purpose of the United States
Government.
821
822 James L Schmidhammer
(2.2)
(2.3)
(,/(;)
On the selection of variables under regression modeb 823
y= Xfl+ e
Fi = w i i S 2 / ( n - q -- 1)
W
and P[ o/q=0(Fi < F/~}JH] = 1 - - a . In this p a p e r we will only consider the case
where F/~ = F~ for i = 0, 1..... q. Now, the joint distribution of F o, F 1. . . . . Fq is a
multivariate ( ( q + l ) - v a r i a t e ) F distribution with 1 and n - q - 1 degrees of
freedom. Simultaneous confidence intervals associated with this procedure are
given by
for i = 0 . . . . . q.
F o r comparison, if we use the usual overall F test, we reject H if F > F* where
F ]~'(X'X)]~/(q+I)
S 2 / ( n - q - 1)
Now, since F~ ~<(q + 1)F* (see Krishnaiah, 1969), the lengths of the confidence
intervals associated with the finite intersection test are never longer than the
lengths of the corresponding confidence intervals associated with the overall F
test.
In the procedure described above, H = ("lqi=oHi, where Hi: fli = 0. Thus, a test
is performed on the importance of every independent variable simultaneously,
including the intercept. However, it is usually the case that the test H o : r0 = 0 is
of no interest, and it is often the case that only a subset of all possible
independent variables are to be examined for importance. With this in mind,
consider r hypotheses of the form H i : c~fl = 0 for i = 1..... r, with H* = ("]i=r1Hi"
In the above context, c[ = (0 ..... 0, 1,0 ..... 0), i.e., c i selects the particular fit of
interest for testing, although the procedure described below works for arbitrary %
Using the finite intersection test procedure, we reject/4 i if F, > F, where
F,= ( cil0 ) 2 / [ c ; ( X , X ) - , c i S 2 / ( n - q - 1 ) ]
r
and P[ O i= IF• . < F~[ H*] = 1 -- a, with the joint distribution of F 1. . . . . F~ being a
multivariate F distribution with 1 and (n - q - 1) degrees of freedom. In this case
simultaneous confidence intervals are given by
Table 1
Relative efficiency of overall F test to the
finite intersection test (a = 0.05, v 10)
r• 1
0.1
1.00
0.5
1.00
0.9
1.00
2 0.83 0.80 0.71
3 0.72 0.68 0.57
4 0.64 0.60 0.48
5 0.58 0.54 0.42
6 0.53 0.49 0.37
7 0.49 0.43 0.33
8 0.45 0.42 0.30
9 0.43 0.39 0.28
10 0.40 0.36 0.26
On the selection o f variables under regression models 825
Table 2
Relative efficiency of overall F test to the
finite intersection test (a = 0.05, u = 30)
0.1 0.5 0.9
If instead the usual overall F test is used, then we reject H if F > F* where
Table 3
Relative efficiency of overall F test to the
finite intersection test (a = 0.01, u = 10)
r• 1
0.1
1.00
0.5
1.00
0.9
1.00
2 0.84 0.82 0.76
3 0.73 0.71 0.62
4 0.66 0.63 0.53
5 0.59 0.57 0.47
6 0.55 0.52 0.42
7 0.51 0.47 0.38
8 0.47 0.44 0.34
9 0.44 0.41 0.32
10 0.42 0.39 0.30
826 James L. Schmidhammer
Table 4
Relative efficiency of overall F test to the
finite intersection test ( a = 0.01, ~ = 30)
0.1 0.5 0.9
Analogous to the univariate linear model, the multivariate linear model is given
by
Y=XB+E
where Y is an n × p matrix of n observations on p variables Yl ..... yp whose rows
are independently distributed as a p-variate normal distribution, with covariance
matrix ~, and E ( Y ) = XB. Furthermore, X is as described in the previous section,
while B = [fl0, fll ..... flq]' is a (q + 1)X p matrix of unknown regression parame-
ters, and E is an n × p matrix whose rows are independently and identically
distributed as a p-variate normal distribution with mean vector 0 and covariance
matrix ~.
The problem of selection of variables under the multivariate regression model
can again be formulated within the framework of simultaneous test procedures as
in the univariate case. The problem of testing H : B = 0 is equivalent to the
problem of testing H i : c ' i B = O ' for i = 0 , 1 ..... q simultaneously where c~=
[Coi , Cli . . . . . Cqi] for i = O, 1,...,q, with
(~ ifh#i,
Chi = if h = i,
On the selection of variables under regression models 827
[,:,] r°,
Yk = : =
[.ak,k+l
~j+l
(4.1)
where ~//+l = 0/+l -- B/~. Each of the hypotheses Ho, H1,... ,Hq can be expressed
as
P
for i = 0 , 1 .... ,q
j=l
for i = 0,1 ..... q a n d j = 1.... ,p. In (4.2), DijSj2/(n - j -- q) is the sample estimate
of the conditional variance of c;~/. W e reject H i / i f F/j > F~, where
When H is true, the joint distribution of Foj, F1/ ..... Fqy, for any given
j = l .... ,p, is a (q + 1)-variate F distribution with 1 and n - j - q degrees of
freedom. The associated 100(1- a)% simultaneous confidence intervals are given
by
c:Oj--~F~DijSj2/(n -- j - - q) <~
Several comparisons have been made between the lengths of the confidence
intervals for the finite intersection test in the multivariate case and the lengths of
confidence intervals derived from other procedures. It is known (see Krishnaiah,
1965b) that the finite intersection test yields shorter confidence intervals than the
step-down procedure of J. Roy (1958). Also, Mudholkar and Subbaiah (1979)
made some comparisons of the finite intersection test with the step-down proce-
dure and Roy's largest root test. Additional comparisons of interest are to be
found in Cox et al. (1980).
Several remarks are in order at this time. First, the critical value F~ has been
chosen to be the same for all hypotheses Hij. This was done out of convenience
but is Certainly not necessary. Second, the hypothesis Hij was chosen such that
Hi = O ~ 1Ho and H = Aq=0Hi, with H : B = 0. However, the overall hypothesis
H need not be H : B = 0. We can just as easily consider any set of hypotheses
H 1 , . . . , H r where H i : e:B = 0' for i = 1..... r with e/being chosen as desired and
r
H = (-li=lH i. In the context of selection of variables, however, the ei's are to be
chosen so as to select out the particular independent variables of interest (see
discussion in previous section).
5. A univariate example
i The author is grateful to John Wiley & Sons for givingpermission to use these data for illustrative
purposes.
On the selection of variables under regression models 829
Table 5
Finite intersection test--Univariate example
Critical values
Error degrees of freedom 23 Poincare's lower bound 9.0332
Sample estimate of error variance 60.8 Sidak's upper bound 9.1699
Overall a-level 0.05 Product upper bound 9.2969
Overall F (9F (-95)) 20.8809
9,23
computer program written for use on the DEC-10 computer at the University of
Pittsburgh. The results appear in Table 5.
The model for the data is
y = X13 + e
where fl' = [ flo, 13'1], 13't= [ill .... , flq], and the overall hypothesis tested is H: 131 = 0,
against the alternative A : 131 =~ 0. Thus, we test the hypothesis that none of the
independent variables are related to the dependent variable, but do not test that
the intercept is zero.
In Table 5, note that simultaneous confidence intervals are constructed using
Poincare's lower bound, Sidak's upper bound, and the product upper bound on
the critical values for the finite intersection test, and also using the critical value
associated with the overall F test. The confidence intervals associated with overall
F test are at least 50% wider than the corresponding confidence intervals using
the finite intersection test. However, the confidence intervals constructed
using the product upper bound are only 1.4% wider than those constructed using
Poincare's lower bound, while the confidence interval constructed using Sidak's
upper bound are only 0.75% wider than those using Poincare's lower bound,
indicating that a fairly precise estimate of the true critical value is available, at
least in this case, using only some probability inequalities.
As for the results of the analysis, it is interesting that the only variable related
to corn crop yield is year number, reflecting the well known fact that grain
production in the United States has been steadily increasing for the past fifty
830 James L. Schmidhammer
6. A multivariate example
Table 6"
SAT PPVT RPMT N S NS NA SS
49 48 8 1 2 6 12 16
49 76 13 5 14 14 30 27
11 40 13 0 10 21 16 16
9 52 9 0 2 5 17 8
69 63 15 2 7 11 26 17
35 82 14 2 15 21 34 25
6 71 21 0 1 20 23 18
8 68 8 0 0 10 19 14
49 74 11 0 0 7 16 13
8 70 15 3 2 21 26 25
47 70 15 8 16 15 35 24
6 61 11 5 4 7 15 14
14 54 12 1 12 13 27 21
30 55 13 2 1 12 20 17
4 54 10 3 12 20 26 22
24 40 14 0 2 5 14 8
19 66 13 7 12 21 35 27
45 54 10 0 6 6 14 16
22 64 14 12 8 19 27 26
16 47 16 3 9 15 18 10
32 48 16 0 7 9 14 18
37 52 14 4 6 20 26 26
47 74 19 4 9 14 23 23
5 57 12 0 2 4 11 8
6 57 10 0 1 16 15 17
60 80 11 3 8 18 28 21
58 78 13 1 18 19 34 23
6 70 16 2 11 9 23 11
16 47 14 0 10 7 12 8
45 94 19 8 10 28 32 32
9 63 11 2 12 5 25 14
69 76 16 7 11 18 29 21
35 59 I1 2 5 10 23 24
19 55 8 0 1 14 19 12
58 74 14 1 0 10 18 18
58 71 17 6 4 23 31 26
79 54 14 0 6 6 15 14
Table 7
Finite intersection test--multivariate example--first dependent variable (RPMT)
Critical values
Error degrees of freedom 31 Poincare'slower bound 9.8828
Sample estimate of error variance 8.65 Sidak's upper bound 10.0195
Overall a-level 0.0169524 Productupper bound 10.0586
Simultaneous confidenceintervals
Poincare's Sidak's Product
i fli Fi lower bound upper bound upper bound
l 0.2110 0.82773 (-0.52,0.94) (-0.52,0.95) (-0.52,0.95)
2 0.0646 0.24418 (-0.35,0.48) (-0.35,0.45) (-0.35,0.45)
3 0.2136 2.85731 (-0.18,0.61) ( 0.19,0.61) (-0.19,0.61)
4 -0.0373 0.06725 ( 0.49,0.42) (-0.49,0.42) (-0.49,0.42)
5 -0.0521 0.11646 (-0.53,0.43) (-0.54,0.43) (-0.54,0.43)
used. These data are reproduced in Table 6. The three dependent variables are
scores on a student achievement test (SAT), the Peabody Picture Vocabulary Test
(PPVT), and the Ravin Progressive Matrices Test (RPMT). The independent
variables consisted of the sum of the number of items answered correctly out of
20 on a learning proficiency test on two exposures to five types of paired-associ-
ated learning proficiency tasks. These five tasks are named (N), skill (S), named
skill (NS), named action (NA), and sentence skill (SS). The same F O R T R A N
program used for the analysis of the previous se6tion was used for this analysis,
since when using the finite intersection test, a multivariate linear regression can be
expressed as several independent univariate linear regressions. The results appear
in Tables 7-9.
The model for these data is
Y= XB + E (6.1)
where B ' = [fl0, B'l], B'I = [ill ..... flq] and the overall hypothesis tested is H : B l = 0
against the alternative A : B 1~ 0. Again, a test on the intercept is not performed.
As in the previous univariate example, Tables 7 - 9 display simultaneous confi-
dence intervals constructed using the three bounds on the critical values. For
these data the use of the product upper bound results in confidence intervals only
0.9% wider than the confidence intervals using Poincare's lower bound, while the
use of Sidak's upper bound produces confidence intervals only 0.7% wider than
the confidence intervals using Poincare's lower bound. Again, very satisfactory
estimates of the true critical values have been obtained using probability inequali-
ties.
Note that in each of Tables 7 - 9 the Type I error rate is given as a* = 0.0169524.
This yields an experimentwise error rate of a = 0.05, since (1 - a*) 3 = 1 - a, there
832 James L. Schmidhammer
Table 8
Finite intersection test--multivariate example--second dependent variable (PPVT)
Critical values
Error degrees of freedom 30 Poincare's lower bound 9.9414
Sample estimate of conditional Sidak's upper bound 10.0781
error variance 86.49 Product upper bound. 10.1172
Overall a-level 0.0169524
being 3 dependent variables. Also recall that Tables 8 and 9 display statistics on
conditional means, variances, and regression coefficients, the results of Table 8
being conditioned on holding the first dependent variable (RPMT) fixed, and the
results of Table 9 being conditioned on holding both the first and second
dependent variables (RPMT and PPVT) fixed.
The results of Table 8 show that the overall hypothesis H is rejected, since the
h y p o t h e s i s 942 : '042 = 0 is rejected. Thus, the independent variable named action
(NA) is probably the only variable of importance in (6.1), and the other
independent variables (N, S, NS, SS) can be regarded as unimportant.
Table 9
Finite intersection test--multivariate example--third dependent variable (SAT)
Critical values
Error degrees of freedom 29 Poincare's lower botmd 9.9805
Sample estimate of conditional Sidak's upper bound 10.1172
error variance 435.38 Product upper bound 10.1563
Overall a-level 0.0169524
References
[1] Cox, C. M., Krishnaiah, P. R., Lee, J. C., Reising, J. and Schuurmann, F. J. (1980). A study on
finite intersection tests for multiple comparisons of means. In: P. R. Krishnalah, ed., Multi-
variate Analysis, Vol. V. North-Holland, Amsterdam.
[2] Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. Wiley, New York.
[3] Krishnaiah, P. R. (1963). Simultaneous tests and the efficiency of generalized incomplete block
designs. Tech. Rept. ARL 63-174. Wright-Patterson Air Force Base, OH.
[4] Krishnalah, P. R. (1964). Multiple comparison tests in multivariate case. Tech. Rept. ARL
64-124. Wright-Patterson Air Force Base, OH.
[5] Krishnaiah, P. R. and Armitage, J. V. (1965). Probability integrals of the multivariate F
distribution, with tables and applications. Tech. Rept. ARL 65-236. Wright-Patterson Air Force
Base, OH.
[6] Krishnalah, P. R. (1965a). On the simultaneous ANOVA and MANOVA tests. Ann. Inst.
Statist. Math. 17, 35-53.
[7] Krishnalah, P. R. (1965b). Multiple comparison tests in multi-response experiments. SankhyS,
Ser. A 27, 65-72.
[8] Krishnaiah, P. R. (1969). Simultaneous test procedures under general MANOVA models. In:
P. R. Krishnaiah, ed., Multivariate Analysis, Vol. II. Academic Press, New York.
[9] Krishnaiah, P. R. and Armitage, J. V. (1970). On a multivariate F distribution. In: R. C. Bose
et al., eds., Essays in Probability and Statistics. Univ. of North Carolina Press, Chapell Hill, NC.
[10] Krishnaiah, P. R. (1979). Some developments on simultaneous test procedures. In: P. R.
Krishnaiah, ed. Developments in Statistics, Vol. 2. Academic Press, New York.
[11] Krishnaiah, P. R. (1980). Computations of some multivariate distributions. In: P. R. Krishnaiah,
ed., Handbook of Statistics, Vol. 1: Analysis of Variance. North-Holland, Amsterdam.
[12] Mudholkar, G. S. and Subbaiah, P. (1979). MANOVA multiple comparisons associated with
finite intersection tests. In: P. R. Krishnaiah, ed., Multivariate Analysis, Vol. V. North-Holland,
Amsterdam.
[13] Roy, S. N. and Bose, R. C. (1953). Simultaneous confidence interval estimation. Ann. Math.
Statist. 24, 513-536.
[14] Roy, J. (1958). Step-down procedure in multivariate analysis. Ann. Math. Statist. 29, 1177-1187.
[ 15] Scheff6, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika 40,
87-104.
[16] Scheff6, H. (1959). The Analysis of Variance. Wiley, New York.
[17] Schuurmann, F. J., Krishnalah, P. R. and Chattopadhyay, A. K. (1975). Tables for a multi-
variate F distribution. SankhyS, Set. B 37, 308-331.
[18] Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distribu-
tions. J. Amer. Statist. Assoc. 62, 626-633.
[19] Timm, N. (1975). Multivariate Analysis with Applications and Psychology. Brooks/Cole, Monterey,
CA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "7~C]
@North-Holland Publishing Company (1982) 835-855 .,/
1. Introduction
The designer of a statistical pattern classification system is often faced with the
following situation: finite sets of samples, or paradigms, from the various classes
are available along with a set of measurements, or features, to be computed from
the patterns. The designer usually proceeds by estimating the class-conditional
densities of the measurement vector on the basis of the available samples and uses
these estimates to arrive at a classification function. Naive intuition suggests that
if the dimensionality of the measurement vector is increased, then the classifica-
tion error rate should generally decrease. In the case where the added measure-
ments do not contribute in any waY to classification, then the error rate should at
least stay the same. For, after all, is not more information being utilized in the
design? However, in practice quite often the performance of the classifier based
on estimated densities improved up to a point, then started deteriorating as
further measurements were added, thus indicating the existence of an optimal
measurement complexity when the number of training samples is finite. The past
decade has seen much research devoted to elucidating this phenomenon under
various conditions [38].
The purpose of this paper is to discuss the role which the relationship between
the number of measurements (dimensionality of the pattern vector, or simply
dimensionality) and the number of training patterns (sample size) plays at various
stages in the design of a pattern recognition system. The designer of a pattern
recognition system can (and should) pose the following basic question: For a
given knowledge about the form of the underlying class-conditional densities and
the availability of certain numbers of training samples, how many measurements
should be used in designing the classifier? While no specific design equations are
available, our review below shows that general guidelines can be used to clarify
the situation, and help the designer be aware of several possible pitfalls.
*Research supported by NSF Grants ENG 76-11936 Aol and ECS 8007106.
**Research supported by AFOSR Grant 72-2351.
835
836 A. K. Jain and B. Chandrasekaran
2. Classification performance
The most well-known example of the "curse of finite sample size" is the
peaking in the classification performance as the number of measurements is
increased for a fixed number of training samples. Consider a two-class pattern
recognition problem, where a total of N measurements is made on each pattern.
Let the prior probabilities of the two classes be equal for simplicity and let f ( x )
be the class-conditional density function of the measurement vector x from class
c~, i = 1,2. Let us first consider the simple situation where the number of training
samples equals infinity, that is, the class-conditional density functions are com-
pletely known. Then, the probability of correct recognition based on the optimal
Bayes decision rule is given by
0,x }+Vrlo
It is well known that Pcr(N)~< P = ( N + 1), or in the presence of perfect informa-
tion, the classification accuracy would never decrease as the number of measure-
ments is increased. Whether or not l i m s ~ P c ~ ( N ) approaches unity, i.e., perfect
discrimination is attained in the limit, is, however, a different issue and has been
studied in [7, 17, 27], and more recently in [65]. For example, in [65] it is shown
that given two classes with known densities fl(x) and f2(x), if
then the probability of correct classification for objects from c I tends to unity,
where Eft and o~,2denote, respectively, the expectation and variance with respect
to the fl distribution. A similar result holds for c 2.
A more realistic and at the same time mathematically tractable pattern recogni-
tion problem involves the case where the form of the densities f ( x ) is known but
parameter values are unknown. Let m i be the number of samples available from
class ci, i = 1,2, to train the recognizer. The class-conditional densities need to be
estimated from these samples. In the Bayesian formulation, some a priori densi-
ties on the parameters of f~(x) are assumed, and through the sample set X~,
f~(xlxi) can be calculated. On the other hand, one can arrive at maximum
likelihood estimates of the density function (by, say, first obtaining the maximum
likelihood estimates of the unknown parameters and then substituting these
estimated values for the true parameters in f~(x)) if one does not want to concern
oneself with a priori densities on parameters. These estimated density functions
are then used in the Bayes' decision function.
Given the knowledge of a priori densities on the unknown parameters, the
Bayesian method in the design of a classification system is optimal, that is, there
Dimensionality and sample size considerations in pattern recognition practice 837
does not exist any other decision rule which has a higher recognition accuracy.
The fact that a priori densities are required to be known and the complexity
involved in computing the a posteriori densities even in the commonly occurring
case of multivariate Gaussian densities with unknown mean vectors and covari-
ance matrices [40] restrict and limit the scope of the Bayesian method. Therefore,
in most practical applications, a suboptimal procedure, such as using m a x i m u m
likelihood estimates of the parameters in place of their true values, is preferred. In
the statistical literature the Bayesian method is often referred to as the predictive
method, and the terms estimative procedure and plug-in rule are used to denote
the method in which the unknown parameter is replaced b y its estimate. Recent
investigation by Aitchison [2] and Aitchison et al. [3] concerns the conditions
under which the predictive method of statistical discrimination has superior
properties. Clearly, the classification performance depends on the estimation
procedure used, and we need to study problems pertaining to the relationship
between dimensionality and sample size in the context of the method of estima-
tion.
Allais [4] pointed out an interesting relation between dimensionality, sample
size and recognition accuracy in the linear prediction problem. H e considered an
unobservable random variable, y, an observable measurement vector, x, and a
linear predictor of y, g(x), represented as
g(X)=c+wTx
where c is a scalar and wT is a row vector. The performance of g(x) was evaluated
in terms of its mean square error defined as
e=r[y--g(x)] 2
where E denotes the expectation operator. Allais considered the case where the
joint distribution of predictor y and measurement vector x was multivariate
Gaussian. Since the parameters of this distribution were assumed not known,
Allais considered a m a x i m u m likelihood predictor ~(x) and derived its uncondi-
tional mean square error as
for N = m - 1,
L undefined for N ~> m,
the existence of a finite measurement complexity due only to finite sample size, or
whether the kind of estimates (maximum likelihood vs. Bayesian) used had any
relationship to it?
~ai= ~fli=l.
i=1 i=l
mance of a Bayesian classifier? Waller and Jain [69] demonstrate that for a given
sample size and measurement complexity,/~r increases as the problem becomes
more structured.
So far, in the Bayesian context, we have talked only about the average
performance over all possible problems generated by the a priori densities.
However, what about the performance Per in an individual problem (a specific set
of parameters), even though the parameter estimates are Bayesian for the given a
priori densities? This is the kind of performance in which a designer of a
recognition system is really interested. In this case the notion of the 'optimality'
of the estimate does not enter, since the optimality of the Bayesian estimates is
assured by averaging over a problem space generated by the a priori densities. In
[13], [14] and [65], examples are given that show that peaking is indeed possible in
this situation. More generally, not much is known about the conditions for perfect
discrimination or the convergence of Pcr as the number of measurements is
increased. If the measurements are independent, then the sufficient conditions of
[13], [14] and [65], can be used to determine if, for a specific problem, perfect
discrimination is possible in the limit. But in order to check for these conditions
one must know the true parameters of the class-conditional densities, which, if
available, would obviate the need for estimation in the first place. However, these
conditions do illuminate the fact that a finite sample size imposes greater
constraints on the measurement parameters to be 'good'. For example, if the
measurements are independent and binary with parameters Pi = P(xi = 1]c 1) and
q~--P(xg--11c2), i = 1,... ,N, then, if one has infinite number of samples, a good
measurement is one for which IP~ - qi] >~8 ~>0, a condition which is not sufficient
for perfect discrimination for the finite sample case.
2.2. Suboptimal classification rules
The average recognition accuracy of a Bayesian classifier, as shown in [64, 70]
will never decrease as the number of measurements is increased. However, when
the approach to classifier design is non-Bayesian--typically in such a situation
the parameters of the distributions may be estimated by say maximum likelihood
methods--then peaking becomes theoretically possible. There is no longer any
obviously natural notion of 'optimality', i.e., while the original Bayes' rule is
optimal, the decision rule that results by substituting the maximum likelihood
estimates of the parameters is no longer optimal, and this is often the cause of
peaking observed in the performance of many practical pattern classifiers. In
some sense the errors caused by the nonoptimal use of added information
outweighs the advantages of extra information. The mechanism behind this
self-defeating behavior is the subject of this section.
Classification problems involving multivariate Gaussian densities have received
the most attention in the literatures both on statistics and on pattern recognition.
This is understandable due to the ease in mathematical analysis and because
class-conditional densities of many real-world classification problems can be
reasonably approximated as multivariate Gaussian. Let us consider two equiprob-
able pattern classes which are represented by multivariate Gaussian densities with
Dimensionality and sample size considerations in pattern recognition practice 841
otherwise decide class c 2. It is well known that the probability of error using this
discriminant function is related to the Mahalanobis distance A~ between the two
populations, given as
& = - -
The larger the Mahalanobis distance, the smaller the probability of error. Usually
the parameters Z,/~t and/z 2 are not known and the following estimates of these
parameters based on m i training samples from class c i are commonly used.
mi
^ 1 • x}i) ' i=1,2,
I~ i = -mii j = l
1 2 mi
S=(m1+m2_2) 2 2 (xJi)--~i)2
i lj=l
where/2 i is the sample mean of class ci and S is the pooled unbiased estimate of
the common covariance matrix N£ In the above expressions x}0 refers to the j t h
training sample from class ci. In this situation, the most commonly used decision
rule is based on the W statistic proposed by Wald [5, 68] where
Rao [52] first illustrated the dangers of using too many measurements in a
classification problem involving two populations some thirty years ago. He stated
that " I t does not seem to be, always, the more the better...". It is unfortunate
that pattern recognition community is not aware of this pioneering work by Rao.
The example used by Rao to illustrate this problem involved discrimination
between Indian and Anglo-Indian skeletons based on two measurements--lengths
of Femur and Humerus. Rao took 20 samples of Indian skeletons and 27 samples
of Anglo-Indian skeletons and used the estimate of the Mahalanobis distance
( D2 = (/21-/22)Ts-1(/21- ~2)) to test if the separation between the two popula-
tions is significant (at 5% level). It was found that while the two populations were
significantly different when only a single measurement (either Femur length or
Humerus length) was used, there was no significant separation when both the
measurements were used. Rao provides a test to determine whether the addition
of q more measurements to an existing set of N measurements increases the
distance between the two populations and concludes that if the Mahalanobis
distance (Z~2N)increases proportionately with the number of measurements, then,
except in situations where the total number of samples is very small, the addition
of extra measurements does not result in a loss in discrimination.
More recently Jain and Waller [33] have studied the peaking phenomenon in an
effort to relate the optimum number of measurements to the number of available
training samples and the Mahalanobis distance between the two populations.
Their results can be related to those obtained by Rao [52]. For a classification
problem involving two equiprobable multivariate Gaussian densities with a
common covariance matrix they use the asymptotic expansion of the average
probability of error (good up to order 1 / m 2) derived in [59] to conclude the
following (of course, we are assuming that estimates/2 i and S given earlier are
being used):
(1) The minimal increase in the Mahalanobis distance needed to keep the same
error rate when a measurement is added to a set of N features is
gd A 2 = A Z / ( 2 m - - 3 - - N )
where m is the number of training samples per class. In order to avoid peaking,
8A 2 > a z / ( 2 m -- 3 -- N ) . Note that increasing the sample size decreases the value
of 8A2 and in the limit as m ~ oo, 8A2u ~ 0.
(2) If the Mahalanobis distance is proportional to the number of measurements
or, equivalently, if all the features are equally good, then the peaking in the
performance of the classifier is not a real problem because Nopt = m - 1.
In order to study the effect of the structure of the covariance matrix on the
optimum number of measurements, Jain and Waller [33] considered three types of
Toeplitz matrices for N. However, in their analysis, the classifier did not incorpo-
rate knowledge of the form or the structure of the true covariance matrix, and
thus they used the general covariance matrix estimator S in the W statistic. As
was expected, the performance of a classifier improved if it could utilize the
knowledge about the structure of the covariance matrix, as was recently con-
Dimensionality and sample size considerations in pattern recognition practice 843
firmed by Morgera and Cooper [48]. They introduce the notion of 'effective
sample size' to show that the constrained Toeplitz estimator provides better
performance at a sample size m for which the generalized estimator has poor
performance. In other words, for the same performance a classifier using con-
strained Toeplitz estimates requires fewer samples than if the generalized estima-
tor is used, assuming that the true covariance matrix is of Toeplitz form.
Interestingly enough this reduction in required sample size (to maintain the same
performance) is an increasing function of the dimensionality N.
Several recent investigations have demonstrated the relationship between di-
mensionality, sample size and recognition accuracy based on Monte Carlo simula-
tions. Boullion et al. [8] show that if the decision rule is in the form of a linear
discriminant function, then a subset of measurements can yield better classifica-
tion results than the full set of measurements. A real-world example they
considered to demonstrate this consisted of data from a 12-channel sensor used in
remote sensing of agricultural crops (soybeans and corn) obtained from NASA. If
only thirty training samples are available from each of the two classes, then the
expected probability of misclassification is lowest when only six measurements
are used out of a possible total of twelve. The number of training samples per
class must exceed one hundred before all twelve measurements can be used safely.
Van Ness and Simpson [67] and Van Ness [66] confirm the results of [33, 52],
namely the Mahalanobis distance between the two populations must increase as
the number of measurements is increased for the classification performance to
stay at least the same. However, they do not provide any explicit expression for
this increase as a function of sample size, dimensionality and Mahalanobis
distance. Van Ness and Simpson [67] also compare experimentally the classifica-
tory power of five different discriminant functions (in the order of assuming less
and less knowledge about the true underlying distributions which were multi-
variate Gaussian): linear with unknown mean vectors and known covariance
matrices, linear with unknown mean vectors and unknown but common covari-
ance matrix, quadratic with unknown mean vectors and unknown covariance
matrices and finally two non-parametric decision rules involving Parzen window
estimates of the density function based on Gaussian and Cauchy window func-
tions. A surprising result of this comparison was that the two non-parametric
decision rules performed better than the linear and quadratic discriminant
functions (with unknown covariance matrices) even when the dimensionality was
small.
Results reported by Raudys [53] are similar to those in [8, 67]. Like Van Ness
and Simpson [67], Raudys uses Monte Carlo simulations to generate tables
showing the relationship between sample size, dimensionality, classification accu-
racy and complexity of the classification rule. As is understandable, this table
depends on the true underlying class-conditional densities, which in the case of
Raudys was assumed to be multivariate Gaussian with a common identity
covariance matrix. An important guideline proposed by Raudys is that the
number of training samples required to achieve a given recognition accuracy
should increase linearly with the number of measurements for linear discriminant
844 A. K. Jain and B. Chandrasekaran
In the above expression, fl(x) and )~(x) are the estimates of the class-conditional
densities which are used in the design of the classifier. Thus the quantities e/,
i = 1,2, are the errors made in estimating the densities f(x), i= 1,2, respectively.
It is clear that E(ei) is a function of the sample size, dimensionality and the true
underlying density. For Gaussian densities, Duin computes E(ei) for different
values of m and N using Monte Carlo runs, and shows that as dimensionality
increases, more samples are needed to keep E(ei) at some fixed value. This
supports one of the main reasons given to explain peaking: for a fixed sample
size, as the dimensionality increases, the extra discriminatory power provided by
the added features is overcome, after a certain point, by the deterioration in the
estimates of the densities. More work needs to be done to establish this point of
crossover for underlying Gaussian densities with different parameter values.
So far, most of the results we have summarized deal with Gaussian densities in
one way or another. Can some set of general conditions be obtained as a function
of fl(X), f2(x), ml, m2, and N such that for a two-class classification problem we
are guaranteed to have perfect discrimination and monotonicity of expected
probability of misclassfication as the number of measurements is increased?
Chandrasekaran and Jain [13, 14] provide a partial answer to the question raised
above (see also [65] for some corrections to the results of [13]). To summarize the
results of [14] and [65], let
N N
f~(x) ~ ]-I f~,(x,) and f2(x)~ 1-I f2~(x,)
i=1 i=1
following abbreviations:
di==-logfl,-logf2,, M[J)~E~E~Ex 3,
and
]
where f stands for estimated densities, c I and c 2 are the two classes, Exc ~ stands
for expectation with respect to class cj, and E x for expectation with reference to
training data set X- N o w the probability of correct classification, given that c~ is
the true class, can be expressed as
where K 1 and K 2 a r e positive for all positive m. Thus, the necessary and sufficient
condition for perfect discrimination and monotonicity of recognition accuracy is
that A% =Y,(0;--q~,) 2 is of order ~ as N--,oo. Note that in this example the
covariance matrix is the identity matrix which is known by the classifier whereas
in the models of Rao [52] and Jain and Waller [33] the common covariance matrix
has to be estimated. This explains why, in the less structured models of [52, 33],
the Mahalanobis distance A% is required to increase proportionally to N to avoid
peaking while in the above example it is sufficient that a % is of order f N . We can
only hypothesize that if the two covariance matrices are unequal and have to be
estimated, then a % must be of order N 2 to avoid peaking.
The conditions derived by Chandrasekaran and Jain [14] are, as mentioned, for
the case of statistically independent measurements. In [ 15] the same authors have
generalized the conditions for the case of dependent measurements with arbitrary
distributions. Let d(N; x) be the Bayes' decision function such that x E c i if
d(N; x ) > 0 and x ~ c 2 otherwise, where the pattern vector x has N components,
and let d(N; x) be the classifier obtained by using estimates. Let further
^ (j) 2
M(NJ) = Exe~jExd and V(NJ)= Ex~cjEx{ d M~t } .
Then by arguments similar to those of Gaffey [27] and Van Ness [65], it can be
shown that, if
then limN~ooPcr = 1 for elements of class cj, j = 1,2. However, applying these
conditions to actual cases will be more or less difficult depending upon the
tractability of the underlying distributions. For the case of multivariate normal
distributions with unknown mean vectors and known covariance matrices, more
compact conditions for perfect discrimination are given in [15], and experimental
and mathematical investigations of some aspects of performance as the dimen-
sionality is increased are provided in [62] and [55].
and Saltzberg [43]) have outlined the advantages of having equal numbers of
samples. In the context of a linear discriminant function Rao [52] showed that for
given A2N, N and (m 1+ m2), and using maximum likelihood estimates, it is more
profitable to have m 1= m 2. Note that D 2 is often used as a feature selection
criterion. Jain and Wailer [33] used an asymptotic expansion to show that
Okamoto's [50] probability of misclassification is minimum when m I = m 2, and
that the degradation in the classification performance due to an unbalanced set of
training samples is more severe for a large number of measurements. Levine et al.
[43] show that for a nearest-neighbor decision rule, best results are obtained when
ml ~ m 2. The above results suggest that if the designer can obtain only a small
number of samples from one class, it is not necessary to compensate by taking
large samples from the other class. In fact, Chandrasekaran and Jain [13]
demonstrate a counter-intuitive phenomenon whereby discarding the excess sam-
ples from the class containing the greater number of samples is profitable.
What should a designer do when confronted with unequal numbers of training
samples? In this situation the degrees of reliability associated with the estimates
of the different class density functions are clearly different, and one has the
intuitive feeling that this factor should somehow be taken into account, i.e., the
decision function must be 'balanced' with respect to the different sample sizes. To
be more concrete, while f l ( x ) and fz(x) might be individually the 'best' estimates
of the density function, it is not at all clear that d ( x ) = {log f ~ ( x ) - l o g j ~ ( x ) } is
the best decision function to use if the sample sizes are substantially different.
Perhaps a modification such as log W[ fl(x), m l ] - log W[ fz(x), m2] where W is a
weighting function and m 1 and m 2 are the two sample sizes might perform better.
Kaminuma and Watanabe [37] proposed the idea of a well-balanced adaptive
decision function in the context of a perceptron convergence algorithm, when
substantially different numbers of paradigms from the two classes are used. These
authors proposed to adjust the position and orientation of the resulting hyper-
plane to reflect the different numbers of paradigms. In [37] the problem was
posed in a non-probabilistic context, and the criterion was heuristic in nature.
Chandrasekaran and Jain [15] observed that this question of balancing does not
arise in the case where the classifier is Bayesian, i.e., prior distributions on
unknown parameters are assumed and the classifier essentially determines p ( x l x )
where X denotes the sample set. In this approach the weighting is automatically
incorporated in the posterior probabilities.
In [15] Chandrasekaran and Jain suggested that the criterion for weighting the
estimated densities should be chosen so that the counter-intuitive phenomenon of
doing better by discarding excess samples from the class containing greater
number of samples, reported in [13], does not arise. Consider a two-class problem
involving multivariate Gaussian densities where the common covariance matrix is
known, and the mean vectors 0 and ~ are unknown. They showed that if
maximum likelihood estimates of 0 and q~ are used based on m 1 and m 2 samples,
respectively, then a balanced decision function based on the above criterion is
where
Note that d(x) is the usual linear decision function. In case m 1= ma, db(X) = £¢(x),
and as m I and m 2 approach infinity, then C?b(X) approaches d(x), where d(x) is
the Bayes' decision surface for this problem. An interesting aspect of this
particular weighting is that Exdb(X)=d(x), i.e., the weighting results in the
expected value (over the samples) of the estimated decision surface being the
same as the Bayes' decision surface.
The results of [15] provide an efficient way to utilize information provided by
unequal number of samples when a non-Bayesian decision rule is employed.
These results need to be generalized to a larger class of distributions and
estimates.
lim ( - 1 ) j 1 EM/(j)
£¢,__ - ml
-
~ - (1x , - 0 , )2
Xi - q)i) -
l + m l i I (i l+m 2 i 1
Note that this rule weights the contribution of different measurements differently
Dimensionality and sample size considerations in pattern recognition practice 849
in addition to taking into account different sample sizes. They show that if
Pm(X)=(Km/m)/Vm
4. Error estimation
multivariate Gaussian with unknown mean vectors /~ and ~2, and known
common covariance matrix 2. He computed the expected design-set error rate as
a function of m (number of samples per class), N and A% (the true Mahalanobis
distance). Foley's results can be summarized as follows:
(1) limN~ooE(Ed(m, N, A2N=0)}=0 , where E d denotes the design-set error
rate. That is, by adding more and more measurements it is possible to make the
design-set error rate approach zero even if the true error rate is (½).
(2) The ratio ( m / N ) is critical to the bias in the design-set error rate. If
(re~N) > 3, then the bias in the design-set error rate is close to zero.
Mehrotra [47] extended Foley's results to situations where the common covari-
ance matrix Z is also unknown and concluded that the ratio (re~N) must be
larger than five before the bias in the design-set error rate is sufficiently small.
These results seem to confirm the hypothesis [38, 39]; " t h e less is known about
the underlying probability structure, the larger is the ratio of sample size to
dimensionafity".
Intuitively it appears that one should know the class assignment of test samples
in order to estimate the error rate of a classifier. However, in many applications
of pattern recognition methodology, labelling of samples can be very expensive.
In view of this, Chow [16] proposed a procedure to estimate the error rate of a
classifier based on a set of unlabelled test samples. Chow established a relation-
ship between the error rate and the reject rate of a classifier, and since computing
the reject rate does not require the knowledge of class assignment of test samples,
the error rate can be determined with unlabelled test samples. Again, the
ratio of the number of training samples to dimensionality plays an important role
in this method of estimating the error rate. Fukunaga and Kessell [26] showed
that for two Gaussian distributions with unknown mean vectors and unknown
common covariance matrix, the estimate of error rate obtained from the empirical
reject rate is optimistically biased. This bias in the error rate is a function of the
ratio of sample size to dimensionality, and [26] recommends that this ratio must
be at least ten for the bias to be small.
5. Conclusions
We have shown the important role which dimensionality and sample size play
in various areas of pattern recognition, namely, classification accuracy, K-
nearest-neighbor approach and error estimation. There is no doubt that the
designer of a pattern recognition system should make every possible effort to
obtain as many samples as possible. As the number of samples increases, not only
does the designer have more confidence in the performance of the classifier, but
more measurements can be incorporated in the design of the classifier without the
fear of peaking in its performance. However, there are many pattern classification
problems where either the number of samples is limited (for example, in a medical
decision-making problem, there may be only a small number of patients available
who are suffering from a specific disease) or obtaining a large number of samples
852 A. K. Jain and B. Chandrasekaran
References
[1] Abend, K., Harley Jr., T. J., Chandrasekaran, B. and Hughes, G. F. (1969). Comments on "On
the mean recognition accuracy of statistical pattern recognizers".IEEE Trans. Inform. Theory
15, 420 423.
[2] Aitchison,J. (1975). Goodnessof prediction fit. Biometrika 62, 547-554.
[3] Aitchison,J., Habbema, J. D. F. and Kay, J. W. (1977).A critical comparisonof two methods of
statistical discrimination. Appl. Statist. 26, 15-25.
[4] Allais, D. C. (1966). The problem of too many measurementsin pattern recognition. IEEE Int.
Con. Rec. 7, 124-130.
[5] Anderson,T. W. (1951). Classificationby multivariate analysis. Psychometrika 16, 31-50.
Dimensionality and sample size considerations in pattern recognition practice 853
[6] Bailey, T. and Jain, A. K. (1978). A note on distance-weighted k-nearest-neighbor rules. IEEE
Trans. Systems Man Cybernet. 8, 311-313.
[7] Ben-Bassat, M. and Gal, S. (1977). Properties and convergence of a posteriori probabilities in
classification problems. Pattern Recognition 9, 99-107.
[8] Boullion, T. L., Odell, P. L. and Duran, B. S. (1975). Estimating the probability of misclassifica-
tion and variate selection. Pattern Recognition 7, 139-145.
[9] Bowker, A. H. (1961). A representation of Hotelling's T 2 and Anderson's classification statistic
W in terms of simple statistics. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
271-284. Stanford University Press, Stanford, CA.
[10] Bowker, A. H. and Sitgreaves, R. (1961). An asymptotic expansion for the distribution function
of the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
292-310. Stanford University Press, Stanford, CA.
[11] Chandrasekaran, B. (1971). Independence of measurements and the mean recognition accuracy.
IEEE Trans. Inform. Theory 17, 452-456.
[12] Chandrasekaran, B. and Jaln, A. K. (1974). Quantization complexity and independent measure-
ments. IEEE Trans. Comput. 23, 102-106.
[13] Chandrasekaran, B. and Jain, A. K. (1975). Independence, measurement complexity and
classification performance. IEEE Trans. Systems Man Cybernet. 5, 240-244.
[14] Chandrasekaran, B. and Jain, A. K. (1977). Independence, measurement complexity and
classification performance: An emendation. IEEE Trans. Systems Man Cybernet. 7, 564-566.
[15] Chandrasekaran, B. and Jain, A. K. (1979). On balancing decision functions. J. Cybernet.
Inform. Sci. 2, 12-15.
[16] Chow, C. K. (1970). On optimum recognition error and reject trade-off. IEEE Trans. Inform.
Theory 16, 41-46.
[17] Chu, J. T. and Chueh, J. C. (1967). Error probability in decision functions for character
recognition. J. Assoc. Comput. Mach. 14, 273-280.
[18] Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with
application in pattern recognition. IEEE Trans. Elec. Comput. 14, 326-334.
[19] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[20] Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[21] Duin, R. P. W. (1976). A sample size dependent error bound. Proc. Third lnternat. Joint Conf.
Pattern Recognition, Coronado, CA., 156-160.
[22] Fischer, F. P., II. (1971). K-nearest neighbor rules. Ph.D. dissertation, School of Elect. Engr.,
Purdue University, Lafayette, IN.
[23] Fix, E. and Hodges, J. L. (1952). Discriminatory analysis, nonparametric discrimination: Small
sample performance. USAF School of Aviation Medicine, Randolph AFB, TX, Project 21-49-004,
Rep. l 1. Also in: A. K. Agarwala, ed., Machine Recognition of Patterns, 280-322. IEEE Press,
New York, 1977.
[24] Foley, D. M. (1972). Considerations of sample and feature size. IEEE Trans. Inform. Theory 18,
618-626.
[25] Fukunaga, K. and Hostetler, L. D. (1973). Optimization of k-nearest neighbor density estimates.
IEEE Trans. Inform. Theory 19, 320-326.
[26] Fukunaga, K. and Kessell, D. L. (1972). Application of optimum error-reject functions. IEEE
Trans. Inform. Theory 18, 814-817.
[27] Gaffey, W. R. (1951). Discriminatory analysis: Perfect discrimination as the number of variables
increases. Report No. 5, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph
Field, TX.
[28] Harter, H. L. (1951). On the distribution of Wald's classification statistics. Ann. Math. Statist.
22, 58-67.
[29] Highleyman, W. H. (1962). The design and analysis of pattern recognition experiments. Bell
System Tech. J. 41,723-744.
[30] Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers. IEEE Trans.
Inform. Theory 14, 55-63.
854 A. K. Jain and B. Chandrasekaran
[31] Jain, A. K. (1976). On an estimate of the Bhattacharyya distance. IEEE Trans. Systems Man
Cybernet. 6, 763-766.
[32] Jain, A. K. and Dubes, R. (1978). Feature definition in pattern recognition with small sample
size. Pattern Recognition 10, 85-97.
[33] Jain, A. K. and Waller, W. (1978). On the optimum number of features in the classification of
multivariate Gaussian data. Pattern Recognition 10, 365-374.
[34] John, S. (1961). Errors in discrimination. Ann. Math. Statist. 32, 1125-1144.
[35] Kabe, D. G. (1963). Some results on the distribution of two random matrices used in
classification procedures. Ann. Math. Statist. 34, 181-185.
[36] Kain, R. Y. (1969). The mean accuracy of pattern recognizers with many pattern classes. IEEE
Trans. Inform. Theory 15, 424-425.
[37] Kaminuma, T. and Watanabe, S. (1972). Fast-converging adaptive algorithms for well-balanced
separating linear classifier. Pattern Recognition 4, 289-305.
[38] Kanal, L. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20,
697-722.
[39] Kanal, L. and Chandrasekaran, B. (1971). On dimensionality and sample size in statistical
pattern classification. Pattern Recognition 3, 225-234.
[40] Keehn, D. G. (1965). A note on learning for Gaussian properties. IEEE Trans. Inform. Theory
11, 126-132.
[41] Kittler, J. (1975). Mathematical methods of feature selection in pattern recognition, lnternat. J.
Man-Mach. Stud. 7, 609-637.
[42] Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics 10, 1- 11.
[43] Levine, A., Lustick, L. and Saltzberg, B. (1973). The nearest-neighbor rule for small samples
drawn from uniform distributions. IEEE Trans. Inform. Theory 19, 697-699.
[44] Lindley, D. V. (1968). The choice of variables in multiple regression. J. Roy. Statist. Soc. Ser. B
30, 31-66.
[45] Lindley, D. V. (1977). The Bayesian Approach. Seventh Scandin. Conf. in Math. Statist., 55-72.
[46] Loftsgaarden, D. O. and Quesenberry, C. P. (1965). A nonparametric estimate of a multivariate
density function. Ann. Math. Statist. 36, 1049-1051.
[47] Mehrotra, K. G. (1973). Some further considerations on probability of error in discriminant
analysis. Report on RADC contract no. f 30602-72-C-0281.
[48] Morgera, S. D. and Cooper, D. B. (1977). Structured estimation: sample size reduction for
adaptive pattern classification. IEEE Trans. Inform. Theory 23, 728-741.
[49] Murray, G. D. (1977). A cautionary note on selection of variables in discriminant analysis. Appl.
Statist. 26, 246-250.
[50] Okamoto, M. (1963). An asymptotic expansion for the distribution of the linear discriminant
function. Ann. Math. Statist. 34, 1286-1301. Correction: Ann. Math. Statist. 39 (1968) 1358-
1359.
[51] Pettis, K., Bailey, T., Jain, A. K. and Dubes, R. (1979). An intrinsic dimensionality estimator
from near-neighbor information. IEEE Trans. Pattern Anal. Mach. Intelligence 1, 25-37.
[52] Rao, C. R. (1949). On some problems arising out of discrimination with multiple characters.
Sankhyd 9, 343-364.
[53] Raudys, S. (1976). On dimensionality, learning sample size and complexity of classification
algorithms. Proc. Third Internat. Joint Conf. Pattern Recognition, Coronado, CA., 166-169.
[54] Rejto, L. and Revesz, P. (1973). Density estimation and pattern recognition. Problems Control
Inform. Theory 2, 67-80.
[55] Roucos, S. and Childers, D. G. (1980). On dimensionality and learning set size in feature
extraction. Proc. Internat. Conf. Cybernet., Soc., Cambridge, MA., 26-31.
[56] Schaafsma, W. and Steerneman, A. G. M. (1980). Proofs and extensions of the results in the
paper on classification and discrimination if p ~ 0o. Intemal Rept., Dept. Math, University of
Groningen, Groningen.
[57] Sitgreaves, R. (1952). On the distribution of two matrices used in classification procedures. Ann.
Math. Statist. 23, 263-270.
Dimensionality and sample size considerations in pattern recognition practice 855
[58] Sitgreaves, R. (1961). Some results on the distribution of the W-classification. In: H. Solomon,
ed., Studies in Item Analysis and Prediction, 241-251. Stanford University Press, Stanford, CA.
[59] Sitgreaves, R. (1973). Some operating characteristics of linear discriminant functions. In: T.
Cacoullos, ed., Discriminant Analysis and Applications, 365-374. Academic Press, New York.
[60] Teichroew, D. and Sitgreaves, R. (1961). Computation of an empirical sampling distribution for
the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
271-284. Stanford University Press, Stanford, CA.
[61] Toussaint, G. T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
[62] Trunk, G. V. (1979). A problem of dimensionality. IEEE Trans. Pattern Anal. Mach. Intelligence
1, 306-307.
[63] Vajta, M. and Fritz, J. (1974). Some remarks on optimal kn-nearest neighbor pattern classifica-
tion, Proc. Second Internat. Joint Conf. Pattern Recognition, Lyngby, Denmark, 547-549.
[64] Van Campenhout, J. M. (1978). On the peaking of the Hughes mean recognition accuracy: The
resolution of an apparent paradox. IEEE Trans. Systems Man Cybernet. 8, 390-395.
[65] Van Ness, J. W. (1977). Dimensionality and classification performance with independent
coordinates. IEEE Trans. Systems Man Cybernet. 7, 560-564.
[66] Van Ness, J. W. (1980). On the dominance of non-parametric Bayes rule discriminant algo-
rithms in high dimensions. Pattern Recognition 12, 355-368.
[67] Van Ness, J. W. and Simpson, C. (1976). On the effects of dimension in discriminant analysis.
Technometrics 18, 175-187.
[68] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann. Math. Statist. 15, 145-162.
[69] Waller, W. and Jain, A. K. (1977). Mean recognition accuracy of dependent binary measure-
ments. Proc. Seventh Internat. Conf. Cybernet. Soc., Washington, DC, 586-590.
[70] Waller, W. G. and Jain, A. K. 0978). On the monotonicity of the performance of Bayesian
classifiers. IEEE Trans. Inform. Theory 24, 392-394.
[71] Young, I. T. (1978). Further consideration of sample and feature size. IEEE Trans. Inform.
Theory 24, 773-775.
P. R. Krishnaiahand L. N. Kanal, eds., Handbook of Statistics, Vol. 2 h['~
" l " K./
©North-HollandPublishingCompany(1982) 857-881
Willem Schaafsma
1. Introduction
The solution to the 'dimensionality problem' might depend on (1) the specific
aim one has in mind, (2) the costs of measuring various potential variables, (3) a
priori knowledge. With respect to (1) we distinguish the following aims which can
be of main interest in discriminant analysis situations (aims of secondary interest
arise in a natural way when dealing with these main aims).
Aim 1. To construct one or more discriminant functions, either for the purpose
of dimension reduction to describe the data, or as a first step in the process of
evaluating the data.
Aim 2. To test whether the populations differ with respect to the variables
Aim 3. To assign the individuals under classification to one (or more) of the k
populations; this entails the secondary aim of estimating the various misclassifica-
tion probabilities of the classification procedure which has been used.
Aim 4. To estimate posterior probabilities, for the individuals under classifica-
tion, preferably by means of confidence intervals.
Aim 5. To select a subset of the set of all s potential variables for use in the
future.
Aim 6. To distinguish 'clusters' of the n o individuals under classification, or of
the k populations, or of the s variables.
With respect to measuring costs we remark that they will play (almost) no part
in this paper: we restrict the attention to the Aims 1, 3 and 4. The main
motivation for this paper is that we want to assist painstaking scientists by
providing protection against the introduction of too many variables. Protection will
be achieved by modifying the underlying standard procedures.
A priori knowledge will play an important part in the stage preliminary to the
actual data evaluation (and also for the determination of the a priori probabilities
for each of the individuals under classification). During this preliminary stage,
results of previous investigations can be incorporated. We distinguish the follow-
ing steps.
are deleted and ~1, ~2.... are chosen by taking T~q+t+ 1' ~ q ÷ t + 2 ' " ' " if no reliable a
priori knowledge exists with respect to the discriminatory properties of these
variables. Otherwise such knowledge might be incorporated by rearranging the
principal components (the subject of ordering variables is very unpleasant from a
theoretical point of view; more details are given in [12]).
Standard procedures show often a degrading performance if the number of
involved variables is increased beyond a certain bound p*. This very interesting
phenomenon, has been observed by many scientists. Note that various, intrinsi-
cally different, illustrations can be made because the underlying aims can be
different, or be specified differently. Of course, the illustrations will also depend
upon the underlying parameters and sample sizes. It is interesting to remark that
for some 'completely specified' aims holds that the performance of the standard
procedure admits different specifications and that even the concept of 'standard'
procedure can be doubtful. These variations are of course of almost no impor-
tance when compared with the influences of the sample sizes, the underlying
parameter and, in particular, the specification of the aim in mind.
The above-mentioned phenomenon implies that the standard procedure based
on all s variables can often be improved by deleting variables. It obviously
depends on the specific aim, performance, values of the underlying parameters
and sample sizes, which selection of variables should be made. A complication
will be caused by the fact that the underlying parameters are unknown. Some
estimation procedure has to be introduced.
In this paper we restrict the attention to the case that only two populations are
of real interest (the other ones, if available, are only exploited for estimating 2;).
At the beginning we had the intuitive feeling that the Aims 1, 3 and 4 are so
closely related that any technique for selecting variables that is natural for one of
these aims, will also be natural for the other two. Computations learned that this
relationship may exist between Aim 1 and certain specifications of Aim 3 but it
disappears if Aim 4 is taken into consideration (see Section 5 for an explanation).
Notation
Most theory will be developed for an arbitrary number p of variables. If it is of
interest to indicate which p variables are taken from the set ~1..... ~s, then this
will sometimes be indicated by a left upper-script p if ( ~ ..... ~p) is meant and by
a set of left upper-scripts j(1) .... ,j(p) if (~j(1).... ,~j(p)) is meant.
( X 2. - X , . ) T ( f - ' S ) I { X - - ½ ( X i. + ) ( 2 . ) }
only if g(x) > 0, then the values of a, I[w II and b will all be of interest. However,
when dealing with Aim 1 for its own sake, and disregarding matters of scaling or
defining cut-off points, it seems of interest to choose the performance concept in
such a way that it neither depends on a, nor b, nor IIw II, while it is maximized if w
is any positive multiple of ~o= Z - ~(/~2 - / ~ 1)- The appropriate concept seems to be
the discriminatory value
8 ( w ) = (r(.~.z)g(X)-r(~,,z)g(X))(varzg(X)} ,/2
= wT( 2 -
1 8 ( w ) [ _-< max
wER p
8(w)=8(o~)={(l~2--l~,)T~-l(l~2--1~l)}'/2=Zl.
This is a simple consequence of the Cauchy-Schwarz inequality; A is the usual
Mahalanobis distance.
The discriminatory value of any data-dependent discriminant function, with
W= w(X~., X2., S) as the underlying vector of weights, will be a random variable
D = 6 ( W ) = wT(/z2 - - # , ) / ( w T . ~ w ) '/2,
Eo D 2 ~ A 2 - A 2 ( p - 1 ) ( l + ( f - 1 ) m m { l m ~ l A 2)
( f - - 1){1 + pmm~lmzlA -2}
It is interesting to draw some graphs of A2:{1 ..... s}--, [0, oo) and to study the
performance Eo D2 of the classical standard procedures with W = cS-1(X2" _ Xl ")
as a function of p.
The basic question is which graphs of A2 should be expected in actual practice.
Theoretical deliberations are not very useful in this connection. Know-how in the
area of application would be decisive but is usually only very scanty. In order to
do at least something, we discuss graphs of A2 given by ,~2(p)= 82p(p + 1) 1
Here 8E [1,5] might be appropriate when discriminating between 'tribes' on the
basis of skull measurements, 82~ [3, 4] might be appropriate when discriminating
between 'sexes' and 82~ [4, 16] when discriminating between 'races'. If one has to
discriminate between ' successful' and ' non-successful' highschool records on the
basis of some pre-highschool psychological examinations and this task is regarded
as intrinsically very uncertain, then one might think of 82~[1,3]. Note that
limp~ooa2(p)=82 implies that ~ ( - ½ 8 ) is an intrinsic lower bound for the
maximum of the two misclassification probabilities.
In order to get a nice illustration, we single out the very special situation where
(1) 8 2 = 4, (2) m I = m2, ( 3 ) f = m + 1. Thus we have to study the graph of
-{_p2+(m+Z)p+l}{p 2+(m+2)p+m+l} -1
Interpretation
If one knows the graph of A2 as a function of p, then one will 'often' see that
the approximate performance of the classical vector of weights cS-I(X 2. - X 1.)
increases for p-< p* and decreases for p>-_p* where p* will be an increasing
function of the sample sizes. This means in particular that the performance of the
classical procedure based on all s variables (the value of the performance for
p - - s ) can 'often' be improved upon by replacing this procedure for a classical
one based on a predetermined subset of the set of all variables. In Section 3 other
'non-classical' procedures will be considered which essentially make use of all
variables ~1..... ~s but are based on 'very complicated functions w(Xl., X2., S)'
(and possibly a( X1. , X2., S) and b( Xl., X2., S)). For these complicated proce-
dures the performance EoD2 will not depend on 0=(/~1,/~22~ ) through the
Mahalanobis distance.
the appendix, we introduce the following concepts. Let x/, be any orthogonal
matrix with A- ~(/~2 --/~1)T Z - 1/2 as its first row. Define
= ( m 1 + m 2 ) - l / 2 m l / 12 m l / 22 A V = ~PY.-1/2S~,-1/2kOT
and notice that the probabilistic assumptions in the beginning of this section
imply that those at the beginning of the appendix are satisfied. The following
identifications can be made
S - I ( x2" - X 1.) -- m - ~ , / 2 m z l / Z ( m 1 + m 2 ) 1 / Z ~ - l / Z q T R ,
D=8(W)=AR1/IIR[I ifW=cS-~(X2 .- X~.)for c>O.
E S - I ( x2 • - X I . ) = ( f - P - 1)-12~-1(g2-/~1)
{(f--p)(f--p-- l)(f--p--3)}-I
times the above-mentioned one. This result was first derived by Das Gupta [3]
(see [6]).
The question arises whether the choices c = f - p - 1 (leading to the best
unbiased estimator), c = f (the usual plug-in estimator where Z is replaced by
f - 1 S ) , or c = f + 2 (the maximum likelihood estimator) lead to admissibility from
the m.s.e, point of view. For that purpose we consider
+[c2(f--p+l)(f--p) '(f--p--1)-2(f--p--3) 1
and we determine
c * = ( f -- p ) ( f - - p - - 3 ) ( f - p - - 1 ) -1
by minimizing the coefficient of ¢0~*. Note that c* is smaller than any of the
above-mentioned choices of c. This implies that each of these choices leads to an
inadmissible estimator because, by minimizing the coefficient of ¢0¢0T, the coeffi-
cient of ,Y 1 is also decreased. Notice that c* ~ f - p - 1 and that the improve-
ment over the best unbiased estimator will not be substantial. As ¢0 and Z-~ can
vary independently one can show that no estimator of the form cS 1(X 2. - X 1.)
with c E [0, c*] can be improved upon uniformly by considering other estimators
of this form.
The basic question arises whether one should go beyond the class of estimators
of the form cS I(X 2. - X1. ) when estimating ~0= ~-l(/~ 2 - # 1 ) . This question is
closely related to the problem whether one should go beyond the class of weight
vectors w(Xl., )(2., S) = eS-1()(2" _ X1. ) when constructing an affine-linear dis-
criminant function. The last formulation suggests that one should be willing to
delete variables, or in other words to consider estimators for o~~ R" in which
various components are equated to 0. The original formulation suggests that
modifications can be obtained along the lines of 'ridge-regression' or 'Stein
estimators'. Thus there are many ways to go beyond the class {cS-1(X2" _ X1 "); c
> 0}. We shall restrict attention mainly to selection-of-variable modifications.
Section 2 provides evidence that from a practical point of view the standard
procedure based on all s variables can be substantially improved if s is large and
the sample sizes are small. The 'illustrations' in Section 2 and elsewhere in the
Selecting variables in discriminant analysis for improving upon classicalprocedures 865
literature (see e.g. [12] for references in the area of pattern recognition) are based
on a given probabilistic structure: the graph of A2 should be known. In practice
the a priori information with respect to this graph is very scanty. One will have to
use the data under evaluation for estimating A2. It seems natural to proceed as
follows when dealing with Aim 1.
Suppose that the outcome of the independent random variables XI., X 2. and S
has to be evaluated where X h. ~ Ns(l~h, ~,) and S - W ~ ( f , ~). Then one should
start out by reconsidering the initial ordering ~1..... ~s because the performance of
the procedure to be chosen will depend on the reliability of this ordering. We
distinguish among the following situations:
(1) the initial ordering is so reliable that the investigator is not willing to
consider any other subset than ~ ..... ~p if he has to select p variables from
(2) the initial ordering is 'rather' reliable but 'some deviations from this
ordering are allowed';
(3) the initial ordering is 'very shaky', 'almost random'.
The distinction between these categories is not very clear and to some extent a
matter of taste. Situation (3) gives rise to nasty complications which will be
outlined in the technical details at the end of this section.
Assuming that situation (2) appears, we propose the following procedure.
Step 1. Estimate A2 : { 1.... ,s} --, [0, ~ ) by means of the uniformly best unbiased
estimator
Consider the corresponding outcomes. It may happen that for some p, the
difference ~2(p + 1 ) - ~ 2 ( p ) is so large (small) that one would like to move ~p+l
to the left (right). This should be done very reluctantly (see the technical details at
the end of this section). One should do nothing unless some appropriate test(s)
lead to significance. In this context one will make use of the test for additional
discrimination information [7, 8d.3.7] or [16], the test for discriminant function
coefficients [7, 8d.3.9] or approximate tests for comparing standardized weights
[12, Section 4.9]. The procedure for rearranging the variables will not be described
explicitly here because various different specifications are reasonable. After
rearranging the variables, proceed as if the thus obtained ordered set had been
decided upon in advance, again using the notation ~1..... ~-
Step 2. Consider the ordered set ~l ..... ~ obtained after Step 1 and estimate A2
by means of the above-mentioned estimator 4 2 which has lost its optimum
property because Step 1 contains some data-peeping. It is expected that the
resulting positive bias is so small that it does not lead to serious consequences.
Step 3. Estimate the 'performance' Eo D2 of the classical vector of weights
c(PS)-I(PX2.-PX1 .) by plugging ,~2(p) into the approximation of Eo D2 in
866 Willem Schaafsma
Interpretation
Incorporating a selection of variables technique is a very complicated matter.
The ultimate goal in this section is to achieve a performance EoD 2 which is as
large as possible, as a function of 0 = (/~l, t~2, ~), in the region of 0 's which seem
of interest. It is expected that the performance
E0 D 2 : Eo( w*T(/x z - / ~ l ) ) 2 / ( w . T z w . ) ,
with D = 8(W*) and W* = w*(Xl., X2., S) defined by Step 4, compares favor-
ably with
x,.)Ts-%2-,2)} 2
EoD 2 =
X,.)Ts-'zS-'(X2.- X,.)} '
with D = 8(W), in the greater part of the region of 0 's which are more or less in
agreement with the initial ordering. Note that the last mentioned Eo D2 depends
on 0 = (~l,/~2, 2~) through the Mahalanobis distance A whereas the before men-
tioned performance is not determined by A. A comparison by means of simula-
tion experiments will be a very complicated matter because the region of 0 's for
which the initial ordering is 'correct' is not unambiguous (see [12, Section 4.9])
and certainly very complicated to overview.
Conclusion
We do not know precisely what to do when dealing with situation (3): all three
approaches look attractive, but large scale simulation experiments for comparing
the three approaches have not yet been carried out.
Before adopting a procedure for evaluating the outcome of (X0, XI., X2., S)
one should discuss the underlying decision situation, e.g. by posing the following
questions. Is it a situation with 'forced decision' or is one allowed to remain
undecided? Which loss function should be used? Is it more appropriate to keep
the error probabilities under control by imposing significance level restrictions?
Can a reasonable choice be made for the a priori probability P ( T = 2) = z that the
individual under classification belongs to population 2?
Complete specification of action space, loss structure and a priori probabilities
are often felt to be extremely restrictive. We have the opinion that in such
situations it might be more attractive to construct a confidence interval for the
posterior probability (in principle this requires that some a priori probability ~- is
specified, it will be seen in Section 5 that the influence of ~- can be studied easily).
In this section we restrict the attention to two-decision situations where one
individual has to be assigned either to population 1 or to population 2 (m 0 = 1)
while no reasonable a priori probability r is available. With respect to the loss
structure we consider
(1) the case of 0 - 1 loss,
(2) the Neyman-Pearson approach where the probability of assigning to
population 2, whereas the individual belongs to population 1 (the probability of
an error of the first kind), is bounded from above by the significance level a.
Note that the case of 0 - a - b loss (an error of the first kind costs b units, an
error of the second kind a) leads to the Neyman-Pearson formulation with
a = a / ( a + b) if attention is restricted to the class of minimax procedures or,
equivalently, to that of all procedures which are unbiased in the sense of
Lehmann. ([8] is applicable: The problem is of type I because, using 0--
(~t' ~1'~2' ~ ) as the unknown parameter, the subsets 81 ~-{0; t =1} and 0 2 =
{0; t = 2} have 8 0 = (0;/~1 =/~2} as common boundary.)
Selecting variables in discriminant analysis for improving upon classicalprocedures 869
A = f ( X 2. - X l . ) T g - l ( X o - 1/2(X,. + X2. )}
of (/~,/~2, N) where the subscript t E ( 1,2} indicates that the moments have to be
computed for 0 = (t,/~1,/~ 2, Z). This function is closely related to the performance
Eo D2 which was used in Sections 2 and 3, although the values of these two
functions differ more than we had expected (at least for the examples which we
considered).
Tedious computations, performed independently by A. Ambergen, provided
the result
vart(A)=f2(f--p)-l(f--p--1)-z(f--p--3) l(aA4~-btA2-~c)
where
a=½(f--p),
bt=(f--p--1)(f--1)(l+mjl_t)+mllm21(mt--m 3 ,)(f--l),
c= m ~ l m ~ l ( m I + m 2 ) P ( f -- p - 1 ) ( f - 1)
+ ½(m:~ 2 + m 2 2 ) p ( f - p ) ( f - - 1)-- m l l m 2 1 p ( f - 1).
870 Willem Schaafsma
~cb(_½A)+m-lcp(_½z~)(8-1A3 +4-1(p+3)A + pA l)
which formula does not agree with
var(A)
mp(m-- p+ l)(m-- p - 2 )
2mp(m+ l - p)+ m(m+ 2)(m- p)(p+ l)+(m + l ) ( m - p)(p+ l) 2
We conjecture that the Aims 1 and 3 are so closely related, when 0 - 1 loss is
used, that the corresponding values of p* are almost the same. This conjecture can
be verified for our very special example. For m = 1 2 we verified that the
above-mentioned function of p assumes its maximum for p = 2; m = 40 yields a
maximum for p = 4. These values are in perfect agreement with p * = [ - ½
+ ½(1 + 2 m ) 1/2] based on EoD2.
to use the exact expressions for E1A and varlA in order to arrive at an
appropriate studentized version. If we consider A - E 1 A and replace E1A by the
corresponding best unbiased estimator, then we obtain
- =
={(f--p)(f--p--l)(f--p--3)} '
+2mlZ(f--p)(f--p--1) '}]
5. D e a l i n g w i t h A i m 4 in t h e c a s e k = 2, m o = 1
Zx = ( x 2 . - x , . ) ¢ s - ' ( x - ½( x , . + x2.))
=( x~.- x,.)Ts-l(x- X. .)
+½rn '(m 2 - rn,)(X 2. - X 1.)Ts-'(X2.- X, .)
where X . . = m - l ( m l X l . +m2X2. ) is introduced in order to exploit that X . . ,
X 2. - X 1. and S are independent. E S l = ( f _ p _ 1 ) - I Z 1 (see Theorem A.2(1)
in the appendix) implies that
+mm~'m~'(f--1)(f--p--1){(x--~)Tv~-'(x ~ ) + 4 - 1 A 2}
+(m,'-- m z l ) ( f - 1 ) ( f - - p + l)~ x
+ ½ ( m ; 2 + m-22)p(f - 1 ) ( f - - p ) - - m ; I m 2 ' p ( f - 1)]
where the notation /2=½(/zl+g2 ) should not be confused with the notation
~ . . = m-l(mllzl + m292). The exact variance o2 is a substantial improvement
over the asymptotic variance which was obtained in [10] and which also appears if
one lets f = rn 1+ m 2 tend to infinity withp and Ph = m-lmh fixed.
Of course we propose using gx -+ u,~/2(f- P - l)rx as confidence bounds for ~x;
u,~/2 denotes the ½a upper quantile of the N(0, 1) distribution and 6x any
appropriate estimator for ox = (02) 1/2 ([1] is devoted to the general case k > 2).
Thus the classical procedure for constructing a confidence interval for ~ is
available. Note that the concept of posterior probability depends upon the set of
variables considered. If one tries to remove this dependence by restricting the
attention to the posterior probability with respect to all s variables ~l .... ,~s, then
this will have certain optimum properties if ~1,/z2, ~ were known but one may
make very serious estimation errors if (1) gl,/z2 and ~ are unknown, (2) s is large,
(3) f, rn I and m 2 are not large, (4) the classical estimator ~x is used. This follows
from
Etvar~xo= ( f - p--1)2Eto~o
:(f--p)-l(f--p--3)-I
X [ ½ ( f - - p)A 4 At-{2+ ( p + ] ) ( f - - p -- 1)+ m;l_t(f - 1 ) ( f - - p + 1)
--m~lm21m(f --1)}A 2
+ m~lrn~l(m + 1 ) ( / - - l ) ( f - - p -- 1)p
+½(f--1)(f--p)p(ml l - m z l ) 2]
which is based on
½(E,6xo+Ez6xo) ~ ,
{2(ElOxo@E2oxo)
2 2 } 1/2
+mT~m21(m+l)(f-1)(f p-1)p
should be distinguished:
(1) the data under evaluation need not be used for estimating Zl2(p) or
a2(~:( 0 ..... ~:(p)) because estimation can be done on the basis of 'professional
knowledge' or 'other data',
(2) the data under evaluation has to be used because 'professional knowledge'
and 'other data' are not sufficiently relevant.
Situation (2) is very unpleasant because the data dependent subset has a very
peculiar probabilistic structure. The conditional distribution of the scores for the
selected subset, conditionally given that this subset ~j(l),..., ~j(p) has been selected,
will differ from the corresponding unconditional distribution. If the outcomes for
the selected subset are treated by means of the classical standard procedure, then
one will overestimate A2(~sO),..., ~s(p)), overestimate the discriminatory properties
of any discriminant function, underestimate the misclassification probabilities
and construct confidence intervals which cover the true value with probability
smaller than 1 - a. The technical details at the end of Section 3 contain some
indications of the bias to be expected if all subsets ~j(1)..... ~j(p) of p elements are
allowed. We have the opinion that one should try to avoid situation (2) by
introducing as m u c h a priori professional knowledge as seems reasonable. In
876 Willem Schaafsma
practice one will not be able to avoid situation (2) completely because the
investigation would not be of much use if the a priori professional knowledge is
already sufficiently relevant for estimating Az(~jO) ..... ~Ap~)"
We recall the distinctions drawn at the beginning of Section 3 between
(1) the initial ordering is 'very compelling',
(2) the initial ordering is 'rather compelling',
(3) the initial ordering is 'rather useless'.
Notice that the initial ordering was based on a priori professional knowledge,
see Step 4 of the preliminary stage discussed in Section 1.
If the initial ordering is (at least) rather compelling, then one might decide to
deviate from this initial ordering only in very exceptional situations. This will
imply that we need not worry much about the above-mentioned kinds of bias. We
propose to use either procedure 1 or procedure 2.
In practice the initial ordering will often be rather useless and one will decide
to deviate from the initial ordering. We worry so much about the before-
mentioned kinds of bias that we prefer the following approach.
nh/m h : (l+(p-1)1/2) -I
are independent while Yhi has the univariate N(~h, o 2) distribution and Y0 the
N ( ~ , o 2) one. Of course ~h =awTl~h + b and o 2= a2wT~w. Thus theory for
univariate classification problems is applicable (see [11] for a survey and some
new results).
the dependence on p [ 13] is devoted to the case p ~ oc, ~ = I, fixed sample sizes.
At a meeting in Ober-Wolfach (1978), Prof. Eaton made an indispensable
contribution by showing to the author that various exact moments can be
obtained. Unfortunately the basic issues often still require that some unpleasant
approximations have to be made. Ton Steerneman performed indispensable
simulation experiments and Ton Ambergen verified various results by means of
independent computations. Many others have contributed by making comments,
referring to the literature and sending papers.
Appendix A
The following theoretical and simulation results constituted the core of the
previous discussions. Throughout the appendix Y and V are assumed to be
independent,
%(i, I.)
where e 1= (1,0 . . . . . 0) T, f > p, R = V-Iy.
PROOF. The first result in (1) is an immediate consequence of the Central Limit
Theorem for Wishart distributions. The second result follows from the first one
by using
: ,W-Zp+:-,:{:,/.(: ,v-/.)},
:v-,-- :-,:{ :,:( :-,V- }
_:-,{:,:(:-,):_,.)}' ....
where the last expansion suggests that f l/2( f - 1V - - Ip) + f 1/2( f V l _ ia) ~ 0 in
probability. This last result can be proved rigorously and (1) follows. Next
(1) ~ (2) ~ (3) ~ (4); notice that R 1/l[ R I1 --, 1 in probability.
(1979, personal communication) taught me the following result (see also [6,
Section 6.5]).
and we can compute d in order to establish (2) while (4) follows from covR =
ERR v (ER)(ER)T where
-
ERRT = E V - 1 y y X v -1 = EV -2 + ~2EV-leleTV -l
and
( (VI1)2 vllv(I'2) )
V ~ l e l e T v -1 _.= VIIV(2,1 ) V(2,1)V(I,2) ,
f%~½(1-4- -2 )Xp--l;a,
2 fda"+(lq-8-Z)X2p-l;a
where Xp 2 1;~ denotes the ( 1 - a ) t h quantile of Xp2 1. M o r e o v e r T h e o r e m A.1
suggests that
c et
- (1 / t--
-- 2kJ1 1 4_)~-2"1,,2
-- ~1 ]Ap--1;a~
a(O=l(f-l_F~l-2)(p_l)
by plugging-in 8 = ~/f 1/2, m o r e o v e r d(~1) = 2c(,I) and b (1) = 2a (0.
The exact results in T h e o r e m A.2 m a y be used to define other a p p r o x i m a t i o n s
by using a function of expectations as an a p p r o x i m a t i o n for the expectation of
the function. T w o ideas were exploited: first
a(2)=l_ n ( f - p ) W 2 ( f - p - 3 ) ~/2
(f- 1)1/2(f-- p - 1 ) l / 2 ( p q- n2) 1/2
Simulation experiments
The following cases were considered in [15].
(1) f = 25, ~ ( p ) = {24p/(p + 1)}I/2,2~ p=< 15,
(2) f = 40, ~ ( p ) = 1 0 + 2 p ~ / 2 , 2 < p < 20,
( 3 ) / = 11, ~ ( p ) = { l l 2 p / ( p + 1)}~/2,2< p___<25.
True values were estimated from 1000 independent repetitions. The approxima-
tions for c0.~0 and do.lo were not very satisfactory. The approximations for dl/2
and b based on b ~ b (3) looked very accurate and better than any other approxi-
mation. The approximations based on Theorem A. 1 looked a bit worse than those
based on b ~ b ~3~. The approximations based on a ~ a (2) looked worst of all and
positively biased. We were not successful in our attempt to give a satisfactory
theoretical explanation of the bad behaviour of the last mentioned approxima-
tions.
References
[1] Ambergen, T. and Schaafsma, W. (1981). Interval estimates for posterior probabilities. Report
TW-224, Department of Mathematics, Groningen.
[2] Anderson, T. W. (1972). Asymptotic evaluation of the probabilities of misclassification by linear
discriminant functions. In: T. Cacoullos, ed., Discriminant Analysis and Applications, 17-35.
Academic Press, New York.
[3] Das Gupta, S. (1968). Some aspects of discriminant function coefficients. Sankhy~ Ser. A 30 (4)
387-400.
[4] Das Gupta, S. and Perlman, M. (1974). Power of the noncentral F-test: effect of additional
variates on Hotelling's T 2 test. J. Amer. Statist. Assoc. 69, 174-180.
[5] Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 32,
1-49.
[6] Kshirsagar, A. M. (1972). Multivariate Analysis. Dekker, New York.
[7] Rao, C. R. (1965). Linear Statistical Inference and its Applications. Wiley, New York.
[8] Schaafsma, W. (1969). Multiple decision problems of type I, Ann. Math. Statist. 40, 1684-1720.
[9] Schaafsma, W. (1972). Classifying when populations are estimated. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 339-364. Academic Press, New York.
[10] Schaafsma, W. (1976). The asymptotic distribution of some statistics from discriminant analysis.
Report TW-176, Department of Mathematics, Groningen.
[11] Schaafsma, W. and van Vark, G. N. (1977). Classification and discrimination problems with
applications, Part I. Statist. Neerlandica 31, 25-45.
[12] Schaafsma, W. and van Vark, G. N. (1979). Classification and discrimination problems with
applications, Part II a. Statist. Neerlandica 33, 91-125.
[13] Schaafsma, W. and Steerneman, A. G. M. (1981). Discriminant analysis when the number of
features is unbounded. IEEE Trans. Systems Man Cybernet. 11, 144-151.
[14] Solomon, H. (1961). Studies in Item Analysis and Prediction. Stanford University Press, Stan-
ford.
[15] Steerneman, A. G. M. (1979). Simulating the performance of a special linear discriminant.
Report SE-57/7907, Institute of Econometrics, Groningen.
[16] Stein, Ch. (1966). Multivariate analysis (mimeographed notes recorded by M. L. Eaton). Dept.
Statistics, Stanford.
[17] van Vark, G. N. (1976). A critical evaluation of the application of multivariate statistical
methods to the study of human populations. HOMO 28, 94-114.
[18] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann,Math. Statist. 15, 145-162.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 A |
"-r 1
©North-Holland Publishing Company (1982) 883-892
Selection of Variables in
Discriminant Analysis*
P. R . K r i s h n a i a h
1. Introduction
In this section we discuss procedures for testing the hypotheses, on the coeffi-
cients of the discriminant function associated with the discrimination between
*This work is sponsored by the Air Force Office of Scientific Research under contract F49629-82-
K-001. Reproduction in Whole or in part is permitted for any purpose of the United States
Government.
883
884 P. R. Krishnaiah
two multivariate normal populations. Let the mean vector and covariance matrix
of ith multivariate normal population be given by ~ and Z. Now, consider the
discriminant function a'x for the two population case where a ' = ( a l ..... a p ) =
(~l - ~ 2 ) ' Z - 1 and x ' = (xl,..., Xp). If any of the coefficients a~ are zero, then the
corresponding variables xi do not make any contribution for the discrimination
between the two populations. So, it is of interest to find out as to which of the
coefficients are zero. Suppose, we know a priori that x~,...,Xq are important and
we are not sure of Xq+,... ,Xp. Then we are interested in testing the hypothesis
that aq+ 1 . . . . = ap ~ O.
Let x ' l = ( x l , . . . , X q ) , x'2=(Xq+ 1..... Xp), 8 ' = ( 8 1 , . . . , 8 p ) and let ~ 1 , / ~ 2 , 2 be
partitioned as
S=
(Xe,l
Se21 Se22 ]
/ 2n( E E
--
Zilt--Zil"
)I t -t
Zilt--Zil''
Zt -f x
i2t--Zi2") •
i 1t 1 \ Zi2t--Zi2"
F N F~ (2.5)
where
The above procedure for testing the hypothesis aq+ 1 . . . . . ap = 0 was proposed
by Rao (1946,1966). It is known (e.g., see Kshirsagar (1972)) that Rao's U statistic
is related to the F statistic given in (2.3).
The simultaneous confidence intervals associated with the above procedure are
known to be
Ib'(N2 - / ~ 1 - ~ 2 + B~,)I
~-<I F ~ b S e' z . l b ( p -- q ) ( l + c ~ ' l S ~ l ) / c ( n - - p - 1) (2.7)
In this section we consider the problem of testing the hypotheses that the
discriminant coefficients associated with certain variables in the discriminant
functions are zero. Let x 1..... x k be distributed independently as multivariate
normal with mean vectors ~ l , ' " , ~ k and covariance matrix Z. Also, let xij
( j = l , 2 ..... ni) denote j t h independent observation on xi. Then, the between
group sums of squares and cross products (SP) matrix is given by
s-- E
k ( S,, 1
i=~l S21 822 ]
886 P. R. Krishnaiah
where
Now, let 01 > / ' ' " /> Op denote the eigenvalues of the noncentrality matrix ~2 =
A N - l and let I,/'= (va,..., Pip) (i = 1..... p) denote the eigenvector corresponding
to 0i where
k
A= ~ Fli(~i--~)(~i--g)" (3.1)
i:1
where ~[Lil and ~1 are of order q × 1, and All and Zll are of order q × q. Let Hj:
uj: = 0 f o r j = 1,2 ..... r and H = O~= iHj. It is known (e.g. see McKay (1977) and
Fujikoshi (1980)) that the hypothesis H and the following statements are equiva-
lent.
It is well known that Fisher's linear discriminant functions are the best for
discrimination among all linear functions of the original variables. In this section
we review some procedures for the selection of important discriminant functions.
Selection of variables in discriminant analysis 887
We know that
I i ~ Co: (4.4)
where
l i X %1 (4.6)
888 P. R. Krishnaiah
where
Here we note that the hypotheses H1,..., H v are nested. For example, H i implies
Hi+ l .... , Hp. When H 1 is true, the exact distribution of l 1 was given in Krishnaiah
and Chang (1971). For percentage points of the distribution of l 1 in the null ease,
the reader is referred to Krishnaiah (1980) and Pillai (1960). A review of the
literature on the exact distributions of individual roots and certain functions of
the eigenvalues is given in Krishnaiah (1978).
Fang and Krishnaiah (1982) derived asymptotic nonnull distributions of cer-
tain functions of the eigenvalues of some random matrices when the underlying
distribution is not normal.
The likelihood ratio statistic for testing the hypothesis H~+~ is known to be
P
LI= II (l+li)-"/2. (4.8)
i r+l
~P(lr+l,...,lp)%c~2 (4.9)
where
Some special cases of ~k(lr+ 1.... ,l:) are lr+l, (l~+ 1 + • -. + lp), etc.
We now review some known results on asymptotic distributions of li's and
certain functions of these eigenvalues.
Selection of variables in discriminant analysis 889
Let l 1/>. • •/> l v (r ~< v ~< p ) be the nonzero eigenvalues of SS]- l. Also, let the
eigenvalues of ~2 have multiplicities as below.
01 . . . . . OpT=nSl,
(4.11)
0p~_l+ 1. . . . . Op. = n ~ t,
Op.+l . . . . . Op = 0
U i h = f n ( Z S ~ + 4 6 h ) - - l / 2 ( l \ ih - - 6 h ) , u~+/=nlr+ J (4.12)
where ~j(-) ( j = 1,2 .... , t) denotes the joint density of the eigenvalues of Aj and
the elements of Aj: pj × pj are distributed independently as n o r m a l with m e a n
zero. The variances of the diagonal elements of Aj are equal to one whereas the
variances of the off-diagonal elements are equal to 1/2. Also, 7/t+](. ) is the joint
density of the eigenvalues of At+l: ( v - r ) × ( v - r) where At+ 1 is distributed as
the central Wishart matrix with ( k - 1 - r) degrees of freedom and E ( A t + I ) =
(k - r - 1)I v r. H e r e A 1..... At+ 1 are distributed independent of each other.
Expressions for the densities of the eigenvalues of Aj ( j = l , 2 . . . . . t) and At+ 1
were given in H s u (1941b). The asymptotic joint density of u t . . . . . uv given b y
(4.13) was derived b y A n d e r s o n (1951 a) b y a different method.
N o w let 0 i = (n -- k -271)fli (i = 1,2,... ,r), Or+ 1 . . . . . Op = 0 where ~/and fli's
are constants. Then, Fujikoshi (1976) derived approximations to the distributions
of m i T i (1 = 1,2,3) up to terms of order rn7 2 where
P
T]= ~. l o g ( l + l j ) , (4.14)
j=r+l
P
r2= E tj, (4.15)
j=r+l
P
T3= 2 {lj/(l+lj)}, (4.16)
y=r+l
890 P. R. Krishnaiah
and m L, m 2 and m 3 are suitable correction factors. The first terms in these
approximations involve a chi-square distribution. Similar approximations can be
derived for various other functions of lr+ l ..... lp. Asymptotic distributions of a
wide class of functions of 11.... ,lp in the nonnull cases were given in Fujikoshi
(1978), Krishnaiah a n d Lee (1979) and Fang and Krishnaiah (1982).
In some situations we know in advance that the last few eigenvalues of ~2 are
equal to zero. For example, when p > k - 1, Ok = 0k+ l . . . . . Op = 0. In these
situations it is of interest to test whether some of the 0i's (i = 1,2 . . . . . k - 1) are
zero. We can also test the hypotheses Hj ( j = t, t + 1..... k - 1) as follows. We
accept or reject H / ( j = t, t + 1..... k - 1) according as
~kXc, (4.17)
where
l 1 ~ Cal (4.19)
where
P [ l I ~ c a l [ H l ] = (1--0gl). (4.20)
12 ~ Ca2, (4.21)
e[t2 co21t, >I = (1 - (4.22)
/3 ~ Ca3
Selection of variables in discriminant analysis 891
where
P [la ~< %3 Ill ~> %,, 12 ~> %2;//31 ----(1 -- a 3). (4.23)
li+1% Ca,i+l
where
Then, the overall type I error to test H~, H 2 ..... Hi+ 1 sequentially is given by ct*+l
where
i
1"I P [ l t + l < ~ C a , t + l l l j > ~ c a j , j = l , 2 , " . , t ; Ht+l] = ( 1 - - c t * + , ) . (4.25)
t=o
References
Anderson, T. W. (195 la). The asymptotic distribution of certain characteristic roots and vectors. In: J.
Neyman, ed., Proceedings of the Second Berkeley Symposium in Mathematical Statistics and Probabil-
ity, 103-130. University of California Press, Berkeley, CA.
Anderson, T. W. (1951b). Estimating linear restrictions on regression coefficients for multivariate
normal distributions. Ann. Math. Statist. 22, 327-351.
Bartlett, M. S. (1947). Multivariate analysis. J. Roy. Statist. Soc. Suppl. 9, 176-190.
Chou, R. J. and Muirhead, R. J. (1979). On some distribution problems in MANOVAand discriminant
analysis. J. Multivariate Anal. 9, 410-419.
Fang, C. and Krishnaiah, P. R. (1981). Asymptotic distributions of functions of the eigenvalues of the
real and complex noncentral Wishart matrices. In: M. Csorgo, D. A. Dawson, J. N. K. Rao, and A.
K. Md. E. Saleh, eds., Statistics and Related Topics. North-Holland, Amsterdam.
Fang, C. and Krishnalah, P. R. (1982). Asymptotic distributions of functions of the eigenvalues of
some random matrices for nonnormal populations. J. Multivariate Anal. 12, 39-63.
Fisher, R. A. (1938). The statistical utilization of multiple measurements. Ann. Eugenics. 8, 376-386.
Fujikoshi, Y. (1976). Asymptotic expressions for the distributions of some multivariate tests. In: P. R.
Krishnaiah, ed., Multivariate A nalysis-IV, 55-71. North-Holland Amsterdam.
Fujikoshi, Y. (1978). Asymptotic expansions for the distributions of some functions of the latent roots
of matrices in three situations. J. Multivariate Anal 8, 63-72.
Fujikoshi, Y. (1980). Tests for additional information in canonical discrimination analysis and
canonical correlation analysis. Tech. Rept. No. 12, Statistical Research Group, Hiroshima Univer-
sity, Japan.
892 P. R. Krishnaiah
Hsu, P. L. (1941a). On the problem of rank and the limiting distribution of Fisher's test functions.
Ann. Eugenics 11, 39-41.
Hsu, P. L. (1941b). On the limiting distribution of roots of a determinantal equation. J. London Math.
Soc. 16, 183-194.
Krishnaiah, P. R. and Schuurmann, F. J. (1974). On the evaluation of some distributions that arise in
simultaneous tests for the equality of the latent roots of the covariance matrix. J. Multivariate Anal
4, 265-283.
Krishnaiah, P. R. (1978). Some developments on real multivariate distributions. In: P. R. Krishnaiah,
ed., Developments in Statistics, Vol. 1, 135-169. Academic Press, New York.
Krishnaiah, P. R. and Lee, J. C. (1979). On the asymptotic joint distributions of certain functions of
the eigenvahies of four random matrices. J. Multivariate Anal 9, 248-258.
Krishnaiah, P. R. (1980). Computations of some multivariate distributions, In: P. R. Krishnaiah, ed.,
Handbook of Statistics, Vol. 1,745-971. North-Holland, Amsterdam.
Kshirsagar, A. M. (1972). Multivariate Analysis. Dekker, New York.
Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66.
McKay, R. J. (1976). Simultaneous procedures in discriminant analysis involving two groups.
Technometrics 18, 47-53.
McKay, R. J. (1977). Simultaneous procedures for variable selection in multiple discriminant analysis.
Biometrika 64, 283-290.
Muirhead, R. J. (1978). Latent roots and matrix variates: a review of some asymptotic results. Ann.
Statist. 6, 5-33.
Pillai, K. C. S. (1960). Statistical Tables for Tests of Multivariate Hypotheses. Statistical Center,
University of Philippines, Manila.
Rao, C. R. (1946). Tests with discriminant functions in multivariate analysis. Sankhyh 7, 407-414.
Rao, C. R. (1948). Tests of significance in multivariate analysis. Biometrika 35, 58-79.
Rao, C. R. (1966). Covariance adjustment and related problems in multivariate analysis. In: P. R.
Krishnaiah, ed., Multivariate Analysis, 87-103. Academic Press, New York.
Rao. C. R. (1973). Linear Statistical Inference and its Applications. Wiley, New York.
Schuurmann, F. J., Krishnaiah, P. R. and Chattopadhyay, A. K. (1973). On the distribution of the
ratios of the extreme roots to the trace of the Wishart matrix. J. Multivariate Anal. 3, 445-453.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2
©North-Holland Publishing Company (1982) 893-894
Corrections to
Handbook of Statistics, Volume 1"
Analysis of Variance
Al= (' !)
--1
0 •
Page 60: The vector a' after expression (5.6) should read as a ' = (1,0, - 1) and
not (1,0, 1).
Page 71: The matrix
A=D
(10) 0 1
--1 1
A=D
(1 0)0 1 •
-1 --1
Page 167, lines 2-3 from bottom: quantiles and mechans should read as
quartiles and medians.
Page 518: Delete the last line.
Page 528, line below (4.14): Section 4.2 should read as Section 4.3.
Page 531, line 17: accepted should read as rejected.
Page 549: The table for a = 0.01 is reproduced from a technical report by Lee,
Chang and Krishnaiah (1976)
Pages 556-557: M should read as N
Pages 558-559: The entries in Table 14 give the percentage points of the
likelihood ratio test statistic for the homogeneity of complex multivariate normal
populations instead of real multivariate normal populations. The correct entries
are given in Table II of the paper by Chang, Krishnaiah and Lee (1977). The
893
894 Corrections to Handbook of Statistics, Vol. 1
= E E
i j
895
896 Subject Index