You are on page 1of 897

Preface

The series Handbook of Statistics was started to serve as an important medium


for the dissemination of knowledge in all branches of statistics. Many prominent
workers in theoretical and applied aspects of statistics are invited to write
chapters in the volumes of this series. The material in these volumes is expository
in nature and will be of interest to all statistical scientists. Various developments
in the analysis of variance are discussed in the first volume. This second volume
contains articles on discriminant analysis, clustering techniques and software,
multidimensional scaling, statistical, linguistic and artificial intelligence models
and methods for pattern recognition and some of their applications, the selection
of subsets of variables for allocation and discrimination, and reviews of some
paradoxes and open questions in the areas of variable selection, dimensionality,
sample size and error estimation.
More than four decades ago R. A. Fisher introduced the linear discriminant
function. Since then, classification, i.e., allocation of an object to the class to
which it is closest, and discrimination, i.e., determining separation rules for
distinguishing between categories into which an object may be classified, have
been the object of numerous papers falling under the heading Discriminant
Analysis. The first eight chapters in the volume may be generally said to fall
under that heading. R. H. Shumway describes linear and quadratic methods for
discrimination of time series when the underlying distributions are assumed to
have multivariate Gaussian distributions with equal and with unequal covariance
matrices. S. Das Gupta considers a variety of optimality criteria and presents
classification rules which are optimal for Gaussian distributions with the same
covariance matrix, whereas M. Siotani gives a review of the asymptotic distribu-
tions of classification statistics. S. Geisser discusses Bayesian, frequentist, and
semi-Bayesian approaches to developing allocation and separation rules with the
focus on Gaussian distributions and linear discriminants. The chapter by J. C.
Lee presents a generalized multivariate analysis of variance model for growth
curves and discusses the Bayesian classification of data from growth curves. J. D.
Broffitt, noting the prevalance of non-Gaussian data, considers nonparametric
discrimination rules based on ranks of discriminant function scores. He describes
vi Preface

robust estimators for use in linear and quadratic discriminant functions. J. A.


Anderson presents logistic discrimination in which the basic assumption is that
the logarithm of the likelihood ratio (for the two group case) is a linear function
of the variables. He also mentions the case where the logarithm of the likelihood
ratio is represented by a quadratic function. L. Devroye and T. J. Wagner discuss
asymptotic and finite sample results for nearest neighbor classification rules; such
rules have been of interest and the subject of many papers in recent years,
especially in the engineering literature on pattern recognition.
Cluster analysis algorithms, software, and graphical display techniques are the
topics reviewed in the next four chapters. G. J. McLachlan describes classification
and mixture maximum likelihood approaches to cluster analysis when the set of
class labels is not known a priori. The problem of determining the number of
distinct clusters is also considered. J. M. Chambers and B. Kleiner present the
general features of several graphical display techniques for multivariate data and
for clustering and describe the details of some of the techniques. R. K. Blashfield,
M. S. Aldenderfer and L. C. Morey present the major categories of cluster
analysis software and discuss a number of software programs in some detail. F. J.
Rohlf considers computational algorithms for the single link clustering method.
Techniques for finding a configuration of points in a low dimensional space to
represent a high dimensional set of data points, while preserving the local
structure of the points according to some loss criterion, are referred to as
multidimensional scaling (MDS) methods. In their chapter on multidimensional
scaling theory, J. de Leeuw and W. Heiser note that scaling procedures differ
from each other in two ways: either they use different loss functions to fit the
same data structure, or they use different algorithms to minimize the same loss
function. The chapter gives some history of MDS and presents MDS models and
algorithms. M. Wish and J. D. Carroll first discuss MDS applications with a two
way data matrix and a quadratic model for squared distances. Applications
include the dimensions of Morse code signals and perception of similarity among
nations. They also consider extensions to three-way tables and trilinear models
and discuss INDSCAL (individual differences scaling) and its applications to
data on perceptions of nations and dimensions of interpersonal communication.
The chapter by K. Fukunaga discusses methods for determining the intrinsic
dimensionality of the data in the original high dimensional space; the assumption
is that the data is governed by certain small number of underlying parameters and
the intrinsic dimensionality estimates the minimum number of such parameters.
Fukunaga discusses intrinsic dimensionality for representation and for classifica-
tion, and also some articles in the engineering literature which have used MDS
algorithms for nonlinear mapping of high dimensional data into two or three
dimensions. He also mentions non-iterative nonlinear mapping algorithms and
mappings which preserve class separability as well as the structure of the data
distribution.
To complement the material in the other chapters, the chapter by L. N. Kanal,
B. A. Lambird and D. Lavine provides an elementary introduction to some of the
major, non-statistical approaches to modelling pattern structures. Following a
Preface vii

brief presentation of generative grammar models for describing syntactic struc-


tures the topics of incorporating semantic attributes, inferring grammars and
doing error correcting parsing are touched on. The chapter introduces a variety of
approaches from the field of Artificial Intelligence, such as production rule
systems, frames, and knowledge-based and decision support systems which repre-
sent attempts to incorporate contextual and problem domain knowledge of
various types in the pattern recognition and description process.
N. Ahuja and A. Rosenfeld review work on pixel based and region based image
models. Pixel based models include one dimensional time series models and
random field models. Mosaic models based on random planar generation processes
are an example of region based models. R. M. Haralick surveys extraction
techniques and models, including autocorrelation functions, optical transforms,
digital transforms, textural edgeness, structural elements, spatial gray tone run
lengths, and autoregressive models for measuring textural properties. The chapter
by K. S. Fu presents a linguistic-statistical model based on stochastic languages.
Fu briefly introduces string languages and describes their application in com-
munication and coding, syntactic pattern recognition and error correcting pars-
ing. He then introduces stochastic tree languages and describes their application
to texture modelling. J. C. Simon, E. Backer and J. Sallentin present a unifying
viewpoint on pattern recognition which attempts to unify different approaches to
pattern recognition, such as the statistical approach, the fuzzy set approach and
the clustering approach by establishing a homomorphism between a representa-
tion space and an interpretation space.
The next two chapters are concerned with inference in empirical data tables.
G. S. Lbov considers problems of pattern recognition, clustering, and prediction
for empirical tables with either a large number of features and a small sample size
approximately equal to the number of features, or heterogeneous types of
features. In both cases feature values may be missing. For the case of heteroge-
neous features Lbov presents a class of logical decision rules. N. G. Zagoruiko
and V. N. Yolkina present two algorithms for filling missing values in data tables.
Starting with the chapter by J. van Bemmel, nine chapters are devoted to
practical applications of pattern recognition techniques. Van Bemmel reviews
statistical and other current methodology for the recognition of electrocardio-
graphic signal patterns. G. C. Stockman surveys some of the key issues in the
linguistic approach to waveform analysis for. describing and recognizing struc-
tures in waveforms. F. Jelinek, R. L. Mercer and L. R. Bahl present the statistical
methods used in research carried out at IBM on continuous speech recognition.
A. A. Grometstein and W. H. Schoendorf describe the nature of radar signals, the
design of classifiers, radar decoys, and other related problems in radar pattern
recognition. E. S. Gelsema and G. H. Landeweerd review experiments in white
blood cell differential count automation and describe some of the commercially
available systems. P. H. Swain reviews pattern classification methods currently
applied to remotely sensed data and comments on research directions likely to be
pursued for this application. G. Nagy presents an overview of optical character
recognition, one of the earliest (and now the most commercially successful)
viii Preface

applications of pattern recognition. Nagy discusses various statistical approxima-


tions to the optimal classifier, including dimensionality reduction, feature extrac-
tion and feature selection. He contrasts parallel classification methods with
sequential methods, considers the use of contextual information provided by
sequences of characters and describes special hardware and software. He also
discusses error and reject rate relationships and their estimation, and the ad-
vantages and disadvantages of various experimental designs. Y. T. Chien and T. J.
Killeen review research and development projects on oil spill identification and
discuss methods for matching oil spillers by examining the spectra taken for the
spilled oil samples and the suspect samples. The last chapter in this group by
B. R. Kowalski and S. Wold reviews various applications of pattern recognition in
Chemistry and comments on the current directions of research in this area.
The remaining papers in this volume fall into two groups. The first group,
consisting of three papers, is concerned with some classical multivariate data
representation models. T. Kaminuma, S. Tomita and S. Watanabe present orthog-
onal expansions for covariance and correlation matrices which together with a
view of a data matrix as an object-predicate table suggesting symmetric treat-
ment of objects and predicates, lead them to interesting observations on reducing
computation in dimensionality reduction, and applications of symmetric
Karhunen-Lobve systems in image representation. They trace the Karhunen-
Lorve system back to a 1907 paper on unsyrnmetric kernels of integral equations
and mention its independent rediscovery in quantum chemistry and pattern
recognition. R. A. Reyment notes that in Multivariate Morphometrics the stan-
dard tools of multivariate statistics are being applied, with emphasis on canonical
variate analysis and principal component analysis; but success in explaining the
results is likely to depend on a deep familiarity with modern evolutionary theory,
taxonomy, functional morphology and aspects of ecology. The chapter by P. M.
Bentler and D. G. Weeks reviews factor analysis models and related problems of
estimation and testing.
The final set of seven papers deals with aspects of feature evaluation and
questions concerning measurement selection, as well as selection of variables in
regression analysis and discriminant analysis. M. Ben Bassat discusses distance
and information measures and error bounds, categorizes feature evaluation rules
and summarizes current findings on their use. The paper updates the results
presented in Table 1 in L. N. Kanal, Patterns in Pattern Recognition: 1968-1974,
1EEE Trans. lnform. Theory 20 (1974) 697-722. The paper by J. M. Van
Campenhout provides additional important insights concerning some results
previously considered paradoxical in the area of measurement selection. He
clarifies why the phenomenon of misclassification rate peaking might occur in
practice. He also provides examples where non-exhaustive selection procedures
could be optimal even though, in general, non-exhaustive selection algorithms can
result in arbitrarily bad measurement subsets. The first paper by P. R. K_rishnaiah
and the paper by J. L. Schmidhammer deal with the problems of the selection of
the important variables under regression models. The paper of P. R. Krishnaiah
points out the drawbacks of stepwise procedures and reviews some alternative
Preface ix

procedures for the selection of variables under univariate regression models. The
paper of J. L. Schmidhammer illustrates the use of finite intersection tests
proposed by Krishnaiah for the selection of variables under univariate and
multivariate regression models. A. K. Jain and B. Chandrasekaran discuss the role
which the relationship between the number of measurements and the number of
training patterns plays at various stages in the design of pattern classifiers and
mention the guidelines provided by research to date. In some situations, the
discriminating ability of various procedures for discrimination between the popu-
lations may actually decrease as the number of variables increases. Apart from it,
it is advantageous to deal with a small number of important variables from cost
and computational considerations. Motivated by the above considerations, the
paper by W. Schaafsma and the second paper by P. R. Krishnaiah deal with
methods of selection of important variables in the area of discriminant analysis.
Krishnaiah first reviews certain methods of the selection of the original variables
for discrimination between several multivariate populations. Then he discusses
various methods of selecting a small number of important discriminant functions.
The models, examples, applications, and references from diverse sources con-
tained in these articles by statisticians, engineers, computer scientists, and scien-
tists from other disciplines, should make this volume a valuable aid to all those
interested in classification and the analysis of data and pattern structure in the
presence of uncertainty.
We wish to thank Professors T. Cover, S. Das Gupta, K. S. Fu, J. A. Hartigan,
V. Kovalevsky, C. R. Rao, B. K. Sinha and J. Van Ryzin for serving on the
editorial board. Thanks are also due to Professors J. Bailey, R. Banerji,
B. Chandrasekaran, S. K. Chatterjee, R. A. Cole, A. K. Jain, K. G. J6reskog,
J. Lemmer, S. Levinson, J. M. S. Prewitt, A. Rudnicky, J. van Ness, and
M. Wish for reviewing various papers. We are grateful to the authors and
North-Holland Publishing Company for their excellent cooperation in bringing
out this volume.
P. R. Krishnaiah
L. N. Kanal
Table of Contents

Preface v
Table of Contents xi
Contributors xxi

Ch. 1. Discriminant Analysis for Time Series 1


R. H. Shumway

1. Introduction 1
2. Time domain classification methods 5
3. Discriminant analysis in the frequency domain 11
4. Statistical characterization of patterns 26
5. A n application to seismic discrimination 33
6. Discussion 42
Acknowledgment 43
References 43

Ch. 2. Optimum Rules for Classification into Two Multivariate Normal


Populations with the Same Covariance Matrix 47
S. Das Gupta

1. Introduction 47
2. The univariate case 49
3. Multivariate case: 2"known 54
4. Multivariate case: 2" unknown 56
5. Multivariate case: /z I and/~2 known 58
References 60

Ch. 3. Large Sample Approximations and Asymptotic Expansions of


Classification Statistics 61
M. Siotani

Introduction 61
Statistics of classification into one of two multivariate normal
populations with a c o m m o n covariance matrix 62
xi
xii Table of Contents

3. Statistics of classification into one of two multivariate normal


populations with different covariance matrices 83
4. Statistics in the non-normal case and in the discrete case 90
References 97

Ch. 4. Bayesian Discrimination 101


S. Geisser

1. Introduction 101
2. Bayesian allocation 101
3. Multivariate normal allocation 106
4. Bayesian separation 109
5. Allocatory-separatory compromises I 11
6. Semi-Bayesian multivariate normal applications 112
7. Semi-Bayesian sample reuse selection and allocation 118
8. Other areas 119
References 120

Ch. 5. Classification of Growth Curves 121


J. C. Lee

1. Introduction 121
2. Preliminaries 122
3. Classification into one of two growth curves 123
4. Bayesian classification of growth curves 125
5. Arbitrary p.d. 2- 125
6. Rao's simple structure 132
References 136

Ch. 6. Nonparametric Classification 139


J. D. Broffitt

1. Introduction 139
2, A procedure for partial and forced classification based on ranks of discriminant scores 144
3. Robust discriminant functions 153
4. Nonparametric discriminant functions 159
References 167

Ch. 7. Logistic Discrimination 169


J. A. Anderson

1. Introduction 169
2. Logistic discrimination: Two groups 170
3. Maximum likelihood estimation 175
4. A n example: The preoperative prediction of postoperative deep vein thrombosis 180
5. Developments of logistic discrimination: Extensions 182
6. Logistic discrimination: Three or more groups 187
7. Discussion: Recent work 189
References 191
Table of Contents xiii

Ch. 8. Nearest Neighbor Methods in Discrimination 193


L. Devroye and T. J. Wagner

References 196

Ch. 9. The Classification and Mixture Maximum Likelihood Approaches to


Cluster Analysis 199
G. J. MeLachlan

1. Introduction 199
2. Classification approach 201
3. Mixture approach 202
4. Efficiency of the mixture approach 204
5. Unequal covariance matrices 205
6. Unknown number of subpopulations 206
7, Partial classification of sample 206
References 207

Ch. 10. Graphical Techniques for Multivariate Data and for Clustering 209
J. M. Chambers and B. Kleiner

1. Graphics and multivariate analysis 209


2. Displays for multivariate data 210
3. Plots for clustering 226
4. Summary and conclusions 243
References 244

Ch. 11. Cluster Analysis Software 245


R. K: Blashfield, M. S. Aldenderfer and L. C. Morey

1. Major categories of duster analysis software 247


2. Programs with hierarchical methods 249
3. Programs with iterative partitioning methods 254
4. Special purpose programs 258
5. Usability of cluster analysis software 260
6. Discussion 263
References 264

Ch. 12. Single-link Clustering Algorithms 267


F. J. Rohlf

1. Introduction 267
2. Notation and definitions 268
3. Algorithms 270
Acknowledgment 282
References 282
xiv Table of Contents

Ch. 13. T h e o r y of M u l t i d i m e n s i o n a l S c a l i n g 285


J. de Leeuw and W. Heiser

1. The multidimensional scaling problem 285


2. Multidimensional scaling models 291
3. Multidimensional scaling algorithms 303
References 311

Ch. 14. M u l t i d i m e n s i o n a l S c a l i n g a n d its A p p l i c a t i o n s 317


M. Wish and J. D. Carroll

1. Multidimensional scaling of two-way data 317


2. Multidimensional scaling of three-way data 327
3. Recent developments and future trends 341
References 342

Ch. 15. I n t r i n s i c D i m e n s i o n a l i t y E x t r a c t i o n 347


K. Fukunaga

1. Introduction 347
2. Intrinsic dimensionality for representation 348
3. Intrinsic dimensionality for classification 353
References 359

Ch. 16. S t r u c t u r a l M e t h o d s i n I m a g e A n a l y s i s a n d R e c o g n i t i o n 361


L. N. Kanal, B. A. Larnbird and D. Lavine

1. Introduction 361
2. Syntactic pattern recognition 362
3. Artificial intelligence 371
4. Relaxation 379
Acknowledgment 381
References 381

Ch. 17. I m a g e M o d e l s 383


N. Ahuja and A. Rosenfeld

1. Introduction 383
2. Pixel based models 383
3. Region based models 393
4. Discussion 394
Acknowledgment 395
References 395

Ch. 18. I m a g e T e x t u r e S u r v e y 399


R. M. Haralick

1. Introduction 399
2. Review of the literature on texture models 400
Table of Contents xv

3. Structural approaches to texture models 406


4. Conclusion 412
References 412

Ch. 19. A p p l i c a t i o n s of S t o c h a s t i c L a n g u a g e s 417


K . S . Fu

1. Introduction 417
2. Review of stochastic languages 417
3. Application to communication and coding 423
4. Application to syntactic pattern recognition 427
5. Application to error-correcting parsing 430
6. Stochastic tree grammars and languages 433
7. Application of stochastic tree grammars to texture modelling 441
8. Conclusions and remarks 446
References 447

Ch. 20. A U n i f y i n g V i e w p o i n t o n P a t t e r n R e c o g n i t i o n 451


J. C. Simon, E. Backer and J. Sallentin

0. Introduction 451
1. Representations and interpretations 451
2. Laws and uses of similarity 460
3. Conclusion 475
References 476

Ch. 21. L o g i c a l F u n c t i o n s in t h e P r o b l e m s o f E m p i r i c a l P r e d i c t i o n 479


G. S. Lbov

0. Introduction 479
1. Requirements for a class of decision rules 480
2. Class of logical decision rules 483
3. Method of [3redicting object's perspectiveness 486
4. Algorithm of predicting the value of quantitative feature 487
5. Automatic grouping of objects 488
6. Method of dynamic prediction 490
References 491

Ch. 22. I n f e r e n c e a n d D a t a T a b l e s w i t h M i s s i n g V a l u e s 493


N. G. Zagoruiko and V. N. Yolkina-

I. Algorithm ZET 493


2. Algorithm VANGA 495
3. Conclusion 500
References 500

Ch. 23. R e c o g n i t i o n of E l e c t r o c a r d i o g r a p h i c P a t t e r n s 501


J. H. van Bemmel

1. Introduction 501
2. Electrocardiology 502
xvi Table of Contents

3. Detection 505
4. Typification 513
5. Boundary recognition 517
6. Feature selection and classification 520
7. Data reduction 523
8. Discussion 524
References 524

Ch. 24. Waveform Parsing Systems 527


G. C. Stockman

1. Introduction 527
2. Models for waveform analysis: SDL and FDL 529
3. The n~ARSAYspeech understanding system 535
4. Analysis of medical waveforms using waPsYs 537
5. Concluding discussion 546
References 548

Ch. 25. Continuous Speech Recognition: Statistical Methods 549


F. Jelinek, R. L. Mercer and L. R. Bahl

1. Introduction 549
2. Acoustic processors 551
3. Linguistic decoder 551
4. Markov source modeling of speech processes 552
5. Viterbi linguistic decoding 558
6. Stack linguistic decoding 560
7. Automatic estimation of Markov source parameters from data 562
8. Parameter estimation from insufficient data 564
9. A measure of difficulty for finite state recognition tasks 569
10. Experimental results 570
Acknowledgment 572
References 573

Ch. 26. Applications of Pattern Recognition in Radar 575


A. A. Grometstein and IV. H. Schoendorf

1. Introduction 575
2. A radar as an information-gathering device 575
3. Signature 576
4. Coherence 577
5. Polarization 577
6. Frequency diversity 577
7. Pulse sequences 578
8. Decisions and decision errors 579
9. Algorithm implementation 579
10. Classifier design 580
Table of Contents xvii

11. Classifier performance 583


12. Examples 583
References 593

Ch. 27. White Blood Cell Recognition 595


E. S. Gelsema and G. H. Landeweerd

1. Introduction 595
2. Experiments on the automation of the WBCD 596
3. Developments in the commercial field 603
4. Conclusions 606
References 607

Ch. 28. Pattern Recognition Techniques for Remote Sensing Applications 609
P. H. Swain

1. Introduction: The setting 609


2. The rationale for using statistical pattern recognition 611
3. A typical data analysis procedure 611
4. The Bayesian approach to pixel classification 612
5. Clustering 613
6. Dimensionality reduction 615
7. An extension of the basic pattern recognition approach 616
8. Research directions 619
References 620

Ch. 29. Optical Character Recognition--Theory and Practice 621


G. Nagy

1. Introduction 621
2. OCR problem characterization 622
3. Applications ~523
4. Transducers r628
5. Character acquisition 631
6. Character classification 634
7. Context 639
8. Error/reject rates 643
Acknowledgment 647
Bibliography 647

Ch. 30. Computer and Statistical Considerations for Oil Spill


Identification 651
Y. T. Chien and T. J. Killeen

1. Introduction 651
2. Methods for oil data analysis 652
3. Computational models for oil identification 663
4. Summary of oil identification research 668
References 669
xviii Table of Contents

Ch. 31. P a t t e r n R e c o g n i t i o n in C h e m i s t r y 673


B. R. Kowalski and S. Wold

1. Introduction 673
2. Formulation of chemical problems in terms of pattern
recognition 675
3. Historical development of pattern recognition in chemistry 677
4. Types of chemical data and useful preprocessing methods 677
5. Pattern recognition methods used 682
6. Some selected chemical applications 685
7. Problems of current concern 689
8. Present research directions 693
9. Conclusions and prognosis 694
References 695

Ch. 32. C o v a r i a n c e M a t r i x R e p r e s e n t a t i o n a n d O b j e c t - P r e d i c a t e
S y m m e t r y 699
T. Kaminuma, S. Tomita and S. Watanabe

1. Historical background 699


2. Covariance representation 700
3. Minimum entropy principle 702
4. SELFIC 705
5. Object-predicate reciprocity 706
6. Applications to geometric patterns 707
7. Schmidt's theory of unsymmetric kernels 716
8. Conclusion 718
Acknowledgment 719
References 719

Ch. 33. M u l t i v a r i a t e M o r p h o m e t r i c s 721


R. A. Reyment

1. Introduction 721
2. Variation in a single sample 724
3. Homogeneity and heterogeneity of covariance matrices 726
4. Size and shape 728
5, Significance tests in morphometrics 729
6, Comparing two or more groups 730
7. Morphometrics and ecology 738
8. Growth-free canonical variates 738
9. Applications in taxonomy 743
References 743

Ch. 34. M u l t i v a r i a t e A n a l y s i s w i t h L a t e n t V a r i a b l e s 747


P. M. Bentler and D. G. Weeks

1. Introduction 747
2. Moment structure models: A review 751
3. A simple general model 757
4. Parameter identification 760
Table of Contents xix

5. Estimation and testing: Statistical basis 761


6. Estimation and testing: Nonlinear programming basis 764
7. Conclusion 767
References 768

Ch. 35. Use of Distance Measures, Information Measures and Error Bounds in
Feature Evaluation 773
M. Ben-Bassat

1. Introduction: The problem of feature evaluation 773


2. Feature evaluation rules 774
3. What is wrong with the Pe rule 776
4. Ideal alternatives for the Pe rule do not generally exist 777
5. Taxonomy of feature evaluation rules 778
6. The use of error bounds 785
7. Summary 787
References 788

Ch. 36. Topics in Measurement Selection 793


J. M. Van Campenhout

1. Introdudtion 793
2. The monotonicity of the Bayes risk 796
3. The arbitrary relation between probability of error and measurement subset 800
References 803

Ch. 37. Selection of Variables Under Univariate Regression Models 805


P. R. Krishnaiah

1. Introduction 805
2. Preliminaries 806
3. Forward selection procedure 806
4. Stepwise regression 809
5. Backward elimination procedure 811
6. Overall F test and methods based on all possible regressions 814
7. Finite intersection tests 817
References 819

Ch. 38. On the Selection of Variables Under Regression Models Using


Krishnaiah's Finite Intersection Tests 821
J. L. Schmidhammer

1. Introduction 821
2. The multivariate F distribution 821
3. The finite intersection t e s t - - A simultaneous procedure in the univariate case 823
4. The finite intersection t e s t - - A simultaneous procedure in the multivariate case 826
5. A univariate example 828
6. A multivariate example 830
References 833
xx Table of Contents

Ch. 39. Dimensionality and Sample Size Considerations in Pattern Recognition


Practice 835
A. K. Jain and B. Chandrasekaran

1. Introduction 835
2. Classification performance 836
3. K-nearest neighbor procedures 849
4. Error estimation 850
5. Conclusions 851
References 852

Ch. 40. Selecting Variables in Discriminant Analysis for Improving upon


Classical Procedures 857
W. Schaafsma

1. Introduction 857
2. Illustrating the phenomenon when dealing with A i m 1 in the case k -- 2 860
3. One particular rule for selecting variables 864
4. Dealing with Aim 3 in the case k = 2, m 0 = 1 868
5. Dealing with A i m 4 in the case k = 2, m 0 = 1 872
6. Incorporating a selection of variables technique when dealing with Aim 3 or A i m 4 in the case
k = 2 , m 0 = 1 875
7. Concluding remarks and acknowledgment 877
Appendix A 878
References 881

Ch. 41. Selection of Variables in Discriminant Analysis 883


P. R. Krishnaiah

1. Introduction 883
2. Tests on discriminant functions using conditional distributions for two populations 883
3. Tests on discriminant functions for several populations using conditional distributions 885
4. Tests for the number of important discriminant functions 886
References 891

Corrections to Handbook of Statistics, Volume 1: Analysis of Variance 893

Subject Index 895


Contributors

N. Ahuja, University of Maryland, College Park (Ch. 17)


M. S. Aldenderfer, State University of New York, Buffalo (Ch. 11)
J. A. Anderson, University of Newcastle Upon Tyne, Newcastle Upon Tyne (Ch. 7)
E. Backer, University of Technology, Delft (Ch. 20)
L. R. Bahl, 1BM Thomas J. Watson Research Center, Yorktown Heights (Ch. 25)
J. H. van Bemmel, Vrije Universiteit, Amsterdam (Ch. 23)
M. Ben-Bassat, University of Southern California, Los Angeles (Ch. 35)
P. M. Bentler, University of California, Los Angeles (Ch. 34)
R. K. Blashfield, University of Florida, Gainsville (Ch. 11)
J. D. Broffitt, University of Iowa, lowa City (Ch. 6)
J. D. Carroll, Bell Laboratories, Murray Hill (Ch. 14)
J. M. Chambers, Bell Laboratories, Murray Hill (Ch. 10)
B. Chandrasekaran, Ohio State University, Columbus (Ch. 39)
Y. T. Chien, University of Connecticut, Storrs (Ch. 30)
S. Das Gupta, University of Minnesota, Minneapolis (Ch. 2)
L. Devroye, U~iversity of Texas, Austin (Ch. 8)
K. S. Fu, Purdue University, West Lafayette (Ch. 19)
K. Fukunaga, Purdue University, West Lafayette (Ch. 15)
S. Geisser, University of Minnesota, Minneapolis (Ch. 4)
E. S. Gelsema, Vrij'e Universiteit, Amsterdam (Ch. 27)
A. A. Grometstein, M.I.T. Lincoln Laboratory, Lexington (Ch. 26)
R. M. Haralick, Virginia Polytech, Blacksburg-(Ch. 18)
W. Heiser, Universiteit van Leiden, Leiden (Ch. 13)
A. K. Jain, Michigan State University, East Lansing (Ch. 39)
F. Jelinek, IBM Thomas J. Watson Research Center, Yorktown Heights (Ch. 25)
T. Kaminuma, Metropolitan Institute of Medical Science, Tokyo (Ch. 32)
L. N. Kanal, University of Maryland, College Park (Ch. 16)
T. J. Killeen, University of Connecticut, Storrs (Ch. 30)
B. Kleiner, Bell Laboratories, Murray Hill (Ch. 10)
B. R. Kowalski, University of Washington, Seattle (Ch. 31)
P. R. Krishnaiah, University of Pittsburgh, Pittsburgh (Chs. 37, 41)
B. A. Lambird, L.N.K. Corporation, Silver Spring (Ch. 16)
xxi
xxii Contributors

G. H. Landeweerd, Vro'e Universiteit, Amsterdam (Ch. 27)


D. Lavine, L.N.K. Corporation, Silver Spring (Ch. 16)
G. S. Lbov, USSR Academy of Sciences, Novosibirsk (Ch. 21)
J. C. Lee, Bell Laboratories, Murray Hill (Ch. 5)
J. de Leeuw, Universiteit van Leiden, Leiden (Ch. 13)
G. J. McLachlan, University of Queensland, Queensland (Ch. 9)
R. L. Mercer, IBM Thomas J. Watson Research Center,
Yorktown Heights (Ch. 25)
L. C. Morey, University of Florida, Gainsville (Ch. 11)
G. Nagy, University of Nebraska, Lincoln (Ch. 29)
R. A. Reyment, Uppsala University, Uppsala (Ch. 33)
F. J. Rohlf, State University of New York, Stony Brook (Ch. 12)
A. Rosenfeld, University of Maryland, College Park (Ch. 17)
J. Sallentin, Universitb Pierre et Marie Curie, Paris (Ch. 20)
W. Schaafsma, Groningen University, Groningen (Ch. 40)
J. L. Schmidhammer, University of Pittsburgh, Pittsburgh (Ch. 38)
W. H. Schoendorf, M.I.T. Lincoln Laboratory, Lexington (Ch. 26)
R. H. Shumway, University of California, Davis (Ch. 1)
J. C. Simon, Universitb Pierre et Marie Curie, Paris (Ch. 20)
M. Siotani, Hiroshima University, Hiroshima (Ch. 3)
G. C. Stockman, L.N.K. Corporation, Silver Spring (Ch. 24)
P. H. Swain, Purdue University, West Lafayette (Ch. 28)
S. Tomita, University of Yamaguchi, Ube (Ch. 32)
J. M. Van Campenhout, Rijksuniversiteit Gent, Gent (Ch. 36)
T. J. Wagner, University of Texas, Austin (Ch. 8)
S. Watanabe, University of Hawaii, Honolulu (Ch. 32)
D. G. Weeks, University of California, Los Angeles (Ch. 34)
M. Wish, Bell Laboratories, Murray Hill (Ch. 14)
S. Wold, Umea University, Umea (Ch. 31)
V. N. Yolkina, USSR Academy of Sciences, Novosibirsk (Ch. 22)
N. G. Zagoruiko, USSR Academy of Sciences, Novosibirsk (Ch. 22)
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, VoL 2 ]
©North-Holland Publishing Company (1982) 1-46 1

Discriminant Analysis for Time Series

R. H. Shumway

1. Introduction

The extension of classical pattern recognition techniques to experimental time


series data is a problem of great practical interest. A series of observations
indexed in time often produces a pattern which may form a basis for discriminat-
ing between different classes of events. As an example, one might consider Fig. 1
which shows data from a number of presumed earthquakes and nuclear explo-
sions as recorded at the Large Aperture Seismic Array (LASA) in Montana. The
records, taken from [71], indicate that there may be some rather pronounced
differences between the patterns formed for this particular group of earthquakes
and explosions. For example, the rather impulsive nature of the explosion
patterns stands out in contrast to the energetic latter portions (codas) of the
earthquake signals. The statistical characterization of these differences can lead to
a classification procedure which assigns future events of unknown origin to either
of the two categories with relatively low error rate. Such classification procedures
form the basis for the discriminant analysis of time series.
Time series classification problems are not restricted to geophysical applica-
tions but occur under many and varied circumstances in other fields. Tradition-
ally, the detection of a signal imbedded in a noise series has been analyzed in the
engineering literature by statistical pattern recognition techniques. For example,
the problem of detecting a radar signal (see [44] or [77]) is important for
estimating the time delay and Doppler parameters which characterize the position
and velocity of a moving target. The detection of the signal or, equivalently, the
problem of discriminating between a patterngenerated by signal plus noise and a
pattern generated by noise alone, has been analyzed extensively in the engineering
literature, and is discussed comprehensively in the above references or in [24] and
[691. The foundations for the classical engineering approach, which assumes a
continuous time parameter model, were laid by Grenander [39] and extended by
Root [66], Kadota [47], Kadota and Shepp [48], and Parzen [63, 64]. Basically one
must expand the continuous parameter time series in terms of the orthonormal
eigenfunctions of its covariance operator. The resulting Karhiinen-Lo6ve expan-
sion yields a countable infinity of random variables which can be analyzed using
conventional likelihood expressions. The difficulties encountered in applying this
2 R.H. Shumway

EOl VW'V ~ "V~N" "~" "V"~"


91655 "~7'3/02/0~428,, 82.0E, 95675 ! ~>3i}i~j
USGS
44.TN,78.iE, 5.5
4.5, 33N USGS

72/02/10, 49.4N, 78.8E, 5.5


94060 ~73/02/06, 40.7N, 74.1 , 23115 USGS
4.6, 33N USGS i! '

58970 ~VI173/01/24, v40.9~ ~V~ 174~5 VT~/'I~;~oT4917N, 78.1E, 5.8


V 5.1, 33N USGS USGS

A . ~ I ~, AA A,-,,A .^AA .AAA ^A EXlO ~t


EQ4 -V v ,..-,vv ,v;, V VV"'vvv-Vv "~'~"
74115 II72/i0~27,42.6N, 75.~ 79~2o USGS
5.0, 33N USGS

12790 ]]~V9'i/i0/21,499,, 77.6E, 5.6


~{" (2) 72/08/17, 40.7N |I USGS
78.3E, 4.6, 33N USGS
A~A.............^ ...........
"Ill{.... r ,~" I--"" v""''v " 1/10/09, 50.ON, 77.6E, 5.4
32005 I~ 72107/10, 43.4N, 88.6E, 30510 USGS
II 4.7, 33N USGS

8, 49.7 , 78Ji 5.2


I USGS

I A ~.AA^ ,,..... ^.A. . . . . .- - -

21805 T v72/07/05 , 43.6N, 56900


71/06/30, 49.9N, 79.0F, 5.4
USGS
87.9E, 4.3, 33N USGS

L'.-~-,,"..-r-~-'.
~.-.%"..--_-^_^,,'~-,,'~
51265 H ~'vV~V V~VV W" ,,/v''v " " ' 240 71/06/06, 49.~N, 77.7E, 5.5
72/04/05, 41.9N, 84.5E, USGS
4.8, 33N USGS

~o,o W~y" '~U" VV'vv~I""V'vv"" ~xm {{Vv'7i%3}~;/9;7,, 781E, 58


21120 ~2/03/24 42.9," 87:~E 8085 {" USGS
SO¢
!
~
5 o,
* ,
33N
' ,
USGS 1 ' SO¢ F

o 5 10
n 5 10
Discriminant analysis for time series

approach to real sampled-data systems where the integral equations can


necessarily be solved or to even more difficult cases where the covariance funct
must be estimated from data, have tended to discourage prospective users fr
embracing it as a general purposed practical tool. In the more difficult cases
use of the more restrictive Fourier theory has been common in the literature,
one can find accounts in [44] and [80].
An important potential application in medicine is to the problem of q
criminating between different classes of brain wave recordings. ElectroencepN
graphic (EEG) time series have been used to discriminate among sleep stage~,
to predict the onset of epileptic seizures. Gevins et al. [35] have summarized
applications of discriminant analysis to EEG data, emphasizing approaches ba
primarily on recognizing specific spectral features in the frequency dom~
Recently Gersch and colleagues [33, 34] have investigated EEG classifical
methods based on characterizing group differences in terms of autoregres~
models; the application in [34] was to discriminate between anesthesia le'
sufficient, and insufficient for deep surgery.
Potential applications of time series discriminant functions can be identified
recorded speech data where one may be interested in discriminating betw
various speech patterns [57, 79, 84]. The problem of identifying speakers u~
variables derived from the frequency spectrum has been investigated extensi'
in [15].
The characterization of two-dimensional patterns, as in [83] or [55], lead:
possible applications of picture processing [45, 58, 67]. Multidimensional pattt
also arise in modelling the changing character of the ocean as a propaga
disturbance moves across an array of acoustic or temperature sensors. In seis
applications the pattern introduced by the propagating plane wave can y
valuable information relating to the source of the event as well as other charac
istics of the propagating wave. The detection of such multidimensional sit
sources is considered in [80] for patterns generated as propagating plane wa
The common feature in many of the above practical problems is that
observes a discrete parameter time series (x(t), t = 0 , 1 , . . . , T - 1 } at each (
points in time, with the objective being to classify the observed series into on
q mutually exclusive and exhaustive categories. One may also observe a m
variate series of the form (xj(t), t = 0 ..... T - 1 , j = l , 2 .... ,p} but we defer
discussion of this important case in the interest of simplifying the n o t a t i o n .
convenient to represent the univariate sampled time series as the T × 1 vecto:

x = (x(O), x(1) ..... x(T-- 1))',

and consider the classification problem for finite dimensional random vecl
This reduces the problem to one that is covered very well in standard multival
references such as in [3, 37, 59, 75]. In these approaches one usually assigl
multivariate normal observation to each of the q classification categories on
4 R.H. Shumway

basis of a Bayes or likelihood based rule, which usually ensures that some
combination of misclassification probabilities will be minimized.
In the case of q categories the vector x is regarded as belonging to a
T-dimensional Euclidean space which has been partitioned into q disjoint regions
E~, E 2. . . . , E q , such that if x falls in region Ei, we assign x to population i. If x has
a probability density of the form pi(x) when x belongs to population i, the
probability of misclassifying an observation into population j can be written as

P(jli) =£pi(x)dx (1.1)

for i v~ j = 1,2,..., q. If the prior probability of belonging to population i is ~ri and


the overall misclassification costs are equal, the overall cost
q

Pe = E ~ E P(Jl i) (1.2)
i=t .jv~ i

is minimized by the Bayes rule that assigns x to population 1 if

p,(x) > 1rj (1.3)


ps(x)
for all j v~ l. This is equivalent to assigning x to that population which has the
largest posterior probability.
In the case where there are only two (q = 2) populations of interest, it is
convenient to define two hypotheses, say H 1 and H 2. In this case, the N e y m a n -
Pearson lemma leads to accepting H l for

p,(x)/p2(x ) > K (1.4)

and accepting H 2 otherwise. The advantage here is that the rule is independent of
the prior probabilities and has the property that for a fixed misclassification
probability of one kind the error of the other kind is minimized. That is, fixing
P(l12 ) yields a minimum P(2 I1), and fixing P(2 ]1) yields a rule which minimizes
P(1 [ 2). It is obvious from (1.3) and (1.4) that K = ~rz/rr I is the Bayes rule when
the prior probabilities are ~r~ and 7r2.
The discussion can be made more concrete by considering the classical problem
of detecting a signal in noise. Suppose that H~ denotes the signal present
hypothesis, and that the signal is absent under H 2. Then P(1 [ 2) denotes the false
alarm probability, which might be set at some prespecified level, say 0.001. In this
case it follows that P(2 I1), the missed signal probability, is minimized or,
equivalently, that P(lll ), the signal detection probability, is maximized. As
another example the seismic data in Fig. 1 can be analyzed by identifying the
earthquakes with H 1 and the explosions with H 2. Since the interest is in detecting
an explosion, presumably identified as a violation of an underground nuclear test
Discriminant analysis for time series 5

ban treaty, P(211) can be identified as the false alarm probability, whereas P(212 )
is a measure of the explosion detection probability.
The above criteria have been applied mainly to the problem of classifying
multivariate vectors, where the dimensionality T was fairly small and an adequate
learning population was available for estimating the unknown parameters. This
will not generally be the case for time series data, where T can be very large
relative to the number of elements likely to be found in the learning population.
For example, the earthquakes and explosions in Fig. 1 are sampled at T = 256
points, and the potential learning populations contain only 40 earthquakes and 26
explosions respectively. In this case the computations required to calculate the
discriminant function and to numerically evaluate its performance will always
involve inverting a 256 × 256 covariance matrix. The estimation of the parameters
in the learning sets will be difficult because of the fact that when the dimension T
exceeds the number of elements in the learning population, the sample covariance
matrices are not of full rank.
The difficulties inherent in time domain computations can be alleviated consid-
erably by applying spectral approximations suggested by the properties of the
discrete Fourier transform (DFT) of a stationary process. If the covariance
functions are assumed to come from stationary error or noise processes, then the
involved matrix operations can be replaced by simple ones involving the spectra
and DFT's of the data and mean value functions. The use of spectral approxima-
tions has been fairly standard, beginning with the work of Whittle [81], and
continuing with the work of Brillinger [18] and Hannan [43]. The approximations
used here depend on those developed by Wahba [78], Liggett [56], and Shumway
and Unger [72]. Spectral approximations to the likelihood function have been
used by a number of authors for solving a great variety of problems (see [6, 19,
20, 22, 27, 28, 29, 56, 73]).
We begin by reviewing the standard approach to the problem of discriminating
between two normal processes with unequal means or covariance functions. The
two cases lead separately to linear or quadratic discriminant functions which can
be approximated by spectral methods if the covariance function is stationary. The
discriminant functions and their performance characteristics are approximated by
frequency domain methods, leading to simple and easily computed expressions. A
section is included which discusses methods for (1) determining whether the
group patterns are better modelled as by differing in the mean values or
covariances (spectra), and (2) the estimation of means and spectra from a learning
population. Finally an example shows the application of the techniques to the
problem of discriminating between short period seismic recordings of earthquakes
and explosions.

2. Time domain classification methods

Suppose that a time series is regarded as a multivariate normal vector x =


(x(0), x ( l ) , . . . , x ( T - 1))' with mean vector pj = (~j(0), ftj(1),...,~tj(T- 1))' and
6 R.H. Shumway

T × T covariance matrix

Rj={rj(t-u), t , u = 0 , 1 ..... T - - l ) (2.1)

under hypothesis Hi, j = 1,2 .... , q. Writing the covariance function in terms of the
time difference t - u indicates that under Hi, one may represent x as a signal plus
a stationary zero mean noise process, i.e.,

x = / t j + n~ (2.2)

where/zj denotes a fixed signal and nj = (n j(0), n j(1) ..... nj(T--1))' has mean 0
and stationary covariance matrix Rj. One may obtain a stochastic signal model by
c h o o s i n g / , j = 0 in (2.2) and regarding nj as formed by adding a zero-mean
stationary stochastic signal sj = (sj(0), sj(1) ..... sj(T- 1))', depending on the par-
ticular hypothesis Hj, to a noise n which does not depend on j. This leads to the
stochastic signal model

x = sj + ,,, (2.3)

where the covariance matrix of x can be represented as the sum of the signal and
noise covariance matrices if the signal and noise processes are assumed to be
uncorrelated. The simple case of detecting either a fixed or stochastic signal s
imbedded in Gaussian noise follows by taking q = 2 , / t 1 = s, and/*2 = 0 in (2.2)
for a deterministic signal model and s 1 = s, s 2 = 0 in (2.3) for the stochastic signal
case. It follows that the general model which regards x as normal with mean/~j
and covariance matrix Rj under Hj subsumes the standard cases of interest in
signal detection theory.
For the multivariate normal or Gaussian case the probability density appearing
in the likelihood or Bayes approaches takes the form

pj( x) = (27r)- T/21Rj[-'/2exp{--½( x-- I~j)'R j '(x--~j)},


--oo<x<m,j=l,2. (2.4)

The basic approach and equations following from (2.4) can be found in the
standard references [3, 37, 75].
It should be noted that the results in this section assume that the mean value
and covariance parameters are known exactly. This idyllic assumption is almost
never satisfied in practice, and one must find ways of estimating the initial means
and covariances for each of the populations. Such problems are considered in
Section 5. Since the case of q = 2 predominates in applications and is less
involved, we cover that case first in the following sections. For example, the
problem of discriminating between the earthquake records ( n t) and the explosion
records (H2) falls in this category.
Discriminant analysis for time series 7

equal means case

,L the unequal means case for q = 2, assume first that the covariance matrices
::,e equal, i.e. R~ = R 2, and note that the usual Neyman-Pearson criterion (1.4)
implies that one should accept H1 if the linear discriminant function

dE(X ) = (/t 1--Itt2)'R-'x--½1tt',R-lktl + ½/t~R-'/t 2 (2.5)

exceeds some threshold value K, where K is usually chosen to obtain a desired


performance in terms of the two error probabilities P ( l 1 2 ) and P(211 ) as defined
in Section 1.
It is easy to show that the discfiminant function d E ( X ) is normally distributed
1 2
with mean 2~T under H 1 and - - ~16 2T under H2, with variance 8 2 in both cases
where
t~2 = ( / t l - f l 2 ) ' R - l ( f l l - ] / 2 ) (2.6)

is the standard Mahalanobis measure of distance between the two populations.


The two error probabilities are

P(lI2)=l-~ (K+ 8 t~ ] (2.7)


and
P ( 2 I1) = K ' (2.8)
where
• (x/= (2 exp(- ½z2)dz (2.9)

denotes the cumulative distribution function of the standard normal distribution.


A useful special case occurs when the error probabilities are set equal, so that
with K = 0 in (2.7) and (2.8)

P(l12 ) = P(2]l) = ~(-½8r), (2.10)

and the performance is exhibited as a simple monotone function of the distance


measure. The detection probabilities are then given by

P ( l l l ) = P ( 2 ] 2 ) = ~(½8T) (2.11)

and clearly increase as a function of the distance measure.


Again, a special case of some interest is that of detecting a deterministic signal s
imbedded in Gaussian white noise n. The hypotheses H 1 and H 2 may then be
formulated as

H," x ( t ) = s ( t ) + n ( t ) and //2: x(t)=n(t).


8 R.H. Shumway

The covariance matrix of the vector white noise process n is R --- O2Iv where I v
denotes the T × T identity matrix and of = E(n2(t)) is the noise power. The
discriminant function in this case is obtained by taking/t~ = s,/-g2 ~ 0 with R as
given above. In (2.6) note that

T--1 T--1
dL(X)----o• 2 E s ( t ) x ( t ) - - ½ ° , 2 E s2(t) (2.12)
t~0 t~0

is the filter resulting from simply matching the theoretical signal to the observed
series. The performance of this matched filter depends on the distance measure
(2.6) which becomes

T 1
6 2 = ° ~ -2 E s2(t) (2.13)
t=0

and is just the signal to noise ratio. From (2.10) and (2.11) it is easy to see that
P(l12), the false alarm probability, gets small as the signal to noise ratio
increases, whereas the signal detection probability P011) increases towards a
limiting value of one.
For the case of discriminating among more than two populations differing only
in the mean values, it is convenient to define the intermediate term

g2(x) = ltjR - ix - ½1~'jR ll~j (2.14)

f o r j = 1,2 ..... q, with the Bayes rule (1.3), implying that one should classify x into
population l whenever

Ulj( X ) = gl( x ) - gj( x ) > I n 'n'j - - I n ~(~ (2.15)

for j = 1,2,... ,q, j va I. If the error probabilities for the multiple group case are
needed, note that the random variable utj(x ) ( u u ( x ) = - u jr(x)) is normal with
I 2
mean ~Stj r under H l and mean - ~ 1 ~ 2tit under ~ with variance 8~r under both
hypotheses, where

2 __ !
8,jr- R--1 (m-t,j) (2.16)

Unfortunately the uU(x ) and uU,(x ) are correlated f o r j va j', and the regions for
determining the probability of correctly classifying x into H t are rather involved.
However, defining K u = In ~ j - In ~rl, we may use Bonferroni's inequality to write

P(lll)>l- E • K,j- e,jT


( 12) (2.17)
j =~l ~1~

as a lower bound for the probability of correctly classifying x into population I. If


Discriminant analysis for time series 9

the prior probabilities are equal, Ktj = 0 and we have an expression which
depends strictly on the distance measures.

2.2. Unequalcovariance matrices


For the case where one allows the covariance functions to differ, the applica-
tion of (1.4) leads to a discriminant function of the form

d'Q(x) = ½x'( R21-- R~l)x +(lg',R~-I -- tg2tR 2 l'~X,), (2.18)

which is the sum of a quadratic and a linear form. The probability distribution of
this discriminant is very involved, and since one can often deal with the case
where a signal has only a stochastic part, it is convenient to let/*1 =/~2 = 0 and
work with the purely quadratic discriminant

dQ(X) = x ' ( R 2 1 - R~-')x, (2.19)

with the rule being to accept H~ if dQ(X)> K. The distribution of dQ(X) under
hypothesis H; is basically a linear combination of single degree of freedom
chi-squared random variables where the coefficients are the eigenvalues of the
matrix Rj(R 7 ' - R ~ ') for j = 1,2. Even though these coefficients may be either
positive or negative, there are numerical methods for calculating the critical
points (c.f. [26]). In the case where T is moderately large, the normal approxima-
tion may be reasonable, and it is useful to note that the means and variances of
dQ(x) under Hj, j = 1,2 are

Ej( dQ(x)) = tr[(RT'-- R,I)Rj] (2.20)


and
varj(dQ(x)) = 2tr[( R 2 ' - R , ' )R,] 2, (2.21)

where tr denotes'trace.
A special case' of interest is that of detecting a white Gaussian signal in a white
Gaussian noise process, for which we may take R 2 = 02IT and R 1= (q2 + 02)1>
where 02 = E(s2(t)) is the average signal power and 0ff is the average noise power
as before. Then
T--1
dQ(X)=((o 2) ' - - ( o 2 + o 2 ) ') ~,_x2(t) (2.22)
t=0

is simply a constant multiple of a chi-squared random variable with T degrees of


freedom under both H 1 a n d H 2. It is easy to verify for this particular case that the
false alarm and signal detection probabilities may be written as

P(ll2)=l GT((I+r 1)K) (2.23)


and
P(lll ) =I-GT(r-'K ) (2.24)

where GT(. ) denotes the cumulative distribution function of the chi-squared


10 R. H. Shumway

distribution with T degrees of freedom and

r=o~/o 2 (2.25)

is the signal to noise ratio. Later it will be evident that a narrow band version of
this detector is a reasonable approximation in the general case.
The distributional complications associated with working with the discrimi-
nants (2.18) and (2.19) lead to considering the possibility of applying linear
procedures in these complicated cases. In particular, the problem of determining
the T × 1 vector b such that the decision rule accepts H~ when b ' x > K was
considered first by Kullback [53], who showed that when both the means and
covariances differed, the solution that minimized P(l[2) for a given value of
P(2 ]1) was of the form

b=(wlRl+W2R2) '(/t, - / ~ 2 ) , (2.26)

where the covariance matrix R in (2.5) is replaced by a positive definite weighted


average of R~ and R2, and where the weights w~ and w2 are determined to
minimize the error probabilities. Anderson and Bahadur [4] characterized the set
of admissible linear discriminants as being of the form (2.26), with the threshold
K required to satisfy

K = b'ltl - wloZl = btlz2 -~ w202 , (2.27)


where
rJj2 = b' R j b , (2.28)

j = 1,2 are the variances of the discriminant under Hi. Giri [37] has presented a
convenient method for searching over the admissible linear solutions, noting that
(2.27) implies that one may write

P(2 I1) = 1 - ~(wlo,) (2.29)


and
P ( l l 2 ) = 1 - ~(w2o2), (2.30)

and restrict the search to weights w~ and w2 such that w~ + w2 = 1 for w~, w2 > 0,
w ~ - w 2=1 for w j > 0 , w2 < 0 , and w2 - % = 1 for w1<0, w2 > 0 as shown in
Fig. 2.
Then, choosing wt and w2 in Fig. 2 leads to b by solving (2.26) as long as the
positive definite condition holds. This leads, in turn, to the two error probabilities
given in (2.29) and (2.30). We do not discuss more detailed procedures for finding
w~ and w2 to minimize one error probability for a given value of the other, since
the spectral approximations will enable a very rapid scan over the values of w~
and w2 with an accompanying simple method for determining whether the
positive definite condition holds.
In order to obtain the multiple group discriminant function corresponding to
the unequal means, unequal covariances case, simply substitute (2.4) into (1.3) to
Diseriminant analysis for time series 11

w2

w2--w1=1 w1+w2=1

Wl

1--w2=1

Fig. 2. Values of the weights w1 and w2 [or which an admissible linear discriminant may exist in the
unequal covariance matrix case.

obtain the analog of (2.15). In this case one classifies x into population I if

v,j(x) = h , ( x ) - hi(x) > In 7rj -ln~r t (2.31)

for j = 1,2,..., q, where

hi(x) = gj( x) - 1 In det(Rj) - ½x'R; ix, (2.32)

with gj(x) the linear term given in (2.14). Taking the means/tj = 0 in (2.32) gives
gj(x) = 0, leading to the pure quadratic discriminant.

3. Discriminant analysis in the frequency domain


• r

While the tinge domain approach of the previous section can lead to rather
cumbersome matrix calculations, this is not the primary reason for considering
the more easily computed spectral approximations. The motivation for the
frequency domain approach stems mainly from the convenient theoretical proper-
ties of the discrete Fourier transform of a weakly stationary process, namely, that
the random variables produced are nearly uncorrelated, with variances approxi-
mately equal to the power spectrum. In this case the estimation and hypothesis
testing problems are formulated in terms of sample spectral densities with simple
approximate distributions, and one avoids the more difficult sampling character-
istics of the autocorrelation functions observed ir~ time domain computations. In
a practical phenomenological context, the power spectrum usually turns out to be
an essential component in any overall model for a physical system. The purpose
of this section is to present some of the spectral approximations which make
discriminant analysis in the frequency domain such an attractive alternate proce-
dure. The results allows one to replace complicated expressions involving matrix
inverses by simple sums involving discrete Fourier transforms (DFT's) and
spectral density functions.
12 R. H. Shumway

In order to develop the basic approximations, assume a zero mean stationary


discrete time parameter process x(t), observed at the points t -- 0, 1.... , T - 1 and
with autocorrelation function r(t - u) = E(x(t)x(u)). As in Section 2, define the
vector x - - ( x ( 0 ) , x(1) ..... x(T--1))' of time sampled values with covariance ma-
trix

R=E(xx')={r(t--u), t , u = 0 , 1 ..... T - l } . (3.1)

The power spectrum f ( . ) of the stationary process is defined by the usual Fourier
representation

r(t) = (2~r)-' f"rexp(it2t}f (x) dX (3.2)

as long as r(t) is absolutely summable. The 2~r is associated with X in order to


relate the angular frequency in radians directly to the frequency f in cycles per
unit time (X = 27rf ) . For the approximations detailed here, the spectrum should
be bounded above and below away from zero, i.e.

O<m<-f(X)<-M<~ (3.3)

so that expressions involving f - ~ ( X ) will be stable. It is also necessary that the


regularity condition

£ Itll+~lr(t)[<ce (3.4)
t = --~

holds for some a, 0 < a < 1. This condition from Liggett [56] is used to justify the
approximation to R -1.
In order to develop a reasonable approximation to the covariance function
r(t - u), note that (3.2) suggests the form
T--I

k(t--u)=T-' ~ exp(iX~(t--u)}f(X~) (3.5)


k=O

where Xk = 2~rkT -1 for k = 0, 1..... T - 1. Wahba [78] showed that

T--I

~ Ir(t-u)-k(t-u)l ~<20 (3.6)


t,u=0
where

o= Itltr(t)l< (3.7)
t= --oe

because of (3.4). If /~ = (k(t - u), t, u = 0, 1..... T - 1 }, Liggett [56] has shown


Discriminant analysis for time series 13

that/~-1 with elements of the form

T--I
k - l ( t - - u ) = r ' ~, exp{i2tk(t--u)}f-l(2tk) (3.8)
k=0

is an approximation to R -1 in the sense that a difference of the form (3.6)


involving the elements of the respective inverses is bounded. These results suggest
the possibility of replacing R and R-1 by/~ and/~-~ in the linear and quadratic
discriminant functions in Subsections 2.1 and 2.2. Other approximations to the
inverse for autoregressive moving average processes are considered in [70] and [5].
A fundamental transformation used in frequency domain methods is the
discrete Fourier transform (DFT), defined for the process x(t) as

T--I
X^(k)=T -1/2 • x(t)exp{--iXkt } (3.9)
t=0

for Xk=2~rkT -1 as before. Since these transforms can be rapidly calculated


using one of the fast Fourier transform algorithms (see, for example, [60]) and
generate complex random variables which are approximately uncorrelated, they
occupy a central role in time series inference (see [5, 10, 16-20, 43] for examples).
In fact, under our fairly restrictive conditions it follows that, for x(t) a normal
process, X^(k) is approximately a complex Gaussian random variable in the sense
of Goodman [38]. Many authors have shown that

-1) ifk=l,
(3.10)
X'(l)) [O(r ') if k=/=l

for k, l = 1,2,..., ½T - 1 , where T is arbitrarily taken to be even to simplify


notation. At the points k=O, ½T, X^(k) is purely real, and the coefficient of
f(Xk) in (3.10) is ½. Since X^(k) = X ^ ( T - k ) *, it is enough to consider the
frequency points k = 0 , 1..... ½T where • denotes complex conjugate. A useful
additional observation (see [54]) is that under the regularity condition (3.7), the
error in the above approximations can be bounded b y 0 T -l, enabling one to
specify exactly how close the DFT's are to satisfying (3.10) for any given
autocorrelation. In the case where x(t) is not a normal process, but is mixing, itis
possible to develop central limit results as in [43] to show that X^(k) tends toward
normality as Xk tends toward some reference frequency )t.
The spectral approximations for the unequal means case depend on evaluating
the asymptotic behavior of the q mean value functions/tj (t), j = 1,2 ..... q over the
time points t = 0, 1..... T-- 1. For this case it is convenient to adopt a version of
Grenander's conditions [5, 41, 43]. We adopt the scaling convention used in [72]
14 R. H. Shumway

and [20]. Suppose that f o r / t ( t ) = ( / t l ( t ),/t2(t),...,/tq(t))',

(i) sup j = l ..... q,


O<~t<~T--1

and that the q × q sample autocorrelation matrix


T - - 1--'r

oT(- = (3.11)
t=0

has a limit given by

(ii) lim pT('r)=p('r)


T ~

where p(0) is positive definite. Then it follows, as in [43], that

O(~-) = (27r) -~
f --qT
exp{iXT} a M ( X ) (3.12)

where M(X) is a q × q matrix with Hermitian non-negative increments, uniquely


defined by continuity from the right and with M ( - 7r) = 0.
Now, if we define the matrix M r =(/~1,/~2 .... ,/tq) composed of the mean
vectors/tj =(/~j(0), #j(1) .... ,I~j(T- 1))', it is easy to see that the elements of the
matrix
D T = T-1M~.R-1M T (3.13)

are of the form D T = {T-ltt'jR-1ttk, j, k = 1,2 .... ,q}, and appear in the discrimi-
nant functions in Subsection 2.1. Then it can be shown, as in [5] or [43], that

lim D r = D
T~c~

where

D= (2~r)-'f"/-'(X)dM(X). (3.14)

Similarly, one may consider the easily computed approximation


T-I
/)r=T 'M~I~-IMT=T -' ~ f-'(Xk)M(k)M^(k) * (3.15)
k=0

where M ^ ( k ) = ( M ~ ( k ) , M2(k
^ ) .... ,Mq(k))
^ ' with
T--1
Mf(k)=T ,/2 E # j ( t ) e x p { - i X k t } , (3.16)
t=0

the DFT of the mean value function. (The notation • will also be used for the
Discriminant analysisfor time series 15

complex conjugate transpose.) It is easy to show by the arguments used in [72]


that

limb T= D,
T~oo

so that the approximation and the true value have the same limit. Furthermore, it
can be shown (see [68]) that the absolute difference between two corresponding
elements of D r and b T is O(T-1), implying that the approximation is reasonable
for finite values of T.
A number of other approximations will be introduced in the following sections
which reduce involved matrix operations like those in D r to simple sums like b r.
The complex exponentials appearing in the DFT can be regarded as coefficients
in a linear transformation which reduces the matrix R approximately to a
diagonal matrix with the spectral ordinates appearing down the diagonal. Work
of Fuller [31] and Davies [25] can be consulted for more detailed analyses from
this point of view.

3.1. Linear discriminant functions." The deterministic signal case


For the unequal means case one would like to develop an approximation to
dL(X ) as given in (2.5), which may be written in the more convenient form

dL(X)=gt(X)--g2(x) (3.17)

with gj(x), j = 1,2 defined in (2.14). The approximation to dL(X ) is obtained by


replacing R -1 by /~-1 in (2.14) for gj(x), so that the approximation to (3.17)
becomes

dL(x) = t,,(x)- ~2(x) (3.18)

with

~j(x) = trh-'x -'~t,jR'°- 't,j

~-' ~(~)*x~(k) ' rE


- i IMj'(k)l 2
(3.19)
= ~ 2
~=o f(~'~) ~=o f ( ~ )

for )~k = 2~rkT-1 with Mr(k), X^(k) the DFT's of/xj(.) and x(.) respectively.
The resulting expression for (3.18) contains only simple sums depending on the
DFT's of the mean value, and data series weighted inversely by the spectral
values. Furthermore, note that T-ldL(X) is normally distributed with mean
16 R. H. Shumway

2 -- "T under H 1 and mean - } T - 1~2 under H 2, where

T, 2
~ 2 = (/-L 1 --/-~ 2 )ttl~-- 1( ~ 1 - - ~ 2) = ~ [ k)l (3.20)
k 0 f(Xk)

Note that for q = 2 one may use the limiting results for D T and b T to write

8 2 = lim T-18~. = lim T - I ~ 2


T~ T~oo
where
82 = (27r) 'ffj-'(X)a'dM(X)a (3.21)

with a = ( 1 , - 1,0 ..... 0)'. Again, the approximations are good for finite T, as we
have

T-'[8~--821 = O ( T - l ) .
The variance of the approximate discriminant T-ldL(X ) is

var(T 'dc(x)) = T-2(/*,--/t2)'/~-'RR '(/t,--/t2).

It can be shown by methods analogous to those used by Liggett [56] (see also [68]
and [72]), that

8 2 = lim TvarT-'dL(X)= lim TvarT-'dL(X ),


T~ T~

with the relations

T IvarT-ldL(X)-varT ldL(x)l : O ( T - ' )


and
TJvarT ldL(x)--T 18~l:O(T-' )

holding for finite T. This justifies the use of the approximate discriminant
function T-~dL(X) as being normal with approximate means 51 _T 1,~2 v T and
- - I T2 ---- 1 ~ 2 VT under H l and H 2 respectively. The variance is approximately T-2/~-
under both hypotheses. The error probabilities P ( l J 2 ) and P(2 Jl) can be ap-
proximated by replacing 8~ in (2.7) and (2.8) by ~2o, and we note that, for K = 0,
the argument in (2.10) and (2.11) is approximately 8 r ~ T 1/2 8 where 6 2 is given in
(3.21). These are the same approximate error probabilities that would be obtained
from T-IdL(X), and it is clear that both error rates go to zero as T ~ oo.
The special case of detecting a deterministic signal s(t) in noise n(t) is of
interest in its own right, and denoting the noise autocorrelation and spectrum by
Diseriminant analysis for time series 17

R,(.) and f,(.) respectively, we obtain the approximate discriminant function

T-' S ^ ( k ) , X ~ ( k )
dL(X) ~-" E 2(~rl
°2 (3.22)
k 0 f,(Xk)
where
T--1 12
~= E Is~(k) (3.23)
t=o f.(Xk)
determines the approximate performance of the filter in terms of the signal to
noise ratios summed over the separate frequencies. The linear discriminant
function (3.22) is equivalent to the usual signal processing technique which
matches the prewhitened data against a prewhitened version of the signal. One
may first prewhiten both DFT's by dividing by f,I/z(xk) at each frequency. The
inverse DFT's, say Xw(t ) and Sw(t), are the prewhitened data and signal processes
respectively, and they can be applied as in matched filtering. This yields
T--I T--I
alL(x)= E Sw(t)Xw(t)-~ E s~(t)
t 0 t=0

as an equivalent form of (3.22).


For the multiple group discriminant function one may simply replace gj(x) in
(2.15) by ~j(x) in (3.19) obtaining

fizj(x) = ~,(x) = ~j(x) (3.24)

as the approximate discriminant function, with the performance determined by


the approximate distance

, T - , IM; ( k ) _ ~ ( k ) l 2
$/2jT= ( ~ , - - U j ) t I ~ - - I ( / A / - ~ j ) = E (3.25)
k=o f(Xk)
for l,j=l,2,...,q, and lva j.

3.2. Quadratic discriminant functions." The stochastic signal case


In the case where the purely quadratic discriminant function (2.19) is of
interest, the approximation
T--1
;~Q(x)=x,(r~,- ~;,)x= E IX"(k)12(f2 '(Xk)--fl '(Xk))
k=O
(3.26)
can be used. Such approximations have been considered by Liggett [56] who
showed, for example, that T-11 dQ(X)--dQ(X)]--, 0 almost surely under both H 1
18 R. H. Shumway

and H 2. Central limit results for the test statistic (3.26) were considered by Capon
[21] and Rubin [68] who proved essentially that

mj= lim T - l g d Q ( X ) = lim T-1EjdQ(X)


T~oo T~oo
where
mj = (2~r) -l
f--qT(f2 l(X)-- fl-'()~ X)dX. (3.27)

The variance terms satisfy

y 2 = lim TvarjT-ldQ(X)= lim TvarjT-'do(x )


T~ T~
where
./2= 2(2~r)-l f~o ( fff,(~ )_ f{-,(~ ))z fj2(~)d~. (3.28)

The approximations to the means and variances, derived by replacing Rj by/~j in


(2.20) and (2.21) and scaled as above, also have this same limiting property,
implying that one may take the approximate quadratic discriminant function
T-ldQ(X) as being normal with mean
T--1
Fhj:T-I E (f21(~.k)-- fll()kk))fj()kk) (3.29)
k=0

and variance
T--I
OJ
° 2 = 2T_ 2 ~] (fz_l(~k) -- fl_l( ~k)) 2fj 2(~k) (3.30)
k=0

under Hi, j = 1,2. These approximations to the true means and variances of
T-ldQ(X) given in (2.20) and (2.21) satisfy

I j- %(*)1 = O(T-')
and
T]6j2 --varjT-'dQ(X) I = O(T- 1),
implying that they are reasonable for finite T.
Since the decision rule is to accept H l for dQ(X)> K, the error probabilities
P(l12 ) and P(2 ]1) may be approximated as

K-rh2)
P(l[2) = 1 - • ~ (3.31)
a2
and
p(211)=~ ( K-m,)6~ ' (3.32)

where the arguments are proportional to rhj/6j ~ T1/2mj/'yj. Grenander [42] has
Discriminant analysis for time series 19

derived other asymptotic expressions for the error rates which are related directly
to the spectral distribution of the eigenvalues of R IR 7 ~.
Freiberger [29] (see also [30]), used the continuous analog of T-ldQ(X), say

dQ(X)=(2~r)-|J~lX^(A)lZ(fzl(A)- f~(A))dA (3.33)

as an approximation to the quadratic discriminant function with means m j,


j = 1,2 under Hj where mj is given in (3.27). They compared the probability
density of (3.33) with the exact distribution of the quadratic form dQ(X) for
several cases where both could be evaluated and found excellent agreement. The
test statistic dQ(X) is the output of a continuous linear filter followed by a power
integrator, and can be realized directly with analog equipment. Shaman [70] has
also considered some properties of quadratic forms which are similar to (3.33) for
autoregressive or moving average processes. The processing of digital data would
require some discrete approximation of the form (3.26).
If the number of non-zero terms in (3.26) is not large, the normal approxima-
tion may work poorly and approximations related to the chi-squared distribution
may be appropriate. A realistic assumption for many applications is that a
suitable restricted band of frequencies in the neighborhood of some center
frequency Am contains the values for which the two spectra differ, and that the
spectra are essentially equal outside this band. If the L frequencies are of the form
Am+h, k = -- ½(L -- 1),...,0 ..... ½(L -- 1), the frequency domain quadratic detec-
tor (3.26) may be written in the form

d~)(x)=2~lX^(m+k)I2(f21(Am+k)-- f{-'(Am+k)). (3.34)


k
Eq. (3.10) then implies that, under//j, the squared DFT's in (3.34) are approxi-
mately independent and distributed as chi-square random variables, if they are
standardized by dividing by fj(Am+k). Hence the quadratic discriminant can be
exhibited as

d~)(x) = ~ a~J)u~, (3.35)


k

which is a linear combination of L chi-square random variables Uk with two


degrees of freedom each. The coefficients are

a(J) = fj( Arn+k)( f 2 l(Am+k)-- f l l(Am+k)) (3.36)

under Hi, j = 1,2. Several options are available for calculating the critical values
associated with the approximate quadratic discriminant function (3.35). Box [13]
has given a simple expression for computing probabilities involving a hnear
combination of negative exponentials when the coefficients may be both positive
20 R. H. Shumway

and negative. The cdf corresponding to (3.35) in this case is of the form

FQ(Z ) = 1-- ~] a~/)(1 - G2z/a(kj) ) (3.37)


k

where G2(") denotes the cdf of a chi-square random variable with two degrees of
freedom and the coefficients are defined by

a(kj ) = 1-[ a(kJ)/(a(kJ)-- a}J)). (3.38)


lv~ k

For a general formula which applies to any linear combination of possibly


non-central chi-squared random variables see [25].
If we assume that the spectra fl(') and f2(') are essentially constant over the
band centered on 7tm and

a 1" fl(~tm)Zfs(~tm)q-fn(~km) and H2: f2(Xm)=L(2tm)


specify the signal plus noise and noise along hypotheses, then the distribution of
the test statistic (3.34) reduces to that of a constant times a chi-squared random
variable. The performance in this case reduces to (2.23) and (2.24), with T
replaced by 2L and r replaced by

rm=fs(Xm)/f.(Xm), (3.39)

the signal to noise ratio at X,,.


The approximation (3.34) suggests that functions involving the sample spec-
trum (smoothed periodogram)
(L 1)/2
fr(2tm) = Z A^(k)lX^(rn+k)] 2 (3.40)
(L - 1)/2

might be nearly optimal for purposes of discrimination. In general, the weights


can be determined in accordance with the usual considerations that are common
in designing spectral windows (see, for example, [10, 18, 43, 46, 60]). Liggett [56]
has shown that the periodogram ordinates appearing in the quadratic discrimi-
nant function can be replaced asymptotically by smoothed spectral estimates of
the form (3.40), which would yield an expression for (3.26) of the form
M--1
d~(x)= ~ fr()tm)(f21(?tm)--f~l(Xm)) (3.41)
m:0

where ML ~ T and )~m = cr[(2m + 1)L -- 1]M- l. Then M-1 d~)(x) can be used,
with the errors under the normal approximation determined by making the
appropriate replacements in (3.29)-(3.32). The choice of M follows from smooth-
ness conditions imposed on the spectrum which are given by Liggett [56].
Discriminant analysis [or time series 21

In the case where both the mean value and covariance functions are different,
linear procedures of the form b'x > K may be considered, and one may develop
the approximate version in terms of the DFT's of b and x, leading to accepting H l
when T 1/}Q(x) > K, where

T 1
dLQ(x) = ~] B^(k)*X^(k) (3.42)
k=0

denotes the approximation to the linear discriminant function applied to the


mixed case. The coefficients B^(k) are given by the approximate version of (2.26),
say

B^ ( M£ + w2 (3.43)

with the error probabilities (2.29) and (2.30) given by

0(211 ) = 1 - I~(Wl(l' ) (3.44)


and
0(112 ) = 1 - q~(w262) (3.45)
where
T--I
dj2 = T - 2 X IB^(k)12fs(X) (3.46)
k=0

is the approximate variance under Hj. The requirement that W l R ! q - w 2 R 2 be


positive definite in the stationary case is expressed very simply in terms of the
condition

w, fl(X ) + w2f2(X ) > 0 (3.47)

for - ~r ~<X ~<~r. The whole range of admissible linear discriminants for this case
can be very quickly searched over the values of w 1 and w2 shown in Fig. 2,
satisfying the positive definite condition (3.47). The performance characteristics
(3.44) and (3.45) can be evaluated for each w I and w2, with the approximate
threshold/( following from the conditions (2.27) and (2.28) as

T--1
I~=r-I E B^(k)*M~(k)-Wld~
k=0
T-1
=T-' E g^(k)*M£(k)+w262. (3.48)
k--0

The justification (see [72]) for these approximations involves showing that the
means and variances of T-ldQ(x) satisfy limiting conditions which are the same
as those satisfied by the exact time domain version T-lb'x.
22 R. 11. Shumway

For the case of discriminating among q groups one may develop an approxima-
tion to vtj(x) as given in (2.31), noting that

T--I
I: lim T - ' l n d e t R j = lim T - ' ~, lnfj()~k) (3.49)
T~oo T~oo k=0
where
I = (2 ~r)- l f_~r In fj (X) d )t (3.50)

(see [56], or [40, p. 65]). This means that one may use the discriminant T-lbtj(x),
accepting H t if

13,j(x) = / 4 , ( x ) - / ~ j ( x ) > lnD - l n ~r, (3.51)

as in (2.31) and (2.32) with

T--1 T--1
l~j(X)=gj(X) -I ~ lnfj(~kk) -1 E IX^(k)]2 (3.52)

where ~j(x) is as in (3.19) with fj()~k) replacing f()~k).


In the case where the mean values are zero, corresponding to discriminating
among purely stochastic signals, one would have ~ j ( x ) = 0 and Capon [22]
showed the asymptotic joint normality of v tj (x), j = 1..... q, j v~ l where the h tj(x)
have been appropriately standardized. This implies that T-161j(x) can be treated
as approximately normally distributed with mean

T, ]
° -½T-'
mlj -- [01n
k
fAXk)
+ ~
k=0
(fj-~(Xk)-f/ ~()~k))ft()~k) (3.53)

under hypothesis H t for j = 1..... q, j 4: I. Again, as in (2.17), Bonferroni's inequal-


ity can be applied to place an approximate lower bound on the probability of
detecting HI.

3. 3. Multivariate extensions
In some cases it may be necessary to handle multivariate time series data. For
example, a number of sensors can be employed to monitor a phenomenon, with
the possibility that population differences may be characterized by the sensor
cross correlation structure. Possible applications occur in processing EEG data
from multiple leads attached to the same subject where cross spectral parameters
are assumed to be important.
Discriminant analysis for time series 23

For a multivariate time series (see also [81-83] and [43]) we imagine a
collection of p sensors xl(t),Xz(t )..... Xp(t) sampled at the time points t =
0, 1..... T - 1. The vector series x(t) = (xl(t), x2(t ) .... ,Xp(t))' is assumed to have
a p × 1 mean value function/tj(t) = (/~/l(t),/~i2(t) . . . . . tljp(t))' and a p × p matrix
valued covariance function

Rj(t - u) = (rjm"(t -- u), m, n = 1 ..... p} (3.54)

under Hi, j = 1,..., q. The p × p Hermitian non-negative definite spectral matrices


Fj.(X) are related to the cross correlation matrices through the usual Fourier
representation

Rj(t- u) : (27r)-' f~ exp{iX(t- u)Fj(X )dX. (3.55)

The DFT's for the observed vector series x(t) and ~ j ( t ) , j = 1, ..:, q are defined as
in (3.9) and (3.1); we denote the transforms by X (k) and Mj (k) respectively.
The sampled data will be represented generically by the p × T matrix X =
(x(0), x(1) ..... x(T-- 1)).
In order to discriminate between two hypotheses

H," x(t)=lt,(t)+n(t) and Ha: x(t)=ltz(t)+n(t)


where the matrix valued covariance functions of n(t) are the same under H l and
H2, one may construct the multivariate approximation

dL(X) = ~,(X) - g2(X) (3.56)


where
T--1
~j(X)= 2 Mf(k)*F l(Xk)X^(k)
k--O
T--1
1 2 Mf (kl*F-'(Xk)M f (k), (3.57 t
k=0

j=l,...,q with F - I ( . ) denoting the inverse of the spectral matrix F(.). The
performance in this case can be approximated by using
T--I
(M?(k)- M; (k))*F-'(Xk)(M;(k)- M; (k)) (3.58)
k=0

in (2.7) and (2.8) for the error probabilities.


The case where we are interested in discriminating among multiple groups with
unequal means is handled similarly, replacing ~j(x) in (3.19) by ~j(X) given
above and ~ T in (3.25) by its matrix generalization as for the case q = 2 given
above in (3.58).
24 R. H. Shumway

For the case where the covariance matrices are unequal but q = 2, the pure
quadratic discriminant (2.18) becomes

T--1
dQ(X)= ~] X^(k)*(F21(~k)-F, l(~k))X^(k), (3.59)
k 0

and the normal approximation can again be used with T ldQ(X) assumed to
have means

T--I
rhj= T ' ~, tr((F21(2tk)-- F~l(~k))~.(Xk) } (3.60)
k=O

and variances
T 1
S2=2T 1 ~ tr[(F2'(Xk)--F,-'(~tk))Fj(Xk)] 2 (3.61)
k=O

for j = 1,2 with tr denoting the trace.


One may also develop the linear admissible discriminant as in (3.42) and (3.43),
where the coefficients of the linear discriminant are expressed in terms of the
vector DFT, B^(k)=(B~(k),B2(k) ..... Bp(k))', so that the approximate dis-
criminant function becomes T-1DQ(X) where

T--1
E n"(k)*x^(k) (3.62)
k=0
with
B^(k)=(WlFl(~.k)+W2F2()kk))-l(M~(k)--M;(k)). (3.63)

The weighted spectrum is restricted to values of w1 and w2 for which the matrix

if(X) = WtFl(~) + wzF2(~ )

is positive definite (see Fig. 2). The two error probabilities are approximated as in
(3.44) and (3.45) with

T--1
6j2= T -' ~ B^(k)*Fj()tk)B^(k). (3.64)
k=0

We note that Azari [9] has given the details which establish the validity of these
approximations in the multivariate case. Generally it is required that the cross
correlations satisfy condition (3.4) and that 0 < m ~<det Fj(X) ~<M < ~ for all X.
The multiple population case just uses the function ~j(X) in the same way that
(3.24) uses ~j(x) in the univariate case, and the performance depends on the
Discriminant analysis for time series 25

obvious generalization of (3.25). Another approach (see [18, pp. 390-391]) which
might be useful in the ^multiple group case is to choose the discriminant or
transformation vector/3 (k) at each frequency by maximizing the ratio of the
between-group power to the within-group power of the transformed observations
in the learning populations. In this approach the discriminant vector/]^(k) is the
characteristic vector corresponding to the largest root ~'k of the determinantal
equation

( B ~ - pkEk)[3^ ( k ) = O , (3.65)

involving the sample between-group and within-group power matrices B k and E k


defined in (4.10) and (4.11) of Subsection 4.1. The discriminant function may
then be applied to the observed vector X^(k), as in (3.63), over a band of
frequencies of interest, say for k = 1.... ,L.
A further generalization of interest in certain array processing applications is to
regard the time parameter as a vector, with coordinates in space as well as time.
For example, a signal propagating through the ocean could be indexed by its
location in latitude, longitude, and depth for each point in time. A black and
white picture could be described in terms of a process taking values in an
ordinary Cartesian coordinate system, where the value is proportional to the
intensity of the picture. For these space-time processes, define the vector series
x ( t ) = ( X l ( t ) ..... Xp(t))', where t = ( t 1..... tr)' is an r × 1 vector describing the
location of the series relative to r space-time coordinates. The application of
spectral methods to space-time data was considered by Whittle [83], Pagano [61],
and recently by Larimore [55] for fitting autoregressive moving average models.
We simply note here that the methods of this section are easily extended to the
case of a multidimensional space-time process by observing that the autocorrela-
tion matrix of,a stationary process again has the representation
¢

Rj(t- u) = (2~r) ~/2f e x p ( i ~ ' ( t - - u ) ) F j ( X ) d X (3.66)


JA

where X = (X,,..., ?tr)' is a vector of frequencies restricted to IX, 1, IX2 ]..... IXr 1~<
7r, and Fj(X) is thep × p spectral density matrix under Hj. The DFT is defined as

T 1 -- 1 Tr--1
X^(k)=T~ '/2... Tr 1/2 ~ "'" £ x ( t ) e x p { - - i X ' k t } (3.67)
tI~ o tr o

where X k = (2~'klT~t,..., 2~rkrTr-1) ' is the vector of wavenumbers corresponding


to the index k = ( k 1..... kr)', with k l = 0 ..... TI--1 , ..., k r = O ..... Tr--1. The
equations appearing in the earlier part of this section are then modified so that
the sums run over the coordinates defined above.
26 R. H. Shumway

4. Statistical characterization of patterns

All of the approximations in Section 3 require that we know in advance the


spectral densities and mean value functions under the various hypotheses. Fur-
thermore, one should have some method for determining whether the populations
are distinguished by differences in the mean value functions as in Subsection 3.1,
or by differences in the covariances (spectra) as considered in Subsection 3.2. If
possible, it is advantageous to express the differences in terms of the mean value
functions, with the covariances assumed to be the same. This implies that a linear
discriminant function will have the optimality property, with error probabilities
which can be easily computed using the normal distribution. If the means and
covariances are not equal, a linear procedure is available, but one can only
guarantee optimality within that restricted class of linear procedures. The ap-
proximate optimum discriminant function is a mixture of a linear and a purely
quadratic function, as given in (3.51) and (3.52). If the means are zero, the
populations can be distinguished using spectral characteristics, as in the quadratic
discriminant (3.51) and (3.52) with ~ j ( x ) - - 0 .
The discriminant analysis procedure requires that we first decide whether linear
or quadratic statistics are more appropriate. Results obtained from testing the
equality of means and covariances hypotheses can be used as diagnostic aides for
determining whether to employ linear methods or quadratic methods or even a
mixture of the two. After the appropriate form has been chosen for the discrimi-
nant function, one will require estimators for the unknown population character-
istics. This reduces in the normal case to estimating the mean value functions and
spectra for the various groups or populations. This section provides some rea-
sonable tests for determining which discriminant function is appropriate, and
then provides estimators for the population mean values and spectra for use in
the approximate discriminant functions. Many of the multivariate test statistics
considered in this section depend on expanding classical results for real multi-
variate normal random variables to the complex case. A number of the multi-
variate results needed here can be found in [36] and [49]; some later results and a
review of the literature on complex multivariate normal distributions is given in
[50].
In general, write the lth member of t h e j t h learning population as
xj/(t ) ----/~j(t) + nji(t) (4.1)

for j - - 1 .... ,q, 1--1,... ,/Vs., with the sample series assumed to be observed at the
points t - - 0 , 1.... , T - 1 . The error series n~/(t ) are assumed to be zero-mean
stationary normal processes with an autocorrelation function of the form

E[nji(t)nfi'(u)]={Rj(t-u) otherwise.if(j'J')=(l'l')' (4.2)

As a first step, it is of interest to determine whether the mean values are


significantly different assuming that the covariance functions are equal. This
Discriminant analysis for time series 27

would imply that the linear discriminant function would be nearly optimum with
a corresponding simplification in the form and performance as described in
earlier sections. If significant differences are apparent in the group covariance
functions or spectra, one may need to look at some tests for equality of the group
spectra.
Since the discriminant functions in the previous section have all been applied in
the frequency domain, it is convenient to rely on an approach for testing these
hypotheses on a frequency by frequency basis. These follow by taking DFT's on
both sides of (4.1) to obtain the frequency domain model

X/~ (k) = M] ( k ) + N/~ ( k ) , (4.3)

where the complex normal variables N/~ (k) now have the approximate covariance
structure

E{Nf.t(k)Nf,,(k,),}~{~(Xk) if (j,k,l):(j',k',l'),otherwise, (4.4)

where f/(Xk) is just the spectrum of the j t h population evaluated at frequency


X k = 2~rkT 1 as before. The next section examines the problem of testing the
equality of q means m~(k), M2(k ) ..... Mq(k) on a frequency by frequency basis.

4.1. Testing for pattern differences in the means


A test for equality of the q means can be developed using a spectral analysis of
variance approach described in [16, 20] (see also [12, 73, 74]). If the spectra of the
q groups are assumed to be the same, the usual hypothesis testing approach
applied to the complex model (4.3) leads to a frequency domain analysis of power
(ANOPOW) table, as shown in Table 1. Note that the group means

Xf.( k ) = N/-I ~ X~( k ), (4.5)


/

j = 1..... q, and the overall means

N+ EU+ (k) (4.6)


J

are defined as in the usual case. The F test for equality of the q means at a given
frequency Xk follows by using the ratio of the mean between power component to
the mean error power component in Table 1, which yields

_ PB k n (4.7)
Fz(q- 0,2~ PE k q -- 1 '
28 R.H. Shumway

where it is convenient for later expressions to define

nj = Nj - 1 (4.8)
and
n = • n j = ~ Nj - q. (4.9)

If the test is performed over a band of L frequencies centered at Am, say Am+k,
k = - ½ ( L - 1). . . . . 0 ..... ½ ( L - 1), the components of power in Table 1 are sim-
ply smoothed over the frequencies, and the degrees of freedom are replaced by
2 L ( q - - 1 ) and 2Ln respectively. We will refer frequently in the sequel to
smoothing of various power components, and will mean by this in most cases an
average of the form (3.40) with A ( k ) = 1 or L - 1. It is useful to plot the F statistic
(4.7) as a function of frequency in order to determine which frequencies dis-
criminate between the group means.
In the multivariate case the vector process xjl(t ) = ( X j n ( t ) . . . . . xjtp(t))' is trans-
formed to the frequency domain, and we obtain the equality of means test in
terms of the p × p between groups spectral matrix

B,= E ( x ) { , ) - - xT Ik))* (4.10)


jl

and the error spectral matrix

,,)(k))*. (4.11)
jl

Application of the likelihood ratio test leads to rejecting the hypothesis when

L( 2tk) = det( Ek)/det( E k + Bk) (4.12)

is less than some critical value. Khatri [49] and H a n n a n [43] give the usual

Table l
Analysis of power for equality of means test at frequency Xk = 2 ~ r k T - t
Degrees of
Source freedom Power

Between groups 2(q--l) PBk = ~ ] G ( k ) XT.(k)] 2


jl
Error q) PEk=Y,lx;,(k)-X;.(k)I
fl
Total 2(~Nj -- 1) PTk= ~~ IX].,(k) -- X'..(k) 12
jl
Discriminant analysis for time series 29

approximation

P r ( - uln L(X k ) ~<z) = G f ( z ) + 0 0 , -2) (4.13)


where
f = 2(q -- 1)p (4.14)
and
=2n + q-l--p (4.15)

with n as in (4.9) and Gf(.) the cdf of a chi-squared distribution with f degrees of
freedom as before. One term will generally be sufficient since smoothing over L
points simply replaces f by f ' = 2L(q -- 1)p and ~ by v ' = L(2n + q -- 1)-- p, so
that the remainder is automatically made small by increasing bandwidth. The case
of q = 2 populations specializes to a version of Hotelling's T 2, as shown in [37],
where

-
NI N2 ^ * I ^
T2(Xk) (N, + N2) (X~.(k) -
(4.16)

with the pooled spectral matrix defined as

FT(Xk) = n-l[nlF1r(Xk) + n2F2r(Xk) ] (4.17)

where the spectral matrix of thejth group is always estimated by

FjT(~kk)=n j l ~ ( X f l ( k ) - X; ( k ) ) ( X f l ( k ) - g;.(k))*. (4.18)


I

The equality of, means hypothesis is rejected if T2(Xk) exceeds


¢

K - n - - np
p+l F2P'2(" p+l);~ (4.19)

w h e r e Ffl,f2; ~ denotes the upper a critical value from an F distribution with fl and
f2 degrees of freedom. Again it is informative to plot Hotelling's T 2 as a function
of frequency.

4.2. Testing for pattern differences in the spectra


The case where the spectra of the populations may differ can also be handled
on a frequency by frequency basis, using tests which are similar to those used with
conventional multivariate data. For example, the univariate version of the spectral
estimator for thejth population is of the form
^ 2
fjr(~)= n j l ~ IX;t(k)-- Xj.(k)] , (4.20)
l
30 R. H. Shumway

and we notice that it is essentially a complex sample variance. For q = 2, the ratio

Fzn~,Zn2(~k) ----flT(~k)/fZT(hk) (4.21)

has an F distribution with 2n 1 and 2n 2 degrees of freedom under the assumption


that the two population spectra are the same at frequency ~k. If smoothing is
introduced over L frequencies, the degrees of freedom for the F approximation
change to 2Lnl and 2Ln 2 for the numerator and denominator respectively. One
may follow the usual convention of putting the larger spectral component in the
numerator and rejecting for values of F exceeding some constant K.
For more than two populations one might just as well look at the multivariate
likelihood ratio test that the q p × p spectral matrices are equal. A modification
which incorporates the degrees of freedom nj in place of the sample sizes Nj.
amounts to rejecting the equality hypothesis when

nnP [Iq=l[detnjFjT(~k)] nj
(4.22)
nq=,n;J; [detnrT(X )]"
is less than some critical value, where
q
FT(~kk) = n - I E njFj'T(~k) (4.23)
j =1

is the pooled estimator of the spectrum. Krishnaiah et al. [51] have given the hth
moment of L ' ( ~ ) and calculated 95 % critical points for p = 3, 4, using a Pearson
type I approximation. For reasonably large samples involving smoothed spectral
estimators, the first term in the complex version of the usual chi-squared series
may be sufficiently accurate, and we note that the critical values can also be
determined from

P r ( - 2pln L'(~k) ~<z) = G / ( z ) + O ( n - ' ) (4.24)


where
i x (q - 1)p z (4.25)
and
l - - o - - ( p +~ l )~( p)- 1 ) (2n;1--n ,). (4.26)

Introduction of smoothing over L frequencies leads to replacing nj and n by Lnj


and Ln in (4.26). The univariate case with more than two groups can be treated
by substituting p -- 1 into the above equations.

4. 3. Estimation of the group means, spectra and error probabilities


As a first step it is recommended that one examines the generating phenomena
to see if either a purely deterministic (unequal means) or a purely stochastic
Discriminant analysis for time series 31

(unequal covariance) model might be preferred on practical grounds. If no


tendency towards either model is apparent on a priori grounds, one may use the
learning population to look qualitatively at the mean value functions and esti-
mated spectra for the q groups.
If the spectra are roughly equal over all frequencies, a frequency dependent test
of the equality of means hypothesis may be performed using the ANOPOW
approach given in Subsection 4.1. If the hypothesis can be rejected strongly at a
number of frequencies, the spectral approximations of Subsection 3.1 might be
appropriate, with Mj (k) replaced by Xj.(k) and f(Xk) replaced by the pooled
univariate spectral estimator

fT(x ) = n - ' In, f,A + .2 (4.27)

which can be identified as the mean error-power component in Table 1.


If the means are essentially zero and the discriminating power appears to be
concentrated in the spectra, one may apply the purely quadratic discriminant
function of Subsection 3.2 with the estimator

^ 2
fj~-( Xk ) = Nj-' 2 IXj,( k)[ , (4.28)
l

j = 1..... q, replacing the population values fj(Xk) in the discriminant functions


and performance measures.
If both the mean value and spectral functions differ, in the case q = 2, the
linear admissible procedure of Subsection 3.2 might be applied, with Mr(k)
replaced by X~,(k) and fj(Xk) replaced by f j r ( ~ ) as defined in (4.20). Since the
weights for the pooled estimator (4.27) are of the form wj = nj/n,
it follows that
the linear discriminant function belongs to the class of linear admissible proce-
dures if the sanfiple spectra are essentially equal to the population spectra. In the
• , ¢. .

case of dlscnrmnatlng among multiple populations, the theoretical spectra are


again replaced by their sample values (4.20) and the mean values Mj (k) are
replaced by Xj.(k) as before.
It should be emphasized that smoothing over a band of L frequencies will
generally be a good idea when we are interested in estimating population spectra
for use in the various discriminant functions. In these cases, the group spectral
estimators (4.20) can be replaced by the smoothed versions

fjT( •k ) = ( g n j ) l ~, ~ [Xfl( k + k ' ) - Xf. ( k + k')[= (4.29)


k' I

for j = 1..... q, with L chosen in accordance with the usual resolution and
bandwidth stability considerations. The smoothing introduces stability into com-
putations involving the approximate discriminant functions in Section 3. For
example, the occurrence of a zero value for fir(X) is to be avoided at all costs,
since it will induce a large spurious contribution to the discriminant function over
32 R.H. Shumway

those frequencies. Frequently there will be bands where the group spectra and
mean value transforms are near zero, and we may have a situation where
assumption (3.3) is nearly violated. In order to keep the frequencies where the
spectra are nearly zero from exerting an unnatural effect on the computations,
one can either replace the spectrum in question by some small non-zero value or
simply assume that the numerator is zero over intervals where the denominator
spectrum is null, and sum the test statistic over a reduced number of frequencies.
The use of the autoregressive estimator for the group spectra may be preferred
if the original series satisfy low order autoregressions which differ between the
groups. This requires that one determines the appropriate order by some reason-
able criterion such as has been given by Akaike [1, 2]. Anderson [7] has given a
procedure for fitting autoregressive models to the replicated data in each of the
groups, and this will imply specific forms for the spectral densities ~v()t),
j = 1,..., q. One might even consider the autoregressive moving average (ARMA)
model, which can be developed using the model identification techniques of Box
and Jenkins [14]. The Akaike information theory criterion (AIC) can help one to
choose a final version, and then one can use the techniques summarized by
Anderson [6, 7] to fit a final model in either the time or frequency domain. In the
case where one is interested in detecting a simple autoregressive stochastic signal
imbedded in white noise, the estimation techniques developed in [62] could be
employed.
The linear discriminant function can be evaluated by examining its perfor-
mance on samples not included in the learning population or by constructing
estimators for P ( l I 2 ) and P(2[1), using one of the conventional techniques
described in [59] or [37]. One possibility is to use the conventional expressions
(2.7) and (2.8) with the approximation

T-,
Z (4.30)
k=o fv()tk)

for 6~. Another possibility is to use the sample means and variances of the
discriminant functions evaluated over a population not used for estimating the
original mean values and spectra. For example, let dL(xjt ) be the value of a
discriminant function for the sample xjt, where the discriminant function was
derived from a learning sample not containing xg. Then the sample means

dj = N / E (4.31)
/

and variances

4:(Nj-1)-IZ(dL(Xjl)-dj) 2 (4.32)
I
Discriminant analysis for time series 33

can be used in the usual expressions for the two errors, given in this case by

~'(112 ) -- 1 - ~ ( ( K - d2)/s2) (4.33)


and
~)(2 [1)= O ( ( K - - (~l)/S1). (4.34)

If the entire sample has been used as a learning set, it is conventional to evaluate
the means and variances of discriminants which were computed after discarding
the observation in question during the parameter estimation stage.

5. An application to seismic discrimination

The methods and procedures of the previous sections can be clarified and
illustrated by applying them to a reasonably complete data set. We have chosen a
collection of 40 presumed earthquakes and 26 presumed nuclear explosions
recorded at the Large Aperture Seismic Array (LASA) in Montana. The traces are
short period beams (averages) for events located between 40 and 60 degrees
latitude and between 70 and 90 degrees longitude. Fig. 1 (see Section 1) shows ten
typical members of each of the two populations, and it should be noted that there
appear to be different features which suggest the possibility that these populations
can be separated using a discriminant function. One might mention the energy in
the earthquake records which tends to appear in the latter part or coda of the
waveform. This is often due to a 'depth phase' (pP) which arrives later for the
deeper events.
One may proceed in accordance with the recommendations of Section 4 by first
investigating the possibility that the mean value functions are different. Fig. 3
shows the mean value functions x~.(t) and x2.(t ) computed from the full
populations of earthquakes (H~) and explosions (H2) respectively. It should be
noted that all of the recordings were standardized so that one-half of the peak to
peak amplitude of the largest cycle on each trace was unity. The sample means
indicate that both populations seem to contain fixed deterministic components
with a large excursion appearing in the explosion population. The mean values
show the impulsive nature of the explosion population, in contrast with the more
energetic behavior of the entire earthquake waveform. The coda of the earth-
quake, say between 5 and 10 seconds after th.e start, seems to have more power
than the explosion, due again to the impulsive nature of the explosive source.
In order to characterize these qualitative signal differences over frequency, one
may perform the analysis of power described in Subsection 4.1. The ANOPOW
components shown in Table 1 of that subsection are plotted as a function of
frequency in Fig. 4. For this population there are N~ = 40 earthquakes and
N2 = 26 explosions with the sampling rate of 10 points per second, yielding a
folding frequency of 5 Hz. The power components tend to peak in the neighbor-
hood of 1 Hz with little or no activity beyond 3 Hz. The between power compo-
nent indicates roughly that the discriminating power ought to lie between 0.5 and
34 R. 1t. Shumway

MEANS

VV" ~VV v''v-'v- V- -"VV , uw.

I
~'V"~-"- ~=- - "-- -- . . . . -- --- 26EX

Fig. 3. Sample means for earthquake (EQ) and explosion (EX) populations (256 points at 10 samples
per second).

10.0 --

/I

8.0--

i II
6.0 --
{
n" I
LLI {
0 /
O. !
!
I
!
4.0-- /
{--
! I

"~ Total Power


I I
I ~ Between Group Power
I
I (2 df)

2.0-- I\\1 II
{ I
%

I \
/
0.0
I I I I I I I I I I
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ

Fig. 4. Analysis of power resulting from testing equality of earthquake and explosion means.
Discriminant analysis for time series 35

1.5 Hz, although the within or error component is also large over these frequen-
cies. All components are reasonably stable, so that no smoothing was introduced,
and there are 129 frequency points running from zero to the folding frequency.
A clear indication as to which frequencies ought to discriminate between the
mean values is provided by plotting the F-statistic (4.7) which is the ratio of the
average between group power to the average within group power. For the case
q = 2, this may be written in the form

N1N2 IX; (k)-X; (k)1.2 (5.1)


F(2tk) - ( N 1 + N2) fr(Xk)

with fr(Xk) the pooled estimator of the error spectrum defined in (4.27). This
exhibits the F-statistic as a sort of signal to noise ratio expressed as a function of
frequency, since the values in (5.1) are essentially the components of the estimated
distance a)r2 in equation (4.30). Fig. 5 shows that the discriminating frequencies
tend to be distributed over a somewhat broader band, with values exceeding the
0.01 critical value occurring between 1.0 and 2.5 Hz. This indicates that whatever
discriminating power there is in the mean value functions, is concentrated in a
relatively high frequency band, with the lower frequency component masked by
the large error power contribution. It is interesting to look in more detail at the
error spectra to determine whether the assumption that the theoretical spectra of
the two groups are equal over all bands is reasonable. The group spectra flv(?~)
and f2r(X) defined in (4.20) are plotted in Fig. 6. First of all note that the error

15.0--

F2,128 Frequency dependent


F statistics
10.0--

5.0-- ........... F2,128:.01

0.0 I [ I [
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ

Fig. 5. F-statistics for testing equality of earthquake and explosion means as a function of frequency
(129 frequencies, 2 and 128dr).
36 R. H. Shumway

15.0--

n-
UJ
0 - f\
ft. I
I
/
10.0 -- /
/
/ EQ error spectrum
/
-- / E X error spectrum
I
i ~1 t
I
5.0--

,,"t,,,'-' M,,
- i /-" \..._,,

o.o I r I I I I I I I t
0.0 1.0 2.0 3.0 4.0 5.0
FREQUENCY HZ

Fig. 6. Error spectra of earthquake and explosion populations. (129 frequencies, EQ (78dr), EX
(5Od~).

spectra appear to be roughly equal in the higher frequency bands (1.2 to 3.0 Hz),
where we have just indicated that significant differences in the mean value
functions are present. Using the test for equality of the error spectra given in
(4.21) leads to comparing the spectral ratios with an F-statistic, where we
associate 78 degrees of freedom with the larger earthquake spectrum and 50
degrees of freedom with the smaller explosion spectrum. The ratio exceeds the
critical value (4.8 at the 0.01 level) fairly consistently over the band ranging from
0.0 to 1.2 Hz, implying that the signals have significantly different spectra over the
band.
The tentative conclusions that follow from the preceding tests are that stochas-
tic or spectral differences characterize the low frequency range whereas determin-
istic or mean differences predominate at the higher frequencies. It should be
noted that these conclusions tend to support the near optimality of classical long
period discriminants, such as surface-wave body-wave magnitudes as in [23] or
short period discriminants such as complexity or spectral ratios as in [8] (see also
[11]). For example, a classical and reliable discriminant exists when the short
period P-wave arrival represented in Fig. 1 can be compared with a long period
(low frequency) surface wave observed on a separate system, recording frequen-
cies between 0.0 and 1.0 Hz. The surface wave magnitude, measured basically as
the logarithm of the amplitude divided by the period, is roughly a measure of
power in the low frequency band. The surface wave magnitude M s is combined
with the body wave magnitude M b measured from the maximum cycle on the
short period data traces which we have scaled to unity for all events. Hence the
Discriminant analysis for time series 37

M s - M b discriminant is related closely to the low frequency power which shows


up in Fig. 6 as a significant band. Another discriminant used in seismic studies is
the spectral ratio, defined as the energy in the low frequency band (0.4-0.8 Hz)
divided by the energy in the high frequency band (1.4-1.8 Hz). Fig. 6 shows that
this can be expected to be much higher for earthquakes than for explosions,
confirming that the spectral ratio might also be a possible discriminant. A notion
of complexity, defined as the integrated power in the coda of the signal, is easily
seen to be a broad band version of the same idea. The preceding comments
confirm that the classical measures tend to emphasize the properties which should
be good discriminants, particularly in the low frequency band. The test results,
based on Subsections 4.1 and 4.2, tend to indicate that a quadratic discriminant
should be used over the low frequencies, whereas a linear discriminant might be
superior for the higher frequencies. In any case, it would be interesting to
compare the performance of the linear and quadratic discriminant functions with
that of the more classical discriminants given above. The analysis given below is
taken from [7t].
Since both the means and spectra are unequal, a version of the linear filter
(3.43) was applied with the weights w 1 and w2 chosen to yield the largest explosion
detection probability for a given false alarm rate. For a false alarm probability set
at [~(2 [1)= 0.001, we obtained w I = w 2 = 1 , with the resulting discriminant func-
tion proportional to
T-i (M~(k)-- M;(k))*X^(k), (5.2)
k=0

implying an explosion detection probability of P(2 [ 2) = 0.997.


For purposes of comparison, a linear matched filter (LMF) was defined as (5.2)
with f l ( ' ) = f 2 ( ' ) = ½- This leads to a discriminant function of the form
T 1
d[(x,)= ~ (M:(k)-M2(k))*X^(k), (5.3)
k--0

which is just a matching of the mean difference function with the data vector. The
linear matched filter (LMF) would be optimal for the case where both spectra are
constant over all frequencies (white), and the signal difference is a known
deterministic function (see also (2.12)).
Several different versions of the quadratic filter will be considered. The most
general version, which is appropriate in t h e case where there are presumed
differences in both the means and spectra, is defined in (3.51) and (3.52). This is
proportional, for the case q = 2, to
T--1
X Ix^ 2
[A-1(x )-fr 1(
k--0
T 1
+ ~ [M~(k)*f~l(Xk)--M2(k)*fff21(Xk)]X^(k), (5.4)
k=0
and will be referred to as the quadratic detection filter (QDF).
38 R. H. Shumway

An approximation to the purely quadratic detector of the form (3.34) can be


based on the principle that the spectral ratio ought to be a good discriminant.
Since the series has been standardized to a unit amplitude, the high frequency
power tends to be equalized over the events, so that the integrated power in the
low frequency band is essentially the same as the spectral ratio. Hence we will
define the spectral ratio (SR) discriminant as essentially (3.34) with the spectra
assumed to be constant over the smoothing interval. Then

fj(~km+k)~---fj(~.m) for k = - - l ( L - - 1 ) ..... ½(L-- 1),

and we have
(L -- 1)/2
d~)(x) = 2 E IX^(m+k)lZ(f21(Xm)--f, l (?m
t )), (5"5t
k=--(L--l)/2

where L is chosen so that the frequency band runs from 0.4 to 0.8Hz. For l0
samples per second and 256 points the primary frequencies are of the form
fn = 10n/256 cycles per second for n = 0 .... ,128, so that taking L = 11 in (5.5)
produces the desired bandwidth, where we center (5.5) on the value m =15,
corresponding to 0.6 Hz.
For completeness, the classical complexity discriminant, defined as the mean
square coda of the series, say

256
C=(2001-' E x2 (5.6)
t=57

will be calculated. This is closely related to the quadratic detector (2.22) which is
optimal for detecting a Gaussian white signal in white noise.
The unknown spectral and mean value parameters were estimated using the
entire population of 40 earthquakes (EQ) and 26 explosions (EX) as a learning
sample, w i t h fiT(,') determined from (4.29) ( L = 3) and Mj (k) estimated by the
sample means Xj.(k), j = 1,2. When spectral values were zero, they were replaced
by a small constant percentage of the observed maximum over the entire
frequency band. The estimated spectral values for the spectral ratio (SR) detector
(5.5) were calculated using a smoothed version of (4.28) with L = 11, i.e.,

5 nj ^ 2
2 2
k= 5 l=I

f o r j = 1,2, yielding flT(hlS) = 5.802, f 2 T ( X 1 5 ) = 1.372.


For the linear versions (5.2) and (5.3), the threshold values were calculated
using the spectral approximations and normal theory described in Subsections 3.1
and 3.2. The matched filter results can be determined by using (5.3) with
M ] ( k ) replacing X^(k) for the mean of d~(x) u n d e r / / / , and with the variance
Discriminant analysis for time series 39

given by

T--I
oj 2= ~ ~.(X,)IM~(k)--M£(k)I 2
k=O

for j = 1,2, corresponding to the earthquake and explosion populations respec-


tively. The linear detection filter (LDF) was applied with a threshold value set at
K = --8.51 in order to achieve a theoretical false alarm rate of 0.001, whereas the
matched filter threshold set at - 0 . 3 2 gave a theoretical false alarm rate of 0.05.
The explosion detection probabilities were both 0.99 as computed from the
theoretical value P(2 ] 2). This is not achieved with the population used here, as
can be seen in Fig. 7 which shows the outputs of the filters LDF and LMF
plotted against each other. The number of false alarms is higher for both the LDF
(two) and LMF (eight) than would be predicted from theory. The matched filter
detected 23 out of the 26 explosions as compared to 22 out of 26 for the linear
detection filter. Although these values are far below the performance values that
could theoretically be attained, the sample predicted performance using (4.31) to
(4.34) is also low as can be seen by comparing the theoretical and sample

9.90
.O5 0
o ©
o o
o o o
308 o
o o ODO o (50
o o
0 0
o ° 0 o o 0
-3.75 (~

0 o

- - .001
.OO
0
ae -10.57

-17.40

-24.22

-31.05

o EARTHQUAKES
-K.-EXPLOSIONS
-37.87

-44.70 ~ I l I I I I I 1 I I
-0.87 -0.72 -0.57 -0.42 -0.27 -0.12 0.04 0.19 0.34 0.49 0.64
MATCHED FILTER

Fig. 7. Output of linear detection filter (LDF) and linear matched filter (LMF), applied to full suite of
events.
40 R. 14. Shumway

parameters in Table 2. The disparity between the predicted and observed sample
performance is due to the increase in the sample variances caused by several
extreme observations, clearly visible in Fig. 7. In order to evaluate the perfor-
mance under other conditions, a subpopulation of 23 earthquakes and 15 explo-
sions was drawn to serve as a hypothetical learning sample. The mean value and
spectra from this small learning sample did not differ substantially from the
initial values evaluated over the complete population. Furthermore, the estimated
filter parameters did not change substantially from those given in Table 2 when
they were evaluated over either the learning or test populations. For further
details, see [71].
The different classes of quadratic detectors were also applied to the full suite of
earthquakes and explosions. The threshold value for the spectral ration (SR)
detector (5.5) was determined for a specified false alarm probability of 0.01, using
the chi-squared distribution with 2 L = 22 degrees of freedom and the average
spectral values mentioned earlier. The predicted detection probability for that
false alarm probability was 0.99. The generalized quadratic filter (QDF) and
complexity thresholds were estimated using the observed empirical values for the
discriminants. Fig. 8 shows the performance of the quadratic filter (QDF)
compared with the linear detection filter (LDF), and we note that the perfor-
mance of the quadratic filter is an improvement with only one false alarm and 24
out of 26 explosions detected. Note that the variance of the quadratic filter output
increases substantially for the earthquake population whereas the linear filter
outputs have approximately equal variances under H 1 and H 2.
The empirical false alarm and signal detection probabilities for all of the
methods based on the proportion correctly and incorrectly classified in the
sample, are shown in Table 3. If the empirical false alarm and signal detection
probabilities are denoted by Ps(211) and Ps(2 ] 2) = 1 - Ps(112), we m a y express the
overall error probability, for equal prior probabilities ~rI -- ~r2 = ½, as

Pe = ½(Ps( 2 I1) +Ps(112))


as in (1.2). This enables a comparison between the empirical detection methods to

Table 2
Theoretical and sample parameters, and predicted false alarm and signal detec-
tion probabilities for linear detection (LDF) and linear matched (LMF) filters
Theoretical Samplea
LDF LMF LDF LMF
Means EQ 1.0 0.01 0.9 -0.03
EX - 15.9 -0.59 - 15.9 0.59
Std. dev. EQ 3.1 0.20 4.6 0.32
EX 2.7 0.11 8.9 0.28
P(2 I1) b 0.001 0.052 0.02 0.18
P(2 I 2)c 0.997 0.992 0.80 0.83

aEquations (4.31)-(4.34). °Explosion


signal detection probability.
bExplosion false alarm probability.
Discriminant analysis for time series 41

9.90

o o
EST.~
o ©
3.08 OOo o
o oO~ °o o o o o

o o

°o o
-3.75
o* o o
o

- - .001
-10.57 d¢~ o

!,-
ell -17.40 :~J~

~k

-24.22 #t-

-31.05
0 EARTHQUARES

:#: EXPLOSIONS
-37.87

-44.70 : I I I I I i I I I J
-0.40 -().09 0.22 0.52 0.83 1.14 1.44 1.75 2.05 2.36 2.67
QUADRATIC FILTER

Fig. 8. Output of quadratic detection filter (QDF) and linear detection filter (LDF), designed and
applied to full suite of events. Quadratic threshold at 0.1 estimated visually.

be made, as in Table 3. We note that the quadratic filter (QDF) performs best
with an overall error rate of 0.07, with the linear detection filter (LDF) running a
close second. Since the distribution theory for the LDF is formulated easily in
terms of the normal distribution, it might be chosen in the absence of any clear
superiority for the quadratic detector. It seems to be clear, however, that some
improvement can be expected by combining the spectral and mean value informa-
tion from a short period recording in a more nearly optimal manner. It should be

Table 3
Sample empirical error and detection probabilities for all methods
P2(2 IDa Ps(2 ] 2)b PS
Linear detection (LDF) 0.05 0.85 0.10
Linear matched (LMF) 0.20 0.88 0.16
Quadratic detection (QDF) 0.05 0.92 0.07
Spectral ratio (SR) 0.20 0.92 0.14
Complexity (C) 0.10 0.85 0.13

aExplosion false alarm probability, c Overall error probability.


bExplosion signal detection probability.
42 R. H. Shumway

noted that the above analysis is predicated on the assumption that the surface
wave was not detected on the long period recording instrument. If the surface
wave magnitude can be measured, the classical M s - - M b discriminant may be
superior to any of those described above.
The use of autoregressive techniques for discrimination using short period data
has been investigated by TjOstheim [76] using a similar population containing 45
earthquakes and 40 explosions recorded at NORSAR (Norwegian Seismic Array).
It was noted that the coda or complexity portions could be modelled as third
order autoregressions. Then, displaying the first two autoregressive coefficients
for earthquakes and explosions led to a separation between the two classes, which
was comparable to that achieved using the classical complexity and spectral ratio
methods described here.

6. Discussion

The historical approaches to the problem of discriminating among different


classes of time series can be divided into two distinct categories.
(I) The 'optimality' approach as found throughout the engineering literature
has traditionally assumed very specific Gaussian additive signal and noise models,
and then developed solutions to satisfy well-defined minimum error criteria. In
general, this requires that one assumes prior knowledge of the signal waveforms
and spectra under each of the hypotheses, so that discriminant functions like
those in Section 3 can be calculated for an observed time series. Less attention
was paid to incorporating realistic estimators for the parameters into the scheme
when they were not known in advance; for example, the signal and noise were
frequently assumed to be white when this was clearly not the case. Hence one
might achieve by this approach an optimum solution to what is really a very
rough approximation to the actual problem.
(II) A second approach, which one might term the 'feature extraction' method,
proceeds more heuristically by looking at quantities which tend to be good visual
discriminators for well-separated populations and have some basis in physical
theory or intuition. Little attention is paid to the theoretical distributional
properties of the discriminants or to determining whether they might be ap-
proximations to the exact solution of some optimality problem. As examples one
might mention the classical seismic discriminants cited in the previous section or
the traditional use of spectra and coherence values in EEG analysis (see [35]).
Gersch and colleagues [33, 34] have given an excellent critique of these methods,
but their conclusion [34] that "spectral analysis doesn't work here" should be
checked by applying the quadratic discriminant function (3.60) in conjunction
with a test (4.21) of the hypothesis that the group spectra are equal.
The approach suggested here involves the use of the frequency domain ap-
proximations to the optimum discriminant function with sample estimators
substituted for the unknown parameters. The form for the discriminant functions
is established using preliminary hypothesis testing procedures on learning samples
Discriminant analysis for time series 43

k n o w n to be members of the respective populations. F o r the multivariate


Gaussian or normal case this m a y involve the discriminants discussed in this
paper, which arise from the likelihood a p p r o a c h or which m a y alternatively be
based on information theoretic principles as in [32, 33, 34, 65, 72]. As an
alternative to conventional spectral estimation, one m a y replace the s m o o t h e d
sample spectra with autoregressive estimators. The likelihood a p p r o a c h should
not be applied blindly but should be used in conjunction with knowledge gained
by examining single features which tend to be g o o d discriminators. The test
statistics appearing in this chapter have an inherent flexibility as to choice of
model and b a n d w i d t h parameters, so that the optimal discriminant functions can
be turned to enhance selected discriminating features.

Acknowledgment

I would like to thank Professor David Brillinger of the University of California,


Berkeley, for a n u m b e r of helpful suggestions.
The seismic discrimination study was supported b y the Defense A d v a n c e d
Research Projects Agency, Nuclear Test Monitoring Office, under contract no.
F0806-74-C0006 at the Seismic D a t a Analysis Center, Teledyne Geotech,
Alexandria, Virginia. Dr. R o b e r t Blandford of Teledyne Geotech contributed
substantially to the general approach and methodology.

References

[1] Akaike, H. (1974). A new look at the statistical model identification. I E E E Trans. Automat.
Control 19, 716-723.
[2] Akaike, H. (1977). On entropy maximization principle. In: P. R. Krishnaiah, ed., Proc. Symp.
Applications Statistics, 27-47. North-Holland, Amsterdam.
[3] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
[4] Anderson, T., W. and Bahadur, R. R. (1962). Classification into two multivariate normal
populations with different covariance matrices. Ann. Math. Statist. 33, 420-431.
[5] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York.
[6] Anderson, T. W. (1977). Estimation for autoregressive moving average models in the time and
frequency domain. Ann. Statist. 5, 842-865.
[7] Anderson, T. W. (1978). Repeated measurements on autoregressive processes. J. Amer. Statist.
Assoc. 73, 371-378.
[8] Anglin, F. M. (1971). Discrimination of earthquakes and explosions using short period seismic
array data. Nature 233, 51-52.
[9] Azari,R. (1975). Information theoretic properties J s o m e spectral approximations in stationary
time series. Dissertation. George Washington University, Washington.
[10] Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. Holt, Rinehart and
Winston, New York.
[11] Booker, A. and Mitronovas, W. (1964). An application of statistical discrimination to classify
seismic events. Bull. Seismological Soc. America 54, 961-977.
[12] Borpujari, A. S. (1977). An empirical Bayes approach for estimating the mean of N stationary
time series. J. Amer. Statist. Assoc. 72, 397-402.
[13] Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of
variance problems. Part I: Effect of inequality of variance in the one-way classification. Ann.
Math. Statist. 25, 290-302.
44 R. H. Shumway

[14] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco.
[15] Bricker, P. D., Gnanadesikan, R., Mathews, M. V., Pruzansky, S., Tukey, P. A., Wachter, K. W.,
and Warner, J. L. (1971). Statistical techniques for talker identification. Bell Syst. Tech. J. 50,
1427-1454.
[16] Brillinger, D. R. (1973). The analysis of time series collected in an experimental design. In: P. R.
Krishnaiah, ed., Multivariate Analysis-III. Academic Press, New York.
[17] Brillinger, D. R. (1974). Fourier analysis of stationary processes. Proc. IEEE 62, 1623-1643.
[18] Brillinger, D. R. (1975). Time Series: Data Analysis and Theory. Holt, Rinehart and Winston,
New York.
[19] Brillinger, D. R. (1978). Comparative aspects of the study of ordinary time series and of point
processes. In: P. R. Krishnaiah, ed., Developments in Statistics, Vol. 1, 33-133. Academic Press,
New York.
[20] Brillinger, D. R. (1979). Analysis of variance and problems under time series models. Handbook
of Statistics, Vol. 1,237-278. North-Holland, Amsterdam.
[21] Capon, J. (1965). Hilbert space methods for detection theory and pattern recognition. IEEE
Trans. Inform. 11, 247-59.
[22] Capon, J. (1965). An asymptotic simultaneous diagonalization procedure for pattern recogni-
tion. J. Informat. Control 8, 264-281.
[23] Capon, J., Greenfield, R. J., and Lacoss, R. T. (1969). Long-period signal processing results for
the large aperture seismic array. Geophysics 34, 305-329.
[24] Davenport, W. B. and Root, W. L. (1958). An Introduction to the Theory of Random Signals and
Noise. McGraw-Hill, New York.
[25] Davies, R. B. (1973). Asymptotic inference in stationary Gaussian time series. Adv. in Appl.
Probab. 5, 469-497.
[26] Davies, R. B. (1973). Numerical integration of a characteristic function. Biometrika 60, 415-417.
[27] Dunsmuir, W. (1979). A central limit theorem for parameter estimation in stationary vector time
series and its application to models for a signal observed with noise. Ann. Statist. 7, 490-506.
[28] Dunsmuir, W. and Hannan, E. J. (1976). Vector linear time series models. J. Appl. Probab. 10,
130-145.
[29] Freiberger, W. F. (1963). An approximate method in signal detection. Quart. Appl. Math. 20,
373-378.
[30] Freiberger, W. F. and Grenander, U. (1959). Approximate distributions of noise power
measurements. Quart. Appl. Math. 17, 271-1284.
[31] Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York.
[32] Gersch, W. (1977). Discrimination between stationary Gaussian time series, large sample results,
Tech. Rept. No. 30. Dept. of Statistics, Stanford University, Palo Alto.
[33] Gersch, W. and Yonemoto, J., (1977). Automatic classification of multivariate EEG, using an
amount of information measure and the eigenvalues of parametric time series model features.
Comput. Biomed. Res. 10, 297-316.
[34] Gersch, W., Martinelli, F., Yonemoto, J., Lew, M. D., and McEwan, J. A. (1979). Automatic
classification of electroencephalograms: Kullback-Leibler nearest neighbor rules. Science 205,
193-195.
[35] Gevins, A. S., Veager, C. L, Diamond, S. L., Spire, J., Zeitlin, G., and Gevins, A. (1975).
Automated analysis of the electrical activity of the human brain (EEG): A progress report. Proc.
IEEE 63, 1382-1399.
[36] Gift, N. (1965). On the complex analogues of T 2 and R 2 tests. Ann. Math. Statist. 36, 664-670.
[37] Giri, N. C. (1977). Multivariate Statistical Inference. Academic Press, New York.
[38] Goodman, N. R. (1963). Statistical analysis based on a certain multivariate complex Gaussian
distribution. Ann. Math. Statist. 34, 152-177.
[39] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1 (17) 195-277.
[40] Grenander, U. and SzegO, G. (1958). Toeplitz Forms and their Applications. University of
California Press, Berkeley.
[41] Grenander, U. (1965). On the estimation of regression coefficients in the case of an autocorre-
lated disturbance. Ann. Math. Statist. 25, 252-272.
Discriminant analysis for time series 45

[42] Grenander, U. (1974). Large sample discrimination between two Gaussian processes with
different spectra. Ann. Statist. 2, 347-352.
[43] Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
[44] Helstrom, C. W. (1968). Statistical Theory of Signal Detection. Pergammon Press, Oxford.
[45] Huang, T. S., Schreiber, W. F., and Tretiak, O. J. (1971). Image processing. Proc. IEEE 59,
1586-1609.
[46] Jenkins, G. M. and Watts, D. J. (1968). Spectral Analysis and its Applications. Holden Day, San
Francisco.
[47] Kadota, T. T. (I 965). Optimum reception of binary sure and Gaussian signals. Bell System Tech.
J. 44, 1621-58.
[48] Kadota, T. T. and Shepp, L. A. (1967). On the best finite set of linear observables for
discrimination between two Gaussian signals. IEEE Trans. Inform. Theory 13, 278-284.
[49] Khatri, C. G. (1965). Classical statistical analysis based on a certain multivariate complex
Gaussian distribution. Ann. Math. Statist. 36, 115-119.
[50] Krishnaiah, P. R. (1976). Some recent developments on complex multivariate distributions. J.
Multivariate Anal. 6, 1-30.
[51] Krishnaiah, P. R., Lee, J. C., and Chang, T. C. (1976). The distribution of likelihood ratio
statistics for tests of certain covariance structures of complex multivariate normal populations.
Biometrika 63, 543-549.
[52] Krishnaiah, P. R. and Lee, J. C. (1979). Likelihood ratio tests for mean vectors and covariance
matrices. In: P. R. Krishnaiah, ed., Handbook of Statistics, Vol. 1, 513-570. North-Holland,
Amsterdam.
[53] Kullback, S. (1959). Information Theory and Statistics. Smith, Gloucester, MA.
[54] Lagakos, S. W. (1973). Bounds on the diagonalizability of the finite Fourier transforms of
stationary time series. Tech. Rept. No. 2, Dept. of Computer Sciences, State University of New
York at Buffalo, Amherst.
[55] Larimore, W. E. (1977). Statistical inference on random fields. Proc. IEEE 65, Special Issue on
Multidimensional Systems, 961-970.
[56] Liggett, W. S. (1971). On the asymptotic optimality of spectral analysis for testing hypotheses
about time series. Ann. Math. Statist. 42, 1348-1358.
[57] Markel, J. D. and Gray, A. H., Jr. (1976). Linear Prediction of Speech. Springer, Berlin.
[58] Meisel, W. S. (1972). Computer Oriented Approaches to Pattern Recognition. Academic Press,
New York.
[59] Morrison, D. E. (1976). Multivariate Statistical Methods. McGraw-Hill, New York.
[60] Otnes, R. K.~and Enochson, L. (1978). Applied Time Series Analysis. Wiley, New York.
[61] Pagano, M.'(1970). Some asymptotic properties of a two-dimensional periodogram, Tech. Rept.
No. 146, Dept. of Statistics, The Johns Hopkins University, Baltimore.
[62] Pagano, M. (1974). Estimation of models of autoregressive signal plus white noise. Ann. Statist.
2, 99-108.
[63] Parzen, E. (1962). Extraction and detection problems and reproducing kernel I-Iilbert spaces. In:
E. Parzen, ed., Time Series Papers (1967), 492-519. Holden-Day, San Francisco.
[64] Parzen, E. (1959). Statistical inference on time series by Hilbert space methods I. In: E. Parzen,
ed., Time Series Papers (1967), 251-382. Holden-Day, San Francisco.
[65] Pinsker, M. S. (1964). Information and Information Stability of Random Variables and Processes.
Holden-Day, San Francisco.
[66] Root, W. L. (1962). Singular measures in detection theory. In: M. Rosenblatt, ed., Time Series
Analysis, Symposium, 292-315. Wiley, New York.
[67] Rosenfeld, A. and Weszka, J. S. (1976). Picture recognition. In: K. S. Fu, ed., Digital Pattern
Recognition, 135-166. Springer, Berlin.
[68] Rubin, G. E. (1977). On the quadratic classification problem for zero mean stationary time
series. Dissertation. George Washington Univ., Washington.
[69] Selin, I. (1965). Detection Theory. Princeton Univ. Press, Princeton.
[70] Shaman, P. (1975). An approximate inverse for the covariance matrix of moving average and
autoregressive processes. Ann. Statist. 3, 532-538.
46 R. H. Shumway

[71] Shumway, R. H. and Blandford, R. (1974). An examination of some new and classical short
period discriminants. Tech. Rept. No. TR-74-10, Seismic Data Analysis Center, Alexandria,
U.S.A.
[72] Shumway, R. H. and Unger, A. N. (1974). Linear discriminant functions for stationary time
series. J. Amer. Statist. Assoc. 69, 948-956.
[73] Shumway, R. H. (1971). On detecting a signal in N stationarily correlated noise series.
Technometrics 13, 499-519.
[74] Shumway, R. H. (1970). Applied regression and analysis of variance for stationary time series. J.
Amer. Statist. Assoc. 65, 1527-1546.
[75] Srivastava, M. S. and Khatri, C. G. (1979). An Introduction to Multivariate Statistics. North-Hol-
land, New York.
[76] Tj6stheim, D. (1975). Autoregressive representation of seismic P-wave signals with an appfica-
tion to the problem of short period discriminants. Geophys. J. Roy. Astron. Soc. 43, 269-291.
[77] Van Trees, H. L. (1968). Detection Estimation and Modulation Theory, Parts I, II. Wiley, New
York.
[78] Wahba, G. (1968). On the distribution of some statistics useful in the analysis of jointly
stationary time series. Ann. Math. Statist. 38, 1849 1862.
[79] Welch, P. D. and Wimpress, R. S. (1961). Two multivariate statistical computer programs and
their application to the vowel recognition problem. J. Acoust. Soc. Amer. 33, 426-434.
[80] Whalen, A. D. (1971). Detection of Signals in Noise. Academic Press, New York.
[81] Whittle, P. (1951). Hypothesis Testing in Time Series Analysis. Almqvist and Wiksell, Uppsala.
[82] Whittle, P. (1953). The analysis of multiple stationary time series. J. Roy. Statist. Soc. Ser. B 15,
125-139.
[83] Whittle, P. (1963). Stochastic processes in several dimensions. Bull. Inst. Internat. Statist. 40,
974-994.
[84] Wolf, J. J. (1976). Speech recognition and understanding. In: K. S. Fu, ed., Digital Pattern
Recognition, 167-203. Springer, Berlin.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ~')
z~
©North-Holland Publishing Company (1982) 47-60

O p t i m u m Rules for Classification into


Two Multivariate N o r m a l Populations
with the Same Covariance Matrix*

S o m e s h D a s Gupta

1. Introduction

Let w denote an experimental unit drawn randomly from a population ~r. The
classification problem in its standard form is to devise rules so as to identify ~r
with one of the two given 'distinct' populations ~r1 and ~r2. A set of p real-valued
measurements X: p × 1 is observed on w and it is believed that the distributions
of X in those two populations are different. In this paper we shall assume that
X--Up(~,~,).
Let #i denote the mean of X in the population % (i = 1,2), where #1 ve S2. The
classification problem is to find 'good' rules for deciding whether S = / h or
t~ = S2. When all the parameters St,/Lz and X are known, Wald's decision theory
[17] may be used to derive the miminal complete class of decision rules for
z e r o - o n e loss function. It is given by the following, except for sets of measure
zero [21:
The rule ¢pk decides/~ =/~1 iff

(X'-- Sl)t~-- I(X -- St) -- (X -- ~2)tZ - I(X -- ~2) < k. (1.1)

It can be proved [2] that the rule cp0 is the only admissible minimax rule.
However, in practice all the parameters are not known, and in order to
differentiate the two populations random (training) samples from both the
populations are obtained. It may be remarked that if either o f / ~ and ~2 is known
it is not necessary to draw samples from both the populations.
Let 0 stand for ( S , / h , ~2, Z), and

O 1 = {0:/~ = St, (/~l, S2, Z ) G ~2}, (1.2)

o2 = (0: . = s 2 . (.,. s2, z) e e) (1.3)


where I2 is a known set in the space of S~, S2 and Z. It may be noted that in order

*Supported by a grant from the Mathematics Division, U.S. Army Research Office, Durham, NC,
Grant No. DAAG-29-0038.

47
48 SomeshDas Gupta

to control (arbitrarily) both probabilities of incorrect classification certain condi-


tions must be imposed on ~2 and sequential sampling schemes may have to be
used [5]. However, in standard practice (2 is taken to be the set

~2= ((/~,,/z2,2):/~,,/z2, E R P,/~, #/x2,2: is positive-definite).


(1.4)

Following [7] a set of heuristic rules (called plug-in rules) may be devised by
first choosing some good estimates of the unknown parameters and replacing the
unknown parameters in eft by their respective estimates. We shall call such a rule
Cpp
k when the standard estimates are used.
Let Xil,...,X,n ' denote the X-observations of the training sample from
(i = 1,2). Define (assume n 1+ n 2 - 2 > 0)
nI
~= X Xij/ni (i=1,2), (1.5)
j=l

q- 2 (X2j-- X2)(Xij-- X2
j=l
-),] / ( n l + n 2 --2).

When all the parameters are unknown, Fisher's plug-in rules are given by the
following:
The rule q0p
k decides/~ =/~l iff

(X-- Xl)"S- 1 ( / - Xl)-- (X-- X 2 ) t s - l ( x - 1~2) ~ k. (1.6)

Using the likelihood-ratio principle Anderson [1] proposed the following rules
when (~l,/~2, Z) lies in (2 given by (1.4):
The rule +~ decides > =/~l iff

(1 + 1 / . , ) - ' ( x - R,)'s*-'(x- Zl)-


- Yk(1 + 1 / n 2 ) - l ( X - X2)'S*-'(X- X2) ~ h -- 1, (1.7)

where S * = m S , m = n 1+ n 2 - 2 ( ~ > 0 ) . Note that '#~=q0° when n l = n 2. The


likelihood-ratio rules turn out to be the following when Z is known:
The rule q0~ decides/~ = / ~ iff

(l q- 1/nl) 1( X-- Xl)t~-1( X-- ~'Y1)--


--(1 + 1/n2)-l(X-- R2)'~7-1( X-- X2) ~ k. (1.8)
Optimum rulesfor classification 49
One may also derive some 'good' constructive rules from various optimality
criteria. In this paper we shall obtain some good rules from Wald's decision-theo-
retic viewpoint and also from asymmetrical Neyman-Pearson approach. We shall
also study the above two classes of heuristic rules with respect to some optimality
criteria.

2. The univariate case

2.1. p = l , 02isknown
Without any loss of generality we shall assume that 0 2 = 1. Let ~ = (ep~, ep2)
stand for a decision rule, where cp~is the probability of deciding/~ =/~i given the
observations. We shall consider only the rules based on sufficient statistics X, .~
and -~2-
First we shall make an orthogonal transformation as follows: Define

U,=k,[(l+l/nl) 1/2(X-X,)÷(I÷I/n2)-t/2(X-X2)], (2.1)

U2=k2[(l+l/n,) '/2(X--XI)--(I÷I/n2)-I/2(X÷ -~2)], (2.2)

U 3 = k3[ X + n,,g, + n 2.~2] (2.3)

where ki's are chosen so that var(U/)= 1, i = 1 , 2 , 3 . Note that U/'s are indepen-
dently distributed. Let E(U~) = vi. Then U~~ N(vi, 1).
In terms of (v 1, v2, v3) the sets O 1 and 02, as defined in (1.2)-(1.4), are
transformed as follows.

= ((.,,.2,.3): ., --...2 = - c.,. 0,.3c R), (2.4)


~2' = ( ( ~'1' /"2' P3 ): P l = P' /"2 = C/-', /,' =?~ 0, /"3 ~ U } (2.5)

where c = k2/k 1> 0 (ki's are chosen to be positive). Note that c > 1.

2.1.1. Bayes rules and minimax rules


It is easy to see that by taking a suitable prior distribution of v3 independently
of v~ and P2 we can get Bayes rules free frQ.rn U3. Hence we shall only consider
prior distributions of (v 1, v2) and drop U3 from the argument of % Let

n. = = (- 1)it., . s0}. (2.6)

Consider a prior distribution ~(fl,~,,Vo) which assigns probabilities fly,


(1-fl)(1-V),fl(l--v),),(1-fl) to parameter points (Vo,CVo), (-Vo,--CVo),
(Vo,-CVo),(-Vo, CVo), respectively, where O ~ < f l ~ < l , 0 ~ < V ~<l, vo > O.
It can be seen that the unique (a.e.) Bayes rule (for zero-one loss function)
against the above prior distribution is given by the following:
50 Somesh Das Gupta

Decide (Pl, P2, P3) ~ ~21 iff

(U,-c,)(u2-c2)<-o (2.7)

where c~ and c 2 are functions of vo, fl, "y and c. Conversely, given c 1 and c 2 it is
possible to choose fi, ~, and v0 appropriately. Another class of Bayes rules may be
obtained from the following prior distributions: The probability that (Vl, v2)~ ~2"
is ~i, and given that v 1 = v and v2 = ( - 1)icy the distribution of v is N(0, q.2). The
unique (a.e.) Bayes rule against the above prior distribution decides (Vl, v2, v3) E ~21
iff

U, Uz<~k (2.8)

where k is a function of ~1, ~2 and c. Different types of Bayes rules are given by
Das Gupta and Bhattacharya [3].
Now consider the rule which decides (vl, v2, v3)E ~21 iff

U1U2<~O. (2.9)

Note that (2.9) is equivalent to

(1+ I/H1)-I(x--/y1)2~(I+I/F/2)-I(x-- X2) 2. (2.10)

Thus the above rule is the same as ¢po, defined in (1.8). The rule ¢po is the unique
Bayes rule against the prior ~(5,~,v0)
1 1 for any v 0 > 0 . Moreover, the risk of
the rule ~0° is constant over the four-point set (Vo, CVo), (--Vo,-CUo),
( - vo, CVo),(vo, -- CVo). Hence ¢po is an admissible minimax rule, and moreover the
supremum of the risk of ~0° is equal to ½.
However, ¢po is not the unique minimax rule (leaving aside the trivial rule
__ __ [
~1 = qP2 = 2)" TO see this, transform (Ul, U2) to (V l, V2) by an orthogonal transfor-
mation L such that (EVI,EV2) is proportional to ( 1 , - d l ) and (1, d2) for
(Vl, v2)~ ~2~' and (vl, Vz)E ~2~, respectively, and d 1> 0, d 2 > 0. Let + be the rule
which decides (Vl, u2)E ~2~ iff VIV2 <~O. It can be easily seen (or, see [6]) that the
supremum of the risk of + is ½. Note that there are many such orthogonal
transformations L which will satisfy the desired property for (EV1,EV2). It may
be shown that neither of the rules ¢po and ~b dominates the other. However, the
characterization of the class of all admissible minimax rules is not known.
Now, instead of the zero-one loss function consider a loss function which takes
the value 0 for correct decisions and equals l(I/~ 1 -/~ 2]) for any incorrect decision
where l is a positive-valued, bounded, continuous function such that l ( A ) ~ 0 as
a ~0. Das Gupta and Bhattacharya [3] have shown that ¢po is the unique minimax
rule (and Bayes admissible) for the above loss function when n 1 = n 2.
It is clear that neither of Cpp
° and cp° dominates the other. It is believed that ~p0 is
also admissible.
Optimum rules for classification 51

2.1.2. Invariant rules


Let us n o w consider the following conditions on the rules based on U1, U2, U3.
Translation invariance:

r(u~, u~, u~) = ~(u,, u~, u~ + b) (2.11)

for all ul, u2,/'/3 and b E R .


A set of maximal invariants for (2.11) is given by (U l, U2). Hence we shall write
a translation-invariant rule as a function of U~ and U2.
Sign invariance:

~ ( ~ , , u2, u~) = r ( - u,, - u~, - u~) (2.12)

for all Ul, u 2, u 3.


A translation-invariant rule is sign-invariant iff it is a function of ( u l u 2 / I u21,
In21), see [10].
Symmetry:

~l(Ul, -- ~ , U~) ----~(Ul, U~, U~) (2.13)

for all ul, u 2 and u 3.


It is clear that both (2~ and I22 are unchanged under the transformations
(u l, u 2, u3) --, (u 1, u 2, u 3 + c) and (Ul, u2, u3) ~ ( - ul, - u2, - u3). In terms of x,
Y~ and x2 these t r a n s f o r m a t i o n s are respectively (x, x l, £2) ~ (x + b,
~ + b, x2 + b) and (x, Y~, ~2) ~ ( - x, - xl, - £2). The sets I2~ and ~22 a r e inter-
changed under the transformation (ul, u2, u3)--, ( u l , - u2, u3). This transforma-
tion is obtained b y interchanging (~1, nl) and (£2, n2)- We shall n o w show that
~0° is the uniformly best translation-invariant, sign-invariant symmetric rule. F o r
(P¿, P2' /'t3)~ ~2'1 (i.e., v I = v, v2 = - cv) the risk of a translation-invariant, sign
invariant syminetric rule q0 is given b y

= folio [~2(Ul,U2)n(ul;~)n(u2;-c~)
+ ~2(u~, u2)n(u, ; -'~)n(u2; c~)
+ ( 1 - - ¢P2(ul, u2))n(u ,; -- v)n(u 2; -- cv)
+(1-cp2(u,,u2))n(ul;v)n(u2;cv)]du, du2, (2.14)

where n(u; v) is the density of N(v, 1) at u. It m a y be seen that (2.14) is m i n i m u m


(uniformly in v and v3) for ¢p2(ul, u 2 ) = 1 when ulu 2 > 0. The above result can
also be proved using the distribution of (U~U2/IU21, IU21) [10].
K i n d e r m a n [10] characterized the (essential) complete class a m o n g all ,transla-
tion-invariant, sign-invariant rules when n~ = n 2.
52 Somesh Das Gupta

2.1.3. Best invariant similar test


The classification problem may be viewed in the light of Neyman-Pearson
Theory. We may pose the problem as testing the hypothesis H,: 0E ~91 against the
alternative H2: 0 ~ 02. We restrict our attention to the class of tests which are
translation-invariant and sign-invariant. Let~b be a test function, i.e. ~p(X, X,, X2)
is the probability of rejecting H t given X, X, and X2. Define

I11 = (1 + 1 / n , ) - ' / Z ( x - Xl), (2.15)

Y2=[(l + l/n2)-l/2( X-- X2)


-(l+l/n2)-l/e(l+l/n,) '(X- X,)] d, (2.16)

y3=(l+n,+n2) '/2(X+n,X,+n2X2) (2.17)

where d is a constant chosen appropriately to make var(Y2)= 1. If ~ is transla-


tion-invariant it will depend only on YI and Y2. Furthermore the sign-invariance
of + means

+(Y~, Y2) = +(-- Y,, - Yz). (2.18)

Under H l the means of Y1 and Y2 are given by

8 , - - EYI = 0, 82=EYz=d(l+l/n2) '/2(/~, -/x2). (2.t9)

Similarly, the means of Yl and I12 under H 2 are given by

8, = (1 + l/n,)-1/2(iz 2 -/~l), (2.20)


82 = - d(1 + 1//'/2)--1/2(1 ~- 1 / / ' / i ) - - 1 ( / £ 2 -- ILl).

In terms of 6, and 62 the parameter sets may be expressed as

A 1 7-- {(81 ' 82): 81 = O, 82 :~ 0}, (2.21)

A 2 = {(8,,82): 82 =a61 4 =0) (2.22)

under H l and//2, respectively; a ± - d(1 + 1/nl) 1/2(1 + 1/n2) 1/2. Since 82 is


still unknown under H, we require ~ to be similar size a for Hi, i.e.

E0.a2~b(Y~, Y2) ~-o~ for allazv~0. (2.23)

This is equivalent to

Y2)n(yl;O)dy, = a a.e. (Y2). (2.24)


Optimum rules for classification 53

The power of the test ~p is given by

Es,.8:~P(Y1,Yz)=½[E~I.~:tP(Y,,Yz)+E-8,. 82tP(Y1,Y2)]
O(3 O(3 2 2

× [ e ~¢y'+~y:) +e-8'(yl+ay~)]dy, dy2 . (2.25)

Using the Neyman-Pearson lemma in order to maximize

f~utp(yl, y2)[e ~''y'+"y2) +e ~'(Y'+~Y~)]n(y,;O)dy,. (2.26)

subject to (2.24) we get the following optimum test:

~P*(Yl, Y2) = 1 iff l Yl + ay21 > k(y2) (2.27)

where k(y2) is chosen so that


av +k(~'
f 2 .2)n(yl;O)dyl=l_a. (2.28)
- av2 - k (.v2)

Thus ~p* is the uniformly most powerful invariant similar test. The above result is
due to Schaafsma [12].

2.2. The common variance 0 2 is unknown


It may be easily seen that the rules given by (2.7) and (2.8) are still unique
Bayes. Moreover, the rule tp[ is the one which accepts 0 ¢ O 1 if (2.10) holds and it
is admissible minimax. When n I = n 2 Das Gupta and Bhattacharya [3] have
shown that the rule ~pl is the unique (a.e.) minimax when the loss for incorrect
decision is l(X/~1- / x 2 1 / o ) , where l is a positive valued, bounded, continuous
function such that l(A)--, 0 as A $0. To see all the above results, note that
(U 1, U:, U3, S) are sufficient statistics in this case and S is distributed indepen-
dently of (U 1, U2, U3). It also follows that ~p[ is the uniformly best translation-
invariant, symmetric rule. To see this, condition on S and fix o.
Schaafsma [13] has shown that the following critical region for testing H l
against H 2 is (i) similar of size a for Hi, (ii) unbiased for H2, and (iii)
asymptotically (as min(n 1, n2) ~ oe) most stringent among all level a tests:

Ylsign(Yz)>~tn,+n2 2,~ (2.29)

where Y1 and Y2 are given in (2.15) and (2.16), S is given in (1.5), and tnl+n2_2, a is
the upper 100a% point of the Student's t distribution with n 1+ n 2 - 2 degrees of
freedom. However, it is very likely that this test is not admissible.
54 Somesh Das Gupta

It follows from [9] that the rule q~L x is a (unique) Bayes rule. We shall give a
sketch of the prior distribution against which +Lx is unique Bayes. Consider
U 1, U2, U3 as defined in (2.1)-(2.3). Then the U~'s are independently distributed,
and U~ ~ NO,i, 02). Moreover, under 0 ~ 0 i (i.e. (Vl, v2, v3)E ~2i) we have v I = v, v2
= ( - 1)~cv, v 4: O. The prior distribution is given as follows.
(i) P(OE (9i) = ~, i = 1,2.
(ii) Given 0 E Oi, the conditional distribution of (v, us, 0 2) is derived from the
following:
(iia) Given o 2 = (1 + ~_2)-1, the conditional distribution of 0 , / 0 2, v3/o 2) is the
same as that of (~-V, l-Vs), where V and V3 are independently distributed with
V ~ N(0,(1 + ~-2)/(1 + c2)) and V3 ~ N(0, 1 + T2)
(iib) The density of ~- is proportional to (1 + ~.2) (re+l)/2.

3. Multivariate case: .X known

Without any loss of generality we shall assume that Z = Ip. First we shall derive
a class of Bayes rules and obtain an admissible minimax rule. Define U1, U2, U3
and kl, k 2 a s in (2.1)-(2.3), except that U~'s are now p × 1 vectors and U ~
Np(v, Ip). Correspondingly redefine the sets $2i as follows:

~i=((pl,v2,P3):p,=p,~z=(-1)icu=/=O,v,~3~RP}, (3.1)

i = 1,2. As before U3 may be eliminated from a Bayes rule by taking a fixed


distribution, independent of (u,, v2), under both I21 and $22. Now consider the
prior distribution which assigns the probability ~i to ~2/ and, given vt = u, v2 =
(-1)/eu, the distribution of v is Np(O, "r2Ip). It can now be seen that the unique
(a.e.) Bayes rule against the above prior distribution decides (v l, v2, v3) ~ I21 iff

U~U2 <~k (3.2)

where k is a function of 51 and ~2; conversely, given k the probabilities ~1 and ~2


can be suitably chosen. Thus any likelihood-ratio rule cp~ is Bayes and admissible.
We shall now show that qo° is minimax. First we shall consider a different prior
distribution against which cp° is unique Bayes. As before, v3 can be eliminated
from the problem. Now consider a prior distribution which assigns equal proba-
bilities to the sets I2~ and I2~ where

= = (- 1)'cp, v ¢0,. E R P } . (3.3)


Moreover, given that (v~, v2)E ~2", the distribution of v is taken to be uniform
over the surface of the hypershpere v'v = A2. See [4] to get a detailed proof of the
fact that cp° is unique (a.e.) Bayes against the above prior distribution. To see that
Optimum rules"for classification 55

99o is minimax, note that the risk of 990L is constant over the set

(3.4)
U {(IPl, P2, It3): ~'1 = .,.2 = CP, I-P"tP= Z~2} .

Das Gupta [4] has also shown that the rule 99o is the unique (a.e.) minimax when
the loss for any correct decision is zero, and the loss for deciding ff=ffi
incorrectly is

/[(1 + 1/n,)-'(l~- ~i)t(~ - p,i)] (3.5)


where l is a positive-valued, bounded, continuous function such that l(A) ~ 0 as
a+0.
As in (2.11) we may call a rule 99 translation-invariant if

+ b) (3.6)

for all b ~ N P. Clearly, (U1, U2) is a set of maximal invariants. A rule 99 is called
orthogonally-invariant if

99(u,, u:, u3 ) -- 99(ou1, ou2, ou3 ), (3.7)

for all orthogonal p × p matrices O.


Kudo [11] considered the following 'symmetry' condition for a translation-
invariant rule 99:

/3,(,99; (1 + l / n 2 ) ' / 2 d ) = B 2 ( 9 9 ; ( l + l / n , ) - ' / 2 d ) (3.8)

where /3i(99;d')=E099i when d = ( f f l - - f f 2 ) and /~=/~i- Moreover, he required


/3i(99; d) to depend on d only through d'd. This condition clearly holds if 99 is
translation-invariant and orthogonally-invariant. Note also that for a translation-
invariant and an orthogonally-invariant rule 99 satisfying (2.1 3) the condition (3.8)
holds. Kudo [1 1] has shown that 99o simultaneously maximizes both/31(99; d) and
/32(99; d) in the class of all translation-invariant rules satisfying (3.8) and for
which /3i(99; d) depends on d only througli d'd. This can be seen easily by
integrating the probability of correct classification with respect to the uniform
distribution of u over u'1, = A2, where 1,l = v and P2 - - ( - 1)~cl'.
Rao [15] has considered the class ~* of rules whose probabilities of misclassifi-
cation depend only on

A2 = (ffl -- ~2) ' ~ - l ( f f l - ~2). (3.9)

For a rule 9 9 ~ * let G1(99; A2) and G2(99; a 2) be the error probabilities when
56 Somesh Das Gupta

/~ =/z 1 and/~ =/~2, respectively. Rao [15] has posed the problem of minimizing

d {aG,(qo;AZl+bG2(ep. Zl2)}a=o, (3.10)


dA2
subject to the condition that the ratio of G~(q0;0) to G2(ep;0) is equal to some
specified constant. The resulting optimum rule decides/~ =/z 1 iff

.[(x- + 1/.1)(x-
×[(X-- X,)--(l÷ 1//11)( x - X2)]
- b[(1 +

×[(l+l/nz)(X--X,)-(X X2)] ~>k. (3.11)

The above rule coincides with ep° when n I = n 2 and a = b, k = 0.

4. Multivariate case: Z unknown

First we shall show that a likelihood-ratio rule '/'Lx is unique (a.e.) Bayes and
hence is admissible (for zero-one loss function). Note that U1, U2, U3 and S are
sufficient statistics in this case, where U~'s (in p × 1 vector notations) are given by
(2.1)-(2.3) and S is given by (1.5). Here Ui~Np(vi, N ). We now consider the
following prior distribution.
(i) P(0E Oi) = ~i.
(ii) Given 0EO~ (i.e., v 1=v, v2=(-1)~cv), the conditional distribution of
(v, v3, Z) is derived from the following:
(iia) Given v-1 = Ip + rr'(r: p × 1), the conditional distribution of ( ~ iv,
Z-~v3) is the same as the distribution of (rV, rV3), where V and V3 are
independently distributed as

N(O,(l+c2)-'(l+V~')) and N(O,I+V~'),

respectively.
(iib) The density of ~- is proportional to (1 + "r"r) (re+l)/2 where m > p - 1.
Following a simplified version of the results of Kiefer and Schwartz [9] it can
be shown that a unique (a.e.) Bayes rule against the above prior distribution
accepts # =/z 1 if (1.7) holds, where X is a function of ~i's; conversely, given X the
constants ~i's can be appropriately chosen.
Das Gupta [4] has considered a class ~** of rules invariant under the following
transformations:

(X, X 1 X z , S ) ~ ( A X + b , AX-1+ b , A X 2 +b, ASA') (4.1)

where A is any p × p nonsingular matrix and b is any vector in N P. It is shown [4]


Optimumrulesfor classification 57

that a set of maximal invariants is given by (m11, m12 , m 22) where

z t --1
mij UiS Uj/m. (4.2)

When v I = 9,/~2 = (-- 1)icy, b't•--lb' =A2' the joint density of (mlt, m12 , m22 ) is
given by [ 14]

Pi(m11, m12, m22; A2) = Kexp[-- A2(1 + c2)/2]]MI (p-3~/2

X E gj(½A2)Jhj(mll,m12,m22) (4.3)
j 0
where
(rnll + 2 ( - 1 ) i c m l 2 + c2m22 + (1 + c a) Iml )J
hj(mll, m12, m22) = 112 + M I(1~2)(rn+2)+j
(4.4)

IMI= det M, [ ml' m121 (4.5)


M=~rn12 m22],
m --~nl + n2 -- 2, (4.6)

and K > 0, gi > 0 are numerical constants.


Consider a prior distribution which assigns equal probabilities t o O i and, given
OE 0 i (i.e. b,1~- P, 92 = (-- 1)icy) the value of v'2J-lv = A2 is held fixed. The Bayes
rule in ~** against the above prior distribution decides 0E O 1 iff

m 12< 0. (4.7)

To see this, note that for a > O,

(a+ x)J < ( a - - x) j (4.8)

for any positivej if x < 0. The relation (4.7) is the same as (1.7) for 7t = 1. It now
follows easily that the rule ~p[ is admissible and minimax in ~** [4]. Das Gupta
[4] has also shown that ~[ is the unique (a.e.) minimax in ~** if the loss for any
correct decision is zero and the loss for deciding/~ =/~i incorrectly is

l[(l+l/ni) l(.__.i)t~ 1(•__•i)] (4.9)

where l is a positive-valued, bounded, continuous function such that l(A) -, 0 as


Z~$0.
Again for this case Rao [15] considered the class ~** of rules whose probabili-
ties of misclassification depend only on z~2 given in (3.9). Then he derived the
optimum rule which minimizes the expression given by (3.10) subject to the
condition of similarity for the subset of the parameters given by ~1=~2 .
58 Somesh Das Gupta

The optimum rule decides/L =/~ t iff

a[(X-- ~'l)--(1 + 1 / n l ) ( X - .~2)]'B 1


X [ ( X - - l,Y1)- (1 + l / n 1 ) ( X-- )~2)]
- b [ ( X - X 2 ) - ( 1 + l / n 2 ) ( X - -~1)] 'B -1
× [( X-- X2)--(1 + l / n 2 ) ( X-- Xl) ]
>-c(B) (4.10)
where
B=mS+ l + n1111"2
l+n 2
[(1 + 1//'/2) ( x - X t ) ( X - X1)'

+(I+I/nl)(X-Xz)(X-X2)'-2(X-X1)(X--X2)' ]. (4.tl)
It is not clear why Rao imposed the similarity condition even after restricting to
the class ~*. One may directly consider the class of rules invariant under (4.1)and
try to minimize (3.10) subject to the condition that Gi(ep; 0) is equal to a specified
constant. Using (4.3) it can be found that the optimum rule decides/~ =/~1 iff

a(kZm,, + k~m22 +(k~ + k~)lM I -2klk2m12)(1 + l / n 2 ) - '


-- b(k2mll + k2m22 +( k 2 + kzZ)]MI + 2klk2mtl)(1 + 1 / n l ) -l
> Xdet(I 2 + i ) . (4.12)

As in (2.29) a similar region for O1 may be constructed for this case also. It is
given by the following.

Y~(mS + Y1Y{)-IY1/[Y~(mS + Yff{)-IY2] 1/2 > k (4.13)

where YI and Y2 are given in (2.15) and (2.16) in vector notations.

5. Multivariate case:/~1 and/~ 2 known

In this case the plug-in rules are given by the following: Decide/~ =/~1 if

(X-I~I)'A-'(X-I~,)-(X-1~z)'A I ( X - / x 2 ) > 7t (5.1)


where
A=[mg+/71(Xl-txl)(Xl--lXl)'+/72(X2--tx2)(X2--1x2) ]. (5.2)
On the other hand, a likelihood-ratio rule decides/~ =/~l iff

I+(X--1~z)'A-I(X--t~2) > k (0<)t). (5.3)


I +( X--I~,)'A-I( X - I~,)
Define m* = m +2.
Optimum rules for classification 59

Without loss of generality we may assume that ~1 7--0 and /~'2 = (1,0 ..... 0).
Then the problem is invariant under the following transformations:

(X,A)~(LX, LAL') (5.4)

where L is a nonsingular p × p matrix of the form

[0l L22
It can be seen that a set of maximal invariants is given by (X1.2, X(2)A221X(2), A 11.2)
where

All AI2] 1 Xl) 1 (5.6)


A= p-l' X~ X(2) p - l '
[A21 A22
1 p-1
AII.2 = All -- AIzA~21A21" (5.7)
X1.2 = X 1 - AtzAzzIX(2). (5.8)

All.2 is distributed, independently of (XI.2, X[2)A221XI2), as 0 11 . 2 X 2 , _ p + 1; given


X(z)A~IX(2}, the distribution of Xl. 2 is
N( d, Oll.2(l + X[z)A~IX(2))),
and X[2)A221X(2)is distributed as the ratio of independent Xp2 i and Xm*-p+22
variates. In the above d is equal to 0 or 1 according as/~ =/~1 or/~ =/~2, and o I 1.2
is the residual variance of X 1 given )((2). It can be shown now that the following
rule is minimax (and Bayes) in the class of rules invariant under (5.5):
Decide/~ =/~ ~ iff

Xl. 2 < 1/2. (5.9)

The relation (5.9) is the same as (5.1) for X = Q,, and as (5.3) for X = 1. The above
region is not similar for/~ = #1- Such a similar region may be constructed using

X1.2(1 + X[2)A221X(2))-I/2(All.2/.(m * - p + 1)) -1/2 (5.1o)


which is distributed as Student's t distribution with m*-p + 1 degrees of
freedom when/~ =/~1. The Mahalanobis distance is equal to (011.2) -1/2 i n this
case. The probabilities of correct classification for the rule given by (5.9) are the
same and they decrease as p increases if e l l 2 is held fixed.
This section is new in the literature and it is due to the author.
60 Somesh Das Gupta

References

[ 1] Anderson, T. W. (1951). Classification by multivariate analysis. P~ychometrika 16, 31-50.


[2] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
[3] Das Gupta, S. and Bhattacharya, P. K. (1964). Classification into exponential populations.
Sankhy~ Ser. A 26, 17-24.
[4] Das Gupta, S. (1965). Optimum classification rules for classification into two multivariate
normal populations. Ann. Math. Statist. 36, 1174-1184.
[5] Das Gupta, S. and Kinderman, A. (1974). Classifiability and designs for sampling. Sankhy~ Ser.
A 36, 237-250.
[6] Das Gupta, S. (1974). Probability inequalities and errors in classification. Ann. Statist. 2,
751-762.
[7] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics
7, 179-188.
[8] Fisher, R. A. (1938). The statistical utilization of multiple measurements. Ann. Eugenics 8,
376-386.
[9] Kiefer, J. and Schwartz, R. (1956). Admissible Bayes character of T 2-, R 2-, and other fully
invariant tests for classical multivariate normal problems. Ann. Math. Statist. 36, 747-770.
[10] Kinderman, A. (1972). On some problems in classification. Tech. Rept. 178, School of Statistics,
University of Minnesota, Minneapolis.
[11] Kudo, A. (1959). The classificatory problem viewed as a two-decision problem. Mern. Fac. Sc4.
Kyushu Univ. Ser. A 13, 96-125.
[12] Schaafsma, W. (1971). Testing statistical hypotheses concerning the expectations of two inde-
pendent normals; both with variance 1. Parts I and II. Proc. Kon. Ned. Akad. 33, 86-105.
[13] Schaafsma, W. and Van Verk, G. N. (1977). Classification and discrimination problems with
applications. Statist. Neerlandica 31, 25 45.
[14] Sitgreaves, R. (1952). On the distribution of two random matrices used in classification
procedures. Ann. Math. Statist. 23, 263-270.
[15] Rao, C. R. (1954). A general theory of discrimination when the information about alternative
population distributions is based on samples. Ann. Math. Statist. 25, 651 670.
[16] Von Mises, R. (1945). On the classification of observation data into distinct groups. Ann. Math.
Statist. 16, 68-73.
[17] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann. Math. Statist. 15, 145-162.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~
©North-Holland Publishing Company (1982) 61- 100

Large Sample Approximations and Asymptotic


Expansions of Classification Statistics

Minoru Siotani

1. Introduction

This i s a review paper on special distributional problems which arise in


classifying an observation x of p components into one of rn distinct populations
171, H 2.... , H m having p-dimensional distributions P1, P2 . . . . , Pro, respectively. For
a general review on the classification, we may refer to the paper by Das Gupta
(1973).
Distributions of classification statistics are needed mainly to know the proba-
bility of misclassification (PMC) or the probability of correct classification (PCC),
an optimum choice of cut-off point, and to discuss on some testing hypotheses
and estimations in the classification. Unfortunately the exact distributions are
usually hard to bbtain. In particular when Pi's or parameters in them are partially
or completely unknown and supplementary information has to be estimated
through available samples from Hi's, the distributional problem becomes quite
complicated. In those cases, large sample approximations to and asymptotic
expansions for the distributions of classification statistics are required. In the
non-parametric classification, the distribution of classification statistics is usually
based on the large samples.
In this paper we consider the distributions of t h e following classification
statistics:
(i) Statistics for classification of one of two multivariate normal populations
with a common covariance matrix (Section 2).
-Wald's W and the Studentized W statistics (Subsection 2.1).
- M L classification statistic Z and its Studentized statistic (Subsection 2.2).
-Classification statistics in covariant discriminant analysis (Subsection 2.3).
(ii) Statistics of classification into one of two multivariate normal populations
with different covariance matrices (Section 3).
(iii) Statistics in the non-normal case and in the discrete case (Section 4).
61
62 Minoru Siotani

Notations

ar(c ) (fir(c)) is the PMC when x is wrongly classified to the first (the second)
population H 1 (H2) by a classification rule with cut-off point c based on the
statistic T.
• (x) and q,(x) are respectively the c.d.f, and p.d.f, of N(0, 1).
E ( T I H ) is the distribution law of a statistic T when H is the underlying
population.

2. Statistics of classification into one of two multivariate normal populations


with a common covariance matrix

Let H F Np(~i , ,~), i = 1,2, be the two p-variate normal populations into one of
which we wish to classify a vector x. When all parameters are known, an optimum
classification rule (Bayes rule, minimax rule, etc.) is based on the statistic

U0 = { x - l ( ~ e, + ~e2)}'2~ '(¢, -~e2) (2.1)

which is also the minimum distance rule:

2U0 = ( x - ~z'2)t,~ l ( x - - ~ t ~ 2 ) - ( x - ~l)',~-l(X--~l ). (2.2)

The term in (2.1),

F0 = (ge -- ~2)'~-'X

is Fisher's linear discriminant function (LDF) in the population.


When all the parameters are not known, random samples of sizes Ni from H i
are used to estimate them. Some general theory for constructing a classification
rule based on samples has been discussed by Rao (1954).

2.1. Wald's W and the Studentized W statistics


Wald (1944) proposed the plug-in version of U0 as a classification statistic

W: (X -- 1(.~ 1 -J- .~2)}'S- 1(.~1 -- -~2) (2.3)

where £i are sample mean vectors and S is the sample pooled covariance matrix
with divisor n = N 1+ N2 - 2 . Fisher (1936) and Wald (1944) suggested the plug-in
LDF given by

F = ( Y l - x2) 'S-'x. (2.4)

The distributions of W and F have been studied by many authors but we here
concern with large sample approximations and asymptotic expansions of them.
Classification statistics 63

Wald (1944) showed that the limiting distribution of W as N~ -~ o~, i = 1,2, is the
same distribution of U0, i.e.

~ ( W I H , ) ~ N(½52, A2), ~ ( W I H 2 ) ~ N ( - ½A2, A2) (2.5)

where a 2 = (~e1-- ~2)'~7-1(~ 1 - - ~-¢~2), A being the Mahalanobis distance between H 1


and H 2. Similarly the limiting distribution of F is, as N i --, m, i = 1,2,

E(FIII1)~N(w2+AZ, A2), E(FIHz)~N(wl--AZ, A2) (2.6)

where wi = E(Fo[Hi) = (~et - ~2)'~J l~i , i = 1,2. Thus for sufficiently large sam-
ples, we can use W and F as if we know the population exactly. Hence the PMC's
of the W-rule and the F-rule with a cut-off point c are approximately equal to

fiw(C)~(b(--(c+½A2)/A), (2.7)

(2.8)

respectively. Elfving (1961) obtained an approximation to the c.d.f, of W for large


N 1= N 2 and p = 1. Teichroew and Sitgreaves (1961) discussed on an empirical
sampling experiment to obtain an estimate of the c.d.f, of W. In the univariate
case, Linhart (1961) gave an asymptotic expansion for the expectation of the
average of two types of conditional PMC's given a sample, i.e. aw(01£1, Y2, s2)
and flw(0]Yl, x2, s2) in the following form:

E[ l{O/W(0lxl,.~2,S2)~ -/~W(0].~1,.~2,$2)}] =
, { m2 ... }

(2.9)

where m = (N 1+ N 2)/8N 1N2, H _ I(A) = ~ ( - - A)/q~(A) and H,(A) are the Hermite
polynomials of degree v. Bowker (1961) showed that W can be represented as a
function of two independent 2 × 2 Wishart matrices one of which is noncentral.
Bowker and Sitgreaves (1961) used this representation to obtain the following
asymptotic expansion for the c.d.f, of W when NI = N 2 = N:

P { W <~w° lH J } = cb - -~22 + -~ ' = 3 (h¢~2) h~2 ]

1 ~ htj ,~(t)[ hlj ) (~)(2.10)


64 Minoru Siotani

where j = 1,2, q)(l~(x) = ~ ~(x),

( 2N ]1/2(( 2N--p--1 }
hlj= ~ ! (-1)j+l(½A2) 2N-2 w° '

2N-3 ( 2N+2 2 2p } N ( 2 N - p - 1 ) w2'


h2=2N--p-2 l ~ A +N- + (N_l)2(2N+l)

h3j=(_l)g+12p--3A _ ~w61, 3 h4j -- P--


A 12A4++ ~p ' 1 6

hsj=(--1)J+l(2A2), h6j = }A4 +2A2,

hTj = ( - 1)J+IA4, h s j - - ± A4

a3j z ( -- 1)J + IA2, a4j z A 2 "

Okamoto (1963, 1968) considered a more general case of sample sizes and derived
asymptotic expansions for

P~(u;zi)= P{ ----7=-
W-/A2 -< ulUl } ,
e2(b/; A ) = p ( W-~IA2 }
<~u]H2 (2.11)

and also for the PMC of the W-rule with cut-off point 0

aw(0) = P{ w < 01H,} = &(-½a; a),


flw(O) = Pf W)OIH2} =I--P2(½A; A) (2.12)

as N I, N2, and n tend to cc and N2 / N 1 tends to a finite positive constant, up to


terms of the second order with respect to (N I~, N2 1, n-l). Siotani and Wang
(1975, 1977) added the third order terms to Okamoto's expansions. With some
further computation, the expansion formulae are

,o (.;,~)= ~ ( . ) +
[{h,~ + ~ +h27 h~}
+{gll _t_ g,2 q_ g2..._~2
q_ g13 q_ g23 q_ g33~1
7N ~ Nd N,n ~ 2n ~ - J ] o ( u ) + O3
(2.13)
Classification statistics 65

where h i = hi(u; A), gij = gij(u; A),


1 {u3+(p--3)u--pA}
hI- 2A 2
1 {u3+ZAu2+(p'3+AZ)u+(p--Z)A},
h 2 -- 2A2
h3=-¼{4u 3+4Au z+(6p 6+AZ)u+Z(p--1)A},
1 {uV+(2p_lV)u5 2 ( p + 2 ) A u 4+(p2 18p+65)u3
g I 1 -- 8A4
- 2 ( p + 2 ) ( p --6)Au 2 + p ( p 42)A2u --3(p --3)(p --5)u
+2(p +2)(p--3)A},
1 {(u+~)2uS+(Zp--lV)uS+Z(p--13)Au4--(p+lO)A2u3
g 12 -- 4A 4
-- pA3u 2 +(p2--18p+65)u3--6(2p--ll)Au2
_(p2 - - 3 p - - 1 5 ) A 2 u q- p A3 - - 3 ( p - - 3 ) ( p - - 5 ) u + 6 ( p - 3)a},
1 {(u+z~)nu3+(Zp-lV)u5
g22 -- NA4

+ 6 ( p - 8 ) ( u + a ) a u 3 + 2 ( p - lO)a' u
- - 3 A4u + ( p 2 - - 18p +65)u 3 + ( p 2 - - 16p +54)(2u + A)A u
- - 2 ( p --4)a3 - - 3 ( p - - 3 ) ( p --5)u - - 2 ( p --3)(p -- 4)a },
1 { ( 2 u + A)2u5 + 2 ( 5 p - - 2 3 ) u 5
g13 -- 8A 2
+ 2(p - 21) a u 4 - (3p + 10)A2u 3
--pA3u 2 + 6 ( p 2 - - 9 p + 16)u3--4(p 2 + 4 p - - 1 8 ) A u 2
--(2p 2 - 7 p - l l ) A 2 u + p A 3-6(p- 1)(p--3)u+6(p-- 1)A},
1
g23-- 8A2 {(u+z~)Z(2u+k)2u3+2(5p--Z3)uS+Z(llp--51) Au4

+5(3p -16)A2u 3 + ( 3 p --26)A3u 2 --3A4u


+ 2(3p 2 -- 27p +48) u 3 + 4(2p 2 -- 19p + 34) k u 2 -- (3p -- 8)A 3
+(2p2--31p+61)A2u--6(p ~ Z 1 ) ( p - 3 ) u - 2 ( p - 1 ) ( 2 p - 5 ) A } ,
1 {48(u + A )2u5 + 24(u + A )A2u 4 + 3Aau 3 + 16(9p -- 26)u 5
g33 = 96
+ 96(2p - 7) zl b/4 -}- 12(7p -- 33)A2 u 3 + 4(3p -- 25)Zl3u 2
--9A4u + 4(27p 2 -- 132p + 137)u 3 + 24(3p 2 - 22p + 27)Au 2
+ 12(p 2 --13p + 20)A2u --4(3p --7)A 3 -- 12(p -- 1)(3p --1)u
-24(p- 1)2k}.
66 Minoru Siotani

Asymptotic expansion for P2(u; A) is obtained from the relation


P2(u; A) = 1 - - P~(-- u; A) (2.14)
where P~(w; k) is the Pl(w; A) with the interchange of N~ and N2. The relation
between the cut-off point ¢ and the argument u is
c=ua+½a 2, u=(c-'A2)/a. (2.15)

The PMC's of the W-rule with c = 0 then have asymptotic expansions

al a2 ~ }
aw(0) = ~ ( - a / 2 ) + ~7 + ~22 +
+f bll b12 b22 b13 + b23 b33 ~ + O3, (2.16)
[ N 2 +~1N2+-~22 + Nln N ~ 2 n + n 2 J
flw(0) = the expression obtained by interchanging N 1 and N 2 in aw(0),
(2.17)
where a i = hi(- A/2; k)~(-- A/2), bij = gij(-- A/2; k)qS(-- A/2). (The third
order terms are too long to present here.)
Tables of a i and bij were given by Okamoto (1963, 1968) for p =
1,2,3,5,7, 10,20,50 and Zl = 1,2,3,4,6,8. Siotani and Wang (1975, 1977) prepared
the tables of cijk, the coefficients of the third order terms in (2.16), as well as a i
and bij for p = 1 ( 1) 20, 25, 30 (10) 50 and A = 0.4 (0.2) 2.0 (0.4) 4.0, 3.0, 5.0 ( 1.0) 8.0.
Lachenbruch and Mickey (1968) considered the estimation of ave(0) and flw(0)
based on (2.16) and (2.17) by replacing A2 with its estimate

D2=(Xl--X2)'s-l(xl--X2) or D*2=(n-p--1)D2/n.
Anderson (1973a) derived asymptotic expansions for the c.d.f.'s of the Studen-
tized W, i.e. (W-½D2)/D and ( W + ½D2)/D under H 1 and/72, respectively;

P( W - ½D E <~u ]H ' }

- - * ( u ) + l {-~~ ( l + k ) n

_(p_¼+½k)u_¼u3)ep(u)+O(n
2), (2.18)
{ W+½D2 }

(2.19)
where k = lim,, ~ ~(N 2 / N 1). Using (2.18) and (2.19), Anderson (1973b) discussed
Classi[ication statistics 67

the comparison of the expansions for the c.d.f.'s, densities, and the first two
moments of W itself and the Studentized W. In particular, when k = 1 and
u = --A/2,

P{W ~ ½ D 2 - ½AD IH,, k = 1} - P { W ~< O]H,, k = 1} =

(2.20)

He also considered how one chooses the cut-off point c = uA + ½A2 for W to
achieve a given PMC P0 when H 1 is true.
When p = 1 and all the parameters are known, the rule Uo ~ 0 is equivalent to

[x-½(~,+~z)](~,-~z)/O2X0 or x-½(~,+~z)X0

for ~l < 42- Friedman (1965) considered the plug-in rule: x X½(£1 + x2), and
compared their PMC's with approximations for large samples from H~ and H 2.
The conditional distribution of W given xl, x2, and S is obviously normal and
the conditional PMC of the rule W <> 0 is written as

~w(01x) = p{w< 0Ix, n,)


=~{ (~__~j
+½~___~-~,)'s-'______~(~j
~2_))},_~ (2.21)
((<- ~2)'s-'xs '(<- ~)),/2
pw(01x) = P{ w> 0IX, n2}

{
(IxI "q-1X2-- ~2)tS-1(-~1- -~2) ~, (2.22)

where K = {£1, xz, S}.


We have obvious relations

aw(O ) = E K { a w ( O ] K ) } , ~w(O) = Elf{flw(O[K)}. (2.23)

When all the parameters are known, we have the optimum PMC given by

au0(0 ) =/3~0(0 ) = ~b( - A / 2 ) . (2.24)

The problem of estimating one or more of those PMC's and the study on
effectiveness and comparison of the estimators have been considered by many
68 Minoru Siotani

authors. The explanation here is made only on approximations and asymptotic


expansions of estimators of the PMC.
McLachlan (1973, 1974a-d, 1976, 1977) discussed in a series of papers the
asymptotic expansions for the conditional PMC of Wald's W-rule and for the
distributions of estimators of the PMC. Based on those expansion formulae
together with Okamoto's formulae (2.16) and (2.17), he evaluated the biases and
asymptotic mean square errors (AMSE) of the estimators in expanded forms, and
made comparisons of them based on those criteria. Some of them are explained
below:
McLachlan (1973) gave an asymptotic expansion for the expectation of QD =
q~(- D / 2 ) in the following form:

E{Qo} = B + 03
= ~(- A/2)+~(--A/2)

+
I N 2 Ar ~,1~,2 ~-d¥2 ~-,,1,, + ~,2,~ ' ~ - n 2 j
1
(2.25)
where

b,=b 2=~1 {A2--4(p--1)}, b3 = 3 - ~ { A z - 4 ( 2 p + l ) } ,

bll : ½b12 : bz2 - 1024A


1 3 {ZI6--4(2p + 1)A4 + 16(p-- 1)2A2

+ 6 4 ( p - - 1)(p--3)},
1
b'3 = b 2 3 - 1024A {A6--4(3p +7)A4 + 16(2p2 +8P +5)A2

--64(p-- 1)(2p + 1)},

A (3Z~6_4(12p+35)A 4 +16(12p2 +72p+Vl)Zl 2


b33- 12288
--192(12p 2 + lZp +1)}.

If we write Okamoto's formula (2.16) as

aw(O ) = E K { a w ( O I K ) } = A + 03, (2.26)


Classification statistics 69

then we have an asymptotic bias of QD given by

Bias(QD) -- A - B + 03

=dP(--A/2)[+(P--1)+ 3+n (4(4p--1)--A2}]+02


(2.27)

neglecting second order terms.


McLachlan (1974a) proved that if the term equal to 0 2 is ignored, then
M 1= aw(O[K ) is asymptotically distributed according to N(/~, o2), where
/~ = the first order term in Okamoto's formula (2.16)

= ~ ( - zl/2)[ I ~ N ~ {A2 + 12(p -- 1)}

1 2 A P _ l)] (2.28)
+ 1 ~ - 2 ~ {A - 4 ( P - 1)} +-~n (
A

0 2 ~__41( ~ 2 ( - A / 2 ) [ 1 / N 1 + 1 / N 2 ], (2.29)

but if the term equal to 02 is taken into account, M~ does not have a normal
distribution. He also proved that (M 1+ M 2 ) / 2 , where Mz=flw(O[K), has a
normal distribution with mean

m = 1 [expression obtained from (2.16) and (2.17) up to the


second order term],
(2.30)

and variance

,2
r2=-l~ q~(-A/2) [(,~ + N ~ {A4+16(p--1)}

+ 1 + !,~2) 16(p_ 1)1


+ 1n( 1- ~ l N z n (2.31)

if the term equal to 03 only is ignored.


Using asymptotic expansions (2.13) and (2.25), McLachlan (1974b, d) discussed
a technique for constructing an estimator of M 1 whose bias is only 03, and gave
two new estimators. The technique is, of course, applicable to the case of M 2.
Furthermore McLachlan (1974c, d) calculated AMSE's of various existing
estimators of MI. Let us denote an estimator of M 1 using a technique t by Qt;
then AMSE of Qt is defined as the asymptotic expansion of E ( ( M 1-Qt)2}, the
expectation of (M~ - Qt) 2 over the joint distribution of K = {71, 72, S}. He first
70 Minoru Siotani

calculated the AMSE of QD= 0(-- D/2) which is equal to

AMSE(QD) = ½02( -- A / 2 )

X ~+ Az+ZNZA-------~{A4+4(p--2)Ae+I6(p--1)(p--2))

+ N--~ (A2 - - 2 p ) 1n
+ 4--~i (A4--(3p+14)A2 +4p(8p--1)}

1 4
+ 8---~2n (~1 - 2 ( p + 4 ) a 2 + 8 p )

+ {7A4 -- 8(8p + 13)A2 + 16(16p2 +23) } + 0 3


128n 2

(2.32)

and the expansions for AMSE(Qt)-AMSE(Qo), t =/=D, were listed for several
estimators which are all 02, so that they have the same leading term of the first
order

q~2(_ A / 2 ) ( ( 1 / N 1) + (AZ/8n) }.

He also considered AMSE's of Qt against O ( - A / 2 ) and aw(0 ) instead of M 1.


Those AMSE's were used to assess the estimators and make comparisons.
Let R i = mi/N ~(i = 1,2) be the resubstitution estimator of 34,. suggested first by
Smith (1947), where m i is the number of misclassified members in the initial
sample size N, from the population H i. When i = 1 , let ql = q l ( K ) be the
expectation of R~ conditional on K, which is equal to

q,= P(W(y)<0]K}, (2.33)

where W ( y ) is the Wald W for a randomly chosen member y from the initial N I
observations f r o m / 7 I. McLachlan (1976) derived the asymptotic bias of R~ by
evaluating an asymptotic expansion of E(q~ ), which is the unconditional expecta-
tion of R~, as

E(q,)=E(O(-D/2)}+q~(-~/2) 8'
-N~-I + - ( 1 2 - A 2 ) n t +O2"
(2.34)
Classification statistics 71

Using (2.27), the asymptotic bias of R 1 is thus

Bias(R,)=0(--A/2) ~{A2+4(p--1)}+-~n(P--1 ) + O 2.

(2.35)

The asymptotic bias of R 2 is obtained just by interchanging N 1 and N 2 in (2.35).


McLachlan (1977) considered the conditional c.d.f, of the Studentized form of
W when x comes from H1, given K, which is normal; namely

P1 = P( (W--½D2)/D<cIK,/71}
{cD+(21 --'l)S' - I - (Xl--X2)} (2.36)
Y

where y = {(21 - 2 2 ) ' S 12JS-1(21- 22)} 1/2. This corresponds to the conditional
PMC for an observation from /71. He showed that the distribution of PI is
asymptotically normal with mean

~I 1 --'--~ ( ¢ ) ~- leo( c )a( c ) (2.37)

and variance

o 2 = l~2(c)b(c) (2.38)

where

a(c)= ~_( p - - 1 ) ( l + k ) - ( p - z + ~ k ) c - - z c ,
1 1 1 3

b(c)=l+k +½c2, k = l i m ( N 2 / N 1 ) asNl--,~,Nz~m.

It is noted that/~1 is the same expression as the first two terms of (2.18) with
c = u. Using this asymptotic distribution of P1, he discussed the confidence
statement on P1 by determining the cut-off point c so that

P(P1 < M} =- a + O(n -2) (2.39)


72 Minoru Siotani

is satisfied, where M is a preassigned upper bound on P1 and a (0 < a < 1) is the


desired confidence level. The result is given by

hi h2 h3 + O ( n - 2 ) , (2.40)
c = Co nl/2 17 /,/3/2

where C o = ~ I(M) and where, on writing a(Co) and b(co) as a 0 and bo,
respectively,

h 1: zbl/2 '

h 2 : Ct0 + ½ Z 2 c o ( b o - 1),
1/2[ ^ 1 ~ 2
h3=zbo [Coao+~oCO(ZCo-4Cto)+½z2bo(c2-1 )

+34c2(1-z2)+~z 2+p-¼+½k],
/

z = ~ - l ( a ) , and 4 0 denotes a o with D in the place of A.


In the case of the second type PMC, the conditional PMC for an observation
from/72,

/'2 = P( (W + ½D2)/D >-- c*IK, H2), (2.41)

was considered, where c* is determined so that P{P2 < M} = a + O ( n 2). It turns


out that c* is given by inverting k in (2.40). McLachlan also considered the case
where both types of conditional PMC's are to be bounded with high probability
by considering the following combined rule:

Allocate x to 1-/1 if ( W - ½D 2) / D > c 1 and ( W + ½D 2) / D > - c'~,


and to H 2 if ( W - ½0 2 )/D < c 1 and ( W + ½0 2)/D < - c~.
(2.42)

The cut-off points c I and c~ are determined seperately for the upper bounds M 1
and M 2 on the conditional PMC's P~ and P2* and specified confidence levels a I
and a2, respectively. Since

P?=aw(C"c~lK)--e
W ½D 2
D <cl'
W+½D2
D <-c~]K'H1
)
< P { W-½D2D < cIlK'/-/1} = P l ,
Classification statistics 73

we obviously have

P{P~ < M,}/> "1 + 0 ( n - 2 ) (2.43)

if we use the clgiven by the formula (2.40) for M = M I and " = a l . Similarly, we
have

P{P2* < M2} >~"2 + O ( n - 2 ) (2.44)

if c~ is given by inverting k in (2.40) with M = M 2 a n d , = "2. It is noted that in


order to bound both errors simultaneously, the combined rule gives rise to a
non-allocation region

{X: min(A1, A2) ~ W~ max(A l, A2) } (2.45)

where A 1= D(c I + ½D) and A 2 = - D(c~+ ½D).


Lachenbruch (1966) discussed the effect of incorrect initial classification of the
samples on the W-rule. Let ~,~ be the proportion of the N~ observations in the first
group that really belong to H2, and ~'2 be the proportion of the N 2 observations
that really belong to H I. McLachlan (1972) obtained asymptotic expansions for
the means and variances of the conditional PMC's for the rule W <> 0 under this
situation with ,{~= 0. Let n = N~ + N2 --2, f = Y2(1- 72)N2, and 7 = 72 < 1, then

r{.w(01x)) = ~ ( - ~ff-~-A ) + (2qr)-'/2exp{ (1 8Y)2A2 }

× (a, +-~n (P-1)(1-Y)} +02, (2.46)

r{¢w(oIK)} = q)(- ½(1 + y)A) + (2~r) l/2exp{ (1+v)242}8


× d2+~n(P-1)(l+y) +O2, (2.47)

Var(,Xw(0lK)) (2~r)-l/2exp{
=
(14v) 42 4 N + +02,
(2.48)

Var{flw(OIK)) = (2~r) '/2exp{ (1 v) 42 T~, + +02,

(2.49)
74 Minoru Siotani

where

1[
dl=~- 7 (p--l)
l+t 2jn,
-(~TT--~ {l+½(l+(fA2/n)))+l(1--y)A
]
1[ y+(fA2/n) ]
+2---~2T ( p - - l ) -~--T--~ {½(7+(fA2/n))--Y}+I(1--Y)Y2A
[1
+ 2N2(11--7) ---2-~(P--I)(I--Y)+½(1--T)3A ,
] (2.50)

1 [ l+(fA2/n){l+(fA2/n)(l+ 7 } ]
d2 = -2N,
- (p--l) ~-_2--~ 1_--~) - 1 + 1(1-¢- T)a
2
1
+ 2 N 2 ( I _ y ) [ - - ~ ( P - - 1 ) ( y - - 3 ) + ½ ( I + y ) ( 1 - - Y ) 2A]

l [ y+(fA2/n){~,+(fA2/n)(l+yl }
+2--~2y ( p - - l ) (1--y)A 2 ~ l - - T ] +3'

+l(1 + 7)72A]. (2.51)

2.2. Maximumlikelihood(ML) classificationstatisticZ


andits Studentizedform
Another classification statistic based on samples for the case of two p-variate
normal populations with a common covariance matrix is the likelihood ratio (LR)
criterion

n + Nl --
NT~(x- ~2)'S '(x-
(2.52)
see Anderson (1958). When the cut-off point c is l, this LR rule reduces to the
maximum likelihood (ML) rule z X 0 where

Z - N1N,+~ (x_y,),S_,(x_yl) N: (x- ~)'S '(x-- ~)


N2÷l
(2.53)
I f N 1 ----- N 2 = N, then Z = -[2N/(N+ 1)]W. Z is also a classification statistic
proposed by Kudo (1959, 1960) and John (1960, 1963). Some optimum properties
of the Z-rule were discussed by Das Gupta (1965).
Classificationstatistics 75

The limiting distribution of Z as Nl, N 2 ~ ~ (hence also n ~ ~ ) is

E ( Z [/-/,) ~ N(-- A2,4A2), E(ZI/-/2) ~ N(A2,4A2). (2.54)

Memon and Okamoto (1971) derived asymptotic expansions for

P~(u; A)=P{ Z+ A2~u'I-I1} P'(u; A)=P{ Z-A2<~u'I-[2


(2.55)

and also for the PMC's of the Z-rule, i.e., az(0 ) and flz(0), in similar forms to
those of the W-statistic and the W-rule, i.e., to (2.13) and (2.16), respectively. The
coefficients hi, giy, ai, biy, and cij k there are denoted with * here.
For P~(u; A), the first term is ~ ( u ) and

h~ - 1 2 (U 3 --Au 2 q - ( p - - 3 ) u + A } ,
2A

h~ - 2A
21{u3-Au2+(P-3-Aa)u+A(A2+I)}

h~= -¼{4u 3-4Au 2 +AZu+6(p--1)u--2(p--1)A},

1 {(u_A)2uS+(2p_17)uS 2(p_13)Au4_laA2u3
g~a 1 - - 8~ 4

+4A3u 2 + ( p 2 _ 18p +65)u 3 + 6 ( 2 p - - l l ) A u 2

, --(4p -- 27)a2 u -- 3( p -- 3)( p -- 5)u --6( p -- 3)A -- 4A3 },


¢

1 {(u--ZA)u6+A3(Zu--A)u3+(Zp--17)u5
g~2 -- 4A4
- - 2 ( p -- 13)Au 4 - - ( p +4)AZu 3 + ( p -- 8)A3u 2 +3Aau

+(p2_18p+65)u 3+6(2p--ll)Au 2

+ ( p + 12)AZu - - ( p --2)a 3 ---3(p " 3 ) ( p -- 5)u - - 6 ( p -- 3)a},

1 {(u+a)Z(u--a)4u+(2p--lV)uS--2(p--13)au 4
g~2 -- 8A4
- - 2 ( p -- 1)AZu3 + 2 ( p -- 8)A3u 2 +7A4u --2 A5

+(p2 lSp+65)u 3+6(2p-ll)Au 2+(2p+9)Azu

-- 2(p -- 2)a 3 -- 3( p -- 3)( p -- 5)u --6( p -- 3)A ),


76 Minoru Siotani

1 {(u_A)(2u_A)2u4+2(5p_23)u5 12(p_6)Au4
g~3 -- 8A 2

+ 3 ( p - 12)AZu3 +6A3u 2 + 6 ( p 2 --9p + 16)u 3

- - 2 ( p 2 - 2 3 p +52) Au 2 --3(3p --11)a2u --3A 3

--6(p--1)(p--3)u+2(p--1)(p--4)A},
1
g~3-- 8A2 {(u+ A)(u-- A)2(2u--A)2u2 +2(5p--23) u5

-- 12(p --6) Au 4 - - 3 ( p +6)A2u 3 + 2 ( 4 p -- 9)A3u 2

--(2p -- 9)A4u -- A5 + 6 ( p 2 --9p + 16)u 3 - - 2 ( p 2 --23p +52) AU2

- - 3 ( p -- 9)A2u --(rip -- 1)A3 - - 6 ( p -- 1)(p --3)u


+2(p-1)(p-4)a),
1 {3(2u - A)4u 3 + 16(9p --26)u 5 --96(2p --7)Au 4
g~'3- 96
+ 12(7p - 33)A 2u 3 _ 4(3p -- 25)A3u z _ 9A4u

+ 4(27p 2 - 132p + 137)u 3 - 24(3p 2 - 22p + 27)Au 2

+ 12( p2 _ 13p + 20)A2u + 4(3p -- 7)A3

-12(p- 1)(3p - 1)u + 24( p - 1)2A ).

It is noted that h~ = h~(u; A)= h3(u;- A) and g~'3= g~'3(u; A ) = g33(u;- A). As
in the case of W, P~(u; A) can be obtained from

P~(u; A) = 1- ff~(- u; A) (2.56)

where /~*(w;A) is the expression obtained by interchanging N 1 and N z in


P?(w; A). Siotani and Wang (1975, 1977) extended the expansion formulae up to
the third order terms. The PMC's of the ML rule are calculated from

az(O ) = 1 - p~'(½A; A),


/32(0) = P2*(-- ½A; A) = 1 - ff~(½A; A), (2.57)

hence/3z(0 ) is the az(0 ) with the interchange of N~ and N 2. Thus the first term in
the expansion for az(O) is 1 - ~ ( A / 2 ) = ~ ( - A/2), and the coefficients a*, b~,
and c~k are obtained respectively from a* = - h *i (~A,
~ • A), bTj = - gTj(½A;A), and
the corresponding relation for e~k.
Tables of a* and b~ were prepared by Memon and Okamoto (1971) for
p =1,2,3,5,7, 10,20,50 and A =1,2,3,4,6,8. Siotani and Wang (1975, 1977) pre-
Classification statistics 77

pared the tables of Cijk and C~jk and made a comparison of the W-rule and the
Z-rule with respect to the PMC.
As in the case of W, an asymptotic expansion for the distribution of the
Studentized Z was given by Fujikoshi and Kanazawa (1976) which is rewritten as
Z+D2 <__ul//l)
P T~ =

=#P(ul--e~(ul[2----~l(Au--u
~ 1
+p--l)
2
(2.58)

1 1 (u2+4p_3)u]+02"
+ 2--~2~ { ( u - A) 2 + ( p - - 1)) +-4-nn
The case when x comes f r o m / / 2 can be treated by using a relation similar to
(2.56).
For the estimation of az(O), flz(O) or az(OIK ), flz(O[K), we may make a
similar investigation to the one mentioned in the last section for the estimators of
the PMC's of the W-rule.
Siotani (1980) obtained the asymptotic expansions for the conditional distribu-
tions of ( Z - ( - 1 ) ~ 2 ) / 2 A and of the Studentized Z, i.e.

Zsi = (Z--(--1)iDZ)/2D
given K when x comes from H~, i---1,2. From those results, the following large
sample approximations to the distributions of conditional PMC's were derived: If
the second order term O2 with respect to (N~ ~, N f ~, n -1) is ignored, then the
conditional PMC az(C IK) has a normal distribution with mean

~ = ~ ~-~ + (c 3 + -

, + ( 4 p - - 4 - - AZ)A4 )
1
~ {c 3 + AZc2 + (4p -- 12- 5 £ ) £ c + (4p - 4+ 3a2),a 4)
+ 16N2A--------

+ - - (c ~ +z~c ~ +6(p - 1)z~c - 2 ( p - 1)z~4 ,


8rtA3 2A '
(2.59)
and variance

°~2= ~ 2A

X ( c + A2)2 + 4NzA 4

In the special case when c = 0, the asymptotic distribution of


1.
az(OlK) is normal
78 Minoru Siotani

with mean
1 1
/*~0= ~b(-- A / 2 ) + 1 ~ - ~ {4(p -- 1)-- A2} + ~ {4(p -- 1) + 3A2}

1 1
4n ( p --1)a , ~ ( - - a / 2 ) (2.61)

and variance

o~02= ¼{q~(- A/2)}z(1/N~ + 1/N2). (2.62)

For the Studentized form, we consider the classification rule; assign x to II1 if
Zsl < c and to H 2 otherwise. Then if the term equal to 02 is ignored, the
distribution of the conditional PMC a Zs,(ClK) is N(m T, z~'2), where

m~= ~(--c)+[ 1 ((p-1)+ ac-c }


1
+ 2 74 a { ( p - 1+ a )-2ac + c2}
,
+ ~nn {(4p - 3 ) c + C3 } d~(C), ] (2.63)

"r~2 = q~2(c)(1/N, + c2/2n). (2.64)

For the asymptotic distributions of flz(cIK), •z( 01K), and BZo2(ClK), they are
all normal with means and variances calculated from (~]~, o~'2") in (2.59) and
(2.60), (/~0, °~'2) in (2.61) and (2.62), and (m]~, ~.~2) in (2.63) and (2.64), respec-
tively, by interchanging N 1 and N 2.

2.3. Classification statistics in covariate discriminant analysis


Cochran and Bliss (1948) discussed the classification problem when certain
variates (called the covariates) are known to have the same means in H~ a n d / / 2 ,
i.e.

12]1
where x is a subvector of p discriminators, y a subvector of q covariates,
E ( y l H i ) = ~ , , I; is a ( p + q ) × ( p + q ) covariance matrix with the partition
corresponding to the partition of the random vector. They introduced

W* = {x* -- ½(.~' + Y~)}'S~ !2(~' - 2~) (2.66)

as a classification statistic, which is the W expressed in terms of the residuals


Classification statistics 79

x* = x - By, where

Y*=~i-Bfi,., i=1,2,
= [s,, s12]
s((p+q)x(p+q)) [s2, s ~ '
B(pXq)=SI2S221; Sll.2=S11-S12S221S21 .

When q = 0, obviously W* = W. Cochran (1964) discussed on the gain obtained


by the use of covariates.
Fujikoshi and Kanazawa (1976) considered the ML rule in this covariate case
and introduced the classification statistic

Z* = N1 --* t --1 *__


N, + l + ( l , / r , ) ( x * - - x , ) Sll.2(x X~)

N2 (x*-- 2~)'Slq }2(x*- ~ ) (2.67)


Uz + l +(12/r2)

where ri = n/N~, l i = ( y -- j])'S2~ l( y .~), i = 1,2 and its modification

Z**- N1 ( x * - Y ~ { ) ' S ~ ' tx* ~*~


NI+~ .2t - 1:
U~
N 2 + l ( x * - - ~)'Sl-l!z(X* ~). (2.68)

Z** was first mentioned by Memon (1968). When q = 0, the both Z* and Z** are
equal to Z.
The limiting distributions of IV*, Z*, a n d Z** as N 1, N2, n --, ce (lim r~ = const.)
are as follows,:

E(W* I/7,) ~ N(A*2/2, ZI*2), E(W*[//2) - N ( - A'V2, Z~*2) ,


(2.69)
~(Z*lU ,) ~ N ( - A*2,4A*2),
(2.70)
g ( Z * * l H , ) - N ( - A*2,4A*2), e(z**ln2) ~ N(A*2,4A*2),
(2.71)
where A.2 =(~1--~e2) t ~ - -ll-a(~el--6)'
1
"~11"2Z ~ l l ~12"~221~21"
Memon and Okamoto (1970) gave an asymptotic expansion for

Q~')(u; ~*)= e((w*-~*2)/~*<~uln,}


up to the terms of the second order with respect to (N 1- i, N2-1, n - l ) and McGee
80 Minoru Siotani

(1976) obtained the third order terms. The formula of

Q(2t)(u; zl*)= P{ (W* + ½A*2)/k* ~ ulH2}

is obtained using the relation corresponding to (2.14). Fujikoshi and Kanazawa


(1976), and McGee (1976) gave asymptotic expansions for

Q}2)(U ; A*)= p{(Z* + A*2)/2A * u117,} ,


Q{22)(u; A*) = P{(Z*-- a * i ) / 2 a * ~< u l n 2 } ,

Q{')(u;a*)=P{(Z**+ A*i)/2A*< ublI,}


and
Q?)( u; zX*) = P { ( Z * * - zX*2 ) /2zX* ~ ul H2 ) .

Since these Q}J)(u; z~*) become the corresponding probability functions in the
non-covariate case when q -- 0, they are written in the form

Q}J)(u; A*)= Pi(Jl(u;a*)+ q.L}Jl(u; A*), i = 1 , 2 , j = 1 , 2 , 3


(2.72)

where P/O)(u; A*) are independent of q; hence they are parts due to discrimina-
tors, so that

e(')(u; A*) = Pl(U; A*), the expression (2.11)


with A* in the place of A, (2.73)
Pi(2)(u; A*) = pg)(u; A*) = pi*(u; A*), the expression (2.55)
with A* in the place of A. (2.74)

L}J)(u; A*) are due to covariates. From those results, asymptotic expansions for
the PMC's of the W*, Z*, and Z** rules with cut-off point 0 are given by

aw,(0) = Q~l)(_ A*/2; A*) = aw(O) + q. L(l~)(--a * / 2 ; A*),


az,(0 ) = 1 -- Q}Z)(A*/2; A*) = az(0 ) -- q.L?)(A*/2; A*), (2.75)
az**(0 ) = 1 -- Q}3)(A*/2; A*) = az(0)-- q.L?)(A*/2; a*).

flw,(0),Bz,(0) and Bz**(0) are obtained by interchanging N 1 and N2 in the


expressions of aw,(0 ), az**(O), and az**(O), respectively. The formulae for aw(O)
and az(O) can be calculated from (2.16) and (2.57), so we give here only the
Classification statistics 81

covariate parts;

L,')(-- A*/2; A*) = [ - ~ ÷


+ Va2n+ /~2 J ] q)(-- a * / 2 ) + 03'

(2.76)
where
1
b]'3) - 128A* { A * 4 +4(3p--4)A*2 + 4 8 ( p - - 1 ) } ,
1
b0)- { A . , _ 4 p A . 2 _ 16(p _ 1)},
23 128/1"
a*
u33h(1)= --~ ( ( 2 p + q ) ( A * 2 + 4 ) - - 1 6 } , "3•(1) --
I a *4- ~- '

-- L(tJ)( A*/2; A*) =

= a(J) f h(J) + b(2J)


3 +~"13 big) ] ]
j=2,3
n +7I]
(2.77)
where
a~2) = a~3) = a('), b~2) = b(33)= b(3~),

b(2)
, 3 -_ b(3)
13 _ 128A*
1 {--A*4+4pA*2+16(p--1)}'

1
b(2)
2 3 -- b(3)
23 - 128A* {3 A*4 + 4 ( p --4)A'2 + 1 6 ( p - - 1)}"

It is noted here that asymptotic expansions for az.(O ) and az**(0 ) are equal if the
terms of the third order are negligible. McGee (1976) and Kanazawa, McGee, and
Siotani (1979) made a comparison of those covariate classification rules on the
basis of the PMC's thus obtained.
Tables of coefficients a's, b's as well as c's in the expansions for aw.(O ) and
az**(0 ) were given by McGee (1976).
As in the non-covariate case, asymptotic expansions for the c.d.f.'s of the
Studentized W*, Z*, and Z** are available. Kanazawa and Fujikoshi (1977) gave
the formula

p{(W*-½D*2)/D*~uIHI} =

o(u,
(2.78)

where D* 2 = ( ~ __ . ~ ) , S 1 1 }2 ( - ~ - - / ~ ) . (The formula was derived up to the terms


of the second order with respect to ( N l- ', N-2 1, n -laj.ja When q = 0, this coincides
82 Minoru Siotani

with Anderson's formula (2.18). Fujikoshi and Kanazawa (1976) obtained the
formulae for the Studentized Z* and Z**, which are expressed as
P / Z * + D .2 [right hand side of (2.58) with A*
(2.79)
k
2D* <-u[II,) = instead of A 1_ ~n A, + O2,
¢ z * * + 0 *2 ~ [right hand side of (2.59) with ,4*
Pl 2 D* )
< u11I~ = instead of ,4 ] - q u + 02 .
n
(2.80)

The expressions for


P{ ( W* + ½0*2)/D* <-<u[172},
P{(Z* -- 0 * 2 ) / 2 0 * ~< U 1/~2} ,
and
P { ( Z * * - - D , 2 1/2D , ~<u1/72}
can be obtained in the same manner as before. Those formulae were used by
Kanazawa and Fujikoshi (1977) and Kanazawa (1979) to discuss on the setting of
the cut-off point to achieve a specified PMC in each classification procedure. For
example, Kanazawa (1979) obtained the cut-off points c~,, c2a, and c3~ such that
a=P{W*<-Cl,IH,}=P{Z*>cz, I171}=P{Z**>c3,IHl} (2.81)
for a given a (0 < a < l ) by using asymptotic expansions for distributions of the
Studentized W*, Z*, and Z**, i.e. (2.78), (2.79), and (2.80), respectively, and by
applying the general inverse expansion formula of Hill and Davis (1968);
c,,~=A,~(W*)D*+½D .2, c2,~=2A,~(Z*)D*-D .2,
c3~= A~(Z**)D* - D .2 , (2.82)
where
Ao(w*)=u° + 2-N1,a, (A,u _ 2 p + 2 ) + 1 2+4p_3)+qu,,

(2.83)
A,,(Z*) 1

+ 2_~2~, { ( p _ 1)+ (uo+a), 2 }


l (u]+4p-3)u.+--~na*, (2.84)
4n
A , , ( Z * * ) = - u , ~ + ~ (1( p - 1 ) - u . 2
A, u.)

+ 2_~2~, { ( p _ 1)+ ,2

1 (u 2+4p_3)u, qu (2.85)
4n n a,
and u~ is the upper 1008% point of N(0, 1).
Classification statistics 83

3. Statistics of classification into one of two multivariate normal populations


with different covariance matrices

The two populations are now//~: N p ( ~ , ~ ) , i = 1 , 2 , and it is known that


~1 # ~2. When all the parameters are known, an optimum classification rule
(Bayes rule, minimax rule, etc.) is based on the quadratic forms of x, i.e.

U 1= (x - ~'I)'Zl-I( x - ~'1)- (x - &)tZ 21(X - &)

+log(l~,l/l~21). (3.1)
If ~e1= ~e2 = C0, then this reduces to

U2=(x-&)' ( ~-'-~2')(x-&)+log(l~,l/l~21 ). (3.2)

The distribution of these statistics is generally too complicated to evaluate the


PMC's of rules, except some special cases. When unknown parameters to be
estimated from samples are involved in the statistics, the distributional problem
becomes almost untractable. In the literature, the special structure of ~Y~and the
restriction of the class of rules are considered by several authors. For example,

x, = o / { ( 1 - p,)I, + pi~s;) (3.3)

(Bartlett and Please, 1963; Han, 1968),

~1 = O2"~2 (Han, 1969), (3.4)


1 Pl P2 P3 "'" P2 Pl
Pl 1 p~ P2 "'" P3 P2
~i ~-- Oi2 P2 p~ 1 p~ "'" P4 P3 (3.5)

Pl P2 P3 P4 "'" Pl 1

circular structure (Han, 1970). An optimum rule in the class of rules based on
linear functions of x is studied by Kullback (1952, 1958), Clunies-Ross and
Riffenburgh (1960), Anderson and Bahadur (1962), and Banerjee and Marcus
(1965). There are other studies on optimum rules when parameters are known or
unknown, but only a few approximations to or asymptotic expansions for the
distributions of classification statistics are known.
Okamoto (1961) considered the plug-in version of U2, apart from a constant,

Qo = ( x- eo)'( s;,' - So2' )( x- Co), when & is known, (3.6)


where
1 N~
Soi- ~, y. (x~ i' - Co)( x~
-~i' - - #
" ~o' t , i=1,2,
a=l
84 Minoru Siotani

and
Q=(x-£)'(S(1-S[1)(x-£), when ~0 is unknown , (3.7)

where
1
-r =
N1 + N2(Nl-rl + N2 X2 ), sample grand mean vector,

1 N,
1 2 (xo( i ) _ _x,)(xo
- (i)__- , i=1,2.
ot:l

He derived an asymptotic expansion for the distribution of Q0 in the special case.


He first suggested a method for reducing the number of dimensions from p to q
(~< p), holding the efficiency of classification as high as possible. The reduced
quadratic classification statistic is written by

Q~= ~ + ~ 1 - 1 z,2 (3.8)


i=1 i=p--q+s+l

for an appropriate choice of a value of s = O, 1,..., q, where lj's (11 >~ 12 9 . . . ~ lp)
are roots of IS02 - lSol ] = 0 and zj is thejth component of Z = F(x -- ~0), F being
a nonsingular matrix such that F'S01F =Ip, F'So2F = diag(ll, l 2..... lp). Okamoto
(1961) gave an asymptotic expansion for the distribution of Q~ only for q = 1, i.e.,
Q~ = { 1 - (1/l l) }zl2, in the following form:

P(2]l ) = P(Q~>k(I,)IH1}

= l - - ~ ( X ' k ( X -' )- )i ~ l --nl{tp'A+½~P~B}+O(n-2)' (3.9)

where n = Nl, ~1 is the largest root of IX2 - ? t ~ l [ = 0, k()tl) is the cut-off point
for the Bayes rule or minimax rule when q = 1 in the reduced form, which is a
function of ~1, k(/1) is obtained by replacing 2t1 by l I in k(?tl),

__ u -- 1

20 e -u/2,
Classificationstatistics 85

and putting 2, = 2t~, K = k(X), K ' = (d/dX)k(X), K"---- (d2/dX2)k(2t),


F = ]~f=2(2tl - ~tj) -1, and c = lim(N1/N2) 1/2,

A = ~ Z - ]- 2t2K"+XK ' X F - 2 -K 2~F+--


X--1 x-1 (x-_i)2
{ ±-

-K _ (A._ 1) 2 ~.-1 '

B_ (£_])2
2)~2 (X~( K
A.--1 K') 2
+c2( K _ X K , ) 2}
X--1 "

P(112) = P{Q~ < k ( l , ) l H 2 }

(3.10)

where putting again ~t = ~1,

C= 7t2K"+XK ' XF-•_------ i- - K X-1


()k-- 1) 2

+ c 2 [ ) t 2 K " + ) t K ' ( ) ~ F - - - -2- p + l )


)~--1
~ 2 m
A
+ K~(XF- p + 1) X--~ -+
L (x-i)? ,
D = (1/?t2)B.

Asymptotic expansion in the general case may, in principle, be obtainable, but it


seems to be quite complicated. The similar study for the classification statistic Q
or its reduced form Q* will be done in the future.
86 Minoru Siotani

Han (1969) treated the distributions of U l in (3.1) when ~1 = ~ , ~2 = °2~7-


Apart from a constant term, U 1 becomes

t:, = 0-17 (x - ~e2)'~-'(x - ~2)


= (a + 1 ) - 1 ( Y + a ~ ) ' ~ - 1 ( y + a a ) - a A2, (3.11)

where y = x - ~ l , I/a=02--1, d~=~e2-~e 1, A2:--8'~-1~ (the Mahalanobis


squared distance). When x comes from 111, y ~ Np(O, Z') and when x comes from
112, Y ~ Np( ~, 02~). Hence
( y + a a ) ' X - l( y + aO) (3.12)

is distributed as x~Z(a2A2) if x Comes from Hi, where X'p20"z) denotes a non-


central chi-square distribution with p d.f. and noncentrality parameter ~.2. If x
comes from H2, it is distributed as o2X'vZ(o2a2A2).
For the case when ~/ are unknown but o z and ~ are known, Han (1969)
considered the plug-in version of U I of (3.11), i.e.

vl = ( x - 1 ( x _ ~ 2 ) , ~ _ l ( x _ :~2)" (3.13)

Since V1 is invariant under any linear transformation, we may without loss of


generality let ~e1 = O, ~e~= (A, O, 0 ..... 0), and ~ = Ip for the distributional problem.
Thus V1 is written apart from a constant as

V, = (x - ~, - a(~, - £ 2 ) } ' { x - - ~ 1 - a ( £ 1 - ~2)}


- a(a + 1)(.~,- £2)'(.~1- :r2). (3.14)
The limiting distribution of V1 is easily seen to be the same distribution of

[( y + aO)'X-1( y + a O ) - a ( a + 1)~2],

i.e., of [(noncentral chi-square v a r i a t e ) - a ( a + I)A2], whose characteristic func-


tion is

~t(t) = (1--2it) V/2exp( - - i t a ( a + 1)A2 + ~ it a2A2j/ (3.15)

when x comes from/I1 and

q,2(t) = ( 1 - 2 i t o 2 ) - p / 2 e x p ( - i t a ( a + 1)A2 -~ it (a + 1)2A21


1 --2ito 2 J
(3.16)
Classificationstatistics 87
when x comes from H 2. Han (1969) derived an asymptotic expansion for the
distribution of V~ up to the second order. Let G(pi)(x) (i = 1,2) be the c.d.f, of the
random variable having the characteristic function q~i(t). Then the formula up to
terms of the first order is written as

F,(v) = P( V1~<v I/7, }

= GO)(v) + ~ ( a,(d )GO)(v) + a2(d )Gp(~2(v) + a3(d ) G(~ 4( I~) }

0"2
+N + 02.
(3.17)

r2( v) = P{ V, <<-vJn2)

: G(2)(v)+ ~-1 (bl(d)G(2)(v)+ b2(d)G(2)+2(~)+ b3( d )G(p2)+4(v ) }


0"2
-}-~22 (bl( d )Gp(2)(~)-[- b4( d )Gp(2+)2(I)) --[-bs( d )ap(2+)4(1.))) -[- O2
(3.18)

where N~ (i = 1,2) is the size of sample taken from H i, d = d/dv, and

al( d ) = pa( a + 1)d +2a2(a + 1)2A2d2,


a2(d ) = - p(a + 1)2d - 4 a 2 ( a + 1)2AZd2, a3(d ) = 2a2(a + 1)2A2d2,
a4(,d ) = _ pa2d - 4 a 3 ( a + 1)A2d2, as(d ) 2 a 4 A 2 d 2 '
=

bl(d)=al(d),
b2(d ) = - p(a + 1)2d - 4 a ( a + 1)3A2d2, b3(d ) = 2(a + 1)4AZd2,
b4(d ) = _ paZd - 4 a z ( a + 1)2AZd2. bs(d ) = 2a2(a + 1)2A2d2,

Han (1970) considered the distribution of the plug-in version of U1 of (3.1) apart
from the term log[ 2~11/ [2~2[, when ~ have the circular Structure given in (3.5). In
this case, there exists an orthogonal matrix H with the (j, k)-element

hjk= P 1/2[C0s (j--1)(pk--1)2~r +sin ( j - 1)(k-p 1)2~r ] (3.19)

such that H ' . ~ i H = diag(o/], °'2i2,"".,0"2) (cf. Wise (1955)). Since U1 is invariant
88 Minoru Siotani

under any linear transformation, U~ can be expressed, apart from a constant, as

j=,
11t(
a?j o?j xj 1/o---Tjj 1/o2j
(3.20)

where xj a n d ~ij are thejth component of x and 6, respectively. If 02j > o2 for all
j or equivalently 2J1- ~2 is positive definite, then when x comes from Hi, i = 1 or
2, V2 is distributed as the sum of 52X{2(~2,.),
J J
where X~2(`/2)
J
is a noncentral
chi-square distribution with 1 d.f. and noncentrality parameter ,/2. = m2j/rt2, and

( 1 1 )'/2( ~2j/o22,-~,j/02j)
(3.21)
rn,, = °2J °b litj 1/022j -- 1/o2 '
rtj2 = oi2j(1/ozj -1/o2j ) (3.22)

According to a X 2 approximation due to Patnaik (1949), the distribution of V2 is


approximated by atx~, where

X, rt4 + 2Ej'rt2m~, 1
( Z ~ + ~m2:)- (3.23)
ai = ~j,l.i ~ _[_ ~ , j m 2 j , Pt = --
at j J -

When set are unknown and estimated by sample means -~t, the plug-in statistic

V3= E
,,
1
022,
1
x,-
x2,/Oz2,--x,,/02j
1/o22, l / 4
t 2] (Xl > 2~2) '

(3.24)

is also approximated by a i, X .2 7 when x comes from H t, where a* and v* can be


calculated by the Patnaik method.
When all the parameters are unknown, Han (1970) derived an asymptotic
expansion for the plug-in statistic

j=, 4, 4 1)t x, 1/4,-1/4


(3.25)

2
under the assumption that 271 > ~2, where stj- N i i(xtj~- Ytj)2/nt, n / = N~- 1,
]~= - -

N~ being the sample sizes. It is noted here that, without loss of generality,
we may let ~l = O, 271 = lp, ~ = ~ = (~01, ~02. . . . . ~0p), and '~2 = '~0 =
Classification statistics 89

• 2 2
dlag(o61, or~2. . . . . 02p). The results were given in the following forms:

F~(v) = P{ v4--< v1/71}


P
= E {%(v)-q,jo~)~(v)+q~#(,~(v)
j=l

-- q3jG(73)(v ) + q4jG(94)( v ) ) q- 02, (3.26)

F~(v) = P( V4 ~<vIH2}
P
= E {L,j(v)--,-,+L(,y(v)+r~+L(?~(v)
j=l

- rsjL~73](v) + r4jL~94)(v)} + O2 (3.27)

where G,j(v) is the c.d.f, of a noncentral chi-square distribution with v d.f. and
4 2 G(f)(v) is the kth derivative of G,j(v), L,j(v)
noncentrality parameter ~oj/Ooja),2
is the c.d.f, of a noncentral chi-square distribution with v d.f. and noncentrality
parameter ~Oj/OgjOlj,
2 2 2 L~))(v) is the kth derivative of L v j ( t ) ) , eli= (1/%2) - 1, and

q,j 1/%Nl + l/o~ajNz + aj/nl +oo)bj/n2,


1 2~2j 1 2~5o
2j 1(
q2J--N100).a)4 2 "q- N2
%a) -~-
62 af +2aj+cj)

+l (o~jb2_ 2b, _ %jdj),


4 t
g.,

% Y (4cj + 2cjaj)/n, + (44-2o~4b j)/n2


q4j = cf/nl + °~df/n2,

with the notations

%- %4ja2 + 1 ,
o6ja)
6 3
i)
b,: C 1 los b'J = o~j
2--( ~2j 3~2j )
04'
o6ja
6 3
) 2--'-'-'~
%a) + 1 '
2~2j 2~°2J ( 1 +2 )
cj - Oo~aj '
2~2j

di-4 ,, d;- %%
,o ( 4 ~] j -41, .
90 Minoru Siotani

and
rlj = 1/Nlaj+l/N2o~aj+ A~/n ' + OojSj/n~,
4 ,

2~2j 2~j 1 2
rzj-- m + - + -~((Aj + 2 0 ~ A j + C})
N1a2 Nzoo4j#

+ <v;),
~
r 3 j (4o~Cj + 2cjaj)/n 1+(4o~Dj--2o~DjBj)/n2,
% = CjZ/n, + o4jOZ/n2
with the notations

Aj - a~ --+o~., A ~ = 2 / = 3 + _--57- + _ -

1 3~5~j 2
OJ = 4
Oo)a)
2 O.02j' n ; - - %a)
6 3+ %4 '
2 2
2o6j~oj
C j . - -aj- -
aj j
2fo2j ~2
vJ- j,
Oojaj ~ %a~ I

Kullback (1952, 1958) suggested a rule based on the linear statistic which
maximizes the divergence J(1,2) between Np(~el, ~q) and Np(¢2, X2). Matusita
(1967) considered a minimum distance rule based on the distance

d<<, n2)-- If( I,(vT S- Cf2(x) ) am]"11/2

or equivalently based on the affinity p ( H l , / I 2 ) = f f l ( ~ f 2 ( ~ dm. He dis-


cussed the different cases according as ~/'s and ~,i's are known or unknown and
gave some bounds for the probability of the success rate.

4. Statistics in the non-normal case and in the discrete case

The description given here is not for a general explanation on the classification
problem but is a short note on the main topic of this chapter in the non-normal
and discrete cases.

4.1. Classification for multivariate dichotomous responses


This is the case of the p × 1 random vector x'= (x I..... xp), each component x i
taking values only 0 or 1. In this binary or dichotomous case, x generates s = 2 p
Classification statistics 91

possible patterns of zeros and ones. We call each unique pattern a state and with
each state a probability is associated. In the case of two groups/71 a n d / / 2 , we
denote the probability distributions of x in /71 and //2 by p ( x ) and q(x),
respectively. More specifically, we have ( P l , P2,...,Ps) and (ql, q 2 , . . . , q s ) i n / / 1
and/72, respectively, where pi is of the ith state i n / / 1 and q~ is the probability of
the ith state in //2. If p ( x ) and q ( x ) or Pi, i = 1 ..... s and q~, i = l ..... s are
known, then the optimal classification procedure is based on the likelihood ratio
p ( x ) / q ( x ) or equivalently on

L(x) = log{p(x)) -log{q(x)}. (4.1)

The rule allocates an individual with response pattern x into H 1 if L ( x ) ~ c and


into/72 if L ( x ) < c; in other words, we classify an individual with state i into H 1
if p i / q i >>-c* and i n t o / / 2 if P i / q i < c*.
In practice, the population parameters (state or cell probabilities) are rarely
known. The usual approach is to use the plug-in rule by replacing each parameter
by the relative frequencies or in this case the ML estimates /3g = n g / n and
gli = m i / m , where n i ( m i ) is the number of sample observations of state i out of n
(m) observations known to come from H 1 (//2)- The plug-in rule is then: classify
an individual with state i into Hi if p i / ~ I> c and into/72 if/O~/~ < c.
When the number of states is large, we encounter a trouble-some problem of
spareness, in which some of the states or multinomial cells may have no data to
use for the estimation; this means that some parameters are inestimable but may
be required for classifying some future observations. To overcome this compli-
cation it was proposed to impose a further structure on the state probabilities.
Bahadur (1961) gave the following series representation for p ( x ) = p ( x 1. . . . . Xp ):

(4.2)

where

f(x)=l+ ~ rgjyiYj+ • rijkyiyJk+'''+rl2...pyly2"''y p,


i<j i<j<k

~ = P{x~=ll//~), yg=(Xi-~i)/~g(1-ai), (4.3)

rij. . .k = Ep( YiYj" " "Yk ).

Similarly

(4.4)
92 Minoru Siotani

where

h(x)=l+ ~ suzizj+ E SijkZiZjZk+''"-}-S12...pZlZ2"''Zp,


i<j i<j<k

ili=P(xi=llH2}, Zi=(Xi--ili)/~ili(1--ili), (4.5)

Sij...k: Eq( ZiZj" " " Zk).

U sing the truncation of these representations and denoting Ep{ L (x) }, Ep [L ( x ) -


Ep{L(x)}] 2, Eq(L(x)), Eq[L(x)-- Eq(L(x))] 2 by /~, o2,/~2, o~, respectively,
Bahadur (1961) showed that if p is large, if

J : Ix, -- t~2 = Ep[log(p(x)/q(x)}] -- Eq[log(p(x)/q(x))]

(the symmetric Kullback-Leibler information measure) is small, and if x~'s are not
highly interdependent, then L(x) is approximately normally distributed in /71
and/72 with means/~l and /~2 and variances o~ and o2, respectively. He also
showed that o~ and a 2, under some conditions on p and q, may be approximated
by j l / 2 = (l~l-/x2) ~/2. It should be noted that/~ > 0 and/x 2 < 0 unless p and q
are identical distribution. The PMC's associated with a cut-off point c are then

C--~l
(4.6)
(x: L(x)<~c}

ilL(C) = ~
{x: L(x)>c}
q(x)~( -- C+/~2
jl/2
)j" (4.7)

Solomon (1961) used the representations (4.2) and (4.4) to assess the loss of
information incurred by the approximations to p(x) and q(x) by exploring the
PMC's of classification procedures using test-item dichotomous response data.
Moore (1973) discussed and evaluated five procedures for classification with
binary variables. Among them the plug-in version of the first and the second
order approximations to the Bahadur models (4.2) and (4.4), i.e., the first
approximation

p
p(')(x) = 1-[ aX'( 1 - - ai)
,~1 x,
i=1
FA = P
(4.8)
q(')(x) = 1-[ ill x '(1-- fl,) , - x
',
i=1
Classification statistics 93

and the second approximation

p(2)(x)= [fi a;'(1--%),,-x,] (


]exp~ 1+ 2 rijYiYj} ,
i=1 i<j
SA z (4.9)
q(2'(x)=[ fi fl~"(l--fli)'-X']exp{l+ ~ sijzizj},
i=1 i<j

were treated, where a~ and fli in them are estimated unbiasedly by

&~= ~ n ( x ) / n , fl~: ~ m ( x ) / m , (4.10)


s, &
where n, m are sizes of samples independently drawn f r o m / I 1 a n d / I 2 , respec-
tively, n(x), re(x) are frequencies of a state x i n / / 1 and 112, respectively, and S/
is the set of all states x with x, = 1. r~j and sij are estimated by corresponding
sample corrdation coefficients

~&sn(x)/n - &i&j ]~si,m(x)/rn -fliflj (4.11)


Pi'= < ( 1 - < ) S , ( l - '

where Sis is the set of all states x with x i = 1 and xj -- 1.


Martin and Bradley (1972) developed a probability model for the joint mass
function &( x ) of a p-dimensional state random vector x = ( xl, x 2..... Xp), x i = 0
or 1, in the form

p j ( x ) = f ( x ) { l + h ( a j , x)}, xEl-Ij, j=l,2 (4.12)


t

where h(aj, x) is a linear combination of orthogonal polynomials in x and

f(x)----wlfl(x)+Wzf2(x ), Wi>~O, WI+W2----1. (4.13)

They discussed the estimation of aj and f under some constraints. Using these
estimates, we have the plug-in classification rule: classify x into H 1 if h(al, x)>I
h(a=, x), which is equivalent, when all estimators are included, to the rule:
classify x into H 1 if n l ( x ) / n i ~> n 2 ( x ) / n 2, where n i are sizes of samples indepen-
dently taken from H i, and ni(x ) are frequencies of state x in H i.
There are many ways to represent binary data. Cox (1972) gave a brief
overview of the properties and problems of various methods used in multivariate
binary distributions. Among them the representation

P(x)=expao + •aix, + E aijxixj-~- "" (4.14)


i i<j
94 Minoru Siotani

is easy to handle. In a manner analogous to that in a factorial design, a 0 is called


the overall effect, ai's are main effects, a~j's are first order interactions, etc. An
ith order interaction model is a model of the representation (4.14) in which all
interactions of an order higher than i equal to zero.
Kronmal and Tarter (1968) considered an estimation of probability densities by
Fourier series methods, expressing the probabilities as linear combinations of an
orthogonal basis. Using this idea, Ott and Kronmal (1976) discussed the represen-
tation of state probabilities of a multivariate binary density and, using it,
sample-based classification rules for the case of sampling from a mixture of two
multivariate dichotomous populations. Goldstein (1977) succeeded this idea to
provide classification rules when independent random samples are available from
the two groups H~ a n d / / 2 - The representation considered by Ott and Kronmal
for a state probability P(x) of x'= (x l, x 2..... xp), xe = 0 or 1, is

P ( X ) = 71 E dr%(x) (4.15)
rc S(x)

where s = 2 p as before, r = ( r l, r2..... rp) is a binary indexing vector which


numbers all of the possible states, and %(x) is the r t h orthogonal function at x
defined by
P
%(x) = ( - 1) x'' where x'r= E xiri. (4.16)
i=1

S(x) is a set of all the s state points x and the coefficients dr are, using the
orthogonality of %(x), evaluated by

dr = E(%(x)}. (4.17)

Let p~(x) be state probabilities associated w i t h / / i . Then assuming equal prior


probabilities, an optimal classification rule using the above representation is:

Classify x into/-I1 ( / / 2 ) if
(4.18)
E E
r~S(x) r~S(x)

and randomly otherwise, where the sets (dj, r} and (d2, r) are associated with H I
and /72, respectively. If all the parameters are to be estimated from available
independent samples, the plug-in rule is simply the rule given by (4.18) with {di, r)
replaced by their estimates

di.r = E %(x)ni(x)/ni (4.19)


xES(x)

where n i and ni(x) are the same notations as before.


Classification statistics 95

The classification rules or their plug-in versions based on the representations


explained above were examined and compared by Monte Carlo sampling experi-
ments by the authors cited above and others.

4.2. Classification for multinomial distributions


The binary case in the last subsection is a special case of the multinomial
classification, where each component of a random vector x ' = (x j,..., xp) assumes
a finite number of distinct values. If x i takes s i values, then x generates s = I-I/P=lSi
possible patterns or states.
Matusita (1956) proposed a minimum distance rule. His distance is defined by
the square root of

[IF_GII2 = ~ (f~/_~//)2 (4.20)


i=1

where (fl, f2 ..... f~) and (gl, g 2 , ' " , g s ) a r e state probabilities corresponding to
the two distributions F and G, respectively. Suppose that independent samples of
sizes n i and n 2 from H l and H E are available. We wish to classify a new sample of
size n o into either H l o r / 7 2. Let S1, $2, and SO be the empirical distributions
formed on the basis of these independent samples. Then Matusita's sample-based
rule is: classify the new sample into H 1 (H2) if

1152-S011 > ( < ) l l S ~ - S 0 1 1 . (4.21)

He obtained lower bounds for PCC and an approximate value of PCC when
sample sizes are large.
Dillon and G,oldstein (1978) considered a modification for the case of n 0 = 1;

Classify x into H t if IIS~' - S2 II > IIS~ - S~ II; otherwise to ~r2 , (4.22)

where S* (i =1,2) is the empirical distribution formed by n i + 1 observations


including a new observation x. They compared this rule with other commonly
used rules by Monte Carlo study.
Cochran and Hopkins (1961) obtained the form of the Bayes rules and
considered especially the ML rule. They further discussed a sample-based rule
using estimates of state probabilities, and suggested a correction for bias in the
estimation of the PMC due to the plug-in. They showed that the expectation of
the actual error (PMC that results when the rule is based on estimated log
likelihood ratio) is always greater than or equal to the optimal error (PMC that
results when all the parameters are known and the rule is optimum) and for any
fixed same size, the expected difference between optimum and actual non-error
rates depends, in a complicated way, on the multinomial cell or state probabili-
ties.
96 Minoru Siotani

But Glick (1972, 1973) showed rigorously that the difference has an upper
bound which diminishes to zero exponentially as sample size n ~ oo and also
P(actual = optimum} --, 1 exponentially. He also gave a proof of the proposition
that the expected excess of the apparent non-error rate (Smith's resubstitution or
reallocation estimator of PMC) over the optimum non-error rate has an upper
bound proportional to (n-l/2)an where a < 1. Based on Glick's work, Goldstein
and Rabinowitz (1975) discussed a sample-based procedure for selecting an
optimum subset of variables.
Glick's results contain a generalization of the results obtained by Cochran and
Hopkins (1961), and Hills (1966).

4. 3. Classification for parametric non-normal continuous type distributions


Cooper (1962, 1963) considered a multivariate distribution of Pearson type II
or type VII as the basic distribution of x in H r He (1965) also studied the case
where p.d.f, of x in H i is

f~( x) = (O,(X)} 1/2],


Ail.~,il-':g,[
Q i ( x ) : a positive definite quadratic form in x, (4.23)
gi (u) : a decreasing function of u I> O.

Cooper studied the LR statistics for the distributions mentioned above.


Cox (1966) and Day and Kerridge (1967) both suggested the logistic form for
posterior probabilities as a basis for discrimination between two populations H~
and H 2. Day and Kerridge considered estimating the classification rule where
sampling was from the mixture of H l and H 2, while Cox was not concerned with
discriminant estimation. Anderson (1972) extended the C o x - D a y - K e r r i d g e ap-
proach to the situation where separate samples were taken from each population
and further to classification between three or more populations. Anderson
intended to cover the case where continuous and polychotomous data both occur.
The basic p.d.f, of x in/-/i considered by Day and Kerridge (1967) was

f(x)=diexp(-l(x--lLi)'~,-l(x--i~i)}g(x), i=1,2. (4.24)

The posterior probability of the hypothesis Hi: H = H i given x is expressed as


exp(a'x + b)/( 1 + exp(a'x + e)}. Anderson's (1972) extended form is

p( Hilx) = exp(ao + a~x)p( Hmlx),


m--I }--1
p(H,,Jx)= 1+ ~ exp(aio+a~x ) (4.25)
j=l
Classification statistics 97

for the posterior probabilities of Hi: H : H i , i : 1..... m, given x. They consid-


ered the ML estimates of unknown parameters to obtain a plug-in classification
rule.
Since Fisher's sample linear discriminant function (LDP) defined by (2.4) is
derived without the assumption of normality, and since its ease in practical use
and computation, it is attractive to the practicians. Hence the knowledge of
performance of the L D F in non-normal conditions would be valuable.
Lachenbruch et al. (1973) considered the robustness of Fisher's L D F under
non-normal distributions generated from the normal distributions by using non-
linear transformations suggested by Johnson (1949). Zhezhel (1968) examined the
efficiency of L D F for the case of two arbitrary distributions with equal covari-
ance matrices. A general review of the published works on the performance of
L D F was given by Krzanowski (1977) who also discussed the cases of the
LDF-use when all the variables are discrete and when some variables are discrete
and the remainder continuous.

4.4. Classification when both continuous and discrete variables are involved
Chang and Afifi (1974) suggested a method suitable for one binary and p
continuous variables, based on the location model. An extension to the case of q
binary and p continuous variables variables was proposed by Krzanowski (1975)
under Olkin and Tate's (1961) location model and L R classification, and its
plug-in version was considered. He also discussed on the conditions for success or
failure in the performance of LDF.

References

[1] Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59, 19-36.
[2] Anderson, T. W. (1951). Classification by multivariate analysis. Psychometrika 16, 31-50.
[3] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
[4] Anderson, T. W. (1973a). An asymptotic expansion of the distribution of the studentized
classification statistic. Ann. Statist. 1, 964-972.
[5] Anderson, T. W. (1973b). Asymptotic evaluation of the probabilities of misclassification by
linear discriminant functions. In: T. Cacoulos, ed., Discriminant Analysis and Applications,
17-35. Academic Press, New York.
[6] Anderson, T. W. and Bahadur, R. R. (1962). Classification into two multivariate normal
distributions with different covariance matrices. Ann. Math. Statist. 33, 420-431.
[7] Bahadur, R. R. (1961). On classification based on response to N dichotomous items. In: H.
Solomon, ed., Studies in Item Analysis and Prediction, 169-176. Stanford Univ. Press, Stanford.
[8] Banerjee, K. and Marcus, L. F. (1965). Bounds in a minimax classification procedure. Bio-
metrika 52, 653-654.
[9] Bartlett, M. S. and Please, N. W. (1963). Discrimination in the case of zero mean differences.
Biometrika 50, 17-21.
[1o] Bowker, A. H. (1961). A representation of Hotelling's T 2 and Anderson's classification statistic
W in terms of simple statistics. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
285-292. Stanford Univ. Press, Stanford.
98 Minoru Siotani

[11] Bowker, A. H. and Sitgreaves, R. (1961). An asymptotic expansion for the distribution function
of the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
293-310. Stanford Univ. Press, Stanford.
[12] Chang, P. C. and Afifi, A. A. (1974). Classification based on dichotomous and continuous
variables. J. Amer. Statist. Assoc. 69, 336-339.
[13] Clunies-Ross, C. W. and Riffenburgh, R. H. (1960). Geometry and linear discrimination.
Biometrika 47, 185-189.
[14] Cochran, W. G. (1964). Comparison of two methods of handling covariates in discriminatory
analysis. Ann. Inst. Statist. Math. 16, 43-53.
[15] Cochran, W. G. and Bliss, C. I. (1946). Discriminant functions with covariance. Ann. Math.
Statist. 19, 151-176.
[16] Cochran, W. G. and Hopkins, C. E. (1961). Some classification problems with multivariate
qualitative data. Biometrics 17, 10-32.
[17] Cooper, P. W. (1962a). The hyperplane in pattem recognition. Cybernetica 5, 215-238.
[t8] Cooper, P. W. (1962b). The hypersphere in pattern recognition. Information and Control 5,
324-346.
[19] Cooper, P. W. (1963). Statistical classification with quadratic forms. Biometrika 50, 439-448.
[20] Cooper, P. W. (1965). Quadratic discriminant functions in pattern recognition. I E E E Trans.
Inform. Theory 11, 313-315.
[21] Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve. In:
F. N. David, ed., Research Papers in Statistics: Festschrift for J. Neyman, 55-71. Wiley, New
York.
[22] Cox, D. R. (1972). The analysis of multivariate binary data. Appl. Statist. 21, 113-120.
[23] Das Gupta, S. (1965). Optimum classification rules for classification into two multivariate
normal populations. Ann. Math. Statist. 36, 1174-1184.
[24] Das Gupta, S. (1973). Theories and methods in classification: A review. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 77-137. Academic Press, New York.
[25] Day, N. E. and Kerridge, D. F. (1967). A general maximum likelihood discriminant. Biometrics
23, 313-323.
[26] Dillon, W. R. and Goldstein, M. (1978). On the performance of some multinomial classification
rules. J. Amer. Statist. Assoc. 73, 305-313.
[27] Elfving, G. (1961). An expansion principle for distribution functions with applications to
Student's statistic and the one-dimensional classification statistic. In: H. Solomon, ed., Studies in
Item Analysis and Prediction, 276-284. Stanford Univ. Press, Stanford.
[28] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics
7, 179-188.
[29] Friedman, H. D. (1965). On the expected error in the probability of misclassification. Proc.
IEEE 53, 658-659.
[30] Fujikoshi, Y. and Kanazawa, M. (1976). The ML classification statistic in covariate discriminant
analysis and its asymptotic expansions. Essays in Probability and Statistics (Ogawa Volume),
305-320. Shinko-Tsusho, Tokyo.
[31] Gilbert, E. S. (1968). On discrimination using qualitative variables. J. Amer. Statist. Assoc. 63,
1399-1412.
[32] Glick, N. (1972). Sample-based classification procedures derived from density estimators. J.
Amer. Statist. Assoc. 67, 116-122.
[33] Glick, N. (1973). Sample-based multinomial classification. Biometrics 29, 241-256.
[34] Goldstein, M. (1976). An approximate test for comparative discriminatory power. Multiv.
Behav. Res. 11, 157-163.
[35] Goldstein, M. (1977), A two-group classification procedure for multivariate dichotomous
responses. Multiv. Behav. Res. 12, 335-346.
[36] Goldstein, M. and Rabinowitz, M. (1975). Selection of variates for the two-group multinomial
classification problem. J. Amer. Statist. Assoc. 70, 776-781.
[37] Han, C. P. (1968). A note on discrimination in the case of unequal covariance matrices.
Biometrika 55, 586-587.
Classification statistics 99

[38] Han, C. P. (1969). Distribution of discriminant function when covariance matrices are propor-
tional. Ann. Math. Statist. 40, 979-985.
[39] Hart, C. P. (1970). Distribution of discriminant function in circular models. Ann. Inst. Statist.
Math. 22, 117-125.
[40] Hills, M. (1966). Allocation rules and their error rates. J. Roy. Statist. Soc. Set. B 28, 1-31.
[41] Hills, M. (1967). Discrimination and allocation with discrete data. Appl. Statist. 16, 237-250.
[42] Hill, G. W. and Davis, A. W. (1968). Generalized asymptotic expansions of Cornish-Fisher
type. Ann. Math. Statist. 39, 1264-1273.
[43] John, S. (1960). On some classification problems, I, II. Sankhy~ 22, 301-308,309-316.
[44] John, S. (1963). On classification by the statistics R and Z. Ann. Inst. Statist. Math. 14,
237-246.
[45] Johnson, N. L. (1949). Systems of frequency curves gerated by methods of translation.
Biometrika 36, 149-176.
[46] Kanazawa, M. (1979). The asymptotic cut-off point and comparison of error probabilities in
covariate discriminant analysis. J. Japan Statist. Soc. 9, 7-17.
[47] Kanazawa, M. and Fujikoshi, Y. (1977). The distribution of the Studentized classification
statistics W* in covariate discriminant analysis. J. Japan Statist. Soc. 7, 81-88.
[48] Kanazawa, M., McGee, R. I., and Siotani, M. (1979). Comparison of the three procedures in
covariate discriminant analysis. Unpublished paper.
[49] Kronmal, R. and Tarter, M. (1968). The estimation of probability densities and cumulatives by
Fourier series methods. J. Amer. Statist. Assoc. 63, 925-952.
[50] Krzanowski, W. J. (1975). Discrimination and classification using both binary and continuous
variables. J. Amer. Statist. Assoc. 70, 782-790.
[51] Krzanowski, W. J. (1976). Canonical representation of the location model for discrimination or
classification. J. Amer. Statist. Assoc. 71, 845-848.
[52] Krazanowski, W. J. (1977). The performance of Fisher's linear discriminant function under
non-optimal conditions. Technometrics 19, 191-200.
[53] Kullback, S. (1952). An application of information theory to multivariate analysis, I. Ann. Math.
Statist. 23, 88-102.
[54] Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
[55] Kudo, A. (1959). The classificatory problem viewed as a two-decision problem. Mere. Fac. Sci.
Kyushu Univ. Ser. A. 13, 96-125.
[56] Kudo, A. (1960). The classificatory problem viewed as a two-decision problem, II. Mere. Fac.
Sci. Kyushu Univ. Ser. A 14, 63-83.
[57] Lachenbruch, P. A., Sneeringer,~C., and Revo, L. T. (1973). Robustness of the linear and
quadratic discfiminant functions to certain types of non-normality. Comm. Statist. 1, 39-56.
[58] Lachenbruch, P. A. (1966). Discriminant analysis when the initial samples are misclassified.
Technometrics 8, 657-662.
[59] Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics 10, 1-11.
[60] Linhart, H. (1961). Zur Wahl yon Variablen in der Trennanalyse; Metrika 4, 126-139.
[61] Matusita, K. (1956). Decision rule, based on the distance, for the classification problem. Ann.
Inst. Statist. Math. 8, 67-77.
[62] Matusita, K. (1967). Classification based on distance in multivariate Gaussian cases. Proc. Fifth
Berkeley Syrup. Math. Statist. Prob. 1, 299-304:
[63] Martin, D. C. and Bradley, R. A. (1972). Probability models, estimation and classification for
multivariate dichotomous populations. Biometrics 28, 203-222.
[64] McGee, R. I. (1976). Comparison of the W* and Z* procedures in covariate discriminant
analysis. Dissertation submitted in partial fulfillment of P h . D . requirements. Kansas State
Univ.
[65] McLachlan, G. J. (1972). Asymptotic results for discriminant analysis when the initial samples
are misclassified. Technometrics 14, 415-422.
[66] McLachlan, G. J. (1973). An asymptotic expansion of the expectation of the estimated error rate
in discriminant analysis. Austral. J. Statist. 15, 210-214.
1O0 Minoru Siotani

[67] McLachlan, G. J. (1974a). The asymptotic distributions of the conditional error rate and risk in
discriminant analysis. Biometrika 61, 131-135.
[68] McLachlan, G. J. (1974b). An asymptotic unbiased technique for estimating the error rates in
discriminant analysis. Biometrika 30, 239-249.
[69] McLachlan, G. J. (1974c). Estimation of the errors of misclassification on the criterion of
asymptotic mean square error. Technometrics 16, 255-260.
[70] McLachlan, G. J. (1974d). The relationship in terms of asymptotic mean square error between
the seperate problems of estimating each of the three types of error rate of the linear
discriminant function. Technometrics 16, 569-575.
[71] McLachlan, G. J. (1976). The bias of the apparent error rate in discriminant analysis. Biometrika
63, 239-244.
[72] McLachlan, G. J. (1977). Constrained sample discrimination with the Studentized classification
statistic W. Comm. Statist. A--Theory Methods 6, 575-583.
[73] Memon, A. Z. (1968). Z statistic in discriminant analysis. Ph.D. Dissertation, Iowa State Univ.
[74] Memon, A. Z. and Okamoto, M. (1970). The classification statistic W* in covariate discriminant
analysis. Ann. Math. Statist. 41, 1491-1499.
[75] Memon, A. Z. and Okamoto, M. (1971). Asymptotic expansion of the distribution of the Z
statistic in discriminant analysis. J. Multivariate Anal. 1, 294-307.
[76] Moore, II, D. H. (1973). Evaluation of five discrimination procedures for binary variables. J.
Amer. Statist. Assoc. 68, 399-404.
[77] Okamoto, M. (1961). Discrimination for variance matrices. Osaka Math. J. 13, 1-39.
[78] Okamoto, M. (1963). An asymptotic expansion for the distribution of the linear discriminant
function. Ann. Math. Statist. 34, 1286-1301.
[79] Okamoto, M. (1968). Correction to "An asymptotic expansion for the distribution of the linear
discriminant function". Ann. Math. Statist. 39, 1358-1359.
[80] Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and
continuous variables. Ann. Math. Statist. 32, 448-465.
[81] Ott, J. and Kronmal, R. A. (1976). Some classification procedures for multivariate binary data
using orthogonal functions. J. Amer. Statist. Assoc. 71, 391-399.
[82] Rao, C. R. (1954). A general theory of discrimination when the information about alternative
population distributions is based on samples. Ann. Math. Statist. 25, 651-670.
[83] Patnaik, D. B. (1949). The non-central X 2 and F distributions and their applications. Biometrika
36, 202-232.
[84] Siotani, M. (1980). Asymptotic approximations to the conditional distributions of the classifi-
cation statistic Z and its Studentized form Z*. Tamkang. J. Math. 11, 19-32.
[85] Siotani, M. and Wang, R. H. (1975). Further expansion formulae for error rates and comparison
of the W- and Z-procedures in discriminant analysis. Tech. Rept. No. 33, Dept. Statist., Kansas
State Univ., Manhattan.
[86] Siotani, M. and Wang, R. H. (1977). Asymptotic expansions for error rates and comparison of
the W-procedure and the Z-procedure in discriminant analysis. In: P. R. Krishnaiah, ed.,
Multivariate Analysis IV, 523-545. North-Holland, Amsterdam.
[87] Smith, C. A. B. (1947). Some examples of discrimination. Ann. Eugenics 13, 272-282.
[88] Solomon, H. (1961). Classification procedures based on dichotomous response vectors. In: H.
Solomon, ed., Studies in Item Analysis and Prediction, 177-186. Stanford Univ. Press, Stanford.
[89] Teichroew, D. and Sitgreaves, R. (1961). Computation of an empirical sampling distribution for
the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
252-275. Stanford Univ. Press, Stanford.
[90] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann. Math. Statist. 15, 145-162.
[91] Wise, J. (1955). The autocorrelation function and the spectral density function. Biometrika 42,
151-159.
[92] Zhezhel, Yu., N. (1968). The efficiency of a linear discriminant function for arbitrary distribu-
tions. Engrg. Cybernetics 6, 107-111.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 A
©North-Holland Publishing Company (1982) 101-120 --i-

Bayesian Discrimination*

Seymour Geisser

1. Introduction

The complementary problems of allocation to and separation of several popula-


tions are reviewed and amplified. In either case we assume we have two or more
identifiable populations whose distribution functions for a set of manifest varia-
bles are known up to some specifiable parameters. An identifiable sample is
drawn from the populations. In one case future observations to be generated, or
observations possibly already in hand but with unknown latent identity, require
labeling, diagnosis or allocation. In the second case we require some simple
functions (discriminants) which maximally distinguish or separate these popula-
tions. This is attempted in order to throw some light on relevant issues or to
formulate hypotheses concerning these populations. Sometimes the goal is to
make high dimensional data more immediately accessible and more manageable
by severely reducing their dimensionality yet retaining a large degree of the total
information available in the data.
We first describe a general Bayesian procedure for allocation and then give
applications for the most popular of models in this area, the multivariate normal
one. The problem of separation from a Bayesian viewpoint is then presented.
Often both allocation and separation are part of the same study, and some
compromise solutions, which can serve in a near optimal manner for both
purposes, are obtained and applied to multivariate normal populations. A sample
reuse procedure in conjunction with a semi-Bayes approach which is useful for
selecting the appropriate allocatory/separatory model is also presented. Further
areas for examination via the Bayesian approach are proposed.

2. Bayesian allocation

Suppose we have k populations %, i : 1 ..... k, each specified by a density


f(" [0i, ~i) where 0i is the set of distinct unknown parameters of %; ~i is the set of

*Work was supported in part by grant NIH-GM-25271.

101
102 Seymour Geisser

distinct known parameters of %; Xi are the data obtained on 7ri based on N~


independent (vector) observations; and Z = z is a new (vector) observation to be
assigned which has prior probability qi of belonging to ~r/,Y.~_lqi = 1.
Further, let 0 = U~=10i, qJ = U ~ = ~ i , i.e., the total set of distinct unknown and
known parameters, respectively, and g(01+ ) be the joint prior density of 0 for
known q~. Let L(X~IO ~, q~i) be the likelihood of the sample obtained from ¢ri with
the joint likelihood obtained on 7r1..... % given by

k
L ( X I O , ~ ) = 1"I L(X~IO,,~j) (2.1)
i=l

where X represents the set of all the data samples Xl,... ,Xk, often referred to as
the training sample. Hence the posterior density, when it exists, is

?(01 x , q~) oc t ( glO , q,)g(O Iq,), (2.2)


from which we may obtain the predictive density of Z on the hypothesis that it
was obtained from ~r~,which results in

f( z I X, @, 7re) = f f( zlO,, ~v,,~,)p(01 x, +)d0. (2.3)

Occasionally it is more convenient to express the above equation in the following


manner:

f( z IX, ~, rri) = f f( zlO~, 111i , Iri)p( Oi[X, ~ )dOi (2.4)


where
p(Oilg, q~) = f P(OIX, + )d02 (2.5)

and O/~ is the complement of Oi, O?UO~=O. We then calculate the posterior
probability that z belongs to %,

P r { z E %l X, ~, q} ~ qJ(z[ X, ~, %) (2.6)

where q stands for (q~,..., qk)- For allocation purposes we may choose to assign z
to that ~ri for which (2.6) is a maximum, if we ignore the differential costs of
misclassification. We could also divide up the observation space of Z into sets of
regions R l .... ,R k where R i is the set of regions for which ui(z ) = qif(zl X, q~, ~ri)
is maximal and use these as allocating regions for future observations. We may
also compute 'classification errors,' based on the predictive distributions, which
are in a sense a measure of the discriminatory power of the variables or
characteristics. If we let Pr(~sl~r~} represent the predictive probability that z has
Bayesian discrimination 103

been classified as belonging to % when in fact it belongs to %, then we obtain

Pr{iril ~ri)= fRf(ZlX, ~, 7ri)dz, (2.7)

Pr(~rjl~ri)=fRf(ZlX, qJ,~ri)dz, i . j , (2.8)

Pr(~ric ]%} = 1 - fRf(zlx, +,~ri)dz (2.9)

where ~rg° stands for all the populations with the exception of ~rl. Then the
predictive probability of a misclassification is

k k
2 q, Pr(~~lTr~)--1- 2 qiPr(~ril~r~)- (2.10)
i 1 i 1

Prior to observing Z, the smaller the predictive probability of a misclassification


the more confidence we have in the discriminatory variables. However, once Z
has been observed and if our interest is only in the particular observed z, the
misclassification errors are relatively unimportant, but what is important is (2.6),
i.e., the posterior probability that z belongs to ~ri. Nevertheless, before any
observations are inspected for assignment, the error of classification can be of
value in determining whether the addition of new variables or the deletion of old
ones is warranted.
In many situations the q i ' s a r e also unknown. First we consider that the
sampling situation was such that we have the multinomial density for the N/s
(where throughout what follows N k = N - - N 1 . . . . . Nk_t, and qk----1-
ql . . . . . qk-1)/Thus the likelihood for the observed frequencies in the training
sample is

k
L(q, ..... qk_,)o z ~ qNj. (2.11)
j=l

If we assume that the prior probability density of the qi's is of the Dirichlet form

k
g(ql , . . . . q k - 1 ) OZ 1"I q ) ~J , (2.12)
j=l

we obtain the posterior density of the q i ' s ,

k
P(ql .... ,qk-ll N, .... ,Nk~,)cc H qUj+%. (2.13)
j=l
104 Seymour Geisser

Further

p( q~,...,qg_~lz, N, ..... Nk_l) ac


(2.I4)
CX:p( q, ..... qk_,lNt ..... N k _ , ) f ( z l q I ..... qg_,, X, q~)
where
k
f(zlq,,...,qk_,)= ~ qjf(z I X,q,,~), (2.15)
j=l

whence we obtain the posterior probability no longer conditioned on q,

Pr(zE~elg,+)= f "" fr~(zE~elX,+,q)


× p(q, ..... q~_,lz, N, .... , N k _ , ) d q , . . . d q k _ ,
= (N,+~,+l)f(zlX,~,~,) (2.16)
Y j(Nj +,,j + l ) / ( z l X , , ~ , ~ )

In the second situation we assume that the ~ ' s were chosen and not random
variables. This is tantamount to assuming that N~----0 for all i as regards the
posterior distribution of the qi's, resulting in

P r ( z E ~r,[X, '1') oc ( a~ + l ) f( zl X, ~, ~r,). (2.17)

The ai's may be regarded as reflecting previous frequencies or intuitive impres-


sions about the frequencies of the various rr~'s. If there is neither previous data nor
any other kind of prior information, the assumption ai = a for all i leads to the
same result that we would obtain had we assumed that the k populations were all
equally likely a priori, i.e. qi = 1/k.
Suppose we wish to classify jointly n independent observations z I ..... z,, each
having prior probability qt of belonging to rri. We can then compute the joint
predictive density on the hypothesis that ( Z 1E % ..... Z , @ rri.), where i l,..., i n are
each some integer such that 1 ~< ii ~ k, j = 1,...,n. Therefore,

/( z, ..... znl X, ~', '~,,..... ~,°)=


= fp(ol ,x) [I Y(z,
j=l
or

=f p tSe,,Iq,, x ) , H° f( zjlOi:, ~i,, rri)d


"= j=l j~-I
I...J
" 8ij (2.18)

where

p U0,,l~,X = e(0l~,X) (2.19)


j=l j I
Bayesian discrimination 105

This then yields the joint posterior probability

Pr{z,E %,,... , z , ~ ~ri. IX, •, q) ~:

( fi qij)f(z,,"',znl X, +, Irq,...,Tri° ). (2.20)


j=l

It is to be noted that while the joint density of Z1,..., Z n given 0i,..... 0i. factorizes
to II~= lf(zjlOi/~ij, ~rij), this will not be generally true for the predictive density;
i.e.,

f(z 1..... z, IX, +,~ri, ..... ~ri.)--/= fi f(zjlX, ~ij,~rij ). (2.21)


j=l

Hence the results of a joint allocation will be in principle different from the
previous type, which we may refer to as a marginal allocation, although perhaps
not too often in practice.
It is sometimes convenient to write

Pr{zlE %,,... , z , E ~ri. IX, +, q} = Pr{3, ~ ~r,,... ,SkE 7r~[X, ~k, q),
(2.22)

where 8~ represents the set of n~ observations assumed from ~ri and Y~_lni = n,
since the set of observations z 1..... z, is apportioned among the k populations
such that n~ belong to 7r~. The reason for using (2.22) is that under certain
conditions we do have a useful factorization such that
k
P r ( 3 , ~ r , ..... 8 k ~ r k I X , + , q ) = 1-[ P r { s j ~ l X , ~ , q } • (2.23)
j=l

Another form of predictive classification would be one wherein diagnoses or


allocations need be made as soon as possible, i.e., as soon as Z~ is observed.
Hence, if Z 1, Z2,... are observed sequentially, we may wish, when we are ready to
observe and classify Z,, to make our allocation as precise as possible by
incorporating the previous observations z~,..., z,_ ~ into our predictive apparatus.
We need now compute the sequential predictive density of Z, on the hypothesis
that it belongs to % conditional on ~ and on the observations X (whose
population origin is known), and on the observations zl, .... Zn-1 (whose popula-
tion origin is uncertain). We then obtain the sequential predictive density of Z, on
the hypothesis that it belongs to 7r~.
f(z, IX, ~P, z~,...,z,-l, ~ri) cc
k k
c~ Z "'" ~ qi~"'qi. ,f(zl ..... z,[X, qJ,%, ..... ~._,,%),
in_l=l il--1
(2.24)
106 Seymour Geisser

i.e., a mixture of joint predictive densities with Z, assumed from 7ri. Further,

Pr(z,~lS,~p,z , ..... Zn_,)OZqif(ZnlS,~,Z,,...,Zn_l,~ri). (2.25)

This same result can also be obtained from the product of the likelihoods and the
prior density,

L( XI O, ~p)L( z, ..... z,_ , l O, + )g( OI q, ) cc p( O I X, ~p, z I ..... Z,_ l )


(2.26)
where
n 1 k

L(z,,...,z,-,lO,~P)= 1-[ E qijf(zjlOij,tPi;),


j=l ij=l

and finally

f ( z , IX,~p,z 1..... z , _ , , ~ ) = f f ( z , lO,,+~)p(OlX,~p,z I .... ,z, ,)dO,


(2.27)
which is equivalent to (2.24).

3. Multivariate normal allocation

We now illustrate the previous work by applying it to multivariate normal


distributions. The usual situation is to assume equal covariance matrices but
differing means for the k populations ~ri ..... ~r~. Hence ~r~ is represented by a
N ( # i , X ) distribution with an available training sample x i l , . . . , X i u , , i = l , . . . , k .
We define
N,
xi=N~ - ' E xij, (Ni-1)Si=E(xij-Zi)(xij-xi)',
j=l j
k
( N -- k ) S = ~,, ( N i -- 1)Si, N = E Ni.
i i=1

Using a convenient reference prior for/~, ..... /~k and 2J-~,

g ( S - ' , ~, . . . . . ~ ) ~ I,?1 ('+')/= , (3.1)


we easily obtain, including only relevant constants,

f(zlX,,S, Tr,)~
(N~)~/2[
~ 1-+
U~(Xi-z)'S-'(Xi-z)
(Ni + 1)(N_ k )
(N k+1)/2

(3.2)
Bayesiandiscrimination 107

the predictive density of the observation to be allocated. This then is inserted into
either (2.6), (2.16) or (2.17) depending on the circumstances involving q and is
appropriate for allocating a single new vector observation z 1.
We now assume that we need jointly allocate n new vector observations
z l , . . . , z,. Letting, as in (2.22), ~ represent the set of n~ observations assumed f r o m
~ri, with n = N~=ln,, we obtain

f(31 ..... ~klX,~rl .... ,Trk)°C NiH_ni


i=1
k
× [(N- k ) S + ]~ (~i - xie;)~i(~i -- ~e,')'[--N+n--k/2
i=1

(3.3)
where~2~ = I + Ni-leie'i and e l = (1 ..... 1) of dimension n i. Hence

Pr[ ~1 ~ ~ r l ' " " ~ k ~ ~rk I X ' q] cc (i=llIqni) f(31,''" ,3klX,~l ..... ~k)

(3.4)

where again if the q/'s are unknown, appropriate substitutes can be found in
(2.16) or what follows it.
The observations m a y in m a n y instances be sequentially obtained and for
compelling reasons allocations (diagnoses) need be m a d e as soon as possible.
Let z in- 1)= ( z I . . . . , Z n _ l ) and Y,' stand for the sum over all assignments of
z~ . . . . . z , _ 1 to 31 ..... ~k with z n always assigned to ~ and then s u m m e d over all
partitions of n such that YT=lnj k = n, nj >1O, j ~ i and n i t> 1. Then

Pr[znE~i[X,z(n-1),q ] 0;2~ IJ_lq2J ~ Nj


j j 1 Nj+nj
k
× [(N--k)S+ ~ (~i--xie;)~~i(~i--xieti)t[ -(N+n-k)/2 (3.5)
i=1

for n = 2, 3, . . . .
A second case that is also easily m a n a g e d is the unequal covariance matrix
situation. Here ~ri is represented b y a N(/~ i, ~Ji) distribution i = 1. . . . . k.
Using the same training sample notation as previously and a similar convenient
unobtrusive reference prior

k
g ( ~ ..... ~ k , ~ 7 1 , . . . , ~ ; ') ~ l-[ I~il <~+'>/= (3.6)
i=1
t08 S e y m o u r Geisser

we obtain

-- N i / 2
p/2/~(N//2) 1"-~ N / ( x i - - z) t
-1 - z)
N/2--1

F((N~ -- p ) / 2 ) I ( N ~-- 1)S~I'/2


(3.7)

the predictive density of the observation to be allocated. This is then inserted into
the appropriate formula as previously, to calculate the posterior probability of z
belonging to ~ri.
For the joint classification of z~..... z, we obtain as in (2.15) by assigning
ZI,...,Z n to ~ l , " ' ~ k
k
P r [ ~ , ~ r , ..... 8k~rklX, q] ~ 1-[ q["d(silxie;,~2i, s , , N i - l,ni, p)
i=1
(3.8)
where d(. I') represents the determinantal density (Geisser, 1966),

d(YIA,9, A,M,m,p) =
(2~)-P~/2K(p, M)IMAIM/2[~2IP/2
K( p, M + m) IMA + ( Y-- A )~2(Y-- A )'I (M+m)/2
(3.9)

f o r M ~ p , m > ~ l , A is p × p and positive definite, $2 is m × m and positive


definite, Y and A are p × m, and in addition (3.9) is defined as 1 for m = 0, and

K l(p,v)=2pv/27rtP(P-l)l/4 [I v + l - - j
j=~ 2
)"
For sequential allocation we obtain for n = 2, 3 ....

k
pr(z.C~rilX, z('-l),q } cc X I'[ q;Jd(3jlYjej,~2j,Sj,Nj- l,nj, p).
j 1

(3.10)

A third case of interest, especially in genetic studies of monozygotic and


dizygotic twins is where ~ri is represented by a N(0, Xi) distribution, i = 1,2.
Bayesian discrimination 109

Again assuming a prior of the form

g(~:? ~, z~') ~ 12:,z21<p+'~/2 (3.11)


we obtain the predictive density of the vector difference of a twin pair

r(~(N~+l)) IN~lU,/= (3.12)


f(zlX' ~r*)~ £ ( ½ ( N / + l - p ) ) [NiTi + zz'I(N,+')/2
where N/T/= Eu,__" lxgjx;j and xij represents the vectorial difference between a twin
pair. Insertion of (3.12) into the appropriate formula yields the posterior probabil-
ity of the new twin pair being either monozygotic or dizygotic.
For joint classification,
2
P r ( ~ , ~ r , , ~ z C ~ r z l X , q)°c 1"I q?'d(~i[O,I, Ti,Ni,ni,P)
i:1

and for sequential allocation


2
Pr{znE~ri[X,z(n-l),q ) ~ 1-[ q],d(3j[O,I, Tj,Nj,nj,p).
j~-I

The material in this and the previous section is derived from Geisser (1964,
1965, 1966) and Geisser and Cornfield (1963), Geisser and Desu (1968, 1973).

4. Bayesian separation

A second goal in discrimination studies is to identify and utilize in some


parsimonious manner the manifest features that separate the various populations.
Here there are ,no new observations that require allocation. The stress is on
throwing some light on scientific, technical or social issues.
One defines a class ®(z) of discriminants and some measure of spread amongst
the populations, and then selects some minimal set of discriminants that maxi-
mizes the spread given the constraints. The technique appears to work best when
the p-dimensional multivariate populations can be assumed to have approxi-
mately the same covariance matrix Z and differing mean vectors #l .... ,/~k, and
exhibit roughly the kind of shape possessed.by multivariate normal densities.
Hence the major source of their differences is their location. Fisher (1936) found
the set of linear combinations c'z which maximized pairwise the distance function

c'Y.c , i , j = l ..... k. (4.1)


Let
k
AA'= ~, ( / ~ , - g )(/z,-/Y )' (4.2)
i=l
be of rank k - v ~<p, where A is a p × ( k - v) matrix and in particular v = 1 if
110 Seymour Geisser

/~l ..... ~ are linearly independent. The solution then is the set of k - v linear
discriminants given by

z'~-lAp (4.3)

where P is ( k - v ) × ( k - v) orthogonal matrix which reduces A'~,-IA to the


diag(Sl,...,Sk_v) matrix and 8j are the non-zero roots in descending order of
A'2J-1.4. Fisher's derivation essentially employed Lagrange multipliers. An alter-
nate geometric derivation is given by Dempster (1969). Wilks (1962) obtained
these results by maximizing a single measure of spread. A somewhat more general
approach using algebraic methods is given by Geisser (1977) who demonstrates
that any scalar measure of the spread of the k populations that is increasing in the
non-zero roots of A A ' ~ - l is maximized in an r <~ p dimensional space by that set
of r ~< k - v ~< p linear discriminants

z'N-1AP(r) (4.4)

where P(r)=(P1 er) are the r column vectors associated with the r largest
.....

non-zero roots 8j of A N N - 1.
The focus here is on the estimation of c. In particular if we are dealing with two
populations, then (4.4) is equivalent to

z'IJ 1(~1 --/~2). (4.5)

To estimate this quantity in a Bayesian manner would generally require a joint


posterior distribution for N - l , /~l, and /~2, and hence precise distributional
specifications on ~r1 and ~r2. However, if we take the posterior mean of z'N-1(/~! _
/~2) as its estimator and assume that E(/~ 1 -/~21~ -1) is £ 1 - if2 and the marginal
expectation of 2J-1 is S-1 where, in terms of the sample values in Section 2,
(n I + n 2 - 2 ) S = (n 1- 1)S 1+ (u 2 - 1)$2, then we have the result that the Bayesian
estimator of z'~;-1(/~ I -/~2) is

ZtS- 1(.~1 -- -~2 ). (4.6)

If we make the multivariate normal assumptions of Section 2 and also use the
same prior density for Jg-1,/~l, and/~2, then we obtain the result of (4.6). Hence
one may obtain for k populations that the estimator for z'2; ~(/~i-/~j) is
z'S I(Y~i - £j) and generate the estimator, using x for/~,

z'S l~ ~ (4.7)

of the set of linear discriminants where A and/~ are obtained from the solution

/SA'S ~AP = Diag. (4.8)


Bayesian discrimination 111

5. Allocatory-separatory compromises

By an allocatory-separatory compromise we mean that we shall derive the


discriminant from allocatory/separatory considerations and apply it in a semi-
Bayesian manner for separatory/aUocatory purposes.
For the sake of simplicity we shall confine ourselves to the two population case
7rI or ~r2 as there is no intrinsic difficulty in extending it to the case of k
populations. Assume now that 7ri is specified by density f(. 10i, ~ri) suppressing the
known parameter ~i. For purposes of allocation we obtain

f(z]O,,~r,) { ~ q2ql ! allocate z to rr,, (5.1)


O-- f( zl Oz,rr2) qzq~ l allocatez tour2
or
h(p]~ >~h(q2q~l) all°cates z t° ~rl' (5.2)
" "[ < h(q2q; l) allocates z to 7r2

where h(p), any monotone function, equally serves as an identical allocator. We


could also consider p or h(p) as a separatory function derived initially from
allocatory considerations. In frequentist theory p depends on 0, so an estimate of
p is obtained by plugging an estimate for the set of parameters 0 obtained from
the training sample employing some 'optimal' estimation property. However, as is
usually the case these optimal properties will not ordinarily be invariant under
monotone transformations, e.g. mean squared error. A way around this dilemma
which preserves the invariance of the allocation rule is to use an estimator t~ such
that/~(p) = h(tS), in particular the maximum likelihood estimator of p. Of course,
for purposes of allocating, one might attempt to derive an estimator of h (or
better a rule) which minimized future errors of allocation. However, this in
general cannot be achieved for all 0 when 0 must be estimated from a training
sample. A semi-Bayesian approach to estimating p or h(p) is fraught with some of
the same difficulties. For example minimizing posterior squared error implies that
the Bayesian estimator is Eo(h(p)), where the expectation is taken over the
posterior distribution of 0. Howeve r, this estimator will not in general be equal to
h(Eo(p)), thus this loss function does not possess the invariance property. In
order to retain the invariance property, one could use the posterior median of
h(p). In practice this turns out to be a rather difficult computation. Hence one
settles for a convenient and simple function h(p) and calculates its posterior
expectation (Geisser, 1967; Enis and Geisser, 1970). Note that we started from an
allocatory point of view and obtained a separatory function. One sometimes also
is interested in finding the allocatory properties of such a separatory function or
more generally a Bayesian analysis of error rates of any proposed separatory
discriminant.
Another semi-Bayesian way of proceeding is to start from the predictive density
functions which are the prime ingredients of allocatory rules, and then define a
112 Seymour Geisser

class of separatory discriminants selecting that one which minimizes the total
error of classification with respect to the predictive distributions (Enis and
Geisser, 1974). This then would be an all purpose discriminant having both good
separatory and allocatory properties. This approach modifies the optimal Baye-
sian allocatory discriminant, which is

by introducing a constraint on the form of the discriminant. Define W(z, c) as a


member of the class @(z) where c is a set unknown constants such that

Then minimize

with respect to c where

where f(wIX, rri) is the predictive density of W derived from the predictive
density of Z, given %.
This provides us with a Bayesian discriminant of a stipulated form that is
optimal with respect to error rates. This compromises the form of the discrimi-
nant with an allocatory requirement.

6. Semi-Bayesian multivariate normal applications

In the multivariate normal case with equal covariance matrices, interest has
generally focussed on the linear discriminant

with the accompanying allocatory rule, for r = q2ql ~,


Bayesian discrimination 113

The usual frequentist estimator of U is the sample linear discriminant

V : [z -- ½(Xl + x 2 ) ] t S - l ( x l - x2) (6.3)

obtained by substituting the usual estimators f o r / ~ , ]L2, and ~ in U. The actual


allocation rule derived from the training sample then is as follows:

Vf>• log r
< log r
assigns z to ~r1,
assigns z to ~r2.
(6.4)

The first thing we note is that we can provide a Bayesian estimator of U by


calculating its posterior expection with regard to/~1,/~ 2 and Z for fixed z. For the
particular prior distribution used previously

E(UIz ) =V+½p(N2 1- N~-') (6.5)

is a Bayesian estimator of U and in terms of its use as a separatory discriminant it


is virtually identical to V because for separatory purposes the constant displace-
ment is more or less irrelevant. Frequentists also use V in its allocatory mode, as
determined by (6.4). There it appears that the semi-Bayes approach may yield a
rather slight improvement that diminishes with increasing sample sizes and
decreasing difference between sample sizes, in terms of frequentist error rates.
Although the frequentist theory of allocation concerns itself with a number of
different error rates and their estimators (Hills, 1966) we shall only discuss the
two most important ones. Now the optimal errors of classification are given as

Pr[U<logr[t~,,l~2,Z-',~r,]=e,-----f ~ ~ ( v ) dv = ~b(~',), (6.6)


p
Pr[U> l°gr]/~l,/~2, , q'/'2] =/32= ~ ( v ) d v = 1 - - ~(~a) (6.7)

where ~ ( v ) = (2~r)-l/2e-O/2)v2 is the standard normal density and

T, = (logr - ½ a ) / a 1/2, = (logr + ½ a ) / a 1/2, (6.8)


and
(6.9)

A frequentist estimator of the optimal errors employs V for U and for


substitutes

x )S (6.10)

A Bayesian estimator for Ei can, in principle, be obtained by calculating E(ei).


This is a rather difficult calculation and an approximation is available. Let
114 Seymour Geisser

e = NINz/(N l + N2) and u = N 1+ N 2 - 2 , then

(logr-½(pc-'+Q) )
(6.11)

and with increasing sample size

E(el ) _ ~ ( l o g r - - ½Q (6.12)
Q1/2 ] = gl.

Further one can obtain

[ 4c(p+cQ)(q~-'(b)) 2
P(e')--=Pr[el<~b]~l-Fa p+cQ+l,-l(cQ)2 (6.13)

where t , = N 1+ N 2 --2 and Fa(.) is the distribution function of a chi-squared


random variable with

d = (p + cQ)2/(p + cQ +(cQ)2p -')


degrees of freedom. A similar result is available for e 2.
The estimation of e=qle ~+qzel, the total optimal error rate, is useful as a
guide to the optimal discriminatory power of the variables used for allocation. If
the estimate of e indicates that e is larger than the accuracy required for the
allocation procedure, one would search for additional or another set of variables
that would diminish the total error rate. If (6.5) is used for allocation in place of
(6.4), then one replaces log r by log r - lp(N~ 1_ N(-1) in the Bayesian estimators
of (6.11) and (6.12).
From the practical point of view the error rates that are most important are
those that are actually incurred when using the sample discriminant V on future
observations. These actual errors are defined as

f°' (6.14)

Pr(V>logrlt~,,Iz2,~,~a)=fia= fo~(v)dv=l--*(02) (6.15)

for the fixed values x~, x 2 and S where

0 , = ([½(Yl + Y 2 ) - / ~ l ] ' S - ' ( £ 1 - x2) + l ° g r )

[('~l- X2)'S lzS 1('~1- "~2)]-1/2' (6.16)


02= + - 2)+log r}
X [("Y1- x2)'S-lz:~S-l(Xl- -~2)] -1/2 (6.17)
Bayesian discrimination 115

and 01 and 02 are random variables that are functions of #1,/~2 and 2~. Hence we
have defined fll and/32 as functions of the random variables/~1,/~2, and ~ for
fixed values of Xl, x 2 and S which differs from the sampling interpretation where
fl~ and f12 are considered either as functions of the fixed parameters/21,/~2, and 2~
obtained from the unconditional sampling distribution of V in terms of the
random variables Xl, x2, and S, or defined as functions of the random variables
x 1, x 2, and S. Although the exact posterior distribution of fl~ both jointly or
marginally (Geisser, 1967) can easily be found, a convenient and rather good
approximation is obtained as

pr[fll<~b]~--~( ~---'-(b)-~A-A) (6.18)


(NII+B,) 1/2
where
A, = (u--P+I/2)
Qu '/2(logr _ l Q ) ,
-~Q]
B l = [log rl 2 /2pQ,

pr[fl2~b ] = l- q~( ~-'(l_- b)~ A2 ) (6.19)


( N 2 - ' + B 2 ) 1/2 '

A2----
(u--P+I/2)
Q1,
1/2(logr + ½ Q ) ,
B 2 ---- (log r +~Q)
~ 2 /2uQ.

A Bayesian estimator of fli, E(fl~), is also the unconditional predictive probabil-


ity:

E()31) = P r [ V ~< logr[ X, 7rl], (6.20)


E(fl2) = P r [ V > log r IX, ~r2]. (6.21)

The argument runs as follows:

E(fll) = f Pr[V<<.logrll~,,~2,y. , ~rl]p(/-tl,kt2,~ l[X)d~,d~zd~-'

(6.22)
where f(V ]/z i,/~ 2, ~ , "7/'1) represents the conditional density of W. Hence

E(fll) =fl~f(VIX, ~h) d V = PrIVy< log r [X, ~rl] (6.23)

where f(V [ X, ~rl) represents the unconditional or predictive density of V.


116 SeymourGeisser

Thus we can obtain, in terms of a t with v + 1 - p d.o.f.,

E(/31) = P r [ Q + l _ p < - ( l o g r - - l Q ) [ v ( N l + 1)Q/(I, + 1 - p ) N l ] -1/2]


(6.24)

which may be evaluated directly from tables of the t-distribution. Similarly

E(B2) = P r [ V > logr IX, ~r21


Pr[tv+l_p>(logr+ ½Q)[p(N2 q-1)Q/(p-b 1- p ) N 2 ] - ' / 2 ] .
(6.25)

In practice if an investigator is satisfied with the estimate of the optimal error e,


then he can compute his estimates of fll and f12- If they are larger then he can
tolerate, then he should collect larger sample sizes since fl ~ e from above as the
sample sizes increase. Of course all of this is prior to obtaining the observations to
be allocated because once they are in hand the only relevant calculation for the
Bayesian is the posterior probability that z E ~ri or the allocatory decision for that
observation. The optimal and actual probability of correct allocation 1 - e and
1-/3 refer only to the long run frequency of future allocations using the
discriminant from a hypothetically infinite sample in the first case and the actual
sample in hand in the second case. A more detailed exposition with other results
can be found in Geisser (1967, 1970).
Another semi-Bayesian approach would be to find the linear discriminant
W( z ) = a' z - b such that if

W(z) { ~>0,
<0,
assign z to ~r1,
assignzto~r 2
(6.26)

where a ' = [a~,...,ap] is a nonnull vector and b is an arbitrary scalar, such that
for variations in a and b the total predictive probability of correct allocation is
maximized. The solution obtained by Enis and Geisser (1974) is termed the
optimal predictive linear discriminant,

Wo( z ) -- a'oz - b o (6.27)


where
a o = S - l(yl - x2), (6.28)

bo = (RK? - { ( X , - X2)'S-' K X2)-


(6.29)
o~ = Q R K ~K 2 -- e( R - 1)(RK 2 - K 2),
[ qzK2 ]2/(v+l)
R
qlK1 ]
Bayesian discrimination 117

and
1/2

/q= (~+l)(.+p-1)

First we note that for purely separatory purposes the constant is irrelevant and
again we obtain Fisher's linear discriminant function. For allocatory purposes the
constant b 0 is relevant and may yield a rather slight error rate improvement over
V or V + p(N2-1 - NI~). But note that W will be globably optimal iff RK 2 = K~
since this is equivalent to r(z), the optimal posterior discriminant, which under
these circumstances and appropriate h(r) becomes linear as well. If ql = q2 and
N 1 = N 2, than all methods mentioned thus far essentially yield V.
When 2; 1vs 2J2, the optimal discriminant is the quadratic

u = ½{logl~U'~:21 + ( z - ~ 2 ) ' z j ' ( z - ~ 2 ) - ( z - ~ , ) ' ~ l - ' ( z - ~ , ) } .


(6.30)
Error rates become much more difficult to compute under these circumstances.
It is, however, interesting to note that the usual estimator of U, namely

v = ½{log ISU'321 + (z - ~2)'82'(z - ~2) - (z - ~,)'Sl'(Z - ~, )},


(6.31)

is also very nearly achieved as the posterior expectation of U. Enis and Geisser
(1970) showed that

E( U[z ) = V+ h( p, U 1, N2) (6.32)

where

2 P
h(p,N,,N2)=½ E E ( - - 1 ) i { l o g ( N / - 1 ) + Ni- 1 - ~ - t [ l ( N / - - j ) ] }
i= 1 j = l
(6.33)

and g'(x)=U(x)/F(x) is the psi(digamma) function. Note as N 1 and N2


increase, then h ~ 0, and in particular when N l = N2, h = 0. Thus V differs from
the posterior expectations by at most a negligible quantity.
The optimal predictive discriminant is

r(z) = f(z]£,, S,, ~r,) (6.34)


~ i $21 rr2 ) >~q2q~'

as defined in (3.7) and is a rather complicated function of z, and no h(r) emerges


118 Seymour Geisser

that will simplify it. One could attempt to derive the optimal predictive quadratic
discriminant, but in general this is quite difficult to obtain.

7. Semi-Bayesian sample reuse selection and allocation

In many problems we cannot always formulate definitively the density function


for ~rL and ~r2. For example in certain situations we may be uncertain as to
whether we are dealing with two normal populations with differing means and
either the same or differing covariance matrices. Hence often to the problem of
allocation there is added an uncertainty regarding the specification. More gener-
ally, suppose that f(. [0~, ~ri, ~), the basic density, is now indexed by the double
designator coE ~2 which jointly specifies a pair of densities for ~r1 and ~r2 and is
assumed subject to a probability function g(~0). A complete Bayesian solution for
the allocation of z (Geisser, 1980) maximizes (over all i)

P r [ z ~ ~,1 x, q~] c~ q~Ej( z I X, w, ~r~)f( X,, X21~r,, rr2, co) (7.1)

where the expectation is over g(co), X~ represents the set of observations from %,

f( z[ X, co, 7r~)= f f( zlO,, co, O[x, co)d0 (7.2)

where f, the sampling density of z, and p, the posterior density of 0 in the


integrand, are now indexed by ~o which specifies the assumed population and
where

2 N,
f(sL, g2lco,~,,~rz)=fp(OI o~) IX 1-[ f(xvlO,,co,~r~)dO. (7.3)
i lj=l

This full Bayesian approach requires a body of prior knowledge that is often
unavailable and may be highly sensitive to some of these assumptions.
We shall present here only one of a series of data analytic techniques given by
Geisser (1980) which selects a single co = co* to be used for allocation rather than
the Bayesian averaging. It is a technique which combines Bayesian, frequentist
and sample reuse procedures.
Let

2 N,
L (co) = ~ H f(xijl X(ij), co, ~ri), (7.4)
i:lj:l

be the product of reused predictive densities where X(ij) is the set of observations
X with xij deleted, and f is the same form as (7.2); i.e., xij replaces z and X(ij)
Bayesian discrimination 119

replaces X. Choose 60* according to

maxg(60)L(60),
o~

and then use the % and ~r2 specified by 60* in an allocatory or separatory mode.
As an example suppose 60= 601 specified that N is N(~i, ~) and 60 = 602 specified
that % is N(/~, Z~), respectively.
Under 60l,

2 N,
L(601)= II II f(xij[xi(j),S(ij),Ni-l,N-1,601,~ri) (7.5)
i=1 j=l

where the density f is given by (3.2) with z, xi, S, N~, and N replaced by x~j, ff~o),
S(ij) , N~- 1 and N - 1 , respectively; ~(j) and S(ij) being the sample mean and
pooled covariance matrix with xtj deleted.
Under 602,

2 N,
L ( % ) = I'[ II f( xij[xi(j), Si(j), Ni - 1,60 2, ¢ri) (7.6)
i=1 j=l

where the density f is given by (3.7) with z, ff~, S / a n d iV/replaced by xij, Y(~j),
S~O), and N , - 1 , respectively and S/o.) being the sample covariance matrix
calculated from X/with xij deleted. The choice of 60* now rests with

maxg(wi)L(w/), i = 1,2. (7.7)


i

One then uses 'the 60* specification for allocation or separation.

8. Other areas

Most of the current work in separatory discriminants has been linear mainly
because of convenience and ease of interpretation. However, it would be desirable
to consider other functional discriminants as there are situations where the
natural discriminants are quadratic.
There is also another useful model wherein the so-called populations or labels
have some underlying continuous distribution, but one can only observe whether
Ir is in a set Si where S 1..... S k exhaust the range of ~r, see, for example, Marshall
and Olkin (1968). In the previous case ~ = N was synonymous with Si, and the
distribution only involved the discrete probabilities qg. However, this case involves
more structure and requires a more delicate Bayesian analysis. Work in this area
is currently in progress.
120 Seymour Geisser

References

Dempster, A. P. (1969). Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading, MA.


Desu, M. M. and Geisser, S. (1973). Methods and appfications of equal-mean discrimination. In: T.
Cacoullos, ed., Discriminant Analysis and Applications, 139-161. Academic Press, New York.
Enis, P. and Geisser, S. (1970). Sample discriminants which minimize posterior squared error loss.
South African Statist. J. 4, 85-93.
Enis, P. and Geisser, S. (1974). Optimal predictive linear discriminants, Ann. Statist. 2(2) 403-410.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems, Ann. Eugenics 7,
179-188.
Geisser, S. (1964). Posterior odds for multivariate normal classification, J. Roy. Statist. Soc. Ser. B 1,
69-76.
Geisser, S. (1965). Bayesian estimation in multivariate analysis. Ann. Math. Statist. 36, 150-159.
Geisser, S. (1966). Predictive discrimination. In: P. Krishnaiah, ed., Multivariate Analysis, 149-163.
Academic Press, New York.
Geisser, S. (1967). Estimation associated with linear discriminants. Ann. Math. Statist. 38, 807-817.
Geisser, S. (1970). Discriminatory practices. In: D. Meyer and R. C. Collier, eds., Bayesian Statistics,
57-70. Peacock, Illinois.
Geisser, S. (1977). Discrimination, allocatory and separatory, linear aspects. In: J. Van Ryzin, ed.,
Classification and Clustering, 301-330. Academic Press, New York.
Geisser, S. (1980). Sample reuse selection and allocation criteria. In: P. Krishnaiah, ed., Multivariate
Analysis V, 387-398. North-Holland, Amsterdam.
Geisser, S. and Cornfield, J. (1963). Posterior distributions for multivariate normal parameters. J. Roy.
Statist. Soc. Ser. B 25, 368-376.
Geisser, S. and Desu, M. M. (1968). Predictive zero-mean uniform discrimination. Biometrika 55
519-524.
Hills, M. (1966). Allocation rules and their error rates, J. Roy. Statist. Soc. Ser. B 28, 1-31.
Marshall, A. W. and Olkin, I. (1968). A general approach to some screening and classification
problems. J. Roy. Statist. Soc. Ser. B. 3, 407-443.
Wilks, S. S. (1962). Mathematical Statistics. Wiley, New York.
P. R. Krishnaiah and L. N. Kanal, eds., H a n d b o o k o f Statistics, Vol. 2
©North-Holland Publishing Company (1982) 121-137 J

Classification of Growth Curves

Jack C. L e e

1. Introduction

The model considered here is a generalized multivariate analysis of variance


model useful especially for growth curve problems. It was first proposed by
Potthoff and Roy (1964). Tl-ie model was subsequently considered by many
authors including Rao (1965, 1966, 1967), Khatri (1966), Krishnaiah (1969),
Geisser (1970), Lee and Geisser (1972, 1975) among others.
The growth curve model is defined as

YpXN = XpxmTmXrArXN ÷ epXN (1.1)

where ~- is unknown,'X and A are known matrices of ranks m < p and r < N,
respectively. Further, the columns of e are independent p-variate normal with
mean vector 0 and common covariance matrix Z, i.e., G(Y[ ,r, Z ) = N ( Y ; X,rA, Z ®
IN) where ® denotes the Kronecker product, and G(.) the cumulative distribu-
tion function (c.d.f.).
Several examples of growth curve applications for the model (1.1) were given
by Potthoff and Roy (1964). We will only indicate two of them here.
(i) N individuals, all subject to the same conditions, are each observed at p
points in time t, ..... tp. The p observations on a given individual are not
independent, but rather are assumed to be multivariate normal with unknown
covariance matrix Z. The observations of different individuals are assumed to be
independent. The growth curve is assumed to be a polynomial in time of degree
m - 1, so that the expected value of the measurement of any individual at time t is
T0 + "rlt + • • • + "gm_ltm-1. The matrix A is 1 × N and contains all l's, ~ =
(% ..... ~'m--,)' and the element in t h e j t h row and cth column of X i s t f - ' .
(ii) There are r groups of individuals With n/individuals in t h e j t h group, and
with each group being subjected to a different treatment. Individuals in all groups
are measured at the same points in time and are assumed to have the same
covariance matrix Z. The growth curve associated with the j t h group is %i ÷
Tut + . . . + "rm_l,jt m-l. The matrix A will contain r rows, and will consist of n 1
columns of (1,0 .... ,0), n2 columns of ( 0 , 1 , 0 . . . . . 0 ) , . . . , and n r columns of
121
122 Jack C. Lee

(0,...,0,1). The ( j , c ) element of r is ~)-l,c and the matrix X is the same as in


example (i).
Growth curve classification was first considered by Lee (1977) from a Bayesian
viewpoint and later extended by Nagel and deWaal (1979). It was also considered
by Leung (1980).
In this paper we consider the situation where c growth curves

G(Y,I~-i,27,,//i)=N(Y,; X~Ai,27,OIN, ), i = 1 . . . . . c, (1.2)

have been observed and a set of future observations V, of dimensionp × K, which


is known to be drawn from one of the c populations 1-/1.... , H c with probabilities
q~,...,qc. Here, of course, we assume

G(V[ %, 27i, Hi) = N(V; X%F,, Zi®IK) (1.3)

where F i is a known design matrix usually formed by some columns of A i. Two


different covariance structures for Z will be considered. They are: (i) arbitrary
positive definite (p.d.) and (ii) Rao's simple structure (RSS) 27 = XI"X'+ ZOZ"
where Z is p X p - m such that X'Z=O, c.f. Lee and Geisser (1972). The
importance of RSS was demonstrated in Rao (1967), Geisser (1970), and Lee and
Geisser (1972, 1975). The likelihood ratio test for RSS was given in Lee and
Geisser (1972).
In this paper we shall primarily study the problem of classification. But as a
spin-off from this work we shall also discuss the estimation of parameters for the
covariance matrices described above. We shall give some preliminary results in
the next section. In Section 3 classification into one of two growth curves is
considered from a non-Bayesian viewpoint and in Section 4 a Bayesian approach
to growth curve classification is given. In Section 5 Bayesian classification and
parameter estimation are considered when 27 is arbitrary p.d., and in Section 6 we
discuss the case where 27 = XFX' + ZOZ' obtains.

2. Preliminaries

We need the following in the sequel:

LEMMA 2.1. Let YpxN, Xp×m, %xr, Ar×N be such that the ranks of X and A are
m < p and r < N, respectively. Then

I ( Y - X r A ) ( Y - X'rA)'[ -u/2 =

= (BZ,) NIz,yy, zI-u/21(X,S-,X)-I+(r_+)G(.~_+), I N/2


(2.1)
Classificationof growth curves 123

where Z is p × p -- m such that X ' Z = O,

B=(X'X)-'X,
S=Y(I-A'(AA')-IA)Y ',
÷= (X,S-1X)-'X,S-1YA,(AA,) -1, (2.2)
G - ' = ( A A ' ) - ' + T~(Z'SZ)-~T2,
T2 = Z'YA'(AA') -1.

For a proof of the above lemma, the reader is referred to Geisser (1970).
We shall say that L, with dimension m X r, is distributed as Dm, r('; B, A, 2~, N)
(see Geisser (1966)) if it has probability density function (p.d.f.)

C fir--rm/2 Zl~/2lAlm/2
g(L) = m,, (2.3)
cm, N 12 + ( L - a ) a ( L - a ) ' l N/2

where v = N - r and

C-'
m,p =fir m(m-l)/' ~I / ~ [ I ( P - I - I - - J ) ]
j=l

We note that the general determinantal distribution Din, r(') includes multi-
variate T distribution, T(- ; A, Z, N), as a special case when r = 1, A = ( N - - rn) -1.
Some properties of Din, r(" ) were given in Lee and Geisser (1972). In the sequel we
will use h(. ) for the prior probability density functions (p.d.f.) of parameters, g(-)
for probability density functions other than prior, and G(.) for the c.d.f, even
though they stand for different functional forms in different circumstances.

3. Classification into one of two growth curves

In this section we consider the special case where c -- 2 and ql, q2 are known.
Following Welch (1939) we classify V into H I if
g , ( V ) / g 2 ( V ) > q2/ql (3.1)
where
gi(V) = (2v)-,/~/212 i I_K/2
Xexp{ -- ½tr 2 / - l ( v - XriFi)(V-- Xr/F,.)'}. (3.2)

Since the natural logarithm is a monotonically increasing function, the inequality


(3.1) can be written as

g(ln1221 -lnl~Yl l} + t r ( g - X'c2f2)'2;~(g - X'r2f2)


-tr(V- XrlF1)'Zi-'(V-- X'qF1)>t 21n(qz/ql). (3.3)
124 Jack C. Lee

Thus, V is classified into H 1 if (3.3) holds. In the case Z 1= Z2 = Z and K = 1,


(3.3) is reduced to

V'.I~-Ix( T, F1 -- $2F2)-- I(~-IF1 + rzFz)'X',V.-1X(.r,F, -- ~-2F2) 1>

/> In(q2/ql ). (3.4)


It should be noted that (3.4) is an extension of (5) on page 134 of Anderson
(1958).
In practice, the parameters involved are unknown and are replaced by their
estimates. The maximum likelihood estimate (m.l.e.) of ~-~is

+i = (X'S~-'X) -I X'S~'YiA'i( A , A ~ ) - I (3.5)


where
Si = Yi( I -- A;( AiA'i)-'Ai)Yi' (3.6)

and the m.l.e, of ~ is

2 i = N / - I ( Y / - S ~ i A i ) ( Y i - X~iAi)' (3.7)

if ~i is arbitrary p.d. Now, let us assume that Ni is of the following structure,

~ = x r ~ x ' + zo, z '

where Z is a p × p - m matrix such that X ' Z = 0; the above structure of X~ was


considered by Rao (1967) in another connection.
We refer to this as Rao's simple structure (RSS). In this case, the m.l.e, of ~'i is

Tli = Br, Ai(AiAi)


t t I (3.8)
and the m.l.e, of Xi is

Z ~ i - N ~- 1 { X B S i B X + ZDYiiYiD Z }
^ _ , , , , t
(3.9)
where
D=(Z'Z)-'Z'.

Of course, when -~1 = ~2 = Z, then the m.l.e, of the common covariance matrix is

~: (N, + N 2 ) - ' ( N I ~ , + N2~2) (3.10)

when ~ is arbitrary p.d., and

,~, = ( S l + N 2 ) - ' ( N l ~ ,^s l + N 2 ~ s 2^) (3.11)

when RSS holds. Bayesian estimators will be discussed in Sections 5 and 6.


Classification of growth curves 125

4. Bayesian classification of growth curves

For Bayesian classification purpose we derive the predictive density of V,


g(VIY, Hi), where Y represents the set of available data Y~.... , Yc (Lee, 1975),
following the proposal of Geisser (1964, 1966). The predictive probability that V
belongs to H i is

P( V c llilY, q ) oc qig( Vl Y, Fli) (4.1)

for i = 1,...,c; q = (ql .... ,qc). The observation V is then classified into H i for
which (4.1) is a maximum. For the prior density of parameters we follow Geisser
(1970) in assuming a situation where there is little or no prior knowledge
regarding the parameters. In the next two sections we discuss Bayesian classifica-
tion of growth curves for the two different covariance matrices.

5. Arbitrary p.d. 2;

5.1. ~-~iunknown, "ri unknown


By using the convenient prior h('ri, Z7 l) ec i~i1~.+1)/= it can be shown that the
predictive density of Vis such that for V1 = (X'X) ~X'V= BV, V2 = Z'V,

g(v~, v~lYi, H,) = g(VllV2, Y,, n,)g(v~ Ir/, n,) (5.1)


where
G(V, IVz,Y~,Hi)=Dm,K(" ; Q i , G i , ( X St [- l X) 1
,N+ K--r), (5.2)
G(V2I •, Hi) = Dp m,K(" "~O, I, Z'Yy/Z, N i + K -- m) (5.3)
and
Gi -1= m[-' - ] - ( V - gi)tZ( Z t S i z ) - l z t ( v - Vi),
y -1
Mi=I--F[(H,H;) Fi, Hi = (A,, F~), (5.4)

~ = Y~A'i(AiA:)-'Fi, ^ + Bsiz(z ,siz) - ' z ' ( v - ~)


Q, = ~v~

and S i is defined as the S in (2.2) except that Y is replaced by Y/and A by A i. The


predictive density of V can be shown to be

g(WlYi, zL) ¢c oF I( x's,-~X)I-¢N,-~)/ZlG, I~/2IZ'Y~Y[ZICN,-m)/2


× l( x's;-'x)-~ + B ( v - XQ,)Gi(V- xoi)'B'l -¢u,+K ~)/2
× IZ'(V,V,' + VV')ZI-¢~',+K-m)/~ (5.5)
126 Jack C. Lee

for i = 1,2 ..... c, where

C/*-~-~ /~[21(Ni_~g_~_l_ _ .)] p~m/~[l(N/_lt_g_~_l_Fn~.)~)]


j=l r[{(NiTiyTry-~ j, r[½(g+F-m-
(5.6)

and the constant of proportionality is

mod ( BZ') s: ~r-PK/2

and hence is irrelevant• The irrelevant constant will be absorbed by the propor-
tionality sign oc and hence will be omitted from now on.
The posterior expectation of ~'i is ~ as given in (3.5) and the posterior
expectation of 2i is

E(Zi [I~/)= ( N ~ - p - 1 ) - 1 { ( Y i - X~iA,)(Y ~- X;riA,)'

+(N/-- m--r-1) -1 X(X'S~-'X) - ~X'[ tr G; IAiA'i] }.


(5.7)

5.2. ~i = "Y but unknown, zi unknown


Using the convenient prior h(~-, .... , ~'c,2 - ') ~: {2[-<P+ ,)/2, we obtain the post-
erior density

g(~, ..... ~c,z - ' IY)~ I~:-' I<N-p-'>/2


Xexp{-½trZ-'(Y- X'rA)(Y-- X'rA)'} (5.8)
where
r--(r,,r2 ..... r,), =(~,,~2 . . . . . ~), N=N,+N2+... +N~.,
A1 0 0 • • " 0

0 A2 0 • " " 0

A= 0 0 A3 (5 •9)

0 0 • *" A C

The posterior distribution of Z ' and ¢ satisfy

G ( Z - ' [~-, Y ) = W ( - ; [ ( Y - X'rA)(Y- X'rA)']-',N), (5.10)

G(T[Y)=Dm.rc(-;4, G,(X'S-'X) ',U) (5.11)


Classification of growth curves 127

where 4, S and G are defined in (2.2) and W(.) stands for the Wishart
distribution. The posterior expectation of 2J is

E ( ~ . I Y ) = ( N - p - - 1 ) - I { ( Y - X?A)(Y-- X~A)'
+( N - m -- rc-- I )-' X( X'S-1X)-' X'[trG-1AA'I }. (5.12)

We also note that ÷ is the m.l.e, as well as the posterior expectation of ~-, and the
m.l.e, of N is

~,= N - ' ( Y - - X÷A)(Y-- X÷A)'. (5.13)


F r o m (5.8) and the fact that

I(r- X.:A)(Y- X'rA)'+(V- X'riFi)(V- Xq'iFi)'I =


= v ) - X,r*/-/,*] v ) - x¢*H*]'I (5.14)
where
Y/* = (Y,, Y2 ..... Y/-,, ~+~ ..... Yc, Yi),
~* = (~1, "r2. . . . . 5 - 1 , 5 + 1 ..... ~'c, 5 ) ,
A I 0 0 0 ••• 0
0 A2 0 0 "'"

0 0 Ai_ 1 0 0
4,,= 0 = (A*, F~*),
Ai+ 1 0

Ac 0 0
0 0 0 ..- 0 Ai
(5.15)
~t__ t
F i - (0,0 .... ,0, Fi ),

we obtain, by applying L e m m a 2.1, the predictive density of V

g(V[ Y, Hi) = g(V l [V2, Y, Hi)g( V2[Y, Hi) (5.16)


where
G(V,[V2,Y, YIi)=Dm,K(.;Q*,G*,(X'S*-'X ) 1
(5.17)
G(Vz[Y, Hi)=Dp_m,K(.;O,I,Z'Y*Y,*'Z,N+K-m) (5.18)

where Q*, G* are defined in the same way as Qz, Gi given in (5.4) except that Y/is
128 Jack C. Lee

replaced by Y~*,Ai by A*, F/by F~* and S~ by

S*i --- ~"i *\--


( 1 - A*'(
i ~ A*A*q-
- - i - - i / ~--i
A* ]~*~*'
i "

The predictive density of V can be shown to be

g ( V l Y , 1~ i ) OC l( X t S ~ - I X ) [ - ( N - ' c ) / ~ l G y l ~ / 2 l Z ' ~ * ~ * ' Z l ( U - ~.)/~

× ( X ' S * - I x ) - ' + B ( V - XQ*)G*(V- XQ*)'B' -(2V+K-~c)/2

X [Zt(~t~Yi*t~ - [/'Vt)Z[ -(N+K-m)/2 (5.19)

5. 3. ~i = ~ but unknown, ~ri known


With the prior h(Y~ ~)~ [Z[ (p+O/2, we have the posterior density of 2; 1:

g(~y-1 [~., y) ~ [y-~[(N-p-O/2exp{_ ½tr ~ - l ( y _ X'rA)( Y - X'rA)'}


(5.20)

where Y, z, A and N are defined in (5.9), which implies

E(Z[ % Y) = ( U - p - 1)-1(Y - X z A ) ( Y - X~A)'.

The predictive distribution of V is easily seen to be

G(V[ ~, Y, Hi) = Dp, K('; XTiFi, I, (Y-- X~A)( Y-- XTA)', U + K).
(5.21)

5.4. Y~i unknown, ri = ;r unknown

With the prior h(+, X ~ ..... 2~ -~) cc II~=llY~jI(P+°/2, we have the posterior
density

P jl
j=l

× e x p { - ½ tr ~Y/'(Yj - X'Mj)(Yj- X~Aj)'}.


(5.22)

By Lemma 2.1 we have the posterior density of ~,

g(~lY) ~ 1~ J ( X ' S ; ' X ) - ' + ( ~ - ~ I G j ( * - ~ ) ' I -~j/2 (5.23)


j=l
Classificationof growthcurves 129

where ~ , ~, G/are defined in the same way as S, ++,G given in (2.2) except that Y
is replaced by Y/, A by Aj. We note that (5.22) is a product of c determinantal
distribution kernels.
We now consider the special case where c = 2, r = 1. From (5.22) it is easy to
see that the posterior distribution of Ni- ~ for a given "~ is

G(N~-I]Y,+)=W(. ;[(Yi- X'i'Ai)(Yi- X4"Ai)t ] 1 ,Ni) , (5.24)

and it can be shown that the posterior density of + is

g(+lY) = f 0 ° g(4"lO,Y)f(O)dO (5.25)


where
G(4"IO, Y) = T(-; 4,*, N*, N ) , N = N l + N2 (5.26)
and
f( O) = I?20N'/Z-l[a* + ( 4'2-- 4'1)'+[(4'2 -- 4'1)] - ( u m)/2lA*l-'/2,
~* = ( U - m) -1 (a* +(4. 2 - 4.1)'A(4.2 - 4,1) ) A * - 1 , (5.27)
++*=++1+A*-~X'S2-~X(h-++,), ot*=OG~lq-Gf l,
A* = OXtSI1X~ XtS21X, 4 = OXtSIIXA*-Ixts21x

a n d / £ is a normalizing constant. We can also obtain

E(+]Y) = ff°4.*f(O)dO (5.28)


and
E(~ilV) = ( Ni - P - 1)-1( (Yi - X[ E(+IY)] a,)
, ×(y~- X[E(+IY)]A~ )'
+(AiA ~) X[cov( +IY)] X'}.

A reasonable approximation can be obtained as

G(4"IY ) ~--T(- ; 4.*(0 ), X*(O ), N ) (5.29)

where t~ maximizes f(O). We thus have approximately

e(+lt) ++*(0),
^q¢ ^
E(~ilY) ~(N~- p-1)-'{(Y,- X++*(0)Ai)(g~- X++ (0)Ai) !

(5.30)
130 Jack C. Lee

For the special case c = 2, r = 1 the predictive density of V can be shown to be

g(Vl~, hi) o~c, im,YjUml-~/~lZ,(y,y,, + VV')ZI-(Ni+K)/2


× IX,S?,Xl~/21x,s~,xl(N,+K)/~
X t(UJ2)-l[li(t)]-(N'+NJ+g)/2]Di(t)l-1/2dt
(5.31)
where for j v~ i

ci=r-' r' 5- ~=, r[½(Ui+l-~)] '

Di( t ) = tGj X ' S ~ IX-F Goi X'So-i Ix, (5.32)

li(t)= l + t( ( ~_~oi),GjX,Sj lX[Di(t)]-I GoiX iSo


, -Ix(~.^ - ~roi)
and SOD~Oi'aoi are defined in the same form as S, ~ and G given in (2.2) except
that Yis replaced by U~--- (Y/, V) and A is replaced by/4,. = (A i, F/). We note that
the one-dimensional integrals in (5.25) and (5.31) can be executed with desired
accuracy without difficulty through the modern computer (see, for example, Lee
(1975)).

5.5. ~i known, r~= ÷ unknown


The posterior density is easily seen to be

g(4"[~l ..... ~c, Y) °c ~I exp{--~ tr( X'~}-lX)('r--roj)AjA~(ec-~'oj) ')


j=l
(5.33)

where ~'ojis defined in the same way as + given in (2.2) with S replaced by 2j, Y by
Yj and A by Aj. We note that (5.33) is a product of c multivariate normal kernels.
For the rest of this case we consider the situation where c = 2, r -- 1. It can be
shown that the posterior distribution of + is

G(+I~,, 2 2, Y) =N(. ; ~0, 12, +~22) (5.34)


where
[2i=(AiA~)X'~/'X , i=1,2,
(5.35)

We note that 4o is the posterior expectation of ~.


Classificationof growth curves 131

The joint density of V and ~ is, for j r ~ i,

g(V, ~ l S l , ~ 2 , Y , I I i ) o c e x p ( - ½ tr~yil[(Yi,V) - X'~Hi]


X [(Y/, V)-- S'rni]'
- ½t r Z ; ' ( ~ - X+~j)(~-X+Aj)'},
from which it can be shown that the predictive density of V is, forj v~ i,

g(Vl'~l, "~2, Y, Hi) oc I:~il-X/2exp { - ½ tr( Z'y, i Z ) - ' Z , G U / Z }


X e x p { - ½tr(Z'~,/Z) 'Z'YjY]Z}

Xexp(-½trX'~,71X[(X'S~ilX) -1 + (~/i - ~oi) C2(~).1(~i - ~oi)'] }

× exp{- ½(To,- Toj)'HiH;X'~,;1XQolAjA;X'r,f-IX(Toi - Toj)}


x I~ HitXtZi 1X-}- AjAj X t Z ; ' X l - 1 / 2
xexp (-½ tr X'N/'X[( X"Sj-1X)-l Jr< ~ j - ~ j ) f ( J ! , <" O j - ,~j)t] }
(5.36)
where Sol is defined as in Section 5.4 and

ni = B Z i Z ( Z !~i Z ) --| , ~oi z BSoiZ(Z'SoiZ) 1,


U~= (Y~,V), Qo = I-t~H;X'~,?'X + A j A i X 'Z -'X,
7; i,= ^%i - ( ^ )z'~/-/i(/-/d4;)
~/i--~oi ' (5.37)
Toj=~j--(~ j _ _
~j)ZYjAj(AjAj)
^ t ! t --
1,

d~i~ d~ z'uitJ; z'u,v,'z]

(~(J) is defined as ~(o except that H i is replaced by A/ and U~ by Y/, and


C(2~). 1 ~- C(2~) -- "'21
[~(i)g%(
"'11i) - 1/'%(0
"12"

5. 6. Ei unknown, ~. known
With the prior h ( ~ -1) ec I~Jil(p÷l)/2, we have the posterior density

g(~-1Yi) cc I~/1[ (N~-p-1)/2exp( _ ½ tr ~/-'(Yi - X%A~)(Ye - X'riA~)'}


(5.38)
132 Jack C. Lee

which implies

E(Zil ~ ) = (U~ - p - 1 ) - ' ( ~ - X~iAi)(~ - X~Ai)'. (5.39)

The predictive distribution of V is easily seen to be

G(VlZi, Yi, Hi) = D ( . ; XTiFi, I , ( Y i -- X ' r i A i ) ( Y i - XTiAi)' , N i Af_K).


(5.40)

5.7. Zi known, ri unknown


By the prior h(Ti)d'r icc d% it can be shown that the posterior distribution of ~'i
is

t --1
G('rilZi,Yi)=N(';'roi,(X'-y,/lx)-l®(AiAi) ) (5.41)

where "r0iis defined in the same way as + given in (2.2) with S replaced by Z i, Y by
Yi, A by Ai.
The predictive density satisfies

g(VI.Yi-',Y~,Hi)=g(V11V2,.YTi,Yi,Hi)g(V21~71,Yi,Hi) (5.42)
where
G(V, IV2,•i', Hi) = N ( ' ; Qi,( X t ~ i l x ) -1 @Mi I )'
G(V2IZZ 1, Yi, H i ) = N(. ;0, Z'Z/'Z®IK) (5.43)

and Qi is defined in the same way as Qi given in (5.4) except that S i is replaced by
~i-

6. Rao's simple structure

6.1. ~"i = XFiX'+ ZOiZ', Fi, Oi, ~ unknown


With the convenient prior

h( i, r/l, o7 l) o: IEI <m+o/21oi1<p-,.+1 /2 ,


we obtain the predictive density of V

g( V i Yi, Hi) oc Ki l BSiB'[ (N,-r) /2 l Mi l m/21D YiiYii'D'lN'/Z


× IgSiB,+(nv__ ZliFi)Mi(B V _ TliFii), I (Ni+g-r)/2
× [DY~Y'D'+ DVV'D'[ -<Ni+K)/2 (6.1)
Classification of growth curves" 133

where Tli is defined in (3.8) and

K i = ~I F [ ½ ( N ~ + K + I - r - j ) ] F[½(N~+K+I--j)]
j, r[~(N,+i---;--~] pIIm.j=, r[½(Ni+F--j-)] "
(6.2)

The posterior expectations are


t t --1
E(~i]Yi) = B YiAi( AiAi ) ,
E(F~IY~)=(Ni-r-m-1 ) 1BS~B,t (6.3)
E( Oil Y~) = ( U - - p + m -- 1)-I D YiYii'D'.

6.2. ~i = X F X ' + ZOZ', F, 0 and "ri unknown


By the convenient prior

h ( T 1 . . . . . ,rc ' / ~ - 1 0 - 1 ) (3c I r l ( m + ' ) / 2 1 0 1 ( p - m + 1)/2,

we see that the posterior distributions are

G(~gi[Z) = Din,r('; rli, AiA ;, BSB', N - r),


G(F-1y ) = W ( - ; ( B S B ' ) -1 , N -- rc ), (6.4)
G(O-IIy)=w( •;(DYY'D')-I,N)

where

S = ~, S~, Y=(Y1,...,Yc), N= N i.
i=1 i=1

Hence

E('ri]Y)=Tli, E(FIY)=(N-rc-m--1 ) 1BSB'


and
E(O IY) = ( N - p + m - 1 ) - ' D Y Y ' D ' .

The predictive density of V is


g(VlY, Hi) c~ IM~im/2IDYY'D'+DVV'D'I-(N+K)/2
× IB S B ' + (~V- T , f , ) ~ ( ~ V - V,iF,)'I ~N+K-rc)/2
(6.5)
134 Jack C. Lee

6.3. ~, = XFX' + ZOZ', F, 0 unknown, "ri known


With the prior

h( r - ', O-1) Irl m+ 'V21Ol P-m+ l)/2,


w e see that the posterior distributions of F 1 and 0 ~ are

G(F IIY, r ) = W(. ;[(BY-- "rA)(BY- ~'A)'] 1 N),


G( O-'IY, "r) = W(" ;(D YY'D') -1, N) (6.6)

where Y, r, A and N are defined in (5.9). H e n c e w e have

E(FI Y, ~-) = ( N - m - 1 ) - ' ( B Y - "rA)(BY- "rA)',


E(OIY, ~') = ( N - p + m - 1)-1DYY'D ',

and the predictive density

g(VIY, "r, 17,) cc I( B y - "rA)( B Y - "cA)'


+ (BV-
× [DYY'D'+ DVV'D'I -(N+K)/2. (6.7)

6.4. .~, -- XF, X' + ZO, Z', Fi, O, unknown, ~'i= ~ unknown
As in Section 5.4 we will consider the case where c = 2, r = 1. With the
convenient prior

h(/] -1, 0/-', + ) cc I~l~m+l)/2 IO,I~P m+,~/2

we have the posterior distribution of ~ - i and O~-1 for a given ~:

X W( 0/-1;(D Y,Yi'D') -1 , N/) (6.8)

and the posterior density of 4

g(~lY) = f~g(+lY, t ) f , ( t ) d t (6.9)


where
G('~IY, t ) = T(. ; T I , ~ , N ), N = NI + N2, (6.10)
Classification of growth curves 135

and
f , ( t ) = I~tN'/2--1[Ol"t-(rl2 -- Tll)tA(T12 -- r l , ) ] - ( N m)/2 iA, l_l/2,
,Y: (N- m)-l {a +(T,2 - T,,)'A(T,2- T1, ) }A *-1 ,
T, = rll + fI*-'(BS2B')-l(T12 - r,,), (6.11)
ot --= t ( A I A ] ) -1 + ( A 2 A I ) -1 ,

A* = t( B S 1 B ' ) -1 + ( B S 2 B O -1, A = t(BSIB')-'A*-1(BS2B') -1

a n d / ~ is a normalizing constant. As in Section 5.4 we have

E(+IY) = fo°~Tl(t)fl(t)dt,
E(O,IY)=(~-p+m-1) --I
DEED t !

and
E(F/] r ) = (N/-- m -- 1) - ' { (BY/-- E(4IY)Ai)(BY i -- E(,I v)A,)'
+ AiA;[cov (elY)]}.
A reasonable approximation for the posterior distribution of 4 is

G( 41Y) ~ T(-; TI([ ), •(/), N ) (6.12)

where t~maximizes fl(t). Thus, approximately, E( 4 [ Y) ~ T 1([) and

E(~IY)~(N+--m--1) l((nYi-- Zl(i)Ai)(B~i-- Zl(i)Ai)t

+
+ A m m 2A~A;2(/)}"
The predictive density of V is, f o r j ~ i,

g(mlg,/L) ~ ~IBS+B'I (NF1)/2IDYf/D'IN'/2


XID(Yiy;_I_VV,)D,I(Ni+K)/ZIBS/B,I-(NFO/2IB~I-(N,+K1)/2
× f~t(N'+X)/2--1[ h(t)] -(N+ ~,:)/21tHfl;BS/B, + A/A}B, l-1/2dt
~3

(6.13)
where
:N,+K~ ,:Nj~ r[~(N~ K+I--~)]
. =H, + 1 - a)]
F[½(N/+
p-m r[l(N~ + K + l~)i)] (6.14)
X al-I 1 F[l(Ni +l- '
B~= BS~B'+ (BY-- T,,E)'M~(BV-- TI~).
136 Jack C. Lee

6.5. ~"i = X ~ X ' + ZOiZ', I"i, 0i unknown, "ri known


By the convenient prior

h(~ 1,0/~1)(3CIFil(m+l)/2[Oel (p m+O/Z,

it is easily seen that

G(F/-' IY/, 5 ) = W ( - ; [ ( B Y i - ' r i A i ) ( B Y i - ' r i A i ) ' ] - l , N i ) ,


(6.15)
G(OZ'IYi,$i)=W( • ; (DY/Y/'D')-', ~).
Hence
E(F/I Y/, Ti) = (N/-- m -- 1)-'(BY/-- ~ . A i ) ( B Y i -- "riAi)',

E( Oil Yi, ~'i) = ( Ni -- P + m -- 1)-I D YiYi'D '.

It can be shown that the predictive density of V is

g( V l Y~, ,i-i, I7[i) o~ /£~ I(n Y//- ,.l-iAi)( B ~ii - ,riAi )'l Ni/2 l D Y~Yg'O'l N,/2
XI V)- [B(r,, V ) - K)/2
X ID(Y,.Y~' + V V ' ) D ' I-(u,+K)/2 (6.16)

where/£* is defined as K i given in (6.2) with r = 0.

References
Anderson, T. W. (1958). A n Introduction to Multivariate Statistical Analysis. Wiley, New York.
Geisser, S. (1964). Posterior odds for multivariate normal classifications. J. Roy. Statist. Soc. Ser. B,
26, 69-76.
Geisser, S. (1966). Predictive discrimination. In: P. R. Krishnaiah, ed., Multivariate Analysis, 149-163.
Academic Press, New York.
Geisser, S. (1970). Bayesian analysis of growth curves. Sankhya, Ser. A 32, 53-64.
Geisser, S. and Desu, M. M. (1968). Predictive zero-mean uniform discrimination. Biometrika 55,
519- 524.
Khatri, C. G. (1966). A note on MANOVA model applied to problems in growth curve. Ann. Inst.
Statist. Math. 18, 75-86.
Krishnaiah, P. R. (1969). Simultaneous test procedures under general MANOVA models. In: P. R.
Krishnaiah, ed., Multivariate Analysis, H, 121-143. Academic Press, New York.
Lee, J. C. (1975). A note on equal-mean discrimination. Comm. Statist. 4, 251-254.
Lee, J. C. (1977). Bayesian classification of data from growth curves. South African Statist. J. ll,
155-166.
Lee, J. C. and Geisser, S. (1972). Growth curve prediction. Sankhyd, Set. A 34, 393-412.
Lee, J. C. and Geisser, S. (1975). Applications of growth curve prediction. Sankhyd, Ser. A 37,
239-256.
Leung, C. Y. (1980). Discriminant analysis and testing problems based on a general regression model.
Ph.D. Thesis, University of Toronto.
Classification of growth curves 137

Nagel, P. J. A. and deWaal, D. J. (1979). Bayesian classification, estimation and prediction of growth
curves. South African Statist. J. 13, 127-137.
Potthoff, R. R. and Roy, S. N. (1964). A generalized multivariate analysis of variance model useful
especially for growth curve problems. Biometrika 51, 313-326.
Rao, C. R. (1965). The theory of least squares when the parameters are stochastic and its application
to the analysis of growth curves. Biometrika 52, 447-458.
Rao, C. R. (1966). Covariance adjustment and related problems in multivariate analysis. In: P. R.
Krishnaiah, ed., Multivariate Analysis, H, 87-103. Academic Press, New York.
Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to
measurement of signals. Proc. Fifth Berkeley Syrup. Math. Statist. and Probability 1, 355-372.
Welch, B. L. (1939). Note on discriminant functions. Biometrika 31, 218-220.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1~
©North-Holland Publishing Company (1982) 139-168 K3

Nonparametric Classification

James D. Broffitt

1. Introduction

1.1. Preliminaries
Consider the problem of discriminating between two populations rq and ~r2. An
clement is selected at random from a mixture of ~r1 and ~r2, and p variates are
measured on this element, which we arrange in a vector z. Based on z we must
decide if the element is a member of 7rl or ~r2. For convenience we shall talk about
classifying z rather than the element which gave rise to z. We may then state the
problem as deciding between z being a member of ~rI and z being a member of ~r2.
We may think of any decision rule as a partition of the p-dimensional space of
the random vector Z into two sets, K~ and K 2. (We shall follow the usual
convention that z denotes an observed value of the random vector Z.) If z ~ K~,
then z is classified into ~q, otherwise z is classified into ~r2. Thus any decision rule
may be defined by specifying the set K 1. Corresponding to each decision rule are
the probabilities of classification: P(jli) is the probability of classifying an
element from ~ri~into ~. If i :/: j, then P(jli) is a probability of misclassification,
otherwise P(jli) is a probability of correct classification. Let the p variates have
pdf fl(" ) in ~rI and f2(" ) in rr2. We may then write

P(jIi)= P{Z~KjIZ~ f~ )
where Z ~ f / m e a n s that Z has pdf f/(-).
The objective is to determine a decision rule which minimizes some function of
the misclassification probabilities. Such a function may be obtained by consider-
ing the costs of misclassification. Let C(jli ) denote the loss (cost) incurred when
an element from rri is classified into ¢rj. We restrict C(jli ) to satisfy: C(jli) -- 0 if
i = j, and C(jli)>O if i ve j. Then the risk (expected loss) corresponding to a
decision rule is

q1C(211)P(211)+ q2C(lI2)P(l[2),
where qi = P{Z~ f} is the prior probability that Z is a member of %. The

139
140 James D. Broffitt

decision rule that minimizes this risk (see Anderson (1958, Chapter 6), is sped-
fled by

K'~:(z:fl(z)/f2(z)~k} wherek=q2C(ll2)/qlC(2[1). (1.1)

Thus if we know the prior probabilities, misclassification costs, and the pdf's f~(.)
and f2('), we classify z into ~rI iffl(z)/f2(z ) >>-k, and into ~r2 otherwise.
In many practical problems it may be difficult to assess the misclassification
costs as well as the prior probabilities. In this case a different approach may be
possible. Suppose we are able to specify the value that we would like to have for
the ratio P(2]1)/P(l]2). For example, if we desire a decision rule that misclassi-
lies elements from 7r2 twice as often as elements from ~rl, then we would let
P(2 ]1)/e(112) -- ½. In general let 7 be the value assigned to P(211)/P(1 ]2). Then
among all decision rules that satisfy our constraint P(211 ) = 7P(112), we want the
one that minimizes P(211), or equivalently, minimizes P(ll2). The solution is
given by

K t : {z: >i k}

where k is selected to satisfy P(211 ) = 7P(112 ). In Section 2.4 we will show how to
apply discriminant functions in such a way that, at least asymptotically,
P(211)/P(112) = Y. These procedures allow the experimenter to have some control
over the balance between the two misclassification probabilities.
In order to apply the above decision rules it is necessary to compare fl(z) to
fz(Z). In practical problems f l ( ' ) and f2(') are not fully known and must be
estimated from sample data. Our decision rules will then be based on ]l(z)/f2(z)
where)~(z) is an estimate o f f ( z ) . That is, we replace the set K~ with

which defines the sample based rule: classify z into ~rI if and only if z E K~'. There
are different levels of prior knowledge that we may have about f ( - ) . If we assume
that f ( - ) is a normal density with unknown mean vector and dispersion matrix,
then we could obtain)~(. ) by replacing the unknown parameters in f ( . ) with their
sample estimates. This would be termed a parametric solution since we assumed
very specific forms for fl(" ) and f2('), and consequently needed to estimate only a
few parameters. On the other hand if we have very little knowledge of f ( - ) , then
we might use a nonparametric density estimate to obtain f ( - ) . In either case we
must have observations from ~r1 and ~r2 in order to form estimates of f l ( ' ) and
f2(')-
Two methods of sampling are used in practice. The method used is generally
dictated by the practical problem rather than a preference choice. In the first
method we draw random samples separately from ~r~ and ~r2. Let x 1..... xn~ be the
observation vectors for n~ randomly selected elements from 7r~, and similarly
Y~..... Yn2 are the observation vectors for n z elements from ~r2. The x ' s and y ' s are
Nonparametric classification 141

called training samples and may be used to estimate f l ( ' ) and f2('), respectively.
The second sampling method occurs when we must sample from the mixture of ~r1
and ~r2 rather than taking samples from the individual populations. Let n b e the
number of observations taken from the mixture. If these observations can be
correctly classified, then those from ~r1 are the x ' s and those from ~r2 are the y 's.
In this case, n 1 and n 2 are random and n i / n m a y be used as an estimate of q~. In
the remainder of this section we will concentrate on methods of classification
based on training samples.

1.2. The L D F and QDF parametric rules


The most common parametric solution for the classification problem assumes
that the distribution for ~r~is Np(/~, ~), i : 1,2, that is p-variate normal with mean
v e c t o r / t i and dispersion matrix ,~.
Notice that the dispersion matrix is assumed to be the same for ~r~ and ~r2. Then

f/(z) = (2~r)-P/2 ]IJI-l/2exp { -l(z -l~i)'~,-'(z -/~,)} •

The p a r a m e t e r s / t l,/g2, and N are unknown. To obtain fl(z) and/2(z) we replace


/t 1 by Y,/~2 by fi, and ~J by S where
n I n 2

i=1 i=l
nl
S, = Z (x~- Y)(x,- Y)'/(n, - I),
i=l
n 2

= 2 ( y, - : ) ( y, - : 1 ' / ( . 2 -
F
i=I
and
S -- [(n I - 11S , + (n 2 - l l S z ] / ( n I + n 2 - - 2 / .

Then fl(z)/fz(Z) >~k is equivalent t o DL(Z )/> I n k where

z)L(z) : (z
R. A. Fisher was the first to suggest using DL(Z ) for purposes of classification.
Since DL(Z ) is linear in the components of z, it is called the linear discriminant
function or LDF. The procedure is to compute x, y, and S from the training
samples, then compute DL(Z ) and compare it to Ink. If DL(Z ) >/In k we classify z
into 7rl, otherwise we classify z into ~r2.
Notice that in a sample based classification rule the classification proba-
bilities P ( j ] i ) are computed with respect to the joint distribution of
Z, X 1..... X,,, YI.... ' Y,2" For example, with the above rule based on DL,

P(2]l)- P[((Z,X, ..... x,l,Vl,...,r,2) OL(Z)<lnk) Z - fl].


142 James D. Broffitt

If we relax the assumption of equal dispersion matrices, then the distribution


for ~ri is Np(tt i, ~,j). We obtain f/(z) by replacing/t I by £" and ~1 by $1, and to
obtain f2(z ) we replace/t 2 b y f i and ~2 by S 2. Then

fl(z)/f2(z) >1k is equivalent to DQ(Z) ~>2 In k

where

D Q ( Z ) : (Z - - . ~ ) ' S 2 I(Z -- J ~ ) - - ( Z - - x ) t S l l ( Z -- )+ln(IS21/IS l).


Since Do(z) is a second degree polynomial in the components of z, it is called the
quadratic discriminant function or QDF. We classify z into ~r~ if DQ(Z)/> 21n k
and into ~rz otherwise.
It must be strongly emphasized that the two sample based procedures defined
above are not optimal since they are based on fl(z)/fz(z) rather than fl(z)/f2(z).
However, since ~(z) is a consistent estimator of fi(z ), the sample based decision
rules approach the corresponding optimal rules as n, and n 2 increase. As a
consequence (see Glick (1972)) the misclassification probabilities for the sample
based rules (classify z into ~rI if and only if z E / ~ ) converge to the misclassifica-
tion probabilities for the optimal rules (classify z into ~rI if and only if z E K~).
Thus for large sample sizes the rules based on D L and DQ will be approximately
optimal.
There is another question of optimality which has never been resolved. It is
unrealistic to expect rules based on f l ( z ) / ~ ( z ) to perform as well as those based
on fl(z)/f2(z); however, if fl(z)/fE(z) contains unknown parameters, then it
cannot be used to classify z. Thus decision rules based on f~(z)/f2(z) are of no
practical value since they could never be used. We must consider only those rules
which do not depend on unknown parameters, and among these we might look
for the best rule. In general it is not known how to find such rules. For the
present we must be content with using rules that are asymptotically optimal.

1.3. The multipopulation classification problem


In many problems we have more than two populations to which z could belong.
Suppose we have m populations tr1..... 7rm with corresponding pdf's fl('), .... fro('),
and prior probabilities q~..... qm" AS above let P(jli) be the probability of
classifying an observation from ¢ri into ¢5, and let C(jli ) be the loss incurred by
this action, where C(j[i)= 0 if i - j and C(jli ) > 0 if i v~ j. Then the expected
loss or risk is

~ qiC(j i)P(jli ). (1.2)


i--l j--I

A decision rule is equivalent to a partition of the space of Z into sets K~ ..... Kin,
such that z is classified into ~rt if and only if z E K t. Then the decision rule that
Nonparametric classification 143

minimizes the expected loss (1.2) is specified by

K?= {z qiC(lli)fi(z)< ~ qiC(j[i)fi(z),J =1 .... ,m, j v a l } .


i:l i=1

That is, classify z into that population which corresponds to the smallest of the m
values:
m

qiC(lli)f~(z) .... , qiC(mli)fi(z).


i=1 i=l

If there is a tie between two or more populations for this minimum value, then z
may be classified into any one of those populations involved in the tie. (Note that
K~ ..... K m is not a partition in the strict sense because they do not include these
tie producing z's.) In the special case where C ( j [ i ) = l for iv a j, K~" may be
written in the simplified form

K7 = (z: ft(z ) /fj(z ) > qj /qt,J = 1..... m, j va 1}.

In particular, if qi = 1/m, i = 1,...,m, then z is classified into that population


whose density is largest at z.
When the pdf's f l ( ' ) .... ,fro(') are unknown, they must be estimated from
training samples. We^
may then use the sample based rule which classifies z into rrt
if and only if z C K? where

I~: {Z: i~1qiC(lli)f~(z)< ~ qiC(jti)fii(z)'J=l .... 'm' jva l}"


i:1

p
If )~(z) is a consistent estimator of f/(z), then the sample based rule is asymptoti-
cally optimal. Thus the two population problems may be extended to m popula-
tions without difficulty.
To enhance exposition we generally restrict our discussion to the two popula-
tion problem, referring to the m population problem when it is unclear how a two
population procedure extends to the m population case.

1.4. The need for nonparametric classification


Since the L D F and QDF rules have simple forms and are based on the normal
distribution, they have become the most widely used rules for discriminant
analysis. Unfortunately, in most practical problems the assumption of normality
is violated. We often encounter problems where the distributions are skewed or
where one or more of the variates is discrete. Recently statisticians have shown
greater awareness and concern for the presence of outliers in sample data, that is,
observations that are abnormally distant from the center or main body of the
144 James D. Broffitt

data. Outliers are particularly hard to spot in multivariate data since we cannot
plot points in more than two or three dimensions. It is because of the persistence
of nonnormal data that we are interested in nonparametric classification.
In Section 2 we shall present a method which uses ranks of the discriminant
scores to classify z. A problem often encountered with using D L and DQ on
nonnormal data is the imbalance in the resulting misclassification probabilities.
For example, when using D L with highly skewed distributions or distributions
that have quite different dispersion matrices, it is not uncommon to obtain
misclassification probabilities such as P(2[1) = 0.08 and P(l[2) = 0.39. The ex-
perimenter may find this an undesirable feature. The rank method affords the
experimenter some control over the ratio P(2tl)/P(l[2 ). These ranks may also be
used to define partial classification rules where P(2[1) and P(I[ 2) are bounded by
prespecified values. The rank procedure is universal since it may be used in
conjunction with virtually any discriminant function including DL, DQ, and
others discussed in Sections 3 and 4. In Sections 3 and 4 we shall review robust
and nonparametric discriminant functions.

2. A procedure for partial and forced classification based on ranks of


discriminant scores

2.1. Introduction
In this section we shall present a method of ranking discriminant scores. Rules
for partial and forced classification will then be defined in terms of these ranks.
We are not actually developing any new discriminant functions but rather a
nonparametric method of applying discriminant functions to classify observa-
tions. Its use in partial classification, where highly questionable observations need
not be classified, permits us to place upper b o u n d s on the misclassification
probabilities. In forced classification we will be able to control, at least asymptoti-
cally, the balance between the two misclassification probabilities. These results
are valid regardless of the discriminant function being used and the distributions
of 7r1 and ~r2, provided the discriminant functions have continuous distributions.
Of course, to obtain an efficient procedure we should choose a discriminant
function which is suitable for discriminating between ~r~ and 7r2. As we shall see,
choice of the discriminant function may be made after examining the data,
without disturbing the nonparametric nature of the procedure. This is particularly
appealing in partial classification since we may pick the discriminant function
after a preliminary look at the data and still maintain prespecified bounds on the
misclassification probabilities. Thus the rank method provides us with an oppor-
tunity to adaptively select discriminant functions.

2.2. Partial classification


Suppose we have a discriminant function D(.) which we plan to use in
classifying observations into 7r~ or ~r2. Further suppose that D(-) is constructed so
Nonparametric classification 145

Fig. I. Hypothetical distributions of D(Z).

that 'large' values of the discriminant score D(z) indicate classification of z into
~rv That is, classify z into ~ if D(z)>1c and into 9 2 otherwise. For this rule the
misclassification probabilities are given by

P(2il)=P(D(Z)<cIZ-f~} and P(ll2)=P{D(Z)~cIZ-f2 ).


Let the pdf of D(Z) be denoted by gi(') when Z ~ f . If g l ( ' ) and g2(') have a
substantial overlap, then P(211) and P(I] 2) may be quite large. This situation is
depicted in Fig. 1.
Furthermore, if the observed value D(z) is close to the cutoff c, we should feel
quite dubious about classifying z, since the choice between ~rl and 9 2 is not clear
cut, as it is when D(z) is very large or very small. Of course we prefer to use a
discriminant function whose pdf's gx(') and g2(') are well enough separated so
that the values P(2[1) and P(l[2) are tolerable, and we are comfortable about
making classifications. This may not happen for two reasons. We may be using a
discriminant function which is inappropriate for ~r1 and ~r2. For example, if we use
the LDF, i.e., D(z)=DL(Z), when the population parameters are such that
/~1 = / z 2 and , ~ ve ~2, then both P(2[1) and P(l[2) could be 0.5. Also it may be
that the distributions of our p variates, f~(-) and f2('), are so similar that no
discriminant 'function (including the optimal one) can adequately distinguish
between ~r~ and 9 2.
These considerations suggest the use of partial (rather than forced) classifica-
tion. In forced classification an observation z must be classified into ~r1 or ~r2. The
decision rules we have considered thus far have been of this type. However, in
partial classification we have the option of not classifying a z whose discriminant
score D(z) is so close to c that we have little confidence in deciding between ~r~
and ~r2. Partial classification is a three decision procedure: classify z into 7q,
classify z into ~r2, or do not classify z. The advantage is that we will have some
control over the misclaSsification probabilities.
To define a partial classification rule, we specify two sets A~ and A 2 so that the
occurrence of A i favors z ~ 7ri, i = 1,2. With Ai representing the complement of Ag,
we then use the following classification rule:
(a) If h._lf)A2 occurs, classify z into ~rl;
(b) If AINA 2 occurs, classify z into 7r2; (2.1)
(c) If (Alr3A2)U(AlfqAa) occurs, do not classify z.
146 James D. Broffitt

This formulation is due to Quesenberry and G e s s a m a n (1968); also see G e s s a m a n


and G e s s a m a n (1972). U n d e r this scheme we have

P(211) = e[~TlnA21/-- f~] ~ e[-~ Iz - fl]- (2.2)

Since AI=(.Alf-IA2)U(AIf3A2), this upper b o u n d for P(211) will be sharp if


P [ A 1NA 2 [Z - fl ] = 0 (a similar statement holds for P(1 ]2)). Thus if we wish to
control P(211) and__P(ll2) at specified levels a I and ot2, we should try to define A I
and A 2 so that P [ A i l Z ~ f i ] = a i and P [ A f l A 2 t Z ~ f i ] - - 0 for i = 1,2. We would
then have P ( j l i ) b o u n d e d closely above by a i.

2.3. The rank procedure for partial classification


Ideally, suppose we knew the pdf's g l ( ' ) and g2('). Then we could c o m p u t e
n u m b e r s c 1 and c 2 so that the left tail area of g l ( ' ) is a I and the right tail area of
g 2 ( ' ) is a 2 (see Fig. 2). We could then define A 1 and A 2 as the following events:

AI=[D(Z)>Cl] and A2=[D(Z)<ca]. (2.3)

By (2.2) the u p p e r b o u n d s on the misclassification prebabilities for the partial


rule (2.1) with A 1 and A 2 given by (2.3) are

P(211 ) ~<P[ O( Z ) <~C1 1 Z ~ f l ] ~- Oil,


P(112) ~< P [ D ( Z ) >! c 2 IZ ~ f2] = a2"

Also if c I < c 2 , then A]f-IA 2 = ~ and these upper b o u n d s are achieved, i.e.,
P(211) = a 1 and P ( l ] 2 ) = a 2. The condition e I < c 2 occurs when g l ( ' ) and g2(')
have a substantial overlap. It is in this situation that we r e c o m m e n d partial
classification. If g l ( ' ) and g2(') are well separated (see Fig. 3) so that c I > c > c2,
then we would be foolish to use partial classification since the forced rule, which
classifies z into tr 1 if and only if D(z) >! c, has misclassification probabilities which
satisfy the specified conditions P(2[1)~< a 1 and P(I[ 2)~< ot2, and never fails to

Cl C2

A2 ~ ~1

Fig. 2. Ideal partial classification regions.


Nonparametric classification 147

classify observations. U n f o r t u n a t e l y this situation rarely occurs in practice, so the


use of partial classification is usually quite appropriate.
I n practice, c I and c 2 are u n k n o w n so we c a n n o t tell when A 1 or A 2 occur. A n
obvious alternative is to replace c~ and c 2 b y estimates and use the partial rule
(2.1) with the events A l and A 2 replaced by

Am=[D(Z)>Ol] and A2~--[D(Z)<~2]. (2.4)

However, the difficulty here lies in finding reasonable estimates ~ , ~2- Even if we
had estimators dl and d2, and used the partial classification rule (2.1) with the
events A1 and /12 given in (2.4), there is no guarantee that P ( 2 [ 1 ) ~< a I and
P ( l [ 2 ) ~< ot2 •
T h e rank procedure will determine two events, A 1 and A2, whose use in the
partial classification rule (2.1) will ensure that P(2] 1) ~< a I and P(1 ] 2) ~< a 2. I n this
procedure two ranks are assigned to z. O n e rank measures its closeness to the
training sample of x ' s and the other measures its closeness to the y ' s .
First we assume that z is f r o m ~r~ and accordingly we consider the two samples
Xl . . . . . xn~, z and Yl . . . . . y,~ of sizes n~ + 1 and n 2 f r o m ~r~ and 7r2 respectively. Based
on these samples we determine a discriminant functiQn Dx(.), designed so that
discriminant scores for observations from ~rI will tend to be large and observa-
tions from ~r2 will tend to produce small values for Dx(-). A n essential require-
ment in the determination of Dx(.) is that the sample of n I + 1 observations from
~q must be treated symmetrically, and the n 2 observations f r o m ~r2 must be treated
symmetrically. That is, Dx(.) must be a symmetric function of x 1. . . . . x n~, z and a
symmetric function of y~,...,yn2. To fix this idea we use the following notation
which emphasizes the dependence of D~(-) on the observations:

D~(')= Dx(" [xl,...,x,,+,; Yl ..... Yn2)

where x,1 +~ = z. N o w if i~,..., in, +~ and Jl . . . . . fi~ denote arbitrary permutations of


the integers 1,..., n i + 1 and 1. . . . . n 2, respectively, then Dx(.) must satisfy

D~('lxi,, .... xi,,l+l;YJ"1 ..... YJ.2)-- O x ( ' l X l ' " " X , l +1; Y' ..... Y,z)" (2.5)

C2 C CI
A2 AI
< ,I >
Fig. 3. gl and g2 well separated.
148 James D. Broffitt

F o r example Dx(. ) could be DL(" ) modified so that xnl+l = z is included in the


c o m p u t a t i o n of .7 and S. Let Rx(Z ) be the rank of Dx(z ) a m o n g the n 1 + 1 values
Dx(Xl) ..... Dx(xn,), Dx(z ). The smallest item receives a rank of 1 and the largest a
rank of n I + 1. A large rank assigned to Dx(z) indicates that z ' l o o k s ' more like an
x than a y in terms of its Dx(.) value.
Then in a similar fashion we assume that z is from ~r2 and consider the two
samples xl,...,xn, and Yl . . . . . yn2,~5 of sizes n I and n 2 + 1 f r o m 371 and ¢r2 respec-
tively. We construct a discriminant function De(.) in a fashion that treats the
sample x i . . . . . x n, symmetrically and the sample yl . . . . . Yn2,z symmetrically. Again
Dy(.) is designed so that it will tend to be large (small) for observations
f r o m ~rl(vr2). Let Ry(Z) be the rank of -Dy(z) a m o n g the n 2 + l values
- Dy( Yl),-.., -- Dy( Yn2)' - Dy(Z). A large value for Ry(z) indicates that z ' l o o k s '
m o r e like a y than an x.
We are now prepared to define A 1 and A 2. For convenience let us assume
a t = [(n i + 1)ai], i = 1,2, where [.] denotes the greatest integer function. W e then
define the events A~ and A 2 as

Al=[Rx(Z)>a,] and Aa=[Ry(Z)>a2]. (2.6)

I n a previous paper (Broffitt, Randles, and Hogg, 1976) it was shown that when
Z ~ fl, then R~(Z) has a discrete uniform distribution over the integers 1.....
n 1 + 1, that is,

P{Rx(Z)=tlZ--fl}=l/(n,+l) ift=l .... , n , + l .

This result is independent of the distributions of ~rI and ~r2 provided D,(Z) has a
continuous distribution. It follows that

P(2]l) = P[X~nA2IZ~ f~] <~P[ Z, I Z ~ f,]


= P[Rx(Z ) <~a, l Z ~ f l ] = a,/(nl + 1)
~ oq.

Also, if Z ~ f 2 and Dy(Z) has a continuous distribution, then Ry(Z) has a


discrete uniform distribution over the integers 1..... n 2 + 1. By an analogous
argument it follows that P ( l [ 2 ) ~< a 2. Thus we have defined events A l and A 2
(2.6), whose use with the partial rule (2.1) produces misclassification probabilities
which are b o u n d e d b y prespecified values a I and a 2.
It is important to note that we do not require Dx(.) and Dy(.) to be of the same
form; for example, one could be based on DL(.) and the other on DQ(.). More
importantly, the choice of Dx(.) and Dy(.) m a y be (and should be) m a d e after
examining the data, since these selections are actually just one step in the
c o m p u t a t i o n of Dx(. ) and Dy(.). For example, suppose p = 2 and we wish to
select D~(.). We could examine a plot of the two samples of n 1 + 1 and n 2
observations in order to determine a discriminant function which does a g o o d job
in separating these two samples. In this process z must be plotted and treated as
Nonparametricclassification 149

just another x. We must not know which point is actually z (or at least we must
not use the knowledge that we know which point is z) in determining Dx(.). If we
do not violate (2.5) and a corresponding equation for Dy(.), then Rx(Z ) and
Ry(Z) will have uniform distributions and the misclassification probabilities will
satisfy the upper bound restrictions. This opportunity to 'legally' inspect the data
before selecting a discriminant function m a y be used to great advantage. We
should be able to pick reasonably accurate models for the distributions of ~r~ and
rr2 and thereby improve the efficiency of our analysis without disturbing its
nonparametric nature.

2.4. The rank procedure for forced classification


In forced classification it is not possible to simultaneously bound P(l12 ) and
P(211 ). If f l(. ) and f2(') are very similar, there is no discriminant function that
will produce small values for both P(2[1) and P ( l l 2 ). In partial classification we
were able to bound the misclassification probabilities simply because we classified
only those z ' s which showed a definite preference for ~r~ or ~r2. In forced
classification we do not have that option. All z ' s must be classified including
those whose discriminant scores are in the intermediate zone. We may, however,
be interested in controlling the balance between P(211 ) and P(112 ). We shall use
the two ranks, Rx(z ) and Ry(z), defined in Section 2.3 to construct a forced
classification rule such that asymptotically P(2 [1 ) / P ( I ]2) = 2/, where 2/is specified
by the experimenter.
It is helpful to think in terms of p-values similar to those used in tests of
significance. First construct a p-value for ~q, representing the probability of
drawing an observation from ~rI at least as extreme in the direction of the y ' s as is
z. That is, suppose we repeat the process of drawing training samples of sizes n~
and n 2 from qr1 and ~r2 respectively, and then we draw an additional observation,
x, from ~rl, which we shall treat as an observation to be classified. Using these
new observations we construct Rx(X ), and note that it is uniformly distributed
over 1,2 ..... n I + 1 . We denote the observed value of Rx(Z ) by Rx(z ), so the
p-value we want is

Px(Z) = P ( R x ( X ) ~ Rx(z)) = Rx(z)/(nl + 1).


Correspondingly the p-value for 7r2 is

py(Z) = Ry(Z)/(n2 + 1).

This represents the probability of drawing an observation from 7r2 at least as


extreme in the direction of the x ' s as is z. The larger Px(Z) (Py(Z)) is, the more it
appears z belongs to ~r1 (rr2).
We define the forced classification rule as follows:
(a) Classify z into 7r1 ifpx(z)>2/py(Z);
(b) Classify z into ~r2 if Px(Z) < 2/py(Z); (2.7)
(c) If px(Z ) = 2/py(Z), then classify z using a nonrank procedure.
150 James O. Broffitt

Notice that in using this rule we do not need to specify prior probabilities or
misclassification costs. This rule produces misclassification probabilities which
asymptotically satisfy P ( 2 [ 1 ) / P ( l 1 2 ) = 3'- A heuristic argument for this result was
given by Randles et al. (1978). In the special case where 3, = 1, rule (2.7) classifies
z into that population which corresponds to the larger p-value. This is an
intuitively appealing result since the p-values measure the affinity of z for the
respective populations,

2.5. The multipopulation rank procedure for partial classification


In this section we shall discuss extensions of the rank procedure to the m
population problem (m > 2). With the aim of preserving clarity and simplifying
notation we shall assume m = 3. Once the m = 3 problem is understood, exten-
sions to m > 3 will be obvious. We have populations ~q, ~r2, and 7r3 with respective
pdf's f l ( ' ) , f2('), and f3('), and training samples x I . . . . . x , , , y I . . . . . y~2,~ and
w I ..... wns. As before, z will denote the observation to be classified.
For the partial classification problem we must specify three events A1, A2, and
A 3 such that the occurrence of A i favors classification of z into ~ri. We may then
extend rule (2.1) as follows:
(a) If AjNA2NA_.3 occurs, classify z into ~rl;
(b) If A1NA2NA 3 occurs, classify z into ~r2;
(c) If A1NA2NAJ occurs, classify z into 7r3;
(d) If A1NA__2NA3 occurs, classify z into ~hUTr2; (2.8)
(e) If h lNA2NA 3 occurs, classify z into rqU~r3;
6) If A I N A 2 N A 3 occurs, classify z into ~r2U~r3;
(g) Otherwise do not classify z.
Notice that rule (2.8) allows classification into a mixture of two populations. For
example, in case (d) we have eliminated ~r3 as a possibility but we cannot decide
between rq and ~r2. Of course we could define a rule that lumps cases (d), (e) and
(f) into the 'do not classify z ' category, but we may as well take advantage of all
our information by at least eliminating one distribution if we can, even if we
cannot classify z uniquely.
Using ' Z M C ' to denote the event that z is misclassified, and noting that z ' s
from 7r1 are misclassified in cases (b), (c), and (f), we have

P [ Z M C I Z ~ f,] = P[~T, Iz ~ f ~ ] - P[A1NA2NAsl Z~ fl]

and in general

P [ Z M C ] Z ~ fi l ~< P [ ~ 1 Z ~ f/]- (2.9)

Thus to achieve the bounds P [ Z M C ]Z ~ f/] ~< a i, i = 1,2, 3, we should define A i


Nonparametric classification 151

so that P [ ~ I Z ~ fi] = ai. We also note that these upper bounds will be sharp if
P [ A , n A 2 n A 3 1 z ~ fA = O,
Two methods for determining A~, A2, and A 3 will be discussed. In the first we
consider the populations pairwise, and for each of these pairs, z is assigned two
ranks as in Section 2.3. Let R12(z ) and R21(z ) be the t w o r a n k s assigned to z using
populations 7rl and ~r2. The first rank R12(Z), is obtained by ranking z among the
x,s, so that large values of Rj2(z ) indicate that z looks more like an x than a y.
Also R2~(z ) is obtained b y ranking z among the y ' s . Similarly we m a y obtain
RI3(Z), R31(Z),R23(Z), and R32(C). The two subscripts for these ranks indicate the
training samples used to construct the discriminant function, and the first
subscript denotes the training sample among which z is ranked. We could then
define the events AI, A 2, and A 3 as

A, = [ R,2 ( Z ) > a d N [ R,3 ( Z ) > a , ] ,


A 2 = [ R2, ( Z ) > a2] Cl [ R23 ( Z ) > a2] , (2.10)
A 3 = [ R3, ( Z ) > a3] Cl [ R32 ( Z ) > a3]

where a l, a 2, and a 3 are integers. Then from (2.9) and (2.10),

P[ZMCIZ~flI~P[/TIIZ~fl]
P[ RI2( Z ) ~ all Z ~ fl] -t- P[ R13( Z ) ~ a I IZ ~ fl]
= 2al/(nl dr 1). (2.11)
Thus to satisfy the bounds P [ Z M C [ Z ~ f i ] ~ ct we must let a i = [(ni + 1)ai/2]
where [-] denotes the greatest integer function. As an example, suppose n I = n 2 =
n 3 = 18, and we desire a~ = a 2 = a 3 = 0.10, then a~ = a 2 = a 3 = 0. In this case
AINA2fBA3 always occurs and we would always fail to classify z. This undesirable
result seems to be somewhat characteristic for the events defined in (2.10).
This situatibn m a y be improved with a second method of defining A~, A 2, and
A 3. First we combine the training samples from ~r2 and 7r3, and consider the two
population problem where one population is '~'1 and the other is a mixture of ~r2
and ~r3. We determine two ranks for z as in Section 2.3. Let Rl(z ) be the rank of z
among the x ' s and Rl(Z ) the rank of z among the combined samples of y ' s and
w's. Large v a l u e s o f R~(z) indicate that z looks more like an x than a y or w, and
large values of Rl(z) indicate that z looks more like a y or w than an x. In an
analogous manner we determine R2(z),/l~2(Z), R3(Z), a n d ]~3(Z). The ranks R2(z )
and R2(z ) are obtained from the two population problem where ~r2 is one
population and the other is a mixture of ~r~ and ~r3, and so on. We can then define
the events A~, A 2, and A 3 as

AI=[R,(Z)>al], A2=[R2(Z)>a2],
A 3 = [ R 3 ( Z ) > a3] (2.12)

where a 1, a2, and a 3 are integers.


152 James D. Broffitt

Using events (2.12) with rule (2.8) we obtain

P [ Z M C I Z - fi l <~P[ ~ I Z - fi]
: P[Ri(Z) ~ a, I Z - - f i ] = ai/(ni + 1)

where the last equality follows by the uniform distribution of R i(Z). Thus we can
satisfy the bound P [ Z M C I Z - f/] ~< a i by selecting

ai=[(niq-1)oti].

Notice that this is the same quantity used for a i in the two population problem of
Section 2.3. In fact if we extend definition (2.12) to the general m population
problem, then we should still select a i = [ ( n i -}- 1)ai].
For the partial rule (2.8), the events (2.12) seem to be a better choice than those
in (2.10). We should emphasize that the ranks used in (2.12) are based on
discriminant functions constructed to separate a group of m populations into one
population and a mixture of the remaining r n - 1 populations. We may not be
able to find such a discriminant function in a simple form such as D L and DQ. We
must be prepared to consider a variety of different types of discriminant func-
tions including those based on nonparametric density estimation.

2.6. The multipopulation rank procedure in forced classification


In the multipopulation problem, forced classification rules may be structured
upon p-values in a manner similar to that in Section 2.4. In the previous section
we defined two different sets of ranks for z. The first set of ranks, used to define
events (2.10), were denoted by Rij(z), i v~ j = 1,2, 3, and_ the second set of ranks,
used to define events (2.12), were denoted by Ri(z ), Ri(z), i = 1,2, 3. Each set of
ranks may be used to define p-values.
Based on the first set of ranks, the p-value for population ~1 is

p,(z ) = min(R12(z), R13(Z ))/(nl + 1).


This represents the smallest probability of drawing an observation from ~1 that is
as close or closer to ~r2 or ~r3 than is z. We m a y similarly define P2(z) and P3(Z). If
we choose the second method of assigning ranks, then we define the p-value for
population 7/"1 as simply

PI(z) = g , ( z ) / ( n , ÷ 1),
since Rl(z ) already takes into account the closeness of z to both rr2 and ~r3. In
general Pi(Z ) = R i ( z ) / ( n i + 1). Both sets of p-values seem reasonable and either
m a y be used. The first set of p-values may be more advantageous since they will
usually be based on simpler discriminant functions. In either case, the larger Pi(Z)
is, the more it appears that z is a member of ~ri.
Nonparametric classification 153

An intuitively appealing forced rule is:

(a) max [ljpj(z)], classify z into %;


if lipi(z ) = j=1,2,3
(b) if there is not a unique maximum in (a), break (2.13)
the tie with any nonrank procedure.
The quantities l i, i = 1,2,3, are constants which when varied will change the
values of the misclassification probabilities. For example, if we increase 11, then it
is easier to classify z into ~rl, so the probability of misclassifying observations
from ~rl will decrease, and the probability of misclassifying observations from 7r2
and 7r3 will increase. Since it is the ratios of the l's and not their actual values that
are important we may, without loss of generality, assume 13 = 1. Unfortunately we
do not know how to set l~ and l 2 in order to control the balance between the
various pairs of misclassification probabilities. We can, however, attempt an
empirical evaluation of l 1 and l 2 by trial and error.
Apparently very little research has been done on the multipopulation rank
procedure for both the partial and forced classifcation problem. The suggestions
given in Sections 2.5 and 2.6 should not be taken as the final word on this subject.
Hopefully there will be new research in this area which could provide improved
procedures.

3. Robust discriminant functions

3.1. LDF and QDF with robust estimates


In order to obtain robust discriminant functions we must broaden our model to
include other distributions in addition to the multivariate normal. In particular if
we are concerned about outliers, we would like our model to include distributions
which occasionally produce extreme observations. For example, if epp(z;l*,~,)
denotes the pd~, with argument z, of a Np(/,, 2J) distribution, then

f(z) = (0.90)qop(z;/,, V ) + (0.10)q0p(Z;/t,40 V) (3.1)

is the pdf of a mixture of two normal distributions. The primary distribution is


Np(~, V), but 10% of our observations comes from a Np(IL,40V ) distribution.
Because of the inflated dispersion matrix, these observations may be quite far
from the main body of the data and appear as outliers. Notice that the pdf (3.1)
has m e a n / , and dispersion matrix 4.9V. It is also elliptically symmetric, that is, it
can be written in the form

f(z ) = IVl-1/2h{ [(z - I~)'V-l(z -/~)] t/2 }. (3.2)

Thus the set of z's that satisfy f(z)= const, is an ellipsoid in p dimensions with
center at/~ and shape determined by V. If moments exist for the distribution in
154 James D. Broffitt

(3.2), the mean is /t and the dispersion matrix is o2V where 0 2 is a constant
depending on h(.).
Suppose we have a random sample x~ .... ,x n from an elliptically symmetric
distribution (3.2). One possible estimate o f / t is .~, which is a special case of the
weighted average ( w l x 1+ . . . + W n X n ) / / ~ l W i , where each weight equals one. If
h(-) is such that we are likely to observe outliers, then Y may be a poor estimate
since one extreme observation can pull :~ away from the center of the data. In this
case we can improve our estimate by giving extreme observations smaller weights
and thus decrease the influence of outliers should they exist.
Huber (1964) developed the t h e o r y of robust M-estimators for u n i v a r i a t e
distributions, and Maronna (1976) extended this research to the multivariate case.
Maronna's estimators/2 and I? are the solutions of the equations

1l~: ~ WI(Si)Xil'/ WI(Si)


i=1 i=1

(3.3)
n i=1

where s~ : (x~-/~)'l?-l(x~ , / ~ ) measures the distance of x i from the center of


the data, and wl(si) and w2(siz) are weight functions. The object then is to select
w I and w2 so as to minimize the effect of outliers and consequently produce good
estimates. One set of possible weight functions which corresponds to Huber's
univariate estimators is

1, s<~k,
w~(s)= It~s, s>Ic
and
1, s 2 ~ k 2,
w (s2) : s2 > k 2.

The quantity k 2 should be chosen so that all but the extreme observations receive
weights of one or almost one. We suggest trying k Z = p + k o~2 where k 0 is a
constant less than 2.0. Many other weight functions are possible. Maronna gives
sufficient conditions on weight functions for which the solutions to (3.3) are
consistent and asymptotically normal. By consistency we mean that (/L I?)
(/%, II0) a.s., where (/%, Iio) is the solution of

x- x- x - . o ) } =o.
and
e(w2((x- t'o)'Vol(X--t'o))(X-- t'0)(X-- P0)') = V0.

If X has pdf (3.2), t h e n / t o = ~ and V0 = (const.)V, where the constant depends on


wz(. ) and h(-). Notice that V is actually estimating (const.)V rather than V. For
Nonparametric classification 155
our purposes it does not matter if we ignore the constant multiplier, and we use I~"
to denote the estimate of V apart from the multiplier. Once we have selected
weight functions w I and w2, (3.3) may be solved by iteration. It is interesting that
if the true pdf is (3.2), then the maximum likelihood estimators o f / ~ and V are
given by the solutions to (3.3) if the weight functions are

1 h'(s)
w,(s)-- s h(s) and w2(s2):wl(s),

A discriminant score D(z) may be viewed as a comparison of the 'distances' of


z from each of the populations~ ~rI and ~r2. Suppose that Di(z ) measures the
distance of z from ~r~,then we m a y write D(z) = D2(z)-- D I(Z). So if D(z) > O, z is
'closer' to ~r~ than ~r2. Accordingly we classify z into ~r~ if and only if D(z) >i O, or
more generally if and only if D(z ) >! c. The optimal rule (1.1) is of this form where

D(z)=--lnf2(z)+lnfl(z) and c:lnk.

In this case the distance measure is Di(z ) = - I n f/(z).


If the pdf for ~r1 is (3.2) with parameters/~i and V/, then Di(Z ) unfortunately
depends on h(.) which is fairly difficult to estimate. An intuitively appealing
compromise is to use the D i corresponding to a normal distribution in conjunction
with robust estimates/~i and V/. That is,

Di(z) = const.+ ½(z -/~i)'l~/- !(z "~,)+½1nl ~ I

The resulting classification rule assigns z to ~1 if and only if DHQ(Z)>~ c where

DHQ(Z) = (Z --/~2)'I?'2 l(Z - - / ~ 2 ) " (Z --/~l)'Vi-i I(Z --/~1)


+in(if,21/1 ¢, i). (3.4)

If we assume VI = V2 = V, then we would use a pooled estimate of V, and our rule


reduces to classifying z into IrI if and only if DHL(Z ) t> c where

DHL(Z) : [z " ½ ( /~l +/~2)]'1"3' i(~1 "/L2)" (3.5)

Thus we are simply using the usual linear and quadratic discriminant functions
with different estimates for the location and scale parameters.
Of course we suggest using the rank procedure (Section 2) in conjunction with
these robust discriminant functions in order to classify z. The use of robust
estimates seems particularly appropriate when the r a n k procedure is being
applied. For example, in the determination of Rx(z ), z is placed with the x's, and
if z really is from ~r2, then z may act as an outlier in this sample and distort the
estimates £ and S~. Consequently z may receive a higher rank than if robust
estimates had been used. This of course would be an undesirable feature since,
when z belongs to ¢r2, we want Rx(z ) to be as small as possible. In the process of
156 James D. Broffitt

placing z with each training sample we can artificially create outliers, and since
robust estimates provide a degree of protection against outliers, these estimates
are very attractive when the rank procedure is being used.
As an illustration of the use of robust estimates consider the data displayed in
Fig. 4. Here we have m = 2 and p = 2. The × 's denote observations from ~r1 and
o's denote observations from ~r2. Each sample of ten observations contains one
very conspicuous outlier. Each of the four lines represents a different discriminant
function, and in each case the region above the line corresponds to classification
into ¢r1 and the region below the line corresponds to classification into rr2. Line L
is a plot of DL(Z) ----0 where x, y, and S are computed using all 20 observations.
In order to assess the effect of the two outliers, we trimmed (deleted) them from
the data sets, recomputed the sample means and dispersion matrix, and replotted
DL(Z ) = 0. The result was line LT. Since we will observe outliers infrequently, our
main concern should be to correctly classify the 'inliers'. Thus line L T is preferred
as a separator between ~rI and ~r2. The difference between lines L and LT
dramatically illustrate the damage that outliers can cause. Line H is a plot of
DHL(Z ) = 0 where DHL(- ) was computed using all 20 observations, and line H T
is a plot of DHL(Z ) = 0 where DHL(. ) was based on the trimmed sample of 18
observations. We notice that L T and H T are nearly identical, that is, when
outliers are not present there is very little difference in the results based on robust

70-4 \X .

L7

50-

7
W

I0-

-,oo ~ o ~b ,6o
E-I

Fig. 4. Comparison of D L and D H L.


Nonparametric classification 157

and nonrobust estimates. Lines L and H are quite different however, showing that
the robust estimates did partially adjust for the outliers. While we would prefer
line H to be closer to LT, in terms of classifications, H will perform more like LT
than will L. Lines H and HT were obtained by iteratively solving the three
equations:

= E w,(sli)x,/E w,(s,3,
~2 "~-~ WI(S2i)Yi/ ~ WI(S2i), (3.6)
¢,=[ E
+ EwZ(sz,)(Yi-li2)(y,-112)']/(n, +n2-2 )

where wl(- ) is the Huber weight function given above with k o =1, sZi = ( x ~ -
f~l)'fz-l(Xi--ill), and sZi = ( Yi-- fl2)'("-l( Yi-- !12) • Thus we used a pooled V in
computing the weights throughout the iteration process. The final weight for the x
outlier was wl(sl) = 0.33, and the final weight for the y outlier was wl(s2) = 0.25.
For the remaining 18 observations the final weights were all ones. We could move
line H closer to LT by decreasing k o and thereby decreasing the weights of the
outliers. There generally do not appear to be good guidelines for choosing k0, but
in a real problem we could try several values of k 0 and choose that value
corresponding to the smallest estimates of misclassification probabilities. Clearly
the choice of k 0 is an important practical problem which needs further develop-
ment.

3.2. A robust linear discriminant function


Consider the family of linear functions of z, fl'z where ] / ' f l = 1. Graphically,
(fl'z)fl is the 'orthogonal projection of z onto the one-dimensional linear space
generated by ,8, and accordingly fl'z is the distance of this projection from the
origin. We may think about fl'x I .... ,fl'x,~,fl'yl ..... fl'Yn2 as a one-dimensional
linear reduction of the data. If we plan to use fl'z as a discriminant function, then
we would like to choose fl to maximize the separation between the reduced
training samples fl'x 1.... ,fl'x,, and fl'Yl,.-. ,fl'Y,2" That is, specifying//is equiva-
lent to specifying a direction, and we would like to determine that direction which
corresponds to maximum separation.
An example of this type of procedure is Fisher's (1936) original derivation of
the LDF. He considered the family of linear functions and chose the one that
maximized a standardized measure of separation between the training samples.
Specifically he determined fl to maximize

( p'~-fl'~)/(fl'Sp) '/2. (3.7)

The optimal fl is ',8o= S l ( y _ jS)/[(Y- p)'S z ( ~ _ j5)]1/2, which leads to the


t58 James D. Broffitt

rule: classify Z into ~r, if and only if

It may be easily verified that this inequality is~ equivalent to DL(Z ) >~ 0.
Notice that the ratio (3.7) can he rewritten as

-' 1
.l ,=, t( a,sp)'/I
The quantity (fl'x i a ) / ( f l ' S f l ) 1/2 is a standardized measure of distance between
the projection of x~ and the point a. We may think of a as the point of separation
between the two classification regions. That is, fl'z >~ a corresponds to classifi-
cation into ~rl, etc. An x for which fl'x > a is correctly classified and accordingly
contributes a positive increment to (3.8). If fl'x < et, then x would be misclassified
and accordingly a negative value is added to (3.8). A similar type of statement
holds for each y. Fig. 5 should help clarify the idea of comparing projections of x
and y in the fl direction to a point a. In this figure ll'x > a and f l ' y < a.
Consider now the effect of an extreme observation which could be either an x
or y. Such an observation can have a disproportionate influence in determining
the fl that maximizes (3.8). To obtain a robust discriminant function we should

classify into nz (
\ > classify into ITi

\
Y

. ~ : ~ ~ S ~

Fig. 5. Projections of points x and y upon the linear space generated by ft.
Nonparametric classification 159

minimize the impact of these outliers. To accomplish this we generalize (3.8) to

(3.9)
T ( a , f l ) : n, i=, ( fl, f.fl)l/2 n2 .= ( fl,~fl),/2 '

where ~-(.) is a nondecreasing, odd, and nonconstant function, and l~ is a robust


estimate of scale such as that given in (3.6). Rather than using ~-(d) = d as in (3.8),
we would use a T(-) such that I~(d)l does not increase as rapidly as Id I. Such a
~-(.) will not be influenced as much by extreme observations. Two possibilities for
~-(-) are

-k, d<-k,
Tl(d): d, -k~d~k,
k, d>k,
and

r2(d) =
t sin(~rd/2k),
D1, d<-k,
-k~d~k,
L1, d>k.

The quantity T(a, r ) will be maximized with respect to a and fl using a computer
algorithm. Since r2(. ) is everywhere differentiable, it is smoother than ~'l("), and
consequently "r2(.) may be easier to maximize with an algorithm. The constant k
should be picked so that only the extreme observations produce d ' s that are
larger than k in magnitude. We suggest using k ~< 2. In using the rank procedure
for classifications we may try several values of k and pick the one that seems most
appropriate. For example, to determine Rx(Z ) we consider the two samples
xl ..... xn,, z and yl ..... Yn2"Provided that we treat the n ~+ 1 x ' s symmetrically and
also the n 2 y's, we may do anything to determine a good discriminant or ranking
function. Thus/we may try several k's and also several r ' s before deciding which
ones to use. The selection of these items should be viewed as just one step in the
computation of the discriminant function.
A Monte Carlo study of some robust discriminant functions similar to those
discussed above was reported by Randles et al. (1978).

4. Nonparametric discriminant functions

4.1. Introduction
We have developed the rank procedure (Section 2) which forms the basis for
classification rules. Use of this procedure requires a discriminant or ranking
function. If ~rI and ~r2 have normal distributions, then we would use DL(') or
DQ(.) for the ranking function. Also the robust discriminant functions developed
in Section 3 are appropriate for elliptically symmetric distributions which produce
outliers. In order to accommodate problems with nonelliptically symmetric data,
160 James D. Broffitt

we must broaden our model for f ( - ) to include a wider variety of distributions.


Loosely speaking we may define a nonparametric discriminant function as one
that is designed for models which contain a wide range of distributions. In this
section we shall review some techniques for finding nonparametric discriminant
functions. In order to give a perspective of the different possibilities we include
short discussions of nonparametric density estimation and the multinomial tech-
nique. Although we do not treat them in detail, there is a sizable quantity of
literature on both of these methods and the interested reader is referred elsewhere.
We also discuss best of class rules and procedures based on rank data.

4.2. Multinomial method


The most universally applicable nonparametric discriminant function is based
on the multinomial distribution. This is so because any set of variates may be
discretized. For example, suppose we are measuring four variates: sex (male,
female), political preference (Republican, Democrat, other), yearly income, and
IQ. We may discretize yearly income by simply recording in which of the
following intervals it belongs: below $10000, $10000 to $20000, $20000 to
$30000, $30000 to $40000, above $40000. Thus yearly income is transformed to
a discrete variable with five possible values. Similarly we could discretize IQ using
the six intervals: below 95, 95 to 100, 100 to 105, 105 to 115, 115 to 130, above
130. We now have four discrete variates with the number of possible values being
2, 3, 5, and 6 respectively. Thus the number of categories in our multinomial
distribution is 2 × 3 × 5 × 6 = 1 8 0 . If the probabilities of these categories are
denoted by P~l ..... P/180 for ~ri, then fl(z)/fz(z) = Plj/P2j where j is the category
corresponding to the outcome z. If N/j denotes the number of training sample
observations from ~ri in category j, then/~ij = ~ j / n i, and accordingly fl(z)/f2(z)
----(Nunz)/(N2jnl). The disadvantage of the multinomial method should be
apparent. With as few as four variates we created multinomial distributions with
180 categories, so for each of ~r~ and ~r2 we need to estimate 179 parameters
(probabilities). This means that we must have an extremely large number of
observations in each training sample. Of course many of the 180 categories may
have such small probabilities that they would virtually never occur and may be
ignored in the estimation process or absorbed into other categories. However, it is
unlikely that we could make enough of a reduction in the number of categories to
eliminate the problem of estimating many parameters. We could reduce the
number of parameters by dichotomizing all variates, e.g., political preference
(Democrat, non-Democrat), yearly income (below $20000, above $20000), IQ
(below 105, above 105), then we would have 2 4 = 16 categories. This achieves a
large reduction in the number of parameters to be estimated, but at the same time
we lose information contained in the original measurements since we are measur-
ing our variates on a less refined scale. Other techniques have been used in
classifying multinomial data. We will not consider it further in this section; the
interested reader is referred to Goldstein and Dillon (1978).
Nonparametric classification 161

4. 3. Density estimation
A nonparametric discriminant function applicable when all p variates are
continuous is based on density estimation. Let x ~ , . . . , x n be univariate sample
observations on a continuous random variable X with pdf f(.). How can we
estimate f ( z ) , the density of X at the point z? Let N ( z ) be the number of sample
x's in the interval [ z , h, z + h], then N ( z ) / n is an estimate of P{z - h <- X ~ z
+ h}. If we divide this by the length of the interval, 2h, we obtain an estimate of
f ( z ) , that is,

f(z) = N(z)/2hn.

Now define

K ( u ) = {½0 (]u]~<l)
(lul>l)
then N ( z ) = 2~,7: lK((z -- x i ) / h ) and

1 ~ tz-x~

This form of f ( z ) is the so-called Rosenblatt (1956)-Parzen (1962) kernel esti-


mate. The term 'kernel' refers to the function K ( u ) . With K ( u ) as defined above,
the graph of f ( z ) will appear as a step function. This is unappealing since we
generally believe densities to be smooth functions. To obtain a smooth version of
f ( z ) , we simply choose a smoother kernel, e.g.,

' l u2/2
K(•) = - - e . .

Cac0ullos (1966) extended the idea of kernel estimates to the multivariate case.
Let x~ .... , x n be sample observations on a continuous p-variate random variable X
with pdf f ( . ) . If x~= (Xil .... ,Xip ) and z ' = ( z l ..... zp), then

1 n ( Zl -- Xi 1 Zp -- Xip )
f(Z)- n h , . - . hp i~=l K h1 '"" -h; •

If h~ ~ 0 as n ~ m and if nPh~ • • • h p ~ o~ as n -~ ~ and if the multivariate kernel


K satisfies certain conditions, then f ( z ) is a consistent estimator of f(z). It is
interesting to note that although K ( u ) can be a quite general function and may be
chosen without knowledge of the form of the density f(z ), the estimate f ( z ) is still
consistent. Thus in the classification problem we may obtain nonparametric
162 James D. Broffitt

consistent estimates of fl(z) and f2(z), and the classification of z may be based on
).

4. 4. Best of class rules


The set of points B----(z-fl(z)/f2(z ) = k} forms a boundary between the two
sets K~" (1.1) and K ~ = K ~ . If f l ( ' ) and f2(') are normal pdf's with equal
dispersion matrices, then B is a p-dimensional hyperplane given by

B : {z: [z -- ½(/t, + / t 2 ) ] ' Z - l ( / ~ l - - / L 2 ) : l n k } .

If we relax the assumption of equal dispersion matrices, then B is a p-dimensional


quadratic surface, i.e.,

O--- (z:
+ln(t~2l/l~,l ) = 21nk}.

As fl(') and f2(') become more irregularly shaped, the more complex the set B
may become. However, if ft(') and f2(') are not too complex, then a hyperplane
or quadratic surface may be an adequate substitution for B, at least for classifi-
cation purposes. The hyperplane {z: fl'z = a} splits the space of Z into two
halfplanes {z: fl'z >>-a} and {z: fl'z < ct}. If we could find values of fl and a, say
t * and ct*, so that {z: fl*'z/> a*} is 'similar' to K~, then we could use the rule:
classify z into 7rI if and only if fl*'z/> a*. In general this rule is not optimal, but it
may be reasonable if its misclassification probabilities are not much larger than
those of the optimal rule. Of course we would always choose the optimal rule if
f l ( ' ) and f2(') are known. In practice, however, we must estimate the sets K~" and
{z: fl*'z 1> a*} from sample data. Estimating the best hype~lane requires estima-
tion of p + 1 quantities (fl* and a*), whereas determining K~' requires estimation
of two entire distributions. Since in general it is statistically more efficient
to estimate fewer parameters, it is plausible that the rule corresponding to
{z: fl*'z >/&*} may produce smaller misclassification probabilities than the rule
based on/£~', when the sample sizes are small or moderate.
Consider now the m population classification problem. Let K = ( K l .... , K m) be
a partition of the space of Z which corresponds to the rule that classifies z into %
if z C Ki. We shall use the symbol K to denote both the partition and its
corresponding rule. Let r(K) be the probability of correctly classifying an
observation drawn from the mixture of ~rI..... ~r,~when rule K is used, that is,

r ( K ) = ~ qiP(ili ).
i=1

Finally we let C be the class of partitions from which we shall choose our rule. If
N onparametric classification 163

K + E G and

r(K +)= sup r(K),


KcC

then K + is a best decision rule among those in class E. If C is the class of all
partitions, then K + is the unrestricted best rule since it achieves the largest
probability of correct classification. If m = 2 and ~ is the class of partitions
K = ( K 1, K2) where K 1 and K 2 are complementary halfplanes, then the best
partition K ÷ is a split of the space of Z into two halfplanes ( K + , K + ), for which
the probability of correct classification is a maximum. So rather than considering
the class of all decision rules, we may restrict our attention to those rules in a
special class C and seek the best rule within this class.
Following Glick (1969), let ? ( K ) be the sample based counting estimate of
r(K). That is, ? ( K ) equals the proportion of training sample observations which
are correctly classified by K. If K E C satisfies

?(/~) = sup f(K),


KE8

i.e., among all rules in C , / £ correctly classifies the largest proportion of training
sample observations, then k is called the best of class ru'le. Thus we are picking
that rule within C, K, which maximizes our estimate of r(K). In this sense K i s an
estimate of K + . For certain classes C Glick showed that the r u l e / ( is asymptoti-
cally equivalent to K + . In particular, if m = 2 and C is the class of complementary
halfplanes, then

r( K ) ~ r( K +) a.s.,

i.e., with probability one the best of class r u l e / ( is asymptotically optimal within
the class C.
As an example suppose p and m are both two, and C is the class of complemen-
tary halfplanes. Any line in the two-dimensional plane divides the training sample
observations into two sets. Observations on one side of the line are classified as
x's, and those on the other side as y's. Then we m u s t find that line which
separates the observations so that the number correctly classified is a maximum.
To illustrate this idea, we use the symbols X and o to represent the training
sample observations corresponding to ~r~ and 7r2 respectively, and consider the
example shown in Fig. 6. Notice that the best of class rule is not unique since
there are several lines labelled A, B, and C for which the empirical probability of
correct classification is a maximum, in this case 16/19.
When p is larger than 2, determining the best separating hyperplane presents
computational difficulties. We can no longer rely on a visual determination of the
best line but must use some sort of computer algorithm. This problem along with
some of its variants was studied by Wang (1979). T h e computations generally
become quite difficult for large values of n~, n 2, or p.
164 James D. Broffitt

70"

50"

30-
3
I,Ll
B
~ xx x x
I0-
x

-[0 a i
- 20 I0 '~O
E-I

Fig. 6. Examples of a best of class rule.

4.5. Discriminant functions based on rank data


In Section 2 we developed classification rules that were based on ranks of
discriminant scores. In this section we are interested in first ranking the data and
then computing discriminant scores which are functions of these ranks.
Let x ; = (xil ..... Xip), ~ = (Yil ..... Yip), and z ' - - ( z l ..... Zp), that is, xij is the
measurement on variatej in observation vector x i and so on. For each variate we
shall rank the combined samples, and then replace the original data by their ranks
before computing a discriminant function. That is, for variate j we rank the
n~ + n 2 + 1 measurements xlj,...,xn~ j, Ylj'""Yn2j' Zj, SO that the smallest item
receives a rank of 1 and the largest a rank of nl + n 2 + 1. If ties occur, then we use
the customary procedure of computing average ranks. Let the ranks of
Xlj .... ,xn,j, Y~ j , " " ,Yn2j, zj be respectively al j ..... an, j, b~ j , . . . , bn2j, cj. Thus if no
ties occur, a~j ..... bn2j, cj is a permutation of the integers 1..... n 1q-n 2 q-1. Now
define

a~ = (ail . . . . . aip), i=l,...,nl,


b;= (bil .... ,bip), i=l,...,n 2,

and
C " z (C 1. . . . . Cp).

Rather than using the original data x 1. . . . . yn2,z we shall base our discriminant
functions on the corresponding rank data al,...,bn=, c.
Nonparametric classification 165

In particular, for the linear and quadratic discriminant functions we would use

D R L ( C ) = [c - ½(K + / ; ) ] ' U - l ( a - / ; )
and
DRQ(C) = ( c -- b ) ' U f ' ( c - b ) - ( c - ff ) ' U f '( e -- ~ ) + l n ( I U 2 1 / IU, I )
where
tl 1 ?12
ai/n,, X t /n2,
i--1 i=1
nl
( a i -- ~ ) ( a i -- ~ ) ' / ( n I -- 1),
i=1
n2

i--I
and
V ~- [ ( H 1 - 1 ) U 1 + ( n 2 - 1)U2]/(n I -}- rt 2 --2).

Then to classify z we compare D R L ( e ) to ln k or DRQ(e) to 21nk. If the


distributions of ~r~ and 7r2 are not normal, then hopefully the rank vectors
a l ..... b,2 will appear more like normal data than "the original observations
x~,... 'Y,2" Thus we might expect to make better classifications using D R L or DRQ
than D L or DO.
If there are several z ' s to be classified, then it would be computationally
cumbersome to rerank the x ' s and y ' s with each of the z 's. As an alternative we
could combine just the x ' s and y ' s and rank marginally as above to determine
a 1..... a,,, b l .... ,b,2. These ranks would not change and could be used with each
z. Of course we must also determine ranks for the elements of z. Consider t h e j t h
variate. Then we must order x~j ..... xn~j, Y~j..... Ynd and see where zj fits among
these ordered ~alues. If zj is between the k t h and ( k + 1)st of these ordered
values, then we would assign zj a rank of k + ½, i.e., cj = k + ½. If zj is smaller
than each xij and each Yij, then we let c9 = 1, and if zj is larger than each xij and
Yij, we let cj = n 1 + n 2 +½. This is repeated for j = 1..... p, and thereby we
generate the vector e.
As an example, consider the data displayed in Fig. 4. Since we do not have a z
to be classified we computed the a ' s and b ' s from the combined sample of x ' s
and y ' s as explained in the preceding paragraph. A plot of the rank data is shown
in Fig. 7 where the × ' s represent a ' s and the o's represent b's. The two points
inside the diamonds correspond to the outliers shown in Fig. 4. Notice that these
points no longer appear as obvious outliers since the rank transformation
eliminated most of the distance between these points and the other observations
in their respective samples. The line R shown in Fig. 7 is a plot of D R L ( e ) = 0.
Thus the region above line R indicates classification into ~r~. Since the scales of
measurement are different for Figs. 4 and 7, we cannot make a precise compari-
son between line R and the lines in Fig. 4. However, by inspection it seems clear
166 James D. Broffitt

20-
X

15-

I0-
¢,~ . R

5- o

o4 I I

Fig. 7. D R L with outliers present.

that line R will be a much better classifier than line L but possibly not quite as
good as line H.
Both lines H and R adjust for outliers, but in different ways. To compute line
H we leave the outliers in their original positions but give them small weights in
the analysis. In computing line R we first move the outliers inward so they are on
the fringe of the data set and then give them full weight in the analysis. We
cannot say which method is better. That depends on the distribution of ~r1 and ~r2,
i.e., in some cases line H may be better and in others line R may be better. We
note however that as the outliers move further from the main body of data, the
computation of line H will assign increasingly smaller weights to the outliers,
whereas line R will not change. That is, line R will remain the same but line H
will move closer to line LT.
Conover and Iman (1978) did a Monte Carlo study comparing classification
rules based on D R L and D R O to those based on DL, DQ, and a variety of
nonparametric density estimates. For the distributions they simulated and the
sample sizes used their general conclusion was that D R L and DRQ are nearly as
good as D L and D o when ~r1 and ~r2 h a v e normal distributions, but with
nonnormal distributions D R L is generally better than either D L or D o but not
quite as good as DRQ, although the differences between D R L and DRQ seemed
to be very slight.
Finally we should note that there have been other proposals for classification
rules based on rank data and distance functions. Since we will not review them in
detail, the interested reader should consult the references cited. Stoller (1954)
considered a univariate classification rule which was later generalized by Glick to
the best of class rules (Section 4.4). The univariate problem was also considered
by Hudimoto ( 1 9 6 4 ) a n d Govindarajulu and Gupta (1977). Hudimoto worked
N onparametric classification 167

with the two population problem,while the latter paper considered the general
multipopulation setup. In both cases it is assumed that a sample of n o z 's, all
from the same population, is to b e classified. Classification rules based on the
ranks of the combined samples are defined. Hudimoto derives bounds on the
misclassification probabilities. Govindarajula and Gupta show that the probabil-
ity of correct classification converges to one as n o, n~ . . . . . n M approach infinity.
Chatterjee (1973) has multivariate samples of size n o , n l, and n 2 from distribu-
tions F o, FI, and Fz, where F 0 is a mixture of F l and F2 (i.e., F o = O F l +(1 - 0)F2).
Thus some of the n o z's are from ¢r~ and some are from ¢r2. Rather than
classifying the z 's, the problem is to classify the mixing parameter as either large,
small, or intermediate. He defines a decision rule based on the combined sample
ranks and shows consistency. Das Gupta (1964) ~Considered the multivariate
multipopulation problem where a sample of z's all'from the same population is to
be classified. His classification rule i s based o n a'comparison of a measure of
distance between the empirical distribution function o f the z 'S and the empirical
distribution of each of the other samples. H e also considers a univariate
two-population problem similar t o that of Hudimoto. For both problems h e
shows that his classification rules are~consistent; Le,, the probabilities of correct
classification converge to one as~ the Sample sizes approach infinity.

References

Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.


Broffitt, J. D.. Randles. R. H.: and Hogg, R. V. (1976). Distribution-free partial discriminant analysis.
J. Amer. Statist. Assoc. 71. 934-939.
Cacoullos, T. (1966). Estimation of a multivariate density. Ann. Inst. Statist. Math.' 18. 179-189.
Chatterjee, S. K. (t973). Rank procedures for some two-population multivariate extended classifica-
tion problems. J. Multivariate Anal. 3, 26-56.
Conover, W. J. and iman, R. L. (1978), The rank transformation as a method of discrimination with
some examples. S~mdia Laboratories.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann Eugenics 7,
179-188.
Gessaman, M. P. and Gessaman, P. H. (1972). A comparison of some multivariate discrimination
procedures. J. Amer. Statist. Assoc. 67. 468-472.
Glick, N. (1969). Estimating unconditional probabilities of correct classification. Tech. Rept. No. 19.
Department of Statistics. Stanford UniverSity,' Stanford, CA.
Glick, N. (1972). Sample-based classification procedures derived from density estimators. J. Amer.
Statist. Assoc. 67, 116-122.
Goldstein, M. and Dillon, W. R. (1978). Discrete Discriminant Analysis. Wiley, New York.
Govindarajulu, Z. and Gupta. A. K. (1977). Certain nonparametric classification rules and their
asymptotic efficiencies. Canad. J. Statist. 5. 167-178.
Huber, P. J. (1964). Robust estimation of a location parameter. Ann Math Statist. 35. 73-101.
Hudimoto, H. (1964). On a distribution-free two-way classification. Ann Inst. Statist. Math. 16.
247-253.
Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Ann. Statist. 4.
51-67.
Parzen, E. (t962). On estimation of a probability density) fuflction and mode. Ann. Math Statist. 33.
1065-1076.
168 James D. Broffitt

Quesenberry, C. P. and Gessaman, M. P. (1968). Nonparametric discrimination using tolerance


regions. Ann. Math. Statist. 39, 664-673.
Randles, R. H., Broffitt, J. D., Ramberg, J. R., and Hogg, R. V. (1978). Discriminant analysis based
on ranks. J. Amer. Statist. Assoc. 73, 379-384.
Randles, R. H., Broffitt, J. D., Ramberg, J. R., and Hogg, R. V. (1978). Generalized linear and
quadratic discriminant functions using robust estimates. J. Amer. Statist. Assoc. 73, 564-568.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math.
Statist. 27, 832-837.
Stoller, D. S. (1954). Univariate two-population distribution-free discrimination. J. Amer. Statist.
Assoc. 49, 770-777.
Wang, C. (1979). Robust linear classification rules for the two-population classification problem.
Ph.D. Dissertation, Industrial and Management Engineering, University of Iowa, Iowa City, IA.
P. R. Krishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 "7
/
©North-HollandPublishingCompany(1982) 169-191

Logistic Discrimination

J. A. Anderson

1. Introduction

Discriminant methods are used in broadly two ways, firstly to summarise and
describe group differences and secondly to allocate new individuals to groups.
This chapter is chiefly concerned with the latter problem, of which medical
diagnosis is a prime example.
Three of the chief attractions of logistic discrimination'are: (i) Few distribu-
tional assumptions are made. (ii) It is applicable with either continuous or
discrete predictor variables, or both. (iii) It is very easy to use--once the
parameters have been estimated, the allocation of a fresh individual requires only
the calculation of a linear function. There are many other methods that have been
suggested for statistical discrimination. One possible categorisation is by the level
of assumptions made about the functional forms of the underlying likelihoods.
Three classes are probably sufficient: (i) fully distributional, (ii) partially distribu-
tional and (iii) distribution-free. Thus, suppose that it is required to discriminate
between k groups ( H l , . . . , H k ) on the basis of random variables x T =
(xl, x 2..... Xp).'The likelihood of x in H,, L(xlHs) (s = 1,... ,k) may be assumed
to have a fully specified functional form except for some parameters to estimate.
This is the fully distributional approach (i), the classical example being the
assumption of multivariate normality (Welch, 1939; Anderson, T. W., 1958). An
example in the partially distributional class (ii) is the logistic discrimination
(Anderson, J. A., 1972) where ln{L(x IHs)/L(x IHt) } is taken to be linear in the
(XJ) or simple functions of them. Here only the likelihood ratios are modelled, the
remaining aspects of the likelihoods are estimated only if required. Distribution-
free methods of discrimination have been described, for example, by Aitchison
and Aitken (1976), and Habbema, Hermans and van der Broek (1974). The basic
idea here is that distribution-free estimates of the likelihoods are found, perhaps
using kernel or nearest neighbour methods. The above is intended not as a review
of the literature but rather to give examples of the three classes of discriminant
method.

169
170' J. ,4. Anderson

Irrespective of the level of assumptions made about the likelihoods, many


discriminant techniques make use of the optimality theorem. This states that if
sample points x from the mixture of populations Hi, H 2..... Hk in the proportions
H 1 , / / 2 ..... //k are to be allocated to one or other of the populations, the overall
probability of correct assignment is maximised if an individual with observations
x is allocated to H i provided H i Z ( x l n i ) > ~ H j Z ( x l H j ) ( j = l ..... k; j--/:i) or
equivalently Pr(Hilx)/> Pr(Hjlx) (j = 1..... k; j v~ i). This result is proved by
R a t (1965) and is readily extended to maximising the average utility. A technique
giving optimal allocation in this sense is termed 'optimal'.
Quite a different approach is used in Fisher's linear discriminant function. This
is widely (and erroneously) associated with multivariate normal distributions. It is
optimal for the multinormal and other specific distributions, but as introduced by
Fisher (1936) and developed by Bartlett (1947) it is not restricted to any families
of distribution. Rather it is designed to maximise the distance between population
means in a specific sense and is equally concerned with describing group
differences.
The topics to be covered in subsequent sections are to introduce the ideas of
logistic discrimination between two groups (Section 2), and then to consider
iterative maximum likelihood estimation of the parameters postulated, and treat
associated topics and any difficulties (Section.3). An example of the method
applied tO diagnosis in medicine is given in Section 4. The discussion then turns
in Section 5 to extensions of the method: quadratic logistic discrimination, bias
reduction in the estimation method; and logistic compound methods for discrimi-
nant updating and estimating mixing proportions. In Section 6 k-group logistic
discrimination is introduced and developed alOng the lines established for two
groups. Finally, in Section 7, applications of the methods outside discrimination
are described briefly, some new work is mentioned, and some comparisons made
with other discriminant methods~

2. Logistic discrimination: Two groups

In the notation of the previous section, the objective is to allocate an individual


with observations x to one of the two populations H 1 or H 2. The possibility that
the individual is from neither population is not considered here, but separate
techniques are available (Aitchison and Aitken, 1976) to investigate this eventual-
ity. Alternatively, this question could be posed as an outlier problem, see Barnett
and Lewis (1979) for a review.
The fundamental assumption in the logistic approach to discrimination is that
the log-likelihood ratio is linear, that is:
l n ( L ( x I H i ) //L( x l H2)) :- fl~ + flTx, (2.1)
where t i t _ (ill .... ,fie)" Many authors have investigated this model from various
aspects: useful references are Anderson (1972), Cox (1966), Day and Kerridge
(19671); and Truett, Cornfield- and, Kannel (1967).
Logistic discrimination 171

The importance of the model (2.1) has many facets. Perhaps the first is that it
gives the posterior probabilities a simple form:
Pr(H[ x) = exp(fl~ + l n K q- ]~Tx)/(1 +exp(/3~ + l n K + flTx)), (2.2)

where K = II l / H 2 , and as before H~ is the proportion of sample points from H~


(s = 1,2). Once/3~, p and K are estimated, the optimal rule is very easy to use as
the decision about allocation depends only on the linear function fl~ + In K + flTx.
Another important property of this approach is that the same estimation proce-
dure can be used for//irrespective of whether x represents continuous or discrete
data (or some of each). Further, p can be estimated without making any further
distributional assumptions over those implied in (2.1). Hence logistic discrimina-
tion is a partially distributional method. These assertions will be proved using the
recent techniques developed by Anderson (1979).
The utility of the method is enhanced by the wide variety of families of
distributions which satisfy (2.1), including (i) multivariate normal distributions
with equal covariance matrices, (ii) multivariate discrete distributions following
the log-linear model with equal (not necessarily zero) interaction terms, (iii) joint
distributions of continuous and discrete variables following (i) and (ii), not
necessarily independent, (iv) selective and truncated versions of the foregoing, (v)
versions of the previous families with quadratic, log, or any specified functions of
(xs). Anderson (1972, 1975) discusses this. Hence the basic assumption (2.1) is
satisfied by many families of distributions, and logistic discrimination is widely
applicable.
There are three common sampling designs that yield data suitable for estimat-
ing ,8: (i) mixture sampling, (ii) sampling conditional on x and (iii) separate
sampling for each H s. Under (i), points are sampled from the joint distribution
(H, x) with likelihood L(H, x) where H takes the value H 1 or H 2. Thus the
proportion of sample points from H 1 is an estimate of the incidence of H 1. Under
(ii), x is fixed and then one or more sample values are taken, which take the
values H 1 or H2: This is sampling the conditional distribution H[x with likelihood
Pr(H[x). This sampling scheme arises frequently in bioassay. Under (iii), the
other conditional distributions (xlHl) and (x[H2) are sampled. This is very
common in discriminant work.
The estimation procedures for all three modes of sampling are very similar; the
same basic algorithm can be used for iterative maximum likelihood estimation
with very minor subsequent adjustments for separate sampling. However, the
justification of the sampling procedure for separate sampling is more intricate.

2.1. The x-conditional sampling case


Suppose that n sample points are drawn and that at x there are ns(X ) points
from H s (s = 1,2). Note that many or most of the G(x) may be zero or unity.
Under x-conditional sampling, the likelihood is
Lc = II {Pr(H, Ix)}"'(X){Pr(H2[x)} n2(x). (2.3)
X
172 J. A. Anderson

Here n(x) = nl(x)+ n2(x ) is fixed for all x. Now it follows from (2.2) that

exp(fl° +flTx) - - p l ( x ) , (2.4)


Pr( H 1Ix) = exp(/3 0 + ,g'rx ) + 1

say, where/3 o = fl6 + In K. Then

pz(X) = P r ( H 2 [x) = 1 / ( 1 +exp(fl 0 +flTx) } • (2.5)

Note that the decision about discrimination rests solely on the linear function

I(x) --/3o+PTx.
The likelihood, Lc, may now be written,

L c = 1-[ (pl(x))"~(X)(p~(x)} n2(x). (2.6)


x

It is the logistic form of the conditional probabilities in (2.4) and (2.5) which gives
its name to this approach of discrimination. We see that L c in (2.6) is a function
of flj ( j = 0, 1,...,p) and hence the maximum fikelihood estimators of these
parameters are derived by an iterative optimisation procedure. Note that flo is
estimable but that f l O = f l o - l n H 1 / H 2 is estimable only if H 1 is known or
estimable separately. Technical and practical details of the iterative procedure will
be given in the next section.

2.2. The mixture sampfing case


Under mixture sampling the likelihood is given by

Lm: H {L(xlH,)}°'(X){L(xlH=)} °2(x). (2.7)


x

This is different from Lc but it can be shown that the flj ( j = 0, 1..... p ) are
estimated as in the x-conditional sampling case by maximising L~ in (2.6). To see
this, note that

L(xlHs):Pr(H~lx)L(x), s=l,2.

Hence

L m = L c. 1I (L(x)} n(x). (2.8)


x

Now the functional form of the likelihood ratio has been assumed in (2.1) but
specifically no further assumptions have been made. This implies that L(x), the
Logistic discrimination 173

marginal distribution of x contains no information about the flj's. Even if there is


further information about the forms of L(xlH s) (s ~- 1, 2), the extra information
about the flj's in II(L(x)} n(x) is likely to be small compared with that contained
in L c. Hence maximum likelihood estimates of the firs are again obtained by
optimising Lc in (2.6). Day and Kerridge (1967) give further details of the above
argument. Note that the above conclusions apply equally well to data of any kind,
continuous or discrete, satisfying (2.1). Further H i and hence fl~ are now
estimable.

2.3. The separate sampling case


The likelihood of the observations is

2
Ls= I-I H(L(xIHs)) "'(x). (2.9)
s=l x

It is not at all obvious that the estimation scheme above (maximising L c in (2.6))
yields maximum likelihood estimations of the firs in this case. However, the result
does follow. Anderson (1972) gave a proof for discrete random variables and gave
a brief discussion of the continuous variate case. A rather different approach will
be sketched here following Anderson (1979). That paper was concerned with the
same likelihood structure as here, but additionally there were n 3 sample points
from the compound distribution

01L(xIH1)+(1- O,)L(xIH2)
for some unknown proportion 01. The following discussion is derived from
Anderson's (1979) results with n 3 = 0.
Suppose ,

L(xIHe) =f(x), (2.10)


then
L (x [H, ) = f(x)e/~+pTx (2.11)
and
Ls= ~ exp(n,(x )(B~ ~-flTx)}f(x )~(~). (2.12)
X

This expression for L s involves the parameters f16,fl and the function f(x). To
proceed with the estimation, some structure for f ( . ) is necessary. The discrete case
is most straightforward and will be dealt with first. Thus suppose that the sample
space for x is discrete, then the values of f(x) may be taken to be multinomial
probabilities. We may attempt to estimate all these as multinomial probabilities
without imposing any further structure. Note that if the sample from H 1 is not
available, the maximum likelihood estimates of the ( f ( x ) } are f ( x ) = n 2 ( x ) / n 2
for all x. In the more interesting case where we also have n l sample points from
174 J. A. Anderson

HI we must impose two constraints to preserve the probability distributions


L( x]Hs), s =1,2. Hence

~f(x)e#~+PrX=l, (2.13)
X

~f(x)=l. (2.14)
X

The likelihood L s in (2.12) is now seen as a function of the parameters/36, fl and


{f(x)}. This likelihood may then be maximised, subject to the constraints (2.13)
and (2.14). It can be shown that the constrained maximum likelihood estimate of
f(x) is

f(x)=n(x)/(n,exp(/3~ "~-~Tx)-{-F/2) forallx. (2.15)

This is Anderson's (1979) equation (9) with n 3 = 0. Substituting f ( x ) for f(x) in


(2.12) gives

L~ = L*.nT", .n~ "2. H n ( x ) "(x) (2.16)


X

where

L* = ]-I
x
nlel3~+PTx
nle/3~+pTx+ n 2
}nx{ n2
n~e¢~+pTx-+- n 2
t
= II {P,(X)}"'(X){Pz(X)}"2(x), (2.17)
X

where p l(x) and p2(x) are as given in (2.4) and (2.5) with/3 o = fi6 + l n ( n l / n 2)" It
follows that the maximum likelihood estimates of fl and/3o (and hence f16) are
obtained by maximising L* in (2.17), or equivalently by maximising L c in (2.6).
If some or all of the x variates are continuous, the above treatment does not
hold. Anderson (1979) discusses this case briefly. It is suggested that/36 and fl
may still be estimated by maximising L* or equivalently L c. The simplest
justification is based on the subdivision of each continuous variate to make it
discrete. Subject to the condition (2.1) holding, the above approach is applicable
with perhaps some information loss due to the range subdivision.
To retain the 'continuous' nature of the x variates introduces some difficulties
in the context of maximum likelihood estimation. Any attempt to maximise the
likelihood (2.12) with respect to fl~3,fl, and the functional form f ( - ) quickly
founders as the likelihood is not bounded above with respect to f(. ), even though
it is constrained as a density. Good and Gaskins (1971) discuss this in a simpler
context with the aid of Dirac-delta functions. They showed how to 'bound' the
likelihood by introducing smoothness constraints on the function f(-). Recent
Logistic discrimination 175

work has concentrated on the maximisation of L S in (2.12) with smoothness


constraints on f(-). It is conjectured that the maximum likelihood estimates of 13~
and fl will again be given by maximising L* (or Lc) for the x variates all
continuous and also in the case where some of them are continuous and some
discrete. Simulation studies (Andersonl 1972; Anderson and Richardson, 1979)
confirm that this procedure gives sampling distributions for the estimators as
would be suggested by asymptotic theory.
Recall that the only questions over continuous variates occur with separate
sampling. For x-conditional or mixture sampling, the estimation of 13o and p from
L c is unquestioned, no matter the nature of the x-variates.

2.4. Discussion
Provided that the likelihood ratio satisfies (2.1), the parameters fl may be
estimated by maximising Lc, irrespective of (i) which of these sampling plans has
been taken and (ii) the nature of the x-variates, discrete or continuous. The role of
13~ is a little different in the three sampling plans considered here, and care is
required. In particular, for use in discrimination, it has been seen in (2.2) that
13~+ l n 11~/112 is required. This is estimated automatically in x-conditional and
mixture sampling, but in separate sampling 13~ is estimated and 11~ must be
supplied (or estimated) additionally. This is not normally a severe requirement.
Suppose that ]3o and fl have been estimated corresponding to proportions
II1, H2 for H~, H 2. It is now required to use the discrimination system in a
context where sample points are drawn from the mixture of H 1 and H 2 in the
proportions/'I1 and if/2. Anderson (1972) showed that the appropriate posterior
probability for H 1 is

i6,(x) =exp( rio + flTx)/{ 1 + exp( rio + ] ~TX)}, (2.18)


where
fi0 '~-~130÷ In (/~/1/-I2 / (/~/2//1 ) }.

3. Maximum likelihood estimation

It has been seen that a key feature of logistic discrimination is that the
parameters 130 a n d / 8 required for the posterior probabilities, Pr(H s Ix), s = 1,2,
are estimated by the same maximum likelihood procedure for continuous and
discrete data, for different underlying families of distributions and for different
sampling plans. In all these cases it is necessary to maximise

Lc = II ( pl( x ) }"'~X) ( p2( x ) } n2~x) (2.6)


X
where
pl(x) = exp(fio +flWx)/(l+exp(fiO q-pTx) ) (2.4)

and p2(x) = 1 - pl(x).


176 J. A. Anderson

3.1. lterative maximum fikelihood estimation


Day and Kerridge (1967) and Anderson (1972) originally suggested using the
Newton-Raphson procedure to maximise L c. However, with the progress made in
numerical optimisation, quasi-Newton methods can now be recommended, as
these combine the Newton property of speed of convergence near the optimum
with the advantages possessed by the steepest descent method with poor starting
values (Gill and Murray, 1972). A further advantage of the quasi-Newton
procedures is that they require only the first-order derivatives at each iteration
while giving an estimate of the matrix of second order derivatives at the
maximum point. These properties are almost tailor made for statistical use, as
maximisation of the log-likelihood supplies the information matrix without ex-
plicit calculation. Some unpublished calculations suggest that, provided the
number of iterations is not less than the number of parameters to be estimated, in
logistic discrimination the error in the estimates of the variances introduced by
the quasi-Newton approximation is of the order of 5%. This is acceptable for
most purposes.
The derivatives of In L c can be derived quickly;
0 In Lc
0flj - • { n , ( x ) - n ( x ) p , ( x ) ) x j , j = O, 1..... p, (3.1)
x

where the summation is over all x-values, but clearly only x-values with n(x) > 0
need to be included. The second derivatives are equally accessible:
021n L~
~flJ~fl,- ~n(x)p(lx)p2(x)xj"x j , l = l ..... p. (3.2)

The only extra information required to start the iterative optimisation proce-
dure, whether Newton or quasi-Newton, is the starting values. Cox (1966)
suggested taking linear approximations to pi(x) ( i = 1,2) and obtaining initial
estimates of t0 and fl by weighted least squares. However, Anderson (1972)
suggested taking zero as the starting value for all p + 1 logistic parameters. This
has worked well in practice, and can be recommended with confidence. Albert
(1978) recently showed that except in two special circumstances, which can be
readily recognized, the likelihood function L c has a unique maximum attained for
finite ,8. Hence the procedure of starting from a fixed 'neutral' value may be
expected to converge rapidly to the maximum. Special computer programs are
available in Fortran and Algol 60 to effect this estimation, and they can be
obtained from the author.

3.2. Asymptotic standard errors


Asymptotic standard errors for the logistic parameters can be found as usual.
At the maximum likelihood point, the observed information matrix,
I : (-- 021n Lo/OflyOfl,)
is evaluated from (3.2) and inverted to give the usual asymptotic result that
Logistic discrimination 177

cov(/3j,/3t) = I jr where I-1 = (Ut) for a l l j and l. This result is obvious except for
separate sampling where the likelihood L S (2.12) is subject to the constraints
(2.13) and (2.14). However, Anderson (1972) showed that the above asymptotic
matrix is appropriate in this case also although the variance of rio, the maximum
likelihood estimator of/3o, has a further error of o(1/n) introduced.

3.3. Maxima at infinity


As noted above, there are two kinds of configurations of sample points for
which the likelihood L c (2.6) does not have a unique maximum attained for finite
ft. The first kind is said to exhibit complete separation which occurs if all the
'sample' points from H 1 lie on one side of a hyperplane and the points from H 2
lie on the other side. The points x T = ( x l ..... Xp) are considered to lie in a
Euclidean space R p, of dimension p, irrespective of whether the underlying
random variables are continuous or discrete. With complete separation there are
non-unique maxima at infinity. Suppose the equation of the separating hyper-
plane is l(x) = 7o + Yzx = 0, and that l(x) is positive if x lies on the same side of
the hyperplane as the H~ points. Hence l(x) is negative for all the H 2 points.
Taking/3 o = k'y0 and f l = k y gives for the likelihood L c in (2.6)

Lc(k) = [I l+exp{kl(x)} l+exp{kl(x)} (3.3)


HI

where the products are over the sample points from H~ and H 2, respectively. As
k--, + m , L c ( k ) ~ l . Hence there is a maximum of L c at fio=k,{o, f l = k y as
k ~ + m, that is, at infinity. Further, there are equivalent points at infinity giving
the same upper bound of 1 for L c for all hyperplanes which give complete
separation. These non-unique maxima suggest that fl0 and fl cannot be estimated
with much precision, although some bounds may be placed on them. However,
from the disdrimination point of view, quite a good discriminant function may
ensue for any separating hyperplane. It will after all have good discriminant
properties on the sample.
It is easy to test whether a particular data configuration exhibits complete
separation. If there is a separating hyperplane, the maximum of the likelihood is
at infinity, and the iterative procedure must at some stage find a /3(m) and fl(m)
such that the plane

l(m)(x) ----/3o~ + b~m)Tx = 0

completely separates the points from H~ and H 2. The values of l("°(x) are
required in the iterative procedure, so it is quick and simple to check whether
l(m)(x) gives complete separation at each stage of the iteration. If so, the iteration
stops with an appropriate message. Day and Kerridge (1967) and Anderson
(1972) give full details of complete separation.
The second kind of problem of data configuration has zero marginal propor-
tions occurring with discrete data. Again the maximum for L~ is at infinity. For
178 J. A. Anderson

example, suppose that x 1 is a binary variable, taking the values 0 or 1. In the


sample obtained, x I = 0 for all points from H 1 and x 1 = 1 for at least one point
from H 2. Anderson (1974) showed that the m a x i m u m value of L c is attained with
/31 ~ - m. This is unsatisfactory as it implies certainty about group membership
when x I =1. However, x 1 must be retained as a predictor since it contains
valuable information about group membership. Anderson (1974) suggested an ad
hoc alternative procedure for estimating/31 in this case, based on the assumption
that x I is independent of x 2, x 3..... xp, conditionally in each group, H 1 and H 2.
Clearly this assumption will only be true approximately in m a n y cases, but this
type of data configuration will occur mostly with small samples where hopefully
the approximation will not be too important. It can be regarded as a compromise
between naive Bayes and logistic discrimination.
Anderson (1974) discussed this problem more fully and showed that taking the
above approach gives the following estimates:

1 +1n(0,1/021),
/~1 = ln{(/~,,021)/(01l/~21)},
g--af 1,,
where/~il is the estimate of Pr(x 1 = IIHi), i = 1,2 and Oil = 1 --/~,1. The estimate
1Vii=(ail +l)/(ni+2) is suggested where all is the number of sample points
observed in H i with x 1 = 1, and n i is the total number of sample points from Hi,
i = 1, 2. Further /30(- 0 a n d / ~ - o are the m a x i m u m likelihood estimates of the
logistic parameters as derived on this section, but omitting x 1. This approach may
be extended easily if more than one variable has troublesome zero marginal
proportions.

3.4. Step-wise selection of discriminant variables


In some discriminant situations there are very m a n y potential discriminant
variables. This may represent the experimenter's lack of knowledge, his caution,
or both. One objective of the statistician must be to choose a set of good
predictors from the set of possibles. This is ~a very similar problem to that of
choosing predictor variables in ordinary regression, which usually involves step-
wise methods. The procedure suggested here is the simplest such procedure for
logistic discrimination.
The first stage is to ascertain whether any of the variables considered singly are
significantly related, in the statistical sense, to group membership. For considera-
tion, there is a set of p possible variables giving as models for the probability of
H1, given x j,

exp( fl(oj' +/3j(.J'xj)


Pr(H,Ixs)=p{S (xs)= j=l,...,p. (3.4)

We must decide which of these models is 'best' and whether it is 'better' than the
Logistic discrimination 179

'null' model which indicates that none of the potential predictor variables is
useful:

Pr(H,) = p]O,_ exp(fl0(°)) (3.5)


1 + exp( h0(°~)"

Taking a pragmatic approach to model selection, the criterion for comparing


models here is the maximised log-likelihood. Denote the maximised values for the
likelihood functions L c (found as in Subsection 3.1) corresponding to the models
in (3.5) and (3.4), that is, no predictor variable, predictor variables xl, x 2 . . . . . x?,
by M o, M 1. . . . . Mp, respectively. Note that

M o = n]',n'~2/n ",

where n = nl + n 2. Then x(t) is selected as the 'best' single predictor variable if its
maximised log-likelihood M(1) satisfies

34(1,/> M j , j = 1 . . . . . p. (3.6)

An approximate asymptotic chi-square test of the hypothesis that x(~ is not


effective as a predictor variable (that is, the logistic coefficient corresponding to
x(1), flo), equals zero) is given by the statistic 2 1 n ( M ( l ~ / M o ) . Under the above
hypothesis this has the chi-square distribution with one degree of freedom. If this
statistic is not statistically significant, it is concluded that none of the potential
predictor variables is effective, and its exercise is complete. If flo) is significantly
different from zero, x(~) is confirmed as a predictor.
The next step is to see if any of the remaining ( p - 1) potential predictors
significantly a:dd to the predictive power of x(1). Models similar to (3.4) but
containing two variables are considered. Variable x(~) is always taken as one of
the pair giving ( p - 1) pairs of variables. Corresponding to each of these is a
likelihood L c and a maximised log-likelihood. Suppose the greatest of these is m(2 )
which occurs when variable x(2) is paired with x(1). The likelihood ratio
21n(Mc2)/M(I)) is tested as above, and if significant, x(2) is confirmed as a
predictor variable. Otherwise the predictor set is restricted to x(1) and the exercise
is complete.
This procedure is repeated at each stage adding a variable from the 'possible'
to the 'confirmed' predictor set. The procedure stops either when the next best
potential predictor does not significantly add to the predictive power or when
there are no more potential predictors.
More sophisticated rules for the inclusion and exclusion of predictor variables
could be devised along the fines of these used in linear regression. Moreover, the
above procedure is probably biased towards the inclusion of extra variables in
the predictor set. This is because at the j t h stage, 2 1 n ( M ( j ) / M ( j _ l ~ ) has been
tested as a chi-square variate but really it is the largest of ( p - j + 1) correlated
180 J. A. Anderson

chi-square variates. More sophisticated procedures could be devised which would


take some account of these points, but the gain is probably small in most cases.
Perhaps the decision to use some method of variable selection is more important
than the actual choice of method.

4. An example: The preoperative prediction of postoperative


deep vein thrombosis

Recently evidence has accumulated that the incidence of postoperative deep


vein thrombosis can be reduced by prophylactic measures, particularly low-dose
heparin. Low-dose heparin may, however, significantly increase bleeding prob 7
lems at the time of surgery, and the object is to try to identify before operation
those patients particularly at risk of deep vein thrombosis, in the hope that in
future it would be unnecessary to give prophylactic treatment to patients in a
low-risk group.
Logistic discrimination was used on variables available preoperatively to pre-
dict membership of the two groups Hi: postoperative deep vein thrombosis and
H2: no postoperative deep vein thrombosis. Full details of this important and
successful study are given by Clayton, Anderson and McNicol (1976).
One hundred and twenty-four patients undergoing major gynaecological surgery
were investigated. On each patient ten items of clinical data were recorded and
sixteen laboratory tests were carried out. The diagnosis of postoperative deep vein
thrombosis was made on the basis of isotopic scanning of the legs.
No patients had evidence of deep vein thrombosis preoperatively, but after
operation 20 of the 124 developed deep vein thrombosis. Thus there were 20
sample patients from H 1 and 104 from H 2. Note that the mixture sampling plan
has been used.
After some preliminary screening attention focussed on the following ten
variables as being the best potential predictors of deep vein thrombosis: (1)
fibrinogen; (2) factor VIII; (3) euglobulin lysis time; (4) FR antigen; (5) age; (6)
length of preoperative stay; (7) percentage overweight for height; (8) presence of
varicose veins; (9) cigarette smoking habits recorded as 'Yes' or 'No' and (10)
benign or malignant disease. Variables (8) and (9) were scored 0 for absence and
1 for presence. Variable (10) was scored 0 for benign and 1 for malignant.
The stepwise procedure outlined in Subsection 3.4 was then employed on these
data to select a predictor set from the above 10 variables. The five variables
selected were (3) euglobulin lysis time; (5) age; (8) varicose veins; (4) FR antigen
and (7) percentage overweight for height, in the order of their selection. No
further variables were added to this set of five.
It follows from (2.4) that the posterior probabilities Pr(Hllx ) depend solely on
the index

1( x ) = rio + fl%, (4.1)


giving
Vr(H1]x ) = el(X)/(1 + e1(X)). (4.2)
Logistic discrimination 181

The logistic parameters for the five above variables in the predictor set were
estimated as part of the stepwise system and gave as estimate of l(x),

I ( x ) -- - 11.3 + 0.0090x 3 + 0.22x 4 4- 0.085x 5 + 0.043x 7 4- 2.19x 8. (4.3)

The estimates of the fl-parameters as given in this equation display their depen-
dence on the scale of measurement for the (xj). For example, the mean of x 3 in
the deep vein thrombosis group is 412 while that for x 4 is 9 in the same group.
Using the method given in Subsection 3.2, the standard errors of the logistic
parameters in (4.3) were estimated to be 2.39, 0.0028, 0.12, 0.03, 0.023 and 0.76,
respectively.
The values of i ( x ) were calculated for each of the 124 patients in the study and
plotted in Fig. 1, distinguishing between the two groups. Clearly patients with and
without deep vein thrombosis are well separated, but there is inevitably a 'grey'
zone where group membership is equivocal. In this context there is a preoperative
decision to be made, whether to use anti-coagulant therapy or not. This decision
depends not only on the relative likelihood of the two groups but also on the costs
of the various (state-of-the-world, action) pairs. Bearing in mind the potential
gain and losses mentioned at the beginning of this section a cut-off point of - 2.5
was taken, the idea being that patients whose [ values are greater than - 2 . 5
should be given preoperative anti-coagulant therapy. No other patient would be
given this treatment.
It can be seen from the figure that some quite extreme values for i ( x ) have
been found, as low as - 8 and as high as 6.5. The corresponding posterior
probabilities for H 2 and H 1 are not convincing as they are very high at 0.9997 and
0.9985, respectively. In other studies even more extreme estimates of the posterior
probabilities have been noted. This is to some extent caused by the relatively
small number of patients from H1, but the phenomenon of extreme odds occurs in
many similar techniques where estimated parameters are 'plugged' into an
expression involving the true values of parameters. Aitchison and Dunsmore
(1975) discuss this 'estimative' approach and prefer their 'predictive' method.
Unfortunately this is limited largely to multivariate normal distributions. There is
no predictive treatment for multivariate discrete distributions. Hence the numeri-
cal values of the posterior probabilities obtained from logistic discrimination
should be treated with caution, particularly with small samples. However, the
ordering of probabilities is unlikely to be changed by a better method of

x x
x x
x x x
x xx xx x
xx xx xx x
x xxxxx xxxxx x
x xx xxxxxxxxxxxxx xx
x xx xxxxxxxxxxxxx ox o
xxxxxxxxxxxxxxxxoxx ox x o
xxxxxxxxxoxxxxxooxo ooxx oo ooo 0 O0 0

Fig. 1. Prognostic index (/~) for patients at risk from deep vein thrombosis (DVT). Sample values: 0
patients with DVT: X patients with no DVT.
182 3-. A. Anderson

estimation (if one existed), so some monotone transformation of the index values
i ( x ) should give satisfactory estimates of the posterior probabilities. The index
values should be interpreted in this light.
Note that here the sampling was from the mixture of the two distributions.
Hence the /3o given by the maximum likelihood procedure of Subsection 3.1
required no adjustment for use in the estimated discriminant function or index
i(x). This would have also been true for x-conditional sampling but if separate
sampling had been employed, ln{ ( H l n 2 ) / ( H 2 n l) } would have been added to the
estimate/30 emerging from the maximum likelihood procedure. As discussed in
Subsection 2.4, H~ is the proportion of sample points from H 1 in the mixed
population in which discrimination is to take place.
The need to derive discriminant functions based on continuous and discrete
variables is common in medical situations, as here, and gives logistic methods
particular importance.
The decision system suggested for the management of postoperative deep vein
thrombosis has been tried in practice and has given very satisfactory results giving
a reduction in the incidence of the thrombosis from 16% (20/124) here to 3%
(3/100) in a recently completed study. Papers are in preparation to report these
findings in detail.

5. Developments of logistic discrimination: Extensions

Over the last few years certain developments and extensions of logistic dis-
crimination have emerged.

5.1. Quadratic logistic discrimination


It was pointed out at the beginning of Section 2 that the fundamental
assumption of (2.1) did not imply that the log-likelihood ratio was linear in the
basic observations. On the contrary, any specified functions of these may be
included as x-variates. Perhaps most common are the log, square and squareroot
transformations which arise for the same reasons in discrimination as in linear
regression.
There is a type of quadratic transformation that is introduced for a rather
different reason. Suppose that x has the multivariate normal distribution N(~ s, Zs )
in the s th group, Hs, s = 1,2. Then the log-likelihood ratio is a quadratic function
in x:

ln{ L( x[H,)/L( xlH2) } : BO Av~ Tx AV x T U x (5.1)


where the fl's are as before and ~ = (~,j;) is a p × p symmetric matrix. Now the
log-likelihood ratio in (5.1) is linear in the coefficients/3~,/7 and (~'i;) (J ~<J') so
that it can be written in the form of (2.1) with p + p ( p + 1 ) / 2 independent
parameters (fi's and 3' 's). In principle these parameters may be estimated just as
Logistic discrimination 183

in Sections 2 and 3. In practice, if the number of basic variates p is, at all large,
say greater than 4 or 5, this approach would result in far too many parameters to
estimate in an iterative procedure. Thus for p = 5 and 10, the number of
parameters is 21 and 66, respectively.
Anderson (1975) discussed this area in some detail and suggested some ap-
proximations for the quadratic form xT~x, which enables the estimation to
proceed. The simplest of these was to take a rank one approximation to ~ = 2t~lll~,
visualised in terms of the largest eigenvalue (2t l) and corresponding eigenvector
(! 1) of ~. In this case, the probability of H l, given x,

eq
p l ( x ) -- 1 + e q (5.2)

and pz(x) = 1 - p l ( x ) where

q =/30 + f l Tx + X~(ITx) 2" (5.3)

This is no longer linear in the parameters but Anderson (1975) showed that the
likelihood

L'= 1-[ { Pl( X) }"'(X)( Pz( X) } n2(x) (5.4)


x

could be maxirnised to give estimates of/3o,/31 ..... /3p and 11..... lp where p l(X) is
given by (5.2). Because of the non-linearity of q, a different iterative procedure
from that of Section 3 is required, but using the quasi-Newton methods this is
straightforward. Clearly the discriminant function based on Subsections 5.2 and
5.3 has a parabolic boundary if it is agreed to allocate all x such that p~(x) ~> 7t to
H~, for some X. Anderson (1975) demonstrated this by means of an example.
The need for quadratic discriminant functions is by no means restricted to
situations whe)e all the underlying variates are continuous. For example, if the
variates are binary and the first-order interactions on the log-linear scale are not
equal in the two groups, then again the log-likelihood ratio satisfies (5.1),
provided the higher order interactions are the same in the two groups. Usually
any log-likelihood ratio that may be thought to be linear in x and satisfy (2.1)
may be generalised to be quadratic in x and satisfy (5.1). Note that the basic
variables may be transformed before commencing the quadratic logistic proce-
dures.

5.2. Bias reduction in logistic estimation


It is well known that there is a potential bias when maximum likelihood
estimators are based on small samples. This can be partially eliminated by the use
of bias correction techniques (Cox and Hinkley, 1974). Anderson and Richardson
(1979) considered the application of these methods in logistic discrimination.
The general method can be described briefly. Suppose that r parameters
0 T = (01 ..... Or) are to be estimated. The expectation of the maximum likeli-
184 J. A. Anderson

hood estimator 0 is given by

e ( O ) = O+ b(O) + e (5.5)

where bT(O)=[bl(O) ..... b,(O)] and e is a vector whose components are all
o(1/n).

bt(O)=½ i
{ ( alnL
IitlJk2E
OzlnL) ( 031nL)}
OOj OOiaOk + E ~0i00230 [
i,j,k=l

( t = l , 2 ..... r) (5.6)

and ( I i J ) = i - 1 where, as usual, I=-(E(O21nL/(OOiOOj))}. Proceeding as


before, corrected maximum likelihood estimators are given by

t~= 0-- b ( 0 ) . (5.7)

The techniques introduced above to approximate expected values of the deriva-


tives of likelihoods may be extended to the multi-parameter case where necessary.
Thus the expected value of

O'ln L /( OO: )
may be estimated by its observed value calculated at t~, where a, b, c ~> 0 and
a + b + c = r. Similarly for independent identically distributed sample points.

E( OlnL 021nL)

may be estimated by

Olnl. 021n lu
.=1 OOs O0~OOk"

As described in Section 2, the logistic discrimination situation with mixture


sampling gives rise to a likelihood

L : I I II (p (x)I4x)}
X s:l

as in (2.8). The probabilities (ps(x)} are as given in (2.4) and (2.5). Although
there are two sets of parameters, the (flj) and the (L(x)}, to estimate, the
Logistic discrimination 185

maximum likelihood estimation of the (flj) and the bias correction proceeds
independently of the terms in L(x) as the likelihood factorises. Hence, as in
Subsection 2.2, the maximum likelihood estimation is based on L c given in (2.6).
The first and second derivatives 0 In Lc/3flj and 32 In L c/(0flj 0fit) are given in
(3.1) and (3.2). Hence it can be proved (Anderson and Richardson, 1979) that

( ~lnL 021nL)
E aflj Oflk Oflt ----0, for all j, k, l. (5.8)

Thus the bias corrected estimators of the (flj) may be calculated from (5.7) with
the simplification that one set of expectations are all zero.
The situation for separate sampling logistic discrimination is not so straightfor-
ward. It was seen in Subsection 2.3 that the estimation of the (flj) does not
separate easily from that of the quantities {f(x)) introduced there, largely
because of the constraints (2.13) and (2.14). Strictly, new results for bias correc-
tion in the presence of constraints are required, but because other results for
mixture and x-conditional sampling have carried over to the separate sampling
case, Anderson and Richardson (1979) suggested using the above bias corrections
in this case also. They investigated the properties of the bias corrected estimators
using simulation studies, and concluded that worthwhile improvements to the
maximum likelihood estimates could be obtained provided that the smaller of n 1
and n 2 (the sample sizes from the two groups) was not too ~mall.

5.3. Logistic compound distributions: Discriminant updates


As above, the objective is to distinguish between the two groups H 1 and H 2.
Suppose that a logistic discriminant function has been estimated along the lines of
Section 3, for one of the three types of sampling discussed in Section 2. The
discriminant function is then used routinely to allocate new sample points to H 1
or H 2. When the 'true' group for a new sample point is eventually known, this
information c~in be included in the likelihood Lo and the parameter estimates
updated using the methods of Section 3. However, at any one time there may well
be a large sample of points for whom no definite allocation to groups is available.
The size of this group depends on the time lag before definite identification. This
sample has been taken from population H 3, the mixture of the distributions in H 1
and H 2 in the proportions 01 and 02, where 01 + 02 = 1. This is a very natural
model of the way information might accumulate in a medical diagnostic context
where there might be a considerable time interval between the time a patient is
first seen and the time at which diagnosis is definite. The sample from H 3
contains information about the logistic parameters and about 01; the problem is
to extract the information without making stronger assumptions than in Sections
2 and 3. As posed, this is clearly similar to the mixture problems considered by
Hosmer (1973) and Titterington (1976), but the assumptions in Subsection 2.1 are
much weaker.
Anderson (1979) considered this problem and showed that a neat solution
exists in terms of the logistic formulation. Following his notation, suppose that
186 J. A. Anderson

separate samples are taken from H1, H 2 and H 3 of size n ], n 2 and n 3, respectively.
Extending the notation of Section 2, suppose that at x, n~(x) sample points are
observed from H s (s = 1,2, 3). The likelihood is given by
3
L,nix=l'I 1-[ (L(xlH,)) "~(x). (5.9)
x s 1

Writing L(xIH2)= f(x), as in (2.12), this gives

Lmi~ = 1-I ( eB~+'e~*)''(x)( O,eB~+pTx + 02 ).3(x, { f(x) }.(x) (5.10)


x

where n(x) = n l(x) + n 2(x) + n3(x ).


Note that there are two constraints on the unknowns in (5.10) since L(xIH 0
and L(xln 0 must sum or integrate to unity. This will ensure that L(xln3) is
also normalised since

t ( xl n3 ) = O,t( xl n, ) + O2t( xl nz)


and 01 + 02 = 1.
In the case where x is discrete, Anderson (1979) estimated the ( f ( x ) ) as a set
of multinomial probabilities, one defined for each distinct x-value. Using
Lagrange's method of undetermined multipliers, he showed that the maximum
likelihood estimate o f f ( x ) ,

f(x)=n(x)/(n~e~'°+P~*+n'~) for all x, (5.11)


where
n*=ns+O3n3, s=l,2. (5.12)

Substituting this result for f(x) into (5.10) implies that fl~, fl, and 0 t may be
estimated by maximising the function

f 1 p 1 ''(~)f 1 .],~(~)f 01 02 1 n3(x)

(5.13)
where
P( x ) = n~elr°+PTX/( n~e B'°+pTx+ n~ ) (5.14)
and
Q(x)=l-P(x).
The expression in (5.13) clearly displays a compound logistic distribution which
gives its name to this approach.
The function Lmix may be maximised using one of the quasi-Newton proce-
dures referred to in Section 3. A Fortran program has been written to do this and
is available.
Logistic discrimination 187

Note that Lmix contains only p + 2 parameters, so the update facility has
introduced only one extra parameter, a small cost for the additional power. It
follows from the form of n~' that if there are no sample points from H a (n3(x) = 0
for all x), the functional form of L*i~ reduces to that of L c in (2.6).
Although the above results hold strictly for discrete random variables,
Anderson (1979) used arguments similar to those in Subsection 2.3 to justify their
use with continuous random variables or combinations of discrete and continuous
random variables in the separate sampling case. In the mixture or x-conditional
case, these results may be justified directly for continuous a n d / o r discrete
random variables.
The above procedure for updating logistic discriminant functions, using infor-
mation from sample points of uncertain provenance, gives logistic discrimination
a considerable advantage over most of its rivals. For example, if multivariate
normal distributions are assumed, the iterative procedures for discriminant updat-
ing involve O(p 2) parameters. If no distributional assumptions are made, it is
difficult to incorporate information from the mixed sample. Thus there is no
extension of Fisher's linear discriminant function to cover this case. However,
Murray and Titterington (1978) have recently extended the kernel method to
provide an alternative to the method of logistic compounds in some circum-
stances.
The emphasis here has been on discriminant functibn updating, but the
methods derived here can be used quite generally where the fundamental problem
is to estimate the mixing proportions 0~. Given samples from the two distributions
and the mixture, maximisation of L*~x then gives estimates of 01 and the logistic
parameters. The logistic approach outlined here is particularly appropriate if
there is only weak information available about the underlying likelihoods; it
provides a partially distributional approach.

6. Logistic d(iscrimination: Three or more groups

For simplicity all the results on logistic discriminant functions have been given
so far in terms of two groups. There is no difficulty in extending at once all the
previous results to discrimination between k groups, Hi, n2,... , n k. An outline of
the methods is given here.
Denote the likelihood of the observations x given Hs by L ( x I H~), s = 1 . . . . , k .
The equivalent of the fundamental assumption (2.1) on the linearity of the
log-likelihood ratio is

-a, + B, Y- sx , s = l , . .. , k - I , (6.1)

where PsT _--(ills, il2 ...... ilps)" Note that this implies that the log-likelihood ratio
has this form for any pair of likelihoods. As in previous sections, the linearity in
(6.1) is not necessarily in the basic variables; transforms of these may be taken. It
has been shown in Section 2 that the assumptions embodied in (6.1) are likely to
188 J. A. Anderson

be satisfied by many families of distributions, including those often postulated in


discrimination.
Arguing as in (2.2),

k
Pr(H, lx)=eZs/ E eZ', s = l .... ,k, (6.2)
s=l
where
zs=fl[~s+lnK,+flfx, s = l ..... k - - l ,
and
Zk=O.

Further K s = H s / H ~ where H , is the proportion of sample points from H s


(s = 1..... k). An alternative expression for z, is

z, = B0, + f l f x , s = l .... ,k, (6.3)


where
BOs = fl~, + In K,.

Thus if the fl's are known or have been estimated, the decision about the
allocation of a sample point x requires little computing as it depends solely on the
linear forms in x and zs, s = 1..... k - 1.
Following the notation in Section 2, suppose that ns(x ) sample points are
noted at x from H s (s = 1,...,k). Then under x-conditional sampling the likeli-
hood

k
L~k)=H H (Pr(H, lx)} "'(x) (6.4)
x s 1
or
k ( . k ) ns(x)
(6.5)
s=l[ k=l J

substituting from (6.2). This displays L~k) as a function of the (fl0,) and (fl,)
alone. Hence L(ck) may be maximised to give maximum likelihood estimates of the
fl-parameters. Nothing further is required for discrimination, but note that if
estimates of the (fl0s) are required, extra information about the (Ks) or (/7,) is
required.
Arguing as in Subsection 2.2, it follows that the above procedure also yields the
maximum likelihood estimates of the fl-parameters under mixture sampling where
the basic random variable is (x, H ) ( H = H1, H 2.... or Hk). Here n , / n gives an
estimate of H s, so fl~s is estimable without extra information (s = 1..... k - 1).
Note n, = Exns(x) and n = 2~=1n,.
The separate sampling case is more complicated, but it can be proved
(Anderson, 1972 and 1979) that for discrete random variables maximisation of
L(f ) again gives estimates of the (fl0s) and (fls).
Logistic discrimination 189

However, now

flo,=fl~s+ln(nJnk) ( s = l .... , k - l ) . (6.6)

Hence the B~s are estimable here directly but for discrimination fl~, + l n ( H , / H k)
is required as in (6.3), s = 1..... k - 1. Thus for discrimination the (H~) must be
estimated separately. If some of the variables are continuous under separate
sampling, it is suggested that the above approach of maximising L(~k) is still valid.
The full justification for this is still awaited but it is certainly approximately valid
(Anderson, 1979).
The likelihood L(~k) in (6.5) is maximised iteratively using a quasi-Newton or a
Newton-Raphson procedure along the lines of Section 3. Anderson (1972, 1979)
gives full details of this, and Fortran programmes are available.
Complete separation may occur with k groups, but again these data configura-
tions can be easily identified in the course of the iterative maximum likelihood
procedure. Hence, although the maximum of the likelihood is achieved at a point
at infinity in the parameter space, the situation is recognised and the iterations
stopped before time has been wasted. In this case the estimates of the parameters
are unreliable but good discriminant functions emerge (Anderson, ! 972).
Zero marginal sample proportions cause the same difficulties with k groups as
with two. The ad hoc method suggested by Anderson (1~t74) may be used here
also.
The ideas of quadratic logistic discriminators can be applied immediately in the
k group case as discussed by Anderson (1975). Equally, compound logistic
methods may be used (i) to update discriminant functions for k groups using data
points of uncertain origin and (ii) to estimate the mixing proportions of k groups
(Anderson, 1979).
In short, there is no additional theoretical problem if the number of groups is
greater than two. Note, however, that some constraints on the dimensionality ( p )
of a problem' may be introduced because of the number of parameters to be
estimated. In the k group case, there are ( k - 1 ) × ( p + 1) parameters, and clearly
this number must be kept within the operational limits of the optimisation
procedure.

7. Discussion: Recent work

Logistic methods have been described here from the standpoint of their
application in discrimination. However, implicit in the assumptions (2.1) and (6.1)
are the models (2.2) and (6.2) for the conditional probability of H s given x. If the
(H s) are now thought of as representing levels of a variable of interest, say y, then
the methods discussed here may be used to investigate aspects of the relationship
between y and x. This is because (2.2) and (6.2) now model the conditional
distribution of y given x as logit regressions. These ideas have been used and
developed in various contexts including the estimation of and inference about
190 J. A. Anderson

relative risks in epidemiology (Anderson, 1973; Prentice, 1976; Breslow and


Powers, 1978). Regression methods in life table analysis were discussed by Cox
(1972) taking an approach with some similarities to logistic discrimination.
Prentice and Breslow (1978) and Farewell (1979) developed the link and intro-
duced a conditional likelihood approach to logistic estimation appropriate where
there is a pairing or matching. This approach is not possible in the sampling
contexts considered here for computational reasons.
There is also a case (Mantel, 1973) for replacing standard normal variable
analyses like t-tests and F-tests by tests based on logistic regression. The argu-
ment being that the logistic tests are much more robust to departures from ideal
sampling conditions. For example, they are not affected by an unsuspected
truncation of the sample space.
Returning to the use of the logistic approach in discrimination, the emphasis
here has been on allocating points to groups which are qualitatively distinct. In
fact, in many contexts the groups may be defined quantitatively. For example we
may wish to identify a group of individuals of high intelligence on the basis of test
scores x. The two groups are 'high' and 'not high' intelligence. They are not
qualitatively distinct as the difference between them is one of degree. It is unlikely
a priori that x will have a multivariate normal distribution in both groups, nor
does it seem that any of the other standard discriminant models will be valid.
However, Albert and Anderson (1979) showed that probit or logit functions can
be used to model this situation and that if the logit approach is taken, then the
probability of H 1 given x has the same logistic form (2.2) as in Section 2 and the
method of Section 3 and programmes referred to there may be used to estimate
the r-parameters. It now seems that these ideas may be extended to discrimina-
tion between k ordered groups but this work is not yet complete.
There have been some papers comparing the various methods for discrimina-
tion, but most of these have concentrated on one set of data for their conclusions.
For example, Gardner and Barker (1975) compared seven techniques on a real
data set comprising nine binary variables but found few differences between the
different methods. This seems to characterise the difficulties: if a genuine context
is considered, then usually fairly good discrimination is possible and any reasona-
bly good technique will perform reasonably well. If on the other hand simulated
data are used, it will be found that the method most closely related to the
distributions being used will be 'best'. Logistic discrimination performs well and
has several distinct advantages:
(i) continuous and discrete variables can be handled with equal facility;
(ii) the partially distributional assumptions give tests and enable extensions to
updating, compounds, quadratic functions, etc., to be effected easily;
(iii) it is simple to u s e - - a hand calculator or even paper and pen is all that is
required;
(iv) it is applicable over a very wide range of distributions;
(v) a relatively small number of parameters is required.
We recommend the use of logistic discrimination and related methods quite
generally.
Logistic discrimination 191

References

Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel method.
Biometrika 63, 413-20.
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge University Press,
Cambridge.
Albert, A. (1978). Quelques apports nouveaux h l'analyse discriminante. Ph.D. Thesis, Facult6 des
Sciences, Universit6 de Liege.
Albert, A. and Anderson, J. A. (1981). Probit and logistic discriminant functions. Comm. Statist.--
Theory Methods 10, 641-657.
Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59, 19-35.
Anderson, J. A. (1973). Logistic discrimination with medical applications. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, pp. 1-15. Academic Press, New York.
Anderson, J. A. (1974). Diagnosis by logistic discriminant function: Further practical problems and
results. Appl. Statist. 23, 397-404.
Anderson, J. A. (1975). Quadratic logistic discrimination. Biometrika 62, 149-54.
Anderson, J. A. (1979). Multivariate logistic compounds. Biometrika 66, 17-26.
Anderson, J. A. and Richardson, S. C. (1979). Logistic discrimination and bias correction in maximum
likelihood estimation. Technometrics 21, 71-8.
Anderson, T. W. (1958). An Introduction to Multivariate Analysis, p. 133. Wiley, New York.
Barnet, V. and Lewis, T. (1978). Outliers in Statistical Data. Wiley, Chichester.
Bartlett, M. S. (1947). Multivariate analysis. J. Roy. Statist. Soc., Supple. 6, 169-73.
Breslow, N. and Powers, W. (1978). Are there two logistic regressions for retrospective studies?
Biometrics 34, 100-5.
Clayton, J. K., Anderson, J. A. and McNicol, G. P. (1976). Preoperative prediction of
postoperative deep vein thrombosis. British Med. J. 2, 910-2.
Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve. In: F. N.
David, ed., Research Papers in Statistics.: Festschrift for J. Neyman, pp. 55-71. Wiley, New York.
Cox, D. R. (1972). Regression models and fife tables (with discussion). J. Roy. Statist. Soc. Ser. B. 34,
187-220.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, p. 309. Chapman and Hall, London.
Day, N. E. and Kerridge, D. F. (1967). A general maximum likelihood discriminant. Biometrics 23,
313-23.
Farewell, V. T. (1979). Some results on the estimation of logistic models based on retrospective data.
Biometrika 66, 27-32.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen. Lond. 7,
179-88.
Gardner, M. J. and Barker, D. J. P. (1975). A case study in techniques of allocation. Biometrics 31,
931-42.
Gill, P. E. and Murray, W. (1972). Quasi-Newton methods for unconstrained optimisation. J. Inst.
Math. Appl. 9, 91-108.
Good, I. J. and Gaskins, R. A. (1971). Non-parametric roughness penalties for probability densities.
Biometrika 58, 255-77.
Habbema, J. D., Hermans, J. and van der Brock, K. (1974). A stepwise discriminant analysis program
using density estimation. In: G. Bruckmann, ed., Compstat 1974, pp. 101-110. Physica, Vienna.
Mantel, N. (1973). Synthetic retrospective studies and related topics. Biometrics 29, 479-86.
Murray, G. D. and Titterington, D. M. (1978). Estimation problems with data from a mixture. Appl.
Statist. 27, 325-34.
Prentice, R. (1976). Use of the logistic model in retrospective studies. Biometrics 32, 599-606.
Prentice, R. and Breslow, N. (1978). Retrospective studies and failure time models. Biometrika 65,
153-8.
Rao, C. R. (1965). Linear Statistical Inference and its Applications, p. 414. Wiley, New York.
Truett, J., Cornfield, J. and Kannel, W. (1967). A multivariate analysis of the risk of coronary heart
disease in Framlington. J. Chron. Dis. 20, 511-24.
Welch, B. L. (1939). Note on discriminant functions. Biometrika 31, 218-20.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2
©North-Holland Publishing Company (1982) 193-197

Nearest Neighbor Methods in Discrimination

L Devroye and T. J. Wagner

In the discrimination problem one makes an observation X = (X1,...,Xa) on


some object whose state 0 is known to be in some finite set which we may take to
be (1 ..... M}. Assuming that the object is picked at random from some popula-
tion, (X, 0) is a random vector with an arbitrary probability distribution. All that
is assumed known about this distribution is that which can be inferred from a
sample ( X l, 01) .... , ( X n, On) of size n made from objects drawn from the same
population used for (X, 0). This sample, called data, is assumed to be indepen-
dent of (X, 0). Using X and the data one makes an estimate t~ for 0 where the
procedure used for making this estimate is called a rule.
The rule which is the standard example for the class of rules considered in this
article is the k-nearest neighbor rule. Here tq is taken to be the state which occurs
most frequently among the states of the k closest measurements to X from
X 1..... X,. To break ties in determining which of the vectors X1,...,X n is among
the k closest to X and to break ties in determining which state occurs most
frequently among these k closest, the independent sequence Z, Z~ ..... Z, of
independent random variables, each with a uniform distribution on [0, 1], is
generated. We will think of Z as being attached to X and Z i as being attached to
X i, 1 ~< i ~< n. Then X i is closer to X than Xj if
(a) IlX-X,.][ < ILX-Xsl] or
(b) I [ X - Xi[[ = ] I X - - Xj[I and [ Z - Zi[ < IZ - Zj[ o r
(c) [IX--Xil[ = II X - X j l [ , I g - z i [ = I Z - Z j [ and i < j .
The k closest vectors to X from X 1.... , Xn are now determined and t~ is taken as
the state occurring most frequently among these vectors. If several states occur
most frequently among the k closest, the state whose observation is closest to X
from among those tied is chosen. If (X/,OJ, Z j) represents the j t h closest
observation to X, its corresponding state, and attached random variable, then we
see that 0 for the k-nearest neighbor rule can be written as

O=g((x',z',o') ..... (1)


194 L. Devroye and T. J. Wagner

for some function g. Rules which have the form

0 ~- gn( ( X 1, Z 1, 01 ) ..... ( X n, zn, 0")) (2)

for some function g, are termed nearest neighbor rules, while rules which can be
put in the form (1) for some g are called k-local.
The probability of error for a rule given the data and attached random
variables is given by

L,=P[O•OID, ]
where
Do=((X1,01,Z,) ..... (X., 0., Z.)).
The frequency interpretation of L, is that a large number of new observations,
whose states are estimated with the rule and the given data, will produce a
frequency of errors equal to the value of L,. (Each of these new observations will
have a new independent Z attached to it but the Z 1..... Z, stay fixed with the
data.) The random variable L, is important then because it measures the future
performance of the rule with the given data.
Most of the results dealing with nearest neighbor rules are of the asymptotic
variety, that is, results concerned with where L, converges to and how it
converges as n tends to infinity. If the limiting behavior of L~ compares favorably
to L*, the Bayes probability of error (the smallest possible probability of error if
one knew the distribution of (X, 0)), then one has some hope that the rule will at
least perform well with large amounts of data. For the k-nearest neighbor rule
with fixed k the first result of this type, and certainly the best known, is that of
Cover and Hart (1967) who showed that
n
EL. (3)
when P[0 = i[ X = x] has an almost everywhere continuous version, 1 ~ i ~< M. In
(3) L is a constant satisfying, for k = 1,

L*<~L<~2L*(1--L*)<~2L *. (4)

For arbitrary k the "2" in (4) is replaced by a~ where a~ $1 as k ~ oe. For these
same assumptions it is also known that

L~ L L in probability (5)

(Wagner, 1971) with convergence in (5) actually being with probability one for
k =1 (Fritz, 1975).
If k is allowed to vary with n, then Stone (1977) showed that for any
distribution of (X, 0)

L,, ~ L* in probability (6)


Nearest neighbor methods in discrimination 195

if
n n
k=kn~oe and kn/n O.

This distribution-free result extends to a large class of nearest neighbor rules,


which are also discussed by Stone and, because of its sheer technical achievement,
rivals the original accomplishment of Fix and Hodges (1951) who introduced
k-nearest neighbor rules and proved (6) in a slightly different setting with analytic
assumptions on the distribution of (X, 0). We should note here that Stone breaks
ties differently than described earlier. For example, if k, = 5 and if six vectors,
with the attached Z 's, have positions 4 - 9 in the distance ordering of X~..... X, to
X and all have the same distance to X, then each of the states of these six vectors
gets a 2 / 6 = 1 / 3 'vote' for the estimate t~. By contrast, in the first way of
breaking ties two of these six vectors would get one vote each and the other four
would get 0. Devroye (1981a) has recently shown that if one also assumes that
n
k , / ( l o g n) oe,

then (6) holds with the convergence being with probability one.
In view of Stone's result, it might be expected that the asymptotic results of the
k-nearest neighbor rule with k fixed are also distribution-free, that is, no condi-
tions on the distribution of (X, 0) are needed for (5). In fact, using Stone's way of
breaking ties, Devroye (1981b) has shown exactly that. Moreover, the constant L
for the general case, which is the same as Cover and Hart's for their assumptions
on the distribution of (X, 0), continues to obey the inequality (4).
As intellectually satisfying as these results are, one is still faced with the finite
sample situation. You have data Dn and your immediate need is for a reliable
estimate of L, for your chosen rule. You may even wish to examine the data and
then pick the rule. In this case reliable estimates of L n for each rule may guide
you in your chbice. If one is using a local rule, then a natural estimate is the
deleted estimate of Ln given by

£, = (l/n) ~ I[~o,1
i=1

where t~ is the estimate of 0i from Xi, Z/, and D, with ( X i, 0i, Zi) deleted. This
definition requires, of course, that k ~< n - 1. Deleted estimates are not easy to
compute but, in cases like the k-nearest neighbor rule, the computation is
reasonable and the intuitively appealing use of the data can be taken advantage
of. Rogers and Wagner (1977) have shown that for all distributions of (X, 0) and
any k-local rule

2k(2k + l / 4 ) 1/2 k2
E(£,, - Ln) 2 ~< 2k + 1 / 4 -~ + --. (7)
n n3/2 n2
196 L. Devroye and T. J. Wagner

Using Chebychev's inequality and (7), distribution-free upper bounds for


P [ I £ , - L n Ii-
> e] can be obtained which are O(1/n). In Devroye and Wagner
(1979a) distribution-free upper bounds for P [ I L , - £,1 ~> e] of the form Ae -"B
are also given where A and B are positive constants which depend only on d, M,
and e. In these bounds, however, the rate of decrease of B to 0 with d is quite
rapid. In contrast, the right-hand side of (7) does not depend on d at all. Finally,
simulations carried out by Penrod and Wagner (1979) suggest that 2e -2nd is
generally an upper bound f o r P [ I L n - L , [ 1> e]. Other estimates of L , are dis-
cussed in the references mentioned above.
If one considers just the single nearest neighbor rule for the finite sample case,
two features stand out. The first is that one must store and search all of the data
for each of the future estimates. The second point is that the nearest neighbor rule
performance deteriorates from the Bayes rule (e.g., the rule used to achieve L*
when the distribution of (X, 0) is known), because in the region of R a where
P[0 = m i X = x] is maximal (which is where 0(x) = m in the Bayes rule) all of the
samples X~ which fall there 'carve' out a subset where 0 = Oi, regardless of whether
i = m or not. To reduce one or both of these effects, many authors have suggested
condensing or editing the data before the nearest neighbor rule is applied (e.g., see
Ritter et al. (1975) for recent references). There are no really general asymptotic
results for condensing methods at this writing, but it seergls clear that condensing,
properly done, will definitely reduce computation for future estimates and im-
prove performance. Devroye and Wagner (1979b) have also shown that if the
original data is condensed in any way to J points,

(Y,, ..... ( 5 , (8)


and if the single nearest neighbor rule is used with these J points, then

P[ IL,, -- E,,,I/> e] ~<4(4n )dJ'(J-,/e-,,~=/8 (9)


where L, is the frequency of errors one gets on the original data with the single
nearest neighbor rule now using (8) as data. The right-hand side of (9) is, of
course, distribution-free, but requires that J be small to be useful.

References

Cover, T. and Hart, P. (1967). Nearest neighborpattern classification.IEEE Trans. Inform. Theory 11,
21-27.
Devroye, L. (1981a). On the almost everywhereconvergenceof nonparametric regression function
estimates. Ann. Statist. 9, 1310-1319.
Devroye,L. P. (1981b). On the inequalityof Cover and Hart in nearest neighbor discrimination. IEEE
Trans. Pattern Analysis Machine Intelligence 3, 75-78.
Devroye, L. P. and Wagner, T. J. (1979a). Distribution-freeinequalities for the deleted and holdout
error estimates. IEEE Trans. Inform. Theory 25, 202-207.
Nearest neighbor methods in discrimination 197

Devroye, L. P. and Wagner, T. J. (1979b). Distribution-flee performance bounds with the resubstitu-
tion error estimate. IEEE Trans. Inform. Theory 25, 208-210.
Fix, E. and Hodges, J. (1951). Discriminatory analysis: Nonparametric discrimination: consistency
properties. Rept. No. 4, USAF School of Aviation Medicine, Randolph Field, TX.
Fritz, J. (1975). Distribution-free exponential error bound for nearest neighbor pattern classification.
IEEE Trans. Inform. Theory 21, 552-557.
Penrod, C. S. and Wagner, T. J. (1979). Risk estimation for nonparametric discrimination and
estimation rules: A simulation study. IEEE Trans. Inform. Theory [to appear].
Ritter, G. L., Woodruff, H. B., Lowry, S. R., and Isenhour, T. L. (1975). An algorithm for a selective
nearest neighbor rule. IEEE Trans. Inform. Theory 21, 665-669.
Rogers, W. H. and Wagner, T. J. (1977). A finite sample distribution-free performance bound for local
discrimination rules. Ann. Statist. 6, 506-514.
Stone, C. J. (1977). Consistent nonparametric regression. Ann. Statist. 5, 595-645.
Wagner, T. J. (1971). Convergence of the nearest neighbor rule. IEEE Trans. Inform. Theory 17,
566-571.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 C}
©North-Holland Publishing Company (1982) 199-208 .J

The Classification and Mixture Maximum


Likelihood Approaches to Cluster Analysis*

G. J. McLachlan

1. Introduction

A common and very old problem in statistics is the separation of a heteroge-


neous population into more homogeneous subpopulations. We concentrate here
on the situation where the population of interest, H, is known or assumed to
consist of, say, k different subpopulations/7~ ..... H~, and where the density of a
p-dimensional observation x from H i is known or assumedto be f~(x; 0) for some
unknown vector of parameters, 0 (i = 1,..., k). In this context the problem may be
formulated as follows: Given a random sample of observations x~..... x n f r o m / 7 ,
attempt to allocate each xj to the subpopulation to which it belongs. We let
7 ' - - ( ' h ..... yn) denote the set of identifying labels where ,fj = i if xj comes from
H/. This would be the classical discrimination problem if 7 were known a priori; a
discrimination procedure would be formed from the classified sample for the
allocation of subsequent observations of unknown origin.
In what is sometimes called the classification maximum likelihood procedure, 0
and 7 are chose fi to maximize

Lc(x , ..... x,; 0, ~,)= I] fv,(x#; 0). (1.1)


j=l
The maximization is over the set of values of 7 corresponding to all possible
assignments of the xj to the various subpopulations as well as over all admissible
values of 0. The estimates of 0 and ?, so obtained are denoted by 6 and ?7
respectively. The x 1..... x,, are then classified according to the estimates "~ ..... "/n;
for example, xj is assigned to/-/g if ~j = g. This procedure has been considered by
several authors including Hartley and Rao [14], John [17], Scott and Symons [31],
and Sclove [30]. Unfortunately, with this procedure, the yj increase in number
with the number of observations, and under such conditions the maximum
likelihood estimates need not be consistent. Marriott [23] pointed out that under
the standard assumption of normal distributions with common variance matrices,
this procedure gives definitely inconsistent estimates for the parameters involved.

*This work was completed while the author was on leave with the Department of Statistics at
Stanford University, and was supported in part by ONR contract N00014-76-C-0475.

199
200 G.J. McLachlan

More recently, Bryant and Williamson [4] extended Marriott's results and showed
that the method may be expected to give asymptotically biased results quite
generally.
A related approach is the mixture maximum likelihood method considered by
Day [5] and Wolfe [34], among many others. With this approach x I..... x, are
assumed to be a random sample of size n from a mixture of H 1..... H k in the
proportions (e 1..... e~) = g. Hence the likelihood

LM(X,.....x°;0,,)= fi{ ,/i(+;0)} j=l i 1


(1.2)

can be formed; the estimates of 0 and e obtained by maximizing (1.2) are denoted
by 0 and d respectively. Each xj can be classified then on the basis of the
estimated posterior probabilities Pij (i = 1..... k) formed by replacing 0 and e with
6 and ~ in
P i j = e{xj EH i l x/ },
and x/is assigned to Hg if
^ ^
Pgj>~Pij, i = l , ' " , k"
It can then be seen that the mixture approach is equivalent to the classification
procedure with the additional assumption that "h..... ,/, is an (unobservable)
random sample from a probability distribution with mass e~ at i (i =1 ..... k). It
appears to avoid the asymptotic biases associated with the classification proce-
dure where at each step in the iterative process of computing the maximum
likelihood estimates each xj is assigned outright to a particular subpopulation
according to the estimate for yj. By contrast, the mixture approach does not insist
on definite membership to any subpopulation; rather it gives an estimated
probability of membership of each subpopulation.
Note that another approach to this problem is to proceed further and adopt a
Bayesian procedure in which all parameters are random variables (see [2] and
[321).
A common assumption in practice is to adopt the normality model

xj--N(/t/,~J) inH~ (i=l,...,k). (1.3)

In this case 0 has ½p(p + 2 k + 1) elements, comprising the components of the k


mean vectors/e i and the distinct elements of the common covariance matrix ~,
and the density f~(x; O) is given by

f( x; tti, .~) = (2~r)-l/ZPl.~,[-1/Z {exp-- ½( x -- [-~i)'.~--l( x-- l~i) }.

We now proceed to consider the application of the classification and mixture


Maximum likelihood approaches to cluster analysis 201

approaches under the normality model (1.3) which is assumed to hold through to
Section 5, where the condition of a common covariance matrix is relaxed to cover
the general case of unequal covariance matrices.

2. Classification approach

In principle the maximization process for the classification maximum likeli-


hood procedure can be carried out since it is just a matter of computing the
maximum value of the likelihood (1.1) over all possible partitions of the n
observations to the k subpopulations. However, unless n is quite small, searching
over all possible partitions is prohibitive. It follows that "?j= g if

f(xj;tgg,~.,)) f(xj;fii,~.,), i = 1 ..... k, (2.1)

where fli and 2 are the ordinary maximum likelihood estimates of/t~ and 2J for a
sample of normal observations classified according to ~. Hence the solution can
be computed iteratively [17, 30]. Starting with some initial clustering y, the/ti and
IJ are estimated accordingly and then used to give a new estimate of y on the
basis (2.1), equivalent to allocating each observation to"the nearest cluster centre
in terms of the estimated Mahalanobis distance. Each step in the iterative process
yields a value of the likelihood not less than that at the previous step, and the
iterations may be continued until no observation changes clusters. Various
starting values should be taken in an attempt to locate the global solution. It will
be seen in the next section that the likelihood equations under the mixture
approach can be easily modified to be applicable also under the classification
approach. There are other procedures for finding the solution under the classifica-
tion approach; for example, the Mahalanobis distance version of MacQueen's [20]
k-means procedure, where the/~ and 2J are re-estimated after each observation is
allocated rather than waiting until after all the observations have been allocated.
For the classification approach applied under the normality model (1.3), Scott
and Symons [31] showed that ~ corresponds to the partition which minimizes the
determinant of the pooled within-subpopulations sum of squares matrix
k
w=2w
i=1
where
ni

q=l

and Xiq (q = 1..... n~) denote the n~ observations assigned to H~ according to


and ~ refers to their sample mean; see also the paper by Friedman and Rubin [9],
who originally suggested this criterion. The minimization of IW I would appear to
be a reasonable clustering criterion regardless of the underlying distributions.
202 G. J. M c L a c h l a n

Marriott [22] has given a comprehensive account of the properties of this


criterion. It does have the tendency to produce clusters of roughly equal size,
although the modified version,

k
nloglW I --2 ~] nilogn i
i=1

suggested recently by Symons [32], would appear to go some way to overcome


this.

3. Mixture approach

An excellent account of the computation of the maximum likelihood estimates


of/t~, N, and e for the mixture approach has been given by Day [5]. Under the
normality model (1.3) the posterior probabilities Pij (i = 1..... k; j = 1..... n) have
the form

P,j = exp(a~xj + b, tJ{ Ek exp(a'~xj + br


r 1
t}
where
a r = , ~ - l ( l~ r - - [.~ l )
and
b~ =½(/t I + / t ~ ) ' X '(/*,--ll~r)+log(er/el)

for r = 1..... k; that is, al = 0 and b 1= 0. The maximum likelihood estimates are
evaluated from the equations

g,= ~] ffij/n (3.1)


j=l

fi,= ~ (13ijxj)/(ngi) (3.2)


j=l
and
~.,= ~ ~ (Pu/n)(xs-fi,)(xj-fii)',
i=lj=l
(3.3)

which can be solved iteratively by substituting some initial values for the
estimates into the right-hand side of (3.1)-(3.3) to produce new estimates on the
left-hand side, which are then substituted into the right-hand side, and so on.
These iterative estimates can be identified with those obtained by directly
applying the so-called EM algorithm of Dempster et al. [6], which shows that the
estimates will converge to a local maximum irrespective of the starting point. The
Maximum likelihood approaches to cluster analysis 203

iterative process should be started from several points in an attempt to ensure


that the global maximum is obtained.
Day [5] has shown that considerable computing time can be saved for k = 2 by
reparametrizing the likelihood in terms of a, b, m, and V, where

m = EDUl -'1-e2/.g 2
and
v: x+ t,2)(,,,-,,2)'

are the mean and covariance matrix of the mixture distribution; a and b denote
a 2 and b 2 with their subscripts suppressed since k = 2 only. The maximum
likelihood equations now can be written as

m= ~ xj/n, (3.4)
j=l

(+- m)(+- m)'/,, (3.5)


j 1

a = V l(/~ 2 - - / ~ 1 ) / { 1 -- ele2(/~l-/.~2)'~z-l(/.gl-/-~2)} (3.6)


and
/) = -- l a t(/~l ~-/~2 ) "~-log( e 2 / e 1). (3.7)

Only values of fi a n d / ; are needed in solving the above equations as th and P are
given explicitly.
To obtain suitable initial values of a and b, it is suggested for various bivariate
subsets of the variables plotting the data points and drawing a line which divides
the data into two groups which have a scatter that appears normal (see, for
example, [28] and [12]). Estimates of a and b can be formed on the basis of this
subdivision, proceeding as if the observations were correctly classified. There
appears to be no difficulty in locating the global maximum for p = 1 and p = 2,
but for p/> 3 there are problems with multiple maxima, particularly for small
values (less than two, say) of the Mahalanobis distance between/71 and H2,

Z~= ((/.gl--/.g2)'X l(/.gl--/J2)}l/2 ,

when n is not large [5]. Also, it is well known [5, 16] that maximum likelihood
estimates based on a mixture of normal distributions are very poor unless n is
very large (for example, n > 500). However, Ganesalingam and McLachlan [11]
found that although the maximum likelihood estimates a and/~ may not be very
reliable for small n, it appears that the proportions in which the components of a
and/~ occur are such that the resulting discriminant function, a ' x + b, may still
provide reasonable separation between the subpopulations.
Note that the same set of equations here can be used as follows to compute the
estimates/~i, X, and y under the classification approach. At a given step, ~/is put
equal to that g for which lPgj >~Pj/(i = 1..... k) where, in the Pi/, br is used without
204 G. 3-. McLachlan

the log(e,/e~) term. Then on the next step the /2i and ~7 are computed from
(3.1)-(3.3) in which, for each j, /~ij is replaced by 1 ( i = g ) and 0 (i=/=g). The
transformed equations (3.4)-(3.7) for k = 2 are also applicable to the classifica-
tion approach with the above modifications; that is, the term corresponding to gi
in (3.6) is given by ni/n (i=1,2) while there is no term corresponding to
log(~2/~l) in (3.7).
A simulation study undertaken by Ganesalingam and McLachlan [13] for k = 2
suggests that overall the mixture approach performs quite favourably relative to
the classification approach even where mixture sampling does not apply. The
apparent slight superiority of the latter approach for samples with subpopulations
represented in approximately equal numbers is more than offset by its inferior
performance for disparate representations.

4. Efficiency of the mixture approach

We consider now the efficiency of the mixture approach for k = 2 normal


subpopulations, contrasting the asymptotic theory with small sample results
available from simulation.
For a mixture of two univariate normal distributions Ganesalingam and
McLachlan [ 10] studied the asymptotic efficiency of the mixture approach relative
to the classical discrimination procedure (appropriate for known y) by consider-
ing the ratio

e = ( E ( R ) - R o ) / ( E ( R M ) - Ro} (4.1)

where E(RM) and E(R) denote the unconditional error rate of the mixture and
classical procedures respectively applied to an unclassified observation subse-
quent to the initial sample, and R 0 denotes their common limiting value as n ~ oe.
The asymptotic relative efficiency was obtained by evaluating the numerator and
denominator of (4.1) up to and including terms of order 1/n. The multivariate
analogue of this problem was considered independently by O'Neill [28]. By
definition the asymptotic relative efficiency does not depend on n, and O'Neil
[28] showed that it also does not depend on p for equal prior probabilities,
e~ = 0.5. The asymptotic values of e are displayed in Table 1 as percentages for
selected combinations of a 2, e~, p, and n; the corresponding values of e obtained
from simulation are extracted from [11] and listed below in parentheses. It can be
seen that the asymptotic relative efficiency does not give a reliable guide as to the
true relative efficiency when n is small, particularly for A = 1. This is not
surprising since the asymptotic theory of maximum likelihood for this problem
requires n to be very large before it applies [5, 16]. Further simulation studies by
Ganesalingam and McLachlan [11] in the univariate case indicate that the
asymptotic relative efficiency gives reliable predictions at least for n/> 100 and
A~>2.
Maximum fikelihood approaches to cluster analysis 205

Table 1
A s y m p t o t i c versus s i m u l a t i o n results for the relative efficiency of the m i x t u r e a p p r o a c h
p=l,n=20 p=2, n=20 p=3, n=40
A e I -- 0.25 e I -- 0.50 e I -- 0.25 e 1 = 0.50 e I = 0.25 e 1 = 0.50
1 0.25 0.51 0.34 0.51 0.42 0.51
(33.01) (25.12) (46.71) (63.11) (25.00) (43.39)
2 7.29 10.08 9.36 10.08 10.51 10.08
(22.05) (17.74) (25.73) (16.26) (16.28) (14.51)
3 31.41 35.92 35.13 35.92 36.78 35.92
(19.57) (23.54) (43.91) (29.63) (29.01) (23.46)

The simulated values for the relative efficiency in Table 1 suggest that for the
mixture approach to perform comparably with the classical discrimination proce-
dure it needs to be based on about two to five times the number of initial
observations, depending on the combination of the parameters.

5. Unequal covariance matrices

For normal subpopulations H i with unequal covariance matrices Zi, the


classification procedure has to be applied with the restriction that at least p + 1
observations belong to each subpopulation to avoid the degenerate case of infinite
likelihood.
The likelihood equations under the mixture approach are given by (3.1)-(3.3)
appropriately modified to allow for k different covariance matrices [34]. Unfor-
tunately, maximum likelihood estimation breaks down in practice, for each data
point gives rise to a singularity in the likelihood on the edge of the parameter
space. This problem has received a good deal of attention recently. For a mixture
of two univari'ate normal distributions, Kiefer [18] has shown that the likelihood
equations have a root q~ which is a consistent, asymptotically normal, and
efficient estimator of ~ = (0', e')'. Quandt and Ramsey [29] proposed the moment
generating function (MGF) estimator obtained by minimizing

E /.
i=l

for selected values t~. . . . . th of t in some small interval (c, d), c < 0 < d, where

2
+(t) = ~ eiexp(l~it + ½oi2t2)
i=1

is the M G F of a mixture of two normal distributions with variances 0 2 and 0 2.


The usefulness of the M G F method would appear to be that it provides a
consistent estimate which can be used as a starting value when applying the EM
206 G. J. McLachlan

algorithm in an attempt to locate the root of the likelihood equations correspond-


ing to the consistent, asymptotically efficient estimator. Bryant [3] suggests taking
the classification maximum likelihood estimate of ~ as a starting value in the
likelihood equations~
The robustness of the mixture approach based on normality as a clustering
procedure requires investigation. A recent case study by Hernandez-Alvi [15]
suggests that, at least in the case where the variables are in the form of
proportions, the mixture approach may be reasonably robust from a clustering
point of view of separating samples in the presence of multimodality.

6. Unknown number of subpopulations

Frequently with the application of clustering techniques there is the difficult


problem of deciding how many subpopulations, k, there are. A review of this
problem has been given by Everitt [8]; see also [7] and [19]. With respect to the
classification approach Marriott [21] has suggested taking k to be the number
which minimizes kZ[ W I. For heterogeneous covariance matrices there may be
some excessive subdivision, but this can be rectified by recombining any two
clusters which by themselves do not suggest separation was necessary.
With the mixture approach the likelihood ratio test is an obvious criterion for
choosing the number of subpopulations. However, for testing the hypothesis of,
say, k I v e r s u s k 2 subpopulations (k 1< k2), it has been noted [35] that some of the
regularity conditions are not satisfied for minus twice the log-likelihood ratio to
have under the null hypothesis an approximate chi-square distribution with
degrees of freedom equal to the difference in the number of parameters in the two
hypotheses. Wolfe [35] suggested using a chi-square distribution with twice the
difference in the number of parameters (not including the proportions), which
appears to be a reasonable approximation [15].

7. Partial classification of sample

We now consider the situation where the classification of some of the observa-
tions in the sample is initially known. This information can be easily incorporated
into the maximum likelihood procedures for the classification and mixture
approaches. If an xj is known to come from, say, //r, then under the former
approach, yj = r always in the associated iterative process while, under the latter,
Psi is set equal to 1 (i = r) and 0 (i ve r) in all the iterations. In those situations
where there are sufficient data of known classification to form a reliable dis-
crimination rule, the unclassified data can be clustered simply according to this
rule and, for the classification approach, the results of McLachlan [24, 25] suggest
this may be preferable unless the unclassified data are in approximately the same
proportion from each subpopulation. With the mixture approach a more efficient
clustering of the unclassified observations should be obtained by simultaneously
Maximum likelihood approaches to cluster analysis 207

using them in the estimation of the subpopulation parameters, at least as n ~ o~,


since the procedure is asymptotically efficient. The question of whether it is a
worthwhile exercise to update a discrimination rule on the basis of a limited
number of unclassified observations has been considered recently by McLachlan
and Ganesalingam [26]. For other work on the updating problem the reader is
referred to [1, 27] and [33].

References

[1] Anderson, J. A. (1979). Multivariate logistic compounds. Biometrika 66, 7-16.


[2] Binder, D. A. (1978). Bayesian cluster analysis. Biometrika 65, 31-38.
[3] Bryant, P. (1978). Contributions to the discussion of the paper by R. E. Quandt and J. B.
Ramsey. J. Amer. Statist. Assoc. 73, 748-749.
[4] Bryant, P. and Williamson, J. A. (1978). Asymptotic behavior of classification maximum
likelihood estimates. Biometrika 65, 273-281.
[5] Day, N. E. (1969). Estimating the components of a mixture of normal distributions. Biometrika
56, 463-474.
[6] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1-38.
[7] Engelman, L. and Hartigan, J. A. (1969). Percentage points of a test for clusters. J. Amer.
Statist. Assoc. 64, 1647-1648.
[8] Everitt, B. S. (1979). Unsolved problems in cluster analysis. Biometrics 35, 169-181.
[9] Friedman, H. P. and Rubin, J. (1967). On some invariant criterion for grouping. J. Amer. Statist.
Assoc. 62, 1159-1178.
[10] Ganesalingam, S. and McLachlan, G. J. (1978). The efficiency of a linear discriminant function
based on unclassified initial samples. Biometrika 65, 658-662.
[l l] Ganesalingam, S. and McLachlan, G. J. (1979). Small sample results for a linear discriminant
function estimated from a mixture of normal populations J. Statist. Comput. Simulation 9,
i51-158.
[12] Ganesalingam, S. and McLachlan, G. J. (1979). A case study of two clustering methods based on
maximum l~kelihood. Statistica Neerlandica 33, 81-90.
[13] Ganesalin~am, S. and McLachlan, G. J. (1980). A comparison of the mixture and classification
approaches to cluster analysis. Comm. Statist. A - Theory Methods 9, 923-933.
[14] Hartley, H. O. and Rao, J. N. K. (1968). Classification and estimation in analysis of variance
problems. Rev. Internat. Statist. Inst. 36, 141-147.
[15] Hernandez-Alvi, A. (1979). Problems in cluster analysis. Ph.D. thesis, University of Oxford,
Oxford [unpublished paper].
[16] Hosmer, D. W. (1973). On MLE of the parameters of a mixture of two normal distributions
when the sample size is small. Comm. Statist. 1, 217-227.
[17] John, S. (1970). On identifying the population of origin of each observation in a mixture of
observations from two normal populations. Technornetrics 12, 553-563.
[18] Kiefer, N. (1978). Discrete parameter variation: efficient estimation of a switching regression
model. Econometrika 46, 427-434.
[19] Lee, K. L. (1979). Multivariate tests for clusters. J. Amer. Statist. Assoc. 74, 708-714.
[20] MacQueen, J. (1966). Some methods for classification and analysis of multivariate observations.
Proc. Fifth Berkeley Syrup. Math. Statist. Probability l, 281-297.
[21] Marriott, F. H. C. (1971). Practical problems in a method of cluster analysis. Biometrics 27,
501-514.
[22] Marriott, F. H. C. (1974). The Interpretation of Multiple Observations. Academic Press, London.
[23] Marriott, F. H. C. (1975). Separating mixtures of normal distributions. Biometrics 31, 767-769.
208 G. J. McLachlan

[24] McLachlan, G. J, (1975). Iterative reclassification procedure for constructing an asymptotically


optimal rule of allocation. J. Amer. Statist. Assoc. 70, 365-369.
[25] McLachlan, G. J. (1977). Estimating the linear discriminant function from initial samples
containing a small number of unclassified observations. J. Amer. Statist. Assoc. 72, 403-406.
[26] McLachlan, G. J. and Ganesalingam, S. (1980). Updating a discriminant function on the basis
of unclassified data. Tech. Rept. No. 47, Department of Statistics, Stanford University.
[27] Murray, G. D. and Titterington, D. M. (1978). Estimation problem with data from a mixture.
Appl. Statist. 27, 325-334.
[28] O'Neill, T. J. (1978). Normal discrimination with unclassified data. J. Amer. Statist. Assoc. 73,
821-826.
[29] Quandt, R. E. and Ramsey, J. B. (1978). Estimating mixtures of normal distributions and
switching regressions. J. Amer. Statist. Assoc. 73, 730-738.
[30] Sclove, S. L. (1977). Population mixture models and clustering algorithms. Comm. Statist.
A - - Theory Methods 6, 417-434.
[31] Scott, A. J. and Symons, M. L. (1971). Clustering methods based on likelihood ratio criteria.
Biometrics 27, 387-397.
[32] Symons, M. J. (1980). Clustering criteria for multivariate normal mixtures. Biometrics 37 [to
appear].
[33] Titterihgt•n• D. M. (• 976). Upda•ng a diagn•stic system usihg unc•n•rmed cases. App•. Statist.
25, 238-247.
[34] Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behav. Res.
5, 329-350.
[35] Wolfe, J. H. (1971). A Monte-Carlo study of the sampling distribution of the likelihood ratio for
mixtures of multinormal distributions. Tech. Bullet. STB 72-2, Naval Personnel and Training
Research Laboratory, San Diego.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ] I'~
©North-Holland Publishing Company (1982) 209-244 JL

Graphical Techniques for Multivariate Data and


for Clustering

John M. Chambers and Beat Kleiner

1. Graphics and multivariate analysis

Graphical displays are especially important in the analysis of multivariate data,


because they provide direct methods of studying the data and the results of
statistical analyses. Well-chosen displays will enhance the understanding of
multivariate data and can provide a partial antidote to the dangerous habit of
applying techniques of multivariate analysis in a careless and uncritical way.
The data for most multivariate statistical analyses may be regarded as a set of n
observations on p observable variables, which are assumed to be quantitative by
most analytical and graphical methods. The most basic graphical problem is the
display of such data. Section 2 of this article describes some useful techniques for
this purpose.
Our approach in this paper is to present the general features of several
techniques and to discuss a small number of these methods in some detail;
sufficient, we hope, to permit their application in practice. We consider the
techniques tre'ated in detail to be relatively simple and useful over a wide range of
applications. 'There are, however, no 'best' techniques for all applications, and
users should be aware of the limitations and drawbacks of each method. Other
useful techniques are discussed in less detail. Readers may wish to pursue some of
these for their own work. Finally, limitations of space prechide discussing every
worthwhile variant. We feel that a reasonably detailed understanding of some
basic graphical methods should help the practitioner develop variations or new
methods suitable for specific problems.
Many of the techniques for analyzing multivariate data may be regarded as
defining a set of new variables derived from the original variables. The derived
variables are intended to focus on certain properties of the original data, with the
hope that a relatively small number of derived variables will retain most of the
information in the original data. Principal component analysis, canonical analy-
sis, discriminant analysis, factor analysis and multidimensional scaling can be
regarded, at least partly, as having this purpose. The graphical techniques of
Section 2 can be used to look at the derived variables or at combinations
209
210 John M. Chambers and Beat Kleiner

of original and derived variables. There are in addition some displays specific to
each of the analyses mentioned, but these will not be discussed here.
One technique of multivariate analysis which does generate important new
graphical displays is cluster analysis. Here a set of objects (which may be either
the observations or the variables) are assigned to clusters, such that objects within
each chister are relatively close together or similar. Frequently, the clustering is
hierarchical; for example, by successively merging clusters. The need to under-
stand the process of clustering and its relation to the underlying data leads to a
set of graphical displays, discussed in Section 3.

EXAMPLES.All the graphical techniques illustrated in this article will be applied to


the same set of data, consisting of the yearly yields ((divident+price
change)/price) for 15 transportation companies in the Dow Jones transportation
index, from 1953 to 1977. The data were obtained from the monthly yield tape of
a master file maintained by the Center for Research in Security Prices at the
University of Chicago, Graduate School of Business. It contained dividends and
monthly closing prices which were used to compute the yearly yields given in
Table 2 of Kleiner and Hartigan (1980). The 15 companies include 8 corpora-
tions with large interests in railroads: Canadian Pacific (CP), Chessie System
(CO), Missouri-Pacific Corp. (MIS), Norfolk & Western (NFW), St. Louis-San
Francisco Railway (FN), Seaboard Coastline (SCI) Southern Pacific (SX), and
Southern Railway (SR); three primarily domestic airlines: American Airlines
(AMR), Eastern Airlines (EAL) and United Airlines (UAL); three to a large
extent international airlines: Northwest Airlines (NWA), Pan American World
Airways (PN) and Trans World Airlines (TWA); and one conglomerate: Trans-
way (TNW). As happens very often, this data set has a considerably different
structure from a multivariate random sample. Nevertheless, multivariate analysis,
and particularly graphical methods, can be useful in studying the data. As seems
appropriate, either the companies or the years may play the role of observations
or variables. Some of the techniques, in addition, may utilize the sequential
(time-series) nature of the data.
A thorough case study of this data would involve many techniques not
presented here and would follow a different order of presentation, but we hope
the reader will agree, after looking at the displays shown, that graphical tech-
niques point out interesting features of the data which could otherwise go
undetected.

2. Displays for multivariate data

By comparison with data on one or two variables, multivariate data both


benefit greatly from graphical displays and also present significant difficulties and
pitfalls. The benefits accrue because there is often a great deal of information in
multivariate data, making inspection of a simple table of values a challenging
task. The essential problem, however, is that our most common plotting tech-
Graphical techniques for multivariate data and for clustering 211

niques, such as scatter plots, time-series plots and histograms, are directly useful
only for one or two variables. Even the techniques for three-dimensional data
apply only to a minority of problems, since most interesting multivariate data sets
will have more than three variables.
One must keep in mind that we are trying to use a fundamentally two-dimen-
sional plot to represent data which have intrinsically more than two dimensions.
N o single multivariate plot is likely to convey all the relevant information for any
nontrivial data set. For effective data analysis, one needs to try several techniques
and to integrate the graphical displays with model-fitting, data transformation
and other data-analytic techniques. In addition the suitability of specific methods
for a given set of data depends on the number of observations, number of
variables and other properties of the data.
It is useful to group most multivariate plotting methods into two classes:
-extensions of scatter plots and
- symbolic plots.
In the first class the actual plots are two-variable scatter plots. A set of these is
generated to represent higher-dimensional data, directly or through derived vari-
ables. In the second class the data values are not used as coordinates in scatter
plots but as parameters to select or draw graphical symbols. The two classes are
not mutually exclusive. Symbols may usefully enhance scatter plots in some cases.

2.1. Multiple scatter plots


We first consider the use of scatter plots. The simplest procedure is to make all
or some of the possible scatter plots of pairs of the original variables. For data
with a moderate number of variables (say not much more than 10) pairwise
scatter plots are an attractive method. The plots are easy to describe and to
generate. They relate directly to the original data and do not require explanation
of any intermediate transformations. Drawbacks of the method are that it is not
easy to infer' patterns from the plots for more than two variables and that the
number of plots becomes impractically large for a large number of variables.
For inspection of pairwise plots, adjacent axes should display the same varia-
ble. An easy way to do this f o r p variables is to use a ( p - 1) by ( p - 1) array of
plots in which, for example, the (i, j ) plot could be of variable i ÷ 1 against
variable j. Only p ( p - 1)/2 of the plots are needed (say, the lower triangle). Since
the number of plots increases roughly with the square of the number of variables,
individual plots will soon become small relative to the overall display. For much
more than 10 variables, the method becomes of questionable value.
In the data chosen for our examples, pairwise scatter plots of the 15 companies
required 105 separate plots, with 25 points on each. Plotting all these by hand is
unattractive, but with reasonable computing and graphical displays it can be
done. The problem is in scanning such a large set of plots. With a high-quality
display device, it is possible to draw all the plots on one page or else to make
several displays and combine them for inspection (Fig. 1). An overall impression
can be gained from this rather crowded plot. There is general evidence of positive
212 John M. Chambers and Beat Kleiner

¢y . . . . . . .
6"1 .~.~.:. •

¢~ • , °

. . ., , .

TNW FN SCI CP CO NFW HIS SX SR EAL ^MR UAL TWA NWA

Fig. l. Pairwise scatter plots.


Graphical techniques for multivariate data and for clustering 213

relationship for most pairs of companies, along with some outlying years which
deviate from the general pattern. Some relationships are substantially stronger
than others, e.g., CP versus UAL is very weak.
In addition to such general criteria, one should also be ready to use information
about the specific data at hand in selecting graphical displays. For our example,
the set of 15 companies naturally formed four subsets: railroads, domestic and
international airlines and the conglomerate. Picking one company from each
subset gives us a display requiring only 6 scatter plots (Fig. 2). We see somewhat
more detail in the relationships; for example, we can see that two years have
unusually high returns for TWA, accounting for part of the departure from the
positive relation.
To apply scatter plots for larger numbers of variables, one may select either a
subset of variables or a subset of variable pairs, generating a more reasonable
number of plots. Essentially any variable-selection technique could be used,
depending on the goals of the analysis (e.g., subsets defined by regressing one
important variable on all others). Pairs can be selected by looking, say, at
properties of their joint distribution, such as comparing ordinary pairwise correla-
tions with robust correlations, and then looking at the pairs of variables where the
two correlations differ substantially. Conversely, one may look at all scatter plots
within one subset. Fig. 3 shows the scatter plots for the railroads.
If one is willing to sacrifice the direct interpretability of the original variables,
plots may be made for derived variables. Any of the techniques of multivariate
analysis (Gnanadesikan, 1977; Kruskal and Wish, 1978) could be used to derive a
smaller set of variables to plot. Examples are:
(i) principal component analysis;
(ii) factor analysis;
(iii) multidimensional scaling;
(iv) canonical analysis;
(v) discriminant analysis.
The first three methods define derived variables intended to represent as much
as possible of the internal variation in the data. Note that multidimensional
scaling may also be used when the data is originally a set of similarities among the
n observations. For canonical analysis, the original variables are divided i n t o
subsets, and derived variables are found within each subset that is highly
correlated. Discriminant analysis takes a partitioning of the observations into
groups, and looks for derived variables that are good discriminators among the
groups, i.e. predict well some contrast among the groups.
In general, the result of the analysis is some set of new variables of which we
use the first k to represent the original variables. Graphical presentation may be
more effective if we can choose k <<p. One is tempted, of course, to choose k = 2.
Unless the data support this, the temptation should be resisted.
Fig. 4 shows the pairwise scatter plot of the first two principal components of
the transportation data. Table 1 at page 217 shows the linear combination of the
companies defining the first and second principal components. The first principal
component gives positive weight to all companies, and may be interpreted as a
214 John M. Chambers. and Beat Kleiner

..•'. O

/ ~

o~
• O

tD • D tB
0 •

el, •

°ol ? ~O
• o t[, t~a, tl. •
~,• • • •

tt o

E~L

Fig. 2. Scatter plots of one company from each group.


Graphical techniques for multivariate data and for clustering 215

~.] • J°o" "°

?
]
-° • °-

• ",.:
• °" t
--.°

oL •

1l °s~°°
.~° ° "

°l i
#" . , t

l °.
.; • ..-¢•

FN SCI CP
1[ 1l

t; :i
•°

CO
g

"
°~

"::."

NFV

• :¢:

NIS
. •

,:;

SX

Fig. 3. Scatter plots of the railroads.


216 John M. Chambers and Beat Kleiner

S
1.4

* 1871

!
8.,,
B
CJ
* 1~4

@ •
@

@
@

| _
T

m
T

lgSB *
6

_ I _ .... I.. I I I
czu -1BB B 10B 2m 4m
l i n t Prlnolpal Comp~mnt

Fig. 4. Scatter plot of the first two principal components.


Graphical techniquesfor multivariate data and for clustering 217

Table 1
TNW FN CP CO NFW MIS SX SR SCI
1st 0.24 0.20 0.18 0.13 0.10 0.11 0.11 0.17 0.16
2rid -0.53 -0.24 -0.13 -0.07 -0.08 -0.18 -0.13 -0.23 -0.23

EAL AMR UAL TWA NWA PN


1st 0.25 0.31 0.26 0.44 0.46 0.36
2nd 0.07 0.13 0.17 0.57 -0.23 0.21

measure of overall behavior, with relatively higher weights for the airlines
reflecting their greater variability. The plots shows that 1954, 1958, 1963 and 1971
all have been exceptionally good years. The second principal component contrasts
TNW, N W A and the railroads to the other airlines but seems harder to interpret.
In any event, if a small set of derived variables is a good representation of the
original variables, plots of this set m a y be helpful. Drawbacks are the difficulty of
explaining the analysis leading to the plots and, more fundamentally, the danger
that the.transformation has obscured rather than enhanced some essential infor-
mation, reducing the usefulness of these plots as diagnostic aids.

2.2. Symbolic plots


The general idea of symbolic plots is very simple, although there are a
bewildering number of different methods. A set of plotting symbols is chosen,
indexed by one or more parameters, such that varying the values of a parameter
causes the appearance of the symbol to vary in an easily visible way. A set of data
values is mapped into parameter values and the resulting symbols plotted. If the
method is well designed, we hope to infer properties of the data from looking at
the symbols.
Consider two c o m m o n examples: a set of characters with varying amounts of
ink (grey level) and a line of varying length. In the first case, there will be some
fixed, finite set of symbols; for example, three levels might be given by the
characters ". ", " + " and " v ~ ". Each data value xij must be mapped into one of
the three symbols. Naturally, the implication is that " - , represents a smaller data
value than " + " and both represent smaller values t h a n " v~ ". If variables are not
in the same units, one should usually scale each variable separately, often by
mapping the range [min i xu, max i xi: ] into [0, 1]. Then values in the range [0, 1/3]
are shown as '"-", and so forth. Notice that outlying values can destroy the
usefulness of the plot; one should use a robust estimate of the range and, perhaps,
mark outlying values specially.
When some or all of the variables are in the same units, it may be more
informative to scale all the data together using a range for all data values. This is
the case in our example.
218 John M. Chambers and Beat Kleiner

The procedure is essentially the same for continuously varying symbols, except
that there is no need to use only a discrete set of values. One must decide what
range of symbol values is wanted; for example, whether the shortest line should
be of zero length. Notice that line length gives more graphical impact to large
values than to small ones. Such biases are a significant danger with many
symbolic plots. They can be somewhat relieved by plotting several times with
changes in the variables (e.g., - xij instead of xu).
The two most common approaches to choosing symbols are to have either a
one-parameter symbol correspond to each data value or a p-parameter symbol
correspond to each observation. The former method, which we call a symbolic
matrix, usually amounts to taking the printed form of the data matrix and
replacing each entry by a symbol. One may choose any set of symbols, but usually
the choice should be made so that the symbols are obviously ordered. For
example, printed characters, possibly overstruck, will give a varying grey scale.
Fig. 5 shows a symbolic matrix plot of the transportation data, using a
four-level set of symbols. We have ordered the companies, using an ordering
implied by a clustering algorithm (to be described in Subsection 3.1). Four
symbols are used, from " - " to a superimposed "$" and "v ~''. The representation
of the data is rather crude. However, one can see that certain years are particu-
larly high (1958) or low (1953) and, with closer scrutiny, that there is some
grouping as expected (for example, airlines did relatively better than railroads in
the mid 1960's and worse in 1972-1974). The symbolic matrix is not really
competitive with other methods, however, until the size of the data matrix is
substantially larger than our example.
Advantages of the symbolic matrix are its simplicity and compactness. Data
matrices of quite large size can be represented legibly by this method. The
technique has been exploited extensively by Bertin (1967), particularly through a
mechanical display which allows the user to permute rows or columns of the

Ln ~

TNW ° + + + . $ + , + ° + + + ° + + ° . + o ° . + + +
FN , ++ , . # ° . + + + + + , + + ° ++ ° ° ° +# +
SCI . # . . . + . . . + + . + . + + . + + . . . . + +
CP ° + + + . + ° • + • + + + . + + • ° + + + , ° + °
CO . + + + . + + . . • + + + • + + • + + • + . + + •
NFW . + + + . + + . + + + + . • • + • • + • + • + + •
MIS + + + + . + . + + . . + + . + + . . + + + • + + +
SX . + + . . # + . + + + + + . + + . + + . . . + + .
SR . # + + . # . . + + + . +. + + . + + + + . + + .
EAL . + + . . + . . + . + + # . + . . + + . . . + # .
AMR . # + . . + + . + . # + + + . + . . + . . . + + .
UAL . + + + . + + . + • + + # + + . • . # • • • # • .
TWA . $ . . . + + . . . $ + + + . . . . ~ ° . . + + .
N~A . $ . . . ~ . . # + # + # # . . . . + . . . # + .
PN • # . + • # + • + • # + + + • + • + + • • • 9 • •

F i g . 5. S y m b o l i c matrix.
Graphical techniques for multivariate data and for clustering 219

matrix to detect structure. Disadvantages of the method are its relatively coarse
parameter values and perhaps the difficulty of perceiving the overall relation
among observations.
Five of the methods which associate a symbol with each observation are the
following.
- Profiles, which represent each observation by p vertical bars for p variables, each
bar having height proportional to the corresponding variable. The profile refers to
the top of the bars; sometimes the profile is shown as a connected line.
- Stars or polygons, which represent each variable as a value along equally spaced
radii from a common center. The points on the radii are usually connected in a
polygon.
-Faces, which represent each variable by features of a cartoon face (Chernoff,
1973). Such features as the shape of the face, the curve of the mouth, the position
and shape of the eyes can be used as parameters.
-Curve plots (Andrews, 1972) where each observation is mapped into a curve
which is a linear combination, defined by (xil Xip), of a set of basis curves
.....

(usually trigonometric.).
- Trees, which represent each variable as the length of a branch of a tree whose
structure is determined by applying a hierarchical clustering algorithm to the
variables (Kleiner and Hartigan, 1980).
Having defined the symbols for each observation one is at liberty to plot them
in any suitable arrangement. For all but the curve plots the usual practice is to
plot the symbols separately, say in an array on the page. Curve plots are usually
superimposed.
Among the methods mentioned, no single method is entirely best, but we feel
that star or polygon plots combine a reasonably distinctive appearance with
computational simplicity and ease of interpretation. Profiles are not so easy to
compare as a general shape. Faces are memorable but they are more complex to
draw and one must be careful in assigning variables to parameters and in
choosing parameter ranges. Curve plots are effective when p is large but become
cluttered when there are many observations. Trees provide additional informa-
tion, using a non-subjective clustering of the variables, and are vivid symbols, but
require considerable initial computation. Faces and curves to some extent disguise
the data in the sense that individual data values may not be directly comparable
from the plot.
Polygon plots are simple to construct and to interpret. Given p variables each
symbol will have p radii, usually spanning either a full or a half circle. The angle
between the horizontal and t h e j t h radius is

{ 21r(j - 1 ) / p for a full circle,


Oj= ~r(j-1)/(p-1) for a half circle,

for j = 1,... ,p. The full circle is the more compact form and tends to give more
distinct symbols.
220 John M. Chambers and Beat Kleiner

The ith polygon consists ofp points, thejth point lying a distance along thejth
radius proportional to the data value x~j. A simple technique is to arrange for the
n values of each variable to be scaled to the range [0, 1], to put the center of the
symbol at the plotting origin and to let the maximum radius be 1. Then the point
corresponding to Xij has plotting coordinates

Pi / = ( x ij c ° s Oi , xijsin 0j).

Notice that by normalizing the variables we destroy the information about


location and scale of each variable. This is a problem common to all symbolic
plots. If, as in our example, the variables are comparable, one may choose to scale
all the data in the same way. Since all variables are represented similarly in star
plots, one can then compare the magnitude of different variables (for the faces
plot, in contrast, this would not be straightforward). The overall loss of location
information remains a problem.
The star or polygon may be plotted in a number of different ways. One choice,
used in the example below, is to draw radii from each P,7 to the center and to join
(P/l' Pi2 .... ,Pie, Pil) by connected lines. There are many possible variants, either
deleting part of the symbol to reduce clutter or adding additional visual clues.
When plotting by hand, one should produce first a symbol with all the radii
drawn heavily to full length. Each of the individual symbols can then be traced
over this, first drawing radii of length xij and then connecting the ends of the
radii.
Fig. 6 shows symbolic plots of the data in our example, using polygon plots,
faces and trees. One symbol is drawn for each company, treating the years as
variables. In all cases we maintain a consistent scale over all years, since we have
in effect repeated observations of the same variable rather than a set of distinct
variables. While this reduces the visual range of an individual year, it enables us
to make comparisons across years.
Some points of detaii are worth noting. Plotting a half-circle for the polygon
might be in principle more attractive, since then the last year would not be
adjacent to the first. In practice the resulting symbols show less contrast in shape
and visual discrimination among companies was less effective. In Fig. 6a, the
plots have years going counterclockwise from the east. Each quarter-circle repre-
sents 6.25 years, so that north, for example comes just after 1958, and south just
before 1972. Fig. 6b shows a single polygon with its vertices labelled.
The algorithm used for Fig. 6c has only 15 parameters to describe the face. We
have chosen to use data from 1958 through 1972 and have in one case rearranged
the data so that a noteworthy feature (the smile) corresponds to a year (1963)
with interesting contrasts among companies. 1958 and 1959 are coded by area
and shape of the face, 1960 by the length of the nose, 1961-63 by location, width
and curvature of the mouth, 1964-68 by location, separation, angle, shape and
width of the eyes, 1969 by the location of the pupils and 1970-72 by location,
Graphical techniques for multivariate data and for clustering 221

TNW NFW AHR

FN MIS UAL

SCI SX TWA

CP SR
4
NW^

CO EAL

Fig. 6a. Polygon plots.


222 John )14. Chambers and Beat Kleiner

v~

19

"," ,~

to

CO
Fig. 6b. Labelled polygon plot.

angle and width of the eyebrows. In this example, the faces display the grouping
of companies less clearly than some of the other methods.
Tree plots (Figs. 6d and 6e) are constructed by first hierarchically clustering the
years. The resulting dendrogram (see Subsection 3.1 for further discussion of the
dendrogram) serves as the template for the tree plots in the sense that the tree
symbols all have the same topology as the dendrogram and the angles between
'branches' of the symbols are a function of the distances at which the correspond-
ing subclusters were merged. Each symbol gets its individual shape by making the
length of each outermost limb proportional to the size of the variable it represents
and setting the length of each inner limb equal to the average of all outermost
limbs whose path to the bottom of the tree passes through this inner limb. For
more details see Kleiner and Hartigan (1980). Figs. 6d and 6e show that the
extremely good years 1954, 58, 63 and 71 are grouped on the lower left, the good
years 1961, 64-66 and 75-76 on the lower fight and the extremely disappointing
years 1957, 69 and 73-74 just above the extremely good years. Fig. 6e clearly
indicates that the airlines (especially the international ones) do extremely well in
good years and on closer inspection, extremely poorly in the bad years.
The plots in Fig. 6 are generally able to contrast companies, showing that the
shapes of the symbols representing companies within a group (e.g., the railroads)
are more similar than for companies from different groups. The polygon plots
and the trees especially, also point out that the groups differ in variability as well.
The airlines, particularly the three international airlines, have much greater
year-to-year variability. The features seen here were also visible in the symbolic
matrix although with less detail available.
Graphical techniquesfor multivariate data andfor clustering 223

TN¥ CO SR TYA

FN NF¥ EAL NWA

5CI , NIS ANR PN

CP SX UAL

Fig. 6c. Faces.


224 John M. Chambers and Beat Kleiner

TREE PLOT

Fig. 6d. The results of hierarchically clustering the years and a labelled tree plot.

2.3. Summary
All the methods described here can be useful, but it is worth repeating that
none gives a completely adequate picture of the data. Scatter plots are natural to
interpret and have fewer problems of scaling or of distorting the range of data
values, but they can only be used indirectly for a large number of variables and
integrating the pairwise plots is difficult. Symbofic plots are overly dependent on
several arbitrary choices; for example, the mapping of data values onto the
parameter interval and the ordering of both variables and observations. Some
(like faces) also treat the different variables in a highly unsymmetric way. Some
(particularly curve plots) are hard to use when the number of observations is
large.
These problems point to the need for care in interpreting the plots. Several
different methods should be used to get a good look at a difficult data set. Other
things being equal, simple methods, such as pairwise scatter plots, the symbolic
matrix and polygon plots, are easier to use and to explain and offer fewer hidden
distortions. The previously mentioned drawbacks in some of these methods
should be kept in mind, however.
Graphical techniques for multivariate data and for clustering 225

° ~.ll

(¢J

a3
W'4

Fig. 6d. (Cont.)


226 John M. Chambers and Beat Kleiner

TNW NFW AMR

FN MIS UAL

+ SCI SX TWA

CP SR NWA

CO EAL PN
Fig. 6e. ~ e e plots.

3. Plots for clustering

The statistical technique of cluster analysis tries to partition a set of objects


into subsets such that pairs of objects within a subset are relatively similar to one
another, by comparison to objects from different subsets. By object we may mean
either the observations or the variables in a set of data.
Hierarchical clustering methods produce a sequence or hierarchy of clusterings.
Each clustering can be regarded as the result of merging two subsets from a
previous clustering, beginning with each object in a cluster by itself and ending
Graphical techniques for multivariate data and for clustering 227

with one cluster containing all the objects. (One can also view this as a splitting
process, in the opposite direction.)
By contrast nonhierarchical clustering produces a single partition of the objects.
Naturally, each step of the hierarchical clustering defines a nonhierarchical
clustering, but typically without a rule for stopping at a 'best' clustering. In fact
the popularity of hierarchical methods is a combination of the computational
difficulty of choosing a 'best' partition and the advantage to the user of studying
a number of possible partitions.
We consider the graphical techniques for hierarchical and nonhierarchical
clustering separately. The latter can be applied to any partition derived from the
former.
For simplicity, we will always speak of measures of dissimilarity, dij, between
objects i and j. The entire discussion, however, could equally well be phrased in
terms of similarities. Speaking of objects being 'close' for example, implies small
dissimilarity or large similarity. Also, when specific notation is required, we will
tend to write fortnulas as if the objects were the n observations, although only
interchanging subscripts is needed to talk about clustering variables.
If we start with a set of multivariate data, the most common measure of
dissimilarity between observations would be some measure of the distance be-
tween two observations, regarded as points in p-dimensional space. Conversely, to
cluster variables one may begin with a measure of correlation as indicating the
similarity of variables. A third class of clustering applications may arise when
measures of similarity or distance arise as the original data.

3.1. Hierarchical clustering: The cluster tree


The fundamental description of hierarchical clustering of n objects consists of
(n - 1) steps of,successively merging clusters. At step j, the algorithm merges two
clusters, each Of which may be one of the original objects or a cluster merged at
some previous step k < j. The choice of which clusters to merge is usually based
on a measure of the dissimilarity of a pair of clusters (for example, the complete
linkage or compact method of clustering defines this as the maximum dissimilarity
of all pairs of objects in the separate clusters). The cluster process can then be
defined, for each step, by the two clusters merged and their dissimilarity.
The cluster tree or dendrogram is a graphical representation of this structure.
Fig. 7 shows an example: the cluster tree describing a hierarchical clustering of
the 15 transportation companies (using the compact method with Euclidean
distances). The original objects are represented by (arbitrary) positions along the
x-axis. Each cluster, as it is formed, is represented by a vertical position equal to
the dissimilarity of the two clusters when joined and a horizontal position which is
the average of the horizontal positions of the clusters merged. One now joins these
positions to show the tree structure. Fig. 7 shows one version of the cluster tree.
The vertical positions for each of the original objects has been taken to be zero.
228 John 311. Chambers and Beat Kleiner

Notice that some reordering of the original objects is necessary to avoid lines
crossing in the cluster tree. The ordering is not unique; at each node one could
flip the right and left subtree. There are various mechanisms for choosing a
unique ordering of the objects. We shall assume this to be done by the clustering
algorithm and will not discuss it further.
A cluster tree plot allows one to gain a number of insights into the clustering.
One may look for subsets which are clearly defined by the clustering. These are
indicated by clusters which join together at a relatively low distance. In Fig. 7 all
the railroads form such a cluster. On the other hand, the grouping of TWA, NWA
and PN (the international airlines) forms at a larger distance (3.0 opposed to 1.4),
indicating that their cluster is not as tight. The cluster of the domestic airlines
(EAL, AMR and UAL) is nearly as tight as the railroads. The conglomerate,
TNW, is merged with the combined railroad/domestic airline cluster, indicating
that it shares some characteristics of each, based on the data.
By 'cutting' the tree at any level of dissimilarity, one obtains a partition of the
objects. In the example of Fig. 7, we could obtain a partition into 6 subsets by
cutting at level 2.0. This defines the clusters of railroads and domestic airlines by
TNW and the three international airlines form four one-object clusters. On
substantive grounds one would prefer to group TWA, NWA and PN into one
cluster, giving four clusters (even though this does not strictly represent cutting
the tree).
Many variations of the basic cluster plot exist. One may put the objects some
fixed distance below the value for the cluster they join (as in Fig. 8), rather than

'l
Fig. 7. Dendrogramof completelinkageon Euclideandistances(basic form).
Graphical techniquesfor multivariate data and for clustering 229

tt~

X fir:.

g. 8. Dendrogram with vertical lines not drawn all the way to 0.

~r

Fig. ,9. Dendrogram with nodes connected by straight lines.


230 John M. Chambers and Beat Kleiner

] <z
N

, x ~

EY

Fig. 10. Dendrogram with ideas from Figs. 8 and 9 combined.

at 0. This makes the members of a cluster easier to see, but obscures distances
slightly. Instead of the horizontal and vertical lines marking each step, one can
join the successive plotting positions by a single, oblique line (Fig. 9) or, better,
combine this with the previous modification (Fig. 10) to somewhat reduce the
chance of lines touching. For modest-sized problems, the original method seems
the clearest. For larger problems, however, it may be difficult to associate objects
with long vertical lines, in which case the variation of Fig. 10 may be preferable.
If the cluster tree is produced on a line printer or printing terminal, rather than
on a graphic device, a number of further variations may be imposed. Because of
the limited resolution, the successive lines do not represent the dissimilarities but
simply the merging step. The dissimilarity may be printed beside the tree (Fig.
11). This still obscures the graphical evidence of cluster tightness (compare Fig.
7).
The printed tree may be further reduced in size, simply by omitting some of the
lower merges. Fig. 12 shows this variation, in which only the mergers from steps 7
through 14 of Fig. 11 are shown. This display, called a squashed tree by Gross
(1975), takes advantage of the tendency of people to be mainly interested in
obtaining a small number of clusters; there is little point in generating, say, more
than n / 2 subsets of n objects. The squashed tree plots also may print two disjoint
clusters side by side, rather than indicating the merging times consecutively for
the entire tree. Another variation is the icicle plot (Kruskal and Landwehr, 1980),
which uses the object labels to fill in vertically, starting from the step at which the
object first joins a cluster. Fig. 13 shows an icicle plot of the transportation data.
Notice that the company codes are repeated within each circle. Identifying
Graphical techniques for multivariate data and for clustering 231

T F S C C N M S S E A U T N P
N N C P 0 F I X B A M A W ~ N
W i W S L i~ L A

I 1 I I I 1
4 9 0 2 3 6 5 ,I 2 4 I 5 3 ,7 8
- : : - . : : - . : - . : - .

l 7, 1352726E- 0 ! . . . . • : : : : : : :
. . : : : - . : - : : - . .

2 8,8281667E'01 : : : : : • . . . . o
: - . : : - :
3 |.0657622E 00 : : : ...... : : - . : : • :
: - . . : - :
. . . . . : •
|,08|135|E 00 : .... : :
• : _. . : • .

5 1,1878790~ 00 : : : :
. . : : - .

6 1.2848810E 00 : : : ...... : : : : :
. . : : : : : : -

7 1,3658933E 00 : : .......... : : : : "


• : - : : : - .

8 1,4158114E 00 • ............. : : : : :
. . . : - : :

9 1,6111923E 00 : : ...... : : :
: : - : - :

I0 2,0754764E 00 : .................... : " :


: : : : -
II 2,4309823E O0 ...................... : : :

12 2,5095q63E 00 : : ....

13 3.05468|1E 00 : ......

14 3. 5739761E O0 .............................

Fig. 11. Dendrogram (printer version).

T 9 S C C N M S S E A U T N P
h N C P O F I X ~ A M ~ W W N
W I W S L R L A A

I I I I I I
4 9 0 2 3 6 5 I 2 4 I 5 3 7 8

I 1.2848810E 00 : . . . . . . . . . . . . . . . . . . : .... : : :
: : : : : : : : :
2 ],3658933E 00 : : .......... : : : : :
: : : : : : : :
3 1,611|923E 00 : . . . . . . . . . . . . . . . . . . . : : :
: : : : : :
2.075476~E 00 : .................... : : :
: : : : :
5 2,5095463~ 00 ...................... : ....

6 3.05~6811£ 00 : ......
:
7 3,573976lE 00 .............................

Fig. 12. Squashed tree.


232 John M. Chambers and Beat Kleiner

TF SC C NMS S E A U ~ N P
NN C PO F I XR AM A ~ W N
W- I- - WS- - L R L A ~ -

I~ 3.57398E 00 T=F=S=C=C=N=M=S=S=E=A=U=I=N=P
13 3.05~68~ 00 Iq=N=C=P=O=F=I=X=I~=A=M=A ~=W=N
12 2. 5 0 9 5 5 ~ 00 W=-=I ..... W=S ..... L=a=L A=-
II 2.43098E 00 S=~=S=~=&=&=&=&=&=&=&=&
10 2. 0 7 5 ~ 8 E 0 0 F = S = C = C = N = M = S = S = E = ~= U
9 1.61119E 00 N = C = P = O = F = . Z = X = R ~=M=A
1 . 4 1 5 8 1 E 00 -=I= .... W=S .... R=L
7 1 . 3 6 5 8 9 E 00 &=& &=&=&=&=&=& ~=S
6 1.28~88E 00 F=S C=C=N M=S=S A=U
5 I. 1 8 7 8 8 E 0 0 N=C P=O=F X=R M=A
t& i . 0 8 i i/4~. O0 -=/ .... • -=-
3 1 . 0 6 5 7 6 ~ 00 ~=&=& &=&
2 8. 8 2 8 1 7 ~ - 0 I C=N S=S
I 7. 1 3 5 2 7 F - 0 1 O=F

Fig. 13. Icicle plot.

individual objects in a cluster is easier in this version, particularly for substan-


tially larger examples than ours, because the object labels are repeated in the
columns. Also, scanning across a line shows immediately which objects are linked
in clusters (other than singletons) at this stage. The price for this information is
some degree of visual clutter.
Sneath and Sokal (1973, Subsection 5.9) describe two methods for representing
hierarchical clusters which are not variations of the dendrogram: the linkage
diagrams and the Wroclaw diagram. The linkage diagrams consist of a series of
successive graphs (in the graph theoretical sense) for different levels of merging.
Their main drawback is that a fairly large number of graphs is needed to get an
understanding of the relationships between a given set of objects. In the Wroclaw
diagram the levels of the different mergers are shown by contour lines which
correspond to the contours on a topographical map. Each contour is drawn
around all objects which have been merged at a level below that of the contour.

3. 2. Plotting distances
The concept of distance is natural and central to many applications of
clustering, because distances (or dissimilarities) between objects are inputs to
many clustering algorithms. In this subsection we will discuss several ways of
using plots of distances to assess appropriateness, tightness and separations of
clusters and to gain insight into how individual objects in a cluster differ from the
'average' cluster behavior. We will discuss three sets of distances:
(i) Distances between pairs of individual objects (Subsection 3.2.1),
(ii) Distances between pairs of cluster centroids (Subsection 3.2.2),
(iii) Distances between cluster centroids and individual objects (Subsection
3.2.3).
Graphical techniquesfor multivariate data and for clustering 233

3.2,1. Distances between individual objects


A rough but simple way of assessing the effectiveness of a given cluster
configuration is the shaded distance matrix. If the objects have been rearranged so
that objects in the same cluster are adjacent to each other, distances within a
cluster will be small for tight and well-separated clusters, while distances between
objects in different clusters will be large. If one codes increasing distances by
decreasing gray levels, one should therefore get a series of dark triangles under
each tight and well-separated cluster, while clusters due to randomness would not
exhibit this behavior.
Fig. 14 shows a shaded distance matrix of the 15 transportation companies.
The companies have been sorted according to the results of the compact
hierarchical clustering algorithm used in Figs. 7-11 and four clusters are shown:
Transway, all 8 railroads, 3 domestic airlines, 3 international airlines. 5 different
shadings have been used, each representing 20% of the distances. From Fig. 14 it
is clear that the data can be clustered successfully, that the cluster of the railroads
is very tight, that the domestic airlines also form a reasonably tight cluster, but
that the international airlines do not form a good cluster.

TNW

FN

SCI

CP

CO

NFW

MIS

SX

SR
EAL

AMR

UAL

TWA

NWA

PN x~

Ioo ,,,
Fig. 14. Shadedrepresentationof the Euclideandistancematrix between compames.
234 John M. Chambers and Beat Kleiner

Cohen et al. (1977) describe a method for plotting distances between objects,
which can be used to identify the presence and composition of clusters without
going through may clustering algorithm. The distances between the objects are
plotted in groups which consist of the distances between each object and its
nearest neighbor, the distances between each object and its second nearest
neighbor, and so on.
In the diagram they plot

di(j) vs. median di(j), i = 1,2 ..... n; j = 1,2 ..... n - 1

where n is the number of objects and di(j) is the j t h largest value among
dil, di2,---, di.i- l, d/,i+ 1,..., din- Thus the first column of the plot ( j = 1) displays
the empirical distribution of the n nearest neighbor distances against their
median. The second column shows the n second nearest neighbor distances, and
so on. A diagram of this type, together with output identifying i and j for each
point, can be helpful in detecting certain types of clusters and outliers.
The distances between the 15 transportation companies are displayed in Fig.
15. They show a clearly separated group in the lower left hand corner (all data
points with y < 1.5). All distances in this group except one (AMR-UAL) are
among the 8 railroads, indicating that they form a relatively tight and well-sep-
arated cluster. Furthermore, the two largest distances in all but the last column
involve two of the international air carriers (TWA and NWA); the largest four
distances in all but the last two columns include all three international air carriers
TWA, NWA and PN. This suggests that the international airlines cluster consists
of three objects, which are quite distant among themselves but even farther from
all other objects.

3.2.2. Distances between cluster centroids


Given a set of clusters, their relative positions and sizes and the distances
between them are of great interest, and a large number of ways have been
proposed to plot the inter-cluster distances.
If one is dealing with more than two variables, it is usually not possible to
position all cluster centers on a sheet of paper without distorting some of the
distances between them. The true distances can be approximated in one of the
following two ways:
-Make scatter plots of the cluster centers in a new coordinate system chosen to be
optimal in some sense. Such coordinate systems can be found, for example, by
using principal components, discriminant coordinates or multidimensional scal-
ing.
-Try to plot some of the distances accurately and put less emphasis (and
accuracy) on others. Carmicheal and Sneath (1969) have for instance proposed to
plot what they call taxometric maps by making the straight lines between close
clusters conform to the true distances while distances between farther away
clusters are represented by other devices (such as wiggly lines whose total lengths
Graphical techniquesfor multivariate data and for clustering 235

15 TRANSPORTATION COMPANIES

I I I

u~ @
• m

? _ 0 @
W3

&
o°°°0 o
O

0 O
0 o
0 0
o o
o ~
o

e
o oo 0 0
"? _oo

I I
0

1 .0 1 .5 2.0 2.5 3.0 3.5


MEDIAN OF COLUMN

Fig. 15. Diagram ol i th closest neighbor distances vs. their median. The circles enclosing a star denote
T W A and NWA, the circles enclosing a triangle denote PN.

are equal to the true distances). They also plot the 'diameters' of the clusters in
order to compare them to the distances between the clusters.
Taxometric maps contain large amounts of information but are rather cumber-
some to construct and seem hard to interpret. A similar, but much simpler,
diagram is described by Fowlkes et al. (1976). They begin by cutting the
dendrogram at a dissimilarity level which will create a desired number of
(hopefully well separated) clusters. The resulting clusters are represented by
circles with diameters equal to the diameters of the clusters. The interpretation of
the circle diameter depends on the algorithm and metric used; for the compact
method on Euclidean distances, for instance, the diameter is equal to the
maximum distance between any two objects within the cluster. Finally the circles
are connected by horizontal and vertical lines whose lengths are equal to the
distances between the corresponding clusters.
236 John M. Chambers and Beat Kleiner

EAL
.~p;,
/ \ A,,R
)CPco
NFW
\ / MIS
~ S X
SR

FNW
TWA

I I I I
2 4 6 8 10

15 TRANSPORTATION COMPANIES

Fig. 16. Diameters and distances of clusters obtained by cutting the dendrogram in Fig. 7 at level 2.0.

Fig. 16 shows the plot resulting from cutting the dendrogram in Fig. 7 at level
2.0. Each circle is labeled by the objects it contains. There are 6 clusters, 4 of
which consist of only one object each. The distance between the centers of the two
circles is 2.07, i.e. the distance between the clusters represented by these two
circles. The distance between TNW and the middle of the line connecting the
two circle centers is 2.43, i.e. the distance between TNW and the union of the two
circles. This plots shows two rather tight clusters which are reasonably close
together but are quite far away from the other objects. Note that, besides the
diameters of the circles, only the distances along the vertical and horizontal lines
Graphical techniquesfor multivariate data and for clustering 237

matter. Therefore the fact that N W A lies very close to the circles does not mean
that it is close to them in the metric used in the clustering algorithm.
Another method of displaying inter-cluster distances is a plot of the distances
between a first cluster and all remaining clusters in the first column, the distances
between a second cluster and the remaining ones in the second column, and so
on. This often reveals isolated dusters and clusters so close to each other that the
question arises if they should be separate dusters at all.
Fig. 17 shows distances between the cluster centroids of duster 1 (consisting of
TNW), cluster 2 (containing the 8 railroads), cluster 3 (domestic airlines) and
cluster 4 (international air carriers). The plot is not very illuminating; it shows
that dusters 1 and 4 are somewhat more apart from all other clusters than 2
and 3.

15 TRANSPORTATION COMPANIES
0

I I

4 1

03
4 2

S
t.- 3
Z
0
w
(.3 o4

IZ:
Ld
I.-- 2
03

d Lf3 4 3
r,.j, 3 2 --

Z
LLI
W

l'--
Ld
r~ 0

I--
03
rm
U3

o
- I I I I --

1 2 5 4
CLUSTER NUMBER

Fig. 17. ,Distances between cluster centroids (in fractions of returns).


238 John M. Chambers and Beat Kleiner

3.2.3. Plotting distances between cluster centers and individual objects


Probably the most useful diagrams involving distances are plots of the dis-
tances between cluster centers and individual objects as suggested by Dunn and
Landwehr (1980). For each cluster they plot the distances between its center and
all objects, indicating which objects belong to the cluster. If the objects in the
cluster are close to the cluster center, the cluster can be regarded as tight; if there
is a substantial gap between the farthest object in the cluster and the closest one
outside, the cluster is well separated. This kind of plot also allows the detection of
borderline objects; if, e.g., an object is quite far removed from the center of an
otherwise relatively tight cluster but is closer to the center of another cluster, one
might consider this object misclassified and reassign it to another cluster.
Fig. 18 shows the distances between the cluster centers of the 4 clusters
discussed in Subsection 3.2.2 and all individual objects. The objects belonging to

1 5 TRANSPORTATION COMPANIES

I I I

I,- TWA
0
w

2o
0

._J
mWA

PN
>
tm CO
Z NWA NWA TNW
EAL PN BR
$CI
I UAL TWA
CP
TNW

0 SCI
n~ NFW

z
BNSR ~
~ UAL
w
AMR SCl
o
FN
n~
w
PN
AMR
oo

J
(_) EAL
~J

Z SX
<
NFW

Q - T~W I I I
I 2 3 4

CLUSTER NUMBER

Fig. 18. Distances between cluster centroids and individual objects (in fractions of returns).
Graphical techniquesfor multivariate data andfor clustering 239

a particular cluster are plotted directly above the corresponding cluster number
while the other objects are somewhat set off to the right. We can see that cluster 1
(consisting of only one object) is very well separated, that clusters 2 and 3 are
tight and reasonably isolated (cluster 3 somewhat less so than cluster 2), while
cluster 4 has one outside object which is closer to its center than any of its own
objects.
Due to the relatively small size of this data set, all possible distances could be
reasonably plotted on the same page. This might not be the case for larger data
sets; there one usually only plots the objects closest to the cluster centers. Often it

RETURNS 1 954 RETURNS 1955


U'3 J i

TWA

SR AMR

$CI UAL
TNW EAL
O3
Z Z
~ u3 fNW

i-- 0 =S SR
%w w
EAL
&
#0,
.............................
*.e ~
PN
TWA

I I I I I I
1 2 3 4 2 3
CLUSTER NUMBER CLUSTER NUMBER

RETURNS 1956 RETURNS 1957

U3
Z
I~: Lrb
~_0

Od TNW
#?w
UAL PN
. . . . . . . . ~ .... ~:~l~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NWA TNW
6N
TWA
~ NWA
UAL TWA
I F~ I I
1 2 3 I 2 3
CLUSTER NUMBER CL U STE R N U M B ER

Fig. 19. Sets of fractional returns vs. cluster number for the years 1954-57.
240 John M. Chambers and Beat Kleiner

is also advantageous to use different scales for different clusters and to code each
object simply by the cluster to which it belongs.

3.3. Looking at individual objects and variables

Plots displaying the behavior of a given set of clusters and their objects for each
individual variable are described next, followed by descriptions of the behavior of
individual clusters compared to the overall center, or of objects compared to their
cluster center with respect to all variables.
The former include separate diagrams for each variable where the values of
each object are plotted against the number of the cluster containing that object
(Fig. 19), showing the location of the different clusters for each variable and the
spread within each cluster. It also pinpoints outlying values with respect to a
single variable. A variation of this is to first subtract the respective cluster center
from each object before plotting. This allows comparing spreads within clusters
and detecting outliers more easily, but loses the information about cluster levels.
Fig. 19 shows the returns of the 15 transportation companies for the years
1954-57. Generally the median of all values in a cluster is denoted by a star; here

SR

975

1970

1965

1960

955

0.0 02 04
DEVIATrONS FROM CLUSTER CENTER

Fig. 20. Deviationsof object SR from the center of cluster 2 for all years 1953-77 (in fractions of
returns).
Graphical techniques for multivariate data and for clustering 241

this has only been necessary for duster 2. A dotted line is drawn at level 0. The
returns have been very good in 1954; highest for the international carriers in
cluster 4, followed by duster 3, followed by duster 2. In 1955-57 the reverse is
true, and the returns grow steadily worse until in 1957 none of the 15 companies
has a positive return. Note that cluster 1 does better than the other dusters in
1955-57.
In order to see how individual cluster centers relate to an overall center, Dunn
and Landwehr (1980) have suggested plotting the differences between a duster
center and the overall center for each variable; to see the relations between an
object and its duster center they plot the differences between that object and its
cluster center.
Fig. 20 shows a plot of this type for the object SR (a railroad). It shows the
deviation from the median of cluster 2 for each of the 25 variables from bottom to
top. Interesting here is the very cyclical behavior of the deviations, a phenomenon
which has also been observed for other railroads.
If the variables are measured in different units, it is advisable to standardize the
differences by dividing by a measure of scale within the cluster.

3. 4. Sensitivity analyses
Given a tentative set of clusters it is important to be able to assess the
reliability of these clusters.
The idea of dropping variables is a natural and appealing tool to use in this
context. One drops one (or small groups of) variables at a time from the study,
reapplies the clustering algorithm and checks which clusters are still intact and
which ones are not. This not only shows the effect of minor changes in the data
on the resulting clusters but also enables the user to assess the effects of including
or excluding certain variables (Gnanadesikan et al., 1977).
Fig. 21 shows overall distances between the center of cluster 2 and the other
cluster centeis when no variables are left out (denoted 5-year group 0), when the
first 5 years (1953-57) are left out (denoted 5-year group 1), the second group of
5 consecutive years is left out (5-year group 2), and so on. The distances have to
be normalized to be comparable. The square root of the number of variables
involved seems to be a reasonable choice for the normalizing constant, because
under the assumptions of i.i.d, normal variables the squared distances will follow
a X2 distribution with the number of degrees of freedom equal to the number of
variables.
Fig. 21 suggests that cluster 2 might be somewhat better separated if the first
five years were excluded and might be less well separated from cluster 1 if the
second five years (1958-62) were left out.
Fig. 22 shows the normalized distances between the centroid of cluster 3
(EAL, AMR, UAL) and all other objects with no years left out (group 0) and
consecutive chunks of 5 years left out. As in Fig. 20, objects not in cluster 3 have
been offset somewhat to the right. The plot suggests that dropping the second 5
years would decrease tightness and separation of cluster 3 and that cluster 3
242 John M. Chambers and Beat Kleiner

15 TRANSPORTATION COMPANIES, CLUSTER 2


(,D

o I I I I I I

4 4
o9

nm

Q!

u_
0

rY
z

I--
0 -- ,3

0
o 3

LJ
rr

0 c'q
O9
S
Ld
0
Z
<r.
I--
O9

C'

o
o
-I I I I I I-
O 1 2 5 4 5

NR. OF OMITTED 5-YEAR GROUP

Fig. 21. Normalized distances between the centroid of cluster 2 and the other cluster centers when 0,
the first 5, the second 5 etc. years are left out.

would be somewhat tighter, but not more separated, if the last 5 years were not
included.
An enhancement of the ordinary dendrogram designed to give the viewer more
information about each merger is due to Rohlf (1970). He not only denotes each
merger by a horizontal line such as in Fig. 7 but for each merger also plots the
distances between all possible pairs of objects between the two clusters merged.
Therefore at the merging of MIS with {SX, SR} one would plot the distances
MIS-SX and MIS-SR; at the merging of the 8 railroads with the 3 domestic
airlines one would plot the distances between all pairs with exactly one railroad
Graphical techniques for multivariate data and for clustering 243

15 TRANSPORTATION COMPANIES, CLUSTER 5


tD
c:; I I I I I I
NWA NWA NWA

NWA
t#) NWA TWA TWA --
U3 O TWA
W
_1
TNW
<
TWA TNW TNW
7.. TNW
< TWA
TNW
LL O
O
~,~
SCI sx cP ~ SR
CF
Z TNWscI ~ ~' 01XS

F-- t¢3
SCI
SX
~FOu
PN
FN FN f~ r'~ - -
O -- FN
O O FN
rF NWA
w SCI
rY
<

0
Cf) o4 EAL
EAL
--EAL EAL EAL
w
0 UAL UAL UAL UAL
Z UAL AMR ~
< AMR AMR
AMR
AMR
g
o UAL

? I I I I I I
o
0 1 2 3 4 5

NR. OF OMITTED 5 YEAR GROUP

Fig. 22. N o r m a l i z e d distances between the centroid of cluster 3 and all objects w h e n 0, the first 5, the
second 5 years etc. are left out.

and one domestic airline. These distance plots will give some indication about the
validity of the level of a particular merge.

4. Summary and conclusions

In this article we have described a large number of graphical tools to be used in


conjunction with the analysis of multivariate data. It is very unlikely that any user
will use all the graphical methods described here in any given analysis; at the
other extreme no single picture is ever going to tell the whole story. Since
244 John M. Chambers and Beat Kleiner

a n a l y z i n g m u l t i v a r i a t e d a t a should b e a n iterative process, w i t h the results f r o m


o n e stage d e t e r m i n i n g the steps o f the following stage, the user s h o u l d at each
stage choose the techniques which serve the c u r r e n t needs.
A s in all a p p l i c a t i o n s of g r a p h i c a l m e t h o d s , the d i s p l a y of m u l t i v a r i a t e d a t a can
serve b o t h to suggest p a t t e r n s a n d m o d e l s for the data, a n d to check the a d e q u a c y
o f such models. B o t h for insights into the d a t a a n d for a check against reckless
m o d e l i n g , plots are essential.

References

Andrews, D. F. (1972). Plots of high-dimensional data. Biometrics 28, 125-136.


Bertin, T. (1967). Semiologie Graphique. Gauthier-Villars, Paris.
Carmicheal, J. W. and Sneath, P. H. A. (1969). Taxometric maps. Systematic Zool. 18, 402-415.
Chernoff, H. (1973). The use of faces to represent points in k-dimensional space graphically: J. Amer
Statist. Assoc. 68, 361-68.
Cohen, A., Gnanadesikan, R., Kettenring, J. R. and Landwehr, J. M. (1977). Methodological
developments in some applications of clustering. In: P. R. Krishnaiah, ed., Applications of Statistics,
141-162. North-Holland, Amsterdam.
Dunn, D. M. and Landwehr, J. M. (1980). Analyzing clustering effects across time. J. Amer. Statist.
Assoc. 75, 8-15.
Fowlkes, E. B., Gabbe, J. D. and McRae, J. E. (1976). Graphical techniques for displaying
multidimensional clusters. Proc. Business and Economic Section Amer. Statist. Assoc., 308-312.
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. Wiley,
New York.
Gnanadesikan, R., Kettenring, J. R. and Landwehr, J. M. (1977). Interpreting and assessing the results
of cluster analyses. Bull. Internat. Statist. Inst. 47 (2) 451-463.
Gross, A. M. (1975). Condensing hierarchical trees. Bell Lab. Memo [unpublished].
Kleiner, B. and Hartigan, J. A. (1980). Representing points in many dimensions by trees and castles
(with discussion). J. Amer. Statist. Assoc. [in press].
Kruskal, J. B. and Landwehr, J. M. (1980). Icicle plots: Better displays for hierarchical clustering, J.
Amer. Statist. Assoc. ]submitted].
Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. SAGE Publications, Beverly Hills.
Rohlf, F. J. (1970). Adaptive hierarchical clustering schemes. Systematic" Zool. 19, 58-82.
Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
P. R. Krishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 | "1
.t
©North-HollandPublishingCompany(1982) 245-266 I

Cluster Analysis Software*

Roger K. Blashfield, M a r k S. Aldenderfer and


Leslie C. M o r e y

When in danger or in doubt,


Run in circles, scream and shout
Ancient Adage

The amount and diversity of duster analysis software has grown almost as
rapidly as the number of publications which describe its use (Blashfield and
Aldenderfer, 1978a). New methods (and the programs which implement them) are
proposed continually, and no end to this process of innovation is in sight (Sneath
and Sokal, 1973). No reliable estimate of the total amount of clustering software
in use today has ever been made because duster analysis has spread to innumer-
able scientific disciplines and subdisciplines, making any attempt at a comprehen-
sive review futile. An indirect measure of the abundance of clustering software
can be found in a study by Blashfield (1976b). Selecting the members of the
Classification Society (North American Branch) as a sample, Blashfield sent each
member a questionnaire, asking what cluster software each currently used. The
fifty-three respondents listed fifty different programs and packages. On the basis
of this result he suggested that there may be as many different programs in
existence to perform cluster analysis as there are users.
Three reasons can be identified why so much software has been developed.
(1) The process of creating groups of entities--classification--is a fundamental
human activity which forms the basis for most scientific progress (Hempel, 1952).
However, there are many competing philosophies as to how groups should be
constructed and how they should be defined. Consequently, a wide variety of
logical, statistical, mathematical, and heuristic methods has been applied to the
problem of creating groups. Cluster analysis has been strongly affected by this
diversity of thought, and at least seven major families of clustering methods have
been developed. These are: (1) hierarchical agglomerative; (2) hierarchical divi-
sive; (3) iterative partitioning; (4) mode searching (5) factor analytic; (6) clump-
ing; and (7) graph theoretic methods (Anderberg, 1973; Bailey, 1975; Everitt,
1974). Each of these families represents a different perspective on the creation of

*The materialin this chapter was partially collectedwhile the first author was supported by NSF
Grant DCR No. 74-20007.
245
246 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey

groups. The results obtained when different methods are applied to the same data
set can be widely divergent. This diversity in approach to clustering makes it
likely that a number of programs will be written in order to meet the alternative
philosophies of classification.
(2) Different sciences have different analytical and methodological needs.
Although each of the families of clustering methods can be used by any discipline,
certain methods have been found to be particularly useful in certain sciences
(Clifford and Stephenson, 1975). Thus hierarchical agglomerative methods are
most frequently used in the biological sciences (Sneath and Sokal, 1973), and
factor analysis variants are popular in psychology (Lorr, 1966; Overall and Klett,
1972; Skinner, 1977). At first glance this state of affairs would seem to work
against the development of new programs, but it is, in fact, a cause of software
proliferation. For instance, a new type of classification problem may arise in a
science for which existing software is poorly structured, or a new important
variation on a traditional cluster method may be proposed, or researchers in one
science may become aware of a clustering technique which has proved useful in
another science. All of these instances are likely to cause the creation of new
software or the major revision of old software programs.
(3) Finally, the tendency to write new clustering programs is greatly facilitated
because most cluster analysis methods are relatively easy to program (Anderberg,
1973). Many/-nethods (and their variants) are not based on sophisticated statisti-
cal or mathematical models and thus they do not require considerable expertise to
implement. In fact, most clustering methods are no more than heuristics, or 'rules
of thumb,' which have reasonably straightforward rules of group formation
(Hartigan, 1975a).
Clustering software comes in a variety of forms, ranging from the simple,
100-line FORTRAN programs to packages containing many thousands of state-
ments. The software reviewed in this paper is divided into five categories: (1)
collections of subroutines and algorithms; (2) general statistical packages which
include clustering methods; (3) cluster analysis packages; (4) simple programs
which perform one type of clustering; and (5) special purpose clustering pro-
grams, including novel methods, graphics, and other aids to cluster interpretation.
Since a comprehensive review of clustering software is effectively impossible,
some selection is necessary. The decision was made to emphasize the software
programs shown to be most popular by Blashfield (1976b). In addition, other
software was included which concerns recent advances that are likely to be
popular and/or which represent striking alternatives to the most commonly used
programs. Most of the software discussed in the paper was developed in the
United States, Great Britain or Australia. The software developed in Europe has
not been fully sampled (e.g., Spath, 1975).
The remainder of this chapter will be organized as follows: First, there will be a
short discussion of the five major categories of software and the particular
programs included under each category (Section 1). Next, a reasonably detailed
discussion will follow concerning software programs which emphasize hierarchical
methods (Section 2) and those which contain iterative partitioning methods
Cluster analysis software 247

(Section 3). Fourth, the special purpose programs will be described (Section 4).
Finally, there will be a section on usability which concerns the users manuals and
error handling of the programs (Section 5).

1. Major categories of cluster analysis software

1.1. Collections of subroutines and algorithms


Three collections of subroutines and algorithms are available today. Two of the
collections are found in books: Anderberg's (1973) Cluster Analysis for Application
(abbreviated ANDER), and Hartigan's (1975a) Clustering Algorithms (abbrevia-
ted HART) (see also Dallal, 1975). The third collection is from the popular
International Mathematical and Statistical Library of scientific subroutines (IMSL,
1977).
The most important feature is that, unlike statistical packages, these collections
require the user to supply all of the job control language of the computing system
needed to link and run the subroutines. This means that a potential user of these
collections must be fluent in FORTRAN and knowledgeable about computing, to
use them effectively. With the exception of the IMSL subroutines, there is
currently no organization or group responsible for the maintenance, modification,
and further development of these subroutines. The authors of both books stress
that their collections are primarily research tools, and that the ideal user is an
experienced computer programmer. A user of the IMSL subroutines would also
have to have considerable computing expertise. Consequently, these collections
are not suitable for use by novices unless extensive guidance is provided.
Of these collections HART is the most versatile and contains a number of
clustering techniques which are not available elsewhere. However, this collection
offers relatively little concerning the dominant hierarchical agglomerative meth-
ods. ANDER, on the other hand, has reasonable breadth of hierarchical ag-
glomerative and iterative partitioning techniques. The IMSL subroutines only
have two hierarchical agglomerative methods, thus are the most limited of the
three.

1.2. General statistical packages which contain clustering methods


Clustering methods are also found in general statistical packages. The philoso-
phy behind these packages is well known; they have been designed to provide
relative novices in both computing and statistics with access to complex statistical
methods which can be applied to a wide variety of data sets. All major packages
include many options for data screening and transformation and they normally
contain methods for file storage and manipulation. Once a user has learned the
command language, running jobs with the package is very simple. The package
provides the system job-control language, whereas the user provides the ap-
propriate set of commands in tkeir proper sequence. The use of these systems
offers great flexibility for many types of data manipulation and exploration.
248 R. K. Blashfield, M. S. Aldenderfer and L C. Morey

The addition of cluster analysis methods to general statistical packages has


been recent. For example, the BMDP series has added an iterative partitioning
procedure to its existing repertoire of hierarchical agglomerative and block
clustering methods. Two other general packages, SAS and OSIRIS, contain a
limited selection of procedures for hierarchical clustering. However, a SAS
procedure is now available which permits the user to access CLUSTAN through
the SAS package. CLUSTAN is a versatile program which is described later.
All of these statistical packages offer substantial advantages to researchers
interested in cluster analysis should the method of choice be available in the
package. These packages are widespread and are available on a wide variety of
computing systems. The only exception to this is SAS, which is limited to IBM
systems because most of the package is written in IBM assembly language and
PL/1. These packages are well maintained, and each has an excellent user's
manual which describes in detail the options available for use. Because these
packages are designed for people with little or no programming experience, they
are excellent for novices.
Nevertheless, the general statistical packages have serious limitations. The most
crucial of these is the relative lack of versatility. Relative to the range of clustering
methods which have been proposed in the literature, these packages have a
limited selection. However, it should be noted that BMDP appears to be seriously
concerned with the incorporation of an adequate range of clustering methods, and
hence this criticism of poor versatility may soon be obsolete.

1.3. Cluster analysis packages

Six packages devoted to clustering and related methods are reviewed. They are
CLUSTAN IC (Wishart, 1978), NT-SYS (Rohlf, Kishpaugh and Kirk, 1974).
CLUS (Rubin and Friedman, 1967), TAXON (Milne, 1976; Williams and Lance,
1977), BC-TRY (Tryon and Bailey, 1970) and CLAB (Shapiro, 1977). In many
ways, packages devoted to cluster analysis represent the ultimate of both flexibil-
ity and user convenience. These packages combine many of the advantages of a
general statistical package (data screening, file manipulation, and data transfor-
mation), with features of interest to users of cluster analysis (a wide choice of
similarity measures, cluster diagnostics, and graphics). Novices will have more
difficulty learning to use these packages than general statistical packages such as
SAS or SPSS, but these packages are not overly difficult. Experienced users find
these packages important because they often contain many different or hard-to-
find options which may be particularly suited to a specific problem or data set.
Most of these packages (except CLUS and CLAB) are maintained by private or
commercial organizations which are responsible for their development and distri-
bution.
Of the six cluster analysis packages CLUSTAN is the most versatile, containing
the widest range of clustering methods and similarity measures. Also worth noting
about CLUSTAN is that versions are now available which allow CLUSTAN to
handle SAS or SPSS files. NT-SYS was designed for users in the biological
Cluster analysissoftware 249

sciences and relies heavily on the ideas proposed by Sokal and Sneath (1963;
Sneath and Sokal, 1973). CLUS is an iterative partitioning program which has
many options concerning this type of clustering method. TAXON is an Australian
program which emphasizes developments by Lance and Williams (1967a, 1967b;
Clifford and Stephenson, 1975; Williams and Lance, 1977). Hierarchical divisive
methods and the use of statistics from information theory are particularly salient
in TAXON. The next package, CLAB, is the only one which is interactive. It was
designed for use on DEC-10 computer. Finally, the BC-TRY system is a factor
analysis oriented package derived from the work on clustering by the psychologist
Tryon (1939; Tryon and Bailey, 1970).

1.4. Simple cluster analysis programs


Included under this category are six relatively short FORTRAN programs
which have become somewhat popular among cluster analysis users. HGROUP,
for instance, is a program by Veldman (1967) which performs Ward's minimum
variance method of clustering (Ward, 1963; Ward and Hook, 1963). This program
has been commonly used in geography, industrial psychology and business.
JCLUST is a program written by Johnson, which incorporates single and com-
plete linkage methods as discussed in his influential article (Johnson, 1967).
Johnson's article is also the basis for the clustering methods available in OSIRIS,
IMSL and SAS. Thus, these three programs are effectively equivalent to JCLUST
concerning cluster analysis options. HOWD, ISODATA and MIKCA contain
iterative partitioning methods. Like CLUS, they emphasize options in the statistic
to be maximized by the partitions. ISODATA is a flexible iterative partitioning
method which has proved popular in engineering (Hall and Khanna, 1977).
Finally, BUILDUP (Lorr and Radhakrishnan, 1967) is a program which uses a
clumping method and has been popular in clinical psychology.
All five of these programs have limited versatility (except MIKCA). These
programs are,not user oriented, generally have poor documentation and provide
relatively little output information. The only major reason for considering them is
that BUILDUP, HOWD, ISODATA and MIKCA contain clustering methods
which are not available in other software.

1.5. Special purpose clustering programs


Included in this category will be programs which can handle very large data
sets (over 500 entities), graphics, or relatively unusual approaches to cluster
analysis. These programs and their characteristics will be the focus of a later
section in this chapter.

2. Programs with hierarchical methods

In Section 1, eighteen separate software programs for cluster analysis were


introduced: three collections of subroutines, three general statistical packages, six
250 R. K. Biashfield, M. S. Aldenderferand L. C. Morey

clustering packages and six simple programs. Of these, twelve contained at least
one hierarchical method. The six remaining programs (BC-TRY, BUILDUP,
CLUS, HOWD, ISODATA, and MIKCA) will not be discussed further in this
section.
Hierarchical methods of cluster analysis are, by far, the most commonly used
techniques in the clustering literature (Sneath and Sokal, 1973; Blashfield and
Aldenderfer, 1978a). Any software which intends to appeal to a wide range of
users should include a number of the hierarchical clustering methods as well as a
reasonable sampling of similarity measures.
Attempts have been made to identify characteristic ways in which hierarchical
cluster analysis methods differ. Bailey (1975), for example, proposed 12 'criteria'
to be used in selecting a method; similarly, Sneath and Sokal (1973) discussed 8
'options' from which to choose. In Subsections 2.1-2.4 the dimensions crucial to
the understanding of hierarchical methods are treated.

2.1. Agglomeration versus division


This dimension refers to the basic strategy used in creating a hierarchical
classification. An agglomerative strategy begins with each entity defined as a
cluster, and these clusters are combined on the basis of similarity until only one
cluster remains. Divisive strategies do just the opposite; here K = 1 (where K is
the number of clusters) at the beginning of the procedure and K = N (or a
user-specified number of clusters) at its finish (Bailey, 1975; Lance and Williams,
1965; Whallon, 1971, 1972). Of the programs reviewed here, all contain agglomer-
ative methods of cluster analysis; only CLUSTAN and TAXON have an option
for divisive methods.

2.2. Linkage form


This parameter refers to the set of rules used to join entities together in forming
clusters. Some authors have referred to this option as the sorting strategy. Linkage
forms have been described in detail by many authors (see Everitt, 1974; Sneath
and Sokal, 1973; Lance and Williams, 1967a).
There are many possible linkage forms, each yielding a unique hierarchical
agglomerative method. Over twelve different linkage forms exist, of which four
have become the most popular. The resulting four methods, single linkage,
complete linkage, average linkage and Ward's method, are the dominant methods
of cluster analysis used in applied research.
(a) Single linkage clustering. This method, which was first discussed by Sneath
(1957), specified that an entity will join a cluster if it has a certain level of
similarity with an entity already in the cluster to which it is most similar; that is,
it must be similar to at least one member of that cluster. Thus, connections are
based upon links between single entities.
Single linkage has tendency to form chains, whereby clusters are elongated and
entities at either end of the cluster are likely to bear little resemblance to each
Clusteranalysissoftware 251

other (Sneath and Sokal, 1973). Nonetheless, some authors feel that this is the
only method which meets the mathematical criteria to be satisfied by an accept-
able clustering method (Jardine and Sibson, 1968, 1971). All programs discussed
here, with the exception of HGROUP, OSIRIS, and SAS, contain the single
linkage method as an option.
Single linkage clustering is related to the formation of minimum spanning trees
(Zahn, 1971). For users interested in minimum spanning trees, CLAB contains a
number of intriguing options. CLUSTAN and NT-SYS also will permit the
formation of minimum spanning trees.
(b) Complete linkage clustering. This method is the logical opposite of single
linkage clustering. Rather than an entity joining with the closest member of a
cluster, complete linkage requires that an entity be within a certain level of
similarity with the most distant entity, or in effect, with all members of that
cluster (Sokal and Michener, 1958). This method thus tends to form compact,
hyperspherical clusters composed of highly similar entities. Only HART and
HGROUP do not include an option for complete linkage clustering.
(c) Average linkage clustering. Sokal and Michener (1958) introduced average
linkage clustering as a compromise between the 'conservative' complete linkage
method and the 'liberal' single linkage method. Although more than one average
linkage technique exists, one particular variation has attracted the most use:
unweighted pairwise group mean averaging (abbreviated UPGMA) (Sneath and
Sokal, 1973). From the twelve programs, UPGMA was available in eight. The
other four programs (SAS, OSIRIS, IMSL, JCLUST) were designed in response
to the Johnson (1967) article which does not consider average linkage because
UPGMA is not invariant under monotonic transformation of the similarity
measure. However, since average linkage is the dominant clustering method, its
absence in programs intended for general users such as SAS, OSIRIS and IMSL
clearly limits the wide usefulness of these programs.
(d) Ward's method. This method was designed to optimize an objective func-
tion, the minimum variance within clusters (Ward, 1963). The method has proved
popular in the social sciences but not in the biological sciences. Despite the over
220 citations to Ward's article in a wide range of social sciences, this method has
been frequently overlooked in software. Only three of the twelve programs,
ANDER, CLUSTAN, and HGROUP, have incorporated Ward's method. In this
regard, it is worth noting that one clustering package, CLUSTAN, has suggested
that Ward's method be used as the default option in the choice of linkage form.
Ward's method has also been shown to yield relatively good solutions in Monte
Carlo comparisons of cluster analysis methods (Gross, 1972; Kuiper and Fisher,
1975; Blashfield, 1976a; Mojena, 1977).
In conclusion about linkage forms, CLUSTAN is the best program. It has eight
hierarchical agglomerative methods and two hierarchical divisive methods.
ANDER, TAXON, and NT-SYS are close with seven methods each. BMDP and
CLAB contain three linkage forms, but, in BMDP, these are available only if the
user wishes to cluster variables. If the user wants cluster entities (as is usually the
252 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey

case), BMDP forces the user to cluster with average linkage. SAS, OSIRIS, IMSL,
JCLUSTER, HGROUP and HART are clearly the most limited in terms of their
variety of linkage forms. None of these last six programs is recommended for use
as general cluster analysis software.

2.3. Measures of similarity


Hierarchical methods require the calculation of a similarity matrix of order
N X N (N refers to the number of entities). This involves the calculation of the
degree of resemblance that exists between each pair of the N objects being
clustered. Different measures of similarity have been proposed, and use of these
different measures can result in different solutions, even if all other parameters
are held constant. A number of useful discussions exists concerning these mea-
sures (e.g., Sokal and Sneath, 1963; Sneath and Sokal, 1973; Bailey, 1975; Everitt,
1974; Cormack, 1971). The reader is referred to these for a more comprehensive
review. Sneath and Sokal (1973) divided similarity measures into four groups:
(a) Distance measures. These measures can be used with all types of data. Many
types of distance exist (Cormack, 1971). Of these, Euclidean distance is frequently
chosen. However, the user should be cautioned that there is some variation in the
definition of Euclidean distance. Blashfield (1977) demonstrated that the Euclidean
distance definitions used by CLUSTAN, NT-SYS and BMDP are not identical
and can result in substantially different duster solutions.
(b) Association coefficients. These measures are generally used with binary
categorical or nominally scaled data. Sneath and Sokal (1973) have an excellent
discussion of these measures.
(c) Correlation coefficients. The Pearson product moment correlation and other
measures of correlation have been used for continuous data (Sokal and Sneath,
1963). Correlation measures have become unpopular because they are only
sensitive to the shape of a data profile (Fleiss and Zubin, 1969). However,
Milligan (1978) has demonstrated that correlation measures can recover the
structure of Monte Carlo data better than distance measures with continuous
data.
(d) Probabilistic similarity coefficients. These are based on information statistics.
The Australian researchers have emphasized this measure (Williams, Lambert and
Lance, 1966; Clifford and Stephenson, 1975). This type of measure is available in
the Australian program TAXON.
The extent to which these types of similarity measures are represented by the
packages under consideration is quite variable. The most extensive of these is
CLUSTAN, with 38 different coefficients available as options, covering all four
types of measures. NT-SYS is also versatile in this respect, containing 21
measures with all but a probabilistic metric. In addition to the above, ANDER
(with 9) and BMDP (with 15) have a reasonable choice of measures.
Three of the programs, JCLUST, IMSL, and OSIRIS, are limited in that they
require the user to input the similarity matrix, rather than the raw data. User
Clusteranalysissoftware 253

submission of a similarity matrix is an option for all other programs, with the
exeption of HGROUP. Finally, BMDP and OSIRIS both have options which
permit clustering of variables rather then entities.

2.4. Dendrograms
A final major dimension associated with hierarchical methods concerns the use
of dendrograms or 'trees' which graphically represent the results of hierarchical
clustering. For users in the biological sciences, trees are the major output
necessary for interpretation. Most programs, except SAS, OSIRIS, JCLUSTER
and HGROUP, print trees as either standard or optional output. CLAB, NT-SYS
and CLUSTAN have the most easily interpreted trees. CLUSTAN permits the
user to request horizontal or vertical trees. BMDP has an option which allows the
tree to be printed over the similarity matrix.
Other graphics which can be used to display hierarchical clustering results are
skyline plots and shaded similarity matrices. Skyline plots are somewhat like
vertical trees depicted by a series of bars, while the shaded similarity matrices
(Ling, 1973) are reorganized similarity matrices in which similarity values are
replaced by dots of different darkness. Both of these graphics become awkward to
visually interpret when the number of entities is moderately large (N > 75). SAS,
OSIRIS and JCLUST print skyline plots rather than trees. BMDP permits shaded
similarity matrices as an option in addition to trees.

2.5. Conclusion about hierarchical software


Of the programs reviewed, it appears that CLUSTAN, followed by NT-SYS,
are the most versatile of the popular clustering programs. For the biological
scientist, NT-SYS has the advantage that its logic is specifically designed for
biological problems and includes features (such as the cophenetic correlation
coefficient as ~/n index of cluster 'goodness') well suited for this area. CLUSTAN
has the widest range of hierarchical clustering. In addition, the output of
CLUSTAN is readable and provides more information than any other existing
program. For the general user of cluster analysis, CLUSTAN is clearly the most
versatile package available.
Of the remaining ten programs, ANDER, BMDP, CLAB, and TAXON all
have a moderate degree of versatility across the four dimensions mentioned
above. Within these four, ANDER has the widest range of linkage forms and
similarity measures. However, ANDER requires the user to keypunch the pro-
grams, plus the user must be sufficiently sophisticated in FORTRAN to correct
any programming errors. CLAB and TAXON have limited portability and are
not widely available. Like ANDER, the BMDP hierarchical procedures are
moderately versatile. Its major limitation is that the majority of useful options
only exist on the procedure which clusters variables. For clustering entities,
BMDP is a poor hierarchical program.
254 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey

The remaining six programs, including the general statistical packages of SAS
and OSIRIS, have limited versatility, and there is no particular reason why a user
should seek out these programs for performing hierarchical cluster analysis.

3. Programs with iterative partitioning methods

The iterative partitioning methods comprise the second major family of cluster
analysis techniques. In general, these methods assign entities to the nearest
cluster, compute the new cluster centroids and reassign entities. These alterations
are performed until no object changes cluster membership. Conceptually, this
approach circumvents a serious drawback with hierarchical methods. Hierarchical
techniques require the formation of a similarity matrix which has N ( N - 1 ) / 2
unique values. Since the amount of computer memory needed increases as an
exponential function of N, the hierarchical methods are limited by the size of the
data matrix. For instance, 400 entities would require storage for 79 800 unique
similarity values.
As iterative partitioning methods do not require the storage of a similarity
matrix, they have the potential of handling distinctly larger data sets. However,
these methods are subject to a different limitation. The optimal way to perform
iterative analysis would be to form all possible partitions of the data set.
Unfortunately, this approach requires an enormous number of iterations. When
there are over fifty entities, an exhaustive approach becomes unfeasible. As a
result, the authors of the program which performs iterative partitioning have
created procedures which sample a small subset of the possible partitions. The
heuristic procedures for choosing likely partitions are plausible, yet quite varied in
approach. Thus, large differences exist both between and within the options of
programs which perform iterative analysis.
Ten of the eighteen clustering programs have an iterative partitioning method
of clustering. The ten programs are ANDER, BC-TRY, BMDP, CLAB, CLUS,
CLUSTAN, HART, HOWD, ISODATA, and MIKCA. These programs are more
varied than were the hierarchical clustering programs. In fact, it is impossible,
with the exception of two programs (ANDER and CLUSTAN), to set the options
in such a way that these programs will yield identical solutions to the same data
sets. Hence a discussion of these methods will also follow an analysis of major
dimensions in these methods. Anderberg (1973) contains a helpful discussion of
iterative partitioning methods.

3.1. Initial partition


This dimension refers to the procedure by which the partitioning method is
begun. Almost all programs use different methods to select the starting partition.
Milligan (1979) has emphasized the importance of choosing a starting procedure
by demonstrating, with Monte Carlo methods, that the choice of the initial
Cluster analysis software 255

partition is the option which has the most effect on an iterative method. Some
methods use initial estimates of cluster centroids as the basis for the initial
partition. For example, A N D E R and BMDP start with the specification of
centroid estimates, called 'seed points,' which can be user specified or can be an
arbitrarily chosen set of K actual data points (where K is the number of dusters).
In the first pass through the data, the entities are assigned to the cluster with the
nearest centroid. MIKCA, on the other hand, starts by analyzing three different
sets of randomly chosen seed points for the set which seems most likely to lead to
an efficient solution.
A second type of initial partitioning inyolves the specification of the first
cluster assignment. With this procedure the centroid of each cluster is defined as
its multivariate mean of entities within a cluster. This can be accomplished in a
variety of ways. ANDER, apart from the seed point method, allows the user to
specify the initial partition. CLUSTAN may be started either with a user specified
partition or by pseudorandom assignment, where every K t h element is assigned
to the same cluster. ISODATA selects initial cluster centroids that are relatively
distant from the centroid of the entire data set. There are four starting options
permitted by CLUS: (1) randomly chosen partition; (2) user-specified partition;
(3) a partition where the first K / N entities are assigned to the first cluster, the
second K / N to the second duster, and so on; and (4) a partition that the package
chooses " b y its own method." BC-TRY allows a user-specified starting partition,
as well as a seed point procedure. The BC-TRY seed points are unique in that
they are found by dividing Q-dimensional space (as found in multiple group
factor analysis) and into 2 Q equal segments. The centroids of these segments
define the initial seed points for the iterative clustering process.
There are also three programs which form clusters over a range of K. H O W D
and BMDP, for example, superimpose a hierarchical divisive algorithm onto an
iterative procedure. The initial partition is formed by dividing the data set at the
mean of the variable with the largest variance. After a stable solution is found
using an iterative K-means procedure, these programs search for the variable
within the two clusters that now has the largest variance. Subdividing at the mean
of that variable, the next K-means pass results in a three cluster solution. This
process is repeated until an upper limit on the number of clusters is reached.
CLUSTAN works in the opposite direction, superimposing a hierarchical
agglomerative procedure. First, an initial cluster solution is obtained for the
maximum number of dusters (Kraal). The two closest clusters are merged, and
iterations are performed to find gma x - 1 clusters. This process repeats until a
lower limit of K is reached.

3.2. Typeof pass


This dimension involves the type of pass used to assign entities to particular
clusters. The K-means passes, also called 'nearest centroids' and 'reassignment'
passes, involve entity reassignments to the cluster with the nearest centroid. This
256 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey

type of pass is used exclusively in ANDER, BC-TRY, BMDP, CLAB, HART,


HOWD, and ISODATA. There are distinctions among different approaches to
the K-means pass, however. Most programs use combinatorial passes where cluster
centroids are updated after each membership change. A N D E R contains two
K-means procedures which are noncombinatorial; that is, the cluster centroids are
not redefined until a complete pass has been made through the data set. Another
distinction concerns whether the centroid calculation is exclusive or inclusive of
the entity under consideration; that is, whether the entity is removed from the
parent cluster when centroids are computed. CLUSTAN is the only program
which has both as options; all other programs contain only inclusive centroid
calculations.
A second type of pass is the hill-climbing pass. This pass, rather than assigning
on the basis of centroid distance, moves an entity from one cluster to another if a
particular statistical criterion is better optimized. CLUS and C L U S T A N both
contain this type of pass. CLUS permits passes which 'force' entities to join new
clusters in order to start a new partitioning sequence. M I K C A uses an interaction
of both reassignment and hill-climbing passes to reach a solution. Finally, with
the exception of one subroutine in ANDER, all programs repeat passes until no
membership changes occur or until some limit on iterations is reached.

3.3. Stat&tical criterion

As mentioned in the previous section, the hill-climbing passes are concerned


with making membership changes which optimize a particular statistical criterion.
CLUS and MIKCA give the user a choice of four criteria: tr IV, tr W -1, IWI, and
largest eigenvalue of W-1B where W refers to the pooled within-cluster covari-
ance matrix and B is the between-cluster covariance matrix. All four statistics are
measures often discussed in multivariate analysis of variance (Olsen, 1976).
Since the K-means procedures attempt to minimize the variance within each
cluster, they implicitly are concerned with optimization of the tr W criterion.
However, the two types of passes are not identical and can yield different
solutions to the same data set.

3.4. F i x e d versus variable number o f clusters

This dimension refers to the manner in which the programs aid in determining
the number of clusters that exist in the data set. In A N D E R , CLUS and MIKCA
the number of clusters ( K ) must be specified by the user and hence is fixed. The
remaining packages contain procedures which allow K to vary in some manner.
Again the different programs take different approaches to this problem.
CLUSTAN agglomeratively collapses clusters across a user-specified range while
H O W D and BMDP use a divisive procedure to form a range for K. BC-TRY and
ISODATA provide proc,edures for 'splitting' and 'merging' clusters. ISODATA is
quite flexible in this regard as it permits the user to specify the limits on both the
diameter a n d / o r the size of the cluster. If clusters are too close, ISODATA may
Cluster analysis software 257

merge them; if a cluster is too heterogeneous, it may be split. In the same way,
clusters that are too large may be split and clusters that are too small may be
assigned to an outlier group.

3.5. Cost
Another factor to consider when performing iterative partitioning cluster
analysis is cost. This is particularly important because cost puts a limitation on
the tremendous number of calculations that would be required to exhaustively
test all possible partitions of the data. The various programs which perform
iterative analysis all attempt to efficiently find an optimal solution. However, the
programs differ drastically in terms of cost (Blashfield and Aldenderfer, 1978b).
CLUS is by far the most expensive program to run. This is a result of (1) the
hill-climbing passes, which require computation of the criterion statistic with each
possible move, and (2) the fact that it will either restart the partitioning analysis
or 'force' movement to avoid the problem 'local maxima.'

3.6. Output information


Another important factor on which the programs vary is the amount of
information about the clustering solution which is contained in the program
output. In general all of the programs are fairly thorough, especially when
compared to the hierarchical packages. BMDP, CLUSTAN, and HART present
information about the cluster centroids, distances to the centroid for each entity
in a cluster and distances between clusters. BMDP, CLUS, CLUSTAN and
ISODATA list the vectors of standard deviations for the variables. BMDP and
BC-TRY also provide information about the homogeneity and significance of the
clusters. BC-TRY, CLUST, and CLUSTAN can output every membership change
which occurs during a given iteration. CLUSTAN is the only program, however,
which allows tlae user to output the membership array onto punched cards for
further use.

3. 7. Conclusion
Comparison of the programs that perform iterative analyses is more difficult
than those which perform hierarchical analyses. CLUS is clearly the most
versatile of the packages, as it has the widest range of options for type of pass, the
choice of a starting partition, etc. Unfortunately it is also by far the most
expensive of the iterative partitioning programs. The more cost efficient programs
such as BMDP, CLUSTAN, and HART do not have quite as much versatility.
However, these three programs do have a good range of choices concerning
starting partitions and output information. HOWD, ISODATA, and BC-TRY all
have distinctly unique properties which have no direct analogs in the other
packages. ANDER, CLAB, and MIKCA are relatively efficient programs which
represent the major options existent in current thinking about iterative analysis,
258 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey

but are not as complete in terms of output as other packages. In sum, the
particular needs of the user will dictate the software of preference for performing
iterative partitioning analysis.

4. Special purpose programs

4.1. Validation

A major problem concerning the use of cluster analysis is the validation of a


solution (Dubes and Jain, 1977). Cluster analysis methods have two characteris-
tics which can be disturbing to a casual user: (1) almost all cluster analysis
methods will find clusters in a data set even if none naturally exist; and (2)
different cluster analysis methods frequently yield very dissimilar solutions to the
same data set. Since cluster analysis methods are heuristics which are not derived
in accordance with a well-defined statistical theory, a user should be skeptical of
any clustering solution. Thus it would be a distinct advantage if a cluster analysis
program contained procedures which could be used to validate a solution.
Most of the programs which emphasize hierarchical agglomerative methods of
clustering have no procedures for validation. This lack of validation procedures is
evident in IMSL, SAS, OSIRIS, TAXON, H G R O U P , and JCLUST. The biologi-
cally oriented program NT-SYS does list the cophenetic correlation, a statistic
frequently used in biological classification to assess how well a hierarchical tree
represents the structure of a similarity matrix (Sokal and Rohlf, 1962). Cophenetic
correlations could be calculated using the output available from CLAB. Also
concerning hierarchical methods, CLUSTAN prints the measures of cluster
homogeneity, such as the ratio of the within cluster variance to the total data set
variance for each variable.
As noted earlier, the iterative partitioning programs are much better than the
hierarchical programs concerning output descriptive statistics. Every iterative
partitioning program gives the user some indication of the homogeneity of the
clusters by reporting the error of sum of squares, a value of Wilks' lambda, etc.
Besides output statistics estimating cluster homogeneity, some partitioning
programs provide additional means of estimating the validity of a cluster solution.
For example, H A R T has an interesting collection of multivariate descriptive
statistics including multivariate histograms which could be of great value. The
new iterative partitioning routine in BMDP plots cluster data points along a
vector which goes from the cluster centroid to the grand mean of the data set in
order to present a visual indication of the separateness of a cluster from the data
swarm. In addition, BMDP has a modified F test which can be used to test the
separateness of clusters (Hartigan, 1978a).
The BC-TRY system is the program with the most emphasis on validation. It
has a separate procedure called 4CAST which (1) estimates statistics measuring
cluster homogeneity, (2) uses cluster membership in both simple and more
Cluster analysis software 259

complex prediction systems to estimate an external criterion, and (3) draws a


large number of randomly selected partitions against which to compare the
cluster solution measures of homogeneity.

4.2. Large data sets


The twenty clustering programs and packages emphasized in this paper can
only handle moderately sized data sets. Most have limits (even in core size or in
effective computational cost) of 100 to 400 entities. These limits may vary
somewhat across computer centers depending upon the type of operating system,
the computer hardware, and the local options concerning access to core.
There are four programs which have been written to handle large data sets
(N = 500 entities or more). Sibson (1973) has written a subroutine which permits
single linkage to be used on data sets of this size. McQuitty and Koch (1975)
discussed three hierarchical agglomerative algorithms which have been pro-
grammed for 1000 entities or more. QUICLSTR (Bell and Korey, 1975) and
CLUSTER (Levinsohn and Funk, 1974) are designed to permit the use of Ward's
method with very large data sets. In addition, the complete linkage methods have
been efficiently programmed for reasonably large data sets by Delays (1977) and
by Rohlf (1977).
Concerning the iterative partitioning methods, there is one recent program,
CLASSY by Lennington and Rossbach (1978), which can handle very large data
sets. This program is related to ISODATA in logic and was designed for the large
data sets coming from Landsat satellite data. ISODATA itself has been expanded
to also handle satellite data (Hall and Khanna, 1977).

4.3. Graphics
Another area of current interest in clustering methods is graphical representa-
tion. Since clu~ter analysis methods are simply heuristics, one solution is to
present visual representations of the multivariate data and let the human re-
searcher decide.
For instance, Chernoff (1973; Chernoff and Rizvi, 1975) proposed a method of
representing multidimensional data using computer drawn faces. He wrote a
program utilizing a Calcomp plotter which will create these faces (Chernoff,
1971). Turner and Tidmore (1977) made this approach more usable by writing a
program which draws the Chernoff faces on a line printer. CLAB also can create
the faces as clustering output.
Another interesting set of graphical available techniques is the comparative
univariate histograms, multivariate histograms, joining trees (not dendrograms)
and dimensional boxes. These are illustrated and discussed by Hartigan (1975b).
Another graphical procedure suggested by this author is the use of 'sleeves' to
visually represent clusters which occur in multivariate data gathered over time
(Hartigan, 1978b).
260 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey

4. 4. Additional cluster analysis software


The eighteen programs which are the focus of this chapter, and the additional
software for graphics and large data sets by no means exhaust all clustering
software. For example, consider the following:
Revelle (1978) has written a program, ICLUST, to use cluster analysis instead
of factor analysis as method for generating scales on psychological tests. K-CLUS
is a program to form clusters based upon the theory of Ling (1972). Huizinga
(1978) has proposed a method for finding modes in multivariate data. Wishart
(1969) also worked on a mode-searching algorithm which now is part of
CLUSTAN. Also in CLUSTAN is a non-hierarchical method for minimizing the
error sum of squares (Gordon and Henderson, 1977). Not mentioned earlier is an
interesting method in OSIRIS called AID which forms groups in such a way as to
minimize variance within clusters on a criterion variable. T A X O N has a related
method. Sale (1971) created a program to form clusters which meet a specified
dissimilarity definition. Jardine and Sibson (1971) list a program for performing
their B K method which can generate overlapping clusters. T A X M A P is a program
used in biology to print out within-duster and between-cluster maps (Carmichael
and Sneath, 1969). Wolfe (1970) has written two programs to perform maximum
likelihood cluster analysis. A program to cluster within a multidimensional,
sociological model was written by Coleman (1970). Cattell, who is known for his
own work on factor analysis, wrote a clustering program called T A X O N O M E
(Cattell, Coulter and Tsujioka, 1966). Carlson (1972) suggested a variation of
complete linkage which determined the number of clusters, and he wrote a
program to perform his method.
With little effort the list could be continued.

5. Usability of cluster analysis software

Up to this point the focus has been on the features of cluster analysis programs
which relate to the various methods being performed. In this respect the entire
chapter has been devoted to just one issue: versatility of options. The aim of this
section is to provide the reader with information concerning the general features
of the programs that contribute to their ease in research.
Obviously, the initial problem any user faces is learning how to use this
program. In this respect a user manual is of primary importance because it serves
as the basic source of information about the program. However, not all programs
are accompanied by a manual. Of the eighteen programs under discussion the six
simple programs except M I K C A do not have manuals. In addition, ANDER,
BC-TRY and H A R T are primarily explained in their books although the latter
two do have independent manuals. Those programs which do have manuals vary
in clarity and informativeness.
The packages which include duster analysis as part of a more general coverage
of statistical routines (e.g., BMDP, OSIRIS, and SAS) have manuals which are
Cluster analysis software 261

sold to potential users. In addition, CLUSTAN now makes its manual commer-
cially available. These manuals contain information about the structural features
of the program, control cards, computational algorithms, etc.
Another method of manual distribution is to have the manual availabJe as a
printout from the master tape which contains the source listing of the program
(e.g., NT-SYS and earlier versions of CLUSTAN). This method is convenient in
that manual access is relatively easy. However, in order to keep the printout to a
reasonable size, these manuals tend not to have some user oriented features which
add to clarity (e.g., sample job runs).
All of the manuals provide some basic information about using the program.
This includes the specification of the format for the control cards, a listing of the
available options, and basic references which describe the methods which are in
the program. However, there are other valuable aspects which have been left out
in some instances. For example, the specification of standard or suggested
options, a listing of control cards for an example run, and a listing of the output
generated by the example run are all useful providing concrete examples for the
user to follow. CLAB, BMDP, OSIRIS, CLUSTAN, and SAS provide all of these
features (note: the new TAXON manual was not available at the time this chapter
was written). Of the remaining manuals, MIKCA contains an example of the
output generated by the program. The manuals for CLUS, NT-SYS and BC-TRY
have sections describing the structure and interpretation of output.
Another shortcoming that is found in some manuals is a failure to describe the
error messages generated by the program. In this regard the OSIRIS manual is
exemplary. It has an entire section devoted to a description of the error messages
generated b y the different procedures. In this way it gives the user some idea of
what action will be needed to correct the error. The newer CLUSTAN manual
(CLUSTAN IC) also has a separate error description section, a large advance
over its previous documentation. The SAS manual has some descriptions of errors
and suggestions on how to find them. Another user-oriented feature which is
important to include is how clear and jargon free the introductory sections of the
manual are for novices who are unfamiliar with the program. BC-TRY, BMDP
and OSIRIS have sections of the manual written especially for novices. BMDP
and OSIRIS also have clear and concise descriptions of tl-.e logic of the various
clustering techniques available in the packages. Of the manuals available only
four (OSIRIS, BMDP, CLUSTAN and SAS) have indexes, another important
aspect.
Most manuals are clearly deficient in the statistical documentation 61 the
various procedures which are used within a program. For example, Blashfield
(1977) noted that three packages generated considerably different solutions when
apparently the same clustering techniques were used. He had to calculate the
clustering steps by hand in order to learn why the programs found different
results. These calculations were necessary since the manuals did not provide
sufficient detail about the definitions of the Euclidean distance. The best manual
in this regard is the recent CLUSTAN manual. It provides an entire chapter
describing each similarity measure available, the calculation formulae and peril-
262 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey

nent comments concerning each measure. There is a need for other manuals to
follow this example.
Yet another problem with the user manuals is in their use of jargon. For
example, the complete linkage algorithm has been variously called the diameter
method, the furthest neighbor method, and the m a x i m u m method by different
manuals. The unfortunate result of this use of jargon is that the user is confused
by the idiosyncratic use of these terms. Again, the C L U S T A N manual is probably
best in this regard as it provides synonymous names for some of its methods.
In sum, the manuals vary a great deal in usability and comprehensiveness. The
manuals for C L U S T A N (version IC2), B M D P and CLAB are the clearest.
Nonetheless, the three are limited either by terseness or their use of jargon. Nearly
all of the manuals can be found to be lacking in some respect.

5.1. Error handling

The last major area of concern is the facility of the packages for error h a n d i n g .
Ideally, a package should have sufficient internal checks so that (a) the common
user errors will be detected by the package, (b) the user will be explicitly told in
English (as opposed to being told in computerese) what the error is, and (c) the
user will be told what steps are probably needed to fix the error. A less desirable
error message is generated by the F O R T R A N environment of the user's computer
system. Such errors generally require the user to have a knowledge of F O R T R A N
and may even require the user to access the source code for the package if h e / s h e
is to decode the error. An even worse response to an error occurs when the
program does not detect an error and executes without notifying the user that an
error (or probable error) did occur. The last response to an error is particularly
serious because the package in fact may generate a solution which is gibberish but
the user will assume the solution is valid.
In order to check the error handling facility, thirteen of the programs were run
on a standard data set, and four errors were intentionally made in the control
cards. The thirteen programs examined included all except TAXON, OSIRIS,
SAS, IMSL, and CLAB. The analysis primarily was based on versions of the
programs available in 1977.
The standard error messages from most programs were F O R T R A N error
messages. For instance, a common error message to an error involving control
card transposition was the message " I H C 2 1 5 I - - C O N V E R T - - I L L E G A L D E C I -
M A L C H A R A C T E R . " The user who is familiar with F O R T R A N at IBM
installations will recognize that this error message means that the program
encountered a character, such as an alphabetical letter, which it did not expect.
Thus this error will suggest to the sophisticated user that a control card is not in
its correct sequence. However, for users who are not familiar with F O R T R A N ,
the error message will have no obvious meaning, and they will be forced to
inquire for a computer consultant in order to correct the error.
Some programs, such as BMDP, CLUS, C L U S T A N IC, and M I K C A , usually
generated error messages in the language of the program. In most instances, the
Cluster analysis software 263

errors were not considered 'fatal,' and the programs attempted to generate some
type of cluster solution after printing the error message.
A few unusual responses to error conditions were noted. For example, error
conditions were found under which CLUS and NT-SYS generated large volumes
of printed output which had no useful purpose. In one error condition CLUS
noted that the covariance matrix was singular and created an error message telling
the user this. However, the error apparently was not fatal, so the message was
repeated for 5000 lines until the program exceeded the maximum number of lines
as specified by the user.

5.2. Conclusions about usabifity


In summary, the level of usability for most cluster analysis programs was not
high. Many of the programs do not have user manuals; those that do often
contain idiosyncratic jargon. The error messages from the programs typically were
in F O R T R A N and were not clear. For the sophisticated user these problems
would not be as serious as they could be for the novice.

6. Discussion

The problems associated with cluster analysis software are many. First, there is
a very large number of methods and programs for performing cluster analysis. A
conservative estimate would place the number of clustering methods in excess of
100. Different researchers have attempted to resolve the issue of which method is
best for cluster analysis but the conclusions have been equivocal and conflicting.
As a result, there is no easy way to determine which methods should be
incorporated in cluster analysis software.
A second problem concerns the diversity of the audience of cluster analysis
users. The consumers of this software are research scientists from different
disciplines with decidedly different needs. Similarity measures popular in bio-
chemistry are often very different to those used in psychology. With the explosion
of interest in cluster analysis in the last decade, a program author must face the
difficult problem of making the program sufficiently general to meet the needs of
the wide range of users, and yet still keep the program reasonably small so that it
will not be too expensive or cumbersome to use.
A third problem associated with this software is the relative lack of usability of
this collection as compared to the more general statistical packages. Some of the
larger programs, especially CLUSTAN and BMDP, are not bad in this respect.
However, six of the programs discussed do not have anything remotely resem-
bling a user manual, and the error handling of most programs was not overly
clear. Perhaps if the popular statistical packages such as SAS and SPSS add
cluster analysis to their repertoire, usability will be less of an issue.
In conclusion, the software for cluster analysis displays marked heterogeneity.
The popular programs vary in terms of which clustering methods they contain:
264 R. K. Blashfield, M. S. A ldenderfer and L. C. Morey

and how usable they are. The choice of software will be dictated primarily by the
needs of the consumer. The aim of this chapter has been to provide sufficient
information about these programs so that the reader can make a thoughtful
choice.

References

Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press, New York.
Bailey, K. D. (1975). Cluster analysis. In: D. Heise, ed., Sociological Methodology. Jossey-Bass, San
Francisco.
Bell, P. A. and Korey, J. L. (1975). QUICLSTR: A FORTRAN program for hierarchical cluster
analysis with large numbers of subjects. Behav. Res. Methods Instrumen. 7, 575.
Blashfield, R. K. (1976a). Mixture model test of cluster analysis: Accuracy of four hierarchical
agglomerative methods. Psyeh. Bull. 83, 377-378.
Blashfield, R. K. (1976b). Questionnaire on cluster analysis software. Class. Soc. Bull. 83, 25-42.
Blashfield, R. K. (1977). The equivalence of three statistical packages for performing hierarchical
cluster analysis. Psychometrika 42, 429-431.
Blashfield, R. K. and Aldenderfer, M. S. (1978a). The literature on cluster analysis. Multivar. Behav.
Res. 13, 271-295.
Blashfield, R. K., and Aldenderfer, M. S. (1978b). Computer programs for performing iterative
partitioning cluster analysis. Appl. Psych. Measure 2, 533-541.
Carlson, K. A. (1972). A method for identifying homogeneous classes. Multivar. Behav. Res. 7,
483-488.
Carmichael, J. W. and Sneath, P. H. A. (1969). Taxometric maps. Systems Zoo. 18, 402-415.
Cattell, R. B., Coulter, M. A. and Tsujioka, B. (1966). The taxometric recognition of types and
functional emergents. In: R. B. Cattell, ed., Handbook of Multivariate Experimental Psychology,
287-312. Rand-McNally, Chicago.
Chernoff, H. (1971). The use of faces to represent points in N-dimensional space graphically. Tech.
Rept. No. 71. Department of Statistics, Stanford University, Stanford.
Chernoff, H. (1973). Using faces to represent points in K-dimensional space graphically. J. Amer.
Statist. Assoc. 68, 361-368.
Chernoff, H. and Rizvi, M. H. (1975). Effect of classification error of random permutations of features
in representing multivariate data by faces. J. Amer. Statist. Assoc. 70, 548-554.
Clifford, H. T. and Stephenson, W. (1975). An Introduction to Numerical Classification. Academic
Press, New York.
Coleman, J. S. (1970). Clustering in N dimensions by use of a system of forces. J. Math. Soc. 1, 1-47.
Cormack, R. M. (1971). A review of classification. J. Roy. Statist. Soc. 134, 321-367.
Dallal, G. E. (1975). A user's guide to J. A. Hartigan's clustering algorithms. Yale University, New
Haven.
Defays, D. (1977). An efficient algorithm for a complete link method. Comput. J. 20, 364-366.
Dubes, R. and Jain, A. K. (1977). Models and methods in cluster validity. Tech. Rept. JR-77-05.
Department of Computer Science, Michigan State University, East Lansing.
Everitt, B. D. (1974). Cluster Analysis. Halstead Press, London.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics 7,
179-188.
Fleiss, J. L. and Zubin, J. (1969). On the methods and theory of clustering. Multivar. Behav. Res. 4,
235-250.
Gordon, A. J. and Henderson, J. T. (1977). An algorithm for Euclidean sum of squares classification.
Biometrics 33, 355-362.
Gross, A. L. (1972). A Monte Carlo study of the accuracy of a hierarchical grouping procedure.
Multivar. Behav. Res. 7, 379-389.
Cluster analysis software 265

Hall, D. J. and Khanna, D. (1977). The ISODATA method computation for the relative perception of
similarities and differences in complex and real data. In: K. Enslein, A. Ralston and H. W. Will,
eds., Statistical Methods for Digital Computers, Vol. 3. Wiley, New York.
Hartigan, J. (1975a). Clustering Algorithms. Wiley, New York.
Hartigan, J. (1975b). Printer graphics for clustering. J. Statist. Comput. Simulations 4, 187-213.
Hartigan, J. (1978a). Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117-131.
Hartigan, J. (1978b). Graphical techniques in clustering: Sleeves. Paper presented at the Classification
Society Meetings in Clemson, SC.
Hempel, C. G. (1952). Problems of Concept and Theory Formation in the Social Sciences, Language and
Human Rights. University of Pennsylvania Press, Philadelphia.
Huizinga, D. (1978). MODES: A natural or mode seeking cluster analysis algorithm. Tech. Rept. No.
78-11. Behavioral Research Institute, Boulder, CO.
IMSL (1977). IMSL Reference Manual, Library 1, Ed. 6, Vols. 1 and 2. Houston, TX.
Jardine, N. and Sibson, R. (1968). A mode for taxonomy. Math. Biosci. 2, 465-482.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 38, 241-254.
Kuiper, F. K. and Fisher, L. (1975). A Monte Carlo comparison of six clustering procedures.
Biometrics 31, 777-783.
Lance, G. N. and Williams, W. T. (1965). Computer program for monothetic classification (associa-
tion analysis). Comput. J. 8, 246-249.
Lance, G. N. and Williams, W. T. (1967a). A general theory of classificatory sorting strategies. I.
Hierarchical systems. Comput. J. 9, 373-380.
Lance, G. N. and Williams, W. T. (1967b). A general theory of classificatory sorting strategies. II.
Cluster systems. Comput. J. 10, 271-277.
Lennington, R. K. and Rossbach, M. E. (1978). CLASSY--An adaptive maximum likelihood
clustering algorithm. Paper presented at the Classification Society Meetings at Clemson, SC.
Levinsohn, J. R. and Funk, S. G. (1974). CLUSTER: A hierarchical clustering program for large data
sets (n > 100). Research Memo. No. 40. Thurstone Psychometric Laboratory, University of North
Carolina, Chapel Hill, NC.
Ling, R. F. (1972). On the theory and construction of K-clusters. Comput. J. 15, 326-332.
Ling, R. F. (1973). A computer generated aid for cluster analysis. Comm. ACM 10, 355-361.
Lorr, M. (1966). Explorations in Typing Psychotics. Pergamon, New York.
Lorr, M. and Radhakrishnan, B. K. (1967). A comparison of two methods of cluster analysis. Educ.
Psych. Measure 27, 47-53.
McQuitty, L. L. and Koch, V. L. (1975). A method for hierarchical clustering of a matrix of a
thousand by a tt~ousand. Educ. Psych. Measure 35, 239-254.
Milligan, G. W. (1978). An examination of the effects of error perturbation of constructed data on
fifteen clustering algorithms. Unpublished Ph.D. Thesis, Ohio State University, Columbus.
Milligan, G. W. (1979). Further results on true cluster recovery: Robust recovery with the K-means
algorithms. Paper presented at the Classification Society Meetings in Gainesville, FL.
Milne, P. W. (1976). The Canberra programs and their accession. In: W. T. Williams, ed., Pattern
Analysis in Agricultural Science, 116-123. Elsevier, Amsterdam.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules--An evaluation. Comput. J. 20,
359-363.
Olsen, C. L. (1976). On choosing a test statistic in multivariate analysis of variance. Psych. Bull. 83,
579-586.
Overall, J. E. and Klett, C. J. (1972). Applied Multivariate Analysis. McGraw-Hill, New York.
Revelle, W. (1978). ICLUST: A cluster analytic approach to exploratory and confirmatory scale
construction. Behav. Res. Methods Instrumen. 10, 739-742.
Rohlf, F. J. (1977). Computational efficiency of agglomerative clustering algorithms. Tech. Rept.
RC-6831. IBM Watson Research Center.
Rohlf, F. J., Kishpaugh, J. and Kirk, D. (1974). NT-SYS user's manual. State University of New York
at Stonybrook, Stonybrook.
266 R. K. Blashfield, M. S. Aldenderfer and L. C. Morey

Rubin, J. and Friedman, H. (1967). CLUS: A cluster analysis and taxonomy system, grouping and
classifying data. IBM Corporation, New York.
Sale, A. H. J. (1971). Algorithm 65: An improved clustering algorithm. Comput. J. 14, 104-106.
Shapiro, M. (1977). C-LAB: An on-line clustering laboratory. Tech. Rept. Division of Computer
Research and Technology, National Institute of Mental Health, Washington, DC.
Sibson, R. (1973). SLINK-An optimally efficient algorithm for single-link cluster methods. Comput.
J. 16, 30-34.
Skinner, H. A. (1977). The eyes that fix you: A model for classification research. Canad. Psych. Rev.
18, 142-151.
Sneath, P. H. A. (1957). The application of computers to taxonomy. J. Gen. Microbiol. 17, 201-226.
Sneatli, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships.
Kansas Univ. Sci. Bull. 38, 1409-1438.
Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxonomy
11, 33-40.
Sokal, R. R. and Sheath, P. H. A. (1963). Principles of Numerical Taxonomy. Freeman, San Francisco.
Spath, H. (1975). Cluster-Analyse Algorithmen. R. Oldenbourg, Munich.
Tryon, R. C. (1939). Cluster Analysis. Edward, Ann Arbor.
Tryon, R. C. and Bailey, D. E. (1970). Cluster Analysis. McGraw-Hill, New York.
Turner, D. W. and Tidmore, F. E. (1977). Clustering and Chernoff-type faces. Statistical Computing
Section Proceedings of the American Statistical Association, 372-377.
Veldman, D. J. (1967). FORTRAN Programming for the Behavioral Sciences. Holt, Rinehart and
Winston, New York.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc.
58, 236-244.
Ward, J. H. and Hook, M. E. (1963). Application of a hierarchical grouping procedure to a problem of
grouping profiles. Educ. Psych. Measure 32, 301-305.
Whallon, R. (1971). A computer program for monothetic subdivisive classification in archaeology.
Tech. Rept. No. 1. University of Michigan Museum of Anthropology, Ann Arbor.
Whallon, R. (1972). A new approach to pottery typology. Amer. Antiquity 37, 13-34.
Williams, W. T., Lambert, J. M. and Lance, G. N. (1966). Multivariate methods in plant ecology. V.
Similarity analyses and information analysis. J. Ecology 54, 427-446.
Williams, W. T., and Lance, G. N. (1977). Hierarchical classification methods. In: K. Enslein, A.
Ralston and H. Wilf, eds., Statistical Methods for Digital Computers. Wiley, New York.
Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining
effects. In: A. J. Cole, ed., Numerical Taxonomy. Academic Press, London.
Wishart, D. (1978). CLUSTAN 1C user's manual. Program Library Unit of Edinburgh University,
Edinburgh.
Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5,
329-350.
Zahn, C. T. (1971). Graph theoretical methods for dissecting and describing Gestalt clusters. IEEE
Trans. Comput. 2, 68-86.
P. R. Krishnaiahand L. N. Kanal,eds., Handbookof Statistics, Vol. 2 1 '~
©North-HollandPublishingCompany(1982) 267-284

Single-link Clustering Algorithms

F. J a m e s R o h l f

1. Introduction

The present paper is concerned with computational algorithms for the single-link
clustering method which is one of the oldest methods of cluster analysis. It was
first developed by Florek et al. (1951a, b) and then independently by McQuitty
(1957) and Sneath (1957). This clustering method is also known by many other
names (e.g., minimum method, nearest neighbor method, and the connectedness
method) due both to the fact that it has been reinvented in different application
areas, but also to the fact that there exist many very different computational
algorithms corresponding to the single-link clustering model. Often this identity
has gone unnoticed since new clustering methods are not always compared with
existing ones. Jardine and Sibson (1971) point out quite correctly that one must
distinguish between clustering methods (models) and the various computational
algorithms which enable one to actually determine the clusters for a particular
method. Different clustering methods imply different definitions of what con-
stitutes a 'cluster' and should thus be expected to give different results for many
data sets.
Since we are concerned here only with algorithms, the interested reader is
referred to Sneath and Sokal (1973) and Hartigan (1975) for general discussions
of some of the important properties of the single-link clustering method. They
contrast this clustering method with other related methods such as the complete
link and the various forms of average link clustering methods. Fisher and van
Ness (1971) and van Ness (1973) summarize some of the more important
mathematical properties of a variety of clustering methods including the single-link
method. The book by Jardine and Sibson (1971) considers some of the more
abstract topological properties of the single-link clustering method and its gener-
alizations.
In the account given below a variety of algorithms are presented to serve as a
convenient source of algorithms for the single-link method. It is hoped that
presenting these diverse algorithms together will also lead to a further understand-
ing of the single-link clustering method. While the algorithms differ considerably
in terms of their computational efficiency (O(n log n) versus O(n 5)), even the least
267
268 F. James Rohlf

efficient algorithm may sometimes be useful for small data sets. The less time
efficient algorithms are simpler to program in languages such as FORTRAN. They
also may require much less computer storage.

2. Notation and definitions

Informally, the single-link clustering problem is to group n objects (items,


persons, OTU's, species, etc.) into a system of sets (clusters) of similar objects
such that, for any given level A of clustering intensity, objects i a n d j are placed in
the same set if and only if the resulting set satisfies the single-link criterion (see
below).
The dissimilarity d i / w h i c h is a measure of some desired relationship between a
pair of objects i a n d j is usually considered given in clustering problems. Thus we
will not be concerned here with the advantages of different choices of dissimilar-
ity coefficients which may be appropriate in specific applications. Whether or not
the dissimilarity coefficient satisfies the triangle inequality (and thus defines a
metric space) is not important for most of the algorithms presented below. For
convenience we will always refer to clustering 'dissimilarities' rather than 'similar-
ities.' Table 1 presents a dissimilarity matrix for 6 objects which will be used to
illustrate some of the data structures used in the algorithms given below.
In order to define the single-link clustering method the following definitions are
convenient:
A chain from object i to objectj is an ordered sequence of objects ol, 02..... ot
with o~ = i and ot = j. No object may appear more than once in a given chain.
The cardinality t of the chain is defined as the number of objects in the sequence.
The length of a chain is the sum of the dissimilarity values between pairs of
objects which are adjacent in the given sequence. The size of the chain is defined
as the largest of these dissimilarity values.
Two objects i a n d j belong to the same single-link cluster at a clustering level A
if and only if there exists a chain of size less than or equal to A connecting them.
Thus at a given threshold level A the single-link clusters consist of the sets of
objects (clusters) for which such chains exist for all pairs of objects within the
same cluster. No such chains exist between objects in different clusters. Therefore
the clusters will be non-overlapping at a given level A.

Table 1
A dissimilarity matrix for 6 objects
i l 2 3 4 5 6
1 0.0 6.8 2.6 3.0 3.5 7.0
2 6.8 0.0 4.5 9.8 4.9 0.8
3 2.6 4.5 0.0 5.4 1.2 5.2
4 3.0 9.8 5.4 0.0 6.3 9.9
5 3.5 4.9 1.2 6.3 0.0 6.1
6 7.0 0.8 5.2 9.9 6.1 0.0
Single-link clustering algorithms 269

Table 2
Single-link hierarchical clustering scheme for the dissimilarity matrix given in Table 1
x A,, Clusterings ~
0 0.0 {1}, {2}, {3}, {4}, {5}, {6}
1 0.8 {1}, {2,6}, {3}, {4}, {5}
2 1.2 {1}, {2,6}, {3,5}, {4}
3 2.6 {1,3,5}, {2,6}, {4}
4 3.0 { 1,3,5,4}, {2,6}
5 4.5 { 1,3,5,4,2,6}

A clustering C~ is a partition of the n objects into k~ mutually exclusive sets


(clusters) C l, C2..... Ce.... , C~.
The final result of a single-link cluster analysis is what Johnson (1967) called a
hierarchical clustering scheme (HCS) which is a sequence of distinct clusterings:
C o, C l ..... Ca,..., G,o where 0 ~< ¢0~< n - 1. C o is the weakest clustering which has n
clusters each containing only a single object and C~ is the strongest clustering in
which all objects are united into a single cluster. These clusterings are nested,
meaning that every cluster in C~+~ is either a cluster in C~ or is the union of two or
more clusters in Ca. Associated with each clustering C~ is a number z~ (0 ~<A _ 1
~< A ~< A~+l) representing the clustering level or intensity of the clustering. The
single-link hierarchical clustering scheme for the data in Table 1 is shown in
Table 2. The ordering of the clusters in each row of this table is arbitrary.
Monotonic transformations of the dissimilarity coefficient (such as d~ or
log dij ) affect the clustering levels but do not affect the nested system of clusters
themselves. Thus only the rank order of the input dissimilarities are required in
order to determine the single-link clusters. The single-link clustering method is
therefore monotone invariant.

2" '

°1
i I

6
h
Fig. 1. Single-link dendrogram for the hierarchical clustering scheme given in Table 2.
270 F. James Rohlf

An HCS can always be displayed as a dendrogram, a tree-like diagram in which


the n objects are represented as terminal 'twigs.' The sequence of merging of
clusters is represented by the fusion of twigs into branches and finally into a
single trunk. The dendrogram for the hierarchical clustering scheme given in
Table 2 is shown in Fig. 1. More extensive examples are given in Sneath and
Sokal (1973).
The computational effort required by an algorithm is expressed using the
notation O ( f ( n ) ) which means that as n --, oo the running time will eventually be
proportional to f ( n ) . This measure of effort may not be an accurate guide to the
relative costs of two algorithms for small values of n. Two algorithms with the
same order of effort can differ in the coefficients of the terms in the function f ( n ),
so that must also be considered in a decision as to which algorithm should be
implemented for a practical application.

3. Algorithms
The different algorithms presented below are classified into five different types
of approach.

3.1. Connectedness algorithms


These are the simplest single-link algorithms. If we represent the n objects as n
vertices in an abstract graph, and connect all pairs (i, j ) of vertices with an edge
if and only if d~j ~< A, then the single-link clusters at level A correspond to the
vertices in the connected subgraphs of this graph.
Van Groenewoud and Ihm (1974) present the following algorithm, which they
credit to Berge (1966), for finding these subgraphs.

ALGORITHM 1
(a) Set the clustering level A to an initial value: A 0 --, A. [A = min(dij; i ~ j ) is
the smallest value of interest.]
(b) Define a connection matrix, A ----(aij) such that a U -- 0 if dij > A and aij = 1
if d~j <~A.
(c) Raise the matrix A to a power m such that A m = A m+l. [Then the (i, j ) t h
element of A m equals unity if and only if the ith and j t h objects belong to the
same connected subgraph (single-link cluster). All other elements are equal to
zero.]
(d) Repeat steps (b) and (c) for a larger value of A. [One could increment A by
a fixed amount as suggested by van Groenewoud and Ihm (1974) or (in order to
obtain nonredundant solutions) one can use the smallest value of dij such that i
and j have not yet been placed into the same cluster. At most n - 1 distinct A
values are required.]
The straight-forward implementation of such an algorithm implies considerable
computational effort since the multiplication of two matrices requires effort
Single-link clusteringalgorithms 271

proportional to n 3. For a given value of A the effort will be determined by the


chain of maximum cardinality. This chain would be found at the highest cluster-
ing level, A,o. Thus for a perfectly symmetrical dendrogram one would have to
raise A to the n / 2 t h power. For a skewed dendrogram the worst case would be if
the highest fusion resulted in a single object being added to a cluster of
cardinality n - 1. Since there are at most n - 1 distinct clustering levels, the total
effort is O(nS).
The algorithm given above has the desirable property that arbitrary rules for
treating ties in the input dissimilarities are not required as in some of the
algorithms given below. It is also easy to understand. A computer program exists
for this algorithm (COMPCON 5, van Groenewoud and Ihm, 1974).
Davies (1971) gives the following algorithm for determining the disconnected
subgraphs at a given clustering level A (steps (b) and (c) in Algorithm 1 above).
He also provides a FORTRAN program ('program 46') for single-link cluster
analysis (his subroutine SILINK corresponds to Algorithm 2).

ALGORITHM 2
(a) Initialize: 0 ~ Pi (i ----1.... ,n),0 --, s, and 1 -~ i. [P~ will contain the index of
the cluster to which object i belongs.]
(b) If Piv a 0, then: go to step (f) [object i has already been clustered]; else:
continue.
(c) If there exists a dij<A ( j = i + l , . . . , n ) , then: go to step (d) [object i is
connected to at least one other object]; else: go to step (f).
(d) A new cluster containing more than a single object has been found:
s - l - l ~ s , s ~ P i.
(e) While there exists an object k ( k = i + l .... ,n) such that P ~ = s and such
that djk < A ( j = 1..... n; j va k): s ~ ~ . [This step finds all objects connected to
the present members of cluster s by repeatedly searching the dissimilarity matrix
and adding new objects to the cluster.]
(f) If i < n, then: i + 1-9 i and go to step (c); else: stop.

At termination Pi = 0 if object i belongs to a singleton duster, while objects i


a n d j belong to the same cluster if and only if P, = Pj v~ 0. The critical part of this
algorithm is in step (e) in which the effort can be O(n3). However, for favorable
data sets this can be reduced to only O(n2). Thus the total effort in performing a
single-link duster analysis will be O(n 4) or O(n5). Even in the worst case the
computer time required should be much less than for Algorithm 1.

3.2. Algorithms based on an ultrametric transformation


Jardine and Sibson (1971) view cluster analysis as being a transformation which
maps an input dissimilarity matrix D into an output ultrametric dissimilarity
matrix U. They show that this transformation has certain desirable continuity
272 E James Rohlf

Table 3
Ultrametric distance matrix for the hierarchical clustering scheme given in Table 2
i 1 2 3 4 5 6
1 0.0 4.5 2.6 3.0 2.6 4.5
2 4.5 0.0 4.5 4.5 4.5 0.8
3 2.6 4.5 0.0 3.0 1.2 4.5
4 3.0 4.5 3.0 0.0 3.0 4.5
5 2.6 4.5 1.2 3.0 0.0 4.5
6 4.5 0.8 4.5 4.5 4.5 0.0

properties only for the single-link method. Given an HCS, the ultrametric distance
u~j between all pairs of objects i a n d j is defined as follows. Let uij -- A , where x
is the smallest integer such that in clustering C~ objects i a n d j are in the same set
(0 ~<x ~<n - 1 ) . Then U = (uis) is a matrix of ultrametric distances, that is, ui~ = 0
and 0 ~< uij <~max{uig, ujk } for all triples of objects i,j, and k. This is a stronger
condition than the usual metric condition since max(ui~, ujk } ~<u~k + ujg. This
relationship between dusters and ultrametrics is discussed in Jardine et al. (1967),
Johnson (1967), and Hartigan (1967). The ultrametric distance matrix for the
hierarchical clustering scheme given in Table 2 is shown in Table 3.
The single-link clusters for a given data set can readily be determined from an
ultrametric matrix (objects i and j belong to the same cluster at level A if and
only if uij <~A ) . Thus another algorithmic approach to single-link clustering is to
transform D to U, and then recover the single-link clusters from U.
The following single-link ultrametric transformation algorithm is a special case
of Jardine and Sibson's (1968) algorithm for their B~ (fine) clustering methods
(B l corresponds to the single-link method).

ALGORITHM 3
(a) Consider all possible triplets of distinct objects, i, j, and k. For each such
triplet of objects determine the largest, d', and second largest, d", dissimilarities
between them. If d ' > d", then: replace the d' value in the dissimilarity matrix
with d" else [ d ' = d"]: leave the dissimilarities unchanged.
(b) Repeat step (a) with the updated dissimilarity matrix until no dissimilarity
values are changed. [At completion the dissimilarity matrix will have been
transformed into an ultrametric matrix corresponding to the single-link clustering
method.]

The effort for this algorithm is O(mn 3) where m is the number of repetitions of
step (a). Cole and Wishart (1970) state that the value of m is usually between 3
and 5. While this algorithm requires considerable effort, it is more efficient than
Algorithm 1 or Algorithm 2. Cole and Wishart (1970) present a number of
improvements in this algorithm which reduce needless checks of triplets of objects
which ultimately require no adjustment in their dissimilarity values. Another
important innovation is first sorting the dissimilarities (O(n21ogn) effort) and
Single-link clustering algorithms 273

then considering the triplets in sorted order. The FORTRAN computer program
KDEND is available for this algorithm (Cole and Wishart, 1970).
Rohlf (1973a) proposed another algorithm for B~ (fine) clustering in which the
elements in the dissimilarity matrix are initially sorted, but sets of objects (triplets
in the case of B~) are not explicitly considered. For the single-rink method (B~) it
reduces to the following algorithm.

ALGORITHM 4
(a) Sort the elements of the upper half portion, excluding diagonal elements, of
the dissimilarity matrix D in ascending order into array L. Clear the n by n matrix
U. Let C = CO be the weakest clustering where each cluster C k contains only the
single object k. 0 ~ l.
(b) Set l + 1 -~ l and let L t = dij [the next, larger, dissimilarity value from the
sorted array].
(c) If uij has already been defined, then: go to step (b); else: continue.
(d) Let C~ and C b represent the clusters to which objects i and j belong,
respectively. Then dij "---)Ucd for all objects c E Ca and d E Cb). Update G: C a -~- C b
-- Ca UC b ---, C~ and then remove C b from C.
(e) If all elements of U have not been defined (i.e., C consists of more than a
single cluster), then: go to step (b); else: stop.

The principal effort required by this algorithm is that of sorting D, which is


O(n21og n). Note that with this algorithm one does not have to apply a separate
algorithm to determine the single-rink clusters from U. Each time one passes
through step (d) G contains the single-link clusters at level A = dij. The FORTRAN
program BKG is available for this algorithm (Rohlf, 1973a). The FORTRAN pro-
gram ALLIN1 (Anderberg, 1973) represents an equivalent approach.

3.3. Probability density estimation algorithms


An intuitively appealing approach for finding clusters is to estimate the
multivariate probability density function, p.d.f, over the p-dimensional space. The
modes in the estimated p.d.f, then correspond to the regions of the space with
the higher densities (clusters) and the valleys can be interpreted as the regions of
separation (gaps) between the clusters.
Shaffer et al. (1979) presented a 'mode-seeking' algorithm. They defined the
probability density D i in the region of the space near an object i as simply the
cardinality of its neighborhood. The neighborhood N, of an object i is defined as
the set of objects within a fixed radius p of object i.

ALGORITHM 5

(a) Clear sets P and Q. Form a set S consisting of all objects i such that D i =/=O,
where D t = [HI. [P will contain objects whose neighborhoods are to be consid-
ered, and Q will contain the objects in the current cluster.]
274 F. James Rohlf

(b) Select an object i ~ S as the first member of a new cluster Q. S - { i } ~ S,


(i} ~ Q, and N / ~ e.
(c) Find the object j E P such that Dj is a maximum [j has the most dense
neighborhood of the objects in P].
(d) Add to P the objects in the neighborhood ofj. PUNj - ( j ) ~ P, S - (j}
S,Q+{j}~Q.
(e) If ]P I > O, then: go to step (c); else: output cluster Q.
(f) If IS] > O, then: go to step (b); else: stop.

If one plots the estimated densities Dj for the neighborhood of each object j as
the j ' s are added to cluster Q, they will rise to a maximum (thereby indicating
which objects are closest to the estimated mode of the cluster) and then fall off as
we move towards a valley. When the set P is empty a valley has been reached
which has zero density. A new starting object which does not belong to a mode
that has already been found is then selected at step (b). This is repeated until all
modes have been found for the given radius p. The procedure can then be
repeated with a new (larger) value for p.
If the criterion that a cluster is a 'dense' region of space separated by a gap of
zero density from other such clusters is used, then the above algorithm will find
the single-link clusters corresponding to a clustering level A = p. Note: other
definitions of density (e.g., by using a Gaussian kernel) of the p.d.f, do not, in
general, lead to single-link clusters (however, Hartigan, 1977, shows that asymp-
totically the clusters formed using these two definitions of 'density' converge). It
should also be pointed out that the usual approach to density estimation
clustering does not involve defining a cluster based on a minimum threshold of
density (density contour clustering) as proposed by Shaffer et al. (1979) but uses
valleys to separate modes (density gradient clustering). Katz and Rohlf (1973)
defined two objects to belong to the same cluster if and only if the path from each
point following the steepest gradient in the p.d.f, leads to the same peak in the
p.d.f. Kittler (1976, 1979) used an intuitive assessment of the depth of the valley
in order to decide whether two modes are sufficiently distinct. It is not clear by
their algorithm how sure one is that the depths of the valleys displayed on their
plots reflect the depths of the valleys between adjacent modes.
The principal effort required by these algorithms is in computing the density
around each object. If the number of objects within a fixed radius of each object
is found by a direct search (O(n2)), then the effort can be O(n 3) for each value of
p (the exact effort depends upon the size of the neighborhoods and the method
used to update the sets P and S).

3.4. Agglomerative algorithms


The agglomerative algorithms for single-link cluster analysis are the best
known. Sneath and Sokal (1973) give a good general account of this approach for
what they call sequential, agglomerative, hierarchic, nonoverlapping (SAHN)
Single-link clustering algorithms 275

clustering methods (actually these are, of course, algorithms). They give the
following algorithm for the single-link method.

ALGORITHM 6
(a) Let C = CObe the weakest clustering, where each cluster C k contains only a
single object k. [The dissimilarity matrix will be interpreted as a matrix of
dissimilarities between the corresponding clusters.]
(b) Find a pair (not necessarily unique) Of clusters ( C a, Cb) which are least
dissimilar (i.e., dab is minimal), dab ~ a .
(c) CaUC b ~ Ca, and delete C b from C. The a value is saved as the clustering
level at which the new cluster was formed.
(d) Repeat steps (b) and (c) for all pairs of clusters (if any) with the same
minimal dissimilarity (i.e., allow for ties).
(e) Recalculate the dissimilarity between the new cluster Ca and each of the
other clusters C~. The dissimilarity dij is computed as min(dij]i~ Ca, j ~ C~}.
(f) Repeat steps (b), (c), and (d) until C = C,~ (i.e., it consists of only a single
cluster of cardinality n).

This general algorithm can be implemented in many ways. One simple scheme
uses an n by n matrix of dissimilarities between all pairs of objects. When a pair
of objects (e.g., i and j ) is merged, the corresponding rows and columns i a n d j are
deleted from the matrix and a new row and column i' corresponding to the
resulting cluster is added. Thus in step (c) the row- and column-dimension of the
matrix is decreased by one each time there is a merger. Johnson (1967) called
single-link clustering method the 'minimum method' due to the fact that the rain
function is used in step (e).
A c by c matrix of dissimilarities among the c currently existing clusters must
be examined' in order to find the smallest dissimilarity for each of the n - 1
clustering levels. The value of c is initially n and is reduced by unity as the
clusters merge. Therefore the total number of dissimilarity coefficients which need
to be considered is O(n3).
A number of computer programs exist for this algorithm (e.g., JOIN, Hartigan,
1975; Johnson, 1967; CLASS and CENTBEV, Lance and Williams, 1967; CLSTR,
Anderberg, 1973).
The computational effort can be reduced considerably by using the fact that
when updating the dissimilarity matrix in step (d), the dissimilarity between two
objects (or clusters) is not affected by mergers which only involve other clusters.
Thus the entire matrix need not be searched in step (b) but rather only the few
'local' changes made as a result of the last fusion. In particular, when two clusters
C a and C b are merged, a list of the nearest neighbors of each cluster must be
updated only for those clusters which were nearest neighbors of either cluster a or
cluster b. These can conveniently be located in step (e) as described by Rohlf
(1977).
276 F. James Rohlf

The algorithm can also be improved by locating all mutually closest pairs of
clusters in step (b), rather than just the one pair of clusters with minimum
dissimilarity. All such pairs can then be merged simultaneously in step (c).
Clusters Co and C b are mutually closest if cluster Co has its least dissimilarity with
cluster C b and cluster C b has its least dissimilarity with cluster Ca. There must be
at least one such pair. For the dissimilarity matrix given in Table 1, two mutually
closest pairs (3-5 and 2-6) would be found during the first pass through step (b).
If there are many such pairs found each time through step (b), then the effort will
approach O(n2). However, if the final dendrogram is very asymmetrical, only a
few mutually close pairs of points will be found each time and the effort will
remain close to O(n 3). Modifications of this algorithm are also available such that
the average effort is expected to remain very close t o O(n 2) even for such data
sets (see, for example, Rohlf, 1977).
Even more efficient algorithms, such as one developed by Sibson (1973), result
from the fact that only local changes in the reduced dissimilarity matrix result
from the merging of two clusters. Given an initial single-link clustering for m < n
objects, Sibson (1973) developed a method for determining the changes required
in a dendrogram in order to correctly add an additional object. This makes it
possible to start with a dendrogram consisting of only a single object and build
the final dendrogram recursively by adding the remaining n - 1 objects one at a
time in an arbitrary order. His algorithm is as follows.

ALGORITHM 7

(a) Initialize: 1--*/]l, oo--,Al, and l ~ m . [H and A constitute a pointer


representation of a dendrogram, see below.]
(b) Set m + l - , H m + 1 and oe~A,~+l. Set di, m + l ' - ' M i (for i = 1 .... ,m).
(c) Set 1 ~ i.
(d) If A i >1 Mi, then: set min(Mrii, A i } ~ MRi , M i ~ A i , and m + 1 -~ Hi, else
( A i < M i ) : set min{Mn,, Mi}-~ M u .
(e) If i < m set i + 1 ~ i and go to step (d); else: continue.
(f) l ~ i .
(g) If Ai>~ A n , , then: m + l ~ H i.
(h) If i < m, then: set i + 1 -~ i and go to step (g); else: continue.
(i) If m < n, then: set n + 1 ~ n and go to step (b); else: stop.

In the above algorithm ( H and A) constitute a pointer representation of a


dendrogram. A i is the clustering level at which object i is no longer the last listed
object in its cluster and H i is the last object in the cluster which object i joins. The
array M is used for intermediate storage. While this pointer representation is
perhaps not particularly convenient from the user's point of view (see the example
in Table 4), it is computationally quite efficient. In fact the computational effort
to perform the algorithm for any data set is clearly just O(n2). Furthermore, the
algorithm has the convenient property that only one row of the dissimilarity
matrix is needed during an iteration of step (b). Thus the entire n by n
Single-link clustering algorithms 277

Table 4
Pointer representationof the hierarchicalclusteringschemegivenin Table 2
i 1 2 3 4 5 6
H i 5 6 5 6 4
A i 2.6 0.8 1.2 4.5 3.0

dissimilarity matrix need not be placed in fast access storage during execution,
giving the algorithm a considerable advantage when n is very large. Sibson (1973)
furnishes a FORTRAN computer program, SLINK1, for the above algorithm and
another, SLINK2, for transforming the pointer representation into a more conveni-
ent 'packed' form (see below).
Williams et al. (i966) and Anderberg (1973) give the following simple algo-
rithm (which is very similar in its approach to Algorithm 4).

ALGORITHM 8
(a) Sort the entries in the upper triangular portion, excluding diagonals, of the
dissimilarity matrix into an array L. 0 ~ I. Let C = Co (the weakest clustering).
(b) l + 1 ~ l, let L t -- dij. [The next dissimilarity value from the sorted list.]
(c) If i a n d j already belong to the same cluster, then: go to step (b); else: merge
the two clusters to form a new cluster at clustering level A = dij.
(d) Repeat steps (b) and (c) until ~ = G,~ (a single cluster which contains all n
objects).

This algorithm requires O(n21ogn) effort to initially sort the dissimilarity


matrix followed by O(n 2) effort which is spent actually constructing the clusters.

3.5. Minimum spanning tree algorithms


There is a close relationship between a single-link cluster analysis and a
minimum length spanning tree, MST (Gower and Ross, 1969). A MST for a set of
n p-dimensional vertices (objects) consists of n - 1 edges (connections) joining
pairs of objects such that:
(a) No closed loops occur.
(b) Each object is connected to at least one other object by an edge.
(c) The length of the tree (the sum of the dissimilarities corresponding to the
particular n - 1 edges employed) is as small as possible.
A series of disconnected subgraphs is obtained by deleting all edges longer than
A. The sets of objects belonging to the same subgraph are the single-link clusters
at level A. Note: the 'minimal spanning tree' clustering method of Zahn (1971)
represents a different clustering method despite the similar name.
Gower and Ross (1969) give the following algorithm to compute single-link
clusters from a given MST.
278 F. James Rohlf

ALGORITHM 9

(a) Let H be a list of the edges in the MST which are of length less than or
equal to a given value of Zl.
(b) Let C = CObe the weakest clustering. (Each cluster C~ contains only a single
object).
(c) Let (i, j ) be the next edge (of length dij) from list H (the order in which
they are considered is arbitrary).
(d) Let Ca and C o be the clusters to which objects i and j belong. Set
Ca UC b ~ C,, and then delete C b from C.
(e) Repeat steps (c) through (d) until list H is empty. C will then contain the
single-fink clusters at level ,t.

Repeated application of the above algorithm to obtain the dusters at all n - 1


levels would require O(n 2) effort. The usual algorithms for computing MST's
from a dissimilarity matrix (e.g., Dijkstra, 1959; Kruskal, 1956; Prim, 1957;
Whitney, 1972; Rohlf, 1973b) require O(n 2) effort. Thus the total effort required
by the strategy of first computing a MST and then the single-link clusters is just
O(n2). Ross (1969) furnishes an ALGOL computer program for the above proce-
dure. Rohlf (1973b) suggests some improvements in the above algorithm and
furnishes a FORTRANsubroutine for computing single-link clusters from a given
MST.
Rohlf (1973b) also shows that it is more efficient to modify the data structures
used in the MST algorithms so that the single-link clusters are built as the MST is
being computed (thus saving an unnecessary step). The effort is also O(n 2) but
the constants are much smaller. The ' trick' is simply to save the edges in the MST
in the order in which they are found by an algorithm such as that of Prim (1957)
in which one adds vertices one at a time to a growing fragment of the MST.
Consider the example shown in Table 5 (each row of this table corresponds to an
edge in the MST). The MST starts (arbitrarily) with object 1. R o w j gives the new
object which is linked to an object already part of the fragment of the MST. Row
dij gives the length of the edges. The objects in a single-link dendrogram can be
listed in the order 1, 3, 5, 4, 2, and 6 (i.e., one takes the first object from row i
followed (in order) by all of the objects in row j). Row dij lists the clustering
levels for the dendrogram in the order in which they will be needed to produce a
dendrogram (they give the lengths of the horizontal fines in Fig. 1 (see Section 2)
in the order from top to bottom in which they appear in that figure). Sibson
(1973) calls this the packed representation (an example is given in Table 6). Rohlf

Table 5
Minimum spanning tree for the dissimilarity matrix given in Table 1
i 1 3 1 3 2
j 3 5 4 2 6
dij 2.6 1.2 3.0 4.5 0.8
Single-link clusteringalgorithms 279

Table 6
Packed representationof the hierarchicalclusteringschemegivenin Table 2
i 1 3 5 4 2 6
2.6 1.2 3.0 4.5 0.8

(1973b) gives a FORTRAN subroutine, DEND, which creates a simple line printer
plot of the dendrogram from this packed representation with only O(n) effort.
Rohlf (1974, 1975) gives a FORTRAN subroutine for plotting a dendrogram which
requires slightly more effort since it centers the 'stems' under the branches.
Many papers have been published which are concerned with efficient algo-
rithms for constructing the MST's themselves. Recent work has been concerned
with efforts to reduce the computational effort below O(n2). Clearly this is only
possible if some proportion of the n ( n - 1)/2 dissimilarities can be ignored as
being irrelevant. The larger dissimilarities are not of interest since the MST
contains only edges connecting near neighbors. When only a small proportion of
the n ( n - 1)/2 dissimilarities are defined (the undefined coefficients are consid-
ered infinite), then the following algorithm by Yao (1975) is useful.

ALGORITHM 10
(a) For each object i partition the remaining objects into k ordered sets
S1, S 2 ..... S~ so that d i j ~ d i t if j C S ~ , l E S b and a < b . [This can be done by
repeatedly applying the linear median-finding algorithm of Blum et al. (1973).]
The number of sets, k, is taken as log n.
(b) Sollin's algorithm (a multi-fragment MST algorithm, see Berge and
Ghouila-Houri, 1962) is then applied but with the dissimilarities in the higher
indexed sets considered only after those in the lower indexed sets have been
exhausted.

This algorithm requires O(mlog(logn)) effort where m < ~ n ( n - 1 ) / 2 is the


number of defined dissimilarities. Thus the effort to completely order all of the
dissimilarities is avoided since only the smaller dissimilarities are relevant to
the MST and hence to the single-link cluster analysis.

3.6. Coordinate spaces


In algorithms such as Algorithm 7 and the MST algorithms mentioned above,
each dij value is needed only once. Thus if the original data matrix is available at
execution time, each dij value can be computed as needed and need not be stored
unless it is part of the resulting MST. When the dimensionality of the space, p, is
much less than n, this can require considerably less computer storage and even a
savings in total running time if the D matrix would be too large to be stored in
the fast access memory of the computer.
In the algorithms given above all of the n ( n - 1)/2 dissimilarities must be
considered at least once in order to find the nearest neighbors of the points.
280 F.. James Rohlf

Therefore the most efficient of such algorithms must expend O(n 2) effort (Sibson,
1973). All of the dissimilarities must be examined because the analyses make no
assumptions about the metric properties of the space, and it is not possible to
infer the magnitude of any d U by knowing any or all of the other dissimilarities.
In most applications of single-link cluster analysis this is unnecessary since in fact
most dissimilarity coefficients used allow such inferences. The case of a metric
space knowing dl, 2 and dr, 3 allows one to set upper bounds on the magnitude of
d2, 3 since d2, 3 ~< dl, 2 ÷ dl, 3. What is needed, however, is a lower bound on the
dissimilarity between two objects since if one could infer that d~j was greater than
A, then one could skip over the dissimilarity between objects i and j.
If the dissimilarity coefficients were computed from a p variable by n object
matrix X such that x ~ can be interpreted as the coordinate of a point (corre-
sponding to object i) along axis k, then more time-efficient algorithms are
possible. The choice of a dissimilarity coefficient for these algorithms is limited
only by the restrictions that

dij~ max([xk i -- Xkj[)


and that considering additional dimensions cannot cause dij to decrease. The
coefficients which satisfy these relationships include all Minkowski metrics

dij = ( X [Xki- Xkj[ r) l/r, r >11.


Hartigan's (1975) spiral search algorithm is a simple example of this approach
which makes use of the geometry of coordinate spaces. This technique can be
imbedded in many of the algorithms given above, wherever it is necessary to find
the nearest neighbor of a given object. The algorithm makes use of the fact that
objects which are adjacent in a sorted list for a given dimension are expected to
be near neighbors in the p-dimensional space.

ALGORITHM 11
(a) Let i denote the object for which the nearest neighbor is required.
(b) Initiahze: create a table T giving the objects in rank order for each of the p
variables separately and an array R which will contain the dissimilarity from the
nearest neighbor to i in the k th dimension. A table of inverted lists, L, should
also be prepared so that one can directly look up the position of an object in T for
a given dimension. 0 ~ R~, 1 ~ l k ~ r~ (for k = 1 t o p ) . [Pointers l~ and r k point to
an object to the left and to the right of i (if such exist) in the sorted list for
dimension k.]
(c) Initialize: l ~ k, ~ --, dmin, ~ ~ m. [k is the dimension being searched over,
m will be the nearest neighbor of object i at distance d~in.]
(d) Find the location a in T of object i for dimension k: L~i --, a.
(e) Find the two objects on either side of i in the k t h dimension: Tk, o _ l , ~
b, Tk,a+rk ~ c (if a - l~ < 1 or a + r k > n, then: b or c, respectively, is undefined
and dissimilarities to such undefined objects are set to ~ ) .
Single-link clustering algorithms 281

(f) If min(dib, die ) ~ d~an, then: go to step (j) [objects b and c are both outside
of the search radius], else: continue. [A nearer neighbor has been found.]
(g) Update radius of spiral: m i n { l x k i -- Xkbl, [Xki -- XkcI} -~ R ~ .
(h) If dib < d i c , then: b ~ m , dib ~ dram, l k -- 1 ~ lk, else: c ~ m , dic ~ drain,
rk + l--, r ~.
(i) If k < p, then: k + 1 ~ k, go to step (d), else: continue. [A new nearest
neighbor candidate has been found.]
(j) Compute the dissimilarity coefficient based on differences in array R. If it is
larger than dmin, go to step (c), else: stop. [Object m is the nearest neighbor.]

Steps (c) through (j) are repeated until it is impossible for there to be a point
closer to i than m (dim = dmi~). The search 'spirals' outward from point i due to
step (h). O(nlog n) effort is spent during the initialization step sorting the data
and building the tables T and L of pointers, but for a given analysis one builds
the tables only once.
Two other general approaches have been suggested for finding nearest neigh-
bors which also avoid the computation of dissimilarities between all pairs of
objects. Bentley (1975) developed a multidimensional binary search tree, 'k-d tree'
(refers to a k-dimensional tree), which can be fitted to a given data set so that
nearest neighbors tend to be on nearby nodes of the binary tree. As in the spiral
search algorithm, dissimilarities are computed during construction of the tree only
for objects which are likely to be nearest neighbors. The second approach is to
partition the p-dimensional space into cells and then compute dissimilarities only
among points in the same or adjacent cells. The optimal approach among these
three depends upon the configuration of the points in the p-dimensional space.
The application of k-d trees to the construction of MST's is described in
Bentley and Friedman (1978) (one can easily compute single-link clusters from a
MST as described above). They give a detailed presentation of the algorithm and
the results of a simulation study showing that the average running time is
O(nlogn) for the spherical normal distribution. There is, however, considerable
overhead at the preprocessing stage so that the algorithm represents an improve-
ment in total running time only for larger values of n. For p = 2 the break-even
point with respect to a simple MST algorithm such as that of Prim (1957) is at
n = 250. For p = 8 the break-even point is at n = 340 (Bentley and Friedman,
1978). Execution times for this algorithm are especially favorable for data sets
consisting of points which are uniformly distributed along the orthogonal coordi-
nate axes, while data sets with a few isolated clusters exemplify worst case
situations.
Many authors have suggested a preprocessing step in which the coordinate
space is partitioned into cells to facilitate searching (e.g., Yuval, 1975, 1976;
Bentley and Shamos, 1976; and Rabin, 1976). Given a fixed threshold of
dissimilarity, 8, Rabin (1976) developed a method for partitioning a p-dimen-
sional space into 2 p systems of p-dimensional cells of width 28 on each side such
that all points at distance d ~<8 would be found together in the same cell at least
once. Thus dissimilarity values need only be computed for pairs of points
282 F. James Rohlf

contained in a common cell. Rabin (1976) also proposed that 8 be estimated from
a random sample of points. Rohlf (1977, 1978) adapted this method to the
computation of MST's and single-link clusterings. In simulations, based on
samples from p-dimensional spherical normal distributions it was found that this
algorithm had average running times which increased more slowly than O(n log n).
The break-even point with respect to a simple MST algorithm such as Prim (1957)
for p - - 2 was only n = 150, but for p - - 4 it was n = 350. Thus the apparent
advantage of the use of cells versus k-d trees seems to decrease as p increases.
Rohlf (1977) showed that additional preprocessing of the data could improve the
performance of using cells. The partitioning of the space could be based, for
example, on only the k most important variables (in the sense of having the
largest variances). Or the data could be projected onto the first k principal
components. If the k dimensions used for the partitioning of the data into cells
explain most of the variance in the data and k is much smaller than p, then the
running time should be reduced considerably. The most favorable data configura-
tion for this approach would be that a large number of points uniformly
distributed in a low-dimensional subspace. As in the case of k-d trees the worst
case is for well-separated clusters of objects. From the point of view of applica-
tions to cluster analysis, these algorithms work best on data in which there are in
fact no clear clusters. A description of these algorithms for the computation of
MST's and single-link clusters is given in Rohlf (1977, 1978).
For the special case in which dij is the Euclidean distance in the plane (p = 2)
Shamos (1975) and Shamos and Hoey (1975) show that Voronoi diagrams can be
used to reduce the computational effort in finding the nearest neighbors of an
object. They describe an algorithm for finding a MST which in the worst case
requires computational effort of only O(nlogn). Unfortunately, this approach
has not been extended to other dissimilarity coefficients or even to greater than
p = 2 dimensions.
This is an important area which needs further work. It appears that only
through the use of such techniques it will be feasible to cluster very large
(n > 1000) data sets.

Acknowledgment

This paper represents contribution No. 332 from the Program in Ecology and
Evolution at the State University of New York at Stony Brook. It was supported
in part by a grant DEB77-24611 from the National Science Foundation. I wish to
thank Sally Howe who critically read a draft of this paper and made many
valuable suggestions.

References
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press, New York.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Comm.
ACM 18, 509-517.
Single-link clustering algorithms 283

Bentley, J. L. and Friedman, J. H. (1978). Fast algorithms for constructing minimal spanning trees in
coordinate spaces. IEEE Trans. Comput. 27, 97-105.
Bentley, J. L. and Shamos, M. I. (1976). Divide-and-conquer in multidimensional space. Proc. Eighth
A CM Syrup. on the Theory of Computing, 220-230.
Berge, C. 1966. The Theory of Graphs and its Application. Methuen, London.
Berge, C. and Ghouila-Houri, A. (1962). Programming, Games and Transportation Networks. Methuen,
London.
Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L. and Tarjan, R. E. (1973). Time bounds for
selection. J. Comput. System Sci. 7, 448-461.
Cole, A. J. and Wishart, D. (1970). An improved algorithm for the Jardine-Sibson
method of generating overlapping clusters. Comput. J. 13, 156-163.
Davies, R. G. (1971). Computer Programming in Quantitative Biology. Academic Press, New York.
Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numer. Math. 1, 269-271.
Fisher, L. and van Ness, J. W. (1971). Admissible clustering procedures. Biometrika 58, 91-104.
Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, H. and Zubrzycki, S. (1951a). Sur la liason et la
division des points d'un ensemble fini. Colloq. Math. 2, 282-285.
Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus, H. and Zubrzycld, S. (1951b). Taksonomia
Wroclawska. Przegl. Antropol. 17, 193-211.
Gower, J. C. and Ross, G. J. S. (1969). Minimum spanning trees and single-linkage cluster analysis.
Applied Statistics 18, 54-64.
van Groenewoud, H. and Ihm, P. (1974). A cluster analysis based on graph theory. Vegetatio 29,
115-120.
Hartigan, J. A. (1967). Representation of similarity matrices by trees. J. Amer. Stat. Assoc. 62,
1140-1158.
Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
Hartigan, J. A. (1977). Distributional problems in clustering. In: J. van Ryzin, ed., Classification and
Clustering, 45-71. Academic Press, New York.
Jardine, C, J., Jardine, N. and Sibson, R. (1967). The structure and construction of taxonomic
hierarchies. Math. Biosei. 1, 173-179.
Jardine, N. and Sibson, R. (1968). The construction of hierarchic and non-hierarchic classifications.
Comput. J. 11, 177-184.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241-254.
Katz, J. O. and Rohlf, F. J. (1973). Functionpoint cluster analysis. Systematic Zool. 22, 295-301.
Kittler, J. (1976). A locally sensitive method for cluster analysis. Pattern Recognition 8, 23-33.
Kittler, J. (1979)i Comments on "single-link characteristics of a mode-seeking clustering algorithm".
Pattern Recognition 11, 71-73.
Kruskal, J. B. Jr. (1956). On the shortest spanning subtree of a graph and the traveling salesman
problem. Proc. Amer. Math. Soc. 7, 48-50.
Lance, G. N. and Williams, W. T. (1967) A general theory of classificatory sorting strategies, Part I.
Hierarchical systems. Comput. J. 9, 373-380.
McQuitty, L. L. (1957). Elementary linkage analysis for isolating orthogonal and oblique types and
typal relevancies. Educ. Psychol. Meas. 17, 207-222.
van Ness, J. W. (1973). Admissible clustering procedures. Biometrika 60, 422-424.
Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Tech. J. 36,
1389-1401.
Rabin, M. O. (1976) Probabilistic algorithms. In: J. F. Traub, ed., Algorithms and Complexity, 21-39.
Academic Press, New York.
Rohlf, F. J. (1973a). A new approach to the computation of the Jardine-Sibson Bk clusters. Comput.
J. 18, 164-168.
Rohlf, F. J. (1973b). Algorithm 76, Hierarchical clustering using the minimum spanning tree. Comput.
J. 16, 93-95.
Rohlf, F. J. (1974). Algorithm 81, dendrogram plot. Comput. J. 17, 89-91.
Rohlf, F. J. (1975). Note on Algorithm 81, dendrogram plot. Comput. J. 18, 90-92.
284 F. James Rohlf

Rohlf, F. J. (1977). Computational efficiency of agglomerative clustering algorithms. Tech. Rept. RC


6831. IBM T. J. Watson Research Center, Yorktown Heights, New York.
Rohlf, F. J. (1978). A probabilistic minimum spanning tree algorithm. Information Processing Lett. 7,
44-48.
Ross, G. J. S. (1969). Algorithms AS 13-15. Appl. Statist. 18, 103-110.
Shaffer, E., Dubes, R. and Jain, A. K. (1979). Single-link characteristics of a mode-seeking clustering
algorithm. Pattern Recognition 11, 65-70.
Shamos, M. I. (1975). Computational geometry. Ph. D. dissertation. Yale University.
Shamos, M. I. and Hoey, D. (1975). Closest-point problems. Proc. 16th Annual IEEE Symposium on
Foundations of Computer Science, 151-162.
Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. Comput.
J. 16, 30-34.
Sneath, P. H. A. (1957). The applications of computers to taxonomy. J. Gen. Microbiol. 17, 201-226.
Sheath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
Whitney, V. K. M. (1972). Algorithm 422, minimal spanning tree. Comm. A C M 15, 273-274.
Williams, W. T., Lambert, J. M. and Lance, G. N. (1966). Multivariate methods in plant ecology. V.
Similarity analyses and information-analysis. J. Ecology 54, 427-445.
Yao, A. C. (1975). An O(I Elloglog I VI) algorithm for finding minimum spanning trees. Information
Processing Lett. 4, 21-23.
Yuval, G. (1975). Finding near neighbors in k-dimensional space. Information Processing Lett. 3,
113-114.
Yuval, G. (1976). Finding nearest neighbors. Information Processing Lett. 5, 63-65.
Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE
Trans. Comput. 20, 68-86.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1
©North-Holland Publishing Company (1982) 285-316 1 ,.3

Theory of Multidimensional Scaling

Jan de Leeuw and Willem Heiser

I. The multidimensional scaling problem

1.1. M D S : broad and narrow

It is difficult to give a precise definition of MDS, because some people use the
term for a very specific class of techniques while others use it in a much more
general sense. Consequently it makes sense to distinguish between MDS in the
broad sense and M D S in the narrow sense. M D S in the broad sense i n c l u d e s
various forms of cluster analysis and of linear multivariate analysis, M D S in the
narrow sense represents dissimilarity data in a low-dimensional space.
People who prefer the broad-sense definition want to emphasize the close
relationships of clustering and scaling techniques. Of course this does not imply
that they are not aware of the important differences. Clustering techniques fit a
non-dimensional discrete structure to dissimilarity data, narrow-sense M D S fits a
continuous dimensional structure. But both types of technique can be formalized
as representing distance-like data in a particular metric space by minimizing some
kind of loss function. The difference, then, is in the choice of the target metric
space, the structure of the two problems is very much alike. The paper pioneering
this point of view is Hartigan's (1967). In fact Hartigan proceeds the other way
around, he takes clustering as the starting point and lets clustering in the broad
sense include narrow-sense MDS. The same starting point and the same order are
more or less apparent in the important review papers of Cormack (1971) and
Sibson (1972). An influential broad-sense paper that uses narrow-sense M D S as
the starting point is that of Carroll (1976). In fact Carroll discusses techniques
that explicitly combine aspects of clustering and narrow-sense MDS, and find
mixed discrete/continuous representations. In a very recent paper Carroll and
Arabie (1980) propose a useful taxonomy of MDS data and methods which is
very broad indeed. Investigators who are less interested in formal similarities of
methods and more interested in substantial differences in models have naturally
emphasized the choice of the space, and consequently the differences between
clustering and narrow-sense MDS. Of course detailed comparison of the two
classes of techniques already presupposes a common framework. This is obvious

285
286 Jan de Leeuw and Willem Heiser

in the important papers of Shepard (1974) and Shepard and Arabie (1979), who
present the two classes of methods essentially as complementing each other. If the
emphasis shifts more towards a theory of similarity judgments, the discrete and
continuous representations tend to become rivals. In a brilliant paper Tversky
(1977) has attacked narrow-sense MDS as a realistic model for psychological
similarity, and Sattath and Tversky (1977) have presented the free or additive tree
as a more realistic alternative. Krumhansl (1978) has defended dimensional
models.
We are not interested, in this paper, in psychological theories of similarity, i.e.
in narrow-sense MDS as a miniature psychological theory. We are also not
interested in discussing the many forms of cluster analysis and nondimensional
scaling. Consequently we propose a definition of scaling which is quite broad
(although not as broad as the one suggested by Carroll's taxonomy), and after
proposing this definition we quickly specialize to MDS. From that point on there
is no more need to distinguish between narrow and broad MDS. Our definition is
inspired by the definition of Kruskal (1977), and by the discussion in Kruskal and
Wish (1978), and Cliff (1973). We agree with these authors that MDS should be
further classified, and that the most important distinctions are between metric
and nonmetric MDS, and between two-way and three-way MDS. Other criteria
are choice of metric, choice of loss function, and choice of algorithm, but these
seem to be more technical and less essential.

1.2. Notation and definitions


In metric two-way scaling the data are a pair (I, 8), with I a nonempty set and 8
a real valued function on 1 × I. The elements of I are called objects or stimuli, the
number 8(i, j) is the dissimilarity between objects i andj. A second pair (X, d ) is
also available, with X another nonempty set and d a real valued function on
X × X. The elements of X are called points, the number d(x, y) is the distance
between points x and y. The metric two-way scaling problem is to construct a
mapping 0 of I into X in such a way that for all i and j in I the distance
d(q~(i), ep(j)) is approximately equal to the dissimilarity 8(i, j).
In nonmetric two-way scaling the situation is more complicated. The d a t a do
not consist of a single real valued function 8 on I × I, but of a set Z~ of real valued
functions on I × I. If we agree to call an arbitrary real valued function on I × I a
disparity, then A is the set of all admissible disparities. The nonmetric two-way
scaling problem is to find a mapping q~ of I into X and an admissible disparity d
in such a way that for all i a n d j in I the distance d(O(i), ep(j)) is approximately
equal to el(i, j).
In metric three-way scaling we use an additional set K. The elements of K are
called replications, or occasions, or individuals. For each k in K we have a
dissimilarity structure (I, 8k), and the metric three-way scaling problem is to find
for each k a mapping q~k from I into X such that d(epk(i), ~k(J)) is approximately
equal to 8k(i, j) for all i a n d j in I. It will be obvious by now how to define the
nonmetric three-way scaling problem: there is a set A k of admissible disparities
for each k, and we have to construct both the mappings ~k and the disparities d k.
Theory of multidimensional scaling 287

A number of additional comments are in order here. We have called d(x, y) the
distance of x and y, but we have not assumed that the function d satisfies the
usual axioms for a metric. In the same way we have defined disparities as
arbitrary functions on I × I, while in most applications disparities (and dissimi-
larities) are distance-like too. For ease of reference we briefly mention the metric
axioms here. For (X, d ) we must have the following.

(M1) I f x # y , thend(x,y)>O, andd(x,x)=O. (minimality)


(M2) d(x, y) = d(y, x). (symmetry)
(M3) d(x, y ) ÷ d(y, z) >-d(x, z). (triangle inequality)

In most applications our target space (X, d ) does satisfy these axioms, but in
some (M1) must be replaced by

(MI') d(x, y) >>-O, and d(x, x) = O. (weak minimality)

Moreover in many other applications we would like to drop symmetry. Tversky


(1977) has argued, quite convincingly, that assuming symmetry is not very
realistic in many psychological situations. In the same way (I, iS) often satisfies
(M1) and (M2), but sometimes only

(MI") If x v~ y, then d(x, y) > d(x, x) = d(y, y). (shifted minimality)

Because of the very many possibilities we do not impose a specific set of axioms
about (X, d ) a n d / o r (I, is), we simply have to remember that they usually are
distance-like.
The expression "approximately equal to" in our definitions has not been
defined rigorously. As we mentioned previously the usual practice in scaling is to
define a real valued lossfunction, and to construct the mapping of I into X in such
a way that this loss function is minimized. In this sense a scaling problem is
simply a minimization problem. One way in which scaling procedures differ is
that they use different loss functions to fit the same structure. Another way in
which they differ is that they use different algorithms to minimize the same loss
function. We shall use these technical distinctions in the paper in our taxonomy
of specific multidimensional scaling procedures.
It is now very simple to define MDS: a (p-dimensional) multidimensional
scaling problem is a scaling problem in which X is R P, the space of all p-tuples of
real numbers. Compare the definition given by Kruskal (1977, p. 296): " W e
define multidimensional scaling to mean any method for constructing a configura-
tion of points in low-dimensional space from interpoint distances which have
been corrupted by random error, or from rank order information about the
corrupted distances." This is quite close to our definition. The most important
difference is that Kruskal refers to 'corrupted distances' and 'random error'. We
do not use these notions, because it seems unnecessary and maybe even harmful
to commit ourselves to a more or less specific stochastic model in which we
assume the existence of a 'true value.' This point is also discussed by Sibson
288 Jan de Leeuw and Willem Heiser

(1972). Moreover, not all types of deviations we meet in applications of MDS can
reasonably be described as 'random error.' This is also emphasized by Guttman
(1971).
We have now specified the space X, but not yet the metric d. The most familiar
choice is, of course, the ordinary Euclidean metric. This has been used in at least
90% of all MDS applications, but already quite early in MDS history people were
investigating alternatives and generalizations. In the first place the Euclidean
metric is a member of the family of power metrics, Attneave (1950) found
empirically that another power metric gave a better description of his data, and
Restle (1959) proposed a qualitative theory of similarity which leads directly to
Attneave's 'city block' metric. The power metrics themselves are special cases of
general Minkovski metrics, they are also special cases of the general additive/dif-
ference metrics investigated axiomatically by Tversky and Krantz (1970). On the
other hand Euclidean space is also a member of the class of spaces with a
projective (or Cayley-Klein) metric, of which hyperbolic and elliptic space are
other familiar examples. In Luneborg (1947) a theory of binocular visual space
was discussed, based on the assumption of a hyperbolic metric, and this theory
has always fascinated people who are active in MDS. It is clear, consequently,
that choice of metric is another important criterion which can be used to classify
MDS techniques.

1.3. History of MDS


Multidimensional scaling is quite old. Closely related methods have been used
by surveyors and geographers since Gauss, Kruskal has discovered a crude MDS
method used in systematic zoology by Boyden around 1930, and algorithms for
the mapping of chromosomes from crossing-over frequencies can already be
found in an interesting paper of Fisher (1922). The systematic development of
MDS, however, has almost completely taken place in psychometrics.
The first important contribution to the theory of MDS, not to the methods, is
probably Stumpf's (1880). He distinguishes four basic types of judgments which
correspond with four of the eight types of data discussed in the book of Coombs
(1964). Stumpf defines psychological distance explicitly as degree of dissimilarity,
he argues that reliable judgments of dissimilarity are possible, indicates that
judgments about very large and very small dissimilarities are not very reliable,
mentions that small distances are often overestimated while large distances are
underestimated, and argues that triadic comparisons are much easier than tetradic
comparisons [124, pp. 56-65, pp. 122-123, pp. 128-133]. Stumpf's work did not
lead to practical MDS procedures, and the same thing is true for later important
theoretical work of Goldmeier (1937) and Landahl (1945).
The contributions to the method are due to the Thurstonian school. Richardson
(1938) and Klingberg (1941) applied classical psychophysical data collection
methods to pairs of stimuli, used Thurstone's 'law of comparative judgment' to
transform the proportions to interval scale values, estimated an additive constant
to convert the scale values to distance estimates, and constructed coordinates of
Theory of multidimensional scaling 289

the points by using a theorem due to Young and Householder (1938). The
Thurstonian methods were systematized by Torgerson in his thesis of 1951, the
main results were published in [127]. Messick and Abelson (1956) contributed a
better method to estimate the additive constant, and Torgerson summarizes the
Thurstonian era in MDS in Chapter 11 of his book (1958).
The first criticisms of the Thurstonian approach are in an unpublished disserta-
tion of Rowan in 1954. He pointed out that we can always choose the additive
constant in such a way that the distances are Euclidean. Consequently the
Thurstonian procedures tend to represent non-Euclidean relationships in
Euclidean space, which is confusing, and makes it impossible to decide if the
psychological distances are Euclidean or not. Rowan's work is discussed by
Messick (1956). He points out that non-Euclidean data lead to a large additive
constant, and a large additive constant leads to a large number of dimensions. As
long as the Thurstonian procedures find that a small number of dimensions is
sufficient to represent the distances, everything is all right. In the meantime a
more interesting answer to Rowan's objections was in the making. In another
unpublished dissertation Mellinger applied Torgerson's procedures to colour
measurement. This was in 1956. He found six dimensions, while he expected to
find only two or three. Helm (1960) replicated Mellinger's work, and found not
less than twelve dimensions. But Helm made the important observation that this
is mainly due to the fact that large distances tend to be underestimated. If he
transformed the distances exponentially he merely found two dimensions, and if
he transformed Mellinger's data exponentially he found three. Consequently a
large number of dimensions is not necessarily due to non-Euclidean distances, it
can also be due to a nonlinear relationship between dissimilarities and distances.
In the meantime the important work of Attneave (1950) had been published.
His study properly belongs to 'multidimensional psychophysics', in which the
projections of the stimuli on the dimensions are given, and the question is merely
how the subjects 'compute' the dissimilarities. In the notation of the previous
section the mapping q~ is given but the distance d is not, while in MDS the
distance d is given and the mapping q~ is not. In this paper Attneave pointed out
that other distance measures may fit the data better than the Euclidean distance,
and he also compared direct judgments of similarity with identification errors in
paired associate learning. He found that the two measures of similarity had a
nonlinear but monotonic relationship. In his 1955 dissertation Roger Shepard
also studied errors in paired associates learning. The theoretical model is ex-
plained in [117], the mathematical treatment is in [116], and the experimental
results are in [118]. Shepard found that to a fairly close approximation, distance is
the negative logarithm of choice probability, which agrees with a choice theory
analysis of similarity judgments by Luce (1961, 1963). Compare also [68]. On the
other hand Shepard also found systematic deviations from this 'rational distance
function', and concluded that the only thing everybody seemed to agree on was
that the function was monotonic.
Also around 1950 Coombs began to develop his theory of data. The main
components of this theory are the classification of data into the four basic
290 Jan de Leeuw and Willem Heiser

quadrants, the emphasis on the geometric interpretation of models and on the


essentially qualitative nature of data. All these three aspects have been enor-
mously influential, but the actual scaling procedures developed by Coombs and
his associates have not been received with great enthusiasm. The main reason is
that they were 'worksheet' methods involving many subjective choices, and that
they gave nonmetric representations of nonmetric data. In the case of the
one-dimensional unfolding model the derived ordered metric scale turned out to
be close to an interval scale, but the multidimensional extensions of Bennett and
Hays [6, 54] were much more problematical. The same thing is true for the
method developed by Hays to analyze Q - I V - a (dis)similarity data, which is
discussed in [24, Part V]. The methods of Hays and Bennett are the first
nonmetric MDS methods. But, in the words of Coombs himself, "This method,
however, new as it is, may already be superseded by a method that Roger Shepard
has developed for analysis of similarities data" [24, p. 494]. Coombs aimed at
Shepard's 1962 work.
Another important line of development can be found in the work of Louis
Guttman. It is obvious from the book of Coombs (1964) that Guttman's work on
the perfect scale, starting with [42] and culminating in [44], has been very
influential. On the other hand Guttman's techniques which find metric represen-
tations from nonmetric data, discussed in [41, 43, 44], were not used a great deal.
It seems that around 1960 Guttman had arrived at approximately the same point
as Coombs and Shepard (compare for example [46]). Ultimately this resulted in
[47], but the unpublished 1964 version of this paper was also widely circulated.
Although Coombs and his students had studied nonmetric MDS and although
Guttman had proposed methods to quantify qualitative data, the real 'computa-
tional breakthrough' was due to Roger Shepard [119]. We have already seen that
his earlier work on functions, relating dissimilarity to distance, pointed strongly in
the direction of merely requiring a monotonic relationship without specifying a
particular functional form. In his 1962 papers Shepard showed that an MDS
technique could be constructed on the basis of this requirement but (perhaps even
more important) he also showed that an efficient computer program could be
constructed which implemented this technique. Moreover he showed with a large
number of real and synthetic examples that his program could 'recover' metric
information, and that the information in rank-ordered dissimilarities was usually
sufficient to determine the configuration of points (cf. also [120]). This idea of
'getting something from nothing' appealed to a lot of people, and the idea that it
could be obtained by simply pushing a button appealed to even more people. The
worksheet methods of Coombs and Hays were quickly forgotten.
Another important effect of the Shepard 1962 papers was that they got Joseph
Kruskal interested in MDS. He took the next important step, and in [73, 74] he
introduced psychometricians to loss functions, monotone regression, and (gradi-
ent) minimization techniques. In these papers Kruskal puts Shepard's ideas,
which still had a number of heuristic elements, on a firm footing. He showed,
essentially, how any metric psychometric technique could be converted into a
nonmetric one by using monotone regression in combination with a least squares
Theory of multidimensional scaling 291

loss function, and he discussed how such functions could be minimized. Early
systematizations of this approach are from Roskam (1968) and Young (1972).
Torgerson (1965) reported closely related work that he had been doing, and made
some thoughtful (and prophetic) comments about the usefulness of the new
nonmetric procedures. Guttman (1968) contributed a long and complicated paper
which introduced some useful notation and terminology, contributed some inter-
esting mathematical insights, but, unfortunately, also a great deal of confusion. It
is obvious now that Kruskal's discussion of his minimization method was rather
too compact for most psychometricians at that time. The confusion is discussed,
and also illustrated, in [87].
The main contribution to MDS since the Shepard-Kruskal 'computer revolu-
tion' is undoubtedly the paper by Carroll and Chang (1970) on three-way MDS.
It follows the by now familiar pattern of presenting a not necessarily new model
by presenting an efficient algorithm and an effective computer program, together
with some convincing examples. A recent paper, following the same strategy,
integrates two- and three-way metric and nonmetric MDS in a single program
[1251.

2. Multidimensional scaling models

2.1. The Euclidean distance model

2.1.1. Euclidean metric two-way scaling


We have already decided to study mappings of (I, 6) into (R P, d), with N p
the space of all p-tuples of real numbers. The most familiar way to define a metric
in R p is as follows. For each x, y in R p we define

d2(x,y)=(x-y)'(x-y),

or, equivalently,

d ( x , y ) = I l x - yll,

where II" II is the Euclidean norm. The metric two-way Euclidean MDS problem is
to find a mapping ~ of I into R p such that 6(i, j ) is approximately equal to
[Iq~(i)- q~(j)ll. In this section we are interested in the conditions under which this
problem can be solved exactly, or, to put it differently, under which conditions
(I, 6) can be imbedded in (NP, d). This problem is also studied in classical
distance geometry. There is some early work on the subject by Gauss, Dirichlet,
and Hermite, but the first systematic contribution is the very first paper of Arthur
Cayley (1841). Cayley's approach was generalized by Menger (1928) in a funda-
mental series of papers. Cayley and Menger used determinants to solve the
imbedding problem, an alternative formulation in terms of quadratic forms was
suggested by Fr6chet (1935) and worked out by Schoenberg (1935). The same
292 Jan de Leeuw and Willem Heiser

result appears, apparently independently, in [141]. The contributions of Cayley,


Menger, Frbchet, and Schoenberg are summarized and extended by Blumenthal
(1938, 1953). In this paper we prefer the Schoenberg solution to the Menger
solution, because it leads more directly to computational procedures.
We suppose throughout this section that (I, 8) is a semi-metric space, by which
we mean that 8 satisfies both minimality and symmetry. For our first theorem we
assume in addition that I is a finite set, say with n elements. We define the n X n
matrix H with elements h i / = 82(i, j), and the n × n matrix B with elements
b i / = - ½ ( h i / - h i. - h . / + h.. ), where dots replacing an index mean that we have
averaged over that index. If J is the matrix which centers each n-vector, i.e.
J = I - wee,
l , then B = - ½JHJ.

THEOREM 1. The finite semimetric space (I, 8) can be imbedded in Euclidean


p-space if and only if B is positive semi-definite, and rank(B)<<-p.

PROOF. Suppose x 1 x n are p-vectors such that 82(i, j ) = d2(xi, xj). Define
.....

the n-vector a by a i = x~xi, and collect the x i in the n × p matrix X. Then


H = ae' + ea'-- 2 XX', and consequently B = -- ) J H J = JXX'J, which implies that
B is positive semi-definite (from now on: psd) of r a n k ( B ) = rank(JX)~< p. This
proves necessity. Conversely if B is psd and rank(B) ~<p, then there is an n × p
matrix X such that B = XX'. For this X we have d Z ( x i , x j ) = bii ÷ bjj - 2 b i j = hij.

In fact Schoenberg (1935) and Young and Householder (1938) prove a slightly
different version of the theorem. They place the origin in one of the points while
our version places the origin in the centroid of the points, which is considerably
more elegant from a data analysis point of view. Our formulation is due to
Tucker, Green, and Abelson (cf. [128, pp. 254-259]). Theorem 1 can be sharpened
by defining ( 1 , 8 ) to be irreducibly imbeddable in Euclidean p-space if it is
imbeddable, but not imbeddable in Euclidean ( p - 1)-space. A slight rewording
of the proof shows that ( I , 8) is irreducibly imbeddable if and only if B is psd
and r a n k ( B ) = p . A complete solution of the general imbedding problem, in
which I can be infinite, was given by Menger (1928). We quote his result without
proof; for an excellent discussion we refer to [10, Chapter IV].

THEOREM 2. The semimetric space (I, 8 ) can be imbedded in Euclidean p-space if


and only if each subset of p + 3 points can be imbedded in Euclidean p-space.

The imbedding problem is the first fundamental problem of classical distance


geometry. The second fundamental problem is the space problem. The imbedding
problem gives conditions under which a semimetric space is metrically congruent
to a subset of Euclidean p-space, the space problem gives conditions under which
it is metrically congruent to Euclidean p-space itself, i.e. there must be a o n e - o n e
correspondence between the two spaces. The first solution to the space problem
was again given by Menger in 1928, a more elegant solution was found by Wilson
Theory of multidimensional scaling 293

in 1932. T h e results are summarized in [10, Chapter V], a m o r e recent s u m m a r y is


in [11]. Again we quote the m a i n results without proof. R e m e m b e r that in a
semimetric space (A,/~ ) the point b is between a and c if/~(a, b ) + / ~ ( b , c) = / ~ ( a , c).
A semimetric space is convex if for each pair a, c there is at least one b 4: a, c
between a and c, and it is externally convex if for each pair a, b there is at least
one c 4= a, b such that b is between a and c.

TUEOREM 3. The semimetric space ( I , 8) is congruent with Euclidean p-space if


and only if it is complete, convex, externally convex, and irreducibly imbeddable in
Euclidean p-space.

T o obtain Wilson's characterization we need an additional definition: a semi-


metric space has the Euclidean four-point p r o p e r t y if each quadruple of points
can be i m b e d d e d in Euclidean 3-space.

THEOREM 4. The semimetric space ( I , 8) is congruent with Euclidean p-space if


and only if it is complete, convex, externally convex, it satisfies the Euclidean
four-point property, and each set of p + 1 points defines a singular B-matrix ( cf.
Theorem 1).

2.1.2. Euclidean non-metric two-way scaling


Again we suppose that I is finite, with n elements. T h e additive constant
p r o b l e m is to find x I ..... x n in R ; and a real n u m b e r such that 8(i, j ) + 0/(1 - 6 ij)
is approximately equal to d(x~, xj). Superscripted delta (8 ~-j) is the K r o n e c k e r
symbol.

THEOREM 5. Suppose ( I , 8) is a semimetric space, and suppose I is finite with n


elements. Then there is an 0/ such that I with semimetric 8(i, j ) + 0/(1 - 6 ij) can be
imbedded in Euclidean ( n - 1)-space.

PROOF. Define H(0/) by the rule

hi j(0~ ) = ~2(i, j ) + 20/8( i, j ) + o/2(1- ~ij ).

and define B ( a ) = -- 1 j H ( a ) J . Clearly B ( a ) is of the f o r m

B(0/) = Bo + 20/Co + ½0/2j.


This implies that there is an a 0 such that bij(0/) < 0 for all i Y= j if a > a o. Because
rows and columns of B ( a ) add up to zero a familiar matrix theorem [126] proves
that B ( a ) is psd with rank ( B ( a ) ) = n -- 1 for all a > a o. T h e o r e m 1 now gives the
required result.

THEOREM 6. I f ( I , 6) is a semimetric space, I is finite with n elements, and ( I , 8)


cannot be imbedded in Euclidean ( n - 1)-space, then there is an 0/ such that I with
semimetric 8(i, j ) + cffl - 6 ij) can be imbedded in Euclidean (n -2)-space.
294 Jan de Leeuw and Willem Heiser

PROOF. Consider X(a), the minimum of x'B(et)x over all x satisfying x ' x - - 1
and x'e -- 0. Clearly X(et) is continuous. By hypothesis X(0) < 0, and the proof of
Theorem 5 shows that X(et)> 0 for a > a 0. Consequently X(et)= 0 for some a
between 0 and et0-

These two theorems improve the (unpublished) results of Rowan we mentioned


in Subsection 1.3. Another one-parameter family of transforms was used by
Guttman (cf. [86]).

THEOREM 7. Suppose <I, 8> is a semimetric space, and suppose I is finite with n
elements. Then there is an et such that I with the semimetric [(82(i,j)
+ et(1 - 8iJ)} 1/2 ] can be imbedded in Euclidean (n -2)-space.

PROOF. In this case B(t 0 = B0 + l a J , and X(a) = X(0)+ let. Thus if a = -2X(0),
then X(et) = 0, and B(a) is psd of rank n - 2 .

The transformation considered in Theorem 7 is clearly monotone. Conse-


quently the following theorem is an easy corrollary.

THEOREM 8. Suppose <I, 8> is a semimetric space,_ and sup_pose I is finite with n
elements. Then there is a semimetric 8 such that 8(i, j)<~ 6(i', j') if and only if
6(i, j ) <~6(i', j') for all i ~=j and i' =/=j', and such that (I, i~) can be imbedded in
Euclidean ( n - 2)-space.

It was already pointed out by Lingoes (1971) that the mapping constructed in
the proof of Theorems 7 and 8 may lead to 8(i, j ) -- 0 for some i v~ j. This means
that in general the conclusion of Theorem 8 cannot be strengthened to 8(i, j ) ~<
8(i', j') if and only if 8(i, j ) <~8(i', j') for all i, j, i', j'. The precise conditions
under which such a strengthening is possible were investigated by Holman (1972).
His work is based partly on unpublished work of the distance geometer L. M.
Kelly. In the first place a semimetric space (I, 8> is called an ultrametric space if

8(i, j) <~max(8(/,k), 8(k, j)}

for all i, j, k. This ultrametric inequality, which obviously implies the triangle
inequality, is interesting from a data analysis point of view, because it is well
known that a semimetric space can be imbedded in a hierarchical tree structure if
and only if the semimetric satisfies the ultrametric inequality (cf. for example
[62]). Hierarchical tree structures are very important in clustering and classifi-
cation literature [25]. We now give the two theorems proved by Holman.

THEOREM 9. Suppose (I, 8> is an ultrametric space, and suppose I is finite with n
elements. Then <I, 8> can be imbedded in Euclidean ( n - 1)-space, but not in
Euclidean ( n - 2)-space.
Theory of multidimensionalscaling 295

Because the ultrametric inequality remains true if we transform the 8ij mono-
tonically, this also implies that no strictly monotone transformation of the 6~j can
be imbedded in Euclidean ( n - 2 ) - s p a c e .

THEOREM 10. Suppose ( I, 8 ) is a semimetric space, and suppose I is finite with n


elements. Then there is a semimetric 8 such that if(i, j)<~ 8( i', j') if and only if
8(i, j ) <~8(i', j') for all i, j, i', j' and such that ( I, 8 ) can be imbedded in Euclidean
(n -2)-space if and only if ~I, 8 ) is not an ultrametric space.

For proofs of Theorems 9 and 10 we refer to Holman. Another interesting


nonmetric MDS problem is the missing data problem. Suppose I is finite again,
and suppose 6 is defined only on a subset L of I × I (and is not defined on L).
We suppose that (i, i) is in L for all i, and that (i, j ) is in L if and only if ( j , i) is
in L. Define a matrix H0, which has elements (i, j ) equal to 62(i, j ) if (i, j ) is in
L, and equal to zero otherwise. For each (i, j ) in E w e define a matrix A~j, which
has elements (i, j ) and ( j , i) equal to + 1, and all other elements equal to zero.
Define

H( O) = H o + ( Y,O~jAijl(i, j ) ~ / 7 } ,
, ( o ) : , 0 + { E o,j jl(i,
where B 0 = - )JHoJ and T,j = -- ½JAijJ as usual. The following theorem is rather
trivial but is given because it has computational consequences.

THEOREM 11. Suppose { I, 8) is a semimetric space with missing elements, suppose


I is finite. Then ~I, 8) can be imbedded in Euclidean p-space if and only if there
exist Oij,( i, j ) @ I,, O~j> O, such that B( 0 ) is psd and rank(B (0)) ~<p.

A special missing data problem is the metric unfolding problem. Here I is


partitioned into the finite sets I 1 and I2, and L = II × 12. Because of the special
structure of this problem a more precise analysis is possible than the one given in
Theorem 11, for the results we refer to papers from Sch6nemann (1970), Gold
(1973), and Heiser and De Leeuw (1978). A computationally more convenient
version of Theorem 8 can also be formulated along the lines of Theorem 11.
Define for each i ve j a matrix Aij in which elements (i', j') and ( j ' , i') are equal
to + 1 if 8(i', j')>i 6(i, j), and equal to zero otherwise. Define Tij = - 1 j A i j J
and B(O) = EOijT,j.

THEOREM 12. Suppose (I, 6) is a semi-metric space, with I finite. Then there
exists a semimetric 8 such that for all i, j, i', j', 8(i, j ) <~8(i', j') implies 8(i, j ) <~
8( i', j') and such that ~I, 8) can be imbedded in Euclidean p-space if and only if
there exist nonnegative numbers O,j such that B( O) is psd, and rank(B(0)) ~< p.

Up to now we have only analyzed nonmetric MDS in cases in which I was


finite. It is clear, however, that we can combine the results in this section with
296 Jan de Leeuwand WillemHeiser

Theorem 2 from the previous subsection to find solutions to the general imbed-
ding problem. The space problem is somewhat more complicated. The most
interesting case is the one in which the elements of I × I are ordered. We want to
study the conditions under which this order can be considered to be induced by a
convex metric. This topic was studied by Beals and Krantz (1967), their treatment
was improved by Krantz (1968), explained for psychologists by Beals, Krantz,
and Tversky (1968), and generalized by Lew (1975). Minimality and symmetry
can easily be defined in terms of the order, they are obviously necessary
conditions for the representation of the order by any metric, convex or not. A
number of 'technical' assumptions is also needed, which state that the space is
continuous, without holes. They are usually untestable on empirical data. The
most important assumption is based on an ordinal characterization of between-
ness. Suppose il, i2 and i 3 are three distinct objects in I. Following Beals and
Krantz we define (ili2i3) if and only if
(a) (il, i~) ~<(i~, i2) and (i~, i~) ~>(i~, i3) imply (i~, iX) ~>(i2, i3)-
(b) if the conditions of (a) hold, and (i~, iX) = (i2, i3), then (il, i~) = (il, i2) and
(i,, i~)= (il, i3)"
Moreover we define (iliei3) if and only if both (ili2i3) and (i3i2il). The basic
assumption is that if 0<(il, i'2)<(i~,i3) then there is an i 2 in 1 such that
(il, i 2 ) = (il, i~) and (ili2i3). The conditions can be illustrated by drawing 'iso-
similarity contours' as in [5]. Under the assumptions we have stated there exists a
metric d on I such that
(a) (i, j ) ~<(i', j') if and only if d(i, j) <~d(i', j').
(b) (ili2i3) if and only if d(ii, i2)+ d(i2, i3) = d(il, i3)"
To get to Euclidean space we can use Theorem 3 or 4. Completeness and external
convexity can easily be defined in terms of the order relation and the betweenness
relation. Because the convex metric constructed by Beals and Kxantz is unique up
to scale we can simply use it to test the Euclidean four point property (or the
weaker versions of the property discussed by Blumenthal (1975)). A more direct
approach is also possible. Blumenthal (1938, pp. 10-13) discusses an axiomatiza-
tion of three-dimensional Euclidean geometry due to the Italian geometer M.
Pieri. The single undefined elements are 'points', the single primitive relation 'i 1 is
equally distant from i 2 and i3'. This axiomatization, published in 1908, can easily
be generalized to p-dimensional space.
Holman's Theorem 9 can be interpreted as a negative result, which shows that
ultrametrics cannot be represented in low-dimensional Euclidean spaces. It is well
known that ultrametric and tree distances are closely related to 'city block' or
lrdistances. In this sense the counter examples presented by Lew (1978) gener-
alize Holman's results. Lew proves that the p-dimensional 'city block' spaces 1p
cannot be imbedded into finite-dimensional Euclidean space, and that there is no
monotone transform of the metric which makes such an imbedding possible. On
the other hand it has been shown by Schoenberg (1937, 1938) that if 8(i, j ) is the
ll-metric, then {8(i, j)}~, with 0 < 3' ~<½, can be imbedded into 12, the natural
infinite-dimensional generalization of Euclidean space. Lew (1978) also presents
other interesting results based on Schoenberg's metric transform theory.
Theory of multidimensionalscaling 297

2.1.3. Euclidean three-way scaling


In our discussion of three-way scaling we restrict ourselves to the metric case
with finite I. It will be clear from the previous sections how to extend at least
some of the results to more general cases. In the major models that have been
proposed for Euclidean three-way scaling we require that the mappings q~k of I
into R p are of the form q , k ( i ) = T k x i , where x 1. . . . . x , are elements of R p, and
where T I , . . . , Tm are p × p matrices. There are three special cases to be considered.
We use names suggested b y Carroll and Chang (1970), Carroll and Wish (1974),
and Harshman (1972). The names are actually names of computer programs, but
it is common practice to use them for the models as well. In I D I O S C A L the Tk
are unrestricted, in P A R A F A C the Tk must be of the form T k = W k S with W k
diagonal and S unrestricted, and in I N D S C A L we must have T k = W k with W k
diagonal. As in the previous sections we define the matrices E k with elements
2 •
8k(l, j ) and B~ = -- ½JEkJ. We also collect the x i in the n × p matrix X, and we
suppose (without loss of generality) that X is centered, i.e. J X = X. We study the
conditions under which I D I O S C A L and I N D S C A L can be fitted exactly. The
results are due mainly to Sch0nemann (1972), but they have been simplified
considerably by De Leeuw and Pruzansky (1978). Unfortunately it is not possible
to give an equally satisfactory algebraic analysis of the P A R A F A C model.

THEOREM 13. The semimetric spaces ( I , 8~) have an irreducible I D I O S C A L


imbedding in Euclidean p-space if and only if
(a) B k is psd,
(b) B , = Z B k has rank p.

PROOF. We have to solve B k = X C ~ X ' , with C k = T k T ~. It follows directly that


(a) is necessary. If r a n k ( B , ) = r < p, then there is an I D I O S C A L imbedding in
r-space, and consequently (b) is necessary for irreducibility. To prove sufficiency
we identify the system (partially) by requiring that C,, the sum of the Ck, is the
identity matrix. Thus we have to solve B , = XX'. Suppose B , = K A Z K ' is the
canonical form of B , , with A 2 the diagonal p × p matrix of eigenvalues. It follows
that X = K A L', with L square orthonormal but otherwise arbitrary. We now have
to solve B k = K A L ' C k L A K ' for C k. Because K K ' B k K K ' = B k , for all k, the
solution is simply C k = L A I K ' B k K A - 1 L ' . Observe that indeed C , = I.

THEOREM 14. The semimetric spaces ( I , Sk) have an irreducible I N D S C A L


imbedding in Euclidean p-space if and only if
(a) B k is psd,
(b) r a n k ( B , ) = p,
(c) B k B + B l = BtB+,B~ for all k, l -- 1..... m.

PROOF. For the irreducible I N D S C A L imbedding we can also require without


loss of generality that W 2 = I. Moreover, we are only interested in X of full
column rank p. By using exactly the same proof as in Theorem 13 we find that an
I N D S C A L solution exists if and only if L can be chosen in such a way that
298 J a n de L e e u w and Willem Heiser

L A - I K ' B k K A - 1 L ' is diagonal for each k, which is possible if and only if the
A - I K , B~AA-T , 1 commute. This is true if and only if (c) is true.

It is possible in INDSCAL that solutions exist in which X is not of full column


rank, and which are irreducible in the sense that no INDSCAL solution exists
with a smaller value of p. In fact it is possible in some cases that INDSCAL
solutions only exist for some p > n . This is one of the main reasons that
INDSCAL is very interesting from a theoretical point of view; the other reason is
that INDSCAL often gives unique solutions (up to a permutation of the dimen-
sions), even if p > n. Thus there is no 'rotational indeterminacy'. We give a
uniqueness theorem for the case in which X is of full column rank, the more
general case is studied by Kruskal [75, 76].

THEOREM 15. An irreducible I N D S C A L imbedding in Euclidean p-space is unique


if and only if for each s ~ t there is a k such that Wks2~ Wkt.2

PROOF. The rotation matrix L in Theorem 14 is uniquely determined if and only


if there is at least one linear combination of the A - ~ K ' B k K A ~ with different
eigenvalues, which is true if and only if there is at least one linear combination of
the W2 with different diagonal values. This is equivalent to the condition in the
theorem.

2.1.4. Asymmetries in Euclidean M D S


One of the main objections of Tversky (1977) against the use of metric
dimensional models in psychology is that dissimilarity is very often not symmet-
ric. This is not very convincing. In distance geometry it has been recognized from
the start that the symmetry axiom may be too restrictive in some applications.
Busemann (1955, p. 4) gives the familiar example that a distance downhill may be
shorter than 'the same' distance uphill. Busemann's students Zaustinsky, Phadke,
and Featherstone have generalized much of the theory of G-spaces to asymmetric
metrics (cf. [14]).
In Euclidean MDS several people have proposed asymmetric modifications
of the Euclidean distance. An early one by Kruskal (personal communication,
1973) is
P

E
s=l

where z is called the 'slide vector'. A related 'jet stream' model appears in [40]
where Gower also has an interesting 'cyclone' model. Gower (1977), and
Constantine and Gower (1978) also discuss MDS techniques which decompose a
matrix in its symmetric and antisymmetric parts, and then compute the singular
value decomposition of both parts. Baker (cf. [3]) has proposed
P
2 2
dZ(dP(i),q'(J)) = E w , s ( x , , - - x j , ) .
s=l
Theory of multidimensionalscaling 299

This ASYMSCAL model is incorporated as one of the options in the ALSCAL-4


computer program, which also has corresponding three-way asymmetric models
[140].
Another idea, which is already quite old, is that dissimilarity is really symmetric
(and even Euclidean), but it is contaminated by response bias or other
uphill-downhill effects. We merely need a procedure to remove the response
bias, and we can use the ordinary Euclidean model again. The procedures of
Kruskal, Baker, and Gower are of the same type, but it is not immediately
obvious how the corrections should be carried out. In the Shepard-Luce model
for confusion matrices [88, 116] the corrections are very simple. We suppose that
the confusion probabilities ~ru satisfy a model of the form ~ij ~--"aifijlTij, with Tlij
symmetric. This is closely related to work in the contingency table literature on
'quasi-symmetry' [61] and on social mobility tables [49, Chapter VI]. It is also
related to a recent proposal of Krumhansl (1978), who responds to Tversky's
criticisms with the simple equation

d ( ~(i), ~(j)) = d( d~(i), d?(j))+ a~(i)+ b@(j),

especially if we consider the fact that Shepard and Luce extend their model by
supposing that d(¢(i), q~(j)) = - In */u" For recent extensions of the Shepard-Luce
models we refer to work by Nakatani (1972), and Townsend (1978). Models of
this form can be used to find maximum likelihood estimates of the quasi-
symmetry parameters, and in the complete specification of the MDS-coordinates.
De Leeuw and Heiser (1979) propose a variety of probability models and
computational techniques for these 'discrete interaction matrices'.

2.1.5. Restricted Euclidean MDS


In restricted Euclidean MDS not all maps ~ of I into •P are feasible. If
x i = ¢(i) and the xi are collected in the n X p matrix X, then general restrictions
can be written as X ~ ~2, with ~2 a subset of R np, the space of all n X p matrices.
Various types of linear and nonlinear restrictions are possible [7, 8, 33], but we
discuss what seems to be the most important type of restrictions. There are some
obvious relationships between MDS and principal component analysis (PCA).
One way of formulating the relationship of the two techniques is that MDS fits
distances to data, while PCA fits inner products. PCA has been extended to factor
analysis and, more generally, to 'structural analysis' of covariance matrices (a
recent review is given by J6reskog, 1978). Similar extensions of MDS are possible,
and probably useful [33].
The restrictions we discuss are of the form x i = Tyi, where T and the y~ may be
specified in various different ways. Observe the relationship with three-way
scaling where we used ~ k ( i ) - - T k y ~. The first special case, important in multidi-
mensional psychophysics, has y / k n o w n and T unknown and unrestricted. The y~
can be collected in an n X q matrix Y, T is a p X q matrix. Without loss of
generality we can require that J Y = Y, and that Y'Y= I. We can now apply
Theorem 1 to derive an imbedding theorem.
300 Jan de Leeuw and Willem Heiser

THEOREM 16. Suppose (I, 8) is a finite semimetric space, Y is an n × q matrix


such that J Y = Y and Y ' Y = I . Then there exists a p × q matrix T such that
d(Ty i, Tyj) = 8(i, j ) if and only if
(a) B = -- ½JEJ is psd,
(b) rank( Y' B Y ) <~p,
(c) rank(B + YY') : rank(B).

The proof is easy matrix algebra. In a principal component context this model
has been discussed by Carroll, Green, and Carmone (1976). It has been extended
to three-way PCA by Carroll and Pruzansky (1977). Various applications, in
which for example the matrix Y is an ANOVA-type design matrix, are also
discussed in these papers. A more complicated class of restrictions uses x i = Tyi in
combination with T diagonal. This can be used to fit simplexes and circumplexes.
Cases in which X is partially restricted and partially free can be used to build
MDS versions of common factor models.
If T is diagonal and Y is binary (zero-one) a special interpretation of the
Euclidean distance model is possible. Suppose P = (1,... ,p). If S is a subset of P
and t is the p-vector with the diagonal elements of T, then we can define

(s):E{tsls P}.
If S 1..... Sn are the subsets of P defined by

s,:{sly, s:l},
then

d2( xi, xj ) :/~(Si ASj),


with A the symmetric difference. The subset model has been studied in distance
geometry by Kelly (1968, also 1975 and the references given there), and in MDS
by Shepard and Arabic (1979). For special systems of subsets we recover the
additive tree models of Bunemann (1971), Cunningham (1978), Sattath and
Tversky (1977). For even more special systems we find the hierarchical trees of
Johnson (1967) and many, many others.

2.2. Non-Eucfidean models

2.2.1. Additive difference models


In multidimensional psychophysics we suppose that the set of objects I has
product structure, i.e. I = I~ × --. × Ip, and i E I can be written as the p-tuple
(ip...,ip). In classical multidimensional psychophysics the I s are sets of real
numbers, and we suppose that
P
a(i, j) = E li - J l,
Theory of multidimensionalscaling 301

[2], or

P
~2(i, j) = E
( , "s - J s ) . 2 ,
s=l

[128]. This approach has been generalized by Tversky (1966), whose work is also
discussed in [5], and is improved and generalized in [134]. Now the I s do not have
to be sets of real numbers anymore, they do not even have to be ordered sets. We
suppose that there are real valued functions +s on Is, increasing functions
Xs : R ~ R, and an increasing function F: R --, R, such that

8(i, J ) = ~_, X s ( I q ~ s ( i s ) - q ~ ( L ) l ) .

This is called the additive difference model. Tversky gives necessary and sufficient
conditions in terms of the dimensions of the product structure and the order
relation on I × I which must be satisfied for an additive difference representation.
It is also proved that the q~ are interval scales, and the Xs are interval scales with a
common unit. Of course an additive difference model does not necessarily define
a metric. The additive difference representation is said to be compatible with a
metric with additive segments if the representation satisfies the assumptions of
Beals and Krantz (1967), i.e., if the order on I × I also defines a convex metric.
Krantz and Tversky (1970) prove the very satisfactory result that compatibility
proves that there is an r > 1 such that

P
8r(i,j) = ~ I s(is)- s(Js)l r
s~l

In other words, the only additive difference models compatible with a convex
metric are the power metrics. Tests of the additive difference theory have been
carried out by Tversky and Krantz (1969), Wender (1971), Krantz and Tversky
(1975), and SchOnemann (1978).

2.2.2. Minkovski geometry


Power metrics are already mentioned by Torgerson (1958, p. 294), but they did
not become popular in psychometrics until Kruskal (1964), Shepard (1964)and
Cross (1965). Imbedding of finite semimetric spaces in Euclidean space is
investigated by transforming squared distances to scalar products. There is no
scalar product associated with power metrics, and, as a consequence, there is no
imbedding theory for finite sets. There are some results for the city block case in
[35], but Eisler supposes that the order in which the points project on the
dimensions is known.
There are more results in the infinite case. We have already discussed the
elegant characterization of the power metrics by Krantz and Tversky, but this
302 Jan de Leeuw and Willem Heiser

supposes that the objects have a product structure. If only the metric, or an order
on 1 × I, is given, we have to follow a different route. We can use results of
Andalafte and Blumenthal (1964) that characterize Banach spaces in the class of
complete convex metric spaces, by using ordinal properties of the metric only. A
Minkovski space is a finite dimensional Banach space in which power metrics can
be characterized by using homogeneity. Another possibility is to characterize
Minkovski spaces in the class of straight G-spaces, as in [13, pp. 144-163], by
using this theory of parallels. In both cases we simply have to add some
qualitative axioms to the ones given by Beals and Krantz (1967), the additional
axioms are 'testable' and not 'technical' in the sense of Beals, Krantz, and
Tversky (1968). Both Anadalafte and Blumenthal (1964) and Busemann (1955)
list additional simple qualitative properties which characterize Euclidean space in
the class of Minkovski spaces.

2.2.3. Classical non -Euclidean geometries


We have already seen earlier in this paper that Luneborg's theory of binocular
vision has inspired psychometricians (for example Indow, 1975, with references)
to look at hyperbolic space. For the classical non-Euclidean spaces far more
interesting results are available then for Minkovski spaces. In fact Blumenthal
(1953) concentrates throughout his book on imbedding semimetric spaces in
hyperbolic, elliptic, spherical, and Euclidean spaces. He uses analogues of the
Cayley-Menger determinant, but Schoenberg (1935) gives a quadratic form result
for spherical space, and Valentine (1969) gives a quadratic form result for
hyperbolic space. The relationships between the elementary spaces are studied,
using matrices and quadratic forms, in [114]. In MDS Indow and his collabora-
tors have used ad hoc methods to find imbeddings in hyperbolic space, Pieszko
(1975) has constructed a method to imbed in Riemannian spaces of non-constant
curvature. Lindman and Caelli (1978) criticized Pieszko's work and proposed
a different algorithm. We give versions of the theorems of Valentine and
Schoenberg, which make them maximally like the Euclidean Theorem 1. We write
Sp, p for the p-dimensional spherical space of radius p.

THEOREM 17. The finite semimetrie space <I, 6 > can be imbedded in S;, o if and
only if
(a) 6(i, j ) <~ TO,
(b) the matrix B with elements big = cos 6(i, j ) / p is psd, and
(c) rank(B) ~ p + 1.

For p-dimensional hyperbolic space Hp,p the situation is even more like
Euclidean space. We define the matrix H with elements hij = coshr(i, j ) / o 2, and
the matrix B by bij = 1 - ( h i j h . . / h i. h .j), where dots are averages again.

THEOREM 18. The finite semimetric space (I, 8 > can be imbedded in Hp, p if and
only if B is positive semi-definite, and rank(B) ~<p.
Theory of multidimensional scaling 303

Congruence orders of Sp. o and He.p, i.e. the analogues of Theorem 2, are
studied in Blumenthal (1953). The theorem is also true in these spaces, they also
have congruence order p + 3. For elliptic space the situation is more complicated.
For imbedding results and congruence order results we refer to Blumenthal [10,
Chapters IX-XI], and the more recent review of Seidel [115]. Because of the
developments of non-Euclidean geometry at the end of the previous century and
in the beginning of this century we know many ways to solve the space problem
for these geometries. Not all of them are useful for our purposes, however.
Busemann discusses many metric characterizations. The most convenient one is
given in the following Theorem [13, p. 331].

THEOREM 19. I f each bisector B(a, a') (i.e. the locus xa = xa') of a G-space R
contains with any two points x, y at least one segment T(x, y), then the space is
Euclidean, hyperbolic, or spherical of dimension greater than 1.

Again it is easy to add the flatness of the bisectors to the axioms of Beals and
Krantz (1967) for G-spaces.

3. Multidimensional scaling algorithms

3.1. Least squares on the scalar products

3.1.1. Metric two-way M D S


Theorem 1 suggests that a reasonable way to solve the metric Euclidean
two-way scaling problem is to minimize the loss function

ol( X ) = tr(B -- X X ' ) 2

over all X in R "p. Suppose 2,~ > / . . - >/2,, are the ordered eigenvalues of B, let
Xs = min(0, Xs), and let k s be an eigenvector corresponding with ~,; the k s
corresponding with equal eigenvalues are chosen to be orthogonal.

THEOREM 20

P
min{o,(X)[XER "p}: ~ X2s+ ~ )~2.
s=l s=p+l

Moreover, the minimum is attained if we set column s of X equal to

•.sl , s=l,...,p.
304 Jan de Leeuw and Willem Heiser

A proof is given, for example, by Keller (1958). This theorem justifies the
classical scaling method explained most extensively by Torgerson (1958). Observe
that the last columns of X are equal to zero if B does not have p positive
eigenvalues. This metric scaling procedure has three major advantages compared
with other competing ones. In the first place we know how to compute eigenvec-
tors and eigenvalues precisely and efficiently. In the second place the solutions
are nested in the sense that the solution for q dimensions is contained in the
solution for p dimensions if q < p. And finally, we are sure that we find the global
minimum of the loss function, and not merely a local minimum. The disadvantage
is that the procedure of computing B only makes sense in the Euclidean case. We
can use Theorems 17 and 18 to construct metric scaling procedures for spherical
and hyperbolic geometry, but they use another definition of B. If we cannot
transform to scalar products, as in the Minkovski case, then these procedures
cannot be used. Another disadvantage in the Euclidean and hyperbolic case is
that the elements of B are not independent if the elements of E are independent,
which makes unweighted least squares look bad. Moreover, the method loses
much of its computational appeal if there are missing data, while this does not
bother other methods.
Carroll, Green, and Carmone (1976) have already pointed out that the simple
scaling procedure can be generalized if we have linear restrictions of the form
X = YT', with Y'Y--- I and Y known. The loss function ol( X ) becomes o1(S ) =
t r ( B - Y S Y ' ) 2, with S - - T ' T . If we define S = Y'BY, then we can write o1(S) =
t r ( B - yffy,)2 + t r ( f f - S ) 2 which we can minimize by minimizing t r ( S - T ' T ) 2
over T, using the least squares result of Keller (1958) again, as we did in Theorem
20. The asymmetric 'slide vector' model can also be fitted by using the same
matrix methods.

3.1.2. Nonmetric two-way M D S


Theorem 5 suggests an interesting way to solve the additive constant problem
[106]. We have to minimize

o1( X, a) = t r { B ( a ) - XX'} 2

over all X in ~np, and over a. Define the minimum of ol(X, a) over X for fixed a
as ~(a). Then, in the same way as in Theorem 20,

s--1 s--p+l

This is a function of the single real parameter t~, which can be minimized
efficiently in a number of ways. It is clear that the approach generalizes without
further complications to any problem in which we have a one-parameter family of
matrices B(a). By using the theory of lambda-matrices [81] we can in all cases
construct efficient algorithms. This fact was used by Critchley (1978) in his
Theory of multidimensional scaling 305

alternative approach to MDS. In our notation he does not optimize a criterion


which depends on the choice of the dimensionality p, but he suggests to maximize
the variance of the X(a). This usually makes a good fit in a low dimensionality
possible. The approach in this section can obviously be generalized to multi-
parameter problems. Theorems 11 and 12 suggest defining ol(X,O ) and ~(0).
Nonmetric Euclidean MDS can be formulated as minimization of these two
functions. In unpublished research De Leeuw, Takane, and Young have applied
alternating least squares to ol(X , 0). Each iteration has two steps. In step 1 we
minimize o~(X,O) over X with 0 fixed at its current value. This is a partial
eigen-problem. In later iterations a very good initial estimate is available, and we
consequently use simultaneous iteration [ 105] to compute the vectors. In step 2 we
minimize ot(X , O) over 0 with X fixed at its current value. This is a linear least
squares problem. Often the vector 0 is restricted to be nonnegative. We use a fast
iterative quadratic programming routine using the good initial estimate. The
complete algorithm, called INDISCAL, is extremely fast, especially if 0 has only a
few elements (few missing data, or many ties in ordinal data). If we want a very
precise solution we can use Newton's method in the final iterations. Alternatively
we can also use Critchley's criterion in multiparameter problems.
The major disadvantage of these procedures is that they may converge to
non-global minima. Moreover, it is usually the case that B(O) or B ( a ) has some
negative eigenvalues at the optimum, which may be undesirable in some applica-
tions. An alternative approach is to require that B(0) is psd, and to maximize the
sum of the first p eigenvalues X(0). In missing data problems or ordinal MDS
problems this amounts to maximizing a convex function on a convex set. In [29]
some ways of solving this problem have been suggested. By approximating the
convex set by a sequence of convex polyhedra we can approximate the global
minimum of the loss function in this case (but the algorithm is very expensive). In
ordinal MDS (Theorem 12) the matrix B(O) is of the form YOuTh.Now minimizing
f(0) is easy enough, we simply set 0 = 0. Thus in this case the problem must be
normalized, if we minimize ~(0), we require trB2(0) = 1. If we maximize the sum
of the first p eigenvalues, we require t r B ( 0 ) = 1.
The metric unfolding algorithm of Sch0nemann (1970) also belongs in this
section. It is based on an algebraic analysis of the metric unfolding model, which
has been clarified further by Gold (1973). Unfortunately, numerical experiments
of Heiser and De Leeuw (1978, 1979) indicate that Sch0nemann's algorithm does
not work very well. As a matter of fact, even the best metric unfolding methods
do not work very well. Nonmetric unfolding methods do not work at all.

3.1.3. Three-way MDS


The theory in Subsection 2.1.3 suggests that we minimize

ol(X;Ck) = ~ tr(Bk-- XCeX') 2


k--I

over X and C~..... , Cm. In the IDIOSCAL case there are no further restrictions on
306 Jan de Leeuwand WillemHeiser

the C~, in the INDSCAL case we require them to be diagonal. In [17,50] an


alternating least squares algorithm was proposed for the INDSCAL model. A
slightly different ALS method was proposed and implemented for the IDIOSCAL
model by Kronenberg and De Leeuw (1978). The algorithms have two substeps in
each iteration: in the first substep we minimize %(X; Ck) over the C k with X fixed
at its current value, in the second substep we minimize el(X; Ck) over X with the
C k fixed at their current values. The first subproblem is simple because we can
solve it for each k separately and because XCkX' is linear in C k. The second
problem is less simple; Carroll and Chang propose a general ALS trick which we
shall call splitting. Instead of minimizing el(X; Ck) we minimize

tr,(X, Y; Ck) = • tr(Bg - XCkY') 2

over both X and Y. Because the B~ are symmetric and the C k are symmetric too,
we expect X and Y to converge to the same value. Splitting can be used to
generalize ALS from multivariate multilinear problems to multivariate poly-
nomial problems in many cases but the precise conditions under which splitting
works have not been established. In the three-way case it is easy to show that
using splitting is closely related to using the Gauss-Newton method. In the
Gauss-Newton method we minimize the approximation

~] tr{Bk - ( X C k X ' + XCkA'+ ACk X') } 2

over A, and then set the new X equal to the old X plus A. This gives the same
iterates as ALS applied to or(X, Y; Ck). Of course convergence of the ALS
procedures is no problem, they converge more or less by definition. Often
convergence is painfully slow, however. Ramsay (1973) has shown that we can
usually accellerate convergence considerably by choosing a suitable relaxation
factor in the ALS iterations. Another disadvantage of ALS in this context is that
the procedure may converge to a C k which is not psd, or to a Wk which is not
nonnegative.
We can compute very good initial estimates for our iterative procedures by
using Theorems 13 and 14. The proof of Theorem 13 gives us IDIOSCAL
estimates of X and Ck, Theorem 14 says that we can find INDSCAL estimates if
we diagonalize these C k. A number of these two-step procedures has been
proposed. The first one in [112], the most straightforward one in [34]. This last
paper also has the necessary references. Another two-step procedure proposed by
De Leeuw is used to construct the initial configuration in ALSCAL. It is
described in [139]. It is clear that we can construct nonmetric versions of
three-way MDS by combining the results in this section with those from the
previous section. Carroll and Chang (unpublished) have experimented with non-
metric INDSCAL, called NINDSCAL, while Richard Sands (in press) has a
nonmetric version of IDIOSCAL which is comparable to ALSCAL.
Theory of multidimensional scaling 307

3.2. Least squares on the squared distances

3.2.1. Two-way M D S
For metric two-way MDS we can also consider the loss function

i lj=l

with hij = ~2(i, j), as before, and d2( X ) = ( x i - xj)'(x i - xj). This loss function
was proposed by Obenchain (1971) and Hayashi (1974), but efficient algorithms
to minimize functions like this in several different MDS situations were proposed
by Takane, Young, and De Leeuw (1977). The current version of the ALSCAL
algorithm (see [140]) can handle all kinds of metric/nonmetric two/three way
data structures, using the basic two-step alternating least squares methodology of
Young, De Leeuw, and Takane (in press). The interesting problem is how we
must minimize o2(X) over X. In ALSCAL a single coordinate is changed at the
time, the other coordinates are fixed at current values, and we cycle through the
coordinates. Of course o2 is a quartic in each coordinate, we minimize over the
coordinate by solving a cubic. The procedure may not be very appealing at first
sight but is surprisingly efficient.
The loss function o2(X ) is called SSTRESS by Takane et al., the loss function
ol(X ) is called STRAIN by Carroll. There is an interesting relationship between
STRAIN and SSTRESS which explains why the initial configuration routines for
ALSCAL work as well as they do.

THEOREM 21. I f X is centered, then o 2 ( X ) / > 4ol(X ).

PROOF. Define o(X, b) = t r { U ( b ) - X X ' } 2, where U(b) has elements

U i j ( b ) = -- l ( h i j -- h i -- b j ) .

Then ol(x ) = min(o(X, b)]b}, while o2(X ) = 4o(X, a), a i = Ex2i~.

If we combine Theorems 20 and 21, we obtain a lower bound for

min(o2(X) IX ~ R"P } .

Similar bounds can be obtained for nonmetric and three-way versions of


SSTRESS in terms of STRAIN.
In nonmetric two-way scaling we have to minimize 02( X, d) over x in R "p and
over all admissible disparities. In the ordinal case we have to use normalization
requirements again to prevent certain forms of degeneracy. Discussions on how to
normalize SSTRESS are in [138] and in [140]. In the simplest case (for ordinal
two-way MDS) we require that the sum of the fourth powers of the disparities is
unity (the loss function is the sum of squares of the differences between squared
disparities and squared distances).
308 Jan de Leeuw and Willem He&er

THEOREM 22. Suppose that a matrix with all off-diagonal disparities equal &
admissible. Suppose in addition that n and p are such that an n X p matrix Y exists
with ~ Yis = Ofor each s, ~ Yi2s= n / p for each s, Y~YisYitz Ofor all s ~ t, and ~,yi2~~- 1
for all i. Then

p n
m i n { o 2 ( X , d ) [ X , d } < ~ l - ( - ~ - - + - f )(-~-Z-(_l ).

PROOF. Simply substitute constant d, suitably normalized, and Y in the formula


for SSTRESS.

The assumptions in Theorem 22 are quite realistic. Constant disparities are


admissible in all ordinal MDS problems, a matrix Y with the required properties
exists if p = 1 and n is even, and also if p = 2 (regular polygon). The theorem is
important because it tells us that even if there is no structure at all in the data, we
can still find fairly low SSTRESS by distributing clumps of points regularly on a
sphere. Observe that Y satisfies the conditions of the theorem for n = 10 and p = 2
if we have ten points equally spaced on a circle, but also if we have five groups of
two points equally spaced on a circle. In fact, in many applications we have found
that ALSCAL tends to make clumps on a sphere.
Using squared distance loss has one important advantage over inner product
loss. We have seen that the double centering operation B = -- ½JHJ can introduce
statistical dependencies, but it also complicates the regression problems in the
optimal scaling step. In the ordinal case, for example, minimizing Ol(X,0 ) for
fixed X over 0 ~> 0 is a quadratic programming problem, minimizing o2(X, d ) over
the admissible d for fixed X is also a quadratic programming problem but it has
simpler structure and consequently the more efficient monotone regression algo-
rithm can be used. A disadvantage is that using squared distances is more
complicated in the metric case, and Theorem 22 shows that using squared
distances may bias towards regular spherical configurations. For inner product
loss the corresponding upper bound is 1 - p / ( n - 1 ) , which is attained for all Y
satisfying the less stringent conditions N.YisYit = ~st.

3.2.2. Three-way M D S
The only three-way MDS program based on squared distance loss is ALSCAL.
We do not discuss the algorithm here because the principles are obvious from the
previous section. There are two substeps, the first one is the optimal scaling step,
it finds new disparities for a given configuration, the second o n e changes the
configuration by the cubic equation algorithm of ALSCAL and the weights for
the individuals by linear regression techniques. There is an interesting modifica-
tion which fits naturally into the ALSCAL framework, although it has not been
implemented in the current versions. We have seen that the inner product
algorithms can give estimates of C~ = TkTi, that are not psd. The same thing is
true for ALSCAL, but if we minimize the loss over Tk instead of over C k we do
Theory of multidimensional scaling 309

not have this problem, and the minimization can b e carried out 'one variable at a
time' by using the cubic equation solver again. This has the additional advantage
that we can easily incorporate rank restrictions on the C k. If we require that the
Tk are p × 1, for example, we can fit the 'personal compensatory model' men-
tioned by Coombs (1964, p. 199), and by Roskam (1968, Chapter IV).
An important question in constructing MDS loss functions is how they should
be normalized. This is discussed in general terms in [78], and for ALSCAL in
[125] and [140]. McCallum (1977) studies the effect of different normalizations in
a three-way situation empirically. Other Monte Carlo studies have been carried
out by McCallum and Cornelius (1977) who study metric recovery by ALSCAL,
and by McCallum (unpublished) who compares ALSCAL and INDSCAL re-
covery (in terms of mean squared error). It seems that metric I N D S C A L often
gives better results than nonmetric ALSCAL, even for nonmetric data. This may
be due to the difference in metric/nonmetric, but also to the difference between
scalar product and squared distance loss. Takane, Young, and De Leeuw (1977)
compare CPU-times of ALSCAL and INDSCAL. The fact that ALSCAL is much
faster seems to be due almost completely to the better initial configuration, cf.
[341.

3.3. Least squares on the distances

3.3.1. Metric two-way M D S


The most familiar MDS programs are based on the loss function

o3(x)= wij( j-dii(x)) 2


i--I j-- l

where we have written 8ij for 6(i, j ) and where we have introduced nonnegative
weights wij. For wij ~ 1 this loss function is STRESS, introduced by Kruskal [73,
74]. The Guttman-Lingoes-Roskam programs are also based on loss functions of
this form. Using o3 seems somewhat more direct than using o2 or o l, moreover,
both o2 and o I do not make much sense if the distances are non-Euclidean. A
possible disadvantage of o 3 is that it is somewhat less smooth, and that computer
programs that minimize o3 usually converge more slowly than programs mini-
mizing o~ or o2. Moreover, the classical Young-Householder-Torgerson starting
point works better for o 2 and ol, which has possibly some consequences for the
frequency of local minima. There are as yet, however, no detailed comparisons of
the three types of loss functions.
We have introduced the weights w/j for various reasons. If there is information
about the variability of the 6~j, we usually prefer weighted least squares for
statistical reasons, if there is a large number of independent identically distributed
replications, then weighted least squares gives efficient estimates and the mini-
mum of o3 has a chi-square distribution. Another reason for using weights is that
we can compare STRESS and SSTRESS more easily. It is obvious that if
310 Jan de Leeuw and Willem Heiser

8ij~-,dij(X ) and
if we choose wij = 48/~, then o3(X ) ~ o 2 ( X ). Thus, if a good fit is
possible, we can imitate the behaviour of 02 by using o 3 with suitable weights
Ramsey (1977, 1978) has proposed the loss function

o4(X)= ~ ~] ( l n S ~ j - l n d i j ( X ) ) 2,
i--l j--1

which makes sense for log-normally distributed dissimilarities. Again, if 8ij ~-~
dij (X) and if we choose wij -- 1/62, we find a3(X) ~ o4(X).
The algorithms for minimizing o3(X ) proposed by Kruskal [73, 74], Roskam
(1968), Guttman (1968), Lingoes and Roskam (1973) are gradient methods. They
are consequently of the form

X (~+') = X (~) - a W o 3 ( X ( ' ) ) ,

where index ~- is the iteration number, xTo3 is the gradient, and a~ > 0 is the
step-size. Kruskal (1977) discusses in detail how he chooses his step-sizes in
MDSCAL and KYST, the same approach with some minor modifications is
adopted in the MINISSA programs of Lingoes and Roskam (1973). KYST [80]
also fits the other power metrics, but for powers other than two there are both
computational and interpretational difficulties. The city-block (power = 1) and
sup-metric (power = oc) are easy to interpret by very difficult to fit because of the
serious discontinuities of the gradient and the multitude of local minima. The
intermediate cases are easier to fit but difficult to interpret. We prefer a somewhat
different approach to step-size. Consider the Euclidean case first. The cross
product term

p(x) E Y wiA/ij(x)
=

in the definition of o3(X) is a homogeneous convex function, the t e r m ,172(X) --


Y~Y~d z ( X ) is quadratic and can be written as ~/2(X) -- tr X ' V X for some V. If p is
differentiable at X, then Vo3(X) = 2 V X - 2 Vp(X), which suggests the algorithm

X(~ + ,) = V + ~Tp( X(~) ),

with V + a generalized inverse of V. If p is not differentiable at X, which happens


only if x i = xj for some i va j, then we can use the subgradient Op(X) instead of
the gradient Vp(X), and use the algorithm

x (~+ l)e V+ Op( X (~)).

De Leeuw and Heiser (1980) proved the following global convergence theorem.

THEOREM 23. Consider the algorithm X(~)~ V + Op( X (~)) and

X (~+ ') = ~ A (~) + (1 - a~)X (~)


Theory of multidimensional scaring 311

with 0 < c I < a T < 2 - c 2 < 2. T h e n o3(X (~)) is a decreasing, a n d thus c o n v e r g e n t ,


sequence. Moreover

tr( X (~+ ') - X(*) ) ' V ( X (~ + ') - X (~) )

c o n v e r g e s to zero.

In [30] there is a similar, but less general, result for general Minkovski metrics.
In that case the computations in an iteration are also considerably more com-
plicated.

3.3.2. MDS with restrictions


Consider general restrictions of the form X E ~2, with $2 a given subset of R'P.
Gradient methods can be used without much trouble if the restrictions are simple
(fix some parameters at constant values, restrict others to be equal). This has been
discussed by Bentler and Weeks (1977) and by Bloxom (1978). For complicated
sets of restrictions the gradient methods get into trouble and must be replaced by
one of the more complicated feasible direction methods of nonlinear pro-
gramming. The approach based on convex analysis generalizes quite easily.
De Leeuw and Heiser (1980) proposed the convergent algorithm

x +o0(X('))),
where P9 is the metric projection on /2 in the metric defined by V. Three-way
M D S methods are special cases of this general model, because we can use an
n m X n m supermatrix of weights W, with only the m diagonal submatrices of
order n nonzero. The configurations can be collected in an n m X p supermatrix X
whose m submatrices of order n X p must satisfy the restrictions X k = Y T k.

3.3.3. Nonmetric MDS


Nonmetric versions of all algorithms discussed in the previous sections can be
easily constructed by using the general optimal scaling approach which alternates
disparity adjustment and parameter (or configuration) adjustment. In the convex
analysis approach we can moreover use the fact that the maximum of p ( X ) over
normalized disparities is the pointwise maximum of convex functions and conse-
quently also convex. Thus if we impose suitable normalization requirements, the
same theory applies as in the metric case (cf. [31]).

References

[1] Andalafte, E. Z. and Blumenthal, L. M. (1964). Metric characterizations of Banach and


Euclidean spaces. Fund. Math. 55, 23-55.
[2] Atmeave,F. (1950). Dimensions of similarity.Amer. J. Psyehol. 63, 516-556.
[3] Baker, R. F., Young, F. W. and Takane, Y. (1979). An asymmetric Euclidean model: an
alternating least squares method with optimal scaling features. Psyehometrika, to.appear.
312 Jan de Leeuw and Willem Heiser

[4] Beals, R. and Krantz, D. H. (1967). Metrics and geodesics induced by order relations. Math. Z.
101, 285-298.
[5] Beals, R., Krantz, D. H. and Tversky, A. (1968). Foundations of multidimensional scaling.
Psychol. Rev. 75, 127-142.
[6] Bennett, J. F. and Hays, W. L. (1960). Multidimensional unfolding, determining the dimen-
sionality of ranked preference data. Psychometrika 25, 27-43.
[7] Bentler, P. M. and Weeks, D. G. (1978). Restricted multidimensional scaling. J. Math. Psychol.
17, 138-151.
[8] Bloxom, B. (1978). Constrained multidimensional scaling in N-spaces. Psychometrika 43,
397-408.
[9] Blumenthal, L. M. (1938). Distance geometries. Univ. Missouri Studies 13 (2).
[10] Blumenthal, L. M. (1953). Theory and Applications of Distance Geometry. Clarendon Press,
Oxford.
[11] Blumenthal, L. M. (1975). Four point properties and norm postulates. In: L. M. Kelly, ed., The
Geometry of Metric and Linear Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[12] Bunemann, P. (1971). The recovery of trees from measures of dissimilarity. In: R. F. Hodson,
D. G. Kendall and A. Taltu, eds., Mathematics in the Archeological and Historical Sciences.
University of Edinburgh Press, Edinburgh.
[13] Busemann, H. (1955). The Geometry of Geodesics. Academic Press, New York.
[14] Busemann, H. (1970). Recent Synthetic Differential Geometry. Springer, Berlin.
[15] Carroll, J. D. (1976). Spatial, non-spatial and hybrid models for scaling. Psychometrika 41,
439-463.
[16] Carroll, J. D. and Arabie, P. (1980). Multidimensional scaling. Ann. Rev. Psychol. 31, 607-649.
[17] Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional
scaling via an N-way generalization of 'Eckart-Young' decomposition. Psychometrika 35,
283-319.
[18] Carroll, J. D., Green, P. E. and Carmone, F. J. (1976). CANDELINC: a new method for
multidimensional analysis with constrained solutions. Paper presented at International Congress
of Psychology, Paris.
[19] Carroll, J. D. and Pruzansky, S. (1977). MULTILINC: multiway CANDELINC. Paper
presented at American Psychological Association Meeting, San Francisco.
[20] Carroll, J. D. and Wish, M. (1974). Models and methods for three-way multidimensional
scaling. In: Contemporary Developments in Mathematical Psychology. Freeman, San Francisco.
[21] Cayley, A. (1841). On a theorem in the geometry of position. Cambridge Math. J. 2, 267-271.
[22] Cliff, N. (1973). Scaling. Ann. Rev. Psychol. 24, 473-506.
[23] Constantine, A. G. and Gower, J. C. (1978). Graphical representation of asymmetric matrices.
Appl. Statist. 27, 297-304.
[24] Coombs, C. H. (1964). A Theory of Data. Wiley, New York.
[25] Cormack, R. M. (1971). A review of classification. J. Roy. Statist. Soc. Ser A. 134, 321-367.
[26] Critchley, F. (1978). Multidimensional scaling: a critique and an alternative. In: L. C. A.
Corsten and J. Hermans, eds., COMPSTA T 1978. Physika Verlag, Vienna.
[27] Cross, D. V. (1965). Metric properties of multidimensional stimulus generalization. In: J. R.
Barra et al., eds., Stimulus Generalization. Stanford University Press, Stanford.
[28] Cunningham, J. P. (1978). Free trees and bidirectional trees as representations of psychological
distance. J. Math. Psychol. 17, 165-188.
[29] De Leeuw, J. (1970). The Euclidean distance model. Tech. Rept. RN 02-70. Department of
Datatheory, University of Leiden.
[30] De Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In: J. C.
Lingoes, ed., Progress in Statistics. North-Holland, Amsterdam.
[31] De Leeuw, J. and Heiser, W. (1977). Convergence of correction matrix algorithms for
multidimensional scaling. In: Geometric Representations of Relational Data. M athesis Press,
Ann Arbor.
[32] De Leeuw, J. and Heiser, W. (1979). Maximum likelihood multidimensional scaling of
interaction data. Department of Datatheory, University of Leiden.
Theory of multidimensional scaling 313

[33] De Leeuw, J. and Heiser, W. (1980). Multidimensional scaling with restrictions on the
configuration. In: P. R. Krishnaiah, ed., Multivariate Analysis, Vol. V. North-Holland,
Amsterdam.
[34] De Leeuw, J. and Pruzansky, S. (1978). A new computational method to fit the weighted
Euclidean model. Psychometrika 43, 479-490.
[35] Eisler, H. (1973). The algebraic and statistical tractability of the city block metric. Brit. J.
Math. Statist. Psychol. 26, 212-218.
[36] Fisher, R. A. (1922). The systematic location of genes by means of cross-over ratios. American
Naturalist 56, 406-411.
[37] Fr6chet, M. (1935). Sur la d6finition axiomatique d'une classe d'espaces distanci6s vectorielle-
ment applicable sur l'espace de Hilbert. Ann. Math. 36, 705-718.
[38] Gold, E. M. (1973). Metric unfolding: data requirements for unique solution and clarification
of SchOnemann's algorithm. Psychometrika 38, 555-569.
[39] Goldmeier, E. (1937). Uber ,g&nlichkeit bei gesehenen Figuren. Psychol. Forschung 21, 146-208.
[40] Gower, J. C. (1977). The analysis of asymmetry and orthogonality. In: J. R. Barra et all, eds.,
Progress in Statistics. North-Holland, Amsterdam.
[41] Guttman, L. (1941). The quantification of a class of attributes: a theory and method of scale
construction. In: P. Horst, ed., The Prediction of Personal Adjustment. Social Science Research
Council, New York.
[42] Guttman, L. (1944). A basis for scaling qualitative data. Amer. Sociol. Rev. 9, 139-150.
[43] Guttman, L. (1946). An approach for quantifying paired comparisons and rank order. Ann.
Math. Statist. 17, 144-163.
[44] Guttman, L. (1950). The principal components of scale analysis. In: S. A. Stouffer, ed.,
Measurement and Prediction. Princeton University Press, Princeton.
[45] Guttman, L. (1957). Introduction to facet design and analysis. Paper presented at F(fteenth Int.
Congress Psyehol., Brussels.
[46] Guttman, L. (1959). Metricizing rank-ordered or unordered data for a linear factor analysis.
Sankhya 21, 257-268.
[47] Guttman, L. (1968). A general nonmetric technique for finding the smallest coordinate space
for a configuration of points. Psychometrika 33, 469-506.
[48] Guttman, L. (1971). Measurement as structural theory. Psychometrika 36, 329-347.
[49] Haberman, S. (1974). The Analysis. of Frequency Data. University of Chicago Press, Chicago.
[50] Harshman, R. A. (1970). Foundations of the PARAFAC procedure: models and conditions for
an explanatory multi-modal factor analysis. Department of Phonetics, UCLA.
[51] Harshman, R. A. (1972). PARAFAC2: mathematical and technical notes. Working papers in
phonetics No. 22, UCLA.
[52] Hartigan, J. A. (1967). Representation of similarity matrices by trees. J. Amer. Statist. Assoc.
62, 1140-1158.
[53] Hayashi, C. (1974). Minimum dimension analysis MDA: one of the methods of multidimen-
sional quantification. Behaviormetrika 1, 1-24.
[54] Hays, W. L. and Bennett, J. F. (1961). Multidimensional unfolding: determining configuration
from complete rank order preference data. Psychometrika 26, 221-238.
[55] Heiser, W. and De Leeuw, J. (1977). How to use SMACOF-I. Department of Datatheory,
University of Leiden.
[56] Heiser, W. and De Leeuw, J. (1979). Metric multidimensional unfolding. MDN, Bulletin VVS
4, 26-50.
[57] Heiser, W. And De Leeuw, J. (1979). How to use SMACOF-III. Department of Datatheory,
University of Leiden.
[58] Helm, C. E. (1959). A multidimensional ratio scaling analysis of color relations. E.T.S.,
Princeton.
[59] Holman, E. W. (1972). The relation between hierarchical and Euclidean models for psychologi-
cal distances. Psychometrika 37, 417-423.
[60] Indow, T. (1975). An application of MDS to study binocular visual space. In: US Japan
seminar on MDS.. La Jolla.
314 Jan de Leeuw and Willem Heiser

[61] Ireland, C. T., Ku, H. H. and Kullback, S. (1969). Symmetry and marginal homogeneity in an
r × r contingency table. J. Amer. Statist. Assoc. 64, 1323-1341.
[62] Johnson, S. C. (1967). Hierarchical clustering schemes. Psyehometrika 32, 241-254.
[63] Jrreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psycho-
metrika 43, 443-477.
[64] Keller, J. B. (1962). Factorization of matrices by least squares. Biometrika 49, 239-242.
[65] Kelly, J. B. (1968). Products of zero-one matrices. Can. J. Math. 20, 298-329.
[66] Kelly, J. B. (1975). Hypermetric spaces. In: L. M. Kelly, ed., The Geometry of Metric and Linear
Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[67] Klingberg, F. L. (1941). Studies in measurement of the relations between sovereign states.
Psychometrika 6, 335-352.
[68] Krantz, D. H. (1967). Rational distance functions for multidimensional scaling. J. Math.
Psychol. 4, 226-245.
[69] Krantz, D. H. (1968). A survey of measurement theory. In: G. B. Dantzig and A. F. Veinott,
eds., Mathematics of the Decision Sciences. American Mathematical Society, Providence.
[70] Krantz, D. H. and Tversky, A. (1975). Similarity of rectangles: an analysis of subjective
dimensions. J. Math. Psychol. 12, 4-34.
[71] Kroonenberg, P. M. and De Leeuw, J. (1977). TUCKALS2: a principal component analysis of
three mode data. Tech. Rept. RN 01-77. Department of Datatheory, University of Leiden.
[72] Krumhansl, C. L. (1978). Concerning the applicability of geometric models to similarity data:
the interrelationship between similarity and spatial density. Psychol. Rev. 85, 445-463.
[73] Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-27.
[74] Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika
29, 28-42.
[75] Kruskal, J. B. (1976). More factors than subjects, tests, and treatments: an indeterminacy
theorem for canonical decomposition and individual differences scaling. Psychometrika 41,
281-293.
[76] Kruskal, J. B. (1977). Trilinear decomposition of three-way arrays: rank and uniqueness in
arithmetic complexity and in statistical models. Linear Algebra Appl. 18, 95-138.
[77] Kruskal, J. B. (1977). Multidimensional scaling and other methods for discovering structure.
In: Statistical Methods for Digital Computers. Wiley, New York.
[78] Kruskal, J. B. and Carroll, J. D. (1969). Geometric models and badness of fit functions. In:
P. R. Krishnaiah, ed., Multivariate Analysis', Vol. H. Academic Press, New York.
[79] Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. Sage Publications, Beverly Hills.
[80] Kruskal, J. B., Young, F. W. and Seery, J. B. (1977). How to use KYST-2, a very flexible
program to do multidimensional scaling and unfolding. Bell Laboratories, Murray Hill.
[81] Lancaster, P. (1977). A review of numerical methods for eigenvalue problems nonlinear in the
parameter. In: Numerik und Anwendungen von Eigenwertaufgaben und Verzweigungsproblemen.
Internat. Ser. Numer. Math. 38. Birkhauser, Basel.
[82] Landahl, H. D. (1945). Neural mechanisms for the concepts of difference and similarity. Bull.
Math. Biophysics 7, 83-88.
[83] Lew, J. S. (1975). Preorder relations and pseudoconvex metrics. Amer. J. Math. 97, 344-363.
[84] Lew, J. S. (1978). Some counterexamples in multidimensional scaling. J. Math. Psychol. 17,
247-254.
[85] Lindman, H. and Caelli, T. (1978). Constant curvature Riemannian scaling. J. Math. Paychol.
17, 89-109.
[86] Lingoes, J. C. (1971). Some boundary conditions for a monotone analysis of symmetric
matrices. Psychometrika 36, 195-203.
[87] Lingoes, J. C. and Roskam, E. E. (1973). A mathematical and empirical analysis of two
multidimensional scaling algorithms. Psychometrika 38, monograph supplement.
[88] Luce, R. D. (1961). A choice theory analysis of similarity judgements. Psychometrika 26,
151-163.
[89] Luce, R. D. (1963). Detection and recognition. In: R. D. Luce, R. R. Bush and E. Galanter,
eds., Handbook of Mathematical Psychology, Vol. I. Wiley, New York.
Theory of multidimensional scaling 315

[90] Luneborg, R. K. (1947). Mathematical Analysis of Binocular Vision. Princeton University Press,
Princeton.
[91] MacCallum, R. C. (1977). Effects of conditionality on INDSCAL and ALSCAL weights.
Psychometrika 42, 297-305.
[92] MacCallum, R. C. and Cornelius III, E. T, (1977). A IVlonte Carlo investigation of recovery of
structure by ALSCAL. Psyehometrika 42, 401-428.
[93] Menger, K. (1928). Untersuchungen tiber allgemeine Metrik. Math. Ann. 100, 75-163.
[94] Messick, S. J. (1956). Some recent theoretical developments in multidimensional scaling. Ed.
Psychol. Meas. 16, 82-100.
[95] Messick, S. J. and Abelson, R. P. (1956). The additive constant problem in multidimensional
scaling. Psychometrika 21, 1-15.
[96] Nakatani, L. H. (1972). Confusion-choice model for multidimensional psychophysics. J. Math.
Psychol. 9, 104-127.
[97] Obenchain, R. L. (1971). Squared distance scaling as an alternative to principal components
analysis. Bell Laboratories, Holmdell.
[98] Pieszko, H. (1975). Multidimensional scaling in Riemannian space. J. Math. Psychol. 12,
449-477.
[99] Ramsay, J. O. (1975). Solving implicit equations in psychometric data analysis. Psychometrika
40, 337-360.
[100] Ramsay, J. O. (1977). Maximum likelihood estimation in multidimensional scaling. Psycho-
metrika 42, 241-266.
[ 101] Ramsay, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika
43, 145-160.
[102] Restle, F. (1959). A metric and an ordering on sets. Psychometrika 24, 207-220.
[103] Richardson, M. W. (1938). Multidimensional psychophysics. Psychol. Bull. 35, 659-660.
[104] Roskam, E. E. (1968). Metric analysis of ordinal data in psychology. VAM, Voorschoten, The
Netherlands.
[105] Rutishanser, H. (i970). Simultaneous iteration method for symmetric matrices. Numer. Math.
16, 205-223.
[106] Saito, T. (1978). An alternative procedure to the additive constant problem in metric multidi-
mensional scaling. Psychometrika 43, 193-201.
[107] Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika 42, 319-345.
[108] Schoenberg, I. J. (1935). Remarks to Maurice Fr6chet's article "Sur la d+finition axiomatique
d'une elasse d'espaces distancibs vectoriellement applicable sur l'espace de Hilbert." Ann.
Math. 38, 724-732.
[109] SchoenbeJ:g, I. J. (1937). On certain metric spaces arising from Euclidean space by a change of
metric and their imbedding in Hilbert space. Ann. Math. 40, 787-793.
[110] Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Trans. Amer. Math. Soc.
44, 522-536.
[111] Sch/Snemann, P. H. (1970). On metric multidimensional unfolding. Psychometrika 35, 349-366.
[112] Sch6nemann, P. H. (1972). An algebraic solution for a class of subjective metrics models.
Psvchometrika 37, 441-451.
[113] Sch~Snemann, P. H. (1977). Similarity of rectangles. J. Math. Psychol. 16, 161-165.
[114] Seidel, J. J. (1955). Angles and distances in n-dimensional Euclidean and non-Euclidean
geometry. Parts I, II, III. lndag. Math. 17, 329-335, 336-340, 535-541.
[115] Seidel, J. J. (1975). Metric problems in elliptic geometry. In: L. M. Kelly, ed., The Geometry of
Metric and Linear Spaces. Lecture Notes in Mathematics 490. Springer, Berlin.
[116] Shepard, R. N. (1957). Stimulus and response generalization: a stochastic model relating
generalization to distance in psychological space. Psychometrika 22, 325-345.
[117] Shepard, R. N. (1958). Stimulus and response generalization: tests of a model relating
generalization to distance in psychological space. J. Exp. Psychol. 55, 509-523.
[118] Shepard, R. N. (1958). Stimulus and response generalization: deduction of the generalization
gradient from a trace model. Psychol. Rev. 65, 242-256.
[119] Shepard, R. N. (.1962). The analysis ofproximities: multidimensional scaling with an unknown
distance function, Parts I, II. Psychometrika 27, 125-140, 219-246.
316 Jan de Leeuw and Willem Heiser

[ 120] Shepard, R. N. (1966). Metric structures in ordinal data. J. Math. Psychol. 3, 287-315.
[121] Shepard, R. N. (1974). Representation of structure in similarity data: problems and prospects.
Psychometrika 39, 373-42 I.
[122] Shepard; R. N. and Arabie, P. (1979). Additive clustering: representation of similarities as
combinations of discrete overlapping properties. Psyehol. Rev. 86, 87-123.
[123] Sibson, R. (1972). Order-invariant methods for data analysis. J. Roy. Statist. Soc. Ser. B 34,
311-349.
[124] Stumpf, C. (1880). Tonpsychologie, Vol. I and H. Teubner, Leipzig.
[125] Takane, Y., Young, F. W. and De Leeuw, J. (1977). Nonmetric individual differences in
multidimensional scaling: an alternating least squares method with optimal scaling features.
Psyehometrika 42, 7-67.
[126] Taussky, O. (1949). A recurring theorem on determinants. Amer. Math. Monthly 56, 672-676.
[127] Torgerson, W. (1952). Multidimensional scaling I--theory mad methods. Psychometrika 17,
401-419.
[128] Torgerson, W. (1958). Theory and Methods of Sealing. Wiley, New York.
[ 129] Torgerson, W. (1965). Multidimensional scaling of similarity. Psychometrika 30, 379-393.
[130] Townsend, J. T. (1978). A clarification of some current multiplicative confusion models.
J. Math. Psychol. 18, 25-38.
[131] Tversky, A. (1966). The dimensional representation and the metric structure of similarity data.
Michigan Math. Psychol. Program.
[132] Tversky, A. (1977). Features of similarity. Psychol. Rev. 84, 327-352.
[133] Tversky, A. and Krantz, D. H. (1969). Similarity of schematic faces: a test of interdimensional
additivity. Perception and Psyehophysics 5, 124-128.
[134] Tversky, A. and Krantz, D. H. (1970). The dimensional representation and the metric structure
of similarity data. J. Math. Psychol. 7, 572-596.
[135] Valentine, J. E. (1969). Hyperbolic spaces and quadratic forms. Proc. Amer. Math. Soc. 37,
607-610.
[136] Wender, K. (1971). A test of independence of dimensions in multidimensional scaling.
Perception and Psychophysics 10, 30-32.
[137] Young, F. W. (1972). A model for polynomial conjoint analysis algorithms. In: R. N. Shepard,
A. K. Romney and S. B. Nerlove, eds., Multidimensional Scaling: Theory and Applications in the
Social Sciences, Vol. I. Seminar Press, New York.
[138] Young, F. W., De Leeuw, J. and Takane, Y. (1980). Quantifying qualitative data. In:
E. Lantermann and H. Feger, eds., Similarity and Choice. Huber, Bern.
[139] Young, F. W., Takane, Y. and Lewyckyj, R. (1978). Three notes on ALSCAL. Psychometrika
43, 433-435.
[140] Young, F. W. and Lewyckyj, R. (1979). ALSCAL-4 User's guide. Data analysis and theory
associates, Carrboro, NC.
[141] Young, G. and Householder, A. S. (1938). discussion of a set of points in terms of their mutual
distances. Psychometrika 3, 19-22.
P. R. Krishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 "1A
.Ik
©North-HollandPublishingCompany(1982) 317-345

Multidimensional Scaling and its Applications

M y r o n Wish a n d J. Douglas Carroll

1. Multidimensional scaling of two-way data

Multidimensional scaling (MDS) is a general term for a class of techniques that


can be used to develop spatial representations of proximities among psychological
stimuli or other entities. These techniques involve iterative algorithms, almost
always requiring high-speed computers, for discovering and displaying the un-
derlying structure of the data matrix.
The aim of MDS is to discover the number of dimensions appropriate for the
data and to locate the stimulus objects on each dimension; that is, to determine
the dimengionahty and the configuration of stimuli in the multidimensional space.
The problem of interpretation is to identify psychological, physical, or other
correlates of these dimensions, or to specify their meaning in some way. This can
be based on visual inspection of the space, or statistical procedures such as
multiple regression. It should be pointed out, however, that interpretation is
sometimes configurational rather than dimensional; i.e., entailing description of
meaningful clusters, regions, or other patterns in the multidimensional space (see
Guttman, 1968, 1971). Furthermore, dimensions need not actually vary continu-
ously in the sense that all possible values will be evidenced, but may be more
discretely valued as in a highly clustered space (see Kruskal and Wish, 1978).
The typical kind of data used in two-way MDS analysis is a single two-way
matrix of proximities, that is a table of numbers reflecting the relative closeness or
distance of stimulus objects to one another. If the numbers are assumed to be
directly related to distance, they are usually referred to as dissimilarities; if
inversely related, they are called similarities. Thus, the smaller the similarity value,
or the larger the dissimilarity, the further apart the associated stimulus objects
should be in the multidimensional space.
There is a wide variety of methods for obtaining data appropriate for MDS.
The most direct way is to ask subjects to give pairwise ratings or to sort stimulus
objects according to their similarity, relatedness, association, etc. Some other
sources of proximities are 'confusions' from a stimulus identification or 'same-
different' task, co-occurrences of stimuli in text or other material, indices of
communication or volume flow, and various measures of profile distance derived
from objective or subjective multivariate data. Extensive discussions of data
317
318 Myron Wish and J. Douglas Carroll

collection procedures relevant for MDS are provided in Shepard (1972), Wish
(1972), and Kruskal and Wish (1978).
Proximity is taken to be a primitive relation on pairs of stimufi, which is
assumed to be orderable. Thus it is defined on at least what Stevens (1951) has
called an ordinal scale (the usual assumption in the so-called nonmetric MDS
methods). In some cases it may be assumed to be measurable on an interval scale
(the standard assumption for metric MDS) or even on a ratio scale, see Ekman
(1963).
Since proximities can be regarded as distance-like numbers, they should
roughly satisfy the general conditions for a metric space (see Carroll and Wish,
1974a, b and de Leeuw, this volume); that is, the distance axioms of positivity
(djk <~djj = 0), symmetry (djk = dkj), and the mangle inequality (djl <~dj~ + dk, )
should be satisfied in an ordinal sense. If the data are dissimilarities, 6jk, the first
two conditions become 6j~ ~<6jj = 6kk and ~jk = 6kj for all j, k.
In practice, approximate satisfaction of positivity and symmetry is about the
best that one can hope for. Furthermore, the third condition is untestable if the
proximities are only on an ordinal or interval scale, since by adding a suitably
large constant to each data value the proximities can be made to satisfy the
triangle inequality. One could, however, establish a rough criterion based on how
large that constant must be, relative to the variance of the proximities, to
guarantee satisfaction of something analogous to the triangle inequality. When,
however, the violations are severe, it is essential to transform the data in some
way to make it more appropriate for MDS (e.g., symmetrizing the matrix,
normalizing rows and columns, computing some measure of profile distance; see
Kruskal and Wish, 1978), or to use a non-spatial type of data analysis procedure
(see Cunningham, 1978; Carroll, 1976; Tversky, 1977; Sattath and Tversky, 1977;
Shepard and Arabie, 1979; and Arabie and Carroll, 1980).

1.1. "Classical" multidimensional scaling


The 'classical' metricmethod of MDS, which is described by Torgerson (1958),
assumes that the dissimilarities (or similarities) are approximately a linear func-
tion of a set of underlying distances. Its theoretical basis was a theorem by Young
and Householder (1938) that enabled determination of the minimum dimensional-
ity required to accommodate a given set of Euclidean distances among n points.
This theorem also provided a method for constructing the space, and for
determining whether the distances could be accommodated in any Euclidean
space. Richardson (1938) used this constructive method to implement the first
known application of multidimensional scaling (see also Klingberg, 1941). This
method laid dormant until the early 1950's when improvements by Torgerson and
others (see Torgerson, 1958), along with the advent of computers, made it feasible
to handle large data sets.
The now 'classical' MDS procedure associated with Torgerson can be sum-
marized as follows. Similarities data were collected by one of several means (e.g.,
Multidimensional scaling and its applications 319

the complete method of triads or the method of successive intervals) and put
through one of the Thurstone-type unidimensional scaling procedures (see
Torgerson, 1958) to produce interval scale measures called comparative distances.
Since ratio scale distances were needed, the problem of estimating an additive
constant arose. The simplest estimation procedure was to add to each data value
the smallest constant guaranteeing satisfaction of the triangle inequality. Once
ratio scale distances were obtained they were converted to scalar products around
an origin placed at the centroid of the configuration. The conversion from
Euclidean distances to scalar products was effected by double centering the
matrix whose general entry is ~d)k.
-
~ 2 Double centering is equivalent to taking out
both row and column effects in analysis of variance, leaving 'interaction numbers'
which in this case are estimated scalar products. The matrix of estimated scalar
products S = XX', can be viewed as analogous to a covariance matrix, and
methods closely related to factor analysis and principal components analysis can
be used to solve for the dimensionality and configuration (based on the eigenval-
ues and eigenvectors of S).

1.2. Nonmetric multidimensional scaling


Although Coombs and his coworkers (see Coombs, 1964) were the first to
devise a procedure for nonmetric MDS, Shepard (1962a, b) was the first to
demonstrate that metric solutions could be produced from purely nonmetric (i.e.
ordinal) data. These were metric in two respects: the coordinates were defined up
to an interval scale, and the space was indeed a (Euclidean) metric space.
Shepard's method, which he called the analysis of proximities, started out with
the n stimuli arranged in a maximum entropy configuration (a regular simplex in
n - 1 dimensions, that is, the generalization of an equilateral triangle to higher
dimensionality). He then introduced a process to decrease the dimensionality that
was based on the idea of increasing large distances and decreasing small ones.
When the dimensionality appeared to be reduced as much as possible, the points
were projected down exactly into a space having that number of dimensions, and
a second process was initiated. The aim of this latter process was to improve the
fit between the data and the spatial distances by increasing distances between
points that were too close together (relative to the data) and decreasing distances
between points that were too far apart. This was done iteratively by setting up
'force' vectors for each point oriented toward or away from each of the others,
according to whether the distance was too large or too small; the magnitude of a
vector was proportional to the size of the discrepancy. A resolution vector was
determined for each point by summing up the n - 1 force vectors. Each point was
then moved in the direction of its resolution vector a distance proportional to the
magnitude of the force. New force vectors were then computed, and the process
continued until the system was in equilibrium, or as close to it as possible.
Kruskal's (1964a, b) MDSCAL procedure improved Shepard's method in two
important ways. First, the notion of optimizing an explicitly defined measure of
320 Myron Wish and J. Douglas Carroll

goodness (or badness) of fit was introduced. Second, an explicit numerical


procedure, called the method of gradients, or steepest descent, was used. Kruskal
called his loss function 'STRESS,' which was defined in his original paper as

(1)
[ XjX d)

where the djk's are distances in the underlying metric space, and the djk ^ ' s are
related to proximities by a monotonic function (nonincreasing if the proximities
are similarities, nondecreasing if the data are dissimilarities). This formula is
referred to as stress form 1 (or SFORM1). Another formula, stress form 2, differs
only in the normalization factor; that is, the denominator involves a summation
of ( d j k - d) 2 rather than d~k, where d is the mean of the dj~'s. Least squares
monotone regression (see van Eeden, 1957; Miles, 1959; Barton and Mallows,
1961; and Bartholomew, 1959) was incorporated in the algorithm for finding the
best monotonic function; that is, the function minimizing stress. Another im-
portant generalization introduced by Kruskal allowed specific non-Euclidean (Lp)
metrics to be fit.
The first step in the procedure was to generate a set of coordinates for stimuli
in a specified dimensionality by some random, arbitrary, or rational method.
Given a particular metric in the underlying space (usually, but not necessarily,
Euclidean) and a particular dimensionality, the objective was rigorously defined
as that of minimizing stress over the class of all R-dimensional spaces and over
the class of all monotonic functions. This involved computation of the partial
derivatives of stress with respect to the total set of coordinates (all stimuli in all
dimensions). To determine the direction and relative distance each point should
be moved on that iteration. When the components of the gradient were suffi-
ciently small to indicate convergence, the process was terminated.
Kruskal's algorithm was originally implemented in a computer program called
MDSCAL that went through several versions. More recently several features of
another MDS program called TORSCA (Young and Torgerson, 1967) were
combined with those of MDSCAL to produce an improved program called KYST
(Kruskal, Young, and Seery, 1973). (The name, 'KYST,' is based on a combina-
tion of four contributors to MDS methodology--Kruskal, Young, Shepard, and
Torgerson.) Another series of two-way MDS procedures under the general name
of Smallest Space Analysis, or SSA, was independently developed by Guttman
and Lingoes (see Guttman, 1968; Lingoes, 1972).
Although the MDSCAL and KYST programs are usually thought of as being
nonmetric, they do also allow for metric MDS (as well as a compendium of
options for different types of analyses). This amounts to replacing the monotone
function with some specified metric function, e.g. a polynomial of some degree.
These approaches are metric in that interval scale properties of the data are used.
Multidimensional scaling and its applications 321

Of course, as functions with increasing numbers of parameters are used, the


distinction between metric and nonmetric becomes less meaningful.

1.3. Dimensionality
A difficult task confronting the user of KYST or any other MDS program is to
decide how many dimensions are appropriate for the data. In most cases the
answer is not straightforward since there is a balance between goodness of fit,
interpretability, and parsimony of data representation. Stress (or another com-
parable measure) can be made as low as desired by using a sufficiently large
number of dimensions, but this would provide limited data reduction and would
greatly complicate interpretation of results. In contrast, a two-dimensional solu-
tion greatly facilitates comprehension of the space, but if the stress is too high, the
configuration may misrepresent the overall data structure or omit important
features. In this regard it is difficult to say how low stress should be on an
absolute basis since it depends on so many considerations--for example the
number of stimuli, the amount of 'noise', the distribution of proximity values, the
analysis type (e.g., metric vs. nonmetric, Euclidean vs. non-Euclidean), etc.
The most prevalent procedure used for deciding how many dimensions are
needed is to look at a plot showing the stress values for solutions in several
dimensionalities. (To avoid the possibility of a local minimum in one or more of
these dimensionalities, it is desirable to use more than one starting configuration,
and perhaps to do both metric and nonmetric analyses of the data.) Under ideal
circumstances there is a sharp bend or elbow in the plot. However, unless there
are physical or other clearly defined attributes associated with the dimensions,
there is unlikely to be a prominent inflection point.
Another technique for assessing dimensionality is to refer the stress values to
charts based on Monte Carlo simulations (Wagenaar and Padmos, 1971; Isaac
and Poor, 1974; Spence and Graef, 1974). These charts, which are based on
sampling distributions, can be helpful in some instances, but several strong
assumptions about the data are involved as well as other limitations (see Kruskal
and Wish, 1978).
The size and distribution of residuals ( d j ~ - djk) is also relevant to decisions
about dimensionality. The presence of large outliers or a systematic pattern of
residuals can be grounds for additional dimensions. In some instances inspection
of the residuals can suggest what those dimensions might be, or suggest other
structure in the proximity matrix.
Two additional criteria that can be brought to bear are replicability and
interpretability of results. In addition to analysing other sets of comparable data,
replicability can be assessed by splitting the data in several ways, by a
'jackknife'-type approach (Tukey, 1977), or by various types of 'sensitivity'
analyses.
Interpretability depends to some extent on the prior knowledge and intuitions
of the experimenter, but can also be aided by regression analyses and procedures
322 Myron Wish and J. Douglas Carroll

for embedding dusters in multidimensional spaces (see Shepard, 1972). In most


cases lower dimensional solutions are much easier to interpret, due to the
rotational problems encountered in high dimensional solutions. There are times,
however, when additional dimensions help to clarify the meaning of the structure
as a whole.

1.4. Dimensions of Morse code signals


Several concepts of MDS can be clarified by applications to real data. The first
example is based on data collected by Rothkopf (1957) on confusability of Morse
code signals. In the experiment subjects listened to pairs of Morse code signals,
and responded 'same' or 'different' after each pairwise presentation. The proxim-
ity between stimulus j and stimulus k was defined as the proportion of subjects
responding 'same' when these two signals were presented. These data were
analyzed by Shepard (1963) using an early version of Kruskal's MDSCAL
program.
Fig. 1 shows the stress vs. dimensionality plot for these data. Although there is
not a clear elbow, the sharpest drop appears to be between one and two
dimensions. Accordingly, two dimensions were regarded to be most appropriate
for these data.
Fig. 2 shows the rotated two-dimensional configuration obtained by Shepard
(1963) for these data, which had a stress value (SFORM1) of 0.18. As can be seen,
the dimensions were interpreted as 'number of components' and 'dots vs. dashes'.
Thus, from bottom to top, the number of components increases successively from
one to five, while the dash-to-dot ratio generally increases from left to right in the

.40 -

,50

(/)
oo
20

o9

.10 -

I I I I I
1 2 3 4 5
DI M E N S I O N A L I T Y

Fig. I. STRESS(form 1) for Morse code confusiondata (Rothkopf, 1957) in one to five dimensions.
Multidimensional scaring and its applications 323

Fig. 2. Two-dimensional MDSCAL solution for Morse code confusion data, as interpreted by
Shepard (1963).

figure. The vertical dimension might also be interpreted as total duration of the
signal (dashes are three times as long as dots, and the interval between compo-
nents has the same duration as a dot).
The plot of proximities (percentages of 'same' judgments for pairs) vs. inter-
point distances (in the multidimensional space) is shown in Fig. 3. The nonlinear-
ity of the best-fitting monotonic function is striking, and is in accord with the
exponential relationship expected by Shepard for such 'confusions' data. Also of
note are the rather large discrepancies from a perfect monotonically decreasing
function. Thus, although the number of components and the relative number of
dots and dashes are the two most prominent attributes along which Morse code
signals are confused, the large residual error suggests the possibility of additional
structure latent in these data. In this regard, further analyses of these and other
similar data (Wish, 1967a,b) showed that the higher dimensions reflect the
particular sequence of components. For example, a three-dimensional solution,
which had a stress of 0.13, had an additional dimension, separating signals that
begin with a dash from those whose first component is a dot.
324 Myron Wish and J. Douglas Carroll

5 •0 0 8- ' °o

) oI ~ o
LLI 0 0 0
o

3O
Z o °0 ID

U 20 o'~ o
... oo ~ ~
,', I0
2°°o o.°o ~,~,
° Oo.o
o

oo o

3i.5
°o o'.5 ;.o " '
1.5 2 '. 0 Z '. 5 .~.o
INTERPOINT DISTANCE

Fig. 3. Relationship between percentage of 'same' responses for stimulus pairs and their distance in
the two-dimensional configuration.

This was one of the first applications of MDS showing that metric information
could be obtained from ordinal data (see Shepard, 1966). Besides revealing the
fundamental dimensions of these data, the study made a theoretical contribution
in showing that the function relating stimulus confusions to distance was ex-
ponential. It should be pointed out, however, that the dimensions in this applica-
tion and most others depend on the kind of judgment or psychological process
involved, as well as on the stimulus domain. Thus, somewhat different dimensions
were obtained for data using other experimental tasks (see Shepard, 1963).

1.5. Relatedness among societal problems

The next application of MDS is from a pilot study by Wish and Carroll (in
association with W. Kluver and J. Martin of Experiments in Art and Technology,
Inc.) The main task for the 14 subjects participating in the study (see Kruskal and
Wish, 1978) was to give pairwise ratings of relatedness among 22 societal
problems. In addition, they evaluated these problems on 15 attributes; for
example, 'not at all important vs. very important,' 'affects very few people vs.
affects most people,' etc. The data were collected in 1972; other problems, such as
energy, would undoubtedly be included if the study were done today.
The relatedness judgments were regarded as proximities and analyzed by the
KYST procedure. Analyses in one through six dimensions gave stress values
Multidimensional scaling and its applications 325

( S F O R M 1 ) of 0.41, 0.23, 0.15, 0.12, 0.10, a n d 0.08, respectively, A s in the


previous example, the d i m e n s i o n a l i t y was n o t clear cut. However, the relatively
s h a r p d r o p from two to three d i m e n s i o n s , followed b y a g r a d u a l decline there-
after, suggested that a t h r e e - d i m e n s i o n a l solution was m o s t a p p r o p r i a t e .
I n t e r p r e t a t i o n of d i m e n s i o n s was facilitated b y the s u p p l e m e n t a l ratings of
these stimuli on b i p o l a r scales. T h e first step was to c o m p u t e m e a n ratings of each
p r o b l e m on the various scales. T h e vector of m e a n ratings for a n y given scale was
then ' r e g r e s s e d ' on the t h r e e - d i m e n s i o n a l K Y S T solution for the r e l a t e d n e s s data.
In o t h e r words, the three d i m e n s i o n s served as predictors, a n d each vector of
m e a n ratings served in t u r n as a criterion.
T h e n o r m a l i z e d regression weights ( s u m of squares across d i m e n s i o n s equals 1)
a n d m u l t i p l e correlations are shown in the first four c o l u m n s of T a b l e 1. These
d i m e n s i o n s are in a p r i n c i p a l axis orientation, as p r o v i d e d b y the K Y S T p r o g r a m
(one of the options). A l t h o u g h n o n e of the m u l t i p l e correlations is very high, six
are statistically significant (at the 0.05 level o r better).
T h e first d i m e n s i o n is a l m o s t p e r f e c t l y c o i n c i d e n t with the scales ' a f f e c t s m e a
great deal' a n d ' a f f e c t s m o s t p e o p l e . ' Thus, the characteristic m o s t salient to
subjects in m a k i n g t h e s e ratings a p p e a r s to b e h o w m u c h the p r o b l e m s affect
them, and p e r h a p s h o w m a n y o t h e r p e o p l e are affected.

Table 1
Multiple regression of mean bipolar scale ratings on dimensions of societal problems
Mult. Principal axes Rotated axes
Rating scales corr. (1) (2) (3) (1) (2) (3)
(1) Very important 0.49 0.54 0.76 0.35 0.54 -0.53 0.66
(2) Very interested 0.36 0.91 0.39 0.16 0.91 -0.42 0.03
(3) Affects me agreat deal 0.84* 0.99 0.04 0.10 0.99 -0.08 --0.08
(4) Affects most people 0.80* 0.94 0.34 0.05 0.94 0.33 0.11
(5) Action urgently needed 0.30 0.56 0.42 -0.71 0.56 -0.06 0.83
(6) Economic problem 0.52 0.33 0.94 0.07 0.33 -0.81 0.49
(7) Moral problem 0.52 0.17 -0.12 0.98 -0.17 -0.33 -0.93
(8) Political problem 0.30 0.36 --0.68 -0.63 0.36 0.90 0.26
(9) Organizationalproblem 0.64** 0.32 -0.47 -0.82 0.32 0.79 0.52
(10) Technological problem 0.77* 0.61 0.42 -- 0.67 0261 0.07 0.79
(11) Responsibility of
federal government 0.44 -0.03 0.74 0.68 0.03 -0.96 -0.28
(12) Responsibility of
localgovernment 0.77 0.26 -0.86 -0.43 -0.26 0.97 0.00
(13) Responsibility of
non-profitinstitutions 0.41 0.59 0.44 --0.68 0.59 -0.09 0.80
(14) Responsibility of profit-
making institutions 0.60** 0.56 0.77 0.32 0.56 0.83 0.06
(15) Responsibility of people
directly affected 0.39 0.30 -0.86 -0.41 0.30 0.95 0.01

* F3.1s significant at less than 0.001 level.


- -

** = F3.18significant at less than 0.05 level.


326 Myron Wish and J. Douglas Carroll

DIM. 3

INADEQUATE OVER - ......


HEALTH CARE POPULATION I TECHNOLOGICAL I

DETER OF /
DRUG 8~ ALCOHOL
Pu~:E~//
ABUSE

URBAN
DECAY
/ POLLU:,ON
FAILURES IN / MISUSES OF
WELFARE • POVERTY
UNEMPLOYMENT,,
DIM. t
LOCAL GOVT

"AC'SM/,MBAL',N
/ • CONSUMER
EXPLOITATION

JUDIC. SYSTEM

(a)

INEQUITIES IN DIM. 3
JUDICIAL SYSTEM WAR
CONSUM ER
EXPLOITATIOn, INFLATION

X• JOB I
~'SC"'%GREEO
RACISM.\ "1 C,,ME~
,M~.,N \ I V'O.~E":E~
POL RER SUBSTANDARD\ I / UNEMPLOYMENT
• h~U~'~"'-~/ "
'~ - DIM. 2
NEGLECTO' /\POV .RTY ".ISUSESOF
PUBLICTRANSR/ _\ TECHNOLOGY
• "7 URBAN ~P ~ •
_ / DECAY" \ DRUG& ALCOHOL
,~.NEF~ 9E.... %..~ ABUSE
LOCAL
LOCAL
I
GOV1Z
• ""
" ~~ P OPOLLUTION
LLUTION
IRESPONSIBILITY[ FAILURESIN ~=
WELFARE 'q
TECHNOLOG. (~VERPOPULATION
PROBLEM •
DETER. OF INADEQUATE
PUBLIC EDUC. HEALTH CARE

(b)

Fig. 4. Three-dimensional KYST solution for data on perceived relatedness among 22 societal
problems. Vectors for rating scales are based on multiple regression analyses.
Multidimensional scaling and its applications 327

Other scales have high regression weights on the second and third dimensions
(economic vs. not economic problem, moral vs. oot moral problem) but their
multiple correlations are too low for these scales to be useful for interpretational
purposes. Two other rating scales having reasonably high multiple correlations
a r e ' responsibility (vs. not responsibility) of local government' and ' technological
(vs. not technological) problem.' Since neither of these fines up closely with
dimension 2 or 3, a rotation was needed to provide a satisfactory interpretation of
these coordinate axes.
This is displayed geometrically in Fig. 4 , which shows two planes from the
three-dimensional solution for the relatedness data, along with vectors for some of
the scales. The location of these vectors is based on the normalized regression
weights, or direction cosines, in the first three columns of the table. In this
orientation mean ratings on the scales correlate as highly as possible with stimulus
projections on the associated vectors.
Since two scales already provide a good definition of the first dimension, no
rotation is required for its interpretation. Labelling of the other two dimensions
could be enhanced considerably, however, by rotating the axes to closer con-
gruence with the two vectors shown in Fig. 4(b). This cannot be done perfectly
with an orthogonal rotation since the two vectors are not exactly at right angles to
one another.
The last three columns of Table 1 show the regression weights for the rotated
dimensions. The fourth column is the same as the first since dimension 1 was not
involved in the rotation. In this orientation, however, the other two dimensions
can be more reasonably interpreted as 'responsibility vs. not responsibility of
local government' and 'technological vs. not technological problem.' Since multi-
ple correlations are only in the 0.70's for these scales (they are unaffected by the
rotation), the names for dimensions are still only suggestive. Further studies
including additional scales and societal problems, as well as a large sample of
subjects, would be needed to arrive at a compelling determination of the dimen-
sions of societal problems.

2. Multidimensional scaling of three-way data

Everything discussed so far has assumed a two-way data matrix and a quadratic
model for squared distances. Since the squared distances can be converted to
scalar products (as in classical MDS), two-way MDS can also be thought of as
involving bilinear models for derived scalar products. We now consider the
extension to three-way arrays and trilinear models.
Probably the simplest generalization of the bilinear model for two-way MDS is
one given by (2).
R
Z i j k ~ Z i j k = ~ airbjrCkr. (2)
r--l
328 Myron Wish and J. Douglas Carroll

In this equation air, bjr, and C~r are elements of the matrices A, B, and C having
I, J, and K rows, respectively. (All three matrices have R columns.) If A
corresponds to subjects and B to stimuli, then C might be associated, for example,
with different rating scales, experimental methods, occasions, etc.
The least squares solution for this trilinear model can be obtained by extending
the Eckart-Young (1936), or singular-value decomposition to the three-way case.
This entails use of what Wold (1966) has called a NIPALS (for nonlinear iterative
partial least squares) procedure, or what has more recently (de Leeuw, 1977) come
to be called an ALS (for alternating least squares) procedure. This involves
converting the overall nonlinear least squares problem (of finding matrices A, B,
and C that optimally fit the Z array in a least squares sense) into a series of linear
least squares problems that are solvable by standard methods.
This is done by successively fixing two of the matrices (say B and C) and
solving for the least squares estimate of the third with these fixed values; then
fixing two others (say A and C) and solving for the remaining one and so on
around the iterative loop until no further improvement occurs. Each iteration of
the process necessarily improves the fit; i.e., decreases the sum of squared
residuals between Z and Z, ( = the three-way array containing the £ijk'S). When
the process has converged (and it can be proved that it will converge), the
resulting matrices constitute at least locally optimal least squares estimates of A,
B, and C. This process has been called CANDECOMP, for canonical decomposi-
tion of N-way tables.
In the case described we have N = 3, but the extension to higher-way tables is
straightforward (Carroll and Chang, 1970; Carroll and Wish, 1974a). Thus, in
higher-way methods all but one way of the data table is fixed at any time, and the
remaining one is estimated by the NIPALS-ALS procedure. A fruitful appli-
cation of N-way CANDECOMP has been for the analysis of N-way contingency
tables in terms of the Lazarsfeld (Lazarsfeld and Henry, 1968) latent class model
(see Carroll, Pruzansky, and Green, 1980).

2.1. The I N D S C A L model and method


INDSCAL (Carroll and Chang, 1970), for individual differences scaling, as-
sumes a three-way table of a special kind (see also Horan, 1969). The simplest
characterization of this particular array is that it consists of several (at least two)
two-way matrices of proximities referring to the same entities. In the typical case
each matrix contains the data for one individual (i.e., a subject) participating in
the study. However, the matrices can be associated with any other kind of source,
such as occasions, trials, methods, measures, etc. For example, INDSCAL was
used (Wish and Carroll, 1974) to analyze data on confusability of English
consonants under 17 kinds of acoustic degradations; in this application the
degradation conditions served as data sources.
The general entry in the table can be called 8ij k, or 8)~~, for the dissimilarity
between objects j and k for individual i. (If the data are similarities, they can
always be converted to dissimilarities by subtracting all values from the largest
Multidimensional sealing and its applications 329

value, or applying any other linear function with negative slope.) Fig. 5 schema-
tizes both the input data array and output matrices for INDSCAL.
The model for INDSCAL is summarized in the following equations:

where
R ) 1/2
d}~:)= E Wir(Xjr-- Xkr)2 (4)
r=l

o1

o
, 2., "1 I
__ "~
I I "~J. ~ : 1
. . a.<
g ~
.
""
...1"Q2, c'rs o,.,o,,as
(OR OIH~""
(a)
DIMENSIONS DIMENSIONS
t2---r R I 2---r R
t

ol
2.
G)
I--
(.J
tJJ Nir
1
i

J Xjr g
u.I
ol
tn
O
SUBJECT SPACE
NOTE:"OBJECTS" NEED NOT BE
n "STIMULI." "SUBJECTS"
GROUP MAY BE OTHER DATA
STIMULUS SPACE SOURCES
(b)

Fig. 5. Schematic representation of INDSCAL input (a) and output (b). The input data are two or
more square symmetric matrices, one for each subject or other data source. The data values, 6j(~),
indicate the dissimilarity between stimulusj and stimulus k for subject i. The set of two-way matrices
defines the rectangular solid or three-way data array, shown at the top. The output from INDSCAL
consists of a matrix of stimulus coordinates (bottom left) and a matrix of subject weights (lower right)
on the R dimensions.
330 MyronWishandJ. DouglasCarroll
The F ' s will generally be considered to be either linear, in the metric case, or
monotonic in the nonmetric case. It is important to note, however, that a different
F/is assumed for each individual. Thus INDSCAL generalizes two-way MDS by
substituting a weighted Euclidean metric for the Ordinary (unweighted) Euclidean
metric. Different patterns of weights are allowed for each individual or other data
source. The possibility of zero weights means that some dimensions m a y be
totally irrelevant to a subject. Thus, INDSCAL allows for a large degree of
variation among subjects within the context of a shared multidimensional space.
A more geometric interpretation of the INDSCAL model is provided by (5).

y)/) = wi'/2xjr. (5)


In this equation Xjr is the rth coordinate of t h e j t h stimulus in what is called the
'group stimulus space,' as defined by the matrix shown at the bottom left in Fig.
5. Each individual, or data source, can be thought of as having a 'private
perceptual space' whose general coordinates are designated by y)]). The y)])'s,
however, are derived from the xjr's by the simple relation expressed in (5).
Ordinary Euclidean distances are then computed for individual i by the equation

d)ik)= ( ~r (yj(.ir)--y(kir))2)l/2. (6)

By substituting (5) into (6) we get (4). Thus, (5) and (6) together provide an
alternative interpretation of the weighted generalization of the ordinary Euclidean
metric defined in (4). They express the INDSCAL model in terms of a particu-
larly simple class of transformations of the common space, followed by computa-
tion of the ordinary Euclidean metric. The class of transformations can be
described algebraically as linear transformations, with the transformation matrix
constrained to be diagonal (sometimes called a strain or 'stretch' transformation).
More simply, this amounts to simply rescaling each dimension by the square root
of that particular subject's weight for that dimension. Geometrically this can be
thought of as differentially stretching or shrinking each dimension by a factor
proportional to the square root of the associated weight.
The weights can be plotted in a second space, which is generally referred to as
the 'subject space,' since in most psychological applications these points corre-
spond to different individuals. The coordinates of the subject space are contained
in an I × R matrix, as shown in the bottom right of Fig. 5. Although in principle
these weights should all be positive (or zero), some of the weights turn out
occasionally to be slightly negative. Near-zero weights are likely to represent
random statistical fluctuation. Large negative weights signal inappropriateness of
the model, too many dimensions, or perhaps a selection of the wrong option (e.g.,
treating the data as dissimilarities when they should be similarities).

2.2. I N D S C A L application to perceptions of nations


These concepts can be illustrated by an application of INDSCAL to percep-
tions of similarity among nations. The data were collected in 1968 from 18
Multidimensional scaling and its applications 331
(a)
ECONOMICALLY
DEVELOPED
DIM 2

I
USA
RUS~,A • JAPAN
FRANCE •
YUG(~SLAVlA ISRAEL

PRO-COMMUNIST PRO-WESTERN
DIM1
CHINA

C~BA BR~ZIL
EG~'PT IN~)IA

~CONGO
UNDERDEVELOPED

(b)

(~)I0

I-
2(•)4 69
z
uJ @3
(3.
$
UJ
>
@ DOVES~
uJ

0
t
o
bJ

® @
,4

IM f" P O L I T I C A L ALIGNMENT AND IDEOLOGY

Fig. 6. Group stimulus space (a) and subject space (b) from an INDSCAL analysis of perceived
similarities among 12 nations. Private perceptual spaces for subjects 17 (c) and 4 (d) were obtained by
applying the square roots of their respective weights to the dimensions of the group stimulus space.
332 Myron Wish and J. Douglas Carroll

(c)
ECONOMICALLY
DEVELOPED
DIM 2

IRUSSIA • USA
•YUGOSLAVIA FRANCE JAPAN ¶SRAEL
PRO-CO MM UN IST PR0-WEST~]~I
•CHINA DiM1

CLTBA EGYPT •
INDIA
BR~. Ik

DCONGO

UNDE]~DEVELOPED

(d)
ECONOMICALLY
DEVELOPED •
DIM 2 USA
RUS,A •JAPAN
F~RANCE • I S R A E L
YuG(~SLAVIA

PRO-COMMUNIST PRO'WESTERN
DIM1
CHINA

BRAZI L
INDIA

CONGO
UNDERDEVELOPED

Fig. 6 (continued).

students in a psychological measurement class taught by Wish (Wish, Deutsch,


and Biener, 1970). In addition to making pairwise ratings of similarities (on a
9-point scale) for a set of 12 nations, the subjects gave their views about U.S.
policy in Vietnam.
The set of eighteen 12 × 12 matrices were used as input to ]NDSCAL, produc-
ing the group stimulus space in Fig. 6(a) and the subject space in Fig. 6(b). The
Multidimensionalscalingand its applications 333

unrotated stimulus dimensions were interpreted as 'political alignment' (or com-


munist vs. noncommunist) and 'economic development.' (A three-dimensional
analysis revealed another dimension reflecting geography and culture.)
The D 's, M 's, and H ' s in the subject space indicate whether the student was a
'dove,' 'moderate,' or 'hawk' regarding Vietnaml There appears, then, to be a
systematic relationship between political views and perceptions of these nations:
the political alignment dimension is more important for hawks than doves, while
the reverse is true for economic development:
The difference between 'doves' and 'hawks' perceptions is dramatized in Figs.
6(c) and 6(d). The horizontal stretching of subject 17's space reflects the higher
weight on dimension 1 than dimension 2 for that subject. He seems to view the
world in terms of communist vs. noncommunist nations (or perhaps, good guys
vs. bad guys). The vertical stretching of the private space for subject' 4 reflects
greater salience for the economic development dimension. Thus, in his world
view, the primary distinction is between developed and underdeveloped nations
(or haves vs. have nots).
This example shows that INDSCAL can yield important information about the
subjects as well as the stimuli, and the differences in subject weights can be
associated with other characteristics (e.g., perceptual, cognitive, physiological,
demographic, etc.). It should also be noted that the unrotated axes were directly
interpretable.

2.3. Metric I N D S C A L analysis


In a metric INDSCAL analysis the function relating the dissimilarities to
distances is assumed to be linear, with a different linear function allowed for each
individual or data source. To apply the CANDECOMP algorithm, it is necessary
to convert these dissimilarities to scalar products, as in classical MDS. Since the
scalar product matrix is three-way rather than two-way, the additive constant and
transformation is applied for each individual.
Under the assumed form expressed in (5) for the distances, the scalar products,
sj(~~, should obey the model
R
#ik)~ ~ WirXjrXkr (7)
r=l

or in matrix notation

(s)
The correspondence with (2) can be seen by substituting Wir for air, Xjr for byr, and
Xkr for ckr. This is a symmetric form of CANDECOMP in which the second and
third ways of the table correspond to the same set of entities, and therefore
J = K = n, the number of stimuli. Furthermore, ~ is an n × n matrix of derived
scalar products, and W, is an R × R diagonal matrix.
334 Myron Wish and J. Douglas Carroll

The goodness of fit value that is maximized is equal to the root-mean-square


correlation (computed over subjects or other data source) between scalar products
computed from the input data and scalar products computed from the multidi-
mensional solution. As in the general CANDECOMP procedure, the NIPALS-
ALS approach is used cyclically and iteratively to arrive at the optimal set of
stimulus coordinates and subject (or other data source) space.
When CANDECOMP has converged, matrices B and C (which refer to the
same set of stimuli) will generally be the same up to a possible diagonal
transformation. This is assured, however, by a final normalizing step in which B is
set equal to C, the two are normalized appropriately, and A is recalculated. The
ith row of A then defines the vector of weights for data source i.
INDSCAL can also be used to solve for weights relative to a prespecified
configuration. This configuration could be based, for example, on previous
analyses or on theoretical considerations. The procedure involved is to input the
specified set of stimulus dimensions as matrices B and C, and to run INDSCAL
for zero iterations to solve for A, the matrix of weights. This amounts to a kind of
regression analysis in which the stimulus dimensions serve as predictors.

2.4. Normalizing conventions in INDSCAL


There is a basic problem of scale indeterminacy in INDSCAL. In this regard,
the dimensions of the group stimulus space could be rescaled by an arbitrary
diagonal transformation, and the inverse of the square of the transformation
would then be applied to the subject space. To resolve this indeterminacy,
normalizing conventions have been adopted. The origin of the group stimulus
space is set at the centroid of the stimulus points, and the variance of stimulus
coordinates is made equal to 1/n for all dimensions. The coordinates on each
dimension then have a mean equal to zero, and sum of squares equal to one.
Furthermore, with this normalization, wi2ris approximately equal to the propor-
tion of variance in subject i's data (converted to scalar products) accounted for
by dimension r. (The squared weight underestimates the proportion of explained
variance if the dimensions of the group stimulus space are correlated.) Squared
distance from the origin of the subject space approximates the proportion of
variance in that individual's data explained by the entire set of stimulus dimen-
sions.

2.5. Dimensional uniqueness


The determination of a subject as well as a stimulus space is an important
advantage that INDSCAL has over two-way MDS. Thus, it is possible to explore
individual differences, time trends, treatment effects, etc., directly within
the context of multidimensional scaling. Another important advantage that
INDSCAL has over two-way and most three-way MDS methods is referred to as
dimensional uniqueness (see also Harshman, 1972; Harshman and Berenbaum,
1980). Thus the dimensions are uniquely determined in that they cannot be
rotated or otherwise transformed without changing the solution in an essential
Multidimensional scaling and its applications 335

way. Mathematically a transformation of axes will change the family of permissi-


ble transformations of the group stimulus space, and thus the family of possible
individual metrics. Statistically a transformation of axes will generally lead to a
reduction in variance accounted for.
A central reason why INDSCAL is so useful in practice is that in most cases
the interesting directions in the configuration are along the coordinate axes that
are obtained directly from the analysis (as illustrated by the application to nation
perceptions). This means that INDSCAL dimensions can be assumed to corre-
spond to fundamental dimensions (psychological, physical, etc.) which differ in
degree among individuals or other data sources.
More important perhaps than statistical uniqueness of dimensions is the
empirical observation that the unrotated INDSCAL dimensions can usually be
directly interpreted without rotation. As a result, it is generally much easier to
interpret INDSCAL solutions, particularly those involving several dimensions,
than those based on other MDS procedures or factor analysis. Although principal
components directions and axes obtained from a varimax rotation are also
unique, their interpretability has generally been less compelling than those
obtained directly from INDSCAL. This appears to be at least partially due to the
use of three-way vs. two-way data.

2.6. Dimensions of interpersonal communication


There are many instances when the data have to be transformed in some way to
be appropriate for MDS. This would occur, for example, if the data consisted of
subjects' ratings of stimulus objects on various attributes, or if several types of
stimulus measurements were taken at different time periods. A common tech-
nique for deriving a proximity matrix from such multivariate data is to compute
some measure of profile similarity or dissimilarity between rows or columns of the
original data matrix. (In some instances there is theoretical or statistical justifica-
tion for transforming a matrix that already contains proximities; see Rosenberg,
et al., 1968.)
An application of INDSCAL to perceptions of interpersonal communication
illustrates the use of such a proximity measure (Wish and Kaplan, 1977). The
stimuli were 64 hypothetical communication episodes obtained by factorially
combining the eight role relations and eight situational contexts shown in Table 2.
The task for the 72 subjects was to rate the expected communication in each
episode on 14 bipolar scales. Thus, the stimuli 'parent and teenager discussing a
controversial social issue on which their opinions differ' and 'supervisor and
employee pooling their knowledge and skills to solve a difficult problem' were
rated on such attributes as friendly vs. hostile, personal vs. impersonal, etc. (It
would have been prohibitive to have asked subjects to make similarity or
dissimilarity ratings for all combinations based on such a large stimulus set.)
A subjects (72 rows) by stimuli (64 columns) matrix was first defined for each
bipolar scale. Each of these was converted to a matrix of dissimilarity between
stimuli by computing the root-mean-square difference between columns of the
336 Myron Wish and J. Douglas Carroll

Table 2
Interpersonal relations and situational contexts used to generate the factorial set of hypothetical
communication episodes
Interpersonal relations
(A) Bitter enemies
(B) Business partners
(C) Casual acquaintances
(D) Husband and wife
(E) Marine sergeant and private
(F) Parent and teenager
(G) Political rivals
(H) Supervisor and employee

Situational contexts
(1) Attempting to work out a compromisewhen their goals are strongly opposed
(2) Pooling their knowledge and skills to solve a difficult problem
(3) Discussing a controversialsocial issue on which their opinions differ
(4) Talking to each other at a large social gathering
(5) Expressing anxiety about a national crisis that is affecting them personally.
(6) Working for a common goal with one person directing the other
(7) Blaming one another for a serious error that was made
(8) Having a brief exchange about a minor technical detail

original two-way table. The input to I N D S C A L was then 14 matrices of stimulus


dissimilarity, one for each bipolar scale. (When I N D S C A L is used to analyze
matrices of profile distances, it is desirable to select the option in the program
that reads the data values as Euclidean distances rather than simply as dissimilari-
ties. In such a case no additive constant is applied.)
The small drop in variance accounted for from six (88.0%) to five (86.2%), as
compared to the drop from five to four (76.6%) dimensions, suggested that a
five-dimensional solution was most appropriate. In this example the dimension
weights, which are analogous to partial correlations, provided useful information
for interpreting dimensions. This is because the weights, as the original input
matrices, were associated with bipolar scales rather than with individual subjects.
(See Wish and Carroll (1974) for a discussion of some advantages of using such
weights from an I N D S C A L analysis for interpreting dimensions rather than
supplementary multiple regression analyses.) Thus, as with factor loadings and
weights from regression analyses, the higher the dimension weight for a particular
scale, the more relevant that attribute is for the interpretation of the dimension.
(Weights for subjects could also be determined if matrices of stimulus dissimilar-
ity were derived for each subject from bipolar scale profiles as described in Wish
and Carroll (1974).)
Table 3 shows the weights on the five dimensions for each bipolar scale. Based
on this pattern of dimension weights (which was compatible with visual inspec-
tion of the stimulus space), the dimensions were interpreted as: 'cooperative and
friendly vs. competitive and hostile,' 'intense vs. superficial,' 'dominance vs.
Multidimensional scaling and its applications 337

Table 3
Dimension weights for 14 bipolar scales based on
an INDSCAL analysis of 64 hypothetical communication episodes
Rating scales Dim. l Dim. 2 Dim. 3 Dim. 4 Dim. 5
Very cooperative vs.
very competitive 0.96* 0.03 0.00 0.03 0.03
Very friendly vs.
very hostile 0.96* -0.10 0.04 0.10 0.03
No conflict vs.
constant conflict 0.88* 0.20 0.03 -0.04 -0.03
Very intense vs.
very superficial 0.11 0.92* 0.04 - 0.02 0.05
Completely engrossed vs.
uninterested and uninvolved -0.08 0.85* 0.04 0.07 0.14
Very emotional vs.
very unemotional 0.33 0.73* 0.02 -0.02 0.05
Very personal vs.
very impersonal - 0.08 0.70* 0.00 0.50 0.05
Very different roles and behavior vs.
very similar roles and behavior 0.03 0.02 0.93* 0.00 0.03
One totally dominates the other vs.
each treats the other as an equal 0.18 0.04 0.92* 0.07 0.01
Very autocratic vs.
very democratic 0.21 0.00 0.91" 0.08 0.02
Very formal vs.
very informal O.lO 0.00 0.09 0.92* 0.02
Very reserved and cautious vs.
very frank and open O.19 0.01 0.02 0.85* 0.08
Very task oriented vs.
not at all task oriented - 0.02 O.12 0.05 0.05 0.91 *
Very productive vs.
very unproductive 0.39 -0.08 -0.02 0.07 0.78*

*Weight/> 0.70

equality,' ' i n f o r m a l and open vs. f o r m a l a n d cautious,' and ' t a s k o r i e n t e d vs.


n o n t a s k oriented.'
T o assess the uniqueness o f this m u l t i d i m e n s i o n a l solution, the s a m p l e of
subjects was split in half, and profile distance matrices were c o m p u t e d for b i p o l a r
scales f r o m data for each subsample. Separate I N D S C A L analyses for these two
d a t a sets gave almost identical results; that is, the correlations b e t w e e n the
c o r r e s p o n d i n g unrotated d i m e n s i o n s were all above 0.980. T h e stability of results
was also checked by c o m p a r i n g the results for the entire s a m p l e with those
o b t a i n e d f r o m a parallel study i n v o l v i n g ratings of 120 stimuli (12 relations X 10
situations) on the same bipolar scales. W i t h o u t r o t a t i o n of axes, the correlations
b e t w e e n c o r r e s p o n d i n g d i m e n s i o n s f r o m the I N D S C A L analyses of 120 and 64
stimuli (based o n the 64 stimuli in c o m m o n ) were all in the high 0.90's.
Since the stimuli were generated f r o m a factorial design, analyses of v a r i a n c e
w e r e also d o n e to d e t e r m i n e the extent of r e l a t i o n - b y - s i t u a t i o n i n t e r a c t i o n for
338 Myron Wish and J. Douglas Carroll

each dimension. Details about the study as a whole and these supplementary
analyses can be found in Wish and Kaplan (1977).

2. 7. Alternative methods for fitting the I N D S C A L model


Ramsay's (1978) MULTISCALE procedure provides alternative metric meth-
ods for fitting the INDSCAL model, as well as the two-way MDS model, using a
maximum likelihood criterion under different distributional assumptions. Takane,
Young, and de Leeuw's (1977) ALSCAL procedure also allows metric fitting of
INDSCAL or two-way MDS.
There are 'quasi-nonmetric' and fully nonmetric algorithms for fitting the
INDSCAL model (Carroll and Chang, 1970; de Leeuw, 1977; Takane, Young,
and de Leeuw, 1977) that allow F, in (3) to be a monotonic rather than a linear
function. Using their quasi-nonmetric approach (called NINDSCAL), Carroll and
Chang observed empirically, however, that the stimulus space was virtually the
same for nonmetric as for metric INDSCAL, and that there were only slight
changes in the subject weights in most instances.
Since the cost is substantially greater for a nonmetric analysis, the metric
version was used almost exclusively until Takane, Young, and de Leeuw's
ALSCAL made a fully nonmetric approach computationally feasible. ALSCAL
uses a complex alternating least squares procedure to minimize a loss function
called SSTRESS (for Squared distance STRESS), which is defined by (9).
(i) 2 ^(i) 2 2

(9)
, Norm/ J"

The two versions of Norm i are analogous to those used in stress forms 1 and 2.

2.8. CANDECOMP with linear constraints


Recently the CANDECOMP model was generalized to allow for the imposition
of linear constraints on the parameters (i.e., the air'S, bjr'S, and C~r'S). The general
approach is called CANDELINC (Carroll, Pruzansky, and Kruskal, 1979) for
canonical decomposition with linear constraints. This is useful when it is desired
to constrain the dimensions to be completely explained by a set of outside
variables (i.e., variables not used in determining the multidimensional space).
These external variables could correspond, for example, to physical measure-
ments, as well as to any objective or subjective attributes of the stimuli, subjects,
or other data mode. They might also be based on a factorial design underlying the
stimulus domain. In this latter case, the artificial (or dummy) variables used to
code the design may function as the external variables. In order to constrain one
of the matrices defining the model (say A), it is required to have the form A = X T
where X is a fixed, known 'design matrix' and ir is the unknown paramete r matrix
to be estimated.
Multidimensional scaling and its applications 339

Suppose, for example, that the subjects are defined by a factorial design (e.g.,
using demographic factors such as age, sex, and socioeconomic status), and
that stimuli are defined factorially in terms of color, size, and shape. In a
CANDELINC analysis the subject and/or stimulus parameters could be required
to satisfy any set of linear constraints desired by the experimenter or data analyst.
Thus one or both sets could be constrained to be perfectly decomposable by an
additive (or main effects only) ANOVA-type model. Alternatively, specified
interaction terms could be included allowing for all two-way, but no three-way
interactions; or certain one-degree-of-freedom contrasts (partitioning main ef-
fects, interactions, or both) could be allowed.
One important advantage of imposing such constraints is to enhance the
experimenter's ability to extrapolate results to new stimuli, subjects, etc. Another
benefit is the potential for comparing alternative models (as in analysis of
variance or conjoint analysis) and providing a parsimonious representation of the
data. Since CANDELINC entails a decomposition of a much smaller array, it
also saves considerable computer time and costs.

2.9. Linearly constrained INDSCAL


The most useful implementation of this general procedure to date has involved
a constrained version of INDSCAL, called LINCINDS (for linearly constrained
INDSCAL). A comparison of the previously described INDSCAL analysis of the
hypothetical communication episodes with a LINCINDS analysis of the same
data illustrates this procedure. The same set of matrices of profile distances
between stimuli were used, but in this case the stimuli were constrained. The
design matrix corresponded to a simple additive model for the 8 (relations) × 8
(situations).
The variance accounted for by the constrained analysis was 79.8%, as opposed
to 86.2% for the conventional INDSCAL solution. The solutions were extremely
similar, however; that is, the correlations of corresponding dimensions of the
stimulus space ranged from 0.970 to 0.986, while the correlations for the dimen-
sion weights were 0.999 in four instances and 0.998 in the other. The constrained
version of INDSCAL appeared, then, to provide a more parsimonious description
of the data without changing the meaning of the dimensions.
Another important use of LINCINDS worth noting is to provide a computa-
tionally efficient, rational starting configuration for INDSCAL. This dramatically
reduces the number of iterations required for convergence, and thereby reduces
the cost by as much as an order of magnitude in some instances. Details and
examples of this use of LINCINDS appear in Carroll and Pruzansky (1979).

2.10. More general three-way M D S models


An interesting class of general three-way models for MDS can all be viewed as
special cases of what Carroll and Chang (1970) have called the IDIOSCAL (for
individual difference in orientation scaling) model. The central assumption is that
340 Myron Wish and J. Douglas Carroll

the distances (which again are assumed linearly, monotonically, or otherwise


'simply' related to dissimilarities) are given by an equation of the form

• __Xkr)Crr,(Xjr ' X k r , ) ) 1/2 (10)


Ft

where C i = IIC~ir! II is an R × R symmetric positive definite or semidefinite matrix.


This model, as INDSCAL, can be converted to a scalar product form:

-~jr~,rr,~ kr ,
F Ft

or in matrix notation

S i = XCiX' (12)

Approaches to analyzing data in terms of this model have been outlined by


Carroll and Chang, (1972). INDSCAL can be seen to be a special case of
IDIOSCAL in which C, ( ~ W~) is constrained to be a diagonal matrix (usually,
but not necessarily with non-negative diagonal entries). It should be pointed out
that the generality of the IDIOSCAL model leads to a rotational invariance
property, meaning the 'dimensional uniqueness' does not hold. Some other
interesting special cases are considered below.

Tucker's "three-mode scaling"


In this special case, the C i are of the form

H
E aihh (13)
h=l

where each G h is a square symmetric matrix (possibly, but not necessarily,


positive definite or semidefinite) and H is less than I, the number of subjects.
Tucker (1972) has described a way of analyzing data of the prescribed sort in
terms of this model, treating it as a special case of his three-mode factor analysis
(Tucker, 1964). In this model, as in IDIOSCAL, the orientation of axes is not
unique. Indeed, even for a fixed stimulus space there is a residual nonuniqueness
of definition of the G matrices and of the aih's (which constitute subject
parameters of this model). There may be many situations, however, in which this
model provides an appropriate compromise between INDSCAL and the fully
general IDIOSCAL model.

H a r s h m a n ' s P A R A F A C-2 model


Richard Harshman (1972, 1980) has proposed a model that is also midway in
generality between INDSCAL and IDIOSCAL. With Jennrich he produced a
Multidirnensional scaling and its applications 341

computational method for fitting this model to data. This model, which is called
PARAFAC-2 (for parallel factors, model 2) assumes that C~ can be decomposed
as

C i = DiRD i (14)

where D i is diagonal and R is a presumed positive definite (or semidefinite)


matrix. Harshman interprets the diagonal elements of O i as (square roots of)
weights applied to oblique dimensions. The off-diagonal elements of R are
interpreted as cosines of angles between these oblique coordinate axes. Since R is
not subscripted, it is assumed that these angles (and thus their cosines) are the
same for all individuals.

3. Recent developments and future trends

There are several recent trends in MDS methodology that offer bright prospects
for the future. There are also arid areas that should become more fertile in the
coming years.
One recent trend that appears almost to represent a zeitgeist involves the fusion
of continuous and discrete models for the analysis of proximities. In this regard,
Carroll and Pruzansky (1975) have recently developed hybrid models that incor-
porate dimensional and tree-structure components (see also Carroll, 1976). Pre-
liminary applications have shown the potential for a more comprehensive data
representation than would be possible with a spatial or clustering model alone.
Until recently nonsymmetric data were generally simply symmetrized (e.g., by
averaging the j, k and k, j cells). Recently, however, models and methods have
been proposed for dealing directly with nonsymmetries. A procedure developed
by Young (1975) called ASYMSCAL and a nonsymmetric version of INDSCAL
(DeSarbo and Carroll, 1979) offer potentially useful perspectives for tackling this
ubiquitous problem. Other interesting developments related to the analysis of
nonsymmetric proximities data have been made by Tobler (1976), Chino (1978),
and Gower (1977, 1978). A particularly innovative attempt for handling and
graphically displaying general patterns of nonsymmetries is the DEDICOM
approach of Harshman (1980).
CANDELINC represents a general approach that could be considerably devel-
oped and expanded in the future. In this regard it should be pointed out that
linear constraints can be applied to the parameters of the other three-way
(and higher-way) models by methods very similar to the procedure used in
CANDELINC. In fact, Carroll, Pruzansky, and Kruskal (1980) discuss an explicit
generalization of CANDELINC to Tucker's three-mode factor analysis and
scaling models. In the future other kinds of constraints could be allowed, or the
constraints could apply differentially to specified dimensions. Work along these
lines has recently been done by Bentler and Weeks (1978), Bloxom (1978), de
Leeuw and Heiser (1979) and Noma and Johnson (1979). Perhaps the most
342 Myron Wish and J. Douglas Carroll

encouraging aspect of this approach is the attempt to integrate MDS with other
areas of statistics such as analysis of variance. The possibility of comparing a
range of models may also increase the value of MDS for theory construction and
testing in various fields.
Approaches to the analysis of highly nonlinear data, which have been done
more generally for multivariate statistics (e.g., McDonald, 1962; Shepard and
Carroll, 1966; Gnanadesikan and Wilk, 1969) will undoubtedly be proposed
within the context of MDS. Likewise, the important work by Ramsay (1978) and
Takane (1981) on the development of maximum likelihood methods for estimat-
ing MDS parameters offers another example of the introduction of general
statistical methodology within the context of MDS. (These maximum likelihood
methods also allow for constraints of various kinds on the solutions.) This is
particularly encouraging since MDS has been primarily used in the past as a
purely descriptive rather than as an inferential tool.
Although there have been attempts, using Monte Carlo methods, to provide
objective indices for assessing dimensionality, considerable work remains to be
done toward the development of distribution theory within the context of
multidimensional scaling. Likewise the need to develop a more solid base in
statistical inference is a sine qua non for the future.
There are clearly numerous other unsolved problems and unexplored frontiers
in multidimensional scaling, such as identifiability and uniqueness of various
models, efficiency of numerical algorithms, and the development of diagnostics to
aid naive as well as sophisticated users. Hopefully, the broad range of challenges
and the growing awareness of these needs will motivate important breakthroughs
in multidimensional scaling.

References

Arabie, P. and Carroll, J. D. (1980). MAPCLUS:A mathematical programming approach to fitting


the ADCLUSmodel. Psychometrika 45, 211-235.
Bartholomew, D. J. (1959). A test of homogeneityfor ordered alternatives. Biometrika 46, 36-48.
Barton, D. E. and Mallows, C. L. (1961). The randomization bases of the amalgamation of weighted
means. J. Rqv. Statist. Soc. Ser. B 23, 423-433.
Bentler, P. M. and Weeks, D. G. (1978). Restricted multidimensional scaling models. J. Math.
Psychology 17, 138-151.
Bloxom, B. (1978). Constrained multidimensional scaling in N spaces. Psychometrika 43, 397-408.
Carroll, J. D. (1976). Spatial, nonspatial, and hybrid models for scaling. Psyehometrika 41, 439-463.
Carroll, J. D. and Arabie, P. (1980) Multidimensional scaling. In: M. R. Rosenzweigand L. W. Porter,
eds., Annual Review of Psychology 31, 607-649.
Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling
via an N-waygeneralization of Eckart-Young decomposition. Psychometrika35, 283-319.
Carroll, J. D. and Chang, J. J. (1972). IDIOSCAL(Individual DifferencesIn Orientation SCALing):A
generalization of INDSCAL allowing IDIOsyncratic reference systems as well as an analytic
approximation to INDSCAL. Paper presented at meetings of the PsychometricSociety, Princeton,
NJ, 1972.
Carroll, J. D. and Pruzansky, S. (1975). Fitting of hierarchical tree structure (HTS) models, mixtures
of HTS models, and hybrid models, via mathematical programming and alternating least squares.
Multidimensional scaling and its applications 343

Presented at U.S.-Japan Seminar Multidimensional Scaling, University of California, San Diego,


La Jolla.
Carroll, J. D. and Pruzansky, S. (1979). Use of LINCINDS as a rational starting configuration for
INDSCAL. Submitted to Psychometrika.
Carroll, J. D. and Pruzansky, S. (1981). Discrete and hybrid models for scaling. In: E. D. Lantermann
and H. Feger, eds., Similarity and Choice. Huber, Bern.
Carroll, J. D., Pruzansky, S. and Green, P. E. (1980). Estimation of the parameters of Lazarfeld's
latent class model by application of canonical decomposition (CANDECOMP) to multi-way
contingency tables. Submitted to Psychometrika.
Carroll, J. D., Pruzansky, S. and Kruskal, J. B. (1979). CANDELINC: A general approach to
multidimensional analysis with linear constraints on parameters. Psychometrika 45, 3-24.
Carroll, J. D. and Wish, M. (1974a). Models and methods for three-way multidimensional scaling. In:
D. H. Krantz, R. C. Atkinson, R. D. Luce and P. Suppes, eds., Contemporary Developments in
Mathematical Psychology. Vol. II: Measurement Psychophysics, and Neural Information Processing,
57-105. W. H. Freeman, San Francisco.
Carroll, J. D. and Wish, M. (1974b). Multidimensional perceptual models and measurement methods.
In: E. C. Carterette and M. P. Friedman, eds., Handbook of Perception, Vol. 2, 391-447. Academic
Press, New York.
Chino, N. (1978). A graphical technique for representing the asymmetric relationships between N
objects. Behaviormetrika 5, 23-40.
Coombs, C. H. (1964). Theory of Data. Wiley, New York.
Cunningham, J. P. (1978). Free trees and bidirectional trees as representations of psychological
distance. J. Math. Psychol. 17, 165-188.
de Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In: J. R. Barra, F.
Brodeau, G. Romier, B. van Cutsem, eds., Recent Developments in Statistics. North-Holland,
Amsterdam.
de Leeuw, J. and Heiser, W. (1981). Theory of multidimensional scaling. In: P. R. Krishnaiah and L.
Kanal, eds., Handbook of Statistics. 1Iol. 2: Classification, Pattern Recognition and Reduction of
Dimension. North-Holland, Amsterdam [this volume].
de Leeuw, J. and Heiser, W. (1979). Multidimensional scaling with restrictions on the configuration.
In: P. R. Krishnaiah, ed., Multivariate Analysis, Vol 5. North-Holland, Amsterdam.
DeSarbo, W. S. and Carroll, J. D. (1979). Three-way metric unfolding. Unpublished manuscript, Bell
Telephone Laboratories, Murray Hill.
Ekman, G. (1963). Direct method for multidimensional ratio scaling. Psychometrika 23, 33-41.
Gnanadesikan, I~. and Wilk, M. B. (1969). Data analytic methods in multivariate statistical analysis.
In: P. R. Krishnaiah, ed., Multivariate analysis, Vol. 2 (Academic Press, New York).
Gower, J. C. (1977). The analysis of asymmetry and orthogonality. In: J. R. Barra, F. Brodeau, G.
Romier, B. van Cutsem, eds., Recent Developments in Statistics. North-Holland, Amsterdam.
Gower, J. C. (1978). Unfolding: Some technical problems and novel uses. Presented at European
Meeting Psychom. Math. Psychol., Uppsala.
Guttman, L. (1968). General nonmetric technique for finding the smallest coordinate space for a
configuration of points. Psychometrika 33, 469-506.
Guttman, L. (1971). Measurement as structural theory. Psychometrika 36, 329-347.
Harshman, R. A. (1972). PARAFAC2: Mathematical and technical notes. UCLA, Working Papers in
Phonetics No. 22.
Harshman, R. A. (1980). DEDICOM: A family of models for analysis of asymmetrical relationships.
Unpublished manuscript, Bell Laboratories, Murray Hill.
Harshman, R. A. and Berenbaum, S. (1980). Basic concepts underlying the PARAFAC-CANDE-
COMP three-way factor analysis model and its application to longitudinal data. Unpublished
manuscript, Bell Laboratories, Murray Hill.
Horan, C. B. (1969). Multidimensional scaling: Combining observations when individuals have
different perceptual structures. Psychometrika 34, 139-165.
Isaac, P. D. and Poor, D. D. S. (1974). On the determination of appropriate dimensionality in data
with error. Psvchomelrika 39, 91-109.
344 Myron Wish and J. Douglas Carroll

Klingberg, F. L. (1941). Studies in measurement of the relations among sovereign states. Psycho-
metrika 6, 335-352.
Kruskal, J. B. (1964a). Multidimensional scaring by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-27.
Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrika 29,
115-129.
Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling. Sage Publication, Beverly Hills.
Kruskal, J. B., Young, F. W. and Seery, J. B. (1973). How to use KYST, a very flexible program to do
multidimensional scaling and unfolding. Tech. Rept., Bell Telephone Laboratories, Murray Hill.
Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston.
Lingoes, J. C. (1972). A general survey of the Guttman-Lingoes nonmetric program series. In: R. N.
Shepard, A. K. Romney and S. Nerlove, eds., Multidimensional Scaling: Theory and Applications in
Behavioral Sciences. Vol I." Theory, 49-68. Seminar Press, New York.
McDonald, R. P. (1962). A general approach to nonlinear factor analysis. Psychometrika 27, 397-415.
Miles, R. E. (1959). The complete amalgamation into blocks, by weighted means, of a finite set of real
numbers. Biometrika 46, 317-327.
Noma, E. and Johnson, I. (1979). Constrained nonmetric multidimensional scaling. Tech. Rept.
MMPP 1979-4. Ann Arbor: University of Michigan Math. Psychol. Program.
Ramsey, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43,
145-160.
Richardson, M. W. (1938). Multidimensional psychophysics. Psychological Bulletin 35, 659-660.
Rosenberg, S., Nelson, C. and Vivekananthan, P. S. (1968). A multidimensional approach to the
structure of personality impressions. J. Personality and Social Psychology 9, 283-294.
Rothkopf, E. Z. (1957). A measure of stimulus similarity and errors in some paired-associate learning
tasks. Experimental Psychology 53, 94-101.
Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika 42, 319-345.
Shepard, R. N. (1962a). Analysis of proximities: Multidimensional scaling with an unknown distance
function. I. Psychometrika 27, 125-140.
Shepard, R. N. (1962b). Analysis of proximities: Multidimensional scaling with an unknown distance
function. II. Psychometrika 27, 219-246.
Shepard, R. N. (1963). Analysis of proximities as a technique for the study of information processing
in man. Human Factors 5, 33-48.
Shepard, R. N. (1966). Metric structures in ordinal data. J. Math. Psych. 3, 297-315.
Shepard, R. N. (1972). A taxonomy of principal types of data and of multidimensional methods for
their analysis. In: R. N. Shepard, A. K. Romney and S. Nerlove, eds., Multidimensional Scaling:
Theory and Applications in the Behavioral Sciences, Vol. I, 21-47. Seminar Press, New Ycrk
Shepard, R. N. and Arable, P. (1979). Additive clustering: Representation of similarities as combina-
tions of discrete overlapping properties. Psychol. Rev. 86, 87-123.
Shepard, R. N. and Carroll, J. D. (1966). Parametric representations of nonlinear data structures. In:
P. R. Krishnaiah, ed., Multivariate Analysis Vol. 2,561-592. Academic Press, New York.
Spence, I. and Graef, J. (1974). The determination of the underlying dimensionality of an empirically
obtained matrix of proximities. Multivariate Behav. Res. 9, 331-342.
Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In: S. S. Stevens, ed., Handbook
of Experimental Psychology. Wiley, New York.
Takane, Y. (1981). Multidimensional successive categories scaling: A maximum likelihood method.
Psychometrika [in press].
Takane, Y., Young, F. W., and de Leeuw, J. (1977). Nonmetric individual differences multidimen-
sional scaling: An alternating least squares method with optimal scaling features. Psychometrika 42,
7-67.
Tobler, W. (1976). Spatial interaction patterns. J. Environ. Syst. 6, 271-301.
Torgerson, W. S. (1958). Theory and Methods of Scaling. Wiley, New York.
Tucker, L. R. (1964). The extension of factor analysis to three-dimensional matrices. In: N, Fredriksen
and H. Gulliksen, eds., Contributions to Mathematical Psychology. Holt, Rinehart and Winston, New
York.
Multidimensional scaling and its applications 345

Tucker, L. R. (1972). Relations between multidimensional scaling and three-mode factor analysis.
Psyehometrika 37, 3-27.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading.
Tversky, A. (1977). Features of similarity. Psychol. Rev. 84, 327-352.
van Eeden, C. (1957). Note on two methods for estimating ordered parameters of probability
distributions. Proeeedings Akademie van Wetenschappen, Ser. A 60, 128-136.
Wagenaar, W. A. and Padmos, P (1971). Quantitative interpretation of stress in Kruskal's multidimen-
sional scaling technique. British J. Math. Statist. Psych. 24, 101-110.
Wish, M. (1967a). A structural theory for the perception of Morse code signals and related rhythmic
patterns. Center for Research on Language and Language Behavior, University of Michigan.
Wish, M. (1967b). A model for the perception of Morse code like signals. Human Factors 9, 529-540.
Wish, M. (1972). Notes on the variety, appropriateness, and choice of proximity measures. Unpub-
lished manuscript, Bell Telephone Laboratories, Murray Hill.
Wish, M. and Carroll, J. D. (1974). Applications of INDSCAL to studies of human perception and
judgment. In: E. C. Carterette and M. P. Friedman, eds., Handbook of Perception. 449-491.
Academic Press, New York.
Wish, M., Deutsch, M. and Biener, L. (1970). Differences in conceptual structures of nations: An
exploratory study. J. Personality, Social Psychology 16, 361-373.
Wish, M. and Kaplan, S. J. (1977). Toward an implicit theory of interpersonal communication.
Sociometry 40, 234-246.
Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In:
P. R. Krishnaiah, ed., Multivariate Analysis. Academic Press, New York.
Young, F. W. (1975). An asymmetric Euclidean model for multiprocess asymmetric data. Presented at
U.S.-Japan Seminar Multidimensional Scaling, University of California, San Diego, La Jolla.
Young, G. and Householder, A. S. (1938). Discussion of a set of points in terms of their mutual
distances. Psychometrika 3, 19-22.
Young, F. W. and Torgerson, W. S. (1967). TORSCA, a FORTRAN IV program for Shepard-Kruskal
multidimensional scaling analysis. Behav. Sci. 12, 498.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 l
©North-Holland Publishing Company (1982) 347-360 1 ,J

Intrinsic Dimensionality Extraction*

Keinosuke Fukunaga

1. Introduction

The discussion of intrinsic dimensionality was initiated in [1-5]. There a


problem in psychology is considered where a collection of N samples is originally
characterized by N(N-1)/2 pairwise similarities, SI,2,...,SN_I,N. Only the
relative value or rank order of each Si, i has physical meaning. The problem is to
obtain a geometrical representation of these samples in a vector space in order to
make quantitative discussion possible. This is known as multidimensional scaling
in which N vectors, X t ..... XN, are located in an n-dimensional space such that
their pairwise euclidean distances d(X,, Xj) keep the same rank orders as the
corresponding similarities Si, j. The intrinsic dimensionality is defined as the
smallest possible dimensional space we can obtain under the above constraint.
Related problems arise in engineering analysis. Suppose we have a random
process

x(t) = . e (1)

where a and m are two random variables that govern this process. Assume that
our observations are formed by taking n time samples of each x(t). Observe that
knowledge of the parameters a and m allows a complete representation of the n
observed time samples as well as the process. This implies that the intrinsic
dimensionality is two. However, conventional basis function approaches, such as
the Karhunen-Lo6ve expansion or the Fourier transform, will require signifi-
cantly more than two terms to approximate the above process. This is due to the
nonlinear nature of x(t). In contrast to the psychology example absolute values,
as opposed to relative values, are necessary in general. Furthermore, the observa-
tions of x(t) naturally provide N vectors in an n-dimensional space. We seek to
transform the vectors from the n-dimensional space to an m-dimensional space
with m < n such that the pairwise euclidean distances are 'approximately' the

*This work was supported in part by the National Science Foundation under Grant ECS-80-05482.

347
348 KeinosukeFukunaga

same. The intrinsic dimensionality is defined as the minimum m for which the
above constraint holds.
The previous two problems are concerned with representation. We desire a low
dimensional space that preserves the structure of the distribution of samples. The
representation problem will be discussed in more detail in Section 2. For the
classification of samples, the preservation of distribution structure is no longer
required. Instead, the degree of overlap (Bayes error) among different classes (or
categories) should be preserved. In this case the intrinsic dimensionality becomes
the lowest dimensionality we can obtain without changing the overlap among
classes. This problem will be discussed in Section 3.
In this article, we limit our discussion to the case where original samples are
given as vectors in an n-dimensional space. Therefore, when a random process
x(t) is considered, it should be converted to a random vector with n time-
sampling as

X:[x<t,) .... ,x(t,,)] T (2)

where T denotes transposition.

2. Intrinsic dimensionality for representation

2.1. Local properties of a distribution [6]


Whenever we are confronted by large multidimensional data sets, it is usually
advantageous for us to discover or impose some structure on the data. Therefore,
we might assume that the data is governed by a certain number of underlying
parameters. The minimum number, no, of parameters required to account for the
observed properties of the data is called the intrinsic dimensionality for representa-
tion, or, equivalently, the data generating process. The geometric interpretation is
that the entire data set lies on a topological hypersurface of dimension n 0.
A conventional technique to find the dimensionality of an n-dimensional data
set is to calculate the eigenvalues, ~1 ..... ~n, and the eigenvectors, qh,-.., ~n, of the
covariance matrix ~. Since the eigenvalues represent the variances of the data
distribution along the corresponding eigenvectors, only the eigenvectors with
significant eigenvalues (larger than a certain threshold) may be selected to span a
subspace. Although this technique is powerful for finding effective features, it is
limited because it is based on a linear transformation. For example, in Fig. 1 a
one-dimensional distribution is shown by a solid line. The principal axes are q~l
and ~2 and are the same as those for the distribution shown by the dotted line.
Thus the linear mapping fails to demonstrate the intrinsic' dimensionality which is
one for this example.
As is seen from Fig. 1, the intrinsic dimensionatity is in essence a local
characteristic of the distribution. Referring to Fig. 2, if we establish small regions
around X~, X z, X 3, etc., then the eigenvalues of the covariance matrix for each
Intrinsic dimensionality extraction 349

x2

Fig. 1 Intrinsic dimensionality and linear mapping.

local subset of data should indicate the dimensionality which is close to the
intrinsic dimensionality. The local eigenvectors will also give the basis vectors for
the local distributions. The eigenvalue approach is not the only way to estimate
local dimensionality, and there are other alternatives. One example is the maxi-
mum likelihood estimation technique proposed by Trunk in which each local
distribution is statistically tested to determine the most likely number of parame-
ters in the data generating process [7].
Local dimensionalities with unlimited noise-free data may be found by reduc-
ing the size of the local regions until a limiting dimensionality is reached. In
practice, however, some factors such as a limited data set and noise complicate
this procedure.
(1) Dominant eigenvalues. As is seen in Figs. 2 and 3, surface convolutions and
noise tend to enlarge the eigenvalues along insignificant eigenvectors. Therefore,
instead of computing the pure mathematical rank of the covariance matrix, the
dominant eigenvalues should be chosen with a properly selected threshold. The
threshold value affects the estimation of the dimensionality directly, and is very
subjective. It is advisable to try several threshold values and compare the results.
(2) Sample size. It is known that in order to insure the nonsingularity of the
covariance matrix, the number of samples must be larger than the dimension n.
However, the necessary sample size to detect the intrinsic dimensionality of n 0 is
some number larger than n 0, not necessarily larger than n. In addition, our

Lx2

~ ×1

¢
Fig. 2 Local subsets of data.
350 Keinosuke Fukunaga

k× 2

t,
,,,"
ee

, ,',*;,'o'e :;- •

Fig. 3 Effect of noise.

concern is the total number of significant eigenvalues, not the accuracy of their
estimate. So the required sample size may be much smaller than the one needed
for estimating eigenvalues. Experience shows that the local sample size may only
be two or three times the number of dominant eigenvalues, n 0, regardless of n.
For example, with n o = 3 only 6 or 9 samples are required even for n--100. A
technique is available to calculate the eigenvalues with sample size smaller than n,
without computing an n × n covariance matrix [8].
(3) Effect of noise. The addition of noise has the effect of constraining the
minimum size of the local regions. This is shown in Fig. 3, where a one
dimensional line is smeared by a noise such that the number of the dominant
eigenvalues in small local regions may be increased to two. As mentioned earlier,
the choice of large regions may include several surface convolutions, leading to an
overestimate of the intrinsic dimensionality. On the other hand, a small local
region will pick up the eigenvalues due to the noise component. Therefore, an
engineering compromise on the size of local regions is necessary. This problem
may be corrected by using a data filter, which will be described later, to eliminate
the noise.
Since the choice of threshold value and local region size depends on the data, it
is desirable to have an interactive computer system to provide operator flexibility.
One possible algorithm is given as follows.
(1) Selection of the size of local regions. Although the size of local regions could
be adjusted region-by-region, it might be more convenient to fix the size at the
beginning of the program. The proper size would be chosen after studying the
relationship between the size and the resulting dimensionality around the sample
nearest to the sample mean vector of the entire sample set. The size is specified by
either the radius of a hypersphere or the number of samples in a local region. An
operator should be able to adjust the size whenever needed.
(2) Selection of local centers. This problem is essentially a search for 'good' local
regions in a high dimensional space, and involves many compromises such as
local region size, amount of overlap, and the methodology used to find local
centers. One of the possibilities is to start by using the first sample of a sample list
as a local center. The samples of the local region around the sample are used to
determine the local dimensionality and then removed from the sample list. The
same procedure is repeated until the list becomes empty. Statistics of local
Intrinsic dimensionality extraction 351

dimensionalities are tabulated for determining the intrinsic dimensionality of


data.
(3) Application of a data filter. A high dimensional data filter which eliminates
noise can be realized by moving samples toward the gradient of the density
function [9]. The gradient of a density can be estimated by

Vfi( x ) = k( X) n + 2 [ 1 ~ (K_X)
(3)
Nv h2 [ t t IIXi Xll < h

where v is the volume of the hypersphere around X with radius h and k(x) is the
number of samples in the hypersphere. Eq. (3) indicates that V/~(X) is propor-
tional to k ( X ) / N v , a density estimate, and Y~(Xi - X ) / k , the sample mean-shift
from X. When the estimate of a density gradient is normalized by a density
estimate as v p ( X ) / p ( X ) , each sample is moved in proportion to the sample
mean-shift. Fig. 4 shows the experimental result for a simple two-dimensional
example, where the above operation was repeatedly applied until no sample
moves were observed.

EXAMPLE 2.1. The Gaussian pulse is a popular function since it reasonably


approximates many signals encountered in practice, and is easily characterized by
two parameters as shown in (1). For this example, 100, 20-dimensional (20

. :,,~ ° ° N.

....
:.-.::... -=- ..
• " ..~.
:, :" "::.. :..
:.:.: ..... g

L ""
• ..• ..
o• °r. :.-"
°.~-• •

• ,2. °" • ;

(o)

.......... .°,°
_ -,°. .,.:..
, .,.'°- ....,

.T.'" °:"
.- ...
°- °.'°
°.
,o .'°

(b)

Fig. 4 A data filter. (a) The input of the filter; (b) the output of the filter.
352 Keinosuke Fukunaga

Table 1
Intrinsic dimensionalityof Gaussian pulse
No. of samples Range of Intrinsic dimensionalityusing
used in local regions hypersphereradius 10% 1%
5 0.81-0.83 1 1
10 1.60- 2.43 2 3
15 1.92-2.15 2 3
100 4.60 4 6

time-sampled) vectors were generated with the two parameters a and m uniformly
distributed as 1 ~<a ~< 3 and 0.2 ~ m ~< 0.8. The size of the local region was
specified by requiring that k points be contained in each local region. Table 1
contains the results that were generated for k = 5, 10, and 15 using 20 local
regions. The intrinsic dimensionality was determined by counting the number of
eigenvalues larger than t% of the largest eigenvalue in each local region. Two
values of t were used, 10% and 1%. For each combination of k and t the most
frequent number of sufficiently large eigenvalues was called the intrinsic dimen-
sionality. Referring to Table 1 it can be seen that either 2 or 3 is indicated as the
intrinsic dimensionality. To provide a comparison the entire data set was trans-
formed using the K a r h u n e n - L o r v e expansion. This corresponds to one 'local'
cluster with 100 samples. Using the same criteria the dimensionality is 4 for
t = 1 0 % and 6 for t = l % .

2.2. Mapping algorithms


With a small sample size, attempts have been made to unfold and stretch-out a
bended data distribution and then find the dominant eigenvalues for the entire
data set. These techniques have been developed in the area of multidimensional
scaling and are discussed in another chapter. Therefore, only some of the
references and a brief discussion are presented here.
In the scaling problem, one would like to find the lowest dimensional vector
space that a set of N points can be located so that their interpoint distances are a
monotonic function of the experimental dissimilarities between N samples. Start-
ing from an N - 1 dimensional space, Shepard moved vectors so as to increase the
variance of the interpoint distances while maintaining the rank orders of the
interpoint distances for the smallest 40% of distances [1, 2]. Later, Bennett
applied the same technique to engineering problems in which the original samples
were given in a vector space [10]. He maintained monotonicity only in local
neighborhoods within hyperspheres around each observation point with a fixed
radius.
Kruskal, on the other hand, fixed the dimension of the mapped space, say m.
Then, he searched the best point locations, Y1..... YN, in the mapped space by
minimizing a stress criterion S=Y~i~,jooij[d(Xi, Xj)-- d( Y,., ~)]2 where ~oij is a
weighting coefficient and larger for smaller d ' s [3, 4]. When an intrinsic dimen-
sionality is sought, several values for m are selected and the minimized stress is
Intrinsic dirnensionality extraction 353

computed for each m. One then chooses a value of m that has a small stress and
for which an increase in m does not significantly reduce the stress. Shepard and
Carroll used a similar procedure for a different criterion called continuity [5].
In the engineering literature, Sammon minimized the stress criterion for m -- 2.
His purpose was to display data on a cathode-ray tube in his interactive computer
system OLVARS[1 1]. Calvert and Young added to the stress criterion a criterion to
measure the overlap among different classes [12]. Thus, they tried to find a
mapping which preserves the class separability as well as the structure of the data
distribution.
All these algorithms use the point locations, Y1; .... YN, as the variables to be
optimized, and the optimization process is iterative in nature. The iterative
optimization of N × m variables is very time consuming even on present com-
puters.
Some attempts have been made to develop noniterative mapping algorithms.
Olsen and Fukunaga specified the intrinsic dimensionality n o by determining
dominant local eigenvalues and selecting the set of n 0 eigenvectors for each local
region. These n o dimensional subspaces were transformed into a common sub-
space [13]. Also, Koontz and Fukunaga found a mapping function of interpoint
distances which preserves the class separability as well as the structure of the data
distribution [14].

3. Intrinsic dimcnsionality for classification

3.1. The optimum features for classification


In the previous section, we were concerned with the computation of the
intrinsic dimensionality in the context of representation, i.e., the structure of a
data distribution was to be preserved by the mapping. In this section we address
intrinsic dimensionality in the context of classification.
In classification i t i s well known that the Bayes classifier is the best classifier
for given distributions. The resulting classification error, the Bayes error, is the
minimum probability of error. Since the Bayes classifier compares the posterior
probabilities of classes given X, ~/i(X), the optimum features yi(X) for M class
classification problems are the posterior probability functions,

y,(X) -- ~h(X), y z ( X ) = ~2(X) . . . . . y~t(X) = ~ t ( X ) . (4)

In the feature space of H ( X ) = [~h(X),...,7/~t(X)] T, the Bayes error is identical


to the one in the original X-space, and the Bayes classifier becomes a set of simple
bisectors as shown in Fig. 5. Furthermore, since ~ h ( X ) + -.- + ~M(X)--1, only
M - 1 features are linearly independent. This implies that M - - 1 is the intrinsic
dimensionality for classification of M classes. Hence, we would like to find a
mapping that would convert our observed variables into the needed M - - 1
posterior probability functions. We would then be working in a space of intrinsic
dimension, using a simple classifier, and achieving the Bayes error.
354 Keinosuke Fukunaga

xa ( Boyes Clossifier I ly3(X) ='r/3(X)

yl (X) ='r/i(X)
Fig. 5 The optimum mapping for classification.

Unfortunately, it is very hard to estimate the posterior probabilities, and even


more difficult to obtain the posterior probability functions particularly in a high
dimensional space. Therefore, feature extraction for classification in the past has
been focused on two problems:
(1) Selection of a criterion which measures the degree of overlap among
distributions.
(2) Determination of a systematic method of improving the feature (mapping)
functions so as to optimize the criterion value.
There is no known general method to handle (2). As a result, the trial and error
method is often used. The procedures used to improve feature functions very
much depend on which criterion is used. Therefore, in this section, we would like
to focus our attention on the problem of criteria.
As for criteria, the Bayes error is the best. However, since the Bayes error is not
easily expressed in an explicit form that may be readily manipulated, many
alternatives have been proposed. These include the asymptotic nearest neighbor
error, the equivocation, the Chernoff bound, the Bhattacharyya bound and so on.
These criteria give upper bounds for the Bayes error. However, the optimization
of these criteria does not give the posterior probability functions as the solutions.
There is another family of criteria which are simpler although they are not
directly related to the Bayes error. They are functions of the first and second
order moments such as f ( ~ l ..... ~M, K0 ) where ~i is the expected vector of Y(X)
with respect to the ith class, ~0i, and K 0 is the covariance matrix of Y(X). Y(X) is
a vector in the feature space as Y(X)= [yl(X),...,yM(X)] T. In this section we
will show that this family has the posterior probability functions as the optimum
solutions. This result has been obtained for specific functional forms by Devijver,
Otsu, Fukunaga, and Ano [15-18] and generalized to a general functional form
by Fukunaga and Short [19]. The presentation follows that in Fukunaga and
Short [20].

3.2. f(~l, ~2, Ko ) for the two-classproblem [20]


The criterion which is considered in this section is a general function of YI, y2,
and K 0. In addition, the criterion is assumed to be invariant under any nonsingu-
lar linear transformation and coordinate shift. This property is generally consid-
Intrinsic dimensionality extraction 355

ered necessary for a criterion, since these transformations should not change the
class separability. After a whitening transformation (ATKo A = I) and a coordi-
nate shift

j = f ( ~ l , y 2 Ko ) = f(0,/x, I ) (5)
where/, = AT(y 2 - ~1). Further application of any unitary transformation does
not change the criterion value. Therefore, it may be stated that the criterion value
of (5) depends only on the length of the vector/,, that is,

J=g(d(2) (6)
where
d?2 =#T# : (~2 _ y l ) T K o l ( y 2 _ y , ) . (7)

Therefore, the optimization o f f ( Y 2, ~i, Ko ) with respect to Y(X) is equivalent to


the optimization of g(dl22) or to solve

og =0. (s)
0d122

Eq. (8) reveals that, regardless of the selection of f, the optimization of f is to find
Y(X) which maximizes d122= (~2 _ ~I)TK01(~2 _ ~l). Although Og/Od22 = 0
gives another optimum condition of f, this is an undesired solution, since the
solution does not depend on the functional form of Y(X).

EXAMPLE 3.1. One example of a functional form for f ( ~ l , y2, K0), which is
commonly used in discriminant analysis [22], is

2~1~2d22
J = trS w iSb -- (9)
1 - - ~'1q7"2d?2

where Sw = r r l K 1 4 - q r 2 K 2 (the within-class scatter matrix), and S b = 7rl(Y I - Y)


X ( ~ I _ ~ ) T + %(~2 _ ~)(~2 _ ~)T (the between-class scatter m_atrix). ~ri is the
class probability of %, the K i is the covariance matrix of ~0i and Y is the expected
vector of Y(X).

EXAMPLE 3.2. The Fisher discriminant is

J = tr[K 1 + K 2 ] - l ( y 1 - y 2 ) ( y ' -- F 2) = 1d22. (lO)


EXAMPLE 3.3. Another functional form is

J = l n [Swl = l n ( 1 - %%d~2 ) (11)


ISml
where Sm = S w + S b (the mixture scatter matrix).
356 Keinosuke Fukunaga

EXAMPLE 3.4. The following are the examples in which the criterion values
depend on the selected coordinate

trSw
J-- trSm, J=trSw+l~trSm (/,: Lagrange multiplier)• (1 2)

Therefore, the criteria cannot be expressed as a function of d22.

3.3. f(yl ..... yM Ko) for the M-classproblem [201


The discussion of the two-class problems can be extended to the multiclass case
as follows. Applying the whitening transformation and the coordinate shift,

j = f(~-, .... , ~M, K0 ) = f(0,/*2 ..... /*M, I )

=g( 2T 2 .....
T
.....
(13)

Therefore, the optimization of this criterion is accomplished by solving

M M Og
x x • 8(/~/~k)=O (14)
j 2k 2 Y(X) = r * ( x )

where Y*(X) is the optimum Y(X). This is equivalent to optimizing the criter-
ion J',
M M
J'=2 2 ajkgjP,
T k = tr BTHTM TKo 1MHB (15)
j=2k=2

where
M= [~,( ~ 1 _ ; ),..., ~,(~M_ ~ )], (16)

- 1/~r l . . . . 1/,',71
1/7r2 0
H= (17)

0 1/¢rM

0/22 • "" ½a~+l,s+l

BB T = (18)
Intrinsic dimensionality extraction 357

The criterion in (15) may be related to the mean-square-error criterion dis-


cussed by Devijver [15]

J"= e(ll(r(x)- F)ll 2} (19)

where P ( X ) is the M-dimensional Bayes risk vector. The k th component of P(X)


is defined by
M
oh(X) = ~ Xjk~/j(X ) (20)
j=l

where Xjk is the cost of choosing class k given class j occurs. Eq. (19) may be
rewritten as

J" E( IIP ( X ) - - PII 2) _trATMTKo,MA (21)


where
a =[xA (22)

Since the first term in (21) is independent of Y(X), minimizing (21) with respect
to Y(X) is equivalent to maximizing the second term in (21) or to maximizing
(15) with HB =A.
Thus, it may be concluded that fhe optimization of f(~l ..... ~M, K0 ) is
equivalent to the minimization of the mean-square-error of the Bayes risk
estimate as in (19). And the selection of a criterion function of the form (13)
corresponds essentially to the selection of a cost matrix A in the Bayes risk
estimate.

EXAMPLE 3.5. The Xjk's corresponding to J = trS m 1S b are

)kjk = { 10/~jJ f o r j = k,
(23)
forj v~ k.

If q'71. . . . . qTM,o k ( X ) = r&(X)/M from (20). Therefore, the maximization of


trSmlSb with equal class probabilities is equivalent to the minimization of the
mean-square-error of the posterior probability estimate.

3.4. The optimum solutions of f ( Y 1..... f M, Ko )


Now, it is obvious that Y*(X) = p(X) minimizes the mean-square-error of (19)
and thus optimizes f ( ~ l , . . . , ~a4, K0). Furthermore, since the pj(X)'s and the
~/i(X)'s are mutually related by an M X M linear transformation, they share the
same feature space as long as A is selected to be nonsingular. Therefore, we can
claim that ~/I(X) ..... ~/M(X) are the optimal solutions of f ( ~ l , . . . , ~M, Ko ) with
respect to Y(X).
358 Keinosuke Fukunaga

Although this result still does not aid in the estimation of the posterior
probability functions, there is a way to estimate the optimum value of f without
computing the posterior probabilities. Thus, when a data set is given, and a
feature set is proposed, one can estimate f*, the criterion value for the optimum
feature set, and compare it with the criterion value for the proposed feature set. If
both are close, the proposed feature set may be acceptable. If not, a better feature
set should be found.
T h e j t h component of ~ i , is

g*= = (24)

Also, t h e j l t h component of K~' is

k~= E{~j( X)~t( X) }. (25)

E(~j(X)~i(X)} is the asymptotic error between class j and class i due to the
nearest neighbor classification [21]. Therefore, the classification errors of the
nearest neighbor rule can be estimated in the original X-space, and thus
f ( ~ l , ..... ~M,, K~') can be computed through (24), (25), and the given functional
form of f.

3.5. The subspaceproblem


So far in our discussion we have considered the optimization of
f ( f l ..... fM, Ko ) over all possible Y(X). In this section, let us assume that a set
of L features '/'(X) = [~bl(X ) .... ,qJL(X)] T is given and that Y(X) is restricted to
be a linear combination of these ffj(X)'s. That is,

L
y k ( X ) = ~ aj~q~j(X) or Y(X)=AT~(x). (26)
j 1

Using (21), the problem becomes to maximize the following criterion with respect
to A,

J=trAT(ATD)T(ATSoA)-I(ATD)A=tr(ATSo A) '(ATDAATDTA)
(27)

where D and So are the matrices in the '/'-space which correspond to M of (16)
and K 0 in the Y-space. D and So become ATD and ATSoA respectively after the
transformation of (26).
The optimum solution of (26) specifies that the column vectors of A should be
the eigenvectors of So 1DAATDTwith nonzero eigenvalues [22]. Since the rank of
D is M - 1, there are M - 1 nonzero eigenvalues leading to M - 1 columns in A.
The subspace spanned by these M - 1 eigenvectors is identical to the subspace
Intrinsic dimensionality extraction 359

s p a n n e d _ b y _ ~ l _ ~ ..... ~ M _ ~ . Note that these M vectors are related by


Eim_lTri(ff' i - g ' ) = 0 and thus only M - 1 of them are linearly independent. It
means that all column vectors of A can be expressed by linear combinations of
the (,/,i_ '/')'s. On the other hand, g'i= E { g ' ( X ) / c o i } = E{g'(X)~I~(X)}/~ri. This
indicates that the optimum features of the '/'-space will be the projection of the
Y*-space onto the kO-space. Therefore, adding an additional +c+I(X) to
the '/'-space only affects the optimum solution if ~bL+I(X) has a component in the
Y*-space.

References

[1] Shepard, R. N. (1962). The analysis of proximities: multidimensional scaling with an unknown
distance function, I. Psychometrika 27, 125-140.
[2] Shepard, R. N. (1962). The analysis of proximities: multidimensional scaling with an unknown
distance function, II. Psychometrika, 27, 219-245.
[3] Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29, 1-28.
[4] Kruskal, J. B. (1964). Nonlinear multidimensional scaling: a numerical method. Psychometrika
29, 115-130
[5] Shepard, R. N. and Carroll, J. D. (1965). Parametric representation on nonlinear data structure.
In: P. R. Krishnaiah, ed., Multivariate Analysis. Academic Press, New York.
[6] Fukunaga, K. and Olsen, D. R. (1971). An algorithm for finding the intrinsic dimensionality of
data. IEEE Trans. Comput. 20, 176-183.
[7] Trunk, G. V. (1968). Statistical estimation of the intrinsic dimensionality of data collections.
Inform. and Control. 12, 508-525.
[8] McLaughlin, J. A. and Raviv, J. (1968). Nth order autocorrelation in pattern recognition.
Inform. and Control 12, 121-142.
[9] Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function
with application in pattern recognition. IEEE Trans. Inform. Theory 21, 32-40.
[10] Bennett, R. S. (1969). The intrinsic dimensionality of signal collections. IEEE Trans. Inform.
Theory 15, 517-525.
[11] Sammon, Jr., J. W. (1969). A nonlinear mapping algorithm for data structure analysis. IEEE
Trans. Comput. 18, 401-409.
[12] Calvert, T. W. and Young, T. Y. (1969). Randomly generated nonlinear transformations for
pattern recognition. IEEE Trans. Systems Sci. Cybernet. 5, 266-273.
[13] Olsen, D. R. and Fukunaga, K. (1973). Representation of nonlinear data surfaces, IEEE Trans.
Comput. 22, 915-922.
[14] Koontz, W. L. G. and Fukunaga, K. (1972). A nonlinear feature extraction algorithm using
distance transformation. IEEE Trans. Comput. 21, 56-63.
[15] Devijver, P. A. (1973) Relationships between statistical risks and the least-mean-square-error
design criterion in pattern recognition. Proc. First Internat. Joint Conf. Pattern Recognition, 139-
148, Washington, DC.
[16] Otsu, N. (1972). An optimal nonlinear transformation based on variance criterion for pattern
recognition I. Its derivation. Bull. Eleetroteehnical Laboratory 36, 815-830.
[17] Otsu, N. (1973). An optimal nonlinear transformation based on variance criterion for pattern
recognition--II. Its properties and experimental confirmation. Bull. Electrotechnical Laboratory
37, 283-295.
[18] Fukunaga, K. and Ando, S. (1977). On a nonlinear feature extraction. IEEE Trans. Inform.
Theory 23, 453-459.
360 Keinosuke Fukunaga

[19] Fukunaga, K. and Short, R. D. (1978). Nonlinear feature extraction with a general criterion
function. IEEE Trans. Inform. Theory 24, 600-607.
[20] Fukunaga, K. and Short, R. D. (1978). A class of feature extraction criteria and its relation to
the Bayes risk estimate. IEEE Trans. Inform. Theory 26, 59-65.
[21] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[22] Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1 1~
@North-Holland Publishing Company (1982) 361-382 1

Structural Methods in Image Analysis and


Recognition

Laveen N. Kanal, Barbara A. Lambird and David Lavine

1. Introduction

The field of image processing has grown enormously in recent years. Among
the many areas of application are remote sensing for natural resource evaluation,
industrial parts inspection, cartographic feature extraction, and x-ray analysis. In
these applications considerable information about the probable contents of
images is often available. The amount and complexity of this information has
prevented full use of this knowledge. Many approaches to the structuring of
image knowledge have been studied; in this chapter we provide an elementary
introduction to some of the major approaches.
A wide variety of techniques has been developed for image analysis. One
common approach consists of image segmentation using some type of similarity
criterion for grouping areas within an image followed by measurement of result-
ing region properties such as shape and texture. Finally, these measurements are
used to classify the regions into types by computing the similarity of these
measurements to those of a set of tracking regions. Statistical methods such as
discriminant analysis and Bayesian classifiers have been used for this classifica-
tion step. This type of approach has not been adequate to handle many problems
in which objects are formed from substructures, and the regions surrounding an
object are important for the identification of the object.
The primary goal of structural pattern recognition procedures has been the
recognition of objects in an image. In some tasks such as industrial parts
inspection, there is often only one object present, and the identity of the object is
known a priori. The problem in this situation is to determine the position of the
object and perform visual inspection tasks automatically. For remote sensing
applications, many objects may be present in the image. The types of objects
present may, in some cases, not be known during algorithm development.
Structural methods are appealing because they allow the designer or user of a
pattern recognition system to employ a somewhat intuitive description of an
object as the basis for a recognition scheme. Unfortunately, these methods can be
time-consuming to implement and use. In spite of such drawbacks, there are
many situations in which structural methods appear necessary, especially when
there is considerable variation in factors such as image size, location, shape, and
361
362 L, N. Kanal, B. A. Lambird and D. Lavine

color. Such variations frequently arise in remote sensing. These variations are a
result of factors such as lighting, viewing angle, type of sensor, seasonal changes
in crops, water content of the soil, and elevation.
A fundamental part of many structural recognition systems is a search space.
This space may be explicitly stored in a computer or implicitly stored and
dynamically generated as in a grammar. Often measures of merit are defined on
those parts of the search space which have been examined. These measures may
provide a notion of distance between parts of the search space already searched
and the goal of the search. In addition they are often used to direct the next step
in the search. There may be more than one goal and belief in the goal, once
reached, may be complete or probabilistic.
Numerous types of search spaces and problem representations may be found in
the literature. Grammars, stochastic grammars (see the chapter by K. S. Fu),
production rule systems, predicate calculus systems, semantic nets, state spaces,
and A N D / O R graphs (see the chapter by G. C. Stockman), are a few of the more
common types of problem representations. Within each of the above representa-
tions, various subclasses of representations have been implemented. In addition,
more than one search algorithm has been defined for each of these search spaces.
Furthermore, many structural recognition systems are quite large and their
performance is strongly affected by the problem domain, the extent of expert
domain knowledge employed, and general system design strategies.

2. Syntactic pattern recognition

Many of the limitations of statistical pattern recognition can be traced to its


neglect of structural information. In addition, statistical pattern recognition is
normally used only for classification. Syntactic pattern recognition, on the other
hand, attempts to use structural information for both feature extraction and
classification processes, by using the formalism developed for formal language
theory.
In syntactic pattern recognition, the pattern is decomposed into subpatterns,
which can be further decomposed, until eventually primitive patterns are reached.
Formally, a pattern is a sentence in a language specified by a grammar. The
grammar is a set of rules of syntax used for the generation of the sentences from
given symbols. The sentences would then consist of strings of these symbols. In
syntactic pattern recognition the symbols represent the primitives and subpat-
terns. For example in Fig. 1, suppose a and b are the primitives, then the pattern
'airplane' is represented by the string (starting from S and traveling clockwise)
abbabbbababbbabb. The subpattern wing would be represented by bab and the
subpattern tail would be represented by babab.
A grammar G is formally defined by the four-tuple (N, Z, P, S) where N is a
finite set of nonterminals or subpatterns, N is a finite set of terminals or primitive
patterns, P is a finite set of production or rewriting rules, and S in N is the
starting symbol. In the airplane example, N - {S, W, T}, Z = {a,b} and P = (S -,
abWbTbWb, W--, bab, T-, babab}. In this example, W and T represent wing
Structural methods in image analysis and recognition 363

a b

_ b b

S a a

Fig. 1. The p a t t e r n a i r p l a n e is divided i n t o p r i m i t i v e s ' a ' and 'b'.

and tail respectively, and the production rules indicate that W may be replaced by
bab, T by babab, and S by a b W b T b W b .
The notation

p~8 ~ p138
G

indicates that string Off6 is derivable from string p a 6 by one application of a


production a ~ 13, where a, 13, 8 and p represent mixed strings of terminals and
nonterminals. Two possible derivations of the airplane, from the above example,
are

S ~ abWbTbWb ~ abbabbTbWb = abbabbbababbWb


G G G
abbabbbababbbabb,
G
and

S ~ abWbTbWb ~ abWbbababbWb ~ abWbbababbbabb


G G G
abbabbbababbbabb.
G

There are four types of grammars, depending on their production rules.


G r a m m a r s with no restrictions on the form of their productions are called
unrestricted grammars. The next least restrictive grammars are context-sensitive
grammars. Their productions have the form, p A 8 ~ otiS, where p and 8 are
possibly empty, mixed strings of terminals and non-terminals, fl is a non-empty
mixed string of terminals and non-terminals, and A is in N. These grammars are
context-sensitive because A can be rewritten as fl only when it appears in the
context of p A 8 .
A context-free grammar is the next type of grammar and has productions only
of the form A ~ a, where A is in N, and a is a non-empty mixed string of
nonterminals and terminals. The above airplane grammar is context-free. The
most restricted grammars are called regular grammars. They have productions of
the form A ~ a B or A ~ a, where A and B are in N and a is in Z.
364 L. N. Kanal, B. A. Lambird and D. Lavine

The division of grammars into these four categories is useful in the design of
procedures for finding derivations. Efficient procedures exist for handling regular
and context-free grammars, while the complexity of context sensitive and unre-
stricted grammars, in general, make them infeasible to handle. Context-free
grammars do not appear to be powerful enough to handle some applications.
New types of grammars representing a compromise between context-free and
context-sensitive grammars have been designed to alleviate these problems. Some
of those types of grammars, such as programmed grammars or indexed grammars
have been used to describe non-context-free constructions in programming lan-
guages.
Grammars can also be classified in a different way. A grammar is deterministic
if at each step in a derivation there is only one possible action that can be taken,
i.e. there are no alternatives. A grammar is stochastic if probabilities are assigned
to each alternative, indicating the likelihood of that alternative. A grammar is
nondeterministic if there are a finite number of choices at any step, and
probabilities are not used.
Stochastic grammars have received considerable attention in recent years (Fu,
1982). By using such grammars for parsing it is possible to attach a probability to
each derivation of a string. Furthermore, the probabilities attached to rewriting
rules can be used to guide the search for derivations of a string. In any step of a
derivation, the search procedure merely applies that rewrite rule which has the
highest probability among all rewrite rules which are applicable. Other applicable
rules may be tried, in order of decreasing probability, at this point in the
derivation. The probability attached to a derivation is the product of the
probabilities of the rewrite rules used in the derivation. By using the probabilities
attached to the rewrite rules to guide the search for derivations we are attempting
to apply the best rewrite rule at each point in the search for derivations.
A strong motivation for the use of stochastic grammars lies in the relationship
between probabilities on rewrite rules and probabilities on sentences. If we view
the objects we are attempting to recognize as being generated by some type of
probabilistic mechanism, then each object corresponds to a sentence and so a
probability can be attached to each sentence. Under certain conditions, it is
possible to find, for such a probability distribution, a stochastic grammar which
yields the same probability distribution on sentences.

2.1. P D L - - Picture description language


The string grammars discussed so far only allow the relationship of simple
concatenation between subpatterns, i.e., the subpatterns are always connected at
their ends. There has been much effort to generalize grammars which allow more
complicated relationships. This section will briefly describe grammars which
allow more complicated concatenations.
Picture description languages, developed by Shaw (1969, 1970) associate a head
and a tail with each primitive pattern. The primitive patterns can be connected at
only these points using four binary concatenation operators and a reversal
Structural methods in image analysis and recognition 365

Operator String Result //~


+ a÷b a ~/
- a-b ~ < _a
× a×b a ~ b
* a,b ~ b

Fig. 2. PDL operators.

operator, as demonstrated in Fig. 2. Segments a and b are the abstract structures


defined by the head and the tail of the primitive patterns.
As an example, let the primitive patterns be represented by the set (a ?,
b~,c/~,d \). Then the string, a + b + - a + - b , represents~,xthe structure t 2 1 ,
and the string a + c + - d + ~ a + - b represents the structure ~/ ~ . PDL's have been
used in the analysis of particle tracks in bubble and spark chambers (Shaw, 1968).

2.2. Higher dimensional pattern grammars

The previous sections described grammars that only allow concatenation as the
relationship between subpatterns. This section will briefly touch on grammars
that are much more complicated but allow complex relations (Fu, 1974; Kanal,
1974; Gonzalez and Thomason, 1978). These grammars are based on graph
theoretical principles (Harary, 1969).
Tree grammars have productions where the terminals and non-terminals are
trees. However, since trees can be represented by lists or strings, tree grammars
can be written in string grammar form. Fig. 3 shows an example of a tree
grammar and a derivation. Tree grammars have been used in the detection of
texture (Lu and Fu, 1978) and in analysis of bubble chamber photographs (Fu
and Bhargarva, 1973). Texture can be defined in a hierarchical way, i.e., as
subpatterns occurring repeatedly in a highly specified manner. This definition
leads naturally to a tree grammar, where small regions of some grey level are the
primitive patterns.
Web grammars use webs as the subpatterns, where webs are undirected, labeled
graphs. Web grammars can describe more general patterns than tree or string
grammers. However, the increased flexibility requires much more complicated
production rules. In the string or tree grammar case it was easy to see how to
substitute the new subpattern. In the web grammar case, it is necessary to specify
how to connect the new subpattern. For example, suppose a is to be replaced by
fl (Fig. 4(a)), in the web pattern shown. There is more than one possible way to do
this substitution, as demonstrated. As a result, web productions are written as
triples (a, 13, f ) where f is a function that specifies how to join the nodes of
subweb 13 to the neighbors of each node of the removed subweb a.
Web grammars have been used in the classification of L A N D S A T data (Brayer
and Fu, 1976). Their graph model for the scene is shown in Fig. 4(b). The graph is
366 L. N. Kanal, B. A. Lambird and D. Lavine

A2

A1 A2 A2

A2 ~,, c A1

(a)

S S $ $

s_

A1 A2 A1 b A1 b a b

I
A2 c
I
b
I
c

(b)
Fig. 3. An exampleof a tree grammar and derivation:(a) productionrules; (b) derivationof a pattern.

not a tree because of the presence of the relationships 'surround', 'near', and
' range'.
Plex grammars have even more complicated production rules. Plex structures
have an arbitrary number of attaching points for joining to other structures. The
production rules describe the connectivity by providing lists of labeled concatena-
tion points. Plex grammars have been used to describe chemical structures (Feder,
1971).
In the above discussion it should be noted that the more complicated the
primitives are, the more complicated the form of the production rules. However,
with the more complicated grammars, the number of production rules in the
grammar needed to describe a given pattern may be a great deal less. This
tradeoff of primitive type versus grammar complexity becomes important in the
recognition process discussed next.

2.3. Recognition process


Once a grammar or grammars have been designed to recognize the desired
patterns, it is necessary to construct a recognizer. Each recognizer for a particular
grammar should recognize patterns only generated by that grammar. The process
of recognizing a pattern with respect to a given grammar is called parsing, or
Structural methods in image analysis and recognition 367

Grammar: Gw=(N,.Y,,P,S)
N-(S}
N= {a,b,c)
P is the following two triples:

f
b
f(a, S) = {b, c}
c

b
f(a, S) = {b, c}
c

Derivation

b b b b b b
S ~_ a ~ S ,,,._ a ~ ~ >
c c c c c c

Fig. 4(a). A web grammar and derivation.

syntax analysis, or finding the derivation. If the process begins with the start
symbol and attempts to find the derivation by progressively expanding the
non-terminals (usually done leftmost first), then the process is called top-down. If
the parse begins with the terminal symbols and progressively attempts to replace
them with the left-hand side of the production rules, then the parse is called
bottom-up. Since at each step a wrong choice can be made, if more than one
choice is possible, the parsing process can be inefficient.
Top-down parsing is 'goal-oriented'. Suppose the start symbol S has the
product production S--, PI ..... Pn. If P1 is a terminal symbol, then the pattern
must start with P1; if P~ is a non-terminal, then P~ becomes a subgoal and P~
productions are examined. This process continues until all the Pi's are recognized.
If at any point a Pi cannot be recognized, then an alternative S production is
examined. Fig. 5(a) shows a grammar, a pattern string to be recognized, and the
top-down parsing process.
Top-down parsing techniques are conceptually simple since they can use the
structure of the grammar to direct the parsing. In addition, since the desired
primitive is known at each step, primitive detection can be tailored to the desired
primitive.
368 L. N. Kanal, B. A. Lambird and D. Lavine

Z
<

<
LU
Z

e~ ~
Structural methods in image analysis and recognition 369

Grammar G = ( { S , A } , (a,b,c,d}, P , S )
Production rules S ~ d Abc
S~dS
A~aA
A ---,c

Parse string ddacbc top-down

Step String Production rule used

1 S initially
2 dS S~dS
3 ddAbc S ~ dAbc
4 ddaAbc A ~ aA
5 ddacbc A~ c

Fig. 5(a). Example of a top-down parse.

Bottom-up parsers, on the other hand, require all the primitives to be detected
before parsing. Fig. 5(b) shows an example of b o t t o m - u p parsing. At each step
some part of the pattern string is substituted b y the left side of a production rule
until the start symbol is reached.
A general comparison of b o t t o m - u p and t o p - d o w n methods is difficult (Fu,
1974). Present t o p - d o w n parsers can accept a wider class of g r a m m a r s but are
m o r e inefficient. M a n y efficient special-purpose b o t t o m - u p parsing techniques are
available for various restricted grammars.
The one-directional analysis used b y b o t t o m - u p parsers or t o p - d o w n parsers
causes problems in difficult applications such as cartographic feature recognition.
Bottom-up methods do not take advantage of a priori knowledge during segmen-
tation, while strictly t o p - d o w n methods can inefficiently generate hypotheses that
are consistent with the model but in no way related to a given instance of data. A
m e t h o d which combines the b o t t o m - u p and t o p - d o w n methodologies is presented

Grammar G - ((S, A}, (a,b,c,}, P, S)


Productions S ~ Abc
S-~ a A S
A ---~c
Parse string accbc bottom-up

Step String Production rule used


1 accbc initially
2 aAcbc A~ c
3 aAAbc A~ c
4 aAS S ~ Abc
5 S S ---, a A S

Fig. 5(b). An example of bottom-up parsing.


370 L. N. Kanal, B. A. Lambird and D. Lavine

in (Stockman 1977, Stockman and Kanal, 1982). This method is able to proceed
in a bi-directional manner: model-directed and data-confirmed. The analysis is
not confined to a canonical scan of the data such as row-by-row and left-to-right.
Multiple or ambiguous interpretations are allowed and are developed on a
best-first basis. This ambiguity is permissible at both the segmentation level and
the structural analysis level. The algorithm has subsequently been extended to
allow parallel development of the interpretations (Kanal and Kumar, 1981).
The non-directional parsing is accomplished by combining artificial intelligence
problem-solving techniques with formal language theory. The problem-reduction
representation (PRR) approach of artificial intelligence subdivides the original
problem into a set of subproblems which are in turn subdivided. The subdivision
continues until primitive subproblems are reached whose solutions are trivial.
Informally, each production in the grammar can be thought of as a subproblem,
the terminal symbols represent primitive subproblems, and start productions
represent the original problem. This representation can then be searched for an
optimal solution using various techniques. For more details see (Nilsson, 1980;
Stockman, 1977; Stockman and Kanal, 1982).
Stockman applied this procedure to one-dimensional waveforms and experi-
mented with carotid pulse waves. The system (WAPSYS) successfully recognized
the pulse waves and automatically extracted features. This work is now being
extended to images (Lambird, 1982).

2.4. Semantic information


Most complex problems are difficult to formulate as context-free or regular
grammars. However, efficient parsing techniques are not available for context-
sensitive and unrestricted grammars. Attributed context-free grammars are one
way around this problem. Many context-sensitive grammars can be formulated as
attributed context-free grammars. Stockman's waveform analyzer uses attributed
context-free grammars; Davis (1981) used attributed context-free grammars to
recognize airplanes.
Attributed grammars are ordinary grammars in which terminals or nontermi-
nals may have values associated with them. These values generally represent
properties of the structure represented by the terminal or nonterminal. An early
work in this field (Knuth, 1968) dealt with the application of attributed grammars
to the recognition and interpretation of numbers. Knuth addressed the problem
of determining whether or not a string of characters represented a number where
the vali¢, forms for numbers were specified by a grammar. The attributes of a
nonterminal gave the value corresponding to the segment of the character string
corresponding to that nonterminal and enabled the parser to determine the value
of the character string during parsing. This approach of giving values to non-
terminals provides a convenient formalism for the manipulation of descriptors in
images.
The attributed context-free grammar can be formulated as a 4-tuple G -
(N, Z, P, S) where N is a finite set of non-terminals, Z is a finite set of terminals,
P is a finite set of productions, and S in N is the start symbol. For each
Structural methods in image analysis and recognition 371

VE (NUN), there exists a finite set of attributes A(V), where each attribute a of
A(V) has a set of possible values D~. The production rules, P, have two
parts--the syntactic part with the form V0 --, V y 2 • • • Vm where V0~ N and VzC V
for 1 < i < m, and the semantic part which is a set of functions or procedures
which show how the attributes of each Vii are generated from the other Vii in the
production.
Attributes in an attributed grammar may be used for directing parsing and for
assigning a measure of merit to a parse. Owing to the lack of restrictions on the
semantic part of an attributed grammar, no general statements can be made
concerning the efficiency of using attributes for directing parsing. For simple
semantic rules attributes have proven useful for directing parsing in various
applications.
In image processing applications attributes may represent object properties
such as length. A semantic rule may in this case assign to the left-hand side of a
production rule the sum of the lengths of the right-hand side attributes.

2.5. Inference of grammars

Grammatical inference is the procedure of learning a grammar based on a set


of sample patterns. At the present, there exist no widely accepted or adequate
procedures for automatic grammatical inference for a broad class of problems. In
addition, automatic procedures do not, in general, generate meaningful subpat-
terns, i.e. the nonterminals do not correspond to a meaningful part of the pattern.
For example, in the airplane example, the nonterminals represented wings and
tails of the airplanes. Moreover, no procedures for generating meaningful semantic
information have been formulated. A survey of inference mechanisms can be
found in (Fu and Booth, 1975).

2.6. E r r o r correcting p a r s i n g

An obscured portion of an object in an image can cause severe difficulties when


trying to recognize this object by ordinary parsing methods. The obscured portion
may correspond to one or more terminals which must be present if a parse for the
object is to be found. To avoid this difficulty, another type of parsing called
error-correcting parsing is used (Fu, 1982). Error-correcting parsing procedures
find the string in the language of interest which is most similar to the string to be
parsed. Similarity is often defined in terms of the number of changes, such as
insertions or deletions, which are required to transform one string into another.
Error-correcting parsing offers considerable flexibility over ordinary parsing,
although the computational cost is higher.

3. Artificial intelligence

Syntactic pattern recognition methods do not appear capable of handling many


of the more complex image-processing tasks. To recognize an object in an image
despite object variability due to factors such as partial occlusion, shadow, and
372 L. N. Kanal, B. A. Lambird and D. Lavine

noise, requires contextual information which is not easily representable by


grammars. To overcome the.limitations of the grammatical approach, a variety of
approaches from the field of artificial intelligence has been investigated. These
approaches are, in general, more flexible than the syntactic approach. Unfor-
tunately, many artificial techniques have not been sufficiently well formulated to
allow for comparison.
Representation of knowledge is a central issue in artificial intelligence. Some
techniques, such as production rules, place strong restrictions on the form in
which information can be represented. Others, such as semantic nets and the
predicate calculus, can be extremely general. In addition to the problem of
knowledge representation, artificial intelligence programs must provide some
procedures for using the knowledge base for image processing.
We now survey some artificial intelligence programs and methods which may
be useful in image analysis.

3.l. Frames
A frame (Winston, 1977) is a scheme for representing knowledge. It represents
a stereotyped situation such as a prototype of a house or a car. Several types of
information may be contained in a frame. Generally there is a description of
those properties which are to be found in all the observed variants of the
prototype. In addition, a set of initially empty areas called slots are present. These
slots contain information describing a specific instance of the type of object
represented by the prototype. For a frame representing an image, the slots may
contain a description of a specific object, such as a house, within the image. The
frame may also contain information on how the frame is to be used, what it can
expect to see next, and what to do if it does not see what it expects to see.
A slot in a frame may Contain another frame. Thus frames can be linked
together to represent complex relationships between frames. Spatial relationships
such as adjacency and part-whole relationships can be represented using frames.
Because of the lack of restrictions on the type of information and procedures
which may be stored within frames, it is difficult to draw any conclusions about
the value of frame systems in general. Frames, as well as closely related systems
such as semantic nets, have been used heavily in recent years, and further
development of such systems is likely to continue.

3.2. Production rule systems


Artificial intelligence research, aimed at combining the power of the predicate
calculus of formal logic with the rule based opinions of experts in particular
applications, has led to the development of large programs known as production
rule systems which have achieved high quality decision making in diverse disci-
plines. Partitioned semantic networks, an important class of production rule
systems, have proved to be particularly powerful, flexible, and convenient for user
interaction. This section presents an overview of this class of systems.
Structural methods in image analysis and recognition 373

I Madeof I

[ Eot,ty_l ] [ B.ck ]
Fig. 6. Representation of the statement: entity-1 is made of brick.

The underlying structure of a semantic network is a directed graph consisting


of nodes, representing objects or relations, and arcs connecting nodes represent-
ing relations between nodes. The arcs are labelled with the type of argument of
the relation. The net shown in Fig. 6 consists of three nodes. The node ' M a d e of'
is a relation and the nodes ' E n t i t y - l ' and 'Brick' are objects. The arc labels
'entity-l' and 'brick' are used to indicate the relation ' M a d e of' has two
arguments, an 'entity' and a 'value'. Certain parts of the graph representing
relationships and types of values are partitioned off for special handling.
Logical relationships in the form of simple implicational statements such as E 1
and E 2 and ... and E u ~ H where E 1, E 2 . . . . . E N are evidence for the hypothesis
H can be represented. Fig. 7 illustrates the network representation of such a
statement. This network represents the rule " A road going into a mountain is a
tunnel". Object E-1A is a road going into object E-1B which is a mountain and
the resulting object, E-1C, is a tunnel. Consequences of rules can be used as
antecedents to other rules thus giving a network structure to the set of implication
statements.
The utility of implication rules can be extended by allowing inexact matching
of antecedents and providing probabilistic measures of belief in the antecedents.
Due to the complexity of propagating probabilities, approximations to Bayesian
methods can be applied. Research into efficient methods for modelling limited
statistical dependence among antecedents is an area of active research which has
to date yielded various algorithms which have not been extensively tested.

RULES ]

I role-1 1
antecedent ~ " ~ e n t

I E-la ~ E-lb i [ E-lc I

I road 1 I m°°°tain I l
L
tonoe' I
Fig. 7. The network that represents the production rule: "a road that goes into a mountain is a
tunnel".
374 L. N. Kanal, B. A. Lambird and D. Lavine

Strategies, called control structures, for movement of information through


networks remain a complex problem in the design of expert systems.

3.3. Knowledge-directed image analysis


The vision system developed in (Ballard, 1978) uses a three level system of
image representation. At the highest level of representation a semantic network is
used to model the important features for a range of possible scenes. At the next
level we have a synthesized data structure called a sketch map relating parts of the
image to items in the semantic network. At the lowest level, we have a structure,
called the image data structure, containing the image, possibly at various levels of
resolution, spectra, filtered versions of the image, etc.
The basic unit of data in each level of representation is called a node. The goal
of image analysis, in this case, is to determine a mapping from nodes in the image
data structure to nodes in the sketchmap and a mapping from nodes in the
sketchmap to nodes in the semantic network. Two structures are used for
generating these mappings. The links between the semantic network, called a
real-world model, and the sketch map is controlled by a routine called the
executive procedure, which is a user-written fast, specific program to control the
construction of links. The links between the sketchmap and the image data
structure are formed by procedures attached to the nodes of the sketchmap.
The real world model contains constraints relating nodes. In Fig. 8 (Ballard,
1978), we see a portion of a model representing a dock and its surroundings. The
circles containing the words 'Docked Ship', 'Coastlines', and 'Dock' represent
nodes in the semantic network. The constraint, 'Intersection', means that the
coastline and dock must intersect, while the constraint 'Parallel at a Distance',
means that the line through the centroids of the docked ships should be parallel to
the line of intersection between the coastline and the dock. The circles labelled 'I'
denote instantiations of nodes in the semantic network and the circles labeled
'LD' refer to descriptors specifying the definition of the location of an object.
The system answers user queries, updating its model with each search for an
answer to a query. The system generates a sketchmap when given a query and the
sketchmap activates procedures to locate objects in the image. Information on a
located object is then used to update the network.
The verification of map information in images has been studied by L.N.K.
corporation (Stockman, 1978; Stockman, 1981). This work includes a survey of
basic problems related to cartographic feature extraction, a proposed feature
extraction process, and experimentation using a profile search method for feature
verification and a servoing procedure for road verification.
The servoing technique is a procedure for following roads by using a map in
conjunction with verified road points to predict new image road points. The need
for this corrective process arises because of the approximations involved in the
registration transformation and, in the case of the time varying features such as
rivers, because of changes in the physical location of feature points. The average
correction vector between the five most recent image and map features observed
Structural methods in image analysis and recognition 375

Docked
Ship

('I Parallel at a distance )


Intersection

Coastline

l = Instance
r
LD= Location descriptor

Fig. 8. A real world model of a dock and coastline (Ballard, 1978).

in the road following procedure is used as a correction vector for the location of
the next image feature point to be detected. The experimental results indicate
high accuracy in road point verification and extensions of the method appear
promising.

3.4. Decision support systems


The development of large knowledge-based systems for decision making is a
complex task requiring considerable programming facility and time. Though
much attention has been directed toward the problem of providing easy user
updating of existing systems, few workers have focused on the development of
very high level languages which are sufficiently powerful to permit users to easily
construct entire systems. In this section we describe a system (Reggia, 1981)
which attempts to deal with this task.
The development system KMS (Knowledge Management System) is a tool,
written in Lisp, to aid applications specialists in the implementation of knowledge
based inference systems. The system provides for easy input of a knowledge base,
selection of inference mechanisms, and a simple user interface to the system.
376 L. N. Kanal, B. A. Lambird and D. Lavine

The representation of knowledge in KMS can take several forms. In all forms,
the user must supply a set of attributes, the range of possible values for these
attributes, and a set of relationships among the attributes. The relationships can
be given in four forms:
(1) Production rules,
(2) User-defined mathematical formulas,
(3) Bayesian inferences,
(4) Feature description.
In the KMS system measures of belief can be attached to each attribute on the
right-hand side of a production rule. A rule is provided by the system for
combining these measures of belief into a measure of belief for the left hand side.
In all cases, attributes refer to variables whose values can be computed or
assigned by the user. For example, we may have an attribute bridge, which has a
value determined b y t w o other attributes, (1) water and (2) a strip of pavement of
specified dimension over water. In each of the first three types of relationships we
may assign some measure of belief or probability to the attributes water and strip
of pavement over water. Each system will then compute a number which may be a
probability, measure of belief, etc., for the entity bridge.
Production rules were described in Section 3.2. The user-defined mathematical
formulas are functions, such as regression functions, mapping values of a set of
attributes into the value of another attribute. Bayesian inferences refer to applica-
tions of Bayes rule to calculate the value of an attribute using the values of other
attributes. Finally, feature descriptions consist of a set of attributes describing
another attribute together with probabilities of the presence of these attributes in
the attribute being described. A hypothesis and test inference mechanism is
available with the feature descriptions.
At present in all but the feature description systems, KMS seeks to assign
values to attributes by exhaustive top-down directed search through the knowl-
edge base using an inference mechanism. This is made possible by the KMS
restriction that the relationships between attributes be ordered in a hierarchical
fashion. Thus to each attribute A, one can find a set of attribute relations and
ordering of those relations such that the system can start with user supplied
attributed values and apply these relations in order to compute the value of
attribute A. The KMS control structure could be modified to allow for alternate
types of search, though this requires modification of the code and not a simple set
of user commands to the system.
The goal of the KMS subsystem DESCRIPTION is to provide an explanation
of a set of data using a minimum number of explanatory factors. Hypotheses are
formed based on an initial set of data. These hypotheses are then used to generate
new questions whose answers can be used to test the hypotheses. This process is
iterated until an acceptable explanation has been found.
The mathematical basis (Stoffel, 1974) of the definition of a minimal explana-
tion is found in the theory of minimal covers of sets. Let S, S 1.... , S~ be sets. A
minimal cover A of S is defined to be a subset {Si,. . . . . Si,,} of { S 1 . . . . . Sk} such
Structural methods in image analysis and recognition 377

that SCSilU'''USim and for any other subset (Sj,,...,Sj,,} of (S, ..... Sk),
n
Ui=1SjDS implies n > = m. In our application, each S i may be viewed as an
explanation of the elements of Si and we are interested in explaining all the
elements of S using the minimum number of explanations. The goal of the
description section of KMS is to determine such minimal explanations though
the minimality of the cover produced by KMS has not been proven. The worst
case running time of existing minimal cover algorithms is exponential though
experience with KMS has indicated that actual running times are short.

3.5. Map guided analysis


The desirability of map guided image interpretation is now recognized by many
researchers in image processing, though there is considerable variation in the
approaches taken. In this section we discuss an approach (Tenenbaum, 1978)
employing a map to rapidly localize search so that object-specific operators may
be applied. A similar approach is given in (Stockman, 1978).
The authors cite three major aids provided by the use of a map in image
interpretation. First, the map can be used to select regions for examination. In
many cases, accurate locations of objects of interest are in the map and may be
used to localize search sufficiently to make the use of template matching inexpen-
sive. Second, the map can provide information constraining the types of objects
which can occur in parts of the image, thus reducing the number of object-specific
operators which must be applied. Finally, the map can aid in the interpretation of
detected objects by the use of context.
The detection of reservoir boundaries, roads, boxcars and ships were among the
tasks studied. The map contained ground coordinates and elevation data for
landmarks (coastline, roads, etc.) as well as for all sites to be examined for object
detection. A global and local registration is performed. Next the approximate
image position of interest is located using the map.
The operations used for object detection are highly object specific. Conse-
quently feature extraction can sometimes be done to high accuracy. As an
example we describe a procedure for determining depths of a reservoir. The map
contains the coordinates of the reservoir as well as the coordinates of curves
normal to the contour lines. A model of the image intensity along lines perpendic-
ular to the land-water interface was determined. First, the image is registered to
the map. Thus the image intensity along the normal curves starting from deep in
the reservoir interior is used in conjunction with the edge model to determine
points on the land water boundary. Using the map image correspondence the
altitude of each of these boundary points is determined. These altitudes are
averaged to give an average altitude for the reservoir. Using the map reservoir
boundaries for several altitudes near the average altitude are computed and
compared with the observed image reservoir boundary and the best fitting
computed reservoir boundary is selected. Using this boundary the reservoir depth
and volume can be computed.
378 L. N. Kanal, B. A. Lambird and D. Lavine

The approach taken by Tenenbaum (1978) and Stockman (1978) appears to be


extremely promising for it attempts to reduce computation by using the map to
severely limit the class of structures one is attempting to locate. The usefulness of
this approach is determined by the algorithm developers' ability to determine a
successful balance between the desirability of highly specific operations for
detecting certain objects, and the desirability of general operators to reduce
software development costs.

3.6. A modular vision system


The development of the HEARSAY-II speech understanding system drew
attention to the advantages of recognition systems which were driven by both
data and an underlying model of the possible contents of the data. This approach
offers great flexibility, though it generally requires a complex control structure.
Due to the complexity of the control structures, little is understood about the
theoretical properties of these procedures. As an example of this approach, we
describe a vision system (Levine and Shaheen, 1981) designed to provide segmen-
tation and interpretation of images.
The system is formed from three types of components--a long-term memory, a
short-term memory, and a set of processors. The long-term memory (LTM) is a
relational database describing objects which may appear in an image and rela-
tionships between such objects. Object features together with typical values for
these features and a range of permissible feature values are included in the object
description. In addition, a list of features which distinguish the object from other
objects as well as measures of the importance of this feature are present.
Constraint relations specifying conditions necessary for objects to coexist are also
contained in the LTM. The LTM also contains C O N D I T I O N ~ A C T I O N rela-
tions which specify procedures to be implemented when specified conditions are
met.
The short-term memory (STM) is a relational database containing both the
input data and current interpretation of the input data. A region map indicating
tentative regions is an important relation in the STM. Region features, interpreta-
tions, and spatial relationships are also stored in the STM. The STM also contains
a list of regions in which the order of the regions denotes the sequence in which
the regions should be analyzed.
The third type of structure within the system is the processor. A collection of
processors is available for specialized tasks such as the formation of tentative
interpretations, verification of interpretations, computation of features, and
scheduling of processors. The system begins by segmenting the picture into
non-overlapping regions. It then assigns a confidence value for each interpreta-
tion of these regions. Those procedures are performed as a result of processor
calls. Neighbor constraints are then used to disambiguate the interpretations.
This system is highly modular and facilitates rapid addition of new processors
and STM's and LTM's. It appears to provide reasonably good segmentation of
color images, though it does not yet incorporate three-dimensional analysis.
Structural methods in image analysis and recognition 379

4. Relaxation

Relaxation is another structural pattern recognition procedure. Relaxation


processes use local contextual information to remove local inconsistencies. Thus,
relaxation is a local process. However, globally consistent interpretations can
result from the application of relaxation.
Well-designed relaxation processes obey the'least commitment' principle (Marr,
1977). The principle of least commitment states that all decisions should be
postponed until there is relative certainty the decision is correct. Making decisions
(e.g., classification, segmentation) with insufficient evidence often leads to errors,
which can lead to either wrong or inconsistent interpretations of a scene.
One use of relaxation is the analysis of scenes (Rosenfeld, 1976). The relaxation
process starts with an ambiguous scene description and attempts to disambiguate
the scene using neighbor constraints. The initial scene is represented as a labelled
graph whose nodes correspond to primitive objects in the scene. The arc in the
graph represent the allowed relationships between neighboring primitive objects.
Each node is labelled with the alternative interpretations of the object correspond-
ing to that node. Ambiguity results if there is a node with more than one label.
Disambiguation is achieved by eliminating conflicting labels and thus simplifying
the graph. The conflicting labels are determined using a set of constraint relations
or compatibility functions. Thus, in relaxation the constraint relations provide the
model for the expected input scenes.
The refinement process of applying the constraint relations between each label
at a node and the labels of neighboring nodes changes the label set at the nodes.
The relaxation process is the iterative application of the refinement process.
Under certain conditions this process converges (Hummel, 1980).
Relaxation processes are useful only if the initial labelling processes cannot
produce an unambiguous interpretation and there are enough neighborhood
constraints between the objects which can be used to find the incorrect labels. If
the refinement process actually removes the labels, the relaxation process is called
discrete. Thus, discrete relaxation finds the set of possible consistent interpreta-
tions. An example of a discrete relaxation system is the system designed by Waltz
(1975). This system segmented the objects from the background and each other in
a scene containing polyhedrons.

4.1. Probabilistic relaxation


Discrete relaxation processes could not select an 'optimal' interpretation from
the set of consistent interpretations. To provide this capability, probabilistic and
fuzzy relaxation processes were introduced (Rosenfeld, Hummel and Zucker,
1976). In this case each node is labelled with label-weight pairs which represent
the possible interpretations of the corresponding object and an associated 'mea-
sure of belief' or probability of occurrence. The weights must be in the range of
zero to one and, in the probabilistic case, the weights at a node must sum to one.
The refinement process no longer discards inconsistent labels but uses a set of
380 L. N. Kanal, B. A. Lambird and D. Lavine

compatibility functions to generate a new set of weights at a node in terms of the


old set of weights at the node and its neighbors. The original compatibility
functions were related to correlations. More recent rules using conditional proba-
bility have been formulated (Peleg, 1980).
At the end of the relaxation process, the weights of the interpretations can be
used to select the 'best' interpretation.

4.2. Hierarchical relaxation


Relaxation methods can be extended (Davis and Henderson, 1981) to incorpo-
rate hierarchical object descriptions such as grammars. Hierarchical relaxation
methods generally have a strong bottom-up component, in keeping with the
data-directed philosophy of relaxation. The basic approach in hierarchical relaxa-
tion is to use an ordinary relaxation process to label primitive structures within an
image. A new graph is now formed by assigning to each node in the original
graph a new set of nodes. These nodes correspond to objects which contain the
type of object corresponding to the original node. For example, a node in the
original graph may be a terminal in a grammar. The new nodes assigned to this
might be those nonterminals in the grammar which occur on the left-hand side of
a production rule in which the terminal appears on the right. Relaxation is then
carried out on this new level and below. The process is iterated until a graph with
nodes on the highest possible level are obtained. These nodes correspond to the
composite objects in the picture.
The details of the hierarchical procedures are quite complex. Whenever a new
set of nodes is generated, some procedure for generating constraints must be
applied. The data structure for the relaxation process must contain pointers to
neighbors on the same level, the level above and the level below. No detailed
analysis of the efficiency of the various hierarchical relaxation procedures has
appeared.

4. 3. Experiments with hierarchical relaxation


Davis (1978) uses hierarchical relaxation, with a context-free grammar provid-
ing the constraints, to analyze a simple but noisy waveform. Ostroff (1982)
applied a more efficient version HRPS of Davis' system to carotid pulse waves;
HRPS successfully parsed and analyzed the waveforms. More recently, Davis and
Henderson (1981) have expanded their system to use attributed context-free
grammars and have applied it to recognize segmented airplanes. The airplanes in
their experiments were perfectly segmented in that all edge points of the entire
airplanes were found (i.e., no gaps existed) and the only ambiguity allowed was
that several straight lines were used to represent one edge. It is not clear how well
their system would perform with less constrained data.

4. 4. Relaxation summary
Although much success has been claimed with relaxation methods, it is not
clear that relaxation would perform better than the top-down methods presented
Structural methods in image analysis and recognition 381

in this chapter. Since relaxation requires all preprocessing to be performed


initially, computational costs preclude the use of specialized preprocessing. The
lack of the ability to use partially analyzed scenes to guide in further recognition
is a major drawback to relaxation techniques. Davis and Henderson (1981) seem
to have come to the same conclusion and suggest embedding relaxation with
artificial intelligence search processes.

Acknowledgment

The preparation of this article was supported in part by NSF grant ECS-7822159
to the Laboratory for Pattern Analysis and in part by L.N.K. Corporation,
Silverspring, MD.

References
Ballard, D. H., Brown, C. M., and Feldman, J. A. (1978). An approach to knowledge-directed image
analysis. In: A. R. Hanson and E. M. Riseman, eds., Computer Vision Systems, 271-281. Academic
Press, New York.
Brayer, J. and Fu, K. (1976). Application of a web grammar model to an ERTS picture. Proc. Third
Internat. Joint Conference on Pattern Recognition, 405-410. Coronado, CA.
Davis, L. and Henderson, T. (1981). Hierarchical constraint processes for shape analysis. IEEE Trans,
Pattern Analysis and Machine Intelligence 3, 265-277.
Davis, L. and Rosenfeld, A. (1978). Cooperative processes for waveform parsing. In: A. Hanson and
E. Riseman, eds., Computer Vision Systems. Academic Press, New York.
Feder, J. (1971). Plex languages. Inform. Sci. 3, 225-241.
Fu, K. (1974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
Fu, K. (1982). Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, NJ.
Fu, K. and Bhargava, B. (1973). Tree systems for syntactic pattern recognition. IEEE Trans. Comput.
22 (12) 1087-1099.
Fu, K. and Booth,. T. 0975). Grammatical inference: introduction and survey--Part I. IEEE Trans.
Systems Man Cybernet. 5 (l) 95-111.
Fu, K. and Booth, T. (1975). Grammatical inference: introduction and survey--Part II. IEEE Trans.
Systems Man Cybernet. 5 (4) 409-423.
Gonzalez, R. C. and Thomason, M. G. (1978). Syntactic Pattern Recognition--An Introduction.
Addison-Wesley, Reading, MA.
Harary, F. (1969). Graph Theory, Addison-Wesley, Reading, MA.
Hummel, R. and Zucker, S. (1980). On the foundations of relaxation labeling processes. Computer
Vision and Graphics Laboratory, Tech. Rept. TR-80-7. McGill University, Montreal.
Kanai, L. (1974). Patterns in pattern recognition: 1968-1974. IEEE Trans. Inform. Theory 20 (6)
697-722.
Kanal, L. and Kumar, V. (1981). Parallel implementations of a strnctural analysis algorithm. Proc.
Conf. Pattern Recognition and Image Processing, 452-458. Dallas.
Knuth, D. E. (1968). Semantics of context-free languages. J. Math. Systems Theory 2, 127-142.
Lambird, B. A. and Kanal, L. N. (1982). Syntactic analysis of images in progress, Dept. of Computer
Science, University of Maryland, College Park, MD.
Lu, S. and Fu, K. (1978). A syntaxtic approach to texture analysis. Comput. Graphics Image Process. 7
(3) 303-330.
Levine, M. D. and Shaheen, S. I. (1981). A modular computer vision system for picture segmentation
and interpretation. IEEE Trans. Pattern Analysis and Machine Intelligence 3, 540-556.
Marr, D. (1977). Artificial intelligence--A personal view. Artificial Intelligence 9 (1) 37-48.
382 L. N. Kanal, B. A. Lambird and D. Lavine

Nilsson, N. (1980). Principals of Artificial Intelligence. Tioga, Palo Alto, CA, 2nd ed.
Ostroff, B., Lambird, B., Lavine, D. and Kanal, L. (1982). HRPS--Hierarchial relaxation parsing
system. Tech. Note, Lab. for Pattern Analysis, Department of Computer Science, University of
Maryland, College Park, MD.
Reggia, J. A. (1981). Knowledge-based decision support systems: development through KMS. Ph.D.
Thesis, University of Maryland; ibid., Tech. Rept. TR. 1121, Comput. Sci. Center, Univ. of
Maryland, College Park, MD.
Rosenfeld, A., Hummel, R. and Zucker, S. (1976). Scene labelling by relaxation operations. IEEE
Trans. Systems Man Cybernet. 6, 420-433.
Shaw, A. (1968). The formal description and parsing of pictures. Stanford Linear Accelerator Center,
Rept. SLAC-84, Stanford University, Stanford, CA.
Shaw, A. (1969). A formal picture description scheme as a basis for picture processing systems.
Inform. Contr. 14, 9-52.
Shaw, A. (1970). Parsing of graph-representable pictures. J. ACM 17, 453-481.
Stockman, G. C. (1977). A problem-reductlon approach to the linguistic analysis of waveforms. Tech.
Rept. TR-538, University of Maryland.
Stockman, G. C. (1978). Toward automatic extraction of cartographic features. U.S. Army Engineer
Topographic Laboratory, Rept. No. ETL-0153, Fort Belvoir, VA.
Stockman, G. C. and Kanal, L. N. (1982). A problem reduction approach to the linguistic analysis of
waveforms. IEEE Trans. Pattern Analysis and Machine Intelligence, to appear.
Stockman, G. C., Lambird, B. A., Lavine, D. and Kanal, L. N. (1981). Knowledge-based image
analysis, U.S. Army Engineer Topographic Laboratory, Rept. ETL-0258, Fort Belvoir, VA.
Stoffel, J. C. (1974). A classifier design technique for discrete variable pattern recognition problems.
IEEE Trans. Comput. 23, 428-444.
Tenenbaum, J. M., Fischler, M. A. and Wolf, H. C. (1978). A scene-analysis approach to remote
sensing. Stanford Research Institute Internat. Tech. Note 173.
Waltz, D. (1975). Understanding line drawings of scenes with shadows. In: Winston, ed., The
Psychology of Computer Vision. McGraw-Hill, New York.
Winston, P. H. (1977). Artificial Intelligence. Addison-Wesley, Reading, MA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 1 ~7
@North-Holland Publishing Company (1982) 383-397 J{ /

Image Models

Narendra Ahuja and Azriel Rosenfeld

1. Introduction

In this paper we will be concerned with models of spatial intensity variation in


homogeneous images, that is images which do not exhibit any macrostructure.
The lack of macrostructure results when a uniformly structured scene is photo-
graphed at a coarse resolution, the consequent dense packing of structure giving
rise to image texture. We will use the terms image and texture interchangably.
Traditionally, image models have been classified as statistical or structural [22,
48, 54]. Statistical models involve description of image statistics such as autocorre-
lation, etc., while the structural approach specifies spatial primitives and place-
ment rules for laying these primitives out in the plane. It should be noted that if
the rules in the structural approach are not statistical, the resulting models will be
t o o regular to be of interest. If a statistical model cannot reveal the basic structure
of an image, it is not powerful enough to be of much use. A somewhat better
classification of image models might be as follows;
(a) Pixel based models. These models view individual pixels as primitives of an
image. Specification of the characteristics of the spatial distribution of pixel
properties [22, 42] constitutes the image description.
(b) Region based models. These models conceive of an image as an arrangement
of a set of spatial (sub)patterns according to certain placement rules [54]. Both the
subpatterns and their placement may be defined statistically.'The subpatterns
may further be composed of smaller patterns. The objective is still to model a
single texture; the use of regions being to capture the microstructure.
In the following sections we will discuss these two classes of models and review
many of the studies of image modeling conducted through 1978. It should be
emphasized that image modeling is a rapidly evolving field and much further
work is currently in progress.

. Pixel based models

Pixel based models can be further divided into two classes.


383
384 Narendra A huja and A zriel Rosenfeld

2.1• One-dimensional time series models

Time series analysis [10] has been extensively used [38, 60, 61] to model
relationships among gray levels of a given pixel and those preceding it in the scan
of a texture• An image is raster scanned to provide a series of gray level
fluctuations, which is treated as a stochastic process evolving in 'time'• The future
course of the process is presumed to be predictable from information about its
past.
Before summarizing the models, we review some of the commonly used
notation in time series•
Let

• " " Zt- 1 Z t Z t + 1" " "

be a discrete time series where Z i is the random variable Z at time i. We denote


the series by [Z].
Let/~ be the mean of [Z], called the level of the process•
Let [Z] denote the series of deviations about/~, i.e.,

2 i = Z i -#.

Let [a] be a series of outputs of a white noise source, with mean zero and
variance o~.
Let B be the 'backward' shift operator such that B Z t = 2 t - l ; hence B'~2t =
2?t_m; and let V be the 'backward' difference operator such that

V2,=2,-2t_,=(1-B)2t;

hence Vmzt = (1 -- B)m2~t.


The dependence of the current value 2 t of the random variable 2 on the past
values of 2 and a is expressed in different ways, giving rise to several different
models [38].

Autoregressive model ( A R )
In this model the current Z-value depends on the previous p Z-values, and on
the current noise term:

2 t = dPlZt_ 1+ q52Zt_2 + . . . + dPpZt- p + a t. (1)

If we let

Op( B ) = I - O~B - O2B 2 . . . . . @ B p,

then (1) becomes

[@(B)](2t)=at.
Image models 385

[Z], as defined above, is known as the autoregressive process of order p, and


q~p(B) as the autoregressive operator of order p. The name autoregressive comes
from the model's similarity to regression analysis and the fact that the variable
is being regressed on previous values of itself.

Moving average model (MA)


In the above model 2~t_ can be eliminated from the expression for 2~t by
substituting

Zt-l=~lZt-2-{-~2Zt 3-}-'"-}-~pZt p 1+at i"

This process can be repeated to eventually yield an expression for 2~t as an infinite
series in the a 's.
The moving average model allows a finite number q of previous a-values in the
expression for 2~t. This explicitly treats the series as being observations on linearly
filtered Gaussian noise.
Letting

Oq(B ) = 1 - 01B -- 02 9 2 . . . . . OqB q,


we have
2, = [ o q ( g ) ] ( a , )

as the moving average process of order q.

Mixed model (ARMA)


To achieve greater flexibility in fitting of actual time series, this model includes
both the autoregressive and the moving average terms. Thus

Zt = ~1 Zt 1 -}- ~2 ~',-- 2 "{- " ' " -}- ~p/'~t--p


+ a t - O,a, 1-- 02a,-2 . . . . . Oqat-q, (2)

i.e., [~Sp(B)](2~,) [Oq(B)](a,).


In all the three models just mentioned, the process generating the series is
assumed to be in equilibrium about a constant mean level. Such models are called
stationary models.
There is another class of models called nonstationary models, in which the level
/, does not remain constant. The series involved may, nevertheless, exhibit
homogeneous behavior when the differences due to level-drift are accounted for.
It can be shown [10] that such a behavior may be represented by a generalized
autoregressive operator.
A time series may exhibit a repetitive pattern. For example, in a raster scanned
image, the segments corresponding to rows will have similar characteristics. A
model can be formulated that incorporates such 'seasonal effects' [38].
All of the time series models discussed above are unilateral, i.e., a pixel depends
only upon the pixels that precede it in a raster scan. Any introduction of bilateral
386 Narendra Ahuja and Azriel Rosenfeld

dependence gives rise to more complex parameter estimation problems [9, 12].
Interestingly, a frequency domain treatment makes parameter estimation in
bilateral representation much easier [13].

2.2. Random fieM models


These models treat the image as a two-dimensional random field [53, 64]. We
will consider two subclasses.

2.2.1. Globalmodels
Global models treat an entire image as the realization of a random field.
Different image features may be modeled by a random field, and the field may be
specified in different ways. An important model for height fields has been used by
oceanographers [31-33, 49] interested in the patterns formed by waves on the
ocean surface. Longuet-Higgins [31-33] treats the ocean surface as a random field
satisfying the following assumptions:
(a) the wave spectrum contains a single narrow band of frequencies, and
(b) the wave energy is being received from a large number of different sources
whose phases are random.
Considering such a random field, he obtains [32] the statistical distribution of
wave heights and derives relations between the root mean square wave height, the
mean height of the highest p% of the waves, and the most likely height of the
largest wave in a given interval of time.
In subsequent papers [31, 32], Longuet-Higgins obtains a set of statistical
relations among parameters describing (a) a random moving surface [31], and (b)
a Gaussian isotropic surface [32].
Some of his results are:
(1) The probability distribution of the surface elevation, and the magnitude and
orientation of the gradient.
(2) The average number of zero crossings per unit distance along an arbitrarily
placed line transect.
(3) The average contour length per unit area.
(4) The average density of maxima and minima.
(5) The probability distribution of the heights of maxima and minima.
All results are expressed in terms of the two-dimensional energy spectrum up to
a finite order only. The converse of the problem is also studied and solved. That
is, given certain statistical properties of the surface, a convergent sequence of
approximations to the energy spectrum is determined.
The analogy between this work and image processing, and the significance of
the results obtained therein, is obvious. Fortunately the assumptions made are
also acceptable for images.
Panda [47] uses this approach to analyze background regions selected from
Forward Looking InfraRed (FLIR) imagery. He derives expressions for (a)
density of border points and (b) average number of connected components along
a row of the thresholded image. Panda [46] also uses the same model to predict
Image models 387

the properties of images obtained by running several edge operators (based on


differences of average gray levels) on some synthetic pictures with normally
distributed gray levels, and having different correlation coefficients. The images
are assumed to be continuous-valued stationary Gaussian random fields with
continuous parameters. Schachter [57] suggests a version of the above model for
the case of a narrow band spectrum.
Nahi and Jahanshahi [43] suggest modeling the image by background and
foreground statistical processes. The foreground consists of regions corresponding
to the objects in the image. In estimating the boundaries of horizontally convex
objects on background in noisy binary images, Nahi and Jahanshahi assume that
the two kinds of regions in the image are formed by two statistically independent
stationary random processes with known (or estimated) first two moments. The
borders of the regions covered by the different statistical processes are modelled
locally. The end-points of the intercepts of an object on successive rows are
assumed to form a first order Markov process. This model thus also involves local
interactions.
Let
b(m, n) = gray level at the nth column of the ruth row,
~,(m, n) = a binary function carrying the boundary information,
b b = a sample gray level from the background process,
bo = a sample gray level from the object process, and
u = a sample gray level from the noise process.
The model allows us to write

b(m, n) = y(rn, n)bo(m , n)+ [1-- y(m, n ) ] b b ( m , n) = v(m, n)

where 3' incorporates the Markov constraints on the object boundaries.


In a subsequent paper Nahi and Lopez-Mora [44] use a more complex ,f
function. For each row, ,/either indicates the absence of an object, or provides a
vector estimate Of the object width and its geometric center in that row. Thus, the
two element vector contains information about the object size and skewness. The
vectors corresponding to successive rows are assumed to define a first-order
Markov process.
Pratt and Faugeras [50] and Gagalowicz [17] view texture as the output of a
homogeneous spatial filter excited by white noise, not necessarily Gaussian. A
texture is characterized by its mean, the histogram of the input white noise, and
the transfer function of the filter. For a given texture, the model parameters are
obtained as follows:
(a) The mean is readily estimated from the image.
(b) The autocorrelation function is computed to determine the magnitude of
the transfer function.
(c) Higher-order moments are computed to determine the phase of the transfer
function.
Inverse filtering yields the white noise image, and hence its histogram and
probability density. The inverse filtering or decorrelation may be done by simple
388 Narendra Ahu/a and Azriel Rosenfeld

operators. For example, for a first order Markov field, decorrelation may be
achieved by using a Laplacian operator [50]. The whitened field estimate of the
independent identically distributed noise process will only identify the spatial
operator in terms of the autocorrelation function, which is not unique. Thus, the
white noise probability density and spatial filter do not, in general, make up a
complete set of descriptors [51]. To generate a texture, the procedure can be
reversed by generating a white noise image having the computed statistics, and
then applying the inverse of the whitening filter.
Several authors describe models for the Earth's surface. Freiberger and
Grenander [16] reason that the Earth's surface is too irregular to be represented
by an analytic function having a small number of free parameters. Nevertheless,
landscapes possess strong continuity properties. They suggest using stochastic
processes derived from physical principles. Mandelbrot [35] uses a Poisson-Brown
surface to give a first approximation to the Earth's relief. The Earth's surface is
assumed to have been formed by the superimposition of very many, very small
cliffs along straight faults. The positions of the faults and the heights of the cliffs
are assumed random and independent. The irregularity predicted by the model is
excessive. Mandelbrot suggests that the surface could be made to resemble some
terrain more closely by introducing anisotropy into ridge directions. Mandelbrot's
model is often used in computer graphics to generate artificial terrain scenes.
Adler [2] presents a theoretical treatment of Brownian sheets and relates them to
the rather esoteric mathematical concept of Hausdorff dimension.
Recursive solutions based upon differential (difference) equations are common
in one-dimensional signal processing. This aprroach has been generalized to
two-dimensions. Jain [29] investigates the applicability of three kinds of random
fields to the image modeling problem, each characterized by a different class of
partial differential equations (PDE's). A digital shape is defined by a finite
difference approximation of a PDE. The class of hyperbolic PDE's is shown to
provide more general causal models than autoregressive moving average models.
For a given spectral density (or covariance) function, parabolic PDE's can
provide causal, semicausal, or even noncausal representations. Elliptical PDE's
provide noncausal models that represent two-dimensional discrete Markov fields.
They can be used to model both isotropic and nonisotropic imagery. Jain argues
that the PDE model is based upon a well-established mathematical theory.
Furthermore, there exists a considerable body of computer software for numerical
solutions. The PDE model also obviates the need for spectral factorization, thus
eliminating the restriction of separable covariance function. System identification
techniques may be used for choosing a PDE model for a given class of images.
In the absence of any knowledge or assumption about the global process
underlying a given image, models of the joint gray level probability density and
its derivative properties may be used. Among models of the joint density for
pixels in a window, the multivariate normal has been the one most commonly
used because of its tractability. However, it has been found to have limited
applicability. Hunt [25, 26] points out that stationary Gaussian models are based
upon an oversimplification. Consider the vector F of the picture points obtained
Image models 389

by concatenating them as in a raster scan. Let R F be the covariance matrix of the


gray levels in F. Then according to the Gaussian assumption, the probability
density function is given by

f ( F ) = Kexp [--½ ( F - F )TRFI ( F - ; )]

where F is a constant mean vector, and K is a normalizing constant. This means


that each point in the image has the same ensemble statistics. Images, however,
seldom have a beU-shaped histogram. Hunt [25] proposes a nonstationary
Gaussian model which differs from the stationary model only in that the mean
vector F has unequal components. He demonstrates the appropriateness of this
model by subtracting the local ensemble average from each point and showing
that the resulting image fits a stationary Gaussian model.
Trussel and Kruger [62] claim that the Laplacian density function is better
suited for modelling high-pass filtered imagery than the Gaussian model. Never-
theless, they contend that the basic assumptions which allow the Gaussian model
to be used for image restoration purposes are still valid under a Laplacian model.
Matheron [37] uses the change in pixel properties as a function of distance to
model a random field. He uses the term "regionalized variables" to emphasize the
features of pixels whose complex mutual correlation reflects the structure of the
underlying phenomenon. He assumes weak stationarity of the gray level incre-
ments between pixels. The second moment of the increments for pixels at an
arbitrary distance, called the variogram, is the basic analytic tool. Huijbregts [24]
discusses several properties of the variogram and relates them to the structure of
the regionalized variables. For nonhomogeneous fields having spatially varying
mean, the variogram of the residuals with respect to the local means is used.
Angel and Jain [8] use the diffusion equation to model the spread of values
around a given point. Thus a given image is viewed as a blurred version of some
original image.
A characterization similar to the variogram is given by the autocorrelation
function. For restoration, images have often been modeled as two-dimensional
wide sense, stationary, random fields with a given mean and autocorrelation. The
following general expression has been suggested for the autocorrelation function:

R(rrl, 17"2)7---0"2p[--~ll'rll--~2lT2l]

which is stationary and separable. Specifically, the exponential autocorrelation


function (0 = e) has been found to be useful for a variety of pictorial data [15, 18,
23, 27, 30].
Another autocorrelation model often cited as being more realistic is

R(~'l, 'r2) = iO(r~+r~)'/2

which is isotropic but not separable.


390 Narendra A huja and A zriel Rosenfeld

2.2.2. Local models


A simplification that may be introduced to reduce the problems involved in
specifying the joint probability function or its properties for the entire image is to
assume that not all points in an image are simultaneously constrained but that
this is only true of small neighborhoods of pixels. The resulting models are called
local models. Predetermined formalisms are used to describe the gray level
relationships in a neighborhood. The modeling process consists of choosing a
formalism and evaluating its parameters.
Read and Jayaramamurthy [52] and McCormick and Jayaramamurthy [39]
make use of switching theory techniques to describe local gray level patterns by
minimal functions. Suppose that each pixel can take one out of Ng gray levels.
Then a given neighborhood of n pixels from an image can be represented by a
point in an n × Ng-dimensional space. If many such neighborhoods from a given
texture are considered, then they are likely to provide a cluster of points in the
above space. The differences in the local characteristics of different textures are
expected to result in different clusters. The set covering theory of Michalski and
McCormick [40] (which is a generalization of the minimization machinery of
switching theory already available) is used to describe the sets of points in each
cluster. These maximal descriptions allow for coverage of empty spaces in and
around clusters. The samples do not have to be exhaustive but only have to be
large enough to provide a reasonable representation.
Haralick et al. [20] confine the local descriptions to neighborhoods of size two.
They identify a texture by the gray-level cooccurrence frequencies at neighboring
pixels, which are the first estimates of the corresponding joint probabilities. They
use several different features, all derived from the cooccurrence matrix, for
texture classification.
Most of the local models, however, use conditional properties of pixels within a
window, instead of their joint probability distributions as in the local models
discussed above. We will now discuss such Markov models that make a pixel
depend upon its neighbors.
Time series analysis for the one-dimensional models discussed earlier can also
be used to capture part of the two-dimensional dependence, without getting into
the analytical problems arising from a bilateral representation. Tou et al. [60]
have done this by making a point depend on the points in the quadrant above it
and to its left. For such a case the autoregressive process of order (q, p) is

2,j = 0,2,,j , + ... i .;


the moving average process of order (q, p) is
~

Z i j = a o - Oo~a~.j-i - O l o a ~ - l , j - 011a~ l , j - ~ . . . . . Oqpai-q,j-p;

and the two-dimensional mixed autoregressive/moving average process is

Zij = ~ o 1 2 i . j - i + ~lo2i-l.j + q)tlZi-l.j-1 -k- . . . q- ~ q p 2 i q , j _ p


+ aij -- Oolai,j_ 1 -- Oloai_l. j -- O l l a i _ l . j _ 1 . . . . . Or~ai_r, j ~.
Narendra A huja and A zriel Rosenfeld 391

The model, in general, gives a nonseparable autocorrelation function. If the


coefficients of the process satisfy the condition

qSmn= ¢),,,o~o~,

then the process becomes a multiplicative process in which the influence of rows
and columns on the autocorrelation is separable. Thus,

Pij = PioPoj

Tou et al. consider fitting a model to a given texture. The choice among the
autoregressive, moving average and mixed models, as well as the choice of the
order of the process, is made by comparing the behavior of some observed
statistical property, e.g., the autocorrelation function, with that predicted by each
of the different models. The values of the model parameters are determined so as
to minimize, say, the least square error in fit. In a subsequent paper Tou and
Chang [61] use the maximum likelihood principle to optimize the values of the
parameters, in order to obtain a refinement of the preliminary model as suggested
by the autocorrelation function.
A bilateral dependence in two dimensions is more complex as compared to the
one-dimensional case discussed earlier. Once again, a simpler, unilateral model
may be obtained by making a point depend on the points in the rows above it, as
well as on those to its left in its own row. Whittle [63] gives the following reasons
in recommending working with the original two-dimensional model:
(1) The dependence on a finite number of lattice neighbors, for example a finite
autoregression in two dimensions, may not always have a unilateral representa-
tion that is also a finite autoregression.
(2) The real usefulness of the unilateral representation is that it suggests a
simplifying change of parameters. For most two-dimensional models, however,
the appropriate transformation, even if evident, is so complicated that nothing is
gained by performing it. It may be pointed out that frequency domain analysis
for parameter estimation [13] may prove useful here too.
Two-dimensional Markov random fields have been investigated for represent-
ing textures. A wide sense Markov field representation aims at obtaining linear
dependence of a pixel property, say its gray level, on the gray levels of certain
other pixels so as to minimize the mean square error between the actual and the
estimated values. This requires that the error terms of various pixels be uncorre-
lated random variables. A strict sense Markov field representation involves
specification of the probability distribution of the gray level given the gray levels
of certain other pixels. Although processes of both these types have been
investigated, more experimental work has been done on the former.
Woods [65] shows that the strict sense Markov field differs from a wide sense
field only in that the error variables in the former have a specific correlation
structure, whereas the errors in the latter are uncorrelated. He points out
restrictions on the strict sense Markov field representation under which it yields a
392 Narendra A huja and A zriel Rosenfeld

model for non-Markovian processes. The condition under which a general non-
causal Markov dependence reduces to a causal one is also specified.
Abend et al. [1] introduce Markov meshes to model dependence of a pixel on a
certain immediate neighborhood. Using Markov chain methods on the sequences
of pixels from various neighborhoods, they show that in many cases a causal
dependence translates into a noncausal dependence. For example, the dependence
of a pixel on its west, northwest and north neighbors translates into dependence
on all eight neighbors. Interestingly, the causal neighborhood that results in a
4-neighbor noncausal dependence is not known in their formulation, although in
the Gauss Markov formulation of Woods [65] such an explicit dependence is
allowed. In this sense Woods' definition of a Markov field is more general than
the Markov meshes of Abend et al. [1].
Hassner and Sklansky [21] also discuss a Markov random field model for
images. They present an algorithm to generate a texture from an initial random
configuration. The Markov random field is characterized by a set of independent
parameters that specify a consistent collection of nearest neighbor conditional
probabilities.
Deguchi and Morishita [14] use a noncausal model for the dependence of a
pixel on its neighborhood. The coefficients of linear dependence are determined
by minimizing the mean square estimation error. The resulting two-dimensional
estimator characterizes the texture. They use such a characterization for classifica-
tion and for segmentation of images consisting of more than one textural region.
Jain and Angel [27] use a 4-neighbor autoregression to model a given, not
necessarily separable, autocorrelation function. They obtain values of the autore-
gression coefficients in terms of the autocorrelation function. However, their
representation involves error terms that are uncorrelated with each other and with
the non-noisy pixel gray level values. As pointed out by Panda and Kak [45], the
two assumptions about the error terms are incompatible for Markov random
fields [65]. Jain and Angel [27] point out that a 4-neighbor Markov dependence
can represent a large number of physical processes such as steady state diffusion,
random walks, birth and death processes, etc. They also propose 8-neighbor [27]
and 5-neighbor (the 8 neighbors excluding the northeast, east, and southeast
neighbors) [27, 28] models.
Wong [64] discusses characterization of second order random fields (having
finite first and second moments) from the point of view of their possible use in
representing images. He considers various properties of a two-dimensional ran-
dom field and their implications in terms of its second-order properties. Some of
the results he obtains are as follows:
(1) There is no continuous Gaussian random field of two dimensions (or higher
dimensions) which is both homogeneous and Markov (degree 1).
(2) If the covariance function is invariant under translation as well as rotation,
then it can only depend upon the Euclidian distance. The second-order properties
of such fields (Wong calls them homogeneous) are characterizable in terms of a
singel one-dimensional spectral distribution.
Image models 393

Wong generalizes his notion of homogeneity to include random fields that are
not homogeneous, but can be easily transformed into homogeneous fields. Even
this generalized class of fields is no more complicated than a one-dimensional
stationary process.
Lu and Fu [34] identify repetitive subpatterns in some highly regular textures
from Brodatz [11] and design a local descriptor of the subpattern in an enumer-
ative way by generating each of the pixels in the window individually. The
subpattern description is done by specifying a grammar whose productions
generate a window in several steps. For example, starting from the top left corner
rows may be generated by a series of productions, while other productions will
generate individual pixels within the rows. The grammar used may also be
stochastic.

3. Region based models

These models use the notion of a structural primitive. Both the shapes of the
primitives and the rules to generate the textures from the primitives may be
specified statistically.
Matheron [36] and Serra and Verchery [58] propose a model that views a binary
texture as produced by a set of translations of a structural element. All locations
of the structural elements such that the entire element lies within the foreground
of the texture are identified. Note that there may be (narrow) regions which
cannot be covered by any placement of the structural element, as all possible
arrangements of the element that cover a given region may not lie completely
within the foreground. Thus only an 'eroded' version of the image can be spanned
by the structural element which is used as the representation of the original
image. Textural properties can be obtained by appropriately parameterizing the
structure element. For a structural element consisting of two pixels at distance d,
the eroded image represents the autocovariance function of the original image at
distance d. More complicated structural elements would provide a generahzed
autocovariance function which has more structural information. Matheron and
Serra show how the generalized covariance function can be used to obtain various
texture features.
Zucker [67] views a real texture as a distortion of an ideal texture. The ideal
texture is a spatial layout of cellular primitives along a regular or semiregular
tessellation. Randomness is introduced by distorting the primitives using certain
transformations.
Yokoyama and Haralick [66] describe a growth process to synthesize textures.
Their method consists of the following steps:
(a) Mark some of the pixels in a clean image as seeds.
(b) The seeds grow into curves called skeletons.
(c) The skeletons thicken to become regions.
394 Narendra Ahuja and Azriel Rosenfeld

(d) The pixels in the regions thus obtained are transformed into gray levels in
the desired range.
(e) A probabilistic transformation is applied, if desired, to modify the gray level
cooccurrence probability in the final image.
The distribution processes in (a) and the growth processes in (b) and (c) can be
deterministic or random. The dependence of the properties of the images gener-
ated on the nature of the underlying operations is not obtained.
A class of models called mosaic models, based upon random, planar pattern
generation processes, have been considered by Ahuja [3-6], Ahuja and Rosenfeld
[7] and Schachter, Davis, and Rosenfeld [56]. Schachter and Ahuja [55] describe a
set of random processes that produce a variety of piecewise uniform random
planar patterns having regions of different shapes and relative placements. These
patterns are analyzed for various geometrical and topological properties of the
components, and for the pixel correlation properties in terms of the model
parameters [3-6]. Given an image and various feature values measured on it, the
relations obtained above are used to select the appropriate model.
The syntactic model of Lu and Fu [34] discussed earlier can also be interpreted
as a region based model, if the subpattern windows are viewed as the primitive
regions. Although the model used by Nahi and Jahanshahi [43], and Nahi and
Lopez-Mora [44], discussed earlier, is pixel based, the function y carries informa-
tion about the borders of various regions. Thus, under the constraint that all
regions except the background are convex, the model can also be interpreted as a
region based model.

4. Discussion

Region based models can act as pixel based models. For the case of images on
grids this is easy to see. Consider a subpattern that consists of a single pixel. The
region shapes are thus trivially specified. It is obvious that the region characteris-
tics and their relative placement rules can be designed so as to mimic the pixel
and joint pixel properties of a pixel based model, since both have control over the
same set of primitives and can incorporate the same types of interaction. On the
other hand if we are dealing with images that are structured, i.e. that have planar
clusters of pixels such that pixels within a cluster are related in a different way
than pixels across clusters, then we must make such a provision in the model
definition. Such a facility is unavailable in pixel based models, whereas the use of
regions as primitives serves exactly this purpose. The pixel based models are
acceptable for images where there are no well-defined spatial regional primitives.
Region based models appear to be more appropriate for representing many
natural textures, which do usually consist of regions.
Many texture studies are basically technique oriented and describe texture
feature detection and classification schemes which are not based upon any
underlying model of the texture. We do not discuss these here; see [19, 41, 59] for
several examples and more references.
Image models 395

Acknowledgment

The support of the U.S. Air Force Office of Scientific Research under Grant
AFOSR-77-3271 to the University of Maryland and of Joint Services Electronics
Program (U. S. Army, Navy and Air Force) under Contract N00014-79-C-0424 to
the University of Illinois is gratefully acknowledged as is the help of Kathryn
Riley and Chris Jewell in preparing this paper.

References

[1] Abend, K., Harley, T. J. and Kanal, L. N. (1965). Classification of binary random patterns.
IEEE Trans. Inform. Theory 11, 538-544.
[2] Adler, R. J. (1978). Some erratic patterns generated by the planar Wiener process. Suppl. Adv.
Appl. Probab. 10, 22-27.
[3] Ahuja, N. (1979). Mosaic models for image analysis and synthesis. Ph.D. dissertation, Depart-
ment of Computer Science, University of Maryland, College Park, MD.
[4] Ahuja, N. (1981). Mosaic models for images, I: Geometric properties of components in cell
structure mosaics. Inform. Sci. 23, 69-104.
[5] Ahuja, N. (1981). Mosaic models for images, II: Geometric properties of components in
coverage mosaics. Inform. Sci. 23, 159-200.
[6] Ahuja, N. (1981). Mosaic models for images, III: Spatial correlation in mosaics. Inform. Sci. 24,
43-69.
[7] Ahuja, N. and Rosenfeld, A. (1981), Mosaic models for textures. IEEE Trans. Pattern Analysis,
Machine Intelligence 3, 1-11.
[8] Angel, E. and Jain, A. K. (1978). Frame-to-frame restoration of diffusion images. IEEE Trans.
Automat. Control 23, 850-855.
[9] Bartlett, M. S. (1975). The Statistical Analysis of Spatial Pattern. Wiley, New York.
[10] Box, J. E. P. and Jenkins, G. M. (1976). Time Series Analysis. Holden-Day, San Francisco.
[11] Brodatz, P. (1966). Textures: A Photographic Album for Artists and Designers. Dover, New York.
[12] Brook, D. (1964). On the distinction between the conditional probability and joint probability
approaches in the specification of nearest-neighbor systems. Biometrika 51, 481-483.
[13] Chellappa, R. and Ahuja, N. (1979). Statistical inference theory applied to image modeling.
Tech. Rept. TR-745, Department of Computer Science, University of Maryland, College Park,
MD.
[14] Deguchi, K. and Morishita, I. (1976). Texture characterization and texture-based image parti-
tioning using two-dimensional linear estimation techniques. U.S.-Japan Cooperative Science
Program Seminar on Image Processing in Remote Sensing, Washington, DC.
[15] Franks, L. E. (1966). A model for the random video process. Bell System Tech. J. 45, 609-630.
[16] Freiberger, W. and Grenander, U. (1976). Surface patterns in theoretical geography. Report 41,
Department of Applied Mathematics, Brown University, Providence, RI.
[17] Gagalowicz, A. (1978). Analysis of texture using a stochastic model. Proc. Fourth Internat. Joint
Conf. Pattern Recognition, 541-544.
[18] Habibi, A. (1972). Two dimensional Bayesian estimate of images. Proc. IEEE 60, 878-883.
[19] Haralick, R. M. (1978) Statistical and structural approaches to texture. Proc. Fourth Internat.
Joint Conf. Pattern Recognition, 45-69.
[20] Haralick, R. M., Shanmugam, K. and Dinstein, I. (1973). Textural features for image classifica-
tion. IEEE Trans. Systems Men Cybernet. 3, 610-621.
[21] Hassner, M. and Sklansky, J. (1978). Markov random field models of digitized image texture.
Proc. Fourth Internat. Joint Conf. Pattern Recognition, 538-540.
[22] Hawkins. J. K. (1970). Textural properties for pattern recognition. In: B. S. Lipkin and A.
Rosenfeld, eds., Picture Processing and Psychopictorics, 347-370. Academic Press, New York.
396 Narendra Ahuja and Azriel Rosenfeld

[23] Huang, T. S. (1965). The subjective effect of two-dimensional pictorial noise. IEEE Trans.
Inform. Theory 11, 43-53.
[24] Huijbregts, C. (1975). Regionalized variables and quantitative analysis of spatial data. In: J.
Davis and M. McCullagh, eds., Display and Analysis of Spatial Data, 38-51. Wiley, New York.
[25] Hunt, B. R. (1977). Bayesian methods in nonlinear digital image restoration. IEEE Trans.
Comput. 26, 219-229.
[26] Hunt, B. R. and Cannon, T. M. (1976). Nonstationary assumptions for Gaussian models of
images. IEEE Trans. Systems Man Cybernet. 6, 876-882.
[27] Jain, A. K. and Angel, E. (1974). Image restoration, modelling, and reduction of dimensionality.
IEEE Trans. Comput. 23, 470-476.
[28] Jain, A. K. (1977). A semi-causal model for recursive filtering of two-dimensional images. IEEE
Trans. Comput. 26, 343-350.
[29] Jain, A. K. (1977). Partial differential equations and finite-difference methods in image
processing, Part 1: Image representation. J. Optim. Theory Appl. 23, 65-91.
[30] Kretzmer, E. R. (1952). Statistics of television signals. Bell System Tech. J. 31, 751-763.
[31] Longuet-Higgins, M. S. (1957). The statistical analysis of a random moving surface. Phil. Trans.
Roy. Soc. London Ser. A 249, 321-387.
[32] Longuet-Higgins, M. S. (1957). Statistical properties of an isotropic random surface. Phil. Trans.
Roy. Soc. London Ser. A 250, 151-171.
[33] Longuet-Higgins, M. S. (1952). On the statistical distribution of the heights of sea waves. J.
Marine Res. 11, 245-266.
[34] Lu, S. Y. and Fu, K. S. (1978). A syntactic approach to texture analysis. Comput. Graphics
Image Process. 7, 303-330.
[35] Mandelbrot, B. (1977). Fractals-- Form, Chance, and Dimension. Freeman, San Francisco.
[36] Matheron, G. (1967). Elements pour une Theorie des Milieux Poreux. Masson, Paris.
[37] Matheron, G. (1971). The theory of regionalized variables and its applications. Les Cahiers du
Centre de Morphologie Math. de Fontainbleau 5.
[38] McCormick, B. H. and Jayaramamurthy, S. N. (1974). Time series model for texture synthesis.
Internat. J. Comput. Inform. Sci. 3, 329-343.
[39] McCormick, B. H. and Jayaramamurthy, S. N. (1975). A decision theory method for the analysis
of texture. Internat. J. Comput. Inform. Sci. 4, 1-38.
[40] Michalski, R. S. and McCormick, B. H. (1971). Interval generalization of switching theory. Proc.
Third Annual Houston Conf. Computing System Science, 213-226. Houston, TX.
[41] Mitchell, O. R., Myers, C. R. and Boyne, W. (1977). A max-min measure for image texture
analysis. IEEE Trans. Comput. 26, 408-414.
[42] Muefle, J. L. (1970). Some thoughts on texture discrimination by computer. In: B. S. Lipkin and
A. Rosenfeld, eds., Picture Processing and Psychopictorics, 371--379. Academic Press, New York.
[43] Nahi, N. E. and Jahanshahi, M. H. (1977). Image boundary estimation. IEEE Trans. Comput.
26, 772-781.
[44] Nahi, N. E. and Lopez-Mora, S. (1978). Estimation detection of object boundaries in noisy
images. IEEE Trans. Automat. Control 23, 834-845.
[45] Panda, D. P. and Kak, A. C. (1977). Recursive least squares smoothing of noise in images. IEEE
Trans. Acoust. Speech Signal Process. 25, 520-524.
[46] Panda, D. P. and Dubitzki, T. (1979). Statistical analysis of some edge operators. Comput.
Graphics Image Process. 9, 313-348.
[47] Panda, D. P. (1978). Statistical properties of thresholded images. Comput. Graphics Image
Process. 8, 334-354.
[48] Pickett, R. M. (1970). Visual analysis of texture in the detection and recognition of objects. In:
B. S. Lipkin and A. Rosenfeld, eds., Picture Processing and Psychopietorics, 289-308. Academic
Press, New York.
[49] Pierson, W. J. (1952). A unified mathematical theory for the analysis, propagation, and
refraction of storm generated surface waves. Department of Meteorology, New York Uhiversity,
New York.
[50] Pratt, W. K. and Faugeras, O. D. (1978). Development and evaluation of stochastic-based visual
textures features. Proc. Fourth Internat. Joint Conf. Pattern Recognition, 545-548.
Image models 397

[51] Pratt, W. K., Faugeras, O. D., and Gagalowicz, A. (1978). Visual discrimination of stochastic
texture fields. IEEE Trans. Systems Man Cybernet. 8, 796-804.
[52] Read, J. S. and Jayaramamurthy, S. N. (1972). Automatic generation of texture feature
detectors. IEEE Trans. Comput. 21, 803-812.
[53] Rosenfeld, A. and Kak, A. C. (1976). Digital Picture Processing. Academic Press, New York.
[54] Rosenfeld, A. and Lipkin, B. S. (1970). Texture synthesis. In: B. S. Lipldn and A. Rosenfeld,
eds., Picture Processing and Psychopictorics, 309-322. Academic Press, New York.
[55] Schachter, B. and Ahuja, N. (1979). Random pattern generation processes. Comput. Graphics
Image Process. 10, 95-114.
[56] Schachter, B. J., Davis, L. S. and Rosenfeld, A. (1978). Random mosaic models for textures.
IEEE Trans. Systems Man Cybernet. 8, 694-702.
[57] Schachter, B. J. (1980). Long crested wave models for Gaussian fields. Comput. Graphics Image
Process. 12, 187-201.
[58] Serra, J. and Verchery, G. (1973). Mathematical morphology applied to fibre composite
materials. Film Science and Technology 6, 141-158.
[59] Thompson, W. B. (1977). Texture boundary analysis. IEEE Trans. Comput. 26, 272-276.
[60] Tou, J. T. and Chang, Y. S. (1976). An approach to texture pattern analysis and recognition.
Proc. IEEE Conf. Decision and Control, 398-403.
[61] Tou, J. T., Kao, D. B., and Chang, Y. S. (1976). Pictorial texture analysis and synthesis. Proc.
Third lnternat. Joint Conf. Pattern Recognition.
[62] Trussel, H. J. and Kruger, R. P. (1978). Comments on 'nonstationary' assumption for Gaussian
models in images. IEEE Trans. Systems Man Cybernet. 8, 579-582.
[63] Whittle, P. (1954). On stationary processes in the plane. Biometrika 41, 434-449.
[64] Wong, E. (1968). Two-dimensional random fields and representations of images. SIAMJ. App !.
Math. 16, 756-770.
[65] Woods, J. W. (1972). Two-dimensional discrete Markovian fields. IEEE Trans. Inform. Theory
18, 232-240.
[66] Yokoyama, R. and Haralick, R. M. (1978). Texture synthesis using a growth model. Comput.
Graphics Image Process. 8, 369-381.
[67] Zucker, S. (1976). Toward a model of texture. Comput. Graphics Image Process. 5, 190-202.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "1
K.J
North-Holland Publishing Company (1982) 399-415

Image Texture Survey*

Robert M. Haralick

1. Introduction

Texture is an important characteristic for the analysis of many types of images.


It can be seen in all images from multi-spectral scanner images obtained from
aircraft or satellite platforms (which the remote sensing community analyzes) to
microscopic images of cell cultures or tissue samples (which the bio-medical
community analyzes). Despite its important and ubiquity in image data, a formal
approach or precise definition of texture does not exist. The texture discrimina-
tion techniques are, for the most part, ad-hoc. In this paper we survey, unify, and
generalize some of the extraction techniques and models which investigators have
been using to measure textural properties.
The image texture we consider is non-figurative and cellular. We think of this
kind of texture as an organized area phenomenon. When it is decomposable, it
has two basic dimensions on which it may be described. The first dimension is for
describing the primitives out of which the image texture is composed, and the
second dimension is for the description of the spatial dependence or interaction
between the primitives of an image texture. The first dimension is concerned with
tonal primitives or local properties, and the second dimension is concerned with
the spatial organization of the tonal primitives.
Tonal primitives are regions with tonal properties. The tonal primitive can be
described in terms such as the average tone, or maximum and minimum tone of
its region. The region is a maximally connected set of pixels having a given tonal
property. The tonal region can be evaluated in terms of its area and shape. The
tonal primitive includes both its gray tone and tonal region properties.
An image texture is described by the number and types of its primitives and the
spatial organization or layout of its primitives. The spatial organization may be
random, may have a pairwise dependence of one primitive on a neighboring
primitive, or may have a dependence of n primitives at a time. The dependence
may be structural, probabilistic, or functional (like a linear dependence).
To characterize texture, we must characterize the tonal primitive properties as
well as characterize the spatial inter-relationships between them. This implies that

*A full text of this paper by the same author appeared in Proc. IEEE 67 (5) (1979) 786-804, under
the title "Statistical and Structural Approaches to Texture." Reprinted with permission. @1979 IEEE.

399
400 Robert M. Haralick

texture-tone is really a two-layered structure, the first layer having to do with


specifying the local properties which manifest themselves in tonal primitives and
the second layer having to do with specifying the organization among the tonal
primitives. We, therefore, would expect that methods designed to characterize
texture would have parts devoted to analyzing each of these aspects of texture. In
the review of the work done to date, we will discover that each of the existing
methods tends to emphasize one or the other aspect and tends not to treat each
aspect equally.

2. Review of the literature on texture models

There have been eight statistical approaches to the measurement and char-
acterization of image texture: autocorrelation functions, optical transforms, dig-
ital transforms, textural edgeness, structural elements, spatial gray tone co-
occurrence probabilities, gray tone run lengths, and auto-regressive models. An
early review of some of these approaches is given by Hawkins (1970). The first
three of these approaches are related in that they all measure spatial frequency
directly or indirectly. Spatial frequency is related to texture because fine textures
are rich in high spatial frequencies while coarse textures are rich in low spatial
frequencies.
An alternative to viewing texture as spatial frequency distribution is to view
texture as amount of edge per unit area. Coarse textures have a small number of
edges per unit area. Fine textures have a high number of edges per unit area.
The structural element approach of Serra (1974) and Matheron (1967) uses a
matching procedure to detect the spatial regularity of shapes called structural
elements in a binary image. When the structural elements themselves are single
resolution cells, the information provided by this approach is the autocorrelation
function of the binary image. By using larger and more complex shapes, a more
generalized autocorrelation can be computed.
The gray tone spatial dependence approach characterizes texture by the co-
occurrence of its gray tones. Coarse textures are those for which the distribution
changes only slightly with distance and fine textures are those for which the
distribution changes rapidly with distance.
The gray level run length approach characterizes coarse textures as having
many pixels in a constant gray tone run and fine textures as having few pixels in a
constant gray tone run.
The auto-regressive model is a way to use linear estimates of a pixel's gray tone
given the gray tones in a neighborhood containing it in order to characterize
texture. For coarse textures, the coefficients will all be similar. For fine textures,
the coefficients will have wide variation.
The power of the spatial frequency approach to texture is the familiarity we
have with these concepts. However, one of the inherent problems is in regard to
gray tone calibration of the image. The procedures are not invariant under even a
linear translation of gray tone. To compensate for this, probability quantizing can
be employed. But the price paid for the invariance of the quantized images under
Image texture survey 401

monotonic gray tone transformations is the resulting loss of gray tone precision in
the quantized image. Weszka, Dyer and Rosenfeld (1976) compare the effective-
ness of some of these techniques for terrain classification. They conclude that
spatial frequency approaches perform significantly poorer than the other ap-
proaches.
The power of the structural element approach is that it emphasizes the shape
aspects of the tonal primitives. Its weakness is that it can only do so for binary
images.
The power of the co-occurrence approach is that it characterizes the spatial
inter-relationships of the gray tones in a textural pattern and can do so in a way
that is invariant under monotonic gray tone transformations. Its weakness is that
it does not capture the shape aspects of the tonal primitives. Hence, it is not likely
to work well for textures composed of large-area primitives.
The power of the auto-regressive linear estimator approach is that it is easy to
use the estimator in a mode which synthesizes textures from any initially given
linear estimator. In this sense, the auto-regressive approach is sufficient to capture
everything about a texture. Its weakness is that the texture it can characterize are
likely to consist mostly of micro-textures.

2.1. The autocorrelation function and texture


From one point of view, texture relates to the spatial size of the tonal primitives
on an image. Tonal primitives of larger size are indicative of coarser textures;
tonal primitives of smaller size are indicative of finer textures. The autocorrela-
tion function is a feature which tells about the size of the tonal primitives,
We describe the autocorrelation function with the help of a thought experi-
ment. Consider two image transparencies which are exact copies of one another.
Overlay one transparency on top of the other and with a uniform source of light,
measure the average light transmitted through the double transparency. Now
translate one transparency relative to the other and measure only the average light
transmitted through the portion of the image where one transparency overlaps the
other. A graph of these measurements as a function of the (x, y ) translated
positions and normalized with respect to the (0,0) translation depicts the two-
dimensional autocorrelation function of the image transparency.
Let I(u, v) denote the transmission of an image transparency at position (u, v).
We assume that outside some bounded rectangular region 0 <~u <~L x and 0 ~< v ~<
Ly the image transmission is zero. Let (x, y ) denote the x-translation and
y-translation, respectively. The autocorrelation function for the image trans-
parency d is formally defined by

1
(Lx-lXl)(Ly-lYl)
f jI(u,v)Z(u+ x,v+ y ) d u d v
p(x, y =
1 f2fl2(u,v)dudv
L x Ly

Ixl<Lx and lyl<Ly.


402 Robert M. Haralick

If the tonal primitives on the image are relatively large, then the autocorrelation
will drop off slowly with distance. If the tonal primitives are small, then the
autocorrelation will drop off quickly with distance. To the extent that the tonal
primitives are spatially periodic, the autocorrelation function will drop off and
rise again in a periodic manner. The relationship between the autocorrelation
function and the power spectral density function is well known: they are Fourier
transforms of one another (Yaglom, 1962).
The tonal primitive in the autocorrelation model is the gray tone. The spatial
organization is characterized by the correlation coefficient which is a measure of
the linear dependence one pixel has on another.
An experiment was carried out by Kaizer (1955) to see of the autocorrelation
function had any relationship to the texture which photointerpreters see in
images. He used a series of seven aerial photographs of an Arctic region and
determined the autocorrelation function of the images with a spatial correlator
which worked in a manner similar to the one envisioned in our thought experi-
ment. Kaizer assumed the autocorrelation function was circularly symmetric and
computed in only as a function of radial distance. Then for each image, he found
the distance d such that the autocorrelation function p at d took the value
l / e ; p(d) = 1/e.
Kaizer then asked 20 subjects to rank the seven images on a scale from fine
detail to coarse detail. He correlated the rankings with the distances correspond-
ing to the ( 1 / e ) t h value of the autocorrelation function. He found a correlation
coefficient of 0.99. This established that at least for his data set, the autocorrela-
tion function and the subjects were measuring the same kind of textural features.
Kaizer noticed, however, that even though there was a high degree of correla-
tion between p - 1 ( l / e ) and subject rankings, some subjects put first what p - 1 ( l / e )
put fifth. Upon further investigation, he discovered that a relatively flat back-
ground (indicative of high frequency or fine texture) can be interpreted as a fine
textured or coarse textured area. This phenomena is not unusual and actually
points out a fundamental characteristic of texture: it cannot be analyzed without
a reference frame of tonal primitive being stated or implied. For any smooth gray
tone surface there exists a scale such that when the surface is examined, it has no
texture. Then as resolution increases, it takes on a fine texture and then a coarse
texture. In Kaizer's situation, the resolution of his spatial correlator was not good
enough to pick up the fine texture which some of his subjects did in an area which
had a weak but fine texture.

2.2. Orthogonal transformations


Spatial frequency characteristics of two-dimensional images can be expressed
by the autocorrelation function or by the power spectra of those images. Both
may be calculated digitally a n d / o r implemented in a real-time optical system.
Lendaris and Stanley (1969, 1970) used optical techniques to perform texture
analysis on a data base of low altitude photographs. They illuminated small
circular sections of those images and used the Fraunhoffer diffraction pattern to
Image texture survey 403

generate features for identifying photographic regions. The major discriminations


of concern to these investigators were those of man-made versus natural scenes.
The man-made category was further subdivided into roads, intersections of roads,
buildings and orchards.
Feature vectors extracted from these diffraction patterns consisted of forty
components. Twenty of the components were mean energy levels in concentric
annular tings of the diffraction pattern and the other twenty components were
mean energy levels in 9°-wedges of the diffraction pattern. Greater than 90%
classification accuracy was reported using this technique.
Cutrona, Leith, Palermo and Porcello (1969) present a review of optical
processing methods for computing the Fourier transform. Goodman (1968),
Preston (1972) and Shulman (1970) also present in their books comprehensive
reviews of Fourier optics. Swanlund (1971) discusses the hardware specifications
for a system using optical techniques to perform texture analysis.
Gramenopolous (1973) used a digital Fourier transform technique to analyze
aerial images. He examined subimages of 32 × 32 pixels and determined that for
an (ERTS) image over Phoenix, spatial frequencies between 3.5 and 5.9 cycles/km
contained most of the information required to discriminate among terrain types.
An overall classification accuracy of 87% was achieved using image categories of
clouds, water desert, farms, mountain, urban, river bed and cloud shadows.
Horning and Smith (1973) used a similar approach to interpret aerial multispec-
tral scanner imagery.
Bajscy (1972, 1973) and Bajscy and Lieberman (1974, 1976) computed the
two-dimensional power spectra of a matrix of square image windows. They
expressed the power spectrum in a polar coordinate system of radius (r) versus
angle (a). As expected, they determined that directional textures tend to have
peaks in the power spectrum along a line orthogonal to the principle direction of
the texture. Blob-like textures tend to have peaks in the power spectrum at radii
(r) comparable to the sizes of the blobs. This work also shows that texture
gradients can be measured by determining the trends of relative maxima of radii
(r) and angles (a) as a function of the position of the image window whose power
spectrum is being analyzed. For example, as the power peaks along the radial
direction tend to shift towards larger values of r, the image surface becomes more
finely textured. In general, features based on Fourier power spectra have been
shown to perform more poorly than features based on second order gray level
co-occurrence statistics (Haralick, Shanmugam, and Dinstein, 1973) or those
based on first order statistics of gray level differences (Weszka, Dyer and
Rosenfeld, 1976). Presence of aperture effects has been hypothesized to account
for part of the unfavorable performance by Fourier features compared to space-
domain gray level statistics (Dyer and Rosenfeld, 1976), although experimental
results indicate that this effect, if present, is minimal.
Transforms other than the Fourier Transform can be used for texture analysis.
Kirvida (1976) compared the fast Fourier, Hadamard and Slant transforms for
textural features on aerial images of Minnesota. Five classes (i.e., hardwood trees,
conifers, open space, city and water) were studied using 8 X 8 subimages. A 74%
404 Robert M. Haralick

correct classification rate was obtained using only spectral information. This rate
increased to 98.5% when textural information was also included in the analysis.
These researchers reported no significant difference in the classification accuracy
as a function of which transform was employed.
Pratt (1978) and Pratt, Faugeras and Gagalowitz (1978) suggest measuring
texture by the coefficients of the linear filter required to decorrelate an image and
by the first four moments of the gray level distribution of the decorrelated image.
They have shown promising preliminary results.
The linear dependence which one image pixel has on another is well known and
can be measured by the autocorrelation function. This linear dependence is
exploited by the autoregression texture characterization and synthesis model
developed by McCormick and Jayaramamurthy (1974) to synthesize textures.
McCormick and Jayaramamurthy used the Box and Jenkins (1970) time series
seasonal analysis method to estimate the parameters of a given texture. These
estimated parameters and a given set of starting values were then used to
illustrate that the synthesized texture was close in appearance to the given texture.
Deguchi and Morishita (1978), Tou, Kao and Chang (1976) and Tou and Chang
(1976) used similar techniques.
The autoregressive model for texture synthesis begins with a randomly gener-
ated noise image. Then, given any sequence of K synthesized gray level values in
its immediately past neighborhood, the next gray level value can be synthesized as
a linear combination of those values plus a linear combination of the previous L
random noise values. The coefficients of these linear combinations are the
parameters of the model. Texture analysis work based on this model requires the
identification of these coefficient values from a given texture image.

2.3. Gray tone co-occurrence


Textural features can also be calculated from a gray level spatial co-occurrence
matrix. The co-occurrence P(i, j ) of gray tone i a n d j for an image I is defined as
the number of pairs of neighboring resolution cells (pixels) having gray levels i
and j, respectively. The co-occurrence matrix can be normalized by dividing each
entry by the sum of all of the entries in the matrix. Conditional probability
matrices can also be used for textural feature extraction with the advantage that
these matrices are not affected by changes in the gray level histogram of an image,
only by changes in the topological relationships of gray levels within the image.
Apparently Julesz (1962) was the first to use co-occurrence statistics in visual
human texture discrimination experiments. Darling and Joseph (1968) used
statistics obtained from nearest-neighbor gray-level transition probability matrices
to measure texture using spatial intensity dependence in satellite images taken of
clouds. Bartels and Wied (1975), Bartels, Bahr and Wied (1969) and Wied, Bahr
and Bartels (1970) used one-dimensional co-occurrence statistics for the analysis
of cervical cells. Rosenfeld and Troy (1970), Haralick (1971) and Haralick,
Shanmugan and Dinstein (1973) suggested the use of spatial co-occurrence for
arbitrary distances and directions. Galloway (1975) used gray level run length
Imagetexturesurvey 405

statistics to measure texture. These statistics are computable from co-occurrence


assuming that the image is generated by a Markov process. Chen and Pavlidis
(1978) used the co-occurrence matrix in conjunction with a split and merge
algorithm to segment an image at textural boundaries. Tou and Chang (1977)
used statistics from the co-occurrence matrix, followed by a principal components
eigenvector dimensionality reduction scheme (Karhunen-Lorve expansion) to
reduce the dimensionality of the classification problems.
Statistics which Haralick, Shanmugan and Dinstein (1973) computed from such
co-occurrence matrices have been used to analyze textures in satellite images
(Haralick and Shanmugan, 1974). An 89% classification accuracy was obtained.
Additional applications of this technique include the analysis of microscopic
images (Haralick and Shanmugan, 1973), pulmonary radiographs (Chien and Fu,
1974) and cervical cell, leukocyte and lymph node tissue section images
(Pressman, 1976a, 1976b).
Haralick (1975) illustrates a way to use co-occurrence matrices to generate an
image in which the value at each resolution cell is a measure of the texture in the
resolution cell's neighborhood. All of these studies produced reasonable results on
different textures. Conners and Harlow (1976) concluded that this spatial gray
level dependence technique is more powerful than spatial frequency (power
spectra), gray level difference (gradient) and gray level run length methods
(Galloway, 1975) of texture quantitation.

2.4. Mathematical morphology


A structural element and filtering approach to texture analysis of binary images
was proposed by Matheron (1967) and Serra and Verchery (1973). This approach
requires the definition of a structural element (i.e., a set of pixels constituting a
specific shape such as a line or square) and the generation of binary images which
result from the translation of the structural element through the image and the
erosion of the image by the structural element. The textural features can be
obtained from the new binary images by counting the number of pixels having the
value 1. This mathematical morphology approach of Serra and Matheron is the
basis of the Leitz Texture Analyser (LTA) (Muller and Hunn, 1974; Muller, 1974;
Serra, 1974). A broad spectrum of applications has been found for this quantita-
tive analysis of microstructures method in materials science and biology.
Watson (1975) summarizes this approach to texture analysis and we now give a
precise description. Let H, a subset of resolution cells, be the structural element.
We define the translate of H by row column coordinates (r, c) as H(r, c) where
H(r, c) = {(i, J)l for some (r', c') E H, x = r + r', c = c + c'}.
Then the erosion of F by the structural element H, written FOH, is defined as

FQH---- { (m, n)lH(m,n) C F).


The eroded image J obtained by eroding F with structural element H is a binary
image where pixels take the value 1 for all resolution cells in F@H. Textural
406 Robert M. Haralick

properties can be obtained from the erosion process by appropriately para-


meterizing the structural element ( H ) and determining the number of elements of
the erosion as a function of the parameter's value. Theoretical properties of the
erosion operator as well as other operators are presented by Matheron (1975),
Serra (1978) and Lantuejoul (1978). The importance of this approach to texture
analysis is that properties obtained by the application of operators in mathemati-
cal morphology can be related to physical properties of the materials imaged.

2.5. Gradient analysis


Rosenfeld and Troy (1970) and Rosenfeld and Thurston (1971) regard texture
in terms of the amount of 'edge' per unit image area. An edge can be detected by
a variety of local mathematical operators which essentially measure some property
related to the gradient of the image intensity function. Rosenfeld and Thurston
used the Roberts gradient and then computed, as a measure of texture for any
image window, the average value of the Roberts gradient taken over all of the
pixels in the window. Sutton and Hall (1972) extend this concept by measuring
the gradient as a function of the distance between pixels. An 80% classification
accuracy was achieved by applying this textural measure in a pulmonary disease
identification experiment.
Related approaches include Triendl (1972) who, smoothes the image using 3 X 3
neighborhoods, then applies a 3 X 3 digital Laplacian operator and finally smoothes
the image with an 11 X 11 window. The resulting texture parameters obtained
from the frequency filtered image can be used as a discriminatory textural feature.
Hsu (1977) determines edgeness by computing variance-like measures for the
intensities in a neighborhood of pixels. He suggests the deviation of the intensities
in a pixel's neighborhood from both the intensity of the central pixel and from the
average intensity of the neighborhood. The histogram of a gradient image was
used to generate textural parameters by Landeweerd and Gelsema (1978) to
measure texture properties in the nuclei of leukocytes. Rosenfeld (1975) generates
an image whose intensity is proportional to the edge per unit area of the original
image. This transformed image is then further processed by gradient transforma-
tions prior to textural feature extraction.
For example, mosaic texture models tessellate a picture into regions and assign
a gray level to the region according to a specified probability density function
(Schacter, Rosenfeld and Davis, 1978). Among the kinds of mosaic models are the
Occupancy Model (Miles, 1970), Johnson-Mehl Model (Gilbert, 1962), Poisson
Line Model (Miles, 1969) and Bombing Model (Switzer, 1967). The mosaic
texture models seem readily adaptable to numerical analysis and their properties
seem amenable to mathematical analysis.

3. Structural approaches to texture models

Pure structural models of texture presume that textures consist of primitives


which appear in quasi-periodic spatial arrangements. Descriptions of these primi-
Image texture survey 407

tives and their placement rules can be used to describe textures (Rosenfeld and
Lipkin, 1970). The identification and location of a particular primitive in an
image may be probabilisticaUy related to the identification and distribution of
primitives in its neighborhood.
Carlucci (1972) suggests a texture model using primitives of line segments, open
polygons and closed polygons in which the placement rules are given syntactically
in a graph-like language. Zucker (1976a, 1976b) conceives of a real texture to be
the distortion of an ideal texture. Zucker's model, however, is more of a
competance based model than a performance model. Lu and Fu (1978) and Tsai
and Fu (1978) use a syntactic approach to texture.
In the remainder of this section, we discuss some structural-statistical ap-
proaches to texture models. The approach is structural in the sense that primitives
are explicitly defined. The approach is statistical in that the spatial interaction, or
lack of it, between primitives is measured by probabilities.
We classify textures as being weak textures, or strong textures. Weak textures
are those which have weak spatial-interaction between primitives. To distinguish
between them it may be sufficient to only determine the frequency with which the
variety of primitive kinds occur in some local neighborhood. Hence, weak texture
measures account for many of the statistical textural features. Strong textures are
those which have non-random spatial interactions. To distinguish between them it
may be sufficient to only determine, for each pair of primitives, the frequency
with which the primitives co-occur in a specified spatial relationship. Thus, our
discussion will center on the variety of ways in which primitives can be defined
and the ways in which spatial relationships between primitives can be defined.

3.1. Primitives
A primitive is a connected set of resolution cells characterized by a list of
attributes. The simplest primitive is the pixel with its gray tone attribute.
Sometimes it is useful to work with primitives which are maximally connected sets
of resolution cells having a particular property. An example of such a primitive is
a maximally connected set of pixels all having the same gray tone or all having
the same edge direction.
Gray tones and local properties are not the only attributes which primitives
may have. Other attributes include measures of shape of connected region and
homogeneity of its local property. For example, a connected set of resolution cells
can be associated with its length or elongation of its shape or the variance of its
local property.

3.2. Spatial relationships


Once the primitives have been constructed, we have available a list of primi-
tives, their center coordinates, and their attributes. We might also have available
some topological information about the primitives, such as which are adjacent to
which. From this data, we can select a simple spatial relationship such as
adjacency of primitives or nearness of primitives and count how many primitives
of each kind occur in the specified spatial relationship.
408 Robert M. Haralick

More complex spatial relationships include closest distance or closest distance


within an angular window. In this case, for each kind of primitive situated in the
texture, we could lay expanding circles around it and locate the shortest distance
between it and every other kind of primitive. In this case our co-occurrence
frequency is three-dimensional, two dimensions for primitive kind and one
dimension for shortest distance. This can be dimensionally reduced to two
dimensions by considering only the shortest distance between each pair of like
primitives.

3.3. W e a k texture measures

Tsuji and Tomita (1973) and Tomita, Yachida, and Tsuji (1973) describe a
structural approach to weak texture measures. First a scene is segmented into
atomic regions based on some tonal property such as constant gray tone. These
regions are the primitives. Associated with each primitive is a list of properties
such as size and shape. Then they make a histogram of size property or shape
property over all primitives in the scene. If the scene can be decomposed into two
or more regions of homogeneous texture, the histogram will be multi-modal. If
this is the case, each primitive in the scene can be tagged with the mode in the
histogram to which it belongs. A region growing/cleaning process on the tagged
primitives yields the homogeneous textural region segmentation.
If the initial histogram modes overlap too much, a complete segmentation may
not result. In this case, the entire process can be repeated with each of the then so
far found homogeneous texture region segments. If each of the homogeneous
texture regions consists of mixtures of more than one type of primitive, then the
procedure may not work at all. In this case, the technique of co-occurrence of
primitive properties would have to be used.
Zucker, Rosenfeld and Davis (1975) used a form of this technique by filtering a
scene with a spot detector. Non-maxima pixels on the filtered scene were thrown
out. If a scene has many different homogeneous texture regions, the histogram of
the relative max spot detector filtered scene will be multi-modal. Tagging the
maxima with the modes they belong to and region growing/cleaning thus
produced the segmented scene.
The idea of the constant gray level regions of Tsuji and Tomita or the spots of
Zucker, Rosenfeld, and Davis can be generalized to regions which are peaks, pits,
ridges, ravines, hillsides, passes, breaks, flats and slopes (Toriwaki and Fukumura,
1978; Penucker and Douglas, 1975). In fact, the possibilities are numerous
enough that investigators doing experiments will have a long working period
before understanding will exhaust the possibilities. The next three subsections
review in greater detail some specific approaches and suggest some generaliza-
tions.

3.3.1. Edge per unit area


Rosenfeld and Troy (1970) and Rosenfeld and Thurston (1971) suggested the
amount of edge per unit area for a texture measure. The primitive here is the pixel
and its property is the magnitude of its gradient. The gradient can be calculated
Image texture survey 409

by any one of the gradient neighborhood operators. For some specified window
centered on a given pixel, the distribution of gradient magnitudes can then be
determined. The mean of this distribution is the amount of edge per unit area
associated with the given pixel. The image in which each pixel's value is edge per
unit area is actually a defocussed gradient image. Triendl (1972) used a de-
focussed Laplacian image. Sutton and Hall (1972) used such a measure for the
automatic classification of pulmonary disease in chest X-rays.
Ohlander (1975) used such a measure to aid him in segmenting textured scenes.
Rosenfeld (1975) gives an example where the computation of gradient direction
on a defocussed gradient image is an appropriate feature for the direction of
texture gradient. Hsu (1977) used a variety of gradient-like measures.

3.3. 2. Run lengths


The gray level run lengths primitive in its one-dimensional form is maximal
collinear connected set of pixels all having the same gray level. Properties of the
primitive can be length of run, gray level, and angular orientation of the run.
Statistics of these properties were used by Galloway (1975) to distinguish between
textures.
In the two-dimensional form, the gray level run length primitive is a maximal-
connected set of pixels all having the same gray level. These maximal homoge-
neous sets have properties such as number of pixels maximum or minimum
diameter, gray level, angular orientation of maximum or minimum diameter.
Maleson et al. (1977) have done some work related to maximal homogeneous sets
and weak textures.

3.3.3. Relative extrema density


Rosenfeld and Troy (1970) suggest the number of extrema per unit area for a
texture measure. They define extrema in a purely local manner allowing plateaus
to be considered extrema. Ledley (1972) also suggests computing the number of
extrema per unit area as a texture measure.
Mitchell, Myers and Boyne (1977) suggest the extrema idea of Rosenfeld and
Troy except they proposed to use true extrema and to operate on a smoothed
image to eliminate extrema due to noise. See also the work by Carlton and
Mitchell (1977) and Ehrich and Foith (1976, 1978).
One problem with simply counting all extrema in the same extrema plateau as
extrema is that extrema per unit area is not sensitive to the difference between a
region having few large plateaus of extrema and many single pixel extrema. The
solution to this problem is to only count an extrema plateau once. This can be
achieved by locating some central pixel in the extrema plateau and marking it as
the extrema associated with the plateau. Another way of achieving this is to
associate a value 1 / N for every extremum in a N-pixel extrema plateau.
In the one-dimensional case there are two properties that can be associated
with every extremum: its height and its width. The height of a maximum can be
defined as the difference between the value of the maximum and the highest
adjacent minimum. The height (depth) of a minimum can be defined as the
410 Robert M. Haralick

difference between the value of the minimum and the lowest adjacent maximum.
The width of a maximum is the distance between its two adjacent minima. The
width of a minimum is the distance between its two adjacent maxima.
Two-dimensional extrema are more complicated than one-dimensional extrema.
One way of finding extrema in the full two-dimensional sense is by the iterated
use of some recursive neighborhood operators propagating extrema values in an
appropriate way. Maximally connected areas of relative extrema may be areas of
single pixels or may be plateaus of many pixels. We can mark each pixel in a
relative extrema region of size N with the value h indicating that it is part of a
relative extremum having height h or mark it with the value h / N indicating its
contribution to the relative extrema area. Alternatively, we can mark the most
centrally located pixel in the relative extrema region with the value h. Pixels not
marked can be given the value 0. Then for any specified window centered on a
given pixel, we can add up the values of all pixels in the window. This sum
divided by the window size is the average height of extrema in the area.
Alternatively we could set h to 1 and the sum would be the number of relative
extrema per unit area to be associated with the given pixel.
Going beyond the simple counting of relative extrema, we can associate
properties to each relative extremum. For example, given a relative maximum, we
can determine the set of all pixels reachable only by the given relative maximum
and not by any other relative maximum by monotonically decreasing paths. This
set of reachable pixels is a connected region and forms a mountain. Its border
pixels may be relative minima or saddle pixels.
The relative height of the mountain is the difference between its relative
maximum and the highest of its exterior border pixels. Its size is the number of
pixels which constitute it. Its shape can be characterized by features such as
elongation, circularity, and symmetric axis. Elongation can be defined as the ratio
of the larger to small eigenvalue of the 2 × 2 second moment matrix obtained
from the (~) coordinates of the border pixels (Bachi, 1973; Frolov, 1975).
Circularity can be defined as the ratio of the standard deviation to the mean of
the radii from the region's center to its border (Haralick, 1975). The symmetric
axis feature can be determined by thinning the region down to its skeleton and
counting the number of pixels in the skeleton. For regions which are elongated it
may be important to measure the direction of the elongation or the direction of
the symmetric axis.

3.4. Strong texture measures and generalized co-occurrence


Strong texture measures take into account the co-occurrence between texture
primitives. On the basis of Julesz (1975) it is probably the case that the most
important interaction between texture primitives occurs as a two-way interaction.
Textures with identical second and lower order interactions but with different
higher order interactions tend to be visually similar.
The simplest texture primitive is the pixel with its gray tone property. Gray
tone co-occurrence between neighboring pixels was suggested as a measure of
Image texture survey 411

texture by a number of researchers as discussed in Section 2.6. All the studies


mentioned there achieved a reasonable classification accuracy of different tex-
tures using co-occurrences of the gray tone primitive.
The next more complicated primitive is a connected set of pixels homogeneous
in tone (Tsuji and Tomita, 1973). Such a primitive can be characterized by size,
elongation, orientation, and average gray tone. Useful texture measures include
co-occurrence of primitives based on relationships of distance or adjacency.
Maleson et al. (1977) suggest using region growing techniques and ellipsoidal
approximations to define the homogeneous regions and degree of co-linearity as
one basis of co-occurrence. For example, for all primitives of elongation greater
than a specified threshold we can use the angular orientation of each primitive
with respect to its closest neighboring primitive as a strong measure of texture.
Relative extrema primitives were proposed by Rosenfeld and Troy (1970),
Mitchell, Myers and Boyne (1977), Ehrich and Foith (1976), Mitchell and
Carlton (1977), and Ehrich and Foith (1978). Co-occurrence between relative
extrema was suggested by Davis et al. (1978). Because of their invariance under
any monotonic gray scale transformation, relative extrema primitives are likely to
be very important.
It is possible to segment an image on the basis of relative extrema (for example,
relative maxima) in the following way: label all pixels in each maximally
connected relative maxima plateau with a unique label. Then label each pixel with
the label of the relative maximum that can reach it by a monotonically decreasing
path. If more than one relative maximum can reach it by a monotonically
decreasing path, then label the pixel with a special label 'c' for common. We call
the regions so formed the descending components of the image.
Co-occurrence between properties of the descending components can be based
on the spatial relationship of adjacency. For example, if the property is size, the
co-occurrence matrix could tell us how often a descending component of size s~
occurs adjacent to or nearby to a descending component of size s 2 or of label 'c'.
To define the concept of generalized co-occurrence, it is necessary to first
decompose an image into its primitives. Let Q be the set of all primitives on the
image. Then we need to measure primitive properties such as mean gray tone,
variance of gray tones, region, size, shape, etc. Let T be the set of primitive
properties and f be a function assigning to each primitive in Q a property of T.
Finally, we need to specify a spatial relation between primitives such as distance
or adjacency. Let S C_Q × Q be the binary relation pairing all primitives which
satisfy the spatial relation. The generalized co-occurrence matrix P is defined by

# {(ql, q2) ~ SI f ( q , ) : t, and f(q2 ) : t2)


P ( t l ' t2) ---- ~S

P(t 1, t2) is just the relative frequency with which two primitives occur with
specified spatial relationship in the image, one primitive having property t~ and
the other primitive having property t 2.
412 Robert M. Haralick

Zucker (1974) suggests that some textures may be characterized by the frequency
distribution of the number of primitives any primitive has related to it. This
probability p(k) is defined by

# { ( q ~ QI#S(q) = k)
p(k) = #Q

Although this distribution is simpler than co-occurrence, no investigator appears


to have used it in texture discrimination experiments.

4. Conclusion

We have surveyed the image processing literature on the various approaches


and models investigators have used for textures. For microtextures, the statistical
approach seems to work well. The statistical approaches have included autocorre-
lation functions, optical transforms, digital transforms, textural edgeness, struct-
ural elements, gray tone co-occurrence, and autoregressive models. Pure structural
approaches based on more complex primitives than gray tone seems not to be
widely used. For macro-textures, investigators seem to be moving in the direction
of using histograms of primitive properties and co-occurrence of primitive proper-
ties in a structural-statistical generalization of the pure structural and statistical
approaches.

References
Bachi, R. (1973). Geostatistical analysis of territories. Proc. 39th Session-Bull. Internat. Statist.
Institute. Vienna.
Bajcsy, R. (1972). Computer identification of textured visual scenes. Stanford Univ., Palo Alto, CA.
Bajcsy, R. (1973). Computer description of textured surfaces. Third lnternat. Joint Conf. on Artificial
Intelligence, 572-578. Stanford, CA.
Bajcsy, R. and L. Lieberman (1974). Computer description of real outdoor scenes. Proc. Second
Internat. Joint Conference on Pattern Recognition, 174-179. Copenhagen, Denmark.
Bajcsy, R. and L. Lieberman (1976). Texture gradient as a depth cue. Comput. Graphics Image Process.
5 (1) 52-67.
Bartels, P., G. Bahr and G. Weid (1969). Cell recognition from line scan transition probability profiles.
Aeta Cytol. 13, 210-217.
Bartels, P. H. and G. U Wied (1975). Extraction and evaluation of information from digitized cell
images. In: Mammalian Cells: Probes and Problems, 15-28. U.S. NTIS Technical Information
Center, Springfield, VA.
Box, J. E. and G. M. Jenkins (1970). Time Series Analysis. Holden-Day, San Francisco, CA.
Carlton, S. G. and O. Mitchell (1977). Image segmentation using texture and grey level. In: Pattern
Recognition and Image Processing Conference, 387-391. Troy, NY.
Carlucci, L. (1972). A formal system for texture languages. Pattern Recognition 4, 53-72.
Chen, P. and T. Pavlidis (1978). Segmentation by texture using a co-occurrence matrix and a
split-and-merge algorithm. Tech. Rept. 237. Princeton Univ., Princeton, NJ.
Chien, Y. P. and K. S. Fu (1974). Recognition of x-ray picture patterns. IEEE Trans. Systems Man
Cybernet. 4 (2) 145-156.
Image texture survey 413

Conners, R. W. and Ch. A. Harlow (1976). Some theoretical considerations concerning texture
analysis of radiographic images. Proc. 1976 IEEE Conf. on Decision and Control. Clearwater Beach,
FL.
Cutrona, L. J., E. N. Leith, C. J. Palermo and L. J. Porcello (1969). Optical data processing and
filtering systems. IRE Trans. Inform. Theory 15 (6) 386-400.
Darling, E. M. and R. D. Joseph (1968). IEEE Trans. Systems Man Cybernet. 4, 38-47.
Davis, L., S. Johns and J. K. Aggarwal (1978). Texture analysis using generalized co-occurrence
matrices. Pattern Recognition and Image Processing Conf. Chicago, IL.
Deguchi, K. and I. Morishita (1978). Texture characterization and texture-based image partitioning
using two-dimensional linear estimation techniques. IEEE Trans. Comput. 27 (8) 739-745.
Dyer, C. and A. Rosenfeld (1976). Fourier texture features: suppression of aperature effects. IEEE
Trans. Systems Man Cybernet. 6 (10) 703-706.
Ehrich, Roger and J. P. Foith (1976). Representation of random waveforms by relational trees. IEEE
Trans. Comput. 26 (7) 726-736.
Ehrich, R. and J. P. Foith (1978). Topology and semantics of intensity arrays. In: Hanson and
Roseman, eds., Computer Vision. Academic Press, New York.
Frolov, Y. S. (1975). Measuring the shape of geographical phenomena: a history of the issue. Soviet
Geography: Review and Translation 16 (10) 676-687.
Galloway, M. M. (1975). Texture analysis using gray level run lengths. Comput. Graphics and Image
Process. 4 172-179.
Gilbert, E. (1962). Random subdivisions of space into crystals. Ann. Math. Statist. 33, 958-972.
Goodman, J. W. (1968). Introduction to Fourier Optics. McGraw-Hill, New York.
Grarnenopoulos, N. (1973). Terrain type recognition using ERTS-I MSS images. Syrup. on Significant
Results Obtained from the Earth Resources Technology Satellite, NASA SP-327, pp. 1229-1241.
Haralick, R. M. (1971). A texture-context feature extraction algorithm for remotely sensed imagery.
Proc. 1971 1EEE Decision and Control Conf., 650-657. Gainesville, FL.
Haralick, R. M. (1975). A textural transform for images. Proc. IEEE Conf. on Computer Graphics,
Pattern Recognition and Data Structure. Beverly Hills, CA.
Haralick, R. M. and K. Shanmugam (1973). Computer classification of reservoir sandstones. IEEE
Trans. Geosci. Electronics 11 (4) 171-177.
Haralick, R. M. and K. Shanmugam (1974). Combined spectral and spatial processing of ERTS
imagery data. J. Remote Sensing of the Environment 3, 3-13.
Haralick, R. M. and K. Shanmugam (1973). Textural features for image classification. IEEE Trans.
Systems Man Cybernet. 3 (6) 610-621.
Hawkins, J. K. (1970). Textural properties for pattern recognition In: B. S. Lipkin and A. Rosenfeld,
eds., Picture Processing and Psychopictorics. Academic Press, New York.
Horning, R. J. and J. A. Smith (1973). Application of Fourier analysis to multi-spectral/spatial
recognition. Management and Utilization of Remote Sensing Data A S P Syrup. Sioux Falls, SD.
Hsu, S. (1977). A texture-tone analysis for automated landuse mapping with panchromatic images.
Proc. Amer. Soc. for Photogrammetry, 203-215.
Julesz, B. (1962). Visual pattern discrimination. IRE Trans. Inform. Theory 8 (2) 84-92.
Julesz, B. (1975). Experiments in visual perception of texture. Sci. Amer. 232, 34-43.
Kaizer, H. (1955). A quantification of textures on aerial photographs. Tech. Note 121, AD 69484.
Boston Univ. Res. Labs.
Kirvida, L. (1976). Texture measurements for the automatic classification of imagery. IEEE Trans.
Electromagnetic Compatibility 18 (1) 38-42.
Landerweerd, G. H. and E. S. Gelsema (1978). The use of nuclear texture parameters in the automatic
analysis of leukocytes. Pattern Recognition 10, 57-61.
Lantuejoul, C. (1978). Grain dependence test in a polycristalline ceramic. In: J. L. Chernant, ed.,
Quantitative Analysis of Microstructures in Materials Science, Biology and Medicine, 40-50. Riederer,
Stuttgart.
Ledley, R. S. (1972). Texture problems in biomedical pattern recognition. Proc. 1972 IEEE Conf. on
Decision and Control and the l l t h Syrup. on Adaptive Processes. New Orleans, LA.
Lendaris, G. and G. Stanley (1969). Diffraction pattern sampling for automatic pattern recognition.
SPIE Pattern Recognition Studies Seminar Proceedings, 127-154.
414 Robert M. Haralick

Lendaris, G. G. and G. L. Stanley (1970). Diffraction pattem samplings for automatic pattern
recognition. Proc. IEEE 58 (2) 198-216.
Lu, S. Y. and K. S. Fu (1978). A syntactic approach to texture analysis. Comput. Graphics andlmage
Process. 7, 303-330.
Maleson, J., C. Brown and J. Feldman (1977). Understanding natural texture. University of Rochester,
Rochester, NY.
Matheron, G. (1967). Elements pour une Throrie des Milieux Poreux. Masson, Paris.
Matheron, G. (1975). Random Sets and Integral Geometry. Wiley, New York.
McCormick, B. H. and S. N. Jayaramamurthy (1974). Time series model for testure sythesis. Internat.
J. Comput. Informat. Sci. 3 (4) 329-343.
Miles, R. (1969). Random polygons determined by random lines in the plane. Proc. Nat. Acad. Sci.
U.S.A. 52, 901-907; 1157-1160.
Miles, R. (1970). On the homogeneous planar poisson point-process. Math. Biosci. 6, 85-127.
Mitchell, O., Ch. Myers and W. Boyne (1977). A max-min measure for image texture analysis. IEEE
Trans. Comput. 25 (4) 408-414.
Mitchell, O. R. and S. G. Carlton (1977). Image segmentation using a local extrema texture measure.
Pattern Recognition 10, 205-210.
Muller, W. (1974). The leitz-texture-analyzing system. Lab. Appl. Microscopy, Sci. and Tech. Inform.,
Suppl. I. 4, 101-136. Wetzler, West Germany.
Muller, W. and W. Hunn (1974). Texture Analyzer System. Industrial Res., 49-54.
Ohlander, R. (1975). Analysis of natural scenes. Ph.D. dissertation. Carnegie-Mellon Univ., Pitts-
burgh, PA.
Peucker, T. and D. Douglas (1975). Detection of surface-specific points by local parallel processing of
discrete terrain elevation data. Comput. Graphics' and Image Process. 4 (4) 375-387.
Pratt, W. K. (1978). Image feature extraction. Digital Image Processing, 471-513.
Pratt, W. K., O. D. Fangeras and A. Gagalowicz (1978). Visual discrimination of stochastic texture
fields. IEEE Trans. Systems Man Cybernet. 8 (11) 796-804.
Pressman, N. J. (1976). Markovian analysis of cervical cell imag,~s. J. Histochem. Cytochem. 24 (1)
138-144.
Pressman, N. J. (1976b). Optical texture analysis for automated cytology and histology: A Markovian
approach. Lawrence Livermore Lab. Rept. UCRL-52155. Livermore, CA.
Preston, K. (1972). Coherent Optical Computers. McGraw-Hill, New York.
Rosenfeld, A. (1975). A note on automatic detection of texture gradients. IEEE Trans. comput. 23 (10)
988-991.
Rosenfeld, A. and B. S. Lipkin, eds. (1970). Picture Processing and Psychopictorics. Academic Press,
New York.
Rosenfeld, A. and M. Thurston (1971). Edge and curve detection for visual scene analysis. IEEE
Trans. Comput. 20 (5) 562-569.
Rosenfeld, A. and E. Troy (1970). Visual texture analysis. Univ. of Maryland, College Park, MD;
Tech. Rept. 70-116; ibid., Conf. Record Syrup. on Feature Extraction and Selection in Pattern
Recognition, IEEE Publ. 70C-51C (1970) 115-124. Argonne, IL.
Schacter, B. J., A. Rosenfeld and L. S. Davis (1978). Random mosaic models for textures. IEEE
Trans. Systems Man Cybernet. 8 (9) 694-702.
Serra, J. (1974). Theoretical bases of the leitz texture analysis system. Leitz Sci. Tech. Inform., Suppl. 1
(4) 125-136. Wetzlar, Germany.
Serra, J. (1978). One, two, three ..... infinity. In: J. L. Chernant, ed., Quantitative Analysis of
Microstructures in Materials Science, Biology and Medicine, 9-24. Riederer, Stuttgart.
Serra, J. and G. Verchery (1973). Mathematical morphology applied to fibre composite materials. Film
Sci. Technol. 6, 141-158.
Shulman, A. R. (1970) Optical Data Processing. Wiley, New York.
Sutton, R. and E. Hall (1972). Texture measures for automatic classification of pulmonary disease.
IEEE Trans. Comput. 21 (1) 667-676.
Swanlund, G. D. (1971). Design requirements for texture measurements. Proc. Two Dimensional
Digital Signal Processing Conf. Univ. of Missouri, Columbia, MO.
Image texture survey 415

Switzer, P. (1967). Reconstructing patterns for sample data. Ann. Math. Stat&t. 38, 138-154.
Tomita, F. M. Yaehida and S. Tsuji (1973). Detection of homogeneous regions by structural analysis.
Proc. Third Internat, Joint Conf. on Artificial Intelligence, 564-571. Stanford Univ., Stanford, CA.
Toriwaki, J. and T. Fukumura (1978). Extraction of structural information from grey pictures.
Comput. Graphics and Image Process. 7 (1) 30-51.
Tou, J. T. and Y. S. Chang (1976). An approach to texture pattern analysis and recognition. Proe.
1976 IEEE Conf. on Decision and Control. Clearwater Beach, FL.
Tou, J. T. and Y. S. Chang (1977). Picture understanding by machine via textural feature extraction.
Proc. 1977 IEEE Conf. on Pattern Recognition and Image Processing. Troy, NY.
Tou, J. T,, D. B. Kao and Y. S. Chang (1976). Pictorial texture analysis and synthesis. Third lnternat.
Joint Conf. on Pattern Recognition. Coronado, CA.
Triendl, E. E. (1972). Automatic terrain mapping by texture recognition. Proc. Eighth Internat. Symp.
on Remote Sensing of Environment. Environmental Research Institute of Michigan Ann Arbor, MI.
Tsai, W. H. and K. S, Fu (1978). Image segmentation and recognition by texture discrimination: A
syntactic approach. Fourth Internat. Joint Conf. on Pattern Recognition. Tokyo, Japan.
Tsuji, S. and F. Tomita (1973). A structural analyzer for a class of textures. Comput. Graphics and
Image Process. 2, 216-231.
Watson, G. S. (1975). Geological Society of America Memoir 142, 367-391.
Weszka, J., C. Dyer and A. Rosenfeld (1976). A comparitive study of texture measures for terrain
classification. IEEE Trans. Systems Man Cybernet. 6 (4) 269-285.
Wied, G., G. Bahr and P. Bartels (1970). Automatic analysis of cell images. In: Wied and Bahr, eds.,
Automated Cell Identification and Cell Sorting. Academic Press, New York.
Yaglom, A. M. (1962). Theory of Stationary Random Functions. Prentice-Hall, Englewood Cliffs, NJ.
Zucker, S. W. (1976a). Toward a model of texture. Comput. Graphics and Image Process. 5, 190-202.
Zucker, S. W. (1976b). On the structure of texture. Perception 5, 419-436.
Zucker, S. (1974). On the foundations of texture: A transformational approach, Univ. of Maryland,
Tech. Rept. TR-331. College Park, MD,
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, VoL 2 "1 0
& j,
©North-Holland Publishing Company (1982) 417-449

Applications of Stochastic Languages*

K . S . Fu

1. Introduction

Formal languages and their corresponding automata and parsing procedures


have been used in the modeling and analysis of natural and computer languages
[2, 9] and the description and recognition of patterns [15, 18]. One natural
extension of one-dimensional string languages to high dimensional is tree lan-
guages. A string could be regarded as a single-branch tree. The capability of
having more than one branch often gives trees a more efficient pattern representa-
tion. Interesting applications of tree languages to picture recognition include the
classification of bubble chamber events, the recognition of fingerprint patterns,
and the interpretation of LANDSATdata [2%35]. In some applications, a certain
amount of uncertainty exists in the process under study. For example, noise or
distortion occurs in the communication, storage and retrieval of information, or
in the measurement of processing of physical patterns. Under these situations, in
order to model the process more realistically, the approach of using stochastic
languages have been suggested [11, 15, 18, 21]. For every string or tree x in a
language L, a probability p(x) can be assigned such that 0 < p ( x ) ~ < l and
Y'xe Lp(x) = 1. Thus, the function p(x) is a probability measure defined over L,
and it can be used to characterize the uncertainty and randomness of L. In this
paper, after a brief introduction of stochastic string languages, three applications
are described. 1 They are (1) communication and coding, (2) syntactic pattern
recognition, and (3) error-correcting parsing. Stochastic tree languages are then
introduced, and their application to texture modeling is described.

2. Review of stochastic languages

Two natural ways of extending the concept of formal languages to stochastic


languages are to randomize the productions of grammars and the state transitions

*This work was supported by the NSF Grant ENG 78-16970.


I Other applications of stochastic languages include language learning [6, 40] and digital system
design [5].

417
418 K. S. Fu

of recognition devices (acceptors) respectively. I n this section some major results


in stochastic g r a m m a r s and stochastic syntax analysis are reviewed.

DEFINITION 2.1. A stochastic phrase structure g r a m m a r (or simply stochastic


g r a m m a r ) is a four-tuple Gs = (V N, VT, P~, S ) where VN and VT are finite sets of
nonterminals and terminals; S E VN is the start symbol; 2 Ps is a finite set of
stochastic productions each of which is of the form

PU
ai~fli/, j=l ..... n,; i=1 ..... k (1)

where

E( )*,
Bu
and pij is the probability associated with the application of this stochastic
production,

ni
0 < p i j ~<1 and ~ Pij=l-3 (2)
j=l

V* denotes the set of all strings c o m p o s e d of symbols of V, including the e m p t y


string X.
Suppose that a i ~ puflij is in Ps. 4 Then the string ( -- ]tlO~i]/2 m a y be replaced by
~1= "hflij72 with the probabilityp~j. We shall denote this derivation b y

])ij

and we say that ~ directly generates 7/ with probability Pij- If there exists a
sequence of strings ~ol.... , % + 1 such that

Pi
~ = ¢a~l , 7~ : 0 ) n + 1, ~i---~i+l, i=1 ..... n,

then we say that ~ generates 71 with probability p = ~n=lp i and denote this

2In a more general formulation, instead of a single start symbol, a start symbol (probability)
distribution can be used.
3If these two conditions are not satisfied, Pij will be denoted as the weight wij and Ps the set of
weighted productions. Consequently, the grammar and the language generated are called weighted
grammar and weighted language, respectively [12].
4 I n running text we will set indices superior and inferior to arrows instead of indices centered above
or below the arrow.
Applications of stochastic languages 419

derivation by
P
~--,~.

The probability associated with this derivation is equal to the product of the
probabilities associated with the sequence of stochastic productions used in
the derivation. It is clear that ~P. is the reflexive and transitive closure of the
relation --, P.
The stochastic language generated by G~ is

L(G~)= x,p(x))lxEV{.,S~x, j = l ..... kandp(x)= ~. pj~,


* j 1

(3)
where k is the number of all distinctively different derivations of x from S and pj
is the probability associated with t h e j t h distinctive derivation of x.
In general, a stochastic language L(G~) is characterized by (L, p) where L is a
language and p is a probability distribution defined over L. The language L of a
stochastic language L(G~)= (L, p) is called the characteristic language of L(G~).
Since the productions of a stochastic grammar are exactly the same as those of
the non-randomized grammar except for the assignment of the probability
distribution, the language L generated by a stochastic grammar is the same as that
generated by the non-randomized version.

EXAMPLe 2.2. Consider the stochastic finite-state grammar

Gs=(VN,VT, Ps,S)
where
VN=(S,A,B), VT=(0,1)
and
1
Ps: S--,1A,
0.8 0.2 0.3 0.7
A --,0B, A ~ 1, B-~0, B ~ 1S.

A typical derivation, for example, would be

S -, 1A ~ 10B ~ 100,
p(100) = 1 × 0 . 8 × 0 . 3 = 0.24.

The stochastic language generated by G~, L(G~) is as is illustrated in Table 1.


It is noted that

p(x)=0.2+0.24+ ~ (0.2+0.24)(0.56)n=1.
xE L(Gs) n = 1
420 K.S. Fu

Table 1
String generated x p (x)
11 0.2
100 0.24
(101)"11 0.2×(0.56) n
(101)"100 0.24×(0.56)"

DEFINITION 2.3. If, for a stochastic grammar Gs,

E p(x) :1, (4)


x ~ L(Gs)

then GS is said to be a consistent stochastic grammar.

The condition for a stochastic context-free grammar to be consistent is briefly


described [6, 15, 18]. Whether or not a stochastic context-sensitive grammar is
consistent under certain conditions has not yet been known. The consistency
condition for stochastic finite-state grammars is trivial since every stochastic
finite-state grammar is consistent.
For a context-free grammar, all the productions are of the form

A-,a, A~V N, a~V + = V * - {?,}.


In this case a nonterminal at the left side of a production may directly generate
zero or a finite number of nonterminals. The theory of multitype Galton-Watson
branching processes [23] can be applied to study the language generation process
by stochastic context-free grammars. The zero-th level of a generation process
corresponds to the start symbol S. The first level will be taken as fll where fll is
the string generated by the production S ~ hi. The second level will correspond to
the string 82 which is obtained from 81 by applying appropriate productions to
every nonterminal in ill- If 81 does not contain any nonterminal, the process is
terminated. Following this procedure, the j t h level string fig.is defined to be the
string obtained from the string 8j-l by applying appropriate productions to every
nonterminal of 8j-1. Since all nonterminals are considered simultaneously in
going from the ( j - 1)th level to the j t h level, only the probabilities associated
with each productions A i ~ a r, Pit, need to be considered. Define the equivalence
class CAi = (all productions of P with left side Ai}. Thus,
k
P=UCAi.
i:l

For each equivalence class CA,,

E Pir : 1.
cA,
Applications of stochastic languages 42 1

DEFINITION 2.4. For each CA,, i = 1..... k, define the k-argument generating
function (st, s 2 .... ,sk) as

f / ( S 1. . . . . Sk) ~- ~ PirS~il(ar)s~ i2(ar). . , S~cik(etr) (5)


cA,

where /~it(ar) denotes the n u m b e r of times the nonterminal A l appears in the


string a r of the production Ai --, ar, and S = Av

DEFINITION 2.5. The jth-level generating function Fj(st,...,sk) is defined recur-


sively as

Fo(s,,...,sk)= st, Ft(s t ..... S k ) = f , ( s I . . . . . Sk),


Fj(s, .... ,s~) = Fj_ ,( f,(s t ..... sk) ..... fk(St,... , , k ) ) . (6)

EXAMPLE 2.6. Gs = (V N, Va-, Ps, S ) where VN = { A t , A 2 ) , V T = {a, b}, S = A 1 and

Pll
P~: A1 --* aAtA2,
Pl2 P21 P22
A t --'b, A 2 --+aA2A2, A 2 ~aa.

The generating functions for CA, and CA2 are, respectively,

ft(s,,Sz)=pttStSz+pt2, fz(s, s2)=p2tsZ+pz2.


The j t h level generating functions for j = 0, 1,2 are

Fo(st,s2)=st, Ft(st,s2)=ft(sl,s2)=ptts,s2+pt2,
rz(st, s2) = FI( ft(st, s2), f2(s1, s 2 ) ) ~---P t l ft(s1, s2)Tz(S,, S2) + Pt2
-- 2 3
-- Plt P2tSlS2 + p2tP22SISz + P11Pt2P22 + P12"

After examining the previous example we can express Fj(s 1..... sk) as

Fjj.(S 1 . . . . . Sk ) = Gj( S 1. . . . . Sk )"~- g j

where Gj(s t ..... sk) denotes the polynomial of s t .... ,s~ without including the
constant term. The constant term, Kj, corresponds to the probability of all
the strings x E L(Gs) that can be derived in j or fewer levels. This leads to the
following theorem.
422 K. S. Fu

THEOREM 2.7. A stochastic context-free grammar Gs is cons&tent if and only if

lim Kj = 1. (7)
j~CX~

Note that if the above limit is not equal to 1, there is a finite probability that a
generation process may never terminate. Thus, the probability measure defined
over L will be less than 1 and, consequently, G~ will not be consistent. On the
other hand, if the limit is equal to 1, then there exists no such infinite (non-
terminating) generation process since the limit represents the probability of all the
strings which are generated by applications of a finite number of productions.
Consequently, G~ is consistent. The problem of testing the consistency of a given
stochastic context-free grammar can be solved by using the following testing
procedure developed for branching processes.

DEFINITION 2.8. The expected number of occurrences of the nonterminal Aj in the


production set CA, is

eiJ = 3f/(Slasj
..... Sk) ~, ...... k : l (8)

DEFINITION 2.9. The first moment matrix E of the generation process corre-
sponding to a stochastic context-free grammar G~ is defined as

E = [ eij], 1 <~ i, j <~k (9)

where k is the number of nonterminals in Gs.

THEOREM 2.10. For a given stochastic context-free grammar Gs, order the eigen-
values or the characteristic roots p l , . . . , Ok of the first moment matrix according to
the descending order of their magnitudes; that is,

1o, I lpjl ifi<j. (10)

Then G s is consistent if p I <1 and it is not consistent if pl >1.

For the stochastic context-free grammars Gs given in Example 2,

3 f l ( s l ' s2) ~L,~2:l


el 1 -- ~S 1 =PI1, el2 -- Of,(s~,s2)
~$2 s,,s2= 1 = P ~ '

e21 -- 0f2(Sl'0s1S2) sl,s2 =1 = O, e22 -- 0f2(s1'$2)


~s 2 s,,s,='
= 2P21 .
Thus

2p21 •
Applications of stochastic languages 423

The characteristic equation associated with E is Co(x) ----( x -- p l l ) ( x -2p21).


Hence, Gs will be consistent as long as Pal < ½.

3. Application to communication and coding

An information source of a communication system is defined as a generative


device which produces, at a uniform rate, sequences of symbols from an alphabet.
In channel coding these symbols are usually assumed to be randomly generated
and therefore they do not represent any specific meanings other than their own
identifications. We shall call sources of this kind lexicographical, to emphasize the
total lack of regularity or 'structure' in the formation of source messages. The
realm of algebraic coding has hitherto been concerned almost exclusively with
these sources and we shall also call the encoder, decoder, as well as the coding
technique lexicographical, in order to emphasize the contrast between these and
the new type of syntactic decoder introduced in this section. Other types of
information sources, of course, do exist. In particular, Markov information source
models are widely accepted in source coding.
Several attempts have been made to treat the problem of human communi-
cation from the narrow, strictly probabilistic point of view, based on the measure
of information introduced by Shannon in 1948. Recently it was pointed out that
information theory fell quite short of an acceptable description for the process of
transference of ideas that occurs when intelligent beings communicate [39]. It has
been demonstrated that syntactic information about the source is useful for
designing a 'syntactic decoder' which supplements an ordinary decoder [19].
On the other hand, Hutchins [25] showed that we can do source coding for
some formal languages by removing a certain amount of 'redundancy' in the
languages. Smith [38] suggested that the existence of 'built-in redundancy' in
formal languages can be utilized in achiex4ng error-detection without the obliga-
tory addition of redundancy. Problems of a similar nature were recently studied
by Hellman [24] who developed a method of combining source coding (i.e., data
compression) and channel coding.
A brief discussion about the maximum-likelihood syntactic decoder is given in
this section. Let the linguistic source be characterized by a context-free grammar
G = (VN, VT, P, S) which generates L ( G ) C V.}. The source generates a sequence
of sentences of L ( G ) . A special kind of synchronization mechanism is provided so
that there is no error in locating the beginning and the end of every sentence
issued by the source. With this assumption we may proceed to consider syntactic
decoding for individual sentences.
Refer to Fig. 1, the output of the source is encoded into binary form which is
transmitted over a binary symmetric memoryless channel N with bit error rate p.
Output symbols VT of the source are encoded by Cn. Suppose VT has l symbols,
i.e., VT = (o l, (i2,... ,Ol} , then n/> [log 2 l]. Let u i = C,(oi) be the codeword for the
ith symbol a~ of the alphabet, which can be expressed as a binary sequence
424 K. S. Fu

-kz

<u
<u

-I 05

I--

$. t ~,
0

0
I--
I.- V-
z

o
8
<

u
d-
Applications of stochastic languages 425

u , -m (u il, u2,...,
i
ui). Then the conditional probability of a binary sequence v =
(vl, v2 .... ,%) from the output of the channel given that the input sequence is
u = ( u . u2 ..... u.) is

e ( v lu ) = P ( s o = v , I S l = u , ) . . . . . P ( s o = vnl s , = u n)

= ( 1 - p ) ~ ' ~ ( ..... ).pn Y4_t~(..... ) (11)


where
P(so=V,[S =ui)=f p i f v i : / = u ~ f o r i = l ..... n,
1 -- p otherwise,

and 6(,) is the Kronecker delta function.


At the receiver a decoder assigns a symbol o to every binary sequence
v E (0, 1) ". More precisely, let B be a block code defined by

O = {(Ul, UI) , (u2, U2) ..... (HI, Ul) )

where the ui's are codewords for the alphabet VT, and U's are disjoint subsets of
(0, 1)" such that Ult_J . . . t.JUt--(0, 1}L A lexicographical decoder D O is a map-
ping from {0, l } n into Va-, defined by

Do(v ) = oj if and only if v E Uj. (12)

Designs of the code as well as its encoders and decoders are studied extensively in
information theory (e.g., [26]). We shall present in the following a decoding
scheme which in a certain sense is superior to the lexicographical decoding
scheme.
The encoder, binary symmetric channel N and the decoder D o can be combined
and represented by a mutational channel M characterized by the following
transition probabilities:

QB(~=ol~=oi)= e~(v~Ujluj) = X e B ( v = ( v , , v 2 ..... vn)lui)


v~ Uj

= X (l-p) X~=IS(U~'Vk)-p [n--'9~--l~(Uk (13)


vE~

for every i, j = 1,2 ..... I. Here a and/3 are respectively the input and output of the
mutational channel M, and QB(fl = ° j l a = °i) is the probability of observing a
symbol oa. at the output of M when symbol oi is the true symbol at the input of M.
Note t h a t ~ j Q B ( f l ~- oj[a = oi) ~- 1 for all i.
As illustrated in Fig. 1, a syntactic decoder D is introduced to the output of the
mutational channel, i.e., to the output of the lexicographical decoder Do, before
the messages are conveyed to the user. Suppose the source grammar G generates a
sentence w = ClC2. •, c N ~ L ( G ) . The symbols Cl,..., cw are fed into the encoder Cn
426 K. S. Fu

sequentially such that a sequence of binary codewords Cn(c 1)..... Cn(¢N) c o m e out
of the encoder and are modulated and transmitted over the noisy channel N. Let
u i = C~(ci), i = 1 . . . . . N , and vi be the binary sequence of length n coming out of
the channel N corresponding to the transmitted codeword ui. These binary
sequences v i . . . . , v u are decoded according to (12) so that the lexicographical
decoder outputs a string of symbols in VT, z = a l a 2. • • a N where a~ = D o ( v i ) , i =
1,2 .... ,N. We can symbolically describe this coding scheme by z = a l a 2 • • • a N =

M(q)M(c2).--M(cu). The block code B is designed in such a way that B is


optimal in some sense. For example, B results in minimal average probability of
error for each individual symbol, among all possible block codes of codeword
length n. That is, for every symbol oi transmitted, the average probability of error
per symbol,

1 t
P(elB) = 7 i ~ l Q B ( f l ¢: ° i l a = ai) (14)

is minimized.
However, it is immediately obvious that the above optimal block code B is
based on the performance of the code for individual symbols. For linguistic
messages we may take advantage of our knowledge about the syntactic structure
among the transmitted symbols to improve the efficiency of the communication
system. Since the utility of the communicated linguistic messages depends mostly
on the correctness of individual sentences, the average probability of obtaining a
correct sentence for the source grammar will be considered to be a significant
factor in evaluating the efficiency of the communication system.
Let z = a l a 2 . . , a N be the decoded sequence of symbols from the lexicographi-
cal decoder D 0. The syntactic decoder D is a mapping from V-} into L ( G ) , which
is optimal according to the maximum likelihood criterion:

P ( Z = ala2" " " aNl D ( z ) = cl^*^*C2" " " O*N)

>1 P ( z ---- a , a 2 . . . a N I~ = 01c2.-- ON) (15)

for all ~ E L ( G ) N V ( . Here, D(z)=0]~0~ . . . 0*v is the output sequence of the


syntactic decoder with input sequence z. Since the channel N is memoryless and
since the symbols of a transmitted sentence are encoded and decoded sequen-
tially, (15) can be written as

N N
[I Q ( B - - a i l a = O * ) > ~ 1-1 Q ( B = a ~ l a = O ~ ) . (16)
i--I i-1

Define d ( a i , oj) = - l o g Q ( f l = ojla = oi). Then (16) is equivalent to


N N
X d ( a i , c:) ~ X d ( a i , ci). (17)
i=l i 1
Applications of stochastic languages 427

Extend the definition of d(oj, ai) to the case of two strings z = a l a 2 . . , a N and
w = clc 2. • • c u which are both of length N, N = 2,3 . . . . . Then
N
d ( z = a , a 2 . . , aN; W = CLC2''" CN) : • d(ai, ci). (18)
i 1

Similarly, we can also define

d(z;L(G))= inf d(z;w). (19)


wEL(G)

Hence, using (17) and (18), we can rewrite (15) as

D(z) = (~*[~* = ~- • • ~ ~ L ( G ) such that


d(z; ~*) <~d(z; ~ ) for all sentences ~ E L ( G ) } . (20)

It follows from (19) that ~ * ~ D ( z ) implies that

d(z; ~*) : d(z; L ( G ) ) . (21)

We say that ~* of the source language L ( G ) is the 'nearest' to the string z.


An algorithm based on the Cocke-Younger-Kasami parsing scheme has been
proposed for the syntactic decoder [19]. The syntactic decoder is basically an
error-correcting parser which attempts to produce a collection of strings D ( z ) C
L ( G ) for any string z not in the language L ( G ) so that every string ~* in D ( z ) is
'nearest' to z as compared with other strings in L(G). An example of applying
such a syntactic decoder to computer networks is given in [20].

4. Application to syntactic pattern recognition

In syntactic or linguistic pattern recognition [15, 18], patterns are described by


sentences of a language characterized by a grammar. Typically, each class of
patterns is characterized by a grammar which generates sentences describing each
of the patterns in the class. The recognition of a pattern then becomes a syntax
analysis or parsing of the sentence describing the pattern with respect to the
grammar characterizing each pattern class.
In some practical applications, however, noise and distortion occur in the
measurement or processing of physical patterns. Languages used to describe the
noisy and distorted patterns under consideration are often ambiguous in the sense
that a sentence (a pattern) can be generated by more than one grammar which
characterizes the patterns (sentences) generated from a particular pattern class. In
terms of statistical pattern recognition [10], this is the case of overlapping pattern
classes; patterns belonging to different classes may have the same descriptions or
428 K. S. Fu

b b b
~"-t-....~ b d b b ~ a
(a) Median String Representation
cbbbabbbbdbbbbabbbcbbbabbbbdbbbbabbb

(b) Submedian String Representation


cbabbbdbbbbbabbbcbbbabbbbbdbbbab

a a

b
b b b b
a b b ~ ' ~ " ~ a
(c) Acrocentrie String Representation
cadbbbbbbabbbbbcbbbbbabbbbbbda

a a

b b ' b~" h

Fig. 2. Three s a m p l e chromosomes.

measurement values (but with different probabilities of occurrence). In these


situations, in order to model the process more realistically, the approach of using
stochastic languages has been suggested [11, 15, 18, 21].
For example, three type s of chromosome patterns and their corresponding
string representations are shown in Fig. 2. It is easy to notice that with noise and
distortion one type of chromosome (e.g., submedian) could become another type
(e.g., median and acrocentric). In other words, the grammar characterizing
submedian chromosomes should sometimes (with small probabilities) also gener-
ate median and acrocentric chromosomes. Similar conclusions can also be drawn
for grammars characterizing median and acrocentric chromosomes. One approach
to model the situation is to use one grammar which generates all the three types
of chromosome patterns but to use three different sets of production probabili-
ties. The following is a grammar generating median, submedian, and acrocentric
chromosome patterns. However, three different production probability assign-
ments result in three different stochastic grammars which characterize median,
submedian, and acrocentric chromosome patterns, respectively. Presumably, the
stochastic grammar for median chromosome patterns will have high probability
of generating median strings than submedian and acrocentric strings; the stochas-
tic grammar for submedian chromosome patterns will have high probability of
generating submedian strings, etc. In practice, the production probabilities will
have to be inferred from string probabilities or assigned subjectively by the
designer [15, 16].
Applications of stochastic languages 429

Chromosome grammar Gs = (VN, VT, P~, S), where VN = {S, A, B, D, H,


J,E,F,W,G,R,L},

vT= {n,l,u,{}
abcd
and Ps:
1 P3 P3
S --'AA, R ~RE, W--' WE,
1 P4 P4
A -'*cb, R ~HDJ, W~d,
Pl Pl 1
B ~FBE, D ---'E, F~b,
P2 P2 1
B ~ HDJ, D -~ d, E ---,b,
P3 P3 1
B ~ RE, D ~ FG, H ~ a,
P3 P3 1
B ~F L , D --, W E , J--,a,
P3 P3
L ~ FL, G ---"FG,
P4 P4
L ~ HDJ, G ~ d,

p l + p2 + 2p3 = l, p3+P4=l.
By associating probabilities with the strings, we can impose a probabilistic
structure on the language to describe noisy patterns. The probability distribution
characterizing the patterns in a class can be interpreted as the probability

Par,~ P(xlGI)

"I°'
-'
x E L(G.)
Input x Maximum •,b. I
f

Detector

~ ~ p(xG
I
Fig. 3. Maximum-likelihood syntactic pattern recognition system.
430 K.S. Fu

distribution associated with the strings in a language. Thus, statistical decision


rules can be applied to the classification of a pattern under ambiguous situations
(for example, use the maximum-likelihood or Bayes decision rule). A block
diagram of such a recognition system using maximum-likelihood decision rule is
shown in Fig. 3. Furthermore, because of the availability of the information about
production probabilities, the speed of syntactic analysis can be improved through
the use of this information [29, 37].

5. Application to error-correcting parsing

The syntactic decoder discussed in Section 3 is actually implemented as an


error-correcting parser with substitution error. In language parsing, in addition to
substitution error; that is, the substitution of a symbol by another symbol, two
other types of syntactic error also occur. They are the insertion error (the
insertion of an extra symbol) and the deletion error (the deletion of a symbol).
Application examples of this kind include error-correcting compiler [28] and the
recognition of continuous speech [3, 32]. In this section, error-correcting parsing
for the three types of error is briefly described.
When the error statistics are known or can be estimated, the error transforma-
tions become probabilistic. For instance, we may know that the probability of
substituting terminal a by a or the probability associated with Ts is qs(bla), the
probability of inserting terminal b in front of a is qi(bla), and the probability of
deleting terminal a is qo(~Ia) where ~ is the empty sentence or sentence with
length zero. With the information of these error transformation probabilities, we
can apply the maximum-likelihood decision criterion to obtain an error-corrected
sentence in L. If the language L is also stochastic, Bayes decision rule can be used
[18, 33].

DEFINITION 5.1. The deformation probabilities that are associated with substitu-
tion, insertion and deletion transformations Ts, TI, TD are defined as follows:

Ts, qs(b/a)
(1) xay t - - xby

where qs(b/a) is the probability of substituting terminal a by b,

7"I, qi(b/a)
(2) xay I xbay

where qi(b/a) is the probability of inserting terminal b in front of a,

TD, qD(a)
(3) xay I - - xy
Applications of stochastic languages 431

where qo(a) is the probability of deleting terminal a from a string,

Tt,qi(a)
(4) xI xa

where q~(a) is the probability of inserting terminal a at the end of a string.


It can be shown that the deformation probabilities for both single-error and
multiple-error cases are consistent if

E qs(b/a)+qD(a) + E qi(b/a)+q(a) =1
b~ bE.Y,
bvaa

for all a E 2, where q(a) is the probability that no error occurs on terminal a [18,
33].
Let L(G~) be a given stochastic context-free language and y be an erroneous
string, y ~ L(G~). The maximum-likelihood error-correcting parsing algorithm
searches for a string x, x E L(G~) such that

q(y/x)p(x) = max ( q ( y / z ) p ( z ) l z E
Z
L(Gs) )

where p ( z ) is the probability of generating z by G~. By adopting the approach of


constructing covering grammars used in [1], we present an algorithm of expanding
the original stochastic context-free grammar to accomodate the stochastic defor-
mation model as follows:

ALGORITHM 5.2 (Stochastic error-induced grammar)


Input. A stochastic context-free grammar.
Output. G" = (N', 2', P~, S'), the stochastic grammar induced by error transfor-
mations for GS.
Method
Step 1. N ' = N U { S ' } U ( E a I a E 2 }.
Step 2. Z' D_Z.
Step 3. If A ~ Paobialb2a2 • • • bmam, m >f O, is a production in Ps such that a i
is in N*, and b~ is in 2, then add the production A ~ PetoEbaiEb2...Eb, a m to P~',
where each Eb, is a new nonterminal, E b E N'.
Step 4. Add to P~' the productions
(a) S'---~ 1-qis where q~ = ~aEz,q~(a).
(b) S' ~ q'I(a)S'a for all a E 2'.
Step 5. For all a E 2, add to P~' the productions
(a) E~ ~ q~a,
(b) E a ~ qs(bla)b for all b E 2 ' , b vL a,
(c) E~ ~ qo(°~,,
( d ) E s ~ qI(bla)bEa f o r all b E Y , ' .
432 K. S. Fu

Suppose that y is an error-deformed string of x, x = a l a 2 . . , an. By using


productions added to P~' by Step 3, we have S --,g~X where X = EaEa2" • • Ea, if
and only if S ~ x where Pi is the ith derivation of x in G~. Applying Step 4(a)
first and then repeatedly applying Step 4(b), we can further derive S'--'g!Xan+ l
where p~=piq'(an+l). The productions in Step 5 generate Ea, ~q!~'la')a i for all
1 <- i <- n, a l, ot2
.... ,Otn+ 1 is a partition ofy. Step 5(a)-(d) corresponds to nonerror
transformation, substitution transformation, deletion transformation and inser-
tion transformation, which allows multiple insertions, respectively. Thus, the
stochastic language generated by G' s is

x~G s ri

where r i is one form of deforming x to y. The consistency of L(G~) can be proved


[33].
It is proposed to use a modified Earley parser on G" to implement the searching
of the most likely error correction. The algorithm is essentially Earley's Mgorithm
[2] with a provision added to keep accumulating the probabilities associated with
each step of derivations.

ALGORITHM 5.3 (Maximum-likelihood error-correcting algorithm)


Input. A stochastic error-induced grammar G~'-- (V', 27, P~', S) of Gs, and string
y = bib 2. • • bm in ~7".
Output. q(ylGs) and string x, such that q ( y l x ) p ( x ) --- q(ylG~).
Method
Step 1. Set j = 0 and add [ E - , .S',0,1] t o l j .
Step 2. (a) If [A ~ a.Bfl, i, p] is in Ij, and B -~q is a production in p~, add
item [B ~ "T, j, q] to Ij.
(b) If [A -~ a., i, p] is in I i and if no item of the form [B ~ flA. y, k, r]
can be found in Ij, add a new item [B ~ flA.y, k, pq] to Ij. If [B -~ flA. y, k, r] is
already in Ij, then replace r by pq if pq > r.
Step 3. If j = m go to Step 5. Otherwise, j - - j + 1.
Step 4. For each item in Ij_ 1 of the form [ A ~ a . b j f l , i,p] add item
[A ~ abj .fl, i, p] to Ij, go to Step 2.
Step 5. If item [E ~ S'-,0, p] is in I m, then q(ylG~) = p, stop.

Algorithm 5.3 together with Algorithm 5.2 is called Maximum-Likelihood


Error-Correcting Parser.
For a multiclass classification problem assume that there are K classes of
patterns, denoted C l, C2..... Ck, each of which is described by a set of strings
C 1, C2,..., Ck. The grammar G i inferred to characterize strings in C i satisfied
L(Gi), C i for all i, l<-i<~K. We call strings in L ( G i ) - C i as unwanted or
illegitimate strings caused by grammar error. By assigning very low probabilities
to unwanted strings, a properly inferred stochastic grammar can discriminate
unwanted and wanted strings from their frequencies of occurrence [15].
Applications of stochastic languages 433

Using Bayes rule, for a given string y, y E L(Gi) for all i, l ~ i < - K , the a
posteriori probability that y is in class j can be c o m p u t e d as

q( yICj )P( Cj )
e ( f j l y ) : ~K
~=,q(yICi)P(C~)
where P(C,) is the a priori probability of class C i.
The Bayes decision rule that classifies y as class Cj is as follows,

P ( C j l y ) : miax { P ( C ~ y)ll<~i<~k}.

6. Stochastic tree grammars and languages

In this section we present some major results on stochastic tree languages [4].

DEFINITION 6.1 A stochastic tree grammar Gs is a 4 tuple G s = (V, r', P, S )


over ranked alphabet (V-r, r > where
-<V, r'> is a finite ranked alphabet such that V x _ V and r'[VT = r.
T h e elements of VT and V C_ VT = VN are called terminals and nonterminal
symbols respectively.
-P is a finite set of stochastic p r o d u c t i o n rules of the form q~~ p xo where ~ and '/"
are trees over (V, r ' ) and 0 ~< p ~< 1.
-S C_T v is a finite set of start symbols, where T v is the set of all trees over V.

DEFINITION 6.2. a --' Pfl is in G s if and only if there is a p r o d u c t i o n ~ --, p g" in P


such that a[a=ep and f l = ( a ~ P q ' ) a ; i.e., ~ is a subtree of a at 'a' and /3 is
obtained by replacing the occurrence of ~p at ' a ' by ~b. We write a --, P/3 in GS if and
only if there exists a ~ D~, the d o m a i n of a, such that a ~Pafl.

DEFINITION 6.3. If there exists a sequence of trees t 0, tl,...,t m such that

Pi
a = t 0, f l = t m, ti- 1 ---~ti, i=l,...,m,

then we say that a generates fl with probability p = IIi=_Tpi and denote this
derivation by a ~-- off or a ~ Pfl.

The sequence of trees t 0 , . . . , t m is called a derivation of 13 from a. The


probability associated with this derivation is equal to the p r o d u c t of the probabili-
ties associated with the sequence of stochastic productions used in the derivation.
434 K. S. Fu

DEFINITION 6.4. The language generated by stochastic tree g r a m m a r G~ is

L(Gs) = ( t , p ( t ) ) [ t E T v T , S = t ,~i = l .... , k a n d p ( t ) = ~E~ ~PJ}

where Tv~ is the set of all trees over VT, k is the number of all distinctly different
derivation of t from S and ps is the probability associated with the j t h distinct
derivation of t from S.

DEFINITION 6.5. A stochastic tree grammar G~=(V, r', P, S) over (VT, r ) is


simple if and only if all rules of p are of the form

p q r
Xo-~ x , Xo-~ Xl, or x ~Xo
/ \ //\
X I " - Xr(x) X l " * Xr(x)

where X 0, X l ..... X~(~) are nonterminal symbols and x E VT is a terminal symbol


and 0 < p, q, r ~< 1. A rule with the form

X 0----~ x
J \
X 1 . . . X,(~)

can also be written as X o ~ xX~. • • Xr(~).

LEMMA 6.6. Given a stochastic tree grammar G~ = (V, r, P, S ) over ( V T, r}, one
can effectively construct a simple stochastic tree grammar G2 = (V', r', P', S') over
V T which is equivalent to Gs.

PROOF. To construct G~', examine each rule ~i-~P~i of P. Introduce new


symbols U,~ and V~ for each a E D~,i and b E D+.
Let P ' consist of:
(1) rules that contract the tree ~i a level at a time having the form xU2.~...
U(/., ~ U0` where ~i(a) = x E VT. and VTo is the set of terminal symbols with rank
n;
(2) the rule Ud -, PVoi;
(3) rules of the form V2 ~ lxV,.1 ..... V2., that expand Vo/ to the tree ~ .
The construction of Gs' is clearly effective. We must now show L(G~) = L(Gs).
Note that d~i~---P@iin G" for each rule ~i----~P~i in P.
Suppose that a-~Pfl in Gs. Then for some rule q,i-~Pq~ in Gs,:a[a=e~i and
fl ~ (a ~--P@)a i.e., ~i is a subtree of a at a and fl is obtained by replacing the
occurrence of 0~ at a by ~pgwith probability p.
Applications of stochastic languages 435

By the above argument a l a = @i~ p @i = • Ia in G2 hence

~-- a ~ a l a a )(')
a~-flla a = a~gtl a = f i n G 2.

Thus a ~ - eft in G~ implies a~--Pfl in G" and L(Gs)_(G~).


For the converse, suppose a~ L(G'), i.e., S ~ P a in Gs and aE TvT.
A deduction of S ~ P a in G~ may be constructed as follows: Examine the
deduction S [--Pa in G~. Each time a rule Uo' ~ PV0, is applied at b, apply the
rule ~--, P ~ at b. The result will be a deduction of S~---Pa in G~, since if
the rule Uo'~PV0' can be applied at b, all contracting rules in pi (i.e. those
involving U~, aEDi) must have been applied previously at the corresponding
addresses b. a and all expanding rules of pi (i.e., those involving Va~, a E Di) must
be applied later at b • a, since all symbols Ud and Vo' are elements of V ' - VT.

Note that the application of a single rule (~i ~ p ~/ti in Gs simulates the applica-
tion of all rules of P'. An example should make this very clear.

EXAMPLE 6.7. Let GS= (V, r, P, S) where V = { +, X, S}, VT = { + , X }, r ( + ) =


2, r(X) = 0 and
p q
P: (1) S~ +, (2) S~X, p+q=l.
/ \
S ×
In this case P':

(1') SPUo l, (2') UolLV01, (3') V? L -J- ~,-~Iv.1


1 2,

1 1
(4') V,'~S, (5') V2'~X, (6') squ?,
1 2 1
(7') Uo2 ~ V 0, (8') Vo2 ~ × .

Note the productions (1'),(2'),...,(5') in P ' are the result of production (1) in P
and productions (6'), (7'), (8') are due to production (2) in P.
A simple deduction in Gs is as follows:

S PUo ~ ~, V°~ ~, _~_V11V21 ~1 + SV21 ,


1 1 Pq
st +xx.

The corresponding deduction in Gs is

p q Pq
s--,+s× --,+xx, sI - +XX.
436 K.S. Fu

Note that if the tree ¢Pi on the left-hand side of the production rule is a single
symbol of alphabet V, we will have no contracting production rules in our
grammar.

DEFINITION 6.8. A stochastic tree grammar Gs=(V, r, P, S) over ( V T, r ) is


expansive if and only if each rule in P is of the form
p p
X 0 ---~ x or Xo---~x
/ \
X l • •. Xr(x)

where x E VT and X0, X 1..... Xr~x) are non-terminal symbols contained in V - VT.

EXAMPLE 6.9. Following is a stochastic expansive tree grammar.


G s = ( V , r , P , S ) over (VT, r ) where V N = V - V T =(S,A,B,C),VT=
(a, b,$), r(a) = r(b) -- (2,0), r($) = 2, and P:
1.0 p 1--p
(1) S~ S , (2) A--, a , (3) A - - , a,
/\ /\
A B A B
q 1 q 1.0
(4) B--. b , (5) B ~ b, (6) C~a,
/
C
0~<p~<l, 0~<q~<l.

DEFINITION 6.10. Define a mapping h: TvT -~ V~o as follows:

(i) h(t)=xift=xEVTo. (Obviously, p ( t ) = p ( x ) ) .

(ii) h --h(tl)...h(tn) ifx~VT,,n>O.


t • n

Obviously,

p( /x\ )=p(x)p(tl)...p(tn).
t I • •. t n

The function h forms a string in VTo obtained from a tree t by writing the
frontier of t. N o t e frontier is obtained by writing in order the images (labels) of
all end points of tree 't '.

THEOREM 6.11. If L 7 & a stochastic tree language, then h ( L T ) & a stochastic


context-free language with the same probability distribution on its strings as the trees
of L z. Conversely, if L(G') is a stochastic context-free language, then there is a
stochastic tree language L z such that L( G~) = h ( L T ) and both languages have the
same probability distribution.
Applications of stochasticlanguages 437

PROOF. By Lemma 6.6, if L T is a stochastic tree language, there is a simple


stochastic tree grammar Gs = (V, r, P, S ) such that L T = L(G~). Let

,
fI P P
P = I X° -~ X1" " " X" x E VT ' n > O' X° -->
/\
Xl..- X.

U { Xo P x X P x ~ P, x E VTo}.

Then if G~ is the stochastic context-free grammar ( V - VT, Vx, P', S)

t ( as) : h( Z ( as) ) :h(ZT).

For the converse, suppose L(G') is generated by the stochastic context-free


grammar G~'= (V' - V~, V~, P', S). It may be assumed that all rules of G~ are of
the form (Chomsky Normal Form)
P P
Xo .-+Xl X 2 or Xo --) x

where X o, X l, X2E ( V ' " V~) and x ~ V,'Y0"


Let Vy = V~to ( + } and V = V' tO ( + } where + ~ V~. Let r(x) = O, x E V~r and
r(+)=2.
Let

p= Xo ~p / \ + X o P+X1X2E P'] U {XoLx XoLx~p').


x~ x 2
Let Gs = (V, r, P, S), then if Z T = Z ( G s ) , Z(Gs) = h(Z(Gs) ) = h ( L T ) . This
completes the proof.

DEFINITION 6.12. By a consistent stochastic representation for a language L(G~)


generated by a stochastic tree grammar Gs, we mean that the following condition
is satisfied.

£ p(t) =1
t ~ L(Gs)
where
-t is a tree generated by G~,
-L(Gs) is the set of trees generated by Gs,
-p(t) is the probability of the generation of tree 't'.

The set of consistency conditions for a stochastic tree grammar Gs is the set of
conditions which the probability assignments associated with the set of stochastic
438 K.S. Fu
tree productions in Gs must satisfy such that Gs is a consistent stochastic tree
grammar. The consistency conditions of stochastic context-free grammars has
been discussed in Section 2. Since non-terminals in an intermediate generating
tree appear only at its frontiers, they can be considered to be causing further
branching. Thus, if only the frontier of an intermediate tree is considered at levels
of branching and, due to Theorem 6.11, the consistency conditions for stochastic
tree grammars are exactly the same as that for stochastic context-free grammars
and the tree generating mechanism can be modelled by a generalized branching
process [23].
Let P =/'A U/~A2U "-" tOFAKbe the partition of P into equivalent classes such
that two productions are in the same class if and only if they have the same
premise (i.e., same left-hand side non-terminal). For each FAy define the condi-
tional probabilities (p(tlAj) } as the probability that the production rule Aj--, t,
where t is a tree, will be applied to the non-terminal symbol Aj where ZrAjp(tlAj)
=1.
Let ~t(t) denote the number of times the variable A t appears in the frontier of
tree ' t ' of the production A j --, t.

DEFINITION 6.13. For each Faj, j = I , . . . , K , define the K-argument generating


function gj( S1, S2..... SK) as
&(&, 32.... ,SK) = ~,p(tlAj)SfJ"(')"" S,~'K(')
rA,

EXAMPLE 6.14. For the stochastic tree grammar GS in Example 6.9,

gl(S1,S2,S3,S4):P(/$ lS)$283=-$2S3,
A B
g2(81,32,33,84): p ( / a IA)3233-~p(alA )
A B
= pS2S3 +(1- p),
g3(S,,Sz,S3,S4)=P( ~'B)S4 + p(blB)=qS4 +(t-q),
C
g4(S1, 82, 83, 54) = P(alC) = 1.0.

These generating functions can be used to define a generating function that


describes all ith level trees.
Note that for statistical properties, two i th level trees are equivalent if they
contain the same number of non-terminal symbols of each type in the frontiers.
Applications of stochastic languages 439

DEFINITION 6.15. The ith level generating function Fi(S 1.... , S 2 .... ,SK) is de-
fined recursively as

Fo( SI, S2 ..... SK ) = S,,


F,( S,, S2 ..... Sk) = g,( S,, S z , . . . , S K )

S2 .... ,SK) = S2 .... , & ) ,

g2(Sl, 82 .... ,SK ) .... ,gK(S1,82 ..... Sk ) ] .

F/(St, S2..... S/c) can be expressed as F/(S l, S 2 , . . . , S K ) = G,(S1, $2 ..... S x ) + Ci


where Gi(- ) does not contain any constant term. The constant term Ci corre-
sponds to the probability of all trees t E L(G~) that can be derived in i or fewer
levels.

THEOREM 6.16. A stochastic tree grammar Gs with unrestricted probabilistic repre-


sentation R is consistent if and only if

lim Ci = 1.

PROOF. If the above limit is not equal to 1, this means that there is a finite
probability that the generation process enters a generating sequence that has a
finite probability of never terminating. Thus, the probability measure defined
upon L(Gs) will always be less than 1 and R will not be consistent. On the other
hand, if the limit is 1, this means that no such infinite generation sequence exists
since the limit represents the probability measure of all trees that are generated by
the application of a finite number of production rules. Consequently R is
consistent.

DEFINITION 6.17. The expected number of occurrences of non-terminal symbol Aj


in the production set FAi is

eij = Ogi(S,, 0sjSa'""SK) s,.s2 ..... sK =,

The first moment matrix E is defined as

E = [e,j],<_i,g<_K

LEMMA 6.18. A stochastic tree language with probabilistic representation R is


consistent if all the eigenvalues of E are smaller than 1. Otherwise, it is not
consistent.
440 K. S. Fu

EXAMPLE 6.19. In this example consistency conditions for the stochastic tree
grammar Gs in Example 6.9 (as verified in part (a)) are found, and thus the
consistency criterion verified.
(a) The set of trees generated by GS is as follows:
Tree (t) Probability of generation [p (t)]
$ (1-p)(1-q)
/\
a b
$ (l-p)q
/\
a b
\
a
$ p(1 - p)(1 - q)2
/\
a b
/\
a b
$ p(1--p)q 2
/\
/\
a b\
a b a
\
a

• • • etc.
In all the above trees production (1) is always applied. If production (2) is
applied ( n - 1) times, there will be one 'A' and n 'B's in the frontier of such
obtained tree. Production (3) is then applied when no more production (2) is
needed. In the n 'B's in the frontier, any one, two, three or all n 'B's may have
production (4) applied and to the rest of ' B ' production (5) is applied• Production
(6) always follows production (4).
Thus we have

P( t ) = (1- p ) P°[ 'Co(1- q ) + 'C,q]


t ~ L(G s)

+(1-- p)pt[2Co(1-- q)2 + 2C,q(1-- q)+ ZC2q2]


+(1-p)p2[3Co(1-q)3 +3C,q(1-q) 2
+ 3C2q2(1 - q ) + 3C3q3]

+(1-p)p" '["Co(1-q)" +"C,(1-q)"-' q + ' "


+ n Cr(l_q) n
rqr+...+nC,,qn ] + . . .
Applications of stochastic languages 441

Note that the power of p in the above terms shows the number of times
production (2) has been applied before applying production (3). So

p(t)=(1-p)[~-q+q]
t@ L(Gs)

q+qt
+(1--p)p"-l[(~-q+q)"]+...
or
Z p(t)=(1-p)+(1-p)p+...(l--p)p" ~+...
t E L(Gs)

=(l_p)[l+pl+pZ+...p, 1+...]
=(1-p)X 1 (ifp<l)=l

Hence, Gs is consistent for all values of p such that 0 ~ p <~1.


(b) Let us find the consistency condition for the grammar Gs using Lemma 6.18
and verify the consistency criterion. From Example 6.7 we obtain

E=
0 ,01
0
0
p
0
p
0
0
q "
0 0 0 0

The characteristic equation for E is ~b(~-)=('r--p)~ -3. Thus, the probability


representation will be consistent as long as 0 ~<p < 1. The value of q is con-
strained only for the normalization of production probabilities.
Hence Gs is consistent.

7. Application of stochastic tree grammars to texture modelling

Research on texture modelling in picture processing has received increasing


attention recently [43]. Most of the previous research has concentrated on the
statistical approach [22, 42]. An alternative approach is the structural approach
[31]. In the structural approach, a texture is considered to be defined by
subpatterns which occur repeatedly according to a set of well-defined placement
rules within the overall pattern. Furthermore, the subpatterns themselves are
made of structural elements.
We have proposed a texture model based on the structural approach [18, 34]. A
texture pattern is divided into fixed-size windows. Repetition of subpatterns or a
portion of a subpattern may appear in a window. A windowed pattern is treated
442 K S. Fu

T ¢-t111/i1 i/1 ]
t + I t I
Starting
point

i.i [/ .liliti i
[ .+-.+-.+~.1:i!1!1[
(a) Structure A

Starting
point

i ..... + .... "-"



-U
:i:!4-, - ° -

. . . . . - - , __, o_

(b) Structure B

Fig. 4. Two tree structures for texture modeling; (a) Structure A, (b) Structure B.

as a subpattern and is represented by a tree. Each tree node corresponds to a


single pixel or a small homogeneous area of the windowed patterns. A tree
grammar is used to characterize windowed patterns of the same class. Two
convenient tree structures and their corresponding windowed patterns are il-
lustrated in Fig. 4. The advantage of the proposed model is its computational
simplicity. The decomposition of a pattern into fixed-size windows and the use of
a fixed tree structure for representation make the texture analysis procedures and
its implementation very easy. However, the proposed model is very sensitive to
local noise and structural distortion such as shift, rotation and fluctuation. In this
section we will describe the use of stochastic tree grammars and high level syntax
rules to model local noise and structural distortions.
Figs. 5a and 5b are digitized pictures of the patterns D22, and D68 from
Brodatz' book Textures [8]. For simplicity we use only two primitives, black as
primitive '1', and white as primitive '0'. For pattern D22, the reptile skin, we may
consider that it is the result of twisting the regular tessellation such as the pattern
shown in Fig. 6. The regular tessellation pattern is composed of two basic
Appfications of stochastic languages 443

Fig. 5a. Texture pattern: D22, reptile skin.

subpatterns shown in Fig. 7. A distorted tessellation can result from shifting a


series of basic subpatterns in one direction. Let us use the set of shifted
subpatterns as the set of primitives. There will be 81 such windowed pattern
primitives. Fig. 8 shows several of them. A tree grammar can be constructed for
the generation of the 81 windowed patterns [18, 34]. Local noise and distortion of
the windowed patterns can be taken care of by constructing a stochastic tree
grammar. The procedure of inferring a stochastic tree grammar from a set of
texture patterns is described in [18, 35]. A tree grammar for the placement of the
81 windowed patterns can then be constructed for the twisted texture pattern. A
generated pattern D22 using a stochastic tree grammar is shown in Fig. 9.
The texture pattern D68, the wood grain pattern, consists of long verticle lines.
It shows a higher degree of randomness than D22. N o clear tessellation or
subpattern exists in the pattern. Using verticle lines as subpatterns we can
construct a stochastic tree grammar G68 to characterize the repetition of the
subpatterns. The density of verticle lines depends on the probabilities associated
with production rules. Fig. 10 shows two patterns generated from G68 using
different sets of production probabilities.
444 IC S. Fu

"]i i i
iH~i~'i !i=

=i i ~ - :

iii~il: ! :I!! g

i 71
Fig. 5b. Texture pattern: D68, woodgrain.

Fig. 6. The ideal texture of pattern D22.


Applications of stochastic languages 445

*** |

*******/

*** J
(a)

Fig. 7. Basic pattern of Fig. 6.

*.1I:1I:~

k *~**
A] A2 A5

*** ****'1
r ,x****~
[ ********1

*Ill* Zl
Dq [3 2 [3 5

Fig. 8. Windowed pattern primitives.

I i~,lIF, IF,4~,IP:4|~,f,v,1 ~,,~MI~.4M|,,,L A i Q ' , I

i , ' ainu-, ~i ,~" ¢~ ,I~', ,'lilr.~W.l~"


! ~'ql~r" ~!1% o u = '= ~ ~ . . . . r?~ " " " [ ~"'" . . . . "~R"...,, "-

..~ ~t~~,4 I~ ,, . . . •

iJ,,,O:,l~,~,l!i~l. ~I ,i, l~,|i~,.|&:.Jti.,:Jill,,i|~iJil;,,lI,Q.i

Fig. 9. Synthesis results for pattern D22.


446 K. S. Fu

Fig. 10. Synthesis results for pattern D68.

G68 = (V, r, P, S) where V : {S, A, B,0, 1}, VT = {0, 1), r(0) : r(1) = {0, 1,2,3},
and P is

0 0 0 1
0.5 0.05 0.09 0.09
s-*/l\, s -* / \, s -* / I \ , s-* / I\,
A SA A A B SA BSA
0 1 1 0.90
0.09 0.09 0.09
s-~ / I\, s-* / I\, s-* / I\, A-*0,
A SB B SB A SB I
A
0.05 0.05 0.85 0.10 0.05
A -* 0, A -*0, B-* 1, B ~ 1, B-*I.
I L I
B B A

8. Conclusions and remarks

Stochastic string languages are first introduced and some of their applications
to coding, pattern recognition, and language analysis are briefly described in this
paper. With probabilistic information about the process under study, maximum-
likelihood and Bayes decision rules can be directly applied to the coding/decod-
ing and analysis of linguistic source and the classification of noisy and distorted
linguistic patterns. Stochastic finite-state and context-free languages are easier to
analyze compared with stochastic context-sensitive languages; however, their
descriptive power of complex processes is less. The consistency problem of
Applications of stochastic languages 447

stochastic context-sensitive languages is still a problem of investigation. This, of


course, directly limits the practical applications of stochastic context-sensitive
languages except in a few special cases (e.g., stochastic context-sensitive languages
generated by stochastic context-free programmed grammars [15]). Only very
limited results in grammatical inference are practically useful [16, 18]. Efficient
inference algorithms are definitely needed before we can design a system to
automatically infer a grammar from sample sentences of a language.
Stochastic tree grammars are also introduced and some of their properties
studied. Tree grammars have been used in the description and modelling of
fingerprint patterns, bubble chamber pictures, highway and rivers patterns in
LANDSAT images, and texture patterns. In order to describe and model noisy and
distorted patterns more realistically, stochastic tree grammars have been sug-
gested. We have briefly presented some recent results in texture modeling using
stochastic tree grammars. For a given stochastic tree grammar describing a set of
patterns we can construct a stochastic tree automation which will accept the set of
patterns with their associated probabilities [4]. In the case of multiclass recogni-
tion problems, the maximum-likelihood or Bayes decision rule can be used to
decide the class label of an input pattern represented by a tree [15, 18]. In order to
characterize the patterns of interest realistically, it would be nice to have the
stochastic tree grammar actually inferred from the available pattern samples.
Such an inference procedure requires the inference of both tree grammar and its
production probabilities. Unfortunately, a general inference procedure for sto-
chastic tree grammars is still a subject of research. Only some very special cases in
practice have been discussed [7, 35].

References

[1] Aho, A. V. and Peterson, T. G. (1972). A minimum distance error-correcting parser for
context-free languages. S I A M J. Comput. 4.
[2] Aho, A. V. and Ullman, J. D. (1972). Theory of Parsing, Translation and Compiling, Vol. 1. (Vol.
2: 1973). Prentice-Hall, Englewood Cliffs.
[3] Bahl, L. R. and Jelinek, F. (1975). Decoding for channels with insertion, deletion and
substitutions with applications to speech recognition. IEEE Trans. Inform. Theory 21, 4.
[4] Bhargava, B. K. and Fu, K. S. (1974). Stochastic tree system for syntactic pattern recognition.
Proc. 12th Annual Allerton Conf. on Comm., Control and Comput., Monticello, IL, U.S.A.
[5] Booth, T. L. (1974). Design of minimal expected processing time finite-state transducers. Proc.
IFIP Congress 74. North-Holland, Amsterdam.
[6] Booth, T. L. (1969). Probability representation of formal languages. IEEE lOth Annual Syrup.
Switching and Automata Theory.
[7] Brayer, J. M. and Fu, K. S. (1977). A Note on the k-tail method of tree grammar influence,
IEEE Trans. System Man Cybernet. 7 (4) 293-299.
[8] Brodatz, P. (1966). Textures. Dover, New York.
[9] Chomsky, N. (1956). Three models for the description of language. IEEE Trans. Inform. Theory
2, 113-124.
[1o1 Fu, K. S. (1968). Sequential Methods in Pattern Recognition and Machine Learning. Academic
Press, New York.
448 K. S. Fu

[11] Fu, K. S. (1972). On syntactic pattern recognition and stochastic languages. In: S. Watanabe,
ed., Frontiers of Pattern Recognition. Academic Press, New York.
[12] Fu, K. S. and Huang, T. (1972). Stochastic grammars and languages. Internat. J. Comput.
Inform. Sci. 1 (2) 135-170.
[13] Fu, K. S. (1973). Stochastic languages for picture analysis. Comput. Graphics and Image
Processing 2 (4) 433-453.
[14] Fu, K. S. and Bhargava, B. K. (1973). Tree systems for syntactic pattern recognition. IEEE
Trans. Comput. 22, 1087-1099.
[15] Fu, K. S. (I974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
[16] Fu, K. S. and Booth, T. L. (1975). Grammatical inference: Introduction and survey, I-II, IEEE
Trans. Systems Man Cybernet. 5; 95-111,409-423.
[17] Fu, K. S. (1976). Tree languages and syntactic pattern recognition. In: C. H. Chen, ed., Pattern
Recognition and Artificial Intelligence. Academic Press, New York.
[ 18] Fu, K. S. ( 1981). Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs.
[19] Fung, L. W. and Fu, K. S. (1975). Maximum-likelihood syntactic decoding. IEEE Trans.
Inform. Theory 21.
[20] Fung, L. W. and Fu, K. S. (1976). An error-correcting syntactic decoder for computer networks.
Internat. J. Comput. Inform. Sci. 5 (1).
[21] Grenander, U. (1969). Foundation of pattern analysis. Quart. Appl. Math. 27, 1-55:
[22] Haralick, R. M., Shammugam, K. and Dinstein, I. (1973). Texture features for image classifica-
tion. IEEE Trans. Systems Man Cybernet. 3.
[23] Harris, T. E. (1963). The Theory of Branching Processes. Springer, Berlin.
[24] Hellman, M. E. (1973). Joint source and channel encoding. Stanford Electronics Lab., Stanford
University.
[25] Hutchins, S. E. (1970). Stochastic sources for context-free languages. Ph.D. Dissertation,
University of California, San Diego, CA.
[26] Jelinek, F. (1968). Probabilistic Information Theory. McGraw-Hill, New York.
[27] Keng, J. and Fu, K. S. (1976) A syntax-directed method for land use classification of LANDSAT
Images. Proc. Syrup. Current Math. Problems in Image Sci. Monterey, CA, U.S.A.
[28] Lafrance, J. E. (1971). Syntax-directed error-recovery for compilers. Rept. No. 459, Dept. of
Comput. Sci., University of Illinois, IL, U.S.A.
[29] Lee, H. C. and Fu, K. S. (1972). A stochastic syntax analysis procedure and its application to
pattern recognition. IEEE Trans. Comput. 21, 660-666.
[30] Li, R. Y. and Fu, K. S. (1976). Tree system approach to LANDSATdata interpretation. Proc.
Symp. Machine Processing of Remotely Sensed Data, Lafayette, IN, U.S.A.
[31] Lipkin, B. S. and Rosenfeld, A., eds. (1970). Picture Processing and Psychopietories, 289-381.
Academic Press, New York.
[32] Lipton, R. J. and Snyder, L. (1974). On the optimal parsing of speech. Res. Rept. No. 37, Dept.
of Comput. Sci., Yale University.
[33] Lu, S. Y. and Fu, K. S. (1977). Stochastic error-correcting syntax analysis for recognition of
noisy patterns. IEEE Trans. Comput. 26, 1268-1276.
[34] Lu, S. Y. and Fu, K. S. (1978). A syntactic approach to texture analysis. Comput. Graphics and
Image Processing 7.
[35] Lu, S. Y. and Fu, K. S. (1979). Stochastic tree grammar inference for texture synthesis and
discrimination. Comput. Graphics and Image Processing 8, 234-245.
[36] Moayer, B. and Fu, K. S. (1976). A tree system approach for fingerprint pattern recognition.
IEEE Trans. Comput. 25 (3) 262-274.
[37] Persoon, E. and Fu, K. S. (1975). Sequential classification of strings generated by SCFG's.
lnternat. J. Comput. Inform. Sci. 4 (3) 205-217.
[38] Smith, W. B. (1970). Error detection in formal languages. J. Comput. System Sci.
[39] Souza, C. R. and Scholtz, R. A. (1969). Syntactical decoders and backtracking S-grammars.
ALOHA System Rept. A69-9, University of Hawaii.
[40] Suppes, P. (1970). Probabilistic grammars for natural languages. Syntheses 22, 95-116.
Applications of stochastic languages 449

[41] Tanaka, E. and Fu, K. S. (1976). Error-correcting parsers for formal languages. Tech. Rept.
EE-76-7, Purdue University.
[42] Weszka, J. S., Dyer, C. R. and Rosenfeld, A. (1976). A comparative study of texture measures
for terrain classification. IEEE Trans. Systems Man Cybernet. 6.
[43] Zucker, S. W. (1976). Toward a model of texture, Comput. Graphics and Image Processing 5,
190-202.
P. R. Krishnalah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")1"~
z~K./
©North-Holland Publishing Company (1982) 451-477

A Unifying Viewpoint on Pattern Recognition*

J. C. Simon, E. Backer and J. Sallentin

O. Introduction

Pattern Recognition is a burgeoning field of a rather unwieldy nature. Algo-


rithms and 'ad hoc' techniques are proposed (or even independently discovered)
in many applied fields and by scientists of different cultures: physicists, natu-
ralists, doctors, engineers, and also mathematicians, statisticians or computer
scientists. The same algorithms are often proposed (or 'discovered') under differ-
ent names. They also are presented with different explanations or justifications.
This paper is an effort to present under the same point of view a large class of
PR algorithms, through an approach similar to the structural approach, which
was successful in mathematics. Of course the Theory of Groups did not help
anyone to perform additions or multiplications, but it helped to unify a very large
class of operations, which were considered as quite different. The presentation is
now simpler and even new powerful properties have been shown, at least in
mathematics.
Without antagonizing anyone, we hope to show what is common to approaches
to PR considered now as quite different, such as the statistical approach, the
fuzzy set approach, the clustering approach and some others.

1. Representations and interpretations

1.1. Notations and basic definitions


We would like to recall some conventions, as we have used them already [1].

Information
Information is a loose concept, which is made of two parts: a representation
and one or more interpretations. Later we speak of couples or pairs of 'representa-
tion-interpretation'.
*An earlier version of this article has been published in the journal Signal Processing, Volume 2
(1980) pp. 5-22 under the title: "A Structural Approach of Pattern Recognition".

451
452 J. C. Simon, E. Backer, and J. Sallentin

Representation
A representation is the material support of information. For example a string
of letters or bits; the result of a measurement, such as an image. Let such a value
be represented by a string of italic letters.

X=(Xl,X2,...,Xn).

We represent a variable measurement by a string of roman letters. Such as

x = (x,,x2,...,x°).

The set of all defined X is called the Representation Space X.

Interpretation
A representation may have many interpretations:
- trivial." the nature of the element of rank j in the representation string.
- a n identification: the 'name' of the object represented by X. Such an interpreta-
tion is the most frequent in Pattern Recognition (PR) for a representation of an
object, measured by some physical sensors.
-a property, such as a truth value or an assertion, a term, an expression
-an action: A program is represented by a string of 'instructions', themselves
translated into a string of bits, interpreted by a computer as actions.
-In practice, many others as we witness by man in every days fife...
Of course one may ask what we choose to call interpretation being the result of
a process on a representation? We believe that in our field of PR, related to
understanding and linguistics, it is more appropriate as this frame of concepts has
been advocated quite early in linguistics [2].
We call semantic of a representation the set of interpretations which may be
found from this representation. Again we underline that this set may be infinite
a n d / o r ill defined. But after all the set of properties of a mathematical object
such as a number may be also infinite. To demonstrate a theorem we only use a
finite set of properties.
Thus 'information', which is the couple of a representation and of one (or
more) interpretation, may be quite ill defined. This we see in every day life: a
representation is understood quite differently by different people, specially if they
belong to different cultures.

Identification
Let I2 be the set of names, I2= {c01,cO2," "',cOp)" An identification is a mapping
E from the representation space into the set of names.

E:X ~/2,

written also as E : ( x t, X 2 . . . . Xn) ----)0.


An identification is the most simple interpretation of a representation. But
many other interpretations may be considered as we will see later on.
A unifying viewpoint on pattern recognition 453

PR operators or programs
Of course such a mapping is only a mathematical description. It has to be
implemented in a constructive way. A PR operator or algorithm does effectively
the task of giving a name if the input data is a representation. The PR specialists
are looking for such algorithms; they implement them on computer systems.
Finding such efficient PR algorithms and programs is the main goal of the
Pattern Recognition field.

Interpretations of an object
An identification is not the only interpretation looked for the representation of
an object:
(i) A feature is the result of a partial identification. For example, a phoneme in a
spoken word, a segment in a letter, a contour or a texture in an image. Sometimes
the term initial level feature is used instead of representation. It points out the
fact that such representations are obtained through physical sensors from the
outside universe. Thus a representation is already an interpretation of the outside
world.
(ii) A fuzzy identification is sometimes preferred to the identification by yes or
no. It may be defined as a ' multimapping' of X in/a × f,

Ef:X+~×f or E f : ( x , ..... x.)--+(oai,fiIVi}.

fi is a membership function, of real value in the interval [0, 1].


(iii) A class or cluster is a name given to a set of elements.
(iv) More generally the symbolic description of a class may be:
-The most representative element, such as the center of gravity or the element
closest to the center of gravity.
-The nj most representative elements, the skeleton for instance.
- A representation of the class, such as a linear manifold, a geometrical representa-
tion.
- A concept: " a mental representation of something perceived through the senses"
(Britannica World Language Dictionary).
It is represented usually by a sentence in a language.
- A statement, a logical proposition, an expression. These string of symbols are
generated by a syntax; they may themselves be interpreted as 'true' or 'false'.
The only way we use symbolic descriptions is through a similarity measure,
between the representation of the object and its interpretation (the symbolic
description). As we see later these similarity measures may be called by very
different names. But in fact they play the same part in the determination of an
identification. The aim of this paper is to study and to show their common
properties.

Similarity, distance
Let X,Y, Z be entities that we wish to compare. Note that they are not always
of the same nature. Later on they may be taken as objects, classes or what we
454 J. C. Simon, E. Backer, and J. Sallentin

called symbolic descriptions, expressions, operators, etc... Let >> be an order


operation on the set of couples, with the following interpretation:
(X,Y) ~ (X,Z) means that X is more 'similar' to Y that to Z, or that the
'resemblance' is greater. More generally this operation may be interpreted as a
' natural association'.
A constructive procedure to build this order is to implement a similarity (or
dissimilarity) measure, i.e. a function of real values, the domain of which is the set
of couples.

Similarity (resemblance)

r,(x,x) = sups, (1.1)


/x(X,V) =/~(Y, X), (1.2)
/~(X,Y) >~ #(X,Z) is equivalent to (X,Y) ~- (X,Z). (1.3)

Dissimilarity (dissemblance)

h(X, X) = inf X, (1.4)


7~(X,Y) = h(Y,X), (1.5)
X(X,Y) ~<X(X,Z) is equivalent to (x, Y) (x, z). (1.6)
Distance. A distance is a dissimilarity measure, which satisfies also the ' triangle
inequality':

d(X, Y) ~<d(X, Z) + d(Y, Z), (1.7)


d(X,Y) = 0 is equivalent to X--Y. (1.8)

1.2. Remarks
(I) Usually an identification may be described as a multilevel process. Let us
take the example of written words identification.
From the initial representation level, for instance the pixel level of an image, a
first group of interpretations is obtained. They result in the identification of a
certain number of 'features', such as segments, curves, crossings, extremities, etc.
From this level the letters are found. Then in another level the word is identified
from the letters.
Thus starting from a representation level, an identification process allows access to
an interpretation level, which then becomes the new representation level.
Such a scheme is more general than the 'historical' PR scheme: feature
identification followed by a 'classifier'. It is now currently said that image and
speech recognition are such multilevel processes and may be described as an
interactive, competitive system of procedures, either inductive (data driven) or
deductive (concept driven) [3].
A unifying viewpoint on pattern recognition 455

(II) Any partial or intermediate identification process has to be implemented


by a program, i.e. a combination of primitive operators. The problem which a PR
specialist faces is to choose the appropriate operators and to combine them.
Some are chosen for their simplicity, such as the linear operator, or because
they pertain to some physical property of the problem, for instance the contour
filters or the texture detectors of digital images, the segmentation operators of
speech. Usually the first level uses arithmetic operators such as the filters in signal
processing. The upper levels rely on syntactic or linguistic operators. In speech
recognition, these techniques are now used even at the lower levels [4].
However, at the lowest level a large class of operators is directly inspired by the
properties of the representation space. They may be designed as the characteristic
function processes [5]. In fact these functions are similarity measures as we will see
later.
(III) Many PR specialists like to oppose the statistical approach to the syntacti-
cal approach. In fact this distinction, which has a historical interest, does not seem
justified now.
Should not we say now that the syntactical approach relies on the properties of
a set of operators (which may be syntactic) and that the statistical approach relies
on the properties of the representation space? In fact in the so-called statistical
approach it is customary that the statistical assumptions are not justified, even
sometimes ignored. If the setting up of a 'probability density' should be justified
by the statistical properties of the PR problem, it is always used as a similarity
function.
As it has been underlined by Diday and Simon [6], clustering, which is typically
a lower level process, is determined only by the data of similarity functions.
Of course the statistical properties and techniques may be very interesting to
establish and justify these similarity measures. But as soon as these similarity
measures are determined, the identification is also determined.

1.3. Properties of a representation space


The representation X of an object has been defined as a list of measurements.
Let E be a finite set of m representations X.
Most often it is implied that E is a sample of a more general infinite set X into
which E is embedded. Intuitively it seems natural that any variable point X of X
may be obtained by a measurement; thus this idea of an infinite representation
space. But in certain instances it is not clear at all that such an infinite set is
defined (attainable) everywhere or even may exist. Most of the efforts of the
statistical approach of PR are oriented towards a restitution of such a space X.
In the first place, let us consider what we may assert on a finite set of m
representations X j (1 ~<j ~< m). The experimenter may see an order between the
couples (XJ,xh). The usual way to implement this order is by the use of a
similarity or dissimilarity measure. In other words a triangular table may be set
up in which a real number corresponds to any couple. According to the strict
order property of real numbers, an order exists between the couples. It may be
456 J. C. Simon, E. Backer, and J. Sallentin

assumed that such an order is a basic property of the data. Let ~(j, h) be such a
dissimilarity measure. It is clear that it does not always satisfy the triangular
inequality (1.7).

PROPERTY. A dissimilarity relation being given on a finite set E, it is always


possible to find a homeomorphic mapping of g~ in R such that the resulting
dissimilarity measure is a distance and induces the same order on the set of pairs
of E.

Such a homeomorphism should map inf ~ on 0 and should add a constant to all
the values of ~(j,h), such that (1.6) is verified for all the table.
Similar properties may be found for a dissimilarity measure.
We deal now with the properties of an infinite representation space X , into
which E is embedded. Two basic properties have to be examined for such a space:
(a) is it a metric space?
(b) has it a density measure?

Topology and metric spaces


By referring to the work and language of mathematical topology we have a
clear way to state the properties relevant to our problem. For a reference textbook
see [7]. Let us recall some definitions.
Topology T on a set X. It is a non empty collection of subsets of X, called open
sets and satisfying four axioms (using the union, intersection and complementation
operations).
Basis. It is a subcollection fl of open sets, such that any open set is a union of
some open sets of ft.
Countable basis. A basis fl is countable if the number of open sets of 13 is
countable.
Neighbourhood. The neighbourhood of a point p is a set N containing p and
also some open set of X which contains p.
Hausdorff spaces. Every pair of distinct points has a disjoint neighbourhood.
Compact spaces. Every open cover has a finite subcover.
Many other concepts are defined and studied: basis of a topology, closed sets,
interior, exterior, frontier, limit point, continuous or homeomorphic maps, etc. Note
that no interesting property is obtained from a finite set X.
However, to define constructively a collection of open sets is not an easy job.
Usually it is done through the use of distances:
Metric spaces. Assuming that there exists a distance d(p,x) between any two
points p and x of X, an r-neighbourhood is Nf = {x E X Id(x, p) < r}. They allow to
form a collection of open sets. R, R ", Hilbert space are metric spaces.
Every metric space is a Hausdorff space.
Metrizable spaces. A topological space is metrizable if there exists an injective
mapping of the space into a metric space.
Important theorems allow to bridge topological spaces in general and metric
spaces:
A compact Hausdorff space, having a countable basis is metrizable.
A unifying viewpoint on pattern recognition 457

One may challenge that the Hausdorff condition is verified everywhere, spe-
cially with the finite precision of the measurements and of the computations.
However it seems quite reasonable to give the Hausdorff quality to an infinite
representation space. In other words to assume that two distinct points should
have separate neighbourhoods. Thus from now on we assume that a representa-
tion space X is a metric space. The problem is of course to find the metric.
We have seen how an experimental table of dissimilarity of E may be
transformed in a table of distances, without changing the order on the couples of
points. This table of distance provides a sampling of the general distance measure
on X. The problem is to find an algorithm of distance which verifies the
measured distance table. It is a generalization problem as there exists so many
in PR.

Density measure
Let us call (intentionally)/~(X) a density measure at X ~ X, a function of X
taking its values in • +.
Many efforts are made in 'statistical' PR to build up such a density from an
experimental distribution of representations X of objects. This density is used as a
similarity measure between an object X and a class.
These efforts are along two lines: either the objects are labeled or not labeled.

Labeled. (1) Probability densities are obtained through various statistical tech-
niques (parametric) and also by interpolating techniques (non parametric) [8].
(2) k-Nearest Neighbours, (k-NN) [9], [10], [11]. Note also the Shared k-NN
approach of Jarvis [12] and the Mutual N N of Gowda [13].
(3) Potential or Inertia functions, [1].
(4) Fuzzy belonging [14].
All of these measures are interpreted and used as a similarity between an object
and a concept, such as a class (also called aggregate, taxon, OTU (operational
taxonomic unit), fuzzy set) [1].

Unlabeled. In clustering techniques the knowledge on the problem is usually


given as some unlabeled density function, obtained by the potential or inertia
function from the existing samples and the distance between points. It is assumed
that this density is the sum of the density pertaining to each class;/~(X) -- Y/~i(X)
[1]. Some practical ways to obtain this density is to build up the minimum
spanning tree (MST) or the k-NN graph.
The clustering techniques are using this knowledge to build up the clusters from
the regions of high density.
It is important to note that the algorithms to obtain these densities use the
distance between points, i.e. the property that the representation space is a metric
space.

Remark
It should be clear that the hypothesis that a representation space is a metric
space is a 'strong' one, even if it is the most frequent. Some finite data cannot be
458 J. C. Simon, E. Backer, and J. Sallentin

ULL'
L@L'
ALL'
Fig. 1

embedded in such an infinite space. We should be careful about the experimental


validity of our data. For instance if the x i are only one among a few (binary for
example), it is not legitimate to extend them in R. They are 'qualitative' values.
On the other hand, though measured ampfitude such as grey level in an image, are
always finite in number, according to the precision of the sensors, it is legitimate
to extend them in R; thus considering them as 'quantitative' values.

1.4. Interpretation spaces and their internal laws


We first examine the interpretation spaces directly deduced from the represen-
tation space.

The representation space is finite


Let us assume that the representation space is the finite set E. The first example
of an interpretation space is the set P = 2 E of the subsets of E. To 'classify' is to
separate E in k disjoint subsets Ci, with E =- U i c i.
The basic operations on such a finite (but exponential) set P are the union U,
the intersection N, the complementation c. Under these laws, the elements of P
form a distributive lattice or algebra.
Let L, L' E P. The lattice relations may be represented by Fig. 1.
A class such as C i may be obtained from the elements X i of E by a succession
of union operations.

Hierarchies. Hierarchies are a special set H of subsets of E, such that if


L, L' E H C P, then either N LL' -- ~, or n L L ' :~ ~ and L ~ L' or L' E L.
Hierarchies may be represented by trees.
We later outline how hierarchies are obtained from E and a distance on the
elements of E.

The representation space is a metric space


Again the interpretation space P is the set of subsets of X. The basic operations
are U, N, e, as for finite sets, but now an infinite number of operations may be
considered. The symmetric difference A is also utilized: A ~ B -- (A N eB) U (B N
eA).
A unifying viewpoint onpattern recognition 459

These laws on P make a distributive lattice of this set, sometimes also called
semi-ring or o-algebra. The distributivity means that

A N ( B U C ) = (AN B) U (AN C). (1.9)

A similar relation would we obtained, replacing U by A.


Hierarchies are special subsets of P, with the same properties as for the finite
space P.
Fuzzy sets are b u i l t u p on P. Apart the questions of language and terminology,
the algorithms that they suggest do not seem different from the usual ones. We
will come back on this question in the following paragraph.

Languages as interpretation spaces


We have looked up the interpretation spaces deduced directly from the repre-
sentation space, i.e. the set of subsets which, equipped with operations, becomes a
structure (the representation space is E or X).
But such interpretations are not the only possible ones. The interpretation of an
object may also be a sentence in a language. Let us outline different occurrences of
such interpretations:

Terms. We will call term a sentence naming an object. The set of terms is a
generalization of the set ~ of names. The languages of terms are regular, i.e. may
be recognized by a finite automata [15].

Logical expressions. The languages of different logics have been well defined,
for instance cf. [15]. A sentence of such a language is called an expression. It is
formed with terms and formulas. The formulas are built up with logical connec-
tives, according to a syntax and a number of axioms. An essential point of these
logical languages is that an expression may be interpreted as 'true' or 'false'.
In the classical sentential logic the basic connectives are and A, or V. But many
others have been proposed, with different syntax and semantic. They have
allowed to propose other logics, such as the predicate (extensions of the classical
sentential logic), the modal, the intuitionistic, the fuzzy, the quantum logics.
Let us simply point out that the expressions of the sentential logic form an
algebra, called a boolean algebra, isomorphic to an algebra of sets; thus we may
obtain the same structure as that of a distributive lattice, where the connectives
'and' and 'or' play respectively the same role as the operations 'intersection' and
'union' for sets [15].
By suppressing some axioms, introducing new connectives, other logics are
formed in which the distributivity of the lattice structure is not any more certain.

Natural languages. A sentence in natural language may be considered as a


proposition or predicate on an object (of course not always...) [16]. Human beings
use the sentences of natural language as the interpretation space of their percep-
460 J. C. Simon, E. Backer, and J. Sallentin

tions of the world of objects. Of course the formalization is a lot more difficult.
The languages of different logics have in fact been proposed to model the natural
language.

Operators, programs, algorithms. A programming language is a set of instruc-


tions with syntactic rules to form the proper sentences of the language. Such a
sentence, i.e. a program, is interpreted by a computer as actions.
Usually the domain is formalized as 'recursive functions'. In PR the domain of
primitive recursive functions is of main interest [17]. The basic operations are the
concatenation, the composition and recursivity.
The term operator denotes that these algorithmic functions may be imple-
mented by machines. Instead of speaking of the interpretation of an object it is
preferable in this context to speak rather of relevance, interest for a recognition.
An application of these ideas will be examined in connection with Information
Measures.

2. Laws and uses of similarity

Similarity, dissimilarity measures, distances have been defined. Some examples


have been given. It was advanced that these measures form the basis for an
interpretation at the first levels of an identification.
We will show now how the structure of the interpretation space induces laws on the
similarity or dissimilarity measures.

2.1. Laws and structure of a similarity


The Subsections 1.3 and 1.4 gave the properties of representation spaces and of
interpretation spaces. The first group of properties may be called 'data driven';
they come from the structure of the representation space and from the knowledge
of a similarity or dissimilarity measure (later on we call it simply a measure). The
second group of properties may be called 'concept driven'; they pertain to the
interpretation space.
The central issue of PR is how a problem can be translated into concepts, in other
words to find an appropriate interpretation space related to the problem and the
set of operators which allow to pass from the representation to the interpretation.
However, in this paper we are concerned with another fundamental issue: how
should the data driven representation structure be related in general with the concept
driven interpretation structure?
We will show that most of the time, an homomorphism exists between the
measure and the interpretation space.
Let us designate by f(X;A) a similarity or a dissimilarity. X is a variable
element (a point) of the representation space. A is an element of the structure of
the interpretation space.
We now come to the main object of this paper, answer the following question:

Knowing f(X; A) and f(X; B), what are the values of f(X; A U B) and f(X; A n B)?
A unifying viewpoint onpattern recognition 461

PR problem

data modelling concept


formalization

representation interpretation
space space

data driven concept driven


structure structure

1 , homomorphism ? ~ I

Fig. 2

The answer generally found by the users is to use on f two homomorphisms, i.e.
to find two laws on f such that
@ is an additive law, homomorphic to U :

f(X; A U B) = fCx; A) @ f(X; B). (2.1)

* is a multiplicative law, homomorphic to n :

fCx; A n B) = fCx; A ) , fCX; B). (2.2)

If necessary the multiplicative law is distributive with respect to the additive


law. It will be true if the structure of the interpretation is itself distributive.
Fig. 2 illustrates the above viewpoints. We will see that for most PR problems,
there exists an homomorphic correspondance between the concept driven struc-
ture and the data driven structure, in other words between the interpretation and
the knowledge given by the similarity or dissimilarity measure.

The range of f
f takes its values in the real domain R. But its interval of variation or range R is
usually a part of R. For example: {0, 1} two values only, example of the range of
the characteristic function of a set; [0, 1], range of probability and of fuzzy
belonging; R +, range of distances, etc.

Semi ring
Let us recall now the definition of a semi ring. It is a structure Z on a set
(range) R,

Z= (R,@,.,0,1>.
462 J. C. Simon, E. Backer, and J. Sallentin

@ is an associative law, called addition.


0 is the element identity of this law; a@0 = a.
* is an associative law, called multiplication.
1 is the identity of this law; a* 1 = a.
0 is an 'absorbing' element; a * 0 = 0 * a = 0.
* is distributive in respect of @ ; a * (b @ c) = (a * b) @ (a* c).
Most of the time each law is commutative, but it is not always necessary.
Thus with the two laws induced by the homomorphisms of the interpretation
space, most of the time f has the structure of a topologic semi ring. 'Topologic'
because of the topology of R.
Of course, if the structure of the interpretation space is such that only one law
is considered, the structure on f has only one law. It is a topologic semi group. For
example Hierarchies consider only union, the corresponding law is unique (but
m a y be anything between I N F and SUP, as we see later).

Examples of semi rings


Let us give some examples of semi rings.

Z~= ({0,1),@, x,0,1),


Z 2 = ({0, 1 ) , S U P , I N F , 0 , 1),
Z 3 = ([0, 1],SUP, I N F , 0 , 1).
Z4 = ([0, 1],SUP, X , 0 , 1),
Z s = ([0, 1 ] , x + y - x y , X ,0, 1),
Z6 = ( a ~ , I N F , + , ~ , 1), with R ~ = U + U { + ~ } ,
Z 7 = ( R ~ , SUP, I N F , 0 , ~ ) ,
z~= (n~,+, x,o,1),
Z 9 ~- <R -- oo, I N F , + , 0 , 0 ) .

We will see their use on similarity measures.


Any of the first or second law m a y generate a semi group.

Remark
The difference between a group and a semi group comes from the fact that no
inverse is defined for an element x. In other words no - x for addition, no x 1 for
multiplication, as it is always defined for a group. We will show that it has some
important consequences.

Idempotent operations
An operation is called idempotent if applied to the same element, it gives as a
result the element itself. For instance if U and f3 are respectively the set
A. unifying viewpoint on pattern recognition 463

operations union and intersection,

AUA=A and ANA=A;

the two operations are idempotent. If A is the symmetric difference,

AAA=~;

A is not idempotent.
Let us suppose that (2.1) a n d / o r (2.2) are satisfied, if O is the operation on f,

f(X; A) Of(X; A) : f(X; a ) ; (2.3)

the homomorphism implies that the operation on f is also idempotent. Then we


may consider two possibilities:
(a) The law on f has an inverse; it is a group.

f-f=O or f×f-~=l.

Then f - f -- f or f × f 1 f is realized only by the identity dements 0 or 1. The


=

only possible semi ring is Z 1 or Z 2.


(b) The law on f has no inverse; it is a semi group.
An example of this is INF or SUP, which are idempotent for any element of R.
INF(A; A) = A and SUP(A; A) = A.
As we see later this explains the interest of these operations introduced in the
fuzzy set concept. Some other semi group laws may also be considered of course.
We may characterize them by their idempotent properties.
An alternative to (a) is to use other semi tings but to forbid the use of
idempotence. Let us take the example of probability density. Then Z 5 is utilised
but with the hypothesis of independency between elements, which excludes an
idempotent operation.

p(X; AN B) ----p(X; A) Xp(X; a), (2.4)


p ( X ; A U B) = p ( X ; A ) + p ( X ; B ) - - p ( X ; A N B ) . (2.5)

Similar operations may be performed on probability P or information J. If two


events are independent in probability,

P(ANB) = P(A)XP(B). (2.6)

The information given by their simultaneous realisation is usually assumed as

J(A N B) = J(A) +J(B) (2.7)

The range of J is R +. The semi ring may be Z6 [18].


464 J. C. Simon, E. Backer, andJ. Sallentin

Let us come back to the range [0, 1]; other operations may be taken. Let % be a
law such that

%(A; B) ~<INF(A; B). (2.8)

% is a contracting law and formulas similar to (2.4) and (2.5) may be used

for intersection 9(; (A; B) ;


for union 1 -- %(1 - A ; I - B ) . (2.9)

EXAMPLE. % is INF. Then 1 - % ( 1 - A ; 1 - B ) is nothing but SUP.

2.2. Application to clustering and first level recognition


As it has been underlined, the first level recognition (between the representa-
tion space or initial level after the sensors and the interpretation space of the first
level 'features') uses similarity or distance measures, sometimes called 'character-
istic functions', with the varieties of probability and fuzzy belonging.
The above frame allows to unify the different techniques under a common
structural point of view.

2.2.1. Hierarchies
The representation space is either a finite set E or a metric space X.
A hierarchy is a finite set H belonging to the set of parts of E or X, with the
following conditions:

(X}~HCP(orP),
E (or X ) E H ,
if h i , h j E H , then either h i A h ) = O
(2.10)
or h i A h j v a ~
and either h i D hj
or hi C_hj.

A hierarchy is a semi lattice, even more a tree, in which the elements are obtained
by the operation U. Two elements of H being given hi,hi, there exists always an
element h called the least upper bound (1.u.b.), such that it is the smallest set with
the property h i C h and hj C h. If hif-lh j =/a~, it is clear that h i = h or hj = h.

Ultrametric distances. Let us define measures compatible with the above struc-
ture.
hirqh j ¢ ¢. h i and hj are elements in a chain ordered by C.

XChl.-. c h i. .. C h j . (2.11)
A unifying viewpoint on pattern recognition 465

h"

X1 X2 X3

Fig. 3

Let X be a measure on such chains:

X(X) = O < h ( h , ) - - - <h(hi)'-- <h(hj). (2.12)

hif-)hj : 0. Let h be the l.u.b, of h i and of hj. Then Mhi), )~(hj) < X(h). Let X
and X' be leaves of the hierarchy (tree). The distance between X and X' is
8(X,X') -- X(h), where h is the 1.u.b. of X and X'. Let X1,X2,X 3 be three leaves of
the hierarchy, h be the 1.u.b. of Xl, X 2, and h' be the l.u.b, of X 1, X 3 and of X 2,
X 3, then

•(X 1, X 2 ) < a ( X 2 , X 3 ) = a ( X I , X 3 ) . (2.13)

From (2.13) every triangle X l , X 2 , X 3 is isosceles with a base smaller than the two
equal sides. (See Fig. 3.)
It is shown that such a proposition is equivalent to the following relation,
proper to ultrametric distances,

8(i,j) < SUP[8(i,k), 8(j,k)]. (2.!4)

(8(i,j)'is an abbreviation of 8(X i, X j).)


The problem faced by the builder of a hierarchy is precisely to find 8, from the
data of an ordinary distance d(X,X'). Only such an ultrametric measure will
satisfy the structure of the interpreting hierarchy.

Indexed hierarchy. Any h of H is an equivalence class on E or X such that


8(X,X') = h(h) for all X,X' belonging to h.
Starting from elements X of E or from disjoint classes C i of X, the operation
union (or symmetric difference) allows to build up a tree on which may exist a
measure X, such that, if h = hiUhj = h i Ahj = 1.u.b.(hi,hj) ,

X(h) > SUP[)~(hi) , )~(hj)]. (2.15)

This relation applied to three elements X or classes C i, yields (2.14). Such a


structure is called an indexed hierarchy. (See Fig. 4.) The Lance and Williams
algorithm is a technique to build simultaneously H and h(h).
466 J. C. Simon, E. Backer, and J. Sallentin

X(h) h

Fig. 4

Two elements h i and hj are united in h, according to a criterium. A generalised


distance D0,j) is computed, when in the course of the process, h i and hj become
the leaves of the tree. h i and hj are united if D(i,j) is minimum and X(h) = D(i,j).
Then (2.15) is obeyed. To compute the current D(i,j), a new h being formed,
D(h,h') has to be computed for every h' leave of the tree.
Various operations may be utilised for the computation of D0,j).
- I N F corresponds to single linkage,
-MEAN corresponds to average linkage,
-SUP corresponds to complete linkage.
In fact any operation ~ which gives a result superior or equal to INF may be
utilised; otherwise the relation (2.15) would not hold, as it is easy to verify.

Remark. It appears clearly now that the necessity of an ultrametric distance


comes from the interpretation structure; an ordinary distance would not respect
the homomorphism.

2.2.2. Adaptive partitions


Hierarchy formation may be seen as an inductive process (data driven). The
knowledge on the problem is essentially given by the distances d(X,X'); the
results depend on the technique to compute an ultrametric distance.
Adaptive partition techniques, such as the "Dynamic Cluster Algorithm" of
Diday [19], work in a different way. There the number of disjoint classes is
chosen; a criterion is minimized. Though the classes are not known a priori, this
technique may be considered as more 'concept driven' [1]. Let

C --~ ( C l . . . . C i , . . . C k ) (2.16)

be the k disjoint classes,

A = (A 1.... A i , . . . A k ) (2.17)

be the k corresponding 'kernels' or symbolic descriptions.


(I) The representation set is the finite set E.
A unifying viewpoint on pattern recognition 467

From the distance d(X,X') are usually deduced


-a distance between X and Ai,
D(X,&)= Y~ d(X,X')~(X'); (2.18)
X'EA i

-a distance between A i and Ci,


R(Ai,Ci) = ~ D(X,Ai)/x(X ). (2.19)
Xc Ci

D and R may be considered as inertia measures.


(II) The representation space is a metric space X.
(2.18) and (2.19) may be extended to the continuous problem [20]. They
become the usual inertia formulas.
The basic operations performed on the interpretation spaces (C or A are
respectively elements of these spaces) are union and difference. The usual ring on
the measure is Z 8. But other semi ring laws may be considered, such as SUP
(MAX) or INF (MIN). As an example see the works of Hansen and Delattre [21].

2.2.3. Strong and weak patterns


Usually in 'classification', the classes are defined first, thus the interpretation
structure. Then we look for a data-driven structure homomorphic to the concept-
driven structure. On the contrary cluster analysis goes the other way round: the
interpretation of the problem is inferred automatically from a data driven
structure. The laws in the representation domain infer the laws of the interpreta-
tion domain. Irrespective of the procedures, almost all the clustering algorithms
provide some partitioning of the data on the basis of some (dis)similarity measure
and a criterion, which has to be optimized. In any case the final data-driven
structure is such that the intersections of the final clusters is empty. The
concept-driven structure of the interpretations is one of disjunct subsets.
The idea of 'strong and weak pattern', like the fuzzy idea, is a technique to
obtain a data-driven structure in which the intersections of the interpretation sets
are not empty.
Let us recall this idea [6, 22]. Let us suppose that by any clustering method,
different stable optimized partitions may be obtained, either by changing the
thresholds or the initial conditions. These different partitions C(1),...,C (q) not
only allow how to learn new facts about the problem but also how to define
another interpretation structure.
Let
H = C(~)A -. • (~C (q) (2.20)
be the 'cross partition'. If X and X' are classified together in all the C (i)
(1 ~< i ~<q), they belong to the same class ~r of H, which is called a strong pattern.
From these subsets 7r of H it is possible to build a hierarchy H. The equivalence
relation corresponding to a set h of H is that two elements X and X' of h are
classified p times in the same classes of C (i) (1 ~< p ~<q). They are called weak
patterns
468 J. C. Simon, E. Backer, and J. Sallentin

Of course p = q for a class ~r.


If the classes of C (i) are in number k, the number of classes o f / / m a y be much
larger, thus detailing or 'refining' the interpretation structure. As in fuzzy
belonging, the relation of an element X to a concept (here a class) may be
estimated as a number between 0 and 1.
On the other hand, a natural way to infer concepts when the intersections of
subsets are not empty is to use laws of similarity, which are homomorphic to a
distributive lattice on the set of subsets L i.
These approaches [14, 23, 24] have in common a multimapping

E:X--* Li Xf, (2.21)

where the range of f is [0, 1] and the interpretations form a distributive lattice.

2.3. Probability and fuzzy sets


As we have repeatedly assumed, we wish to show that from the algorithmic point
of view there is no deep difference between the fuzzy set approach and the usual
probability density approach; the axiomatic difference being on the laws of the
similarity measure.
We will examine in parallel the homomorphic correspondances usually found,
either in the probability or fuzzy points of view.

The use of INF, SUP, X


Let D_be a set of conceptual entities L and a distributive lattice determined on
l_ by union U (or symmetric difference A) and intersection f3. Let f be a
measure on the range [0, 1] and ~ be one of the laws on f: INF, SUP, X. Let
a, b, c, be X or L.
The following formulas satisfy the homomorphism.

f(a; b A c) = %If(a; b), f(a; c)], (2.22)


f(a; b U c) = 1 - %[(1 - f ( a ; b)), (1 - f ( a ; c))]. (2.23)

Such formulas are currently used in fuzzy set formulations for object-concept
similarity, for concept-concept similarity and object-object similarity [30, 31, 32].

The use of the ring Z 5


As we have seen this ring, which uses + and × , allows to remain in the range
[0, 1]. It is used for probability with the hypothesis of independance, made
necessary by the problems of idempotence (see Subsection 2.1). It also has been
used sometimes for fuzzy sets. Let us give examples of both domains.
A unifying viewpoint onpattern recognition 469

In probability
Object-concept,

p(X; L n L') = p(X; L) ×p(X; L'), (2.24)


p(X; LU L') = p(X; L) +p(X; L')--p(X; L) ×p(X; L'). (2.25)

Concept-concept,

P(L, L') = 1 ~ p(X; L) X p(X; L'). (2.26)


x

Object-object,
n

p(X,X') = 1 -- ~ p(X; Li) Xp(X; Li). (2.27)


1

Bayes probability of error for two classes,

r(L, L') = fxP(X;L n L')p(x) dx. (2.28)

Bhattacharyya coefficient for two classes,

b(L, L') = fxP(X; L)p(x; L')P(x) dx. (2.29)

Similar formulas are used in fuzzy set formulations; for examples see [14,
Chapter 2].

The use of the semi rings with idempotent operations


The interest of semi rings such as Z 3, Z4, Z6, Z 7 and Z 9 is that idempotent
operations are possible for any f(X). It seems to be the main interest of the fuzzy
set measures of similarity which have extensively used the semi rings Z3, Z 4.
Let us give some examples.

For fuzzy sets


Object-concept,

f(X; L n L') = IN F [f(X; L), f(X; L')], (2.30)


f(X; L U L') = SUP[f (X; L), f(X; L')]. (2.31)
470 J . C . Simon, E. Backer, and J. Sallentin

Concept-concept,

1
F(L, L') = ~ ~ INF [f(X; L), f(X; L')]. (2.321
x

Object-object,
Ill

# (X, X') = ½2 (SUP[f(X; Li), f(X'; Li) ] - INF[f(X; L i), f(X'; Li)]).
1

(2.33t

Remarks
(1) The operation SUP is often interchanged with 'average' (1/N)Y,. However
the average is not an associative operation and some care should be taken to
preserve the homomorphic properties.
(2) Note that in probability the conditional Bayes error for two classes is
written as

f(X; L A L') = INF[p(X; L),p(X; L')]. (2.34)

2.4. Information and feature evaluation

The measures of information


An important work has been done to estimate 'the information of an event'.
Sometimes the information is considered as directly deduced from the probability
measure through the formula

J(A) = -logP(A). (2.35)

Then of course (2.7) is obtained.


But efforts have been made to define directly an information measure [25, 26].
Let p and q be logical statements. If J(p) and J(q) are known, what are J(p V q)
and J(p A q)?
In general one may assume that

J(pVq) = F[J(p),J(q)], (2.36)


J(p A q) = H [J(p), J(q)]. (2.37)

Kampe de Feriet [18] studies F. For instance he shows that if p - q, then


J(p)/> J(q) and

J(pVq) ~<INF[J(p),J(q)] ~<SUP[J(p),J(q)] ~<J(pAq). (2.38)


A unifying viewpoint on pattern recognition 471

Van der Pyl [27] studies H and proposes to use, instead of (2.7),

J(p A q) = J(p) + J(q) + kJ(p)J(q). (2.39)

(2.35) implies (2.7) and thus k = 0.

Feature evaluations
Many measures have been proposed for the evaluation of features; we would
rather refer to [17]: evaluation of the PR operators which detect the existence of
the features. Mutual Information, Quadratic Entropy, Bhattacharyya, Bayesian
distances, Patrick-Fisher, Bayes error, Kolmogorov's, etc .... For example see
[281.
Among these Mutual Information has some interesting properties. Let A, B be
two operators to be evaluated; let 12 be the 'ideal operator' given by the training
set. Knowing the I(A; 12) and I(B; 12), what can be said if A and B are in serie or
in parallel?

Serie

I(ANB; 12)~< INF[I(A; 12),I(B; 12)]. (2.40)

Parallel

I(A U B; ~)/> I(A; ~) + I ( B ; ~ ) - - I ( A ; B). (2.41)

These relations should be compared to (2.4) and (2.5) concerning probability.


They show clearly that it is not possible to find an homomorphism between the
laws on I and the series or parallel operations on the operators.

2.5. Declarative statements


A concept is usually represented by a sentence in a natural language under the
form of a declarative statement. A simplified model of such a statement is a
statement in a logical language.
A logical language is generated by a set of syntactic rules. It is made up of:
(i) a language of terms, recursively defined f r o m constants, variables with
functions. It is a regular language [15];
(ii) a number of logical connectives, which obey a set of rules called axioms.
With the terms and the connectives are built the well formed formulas (wff),
respecting the axioms. They are usually called logicalpropositions or expressions or
statements.
The most frequent connectives are A, V,
(iii) An interpretation 'true' or 'false' is assigned to any logical proposition p or
q. Two connective I and 0 are always interpreted true and false.
472 J. C. Simon, E. Backer, andJ. Sallentin

Let us consider a similarity measure f(X, p) between an object X and a


proposition p.
Many semantic interpretations may be given to f such as a natural association,
an interest, a generalized belonging, a verification index, a propensity, or simply a
similarity.

Remarks
We consider logics different from the 'classical' (boolean) logic. But in all of
these logics no quantifiers are used such as 'there exist' or 'for all'. Thus Modal
logic is not envisioned here.
All these logics are sentential logics. Their propositions form a lattice under the
operations V, A. But this lattice is not always distributive. Let us recall for the
ease of reading some useful definitions of sentential logics.

Properties of the logics


(1) Contraposition

I f p A q = p, then/~ A q = q. (2.42)

(2) Morgans rules

pVq=fiA~ and pAq=fiV~. (2.43)

(3) Double negation

p---p. (2.44)

(4) Excluded middle

p V p = I. (2.45)

(5) No contradiction

p A/3---- 0. (2.46)

(6) Distributivity

pA(qVr)---- (p Aq)V(p Ar). (2.47)

(7) Pseudomodularity

If p Aq---- p, then q A ( p V ~)---- p V( q A ~ ) . (2.48)

The different sentential logics are distinguished one from the others according
to the above properties verified or not.
A unifying viewpoint on pattern recognition ,173

Table 1
Properties of sentential logics
Logics Properties
1 2 3 4 5 6 7
Classical X X X X X X X
Quantum X X X X X X
Fuzzy X X X X X
N o n distributive
fuzzy X X X X
Intuitionist X X X X X
N o n distributive X X X X
intuitionist

An × in Table 1 signifies that the corresponding logic verifies the property.


For instance the classical logic verifies all of the seven above properties.
Let us discuss the different logics according to their properties.

Distributivity

THEOREM OF STONE. All the distributive logics (property (6)) are homomorphic to
a distributive lattice of subsets of a set.

Thus for the distributive logics, we are again in the situation of Subsection 2.1.
Let us consider a semi-ring Z having an additive law _1_ and a multiplicative law
• , then

f(X; p V q) = f(X; p ) J- f(X; q), (2.49)


f(X; p A q ) = f(X; p ) , f(X; q). (2.50)

But then, as before, we have to consider the idempotence question. Asp V p = p


a n d p A p = p.
If the usual + and × are taken as addition and multiplication, and if (2.49)
and (2.50) are true, then the only possible ring is Z 1. The similarity reduces to a
binary decision ' true' or 'false'.
But if we take INF and SUP as laws on f, then any f(X; p) is an idempotent.
The semi-rings Z 3 or Z 7 may be considered.

Negation
(1) If (2.44) is verified, a corresponding law on f has to be found. For instance
if the range is [0, 1],

f(x; p) = 1 - f ( x ; p). (2.51)

Such a relation on f relating to negation has been chosen by Watanabe and in


fuzzy logic by Zadeh.
474 J. C. Simon, E. Backer, and J. Sallentin

(2) If (2.46) is verified, and if the range is [0, 1], then we should have

f ( X ; f i ) = { O 1 iff(X;p)>O,otherwise. (2.52)

Fuzzy logic does not verify this last relation but it is an essential property of
the intuitionistic logic. On the other hand it is clear that the intuitionistic logic
does not verify (2.44), the double negation property.
The only logic which does verify both the double negation (2.44) and the no
contradiction (2.46), is the classical logic. But then only admissible rings on f are
the first two, leading to only two values for f.
The interest of logics other than classical appears now clearly.
Classical logic gave birth to Quantum logic; Fuzzy and Intuitionist to non
distributive logics. Examples may be found in the 'real Universe', where distribu-
tivity is not verified.

Idempotence
An essential property of the basic connectives V and A is idempotence. But
modeling natural language, which describes the real world, we find that idempo-
tence should not always be verified by such connectives.
For example the repetition of a proposition is not always equivalent to the
proposition itself:

"This is a man"

and

"This is a man, who is a man"

are not equivalent propositions.


Let us introduce another connective [] such that p [] p = p is not always true
for all p. We wish of course to give to [] the same properties as A, except maybe
idempotence.
For this, let us consider a 'projective operator' on X, Cpp(X). This operator takes
into account the first proposition p, which has modified our knowledge on X.
Then (2.50) may be written as

f(X; p [] q) = f(X; p)~f(q0p(X); q). (2.53)

Then

f(X; p[]p) = f(X; p)~f(epp(X); p ) . (2.54)

But only if p [] p = p we have

f(X; pVlp)= f(X; p). (2.55)


A unifying viewpoint on pattern recognition 475

It means that f(%(X); p) is now the idempotent. "Repeating once more will
not change our knowledge on X" [29].
The propositions p verifying (2.55) are special in the language; sometimes they
are called 'observable'. They form a lattice, which is not always distributive.

On the use of projective operators


Projective operators are introduced in many other instances: Fourier,
Hadamard, K . L . expansions, filtering and others such as those pointed by
Watanabe, Bongard and Ullman.
The main question is: knowing f(X; L), what is f(~0L(X); L')?
The usual answer is to assume that the representation space X is an Euclidian
space and that the 'concepts' L, L' are built up as subspaces of this space. Thus
the situation is similar to the one of Subsection 1.4: the interpretation space is a
metric space. If ~. ) is the scalar product,

f(X; L) - iX" ePL(X)) (2.56)


<X'X')
and

f(X; LN L'): (X'X') (2.57)

A more general answer would be to assume on X a structure built with a


semi-ring operation and to obtain in a similar manner f(X; L) and f(X; L n L').

3. Conclusion

t~y establishing an homomorphism between the representation space and an


interpretation space we have shown that a unifying point of view may be obtained
for many apparently different PR problems. The operational laws on the (dis)sim-
ilarity measures between representations of objects and concepts appear to be the
key factors in constructing this homomorphism with the structure of the concepts
(the interpretation space). In other words a link has to be made between data
driven structures and concept driven structure through these operational laws.
It is the belief of the authors that a unifying point of view may lead to the
following issues.
Many Pattern Recognition problems are stated and treated as if they were all
quite different. However, a closer look to the fundamental problem may learn
that these PR problems may only differ in discourse and terminology or may
differ in the context in which they appear. We have emphasized the fact that the
underlying problem lies in how to link its representation space to its interpreta-
tion space and that the construction of an homomorphism between them de-
mands for a precise formulation of the linking key factor: a (dis)similarity
476 J. (7. Simon, E. Backer, and,[. Sallentin

measure. Then, the number of admissible operators to construct similarity based


homomorphisms between representation space and interpretation space appears
to be rather limited. In that sense for example, fuzzy set theory may enlarge the
interpretation space but does not enlarge the operational formulation of the
homomorphisms between the two spaces involved. In other words, the laws of
fuzzy operation do not differ basically from the probabilistic ones. So, the
diversity of problems is in some sense misleading.
The same can be said for the huge amount of existing mapping algorithms
between the representation space and the interpretation space. Once the set of
admissible operational laws has been defined, many apparently different algo-
rithms appear to be intrinsically the same. They just construct the homomorphic
mapping in a different fashion, still based on the same laws of operation. Hence,
they differ not fundamentally, but merely they differ in details of computation
and in the discourse in which they are phrased.
This fact should be stated as such and should not be brought up as something
special. A unifying view should keep future researchers from the 're-invention of
the wheel'. As soon as it becomes clear that constructing a homomorphism
between representation space and interpretation space in PR is the basic issue, the
solution space of 'true' homomorphisms is a very restricted one. Future designers
of PR algorithms should identify both their problems and their approaches within
the framework of the admissible operational laws on (dis)similarity.
In conclusion, an important issue of this paper would be that the many
identification algorithms may be thought or explained as special cases of a much
more general model, so that we escape from the critic that Pattern Recognition is
just 'bag of tricks'.

References

[1] Simon, J. C. (1978). Some current topics in clustering in relation with pattern recognition. Proc.
Third Internat. Conf. on Pattern Recognition, Coronado, pp. 19-29.
[2] Sanssure, F. de (1972). Cours de Linguistique Gkn&ale. Payot, Paris.
[3] Haralick, R. M. (1978). Scene matching problems. Proc. 1978 NATO A S I on Image Processing,
Bonas.
[4] De Mori, R. (1978). Recent advances in automatic speech recognition. Proc. Fourth Internat.
Conf. on Pattern Recognition, Kyoto, pp. 106-124.
[5] Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[6] Diday, E. and Simon, J. C. (1976). Cluster Analysis. In: Fu, K. S., ed., Digital Pattern
Recognition. Springer, Berlin.
[7] Hu, S. T. (1966). Introduction to General Topology. Holden Day, San Francisco.
[8] Kanal, L. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20 (6)
697-722.
[9] Cover, T. M. and Wagner, T. J. (1976). Topics in statistical recognition. In: Fu, K. S., ed.,
Digital Pattern Recognition. Springer, Berlin.
[10] Devljver, P. A. (1977). Reconnaissance des formes par la m6thode des plus proches voisins.
Th6se de Doct., Univ. Paris VI, Paris.
[ll] Cover, T. M. and Hart, P. E. (1967). Nearest neighbour pattern classification. IEEE Trans.
Inform. Theory 13, 21-26.
A unifying viewpoint onpattern recognition 477

[12] Jarvis, R. A. (1978). Shared near neighbour maximal spanning tree for cluster analysis. Proc.
Third Internat. Conf. on Pattern Recognition, Coronado, pp. 308-313.
[13] Gowda, K. C. and Krishna, G. (1978). Agglomerative clustering using the concept of mutual
nearest neighbourhood. Pattern Recognition 10, 105-112.
[14] Backer, E. (1978). Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. Delft
University Press, Delft.
[15] Lyndon, R. C. (1964). Notes on Logic. Van Nostrand Mathematical Studies 6. Van Nostrand,
New York.
[16] Sabah, G. (1977). Sur la eompr6hension d'histoire en langage naturel. Th6se de Doct., Univ.
Paris VI, Paris.
[17] Simon, J. C. (1975). Recent progress to a formal approach of pattern recognition and scene
analysis. Pattern Recognition 7, 117-124.
[18] Kampe de Feriet, J. (1977). Les deux points de vue de l'information: information ~t priori,
information ~t posteriori. Colloques du CNRS 276, Paris.
[19] Diday, E. (1973). The dynamic cluster algorithm and optimisation in non-hierarchlcal clustering.
Proc. Fifth IFIP Conf., Rome.
[20] Miranker, W. L. and Simon, J. C. (1975). Un mod6le continu de l'algorithme des nu6es
dynamiques. C.R. Acad. Sci. Paris SOr. A 281, 585-588.
[21] Hansen, P. and Delattre, M. (1978). Complete link cluster analysis by graph coloring. J. Appl.
Statist. Assoc. 73, 397-403.
[22] Simon, J. C. and Diday, E. (1972). Classification automatique. C.R. Acad. Sci. Paris Sbr. A.
275, 1003.
[23] Ruspini, E. (1969). A new approach to clustering. Inform. Control 15, 22-32.
[24] Bezdek, J. C. (1973). Fuzzy mathematics in pattern classification. PhD Thesis, Cornell Univer-
sity, Ithaca.
[25] Carnap, R. and Bar Hillel, Y. (1953). Semantic information. British J. Phil. Sci. 4, 147-157.
[26] Kampe de Feriet, J. (1973). La Thborie Gbnbralisbe de l'Information et de la Mesure Subjective de
l'Information, Lecture Notes in Mathematics 398. Springer, Berlin.
[27] Van der Pyl, T. (1976). Axiomatique de l'information. C.R. Aead. Sci. Paris Sbr. A 282.
[28] Backer, E. and Jaln, A. K. (1976). On feature ordering in practice and some finite sample effects.
Proc. Third Internat. Conf. on Pattern Recognition, Coronado, pp. 45-49.
[29] Sallentin, J. (1979). Repr6sentation d'observation dans le contexte de la th6orie de l'information.
Thbse de Doct., Univ. Paris VI, Paris.
[30] Zadeh, L. A. (1971). Similarity relations and fuzzy orderings. Inform. Sci. 3, 177-200.
[31] Zadeh, L. A. (1977). Fuzzy sets and their application to pattern classification and clustering
analysis. In: van Ryzin, J., ed., Classification and Clustering 251-299. Acad. Press, New York.
[32] Zadeh, L. A. (1978). PRUF, a meaning representation language for natural languages. Internat.
J. Man-Mach. Stud. 10, 395-460.
P. R. Krishnalah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 t,) 1
1
©North-Holland Publishing Company (1982) 479-491

Logical Functions in the Problems of


Empirical Prediction

G. S. L b o v

O. Introduction

An empirical table is considered whose d e m e n t s are results of measuring a


number of features of a subset of objects chosen from a general population F. A
vector x = ( x 1. . . . . xj . . . . . x , ) and a value x 0 in a space of features
X1,...,Xj . . . . . Xn: X o may be corresponded to any object a E I ' . The feature X 0 is
taken as a goal feature. For the feature Xj the range of its values has been defined
as @j ( j = 1,... ,n; 0), and the type of the scale [9] in which it has been measured is
given. 1
Let some set of objects be chosen, A = ( a 1..... a i ..... aN}, A C_F. The table
v = (xij } (i = 1..... N; j = 1..... n; 0) corresponds to the set A. The table v is used
to solve the following types of empirical prediction problems.
(1) Pattern recognition (prediction of the value of a goal feature x 0 for any
object a E F by its description x; in this case the feature X 0 is measured in the
name scale).
(2) Ordering objects according to their perspectiveness from the viewpoint of
some criterion (prediction of order for some objects of a subset A' (A':/: A, A' c F).
(3) Prediction of the value of a goal feature x o for a ~ F by its description x.
The feature X o is serial or quantitative. Note that if X1,..., An; X 0 are quantitative
features, then the problem under consideration coincides with the classical
problem of function restoration.
(4) Automatic grouping of objects. In this case the values of a feature X o for a
subset of objects A are not assigned. They should be determined using the
'similarity' feature of objects.
(5) Dynamic prediction of the value of a goal feature X o using time variations
in the values of features X l ..... Am.

1Groups of scales are considered intended for measuring the features of the three types: quantitative
features (scales of intervals, relations, and the absolute one); serial features (scales of order, partial
order, rank); nominal ones (name scale).

479
480 Ca. S. Lbov

Special attention is paid in the paper to algorithms for solving the above
types of problems in the case of empirical tables characterized by either a great
number of features ( n = 3 0 - 1 5 0 ) and a small size of sample (N~-n), or by
heterogeneity of features (gaps are possible in tables for both cases). The tables
for which at least one of these properties is fulfilled we call RP-tables. Such tables
are common in complex studies in medicine, sociology, and geology. The known
methods for solving prediction problems are mainly designed for the case when
features X 1.... ,Xj ..... Xn are either Boolean [1] or quantitative (for example, in
the classical problem of the function x 0 = f ( x ) , restoration assumes all the
features to be quantitative). An application for heterogeneous features of the
known algorithms [ 11, 13] using the hypothesis of 'closeness', 'compactness', etc.,
faces methodological difficulties: when calculating similarity between the two
vectors one has to deal with its components which are results of measuring non-
comparable quantities. Therefore, in the present paper in the case of heteroge-
neous features for solving prediction problems a class of logical and linear-logical
decision rules is proposed for use. In addition, from theoretical studies [6, 10] it
follows that the given class of rules has the least measure of complexity as
compared to the known classes (e.g. the class of potential functions, the class of
polynomials) which allows one to construct reliable decision rules for the large
dimension of feature space and small training sample (a small number of objects).
This fact also leads to the necessity to use the given class of rules in the case when
features X 1.... ,)in are measured in the same scale. In medical, geological and
sociological research there often arise problems when both the given facts occur.
To solve the problems the basic empirical hypothesis is used: a subset of
objects of a set A is believed to be randomly chosen, i.e. a statistical set-up of the
problem is considered. All the prediction methods construct a decision rule
having the maximal prediction quality for the objects of the subset A. Only if to
take the given hypothesis, such a decision rule will possess the best predicting
capacity for the rest of the objects of the set A.

1. Requirements for a class of decision rules

The quality of a decision rule of recognition is known to depend on the


information content of the initial system features, on the choice of the class of
decision rules, on the size and representation of sample and optimization proce-
dure of this criterion as well.
Let us formulate the basic requirements for a class of rules based on RP-tables.
The class of rules should satisfy the following conditions.
(1) Have a small complexity measure and, at the same time, be good enough for
an effective solution of applied problems when a functional form of the distribu-
tion is unknown.
(2) Be invariant to the permissible scale transformations.
(3) Have easily interpretable decision rules.
Logicalfunctions in the problems of empiricalprediction 481

(4) Enable construction Of the simple optimization procedures of the search for
the best rule.
(5) Contain rules which are realized technically in a simple way.
(6) Permit construction of algorithms operating with gaps present in empirical
tables.
Consider now each requirement in more detail.
Let us introduce the statistical criteria which have been used to determine the
complexity of a class of decision rules. Let distributions (p(~0, x ) = p(~o)p~(x)}
be given on a set of features X 1..... X n (~o is a pattern number; x is the point of
n-dimension feature space; p(¢0) is an a priori pattern ~0 probability; ~0= 1..... K;
K is the number of patterns). One may say that the strategy of nature c = 0 is
assigned by a set of probabilities c = {p(¢o, x)} (0 is the set of strategies of
nature). Denote as V the set of all possible samples of the size N. A particular
empirical table {xij } is an element of the set: v E V. The table v is brought in
correspondence with a recognition rule f from some class of decision rules ~. This
rule is chosen in accordance with a training algorithm Q, i.e. Q ( v ) = f. An
operator Q is either a procedure for evaluating distributions of parameters in
(p(~0, x)}, or an optimization procedure of rule f choice from the class ~, this
procedure minimizes the recognition quality criterion (e.g. the number of recogni-
tion errors).
A decision rule is a mapping which corresponds to a point x of the feature
space a pattern number, i.e. f(x) = ~0. For a fixed decision rule f and a fixed
strategy of nature c the probability of error classification of 62f can be obtained.
For the fixed rule f one can determine the probability of error 62f. For each
strategy of nature c there can be given the distribution of probabilities on the set
of samples p(v). Hence the decision rule f will be chosen randomly from the set
in accordance with some distribution p ( f ) . Note that in the general case
p ( f ) >~p(v) since the algorithm Q may choose one and the same decision rule f
for a certain subset of samples Vi c_ V. The error probabihty @f will be a random
quantity with certain distribution of probabilities ¢p(@f). With an increase in the
size of sample (N--+ ~ ) the quantity ~f --, ~ and the distribution q0(6~f) degener-
ates into the &function.
For the fixed distribution q)(@f) one can calculate the probability Pr{°2UE
[62~, 027]}>~ 7, 627 I> 62~. The magnitude V is close to 1. For the fixed value of ,/,
from the given inequality one can obtain the value 627" The quantity e = @7 - 62~
shows an extent of the deviation of sampling rule from the optimal rule in the
sense of error probability at the fixed strategy of nature c, at the fixed size of
training sample N, and at the fixed class of decision rules. Since the strategy of
nature c is unknown, we consider e* = maxcc0 e. The magnitude e* at the fixed
sample size depends on the chosen class of decision rules.

DEFINITION 1.1. Of two classes of rules, ~1 and ~2, the latter is more com-
plicated compared to the former if the value e~ for ~2 is greater than e]~ for ~1-
,482 G. S. Lbov

The quantity e* represents a statistical measure of complexity for the class ~,


In [5, 6] the values e* are given depending on the magnitudes N and ,{ for some
particular cases of decision rules.
In [10] another concept of complexity of the class of decision rules is intro-
duced based on the principle of uniform convergence of frequencies to probabili-
ties

fc,~

where ~f is an estimation of error probability in a training sample for a fixed rule


f, 6~i is the corresponding error probability. The magnitude

e----~(ln]~ I --ln(1-- 7 ) ) / 2 N

represents a statistical measure of complexity of the class of rules ~. Magnitudes


e* and e, at fixed values of N and ,{, only depend on a number 1~] of decision
rules in a class ~. The number I~l is obtained as follows. Let a partition of a
feature space into M domains correspond to every decision rule f ~ ~. To describe
these domains one uses some fixed class of functions (for instance, linear,
quadratic, logical, etc.). The number I~I--KMIff'[ where K is a number of
patterns, ]~l is the maximal number of possible partitions which can be obtained
in a finite sampling. The less numbers M and Ig'l, the simpler the class of
decision rules. For the simplest class (the linear rules class) I'/'] = 2 N if N < n, or
i ,1 --2 "~"-l~i=lcN_liif N>-n [2]. Further it will be shown that the class of logical
decision rules under consideration is no more complicated than that of linear
rules.
Let us consider other requirements to the class of decision rules.
The following methodological requirements seem natural. The results of recog-
nition should not depend on numerical representation of empirical tables. Let
groups of permissible transformations E 1.... ,En; E o correspond to features
X 1..... Xn; X0. Denote as UpVthe transformed table V, ~p= (cpl,...,cpj ..... cpn; %).
Note that Q ( V ) = f, f ( x ) = w. The formal representation of invariance require-
ments takes the form

cpo ~{[Q(~V)] (fix)) -- [Q(V)] (x).

This equality has to hold for each 7.


Special attention should be paid to the interpretability requirement for a
decision rule. In addition to the applicability to predict goal feature values of new
objects, the rule obtained may be interesting by itself to the experts in applied
fields. It should be easily interpreted. This may prove useful to form the models
of objects under study, in dialogue regime of recognition, etc. It should be noted
that the simplest decision rules (linear rules, or those giving pattern descriptions
as hyperspheres) do not possess this prooertv.
Logicalfunctions in the problems of empiricalprediction 483

In the case when a specialized recognition device is to be designed, a decision


rule has to be realized technically in a simple way.
When solving medical, geological, biological and other problems, empirical
tables with gaps are often encountered. In the case of any value of at least one of
the features being missing, most of the known recognition algorithms do not
use the corresponding realization. When the number of gaps is great enough
nearly all the realizations intended for training will not be used. The class of
decision rules should be such that all the non-missing elements of a table could be
taken into account while constructing rules.
In [4] the class of logical decision rules for the features measured in different
scales has been considered. This class satisfies all the above requirements as is
shown below.

2. Class of logical decision rules

Let us formulate a class of logical decision rules for recognition. For the sake of
simplicity, consider two patterns (to, ~}.
Introduce all the necessary definitions and notations. As elementary statement
Tj we take expressions Xj.(a)<-x, X j ( a ) > x for the name scale; Xj(a)<~x,
Xj(a) > x for the order scale, scale of intervals, and relations scale where Xj. is the
feature, x is its value and a is the name of an object.
A conjunction of elementary statements S = T 1/k T 2 / k • • • / k T~ (~ ~ n) is called
a statement S. As the length of a statement we mean the number of elementary
statements involved. We say that a statement is satisfied on an object if each
elementary statement involved in S is true on this object.

DEFINITION 2.1. Two elementary statements Tjt and Tj2 a r e called equivalent if
they are satisfied on the same set of objects of A = A~UA ~.

On the basis of sampling a set of possible elementary statements which can be


formulated for a feature Xj is partitioned into equivalence classes. Further we
consider only one elementary statement from each equivalence class. Denote a
number of equivalence classes for the feature Xj as Lj. The statement S is
a description of some domain of feature space. Denote as B = (S l . . . . . S t , . . . , ST} a
set of statements which corresponds to the partition o f feature space to T
non-intersecting domains. For each statement S one can determine a number of
objects Nso, of pattern to and the number of objects Ns~ of pattern ~, for which S
is satisfied.
A decision rule for the set B minimizing the value of error probability for a
training sample is given as follows: if Ns~>~Ns~, then f ( S ) = to, otherwise
f ( S ) -- ~. A collection of all possible sets (B} with given decisions for statements
gives the class of logical decision rules ~/il (a similar class of rules is considered in
the paper by Michalsky [8] and in [4].
484 G. S. Lbov

The set B can be represented as a tree R -- {b 1.... ,b t ..... bM) (b t is a branch of


the tree). For any object we first verify the truth of some elementary statement Tj.
If Tj is true, we verify the truth of some statement Tk, but if Tj is false, we verify
the truth of some statement T~, etc. From the very way of constructing the tree it
follows that a set of statements {b 1.... ,b~t } is described by M non-intersecting
domains of feature space. If we attach one of the solutions (oa, ~} to each branch
of the tree, we assign the decision rule f ( R ) = {f(bt) }. The rule f ( R ) realizes the
sequential procedure of recognition: for every object on which statement bt is
fulfilled measured is its subset of features. The sequential procedure allows us to
minimize the cost of feature measuring in object recognition. The decision rule
f(R) = (f(bt) } realizes a successive recognition procedure. The set of decision
rules given for all possible trees (R} with the fixed number of branches M
determines the class @2 of logical decision rules as a tree of decisions. For
M 1< M 2 < . - - < M i < . - - we obtain the embedded classes of decision rules
q~2 C ~22 C . - . C ~ 2 C - . . . A number of decision rules is

I~1 = 2MI'Pl = 2MM !( Zn)( t n -- 1 ) . . . ( t n -- ( M - - 2) )


<2MM!( Ln) M-I,

where L = maxjLj, Lj is a number of elementary statements (a number of


equivalence classes) which can be formulated on the feature Xj by the training
sample, n is the dimension of the initial feature space. Selection of number, M
determines the complexity of the class of decision rules. According to the
empirical table it is necessary to indicate procedure Q by the choice of the
decision rule of the class ~2_whose evaluation of error probability ~ would be
minimal, or at the fixed value ~ the number M would be minimal. Simultaneously
chosen decision rules will correspond to the best sequential procedure of recogni-
tion for the case of dependent features. The latter fact is also of importance for
cases when it is necessary to minimize the cost of feature measurements (usually
M ~< 7).
Note a number of important properties of the class of logical decision rules ~2.
(1) This class possesses a minor measure of complexity. Define the number of
rules in the class for K = 2, M ----7, L = 20, n ----100. Then

1~21< 277!(20 . 100) 6 < 5.10 25.

For the class of linear rules [2] at N < n, 1~1 = 22.2 l°°~ 103°. Thus, the class
under consideration belongs to the class of the simplest decision rules. This
property is of primary importance for the case of little sample and large space
dimensions.
(2) From the theoretical viewpoint the class ~2 has an important asymptotic
property: in [7] it was proved that with an increase in the sample size and number
M the decision rule f(R) tends to the optimum Bayes rule.
Logicalfunctions in the problems of empiricalprediction 485

(3) Number and composition of objects (on which this or that elementary
statement is satisfied) do not depend on the permissible transformations of scales.
Therefore, any algorithms for constructing the best rule, taking account of only
the number and composition of objects, when sampling elementary statements,
are invariant to the above transformations.
(4) Statements represented as conjunctions of values and conditional intervals
of features are easily interpretable, and the decision rule can be technically
realized in a simple way as a threshold device.
(5) When describing known recognition algorithms one, as a rule, does not
show how one will use in teaching the vector with omitted values of any features.
In the case of constructing the logical rule this realization is not used for those
conjunctions only those that have the feature with the missing value.
(6) Simultaneously with choosing logical regularities the problem of reducing
the number of features of the initial system is solved. Indeed, in the solution of
applied problems, as a rule, to construct a tree R it is sufficient to use only a small
number of features of the initial system. In addition, the decision rule enables one
to perform an 'individual' approach: for each recognizable object its own subset
of features is used. The above properties are of significance for solving problems
where both the recognition error and cost of measuring the features should be
minimized. Thus it is shown that the class of logical decision rules ~2 satisfies the
above formulated requirements for the class of decision rules of recognition.
Together with classes of decision rules ~ and ~z we also considered classes ~3
and ~4 Which can also be referred to as to the class of logical decision rules. At
present the properties of ~3 and ~j4 classes are being studied.
If the elementary statements of the form Y,m j=lajXj(a)>~eto, Y~e'
j=lajXj(a)<ao
(m ~< n l, n~ is the number of quantitative features) are added to the above types
of elementary statements, then we obtain the class of linear-logical decision rules
~3, given as a decision tree.
If we use conjunctions chosen from some set d = {S 1..... St,..., S~} instead of
elementary statements of the above tree, then we get the class of decision rules
~4. For this purpose let us bring a new system of features Y1..... Yt..... Y~ into
correspondence with the set d in such a way that, if for an object a the statement
S t is fulfilled, then Yt equals 1, otherwise it equals 0. Then the empirical tables can
be rewritten in the form {x/j} ~ (Yit'~} and (xij'~ } ~ (Yit }. The decision rule in the
form of a decision tree is constructed according to new features. From the set of
all possible statements a set d only involves informative statements which are
called regularities. By regularity characterizing the pattern ~0 we mean a statement
S for which @so~~> ~ and ~sr~ ~<fl (e.g. ~ = 0.6, fl = 0.02). For a small fixed fl
magnitude 8 is chosen experimentally: starting with 8 = 1, this quantity decreases
with a certain step (e.g. A8----0.05) until a small number of regularities emerges
(usually this number is assigned to be 5-10 per pattern). When solving the
applied problems the length of an informative statement (number of elementary
statements in conjunction) has not been found to exceed 5. In [5] an algorithm
T E M P was proposed with the help of which all regularities characterizing
RP-tables were disclosed for reasonable machine time.
486 G. S. Lbov

To construct a decision rule of recognition of the classes ~2 and @3, algorithms


DW and LLRP (described in [6]) have been proposed. The description of the
recognition algorithm by a set of trees is given in [6]. A short description of the
above algorithms is also given in [12].
As the experience of solving applied problems shows, classes of decision rules
considered in this section prove to be good enough for these problems to be
effectively solved.

3. Method of predicting object's perspectiveness

A problem of ordering objects according to their perspectiveness arises plan-


ning examination of patients, geological prospecting for a certain region, etc.
This problem is considered in the following set-up. Let each object of some set
{a~,...,a~ ..... a'~} be described by a set features X1..... Xn and by the value of
feature X0 (X 0 acquires two values: ~0 and ~) which is unknown. Given a set of
vectors (xl . . . . . xi . . . . . x , } , x i = (xi~ .... , x i j . . . . . Xino) it is necessary to find a strict
order on the given set of objects in accordance with the assigned criterion of
quality. For the particular order H = ( a ~ , > a ~ 2 > . . . > - a ~ , ) the value of the
criterion

op(rt) = (2 a(a;)
i=1

is determined where

A(a~) = { 0 if Xo(a;) = ~'


or(a;) if Xo(a;) = ~o,

7r(a~) the object's place number in H. The ordering is the better the less the value
of magnitude OP(H). Let us consider the meaningful interpretation of this crite-
rion. Let F be a set of geological areas. The area where deposits are discovered
(pattern ~0) gives some profits per time unit starting from the moment of
discovering. If no deposits are discovered on it, money spent on working (e.g.,
drilling) proves wasted. The criterion in question is shown to be connected with
the rate of compensation for money spent on working.
If the strategy of nature is known to be e = {p(~0, x)), one can determine
mathematical expectation of the criterion M o p ( H ) = Mop(F, c) where F is the rule
of ordering being a ratio of order on the set R". It is necessary to minimize
Mop(F, e) by F ( F E ~). The ordering rule F o is determined by the function
g(x) = p(x/~o)/p(x/~) as follows:
-if g ( x i ) > g ( x t ) , then x i > xt,
-if g ( x i ) = g ( x t ) , then x i ~ x #
Logicalfunctions in theproblems of empiricalprediction 487

In [10] the statement of F 0 optimality is proved, i.e.

M v ( F o , c) <~ M v ( F , c )

for any pair F, c.


In applied problems the strategy of nature c is unknown, but for the training
some set of objects A = (a 1. . . . . a~ . . . . ,aN} is given. For each object a i vector
x i = (xi, . . . . . xi° ) and the value X o ( a i ) are known. One determines the decision
rule of recognition f ( R ) as a decision tree. Evaluation of the likelihood function

g(b,) = (No, + + 1)

corresponds to each branch of the tree b t (N,~ t is the number of objects of the set
A involved in b t from the pattern to, N~, t is the number of objects from A involved
in b t from the pattern ~). Each vector x i is involved into some branch bt = b ( x i ) .
Define ~ ( x i ) = g ( b t ) = g[b(xi) ]. If g ( X i ) > g ( X l ) , then x i >- xt, if g ( x i ) = g ( x t ) ,
then x i ~ x #

4. Algorithm of predicting the value of quantitative feature

In the previous sections the problem of predicting the feature value measured
in the name scale has been considered. This section deals with the description of
an algorithm for the case when feature X 0 is measured in the scale of relations,
and the rest of the features can be measured in different scales. A function F is
chosen from a certain class • = {F} which compares vector x to evaluation of
feature X 0, i.e., vector x is compared to an object a, and this vector x determines
the evaluation of a goal feature X o ( a ) = F ( x ) . The class of functions in which the
function F ( x ) is assigned by a tree R determined on features in scales of names
and order, as well as by a set of linear functions fl of features measured in scale of
relations and intervals, is called a class of linear-logical evaluation functions.
Let R be a function with values in a set of natural numbers from 1 to P, i.e.
R ( x ) : x ~ ( 1 ..... P}. R divides feature space into P non-intersecting classes
A 1..... Ap where A e = {a: R ( x ) = l}. For each of At,... ,Ap its own linear function
ft is defined. According to the training sample {xij}, i = 1,... ,N, j = 1..... n; 0 one
minimizes the criterion

l N
L(F'X):NtX:I [X°(ai)-- F(xi)]i"

To determine functions R ( x ) and (fl} by an empirical table, the following


algorithm is proposed.
First we construct over the entire sample, by the least square method, a linear
regression of the feature X o by Xil being well correlated with X 0. Find the
488 G. S. Lbov

remainder X o - J f o = X o - f l ( x i , ) = X 1. Then items 1 and 2 are fulfilled in


parallel.
(1) Having the information about the fact that a linear function comprises a
variable ~ l we find in the table such a feature Xj, (and for this feature find the
boundary or name B (in the case of name feature)) that the criterion

,-,': E + I

is minimal through all possible partitions of the initial table into two taxons (A 1 is
the set of objects for which Xj~(a)I> B, A 2 is the complementary set of objects).
The coefficients of the function fl 1 have been obtained by constructing a linear
regression of the feature X0 of X 6 by the objects of the taxon A1, coefficients o f f l 2
by objects of the taxon A 2.
(2) Among quantitative features not yet included into the equation we find the
feature X/2 which is utmost correlated with the remainder Xo1. Obtain the
regression equation as follows

YCo = f d )= + +

The criterion value is calculated by the formula

1 N
L2=N X [Xo(ai)-- f2(ai)] 2.
l=l

After steps 1 and 2 have been fulfilled, the values L 1 and L 2 a r e compared. If
L 1> L 2, then the introduction of variable X6 into the regression equation is more
preferable than that of the condition obtained in the tree R, i.e. as a linear
evaluation function we have fz of dimension 2. If L 1~< L 2, then at this step we
complement R with an elementary statement x j, >~ B. Thus, the linear functions fl ~
and f 2 corresponding to the taxons A~ and A 2 a r e obtained.
Thus, the initial sample (taxon 1) is partitioned into 2 taxons with numbers 2
and 3, respectively. The enumeration is performed as follows: if t a x o n j is divided,
then the resulting subsets get numbers 2 j and 2 j + 1.
Further procedures are the same. Having completed a limited number of steps
K the algorithm constructs a tree R ( x ) , partitioning a set of objects into P taxons,
and a set of functions fl ..... fe corresponding to each taxon. The constructed
function F for test objects evaluates the value of the feature X o by the rule
)(o(ai) = f t ( x i ) where l = R ( x ) .

5. Automatic grouping of objects

Let some objects A = ( a l . . . . . aN} a n d corresponding to it empirical tables


( x i j }, i = 1. . . . . N, j = 1. . . . . n be given. It is necessary to partition the set A into a
Logical functions in the problems of empiricalprediction 489

number of non-intersecting subsets (groups) so that each group contains objects


which are, in some sense, more 'similar' to one another. Such problems are known
as taxonomy ones (cluster analysis problems). They often arise at the early stages
of research when it is interesting to obtain some primary information on the
internal nature and structure of empirical data. As a rule, taxonomy of objects is
performed either through the entire initial space or in some subspace of features,
though it is natural to assume that the set of objects A can be partitioned into a
number of such groups so that to describe each of them their own subset of
features is required]
Here an attempt is made to propose an approach to solving problems of
automatic grouping for various features using logical functions.
Natural relations between objects' features are considered to be absent if the
initial table of data was obtained randomly, i.e., if it may be considered as a
sample from a domain @ of the initial space with the uniform distribution law.
In this case, for any logical statement S one can determine the probability of its
fulfilment Ps by such 'random' table. The probability for S to be true on N s
objects of the general number N equals

@(Ns ) = cNspNs( 1 __ p s ) U N,

A choice of the preference criterion for one logical statement to another in search
of regularities for solving the given problem of table approximation is based on
the following hypothesis: the less the magnitude of @(Ns) for the statement S, the
more reasons t o consider this statement as a regularity.
The number of realizations, Ns, on which the statement is fulfilled on the
'random' table, is, on the average, N P s. For grouping purposes we only consider
those statements for which N s > N P s.
Besides, we are interested in statements fulfilled on the initial table not less
than 6 times.
For simplicity of determining the preference order on a set of statements we
use the magnitude

7 ( S ) -- (z - NPs) 2
Ps(1--Ps)

which results from approximation the binomial distribution by a normal ap-


proximation

1 ex-r v(s) 1
f(z): (2,~NPs(l_Ps)),/2 ~ [ - - ~ - ].

It is clear that the less the magnitude of @(Ns) the less the one of y(s).

2When describing the objects aimed at pattern recognitionthis effect was observedin all the solved
applied problems.
490 G. S. Lbov

Consider statements S~ and S 2, fulfilled in the given table Ns, and Ns2 times,
respectively. The statement S~ is considered to be more preferable than $2 if
v(s~) < v(s2).
To discover the best statements according to the criterion ~,(s) the algorithm
T E M P is used. It is done in such a way: the first best statement S 1 having been
chosen, Ns~ objects from N are excluded on which this statement has been
fulfilled. Then the best $2 is determined which is only fulfilled on a complemen-
tary subset of objects, and so on until the partitioning of the initial set is
completed.

6. Method of dynamic prediction

In literature dealing with the analysis of dynamic objects special attention is


paid to the case when all the features are quantitative. However, in some applied
fields (e.g. medicine, sociology, economy) different features characterizing a
dynamic object can be measured in different scales. In this case, to describe
dynamic regularities we use the class of logical functions.
Depending on the aim of research, different set-ups of prediction problems
m a y arise. Consider one of them. Let a feature X 0 acquire two values (~o, ~}
where ~0 is the state we are interested in (e.g. ~o is a certain disease of a patient,
is the absence of this disease). One needs to use the results of measuring features
X I , . . . , X n obtained at the moments t 1..... t L (for these moments X o = ~ ) to
determine for the given object a period of time AT from the time period t L up to
the moment when the value of X 0 is equal to ~0. The problem of early diagnosis of
some diseases may serve as an example of this set-up. To solve this problem, a
training sample is used which is time measuring of the above features of N
objects. For the ith object (i = 1. . . . . N ) the values of all features are determined
simultaneously at R time points t~,...,
i tt,...,
i t Ri in equal time intervals A t. Sets of
time points for the objects considered m a y not coincide but for each object the
following condition holds: at the final moment Xo(t~) = o~, and at all the previous
moments

Xo(t~) = ~, / = 1 ..... R - 1 .

Let us choose as the start of time readings an arbitrary moment t R. The results
of measuring the features of all objects obtained at periods t~ for the first object,
t~ for the second one, and so on, are correlated with the above start. Then a set of
i
m o m e n t s (/R--l) is correlated with the moment tR_l, the set {t R i 2} with t R _ 2 ,

and so on (with t l - t t _ l = At).


The training sample may be represented as a succession of tables v 1..... vl ..... VR
where v t = ( x ~ j ) ( i = l ..... N ; j = I ..... n ; / = l ..... R). Using these tables, a
succession of tables Av~ .... ,Av / ..... Av R ~, is obtained, representing the varia-
tions of feature values between the neighbouring moments.
Logical functions in the problems of empirical prediction 491

To determine magnitude AT being predicted, logical regularities are used


discovered in the given two successions of tables and reflecting regular (meaning-
ful statistically) variations in time of the values of features X~ ..... X n while objects
approach the state ~0. The algorithm of determining dynamic regularities and
localization of an object in time (determination of the magnitude AT) is described
in [6].

References

[1] Bongard, M. M. (1970). Pattern Recognition. Spartan Books, New York.


[2] Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[3] Lbov, G. S. (1966). On the sample representativeness in choosing on effective feature system.
Computer Systems. 22, 39-58.
[4] Lbov, G. S. and Manokhin, A. N. (1973). On estimation of quality of decision rule based on
training sample. Computer Systems 55, 48-107.
[5] Lbov, G. S., Kotyukov, V. I. and Masharov, Yu. P. (1976). Method of discovering logical
regularities on empirical tables. Empirical Prediction and Pattern Recognition, Computer Systems
67, 29-42.
[6] Lbov, G. S. (1979). Logical functions in problems of empirical prediction. Empirical Prediction
and Pattern Recognition, Computer Systems 76, 34-64.
[7] Manokhin, A. N. (1978). On one approach to predict objects' perspectiveness. Methods of
Information Processing, Computer Systems 74, 108-128.
[8] Michalsld, R. S. (1973). AQAL/l--Computer implementation of variable-valued logic system
VL 1 and examples of its application to pattern recognition. Proc. First Internat. Joint Conference
Pattern Recognition. Washington, DC.
[9] Pfanzagl, I. (1971). Theory of Measurement. Physica, Wiarzburg.
[10] Vapnik, V. N. and Chervonenlds, A. Ya. (1974). Theory of Pattern Recognition. Nauka, Moscow.
[11] Voronin, Yu. A. (1971). Introduction of similarity and connection measures for solving
geological problems. Dokl SSSR 199(5) 1011-1014.
[12] Zagoruiko, N. G. and Lbov, G. S. (1978). Algorithms of pattern recognition in a package of
applied programs "OTEKS". Proc. Fourth Internat. Joint Conference Pattern Recognition.
Tokyo.
[13] Zhuravlev, Yu. I and Nikiforov, V. V. (1971). Recognition algorithms based on computation of
estimations. Cybernetics 3, 1-12.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 t.),.)
z_, z..,
ONorth-Holland Publishing Company (1982) 493-500

Inference and Data Tables


with Missing Values

N . G. Z a g o r u i k o a n d V. N . Y o l k i n a

1. Algorithm ZET

The basic ideas of the method Z E T [1] consist in the following. As in most
pattern recognition algorithms, we accept the hypothesis that the 'similar' or 'like'
objects are those having similar values of representation parameters. The less
divergence there is, the more 'similar' these objects are. It is also supposed that
the set of parameters describing the objects is not random, and reflects some
regularity relating to these objects. Then one may expect that the objects, similar
for n parameters, will most probably have similar values for the (n + l ) s t
parameter. Under these conditions we have good reasons to undertake the task of
'prediction' of the missing values, i.e. the calculation of some of the most similar
values in terms of the known elements of the matrix of initial data. As it is
usually, we will consider that the lines of the matrix of initial data contain the
information about the description of the object ai, i 1.... , m in the given system
=

of parameters X---- {Xj), j = 1..... n and that the columns contain the information
on the values which the parameter Xj has on the various objects a r
In algorithm ZET, to predict a missing element only the relevant groups of
lines or columns of the matrix under study are used. Relevance is defined as a
function of two variables: a measure of similarity ~il between a line (column)
containing expected gap and lines (columns) not having a blank in the place
corresponding to the expected gap; and a degree of their mutual fillingness L a.
Naturally, the relevance of predicted lines (columns) is highest if they are more
similar to the predicted ones and if they contain the greatest number of mutual
non-empty elements. As a measure of similarity between lines (columns) of the
matrix in the algorithm ZET we take the module of the coefficient of paired
correlation, calculated after normalization of all the columns of the matrix to the
interval [0, 1].
While calculating a predicted element we take into account the predictive value
of each line (column) which depends on its relevance and of some parameter a.
The parameter a is chosen during the process of decision, each gap being taken
separately. The minimum of the mean error of prediction of all the known
elements of a line (column) containing a gap is a criterion for the choice a.
493
494 N. G. Zagoruiko and V. N. Yolkina

Table 1
~Features
Objects~l 2 .-. k ..- j -..
1

l aik aij
l ark alj

An error ~n of the prediction of the known elements of the table under optimal
value a is obtained as an estimation of the expected 'quality of prediction'. If 6n
exceeds the given threshold, the algorithm will leave the gap empty•
The algorithm ZET works as follows• Let there be a matrix of dimension n × m
where n is the number of columns (features) and m is the number of lines
(objects), see Table 1.

Step 0. The normalization of all the columns of the matrix to the interval [0, 1]
is made•
Step 1. The next gap aij to be predicted, is chosen•
Step 2. For each k t h column having no gaps in the ith line, we calculate its
measure of fillingness Ljk(j ~ k) with respect to t h e j t h column. It is equal to the
number of mutual nonempty pairs of elements of the j t h and the k t h columns•
Step 3. The measure of similarity [fjk[ of columns j and k is computed•
Step 4. Under various a, the values of all known elements a~. of t h e j t h column
are predicted by the k t h column which has no gap in the ith line:

a~. = bjk"a,k + Cjk,


where bjk and Cjk are coefficients of an equation of linear regression.
Step 5. These particular predictions are averaged with a weight proportional to
the relevance of the columns taking part in the prediction:

Yf=la~" Iffj~l°'Ljk
a,j= y f=,l .j l .Lj k ,

l=l,...,m;lvai;k=l,...,P;P<n-1; here P is a number of columns par-


ticipating in the prediction of the element azj is calculated.
Step 6. We choose a ° such that 8,(a °) = minn,(a).
Step 7. With a ° one calculates the expected predicted value a~.
Step 8. A prediction procedure by lines, analogous to Steps 2-7, is made• As a
result the values of the parameter a* are found, under which the minimal
prediction error of the known elements of the ith line has been obtained. With
this value a* the element a~ is calculated.
Inference and data tables with missing values 495

Step 9. Conditions 8mi


0 n ~'~ 80,. Vmi
a , n ~< v0 are tested, where 80 is given a priori.
If none of the values 8rnin
0 or 8mi
, n satisfies the condition of the given precision,
then a gap remains blank.
If both values satisfy the given precision, then for prediction we choose the one
which corresponds to

8, = min( 8£n , 8m'in )"

Step 10. Having filled all the missing elements in the table, one repeats the
Steps 2-9 (smoothing).
Step 11. After each regular iteration one makes an estimate of the mean
summary difference of the results predicted at this step, from the prediction
results obtained at the previous one. The process is ended if this difference
becomes small.
A second criterion is given to end the iterations.

2. AlgorithmVANGA

Algorithm ZET is designed to fill data tables measured in strong scales--of


intervals, relations, and in absolute ones [2]. Below, V A N G A [3] algorithms for all
widespread types of scales are considered.
General structure of algorithms of this series is the following. Let A be a table
having a missing element a~j. The rest of the elements of the table are all known.
Experience of using algorithm ZET shows that information, necessary to fill a
missing element correctly, is distributed all over the table irregularly; in this
connection variants of filling gap (' prompts') obtained by using various elements
of the table should be taken into account in the final result with a weight
proportional to a 'competence' of these elements. In algorithm ZET, a separate
column or a separate line of a table is a minimal 'prompting' element.
In algorithms of VANGA, series submatrices of various dimension (starting
from the dimension 2)<2) are such minimal elements. We compute, for each
submatrix, including filled element aij, its 'prompt' (b) and 'competence' (C).
Methods of calculating values of b and of C are chosen conform scales type, in
which tabular data have been measured.
(1) Let us consider the case when data in Table 2 have been measured in
relation scale (algorithm VANGA-R). Relations of values of any two dements of
the same column are invariable to the transformations permissible for this scale.
If for any pair of lines (the ith and the lth) a relation of elements in the k t h
column were just the same as in t h e j t h one, we would write that

alj • a i k
ai j -- _ _
alk

But in reality, a submatrix, consisting of four elements that belong to the ith line
496 N. G. Zagoruiko and V. N. Yolkina

Table 2
1 ... k ... j ... n

1 1 3 1 6

2 1 2

3 4 3 5
4 2 4 4
5 5 5 3
6 6 6 1

and t h e j t h column, and to the lth line and to the kth column as well, could give
us only a variant ('prompt' btk ) of a value of element aij:

alj • aik
blk -=- , k v ~ j; 14=i.
alk

Let us form matrix B of the same dimension as matrix A, in which there are
prompts big in the crossing of the lth line and the k t h column.
Let us now evaluate the mean value of quantities blk(b ) and their dispersion
( D ) in each column and in each line:

1 rn--1 l n--1
b . k -- m -- 1 ~ btk' #l" -- n -- 1 ~ b'k ;
l=1 k=l

1 m--1 1 n 1
y m-
l=llblk l=1
E D 1. n--1
Y'k=lblk
E I~.--blkl"
k=l

It is evident that the smaller dispersion D . k ( D I . ) is, the 'more competent' an


opinion of the k t h column (of the lth line) is about the value of a missing
element; and competence of an element ark will be defined as a dispersion
function of its column D. k and of its line D l . . Let us denote the product D. k"Dt
by Dlk. Then competence Clk of a submatrix with element ark will be found as

CIk ~- ( D--'~-~ __ DI k ) a

where Dmax and Dmin are maximum and minimum of Dtk for all tables. Ctk
approaches 1 when Dik = O m i n and it approaches 0 when Dtk----Dm~,. In the
intervals between these extremes--0 and 1 - - t h e quantity Ctk is a function of D~k
and a. The value a, like in ZET too, is chosen by the best prediction of known
elements of the ith line and of t h e j t h column of matrix A.
Inference and data tables with missing values 497

Table 3

l ''" j n

2.0 0.33 - 3.00

2.0 0.75 - 7.50


2.0 2.00 - 8.00
2.0 1.00 - 7.50
2.0 1.00 - 3.00

The mean value fiij of quantity of the expected element aij is computed as
1 k=n'l=m
^ _ _

aiJ ~,k=., z=,,,e ~ Cz~" btk"


~k,l=l "~lk k,I 1

Values of prompts and of competences under a = 0.5 are given in Table 3


(matrix B) and in Table 4 (matrix C).
The predicted value of a U, under such conditions, equals 1.998.
(2) If in Table 2 data have been measured in interval scale, then in algorithm
VANGA-I a submatrix consisting of two columns (j and k) and of three lines
(the ith, jth, and qth) is the minimal prompting element. Ratio of differences of
any two pairs of elements of the same column is the invariant of interval scale.
Then, if to accept the hypothesis about direct connection between thejth and the
k th feature, one can write

aij -- a o _ aik -- alk


a 0 -- aqj ark - - a q k "

Hence a prompt for element a~j, obtained with participation of elements al~ and
a q k (let us denote it by blq , k ) , will be

( aik - a,k )( a,j - a qj )


btq, k z a u q-
alk -- aqk

Table 4

1 .-- j -..n

1 1.00 1.00 - 0.93

i 1.00 0.97 - 0.00


1.00 0.98 - 0.27
1.00 0.98 - 0.25
m 1.00 1.00 - 0.96
498 N. G. Zagoruiko and V. N. Yolkina

Computing the prompts, their mean value and dispersion in columns and lines,
determining competence of these prompts and their accounting under determin-
ing the value of expected element, is made in the same way as for the relation
scale. The value aij in matrix A is then predicted to be 2.3.
(3) In order scale (algorithm VANGA-0) a minimal submatrix has dimension
2 × 2. Statements of such a kind as
"if aik >>-alk, then aij >~alj too"

are invariable to the transformations of order scale. Hence, prompt blk >1 alj if
ai~ >1alk, and btk < alj if aik < ate. If a rank correlation between t h e j t h and the
k t h columns is negative, then prompts have an inverse sign: blk >~ a,/ if aik <
ark, b~k < a~j if aik >1alk.
In columns of matrix B (Table 5) (d + 1) different variants of prompts may be
found if d is a number known but not-coinciding with any other element in the
j t h column. In fact, one of d + 1 different events may take place; in our example
such events are

(1) b,k <1, (2) 3>bzk~>l,


(3) 4 > bzk/> 3, (4) 5 > bzk >~4,
(5) 6 > btk/> 5, (6) btk > 6.

Prompts btk show whether the unknown quantity aij may be in one or another
of the (d + 1) diapasons of the order scale. If bzk < 4 , then such events as
aij < l, 3 > aij >t 1 , 4 > aij ~ 3, are possible. Let us assign 1 to each of these events.
If b/~ ~>6, then it is possible only that aij >~6, and 1 is assigned only to this
diapason. After looking over all prompts of a column, we shall interpret the sum
of ones, assigned to each diapason and divided by a total number of ones, as a
probability (Ps) of the fact that aij will be in this diapason. Entropy
d+l
H.k=-- E Ps lnPs
S=I

may be used as a measure of uncertainty for this case.

Table 5

1 "'' j "'" n

1 /> 1 < 1 - ~> 1 1.768

i < 3 < 3 - ~> 3 1.733


< 4 < 4 - ~> 4 1.735
< 5 < 5 - ~> 5 1.748
m < 6 < 6 - ~> 6 1.609

1.685 1.488 - 1.685 Hk ~ H /


X
Inference and data tables with missing values 499

Table 6
l "'" j -'-n

0.268 0.425 - 0.268

0.303 0.444 _ 0.303


0.302 0.445 - 0.302
0.288 0.436 - 0.288
0.677 0.755 - 0.677

Line uncertainty ( H t .) will be found by entropy as well. Competence of each


separate prompt blk is found by the product Hlk of uncertainties H l . × H . ~ of the
lth line and of the k t h column:

= ( Hm= - H,k ) ° ,

Hm= = 3.211, Hmi ~ = O.

Quantities Cl~ under a = 0.5 are given in Table 6. Now, using weights of Ctk,
one can find a sum of weighed votes for each event of the six possible ones. For
the case under review these sums are

Pl(ai] < 1) = 0.207, P2(3 > aij >>~1) = 0.212,


P3(4>aij>~3)=O.192, /°4(5 > aij >~4) = 0.173,
P s ( 6 > aij>~ 5)=O.154, P6( ai] ) 6) = O.062.

The strongest hypothesis contents that 3 > fiij < 1.


(4) Now let us consider nominal scale (algorithm VANGA-N). The name of the
missing element aij may be one of d names, presented in Table 2 (in our example
these names are 1,2,3,4,5,6). Let us find an estimate of probability of each of
these d possible events. For that let us look over all prompting submatrixes of size
2 × 2. We shall look for a p r o m p t proceeding from the following assumption: if in
the k t h column the ith and the lth objects are called by the same name, then in
t h e j t h column they should have the same names, i.e. if aik = alk , then aij = alj.
And vice versa, if aik =/=a/k, then aij =/=at] too.
Matrix B of prompts blk as an example of names matrix A (Table 2) is shown in
Table 7. And here values of entropy of the lth line ( H i . ) and of the k t h column
( H . k) are given; by these, competence of each element is computed in the same
way as for the order scales.
In this example, values of competences Clk of all prompts are the same: Ctk = 1.
500 N. G. Zagoruiko and V. N. Yolkina

Table 7
1 --- k "- j n

1 val 4-1 - 4=1 1.79

4-3 =/=3 - 4-3 1.79


4- 4 4= 4 - '¢: 4 1.79
~5 4-5 - 4-5 1.79
m 4=6 =/=6 - 4-6 1.79
1.94 1.94 - 1.94 Hk ~ H l

Having been weighed with weights of competence Ct~ , probabilities of final


results were f o u n d to be

P l ( a i j = 1) = 0.031, P2(asj = 2) = 0.735,

P3(asj = 3) = Pa(aij = 4) = P s ( a i j : 5) : e 6 ( a i j : 6) = 0.047.

The value fisj = 2 is predicted as the most probable event.


If there is more than one gap in Table 2, then one can predict each missing
element independently of the others ('parallel' strategy) or by turns, using all
e l e m e n t s - - b o t h initial and predicted in the previous steps ('consecutive' strategy).
U n d e r consecutive strategy one ought to begin with predicting the element for
which one succeeds in using the greatest n u m b e r of p r o m p t i n g submatrixes.

3. Conclusion

Algorithms Z E T and V A N G A have been used in solving a great n u m b e r of


tasks in the field of geology, medicine, agriculture, etc. R e d u n d a n c e of real tables
is such that its use allows to make good filling of missing elements, even if their
n u m b e r sometimes approaches 30% of a total n u m b e r of elements in a table.

References

[1] Yolkina, V. N. and Zagoruiko, N. G. (1978). Some classification algorithms developed at


Novoslbirsk. R. A. L R.O Informatique/Comput. Sci. 12 (1) 37-46.
[2] Scott, D. and Suppes, P. (1958). Foundational aspects of theories of measurement. J. Symbolic
Logic 29, 113-128.
[3] Zagonfiko, N. G. (1979). Empirical Prediction. Nauka, Novosibirsk.
P. R. Krishnaiahand L. N. Kanal,eds., Handbook of Statistics, Vol. 2 t.) ,-~
@North-HollandPublishingCompany(1982) 501-526

Recognition of Electrocardiographic Patterns

Jan H. van Bemmel

1. Introduction

1.1. Objectives
In this chapter we will discuss some statistical and other operations on the
ECG, not only because the signal is representative for a time-varying biological
phenomenon, but also since the efforts made and the result obtained in this area
are illustrative of the methods applied to other biological signals.
The strategies and techniques as will be described--preprocessing, estimation
of features, boundary recognition and pattern classification--can also be applied
to many other signals of biological origin, such as the electro-encephalogram, the
spirogram or hemodynamic signals (Cox, 1972).
The ultimate goal of such processing is the medical diagnosis or, in research, to
obtain insight in the underlying biological processes and systems. Our main
objective, therefore, will be to show the state-of-the-art in biological signal
processing and recognition by discussing specifically the processing of the electro-
cardiogram.

1.2. Acquisition
With modern technology and its micro-sensors, a host of different transducers,
instruments and data acquisition techniques became available to study the human
body and to examine the individual patient. Electrodes for the recording of the
ECG have been much improved by new materials and buffered amplifiers;
multi-function catheters can be shifted through veins and arteries, e.g. for the
intra-cardiac recording of the His-electrogram, flows and pressures; micro-
transducers can even be implanted for long-term monitoring of biological signals.
Many other, noninvasive, methods have been devised to study the organism under
varying circumstances, e.g. during physical exercise.
In the entire process of signal analysis, the data acquisition is of course a very
important stage to obtain signals with a high signal-to-noise ratio (SNR). For this
reason it must be stressed that signal processing starts at the transducer; it makes
no sense to put much effort in very intricate statistical techniques if the trans-
501
502 J a n H. van B e m m e l

ducers are not properly located and if the disturbances are unacceptable, hamper-
ing detection, recognition and classification. During the last decade in many
instances the signals are digitized as soon as they have been acquired and
amplified and are recorded in digital form or real-time processed. Data acquisi-
tion equipment is presently often brought very near to the patient, even to the
bedside, for reasons of accuracy, speed and possible system failure. Dependent on
the specific goal, the processing is done real-time (e.g. for patient monitoring on a
coronary care unit, CCU) or off-line (e.g. for the interpretation of the ECG for
diagnostic purposes).
After this rather general introduction we will concentrate on ECG analysis,
primarily by computerized pattern recognition and related statistical techniques.
After a brief discussion of the different research lines in electrocardiology, we will
mainly restrict ourselves to the interpretation of the standard ECG leads (12-leads
or vectorcardiogram), recorded at the body's surface. In these sections we will
follow the main steps by which computer interpretation is usually done.

2. Electrocardiology

Before discussing the processing of the electrocardiogram in more detail, we


want to treat briefly the place of ECG-interpretation by computers amidst the
many models and theories that have been developed to understand the ECG (see
e.g. McFee, 1972). This short survey will reveal the reasons why the ECG--being
the electrical signal generated by the activated cardiac muscle and of which the
origin is physically well understood and which can be expressed in mathematical
terms--is still being analyzed in rather heuristic ways, resulting sometimes in
vague final interpretations as "probable AMI" (anterior myocardial infarction) or
"LVH (left ventricular hypertrophy) cannot be excluded (p > 0.40)". There seems
to exist a large inconsistency between the potentially elegant physical and
mathematical descriptions and the apparently rather subjective solutions to ECG
interpretation.
The amount of information, generated by the heart and available for measure-
ment at the body's surface is so large and is redundant to such a degree, that one
has to curtail the information stream for practical and sensible reasons. Well-
chosen samples have to be taken from the total information stream: in space, by
measuring the field at only a restricted number of locations (e.g. by means of the
Frank VCG system), in time by recording the signals over a limited period of time
(e.g. 5 or 10 sec.), and in frequency by filtering the signals within a certain
bandwidth, so that we end up with a finite, but still rather large amount of
samples. From this set the interpretation must be made.

2.1. Spatial samples and models


In order to take the samples in space, scores of lead systems have been defined,
from the triangular system of Einthoven via the 12 conventional leads up to the
Recognition of electrocardiographicpatterns 503

t CL,N,CAL 1 FINDINGS
STATISTICAL ]
i CRITERIA DATA
BASte J
~ N G - - - G ~ II COMPONENTSJ
II CARDIAC E~jI~7~ ;~"- . . . . . . . ~ l [ DIAGNOSIS
I GENERATORL ~ ~ "/-~1" "INVERSE I
, ~ / / ~ 1 ~ 1 PROBLEMF II FIXED
I EXC~TAT,ONI I-~;P¢'~-GU - A " /! DIPOLES
MODEL I . ..
m I ~
/
II M O V I N G
L ............. ~ MOD E ELI N G ~ " I DIPOLES
~11 MULTIPOLES

Fig. 1. Schematicrepresentationof research in electrocardiology,all starting at the cardiac generator.


The central question is the inverse problem, which is tackled by a theoretical approach (the models
based on fixed or moving dipoles or multipoles) or a phenomenologiealmethod (ECG/VCG, surface
maps). The criteria for the theoretical inverse models are derived from electrophysiologyand the
development of forward excitation models. The criteria for the phenomenologicalapproach (logical
and/or statistical models) stemm from clinical findings and well-documentedECG data bases and are
implemented as operationalinterpretation systems on a computer.

multilead (126 or more) systems, mainly for research purposes (see e.g. Barr,
1971). The vectorcardiographic lead systems are based on the physical assumption
that the heart's generator can be considered as a dipole with fixed location but
variable dipole moment. The main goal for the choice of the set of spatial samples
is to allow some kind of inverse computation, from the body's surface to the
electrical field within the heart, either by mathematical inverse solutions (based
on physical models) or by parameter estimation and logical reasoning (based on
statistical and pattern recognition models). Fig. 1 shows the different approaches
to the analysis of the ECG. For the development of the physical models, of
course, insight into the course of the electrical field through the myocardium is
necessary. Such models are again simplifications of reality: sometimes rather
crude as in vectorcardiography; moving-dipoles (Horan, 1971); multiple fixed
dipoles (Holt, 1969; Guardo, 1976); or, still more abstract, multipoles
(Geselowitz, 1971). It is not our intention to treat these different models in this
chapter, since almost none of them have demonstrated real clinical implications,
but we will restrict ourselves to the second type of solutions, that make use of
clinical experience and evidence.
Still, the advantage of theoretical electrocardiology is that it has provided us
with comprehensive knowledge about the coherence between the different ap-
proaches and with electrode positions that bear diagnostic significance (e.g.
Kornreich, 1974) without being too sensitive for inter-individual differences in
body shape, i.e. the volume conductor for the electric field.
We conclude this section by mentioning that the essential knowledge to build
models of whatever kind, is based on experiments with isolated or exposed hearts;
504 J a n H. van B e m m e l

I p........ H signals H ..... f.... d H


signals sigrla,
paramet ers H classified
signals ~

MEASUREMENT PRE-PROCESSlNG ESTIMATION INTERPRETATION

I obiects
H patterns
H"dH patterns features
H""'°I patterns

MEASUREMENT TRANSFORMATION FEATURE CLASSIFICATION ACTION


SELECTION
stages (1) (Z) (3) (4)

Fig. 2. Stages in pattern recognition and signal processing which run fully parallel. In many instances,
however, the processing is not as straightforward as indicated here but includes several feed-back
loops. In this chapter several such feedback examples are mentioned.

package tasks groups modules err. options

INPUT ± INPUT TAPE


/ 2 RFZND "~ TELEPHONE
DETECT~ - B RCHECK
\ ON-LINE
l \ 6 PFZND
PATTERN TYPZFI-__ 4 RTYP
ANALYSIS
1 CATION\ 5 STTYP
70NOFO
BOUNDA-/
NODULAR RZEB 8 ENDT
PR~SSING~ ~ 8 ONOFP
18 BEAT
i±PARAH
CLASSI-
FICATION - ~ ±2 RBYT ~ E C G - D X A G N O S I S
13CLASS VCG-DIAG'NOSES
OUTPUT 14OUT ~MINNESOTA-CODE
Fig. 3. Example of the set-up of a modular processing system for the computer interpretation of
ECG's/VCG's. The different tasks can be clearly discerned, whereas these can themselves be
subdivided in groups and modules. The advantage of a structured set-up is its easy evaluation,
maintenance and implementation.
Recognition of electrocardiographicpatterns 505

on the construction of the so-called forward models that simulate the real process;
and on the acquisition of (always large) populations of well-documented patient
data, ECG's. These, together, form the basis for the development of the inverse
models, of physical, statistical or mixed nature.

2.2. Stages in interpretation


The over-all purpose of ECG interpretation--and that of biological signals and
images in general--is the reduction and transformation of an often redundant but
disturbed transducer output to only a few parameters, which must be of signifi-
cance for subsequent human decisions. This interpretation can be subdivided into
the well-known stages in Pattern Recognition (e.g. Kanal, 1974), i.e. (1) measure-
ment, (2) preprocessing or transformation, (3) feature selection and (4) classifica-
tion (see Fig. 2). In this sense ECG signal processing runs parallel to pattern
recognition and image processing (Van Bemmel, 1979). Nevertheless, the interpre-
tation of the ECG can also be viewed as consisting of other steps, i.e.: detection,
typification, boundary recognition, feature selection and classification, and data
reduction. At every step it is possible to discern each of the four different stages,
earlier mentioned. This subdivision in steps and stages forms at the same time the
further division into paragraphs of the rest of this chapter. A practical realization
of ECG interpretation along such steps and stages has been realized in a few
operational systems. One of these (Talmon, 1974) is illustrated in Fig. 3, of which
the caption further explains the structural set-up. Most operational systems that
are in the public domain, have been reported by their authors in Van Bemmel and
Willems (1977) (6 American and 6 European systems for rest ECG's/VCG's; 4
for exercise ECG's; 5 for ECG's on CCU or for ambulatory care).

3. Detection

The ECG is composed of a series of consecutive events, coupled or not, of


stable, quasi-periodic nature. We can discern the activity from the sino-auricular
(SA) node and the atria, and the activities from the left and right ventricles.
Coupling between atrial and ventricular activity is mediated by the AV (atrio-
ventricular) node and the bundle of His with its two branches (see e.g. Nelson,
1974). The atrial activity is seen in the ECG as a train of P-waves at regular
distances, in case of no blockage of the impulses from the SA and AV-nodes,
followed by the QRS-complex, generated by the depolarization of the ventricles,
and the ST-T, its repolarization. Abnormal depolarization of the atria may be
seen as atrial flutter (triangular waveshapes in certain leads), or atrial fibrillation,
the latter resulting in signals of non-specific shape. Ventricular activity, if
triggered by the bundle branches, results in rather stable QRS waveforms. These
are more or less modulated by respiratory activity, especially in the chest leads,
from changes in the volume conductor and consequently in the so-called lead
vectors (for a theoretically sound description of the effect of the volume conduc-
506 J a n H. van B e m m e l

X2

INTERVAL ( m.secj ....... ~ ,~ . . . . . . . . . . . . . . . . . . . . . ~


I,,~
CLUSTER nr . . . . . . . . ,~ . . . . . . .

Fig. 4. Example of a VCG recording with up to 7 different QRS wave shpaes. The points of QRS
detection have been indicated by vertical marks, the cluster number by numbers 1 to 7.

tor and its different time-varying conductances, see Hora~ek 1974). In summary,
an ECG may show two wavetrains of rather stable shape (P and QRS), both
followed by repolarization signals, of which mainly the ST-T wave after the QRS
is dearly seen. In abnormal cases one may see almost any combination of atrial
and ventricular signals, sometimes consisting of different QRS-wave shapes
resulting from intra-ventricular pacemakers. In Fig. 4 an illustration is seen of a
rather chaotic signal where only sporadic periodic epochs are observed.
Such signals are a big challenge to develop generally applicable processing
methods. The first problem to be solved is to detect all QRS-complexes without
too many false positive (FP) or missed beats (FN). If this problem has been
solved, the question remains how to detect the tiny P-waves amidst the always
present disturbances, especially if the rhythm and QRS-waves are chaotic.

3.1. A priori knowledge


Before treating the specific problems of QRS- and P-wave detection we want to
mention the fact that any detector can operate optimally only if as much as
possible a priori knowledge about the signals (shape, occurrence) has been built
into it.
Fig. 5 illustrates this situation, often referred to as strong or weak coupling
between detection and estimation. We will wherever possible make use of this in

V
3 detector point process
-I

J e s tlmato r parameters
1
Fig. 5. Illustration of the principles of strong and weak coupling for simultaneous detection and
estimation of signals. In case of weak coupling, only one feed-back loop is present. In ECG pattern
recognition, these principles are frequently used.
Recognition of electrocardiographicpatterns 507

the following. Biological processes, such as the functioning of the heart, can
frequently be considered in the time domain as a series of coupled events, as we
have seen already for the P-wave and the QRS-complex. In analyzing such signals
we are interested in the occurrence of each of these events, which can be
expressed as a point process. In many instances, however, it is a complicated task
to derive the point process from the signal if we do not exactly know what event
(wave shape) to look for. On the other hand, the determination of the wave shape
itself is much facilitated if we are informed about the occurrence of the events
(the point process). Accordingly, a priori knowledge about one aspect of the
signal considerably simplifies the estimation of the other.
In practice, this process of simultaneous detection and estimation in ECG's is
done iteratively: a small part of the signal serves for a first, rough event detection
and estimation of the wave shape, and based upon this, an improved point
process can be computed and so on. However, if we have to deal with only rarely
occurring wave shapes as in intra-ventricular depolarization with wandering
pacemakers, such a priori knowledge is not available. We can improve the
estimation only if at least a few ECG-beats of identical shape are present for
analysis.
It is unnecessary to state that the optimum performance of a detector is
obtained only if we also have at our disposal the prior probabilities of the
occurrences of the different waves. Although the latter is seldom known for the
individual ECG, a good compromise is the optimization of the detector's perfor-
mance for a library of ECG's and to test it on another, independent population.

3.2. QRS detection


The purpose of QRS detection is to detect all depolarization waves, including
the premature beats, resulting from a dipolar wave front travelling through the
ventricles. The detection of QRS-complexes is a typical example of a heuristic
approach, although the algorithms involved are usually trained on learning
populations or at least are evaluated by independent ECG's (see contributions in
Van Bemmel and Willems (1977)).

Preprocessing (QRS)
Since most ECG interpretative systems are processing at least 3 simultaneous
leads, the detection functions are also based on combined leads. The commonly
used detection functions d(i) (i the sample number, sampling rates taken accord-
ing to definitions given by the American Heart Association (AHA, 1975)) are
based on derivatives (i.e. comparable to band-pass filtered leads) o r - - i n terms of
three-dimensional vectorcardiography--the spatial velocity. If the ECG(i) is
expressed as
ECG(i) = (Xl(i), X2(i ), X3(i)),
then the detection function d(i) can be written as

d(,) = 2 r(x~(i))
k
508 Jan H. van Bemmel

with T a transformation of Xk(i). The most simple formula for computing the
derivative is the two-sided first difference, so that

z : - l x A i + l ) - X~(i-1)l 2.
The spatial velocity is in this case just the square root of d(i). Other simpler
forms, saving processing time for d(i) are, with absolute values:

z=lxk(i+l)-- gk(i--l)l,
and a third detection function computed from the original amplitudes,

Z= lSk(i)l 2.
The disadvantage of d(i) with the last transformation is that it is very sensitive to
changes in baselines. Other functions for T are, though sometimes more elaborate,
essentially identical to the ones mentioned here. Fig. 6 shows an example of a
detection function computed from absolute first differences.

Features and classification ( QRS)


As an example of the detection of the QRS in the latter signal we will discuss a
method reported by Plokker (1978). Detection is done by first computing the

Xl

X2

X3
i

X1

1oo %
d(i)

o 10 SECONDS

Fig. 6. Detection of the QRS-complexes in an ECG recording. From the scalar leads Xk(i), the
detection function d(i) is computed. Three thresholds are applied after estimating the 100% level. A
candidate QRS is detected with these thresholds. Further refinement, i.e. the determination of a point
of reference, is done from the derivative of one of the leads, in this case X~. Vertical fines indicate the
points of reference of the detected QRS complexes.
Recognition of electrocardiographicpatterns 509

averaged peak of all QRS-complexes in d(i). Next, thresholds are applied at 5, 25,
and 40% of this averaged peak.
If the detection function fulfills the following conditions, a QRS-complex is
labelled as a candidate wave:
(d(i)>25% during >10 msec.) A (some d(i)>40% during
100 msec. thereafter) A (the distance to a preceding can-
didate > 250 msec.).
Other systems apply different thresholds and rules for QRS finding, but all
approaches follow one or another logical reasoning or syntactic rule, based on
intrinsic ECG properties, expressed as statistics of intervals, amplitudes or other
signal parameters. We will proceed with Plokker's method.
If the candidates, mentioned above, are found, an algorithm is applied to
discriminate between different QRS wave shapes. First of all, the lead is
determined with the largest absolute value of the derivative (see also Fig. 6, where
the X-lead is chosen). In this selected lead, a point of reference is determined: the
zero-crossing with the steepest slope within a search interval of -+ 100 msec.
around the first rough indication. After the zero-crossing (i.e. the point of
reference) has been found, a template is matched to the filtered QRS-complex to
assure a stable reference point (the application of strong coupling in the detector).
This template matching is done with the aid of two levels at -+25% of the
extremum. Fig. 7 shows examples of the ternary signals, that result from this
procedure, for typical QRS-shapes after band-pass filtering. Templates are
determined for all different QRS wave shapes in an ECG recording, in such a
way that each time that a new QRS-shape is detected, a template is automatically
generated. Since such templates have only the values 0 and -+ 1, matching itself
can be simplified to elementary arithmetics of only additions or subtractions. An
already known QRS is assumed to be present if the correlation is larger than 0.70,

b)

i I

lI W l I t w I I t
+I--F~ i

R_TYPE RS-TYPE QRS_TYPE

Fig. 7. Examples of ternary templates, derived from an individual ECG for the preliminary labelling
of QRS wave shapes and the determination of a point of reference. The ternary signals are computed
from the band-pass filtered QRS-complex and used for signal matching.
510 J a n H. van B e m m e l

else a new template is generated. In the end all candidate complexes are matched
with all templates found in an individual recording to determine the highest
correlation factors. A method as described here, yields less than 0.1% FP (false
alarms) or FN (missed beats) as found in a population of 47 750 beats from 2769
ECG's in Plokker (1978).

3.3. P-detection
A similar, though more intricate, strategy can be followed for the location of
P-waves. They are small as compared to the QRS-complex and of the order of 100
~V or less, and of the order of magnitude of the noise in the bandwidth of 0.10 to
150 Hz, that can vary from 10 ~V or more. The frequency spectrum of such
P-waves, however, lies far below this 150 Hz: roughly in between 0.10 and 8 Hz.
The P's can be coupled to QRS-complexes or not, and most of the time occur
with a repetition rate of less than 200 per minute. The duration of the P-wave is
about 60 to 100 msec., its shape may vary from person to person and from lead to
lead. If the P-wave is coupled to the QRS, the range of the PR interval
distribution is less than 30 msec.
The processing of P-waves may require much computer time unless we use data
reduction methods and optimized algorithms to speed up the processing. Often
we have to look for a compromise between what is theoretically (from the
viewpoint of signal processing) desirable and practically (from the standpoint of
program size and processing time) feasible. P-wave detection is an illustrative
example in this respect. First of all, we will discriminate between 'coupled' and
'non-coupled' P-waves. This is of importance, since regular rhythms are most
commonly seen (in more than 90% of all patients). For that reason the detector
has first of all to ascertain whether such coupling is present or not. The detection
of 'non-coupled' P-waves is much more cumbersome and requires considerable
computational effort. For both approaches we will discuss the processing stages
mentioned earlier (for illustration purposes we follow the lines of thought
published by Hengeveld (1976)).

Preprocessing (coupled Ps)


In order to enhance the probabilities of finding coupled P-waves, the pre-
processing is done only in a window before the onset of a QRS-complex, by
filtering each ECG signal in a proper bandwidth (0.10-8 Hz) to increase the SNR
of the P-wave. Since the upper frequency has been decreased it is permissible to
diminish the sampling rate in this interval as well (e.g. to 100 Hz).

Features and classification (coupled Ps)


The parameters on which the decision is based whether the P-waves are coupled
to the QRS-complex are computed as follows. The instants of maximal and
minimal amplitude within the window are determined and the differences are
computed between these instants and the estimated QRS-onset. This yields per
ECG-lead 2 intervals (for 3 leads, of course, 6 intervals). For the entire recording
Recognition of electrocardiographicpatterns 511

t PR
l r l l l l l l ~ l l l l 1 1 1 I ' l l l l l l l l l l l l l l l I I I I I I I I I I I I I I I [ I

I I I I I I I I I I I I I T l l l F I I I I I I I I I I I I I I I I I I I I I I I 1 I I I [ [ I I I I

I l l l l l l l [ l l l 1 [ l l l I I i l 1 1 1 1 1 1 1 1 1 1 1 1 1 I I I I I I I I I I I I I I I I I

Fig. 8. Examples of PR, PP and RR interval histograms for 3 different ECG's, used for the
determination of the presence of coupled P-waves and stable sinus rhythms. In case of P-wave
coupling, the PR appears to be rather constant even if the PP and RR intervals are varying. Distances
between vertical marks are 1,/30 sec. apart. Left: stable sinus rhythm; middle and right histograms:
irregular sinus rhythm but stable PR intervals.

the ranges of the interval distributions are computed and if only one of these is
small enough (i.e. < 30 msec.), for at least 80% of all QRS-complexes present, the
decision of coupled P-waves is made. Fig. 8 shows examples of such distributions
for 3 simultaneous leads, for coupled as well as for non-coupled Ps.
If the P-waves cannot be classified as coupled to the following QRS, it still
remains to be investigated whether Ps are yet present and at what instants. For
that reason the entire processing is started once more, now by using shape
information. We will follow the stages in this second approach as well.

Preprocessing (non-coupled Ps)


First of all the search window is enlarged from the end of the preceding T-wave
until the onset of the next QRS-complex. Filtering and sampling rate reduction as
well as computation of the derivatives of the signals is done in the same way as
discussed before. The derivative itself is computed after having cut away the
QRS-complex (see Fig. 9(a)) in order to diminish the effect of the high-amplitude
QRS on the filter output.
Next, the signal is rectified and two thresholds are applied, at 75% and 50% of
the extreme value within the search area. By combining the outputs of the two
threshold detectors we construct a ternary signal (Fig. 9(e)), which is a highly
reduced version of the original signal and most of the time of zero amplitude. In
practice, only the instants of level crossing are stored and used for further
processing as will be made clear in the following.

Features and classification (non-coupled Ps)


In the next stage the cross-correlation is computed between the ternary signal
and a template, previously computed from a learning set of P-waves, preprocessed
512 Jan H. van B e m m d

e
f

Fig. 9. Steps in the recognition of non-coupled P-waves. In the original signal (a) the QRS is cut away
(b) in order to diminish the response of the high-amplitude QRS in the band-pass filtered output as
seen in (c). The latter signal is rectified (d) and thresholds are applied to derive a ternary signal (e) that
is supposed to give a response where P-waves are located. Signal (e) is crosscorrelated with a template
(f) that has been computed from a training population of P-waves. The matching function is seen in
(g). Again a level is applied to detect the presence of P-waves.

in the same manner (and not from the individual E C G recording, as in QRS
detection). In this template (Fig. 9(f)) the information about the set of P-waves is
condensed--albeit in a rather crude way for reasons of processing speed. Of
course it would in some instances be better to use the prior information about the
P-wave shapes of the individual ECG, but, as written in Section 3.1., this is not
always feasible so that in such cases the statistical properties of a population of
signals is used instead. So, the parameters that are used for recognition are the
instants of the different level crossings, which can be visualized as a ternary
signal. In practice, this cross-correlation or matching again does not imply any
multiplication, since it can be proven that the entire correlation can be carried out
by simple additions and subtractions of intervals, to be computed only at the
instants of level crossings, which is most advantageous for processing speed. An
example of the correlation as a function of time is shown in Fig. 9(g). If the
correlation reaches a maximum above 0.80, the presence of a P-wave is assumed.
The procedure is carried out for each individual lead available and for all
TQ-intervals.
Most interpretation systems for ECG's offer only overall evaluation results.
The available literature in this field gives only seldom the reasons for ECG-
Recognition of electrocardiographic patterns 513

Table 1
Evaluation of the P-wave detection method, described in Section 3. The
numbers are derived from 42240 P-waves from 2769 ECG recordings.
162 P's are missed and 1072 falsely detected. For these ECG's this gave
rise to less than 2.5% errors in the arrhythmia classification, of which
the majority were minor deviations
Computer
+ --

+ 41168 1072
- 162 n.a.

misclassification, which might have happened anywhere during the various steps
and stages of processing. For that reason it is of utmost importance to trace the
shortcomings of all intermediate steps involved in the interpretation, so that
possible weak links in the chain of steps can be improved. For the two different
approaches to P-wave detection this has been done by Plokker (1978). Evaluation
results from a population of 1769 patients can be seen in Table 1.
We will conclude this section on detection by mentioning that in processing
ECG's the finding of other events is important to avoid wrong (FP) detections.
This regards the detection of 'spikes' (sometimes with similar shapes as the QRS)
resulting from electrical disturbances in the environment of the patient; the
measurement and correction of 60 Hz (50 Hz) main voltage; the effect of electrode
polarization causing wandering baselines or even amplifier saturation. Further-
more there is the disturbance of biological origin: patient movements and their
effects on the ECG like baseline fluctuations, electromyographic signals and the
modulation of the signal (up to a modulation depth of 50%) caused by respira-
tion. In order to obtain a reliable interpretation of the ECG with a minimum of
FP and FN, all steps (the detectors, parameter estimators and classifiers) have to
reckon with these disturbances. In many systems special detectors and pattern
recognition algorithms have been built-in to find baseline drift, spikes, EMG and
so on. Especially in cases where a superposition of signal and nonstationary
disturbances exists, discrimination is very complicated or even impractical, given
the finite amount of time allowed for an ECG computer interpretation because of
economic implications.

4. Typification

Morphologic classification in ECG interpretation, as we will report in this


section, is done at 3 different steps: we treated already the detection of the QRS,
where a ternary template is employed for a rough indication of wave shapes; in
this paragraph the morphologic classification is called typification, intended to
514 J a n H. van B e m m e l

find the normal or modal beat by labelling all wave shapes; in the diagnostic
classification it is directed towards a discrimination between (degrees or combina-
tions of) disease patterns.

Preprocessing (typification)
In this step we depart from the original signal(s), given the fiducial points
found by the detection. As in all pattern recognition applications, here also the
question rises what set of features to search for typification. We repeat that as
much a priori information should be utilized as is known and available. For
illustration we discuss again one specific development (Van Bemmel, 1973). We
know that the duration of a QRS-complex is on the average not longer than 100
msec. and that most signal power is found in the bandwidth between about 8 to
40 Hz. For these reasons the QRS is filtered by a digital convolution procedure
around the reference point and the sampling rate is reduced to 100 Hz in such a
way that the instantaneous amplitudes (or, in VCG, vectors) are located in an
area around and phase-locked with the fiducial point at 10 msec. distances apart,
where most of the signal power is located. In practice, this could mean e.g. 3 or 6
parameters before and resp. 7 or 4 parameters after the reference point. The
location of the window of instantaneous amplitudes is therefore dependent on
QRS morphology. From this set of 10 filtered, instantaneous amplitudes per
QRS-complex, the features are computed for typification or labelling.

Features and classification (typification)


The parameters that are used for the discrimination between different types of
QRS-shapes within one ECG, are based on signal power computed by means of
the variance-covariance matrix. Let, for one or more leads k the set of 10
instantaneous amplitudes be represented by the vectors v k. The series of K
consecutive QRS-complexes can be represented by the set {Vk}. For these vectors
v we determine the variance-covariance matrix by

c o v ( j , k ) = ~T. vk (T for transposed)

of which the instantaneous power can be written as P(k)=cov(k, k). The


cross-correlation matrix can be written as

P(j, k) = c o v ( j , k ) [ c o v ( j , j ) c o v ( k , k ) ] - , / 2
Both the O's and the instantaneous power P(k) are used as features for typifica-
tion. Two complexes with indices j and k are said to be identical if

( P ( k ) / w < P(j) < w ' P ( k ) ) A ( p ( j , k) > 7to)

in which w is a weighting coefficient and 7to a threshold, still to be determined


from a learning population. In Fig. 10 we see the effect of 7to on the typification.
Based on such studies, suitable values for w and )t o are: w = 2 and 7to = 0.80. This
Recognition of electrocardiographic patterns 515

QRS
typification
50-
errors

/
3O- in%
2O-
• too many types
o too few types
10-

\ /
5-
3-
2-

.5-
.3-
.2-

.1- l 1 1 I { [ { I •
60 70 80 90 100

Fig. 10. Effect of the typification threshold on the number of correctly labelled QRS-complexes. If the
level is too high, all complexes are called identical and vice versa. In case of parallel leads,
combinatory rules are applied for the optimization of typification.

means that in 10-dimensional feature space two complexes are called identical if
they fall within a cone with a spatial angle determined by ;k and within two
spherical shells determined by w.
To speed up the computation time for an ECG of, say, 20 beats, not the entire
matrix of 20×20 terms is computed, but in practice a sequential method is
employed which needs only a small part of the matrix. Starting with the first
complex, the first row of the matrix (i.e. the covariances with all other complexes)
is computed. Next, only those rows are computed for which the conditions of
similarity were not fulfilled, which brings the number of computations back to
about 10%. For all leads available this procedure is repeated. If more than one
lead is available, again a syntactic majority rule is applied which is optimized by a
learning population. Also the determination of the dominant beat is done by such
rules, making use of intervals between the different types of complexes. Table 2
presents a result of the algorithm for QRS-typification. The typification of ST-T
waves is done in similar ways also based on instantaneous amplitudes, falling,
however, in a much lower frequency bandwidth. The combined results of QRS-
and ST-T labelling finally gives the dominant complexes that can be used for
diagnostic shape classification, to be treated later on.

Supervised learning
Thus far we have restricted ourselves to the labelling of QRS-waves from ECG
recordings of rather short (e.g. 5-15 sec.) duration. If this problem has to be
solved for very long recordings as in a coronary care unit or for ambulatory ECG
monitoring, slightly different techniques can be used (Ripley, 1975; Feldman,
1977). Especially in the first situation all operations have to be executed in
real-time or even faster. If the time is allowed, in such circumstances interactive
516 J a n 11. van B e m m e l

Table 2
Decision matrix for QRS typification. In 2769 records 47751 QRS-complexes were seen,
with up to 6 different wave shapes. The overall result was an error rate of less than
0.07%. The artefact typifications were done on 50 distorted waveforms and noise spikes,
not seen by the detection stage
Computer
Type 1 2 3 4 5 6
1 47209 24 1
2 19 424 7
3 50
Reference 4 11
5
6
Artefact 39 6 5

pattern recognition may offer great advantages (Kanal, 1972). Several systems,
therefore, make use of man-machine interaction for waveform recognition
(Swenne, 1973). As soon as the computer (e.g. by using the same methods as
explained before) finds an unknown QRS-shape, the user (nurse, physician) is
requested to indicate whether he wants to label this wave as normal or, e.g., as a
PVC or wants to ignore it. The computer stores the patterns belonging to the
indicated beats in memory for future comparison. In this way two goals are
served: the user determines the labelling for the individual patient himself and he
is alerted as soon as waves of strange or abnormal shape suddenly occur. During
training (supervised learning) and thereafter, the computer determines the gravity
points of the cluster q~k, belonging to type k as follows:

nk

ink= ~ vi/nk,
i=1

vi being the (10-dimensional) feature vector for complex i and n~ being the
number of times the complex of type k has been observed. The dispersion of the
cluster is determined in the usual way:

?/k
s k2= E ( v / - i n k ) /(nk--1 ).
i=1

The distance from some new vector N to the clusters (~k} is computed by the
normalized Euclidean distance:

2 2
a?k =

% is allocated to Ok instead of qSzif (djk < ?tadjt)/X(djk < ?tk). Proper measures for
the thresholds are Xa = 5 and Xk = 3 or 4. In order to allow a gradual change in
Recognition of electrocardiographic patterns 517

wave shapes (letting the cluster q)~ slowly float in feature space), we may use
recursive formulae as soon as n k > X, (during the training X, = nk):

mk(n k + 1) = (X.m~(n~)+ v ~ ) / ( X . + 1).

A suitable value for X, lies in the order of 20.


Operational systems do make use of these and similar algorithms for the
labelling of wave forms (as reported e.g. in Nolle, 1977 and Thomas, 1979).

5. Boundary recognition

After detection and wave labelling it is necessary to determine those parts in


the signal that are of significance for diagnostic classification. In (biological)
signals and patterns such parts are searched by segmentation methods or boundary
detection. Once we know already approximately the location of the waves of
interest in the E C G (P, QRS, ST-T), we have to locate as exactly as possible their
boundaries; 5 points for one ECG beat. We will discuss a few methods as found
in the literature, and also indicate possible advantages and inaccuracies: straight-
forward threshold detection; matching with a signal part; cross-correlation with
an amplitude-time template (e.g. Pipberger, 1972; Van Bemmel, 1973).
The detection signal used for boundary recognition is commonly one of the
functions d(i) as mentioned in Section 3 (a spatial velocity or vector magnitude
or a combination of both). For illustration purposes we will restrict ourselves to
d(i), as computed with

T=IARN(Xk(i))I
where AR N is an autoregressive digital filter, computing some bandpass filtered
version of the ECG Xk(i), intended to obtain the derivative (while increasing the
SNR) and based on -+ N sample points around i.

Thresholds
Threshold detection is done by applying a fixed or relative amplitude level in
d(i) within a window where the wave boundary has to be expected. In some cases
feedback is built in the method in such a way that the threshold may be
adaptively increased if too many level crossings are seen within the window.

Signal matching
The second method that has been reported is the use of a standard wave form
around the point where the boundary is expected. This standard wave form s(k)
is computed from a learning set of functions d(k) with known boundaries,
indicated by human observers. The method then searches for the minimum of the
518 Jan H. van Bemmel

estimator e~(i) within a given time window:

MINi{es(i)= k ~

with N and M points before, resp. after the boundary. For the weighting factor
w(k), the dispersion of s(k) at point k is usually taken, so that e, is the weighted
mean squares difference between d and s. The minimum of es(i) yields the
boundary at i = i 0.
A disadvantage of this method is, that it is rather sensitive to noise, so that in
such circumstances the function d(i) may remain at relatively high amplitude
levels.

Templatematching
Two-dimensional (i.e. time and amplitude) templates have been developed for
wave form boundary recognition as well. In such applications a signal part is
considered as a pattern in 2-dimensional space, to be matched with another 2-D
template, constructed from a learning set. We will briefly explain the method.
Here again we start from the set of L functions ( d ( i ) } , reviewed by human
observers (boundaries were indicated in the original signals Xk(i)). Around the
boundaries, windows are applied (see Fig. 11), to be used later on in the
cross-correlation.
Within the window area we determine a multi-level threshold function f as

f~(i)=sign(d(i)- X}.

I
25%

d(i)

i i
*÷ end ,-41,,-
0 Isec QRS ORS T
d(i)

Fig. 11. Example of the windows that are applied in the detection function d(i) for the recognition of
wave boundaries. Within these windows a template is matched to a function computed from d(i) (see
Section 3.2 for the definition of d(i)).
Recognition of electrocardiographic patterns 519

i is the sample number, l one of the L functions d t and X the applied threshold.
'sign' takes the sign of the expression in brackets. So, the area where the function
d(i) is larger than ~ is given the value + 1, otherwise the value - 1. As a matter of
fact, see above, this is the most simple boundary detector, yielding a response
only at the place where crosses X. d(i)
We now define a template Tx(i ) in which the statistical properties of all f~ are
comprised:

Tx(i) = Z 1 ~ f/,(i).
/=l

It can be shown that Tx is a linear function of the cumulative density distributions


of the functions d t at points i. Once the template Tx is obtained, it is cross-
correlated with all individual threshold functions f l of the learning population.
This yields in general new boundary points at maximum correlation, which can
form the basis for the construction of a new template Tx. This process of
convergence may be repeated a few times until a stable template results, no longer
influenced by the observer variation in the training set. An example of this
adaptation is shown in Fig. 12 for the onset of the P-wave.
In order to speed up the processing time, again a compromise is made between
theoretical and practical requirements. This is acquired by simplifying the tem-
plate Tx, that may contain any value in between + 1 and - 1, to only the values
+ 1, 0 and - 1. This is done by applying a threshold XT, to obtain the simplified
template Wx:

Wx(i)={ogn(Tx(i)) forf°rITx(i)I>Xr'lTx(i)l
~<Xr.

o b c
5%

E -1 rq -1 r-1 -1

r~ ', r&. ;', [-,, r7


g
o
' :j

r~
. . . . . . . . . . +
÷1
0~
onset P ons~Fet
20ms

Fig. 12. Illustration of the adaptation of a P-wave template, after feedback of the computed template
to the learning population itself. The area where the detection function matches the template narrows
after a few iterations. The first template (a) is based on the learning population; (b) is computed from
the points, estimated by (b) and so on for (c).
520 Jan H. van Bemmel

A suitable value for Xr lies somewhere around 0.30. The advantage of Wx instead
of Tx is that in this case again only additions and subtractions are computed and
no multiplications are involved.
Methods such as the ones described here are in routine use for a wide variety of
ECG interpretation systems. Only very few reports have appeared in the literature
giving evaluation results of the algorithms on well-documented ECG's. Yet, these
boundaries form the basis for all diagnostic procedures.
The inaccuracies that are still allowable for P-, QRS- and ST-T boundaries are
in the order of 15, 5 and 30 msec., resp., at both sides of the onsets or endpoints.
Some interpretation programs for ECG's adopt the boundary detection to each
complex separately and next apply a majority rule to determine the most
probable locations of the wave edges (e.g. by the determination of the median of
the measured distribution of recognized boundaries). Other programs apply the
edge detection only after coherent averaging of the dominant beats by the
typification step. The outcomes of both approaches are in principle different,
because of the different influence of disturbances on both methods.
Promising wave parsing methods, also applied to ECG's, have been reported by
Stockman (1976) and Horowitz (1977). These methods are essentially syntactic
approaches to this problem. After many years of research in the wide field of
Pattern Recognition for general edge or boundary detection methods, still no
general method is yet available. The only common factor between all reported
techniques is that they at least strive at the maximization of the likelihood of the
same phenomenon and one can only hope that they converge to identical
solutions.
Problems which involve the segmentation of ECG's (in general: signals) have
much in common with the boundary recognition methods as treated here. Again,
it fully depends on the signal characteristics and the ultimate goal of the user
what strategy is followed.

6. Feature selection and classification

The proper selection of features is the basis for all pattern and signal classifica-
tion. As soon as we have determined in earlier steps the signal parts to be
classified, the question arises: which features?
This question is also a main issue for ECG interpretation. Some investigators
supposed that those parameters by which the original ECG can be reconstructed
(e.g. an orthonormal basis such as Nyquist samples; Fourier, Karhunen-Lo6ve or
Chebyshev components) are a sufficient basis for a feature space (e.g. Young,
1963). This, however, is only seldom true, since these parameters are ideal for a
syntactic shape reconstruction but do not necessarily have a semantic information
content. Features that have diagnostic discriminatory power are very often
computed from non-linear combinations of the syntactic basic components, such
as products, ratios, squares or time intervals, which may be related to biological
events and phenomena. Such parameters will hardly ever automatically arise, even
Recognition of electrocardiographicpatterns 521

by non-linear mapping techniques. For that reason only sound theoretical reason-
ing based on fundamental knowledge of the biological process (see also par. 2) is
the ideal way to obtain relevant features.
In many cases, however, the significance of the features is only a posteriori
demonstrated by means of operations on well-documented populations of ECG's,
often referred to as heuristic feature selection.

Logical vs. statistical


In ECG interpretation in general two approaches are followed to solve the
wave shape classification problem (as is usually also done for other problems):
logical methods (tables, trees, syntactic rules) and statistical techniques (linear,
non-linear, Bayes' rule) or combinations; all of them essentially multiparameter
methods. Logical decision trees are the most widely used, being better associated
with human reasoning than multivariate statistical methods, although the optimi-
zation of the latter techniques can be more easily done than for logical trees or
tables (see e.g. Cornfield, 1973; Wartak, 1969; Wolf, 1972; Van Bemmel and
Willems, 1977). A final answer to the optimum strategy to be followed in ECG
interpretation is certainly hampered by the fact that not always a one-to-one
relationship can be established between the ECG and the heart disease involved;
consequently, many different criteria exist for identical diseases. Further com-
plicating factors are the interindividual variability; the fact that a disease can
appear in varying degrees and the combination of diseases at different stages of
severity. If one realizes the huge amount of degrees of freedom in combinations
of, say, only 7 or 8 main groups of ECG-shape abnormalities (left or right
ventricular hypertrophy, anterior, posterior/diaphragmatic, inferior and lateral
myocardial infarction, and right or left ventricular conduction defects) for almost
all possible ages and races and for both sexes, almost no training population of
ECG's would be sufficiently large to solve the classification problem in a definite
way. Here again, the use of clinical information and knowledge is of utmost
importance, to be built into the classification rules. For these reasons it is not
surprising that most practical solutions of ECG interpretation make use of logical
reasoning and binary trees.

Classification of contours
Classification results for wave contours have been reported for almost all
existing ECG programs, but except for the study by Bailey almost no objective
evaluation study has been published based on the same ECG population (Bailey,
1974). Classification results, of course, differ widely from one application area to
the other (e.g. in screening or in a heart clinic). Some programs (Pipberger, 1975)
are primarily based on independent, i.e. non-ECG information (history, catheteri-
zation data, autopsy reports, etc.) instead of the diagnosis by cardiologists based
on ECG morphology.
This purely statistical approach to ECG diagnosis, however, has not received
the expected interest from the medical community. If the final diagnosis based on
the ECG itself and obtained from a team of cardiologists is used for reference
522 Jan H. van Bemmel

purposes (e.g. Bonner, 1972), the best results reported thus far claim a percentage
of > 95% of correctly classified wave forms. The evaluation of ECG contours is
usually done along three different lines: with respect to measurements that can be
verified by non-ECG data; for features that can only be derived from the ECG
itself (like conduction disturbances); and for purely descriptive parameters (like
ST-elevations). A detailed description of these three approaches can be found in
the results of The Tenth Bethesda Conference on Optimal Electrocardiography
(1977).
In this section we will briefly describe the method that is followed for statistical
classification and evaluation. Although the proper choice of features primarily
determines the results of the classification and the model that is being used for
discrimination can reveal the requested performance only on the basis of these
properly chosen feature vectors, the statistical approach, at least from a theoreti-
cal point, offers some advantages over a logical reasoning as has been amply
shown by Cornfield (1973) and Pipberger (1975). An advantage is, e.g., the fact
that in a multivariate approach we may easily and analytically take into account
prior probabilities for the different diseases and cost and utility factors. The a
posteriori probability of having a disease k out of K, given the prior probabilities
p(k) and the feature (symptom) vectors x with their conditional probabilities
p(xlk), can be expressed with Bayes as:

p(klx)=p(xlk) • p(xlj)p(j
j=l
))-'
The classification of vector x to a certain class is determined by the maximum of
p(k Ix), if desired beforehand weighted with a matrix of cost factors. Assuming
that the vectors x have normal distributions for all diseases k and identical
variance-covariance matrices D for all distributions x lk, with mjk = mj - m~ (m k
the mean of class k), we may write for the a posteriori probabilities

(
p(klx)= 1+ E exp(xTD-'mjk--½mykD-'mjk)P(J)/P(k)
j=l,j~-k
)l •
Cornfield (1973) has shown the influence of the prior probabilities in such models
if used in clinical practice (see also Pipberger, 1975). If too many disease classes
(age, race, sex, etc.) in different degrees and combinations have to be discerned,
such statistical models require an impractically large population for training the
parameters. In such cases it is necessary to combine the advantage of the purely
statistical approach with that of the heuristic and logical solution to the classifica-
tion problem.

Classification of rhythms
Ryhthm diagnosis is based on the measured PP, RR and PR intervals as well as
P-wave and QRS morphology, found by the detection and typification or wave
form labelling steps. If no detailed diagnosis is given of complicated rhythms but
Recognition of electrocardiographicpatterns 523

just categories of larger groups of certain arrhythmias, many programs claim a


percentage of correctly diagnosed ECG's well above 95%, being very acceptable in
clinical practice or for screening purposes (see Willems, 1972; Gustafson, 1978;
Plokker, 1978). It is clear that for the evaluation of cardiac rhythm mainly the
ECG itself serves as the reference.

Serial electrocardiography
In recent years many programs for ECG classification have incorporated
algorithms for serial analysis of ECG's (e.g., Macfarlane, 1975; Pipberger, 1977).
Improvement in the final classification is claimed, which is not surprising because
differences in morphology as compared to an earlier recording can be taken into
account. This requires, however, very standardized locations of the electrodes,
since especially in the chest leads a minor misplacement may cause large changes
in QRS shape.
Present research in contour classification is, besides the further investigation of
serial electrocardiography, primarily directed towards the derivation of features
from multiple leads (Kornreich 1973); the stability of classifiers (Willems, 1977;
Bailey, 1976); the use of fuzzy set theory (Zadeh, 1965) and syntactic approaches
to classification (Pavlidis, 1979).

Features for exercise ECG's


Processing of ECG's acquired during physical exercise bears elements of the
interpretation of short-lasting resting ECG's and of the analysis of ambulatory
ECG's. Usually a few special exercise leads are taken and analyzed as short
epochs during increasing stress, at maximal stress and during recovery. The noise
problem is solved by coherent averaging techniques. Further, the choice of
relevant complexes (detection and typification) and the recognition of boundaries
is similar to ECG analysis at rest. References to specific methods, parameters,
systems, and evaluation results can be found in Simoons et al. (1975) and
Sheffield (1977).

7. Data reduction

Electrocardiograms offer different possibilities for a considerable data reduc-


tion, required for digital transmission, compact recording as in ambulatory
monitoring or long-term storage for serial comparison of wave forms. We
mentioned already that the ECG, if consisting only of not too many different
waveshapes, can be represented by a combination of point processes and wave
contours. In practice, however, we never know beforehand what waveform is to
be expected.
Still, several 'silent' epochs are present in the ECG (the TP and PR intervals)
and the different waves have a significantly distinct frequency spectrum. These
properties may be used for data reduction. Cox (1968) was one of the first who
used these characteristics of the ECG to develop an algorithm to obtain a tenfold
data reduction. Especially in patient monitoring this and similar techniques are
524 Jan H. van Bemmel

being used. Essentially such methods replace a signal by samples at unequal time
intervals, only measured if a certain threshold of the first or second difference if
crossed. Bertrand (1977) applied this and other algorithms in a system for
transmission.
Other techniques for ECG compression make use of a series of orthogonal
basic functions for the reconstruction of the wave forms. Well known is the
Karhunen-Lo~ve expansion (see Young, 1963) or a Chebyshev transform. The
first method, yielding the eigenvectors, was evaluated by Womble (1977) together
with reduction by spectral techniques. As has been observed already, such
methods do not take into account the semantic information comprised in the
ECG. For that reason they are most helpful in detecting trends in intervals or
sudden changes in wave shapes in individual patients; for contour classification
they are rather inefficient since, e.g., tiny Q-waves may be missed by the fact that
their signal power is less than the distortions allowed, if integrated over the
duration of the wave. This is the reason why most long-term storage systems store
either the samples of the dominant beat or even the entire recording eventually
sampled at 250Hz. Another reason is the fact that the technical means for
inexpensive storage and retrieval have gradually diminished the need for data
reduction algorithms that are always more or less increasing the signal entropy.

8. Discussion

Electrocardiology, started with Einthoven, was greatly stimulated during the


last decades by the advent of the digital computer. Still, although most problems
in the recognition and detection of waves have principally been solved, no definite
solution has come in sight for the classification problem. The major reasons are
(1) the lead selection, different from center to center, (2) the lack of generally
accepted diagnostic criteria, based on non-ECG information and, related to this,
(3) the non-existence of a well-documented data base of representative ECG's for
the development of algorithms and criteria and for evaluation of existing pro-
grams. It is to be expected, that in the 1980's substantial progress will be made in
all these areas because of cooperation between research centers involved in this
field and because of the fact that non-ECG information is now more readily
available in most centers than in the past. Standardization of this patient
information and signals is a prerequisite for the advancement of the research in
this field.

References

AHA Committeeon Electrocardiography(1975). Recommendationsfor standardisation of leads and


of specificationsfor instruments in electro- and vectorcardiography.Circulation 54, 11-31.
American College of Cardiology (1977). Conference Report: Optimal electrocardiography. Tenth
Bethesda Conference.
Bailey, J. J., Itscoitz, S. B., Hirsfeld, J. W., Grauer, L. E. and Horton, M. R. (1974). A method for
evaluating computerprograms for electrocardiographicinterpretation. Circulation 50, 73-93.
Recognition of electrocardiographic patterns 525

Bailey, J. J., Horton, M. and Itscoitz, S. B. (1976). The importance of reproducibility testing of
computer programs for electrocardiographic interpretation. Comput. and Biomedical Res. 9, 307-316.
Barr, R. C., Spach, M. S. and Herman-Giddens, G. S. (1971). Selection of the number and position of
measuring locations in electrocardiography. IEEE Trans. Biomedical Engrg. 18, 125-138.
Bertrand, M., Guardo, R., Roberge, F. A. and Blondeau, P. (1972). Microprocessor application for
numerical ECG encoding and transmission. Proc. IEEE 65, 714-722.
Bonner, R. E., Crevasse, L., Ferrer, M. I. and Greenfield, J. L. (1972). A new computer program for
analysis of scalar electrocardiograms. Comput. and Biomedical Res. 5, 629-653.
Cornfield, J., Dunn, R. A., Batchlor, C. D. and Pipberger, H. V. (1973). Multigroup diagnosis of
electrocardiograms. Comput. Biomedical Res. 6, 97-120.
Cox, J. R., Nolle, F. M., Fozzard, H. A. and Olivei:, G. C. (1968). AZTEC, a preprocessing program
for real time ECG rhythm analysis. IEEE Trans. Biomedical Engrg. 15, 128-129.
Cox, J. R., Nolle, F. M. and Arthur, R. M. (1972). Digital analysis of the electroencephalogram, the
blood pressure wave and the electrocardiogram. Proc. IEEE 60, 1137-1164.
Feldman, C. L. (1977). Trends in computer ECG monitoring. In: J. H. van Bemmel and J. L. Willems,
eds., Trends in Computer-processed Electrocardiograms, 3-10. North-Holland, Amsterdam.
Geselowitz, D. B. (1971). Use of the multipole expansion to extract relevant features of the surface
electrocardiogram. IEEE Trans. Cornput. 20, 1086-1089.
Guardo, R., McA Savers, B. and Monro, D. M. (1976). Evaluation and analysis of the cardiac
electrical multipole series on two-dimensional Fourier technique. In: C. V. Nelson and D. B.
Geselowitz, eds., The Theoretical Basis of Electrocardiology, 213-256. Clarendon, Oxford.
Gustafson, D. E., Wilsky, A. S., Wang, J., Lancaster, M. C. and Triebwasser, J. H. (1978). ECG/VCG
rhythm diagnosis using statistical signal analysis. IEEE Trans. Biomedical Engrg. 25, 344-361.
Hengeveld, S. J. and van Bemmel, J. H. (1976). Computer detection of P-waves. Comput. and
Biomedical Res. 9, 125-132.
Holt, J. H., Barnard, A. C. L. and Lynn, M. S. (1969). The study of the human heart as a multiple
dipole source. Circulation 40, Parts I and II, 687-710.
Horacek, B. M. (1974). Numerical model of an inhomogeneous human torso. Advances in Cardiology
10, 51-57.
Horan, L. G. and Flowers, N. C. (1971). Recovery of the moving dipole from surface potential
recordings. Amer. Heart J. 82, 207-213.
Horowitz, S. L. (1977). Peak recognition in waveforms. In: K. S. Fu, ed., Syntactic Pattern Recognition
Applications, 31-49. Springer, New York.
Kanal, L. N. (1972). Interactive pattern analysis and classification systems: a survey and commentary.
Proc. IEEE 60, 1200-1215.
Kanal, L. N. (1974). Patterns in pattern recognition. IEEE Trans Inform. Theory 20, 697-722.
Kornreich, F., Block, P. and Brismee, D. (1973/74). The missing waveform information in the
orthogonal electrocardiogram. Circulation 48, Parts I and II, 984-1004; ibid. 49, Parts III and IV,
1212-1231.
Macfarlane, P. W., Cawood, H. T. and Lawrie, T. D. V. (1975). A basis for computer interpretation of
serial electrocardiograms. Comput. and Biomedical Res. 8, 189-200.
McFee, R. and Baule, G. M. (1972). Research in electrocardiography and magnetocardiography. Proc.
IEEE 60, 290-321.
Nelson, C. V. and Geselowitz, D. B., eds. (1976). The Theoretical Basis of Electrocardiograms.
Clarendon, Oxford.
Nolle, F. M. (1977). The ARGUS monitoring system: a reappraisal. In: J. H. van Bemmel and J. L.
Willems, eds., Trends in Computer-processed Electrocardiograms, 11-19. North-Holland, Amster-
dam.
Pavlidis, T. (1979). Methodologies for shape analysis. In: K. S. Fu and T. Pavlidis, eds., Biomedical
Pattern Recognition and linage Processing, 131-151. Chemic Verlag, Berlin.
Pipberger, H. V., Cornfield, J. and Dunn, R. A. (1972). Diagnosis of the electrocardiogram. In: J.
Jacquez, ed., Computer Diagnosis and Diagnostic Methods, 355-373. Thomas, Springfield.
Pipberger, H. V., McCaughan, D., Littman, D., Pipberger, H. A., Cornfield, J., Dunn, R. A., Batchlor,
C. D. and Berson, A: S. (1975) Clinical application of a second generation electrocardiographic
computer program. Amer. J. Cardiology 35, 597-608.
526 Jan H. van Bemmel

Pipberger, H. V., Sabharwai, S. C. and Pipberger, H. A. (1977). Computer analysis of sequential


electrocardiograms. In: J. H. van Bemmel and J. L. Willems, eds., Trends in Computer-processed
Electrocardiograms, 303-308. North-Holland, Amsterdam.
Plokker, H. W. M. (1978). Cardiac rhythm diagnosis by digital computer. Thesis. Free University
Amsterdam.
Ripley, K. L. and Arthur, R. M. (1975). Evaluation and comparison of automatic arrhythmia
detectors. In: Computers in Cardiology, 27-32 IEEE Comp. Soc.
Sheffield, L. T. (1977). Survey of exercise ECG analysis methods. In: J. H. van Bemmel and J. L.
Willems, eds., Trends in Computer-processed Electrocardiograms, 373-382. North-Holland, Amster-
dam.
Simoons, M. L., Boom, H. B. K. and Smallenburg, E. (1975). On-line processing of orthogonal
exercise electrocardiograms. Comput. and Biomedical Res. 8, 105-117.
Stockman, G., Kanal, L. N. and Kyle, M. C. (1976). Structural pattern recognition of carotid pulse
waves using a general waveform parsing system. Comm. A C M 19, 688-695.
Swenne, C. A., van Bemmel, J. H., Hengeveld, S. J. and Hermans, M. (1973). Pattern recognition for
ECG monitoring, an interactive method for the recognition of ventricular complexes. Comput. and
Biomedical Res. 5, 150-160.
Talmon, J. L. and van Bemmel, J. H. (1974). Modular software for computer-assisted ECG/VCG
interpretation. In: J. Anderson and J. M. Forsythe, eds., MEDINFO-74, 653-658. North-Holland,
Amsterdam.
Thomas, L. J., Clark, K. W., Mead, C. N., Ripley, K. L., Spenner, B. F. and Oliver, G. C. (1979).
Automated cardiac dysrhythmia analysis. Proc. IEEE 67, 1322-1337.
Van Bemmel, J. H., Talmon, J. L., Duisterhout, J. S. and Hengeveld, S. J. (1973). Template waveform
recognition applied to ECG/VCG analysis. Comput. and Biomedical Res. 6, 430-441.
Van Bemmel, J. H. and Hengeveld, S. J. (1973). Clustering algorithm for QRS and ST-T waveform
typing. Comput. and Biomedical Res. 6, 442-456.
Van Bemmel, J. H. and Willems, J. L., eds. (1977). Trends in Computer-processed Electrocardiograms.
North-Holland, Amsterdam.
Van Bemmel, J. H. (1979). Strategies and challenges in biomedical pattern recognition. In: K. S. Fu
and T. Pavlidis, eds., Biomedical Pattern Recognition and Image Processing, 13-16. Chemie Verlag,
Berlin.
Wartak, J. and Milliken, J. A. (1969). Logical approach to diagnosing electrocardiograms. J. Electro-
cardiology 2, 253-260.
Willems, J. L. and Pipberger, H. V. (1972). Arrhythmia detection by digital computer. Comput. and
Biomedical Res. 5, 263-278.
Willems, J. L. and Pardaens, J. (1977). Differences in measurement results obtained by four different
ECG computer programs. In: Computers in Cardiology, 115-121. IEEE Comp. Soc.
Wolf, H. K., Macinnis, P. J., Stock, S., Helppi, R. K. and Rautaharju, P. M. (1972). Computer analysis
of rest and exercise electrocardiograms. Comput. and Biomedical Res. 5, 329-346.
Womble, M. E., Halliday, J. S., Mitter, S. K., Lancaster, M. C. and Triebwasser, J. H. (1977). Data
compression for storing and transmitting ECG's/VCG's. Proc. IEEE 65, 702-706.
Young T. Y. and Huggins, W. H. (1963). Intrinsic component theory of electrocardiograms. IEEE
Trans. Biomedical Engrg. 214-221.
Zadeh, L. A. (1965). Fuzzy sets. Inform. and Control 8, 338-353.
P. R. Krishnaiahand L. N. Kanal,eds., Handbook of Statistics, Vol. 2 '~ZI
©North-HollandPublishingCompany(1982) 527-548 z_,--- I-

Waveform Parsing Systems

George C. Stockman

1. Introduction

The formal language theory that was developed by Chomsky and others for
modeling natural language turned out to be more useful for the modeling and
translation of programming languages. An algorithm for analyzing forms of a
language according to a grammar for the language is called a parsing algorithm.
The parsing of simple languages is well understood and is a common technique in
computer science. (See Giles, 1971, or Hopcroft and Ullman, 1969.) It is not
surprising that known parsing techniques were brought into play in attempts at
machine recognition of human speech (Miller, 1973; Reddy et al., 1973). How-
ever, even earlier the concept of parsing was applied to the analysis of the
boundary of 2-D objects; i.e. chromosome images (Ledley et al., 1973).
The attempts at automatic analysis of 1-D time or space domain signals have
been numerous and varied in approach. An excellent survey of early work on
bio-medical signals, such as the electrocardiogram and the blood pressure wave,
appears in Cox et al. (1972). Early approaches applied the constraints of a
structural model in an ad hoc manner; usually embedded in an application
program. As programming techniques developed waveform parameters were
placed in tables in computer programs. Later, decision rules were also placed in
tables and the evolution towards more formal analysis techniques had begun.
Waveform parsing systems (WPS) were developed to be applicable to several
problem domains. In order to achieve this, the model of a particular waveform
domain must be input to the WPS as data rather than being implemented by
programming. The loading of a WPS with a waveform model before analysis of
waveform data is pictured in Fig. 1. The structural model is presented to WPS in
a fixed language, the 'structural description language' (SDL). Typically this will
be a BNF grammar or connection tables defining a grammar. The waveform
primitives must also be selected from the set known to WPS, perhaps augmented
by specific numeric parameters. For instance, WPS may use a vocabulary of
shapes such as CAP, CUP, or FLAT which may be parameterized by duration,
amplitude, curvature, etc. The primitives chosen for use in the specific application

527
528 George C. Stockman

WPS skeleton

Structural Description General


of Particular Parsing
Waveform domain Algorithm

Primitive
Description of Feature
special features Detection
Algorithms

WPS

General
Parsing
Algorithm

Structural
Tables Structural
~ analysis
Primitiv~ and
Detectors interpretation
Parameter
Tables o r
Prototypes

b
Fig. 1. (a) The WPS is tuned for a particular set of signals and (b) is then tasked with the analysis of
signals for that set.

are described to WPS in a standard 'feature description language' (FDL). Design


of the appropriate structural model and set of primitives requires a combination
of art and engineering and may require several iterations before success is
achieved.
Before proceeding with details it is important to establish the goals of the
waveform analysis. In the simplest case a binary decision is required, such as
'EKG healthy' versus 'EKG unhealthy' or 'enemy approaching' versus 'nothing
to worry about'. Usually, however, and particularly in cases where WPS is
appropriate, a more detailed output is required which will include descriptions
and interpretations of waveform parts as well as the whole. This is usually true for
medical waveforms. For instance, 'atrial hypertrophy' might be diagnosed by
measurement of the 'P wave' which only constitutes a portion of the entire
waveform of interest. Cardiologists have a very well developed theory of the heart
and can attach interpretations to several parts of its signal. In automatic speech
recognition it is obvious that individual words and phrases must be understood in
Waveform parsing systems 529

order for a complete utterance to be understood. The general goal of waveform


parsing is thus to recognize and describe a waveform as an aggregate of its parts,
each one of which may offer significant interpretations.
The partitioning of a waveform into primitive parts is called segmentation.
Primitive parts may be combined into groups or 'phrases' which have significant
interpretation. Grouping can continue until the aggregate of all parts has been
interpreted.
The material presented in the following sections surveys some of the central
issues of linguistic waveform analysis. Its objective is to provide insight and
motivation rather than to be a definitive treatise. Hopefully, this can be a useful
guide to the large and varied literature on the subject.

2. Models for waveform analysis: SDL and FDL

Models for primitive and for aggregate structures of waveforms will be dis-
cussed. Assignment of meaning to the structures is clearly problem specific and
perhaps not one of the duties of WPS. A very simple example is presented first so
that the methodology can be established before real world complexity is intro-
duced.

2.1. A simple example


Fig. 2(a) shows a set of six waveforms, four of which are in the problem domain
to be modeled. On inspection all the samples are seen to be composed of at most
three primitive structures as shown in Fig. 2(b). If the application were evident
there would no doubt be an explanation for each of these observed shapes. In a
WPS each recognized primitive is symbolized for further structural processing.
For instance, the 'CUP', 'CAP', and 'S-TURN' of Fig. 2(b) could be symbolized
as 'a', 'b', and 'c" respectively. (It is important to note that a virtual infinity of
slight variations of these shapes is to be represented by only three possible
discrete symbols.) Specifying constraints for composing waveforms from the
primitives is the role of grammar.
The strings of symbols which represent the waveform samples and which are to
be modeled by a grammar are listed in Fig. 2(c). Notice that samples 5 and 6
cannot even be segmented using the primitive set {a,c}. Thus it can be seen that
recognition can fail even at the primitive level. This can be useful since these two
samples are not in the problem domain to be modeled. On the other hand, all six
waveform samples can be segmented using primitives {a,b}. Here it is necessary
that the grammar perform the discriminating role.
A generative BNF grammar which will generate the first four waveform
representations but not the last two is shown in Fig. 3. This BNF grammar merely
models how a certain set of strings of a's and b's can be generated recursively
replacing symbols starting with the non-primitive symbol S. The set of strings
530 George C. Stockman

u recognized as 'a'

recognized as b'

recognized as c'

primitives primitives
a r e 'c' a n d 'a' a r e 'a' a n d 'b'

1 caa 1 abaa
2 aaaa 2 aaaa
3 acaa 3 aabaa
4 cacaa 4 abaabaa
5 not possible 5 baa
6 not possible 6 abbb

C
Fig. 2. Example of waveform segmentation problem.
(a) Waveform samples 1-4 are in the problem domain and 5 and 6 are not.
(b) Three possible primitives for waveform segmentation.
(c) Segmentation of the six samples using two different sets of primitives.

which can be generated from S is actually infinite whereas in a practical situation


there may be a limit on string length. An aggregate structure is imposed on the
string by the process of generating (or deriving) it from the grammar model. This
structure is shown for two waveform samples in Fig. 3(b).
Even if a good model of a waveform domain is available there is still the
problem of starting from an instance of waveform data and analyzing that data
with respect to the model. If grammar models are used this is the so-called parsing
problem. There are many partial solutions to this problem resulting in a spectrum
of different parsing algorithms (see A h o and Ulman, 1972) where efficiency of
analysis can be traded off for generality of the grammar model. A very general
Waveform parsing systems 531

A S

a b a a b a
I
a

S > aAS S

S ----)a / ~ S
A---> SbA I
A ----~ba a b a a
A---> SS

a b
Fig. 3. (a) A generative grammar and (b) an example of structural description for some waveform
samples.

algorithm which has been used in waveform parsing is Earley's algorithm (Earley,
1970).
Earley's algorithm is the culmination of a long line of research work and offers
a very simple paradigm for very flexible analysis. The algorithm uses the grammar
to generate hypothesis about the primitive content of the input string. The input
string is then checked for the primitive. There is no backing up by the algorithm
because alternate possibilities are pursued simultaneously. The input string is
recognized and analyzed when the goal symbol of the grammar is recognized as
the last input symbol is seen.
The important property of the grammar model is that it defines and constrains
the context in which symbols of the vocabulary can occur. For example, study of
Fig. 3 shows that if the symbol 'b' appears it can appear only in the context of at
most one other 'b'. Use of context in human speech analysis will be discussed in
the next sections.
The theory of parsing is too broad and complex to treat formally here. Instead
the goals and procedures of parsing are intuitively presented. Fig. 4 shows several
attempts at achieving a structural analysis of a given input string representing a
waveform. As a whole, these attempts represent the unconstrained trials that a
human might make in attempting to match the grammar goal symbol to the input.
532 George C. Stockman

(a) G : S--> aASI a


A--> SbA I ba I SS
abaabaa ~ L(G)

(b) (c) A
1\
A S S
s t
I I
abaab aa

[.A [b ] [ a] ] [ .A[S [a] ] [S [ a] ] ]


2 2 23 33 3 3 3 334 4 444

(e)

S :a! ab aab aa

(S[a] (.A) IS[a] ]) (S[a] (.A) (S[a] (.A) ( . S ) ) )


i i 12 67 7 777 1112 34445 xx 777
+
i

.. (g)

b a abAa a
A
s/ Xs
[.s[a] ] & [.A[b] [a] ]
a b a a b a i iii 5 55666

(S[a] (.A) ( . S ) ) & [.A[S[a] l [S[a] ] ]


1112 xx 77 6 666677777
+
i

Fig. 4. Possible parse states for sample grammar of Fig. 2. A partial parse tree is shown together with
linear encoding. ' ' denotes a 'handle', or current focus of analysis. Integers indicate symbol positions.
Square brackets indicate recognized structures while parentheses indicate open goals.

Parsing algorithms are very limited in their behavior and could not develop all
parsing states s h o w n in Fig. 4. States (b) and (c) are being developed 'bottom-up'
with only one phrase being worked on at a time. State (g) shows b o t t o m - u p
development of two phrases. States (d) and (e) show t o p - d o w n development: a
parse tree rooted at the grammar goal symbol is being extended to m a t c h the
input. In (d) only one phrase is being worked on but in (e) there are three
u n m a t c h e d phrases. In Fig. 4 (f) shows 'N-directional' development. The practical
meaning of these terms will b e c o m e obvious in the detailed examples of Sections
3 and 4. The essentials of the w a v e f o r m parsing paradigm are summarized below.
Waveform parsing systems 533

2.2. The waveform parsing paradigm: feature extraction


Under the waveform parsing paradigm a waveform is viewed as a sequence of
features each one from a finite set of possible features. The finite set of possible
waveform primitives is the vocabulary of the formal language describing aggre-
gate waveform structure. The process of assigning vocabulary elements to inter-
vals of a raw signal is called segmentation. If the vocabulary is the ASCII
character set and the raw signal is a PCM signal carried from one modem to
another, the segmentation will be easy. However, if the vocabulary is a set of
English words and the raw signal is connected human speech, the segmentation
will be difficult. It is generally agreed that in such cases of naturally generated
signals (i.e. speech, EKG's) segmentation cannot be done well without the aid of
syntactic and semantic constraints. Moreover, we are forced to assign a likelihood
or certainty to our extracted features since they may be wrongly assigned and we
have to accept the possibility that two distinct vocabulary elements can match
overlapping intervals of raw signal with competing positive likelihoods. For
example, it is difficult for voice recognition devices to distinguish between 'loop'
and 'hoop' from acoustic information alone, but context may provide a firm
decision later on in the processing. Feature extraction is covered in Section 3 for
speech waveforms and in Section 4 for EKG's and pressure waves.

2.3. The waveform parsing paradigm: structural analysis or parsing


Once the raw waveform, or parts of it, have been represented by elements of a
vocabulary, parsing techniques can be used to construct phrases hierarchically
until a parse tree is formed to represent the entire waveform. In the case of ASCII
signals introduced above, this analysis could be mundane. For example, the BASIC
language compiler of a dial-up system is a trivial waveform parsing system.
However, in cases where the segmentation is difficult and errors or ambiguity are
possible, complex control of structural analysis is required. This control has to
integrate segmentation, structural analysis, and perhaps semantic analysis to
resolve ambiguity and evaluate the likelihood of the representations. Particular
techniques for structural analysis will be presented in Sections 3 and 4.

2.4. Waveform parsing control: parse states


The assumptions are made here that segmentation is difficult and errors and
ambiguities must be handled. These assumptions force the analysis control to
have both bottom-up and top-down components and to allow multiple competing
(alternate) analyses to be simultaneously developed. Likelihoods of primitive
feature detections are to be combined to get joint likelihoods for aggregated
phrases. The parse state records all of the details of the partial analysis. There
may be several competing states, each one representing one consistent interpreta-
tion of the data analyzed so far. Two competing states in the analysis of a spoken
sentence are represented in Fig. 5. The sentence is assumed to be the description
534 George C. Stockman

MOVE ~ THE _

MORE TIME
0 200 400 600 800
a

I KING
QUEEN I
State #i merit=0.80 (MOVE THE PAWN )
KNIGHT

State #2 merit=0.65 (MORE TIME I REQUIRED


IS NEEDED I )
WANTED
b
State #i merit=0.65
(Verb [MOVE,100,300,0.95] obj phrase[Adj[THE,305,420,0.80]

i KING I
•noun[ QUEEN ,420,-,-] ] )
I KNIGHT
PAWN I
State #2 merit=0.80
(subj [adj [MORE,95,305,0.80]noun[TIME,305,410,0.65] ]
I IS NEEDED i
•pred [ REQUIRED ,410,-,-1 )
WANTED ,
C
Fig. 5. Parse states representing alternative competing interpretations of a speech signal.
(a) Matching of possible words in vocabulary to speech segments.
(b) Rough concept of states of analysis.
(c) More refined concept of states of analysis including syntactic labels, phrase structure, location,
and merit. (' .' denotes the 'handle' or current focus of processing.)

of a move which the human wants to make in a chess game with the computer.
The analysis encoded in state # 1 records the following. The words 'MOVE THE'
have been recognized in the speech signal by their acoustic features. These words
are in the chess vocabulary. The chess grammar recognized an imperative sentence
to be in progress and predicted that an object phrase was the next segment. The
name of a chess piece was predicted; names of pieces which were not currently in
the game were removed from the set of possibilities. From previous acoustic
analysis the certainty of the detection of 'MOVE' was 0.95 and the certainty for
'THE' was 0.80. The merit of the state was assigned to be the minimum of these.
State # 2 encodes an alternate partial interpretation as follows. The words
'MORE TIME' have been detected in the initial speech segment with certainty
Waveform parsing systems 535

Table 1
Operations on parse states (' Handle' of parse state is phrase where analysis is currently focused.)
Condition Action
of of Action Action
current best feature of of
parse extraction syntactic semantic
state module module module
halts if global goal re-
handle is
recognized cognized. Otherwise can delete state
none generates new states if unacceptable or
word or
phrase using grammar by reset merit
setting next a goal
creates new state for
handle is each grammar rule
non-terminal
structural none applicable to N. Handle
goal N. of each new state is
first a subgoal of N.
handle is attempts to match T in
terminal input data at appropriate
none none
structural location. Returns match value
goal T. between 0 and 1.

a'next' and 'first' mean in left to right order here. This order can actually be defined
arbitrarily.

0.80 and 0.65 respectively. The chess grammar allows only 3 possible continua-
tions. Since state # 1 has a higher merit than state # 2, the analysis encoded there
should be continued next. Thus the acoustic feature detection routines should be
called to evaluate the predicted words against the rest of the input signal. This in
turn will result in several possibilities. Only one word may be recognized and
state # 1 will be amended. More than one word may be recognized and state # 1
will be split into competing states of differing likelihoods. If none of the predicted
words are recognized with any certainty the record of analysis in state # 1 will be
deleted. Provided that all of the details of analysis are encoded in the states, the
entire waveform analysis can be controlled by operations on the states defined as
in Table 1. Control is therefore rather simple--just allow the appropriate modules
(feature extractor, syntax analyzer, or semantic analyzer) their turn at accessing
the set of current states of analysis. How this solves the original practical problem
will become clear after Sections 3 and 4.

3. The HEARSAYspeech understanding system

A brief description is given here of some speech recognition work done at


Carnegie Mellon University (Reddy et al., 1973). While it does not represent the
latest achievements of the group the HEARSAY system presents a very interesting
case study.
~o George C. Stockman

3.1. Overall control of HEARSAYprocessing


HEARSAY speech recognition was controlled by a system which managed pal
states as discussed in Section 2. Analysis progressed via operations on the sta~
by three system modules. An acoustic recognizer was used to detect the preset
of given English words in the input speech signal. A syntax analyzer was used
make predictions about further waveform content given previous partial analy
on the basis of grammar and vocabulary constraints. There was also a seman
routine which 'knew' about the subject of discourse and used this knowledge t
prediction or acceptance of words or alteration of the merit of states. Althou
HEARSAY was designed for different domains of discourse the discussion here
limited to chess games played between a human and computer. The hum
speaks his moves to the computer.

3.2. Primitive extraction from speech signals


Typically speech scientists identify 43 different primitive (or atomic) sounds f
English called phonemes. Phonemes are characterized not only by local properti
of the speech signal where they are said to occur but also by the use of ti
articulators which generate them (e.g. teeth, tongue position) and by the conte
with adjacent phonemes. At the lowest level of recognition, HEARSAYdid assign
phoneme label to each 0.01 second interval of the speech signal on the basis
signal properties alone. The amplitude of the signal was measured for 6 ban,
producing a vector V - - ( v 1, v 2, v 3, v4, v 5, v6) for each 0.01 second interval. Tt
vector was then classified using a nearest-neighbor scheme to assign the phonell
label. The data with which V was matched could have been created from sing
speaker or multiple speaker samples of the set of 43 phonemes. Once the sign
was segmented into fixed sized pieces, variable sized segments were then form~
by runs of the same phoneme and sonorant, frieated, and silent periods we
identified. This acoustic information could then be used to match acoust
descriptions of words in the chess vocabulary such as 'MOVE', 'QUEEN', el
When asked if a given word could be identified at a given place in the input, tt
acoustic model could then assign a numerical rating to the possibility.

3.3. HEARSAY s y n t a x

A BNF grammar defined the official chess move language. This means that m
only was the entire vocabulary specified but so was the syntax of all sentence
Note that the set of all possible nouns would be very constrained, i.e. (KIN(
QUEEN, PAWN, etc.) as would the set of verbs, i.e. (MOVES, TAKE',
CAPTURES, etc. }. The essential property of the grammar as discussed in Sectic
2, is that it could accept or reject specific words in context and it could genera
all possible legal contexts. The HEARSAYsyntax analyzer behaved as described i
Fig. 5 and Table 1.
Waveform parsing systems 537

3.4. HEARSAYchess semantics


The computer can be programmed to ' understand' the game of chess very well.
Representing the board condition is trivial. Determining legal moves is mod-
erately easy and determining the value of a legal move is more difficult but
practically achievable. Thus HEARSAY could know which pieces could be moved
and where, and which moves were good ones and which were bad. Thus the
semantic routine could predict moves or could eliminate words predicted by the
syntax module. Also the semantic module could lower the merit of a parse state
that represented a bad move. Thus in ambiguous cases HEARSAY could interpret
the input to be for a good move even when the speaker actually indicated a bad
one. In fact, in theory the acoustic module could be disconnected and the
computer could actually play chess with itself getting no information from the
speech signal.

3.5. Generafity of HEARSAY

HEARSAY was never designed for anything but speech and so would not be
likely to process EKG's. The primary obstacle is the feature extraction module.
Certainly grammar models exist for EKG's and pulsewaves as shown in Section 4,
so the HEARSAY syntax analyzer could be used for structural analysis. Also, it is
likely that ' E K G semantics' could be loaded into HEARSAY in the same way as
chess semantics were. The control of the analysis used by HEARSAYalso appears to
be adequate for any time-domain waveform analysis. Thus there are two obstacles
which would prevent HEARSAY fi?om being classified as a true waveform parsing
system. First, not enough variety is provided in the set of primitive waveform
features. Secondly, certain knowledge appears to have been implemented via
program rather than data tables and hence reprogramming would be required for
other applications.

4. Analysis of medical waveforms using WAPSYS

WAPSYS is a waveform parsing system developed for the analysis of medical


waveforms such as pulse waves and EKG's (Stockman, 1977). The discussion here
is limited to how WAPSYS was actually used to analyze and measure carotid pulse
waves. WAPSYS extracts primitive features by curve fitting. Structural analysis is
driven by a grammar for the waveform input via tables. Other application specific
semantic processing is possible by interfacing special semantic subroutines.

4.1. The carotid pulse wave application


The carotid artery in the neck is the position closest to the heart where the
blood pressure wave can be easily sensed non-invasively. The pressure signal
produced is affected by the condition of the heart, arteries, and the tissue between
the vessel and the pressure sensor. Thus measurements made on the waveform
538 George C. Stockman

Uf•P
-
RGE NEG
U LOPE I l e a l ~N/

-TFlstoli~ diasto~'~ F2
• . .
l--F
/i 2

0867
0865
0864
0863
0862
0859
0858
0856
0855
0854
b
Fig. 6.(~ Stereotyped carotid pulse wave and(b)asetofl0 differentsamples.

should provide information for disease diagnosis (Freis et al., 1966). Fig. 6 shox
10 sample waveforms and a stereotyped pattern for one heart cycle. The objecti'
of analysis is to reliably detect the location of the labeled points so that variol
measurements can be made. For instance, heart rate is easily determined from~
and F 2. Location of the important points can only be reliably done when tt
context of the entire set of points is considered and related to the heart model.

4.2. Extracting primitives in WAPSYS by curve fitting


WAPSYS segmented carotid pulse waves using the five primitive shapes (
morphs shown in Fig. 7. These shapes had been informally identified by medic~
personnel studying the data and were formally defined for WAeSVS analysis
follows. Let interval [1, n] represent the entire set of n waveform amplitud~
y(1), y(2) .... ,y(n) and let [a, b] and [l, r] represent two subintervals of these
data points.
Waveformparsing systems 539

Ym (x) = P 2 x 2 + P l x + P O
y (a) =y (b) y (a) =y (b)
Cl< p2 < c2< 0 c I > p2 > c 2 > 0
c3<- b - a < c 4 c3<- b-a<_ c 4

a b

Y~(x)=P2x2+PlX+Po Ym(X)=P2X +PlX+Po


!
y~(a)=O Ym(b)=O
el< p2 < c2< 0 Cl< p2 < c2< 0

c3~ b-a~ c 4 c3~ b-a~ c 4

c d

Ym (x) =Plx+PO

Cl< Pl < c 2
c3<- b - a < c 4

Fig. 7. Five primitives defined by constrained fits of model Ym(X) to data y(x), x E [ a, b] c_ [ l, r ].
(a) CAP,
(b) CUP,
(c) RIGHT SHOULDER,
(d) LEFT SHOULDER,
(e) STRAIGHT LINE.

Driven by the syntax analyzer, the WAPSYS segmentor is repetitively tasked with
identifying a specific morph M in a specific interval of data [a, b] c_ [1, r]. [l, r] is
the constraint interval to be searched and [a, b] is the match interval where the
primitive is detected. The segmentor may, in fact, identify no occurrence, one
540 George C. Stockman

occurrence, or many occurrences of the morph (M,[aj, bfl, ei, &) existing on the
constraint interval [l, r]. pj is the parameterization of the morph M and ej is an
evaluation of its merit or certainty. The morph M is specified to the segmentor by
a syntactic name and semantic constraints C which must be satisfied by pa-
rameterization P. Morph M is formally defined as a functional form Ym = Ym(x)
= f ( x ) to be fit to the datay(x), x E [a, b] under set of constraints C.
For example, the 'CAP' morph of Fig. 7 is defined a s Y m ( X ) = p2 x2 ÷ p l x + p,
subject to the constraints that Ym(a) = ym(b), c I < P2 < c2 < 0 and c 3 ~<b - a ~<ca.
The parameterization P = {P0, P1, P2} is determined by least squares fitting
Ym(X) to y ( x ) over x ~ [ a , b]. All five morph definitions in Fig. 7 imply least
squares error estimation of 2 free parameters. Under the assumption of Gaussian
noise distributed as N(0, o 2) the variable

b
2 2
S ~ E (ym(X)-- y(x))
X a

is X2 distributed with b - a - 1 degrees of freedom if the values of y ( x ) are


interpreted as realizations of ym(x) plus noise. By defining e as the probability
that X2(z, b - a - 1) > s, the merit or certainty of the fit is nicely bounded in [0, 1]
and may be interpreted as a probability.
In classic curve fitting all n given points (x~, Yi), x ~ [l, r] must be fit and the
noise 0 2 is known. This allows variation only in the model. The approach in
polynomial fitting is to vary upward the degree m of the model polynomial

y = ymCx ) = ~ Pi x~
i=0

until the hypothesis H(m), that this model generated the data, can be accepted at
a given confidence and the hypothesis H ( m + 1) does not increase significantly
this confidence of fit. The approach of the waesYs segmentor is to keep the
polynomial form ym(x) fixed in form and to vary the subinterval [a, b] C [l, r] to
find the best fit (s). The reason for doing this is that it is desired that the data be
represented by morphs of confined geometric shape whose parameters might have
strong interpretation in the problem domain. For instance, if the rate of pressure
rise in a certain region of a pulse wave were thought to be significant in disease
diagnosis, it would be appropriate to estimate shape by fitting a straight line to
the region and not a bell-shaped curve.
Fig. 11 shows some pulse wave data that was fitted with models from Fig. 7.
The morphs UPSLOP, MPS, MNG, LN, and HOR are defined and detected by
using constraints on the parameters of the straight line model. The CAP morph
and RSH morph are instances of the cap and right shoulder of Fig. 7. Constrain-
ing the juxtaposition of these morphs is in the domain of syntax discussed in later
sections.
Waveform parsing systems 541

4.3. Training WAPSYSprimitive detectors


While general morphs such as the 'CAP' or the ' L I N E ' might be applicable in
several different waveform domains, it is unlikely that the constraints would be
defined in the same way in each domain. These constraints could be defined from
a priori considerations or could be 'learned' from data Samples. WAPSYS can be
used to define morph constraints by fitting training data under no constraints and
recording the parameterizations that result. The set of parameterizations for each
named morph can then be converted to constraints for use in automatic analysis.
WAPSYS was used to learn the constraints necessary for automatic parsing of
carotid pulse waves. 14 morphs were identified in this application and all were
expressible in terms of constraints on generic features as shown in Fig. 7. Eight
hours was spent by the author examining print plots of 20 sample pulse waves.
Aided only by a ruler, all the data was segmented producing for each sample a list
of triples ~Mi, ai, bi) where M i w a s a 3 character morph name and [ai, bi] was the
interval of data on which the morph was declared to exist. 459 morphs were
identified in this manner for the 20 waveforms. A training routine was written
which called the fitting routine appropriate for each morph specified and forced it
to fit the specified data interval. The fit was forced in the sense that the noise
tolerance was varied upward until either a limit was reached or a fit of quality
above 0.5 was achieved. (The noise limit was useful for detecting human errors in
creation of the training items--7 or 8 errors were detected in this manner.) The
parameters of a successful fit were contributed to a running statistical summary
which was the final output of training.
The statistical summary contained, for each morph, the lowest, highest, and
mean value for each of 3 variables and the standard deviation. The variables were
interval width b - a, noise o 2, and curvature for parabolic morphs or slope for
linear ones. Three iterations of training were required to remove all human errors
in the 459 training items. The summary values from training were then used to
construct a table of constraints to be used in automatic detection control.
In addition to the five shape morphs shown in Fig. 7 WAPSYS also extracts
point features. Three such point features are as follows.
G L B M I N extracts x where amplitude y is minimal on search interval [a, b],
G L B M A X as above for maximal amplitude,
L A M D A P extracts x = a on search interval [a, b]; a dummy morph used to end
structural iteration.
A zero crossing detector would be another point feature detector that would be
useful in E K G analysis but was not useful for the analysis of carotid pulsewaves.

4. 4. WAPSYSgrammar models and control of analysis


The most prominent primitive in the carotid pulse wave is the steep upslope in
pressure during the beginning of the systolic phase. This upslope appears after a
long gradual pressure decrease at the end of the diastolic phase. The context of
these two primitives is modeled in Fig. 8. Fig. 8(a) shows a BNF production rule
542 George C. Stockman

showing that the structure (JYNT) (i.e. joint) has three substructures which
are respectively TREDGE (trailing edge), GLBMIN (global minimum), and
UPSLOP in time order of their appearance. Constraints to be used by the
curve-fitting routines are listed below each structure name. These constraints are
obtained using the training procedure discussed in Section 4.3 and are interpreted
as follows. MIN and MAX are the minimum and maximum width of the
primitive in number of points, ALO and AHI are the low and high limits of the
slope, and HGN and MNN are high and mean noise levels. The informal notion

<JYNT> ~ TREDGE (2) GLBMIN(3) UPSLOP(1)

MAX 250 MAX i00 MAX 5 MAX 15


MIN 50 MIN i0 MIN 0 MIN 5
ALO -32 ALO 30
AHI 1 AHI 350
HGN 512 HGN 2048
MNN 112 MNN 2048

GLBMIN----7[
J (FX,FY)

173 207 211 216


210

b
[ 2=<JYNT>,I FX =.210+03 FY = . 8 0 7 + 0 2 Q=.800+00 RGT=.216+03 LFT=.I73+03

[ 6=GLBMIN,3 YEX=.807+02 XEX=.207+03 Q=.100+01 RGT=.207+03 LFT=.207+03

] 6=GLBMIN,3
9=TREDGE,2 C =.000 B =.293+03 A=-.613+01 Q =.800+00 RGT=.207+03
LFT=.I73+03
] 9=TREDGE,2
[ 4=UPSLOP,I C =.000 B =.201+03 A= .102+03 Q =.100+01 RGT=.216+03
LFT=.211+03
] 4=UPSLOP,I
] 2=<JYNT>,I
C

Fig. 8. WAPSYSdefinition and detection of joint structure in carotid pulse wave.


(a) Grammatical representation for the joint structure found between two consecutive cycles of the
carotid pulse wave.
(b) Graph of (JYNT) structure detected by WAPSYSin waveform 0854 of Fig. 6.
(c) Output from WAI'SYS analyzer after detection of (JYNT) structure shows fit parameters and
data intervals.
Waveform parsing systems 543

of attributes used here has already been formalized by others yielding 'attribute
grammars.' The production rule in Fig. 8(a) also tells the WAPSYS analyzer in what
sequence to search for the substructures of (JYNT>. UPSLOP is to be sought first
because it is reliably found with no other context information. TREDGE is to be
found next, to the left of UPSLOP, of course, and then finally GLBMIN is to be
gotten somewhere in between UPSLOP and TREDGE. The MIN and MAX
parameters are required by WAPSYS so that it can assign a consistent search
interval [l, r] to each hypothesized primitive. Fig. 8(b) shows graphically a
(JY T> structure identified on a pulse wave. A parenthesized form of the partial
parse tree generated by WAPSYS is given in Fig. 8(c). The LFT and RGT
attributes define the match interval [a, b] and Q is the chi-squared quality of fit.
Fig. 9 shows a complete structural model of a carotid pulse wave via BNF.
Parameters of each syntactic structure are omitted for clarity but search sequence

<STRT>---~ < J Y N T > (i) < T R A N > (2)


<TRAN>---~ <CPW> (i) <END> (2)
<END>'----> <TRN2> / LAMDAP
<CPW>---> < C O R E > (2) < J Y N T > (i)
<TRN2>----> <TRAN>
<CORE>""-> < S P L X > (2) < D P L X > (i)
<DPLX>----> LN (i) < D W A V > (2)
<SPLX>---> <SPLI> / <SPL2> / <SPL3> / <SPL4>
<DWAV>---~ HOR / <BUMP>
<SPLI>---~ <MI> (2) C U P (i) <M3> (3)
<SPL2>---> MPS (i) <M3> (2)
<SPL3>----> <MI> (i) M N G (2)
<MI>---~ CAP / LSHOLD
<M3>---~ CAP / RSHOLD
<BUMP>----'> CAP / LAMDAP
<SPL4>---~ CAP (i) / C A P (2)
<JYNT) ---+ TREDOE (3) GLBMIN (3) UPSLOP (1)

TREDGE Long ~aduM negatNe slope


GLBMIN point of mimmum amphtude
UPSLOP l~ge strmght fine segment of positive slope
LAMDAP d u m m y p o i n t m o r p h to e n d i t e r a t i o n
LN l a r g e s t r a i g h t line s e g m e n t of n e g a t i v e s l o p e
HOR s m a l l h o r i z o n t a l s t r a i g h t line s e g m e n t
MPS s t r a i g h t line s e g m e n t , m o d e r a t e p o s i t i v e s l o p e
MNG s t r a i g h t line s e g m e n t , m o d e r a t e n e g a t i v e s l o p e
CAP p a r a b o l i c c a p / c u p as in F i g u r e 8
CUP
LSHOLD l e f t and r i g h t shoulders as in F i g u r e 8
RSHOLD

b
Fig. 9. BNF grammar (without parameters) used to drive analysis of carotid pulse waves.
(a) Set of productions for carotid pulse wave grammar; (STRT) is start symbol or global goal.
Numbers in parentheses indicate search order.
(b) Terminal vocabulary of carotid pulse wave grammar.
544 George C. Stockman

PARSE OF ~VE = 844 844 MERIT = .850 + 00 *


PI~OBL~4 AqTRIBUTES
G L O B A L ATI'RIBUTES P/P= . 905+00 F P 2 = .255+00 P2N= . 595+03 I/P = .490+00 F P I = . 133+00
PIN= .658+03 I / D = .840+00 I-D= .464-01 DYN = .384+03 D X = .114+03
F-I= .389+00 I X = .110+03 IYN= .322+03 RAT=- .871+02 DRF=-.183+03
FlY= .236+03 F I X = . 765+02 CNT=- .1 0 0 + 0 1 PFR= .861+02
I=<STRT>, 0 Q = . 850+00 RGT= . 168+03 L F T = .240+02
[ 3=<TRAN>,2 Q = .850+00 RGT= .168+03 ~ .820+02
[ 8 = < I ~ D > ,2 Q = .100+01 RGT= .168+03 L F T = .168+03
[ II=IAMDAP,I ~ .609+03 Xl/3= .168+03 Q = .I00+01 RGT= .168+03 LFT=- .168+03
] II=LA/~AP, 1
] 8=<I~D> ,2
[ 7 = < C P W > ,i Q = .850+00 RGT= .168+03 LFT=- .820+02
[ 10=-<CORE>,2 Q = .850+00 R~- .118+03 ~ .820+02
[ 13=<SPLX>,2 Q = .100+01 RGT= .103+03 ~ .820+02
[ 18=<SPL3>, 3 Q = .100+01 RGT= . 103+03 LFT= . 820+02
[ 25=~-~G ,2 P2Y= .785+03 P2X= .985+02 C = .000 B = .816+03 A =-.556+01
Q = .100+01 RGT=- .103+03 LFT=~- .940+02
] 25=F~G ,2
[ 21=<MI> ,i Q = .100+01 RGT= .940+02 L F T = .820+02
[ 27=CAP ,2 P l Y = .870+03 PIX= .880+02 C = .783+03 B = .248+02 A ~.177+01
Q = .100+01 RGT= .940+02 L F T = .820+02
] 27=CAP ,2
] 21=<~vil> ,1
] 18=<SPL3 >,3
] 13=<SPLX>, 2
[ 12=<DPLX>,I Q = .850+00 R G T = .118+03 ~ .103+03
[ 15=<DWAV>,2 Q = .100+01 RGT= .118+03 I/T= .110+03
[ 19=MOK ,i DY = .540+03 D X = .114+03 C = .000 B = .530+03 A =-.259+01
Q = .100+01 R G T = .118+03 LFT= .110+03
] 19=MOK ,i
] 15--<DWAI b ,2
[ 14=I~N ,i I X = .110+03 IY = .487+03 C = .000 B = .809+03 A =-.403+02
Q = .850+00 R G T = .110+03 L F T = .103+03
] 14=I~N ,1
] 12= <DPI~ >, 1
] i0=~ C O R E >, 2
[ 2=qTYNT>,I F X = .163+03 FY = .530+02 Q = .i00+01 RGT=- .168+03 L~- .122+03
[ 6=GLBMIN,3 YEX= .530+02 XEX= .162+03 Q - .100+01 RGT= .162+03 LFT= .162+03
] 6=GI~MIN, 3
[ 9--TREDGE,2 C = .000 B = .454+03 A =-.922+01 Q = .100+01 RGT= .162+03
LFT=- . 122+03
] 9=TRED6~, 2
[ 4=UPSIDP,I C = .000 B ~.169+02 A = .107+03 Q = .100+01 RGT= .168+03
L F T = . 163+03
] 4=UPSLOP, 1
] 2=<JYNT>, 1
] 7= <CPW> ,i
] 3=<TRAN>, 2
[ 2=q3XNT>,I F X = .765+02 FY = .236+03 Q = .900+00 RGT=- .820+02 LFT= .240+02
[ 6=GIBM3~,3 YEX= .236+03 XEX= .750+02 Q = .100+01 RGT= .750+02 ~ .750+02

1 6~iIN, 3
[ 9 = ~ , 2 C = .000 B = .644+03 A =-.787+01 Q = .900+00 RGT=-.750+02
.240+02
] 9=TREDGE, 2
[ 4=UPSLOP,I C = .000 B = .181+03 A = .105+03 Q = .100+01 RGT=- .820+02
LFT=- . 770+02
] 4=UPSLOP, 1
] 2=<JYNT>, 1
] I=<STRT>, 0

Fig. 10. Parse tree showing the results of analysis of a complete carotid pulse wave cycle (first cycle
shown in Fig. 11).
Waveformparsing systems 545

• P

u7L°P ~ P 2
/ ~~\~ ..~..~~o~
/ .J~ \~ .::

'/ i
"-'-+~to~ ~/~
,, [ scale appl ~es ~o to~ c(cle [ [

ko ~ ,~p 03 O oD

• I

scale applies to bottom cycle

! I I I I I
ko o'~ O

Fig. 11. Carotid pulse wave sample 0844 shows interesting structural variations on consecutive cycles
(analysis of first cycle shown in Fig. 10).

is indicated on each right hand side. This grammar was used by WAPSYS to drive
the analysis of a few hundred pulse waves. Fig. l0 shows a complete parse tree
obtained by analyzing the first cycle of the waveform shown in Fig. 11. Fig. 11
shows interesting structural variations present in two consecutive cycles of the
546 George C. Stockman

same pulse wave. The parse tree for the second cycle is not shown here. As Fig. 10
shows a number of global attributes have been computed by wAPSYS from the
parse tree: heart rate (RAT = 0.871 +02) is only one of these. Global attributes
are manipulated by 'semantic routines' which are called whenever syntactic
structures are identified. The semantic routines are actually FORTRAN code
segments. Although they are simply linked to WAPSYS these routines are applica-
tion specific and must be written by the user.

4.5. Evaluation of WAPSYS


The methodology of using WAPSYS is straightforward. Primitive detectors can
be trained and tested independent of grammar structure. When structural models
come into play any substructure model can be tested without performing the total
structural analysis. The record of analysis produced by WAPSYS as shown in Figs.
10 and 11 is elaborate and descriptive. Much tedious human measurement work is
saved. An unsatisfactory analysis will be evident in the comparative graphical
output and could be rectified by interaction between WAPSYS and the user. (This
is not currently a feature of WAPSYS.)
Use of WAPSYS in carotid pulse wave analysis exposed some problems. There
were some waveforms, not seen during training, which exhibited structure not
modeled by the grammar and were thus rejected. Also, there was a problem
quantifying the noise. Noise seemed to vary not only across different waveforms
but also across different structures of the same pulse wave. Evaluation of fit
quality was not always done in a manner consistent with human interpretation.
For example, Figs. 10 and 11 show that a straight line fit H O R had higher quality
than a parabolic CAP for fitting points 110 to 118 of the waveform. A human
would probably choose the CAP as best. There is often bias in the fit error but
there are too few points to detect this objectively using a run-of-signs test on the
residuals. Other criticisms of WAPSYS are given in the concluding section.

5. Concludingdiscussion

We close by summarizing the main features of waveform parsing systems and


by mentioning other important related work.

5.1. Summary of the waveform parsing approach


The linguistic model views waveforms as a structured aggregate of primitive
parts. Information is contained in the individual primitives and in their structural
combination. Analysis of a waveform requires that a structural description be
generated which describes each part and its hierarchical relation to other parts.
A waveform parsing system (WPS) provides a standard set of primitive
detectors and a standard structural analysis algorithm. A waveform domain, e.g.
pulse waves, is described to WPS by the subset of primitives to be used and by a
Waveformparsing systems 547

structural model which is usually a grammar of some type. Both the primitives
and structural model can be input to WPS via tables of data and no programming
is required by the user for this. However, sometimes it is necessary for the user to
specify application specific semantic processing via code in a high level language.

5.2. Additional background work


Critical background work was also done by Pavlidis (1971, 1973) and Fu and
You (1974, 1979). Pavlidis has concentrated much of his effort in mathematical
techniques for segmentation of a waveform into primitives. Fu and his co-workers,
on the other hand, have done a large amount of work on grammatical models for
pattern structure and noise. The recent paper by You and Fu (1979) describes a
WPS based on attributed grammars which was used to recognize airplanes by
parsing their outline as if it were a waveform.

5.3. Criticism of the WPS approach


The linguistic approach will only apply to waveforms with rich time or space
domain structure. There is no question that the computer is a useful tool in
analyzing such data. Computer science has developed to a point where all new
programs are modularly designed and often are table-driven, even if an ad hoc
application-dependent approach is taken. Is the step toward a general WPS
worthwhile? The answer to this question is perhaps "yes" provided that there
exists a laboratory with continuing expertise in the WPS techniques and in
specific problem areas. In particular, research on waveforms could be done
quickly using the WPS tool. However, for a one-time application it would be
foolish for an uninitiated staff to first learn the WPS in order to specify the
waveform analysis. Also, because of its generality, WPS will be slower than a
tailored production system.
Whether or not a WPS is used, the linguistic model has advantages. As Figs. 10
and 11 show, the anthropomorphic results of analysis are readily available for
human verification and use. The WAPSYS described analyzed a pulse wave in
roughly 7 seconds using a large computer (Univac 1108). While this is not
real-time it is certainly quick enough for research and most laboratory analysis.
Human measurement and subsequent documentation could take a good fraction
of an hour.
Noise and ambiguity appear to remain as problems because of insufficient
modelling capabilities. Discrete structural models have trouble with continuous
variations. Work in error-correcting and stochastic grammars does not appear to
be fruitful. Heuristic techniques as used in HEARSAY and WAPSYS are partly
effective but without guarantee. But, as work with carotid pulse waves has shown,
perhaps one third of all computer-human disagreements could be due to error on
the part of the human (Stockman, 1976). A good solution appears to be interac-
tive processing where the computer performs automatic analyses and the human
548 George C. Stockman

selects a best one or makes minor corrections. The linguistic model used by WPS
appears to support this very well. With respect to speech recognition, most
current systems require an acknowledgement of some sort from the human to
indicate that the computer did indeed make the correct interpretation.

References

Aho, A. V. and Ulman, J. (1972). The Theory of Parsing, Translation and Compifing, Vol. I: Parsing.
Prentice Hall, New Jersey.
Cox, J., Nolle, F., and Arthur. R. (1972). Digital analysis of the electro-encephalograrn, the blood
pressure wave, and the electrocardiogram. Proc. IEEE 60(10) 1137-1164.
Earley, J. (1970). An efficient context-free parsing algorithm. Comm. A CM 13(2) 94-102.
Freis, E. D., Heath, W. C., Fuchsinger, P. C. and Snell, R. E. (1966). Changes in the carotid pulse
which occur with age and hypertension. American Heart J. 71(6) 757-765.
Fu, K. S. (1974). Syntactic Methods in Pattern Recognition. Academic Press, New York.
Gries, D. (1971). Compiler Construction for Digital Computers. Wiley, New York.
Hall, P. A. (1973). Equivalence between and/or graphs and context-free grammars. Comm. A C M
16(7) 444-445.
Hopcroft, J. and Ullman, J. (1969). Formal Languages and Their Relationship to Automata. Addison-
Wesley, New York.
Horowitz, S. L. (1975). A syntactic algorithm for peak detection in waveforms with applications to
cardiography. Comm. A C M 18(5) 281-285.
Ledley, R. et al. (1966). Pattern recognition studies in the biometrical sciences. AFIPS Conf. Proc.
SJCC, 411-430. Boston, MA.
Miller, P. (1973). A locally organized parser for spoken output. Tech. Rept. 503, Lincoln Lab, M.I.T.,
Cambridge, MA.
Pavlidis, T. (1971). Linguistic analysis of waveforms. In: J. Tou, ed., Software Engineering, Vol. H,
203-225. Academic Press, New York.
Pavlidis, T. (1973). Waveform segmentation through functional approximation. IEEE Trans. Comput.
22(7) 689-697.
Reddy, D. R., Erman, L. D., FenneU, R. D., and Neely, R. B. (1973). The HEARSAYspeech
understanding system. Proc. 3rd Internat. Joint Conf. Artificial Intelligence, 185-193. Stanford, CA.
Stockman, G., Kanal, L., and Kyle, M. C. (1976). Structural pattern recognition of carotid pulse waves
using a general waveform parsing system. Comm. ACM 19(12) 688-695.
Stockman, G. (1977). A problem-reduction approach to the linguistic analysis of waveforms. Com-
puter Science TR-538, University of Maryland.
You, K. C. and Fu, K. S. (1979). A syntactic approach to shape recognition using attributed
grammars. IEEE Trans. Systems Man Cybernet. 9(6) 334-344.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")~
L.,,,./
©North-Holland Publishing Company (1982) 549-573

Continuous Speech Recognition:


Statistical Methods

F. Jefinek, R. L. Mercer and L. R. Bahl

1. Introduction

The aim of research in automatic speech recognition is the development of a


device that transcribes natural speech automatically. Three areas of speech
recognition research can be distinguished: (1) isolated word recognition where
words are separated by distinct pauses; (2) continuous speech recognition where
sentences are produced continuously in a natural manner; and (3) speech under-
standing where the aim is not transcription but understanding in the sense that
the system (e.g. a robot or a data base query system) responds correctly to a
spoken instruction or request. Commercially available products exist for isolated
word recognition with vocabularies of up to several hundred words.
Although this article is confined to continuous speech recognition (CSR), the
statistical methods described are applicable to the other areas of research as well.
Acoustics, phonetics, and signal processing are discussed here only as required to
provide background for the exposition of statistical methods used in the research
carried out at IBM.
Products which recognize continuously spoken digit sequences are beginning to
appear on the market but the goal of unrestricted continuous speech recognition
is far from being realized. All current research is carried out relative to task
domains which greatly restrict the sentences that can be uttered. These task
domains are of tw.o kinds: those where the allowed sentences are prescribed a
priori by a grammar designed by the experimenter (referred to as artificial tasks),
and those related to a limited area of natural discourse which the experimenter
tries to model from observed data (referred to as natural tasks). Examples of
natural tasks are the text of business letters, patent applications, book reviews,
etc.
In addition to the constraint imposed by the task domain, the experimental
environment is often restricted in several other ways. For example, at IBM,
speech is recorded in a quiet room with high fidelity equipment; the system is
tuned to a single talker; the talker is prompted by a script, false starts are
eliminated, etc.; recognition often requires many seconds of CPU time for each
second of speech.
549
550 F. Jernek, R. L. Mercer and L. R. Bahl

r
TE×T
GENERATOR SPEAKER] ACOUST, C
1- i PROCESSOR LINGUISTICI
DECODER [
L
SPEECH RECOGNIZER
Fig. 1. A continuous speech recognition system.

The basic CSR system consists of an acoustic processor (AP) followed by a


linguistic decoder (LD) as shown in Fig. 1. Traditionally, the acoustic processor is
designed to act as a phonetician, transcribing the speech waveform into a string of
phonetic symbols, while the linguistic decoder translates the possibly garbled
phonetic string into a string of words. In more recent work [2,3,5, 8,9, 16] the
acoustic processor does not produce a phonetic transcription, but rather produces
a string of labels each of which characterizes the speech waveform locally over a
short time interval (see Section 2).
In Fig. 2 speech recognition is formulated as a problem in communication
theory. The speaker and acoustic processor are combined into an acoustic channel,
the speaker transforming the text into a speech waveform and the acoustic
processor acting as a data transducer and compressor. The channel provides the
linguistic decoder with a noisy string from which it must recover the message--in
this case the original text. One is free to modify the channel by adjusting the
acoustic processor but unlike in communications, one cannot choose the code
because it is fixed by the language being spoken. It is possible to allow feedback
from the decoder to the acoustic processor but the mathematical consequences of
such a step are not well understood. By not including feedback we facilitate a
consistent and streamlined formulation of the linguistic decoding problem.
The rest of this article is divided as follows. Section 2 gives a brief summary of
acoustic processing techniques. Section 3 formulates the problem of linguistic
decoding and shows the necessity of statistical modeling of the text and of the
acoustic channel. Section 4 introduces Markov models of speech processes.
Section 5 describes an elegant linguistic decoder based on dynamic programming
that is practical under certain conditions. Section 6 deals with the practical
aspects of the sentence hypothesis search conducted by the linguistic decoder.
Sections 7 and 8 introduce algorithms for extracting model parameter values
automatically from data. Section 9 discusses methods of assessing the perfor-
mance of CSR systems, and the relative difficulty of recognition tasks. Finally, in

r I

TE×T SPEAKER ACOUST,C LINGO,ST,C


GE E"ATORI i I PROCESSOR DECODER
t J
ACOUSTIC CHANNEL
Fig. 2. The communication theory view of speech recognition.
Continuous speech recognition 551

Section 10 we illustrate the capabilities of current recognition systems by describ-


ing the results of certain recognition experiments.

2. Acoustic processors

An acoustic processor (AP) acts as a data compressor of the speech waveform.


The output of the AP should (a) preserve the information important to recogni-
tion, and (b) be amenable to statistical characterization. If the AP output can be
easily interpreted l~y people, it is possible to judge the extent to which the AP
fulfills requirement (a).
Typically, an AP is a signal processor, which transforms the speech waveform
into a string of parameter vectors, followed by a pattern classifier, which
transforms the string of parameter vectors into a string of labels from a finite
alphabet. If the pattern classifier is absent, then the AP produces an unlabeled
string of parameter vectors. In a segmenting AP, the speech waveform is seg-
mented into distinct phonetic events (usually phones l) and each of these varying
length portions is then labeled.
A time-synchronous AP produces parameter vectors computed from successive
fixed-length intervals of the speech waveform. The distance from the parameter
vector to each of a finite set of standard parameter vectors, or prototypes, is
computed. The label for the parameter vector is the name of the prototype to
which it is closest.
In early acoustic processors, prototypes were obtained from speech data labeled
by an expert phonetician. In more recent acoustic processors, prototypes are
obtained automatically from unlabeled speech data [5, 8].
A typical example of a time-synchronous AP is the IBM centisecond acoustic
processor (CSAP). The acoustic parameters used by CSAP are the energies in
each of 80 frequency bands in steps of 100 Hz covering the range from 0 to 8000
Hz. They are computed once every centisecond using a 2-centisecond window.
The pattern classifier has 45 prototypes corresponding roughly to the phones of
English. Each prototype for a given speaker is obtained from several samples of
his speech which have been carefully labeled by a phonetician.

3. Linguistic decoder

The AP produces an output string y. From this string y, the linguistic decoder
(LD) makes an estimate, ~i,, of the word string w produced by the text generator
(see Fig. 1). To minimize the probability of error, • must be chosen so that

P(~I Y) = maxP(wl Y). (3.1)


IFor an introductory d~scussion of phonetics, see [17, pp. 99-132].


552 F. Jelinek, R. L. Mercerand L. R. Bahl

By Bayes' rule:

e(w)P(ylw)
P(wly)-- (3.2)

Since P ( y ) does not depend on w, maximizing P(w[ y) is equivalent to maximiz-


ing the likelihood P(w, y ) - - P ( w ) P ( Yl w). Here P(w) is the a priori probability
that the word sequence w will be produced by the text generator, and P( Yt w) is
the probability with which the acoustic channel (see Fig. 1) transforms the word
string w into the AP output string y.
To estimate P(w), the L D requires a probabilistic model of the text generator,
which we refer to as the language model. For most artificial tasks, the language
modeling problem is quite simple. Often the language is specified by a small
finite-state or context-free grammar to which probabilities can be easily attached.
For example, the Raleigh language (see Section 4.2) is specified by Fig. 7 where all
words possible at any point are considered equally likely.
For natural tasks the estimation of P(w) is much more difficult. Linguistics has
not progressed to the point that it can provide a useful grammar for a sizable
subset of natural English. In addition, the interest in linguistics has been in
specifying the sentences of a language, but not their probabilities. Our approach
has been to model the text generator as a Markov source, the parameters of which
are estimated from a large sample of text.
To estimate P(y[w), the other component of the likelihood, the L D requires a
probabilistic model of the acoustic channel, which must account for the speaker's
phonological and acoustic-phonetic variations and for the performance of the
acoustic processor. Once models are available for computing P(w) and P( y[ w), it
is in principle possible for the L D to compute the likelihood of each sentence in
the language and determine the most likely ~i, directly. However, even a small
artificial language such as the Raleigh language has several million possible
sentences. It is therefore necessary in practice to carry out a suboptimal search. A
dynamic programming search algorithm, the applicability of which is limited to
tasks of moderate complexity, is described in Section 5. A more general tree
search decoding algorithm is described in Section 6.

4. Markov source modeling of speech processes

4.1. Notation and terminology


By a Markov source, we mean a collection of states connected to one another
by transitions which produce symbols from a finite alphabet. Each transition t
from a state s has associated with it a probability q~(t) which is the probability
that t will be chosen next when s is reached. From the states of a Markov source
we choose one state as the initial state and one state as the final state. The
Markov source then assigns probabilities to all strings of transitions from the
initial state to the final state. An example of a Markov source is shown in Fig. 3.
Continuous speech recognition 553

We define a Markov source more formally as follows. Let g be a finite set of


states, ~- a finite set of transitions, and d~ a finite alphabet. Two elements of $, s I
and s F are distinguished as initial and final states, respectively. The structure of a
Markov source is a 1-1 mapping M : ~- ---~S X (~ X S. If M(t)=(l, a, r), then we
refer to l as the predecessor state of t, a as the output symbol associated with t,
and r as the successor state of t; we write l = L(t), a = A(t), and r = R(t).
The parameters of a Markov source are probabilities q~(t), sE S -{SF}, t ~ ~-,
such that

q~(t)=0 if sv~L(t)
and (4.1)
]~q~(t) = 1 , S~--(SF).
t

In general, the transition probabilities associated with one state are different
from those associated with another. However, this need not always be the case.
We say that state s~ is tied to state s 2 if there exists a 1-1 correspondence
Ts,s2 :~---, ~-such that qs,(t)= qs2(Ts,~2(t)) for all transitions t. It is easily verified
that the relationship of being tied is an equivalence relation and hence induces a
partition of g into sets of states which are mutually tied.
A string of n transitions t~ for which L(tl) -- s I is called a path; if RUn) = SF,
then we refer to it as a complete path. 2 The probability of a path t~' is given by

P(t~) :qsi(tl) fl qn(t, ,)(ti)" (4.2)


i=2

Associated with path t~ is an output symbol string a~ = A ( t ~ ) . A particular


output string a~ may in general arise from more than one path. Thus the
probability P(a~) is given by

P(a~) = ~ P(t~)6(A(t~), a~) (4.3)


t~
where
(1 ifa=b, (4.4)
8(a,b) = 0 otherwise.

A Markov source for which each output string a~ determines a unique path is
called a unifilar Markov source.
In practice it is useful to allow transitions which produce no output. These null
transitions are represented diagrammatically by interrupted fines (see Fig. 4).
Rather than deal with null transitions directly, we have found it convenient to
associate with them the distinguished letter d?. We then add to the Markov source

2t{' is a short-hand notation for the concatenation of the symbols t l, t 2 , . . . , G- Strings are indicated
in boldface throughout.
554 F. Jelinek, R. L. Mercer and L. R. Bahl

Fig. 3. A Markov source.

I
I

0 0

Fig. 4. A generalized Markov source.

a filter (see Fig. 5) which removes ~, transforming the output sequence a~ into an
observed sequence b'~ where b i E ~ = d~ - (q~}. Although more general sources can
be handled, we shall restrict our attention to sources which do not have closed
circuits of null transitions.
If t~ is a path which produces the observed output sequence bT, then we say
that b i spans tj if tj is the transition which produced b~ or if tj is a null transition

bl,..., bm,...
MARKOVsouRCEal"" ~--'an""
'

GENERALIZED MARKOV SOURCE


Fig. 5. A filtered Markov source.
Continuous speech recognition 555

bl ~ ~ b2 ¢ b3

Fig. 6. A s e q u e n c e o f t r a n s i t i o n s to i l l u s t r a t e s p a n n i n g , b~ s p a n s t~; b 2 s p a n s t2, t3, t4; a n d b 3 s p a n s


t 5, t 6-

immediately preceding a transition spanned by bi. For example, in Fig. 6, b 1


spans tl; b 2 spans t 2, t 3, and t4; and b 3 spans t 5 and t 6.
A major advantage of using Markov source models for the text generator and
acoustic channel is that once the structure is specified, the parameters can be
estimated automatically from data (see Sections 7 and 8). Furthermore, computa-
tionally efficient algorithms exist for computing P(w) and P( Y l w) with such
models (see Sections 5 and 6). Markov source models also allow easy estimation
of the relative difficulty of recognition tasks (see Section 9).

4.2. The language model


Since the language model has to assign probabilities to strings of words, it is
natural for its output alphabet to be the vocabulary of the language. However, the
output alphabet can include shorter units such as word stems, prefixes, suffixes,
etc., from which word sequences can be derived. Fig. 7 is the model of the
artificial Raleigh language which has been used in some of our experiments. The
output alphabet is the 250-word vocabulary of the language. For diagrammatic
convenience, sets of transitions between pairs of states have been replaced by
single transitions with an associated list of possible output words.
For natural languages, the structure of the model is not given a priori.
However,

P(w~) ----P(wl)P(wzlwl)P(w3lw~).. . P(wnlw7 -')


= ~I P(wklw~-l), (4.5)
k=l

and so it is natural to consider structures for which a word string w~-~ uniquely
determines the state of the model. A particularly simple model is the N-gram
model where the state at time k - 1 corresponds to the N - 1 most recent words
wk_N+ 1..... w~_~. This is equivalent to using the approximation

P(wr)~ fi P(WkIW2:~+I)"
k--I

N-gram models are computationally practical only for small values of N. In order
to reflect longer term memory, the state can be made dependent on a syntactic
analysis of the entire past word string Wlk- 1, as might be obtained from an
appropriate grammar of the language.
556 b2. Jelinek, R. L. M e r c e r a n d L. R. B a h l

®~ ~ ~~ ~.~,~.~_
~ . ~ ~_oo...... ~ ~ o~.~o.~.~

e~

I _ ~ .~1 I. ~- if ,
# ~d=-g.°r-e'-~ E"~ ~-~.=-~,|=.s.~.=~ *6.B.=-~g[,'~2 '- ~.~-~o E~-°~,,
Continuousspeech recognition 557

4.3. The acoustic channel model


The AP is deterministic and hence the same waveform will always give rise to
the same AP output string. However, for a given word sequence, the speaker can
produce a great variety of waveforms resulting in a corresponding variation in the
AP output string. Some of the variation arises because there are many different
ways to pronounce the same word (this is called phonological variation). Other
factors include rate of articulation, talker's position relative to the microphone,
ambient noise, etc.
We will only consider the problem of modeling the acoustic channel for single
words. Models for word strings can be constructed by concatenation of these
simpler, single word models. Fig. 8 is an example of the structure of a Markov
source for a single word. The double arcs represent sets of transitions, one for
each symbol in the output alphabet. The straight-line path represents pronuncia-
tions of average length, while the transitions above and below can lengthen and
shorten the pronunciation respectively. Since the pronunciation of a word de-
pends on the environment in which it occurs, it may be necessary in practice to
make the parameters of the model depend on the phonetic environment provided
by the preceding and following words.
Since the same sounds can occur in many different words, portions of one
model will be similar to portions of many other models. The number of parame-
ters required to specify all the word models can be reduced by modeling sounds
or phones rather than words directly. This leads to a two-level model in which
word strings are transformed into phone strings which are then transformed into
AP output strings. Using this approach, the acoustic channel model is built up
from two components: a set of phonetic subsources, one for each word; and a set
of acoustic subsources, one for each phone.
Let P be the alphabet of phones under consideration. A phonetic subsource for
a word is a Markov source with output alphabet @ which specifies the pronuncia-
tions possible for the word and assigns a probability to each of them. Fig. 9 shows
the structure of a phonetic Markov subsource for the word two. The structures of
these subsources may be derived by the application of phonological rules to
dictionary pronunciations for the words [12].
An acoustic subsource for a phone is a Markov source with output alphabet
which specifies the possible AP output strings for that phone and assigns a
probability to each of them. Fig. 10 shows the structure of an acoustic Markov
subsource used with the IBM Centisecond Acoustic Processor.

Fig. 8. A word-basedMarkovsubsource.
558 E Jefine~ R. L MercerandL. R. Bahl

h
V U

t
0 --

Fig. 9. A phonetic Markov subsource.

Fig. 10. An acoustic Markov subsource.

By replacing each of the transitions in the phonetic subsource by the acoustic


subsource for the corresponding phone, we obtain a Markov source model for the
Acoustic Channel. This embedding process is illustrated in Fig. 1 l.
Whereas the structure of the phonetic subsources can be derived in a principled
way from phonological rules, the structures of the word model in Fig. 8 and the
phone model in Fig. 10 are fairly arbitrary. M a n y possible structures seem
reasonable; the ones shown here are very simple ones which have been used
successfully in recognition experiments.

5. Viterbi linguistic decoding

In the preceding section we have shown that acoustic subsources can be


embedded in phonetic subsources to produce a model for the acoustic channel. In
a similar fashion we can embed acoustic channel word models in the Markov
source specifying the language model by replacing each transition b y the model of

I
SUBSOURCE
OF P ONE
~ /
;oos,
SUBSOURCE
OF PHONE
U "~
v /
/ ~'x
4,~ ACOUSTICI ~ ~ / "-.,..
SUBSOURCE L ~ I ~. / ~
0" OF PHONE r-~..P""O'~__~ ~.,,.~ "~
t l ~"---~ ~ /
cou T,C, cou T,C
SUBSOURCE~ rzJ SUBSOURCE
OFP:ONE r_~--~--~l OFPHONEuv
Fig. 11. A phone-based Markov source based on the phonetic subsource of Fig. 9.
Continuous speech recognition 559

the corresponding word. The resulting Markov source is a model for the entire
stochastic process to the left of the linguistic decoder in Fig. 1. Each complete
path t~ through the model determines a unique word sequence w~ = W(t~) and a
unique AP output string y~ ~-Y(t~) and has the associated probability P(t~).
Using well known minimum-cost path-finding algorithms, it is possible to de-
termine for a given AP string y~, the complete path t~ which maximizes the
probability P(t~) subject to the constraint Y(t~) -- y~'. A decoder based on this
strategy would then produce as its output W(t~). This decoding strategy is not
optimal since it may not maximize the likelihood P(w, y). In fact, for a given pair
w, y there are many complete paths t for which W ( t ) = w and Y ( t ) - - y . To
minimize the probability of error, one must sum P(t) over all these paths and
select the w for which the sum is maximum. Nevertheless, good recognition results
have been obtained using this suboptimal decoding strategy [7, 9, 16].
A simple method for finding the most likely path is a dynamic programming
scheme [11] called the Viterbi algorithm [13]. Let ~'k(s) be the most probable path
to state s which produces output y~. Let V~(s) = P('rk(s)) denote the probability
of the path ~k(s). We wish to determine ~n(SF) (see Section 4.1). Because of the
Markov nature of the process, ~-k(s) can be shown to be an extension of "rk_l(S' )
for some s'. Therefore, ~-k(s) and Vg(s) can be computed recursively from ~'k l(s)
and Vk_ l(s) starting with the boundary conditions V0(si) = 1 and ~-0(si) being the
null string. Let C( s, a) -- ( t l R ( t ) = s, A( t ) = a }. Then

V~(s) --- max( max V~_l(L(t))aL~t~(t),


tE C(s, yk)

max Vk(L(t))qL(t)(t)}. (5.1)


t~C(s,~)
If the maximizing transition t is in C(s, Yk), then Tk(S) = ~'~_l(L(t))o t; otherwise
t must be in C(s, e~) and zk(s)--'rk(L(t))ot, where ' o ' denotes concatenation.
Note that in (5.1) Vk(s ) depends on V~(L(t)) for tEC(s, ep). Vk(L(t)) must
therefore be computed before Vg(s). Because closed circuits of null loops are not
allowed (see Section 4.1), it is possible to order the states s~, s2, s3... such that
tE C(sk, ep) and L(t) = sj only if j < k. If we then compute V~(sl), Vk(s2), etc., in
sequence, the necessary "values will always be available when required.
Many shortcuts to reduce the amount of computation and storage are possible
and we will briefly mention some of the more useful ones. If logarithms of
probabilities are used, no multiplications are necessary and the entire search can
be carried out with additions and comparisions only. Computation and storage
needs can be reduced by saving for each k, only those states having relatively
large values of Vk(s ). This can be achieved by first computing Vk(max)=
maxsVk(s ) and then eliminating all states s having Vk(s ) < AVe(max) where a is
an appropriately chosen threshold. This makes the search sub-optimal, but in
practice there is little or no degradation in performance if the threshold A is
chosen with care.
This type of search can be used quite successfully on artificial tasks such as the
Raleigh language task, where the number of states is of the order of 10 5.
560 F. Jefinek, R. L. Mercer and L. R. Bahl

In addition to its application to suboptimal decoding, the Viterbi algorithm can


be used to align an AP output string y with a known word string w, by
determining the most likely path t which produces y when w is uttered. The path t
specifies a sequence of phones which the algorithm puts into correspondence with
the symbols forming the sequence y. Inspection of this alignment allows the
experimenter to judge the adequacy of his models and provides an intuitive check
on the performance of the AP.

6. Stack linguistic decoding

In the previous section we presented a decoding procedure which finds the


most likely complete path t for a given AP output string y. This decoding method
is computationally feasible only if the state space is fairly small, as is the case in
most artificial tasks. However, in the Laser task (described in Section 10), the
number of states is of the order of 101~ which makes the Viterbi search
unattractive. Furthermore, the procedure is suboptimal because the word string
corresponding to the most likely path t may not be the most likely word string. In
this section we present a graph-search decoding method which attempts to find
the most likely word string. This method can be used with large state spaces.
Search methods which attempt to find optimal paths through graphs have been
used extensively in information theory [14] and in artificial intelligence [19]. Since
we are interested in finding the most likely word string, the appropriate graph to
search is the word graph generated by the language model. When a complete
search of the language model graph is computationally impractical, some heuristic
must be used for reducing the computation. Here we describe one specific
heuristic method that has been used successfully. To reduce the amount of
computation, a left-to-right search starting at the initial state and exploring
successively longer paths can be carried out. To carry out this kind of search we
need to define a likelihood function which allows us to compare incomplete paths
of varying length. An obvious choice may seem to be the probability of uttering
the (incomplete) sequence w and producing some initial subsequence of the
observed string y, i.e.

(6.1)
i=0 i=0

The first term on the right-hand side is the a priori probability of the word
sequence w. The second term, referred to as the acoustic match, is the sum over i
of the probability that w produces an initial substring y~ of the AP output string
y. Unfortunately, the value of (6.1) will decrease with lengthening word sequences
w, making it unsuitable for comparing incomplete paths of different lengths.
Some form of normalization to account for different path lengths is needed. As in
the Fano metric used for sequential decoding [14], it is advantageous to have a
likelihood function which increases slowly along the most likely path, and
C o n t i n u o u s speech recognition 561

decreases along other paths. This can be accomplished by a likelihood function of


the form

A(w)= P(w, yl)a


i n--i
Y~P(w,y~+llw,
t n
yl).
i
(6.2)
i=0 w'

If we consider P(w, Yl) to be the cost associated with accounting for the initial
part of the AP string Yl by the word string w, then Y~w,P(w',~"+~[w,y~)
represents the expected cost of accounting for the remainder of the AP string y,~_
with some continuation w' of w. The normalizing factor a can be varied to control
the average rate of grgwth of A(w) along the most likely path. In practice, a can
be chosen by trial and error.
An accurate estimate of E,/P(w', yF+l[w,y() is of course impossible in prac-
tice, but we can approximate it by ignoring the dependence on w. An estimate of
E( Y/%1[Y~), the average value of P(w', Yf+l[ Yl), can be obtained from training
data. In practice, a Markov-type approximation of the form

E(yF+,Iy[)= fi E(yjlyj_-k') (6.3)


j=i+l
can be used. Using k = 1 is usually adequate.
The likelihood used for incomplete paths during the search is then given by

A(w)= P(w) ~ P( y~[w)a'-iE(YT+llY~)- (6.4)


i=0

For complete paths the likelihood is

a(w) : P(w)P(r 'lw), (6.5)

i.e., the probability that w was uttered and produced the complete output string
y~.
The likelihood of a successor path w~ = W~-lWk can be computed incremen-
tally from the likelihood of its immediate predecessor w~- 1. The a priori probabil-
ity P(w~) is easily obtained from the language model using the recursion

P(wk) ~-"P( wlk--1)e (wk IWlk-1). (6.6)

The acoustic match values P( Y~l w~) can be computed incrementally if the values
P( Yll w~-') have been saved [1].
A search based on this likelihood function is easily implemented by having a
stack in which entries of the form (w, A(w)) are stored. The stack, ordered by
decreasing values of A(w), initially contains a single entry corresponding to the
initial state of the language model. The term stack as used here refers to an
562 F. Jelinek, R. L. Mercer and L. R. Bahl

ordered list in which entries can be inserted at any position. At each iteration of
the search, the top stack entry is examined. If it is an incomplete path, the
extensions of this path are evaluated and inserted in the stack. If the top path is a
complete path, the search terminates with the path at the top of the stack being
the decoded path.
Since the search is not exhaustive, it is possible that the decoded sentence is not
the most likely one. A poorly articulated word resulting in a poor acoustic match,
or the occurrence of a word with low a priori probability can cause the local
likelihood of the most likely path to fall, which may then result in the path being
prematurely abandoned. In particular, short function words like the, a, and of,
are often poorly articulated, causing the likelihood to fall. At each iteration, all
paths having likelihood within a threshold A of the maximum likelihood path in
the stack are extended. The probability of prematurely abandoning the most
likely path depends strongly on the choice of A which controls the width of the
search. Smaller values of A will decrease the amount of search at the expense of
having a higher probability of not finding the most likely path. In practice, A can
be adjusted by trial and error to give a satisfactory balance between recognition
accuracy and computation time. More complicated likelihood functions and
extension strategies have also been used but they are beyond the scope of this
paper.

7. Automatic estimation of Markov source parameters from data

Let P~(t, bT) be the joint probability that b~' is observed at the output of a
filtered Markov source and that b i spans t (see Section 4.1). The count

e(t, b'~) ~ i Pi( t, b~')/P(b'~) (7.1)


i:l

is the Bayes a posteriori estimate of the number of times the transition t is used
when the string b~' is produced. If the counts are normalized so that the total
count for transitions from a given state is 1, then it is reasonable to expect that
the resulting relative frequency

c(t, L(t)) (7.2)


is(t, oT) : Z,,c(r, L(t'))

will approach the transition probability qs(t) as m increases.


This suggests the following iterative procedure for obtaining estimates of qs(t).
Step 1. Make initial guesses q°(t).
Step 2. Set j = 0.
Step 3. Compute Pi(t, b~') for all i and t using q~(t).
Continuousspeech recognition 563

Step 4. Compute f~(t, b'~) and obtain new estimates qJ+ l(t) ----f~(t, bT).
Step 5. Set j = j + 1.
Step 6. Repeat from Step 3.

To apply this procedure we need a simple method for computing Pi(t, b'~).
Now Pi(t, b"~) is just the probability that a string of transitions ending in L(t) will
produce the observed sequence b]-~, times the probability that t will be taken
once L(t) is reached, times the probability that a string of transitions starting
with R(t) will produce the remainder of the observed sequence. If A(t) -----~, then
the remainder of the observed sequence is b~'; if A ( t ) ~ ep, then, of course,
A ( t ) = b i and the remainder of the observed sequence is b;m+v Thus if ai(s )
denotes the probability of producing the observed sequence b] by a sequence of
transitions ending in the state s, and fli(s) denotes the probability of producing
the observed sequence b~' by a string of transitions starting from the state s, then

: fO~i-l(g(t))qL(t) (t)fli(R(t)) ifA(t)--~,


(7.3)
Pi(t, b~) Coti_l(g(t)).qL(t)(t)fli+l(R(t) ) if A(t) = b i.

The probabilities ai(s ) satisfy the equation [15]

o(S) = si)+ E s, (7.4a)


t

a;(s) : E a;-l(L(t))v(t, s, bi)+ E cti(L(t))v(t, s, dp), i~>l


t t
(7.4b)
where
y(t, s, a) = qL(o(t)8(R(t), s)6(A(t), a). (7.5)
As with the Viterbi algorithm described in Section 5, the absence of null circuits
guarantees that the states can be ordered so that ai(sj) may be determined from
ai_l(S ) and ai(sk), k < j.
The probabilities B;(s) satisfy the equations

Bm(SF) = 1, (7.6a)
fli(s) = E fl;(R(t))~(t, s, *) + E fli+ ,(R(t))~(t, s, bi),
t t

i~<m, sv as F (7.6b)
where
~( t, s, a) = qL(t)( t )6( L( t ), s )8( A( t ), a ). (7.7)
Step 3 of the iterative procedure above then consists of computing a; in a
forward pass over the data, r; in a backward pass over the data, and finally
P;(t, b"~) from (7.3). We refer to the iterative procedure together with the method
described for computing Pi(t, b~) as the forward-backward algorithm.
564 F. Jelinek, R. L. Mercerand L. R. Bahl

The probability P(b'~) of the observed sequence b~' is a function of the


probabilities qs(t). To display this dependence explicitly, we write P(bT, qs(t)).
L. E. Baum [10] has proven that P(b T, q~+l(t))>i P(bT, qJs(t)) with equality only
if qJ~(t) is a stationary point (extremum or inflexion point) of P(b T, .). This result
also holds if the transition distributions of some of the states are known and
hence held fixed, or if some of the states are tied 3 to one another thereby reducing
the number of independent transition distributions.
When applied to a Markov source language model based on N-grams as
described in Section 4, the forward-backward algorithm reduces simply to
counting the number of times K(WlWlN ~), that w follows the sequence wN-l, and
setting

q,,~-,(w)= K(wlw~-')/ ( ~K(wlw~-')). (7.8)


W

This is equivalent to maximum likelihood estimation of the transition probabili-


ties.
When applied to a Markov source model for the acoustic channel, the
forward-backward algorithm is more interesting. Let us first consider the word-
based channel model indicated in Fig. 8. A known text w~ is read by the speaker
and processed by the acoustic processor to produce an output string y~. The
Markov source corresponding to the text is constructed from the subsources for
the words with the assumption that states of the source which arise from the same
subsource state are tied. The forward-backward algorithm then is used to
estimate the transition probabilities of the subsources from the output string y~'.
To obtain reliable estimates of the subsource transition probabilities, it is neces-
sary that each word in the vocabulary occur sufficiently often in the text w~. For
large vocabularies this may require an exorbitant amount of text.
The use of the phone-based model shown in Fig. 11 can overcome this problem.
The Markov source for the text is constructed from phonetic and acoustic
subsources as described in Section 4. States in the source arising from the same
acoustic subsource state are assumed to be tied. In addition, states from different
phonetic subsources are assumed to be tied if transitions leaving the states result
from the same phonological rules. With these assumptions the training text can be
considerably shorter since it needs only include sufficiently many instances of
each phone and each phonetic rule.

8. Parameter estimation from insufficient data

It is often the case in practice that the data available are insufficient for a
reliable determination of all of the parameters of a Markov model. For example,

3For definitionof tying, see Section4.1. For details of the forward-backward algorithmextended to
machines with tied states, see [15].
Continuousspeech recognition 565

the trigram model for the Laser Patent Text corpus [4] used at IBM Research is
based on 1.5 million words. Trigrams which do not occur among these 1.5 million
words are assigned zero probability by maximum likelihood estimation, a degen-
erate case of the forward-backward algorithm. Even though each of these
trigrams is very improbable, there are so many of them that they constitute 23%
of the trigrams present in new samples of text. In other words, after looking at 1.5
million trigrams the probability that the next one seen will never have been seen
before is roughly 0.23. The forward-backward algorithm provides an adequate
probabilistic characterization of the training dat a but the characterization may be
poor for new data. A method for handling this problem, presented in detail in
[15], is discussed in this section.
Consider a Markov source model the parameters of which are to be estimated
from data b~'. We assume that b~' is insufficient for the reliable estimation of all
of the parameters.
Let Os(t) be forward-backward estimates of the transition probabilities based
on b~' and let *Os(t) be the corresponding estimates obtained when certain of the
states are assumed to be tied. Where the estimates 0s(t) are unreliable, we would
like to fall back to the more reliably estimated *O~(t), but where 0s(t) is reliable
we would like to use it directly.
A convenient way to achieve this is to choose as final estimates of qs(t) a linear
combination of 0~(t) and *Os(t). Thus we let ~ ( t ) be given by

O~(t) = AsO~(t) + (1 - - h~)*gL(t ) (8.1)

with As chosen close to 1 when gls(t) is reliable and close to zero when it is not.
Fig. 12(a) shows the part of the transition structure of the Markov source
related to the state s. Eq. (8. l) can be interpreted in terms of the associated Markov
source shown in Fig. 12(b), in which each state is replaced by three states. In Fig.

S ~ 0
SI

$2
~ SI

$3
(o) (b)

Fig. 12.(a)Part of transition structure of a Markov source; (b) the correspondingpart of an associated
Markov source.
566 F. Jelinek, R. L. Mercer and L. R. Bahl

12(b), g corresponds directly to s in Fig. 12(a). The null transitions from ~ to s and
s* have transition probabilities equal to ?~ and 1 - X~, respectively. The transi-
tions out of s have probabilities qs(t) while those out of s* have probabilities
*q~(t). The structure of the associated Markov source is completely determined by
the structure of the original Markov source and by the tyings assumed for
obtaining more reliable parameter estimates.
The interpretation of (8.1) as an associated Markov source immediately sug-
gests that the parameters 7~ be determined by the forward-backward ( F - B )
algorithm. However, since the ?~ parameters were introduced to predict as yet
unseen data, rather than to account for the training data b T, the F - B algorithm
must be modified. We wish to extract the ?, values from data that was not used to
determine the distributions qs(t) and *qs(t) (see (8.1)). Since presumably we have
only b T at our disposal, we will proceed by the deleted estimation method. We
shall divide b~' into n blocks, and for i--1 ..... n estimate ?~ from the i th block
while using qs(t) and *q~(t) estimates derived from the remaining blocks.
Since the ~s values should depend on the reliability of the estimate q~(t), it is
natural to associate them with the estimated relative frequency of occurrence of
the state s. We thus decide on k relative frequency ranges and aim to determine
corresponding values X(1),...,~(k). Then ~s = ?~(i) if the relative frequency of s
was estimated to fall within the i th range.
We partition the state space $ into subsets of tied states $~, $2 ..... Sr and
determine the transition correspondence functions T~,~, for all pairs of tied states
s,s'. We recall from Section 4 that then *qs(t)=*q~,(Ts,~,(t)) for all pairs
S, S t ~ ~i' i = 1,... ,r. If L(t)E $*, then g'(t) -- ( g i g = TLU),~,(t), s ' ~ S*) is the set
of transitions that are tied to t. Since TL(t)"f u ) ( t ) = t, t ~ ~'(t).
We divide the data b T into n blocks of length l (m = nl). We run the F - B
algorithm in the ordinary way, but on the last iteration we establish separate
counters
(j - 1)l m
cj(t, b T ) g • Pi(t, bT)+ Y, Pi(t,b'~), j = l , 2 ..... n, (8.2)
i--I i=jl+l

for each deleted block of data not contributing to the counter. The above values
will give rise to detailed distributions

q,(t,j)=(cy(t,b~)6(s,L(t)))/(~cj(t',bT)8(s,L(t')) ) (8.3)
t'
and to tied distributions

t'E ~(t)
--1
(8.4)
' t"e ~(t')

Note that qs(t, j) and *qs(,t j )" do not depend directly on the output data
F. Jelinek, R. L. M e r c e r a n d L. R. Bahl 567

belonging to t h e j t h block. Thus the data in t h e j t h block can be considered new


in relation to these probabilities.
We now run the F - B algorithm on data b~' to determine the 7~ values based on
n associated Markov sources which have fixed distributions over transitions
leaving the states s and s*. These ~ values are obtained from estimates of
probabilities of transitions leaving the states ~ of the associated Markov~ source
(see Fig. 12(b)), Only k counter pairs pertaining to the values X(i) and 1 - X(i)
being estimated are established. When running on the data of the j t h block, the
j t h associated Markov source is used based on the probabilities q,(t, j) and
*q,(t, j). The values )t~ used in the j t h block are chosen by computing the
frequency estimates

q(s, j) = Y~cj(t, bT)8(s, L ( t ) ) / Y , ej(t', bT) (8.5)


t t"

and setting )t~ = ?~(i) if q(s, j) belonged to the ith frequency range. Also, the )t~
counts estimated from the j t h block are then added to the contents of the ith
counter pair.
After ?t values have been computed, a new test data is predicted using an
associated Markov source based on probabilities

(8.6)
*q~(t)=(6(s,L(t)) E i cj(t',b~))
t ' ~ °J(t) j = 1

× ~(s, L(t')) ~, ej(t",b'~) , (8.7)


' t"C ~(t') j = 1

and Xs values chosen from the derived set X(1) ..... X(k), depending on the range
within which the estimate

(( j
))/( t'j=l
)
falls.
This approach to modeling data generation is called deleted interpolation.
Several variations are possible some of which are described in [15]. In particular,
it is possible to have v different tying partitions of the state space corresponding
to transition distributions (Oq,(t), i =-1..... v, and to obtain the final estimates by
the formula

Cts(t) = i Xi(s)(°q,(t) (8.9)


i=1

with ?ti(s ) values determined by the forward-backward algorithm.


568 Continuous speech recognition

We illustrate this deleted interpolation algorithm with an application to the


trigram language model for the laser patent text corpus used at IBM.
Let 7r(w) be the syntactic part of speech (e.g. noun, verb, etc.) assigned to the
word w. Let q'i, i -- 1..... 4 be functions classifying the language model states wiw2
as follows.

q>,( w,w~) = ( ( w,w~) },


~(w,w~) = ( (ww~)l=(w) = ~(w,) ),
~3(w,w~) = ( (ww')l~(w) = ~(w,), =(w')= ~(w2) ), (8.1o)
~4(WlW2) = (all pairs of words}.

Let K(d~i(WlW2) ) be the number of times that members of the set ~i(WIW2) Occur in
the training text. Finally, partition the state space into sets

~5(w,w2) = (ww'lK(~j(ww'))=IC(,j(w,w2))=1, j = l , 2 .... ,i-1,

K( ePi(ww') ) = K( ePi(wlw2) ) > 1) (8.11)

which will be used to tie the associated states wiw2 according to the frequency of
word pair occurrence. Note that if K(qh(wlw2) ) >-2, then q,5(wlw2) is simply the
set of all word pairs that occurred in the corpus exactly as many times as wiw2
did. A different X distribution will correspond to each different set (8.11). The
language model transition probabilities are given by the formula

4
P(w,lw,w~)= ~ x,(~(w,w~))e,(w~l~,(w,w~)). (8.12)
i=1

Fig. 13 illustrates this graphically. We use deleted interpolation also in estimating


the probabilities associated with the acoustic channel model.

iI~(w I ~l ¢wl, w2))


Xli(ir.x. - ~p(w / tt,:>¢wi,w>)

\\ X ~ "~::::~P(w / 13 (wi'w2))
\\4 //""

Fig, 13, A section of the interpolated trigram language model correspondingto the state determined
by the word pair wl, w2.
F. Jefinek, R. L. Mercer and L R. Bahl 569

9. A measure of difficulty for finite state recognition tasks

Research in continuous speech recognition has led to the development of a


number of artificial tasks. In order to compare the performance of different
systems on sentences from different tasks, it is necessary to have a measure of the
intrinsic difficulty of a task. Although vocabulary size is almost always mentioned
in the description of an artificial task, by itsdf it i s practically useless as a
measure of difficulty. In this section we describe perplexity, a measure of
difficulty based on wen-established information theoretic principles. The experi-
mental results described in the next section show a clear correlation between
increasing perplexity and increasing error rate.
Perplexity is defined in terms of the information theoretic concept of entropy.
The tasks used in speech recognition can be adequately modeled as unifilar (see
Section 4.1) Markov sources. Let P(wls ) be the probability that word w will be
produced next when the current state is s. The entropy, Hs(w ) associated with
state s is

n s ( w ) = - ~, P ( w l s ) l o g 2 e ( w l s ) . (9.1)
W

The entropy, H(w), of the task is simply the average value of Hs(w ). Thus if 7r(s)
is the probability of being in state s during the production of a sentence, then

H(w) = ~,~r(s)Hs(W). (9.2)

The perplexity S(w) of the task is given in terms of its entropy H(w) by
S(w)-- 2 H(w). (9.3)

Often artificially constrained tasks specify the sentences possible without


attaching probabilities to them. Although the task perplexity depends on the
probabilities assigned to the sentences, Shannon [21] has shown that the maxi-
mum entropy achievable for a task with N possible sentences of average length I is
(1/l)1ogEN. Hence the maximum perplexity is N l/l. If all the sentences for the
task could be arranged as a regular tree, the number of branches emanating from
a node would be N Wt. So, for artificially constrained tasks, perplexity can be
thought of as the average number of alternative words at each point. For the
Raleigh task of Fig. 7, the number of alternative words ranges from 1 to 24, and
the perplexity is 7.27.
For natural language tasks, some sentences are much more probable than
others, and so the maximum perplexity is not useful as a measure of difficulty.
However, the perplexity which can be computed from the probabilities of the
sentences remains a useful measure. Information theory shows that for a language
with entropy H, we can ignore all but the most probable 2 m strings of length !
and still achieve any prescribed error rate.
570 Continuous speechrecognition

The definition of perplexity makes no use of the phonetic character of the


words in the vocabulary of the language. Two tasks may have the same perplexity
but one may have words that are substantially longer than the other, thereby
making recognition easier. This problem can be overcome by considering the
sentences of the task to be strings of phonemes rather than strings of words. We
can then compute the phoneme level perplexity of the two tasks and normalize
them to words of equal length. In this way the perplexity of the task with the
greater average word length will be lowered relative to that of the other task.
Some pairs of phones are more confusable than others. It is possible therefore
to have two tasks with the same phoneme level perplexity, one of which is much
easier to recognize than the other, simply because its words are acoustically
farther apart. We can take this into account by considering the joint probability
distribution P(w, y ) of word sequences w and acoustic sequences y and determin-
ing from it the conditional entropy H(w I y). y could be the output string from a
particular acoustic processor or simply the time waveform itself. Unfortunately,
this is far too difficult to compute in practice.
Perplexity reflects the difficulty of recognition when a complete search can be
performed. The effect on the error rate of performing an incomplete search may
be more severe for one language than for another, even though they have the
same perplexity. However, as the results in the next section show, there is a clear
correlation between perplexity and error rate.

10. Experimental results

The results given in this section, obtained before 1980, are described in detail in
[2-6].
Table 1 shows the effect of training set size on recognition error rate. 200
sentences from the Raleigh language (100 training and 100 test) were recognized
using a segmenting acoustic processor and a stack algorithm decoder. We initially
estimated the acoustic channel model parameters by examining samples of
acoustic processor output. These parameter values were then refined by applying
the forward-backward algorithm to training sets of increasing size.

Table 1
Effectof training set size on the error rate
%of sentences
Training set decodedincorrectly
size Test Training
0 80% --
200 23% 12%
400 20% 13%
600 15% 16%
800 18% 16%
1070 17% 14%
F. Jelinek, R. L. Mercer and L. R. Bahl 571

Table 2
Effect of weak acoustic channel models
% of sentences
Model type decoded incorrectly
Complete acoustic channel model 17%
Single pronunciation 25%
Spelling-basedpronunciation 57%

While for small training set sizes performance on training sentences should be
substantially better than on test sentences, for sufficiently large training set sizes
performance on training and test sentences should be about equal. By this
criterion a training set size of 600 sentences is adequate for determining the
parameters of this acoustic channel model. Notice that even a training set size as
small as 200 sentences leads to a substantial reduction in error rate as compared
to decoding with the initially estimated channel model parameters.
The power of automatic training is evident from Table 1 in the dramatic
decrease in error rate resulting from training even with a small amount of data.
The results in Table 2 further demonstrate the power of automatic training.
In Table 2, three versions of the acoustic channel model are used, each weaker
than the previous one. The 'complete acoustic channel model' result corresponds
to the last line of Table 1. The acoustic channel model in this case is built up from
phonetic subsources and acoustic subsources as described in Section 4. The
phonetic subsources produce m a n y different strings for each word reflecting
phonological modifications due to rate of articulation, dialect, etc. The 'single
pronunciation' result is obtained with an acoustic channel model in which the
phonetic subsources allow only a single pronunciation for each word. Finally, the
'spelling-based pronunciation' result is obtained with an acoustic channel model
in which the single pronunciation allowed by the phonetic subsources is based
directly on the letter-by-letter spelling of the word. This leads to absurd pronunci-
ation models for some of the words. For example, t h r o u g h is modeled as if the
final g and h were pronounced. The trained parameters for the acoustic channel
with spelling-based pronunciations show that letters are often deleted by the
acoustic processor reflecting the large number of silent letters in English spelling.
Although the results obtained in this way are much worse than those obtained
with the other two channel models, they are still considerably better than the

Table 3
Decoding results for severaldifferent acoustic processors with the Raleigh language
Error rate
Acoustic processor Sentence Word
MAP 27% 3.6%
CSAP 2% 0.2%
TRIVIAL 2% 0.2%
572 Continuous speech recognition

results obtained with the complete channel model using parameters estimated by
people.
Table 3 shows results on the Raleigh Language for several different acoustic
processors. In each case the same set of 100 sentences was decoded using the
stack decoding algorithm. MAP is a segmenting acoustic processor, while CSAP
and TRIVIAL are non-segmenting acoustic processors. Prototypes for CSAP were
selected by hand from an examination of speech data. Those for TRIVIAL were
obtained automatically from a Viterbi alignment of about one hour of speech
data.
Table 4 summarizes the performance of the stack decoding algorithm with a
segmenting and a time-synchronous acoustic processor on 3 tasks of varying
perplexity. The Raleigh task has been described earlier in the paper. The Laser
task is a natural language task used at IBM. It consists of sentences from the text
of patents in laser technology. To limit the vocabulary, only sentences made
entirely from the 1000 most frequent words in the complete laser corpus are
considered. The CMU-AIX05 task [20] is the task used by Carnegie-Mellon
University in their Speech Understanding System to meet the ARPA specifica-
tions [18]. All these results were obtained with sentences spoken by a single talker
in a sound-treated room. Approximately 1000 sentences were used for estimating
the parameters of the acoustic channel model in each of the experiments.
In Table 4 we can see a clear correlation between perplexity and error rate. The
CMU-AIX05 task has the largest vocabulary but the smallest perplexity. Note
that for each of the tasks, the performance of the time-synchronous acoustic
processor is considerably better than that of the segmenting acoustic processor.

Table 4
Recognition results for several tasks of varying perplexity
Vocabulary Word error rate
Task size Perplexity Segmenting AP Time-synchronous AP
CMU-AIX05 1011 4.53 0.8% 0.1%
Raleigh 250 7.27 3.1% 0.6%
Laser 1000 24.13 33.1% 8.9%

Acknowledgment

We would like to acknowledge the contributions of the following present and


past members of the Continuous Speech Recognition Group at the IBM Thomas
J. Watson Research Center: James Baker, Janet Baker, Raimo Bakis, Paul Cohen,
Alan Cole, Rex Dixon, Burn Lewis, Eva Muckstein and Harvey Silverman.
F. Jelinek, R. L. Mercer and L. R. Bahl 573

References

[1] Bahl, L. R. and Jelinek, F. (1975). Decoding for channels with insertions, deletions and
substitutions with applications to speech recognition. IEEE Trans. Inform. Theory 21 (4)
404-411.
[2] Bahl, L. R., Baker, J. K., Cohen, P. S., Dixon, N. R., Jelinek, F., Mercer, R. L. and Silverman,
H. F. (1976). Preliminary results on the performance of a system for the automatic recognition
of continuous speech. Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing,
425-429.
[3] Bahl, L. R., Baker, J. K., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L.
(1978). Automatic recognition of continuously spoken sentences from a finite state grammer.
Proc. IEEE Internat. Conf. on Acousties, Speech and Signal Processing, 418-421.
[4] Bahl, L. R., Baker, J. K., Cohen, P. S., Jelinek, F., Lewis, B. L. and Mercer, R. L. (1978).
Recognition of a continuously read natural corpus. Proc. IEEE Internat. Conf. on Acoustics,
Speech and Signal Processing, 422-424.
[5] Bahl, L. R., Bakis, R., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L.
(1979). Recognition results with several acoustic processors. Proc. IEEE Internat. Conf. on
Acoustics, Speech and Signal Processing, 249-251.
[6] Bahl, L. R., Bakis, R., Cohen, P. S., Cole, A. G., Jelinek, F., Lewis, B. L. and Mercer, R. L . .
(1980). Further results on the recognition of a continuously read natural corpus. Proc. IEEE
lnternat. Conf. on Acoustics, Speech and Signal Processing, 872-875.
[7] Baker, J. K. (1975), The DRAGON system an overview. IEEE Trans. on Acoustics, Speech,
and Signal Processing 23 (1) 24-29.
[8] Baker, J. M. (1979). Performance statistics of the hear acoustic processor. Proc. of the IEEE
Internat. Conf. on Acoustics, Speech and Signal Processing, 262-265.
[9] Bakis, R. (1976). Continuous speech recognition via centisecond acoustic states. 91st Meeting of
the Acoustical Society of America. Washington, DC. (IBM Res. Rept. RC-5971, IBM Research
Center, Yorktown Heights, NY.)
[10] Banm, L. E. (1972). An inequality and associated maximization technique in statistical estima-
tion of probabilistic functions of Markov processes. Inequalities 3, 1-8.
[11] Bellman R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
[12] Cohen, P. S. and Mercer, R. L. (1975). The phonological component of an automatic speech-
recognition system. In: D. R. Reddy, ed., Speech Recognition, 275-320. Academic Press, New
York.
[13] Forney, G. D., Jr. (1973). The viterbi algorithm. Proc. IEEE 61, 268-278.
[14] Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. I B M J. Res. and
Development 13, 675-685.
[15] Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from
sparse data. Proc. workshop on Pattern Recognition in Practice. North-Holland, Amsterdam.
[16] Lowerre, B. T. (1976). The Harpy speech recognition system. Ph.D. Dissertation, Dept. of
Comput. Sci., Carnegie-Mellon University, Pittsburgh, PA.
[17] Lyons, J. (1969). Introduction to Theoretical Linguistics. Cambridge University Press, Cambridge,
England.
[18] Newell, A., Barnett, J., Forgie, J. W., Green, C., Klatt, D., Licklider, J. C. R., Munson, J.,
Reddy, D. R. and Woods, W. A. (1973). Speech Understanding Systems: Final Report of a Study
Group. North-Holland, Amsterdam.
[19] Nilsson, N. J. (1971). Problem-Solving Methods in Artificial Intelligence. McGraw-Hill, New
York.
[20] Reddy, D. R. et al. (1977). Speech understanding systems final report. Comput. Sci. Dept.,
Carnegie-Mellon University, Pittsburgh, PA.
[21] Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Tech. J. 30,
50-64.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ")~
@North-HollandPublishing Company (1982) 575-593

Applications of Pattern Recognition


in Radar

Alan A. Grometstein and William H. Schoendorf

1. Introduction

In this chapter we discuss the application of pattern recognition (PR) tech-


niques to radar observations. Radars have grown greatly in sophistication and
diversity during the years since World War II and exceptions exist to almost any
general statement made about them; in this chapter we will describe processing
using a generic radar, recognizing that few specific radars will fit the description
in all particulars.
It is important to keep in mind that a radar is an instrument for gathering
information, on the basis of which a decision of some type will be made. For
example:

An airport-surveillance radar displays information about air


traffic so that a Traffic Controller can allocate landing and
take-off priorities and corridors, and can learn of collision-
threatening situations early enough to take remedial action.
Highway police use a Doppler radar to estimate the speed of an
approaching vehicle so that they can decide whether to flag it
down for a traffic violation.
Fire-control radar on a warship allows the Gunnery Officer to
decide when and how to fire at the enemy.

2. A radar as an information-gathering device

A radar periodically transmits a pulse of electromagnetic energy in a direction


determined by its antenna. The pulse, propagating at the speed of light, c,
encounters a target and is reflected back to the radar which detects its arrival. If
the echo arrives at time t after transmission, the target is at a range, R = lct. The
direction of arrival of the echo may not coincide with the boresight of the

575
576 Alan A. Grometstein and William 11. Schoendorf

antenna: the difference can be sensed and the angular position of the target (say,
in azimuth and elevation relative to an oriented ground plane) estimated.
Special circuitry within a radar receiver can compare the radio frequency (r.f.)
of the echo with that of the transmitted pulse. The difference frequency, fa, can be
ascribed to the component of the target velocity in the direction of the radar (i.e.,
to the target's Doppler speed) at the time of pulse reflection; thus, the radar can
measure the instantaneous Doppler, D, of the target:

D = ½)tfd

where )t is the wavelength at which the radar operates.


The four quantities:
- range, R,
- elevation angle, EI,
-azimuth angle, Az,
- doppler, D,
are the metric parameters which a radar customarily measures on a pulse-by pulse
basis; they define the position of the target and a component of its velocity.
Related parameters, such as trajectory or lateral velocity, can be estimated from a
time-sequence of these basic metric quantities. Much work has been done on the
efficient estimation of such quantities (using, e.g., Kalman filtering), but we will
not dwell on them.

3. Signature

When an echo returns from a target it is diminished in energy by a factor of R 4


over the two-way propagation path. Since R is known, this attenuation can be
accounted for. More to the point, however, the echo has been distorted by its
reflection in ways determined by the nature of the target. Information of this
type, related to the electromagnetic scattering characteristics of the target, is
known as signature information and is the subject of our further discussion.
The energy in the returned echo, Pe, can be used to calculate the radar
cross-section (RCS) of the target through the relation:

RCS = K'R4"Pe

where K is a constant associated with the radar circuitry and the power in the
transmitted pulse. The RCS of a target is a measure of its effective scattering area.
RCS is measured in terms of the ratio (energy returned in an echo) : (energy
density impinging on the target), and thus has the units of area. The projected
area presented by a target to a radar beam may be orders of magnitude smaller or
larger than its RCS: indeed, it is often a primary goal of radar observations to
estimate the physical size of a target from its electrical 'size'.
Applications of pattern recognition in radar 577

RCS is conventionally measured in units of square meters (m2); in practice, the


RCS of a target varies widely with time (as the target changes its orientation in
space, say, or under the influence of noise), and it is therefore common to express
RCS, not in square meters, but in logarithmic fashion, in units of dBsm (decibels
with respect to one square meter):

RCSdBsm = 10 log(RCS~2)

Thus, an RCS of 100m 2 is equivalent to one of 20dBsm.


The RCS of an echo is proportional to the square of the electrical field
strength, A, received by the radar for that echo, and for some purposes it is
convenient to work with A (the amplitude) rather than with RCS. A, of course,
has the units of V/re.

4. Coherence

As radars developed, more signature information became available from a


single echo than the RCS. Improved stability of the master timing oscillator
permits comparison of the phase of the echo with that of the transmitted pulse.
(The total number of r.f. cycles between transmission and reception is not
counted, rather, the relative phase, q~, of the echo is the quantity measured.)
Nowdays, coherent radars routinely process relative phase as well as RCS (or
amplitude) data. The relative phase carries information about the range of the
target as well as about its electromagnetic scattering properties.
It is convenient to consider that a coherent radar receives in the pair, A and q~,
a complex return. This return is symbolized as A, where the magnitude of d is A,
and its phase is ~.

5. Polarization

The pulse transmitted by the radar has a polarization state characteristic of the
radar transmitter and antenna. Some of the energy in the echo reflected from a
target retains the transmitted polarization and some is converted to the orthogo-
nal polarization state. The depolarizing properties of the target are informative as
to its nature, and some radars are built which can separately receive and process
the polarized and depolarized components of the echo.

6. Frequency diversity

Some radars have the ability to change their r.f. during the transmission of a
single pulse, so that the frequency content of the pulse has a time-varying
structure. (The pulse, therefore, does not resemble a finite section of a pure
578 Alan A. Grometstein and William H. Schoendorf

sinusoid.) If the frequency changes during the pulse in discrete jumps, the radar is
a frequency-jump radar, while if it changes smoothly, the radar is said to be of the
compressed-pulse type. In either case, the radar can be referred to as a wideband
radar.
Several reasons might lead to the incorporation of frequency diversity in a
radar; of particular interest to us is the fact that the way in which a target reflects
the different frequency components reveals something of its nature. If signature
information is to be processed in a wideband radar, the envelope of the returned
echo is sampled along its length, at intervals determined by the rapidity with
which the frequency changes. In this way, instead of collecting a single amplitude
and phase from the echo (i.e., a single complex return), such a radar collects a set
of complex returns from a pulse. This set, ordered along the pulse length, can be
thought of as the return vector for that echo.
The elaborations discussed (coherence, polarization diversity, frequency diver-
sity) can be incorporated into a radar virtually independently of one a n o t h e r - -
although, for cost reasons, it is rare that all are found in a single instrument. In
the extreme case, however, where all are present, the information recorded for a
single echo takes the form of two amplitude vectors, one for each polarization
channel where each vector, as explained above, consists of the complex returns at
a set of specific points along the length of the echo.
Efficient processing of such a complicated information set is important if the
complete information-gathering capacity of the radar is to be utilized.

7. Pulse sequences

It is rarely the case that a decision must be based on a single echo: ordinarily, a
sequence of echoes from a target can be collected before the decision must be
made, and this permits extraction of yet more information than would be
available in a single echo.
If the target is a constant factor, so that the only causes of variation in the
echoes are such extraneous influences as noise, changes in the propagation path,
etc., the sequence of echoes is frequently integrated, either coherently or incoher-
ently, with a view to improving the signal-to-noise ratio (SNR), so that a clearer
view can be had of the target echo itself.
If, however, the target is representable by a stationary process with a significant
frequency content, to integrate the pulses might destroy useful information. The
sequence might in such a case be treated as a finite, discrete time series--a
spectral analysis might be made, or some features extracted (say, peak values or
rate of zero-crossings) and processed.
In other cases, the target process cannot be treated as a stationary one: often a
change in the nature of the process is precisely what is being looked for, to trigger
a decision. In this case, the pulse sequence must be processed so as to detect the
change in the target characteristic, and integration may again be contraindicated.
Applications of pattern recognition in radar 579

8. Decisions and decision errors

As in any case where decisions are made on the basis of uncertain or


incomplete information, erroneous decisions can be and will on occasion be
made; the types and costs of such errors must be carefully balanced. M a n y texts
such as Van Trees (1968), Chernoff and Moses (1969) and Egan (1975) treat this
aspect of the decision problem in detail.
In the simplest case a decision must be made as to which of two alternative
conditions prevails; we can often speak of one condition as the normal or safe
condition and the other as the dangerous or abnormal condition. A factor or score
is calculated by passing the radar observations through an algorithm (either
manually implemented or on a computer); the score is then contrasted with a
threshold. If the score exceeds the threshold, a decision is made that the abnormal
condition exists, otherwise, that the normal condition obtains.
In this situation two errors can arise:
(1) A decision is made that the abnormal condition prevails when, in fact, it
does not. This type of error is called a false alarm, or (a term borrowed from
medical practice) a false positive.
(2) Conversely, the abnormal condition m a y exist but not be recognized; this
type of error is variously called a leakage or a miss, or a false negative.
As the threshold varies, the probability of a false alarm, Pf, and the probability
of a leakage, Pc, c h a n g e - - the higher the threshold, the smaller is Pf and the larger
Pc. The locus of the two errors as the threshold varies is known as the Operating
Characteristic (OC) of the decision algorithm.
The operator will choose a value of threshold which gives a suitable balance
between the expected cost of false alarms ( = the cost of a false alarm times the
probability of one occurring) and the expected cost of a leakage. The balance m a y
be struck in different ways, of which three are common:
(1) Minimizing the total number of errors (this is equivalent to taking the cost
of a false alarm as equal to that of a leakage).
(2) Minimizing the expected cost of errors (this takes into account the dif-
ference in cost between errors of the two types).
(3) Minimizing the occurrence of errors of one type while restraining errors of
the other type not to exceed some acceptable level. (This is the N e y m a n - P e a r s o n
rule.)
These three ways of accommodating errors are similar in that each leads to a
particular threshold level, taken from the OC (of course, the three levels are not
the same in general).

9. Algorithm implementation

The decision algorithm m a y be automated completely, in part, or not at all. The


choice is based on the speed at which decisions must be reached and imple-
mented, on the complexity of the underlying calculations, and on the likelihood of
580 Alan A. Grometstein and William H. Schoendorf

occurrence of a situation in which the operator ought to (and is capable of)


intelligently overriding the algorithmic choice.
One choice of implementation is that in which a computerized algorithm
calculates the score and contrasts it with the threshold; the indicated course of
action is then presented to the operator who has the opportunity of overriding the
indicated course. If the operator does not override within a time limit, the
indicated course is then effectuated.
At times, expecially when rapid decisions are required, the computerized
algorithmic action is automatically implemented, without recourse to a human
override.

10. Classifier design

After this discussion of radar characteristics, decision errors and algorithm


implementation, we turn to issues concerning algorithm or classifier design. In
this section we discuss the design of classifiers for discriminating targets on the
basis of their radar signatures. A radar signature is a sequence of temporally
ordered returns from a target. Each return may consist of several scalar quantities
depending on the nature of the radar. For example, a dual-polarized, non-coher-
ent, narrow bandwidth radar would provide two quantities on each pulse: the
amplitudes in each of the two polarizations. Signatures from any target must be
treated as stochastic processes. A particular airplane viewed by a radar on several
approaches to an airport exhibits a different signature on each approach. These
differences are due to variations in its trajectory and dynamic motion, as well as
changes in the transmission medium, the radar's calibration and receiver noise,
etc. Thus, in order to properly characterize the radar signatures of a target we
must account for such unavoidable variability by dealing with the ensemble of
signatures representative of the conditions under which the target is viewed.
A sample radar signature is illustrated in Fig. 1. It consists of N pulses and can
be represented geometrically as a point in an N-dimensional space as shown in
the figure. Each coordinate axis is associated with the radar return at a given
time: the amplitude of the first return is plotted along axis 1, the amplitude of the
second return along axis 2, etc. If the signature has plural components
the dimensionality of the space is multiplied to account for the multiplicity of the
measurements. Since the signature is one sample from an ensemble, the entire
ensemble may be viewed as a collection of points in this multi-dimensional space
and can be described by a multivariate probability density function (pdf). We
refer to the space in which this pdf is defined as the observation space (O-space)
since its coordinates correspond to the raw radar observations made on the target.
Thus it is straightforward conceptually to consider the radar classification prob-
lem as the implementation of a Likelihood Ratio Test (LRT) in O-space, reducing
the problem to one of estimating the class-conditional pdfs or the likelihood ratio
directly. These estimates may be derived from collections of labelled signatures
from each class which are used as learning data.
Applications of pattern recognition in radar 581

x] AX .x
x2 I
x
,'~.. X3 f
ii x\ • I
IXN
I---
.e. /t 'k I AXIS 1
_J
n
z;
'11 /s -j X2
/ TIME - - _

tl t2 t3 t 4 . • • tN X1

(a) (b) (c)


Fig. 1. Representationsof sample radar signature:
(a) Signature as discrete, firfitetime series;
(b) Signature as data vector;
(c) Signature as point in observationspace.

It is at this point that the practical difficulties of dealing with finite and
ordinarily small numbers of signatures in high-dimensional spaces become evi-
dent. The dimensionality of the O-space can be quite large since radar signatures
are often as long as tens of pulses and under many circumstances can be
considerably longer.
Signatures can be obtained either directly from observations of flight tests or
from simulations by combining the calculated dynamics of the body with static
body measurements or theoretically produced static patterns. Since flight test data
are difficult and expensive to obtain, it is often necessary to use simulated
signatures to supplement the flight test signatures. In many cases, however, the
number of signatures required to estimate the class conditional pdf's or the
likelihood ratio is so large that even the use of simulations becomes too costly. In
these cases we must consider other approaches to the design of classification
algorithms, including reducing the dimensionality of the space in which classifi-
cations are made.
Parametric classification schemes, feature selection, feature extraction, and
sequential classification are possible alternatives to the non-parametric estimate
of the pdfs or the likelihood ratio in O-space, which diminish the required number
of sample signatures. In the first case, we may know or we may assume
parametric forms for the class-conditional pdf's. The pdf's are then completely
determined by the value of an unknown but nonrandom parameter vector. For
example, if each class is known or assumed to be Gaussian, the components of the
parameter vector are the class conditional mean and covariance matrix. The
learning data are then used to estimate these parameters for each target class and
the LRT can then be implemented using a smaller number of learning signatures
than would be required if a nonparametric estimate of the densities were used.
In the same vein, the number of parameters that must be estimated from the
data can be reduced by calculating features on the basis of prior knowledge or
intuition. For example, if we were trying to discriminate between a dipole and a
sphere on the basis of the returns from a circularly-polarized radar we might use
582 Alan A. Grometstein and William H. Schoendorf

the ratio of the principally polarized (PP) and orthogonally polarized (OP)
returns. Since this ratio would be unity for an ideal dipole and infinite for a
perfect sphere, the ratio could be used as a discriminating feature rather than the
individual returns, reducing the dimensionality of the classification space by a
factor of 2. In general, however, realistic discrimination problems are not as
simple as sphere vs. dipole, and the selection of features becomes difficult.
A third method that has been used for reducing the dimensionality of the space
in which classification is performed is mathematical feature extraction. Here we
attempt to find a transformation that maps the data to a lower-dimensional
feature space while minimizing the misclassification rate in the feature space. In
general, attempts to derive feature extraction techniques associated directly with
error criteria have been unsuccessful except when the underlying density of the
data is known. Nonparametric approaches to the problem have been computa-
tionally complex and require as many samples for the derivation of the transfor-
mation as would be required for the design of the classifier in the original
observation space. Because of this, criteria not directly associated with error rates
have been utilized, particularly those involving the means and covariances of each
class. Examples of these types of techniques are the Fukunaga-Koontz transfor-
mation, Fukunaga and Koontz (1970), and the sub-space method of Watanabe
and Pakvasa (1973). Therrien et al. (1975) extended the Fukunaga-Koontz
technique to the multiclass case and showed the applicability of the technique to
functions of the covariance matrix. He also showed that Watanabe's subspace
method is a special case of this extension. Applications to radar signature
classification of the Fukunaga-Koontz technique, the subspace method and
Therrien's extension are found in Therrien et al. (1975). The results presented in
that paper indicate that the performance when mapping down from high-dimen-
sional spaces is not outstanding. This is probably due to the irregular nature of
the underlying class-conditional pdf's of the radar signature data: these data are
rarely Gaussian or even unimodal. The multimodality of the data is due to the
different nature of the radar returns from a target when viewed from near
nose-on, broadside or the rear.
A fourth technique that has been used to reduce the dimensionality of the
classification space in radar applications is implementation of a sequential
classifier. The classification schemes that utilize the hyper-space concepts de-
scribed previously operate on a predetermined and fixed number of returns to
produce a decision. The sequential classifier makes more efficient use of radar
resources by acting on groups of pulses, one group at a time. After each group of
pulses is fed into the classifier the target is either placed into one of M possible
classes or another group of pulses is fed into the classifier. In this manner targets
which can be easily discriminated will be classified using a small number of
returns, while more difficult targets will be observed for longer times, and the
mean number of pulses required to classify a set of targets may be significantly
reduced, compared to the demands of fixed sampling.
The foundations of sequential classification theory were laid by Wald (1974).
Therrien (1978) has recast the structure of the Gaussian classifier (quadratic
Applications of pattern recognition in radar 583

classifier) into a sequential form consisting of a linear predictive filter followed by


an error accumulator and comparator. By this formulation the computational
requirements of the classifier are related only to the prediction order of the filter
and not to the full length of the observed signatures. Thus, in addition to
apportioning radar resources between easy and difficult targets, a sequential
classifier reduces the dimensionality of the decision space.

11. Classifier performance

After a classifier has been designed, its error rates must be determined in order
to evaluate its performance. The preferred method of accomplishing this is to do
so experimentally by passing a set of test signatures through the classifier and
using the fraction of misclassifications as an estimate of the error rate. The test
signatures should be distinct from those used in the design of the classifier in
order to obtain an unbiased estimate of the error rates. Therrien (1974) and
Burdick (1978) show the results of classifying the dynamic radar signatures of
missle components in O-space and also after using feature extraction techniques.
These papers show results using parametric classifiers such as the Gaussian
quadratic as well as non-parametric classifiers such as the grouped and pooled
'nearest-neighbor'. They indicate that the nonparametric classifiers can give
excellent results when the class-conditional pdf's are multimodal or irregular.
Because of the non-parametric nature of the density, these classifiers work best
when the dimensionality of the O-space is low (less than 20) and the number of
training samples is large. If these classifiers are used in high-dimensional spaces
they perform well on the learning data but suffer degraded performance when
applied to an independent set of test data if there are an insufficient number of
training samples. Ksienski et al. (1975) examined the application of linear
parametric classifiers and nearest-neighbor classifiers to the problem of identify-
ing simple geometric shapes and aircraft on the basis of their low-frequency radar
returns. In this analysis, the temporal response of the target was ignored: it was
assumed that the target was viewed at a fixed aspect angle which was known to
within a specified tolerance. The data vector consisted of a sequence of multi-
frequency returns taken at identical aspect angles. A collection of these data
vectors was then used to design a classifier and learning data were corrupted by
noise to produce data for testing the classifier.

12. Examples

We now consider two examples of the application of pattern recognition to


specific problems in radar discrimination. The first of these concerns an applica-
tion of a Gaussian quadratic classifier to a problem in missile discrimination and
illustrates the use of the algorithmic approach in bringing to light powerful
584 Alan A. Grometstein and William H. Schoendorf

features of the data. The second concerns the discrimination of airborne targets
by a wideband radar, and illustrates the use of an array classifier.

EXAMPLE 1 (Discrimination with a quadratic classifier).


When an air-to-air missile is fired there is often debris which accompanies the
warhead at least through the initial portion of its flight. In this first example we
examine the problem of discriminating between the warhead and deployment
debris on the basis of their radar signatures.
In this case a classifier working on a fixed-length signature was implemented.
The signature consisted of 16 returns, each of which was composed of a pair of
amplitudes: one for the principally polarized (PP) return and one for the
orthogonally polarized (OP) return. Each data vector, X, thus took the form

x T = r ]. xPv vPv ~ ' ' ' , ~ 1 6vPP , ~voP


1 ~ ~'2 *1 ,''',
x o6P]

The pulse spacing was 0.05 s so it required just under 0.8 s to collect the signature.
The Likelihood Ratio Test (LRT) for the quadratic classifier is expressed as
W
g ( X ) ----( X - M w )T Z ' w ' ( X - M w ) - ( X - MD) "r ~'D l ( x -- MD) X T
D

where M w, ~;w, Mr) and ~D are the mean vector and the covariance matrix of
the warhead and debris classes respectively. These means and covariance matrices,
together with the threshold, T, were calculated from simulated learning data.
The major problem in designing the classifier was obtaining the learning data
for the debris class. Static measurements were made on objects thought to
resemble deployment debris, and these were combined with assumed motion
parameters to produce dynamic signatures. Since little was known of the true
debris motions a wide variety of tumble rates were employed to prevent the
resulting classifier from being tuned to a narrow spectrum of motions. Details of
the warhead shape and ranges of motion parameters were better known so there
was no difficulty in simulating its dynamic signatures.
A threshold was then chosen for the classifier and it was tested in real-time on
a series of flight tests. Examples of debris and war-head signatures are shown in
Figs. 2(a) and 2(b), respectively. The classifier was inplemented on a radar and
used to observe real warhead and debris targets. On a total of 132 warhead and
49 debris signatures, leakage rates of 5% and false alarms rates of 0% were
obtained.
One of the more interesting aspects of this example involved the analysis of the
classifier. Examination of the coefficients of the classifier clarifies the characteris-
tics of the signatures which are most important for discrimination.
Prior to the actual flight test it had been postulated that the ratio of the mean
PP return to the mean OP return would be an effective feature for discrimination.
It was argued that a piece of debris, being sharp, irregular and edgy, would show
an OP return comparable to its PP return, much like a dipole, and thus its
Applications of pattern recognition in radar 585

u')
ol
(J
rr"

-5

-10
I I I I
1 2 3 4

(,9
L)
nr

-E

-1C
I I I I
1 2 3 4

¥
OI PP l
co
rr
-5

-I
0 1 2 3 4
TJME (S)

Fig. 2(a). Sample signatures of debris.


586 Alan A. Grometstein and William H. Schoendorf

PP

m -5
"D

(I)

nr
OP

-I0

-15
0 1 2 3 4 5

m -5
-o

u
fig

-x5 I I I i
0 1 2 3 4
TIME (S)

Fig. 2(b). Sample signatures of warhead.

polarization ratio would be close to unity. The warhead, on the other hand, was
known to be rotationally symmetric and would, therefore, have a low OP return,
much like a sphere, and hence provide a high polarization ratio. Fig. 2 shows that
this conjecture is poor for one of the pieces of debris and quite incorrect for the
other two pieces which, on the average, exhibit large polarization ratios.
Examination of the quadratic classifier coefficients revealed that the ratio of
the second moment of the PP return to the second moment of the OP return was
the dominant feature for discrimination rather than the ratio of the means.
Inspection of Fig. 2 confirms this prediction, although this feature would not be
so evident to the eye in the shorter 0.8 s signatures that were fed into the classifier.
To confirm this analysis a new classifier was designed which used the second-
moment-ratio as the sole selected feature: it performed about as well as the full
Applications of pattern recognition in radar 587

classifier had done, thus confirming the interpretation of the coefficients of the
classifier.

EXAMPLE 2 (Discrimination with a linear classifier).


The second example is concerned with discrimination of targets by a wideband
radar. (For a wideband radar, the range resolution is small compared with the
dimensions of a typical target. Range resolution is, roughly, inversely propor-
tional to the bandwidth of the transmitted pulse.)
The problem arose of designing a radar to receive both polarizations (PP, OP),
and to perform discrimination on a single pulse between two classes of airborne
targets: remote piloted vechicles (RPVs) and drones (D). The RPV class consisted
of two types:
- SR: small RPVs;
-LR: large RPVs.
The drone class also consisted of two types of bodies:
- LD: large drones;
- SD: small drones.
It was believed that, to perform discrimination on a single pulse (hence:
without having to place a target into track), the radar would need a wideband
capability. But, since it is expensive to produce such a capability it was important
to determine how the performance would change with the bandwidth of the
transmitted pulse. This relation--between bandwidth and discrimination--was
the main subject of the study.
Learning data on the targets were obtained from flight data collected by a
radar whose operating characteristics were similar to those of the proposed
sensor. A signature in this case consists of the data collected on a single pulse:
since the radar is dual polarized, there are two such returns. And since the radar
is wideband, each return consists of a number of amplitudes, spaced according to
the range resolution of the waveform.
Fig. 3 shows how a signature was defined. The PP (above) and OP (below)
returns are shown. The noise level in advance of the PP return is found and a
threshold established at a level 5dB above the noise. Then, beginning at the
threshold station, n amplitudes are recorded from the PP return. (In Fig. 3, these
are shown as X l , X 2 . . . . . x n . ) The returns from the OP are recorded for the
corresponding range intervals, giving rise to another set of amplitudes:
Xn+l,...,x2n. The 2n amplitudes (half from PP, half from OP) are assembled in
sequence, and form the data vector, X.
The value of n was chosen as a function of the bandwidth of the waveform in
such a way that n amplitudes in each return extended over a range which is
greater than the physical length of the longest target.
Wideband signatures are difficult to simulate. For this study, all learning
signatures were taken from real data gathered on the four types of targets with a
500-MHz waveform. This was one value of bandwidth of interest: others were 200
and 100MHz. To obtain signatures at these reduced bandwidths, the 500MHz
signatures were passed through low-pass filters with appropriate cutoff frequen-
588 Alan A. Grometstein and William H. Schoendorf

l Xpp
X1

"'RES"O'O I t +5.B - 1 X2
PP

Xn
X 14 z M"~ =x
o~ I I Xn+ I
Xn+2 OP

n+l 2n XEn.
Fig. 3. Vector representationof wideband pulse shape.

cies. In this way, a small (335 samples) but realistic set of signatures was obtained
at the three bandwidths of interest.
Fig. 4 shows examples of the time-averaged waveforms returned by the SR,
LD, and SD, at each of the bandwidths. For clarity, the OP return is displaced to
the right of the corresponding, simultaneous PP return. Time-averaged wave-
forms, rather than single-pulse waveforms, are plotted to present the underlying
structure. Single-pulse waveforms (on which the classifier operated) are much
noisier.
A novel problem posed by this study relates to the fact that the two classes,
RPV and drone, were not themselves homogeneous in content, since each was
comprised of two distinct types of target (LR and SR in the one case, LD and SD
in the other). What logic of separation ought to be employed to separate RPVs
from drones, there being no need to distinguish between types of RPV or between
types of drones? The following statagem was found to be powerful:

Four simple linear (Fisher) classifiers were built, denoted by the


vectors of weights, B (Fig. 5). 1 In particular:
B 1 discriminated between SR and LR,
B 2 discriminated between SR and LD,
B 3 discriminated between SR and SD,
B 4 discriminated between LR and either drone.

The outputs, h i, of the four linear classifiers were fed into a nearest-neighbor
classifier which made the final decision as to whether the signature in question
was that of an RPV or a drone. Fig. 5 shows the logical arrangements of the linear
and nearest-neighbor classifiers in the form of an array classifier. The four linear
classifier outputs, h i, can be thought of as features and the classifiers themselves

IOther linear classifiers, representing alternative decision options, were examined but found to be
comparativelyineffective.
Applications of pattern recognition in radar 589

SMALL RPV (SR)

~-5 PP OP -

-I(]

-1!

-21

LARGE DRONE ( L D )

rn

cO -1C
(D
rr

-2(

SMALL DRONE (SD)

-i

-i, z

-2(

Fig. 4. Time-averaged waveforms of three target types ( - - 5 0 0 M H z ; - - - - 2 0 0 M H z ; - - - 1 0 0 MHz).

as feature extractors. The nearest-neighbor classifier is then operating in a


4-dimensional space.
The performance of the array classifier is shown in Fig. 6, which gives the OCs
for the three bandwidths. A curve is shown for the bandwidth of 100MHz, a
single point for 200MHz, a point for 500MHz. 2 The 100-MHz curve indicates
that error rates of about 5 percent can be achieved on a single pulse for both Pe
andPf. Due to the logarithmic scale of the graph, the curve for 2 0 0 M H z appears
as a single off-scale point: it indicates that a Pe of less than 1 percent can be
achieved at virtually 0 percent Pf. Similarly, the position of the point for 500 MHz

2Historically, OCs are conventionally plotted with the two error variables arranged in linear form. A
log-log plot has been found to provide a more legible and readily interpretable shape to the OC curve.
This accounts for what might appear to be an unusual (viz. concave) shape, in contrast to the more
common (convex) shape of OCs.
590 Alan A. Grometsteinand William H. Schoendorf

~=h I

SR vs. LR

~ E = h2
SR vs. LD
% NEAREST
O - m ~[ NEIGHBOR - ~ - DECISION
CLASSIFIER

SR vs, SD

LR vs. (LD OR SD)

Fig. 5. Structure of array classifier.

shows that error rates of close to 0 percent on each axis can be achieved: this is
essentially perfect discrimination on one pulse.
An interesting aspect of this problem arose from the use of linear classifiers as
the first step in the full-array classifier. For the Fisher linear classifier, the LRT
takes the form:

2/7
BTx= 2 bixi.
i=1

That is, the LRT is a dot product between the vector of weighting coefficients, B,
and the data vector, X. The weighting vector is computed from the learning data
of the two target types (say 'R','D'), and is given by

8 = + MD).

Now, of the 2n components of B, those that are largest indicate the positions
within the pulse-form where the greatest amount of discrimination information
lies. Conversely, if a coefficient, bi, is small, that position supplies little dis-
crimination information (since, whatever the pulse amplitude there, it contributes
little to the dot product after being multiplied by the small coefficient).
Fig. 7 shows the PP and OP components of B 2, as an example, which
distinguishes between the SR and the LD. For 500 MHz, there are 40 components
Applications of pattern recognition in radar 591

5 0 0 MHz %

0.999 I I I I I II l I l l i 0.001

0.995 0.005

200 MHz ~
0.99 0.01

/
r I00 MHz

/
0.95 / 0.05

/-
0.90 0.I0

0.50 0.50

O-I II I I I I I I I I I Ilr 1.O0


0.01 0.05 0.10 0.50

P,
Fig. 6. Operating characteristic for array classifier.

to B2 (20 for each polarization); for 200MHz, 20 components; and for 100MHz,
10 components. Several conclusions can be drawn:
(1) For all bandwidths, the leading edge of the PP return is important.
(2) For 500 and 200MHz, the OP return supplies very little discrimination
information.
(3) For 100MHz, the trailing edge of the OP return is important.
These observations suggest that, for the 500- and 200-MHz bandwidths, the
absence of the OP signature might not adversely affect the performance of the B2
classifier, and the radar might as well not have the second polarization. However,
this remark must be tempered by the realization that we have examined only B2:
the other three linear classifiers might tell a different story; further, a classifier
structure more powerful than the linear structure shown might make superior use
of the OP return.
592 Alan A. Grometstein and William H. Schoendorf

PP OP

B:z
5 0 0 MHz

INCREASING INCREASING
RANGE RANGE

,A A
200 MHz

1 0 0 MHz
n2

/
Fig. 7. Weightingcomponentsof B2.

This example illustrates two interesting aspects of the design of classifiers:


(1) Powerful discrimination can sometimes be achieved by constructing an
array classifier which depends on relatively simple classifiers acting in parallel
a n d / o r cascade.
(2) It is easy to attach a form of physical significance to the weighting
coefficients of a linear classifier (and, it should be remarked, of a quadratic
classifier as well). The magnitudes of weights suggest areas of importance to
discrimination. The weights, of course, must be interpreted with caution and, in
any case, it must be remembered that a linear classifier is less powerful than a
more sophisticated classifier whose composition might be less suggestive.
Applications of pattern recognition in radar 593

We have described some of the applications of pattern recognition to radar


discrimination problems. Another aspect of the application of pattern recognition
to radar problems is in the design of radar decoys. Examples of these decoys
might be drones for the RPVs or decoys for warheads. The credibility of a
candidate decoy can be evaluated as described above and, if the resulting
performance is not satisfactory, the statistics of the classification process can be
analyzed to determine the features which were most important to the discrimina-
tion process. The decoy designer, knowing these statistics, tries to alter them by
changing either the flight or the scattering properties of the body. This iterative
process--design, evaluate, analyze, m o d i f y - - i s repeated until a desired perfor-
mance level is achieved.
Pattern recognition techniques have been successful not only in dealing with
discrimination algorithm design and real-time implementation but also as a
research tool in radar data analysis. They provide insight into the complex
relationships existing among the physical processes relevant to the characteristics
of a radar signature and also provide a means for placing the elements of
discrimination and target design on a quantitative basis.

References

Burdick, B. J. et al. (1978). Radar and penetration aid design. Proc. 1978 Comput. Soc. Conf. on
Pattern Recognition and Image Processing, Chicago, IL, U.S.A.
Chernoff, H. and Moses, L, E. (1969). Elementary Decision Theory. Wiley, New York.
Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. Academic Press, New York.
Fukunaga, K. and Koontz, W. L. (1970). Application of the Karhunen-Loeve expansion to feature
selection and ordering. IEEE Trans. Comput. 19, 311.
Ksienski, A. A. et al. (1975). Low-frequencyapproach to target identification. Proc. IEEE 63, 1651.
Therrien, C. W. (1974). Applicationof feature extraction to radar signature classification. Proc. Second
Internat. Joint Conf. on Pattern Recognition, Copenhagen, Denmark.
Therrien, C. W. et al. (1975). A generalized approach to finear methods of feature extraction. Proc.
Conf. Comput. Graphics on Pattern Recognition and Data Structures, BeverlyHills, CA, U.S.A.
Therfien, C. W. (1978). A sequential approach to target discrimination. IEEE Trans. Aerospace
Electron. 14, 433.
Van Trees, H. L. (1968). Detection, Estimation and Modulation Theory, Vol I. Wiley, New York.
Wald, A. (1974). Sequential Analysis. Wiley, New York.
Watanabe, S. and Pakvasa, N. (1973). Subspace methods in pattern recognition. Proc. First Internat.
Joint Conf. on Pattern Recognition, Washington, DC, U.S.A.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2 ')'7
©North-Holland Publishing Company (1982) 595-607

White Blood Cell Recognition

E. S. Gelsema a n d G. H. L a n d e w e e r d

1. Introduction

The white blood cell differential count (WBCD) is a test carried out in large
quantities in hospitals all over the world. In the U.S. and in many European
countries the complete blood count including the differential count is usually a
part of the routine hospital admitting procedure. Thus, it may be estimated that a
hospital will generate between 50 and 100 differential counts per bed per year.
For the U.S. alone Preston [16] estimates that annually between 20 and 40 × 109
human white blood cells are examined. The average time it takes a technician to
examine a slide varies widely and is amongst other things dependent on the
standards set by the hospital administration. The actual 100-cell differential count
can take as little as two minutes, but given the need to spot even rare abnormali-
ties and taking into account the time to load each slide and to record the results,
it seems reasonable to assume an average examination time of 10 minutes per
slide. This, then, for a 2000 bed hospital corresponds to a workload of between 50
and 100 man-hours per day for the microscopic examination only.
For certain conditions the need to examine the blood smear is obvious from the
clinical information available and the information it provides is essential. Such
examples, however, are relatively few, and more often the information from the
differential is yet another important element which helps the physician to arrive
at a diagnosis.
The WBCD consists of estimating the percentages of the various types of white
blood cells present in a sample of peripheral blood. Normal values may vary
widely. The normal percentages and ranges as given by Wintrobe [20] are given in
Table 1.
A typical example of each of the normal cell types is given in Fig. 1.
In pathological cases, in addition to these normal cell types, a number of
immature or abnormal types may also be present. The number of such immature
types depends on the degree of subclassification one wants to introduce. More-
over, in an automated system, immature red blood cells have to be recognised as
well. The detection of immature and abnormal cells, even though in terms of
numbers they usually represent only a small proportion, is very important for
establishing a diagnosis.
596 E. S. Gelsema and G. H. Landeweerd

Table 1
Normal values and ranges of the occurrence of the
different normal types of white blood cells accord-
ing to Wintrobe [20]
Cell type Percentage 95% range
Neutrophils 53.0 34.6-71.4
Lymphocytes 36.1 19.6-52.7
Monocytes 7.1 2.4- ! 1.8
Eosinophils 3.2 0.0-7.8
Basophils 0.6 0.0-1.8

The errors made in a WBCD may be attributed to various sources:


-variability in the preparation of the blood smear;
-variability and fatigue of the human observer;
- statistical errors.
Bacus [2] has shown that the variability between human observers is consider-
able, even when only normal types are considered. It is clear that at least in
principle automation may contribute in reducing such errors. If automation
includes the preparation stages of the process (slide smearing and staining) as well
as the decision making process, then the errors from the first two sources may in
principle be removed. The statistical errors in the WBCD are discussed by Rtimke
[18]. If an automated system can cost-effectively classify more cells per sample
(100 cells being the standard in non-automated procedures), then the statistical
error may be reduced as well.
These arguments have constituted the rationale in the early 1960's to attempt
the automatic classification of white blood cells. A number of research groups
started to conduct experiments in this field, later followed by commercially
oriented endeavours. At the moment of writing there are at least five commercial
systems available for the automated WBCD. One of these [12] is not based on the
image processing principle and will not be discussed here. The degree to which
these systems are successful indicate the way to go in the future.
In the present article a review will be given of the experiments on WBCD
automation. The next section describes the experimental effort conducted in this
field. In Section 3 the commercial systems now on the market will be reviewed.
Section 4 contains some concluding remarks.

2. Experiments on the automation of the W B C D

Although complete automation of the WBCD must include standardization of


slide preparation and staining, automation of slide loading, of movement of the
microscope stage, of cell finding and focusing, the scope of this section will be
limited to a discussion of the automation of the cell recognition task.
White blood cell recognition 597

Fig. 1. Six normal types of white blood cells. Top row: segmented neutrophil and band: middle row:
lymphocyte and monocyte; bottom row: eosinophil and basophil.
598 E. S. Gelsema and G. H. Landeweerd

Table 2
Significant experiments on the automation of the white blood cell differential count
Year of type of No. No. No.
publication Authors Ref. features cells classes features % correct
1966 M. Ingram 8 Topology 117 3 6 72
P. E. Norgren 3 10 84
K. Preston
1966 J.M.S. Prewitt 17 From opt. 22 4 2 100
M. L. Mendelsohn dens. freq.
dist.
1969 I. T. Young 21 Geometry 74 5 4 94
Colour
1972 J.W. Bacus 1 Geometry 1041 5 8 93.2
E. E. Gose Colour 7 8 73.6
Texture 8 8 71.2
1974 J.F. Brenner 4 Geometry 1296 7 20 86.3
E. S. Gelsema Colour 17 20 67.3
T. F. Necheles Texture
P. W. Neurath
W. D. Selles
E. Vastola
1978 G.H. Landeweerd 10 Texture 160 3 10 84.4
E. S. Gelsema

W o r k in this area started a r o u n d 1960. It is still going on at the time of writing.


R a t h e r than describing in detail what has h a p p e n e d in the past 18 years, the
progress m a d e in this field will be illustrated using Table 2 where each entry
marks a significant step in the direction toward the ultimate goal. Each entry
contains a reference to one representative article f r o m the research group in-
volved. This article is not necessarily the last nor the ' b e s t ' one published b y the
researchers involved. It is chosen merely to indicate the progress in this field. In
the a c c o m p a n y i n g text the significance of this article will be pointed out and more
references m a y be given. Each entry of the table contains the year of publication
of the key article, the names of the individuals involved (as evidenced b y this
reference), details on the experimental conditions ( n u m b e r of objects, n u m b e r of
classes, etc.) and finally the result achieved.

2.1

Ingram, N o r g r e n and Preston are a m o n g the first to enter in this field [8, 9].
Their C E L L S C A N system differs considerably from the systems used by the
workers to be listed below. It essentially consists of a TV camera, linked to a
special purpose digital computer through an A D C converter. A binary image of
the cell is thus read into the computer, which then applies a series of 'parallel
pattern transforms' [7], in each step reducing the image by shrinking operations.
White blood cell recognition 599

A monochromator is used to preserve colour information. In the reference cited in


Table 2 the authors present results on the classification of 117 cells into 3 classes
(lymphocytes, monocytes and mature granulocytes).l
Various subsets of features yield classification results varying from 72% to
84% correct. In a later article [9] the application of a subsequent system
( C E L L S C A N - G L O P R ) to a much larger data set is described. Here, however, the
results are given in terms of the differential count.
The entries in Table 2 to be described below are all obtained using a general
purpose digital computer.

2.2

The secOnd entry refers to an experiment by Prewitt and Mendelsohn [17] in


1966. In this experiment 4 classes (neutrophils, eosinophils, lymphocytes and
monocytes) are considered. The features describing the cells (35 in total) are all
extracted from the optical density frequency distribution obtained using the
C Y D A C flying spot microscope. An example of an optical density histogram (not
obtained with CYDAC) is given in Fig. 2. The procedure is based on the

300
NUMBER
OF POINTS

200

100

I i i L I
10 20 30 40 50 60
FILM DENSITY

Fig. 2. Optical density histogram of the image of a white blood cell. The peaks from left to right are
generated mainly by points in the background, in the cytoplasm and in the nucleus, respectively.

IGranulocytes are those cells which contain granules in the cytoplasm.The normal granulocytes are
neutrophils, eosinophils.and basophils.
600 E. S. Gelsema and G. H. Landeweerd

observation that the optical density histogram is characteristic for the cell type
from which it is generated. Of course, parameters used by hematologists such as
nuclear area, cytoplasmic area, contrast, etc. may also be measured globally from
the histogram. The authors report 100% correct classification of 22 cells, using 2
parameters•
As the authors themselves remark it is hardly surprising that with the enlarge-
ment of the blood cell sample and the number of types to be recognised the
method of analysis will have to become more complicated•
A l l experiments to be described below use the 'segmentation' approach which
consists of finding the boundaries of the cell and of the nucleus prior to the
estimation of parameters. Fig. 3 shows the digitised image of a cell and in Fig. 4
the two boundaries imposed on it are given• Measurements may now be per-
formed on the two areas of interest, i.e. cytoplasm and nucleus. For convenience
the resulting parameters may be subdivided in three classes:
- geometry,
- colour,
- texture•
Geometrical parameters describe e.g., the area of the cell, the cellular to nuclear
area ratio, the shape of the nucleus, etc. Colour parameters may be retrieved by
analyzing at least two images obtained through two different colour filters. They
include the average colour of the cytoplasm, the average colour of the nucleus, the
width of the colour distributions in these two areas, etc. Texture parameters
describe the local variations in optical density. They incorporate somehow classi-
cal subjective descriptions such as 'fine chromatine meshwork' or 'pronounced

• I • , , • L • . • /I,,/~w~,/b/~,~./~/~/v/,/,/Ivlvl, l ~ l l • l . l ~ , , , / . . . . . ~ ~ , | 1 .
• i i y y • // /,V'IV I v i./vv,,,./,lm/I/x/*/i.l.l,l,l,l.i*l,l,,.l/// . . . . . , * 1 *
I + * i . . I , l l . I . . l ~ l , l ~ v ~ . / ~ / . / , / , / I / , . / . / l l l . I . l ~ . , I . l . i * l . l . I . . , . l l l . I i ~ . . . .
/ ,I,1,1 I,I I I.IVVIV/I/,I./¢~IVI.III.,I~I.I.I.III. I.t.I.*.IVI . . . . . . . .
. . . . ~ I1~,1. , ,i.I~.l.* ,Ivl.l.l~,/, / I I ~./. v I. I,.! ¢ I.I,.I.l* iJ !. I.IIl.l/ ~ v v / . . . . . . .
. . I/ Ii I, wl-I.l*l;l.l,I*l v¢-/IJvcv,ltl.l.l,l,,I,l,l.I.I IIl.l*l,l,,/v/ . . . . . .
. I ,, v/Ivs.l*ivl.l.I I,I,**.,I.ilcvvIvl,I~l.l.l.l.l*l.l.l*l, laIIXAI,l,,/v/// . . . . .
I v . ./1.1 s*l~l.l,l*I I !~1 ....... / ~l,l~.~l*l.l~*ll~Iil.l.l*l.l.I*l.l*./l/I/~/ . . . .
..... //lll,l.ldI,l.l,I I.,~. ,Iglvll,/v/././*~I.l*...llvl*g*l*l,l,l*I*I*l*,/ll~/v/.. , *
• ,~,.S,t S,I,S,I*S,,*'' -S~S.,'••/ ~ ' / . ~ S * ~ . | . S V I ~ I , I . ~ | , I . * . | , I . ~ ' . / . , . / ~ H ~ .
• I/~vI.I*l* v,ll~I*l,l~lvl,l~lv,.l/~lI/./*/~ll ..... I.I ..... I . l . i l I . I * ~ , , / ~ . , v l / , / . . .

• . */ 11,1..Iq*II.I.i. II.I.III.I~.I/.VV*VV.XI.*I.I**I.I*I.I...IVI..IV..v~..vVVIIII..
• • //XlI, *.,,IVI,I l.S,I,I S * l , l , ,.i /,/l,.~•x ...... $~I,S.t.l .... X~vvvw,•v,lvlv•,,//,.
.... // ....... I.I.I~I.I.I,.I.II,IV.IlII/,II/,IIi~.I.I.III,.II.I.I.Illl.V*IIIv,I~...VlVI~.I~/.
. , //l,l,l,~.l.l,~,,I ...... l~lvi,,v~/ /lllvi*lv=..l...,I.I ...... VVV././VVlvI ~VlVlV..,I~

• , i/l,q11. 11,111.1.i, I I I-I.I, iv I~ va.l~lll41~.llVl,.llv,v*lv,4vIvvv~vvvv vvill~x i/.


// ~.1..1. I.I ~l.lwl. !+ I..llll,.I . I*V ~1AIVI.I ...... Iil.lxVvv~vl V IVVqVV} ~1 i V . V~ VV ~ V , ~ /1
I/,/~ll .II.I,ZI~I-,I. IVI I~ ...I,I.IV.VllI~I.I.I¥••I¥IV•V*/*/,/~VVVVIV~,IvV VVVV~.,,I.
. . t/,/ 11 ...... I¢I~I;I.I,I.I,,.~IVI.IVI~,II~I,I,IV..I~I~I~Iyll,/,,,Vi~I/I/.,I,I,iIIvlilII~
. IIv/ vl.l* "I#IvI.I.I~IV,,~VIVVI~I...IIITI~IV.III.I.,,,,*IV./././~vI/IlVVV~V.I.IVlIV~V,I/.
I/v/" ~1,,1.1,,l.4 ~1 , l ~ I v v v ~/, I./..I.I. 1.• I..I.I.V.,wX, mVVI~, ~,/./././•/vvvv.. vvi/v~,•v I.
• YI,/ ,I.I.¢.IWI,iWI,. vvv'v,vwl.=l~S,l*l*.ll,lviv.lllllvv*/.¢./,/,/./lYv,v,. VVVv././V/.

. vl 11~1,1 gl I 111 II ! I.II..~VIII.I.IV.X..I.IvIV./II.I.I./,I~V~VII.,IvI.X~V.III//.


• • // I1,,I,~ I~I'I-I-IVl ~ I*I'IlIIVVVI'III''IIXIdW/,/W/II'IIVlII~I~VVV~I~I.~vVVIIVl. i

. 1 . , III/'II'I'I'I.I.I.lkVVv/V/~II/V/./III/~V.M~VI/,/.tVVIt././I/I/V~V,,/.,././~/. * I

1 : : :I : ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 I I I I I ' I * I/'/'Yg~/'l~V '**/'/'Ia/~VVV./VVI~'*''*/II// ~ l I I W I I " I I I

F I I I I I l ' I 1 " ~ I I 1 W I I I/III/I/I]¢/VII/V/V/// . . . . . ~ I ~ I I ] I * * I


' V I . . . . . " I y . I ' // . . . . . . . " i * I I I I 1 " I [ I I V I I 4
. ' " ~ ' " ,+ ' , V ~ I V ~ l l y ' V I l Y I I . i l l l t I I I I A I I I I * * I I ~
e I 1 . . . . . I V I 1 I I y I 1 ' . I I I I l l l l l l X l I I W l l I l l 1 ~ I I .
,, I I ~ " I I I V I ' ' l l l * V V I I I I I W ~ l ~ l I l l l l l l I l l i I I i I
I I I y I ' ' I 1 I " I 1 I I 1 1 I 1 1 I, I I l . l I II I I 1 1 1 1 1 I I I i I
" I I I I f ' e l l l l l l I l l l V I I I I p I X I I I . I I I I A I I I I i i * l I

Fig. 3. Grey level plot of the image of a white blood cell (eosinophil).
White blood cell recognition 601

/
I I I I 1 1 1 ! l l - l / l l l •

it ° ° o ° * ° °I
I 1 * . * * .11
I , , o *i
** .i
i I
: I I
° •
I

°
°° ~
°* •

/ :'': ..... ." I


/
• °
I
/ ° /.

i ' ° . o ° ° ° ~
/
I
° * / /
*/
l/
/
I l l l
I
l l l l I I / I
I II / I /
g i l l I

Fig. 4. Contours of the cell and of the nucleus of the cell in Fig. 3.

granularity', etc. With most texture parameters, however, the link between the
numerical values and the visual perception is much less evident than with the
parameters from the geometry and colour category.

2.3
With the experiment by Young [21] systematic colour measurement is intro-
duced. He uses a flying spot colour scanner SCAD which scans colour slides of
the leukocytes through a set of dichroic mirrors. In this way three spectrally
filtered images are stored in the computer memory. The density histogram
provides threshold levels for the transition from background to cytoplasm and
from cytoplasm to nucleus, respectively. A set of quasi chromaticity coordinates
are then used to identify surrounding red cells and to define the 'average' colour
of the cytoplasm and of the nucleus. In his classification scheme Young uses 4
features (the estimated area and one chromaticity coordinate for both the
cytoplasm and the nucleus) to distinguish 5 cell types (the 5 normal types). Of the
74 cells used as a learning population 94% are correctly classified.

2.4
The work of Bacus et al. [1] is significant in more than one respect. First, his
experiment is conducted on a sample size much larger than the previous ones.
Blood smears from 20 people were used to constitute the data base. A total of
1041 cells is involved, half of which are used to train the eight-dimensional
Gaussian classifier. This is in fact the first time that a testing set different from
the learning population is used in the classification process. Secondly, Bacus has
602 E. S. Gelsema and G. H. Landeweerd

also extensively studied the human percentage of error both on the basis of single
cell classification and of the differential count. Surprisingly, in the estimation of
the differential into eight classes (the lymphocytes are subdivided in three classes
according to size and the neutrophils are subdivided in two classes according to
age) the human error is as high as 12%. This figure sets an upper limit in the
evaluation of any automatic classification device. It is to be expected that the
human error in the single cell recognition when immature cells are included is
substantially higher than the 12% given above. Also, Bacus is the first to propose
analytical expressions for the derivation of some texture parameters in the WBC
application. The overall percentages correct classification are 93.2, 73.6 and 71.2
for the five class, the seven class and the eight class problem, respectively.

2.5
Immature cells enter into the picture for the first time in the work by Brenner
et al. [4]. They do a large scale experiment on the classification of white blood
cells into 17 types. From the images of 1296 cells, divided equally among the 17
types, a total of about 100 parameters is extracted. Of these, an optimal set of 20
is retained for classification purposes. These include features from all three
categories listed above. They present their result at three levels of sophistication
w.r.t, the types to be distinguished. Attempts at separation into 17 classes yields
67.3% correct classification. Of the misclassified cells 30% are confused with
adjacent stages in the maturation process. Treating all immature cells as one
single class, a classification into 7 types results in 86.3% misclassification with 8.7
false negatives (immature cells classified as normal cells) and 12.5% false positives
(normal cells classified as immature cells). While the false negative rate is
comparable to the Poisson error if between two and three immature cells are
present in the sample of 100 cells, the false positive rate is judged too high for a
practical system.
At this point (around 1974) in the history of the automated WBCD it becomes
clear that much work remains to be done in the reliable recognition of immature
cells as such and in the classification of the different types of such cells. Evidence
for this is also to be found in the performance of the commercial systems that at
this point in time have become available. (These systems will be described in the
next section.)
The problem is somewhat more complicated since, contrary to the situation in
normal types, the immature types as recognised by hematologists are not discrete
states but should rather be considered as subsequent stages in a continuum in the
evolution from stem cells to mature forms.

2.6
With the last entry in Table 2, representing an experiment by Landeweerd et al.
[10], the emphasis is on quantification rather than on automation. A certain
amount of interaction is introduced to ensure that measurements are taken in the
interesting portions of the cell image (i.e. in this case in the nucleus). Realising
White blood cell recognition 603

that the differences between the immature cell types are mainly in the domain of
texture, as also evidenced by descriptions of such types in hematology textbooks
[20] and by the work of Lipkin et al. [11] and of Pressman [15], Landeweerd et al.
investigate the usefulness of various texture descriptions. They study a sample of
160 cells of three types (basophils, myeloblasts and metamyelocytes). They also
introduce the concept of hierarchical decision tree logic for this application.
Indeed, with so many classes to distinguish in the total WBDC, some of which are
rather arbitrarily defined stages in the maturation continuum, the concept of a
single level classifier becomes increasingly unrealistic. Using an optimal set of 10
parameters out of a set of 27 significant ones (based on T-tests with a confidence
limit of 99%) they arrive at 84.4% correct classification based on texture parame-
ters only.
Summarizing Table 2 it would be customary to indicate a correllation (positive
or negative) between the figures in the first and the last columns (year of
publication and percentage correct classification). The present case, however, calls
for a more qualified approach, since the numbers reflect extremely dissimilar
situations.
Future work will probably be directed toward better automation techniques to
be realised in hardware. At the present state of the art; however, there is also
scope for more sophisticated software to guide the decisions as to which parame-
ters should be extracted. Software to improve the design of optimal hierarchical
classifiers will be needed as well. In this respect interactive pattern analysis
systems such as ISPAHAN [5, 6] may be of considerable value. In any case, a
large scale experiment on several thousands of cells, including immatures, incor-
porating all promising techniques suggested so far is what is needed at this point.

3. Developments in the commercial field

As has been described in the previous section, research in the field of the
WBCD is at its peak in the late 1960's and early 1970's. Led (or misled) by the
early success of these research efforts a number of companies enter into the field
with commercially available machines. There are now four different instruments
on the market, of which the operation is based on the digital image processing
principle. 2 They are listed in Table 3.
Before discussing the merits of these machines it is of interest to consider Table
4 from Preston [16], which gives a breakdown in time of the operations done by a
technician in the manual execution of a WBCD. From these figures it is clear that
for commercial machines to be cost-effective, automation of the task correspond-
ing to visual inspection only is not sufficient. All other tasks together, when done
manually account for 60% of the total processing time, so that automation of
these is an essential part of the design of an acceptable system.

2The HEMALOG-Dof TechniconInstruments Corporation [12] is a differentialmachine based on


the flow principle, not discussed here.
604 E. S. Gelsema and G. H. Landeweerd

Table 3
Four commerciallyavailable white blood cell differential machines
based on the digital image processingprinciple
Year of
Machine Company introduction
HEMATRAK GeometricData 1974
LARC Coming 1974
DIFF-3 Coulter Electronics, Inc 1976
ADC-500 Abbott Laboratories 1977

Table 4
Breakdown of tasks in the manual white blood cell differential
according to Preston [16]
Task %Time
Slide staining 13
Data logging 28
Slide loading 12
Visual examination 40
Overhead 7

Moreover, within the task listed as visual inspection, the technician, while
completing the WBCD also assesses the red blood cell morphology and the
platelet sufficiency. An automated system should therefore have the capability of
doing the same.
It is not an easy task to evaluate the performance of such machines and it is
even more difficult to compare one device against another. First of all, the
specifications of the instruments in terms of the parameters and classifiers that
are used are not always available. Secondly, when results from test runs in a
clinical environment are given, they are usually completely or partly on the basis
of the total differential, rather than on a cell by cell basis. In view of the fact that
in a normal smear most of the cells (53%) are neutrophils, which is one of the
most easily recognised cell types, the total WBCD may hide serious errors in the
less frequently occurring types.
Some properties of the four different machines are listed in Table 5. They have
in principle been taken from the commercial announcements of the various
machines as far as this information was available in the various brochures.
Otherwise the source of the information is referenced with the corresponding
entry. From the number of times such additional sources had to be consulted it is
already clear that even such a simple product comparison is not a straightforward
job. Also, the enormous difference in the number of parameters used in the
different machines is at least surprising. Features such as automatic slide loading,
automatic data logging, etc. seem to be taken care of in the newer versions of
most of the instruments.
Of the two earlier systems listed in Table 3, reports of field trials on a cell by
cell basis are now available [14, 19].
White blood cell recognition 605

Table 5
Some properties of the commercial white blood cell differential counters
HEMATRAK
480 LARC DIFF-3 ADC-500
T y p e of hardware FS TV TV TV
Resolution (/Lm) 0.25 0.42 a 0.40 0.50 b
No. classes 7 7 10 13
No. p a r a m e t e r s 96 9a 50 8b
Time/100 cell diff. 25" 60" 90" 1 l"C
No. s l i d e s / h o u r 60 44 a 25 40 c
Aut. R B C morphol. + - + +
Aut. platelet count + + +

aFrom [19].
bprivate communication from J. Green.
c Classifying 500 cells/sfide.

In Table 6 the percentages of correct classification for normal and immature


cells for the H E M A T R A K and the LARC are given as far as they can be
calculated from the numbers published in the references cited above. For com-
parison, the equivalent numbers for the Tufts N E M C H experiment described in
[4] are also given.
Unfortunately, for the H E M A T R A K the number of normal cells classified as
immatures is not broken down into the cell types involved. For the LARC the
breakdown of the immature cell types is not available. Therefore, for ease of
comparison, in the classification of normal cells in all three cases it was assumed
that whenever such a cell was flagged as abnormal by the machine it would be
presented to the operator, who would then classify it correctly. This amounts to
stating that the percentage of false positives (normal cells classified as immature
Table 6
Percentage correct classification for two commercial machines, com-
p a r e d to results of the experiment described in [4]
Percentage correct classification
HEMATRAK LARC NEMCH
No. cells 21000 12456 1296

Normals
NEU 99.1 99.9 100
LYM 97.7 95.1 88
MON 92.8 97.6 100
EOS 93.1 87.0 100
BAS 80.3 100 96
Average 92.6 95.9 96.8
Immatures
MYE 75 -- 85
PRO 99 -- 96
NRC 87 -- 96
BLA 88 -- 85
PLA 89 94
Average 86.4 64.3 91.2
606 E. S. Gelsema and G. H. Landeweerd

ones) is by definition equal to zero. In this respect it is of interest to know the


percentage of normal leukocytes flagged as suspect by the machine. As stated
above, in [14] this number is given only globally, i.e. for a slide of 'normal'
composition. Since the cells in such a slide are dominated by neutrophils, for
which this percentage is extremely small, this figure is not very informative.
Nevertheless, the percentages are 1.9, 6.7 and 9.1 for the HEMATRAK, LARC
and the NEMCH experiment, respectively.
The immature cell types were considered to be correctly classified when the
machine would flag them as such. This reflects in fact the mode of operation of
both machines discussed here. Neither of them attempts to subclassify the
immature types automatically. For the LARC, the row and column corresponding
to artefacts in Table III of [19] were ignored in the calculation of the percentages.
The averages given in Table 6 are calculated for a hypothetical slide of homoge-
neous composition, i.e., a slide where all cell types are equally probable.
From Table 6 it may be concluded that for both machines considered here the
performance on normal cells is comparable to the performance achieved in [4] in
a controlled laboratory environment. This, in view of the much larger number of
cells involved is a considerable achievement.
For the immature cell types on the other hand it looks as if commercial systems
can still learn from the research that has been and still is going on. Especially in
view of the fact that a rigourous treatment of nuclear texture seems to be
promising, as indicated in [11], in [15] and in [10], it may be expected that a next
generation machines will improve in immature cell classification.
For the Coulter DIFF-3 and the Abbott ADC-500 machines results of field
trials comparable to the data in Table 6 are as yet non-existent. It is of utmost
importance that data on a cell by cell basis will become available. This is about
the only way in which different machines, utilizing different parameters and
different classification schemes, can be compared.

4. Conclusions

Almost 20 years of research effort has now been invested in the automation of
the white blood cell differential count. Starting from promising results obtained in
simple experiments this effort has led to the situation where for this test a number
of different machines is now on the market and in routine use. Even though there
is scope for improvement in this field, the application to white blood cell
differential counting represents one of the successes of image processing and
pattern recognition.
Improvements are to be expected in two directions: First, with the advent of
parallel image processing techniques the speed of differential systems is likely to
increase considerably. This is of importance since in view of the small percentages
of occurrence of some cell types, the 100-cell differential for these types is
statistically meaningless.
Secondly, the situation with respect to the immature and abnormal cell types is
as yet unclear. The optimal choice of parameters and classifiers has still to be
White blood cell recognition 607

found experimentally. Moreover, even the a priori definition of the various types
of leukemic cells is still debated among hematologists [3, 13]. It is well possible
that image processing and pattern recognition techniques by virtue of their
inherent consistency may be useful in this respect as well.
Finally, whether differential machines according to the flow through principle
on the one hand and machines based on image processing techniques on the other
will continue to be competing, or whether they will eventually merge into one
super-machine, is as yet hard to foresee.

References

[1] Bacus, J. W. and Gose, E. E. (1972). Leukocyte pattern recognition. IEEE Trans. System. Man.
Cybernet. 2, 513-526.
[2] Bacus, J. W. (1973). The observer error in peripheral blood cell classification. Amer. J. Clin.
Pathol. 59, 223-230.
[3] Bennett, J. M., Catovsky, D., Daniel, M. T., Flandrin, G., Galton, D. A. G., Gralnick, H. R. and
Sultan, C. (1976). Proposals for the classification of the acute leukaemias. Brit. J. Haemotology
33, 451-458.
[4] Brenner, J. F., Gelsema, E. S., Necheles, T. F., Neurath, P. W., Selles, W. D. and Vastola, E.
(1974). Automated classification of normal and abnormal leukocytes, J. Histoch. and Cytoch. 22,
697-706.
[5] Gelsema, E. S. (1976). ISPAHAN, an interactive system for statistical pattern recognition. Proc.
BIOSIGMA Confer., 469-477.
[6] Gelsema, E. S. (1976). ISPAHAN users manual. Unpublished manuscript.
[7] Golay, M. J. E. (1969). Hexagonal parallel pattern transforms. IEEE Trans. Comput. 18,
733-740.
[8] Ingram, M., Norgren, P. E. and Preston, K. (1966). Automatic differentiation of white blood
cells. In: D. M. Ramsey, ed., Image Processing in Biological Sciences, 97-117. Univ. of
California Press, Berkeley, CA.
[9] Ingram, M. and Preston, K. (1970). Atomic analysis of blood cells. Sci. Amer. 223, 72-82.
[10] Landeweerd, G. H. and Gelsema, E. S. (1978). The use of nuclear texture parameters in the
automatic analysis of leukocytes. Pattern Recognition 10, 57-61.
[11] Lipldn, B. S. and Lipkin, L. E. (1974). Textural parameters related to nuclear maturation in the
granulocytic leukocytic series. J. Histoch. and Cytoch. 22, 583-593.
[12] Mansberg, H. P., Saunders, A. M. and Groner, W. (1974). J. Histoch. and Cytoch. 22, 711-724.
[13] Math6, G., Pouillart, P., Sterescu, M., Amiel, J. L., Schwarzenberg, L., Schneider, M., Hayat,
M., De Vassal, F., Jasmin, C. and Lafleur, M. (1971). Subdivision of classical varieties of acute
leukemia: Correlation with prognosis and cure expectancy. Europ. J. Clin. Biol. Res. 16,
554-560.
[14] Miller, M. N. (1976). Design and clinical results of Hematrak: An automated differential
counter. IEEE Trans. Biom. Engrg. 23, 400-407.
[15] Pressman, N. J. (1976). Optical texture analysis for automatic cytology and histology: A
Markovian approach. Ph.D. Thesis, UCLA, UCLR-52155.
[16] Preston, K. (1976). Clinical use of automated microscopes for cell analysis. In: K. Preston Jr.
and M. Onoe, eds., Digital Processing of Biomedical Images. Plenum, New York.
[17] Prewitt, J. M. S. and Mendelsohn, M. L. (1966). The analysis of cell images. Ann. NYAcad. Sci.
128, 1035-1053.
[18] Ri~mke, C. L. (1960). Variability of results in differential cell counts on blood smears. Triangle
4, 154-158.
[19] Trobaugh, F. E. and Bacus, J. W. (1977). Design and performance of the LARC automated
leukocyte classifier. Proc. Conf. Differential White Cell Counting, Aspen, CO.
[20] Wintrobe, M. M. (1'967). Clinical Hematology. Lea and Febiger, Philadelphia, PA.
[21] Young, I. T. (1969). Automated leukocyte recognition. Ph.D. Thesis, MIT. Cambridge, MA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~
u
©North-Holland Publishing Company (1982) 609-620

Pattern Recognition Techniques for


Remote Sensing Applications

Philip H. Swain

1. Introduction: The setting

Remote sensing is the measurement of physical properties of an object without


coming into physical contact with the object. Today the term is used most
familiarly to describe the process of measuring, recording and analyzing electro-
magnetic radiation emanating from the surface of the earth, usually for the
purpose of identifying or otherwise characterizing the landcover. 1 Instruments
aboard aircraft and orbiting satellites are used to gather such data. If the
measured energy is of natural origin, reflected sunlight, for example, then
the remote sensing system is said to be passive. If the system itself provides the
energy source, as is the case for radar and laser systems, then the term active
remote sensor is applied.
The simplest and most familiar remote sensing system consists of camera and
film. Like the camera, many other--but by no means all--remote sensing
systems yield images of the scenes they observe. Through the use of filters,
dispersive devices, and selective detection/recording media, the more sophisti-
cated remote sensing instruments are able to measure in considerable detail not
only the spatial characteristics of the scene but the spectral characteristics as well.
As shown in Fig. 1, a multispectral scanner (MSS) performs a raster scan of the
area over which it is flown and, for each resolution element along a scan,
disperses the incoming energy into its spectral components (typically visible and
infrared wavelengths, sometimes ultraviolet as well), and records the magnitudes
of these spectral components. Thus both spatial and spectral information are
available for characterizing the ground cover. Still another dimension, variation of
the ground cover with time, may be added if the remote sensor makes temporally
separated passes over the same scene.
A widely used source of remote sensing data is the family of Landsat satellites,
the first of which was launched in 1972. Each Landsat has a multispectral scanner
aboard designed to record imagery in four or five spectral bands. The orbital

I There are other important applications, however, such as seismic prospecting in the extractive
industries.

609
610 Philip H. Swain

Prism Detectors

Scanning Mirror
Motor

pe Recorder

Direction of Flight
.d

~'Resolution
/ Element

/ /
m Raster Line

Fig. 1. Multispectralscanner (MSS).

characteristics of these sun-synchronous satellites cause each of them to scan the


entire earth during daylight hours every eighteen days. The ground resolution
element of the Landsat MSS is roughly 80 meters across; a Landsat 'frame',
covering a nearly square ground area 110 nautical miles on a side, contains
approximately 7.5 million picture elements, or pixels. 2 For further details, the
reader may consult [10].
Quantitative analysis of data having the sheer volume and the number of
associated variables typified by the Landsat data clearly warrants the power of
high-speed computers. For some years now, pattern recognition and other multi-
variate statistical methods have been employed for this purpose, primarily with
the objective of classifying each scene pixel into one of a number of candidate
ground-cover categories. The basis for the classification is the set of spectral
measurements corresponding to the pixels. The goals of ongoing research in this
area are to develop effective methods for incorporating spatial and temporal
information in the analysis process and to use the spectral information with
increasing effectiveness.
2A similar instrument, called Thematic Mapper, aboard the fourth Landsat (launch date: 1982)has
seven spectral bands and a 30-meterresolutionelement.
Pattern recognition techniques for remote sensing applications 611

In this chapter we look at some pattern recognition methods currently applied


to remote sensing data and consider prospects for further work. Our focus here is
limited to the pattern recognition/classification problem, although many applica-
tions of remote sensing data require extensive use of sampling methodology and
other classical statistical tools as well. The reader interested in these aspects is
referred to the discipline-oriented literature.

2. The rationale for using statistical pattern recognition

Pattern recognition, as it has evolved in that amalgamation of computer science


and engineering sometimes called 'artificial intelligence', can take a variety of
forms. The one of greatest immediate interest in this discussion is classification or
applied statistical decision theory. As noted earlier a frequent goal of remote
sensing data analysis is to classify each pixel in order to identify the observed
ground cover, hence the motivation for applying classification. Both the interac-
tion of natural random processes in the scene and the not infrequent occurrence
of unresolvable uncertainty motivate the use of statistical methods in order to
minimize the probability of erroneous classification. There is almost always some
confusion, of greater or lesser degree, arising from the fact that distinct ground
cover classes are sometimes spectrally indistinguishable. Uncertainty may even
arise from unreliable 'ground truth' or reference data used to characterize the
classes of interest.

3. A typical data analysis procedure [11]

The ensemble of data available for accomplishing the classification of the


remote sensing data consists generally of three components: (1) the primary
(remote sensing) data obtained by means of the sensor system; (2) reference data,
often called 'ground truth', consisting of a sample of (hopefully) reliable, on-site
observations from the scene to be classified, made at or near the time the primary
data were acquired, and used for the dual purposes of providing pre-labeled
measurement sets typical of the ground cover types of interest and allowing a
means of evaluating the results of the classification; (3) ancillary data, which may
be any of a wide variety of data types other than remote sensing data, of potential
value for enhancement of the data analysis process. Reference data are often
collected by ground visitation when this is economically and physically feasible. A
useful form of ancillary data is topographic information, particularly when terrain
relief is significant. Meteorological data sometimes serve as reference data,
sometimes as ancillary data.
The details of the remote sensing data analysis process may vary considerably
depending on the application and on the quantity of reference data available.
When a generous quantity of reference data is at hand, the classes can be defined
by partitioning the multivariate measurement space based on the relative loca-
612 PhilipH. Swain

tions of the reference data in the space. This mode of analysis is called supervised
because the data analyst can 'supervise' the partitioning of the measurement
space through use of the reference data. In contrast to this, unsupervised analysis
is required when reference data are scarce. Apparent clustering tendencies of the
data are used to infer the partitioning, the reference data then being used only to
assign class labels to the observed clusters. Supervised classification is at once the
most powerful and most expensive mode of analysis. Most practical analysis
procedures generally consist of a tradeoff between the purely supervised and
purely unsupervised modes of analysis.
Assuming that the information has been specified which is required by the
application at hand and that all necessary data have been collected and pre-
processed, the data analysis procedure will consist of the following steps:
Step 1. Locate and extract from the primary data the measurements recorded
for those areas for which reference data are available.
Step 2. Define or compute the features on which the classification is to be
based. These may consist of all or a subset of the remote sensing measurements or
a mathematical transformation thereof (often a linear or affine transformation).
Ancillary variables may also be involved.
Step 3. Compute the mathematical/statistical characterization of the classes of
interest.
Step 4. Classify the primary data based on the characterization.
Step 5. Evaluate the results and refine the analysis if necessary.
Steps 1 through 3 are usually referred to as 'training the classifier', terminology
drawn from the pattern recognition technology. In practice there is considerable
overlap and interaction among all of the steps we have outlined. The two most
crucial aspects of the process are: (1) determining a set of features which will
provide sufficiently accurate discrimination among the classes of interest; and (2)
selecting a decision r u l e - - t h e classifier--which can be implemented in such a
way as to provide the attainable classification accuracy at minimal cost in terms
of time and computational resources.

4. The Bayesian approach to pixel classification [11, Chapter 3]

The decision rule most commonly employed for classifying multispectral re-
mote sensing data is based on classical decision theory. Let the spectral measure-
ments for a point to be classified comprise the components of a random vector X.
Assume that a pixel is to be classified into one of m classes (~0i[i = 1,2 ..... m).
The classification strategy is the Bayes (minimum risk) strategy by which X is
assigned to the class o~i minimizing the conditional average loss:

j-1
Pattern recognition techniques for remote sensing applications 613

where
~ / = the cost of classifying a pixel into class o9i when it is actually from
class 0~j;
p(o~j[X) = the a posteriori probability that X is from class 0~/.
Typically, a 0-1 cost function is assumed, in which case the discriminant
functions for the classification problem are simply

g,( X)= p( X[~,)p(w~), i = 1 , 2 .... ,m, (2)

where p(X Io~i) is the class-conditional probability of observing X from class 0~i
and p ( e i ) is the a priori probability of class wi.
It is also common to assume that the class-conditional probabilities are
multivariate normal. If class wi has the Gaussian (normal) distribution with mean
vector U~ and covariance matrix Z,, then the discriminant functions become

g+(X) : l o g ~ p ( o ~ , ) - } log~ [ ~ i [ - ½ ( X - G)T z,- 1 (x-v,),


i = 1 , 2 ..... m. (3)
Note that once the class parameters p(o~i) , U i and Z i (and the determinant and
inverse of ~ ) are given, only the quadratic term in the discriminant functions
depends on X and must be recomputed as the pixel-by-pixel classification
proceeds.
In practice, the sample means and sample covariance matrices are estimated
during the training phase o f the analysis procedure because the population
parameters are unknown.
Although there are cases in which the Gaussian assumption may be question-
able (to begin with we recognize that the domain of X is discrete whereas we have
approximated its distribution by a continuous probability function), more than a
decade of experimental work has demonstrated that, for remote sensing applica-
tions, classifiers based on this assumption are relatively insensitive even to
moderately severe violations of the Gaussian assumption. In fact, from a practical
standpoint these classifiers represent, in general purpose use, a very good tradeoff
between classification performance (accuracy) and cost (speed and complexity of
the implemented classifier). It is important, however, that classes having clearly
multimodal distributions be partitioned into unimodal subclasses and that suffi-
cient 'training samples' be available to estimate adequately the class mean vectors
and covariance matrices.

5. Clustering

'Clustering' or cluster analysis is often an important component of multispec-


tral data analysis. It is used for decomposition of multimodal class distributions
614 Philip 11. Swain

when the Gaussian assumption is invoked and for partitioning the measurement
space based on natural groupings in the data. The latter use is often referred to as
' unsupervised classification'.
Unsupervised classification has two interesting and somewhat different applica-
tions in the analysis process. As noted in a previous section, in the early stages of
a supervised analysis procedure it is necessary to locate and extract from the
primary remote sensing data those areas for which reference data are available.
This process is greatly facilitated if the data can be displayed in such a way as to
enhance both the spectral similarities and differences in the data. Such enhance-
ment causes objects or fields and the boundaries between them to be more
sharply defined. Clustering techniques accomplish this nicely since in general they
aim to partition the data into clusters such that within-cluster variation is
minimized and between-cluster variation is maximized. After this unsupervised
classification is performed on every pixel in an area of interest, the results may be
displayed using contrasting tones or distinctive colors, making it easier to find
and identify landmarks in the scene.
On the other hand, unsupervised classification is the key process in an
unsupervised analysis procedure, used when reference data is in too short supply
to provide for adequate estimation of the class distributions. In this case,
clustering is applied after which such reference data as are available are used to
'label' the clusters, i.e., to infer the nature of the ground cover represented by
each cluster. Clearly this approach will suffice only when the clustering algorithm
can successfully associate unique clusters or sets of unique clusters with the
various ground cover classes of interest.
There are a great many multivariate clustering techniques available; many more
than we have space to describe here have been applied to remote sensing data
analysis. Most are iterative methods which depend on 'migration' of cluster
centers until a stopping criterion is satisfied. They differ primarily in two
respects:
(1) The distance measure used on each iteration for assigning each data vector
to a 'cluster center'.
(2) The method used to determine the appropriate number of clusters. This
usually involves splitting, combining and deleting clusters according to various
'goodness' measures.
Unfortunately, the precise behavior of most clustering methods depends highly
on several user-specified parameters. The parameters provide implicitly the defini-
tion of a 'good' clustering, which is often very application-dependent. Since there
is no objective means for relating the characteristics of the application to the
parameter values, these values are usually determined by trial-and-error, the user
or data analyst playing an essential (subjective) role in the process. This is often
viewed as a significant shortcoming of the data analysis procedure when opera-
tional use requires a procedure which is strictly objective, repeatable and as
'automatic' as possible. A considerable amount of research on clustering has been
motivated by the needs of remote sensing data analysis [1].
Pattern recognition techniques for remote sensing applications 615

6. Dimensionality reduction
Determining the feasibility of using computer-implemented data analysis for
any particular remote sensing application often reduces to assessing the cost of
the required computations. Judicious choice of the algorithms employed for the
analysis is essential. They must be powerful enough to achieve the required level
of accuracy (assuming the level is achievable) but not so complex as to be
prohibitively expensive in terms of computational resources they require. Some-
times the most effective way to achieve computational economy is to reduce the
dimensionality of the data to be analyzed, either by selecting an appropriate
subset of the available measurements or by transforming the measurements to a
space of lower dimensionality.
For selecting best feature subsets, an approach is to choose those p features
which maximize the weighted average 'distance' Dave between pairs of classes,
where

Dave = ~ ~ p(o~i)P(~j)Dij. (4)


i=1 j=l

D,j may be any of several measures of 'statistical distance'. Historically, the


divergence was the first to be employed [7], but for multiclass problems a more
suitable choice is the Jeffreys -Matusita distance [14]

]2 ]1/2
Dij:{Sx[~--Tp(XioDj) j dX~ . (5)

Under the Gaussian assumption, this re.duces to a closed-form expression:

Dij=[2(1-e-~)] '/2 (6)


where

Notice from (6) that the Jeffreys-Matusita distance 'saturates' as the separability
(a) increases, the property which makes it behave functionally in a manner
similar to classification accuracy and accounts for its good performance as
predictor of classification accuracy. The appropriate value of p may be de-
termined by finding the optimal value of Dave for each of several candidate values
of p, plotting the results (Dave versus p) and observing the point beyond which
little increased separability is gained by increasing p.
An alternative to subset selection is to apply a dimensionality-reducing trans-
formation to the measurements before classification. The use of linear transfor-
616 Philip H. Swain

mations for this purpose has been studied to a considerable extent [2]. In this
case, the p-dimensional feature vector Y is derived from the n-dimensional
measurement X through the transformation Y = BX, where B is a p × n matrix of
rank p (p < n). Note that:
(1) If X is assumed to be a normally distributed random vector, then so is Y.
Specifically, if X-- N(U, Y,), then Y = B X ~ N( BU, B~,BT).
(2) Subset selection may be considered a special case of linear transformation
for which the B matrix consists only of O's and l's, one 1 per row.
(3) Numerical techniques are available (refer to [2]) for determining B so as to
extremize an appropriate criterion such as that defined by (4) and (5).
For nontrivial problems, both of these dimensionality reduction approaches
require considerable computation, and for this reason suboptimal procedures are
sometimes employed. However, in practice the computational expense may well
be warranted when the total area to be classified is large and the consequent
saving of computation in classification is substantial.
One final comment is appropriate before closing this section. A well-known
general method for dimensionality reduction is principal components analysis or
the Karhunen-Lo6ve transformation [6]. On the face of it, this approach seems
attractive because to apply it one need not have previously computed the statistics
of the classes to be discriminated. All that is needed is the lumped covariance
matrix for the composite data. However, a fundamental assumption underlying
this approach is that variance (or covariance) is the most important information-
bearing characteristic of the data. While this assumption is appropriate for signal
representation problems, it is not appropriate when the final goal is discrimina-
tion of classes. In the former case the objective i s t o capture as much of the data
variability as possible in as few features (linear combinations of the measure-
ments) as possible. In the latter the requirement is to maintain separability of the
classes, and variability is only of incidental interest. 'Canonical analysis', a
somewhat similar approach in terms of the mathematical tools used, provides a
better method of determining linear combinations of features while preserving
class separability [8].

7. An extension of the basic pattern recognition approach [4]

In many applications of remote sensing, the ground covers of interest tend to


occur as 'objects' or contiguous collections of pixels exhibiting comparatively
homogeneous measurements. This tendency can be used to advantage. The larger
the objects relative to the resolution of the sensor, the more accurate and efficient
the classification process can be made by classifying the entire objects rather than
individually classifying their constituent pixels. Historically, this approach was
first used in remote sensing to classify agricultural fields and hence was called
'per-field classification'. Somewhat more generally, it is referred to either as
sample classification (as opposed to point classification) or object classification (as
opposed to pixel classification).
Pattern recognition techniquesfor remote sensing applications 617

As a practical matter, however, sample classification cannot be very useful for


automated image data analysis unless the samples or objects can be isolated by a
computer-implemented method. In this section we shall describe techniques for
scene partitioning and sample classification which together comprise a scene
classification methodology exhibiting the potential benefits suggested above.

7.1. Sample classification


Let X = (Xl, X2.... ,Xs) represent a set of s pixels in some object and therefore
a 'sample' from a population characterized by one of the class-conditional
probability functions such as we have already discussed. A maximum likelihood
sample classification strategy is defined as follows:
Assign X to class ~0, if lnp(Xl~o~) = max lnp(Xl~0j) (7)
s
where p(XI ~o/)is the joint class-conditional probability of observing the sample X
from class %.
If the sensor system is properly designed and operated so that adjacent pixels
do not cover overlapping ground areas, it is reasonable to assume that the pixel
measurements are class-conditionally independent. Thus

p(Xl j)= (I p(X l j). (8)


i=l

If, further, the multivariate Gaussian assumption is again invoked, it may be


shown that
lnp(Xl%)=-½tr(~slS2)+U/W~,~lS~-½sUT~.j 1U/- ½slnl2"~2;jl
(9)
where
$1= ~ X ~ and 32= ~XiXi T
i=1 i=1

are sums taken over all pixels in the object to be classified. Notice two facts:
(1) Only two terms in (9) depend on the object to be classified and need to be
computed for each classification.
(2) The expression for the 'log-likelihood' (9) is valid for the case s = 1, so that
no problem develops when a single-pixel object is encountered.

7.2. Scene partitioning


In general, scene partitioning methods may be categorized broadly as (1)
boundary seeking or object seeking, and (2) conjunctive or disjunctive. The
method to be described here is a two-level conjunctive object-seeking method [4].
Initially the scene is divided by a rectangular grid into small groups of pixels,
typically a 2 × 2 array of four pixels. At the first conjunctive level, each group
618 Philip H. Swain

satisfying a relatively mild homogeneity test becomes a cell. If the group fails the
test, it is assumed to overlap an object boundary and the pixels are then classified
individually. At the second conjunctive level, adjacent cells which satisfy another
test are serially merged into an object. By successively 'annexing' adjacent cells,
each object expands as much as possible (as defined by the test criterion) and is
subsequently classified by maximum likelihood sample classification.
For practical reasons it is important that this scene partitioning algorithm,
together with the maximum likelihood sample classifier, can be implemented in a
sequential fashion, accessing the pixel data only once and in the raster order in
which they are stored on the tape.
The scene partitioning can be implemented in a 'supervised' mode which makes
use of the statistical characterization of the pattern classes, or in an unsupervised
mode which does not require such an a priori characterization. Given the
objectives of this section, we shall describe only the tests for the former.
Define the quantity

Qj(X) = _ zj
i 1

=tr ~,f-'~_X~Xj T --2u/T~fJ' i=l


~ Xi~-sUjT~;I~-~J (10)

where X~ is the i th pixel vector in the group being tested, s is the number of pixels
in the group, and Uj and ~j are, respectively, the mean vector and covariance
matrix for t h e j t h training class (again we have invoked the multivariate Gaussian
model). Let co* be the class for which the log-likelihood of the group is maximum,
i.e.,

In p ( X [ co*) = max In p ( X [ % ) = max [ - ½Qj( x ) - ½sln ]2~r~j I]


J J
(11)
and let Q*(X) be the value of the corresponding quadratic form, as in (10). A
group of pixels becomes a cell if Q*(X) <~c, where c is a user-specified threshold
value. This criterion tends to 'reject' both inhomogeneous groups and 'unrecog-
nizable' groups (unlikely to belong to any of the defined training classes). Also,
under the Gaussian model, the distribution of the Qj values is chi-square with
s × n degrees of freedom (recall that n is the dimensionality of the pixel vectors), a
fact which can be used in determining appropriate values for the threshold
parameter c.
Now let Y be an object (one or more cells) and X an adjacent cell not
previously annexed to an object. Annexation of the cell by the object is based on
the statistic

A: maxip(Xlcoi)p(Yl~')
maxip( Xlcoi)maxsp( y]% ) . (12)
Pattern recognition techniquesfor remote sensing appfications 619

Observe that 0~<A~<I and that A = I when both p(X[o~i) and P(YI%) are
maximum for the same class. Thus, the cell is assumed to belong to the same class
as the object and is annexed to the object if A >t-T where T is a user-specified
threshold.
Naturally the scene partitioning requires computational overhead not required
by pixel-at-a-time classification and this is why the greater efficiency of the
sample classification approach depends on the size of the objects in the scene
being large in comparison to the resolution of the sensor or pixel size. The larger
the objects, the greater the saving in computation required for classification per
se. Significantly, it is also possible to show [5] that the expected accuracy of
classification improves rapidly as the object size increases.

8. Research directions

To date, the statistical pattern recognition techniques most widely applied for
remote sensing data analysis have been brought to bear primarily on the spectral
space, although there have been some notable successes in attempts to augment
the spectral feature vector with, for example, texture features. Basically, however,
the fundamental approach has been to assume that the relationships among the
features can be characterized in relatively simple terms and that the scene of
interest can be classified a pixel at a time using standard decision-theoretical
methods.
There is a great wealth of information as yet virtually untapped in the remote
sensing data and the supporting environmental data which are usually available
in conjunction with it. Spatial characteristics, temporal variations, meteorological
data, soil background data, to name only a few, are significant information-
bearing factors which could be used profitably in the analysis process. Syntactic
scene analysis [3], contextual classification [12], and temporal feature extraction
[9] are some approaches which have begun to prove fruitful but will require
substantial research to develop their practical utility. The very complex relation-
ships among the multivarious forms of data found in the typical remote sensing
data base will not yield easily to the simple classification methods currently in
use. Generalized decision-theoretical methods are called for, and work is in
progress to develop and apply compound decision theory and hierarchical deci-
sion processes.
Finally, in the face of limited high quality reference data ('ground truth') which
must serve the dual purpose of providing both for training and testing classifiers,
there remains the problem of understanding how to best evaluate the results of
the analysis process [13]. Predictions and/or posterior estimates of classification
accuracy as well as biases and precision of the results are needed. It is not well
understood how classification accuracy per se reflects the quality of the results
achieved or achievable, especially when the objective is to obtain large area
estimates from classification of only a sample of the area.
620 Philip H. Swain

References

[1] Bryant, J. (1979). On the clustering of multidimensional pictorial data. Pattern Recognition 11,
115-126.
[2] Decell, H. P. and Guseman, L. F., Jr. (1979). Linear feature selection with applications. Pattern
Recognition 11, 55-63.
[3] Fu, K. S. (1977). Syntactic Pattern Recognition. Springer, New York.
[4] Kettig, R. L. and Landgrebe, D. A. (1976). Classification of multispectral image data by
extraction and classification of homogeneous objects. IEEE Trans. Geosci. Electronics 14, 19-26.
[5] Kettig, R. L. (1975). Computer classification of remotely sensed multispectral image data by
extraction and classification of homogeneous objects. Ph.D. Thesis, Purdue University, West
Lafayette, IN.
[6] Kitfler, J. and Young, P. C. (1973). A new approach to feature selection based on the
Karhunen-Lobve expansion. Pattern Recognition 5, 335-352.
[7] Marill, T. and Green, D. M. (1963). On the effectiveness of receptors in recognition systems.
IEEE Trans. Inform. Theory 9, 11- 17.
[8] Merembeck, B. F. and Turner, B. J. (1979). Directed canonical analysis and the performance of
classifiers under its associated linear transformation. Proc. Symp. Machine Processing of Re-
motely Sensed Data. IEEE Cat. No. 79CH1430-8 MPRSD, IEEE Single Copy Sales, Piscataway,
NJ.
[9] Misra, P. N. and Wheeler, S. G. (1978). Crop classification with Landsat multispectral scanner
data. Pattern Recognition 10, 1-13.
[10] National Aeronautics and Space Administration. Earth resources technology satellite data users
handbook. NASA Goddard Space Flight Center, Greenbelt, MD.
[11] Swain, P. H. and Davis, S. M. (1978). Remote Sensing: The Quantitative Approach. McGraw-Hill,
New York.
[12] Swain, P. H., Tilton, J. C. and Vardeman, S. B. (1982). Estimation of context for statistical
classification of multispectral image data. IEEE Trans. Geosei. Remote Sensing 20 (4).
[13] Todd, W. J., Gehring, D. G. and Haman, J. F. (1980). Landsat wildland mapping. Photogram-
metric Engrg. and Remote Sensing 46, 509-520.
[14] Wacker, A. G. (1971). The minimum distance approach to classification. Ph.D. Thesis, Purdue
University, West Lafayette, IN.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ") Q
../
@North-Holland Publishing Company (1982) 621-649

Optical Character Recognition--


Theory and Practice*

George Nagy

1. Introduction

This article presents an overview of optical character recognition (OCR) for


statisticians interested in extending their endeavours from the traditional realm of
pattern classification to the many other alluring aspects of OCR.
In Sections 2-5 the most important dimensions of data entry are described
from the point of view of a project manager considering the acquisition of an
OCR system; major applications are categorized according to the type of data to
be converted to computer-readable form; optical scanners are briefly described;
and the preprocessing necessary before the actual che.racter classification can take
place is discussed.
Section 6 outlines the classical decision-theoretic formulation of the character
classification problem. Various statistical approximations to the optimal classifier,
including dimensionality reduction, feature extraction, and feature selection are
discussed with references to the appropriate statistical techniques. Parallel classifi-
cation methods are contrasted with sequential methods, and special software and
hardware considerations relevant to the various methods are mentioned.
Section 7 expands the scope of the previous discussion from isolated characters
to the use of the contextual information provided by sequences of characters.
References are given to available collections of statistical data on letter and word
frequencies.
The paramount importance of accurate estimation of the error and reject rates
is discussed in Section 8. A fundamental relation between the error rate and the
reject rate in optimal systems is described, and the advantages and disadvantages
of various experimental designs are discussed. Major sources of error in OCR are
pinpointed and operational error rates for diverse applications are cited.
The bibliography contains references to authors specifically cited in the text as
well as to selected background reading.

*Article submitted in August 1978.

621
622 GeorgeNag,/

2. OCR problem characterization

Among the elements to be considered in the conversion of manual data-entry


operations to OCR are the permissible error rate; data volume, load distribution
and displacement costs; document characteristics; and character style and qual-
ity.

2.1. Error rate


The constituents of the error rate are the undetected substitution rate and the
reject rate. The cost of substitution errors is difficult to assess. When queried,
customers respond that they cannot tolerate any undetected errors. A realistic
baseline figure may be provided by key data entry rates: on typed plain text the
substitution rate (before verification) runs to about 0.1%. It is considerably higher
on 'meaningless' data such as lists of part-numbers, and even higher on hand-
printed forms (where verification may not reduce the error rate appreciably).
On low-volume systems rejected characters may be displayed to the operator on
a screen for immediate correction. On high-volume systems, however, the result-
ing loss of throughput would be intolerable, and documents with rejected char-
acters are merely stacked in a separate bin for subsequent manual entry. Reject
rates on commercial systems are therefore frequently quoted on a per-line or
per-document basis, and the cost of rejects must be calculated from the cost of
manual entry of the entire line or document. The computer attached to most
current OCR systems takes care of merging the corrected material with the rest of
the data to provide an essentially error-free file.
Commercial systems are normally adjusted to run at an equivalent per-
character reject-to-substitution rate of 10 : 1 or 100 : 1. The substitution rates are
two or three orders of magnitude below those reported in the academic literature
for similar material.

2.2. Data volume


The early OCR systems, like early computers, ran to hundreds of thousands
and even millions of dollars, and were economical only if they were able to
displace dozens or hundreds of keypunch operators. Current systems are eco-
nomical for data volumes as low as those corresponding to four or five manual
stations. A rule of thumb for manual data entry, regardless of the type of device
employed (keypunch, key-to-disk, key-to-tape, display terminal), is two keystrokes
per second. At the low end of the OCR spectrum, hand-held scanners (quoted
currently at below $1000 per system!) are used for shelf-inventories, library check
out, retail sales tickets, and other applications requiring only a restricted character
set.

2.3. Document format


Since the document transport accounts for a major portion of the cost of the
larger OCR systems, the data density on the document is an important considera-
Optical character recognition - - theory and practice 623

tion. Turn-around documents with stylized characters, such as credit-card slips,


have only two or three machine-readable lines per document, and the document
transport must be able to carry several dozen documents per second under the
read head. Fortunately such documents are small and have rigid constraints on
size and paper quality. The read-field format is usually a programmable function,
and may be changed for each batch of documents, sometimes by means of a
machine-readable format field.
Page readers move only one or two documents per second, and each typed page
may contain up to 2000 characters. The vertical motion of the page is often
combined with the line-finding function. Perhaps the most demanding transport
requirements are those imposed by mail-sorting applications: size, thickness,
paper quality, and address location on the envelope are virtually uncontrollable.
Document transports are often combined with imprinters and sorters. The
imprinter marks the document (using either an OCR font or a bar code) for
subsequent ease of automatic routing in a manner similar to the imprinting of
checks with MICR (magnetic ink character recognition) codes by the first bank
handling the check. The sorter divides the documents according to the subsequent
processing necessary for each type of document (including reject entry).

2.4. Print quafity


The design of the transducer, character acquisition, and classification systems
depends mainly on the character quality (see I.S.O. Standard R/1831: Printing
Specification for OCR). Some aspects of quality are independent of the spacing
and shape of the characters: these include the reflective properties of the paper
and of the material to be read, the presence of extraneous material on the
document, and the characteristics of the printing mechanism itself such as edge
definition and density variations within the symbols. Other aspects include the
disposition of the symbols on the page (line and character spacing and alignment),
variations in character size, the ratio of character size to average stroke width, the
variability of stroke-width within the symbols, the number of classes, and the
degree of differentiation among the most similar character pairs.
The parameters discussed in this section tend to be relatively similar within a
given class of applications. Commercial systems are therefore developed for each
class, and only minor adjustments are necessary to tune them to a specific task.
The next section describes six major classes of data-entry applications in decreas-
ing order of suitability for automation using currently available techniques.

3. Applications

3.1. Stylized fonts


Stylized typefaces are designed specifically for ease of machine recognition.
Consequently they are used mainly on turn-around documents where the organi-
zation responsible for converting the data to computer-readable form has full
624 GeorgeNagy

ABCDEFGHIJKLMNOPQRS
TUVWXYZ0123456789 •,
'-(}%?#WH : ; =+/$,"&
USASC CHARACTER FONT

ADDITIONAL CHARACTERS FOR INTERNATIONAL USE

Fig. I. OCR-Afont.

control over document preparation. Typical applications are credit card slips,
invoices, and insurance application forms prepared by agents. Special accurately
machined typeheads for certain stylized fonts are available for ordinary type-
writers which must, however, be carefully aligned to satisfy the OCR-reader
manufacturers' specifications with regard to character skew, uniformity of impres-
sion, and spacing. Standards have also been promulgated with respect to accept-
able ink and paper reflectance, margins, and paper weight.
Among the typefaces most popular in the United States is the 64-symbol
OCR-A font (Standard USA 5X3.17) which is available in three standard sizes
(Fig. 1). In the United States, OCR standards are set by the A.N.S.I.X3A1
Committee. In Europe the leading contender is the l l3-symbol OCR-B font
(Standard ECMA-11) which is aesthetically more pleasing, and which als0
includes lower-case letters (Fig. 2). Some OCR devices are capable of reading only
subsets of the full character set, such as the numerals and five special symbols.

3.2. Typescript
With a well-aligned typewriter, carbon-film ribbon, and carefully specified
format conventions, single-font typewritten material can be readily recognized by
machine. Sans-serif typefaces generally yield a lower error rate than roman styles,
because serifs tend to bridge adjacent characters. When specifying the typeface, it

ABCDEFGHIJ
K LMNOPQR ST
UVWXYZ
Fig. 2. OCR-Bfont (upper case only).
@tical character recognition - - ~ e o ~ and practice 625

0123h5 6789 0123456789 Ol 2 3~ 5 6 7 8 9

ABCDEF GHIJ A B C D E F G H I J ABC DEFG HIJ


KLMNOP qRS T K LM NO P Q RS T KLMNO PQRST

$UV wXYZ $UVWXYZ $uvwxYz


abcdef ghiJ a b c d e f g h i j abcdefghij
klmnop qr s t k Im nopqr s t klmnopqrs t
-uv wxyz -uvwxyz -uvwxyz

(a)

.°* **
.,: .:**,
*** *°

s* 1111, ,1,

***~ **°°

..... :11111,
*****
°** *
******
=**=
, :'',. : .'',,. ...°
**°* ****** .°° ,~°el
.°*** .°** °s*
*°*
.... 11"

(b)

Fig. 3. Typewritten characters. (a) Elite, Adjudant, and Scribe fonts. (b) Digitized versions of '4' and
' F ' from the three font styles shown above. (c) Examples of difficult to distinguish lower case letters.
626 GeorgeNagy

"'I .... ~°
..**'***.. ***..,.
:!i"i:::::!i".-
..... ..."
* w * * * * * * ***

* 7 * * * * * * * * * *

...... 51"**
* * * * 8 * * * * * *

*IIIII*
(c)
Fig. 3 (continued).

is important to name the manufacturer as well: Underwood's Courier may not be


identical in design to Olympia's. Styles suitable for OCR will have a distinction
between lower-case 1 and the numeral one, and between upper-case O and zero.
Multi-font typewritten material commonly yields error rates in excess of 0.1%
because the shape of individual letters varies so much among the 2000 or so
typestyles available on U.S. typewriters (Fig. 3). Recognition logic has been,
however, designed for up to a dozen intermixed fonts. An example of a well-
established successful application is the OCR system operated by the State of
Michigan for reading typed State Medicaid billing forms submitted by upwards
of 25,000 individual providers of health services.
Only two different horizontal spacings are in common use. Elite or twelve-pitch
fonts have 12 characters to the inch; pica or ten-pitch fonts have 10 characters to
the inch. The words 'elite' and 'pica', incidentally, have a slightly different
meaning in the context of typeset text. Typewriters with proportional spacing
increase the difficulty of character segmentation, and normally yield error rates
many times higher than fixed-pitch material. The most common vertical spacing
is six single-spaced lines per inch, but material typed for OCR is almost always
double-spaced. A scanning resolution of 0.004" is sufficient for most typewritten
material.
Important applications include data entry, mail sorting, and manuscript
preparation using word-processing systems.

3.3. Typeset text


The major application for reading typeset text is automatic (computerized)
information retrieval. Examples are searching the three million U.S. patents
Optical character recognition theory and practice 627

issued to date, retrieving relevant federal and state laws, and converting library
catalog cards to computer readable form. Most of these applications may,
however, eventually disappear when computerized typesetting becomes so preva-
lent that all newly published material is simultaneously made available in
computer readable form as a byproduct of typesetting.
The greatest problems with automatic reading of typeset material are the
immense number of styles (3000 typefaces are in common use in the United
States, each in several sizes and variants such as italic and boldface), and the
difficulty of segmenting variable-width characters. The number of classes in each
font is also larger than the 88 normally available on typewriters: combinations of
characters (called 'ligatures') such as fi, ffi, and fl usually appear on a single slug,
and there are many additional special symbols.
A resolution of about 0.003" is necessary for segmenting and recognizing
ordinary bookface characters (Fig. 4). The higher resolution is necessary because
the characters are smaller than in typescript, because of the variability in stroke
width within a single character (for instance, between the round and straight parts
of the letter 'e'), and because of the tight character spacing.

3.4. Handprinted characters


The recognition rate obtained on handprinted characters depends on the
number of writers, on the number of character classes, and on the training of the

the t r e a t m e n t of electric filter networks the treatment o/ i~ectric l i t e r nltwqorks


with a discussion of delay lines and troth I dilcuSmOe of d i l l y I, ne$ l e d
time-domain specifications. The book Ilnll-doma|n Ipi:MICllUonl. The book
also contains t w o short chapters on ~llo ¢ o n t a l n l two short chaptms on
microwave and digital filters primarily miarowlvl i n ~ dilltal liters primarily
concerned with some simple transmis- concerned w4h i o r n l ,~mple t r l n f n i l -
sion-line ideas and with the rudimentary Ion-line ideas and with the f u d i m l n t l ~
basis of the fast Fourier transforms. b4ads of the fist Fourier transferalS.
The t r e a t m e n t of these, however, is too The t r l l t m e n t o/these, howe~lr, is t o o
sketchy and superficial to be of any real skltChy and IlpedqClll| tO k of I n y reOI
use to the student or to the practitioner. u14 to the student or to the W l c t i t l o ~ l r .
The book o m i t s synthesis techniques The book o m i t s l y ~ t h m ~ l techniques
entirely; the discussion is usually ter- l e t l r i y ; the dl/~ulsiol~ is Usulely tm'-
minated when a suitable frequency-do- m~nlted w h t n a m t i b i s Ireiluency<lm-
main or, on occasion, time-domain char- m/ill or, on o¢cl~ion, t i m i - d o m l l i n ¢111r.
acterization is obtained. This is some- ic~lrlJltion Is ai~tblined. Thls {11 soqml-
what puzzlin 8 in view of the fact that wqhN p u ~ I t h I in view 04 the fact thet
the book a i m s to address itself to prac- l h l 13o,d, l i m l 1o I d d r e m i t l l l / t o lwac-
ticing filter designers. The a u t h o r starts t l c l q l f l l t l r de1.1ners The o u t h o r s l a m
out by emphasizing that the book is o~t by ernphamzinl that the bm)k is
concerned entirely with passive net- concerned dmtimly w~th p m l l i v l net.
works, but nowhere are the implications I m r k l . but nowhere are the implications
o / U 1 i l passivity constraint illulbated or
of this passivity constraint illustrated or
taken into consideration. tak4m into Consideration.
The ~ iS written On i n qli4Olentilry
The book is written on an elementary
Ieve[; a superficial knowledge of network le~ll; a lupeftk:Sai I m o w l e d l p o f m i t w o r k
theory and Laplace transforms consti-
theory end Lalpleol lbrlnslrmms consti-
tutes the only prerequisite for it. The
tutill the only i ~ l N q u t l d h l f~' it. TIt4l
presentation of the mathematical as- p n l N f l t l t l o n of ~ math~atlcal IS-
pects of the subject matter is disturb- IleCtl Of the wiNect matter i l dmturt)-
ingly superficial, particularly where it ~nllly l u p m l k i s l Nrtlcuisrty Whm'e It

(~) (b)

Fig. 4. Typeset text. (a) Original. (b) Photoplotter output after digitization.
628 GeorgeNagy

writers (see Handprint Standard X3.45). Applications include filling out short
forms such as driver's license applications, magazine renewals, and sales' slips. At
one time automatic interpretation of coding forms was considered important, but
this application has lost ground due to the spread of interactive programming.
Individual experimenters can usually learn to print consistently enough to have
their machine recognize their own characters with virtually zero error. Although
many of the handprinted character recognition devices described in the literature
are trainable or adaptive in nature, in some successful applications most of the
adaptation takes place in the human. An example is the operational and success-
ful Japanese ZIP-code reader, where the characters are carefully printed in boxes
printed on the envelope.

3.5. Cursive writing


The recognition of cursive writing is not yet approaching commercial maturity
and serves primarily as an experimental vehicle for studying context-dependent
classification methods. It has many analogies with the recognition of continuous
speech, but because it lacks the wide applicability of the latter, the recognition of
cursive writing has had far less effort devoted to it. Recently, however, several
large-scale experiments have been published on signature verification with on-line
capture of the stylus motion. Typical results show a rejection rate of 3% and a
false acceptance rate of 0.2% among a population of 300 writers including
amateur forgers.

3.6. Special alphabets


Among special alphabets of interest in optical character recognition are cyrillic
characters, Chinese and Japanese characters, and map symbols. The classification
of printed and typed cyrillic characters differs little from that of latin characters;
the most difficult pair is usually llI and II_[. The recognition of Chinese ideographs
is complicated by the immense nUmber of classes (5000 symbols in ordinary
newspapers) and the complexity of individual characters, A simplifying factor,
however, is the fact that only half-a-dozen different typefaces are in common use.
Japanese OCR requirements include Kanji characters (essentially the Chinese
ideographs), Katakana characters (55 different symbols), as well as the latin
alphabet. Because typewriters are still much less prevalent in the Far East than in
the West, the classification of handprinted characters takes on added importance.

4. Transducers

On examination under a magnifying glass, most printed material intended for


optical character recognition appears well defined and relatively simple to clas-
sify. Digitization by means of an optical scanner, however, frequently obliterates
the significant distinctions between patterns corresponding to the different classes,
and introduces distortions which greatly complicate recognition. The technology
Optical eharacter recognition - - theory and practice 629

for accurate and faithful conversion of grey-scale material to digital form does of
course exist, but the necessary apparatus--flat-bed and rotating drum micro-
densitometers--is far too expensive for most academic research programs and far
too slow for commercial application. The challenge in optical character recogni-
tion is, in fact, to classify characters as accurately as possible with the lowest
possible spatial quantization and the minimum number of grey levels in the
transducer. Most OCR scanners therefore operate at a spatial resolution barely
sufficient to distinguish between idealized representations of the classes and
convert all density information to binary--black and white--form.
The remainder of this section outlines the principal characteristics of optical
transducers used for character recognition either in the laboratory or in the field.

4.1. Device types


Optical scanners may be divided into flying spot devices, where successive
portions of the document are illuminated in turn and all of the reflected or
transmitted light is collected to determine whether the illuminated spot was black
or white, and flying aperture devices, where the entire document is illuminated but
light is collected only from a single spot. It is also possible to combine the two
methods and both illuminate and observe only a single spot at a time; this
expensive arrangement results in greatly improved signal-to-noise ratio and
therefore more accurate grey-scale quantization.
Scanners may also be categorized according to whether they operate with light
reflected from a document or fight transmitted through a transparent image of the
document. Since the optical design of transparency scanners is somewhat simpler
they are sometimes used in laboratory research, but photographing every docu-
ment is not generally practicable in a production environment.
The devices themselves may be crudely classified according to the mechanism
used to address successive portions of the document: mechanical, television
camera, cathode-ray tube, and solid state scanners are the most common. Some
scanners utilize hybrid combinations of these basic types.
Most commercial OCR systems use mechanical motion of the document in the
vertical direction to scan successive lines of print. The earlier machines also used
mechanical motion to scan across each line by means of a rotating or oscillating
mirror or a prism assembly. The information from a single vertical column in a
character, or even from an entire character, was collected by means of a photocell
array. The development of fiber optics added another option to the design of
mechanical systems. Well-designed mechanical scanners have excellent geometric
properties and virtually unlimited resolution, but are restricted to rigid scan
patterns. Because of the necessity for immunity to vibration and the need for
extremely rapid motion for high throughput, they are also very expensive.
Television cameras in a flying-aperture reflection-scanner configuration are a
favorite device for laboratory pattern recognition research, but their geometric
resolution is barely sufficient for page-reading applications. Most laboratory
cameras have poor linearity and a resolution of only about 400 × 400 elements, in
630 GeorgeNagy

contrast to the 2000X2000 array required for a printed page. Attempts to


combine mechanical movement with commercially available television cameras
have not proved successful.
Both research and commercial systems have made use of cathode-ray tube
flying spot scanners, where each spot on the document is illuminated in turn by
the glow of phosphor from the corresponding spot on the screen of the tube and
all of the reflected or transmitted light is collected by means of a photomUltiplier
tube array. Although the scan pattern of such systems is extremely versatile, since
the deflection plates can be placed under direct computer control, careful--and
expensive--optical design is required to provide acceptable geometric and
densitometric fidelity. The geometric resolution is just barely sufficient for page
scanners, but the positional linearity can be considerably increased by incorporat-
ing feedback from a built-in reseau pattern.
Most current systems use solid-state scanners. Flying spot devices use linear or
two-dimensional light-emitting diode (LED) arrays. Chips with 4096X 1 and
256 X256 arrays are already available commercially. Flying-aperture devices use
arrays of photodiodes or phototransistors in similar configurations. Self-scanned
arrays, where the addressing mechanism necessary to activate successive devices is
built into the chip itself, simplify the electronic design. The optical design is also
relatively simple since current microfabrication techniques ensure geometric fidel-
ity and produce devices small enough to be used at 1 : 1 magnification or less. The
amplitude response of all the elements within a chip is typically within a few
percent of the mean.

4. 2. Geometric characteristics
The major geometric characteristics of optical scanners are:
(1) Resolution. The exact measurement of the two-dimensional optical transfer
function is complex, but for OCR purposes one may equate the effective spot size
to the spatial distance between the 10% and 90% amplitude-response points, as a
black-white knife-edge is advanced across the scanning spot. Standard test charts
with stripe patterns are also available to measure the modulation as a function of
spatial frequency.
(2) Spot shape. A circular Gaussian intensity distribution is normally preferred,
but a spot elongated in the vertical direction reduces segmentation problems.
(3) Linearity. This characteristic, which may be readily measured with a grid
pattern, is important in tracking margins and lines of print. Solid-state devices are
inherently linear, but cathode-ray tubes require pin-cushion correction to com-
pensate for the increased path-length to the corners of the scan field. Skew is
sometimes introduced by the document transport.
(4) Repeatability. It is important to be able to return exactly to a previous spot
on the document, for example to rescan a character or to repeat an experiment.
Short-term positional repeatability is generally more important--and better--than
long-term repeatability.
Optical character recognition - - theory and practice 631

4. 3. Photometric characteristics
(1) Resolution. The grey-scale resolution may also be measured in different
ways. To some investigators it means the number of grey levels that may be
reliably discriminated at a specific point on the document; to others it means the
number of discriminable grey levels regardless of location. The latter interpreta-
tion is considerably more stringent, since a flat illumination or light collection
field is difficult to achieve without stored correction factors. Quantum noise also
affects the grey-scale resolution, but the signal-to-noise ratio may be generally
increased at the expense of speed by integrating the signal.
(2) Linearity. Grey-scale linearity is meaningless for binary quantization, but if
several grey-levels are measured, then an accurately linear or logarithmic ampli-
tude response function may be desirable. Standard grey-wedges for this measure-
ment are available from optical suppliers.
(3) Dynamic range. For document scanners a 20:1 range of measurable
reflectance values is acceptable. Transparency scanners, however, may provide
adequate response over 3.0 optical density units (1000 : 1).
(4) Repeatability. It is important that the measured grey level be invariant with
time as well as position on the document.
(5) Spectral match. The spectral characteristics of the source of illumination
and of the detector should be closely matched to those of the ink and of the
paper. The peak response of many OCR systems is in the invisible near-infrared
region of the spectrum.

4. 4. Control characteristics
Most OCR devices operate in a rigid raster scan mode where the scan pattern is
independent of the material encountered. Line followers, however, track the black
lines on the page and have been used principally for the recognition of hand-
printed characters. Completely programmable scanners are invaluable for experi-
mentation, but have become a strong commercial contender only with the advent
of microprocessors. Programmable scanners allow rescanning rejected characters,
increase throughput through lower scan resolution in blank areas of the page, and
reduce storage requirements for line-finding and character isolation.

5. Character acquisition

The conversion of the pattern of black-and-white areas constituting an entire


document into the set of smaller patterns which serve as the input to the character
classification algorithm has not received much attention in the literature. In
academic investigations the challenging problems associated with this area of
OCR are usually circumvented either through the utilization of specially prepared
and formatted documents or through manual methods of isolating the characters.
Needless to say, neither of these approaches is practicable for operational OCR
machines.
632 GeorgeNagy

The preprocessing necessary for classification must satisfy two basic require-
ments. The first requirement is to locate each character on the page in an order
that 'makes sense' in the coded file which eventually results from classification.
The second requirement is to present each isolated character to the recognition
algorithm in a suitable form.

5.1. Format control and line finding


The difficulty of locating each character in succession depends, of course, on
the application at hand. With stylized characters the format and line spacing are
usually rigidly controlled. Material which need not be considered is often printed
in ink with spectral characteristics which are invisible to the optical scanner, and
special symbols guide the scanner to each field of data. Adequate spacing is
provided to eliminate any horizontal or vertical interference between characters.
With typeset material, however, format problems may be almost insurmount-
able without some form of human intervention. Consider, for example, a two-
column magazine article interrupted by a two-column wide illustration. Even if
the system recognizes correctly the presence of the illustration after processing the
top left column, should processing continue in the left-hand column below the
illustration or at the top line of the fight hand column? Other challenging
problems occur in scanning formulas, tables, and alphanumeric material on
technical drawings and maps.
None of these tasks can be performed with currently available commercial
systems. Satisfactory line-finding algorithms are, however, available. When the
printed lines are straight and widely spaced, almost any algorithm will work.
Closely spaced lines may be located by projecting the grey-level distribution on
the vertical axis. For typeset and typewritten serif material, there are usually four
peaks to every line: two major ones at the top and bottom of the lower case
characters, and two minor ones at the ends of the ascenders and descenders.
Because the peaks tend to spread out if the material is not perfectly aligned with
the axes of the digitizer, usually only a column of material a few inches wide is
used for line location. Modifications are then necessary to locate short lines.
Fourier transform methods have also been proposed for line and character
location, but the periodicities of printed matter are seldom regular enough for this
purpose. Adaptive line following algorithms are capable of compensating for
baseline drift in the lines of characters. Individual character misalignments are
usually addressed after character isolation.

5. 2. Character isolation
Imperfect separation between adjacent characters accounts for a large number
of misclassifications, although in experimental studies segmentation errors are
often omitted from the reported classification error rate. With stylized characters
and with typewritten material produced by well-adjusted typewriters the expected
segmentation boundaries can be accurately interpolated by correlating an entire
line with the many easily detectable boundaries that occur between narrow
Optical character recognition - - theory and practice 633

characters. The problem is, however, more complicated if the geometrical linearity
and repeatability of the transducer itself is low relative to the character dimen-
sions.
If the scanner geometry is undependable (CRT scanners) or if the characters
are not uniformly spaced (typeset material), then there is no alternative to
character-by-character segmentation. This, in turn, requires a scanning aperture
two or three times smaller than that required for recognition to find narrow
zig-zag white paths between adjacent characters. Ad hoc algorithms depending,
for instance, on 'serif suppression', are used to separate touching characters. The
expected fraction of touching character pairs~as measured directly on the
document using ten-fold magnification--depends on the type-style, type-size,
printing mechanism, and on the ink-absorbing characteristics of the paper, but
typescript or printed material (serif fonts) with up to 10% touching characters is
not uncommon. The worst offender in this respect is probably the addressograph
machine: with a well-inked ribbon one may seldom see an unbridged pair of
characters within the same word.
Some character pairs simply cannot be segmented unless the individual compo-
nents are first recognized. Current classification algorithms, however, generally
require isolated character input. The development of optimal 'on-the-fly' recogni-
tion, combining segmentation with classification, is one of the major challenges
facing OCR.

5.3. Normalization, registration, and centering


Once the characters are isolated, it is necessary to transform them into the f o r m
expected by the recognition algorithm. Generally, the more sophisticated the
algorithm, the less it is affected by difference in the size, orientation, and position
of the characters. Simple template matching algorithms, such as those used in
many commercial OCR systems, require that variations of this type be eliminated
before classification.
Since the size of the array processed by the recognition system is usually of
fixed size (say 20 × 30), it is advantageous to have each pattern fill as much of the
array as possible. Size normalization algorithms either expand or contract the
pattern until it just fits into the array, or adjust the pattern so as to fix
the horizontal and vertical second moment (standard deviation). Size normaliza-
tion is useful for multifont type-script, for printed matter, and for handprinted
characters.
Most character recognition algorithms are neither rotation nor skew invariant.
Typewritten characters may, however, have sizeable rotation variance due to
typewriter misadjustments or to document misalignment, and handprinted char-
acters tend to exhibit very significant skew. Rotation and skew correction may be
based on principal-components analysis or on enclosure of the pattern in one of a
variety of geometrical figures.
Registration or centering algorithms translate each pattern to a standard
position according to the location of its centroid. Computation of the horizontal
634 GeorgeNagy

and vertical medians (the imaginary lines which divide the pattern into an equal
number of black bits in each quadrant) is more economical than computation of
the centroid, but both methods suffer from vulnerability to misalignments of the
printing mechanism, which causes one side, or top or bottom, of the character to
be darker than the other. An equally unsatisfactory alternative is lower-left-corner
(or equivalent) registration after stray-bit elimination.
Misregistrations of two or three pels (i.e., picture elements) in the vertical
direction and one or two pels horizontally are not unusual with the normally used
scanning resolution for typewritten characters. If uncorrected, such misregistra-
tion necessitates that template matching be attempted for all possible shifts of the
template with respect to the pattern within a small window (say, 7 × 5) in order to
guarantee inclusion of the ideal position.

6. Character classification

In a classic paper, Chow derived the minimum risk character classification


function in terms of the loss W(a i, dj) incurred when decision dj is made and the
true class is a i, and of the conditional probability functions P(vlak) of observing
the signal v when the class of the pattern under consideration is a,. The
derivation makes no assumptions regarding the form of the underlying distribu-
tion of v. The possibility of rejecting a character (i.e., not assigning it to any class)
is considered by including a 'reject' decision d 0. Specific examples were given for
the case where the observations consist of Gaussian statistically independent
noise added to the ideal signal representing each class.
In the same paper, the decision criteria corresponding to the minimum error
classification are also derived. It is shown that the optimum decision rule is not
randomized in either case.
If the cost of misclassification is uniform regardless of the particular error
committed, then it can be shown that the optimal decision consists of selecting the
class a k for which the a posteriori probability P(a~[v) is the largest. If v is a
vector with discrete-valued components, as is usually the case in optical character
recognition, then Bayes' formula allows the computation of the a posteriori class
probabilities for a given observation 13in terms of the conditional probability of
that observation given the class P(~[ak), the a priori probability of the class
P(ak), and the overall probability of the observation P(5):

P( a,15 ) = P( ~[ak )P( ak ) / P ( 5 ).

Since we are seeldng to maximize the a posteriori probabilities, we can


eliminate P03), which is common to each term, from consideration. For the
following discussion we can also forget about the a priori probabilities P(ak). If
they are not all equal we can always take them into account as a weighting factor
at the end. Hence the important term is P(~[ak), the probability of observing
when the true class of the character is a~.
Optical character recognition theory and practice 635

Let us now consider the computational problems of estimating P(vl ak) for all
possible values of ~ and a k in an OCR environment. For the sake of concreteness,
let us assume that each observation 17 corresponds to the digitized grey values of
the scan field. If the number of observable grey levels is S, the number of
elements in the scan field is N, and the number of character classes is M, then the
total number of terms required is M × S N. In a practical example we may have
S----16 (the number of differentiable reflectance values), N = 600 (for a 20×30
array representing the digitized character), and M - - 6 4 (upper and lower case
characters, numerals, and special symbols.) The number of probability estimates
required is thus 64× 166°°z 224°6~107°°. The goal of much of the work in
character recognition during the last two decades has been, explicitly or im-
plicitly, to find sufficiently good approximations, using a much smaller number of
terms, to the required probability density function.
Among the approaches tried are:
(1) Assuming statistical independence between elements of the attribute vector.
(2) Assuming statistical independence between subsets of elements of the
attribute vector (feature extraction).
(3) Computing only the most important terms of the conditional probability
density function (sequential classification).
In the next several paragraphs we will examine how these simplifications allow
us to reduce the number of estimations required, and the amount of computation
necessary to classify an unknown character. In order to obtain numerical com-
parisons of the relative number of computations, we shall stay with the above
example.

6.1. Binary observations


Assuming that the values of each component of the observation vector v are
restricted to two values does not in itself materially reduce the calculations (it
decreases the number of estimates'required from M × S N only to M × 2N), but it
simplifies the subsequent discussion. We shall therefore henceforth assume that
the values of the reflectances are thresholded to black (1) and white (0). The value
of the threshold used to differentiate black from white may in itself depend on the
values of the neighboring elements or on other factors.

6.2. Stat&tical independence


If we assume that for a given class of characters the observations are statisti-
cally independent from one another, then we may express the a posteriori
probability in product form:
P(vIak)=I'IjN1p(vjIak), where vj is the j t h component of ~. Instead of
estimating 2 N values for each class, we need estimates only for N values (since
P(vj -- l lak) = 1 - P(vj -- 0[ak) ). Furthermore, the multiplication operations nec-
essary to compute the a posteriori probabilities can be replaced by additions by
taking logarithms o f both sides of Bayes' equation, as shown by Minsky. Since we
636 GeorgeNagy

need determine only the largest of the a posteriori probabilities, and the logarithm
function is monotonic, taking logarithms preserves the optimal choice.
This derivation leads to the weighted mask approach, where the score for each
class is calculated by summing the black picture elements in the character under
consideration with each black point weighted by a coefficient corresponding to its
position in the digitized array. The contribution of the white points can be taken
into account by the addition of a constant term. The character with the highest
score is selected as the most likely choice. If none of the scores are high enough,
or the top two scores are too close, then the character is rejected.
It may be noted that the weighted mask approach implemented through resistor
networks or optical masks was used in experimental OCR systems long before the
theoretical development was published. A special case, called template matching
or prototype correlation, consists of restricting the values of the coefficients
themselves to binary or ternary values; when plotted, the coefficients of the black
points resemble the characters themselves. A further reduction in computation
may be obtained by discarding the coefficients which are least useful in dis-
criminating between the classes, leading to peephole templates.

6. 3. Restricted statistical independence


An approach less restrictive than requiring complete statistical independence
consists of assuming that certain groups of observations vj are statistically
independent of each other, but that statistical dependencies exist within each
group. Each such group of variables may then be replaced by a new variable. If,
for instance, the a posteriori probability is expanded as follows:

P( v[ak ) = P( ~i[ak ) . e ( vjlvi, ak ) "e ( vm[ Vi, vj, ak ) . " . ,

then one may assume that higher order dependences are insignificant, represent
the statistical relations in the form of a dependence tree, and restrict the computa-
tion to the most important terms, as shown by Chow.
This point of view also leads to a theoretical foundation for the ad hoc feature
extraction methods used in commercial systems. Straightline segments, curves,
corners, loops, serifs, line crossings, etc., correspond to groups of variables which
commonly occur together in some character classes and not in others, and are
therefore class-conditionally statistically dependent on one another. Even features
based on integral transform methods, such as the Fourier transform, can be
understood in terms of statistical dependences.
Feature selection and dimensionality reduction methods may also be considered
in the above context. Given a pool of features, each representing a group of
elementary observations vj, it is necessary to determine which set of groups most
economically represents the statistical dependences necessary for accurate estima-
tion of the a posteriori probability. The number of possible combinations of
features tend to be astronomical: if we wish to select 100 features from an initial
set of 1000 features, there are about 10 ~°° possible combinations. Most feature
Optical character recognition - - theory and practice 637

selection methods therefore use ad hoc techniques, adding or eliminating one


feature at a time on the basis of some type of information-theoretic measure, such
as entropy.

6. 4. Sequential classification
In sequential classification only a subset of the observations vj is used to arrive
at a decision for the character identity; unused elements need not even be
collected. The classification is based on a decision tree (Fig. 5) which governs the
sequence of observations (picture elements) to be examined. The tree is fixed for a
given application, but the path traced through the tree depends on the character
under consideration.
The first element vj to be examined, called the root of the tree, is the same for
any new character, since no information is available yet as to which element
would provide the most information. The second element to be examined,
however, depends on whether the first element was black or white. The third
element to be examined depends, in turn, on whether the second element was
black or white. Each node in the tree, corresponding to a given observation vj,
thus has two offsprings, each corresponding to two other observations. No
observation vs. occurs more than once in a path. The leafs of the tree are labelled
with the character identities or as 'reject' decisions. Normally there are several
leafs for each character class (and also for the reject decision), corresponding to
the several character configurations which may lead to each classification.
Binary decision trees for typewritten characters typically have from 1000 to
10000 nodes. The path length through the tree, from root to leaf, may vary from

103
w h ~ k

"A. . . . B"_ / / ~~ "G"


_./\_/\ "B'" / ~ reject

reject "A. . . . O" reject


Fig. 5. Decision tree for character recognition. The numbers at the nodes represent the picture
elements. The left branch is taken if the element corresponding to the node is white in the character
under consideration, the right branch if it is black. The leaves yield either a definite identification of
the character class or a 'reject' decision.
638 GeorgeNagy

10 to 30 nodes. Character classes which are difficult to classify because of their


resemblance to other classes require longer paths through the tree since more
elements must be examined before a reliable decision can be reached. Several
different methods for designing decision trees are available; they differ consider-
ably with regard to the underlying model (deterministic or probabilistic) of
classification, design complexity, required sample size, and classification accu-
racy.
Sequential classification based on decision trees may be applied, of course, also
to features other than black or white picture elements, but the design and
selection of features for sequential classification is even more complicated than
for the parallel methods discussed above. Other approaches to sequential classifi-
cation are based on Wald's sequential probability ratio test (SPRT), which
provides the optimum stopping rule to satisfy a required minimum probability of
correct classification.

6.5. Comparison of classification methods


Table 1 shows a comparison of several classification methods with respect to
the number of operations required for classification of a single character. The
table does not include any figures on recognition performance, which depends
very markedly on the extent to which the design assumptions correspond to the
data used for evaluation (see below). The estimates given for the example
mentioned earlier, with S = 2, M = 64 and N = 600 are the author's, and should at
best be considered tentative.

6. 6. Implementation of classification algorithms


High-speed page readers require instantaneous classification rates of upward of
1000 characters per second for an average throughput of one document per
second (a double-spaced typewritten page contains about 1500 characters). If the
classification procedure is executed sequentially for each of 64 categories of
characters, then only 15 microseconds at most are available to carry out the

Table 1
Comparison of statistical classification algorithms. The comparison of the computational requirements
of the various classification methods is based on 64 character classes and 600 (20 × 30) binary picture
elements per character
Method Computational requirement
Exhaustive search 2600 --~ 10 Is° comparisons (600
bits each)
Complete pairwise dependence ½× 64 X 6002 = 107 logical operations and additions
Complete second-order Markov dependence 64 X 1200 = 80 000 logical operations and additions
Class-conditional independence (weighted masks) 64 × 600 = 40 000 logical operations and additions
Mask-matching (binary masks) 64 X 600 = 40 000 logical operations and counts
Peephole templates (30 points each) 64 × 30 = 1920 logical operations and counts
Complete decision tree 600 one-bit comparisons
Decision tree--4000 nodes 2log4000 = 12 one-bit comparisons
Optical character recognition theory and practice 639

necessary calculations for each category. Consequently most high-speed commer-


cial OCR machines use special hardware for recognition.
Prototype correlation and weighted masks are most often implemented by
means of optical comparisons, resistor summing networks, or high-speed parallel
digital logic circuitry. Position invariance is achieved by shifting the character to
three, five, fifteen, or twenty-five successive positions in a one- or two-dimen-
sional shift register. Character segmentation is performed either explicitly by
comparing successive vertical scans, or implicitly by looking for peaks in the
output of the recognition circuitry. On the newer machines a small general-pur-
pose computer acts as the control unit for the entire system, performs validity
checks, generates and monitors test inputs, and formats the output as required for
a particular application.
Lower-speed devices, including those used in conjunction with hand-held
wands, use a combination of hardwired digital logic and programmable micro-
processors. Features are usually extracted in hardware, but the final classification
or reject decision, which requires processing a much smaller amount of data than
does the front-end, may be relegated to a microprocessor with a read-only
memory. A separate user-programmable microprocessor is sometimes used for
checking and formatting the output. This microprocessor may then be pro-
grammed directly through the wand itself by means of codes in a typeface which
can be recognized by the unit.

7. Context

Characters that are difficult to classify by their shape alone can sometimes be
recognized correctly by using information about other characters in the same
document. For instance, in decoding a barely legible handwritten postcard, one
must frequently resort to locating different occurrences of the same misshapen
symbol. Contextual information may be used in a number of different ways to aid
recognition; Toussaint lists six major categories. In this section, however, we will
discuss only the use of the information available from the non-random sequenc-
ing of characters in natural-language text, including such relatively 'unnatural'
sequences as postal addresses, prices, and even social security numbers.
For a sequence of observed patterns V = el, v 2..... g,, we may again use Bayes'
rule to obtain

P(AIV)=P(VIA)P(A)/P(V)

where A is a sequence of character identities Ai, i = 1 , 2 ..... n and Ai takes on


values a k, k = 1,2 .... ,M ( M is the number of classes).
To estimate P(VIA) for all possible character sequences A of arbitrary length n
clearly requires an astronomical number of calculations. Approximations for
estimating the sequence are considered either n-gram oriented (Subsection 7.1) or
word (dictionary-lookup) oriented (Subsection 7.2). Sources of the necessary
640 GeorgeNagy

statistical information about letter frequencies are discussed in Subsection 7.3,


while word frequencies are discussed in Subsection 7.4.

7.1. N-gram methods


A simplifying assumption for the calculation of the a posteriori probabilities
P(VIX), exploited independently by Abend and by Raviv, is to consider text as
an ruth order Markov source. It can be shown that under this assumption the a
posteriori probabilities for the i th character in the sequence can be recursively
computed in terms of the conditional probabilities of the pattern (feature) vectors
of the (i - 1)st, (i - 2 ) t h ..... and (i - m + 1)st characters. A further simplification
is obtained by letting the decision for the i th character depend only on the
decision (rather than the conditional probabilities of the feature vectors) for the
previous characters. Experiments show that on typed English text the error rate
based on feature vectors alone Gan be decreased considerably by considering only
the class attributed to the two characters immediately preceding the character
under consideration. The statistical data necessary to accomplish this is a table of
letter trigram probabilities.

7.2. Dictionary look-up


One of the earliest dictionary look-up methods is that reported by Bledsoe and
Browning in 1959. Here the compound a posteriori probability of every word in
the dictionary of the same length as the one under consideration is computed by
multiplying together the probabilities (based on the feature vectors) of its con-
stituent characters. The word with the highest a posteriori probability is chosen.
When the Markov assumption for letter occurrences can be justified, the Viterbi
algorithm provides an efficient method of computing the a posteriori probability
of each word in the dictionary. The computation can be further accelerated
without a noticeable increase in the error rate by considering only the most likely
candidates (based on the feature vectors) at each character position in the word.
The dictionary look-up algorithm can make use of the confusion probabilities
(obtained empirically) between each pair of character identities instead of the
calculated a posteriori character probabilities. Yet another variant is the combina-
tion of sequential feature extraction (Subsection 6.4) and the Markov assumption
on the character sequences. It has been shown that there are circumstances under
which an additional measurement on a previous character is preferable to an
additional measurement on the character about to be classified!

7.3. Letter frequencies


The distribution of letter frequencies is, of course, of interest in deciphering
cryptograms, and singlet and doublet letter frequencies are tabulated in most
texts on the subject. Table 2 shows the doublet frequencies from a corpus of a
600000 characters legal text. Here again the frequencies of the less common pairs
Optical character recognition - - theory and practice 641

Table 2
Bigram frequencies based on 600 000 characters of legal text (X 10)
A B C D E F G H I J K L
1 0.693 0.044 0.001 0.015 0.178 0.375 0.099 0.044 0.038 0.003 0.003 0.008 0.062
2 A 0.207 0.000 0.004 0.030 0.011 0.032 0.011 0.010 0.057 0.019 0.001 0.000 0.038
3 B 0.079 0.011 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.005 0.000 0.000 0.000
4 C 0.124 0.028 0.000 0.004 0.000 0.044 0.000 0.000 0.000 0.047 0.000 0.000 0.000
5 D 0.057 0.017 0.000 0.000 0.001 0.085 0.000 0.000 0.000 0.021 0.000 0.000 0.015
6 E 0.057 0.000 0.033 0.053 0.063 0.025 0.016 0.021 0.2i6 0.022 0.003 0.008 0.045
7 F 0.064 0.006 0.000 0.000 0.000 0.017 0.017 0.000 0.000 0.014 0.000 0.000 0.002
8 G 0.016 0.011 0.000 0.000 0.006 0.012 0.000 0.001 0.000 0.014 0.000 0.000 0.001
9 H 0.047 0.001 0.000 0.032 0.000 0.003 0.000 0.013 0.000 0.000 0.000 0.000 0.000
10 I 0.130 0.025 0.006 0.026 0.037 0.010 0.026 0.007 0.046 0.000 0.000 0.002 0.027
11 J 0.016 0.001 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
12 K 0.004 0.004 0.000 0.004 0.000 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.001
13 L 0.026 0.067 0.010 0.013 0.001 0.031 0.003 0.002 0.001 0.025 0.000 0.001 0.042
14 M 0.044 0.011 0.001 0.000 0.002 0.016 0.000 0.003 0.000 0.017 0.000 0.000 0.000
15 N 0.047 0.118 0.000 0.000 0.000 0.091 0.000 0.003 0.001 0.168 0.000 0.002 0.000
16 O 0.153 0.000 0.018 0.084 0.008 0.003 0.033 0.004 0.025 0.076 0.003 0.000 0.014
17 P 0.075 0.029 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.007 0.000 0.000 0.000
18 Q 0.004 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
19 R 0.057 0.063 0.005 0.006 0.003 0.140 0.009 0.008 0.003 0.017 0.000 0.000 0.000
20 S 0.116 0.060 0.003 0.001 0.005 0.075 0.000 0.004 0.001 0.081 0.000 0.001 0.007
21 T 0.307 0.099 0.001 0.043 0.000 0.022 0.004 0.001 0.008 0.078 0.000 0.000 0.005
22 U 0.024 0.010 0.011 0.010 0.006 0.000 0.006 0.005 0.003 0.000 0.013 0.000 0.006
23 V 0.023 0.008 0.000 0.000 0.001 0.018 0.000 0.000 0.000 0.010 0.000 0.000 0.002
24 W 0.075 0.007 0.000 0.000 0.001 0.007 0.000 0.000 0.001 0.000 0.000 0.000 0.001
25 X 0.000 0.005 0.000 0.000 0.000 0.015 0.000 0.000 0.000 0.001 0.000 0.000 0.000
26 Y 0.005 0.011 0.019 0.002 0.001 0.005 0.001 0.000 0.002 0.000 0.000 0.000 0.020
27 Z 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.000 0.000 0.000

depend greatly on the domain of discourse; in the example given, the frequency of
the pair 'ju' (from jury, judge, judicial, jurisprudence, etc.,) is much higher than in
other types of material. The most common letter in English is 'e', accounting for
13% of all letters. The most common pair is 'he'.
In studying higher-order letter frequencies, upper and lower case, blanks, and
punctuation must also be considered. The letter 'q', for instance, is more common
at the beginning of words than in the middle or at the end--hence 'q-' has a
lower frequency than '-q'. The distribution of n-gram frequencies according to
the position within the word has been studied by Shinghal and Toussaint. A study
of the singlet frequencies of Chinese ideographs (which may be regarded as
complete words, but represent single characters as far as OCR is concerned) is
available in Chen. Japanese Katakana character frequencies are tabulated in Bird.
In deriving n-gram probabilities from a sample of text it is necessary to make
special provision for estimating the frequency of n-grams which do not occur in
the text at all. More generally, it is desirable to use a better estimator than the
sample frequency itself. By postulating an a priori distribution for each n-gram
frequency, an improved estimator may be derived either by minimizing the risk
642 GeorgeNagy

Table 2 (continued)
Bigram frequencies based on 600000 characters of legal text ( × 10)
M N O P Q R S T U V W X Y
1 0.018 0.176 0.074 0.008 0.000 0.097 0.213 0.180 0.002 0.011 0.013 0.005 0.096
2 A 0.031 0.020 0.008 0.020 0.000 0.038 0.013 0.049 0.006 0.007 0.022 0.003 0.000
3 B 0.002 0.000 0.003 0.000 0.000 0.001 0.000 0.000 0.008 0.000 0.000 0.000 0.000
4 C 0.001 0.031 0.010 0.000 0.000 0.009 0.005 0.001 0.014 0.000 0.000 0.005 0.000
5 D 0.000 0.090 0.006 0.000 0.000 0.025 0.001 0.000 0.008 0.000 0.000 0.000 0.000
6 E 0.039 0.037 0.003 0.043 0.000 0.124 0.074 0.085 0.012 0.037 0.020 0.003 0.003
7 F 0.000 0.003 0.095 0.000 0.000 0.002 0.000 0.000 0.002 0.000 0.001 0.000 0.000
8 G 0.000 0.052 0.003 0.000 0.000 0.004 0.000 0.000 0.005 0.000 0.000 0.000 0.000
9 H 0.000 0.001 0.005 0.003 0.000 0.001 0.016 0.258 0.000 0.000 0.022 0.001 0.000
10 I 0.024 0.017 0.004 0.004 0.000 0.047 0.033 0.096 0.010 0.024 0.019 0.002 0.001
11 J 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
12 K 0.000 0.002 0.001 0.000 0.000 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000
13 L 0.000 0.004 0.017 0.017 0.000 0.003 0.003 0.003 0.017 0.000 0.001 0.000 0.001
14 M 0.011 0.001 0.028 0.001 0.000 0.011 0.002 0.003 0.005 0.000 0.003 0.000 0.001
15 N 0.000 0.005 0.143 0.000 0.000 0.008 0.000 0.002 0.023 0.000 0.004 0.000 0.000
16 O 0.015 0.032 0.005 0.027 0.000 0.043 0.016 0.067 0.001 0.003 0.006 0.000 0.002
17 P 0.012 0.000 0.013 0.028 0.000 0.007 0.008 0.000 0.013 0.000 0~000 0.002 0.000
18 Q 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
19 R 0.000 0.001 0.094 0.039 0.000 0.008 0.000 0.028 0.043 0.000 0.003 0.000 0.000
20 S 0.004 0.035 0.013 0.001 0.000 0.023 0.029 0.025 0.032 0.000 0.003 0.000 0.005
21 T 0.000 0.087 0.032 0.007 0.000 0.043 0.072 0.011 0.024 0.000 0.000 0.002 0.000
22 U 0.004 0.004 0.051 0.006 0.010 0.011 0.032 0.013 0.000 0.000 0.000 0.000 0.000
23 V 0.000 0.003 0.011 0.000 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
24 W 0.000 0.002 0.018 0.000 0.000 0.001 0.001 0.004 0.000 0.000 0.000 0.000 0.000
25 X 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
26 Y 0.000 0.009 0.001 0.000 0.000 0.014 0.001 0.019 0.000 0.000 0.000 0.000 0.000
27 Z 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

using a loss function or by the maximum likelihood consideration. The estimator


obtained with a square-loss function turns out to be identical to the a posteriori
probability obtained from the maximum likelihood formulation. In particular,
w i t h a u n i f o r m a p r i o r i d e n s i t y f u n c t i o n f o r e a c h n - g r a m , t h e f o r m u l a is

p = ( m .1. 1 ) / ( n -t-2)

w h e r e m is t h e n u m b e r o f c o o c c u r r e n c e s o f t h e i t h n - g r a m i n t h e s a m p l e a n d n is
the total number of n-grams. The actual a priori distributions may be more
realistically approximated by means of beta distributions with two parameters,
y i e l d i n g a p o s t e r i o r i d i s t r i b u t i o n s o f t h e t y p e (m .1. a ) / ( n .1. b).
The estimates of n-gram frequencies obtained by various authors have been
compared by Suen in an article that includes also an excellent bibliography on the
statistical parameters of textual material.
Optical character recognition theory and practice 643

7.4. Word distributions and frequencies


Word distributions depend, of course, both on the language and on the domain
of discourse under consideration. Vocabularies in children's books are small,
while the average word-length in legal documents is much longer than in
newspapers. In technical material the 500 most common words account for 76%
of all the words. The distribution of word lengths in a corpus containing ten
different types of text is tabulated in a report by Toussaint and Shinghal.

8. Error/reject rates

8.1. Prediction
Based on his 1957 work (see above), in 1970 Chow derived some very
interesting relations between the substitution error rate and the reject rate of a
character recognition system. The principal results are:
(1) The optimum rule for rejection is, regardless of the underlying distributions,
to reject the pattern if the maximum of the a posteriori probabifities is less than
some threshold,
(2) The optimal reject and error rates are both functions of a parameter t,
which is adjusted according to the desired error/reject tradeoff (which, in turn, is
based on the cost of errors relative to that of rejections).
(3) The reject rule divides the' decision region into 'accept' and 'reject' regions;
the error and reject rates are the integrals over the two regions of the probability
P(v) of the observations.
(4) Both the error and reject rates are monotonic in t.
(5) t is an upper bound on the error rate.
(6) The slope d E / d R of the error-reject curve increases from -- 1 + 1/n (n is
the number of classes) to 0 as R increases from 0 to 1.
(7) The error/reject rate is always concave upwards (dZE/dR 2 >10).
(8) The optimum error rate may be computed from the reject rate according to
the equation E(t) = f/=otdR(t), regardless of the form of the underlying distribu-
tion.
The importance of (8) lies in the fact that in large-scale character-recognition
experiments it is generally impossible to obtain a sufficient number of labelled
samples to estimate small error rates directly. Eq. (8) allows the estimation of the
error rate using unlabelled samples.
Chow gives examples of the optimum error-reject curve for several forms of
the probability density function of the observations.

8.2. Experimental estimation of the error rate


Because printed characters do not lend themselves to simple and accurate
syntactic or probabilistic description, theoretical predictions of the misclassifica-
644 GeorgeNagy

tion rate, based on parametric characterizations of the underlying multidimen-


sional probability distributions, are primarily of academic interest. Consequently,
whether a given classification scheme meets the specified performance criteria
with respect to substitution error and reject rate must be determined by experi-
ment.
In the ideal situation, where an infinite number of character samples is
available (and infinite computational resources), there is no theoretical difficulty:
simply partition the sample into two (infinite) sets, design the classifier using the
method of choice on one set (the design set), and estimate the error rate by
observing the fraction of misclassified or rejected samples in the other set (the test
set).
If, however, the number of samples or the available computational resources
are finite, then restricting the size of the design set generally leads to an inferior
design. On the other hand, using all of the available samples for design leads to a
biased estimate of the error rate, because the classifier is tailored to the particular
samples used in its design. The problem is, therefore, how to partition the samples
in order to obtain both a good design and an acceptable estimate of the error rate.
The following discussion is based on an article by Toussaint which contains over
one hundred references to original work on error-rate estimation.
Method 1. In early experiments in character recognition, and occasionally in
more recent work, the design set and the test set were identical. This leads, as
mentioned, to too low an estimate of the error rate. The extent of bias is related to
the relative number of 'free parameters' in the classifier and to the size of the
sample set.
Method 2. Here the sample is divided into a design set and a test set, with the
ratio of the number of samples in the design set to the total number of samples
( N ) equal to R. R X N must, of course, be an integer. R is often chosen equal
to ½.
Method 3. The sample is partitioned into K sets of size ( 1 - R ) X N. Each set
from the first partition is then randomly paired with a set from the second
partition, and K separate design-and-test experiments are performed. The esti-
mate of the error rate is the average error rate on the K test sets. The variance of
this estimate is lower than that of Method 2. Commonly used values of R are
R = ½, and R = ( N - 1)/N. The latter is known as the U-method; it provides the
largest possible number of independent design-samples for a given total sample
size. The maximum possible value of K in the U-method is N.
The lowest expected error rate which can be obtained on an infinitely large,
independent test sample using a finite design sample is, of course, lower than what
could be obtained with an infinite design sample, but with a finite sample we can
estimate the former. It can be shown that Method 1 yields, on the average, an
optimistic estimate, while Methods 2 and 3 yield pessimistic estimates. Given N
samples, the U-method yields the lowest underestimate of the expected perfor-
mance using N characters for design.
Another line of work in character recognition explores the improvement in
recognition performance in adaptive recognizers where each new sample contrib-
Optical character recognition - - theory and practice 645

utes information to tune the classifier. The main branches of this endeavour are
supervised classification where each new sample is identified, and unsupervised
classification ('learning without a teacher') where the true identities of the new
characters remain unknown. None of this work seems to have found much
application to practical OCR systems.
Commercial OCR manufacturers normally test their devices on samples of
several hundred thousands or millions of characters in order to make performance
estimates at realistic levels of substitution error and reject rate. Academic re-
searchers, on the other hand, usually test their algorithms on a few hundred or
thousand samples only, using general purpose computers. The IEEE Pattern
Recognition Data Base contains several dozen test sets, originally collected by the
IEEE Computer Society's Technical Committee on Pattern Recognition, which
are available to experimenters for the cost of distribution. These test databases
include script, hand-printed characters, single and multifont typewritten char-
acters, and also some non-character image data. Several articles have been written
comparing different studies using identical design and test data. Another source
of fairly large files of alphanumeric test data is the Research Division of the U.S.
Postal Service.

8. 3. Major sources of classification errors


In a complete OCR system, the vast majority of the errors can usually be
attributed to the preprocessor or front-end. This includes, as we have seen, the
optical transducer itself and the character-acquisition stages. Photometric non-
linearities in the optical scanner and lack of contrast in the material lead to
troublesome and inconsistent black-white inversions. Geometric non-linearities in
the scanner and misadjusted printing mechanisms (such as bent type bars) result
in distortions in the shape of the characters and in misaligned and skewed
patterns. These problems can be readily detected by visual inspection of the
digitized pattern arrays, hence most systems include provisions for monitoring the
pattern ' video'.
Even with high quality printing, segmentation errors often result in mutilated or
incomplete characters presented to the classifier. This is due to the fact that while
a four-mil scanning resolution is considered acceptable for the classification of
most standard-size fonts, the amount of blank space between adjacent characters
is frequently much less than the scanner spot size, and the digitized arrays
corresponding to adjacent characters are therefore 'bridged'. Decreasing the
spot-size not only increases the amount of data to be processed but also, with
many serif fonts, a significant fraction of adjacent character-pairs is not separated
by any detectable white space. Character classification methods that do not
depend on accurate segmentation show a definite advantage on such material.
Even if the characters are correctly segmented, the presence of extraneous blots
may affect registration. Blots which are cleanly separated from the character are
not difficult to remove, but contiguous blots pose a severe problem. Since in
practical devices recognition is usually attempted in only a few shifted positions, a
646 GeorgeNagv

two or three pixel misregistration may lead to additional rejects.


In the classification stage, errors on clean, well-segmented, and registered
characters are committed either because the structure and variability of the
material do not correspond to the model used to design the classifier (inap-
propriate choice of classification method), or because the data used to adjust the
parameters of the classifier (the training set) is not sufficiently representative of
the data used for testing or operational application. Except for the classification
of hand-printed characters, however, any effort invested in improving the quality
of the output of the digitizing and preprocessing stages tends to be more
cost-effective than fine-tuning the classifier itself.

8.4. Recognition rates in diverse applications


Because print quality is difficult to characterize accurately, few commercial
manufacturers are willing to be pinned down to specific performance rates.
Furthermore, once a system is installed the user has almost no way to find out its
operational substitution error rate: it is simply too costly to record the correct
identities of several hundreds of thousands of characters and to match them with
the string of identities assigned by the OCR system. The ranges of substitution
error and reject rates cited in this section should therefore be considered with
extreme caution (Table 3).
For handprinted characters the most important single factor affecting recogni-
tion performance is the motivation of the writers. The performance rate on an
untrained but motivated population is of the order of 99% correct recognition for
numerals only and 95%-98% for alphanumeric applications. The error rate is two
or three times lower with a captive population, such as clerks in retail stores,
where it is possible to provide feedback to correct misrecognized stylistic idiosyn-
crasies of the writers.
The performance rate on specially-designed stylized characters may reach
1/200000 substitution error rate and 1/20000 reject rate. Almost comparable
performance may be obtained with certain typewriter fonts, such as modified
Courier, when typed on specially adjusted typewriters using carbon film ribbon
and high quality paper. On ordinary typescript, using standard typefaces, error
and reject rates may be one to two order of magnitude higher, depending on the
typeface and the print quality. Restricting the alphabet to one case or to numerals
only results in an improvement roughly proportional to the decrease in the
number of classification categories.

Table 3
'Typical' error and reject rates
Error rate Reject rate
Stylizedcharacters 0.000005 0.00005
Typescript (OCR quality) 0.00001 0.000t
Ordinary. typescript 0.0005 0.005
Handprint (alphanumeric) 0.005 0.05
Bookface 0.001 0.01
Optical character recognition -- theotT and practice 647

There are twenty-four large capacity mail sorters installed in the U.S. Post
Offices in large cities throughout the country. The older machines are effective
only in sorting out-going mail, where the contextual relations between the city,
state, and zip-code on the last line of the address allows correct determination of
the destination even with several misrecognized characters. On high quality
printed matter (most of the mail in the United States, as opposed to Japan, has a
printed or typewritten address) 70% to 90% of the mail pieces are routed
correctly, and the rest is rejected for subsequent manual processing. In order to
use the machines effectively, obviously unreadable material is not submitted to
the machines. Large mailers participate in the U.S. Postal Service's 'red tag'
program by affixing special markers to large batches of presorted mail which is
known to have the appropriate format for automatic mail sorting.
The recognition performance on variable-pitch typeset material is much lower
than on typewritten and special OCR f o n t s - - u p to 1% of the characters may not
be recognized correctly. Only a few manufacturers market machines for typeset
material. Some of these machines must be ' trained' on each new font. Mixed-font
OCR has been successfully applied to reading-aids for the blind. Here even if a
relatively high fraction of the characters is misrecognized, the human intelligence
can make sense from the output of the device. In one commercially available
system the reader is coupled to an automatic speech synthesizer which voices the
output of the machine at an adjustable rate between 100 and 300 words per
minute. When the rules governing its pronunciation fail, this machine can spell
the material one letter at a time.
Few major benefits can be expected from further improvement of current
commercial classification accuracy on clean stylized typescript and on stylized
fonts. Current developments are focussed, therefore, on enabling OCR devices to
handle increasingly complicated formats (such as technical magazine articles) in
an increasing variety of styles including hand-printed alphanumeric information,
ordinary typescript, and bookface fonts, all at a price allowing small decentralized
applications. When these problems are successfully solved, we may expect general
purpose OCR input devices to be routinely attached to even the smallest
computer systems, complementing the standard keyboard.

Acknowledgment

The author wishes to acknowledge the influence on the points of view expressed
in this article of a number of former colleagues, particularly R. Bakis, R. G.
Casey, C. K. Chow, C. N. Liu, and Glenn Shelton, Jr. He is also indebted to
R. M. Ray III for some specific suggestions.

Bibliography
[1] Abend, K. (1968). Compounddecisionproceduresfor unknown distributions and for dependent
states of nature. In:' L. N. Kanal, ed., Pattern Recognition. Thompson, Washington/L. N. K.
Corporations CollegePark, MD.
648 George Nagy

[2] Ascher, R. N. (et al.), (1971). An interactive system for reading unformatted printed text. IEEE
Trans. Comput. 20, 1527-1543.
[3] Bird, R. B. (1967). Scientific Japanese: Kanji distribution list for elementary physics. Rept. No.
33. Chemical Engineering Department, The University of Wisconsin.
[4] Bledsoe, W. W. and Browning, I. (1959). Pattern recognition and reading by machines. Proc.
E.J.C.C., 225-233.
[5] British Computer Society. (1967). Character Recognition. BCS, London.
[6] Casey, R. G. and Nagy, G. (1968). An autonomous reading machine. IEEE Trans. Comput. 17,
492-503.
[7] Casey, R. G. and Nagy, G. (1966). Recognition of printed Chinese characters. IEEE Trans.
Comput. 15, 91-101.
[8] Chen, H. C. (1939). Modern Chinese Vocabulary. Soc. for the Advancement of Chinese
Education. Commercial Press, Shanghai.
[9] Chow, C. K. (1957). An optimum character recognition system using decision functions. IRE
EC 6, 247-257.
[10] Chow, C. K. and Liu, C. N. (1966). An approach to structure adaptation in pattern recognition.
IEEE SSC 2, 73-80.
[11] Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with
dependence trees. IEEE Trans. Inform. Theory 14, 462-467.
[12] Chow, C. K. (1970). On optimum recognition error and reject tradeoff. IEEE Trans. Inform.
Theory 16, 41-46.
[ 13] Deutsch, S. (1957). A note on some statistics concerning typewritten or printed material. IRE I T
3, 147-148.
[14] Doyle, W. (1960). Recognition of sloppy, hand-printed characters. Proc. W.J.C.C., 133-142.
[15] Fischer, G. L. (1960). Optical Character Recognition. Spartan Books, Washington.
[16] Genchi, H., Mori, K. I., Watanabe S. and Katsuragi, S. (1968). Recognition of handwritten
numeral characters for automatic letter sorting. Proc. IEEE 56, 1292-1301.
[17] Greanias, E. C., Meagher, P. F., Norman, R. J., and Essinger, P. (1963). The recognition of
handwritten numerals by contour analysis. I B M J. Res. Develop. 7, 14-22.
[18] Greenough, M. L. and McCabe, R. M. (1975). Preparation of reference data sets for character
recognition research. Tech. Rept. to U.S. Postal Service. Office of Postal Tech. Res., Pattern
Recog. and Comm. Branch, Nat. Bur. of Standards, NBS1R75-746. Washington.
[ 19] Harmon, L. D. (1972). Automatic recognition of print and script. Proc. IEEE 60, 1165-1176.
[20] Hennis, R. B. (1968). The IBM 1975 optical page reader. I B M J . Res. Develop. 12, 345-371.
[21] Herbst, N. M., and Liu, C. N. (1977). Automatic signature verification. I B M J . Res. Develop. 21,
245-253.
[22] Hoffman, R. L. and McCullough, J. W. (1971). Segmentation methods for recognition of
machine-printed characters. IBM J. Res. Develop. 15, 101-184.
[23] Hussain, A. B. S., and Donaldson, R. W. (1974). Suboptimal sequential decision schemes with
on-line feature ordering. IEEE Trans. Comput. 23, 582-590.
[24] Kamentsky, L. A., and Liu, C. N. (1963). Computer automated design of multi-font print
recognition logic. I B M J . Res. Develop. 7, 2-13.
[25] Kanal, L. N. (1980). Decision tree design--the current practices and problems. Pattern
Recognition in Practice. North-Holland, Amsterdam.
[26] Kovalevsky, V. A. (1968). Character Readers and Pattern Recognition. Spartan Books, Washing-
ton.
[27] Liu, C. N. and Shelton, G. L., Jr. (1966). An experimental investigation of a mixed-font print
recognition system. IEEE Trans. Comput. 15, 916-925.
[28] Minsky, M. (1961). Steps towards artificial intelligence. Proc. IRE.
[29] Nadler, M. (1963). An analog-digital character recognition system. IEEE Trans. Comput. 12.
[30] Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Trans. Inform.
Theory 21, 222-226.
[31] OCR Users Association (1977). OCR Users Association News. Hackensack, NJ.
[32] Ledley, G. (1970). Special issue on character recognition. Pattern Recognition 2. Pergamon Press,
New York.
Optical character recognition theory and practice 649

[33] Raviv, J. (1967). Decision making in Markov chains applied to the problem of pattern
recognition. IEEE Trans. Inform. Theory 13, 536-551.
[34] Riseman, E. M. and Ehrich, R. W. (1971). Contextual word recognition using binary digrams.
IEEE Trans. Comput. 20, 397-403.
[35] Riseman, E. M. and Hanson, A. R. (1974). A contextual postprocessing system for error
correction using binary N-grams. IEEE Trans. Comput. 23, 480-493.
[36] Schantz, H. F. (1979). A.N.S.I. OCR Standards Activities. OCR Today 3. OCR Users Associa-
tion, Hackensack, NJ.
[37] Shannon, C. (1951). Prediction and entropy of printed English. BSTJ 30, 50-64.
[38] Shillman, R. (1974). A bibliography in character recognition: techniques for describing char-
acters. Visible Language 7, 151-166.
[39] Shinghal, R., Rosenberg, D., and Toussaint, G. T. (1977). A simplified heuristic version of
recursive Bayes algorithm for using context in text recognition. IEEE Trans. Systems Man
Cybernet. 8, 412-414.
[40] Shinghal, R., Rosenberg, D., and Toussaint, G. T. (1977). A simplified heuristic version of
Raviv's algorithm for using context in text recognition. Proc. 5th Internat. Joint Conference
Artificial Intelligence, 179-180.
[41] Stevens, M. E. (1961). Automatic character recognition-state-of-the-art report. Nat. Bureau
Standards, Tech. Note 112. Washington.
[42] Suen, C. Y. (1979). Recent advances in computer vision and computer aids for the visually-
handicapped. Computers and Ophthalmology. IEEE Cat. No. 79CH1517-2C.
[43] Suen, C. Y. (1979). N-gram statistics for natural language understanding and text processing.
IEEE PA MI 1, 164-172.
[44] Suen, C. Y. (1979). A study on man-machine interaction problems in character recognition.
IEEE Trans. Systems Man Cybernet. 9, 732-737.
[45] Thomas, F. J. and Horwitz, L'. P. (1964). Character recognition bibliography and classification.
IBM Research Report RC-1088.
[46] Toussaint, G. T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
[47] Toussaint, G. T. and Shinghal, R. (1978). Tables of probabilities of occurrence of characters,
character pairs, and character triplets in English text. McGill University, School of Computing
Sciences. Tech. Rept. No. SOCS 78-6, Montreal.
[48] Toussaint, G. T. (1978). The use of. context in pattern recognition. Pattern Recognition 10,
189-204.
[49] Walter, T. (1971). Type design classification. Visible Language 5, 59-66.
[50] Wright, G. G. N. (1952). The Writing of Arabic Numerals. University of London Press, London.
P. R. Krislmaiah and L. N. Kanal, eds, Handbook of Statistics, Vol. 2 ~l~
©North-Holland Publishing Company (1982) 651-671 JL/

Computer and Statistical Considerations


for Oil Spill Identification

Y. T. Chien and T. J. Killeen

1. Introduction

One of the recent applications of probability theory and mathematical statistics


involves the matching of a spilled oil sample with one or more suspect oil samples
in an attempt to track down the source of the spill. This problem is generally
referred to as oil spill identification. In the United States alone, for example, there
were 20 million gallons of oil spilled in coastal waters in 1977. In that same year,
a total of almost 13 thousand oil spill incidents were reported to the U.S. Coast
Guard and of these one thousand cases were significant spills. These spills are
costly events: Not only do they pose environmental dangers to the people, but
they are often difficult and expensive to clean up. As part of the effort to deal
with this problem, the U.S. Coast Guard has been given the overall responsibility
for the investigation of all the oil spills in hopes of determining the source of
spills when they occur and eventually hold the spiller responsible for the clean up
COSTS.
Typically, the identification of an oil spill involves one or more standard
chemical tests which are used to determine the chemical composition of oil
samples. These tests include thin-layer chromatography, infrared and fluorescent
spectroscopy, and gas chromatography. In the identification system developed by
the U.S. Coast Guard Research and Development Center, the chemical tests
begin with thin-layer chromatography (TLC). TLC in general involves mixing the
material to be tested in a solvent, then moving the new solution along an
absorbent material. The different chemical components o f the original material
soak in at different rates. In the oil spill situation, these different rates as
exhibited by various oil samples are chromatographed on specially processed
plates for visual comparison. Oil spill identification, in this case, is performed by
a chemist and on a two-dimensional image.
Other chemical tests usually follow the TLC technique in order to improve the
reliabifity of identification. Both infrared (IR) and fluorescent (FL) spectroscopy
are powerful tools in 'fingerprinting' oil spills. While their complexity and details
of the fingerprinting process differ, their basic techniques are quite similar. The
process consists of examining the infrared or fluorescence spectra taken for the

651
652 Y. T. Chien and T. J. Killeen

spilled oil samples as well as the suspect samples. The peaks of the spectra are
then determined manually (or automatically, in a computerized system) and the
identification of the spiller is accomplished by overlaying two spectra (the spill
sample and a suspect sample) at a time for comparison. The suspect that results in
the closest match (according to some numerical measure of closeness) at the
designated peaks is deduced as the spiller.
Various oil identification schemes have been proposed and studied for their
potential applications in an integrated system which will determine the true
source of an oil spill with high reliability. For a most complete discussion on oil
spill identification schemes, the reader is referred to [4, 5, 6]. In this paper, a
number of mathematical models and techniques that make use of probability
theory and statistics in one form or another will be discussed.
Section 2 will deal with the various computer or statistical-oriented methods for
data analysis as applied to oil identification. Two mathematical models for
identifying oil spillers by matching probability data or by spectral data are
presented in Section 3. The final section gives a summary of the research and
development projects related to oil spill identification work. This summary, when
used with the bibliography, should be useful to the reader who wishes to obtain
additional information on the subject of oil spill identification.

2. Methods for oil data analysis

This section describes various computer and statistical techniques which have
been applied to the problem of identifying oil or chemical spectra. In Subsection
2.1 we discuss how computer graphics can be used to allow interactive analysis of
high dimensional data in a 2-dimensional space. The other two parts of this
section outline major work done by two investigators contracted to the United
States Coast Guard, Chris Brown [7, 8] from the University of Rhode Island,
Department of Chemistry, and James Mattson [35, 36] from the University of
Miami, Rosenthal School of Marine and Atmospheric Science. Each carried on
extensive studies of methods of matching oil spectra. These projects were carried
out in the early to mid 1970's and concentrated on two different aspects of the
problem.
Mattson applied the classical statistical and pattern recognition methodology to
the problem of oil identification, but was hampered by lack of real data to assess
the effects of weathering on types of oil samples. Brown, on the other hand,
collected voluminous data on weathering and simulated weathering. His analysis
was not done from a classical point of view but was developed specifically for the
oil identification problem.

2.1. Interactive data analys&


Matching an oil spill with a set of suspect oil samples often requires both the
computing power of a machine and the interpretive skills of a person. The
Computer and statistical considerations for oil spill identification 653

combination of these types of capabilities can be easily accomplished by imple-


menting an interactive man-machine system where multi-dimensional data, such
as the FL or IR spectral waveforms, can be displayed a n d / o r manipulated by a
user to aid the decision-making process. In this section we describe two such
systems which allow man-machine interaction for (1) waveform matching or (2)
multi-dimensional data analysis.

2.1.1. Matchingdiscrete waveforms


When a digitized oil spill sample is matched against a digitized suspect to
determine their closeness, it is very important that the person in question can
easily manipulate these waveforms according to their corresponding characteristic
features (such as peaks, etc.). An interactive computer graphics system, capable of
moving discrete waveforms in an attempt to aid the user in the oil identification
process, is described here as an illustration. The types of movements typically
needed are (1) shift the spill sample to the right, (2) shift the spill sample to the
left, and (3) move the spill sample so that a specified point of the spill sample
coincides with a specified point of the suspect sample. ~Ihis interactive system has
been implemented at the University of Connecticut on an IBM 370/2250
graphics terminal.
The graphics terminal is equipped, among other things, with a refreshed display
screen, a display processor and a lightpen which allows the user to input graphical
instructions by pointing directly at a position on the screen. When the program
starts a grid is displayed on the screen and the first menu (for input) appears. The
menu consists of the following options: read suspects and spill sample, read spill
sample only, and stop which are chosen by using the lightpen. If the read suspects
and spill sample option is selected, the program will read and store the data for a
specified number of suspects and a spill sample. The spill sample is displayed
once it has been read and the suspect display, if one existed, is erased.
A second menu is displayed where the first one was. This second menu contains
the basic system commands. The options on the second menu are select a suspect,
move a spill sample, produce a hard copy plot of the screen, record a match, and
get a new sample. If the select a suspect option is chosen by the use of the
lightpen, the suspect, if there is one currently displayed, is erased and a series of
light buttons corresponding to the suspects read in are lit. Hitting one of these
lighted buttons causes the corresponding suspect to be displayed. Only one
suspect may be displayed at any time. The second menu is displayed when this
command has been executed.
If the move option is selected, another menu appears for selecting the type of
move desired. A lightpen hit on the shift right command causes the spill sample to
be shifted to the right four wavelengths which is one unit on the grid. A lightpen
detect on the shift left command causes the spill sample to be moved four
wavelengths which is one unit on the scale to the left. A lightpen hit on the
'specify' causes the user to specify a sample point and a suspect point which the
user wishes. The sample and suspect points are specified by typing the position
(number) on the x-axis of the graph currently corresponding to the desired points.
654 Y.T. Chien and T. ,i.. Killeen

The record match and change sample commands return control to the first
menu. The hard copy, select sample and move sample returns control to the
second menu.
The printout from this program is a record of all suspect selections, any
movement made, and any matches recorded. The printout also includes the
spectral data read in by the program for the suspects and sample. Fig. 1 and Fig.
2 illustrate typical manipulations allowed in this system to match FL spectra
corresponding to spill and suspect oil samples.

2.1.2. Display-oriented analysis


One of the widely used data analysis techniques is the use of a 2-dimensional
display for high-dimensional data samples. This display, when executed dynami-
cally, can be an effective visual aid for interactive man-machine data analysis.
The object of this display is to transform the high-dimensional vectors into
2-dimensional vectors which are then shown typically on a graphics screen. When
a collection of these vectors is shown at the same time, it is possible for the viewer
to identify a variety of interesting characteristics about the data collection and
proceed to perform data manipulation tasks accordingly. Cluster analysis, dis-
covering spurious samples, detecting changes in the data, and human-assisted
classification are just some of the examples of how an interactive display system
can help in the analysis of high-dimensional data.
As an illustration, consider a collection of oil samples in the form of FL
spectral wave forms similar to those shown in Figs. 1, 2. Those oil samples can be
considered as high-dimensional vectors. In order to analyze this collection of
data, one can apply the interactive techniques by. projecting these vectors onto a'
2-dimensional display. A possible result of this display is shown in Fig. 3. From
the display, the user may determine that there are several types of oil in the
collection, how they are clusters within each type and between the types, etc.
Much of this information may be important for the design of an oil identification
system.
There are two important considerations in the use of display-oriented
techniques for data analysis. First, the transformation, which converts the high-
dimensional data into 2-dimensional form, must be chosen so that the interrela-
tionship among the data samples is preserved as much as possible. This is
important if the viewer is to 'visualize' the data on the screen which closely
approximates the data in the high-dimensional space. The second consideration is

Fig. 1. Graphical sample execution (' , ' for spill sample and ' + ' for suspect).
(a) The initial display,
(b) Display following light pen detection on READ SUSPECTS and SAMPLE.
(c) Display following light pen detection on SELECT SUSPECT.
(d) Display following light pen detection on MOVE SUSPECT.
(e) Display following light pen detection on SPECIFY while entering point numbers.
(f) Display following entering point numbers.
Computerandstatisticalconsi~rations~roil~illi~nt~cation 655

1.0 I I I I I I I I I I I I [ b--x.o

O.B ~0 ~USPE~T ~ S~MPLE -O.B

0-8 -0-8
REflD SRMPLES ONLY

0.7 0-7
BTOP
06 0.6

0 5 O.S

04 D.~

0,3 0.3

0.2 O-d

0-I 0-I

0.0 I I I I I I I I I I [ l I I ~-0
o s lo is e• ~s 30 3s ~0 ~a 5o ss Go Gs 7o
SPZ~L SAWPLE SUSPECT
(a)

s ~o Is e• 2s so ss ~o ~s so ss so 6s 70
t.O I I I I I ;~-, I I ( I I I I 1.o
$

O.S • • SELECT SUSPECT O.B


•• MOVE SUBPEOT
0.~2 $ REOORO MRTGH 0-2
$ $
• , HARD COPY
0.7 0./
•m CHflNgE SflMFLE

06 06

0.5 n5

04. 0°4'
• $•
$ e•
O.q
$$ e
0.2 $•$$• 0.2

• $$$0•$i
0.1 -0-I

0.0 . . . . r- ~ i I { ( I I I I l [ l I - La
s lo Is eo es ~o 35 ~o ~s so ss Go ~s 70
SVILL S ~ P L E 1 SUSPECT
(b)

Fig. 1
656 Y. T. Chien and T. Y. Killeen

0 5 10 15 20 ~5 30 35 40 ~5 50 55 SO ~S 70
1.0 I ~ ~ ~ , ~" ' I I I I I I I 1.0
+÷ $

0-9 $ .• 6ELEg~ ~USFEGT -O.O

• • MOVE SUSPEgT
0.8 -0 .e
** REgORO MfTOH
~÷. ~• ÷* H~RO COPY
0.7 - o .7
+• CHfNOE SfMPLE
+•
֥
0-6 ÷ - o .6

s
0.5 o .5

•+
0.~ $ •~÷÷ 0.4
$ ••~+÷÷
O.B a $•• ++ -o .3
$ e••

0.? -o .,e

$ $°••••a
0-1 ÷+ $ THE 6PILL 5~MPLE ~NO aU~PEgT BNE NO~MfLIZEO. -o -1
•e

0-0 .... ~- , I ~ [ I I I I I I I I t 0.o


0 5 10 15 20 25 30 35 40 45 50 55 60 ~5 70
6?ILL 5flMPLE 1 fiU6PECT 1
(c)

5 10 15 20 ~5 ~0 35 40 ~5 50 55 60 65 70
1.0 I [ I '{ ]~-÷,1:-. I I [ [ I I [ I 1.0
++ a
+ +÷ • +,m
0.9 • ,o ~HIFT RZQHT -0 -~

o
O.B ** 6HIrT LEFT -0 -8

0.7 -0,7
" +• 6~EgZF¥
÷m
0.6 -0.6
$
I
0.5 -0.5

$+
0.4 -O .4
$
m
0.'3 $ ° o + -0 .~
o$ ÷~#e

0.2
÷÷ °J$ooo÷÷e~÷~
$ o$$$$$
0-1 ** , THE SPILL 5RMPLE fiND SUSPEOT ~RE MORMfLZZEO. :0-1
o•
. . . ~ . , ' ~
O.O .... , , I I i ........ I I I I I [ I [ t O.O
0 5 10 15 20 25 30 35 40 45 50 55 60 GS 70
SPILL ~RMPLE 1 $OSPEGT !
(d)
Fig. 1 (continued).
Computer and statistical considerations for oil spill identification 657

0 S 1(;} 15 ~0 25 30 :35 40 45 $0 ':;S GO SS 70


1.0 I I I I -~ii.. I t---I---t-- t --~ ,L-- I 1-000

0.9 • ',, '00-8


•i S~MFLE FOINT 27
i .tl
O-B ÷, -0.8
SUSPECT POINT 31
÷ se ,° t
0.7 • ** * -0.7
m

0.S ~,~ -O.S


*
÷ i

O-S " t m0 " S

• • m÷

0.4 "¢,,., 000-4

0.3
$ +I+÷.
• • $1 ¢
O-E + e i 15 i ÷'t'4~.~4~÷ -0,~

0.! THE GRILL SAMPLE eND OUSFECT eRE NORMALIZED- "i''i'+ "0-1

m•

0-0 .... ,~+ " I l - I I I I ~ I-- t I - I I I o.o


000 s ~0 ~s ~o ~s so 33 ~o 43 so ss ~o ~s 7000
GRILL GBMFLE 1 SUGFECT 1
(e)

0 5 1000 L5 20 25 30 35 40 45 50 55 GO G5 70

1.0 t f I I ---h---L'.. I l--~--- I -~---+---4- --~--I.o


,¢+÷¢4¢J * SELECT SUSPECT 0-9
O.B
¢ ++ .÷ • *+ •• MOVE SUG?EGT

O.B ÷ $ RECORD MATGH O,B

mi ¢ i t H~PD COPY
0.7 ÷ -0.7
• GH~h~ 5A~fLE
$
O.S • -O-B
$
÷ t
0-5 • +* ÷ " -0-5

0.4 + • -0.9

m ¢÷ •
0.3 i * • -0.3
¢t¢ ~tI•D
0.2 ~ -0.2
THE GUSPgGT IG ~HIFTED 4. UNITS LFFT. + + , ++ " ,,
+* • +•+ * ° • i I ••
0-1
÷* me
liii THE SbSPEGT IS LO~EPED O-O~C~ b h [ T g .

0.0 -:*:~;~t-- l I "-t---I 1 I I I I l I 1 0.0


5 I0 15 20 25 30 35 40 45 50 55 GO 05 70

SFILL SAMrLE I SUS?ECT 1

(~
Fig. 1 (continued).
658 Y. T. Chien and T. J. Killeen

that, in o r d e r to have an effective interactive system, the t r a n s f o r m a t i o n s a n d


o t h e r calculations n e e d e d to d i s p l a y the d a t a m u s t be sufficiently fast for
real-time, o n - l i n e i n t e r a c t i o n s b e t w e e n the viewer a n d the c o m p u t e r display. Such
a r e q u i r e m e n t is o f t e n i m p o s e d on a n y systems t h a t involve m a n - m a c h i n e
interaction.
M a n y t r a n s f o r m a t i o n a l m e t h o d s have b e e n used to efficiently d i s p l a y high-
d i m e n s i o n a l data. These i n c l u d e linear a n d n o n - l i n e a r m a p p i n g s , f u n c t i o n a l
t r a n s f o r m a t i o n s as well as v a r i o u s p i c t o r i a l t r a n s f o r m a t i o n s . F o r a t h o r o u g h
discussion of these techniques, in general, see [9, 26]. A d e t a i l e d t r e a t m e n t of
s o m e of these techniques to oil d a t a ( a n d o t h e r c h e m i c a l d a t a ) c a n b e f o u n d in
[32].

2.2. Classical statistical methods

J a m e s M a t t s o n [35, 36] p r o p o s e d a classical statistical a p p r o a c h to d e t e r m i n i n g


the closeness of an i n f r a r e d spill s p e c t r u m to its source. S i m p l y stated, he

5 10 15 20 25 30 35 ~0 4S SO 5S 60 GS 70
1.0 I I J l l I I .... ! l l f 1.0
$me$$$ eo
oOl ee
0.8 am o -0.9
o$ o •
$$e$65 e • e
$ °00$
O.B -0.8
leell
0-7 e$$e o I ~0.7

0.6 0.6

0.5 $ -0.5
o

• 00
0.4 -0.4
$

0.3 -0 .~

0-2 -0 .E

O-1 -0-I

O0 I '1 I I L I I l I I I L I l 0.0
5 10 IS 20 25 30 35 40 'IS SO SS 60 65 70
SF~ILL SFFMfLE l ~USP'EOT
(a)
Fig. 2. Graphical output ( ' , ' for spill sample and ' + ' for suspect.) Each of the plots (a) through (d)
were obtained by light pen detecting on HARD COPY. They are examples of the hard copy available
to the user. They are also an example of the manipulation of spectra available to the user.
(a) A display of spill sample 1.
(b) A display of spill sample 1 and suspect 2 both in their original normalized positions.
(c) A display of spill sample 1 in normalized position and suspect 2 moved so that the 52nd point of
the suspect coincides with the 45th point of the spill sample.
(d) A display in which the suspect spectral waveform has been shifted one position to the right in
relation to the display of (c).
Computer and statistical considerations for oil spill identification 659

o s so is eo es so ss ~o ~s so ss so ss 7o
1.0 I I I I I I I I ..~b**Xh+..]4. I I 1--1.o
• o$ooooo$o4.4. $1 a ~ •
o$ •
0.9 -0 .S
oo ° ¢4.~ e$e 4.4.*
$$$o$me ÷#
0-8 • 4. °$°0~4.4. -- - 0 . 8

0.7 -0.7

O.S -0 -6

0.$ • 4. -0.5

0-4 -O m4

0-3 -0.3
4.,
0.~ + -0.2
4.
0-I + THE SPILL 8RMPLE RNO 8USPEGT RRE NORMRLIZED. -0.I
4.4.
b654" 4.4.

0-0 ::~ I "- i I 1 I i I I I t-- I 1 i -o.o


) 5 10 15 20 25 30 35 qa 45 5(] 55 GO G5 70
8PILL SRMPLE 1 SUSPEOT 2
(b)

s lo ss 20 es 30 ~s ~o ~s so ss Go 6s 70
1.0 I I I I 1 I • I +4.,,J,,.,o~.***l '1 i I I 1-0
• o,$~,oO 06 $410
11~11 , .e.
O.tq -0.8

O.B • "'2:::"" ""


'1"4.4.~ oo o -0.8
$ ~4. "h*.~ooo $
0.7 $ ogl 4- • -0.7
4.4"
÷4.'~
0.6 0.6

0.5 • 4. -0.5
a

O-q o
4. •

0.3 -0.3

THE SUSPEGT IS 8HIFTEO 7. UNITS LEFT°


0.2 -0 .i~

+ THE SUSPECT IG R6ISEO 0.0020 UNITS.


0.1 -0.1
4.*
m$$
0-0 i " t t I I I I I I I I 1 I I o.o
o s io 1s eo es 30 3s 4o 4s so ss 6o Gs ~o
SPILL s6.eLt 1 GusetcT
(c)
Fig. 2 (continued).
660 E T. Chien and T. J. Killeen

0 5 10 J5 20 2s 30 3s ~o 4s so ss Go ss 70
1.0 f I I I I , , **~.~.4~***!. I I I [ -1 .D

O.S e+,
-0.8
$$°$$$ ~j
~0
O.B $ ÷ e e
+
• -0.8
+ +o$$•
0.7 $$m$ + ÷
• -0,7
++
+÷+
O.S
++ -0 -S
$

0-5 $ ÷
o -0 .S

0.4
,,~t*
+• -0 .~

0.8
-0.3
+

0.2 ÷ THE SUSPECT IS SHIFTED S- UNITS LEFT.


-0.2
÷
e
0-i +~- THE S U S P E C T IS R R I S E O 0-0020 UNITS- -0 . i

0.0 I 'I I I [ I I t 1 1 I I I - I •0 . 0
O 5 I0 15 ~0 25 30 35 40 45 ~0 55 60 65 70
SPILL SflHPLE I SUSPECT 2
(d)
Fig. 2 (continued).

Y2

-Middle East
x -Mid- Continent
z~ - A l a s k a
[] -California

Y,

Fig. 3. An example of how nonlinear transformations can be applied to identify crude 0il clusters
from a 2-dimensional plot. Visually clusters are circled.
Computer and statistical considerationsfor oil spill identification 661

considers repeated runs of a particular oil spectrum as repeated observations from


a multivariate normal population. Then the hypothesis that both suspect and spill
spectra come from the same multivariate normal population may be tested using
the usual X2 statistic. In his study Mattson compares 204 oils with representatives
from most oil types and obtains X2 values which are highly significant for all
possible pairwise comparisons.
In order to run this test, several assumptions are made. Mainly, the population
must be independent multivariate normal. Mattson's estimates of skewness and
kurtosis indeed yield values near the normal parameters when analysis of replicate
data was made. However, he points out [35] that correlations between various
peaks exist and the assumption of independence between dimensions of the
multivariate normal distribution cannot be met. This drawback coupled with the
fact that no actual comparisons of weathered and unweathered spectra were
attempted leaves this method unusable in identifying weathered oil spectra.
The major emphasis of Mattson's work was not in the area of statistics but in
pattern recognition. In [37] he applies the k nearest neighbor technique and linear
discriminant function analysis with good classification success. Of course, he is
only classifying oils by type and not attempting to identify a particular spectrum.
Morton Curtis has also attempted cluster analysis and pattern recognition to
classify oils by type with reasonable success. With all of these techniques, there
seems to be about a 10% chance of misclassification. Therefore, if a 2-step
procedure, i.e. first classify and then identify, is attempted, then the investigator
will stand little chance of a correct identification when incorrect classification is
made. It seems more reasonable to allow the chemist the option of searching
across all oil types when seeking to identify an oil sample. Of course, when
confronted with a classification problem any of the aforementioned pattern
recognition techniques might be appropriate.

2.3. Log-ratio statistics


A very large study of the effects of weathering on oil samples has been
performed by Brown et al. [7, 8]. The analytical chemical technique used in this
study was infrared spectroscopy and literally thousands of experiments were
carried out and analyzed. These experiments were taken from both real and
artificial spill situations. The data were hand digitized and relative changes in
spectra due to weathering were calculated using a log ratio sum of squares. The
frequency count of this S 2 statistic is stored in histograms.
We briefly describe the log ratio method used to measure the distance between
infrared spectra. Each spectrum is digitized at eighteen pre-chosen wavenumbers.
We denote these digitized values by A i for i--1,2 ..... 1 8 and define the distance
S 2 between spectra 1 and 2 by

18 8 ))}2
S 2-- ~]
i=l
log(Ai,/Ai2)-~( E log(Aj,/Aj2
j=l
662 Y. T. Chien and T. J. Killeen

Of c o u r s e S 2 is merely the numerator of the sample variance of X,=


log(Ai,/Ai2). Two histograms of S 2 values were developed by Brown [7] and are
reproduced in Fig. 4. The first contains comparisons of weathered and un-
weathered versions of identical oils while the second compares different oils.
These histograms are utilized by Killeen and Chien [29] in the development of a
probabilistic model for oil matching discussed later in this chapter.
The statistic S 2 is motivated by the fact that differences in sample preparation
of Infrared Spectroscopy cause 'thickness' differences in the resulting spectrum.
These changes cause each digitized height A i to be multiplied by the identical
untraceable constant C. The distance S 2 is invariant for this constant, i.e. S 2
erases differences in sample preparation. Similar motivation is the basis of the
angular distance used by Killeen et al. [29] also discussed later in this chapter.
A second criterion for matching oil spectra also developed by Brown [7] is that
of counting how many of the log ratios in the calculation of S 2, log(Ai,/Ai2), are
within a given percentage of the average log ratio. If two oils are identical, then
the log ratio is constant over all peaks. That is, log(A/,/Ai2) = log(C1/C 2) where
C/is the thickness constant for samplej for all i.

1S00

1000

500

0. . . . . -'2g ~ ~ ~ 0

Fig. 4i F r e q u e n c y h i s t o g r a m for the same (boldface) a n d different oils (meager) b a s e d o n I R spectra.


Computer and statistical considerations for oil spill identification 663

Both of the criteria mentioned above have been found useful in identifying the
true source of a spill in most cases. However, when spills have experienced
weathering, especially in the case of light oils, the log ratio mentioned has some
difficulty identifying the true spill sample. This may be caused by the fact that it
is not able to compensate for the type of changes in the spectrum caused by
weathering.

3. Computational models for oil identification

3.1. The posterior probability model


Oil spill identification can also be accomplished by assigning probabilities of
guilt to suspects in an oil spill case. These probabilities can be revised by using
information contained in the spectra obtained from suspects and spill samples. A
method has been developed by Killeen and Chien [29] which utilizes this
approach and also considers the possibility that the true spiller escaped without
being sampled. In this section we discuss this method in detail and several
numerical examples are given to illustrate its use in an oil spill identification
problem involving infrared data.

The probability model


In an oil spill case samples are taken from n known possible suspects. We
number the suspects 1,2 ..... n and let the event

A~ : ith suspect is the true spiller, i : 1,2 .... n.

Now let the event that the spiller escaped without being sampled be denoted by
A o. Implicit in our model is the assumption that either one of the identified
suspects caused the spill or the true spiller(s) escaped.
Notice that the sample is simply

S: QJ A i
i=O

and the A i are disjoint. We have partitioned the set of all possible outcomes into
n ÷ 1 disjoint events.

Prior probabilities
Many times prior information about the suspects will allow an investigator to
assign prior probability of guilt to each suspect. This information may include eye
witness identification, oceanographic data, information from ships' logs, etc.
These probabilities are denoted by P ( A t ) . Certainly P ( A i ) >>-0 for i = 0, 1,2,..., n
andY~7=0P(Ai) = 1. When no prior information is known, it is usually reasonable
to assign P( At) = ! / ( n + 1). That is, all events A i, i = O, 1..... n, are equally likely.
664 Y. T. Chien and T. J. Killeen

Posterior information and revision of probabilities


For simplicity suppose that one sample has been obtained from each suspect
and one from the spill. Also assume that only one chemical method is to be used
in comparing suspects with the spill. However, we discuss how the method may be
applied to multiple samples and techniques.
A spectrum is run on each sample and a distance statistic, S~2, is calculated
comparing the ith suspect with the spill sample. Typically, St2 increases as the
suspect and spill spectra become dissimilar. Now, the n statistics S 2, $22..... S2
take on values x 1..... Xn, respectively. Let B be the event

B=(s2=xl,S =x2 ..... = Xn}.

If multiple samples or methods are involved, the event B will become more
complicated. We would need additional statistics for each sample and method
from each suspect. For example, if two spill samples and two methods were used
and the distance statistic for the second method were denoted by D 2, then

O-~- { 321 = X , l , . . . , 8 2 1 = Xnl 322 = x12 , . • .~ 3 n22 ~ X n 2 ~

D~I = Y11,..., D ,22 --- Y,2}.


If the statistics S 2, D 2, etc., have been carefully chosen to extract the pertinent
information from the respective spectra, then all of the available additional
information about the spill is contained in the event B. Therefore, the best
possible estimate of the probability of guilt of the ith suspect is now P(A i IB), the
conditional or posterior probability of guilt given the event B.
Since the events Ai, i = 0, 1..... n, are disjoint and inclusive, Bayes' rule

P(B[Ai)P(Ai) (1)
P(A,IB) = :0P( IAj)P(Aj)

is applicable. The revised probability of guilt, P(Ai[B), is our proposed solution


to the problem. The probabilities P(Aj) are simply the prior probabilities
previously assigned, while the conditional probabilities P(B IA j) are not directly
available and must be estimated.

Estimation of conditional probabilities


In order to obtain P ( B I A i) we must know the distribution of the statistic S 2 in
each of the two situations:
(i) when the compared samples are from different oils, and
(ii) when the compared samples are of the same oil and differences are due only
to weathering.
Brown et al. [7] have obtained estimates of these distributions for their S 2 statistic
applied to infrared spectra. They have calculated S 2 for 235 000 pairs of different
oils and also for 5500 pairs of identical oils which had undergone various degrees
Computer and statistical considerations for oil spill identification 665

of weathering. Relative frequency histograms for each of these cases were


calculated and appear in Fig. 4. Note that we use 100 times Brown's value. These
histograms are estimates of the actual distribution of S 2 under the two given
situations. So that when two different oils are compared P(S 2 = x) - D(x) and if
the oils were the same but had weathering differences P(S 2= x ) - S(x). Since
each Sf value in B depends on a different oil, these values are stochastically
independent. Therefore,

P(BIAi) = S ( x i ) f i D(xj) for i = 1,2 ..... n


j4=i
and

P(B[Ao)= fi D(x:).
j=l

In the case when multiple samples are obtained, specific care should be taken
to insure statistical independence; for instance, spill samples should be taken
from different areas of the spill at different times. When multiple chemical
methods are used, histograms for each method must be tabulated and the
statistical independence of the statistics must be established. We have preliminary
results indicating independence of the infrared S 2 and D 2, a distance statistic for
fluorescence.

EXAMPLE. In this section a simple example is presented which demonstrates how


our theory may be applied to an oil spill identification problem using IR data.
Suppose that single samples are taken from two suspects and a spill. There is
no prior knowledge so we assign P(Ai) = ½ for i = 0, 1,2. The following S z values
are obtained: S 2 = 3.23 and S 2 = 25.60. Recall that the smaller the S/2 value, the
closer suspect i is to the spill. Intuitively, suspect 1 is the most likely spiller. We
obtain the following table:

Suspect xi D(xi) S(xi)


1 3.23 0.0019 0.0094
2 25.60 0.0017 0.0003

Letting B1 = {S~ = 3.23, S~ = 25.60}, we immediately obtain

P(BIIAo) - D ( x , ) V ( x 2 ) = (0.0019)(0.0017) = 32.3× 10 -7,


P(B, ]A1 ) -- S(xl)D(x2) = (0.0094)(0.0017) = 159.8)< 10 -7,
and
P(BllA2) ~ D(xi)S(x2) = (0.0019)(0.0003) = 5.7× 10 7.

Notice that since all prior probabilities are equal and appear in each term, they
666 Y. T. Chien and T. J. Killeen

divide out of expression (1) and we obtain

P(AolB~)= P(B'[Ao) .__ 3 2 . 3 _ 0 . 1 6 3 .


y~2=oP(BlIA,) 197.8

Similarly we get P(A1[B1)- 0.808, P(A2[B1)-~ 0.029. Our intuition has been
justified and suspect 1 is most likely the spiller. However, there still is a good
chance of A 0 having occurred and we certainly do not have enough information to
accuse suspect number 1.
If a second sample from a different part of the spill is also available, we may
continue in the following manner. Since the sample is from another part of the
spill we assume that it is independent of the first sample and recalculate (1) using
the above revised probabilities as priors. Suppose that the S 2 values (using the
original suspect samples vs. the new spill samples) are S~2 = 2.67 and S~ 2 = 16.73.
The table of D and S values becomes:

Suspect x; D( x;) S( x;)


1 2.67 0.0015 0.0146
2 16.73 0.0023 0.0007

Here we let B e = ~S~ 2 = 2.67, S~ 2 = 16.73} and P(Ao) - 0.163, P(A1) -- 0.808, and
P ( A 2 ) - 0.029 as our prior probabilities, we obtain

34.5 × 10 - 7 if/--0,
P(B2[Ai)'-J335.8×lO 7 ifi=l,
L 10.5 X 10 - 7 ifi--2.

Further computation yields P(Ao[B2) - 0.020, P(A~[Bz) - 0.979 and P(A2IB2)


--0.001. Note that, after the second application of Bayes' rule using the second
sample, the probability of suspect 1 matching the spill has increased from 0.808 to
0.979.

The above example might have been done in one step letting B = BIf)B 2
and P(Ai)= ½ as our prior probabilities. The calculations are slightly more
complex; for instance, P(BIAI) = S(xl)D(x2)S(x~)D(x'0 = (0.0094)(0.0017)
(0.0146)(0.0023). However, the final revised probability estimates would be the
same as those we obtained.
If another chemical method is employed and (i) the distance statistic for this
method is independent of S 2 and (ii) histogram for identical and different oils
have been tabulated, then we may revise our probability estimates again using our
old revised probabilities as priors. In this way it is possible to combine the results
of two or more methods into our probability calculations.
Computer and statistical considerationsfor oil spill identification 667

Discussion
The method presented here is a general one; it may be applied to any of the
standard chemical tests as long as a suitable 'distance' measure can be established
and the required histograms are available.
Our technique gives a truly quantitative calculation of the probability of a
match, provides a reasonable probabilistic model for the oil identification prob-
lem, and possibly allows an investigator to systematically combine several chemi-
cal methods.

3.2. Spectral matching model


Another approach to compensating for differences in oil spectra due to
weathering has been developed for fluorescence and infrared spectra by Killeen et
al. [30] and Anderson et al. [1]. This technique utilizes at least one laboratory
weathered sample of the suspected oil which indicates the magnitude and direc-
tion of weathering changes for that oil.
The method considers a digitized spectrum as an n dimensional vector where n
is the number of digitized points or peaks on each spectrum. Two fluorescence or
infrared spectra (vectors) that differ in magnitude but not direction in n-space are
spectroscopically identical. Such differences are due to instrument and sample
preparation variability. Any distance measure which is 0 for all such pairs of
vectors might be reasonable for this problem. The distance used in [30] and [1] is
angular distance in n-space, but several others might also be considered here.
Besides using angular measurements as distances, the laboratory weathered
sample of the suspect oil is also utilized to better decide whether the suspect
matches the spill. Theoretically, the spectrum of a given oil changes continuously
as the oil is weathered. That is, the possible weathered spectra of a given oil form
a surface in n-space. If the suspect is the actual source of the spill, then the true
spill spectrum will be a point on this surface, while the observed spill spectrum
will only be near the surface due to random errors in the observed spectrum. This
weathering surface may be approximated by the hyperplane generated by the neat
suspect oil spectrum and its lab weathered counterpart, i.e. two points on or near
the surface. The angle between the spill spectrum and this surface appears to be a
much better indication of a match or nonmatch than merely considering some
distance measure between the spill spectrum and suspect spectra. An additional
benefit is realized when one considers the position of the projection of the spill
spectrum in the generated hyperplane relative to suspect spectra. Even though the
spill spectra may be close to the hyperplane, the position of its projection may
indicate that the spill is not a weathered version of the suspect. This will occur
when the projection indicates weathering in the opposite direction from the neat
spill than the lab weathered spill.
This same technique has been extended to be used with more than one
laboratory weathered version of a given suspect. The remainder of this section
presents the method in more detail.
668 Y. T. Chien and T. J. Killeen

Suppose that the n dimensional vectors X 0, X 1, X 2..... Xk are the sequence


formed by the neat digitized suspect spectrum and k laboratory weathered spectra
of the same oil. We assume that Xg+l is more severly weathered than Xg. Let Y
denote the digitized spill spectrum and A denote the space spanned by the vectors
X0, X 1..... X~. Any vector S in A is of the form

S = atoXo ~- a ~ X 1 ~- " ' " "~- a t k X k , (2)

where the a's are scalars. S may now be rewritten as follows,

S:aoXo+a,(X,--Xo)+a2(X2--X,)+"" +ak(Xk--X -l) (3)


for spectra weathered no more than Xk; the previously mentioned weathering
surface may be approximated by vectors of the form (3) with the restriction that
a '0 --- a '1= . . . . a j = l , 0 < ~ a j + l < l and a j + 2 = a j + 3 . . . . . a k = 0 for some j =
0,1 ..... k - 1 .
Of course here we are approximating the weathering surface by a polygonal arc.
The remainder of the surface that is more heavily weathered than X k may be
simply approximated by the vectors of the form shown in (3) with a t = 1 for i < k
and a k >~1.
One approach here might be to calculate the angular distance from the spill to
its projection on the simulated weathering surface described above. This has not
been done, mainly because the necessary software has not been developed. A
mathematically simpler approach, however, has been taken. The spill spectrum is
projected onto A, the entire subspace generated by X o, X I , . . . , X k and this
projection is written in form (3). If the suspect is the true spiller, then two things
should occur. First, the angular distance of the spill to the projection should be
small and second, a0, a ! ..... a k should be approximately of the form of the
parameters of the weathering surface. How one decides whether angle measure-
ments are small depends heavily upon the oil type, spectral technique and
instrumentation. Therefore, the reader is referred to [30] and [1] for particular
examples.
This technique has been found to be more effective for the identification of
simulated spills, but at this point in time has not been tested in the real world
situation.

4. Summary of oil identification research

The major portion of this chapter has concentrated upon a discussion of


various contributions to the science of oil spill identification. However, our
coverage has by no means been complete. In conclusion, in Table 1 we more fulty
outline the areas of research and appropriate references of major contributors to
this fast growing science over the past decade.
Computer and statistical considerations for oil spill identification 669

Table 1
Summary of oil spill I. D. research
Sponsoring Method of Type of chemical
Investigators institution data analysis data References
A. Bentz USCG a CG ID Infrared l, 3,4, 5, 6,
system 42, 43
C. Brown USCG Log ratio Infrared 2, 7, 8, 34,
Univ. of RI statistics 43
Y. T. Chien USCG Probability model Infrared 1,9, 10, 29,
T. J. Killeen Univ. of CT. Curve fitting Fluorescence 30,43
M. Curtis USCG Cluster Fluorescence 11,43
Rice Univ. analysis
D. Eastwood USCG CG ID system Fluorescence 14,17, 30,
Curve fitting Low temp. 42
luminescence
G. Flanigan USCG Log ratio Gas chromatography 15, 16,18,
G. Frame statistics 42, 43
J. Frankenfeld USCG Multiple 19, 20
Exxon methods
P. Grose NOAA b Modeling oil 21
spills
R. Jadamec USCG CG ID Thin layer and 23,24, 39,
W. Saner system liquid chroma- 40, 42
tography
F. K. Kawahara USEPAc Ratios of Infrared 27, 28
absorbances
L.D.F.A.
B. Kowalski Univ. of Pattern 13, 31,32,
Washington recognition 33,43
J. Mattson Univ. of Multivariate Infrared 35,36,43
Miami, NOAA, analysis
USCG L.D.F.A.

aUSCG--United States Coast Guard.


bNOAA--National Oceanic and Atmospheric Administration.
cUSEPA--United States Environmental Protection Agency.

References

[1] Anderson, C. P., Killeen, T. J., Taft, J. B. and Bentz, A. P, Improved identification of spilled oils
by infrared spectroscopy, in press.
[2] Baer, C. D., and Brown, C. W. (1977). Identifying the source of weathered petroleum: Matching
infrared spectra with correlation coefficients, Appl. Spectroscopy 6 (31) 524-527.
[3] Bentz, A. P. (1976). Oil spill identification bibliography. U.S. Coast Guard Rept. ADA 029126.
[4] Bentz, A. P. (1976). Oil spill identification. Anal. Chem. 6 (48) 454A-472A.
[5] Bentz, A. P. (1978). Who spilled the oil? Anal. Chem. 50, 655A.
[6] Bentz, A. P. (1978). Chemical identification of oil spill sources. The Forum 12 (2) 425.
[7] Brown, C. W., Lynch, P. F. and Amadjian, M. (1976). Identification of oil slicks by infrared
spectroscopy. Nat. Tech. Inform. Service, Rept. ADA040975 (CG 81-74-1099),29,36,38.
[8] Brown, C. W., Lynch, P. F. and Amadjian, M. (1976), Infrared spectra of petroleum--Data base
formation and application to real spills. Proc. IEEE Comput. Soc. Workshop on Pattern
Recognition Applied. to Oil Identification, 84-96.
670 Y.T. Chien and T. J. Killeen

[9] Chien, Y. T., (1978). Interactive Pattern Recognition. Dekker, New York.
[10] Chien, Y. T. and Killeen, T. J. (1976). Pattern recognition techniques applied to oil identifica-
tion. Proc. IEEE Comput. Soc. on Pattern Recognition Applied to Oil Identification, 15-33.
[11] Curtis, M. L. (1977). Use of pattern recognition techniques for typing and identification of oil
spills. Nat. Tech. Inform. Service, Rep. ADA043802 (CG-81-75-1383).
[12] Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[13] Duewer, D. L., Kowalski, B. R. and Schatzki, T. F. (1975). Source identification of oil spills .by
pattern recognition analysis of natural elemental composition. Anal. Chem. 9 (47) 1573-1583.
[14] Eastwood, D., Fortier, S. H. and Hendrick, M. S. (1978). Oil identification--Recent develop-
ments in flourescence and low temperature luminescence. A mer. Lab. 3,10, 45.
[15] Flanigen, G. A. (1976). Ratioing methods applied to GC data for oil identification. Proc. IEEE
Comput Soc. Workshop on Pattern Recognition Applied to Oil Identification, 162-173.
[16] Flanigen, G. A. and Frame, G. M. (1977). Oil spill 'fingerprinting' with gas chromatography.
Res. Development 9, 28.
[17] Fortier, S. H. and Eastwood, D. (1978). Identification of fuel oils by low temperature
luminescence spectrometry. Anal. Chem. 50, 334.
[18] Frame, G. M., Flanigan, G. A. and Carmody, D. C. (1979). The application of gas chromatogra-
phy using nitrogen selective detection to oil spill identification. J. Chromatography 168, 365-376.
[19] Frankenfeld, J. W. (1973). Weathering of oil at sea. USCG. Rept. AD87789.
[20] Frankenfeld, J. W. and Schulz, W. (1974). Identification of weathered oil films found in the
marine environment. USCG. Rept. ADA015883.
[21] Grose, P. L. (1979). A preliminary model to predict the thickness distribution of spilled oil.
Workshop on Physical Behavior of Oil in the Marine Environment at Princeton University.
[22] Grotch, S. (1974). Statistical methods for the prediction of matching results in spectral file
searching. Anal. Chem. 4 (46) 526-534.
[23] Jadamec, J. R. and Kleineberg, G. A. (1978). United States coast guard combats oil pollution.
Internat. Environment and Safety 9.
[24] Jadamec, J. R. and Saner, W. A. (1977). Optical multichannel analyzer for characterization of
flourescent liquid chromatographic petroleum fractions. Anal. Chem. 49, 1316.
[25] Jurs, P. and Isenhour, T. (1975). Chemical Applications of Pattern Recognition. Wiley, New York.
[26] Kanal, L. N. (1972). Interactive pattern analysis and classification systems: A survey and
commentary. IEEE Proc. 60 (10) 1200-1215.
[27] Kawahara, F. K., Santer, J. F. and Julian, E. C. (1974). Characterization of heavy residual fuel
oils and asphalts by infrared spectrophotometry using statistical discriminant function analysis.
Anal. Chem. 46, 266.
[28] Kawahara, F. K. (1969). Identification and differentiation of heavy residual oil and asphalt
pollutants in surface waters by comparative ratios of infrared absorbances. Environmental Sci.
Technol. 3, 150.
[29] Killeen, T. J. and Chien, Y. T. (1976). A probability model for matching suspects with
spills--Or did the real spiller get away? Proc. IEEE Comput. Soc. Workshop on Pattern
Recognition Applied to Oil Identification, 66-72.
[30] Killeen, T. J., Eastwood, D. and Hendrick, M. S. Oil matching using a simple vector model, in
press.
[31] Kowalski, B. R. and Bender, C. F. (1973). Pattern recognition--A powerful approach to
interpreting chemical data. J. Amer. Chem. Soc. 94 (16) 5632-5639.
[32] Kowalski, B. R. and Bender,. C. F. (1973). Pattern recognition II--Linear and nonlinear
methods for displaying chemical data. J. Amer. Chem. Soc. 95 (3) 686-692.
[33] Kowalski, B. R. (1974). Pattern recognition in chemical research. In: Klopfenstein and Wilkins,
eds., Computers in Chemical Biochemical Research, Vol. 2. Academic Press, New York.
[34] Lynch, P. F. and Brown, C. W. (1973). Identifying source Of petroleum by infrared spectros-
copy. Environmental Sci. Technol. 7, 1123.
[35] Mattson, J. S. (1976). Statistical considerations of oil identification by infrared spectroscopy.
Proc. IEEE Comput. Soc. Workshop on Pattern Recognition Applied to Oil Identification,
ll3-121.
[36] Mattson, J. S. (1971). Fingerprinting of oil by infrared spectroscopy. Anal. Chem. 43, 1872.
Computer and statistical considerations for oil spill identification 671

[37] Mattson, J. S. (1976). Classification of oils by the application of pattern recognition techniques
to infrared spectra. USCG Rept. ADA039387.
[38] Preuss, D. R. and Jurs, P. C. (1974) Pattern recognition applied to the interpretation of infrared
spectra. Anal. Chem. 46 (4) 520-525.
[39] Saner, W, A. and Fitzgerald, G. E. (1976). Thin layer chromatographic techniques for identifica-
tion of waterborne petroleum oils. Environmental Sei. Technol. 10, 893.
[40] Saner, W. A., Fitzgerald, G. E. and Walsh, J. P. (1976). Liquid chromatographic identification of
oils by separation of the methanol extractable fraction. Anal. Chem. 48, 1747.
[41] Ungar, A. and Trozzolo, A. N. (1958). Identification of reclaimed oils by statistical discrimina-
tion of infrared absorption data. Anal. Chem. 30, 187-191.
[42] United States Coast Guard (1977). Oil spill identification system. USCG. Rept. ADA044750.
[43] Workshop IEEE Comput. Soc. Proc. on Pattern Recognition Applied to Oil Identification (1976).
Catalogue No. 76CH1247-6C.
P. R. Kfishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~ "[
@North-HollandPublishing Company (1982) 673-697 ,.y Jk

Pattern Recognition in Chemistry

B r u c e R. K o w a l s k i * a n d S v a n t e W o l d

1. Introduction

In terms of the types of measurements made on chemical systems--molecules,


atoms and mixtures thereof--chemistry is a multivariate science (see Tables 1 and
2). Atomic particles in physics are characterized by the values of very few
variables; charge, mass, momentum and spin. In contrast, chemical systems need
the specification of a large number of variables for their description. To be
concrete, in order to distinguish one structural type of molecule from another, say
an alcohol from an amine or a eis conformer from a trans conformer, one needs to
measure several types of spectra on molecules of both types (see Section 6). In
order to determine the source of a chemical mixture, say an oil spill, this mixture
and samples from possible sources must be characterized in terms of their content
of various chemical components, usually a rather large number.
The development of instrumental methods in chemistry during the last two
decades (see Table 2) has made it first possible and then common to measure a
large number of variables on chemical systems. This has, in turn, allowed
chemists to attack more complex problems than before, e.g. environmental,
medical and others, but also, rather paradoxically, has lead to an information
handling problem, often nicknamed the data explosion problem. The reason is
that chemical theory has developed much slower than chemical practice and the
'old' theory is little suited for the analysis of multivariate data.
Hence chemists have turned to 'empirical', 'data analytic' types of methods for
handling their emerging masses of data. Multivariate methods of pattern recogni-
tion, classification, discriminant analysis, factor and principal components analy-
sis and the like have been found most useful in many types of chemical problems
(see Table 1). With increasing experience of multivariate data analysis, chemists
have reached a better understanding of the nature of chemical data and thereby
also have been able to specify more explicitly which kind of information one
wants to extract from the data. The present review tries to describe this under-

*The researchwork of this author is supported in part by the Office of Naval Research.

673
674 Bruce R. Kowalski and Svante Wold

Table 1
Chemical systems (objects, cases, samples) and variables typically used for their characterization (see
also Table 2). Pattern recognition classes with examples to the right
Systems Variables PaRc classes Examples
Chemical compounds Spectral and Structural type Amines, alcohols, esters
or complexes thermodynamic Quadratic, tetrahedral ....
Chemical mixtures Concentrations of Type Corrosive or not
(minerals, alloys, constituents and Source Mineral deposit 1, 2 or 3
water samples, oil trace elements Oil tanker 1, 2 or3
samples, polymers) Cracking plastic, noncracking
Biological Amounts of organic Type Taxonomic type (family, etc.)
constituents and Alergenic, non-allergerfic
fragmentation products Source Blood of suspect 1, 2 or 3
of biopolymers, also Fish from lake, 1, 2 or 3
trace element conc's Disease Differential diagnosis
Chemical reactions Thermodynamic and Mechanistic type SN1,SN2
kinetic, amounts of Solvent assisted or not
major and minor Kinetic or thermodyn, control
products
Biologically active Fragment or substituent Biologically active Drugs
compounds descriptors, quantum or non-active Toxic compounds
mechanical indices Type of biol. activ. Carcinogens

Table 2
Common instrumental methods giving multiple measurements for chemical systems (objects,
cases, samples)
Method System Variables
(1) Spectral
IR, NMR, UV, Compounds Wave lengths or frequencies of
ESCA, X-Ray (sometimes characteristic absorption (peaks)
mixtures) or digitized spectra (absorption
Polymers at regular wave length intervals)
Mass-spectra Ion abundances of fragments
Atomic abs., etc. Trace and major element
concentrations

(2) Separation
Gaschromatogr., GC Mixtures Amounts of volatile constituents
Pyrolysis-GC Polymers Amounts of volatile fragments
Liquid Chromat., LC Biological samples Amounts of soluble constituents
Pyrolysis-LC Amounts of soluble fragments
Amino-acid analysis Amounts of different amino acids
Electrophoretic Amounts of different macromolecules
Gel-filtration Amounts of different larger molecules
Medical analyses Blood, Urine Inorganic, organic, biochemical,
etc. constituents
Pattern recognition in chemistry 675

standing, still far from complete, of how to apply pattern recognition in chem-
istry.

2. Formulation of chemical problems in terms of pattern recognition

As in other disciplines, chemists have borrowed pattern recognition methods


and associated terminology from the engineering, statistics and computer science
literature and made the necessary alterations for application to chemical prob-
lems. To the chemist trying to understand a particular chemical application of
pattern recognition, the new pattern recognition terminology is often somewhat
confusing. Alternately, to the statistician trying to understand how pattern
recognition is used in chemistry, the altered notation and terminology is a barrier
to understanding. This problem has been faced several times before, whenever
two disciplines are interfaced.
In chemistry, a general statement of the pattern recognition problem in its
broadest sense is: given a collection of objects characterized by a list of measure-
ments made on each object, is it possible to find a n d / o r predict a useful property
of the objects that may not be measureable, but is thought to be related to the
measurements? The property to be predicted is often qualitative, e.g. the type or
class of an object (see Table 1 for example). In other cases, one desires a
quantitative prediction of a property, for instance the strength of a material or the
level of carcinogenicity of a chemical compound. This latter type of predictions is
outside the traditional domain of pattern recognition but can in our view
conveniently be seen as an extension of pattern recognition (see further
Section 5).
The objects can range from pure chemical compounds to complex mixtures and
chemical reactions. The measurements are usually laboratory measurements (see
Table 1) but can also be calculated theoretical values or structural properties
derived from molecular topology. For mixtures, examples of typical measure-
ments include quantitative measurements such as the fractional amount of each
component in a mixture as determined by gas chromatography and the semi-
quantitative sensory evaluation scores assigned to food products by panels of
judges. For pure compounds, the number of selected functional groups contained
in a molecular structure and the energy of a molecular orbital can be, and have
been, used to characterize the compounds for chemical applications of pattern
recognition.
The measurements used to represent a sample are often collectively called a
data vector. For all objects, the measurements are identically ordered in each data
vector that represents each sample and together they form the vector space and a
particular data structure. Briefly, pattern recognition can be summarized as a
collection of methods which help the chemist to understand the data structure of
the vector space. If only two measurements are made in a trivial application, the
chemist would plot each object as a point in a two-dimensional space where each
axis corresponds to one of the measurements. The two measurements for each
676 Bruce R. Kowalski and Svante Wold

object serve to postion each point in the space. The 'data structure' is the overall
relation of each object to every other object and, in this simple two-dimensional
vector space, the data structure is immediately available to the chemists once the
plot is made.
As mentioned in Section 1, chemistry is a multivariate science and in order for
the chemist to solve complex problems, more than two or three measurements per
object are often required. When the dimension of the vector space (henceforth
denoted M) becomes greater than three, the complete data structure is no longer
directly available to the chemist.
During the 1970's, chemists have solved this problem by applying the many
methods of pattern recognition to their own M-dimensional vector spaces. The
many methods under the general heading of "Preprocessing" (Andrews, 1972),
sometimes "Feature Selection", can aid the chemist by transforming the data
vector space to a new coordinate system and thereby enhancing the information
contained in the measurements. These methods can also be used to eliminate
measurements that contain no useful information in the context of the application
or weight those measurements containing useful information in proportion to
their usefulness.
Display methods (Kowalski and Bender, 1973) are useful for providing linear
projections or nonlinear mappings of the M-dimensional data structures into
two-dimensions with minimum distortion of data structure. These methods allow
an approximative view of data structure and used in conjuction with unsupervised
learning (Duda and Hart, 1973) or cluster analysis methods that can detect natural
groupings or clusters of data vectors, they are often effective at providing the
chemist with an understanding of the data structure. They can be thought of as
viewing aids to allow a visual, or at least conceptual, examination of M-dimen-
sional space.
The majority of chemical applications of pattern recognition have used super-
vised learning (Fukunaga, 1972) or classification methods. Here, the goal is to
partition the vector space so that samples from well-defined categories based on
chemically meaningful rules fall into the same partition. Examples of classifica-
tion methods applied to chemical problems are given later in this chapter. In these
applications, the property referred to in the general statement given earlier is a
class or category membership. Prototype samples with known class membership
are used to partition the data vector space and samples with unknown class
membership are then classified depending upon their location in the vector space.
The prototype samples from all categories are collectively called the training set.
In the following, the elements of a data vector are interchangeably referred to
as 'measurements' and 'variables'. The former reflects a terminology most
palatable to chemists and the latter terminology more often preferred by statisti-
cians. Besides these terms, the pattern recognition literature often refers to
single-valued functions of measurements or variables as 'features'. Likewise, a
data vector can also be referred to as a 'sample', 'pattern', 'object' or 'system'
usually depending upon the author's background. Reading the pattern recogni-
Pattern recognition in chemistry 677

tion literature from the perspective of a statistician, engineer or chemist, the


reader is forced to be familiar with these different terminologies.

3. Historical development of pattern recognition in chemistry

Pattern recognition in chemistry had its start in 1969 when analytical chemists
at the University of Washington applied the Learning Machine (Nilsson, 1965) to
mass spectral data in attempts to classify molecules according to molecular
structural categories (Jurs, Kowalski and Isenhour, 1969). The objects, chemical
compounds, were characterized by normalized ion intensity measurements at
nominal mass/charge values as measured by low resolution, electron impact mass
spectrometry. For each application, objects with a particular functional group,
say a carbonyl (~/C = O), were put into one class and all other molecules in
another. The learning machine was used to find a hyperplane that separated the
two classes in the vector space with the eventual goal of classifying compounds
with unknown molecular structure based on their location in the so-called mass
spectra space.
In the early 1970's, several improvements were made to preprocessing methods
applied to mass spectral data prior to classification (Juts and Isenhour, 1975) and
the learning machine was applied to other types of molecular spectroscopy. It
wasn't until 1972 that chemists were introduced to the use of several pattern
recognition methods to solve complex multivariate problems. (Kowalski, Schatzki
and Stross, 1972; Kowalski and Bender, 1972).
Currently, a strong and useful philosophy of the application of pattern recogni-
tion and other areas of multivariate analysis in chemistry is developing (Albano,
Dunn, Edlund, Johansson, Norden, SjOstrOm and Wold, 1978). At the same time,
pattern recognition is taking its place with other areas of mathematical and
statistical analysis as powerful tools in a developing branch of chemistry: chemo-
metrics (Kowalski, 1977; 1980). Chemometrics is concerned with the application
of mathematical and statistical methods (i) to improve the measurement process,
and (ii) to better extract useful chemical information from chemical measure-
ments. In recognition of the importance of chemometrics in chemistry, the journal
that published the first account of pattern recognition in chemistry, Analytical
Chemistry, celebrated its 50th anniversary with a symposium that ended with a
review of the developments of chemometrics over the 1970's with a look into the
future (Kowalski, 1978).

4. Types of chemical data and useful preprocessing methods

4.1
The representation of the data vectors measured on the objects as points in the
multidimensional space demands certain properties of these data as given in the
678 Bruce R. Kowalski and Svante Wold

following:
(1) The data should be continuous; i.e. a small chemical change between two
systems should correspond to a small distance between the corresponding points
in M-space and hence to a small change in all variables characterizing the
systems. Many chemical variables have this continuity property directly, but
others do not and therefore must be transformed to a better representation.
(2) Many methods of pattern recognition function better when the variables
are fairly symmetrically distributed within each class. Trace element concentra-
tions and sometimes chromatographic and similar data can display very skew
distributions. It therefore is good practice to investigate the distributions of the
variables by means of, for instance, simple histograms and transform skew data
by the simple logarithm or the gamma-transformation by Box and Cox (1964).
(3) The variation in each variable over the whole data set should be of the same
order. Distance based pattern recognition methods including subspace methods
such as SIMCA are sensitive to the scaling of the data. Initially, equal weighting
of the data should therefore be the rule. This is obtained by regularization
(autoscaling) to equal variance (see Kowalski and Bender 1972).
(4) The variables should be used according to good chemical practice; logarith-
mic rate and equilibrium constants, internally normalized chromatograms, etc.
(5) Finally, for certain methods of pattern recognition there are bounds on the
number of variables (see Section 7.1). Also, there are dangers in preprocessing the
data by selecting the variables that differ most between the classes if the number
of variables exceeds the number of objects in the training set divided by three (see
Section 7.2). In this case, the only safe preprocessing methods are those which are
not conditioned on class separation, i.e. normalization, autoscaling, Fourier or
Hadamard transforms, autocorrelation transforms and the selection of variables
by cluster analysis.

4.2
Some of the types of chemical measurements that have been used in chemical
applications of pattern recognition include the following:

4.2.1. Mass spectra


When a molecule in a gas phase is bombarded by an electron with sufficient
energy, a number of complex processes may occur that depend on several factors
(e.g., the energies of the bonds in the molecule). From an ensemble of identical
molecules, the several possible processes will give rise to a unique pattern of ions
with mass-to-charge ratios and ion abundances that can be used by the expert to
determine the molecular structure of the molecules.
Using the ion intensities at all or selected mass/charge ratios as measurements
in a pattern recognition problem leads to a data vector. Since the pattern of ion
abundances is reproducible and unique for a compound, and the pattern depends
on the bond energies and molecular topology, it follows that molecules that are
similar in molecular structure would be expected to be relatively close in the
Pattern recognition in chemistry 679

vector space. This is the basis of numerous studies aimed at using pattern
recognition and training sets of mass spectra to develop classification strategies
that can extract molecular structural information directly from mass spectra
without the assistance of an expert (Jurs and Isenhour, 1975).
When representing a mass spectrum as a data vector, the direct use of
intensities at various mass numbers often is impractical. The reason is that similar
compounds differing by, say, a methyl group, have very similar mass spectra but
parts of the spectra are shifted in relation to each other by 15 units (the mass of a
methyl group). Therefore a representation of the mass spectrum which recognizes
similarities in shifted spectra is warranted; Fourier, Hadamard and auto correla-
tion-transforms are presently the best choices (McGill and Kowalski, 1978).

4. 2.2. Vibrational spectra


The absorption of electromagnetic radiation at frequencies in the infrared
region is associated with vibrational changes in the structure of a molecule. Most
molecules, therefore, give rise to an infrared absorption spectrum that, with few
exceptions, is unique to a molecular structure.
One of the earliest applications of pattern recognition in chemistry (Kowalski,
Jurs, Isenhour and Reilley, 1969) used the infrared absorptions at selected
frequencies as measurements and, in a manner analogous to the mass spectral
studies mentioned above, attempted to classify chemical compounds according to
molecular structural properties.
Recognizing that infrared spectra and mass spectra are complementary, another
study (Jurs, Kowalski, Isenhour and Reilley, 1969) combined the two types of
spectral data into a single data vector. In a fair (equal number of measurements)
comparison to the use of the individual spectra, classification results were indeed
improved.
As discussed below with NMR-spectra, the IR-spectra have properties which
make them unsuitable for direct digitization. When the peaks in the IR spectra
can be identified with specific functional groups (see Section 6.1), the precise
locations of the functional group peak are recommended as variables. For the
' finger print' region and other parts w i t h ' unassigned' peaks, the same transforms
are recommended for the IR-spectra as for MS and N M R spectra.

4.2.3. Nuclear magnetic resonance (NMR) spectra


An N M R spectrum of a compound (or a mixture) contains rather narrow
absorption bands (peaks) at characteristic frequencies. Each peak corresponds to
a specific atom type in the molecule(s).
When the assignment of the peaks is known, i.e. when the connection between
each peak and atomic type is known, N M R spectra seem to be best represented in
pattern recognition using the frequency of each atom type as a variable. For an
example, see Sj~Sstrrm and Edlund (1977).
When the assignment of peaks is uncertain or unknown or in cases when the
different systems have partly different types of atoms, the N M R spectrum must
be converted to variables without using the information of the assignment of the
680 Bruce R. Kowalski and Svante Wold

peaks. The spectrum cannot be digitized directly, using the absorption at regular
intervals as variables. The reason is that a small shift of one peak from one
spectrum to another results in large jumps in the variables, thus making them lack
continuity properties as discussed in point one above. Instead, a transformation
of the spectrum should be used (see Kowalski and Reilly, 1971).

4.2.4. Electrochemical waveform


The reduction and oxidation of chemical species at the surface of an electrode
in a solution is the basis for a number of very important methods in electrochem-
istry. In stationary electrode polarography for example, voltage is scanned while
the current is measured between a working and a reference electrode. The
waveforms contain peaks that provide information as to the identity and amount
of each species undergoing an electrochemical change.
In applications conducted by Perone and co-worker, (Burgard and Perone,
1978; Schachterle and Perone, 1981), a large number of features of the waveform
and the first and second derivative waveforms are measured and used to de-
termine the number of species undergoing electrochemical change within a given
voltage range.

4.2.5. Elemental composition


Any mixture sample can be characterized by its composition where each
measurement in the data vector is the amount of a specific component of the
mixture. When the samples are minerals, rocks or archaeological artifacts, chem-
ists can easily and accurately determine the concentrations of the elements which
can then be used as data vectors in pattern recognition applications. Using trace
element data vectors, chemists have related archaeological artifacts to their
sources, (Kowalski, Schatzki and Stross, 1972), classified food products (Kwan
and Kowalski, 1978) and solved important geochemical problems (Howarth,
1974).

4.2.6. Chromatography and other separation methods (See Table 2)


Chromatograms (GC, LC, electrophoresis, etc.) of a number of chemical
systems analyzed under reproducible conditions contain a number of peaks
occurring at approximately the same time index in each chromatogram. The peak
heights and areas differ between the samples, however. Either the peak heights or
the integrated areas can be used as variables in pattern recognition, the former
being preferrable since they are easier to measure. The reason that both can be
used is that pattern recognition is concerned with relations between variables and
usually not with their absolute values. Since peak height and area are related by a
continuous function, these relations between the variables remain quafitatively
unchanged regardless of peak representation used.
Pattern recognition in chemistry 681

A common difficulty in measuring chromatograms lies in the definition of the


base line, i.e. the position of zero peak values. When using chromatographic data
in pattern recognition, it is immaterial where the base line is drawn as long as the
base line is drawn in the same way in all chromatograms. The reason for this
insensitivity is that if the base line is changed, one peak will be changed by a
constant amount in all chromatograms and the relations between variables over
the data set will, qualitatively, be unchanged.
Peaks with large variations in size within a class are commonly associated with
skewed distributions. In such cases, the logarithms of the peak heights or areas
should be used as variables. Also, when no internal standard is included in the
chromatogram, it is customary to normalize each chromatogram so that the sum
over all variables in a single chromatogram is 1.0 (or 100 or 1000). This
normalization should precede the conversion to logarithmic values and normali-
zation to unit variance of each variable over the w h o l e data set. The slight
correlations introduced by this internal normalization are unimportant as long as
the number of peaks exceeds 6 or 7 and as long as one peak is not very much
larger than all others. In the latter case a better internal normalization is to divide
all peaks by the largest peak value and throw out this largest peak which becomes
constant.
When samples are run in replicates, which of course is highly recommended,
the average of the replicate chromatograms should not be used for the pattern
recognition analysis. Considerable information on the correlation structure be-
tween the variables can be lost which, in turn, makes the pattern recognition
become much less efficient. Instead the single chromatograms should be entered
as single 'objects' in the data analysis. One must then remember that the actual
number of independent objects is smaller than the number of data vectors, a fact
of importance when selecting variables (Section 7.2) or using pattern recognition
methods conditioned on class separation (Section 7.1).

4.2.7. Other types of chemical data


When analyzing data not obtained from spectral or separation methods dis-
cussed above, the experience is smaller and one can only follow the general
guidelines given in the beginning of this section. When waveforms or curves are
analyzed, for instance in electrochemistry and kinetics, these curves can usually
be digitized directly since they vary smoothly and slowly. Curve-forms showing
rapid changes occurring at different places in different systems (samples) must be
transformed to a more stable representation before digitization (see discussion of
N M R above), possibly using a more fundamental model (kinetic, etc.).
In all cases it is of importance that representations of data are used when
spurious correlations have not been introduced by the transformation of the raw
data. A warning example is found in the field of extrathermodynamic relation-
ships, where the so-called isokinetic relationship is used to study similarities
between chemical reactions. There, variables are often derived from the tempera-
ture dependence of the single reactions in such a way that the variables become
682 Bruce R. Kowalski and Svante Wold

strongly correlated due to the property of the transformation. Strange similarities


are found which completely disappear when variables not having these spurious
correlations are used (Exner, 1970; 1973).

5. Pattern recognition methods used

As mentioned before, we find it convenient to discuss methods of pattern


recognition using geometric concepts (planes, fines, distances) in the M-dimen-
sional space formed by giving each variable an orthogonal coordinate axis. This
space we henceforth call simply M-space.
The most common pattern recognition method used in chemical applications is
the so-called Linear Learning Machine (LLM) which is based on separating the
training set points with an M - 1 dimensional hyperplane (Nilsson, 1965; Jurs
and Isenhour, 1975). The probable reason for its popularity is the simplicity to
implement it in computer programs. The very similar linear discriminant analysis
(LDA) is also often used because of its availability in standard statistical
packages such as BMD, SAS, OSIRIS and SPSS.
Unfortunately, the basic prerequisites for the applicability of these hyper-plane
methods are often forgotten, which makes many applications of these methods in
chemistry difficult to evaluate.
The prerequisites are: (i) The number of independent objects in the training set
(N) must be larger by at least a factor three than the number of variables (M). If,
initially, this condition is not fulfilled, the selection of variables by chosing such
variables that discriminate most between the classes (i.e. selection on the basis of
variance weights, Fisher weights, etc.) may lead to the selection of chemically
meaningless variables.
(ii) Each class is really a homogeneous collection of similar systems. Hence,
LLM and LDA are not useful in the asymmetric situation often encountered in
so-called binary classification (See Section 7.4).
In addition, these methods are less suitable for many chemical problems
because they perform merely a classification of the objects in the test set in terms
of a yes-no assignment to the available classes. No provision for finding outliers
in the training set or objects in the test set which lie far from the existing classes is
given. Furthermore, no information about the 'typical profile' of the objects in
the various classes is obtained, no pattern analysis (Sammon, 1968; Sj6str~m and
Kowalski, 1979) is possible. Together these drawbacks limit the applicability of
the hyper-plane methods in chemical problems (Albano et al., 1978). Moreover,
comparisons of common pattern recognition methods on a variety of chemical
data sets have shown LLM also to be inferior to other methods in the classifica-
tion results (Sj6striSm and Kowalski, 1979).
The second most popular method used in chemistry is the K nearest neighbor
method (KNN). This method classifies the vectors in the test set according to the
'majority vote' of its K nearest neighbors in M-space. Usually K-values of three or
Pattern recognition in chemistry 683

one are used. The method has the advantage to be little dependent on the
homogeneity and distribution within each class and works well as long as the
classes have approximately equally many representatives in the training set. This
makes the method a good complement to other methods which are more depen-
dent on the homogeneity of the classes.
The KNN method, in its standard form, however, has the same drawback as
the hyper-plane methods discussed above to give no information about outliers or
deviating systems and to provide no opportunity for pattern analysis. Generaliza-
tions of the method to remove these drawbacks seem possible, however.
The nature of chemical problems (i.e. the information the chemist wants to
extract from his data) make the so-called modelling methods presently the most
attractive for general use in chemistry. Here the Bayes method is the most widely
available and most well known (ARTHUR; Harper, Duewer, Kowalski and
Fasching, 1977; Fukunaga, 1972). In this method, the frequency distribution is
calculated for each variable within each class. An object from the test set is then
compared, variable by variable, with each class distribution to give a probability
that the variable value actually was drawn from that distribution. The probabili-
ties over all variables are then multiplied together to give a set of probabilities for
the object to belong to the available classes.
The Bayes method is based on the assumption that the variables are uncorre-
lated both over all classes and within each class. This is seldom fulfilled and
therefore the variables should be transformed to an orthogonal representation.
Otherwise the calculated probabilities can be grossly misleading. When using
orthogonalized variables in terms of principal components (Karhunen-Lo6ve
expansion) disjointly for each class, the Bayes method becomes very similar to the
SIMCA method discussed below.
One disadvantage with the Bayes method seems to be that in order to get fair
estimates of the frequency distributions within each class, one needs rather many
objects in each class training set, say of the order of 30. With smaller training sets,
assumptions must be made concerning the shape of the distributions which
complicate the application and makes it more risky.
A second modelling method developed for chemical use rather recently is the
SIMCA method (acronym for Soft Independent Modelling of Chemical Analogy).
This method is essentially a subspace method based on a truncated principal
components expansion of each separate class (Wold, 1976; Wold, and Sj6str6m,
1977; Massart, Kijkstra and Kaufman, 1978). However, the modelling aspect of
the principal components analysis is emphasized more than is usual in the
subspace methods. This allows the dimensionality of the class PC models to be
estimated directly from the data (Wold, 1978).
Compared with the Bayes Classification method, the SIMCA method has the
advantage that it can function with correlated variables; in fact the principal
components models utilize differences in correlation structure between the classes
for the classification. Individual residuals for each variable and object are
calculated which facilitates the interpretation of outliers (Sj6striSm and Kowalski,
1979).
684 Bruce R. Kowalski and Svante Wold

The great advantage with modelling methods in chemical problems is that they
directly give a 'data profile' for each class, which allows the construction of
systems (objects) typical for each class. In particular, in structure/activity appli-
cations (see the example in Section 6.2) this is a primary goal of the data analysis.
The classification in terms of probabilities for each object belonging to each class
provides opportunities to find outliers in both the training and test sets. More-
over, this gives valuable information as to whether an object is sufficiently close
to a class to be considered a typical class member; information of particular
importance in the classification of structural types (see the example in Section 6.1)
and the assignment of the source of a sample (Section 6.3).
Another advantage with the modelling methods is that they operate well even if
some data elements are missing either in the training set or test set or both. The
distribution of each variable (Bayes method) or the parameters in a principal
components model for each class are well estimated even for incomplete training
set data matrices, albeit less precisely. Data vectors in the test set are classified on
the basis of the existing data, again giving consistent, but naturally less precise,
results compared with the complete data situation.
Finally, the possibility of pattern analysis given by the modelling methods is
most valuable in chemical applications where the interpretation of the differences
between the classes and the structure inside the classes often is desired. The
modelling methods give information about the position of each object inside each
class which can be related to theoretical properties of the objects (Section 6.1) or
variables for which a prediction is desired (Section 6.2). The grouping of variables

Table 3
Summary of the properties of the most common methods in chemical pattern recognition
LLM LDA KNN Bayes SIMCA
Classification in terms
of closest class ÷ ÷ + ÷ ÷
Class assignment in terms
of probabilities ÷ +
Not dependent on the
homogeneity of classes ÷
Works in asymmetric
binary classification ÷ ÷ +
Not dependent on
M<~N ÷ + +
Detects outliers in
training and test sets + ÷
Gives typical data
profile of each class ÷ + ÷
Possibilities for
pattern analysis ÷ +
Tolerates missing data ÷ ÷ ÷
Pattern recognition in chemistry 685

is another valuable piece of information available in the modelling methods.


The knowledge about chemical systems is, in general, rather detailed and the
quality of chemical data is correspondingly large. Pattern recognition methods are
therefore needed which extract a maximum of information from these data. This
need is presently best met by modelling methods (See Table 3).
As mentioned above, several methods of pattern recognition are available for
application to chemical problems. Reports of successful applications of these
methods abound in the chemical literature. For the past four years, a large
package of computer programs ( A R T H U R ) , implementing pattern recognition
methods, has been distributed to over 400 laboratories around the world. In
addition to standard statistical and factor analysis methods the program, called
A R T H U R , contains three display methods, two cluster analysis methods, six
preprocessing methods, ten classification methods, a number of utility programs
and the ability to propagate the uncertainties in the measurements (if available) to
the results of many of the methods. A R T H U R has been found to be a powerful
general purpose tool for the analysis of multivariate data. Using the agreement, or
disagreement, between results of two or more complimentary methods has lead to
an enhanced understanding of the data structure in several applications (Duewer
et al., 1978; Boyd et al. 1978).

6. Some selected chemical applications

The two typical applications of pattern recognition in chemistry are (i) the
determination of the structure of a compound on the basis of Spectra measured
on the compound and a number of reference compounds and (ii) the determina-
tion of the type or source of a chemical m i x t u r e - - a n alloy, a mineral or a
hydrocarbon mixture (say, an oil spill) or a macro-molecular mixture (say, a
micro-organism)--on the basis of the content of chemical components in the
mixture and in a number of reference mixtures. We discuss first an example of the
former type, then an 'inverted' problem where relations are sought between
the chemical structure and biological activity of a series of compounds. Second,
we discuss two examples of the second type where the type of a sample is inferred
from its content of trace elements or biochemical macro-molecules. Several other
applications are covered in recent reviews (Kowalski, 1980: Varmuza, 1980).

6.1. The structure of unsaturated carbonyl compounds


Infrared (IR) and ultraviolet (UV) spectra were measured on solutions of a
number of unsaturated carbonyl compounds of the general structure R I - C O -
C(R2) =- C(R3)R 4. The purpose was to learn if compounds with large groups R 1
and R 3 were twisted out of the normal trans conformation to an iutermediate,
twisted, conformation or full way to the semi-stable cis conformation. To get
reference data o n compounds with known conformation, Mecke and Noack
686 Bruce R. Kowalski and Svante Wold

(1960) included molecules which were locked in trans conformation via a carbon
bridge from R~ to R 3 or locked in cis conformation by a carbon bridge from R 1 to
R 2•
Seven variables (the frequencies of CO and CC absorption in IR and the wave
length of maximum absorption in UV and the intensities of these absorptions)
were extracted from the spectra to characterize each compound. These data were
analyzed in terms of two classes:
Class 1. Compounds with 'known' trans conformation (four with 'locked'
conformation and two with R1 = R 3 = Hydrogen).
Class 2. Compounds with large substituents R~ and R 3.
SjOstrrm and Kowalski (1979) have compared the five methods in Table 3 on
these data. All methods give consistent results in the crucial question, namely to
which class the three cis compounds are classified. They all fall closer to Class 2
inferring that the compounds with large groups R l and R 3 are twisted fully over
to the cis conformation.
However, only Bayes and SIMCA provide the valuable information that the
three cis compounds indeed are so close to Class 2 that they can be considered to
be typical members of that class. Since it was not possible to include compounds
with intermediate twisted conformation in the study, the other methods cannot
tell whether the compounds with large groups R 1 and R 3 are twisted compounds
more similar to cis than trans or if they actually are rotated fully over into the cis
conformation.

6.2. Structure-activity study of some beta-receptor active compounds


A central area of chemistry is the investigation of the relationship between the
structure and reactivity of chemical compounds. When the studied reactivity
involves biological test systems, we have an area of particular interest. Since these
relations are potentially very complex, pattern recognition is one of the few useful
data analytic approaches (Kirschner and Kowalski, 1979).
The structures of chemical compounds can be described in terms of a 'back-
bone' and a number of 'substituents' at specific sites of the backbone
(Cammarata and Menon, 1976). Each site is then described by means of variables
characterizing the substituent actually occupying the site in the various com-
pounds. These variables can be those derived from the influence of the sub-
stituents on model reactions, Hansch (1973) scales, Hammett (1970) scales and
Taft (1952) scales. In addition, properties of the substituents such as length and
width according to quantum mechanical calculations are useful (Verloop,
Hoogenstraaten and Tipker, 1971).
Thus, each compound is translated into a data vector of numbers and the
pattern recognition problem is formulated as finding the difference between
classes of compounds showing different types a n d / o r levels of biological activity.
Compounds with 'known' type or level of activity constitute the training set and
untested compounds constitute the test set. The latter need not have physical
Pattern recognition in chemistry 687

existence; data vectors can be generated from the theoretical structure and tables
of substituent scales. Hence predictions of the biological activity can be obtained
on a theoretical basis and only potentially interesting compounds need to be
synthesized and further tested.
As an example, we discuss a series of compounds with the general structure
X ( Y ) - C 6 H 4 - C H ( R ) - C H ( R I ) - N H - R 2. The substituents range from H through
C H ( C H 3 ) - C H 2 - C 6 H 4 - 4 - O H in the five sites of substitution X, Y, R, R 1 and
R 2. Describing each site by means of the variables discussed above and one
measured property (the binding constant to a test beta-receptor) a training set of
32 compounds was used with 13 variables per compound. For test compounds
only the values of the 12 'theoretical' variables are available and therefore a data
analytic method tolerating missing data must be used. The training set consisted
of two classes, compounds active as blockers (class 1, n = 17) and beta-receptor
stimulators, agonists (class 2, n = 15).
On each of the compounds in the training set, the level of the biological activity
is known. Hence, it was of interest also to seek relations between this level and the
'independent' structural descriptors in addition to an ordinary classification
based on the latter variables (Dunn, Wold and Martin, 1978).
A SIMCA analysis of the two training set class matrices revealed that 4 of the
13 variables were irrelevant. The remaining 9 were well described by two separate
three dimensional principal components models. These models classify 15 of the
blockers and all 15 agonists correctly in the validation analysis. The SIMCA
analysis gives coordinates of each object in terms of its principal components
values. In a second step of the analysis, it was found that these coordinates were
related to the measured level of activity. In other words, the position of the
compounds inside their class was correlated to their measured activity value.
This 'pattern analysis' of each class can, in summary, be used to predict the
type and activity of chemical test compounds on the basis of their structure only.
The same methodology has thereafter been applied to other structure-activity
problems (Dunn, Wold and Martin, 1978; Norden, Edlund and Wold, 1978;
Dunn and Wold, 1980).

6.3. Source identification of oil spills


When an oil spill occurs, establishing responsibility for the spill is not an easy
task. In the United States, the U.S. Coast Guard is responsible for monitoring the
waterways. In order to establish responsibility for the cleanup operation, the
Coast Guard has turned to sophisticated analytical chemistry methods to char-
acterize the oil spill samples and clean oil candidate samples. Then, using these
data, several methods of pattern recognition have been applied to relate oil spill
samples to their sources (Kowalski and Duewer).
Recognizing that weathering is an important factor in oil spill identification,
Duewer, Kowalski and Schatzki (1975) used .the concentrations of 22 elements
measured by neutron activation analysis to characterize a total of 400 oil samples.
688 Bruce R. Kowalski and Svante Wold

Twenty crude oil samples and twenty residual oil samples were first analyzed and
then portions of each oil were artificially weathered for different lengths of time
using fresh and salt water. The unweathered oil and nine weathered samples for
each of the 40 starting oils comprised 40 categories of oils with ten samples/
category in a 22-dimensional vector space.
The first goal in the study was to use the elemental concentrations to determine
whether or not the 40 oils could be separated even though the spread~ in each
category due to weathering effects was quite large in some cases. Using pre-
processing methods prior to classification analysis allowed classification accu-
racies as high as 99.3% thereby establishing the feasibility of using pattern
recognition to elemental concentration data to overcome weathering effects in oil
spill identification.
A second goal of the study was to identify, using weighting methods, the most
useful elements for discrimination. This is important as using fewer elements, can
cut the cost of analysis and, in some cases, speed up the entire procedure.
Vanadium was by far the most discriminating element, a result that came as no
surprise to petroleum chemists. Vanadium is tightly complexed by very stable
high molecular weight compounds in oil that resist weathering very effectively.
Nickel was the second most discriminating element for the same reason. Sulfur
was third; it is known to be covalently bonded in large molecules that again are
quite stable to weathering.
Several other applications of pattern recognition to the material source identifi-
cation problem can be found in the chemical literature (Kowalski, :Schatzki and
Stross, 1972; Howarth, 1974; Parsons, 1978; McGill and Kowalski, 1977). Also,
the use of pattern recognition to find useful patterns in industrial production data
in order to solve product quality and integrity problems is a most useful and
cost-effective type of application (Duewer et al., 1978).

6. 4. The blind assay technique


The above application of pattern recognition hints at a powerful procedure for
solving problems associated with complex natural product mixtures. The identity
of the 22 elements used in the oil study was known prior to data analysis but this
need not be true in general.
In a manner analogous to the way the oil identification study was conducted,
Saxberg et at. (1978) analyzed whiskey samples by gas chromatography and
applied pattern recognition to solve a problem in forensic science: namely, the
identification of whiskey brands. In view of the variation inherent in the product
from each distiller, the problem was to select chemical measurements that
minimized this variation and were insensitive to such operations as allowing the
whiskey to be open to the atmosphere, dilution with ice and water, etc. The
important point of the study was that the identity of the measurements was
completely unknown during data analysis: blind assay. The identity of only the
most important measurements as determined by pattern recognition analysis were
Pattern recognition in chemistty 689

determined at the end of the study. For complex natural products containing
hundreds or even thousands of components, the blind assay method can solve
complex problems and save enormous amounts of time.

6. 5. Clinical analytical chemistry


The analysis of several clinical chemistry measurements per patient is an
important and obvious application of pattern recognition. The methods can be
used to find interesting data structures in high dimensional vector spaces,
(Duewer et al., 1978; Boyd et al., 1978), measurements in relation to existing
clinical measurements (Boyd et al., 1978).

7. Problems of current concern

7.1. Pitfalls
The pitfall most commonly encountered in chemical pattern recognition is that
of overestimating class separation. This is done mainly in two ways: (i) by
selecting variables that most differ between the classes and (ii) by using data
analysis methods which exaggerate class differences.
The two points are closely related to each other. Because the data set is finite,
there is a finite chance of finding a substantial difference in mean values in a
variable between two classes, even if the variable was drawn from a random
population. If we now have a large number of variables, the chance of finding,
just by accident, variables that differ between the classes becomes embarrassingly
large, especially for cases with few samples per class. Similarly, data analysis
methods which combine variables in order to enhance class differences are subject
to the risk that the result difference is just an artifact.
Empirically it has been found that for two-class problems the number of
variables must not exceed approximately a third of the number of independent
objects in the training set if variable selection methods or pattern recognition
methods conditioned on class separation are to be used (i.e. variance weights,
Fisher weights, etc. and hyper-plane methods, respectively, see further next
subsection).
A second pitfall related to the first is that one has fewer independent objects
(systems, cases, samples) than one realizes. In situations such as chromatography,
when each sample is analyzed in replicate and the raw data entered into the
analysis (this is strongly recommended to avoid a loss of information), the
number of independent objects is the same as the number of original samples, i.e.
half or a third of the number of objects in the analysis. If now the rule is
employed mechanically that the number of variables must not exceed ½N one falls
into the 'too many variables trap'.
690 Bruce R. Kowalski and Svante Wold

A similar situation can arise in structure/activity studies where several


substituents often are very similar to each other. Thus a series of compounds
with the substituents C1, Br and I are usually so close to each other that they
essentially constitute a triplicate. In the same way the pairs CH3,CzHs;
CN, NO2; COOMe, COOEt are similar.
To get an idea about the real number of independent objects in each class, it is
recommended that separate eigenvector plots of each class are examined. Strong
groupings can be detected and accounted for. In such eigenvector projections one
also gets an idea of the homogeneity and the representability of each class, two
important points that cause much trouble if neglected.

7.2. Variable selection


In cases where measurements are made independently of the others, a cost is
connected to each measurement. One is then interested in finding the smallest
subset of measurements which still solves a given classification problem on the
desired level of confidence. In other cases, a large number of variables is initially
included in the study, with some variables having doubtful relevance. Since
irrelevant variables introduce undesired noise into the mathematical description
of the classes, such variables should be found and deleted.
For the case where the number of variables is small, say M < ~N, the relevance
of variables can be judged from their discrimination power, e.g. their importance
for class separation. A vast literature exists on this problem. (Fukunaga, 1972;
Kanal, 1974). The selection is usually done as a preprocessing step, choosing the
variables differing most between the classes in some relative sense (variance
weights, Fisher weights) (Harper et al., 1977).
In many chemical applications where, say M>½N, methods conditioned on
class separation should be used with extreme caution. In SIMCA, variables can
still be selected according to their modelling power, i.e. how much each variable
participates in the principal components modelling of the classes. When other
pattern recognition methods are used, the only reasonable alternative is a pre-
processing by (a) an R mode cluster analysis followed by (b) the selection of one
variable (or aggregate) from each variable cluster in the subsequent analysis
(Massart, 1978; Jain and Dubes, 1978). If the reduced number of variables is
thereafter smaller than ½N we are again back in the simpler-to-handle first
situation.

7.3. Validation
Most pattern recognition methods classify the training set with an over-
optimistic 'success' rate. Hence a validation of the classification should be made
in another way. The best method seems to be to divide the training set into a
number of groups and then delete one of these groups at a time making it a ' test'
set with known assignment. The pattern recognition method is then allowed to
'learn' on the reduced training set independently for each such deletion. When
Pattern recognition in chemistry 691

summed up over all deletions, a lower bound of the classification rate is obtained
(see for example Kanal (1974)).
Several authors recommend that only one object at a time should be deleted
from each class in this validation. This might be true if one is certain that the
objects really are independent (see Section 7.1). In practice, however, one often
has at least weak groupings in the classes. It then seems safer to delete larger
groups at a time to get a fair picture of the real classification performance.

7.4. Asymmetric case


In the two category classification problem, i.e. when one specified alternative is
held against all others, the second class containing 'all others' often is non-homo-
geneous. For example, when trying to distinguish one type of compound, say
aromatic hydrocarbons, from all other types on the basis of IR and mass spectra,
the first class is well defined corresponding to a small class grouping in M-space.
The second class, however, contains in principle all other types of compounds (i.e.
amines, alcohols, esters, ketones, nitriles, etc.) and is therefore not a well-defined
class but is spread out in M-space. When trying to apply a hyper-plane separation
method to the two 'classes', the result is rather meaningless (Albano et al., 1978).
Only modelling methods, which describe each class separately can handle this
problem. The well-defined class is modelled, but not the 'all others; class. New
objects are classified according to their position inside or outside the class 1
region as belonging to class 1 or not ~espectively.
This asymmetric situation has been formed in several chemical applications
including the structure-carcinogenicity study of polycyclic aromatic hydrocarbons
(Albano et al., 1978; Norden, Edlund and Wold, 1978).
It seems that many of the apparent failures of pattern recognition in chemical
applications might be due to this asymmetric situation caused by the experimental
design.

7.5. Method selection and testing


In terms of the number of pattern recognition methods that can be found in the
literature, chemists have barely scratched the surface in applying the existing
technology to chemical problems. The methods that have been applied in chem-
istry, for the most part, have been selected or developed by chemists to satisfy
particular needs for specific applications. During the development of the
A R T H U R system, 27 methods that were recognized as generally applicable were
programmed, tested for reliability, and either incorporated into the system or
discarded. In order to arrive at the methods comprising the current version of
ARTHUR, several additional methods were examined. As new areas of applica-
tion are opened in chemistry, the number of methods used by chemists is sure to
grow.
As is clearly evident in this chapter, the modelling methods of pattern recogni-
tion are presently preferred for chemical applications. Whenever chemical systems
692 Bruce R. Kowalski and Svante Wold

are studied with prediction of chemical properties (e.g. class membership) the
goal, overestimation should be avoided. The danger of overestimation in chemical
pattern recognition is analogous to data extrapolation by fitting polynomial
functions to experimental data. A better fit can always be obtained by adding
more terms to the model function but the model may not be meaningful in the
context of the experiment. Extrapolation or interpolation results in this case may
be very inaccurate. This danger is efficiently avoided in modelling methods by the
use of a cross-validatory estimation of the model complexity (Wold, 1976; 1978).
When a new application is encountered, the application of a single pattern
recognition method, even the preferred modelling methods described in this
chapter, is not recommended. All pattern recognition methods perform data
reduction using some selected mathematical criteria. The application of two or
more methods with the same goal but with different mathematical criteria
provides a much greater understanding of the vector space. In classification for
instance, when the results of several methods are in agreement, the classification
task is quite an easy one with well separated classes indicating that excellent
chemical measurements were selected for analysis. When the classification meth-
ods are in substantial disagreement then, armed with an understanding of the
criteria of each method, the chemist can use the results to envision the data
structure in the vector space. The use of multiple methods with different criteria is
recommended as is the use of multiple analytical methods for the analysis of
complex samples.
The selection of a method for application to specific problems is rather
complex and can only be learned by experience since almost every vector space
associated with an application has a different structure and every application has
different goals. When a training set is not available, a simple preprocessing
method such as scaling followed by cluster analysis and display in two-space may
be all that is required to detect a useful structure of the vector space. At the other
extreme, when classification results from several methods do not agree, or when
modelling is poor, the application may require a transformation of the vector
space or even the addition of new measurements to the study. This latter case can
become quite complex and may require several iterations.
Multivariate analysis is quite often an iterative exercise. As interesting data
structures are detected and found to be chemically meaningful, the analysis may
be repeated with a new training set. Few problems are solved in any field by the
application of a single tool and the application of pattern recognition to chemical
data is no exception.

7. 6. Standardization
In the present state of early development, the field of chemical pattern
recognition cannot be subject to standardization either in terms of methodology
or data structure (experimental design). However, we feel that the following
elements of data analysis should be included in all applications: (1) a graphical
examination of the data using display methods, (2) test of the homogeneity of
Pattern recognition in chemistry 693

each class using, for instance, a graphical projection of each separate class and
finally (3) a validation of the classification results.

8. Present research directions

8.1. Proper representation of chemical data


There are several areas in chemistry where pattern recognition has yet to be
applied with any success. One reason for difficulties is that raw chemical data
often are analyzed without an understanding of which properties data should
have for pattern recognition to function (see Section 4). For example, the problem
that initially spurred chemists t o become interested in pattern recognition, the
direct extraction of molecular structures from spectral data, is far from solved. In
the authors' opinion the solution to this problem is not to be found in a search for
more powerful pattern recognition methods. Rather, a more appropriate com-
puter representation of molecular structure is required, which is indeed a problem
for the chemist.

8.2. Factor analysis, individualized difference scaling, and latent


variable path models
It is interesting to note the close similarity between chemistry and behavioral
sciences, such as psychology. In both fields the systems studied are discussed by a
language referring to unobservable properties; in chemistry electronic effects,
charge distributions, steric effects, etc. and in behavioral sciences such concepts as
IQ, verbal skill, motor ability and receptivity.
When a chemical compound is modified by changing a 'substituent', all
possible 'effects' may vary simultaneously. It is very difficult or impossible to
make a non-trivial modification that changes only one 'micro'-property of a
molecule. Similarly, t h e result of a psychological 'test' given to a number of
individuals measures combinations of all abilities. It is very difficult to construct
tests which measure only one 'basic' property of individuals.
This gap between experiments and theory in behavioral sciences has made
factor analysis, the closely similar principal components analysis and the exten-
sions to individualized difference scaling (Harper and Kowalski (unpublished))
and latent variable path models (Wold, 1977) to be used for the analysis of
observed data. The main purpose of these methods is to find out which 'factors'
or dimensions are necessary to describe the variation in multivariate data ob-
tained by giving several 'tests' to several individuals, followed by the interpreta-
tion of these 'factors'. In relation to pattern recognition, these methods can be
seen as tools for pattern analysis, i.e. finding out the data structure within each
category or class (see the modelling discussion above).
The analogy between behavioral science and chemistry makes us predict that
chemistry will soon be a field where multivariate data analysis is applied as
vigorously and enthusiatically as ever in psychology. Factor analysis has been
694 Bruce R. Kowalski and Svante Wold

applied to chemical problems by Malinowski (1977), Rozett and Peterson (1975)


and others (e.g. Howery, 1977). Reviews and handbook chapters have recently
appeared (Massart, Kijkstra and Kaufman, 1978). Principal components analysis
is the basis for the SIMCA method discussed above. Other applications are found
in the analysis of fragrant oils by Harmann and Hawkes (1979), chromatography
data (Wold and Andersson, 1973) and in the Linear Free Energy Relations
(Chapman and Shorter, 1974, 1978; Taft, 1952). Latent variable path models
(Wold, 1977) have been applied in the analysis of chemical data (Gerlach,
Kowalski and Wold, 1979).

8.3. Important app#cations


As discussed in the introduction of this chapter, there are two obvious areas of
application of pattern recognition in chemistry: the determination of the structure
of chemical compounds and the determination of the source or type of chemical
mixtures. The former is still developing in chemistry with an increasing number of
papers solving real chemical problems.
Applications in the second area involve a better use of chromatographic, trace
element and other data presently collected but not analyzed appropriately.
Challenging examples involve the classification of micro-organisms on the basis
of their 'chemical fingerprints', i.e. pyrolysis-gas chromatography (Reiner et al.,
1972; Blomquist et al., 1979) or mass fragmentography (Meuzelaar et al., 1973)
and classification of archeological artifacts by a multivariate trace element
analysis (McGill and Kowalski, 1977).
So far, chemical applications of pattern recognition have been almost exclu-
sively to static objects; i.e. the structure and other static properties of chemical
compounds and their mixtures. The only exception is the application of finear
free energy relationships and the like to the study of chemical reaction mecha-
nisms (Exner, 1970, 1973; Hammett, 1970; Taft, 1952; Chapman and Shorter,
1974, 1978) and organic chemistry abound of problems relating to the determina-
tion of the type of reaction (e.g. reaction mechanism, solvent assistance or not,
thermodynamic or kinetic control,...). These problems could be approached
indirectly by means of pattern recognition and some examples are beginning to
appear in the literature (Van Gaal et al., 1979; Pijpers, Van Gaal, Van Der
Linden, 1979). Reactions of known types would constitute the training set and
isotope effects, solvent effects, temperature effects and substituent effects are
examples of variables that could be used to characterize each reaction. The same
variables would then be measured on reactions to be classified (the test set) and
the data set thus collected analyzed as discussed above.

9. Conclusions and prognosis

The multivariate nature of chemical measurements generated by modern chemi-


cal instrumentation together with the nature of chemical theory which involves
Pattern recognition in chemistry 695

u n o b s e r v a b l e ' m i c r o ' - p r o p e r t i e s m a k e a s t r o n g case f o r a r a p i d l y i n c r e a s e d use o f


m u l t i v a r i a t e d a t a a n a l y s i s i n c l u d i n g v a r i o u s m e t h o d s of p a t t e r n r e c o g n i t i o n in
c h e m i s t r y . W i t h t h e a v a i l a b i l i t y o f fast, i n e x p e n s i v e a n d g r a p h i c s - o r i e n t e d c o m -
p u t e r s , the large n u m b e r o f c a l c u l a t i o n s is n o l o n g e r a p r o b l e m a n d w e f o r e s e e t h e
n e x t d e c a d e as t h e d e c a d e w h e n c h e m i s t r y a d v a n c e s as a m u l t i v a r i a t e science.

References

Albano, C., Dunn, W. II1, Edlund, U., Johansson, E., Norden B., Sj6str6m M. and Wold, S. (1978).
Four levels of pattern recognition. Anal. Chim. Acta. 103, 429-443.
Andrews, H. C. (1972). Introduction to Mathematical Techniques of Pattern Recognition. Wiley, New
York.
Arthur, A computer program for pattern recognition and exploratory data analysis,
INFOMETRIX. Seattle, WA, U.S.A.
Blomquist, G., Johansson, E., S6derstr6m, B. and Wold, S. (1979). Reproducibility of pyrolysis-gas
chromatographic analyses of the mould Penicillium Brevi-Cornpactum. J. Chromatogr. 173, 7-19.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26,
211-214.
Boyd, J. C., Lewis, J. W., Marr, J. J., Harper, A. M. and Kowalski, B. R. (1978). Effect of atypical
antibiotic resistence on microorganism identification by pattern recognition. J. Clinical Microbiol-
ogy, 8, 689-694.
Burgard, D. R. and Perone, S. P. (1978). Computerized pattern recognition for classification of organic
compounds from voltammetric data. Anal. Chem. 50, 1366-1371.
Cammarata, A. and Menon, G. K. (1976). Pattern recognition. Classification of therapeutic agents
according to pharmacophores. J. Med. Chem. 19, 739-747.
Chapman, N. B. and Shorter, J., eds. (1974). Advances in Linear Free Energy Relationships. Plenum,
London.
Chapman, N. B. and Shorter, J., eds. (1978). Correlation Analysis in Chemistry. Plenum, London.
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
Duewer, D. L., Kowalski, B. R. and Schatzki, T. F. (1975). Source identification of oil spills by pattern
recognition analysis of natural elemental composition. Anal. Chem. 47, 1573-1583.
Duewer, D. L., Kowalski, B. R., Clayson, K. J. and Roby, R. J. (1978). Elucidating the stnacture of
some clinical data. Biomed. Res. 11, 567-580.
Dunn, W. J. III and Wold, S. (1978). A structure-carcinogenicity study of 4-nitroquinoline 1-oxides
using the SIMCA method of pattern recognition. J. Med. Chem. 21, 1001=1011.
Dunn, W. J. III, Wold, S. and Martin, Y. C. (1978). Structure-activity study of beta-adrenegic agents
using the SIMCA method of pattern recognition. J. Med. Chem. 21, 922-932.
Dunn, W. J. and Wold, S. (1980). Relationships between chemical structure and biological activity
modelled by SIMCA pattern recognition. Bioorg. Chem. 9, 505-521.
Exner, O. (1970). Determination of the isokinetic temperature. Nature 277, 366-378.
Exner, O. (1973). The enthalpy-entropy relationship. Progr. Phys. Org. Chem. 10, 411-422.
Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York.
Gerlach, R. W., Kowalski, B. R. and Wold, H. (1979). Partial least squares path modelling with latent
variables. Anal. Chim~ Acta. 112, 417-421.
Hammett, L. P. (1970). Physical Organic Chemistry, 2nd ed. McGraw-Hill, New York.
Hansch, C., Leo, A., Unger, S. H., Kim, K. H., Nikaitani, D. and Lien, E. J. (1973). 'Aromatic'
substituent constants for structure-activity correlations. J. Med. Chem. 16, 1207-1218.
Harper, A. M., Duewer, D. L. Kowalski, B. R., and Fasching, J. L. (1977). ARTHUR mad
experimental data analysis: The heuristic use of a polyalogrithm. In: B. R. Kowalski, ed.,
Chemometrics: Theory and Application. Amer. Chem. Soc. Syrup. Set. 52, 14-52.
Harper, A. M. and Kowalski, B. R., unpublished.
696 Bruee R. Kowalski and Svante Wold

Hartmann, N. and Hawkes, S. J. (1970). Statistical analysis of multivariate chromatographic data on


natural mixtures, with particular reference to peppermint oils. J. Chromat. Sci. 8, 610-625.
Howarth, R. J. (1974). The impact of pattern recognition methodology in geochemistry. Proc. Second
Internat. Joint Conf. Pattern Recognition, 411-412. Copenhagen.
Howery, D. G. (1977). The unique role of target-transformation factor analysis in the chemometfic
revolution. In: B: R. Kowalski, ed., Chemometrics: Theory and Application. Amer. Chem. Soc. Symp.
Set. 52, 73-79.
Jain, A. K. and Dubes, R. (1978). Feature definition in pattern recognition with small sample size.
Pattern Recognition 10, 85-89.
Jurs, P. C., Kowalski, B. R., Isenhour, T. L. and Reilley, C. N. (1969). An investigation of combined
patterns from diverse analytical data using computerized learning machines. Anal. Chem. 41,
1949-1953.
Juts, P. C. and Kowalski, B. R. and Isenhour, T. L. (1969). Computerized learning machines applied
to chemical problems. Molecular formula determination from low resolution mass spectrometry.
Anal. Chem. 41, 21-27.
Jurs, P. C., Isenhour, T. L. (1975). Chemical Applications of Pattern Recognition. Wiley, New York.
Kanal, L. (1974). Patterns in pattern recognition: 1968-1974. IEEE Trans. Inform. Theory 20,
697-703.
Kirschner, G. and Kowalski, B. R. (1979). The application of pattern recognition to drug design. In:
Drug Design, Vol. III. Academic Press, New York.
Kowalski, B. R., Jurs, P. C., Isenhour, T. L. and Reilley, C. N. (1969). Computerized learning
machines applied to chemical problems. Interpretation of infrared spectrometry data. Anal. Chem.
41, 1945-1949.
Kowalski, B. R. and Reilly, C. A. (1971). Nuclear magnetic resonance spectral interpretation by
pattern recognition. J. Phys. Chem. 75, 1402-1411.
Kowalski, B. R., Schatzki, T. F. and Stross, F. H. (1972a). Classification of archaeological artifacts by
applying pattern recognition to trace element data. Anal. Chem. 44, 2176-2180.
Kowalski, B. R. and Bender, C. F. (1972b). Pattern recognition. A powerful approach to interpreting
chemical data. J. Amer. Chem. Soc. 94, 5632-5639.
Kowalski, B. R. and Bender, C. F. (1973). Pattern recognition. II. Linear and nonlinear methods for
displaying chemical data. J. Amer. Chem. Soc 95, 686-693.
Kowalski, B. R., ed. (1977). Chemometrics: Theory and Application. Amer. Chem. Soc. Syrup. Set. 52.
Kowalski, B. R. (1978). Analytical chemistry: The journal and the science; the 1970's and beyond.
Anal. Chem. 50, 1309A-1322A.
Kowalski, B. R. (1980). Chemometrics. Anal. Chem. Rev. 52, I12R-122R.
Kowalski, B. R. and Duewer, D. L. IEEE Proc. Workshop on Pattern Recognition Applied to Oil
Identification. Catalog No. 76CH1247-6C.
Kwan, W. O. and Kowalski, B. R. (1978). Classification of wines by applying pattern recognition to
chemical composition data. J. Food Sci. 42, 1320-1330.
Malinowski, E. R. (1977). Abstract factor analysis-A theory of error and its application to analytical
chemistry. In: B. R. Kowalski, ed., Chemometrics: Theory and Application. Amer. Chem. Soc. Syrup.
Ser. 52, 53-72.
Massart, D. L., Kijkstra, A. and Kaufman, L. (1978). Evaluation and Optimization of Laboratory
Methods and Analytical Procedures. Elsevier, Amsterdam.
McGill, J. R. and Kowalski, B. R. (1977). Recognizing patterns in trace elements. J. Appl. Spectros-
copy 31, 87-95.
McGill, J. R. and Kowalski, B. R. (1978). Classification of mass spectra via pattern recognition. J.
Chem. Info. Comput. Sci. 18, 52-55.
Mecke, R. and Noack, K. (1960). Strnkturbestimmungen von Unges~ittigten Ketonen mit Hilfe von
Infrarot- und Ultraviolett-Specktren. Chem. Ber. 93, 210-222.
Meuzelaar, H. L. C., Posthumus, M. A., Kistemaker, P. G. and Kistemaker, J. (1973). Curie point
pyrolysis in direct combination with low voltage electron impact ionization mass spectrometry.
Anal. Chem. 45, 1546-1560.
Nilsson, N. J. (1965). Learning Machines. McGraw-Hill, New York.
Pattern recognition in chemistry 697

Norden, B., Edlund, U. and Wold, S. (1978). Carcinogenicity of polycycfic aromatic hydrocarbons
studied by SIMCA pattern recognition. Acta Chem. Scand. Ser. B. 21, 602-612.
Parsons, M. (1978). Pattern recognition in chemistry. Research and Development 29, 72-85.
Pijpers, F. W., Van Gaal, H. L. M., and Van Der Linden, J. G. M. (1979). Qualitative classification of
dithiocarbamate compounds from 13C-N M R and I.R. spectroscopic data by pattern recognition
techniques. Anal. Chim. Acta 112, 199-210.
Reiner, E., Hicks, J. J., Ball, M. M., and Martin, W. J. (1972). Rapid characterization of salmonella
organisms by means of pyrolysis-gas-liquid chromatography. Anal. Chem. 44, 1058-1063.
Rozett, R. W. and Petersen, E. M. (1975). Methods of factor analysis of mass spectra. Anal. Chem. 47,
1301-1310.
Sammon, J. W., Jr. (1968) On-line pattern analysis and recognition system (OLPARS). Rome Air
Develop. Center, Tech. Rept. TR-68-263.
Saxberg, B. E. H., Duewer, D. L., Booker, J. L. and Kowalski, B. R. (1978). Pattern recognition and
blind assay techniques applied to forensic separation of whiskies. Anal. Chim. Acta. 103, 201 210.
Schachterle, S. D. and Perone, S. P. (1981). Classification of voltammetric data by computerized
pattern recognition. Anal. Chem. 53, 1672-1678.
Sj/3str/Sm, M., and Edlund, U. (1977). Analysis of 13C NMR data by means of pattern recognition
methodology. J. Magn. Res. 25, 285-298.
Sj/Sstr/Sm, M. and Kowalski, B. R. (1979). A comparison of five pattern recognition methods based on
the classification results from six real data bases. Anal. Chim. Acta. 112, 11-30.
Taft, R. W., Jr. (1952). Polar and steric substituent constants for aliphatic and o-benzoate groups from
rates of eterificafion and hydrolysis of esters. J. Amer. Chem. Sci. 74, 3120-3125.
Van Gaal, H. L. M., Diesveld, J. W., Pijpers, F. W. and Van Der Linden, J. G. M. (1979). 13C NMR
spectra of dithiocarbamates. Chemical shifts, carbon-nitrogen stretching vibration frequencies, and
~r bonding in the NCS 2 fragment. Inorganic Chemistry 11, 3251-3260.
Varmuza, K. (1980). Pattern Recognition in Chemistry. Springer, New York.
Verloop, A., Hoogenstraaten, W. and Tipker, J. (1971). In: E. J. Ariens, ed., Drug Design, Vol. V.
Academic Press, New York.
Wold, H. (1977). Mathematical Economics and Game Theory. Essays in Honor of Oscar Morgenstern.
Springer, Berlin.
Wold, S. and Andersson, K. J. (1973). Major components influencing retention indices in gas
chromatography. J. Chromatogr. 80, 43-50.
Wold, S. (1976) Pattern recognition by means of disjoint principal components models. Pattern
Recognition 8, 127.
Wold, S. and SfiSstr6m, M. (1977). SIMCA: A method for analyzing chemical data in terms of
similarity and analogy. In: B. R. Kowalski, ed., Chemometrics: Theory and Application. Amer. Chem.
Soc. Symp. Ser. 52, 243--282.
Wold, S. (1978). Cross validatory estimation of the number of components in factor and principal
components models. Technometrics 20, 397-406.
P. R. Kfishnaiahand L. N. Kanal, eds., Handbookof Statistics, Vol. 2 '~')
z~
©North-HollandPublishingCompany(1982) 699-719 J

Covariance Matrix Representation


and Object-Predicate Symmetry

T. K a m i n u m a , S. T o m i t a a n d S. W a t a n a b e

1. Historical background

In the field of pattern recognition, a series of researches was carried out in the
early 1960s utilizing the Karhunen-Lo6ve expansion as a tool for feature extrac-
tion. The results were reported in a paper in 1965 [13], which pointed out an
identical mathematical structure (diagonalization of covariance matrix) in the
Karhunen-Lo6ve method and Factor Analysis. That paper also introduced and
proved the entropy minimizing property of the covariance matrix method. Since
then the method has proved to be one of the most powerful and versatile tools for
feature extraction, and is today widely used in many practical problems in pattern
analysis [ 1].
The idea of object-predicate reciprocity (or symmetry) dates back to 1958 [12],
but its computational advantages were emphasized only in a 1968 paper [15]. A
mathematical theorem that underlies this symmetry was demonstrated in a paper
in 1970 [10] and interesting applications were adduced [10, 11]. In the meantime, a
paper in 1970 [4] surveyed the mathematical core of the covariance method in a
very general perspective and pointed out parallel developments in physics and
chemistry. It traced back the mathematical theory to Schmidt, 1907 [8], although
the later authors may not have known that their works were only independent
rediscoveries. K. Pearson (1901) also had the idea of diagonalization of the
covariance matrix for a purpose related to Factor Analysis [7].
It is interesting to note that from the very beginning, Schmidt introduced two
products of an unsymmetric kernel of an integral equation and its conjugate
kernel. He pointed out that the eigenvalues of the two product kernels are
identical up to their degeneracy, and the two eigen function systems are related to
each other in a symmetrical manner. The symmetrical character of the two
optimal coordinate systems associated with the two products of unsymmetric
kernels was rediscovered by quantum chemists [3] in 1960s and was applied to
obtain good approximate wave-functions in many particle systems.
Time and again, in different contexts, the error minimizing property of eigen-
vectors of the covariance matrix was rediscovered and has been used, mainly

699
700 T. Kaminuma, S. Tomita and S. Watanabe

under the name of Karhunen-Lorve expansion among electric engineers. The


method was transplanted into the field of pattern recognition in the early 1960s,
but the symmetric nature explained later in terms of Karhunen-Lorve expansion
had not been fully exploited until the authors of this article rediscovered it in the
late 1960s.
The purpose of this article is to explain various aspects of the feature extraction
theory based on the covariance matrix representation starting from the minimum
entropy principle of pattern recognition. In particular, we shall emphasize the
object-predicate symmetry and illustrate it with some examples. In the last
section we shall sketch the mathematical theorem due to Schmidt without proof.

2. Covariance representation

The data set is a set of vectors (x}~)}, a = 1,2 ..... N; i = 1,2 ..... n, with a weight
function (W(")}, where x~ is the result of the ith predicate measurement made on
the ath object. The G-matrix is defined by

Gi j = ~ W (a) x i(a) x )(a)/ "r (2.1)


a 1
with
N
r= ~] ~] W("'(x}")) 2. (2.2)
a=l i=1

It is assumed that the Euclidean distance in the n-dimensional real space defined
by the xi's ( i = l , 2 , . . . , n ) has the meaning of a measure of difference or
dissimilarity between objects suitable for clustering or pattern recognition. The
restriction implied by this Euclidean assumption is to some extent alleviated by
dealing with some non-linear functions of the xi's as if they were independent
measurements. Instead of explicitely writing W ", we can multiply each x} ") by
f-W (~1 and simplify the G-matrix
N

Gij= 2 xV)x}")/, (2.3)


with
"r = ~ (x},))2. (2.4)
a li 1

The G-matrix can be considered as a normalized (i.e., divided by ~-) scatter


matrix. If we shift the origin of each measurement so that
N
x} ~) = 0 , (2.5)
a 1

the G-matrix becomes the covariance matrix. Under the same assumption, if we
Covariance matrix representation and object-predicatesymmetry 701

replace 0- by

°)2 S x 2 , (2.6)
a 1 ~x=l

the G-matrix can be considered as the autocorrelation function. In the jargon of


physicists the original G-matrix corresponds to the density matrix.
Obviously we have, for all i and all j,

Gji-= Gij, (symmetric) (2.7)


Gii >t O, (non-negative diagonal element) (2.8)

Gii = 1. (normalized) (2.9)


i=1

Without altering the Eulidean distance, we can change the coordinate system
by an orthogonal transformation

Gij ~ G• = ~ TikGkt(T-l)tj (2.10)


kl
with
Tji=(T-1)ij. (2.11)

The three basic properties (2.7), (2.8), (2.9) of G-matrix remain unchanged by this
transformation. So does the value of ~-.
The invariant properties (2.8) and (2.9) allow to define the entropy

S'=- i p;logp; (2.12)


i=1
with
O; = G;i (2.13)

where 0~ can be interpreted as the 'importance' of the ith coordinate, or the


degree to which the ith coordinate in the primed coordinate system participates in
representing the ensemble of vectors.
The eigenvector ~}k) and the corresponding eigenvahie X(~) of the G-matrix are
defined by

Gij# ~)= x(k)q~}k) (2.14)

where the eigenvalues X(k), k = l, 2 ..... n, are the n roots of the algebraic equation

det IGij -- X~ii I = 0. (2.15)


702 T. Kaminuma, S. Tomita and S. Watanabe

We agree to label these n-roots by the convention

(2.16)

Although an r-ple root determines only an eigenspace of r dimension, we can


agree to define the eigenvectors so that

~p:k)~,:l)
= 8k ' (2.17)
i=l
and
~:k)#k)=~ij. (2.18)
k:l

It is easy to see that we can write the G-matrix in the form

k=l

if we take the primed coordinate system as the one determined by the eigenvec-
tors; then we obviously have

G:j = ?~(k)8,j. (2.19)

3. Minimumentropy principle
A good heuristic advice in pattern recognition is that we should formulate and
adjust our conceptual system to minimize the entropy definable by the data set
[17]. In our case this would mean that we should choose the coordinate system
that minimizes S' defined in (2.12). In such a coordinate system the degrees of
importance will be the most unevenly distributed over the n coordinates. This will
also lead to a possibility of reducing dimensionality of the representation space,
since we shall be able to ignore these coordinates whose 'importances' are very
small.
(i) Optimal coordinate system. To obtain the 'optimum' coordinate system that
minimizes the entropy, we can take advantage of the following theorem.

THEOREM 3.1. The eigenvectors of the G-matrix provide the optimal coordinate
system.

In other words, the optimal coordinate system is the principal axes coordinate
system. The minimum entropy is, according to this theorem and (2.19),

Smin = - ~ ?~(k)log?~(g). (3.1)


k 1
Covariancematrix representationand object-predicatesymmet~ 703

By the definition of the entropy function, the vanishing eigenvalue has no


contribution to Smin.
(ii) Uncorrelatedness. Since G~'j is proportional to the correlation coefficient
between the two stochastic variables x~ and x~, instanciated by x} ~)' and X}~)',
provided we assume (2.5) for all i, we have the following.

THEOREM 3.2. The variables representing the optimal coordinates are uncorrelated.

(iii) Minimum error. If we approximate each n-dimensional vector {x}~)'},


i = 1,2 ..... n, by an m-dimensional vector {x}~;}, i = 1,2 ..... m with m < n , the
total error is represented by

N N
En, m : E ~ (x}¢¢)') 2 - E ~ (x}a)') 2" (3.2)
cx--1 i=l a = l i--1

We can then prove the following theorem.

THEOREM 3.3. No matter what value m has ( m < n ), En, m becomes minimal when
we take as the primed coordinate system the optimal coordinates with the convention
(2.16).

(iv) Karhunen-Lobve expansion. We have assumed that the data about an


object a are given by a finite set of measurements, {x}~)}, i = 1,2 ..... n. In case
the data are given square-integrable, piece-wise continuous functions, f(~)(u),
oL=l,2 ..... N, in the continuous domain of u,[a, b], we take an arbitrary base
function system gi(u), i = 1,2 ..... oo, such that

fal'gi( u )gj( u) du = 3ij (3.3)

and define x} ") by

x}~)= f b f ( ~ ) ( u ) g i ( u ) d u (3.4)

so that

f ( ' ) ( u ) = ~ x}')gi(u). (3.5)


i:1

The continuous counterpart of the G-matrix is


N N
G(.,v)= y. [S'°'(u)]d. (3.6)
~=1 ~=t
704 T. Karninuma, S. Tomita and S. Watanabe

which corresponds to (2.3) with (3.4), i.e.,

Gij : f f gi(u)G(u, v ) g j ( v ) d u d v . (3.7)

The transformation from a base system (f/(u)) to another (f,'(u)} is translated


into an orthogonal transformation for (x~) to {x;}. The optimal base system is
defined by the minimum condition of the entropy (2.12) with n = o o . The
eigenvectors of Gij are now replaced by eigenfunctions of G(u, v): (~(~)(u)}. The
total error committed by taking a finite number m becomes minimum, where we
take the system of these eigenfunctions. The expansion of f(~)(u) in terms of the
eigenfunctions of G(u, v) is called Karhunen-L6eve expansion,
(v) Factor analysis. Suppose that we are given n measurements (xi}, i = 1,2,..., n,
which are correlated to some extent, so that we can suspect that there exist m
independent variables (yj}, j = l , 2 , . . . , m , with m < n , on which the n first
variables are strongly (linearly) dependent. We can then write

x i = ~ aijyj+biz i ( i - - 1 , 2 .... ,n) (3.8)


j=l

where the first m variables are the common independent variables and the last
term biz ~ represents a small variation of values specifically characteristic of x/.
It is usually assumed that each of the variables xi, yj, z i has mean value zero
and standard deviation unity. Further, it is assumed that each pairs (yj, Yk),
(zi, zj) and (z~,zj), with i = 1 , 2 ..... n, and j = l , 2 ..... m are mutually uncorre-
lated. The A~'s and B~'s defined by

Ai= ~ (aij) 2, Bi=b~ (3.9)


j = 1
with
A~ ÷ B~ = 1 (3.10)

are respectively called 'communality' and 'specificity' of variable x r


It is customary to suppose that we can evaluate the value of total communality
A = ]~i%1Ai, and we are asked to find out the desired m common factors yj. But,
for a set of n given variable (xi's), we introduce (n + m) new variables ( y f s and
zi's) and as a result the solution is not unique unless some additional conditions
are added.
To make a comparison with the method of optimal coordinate system easy we
introduce the 'influence' of common factor yj on observed variables x i by

i=1
Covariance matrix representation and object-predicate symmetry 705

and define the 'generality of influence' of common variable yj by

Sj=- ~ qijlogqij. (3.12)


i=1

We provisionally assume that there do not exist the z's and obtain the yj as
eigenvectors of G-matrix which is ( I / n ) times the correlation matrix of xi's. We
can adopt those y f s whose generality of influence is larger than an appropriate
threshold. We can then revive the zi's to satisfy (3.8).

4. SELFIC

The method of SELFIC (self-featuring-information-compression) is a strategy


to decrease the dimensionality of the representation space, without committing an
appreciable overall error when the data are given as an ensemble of vectors. The
qualification 'self' is inserted in the name SELFIC to emphasize that the
discrimination to which the method leads depends on the ensemble itself and not
an external criteria. This error, in a percentage, is E(n, m) of (3.2) divided by the
first term of (3.2). The error depends on two factors, (1) the value of m, and (2)
the choice of the primed frame work for a given m. Suppose we have already
decided to use the optimal coordinate,, then the 'goodness' of the approximation is
measured by 'one minus percentual error', i.e., o defined as the quotient of the
second term of (3.2) and the first term, which is the optimal coordinate system
equals

o=
i=1 i=1 i=1

with the agreement (2.16).


The curve of o as a function of m is a step-like, convex (upward), monotonously
increasing function. In most of the actual applications, it was found that o
becomes, say, 0.95 for m considerably smaller than n. This is a good indication of
the effectiveness of the SELFIC for the dimensionality reduction. The m-dimen-
sional subspace thus defined is the retrenched subspace which can replace the
original n-dimensional representation space with an overall error less that 1 - o.
A few remarks regarding the actual application may be in order here. If we do
not apply (2.5) for each measurement, the first eigenvector tends to represent the
general feature common to all the objects, which does not serve the purpose of
discriminating one object from another in the ensemble. If we apply (2.5) on the
variable x i, there will be as much positive values of x i as negative values. Further,
it should be noted that this formalism does not distinguish positive values from
negative values, because the entire method depends on the square of the x / s .
The SELFIC method makes it possible to define the 'representative objects' of
an ensemble. The percentage/~(") of the weight in the m-dimensional subspace of
706 T. Kaminuma, S. Tomita and S. Watanabe

the a-th object-vector is

/*('~): ~ ( x } " ) ' ) 2 / i (x}'O') 2 (4.2)


i=1 i=1

where the primed coordinates belong to the optimal coordinate system. We can
call these objects which have /z(~) larger than certain threshold 'representative
objects'
/~(~) >~0. (4.3)

The above explanation concerns the case where the class-affiliation of vectors is
not known. That means that the method is used for the preprocessing for
clustering. The method can, however, be used in the case of paradigm-oriented
pattern recognition. We can obtain a separate retrenched subspace for each class.
This leads to the method of CLAFIC (class-featuring-information-compression).
The results described in Sections 2, 3, and 4 are first reported in [1]. The
CLAFIC method was described in [14, 16].

5. Object-predicate reciprocity

The given data {x}~)} are n × N matrices. (In Section 7 this corresponds to
kernel K.) It is quite natural to consider this matrix either as a set of N vectors of
n dimension or as a set of n vectors of N dimension. This object-predicate
symmetry was exploited for the first time for a computational convenience in [12]
in the case of binary data, and the epistemological implications of this reciprocal
duality were explained in [15]. Relegating these methodological details to the
original papers, we only call attention here to the pragmatic fact that the
identification of an object depends also on a set of observations, and hence the
object-predicate table is, in the last analysis, nothing but a relation between two
sets of predicate variables.
Corresponding to the predicates correlation matrix, G/j, we can define an
object correlation matrix

i
i=1 i=1
i oe=l
(,.,)
Both G/j and H ("/~) can be derived from

= / i t (5.2)
a=l i=1
by
N

a/j = E 45°°~ (5.3)


a 1
Covariance matrix representation and ot?iect-predicate symmetry 707

and
H ( ~ ) = ~ ~}~'). (5.4)
il

The eigenvectors q)t and the corresponding eigenvalues Kt of H (~¢) are defined
by
N
H(~)O(~)-
/ - - x / 0 (")
l (5.5)
B 1
with
N

"/"1 ~'m -- (5.6)


a=l

We agree on the labeling order

/¢1 ) ~2 >~''" ~>/I;N" (5.7)

The following interesting theorem can be easily proved.

THEOREM 5.1. The matrices Gij and H (~/3) have the same rank and their corre-
sponding eigenvalues are equal, ~0 = x , (o = 1,2 .... ), with the labeling conventions
(2.16) and (5.7).

The object-predicate reciprocity can be utilized to reduce the dimensionality n


in the following way. If n > N, the data set (x (~)} does not have much statistical
meaning [6] when we regard the matrix as a set of N vectors (sample points) in the
n-dimensional space. Furthermore, an n-value larger than, say, 120 makes the
SELFIC method extremely cumbersome on the computer, because the inversion
of an n × n matrix is required. But, if we carry out the object-predicate reciproca-
ties, using the//-matrix, we can carry out the SELFIC algorithm in the N-dimen-
sional space. Following the same method as the extraction of representative
objects, we can now extract n' 'representative predicates' n ' < n. Now returning to
the G-matrix, with the reduced predicate number, the standard SELFIC method
becomes feasible.

6. Applications to geometric patterns

Unfortunately the symmetric nature of the Karhunen-Lo6ve system is not yet


so well known among those who are working practical pattern recognition
problems. We here give a few simple additional applications of the symmetry
theory to geometric patterns, leaving much to the reader's farther investigations in
other problem domains. It should be noted, however, that the applications in this
section are useful not only for handling picture patterns in practice but also for
seeing the insight of the symmetry theory itself.
708 T. Kaminuma, S. Tomita and S. Watanabe

6.1. Digitization of geometric patterns


In computer processing of geometric patterns, we assume that picture patterns
are a set of picture pattern functions {f(")(u, v)} where f(") are bounded real
continuous functions defined on a two dimensional domain D. There are two
ways to digitize the functions f(~) to vector patterns {x}~)}. One method is to
expand f(") by a base function systems as (3.5) and taking their expansion
coefficients. However, it is more common to make these vectors simply by
geometric and grayness-level sampling. The geometric sampling decomposes the
domain D into m X m meshes D1, D2,... ,Din×m, and the grayness-level sampling
assigns a representing digital number x} ") for a function f(~) at each mesh point
pi~ D i. See Fig. 1. Let the range and domain be bounded between 0 and 1, i.e.,
f("), u and v E [0, 1], the meshes be all equal size, and grayness sampling be of n
levels;

k/n < f(~)(pi)<(k+l)/n, k=O,1,...,n--l,pi~Di. (6.1)

The picture patterns are now represented by picture pattern vectors {x} ~)} with
a = 1,...,N; i = 1.... ,m X m, and x} ~) being an integer between 0 and n - 1. We
then define the two conjugate correlation matrices G and H as before;

N
Gij= ~ W(~)x i(")x)(~)Tllx(~) [[2 (6.2)
a 1

and
'mXm
H ('~)= ]~ (W('~)W(~))'/2x}'~)x}B)/(l]x('~)ll IIx (B) II) (6.3)
i=1

where IIx (~) II is the norm of the vector x (~) defined by

mXm
[Ix(~)l] 2 = 2 (x}~)) 2 (6.4)
i 1

and the weight function {W (~)) must satisfy the condition that W (") > 0 and

N
W (~) = 1. (6.5)
a=l

In the following discussion we take W (~) to be 1/N. In the above definition of G


and H we have explicitly taken the normalization condition into account, because
1T h i s n h a s n o t h i n g to d o w i t h the n o f f o r e g o i n g sections. T h e n u m b e r m X m c o r r e s p o n d s to the
n o f the f o r e g o i n g sections.
Covariance matrix representation and object-predicate symmetry 709

we must be careful when we examine their limiting values as will be shown in the
following section.
It was already notified that the two matrices G and H share the same
eigenvalues and their degeneracies. Thus there are at most rain{N, m × m}
non-zero positive eigenvalues Xi- Moreover there also exist symmetrical relations
between the eigenvectors gi of G and the eigenvectors h i of H:

N
gij = (1/~ki) 1/2 ~ (W(~))l/Zx}~)hi~/llx(~ll (6.6)
c~=l
and
m>(m
h / = (1/3ki) '/2 ~ (W(~))'/2x~)g,y/llx(~)ll. (6.7)
j 1

These relations allow us to compute the eigenvectors of one of the two matrices,
G or H, without time-consuming matrix diagonalization once we already have the
eigenvalues and the eigenvectors of another. This means that whenever we
compute the Karhunen-Lo6ve system, we can always choose the lower dimen-
sional matrix and its Karhunen-Lo6ve system and then convert the eigenvectors
by (6.6) or (6.7) to obtain the other set of eigenvectors.

f I I I -- I ""

D,

k
I I I I

Fig: 1. Decomposition of the picture function domain D.


710 T. Kaminuma, S. Tomita a n d S . Watanabe

i i
hI x +h2x + . . . + h ~3x :z. f(J)
J
+ ...

Fig. 2. The i th eigenvector of G is constructed as a linear combination of picture frames, where


aj -- ( w(J)/~ki)l/2/[[ x (j) 11 and h) is t h e j t h component of the ith eigenvector of matrix H.

The relation (6.6) also allows us to interpret the Karhunen-Lo6ve system


geometrically. It shows that the ith eigenvector of G is constructed as a linear
combination of the original picture patterns f(~) with some coefficients (Fig. 2).
We may consider these eigenvectors as the two-dimensional Karhunen-Lo6ve
filters for picture patterns.
In many practical applications the number of meshes is large while N maybe in
order of several hundreds. If m = 100, for example, the G matrix has 10000
dimension, which may be too large for computing eigenvalues directly. Therefore
it is easier to work with H rather than G itself in such cases.
Finally if we denote the entropy derived from G and H by S(G; m, n) and
S(H; m, n) respectively, then of course we have
rain{N, m )< m}
S(G;m,n)=S(H;m,n)= -- Y~ X/log Xi. (6.8)
i=1

6.2. Limiting process


If we increase m, the meshes become finer, and more detailed information
about the patterns are extracted as shown in Fig. 3a-g. Opposite phenomena will
be observed if we increase the greyness level n. Therefore the entropy defined by
(6.8) may also increase as m increases and n decreases. The dimension of the
matrix G, i.e., m X m, also increases more rapidly as the number m increases,
while that of H does not change. It is therefore wise to investigate H instead of G
in order to see the limiting behavior of the entropy function S(m, n).
In fact for a fixed n, if we take the limit m ~ o¢, then

lim NH~=[f(~)Af(~)]n/llfE(~)ll IIfE(~)ll (6.9)


m~oo

where fE(.~) are approximations of f(~) by quantizing them into n level step
Covariance matrix representation and object-predicate symmetry 711

9
Fig. 3, Digitization of two k a r y o t y p e p a t t e r n s where n is fixed as 64 for different m, (a) m - 2, (b)
m 4,(c) m--g,(d)m 16,(e) m - 3 2 , ( f ) m 64.(g) m -128.
712 T. Kaminuma, S. Tomita and S. Watanabe

functions, and

[ f(~)N f(#)] n = f f fE<.~l)f(.q) d u d v , (6.10)

11/(.3>11= ff (fl~l))2dudv. (6.11)

(Of course when either f[(,~) or f[(~) are zero, H (~#) is also zero.) Since f(~) are
bounded, from (6.9) the limit of H as m -, 0o always exists for fixed n, and so do
its eigenvahies and entropy. Furthermore, if we take n --+ oc, then

[ f('~)N f(#)] ~ = g f('~)f(#)dudv (6.t2)

so that we have the following theorems.

THEOREM 6.1. H has the following limit when m, n --+ 0o :

lim NH ~# = fff'~f#dudv/(ll fall II f# II). (6.13)

THEOREM 6.2. The entropy S(G; m, n) has the limit value

lim S ( G ; m , n ) = lim S ( H ; o o , o o ) . (6.14)


m,n~oo m,n~oo

The corresponding G matrix becomes an infinite dimensional matrix when


m --> oo. In Hilbert space terminology G is actually an o p e r a t o r which corresponds
to a kernel of an integral equation

N
NG(u,v; u',v')= ~ - f ( ' ~ ) ( u , v ) f ( ~ ' ) ( u ' , v ' ) / l l f'~ll (6.15)
a 1

so that finite matrices Gq are approximations to this G, and the infinite dimen-
sional G should be identical to (6.15). C o m p a r i n g (6.2) and (6.3), we see that they

~.__.m.mmm.
.m__m~
Fig. 4. Geometric convolution of two picture patterns: (a) is f(,~): (b) is f[(,~l: (c) corresponds to
[ ¢ ,)r~ J.G~l
•1 ~.1 1,1
Covariance matrix representation and object-predicatesymmetry 713

correspond to G (7.6) and H (7.7) in Schmidt's theory which is described in the


next section where the unsymmetric kernel K is f ~ ( u , e ) and the weighting
function is disregarded. The discussion in this section naturally evokes the
curiosity whether the symmetry still holds even in the case when the index a of
pattern functions f~ becomes continuous. That it is true will be shown in the
Schmidt's theory in the next section.
Eqs. (6.9) and (6.12) show that each element of H is essentially an integral of
the product of two picture pattern (step) functions f[(~) and f[(~l), which is a kind
of geometric meet of the two (steptized) picture patterns as shown in Fig. 4a-c.
For binary geometric patterns n = 2, and f[(,~) = f ~ for all a. In this particular case
H (~B) is interpreted as the geometric meet of the two picture p a t t e r n s f (~) and f(¢)
up to a factor.

6.3. Levels of quantization

Another application is to determine adequate sampling sizes m and n when


digitizing geometric patterns. The discussion in the previous chapter showed that
even if we increase n infinitely or decrease m, there exists a limit for extracting

Fig. 5. Two calligraphyfonts of ten Japanese kana characters. (a) and (c) are zoomed up pictures of
the first characters of (b).and (d), respectively.
714 T. Kaminuma, S. Tomita and S. Watanabe

information. If we plot the entropy S(n, m) as a function of n and m, we may


have an almost monotonously increasing curve which approaches to a limit at the
infinities for each fixed m. It should be noted that the limits S(ce, ce) and S
(n, oe) do not depend on the sampling processes, e.g., we may decompose the
domain of pattern functions nonuniformally. Therefore, the goodness of two
different sampling processes can always be compared by the entropy curve
plotted in terms of the greyness sampling level and the number of sampling
points. In general, the entropy becomes large as s decreases. Therefore, for
geometric pattern recognition problems such as character recognition it is sug-
gested to binarize the patterns and to make each sampling meshes fine.

Entropy

s=2

2.0

1.5

1.0

0.5

J r 2
2 42 2 162 32 642 1282

Number of meshes

Fig. 6. G r a p h s which show that the two entropy functions converge for two calligraphy fonts as n goes
to infinity for s 2 and 64.
Covariance matrix representation and object-predicate symmetry 715

Early study was carried out for Japanese handwritten characters and alphabets
in [10] by one of the authors. We here give some additional illustrative examples
of our experiments [5]. We compared two different calligraphy fonts of 10
Japanese characters (see Fig. 5). Their entropy functions are plotted in Fig. 6 for
fixed s = 2 and 64. In this example the curves reach plateau between n = 32 X 32
and 64 X 64. The results suggest that for classification problems of such patterns it
seems useless to make finer sampling than n = 64 × 64. The same conclusion is
also drawn from the experiment using 23 pairs of karyotypes (see Fig. 7). Of
course to confirm this conclusion we must take into account possible variations of

Entropy

1.2

1.0 ¸

0.5

I I
2I 2 4~ 2 8I z ll6 z 3J2 z
64 z 1282

Number of meshes

Fig. 7. Entropy curve for 23 pairs of karyotypes.


716 T. Kaminuma, S. Tornita and S. Watanabe

each pattern. However, the above result gives a guideline on how to choose
appropriate quantization levels in relation to difference of patterns.

7. Schmidt's theory of unsymmetric kernels

Historically the gist of error-minimizing characters of Karhunen-Lo6ve expan-


sion was already introduced by E. Schmidt in his theory of integral equations [8].
We will briefly give the essence of Schmidt's theory in its simplest form, which
may be useful for treating patterns represented by continuous functions. The
proofs omitted in this section will be found in [9].
In order to eliminate spurious complications we assume that all func-
tions and kernels are real, continuous and square integrable in the domain R =
{x; a ~<x <~ b} or R 2 = R × R, as the case may be. We also use the functional
space concepts defined as follows.
(i) Let K be an operator defined by a mapping which makes a function f
correspond to a function Kf defined by

Kf(s) = fRK(s, t)f(t)dt.


(ii) The 'real' conjugate operator K + of K is defined by

/(+(s, t) =/((t, s).


(iii) The inner product of two functions f and g is defined by

f.g = (f, g) = fsf(s)g(s ) ds.


(iv) The norm of a function f is defined by
< ii f l[ = ( f , f ) l / 2 <o0.
O=

(v) An operator product KL of two kernel operators K and L is defined by

KL(s, t) = ~ K(s, u)L(u, t) du.

(vi) An operator product fg of two functions f and g is defined by

fg(s, t) = f(s)g(t).
In this notation an integral equation

X~K(s, t)f(t) dy = f(s)


Covariance matrix representation and object-predicate symmetry 717

corresponds to the operator equation in the functional space,

7tKf=f, (7.1)

and the eigenfunction f belonging to eigenvalue X] is defined by

= (7.2)

A kernel operator K is said to be symmetric if

K+ =K (7.3)

and positive definite if

K f .f > 0 for arbitraryf. (7.4)

Eigenvalues of a positive definite operator K are all positive.


It is known as Mercer's theorem that a positive definite kernel K can be
expanded by a characteristic system in a series,

K= ~ LL/Xi, (7.5)
i=1

which converges absolutely and uniformly. But such is not the case for unsymmet-
ric kernels. However, for a given unsymmetric kernel K, Schmidt associated two
symmetric operators

G = KK + (7.6)
and
H = K+K, (7.7)

which are positive definite and whose eigenvalues are real. He has shown that:
(1) G and H defined by (7.6) and (7.7) share the same eigenvalues which are
defined in (7.2) and their degeneracies.
(2) K is expanded by characteristic systems of G and H in a series

K- ~ f, gi/X, (7.8)
i=1

where f, and gi are eigenfunctions of G and H belonging to the eigenvalues X]


respectively. By the swung dash in (7.8) we mean that the series converges to in
the mean.
718 T. Kaminuma, S. Tornita andS. Watanabe

(3) For two arbitrary functions X~ and Y,,

rain
{X/, Y/; n}
K-- i X~Y~ 2 = K-- ~"lfgi/)t i
i=1 i=
2= ~
i=n+l
1/)t2 (7.9)

where we have assumed that the positive eigenvalues are arranged such that

o<x2< 2< ._<X~_<...


=~k2z" " -- t--

It was also shown that f and & are related by

fii = ) t i K g i , (7.10)
gi = ~'iK+ fi • (7.11)
With various modifications the previous theory becomes applicable to different
areas of mathematical science. The first modification may be to extend K(s, t) to
a complex-valued function. As is well known, the concept of symmetry must then
be replaced by the concept of hermiticity, and minor additional changes in
definitions are required. K may also be extended to a complex function of several
variables K(x~,x 2..... xn) which is considered to be a kernel operator that
transforms p-dimensional L 2 space vectors into s-dimensional Z 2 space vectors or
its converse with n = s + p. The application of this theory to physics, when K is
identified by the many-particle wave function, was thoroughly discussed in [8].
Secondly, K may be reduced to a matrix Kij Or a family of indexed functions
Ki(s ). These are the cases which we encounter in pattern recognition as discussed
in the previous sections. In particular we note that in pattern recognition theory
an error-minimizing expansion (7.9) was further identified with the entropy
minimizing expansion [13]. However, it is necessary to include a normalization
condition IIKi [I = 1 for this purpose, where Ki(s ) is the ith pattern function when
patterns are represented by continuous variables.
However, in any case the essential features of Schmidt's arguments remain
unchanged, and all proofs are almost identical to the proofs given in this section.

8. Conclusion

We have presented the covariance matrix representation emphasizing the


symmetric character of ihe Karhunen-Lo6ve systems. The essence of the latter
theory was discovered first by the mathematician E. Schmidt as a theory of
unsymmetric kernels of integral equations. It is historically interesting that his
theory was rediscovered in fields of quantum chemistry and pattern recognition
quite independently.
Although the error minimizing and the entropy minimizing character of the
Karhunen-Lo6ve system has been well known and extensively utilized by many
Covariance matrix representation and object-predieate symmetry 719

p a t t e r n r e c o g n i t i o n researchers, the s y m m e t r i c n a t u r e of the K a r h u n e n - L o 6 v e


s y s t e m is n o t yet w i d e l y k n o w n . T h e p r e s e n t p a p e r a i m e d to give a b a l a n c e d view
of t h e K a r h u n e n - L o 6 v e s y s t e m i n the h i s t o r y of the m a t h e m a t i c a l science. I t also
i n t e n d e d to d r a w a t t e n t i o n s to a p p l i c a t i o n s of this s y m m e t r i c theory, for very
little w o r k has b e e n d o n e o n this topic.

Acknowledgment

T h e a u t h o r s w i s h to t h a n k Mr. I s a m u S u z u k i at the T o k y o M e t r o p o l i t a n
I n s t i t u t e of M e d i c a l Science, w h o k i n d l y p r o v i d e d e x p e r i m e n t a l r e s u l t s a n d
p r o d u c e d t h e i l l u s t r a t i o n s i n S e c t i o n 6. H e also k i n d l y r e a d the o r i g i n a l m a n u s c r i p t
a n d c o n t r i b u t e d n o t a t i o n a l corrections.

References

[1] Andrews, H. C. (1971). Multi-dimensional rotations in feature selection. IEEE Trans. Comput.,
1045-1051.
[2] Benzecri, J. P. (1976). L'Analyse des Donnees 2 - - L'Analyse des Correspondances. Bordas, Paris.
[3] Coleman, A. J. (1963). Structure of Fermion density matrics. Rev. Mod. Phys. 35, 668-687.
[4] Kaminuma, T. (1970). Informational entropy as a measure of correlation of interacting Fermi
particles. Ph.D. Thesis, University of Hawaii.
[5] Kaminuma, T. and Suzuki, I. (1980). Symmetry of Karhunen-Lo~ve systems and its application
to geometric pattern analysis. Proe. 5th Internat. Conf. on Pattern Recognition, Vol. 2, pp.
1228-1231
[6] Kanal, L. N. and Chandrasekaran, B. (1965). On dimensionality and sample size in statistical
pattern classification. Proc. Nat. Electronics Conf., Vol. 24, pp. 2-7; ibid., Pattern Recognition 3
(1971) 225-234.
[7] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philos. Mag.
559-572.
[8] Schmidt, E. (1907). Zur Theofie der linearen und nicht linearen Integralgleichungen. Math. Ann.
63, 433-476.
[9] Smithies, F. (1958). Integral Equations. Cambridge University Press, Cambridge.
[10] Tomita, S. et al. (1970). Theory of feature extraction for patterns by the Karhunen-Lo6ve
orthogonal system. Systems Comput. Control 1, 55-62.
[11] Tomita, S. et al. (1973). On evaluation of hand,a ritten characters by the K - L orthogonal system.
HICSS-3., 501-504.
[12] Watanabe, S. (1958). Lecture Notes of Summer School in Information Theory at Vienna; ibid.,
A note on the formation of concept and of association by information-theoretical correlation
analysis. Inform. Control 4, 291-296.
[13] Watanabe, S. (1965). Karhunen-Lo6ve expansion and factor analysis--Theoretical remarks and
applications. Trans. Fifth Prague Conf. in Information Theory Statistical Decision Functions and
Random Processes. Prague, 1967. Publishing House of the Czeckoslovak Academy of Sciences,
Prague, pp. 635-660.
[14] Watanabe, S. (1969). Knowing and Guessing. Wiley, New York.
[15] Watanabe, S. (1969). Object-predicate reciprocity and its applications to pattern recognition.
Inform. Processing 68; ibid., Proc. IFIP Congress Edinburgh, Scotland. North-Holland,
Amsterdam, pp. 1608-1613.
[16] Watanabe, S. and Kulikowski, C. A. (1970). Multiclass subspace methods in pattern recognition.
Proc. Nat. Electronics Conf., Vol. 26, pp. 468.
[17] Watanabe, S., P/tttern recognition as guest for entnropy minimization, Pattern Recognition, to
appear.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ~ "~
©North-Holland Publishing Company (1982) 721-745 ,J

Multivariate Morphometrics

Richard A. Reyment

1. Introduction

1.1. Definition
Morphometrics, in the statistical sense (the term is used differently in descrip-
tive zoology), means, as implies the name, the quantitative description of
variation in the morphology of organisms, plants and animals. Multivariate
Morphometrics, as defined by Blackith and Reyment (1971), is concerned with the
application of the theory and practice of multivariate statistical analysis to two or
more morphological characters considered simultaneously. Although the concept
of multivariate morphometrics can be generalized to cover a wide range of topics,
including numerical taxonomy (as developed by Sokal and Sneath, 1973), it seems
more realistic to restrict it to the ample field represented by the multivariate
statistical evaluation of morphological variation.
Blackith and Reyment (1971) concentrated largely on case histories drawn from
the fields of Botany, Palaeontology and Zoology with little attention paid to the
details of computing. Pimentel (1979) has chosen the opposite approach and
concerned himself with the step-by-step treatment of the arithmetic of the main
standard methods of multivariate statistical analysis of use in morphometric
work. Blackith (1965) gave an excellent general introduction to the subject which
can be consulted as an easy entrance to the ideas involved.
The present account concentrates mainly on newer developments in multi-
variate morphometrics and the special analysis of particular problems of great
interpretative significance. It owes much to the imaginative and far-sighted
research of N. A. Campbell.

1.2. Basic concepts


In order to introduce the concept of multivariate morphometrics, we shall
consider the highly schematic though didactically useful sketch used by Blackith
and Reyment (1971, p. 17), who cartooned the relationships in shape and size of
human beings in terms of major multivariate methods. The aim of the example is
721
722 Richard A. Reyment

to discuss leanness-fatness polarities in people, such as might arise in a regional


socio-medical survey.
Three characters sensitive to weight-variation are girths of the upper arm (xl),
the waist (x2) and thigh (x3). The three characters height (x4), lengths of limbs
(here symbolized as xs) and cranial measurements (here symbolized as x6) may
serve to define variation in shape.
These six characters can be united into a linear compound by calculating the
covariance matrix between all characters, inverting this matrix and multiplying
the vector of differences in the means for the two categories of subjects, thus
providing the right-hand sides for a set of simultaneous linear equations which
have for their left-hand sides the coefficients required for constructing the desired
linear compound of diagnostic characters. These coefficients maximize the separa-
tion of fat and thin people for the six characters relative to the variation in each
group. The linear compound is of the following kind:
Z = b l X 1 q- b z x 2 -4- b 3 x 3 q- b4x 4 q- bsx 5 + b6x 6 . (1)
Eq. (1) is interpretable as the well-known discriminant function which here
separates individuals along a polarity of fatness and thinness.
Once the description of the fatness-thinness polarity is possible in terms of the
vector of coefficients in the appropriate discriminant function, we need to know
how far these two groups are separated by the function. That is, what is the
statistical distance between any given group of fat people and another group of
thin people? Many measures of similarity have been proposed. However, there is
one which stems from the same intellectual approach as the discriminant function
and is readily calculable from it: this is the generalized statistical distance of P. C.
Mahalanobis. It is found by multiplying the vector of differences between the
means for the two groups. Because it is related to the discriminant function, the
generalized distance allows each character to carry only its proper amount of
information about the separation of the groups and eliminates the effects of
correlation between the characters.
Once we have the coefficients for any particular discriminant function, all we
need do to obtain the score (denoting location) of an individual along this vector
is to multiply each coefficient by the value of its proper character for that
individual, with due regard for signs, and sum the products. If desired, we may
then proceed to express the score as a percentage of the total distance, along the
vector, say from 'thinness' to 'fatness'.
The coefficients in the discriminant function are derived from a procedure
designed to maximize the separation of the two poles of the function (the 'fat' and
'thin' groups), but, once established, the particular nature of the groups chosen
for the initial orientation does not restrict the use to which the discriminant can
be put. Thus, if individuals fatter than the 'fat' group, or thinner than the 'thin'
group were subsequently to require a score on the scale of fatness-thinness, their
measurements could be fitted into the discriminator without difficulty.
Once we have established the multivariate-morphometric relationship between
'fatness' and 'thinness', and that between 'tallness' and 'shortness', we may be
Multivariate morphometrics 723

interested in finding a way of giving a pictorial representation of the two


relationships. One may use the angles between the discriminant functions as one
method; this approach can, however, become cumbersome and it does not help us
to incorporate our knowledge of the distances between the defining groups of
people.
It is possible to make a geometrical figure of the arrangement of the various
groups 'fat', 'thin', 'tall', and 'short' by calculating the generalized distances
between each of the groups, choosing an arbitrary scale so that any one of these
distances can be drawn on a sheet of paper and then, by drawing arcs of a radius
determined by the other generalized distances, the positions of the remaining
groups can be localized.
Sometimes the attempt to construct such a chart of distances will be unsuccess-
ful; one possible reason is that the chart could, in fact, be three- (or more)
dimensional. If it is three-dimensional, a solid model may be made, but if it is
four-dimensional or of higher dimensionality, only imperfect representations will
result. A quite different reason for the failure of the construction may be that
variation, as expressed by the covariance matrices, differs substantially from one
group to another.
Rao (1952) recommends that instead of constructing the generalized distance
chart first, and then putting in, as it were, the underlying dimensions of variation
by inspection, we should rather calculate the underlying dimensions of variation
and then use these as the axes of charts on which the mean positions of the
groups may be plotted. Such underlying axes of variation are called the canonical
variate axes and the arrangement of the groups in the space defined by these
variates is called discriminatory topology. This technique has the advantage that,
should the arrangement of the groups require three or more dimensions for its
proper expression, the canonical variates can be taken two at a time so that
different aspects of the relationships can be examined in detail.
The canonical variates are, by definition, at right angles to one another; the
underlying axes of variation representing the polarities may not be orthogonal in
this way, so that these canonical variates act as a formal framework for the
groups, the interrelationships of which can be studied as before once their
positions have been plotted.
Pursuing further the example of the fat and thin people we might wish to
investigate the relationship between obesity and the external factors causing this
condition. An ideal model would be to correlate the morphometric characters we
have selected as being descriptive of adiposity with underlying causes. One would
therefore have a set of morphometric variables and a set of variables such as
measurable genetic factors, intake of nourishment, metabolic factors and the like.
A reasonable method of analysis would then require a measure of the correlation
between the two sets and the contribution of each variable to the correlation,
albeit in approximate terms; the multivariate statistical method of canonical
correlation was developed to treat this kind of problem.
So far, we have been concerned with the contrasts of form between two distinct
groups of people, fat and thin. The mass of the people will, however, fall between
724 Richard A. Reyrnent

the two extremes, being of what we commonly call 'normal build'. There is thus
no clear distinction between studying contrasts of form between two selected
types and studying the variation of form to be found within any one group of
people. Within the group the variation will still be canalized along vectors of
various kinds. The situation in which we try to distinguish these canalized
patterns of variation in a supposedly homogeneous group, and that in which we
are deliberately selecting individuals which show one or other extreme form (to
act as terminal groups for discriminant functions) is in practice a matter of
degree.

2. Variation in a single sample

Canalized patterns of growth and form which exist within a homogeneous


population can be extracted from the covariance matrix S for a sample drawn
from the population by setting it in a characteristic equation, the roots of which
are the latent roots or eigenvalues:

IS--~.I I=0, (2)

where I is the unit matrix of the same rank as S. The solution of this equation
consists of a set of eigenvalues, as many as there were characters in the covariance
matrix, each of which is associated with an eigenvector, with as many elements as
there were characters measured.
Generally the first eigenvector, corresponding to the largest eigenvalue of the
characteristic equation, is taken to reflect the variation in size of the organisms--
for the above example it would be interpreted as reflecting the polarity of stature
between tall and short individuals. The size-variational interpretation of the first
eigenvector, or principal component, has been derived from work by the French
zoologist Teissier, who more than 45 years ago analyzed variation in crabs using
essentially the first principal component of the correlation matrix of features of
the carapace as a means of quantifying size variation. The ideas of Teissier were
carried further by Jolicoeur and Mosimann (1960) who added an interpretation of
the second principal component, the elements of which usually have plus and
minus signs, as a measure of variation in shape. Thus, in the fatness-thinness
example, the second principal component might reflect the fatness-thinness
polarity or the polarity between persons of stocky and slender build. Although
size and shape interpretations based on principal component analyses are very
common in morphometric studies, there is a disturbing arbitrariness to the
approach and it is by no means certain that all analyses are unchallengeable. Rao
(1964) has given attention to alternative methods of treating the problem. The
most complete probing of the entire field of size and shape variables has been
given by Mosimann (1970) and Mosimann and Malley (1979) (see Section 4).
There is another way of extracting canalized variation from the covariance
matrix. The components obtained by principal component analysis are always
Multivariate morphometries 725

orthogonal (at right angles) to each other. Various techniques of factor analysis
have been devised in order to 'reorient' the eigenvectors more closely with
assumed sources of variation.

2.1. Factor analysis


In all techniques for morphometric analysis, except one, the selection of a
suitable method is the only choice that has to be made, apart from technical
matters like the number of characters and the number of organisms on which to
assess them. With factor analysis, however, a new element of choice enters
inasmuch as the experimenter may set out to orient his axes of variation
according to his theoretical knowledge, real or assumed, about the nature of the
variation in his material.
In principal component analysis we work from the data towards a hypothetical
model, whereas in factor analysis things are done the other way round.
Normally, the nature of the a priori hypotheses will change from one applica-
tion of factor analysis to another; it would be foolish to expect a theory of the
intellectual capacities of the human mind, the original field of development of
factor analysis, to bear any useful relationship to theories about the growth of
insects, for instance.
In biology (and the geosciences) most applications of 'factor analysis' use, in
actual fact, principal components. There is much confusion as to what constitutes
factor analysis in almost all biological and geological applications that have been
made. J6reskog et al. (1976) have given a detailed account of the way in which the
various methods for the eigen-analysis of a single sample, which they grouped
together arbitrarily under the title of Geological Factor Analysis are intertwined.
In this book it was pointed out that only very few applications in the natural
sciences belong to the classical factor-analytical model, such as it was conceived
and developed by psychometricians. Unfortunately, the message conveyed by that
book does not seem to have made a noticeable impact on the audience for which
it was intended.
Factor-analytical work, sensu lato, is mostly conceived of as being carried out
in R-mode, respectively, Q-mode representations. The mathematical details are
given elsewhere in this volume. Some form of factor analysis is called for when
the experimenter has a theory, which can be expressed in quantitative terms,
about the behaviour of his material, and feels that this theory should determine
the way in which the results are interpreted. Great restraint is needed in such
cases to avoid using the results so acquired as a justification for the underlying
theory. There is still much uncertainty in the minds of many statisticians and
biologists about the usefulness of factor analysis, some of which is justified and
some of which is due to lack of familiarity with the relationships between a set of
mathematically closely related techniques. J/3reskog et al. (1976) have attempted
to elucidate the question by treating all related methods in the one connexion.
What about the cases where no underlying theory is to be invoked in the
analysis? One such arises when the characters assessed are not the same in two
726 Richard A. Reyment

different experiments but the experimenter wishes to compare some notional axis
of variation. In psychometric work this situation seems to be common: the test
subjects are given a 'battery' of tests, differing from one laboratory to another,
and notional factors such as those related to intelligence or skills are to be
extracted from the matrices of correlations between the performances. An analo-
gous situation in morphometrics would arise when two experimenters wished to
compare vectors of, say, size and shape, but had, for some reason, been unable or
unwilling to measure the same characters of the organisms. There can be little
doubt that certain factor-analytical techniques can arrive at an assessment of
these axes of variation in terms which minimize the nuisance caused by the
different suites of characters used. Against this advantage has to be set the lack of
any way of testing whether factors so uncovered are significant. Such situations
seem to arise with great frequency in psychometric work but they are not
common in morphometrics.
Blackith and Reyment (1971, p. 147) discussed applications of principal
component analysis to morphometric problems. For example, size-variation in
thrips, the identification of shells deriving from the life cycle of fossil for-
aminifers, variation in scale insects and the morphological variants of the com-
mon salamander.
Brief mention should be made of a method of 'factor analysis' known as
correspondence analysis (l'analyse factorielle des correspondances) and which is
widely used (under the name of factor analysis) in French-language morphometri-
cal publications. The current version of the method is due to J. Brnzrcri (1973)
but its history is long and tortuous. An excellent account of the many re-
discoveries of the concept of combined R-Q-mode analyses has been published by
Hill (1975), starting with the original brilliant work of H. O. Hirschfeld (later
Hartley) in 1935.
Briefly stated, the aim of correspondence analysis is to obtain simultaneously,
and on equivalent scales, R-mode 'factor-loadings' and Q-mode 'factor-loadings'
that represent the same principal components of the data matrix. The goal has
been achieved in Brnzrcri's variant by a method of scaling the data matrix, an
appropriate index of similarity, and the Q-R-duality relationship of the Eckart-
Young theorem (JOreskog et al., 1976, p. 107). The scaling procedure and analysis
are algebraically equivalent to Fisher's (1940) contingency table (Hill, 1975).
The monograph on Madagascan lemurs by Mah6 (1971) and that dealing with
Jurassic brachiopods by Delance (1974) contain many examples of the morpho-
metrical application of correspondence analysis.

3. Homogeneity and heterogeneity of covariance matrices

Many of the multivariate statistical procedures used in morphometrical work


have the theoretical requirement of homogeneity in the covariance matrices. There
may also be biological significance attaching to heterogeneity in covariance
Multivariate morphometrics 727

matrices. Particularly in connection with palaeontological morphometric analyses


it is frequently of importance to be able to pick up significant heterogeneity in
covariance matrices as it may be a reflection of sexual dimorphism in the
material, changes in breadth of variation deriving from different ontogenetic
phases (particularly in arthropods), evolutionary changes through time, sorting of
hard parts by geological agencies. Fortunately, most of the multivariate methods
used in morphometric studies are robust even to quite considerable deviations
from homogeneity.
There is no really satisfactory method of testing the homogeneity of covariance
matrices. Campbell (1979) has given the problem close attention, with particular
reference to graphical methods. The most widely used test is sensitive to depar-
tures from normality as well as to differences in the variances and covariances
due to heterogeneity and it therefore requires not a little care and forethought if a
useful analysis is to result. For a discussion of the likelihood ratio tests for the
equality of covariance matrices, the reader is referred to Krishnaiah and Lee
(1980).
There is as yet no definite method for testing multivariate normality of
dispersions, although approximate solutions are available (Mardia, 1970 (example
in Reyment, 1971); Campbell, 1979).
As in all statistical investigations it is highly desirable that a graphical appraisal
of the data be made. Even in a multivariate study, pronounced deviations from
normality in one or more variables will be readily picked up from a graph and, if
necessary, from univariate tests of normality. Such simple steps may be adequate
for most purposes and obviate the need for a multivariate test. Note that this
simplified approach can only indicate that the data deviate from multivariate
normality but not that the data are multivariate-normally distributed.
Gnanadesikan (1977) has given the use of graphical methods in multivariate
analysis close attention. Campbell (1979) has applied procedures for identifying
atypical multivariate observations through the graphical analysis of generalized
distances between observations, a method of analysis also advocated by
Gnanadesikan (op. cit.). Campbell and Reyment (1981) have given an example of
the palaeobiological significance of multivariate outliers, using results of
Campbell (1979).
At the practical level, outliers can cause severe difficulties in multivariate work,
and careful preliminary studies at the univariate level should form an essential
part of all morphometric analyses.
Campbell and Reyment (1981) applied robust statistical methods to for-
aminiferal data to detect observations atypical in the statistical sense but which
are biologically normal. The atypical specimens, disclosed by the robust proce-
dures involving plotting of the generalized statistical distances calculated from
robust estim~ites of means and covariances against the quantiles of a Gaussian
distribution, could be interpreted as follows: (1) microspheric tests (in relation to
the numerically greatly superior normal megalospheres), (2) pseudomegalospheric
tests, (3) morphologically aberrant megalospheres, (4) a crushed specimen, and (5)
a punching error.
728 Richard A. Reyment

4. Size and shape

One of the main areas of interest in multivariate morphometrics is that


constituted by the analysis of size and shape. Although size, shape and the study
of allometry have long been of interest to morphometricians, there is still no
really consolidated methodology for the analysis of shape owing largely to the
difficulty of giving a precise formulation of how shape can be meaningfully
quantified. Many attempts have been made in the past, beginning with the
continuingly useful graphical work of Thompson (1942), followed by Penrose's
ratios (1954), the principal components representation of Jolicoeur and
Mosimann (1960) and Mosimann's (1970) and Mosimann's and Malley's (1979)
geometrical interpretation.
The earliest attempts at a quantitative assessment of shape were almost all
made by means of ratios of characters; often there was the idea that one of the
characters could be regarded as indicative of some feature of primary interest,
whereas the other effectively standardized its variation by providing a measure of
absolute size.
Ratios can prove useful, and they lie at the root of Mosimann's (1970) method,
although one should be aware of certain shortcomings accompanying the indis-
criminant use of ratios of morphological characters. A ratio will not be constant
for organisms of the same species, unless these are also of the same size, owing to
the almost universal occurrence of allometric growth.
Much work in multivariate morphometrics has been directed towards the study
of allometric growth. If much of the emphasis in growth studies of recent years
has moved away from bivariate allometry towards multivariate methods, there is
still a substantial amount of research devoted to the classical allometrical relation-
ship. This work has been well summarized by Gould (1966, 1977). The results of
Jolicoeur (1963, 1968) and Jolicoeur and Mosimann (1960) have taken the
treatment of allometry far along the multivariate road, so that from the allometric
relationship we proceed to its generalization in terms of the major axis of the
covariance matrix, and thence to the 'reduced major axis', which is effectively a
ratio of two measures of dispersion. The reduced major axis suffers from the
defect that its slope cannot be estimated with precision.
There are some difficult, even paradoxical, situations in the theory of allometric
growth which were reviewed by Martin (1960). Sprent (1968) has noted that
Jolicoeur's and Mosimann's model, using essentially the first principal component
as a growth vector, is acceptable to workers who believe that the remaining
components define the error space.
Sometimes nuisance components may interfere with the analysis of only some
of the parts of the organ!sms under investigation. Hopkins (1966) reported that in
rats, the liver, heart, spleen, lungs, skin and adrenal weights develop in a pattern
consistent with a single allometric pattern, whereas the growth of the kidneys,
genitalia and brain develop in a manner which is size-sensitive. Although the
allometric model advanced by Hopkins (1966) seems to be logical and useful, in
actual practice the differences between coefficients found by principal compo-
nents may differ very little from Hopkin's factor-analytical values.
Multivariate morphometrics 729

Ideas for the separation of shape- and size-components of the various contrasts
of form with which one may have to deal date back to Penrose (1954). We shall
now consider the way in which Mosimann (1970) has taken up the general ideas
of Penrose for providing an acceptable quantification of shape variation.
Mosimann (op. cit.) and Mosimann and Malley (1979, p. 182) consider that the
most optimal solution to the problem of defining shape- and size-variables is by
means of quantities based on geometric similarity.
This approach considers only positive k-dimensional vectors. We denote the set
of such vectors by pk. Two vectors xl, x2, say, of P~ have the same shape if they
are both on the same ray of pk. A size-variable is generally defined to be any
homogeneous function of degree 1 and pk to p1, the positive real numbers. This
may be expressed succinctly by saying that for every positive x,

G( ax ) = aG( x )

for arbitrary a > 0.


With this definition, functions from pk such as the following are size-variables:
Y,xi, (Y~x/2)1/2, (Ilxi) ~/4, and xk, since they take only positive values and for each
such G, G(ax) = aG(x).
For a particular size-variable G, Mosimann (op. cit.) defined a shape vector as
a function Z from P k where

z(.)=x/C(x)
for all x. The direction cosines, x/((Ex2i )),/2, proportions, x/Y~x,, and ratios,
x / x k, are examples of shape vectors. Two vectors xl, x 2 have equal shape vectors
Zc(Xl) = Zc(x2) if they both lie on the same ray, i.e. if x 1 = ax 2 for some a > 0.

5. Significance tests in morphometrics

In many of the situations with which an experimental scientist has to cope, tests
of significance are superfluous to the problem of ascertaining the structure of the
experiment in multidimensional space. If a group of organisms is not known to
consist of definable sub-groups, then a principal component, or principal coordi-
nate, analysis seems the appropriate tool for a preliminary probing of its
morphometric structure (so-called 'zapping'). Once the material is known to
comprise a number of sub-groups, an analysis along canonical axes is advisable.
When the decision to use canonical variates has been taken, tests of multivariate
significance lose much of their point, for we know in advance that the sub-groups
differ. In this respect, the interests of the biologist may deviate from those of the
theoretically oriented statistician--a large body of the literature on multivariate
statistics is concerned with divers aspects of the testing of significance.
Thus, the point at issue is not primarily a statistical one: an entomologist
investigating the form of insects in a bisexual species would rarely be well advised
to test the significance of the sexual dimorphism, for a glance at the genitalia will
730 Richard A. Reyment

normally settle the question of sex. A palaeontologist concerned with the question
of whether or not sexual dimorphism exists in fossil cephalopods might well, on
the other hand, take a quite opposite view.

6. Comparing two or more groups

The main multivariate statistical procedure for comparing and contrasting


samples from two or more populations is the method of canonical variate
analysis, which, on the one hand, can be seen as a generalization of two-sample
discriminant functions and, on the other, a generalization of the one-way analysis
of variance. The need for making such comparisons occurs in taxonomical
studies, analyses of geographical and ecological variability and the analysis of
phenotypic changes through time in an evolving species.

6.1. Discriminant functions and generalized distances

The statistical ideas underlying one of the original problems of the method of
discriminant functions may be discussed in terms of two populations ¢q and ¢r2
reasonably well known from samples drawn from them. A linear discriminant
function is constructed, on the basis of v variables and two samples of size NI and
N 2. The coefficients of the sample discriminant function may be defined as

a=S '(£, - Y'2), (5)

where a is the vector of discriminatory coefficients and -~1 and £2 are the mean
vectors of the respective samples from the two populations. S-~ is the inverse of
the pooled sample covariance matrix for the two samples.
The linear discriminant function between the two samples for the variables
x ~ , . . . , x v may be written as

z = xTa. (6)

If the variances of the v variables are almost equal, the discriminator coefficients
give an approximate idea of the relative importance of each variable to the
efficiency of the function.
Considering now one of the classical problems of discriminatory analysis, we
have measurements on the same v variables as before on a newly found individual
which the researcher wishes to assign to one of the two populations with the least
chance of being wrong..(This presupposes that the new specimen does really come
from one of the populations.) Using a pre-determined cut-off value, the measure-
ments on the new specimen are then substituted into (6) and the determination is
made on the grounds of whether the computed value exceeds or is less than the
cut-off point. Usually this is taken to lie midway between the two samples, but
other values may be selected for some particular biological reason.
Multivariate morphometrics 731

The supposition that the individual must come from either of the populations is
necessary from purely statistical aspects, but it is one that may make rather poor
biological sense. Doubtlessly, the specimen could have come from one of the two
populations, but it is equally likely that it is close morphometrically, but not
identical with, one of these. Such situations arise in biogeographical studies and in
the analysis of evolutionary series.
The linear discriminant function is connected with the Mahalanobis' gener-
alized statistical distance by the relationship

D2 = (x', -- x ' 2 ) T s - ' ( X', -- -('2); (7)

D 2 is consequently the inner vector product of the vector of differences in mean


vectors and the vector of discriminatory coefficients.
The complete presentation of the statistics of generalized distances is given
elsewhere in this handbook. The two methods briefly reviewed in the present
section are among the most widely used in routine morphometrical work. Good
examples are given in Rao (1952), Pimentel (1979). Blackith and Reyment (1971)
supply a comprehensive list of case histories. As is shown in later sections, the
routine application of discriminant functions and generalized distances in much
morphometrical work may be beset with problems of a biological nature which
require the use of special methods.

6.2. Canonical variate analys&


The simplest formulation of canonical variate analysis is the distribution-free
one of finding that linear combination of the original variables which maximizes
the variation between groups relative to the variation within groups. If B is the
between-groups sums of squares and cross products matrix for g groups and W
the within-groups sums of squares and cross products on n w degrees of freedom,
the canonical vector c~ maximizes the ratio

cTBc~/cTw~,. (8)

The maximized ratio yields the first canomcal root f~ with which is associated the
first canonical vector c I. Subsequent vectors and roots may be obtained analo-
gously. The canonical vectors are usually scaled so that, for example, c T Wc~ = n~.
An important reference for canonical variate analysis is Rao (1952).
Other derivations are given elsewhere in this handbook.

6. 3. Stability of canonical vectors


In many applications of canonical variate analysis, the relative sizes of the
coefficients for the variables standardized to unit variance by the pooled within-
groups standard deviations are valuable indicators of those variables which are
important for discrimination. If the relative magnitudes of the standardized
coefficients are to be employed in this way, stability (i.e., the sampling variation
732 Richard A. Reyment

of the coefficients over repeated sampling) of the coefficients is an important


factor. The following account of the role of stability in canonical variate analyses
is based on Campbell and Reyment (1978) and Campbell (1980). It is offered,
both as an example of canonical variate analysis as well as a case history of
interest in multivariate morphometrics.
In discriminant analysis it can be shown that high correlation within groups,
when combined with between-groups correlation of the opposite sign, leads to
greater group separation and a more powerful test than when the within-groups
correlation is low. However, if the instability inherent in regression analysis with
highly correlated regressor variables carries over to discriminant analysis, and
thence to canonical variate analysis, interpretation of the importance of variables
based on the relative sizes of the standardized coefficients may be misleading.
Inasmuch as morphometrical studies often include interpretations of the elements
of the canonical vectors, the desirability of achieving stable estimates can hardly
be over-emphasized.
For present purposes, canonical variate analysis may be regarded as a two-stage
rotational procedure. Firstly, one rotates to orthogonal variables which can here
be called the principal components of the pooled samples, using this terminology
in a broad sense. The second rotation corresponds to a principal component
analysis of the group means in the space of the orthogonal variables.
The first stage transforms the within-groups concentration ellipsoid into a
concentration circle by scaling each eigenvector by the square root of the
corresponding eigenvalue. Consider now the variation between groups along each
orthogonalized variable (or principal component). When there is little variation
between groups along a particular direction, and the corresponding eigenvalue is
also small, marked instability can be expected in some of the coefficients of the
canonical variates (effectively, the instability is under the influence of small
changes in the properties of the data set, although formally, this instability is
expressed in repeated sampling from the population).
A solution which tends to overcome the problem of instability in the canonical
coefficients is to add shrinkage or ridge-type constants to each eigenvalue before
this eigenvahie is used to standardize the corresponding principal component. The
use of such constants is, however, not essential to the method; it is no more than a
very useful tool.
When an infinitely large constant is added, this confines the solution to the
subspace orthogonal to the vector, or vectors, affected by the addition. In
particular, when the smallest eigenvalue, or eigenvahies, with the associated
vector, or vectors, is involved, a generalized inverse solution results. As a general
rule, one can say that when the between-groups sum of squares for a particular
principal component is small (say, less than 5% of the total between-groups
variation), and the corresponding eigenvahie is also small (say, less than 1-2%),
then marked shrinking of "the principal component will be of value.
It is often observed that although some of the coefficients of the canonical
vectors corresponding to the canonical variates of interest change magnitude and
often sign, shrinkage has little effect on the corresponding canonical roots (which
Multivariate morphometrics 733

indicates that little or no discriminatory information has been lost). When this
occurs, the obvious conclusion is that one or some of the variables contributing
most to the principal component that has been shrunk have little influence on the
discriminatory process. One or some of these redundant variables can profitably
be eliminated. In addition, variables with small standardized canonical variate
coefficients can be deleted.
For interpreting the morphometrical relationships in a taxonomic study, or an
equivalent kind of analysis, those principal components that contribute most to
the discrimination are of particular interest and the characteristics of the corre-
sponding eigenvectors should be examined. For example, if the first principal
component is involved, size-effects of some kind may occur.
The within-groups sums of squares and cross products matrix W on n w degrees
of freedom and the between-groups sums of squares and cross products matrix B
are computed in the usual manner of canonical variate analysis, together with the
matrix of sample means. It is advisable to standardize the matrix W to correlation
form, with similar scaling for B. The standardization is obtained through pre- and
post-multiplying by the inverse of the diagonal matrix S, the diagonal elements of
which are the square roots of the diagonal elements of W.
Consequently,

The eigenvalues e i and eigenvectors ui of W* are then computed; the correspond-


ing orthogonalized variables are the principal components. With

Usually the eigenvectors are now scaled by the square root of the eigenvalue; this
is a transformation for producing within-groups sphericity. Shrunken estimators
are formed by adding shrinking constants k i to the eigenvalues e i before scaling
the eigenvectors. The details of the mathematics are given in Campbell (1980).
Write

and define

Now form the between-groups matrix in the within-groups principal component


space, that is, form
734 Richard A. Reyment

and set d t equal to the ith diagonal element of G. The ith diagonal element d~ is
the between-groups sums of squares for the ith principal component.
An eigen-analysis of the matrix G~0,...,0) yields the usual canonical roots f and
canonical vectors for the principal components, a u. The usual canonical vectors
c ° = are given by

c ° =U*~0 ..... 0)a u. (13)

Generalized shrunken (or generalized ridge) estimators are determined directly


from the eigenvectors a ~ of G(k ' ko) with c ~ = U*(k ' k )" A generalized inverse
solution results when k i = 0 for i ~ r and k i = ce for i > r~ This gives a/~I = a u for
i < r and aGI = 0 for i > r. The generalized inverse solution results from forming
G(0..... 0. . . . . . . . )=Ur*TB*Ur *, where U* corresponds to the first r columns of
U~;...,0~. The generalized canonical vectors c~=C(~o ..... 0. . . . . . . . ) are given by
c ~ = U*a ax, where a GI, of length r, corresponds to the first r elements of a °. In
practice, marked instability is associated in particular with a small value of e v and
a correspondingly small diagonal element dv of G. Some general guide-lines for
the choice of k i are gDen below. Note that extensive analyses by Campbell (1979)
indicate that a generalized inverse solution with r = v - 1 frequently provides
stable estimates under these conditions and is usually conceptually simpler than
using shrinking constants.
An easy rule to use is to examine the contribution of d v to the total group
separation, trace (W-~B); the latter is merely trace (G(0..... 0~) or Y~iVldr In
situations where one or two canonical variates describe much of the between-
groups variation, it may be more to the point to examine the relative magnitudes
of the first one or two canonical roots derived from G(0..... 0) and G(0..... 0,~) rather
than a composite measure. Either way, if d~/Y~di, or the corresponding ratio of
canonical roots, is small (say, less than 0.05), then little loss of discrimination will
result from excluding the smallest eigenvalue-eigenvector combination ( k v = ce)
or, equivalently, from eliminating the last principal component.
When the group separation along a particular eigenvector (s) is small and the
corresponding eigenvalue (s) is also small, a marked improvement in the stability
of the canonical variate coefficients can be effected by shrinking the ai corre-
sponding to the eigenvector (s) towards zero. Instability will be largely confined
to those variables which exhibit highest loadings on this eigenvector (s).
Usually it will be advantageous to shrink those components corresponding to a
small eigenvalue and small contribution to tr(W-1B). Since the sum of the
coefficients tends to be stable, deletion of one or some of the variables with
unstable coefficients may be a useful next step in the analysis. It should be noted
that in many investigatiohs, there will be no advantage in shrinking; this occurs
when much of the variation between groups coincides with the directions of the
smallest eigenvectors. Whereas in regression, high correlations will almost cer-
tainly indicate instability of the coefficients of the variables with the highest
loadings in the smallest eigenvector, or eigenvectors, within-groups correlations as
high as 0.98 may still yield a stable discriminator. With high positive within-groups
Multivariate morphometries 735

Table 1
Means, pooled standard deviations and correlations for the gastropod Dicathais
Sample l 2 3 4 5 6 7
Sample size 102 101 75 69 29 48 32
Sample means
L 39.36 33.39 35.54 33.86 2 7 . 4 3 5 1 . 7 3 37.47
LS 16.10 11.99 14.06 1 3 . 0 7 10.14 2 0 . 7 3 13.79
LA 28.04 25.58 25.81 2 5 . 1 0 2 0 . 4 2 37.21 28.55
WA 12.81 12.02 11.76 11.60 9.154 17.97 13.39
Sample 8 9 10 11 1_2 13 14
Sample size 83 88 44 34 33 82 60
L 40.11 38.43 33.17 32.39 44.02 33.34 55.94
LS 13.16 12.71 12.36 13.29 14.91 13.34 25.00
LA 31.94 30.40 24.67 2 3 . 1 2 33.51 24.92 38.93
WA 16.08 14.90 11.21 11.76 17.46 13.02 20.84

Correlations and standard deviations (diagonal elements)


Sample
size L LS LA WA
L 9.73 0.97 0.98 0.98
LS 4.31 0.91 0.91
LA 6.82 0.99
WA 3.48

c o r r e l a t i o n a n d negative b e t w e e n - g r o u p s c o r r e l a t i o n p r o n o u n c e d shrinking will


never be necessary. However, with high positive within- and b e t w e e n - g r o u p s
correlations, m a r k e d shrinking will a l m o s t always be to the good.

6. 4. Practical aspects
T h e analysis given b y C a m p b e l l (1980) for the g a s t r o p o d Dicathais f r o m the
coasts of A u s t r a l i a a n d N e w Z e a l a n d p r o v i d e s a g o o d i d e a o f the m o r p h o m e t r i c a l
consequences of i n s t a b i l i t y in c a n o n i c a l v a r i a t e coefficients. F o u r v a r i a b l e s de-
scribing the size a n d shape of the shell were m e a s u r e d , to wit, length of shell ( L ) ,
length of spire (LS), length of a p e r t u r e (LA) a n d w i d t h of a p e r t u r e (WA). M e a n s ,
p o o l e d s t a n d a r d d e v i a t i o n s a n d correlations for W* are listed in T a b l e 1. A s is
n o t u n c o m m o n in highly i n t e g r a t e d c h a r a c t e r s in molluscs, all c o r r e l a t i o n s are
very high. T h e eigenvalues and eigenvectors of the c o r r e l a t i o n m a t r i x of T a b l e 1
are listed in T a b l e 2.
A s an o u t c o m e of the very high correlations, there are two small eigenvalues,
with the smaller of t h e m a c c o u n t i n g for less t h a n 0.08% of the w i t h i n - g r o u p s
variation; the c o r r e s p o n d i n g eigenvector c o n t r a s t s L with LS a n d LA. T h e
b e t w e e n - g r o u p s sums of squares c o r r e s p o n d i n g to each c o m b i n a t i o n of eigenval-
ues a n d eigenvectors (12) are s u p p l i e d in the s a m e table.
736 Richard A. Reyment

Table 2
Eigen-analysis of the within-groups correlation matrix for Dicathais
Eigenvector
No. L LS LA WA Eigenvalue U*TB*U *
1 0.50 0.49 0.50 0.50 3.869 0.55
2 0.08 0.79 -0.42 -0.43 o. 112 1.49
3 -0.33 0.15 -0.56 0.75 0.016 1.87
4 0.79 -0.33 -0.51 0.03 0.003 0.38

In T a b l e 3, the coefficients a u, the c a n o n i c a l roots a n d the c a n o n i c a l vectors c u


(13) for the s t a n d a r d i z e d original variables are given. It will b e seen f r o m a~' a n d
a~ that the third p r i n c i p a l c o m p o n e n t d o m i n a t e s the first c a n o n i c a l variate a n d
the second p r i n c i p a l c o m p o n e n t , the s e c o n d c a n o n i c a l variate. T h e smallest
p r i n c i p a l c o m p o n e n t m a k e s a m u c h greater c o n t r i b u t i o n to the s e c o n d c a n o n i c a l
variate than to the first ( T a b l e 3).
T h e coefficients for the first two c a n o n i c a l variates are, clearly, strongly
affected b y shrinking. It can b e shown that these changes in size a n d sign h o l d for
a wide range of values of k 4 with very little c h a n g e in the first c a n o n i c a l r o o t a n d
o n l y m i n i m a l c h a n g e in the s e c o n d c a n o n i c a l root.
T h e shifts in the coefficients can be p r e d i c t e d if the least eigenvalue is relatively
very small. I n the p r e s e n t example, the eigenvector c o r r e s p o n d i n g to e 4 is
d o m i n a t e d b y L in r e l a t i o n to LS a n d LA, the c o r r e s p o n d i n g b e t w e e n - g r o u p s s u m
of squares is relatively small ( T a b l e 2) a n d the c o n t r i b u t i o n m a d e b y the smallest
p r i n c i p a l c o m p o n e n t to the s e c o n d c a n o n i c a l variate which all u n i t e in b r i n g i n g
a b o u t the m o r e m a r k e d changes in the coefficients for the s e c o n d c a n o n i c a l vector
u n d e r shrinking.
Referring to T a b l e 3, it will b e seen that an i n t e r p r e t a t i o n of the m o r p h o m e t r i -
cal significance of the c a n o n i c a l vectors, if b a s e d on the usual c a n o n i c a l vectors,
w o u l d differ so strongly f r o m the stable values that it w o u l d l e a d to false
c o n c l u s i o n s r e g a r d i n g the roles of the i n d i v i d u a l variables in d i s c r i m i n a t i o n .

Table 3
Summary of canonical variate analysis for Dicathais (from Campbell, 1980)a
First canonical vector Second canonical vector
Canonical Canonical
PC 1 PC2 PC3 PC4 root PC 1 PC2 PC3 PC4 root
au --0.32 --0.08 --0.93 0.17 0.09 --0.93 --0.02 --0.35
L LS LA WA L LS LA WA
c(k4=0)
u 4.82 2.02 -2.41 5.64 2.13 4.65 4.28 -2.12 1.51 1.68
.oi
C(k4=~) --2.42 0.66 -3.78 5.91 2.09 0.17 -2.54 1.93 0.18 1.48

apc stands for principal component, C~k4=0)is the usual canonical vector and c~I4_~) denotes the
generalized inverse coefficients for the canonical variates.
Multivariate morphometrics 737

6.4.1. Redundant variables


The lack of stability of the canonical variate coefficients for L, LS and LA
suggests that one or some of these variables may be redundant for discrimination.
The canonical roots are little affected by shrinking while the coefficients for L
and LA change markedly. This suggests that an analysis restricted to L, LS and
WA, or, to LS, LA and WA would be worth trying. The contribution of the
eliminated variable can be assessed by a multivariate analysis of covariance by
performing (a) a canonical variate analysis on all variables, (b) the same analysis
on the v - 1 retained variables and (c) determining Wilk's A for each analysis.
The ratio Av/A(v_l) can then be used to gauge the importance of the deleted
variable (cf. Rao, 1952; Campbell, 1980).
The problem briefly outlined here is of practical importance in multivariate-
morphometrical work. It is seldom possible to be sure of the diagnostic value of
all the variables selected for measuring. In a case such as the example given, a set
of variables with unstable canonical coefficients often indicates that one or more
of the variables are redundant and can be safely eliminated. Variables with small
standardized canonical coefficients can also be deleted. The variables amongst
those remaining with the largest standardized coefficients will then usually be the
more important variables for discrimination. Clearly, when variables are being
eliminated, care should be taken to ensure that discrimination is little affected.

6.4.2. Further comments


The gastropod example is a rather extreme case of the effects of instability.
Most cases tend to be less spectacular. Some results for the Coniacian ammonite
genus Prionocycloceras (Cretaceous, Brazil), now briefly considered, are typical of
many situations.
Five samples of Prionocycloeeras (original data), on which the four characteris-
tics: diameter of shell (D), diameter of umbilicus (U), maximum whorl height
( H ) and maximum whorl breadth (B) were measured, were analyzed by the
methods of Section 6.3, using logarithmically transformed observations.
The within-groups correlations are small ( > 0.2), except for rl3 and r14 which
are greater than 0.5. All between-groups correlations are relatively high (0.5-0.7).
The contribution of the between-groups sum of squares for the fourth principal
component is only 3% of the total between-groups variation but the smallest
eigenvalue of the within-groups correlation matrix is a little more than 5% of
tr W*, which makes it uncertain whether improved stability can be attained by the
methods of Section 6. The essential results of the analysis are summarized in
Table 4.
The eigenvector associated with the smallest principal component of W*
contrasts D with H and B. The fourth principal component for the second
canonical vector is relatively large, which suggests that L, H and B are likely to be
unstable in the second canonical vector. This is indeed so. Even though shrinkage
affects L, H and B in the first canonical vector, the effects are more pronounced
in the second vector and also involve changes of sign. The canonical roots are
diminished by shrinking, but this is very slight.
738 Richard A. Reyment

Table 4
Summary of canonical variate analysis for Prionocycloceras
First canonical vector Second canonical vector
Canonical Canonical
PCI PC2 PC3 PC4 root PCI PC2 PC3 PC4 root
au 0.94 0.04 --0.31 0.17 0.21 --0.44 0.78 0.38
D U H B D U H B
cU(k4=0) 3.67 10.51 9.03 8.67 5.68 -6~43 -15.19 18.93 4.32 0.10
c(~4=~o)~I 9.52 10.79 5.25 5.56 5.53 7.64 --15.46 11.45 --2.58 0.09

7. Morphometrics and ecology

Occasionally, in the course of morphometric work, and also in the course o f a


wide range of ecological investigations, we may wish to decide whether one set of
variables, taken as a whole, varies with another set of variables and, if the answer
is positive, to uncover the nature of this joint variation.
In ecological investigations there may be special interest attached to the
changes of the elements of the fauna, such as the collembolan fauna of a peat bog,
as the bog is drained; a process which in itself will entrain a large number of
other changes such as the depth of the anaerobic layers, of root systems, etc. In
strictly morphometric work there are occasions when we need a linear compound
of the measurements of an organism which are, as far as possible, uncorrelated
with environmentally induced changes of shape. Canonical correlations afford a
means of testing hypotheses; if we set up the hypothesis that, say, the collembola
of a bog vary together with the nematodes of the bog, as the latter is drained, we
can estimate these two groups of animals in the bog at different stages of
draining, and determine whether there is or is not an association between the two
sets of estimates.
The mathematics of canonical correlation have been treated elsewhere in this
book. The interpretational difficulties of canonical correlation analysis seem to
have worked towards hindering the application of the method to m a n y kinds of
problems for which it seems suitable. Cooley and Lohnes (1971) have developed
modifications that often prove useful in morphometric work. Gittins (1979) has
presented a comprehensive review, with examples, of canonical correlation ap-
plied to ecological studies in which their results achieve prominence. Reyment
(1975) analyzed the inter-set relationships between living ostracod species of the
Niger Delta and physico-chemical factors of the environment. Further examples
are given in Chapter 15 of Reyment and Blackith (1971).

8. Growth-free canonical variates

One of the serious weaknesses of taxonomically oriented canonical variate


analyses (as well as multivariate analyses of variance and generalized distances)
Multivariate morphometrics 739

lies with comparisons made between samples consisting of organisms at different


stages of growth and, or, confounded by sexual-dimorphic differences. This is a
very common situation in almost all morphometric work and one that is ignored
by the majority of workers, despite its fundamental significance.
This means that a sample will not be statistically homogeneous, being a mixture
of growth stages and growth-inhibited morphologies, even though it may be
biologically homogeneous in the broad sense.
The problem of extracting the element of differences in growth from, for
example, ecologically controlled size-differences was first given serious considera-
tion by the palaeontologist T. P. Burnaby (1966) in a significant contribution to
multivariate morphometrics. Rao (1966) gave a general treatment of the question,
but without taking up practical details of estimation and biological relevance.
Gower (1976) developed a general analysis of the growth-invariance concept
along the lines mapped out by Burnaby. Gower's (op. cit.) method of analysis was
applied to planktic foraminifers by Reyment and Banfield (1976).

8.1. Example of growth-invariant analysis


An example which epitomizes the new trends in multivariate morphometry and
at the same time illustrates the use of three common techniques will serve to
round off this section.
Gower (1976) described four methods for estimating a matrix K, the k columns
of which are growth vectors. The coefficients of these vectors give linear combina-
tions of the v variates that have been measured on each of the n organisms. Two
of the methods proposed by Gower for estimating these vectors, namely, the
'external' methods of estimation, where concomitant variables are required, could
not be applied to the fossil foraminifers analyzed by Reyment and Banfield
(1976) and discussed below. Concomitant variables have to be highly correlated
with age so as to express the growth-stages attained by the specimens. As such
variables cannot be found for foraminifers, nor, indeed for any invertebrate
fossils, although some vertebrate fossil material might conceivably be relatable to
external age-defining variables. Unfortunately there is virtually no published data
for an analysis via concomitant variables; the only case known to me is the
material published by Delany and Healy (1964), who used tooth-wear in a species
of rodent as an external age-indicator.
The method of internal estimation used in the ensuing example depends on the
principal-component method, which suggests that the k growth vectors can be
estimated by the first k eigenvectors of the pooled within-groups covariance
matrix W, calculated from the logarithms of the observational vectors. The
growth effects are then considered to be the major source of variation within each
group and can be represented by the first few principal components.
The material analyzed here is of the planktic foraminiferal species Subbotina
pseudobulloides (Plummer) from the Early Paleocene (Danian) of southern Sweden
(Malmgren, 1964). The samples have been derived from a borehole at Limhamn,
Skhne and come from levels 1.0m, 3.0m, 9.3m, 33.3m, 40.5m and 67.2m. The six
740 Richard A. Reyment

X3 x4

{ x5
t I
I I °

"/'6
x2
Fig. 1. Measurements made on Subbotina pseudobulloides.

variables measured are shown in Fig. 1. The basic statistics are listed in Tables 5
and 6.
Growth-free canonical variate analyses were made for each of the species for
k = 0 and k = 1. The analysis for k = 0 is the standard one of canonical variates
where no growth effects are extracted. The analysis with k = 1 corresponds to the
removal of one principal component as a 'growth vector', subject to the reserva-
tion of Section 2 regarding the arbitrary nature of such a principal-components
growth interpretation.
Tables 7 through 9 contain the squared generalized distances, the canonical
variate loadings and the canonical variate means resulting from the analysis.
Where no principal components were extracted (k = 0), the canonical variate
means are substantially different from those for k = 1. The sample illustrates the
comparatively large changes that may result in distances by removal of one

Table 5
Pooled within-groups covariance matrix and group means for Subbotina pseudobulloides from the
Early Paleocene of Sweden (logarithmically transformed data)
Pooled covariance matrix
0.0184
0.0173 0.0203
0.0170 0.0180 0.0226
0.0165 0.0146 0.0159 0.0206
0.0176 0.0198 0.0188 0.0157 0.0257
0.0199 0.0210 0.0199 0.0155 0.0204 0.0298

Group means N
5.1391 4.9476 4.6280 4.6165 4.4182 4.2415 20
5.1501 4.9542 4.5415 4.6714 4.4530 4.2696 29
4.9748 4.7771 4.4850 4.4833 4.2821 4.0888 30
5.1837 5.0079 4.6270 4.6666 4.5271 4.3238 39
5.0404 5.8598 4.5178 4.5127 4.3508 4.1376 60
5.1845 5.0206 4.6998 4.6813 4.5481 4.3389 100
Multivariate morphometrics 741

Table 6
Eigenvalues and eigenvectors of the within-groups covariance matrix for Subbotina
pseudobulloides
1 2 3 4 5 6
EigenvNues
0.1129 0.0094 0.0070 0.0047 0.0022 0.0011

Eigenvectors
0.3860 -0.1291 0.2063 --0.1784 --0.3120 0.8140
0.4032 0.1582 0.1402 --0.0539 --0.7730 -0.4386
0.4069 -0.1071 -0.0252 0.8992 0.1117 0.0363
0.3535 -0.7514 0.3035 -0.2505 0.1894 -0.3460
0.4288 0.0219 -0.7890 -0.2596 0.3463 0.0759
0.4626 0.6179 0.4717 --0.1630 0.3701 -0.1348

principal component (the 'size' component). This change may be an outcome of


variation due to growth- and size-differences.
For k = 0, the means of samples 3 and 6 are relatively far apart ( O 2 z 3.13) but
once the presumed growth variation has been extracted, these two samples lie
much closer together (D R= 0.29). In fact, all the distances are reduced by the
removal of the first principal component because the initial distances (k = 0) are
being partitioned into two parts which express the distances projected onto the
growth-space and those projected onto the space orthogonal to the growth-space.
Even though these distances are smaller, the samples are more distinct because
the minimum distances between sample means required for significance are based
on reduced within-sample variation. Thus, for purposes of discrimination as well
as other quantitative work, removal of the growth-effect is important if a valid
interpretation of the morphometrical relationships is to be made.

Table 7
Squared generalized distances for Subbotinapseudobulloides for k = 0, 1
k-0
1 0.0000
2 2.8677 0.0000
3 1.9539 4.4016 0.0000
4 1.5322 0.8623 3.6189 0.0000
5 0.7524 2.9727 0.8738 1.9316 0.0000
6 1.3623 2.6748 3.1345 0.7847 2.2441 0.0000

k=l
1 0.0000
2 2.8647 0.0000
3 0.7654 3.0867 0.0000
4 1.3496 0.7246 1.3157 0.0000
5 0.2759 2.4150 0.7146 0.6821 0.0000
6 1.0088 2.3848 0.2947 0.7563 0.5923 0.0000
742 Richard A. Reyment

Table 8
Canonical variate analysis for Subbotina pseudobulloides for k = 0
Latent roots
1 2 3 4 5
3.1008 1.3326 0.6374 0.2336 0.0235
Canonical variate loadings
1 2 3 4 5
- 1.6494 -2.0139 - 17.4327 -2.5077 15.2510
4.7737 0.3278 -8.0457 -2.6102 16.2890
-8.4805 9.7136 0.9643 3.7573 -0.0948
5.6673 -3.3072 7.2132 8.2769 -6.8715
2.4804 0.9962 7.2153 -7.5067 2.5701
2.4981 - 1.1284 7.9632 2.5453 5.4067

Coordinates of means
1 2 3 4 5
0.2762 0.3301 -0.5245 0.2498 0.0258
0.9390 --0.7096 0.0179 0.1451 --0.0435
1.0794 -0.3072 0.4137 0.1008 0.0420
0.7064 0.1001 0.0143 --0.2162 0.1046
-0.6122 --0.1777 -0.2669 -0.3043 --0.0652
0.3224 0.7643 0.3455 0.0249 --0.0636

Table 9
Canonical variates analysis for Subbotinapseudobulloides for k = 1
Latent roots
1 2 3 4 5
2.2298 0.6467 0.2667 0.0619 0.0001
Canonical variate loadings
1 2 3 4 5
1.5126 17.9109 0.4724 15.5306 --249.7511
-4.0535 7.2746 5.9664 --16.3598 --98.3832
12.9584 --1.6057 --2.6835 -1.4948 19.6511
--6.6390 --6.8857 --7.5410 -7.8315 52.2846
- 1.6893 --7.5132 6.7275 4.1577 79.9098
-2.4882 -7.6466 -3.7076 4.7446 162.8214
Coordinates of means
1 2 3 4 5
0.4676 0.5034 -0.1826 --0.0667 0.0043
--1.1577 0.0329 -0.1904 0.0278 -0.0033
0.5503 -0.3420 -0.2395 0.1348 -0.0010
--0.4440 -0.0502 0.2503 0.0903 0.0069
0.3066 0.2902 0.2569 0.0392 -0.0071
0.2772 -0.4342 0.1052 --0.1698 0.0001
Multivariate morphometrics 743

9. Applications in taxonomy

Many morphometric analyses are concerned with quantitative taxonomical


studies (to which may be appended much anthropometrical work) and there are
many case-histories in print. Space restrictions cannot permit more than a passing
reference to some typical examples of these. Blackith and Reyment (1971) have
reviewed a large number of case histories to some of which reference is made
here: morphologically similar beetles of the genus Halticus (op. cit., p. 50), Asiatic
wild asses (op. cit., p. 51), whitefly on host plants (op. cit., p. 52), anthropometry
(op. cit. pp. 54, 58, 259), phenotypic flexibility of shape in the desert locust (op.
cit., p. 56), the insect Orthotylus on broom plants (op. cit. p. 93), the primate
shoulder in locomotion (op. cit. p. 100), skeletal form of shrews (op. cit., p. 105), a
phase vector for locusts (op. cit., p. 135) chapters on principal components in size
and shape studies, phytosociological studies, quantitative comparisons of faunal
elements, genetics. Also analyses of the evolution of Cretaceous echinoids (op.
cit., 253-256), nematodes (op. cit., pp. 258-259) have been made.
From the statistical aspect, multivariate morphometrics is mostly a straightfor-
ward application of standard methods of multivariate statistical analysis with
particular emphasis on applications of canonical variate analysis and principal
component analysis. In some cases of special biological pertinence, existing theory
has been found inadequate and specific adaptations of the standard techniques
are in the course of being developed which treat such situations.
The particular claim to existence of multivariate morphometrics lies with the
interpretation of the biological significance of the statistical computations, and,
consequently, the analysis of problems normally beyond the ken of most statisti-
cians. Thus, the successful explanation of a multivariate-morphometric analysis
may require intimate knowledge of modern evolutionary theory, taxonomy,
functional morphology and various aspects of ecology. Finally, we note that some
methods of reduction of dimensionality in principal component analysis, canoni-
cal correlation analysis and discriminant analysis are discussed in a paper by
Krishnaiah (1978).

References
Benzrcri, J. P. (1973). L'Ana]yse des Donnbes. 2, L'Analyse des Correspondances. Dunod, Paris.
Blackith, R. E. (1965). Morphometrics. In: T. H. Waterman and H. J. Marowitz, eds., Theoretical and
Mathematical Biology, 225-249. Blaisdell, New York.
Blackith, R. E. and Reyment, R. A. (1971). Multivariate Morphometrics. Academic Press, London.
Burnaby, T. P. (1966). Growth invariant discriminant functions and generalized distances. Biometrics
22, 96-110.
Campbell, N. A. (1978). The influence function as an aid in outlier detection in discriminant analysis.
Appl. Statist. 27, 251-258.
Campbell, N. A. (1979). Canonical variate analysis: some practical aspects. Ph.D. Thesis, Imperial
College, University of London.
Campbell, N. A. (1980). Shrunken estimators in discriminant and canonical variate analysis. Appl.
Statist. 29, 5-14.
744 Richard A. Reyment

Campbell, N. A. and Reyment, R. A. (1978). Discriminant analysis of a Cretaceous foraminifer using


shrunken estimators. Math. Geology 10, 347-359.
Campbell, N. A. and Reyment, R. A. (1981). Robust multivariate procedures applied to the
interpretation of atypical individuals of a Cretaceous foraminifer. Cretaceous Res. 1, 207-221.
Cooley, W. W. and Lohnes, P. R. (1971). Multivariate Data Analysis. Wiley, New York.
Delance, J. H. (1974). Zeilleridrs du Lias d'Europe Occidentale. Mbmoires Grologiques de l'Universit6
de Dijon, 2, Dijon.
Delany, M. J. and Healy, M. J. R. (1964). Variation in the long-tailed field mouse (Apodemus
sylvaticus L.) in north-west Scotland. II. Simultaneous examination of all the characters. Proc. Royal
Soc. London Ser B 161, 200-207.
Fisher, R. A. (1940). The precision of discriminant functions. Ann. Eugenics 10, 422-429.
Gittins, R. (1979). Ecological applications of canonical analysis. In: L. Orloci et al., eds., Multivariate
Methods in Ecological Work, 309-535. International Co-operative Publishing House, Maryland.
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. Wiley,
New York.
Gould, S. J. (1966). Allometry and size in ontogeny and phylogeny. Biol. Revues Cambridge Philos.
Soc. 41, 587-640.
Gould, S. J. (1977). Ontogeny and Phylogeny. Belknap Press, Harvard.
Gower, J. C. (1966). Some distance properties of latent roots and vectors used in multivariate analysis.
Biometrika 53, 325-338.
Gower, J. C. (1976). Growth-free canonical variates and generalized inverses, Bull. Geological
Institutions University of Uppsala 7, 1-10.
Hill, M. (1975). Correspondence analysis: a neglected multivariate method, J. Roy. Statist. Soc. Ser. C.
23, 340-354.
Hopkins, J. W. (1966). Some considerations in multivariate allometry. Biometrics 22, 747-760.
Jolicoeur, P. (1963). The degree of generality of robustness in Martes americana. Growth 27, 1 27.
Jolicoeur, P, and Mosimann, J. E. (1960). Size and shape variation in the painted turtle. Growth 24,
335-354.
Jolicoeur, P. (1968). Interval estimation of the slope of the major axis of a bivariate normal
distribution in the case of a small sample. Biometrics 24, 679-682.
J~Sreskog, K. G., Klovan, J. E. and Reyment, R. A. (1976). Geological Factor Analysis. Elsevier,
Amsterdam.
Krishnaiah, P. R. (1978). Some recent developments on real multivariate distributions. In: P. R.
Krishnaiah, ed., Developments in Statistics, Vol. 1, Academic Press, New York.
Krishnaiah, P. R. and Lee, J. C. (1980). Likelihood ratio tests for mean vectors and covariance
matrices. In: P. R. Krishnaiah, ed., Handbook of Statistics, Vol. 1. North-Holland, Amsterdam.
Mahr, J. (1974). L'Analyse factorielle des correspondances et son usage en paleontologie et dans
l'rtude de l'rvolution. Bull. Soc. GOologique de France, SOr. 7, 16, 336-340.
Malmgren, B. (1974). Morphometric studies of planktonic foraminifers from the type Danian of
southern Scandinavia. Stockholm Contributions' in Geology 29, l-126.
Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika
57, 519-530.
Martin, L. (1960). Homombtrie, allomrtrie et cograduation en biomrtrie grnrrale. Biomet. Z 2, 73-97.
Mosimann, J. E. (1970). Size allometry: size and shape variables with characterizations of the
log-normal and generalized gamma distribution. J. Amer. Statist. Assoc. 65, 930-945.
Mosimann, J. E. and Malley, J. D. (1979). Size and shape variables. In: L. Odoci, et al., eds.,
Multivariate Methods in Ecological Work, 175-189. International Co-operative Publishing House,
Maryland.
Penrose, L. S. (1954). Distance, size and shape. Ann. Eugenics 18, 337 343.
Pimentel, R. A. (1979). Morphometrics. Kendall-Hunt, Iowa.
Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research. Wiley, New York.
Rao, C. R. (1964). The use and interpretation of principal components analysis in applied research.
Sankhya 26, 329-358.
Multivariate morphometrics 745

Rao, C. R. (1966). Discriminant function between composite hypotheses and related problems.
Biometrika 53, 339-345.
Reyment, R. A. (1972). Multivariate normality in morphometric analysis. Math. Geology 3, 357-368.
Reyment, R. A. (1975). Canonical correlation analysis of hemicytherinid and trachyleberinid ostra-
codes in the Niger Delta. Bull. Amer. Paleontology 65 (282) 141-145.
Reyment, R. A. (1976). Chemical components of the environment and Late Campanian microfossil
frequencies. Geologiska F6reningens i Stockholm Fbrhandlingar 98, 322-328.
Reyment, R. A. (1978a). Graphical display of growth-free variation in the Cretaceous benthonic
foraminifer Afrobolivina afra. Palaeoecology Palaeogeography, Palaeoclimatology 25, 267-276.
Reyment, R. A. (1978b). Quantitative biostratigraphical analysis exemplified by Moroccan Cretaceous
ostracods. Micropaleontology 24, 24-43.
Reyment, R. A. (1979). Analyse quantitative des Vascoc6ratid6s h carbnes. Cahiers de Micropalbontolo-
gie 4, 56-64.
Reyment, R. A. (1980). Morphometric Methods in Biostratigraphy. Academic Press, London.
Reyment, R. A. and Banfield, C. F. (1976). Growth-free canonical variates applied to fossil for-
aminifers. Bull. Geological Institutions University of Uppsala 7, 11-21.
Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, San Francisco.
Sprent, P. (1968). Linear relationships in growth and size studies. Biometrics 24, 639-656.
Thompson, D'A. W. (1942). On Growth and Form. Cambridge University Press, Cambridge.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~ZI_
©North-Holland Publishing Company (1982) 747-771

Multivariate Analysis with Latent Variables*

P. M. Bentler and D. G. Weeks

1. Introduction

Data encountered in statistical practice often represent a mixture of variables


based on several levels of measurement (nominal, ordinal, interval, and ratio), but
the two major classes of variables studied with multivariate analysis are clearly
discrete (nominal) and continuous (interval or ratio) (e.g., Anderson, 1958;
Bishop, Fienberg, and Holland, 1975; Dempster, 1969; Goodman, 1978). Distinc-
tions have been made in both domains between manifest and latent variables:
manifest or measured variables are observable realizations of a statistical process,
while latent variables are not observable, even in the population. Although latent
variables are not observable, certain of their effects on manifest variables are
observable and hence subject to study. The range of possible multivariate models
is quite large in view of the fact that both manifest and latent variables can be
discrete or continuous. This chapter will concentrate on models in which both
manifest and latent variables are continuous; this restriction still generates a large
class of models when they are considered simultaneously in several populations,
and when certain variables are considered fixed rather than random.
Historically, latent variable models are most closely identified with latent
structure analysis (Lazarsfeld and Henry, 1968), mental test theory (Lord and
Novick, 1968), and factor analysis (Anderson and Rubin, 1956; Lawley and
Maxwell, 1971). Only the latter topic has any special relevance to this chapter.
The field has progressed far beyond the simple factor analytic approach per se, to
include such diverse areas as extensions of factor analysi s to arbitrary covariance
structure models (Bentler, 1976; Jrreskog, 1973; McDonald, 1978), path analysis
(Tukey, 1954; Wright, 1934), simultaneous equation models (Geraci, 1977;
Hausman, 1977; Hsiao, 1976), structural equation models (Aigner and
Goldberger, 1977; Bentler and Weeks, 1980; Goldberger and Duncan, 1973;
JOreskog, 1977), errors in variables models, including multiple regression
(Bhargava, 1977; Feldstein, 1974; Robinson, 1974), and studies of structural and

*Preparation of this chapter was facilitated by a Research Scientist Development Award (K02-
DA00017) and a research, grant from the U.S. Public Health Service (DA01070).

747
748 P. M. Bentler and D. G. Weeks

functional relations (Anderson, 1976; Gleser, 1979; Robinson, 1977). Although


each of these areas of multivariate analysis utilizes concepts of latent variables,
and although each has its own specialized approaches to the statistical problems
involved, only recently has it been noted that certain basic ideas can serve a
unifying function so as to generate a field of multivariate analysis with latent
variables. This chapter focuses on those principles associated with model simplic-
itY and generality as well as large sample consistent, normal, and efficient
estimators of model parameters. Further references to some of the voluminous
literature on theory and applications in the above domains can be found in
Bentler (1980).
In order to coherently define a field of multivariate analysis with latent
variables, certain further limitations are imposed for convenience. We impose the
restriction of linearity and structure in models, that is, we deal with models whose
range of potential parameters, their interrelations, and their relations to variables
are specified such that manifest variables (MVs) and latent variables (LVs) are
linearly related to each other via explicit (' structured') matrix expressions. In view
of the existence of an explicit parameter structure that generates the MVs, the
first and second MV moments become explicitly structured in terms of the model
parameters; hence, the label moment structure models would also be appropriate
to define the field. As a consequence of this limitation, certain very general
models are not considered in this chapter. For example, consider a model of the
form
fi(.~t, Xt,Oli)'~'~Uit ( i z l .... ,j) (1.1)

(Amemiya, 1977). In this nonlinear simultaneous equation system for the tth
observation there are j equations, where Yt is a j-dimensional vector of dependent
variables, x t is a vector of independent variables, o~i is a vector of unknown
parameters, and u , is a disturbance whose j-dimensional vector has an indepen-
dent and identically distributed multivariate normal distribution. It is the nonlin-
earity that excludes the current model from consideration, but some models
described below may be considered as nonlinear in parameters provided this
nonlinearity is explicit (see Jennrich and Ralston, 1978). Among structured linear
models, the statistical problems involved in estimating and testing such models
are furthermore considerably simplified if the assumption is made that the
random variables associated with the models are multivariate normally distrib-
uted. While such an assumption is not essential, as will be shown, it guarantees
that the first and second moments contain the important statistical information
about the data.

1.1. Concept of analysis with latent variables


Multivariate analysis with latent variables generally requires more theoretical
specification than multivariate analysis with manifest variables. Latent variables
are hypothetical constructs invented by a scientist for the purpose of understand-
ing a research area; generally, there exists no operational method for directly
Multivariate analysis with latent variables 749

measuring these constructs. The LVs are related to each other in certain ways as
specified by the investigator's theory. When the relations among all LVs and the
relation of all LVs to MVs are specified in mathematical form--here simply a
simultaneous system of highly restricted linear structural equations--one obtains
a model having a certain structural form and certain unknown parameters. The
model purports to explain the statistical properties of the MVs in terms of the
hypothesized LVs. The primary statistical problem is one of optimally estimating
the parameters of the model and determining the goodness-of-fit of the model to
sample data on the MVs. If the model does not acceptably fit the data, the
proposed model is rejected as a possible candidate for the causal structure
underlying the observed variables. If the model cannot be rejected statistically, it
may provide a plausible representation of the causal structure. Since different
models typically generate different observed data, carefully specified competing
models can be compared statistically.
As mentioned above, factor analysis represents the structured linear model
whose latent variable basis has had the longest history, beginning with Spearman
(1904). Although it is often discussed as a data exploration method for finding
important latent variables, its recent development has focused more on hypothesis
testing as described above (J6reskog, 1969). In both confirmatory and exploratory
modes, however, it remains apparent that the concept of latent variable is a
difficult one to communicate unambiguously. For example, Dempster (1971)
considers linear combinations of MVs as LVs. Such a concept considers LVs as
dependent variables. However, a defining characteristic of LV models is that the
LVs are independent variables with respect to the MVs; that is, MVs are linear
combinations of LVs and not vice-versa. There is a related confusion: although
factor analysis is typically considered to be a prime method for dimension
reduction, in fact it is just the opposite. If the MVs are drawn from a p-variate
distribution, then LV models can be defined by the fact that they describe a
(p + k)-variate distribution (see Bentler, 1982). Although less than p of the LVs
are usually considered important, it is inappropriate to focus only on these LVs.
In factor analysis the k common factors are usually of primary interest, but the p
unique factors are an equally important part of the model.
It should not surprise the reader to hear that the concept of drawing inferences
about (p +k)-variates based on only p MVs has generated a great deal of
controversy across the years. While the MVs are uniquely defined, given the
hypothetical LVs, the reverse can obviously not be true. As a consequence, the
very concept of LV modeling has been questioned. McDonald and Mulaik (1979),
Steiger and SchiSnemann (1978), Steiger (1979), and Williams (1978) review some
of the issues involved. Two observations provide a positive perspective on the
statistical use of LV multivariate analysis (Bentler, 1980). Although there may be
interpretive ambiguity surrounding the true 'meaning' of a hypothesized LV, it
may be proposed that the statistical evaluation of models would not be affected.
First, although an infinite set of LVs can be constructed under a given LV model
to be consistent with given MVs, the goodness-of-fit of the LV model to data (as
indexed, for example, by a X2 statistic) will be identical under all possible choices
750 P. M. Bentler and D. G. Weeks

of such LVs. Consequently, the process of evaluating the fit of a model to data
and comparing the relative fit of competing models is not affected by LV
indeterminacy. Thus, theory testing via LV models remains a viable research
strategy in spite of LV indeterminacy. Second, although the problem has been
conceptualized as one of LV indeterminacy, it equally well can be considered one
of model indeterminacy. Bentler (1976) showed how LV and MV models can be
framed in a factor analytic context to yield identical covariance structures; his
proof is obviously relevant to other LV models. While the LVs and MVs have
different properties, there is no empirical way of distinguishing between the
models. Hence, the choice of representation is arbitrary, and the LV model may
be preferred on the basis of LV's simpler theoretical properties.
Models with MVs only can be generated from LV models. For example,
traditional path analysis or simultaneous equation models can be executed with
newer LV models. As a consequence, LV models are applicable to most areas of
multivariate analysis as traditionally conceived, for example, canonical correla-
tion, multivariate analysis of variance, and multivariate regression. While the LV
analogues or generalizations of such methods are only slowly being worked out
(e.g., Bentler, 1976; Jrreskog, 1973; Rock, Werts, and Flaugher, 1978), the LV
approach in general requires more information to implement. For example, LVs
may be related to each other via a traditional multivariate model (such as
canonical correlation), but such a model cannot be evaluated without the addi-
tional imposition of a measurement model that relates MVs to LVs. If the
measurement model is inadequate, the meaning of the LV relations is in doubt.

1.2. Importance of latent variable models


The apparent precision and definiteness associated with traditional multivariate
analysis masks a hidden feature of MV models and methods that makes them less
than desirable in many scientific contexts: their parameters are unlikely to be
invariant over various populations and experimental contexts of interest. For
example, the fl weights of simple linear regression--a prototype of most MV
models--can be arbitrarily increased or decreased in size by manipulation of the
independent variables' reliabilities (or percent random error variance). Thus,
questions about the relative importance of/3 i or/37 cannot be answered defini-
tively without knowledge of reliability (Cochran, 1970). While MV multivariate
analysis is appropriate to problems of description and prediction, its role in
explanation and causal understanding is somewhat limited. Since MVs only rarely
correspond in a one-to-one fashion with the constructs of scientific interest, such
constructs are best conceived as LVs that are in practice measured with impreci-
sion. Consequently, certain conclusions obtained from an MV model cannot be
relied upon since various theoretical effects will of necessity be estimated in a
biased manner. They will also not replicate in other studies that are identical
except for the level of precision (or error) in the variables. Thus the main virtues
of LV models are their ability to separate error from meaningful effects and the
associated parametric invariance obtainable under various circumstances.
Multivariate analysis with latent variables 751

2. Moment structure models: A review

The simplest model to be considered is the classical common factor model

x=lz+ A~+e (2.1)

where x is a ( p × 1) random vector of scores on p observed variables,/, is a vector


of means, A is a ( p × k) matrix of structural coefficients (factor loadings), ~ is a
(k × 1) random vector of factor scores, and e is a ( p ×1) random vector of
residual or unique scores. When the model is written for only a single population,
as in (2.1), the means vector/, may be suppressed without loss of generality. It is
assumed that E ( ~ ) = 0 and E ( e ) = 0 . The ~ and e represent p + k LVs, as
mentioned previously, and x represents the vector of MVs. The covariances of the
MVs are given by

= A q ~ A ' + 'Is (2.2)

where • = E ( ~ ' ) , g ' = E(~'f'), and E ( ~ f ' ) = 0. A further common assumption is


that g' is diagonal, i.e., that the unique components of the LVs are uncorrelated.
It is apparent that in this simple model there is no stucture among the LVs.
Rather, they are either independent or simply correlated. The MVs are functions
of the LVs, rather than the reverse. JOreskog (1969) provided the first successful
applications of the factor analysis model considered as a hypothesis-testing LV
model, based on the ideas of Anderson and Rubin (1956) and Lawley (1940).
Multiple correlation with LVs as independent variables follows naturally from the
model (see also Lawley and Maxwell, 1973). Bentler (1976) provided a parametri-
zation for (2.1) and (2.2), such that ~ = Z ( A ~ A ' + I ) Z with Z 2 = '/'. The parame-
ters A, but not A, are scale invariant.
When model (2.1) is generalized to apply to several populations simultaneously,
the means of the random vectors become an important issue. A model for factor
analysis with structured means (SOrbom, 1974) generalizes confirmatory factor
analysis. A random vector of observations for the gth group can be represented as

x g = v g + Ag~ ~ + f g (2.3)

with expectations E(~ g) = 0 g and E(x g) =/,g ----v g + AgO g, and covariance matrix
Z g = A g ~ g A ~' + "I"g. The group covariance matrices thus have confirmatory factor
analytic representations, with factor loading parameters A g, factor intercorrela-
tions or covariances ~g, and unique covariance matrices g'g. It is generally
necessary to impose constraints on parameters across groups to achieve an
identified model, e.g., A g = A for all g.
Similar models were considered by JOreskog (1971) and Please (1973) in the
context of simultaneous factor analysis in several populations. In a model such as
(2.3), there is an interdependence of first and second moment parameters. The
MVs' means are decomposed into basic parameters that may also affect the
752 P. M. Bentler and D. G. Weeks

covariance structure. These models are particularly appropriate to studies of


multiple populations or groups of subjects, and to the analysis of experimental
data.
J0reskog (1970) proposed a model for the analysis of covariance structures that
may be considered as a confirmatory second-order factor analytic model that
allows structured means. The model can be developed from the point of view of a
random vector model

x=t~+ BA~+ B~+e, (2.4)

or as a model for n observations on p variables with data matrix X (Jrreskog,


1973). The variables have expectation E(X)=A~-P, where ~ is a parameter
matrix and A and P are known matrices, and have the covariance structure

~ = B(A~A'+ ,I,2)B'+O 2. (2.5)


The covariance structure thus decomposes the factor intercorrelation matrix of
first-order factors by a further confirmatory factor analytic structure. The model
is extremely general, containing such special cases as the models of Bock and
Bargmann (1966) and Wiley, Schmidt, and Bramble (1973), as well as the MIMIC
model of Jrreskog and Goldberger (1975) and numerous other models such as
MANOVA or patterned covariance matrices (Browne, 1977). This model intro-
duces the idea of higher order LVs, which was first explicated by Thurstone
(1947). In such a model some LVs have no 'direct' effect on MVs. Model (2.4)
allows two levels of LVs for the common factors, namely ~ and ~, as well as
unique LVs e as in model (2.1). These LVs may be interpreted as follows. The
unique factors e and the common factors ~ are seen to affect the MVs directly, via
the parameter matrix B in the latter case. However, the common factors ~ affect
the MVs indirectly via the product of parameter matrices BA. As before, there are
more LVs than MVs. As a covariance structure model, the model has had several
applications (Jrreskog, 1978), but it has seen few applications as a model with
interdependent first and second moment parameters.
A model developed by Bentler (1976) can be written for the g t h population
( g = l ..... m) as

x g=/~g+ ~ a~ ~y, (2.6)


j=l i 1

g refers to the matrix product A 1g Ag2 . . . A ~, and the


where the notation I-[~=lAi
J
( p × n) random matrix x of n observations on p variables has expectation

E ( x g ) = ~g = Tg~-gUg + V~2gWg. (2.7)

If n = 1, the model can be considered to be a random vector model with random


vectors ~y; frequently these vectors are unobserved LVs such as factors. The
Multivariate analysis with latent variables 753

parameters of the model are T g, Zg, ~2g and the matrices A~, while U g, V g, and
W g are known constant matrices. In some important applications the T g can be
written as functions of the A~. The columns of xg are independently distributed
with covariance matrix ,Yg. For simplicity it may be assumed that the ~ have
covariance matrices ~jg and are independent of ~}, where j =~ j'. It follows that

~g= E A,g q~jg A~ . (2.8)


j=l i=1 i=1

It is apparent that this model introduces LVs of arbitrarily high order, while
allowing for an interdependence between first and second moment parameters.
Alternatively, in the case of a single population one may write model (2.8) as
~_,= A 1A 2" "" A ~ q ) A ~ . . . A'2A], where q~ is block-diagonal containing all of the q~j
matrices in (2.8) (McDonald, 1978).
It is possible to write Tucker's (1966) generalization of principal components to
three 'modes' of measurement via the random vector form as x -- ( A Q B ) F ~ where
x is a ( p q X 1) vector of observations; A, B, and F are parameter matrices of
order (p X a), (q X b), and (ab X c) respectively, and ~ is of order (c X 1). The
notation (A ®B) refers to the right Kronecker product of matrices (A ® B) = [a u B].
Bentler and Lee (1979a) have considered an extended factor analytic version of
this model as

x = (A®B)F~ + ~ (2.9)

where ~ is a vector of unique factors, and they developed statistical properties of


the model in both exploratory and confirmatory contexts. The covariance matrix
of (2.9) is given by

= (A®B)F+I"(A'®B') + Z 2, (2.10)

where E ( ~ ' ) = ~, E ( ~ " ) = Z 2, and E ( ~ ' ) = 0 . A more specialized version of


(2.10) is described by Bentler and Lee (1978a). These models are applicable to
situations in which MVs are completely crossed, as in an experimental design, and
the LVs ~ and ~" are presumed to have the characteristics of the ordinary factor
model (2.1). Krishnaiah and Lee (1974) studied the more general Kronecker
covariance structure ~=(G1®271)+ . . . +(GkQ~k), where the Gi are known
matrices. This model has applications in testing block-sphericity and block-
intraclass correlation, but structured latent random variable interpretations of this
model have not yet been investigated.
An extremely powerful model growing out of the econometric literature is the
structural equation system with a measurement structure developed by Keesling
(1972), Wiley (1973), and J0reskog (1977). In this model, there are two sets of
754 P. M. Bentler and D. G. Weeks

observed random variables having the measurement structure

x=Ax~+6 and y=Ay~+e, (2.11)

where ~ and 71 are latent random variables and 8 and e are vectors of error of
measurement that are independent of each other and the LVs. All vectors have as
expectations the null vector. The measurement model (2.11) is obviously a
factor-analytic type of model, but the latent variables are furthermore related by a
linear structural matrix equation

~/= B*7/+ F ~ + f, (2.12)


where E(~f') = 0 and B = ( I - B*) is nonsingular. It follows that B~/= F~ + ~',
which is the form of simultaneous equations preferred by Jt~reskog (1977). The
matrices B* and F contain structural coefficients for predicting ,/s from other ,/
and from ~ LVs. Consequently,

y = AyB-'(F~ + ~) + e. (2.13)

The covariance matrices of the observed variables are given by

Zxx= Ax~A'x + O~, ~yx= AyB-1F~PA'x


and (2.14)
z . - - A B - I (r r, + t lA,, +

where • = E ( ~ ' ) , g' = E ( ~ ' ) , 0~ = E(68'), 0~= E(ee'). Models of a similar nature
have been considered by Hausman (1977), Hsiao (1976), Geraci (1977), Robinson
(1977), and Wold (1980), but the JOreskog-Keesling-Wiley model, also known as
LISREL, has received the widest attention and application.
The model represented by (2.11)-(2.14) represents a generalization of econo-
metric simultaneous equation models. When variables have no measurement
structure (2.11), the simultaneous equation system (2.12) can generate path
analytic models, multivariate regression, and a variety of other MV models.
Among such models are recursive and nonrecursive structures. Somewhat para-
doxically, nonrecursive structures are those that allow true 'simultaneous' or
reciprocal causation between variables, while recursive structures do not. Recur-
sivity is indicated by a triangular form for the parameters B* in (2.12). Recursive
structures have been favored as easier to interpret causally (Strotz and Wold,
1960), and they are less difficult to estimate. Models with a measurement
structure (2.11) allow recursivity or nonrecursivity at the level of LVs.
When the covariates in analysis of covariance (ANCOVA) are fallible--the
usual case--it is well known that ANCOVA does not make an accurate adjust-
ment for the effects of the covariate (e.g., Lord, 1960). A procedure for analysis of
covariance with a measurement model for the observed variables was developed
by S6rbom (1978). This model is a multiple-population structural equation model
Multivariate analysis with latent ~)ariables 755

model with measurement error, with

x g = vx + Ax~g + 8 g, yg = Vy + Ay~l g + eg, (2.15)

where common factor LVs are independent of error of measurement LVs. The
vector of criterion variables in the gth group is yg, and x g is the vector of
covariates. The latent variables ~/g and ~g are related by

~g = a g + Fg~ ~ + fg (2.16)

where a g provides the treatment effect. Consequently,

y g = Uy Av A y a g -[- A y ( l"g~ g + ~ g ) -[- e g. (2.17)

The expected values of the latent variables are E(~g)---/z~ and E(T/g)=/zng.
Consequently one obtains the expectations E(x g) =/Zxg= vx + A x # ~ and E ( y g) =
i~gy=Vy + Ayl~g~. Then, rewriting (2.15) and (2.17) one obtains

xg=t~gx + Ax~S* + 8g' (2.18)


= + + V) +

where ~g = / ~ + ~g* and the expected values of ~g*, ~g, eg, and 6 g are null vectors.
The covariances of (2.18) are taken to be
g __ g t g g __ g gt t
Zxx-AxO'Ax+O~n, Zxy-AxO F Ay-}-O~e
and (2.19)
= g t +

where the covariance matrices of the ~g* and fg are given by og and ~bg, and
where the various og matrices represent covariances among the errors.
The structural equation models described above have been conceptualized as
limited in their application to situations involving latent structural relations in
which the MVs are related to LVs by a first-order factor analytic model. Causal
relations involving higher-order factors, such as 'general intelligence', have not
been considered. Weeks (1978) has developed a comprehensive model that
overcomes this limitation. The measurement model for the gth population is given
by
x = IZx + Axli and y = l~y + Ay~l (2.20)

where the superscript g for the population has been omitted for simplicity of
notation. The components of the measurement model (2.20) are of the form (2.6)
and (2.7), but they are written in supermatrix form. For example, A x =
[Alx "'" A k, A L ' " • A kx- l , " ",A~] and, similarly, ( ' = [ ( ~ ' , l i k - l ' , . . . , ~ l'] where the
756 P. M. Bentler and D. G. Weeks

superscripts 1..... k represent measurement levels. The lowest level corresponds to


unique factors, which may be correlated; levels higher than k = 3, allowing
second-order common factor LVs, will not often be needed. The latent variables
at all levels are related by a structural matrix equation for the g th population
(superscript omitted)

(2.21)

where E(~) = >~, E(~) = / ~ , E(f) = 0, a is a regression intercept vector, and f is a


multivariate residual. The multivariate relation (2.21) is specified such that if
Ty = I and T~ = I, the equations relate the variables ,/ and ~, which may be
considered to be latent factors orthogonalized across factor levels. On the other
hand, if Ty a n d / o r Tx are specified in terms of various A j and A j matrices, (2.21)
involves regression among the corresponding primary factors. That is, Tx is
structured such that [Tx~]' = [~W,..., ~rJ', .... ~rl'], where ~rJ is either an orthogonal-
ized factor ~Yor else a primary factor r j which can be expressed by a factor model
such as r 2 = A ~ 3 + (2. The matrices B r a n d / " represent coefficients, as before,
but in most instances Br will be block-diagonal (thus not allowing cross-level
regressions among the ~/'s). The covariance matrix generated by (2.20) and (2.21)
is represented by

Nix = Ax~A'x, ~,y~= AyT7'B-'FT~A',


and (2.22)
~,yy= AyTy 1B -1 ( F T f b T x'F ' + g ' ) B ' 1 T~, - - 1 A yt

where E((~ - / ~ ) ( ~ - bt~)') = ~, E(ff') = g', B = ( I - Br) and where • and g" are
typically block-diagonal. Although (2.22) has a relatively simple representation
due to the supermatrix notation, quite complex models are subsumed by it. It may
be noted that (2.22) is similar in form to the JOreskog-Keesling-Wiley structure
(2.19), but the matrices involved are supermatrices and one has the flexibility of
using primary or multilevel orthogonalized factors in structural relations. See
Bentler (1982) for a further discussion and Weeks (1980) for an application of
higher-order LVs, and Bentler and Weeks (1979) for algebraic analyses that
evaluate the generality and specialization possible among models (2.1)-(2.22).
It is apparent that a variety of LV models exist, and that the study of
higher-level LVs and more complex causal structures has typically been associated
with increasingly complex mathematical representations. It now appears that
arbitrarily complex models can be handled by very simple representations, based
on the idea of classifying all variables, including MVs and LVs, into independent
or dependent sets. As a consequence, a coherent field of multivariate analysis with
latent variables can be developed, based on linear representations that are not
more complex than those of traditional multivariate analysis.
Multivariate analysis with latent variables 757

3. A simple general model

We shall develop a complete linear relations model for LV multivariate analysis


by considering separately a structural equation model and a selection model, then
combining these parts into a single model. It has been shown that this model is
capable of representing all of the models discussed in the preceding section
(Bentler and Weeks, 1979, 1980).

3.1. The structural equation model


Consider the structural equation model

r/= flor/+ y,~, (3.1)

where ~ is an (m × 1) random vector of dependent variables, ~ is an (n × 1)


random vector of independent variables, and where/3o and 7 are (m × m) and
(m × n) parameter matrices governing the linear relations of all variables involved
in the m structural equations. The parameters in 7 represent weights for predic-
ting dependent from independent variables, while the parameters in B0 represent
weights for predicting dependent variables from each other. Typically, but not
necessarily, the diagonal of B0 contains known zero weights. Letting fl = ( I - rio),
(3.1) yields the form f i t = 7~- In general, the coefficients of (3.1) consist of (a)
fixed parameters that are assigned given values, usually zero; (b) free parameters
that are unknown; and (c) constrained parameters, such that for constraints c~
and cj and any parameters 0~ and Oj, w~Oi = wjOj. These constraints, taken from
Bentler and Weeks (1978), are more general than found in JOreskog (1977) but
more restricted than those of Robinson (1977).
Eq. (3.1) is similar to (2.12), but it omits the residual variates f. This difference
is extended in the very conceptualization of the random variables B and ~. In the
J6reskog-Keesling-Wiley model, the simultaneous eq. (2.12) relates only latent
variables. Eq. (3.1), on the other hand, relates all variables within the theoretical
linear system under consideration, whether manifest or latent. Each variable in
the system is categorized as belonging either to the vector B or ~: it is included as
one of the m dependent variables in the system if that variable is ever considered
to be a dependent variable in any structural equation, and it is considered as one
of the n independent variables in the system otherwise. Independent or nondepen-
dent variables are explanatory predictor variables that may be nonorthogonal.
The vector ~ consists of all manifest variables of the sort described in (2.11),
namely those variables that are presumed to have a factor analytic decomposition.
In addition, ~ contains those latent variables or unmeasured (but measurable)
variables that are themselves linear functions of other variables, whether manifest
or latent. As a consequence, ~ contains 'primary' common factors of any level
that are decomposed into higher-order and residual, orthogonalized factors. Thus,
we might define 7 ' = [y', ,r'] where the random vector y represents MVs that are
dependent variables and ~- represents all other LV dependent variables in the
758 P. M. Bentler and D. G. Weeks

system. Obviously, the vector ~/represents more than the 'endogenous' variables
discussed in econometrics, a n d / 3 0 represents all coefficients for structural rela-
tions among dependent variables, including the coefficients governing the relation
of lower-order factors to higher order factors excepting those residuals and the
highest order factors that are never dependent variables in any equation. The
vector ~ contains those MVs and LVs that are not functions of other manifest or
latent variables, and typically it will consist of three types of variables, ~ ' =
Ix', f', e'], namely, the random vector x of MVs that are 'exogenous' variables as
conceived in econometrics, residual LV variables f or orthogonalized factors, and
errors of measurement or unique LV factors e. Note that in a complete LV model,
where every MV is decomposed into latent factors, there will be no ' x ' variables.
While the conceptualization of residual variables and errors of measurement as
independent variables in a system is a novel one, particularly because these
variables are rarely if ever under experimental control, this categorization of
variables provides a model of great flexibility. In this approach, since 7 represents
the structural coefficients for the effects of all independent variables, the coeffi-
cients for residual and error independent variables are typically known as having
fixed unit values.

3.2. The selection model


Since (3.1) includes measured and unmeasured variables, it is desirable to
provide an explicit representation for the relation between the variables in (3.1)
and the measured variables. We shall assume that this relation is given by

y=l~y+Gy~l and x=l~x +Gx( (3.2)

where Gx and Gy are known matrices with zero entries except for a single unit in
each row to select y from 7/and x from ~. For definiteness we shall assume that
there are p observed dependent variables and q observed independent variables.
Vectors/~v ( P X 1) and/~x (q X 1) are vectors of means. Letting z ' = [y', x'], the
selection model (3.2) can be written more compactly as

z=.+cv (3.3)
where ~ ' = [/~y,/~'], v ' = [~/', ~'], and G is a 2 X 2 supermatrix containing the rows
[Gy,O], [0, G~].

3.3. The complete model


We assume that the expected values of (3.3) are given by

E ( z ) =/~ + GTZU, (3.4)

where E(v) = / ~ = TZU, with T and Z being parameter matrices of fixed, free, or
constrained elements and with U being a known vector. The use of means that are
Multivariate analysis with latent variables 759

structured in terms of other parameters is useful in several applications (e.g.,


JOreskog, 1973; Srrbom, 1978), but this topic will not be pursued here. Combin-
ing (3.1) with (3.2) yields the resulting expression for y =l~y + Gy/3-1Y~ where
/3 = ( I - / 3 o ) is assumed to be nonsingular. The covariance matrix of the MVs is
thus given by the matrix elements
ZyyZ__ay/3-1 .... IF,
~'~' ~ ~'y' (3.5)
~yx=Gy/3-~3,~G~ and ~xx=G~q~G~

where ~ is the covariance matrix of the independent variables ~. Eq. (3.5) may be
more simply represented as

Z=G(I-Bo) 'F~r'(1--Bo)'-'G'=GB 1r~F'B' 'G', (3.6)

where F' = [y', I ], B 0 has rows [/30, 0] and [0, I ], and B = I -- B0. The orders of the
matrices in (3.6) are given by G(r × s), B(s × s), F(s × n), and ~(n × n) where
r = p + q a n d s = m + n.
In general, a model of the form (3.1)-(3.6) can be formulated for each of
several populations, and the equality of parameters across populations can be
evaluated. Such a multiple-population model is relevant, for example, to factor
analysis in several populations (e.g., SOrbom, 1974) or to the analysis of covari-
ance with latent variables (SOrbom, 1978), but these developments will not be
pursued here. We concentrate on a single population with the structure (3.4) and
(3.6), with/~v = 0.
It is possible to drop the explicit distinction between dependent and indepen-
dent variables (Bentler and Weeks, 1979). All structural relations would be
represented in/30, and all variables with a null row in/3o would be independent
variables. The matrix • will now be of order equal to the number of independent
plus dependent variables. The rows and columns (including diagonal elements) of
corresponding to dependent variables will be fixed at zero. The model is
simpler in terms of number of matrices:

----GB- t~B'-1G'. (3.7)

It can be obtained from (3.6) by setting/" = I.

3.4. Comparison with alternative models


It is easy to demonstrate that model (3.1)-(3.6) incorporates the seemingly
more complex model (2.20)-(2.22) developed by Weeks (1978). First, it should be
noted that the measurement model (2.20) amounts to a nested series of linear
equations. That is, letting/~ = 0 with k = 3, (2.20) yields x = A1AzA3~ 3 + A1A2~ 2
+ A ~ . But this structure can be generated by the equations x = % = A~¢~ +0,
~'1 =A2~'2 ÷ ~1, and ~'2= A3~3 + ~2; thus, it is possible to reparameterize Weeks'
measurement model to yield simple linear equations in which all variables can be
760 P. M. Bentler and D. G. Weeks

classified as independent or dependent. Next, it should be noted that (2.21)


represents a linear structure as well, where the Tx and Ty matrices simply serve to
redefine, if desired, the variables in a given structural equation. In particular,
these matrices allow a choice among primary or residual factors in the structure
(2.21), which translates in the current model into a given definition for indepen-
dent and dependent variables and their relationship via structural equations.
Obviously the proposed model has a simpler mathematical form.
Although the basic definitions underlying (3.1)-(3.6) are radically different
from those of the JOreskog-Keesling-Wiley model (2.11)-(2.14), it may be shown
that when the current conceptualization is adopted under their mathematical
structure i.e., ignoring the explicit focus on a first-order factor analytic measure-
ment model in (2.11) as well as their definitions of variables, the present model
covariance structure (3.5) (but not the means structure (3.4)) is obtained. That is,
if one takes G y = A y , f l = B , y = F , G x = A x, ~ = 0 , 0~=0, and 0 ~ = 0 in (2.14),
one obtains model (3.5). Thus it is possible to use the model (2.14) to obtain
applications that were not intended, such as linear structural relations among
higher-order factors. However, model (3.5) contains only three matrices with
unknown parameters while (2.14) contains eight. Consequently, the mathematics
involved in application are simpler in the current representation, and the model is
easier to communicate.
The Geraci (1977) model is of the form (2.11)-(2.12), with Ay = I, 0~= 0, and
A x = I. Consequently, it is not obvious how it could be reconceptualized to yield
(3.5). Similarly, Robinson's (1977) model is of the form (2.11)-(2.12), with
A x = I, Ay = I, and f = 0. Although this model allows nonlinear constraints on
parameters in the linear relation (2.12), it does not seem to be able to be redefined
so as to yield (3.5) as conceptualized here. The problem of imposing arbitrary
constraints on parameters in LV models has been addressed by Lee and Bentler
(1980), and is discussed further below. Krishnaiah and Lee (1974) studied
covariance structures of the form Z = U~ZIU ~+ . . . + Uk~,kU£ where the U, are
known matrices and the Ni are unknown. This structure, which arises in cases
such as the multivariate components of variance model, can be obtained from
(3.6) by setting G = I , B = I , F = [U1..... Uk], and q~ as block diagonal with
elements Ni (see also Rao, 1971; Rao and Kleffe, 1980).

4. Parameter identification

LV models cannot be statistically tested without an evaluation of the identifica-


tion problem. Identifiability depends on the choice of mathematical representa-
tion as well as the particular specification in a given application, and it refers to
the uniqueness of the parameters underlying the distribution of MVs. A variety of
general theoretical studies of identification have been made in recent years
(Deistler and Seifert, 1978; Geraci, 1976; Monfort, 1978), but these studies are
not very helpful to the applied researcher. While there exist known conditions
that an observable process must satisfy in order to yield almost sure consistent
Multivariate analysis with latent variables 761

estimabili[y, it remains possible to find examples showing that identifiability does


not necessarily imply the existence of a consistent estimator (Gabrielsen, 1978)
and that multiple solutions are possible for locally identified models (Fink and
Mabee, 1978). In practice, then, it may be necessary to use empirical means to
evaluate the situation: JOreskog and S6rbom (1978) propose that a positive
definite information matrix almost certainly implies identification, and
McDonald and Krane (1977) state that parameters are unambiguously locally
identified if the Hessian is nonsingular. Such a pragmatic stance implies that
identification is, in practice, the handmaiden of estimation, a point of view also
taken by Chechile (1977) from the Bayesian perspective. Although such an
empirical stance may be adopted to evaluate complex models, it is theoretically
inadequate. Identification is a problem of the population, independent of sam-
pling considerations. Thus, data-based evaluations of identification may be
incorrect, as shown in Monte Carlo work by McDonald and Krane (1979) who
retract their earlier claims on unambiguous identifiability.
In the case of model (3.1)-(3.6), identifiability depends on the choice of
specification in a given application and refers to the uniqueness of parameters
underlying the distribution of observed variables, specifically, the second mo-
ments (3.6) when U of (3.4) is null and there is no interdependence of means and
covariance parameters. The moment structure model (3.6) must be specified with
known values in the parameter matrices B0, F, and q~ such that a knowledge of 2J
allows a unique inference about the unknown elements in these matrices. How-
ever, it is obvious that it is possible to rewrite (3.6) using nonsingular transforma-
tion matrices Tl, T2, and T3 as

Z= G*B*-IF*~*F*'B *' 1G*' (4.1)

where G*=GT1, B*-I=T~XB 1T2, F*=T21FT3, and ~*----T31!~T~-1. The


parameters of the model are identified when the only nonsingular transformation
matrices that allow (3.6) and (4.1) to be true simultaneously are identity matrices
of the appropriate order. A necessary condition for this to occur is that the
number of unknown parameters of the model is less than the number of different
elements in the variance-covariance matrix Z, but even the well-known rank and
order conditions and their generalization (Monfort, 1978) do not provide a
simple, practicable method for evaluating identification in the various special
cases that might be entertained under the general model. Even the simplest special
cases, such as factor analysis models, are only recently becoming understood
(Algina, 1980).

5. Estimation and testing: Statistical basis

The statistical theory involved in multivariate LV models exists in rudimentary


form. Only large sample theory has been developed to any extent, and the
relevance of this theory to small samples has not been established. Although the
762 P. M. Bentler and D. G. Weeks

statistical theory associated with LV models based on multinormally distributed


MVs already existed (cf. Anderson and Rubin, 1956), JOreskog (1967, 1973, 1977)
must be given credit for establishing that maximum likelihood (ML) estimation
could be practically applied to LV models. While various researchers were
studying specialized statistical problems and searching for estimators that might
be easy to implement, J6reskog showed that complex models could be estimated
by difficult ML methods based on a standard covariance structure approach. The
most general alternative approach to estimation in LV and other models was
developed by Browne (1974). Building upon the work of J6reskog and Goldberger
(1972) and Anderson (1973), who had developed generalized least squares (GLS)
estimators for the factor analytic model and for linear covariance structures,
Browne showed that a class of GLS estimators could be developed that have
many of the same asymptotic properties as ML estimators, i.e., consistency,
normality, and efficiency. He also developed the associated goodness of fit tests.
Lee (1977) showed that ML and GLS estimators are asymptotically equal. Swain
(1975), Geraci (1977) and Robinson (1977) introduced additional estimators with
optimal asymptotic properties. Some of the Geraci and Robinson estimators can
be easier to compute than ML or GLS estimators; the Browne and Robinson
estimators do not necessarily require multivariate normality of the MVs to yield
their minimal sampling variances. Unfortunately, the empirical meaning of
loosening the normality assumption is open to question, since simple procedures
for evaluating the less restrictive assumption (that fourth-order cumulants of the
distribution of the variables are zero) do not appear to be available. Although
certain GLS estimators are somewhat easier to compute than ML estimators,
there is some evidence that they may be more biased than ML estimators
(JOreskog and Goldberger, 1972; Browne, 1974). Virtually nothing is known
about the relative robustness of these estimators to violation of assumptions or
about their relative small sample properties.
We now summarize certain asymptotic statistical theorems for multivariate
analysis with latent variables. The basic properties for unconstrained ML estima-
tors are well known; Browne (1974) and Lee (1977) developed parallel large
sample properties for GLS estimators. Let ~0 =~J(00) be a p by p population
covariance matrix whose elements are differentiable real-valued functions of a
true though unknown (q × 1) vector of parameters 00. Let S represent the sample
covariance matrix obtained from a random sample of size N = n + 1 from a
multivariate normal population with mean vector 0 and covariance matrix N0. We
regard the vector 0 as mathematical variables and ~ = 2J(0) as a matrix function
of 0.
The generalized least squares estimators, provided they exist, minimize the
function

Q( o ) = tr[( S - Y,)w]2 / 2 (5.1)


where the weight matrix W is either a positive definite matrix or a stochastic
matrix possibly depending on S which converges in probability to a positive
Multivariate analysis with latent variables 763

definite matrix as N tends to infinity. In most applications W is chosen so that it


converges to Z-1 in probability, e.g., W-- S 1. It was proven by Browne (1974)
and Lee (1977) that the estimator t~ that minimizes (5.1) based on a W that
converges to Z-1 possesses the following asymptotic properties:
(a) it is consistent;
(b) it is asymptotically equivalent to the maximum likelihood estimator;
(c) it is a 'best generalized least-squares' estimator in the sense that for any
other generalized least-squares estimator 0+, cov(0+ ) - cov(t~) is positive semidef-
inite;
(d) its asymptotic distribution is multivariate normal with mean vector 0o and
covariance matrix 2n l [ ( 0 ~ 0 / 0 0 ) ( 2 j o l ® ~ o l ) ( 0 ~ 0 / 0 0 ) ' ] l;
(e) the asymptotic distribution of nQ(O) is chi-square with degrees of freedom
equal to p( p + 1 ) / 2 - q.
The latter property enables one to test the hypothesis that ~o =~(00) against
the alternative that Z0 is any symmetric positive definite matrix.
More general statistical results deal with models whose parameters are subject
to arbitrary constraints. Although Aitchison and Silvey (1958) provided the
statistical basis for obtaining and evaluating constrained ML estimates, their
methods were not applied to moment structure models or to LV models until the
work of Lee and Bentler (1980). Their GLS results presented below extend
Aitchison and Silvey's ML results. Of course models without constraints on
parameters can be evaluated as a special case. Suppose 00 satisfies r ~<q functional
relationships h(Oo)=[hi(00) .... ,hr(Oo)] = 0 where h 1.... ,h r are real-valued inde-
pendent differentiable functions. We define the constrained GLS estimator 0 of 0o
as the vector which satisfies h ( 0 ) = 0 and minimizes Q(O). It follows from the
first order necessary condition that if t~ exists, there exists a vector X'= (Xl,-.-,Xr)
of Lagrange multipliers such that

O(O)+/;'X =o, h(O)=O, (5.2)

where Q=(OQ/O0) is the gradient vector of Q(O), L=(Oh/O0) is an r×q


matrix of partial derivatives, and £ = L(0). Such a definition is paralleled with
constrained ML estimators. The constrained ML estimator 0 of 00 is defined as
the vector which satisfies h(0) = 0 and minimizes the function

F(O) = logdet(•) + t r ( S Z - ' ) - log det(S) - p. (5.3)

Similarly, from the first order necessary condition, if a 0 exists, there corresponds
a vector of Lagrange multipliers X'= (Xl ..... Xr) such that

~0(0 ) + £'X = 0, h(t~)=0 (5.4)

where F = (~L/O0) is the gradient of F(0) and/2 = L(0).


Lee and Bentler (1980) assume a number of standard regularity conditions that
are typically satisfied in practice in order to obtain their results. Major use is
764 P. M. Bentler and D. G. Weeks

made of the information matrix 2-1M with elements M(O)(i, j ) = tr Z-lJ~iZ- 12~j
(see Lee and Jennrich, 1979). The results include the following six propositions.
(a) The generalized least squares estimator t~ is consistent.
(b) The joint asymptotic distribution of random variables nt/2(0 - 00) and
n ~/2~ is multivariate normal with zero mean vector and eovariance matrix

2[ P°0 _ 0JR0 where o RoT°


[POT M ° +] =L [° L ' ° - IL o Lr]
-1'0

with M 0 = M(Oo) and L o = L(Oo).


(c) The generalized least squares estimator (0, ~) is asymptotically equivalent to
the maximum likelihood estimator (0, ?Q.
(d) The asymptotic distribution of nQ(O) is chi-square with degrees of freedom
p(p+l)/Z-(q-r).
Suppose h(O) = (h*(O), h**(O)) where h*(O) = (h~(O) ..... hi(0)), h**(O) =
(hj+l(O) ..... hr(O)), and j~<r. Let t~* be the generalized least squares estimator
that is subject to h*(O) = 0, and Q(0*) be its final function value. Then:
(e) The asymptotic distribution of n(Q(O)-Q(0*)) is chi-square with degrees
of freedom r - j.
Proposition (d) provides a goodness-of-fit test statistic for the hypothesis that
the proposed model and relationships h(O) = 0 fit the observed sample data. The
validity of various constraints can be assessed by means of (e).
Let/~ = I-(/Zh;I(0)- l/Z')-1,/~ be a general inverse and 3;1(0)= M ( 0 ) + £'£.
Consider the null hypothesis H0: Z o = 2:(0o) (0o~ ~0) against the specific alterna-
tive hypothesis Hi: Z o = Z(00) (0oE I2) where ~0 is a subset of ~2 whose elements
satisfy h(Oo) = 0. Then
(f) The asymptotic distribution of 2-ln~'/~-~ under H o is chi-square with
degrees of freedom equal to the rank of R 0.

6. Estimation and testing: Nonlinear programming basis

For any choice of error function and for almost any LV model with parameters
subject to no constraints, or to simple equality and proportionality constraints,
parameter estimates can be obtained by one of several nonlinear programming
algorithms. Certain algorithms commonly used in moment structures analysis will
be briefly considered. All algorithms to be considered here may be written as

Ok+1= Ok -- aNk gk (6.1)

where Ok is the vector of parameter estimates at the k th iteration. The vector 0 has
as the number of elements the number of nondependent parameters, i.e., the
number of free parameters after considering equality and proportionally con-
straints. Nk is a square symmetric positive definite matrix, and gk is the gradient
Multivariate analysis with latent variables 765

at iteration k. The stepsize parameter a is, in general, chosen to minimizef(Ok+ ~)


where f(Ok+ 1) = Q(Ok+1) or F(Ok+1), depending on the function chosen. Some
algorithms are defined with a = 1, but in practice one must generally allow the
option of reducing a in order to prevent divergence. The steepest descent
algorithm is defined by setting N = I. Steepest descent is usually very effective
when the initial parameter estimates are far from the solution, but it is very slow
to converge. The Newton-Raphson algorithm is defined by setting N k =
[O2f/ooa ]- 1, the inverse of the Hessian matrix (see, e.g., Bentler and Lee, 1979b).
Typically the Newton-Raphson algorithm converges very rapidly when near the
solution. However, with less than optimal starting values, the Hessian is often not
positive definite. For this reason, and since second derivatives may be difficult to
obtain and expensive to compute, the Newton-Raphson algorithm is relatively
unattractive. The Fletcher-Powell algorithm is defined by Nk+~=Nk+ANk
where
AN= ( AO'Ag)-1AOAO'--( Ag'NAg)-'NAgAg'N ',
and A0 = aNg, Ag = g(O + AO)-- g(O), and typically N~ = I. The Fletcher Powell
algorithm is the algorithm most commonly used at this time in covariance
structure analysis. For maximum likelihood estimation the Fisher scoring algo-
rithm is defined by N as the Fisher information matrix. For the least squares error
function, the Gauss-Newton algorithm is defined by N = (HH')-1 where H =
~:~(0)/00.
The basic Gauss-Newton algorithm can be modified to minimize the gener-
alized least squares error function by setting N = [H(W® W)H']-1. It has been
shown that when W = N -~, the information matrix is proportional to
H(W®W)H', in which case a step of the modified Gauss-Newton algorithm is
equivalent to a step of the Fisher scoring algorithm applied to the maximum
likelihood error function (Lee, 1977; Lee and Jennrich, 1979). A step in the
modified Gauss-Newton algorithm may be written as

Ok+,=Ok+a[H(W®W)H']-IH(W®W)Vec(S--~,), (6.2)

where Vec stacks the elements of the subsequent matrix into a vector. Thus under
an appropriate choice of W, one may obtain least-squares ( W = I), generalized
least-squares ( W = S-1), or maximum-likelihood ( W = 2~-~) estimates from the
modified Gauss-Netwon algorithm. In an empirical comparison of the algorithms
considered here (except steepest descent) for the orthogonal factor model, Lee
and Jennrich (1979) found the modified Gauss-Newton algorithm to be a
cost-efficient statistical optimizer.
For most moment structure models, the constrained generalized least squares
estimator t~ and the corresponding Lagrange multipliers ~ cannot be solved in
closed form; thus, some nonlinear iterative procedure has to be used. Among the
other methods the penalty function technique developed by Fiacco and
McCormick (1968) has been accepted as an effective method in constrained
optimization. Based on this technique, Lee and Bentler proposed an algorithm as
766 P. M. Bentler and D. G. Weeks

follows: (a) Choose scalar c I > 0 and initial values of 0. (b) given ck > 0 and Ok, by
means of the Gauss-Newton algorithm (6.2) search a minimum point 0k+ 1 of the
function

Q(Ok+l)=Q(Ok)+cki ~(ht(0k)) (6.3)


t=l

where q~ is a real-valued differentiable function such that ~ ( x ) / > 0 for all x and
• (x) = 0 if and only if x = 0. (c) Update k, increase ck+ 1 and return to (b) with
Ok+~ as the initial values. The process is terminated when the absolute values of

maX[Ok+l(i)--Ok(i)]
i
and max[ht(Ok+l) ]
t
(6.4)
are less than e, where e is a predetermined small real number. The algorithm will
converge to the constrained generalized least squares estimator, if it exists. It has
been shown by Fiacco and McCormick (1968), and Luenberger (1973) that if the
algorithm converges to Ok, the corresponding Lagrange multipliers are given by

x; ..... ck (hr(0k))) (6.5)

where ~ denotes the derivative of ~. An algorithm for obtaining the constrained


maximum likelihood estimator can be developed in a similar procedure. In this
case, during (b), the Fisher scoring algorithm is applied in minimizing the
appropriately defined Lk(O ) analogous to (6.3). Lee (1980) applied the penalty
function technique to obtain estimators in confirmatory factor analysis.

Partial derivatives for a general model


The only derivatives required to implement (6.2) to estimate parameters of the
general model (3.6) are the elements of ~ / ~ 0 . These can be obtained by various
methods, including matrix differentiation. See Bentler and Lee (1978b) for a
recent overview of basic results in this area and Nel (1980) for a comprehensive
discussion of recent developments in this field. The derivatives were reported by
Bentler and Weeks (1980) as

OX // OdP= ( F'B'- 'G' ® F'B'- 'G'),


O Z / 0 1 ~= ( B t - I G " @ O F " B t - I G p ) ( I-q- Err), (6.6)

OZ/aBo = (B'-'G'®B ' r e r ' B ' - ' 6 ' ) ( i + Err)


where Err denotes OX'/OX for X (r × r) with no constant or functionally
dependent elements. This matrix is a permutation matrix with (0, 1) elements. In
(6.6) the symmetry of • and other constraints on parameters have been ignored.
The complete set (6.6) of matrix derivatives can be stacked into a single matrix
0~/O0* with matrix elements (OZ/OO*)'=[(OZ/~)',(OZ/OF)',(~Z/Bo)' ]. It
Multivariate analysis with latent variables 767

follows that the elements of the unreduced gradient g* are stacked into the vector
g * ' = [g(~)', g(F)', g(Bo)'], whose vector components are given by

g(~) = Vec[Y'A'(:~- S)AF], g(F) = 2Vec[A'(Z- S)AF~],


g(B0) = 2 Vec[ A'(Z-- S)AF~bl"'B'-1] (6.7)
where A = W G B - 1 and the symmetry of • has not been taken into account. The
corresponding matrix N* is a 9 X 9 symmetric supermatrix

(~/~o*) (w®w) (~/~o*)'


whose lower triangular matrix elements, taken row by row, are

(r'vr®r'vr), 2(vr®c'vr),
2[(v®c'vc) + Err(C'VNVC)],
(6.8)
2(vr®vvr), 2[(V®DVC)+Err(DV®VC)],
2[(V®DVD')+ E,,( DV®VD')].
In (6.5), C = Y~, D = B - lyq~/~,, and V = B ' - 1G'WGB i.
The matrix 0N/0 0* contains derivatives with respect to all possible parameters
in the general model. In specific applications certain elements of q~, F, and B0 will
be known constants and the corresponding row of O2J/O0* must be eliminated.
In addition, certain parameters may be constrained, as mentioned above. For
example, • is a symmetric matrix so that off-diagonal equalities must be
introduced. The effect of constraints is to delete rows of ON/O0* corresponding
to constrained parameters and to transform a row i of OZ/O0* to a weighted sum
of rows i, j for the constraint 0i = wjOj. These manipulations performed on (6.7)
transform it into the (q × 1) vector g and when carried into the rows and columns
of (6.8) transform it into the (q × q) matrix N; where q is the number of
nondependent parameters. The theory of Lee and Bentler (1980) for estimation
with arbitrarily constrained parameters, described above, can be used with the
proposed penalty function technique to yield a wider class of applications of the
general model (3.6) than have yet appeared in the literature.

7. Conclusion

The field of multivariate analysis with continuous latent and measured random
variables has made substantial progress in recent years, particularly from
mathematical and statistical points of view. Mathematically, clarity has been
achieved in understanding representation systems for structured linear random
variable models. Statistically, large sample theory has been developed for a
variety of competing estimators, and the associated hypothesis testing procedures
768 P. M. Bentler and D. G. Weeks

have been developed. However, much statistical work remains to be completed.


For example, small-sample theory is virtually unknown, and reliance is placed
upon Monte Carlo work (cf. Geweke and Singleton, 1980). A theory of estimation
and model evaluation that is completely distribution-free is only now being
worked out (Browne, 1982).
The applied statistician who is concerned with utilizing the above theory in
empirical applications will quickly find that 'causal modeling', as the above
procedures are often called, is a very finicky methodology having many pitfalls.
For example, parameter estimates for variances may be negative; suppressor
effects yielding unreasonable structural coefficients may be found; theoretically
identified models may succumb to 'empirical' underidentification with sampling
variances being undeterminable; iterative computer methods may be extremely
expensive to utilize; goodness of fit tests may be 'unduly' sensitive to sample size.
Many of these issues are discussed in the voluminous literature cited previously.
Alternative approaches to model evaluation, beyond those of the simple goodness
of fit chi-square test, are discussed by Bentler and Bonett (1980).

References
Aigner, D. J. and Goldberger, A. S., eds. (1977). Latent Variables in Socioeconomic Models. North-
Holland, Amsterdam.
Aitchison, J. and Silvey, D. S. (1958). Maximum likelihood estimation of parameters subject to
restraint. Ann. Math. Statist. 29, 813-828.
Algina, J. (1980). A note on identification in the oblique and orthogonal factor analysis models.
Psychometrika 45, 393-396.
Amemiya, T. (1977). The maximum likelihood and the nonlinear three-stage least squares estimator in
the general nonlinear simultaneous equation model. Econometrica 45, 955-968.
Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
Anderson, T. W. (1973). Asymptotically efficient estimation of covariance matrices with linear
structure. Ann. Statist. 1, 135-141.
Anderson, T. W. (1976). Estimation of linear functional relationships: Approximate distributions and
connections with simultaneous equations in econometrics. J. Roy. Statist. Soc. Sec. B 38, 1-20.
Discussion, ibid 20-36.
Anderson, T. W. and Rubin, H. (1956). Statistical inference in factor analysis. Proc. 3rd Berkeley
Symp. Math. Statist. Prob. 5, 111-150.
Bentler, P. M. (1976). Multistructure statistical model applied to factor analysis. Multivariate Behav.
Res. ll, 3-25.
Bentler, P. M. (1980). Multivariate analysis with latent variables: Causal modeling. Ann. Rev. Psychol.
31, 419-456.
Bentler, P. M. (1982). Linear systems with multiple levels and types of latent variables. In: K. G.
Jrreskog and H. Wold, eds., Systems under Indirect Observation. North-Holland, Amsterdam [in
press].
Bentler, P. M. and Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psych. Bull. 88, 588-606.
Bentler, P. M. and Lee, S. Y. (1978a). Statistical aspects of a three-mode factor analysis model.
Psychometrika 43, 343-352.
Bentler, P. M. and Lee, S. Y. (1978b). Matrix derivatives with chain rule and rules for simple,
Hadamard, and Kronecker products. J. Math. Psych. 17, 255-262.
Multivariate analysis with latent variables 769

Bentler, P. M. and Lee, S. Y. (1979a). A statistical development of three-mode factor analysis. British
J. Math. Statist. Psych. 32, 87-104.
Bentler, P. M. and Lee, S. Y. (1979b). Newton-Raphson approach to exploratory and confirmatory
maximum likelihood factor analysis. J. Chin. Univ. Hong Kong. 5, 562-573.
Bentler, P. M. and Weeks, D. G. (1978). Restricted multidimensional scaling models. J. Math. Psych.
17, 138-151.
Benfler, P, M. and Weeks, D. G. (1979). Interrelations among models for the analysis of moment
structures. Multivariate Behav. Res. 14, 169-185.
Bentler, P, M. and Weeks, D. G. (1980). Linear structural equations with latent variables. Psycho-
metrika 45, 289-308.
Bhargava, A. K. (1977). Maximum likelihood estimation in a multivariate 'errors in variables'
regression model with unknown error covariance matrix. Comm. Statist. A--Theory Methods 6,
587-601.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis'. MIT
Press, Cambridge, MA.
Bock, R. D. and Bargmann, R. E. (1966). Analysis of covariance structures. Psychometrika 31,
507-534.
Browne, M. W. (1974). Generalized least-squares estimators in the analysis of covariance structures.
South African Statist. J. 8, 1-24.
Browne, M. W. (1977). The analysis of patterned correlation matrices by generalized least squares.
British J. Math. Statist. Psych. 30, 113-124.
Browne, M. W. (1982). Covariance structures. In: D. M. Hawkins, ed., Topics in Applied Multivariate
Analysis. Cambridge University Press, London,
Chechile, R. (1977). Likelihood and posterior identification: Implications for mathematical psychol-
ogy. British J. Math. Statist. Psych. 30, 177-184.
Cochran, W. G. (1970). Some effects of errors of measurement on multiple correlation. J. Amer.
Statist. Assoc. 65, 22-34.
Deistler, M. and Seifert, H. G. (1978). Identifiability and consistent estimability in econometric
models. Econometrica 46, 969-980.
Dempster, A. P. (1969). Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading, MA.
Dempster, A. P. (1971). An overview of multivariate data analysis. J. Multivariate Anal. 1, 316-346.
Feldstein, M. (1974). Errors in variables: A consistent estimator with smaller MSE in finite samples. J.
Amer. Statist. Assoc. 69, 990-996.
Fiacco, A. V. and McCormick, G. P. (1968). Nonlinear Programming. Wiley, New York.
Fink, E. L. and Mabee, T. I. (1978). Linear equations and nonlinear estimation: A lesson from a
nonrecursive example. Sociol. Methods Res. 7, 107-120.
Gabrielsen, A. (1978). Consistency and identifiability. J. Econometrics 8, 261-263.
Geraci, V. J. (1976). Identification of simultaneous equation models with measurement error. J.
Econometrics 4, 263-283.
Geraci, V. J. (1977). Estimation of simultaneous equation models with measurement error. Econometrica
45, 1243-1255.
Geweke, J. F. and Singleton, K. J. (1980). Interpreting the likelihood ratio statistic in factor models
when sample size is small. J. Amer. Statist. Assoc. 75, 133-137.
Gleser, L, J. (1981). Estimation in a multivariate "errors in variables" regression model: Large sample
results. Ann. Statist. 2, 24-44.
Goldberger, A. S. and Duncan, O. D., eds. (1973). Structural Equation Models in the Social Sciences.
Academic Press, New York.
Goodman, L. A. (1978). Analyzing Qualitative/Categorical Data. Abt Books, Cambridge, MA.
Hausman, J. A. (1977). Errors in variables in simultaneous equation models. J. Econometrics 5,
389-401.
Hsiao, C, (1976). Identification and estimation of simultaneous equation models with measurement
error. Internat. Econom. Rev. 17, 319-339.
Jermrich, R. I. and Ralston, M. L. (1978). Fitting nonlinear models to data. Ann. Rev. Biophys. Bioeng.
8,' 195-238.
770 P. M. Bentler and D. G. Weeks

Jtreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika 32,
443-482.
Jtreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika 34, 183-202.
Jtreskog, K. G. (1970). A general method for analysis of covariance structures. Biometrika 57,
239-251.
J~reskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika 36,
409-426.
J6reskog, K. G. (1973). Analysis of covariance structures. In: P. R. Krishnaiah, ed., Multivariate
Analysis III, 263-285. Academic Press, New York.
J6reskog, K. G. (1977). Structural equation models in the social sciences: Specification, estimation and
testing. In: P. R. Krishnaiah, ed., Applications of Statistics, 265-287. North-Holland, Amsterdam.
J~reskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika 43,
443-477.
J6reskog, K. G. and Goldberger, A. S. (1972). Factor analysis by generalized least squares. Psycho-
metrika 37, 243-260.
Jtreskog, K. G. and Goldberger, A. S. (1975). Estimation of a model with multiple indicators and
multiple causes of a single latent variable. J. Amer. Statist. Assoc. 70, 631-639.
Jtreskog, K. G. and Strbom, D. (1978). LISREL I V Users Guide. Nat. Educ. Res., Chicago.
Keesling, W. (1972). Maximum likelihood approaches to causal flow analysis. Ph.D. thesis. University
of Chicago, Chicago.
Krishnaiah P. R. and Lee, J. C. (1974). On covariance structures. Sankhy~ 38, 357-371.
Lawley, D. N. (1940). The estimation of factor loadings by the method of maximum likelihood. Proc.
R. Soc. Edinburgh 60, 64-82.
Lawley, D. N. and Maxwell, A. E. (1971). Factor Analysis as a Statistical Method. Butterworth,
London.
Lawley, D. N. and Maxwell, A. E. (1973). Regression and factor analysis. Biometrika 60, 331-338.
Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis. Houghton-Mifflin, New York.
Lee, S. Y. (1977). Some algorithms for covariance structure analysis. Ph.D. thesis. Univ. Calif., Los
Angeles.
Lee, S. Y. (1980), Estimation of covariance structure models with parameters subject to functional
restraints. Psychometrika 45, 309-324.
Lee, S. Y. and Bentler, P. M. (1980). Some asymptotic properties of constrained generalized least
squares estimation in covariance structure models. South African Statist. J. 14, 121-136.
Lee, S. Y. and Jennrich, R. I. (1979). A study of algorithms for covafiance structure analysis with
specific comparisons using factor analysis. Psychometrika 44, 99-113.'
Lord, F. M. (1960). Large-sample covariance analysis when the control variable is fallible. J. Amer.
Statist. Assoc. 55, 307-321.
Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley,
Reading, MA.
Luenberger, D. G. (1973). Introduction to Linear and Nonlinear Programming. Addison-Wesley,
Reading, MA.
McDonald, R. P. (1978). A simple comprehensive model for tile analysis of covariance structures.
British J. Math. Statist. Psych. 31, 59-72.
McDonald, R. P. and Krane, W. R. (1977). A note on local identifiabihty and degrees of freedom in
the asymptotic likelihood ratio test. British J. Math. Statist. Psych. 30, 198-203.
McDonald, R. P. and Krane, W. R. (1979). A Monte-Carlo study of local identifiability and degrees
of freedom in the asymptotic likelihood ratio test. British J. Math. Statist. Psych. 32, 121-132.
McDonald, R. P. and Mulaik, S. A. (1979). Determinacy of common factors: A nontechnical review.
Psych. Bull. 86, 297-306.
Monfort, A. (1978). First-order identification in linear models. J. Econometrics 7, 333-350.
Nel, D. G. (1980). On matrix differentiation in statistics. South African Statist. J. 14, 137-193.
Olsson, U. and Bergman, L. R. (1977). A longitudinal factor model for studying change in ability
structure. Multivariate Behav. Res. 12, 221-242.
Multivariate analysis with latent variables 771

Please, N. W. (1973). Comparison of factor loadings in different populations. British J. Math. Statist.
Psychol. 26, 61-89.
Rao, C. R. (1971). Minimum variance quadratic unbiased estimation of variance components. J.
Multivariate A nal. 1, 445-456.
Rao, C. R. and Kleffe, J. (1980). Estimation of variance components. In: P. R. Krishnaiah and L. N.
Kanal, eds., Handbook of Statistics, Vol. I, 1-40. North-Holland, Amsterdam.
Robinson, P. M. (1974). Identification, estimation and large-sample theory for regressions containing
unobservable variables, lnternat. Econom. Rev. 15, 680-692.
Robinson, P. M. (1977). The estimation of a multivariate linear relation. J. Multivariate Anal. 7,
409-423.
Rock, D. A., Werts, C. E. and Flaugher, R. L. (1978). The use of analysis of covariance structures for
comparing the psychometric properties of multiple variables across populations. Multivariate Behav.
Res. 13, 403-418.
Srrbom, D. (1974). A general method for studying differences in factor means and factor structure
between groups. British J. Math. Statist. Psych. 27, 229-239.
Srrbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika 43,
381-396.
Spearman, C. (1904). The proof and measurement of association between two things. Amer. J. Psych.
15, 72-101.
Steiger, J. H. (1979). Factor indeterminacy in the 1930's and the 1970's: Some interesting parallels.
Psyehometrika 44, 157-167.
Steiger, J. H. and Schi3nemann, P. H. (1978). A history of factor indeterminacy. In: S. Shye, ed.,
Theory Construction and Data Analysis, Jossey-Bass, San Francisco.
Strotz, Robert H. and Wold, H. O. A. (1960). Recursive vs. nonrecursive systems: An attempt at
synthesis. Econometrica 28, 417-427.
Swain, A. J. (1975). A class of factor analytic estimation procedures with common asymptotic
sampling properties. Psychometrika 40, 315-335.
Thurstone, L. L. (1947). Multiple Factor Analysis. Univ. of Chicago Press, Chicago.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31,
279-311.
Tukey, J. W. (1954). Causation, regression, and path analysis. In: O. K. Kempthorne, T. A. Bancroft,
J. W. Gowen and J. L. Lush, eds., Statistics and Mathematics in Biology, 35-66. Iowa State
University Press, Ames, IA.
Weeks, D. G. (1978). Structural equation systems on latent variables within a second-order measure-
ment model. Ph.D. thesis. Univ. of Calif., Los Angeles.
Weeks, D. G. (1980). A second-order longitudinal model of ability structure. Multivariate Behav. Res.
15, 353-365.
Wiley, D. E. (1973). The identification problem for structural equation models with unmeasured
variables. In: Goldberger and Duncan, eds., Structural Equation Models" in the Social Sciences,
69-83. Academic Press, New York.
Wiley, D. E., Schmidt, W. H. and Bramble, W. J. (1973). Studies of a class of covariance structure
models. J. Amer. Statist. Assoc. 68, 317-323.
Williams, J. S. (1978). A definition for the common-factor analysis model and the elimination of
problems of factor score indeterminacy. Psychometrika 43, 293-306.
Wold, H. (1980). Model construction and evaluation when theoretical knowledge is scarce: An
example of the use of partial least squares. In: J. Kmenta and J. Ramsey, ed., Evaluation of
Econometric Models. Academic Press, New York.
Wright, S. (1934). The method of path coefficients. Ann. Math. Statist. 5, 161-215.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~
@North-Holland Publishing Company (1982) 773- 791 ,,.]

Use of Distance Measures, Information Measures


and Error Bounds in Feature Evaluation*

Moshe Ben-Bassat

1. Introduction: The problem of feature evaluation

The basic problem of statistical pattern recognition is concerned with the


classification of a given object; to one of m known classes. An object is
represented by a pattern of features which, presumably, contains enough informa-
tion to distinguish among the classes. That is, the statistical variability (which is
not due to a random noise) between patterns coming from different classes is high
enough to ensure satisfactory performance in terms of a correct classification rate.
For a given problem, the set of all of the potential features that can be measured
may be quite large, and in this case the designer is faced with the feature selection
problem, i.e., which features to select in order to achieve maximum performance
with the minimum measurement effort.
Procedures for feature selection may be discussed under different circum-
stances, e.g. sequential and nonsequential pattern recognition. The common need,
however, for all these procedures is an evaluation function by which the potency
of a feature, or a subset of features, to distinguish among the classes is assessed.
The nature and use of such evaluation functions is the subject matter of this
paper.
Adopting the Bayesian approach, the true class of a given object is considered
as a random variable C taking values in the set [1,2 ..... m}, where C=i
represents class i. The initial uncertainty regarding the true class is expressed by
the prior probability vector H----(Trl,Tr2..... ~rm). Let F denote the set of all
potential features, let X/denote feature j and let Pi(xj) denote the conditional
probability (density) function of feature j under class i for the value Xj = xj (Xj
may be multidimensional, e.g. when a subset of features is considered). Once Xj
has been measured, the prior probability of class i is replaced by its posterior
probability which is given by Bayes' theorem:

nPi(xj) (1)

*This study was partially supported by the National Science Foundation Grant ECS-8011369 from
the Division of Engineering, and by United States Army Research Institute for the Behavioral and
Social Sciences contract number DAJA 37-81-C-0065.

773
774 Moshe Ben-Bassat

If the cost for all types of correct classification is zero and the cost for all types
of incorrect classification is one, then the optimal Bayes decision rule assigns the
object to the class with the highest a posteriori probability. In this case, the Bayes
risk associated with a given feature X reduces to the probability of error, Pe(X),
which is expected after observing that feature:

Pe(X) = E [ 1 - m a x { ~ ' l ( X ) .... ,~'~(X)}]. (2)


Here, and throughout the paper, expectation is taken with respect to the mixed
probability of X which is given by

e(x) = 2 (3)
i=1

If our objective is to minimize the classifier error rate, 1 and the measurement
cost for all the features is equal, then the most appealing function to evaluate the
potency of a feature to differentiate between the classes is the Pc(X) function.
Nevertheless, extensive research effort was devoted to the investigation of other
functions--mostly based on distance and information measures--as feature
evaluation tools.
Section 2 introduces formally feature evaluation rules and illustrates two of
them by a numerical example. In Section 3 the limitations of the probability of
error rule are pointed out, while in Section 4 the properties desired from a
substitute rule are discussed. Section 5 reviews the major categories of feature
evaluation rules and provides tables which relate these rules to error bounds. The
use of error bounds for assessing feature evaluation rules and for estimating the
probability of error is discussed in Section 6. Section 7 concludes with a summary
of the theorethical and experimental findings so far and also provides some
practical recommendations.

2. Feature evaluation rules

A feature evaluation rule is a real function U defined on F by which, X and Y


are indifferent if U(X) = U(Y), and X is preferred to Y if U(X) < U(Y), for every
X and Y in F. A feature X is said to be not preferred to Y if U(Y) ~ U(X).

REMARK. For some rules it may be more meaningful to define preference by


U(X)>U(Y). However, this does not detract from the generality of the above
definition since multiplying U by - 1 brings us back to the above definition.

i For instance, in the early stages of sequential interactive pattern recognition tasks, the user m a y be
interested in a feature which reduces maximally the number of plausible classes. Such an objective is
not necessarily the same as minimizing the expected classifier error rate for the next immediate stages.
Featureevaluation 775

Table 1
An examplewith binary features
Prior ""-~eatures
probabilities Classes~ XI X2 X3 X4 X5 X6
0.25 1 0.75 0.90 0.05 0.40 0.10 0.05
0.25 2 0.10 0.45 0.52 0.40 0.90 0.07
0.25 3 0.80 0.45 0.60 0.40 0.75 0.90
0.25 4 0.85 0.01 0.92 0.40 0.80 0.90

This indifference relation is easily seen to be reflexive, symmetric and transi-


tive, and thus it is an equivalence relation which divides F into disjoint equiva-
lence groups defined by

F(u) = (XI F, V ( X ) = u). (4)

For instance, by the probability of error rule--henceforth the Pe r u l e - - X and


Y are indifferent if Pe(X)= P~(Y), while X is preferred to Y if P e ( X ) < Pe(Y).
The equivalence groups are given by F(pe) = {X] XE F, Pe(X) = Pe}" For a given
P~ all the features in F(pe) are indifferent. For pc < p£, any X ~ F(p~) is preferred
to any Y~ F(p£).

Before proceeding, let us introduce the following example which will be used
throughout the paper to demonstrate several of the concepts discussed.

EXAMPLE 1. Consider a classification problem with four classes and six binary
features which is presented in Table 1. For practical illustration the classes may
be considered as medical disorders, while the features are symptoms, signs or
laboratory tests which are measured as positive/negative. The entries of the table
represent the respective conditional probabifities for positive results of the fea-
tures given the classes. The ordering of features by the Pe rule is given in Table 2.
It should be noted that feature preference may be a function of the prior
probabilities. For instance, if the prior probabilities for Example 1 were
(0.1,0.7,0.1,0.1), then Pe(X6)=0.259 while Pe(X2)=0.300, which means that
under these prior probabilities X 6 is preferred to X 2 (Table 3).
Another frequently used feature evaluation rule is derived from Shannon's
entropy by which X is preferred to Y if the expected posterior uncertainty

Table 2
Feature ordering by the Pe rule,/7 = (0.25,0.25,0.25,0.25)
Feature X2 X3 X6 X5 X1 X4
Pe 0.527 0.532 0.537 0.550 0.562 0.750
776 M o s h e Ben-Bassat

Table 3
Feature ordering by the Pc rule,/7= (0.1,0.7,0.1,0.1)
Feature X6 X5 X~ X2 X3 X4
Pc 0.259 0.280 0.285 0.300 0.300 0.300

resulting from X:

H(X)=E[-- X)log, ,( X)] (5)


is lower than that for Y. In (5), and throughout the paper, ~ is on i from 1 to m
and log is to the base 2. Table 4 shows the ordering induced by the H rule. This
ordering is not consistent with the ordering induced by the Pe rule.

3. What is wrong with the Pe rule

Although the Pe rule seems to be the most natural feature evaluation rule,
alternative feature evaluation rules are of great importance due to the following
reasons:
(1) The probability of error m a y not be sensitive enough for differentiating
between good and better features. That is to say, the equivalence groups partition
induced over F by the Pe rule is not sufficiently refined. For instance, in Table 3,
feature X4, which can contribute nothing to differentiating among the classes
(since Pi(X4) is the same for every i, i = 1,2,3,4), is considered by the Pe function
as equivalent to features X 3 and X 2 which may certainly contribute to differentiat-
ing among the classes. For instance, if X 2 is selected and a positive result is
obtained, then class 4 is virtually eliminated [~4(X2 = + ) = 0.0022], while the
liklihood of class 1 is doubled [~1()(2 = + ) = 0.20]. If a negative result is obtained
for X 2 , similar strong indications in the opposite direction are suggested
[~4(X2 = - ) = 0 . 1 8 , ~1(X2 = - ) = 0.0018]. On the other hand, if X 4 is selected,
the posterior probabilities of all the classes remain the same as the priors
regardless of the result observed for X 4. Yet the expected probability of error for
X 4 is 0.300, the same as for X 2. The main reason for the insensitivity of the Pe
function lies in the fact that, directly, the Pe function depends only on the most
probable class, and that under certain conditions, the prior most probable class
remains unchanged regardless of the result for the observed feature [8].
(2) For optimal subset selection of K features out of N, when exhaustive search
over all possible subsets is impractical, procedures based on the relative value of

Table 4
Feature ordering by the H rule 11 -- (0.25,0.25, 0.25, 0.25)
Feature X6 X2 X3 X5 X1 X4
H 1.399 1.640 1.666 1.673 1.698 2.000
Feature evaluation 777

individual features are suggested [20, 30, 44, 55, 59-61]. Using the probability of
error as the evaluation function for individual features does not ensure 'good'
error rate performance of the resulting subset, not even for the case of condition-
ally independent features [11, 13, 21, 56]. Alternative criteria for evaluating
individual features may provide better error rate performance of the resulting
subset, or may diminish the search efforts, see [59] for more details. For example,
Narendra and Fukunaga [45] have recently introduced a branch and bound
algorithm for optimal subset selection. However, although their algorithm is quite
general, it is more efficiently applied if the criterion function satisfies a recursive
formula which expresses the value for t - 1 features by means of its value for t
features. Such a recursive formula is not satisfied by the Pe function but it is
satisfied by other functions, e.g., the divergence and Bhattacharyya distance for
the case of normal distribution.
(3) In sequential classification, when dynamic programming procedures cannot
be used due to computational limitations, myopic policies are usually adopted, by
which the next feature to be tested is that feature which optimizes a criterion
function for just one or a few steps ahead. Usually, the objective is to reach a
predetermined level of the probability of error by a minimum number of features.
This objective is not necessarily achieved when the Pe function is used as the
myopic feature evaluation function, and substitute rules may perform better.
Experience with several myopic rules for the case of binary features is reported by
Ben-Bassat [4]. In fact, the author found out that when ties are broken arbitrarily
for the Pe rule, this rule may be very inefficient under a myopic policy. The main
reason for the low efficiency of the Pe function in myopic sequential classification
is its low sensitivity for differentiating between good and better features, particu-
larly in advanced stages.
(4) Computation of the probability of error involves integration of the function
max{~l(X ) ..... ~m(X)} which usually cannot be done analytically. Numerical
integration, on the other hand, is a tedious process which becomes particularly
difficult and inaccurate when continuous and multidimensional features are
evaluated. For certain class distributions alternative feature evaluation functions
may lead to closed formulas which greatly facilitates the feature evaluation task.
For instance, in the two class case with Gaussian features, Kullback-Liebler
divergence and Bhattacharyya coefficient are simple functions of the mean
vectors and the covariance matrices [29].

4. Ideal alternatives for the P~ rule do not generally exist

When the reasons for considering alternative feature evaluation rules are the
insensitivity of the Pe rule a n d / o r computational difficulties, and our objective is
still the minimization of the expected Pe, then an ideal alternative rule is one
which does not contradict the Pe rule and perhaps refine it. That is, if X is not
preferred to Y by the Pe rule, then it is not preferred to Y by that alternative rule
either. Among features which are indifferent by the Pe rule, it is possible, and
778 Moshe Ben-Bassat

moreover it is desired, to have an internal order which differentiates between


good and better features. Formally, an ideal rule is a rule for which the
equivalence groups partition either coincide with, or is a refined partition of, the
equivalent groups partition induced by the Pe rule. (A partition F~ is a refined
partition of F 2 if F 1 contains more groups than F2, and whenever features X and Y
are clustered together in F l they are also clustered together in F2.)
Many of the papers written on the subject of feature evaluation seem to be
motivated by the feeling that there exist a magic functional which, for arbitrary
class distributions, induces the same ordering as does the probability of error rule.
Ben-Bassat [6] proves that such a functional, if it exists, does not belong to the
f-entropy family which covers several feature evaluation functions, e.g., Shannon's
entropy, quadratic entropy and the entropy of order a. In fact, the proof given in
that paper applies to a wider family of feature evaluation functions which leads to
a conjecture that such a magic functional probably does not exist for the general
case.
For some special cases, however, alternative feature evaluation rules may
induce over F exactly the same preference-indifference relation induced by the Pe
rule. For instance, for the two class case and Gaussian features the feature
evaluation rule derived from Kullback-Liebler divergence induces the same
partition induced by the Pe rule [22, 40].
Since ideal rules could not be found, it was suggested to assess a feature
evaluation function by considering the tightness and the rate of change of either
the lower and upper bounds on the probability of error by means of the
evaluation function, or the lower and upper bounds on the evaluation function by
means of the probability of error, e.g. [9]. These bounds and other arguments for
justifying alternative feature evaluation rules will be discussed in the next section.

5. Taxonomy of feature evaluation rules

5.1. Overview
Feature evaluation rules may be classified into three major categories.
(1) Rules derived from information measures (also known as uncertainty
measures).
(2) Rules derived from distance measures.
(3) Rules derived from dependence measures.
The assignment of feature evaluation rules within these categories may be
equivocal, since several feature evaluation rules may be placed in different
categories when considering them from different perspectives. Moreover, we often
found that a certain feature evaluation rule in category i may be obtained as a
mathematical transformation of another rule in categoryj.
In the following sections we introduce a unified framework for each of these
categories and construct tables that contain representatives of each category along
with their relationships to the probability of error.
Feature evaluation 779

5.2. Rules derived from information (uncertainty) measures


Let A m denote the set of all possible probability vectors of order m, m ~>2, and
let u denote a real nonnegative function on Am. u will be referred to as the
uncertainty function concerning the true class, and is defined so that larger values
for u represent higher levels of uncertainty. Various axiomatic approaches describ-
ing the requirements from a function u to qualify as an uncertainty measure are
included in [1]. The uncertainty measures discussed below are listed in Table 5.
Given an uncertainty function u and a prior probability vector H, the informa-
tion gain from a feature X, I(X), is defined as the difference between the prior
uncertainty and the expected posterior uncertainty using X, i.e.

x)= u(m- (6)

A feature valuation rule derived from the concept of information gain states that
X is preferred to Y if I ( X ) > I ( Y ) . Since u ( H ) is independent of the feature
under evaluation, this rule is equivalent to a rule which says: X is preferred to Y if
U( X ) < U(Y) where

v(x) = (7)

This unified framework was proposed by DeGroot as early as (1962) [15];


DeGroot who also discussed general properties of uncertainty measures. Inciden-
tally, the Pe rule also fits in this framework by considering U ( H ) = I -
max{~h ..... ~rm}. Further details may be found in DeGroot's book [16].
Feature evaluation by Shannon's entropy was first investigated by Lindley [38].
In a series of papers, Renyi discusses in detail inequalities and convergence
relationships between Shannon's entropy and the probability of error in Bayes
decision rules [48-51]. Kovalesky's [34] paper is also a milestone in relating
Shannon's measure to the probability of error. Hellman and Raviv [27] improve
some of Renyi's results and give more references on this subject.
The quadratic entropy appeared first in the context of risk evaluation for the
nearest neighbor classification rule [12]. The term quadratic entropy was coined
by Vajda [65]. The use of the quadratic entropy as a tool for feature evaluation
and its relationship to the probability of error were also investigated by Toussaint
[58] and Devijver [17] who define the Bayesian distance as 1 - Q ( H ) (which
means that it can also be considered within category 2).
Several generalizations of Shannon's entropy have been proposed by weakening
the axioms which uniquely determine Shannon's entropy. These include Renyi's
[47] entropy of order a and the entropy of degree a [14]. When a goes to 1, both
entropies go to Shannon's entropy.
The use of Renyi's [47] entropy for measuring the information content of a
feature is discussed by Renyi [52] and Good and Card [25]. The relationship
between Renyi's entropy and the probability of error is discussed by Ben-Bassat
and Raviv [7], where some new properties of this measure are presented as well.
780 Moshe Ben-Bassat

J
I
o
T T
o

0 r~ --.---.
t q
+ I
+
q ,.-, ÷ I

t "@ o
<°+ I~ i
i I

'T~'

r~
i i
© i--.,.-i

~ ~1 ~

4- I
, ._.
0 ,.-, I
t
0

w ~ ~ + r~° -
...._............#,,~
+~ q ~ ~ -i~ x ~

o~
V
.g
I Atl,
0 W
0 ~w

T o r~

iii II t 2
0 ~ o "-4

o "~
.4,,,...,
*
Feature evaluation 781

Ben-Bassat [4] reports on experiments with this measure for a sequential multi-class
classification problem using conditionally independent binary features.
Devijver [18] and Toussaint [61, 64] relates the entropy of degree a to the
Bayesian probability of error and to the nearest neighbor probability of error.
Except for Renyi's entropy, all of the above functions are special cases of the
f-entropy family which is given by
u(n): E (8)
where f is strictly concave, f " exists and f ( 0 ) : l i m ~ 0 f ( ~ ) : 0. The tightest
upper and lower bounds on f-entropies by means of the probability of error are
presented in [6]. Substituting the appropriate f in these bounds, we obtain the
above mentioned bounds as special cases (see Table 5).

5.3. Rulesderivedfrom distance measures


The rules in this category are derived from distance measures between probabil-
ity functions. Axiomatic characterization of such distance measures--also known
as separability, divergence or discrimination measures--are described by Ali and
Silvey [2] and by Mathai and Rathie [41]. Table 6 contains a fist of the distance
functions discussed below.
Two major types of feature evaluation rules may be identified in this category.
The first type is derived from distances between the class-conditional density
functions, while the second type is derived from distances between the class-
posterior probabilities. Let us start with the first type for the two-class case.
For the two-class case, let D ( X ) denote a distance function between P~(x) and
P2(x) (the conditional probabilities of a feature X under the two classes). A
feature evaluation rule derived from D(X) states-that X is preferred to Y if
D(X)>D(Y). The rational behind this rule is that the larger the distance
between Pl(x) and P2(x), the easier it will be to distinguish between C 1 and C2 by
observing X. A common property to most of the distance functions which justify
this rational is that D ( X ) attains its minimum when Pl(x) = P2(x) and attains its
maximum when Pl(x) is orthogonal to Pz(x). For some special cases it can be
proven that Pe(X) is a monotone decreasing function of D(X). Such is the case,
for instance, with Kullback-Liebler divergence in the case of a Gaussian X [22].
In most cases, however, such a relationship could not be found and the use of a
distance function between the conditional class distributions is justified by
showing that an upper bound on the probability of error decreases as this
distance increases.
For the multiclass case, i.e. m > 2, this approach may be extended by consider-
ing a weighted function of the distances between the class conditional densities
over all possible pairs of classes. That is to say, if dij(X ) is the distance between
class i and class j under feature X, then the discrimination power of X for the
entire set of classes is a function of

D( X) = ~ ~ ¢ri~zjdij( X). (9)


i=l j=l
782 M o s h e Ben-Bassat

This extension for the multiclass case was used by Fu, Min and Li [22] for
Kullback-Liebler divergence measure, by Lainiotis [36] for the Bhattcharyya
distance, and by Toussaint [57] for the Kolmogorov variational distance.
The major disadvantage of this approach is that one large value of dij (X) may
dominate the value for D(X) and impose a ranking which reflects only the
distance between the two most separable classes.
An alternative approach, also based on the dij values, suggests to prefer X to Y
if X discriminates 'better' between the two most confusing pair of classes, i.e. if

min dij( X ) > nEm dij( Y ). (10)


t,J t,j

By its definition, the drawback of this approach is that it takes into account the
distance between the closest pair only. Greetenberg [26] compares between the
methods of (9) and (10).
Another distance measure between the conditional density functions for the
multiclass case includes Matusita's [42, 43] extension of the affinity measure.

M(x)=f[P,(x)P2(x).. "Pro(X)] 1/mdx. (11)

Its relationship to the probability of error and to the average of the Kullback-
Liebler divergence over all the possible pairs of classes is discussed by Toussaint
[62-64]. An axiomatic characterization of this measure is given by Kaufman and
Mathai [31].
Glick's work [24] presents some general results concerning distance measures
and the probability of error.
All of the distance functions that are used to derive feature evaluation rules by
looking at the distance between Pl(x) and Pz(x) may also be used to derive
corresponding versions of these rules by looking at the expected distance between
~l(x) and ~?2(x) with respect to the mixed distribution of X.
Using this approach, it can be shown, see, e.g. [39], that for the two-class case
Kolmogorov distance is directly related to the probability of error by

=k[1-- (12)

where V(X) is defined by

r (x) = ¢rz(x)lP(x)dx. (13)

Following this approach Lissack and Fu [39] propose the IT"~-distancewhich is a


generalization of Kolmogorov distance; Toussaint [59-61] deals with the diver-
gence measure, while Toussaint [63] deals with the Matusita's affinity measure.
Except for the rule derived from the Kolmogorov distance, none of the rules is
known to be equivalent to the Pe rule for the general case. However, in some
Feature evaluation 783

Table 6
Distance measures on the prior and posterior class-probabilities

Affinity f[Cq(x)Cr2(x) " " ¢rm(x)]l/~P(x)dx

Bayesian f[Y~(x)]P(x)dx
Directed r ,~,(x) 1
divergence

Divergence
of order a > 0
l f[logXei(x)%) '~] ,(x)dx
Variance f[:~,~,(e,(x)-,~y];~(x)dx

special cases these rules may be expressed in a closed form. Such is the case, for
instance, with multivariate Gaussian features and the Bhattacharyya, Kullback-
Liebler and Matusita distances, see [23].
Distance functions between the prior and posterior class probabilities have also
been proposed as a tool for feature evaluation. The rational behind this approach
is that a feature which may change more drastically our prior assessment
concerning the true class is a better feature. In principle this approach is similar
to the information gain approach except that distance functions are used instead
of information functions. Several examples are included in Table 7.
Let us note that the directed divergence in Table 7 equals the information gain
obtained from Shannon's entropy in Table 5 [25]. This illustrates the duplicity of
rules in different categories.

5.4. Rules derived from dependence measures


Dependence measures, also known as association measures or correlation
measures, between two random variables W and Z are designed to quantitate our
ability to predict the value of W by knowing the value for Z and vice versa. The
classical correlation coefficient, for instance, is a measure for linear dependence
between two random variables.
Feature evaluation rules derived from dependence measures are based on the
dependence between the random variable C which represents the true class and
the evaluated feature X. Denoting by R ( X ) a dependence measure between X and
C, we prefer feature X to feature Y if R ( X ) > R(Y).
As will be seen later, feature evaluation rules derived from dependence mea-
sures are closely related to feature evaluation rules derived from information
measures or divergence measures, and, in principle, category 3 could be divided
between categories 1 and 2. However, since conceptually they represent a differ-
ent viewpoint, it is worthwhile considering them as a different group.
An axiomatic approach to dependence measures was proposed by Renyi [46],
who set seven postulates for an appropriate measure of dependence between two
784 M o s h e Ben-Bassat

s
~=
0 -~
oV ~

I[

I I

II .~ II

=o7
C
,..c

0 .... ,~ ,~ ,T~~

,..t

a
£-
v
m

V/

I I I
6
~g

0 II II
~Zw
,.o
S

e~

~5
Feature evaluation 785

Table 8
Dependence measures expressed as distance measures between the
class-conditional probabilities and the mixed probability [66]

Bhattacharyya 2 - log[ ,~ff ~/P~(x )P( x ) dx]


Matusita
Joshi
(Kullback-Liebler) P(x)llog Pi(x)

Kolmogorov e( )idx

random variables. Silvey [54] and Ali and Silvey [2] discuss general dependence
measures with respect to Renyi's postulates. The use of dependence measures for
feature evaluation in pattern recognition started with Lewis' [37] work where he
used Shannon's mutual i n f o r m a t i o n for expressing the dependence between
features and classes. This measure is given by
m
P( X, Ci)
R(X)---- E f P ( X , C , ) l o g p ( x ) P ( C i ) d X (14)
i=l
where P( Ci) = %, P( X, Ci) = ~riPi(X ) and P( X) = Y,ml~rgPi(X ).
Considering (14) as a distance measure between the probability functions
P(X, C) and P ( X ) P ( C ) , Vilmansen [66] proposes a set of dependence measures
for feature evaluation which are based on various distance functions. He shows
that these dependence measures attain their minimum when X and C are
statistically independent and attain their maximum when each value of X is
associated with one value of C, i.e. for every x there exists t such that ¢rt(x ) = 1
and ¢;i(x) = O, i ~ t. Using some algebraic manipulation on the original formula-
tion of these dependence measures, Vilmansen also shows that they may be
expressed as the average of the distance measures between each of the class
conditional probabilities Pi(X) and the mixed distribution P(X), see Table 8. All
these properties provide a solid justification for using dependence measures as a
tool for feature evaluation. Vilmansen's paper also contains error bounds by
means of these dependence measures which are based on the bounds of Table 7.
Let us note that the dependence measure based on Kullback-Liebler diver-
gence, also known as Joshi measure, is mathematically identical to the Kullback-
Liebler distance measure between the conditional probabilities [33].

6. T h e use of error bounds

6.1. For assessing feature evaluation rules


The use of error bounds as a tool for assessing feature evaluation rules (e.g. [9])
is based on the intuition that the tighter the bounds, the closer is the rule to be
ideal in the sense discussed in Section 4 above. However, mathematical justifica-
tion for this intuition has not yet been fully established.
786 Moshe Ben-Bassat

For practical purposes, ideal rules are not stringently required. For, if Pe(X) is
only slightly smaller than Pe(Y), then we are not too concerned if an alternative
feature evaluation rule prefers Y to X. However, for a given rule it is desirable to
know a priori how far it may deviate from being ideal. Namely, if a given rule
may prefer Y to X while in fact Pe(X) <- Pe(Y), then we would like to know to
what extent X is better than Y by means of their Pe" The concept of e-equivalence
[5] is designed to answer this question, and it also provides some further
justification for using the upper and lower bounds as indicators for deviation
from ideality.
Briefly, two features X and Y are said to be e-equivalent if the difference
between their corresponding probability of error is less than e; i.e. IP e ( X ) -
Pe(Y)] <e. A feature X is said to be e-preferred to Y if P e ( Y ) > P e ( X ) + e ;
namely, the expected probability of error from X is not just smaller than that for
Y, it is smaller by more than e. For a given e, the grouping and ordering of
features by the e-equivalence and e-preference relations may be considered as a
distortion of the original grouping and ordering induced by the Pe rule. As e goes
up, the degree of distortion increases and at a certain level this distortion is
maximized by grouping all of the features into a single e-equivalence group.
Considering the example of Section 2, if we are willing to tolerate differences of at
most e = 0.014, then the original ordering induced by the Pe rule (Table 2) is
distorted as shown in Table 9. In this table we see that for e = 0 . 0 1 4 the
e-equivalence groups are {X2, X3, X6}, {X6, X5}, {X5, Xl} and {X4}. Looking at
the ranking induced by the H rule (Table 4) we conclude that for this example
with e = 0.014 the Pe rule and the H rule may be considered equivalent.
For a given feature evaluation function let e denote the lowest e for which if X
is not preferred to Y by that rule, then either Y is e-equivalent to X, or Y is
e-preferred to X for e > e. The value of _emay serve as a measure for the deviation
of a given function from being an ideal feature evaluation function. The smaller _e,
the closer is the function to being ideal. An important result is that _eis directly
related to the tightest lower and upper bounds on the probability of error by
means of the feature evaluation function. Let
~ u ) = sup(Pe(X)IX~ F, U ( X ) = u}, (15)
p (u) = i n f { P e ( X ) [ X ~ F, U ( X ) = u}, (16)
d(u)=~e(U)-p(u ). (17)

Table 9
Feature ranking by the Pe rule when differencesof at most e 0.014
are tolerated,jr/= (0.25,0.25,0.25,0.25)

x2 x3 X6 X5 X1 [ ~

0.527 0.532 8.537 0.550 0.562


Feature evaluation 787

Then it can be proven [5] that

e= supd(u). (18)
U

Practically this result suggests that the maximum difference between the upper
and lower bounds is the tolerance level for a given feature evaluation function.
Table 1 from Ben-Bassat [5] lists several values for e for the quadratic and
Shannon's entropy. This table demonstrates that, for the two class case, if we are
willing to consider two features as equal, as long as the differences between their
corresponding probabilities of error is no more than 0.162, then Shannon's
entropy can be used to replace the Pe function in every problem. In most practical
problems a much lower tolerance level is required for exchanging the two rules.

6. 2. For estimating the probability of error


When the calculation of the expected probability of error is impractical, error
bounds derived from distance and information measures may be used to estimate
it. For this purpose two precautions must be noted: First, in order to be useful the
bounds must be tight. It can easily be shown that for every X, 0 ~< Pc(X) ~<1 -- 1/m,
and therefore any bound outside the range [0, 1 - l / m ] is useless. Unfortunately,
many of the published upper bounds for the multiclass case are numerically
useless since they are greater than 1, see [7]. Second, calculation of the bounds
must be practical. Again, the calculation of many of the published bounds is not
simpler than the calculation of Pc.
Recent representative references concerning the use of error bounds in para-
metric and non-parametric error estimation include [19, 28, 39].

7. Summary

The theoretical and experimental findings so far may be summarized as


follows.
(1) Feature evaluation rules which induce over the set of all of the potential
features the same ranking induced by the Pe rule probably do not exist for the
general case [7]. If one is still interested in such a rule, it should be searched under
specific assumptions on the nature of the class-conditional probabilities of the
features.
(2) Via the framework of e-equivalence [5], error bounds may serve as a useful
tool for measuring the deviation between the rankings expected by the Pe rule and
any other rule. Similar frameworks may be used to compare any pair of feature
evaluation rules.
(3) Most of the works that experimented with various feature evaluation rules
conclude that the feature rankings induced by the various rules are very similar.
Using Spearman's rank correlation coefficient as a measure of rankings similarity
and experimenting with Munson's handprinted character data, Toussaint [58]
788 Moshe Ben-Bassat

compared the quadratic and Shannon's information gain rules and obtained 0.947
correlation; Vilmansen [66] compared various dependence measures and obtained
correlations above 0.96, and Backer and Jain [3] compared eleven rules from the
three categories and obtained correlations above 0.84 with most of them above
0.91. These findings suggest that if we decide to avoid the Pe rule, then computa-
tional efficiency should be the key factor in determining the feature evaluation
rule to be used.
(4) Using the probability of error as the evaluation function of individual
features in suboptimal algorithms for subset selection (forward, backward or
other) does not ensure that the resulting subset will be even close to optimal [13].
The author is not aware of experiments with these algorithms which used
evaluation functions other than the Pc" The findings of the previous paragraph do
not signal that much difference will be detected by the various feature evaluation
rules, but perhaps they may perform better than the Pe rule.
(5) Experiments with sequential feature selection under myopic policy were
reported in [4] for the case of conditionally independent binary features. The
conclusion from these experiments using many simulated data sets was that no
rule is consistently superior to the others, and that no specific strategy for
alternating the rules seems to be significantly more efficient.
(6) In the past, computational difficulties with the Pe rule constitute the major
reason for recommending a substitute rule. In the author's opinion the insensitiv-
ity of the P~ rule is of much greater significance for avoiding the Pe rule even when
a substitute rule does not offer any computational advantage. As was pointed out
in [8] the conditions under which the Pe rule becomes highly insensitive are quite
often satisfied when one class has relatively high prior probability compared to
the other classes. In this case the use of a substitute rule is highly recommended
under all circumstances, i.e., for evaluating individual features in an algorithm for
subset selection, or for evaluating individual features in sequential feature selec-
tion.

References

[1] Aczel, J. and Daroczy, Z. (1975). On Measures of Information and Their Characterization.
Academic Press, New York.
[2] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one
distribution from another, J. Royal Statis. Soc. Ser. B. 28, 131-142.
[3] Backer, E. and Jain, A. K. (1976). On feature ordering in practice and some finite sample effects.
Proc. Third Internat. Joint Conf. Pattern Recognition 45-49. San Diego, CA.
[4] Ben-Bassat, M. (1978a). Myopic policies in sequential classification. [EEE Trans. Comput. 27,
170-174.
[5] Ben-Bassat, M. (1978b). e-equivalence of feature selection rules. [EEE Trans. Inform. Theory 24,
769-772.
[6] Ben-Bassat, M. (1978c). f-entropies, probabilities of error and feature selection. Inform. and
Control 39, 227-242.
[7] Ben-Bassat, M. and Raviv, J. (1978). Renyi's entropy and the probability of error. IEEE Trans.
Inform. Theory 24, 324-331.
Feature evaluation 789

[8] Ben-Bassat, M. (1980). On the sensitivity of the probability of error rule for feature selection.
IEEE Trans. Pattern Anal. Machine Intell. 2, 57-60.
[9] Chen, C. H. (1971). Theoretical comparison of a class of feature selection criteria in pattern
recognition. IEEE Trans. Comput. 20, 1054-1056.
[10] Chen, C. H. (1976). On information and distance measures, error bounds and feature selection.
Inform. Sci. 10, 159-173.
[11] Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Trans.
Systems Man Cybernet. 4, 116-117.
[12] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[13] Cover, T. M. and Van Campenhout, J. M. (1977). On the possible orderings in the measurement
selection problem. IEEE Trans. Systems Man Cybernet. 7, 657-660.
[14] Daroczy, Z. (1970). Generalized information functions. In[orm. and Control 16, 36-51.
[15] DeGroot, M. (1962). Uncertainty, information and sequential experiments, Ann. Math. Statist.
33, 404-419.
[16] DeGroot, M. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
[17] Devijver, P. A. (1973). On a class of bounds on Bayes risk in multihypothesis pattern
recognition, IEEE Trans. Comput. 23, 70-80.
[18] Devijver, P. A. (1977). Entropies of degree fl and lower bounds for the average error rate.
Inform. and Control 34, 222-226.
[19] Devijver, P. A. (1978). Nonparametric estimation by the method of ordered nearest neighbor
sample sets. Proc. Fourth Internat. Joint. Conf. Pattern Recognition, 217-223.
[20] Duin, R. P. W. and van Haresma Buma, C. E. (1974). Some methods for the selection of
independent binary features. Proc. Second Internat. Joint Conf. Pattern Recognition, 65-70.
Copenhagen, Denmark.
[21] Elashoff, J. D., Elashoff, R. M. and Goldman, G. E. (1971). On the choice of variables in
classification problems with dichotomous variables. Biometrika 54, 668-670.
[22] Fu, K. S., Min, P. J. and Li, T. J. (1970). Feature selection in pattern recognition. IEEE Trans.
Systems Sci. Cybernet. 6, 33-39.
[23] Fukunaga, K. (1972). Introduction to Statistical Pattern ReCognition. Academic Press, New York.
[24] Glick, N. (1973). Separation and probability of correct classification among two or more
distributions. Ann. Inst. Statist. Math. 25, 373-382.
[25] Good, I. J. and Card, W. I. (1971). The diagnostic process with special reference to errors. Math.
Inform. Med. 10, 176-188.
[26] Greetenberg, T. L. (1963). Signal selection in communication and radar systems. IEEE Trans.
Inform. Theory 9, 265-275.
[27] Hellman, M. E. and Raviv, J. (1970). Probability of error, equivocation and the Chernoff bound.
IEEE Trans. Inform. Theory 16.
[28] Jain, A. K. (1976). On an estimate of the Bhattacharyya distance. IEEE Trans. Systems Man
Cybernet. 6, 763-766.
[29] Kailath, T. (1967). The divergence and Bhattacharyya distance in signal selection. IEEE Trans.
Comm. Tech. 15, 52-60.
[30] Kanal, L. (1974). Patterns in pattern recognition: 1968-1974. IEEE Trans. Inform. Theory 18,
697-722.
[31] Kaufman, H. and Mathai, A. M. (1973). An axiomatic foundation for a multivariate measure of
affinity among a number of distributions. J. Multivariate Anal. 3, 236-242.
[32] Kittler, J. (1975a). Mathematical methods of feature selection in pattern recognition. Internat. J.
Man-Mach. Stud. 7, 609-637.
[33] Kittler, J. (1975b). On the divergence and Joshi dependence measure in feature selection.
Information Processing Lett. 3, 135-137.
[34] Kovalevsky, V. A. (1968). The problem of character recognition from the point of view of
mathematical statistics. In: V. A. Kovalevsky, ed., Character Readers and Pattern Recognition.
Spartan, New York.
790 Moshe Ben-Bassat

[35] Kullback, S. and Leibler, R. A. (1951). Information and sufficiency. Ann. Math. Statist 22,
79-86.
[36] LainJotis, D. G. (1969). A class of upper bounds on probability of error for multihypothesis
pattern recognition. IEEE Trans. Inform. Theory 15, 730-731.
[37] Lewis, P. M. (1962). The characteristic selection problem in recognition systems. IEEE Trans.
Inform. Theory 8, 161-171.
[38] Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math.
Statist. 27, 986-1005.
[39] Lissack, T. and Fu, K. S. (1976). Error estimation in pattern recognition via L"-distance between
posterior density functions. IEEE Trans. Inform. Theory 22, 34-45.
[40] Marill, T. and Green, D. M. (1963). On the effectiveness of receptors in recognition systems.
IEEE Trans. Inform. Theory 9, 11-17.
[41] Mathai, A. M. and Rathie, P. N. (1975). Basic Concepts in information Theory and Statistics.
Wiley, New York.
[42] Matusita, K. (1967). On the notion of affinity of several distributions and some of its
applications. Ann. Inst. Statist. Math. 19, 181-192.
[43] Matusita, K. (1973). Discrimination and the affinity of distributions. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 213-223. Academic Press, New York.
[44] Mucciardi, A. N. and Gose, E. E. (1971). A comparison of seven techniques for choosing subsets
of pattern recognition properties. IEEE Trans. Comput. 20, 1023-1031.
[45] Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset
selection. IEEE Trans. Comput. 26, 917-919.
[46] Renyi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hungar. 10, 441-451.
[47] Renyi, A. (1960). On measures of entropy and information, Proc. Fourth Berkeley Symposium
Math. Statist. and Probability 1, 547-561.
[48] Renyi, A. (1964). On the amount of information concerning an unknown parameter in a
sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci. 9, 617-624.
[49] Renyi, A. (1966). On the amount of missing information and Neyman-Pearson lemma. In F. N.
David, ed., Research Papers in Statistics, 281-288. Wiley, New York.
[50] Renyi, A. (1967a). On some problems of statistics from the point of view of information theory.
Proc. Fifth Berkeley Symposium on Math. Statist. 531-543.
[51] Renyi, A. (1967b). Statistics and information theory. Studia Sci. Math. Hungar. 2, 249-256.
[52] Renyi, A. (1970). Probability Theory. North-Holland, Amsterdam.
[53] Shannon, C. (1948). A mathematical theory of communication. Bell Systems Tech. J. 7, 379-423.
[54] Silvey, S. D. (1964). On a measure of association. Ann. Math. Star. 35, 1157-1166.
[55] Stearns, S. D. (1976). On selecting features for pattern recognition. Proc. Third Internat. Joint
Conf. Pattern Recognition, 245-248. San Diego, CA.
[56] Toussaint, G. T. (1971a). Some upper bounds on error probability for multiclass pattcrn
recognition. IEEE Trans. Comput. 20, 943-944.
[57] Toussaint, G. T. (1971b). Note on the optimal selection of independent binary valued features
for pattern recognition. IEEE Trans. Comput. 17, 618.
[58] Toussaint, G. T. (1972). Feature evaluation with quadratic mutual information. Information
Processing Lett. 1, 153-156.
[59] Toussaint, G. T. (1974a). Recent progress in statistical methods applied to pattern recognition.
Proc. Second Internat. Joint Conf. Pattern Recognition, 479-488.
[60] Toussaint, G. T. (1974b). On the divergence between two distributions and the probability of
misclassification of several decision rules. Proc. Second Internat. Joint Conf. Pattern Recognition.
[61] Toussaint, G. T. (1974c). On information transmission, nonparametric classification and mea-
suring dependence between random variables. In: Proc. Syrup. Statist., Related Topics. Carleton
University, Canada.
[62] Toussaint, G. T. (1977). An upper bound on the probability of misclassification in terms of the
affinity. Proc. IEEE 65, 275-276.
Feature evaluation 791

[63] Toussaint, G. T. (1978a). Probability of error, expected divergence and the affinity of several
distributions. IEEE Trans. Systems Man Cybernet. 8, 482-485.
[64] Toussaint, G. T. (1978b). Probability of error and equivocation of order a. Unpublished
manuscript.
[65] Vajda, I. (1968). Bounds on the minimal error probability and checking a finite or countable
number of hypotheses. Inform. Transmis. Problems 4, 9-19.
[66] Vilmansen, T. R. (1973). Feature evaluation with measures of probabilistic dependence. 1EEE
Trans. Comput. 22, 381-388.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 '~
,..-' ' k J
©North-Holland Publishing C o m p a n y (1982) 793 - 803

Topics in Measurement Selection

Jan M. Van Campenhout

1. Introduction

One of the major problems one encounters during the design phase of an
automatic pattern recognition system is the identification of a good set of
measurements. These measurements, to be performed on future unclassified
patterns, should enable the recognition system to classify the patterns as correctly
as possible. At the same time, the cost of acquisition and processing of the
measurements in the classifier should be kept as low as possible.

1.1. Measurement identification and selection


The identification of such a measurement set is generally carried out in two
phases. The first phase, pattern analysis, uses a variety of techniques that allow
the designer to explore raw pattern data, and to infer some of its structure. The
information so obtained enables him to establish a set of measurements with
potentially high discriminating power.
In the second phase, commonly called 'feature selection' or 'measurement
selection', the preliminary measurement set must be reduced in size so as to meet
the cost/performance trade-off mentioned above.
Various techniques exist to achieve the data reduction called for. One can
characterize these techniques either according to the way in which the data
reduction was achieved, or according to the purpose of the reduced data.
According to the means of data reduction one distinguishes between a mapping
category and a selection category. In the mapping category the reduced data
consists of maps or functions of potentially all raw measurements. N o reduction
in data acquisition effort is aimed for. Rather, emphasis is put on the ease and
performance of postprocessing the reduced data. On the other hand, in the
selection category one simply discards some of the raw measurements, leading to
economies in data acquisition after the design phase is terminated.
According to the usage of t h e reduced measurement data one distinguishes
between a representation category and a discrimination category. The first category

793
794 Jan M. Van Campenhout

emphasizes low-complexity data representations that most faithfully represent the


original (complex) data. This representation is sometimes at the expense of class
discrimination. On the other hand, the second category puts emphasis on the
class separability after data reduction. The most frequently used criterion for class
separability is the rate of correct separation of a classifier designed with the
reduced data.
This paper considers the measurement selection techniques of the discriminating
category that use the selection method as a means for data reduction. In this
sub-area we are concerned with the following fundamental question.

QUESTION. Consider the Bayesian classification problem where the random


variables X l ..... X~ are pattern measurements and the random variable C is the
unobservable pattern class membership. For simplicity we assume a two-class
problem, that is, C E {Cl, c2}. Given that it is only feasible to use at most k < n
measurements in a classification rule, which subset of {X1,..., X,} should be used
so that the Bayes misclassification probability

Pe(S) = P { C # C * } (1)
is minimal? Here C*(S) represents the Bayes decision with respect to a probabil-
ity of error loss function, and using only measurements in S; Pe(S) represents the
corresponding risk.

The naive solution to this problem is to try all possible k-element subsets of
{X~ ..... Xn}, and then choose the subset with the lowest misclassification proba-
bility. Unfortunately, neither of the steps of the above 'solution' is actually
feasible in practice. To begin with, both the number of possible measurements n
and the number of measurements to be retained, k, are typically so large that an
exhaustive investigation of all the k-element measurement subsets is ruled out.
Furthermore, the determination of the Bayes rule and its corresponding risk
require the knowledge of the underlying statistical structure. This knowledge we
seldom have, and even if so, computing the Bayes risk for large values of k can be
a very difficult task.
Therefore, the naive approach is rarely used, except perhaps in the related
problem of measurement selection for regression analysis (Furnival, 1974). Here
the numbers n and k are typically much smaller than in pattern recognition, and
the figure of merit of the measurement subsets, the residual sum of squares, is
relatively easy to compute compared to the Bayes risk in classification problems.

1.2. Selection algorithms and their performance


In the pattern recognition field one takes two approaches to lessen the burden
of an exhaustive search. The first approach makes use as much as possible of the
prior structural information one has about Pe(S), viewed as a function of the
subsets S of {X1,...,Xn}. This information, together with the information ob-
tained so far by probing measurement subsets, can enable the search algorithm to
Topics in measurement selection 795

preclude large numbers of subsets from its subsequent search. For instance,
suppose that the measurement subsets S 1 and S2 have been investigated, and also
suppose that Pe(SI)< Pe(S2). Then it is needless to investigate any of the subsets
of S 2, because these subsets are known to be worse than S 2. Dramatic savings in
search effort have been reported with these "Branch and Bound Algorithms", in
both pattern recognition (Narendra and Fukunaga, 1977) and regression analysis
(Furnival, 1974).
The second approach uses a heuristic reasoning to limit the search path to a
small fraction of the k-element measurement subsets. Various such algorithms
exist and have been in use for a long time, e.g. the forward and backward
sequential search techniques (Stearns, 1976). The forward sequential search
algorithm uses the heuristic that the best ( k + l ) - e l e m e n t subset frequently
contains the best k-element subset. The algorithm then first finds the best
individual measurement and then proceeds by adding to this set the conditionally
best element not yet in the set. Unfortunately, as was soon realized, the heuristic
search algorithms do not necessarily provide us with the best measurement subset
of its size. In fact, it has been conjectured (Kanal, 1974) that the only optimal
search method, i.e., a method guaranteeing to find the best k-element subset no
matter what the underlying distribution is, necessarily has to perform the exhaus-
tive search of the naive solution. Despite this conjecture some researchers claimed
a near-optimal performance of the heuristic algorithms, stating that the measure-
ment subsets selected by these algorithms, although perhaps not the best, are
always very good (Chen, 1975).

1.3. Some new results


In what follows we will investigate some properties of Pe(S) viewed as a
function of the measurement subsets S, when the underlying statistics of the
problem are kept the same. Based on these investigations we will formulate the
following conclusions:
(1) "More is Better". Increasing the amount of information in a measurement
subset through enlarging its size or complexity never worsens the error probability
of a truly Bayesian classifier. In the case of a known probability structure this is a
well-known consequence of the optimality of the Bayes decision rule. However,
when the underlying probability structure is only partially known (and thus when
the Bayes rule must be designed using a training set), a seemingly paradoxical
nonmonotone behavior of the Bayes risk has been reported (the Hughes Paradox
(Hughes, 1968)). We show that the peaking behavior arises from comparisons of
inconsistently specified stochastic models, and thus that these models are not
simulating the peaking observed in practice.
(2) Optimal selection algorithms are exhaustive. In general, i.e., when no
restrictions are imposed on the structure of the measurements and the underlying
statistics, the Bayes probability of error Pe(S) can induce any ordering on the
subsets of { X I..... Xn), provided that this ordering satisfies the above monotonic-
ity constraint (i.e.~ a measurement is never worse than any of its subsets).
796 Jan M. Van Campenhout

Moreover, no other restrictions exist on the numbers Pe(S). Consequently:


(a) No search algorithm can guarantee to always find the optimal
k-element measurement subset unless it is worst-case exhaustive. Thus
in some cases all k-element subsets must be investigated.
(b) Heuristic algorithms may select measurement subsets that are
arbitrarily bad compared to the true optimum.
In Section 2 we indicate why Hughes' stochastic models show a peaking of the
problem-average recognition rate and in Section 3 a closer look is taken at the
almost unrestricted relationship between measurement subsets and their Bayes
error probability.

2. The monotonicity of the Bayes risk

2.1. The problem of unknown stat&tics


In the quest for a good measurement subset one rarely knows the underlying
probability structure (if any) of the classification problem. Thus, not only we do
not know the Bayes risk of the selected measurement subset, but we even have no
means to determine the Bayes decision rule! The standard approach then is to
collect a training set or test sample, such as

T = ((Xll .... ,Xnl),(Xl2 ..... Xn2)), (2)

which consists of n measurements Xll,...,Xnl from class cl and n measurements


X12..... )(,2 from class C2.
This data, as well as any other prior information about the underlying
statistical structure, is then used to obtain a decision rule to classify future
patterns. Many techniques exist to obtain a decision rule from the test data T. For
example, the two-step procedures, which first estimate the prior class probabilities
P ( C = c~) and the class-conditional measurement distributions F( X[C = c i) from
the training set, and then compute a Bayes decision rule with respect to these
estimates. Other well-known methods include nearest neighbor classification and
linear discriminants (Duda and Hart, 1973).
In general these rules, when designed using finite training set sizes, are not
optimal, although some (e.g. the two-step procedures that use consistent esti-
mates) are asymptotically optimal.
An annoying consequence of this suboptimality is that the information con-
tained in a new measurement is not always used in the best possible way for class
discrimination. In fact, extra information, as provided by, e.g., more complex
measurements, might even reduce the discriminating power of the decision rule.
In the literature many cases are reported where, given a finite training set of
vector-valued measurements, at first (i.e., for low measurement dimensionality)
the performance of the decision rule improves, but from a certain dimensionality
on, rules designed with this dimensionality have increasingly poorer performance.
Topics in measurement selection 797

So the following question arises: given a finite training set, how complex can the
measurements be before the performance degrades?

2.2. The Hughes paradox


Many researchers have tried to capture this 'optimal measurement complexity'.
Probably the first to do so was G. F. Hughes (1968), who analyzed the following
two-step procedure. Let X be the (real-valued) measurement and assume that it
has (unknown) class-conditional probability density functions fl(x) = f ( x IC = c 1)
and fz(x)=f(x[C=c2) respectively. A training set T is obtained from both
classes. In order to estimate fl(x) and f2(x) a finite partition is introduced on the
measurement space, and training set histograms are computed over this partition.
Then a Bayes decision rule is computed with respect to the histograms and the
known prior class probabilities.
Note that, unless we know the actual class-conditional cell probabilities, the
determination of the error probability of this rule, either given the training set or
averaged over all possible training sets, remains impossible. Instead of consider-
ing each individual case separately, Hughes takes a global approach. He models
his lack of knowledge about the class-conditional cell probabilities p =
((P 11,... ,Pro1),( P 12
.....Pm2)) by letting these quantities be random variables with
a known prior distribution F(p).
In this way the determination of the problem-average error rate becomes
possible. Hughes computed the value of this error rate as a function of the
training set size n and the number of histogram cells m. It turned out, seemingly
in accordance with experimental evidence, that the problem-average misclassifica-
tion is not monotone in the number of cells m, but reaches a minimum for some
finite cell number depending on the training set size. Hughes concluded that "if
insufficient data are available to estimate the pattern probabilities accurately,
then a Bayes recognizer is not necessarily optimal". As we observed earlier, after a
careful analysis of Hughes' paper, one could attribute his peaking result to the
fact that he indeed is not using globally optimal decision rules, that is, with
respect to all available structural and statistical prior information his rules are not
Bayesian. Soon this was realized and corrections to Hughes' analysis were
published (Abend and Harley, 1969; Chandrasekaran and Harley, 1969) to avoid
this pitfall.., but even with truly Bayesian procedures the peaking phenomenon
did not disappear! Hence the following apparent paradox arose: as the Bayes rule
is optimal, how can a finer partitioning of the measurement space be detrimental
to the recognition accuracy? After all, by merging cells the fine partition can be
reduced to the coarse partition, and thus the class of rules using the fine partition
contains the Bayes rule based on the coarse partition. Hence, the best rule in this
class cannot possibly perform worse than the coarse Bayes rule!
Several workers (Chandrasekaran, 1971; Waller and Jain, 1977) have tried to
explain the Hughes peaking phenomenon by considering models in which the
components of the measurement vector have varying degrees of class-conditional
dependence. The rationale behind their research is that the complexity of estimat-
798 Jan M. Van Campenhout

ing the class-conditional measurement distribution depends strongly on the


degree of dependence among the measurement components and hence that
obtaining good distribution estimates in a high-dependence case (such as the
Hughes case) requires a much larger test sample than in a low-dependence case.
While this reasoning is certainly applicable in non-Bayesian two-step procedures,
it does not seem appropriate in a truly (overall) Bayesian approach, since the
optimality of the Bayes rule is not contingent on feature dependencies. In the next
section we will indicate how the peaking of the problem-average misclassification
rate in the Hughes model arises through incompatibilities in the model itself,
rather than through the sought-for cause of peaking in real-life situations.

2.3. On statistical consistency: The key to the Hughes paradox


The Hughes paradox arises from the logical contradiction in the following
statements.
(a) The problem-average error rate peaks with increasing complexity.
(b) The Bayes rule is optimal.
(c) More information becomes available when the measurement complexity
increases.
Clearly, statement (a) is an undeniably correct result of calculations and
statement (b) is a well-estabfished theoretical result. Hence let us focus our
attention on statement (c). Whereas (a) and (b) have a precise mathematical
meaning, statement (c) is used rather informally. The notion of "amount of
information" is used in a very intuitive way. No ready-made measure of informa-
tion is available that captures the complex aggregate of (i) prior structural and
statistical information, (ii) the information contained in the training set, and (iii)
the pattern measurement into one figure of merit that would allow us to make
comparisons. So how can we tell whether a situation A provides more information
than a situation B? One good way of answering this question is to see whether the
information provided by B could also be obtained from the information provided
by A, for all elements that contribute to the information. Then we can indeed say
that A has more information than B and not only other information.
More formally, let pA, X A, T A be the distribution parameter vector, the pattern
measurement, and the training set respectively in case A, and let ps, X B, T B have
the corresponding meaning for case B. For comparability we require that there
exist functions gl, g2 such that

X B = g , ( X A, TA), T B = g2(X A, TA), (3)

and furthermore that

F( X B, T B, C) = F( gl( XA, TA ), g2( X A ' TA )' C) (4)

that is, we can map the measurements in A into the measurements in B in a way
that agrees with the statistical structure of B.
Topics in measurement selection 799

If case A and case B are related through (3) and (4), then one can easily prove
that Pe(XA) <-Pe(XB) (Van Campenhout, 1978).
In the Hughes case the functions gl and g2 could be defined as the merging of
cells in A's partitioning to obtain B's partitioning. But then the question must be
asked if (4) is satisfied. The distribution F(X B, T 8, C) is the mixture distribution

F( XB, T B, C)= f F( XB, TB,ClpB)dF(p B) (5)

which clearly depends on the parametrization pB and its prior distribution F(pB).
Consequently, once pA and F(pA) are determined, the choice of g Land g2 imposes
restrictions on the choice of pB and f(pB) if (4) is to be satisfied. Hughes does not
take into account any such comparability requirements. Instead he freely chooses
Pl = ( P l l ..... Pml) and P2 = ( P l E , ' " , P m 2 ) t o be independent and uniformly dis-
tributed over the parameter simplices Y~Pil ~ 1 and Y'Pi2 = l, respectively in both
the A and B case. By doing so, he actually changes the amount of prior
information contained in F(p).
One can satisfy the comparability requirement if one obtains pB from pA by the
map g(pA) that adds the cell probabilities of cells of the partition A that are
merged. We then require that F(p B) = F(g(p~)). It is easy to verify that ifp A has
the uniform simplex distribution specified by Hughes, then pB cannot have a
uniform simplex distribution but rather has a Dirichlet distribution.
So we conclude that the peaking established by the (modified) Hughes model
arises from an inconsistency in the specification of the models to be compared:
while the information in the measurements X increases by refining the partition-
ing of the measurement space, at the same time the amount of information
brought in through the prior parameter distribution F(p) (which, notably, reflects
our ignorance!), is reduced. Hence Hughes' model trades prior information for
measurement information. The existence of an optimal trade-off causes peaking.

2.4. The monotonicity of the Bayes risk: Conclusionsand problems


The above analysis allows us to conclude that the recognition rate of a truly
Bayesian rule is monotonic when the measurement subset is enlarged. If the
statistical structure is known, the above statement holds for the recognition rate
of the individual problem at hand, whereas in the case of unknown statistics it
holds for the problem average recognition rate.
The reader should be cautioned against possible misinterpretations of the above
conclusions. In particular we remark:
(a) The decision rule need not be optimal for any particular distribution taken
from the class of possible measurement distributions. Hence, when one applies
the Bayesian problem-average decision rule to any fixed pair of class-conditional
measurement distributions from the set of possible distributions, peaking of the
misclassification rate (either for a particular training set or its expected value of
overall training sets from this distribution) remains possible.
800 Jan M. Van Campenhout

(b) The above analysis does not deal with typical situations encountered in
pattern recognition practice, where Bayes rules can simply not be obtained by
lack of statistical information. Determining the optimal measurement complexity
in practical situations remains a problem.

3. The arbitrary relation between probability of error and measurement subset

In Section 2 we concluded that a measurement set S is never worse than any of


its subsets. However, no such a priori conclusions could be made concerning
uncomparable (non-nested) measurement sets. Unfortunately, in the search for
the best k-element measurement subset, precisely those comparisons are made by
selection algorithms. Thus any such comparison can only be performed through
the actual evaluation of the error probabilities of the subsets to be compared.
Hence, if we are able to derive (distribution-free) relationships between the
measurement subsets and their associated error probabilities, potentially great
savings could be made in the computational complexity of measurement selection
algorithms.

3.1. The general ordering theorem


We want to investigate which restrictions exist on the error probability P~(S)
viewed as a function of the measurement subsets. One restriction we already
know is the monotonicity constraint S' c_ S ~ Pe( S') >~P~(S). Measurement subset
orderings that satisfy this constraint are called isotone. Are there any other
restrictions on the possible measurement orderings? Our main result is negative.

THEOREM 3.1 (Van Campenhout, 1980). Any set of real numbers {Pe(S): S C
{ X 1.... , X, } } giving rise to an ordering of the measurement subsets

½= ee(S, = ¢ ) > e e ( S 2 ) > ' " • > ee(S.-- {x, ..... Xn)) > 0

for which Sj C S k ~ Pe( Sj) > Pe( Sk) is inducible as a set of misclassification proba-
bilities. Moreover, there exist multivariate normal models N ( - - t t , K ) vs. N(It, K )
with vector-valued measurements inducing these numbers.

As an illustration, suppose that we have a two-class classification problem with


three measurements (X,, X 2, X 3} but that we have no other knowledge as to its
structure or underlying statistics. We can raise the following question: is it
possible that the measurement subsets can be ordered by their error probabilities
as follows:

Be{X,} > Be{X2} > ee{X~} > ee(X~, X~}


> P~{X,, X~} > Be{X,, X~} > P~fX,, X~, X~}
Topics in measurement selection 801

with the following (arbitrarily chosen) error probabilities

Pc(X~) = 0 . 4 8 , Pe{ X~, X 3 } = 0.24,


Pe(X2) =0.36, Pe{X~, X2} = 0.025,
eo(X3 ) = 0.30, ~c(X,, X2, X3) = 0.001,
P~(X,, X3) = 0.25?
Since this ordering agrees with set inclusion, the theorem allows us to answer
"Yes" to the question. Note that the best pair ~X 1, X2) consists of the individu-
ally worst measurements X 1 and X2, and that the set {X2, )(3} is approximately
ten times worse than {)(1, X2}. Yet, it is the set {X2, X3) that is identified as the
best pair by the forward sequential search algorithm!

3.2. An important implieation


The theorem as stated above has far-reaching consequences as to the perfor-
mance and complexity of (distribution-free) measurement selection algorithms.
Consider the class of isotone subset orderings that agree with set cardinality, i.e.,
if IS'] > IS[, then Pe(S')< Pe(S). Such orderings can be viewed as consisting of
n + 1 separate 'layers' of subsets, each layer corresponding to a different set
cardinality. As a result of Theorem 3.1, within each layer the set ordering is
totally arbitrary (even numerically) and completely independent of the orderings
in other layers. Thus to determine the best set of the k th layer, one necessarily has
to make at least (~) - 1 comparisons between subsets in that layer, and the error
probability of each subset of the layer must be evaluated. Hence our conclusion
is: any optimal distribution-free measurement selection algorithm is necessarily
exhaustive. Thus even very economical methods, such as the branch and bound
techniques, must perform a worst-case exhaustive search of the k th level.
Furthermore, as illustrated by the example in Section 3.1, suboptimal sets can
be arbitrarily bad, compared to the true optimum. Thus nonexhaustive algorithms
can come up with really bad solutions.
The significance of these results for the applications depends on how frequent
such pathological cases arise in practical situations. Therefore much work remains
to be done in the characterization of orderings (i.e., when is an ordering 'bad')
and in their rates of occurrence in practice. Also the characterization of families
of distributions according to their ordering capability remains an open problem.
The next subsection deals with the ordering capability of some distributions
that are more limited than those used in the proof of Theorem 3.1.

3.3. Considerations on more restricted distributions


The proof of Theorem 3.1 is constructive in that a family of multivariate
normal distributions with vector-valued measurements is exhibited that can
induce any given isotone sequence of real numbers as its error probabilities. The
measurements in these distributions are vector-valued and have intricate class-
802 Jan M. Van Campenhout

conditional dependencies. Furthermore, the dimensionality of the measurements


is exponential in the number of measurements n. Thus, although the models are
normal they have a high complexity, and one could ask whether any of these
restrictions could be relaxed while retaining the full ordering capability of the
distributions.

3.3.1. Non-Gaussian scalar measurements


If one relaxes the assumption that the measurements be class-conditionally
jointly Gaussian, then it is easy to see that models with scalar measurements
having the full ordering capability exist. In fact, any invertible map that maps m
(the dimensionality of the measurements) real numbers into one number provides
a univariate model when applied to the Gaussian measurements of Theorem 3. t.
A typical example of such a map is digit interleaving. Direct derivations of
univariate models are also possible (Cover and Van Campenhout, 1977).

3.3.2. Class-conditionally independent binary measurements


Class conditional measurement dependence has long been considered a neces-
sary condition to generate anomalous measurement subset orderings. However,
some authors (Elashoff, 1967; Toussaint, 1971; Cover, 1973)showed that even
with the simplest of all cases, the binary class-conditionally independent measure-
ments, anomalous relationships could exist between measurement sets and their
elements. Nevertheless, it has been shown (Van Campenhout, 1978) that the
family of binary class-conditional independent measurements is not capable of
inducing all isotone orderings. The possible subset orderings are already restricted
in the way the measurement pairs can be ordered, given an ordering of the
individual measurements. The proof is tedious and goes by a careful analysis of
how individual measurements can interact when forming measurement pairs.

3.3.3. Gaussian univariate measurements with class-independent


covariance structure
This is an important family of distributions, not only because it is frequently
used to model practical pattern recognition situations, but also because of the
correspondence of these classification problems and regression analysis problems.
Measurement selection in classification problems having this type of distribution
is equivalent to measurement selection in regression analysis (Van Campenhout,
1978).
In the area of regression analysis researchers often assume that optimal
selection algorithms must be exhaustive, as in the general classification case.
Again, it has been shown (Van Campenhout, 1978) that not all isotone orderings
can be generated by the family of univariate, class-conditionally Gaussian mea-
surements with class-independent covariance structure. So, at least conceptually,
non-exhaustive selection algorithms for this problem might exist. The settlement
of this question also remains an unsolved problem.
Topics in measurement selection 803

References

Abend, K. and Harley, T. J. (1969). Comments: On the mean accuracy of statistical pattern
recognizers. IEEE Trans. Inform. Theory 15, 420-421.
Beale, E. M. L., Kendall, M. G. and Mann, D. W. (1967). The discarding of variables in multivariate
analysis. Biometrika 54 (3,4) 357-366.
Chandrasekaran, B. and Harley, T. J. (1969). Comments: On the mean accuracy of statistical pattern
recognizers. IEEE Trans. Inform. Theory 15, 421-423.
Chen, C. H. (1975). On a class of computationally efficient feature selection criteria. Pattern
Recognition 7, 87-94.
Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Trans.
Systems Man Cybernet. 4 (1) 116- l 17.
Cover, T. M. and Van Campenhout, J. M. (1977). On the possible orderings in the measurement
selection problem. IEEE Trans. Systems Man Cybernet. 7 (9) 657-661.
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analys. Wiley, New York.
Furnival, G. M. (1974). Regression by leaps and bounds. Teehnometrics 16 (4) 499-511.
Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inform.
Theory 14 (1) 55-63.
Kanal, L. N. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20,
697-722.
Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection.
IEEE Trans. Comput. 26 (9) 917-922.
Stearns, S. D. (1976). On selecting features for pattern classifiers. Proc. Third Internat. Joint Conf.
Pattern Recognition. Coronado, CA, 71-75.
Toussaint, G. T. (1971). Note on optimal selection of independent binary-valued features for pattern
recognition. IEEE Trans. Inform. Theory 17, 618.
Van Campenhout, J. M. (1978). On the problem of measurement selection. Ph.D. Thesis, Department
of Electrical Engineering, Stanford University, Stanford, CA.-
Van Campenhout, J. M. (1980). The arbitrary relation between probability of error and measurement
subset. J. Amer. Statist. Assoc. 75 (369) 104-109.
Waller, W. G. and Jain, A. K. (1977). Mean recognition accuracy of dependent binary measurements.
Proc. Seoenth Internat. Conf. Cybernet. and Society, Washington, DC.
P. R. Kfishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~'7
..J/
©North-Holland Publishing Company (1982) 805-820

Selection of Variables Under


Univariate Regression Models*

P. R. Krishnaiah

1. Introduction

The techniques of regression analysis have been used widely in the problems of
prediction in various disciplines. When the number of variables is large, it is of
interest to select a small number of important variables which are adequate for
the prediction. In this paper we review some methods of selection of variables
under univariate regression models. In Sections 3-5 we discuss the forward
selection, backward elimination and stepwise procedures. A description of these
procedures is given in Draper and Smith (1966). These procedures are widely used
by m a n y applied statisticians since computer programs are easily available for
their implementation. We will discuss some drawbacks of these procedures. In
view of these drawbacks, we have serious reservations about using these methods
for selection of variables. In Sections 6 - 8 we discuss the problems of selection of
variables within the framework of simultaneous test procedures. Section 6 is
devoted to a discussion of how the confidence intervals associated with the well
known overall F test in regression analysis can be used for the selection of
variables. The above confidence intervals are available in the literature (e.g., see
Roy and Bose, 1953; Scheff6, 1959). In this section we also discuss some
procedures based upon all possible regressions. In Section 7 we discuss how the
finite intersection tests (FIT) proposed by Krishnaiah (1963; 1965a, b; 1979) for
testing the hypotheses on regression coefficients simultaneously can be used for
selection of variables. It is known that the F I T is better than the overall F test in
terms of the shortness of the lengths of the confidence intervals. For an illustra-
tion of the FIT, the reader is referred to the chapter by Schmidhammer in this
volume. Reviews of the literature on some alternative procedures are given in
Hocking (1976) and Thompson (1978a, b). For a discussion of the procedures for
selection of variables when the number of variables is large the reader is referred
to Shibata (1981) and the references in that paper.

*This work is sponsored by the Air Force Office of Scientific Research under contract
F49629-82-K-001. Reproduction in whole or in part is permitted for any purpose of the U. S.
Government.
805
806 P. R. Krishnaiah

2. Preliminaries

Consider the regression model

y=flO-~- fllXl-~- "'" -~- flqXq-~-E (2.1)

where e is distributed normally with mean zero and variance O2,/~1 . . . . . /~q are
unknown and x 1. . . . ,Xq may be fixed or random. This chapter is devoted to a
review of some of the procedures for the selection of the above variables. Unless
stated otherwise we assume that x 1. . . . . Xq are fixed. Also, xi: n × 1 and y: n × 1
denote respectively vectors of n observations on xi and y. We now define the
multivariate t and multivariate F distributions since they are needed in the sequel.
Let z ' - - ( z 1..... zp) be distributed as a multivariate normal with mean vector
/L'= (/~ ..... /~p) and covariance matrix ozfg where ~2 is the correlation matrix.
Also, let s Z / o 2 be distributed independent of z as chi-square with m degrees of
freedom. Then, the joint distribution of t 1.... ,tp is the central (noncentral)
multivariate t distribution with p degrees of freedom and with ~2 as the correlation
matrix of the 'accompanying' multivariate normal when/_t = 0 (/~ va 0) and where
t i = z i f m / s (i = 1,2 ..... p). The joint distribution of t l2, . . . , tp2 is the central (non-
central) multivariate F with (1, 1,) degrees of freedom and with ~ as the correla-
tion matrix of the accompanying multivariate normal. The above distribution is
singular or nonsingular according as ~2 is singular or nonsingular. Also, when
/t 4 : 0 and $2/0 2 is distributed as the noncentral chi-square, then t 2. . . . . t2 are
jointly distributed as doubly noncentral F distribution with (1, p) degrees of
freedom. Multivariate t distribution was considered by Cornish (1954) and
Dunnett and Sobel (1954) independently. Multivariate F distribution with (1, p)
degrees of freedom is a special case of the multivariate F distribution proposed by
Krishnaiah (1965a). Cox et al. (1980) investigated the accuracy of various
approximations for the multivariate t and multivariate F distributions. For a
review of the literature on multivariate t and multivariate F distributions, the
reader is referred to Krishnaiah (1980).

3. Forward selection procedure

Consider the univariate regression model

y= X[3+e (3.1)

where e ' = (E 1 En) is distributed as multivariate normal with mean vector 0 and
.....

covariance matrix o2In. Also, X = [x l ..... Xq] where x i is a vector of n indepen-


dent observations on x i whereas y is a vector of n observations on the dependent
variable y. In this section we discuss the forward selection procedure.
Selection of variables under univariate regression models 807

In the forward selection procedure, we first select a single variable which is


supposed to be the most important. Let

t l t -~--1 v l
y xAx~x~) x,yt, n - 1 ) (3.2)
1 p
Fi= y'[I--xi(x'ixi)-xi]y

If m a x ( F l, F 2 . . . . . Fq) >1 F~, the variable corresponding to the max(F1, F2,..., Fq) is
declared to be the most important. For example, if m a x ( F l , . . . , F q ) = F 1, then x 1
is declared to be the most important. Here F~ is the upper 100a% point of the
central F distribution with (1, n --1) degrees of freedom. If m a x ( F 1, F2,...,Fq)<~
F~, we declare that none of the variables are important and we don't proceed
further; otherwise, we proceed further as follows and select the variable which is
the second most important. Let

Fli = y'Q li y( n - 2)/y'QloY (3.3)

for i = 2, 3 ..... q where

Q,i=(i_x,(x,x,)-lx,)xi[xixi , -1 x xi] -,

× (I-x,(xqx,)lx~), (3.4)

Ix xlf(Xl)lx
x (x:)
x
If max(F12 . . . . . Fiq ) <~ Fl,~, we conclude that none of the variables x2, x 3. . . . . Xq
are important and we don't proceed further; here Fl, is the upper 100a%
point of the central F distribution with (1, n - 2 ) degrees of freedom.
If max(F~2, F~3 F|q) > F~,~, then the variable c o r r e s p o n d i n g to
. . . . .

max(Fi2, F13 ..... Flq ) is declared to be the second most important. Suppose x 2 is
the second most important variable according to the above procedure. Then we
proceed further to select the third most important variable. We continue this
procedure until a decision is made to declare that all variables are important or a
decision is made, at any stage, that none of the remaining variables are important.
Suppose we declare that r variables (say x~, x 2..... x r) are important according to
the above procedure when x i is declared to be the ith most important variable.
Then we proceed further as follows. Let

Fri = y'Qriy( n - r -- 1)/y'Qroy (3.6)


808 P. R. Krishnaiah
for i = r + 1.... , q where

r --1
Qri=( I-- S(r)( g(r)g(r)) S(r))Xi
X [XtiXi--xtiS(r)(g~r)g(r)) Ig~r)Xi] 1x~

X ( I- S(r)( X~r)g(r))-l S('r) ), (3.7)

Qro=i_[X(~),xi ] X~)
Xti
(X(r),xi
Xi
, ,

X(r) : [ x 1 ..... X r ] . (3.8)


If max(Fr, r+ l . . . . . Frq)<<-Fra , we declare that none of the remaining variables
xr+ 1. . . . . Xq are important; here Fr~ is the upper 100a% point of the central F
distribution with (1, n - r - 1) degrees of freedom. If max(Fr, r+ 1. . . . . F~q) > F~,
then we declare that the variable corresponding to max(F~,r+ t .... ,F~q) is the
(r + 1)-th most important variable.
Now, consider the statistic F~ defined b y (3.1). This is nothing but the F statistic
used to test the hypothesis H i :/3 i = 0 under the normal regression model

E(y)=xifli. (3.9)

U n d e r the model (3.9), F / ( f o r given i) is distributed as the central F distribution


with (1, n - 1) degrees of freedom. If H i is true, x / i s unimportant. If/4~ is not true,
it does not necessarily mean that xi will be chosen at any stage for inclusion in the
prediction equation. At the first stage, if all the variables are not unimportant, we
pick only the most important variable and no decision is m a d e about the selection
of other variables which are not declared to be important and so there is a region
of indecision. In this procedure we are testing the q hypotheses individually and
not simultaneously since the type I error for each of them is chosen separately
subject to the condition

F IZ/i] = (3.10)

I n any situation all the q models E ( y ) = xlfl 1..... E(y) ~ Xqflq m a y be wrong; at
best, one and only one of the above q models is correct. F o r any given i, let us
assume that the model E ( y ) = xifl i is not correct. T h e n F~ is not distributed as
central F distribution with (1, n - 1) degrees of freedom even when H i is true since
fli is estimated under the assumption that E(y) = xifl i is the correct model. So, the
true type I error is not a even to test H i individually. For example, let the true
model be E(y) = x 2 f12 or (say) E(y) = x 2 f12 -~ X4f14" T h e n Fl, F 3 . . . . . Fq are distrib-
uted as doubly noncentral F distribution with (1, n - 1) degrees of freedom and
the noncentrality parameters associated with these statistics are usually unknown.
So, we will not be able to c o m p u t e the exact type I error for testing H i, even
Selection of variables under univariate regression models 809

individually. Even if we assume that the noncentrality parameters are known, it is


not meaningful to compare F 1. . . . . Fq with the same critical value F~ since the
critical values for testing H l ..... Hq change if we keep the type I error the same
for testing each hypothesis H i individually because the noncentrality parameters
are different.
Next consider the models

E(y) = x , f i , + x 2 f i 2 + . . . -~-Xrt~r-~-Xi~ i (3.11)

for i = r + 1..... q. For any given i, the statistic Fri defined b y (3.6) is nothing but
the F statistic used to test Hi under the model (3.11). So, in the forward selection
procedure we are essentially testing the hypothesis Hi, for any given i ( i =
r + 1..... q), under the model (3.11) to decide whether i th variable is unimportant.
The criticism of the method used at first stage applies here also.
At (i + 1)-th stage (i = 1,2 ..... q - 1), the critical value F~ is chosen ignoring the
decisions made at previous stages. Take, for example, the second stage. While
computing the critical value Fl~, we should compute the conditional probabilities

P[ F,i <~ FI,~]Hi; m a x ( F 1. . . . . Fq) >1 F,~]

for i = 2, 3 .... , q to find the type I error to test H i for given i and not P[F1i < Hi]
since we go to the second stage if and only if m a x ( F l ..... Fq) >1 F,~.
The decision to select or not to select a variable at the first stage is made
according as max(F1,... , F q ) ~ F,~. So, the critical value F~ should be chosen such
that the probability of m a x ( F l , . . . , F q ) being less than F~ is equal to ( 1 - a) when
q 1Hi is true instead of choosing it to satisfy (3.10). Similar criticism applies to
N i=
subsequent stages also. When O ~ = I H i is true and the model is of the form (3.9),
the joint distribution of F I , . . . , Fq is a multivariate F distribution; this multivariate
F distribution is different from the one defined in Section 2 since the above F~'s
do not have a common denominator.

4. Stepwise regression

The stepwise regression is a modification of the forward selection procedure. In


the stepwise regression, after a new variable is included in the prediction, we test
whether each of the variables entered at each of the previous stages is important if
we start with the model equation which includes these variables as well as the new
variable. The procedure is described in detail here. At the first stage we consider
the models
y = xifi i + e (4.1)

for i = 1,2 ..... q. Let H i : f l i = 0 as before and let Fi, for any given i, denote the
usual F statistic used to test H i under the model (4.1). If max(F1,...,Fq)<~ F,~, we
decide that none of the variables are important and we don't proceed further.
810 P. R. Krishnaiah

Otherwise, we make a decision that the variable corresponding to the largest F~ is


the most i m p o r t a n t and go to the second stage. For simplicity of notation, we
assume that x~ is declared to be the most important variable at the first stage.
Then, we consider the following models:

y = xt]~ 1 '~-xi~ i + ~, (4.2)

for i = 2,3 . . . . . q. Let Fri and Fr~ be as defined in Section 3. If m a x ( F 1 2 , . . . , F l q ) <


F ~ , then we declare that none of the variables x2, x 3. . . . . Xq are i m p o r t a n t and we
d o n ' t proceed further. If max(Fl2 ..... F l q ) > F l , , then we conclude that the
variable (say x2) corresponding to max(F12 . . . . . Flq ) is the second most im-
portant. Then, we consider the models

y = XIB1 +X2/~2 + e - (4.3)

U n d e r the above model we test the hypothesis fil = 0 b y using the usual F test. If
the hypothesis fll = 0 is accepted, we delete the variable x I and consider the
following model:

y =x2fl 2 +x,g+e (4.4)

for i = 3 ..... q. Then consider the m a x i m u m of the F statistics associated with


testing fli (i = 3,4 . . . . . q) under the models (4.4). If this m a x i m u m is less than F2~,
then we declare that none of the variables x3, x 4 ..... Xq is important and d o n ' t
proceed further. Otherwise, we conclude that the variable corresponding to the
m a x i m u m of the F statistics described above is important. Let this variable be
(say) x 3. Then, we start with the model

y = x2B 2 + x 3 f l 3 + e . (4.5)

U n d e r the model (4.5) we test H 2. If H 2 is rejected, then we conclude that x 2 is


important and go to the next stage to investigate whether one of the variables
x 4..... Xq is important under the models

y ~--X2B 2 @X3B 3 @ X i B i @ e (4.6)

for i = 4 , 5 . . . . . q. If H 2 is accepted under (4.5), then we conclude that x 2 is


u n i m p o r t a n t and continue the procedure by using the models

r = x3133 +xi[3i + ~ (4.7)

for i = 4, 5 ..... q. If under the model (4.3) the hypothesis H l is rejected, then we
consider the following models:

y = X1]~1 -'F-X2~ 2 - ~ X i ~ i ~- ~. (4.8)


Selection of variables under univariate regression models 811

for i = 3,4,...,q. If the m a x i m u m of the F statistics associated with testing H i


( i = 3 ..... q) under the above models are less than F2~, then we conclude that
none of the variables x3, x 4..... Xq are important and we don't proceed further.
Otherwise, we conclude that the variable (say x3) corresponding to the m a x i m u m
of the above F statistics is an important variable and proceed to the next stage by
using the following models:

y = xlB 1 +x2/32 +x3/33 -~XiB i ~- (4.9)

for i = 4, 5 ..... q. This method is continued or terminated as before.


The objections similar to those raised for forward selection procedure are valid
for stepwise regression. For example, when we are examining as to whether
another variable should be included in the prediction equation, at most one of the
F statistics is distributed as central F distribution and the remaining F statistics
are distributed as doubly noncentral F distributions since at most one of the
models considered at that stage is the true model. Also, a decision to go to
the (i + 1)-th stage for inspection of the variables is made dependent upon the
outcome of ith stage and so we should use conditional instead of unconditional
probabilities to determine the critical values. Also, when we are comparing the
maximum of certain F statistics with critical value, the critical value should be
chosen such that the probability of the above m a x i m u m (and not the individual F
statistics) is less than the critical value is equal to a specified value. The stepwise
regression procedure has the following additional undesirable feature which does
not exist in forward selection p r o c e d u r e . A variable m a y be declared as important
at any stage but it may be purged as being unirfiportant at a later stage. For
example, at first stage we may find that x I is the most important. Also, let us
assume that x 2 is declared to be the most important at the second stage. Then we
consider the model

y =BIXI -~ B2X2 ~- E (4.10)

and test H 1 under the model (4.10). If H1 is accepted under the above model, then
we delete x~ from the prediction equation. Obviously, at the second stage, we
want to find whether x~ is important when the effect of x2 is eliminated. One may
pose the question as to why we should not examine the importance of x 1 after
eliminating the effect of x 3 or x 5 since it is possible that x 2 may be declared to be
unimportant and purged at a later stage whereas x 3 and x 5 may be considered to
be important.

5. Backward elimination procedure

In the forward selection procedure we added variables to the prediction


equation at each stage of selection. In the backward elimination procedure we
start the prediction equation with all variables and test in a sequential way to
812 P. R. Krishnaiah

eliminate the unimportant variables. To discuss the procedure, we consider the


following normal regression model with the usual assumptions

y = xlfl 1 ÷ X 2 B 2 "l- " ' " ÷Xq~q ÷ E. (5.1)

Let H i :/~i = 0 as before. Also, let F~ denote the usual F statistic used to test H~
against A i : B i v~ 0 under the model (5.1). In the backward elimination procedure
we declare that none of the variables are unimportant if min(F?, F2*..... Fq) ) fal
where F,l is the upper 100a% point of the central F distribution with (1, n - q)
degrees of freedom. If min(F~ ..... Fq*) ~< F~l, we eliminate the variable associated
with the smallest of F~ ..... Fq. The above procedure is equivalent to the following
procedure. Let F,t be chosen such that

P[Fi*~F~IIHi] = (l--a) (5.2)

for i = 1,2 ..... q. In the usual individual tests we accept or reject Hi according as

(5.3)
The hypothesis H i is equivalent to the hypothesis that x i is not important. If
min(F~ ..... F q ) ~ F ~ l , it implies that H1, H 2 .... ,Hq are rejected when they are
tested individually. If we eliminate the variable connected with the smallest F~,
then we go to the second stage to determine as to whether we should eliminate the
second least important variable. The main drawback of this procedure at the first
stage is the choice of F~. Since we are interested in finding out as to whether all
variables are important, it is more meaningful to use the critical values F*t
(instead of F~1) where

F~ --~F21, i = 1,2 ..... q Hi (5.4)


i=l

We now discuss the second stage of the backward elimination procedure.


Suppose the variable that was found to be unimportant at the first stage is Xq.
Then we consider the model

y = X l B 1÷ X 2 B 2 ÷ " ' " ÷ X q 1~ 1 ÷E" (5.5)


Now, let F~ (i = 1,2 ..... q - 1) denote the usual F statistic used to test H i under
the model (5.5). If min(F~'l, F~2 . . . . . F~,q 1)>F~2, we declare that none of the
variables x I.... ,Xq_~ are unimportant and we don't proceed further; here F~2 is
the upper 100a% point of the central F distribution with (1, n - q + 1) degrees of
freedom. If min(F~'l, F~2 ..... F~,q 1)~ < F~2, we conclude that the variable associ-
ated with the smallest of F~'l ..... F~. q_ i is the second most unimportant variable
Selection of variables under univariate regression models 813

and discard it. The critical value F~2 is chosen such that

P[F~i <. F~2 ]H/] = (1 -- a) (5.6)

for i = 1,2 ..... q -- 1. So the type I error is chosen at the second stage such that the
probability of rejecting the hypotheses H i individually when in fact they are not
true is a. But, what is of more interest is to control the error of rejecting at least
one of the hypotheses H l, H 2..... Hq_ 1 when, in fact, all of them are true. So, the
critical value should be F*2 where

[
P F~<~F*2;i=I,2 ..... q--1 ]
H i =(l-a) (5.7)

if we have started at the second stage directly• But, we arrive at the second stage if
and only if min(F~' ..... Fq*)~
< F2~. So, we should find the critical value F,*2 such
that

[ q--1
• , # ~ *
P ] Fig-~
*-< F22,
• . i---1,2 . . . . . q 1 A H i ; f i n n ( F , .... ,Fq ) ~ F : I
I. i=1

= ( 1 - o O.
(5.8)

It is quite complicated to compute this probabilits;.


We now discuss the backward elimination procedure at ( j + 1)-th stage• Let us
assume, for convenience of notation, that Xq, X q _ l , . . . , X q j + l were found to be
the unimportant variables. Then we discard these variables and consider the
following model:

y = x l f l 1~ t - x 2 H 2 -~- - - - "~-Xq_jHq_j-[- & (5.9)

Now, let Fj* denote the F statistic used to test H / u n d e r the model (5.9). Then,
according to the backward elimination procedure we decide that none of the
variables x 1..... Xq_j are unimportant if min(Fj], Fj'~..... FTq_j)~>F~,j+ 1 where
F,,j+ 1 is the upper 100a% point of the central F distribution with (1, n - q + j )
degrees of freedom. If min(Fj], F72,... , Fj*,q_j) ~ Fa, j + l , then we conclude that the
variable associated with the smallest of the statistics Fj],... ,Fj*q_i is the ( j + 1)-th
least important variable and proceed to the next stage. The critical value F~,s+ ~
here is chosen such that

e[Fji~-fe~,j+l[Hi] = ( 1 - O r ) (5.10)

for i = 1,2 ..... q - j. But, the critical value should be chosen such that the
conditional probability of rejecting at least one of the hypotheses H I . . . . . Hq_j
814 P. R. Krishnaiah

(when all of them are true) is equal to a given min(Fj* l, l .... , Eft_1,q--j+ 1) • F*aj
where F~*/is chosen in a similar way as F*2. In summary, the critical values used in
backward elimination procedure are chosen somewhat arbitrarily.
Some of the drawbacks of the forward selection, backward elimination and
stepwise procedures were discussed in Pope and Webster (1972) also. In view of
the various drawbacks of the above procedures we do not recommend use of them
in the selection of variables. In the following sections we discuss some alternative
procedures for the selection of the variables.

6. Overall F test and methods b a s e d o n all p o s s i b l e r e g r e s s i o n s

In this section we consider the problem of selection of variables using the


overall F test under the model (3.1). According to the well-known classical overall
F test, we accept or reject H : / 3 --- 0 a c c o r d i n g as

F%F~, (6.1)
where
F = y'Qy(n - q)/Y'QoYq, (6.2)

Qo:I- X(X'X)-'X', (6.3)

Q-- X(X'X)-'X', (6.4)

and F~ is chosen satisfying

P [ F < ~ F , ~ I H ] = ( 1 - - a ). (6.5)

If H is accepted, none of the variables are selected for prediction. If H is rejected,


then it is of interest to find out as to which of the fl~'s are significantly different
from zero. For any given i the variable x i is concluded to be important or
unimportant according as fl~ va 0 or fit = 0. The confidence intervals associated
with the overall F test are known (see Roy and Bose, 1953; Scheff6, 1959) to be

a'~ , , , - 1
-- {qr~(y Q0y)a ( X X ) a / ( n - q ) } ' / 2 ~< a'/3
~< a'/~ + , t t 1 1/2
{qg~(yQoy)a(XX ) a/(n-q)} (6.6)

for all a ' = ( a , ..... a q ) va O' where ~ : y" X( X' X ) - ' = ( t~, .... , flq ). In particular,
the confidence interval on fli is given by

fl,-- {qF~(y Q o y ) e , , / ( n q)},/2


t 1/2
~</3/+ {qF~(y Q o y ) e i J ( n - q)} (6.7)

where E = (eij) = ( X ' X ) - 1 . For a given i, the hypothesis H~ is accepted or


Selection of variables under univariate regression models 815

rejected according as the confidence interval (6.7) covers or does not cover zero.
This is equivalent to acceptance or rejection of H i according as

F i % qF, (6.8)
where
F i = ~i2(n -- q ) / e i i Y ' Q o y . (6.9)

We now interpret the statistics F and F~ where F and F~ are given by (6.2) and
(6.9), respectively. The statistic F can be written as

F = R 2 ( n -- q ) / q ( 1 -- R 2) (6.10)

w h e r e R 2 = (y'y) l y ' X ( X ' X )


lX'y. If X is random, R 2 is nothing but the square
of the sample multiple correlation of y with (x~ ..... X q). Since X is fixed, we refer
to R 2 here as the square of the quasi multiple correlation coefficient and it gives a
measure of the predictability of y using x 1..... xq. The statistic F~ can be written as

e,: RL)(n- q)/(1- (6.11)

w h e r e R~i ) is the square of the sample quasi multiple correlation of y with


Xl,''',Xi 1~ X i + D ' ' ' ' X q "
Next, let H~,. i, ( t = l , 2 ..... q) denote the hypothesis that fli, . . . . . fli=O
where each i~..... i t take one of the values 1,2 ..... q subject to the restriction
i 1 4 = i 2 =/=" • • 5/= i t. For fixed r there are (q) distinct hypotheses of the form H~l. "~r"
The total number of distinct hypotheses of the form Hil...~ (i~ v~ . - . va i t, t =
1,2 ..... q) is equal to 2 q - 1. These hypotheses cain be tested by examining the
simultaneous confidence intervals given by (6.6). For example, the confidence
intervals which measure departure from H12 are given by (6.6) when a 3 . . . . .
a q = 0. The hypothesis 912 is accepted if (6.6) covers zero for all nonnull vectors
(a l, a2,0 ..... 0) and is rejected otherwise. This is equivalent to acceptance or
rejection of HI2 according as

2F12 % qF~ (6.12)

where F12 is the F statistic associated with testing HI2. In general, the hypothesis
H ~ . . i , is accepted or rejected according as

tF~,. . .i, % qF~


where
( n - - q)l]~t)V(ol~t . (6.13)
F/..-i, = ty,Qoy ,

/3~('0= (fli,, .... /~i,) and V(oo2 is the covariance matrix of/~(o" The above implica-
tions of the overall F test are well known. The subset (xi~..... xi, ) of the set of
variables x~,..., Xq may be considered important or unimportant for prediction of
816 P. R. Krishnaiah

y according as H~,...i, is rejected or accepted. But, in the above procedure, we may


run into situations where different subsets of variables may be declared im-
portant. Among these subsets we pick the subset which is associated with the
largest F statistic. However, this method is cumbersome since we have to compute
(2 q - 1 ) F statistics. So, if we are using the overall F test, it is computationally
easy to use the procedure described in the beginning of this section; according to
this procedure, we include a variable x~ in the selected subset if and only if
F i > qF~.
We now discuss an alternative method based upon all possible regressions. In
the procedures discussed above we were interested in finding a subset of the
variables x~ . . . . . Xq which is adequate for prediction assuming that the true model
is (3.1). For example, the true model may be qth degree polynomial. But a lower
degree polynomial m a y be adequate for prediction. But situations m a y arise when
we do not have any knowledge of the true model. The true model m a y be any one
of the following models:

E(y) ----x~,fli ' + . - . +x~,fl~, + e (6.14)

where t = l , 2 ..... q and il ..... i~ take the values from 1 to q subject to the
restrictions i 1 4 = i 2 V= . . . 4= i t. In the above situation there are 2 q - 1 possible
models and we have to select one of them. For given i~, , i t let F.* denote the
" " " 11 • " " i t

usual F statistic used to test the hypothesis H~,...i, under the model (6.14). Then
the hypotheses H~, .. .,,, for all possible values of i~..... i t can be tested simulta-
neously as follows. We accept or reject H i , . . . i , for given i I ..... i t according as

F*llt 2. . .it
X/~* (6.15)
where
P [ F~'',,,2. - .,,"<~ F * ; i , , . . ., it ~ G [ H ] = (1-- a ) (6.16)

and (i I . . . . . i t ) E G indicates that i I ..... i t ( t = l , 2 ..... q) take the values from


1,2 ..... q subject to the restriction that i 1 4 = i 2 4= • • • 4= i t. The joint distribution of
the statistics F,l..-~i is complicated even when H is true. One can use Bonferroni's
inequality to obtain approximate values for F*. For any given i I..... i,, if the
hypothesis H~,...~. is rejected, then the corresponding model (6.14) involving
x~, ..... x~u only is included in the set of selected models. Out of these selected
models we may choose the model associated with the largest F value. The above
procedure also involves computation of (2 q - 1) F statistics. In this procedure, we
are controlling overall type I error whereas in the all possible regressions method
described in Draper and Smith (1966) no attempt is made to control overall type I
error.
In the F statistics of the above procedure we can use the usual error mean
square based upon the model (3.1) involving all the q variables as a common
denominator instead of using different error mean squares for various F statistics.
In this case the distribution problem is less complicated. Apart from it, the
Selection of variables under univariate regression models 817

estimate of the error variance is unbiased irrespective of which of the (2 q - - 1 )


models is the correct model and the error sum of squares is distributed as central
chi-square. On the other hand, let us assume that the error variance is estimated
by using the model based upon some of the variables (say) x,, x 2..... xt+ 2. If the
true model involves any of the variables Xt+3, Xt+4,...,Xt+q, then the above
estimate is not unbiased and the error sum of squares is distributed as the
noncentral chi-square variable.
Next, consider models of the form (6.14). Let s 2 denote the usual error
i I , • .l t

squares under the above models. Then

E ( s 2. . , , ) : o 2
• t I . . .i t •
(6.17)

N o w consider the statistics c(r, t, n)s~...i, + d(r, t, n) (t = 1..... q) where c(r, t, n)


and d(r, t, n) are constants which are chosen by using some criterion. Then we
select the subset associated with the smallest of the above statistics when
il#i2=/=... =/=it, t = l , 2 ..... q and ij's take the values from 1 to q. Another
possible method is to select the subset associated with the smallest of T(il,..., it)
where
T(i 1..... it) = a(q, t, n)s2...6/6 2+ b(q, t, n). (6.18)

a(q, t, n) and b(q, t, n) are properly chosen constants and 02 is the usual error
mean square when the model is (3.1). It is complicated to compute the probability
of correct selection in both of the above situations. Akaike (1973) and Mallows
(1973) considered some special values of a(q, t, n) and b(q, t, n) in (6.18) while
considering procedures for the selection of variables.

7. Finite intersection tests

In this section we discuss as to how the finite intersection test proposed b y


Krishnaiah (1963; 1965a, b) for testing the hypotheses on various regression
coefficients simultaneously can be used to select important variables. According
to this procedure, we accept or reject Hi (i = 1,2 ..... q) according as

F/XF* (7.1)

where F/is defined by (6.9) and


P [ F i ~ F * ; i = l , 2 ..... q [ H ] = ( l - - a ) . (7.2)

When H is true, the joint distribution of F,,...,Fq is the multivariate F


distribution with (1, n - q) degrees of freedom and with C(X'X) 1C' as the
correlation matrix of the 'accompanying' multivariate normal where C =
d i a g ( e , l ' / z . . . . . eqq1/2) and E = ( eij) = ( X ' X ) - i .
The F I T is better than the overall F test in the sense that the lengths of the
confidence intervals associated with the F I T are shorter than the overall F test.
818 P. R. Krishnaiah

Next, let P1 and P2 denote respectively the probabilities of correct selection of r


variables x l ..... xr when the overall F test and FIT are used respectively. Then

P, = P[Fi>~qF,; i = 1 , 2 ..... riB? > 0; i = 1 , 2 ..... r], (7.3)

P2 = P [ F / > I F * ; i = I , 2 ..... r[flf > 0 ; i = l , 2 , . . . , r ] . (7.4)

Since qF~ >>-F*, we observe that P1 ~< P2.


We now discuss about the robustness of FIT for violation of the assumption of
normality. Let the distribution of e belong to the family of distributions whose
densities are of the form

f(e) = fg(el~)h(~)dr (7.5)

where g(e[~) is the density of a multivariate normal with mean vector 0 and
covariance matrix r2In, and h(~-) is the density of ~-. For example, let the density
of ~- be

h ('r ) = [2 / F( v/2 ) ] (vo2/2)~/zr-(v+l)exp( - v02/2~-2).

If ~- has the above density, then th~ distribution of v~- is known to be inverted chi
distribution with v degrees of freedom. In this case the density of e is given by

(7.6)
where
c =//2r½(v + "/2. (7.7)

The density given by (7.6) is a special case of the multivariate t distribution with v
degrees of freedom and with I, as the correlation matrix of the 'accompanying'
multivariate normal. By making different choices of the density of ~r we get a wide
class of distributions. Zellner (1976) pointed out that the type I error associated
with the overall F test is not at all affected if the density of e is of the form (7.5).
A similar statement holds good for the FIT also. But, the power functions of the
overall F test and FIT are affected when the assumption of normality of the
errors is violated and their joint density is of the form (7.5).
Next, let us assume the model to be

y = t ol + + (7.8)

instead of (3.1). Here flo is unknown and 1 is a column vector whose elements are
equal to unity. In this case the methods discussed in this paper hold good with
very minor modification. In the F statistics we replace y and X with y* and X*,
Selection of variables under univariate regression models 819

r e s p e c t i v e l y , w h e r e y* = y - S , X* = X - _~, a n d

i=1 i 1

A l s o , the d e g r e e s o f f r e e d o m o f t h e e r r o r s u m o f s q u a r e s in this c a s e is ( n - q - 1).


N e x t , let us a s s u m e t h a t t h e j o i n t d i s t r i b u t i o n of ( y , x ~ , . . . , X q ) is m u l t i v a r i a t e
normal with mean vector 0 and c o v a r i a n c e m a t r i x ~ . W h e n x 1. . . . ,Xq is h e l d fixed,
the conditional distribution of y is n o r m a l a n d t h e c o n d i t i o n a l m e a n is o f t h e
form

Ec(y) =/3lx, + - - - -~-t~qXq. (7.9)

So, the c o n d i t i o n a l m o d e l is o f t h e s a m e f o r m as (3.1). So, we c a n select v a r i a b l e s


b y u s i n g F I T o r t h e o v e r a l l F test u s i n g the c o n d i t i o n a l m o d e l . But, w h e n H is
true, the c o n d i t i o n a l d i s t r i b u t i o n s o f ( F 1. . . . . Fq) a n d F are t h e s a m e as t h e
corresponding unconditional distributions.

References

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In: B. N. Petrov and F. Csaki, eds., 2nd International Symposium on Information Theory,
267-281. Akademia Kiado, Budapest.
[2] Cornish, E. A. (1954). The multivariate small t-distribution associated with a set of normal
sample deviates. Austral. J. Phys. 7, 531-542.
[3] Cox, C. M., Krishnaiah, P. R., Lee, J. C., Reising, J. and Schuurmann, F. J. (1980). A study on
finite intersection tests for multiple comparisons of means. In: P. R. Krishnalah, ed., Multi-
variate Anal., Vol. V. North-Holland, Amsterdam.
[4] Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. Wiley, New York.
[5] Dunnett, C. W. and Sobel, M. (1954). A bivariate generalization of Student's t-distribution with
tables for certain cases. Biometrika 41, 153-169.
[6] Hocking, R, R. (1976). The analysis and selection of variables in linear regression. Biometrics 32,
1-49.
[7] Krishnaiah, P. R. (1963). Simultaneous tests and the efficiency of generalized incomplete block
designs. Tech. Rept. ARL 63-174. Wright-Patterson Air Force Base, OH.
[8] Krishnaiah, P. R. (1965a). On the simultaneous ANOVA and MANOVA tests. Ann. Inst.
Statist. Math. 17, 35-53.
[9] Krishnaiah, P. R. (1965b). Multiple comparison tests in multiresponse experiments. Sankhyff,
Ser. A 27, 65-72.
[10] Krishnaiah, P. R. (1979). Some developments on simultaneous test procedures. In: P. R.
Krishnaiah, ed., Developments in Statistics. Vol. 2, 157-201. Academic Press, New York.
[11] Krishnaiah, P. R. (1980). Computations of some multivariate distributions. In: P. R. Krishnaiah,
ed., Handbook of Statistics, Vol. 1: Analysis of Variance, 745-971. North-Holland, Amsterdam.
[12] Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661-675.
[13] Pope, P. T. and Webster, J. T. (1972). The use of an F-statistic in stepwise regression problems.
Technometrics 14, 327-340.
[14] Roy, S. N. and Bose, R. C. (1953). Simultaneous confidence interval estimation. Ann. Math.
Statist. 24, 513-536.
820 P. R. Krishnaiah

[15] Schefft, H. (1959). The Analysis of Variance. Wiley, New York.


[16] Schmidhammer, J. L. (1982). On the selection of variables under regression models using
Krishnaiah's finite intersection tests. In: P. R. Krishnaiah and L. N. Kanal, eds., Handbook of
Statistics, Vol. 2: Classification, Pattern Recognition and Reduction of Dimensionality [this
volume]. North-Holland, Amsterdam.
[17] Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68, 45-54.
[18] Thompson, M. L. (1978). Selection of variables in multiple regression: Part I. A review and
evaluation. Internat. Statist. Rev. 46, 1-19.
[19] Thompson, M. L. (1978). Selection of variables in multiple regression: Part II. Chosen
procedures, computations and examples, lnternat. Statist. Rev. 46, 129-146.
[20] Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate
Student-t error terms. J. Amer. Statist. Assoc. 71, 400-408.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "~
©North-Holland Publishing Company(1982) 821-833 J

On the Selection of Variables


Under Regression Models
Using Krishnaiah's Finite Intersection Tests*

J a m e s L. S c h m i d h a m m e r

1. Introduction

The finite intersection test procedure under the univariate regression model was
first considered by Krishnaiah in 1960 in an unpublished report which was
subsequently issued as a technical report in 1963 and later published in Krishnaiah
(1965a). The problem of testing the hypotheses that the regression coefficients are
zero as well as the problem of testing the hypotheses that contrasts on means, in
the A N O V A setup, are equal to zero are special cases of this procedure. Finite
intersection tests under general multivariate regression models were proposed by
Krishnaiah (1965b). In this chapter, we discuss some.applications of Krishnaiah's
finite intersection tests for selection of variables under univariate and multivariate
regression models.
Section 2 gives some background material on the multivariate F distribution,
which is the distribution most commonly used in conjunction with the finite
intersection test. Section 3 describes the application of the finite intersection test
procedure to the univariate linear regression problem, while Section 4 discusses
the extension to the multivariate linear regression case. Finally, Sections 5 and 6
illustrate the use of the finite intersection test with univariate and multivariate
examples respectively.

2. The multivariate F distribution

Let x~ ..... x n be distributed independently and identically as multivariate


normal random vectors with mean vector /L and covariance matrix 2~, with
x~=(xil ..... Xip ) for i = 1 ..... n,/#=(/~l,...,/~p), and ~ = ( o i j ) . Also, let z j =
(1/%)E7= ~x2 f o r j = 1..... p. Then the joint distribution of z 1..... Zp is a central or
noncentral multivariate chi-square distribution with n degrees of freedom (central

*This work is sponsored by the Air Force Office of Scientific Research under Contract F49629-82-
K-001. Reproduction, in whole or in part, is permitted for any purpose of the United States
Government.
821
822 James L Schmidhammer

if /t = 0, noncentral if /t =/=0). Also, let z 0 be distributed independently of


z ' = (z I ..... zp) as a central chi-square random variable with m degrees of free-
dom, with E ( z o ) = m . In addition, let F i = ( z i / n ) / ( z o / m ). Then the joint
distribution of F~ ..... Fp is a multivariate F distribution with n and m degrees of
freedom, with P as the correlation matrix of the accompanying, multivariate
normal distribution, where P = (0/j) and &j = o/j/(oi/Oja)l/2. This distribution
was introduced by Krishnaiah (1963, 1965a). When n = 1, the multivariate F
distribution is known as the multivariate t 2 distribution. For an exact expression
for the density of the central multivariate F distribution when X is nonsingular,
see Krishnaiah (1964). Exact percentage points of the bivariate F distribution
were given in Schuurmann, Krishnaiah and Chattopadhyay (1975), whereas
approximate percentage points of the multivariate F distribution were constructed
by Krishnaiah and Schuurmann and reproduced in Krishnaiah (1980). Percentage
points in the equicorrelated c a s e (Pij= P for i v~ j ) were given in Krishnaiah and
Armitage (1970), and Krishnaiah and Armitage (1965), when n = 1. In general,
exact percentage points of the multivariate F distribution are difficult to obtain
when ~ is arbitrary. However, bounds on these percentage points can be obtained
using several probability inequalities.
Upper and lower bounds on the critical values can be computed using
Poincare's formula,

~<1-- ~ P [ F i > F,~]+ ~] P[F,.> F ~ O F j > F,~].


i=1 i<j
(2.1)
The left-hand side can be used to obtain an upper bound on F~, while the
right-hand side can be used to obtain a lower bound. To describe other probabil-
ity inequalities useful for obtaining bounds on the critical values, let x =
(x~,...,xr)' be distributed as multivariate normal random vector with mean
vector 0 and covariance matrix a2P, where P = (&j) is the correlation matrix of
the xi's, and let y = (Yl,-.. ,Yr)' be distributed as a multivariate normal random
vector, with mean vector 0 and covariance matrix OZIr . Also, let s 2 / o 2 ~ X 2 ,
independent of x and y. Then Sidak (1967) showed that

(2.2)

(2.3)
(,/(;)
On the selection of variables under regression modeb 823

Inequality (2.2) is referred to here as Sidak's u p p e r bound, and inequality (2.3) is


referred to as the product u p p e r bound. F o r additional discussion, see Krishnaiah
(1979).

3. The finite intersection t e s t - - A simultaneous procedure in the univariate case

Consider the usual univariate linear model given b y

y= Xfl+ e

where y is an n-vector of independently distributed normal r a n d o m variables,


X = [1, XI] with 1 being an n-vector of all l's, and X l an n × q matrix of k n o w n
constants representing the n observations on the independent variables
x l . . . . . x ~ , , f l = ( f l o , fll . . . . . fla) is a ( q + l ) - v e c t o r of u n k n o w n parameters, and
e ~ Nn(0, tr 2 I n). The least squares ~ "
estimate 0 f f l i s f l^--
- - ( X ' X ) -1 X ' y, w~th
" covarl-"
ance matrix 02( X ' X ) - 1 = 0 2 ( w i j ) . T h e usual error s u m of squares is S 2 = y ' [ I n -
x ( x ' x ) - ' x ' ] y.
T h e p r o b l e m of selection of variables can be formulated within the f r a m e w o r k
of testing of hypotheses. For example, the variable x~ m a y be declared to be
i m p o r t a n t or u n i m p o r t a n t according as fl~ 4 = 0 or fl~ = 0. In order to test the
hypothesis H: fl = 0 using the finite intersection test procedure, we partition the
overall hypothesis H into a finite intersection of subhypotheses H~:fli = 0 for
i :-- 0, 1..... q. Then we accept or reject H i according as F,. % Fi~ where

Fi = w i i S 2 / ( n - q -- 1)
W

and P[ o/q=0(Fi < F/~}JH] = 1 - - a . In this p a p e r we will only consider the case
where F/~ = F~ for i = 0, 1..... q. Now, the joint distribution of F o, F 1. . . . . Fq is a
multivariate ( ( q + l ) - v a r i a t e ) F distribution with 1 and n - q - 1 degrees of
freedom. Simultaneous confidence intervals associated with this procedure are
given by

fli - ~/F~w. S 2 / ( n - q - 1 ) <~ fl, <~ fli + ~F~wii S 2 / ( n - q - 1 )

for i = 0 . . . . . q.
F o r comparison, if we use the usual overall F test, we reject H if F > F* where

F ]~'(X'X)]~/(q+I)
S 2 / ( n - q - 1)

and P [ F < ~ F * ] H ] = I - - a , with F ~ F q + l , . q-1. The associated simultaneous


824 James L. Schmidhammer

confidence intervals are given by

~i--~-( q + l)F*wiiSZ/(n-- q--1) <~

<~Bi <~fli + ~/( q + 1)F~*wiiS2/( n -- q -- 1).

Now, since F~ ~<(q + 1)F* (see Krishnaiah, 1969), the lengths of the confidence
intervals associated with the finite intersection test are never longer than the
lengths of the corresponding confidence intervals associated with the overall F
test.
In the procedure described above, H = ("lqi=oHi, where Hi: fli = 0. Thus, a test
is performed on the importance of every independent variable simultaneously,
including the intercept. However, it is usually the case that the test H o : r0 = 0 is
of no interest, and it is often the case that only a subset of all possible
independent variables are to be examined for importance. With this in mind,
consider r hypotheses of the form H i : c~fl = 0 for i = 1..... r, with H* = ("]i=r1Hi"
In the above context, c[ = (0 ..... 0, 1,0 ..... 0), i.e., c i selects the particular fit of
interest for testing, although the procedure described below works for arbitrary %
Using the finite intersection test procedure, we reject/4 i if F, > F, where

F,= ( cil0 ) 2 / [ c ; ( X , X ) - , c i S 2 / ( n - q - 1 ) ]
r
and P[ O i= IF• . < F~[ H*] = 1 -- a, with the joint distribution of F 1. . . . . F~ being a
multivariate F distribution with 1 and (n - q - 1) degrees of freedom. In this case
simultaneous confidence intervals are given by

cit~- ~/Fac;( X ' X ) - ' c i S 2 / ( n - q-1) <~

C i ~ Cit8"{-~F~c;( X ' X )- 'ciS2//( n - q -- 1).

Table 1
Relative efficiency of overall F test to the
finite intersection test (a = 0.05, v 10)

r• 1
0.1

1.00
0.5

1.00
0.9

1.00
2 0.83 0.80 0.71
3 0.72 0.68 0.57
4 0.64 0.60 0.48
5 0.58 0.54 0.42
6 0.53 0.49 0.37
7 0.49 0.43 0.33
8 0.45 0.42 0.30
9 0.43 0.39 0.28
10 0.40 0.36 0.26
On the selection o f variables under regression models 825

Table 2
Relative efficiency of overall F test to the
finite intersection test (a = 0.05, u = 30)
0.1 0.5 0.9

1 1.00 1.00 1.00


2 0.83 0.81 0.73
3 0.72 0.70 0.60
4 0.65 0.62 0.51
5 0.59 0.56 0.45
6 0,54 0.51 0.40
7 0.50 0.47 0.37
8 0.47 0.44 0.34
9 0.44 0.41 0.31
10 0.42 0.39 0.29

If instead the usual overall F test is used, then we reject H if F > F* where

and P [ F ~< F* IH*] = 1 - a with F ~ Fr, , _ q 1 and C' = [ c I . . . . . c r ]. Furthermore,


simultaneous confidence intervals are given by

ci~-~rF*ci( X ' X ) - ' ¢ i s Z / ( n - q - 1 )

<~cil3 <~eifl + ~rF* ei( X'X )- l¢igZ/( n - - q -- 1).


Again, the lengths of the confidence intervals associated with the finite intersec-
tion test are shorter than the lengths of the corresponding confidence intervals
associated with the usual overall F test.

Table 3
Relative efficiency of overall F test to the
finite intersection test (a = 0.01, u = 10)

r• 1
0.1

1.00
0.5

1.00
0.9

1.00
2 0.84 0.82 0.76
3 0.73 0.71 0.62
4 0.66 0.63 0.53
5 0.59 0.57 0.47
6 0.55 0.52 0.42
7 0.51 0.47 0.38
8 0.47 0.44 0.34
9 0.44 0.41 0.32
10 0.42 0.39 0.30
826 James L. Schmidhammer

Table 4
Relative efficiency of overall F test to the
finite intersection test ( a = 0.01, ~ = 30)
0.1 0.5 0.9

1 1.00 1.00 1.00


2 0.85 0.84 0.79
3 0.75 0.74 0.66
4 0.68 0.66 0.50
5 0.62 0.60 0.52
6 0.57 0.55 0.47
7 0.53 0.51 0.43
8 0.50 0.48 0.40
9 0.47 0.45 0.37
10 0.44 0.42 0.35

A comparison of the lengths of the confidence intervals associated with the


finite intersection test with the corresponding lengths of the confidence intervals
associated with the overall F test is given in Tables 1-4. These tables give values
of R 2-- F a / ( r F * ) for a (Type I error r a t e ) = 0.01, 0.05 and u (error degrees of
freedom) = 10, 30. For similar tables when using the finite intersection test in a
one way ANOVA setup, see Cox et al. (1980). For discussions regarding the
confidence intervals associated with the overall F test, the reader is referred to
Roy and Bose (1953) and Scheff6 (1959).

4. The finite intersection t e s t - - A simultaneous procedure in the multivariate case

Analogous to the univariate linear model, the multivariate linear model is given
by
Y=XB+E
where Y is an n × p matrix of n observations on p variables Yl ..... yp whose rows
are independently distributed as a p-variate normal distribution, with covariance
matrix ~, and E ( Y ) = XB. Furthermore, X is as described in the previous section,
while B = [fl0, fll ..... flq]' is a (q + 1)X p matrix of unknown regression parame-
ters, and E is an n × p matrix whose rows are independently and identically
distributed as a p-variate normal distribution with mean vector 0 and covariance
matrix ~.
The problem of selection of variables under the multivariate regression model
can again be formulated within the framework of simultaneous test procedures as
in the univariate case. The problem of testing H : B = 0 is equivalent to the
problem of testing H i : c ' i B = O ' for i = 0 , 1 ..... q simultaneously where c~=
[Coi , Cli . . . . . Cqi] for i = O, 1,...,q, with

(~ ifh#i,
Chi = if h = i,
On the selection of variables under regression models 827

i.e., H = N q=0/-/i. N o t e that with c i as described above, c~B =,6/' for i = 0, 1. . . . . q.


W e declare that the variable x i is i m p o r t a n t or u n i m p o r t a n t for prediction of
Y ' = (Yl .... ,yp) according as H / i s rejected or accepted.
In order to develop the finite intersection test procedure in this case the
following notation is needed. Let N k denote the k × k top left-hand corner of
Z = ( o i / ) a n d o k+l
2 -- INk+l l / [ ~ [ for k = 0, 1. . . . . p -- 1, with IN o [ = 1. Also, let

[,:,] r°,
Yk = : =
[.ak,k+l

with "/0 = O. Finally, let Y = [ Yl . . . . . Yp], Yj = [ Yl . . . . . yj] for j = 1 . . . . . p, B =


[01 .... ,Op], B / = [0~ . . . . . Oj] f o r j = 1 . . . . . p.
Now, when Y/ is fixed, the elements of Y/+l are distributed normally with
c o m m o n conditional variance ~ 2 and with conditional means

~j+l
(4.1)

where ~//+l = 0/+l -- B/~. Each of the hypotheses Ho, H1,... ,Hq can be expressed
as

P
for i = 0 , 1 .... ,q
j=l

where H//: c;~lj = 0 and c i is as described previously. Thus the p r o b l e m of testing


H, which is equivalent to the p r o b l e m of testing H0, H~ ..... Hq simultaneously, is
also equivalent to the p r o b l e m of testing H i / ( i = 0, 1..... q; j = 1.... , p ) simulta-
neously. Noting that the model in (4.1) is just a univariate linear regression
model, let c ; ~ / d e n o t e the best linear unbiased estimate ( B L U E ) of c;~//, and let
S/2 denote the usual error sum of squares. With the finite intersection test
procedure, the test statistic for testing Hij against a two sided alternative is

F/j = (ci pr//)


^ 2
/[DqSj2/(n -- j-- q)] (4.2)

for i = 0,1 ..... q a n d j = 1.... ,p. In (4.2), DijSj2/(n - j -- q) is the sample estimate
of the conditional variance of c;~/. W e reject H i / i f F/j > F~, where

P[F,./<- F~; i = O, 1. . . . . q; j = 1..... p i n ] =


P
II P[F.ij<~F,;i=O,1 ..... qln] = l - a .
j 1

If any Hi/is rejected, then we declare the ith variable x i to be important.


828 James L. Schmidhammer

When H is true, the joint distribution of Foj, F1/ ..... Fqy, for any given
j = l .... ,p, is a (q + 1)-variate F distribution with 1 and n - j - q degrees of
freedom. The associated 100(1- a)% simultaneous confidence intervals are given
by

c:Oj--~F~DijSj2/(n -- j - - q) <~

e : rIJ • <~ e : ~ .J + ,/F~D


V
iJ . SJ} / ( n - j -- q )

Several comparisons have been made between the lengths of the confidence
intervals for the finite intersection test in the multivariate case and the lengths of
confidence intervals derived from other procedures. It is known (see Krishnaiah,
1965b) that the finite intersection test yields shorter confidence intervals than the
step-down procedure of J. Roy (1958). Also, Mudholkar and Subbaiah (1979)
made some comparisons of the finite intersection test with the step-down proce-
dure and Roy's largest root test. Additional comparisons of interest are to be
found in Cox et al. (1980).
Several remarks are in order at this time. First, the critical value F~ has been
chosen to be the same for all hypotheses Hij. This was done out of convenience
but is Certainly not necessary. Second, the hypothesis Hij was chosen such that
Hi = O ~ 1Ho and H = Aq=0Hi, with H : B = 0. However, the overall hypothesis
H need not be H : B = 0. We can just as easily consider any set of hypotheses
H 1 , . . . , H r where H i : e:B = 0' for i = 1..... r with e/being chosen as desired and
r
H = (-li=lH i. In the context of selection of variables, however, the ei's are to be
chosen so as to select out the particular independent variables of interest (see
discussion in previous section).

5. A univariate example

In order to illustrate the use of the finite intersection test in univariate


regression, a data set appearing as Exercise E, p. 230 of Draper and Smith (1966)
is analyzed) The data consists of 33 years of observations on corn yield,
preseason precipitation, and various monthly readings on temperature and pre-
cipitation for the state of Iowa. Corn yield (YIELD) is used as the dependent
variable, while the remaining nine variables are used as independent variables.
These independent variables are year number (YEAR), preseason precipitation
(PSP), May temperature (MAY T), June rain (JUN R), June temperature
(JUN T), July rain (JUL R), July temperature (JUL T), August rain (AUG R),
and August temperature (AUG T). The data were input into a F O R T R A N

i The author is grateful to John Wiley & Sons for givingpermission to use these data for illustrative
purposes.
On the selection of variables under regression models 829

Table 5
Finite intersection test--Univariate example
Critical values
Error degrees of freedom 23 Poincare's lower bound 9.0332
Sample estimate of error variance 60.8 Sidak's upper bound 9.1699
Overall a-level 0.05 Product upper bound 9.2969
Overall F (9F (-95)) 20.8809
9,23

Simultaneous confidence intervals

Poincare's Sidak's Product Overall


i fli Fi lower bound upper bound upper bound F test
1 0.8769 21.96651 (0.31,1.44) (0.31,1.44) (0.31,1.45) (0.02,1.73)
2 0 . 7 8 6 5 3.35373 (--0.50,2.00) (--0.51,2.09) (-0.52,2.10) ( - 1.18,2.75)
3 0 . 4 5 4 9 1.13878 (-- 1.74,0.83) (-- 1.75,0.84) ( 1.75,0.84) (-2.40,1.49)
4 -0.7644 0.52760 (--3.93,2.40) (--3.95,2.42) (-3.97,2.44) (-5.57,4.04)
5 0 . 4 9 9 7 0.70392 (-- 1.24,2.20) ( - 1.25,2.21) ( - 1.26,2.22) (-2.13,3.09)
6 2 . 5 8 2 8 3.53278 (--1.55,6.71) (-1.58,6.74) (-1.61,6.77) (-3.70,8.86)
7 0 . 0 6 0 9 0.00723 (--2.09,2.21) (--2.11,2.23) (-2.12,2.25) ( 3.21,3.34)
8 0 . 4 0 1 7 0.15199 (--2.70,3.50) (--2.72,3.52) (-2.74,3.54) (-4.31,5.11)
9 --0.6639 0.88899 (--2.67,1.34) (--2.69,1.36) (-2.70,1.37) (-3.71,2.39)

computer program written for use on the DEC-10 computer at the University of
Pittsburgh. The results appear in Table 5.
The model for the data is

y = X13 + e

where fl' = [ flo, 13'1], 13't= [ill .... , flq], and the overall hypothesis tested is H: 131 = 0,
against the alternative A : 131 =~ 0. Thus, we test the hypothesis that none of the
independent variables are related to the dependent variable, but do not test that
the intercept is zero.
In Table 5, note that simultaneous confidence intervals are constructed using
Poincare's lower bound, Sidak's upper bound, and the product upper bound on
the critical values for the finite intersection test, and also using the critical value
associated with the overall F test. The confidence intervals associated with overall
F test are at least 50% wider than the corresponding confidence intervals using
the finite intersection test. However, the confidence intervals constructed
using the product upper bound are only 1.4% wider than those constructed using
Poincare's lower bound, while the confidence interval constructed using Sidak's
upper bound are only 0.75% wider than those using Poincare's lower bound,
indicating that a fairly precise estimate of the true critical value is available, at
least in this case, using only some probability inequalities.
As for the results of the analysis, it is interesting that the only variable related
to corn crop yield is year number, reflecting the well known fact that grain
production in the United States has been steadily increasing for the past fifty
830 James L. Schmidhammer

years, primarily as a result of technological innovations, overwhelming any


variation that might exist due to temperature and precipitation.

6. A multivariate example

As an example of the application of the finite intersection test in multivariate


linear regression, a data set appearing as Table 4 . 7 . 1 , p. 3 1 4 o f T i m m (1975) is

Table 6"
SAT PPVT RPMT N S NS NA SS
49 48 8 1 2 6 12 16
49 76 13 5 14 14 30 27
11 40 13 0 10 21 16 16
9 52 9 0 2 5 17 8
69 63 15 2 7 11 26 17
35 82 14 2 15 21 34 25
6 71 21 0 1 20 23 18
8 68 8 0 0 10 19 14
49 74 11 0 0 7 16 13
8 70 15 3 2 21 26 25
47 70 15 8 16 15 35 24
6 61 11 5 4 7 15 14
14 54 12 1 12 13 27 21
30 55 13 2 1 12 20 17
4 54 10 3 12 20 26 22
24 40 14 0 2 5 14 8
19 66 13 7 12 21 35 27
45 54 10 0 6 6 14 16
22 64 14 12 8 19 27 26
16 47 16 3 9 15 18 10
32 48 16 0 7 9 14 18
37 52 14 4 6 20 26 26
47 74 19 4 9 14 23 23
5 57 12 0 2 4 11 8
6 57 10 0 1 16 15 17
60 80 11 3 8 18 28 21
58 78 13 1 18 19 34 23
6 70 16 2 11 9 23 11
16 47 14 0 10 7 12 8
45 94 19 8 10 28 32 32
9 63 11 2 12 5 25 14
69 76 16 7 11 18 29 21
35 59 I1 2 5 10 23 24
19 55 8 0 1 14 19 12
58 74 14 1 0 10 18 18
58 71 17 6 4 23 31 26
79 54 14 0 6 6 15 14

*Reproduced from p. 314, T i m m (1975), with permission.


On the selection of variables under regression models 831

Table 7
Finite intersection test--multivariate example--first dependent variable (RPMT)
Critical values
Error degrees of freedom 31 Poincare'slower bound 9.8828
Sample estimate of error variance 8.65 Sidak's upper bound 10.0195
Overall a-level 0.0169524 Productupper bound 10.0586

Simultaneous confidenceintervals
Poincare's Sidak's Product
i fli Fi lower bound upper bound upper bound
l 0.2110 0.82773 (-0.52,0.94) (-0.52,0.95) (-0.52,0.95)
2 0.0646 0.24418 (-0.35,0.48) (-0.35,0.45) (-0.35,0.45)
3 0.2136 2.85731 (-0.18,0.61) ( 0.19,0.61) (-0.19,0.61)
4 -0.0373 0.06725 ( 0.49,0.42) (-0.49,0.42) (-0.49,0.42)
5 -0.0521 0.11646 (-0.53,0.43) (-0.54,0.43) (-0.54,0.43)

used. These data are reproduced in Table 6. The three dependent variables are
scores on a student achievement test (SAT), the Peabody Picture Vocabulary Test
(PPVT), and the Ravin Progressive Matrices Test (RPMT). The independent
variables consisted of the sum of the number of items answered correctly out of
20 on a learning proficiency test on two exposures to five types of paired-associ-
ated learning proficiency tasks. These five tasks are named (N), skill (S), named
skill (NS), named action (NA), and sentence skill (SS). The same F O R T R A N
program used for the analysis of the previous se6tion was used for this analysis,
since when using the finite intersection test, a multivariate linear regression can be
expressed as several independent univariate linear regressions. The results appear
in Tables 7-9.
The model for these data is

Y= XB + E (6.1)

where B ' = [fl0, B'l], B'I = [ill ..... flq] and the overall hypothesis tested is H : B l = 0
against the alternative A : B 1~ 0. Again, a test on the intercept is not performed.
As in the previous univariate example, Tables 7 - 9 display simultaneous confi-
dence intervals constructed using the three bounds on the critical values. For
these data the use of the product upper bound results in confidence intervals only
0.9% wider than the confidence intervals using Poincare's lower bound, while the
use of Sidak's upper bound produces confidence intervals only 0.7% wider than
the confidence intervals using Poincare's lower bound. Again, very satisfactory
estimates of the true critical values have been obtained using probability inequali-
ties.
Note that in each of Tables 7 - 9 the Type I error rate is given as a* = 0.0169524.
This yields an experimentwise error rate of a = 0.05, since (1 - a*) 3 = 1 - a, there
832 James L. Schmidhammer

Table 8
Finite intersection test--multivariate example--second dependent variable (PPVT)
Critical values
Error degrees of freedom 30 Poincare's lower bound 9.9414
Sample estimate of conditional Sidak's upper bound 10.0781
error variance 86.49 Product upper bound. 10.1172
Overall a-level 0.0169524

Simultaneous confidence intervals


Poincare's Sidak's Product
i ~/i ~ lower bound upper bound upper bound
1 -0.2486 0.11201 ( 2.59,2.09) (-2.61,2.11) (--2.61,2.11)
2 -0.7725 3.47015 ( 2.08,0.54) (-2.09,0.54) (-2.09,0.55)
3 -0.4684 1.25916 (-1.78,0.85) ( 1.79,0.86) (--1.80,0.86)
4 1.5001 10.85130 (0.06,2.94) (0.05,2.95) (0.05,2.95)
5 0.3655 0.57045 (-1.16,1.89) (-1.17,1.90) (--1.17,1.90)

being 3 dependent variables. Also recall that Tables 8 and 9 display statistics on
conditional means, variances, and regression coefficients, the results of Table 8
being conditioned on holding the first dependent variable (RPMT) fixed, and the
results of Table 9 being conditioned on holding both the first and second
dependent variables (RPMT and PPVT) fixed.
The results of Table 8 show that the overall hypothesis H is rejected, since the
h y p o t h e s i s 942 : '042 = 0 is rejected. Thus, the independent variable named action
(NA) is probably the only variable of importance in (6.1), and the other
independent variables (N, S, NS, SS) can be regarded as unimportant.

Table 9
Finite intersection test--multivariate example--third dependent variable (SAT)
Critical values
Error degrees of freedom 29 Poincare's lower botmd 9.9805
Sample estimate of conditional Sidak's upper bound 10.1172
error variance 435.38 Product upper bound 10.1563
Overall a-level 0.0169524

Simultaneous confidence intervals


Poincare's Sidak's Product
i ~i F/ lower bound upper bound upper bound
1 --0.8567 0.26320 (--6.13,4.42) (--6.17,4.45) (--6.18,4.46)
2 0.1871 0.03624 (--2.92,3.29) (--2.94,3.31) (--2.94,3.32)
3 1.8858 3.89072 (--4.91,1.13) (--4.93, 1.16) (--4.93, 1.16)
4 --0.1162 0.00950 (--3.88,3.65) ( 3.91,3.68) (--3.92,3.68)
5 2.1723 3.92810 (-- 1.29,5.63) (-- 1.31,5.66) (-- 1.32,5.67)
On the selection of variables under regression models 833

References

[1] Cox, C. M., Krishnaiah, P. R., Lee, J. C., Reising, J. and Schuurmann, F. J. (1980). A study on
finite intersection tests for multiple comparisons of means. In: P. R. Krishnalah, ed., Multi-
variate Analysis, Vol. V. North-Holland, Amsterdam.
[2] Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. Wiley, New York.
[3] Krishnaiah, P. R. (1963). Simultaneous tests and the efficiency of generalized incomplete block
designs. Tech. Rept. ARL 63-174. Wright-Patterson Air Force Base, OH.
[4] Krishnalah, P. R. (1964). Multiple comparison tests in multivariate case. Tech. Rept. ARL
64-124. Wright-Patterson Air Force Base, OH.
[5] Krishnaiah, P. R. and Armitage, J. V. (1965). Probability integrals of the multivariate F
distribution, with tables and applications. Tech. Rept. ARL 65-236. Wright-Patterson Air Force
Base, OH.
[6] Krishnalah, P. R. (1965a). On the simultaneous ANOVA and MANOVA tests. Ann. Inst.
Statist. Math. 17, 35-53.
[7] Krishnalah, P. R. (1965b). Multiple comparison tests in multi-response experiments. SankhyS,
Ser. A 27, 65-72.
[8] Krishnaiah, P. R. (1969). Simultaneous test procedures under general MANOVA models. In:
P. R. Krishnaiah, ed., Multivariate Analysis, Vol. II. Academic Press, New York.
[9] Krishnaiah, P. R. and Armitage, J. V. (1970). On a multivariate F distribution. In: R. C. Bose
et al., eds., Essays in Probability and Statistics. Univ. of North Carolina Press, Chapell Hill, NC.
[10] Krishnaiah, P. R. (1979). Some developments on simultaneous test procedures. In: P. R.
Krishnaiah, ed. Developments in Statistics, Vol. 2. Academic Press, New York.
[11] Krishnaiah, P. R. (1980). Computations of some multivariate distributions. In: P. R. Krishnaiah,
ed., Handbook of Statistics, Vol. 1: Analysis of Variance. North-Holland, Amsterdam.
[12] Mudholkar, G. S. and Subbaiah, P. (1979). MANOVA multiple comparisons associated with
finite intersection tests. In: P. R. Krishnaiah, ed., Multivariate Analysis, Vol. V. North-Holland,
Amsterdam.
[13] Roy, S. N. and Bose, R. C. (1953). Simultaneous confidence interval estimation. Ann. Math.
Statist. 24, 513-536.
[14] Roy, J. (1958). Step-down procedure in multivariate analysis. Ann. Math. Statist. 29, 1177-1187.
[ 15] Scheff6, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika 40,
87-104.
[16] Scheff6, H. (1959). The Analysis of Variance. Wiley, New York.
[17] Schuurmann, F. J., Krishnalah, P. R. and Chattopadhyay, A. K. (1975). Tables for a multi-
variate F distribution. SankhyS, Set. B 37, 308-331.
[18] Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distribu-
tions. J. Amer. Statist. Assoc. 62, 626-633.
[19] Timm, N. (1975). Multivariate Analysis with Applications and Psychology. Brooks/Cole, Monterey,
CA.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 "7~C]
@North-Holland Publishing Company (1982) 835-855 .,/

Dimensionality and Sample Size Considerations


in Pattern Recognition Practice

A. K. Jain* and B. Chandrasekaran**

1. Introduction

The designer of a statistical pattern classification system is often faced with the
following situation: finite sets of samples, or paradigms, from the various classes
are available along with a set of measurements, or features, to be computed from
the patterns. The designer usually proceeds by estimating the class-conditional
densities of the measurement vector on the basis of the available samples and uses
these estimates to arrive at a classification function. Naive intuition suggests that
if the dimensionality of the measurement vector is increased, then the classifica-
tion error rate should generally decrease. In the case where the added measure-
ments do not contribute in any waY to classification, then the error rate should at
least stay the same. For, after all, is not more information being utilized in the
design? However, in practice quite often the performance of the classifier based
on estimated densities improved up to a point, then started deteriorating as
further measurements were added, thus indicating the existence of an optimal
measurement complexity when the number of training samples is finite. The past
decade has seen much research devoted to elucidating this phenomenon under
various conditions [38].
The purpose of this paper is to discuss the role which the relationship between
the number of measurements (dimensionality of the pattern vector, or simply
dimensionality) and the number of training patterns (sample size) plays at various
stages in the design of a pattern recognition system. The designer of a pattern
recognition system can (and should) pose the following basic question: For a
given knowledge about the form of the underlying class-conditional densities and
the availability of certain numbers of training samples, how many measurements
should be used in designing the classifier? While no specific design equations are
available, our review below shows that general guidelines can be used to clarify
the situation, and help the designer be aware of several possible pitfalls.

*Research supported by NSF Grants ENG 76-11936 Aol and ECS 8007106.
**Research supported by AFOSR Grant 72-2351.

835
836 A. K. Jain and B. Chandrasekaran

2. Classification performance

The most well-known example of the "curse of finite sample size" is the
peaking in the classification performance as the number of measurements is
increased for a fixed number of training samples. Consider a two-class pattern
recognition problem, where a total of N measurements is made on each pattern.
Let the prior probabilities of the two classes be equal for simplicity and let f ( x )
be the class-conditional density function of the measurement vector x from class
c~, i = 1,2. Let us first consider the simple situation where the number of training
samples equals infinity, that is, the class-conditional density functions are com-
pletely known. Then, the probability of correct recognition based on the optimal
Bayes decision rule is given by

0,x }+Vrlo
It is well known that Pcr(N)~< P = ( N + 1), or in the presence of perfect informa-
tion, the classification accuracy would never decrease as the number of measure-
ments is increased. Whether or not l i m s ~ P c ~ ( N ) approaches unity, i.e., perfect
discrimination is attained in the limit, is, however, a different issue and has been
studied in [7, 17, 27], and more recently in [65]. For example, in [65] it is shown
that given two classes with known densities fl(x) and f2(x), if

lim {Ef'[l°gfl(x)/f2(x)] }2 -- 0¢,


S~ o~[log f ~(x ) /f2( x ) ] 2

then the probability of correct classification for objects from c I tends to unity,
where Eft and o~,2denote, respectively, the expectation and variance with respect
to the fl distribution. A similar result holds for c 2.
A more realistic and at the same time mathematically tractable pattern recogni-
tion problem involves the case where the form of the densities f ( x ) is known but
parameter values are unknown. Let m i be the number of samples available from
class ci, i = 1,2, to train the recognizer. The class-conditional densities need to be
estimated from these samples. In the Bayesian formulation, some a priori densi-
ties on the parameters of f~(x) are assumed, and through the sample set X~,
f~(xlxi) can be calculated. On the other hand, one can arrive at maximum
likelihood estimates of the density function (by, say, first obtaining the maximum
likelihood estimates of the unknown parameters and then substituting these
estimated values for the true parameters in f~(x)) if one does not want to concern
oneself with a priori densities on parameters. These estimated density functions
are then used in the Bayes' decision function.
Given the knowledge of a priori densities on the unknown parameters, the
Bayesian method in the design of a classification system is optimal, that is, there
Dimensionality and sample size considerations in pattern recognition practice 837

does not exist any other decision rule which has a higher recognition accuracy.
The fact that a priori densities are required to be known and the complexity
involved in computing the a posteriori densities even in the commonly occurring
case of multivariate Gaussian densities with unknown mean vectors and covari-
ance matrices [40] restrict and limit the scope of the Bayesian method. Therefore,
in most practical applications, a suboptimal procedure, such as using m a x i m u m
likelihood estimates of the parameters in place of their true values, is preferred. In
the statistical literature the Bayesian method is often referred to as the predictive
method, and the terms estimative procedure and plug-in rule are used to denote
the method in which the unknown parameter is replaced b y its estimate. Recent
investigation by Aitchison [2] and Aitchison et al. [3] concerns the conditions
under which the predictive method of statistical discrimination has superior
properties. Clearly, the classification performance depends on the estimation
procedure used, and we need to study problems pertaining to the relationship
between dimensionality and sample size in the context of the method of estima-
tion.
Allais [4] pointed out an interesting relation between dimensionality, sample
size and recognition accuracy in the linear prediction problem. H e considered an
unobservable random variable, y, an observable measurement vector, x, and a
linear predictor of y, g(x), represented as

g(X)=c+wTx
where c is a scalar and wT is a row vector. The performance of g(x) was evaluated
in terms of its mean square error defined as

e=r[y--g(x)] 2

where E denotes the expectation operator. Allais considered the case where the
joint distribution of predictor y and measurement vector x was multivariate
Gaussian. Since the parameters of this distribution were assumed not known,
Allais considered a m a x i m u m likelihood predictor ~(x) and derived its uncondi-
tional mean square error as

l+u_m_ 2 for N ~< m - 2 ,


=

for N = m - 1,
L undefined for N ~> m,

where N is the number of observable measurements, m is the number of samples


and ~2 is the ideal mean square error assuming all parameters of the distribution
are known. It can be easily verified from the above expression that for a fixed
sample size the mean square error of the m a x i m u m likelihood predictor decreases
at first, attains a minimum and starts increasing as the number of measurements
is gradually increased. Results reported by Allais raised the following question: Is
838 A. K. Jain and B. Chandrasekaran

the existence of a finite measurement complexity due only to finite sample size, or
whether the kind of estimates (maximum likelihood vs. Bayesian) used had any
relationship to it?

2.1. Optimal Bayes classification rule


Hughes [30] considered the behavior of a finite-sample Bayesian classifier with
respect to increasing dimensionality. The model considered in [30] and [1] can be
summarized as follows.
Let the measurement x take one of n possible values where n is the measure-
ment complexity. Let a i = P[x = xilcl] and/3g -- P[x = xi[c2] such that

~ai= ~fli=l.
i=1 i=l

If the pattern vector is an N-dimensional binary vector for instance, n would be


2 N. The sets {ai} and (fli} specify a particular pattern recognition problem. If
some a priori densities on {ai} and {fli} are assumed, and if one uses a Bayesian
solution, then from available samples (say m i in number from ci) Bayesian
estimates of {a~} and {fig} can be obtained, and the optimal decision function for
x = x~ is then: decide class c 1 if &i >/3~, class c 2 otherwise. Assuming the a priori
distribution on {ai} and (fli} to be uniform in the respective simplexes, Hughes
[30, 1] computed a quantity ff~r(n, ml, mz) that is the average probability of
correct recognition over all problems generated by the assumed a priori densities
on {ai} and (fli} for fixed sample sizes m I and m 2. This quantity increases with n
up to a point nop t called the optimal measurement complexity, then starts
decreasing until as n ~ oo,/~r ~ ½" As the number of training samples increases,
nopt increases and in the limit as ml ~ co and m 2 ~ oo, nopt ~ 0% indicating the
absence of peaking in the infinite sample case. Kain [36] extended the results of
Hughes to the multi-class case.
The results of [30] and [1] showed that even if a Bayesian procedure was used
which is optimal in the sense that it minimizes the probability of misclassification,
one could get into trouble by using too many measurements when the number of
training samples was finite. This led to a number of investigations [11, 12, 69]
where attempts were made to find useful exceptions to this "curse of finite sample
size" in the Bayesian context. For example, it was shown that if the measurements
are independent and binary [11], or independent and quantized to k levels [12], or
first-order nonstationary Markov dependent and binary [69], then there was no
peaking of performance in the Bayesian context with respect to the measurement
complexity for finite sample size. Kanal and Chandrasekaran [39] emphasized the
relationship between the probability structure of the classification problem,
sample size and d_imensionality to explain the peaking in Hughes' model and the
monotonicity of Per in the model of Chandrasekaran [11]. They stated that for a
given sample size the optimum measurement complexity as well as the maximum
Dimensionality and sample size considerations in pattern recognition practice 839

probability of correct recognition increases with the increased structure in the


classification problem.
The peaking of classification accuracy when the Bayesian procedure is used as
reported in [30] needs some explanation due to the optimality of the Bayesian
procedure. This apparent paradox was resolved recently by Van Campenhout [64]
and Waller and Jain [70], but was already implicit in the earlier work of Lindley
[44, 45]. They show that even with a finite set of training samples the performance
of a Bayesian classifier can not be degraded by increasing the number of features
as long as the old set of features is recoverable from the new set of features [70] or
the two classification problems (as a result of increasing the number of features)
are comparable [64]. The peaking phenomenon reported by Hughes [30] can be
attributed to incompatible prior densities. To be more specific, let us consider two
classification problems in the context of Hughes' model consisting of n and n'
measurement states, and let the corresponding prior densities on the unknown
parameters be gi(Oi) and g'i(O'), respectively, for class % i = 1,2. Hughes assumed
these prior densities to be uniform over the n- and n'-dimensional simplexes
respectively, and for a fixed sam_ple size m i from classci, i = 1,2, he computed the
mean recognition accuracies Pcr(n, ml, m2) and Pd'r(n',ml, m2). These mean
recognition accuracies involve two types of averaging--first an average is taken
over all of the possible sets of random samples of size m i from class e~ and then
the resulting error rate is averaged over all possible problems generated by the
prior densities. Both Van Campenhout [64] and Waller and Jain [70] point out
that Per(n, ml, m2) and Pd'r(n', rnl, m2) , corresponding to the two classification
problems, are incomparable due to incompatible priors. That is, if the prior
density g~(O~) for parameters in an n-measurement-states problem is uniform, then
g'(O~) for parameters in a problem with n' numi~er of states is not uniform as
assumed in [30, 12].
In summary, the results of [64, 70] have proved that if the designer of a pattern
recognition system would be willing to take the Bayesian approach, then even for
a finite number of training samples the number of measurements can be increased
arbitrarily without the fear of peaking in the average recognition accuracy. In
other words, taking a Bayesian approach results in the optimum measurement
complexity being equal to infinity irrespective of the number of training samples
available. This does not imply that using Bayesian approach would always lead to
limu~ o~/~r(N, m l, m2)= 1, or that by the addition of arbitrarily large number of
measurements in the classifier design one would get perfect discrimination.
Particular cases in which perfect discrimination is possible in the limit as the
number of measurements is increased are investigated in [11, 12, 69]. Even though
one is guaranteed not to do worse by taking more measurements in the Bayesian
context, the cost and complexity due to additional measurements may not be
worth the slight improvement (if there is any) in the recognition accuracy. Thus it
will be interesting to know the rate of convergence of Per and the conditions for
perfect discrimination as the number of measurements is increased. What role
does the structure of the underlying classification problem play in the perfor-
840 A. K. Jain and B. Chandrasekaran

mance of a Bayesian classifier? Waller and Jain [69] demonstrate that for a given
sample size and measurement complexity,/~r increases as the problem becomes
more structured.
So far, in the Bayesian context, we have talked only about the average
performance over all possible problems generated by the a priori densities.
However, what about the performance Per in an individual problem (a specific set
of parameters), even though the parameter estimates are Bayesian for the given a
priori densities? This is the kind of performance in which a designer of a
recognition system is really interested. In this case the notion of the 'optimality'
of the estimate does not enter, since the optimality of the Bayesian estimates is
assured by averaging over a problem space generated by the a priori densities. In
[13], [14] and [65], examples are given that show that peaking is indeed possible in
this situation. More generally, not much is known about the conditions for perfect
discrimination or the convergence of Pcr as the number of measurements is
increased. If the measurements are independent, then the sufficient conditions of
[13], [14] and [65], can be used to determine if, for a specific problem, perfect
discrimination is possible in the limit. But in order to check for these conditions
one must know the true parameters of the class-conditional densities, which, if
available, would obviate the need for estimation in the first place. However, these
conditions do illuminate the fact that a finite sample size imposes greater
constraints on the measurement parameters to be 'good'. For example, if the
measurements are independent and binary with parameters Pi = P(xi = 1]c 1) and
q~--P(xg--11c2), i = 1,... ,N, then, if one has infinite number of samples, a good
measurement is one for which IP~ - qi] >~8 ~>0, a condition which is not sufficient
for perfect discrimination for the finite sample case.
2.2. Suboptimal classification rules
The average recognition accuracy of a Bayesian classifier, as shown in [64, 70]
will never decrease as the number of measurements is increased. However, when
the approach to classifier design is non-Bayesian--typically in such a situation
the parameters of the distributions may be estimated by say maximum likelihood
methods--then peaking becomes theoretically possible. There is no longer any
obviously natural notion of 'optimality', i.e., while the original Bayes' rule is
optimal, the decision rule that results by substituting the maximum likelihood
estimates of the parameters is no longer optimal, and this is often the cause of
peaking observed in the performance of many practical pattern classifiers. In
some sense the errors caused by the nonoptimal use of added information
outweighs the advantages of extra information. The mechanism behind this
self-defeating behavior is the subject of this section.
Classification problems involving multivariate Gaussian densities have received
the most attention in the literatures both on statistics and on pattern recognition.
This is understandable due to the ease in mathematical analysis and because
class-conditional densities of many real-world classification problems can be
reasonably approximated as multivariate Gaussian. Let us consider two equiprob-
able pattern classes which are represented by multivariate Gaussian densities with
Dimensionality and sample size considerations in pattern recognition practice 841

a common N × N non-singular covariance matrix N and mean vectors/,~ and/*2.


Assuming that these parameters (/*1, tt2 and N) are known, the decision rule
which results in minimum classification error is a linear discriminant function
given by: decide that vector x belongs to class c I if

xTZ-1 (/-L1-- ~2) > (l/2)(~1 -t- ~2 )T~--1(~1 -- ~2 ),

otherwise decide class c 2. It is well known that the probability of error using this
discriminant function is related to the Mahalanobis distance A~ between the two
populations, given as

& = - -

The larger the Mahalanobis distance, the smaller the probability of error. Usually
the parameters Z,/~t and/z 2 are not known and the following estimates of these
parameters based on m i training samples from class c i are commonly used.
mi
^ 1 • x}i) ' i=1,2,
I~ i = -mii j = l

1 2 mi
S=(m1+m2_2) 2 2 (xJi)--~i)2
i lj=l

where/2 i is the sample mean of class ci and S is the pooled unbiased estimate of
the common covariance matrix N£ In the above expressions x}0 refers to the j t h
training sample from class ci. In this situation, the most commonly used decision
rule is based on the W statistic proposed by Wald [5, 68] where

W=xTS 1(~1 -- ~ 2 ) - - (1)(~1 ~- ~ 2 ) T s - - I ( ~ , -- ~2).

The decision rule is to assign a feature vector x to class c] if W ~> 0, otherwise


assign it to class c 2. Clearly, this decision rule is suboptimal. Various authors have
given the exact distribution of W statistic and expressions for the associated
probability of misclassification [9, 10, 28, 34, 35, 50, 57, 58, 59, 60]. Unfor-
tunately, if the two classes have unequal covariance matrices (the resulting
discriminant function is quadratic), then the determination of expected probabil-
ity of rnisclassification is extremely difficult and to Our knowledge no such
expression exists.
Now that we have defined a commonly occurring classification problem
involving two multivariate Gaussian distributions with unknown mean vectors
and (common) covariance matrix, we are ready to discuss the role of dimensional-
ity and sample size in the performance of a suboptimal classifier. In fact, almost
all of the work which has been done pertaining to the relationship between
dimensionality, sample size and probability of misclassification is in the context
of the above mentioned model.
842 A. K. Jain and B. Chandrasekaran

Rao [52] first illustrated the dangers of using too many measurements in a
classification problem involving two populations some thirty years ago. He stated
that " I t does not seem to be, always, the more the better...". It is unfortunate
that pattern recognition community is not aware of this pioneering work by Rao.
The example used by Rao to illustrate this problem involved discrimination
between Indian and Anglo-Indian skeletons based on two measurements--lengths
of Femur and Humerus. Rao took 20 samples of Indian skeletons and 27 samples
of Anglo-Indian skeletons and used the estimate of the Mahalanobis distance
( D2 = (/21-/22)Ts-1(/21- ~2)) to test if the separation between the two popula-
tions is significant (at 5% level). It was found that while the two populations were
significantly different when only a single measurement (either Femur length or
Humerus length) was used, there was no significant separation when both the
measurements were used. Rao provides a test to determine whether the addition
of q more measurements to an existing set of N measurements increases the
distance between the two populations and concludes that if the Mahalanobis
distance (Z~2N)increases proportionately with the number of measurements, then,
except in situations where the total number of samples is very small, the addition
of extra measurements does not result in a loss in discrimination.
More recently Jain and Waller [33] have studied the peaking phenomenon in an
effort to relate the optimum number of measurements to the number of available
training samples and the Mahalanobis distance between the two populations.
Their results can be related to those obtained by Rao [52]. For a classification
problem involving two equiprobable multivariate Gaussian densities with a
common covariance matrix they use the asymptotic expansion of the average
probability of error (good up to order 1 / m 2) derived in [59] to conclude the
following (of course, we are assuming that estimates/2 i and S given earlier are
being used):
(1) The minimal increase in the Mahalanobis distance needed to keep the same
error rate when a measurement is added to a set of N features is

gd A 2 = A Z / ( 2 m - - 3 - - N )

where m is the number of training samples per class. In order to avoid peaking,
8A 2 > a z / ( 2 m -- 3 -- N ) . Note that increasing the sample size decreases the value
of 8A2 and in the limit as m ~ oo, 8A2u ~ 0.
(2) If the Mahalanobis distance is proportional to the number of measurements
or, equivalently, if all the features are equally good, then the peaking in the
performance of the classifier is not a real problem because Nopt = m - 1.
In order to study the effect of the structure of the covariance matrix on the
optimum number of measurements, Jain and Waller [33] considered three types of
Toeplitz matrices for N. However, in their analysis, the classifier did not incorpo-
rate knowledge of the form or the structure of the true covariance matrix, and
thus they used the general covariance matrix estimator S in the W statistic. As
was expected, the performance of a classifier improved if it could utilize the
knowledge about the structure of the covariance matrix, as was recently con-
Dimensionality and sample size considerations in pattern recognition practice 843

firmed by Morgera and Cooper [48]. They introduce the notion of 'effective
sample size' to show that the constrained Toeplitz estimator provides better
performance at a sample size m for which the generalized estimator has poor
performance. In other words, for the same performance a classifier using con-
strained Toeplitz estimates requires fewer samples than if the generalized estima-
tor is used, assuming that the true covariance matrix is of Toeplitz form.
Interestingly enough this reduction in required sample size (to maintain the same
performance) is an increasing function of the dimensionality N.
Several recent investigations have demonstrated the relationship between di-
mensionality, sample size and recognition accuracy based on Monte Carlo simula-
tions. Boullion et al. [8] show that if the decision rule is in the form of a linear
discriminant function, then a subset of measurements can yield better classifica-
tion results than the full set of measurements. A real-world example they
considered to demonstrate this consisted of data from a 12-channel sensor used in
remote sensing of agricultural crops (soybeans and corn) obtained from NASA. If
only thirty training samples are available from each of the two classes, then the
expected probability of misclassification is lowest when only six measurements
are used out of a possible total of twelve. The number of training samples per
class must exceed one hundred before all twelve measurements can be used safely.
Van Ness and Simpson [67] and Van Ness [66] confirm the results of [33, 52],
namely the Mahalanobis distance between the two populations must increase as
the number of measurements is increased for the classification performance to
stay at least the same. However, they do not provide any explicit expression for
this increase as a function of sample size, dimensionality and Mahalanobis
distance. Van Ness and Simpson [67] also compare experimentally the classifica-
tory power of five different discriminant functions (in the order of assuming less
and less knowledge about the true underlying distributions which were multi-
variate Gaussian): linear with unknown mean vectors and known covariance
matrices, linear with unknown mean vectors and unknown but common covari-
ance matrix, quadratic with unknown mean vectors and unknown covariance
matrices and finally two non-parametric decision rules involving Parzen window
estimates of the density function based on Gaussian and Cauchy window func-
tions. A surprising result of this comparison was that the two non-parametric
decision rules performed better than the linear and quadratic discriminant
functions (with unknown covariance matrices) even when the dimensionality was
small.
Results reported by Raudys [53] are similar to those in [8, 67]. Like Van Ness
and Simpson [67], Raudys uses Monte Carlo simulations to generate tables
showing the relationship between sample size, dimensionality, classification accu-
racy and complexity of the classification rule. As is understandable, this table
depends on the true underlying class-conditional densities, which in the case of
Raudys was assumed to be multivariate Gaussian with a common identity
covariance matrix. An important guideline proposed by Raudys is that the
number of training samples required to achieve a given recognition accuracy
should increase linearly with the number of measurements for linear discriminant
844 A. K. Jain and B. Chandrasekaran

functions, and should increase quadratically for quadratic discriminant functions.


This is in general agreement with the acceptable code of good practice in pattern
recognition design, namely the number of training samples must be five to ten
times the number of measurements. Raudys suggests using his table to determine
the optimal number of measurements for a given sample size, but this requires
that the designer knows the true underlying distributions.
In an effort to isolate the effect of finite sample size on expected probability of
error, Duin [21] derives the following expression for a two-class classification
problem

E( Pe) <~P* +E(e, + e2)

where P* is the optimal Bayes error and

e,= ½f l~(x)- f~(x)ldx= l- f min(f~(x), f~(x)) dx,


i = 1 , 2 and 0 ~<e~< 1.

In the above expression, fl(x) and )~(x) are the estimates of the class-conditional
densities which are used in the design of the classifier. Thus the quantities e/,
i = 1,2, are the errors made in estimating the densities f(x), i= 1,2, respectively.
It is clear that E(ei) is a function of the sample size, dimensionality and the true
underlying density. For Gaussian densities, Duin computes E(ei) for different
values of m and N using Monte Carlo runs, and shows that as dimensionality
increases, more samples are needed to keep E(ei) at some fixed value. This
supports one of the main reasons given to explain peaking: for a fixed sample
size, as the dimensionality increases, the extra discriminatory power provided by
the added features is overcome, after a certain point, by the deterioration in the
estimates of the densities. More work needs to be done to establish this point of
crossover for underlying Gaussian densities with different parameter values.
So far, most of the results we have summarized deal with Gaussian densities in
one way or another. Can some set of general conditions be obtained as a function
of fl(X), f2(x), ml, m2, and N such that for a two-class classification problem we
are guaranteed to have perfect discrimination and monotonicity of expected
probability of misclassfication as the number of measurements is increased?
Chandrasekaran and Jain [13, 14] provide a partial answer to the question raised
above (see also [65] for some corrections to the results of [13]). To summarize the
results of [14] and [65], let

N N
f~(x) ~ ]-I f~,(x,) and f2(x)~ 1-I f2~(x,)
i=1 i=1

be the two class-conditional densities. The pattern vector x consists of N statisti-


cally independent measurements. Following the notation in [14] we use the
Dimensionalityandsamplesizeconsiderationsinpatternrecognitionpractice 845

following abbreviations:

di==-logfl,-logf2,, M[J)~E~E~Ex 3,

and

]
where f stands for estimated densities, c I and c 2 are the two classes, Exc ~ stands
for expectation with respect to class cj, and E x for expectation with reference to
training data set X- N o w the probability of correct classification, given that c~ is
the true class, can be expressed as

{ ~di-- ~Mi(l) --~Mi (') }


P(~di>~O[c,}=P >I .

All the summations above and below are from 1 to N.


Using Chebyshev's inequality it can be shown that the probability of error,
given cl, is

Thus a sufficient condition for the probability of correct classification for


elements of cj to approach one as N ~ OO is

lim ( -- 1)s - 1 mOO.


N~oo (y, vy )'/2
These two sufficient conditions ( f o r j -- 1 and 2) are useful in determining whether
perfect discrimination is possible in the limit, but they do not assure us of the
monotonicity of the probability of correct recognition. However, if the Central
Limit Theorem can be applied so that the random variable Yd i can be viewed to
have a Gaussian distribution, then the two sufficient conditions also become
necessary. Consequently, as the quantity YMi(O/(Y,Vfl)) ~/2 increases b y the addi-
tion of measurements, the probability of correct classification, given c 1, increases
correspondingly. Similar arguments hold for the probability of correct classifica-
tion, given c 2. While the verification of the conditions for the applicability of the
Central Limit Theorem are not easy, let us consider the following example where
we know them to be true.
Let the measurements be independent, continuous and normally distributed,
fli(Xi) = @-L(Oi,1) and f2i(Xi) : ~'~(~i, 1) where ~)L(/~,0"2) is the normal density
846 A. K. Jain and B. Chandrasekaran

with m e a n / , and variance o 2. If t~ and ~ are the maximum likelihood estimates


based on m training samples per class, then it can be shown that

/1//,.(0= _ Mi(2) = ( Oi - q~i)2,


V/(1) = V/(2) : K,(m)(O i -- qai)2 + Kz(m )

where K 1 and K 2 a r e positive for all positive m. Thus, the necessary and sufficient
condition for perfect discrimination and monotonicity of recognition accuracy is
that A% =Y,(0;--q~,) 2 is of order ~ as N--,oo. Note that in this example the
covariance matrix is the identity matrix which is known by the classifier whereas
in the models of Rao [52] and Jain and Waller [33] the common covariance matrix
has to be estimated. This explains why, in the less structured models of [52, 33],
the Mahalanobis distance A% is required to increase proportionally to N to avoid
peaking while in the above example it is sufficient that a % is of order f N . We can
only hypothesize that if the two covariance matrices are unequal and have to be
estimated, then a % must be of order N 2 to avoid peaking.
The conditions derived by Chandrasekaran and Jain [14] are, as mentioned, for
the case of statistically independent measurements. In [ 15] the same authors have
generalized the conditions for the case of dependent measurements with arbitrary
distributions. Let d(N; x) be the Bayes' decision function such that x E c i if
d(N; x ) > 0 and x ~ c 2 otherwise, where the pattern vector x has N components,
and let d(N; x) be the classifier obtained by using estimates. Let further

^ (j) 2
M(NJ) = Exe~jExd and V(NJ)= Ex~cjEx{ d M~t } .

Then by arguments similar to those of Gaffey [27] and Van Ness [65], it can be
shown that, if

lim M(j)/[ V(J )] ,/2 = ( _ 1)j-,oo '


N~oo

then limN~ooPcr = 1 for elements of class cj, j = 1,2. However, applying these
conditions to actual cases will be more or less difficult depending upon the
tractability of the underlying distributions. For the case of multivariate normal
distributions with unknown mean vectors and known covariance matrices, more
compact conditions for perfect discrimination are given in [15], and experimental
and mathematical investigations of some aspects of performance as the dimen-
sionality is increased are provided in [62] and [55].

2.3. Balancing decision functions due to unequal sample size


There is an interesting aspect of statistical pattern classification that has not
received much attention; namely, the degradation in the classification perfor-
mance when the numbers of training samples from various classes are substan-
tially different. A few authors (Rao [52], Jain and Waller [33], and Levine, Lustick
Dimensionality and sample size considerations in pattern recognition practice 84"/

and Saltzberg [43]) have outlined the advantages of having equal numbers of
samples. In the context of a linear discriminant function Rao [52] showed that for
given A2N, N and (m 1+ m2), and using maximum likelihood estimates, it is more
profitable to have m 1= m 2. Note that D 2 is often used as a feature selection
criterion. Jain and Wailer [33] used an asymptotic expansion to show that
Okamoto's [50] probability of misclassification is minimum when m I = m 2, and
that the degradation in the classification performance due to an unbalanced set of
training samples is more severe for a large number of measurements. Levine et al.
[43] show that for a nearest-neighbor decision rule, best results are obtained when
ml ~ m 2. The above results suggest that if the designer can obtain only a small
number of samples from one class, it is not necessary to compensate by taking
large samples from the other class. In fact, Chandrasekaran and Jain [13]
demonstrate a counter-intuitive phenomenon whereby discarding the excess sam-
ples from the class containing the greater number of samples is profitable.
What should a designer do when confronted with unequal numbers of training
samples? In this situation the degrees of reliability associated with the estimates
of the different class density functions are clearly different, and one has the
intuitive feeling that this factor should somehow be taken into account, i.e., the
decision function must be 'balanced' with respect to the different sample sizes. To
be more concrete, while f l ( x ) and fz(x) might be individually the 'best' estimates
of the density function, it is not at all clear that d ( x ) = {log f ~ ( x ) - l o g j ~ ( x ) } is
the best decision function to use if the sample sizes are substantially different.
Perhaps a modification such as log W[ fl(x), m l ] - log W[ fz(x), m2] where W is a
weighting function and m 1 and m 2 are the two sample sizes might perform better.
Kaminuma and Watanabe [37] proposed the idea of a well-balanced adaptive
decision function in the context of a perceptron convergence algorithm, when
substantially different numbers of paradigms from the two classes are used. These
authors proposed to adjust the position and orientation of the resulting hyper-
plane to reflect the different numbers of paradigms. In [37] the problem was
posed in a non-probabilistic context, and the criterion was heuristic in nature.
Chandrasekaran and Jain [15] observed that this question of balancing does not
arise in the case where the classifier is Bayesian, i.e., prior distributions on
unknown parameters are assumed and the classifier essentially determines p ( x l x )
where X denotes the sample set. In this approach the weighting is automatically
incorporated in the posterior probabilities.
In [15] Chandrasekaran and Jain suggested that the criterion for weighting the
estimated densities should be chosen so that the counter-intuitive phenomenon of
doing better by discarding excess samples from the class containing greater
number of samples, reported in [13], does not arise. Consider a two-class problem
involving multivariate Gaussian densities where the common covariance matrix is
known, and the mean vectors 0 and ~ are unknown. They showed that if
maximum likelihood estimates of 0 and q~ are used based on m 1 and m 2 samples,
respectively, then a balanced decision function based on the above criterion is

db(X) -- U/m - U/m2 + d(x)


848 A. K. J a i n and B. Chandrasekaran

where

Note that d(x) is the usual linear decision function. In case m 1= ma, db(X) = £¢(x),
and as m I and m 2 approach infinity, then C?b(X) approaches d(x), where d(x) is
the Bayes' decision surface for this problem. An interesting aspect of this
particular weighting is that Exdb(X)=d(x), i.e., the weighting results in the
expected value (over the samples) of the estimated decision surface being the
same as the Bayes' decision surface.
The results of [15] provide an efficient way to utilize information provided by
unequal number of samples when a non-Bayesian decision rule is employed.
These results need to be generalized to a larger class of distributions and
estimates.

2.4. Weaker conditionsfor perfect discrimination


Continuing with the case of statistically independent measurements, the suffi-
cient conditions for perfect discrimination were (Section 2, the first part) that
certain quantities (corresponding to some weighted distances between the class
distributions) approached oe as the number of measurements was increased to m.
Of course these conditions were derived for the case of a particular (classical)
approach to the design of a classifier; i.e., the decision function is Z£¢i (see before
for the notation). Schaafsma and Steerneman [56] (see also the article by
Schaafsma in this handbook) consider an interesting question: Do there exist
other classification rules for which perfect discrimination can be achieved under
weaker conditions, e.g., that the quantity

lim ( - 1 ) j 1 EM/(j)

is simply > 0 , rather than approaching m? We have already seen that by


weighting the estimated densities of each class by a function which reflects the
sample size, we can avoid some undesirable consequences of differential sample
sizes from different classes. Schaafsma and Steerneman similarly look for other
classifiers (than the classical ones) from the viewpoint of achieving perfect
discrimination under weaker conditions on the distance between densities.
They consider the familiar case of two classes, each with N independent and
normally distributed measurements, fli(Xi) • •(Oi, 1) and f2i(Xi) = 6"~(~i, 1). As
before, az=2(Oi-eoi)2, and m 1 and m 2 are the sample sizes from the two
classes. Consider the use of the decision function

£¢,__ - ml
-
~ - (1x , - 0 , )2
Xi - q)i) -
l + m l i I (i l+m 2 i 1

Note that this rule weights the contribution of different measurements differently
Dimensionality and sample size considerations in pattern recognition practice 849

in addition to taking into account different sample sizes. They show that if

liminf --~-1~A% > 0,

then the probability of misclassification approaches 0. The important point is not


the particular procedure or decision function used, but the existence of a class of
procedures which can produce both perfect discrimination in the limit under
weaker conditions on A2N, and elimination of certain peculiarities associated with
different sample sizes. This indicates that it might be quite fruitful to study the
general problem of nonclassical discrimination rules to provide some protection
from the "curse of finite sample size".

3. K-nearest neighbor procedures

The information contained in the K nearest neighbors of a sample pattern


vector x has been used extensively at least in two areas of pattern recognition:
classifcation of x into one of several possible categories, and estimating the
density at point x. The popularity of K-nearest neighbor procedures is due to
their simplicity and because their applicability is not conditioned on the knowl-
edge of the form of the underlying densities. Thus the designer may only have
information of the following kind: There are C pattern classes with m i N -
dimensional training samples from class ci, i = 1..... C. The designer who decides
to use the nearest-neighbor procedure almost always faces the question: what
value of K should be used?
Let us first consider the K-nearest-neighbor decision rule. It is well known [19,
20] that if the number of training samples is infinite, then
(i) the error rate of the nearest-neighbor rule is bounded from above by twice
the Bayes rate, and
(ii) the performance of the K-nearest-neighbor rule improves monotonically
with K, and in the limit as K goes to infinity, this rule becomes identical to the
optimal Bayes rule.
Unfortunately, no such statements can be made about the performance of the
K-nearest-neighbor decision rule when only a finite number of training samples is
available. Fix and Hodges [23] attempted to approximate the finite sample error
rate of the K-nearest-neighbor decision rule and compare its performance with the
linear discriminant function when N ~< 2 and K ~<3. In practice it is quite often
observed [6] that for a fixed m (number of training samples per class) and N
(dimensionality), as K increases, the performance of the K-nearest-neighbor rule
improves, attains a maximum and then starts deteriorating. Thus, we have
another example of peaking of performance.
It is often recommended [20] that K should be only a small fraction of the
number of samples. The relationship between K, m and N needs to be further
investigated especially since the K-nearest-neighbor decision rule is so popular.
850 A. K. Jain and B. Chandrasekaran

One common method to obtain nonparametric density estimates is to use the


K-nearest-neighbor approach. Given m N-dimensional samples, the density esti-
mate pro(x) at point x can be obtained as

Pm(X)=(Km/m)/Vm

where Vm is the volume of a region centered at x which captures exactly K m


samples. (The subscript m is used to denote the dependence of various quantities
on the sample size m.) The necessary and sufficient conditions for p,,(x) to
converge to p ( x ) are that ] J m m ~ K m = c~ and l i m m ~ K m / m = 0 [46]. Empiri-
cal results [46, 22] indicate that K m proportional to ~ is a good choice although
K m should be a function of the dimensionality N also.
Fukunanga and Hostetler [25] derive an expression for the optimum value of
K m (in the mean-square-error sense) which is a function of N, m and the true
underlying density p ( x ) . If the underlying distribution is multivariate Gaussian,
Kopt, the optimum value of K , is proportional to m 4/(N+4). The constant of
proportionality reduces from 0.75 for N = 4 to 0.42 for N = 32. Note that in one
dimension ( N = 1) Kop t is substantially larger than the commonly used heuristic
value of ~-~, Also for a fixed sample size, as the dimensionafity increases, Kopt
goes down, which perhaps can be attributed to sparseness of the data. Vajta and
Fritz [63] make some corrections, for small values of N, in the expression for Kopt
in [25]. Rejt6 and Rrvrsz [54] point out that Kopt in the context of density
estimation is also applicable to the K-nearest-neighbor decision rule. Thus the
results of [25] and [63] can be used to select the optimum number of nearest
neighbors for decision making.

4. Error estimation

The performance of a pattern classifier is usually evaluated by determining its


classification accuracy. In most theoretical studies in pattern recognition one
computes the expected recognition accuracy. In practice, since the true underlying
distributions are not known, one estimates the recognition accuracy from the
available samples. This generally involves dividing the total available samples into
design and test sets; the classifier is designed based on the samples in the design
set and is evaluated by estimating the error rate on samples in the test set [29, 42,
611.
Ideally one would like to use as many samples as possible in designing as well
as in testing the classifier. However, it is well known that the design-set error rate
(same set of samples are used to design as well as to test the classifier) is a biased
estimate of the true error rate [61]. In fact, Cover [18] showed that for a two-class
problem if a linear discriminant function is used, then the design-set error rate is
always zero as long as (m 1 + m2)~< ( N ÷ 1). What then are the situations in which
the design-set error rate can be 'safely' used as an estimate of the true error rate?
Foley [24] considered this problem when the class-conditional densities are
Dimensionality and sample size considerations in pattern recognition practice 851

multivariate Gaussian with unknown mean vectors /~ and ~2, and known
common covariance matrix 2. He computed the expected design-set error rate as
a function of m (number of samples per class), N and A% (the true Mahalanobis
distance). Foley's results can be summarized as follows:
(1) limN~ooE(Ed(m, N, A2N=0)}=0 , where E d denotes the design-set error
rate. That is, by adding more and more measurements it is possible to make the
design-set error rate approach zero even if the true error rate is (½).
(2) The ratio ( m / N ) is critical to the bias in the design-set error rate. If
(re~N) > 3, then the bias in the design-set error rate is close to zero.
Mehrotra [47] extended Foley's results to situations where the common covari-
ance matrix Z is also unknown and concluded that the ratio (re~N) must be
larger than five before the bias in the design-set error rate is sufficiently small.
These results seem to confirm the hypothesis [38, 39]; " t h e less is known about
the underlying probability structure, the larger is the ratio of sample size to
dimensionafity".
Intuitively it appears that one should know the class assignment of test samples
in order to estimate the error rate of a classifier. However, in many applications
of pattern recognition methodology, labelling of samples can be very expensive.
In view of this, Chow [16] proposed a procedure to estimate the error rate of a
classifier based on a set of unlabelled test samples. Chow established a relation-
ship between the error rate and the reject rate of a classifier, and since computing
the reject rate does not require the knowledge of class assignment of test samples,
the error rate can be determined with unlabelled test samples. Again, the
ratio of the number of training samples to dimensionality plays an important role
in this method of estimating the error rate. Fukunaga and Kessell [26] showed
that for two Gaussian distributions with unknown mean vectors and unknown
common covariance matrix, the estimate of error rate obtained from the empirical
reject rate is optimistically biased. This bias in the error rate is a function of the
ratio of sample size to dimensionality, and [26] recommends that this ratio must
be at least ten for the bias to be small.

5. Conclusions

We have shown the important role which dimensionality and sample size play
in various areas of pattern recognition, namely, classification accuracy, K-
nearest-neighbor approach and error estimation. There is no doubt that the
designer of a pattern recognition system should make every possible effort to
obtain as many samples as possible. As the number of samples increases, not only
does the designer have more confidence in the performance of the classifier, but
more measurements can be incorporated in the design of the classifier without the
fear of peaking in its performance. However, there are many pattern classification
problems where either the number of samples is limited (for example, in a medical
decision-making problem, there may be only a small number of patients available
who are suffering from a specific disease) or obtaining a large number of samples
852 A. K. Jain and B. Chandrasekaran

is extremely expensive (as in the application of pattern recognition methodology


to nondestructive testing where specific defects in a piece of metal have to be
detected by, say, analyzing the return ultrasound waveform). It is in these small
sample size problems where the designer of a recognition system has to be
extremely careful. As we have pointed out in this paper, if the designer chooses to
take the optimal Bayesian approach, then the average performance of the classi-
fier improves monotonically as the number of measurements is increased. Most
practical pattern recognition systems employ a non-Bayesian decision rule be-
cause the use of optimal Bayesian approach requires knowledge of prior densities,
and besides, their complexity precludes the development of real-time recognition
systems. The peaking behavior of practical classifiers is caused principally by
their nonoptimal use of measurements.
Realizing the dangers of having too many measurements in the classifier design
when the number of samples is small, a designer must select a 'good' subset of
measurements. Various techniques of feature selection and extraction [32, 38, 41]
have been proposed in the literature for this purpose. However, most of these
techniques use some sort of a criterion function such as a distance or information
measure to select a good subset of measurements which must be estimated from
the available samples. If the number of available samples is small compared to
dimensionality, these estimates themselves will not be very reliable [31, 49, 71].
The same argument holds for finding the intrinsic dimensionality of a given set of
data [51].
In summary, while the ratio of sample size to dimensionality is a crucial factor
in the design of a recognition system, very little quantitative information is
available to a designer. If the designer knows that the underlying distributions are
multivariate Gaussian with unknown parameters, then some tables are available
to determine the optimal number of measurements to use for a given sample size.
However, since these tables are generated based on expected recognition accu-
racy, the designer must know the true Mahalanobis distance to use them. The
general guideline for having five to ten times as many samples as measurements
still seems to be a good practice to follow. We share the opinion expressed in [38,
39] that the ratio of sample size to dimensionality should be inversely propor-
tional to the amount of knowledge about the class-conditional densities.

References

[1] Abend, K., Harley Jr., T. J., Chandrasekaran, B. and Hughes, G. F. (1969). Comments on "On
the mean recognition accuracy of statistical pattern recognizers".IEEE Trans. Inform. Theory
15, 420 423.
[2] Aitchison,J. (1975). Goodnessof prediction fit. Biometrika 62, 547-554.
[3] Aitchison,J., Habbema, J. D. F. and Kay, J. W. (1977).A critical comparisonof two methods of
statistical discrimination. Appl. Statist. 26, 15-25.
[4] Allais, D. C. (1966). The problem of too many measurementsin pattern recognition. IEEE Int.
Con. Rec. 7, 124-130.
[5] Anderson,T. W. (1951). Classificationby multivariate analysis. Psychometrika 16, 31-50.
Dimensionality and sample size considerations in pattern recognition practice 853

[6] Bailey, T. and Jain, A. K. (1978). A note on distance-weighted k-nearest-neighbor rules. IEEE
Trans. Systems Man Cybernet. 8, 311-313.
[7] Ben-Bassat, M. and Gal, S. (1977). Properties and convergence of a posteriori probabilities in
classification problems. Pattern Recognition 9, 99-107.
[8] Boullion, T. L., Odell, P. L. and Duran, B. S. (1975). Estimating the probability of misclassifica-
tion and variate selection. Pattern Recognition 7, 139-145.
[9] Bowker, A. H. (1961). A representation of Hotelling's T 2 and Anderson's classification statistic
W in terms of simple statistics. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
271-284. Stanford University Press, Stanford, CA.
[10] Bowker, A. H. and Sitgreaves, R. (1961). An asymptotic expansion for the distribution function
of the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
292-310. Stanford University Press, Stanford, CA.
[11] Chandrasekaran, B. (1971). Independence of measurements and the mean recognition accuracy.
IEEE Trans. Inform. Theory 17, 452-456.
[12] Chandrasekaran, B. and Jaln, A. K. (1974). Quantization complexity and independent measure-
ments. IEEE Trans. Comput. 23, 102-106.
[13] Chandrasekaran, B. and Jain, A. K. (1975). Independence, measurement complexity and
classification performance. IEEE Trans. Systems Man Cybernet. 5, 240-244.
[14] Chandrasekaran, B. and Jain, A. K. (1977). Independence, measurement complexity and
classification performance: An emendation. IEEE Trans. Systems Man Cybernet. 7, 564-566.
[15] Chandrasekaran, B. and Jain, A. K. (1979). On balancing decision functions. J. Cybernet.
Inform. Sci. 2, 12-15.
[16] Chow, C. K. (1970). On optimum recognition error and reject trade-off. IEEE Trans. Inform.
Theory 16, 41-46.
[17] Chu, J. T. and Chueh, J. C. (1967). Error probability in decision functions for character
recognition. J. Assoc. Comput. Mach. 14, 273-280.
[18] Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with
application in pattern recognition. IEEE Trans. Elec. Comput. 14, 326-334.
[19] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans.
Inform. Theory 13, 21-27.
[20] Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
[21] Duin, R. P. W. (1976). A sample size dependent error bound. Proc. Third lnternat. Joint Conf.
Pattern Recognition, Coronado, CA., 156-160.
[22] Fischer, F. P., II. (1971). K-nearest neighbor rules. Ph.D. dissertation, School of Elect. Engr.,
Purdue University, Lafayette, IN.
[23] Fix, E. and Hodges, J. L. (1952). Discriminatory analysis, nonparametric discrimination: Small
sample performance. USAF School of Aviation Medicine, Randolph AFB, TX, Project 21-49-004,
Rep. l 1. Also in: A. K. Agarwala, ed., Machine Recognition of Patterns, 280-322. IEEE Press,
New York, 1977.
[24] Foley, D. M. (1972). Considerations of sample and feature size. IEEE Trans. Inform. Theory 18,
618-626.
[25] Fukunaga, K. and Hostetler, L. D. (1973). Optimization of k-nearest neighbor density estimates.
IEEE Trans. Inform. Theory 19, 320-326.
[26] Fukunaga, K. and Kessell, D. L. (1972). Application of optimum error-reject functions. IEEE
Trans. Inform. Theory 18, 814-817.
[27] Gaffey, W. R. (1951). Discriminatory analysis: Perfect discrimination as the number of variables
increases. Report No. 5, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph
Field, TX.
[28] Harter, H. L. (1951). On the distribution of Wald's classification statistics. Ann. Math. Statist.
22, 58-67.
[29] Highleyman, W. H. (1962). The design and analysis of pattern recognition experiments. Bell
System Tech. J. 41,723-744.
[30] Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers. IEEE Trans.
Inform. Theory 14, 55-63.
854 A. K. Jain and B. Chandrasekaran

[31] Jain, A. K. (1976). On an estimate of the Bhattacharyya distance. IEEE Trans. Systems Man
Cybernet. 6, 763-766.
[32] Jain, A. K. and Dubes, R. (1978). Feature definition in pattern recognition with small sample
size. Pattern Recognition 10, 85-97.
[33] Jain, A. K. and Waller, W. (1978). On the optimum number of features in the classification of
multivariate Gaussian data. Pattern Recognition 10, 365-374.
[34] John, S. (1961). Errors in discrimination. Ann. Math. Statist. 32, 1125-1144.
[35] Kabe, D. G. (1963). Some results on the distribution of two random matrices used in
classification procedures. Ann. Math. Statist. 34, 181-185.
[36] Kain, R. Y. (1969). The mean accuracy of pattern recognizers with many pattern classes. IEEE
Trans. Inform. Theory 15, 424-425.
[37] Kaminuma, T. and Watanabe, S. (1972). Fast-converging adaptive algorithms for well-balanced
separating linear classifier. Pattern Recognition 4, 289-305.
[38] Kanal, L. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20,
697-722.
[39] Kanal, L. and Chandrasekaran, B. (1971). On dimensionality and sample size in statistical
pattern classification. Pattern Recognition 3, 225-234.
[40] Keehn, D. G. (1965). A note on learning for Gaussian properties. IEEE Trans. Inform. Theory
11, 126-132.
[41] Kittler, J. (1975). Mathematical methods of feature selection in pattern recognition, lnternat. J.
Man-Mach. Stud. 7, 609-637.
[42] Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics 10, 1- 11.
[43] Levine, A., Lustick, L. and Saltzberg, B. (1973). The nearest-neighbor rule for small samples
drawn from uniform distributions. IEEE Trans. Inform. Theory 19, 697-699.
[44] Lindley, D. V. (1968). The choice of variables in multiple regression. J. Roy. Statist. Soc. Ser. B
30, 31-66.
[45] Lindley, D. V. (1977). The Bayesian Approach. Seventh Scandin. Conf. in Math. Statist., 55-72.
[46] Loftsgaarden, D. O. and Quesenberry, C. P. (1965). A nonparametric estimate of a multivariate
density function. Ann. Math. Statist. 36, 1049-1051.
[47] Mehrotra, K. G. (1973). Some further considerations on probability of error in discriminant
analysis. Report on RADC contract no. f 30602-72-C-0281.
[48] Morgera, S. D. and Cooper, D. B. (1977). Structured estimation: sample size reduction for
adaptive pattern classification. IEEE Trans. Inform. Theory 23, 728-741.
[49] Murray, G. D. (1977). A cautionary note on selection of variables in discriminant analysis. Appl.
Statist. 26, 246-250.
[50] Okamoto, M. (1963). An asymptotic expansion for the distribution of the linear discriminant
function. Ann. Math. Statist. 34, 1286-1301. Correction: Ann. Math. Statist. 39 (1968) 1358-
1359.
[51] Pettis, K., Bailey, T., Jain, A. K. and Dubes, R. (1979). An intrinsic dimensionality estimator
from near-neighbor information. IEEE Trans. Pattern Anal. Mach. Intelligence 1, 25-37.
[52] Rao, C. R. (1949). On some problems arising out of discrimination with multiple characters.
Sankhyd 9, 343-364.
[53] Raudys, S. (1976). On dimensionality, learning sample size and complexity of classification
algorithms. Proc. Third Internat. Joint Conf. Pattern Recognition, Coronado, CA., 166-169.
[54] Rejto, L. and Revesz, P. (1973). Density estimation and pattern recognition. Problems Control
Inform. Theory 2, 67-80.
[55] Roucos, S. and Childers, D. G. (1980). On dimensionality and learning set size in feature
extraction. Proc. Internat. Conf. Cybernet., Soc., Cambridge, MA., 26-31.
[56] Schaafsma, W. and Steerneman, A. G. M. (1980). Proofs and extensions of the results in the
paper on classification and discrimination if p ~ 0o. Intemal Rept., Dept. Math, University of
Groningen, Groningen.
[57] Sitgreaves, R. (1952). On the distribution of two matrices used in classification procedures. Ann.
Math. Statist. 23, 263-270.
Dimensionality and sample size considerations in pattern recognition practice 855

[58] Sitgreaves, R. (1961). Some results on the distribution of the W-classification. In: H. Solomon,
ed., Studies in Item Analysis and Prediction, 241-251. Stanford University Press, Stanford, CA.
[59] Sitgreaves, R. (1973). Some operating characteristics of linear discriminant functions. In: T.
Cacoullos, ed., Discriminant Analysis and Applications, 365-374. Academic Press, New York.
[60] Teichroew, D. and Sitgreaves, R. (1961). Computation of an empirical sampling distribution for
the W-classification statistic. In: H. Solomon, ed., Studies in Item Analysis and Prediction,
271-284. Stanford University Press, Stanford, CA.
[61] Toussaint, G. T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
[62] Trunk, G. V. (1979). A problem of dimensionality. IEEE Trans. Pattern Anal. Mach. Intelligence
1, 306-307.
[63] Vajta, M. and Fritz, J. (1974). Some remarks on optimal kn-nearest neighbor pattern classifica-
tion, Proc. Second Internat. Joint Conf. Pattern Recognition, Lyngby, Denmark, 547-549.
[64] Van Campenhout, J. M. (1978). On the peaking of the Hughes mean recognition accuracy: The
resolution of an apparent paradox. IEEE Trans. Systems Man Cybernet. 8, 390-395.
[65] Van Ness, J. W. (1977). Dimensionality and classification performance with independent
coordinates. IEEE Trans. Systems Man Cybernet. 7, 560-564.
[66] Van Ness, J. W. (1980). On the dominance of non-parametric Bayes rule discriminant algo-
rithms in high dimensions. Pattern Recognition 12, 355-368.
[67] Van Ness, J. W. and Simpson, C. (1976). On the effects of dimension in discriminant analysis.
Technometrics 18, 175-187.
[68] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann. Math. Statist. 15, 145-162.
[69] Waller, W. and Jain, A. K. (1977). Mean recognition accuracy of dependent binary measure-
ments. Proc. Seventh Internat. Conf. Cybernet. Soc., Washington, DC, 586-590.
[70] Waller, W. G. and Jain, A. K. 0978). On the monotonicity of the performance of Bayesian
classifiers. IEEE Trans. Inform. Theory 24, 392-394.
[71] Young, I. T. (1978). Further consideration of sample and feature size. IEEE Trans. Inform.
Theory 24, 773-775.
P. R. Krishnaiahand L. N. Kanal, eds., Handbook of Statistics, Vol. 2 h['~
" l " K./
©North-HollandPublishingCompany(1982) 857-881

Selecting Variables in Discriminant Analysis


for Improving upon Classical Procedures

Willem Schaafsma

1. Introduction

Choosing which of the ordered variables ~l,--.,~s should be included in the


analysis is crucial in many applications of multiple linear regression, multivariate
analysis, time series analysis and pattern recognition. Painstaking scientists are
often willing to extend their set of variables. By doing so they can get into trouble
if they apply 'standard' procedures. The theoretical explanation is that the
objectivistic optimum properties leading to the standard procedures are not
compelling. There will 'often' (that means for 'many' values of the underlying
parameter 0) exist a number p* (depending on 0 and the sample sizes) such that
the standard procedures show a degrading performance if the number of variables
is increased beyond the bound p* which is sometimes referred to as the optimal
measurement complexity. This bound will be an increasing function of the sample
sizes.
Thus standard procedures should be 'robustified' against the degradation
which appears if the set of variables is extended by introducing 'counterproduc-
tive' ones. This will be done by starting from the ordered set ~l,...,~s of all
potential variables. The standard procedure based on all s variables will be
'improved' or 'robustified' by deleting a data-dependent set of variables or by
using a ridge-regression type modification.
We restrict our attention to discriminant analysis situations where the outcomes
of the independent s-dimensional random vectors Xhi(i = 1..... ink; h = 0 .... ,k)
have to be evaluated. Here Xhl ..... Xh,,,,,' is an i.r.s, from Ns(l~h, ,Y,) characterizing
population h(h = 1..... k), X0i describes the vector of scores for the i th individual
under classification (in some problems such individuals are not needed) and it is
postulated that Xoi has the N,(/I t<o, N) distribution where t(i) ~ { 1..... k ) denotes
the number of the population to which the ith individual actually belongs. Thus,
from a strict point of view, our theory is not applicable if the populations are not
well defined, or if the individuals under classification cannot be regarded as
selected at random from the relevant populations, or if the assumption of
homogeneity of the dispersions is violated. The assumption of s-variate normality
is also of some concern.
857
858 Willem Schaafsma

The solution to the 'dimensionality problem' might depend on (1) the specific
aim one has in mind, (2) the costs of measuring various potential variables, (3) a
priori knowledge. With respect to (1) we distinguish the following aims which can
be of main interest in discriminant analysis situations (aims of secondary interest
arise in a natural way when dealing with these main aims).

Aim 1. To construct one or more discriminant functions, either for the purpose
of dimension reduction to describe the data, or as a first step in the process of
evaluating the data.
Aim 2. To test whether the populations differ with respect to the variables

Aim 3. To assign the individuals under classification to one (or more) of the k
populations; this entails the secondary aim of estimating the various misclassifica-
tion probabilities of the classification procedure which has been used.
Aim 4. To estimate posterior probabilities, for the individuals under classifica-
tion, preferably by means of confidence intervals.
Aim 5. To select a subset of the set of all s potential variables for use in the
future.
Aim 6. To distinguish 'clusters' of the n o individuals under classification, or of
the k populations, or of the s variables.
With respect to measuring costs we remark that they will play (almost) no part
in this paper: we restrict the attention to the Aims 1, 3 and 4. The main
motivation for this paper is that we want to assist painstaking scientists by
providing protection against the introduction of too many variables. Protection will
be achieved by modifying the underlying standard procedures.
A priori knowledge will play an important part in the stage preliminary to the
actual data evaluation (and also for the determination of the a priori probabilities
for each of the individuals under classification). During this preliminary stage,
results of previous investigations can be incorporated. We distinguish the follow-
ing steps.

Step 1. Decide upon the variable ~/1. . . . . "lJq to be measured directly.


Step 2. Introduce some 'derived' variables ~/q+l..... ~/r as functions of the
original ones ~/l ..... ~/q.
Step 3. Delete those variables which became obsolete after the introduction of
the derived ones.
Step 4. Decide upon some initial ordering of the s variables obtained. Let
~l ..... ~s denote the corresponding ordered set where ~1 is considered most
important, ~2 next most important, etc.
The usefulness of the initial ordering depends on the decisions made during this
preliminary stage, and on the reliability of the results of previous investigations. If
measuring and computation costs are regarded as unimportant, then one might
proceed as follows. During Step 2 some derived variables Oq+l ..... ~/q+t are
introduced, next ~q+t+l ..... ~/r are introduced as (standardized, unrotated) prin-
Selecting variables in discriminant analysis for improving upon classicalprocedures 859

cipal components for ~l,...,71q+t (estimated on the basis of pooled sample


dispersions obtained from previous investigations). Next the variables ~t ~q--t .....

are deleted and ~1, ~2.... are chosen by taking T~q+t+ 1' ~ q ÷ t + 2 ' " ' " if no reliable a
priori knowledge exists with respect to the discriminatory properties of these
variables. Otherwise such knowledge might be incorporated by rearranging the
principal components (the subject of ordering variables is very unpleasant from a
theoretical point of view; more details are given in [12]).
Standard procedures show often a degrading performance if the number of
involved variables is increased beyond a certain bound p*. This very interesting
phenomenon, has been observed by many scientists. Note that various, intrinsi-
cally different, illustrations can be made because the underlying aims can be
different, or be specified differently. Of course, the illustrations will also depend
upon the underlying parameters and sample sizes. It is interesting to remark that
for some 'completely specified' aims holds that the performance of the standard
procedure admits different specifications and that even the concept of 'standard'
procedure can be doubtful. These variations are of course of almost no impor-
tance when compared with the influences of the sample sizes, the underlying
parameter and, in particular, the specification of the aim in mind.
The above-mentioned phenomenon implies that the standard procedure based
on all s variables can often be improved by deleting variables. It obviously
depends on the specific aim, performance, values of the underlying parameters
and sample sizes, which selection of variables should be made. A complication
will be caused by the fact that the underlying parameters are unknown. Some
estimation procedure has to be introduced.
In this paper we restrict the attention to the case that only two populations are
of real interest (the other ones, if available, are only exploited for estimating 2;).
At the beginning we had the intuitive feeling that the Aims 1, 3 and 4 are so
closely related that any technique for selecting variables that is natural for one of
these aims, will also be natural for the other two. Computations learned that this
relationship may exist between Aim 1 and certain specifications of Aim 3 but it
disappears if Aim 4 is taken into consideration (see Section 5 for an explanation).

RE~tAI~,K1. Aim 2 is of interest in situations which are essentially different from


those where the Aims 1, 3 and 4 are relevant. The latter aims are only interesting
if the existence of differences is beyond all doubt whereas Aim 2 is only of
interest if this existence is still a matter of concern. Note that when dealing with
Aim 2, the Hotelling test based on all variables is the classical standard proce-
dure. This test is uniformly most powerful among all invariant level-a tests. The
restriction by invariance is not very attractive from a practical point of view
because, e.g., it seems reasonable to conjecture that "many variables will be pretty
useless if s is large". A nice reference is [4]. An illustration of the degrading power
of Hotelling's test can also be found in [12].

REMARK2. Measuring costs and tests for additional discrimination information


(Rao (1965) 8d.3, Stein (1966)) play a dominant part when dealing with Aim 5.
860 Willem Schaafsma

Notation
Most theory will be developed for an arbitrary number p of variables. If it is of
interest to indicate which p variables are taken from the set ~1..... ~s, then this
will sometimes be indicated by a left upper-script p if ( ~ ..... ~p) is meant and by
a set of left upper-scripts j(1) .... ,j(p) if (~j(1).... ,~j(p)) is meant.

2. Illustrating the phenomenon when dealing with Aim 1 in the case k = 2

Suppose a discriminant function g: R p --* R has to be constructed for dis-


criminating between population 1 and population 2. If 0 -- (/Xl,/~2, ~ ) is known,
then various theoretical approaches have applications. One obtains functions g
which are affine-linear: g ( x ) = aw'rx + b where w is the vector o~--~:-1(/~ 2 - / ~ l )
of weights of Fisher's linear discriminant 'in the population' and the precise
choice of a and b depends on the approach chosen. Sometimes a =-1, b - - 0 , in
other situations a and b are such that v a r ~ g ( X ) - - 1 and Eo,,,z)g(X)+
E(~2,~)g(X ) = 0.
We are interested in situations where 0=(/xi,/~2, X ) is unknown with the
consequence that we have to use data for determining g. Our theory will be based
on the independent r.v.'s

Xl. ~ Np(~l, milkY), X2.~Ns,(~z,rn2'X), S--Wp(f,.,Y)


where usually f = m 1+ rn 2 - 2 but where f > m 1 + m 2 - 2 may also be of interest.
We shall use the corresponding outcome (the data under evaluation) for con-
structing some affine-linear discriminant function. Taking a(Xl. , )(2. , S) ------1, we
have to specify w( Xl., X2., S) and b( X 1., X2., S). The choice

W(Xl.,X2.,s)= f s - ' ( x 2 . - x,.),


b(Xl., X2., S) = X:., + X:.)
leads to the classification statistic

( X 2. - X , . ) T ( f - ' S ) I { X - - ½ ( X i. + ) ( 2 . ) }

considered in [18] and nowadays called the Anderson classification statistic,


because its distribution was studied extensively by Anderson, Sitgreaves a.o.
( f > p is needed).
Note that the restriction to affine-linear discriminant functions is not compell-
ing: any classification statistic based on the likelihood ratio criterion is not
affine-linear.
The performance of any fixed affine-linear discriminant function g(x) = awTx
+ b can be characterized in various ways. If one wants to use the discriminant
function for direct classification purposes, e.g. by assigning to population 2 if and
Selecting variables in discriminant analysis for improving upon classical procedures 861

only if g(x) > 0, then the values of a, I[w II and b will all be of interest. However,
when dealing with Aim 1 for its own sake, and disregarding matters of scaling or
defining cut-off points, it seems of interest to choose the performance concept in
such a way that it neither depends on a, nor b, nor IIw II, while it is maximized if w
is any positive multiple of ~o= Z - ~(/~2 - / ~ 1)- The appropriate concept seems to be
the discriminatory value

8 ( w ) = (r(.~.z)g(X)-r(~,,z)g(X))(varzg(X)} ,/2
= wT( 2 -

or some increasing function of 6(w), e.g. the square. Note that

1 8 ( w ) [ _-< max
wER p
8(w)=8(o~)={(l~2--l~,)T~-l(l~2--1~l)}'/2=Zl.
This is a simple consequence of the Cauchy-Schwarz inequality; A is the usual
Mahalanobis distance.
The discriminatory value of any data-dependent discriminant function, with
W= w(X~., X2., S) as the underlying vector of weights, will be a random variable

D = 6 ( W ) = wT(/z2 - - # , ) / ( w T . ~ w ) '/2,

the distribution of which is concentrated on the interval [ - A, + A]. A discrimi-


nant function will be more satisfactory if its distribution is 'closer' to + A. One
might use E 0D, (E 0D2)1/2, median ° D or another quantile of D as a yardstick. Of
course one might equally well consider how much EoD2 or median0 D2 falls short
with respect to A2. Guided by the simplicity of certain formulas, the fact that A2
admits a best unbiased estimator and by Ton Steerneman's simulation experi-
ments (see the end of the appendix), we decided to characterize the performance
of our affine-linear discriminant functions by means of Eo D2. Note that this
defines a function of 0 = (#1,/~2, ~ ) with values in (0, A2). In the very special case
that the vector W of weights is determined as a positive multiple w(X~., X2., S)
=eS-1(Xz.-X1.) ( c > 0 ) of the plug-in estimator f s - l ( x z . - X l . ) of oJ=
~ - l ( # 2 - - t h ) , it can be proved that Eo D2 depends on 0 = ( / z ~ , # z , Z ) via the
Mahalanobis distance A. The approximation

Eo D 2 ~ A 2 - A 2 ( p - 1 ) ( l + ( f - 1 ) m m { l m ~ l A 2)
( f - - 1){1 + pmm~lmzlA -2}

where m = ml + m 2, is obtained in the appendix "by putting the expectation of


some ratio equal to the ratio of the corresponding exact expectations" (see the
formulas in the appendix where b is approximated by b(3)). We emphasize that
this approximation for Eo D2 is only of application if the vector of weights is
determined by W = cS- 1(X2" _ X1 .) where c > O.
862 Willem Schaafsma

It is interesting to draw some graphs of A2:{1 ..... s}--, [0, oo) and to study the
performance Eo D2 of the classical standard procedures with W = cS-1(X2" _ Xl ")
as a function of p.
The basic question is which graphs of A2 should be expected in actual practice.
Theoretical deliberations are not very useful in this connection. Know-how in the
area of application would be decisive but is usually only very scanty. In order to
do at least something, we discuss graphs of A2 given by ,~2(p)= 82p(p + 1) 1
Here 8E [1,5] might be appropriate when discriminating between 'tribes' on the
basis of skull measurements, 82~ [3, 4] might be appropriate when discriminating
between 'sexes' and 82~ [4, 16] when discriminating between 'races'. If one has to
discriminate between ' successful' and ' non-successful' highschool records on the
basis of some pre-highschool psychological examinations and this task is regarded
as intrinsically very uncertain, then one might think of 82~[1,3]. Note that
limp~ooa2(p)=82 implies that ~ ( - ½ 8 ) is an intrinsic lower bound for the
maximum of the two misclassification probabilities.
In order to get a nice illustration, we single out the very special situation where
(1) 8 2 = 4, (2) m I = m2, ( 3 ) f = m + 1. Thus we have to study the graph of

~-2EoD2 ~ p ( p + 1 ) - ' - p ( p + 1)-'


× {l+(p+l)p-'J(m +p+l)-'(p--1)

-{_p2+(m+Z)p+l}{p 2+(m+2)p+m+l} -1

as a function of p ~ { 1..... min(s, m + 1)). This function is concave and assumes


its maximum in
p*=[-½+½(l+Zm) '/2]

where [ ] is the entier or the entier + 1.

Interpretation
If one knows the graph of A2 as a function of p, then one will 'often' see that
the approximate performance of the classical vector of weights cS-I(X 2. - X 1.)
increases for p-< p* and decreases for p>-_p* where p* will be an increasing
function of the sample sizes. This means in particular that the performance of the
classical procedure based on all s variables (the value of the performance for
p - - s ) can 'often' be improved upon by replacing this procedure for a classical
one based on a predetermined subset of the set of all variables. In Section 3 other
'non-classical' procedures will be considered which essentially make use of all
variables ~1..... ~s but are based on 'very complicated functions w(Xl., X2., S)'
(and possibly a( X1. , X2., S) and b( Xl., X2., S)). For these complicated proce-
dures the performance EoD2 will not depend on 0=(/~1,/~22~ ) through the
Mahalanobis distance.

Some technical details


Note: the paper has been written so that the technical details at the end of
various sections can be omitted at first reading. In order to apply the theory of
Selecting variables in discriminant analysis for improving upon classical procedures 863

the appendix, we introduce the following concepts. Let x/, be any orthogonal
matrix with A- ~(/~2 --/~1)T Z - 1/2 as its first row. Define

y=(ml+m2 ) 1/2 1/2. 1 / 2 . r . x ' , - - l / 2 [ v


m 1 err 2 ~t'.¢~ ~ A 2. -- X 1 . ) ,

= ( m 1 + m 2 ) - l / 2 m l / 12 m l / 22 A V = ~PY.-1/2S~,-1/2kOT

and notice that the probabilistic assumptions in the beginning of this section
imply that those at the beginning of the appendix are satisfied. The following
identifications can be made

S - I ( x2" - X 1.) -- m - ~ , / 2 m z l / Z ( m 1 + m 2 ) 1 / Z ~ - l / Z q T R ,
D=8(W)=AR1/IIR[I ifW=cS-~(X2 .- X~.)for c>O.

The approximation b ~, b (3) at the end of the appendix leads to

Eo D2 = A2 -- AZEo(1 -- R~/[I R [I2) ~ A2 _ A2b(3),

which has been exploited in this section.


The shape of the graph of A2:{1 ..... s} -~ [0, oo) is a matter of concern. Even if
the initial ordering is 'perfect', A2 needs not be concave (consider the case where
~21 --/'Lll = ~22 -- ~12 = E > 0, ~ 2 j - - ~ l j = 0 ( J > 3), oll = o22 = 1, all correlations 0
except for Pt2; note that the ordering is 'perfect' though ~1 and ~2 might be
interchanged; A2(1) = e z, A2(2) = A2(3) . . . . . A2(s) = 2e2/( 1 + P12); A2(2) -
A2(1) can be made arbitrarily large by taking P12 close to - 1. A referee for [10]
suggested to consider the case gzj - glj = e, ojj = 1, o~h = p > 0 ( j , h -- 1,...,s)
where all variables are equally important but A 2 ( p ) = J p ( l + ( p - 1)p} -1 has a
decreasing slope because of the correlations. Schaafsma and van Vark [12] give
elaborations for the case p = ½ which leads to A2(p) = 32p( p + 1)- 1 with 3 2 = 2e 2.
The same function A2 has been discussed in the text.
Choosing the constant c in W = cS-1()(2" _ X1 .) leads to an interesting theory
if attention is shifted from the construction of an affine-linear discriminant
function to the (auxiliary) problem of estimating ~ = ~ - l ( g z -/~1)- Theorem A.2
in the appendix implies that

E S - I ( x2 • - X I . ) = ( f - P - 1)-12~-1(g2-/~1)

and that the dispersion matrix of S-1()(2" _ X1 .) is equal to

{(f--p)(f--p-- l)(f--p--3)}-I

× + m2)+ A2)Z -'


+(f--p + 1 ) ( f - - p - - 1)-ltow T]

with the consequence that c = f - p - 1 yields the uniformly best unbiased


estimator for 0~ and that the corresponding exact dispersion matrix is ( f - p 1)2 - -
864 Willem Schaafsma

times the above-mentioned one. This result was first derived by Das Gupta [3]
(see [6]).
The question arises whether the choices c = f - p - 1 (leading to the best
unbiased estimator), c = f (the usual plug-in estimator where Z is replaced by
f - 1 S ) , or c = f + 2 (the maximum likelihood estimator) lead to admissibility from
the m.s.e, point of view. For that purpose we consider

E{cS-L(X2.- Xl. )- (cS- 1(X2.- Xl. )- T


= c2{(f -p)(f-p-1)(f-p-3)}-I
X {(f--1)m;lmzl(ml + m2) + A2},~ -1

+[c2(f--p+l)(f--p) '(f--p--1)-2(f--p--3) 1

and we determine

c * = ( f -- p ) ( f - - p - - 3 ) ( f - p - - 1 ) -1

by minimizing the coefficient of ¢0~*. Note that c* is smaller than any of the
above-mentioned choices of c. This implies that each of these choices leads to an
inadmissible estimator because, by minimizing the coefficient of ¢0¢0T, the coeffi-
cient of ,Y 1 is also decreased. Notice that c* ~ f - p - 1 and that the improve-
ment over the best unbiased estimator will not be substantial. As ¢0 and Z-~ can
vary independently one can show that no estimator of the form cS 1(X 2. - X 1.)
with c E [0, c*] can be improved upon uniformly by considering other estimators
of this form.
The basic question arises whether one should go beyond the class of estimators
of the form cS I(X 2. - X1. ) when estimating ~0= ~-l(/~ 2 - # 1 ) . This question is
closely related to the problem whether one should go beyond the class of weight
vectors w(Xl., )(2., S) = eS-1()(2" _ X1. ) when constructing an affine-linear dis-
criminant function. The last formulation suggests that one should be willing to
delete variables, or in other words to consider estimators for o~~ R" in which
various components are equated to 0. The original formulation suggests that
modifications can be obtained along the lines of 'ridge-regression' or 'Stein
estimators'. Thus there are many ways to go beyond the class {cS-1(X2" _ X1 "); c
> 0}. We shall restrict attention mainly to selection-of-variable modifications.

3. One particular rule for selecting variables

Section 2 provides evidence that from a practical point of view the standard
procedure based on all s variables can be substantially improved if s is large and
the sample sizes are small. The 'illustrations' in Section 2 and elsewhere in the
Selecting variables in discriminant analysis for improving upon classicalprocedures 865

literature (see e.g. [12] for references in the area of pattern recognition) are based
on a given probabilistic structure: the graph of A2 should be known. In practice
the a priori information with respect to this graph is very scanty. One will have to
use the data under evaluation for estimating A2. It seems natural to proceed as
follows when dealing with Aim 1.
Suppose that the outcome of the independent random variables XI., X 2. and S
has to be evaluated where X h. ~ Ns(l~h, ~,) and S - W ~ ( f , ~). Then one should
start out by reconsidering the initial ordering ~1..... ~s because the performance of
the procedure to be chosen will depend on the reliability of this ordering. We
distinguish among the following situations:
(1) the initial ordering is so reliable that the investigator is not willing to
consider any other subset than ~ ..... ~p if he has to select p variables from

(2) the initial ordering is 'rather' reliable but 'some deviations from this
ordering are allowed';
(3) the initial ordering is 'very shaky', 'almost random'.
The distinction between these categories is not very clear and to some extent a
matter of taste. Situation (3) gives rise to nasty complications which will be
outlined in the technical details at the end of this section.
Assuming that situation (2) appears, we propose the following procedure.

Step 1. Estimate A2 : { 1.... ,s} --, [0, ~ ) by means of the uniformly best unbiased
estimator

~ 2 ( p ) = ( f _ p _ 1)(PX2" _px1.)T(ps)-t(PX2" --PX,.)


-- pmTlrn~l( m 1+ m2).

Consider the corresponding outcomes. It may happen that for some p, the
difference ~2(p + 1 ) - ~ 2 ( p ) is so large (small) that one would like to move ~p+l
to the left (right). This should be done very reluctantly (see the technical details at
the end of this section). One should do nothing unless some appropriate test(s)
lead to significance. In this context one will make use of the test for additional
discrimination information [7, 8d.3.7] or [16], the test for discriminant function
coefficients [7, 8d.3.9] or approximate tests for comparing standardized weights
[12, Section 4.9]. The procedure for rearranging the variables will not be described
explicitly here because various different specifications are reasonable. After
rearranging the variables, proceed as if the thus obtained ordered set had been
decided upon in advance, again using the notation ~1..... ~-
Step 2. Consider the ordered set ~l ..... ~ obtained after Step 1 and estimate A2
by means of the above-mentioned estimator 4 2 which has lost its optimum
property because Step 1 contains some data-peeping. It is expected that the
resulting positive bias is so small that it does not lead to serious consequences.
Step 3. Estimate the 'performance' Eo D2 of the classical vector of weights
c(PS)-I(PX2.-PX1 .) by plugging ,~2(p) into the approximation of Eo D2 in
866 Willem Schaafsma

Section 2, thus obtaining


( p --1){1 +(f--1)mmllm~i/AZ(p)}
~2(p)_~2(p) (f--l)(l+pmm~lm21/7~2(p))

Step 4. Determine/~*= p*(Xl., X2., S) such that the above-mentioned esti-


mated performance is as large as possible. The vector w*(Xl., X2., S) of weights
in the affine-linear discriminant function to be constructed is determined by
putting the last s - p* coordinates equal to 0 while the first/~* coordinates are
given by
Y'w* = ( f - i f * - 1 ) ( P^.S ) -1 (P. X2.
. . -. p Xl. ).

Step 5. If we are only interested in the estimated (squared) discriminatory value


as the relevant performance, then we are through because a*(Xl., X2., S) and
b*(Xl., X2.,S ) can be chosen arbitrarily. (This choice has no effect on the
discriminatory value.) If one likes to obtain a discriminant function which
satisfies var z g ( x ) = 1 and E(~I, ~ ) g ( X ) ÷ E(~2, ~)g(X) = 0, then obvious proposals
for a* and b* can be made though a nasty bias will appear. This does not admit a
satisfactory treatment because the definition of w* is too complicated.

Interpretation
Incorporating a selection of variables technique is a very complicated matter.
The ultimate goal in this section is to achieve a performance EoD 2 which is as
large as possible, as a function of 0 = (/~l, t~2, ~), in the region of 0 's which seem
of interest. It is expected that the performance
E0 D 2 : Eo( w*T(/x z - / ~ l ) ) 2 / ( w . T z w . ) ,
with D = 8(W*) and W* = w*(Xl., X2., S) defined by Step 4, compares favor-
ably with
x,.)Ts-%2-,2)} 2
EoD 2 =
X,.)Ts-'zS-'(X2.- X,.)} '
with D = 8(W), in the greater part of the region of 0 's which are more or less in
agreement with the initial ordering. Note that the last mentioned Eo D2 depends
on 0 = (~l,/~2, 2~) through the Mahalanobis distance A whereas the before men-
tioned performance is not determined by A. A comparison by means of simula-
tion experiments will be a very complicated matter because the region of 0 's for
which the initial ordering is 'correct' is not unambiguous (see [12, Section 4.9])
and certainly very complicated to overview.

Some technical details with respect to Situation 3


The unbiased estimator 42(p) has been constructed by many authors and can
be derived by noting that
(X2.- X 1. ) T s l ( x 2. -- X I .) = m~'mzl(ml + m 2 ) y T v - t Y
where p - ~ ( f - p + 1)yTv~IY has the noncentral F distribution with p and
Selecting variables in discriminant analysis for improving upon classical proeedures 867

f - - p + 1 degrees of freedom and noncentrality parameter '172. Straightforward


computations lead to

varAZ( p ) - - ( f - p-- 3) -'


X {2A4(p) + 4 m ~ ' m 2 l(m I + m2) ( f - - 1)A2(p)

+ 2pmlZmzZ(ml + m2)2( f-- 1) ).

The following proposals can be made with respect to situation (3).

Approach 1: The principal components approach.


Use the outcome of (XI.,)(2., S) for the determination of unrotated standard-
ized principal components ~ .... ,~/s. Ignore the fact that they were defined on the
basis of the data and treat ~ ..... ~/s in the same manner as ~ ..... ~, in situation
(1).

Approach 2: The ridge-regression approach


The set of all w such that

{~0 - ( f - p - 1)S-1(X2 . - X , . ) } T f" - '


×{~--(f--p--1)S ' ( X z . - X , • ) ~) =< X p;a
2

can be used as a confidence ellipsoid for Z - l ( / , 2 - / ' 1 ) ; ~ is some estimator for


( f - p - 1 ) 2 times the dispersion of S - I ( X 2. - X 1.), see the end of Section 2. If a
is large, say 0.70, then the confidence coefficient 1 - a is small and not much
harm is done if the unbiased estimator ( f - p - 1)S-1(X2 . - X 1.) for Z l(t~2 -
/~) is replaced by any other estimator which assumes values in the confidence
ellipsoid. The ridge-regression estimator (0 is obtained by minimizing the Euclidean
norm over the indicated confidence ellipsoid (elaboration along the lines of
Hocking [5]).

Approach 3: The subset-selection approach


For any value of p the sample-optimal subset ( j ; ( 1 ) ..... jp(p)) can be defined
by

A*2(p)= max z~z(~j,..... ~j,)=~z(~jp,(,) ..... ~j~(p)).


(Jl ..... Jp)

Complications arise because A*2( p ) will overestimate A2(ff(1) .... ,jp(p)) in a


very serious manner. The bias will depend on p and may be expected to be of the
same order of magnitude as c(p)(var~2(jff(1) ..... jp( p)))l/2 where c ( p ) may be
about 3 or 4 i f p = ½s and will show a tendency to decrease if I P - ½sl is increased
because the number (~) of subsets of p elements from the set of all s variables
shows this behaviour. The bias implies that the estimator in Step 3 with A .2
instead of ~2 overestimates the performance of the classical procedure based on
868 Willem Schaafsma

(jff(1) ..... j ~ ( p ) ) in a very complicated manner. Thus we have no idea where to


stop when considering A*2: {1,2, ... } --, [0, m).
It is interesting to study the bias from a theoretical point of view by considering
the very special situation, mentioned at the end of Section 2, w h e r e / ~ 2 j - / ~ l j =
e, ojj = 1, ajh = 0 > 0 (j, h = t ..... s) so that all variables are equally important.
Note that A 2 ( p ) = eZp{ 1 + ( p - 1)0} 1 and Eo(D 2) has the same (approximately
known) value for each selection (jp(1) .... ,jp(p)) of p variables. By considering
var~2(p) one gets some impression of the bias to be expected. This impression
should be confronted with simulation experiments for A .2.

Conclusion
We do not know precisely what to do when dealing with situation (3): all three
approaches look attractive, but large scale simulation experiments for comparing
the three approaches have not yet been carried out.

4. D e a l i n g with A i m 3 in the case k : 2, m o : 1

Before adopting a procedure for evaluating the outcome of (X0, XI., X2., S)
one should discuss the underlying decision situation, e.g. by posing the following
questions. Is it a situation with 'forced decision' or is one allowed to remain
undecided? Which loss function should be used? Is it more appropriate to keep
the error probabilities under control by imposing significance level restrictions?
Can a reasonable choice be made for the a priori probability P ( T = 2) = z that the
individual under classification belongs to population 2?
Complete specification of action space, loss structure and a priori probabilities
are often felt to be extremely restrictive. We have the opinion that in such
situations it might be more attractive to construct a confidence interval for the
posterior probability (in principle this requires that some a priori probability ~- is
specified, it will be seen in Section 5 that the influence of ~- can be studied easily).
In this section we restrict the attention to two-decision situations where one
individual has to be assigned either to population 1 or to population 2 (m 0 = 1)
while no reasonable a priori probability r is available. With respect to the loss
structure we consider
(1) the case of 0 - 1 loss,
(2) the Neyman-Pearson approach where the probability of assigning to
population 2, whereas the individual belongs to population 1 (the probability of
an error of the first kind), is bounded from above by the significance level a.
Note that the case of 0 - a - b loss (an error of the first kind costs b units, an
error of the second kind a) leads to the Neyman-Pearson formulation with
a = a / ( a + b) if attention is restricted to the class of minimax procedures or,
equivalently, to that of all procedures which are unbiased in the sense of
Lehmann. ([8] is applicable: The problem is of type I because, using 0--
(~t' ~1'~2' ~ ) as the unknown parameter, the subsets 81 ~-{0; t =1} and 0 2 =
{0; t = 2} have 8 0 = (0;/~1 =/~2} as common boundary.)
Selecting variables in discriminant analysis for improving upon classicalprocedures 869

The case of O-- 1 loss


If m~ # m 2 then Anderson's classification statistic

A = f ( X 2. - X l . ) T g - l ( X o - 1/2(X,. + X2. )}

(usually called V or W) and John's classification statistic (usually called Z) yield


different 'standard' procedures. The last statistic seems to have some slight
advantages from a theoretical point of view. Nevertheless we restrict our attention
to the first statistic because this statistic is a bit less unmanageable. The exact
variance of A will be obtained. Approximations based on this exact variance
might yield an alternative to the asymptotic results of Okamoto, Sitgreaves,
Anderson a.o. presented, e.g., in [2, 14].
Thus we consider Wald's rule as standard procedure (see [18]). This rule assigns
to population 2 if and only if A > 0. The corresponding performance can be
characterized by means of the risk function. The risk in 0 = (t,/~,/~2, ~ ) is equal
to the corresponding probability of {(-1)54 < 0} of assigning to the wrong
population. The risk function can be summarized in the function of (/z~,/~2, 2:)
which is obtained by adding the two corresponding misclassification probabilities.
The performance of Wald's rule will be closely related to the performance of
Anderson's classification statistic, which performance can be characterized by
means of the following function

{E2( A ) - - E , ( A) }2 /[½ {var,( A)+varz( A) } ]

of (/~,/~2, N) where the subscript t E ( 1,2} indicates that the moments have to be
computed for 0 = (t,/~1,/~ 2, Z). This function is closely related to the performance
Eo D2 which was used in Sections 2 and 3, although the values of these two
functions differ more than we had expected (at least for the examples which we
considered).
Tedious computations, performed independently by A. Ambergen, provided
the result

Et(A)=f(f--p--1) y{ (--1)tA 2 + m~'m~'(rn e -- rn,)p}


11

of Lachenbruch (see [6, p. 500]) and the result

vart(A)=f2(f--p)-l(f--p--1)-z(f--p--3) l(aA4~-btA2-~c)
where
a=½(f--p),
bt=(f--p--1)(f--1)(l+mjl_t)+mllm21(mt--m 3 ,)(f--l),
c= m ~ l m ~ l ( m I + m 2 ) P ( f -- p - 1 ) ( f - 1)
+ ½(m:~ 2 + m 2 2 ) p ( f - p ) ( f - - 1)-- m l l m 2 1 p ( f - 1).
870 Willem Schaafsma

Of course one recognizes the obvious limits Et(A ) ~ ( - 1)t½A2,vart(A)~ A2 as


min(m l, m 2, f ) ~ oe while p and A are kept fixed. It is interesting to consider the
normal approximation for the first misclassification probability PI(A > 0). With
m 1= m 2 = ½m, f = m - 2 and expanding we obtain

PI(A > 0) ~ ~ [ E l ( A ) / ( v a r , (A))1/2]

~[--½A{l+m '(½A 2 + p + 3 + 4 p A 2)+O(m 2)}-1/2]

~cb(_½A)+m-lcp(_½z~)(8-1A3 +4-1(p+3)A + pA l)
which formula does not agree with

PI(A>0)=~(--½A)+m lq0(--½A){4 lpA+(p--1)A-'}+O(m -2)


presented, e.g., in [2, p. 21]. Our explanation of the difference is that there exists
no theory to support the idea that m[Pl(A > 0 ) - - ~ [ E l ( A ) / { v a r l ( A ) } l / 2 ] ] tends
to 0, if m ~ o o and all other parameters are kept fixed. We expect that
¢b[El(A)/varl(A))I/2] yields the most accurate approximation of PI(A > 0) if the
sample sizes are small or moderately large and p is not too small. However,
simulation experiments are badly needed.
The normal approximations for the misclassification probabilities PI(A > 0)
and Pz(A -< 0), and in particular for their sum, can be used in a similar way as the
approximation for EoD2 in Sections 2 and 3.

EXAMPLE. We reconsider the very special case AZ(p) = 4p(p + 1) 1, ml = m2


= ½m, f = m + 1 which leads to the optimal measurement complexity p* = [ - ½
+ ½(1 + 2 re)l/2 ] when dealing with E 0D 2 (see Section 2 above the interpretation
subsection). Note that m I = m 2 implies El(A ) = --E2(A ) and varl(A ) = varz(A ).
Thus all performances considered in this section are increasing functions of
{E2(A)} 2 _

var(A)
mp(m-- p+ l)(m-- p - 2 )
2mp(m+ l - p)+ m(m+ 2)(m- p)(p+ l)+(m + l ) ( m - p)(p+ l) 2
We conjecture that the Aims 1 and 3 are so closely related, when 0 - 1 loss is
used, that the corresponding values of p* are almost the same. This conjecture can
be verified for our very special example. For m = 1 2 we verified that the
above-mentioned function of p assumes its maximum for p = 2; m = 40 yields a
maximum for p = 4. These values are in perfect agreement with p * = [ - ½
+ ½(1 + 2 m ) 1/2] based on EoD2.

The Neyman- Pearsonformulation


The standard procedure will be based on a studentized version of Anderson's
classification statistic. Anderson [2] worked with asymptotic expansions. We try
Selecting variables" in discriminant analysis for improving upon classical procedures 871

to use the exact expressions for E1A and varlA in order to arrive at an
appropriate studentized version. If we consider A - E 1 A and replace E1A by the
corresponding best unbiased estimator, then we obtain

U(X2.-- X1.)TS-I(Xo-- XI.)--f(f--p--1) lm~-'p


and we become interested in the exact variance of this statistic if 0 = (1,/~,/x z, "Y).
Tedious computations, performed independently by A. Ambergen, yielded

- =

={(f--p)(f--p--l)(f--p--3)} '

×[(1 ÷ m~-t)(f - 1)A2 ÷ p(f--1){m~'m2'(m+ 1)

+2mlZ(f--p)(f--p--1) '}]

with the consequence that the random variable

(f-p-1)(X2.- X1.)TS l(Xo-- X1.)--m~l p


T~ 2
(f--1)(f--p-1) ml+lA2 + rn+l + 2 f--p
(f-p)(f-p-3) f mI ( m,m 2 m2 f - p - 1 )t]p j2
has exact expectation 0 and variance 1. The final step in constructing a student-
ized version of A consists of replacing the unknown parameter A2 by the
corresponding best unbiased estimator z]2. This yields the test statistic T = T~2. We
are planning to try to derive appropriate approximations for E1T and varl T by
writing

T ~ Ta2[1 -- ½(~z_ A2) {A2 ÷ p(m~l÷ mj')} 1].


Next we will try to derive approximations for the power function Pz[T > E1T ÷
u,(var 1T}l/2]. These approximations will then be compared with the asymptotic
results in [2].
Finally, for various functions A2:{1 ..... s}-, [0, oo), we will study the (ap-
proximate) power as a function of p. We conjecture that the optimal measurement
complexities to be obtained will be considerably smaller than those based on
Eo D2 or on 0 - 1 loss functions. The rationale behind this conjecture is that by
safeguarding oneself against wrong decisions by means of the level- a restriction
one avoids 'risky guessing', especially if a is small. Note that guessing may be
872 Willem Schaafsma

attractive from an averaging point of view. If the dimensionality is increased, then


evaluation becomes much more risky. The Neyman-Pearson approach and the
approach in Section 5 are more intended to avoid the corresponding conse-
quences than the 0 - 1 loss function approach, and will consequently often yield
smaller values of the optimal multiple measurement complexity.

5. D e a l i n g w i t h A i m 4 in t h e c a s e k = 2, m o = 1

If some reasonable value can be postulated for the a priori probability


P ( T = 2 ) = r that the individual under classification belongs to population 2, e.g.
r = 1 if nothing is known, then one will be interested in a confidence interval for
the corresponding posterior probability

P ( T = 21Xo = x) = ~exp(~'~)/{ 1 -- r + rexp(~x) }


where
~'x=O)T{x--I(~I-1-~2)} = (~2--/Zl) T~v l{x--½(Jgl-]-~2)}"
Constructing a confidence interval for ~x becomes the problem of basic interest
because such intervals are transformed easily into intervals for the posterior
probability and because ~ does not depend on r. The latter fact is of particular
interest if one wants to consider the consequences of varying r.
The confidence interval for ~x will be based on the independent r.v.'s Xl., X 2.
and S, where Xh. has the Np(l~h, m~lZ) distribution (h = 1 , 2 ) and S has the
Wp(f, Z) one. The outcome x of X 0 is regarded as a fixed constant. The
construction of an exact confidence interval as an interval of non-rejected
hypotheses is beyond our capabilities. We shall content ourselves by constructing
an approximate confidence interval on the basis of an estimator ~ for ~x together
with an estimator for the corresponding standard deviation.
We start from

Zx = ( x 2 . - x , . ) ¢ s - ' ( x - ½( x , . + x2.))
=( x~.- x,.)Ts-l(x- X. .)
+½rn '(m 2 - rn,)(X 2. - X 1.)Ts-'(X2.- X, .)
where X . . = m - l ( m l X l . +m2X2. ) is introduced in order to exploit that X . . ,
X 2. - X 1. and S are independent. E S l = ( f _ p _ 1 ) - I Z 1 (see Theorem A.2(1)
in the appendix) implies that

EZ x = (f- p - 1 ) - ' { ~ x + ½ p m ~ ' m z l ( m 2 - ml) }

with the well-known consequence that

if, = ( f - - p - - 1)Z x + ½ p m ~ ' m 7 1 ( m , - - m 2 )


Selecting variables in discriminant analysisfor improving upon classicalprocedures 873

is the uniformly best unbiased estimator for ~x. Tedious computations by


Ambergen provided the following exact expressions for o2 = var(~,,.2,z)Z ~,

~:= {(f- p)(/- p- 1)'(:- p-3)}-'[(f- p + 1)~x'


-4-( f - - p -- 1)A2(x --/Y )T•--1( x __ ~ )

+mm~'m~'(f--1)(f--p--1){(x--~)Tv~-'(x ~ ) + 4 - 1 A 2}

+(m,'-- m z l ) ( f - 1 ) ( f - - p + l)~ x
+ ½ ( m ; 2 + m-22)p(f - 1 ) ( f - - p ) - - m ; I m 2 ' p ( f - 1)]

where the notation /2=½(/zl+g2 ) should not be confused with the notation
~ . . = m-l(mllzl + m292). The exact variance o2 is a substantial improvement
over the asymptotic variance which was obtained in [10] and which also appears if
one lets f = rn 1+ m 2 tend to infinity withp and Ph = m-lmh fixed.
Of course we propose using gx -+ u,~/2(f- P - l)rx as confidence bounds for ~x;
u,~/2 denotes the ½a upper quantile of the N(0, 1) distribution and 6x any
appropriate estimator for ox = (02) 1/2 ([1] is devoted to the general case k > 2).
Thus the classical procedure for constructing a confidence interval for ~ is
available. Note that the concept of posterior probability depends upon the set of
variables considered. If one tries to remove this dependence by restricting the
attention to the posterior probability with respect to all s variables ~l .... ,~s, then
this will have certain optimum properties if ~1,/z2, ~ were known but one may
make very serious estimation errors if (1) gl,/z2 and ~ are unknown, (2) s is large,
(3) f, rn I and m 2 are not large, (4) the classical estimator ~x is used. This follows
from
Etvar~xo= ( f - p--1)2Eto~o
:(f--p)-l(f--p--3)-I
X [ ½ ( f - - p)A 4 At-{2+ ( p + ] ) ( f - - p -- 1)+ m;l_t(f - 1 ) ( f - - p + 1)

--m~lm21m(f --1)}A 2
+ m~lrn~l(m + 1 ) ( / - - l ) ( f - - p -- 1)p
+½(f--1)(f--p)p(ml l - m z l ) 2]

which is based on

Efxo= E,E~xol Xo : Et~'Xo= ( - 1)t½A2,


Ef2o= var~.xo+(E,~xo) 2 :A2 +4 ~A2,
Et(Xo--~)T~-'(Xo-~):p+4 1/12.
874 WillemSchaafsma
Giving up the idea that we are only interested in estimating the posterior
probability (that means with respect to all s variables), we notice that one would
like to have the upper bound (x + u ~ / 2 ( f - P - 1)6x of the confidence interval for
~x as small as possible if the individual under classification belongs to population
1; the lower bound should be as large as possible if the individual belongs to
population 2. Accordingly we become interested in the corresponding expecta-
tions

j~(t; ~1, ~ 2 , Z ) ~- E(t; gl,p~2,~,){ffXo-- ( -- 1)tu~/2(f - p-1)OXo }


and in the corresponding compromise

/3(/~t,/~2, Z) =/3(2;/x,,/~2, Z) -/3(1;/~,,/~2, Z)

= E2(xo- E t ( x o- u,~/2( f - p -- 1) (Eldxo + E26xo).

Using the crude approximation

½(E,6xo+Ez6xo) ~ ,
{2(ElOxo@E2oxo)
2 2 } 1/2

we obtain the following approximation

b( ~1, ~2, ~ ) = A2 -- 2u,~/2( f -- P ) ,/2( f _ P _ 3) -,/2[½ ( f _ P )A4

+ [2+ ( f - - p - - 1){ p + 1 + ½(ml 1+ m T ' ) ( f - 1) }]A 2

+mT~m21(m+l)(f-1)(f p-1)p

+ ½(f--1)(/- p)p(m~'-- m21)2] 1/2

for/~(>l, 1'2, Z).


Next we can study the graph of b(p)=b(Pl-~l,Pl~2,P2) as a function of p E
{1 ..... min(s, f - 4 ) } for given values of a, f, m l, m 2 and k 2. At first we expected
that the corresponding optimal measurement complexity p* would not differ much
from those obtained in Section 2 and in Section 4. Some examples deepened our
insight: if a = 0.50 (u~/2 2 2) then Aim 4 is so much related to Aim 1 and to Aim
3 (in the case of 0 - 1 loss) that one indeed should expect similar values for p*;
however, if a is small, say a = 0.05 (u~/2 ~ 2), then one is worrying much more
about random disturbances which become more troublesome if p gets larger;
accordingly one should expect smaller values of p* for a = 0.05 than for a = 0.50.

EXAMPLE. We single out the special situation A2(p) = 4 p ( p + 1) -1, m 1= m 2


= ½m, f = m + 1. This led top* = [-- ½+ ½(1 + 2 m ) 1/2] in Section 2 and to similar
values in Section 4.
Selecting variables in discriminant analysis for improving upon classicalprocedures 875

For a = 0.50, m = 24 we expect p* = 2. Accordingly we computed b(1) =


0.92, b(2) = 1.20 and b(3) = 1.22. Hence p* = 3.
For a = 0.05, m = 24 we expect p * < 2. Accordingly we computed b(1) =
- 1.25, b(2) = - 1.83 and b(3) = - 2 . 3 4 . Hence p* = 1.
For a = 0.50, m = 112 we expect p * = 7. Accordingly we computed b ( 6 ) =
2.413, b ( 7 ) = 2.412 and b ( 8 ) = 2.396. H e n c e p * = 6 or 7.
For a = 0.05, m = 112 we expect p * < 7 . It follows from b ( 1 ) = 0.568, b ( 2 ) =
0.747, b(3) = 0.732 > b(4) > b(5) > b(6) that the optimal measurement complexity
for constructing a 95% confidence interval equals p* = 2 which is much smaller
than the optimal measurement complexities for Aim 1 and Aim 3. For o~= 0.50 a
nice agreement exists with p * = [ - ½ + ½(1 +2m)½], at least in these examples.

6. Incorporating a s e l e c t i o n of variables t e c h n i q u e w h e n dealing w i t h A i m 3


or A i m 4 in the case k = 2, m 0 = 1

We have seen in Sections 2, 4 and 5 that the performance of the classical


procedure based on all s variables can often be considerably improved b y deleting
variables, especially if s is large ( s a y >- 15) and m is small (say < 50).
If one knows A2(p), or more generally A2(~j0 ) ..... ~j(p)), for each value of the
argument, then the appropriate selection will be made by maximizing the perfor-
mance which is relevant to the aim in mind. Aim 1, Aim 3 with 0 - 1 loss and Aim
4 with a = 0.50 will lead to approximately the same selection. The N e y m a n -
Pearson formulation for Aim 3 and usual values for the confidence coefficient
1 - o r in Aim 4 will lead to smaller subsets of {~1.... ,~s)-
In practice one does not know A2(p) or A2(~j(I) ~j(p)). Two situations
.....

should be distinguished:
(1) the data under evaluation need not be used for estimating Zl2(p) or
a2(~:( 0 ..... ~:(p)) because estimation can be done on the basis of 'professional
knowledge' or 'other data',
(2) the data under evaluation has to be used because 'professional knowledge'
and 'other data' are not sufficiently relevant.
Situation (2) is very unpleasant because the data dependent subset has a very
peculiar probabilistic structure. The conditional distribution of the scores for the
selected subset, conditionally given that this subset ~j(l),..., ~j(p) has been selected,
will differ from the corresponding unconditional distribution. If the outcomes for
the selected subset are treated by means of the classical standard procedure, then
one will overestimate A2(~sO),..., ~s(p)), overestimate the discriminatory properties
of any discriminant function, underestimate the misclassification probabilities
and construct confidence intervals which cover the true value with probability
smaller than 1 - a. The technical details at the end of Section 3 contain some
indications of the bias to be expected if all subsets ~j(1)..... ~j(p) of p elements are
allowed. We have the opinion that one should try to avoid situation (2) by
introducing as m u c h a priori professional knowledge as seems reasonable. In
876 Willem Schaafsma

practice one will not be able to avoid situation (2) completely because the
investigation would not be of much use if the a priori professional knowledge is
already sufficiently relevant for estimating Az(~jO) ..... ~Ap~)"
We recall the distinctions drawn at the beginning of Section 3 between
(1) the initial ordering is 'very compelling',
(2) the initial ordering is 'rather compelling',
(3) the initial ordering is 'rather useless'.
Notice that the initial ordering was based on a priori professional knowledge,
see Step 4 of the preliminary stage discussed in Section 1.
If the initial ordering is (at least) rather compelling, then one might decide to
deviate from this initial ordering only in very exceptional situations. This will
imply that we need not worry much about the above-mentioned kinds of bias. We
propose to use either procedure 1 or procedure 2.

PROCEDURE 1. Proceed along the lines of Steps 1.... ,4 of Section 3 with


appropriate modifications of Step 3. In Step 5 one proceeds as if the selected
variables constitute a fixed predetermined subset.

REMARK. In Procedure 1 no attention is paid to the outcome x E R s of X o in the


Steps 1..... 4 whereas the individual under classification is in some sense predomi-
nant. If one is interested in Aim 3 with 0 - 1 loss then one might proceed as
follows, exploiting the outcome x also for selection purposes.

PROCEDURE 2. Apply Step 1 of Section 3 unless the initial ordering is completely


compelling. Use the outcome of (PX0..... PS) for constructing a confidence
interval, with bounds P~x +- u ~ / z ( f - P - 1)Prx, for ~x with respect to the first p
variables. For each value of p an interval is obtained. If the sum of the minimized
upper bound and the maximized lower bound is negative (positive), then assign to
population 1 (population 2). We recommend t~ = 0.50 (u~/2 ~ 2/3).

In practice the initial ordering will often be rather useless and one will decide
to deviate from the initial ordering. We worry so much about the before-
mentioned kinds of bias that we prefer the following approach.

PROCEDURE 3. Return to the original data which consists of the outcome of


independent r.v.'s
Xo, Xl, ..... x , . . , , , x2~ ..... X2, m2
and possibly some extra random matrix S which has the Wishart W( f, ~)
distribution. The data is split up into two parts: (1) the training samples
Xl.n,+l ..... Xl,ml, X2, n2+ 1. . . . ,X2. rn2
and possibly S, (2) the holdout samples
Xo, Xll,...,Xl,nl, X21 .... ,X2, n2.
Selecting variables in discriminant analysis for improving upon classicalprocedures 877

Asymptotic theory in [12] leads to the recommendation that

nh/m h : (l+(p-1)1/2) -I

if one is interested in Aim 3 with 0 - 1 loss.


The training samples are used to construct a vector of weights W * -
w*(X~,n,+l ..... S) along the lines of Section 3.
Next the s-variate holdout samples are made univariate by applying any
transformation x ~ awTx + b where, a, w and b are completely determined by the
training samples and w is the outcome of W*.
Conditionally given the training samples, or rather the corresponding outcome
(a, w, b), the thus obtained real-valued random variables

Y0, rll ..... Yl,,,, Y2,,,..., Y2,,2

are independent while Yhi has the univariate N(~h, o 2) distribution and Y0 the
N ( ~ , o 2) one. Of course ~h =awTl~h + b and o 2= a2wT~w. Thus theory for
univariate classification problems is applicable (see [11] for a survey and some
new results).

REMARK. It would be extremely interesting to compare (1) the classical classifi-


cation procedure based on all s variables with (2) the procedures 1, 2 and 3 for
Aim 3 with 0 - 1 loss. Note that q~(- 1/2A(s)) is a natural lower bound for the
maximum of the two misclassification probabilities. Large scale simulation experi-
ments will be needed because the differences are expected to be only a few
percents, at least between the procedures 1, 2 and 3. Much will depend on the
particular choice for (/~1,/~2, Z), especially with respect to the relevancy of the
initial ordering.

7. Concluding remarks and acknowledgment

The degrading performance of classical procedures has been discussed by many


scientists. Many selection of variables techniques have been proposed. The
references in [12] will not be repeated here.
The author was introduced into the subject of selecting variables by the
physical anthropologist Dr. G. N. van Vark. His anthropological objects (skeletal
remains, in particular skulls) allow many measurements whereas his sample sizes
are very limited for many populations. Van Vark [17] developed various data-
analytic procedures for selecting variables in discriminant analysis. By studying
the mathematical core of his proposals for the case k = 2, the papers [10] and [12]
came into being. These papers were based on asymptotic theory where the sample
sizes tend to infinity and p is fixed. The corresponding results are not completely
satisfactory because in practice the sample sizes are fixed and one is interested in
878 Willem Schaafsma

the dependence on p [ 13] is devoted to the case p ~ oc, ~ = I, fixed sample sizes.
At a meeting in Ober-Wolfach (1978), Prof. Eaton made an indispensable
contribution by showing to the author that various exact moments can be
obtained. Unfortunately the basic issues often still require that some unpleasant
approximations have to be made. Ton Steerneman performed indispensable
simulation experiments and Ton Ambergen verified various results by means of
independent computations. Many others have contributed by making comments,
referring to the literature and sending papers.

Appendix A

The following theoretical and simulation results constituted the core of the
previous discussions. Throughout the appendix Y and V are assumed to be
independent,

%(i, I.)
where e 1= (1,0 . . . . . 0) T, f > p, R = V-Iy.

THEOREM A.1. I f f -~ 00, TI --) O0 w h i l e 7If - 1 / 2 --~ 8 and p is fixed, t h e n


(1) ~ f l / 2( f - l V - Ip ) and ~ f l / 2( f i r - l _ Ip ) tend to the same p 2 variate normal
distribution E B of a random matrix B where var Bii = 2,var Bij = 1,cov( Bij , Bji ) ---- 1
if i ~ j because B iJ• = B.J i and all other covariances zero,
(2) E f ( R -- f - 1 7 2 8el) __)NP (0, (1 + 82)1P + 82eleT),
(3) Ef(1 - R~/II R II2) __, (1 + ~ - 2 ) X 2 _ 1,
(4) Ef(1 - R 1/11R II) --' ½(1 + 8 - 2 ) X 2_ 1.

PROOF. The first result in (1) is an immediate consequence of the Central Limit
Theorem for Wishart distributions. The second result follows from the first one
by using

: ,W-Zp+:-,:{:,/.(: ,v-/.)},
:v-,-- :-,:{ :,:( :-,V- }
_:-,{:,:(:-,):_,.)}' ....

where the last expansion suggests that f l/2( f - 1V - - Ip) + f 1/2( f V l _ ia) ~ 0 in
probability. This last result can be proved rigorously and (1) follows. Next
(1) ~ (2) ~ (3) ~ (4); notice that R 1/l[ R I1 --, 1 in probability.

REMARK. Theorem A.1 is a reformulation and extension of the bottle-neck of


[10]. The approximations based on Theorem A.1 are expected to be crude in
many practical situations (in fact it could have been worse, see [15]). Eaton
Selectingvariablesin discriminantanalysisfor improvinguponclassicalproeedures 879

(1979, personal communication) taught me the following result (see also [6,
Section 6.5]).

THEOR~MA.2. (1) E V - l = ( f - p _ 1"~-1[


.: .p,
(2) EV -2 = ( ( f - p ) ( f - p - 1 ) ( f - p - 3)} - ' ( f - 1)Ip,
(3) E R = ( f - p - 1 ) - I ~ e l ,
(4) covR = { ( f - - p ) ( f - - p -- 1)(f-- p --3)} -1
× [ ( f - - 1 + ~12)Ip + ~/2(1 + 2 ( f - p -- 1)-l}e,e~'].

REMARK. McLachlan (1979, personal communication) noticed that EV 1 and


EV -2 can be found in Lachenbruch (1968). In fact Theorem A.1 and Theorem
A.2 are essentially contained in [3]. However, Eaton's proof is presented because
arguments in the proof constituted the core of all computations of exact moments
presented in this paper.

PROOF. Notice that £V is invariant under orthogonal transformations: if q'E


O(p), then ~ ( # v q ' T ) = ~ ( V ) and hence ~ V - l = ~ ( q ' V - l # T) and EV - 1 =
q'(EV-1)q 'T for all q ' ~ O ( p ) Hence E V - I = c / _ . But ( V - 1 ) l ~ = V l l = ( V l l -
Vl 2V2_21V2, ) - i has the same distribution as 1/X~_~+,. Hence c = EV 11= ( f - p
- ' / ) - ] . "Fhis establishes (1) and (3). The same invariance considerations lead
to EV -2 = dip with d = E(V-2)11 = E(Vll) 2 +EV0'2)V (2'0 and V 0'2) =
-vllvI,2V~,,~ according to standard theory for the inverse of a partitioned
matrix. Moreover vll, vI,2V2_~/2 and V2,2 are independent and having
1/X~_?+l, N?_l(0, Ip_l) and Wp_l( f, Ip 1) distributions. Hence

E(vll) 2 = ( f - - p -- 1 ) - ' ( f - - p --3) -1,

EV0'2)V <2'1) = E(V l' )2E trace V~,~VZ,1/2V2,1 (V2~~/2V2, l )T

: E(Vll) 2 trace EV~2,21= E ( V " ) 2 ( p - 1 ) ( f - p)-',

E V (2' 1)V(I'2) : E ( V ' 1)2EV2,21/2V2~1/2V2,1( V2,21/2V2,1 ) T v2_l/2

= E(VII)2EV2_21 : E ( V " ) z ( f - p)-'Ip ,

and we can compute d in order to establish (2) while (4) follows from covR =
ERR v (ER)(ER)T where
-

ERRT = E V - 1 y y X v -1 = EV -2 + ~2EV-leleTV -l
and
( (VI1)2 vllv(I'2) )
V ~ l e l e T v -1 _.= VIIV(2,1 ) V(2,1)V(I,2) ,

the expectation of which following from the above-mentioned results and


E V I IV(I,2) _---E(V, 11 ) 2 Vl,2V~,2
-I =0.
880 WillemSchaafsma
Approximating ED
The r a n d o m variables A - D and A2 - - D 2 played a crucial p a r t and are
immediately related to 1-R1/IIR[I and 1-R2/R 2 via D=ARI/I[RI[. The
( 1 - - a)th quantiles % and d a of the corresponding distributions are of interest; c a
and d a are defined b y P[1-RI/IIRI[ > % ] = a and P[1-RZ/llRllZ>da]=a.
Especially Cl/2 = m e d i a n ( I - Rl/ll R II) a n d dl/2 = m e d ( 1 - Rz/IIR l[ 2) are of
utmost importance, just like the expectations a = E ( 1 - RI/IIRll ) and b = E ( 1 -
R ] / I I R II2). Notice that (3) and (4) in T h e o r e m A.1 establish that (under the
conditions mentioned and without mentioning the dependence on f and ~/in some
expressions)

f%~½(1-4- -2 )Xp--l;a,
2 fda"+(lq-8-Z)X2p-l;a
where Xp 2 1;~ denotes the ( 1 - a ) t h quantile of Xp2 1. M o r e o v e r T h e o r e m A.1
suggests that

fa--+½(l+8-2)(p--1), fb-~(1 + 8-2)(p-- 1).

Thus T h e o r e m A. 1 leads to the approximations

c et
- (1 / t--
-- 2kJ1 1 4_)~-2"1,,2
-- ~1 ]Ap--1;a~
a(O=l(f-l_F~l-2)(p_l)
by plugging-in 8 = ~/f 1/2, m o r e o v e r d(~1) = 2c(,I) and b (1) = 2a (0.
The exact results in T h e o r e m A.2 m a y be used to define other a p p r o x i m a t i o n s
by using a function of expectations as an a p p r o x i m a t i o n for the expectation of
the function. T w o ideas were exploited: first

a = E(1 -- R I / I I R II) ~ 1 - E R 1 / ( E I [ R II 2) 1/2 = a (2)


with the consequences that

a(2)=l_ n ( f - p ) W 2 ( f - p - 3 ) ~/2
(f- 1)1/2(f-- p - 1 ) l / 2 ( p q- n2) 1/2

and b (2) = 2a (2), while E ( 1 - R 1/11R II) is a p p r o x i m a t e d b y a(Z)(p -1)-ixp_12 and


hence
C(2) = a(Z)(P -- 1) -lXp-1;~,2 d(2)a = 2c(2)a•

T h e second idea leads to

b=E(1-R~/IIRII2)~E(R~ + ... + Rp2 ) / E l i R II2 = bO)

with the consequences that

b°)=(p-1)(p+~l 2) 1(1+7/2/(f-1)}, a (3) = lb(3) '

d(~3>=b(3>(P-1) lx2-1;a, ,0(3)


% x - -_2 ~±.4(3)
'a "
Selecting variables in discriminant analysis for improving upon classical procedures 881

Simulation experiments
The following cases were considered in [15].
(1) f = 25, ~ ( p ) = {24p/(p + 1)}I/2,2~ p=< 15,
(2) f = 40, ~ ( p ) = 1 0 + 2 p ~ / 2 , 2 < p < 20,
( 3 ) / = 11, ~ ( p ) = { l l 2 p / ( p + 1)}~/2,2< p___<25.
True values were estimated from 1000 independent repetitions. The approxima-
tions for c0.~0 and do.lo were not very satisfactory. The approximations for dl/2
and b based on b ~ b (3) looked very accurate and better than any other approxi-
mation. The approximations based on Theorem A. 1 looked a bit worse than those
based on b ~ b ~3~. The approximations based on a ~ a (2) looked worst of all and
positively biased. We were not successful in our attempt to give a satisfactory
theoretical explanation of the bad behaviour of the last mentioned approxima-
tions.

References

[1] Ambergen, T. and Schaafsma, W. (1981). Interval estimates for posterior probabilities. Report
TW-224, Department of Mathematics, Groningen.
[2] Anderson, T. W. (1972). Asymptotic evaluation of the probabilities of misclassification by linear
discriminant functions. In: T. Cacoullos, ed., Discriminant Analysis and Applications, 17-35.
Academic Press, New York.
[3] Das Gupta, S. (1968). Some aspects of discriminant function coefficients. Sankhy~ Ser. A 30 (4)
387-400.
[4] Das Gupta, S. and Perlman, M. (1974). Power of the noncentral F-test: effect of additional
variates on Hotelling's T 2 test. J. Amer. Statist. Assoc. 69, 174-180.
[5] Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 32,
1-49.
[6] Kshirsagar, A. M. (1972). Multivariate Analysis. Dekker, New York.
[7] Rao, C. R. (1965). Linear Statistical Inference and its Applications. Wiley, New York.
[8] Schaafsma, W. (1969). Multiple decision problems of type I, Ann. Math. Statist. 40, 1684-1720.
[9] Schaafsma, W. (1972). Classifying when populations are estimated. In: T. Cacoullos, ed.,
Discriminant Analysis and Applications, 339-364. Academic Press, New York.
[10] Schaafsma, W. (1976). The asymptotic distribution of some statistics from discriminant analysis.
Report TW-176, Department of Mathematics, Groningen.
[11] Schaafsma, W. and van Vark, G. N. (1977). Classification and discrimination problems with
applications, Part I. Statist. Neerlandica 31, 25-45.
[12] Schaafsma, W. and van Vark, G. N. (1979). Classification and discrimination problems with
applications, Part II a. Statist. Neerlandica 33, 91-125.
[13] Schaafsma, W. and Steerneman, A. G. M. (1981). Discriminant analysis when the number of
features is unbounded. IEEE Trans. Systems Man Cybernet. 11, 144-151.
[14] Solomon, H. (1961). Studies in Item Analysis and Prediction. Stanford University Press, Stan-
ford.
[15] Steerneman, A. G. M. (1979). Simulating the performance of a special linear discriminant.
Report SE-57/7907, Institute of Econometrics, Groningen.
[16] Stein, Ch. (1966). Multivariate analysis (mimeographed notes recorded by M. L. Eaton). Dept.
Statistics, Stanford.
[17] van Vark, G. N. (1976). A critical evaluation of the application of multivariate statistical
methods to the study of human populations. HOMO 28, 94-114.
[18] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of
two groups. Ann,Math. Statist. 15, 145-162.
P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 A |
"-r 1
©North-Holland Publishing Company (1982) 883-892

Selection of Variables in
Discriminant Analysis*

P. R . K r i s h n a i a h

1. Introduction

In a number of disciplines, data analysts are confronted with the problem of


classifying an observation into one of the distinct groups when the number
of variables is very large. So, it is of interest to find out a smaller n u m b e r of
important variables which are adequate for discrimination. These variables may
be a subset of the original variables or certain linear combinations of the original
variables. The selection of variables is important since there are situations where
inclusion of unimportant variables may actually decrease the ability for dis-
crimination. Apart from it, it is more feasible to analyze the data from cost and
computational considerations if the number of variables is small. In this chapter
we discuss various procedures for the selection of variables in discriminant
analysis.
In Section 2 we discuss procedures to find out whether variables associated
with certain discriminant coefficients are important for discrimination between
two populations. In Section 3 generalizations of the above procedures for several
populations are discussed. The problems of testing the hypotheses on discrimi-
nant coefficients using simultaneous test procedures will be discussed in another
paper. The procedures in Sections 2 and 3 are based upon using conditional
distributions. In Section 4 we discuss various procedures to determine the number
of important discriminant functions.

2. Tests on discriminant functions using conditional distributions for


two populations

In this section we discuss procedures for testing the hypotheses, on the coeffi-
cients of the discriminant function associated with the discrimination between

*This work is sponsored by the Air Force Office of Scientific Research under contract F49629-82-
K-001. Reproduction in Whole or in part is permitted for any purpose of the United States
Government.
883
884 P. R. Krishnaiah

two multivariate normal populations. Let the mean vector and covariance matrix
of ith multivariate normal population be given by ~ and Z. Now, consider the
discriminant function a'x for the two population case where a ' = ( a l ..... a p ) =
(~l - ~ 2 ) ' Z - 1 and x ' = (xl,..., Xp). If any of the coefficients a~ are zero, then the
corresponding variables xi do not make any contribution for the discrimination
between the two populations. So, it is of interest to find out as to which of the
coefficients are zero. Suppose, we know a priori that x~,...,Xq are important and
we are not sure of Xq+,... ,Xp. Then we are interested in testing the hypothesis
that aq+ 1 . . . . = ap ~ O.
Let x ' l = ( x l , . . . , X q ) , x'2=(Xq+ 1..... Xp), 8 ' = ( 8 1 , . . . , 8 p ) and let ~ 1 , / ~ 2 , 2 be
partitioned as

~i = ~.Li2 , 32 , Z~-~ Z21 ~22

and a ' = la'l, a'2J.


~ Also, /~el,a~,8~ are of order q × 1 and Z~ is of order q × q . Let
H denote the hypothesis that the mean vector of x 2, after eliminating the effect of
x~, is the same for both populations. Also, let/3 = ZzlZit 1. Then

H : ~12 -/3ILL 11 = ~ 2 2 - - / 3 ~ 2 1 " (2.1)

Now, let A ~ - h - ( ~ l - - ~ 2 ) t ~ ' - l ( ~ l - - ~ 2 ) , AI=(~ll--Pg21)t~I~I(~[LII--~IL21) and


A2-1 = (32 --/380'Z2~ ~(82 --/380. The hypothesis H is equivalent to 82 -/381 -~- 0.
It is known (see Rao (1946)) that H is equivalent to the hypothesis that
aq+ 1 . . . . . ap = 0 and it is also equivalent to the hypothesis that A = A~. This
can be interpreted as H is equivalent to the hypotheses that the distance between
the two populations based on the p variables is equal to the distance between the
populations on the basis of the first q variables.
Let x'i = ( x n ..... Xiq, xi q+ 1. . . . . Xip ) -~- (Ztil, Z'i2 ) (i = l, 2), be distributed as multi-
variate normal with mean vector/~'i (i = 1,2) and covariance matrix Z where zi) is
of order q × 1. Then, the conditional distribution of z~2 given Z/l is multivariate
normal with covariance matrix Z 2 ~ = ~22 - ~21 ~111~12 and meamvector ~li +/3z n
where ~l~= ~ 2 - / 3 ~ n . We wish to test ~1~= )12, that is, 8~-/38~ = 0. This can be
done by using any of the known procedures. Let (z'm,z'izt) (t = 1,2,... ,n~) be the
tth observation on (z'mz~2). Then the conditional model is given by

Ec(z,2,/z,,,) = ni +/3zi,, (2.2)

for i-~1,2 and t = l , 2 , . . . , n i.

We can test the hypothesis H by using the following statistic:


^^ r --1
F = c(8~ - / 3 3 , ) S;2., (82 - / 3 ~ , ) ( n - p - 1)
(2.3)
(1 + cS',S~ (g, )( p - q)
Selection of variables in discriminant analysis 885

where Se2.1 : Se2 2 - Se2~Se,~Se,2, 1 ~ = Se2,S2,?, ~ = Zjl


- -- ~j:,
- " i Z- j , = ~7':-1Zj,,,
n : r/l + r/2, c : n l r l 2 / / r l and

S=
(Xe,l
Se21 Se22 ]
/ 2n( E E
--
Zilt--Zil"
)I t -t
Zilt--Zil''
Zt -f x
i2t--Zi2") •
i 1t 1 \ Zi2t--Zi2"

The statistic F is distributed as the central F distribution with ( p - q, n - - p - 1)


degrees of freedom when H is true. The hypothesis H is accepted or rejected
according as

F N F~ (2.5)

where

P[F<~ F.DHl = (1 -- a ) . (2.6)

The above procedure for testing the hypothesis aq+ 1 . . . . . ap = 0 was proposed
by Rao (1946,1966). It is known (e.g., see Kshirsagar (1972)) that Rao's U statistic
is related to the F statistic given in (2.3).
The simultaneous confidence intervals associated with the above procedure are
known to be

Ib'(N2 - / ~ 1 - ~ 2 + B~,)I
~-<I F ~ b S e' z . l b ( p -- q ) ( l + c ~ ' l S ~ l ) / c ( n - - p - 1) (2.7)

for all nonnull b.

3. Tests on discriminant functions for several populations using


conditional distributions

In this section we consider the problem of testing the hypotheses that the
discriminant coefficients associated with certain variables in the discriminant
functions are zero. Let x 1..... x k be distributed independently as multivariate
normal with mean vectors ~ l , ' " , ~ k and covariance matrix Z. Also, let xij
( j = l , 2 ..... ni) denote j t h independent observation on xi. Then, the between
group sums of squares and cross products (SP) matrix is given by

s-- E
k ( S,, 1
i=~l S21 822 ]
886 P. R. Krishnaiah

where

nixi.=Exij, n=nl+'''+n k, nY~..=E~xij.


j i j

Now, let 01 > / ' ' " /> Op denote the eigenvalues of the noncentrality matrix ~2 =
A N - l and let I,/'= (va,..., Pip) (i = 1..... p) denote the eigenvector corresponding
to 0i where

k
A= ~ Fli(~i--~)(~i--g)" (3.1)
i:1

and ~ = (ntl~ 1 + . . . + nkl~k)/n. Suppose the rank of £2 is r. Then 0r+ l . . . . .


0p = 0 and we have r meaningful discriminant functions. The within group SP
matrix is given by

Se= ~ ~ (xij-xi. )(xij-xi. )'= ( Sell Sel2)


i j Se21 Se22
where Se~~ is of order q × q. Now, let us partition ~i, ~ and N as follows:

ZI= A21 A= ' Z= N21 2;22

where ~[Lil and ~1 are of order q × 1, and All and Zll are of order q × q. Let Hj:
uj: = 0 f o r j = 1,2 ..... r and H = O~= iHj. It is known (e.g. see McKay (1977) and
Fujikoshi (1980)) that the hypothesis H and the following statements are equiva-
lent.

]g 12 -- N212;11 IpLll . . . . . ~tLk2-- 2;21~.~11 l~kl, (3.2)


tr/1 2;-1 = tr An~V~l (3.3)

The problem of testing H is equivalent to the problem considered by Rao (1948)


to find out whether the variables x z are good for discrimination after eliminating
the effect of x~.

4. Tests for the number of important discriminant functions

It is well known that Fisher's linear discriminant functions are the best for
discrimination among all linear functions of the original variables. In this section
we review some procedures for the selection of important discriminant functions.
Selection of variables in discriminant analysis 887

We know that

(n-k-p-1)E(SS_,)=i ~ ~2 _•,. (4.1)


iZ--i5)
Let )t 1 / > . . . ~>Xp be the eigenvalues of Z*. Then h i =1 + ( O i / ( k - 1)) where 0 i
was defined in Section 3. The p discriminant functions are p{x .... , p;x where x is a
vector of observations on the p variables, e'x is ith most important discriminant
function and !', was defined in Section 3. The problem of testing for the rank of
the noncentrality matrix ~2 is equivalent to the problem of testing for the number
of important discriminant functions. It is also equivalent to testing the following
hypothesis on certain structural relations among the components of the mean
vectors.

A/xi = ~j (i---- 1 , 2 , . . . , k ) (4.2)

where A: s × p and ~: p × 1 are unknown and the rank of A is s. The above


hypothesis implies that the points ~ .... ,/L k lie in an r-dimensional space where
r=p-s.
Now, let H i denote the hypothesis that )t i = 1. Then the hypothesis (4.2) is
equivalent to the hypothesis that O ~ r + 1Hi is true. The problem of testing for the
rank of $2 was originally considered by Fisher (1938). Fisher's test to test the
hypothesis that the rank of ~2 is r is based upon the statistic

rl=(tr+,+ -+¢) (4.3)

where lr+ 1. . . . . lp are the eigenvalues of S S e 1.


We can test for the rank of ~2 within the framework of simultaneous test
procedures as follows. Accept or reject H i (i = r + 1. . . . . p ) according as

I i ~ Co: (4.4)

where

e[ir+l ~<Ca I H r+,] = (1 - a ) . (4.5)

If H t is accepted but H t _ 1 is rejected, then the rank of f2 is ( t - l ) . But, the


distribution of Ir+ 1 involves )t 1. . . . . )t r as nuisance parameters even when Hr+ 1 is
true, and so the exact values of % cannot be computed. If we know in advance
that H t ..... Hp ( t > r + l ) are true, then we test Hr+ 1..... //t-1 only. In m a n y
situations it is of interest to test H~ ..... Hp simultaneously. In these situations we
accept or reject H i according as

l i X %1 (4.6)
888 P. R. Krishnaiah

where

ell1 I/-/11 = (1- (4.7)

Here we note that the hypotheses H1,..., H v are nested. For example, H i implies
Hi+ l .... , Hp. When H 1 is true, the exact distribution of l 1 was given in Krishnaiah
and Chang (1971). For percentage points of the distribution of l 1 in the null ease,
the reader is referred to Krishnaiah (1980) and Pillai (1960). A review of the
literature on the exact distributions of individual roots and certain functions of
the eigenvalues is given in Krishnaiah (1978).
Fang and Krishnaiah (1982) derived asymptotic nonnull distributions of cer-
tain functions of the eigenvalues of some random matrices when the underlying
distribution is not normal.
The likelihood ratio statistic for testing the hypothesis H~+~ is known to be

P
LI= II (l+li)-"/2. (4.8)
i r+l

For large samples - 2 l o g L 1 is distributed approximately as n(lr+ 1 + • • • + lp) =


n T v It is known (see Hsu (1941a)) that n T l is distributed asymptotically as
chi-square with ( p - r ) ( k - 1 - r ) degrees of freedom when n i ' s tend to infinity.
Bartlett (1947) showed that T2 -= Cl2f=~+llog(1 + / i ) is distributed as chi-square
with ( p - r ) ( k - 1 - r ) degrees of freedom where the correction factor C~ is given
by C 1= (n - 1 - ( p + k ) / 2 ) . The chi-square approximation to the distribution of
the statistic T2 is better than the corresponding approximation to n T p Lawley
(1959) suggested a modified correction factor but it involves nuisance parameters.
For details on the likelihood ratio test for the rank of ~, when 2~ is known, the
reader is referred to R a t (1973). Anderson (1951b) derived the likelihood ratio
test statistic for testing the rank of regression matrix under the multivariate
regression model when N is unknown.
In general, we test the hypothesis H~+ 1 by using 4'(/r+l .... ,l?), a suitable
function o f / r + l ..... lp, as follows. Accept or reject 1t,.+ 1 according as

~P(lr+l,...,lp)%c~2 (4.9)

where

e[+(t,+, ..... lp)-<co21H,+,] (4.10)

Some special cases of ~k(lr+ 1.... ,l:) are lr+l, (l~+ 1 + • -. + lp), etc.
We now review some known results on asymptotic distributions of li's and
certain functions of these eigenvalues.
Selection of variables in discriminant analysis 889

Let l 1/>. • •/> l v (r ~< v ~< p ) be the nonzero eigenvalues of SS]- l. Also, let the
eigenvalues of ~2 have multiplicities as below.

01 . . . . . OpT=nSl,

Op~+l . . . . . Op~=t,l~2 '

(4.11)

0p~_l+ 1. . . . . Op. = n ~ t,
Op.+l . . . . . Op = 0

where p ; = p l + ... +pj (j=l,2 ..... t + l ) , r=p*, v=p*+l, and p ~ = 0 . In


addition, let

U i h = f n ( Z S ~ + 4 6 h ) - - l / 2 ( l \ ih - - 6 h ) , u~+/=nlr+ J (4.12)

where h = 1,2 ..... t, i h = P ~ - I + 1. . . . . p~ a n d j = 1.... ,v - r. Also, let n i = noq i for


i = 1,2 ..... k. Then, the limiting distribution of u I ..... u~, as n o ~ oo, derived by
H s u (1941b) is given by
t+l
f ( U l . . . . . Ut))~- II ~j(~pj* 1+1 . . . . . Up7 ) (4.13)
j ]

where ~j(-) ( j = 1,2 .... , t) denotes the joint density of the eigenvalues of Aj and
the elements of Aj: pj × pj are distributed independently as n o r m a l with m e a n
zero. The variances of the diagonal elements of Aj are equal to one whereas the
variances of the off-diagonal elements are equal to 1/2. Also, 7/t+](. ) is the joint
density of the eigenvalues of At+l: ( v - r ) × ( v - r) where At+ 1 is distributed as
the central Wishart matrix with ( k - 1 - r) degrees of freedom and E ( A t + I ) =
(k - r - 1)I v r. H e r e A 1..... At+ 1 are distributed independent of each other.
Expressions for the densities of the eigenvalues of Aj ( j = l , 2 . . . . . t) and At+ 1
were given in H s u (1941b). The asymptotic joint density of u t . . . . . uv given b y
(4.13) was derived b y A n d e r s o n (1951 a) b y a different method.
N o w let 0 i = (n -- k -271)fli (i = 1,2,... ,r), Or+ 1 . . . . . Op = 0 where ~/and fli's
are constants. Then, Fujikoshi (1976) derived approximations to the distributions
of m i T i (1 = 1,2,3) up to terms of order rn7 2 where

P
T]= ~. l o g ( l + l j ) , (4.14)
j=r+l
P
r2= E tj, (4.15)
j=r+l
P
T3= 2 {lj/(l+lj)}, (4.16)
y=r+l
890 P. R. Krishnaiah

and m L, m 2 and m 3 are suitable correction factors. The first terms in these
approximations involve a chi-square distribution. Similar approximations can be
derived for various other functions of lr+ l ..... lp. Asymptotic distributions of a
wide class of functions of 11.... ,lp in the nonnull cases were given in Fujikoshi
(1978), Krishnaiah a n d Lee (1979) and Fang and Krishnaiah (1982).
In some situations we know in advance that the last few eigenvalues of ~2 are
equal to zero. For example, when p > k - 1, Ok = 0k+ l . . . . . Op = 0. In these
situations it is of interest to test whether some of the 0i's (i = 1,2 . . . . . k - 1) are
zero. We can also test the hypotheses Hj ( j = t, t + 1..... k - 1) as follows. We
accept or reject H / ( j = t, t + 1..... k - 1) according as

~kXc, (4.17)

where

P Tjk<-c~;j----t .... ,k--11 ----(1-- o0, (4.18)

Tjk = l j / ( l k + . . . + l p ) . As pointed out earlier, the joint distribution of


lt, lt+ ! . . . . , lp, when H t is true and the sample sizes tend to infinity, is the same as
the joint distribution of the eigenvalues of the central Wishart matrix. Exact
distribution of the ratio of the largest root to the sum of the roots of the central
Wishart matrix was considered in Schuurmann, Krishnaiah and Chattopadhyay
(1973) and Kfishnaiah and Schuurmann (1974).
We now discuss the problems of testing the hypotheses H 1..... Hp in an ad hoc
sequential way using conditional distributions. The hypothesis H 1 is accepted or
rejected according as

l 1 ~ Cal (4.19)

where

P [ l I ~ c a l [ H l ] = (1--0gl). (4.20)

If H l is accepted, we conclude that 12 = 0 and don't proceed further. If H z is


rejected, we accept or reject H 2 according as

12 ~ Ca2, (4.21)
e[t2 co21t, >I = (1 - (4.22)

If H 2 is accepted, we do not proceed further. Otherwise we accept or reject H 3


according as

/3 ~ Ca3
Selection of variables in discriminant analysis 891

where

P [la ~< %3 Ill ~> %,, 12 ~> %2;//31 ----(1 -- a 3). (4.23)

In general, if we accept //~, we don't proceed further. Otherwise, we accept or


reject Hi+ 1 according as

li+1% Ca,i+l

where

P[li+l<~%,i+tllj>~%j,j=l,2,...,i;Hi+l] = (1-- ai+t). (4.24)

Then, the overall type I error to test H~, H 2 ..... Hi+ 1 sequentially is given by ct*+l
where
i
1"I P [ l t + l < ~ C a , t + l l l j > ~ c a j , j = l , 2 , " . , t ; Ht+l] = ( 1 - - c t * + , ) . (4.25)
t=o

For a review of the literature on the asymptotic joint conditional distribution of


lt+ 1..... lp given l~,..., l t, the reader is referred to Muirhead (1978). Under certain
conditions, the above distribution is the same as the joint distribution of the
eigenvalues of S 1 ( S 1 + $ 2 ) -1 where S l and S 2 are distributed independently as
Central Wishart matrices with ( k - 1 - t) and ( n - k - t) degrees of freedom
respectively and E ( S i / ( k - 1 - t)) = E(Sz/(n -- k - t)) = Ip_ r

References

Anderson, T. W. (195 la). The asymptotic distribution of certain characteristic roots and vectors. In: J.
Neyman, ed., Proceedings of the Second Berkeley Symposium in Mathematical Statistics and Probabil-
ity, 103-130. University of California Press, Berkeley, CA.
Anderson, T. W. (1951b). Estimating linear restrictions on regression coefficients for multivariate
normal distributions. Ann. Math. Statist. 22, 327-351.
Bartlett, M. S. (1947). Multivariate analysis. J. Roy. Statist. Soc. Suppl. 9, 176-190.
Chou, R. J. and Muirhead, R. J. (1979). On some distribution problems in MANOVAand discriminant
analysis. J. Multivariate Anal. 9, 410-419.
Fang, C. and Krishnaiah, P. R. (1981). Asymptotic distributions of functions of the eigenvalues of the
real and complex noncentral Wishart matrices. In: M. Csorgo, D. A. Dawson, J. N. K. Rao, and A.
K. Md. E. Saleh, eds., Statistics and Related Topics. North-Holland, Amsterdam.
Fang, C. and Krishnalah, P. R. (1982). Asymptotic distributions of functions of the eigenvalues of
some random matrices for nonnormal populations. J. Multivariate Anal. 12, 39-63.
Fisher, R. A. (1938). The statistical utilization of multiple measurements. Ann. Eugenics. 8, 376-386.
Fujikoshi, Y. (1976). Asymptotic expressions for the distributions of some multivariate tests. In: P. R.
Krishnaiah, ed., Multivariate A nalysis-IV, 55-71. North-Holland Amsterdam.
Fujikoshi, Y. (1978). Asymptotic expansions for the distributions of some functions of the latent roots
of matrices in three situations. J. Multivariate Anal 8, 63-72.
Fujikoshi, Y. (1980). Tests for additional information in canonical discrimination analysis and
canonical correlation analysis. Tech. Rept. No. 12, Statistical Research Group, Hiroshima Univer-
sity, Japan.
892 P. R. Krishnaiah

Hsu, P. L. (1941a). On the problem of rank and the limiting distribution of Fisher's test functions.
Ann. Eugenics 11, 39-41.
Hsu, P. L. (1941b). On the limiting distribution of roots of a determinantal equation. J. London Math.
Soc. 16, 183-194.
Krishnaiah, P. R. and Schuurmann, F. J. (1974). On the evaluation of some distributions that arise in
simultaneous tests for the equality of the latent roots of the covariance matrix. J. Multivariate Anal
4, 265-283.
Krishnaiah, P. R. (1978). Some developments on real multivariate distributions. In: P. R. Krishnaiah,
ed., Developments in Statistics, Vol. 1, 135-169. Academic Press, New York.
Krishnaiah, P. R. and Lee, J. C. (1979). On the asymptotic joint distributions of certain functions of
the eigenvahies of four random matrices. J. Multivariate Anal 9, 248-258.
Krishnaiah, P. R. (1980). Computations of some multivariate distributions, In: P. R. Krishnaiah, ed.,
Handbook of Statistics, Vol. 1,745-971. North-Holland, Amsterdam.
Kshirsagar, A. M. (1972). Multivariate Analysis. Dekker, New York.
Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66.
McKay, R. J. (1976). Simultaneous procedures in discriminant analysis involving two groups.
Technometrics 18, 47-53.
McKay, R. J. (1977). Simultaneous procedures for variable selection in multiple discriminant analysis.
Biometrika 64, 283-290.
Muirhead, R. J. (1978). Latent roots and matrix variates: a review of some asymptotic results. Ann.
Statist. 6, 5-33.
Pillai, K. C. S. (1960). Statistical Tables for Tests of Multivariate Hypotheses. Statistical Center,
University of Philippines, Manila.
Rao, C. R. (1946). Tests with discriminant functions in multivariate analysis. Sankhyh 7, 407-414.
Rao, C. R. (1948). Tests of significance in multivariate analysis. Biometrika 35, 58-79.
Rao, C. R. (1966). Covariance adjustment and related problems in multivariate analysis. In: P. R.
Krishnaiah, ed., Multivariate Analysis, 87-103. Academic Press, New York.
Rao. C. R. (1973). Linear Statistical Inference and its Applications. Wiley, New York.
Schuurmann, F. J., Krishnaiah, P. R. and Chattopadhyay, A. K. (1973). On the distribution of the
ratios of the extreme roots to the trace of the Wishart matrix. J. Multivariate Anal. 3, 445-453.
P. R. Krishnaiah and L. N. Kanal, eds., Handbookof Statistics, Vol. 2
©North-Holland Publishing Company (1982) 893-894

Corrections to
Handbook of Statistics, Volume 1"
Analysis of Variance

The following corrections should be made in Handbook of Statistics, Vol. 1 :


Page 59: A 1 in (5.5) should read as

Al= (' !)
--1
0 •

Page 60: The vector a' after expression (5.6) should read as a ' = (1,0, - 1) and
not (1,0, 1).
Page 71: The matrix

A=D
(10) 0 1
--1 1

in (7.9) should read as

A=D
(1 0)0 1 •
-1 --1

Page 167, lines 2-3 from bottom: quantiles and mechans should read as
quartiles and medians.
Page 518: Delete the last line.
Page 528, line below (4.14): Section 4.2 should read as Section 4.3.
Page 531, line 17: accepted should read as rejected.
Page 549: The table for a = 0.01 is reproduced from a technical report by Lee,
Chang and Krishnaiah (1976)
Pages 556-557: M should read as N
Pages 558-559: The entries in Table 14 give the percentage points of the
likelihood ratio test statistic for the homogeneity of complex multivariate normal
populations instead of real multivariate normal populations. The correct entries
are given in Table II of the paper by Chang, Krishnaiah and Lee (1977). The
893
894 Corrections to Handbook of Statistics, Vol. 1

authors of this chapter have inadvertently reproduced Table IV of the above


paper.
Page 568, line 6 from bottom: Academic Press, New York should read as
North-Holland Publishing Company.
Page 633, line 8: Xi. should read as Xi
Page 671: Add the following references:
Mudholkar, G. S., Davidson, M. and Subbaiah, P. (1974a).
Extended linear hypotheses and simultaneous test procedures in multivariate analysis of variance.
Biometrika 61, 467-477.
Mudholkar, G. S., Davidson, M. and Subbaiah, P. (1974b).
A note on the union-intersection character of some MANOVA procedures. J. Multivariate Anal. 4,
486-493.

Page 784, line 11: (1975) should read as (1975b).


Page 786, line 11: (1975b) should read as (1975c).
Page 788, line 6: (1975) should read as (1975a).
Page 974, line 10 from bottom:

s)~. = ~ Yij should read as s)~. = Y, Yij.


i j

Page 975, line 13: Add

= E E
i j

Page 981, Eq. (3.5): p should read as P.


Page 983, Eq. (3.19): /> should read as ~<.
Page 986: 01... should read as 01 . . . .
Page 987, Eq. (3.37): p should read as P.
Subject Index

Abnormal cells, 595 Azimuth angle, 576


Acoustic channel, 550, 557
Acoustic processor, 550, 551
Acoustic subsources, 557 Backward elimination procedures, 805-814
Adaptive partitions, 466 Balancing decision functions, 846
ADC-500 machine, 606 Bayes, 683
Additive difference model, 301 classifier, 353
Additive tree, 300 error, 194, 353
Admissible minimax rules, 47, 50, 57 estimate, 562
Affine-linear discriminant function, 860, 863 risk, 794, 796, 799
Affine measure, 782 rule, 8, 49, 50, 54, 56, 62, 83, 84, 196,
Agglomerative algorithms, 274 430, 797, 798
Algorithms Bayesian
EM, 202, 205 allocation (classification, diagnosis), 101,
k-means, 201 105, 111, 119
ALSCAL, 306, 307, 308, 309 classification, 122, 125, 612, 794
Alternating least squares, 328 estimators, 124
Amplitude, 577 inferences, 376
Analysis of covariance, 754 linear discriminants, 110, 112-114, 116,
Analysis of scenes, 379 117
Array classifier, 589, 592 quadratic discriminants, 118
Artificial intelligence, 560 separation (discriminants), 101, 109, 111,
problem-solving techniques, 370 112, 119
search process, 381 Best invariant similar test, 52
Association coefficients, 252 Best of class rules, 162
Asymptotic expansions, 61, 63, 64, 66, 68, 69, Bias-correction techniques, 183
73, 75, 77, 79, 81, 82, 84, 85, 87, 88, 870 Biological activity, 686, 687
Asymptotic mean squared error (AMSE), 69, Blind assay technique, 688
70 Bloodcell morphology, 604
ASYMSCAL procedure, 341 Blood pressure wave, 537
Attributed context-free grammars, 380 Blood smear, 596
Autocorrelation function, 391, 401, 402, 404 Boundary recognition, 501, 517
Automatic grouping of objects, 479, 488
Autoregressive model, 384, 400, 404, 412
Autoregressive moving average (ARMA), 25,
32 CANDECOMP model, 328, 333, 338
Average linkage, 251 CANDEL1NC model, 338, 341

895
896 Subject Index

Canonical analysis, 209, 616 Complete model, 758


Canonical correlation, 723, 738 Complex multivariate normal, 26
Canonical variates, 723, 729, 731, 738 Compressed-pulse type radar, 578
Cardinality, 268 Computational models, 663
Carotid pulse wave, 537, 542-546 posterior probability, 663
Cartographic feature extraction, 361, 374 spectral matching, 667
Cartographic feature recognition, 369 Concentration ellipsoid, 732
Cauchy-Schwarz inequality, 861 Conditional likelihood approach, 190
Cell recognition, 596 Consistent stochastic grammar, 420
CELLSCAN, 598 Continuous speech recognition, 549
CELLSCAN-GLOPR, 599 Control characteristics, 631
Chain length, 268 Control structures, 374
Character acquisition, 631 Co-occurrence, 404, 410, 411
Character classification, 634 Correspondence analysis, 726
Character isolation, 632 Covariances, unequal, 5, 9-11, 18-22, 24, 26,
Characteristic language, 419 29-31
Chemical systems, 674 Covariate discriminant analysis, 61, 78
Chernoff faces, 219, 223, 259 Curve-fitting, 538, 540
Chromaticity coordinates, 601 CYDAC, 599
Chromatography, 680
Chromosome patterns, 428
Classification, 105, 611, 676, 682 Data analysis
accuracy, 836 display-oriented, 654
errors, 645 human-assisted, 654
frequency domain, 11-25, 33-41 interactive, 652
joint, 108, 109 Data filters, 351
performance, 836 Decision rules, 480, 481-483, 485-487
rules, 3, 5 Decision support systems, 375
optimal Bayes, 838 Deleted estimation method, 566
suboptimal, 840 Deleted interpolation, 567
sequential, 108, 109 Dendrogram, 227-231, 270
time domain, 5-11 Density estimation, 161
Class-independent covariance structure, 802 Density measure, 457
Cluster, 453, 455, 457, 464 Dependence measures, 783
Cluster analysis, 210, 245, 676 Depolarizing properties, 577
Cluster analysis methods Determinantal density, 108
Bayesian approach, 200 Diagnosis, 595
classification maximum likelihood DIFF-3, 606
approach, 199-202, 204-206 Difference frequency, 576
connectedness, 267 Differential, 595, 604, 606
mixture maximum likelihood approach, Differential system, 606
199-202, 204-206 Digitised image, 600
nearest neighbor, 267 Dirac-delta function, 174
single-link, 267 Dirichlet distribution, 799
Coefficients of linear filter, 404 Discrete parameter time series, 12
Coherence, 577, 578 Discrete waveforms, 653
Colour parameters, 600 Discriminant analysis, 209
Commercial systems, 596, 602, 606 Discriminant coefficients, 883, 885
Communication theory, 550 Discriminant functions, 203, 722, 730, 883,
Complete linkage, 251 884, 885, 886
Subject Index 897

linear, 7, 10, 15-17, 21, 27 Error-correcting parsing, 430


multiple groups, 10, 17, 23 Error estimation, 850
quadratic, 9-11, 17-20, 22, 24, 26, 31, 38 Expert systems, 374
Discriminant updating, 170, 185, 187
Discrimination based on rank data, 164
Discrimination category, 793,794 Factor analysis, 209, 693,725,751
Discriminatory value, 861 Factor analytic models, 752, 755, 761
Disease diagnosis, 538 Feature
Display-oriented analysis, 654 description, 376
Dissimilarity, 268, 286, 288, 298, 317, 454, 460 description language, 528, 529
Distance geometry, 291 evaluation, 471
Distance measures, 252, 661, 667, 781, 782, evaluation rules, 773-791
784, 785 selection, 636, 776-778
angular, 667 space, 582
Divergence (distance) measures verification, 374
Bhattacharyya, 784, 785 Filter
Kolmogorov, 781-785 linear, 37, 39, 41
Kullback-Liebler, 781-785 matched (LMF), 8, 39
Matusita, 781-785 quadratic detection (QDF), 37, 41
of order ~, 782, 783 Finite intersection test (FIT), 805, 817, 818,
Doppler speed, 576 821, 823, 824, 826, 828
Dynamic cluster algorithm, 466 Fisher's linear discriminant function, 62, 97,
Dynamic prediction, 490 170, 590, 860, 886
Dynamic programming, 559 inadmissibility of estimators for, 864
Fletcher-Powell algorithm, 765
Fluorescence (FL) spectroscopy, 651
Earley's algorithm, 531 Flying devices, 629, 630
Earthquakes, explosions, 1, 2, 5, 6, 33-42 Folding frequency, 33
Eckart-Young theorem, 726 Forced classification, 144, 145, 149, 152
Edge-model, 377 Forward-backward algorithm, 563
Efficiency Forward selection procedures, 805, 807, 811,
asymptotic efficiency of mixture 814
approach, 204, 205 Fourier transform, 12
asymptotic relative, 204 discrete (DFT), 5, 11, 13, 14, 15, 19, 25,27
simulated efficiency of mixture approach, Frequency diversity, 577, 578
204, 205 Frequency domain methods, 11-33
Electrocardiogram (ECG), 501-507, 512, 514, Frequency-jump radar, 578
527 F test, 805, 814
Electrocardiography, 523 Fuzzy identification, 453
Electrocardiology, 502 Fuzzy relaxation, 379
Electro-encephalogram signal, 501 Fuzzy sets, 463, 468, 469, 476
Electroencephalographic (EEG), 3, 22, 42
Elevation angle, 576
Empirical prediction problems, 479 Galton-Watson branching processes, 420
Entropy measures Gaussian classifier, 601
f-entropy, 780 Gauss-Newton algorithm, 765, 766
of order a, 779, 780 Generalized distance, 722, 730, 731, 740
Quadratic, 779, 780 Generalized least square estimators, 762
Renyi's, 779, 780 Generalized multivariate analysis of variance,
Shannon's, 779, 780 121
898 Subject Index

Generalized ridge estimators, 734 Invariance, 111


Generating function, 421, 438, 439 Invariant rules, 51
Geometrical parameters, 600 Isolated word recognition, 549
Geophysics. See Seismology Isotone, 800
Global models, 386 Isotone orderings, 800
Goodness of fit tests, 768 ISPAHAN, 603
Gradient analysis, 406 Iterative processes, 200-202, 206
Grammar, 362-366, 369-371
attribute grammar, 543, 547
BNF grammar, 527, 529, 543 Jackknife-type approach, 321
error-correcting grammar, 547
stochastic grammar, 547
Growth curve classification, 121, 122 Karhunen-Loeve
expansion, 1, 347
system, 703, 707, 709
HEARSAY, 378, 535-537 transformation, 616
HEMATRAK, 605, 606 K-means method of classification, 255, 256
Hemodynamic signal, 501 K-nearest neighbor procedures, 457, 682, 849
Hermite polynomials, 63 Knowledge-based systems, 357
Hessian matrix, 765 Kolmogorov distance, 782
Hierarchical Krishnaiah's tests. See Finite intersection test
classification, 250-254 Kronecker delta function, 425
classifiers, 603 Kullback-Leibler information measure, 92
clustering, 226, 227
clustering scheme, 269
relaxation, 380 Lagrange multiplier, 356
tree, 300 LANDSAT data, 365
Hierarchies, 458, 459, 464, 465 LANDSAT satellite, 609
Land-water interface, 377
Idempotence, 462, 474 Language model, 552, 555
Identification, 452 LARC, 605, 606
IDIOSCAL, 297, 305, 306 Latent variables, 747-750, 767
IDIOSCAL model, 339, 340 Learning data, 580, 583
Image 'Least Commitment' principle, 379
analysis, 361, 374 Leitz Texture Analysis (LTA), 405
models, 383 Leukocytes, 601, 606, 607_
processing principle, 596, 603, 607 Likelihood ratio test, 206, 580, 584, 590
segmentation, 361 LINCINDS model, 339
Imbedding problem, 292, 296 Linear classifier, 587, 588, 591, 592
Immature cells, 595, 602, 605, 606 Linear discriminant analysis (LDA), 682
Individual differences scaling, 297, 298, 305, Linear discriminant function (LDF), 141
306, 309, 328, 330-340, 693 Linear Learning Machine (LLM), 682
Information, 451 Linguistic decoder, 550, 551
matrix, 176 Linkage diagrams, 232
measures, 778, 779, 783, 787 Links, 374
theory, 560 Lisp, 375
Infrared (IR) spectroscopy, 651 LLRP algorithm, 486
Interactive graphics, 653 Local models, 390
Interpretation, 452, 458, 459 Logical decision rules, 480, 483, 484
Intrinsic dimensionality, 347, 348, 351, 353 Logistic compound methods, 170, 185
Subject Index 899

Logistic discrimination, 169-171, 175, 176, global, 386


178, 180, 181, 184, 185, 187 local, 390
Logit function, 190 Markov, 392
Logit regression, 189 mixed, 385
Log-linear model, 171 mosaic, 394
Log-ratio statistics, 661 moving average, 385
multivariate normal, 388
noncausal, 388, 392
Mahalanobis distance, 7, 63, 86, 201,203, nonstationary, 385
841, 861 partial differential equations (PDE), 388
Manifest variables, (MV), 747-749 pixel based, 383
Map guided analysis, 377 random field, 386
Map image correspondence, 377 region based, 383, 393
Mapping category, 793 semicausal, 388
Markov meshes, 392 stationary, 385
Markov process, 387 stastical, 383
Mass spectral data, 677-679 structural, 383
Mathematical morphology, 405 syntactic, 394
Maximal invariants, 55, 57, 59 texture, 383
Maximum likelihood two-dimensional, 391
error-correcting algorithm, 431 Modular vision system, 378
estimation, 564, 765, 766 Moment generating function, 205
rule, 430 Monotone regression, 290
MDSCAL, 310 Morph, 538
Measurement complexity, 835 Morse code signals, 322
optimal, 835 Moving average model, 385
Measurement selection algorithm, 794 Multidimensional scaling (MDS), 209, 285,
Measure of information, 470 287, 288, 291,303, 305, 311, 317 319, 327,
Metric, 287, 288, 296, 298, 303, 304, 576 328, 333, 334, 336, 341
Metric INDSCAL analysis, 333 Multimodality, 206
Minimal complete class, 47 Multinomial
Minimax rules, 49, 50, 59, 62, 83, 84 density, 103
Minimum distance rule, 62, 90 method, 160
Minimum entropy principle, 702 probabilities, 173
Minimum spanning tree, 277 Multiple correlation, 815
MINISSA, 310 Multiple maxima, 203
Minkowski geometry, 301 Multipopulation classification, 142, 150, 152
Misclassification probabilities (errors), MULTISCALE procedure, 338
103 Multispectral scanner (MSS), 609
Mixed model, 385 Multivariate
Mixing proportions, 170, 200 F distribution, 821-823, 828
Mixture of normal, 61, 83, 806, 818, 819
distributions (general), 200 time series, 22-25
normal distributions, 200, 204, 205
Mixture sampling, 171, 172, 175
ML rule, 74 Nearest neighbor, 536, 583, 588, 589
Models classification rule, 358
autoregressive, 384 procedures, 849
causal, 388 rule, 193, 196
Gaussian, 388 Nearest neighbor methods, 169
900 Subject Index

Newton-Raphson algorithm, 765 Orthogonal transformations, 402


Newton-Raphson procedure, 176, 189 Outliers, 143, 153, 684
Neyman-Pearson approach, 872
Neyman-Pearson rule, 579
NINDSCAL model, 338 PARAFAC, 297
Node, 374 PARAFAC-2 model, 340
Non-directional parsing, 370 Parallel pattern transforms, 598
Nonlinear iterative partial least squares Parameter identification, 760
(NIPALS), 328, 334 Parse state, 533
Nonparametric discriminant functions, 159 Parse tree, 532, 533
Nonparametric estimate, 581 Parsing, 366, 371
Nonstationary models, 385 algorithm, 527, 530, 532
Nuclear magnetic resonance (NMR) spectra, bottom-up, 532, 533
679 problem, 530
top-down, 532, 533
Partial classifciation, 144, 146, 150
Object classification, 616 Partial identification, 453
Object-predicate symmetry, 706 Partitioned semantic networks, 372
Object-specific operators, 377 Partitioning methods of classification,
Oblique dimension, 341 254-258
Observation space, 580 Path models, latent variable, 693
Oil samples Pattern analysis, 793
histogram of, 662 Pattern recognition, 479, 575, 583, 609, 611,
multiple, 664, 665 673, 676, 691-693
unweathered, 662 archaeology, 680
weathered, 662 chemical applications, 675, 685-689, 694
Oil spectrum, 652 chromatography, 680
projection of, 667 clinical analytical chemistry, 689
Oil spill electrochemistry, 680
fingerprinting of, 651 elemental composition, 680
identification of, 651 forensic science, 688
matching of, 651 geochemistry, 680
source of, 651 polarography, 680
weathering of, 652 source identification of oil spills, 687
Optical character recognition (OCR), 621-623, structure-activity studies, 686, 690
629 structure of unsattrrated carbonyl
Optical density histogram, 599, 600 compounds, 685
Optimal PDL-picture description language, 364
allocation, 170 Peaking phenomenon, 839, 842
Bayes error, 844 Pearson type distribution, 96
coordinate system, 702 Perfect discrimination, conditions for, 846,
errors, 112-114, 116 848
measurement complexity, 857, 874 Per-field classification, 616
predictive linear discriminant function, Peripheral blood, 595
116 Perplexity, 569
Optimum Bayes rule, 484 Phonetic subsources, 557
Optimum classification rule, 62, 83 Photometric characteristics, 631
Ordering theorem, 800 Picture processing, 3, 25
Orthogonally invariant, 55 Pixel based models, 383, 394
Orthogonally polarized, 582, 584 Plane waves, 3
Subject Index 901

Platelet sufficiency, 604 Quadratic discriminant function, 141, 142


Plokker's method, 509 Quadratic form. See Discriminant functions
Plug-in rules, 48, 58 Quadratic logistic discrimination, 170, 182,
Polarization, 577, 591 189
diversity ratio, 578, 586 Quasi-Newton procedure, 176, 189
Polarization ratio, 586
Polygon-plots, 220, 221
Posterior distribution, 126, 130, 133-135 Radar, 1
Posterior expectation, 127, 130, 133 Radar cross-section, 576, 577
Posterior probabilities, 200-202 Radar signature, 580-582, 584, 593
Power Radio frequency, 576
between-group, 25, 27 Raleigh language, 552, 570
within-group, 25 Random field models, 386
Predictive density, 103, 104, 106, 107, 111, Range, 576
112, 115, 125, 127, 128, 130, 131, 133, 135, Rank procedure
136 forced classification, 149, 152
Predictive filter, 583 partial classification, 146, 150
Predictive probability, 103 Rao's simple structure, 132
Preprocessing, 676, 678 Rao's U statistic, 885
Primitives, 407 Reduction of dimensionality, 743
Principal components, 209, 299, 616, 683, 724, Region based models, 383, 394
753, 867 Regression analysis, 794, 805
Principally polarized, 582, 584 Reject rate, 643
Prior distribution, 56 Relative extrema density, 409
Prior probability density, 123, 125, 131, 134, Remote sensing, 609
136 Representation, 452, 455, 458
Probabilistic similarity coefficients, 252 Representation category, 793
Probability density estimation, 273 Ridge-regression approach, 864, 867
Probability of correct classification, 61, 95 R-mode, 725
Probability of error, 194 Road verification, 374
Probability of misclassification, 55, 57, 61, 63, Robust discriminant functions, 153, 157
66-68, 72, 73, 75, 77, 82, 83 Robust estimates, 154
Probit function, 190 Robustness of mixture approach, 206
Problem-average Rotated configuration, 322
error rate, 797, 798
recognition rate, 799
Problem-reduction representation, 370 Sample classification, 616, 617
Projective operators, 475 Sample reuse procedures, 118
PR operators, 453 SCAD, 601
Prototype correlation, 636 Scene partitioning, 617-619
Proximities, 317, 323 Schmidt's theory, 716
Pulse sequences, 578 Segmentation, 529, 533, 600
Segmentation errors, 645
Seismology, 1-5
Q-mode, 725 discrimination, 33-43
Q-R-duality, 726 Selection
QRS, 505-512, 514 algorithm, 794
Quadratic classification statistic, 84 category, 793
Quadratic classifier, 582, 583, 592 model, 758
Quadratic detector. See Filter Semantic nets, 362, 372
902 Subject Index

Semantic networks, 374 Stein estimators, 864


Semantic of a representation, 452 Step-down procedure, 828
Semi-Bayesian approach, 111, 113, 116 Stepwise procedures, 805, 814
Separate sampling, 171, 173 Stepwise regression, 809
Separation of a heterogeneous population, 199 Stepwise selection procedure, 178, 180
Sequential Stochastic grammar, 418, 422, 433, 436, 437,
classification, 581, 582, 637 439
pattern recognition, 773 Stochastic languages, 417, 419, 433,436, 437
search algorithms, 795, 801 STRAIN, 307
Sexual dimorphism, 727 STRESS, 309, 320
Shape, 540 Strong patterns, 467
Shrinking operations, 598 Structural
Shrunken estimators, 733 equation model, 757
Signal model, 527
detection, 4-10, 16, 17, 40 pattern recognition, 361, 379
deterministic, 15-17 primitive, 393
stochastic, 17-22 Studentized W statistic, 61, 62, 66, 67, 71, 75
Signal-to-noise ratio (SNR), 8, 10, 17, 501, Subset selection approach, 867
517, 578 Supervised analysis, 612
Signature information, 576 Supervised classification, 645
Sign-invariant rules, 51, 52 Supervised learning, 515, 676
SIMCA, 678, 683, 687 Symbolic matrix, 218
Similarities, 317 Symbolic plots, 217
Similarity, 453, 454, 457, 460 Syntactic, 45~
Similarity feature, 479 decoder, 423, 426
Simultaneous test procedures, 805 pattern recognition, 362, 427
Single linkage, 250 Syntax analysis, 367
Space-time process, 3, 25
Spatial information, 609
Speaker recognition, 3 Taxometric maps, 234, 235
Spectra, 12 TEMP algorithm, 485, 490
approximations, 5, 11-25 Template matching, 377, 636
estimation, 20, 31-32, 35, 36, 38, 39, 43 Templates, 518
unequal, 17-22, 24, 25, 29, 30, 36, 38 Texture, 399, 404, 408, 410, 412, 603, 606
Spectral measures, 408, 410
information, 609 modelling, 441
matching model, 667 Theory of minimal covers of sets, 376
matrix, 23-25 Time domain methods, 5-11
ratio, 36, 38, 41 TLC (Thin-layer chromatography), 651
Speech Training of classifiers, 612
recognition, 528, 535, 548 Training set, 676
understanding, 549 Translation-invariant rules, 51-53, 55
system, 378 Tree languages, 417
Spirogram signal, 501 Type I error, 808, 809, 813, 816, 818
SSTRESS, 307-309, 338
Starting values, choice of, 201-203, 205, 206
Stationary models, 385 Ultrametric, 271, 272
Stationary process, 6, 12, 25 Unfolding model problem, 290, 295, 305
Statistical consistency of classification models, Unifilar Markov Source, 553, 569
798 Uniform distribution, 55
Subject Index 903

Unsupervised analysis, 612, 614 Weak patterns, 467


Unsupervised classification, 645 Weathering
Unsupervised learning, 676 direction of, 667
Updating problem, 206, 207 magnitude of, 667
surface, 667
White blood cell differential count (WBCD),
Variogram, 389 595, 603, 604, 606
Vibrational spectra, 679 White blood cells, 595, 596
Wideband radar, 578, 584
Wishart distribution, 127
Wald's W statistic, 61-63, 68, 70, 75 Wishart matrix, 63, 889, 890, 891
WAPSYS, 537, 538, 541, 546, 547 Wroclaw diagrams, 232
Waveform parsing systems, 527, 528
Wavelength, 576
Wavenumbers, 25 X-ray analysis, 361

You might also like