(Springer Series in Statistics) Anthony C. Atkinson, Marco Riani, Andrea Cerioli (Auth.) - Exploring Multivariate Data With The Forward Search-Springer-Verlag New York (2004)

Springer Series In Statistics
Advisors:
P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg,
I. Olkin, N. Wermuth, S. Zeger
Springer Science+Business Media, LLC

Springer Series in Statistics
Andersen!Borgan/Gill!Keiding: Statistieal Models Based on Counting Proeesses.
Atkinson/Riani: Robust Diagnostie Regression Analysis.
Atkinson!Riani/Cerioli: Exploring Multivariate Data with the Forward Seareh.
Berger: Statistieal Deeision Theory and Bayesian Analysis, 2nd edition.
Borg/Groenen: Modern Multidimensional Sealing: Theory and Applieations.
Brockwe/1/Davis: Time Series: Theory and Methods, 2nd edition .
Chan/ Tang: Chaos: A Statistical Perspeetive.
Chen/Shaollbrahim: Monte Carlo Methods in Bayesian Computation.
Co/es: An Introduetion to Statistieal Modeling ofExtreme Values.
David!Edwards: Annotated Readings in the History of Statisties.
Devroye!Lugosi: Combinatorial Methods in Density Estimation.
Efromovich: Nonparametrie Curve Estimation: Methods, Theory, and Applieations.
Eggermont!LaRiccia: Maximum Penalized Likelihood Estimation, Volume I:
Density Estimation.
Fahrmeir/Tutz: Multivariate Statistieal ModeHing Basedon Generalized Linear
Models, 2nd edition.
Fan!Yao: Nonlinear Time Series: Nonparametrie and Parametrie Methods.
Farebrother: Fitting Linear Relationships: A History ofthe Calculus ofObservations
1750-1900.
Federer: Statistieal Design and Analysis for Intereropping Experiments, Volume I:
Two Crops.
Federer: Statistieal Design and Analysis for Intereropping Experiments, Volume II:
Three or More Crops.
Ghosh/Ramamoorthi: Bayesian Nonparametries.
G/az!Naus!Wallenstein: Sean Statisties.
Good: Permutation Tests: A Praetieal Guide to Resampling Methods for Testing
Hypotheses, 2nd edition.
Gourüirou.x:: ARCH Models and Finaneial Applieations.
Gu: Smoothing Spline ANOV A Models.
Györji/Kohler!Krzytakl Walk: A Distribution-Free Theory ofNonparametrie
Regression.
Haberman: Advaneed Statisties, Volume I: Deseription ofPopulations.
Hall: The Bootstrap and Edgeworth Expansion.
Härdle: Smoothing Teehniques: With Implementation in S.
Harre/1: Regression Modeling Strategies: With Applieations to Linear Models,
Logistie Regression, and Survival Analysis.
Hart: Nonparametrie Smoothing and Laek-of-Fit Tests.
Hastie/Tibshirani/Friedman: The Elements of Statistieal Learning: Data Mining,
Inferenee, and Predietion.
Hedayat/Sloane!Stujken: Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
Huet!Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS and R Examples, 2nd edition.
(continued after inde.x)

Anthony C. Atkinson
Marco Riani
Andrea Cerioli
Exploring Multivariate
Data with the
Forward Search
With 390 Figures
'Springer
Anthony C. Atkinson Marco Riani and Andrea Cerioli
Department of Statistics Dipartimento di Economia
The London School of Economics Sezione di Statistica e Infonnatica
London WClA 2AE Universita di Parma
UK Via Kennedy 6
A.C.Atldnson@Jse.ac.uk 43100 Parma
ltaly
mriani @unipr.it
andrea.cerioli @unipr.it
Library of Congress Cataloging-i n-Publication Data

Atkinson, A.C. (Anthony Curtis)
Exploring multivariate data with the forward search 1 Anthony Atkinson, Marco Riani,
Andrea Cerioli.
p. cm. - (Springer series in statistics)
lncludes bibliographical references and index.
1. Multivariate analysis. 1. Riani. Marco. II. Cerioli. Andrea. III. Title. IV. Series.
QA278.A8S 2003
Sl9.5'35-<lc22 2003058614
ISBN 978-1-4419-2353-0 ISBN 978-0-387-21840-3 (eBook)
DOI 10.1007/978-0-387-21840-3 Printed on acid-free paper.
<C 2004 Springer Science+Business Media New York
Originally published by Springer-Verlag New York, lnc. in 2004
Softcover reprint ofthe hardcover lst edition 2004
Ali rights reserved. This worlc may not be translated or copied in whole or in part without the
wriuen perrnission of the publisher (Springer Science+Business Media, LLC ),
except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval. electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names. trademarks. service marks. and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
9 8 7 6 s4 3 2 1 SPIN 10949610
To the memory of
Iris Atkinson
an enthusiastic grower of bearded irises
Ai miei genitori, a mio fratello Gianfranco e a mia zia Ada
A Chiara ed Alessandra, il presente, ed alla bimba ehe verra

Preface
Why We Wrote This Book

This book is about using graphs to explore and model continuous multi-
variate data. Such data are often modelled using the multivariate normal
distribution and, indeed, there is a literatme of weighty statistical tomes
presenting the mathematical theory of this activity. Our book is very dif-
ferent. Although we use the methods described in these books, we focus on
ways of exploring whether the data do indeed have a normal distribution.
We emphasize outlier detection, transformations to normality and the de-
tection of clusters and unsuspected influential subsets. We then quantify
the effect of these departures from normality on procedures such as dis-
crimination and duster analysis.
The normal distribution is central to our book because, subject to our
exploration of departures, it provides useful models for many sets of data.
However, the standard estimates of the parameters, especially the covari-
ance matrix of the observations, are highly sensitive to the presence of
outliers. This is both a blessing and a curse. It is a blessing because, if we
estimate the parameters with the outliers excluded, their effect is appre-
ciable and apparent if we then include them for estimation. It is however a
curse because it can be hard to detect which observations are outliers. We
use the forward search for this purpose.
The search starts from a small, robustly chosen, subset of the data that
excludes outliers. We then move forward through the data, adding observa-
tions to the subset used for parameter estimation. As we move forward we
viii Preface
monitor statistical quantities such as parameter estimates, Mahalanobis

distances and test statistics. In this way we can immediately detect the
presence of outliers and clusters of Observations and determine their effect
on inferences drawn from the data. We can then improve our models.
This book is a companion to "Robust Diagnostic Regression Analysis"
by Atkinson and Riani published by Springer in 2000. In the preface to
that book we wrote "This bald statement . . . masks the excitement we feel
about the methods we have developed based on the forward search. We
are continuously amazed, each time we analyze a new set of data, by the
amount of information the plots generate and the insights they provide".
Although more years have passed than we intended before the completion
of our new book, in which process we have become three authors rather
than two, this statement of our enthusiasm still holds.
For Whom We Wrote It

We have written our book to be of use and interest both to professional
statisticians and other scientists concerned with data analysis as well as
to postgraduate students. Because data analysis requires software we have
a web site http: I I stat. econ . unipr. i tlrianil arc which includes pro-
grams and the data. The programming was clone in GAUSS, with most
graphs for publication prepared in 8-Plus.
The programs on our web site are in 8-Plus. In addition Luca Scrucca of
the University of Perugia (Italy) has translated the forward search routines
into the R language (http:llwww.r-project.org). His routines are at
http: I lwww. stat. unipg. i tllucalfwd. Also, Stanislav Kolenikov of the
University of North Carolina has created in STATA (http: I lwww. stata
. com) a module for forward search in regression available on http: I lideas.
repec. orgl clbocodel s414902. html. Links to forward search routines in
other languages will be put on the web site of the book as they become
known to us.
Our book is intended to serve as the text for a postgraduate course on
modern multivariate statistics. The theoretical material is complemented
by exercises with detailed solutions. In this way we avoid interrupting the
flow of our data analytical arguments. We give references to the statistical
literature, but believe that our book is reasonably self-contained. lt should
serve as a textbook even for courses in which the emphasis is not on the
forward search. We trust such courses will decrease in number.
Preface ix
What Is In Our Book

The first chapter of this book introduces the forward search and contains
four examples of its use for multivariate data analysis. We show how out-
liers and groups in the data can be identified and introduce some important
plots. The second chapter, on theory, is in two parts. The first gives the dis-
tributional theory for a single sample from a multivariate normal distribu-
tion, with particular emphasis on the distributions of various Mahalanobis
distances. The second part of the chapter contains a detailed description
of the forward search and its properties. An understanding of all details of
this chapter is not essential for an appreciation of the uses of the forward
search in the later chapters. If you feel you know enough statistical theory
for your present purposes, continue to Chapter 3.
The next three chapters describe methods for a sample believed to be
from a single multivariate normal distribution. Chapter Three continues,
extends and amplifies the analyses of the four examples from Chapter 1.
In Chapter 4 we apply the forward search to multivariate transformations
to normality. Analyses of three of the examples from earlier chapters are
supplemented by the analysis of three new examples. Chapter 5 contains
our first use of the forward search in a procedure depending on multivariate
normality, that of principal components analysis. We are particularly inter-
ested in how the components are affected by outliers and other unsuspected
structure in the data.
The two following chapters describe the forward search for data in several
groups rather than one. In Chapter 6 the subject is discriminant analysis
and in Chapter 7 duster analysis, where the number of groups, as well
as their composition, is unknown. Here the forward search enables us to
see how individual observations are distorting the boundaries between our
putative clusters. Finally, in Chapter 8 we consider the analysis of spatial
data, which has something in common with the regression analysis of our
earlier book.
Our Thanks
As with our first book, the writing of this book and the research on which
it is based, have been both complicated and enriched by the fact that the
authors are separated by half of Europe. Our travel has been supported
by grants from the Italian Ministry for Scientific Research, by the Depart-
ment of Economics of the University of Parma and by the Staff Research
Fund of the London School of Economics. We are grateful to our numerous
colleagues for their help in many ways. In England we especially thank Dr
Martin Knott at the London School of Economics, who has been a steadfast
source of help with both statistics and computing. Kjell Konis, currently at
x Preface
Oxford University, helped greatly with the S-Plus programming. In Italy

we thank Professor Sergio Zani of the University of Parma for his contin-
uing support and his colleagues Aldo Corbellini and Fabrizio Laurini for
help with computing including 1\\'JEX. We also thank our families who have
endured our absences and provided hospitality. Their support has been
vital.
Our book was read by three anonymaus reviewers for Springer-Verlag.
We are very grateful both for their enthusiasm for our project and for
their detailed comments, many of which we have incorporated to improve
readability, flow and focus. Unfortunately one of their suggested objectives
escaped us - to produce a shorter volume. We trust that the 390 figures in
our book will make it seem not at all like a tome. In reviewing our first,
and shorter, book for the Journal of the Royal Statistical Society, Gabrielle
Kelly wrote "I read this (hardback) book, compulsive reading such as it
was, in three sittings" . Even if it takes them more than three sittings, we
hope many readers will find this new book similarly enjoyable.
Anthony Atkinson
a.c.atkinson@lse.ac.uk
http://stats.lse.ac.uk/atkinson/
London, England
Marco Riani
mriani@unipr.it
http://www.riani.it
http://economia.unipr.it/docenti/riani
http://stat.econ.unipr.it/riani
Andrea Cerioli
andrea.cerioli@unipr.it
http://economia.unipr.it/docenti/cerioli
http://stat.econ.unipr.it/cerioli
Parma, Italy
June 2003
Contents
1 Examples of Multivariate Data 1

1.1 Infiuence, Outliers and Distauces . . . . . 1
1.2 A Sketch of the Forward Search . . . . . . 3
1.3 Multivariate Normality and our Examples 5
1.4 Swiss Heads . . . . . . . . . . . . . . 6
1.5 National Track Records for Women . 10
1.6 Municipalities in Emilia-Romagna 16
1.7 Swiss Bank Notes . 22
1.8 Plan of the Book . . . . . . . . . . 30
2 Multivariate Data and the Forward Search 31

2.1 The Univariate Normal Distribution 32
2.1.1 Estimation . . . . . . . . . . . . . . 32
2.1.2 Distribution of Estimators . . . . . . 33
2.2 Estimation and the Multivariate Normal Distribution 34
2.2.1 The Multivariate Normal Distribution 34
2.2.2 The Wishart Distribution 35
2.2.3 Estimation of E . . . . . . . . 36
2.3 Hypothesis Testing . . . . . . . . . . 37
2.3.1 Hypotheses About the Mean 37
xii Contents
2.3.2 Hypotheses About the Variance 37

2.4 The Mahalanobis Distance . . . . . . . . 39
2.5 Some Deletion Results . . . . . . . . . . 40
2.5.1 The Deletion Mahalanobis Distance 40
2.5.2 The (Bartlett )-Sherman-Morrison-Woodbury Formula 41
2.5.3 Deletion Relationships Among Distauces . . . . 42
2.6 Distribution of the Squared Mahalanobis Distance . . 43
2. 7 Determinants of Dispersion Matrices and the Squared
Mahalanobis Distance . . . . . 44
2.8 Regression . . . . . . . . . . . . 46
2.9 Added Variables in Regression 49
2.10 The Mean Shift Outlier Model 51
2.11 Seemingly Unrelated Regression. 53
2.12 The Forward Search . . . . 55
2.13 Starting the Search . . . . . . . . 58
2.13.1 The Babyfood Data .. . 58
2.13.2 Robust Bivariate Boxplots from Peeling 59
2.13.3 Bivariate Boxplots from Ellipses 62
2.13.4 The Initial Subset ... . ... . 64
2.14 Monitoring the Search . . . . . . . . . . . 66
2.15 The Forward Search for Regression Data . 71
2.15.1 Univariate Regression 71
2.15.2 Multivariate Regression 73
2.16 Further Reading 73
2 .1 7 Exercises 76
2.18 Solutions 78
3 Data from One Multivariate Distribution 89

3.1 Swiss Heads . . . . . . . . . . . . . . 89
3.2 National Track Records for Women . 100
3.4 Swiss Bank Notes 116
3.5 What Have We Seen? 138
3.6 Exercises 140
3. 7 Solutions 142
4 Multivariate Transformations to Normality 151

4.1 Background . . . . . . . . . . . . . . . . . . 151
4.2 An Introductory Example: the Babyfood Data 152
4.3 Power Transformations to Approximate Normality 155
4.3.1 Transformation of the Response in Regression . 156
4.3.2 Multivariate Transformations to Normality 161
4.4 Score Tests for Transformations . 162
4.5 Graphics for Transformations . . . . . . . . . . . 164
Contents xiii
4.6 Finding a Multivariate Thansformation with the

Forward Search 165
4. 7 Babyfood Data 166
4.8 Swiss Heads . . 169
4.9 Horse Musseis . 176
4.10.1 Demographie Variables 187
4.10.2 Wealth Variables . . . 191
4.10.3 Work Variables . . . . . 195
4.10.4 A Combined Analysis . 200
4.11 NationalThack Records for Women . 204
4.12 Dyestuff Data . . . . . . . . . . . . . 209
4.13 Babyfood Data and Variable Selection 214
4.14 Suggestions for Further Reading 218
4.15 Exercises 220
4.16 Salutions 221
5 Principal Components Analysis 229

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 Principal Components and Eigenvectors . . . . . . . . . . 230
5.2.1 Linear Thansformations and Principal Components 230
5.2.2 Lack of Scale Invariance and Standardized Variables 232
5.2.3 The Number of Components . . . . . 232
5.3 Monitaring the Forward Search . . . . . . . . 233
5.3.1 Principal Components and Variances . 233
5.3.2 Principal Component Scores . . . . . 234
5.3 .3 Gorrelations Between Variablesand
Principal Components . . . . . . . . 235
5.3.4 Elements of the Eigenvectors . . . . 236
5.4 The Biplot and the Singular Value Decomposition 236
5.5 Swiss Heads . . 239
5.6 Milk Data . . . . . 242
5. 7 Quality of Life . . 252
5.8 Swiss Bank Notes . 260
5.8.1 Forgeries and Genuine Notes 261
5.8.2 Forgeries Alone . . . . . . . 263
5.10 Further reading 272
5.11 Exercises 274
5.12 Salutions 278
6 Discriminant Analysis 297

6.1 Background . . . . . . . . . 297
6.2 An Outline of Discriminant Analysis 298
6.2.1 Bayesian Discrimination .. . 298
xiv Contents
6.2.2 Quadratic Discriminant Analysis 299

6.2.3 Linear Discriminant Analysis . . 300
6.2.4 Estimation of Means and Variances. 300
6.2.5 Canonical Variates . . . . . . . . 301
6.2.6 Assessment of Discriminant Rules 304
6.3 The Forward Search . . . . . . . . . . . . 305
6.3.1 Step 1: Choice of the Initial Subset . 306
6.3.2 Step 2: Adding Observations During the
Forward Search . . . . . . . . . . . . . . 306
6.3.3 Mahalanobis Distances and Discriminant Analysis in
Step 2 . . . . . . . . . . . . . . . . . . . . . . . . 307
6.4 Monitoring the Search. . . . . . . . . . . . . . . . . . . 307
6.5 Transformations to Normality in Discriminant Analysis 309
6.6 Iris Data . . . . . . . . 310
6. 7 Electrodes Data . . . . 317
6.8 Transformed Iris Data 324
6.9 Swiss Bank Notes . . . 328
6.10 Importance of Transformations in Discriminant Analysis:
A Simulated Example . . . . . . . . . . . . . . . . . . . . 332
6.10.1 A Deletion Analysis . . . . . . . . . . . . . . . . . 332
6.10.2 Finding a Transformation with the Forward Search . 337
6.10.3 Discriminant Analysis and Confirmation of
the Transformation . 341
6.11 Muscular Dystrophy Data . . . . . 344
6.11.1 The Data . . . . . . . . . . 344
6.11.2 Finding the Transformation 345
6.11.3 Outliers and Discriminant Analysis . 349
6.11.4 More Data 351
6.13 Exercises 357
6.14 Solutions . . 359
7 Cluster Analysis 367

7.1 Introduction . . . . . . . . . . . . . . . . 367
7.2 Clustering and the Forward Search . . . 368
7.2.1 Three Steps in Finding Clusters 368
7.2.2 Standardized Mahalanobis Distances and Analysis
with Many Clusters . . . . . . . . . . 369
7.2.3 Forward Searches in Cluster Analysis. . . . . 370
7.3 The 60:80 Data . . . . . . . . . . . . . . . . . . . . . 371
7.3.1 Failure of a Very Robust Statistical Method . 372
7.3.2 The Forward Search . . . . . . . . . . . . . . 373
7.3.3 Further Plots for the 60:80 Data . . . . . . . 375
7.4 Three Clusters, Two Outliers: A Second Synthetic Example 379
7.4.1 A Forward Analysis . . . . . . . . . . . . . . . . . . 379
Contents xv
7.4.2 A Very Robust Analysis 382

7.5 Data with a Bridge . . . . . . . 385
7.5.1 Preliminary Analysis . . 386
7.5.2 Further Preliminary Analysis: Mahalanobis Distances
for Groups and Individual Units . . . . . . . 392
7.5.3 Exploratory Analysis: Single Clusters for the
Bridge Data . . . . . . . . . . . . . . . . . . . 398
7.5.4 Gonfirmatory Analysis: Three Clusters for the
Bridge Data. . . . . . 401
7.6 Financial Data . . . . . . . . 406
7.6.1 Preliminary Analysis . 406
7.6.2 Exploratory Analysis . 410
7.6.3 Gonfirmatory Analysis 417
7. 7 Diabetes Data . . . . . . . . . 420
7. 7.1 Preliminary Analysis . 420
7.7.2 Exploratory Analysis . 428
7.7.3 Gonfirmatory Analysis 436
7.8 Discussion . . . . . . . . . . . 439
7.8.1 Agglomerative Hierarchical Clustering 441
7.8.2 Partitioning Methods . . . . . . . . . 443
7.8.3 Same Examples from Traditional Cluster Analysis 444
7.8.4 Model-Based Clustering 446
7.8.5 Further Reading 448
7. 9 Exercises 450
7.10 Salutions 451
8 Spatial Linear Models 457

8.1 Introduction . . . . . 457
8.2 Background on Kriging . 459
8.2.1 Ordinary Kriging . 459
8.2.2 Isotropie Semivariogram Models 465
8.2.3 Spatial Outliers . . . . . . . . . . 467
8. 2 .4 Kriging Diagnostics . . . . . . . 468
8.2.5 Robust Estimation of the Variagram 471
8.3 The Forward Search for Ordinary Kriging 472
8.3.1 Choice of the Initial Subset 472
8.3.2 Progressing in the Search 474
8.3.3 Monitaring the Search . . 475
8.4 Contaminated Kriging Examples 477
8.4.1 Multiple Spatial Outliers 477
8.4.2 Packet of Nonstationarity 479
8.5 Wheat Yield Data . . . . . . . . 482
8.6 Refl.ectance Data . . . . . . . . . 491
8.7 Background on Spatial Autoregression 495
8.7.1 Neighbourhood Structure and Edge Gorreetion 498
xvi Contents
8.7.2 Simultaneaus Spatial Autoregression (SAR) Models 501

8.7.3 Spatial Outliers Under the SAR Model. . . . . 502
8.7.4 High Leverage Sites . . . . . . . . . . . . . . . 504
8.8 The Block Forward Search for Spatial Autoregression. 506
8.8.1 Subset Likelihood . . . . . 508
8.8.2 Defining the Blocks 509
8.8.3 Choice of the Initial Subset 510
8.8.4 Progressing in the Search . 511
8.8.5 Monitaring the Search . . . 511
8.9 SAR Examples With Multiple Cantamination 513
8.9.1 Masked Spatial Outliers . . . 513
8.9.2 Estimation of p . . . . . . . . 516
8.9.3 Multiple High Leverage Sites 519
8.10 Wheat Yield Data Revisited . 522
8.12 Exercises 526
8.13 Salutions . . . . 528
Appendix: Tables of Data 551
Bibliography 597
Author Index 607
Subject Index 611

Notation
For convenience we gather toget her a summary of the notation we have

used. We have tried to square the circle by being consistent whilst adapt-
ing our notation to that predominant in the fields covered by the various
chapters. An important aspect is the difference between matrices A , vectors
ai and scalars aij.
Multivariate Data
Y is the n x v matrix of observations with ith row Yi, the jth element of
which is Yij and, where essential, the jth column of Y is yc 1 .
X is n x p, the matrix of explanatory variables in regression.
J is an n x 1 vector of ones, even though it is a capital letter.
q(i) is a vector, usually n x 1, of zeroes, with ith element equal to one:
qj(i) = 0, j -1- i, qi(i) = 1.
Statistical Operations
Eis expectation, a Romanletter distinct from the matrix of residuals E.
var variance, v x v for multivariate data and
cov covariance.
Matrix Operations
tr the trace of a matrix.
diag a diagonal matrix.
IIYi- Yjll = n::::%=l(Yik- Yjk) 2 } 0 · 5 , the Euclidean distance between Yi
and Yj·
Ly, = IIYill = n::::%=1yfd 0 ·5 , the length of the vector Yi·
xviii Notation
Likelihood and the Normal Distribution

Nv (J.L, 2:) is the v-dimensional multivariate normal distribution with mean
J.L and covariance matrix 2:. When v = 1 we write
N(J.L, a 2 ), the univariate normal distribution.
Lik(J.L, 2:; y) is the likelihood.
L(J.L, 2:; y) is the loglikelihood with
log the naturallogarithm.
TLR = 2(L 1 - Lo) is the likelihood ratio test of the null hypothesis Ho,
where L 1 is the maximized loglikelihood under the alternative hypothesis
and Lo is maximized under Ho.
Estimation for U nivariate Data
If there are no explanatory variables,
E(y) = J.L and {1, = iJ = yT Jln.
fj = J{l, = JJTyln, the n x 1 vector of fitted values.
e = y - fj the n x 1 vector of residuals.
S([l,) = L:.:~=l (Yi - y) 2 = L:.:~=l e~ is the residual sum of squares of the
observations about their mean.
8 2 = S([l,)l(n- 1) is the unbiased estimator of the variance a 2 .
a 2 = S ([1,) In, the maximum likelihood estimator of a 2 .
U nivariate Regression
E(y) = Xß, where ß is p x 1.
E(yi) = x'[ ß.
/J = (XT X)- 1 XT y, the least squares estimator.
e = y - fj = y - X /J and
8 2 = eT el(n- p).
S(/J) = (y- X/J)T(y- X/J), the residual sum of squares.
So = (y- iJ)T(y- y), the corrected sum of squares of the data.
R 2 = R~IX ={So- S(/J)}IS(/3), the squared multiple correlation coeffi-
cient.
Estimation for Multivariate Data
If there is no matrix of explanatory variables,
E(yi) = J.L and E(Y) = JJ.LT.
{1, = iJ = yr J In, the v x 1 vector of estimated means.
Y = J{l,T = JJTYin, the n x v matrix of fitted values.
E = Y = Y- Y = Y- J JTYin, the n x v matrix of residuals.
2: is the v x v population covariance matrix of Y.
S([l,) = L:.:~=l (Yi - [l,)(yi - [l,)T = ET E is the v x v matrix of residual
sums of squares and products of the data.
f; = S([l,)ln is the maximum likelihood estimator of 2:, with diagonal
elements aJ and off-diagonal elements aik·
tu = S([l,)l(n- 1) is the unbiased estimator of 2:.
Notation xix
R is the estimated correlation matrix, with off-diagonal elements equal

to a.Jk /(8-2&2)0
J k
.5 .
d; = (Yi- [l,)T't.;;_ 1 (yijl) = ef't.uei, the squared Mahalanobis distance
for observation i.
Grouped Data
There are g groups of data with n 9 observations in group g.
81 (fil) is the v x v matrix of residual sums of squares and products of the
data in group l.
w = 2:::[=1 sl ((i!) is the within groups matrix of residual sums of squares
and products of the data.
S(jl) = W + B, where B is the between groups sum of squares and
products matrix. The resulting estimators of the population covariance
matrices are:
't.w = W/(n- g) and
't,B = Bj(g- 1).
Strictly, we should write 't.wu and 't.Bu , but we always divide these
sums of squares by the degrees of freedom rather than by the numbers
of observations.
Multivariate Regression
xr
E(Y) = XB, where B (capital beta) is p x v, with ith row ßi·
E(y;) = BT X; and E(yij) = ßcj'
B = (XT x) - xry, the p X V matrixleast squares estimator when the
1
matrix X is the same for each of the v responses.

E = Y - Y = Y - X B and
't.u = ET E/(n- p).
Forward Search
sim) the subset of size m produced by the forward search: mo ::::; m::::; n.
[l;,. the mean of the Observations in Sim).
t.~m the unbiased estimator of the covariance matrix I; from sirn).
di;, = (y;- [l;,.)T't.~;;,l(y;- [l;,.), the squared Mahalanobis distance for
Observation i with parameters estimated from sim), i = 1, ... , n.
Deletion
/l(i) ( "mu hat sub i" ), the estimator of f.t when observation i is deleted.
't.u(i), the unbiased estimator of I: based on the n- 1 observations when
y; is deleted.
Spatial Data (General)
s is a 2 x 1 vector of spatial coordinates defining a site within
V, the study region.
S is a network of n spatial locations, at which values y; = y(s;), i
1, ... , n, are observed. In kriging S C V, while in spatial autoregression
S=V.
xx Notation
y is the n x 1 vector of values Yi· Observations in y arenot independent.
Ordinary Kriging
s0 is a prediction site, i.e. the value y(so) is tobe predicted from y .
h is a 2 x 1 vector giving the spatial lag between two sites, s and t.
N(h) is the nurober of sites at lag h within S.
c(h) is the covariogram, while
2v(h) is the variagram (v(h) is called the semivariogram). Both c(h) and
2v(h) may depend upon a parameter vector e.
2v(h) isarobust estimate of 2v(h).
C is the n x n covariance matrix of y.
T is the n x n matrix of variagram values of y.
c is the n x 1 vector of covariances between y(so) and Yi·
v is the n x 1 vector of variagram values between y(s 0 ) and Yi·
y(soiS) is the ordinary kriging predictor at site s 0 , computed through
Tf, the n x 1 vector of ordinary kriging weights.
a 2 (s 0 IS) is the mean-squared prediction error associated with y(so iS).
v* and Tf* have the same meaning as v and ry, but they are defined under
a measurement error model; Y* (s 0 IS) is the corresponding ordinary kriging
predictor.
s(i) is network s with the ith location removed.
ei,S(i) is the standardized prediction residual at site Si, based Oll the n -1
Observations from S(i).
C(i) is the (n -1) x (n -1) covariance matrix of Y(i). c(i) is the (n - 1) x 1
vector of covariances between Yi and Y( i).
J(i) is vector J with the ith entry removed, i.e. a (n- 1) x 1 vector of
ones.
e.t, s<=l
•
is the standardized prediction residual at site Si at step m of the
forward search. Here m 0 :::; m :::; n - 1 because the last step of the search
is uninformative for prediction purposes.
Spatial Autoregression
X and ß are the same as in univariate regression.
W is the n x n weight matrix defining the neighbourhood structure inS.
p is the spatial interaction measure between neighbouring sites.
a 2 is the variance of the independent disturbance terms Ei, i = 1, . . . , n.
:E is the n x n covariance matrix of y.
2 is a 2 :E- 1 .
ß, 0'2 and p are the maximum likelihood estimators of ß, a 2 and p.
e = o:-l (In - pw) (y - X /3), the n x 1 vector of standardized regression
residuals.
B. is a block of b. contiguous sites.
n* is the number of such blocks. Usually,
b. = b, for t = 1, ... , n*.
Notation xxi
km is the progression index of the block forward search, i.e. s im) is

updated to sim+k",). Usually, km = b or km = 1.
ßm,&;, and Pm are the maximum likelihood estimators of ß, a 2 and p
from the fit to Sim); em is the COrresponding n X 1 vector of standardized
regression residuals.
Appendix: Tables of Data
Ao1 Swiss heads data: six dimensions in millimetres of the heads

of 200 Swiss soldiers 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 553
Ao2 Nationaltrack records for women 0 0 0 0 0 0 0 0 0 0 0 0 0 0 558
A o3 Selection of data from municipalities in Emilia-Romagna 560
Ao4 Swiss bank notes data: six dimensions in millimetres of 200
Swiss 1,000 Franc notes 0 0 0 0 0 0 0 0 0 0 0 0 562
Ao5 Babyfood data from Box and Draper (1987) 0 567
Ao6 Musseis data from Cook and Weisberg (1994) 568
Ao7 Dyestuff data from Box and Draper (1987) 570
Ao8 Milk data from Daudin, Duby and Trecourt (1988) 571
Ao9 Indices of the quality of life in the provinces of Italyo Data
from Il Sole - 24 Ore 2001 0 0 0 0 0 0 0 0 0 0 0 573
AolO Iris data from Anderson (1935) 0 0 0 0 0 0 0 0 0 576
Aoll Electrodes data from Flury and Riedwyl (1988) 579
Ao12 Muscular dystrophy datao Non-carriers 0 581
Ao12 Muscular dystrophy datao Carriers 584
A o13 The 60:80 data 0 0 0 0 0 0 0 586
A o14 Three clusters, two outliers 0 0 0 0 588
Ao15 Data with a bridge 0 0 0 0 0 0 0 0 590
Ao16 Investment funds data from Il Sole - 24 Ore 1999 592
Ao17 Diabetes data from Reaven and Miller (1979) 0 0 0 594
1
Examples of Multivariate Data
1.1 Influence, Outliers and Distances

In our first example the data form a 200 x 6 matrix: six readings on the
dimensions of the heads of 200 young men. This reetangular array is the
form of all our data sets, an n x v matrix representing v observations on
each of n units , here people. In most examples we first look at a scatterplot
matrix of the data and then fit a multivariate normal distribution. This
model can be fitted to any such reetangular array of numbers, so we need
to explore the data to see whether this is an appropriate model for these
data. Some departures from the multivariate normal model include:
• The presence of a single outlier;
• The presence of a group of outliers;
• Two or more distinct groups in the data;
• A transformation is required to obtain approximate normality of the

data.
These departures are not exclusive; for example the data may need trans-
formation whilst consisting of three groups together with some outliers.
But it is useful to consider the first three departures on their own.
To explore the structure of our data we shall make much use of Ma-
halanobis distances. Although tests on large distances are sometimes sug-
gested to test for outliers, we make particular use of plots. Our argument
2 1. Examples of Multivariate Data
is that these distances from all n observations can fail to reveal some of
the departures listed above when all n observations are used to estimate
the parameters needed to calculate the distances. We will show that extra
information can be obtained when outliers are present by calculating all n
distances using parameter estimates from a subset of m observations which
exclude the outliers. To start our argument it is helpful to look at the case
of a simple sample, that is v = 1. This seemingly trivial instance is sur-
prisingly helpful, not only in describing the problems of outlier detection
using Mahalanobis distances but also in giving an informal description of
the forward search.
For univariate observations the Mahalanobis distance reduces to the
scaled residual
n
di =ei/s=(yi-y)/s and s2 =L)Yi-y) 2 /(n-1), (1.1)
i=l
where y = 2.:::" 1 yifn. The squared distances di have a scaled beta dis-
tribution which is asymptotically chi squared on 1 degree of freedom. (We
devote the first part of Chapter 2 to a discussion of such distributional
results). We can use a probability plot of the n values of di to check the
distribution of the distances and so the adequacy of the model. Alterna-
tively, we can plot di against the normal distribution.
Now suppose that there is a single outlier, formed by adding an amount
s/:::i to observation f which has the value fj, with s as in (1.1). This changed
observation will affect both the value of y and of s 2 . Then for this obser-
vation
de = (1- 1/n)/:::i. (1.2)
J1 + /:::i 2 /n
If observation f has residual ee in the original sample, a slightly more com-
plicated formula results, which has a similar structure as a function of /:::i.
The relationship in (1.2) shows that, for large n and moderate /:::i, the
value of de will be near to /::). since the single observation will not have a
large effect on the estimate of the variance. But, for moderate n, the value
of de is not so large; for /::). = 3 and n = 10, 50 and 100 we obtain values
of 1.959, 2.706 and 2.845 for de. An outlier test based on the approximate
normal distribution of de would fail to detect anything strange abut this
Observation. Of course, with n = 100, a value of 3 is not particularly large.
However, for all n, information about the outlying nature of observation
f can be obtained by plotting the values of di . In particular, although the
value of de may not be especially large, the values of the other distances
will be shrunk by the inflated estimate of the variance in (1.2).
A powerful method of detecting single outliers (Cook and Weisberg 1982,
Atkinson 1985) is the deletion of single observations. If in our simple ex-
ample Yf. is removed from the estimation of the mean and variance, the
1.2 A Sketch of the Forward Search 3
distance for that observationwill have the value ~ . For all other observa-
tions Yi the distances from deletion of the observation will still include the
effect of Ye in estimation of the parameters. The corresponding distances
will thus not be much changed by the deletion of each Yi in turn and this
deletion procedure will lead to clear identification of the single outlier.
Now suppose there are k 2: 2 outliers formed in much the same way
as we formed ye. The outliers will then form a small duster consisting of
observations € 1 up to fk· Single deletion in turn of each of these k obser-
vations will leave the estimates of the mean and variance affected by the
other k - 1 outliers so that the outlying units may not have particularly
large distances. This hiding of the effect of one outlier by another is called
"masking" . This masking effect can be broken and the k outliers revealed if
we delete all k outliers and then calculate the parameter estimates. However
we have first to determine which set of k observations to delete; there are
n!jk!(n- k)! possibilities, which can be a large nurnber , even formoderate
sized samples; with 100 Observations and five outliers there are 75,287,520
possibilities. The presence of rnasking may rnake it impossible to reduce
the nurnber by a series of univariate deletions. An example of the failure of
such a "backwards" rnethod for a binornial rnodel is in Atkinson and Riani
(2000, §6.16.2). In addition, the exact nurnber of outliers is not known;
there rnight be six outliers or four.
Of course, with a univariate sample it is not hard to deterrnine which
observations to delete as they are ordered on a line; the ordering of the
observations is not changed by the estimates of the rnean and variance. But
this is no Ionger so with multivariate observations. The problern with v 2: 2
is that the ordering of the data depends upon the parameter estimates.
Observations with the same Mahalanobis distance di when k 2: 2 lie on
the sarne ellipsoid, centered at the origin, the shape of which is determined
by the estimated covariance rnatrix. Outliers rnay cause the shape of this
ellipsoid to change making it perhaps more or less spherical. The result is
that, as the estimated covariance rnatrix changes, so does the ordering of the
units by their Mahalanobis distances. The forward search overcornes this
problern by finding outlier free subsets of the data from which parameters
and distances can be estimated and the observations ordered.
1.2 A Sketch of the Forward Search

The idea of deletion methods is, by detecting the k outliers, to divide the
data into a central portion of n - k observations that can be used for pa-
rameter estimation, tagether with the k outliers which can be examined for
any scientific characteristics once they have been identified. In the forward
search we identify subsets of m observations which generally contain far
fewer than n - k observations. These subsets are outlier free and so can be
used for parameter estimation; they provide Mahalanobis distances which

correctly order the data with gross outliers most remote. We start with
a small subset and then increase its size one observation at a time, pro-
viding improved parameter estimates and an ordering of the observations
by doseness to the fitted model. A full definition and discussion is in the
latter part of Chapter 2, starting in §2.12. We motivate the procedure by
reference to the univariate example of the previous section.
There are three aspects of importance:
Starting the Search. We find a small subset of size m 0 which is outlier
free. A size of 3v may be appropriate, although the size is not crucial.
In the univariate example we could take the median observation and one
observation either side and use these observations to estimate the mean
and variance. This subset will be at the center of a univariate distribution
with some outliers, from which it will be far. They will therefore be dearly
revealed by their large Mahalanobis distances.
In v dimensions we use medians as estimators of location and find obser-
vations that are dose to the median in all bivariate plots of the data.
Progressing in the Search. Given a subset of size m we estimate
the parameters and calculate all n Mahalanobis distances. These are then
ordered from smallest to largest and the m + 1 observations with the m + 1
smallest distances form the new subset. Here m runs from m 0 to the fit to
all observations when m = n . Usually one observation is added at a time,
but the indusion of an outlier can cause the ordering of the observations to
change, when more than one unit may enter. Of course, at least one unit
then has to leave the subset in order for the size to increase by one unit.
This change of order during the search is a feature of multivariate data
which we have stressed is absent in the analysis of univariate data.
Monitoring the Search. For each value of m we can look at the plot
of all n Mahalanobis distances. If there are outliers they will have large dis-
tances during the early part of the search that decrease dramatically at the
end as the outlying Observations are induded in the subset of observations
used for parameter estimation.
If our interest is in outlier detection we can also monitor, for example,
the minimum Mahalanobis distance among units not in the subset. If an
outlier is about to enter, this distance will be large, although it will decrease
again as the search progresses if a duster of outliers join.
If there are two dusters of roughly equal size and we start with units
from one of them, the units in the other duster will all be remote and
have large distances until they start to join the subset half way through
the search. This example emphasizes the way in which the search can be
used to explore the structure of the data. It also highlights the importance
of flexibility in the choice of starting point.
The univariate example shows what can happen if there really are two
dusters of observations of equal size. The univariate median, induded in
the initial subset, might be in a region with few observations, remote from
1.3 Multivariate Normality and our Examples 5
either duster. Then units from both dusters would join the subset during
the search. Even if there were no very large distances, the presence of the
two dusters would be revealed by a shortage of very small distances. As we
see in a multivariate example in §3.4, rerunning the search from a different
starting point would reveal the dusters.
This brief introduction is complemented by the Exercises in §3.6. In the
solutions we give examples of forward plots of Mahalanobis distances and
of aspects of the estimated covariance matrix for a sample from a bivariate
normal distribution, both in its original state and after contamination by
a variety of patterns outliers.
Our purpose is not only to identify outlying observations and dusters,
groups and subpopulations, but also to determine what effect they have
both on the model to be fitted and on the condusions that can be drawn
from the data. An example is the need for power transformation of the data
to improve normality. The estimated transformation can be very much af-
fected by a few outliers. These need to be identified so that a transformation
can be found that is suitable for the bulk of the data.
1.3 Multivariate Normality and our Examples

The multivariate normal distribution is widely used in the modeHing and
analysis of the continuous multivariate data that form the subject of this
book. There are two main reasons for this prevalence of the normal distri-
bution. The empirical one is that, as we shall see, many sets of multivariate
data closely follow the normal distribution, perhaps after a suitable power
transformation. The second one is that many powerful methods for the
analysis of multivariate data rely on multivariate normality. Examples in-
dude principal components analysis in Chapter 5, discriminant analysis in
Chapter 6 and duster analysis in Chapter 7.
In our first example, measurements on Swiss heads, the data seem to be
weil described by a six-dimensional multivariate normal distribution. There
may perhaps be two slight outliers that are not evident from plots of Maha-
lanobis distances when all the observations are used to estimate the means
and covariances - the outliers are "masked". In §4.8 we use the forward
search to show how important, and misleading, these two observations are
when transformation of the data is considered.
Our second example is of record times in athletics meetings. The struc-
ture of the data appears complicated, with many outliers. However a power
transformation, explored in Chapter 4, Ieads to a simpler model with fewer
outliers. The forward search, in this case, not only provides a suitable trans-
formation, but also ensures that the transformation we find is not being
unduly influenced by the results from a few units; in this case countries.
Data arising in the study of society are typically more complicated to

analyse than those arising in scientific laboratories. This is certainly true
of our third example, which are 28 measures of economic activity and liv-
ing standards in the 341 municipalities of the Italian region of Emilia-
Romagna. The municipalities range in population from 404,378 for the city
of Bologna to 155 for the mountain community of Zerba. There is thus a
broad spectrum of communities with, amongst other differences, the po-
tential for widely varying standards of data collection and presentation. In
this chapter we use the forward search to analyse these data, paying par-
ticular attention to outliers and groups of municipalities. Our purpose is to
find the variation in the substantive variables describing the communities
despite any variability in quality of the data. We continue our analysis in
Chapters 4 and 5 where, in the hope of finding a simpler structure for the
data, we look for transformations and apply principal components analysis.
These three examples all have the same structure; there is a single popu-
lation with a simple, or less simple, structure and a relatively small number
of outlying observations. However, the final example of the chapter, on 200
Swiss bank notes, contains at least two distinct populations. The notes
have been divided into two groups of 100, one believed to contain genuine
notes and the other forgeries. We start by fitting a single multivariate nor-
mal distribution to all 200 observations. Although we know this model is
incorrect, use of the forward search enables us to identify not only the two
main groups but also some finer structure as well.
The theory of the forward search is described in the next chapter. In this
chapter we exemplify the approach. The emphasis in our analyses is on the
graphical presentation of results.
1.4 Swiss Heads

Table A.1 in the Appendix gives six readings on the dimensions ofthe heads
of 200 twenty year old Swiss soldiers. The data are described by Flury and
Riedwyl (1988, p. 218) and also by Flury (1997, p . 6). The variables are:
Y1: minimal frontal breadth

Y2: breadth of angulus mandibulae
y3: true facial height
Y4: length from glabella to apex nasi
Y5: length from tragion to nasion
Y6: length from tragion to gnathion.
Diagrammatic front and profile views of a head illustrating these measure-

ments are on p. 223 of Flury and Riedwyl (1988).
1.4 Swiss Heads 7
The data were collected to determine the variability in size and shape
of heads of young men in order to help in the design of a new protection
mask for the Swiss army. Because of the variations in human heads, it was
clear that one mask could not be satisfactory for all soldiers. The aim was
to find a few typical head sizes and shapes which, it was hoped, would
make it possible to provide satisfactory masks for all soldiers. If the data
have a multivariate normal distribution, the techniques of multivariate data
analysis can be used to determine the best few standard types.
Accordingly we start with two plots to check whether the data are ap-
proximately normal. Figure 1.1 is the scatterplot matrix for the six vari-
ables , that is the matrix of scatterplots for all pairs of variables. The data
do seem to have the elliptical contours which would be expected from the
pairwise bivariate normal distributions. However, it is hard to tell by visual
inspection whether the scattering of more remote points are what is usually
found in the tails of normal distributions.
A conventional way to try to answer this question is to look at a plot of
the n Mahalanobis distances, the analogue of the residuals in regression.
We derive the scaled beta distribution of the squared distances in §2.6. But,
asymptotically, the squared distances have a X2 distribution Oll V degrees of
freedom, where v is the dimension of the measurements, here six. We could
look at a QQ plot of the ordered squared distances against the percentage
points of x~, but this plot is sparse for large values. Instead we look at the
plot for the distances, with percentage points that are the square roots of
the percentage points of the chi-squared distribution. The resulting plot is
Figure 1.2. In the absence of random fiuctuations the calculated distances
should fall on the diagonal line given in the figure. They seem to do so, as
far as the unaided eye can tell, although the smaller distances perhaps lie
slightly, but systematically, off the line.
We use simulation to provide a guide as to what kind of fiuctuations are
tobe expected in such plots. In Figure 1.2 we also include a 90% envelope
formed from 99 simulations of samples of 200 six-dimensional normal ran-
dom variables. The Mahalanobis distances are calculated for each sample
and ordered. The ends of the point-wise confidence intervals in the plot are
the fifth and 95th largest value of each simulated order statistic of the Ma-
halanobis distances. The figure shows that the smallest and largest obser-
vations alllie on or within this narrow envelope. There is no evidence, when
all 200 observations are fitted, of any departure from the multivariate nor-
mal distribution of these measurements. Similar conclusions are provided
by the plot of the squared distances, which, however, due to the effect of
squaring, has a much more spread out upper tail and a more compact lower
tail.
It seems as if the data do indeed have a multivariate normal distribution.
However, it is possible to introduce outliers into a multivariate data set in
such a way that they are not particularly outlying in any of the two dimen-
sional projections onto the co-ordinate axes which form the scatterplots
100
.."'
~
\lj L.-----4C.:........~...,.J \-~~~-.-J L,--.,.......,.4-.,.......,.--i L,....-4:t.....r--.-..--l 4-+--.---.----J

100 110 120 130 110 120 130 140 115 125 135
FIGURE 1.1. Swiss heads: scatterplot matrix of the six measurements on 200
heads
of Figure 1.1. If there were one such outlier, it would have a significantly
large Mahalanobis distance in the QQ plot of Figure 1.2. But, if there were
several outliers close together, they might not show in the final plot of Ma-
halanobis distances because of their combined effect on estimation of the
mean and covariance matrix of the fitted distribution. Such observations
can however be detected from a forward plot of Mahalanobis distances cal-
culated for each subset size during the forward search. It may seem unlikely
that observations of this kind will occur in a well established subject like
the measurement ofheads, where each number is well understood. However,
data entry and editing can produce bizarre errors.
One way of detecting the presence of outliers is to look at the distance
of the next observation to join the subset. Figure 1.3 is a plot of these dis-
tances, that is, at each m, of the minimum distance amongst the observa-
tions not in the subset. This is usually the next unit to join the subset. If the
1.4 Swiss Heads 9
...
0
"'c-;
0
c-;
"'.,;
0
.,;
"1
2 3 4
FIGURE 1.2. Swiss heads: QQ plot of the ordered Mahalanobis distances against
the square root of the percentage points of x~ with 90% envelope from 99 sim-
ulations. There do not seem to be any outliers when all observations are fitted
distance is large, then an outlier is being introduced into the subset. Once
an outlier has been introduced, it may distort the estimates of the mean
and covariance matrix in such a way that other similar outliers no Ionger
seem remote. Thus, the introduction of a duster of outliers will be heralded
by a spike in the plot, which will dedine thereafter. Samething of this be-
haviour is visible in Figure 1.3. Initially the plot is virtually horizontal as
observations from the centre of a multivariate normal distribution join the
subset. The slope of the curve increases gently as more remote observations
are introduced. Towards the end there are severallarger jumps. The penul-
timate jump is due to the introduction of observation 104. Once it has been
introduced, observation 111 is equally remote. Otherwise, Figure 1.3 does
not reveal the introduction of a duster of outliers. The generally smooth
shape of the curve indicates that the forward search has ordered the obser-
vations from those at the centre of the multivariate distribution to those
most remote. We pursue this line of analysis in Chapter 3 where we Iook
at forward plots of the distances for individual units.
We now return to the purpose for which the data were collected. The idea
is to identify typical heads which could be used to decide which protection
masks should be manufactured. It would be most convenient if there were
just a range of standard sizes from small to large, so that only one number
were needed to specify a mask. They could then be selected in the way in
50 100 150 200

Subset size m
FIGURE 1.3. Swiss heads: forward plot of minimum distances of units not in the
subset. There may be a few outliers entering at the end of the search
which cheap shoes are bought solely by size, as are shirts in the United
Kingdom. If the authorities are unlucky, it may be necessary to specify two
or more dimensions. Expensive shoes are bought both by size and width
and shirts in the United States by both collar size and sleeve length. Both
pairs of measurements define a bivariate distribution of satisfactory items.
One could think of further variables which are necessary for a comfortable
shirt, for example ehest size. However it may be that once the variation of
ehest size with collar size and sleeve length has been accounted for, there
is no further appreciable variation in ehest size. The same may be true
with head sizes. Are all six variables really needed to explain the observed
variation or are there a few combinations of the measurements which ex-
plain nearly all differences between people? If the data are multivariate
normal, appropriate linear combinations are simply found by the methods
of principal components analysis, described in Chapter 5. In the absence of
multivariate normality, it is much harder to discover suitable combinations
of the variables and to determine their importance.
1.5 National Track Records for Warnen

The data for our second example seem in comparison rather far from mul-
tivariate normality. In this case the units are countries, so that it should be
1.5 NationalThack Records for Women 11
Distance Minimum Country Maximum Country Ratio

100m.(sec.) 10.79 USA 12.90 Cook Islands 1.20
200m.(sec.) 21.71 GDR 27.10 Cook Islands 1.25
400m.(sec.) 47.99 cz 60.40 Cook Islands 1.26
800m.(min.) 1.890 USSR 2.330 Western Samoa 1.23
Marathon(min.) 142.7 USA 306.0 Western Samoa 2.14
TABLE 1.1. Thack records: the maximum and minimum times for each race and
their ratio. The values of the ratios close to one indicate it may be hard to
establish a power transformation for these variables
easy to suggest reasons for any patterns which we find. Of course, it may be
too easy to explain random fluctuations and we need statistical procedures
to help us assess facile explanations of any seeming structure.
The data in Table A.2 are women's athletic records for 55 countries. The
variables are times for the following distances:
Y1: 100 metres in seconds

y4: 800 metres in minutes
Y5: 1500 metres in minutes
Y6: 3000 metres in minutes
y7: marathon.
The data were given by Johnson and Wichern (1997, pp. 44-45) and
are taken from a handbook prepared for the 1984 Olympic games in Los
Angeles. They therefore come from an interesting period in the history of
women's athletics, when there were, now authenticated, allegations about
the treatment of female athletes, especially in communist countries, with
male sex hormones. One aspect of the analysis is therefore to see what
evidence the data provide- perhaps these countrieswill appear as outliers.
Table 1.1 gives the minimum and maximum national records for each
race and the country concerned. Also given is the ratio of these times.
Both the minimum and maximum times are interesting. Apart from the
USA, the other countries with minimum times were all communistcountries
at the time, which have all since disappeared as legal entities, which has
not been the fate of all communist countries. The German Democratic
Republic (GDR) is now part of the Federal Republic of Germany (FRG) ,
Czechoslovakia (CZ) has split into two countries and the USSR into Russia,
the Ukraine and many more.
The maximum times belong to two island nations in the Pacific (Cook
Islands, CI, and Western Samoa, WS). Figure 1.4 is a scatterplot matrix
of the data, in which both the times for the Cook Islands and Western
[] .... w
22 24 26 1.9 2.1 2.3 9 10 12
.... ....
~:·
W$ m- 'W w
.... w
;t '
. ~i· +
.~ : N
+ .f +
~~..;; ~ ..4'-+ ~·~·

•• +
.
.<;t.
~:. . +
•tt't• ~q ~·
c:J
:;.;+
." +.: WS . WS . .. ...... w

vo
w ..... v•
w +
v•
f•
+ +
r
++
1?~ J~·
~·
~· ·;tl
~...-...·
........ ~· r
'i~;r ..::~"
~r·
l
..
c:J ;r·
"'V ~
.. v•
.
. ~
v •
WS ++WS + .'W w • w
·~ ;Hi ..14l f..

• ~i'~. ·~.:
· -"-'++
.
,.
.t1t;·
,.
+ +jtlif· + ~~
"" s. + ~c:
... •tw~ C
. ~ [J[l]
w
...J •
.c• + "
f
+~!:. .~F ~"\4·
if'A! f+
...... +~
. .
-·1,\.
t\~· • J1·
[Ljc:J
+ • *~+ y4 ' +
no "~
··~
+ (.. t c
~;:;~: .. .
~· . c
~#l c ~~I ~{BI+
0
"'
..
"c:J
~;,.,·: +
0
no ... ~ ... ~ n
c
.. * <i' +
J:l Cl
c +++: c lt....
~* 1/~~
... +
k~J:,'~'·: krii~· b.Y1~ .. ~t... ~!
c:J
+
+
. . +
·~
• c
~
• +
.+
..... c c
++ + Cl 9( +Cl
+
...t.
~~;r ~·
'+...
;:.Jii• ~?~ · ~~it + ~jfJ
11 .0 12.0 48 52 56 80 •o s.o 150 250
FIGURE 1.4. Track records: scatterplot matrix of the national records for women
from 55 countries for seven races. The results for the Cook Islands (CI) and
Western Samoa (WS) are Iabelied
Samoa have been labelled. The difference in the pattern of times for the two
countries is interesting. The times for the Cook Islands are always amongst
the largest, lying towards the end of the major axis of the elliptical clouds
of points. However, the very large tim es for Western Samoa for the last
three races cause the country to lie away from the scatter in many of the
bivariate plots.
Considering the minima and maxima is looking at each variable indi-
vidually, that is at one-dimensional projections along the coordinate axes.
The scatterplot matrix shows bivariate projections onto the same axes. We
now see what multivariate techniques reveal. Figure 1.5 is the QQ plot of
Mahalanobis distances at the end of the search, which shows three outliers.
The largest is, indeed, Western Samoa (55), with the next largest being
North Korea (the Democratic People's Republic of North Korea, 33). The
1.5 National Track Records for Women 13
<0
•
•
•
1.0 1.5 2.0 2.5 3.0 3.5 4.0
FIGURE 1.5. Track records: QQ plot of the ordered Mahalanobis distances

agairrst the square root of the percentage points of x? with 90% envelope from
99 simulations. There are three obvious outliers when all observations are fitted
third outlier is Mauritius (36), another island. There also seem tobe some
other indications of non-normality, with several other observations lying
on, or outside, the simulation envelope. More detailed information can be
obtained from forward plots of Mahalanobis distances during the search.
Figure 1.6, like Figure 1.3, is a forward plot of the minimum Mahalanobis
distance among the units not included in the subset, which is large when
outliers join. The two plots are quite different. For most of the search the
values in Figure 1.6 oscillate between four and six, but at the end of the
search there is a very large jump upwards, much larger than those at the
end of the search in Figure 1.3. The last increase, at m = 54 is for West-
ern Samoa, which joins the subset at m = n = 55. The preceding jumps
are indeed for Mauritius and North Korea which come in at steps 54 and
53. Because these distances are for units not included in the subset, they
are larger than those in the QQ plot of Figure 1.5 where all observations
are used in estimation of the parameters. However, these three extreme
observations are clear from either plot.
It is informative to refer these indications of outliers back to the data.
Accordingly, Figure 1.7 is a scatterplot matrix of the data, minus Western
Samoa, on which the results for the Cook Islands (CI), North Korea (DRK)
and Mauritius (M) have been labelled. The general performance of North
Korean athletes is such as to lie within the bulk of the data, but the plot
0
::;;
E
:::>
"'
E
.2'
~
CD
10 20 30 40 50
Subset size m
FIGURE 1.6. Track records: forward plot of minimum distances of units not in
the subset
shows that there is, in particular, a high value for y 2 which causes the
observation to fall away from the general distribution. This is particularly
clear in the panel plotting Y2 against Y3· The values for Mauritius also
stand away from the generally normal distribution of observations, in part
because of the large time for y 7 compared to some of the other times. lt
is important to be clear that the Cook Islands do not show as an outlier
because the times, although they are all large, fit in the general correlated
multivariate normal distribution - there is no combination of egregiously
high and low times for this country.
A feature ofFigure 1.6 is the peak at m = 40, which may be an indication
of the beginning of a duster of slightly outlying observations. A further in-
dication of some undetected structure comes from the QQ plot of Figure 1.5
which indicates general non-normality. We investigate these indications in
Chapter 3 using forward plots of individual Mahalanobis distances for each
country. As a result there seem to be about a dozen countries which do not
fit the general model of multivariate normality.
lt makes little sense to remove 12 observations out of 55. Instead we
should look for a model which includes all observations, or perhaps all
except one or two outliers, such as Western Samoa. One possibility is
to transform the data. For example, taking the reciprocals of t he times
would give average speeds, which seems as logical a measure of perfor-
mance as times. In some examples, such as the univariate data on survival
times of animals (Box and Cox 1964), often known as the "Poison Data",
[][2] .
1.9 2 .1 2.3 8.5 9.5 10.5
" "'
t.• ....
ttt
..
ORK + -1 +- ORK ++ + + o~ + 0~ •
t<l•·
DRK
:f;J•J:F
.~:~~ t ...... ++/"J"~_..,++ M
~
• W. +
;.;..
~! +
t;.. . . .
•• + ~ .·;
~ ~·
[Z][J
~ +
DRK DRK DRK ORK ORK

++ ... ++ .... • + ...
t!'~\ . . ..
1if}*+~ ~ ·~· t • "f"" 4- ... ..... .; +t"
"\"
·:~l
.
+ • ,...+ +
~·
+ ........... :;:·
••• + f~ .
•f
[]
+ v• t.
M+
'1 +M +M + M
.
·~ ;;1· fJtl}'+
• t.;.+•
...
~:r<-:tf+ + :,...t};-::·
~~~
.'t'; .t· +
.
ff~· ORK '$~+ DRK ~ ·~ ·'.t·
~... ~~K
*
.
[] .... :sr··........
.
+
..'$..;. ~·"
·~ !.M
... . +
.t+
t~jl( ...... \ rk o~- • .f
"'
+
~t:
.. ~:·
w.+~ + ++T~· ~ + .._.t+.~ +
.W+ ....
~~+ .lt!l
.~ []
ORK 4 AK
....
+,...++ ORK "K + K
:...
......
....
r~.: ~.:. 'M
•~ +
Jt
~ ~· + . ... ~
-#i
•• + + +
;..,
:~fr:
+ +
r:il: .~ ...
• + ......
citK..,_
....
*...
• r.tRK
~
.J.•ORK
.~ ... y5
Jf ~·
.:··
M +
+ •
~~
..,.+M
•: + .."'· .
+IK
~}... ."[] .,.

•1':\
+
';' ~
.... . .•++•::: +
.. ~ +
!i$+oRK oo.f'+~ OF\(t~!+
........ ...
!;.~;, .
"' ~tt• ~t·
+ ...
ht~K ~· ~A.'il· ·~ ·
:'[]
+ ti'"
. ...... - .
m
... .......·:
c c c
r.;;ß:..
+ G
+
•Jd:·tr
• + + ·...... y7
..
+
....~~
++ ~jORK
.~it· #'+..
...
ll ~+ •
tA.fo-w.•+"fttt. ~"W ··
11 .0 12.0 <18 52 56 60 4.0 4 .4 4.8 140 180 220 260
FIGURE 1.7. Track records with Western Samoa removed: scatterplot matrix of
the national records. The results for the Cook Islands (CI) , North Korea (DRK)
and Mauritius (M) are Iabelied
which were analysed using the forward search by Atkinson and Riani (2000,
p. 95-8), a simpler model is indeed obtained by use of the reciprocal trans-
formation. For such a transformation to be possible the data have to be
non-negative, which they are here. If a transformation is useful, the data
will have skew marginal distributions. Information on the correct transfor-
mation is stronger if the data cover several cycles, that is powers of ten.
However, the ratios of the value of the maximum time to the minimum
time for each race in Table 1.1 suggest that little information will be avail-
able on the correct transformation of the shorter races. It may well be
that the marathon times are the only ones for which a transformation will
have any effect on our analysis. We accordingly analyse the reciprocals of
these record times in Chapter 3 and see, in Chapter 4 , whether we can
improve on this particular transformation of the data. The importance of
Europe
FIGURE 1.8. Europe, Italy and the Emilia-Romagna region
the forward search in this analysis is that it enables us to identify, and

discount if necessary, the effect of individual observations on the estimated
transformation.
A final comment about these data is that the communist countries of
Eastern Europe have not been shown as in any way egregious. Perhaps
this is because the data were collected in advance of the Olympic games.
More generally, it seems plausible that performance in such sporting events
depends on the prosperity, and to some extent the size, of a country and also
on the proportion of the national wealth that is devoted to athletics. If drugs
are used to improve the performance of all women athletes, the effect is to
move the country nearer the lower part of the multivariate distribution of
times. To detect such occurrences is not easy. One possibility is to compare
the male and female athletic records for the countries. But the power of
such a procedure would be badly affected if drugs, albeit different ones,
were also being used by these countries to improve the performance of
male athletes.
1.6 Municipalities in Emilia-Romagna

As a third example we look at a larger data set with 341 observations and
28 variables. As we shall see, it is hard to visualise the data, for example by
the use of a scatterplot matrix and some form of data reduction is needed.
We attempt this in Chapter 5 by the use of the forward search combined
with principal components analysis. Here we see what is revealed by the
forward search on all the data.
The 341 observations are from all the municipalities of Emilia-Romagna,
a generally prosperous region of Italy (Figure 1.8), south of Milan and
Venice, but north of Florence (Figure 1.9) .
Prann.&
MILAN ThePoDtHi.
Piana Emiti.ana VERONA VENI CE VENI CE
ANCONA
FIGURE 1.9. Road connections of Emilia-Romagna
The region is roughly reetangular (Figure 1.9), bounded on the east by

the Adriatic sea (Mare Adriatico) and, on the south west, by the Appenine
mountains (Appennino Tosco-Emiliano). The province is bisected by the
Via Emilia, a Roman road, on which there is a string of cities: Piacenza,
Parma, Reggio nell'Emilia, Modena, Bologna, Forll and Rimini. There is a
fertile agricultural plain between the cities and the sea, the plain continuing
up to the mountains, which rise from it without noticeable foothills (Fig-
ure 1.10). Many mountain municipalities are included in the region. There
are also two further cities, Ravenna near the coast and Ferrara in the plain.
The data cover all municipalities, which, as Table 1.2 shows, range widely
in size. Nearly all of the variables are indices in which counts have been
divided by municipal population. We hope for a multivariate normal dis-
tribution of these variables and so do not include population as a regressor
variable. We can however expect a large difference between the cities and
the country.
The data, taken from the 1991 Italian census, with some additional in-
formation, are summarised in Table A.3. The full data are available on the
website for this book. The variables are:
Yl: % population aged less than ten

y2: % population aged 75 or more
y3: % single-member families
y4: % residents divorced
Y5: % widows and widowers
FIGURE 1.10. The topography of Emilia-Romagna
TABLE 1.2. The municipalities in Emilia-Romagna with the largest and smallest
populations
Municipality Unit Population
Bologna 6 404,378
Modena 159 176,990
Parma 210 170,520
Ferrara 68 138,015
Ravenna 292 135,844
Reggio nell 'Emilia 329 132,030
Rimini 121 127,960
Caminata 239 319

Cerignale 245 317
Zerba 277 155
Y6: % population aged over 25 who are graduates

y7: % of those aged over six having no education
Ys= activity rate; % of those of working age in full-time employment
y 9 : unemployment rate
Yw: standardised natural increase in population
Yu: standardised change in population due to migration
Y12: average birth rate over 1992-94
y 13 : fecundity: three-year average birth rate amongst women of child-

bearing age
y 14 : % occupied houses built since 1982
Y15: % occupied houses with two or more WCs
Y16: % occupied houses with fixed heating system
Y17: % TV licence holders
Y1s: number of cars per 100 inhabitants
Y19: % luxury cars
Y2o: % working in hotels and restaurants
Y21: % working in banking and finance
y22 : average declared income amongst those filing income tax returns
Y23 : % of inhabitants filing income tax returns
y 24 : % residents employed in factories and public services
y 25 : % employees employed in factories with more than ten employees
y 26 : % employees employed in factories with more than 50 employees
Y27: % artisanal enterprises
Y2s: % entrepreneurs a nd skilled self-employed among those of working
age.
These 28 variables were selected from 50 available. The first 13 are de-
mographic variables, the next three, y 14 - Y16 , measure housing quality, the
succeeding seven, Yl7- Yn, are measures of individual income and wealth
and the last five , Y24- Y2s, relate to industrial production. The definitions
(and translations from Italian) arenot all unambiguous. For example, y 16
is intended to distinguish dwellings with central heating from those with-
out, whether the heating to the radiators is provided centrally to a block
of flats or provided by individual households in the flats. It is also not clear
whether variables like y 25 refer to all employed in the municipality, wher-
ever they live, or to all residents of the municipality wherever they work.
However, such details would be more important if comparisons were tobe
made with similar data from other countries. Provided that the same rules
were applied in collecting the data and calculating the indices in all 341
municipalities (which may be a strong assumption) comparisons between
the municipalities are possible and interpretable in terms of these variables.
We start with a forward search through all the data. The forward plot
of the minimum Mahalanobis distance amongst units not included in the
subset is in Figure 1.11 . The very sharp spike at the end of the search is
caused by units 245 and 277 which are the two smallest municipalities in
the region. The magnitude of these distances is surprising; with 341 units it
might be expected that an individual unit would have only a small effect.
However, there are 28 variables, so that there may be directions in the
space of these variables which are very sparsely filled, hence making such
changes possible.
The very large distances associated with units 245 and 277 are so large
as to make it hard to identify any further appreciable distances in the
li5
0
10
Cl
::;:
E $f
:::>
.5
c
::E
g -
0
C\1
;=
50 100 150 200 250 300 350

Subset size m
FIGURE 1.11 . Municipalities in Emilia-Romagna: forward plot of minimum dis-

tances of units not in the subset
rest of the plot. If these two units are removed from the plot, the rest of
Figure 1.11 becomes as shown in Figure 1.12. It is now clear that a number
of other, lesser, outliers are also entering at the end of the search. These
are alllisted in Table 1.3, starting with unit 277 (Zerba), the last to enter
the subset. The table therefore works downwards from the most outlying
community. The last community to be listed is the first to enter after the
local minimum in distances in Figure 1.12 at m = 324.
The results of Table 1.3 show that the search has found five (out of ten)
of the communities with populations less than 1,000. No cities have been
found to be outlying and, indeed, all the communities at the end of the
search are fairly small, although the last entry in the table, Bellaria-Igea
Marina, has a population of over 12,000 and one other has a population
of over 15,000. So the search seems not merely to have found the smallest
communities. In Chapter 3 we Iook for ways in which these units are out-
lying. Part of the challenge is that, with 28 variables, we cannot, as we did
in the two previous examples, study the scatterplot matrix and highlight
units to see in what way they are outlying. For an understanding of the
general structure of the data we need to estimate any patterns once the
outliers have been detected. For example, many of the outlying commu-
nities fall in the mountainous area in Figure 1.10, as do many other poor
communities. One possible approach, which we do not explore, is to see
whether elevation above sea Ievel can be used to predict the properties of
the communities. Instead, in Chapter 4, we investigate transformation of
;:!
0
::;:
E
::> ~
E
·c:
~
<X>
50 100 150 200 250 300

Subset size m
FIGURE 1.12. Municipalities in Emilia-Romagna with the two least populous

units removed (245 and 277): forward plot of minimum distances of units not in
the subset
TABLE 1.3. Municipalities in Emilia-Romagna: the last 16 communities to enter

the forward search and their populations
Subset size m Unit No. Community Population
341 277 Zerba 155
340 245 Cerignale 317
339 70 Goro 4,410
338 239 Caminata 319
337 310 Casina 4,055
336 260 Ottone 891
335 250 Ferriere 2,675
334 30 Granarolo dell'Emilia 6,934
332 2 Argelato 7,727
331 133 Torriana 1,002
330 264 Piozzano 750
329 188 Bore 1,056
328 194 Compiano 1,080
327 149 Fiorano Modenese 15,644
326 238 Calendasco 2,170
325 88 Bellaria-Igea Marina 12,813
the data and, in Chapter 5, achieve appreciable simplification by the use

of principal components analysis on the transformed observations.
1. 7 Swiss Bank N otes

The three examples so far have all consisted of measurements on a single
population, even though several outliers might be expected. We now look
at some data in which there are at least two populations. The example illus-
trates how different aspects of the data are illuminated by various starting
points. In particular, starting the search within one of the groups clearly
reveals the existence of other groups.
The data are readings on 200 Swiss bank notes, 100 of which aregenuine
and 100 forged. However, the structure of the samples may not be quite
that simple. The forged notes have all been detected and withdrawn from
circulation. To provide a useful comparison, the genuine notes are likewise
used. So some of the notes in either group may have been misclassified. A
second complication is that the forged notes may not form a homogeneous
group. There may be more than one forger at work and a single forger
may have short print runs before repeatedly moving premises in order to
avoid detection. There may be systematic differences between print runs.
In general, we might expect that the quality control during the production
of genuine notes would be tighter than that on forged notes, so that the
variance in the production of a single run of forged notes would be higher
than that for genuine notes.
The data therefore present a number of statistical problems, which we
explore in later chapters. If the data can be divided into two groups, perhaps
after the removal of outliers, one problern is that of discriminant analysis,
Chapter 6, in which a rule is required for classifying any further suspect note
as genuine or not. A second problem, that of duster analysis, Chapter 7,
likewise tries to divide the data into groups, but without knowledge of what
the groups should be. Here we know that there are virtually 100 genuine
notes and 100 forged, and which they are, so that each banknote is labelled
as belonging to one of the two groups. But it may be that the group of
forged notes can be further divided into several subgroups which might, for
example, b e the work of different individuals. In this problern t he labels
are not known, nor are the numbers in the groups, nor how many groups
there are.
In this preliminary chapter we explore what information can be extracted
by fitting a single multivariate normal model to all 200 Observations. As we
shall see, this activity is surprisingly informative.
The data are in Ta ble A.4. The bank note in question is a 1,000 franc
note from before the Second World War. Within a broadly reetangular
frame, the edges of which are formed by arcs of overlapping circles, a group
of men, in baroque poses, pour molten metal into moulds. The data, and
a reproduction of the bank note, are given by Flury and Riedwyl (1988,
pp. 4-8). The analysis of these data forms the central example of their book.
The six variables are measurements of the size of the bank notes:
Y1: length of bank note near the top

Y2: left-hand height of bank note
y3: right-hand height of bank note
y4 : distance from bottom of bank note to beginning of patterned border
y 5 : distance from top of bank note to beginning of patterned border
Y6: diagonal distance.
Flury and Riedwyl (1988, p. 4) illustrate where on the bank note the
measurements are taken: the first three are measurements of paper size
and the fourth and fifth measurements from the edge of the paper to the
printed area. Only y 6 is solely on the printed area: it measures the diagonal
distance across the frame of the central illustration.
A scatterplot matrix of the data is given in Figure 1.13 in which the
two groups have different symbols. This plot already shows much of the
structure. One feature is that y 6 seems to provide an almost complete
separation between the two groups. Complete separation occurs in several
of the scatterplots, particularly clearly in that of Ys against Y6· It seems to
be easier for forgers to get the paper size right - y 1 to y 3 - than to get the
image size correct. If the distance from the top of the image, y 5 is correct,
as is shown by the overlap of the first scatterplots in the row for y 5 , the
distance from the bottom to the image, y 4 is then wrong. As the plots
of Y6 show, the forgeries tend to have too small an image. Being slightly
too small is a feature of forgeries of bronze statues made from moulds
taken from the original statue, because the bronze shrinks on cooling after
casting. However, it is not clear what causes the shrinkage in the banknotes.
A second feature of the data is that the group of forgeries does indeed not
look homogeneous. Particularly in the plot of y4 against y 6 , it looks as if
the forgeries may split cleanly into two groups.
We begin our forward analysis with a search which treats the observa-
tions as if they came from a single population. In previous examples we have
looked at plots of Mahalanobis distances. Instead we here start by monitor-
ing the parameter estimates that go into the calculation of the distances.
Figure 1.14 is a forward plot of the estimates of the individual elements of
the 6 x 6 covariance matrix from a search that starts with an initial subset
containing units from both groups. This plot is extremely stable, giving no
indication of the presence of the two groups.
Although this search fails to reveal the two groups it does reveal many
outliers. Their existence is exhibited in Figure 1.15 by the forward plot
of the minimum Mahalanobis distance amongst observations not in the
subset. This suggests that there is a well defined duster of outliers, the
1300 131 .0 7 8 9 10 12 138 140 142
.
0
~~~~J~~~==~~r=~~~~~~~ N
i!
0
~
0
!!! 0
i!
a
0
214.0 21 5 5 129.0 130 0 131 0 8 g 10 11 12
FIGURE 1.13. Swiss bank notes: scatterplot matrix of the six measurements on
200 notes. The filled circles are units in Group 1, the notes believed genuine
first of which enters at m = 180, making 21 in all, the first sharp increase
in Figure 1.15 occurring at m = 179. That there is a duster of outliers is
indicated by the partial decline of these distances as similar observations
enter. At the very end of the search the larger values for the distances are
for more remote observations. It is surprising that the parameter estimates
in Figure 1.14 are so stable. However, the plot does show an increase in
elements (4,6) and (6,6) from around m = 180. Wehave already seen in the
scatter plot that the structure of the second group is most clearly displayed
in the plot of y4 against Y6 · The last observations to enterare indeed many
of those most remote in the second group, which are most extreme in y6
and so will affect these two elements of the variance-covariance matrix.
This search fails to reveal the two groups because it starts with a subset
containing observations from both groups. The fitted model is thus centred
between the two groups. We now see what happens if we start the search
,_, ......... ,
........... , ....... --- _, ... _,,- .... --"-"" _.... _, _____ __..,_.,._, ___ ... ___ _
....
,,, '
I
I
I
I
I
I
,..-6,6
I __ ..,
r -~-------------------------------//
I _______________________ , __ _-------·
=
~-----
.,.".----
-: . _;. . -- .... ._ .......-"!.••------ .... - - - - - --..-- .........•.. .r ••••.
s; ___=< __
'-
,... ... .... ==

--- ....
-wma:
------ ~ --- --------------------------------
. . 4 ,6
..... ... ... . '.

L
50 100 150 200
Subset size m
FIGURE 1.14. Swiss bank notes: forward plot of elements of the covariance ma-
trix. A stable plot which does not suggest the presence of two groups
ll) ll)
o.ri o.ri
0 0
o.ri (/)
o.ri
c.
*
0
:::;: ,..:
ll) ll)
,..:
0
c: (")
:E u;
"'
0 0
,..: --' ,..:
ll) ll)
,..; ,..;
0 0
,..; ,..;
50 100 150 200 170 175 180 185 190 195 200
Subset size m Subset size m
FIGURE 1.15. Swiss bank notes: forward plot of minimum distances of units not
in the subset, left-hand panel, and a zoom taken in the last 30 steps
in one or other of the two groups. First we describe the results from a
forward search starting with the first 20 observations on supposedly genuine
notes. Figure 1.16 shows that, until m = 100, the elements of the variance-
covariance matrix are small, as the data are being fitted to a single group.
When m = 101 the search will have reached a point at which at least one
observation from Group 2 has to joint the subset. The figure clearly shows
the effect of the mingling of the two groups. As soon as units from the
"'
( --,------------------------ ~.·1
'
!'' _. . . . . .
-
--6.6
~ I :r---------------
t
(.) 1 'I
0
c"'
~/ L::--------------------------5.5
"'
E F~:~~-E~~~~.:-,.-.-:~---"··----------------"·~::-.:;;-. ·t:~
m 0 __..
~---------------------------------.:~
I .,
!i:~ '
J ·.~_.-- ---------------------- -- . . -.-s.s
6.4
l r- ---r
50 100 150 200
Subset size m
FIGURE 1.16. Swiss bank notes starting with the first 20 observations on genuine
notes: forward plot of elements of the covariance matrix. A plot which, unlike
Figure 1.14, clearly suggests the presence of two groups
second group enter, there is a rapid increase in the values of the elements,
especially those associated with variables 4 and 6. The increase is rapid
because many of the units from Group 1 leave the subset as an appreciable
number from Group 2 enter. This interchange occurs because, once the
model is fitted to units from both groups, many units in Group 1 now seem
to be outliers when judged by the common mean and variance-covariance
matrix.
Similar remarks apply to the forward plot of minimum Mahalanobis
distances in Figure 1.17. This shows that, just before m = 100, the last
few observations to join the subset are remote, as judged by the variance-
covariance matrix plotted in Figure 1.16. But, once units from Group 2,
the forgeries, enter the subset, and some from Group 1 leave, these, and
other, units are no Ionger so remote as measured using the larger elements
shown in Figure 1.16. The distances accordingly decrease. They only in-
crease again towards the end of the search as, in the main, the outliers from
Group 2 enter. The last third of Figure 1.17 is identical to the last third of
Figure 1.15, an indication of the stability of the end of the forward search
to very different starting points.
The next two plots we consider come from the complementary start of
the search solely with units from Group 2, which is less concentrated than
L()
Cl
::;
E
..,.
::J
E
·c:
:E
"'
50 100 150 200

Subset size m
notes: forward plot of minimum distances of units not in the subset. The two
groups areevident
Group 1. We can therefore expect that the evidence for two groups will be
slightly weaker in these new plots than it was in those we have just seen.
The forward plot of the elements of the variance-covariance matrix, Fig-
ure 1.18, is similar to that when the start of the search was in Group 1,
Figure 1.16, but reflects the more dispersed nature of Group 2, which is
evident in Figure 1.13. Initially the elements involving variables 4 and 6 are
larger in magnitude than they were before. The effect of including outliers
from Group 2 and observations from Group 1 also has a moregradual effect
than it did before. The forward plot of minimum Mahalanobis distances,
Figure 1.19, shows a sharp peak at m = 84, just before the first outlier from
Group 2 enters the subset. As several similar observations successively en-
ter, the distance to the next unit rapidly decreases. The search finishes as
before.
The forward plots of minimum Mahalanobis distances lead to the iden-
tification of 21 outliers. In order of entry from m = 180, these are units
50, 5, 13, 70, 194, 111, 168, 116, 138, 187, 148, 192, 162, 182, 160, 180,
161, 1, 167, 171 and 40. Of these 21 observations only 6 come from Group
1. Figure 1.20 repeats the scatterplot matrix of Figure 1.13 with these 21
outliers marked. Several of the Group 1 outliers, 1, 13, 40 and 50 show on
the plots of the first three variables. Unit 5 has a very low value of y 5 .
With the possible exception of unit 1, these all seern like clear outliers from
C\1
,'
''
~
, ''
>
; I
, ______________ ,,, ~-6,6
0
(.)
'ö
- ,, ___ ... _____ ,_.;
,,
,,'
/1
I
/-
_.",.--------------------------5.5
~~~~-=~~;;···.-:-~-----------·;. ---1:3
c"'
Q)
E
.!! 0
w
'
I
-- -·
................ ______ ..... ..,..,..- .. , . , . " ' " " ' ' ' ....... ..., _ _ _ _ _ _ _ _ _ _ _ _ .,. _ _ _ _ _ _ _ .., _ _ ..__
g-~
•
'-6.5
. . 6 ,4
50 100 150 200

Subset size m
FIGURE 1.18. Swiss bank notes starting with units in Group 2 (the forgeries):
forward plot of elements of the covariance matrix. The presence of two groups is
evident
"'
I()
0
:::;;:
E
:::1
E
....
·c:
~
C')
C\1
50 100 150 200

Subset size m
FIGURE 1.19. Swiss ba nk notes starting with units in Group 2 (the forgeries) :
forward plot of minimum distances of units not in the subset
"...
N
.
0
.-~--~~~~~~====~~~==~~~==~~==~~ N
~ 0
..,
0
8
0
~
~
!!
::
!!
214 .0 215.S 8 9 10 11 12
FIGURE 1.20. Swiss bank notes: scatterplot matrix of the six measurements on
200 notes. The last 21 units to enter the forward search
Group 1 and Group 2. Unit 1 looks as if it may, in fact, be closer to Group

2. The 14 outliers from Group 2 areshownon the plot to form a coherent
group. The existence of three groups is clearest from the plots including Y6:
the separation is greatest on the plot ofthat variable against y 4 . It seems
as if unit 70 may have been misclassified as well as unit 1.
An important aspect of this example is that it shows that different infor-
mation can be obtained from different searches. If there are several groups,
starting in each group in turn may illuminate different aspects of the data.
Here, starting in the more compact first group was the most informative of
the three searches that we pursued. But we are left with many questions
about the structure of these data. One is whether there are two or three
clusters: the methods of duster analysis, Chapter 7, can help in answering
this question. Another refers to one use which could be made of the data,
that is to provide a rule to determine whether a bank note not from the
sample is genuine or a forgery. Providing a good rule involves answering the

question of whether some of the units have, in fact, been wrongly assigned
to the genuine or forged groups. These questions will be considered further
in Chapter 6 on discriminant analysis.
1.8 Plan of the Book

This chapter has given an introduction to the use and properties of the
forward search through the analysis of four examples. For each example we
have discussed the background to the data as well as illustrating the use of
the search. In Chapter 2 we discuss in detail the algorithm on which the
forward search is based and present distributional results both for Maha-
lanobis distances and for some test statistics. In Chapter 3 we return to
the analysis of data by fitting a single normal distribution. We extend the
analysis of the four examples we have just seen in this chapter. One of these
extensions is to use plots of the distances for individual units during the
search. Another extension is to an analysis of the reciprocals of the times in
the data on national track records for women. Chapter 4 is concerned with
the more general transformation of multivariate data. We concentrate on
the parametric family of transformations introduced, for univariate data,
by (Box and Cox 1964). Such transformations can be strongly influenced
by a few outliers. The contribution of the forward search is to establish a
transformation that is satisfactory for the bulk of the data while revealing
the influential outliers. Three new examples are analysed, including one
with a regression structure.
Chapter 5 is the last in which we assume that the model is a single
multivariate normal distribution. We try to reduce the dimension of this
distribution by the use of principal components analysis. The next two
chapters are both concerned with models containing several normal dis-
tributions. In Chapter 6 on discriminant analysis, the data come divided
into g groups and the purpose of the analysis is to provide a rule for al-
locating new observations to the appropriate group. Of course, the groups
may have been misspecified, so we use the forward search to establish a
clear classification of the data, determining the effect of, for example, any
outliers that may be present. The problern of duster analysis, which forms
the subject of Chapter 7 is similar, except that the number of groups and
their membership are both unknown. Again the forward search illuminates
the exploration of the structure in the data. In the final chapter we turn
to the rather different subject of the analysis of spatial data.
2
Multivariate Data and the Forward
Search
Unlike the other chapters in t he book, this chapter contains little data
analysis. The emphasis is on theory and on the description of the search. In
the first half of the chapter we provide distributional results on estimation,
testing and on the distribution of quantities such as squared Mahalanobis
distances from samples of size n . The second half of the chapter focuses on
the forward search
We start in §2. 1 by recalling the univariate normal distribution. Sections
2.2 and 2.3 outline estimation and hypothesis testing for the multivariate
normal distribution. As we indicated in Chapter 1, forward plots of Maha-
lanobis distances are one of our major tools. Since the distribution theory
for these distances seems, to us, not to be clear in the literature, we devote
§2.4 to §2.6 to deriving the distribution using results on the deletion of
observations. As a pendant, in §§2. 7 and 2.8, we derive this distribution
first using an often quoted result of Wilks and then for regression with a
multivariate response. The subject of the following three sections is also re-
gression. In §2.9 we introduce added variables which provide useful results
for t ests for transformations. These results are applied in §2.10 to the mean
shift outlier model to provide an alternative derivation of deletion results
which is useful in the analysis of spatial data, Chapter 8. This part of the
chapter closes in §2.11 where we outline seemingly unrelated regression, a
simplification of the results for multivariate regression when each model
contains the same explanatory variables.
A general discussion of the forward search is in §2.12. Three aspects
of the search require special attention: how to start, how to progress and
what to monitor. These three are treated in detail in §§2.13 a nd 2.14. The
32 2. Multivariate Data and the Forward Search
final theoretical section is on the modifications necessary, particularly to

the starting procedure, when we have multivariate regression data. The
chapter concludes with suggestions for background reading. We do not
discuss in detailalternatives to the forward search. In particular §§4.6 and
4. 7 of Atkinson and Riani (2000) contain examples in which the forward
search breaks the masking which defeats backwards deletion methods. In
this regard, we rest our case.
We close our introduction to this chapter, by recalling our advice in the
Preface: "If you feel you know enough statistical theory for your present
purposes, continue to Chapter 3."
2.1 The Univariate Normal Distribution

2.1.1 Estimation
Let y = (YI, ... , Yn)T be a random sample of n observations from a uni-
variate normal distribution with mean f.L and variance a- 2 . Then the density
of the ith observation is
2 1 2 2
f(yi; f.J,, O" ) = ~ exp{ -(Yi- f.L) /(2o- )}.
2rro- 2
Sometimes we write this distribution as Yi ,. . ., N(f.L, o- 2 ). The loglikelihood

of the n observations is
n
L(f.L,o- 2 ;y) = - 2)Yi- f.L) 2 /(2o- 2 ) - (n/2)log(2rra- 2 ).
i=l
The maximum likelihood estimator of f.L is

n
P,= Y = LYdn,
i=l
the sample mean. The sum of squares about the sample mean
n n n
S(p,) = L(Yi- P,) 2 = L(Yi- Y) 2 = LYl- nf)2 ,
i=l i=l i=l
leads to estimation of o- 2 . The residual mean square estimator is
s 2 = S(P,)/(n- 1),
which is unbiased. Maximum likelihood produces the biased estimator
if 2 = S(p,)fn.
2.1 The Univariate Normal Distribution 33
Obviously
s2 = _n_a2
n-1
An alternative way of writing the distribution of the Yi is
where now the independent errors €i ,....., N(O, a 2 ). These errors are estimated
by the least squares residuals
(2.1)
2.1. 2 Distribution of Estimators

The sample mean from a normal distribution is itself normally distributed
and the residual sum of squares has a scaled chi-squared distribution, so
(2.2)
The leastsquaresresidual ei (2.1) isalinear combination of normally dis-

tributed random variables, so is itself normally distributed, with mean zero.
The variance can easily be shown (Exercise 2. 2) to be a 2 ( 1-1 j n). Therefore
(2.3)
For multivariate data we are interested in the distribution of squared Ma-

halanobis distances which, in the univariate case, reduce to the squared
scaled residual
(2.4)
For a sample from a normal population, fi and s 2 are independently dis-
tributed, leading for example to the t distribution for the statistic for testing
hypotheses about the value of J.L. However Yi and s 2 are not independent
of each other and the distribution of dT
requires some derivation. Since
L er= L(Yi- fi) =(n- 1)s

n n
2 2
i=l i=l
it follows that
n n
LdT = :Le7/s2 = n -1,
i=l i=l
so the dr
must have a distribution with a limited range. The results of
Cook and Weisberg (1982, p. 19) show that this distribution is, in fact,
a scaled beta. In §2.6 we obtain the related result for the multivariate
Mahalanobis distance. But a couple of preliminary distributional results
for the univariate case are helpful.
If both J-L and f7 2 are known
(2.5)
If now J-L is still assumed known, but f7 2 is estimated by s~, an estimate on

v degrees of freedom which is independent of Yi,
(2.6)
given the identity between the square of a t random variable and an F

variable on 1 and v degrees of freedom. Both the chi-squared and the F
distributions are often used as asymptotic approximations to the distri-
bution of the squared Mahalanobis distances, for example in probability
plotting. One source for s~ would be the results of a set of readings differ-
ent from those for which the Mahalanobis distances were being calculated,
but taken from the same population. A second source is to use the deletion
estimate szi)in which the ith Observation is excluded from the data. As a
result, the estimate of f7 2 is independent of Yi· This is the path we follow
to find the distribution of the Mahalanobis distance for multivariate data.
The deletion results that we need are gathered in §2.5.
2.2 Estimation and the Multivariate Normal

Distribution
2.2.1 The Multivariate Normal Distribution
For multivariate data let Yi be the v x 1 vector of responses forming the
observation from unit i, with Yii the observation on response j. There are
n observations, so the data form a matrix Y of dimension n x v, with ith
row y[. The mean of Yi, i = 1, . . . , n is the v x 1 vector J-L and the v x v
covariance matrix of the data is E. If Yi has a v-variate normal distribution,
the density is
The multiplicative constant before the exponent may also be written as

(27r)-vf 2 IE!- 112 . Sometimes we write this distribution as Yi ,...., Nv(J-L, E),
omitting the v if the dimension is obvious.
The loglikelihood of the n observations is
n
L(J-L, E; y) = -(n/2) log I27TEI- L(Yi- J-LfE- 1 (yi- J-L)/2. (2.8)
i=l
2.2 Estimation and the Multivariate Normal Distribution 35
r
The maximum likelihood estimator of the vector f.t is now
[< ~ ~ (t•ufn,
fj ,t···fn
Alternatively, if J is an n x 1 vector of ones with Y, as before, n x v,
(2.9)
This vector of sample means has a normal distribution
2. 2. 2 The Wishart Distribution

The matrix of sums of squares and products about the sample means is the
v x v matrix S (P,) with elements
n
sjk(fl) = L(Yij- flj)(Yik- flk). (2.10)
i=l
In (2.2) the residual sum of squares had a scaled chi-squared distribution.

The multivariate generalization is that
S(fl) "'Wv(~,n -1), (2.11)
the v-dimensional Wishart distribution on n- 1 degrees of freedom . Just

as the chi-squared distribution can be defined in terms of sums of squares
of independent normal random variables, so the Wishart distribution is de-
fined in terms of sums of squares and products of independent multivariate
normal random variables.
If the rows y'[ of the n x v matrix Y are distributed as Nv(O, ~), M =
yrcy"' Wv(~,n). We need two results to extend this definition to the
sample sum of squares and products matrix (2.10). The first isthat M =
yrcy"' Wv(L:,r) if and only if Cis symmetric and idempotent, where r
= tr C = rank C.
The second extension is to variables with non-zero mean. We now Iet Yi
be distributed Nv(J.ti, ~) . Then, in addition to the idempotency condition,
M = yrcy "' Wv (~, r) if and only if E( GY) = 0.
Some derivations are in Mardia, Kent, and Bibby (1979), particularly
§3.4 and Exercise 3.4.20.
To verify that this condition is satisfied for S(P,) (2.10) it is convenient
to use the matrix notation introduced in (2.9). We write
n
S(P,) = L(Yi- P,)(yi- fl)T = ET E, (2.12)
i=l
where E is the n x v matrix of residuals. Then with Y the n x v matrix of

fitted values,
E = Y-Y
Y-Jp7
= Y- JJTY/n
(I- H)Y =GY,
where H = J JT / n. As a result
(2.13)
Since Cis symmetric and idempotent (Exercise 2.3)
S(fl) = ErE = yrcrcy = yrcY,
a quadratic form in Y, similar to those following the Wishart distribution.

But we also require that E( CY) = 0. Since
E(Y) = JJ.17, (2.14)
and
CJ = J- JJT Jjn = J- J = 0,
it follows straightforwardly that
E(CY) = CE(Y) = CJJ.17 = 0.
Further, since tr C = n- 1, the distributional result stated in (2.11), that

S(jl)"' Wv("E ,n- 1), holds.
2. 2. 3 Estimation of :E
The maximum likelihood estimator of "E is (Exercise 2. 7)
f: = S(fl)/n, (2.15)
which is biased. The unbiased, method of moments, estimator we denote
f:u = S(jl)j(n- 1). (2.16)
From (2.11) we have the distributional results that
nf: = (n- 1)f:u = S(jl) "'Wv("E, n- 1). (2.17)

2.3 Hypothesis Testing 37
2.3 Hypothesis Testing

2.3.1 Hypotheses About the Mean
The maximum likelihood estimator of the means [1, was defined in (2.9)
as the vector of sample means of each response. We derive the maximum
likelihood test of the hypothesis
DJ.L=C, (2.18)
where D is an s x v matrix of full row rank s and c an s x 1 vector of

constants, both specified by the null hypothesis. One hypothesis sometimes
of interest is that all means have the same unspecified value, when s = v- 1
(Exercise 2.8).
The maximum likelihood estimator of I: was defined in (2.15) as f:: =
S([l,)jn = ET E jn. Substitution of this estimator, together with [1,, into
(2.8) yields the maximised loglikelihood
n
L({L , f::; y) -(n/2) log l2nf::l- 2)Yi- {L)Tf:- 1 (Yi- [1,)/2
i=l
-(n/2) log l2nf::l- n tr E(ET E)- 1 ET /2

-(n/2) log l2nf::l - nv/2. (2.19)
Let the null hypothesis (2.18) be that J.L = /LO· The residuals under this
hypothesis are E 0 yielding via (2.12) a maximum likelihood estimator f:: 0
of I:. The maximised loglikelihood (2.19) becomes
L({Lo, f::o ; y) = -(n/2) log l2nf::ol- nv/2.
Then the differences of maximised loglikelihoods
TLR = 2{L([l,, f::; y)- L({Lo, f::o; y)} = nlog(lf::ol/lf::l), (2.20)
has asymptotically a chi-squared distribution on v degrees of freedom when

the null hypothesis is true. This statistic is the likelihood ratio test for the
hypothesis J.L = J.Lo. It is sometimes, here and elsewhere, referred to as the
likelihood ratio test and is particulary used in Chapter 4 where we are
testing hypotheses about transformations of the data.
2.3.2 Hypotheses About the Variance

Most of the examples in the first five chapters of the book are for data in
which there is one multivariate normal population, although the data on
Swiss bank notes appears to consist of two, or perhaps three, populations.
In Chapter 6 on discriminant analysis we have at least two populations, the
analysis of which is simplified if all populations have the same covariance
matrix. One of the aims of data transformations in discriminant analysis is

to achieve such equality of covariance matrices. We now present a test of
this hypothesis.
Suppose there are g groups of v dimensional observations, with nt obser-
vations in the lth group. The maximum likelihood estimator of I: in group
l is denoted "t 1 and the pooled estimator over all groups is
where n = 2.:y= 1n1 . (2.21)
The likelihood ratio test for the hypothesis I: 1 = ... = I: 9 = I: is (Exer-

cise 2.10)
n log IL:t=~ nt I:t

g ' I- t; nt
g
I
log "ttl· (2.22)
Asymptotically (2.22) will have a null chi-squared distribution on (g-

1 )v( v + 1) /2 degrees of freedom. An asymptotically equivalent statistic
with improved distributional properties for small samples was found by
Box (1949) who scaled (2.22) with the numbers of observations replaced by
the degrees of freedom, giving the statistic
(2.23)
where
1 g
~ I:vl"tul
1=1
2.:T=1 .L:Z~1 (Yiz -Ih)(Yiz -Ih)r

2.:T=1 V!
is the within groups unbiased estimator of the covariance matrix. Strictly
we should write "twu, but we always divide this sum of squares by the
degrees of freedom. The notation Yit shows that observation i belongs to
group l. The factor r in equation (2.23), calculated to improve the chi-
squared approximation, is given by
2v 2 + 3v - 1 (
r=1-~
6(v + 1)(g- 1)
I:v -v-
!= 1
9
1- 1 1) . (2.24)
In (2.24) the degrees of freedom are

g
V! = n1 - 1 and v = '2:: V! = n - g
!=1
2.4 The Mahalanobis Distance 39
for a model in which only a constant J.L is fitted to each mean. Further
degrees of freedom are lost if the covariance matrices are calculated from
residuals from regression (§2.8).
With this result on the test of equality of covariance matrices we have
the results we need on estimation of J.L and E and for testing hypotheses
about their values. All are based on aggregate statistics summed over the
data. One use of the forward search is to see how these quantities vary
as we increase the number of observations in the subset. We shall look
at forward plots of several test statistics, particularly, in Chapter 4 on
transformations. But now we consider some statistical properties of the
Mahalanobis distances for individual observations.
2.4 The Mahalanobis Distance

This book contains many forward plots of Mahalanobis distances. As we
shall see, these can be highly informative about the structure of the data.
In this and the succeeding four sections, we derive a series of results about
the distribution of the squared distances which, unlike the distances them-
selves, have a tractable distribution. If we require numerical values for the
distribution of the distances themselves, we can proceed as we shall do
in the construction of the boundaries for the forward plot of distances in
Figure 3.6, using the square root of the values from the distribution of the
squared distances. In Figure 3.6 these are taken from the asymptotic chi-
squared distribution of the squared distances shown in Figure 3. 7. Here we
find the exact distribution of the squared distances.
The squared population Mahalanobis distance of the ith observation,
that is the distance when J.L and E are both known, is
(2.25)
the generalization of the univariate result in (2.5). If E in this expression is

replaced by f:v, an unbiased estimator of E on v degrees of freedom which
is independent of Yi, the distance
(2.26)
where T 2 (v, v) is Hotelling's T 2 with parameters v and v. This is the gen-

eralization of the result for the univariate squared scaled residual in (2.6)
which followed a squared t, or F, distribution. Here the F distribution
arises since
2 vv
T (v, v) = Fv v - v+l · (2.27)
v - v+l '
Hotelling's T 2 distribution is described in §3.5 of Mardia, Kent, and Bibby
(1979).
In the analysis of examples in this book we use the squared Mahalanobis

distance
(2.28)
or its square root di, in which both the mean and variance are estimated.
As was argued above for the squared scaled residual (2.4), the distribution
of this squared distance is affected by the lack of independence between Yi
and the estimators of f.L and E .
We obtain the distribution of the squared Mahalanobis distance in two
steps using the deletion of observations. If f.L and E are estimated with ob-
servation i deleted, the results of (2.26) and (2.27) indicate that the deletion
distance will follow an F distribution. We find an expression for the squared
distance (2.28) as a function of this deletion distance and then rewrite the
F distribution as a scaled beta to obtain the required distribution. We start
with deletion results.
2.5 Some Deletion Results

2. 5.1 The Deletion M ahalanobis Distance
Let fl(i)> often read as "mu hat sub i", be the estimator of f.L when obser-
vation i is deleted and, likewise, let ~u(i) be the unbiased estimator of E
based on the n - 1 Observations when Yi is deleted. The squared deletion
Mahalanobis distance is
d2(i) = ( Yi- A
f.L(i)
)T~-1 (
u(i) Yi- A
f-L(i)
) T ~-1
= e(i)u(i)e(i) · (2.29)
We first find an expression for the residual e(i) .

Consider just the jth response. Then the residual for Yij, when Yi is
excluded from estimation, is
eij(i) Yij - Y·j(i)

n
Yij- (LYlj- Yij)/(n- 1)
1=1
= (nyij- nfJ-J)/(n- 1). (2.30)
So, for the vector of residuals in (2.29)
(2.31)
2.5 Some Deletion Results 41
We also need the residual for any other observation Yl, when Yi is ex-
cluded. In a similar manner to (2.30)
elj(i) Ylj - Y·j(i)

n
Ylj- (L:>Ij- Yij)/(n- 1)
1=1
Ylj- Y·j + (Yij- fi.j)/(n- 1)
elj + eij/(n- 1). (2.32)
Now, for the vector of responses
el(i) = e1 + ed(n- 1). (2.33)

These results yield expressions for the change in the sum of products
S(fl) and so in 'Eu(i) on the deletion of Yi· For all observations, an element
of S(fl) can, from (2.12), be written as
n
S(fl)jk = L eljelk· (2.34)
1=1
When observation i is deleted, the residuals change and one term is lost
from the sum. Then
n
S(jl)jk(i) L elj(i)elk(i)
I,.Oi=1
n
L {elj + eij/(n- 1)} {elk + eik/(n- 1)}
I,.Oi=1
n
L eljelk- neijeik/(n- 1),
1=1
since the residuals sum to zero over l. Therefore
(2.35)
The unbiased deletion estimator of I; is
(2.36)
To find f;~(~) we need a general preliminary result.
2.5.2 The (Bartlett)-Sherman-Morrison- Woodbury Formula

The estimator of I; is a function of the matrix of sums of squares and prod-
ucts S(fl)(i) defined in (2.35). Since we require the inverse of this matrix,
we need an explicit expression for (ET E- aeief)- 1 . In the development of

deletion methods for diagnostics in regression similar results are provided
by the Sherman-Morrison-Woodbury formula, sometimes with the name of
Bartlett added. It is customary to state the result and then to confirm it
gives the correct answer. See, for example Atkinson and Riani (2000, p. 23
and Exercise 2.11). Herewe give a constructive derivation due to M. Knott.
Let X be n x v, with ith row xT, and let C = xr X. A simplified version
of the required inverse is
(2.37)
so that
BC-BxxT =I. (2.38)
Postmultiplication of (2.38) by c- 1 and x, followed by rearrangement leads
to
(2.39)
Substitution for Bx in (2.38) together with postmultiplication of both sides
by c- 1 ' leads to the desired result
(2.40)
on the assumption that all necessary inverses exist. A similar derivation

can be used to obtain the more general result in which xT is replaced by a
matrix of dimension m x v, resulting from the deletion of m rows of X.
2.5.3 Deletion Relationskips Among Distances

Let C = ET E. Then
Eu= Cj(n- 1) (2.41)
and, from (2.28), the Mahalanobis distance
dT = (n- 1)efC- 1ei. (2.42)
It is convenient to write
erc- 1 ei = dTJ(n- 1) as 9i and a = nj(n- 1). (2.43)
From (2 .35) and (2.40)
s-l (P)(i)
(2.44)
Then
(2.45)
2.6 Distribution of the Squared Mahalanobis Distance 43
Finally we combine the definition of dzi) (2.29) with that of 'Eu(i), the
unbiased deletion estimator of :E to obtain
(n- 2)n2 T - l ,
(n _ 1) 2 ei S (J.L)(i)ei
(n- 2)n 2 dr (2.46)
(n- 1)3 1- ndTf(n- 1)2 ·
The inversion of this relationship provides an expression for dr as a function

of the squared deletion distance
(2.4 7)
2.6 Distribution of the Squared Mahalanobis

Distance
In the deletion Mahalanobis distance :E is estimated on v = n - 2 degrees
of freedom. So, from (2.27), the distance with known mean dT(J.L, 'E,_,) in
(2.26) has the scaled F distribution
2 , v(n-2)
di (J.L, :Ev)"' Fv ,n-v-1· (2.48)
n-v-1
But the squared deletion distance is a quadratic form in Yi - Y(i) whereas
in (2.26) we have a quadratic in Yi - J.L. But, as in (2.3), the variance of
Yi- Y(i) is nj(n- 1) timesthat of Yi- J.L. The distribution of the deletion
Mahalanobis distance is then given by
2 n v(n-2)
d(i) "' (n- 1) (n- V - 1) Fv,n-v-1· (2.49)
To find a compact expression for the distribution of the squared Maha-

lanobis distance it is helpful to write the F distribution as a scaled ratio of
two independent chi-squared random variables
x~/v
Fv,n-v-1 = 2
Xn-v-l
/( _
n V
_ 1) ·
Then, from (2.49)

d2 "' n(n- 2) X~ (2.50)
(t) n- 1 x2n-v-1 '
where, again, the two chi-squared variables are independent. It then follows
from (2.47) that the distribution of the Mahalanobis distance is given by
d 2 "" (n - 1 )2 X~
2 2 . (2.51)
t
n Xv + Xn-v-l
We now apply two standard distributional results. The first is that
Xv =
2 r (LI2' 21) .
The second is that if X 1 and X2 are independently Gamma(p, A) and
Gamma(q, A), then XI/(XI + X2) "" Beta(p, q) . We finally obtain
2
d . ""
•
(n-1)2
n
(v n-v-1)
Beta -
2' 2 ·
(2.52)
For moderaten the range of this distribution is approximately (0, n) rather

than the unbounded range for the F distribution of the deletion distances.
This result derives the exact distribution of the individual squared Ma-
halanobis distances. That the distribution is bounded follows from the re-
lationship
n
Ld~ = v(n -1) , (2.53)
i=l
the proof of which is left to the exercises. A summary of the distributional

results for various Mahalanobis distances is in Table 2.1
2. 7 Determinants of Dispersion Matrices and the

Squared Mahalanobis Distance
We now outline an alternative derivation of the distribution of the Maha-
lanobis distance due to Wilks (1963) . As a measure of outlyingness of a
multivariate observation he proposed the scatter ratio
(2.54)
and showed that

Ri "" Beta (
n-v-1
2 ,2
v) . (2.55)
To relate this ratio of determinants to the Mahalanobis distance we need

the standard result for the n x p matrix X
(2.56)
tv
~
t:l
~
C1)
...,
ss·
TABLE 2.1. Summary of distributional results for the squared Mahalanobis distances used in this book; y; is a v x 1 vector of responses
from Nv(J.L, E), x; is a p x 1 vector of regressionvariables and B is a p x v matrix of regression parameters ~
[ll
0
.....,
Reference J.L E Distribution 9.
[ll
(2.25) known known X~ "Ci
C1)
...,
(2.26) and (2.27) known unknown [ll
c;·
estimated independently of Yi T 2 (v, v) = 11 ~,;'+1 Fv,v-v+l :::1
on v degrees of freedom ~
~
...,
Exercise 2.4 known unknown (v = 1) (n - 1)Beta U, n 2:..!.) §'
o- 2 estimated by s 2 [ll
_n_vj_n-2Jp
(2.49) unknown unknown (n-1) n-v-1 v,n-v -1
~p,.
rn
(deletion distance) estimated by Y(i) estimated by Eu( i) ..0
.::
unknown unknown (n -1)•B t (!!. n - v -1)
(2.52) and n ea 2' 2 ~p,.
Exercise 2.5 estimated by jl estimated by Eu for v = 1 the distribution of s:::
the squared scaled residual (2.4) ~
(2.79) J.L = ßT Xi
unknown unknown (n- p)(1- hi) Beta(~ , ~) e:.
~
estimated by ßT Xi estimated by Eu 0
s:
[ll
The limiting distribution is X~ · t:l

~
~
ffi
.,.
CJl
where here xT is one of the rows of X and it is assumed that (XT x)- 1
exists (Rao 1973, p. 32). We now apply this relationship to the matrix of
residuals.
We recall (2.35)
S(P)(i) = S(jl)-neief/(n-1) = ET E-neie'[ /(n -1) = C-neief/(n -1).

Then, in (2.56)
IS(fl)ciJI = IS(P) I{1- ne'[C- 1 ed(n- 1)},

so that Wilks' result becomes
Next we recall that ifthe random variable X ,..._,Beta(o:, ß), 1-X ,..._,Beta(ß, o:)
when, invoking the notation of (2.42), we obtain
which is (2.52).
2.8 Regression
In many of the examples in this book the data, perhaps after transforma-
tion and the removal of outliers, follow the multivariate normal distribution
(2.7) in which each observation Yij on the jth response has mean /-Lj· How-
ever, in some examples, the mean has a regression structure. The simplest,
considered in this section, is when the regressors for each of the v responses
are the same. Then (2.14) becomes
E(Y) = XB, (2.57)
where Y is n x v, X is the n x p matrix of regressionvariables and Bis a

p x v matrix of parameters. The v x v covariance matrix of the data remains
~- Then, for an individual observation
E(Yij) = /-Lij = x'[ ß1, (2.58)
where x'[ is the ith row of X and ß1 the jth column of B.

We now consider estimat ion of B a nd ~- If there are different sets of
explanatory variables for the different responses, so that we write X 1 , rather
than the common X, estimation of B requires knowledge, or an estimate,
2.8 Regression 47
of E. This is the subject of the next section. Here, with a common X, each
estimate /Jj is found from univariate regression of yc1 on X, that is
' T -1 T
ßj =(X X) X Yc1 . (2.59)
The matrix of residuals E has elements

(2.60)
and the sum of squares and products matrix is

(2.61)
In an extension of (2.11) this matrix again has a Wishart distribution
S(/3)"' Wv(E, n- p). (2.62)
To prove this result in a manner analogaus to that of §2.2.2 requires the

use of some standard results in regression. Here we use notation similar to
that of Atkinson and Riani (2000, Chapter 2).
The hat matrix H is defined as
(2.63)
so called because the matrix of fitted values Y = HY. The ith diagonal
element of H is
(2.64)
The theorems relating to the Wishart distribution of the matrix of the sum
of squares and products are analogaus to those in §2.2.2 with the matrix
C (2.13) replaced by the symmetric idempotent matrix I- H.
The maximum likelihood estimator of E is basically unchanged,
f: = S(/3)/n, (2.65)
which is biased. The unbiased, method of moments, estimator becomes
i=u = S(/J)/(n- p). (2.66)
To find the distribution of the Mahalanobis distance, we again use dele-

tion methods. A standard result in regression diagnostics for the change in
the residual sum of squares on deletion of an observation (Atkinson and Ri-
ani 2000, p. 26 , eq. 2.48) can be written in our notation for the jth response
as n n
L efj(i) = L efj - e7j/(1- hi),
l,.Oi=l l=l
so that (2.35) becomes
S(/J) = S(/J)(i)- eief /(1- hi) = ET E- eief /(1- hi)· (2.67)

Foramodel in which only the mean is fitted, hi = 1/n and (2.67) reduces
to (2.35). The unbiased deletion estimator of E for the regression model is
Eu(i) = S(/J)(i)/(n- p- 1). (2.68)

The deletion Mahalanobis distance (2.29) is a function of this matrix and
of the residuals
AT
Yi - f..t(i) = Yi - ß(i)Xi-
A
The standard result for the deletion estimator /J(i) in regression (for exam-
ple (2.94) or Atkinson and Riani 2000, p. 23) shows that
AT
Yi- ß(i)Xi = ei/(1- hi) · (2.69)
In the deletion distance E is estimated with n - p - 1 degrees of freedom.
Then the distance with known mean dT(J.t, En-p- 1 ) in (2.26) has a scaled
F distribution
2 v(n-p-1)
(2.70)
A
di (J.t, En-p-1) '"'-' Fv,n-v-p·

n-v-p
But the squared deletion distance is a quadratic form in ei/(1- hi)· The
variance of each element of ei is (1 - hi) times that of the corresponding
element of Yi, so the vector has variance 1/(1- hi) timesthat of Yi- f..t in
(2.26). The distribution of the deletion Mahalanobis distance is therefore
now given by
2 1 v(n-p-1)
d(i) '"'-' (1- h i ) (n-v-p ) Fv,n-v-p· (2.71)
The next stage in the argument is to find the relationship between the
distance dT and the deletion distance. As before, let C = ET E . Then
Eu= Cj(n- p)
and the squared Mahalanobis distance is
dr = (n- p)e[c- 1 ei. (2.72)
If now we write
e[C- 1 ei=dTf(n-p) as 9i and a=1/(1-hi), (2.73)
application of (2.44) leads to
T -1 dr
(2.74)
A
ei s(i) (ß)ei = n- p- dr /(1- hi).
The combination of this result with the definition of the unbiased deletion
estimator Eu(i) in (2.36) tagether with the residuals ed (1- hi) (2.69) yields
the required relationship
2 (n-p-1) dr
(2.75)
d(i) = (1- hi) 2 (n- p) 1- dTf {(n- p)(1- hi)}'
2.9 Added Variables in Regression 49
which reduces to (2.46) when the linear model contains just a mean, that is
when p = 1 and h i = 1/n. Theinversion of this relationship again provides
a n expression for df as a function of the squared deletion distance
(2.76)
To find the distribution of this squared Mahalanobis distance we start

from (2. 71) proceeding as in §2.6. The distribution of the deletion distance
(2.71) is agairr written as the distribution of the ratio of two independent
chi-squared random variables
d2 ,...., n- p-1 X~
(2 .77)
C•) 1 - h' xn2 - v-p ·
It now follows from (2.76) that the distribution ofthe Mahalanobis distance
is given by
2
dfrv(n-p)(1-hi) 2 X~ (2.78)
Xv + Xn-v - p
Finally, we again use the relationship between beta and gamma random
variables employed in §2.6 to obtain
di2 ,...., (n - p) ( 1 - hi) Beta (v2, n-v-p) 2 · (2.79)
so that the range of support of the distribution of df depends upon hi. For
some balanced experimental designs, such as two-level factorials , all hi are
equal when all n observations are included in the subset and so L hi = p,
when each hi = pjn. Then the distribution (2.79) reduces to
d ,2 ,...., (n- p) 2 Beta (~ , n- v-

2
p) . (2.80)
n 2
However, the distances will not all have the same distribution in the rest
of the search. A more complicated instance of unequal leverages is when
we include in the model a constructed variable for transformation of the
response §4.4, the value for which depends on each observed Yij. But then
fitting the regression model requires estimation of ~ . We discuss the result-
ing regression procedure in §2 .11 .
2.9 Added Variables in Regression

The previous section concludes our work on the distribution theory of
squared Mahalanobis distances. In this and the next section we describe
some more general properties of regression models. For the moment we

continue with multiple regression when the matrix of explanatory variables
is the same for all responses, that is when (2.57) holds. Each response is
then analysed separately using univariate regression. Our purpose is to pro-
vide some background to the development of approximate score tests for
transformation of the data developed in §4.4. Even if the means of the data
do not contain any regression structure, which is the usual situation, the
algebra of added variables described in this section provides a convenient
way of using the forward search to assess transformations.
The underlying idea is that multiple regression can be performed as a
series of simple linear regressions on single explanatory variables, although
both the response and the regression variable have to be adjusted for the
variables already in the model. We then perform a regression of residuals
on residuals. To begin the derivation of the results we extend the univari-
ate regression model to include an extra explanatory variable, the added
variable w, so that the model is
E(y) = Xß + w-y, (2.81)
where y is n x 1, ß is p x 1 and "f is a scalar. We find explicit expressions for

the least squares estimate of"' and the statistic for testing its value. Added
variables are important in the development of regression diagnostics, where
they are used to provide graphical representations ( added variable plots)
for the importance of individual observations to evidence for regression on
w. They arealso important in providing tests for transformationsandin the
development of regression diagnostics using the mean shift outlier model,
which is briefly introduced in the next section. These uses are described in
Atkinson and Riani (2000, §2.2), where full details are given. In this book,
since few of our examples include regression, we give a brief summary of
the method, which is used in §4.13 where we select a regression model.
An expression for the estimate .:Y of"' in (2.81) can be found explicitly
from the normal equations for this partitioned model
(2.82)
and
(2.83)
If the model without "' can be fitted, (XT x)- 1 exists and (2.82) yields
(2.84)
Substitutionofthis value into (2.83) leads, after rearrangement, to
(2.85)
2.10 The Mean Shift Outlier Model 51
Since A = (I- H) is idempotent, .:Y can be expressed in terms of the two

sets of residuals
e = y* = (I - H)y = Ay
and
w* = (I - H)w = Aw (2.86)
as
(2 .87)
Thus .:Y is the coefficient of linear regression through the origin of the resid-
uals e on the residuals of the new variable w, both after regression on the
variables in X.
To calculate the t statistic requires the variance of ,:Y. Since, like any
least squares estimate in a linear model, .:Y is a linear combination of the
observations,
, 2 wT AT Aw a2 2 *T *
var "( = a (wT Aw)2 = wT Aw = a j(w w). (2 .88)
Calculation of the t est statistic also requires s~ , the residual mean square
estimate of a 2 from regression on X and w , given by (Atkinson and Riani
2000, eq. 2.28)
(n-p-l)s~ YT y - f;T XT y - ,:YwT y

yT Ay- (yT Aw) 2j(wT Aw). (2.89)
The t statistic for testing that "( =0 is then
.:y
(2 .90)
tw = J{s~j(wT Aw)} ·
If w is the explanatory variable Xk, (2.90) is an alternative way of writing
the usual t test for Xk in multiple regression. But the usual regression t
tests are hard to interpret in the forward search, decreasing markedly as the
search progresses; Figure 3.4 of Atkinson and Riani (2000) is one example.
The problern arises in multiple regression because the search orders the
observations using allvariables including Xk. We obtain t tests for Xk with
the correct distributional properties by taking Xk as w in (2.81) with X all
the other explanatory variables. The forward search orders the data by all
variables except Xk· Because of the orthogonal projection in (2.86) , the t
test (2.90) is unaffected by the ordering of the observations in the search. A
fuller discussion is in Atkinson and Riani (2002a). Our example is in §4.7.
2.10 The Mean Shift Outlier Model

In §2.5 we obtained some deletion results for Mahalanobis distances using
results derived from the (Bartlett)-Sherman-Morrison-Woodbury formula
(2.40) . In this section we sketch how the mean shift outlier model can be
used to obtain deletion results for the more general case of regression, using
the relationships for added variables derived in the previous section. The
standard results for deletion in regression are summarized, for example, by
Atkinson and Riani (2000, §2.3) .
Formally the model is similar tothat of (2 .81). We write
E(y) = Xß + q(i)c/>, (2.91)
where the n x 1 vector q( i) is all zeroes apart from a single one in the
ith position and 4> is a scalar parameter. Observation i therefore has its
own parameter and, when the model is fitted, the residual for observation
i will be zero; fitting (2.91) thus yields the sameresidual sum of squares as
deleting observation i and refitting.
To show this equivalence requires some properties of q(i) . Since it is a
vector with one nonzero element equal to one, it extracts elements from
vectors and matrices, for example:
XT q(i) = Xi and q(if H q(i) = hi. (2.92)
Then, from (2.85),
(2.93)
If the parameter estimate in the mean shift outlier model is denoted {3q , it
follows from (2.84) that
so that, from (2.93)
(2.94)
Comparison of (2.94) with standard deletion results shows that /3q = /3(i)>
confirming the equivalence of deletion and a single mean shift outlier.
The expression for the change in residual sum of squares comes from
(2.89) . If the new estimate of CJ 2 is s~ we have immediately that
(n-p-1)s~ YT Ay- {yT Aq(i)} 2 j{q(i)T Aq(i)}

(n- p)s 2 - eT/(1- hi), (2.95)
where s~ = s~i), the deletion estimate.

The mean shift outlier modellikewise provides a simple method of finding
the effect of multiple deletion. We first need to extend the results on added
variables in §2.9 to the addition of m variables, so that Q is an n x m
2.11 Seemingly Unrelated Regression 53
matrix and 'Y an m x 1 vector of parameters. We then apply these results

to the mean shift outlier model
E(y) = Xß + Qcp,
with Q a matrix that has a single one in each of its columns, which are oth-
erwise zero, and m rows with one nonzero element. These m entries specify
the observations that are to have individual parameters or, equivalently,
are to be deleted.
2.11 Seemingly U nrelated Regression

When there are different linear models for the v responses, the regression
model for the jth response can be written
(2.96)
where yci is the n x 1 vector of responses (jth column of matrix Y). Here X 1
is an n x p matrix of regression variables, as was X in (2 .57), but now those
specifically for the jth response, and ß1 is a p x 1 vector of parameters. In our
applications we do not need the more general theory in which the number
of parameters p1 in the model depends upon the particular response. The
extension of the theory to this case is straightforward, but is not considered
here.
Because the explanatory variables are no Ionger the same for all responses
the simplification of the regression in §2.8 no Ionger holds: the covariance :E
between the v responses has to be allowed for in estimation and independent
least squares is replaced by generalized least squares. The model for all n
observations can be written in the standard form of (2.57) by stacking the
equations und er each other. In this form the model is that for a vector of
nv observations on a heteroscedastic univariate response variable and the
vector of parameters ß is of dimension pv x 1. If we let \11 be the nv x nv
covariance matrix of the observations, generalized least squares yields the
parameter estimator
(2.97)
with covariance matrix
(2.98)
where X is nv x pv. In the particular form of generalized least squares

resulting from stacking equations of the form of (2.96) the parameters for
each response are different, the estimates being related only through co-
variances of the Yci. This special structure is known as seemingly unrelated
regression (ZeHner 1962).
In all there are nv observations. When the data are stacked the covariance
matrix W is block diagonal with n blocks of the v x v matrix ~. As a result
of the block diagonal structure the calculation of the parameters ß can be
achieved without inversion of an nv x nv matrix.
There are p* = vp parameters to be estimated. Let X* be the n x p*
matrix of explanatory variables formed by copying each column of the Xj
in order - first all the elements of the first column of each Xj, then all the
second columns and so on up to the last columns of each Xj. The elements
of X* are xij. If ß* is the p* x 1 vector of parameters, calculation of the
least squares estimates requires the p* x p* covariance matrix W. Let J be
a vector of ones of dimension n x 1. Then
(2.99)
a matrix containing n x n copies of ~- 1 . In (2.99) ® denotes the Kronecker

product. The vector of parameter estimates can then be written in the
seemingly standardleast squaresform /3* = A- 1 B where
"'n * w-1 *
L..i=1 xij jk xik
(2.100)
'\;"""'n '\;"""'V
L..i=1 uk=1 xij
* w-1
jk Yik·
Although the pattern is clear, the matrices do not combine according to

dimensions and the summations are over the n observations rather than
the nv of the stacked data. Discussion of seemingly unrelated regression is
tobe found in many textbooks on econometrics, for example §2.9 of Harvey
(1990).
Because (2.100) contains IJ!, estimation of w, or equivalently ~' is re-
quired for the procedure to be operational. The estimation proceeds in two
or more steps:
t
1. Obtain 0 , an estimate of ~' from the independent regressions as in
(2.61), but with Ycj regressed on Xj.
2. Seemingly unrelated regression using (2.100) with ~- 1 in (2.99) cal-

culated using :t01 .
3. Iteration in the estimation of W is possible, starting with the esti-

mate of ~ obtained from Step 2 and repeating the seemingly unre-
lated regression calculations until there is no significant change in the
estimates of the covariance matrices.
Much of the emphasis so far in this chapter has been on the distribu-
tion of the statistics we have calculated, particularly the Mahalanobis dis-
tances. However, such results are not available for the seemingly unrelated
regression procedure of this section. U nder the assumption of normally dis-
tributed errors, the estimate in ß from generalized least squares in (2.97)
2.12 The Forward Search 55
has the normal distribution. But with \ll estimated from the data, the dis-
tribution is not readily determined. If the exact distribution is important,
recourse may be had to simulation. But, we use the asymptotic results
which apply when \ll is known.
2.12 The Forward Search

Examples of the forward search were given in Chapter 1. During these we
monitored the behaviour of the minimum Mahalanobis distance for units
not in the subset as the data were fitted to increasingly large subsets. In this
chapter we have introduced further quantities that can be monitored during
the forward search. We now briefl.y describe the search, which is made up
of three steps: the first is the choice of an initial subset, the second the way
in which we progress in the forward search and the third is the monitoring
of the statistics during the progress of the search. In subsequent sections
we discuss in some detail how to start the search and what quantities it is
interesting to monitor. Here we discuss its properties.
The purpose of the forward search is to identify observations which are
different from the majority of the data and to determine the effect of these
observations on inferences made about the correct model. There may be
a few outliers or it may be, as in the data on Swiss bank notes, that the
observations can be divided into groups, so that it is appropriate to fit a
different model to each group. Although it is convenient to refer to such
observations as "outliers", they may well form a large part of the data
and indicate unsuspected structure. Such structure is often impossible to
detect from a model fitted to all the data. The effect of the outliers is
masked and backwards methods using the deletion of observations fail to
show any important features.
If the values of the parameters of the model were known, there would
be no difficulty in detecting the outliers, which would have large Maha-
lanobis distances. The difficulty arises because the outliers are included in
the data used for fitting the model, leading to parameter estimates which
can be badly biased. In particular, the estimates of the elements of the
covariance matrix can be seriously infl.ated, so masking the existence of
outlying observations. Many methods for outlier detection therefore seek
to divide the data into two parts, a larger "clean" part and the outliers.
The clean data are then used for parameter estimation. The forward search
provides subsets of increasingly large size which are designed to exclude the
outliers until there are no clean data remaining outside the subset. At this
point outliers start to be used in estimation, when test statistics and Ma-
halanobis distances may change appreciably.
Some methods for the detection of multiple outliers therefore use very
robust methods to sort the data into a clean part and potential outliers.
An example is the use of the resampling algorithm of Rousseeuw and van

Zorneren (1990) for the detection of multivariate outliers using the mini-
mum volume ellipsoid. The algorithm selects random samples of size v + 1
from which the vector of means f.1 and the covariance matrix I: are esti-
mated. The process is repeated many times, perhaps one thousand, and the
estimates chosen which give the smallest ellipsoid containing approximately
half the data. The resulting parameter estimates are very robust. However
Woodruff and Rocke (1994) show that such estimators, although very ro-
bust, have higher variance than those based on larger subsets. Such larger
subsets are therefore more reliable when used in outlier detection proce-
dures, provided they are outlier free. See also Hawkins and Olive (2002).
In the forward search for multivariate data we find such larger initial
subsets of outlier free observations by starting from m 0 observations which
are not outlying at a specified level in any univariate or bivariate boxplot.
The properties of robust bivariate boxplots are described in the next sec-
tion. We then increment this set starting subset by selecting observations
that have small Mahalanobis distances and so are unlikely to be outliers.
In some versions of the forward search, for example Hadi (1992) and Hadi
and Sirnonoff (1993), the emphasis is on using the forward search to find a
single set of parameter estimates and of outliers. These are determined by
the point at which the algorithm stops, which may be either determinis-
tic or data dependent. The emphasis in this book is very different: at each
stage of the forward search we use information such as parameter estimates
and plots of Mahalanobis distances to guide us to a suitable model.
At some stage in the forward search let the set of m observations used in
fitting be S~m). The mean and estimated covariance matrix of this subset
are fl':n and E~m. From these parameter estimates we can calculate a set of n
squared Mahalanobis distances d;_;..
Suppose that the subset sim) is clear of
outliers. There will then be n- m observations not used in fitting that may
contain outliers. We do not seek to identify these outliers by a formal test.
Our interest is in the evolution, as m goes from m 0 to n, of quantities such
as Mahalanobis distances, test statistics and other diagnostic quantities.
We also look at the sequence of parameter estimates and related quantities
such as the eigenvectors of E~m· We monitor changes that occur, which
can always be associated with the introduction of a particular group of
observations, in practice usually one observation, into the subset of size
m used for fitting. Interpretation of these changes is complemented by
examination of changes in the forward plot of Mahalanobis distances.
Given that we have fitted the model to a subset of dimension m :2: m 0 ,
the forward search moves to dimension m + 1 by selecting the m + 1 units
with the smallest squared Mahalanobis distances, the units being chosen
by ordering all squared distances di,;., i = 1, . . . , n . In most moves from m
to m + 1 just one new unit joins the subset. It may also happen that two
or more units join S~m) as one or more leave. However our experience is
that such an event is unusual , only occurring when the search includes one
unit that belongs to a duster of outliers. At the next step the remaining
outliers in the duster seem less outlying and so several may be included at
once. Of course, several other units then have to leave the subset.
Remark 1: The search starts with a robustified estimator of p, and :E
found by use of a bivariate boxplot. Let this estimator of p, be flo and let
the estimator at the end of the search be fl~ = fl. In the absence of outliers
and systematic departures from the model
E(P,0) = E(P,) = p,;
that is, both parameter estimates are unbiased estimators of the same quan-
tity. The same property holds for the sequence of estimates p,;,_. produced
in the forward search. Therefore, in the absence of outliers, we expect esti-
mates of the mean to remain sensibly constant during the forward search.
However, because of the way in which we select the observations for in-
clusion in the subset, those with smaller Mahalanobis distances will be
selected first. As a result the estimate of :E, unlike that of p,, will increase
during the forward search. Therefore, unless outliers are present, the dis-
tances di:;, will trend steadily downwards during the search. The use of the
scaled distances defined in ( 2.104) overcomes this tendency. A comparison
of plots of scaled and unscaled distances is in Figure 2.5.
Remark 2: Now suppose there are k outliers. Starting from a clean subset,
the forward procedure will include these towards the end of the search,
usually in the last k steps. Until these outliers are included, we expect
that the conditions of Remark 1 will hold and that plots of Mahalanobis
distances will remain reasonably smooth until the outliers are incorporated
in the subset used for fitting. The forward plot of scaled distances for
the data on municipalities in Emilia-Romagna, Figure 3.24, is a dramatic
example in which the pattern is initially stable, but changes appreciably at
the end of the search when the two gross outliers enter.
Remark 3: If there are indications that the data should be transformed,
it is important to remernher that outliers in one transformed scale may
not be outliers in another scale. If the data are analyzed using the wrong
transformation, the k outliers may enter the search weH before the end.
The search avoids the initial inclusion of outliers and provides a natural
ordering of the data according to the specified null model. In our approach
we use a robust starting point combined with unbiased estimators during
the search that are multiples of the maximum likelihood estimators. The es-
timators are therefore fully efficient for the multivariate normal model. The
zero breakdown point of these estimators is an advantage for the forward
search. The introduction of atypical infl.uential observations is signalled by
sharp changes in the curves that monitor Mahalanobis distances and test
statistics at every step. In this context, the robustness of the method does
not derive from the choice of a particular estimator with a high breakdown
point, but from the progressive inclusion of units into a subset which, in the
first steps, is outlier free. As a result of the forward search, the observations
are ordered according to the specified null model and it becomes clear how
many of them are compatible with a particular specification. Our approach
enables us to analyze the inferential effect of the atypical units ( "outliers")
on the results of statistical analyses.
Remark 4: The procedure is not sensitive to the method used to select
an initial subset; even if outliers are included at the start they are often
removed in the first few steps. For example, two forms of robust bivariate
boxplot are described in the next section, either of which can be used to
provide an initial subset. For speed of calculation we use the less robust.
Although the first steps of the search may depend on which of the two
methods is used to find the initial subset, the later stages are independent
of it. What is important in the procedure is that the initial subset is either
free of outliers or breaks the masking of outliers which are masked in the
complete set of n observations. The removal of outliers is visible in some
searches where there are sometimes numerous interchanges in the first few
steps. Examples in which the search recovers from a start that contains
outliers include Exercise 3.4 and Figure 7.20. An example for spatial data
in which the search recovers from a start that is not very robust is given
by Cerioli and Riani (1999).
2.13 Starting the Search

We now describe two methods for finding a "central" part of the data by
looking at two dimensional projections. Both methods fit curves to bivariate
scatter plots. These fitted curves can provide useful extra information when
they are included in scatterplot matrices. In the first section we describe
the babyfood data which are used here to illustrate the construction of the
boxplots and, in §§4.2 and 4.7, to illustrate the transformation of multivari-
ate data. The two methods of construction are described in §§2.13.2 and
2.13.3. Finally, in §2.13.4, we discuss the use of the intersection of these
bivariate central parts in starting the forward search.
2.13.1 The Babyfood Data

Box and Draper (1987, p. 265) present part of a }arger set of data on the
storage of a babyfood. The data are in Table A.5. Unlike other data that
we have so far seen, these include five explanatory variables. There are 27
readings on four responses which are the initial viscosity of the babyfood
and its viscosity after three, six and nine months storage. The distribution
of viscosity, a non-negative property, is highly skewed and we can expect
that the data will require transformation. The ratios of the maximum to
the minimum of each responsein Table 4.1 reinforce this expectation. We
2.13 Starting the Search 59
discuss the transformation of these data in some detail in Chapter 4. Here

we only look at a scatterplot of the first two responses, both transformed
and untransformed, to show the effect of data skewness on the construction
of the two kinds of robust contour.
2.13.2 Robust Bivariate Boxplots from Peeling

A method for using the peeling of points from convex hulls to find a central
part of the data is described by Zani, Riani, and Corbellini (1998). Once
the central part of the data has been found, smooth contours are found by
the use of B-splines (Micula 1998, de Boor 2002) . The method is virtually
non-parametric, in that almost no distributional assumptions are made in
deriving the fitted B-spline. The method can be described in three steps.
Step 1 The Inner Region. The inner region is the two dimensional
extension of the interquartile range of the univariate boxplot, where it is
often called a "hinge". In one dimension we take the length of the box
between the first and third quartiles, which therefore contains 50% of the
values. In two dimensions we look for a similar regioncentred on a robust
estimator of location, containing a fixed percentage of the data. A natural,
nonparametric way of finding a central region in two-dimensions is to use
convex hull peeling. The most extreme group of observations in a multi-
variate sample can be thought of as those lying on the convex hull, with
those on the convex hull of the remairring sample, the second most extreme
group and so on. The output of the peeling is a series of nested convex
polygons (hulls). We call the (1 - o:)%-hull the biggest hull containing not
more than (1- o:)% of the data (the points on the boundary belong to the
hull) . Usually, even if the outermost hulls assume very different shapes and
are infl.uenced by outliers, the 50%-hull seems to capture the underlying
correlation of the two variables.
Since each convex hull contains several observations, the nominal 50%-
hull found by peeling may contain less than 50% of the data, the effect being
greater if the sample size is small. It also might not be smooth. To overcome
this problern we fit a B-spline curve to the 50%-hull found by peeling. The
inner region is therefore formed by those units which lie inside or on the
boundary of the B-spline curve superimposed on the 50%-hull.
As an illustration of our method we use the scatterplot of the logged
values of the first two variables in the babyfood data. There are 27 Ob-
servations. Panel (a) of Figure 2.1 shows the outermost hull, which passes
through seven points. Panel (b) shows this hull together with the second
hull, also passing through seven points, so that 13 remain. The third hull,
of five points in Panel ( c), is the 50% hull, since it is the largest containing
not more than 50% of the data. Inside it are eight data points. In Panel
(d) a B-spline curve is fitted to the 50% hull. This inner region seems free
from outliers and is robust, while keeping the correlation in the data and
allowing for different spreads in the various directions. It is worth noting
2 .5 3 .0 3 .5 4.0 4.5 5.0 5 .5 6.0 6.5 7 .0 2 .5 3 .0 3 .5 4 .0 4 .5 5 .0 5 .5 6 .0 6 .5 7 .0
(d) •
• •
•
•
N
•
0
2 .5 3 .0 3 .5 4. 0 4 .5 5.0 5 .5 6 .0 6.5 7. 0 2 .5 3.0 3. 5 4.0 4 .5 5.0 5 .5 6 .0 6 .5 7 .0
FIGURE 2.1. Logged babyfood data, Yl and y2 : the first three convex hulls con-
taining respectively 7, 7 and 5 points. Panel (d) shows the B-spline fitted to the
50% hull of five points and the robust centre, marked almost coincident with +
an observation
that the fitted spline contains only seven data points, since one observation,
with Coordinates 5.15 and 5.47, lies inside the 50% hull, but outside the
spline.
Step 2 The Robust Centroid. We find a robust bivariate centroid
using the componentwise arithmetic means of the Observations inside the
inner region defined by the fitted spline. In this way we exploit both the ef-
ficiency properties of the arithmetic mean and the natural trimming offered
by the hulls. This mean of the values of logged y 1 and logged y 2 is marked
with a cross in Panel (d) of Figure 2.1. This cross gives the appearance of
being near the centre of the nearly elliptical spline.
A useful requirement of estimators of location is affine invariance (for
example Woodruff and Rocke 1994) ensuring that different rescalings of
the individual variables leave the estimator of location unchanged. If we
require such a property of our estimator we need to take the mean of the
observations over the convex hull, rather than over the fitted B-spline.
References to other ways of finding robust bivariate centres are given at
the end of the chapter.
2.5 3.0 3.5 4 .0 4.5 5~ 5.5 6.0 6.5 7.0 2.5 3.0 3.5 4.0 4 .5 5.0 5 .5 6 .0 6 .5 7.0
FIGURE 2.2. Logged babyfood data, Yt and y2: scaling the convex hull. The
resulting 99% hull indicates four outliers
Step 3 The Outer Region. Once we have found a robust bivariate

centre and a curve asymptotically containing half the data ( a bivariate
"hinge") we require a method for constructing contours at other levels. To
find a small subset for starting the search we may need contours with a
nominal content of less than 50%. But, if interest is in a contour which dis-
criminates between "good" and "bad" observations, a much higher nominal
content will be needed. If we were using Mahalanobis distances to mea-
sure the remoteness of the observations, these contours would be ellipses
which would be found analytically. However, with the metric provided by
the bivariate hinge, we have to proceed numerically. We find the contours
by scaling the hinge, following the procedure suggested by Goldberg and
lglewicz (1992) as modified by Zani et al. (1998).
The left-hand panel of Figure 2.2 shows the method of scaling. Let 0 be
the centre of the data found in Step 2 and let X, Y and Z be three points
on the 50% contour. Then, if X', Y' and Z' are three points on rays from
0 on the scaled contour, we require that the ratios
OX' OY' OZ'
OX = OY = OZ = c, (2.101)
say. For outlier detection c will be appreciably greater than one.

To find suitable values of c, Goldberg and lglewicz (1992) use an approx-
imate F distribution for the Mahalanobis distance
d7"' {2(n- 1)/(n- 2)}F2,n-2, (2.102)

which is close to the distribution of the deletion Mahalanobis distance
(2.49) . The theoretical 50% contour for the babyfood data in Figure 2.2
corresponds to an F value, on 2 and 25 degrees of freedom, of 0.7127. The
value for the 99% contour is 5.568. The ratio of these two is 7.813. But
we are concerned with distance, not squared distances, so the value of c in

(2.101) is v7.813 = 2. 795.
In a conservative approach to outlier detection we might seek to declare
any ambiguous points as outliers, thereby obtaining a central part of the
data which has a reduced probability of being contaminated. A possibil-
ity here is to replace the F distribution with the asymptotic chi-squared
distribution of the distances. In this case the value of 2. 795 becomes 2.58.
Zani et al. (1998) use this approximation to the distribution of squared
Mahalanobis distances combined with Simulation to allow for the effect of
peeling on the content of the hinge. The value of 2.58 increases slightly to
2.68. This final value is a compromise between the 2. 795 based on the F
distribution and the 2.58 from the chi-squared approximation. (In Table 2
of Zani et al. (1998) the values are of c -1, the extension of the ray beyond
the 50% fitted spline) . This approximate 99% contour is plotted in both
panels of Figure 2.2. In the right-hand panel some possible outliers in this
two-dimensional projection of the data are identified.
Use of the exact beta distribution (2.52) for scaling is recommended if it
really is desired to use the bivariate boxplot for bivariate outlier detection.
However, we use the forward search for outlier detection for any dimension
v, at the sametime obtaining information on the inferential importance of
each observation. Our interest in the boxplots is to select an initial subset,
when several values of c may be tried, until a satisfactory value is found
for mo.
The convex hulls in Figure 2.1 and the nearly elliptical contours in
Figure 2.2 suggest that the logged data are approximately normally dis-
tributed. It is interesting to see what happens when we repeat the procedure
using untransformed data.
Figure 2.3 shows the four convex hulls fitted to the data in the peeling
process to find the 50% hull. These hulls are quite different in shape from
the hulls for the transformed data, several being quadrilateral. Since they
mostly only contain four observations, four hulls have to be peeled to obtain
the 50% hull, rather than three for the transformed data.
Figure 2.4 shows the fitted B-spline and the nominal 99% contour found
by scaling up. Because the data are concentrated in the lower-left hand
corner of the figure, the 50% curve is relatively small. As a result sev-
eral observations lie outside the 99% curve found by scaling up. The skew
distribution of the data leads to the detection of many apparent outliers.
2.13. 3 Bivariate Boxplots from Ellipses

The bivariate boxplots calculated from B-splines provide a useful tool for
a preliminary examination of the data. They are however over elaborate as
a means of finding a central part of the data which can serve as a starting
point for the forward search. In this section we present a computationally
simpler method in which ellipses with a robust centroid are fitted to the
0
0
"
0
0
0
600 8 00 1000 1200
0
0
<0
d •
..
0
0
0
0
0 •
• •
•
800 1000 1 200
•
400 600 800 1000 1 200
FIGURE 2.3. Untransformed babyfood data, Yl and y2: four convex hulls have
to be peeled to obtain the 50% hull as opposed to three for the transformed data
in Figure 2.1
data. Our description follows Riani and Zani (1997) who use a version of
the "quelplot" of Goldberg and lglewicz (1992).
The robust centroid of the ellipse is found as the componentwise median
of the two variables in the scatterplot. Let this be P,. The shape of the con-
tours is based on a covariance matrix in which the univariate medians are
used, but which is otherwise calculated in the usual way. That is, the mean
in (2.10) is replaced by iJ, to give a 2 x 2 matrix with elementsproportional
to
n
Sjk(il) = L(Yij- P,j)(Yik- ilk).
i=l
The combination of centroid and covariance estimate gives Mahalanobis

distances for each observation and a family of ellipses which need to be
scaled. The 50% ellipse is that which passes through the point with the
median Mahalanobis distance, and so contains exactly 50% of the data. As
a matter of minor detail, we use the F distribution for scaling this ellipse.
As was stated above, the theoretical value of c for this 50% contour for
the babyfood data is 0.7127. Contours for other levels are then found by
scaling this ellipse.
64 2. Multivariat e Data and the Forward Search
0
0
OCl
...
0
0
0
0
-
0
0
0
OCl
0
0
<D
0
...
0
0
0
N
0
0 200 400 600 800 1000 1 200
FIGURE 2.4. Untransformed babyfood data, YI and y2: the scaled 99% convex
hull indicates seven outliers
A scatterplot matrix of these ellipses for the original babyfood data is

in Figure 4.6 and for the log-transformed data in Figure 4.5. The inter-
pretation of the plots is similar to that for Figures 2.4 and 2.2 , with the
untransformed data exhibiting many more outliers.
The method of constructing boxplots based on peeling was described
above in Section 2.13.2 as virtually non-parametric. It is not completely
non-parametric since the F or x2 distribution used to find the scaling c
is based on the assumption of bivariate normality. The method based on
ellipses in this section is hardly non-parametric at all; apart from the use
of the median as a measure of location, the theory is based entirely on
the normal distribution. Even so, this boxplot is both a useful tool for the
examination of data and a method for finding an initial subset, even in
data with many outliers.
2.13.4 The Initial Subset

We find an initial subset of mo observations from the intersection of units
inside a contour of specified content, where we have to adjust the content
to yield , at least roughly, the required number of observations. We also
eliminate any univariate outliers. The magnitude of the contour depends
Oll the parameter 0 giving the scaling
dt(B) = {2(n- 1)/(n- 2)}0. (2.103)

The relationship between () and the scaling c follows from the approxi-
mate F distribution for Mahalanobis distances defined in (2 .102). For ex-
ample, a value of () = 1 corresponds to the 61.8% point of the F distribution
on 2 and 25 degrees of freedom. Usually we have to use smaller values to
obtain a sufficiently small value for m 0 . An example showing the variation
of m 0 with () is in §4.2.
The value of m 0 is not critical. It should be small enough so that the
initial subset contains no masked outliers, but large enough that the initial
stages of the search are fairly stable, apart from any initial interchanges.
For examples in which we are fitting multivariate models without any struc-
ture in the means, a value around 2v is often suitable. The procedure is
generally robust to the choice of the value of mo and allows us to start
with a somewhat larger subset if the percentage of contamination permits.
Since the method does not involve complicated iterative procedures, there
is no computational burden in finding the starting point. As the size of the
initial subset can easily be increased or decreased by changing the value
of () , we usually try several values and check whether the last third or so
of each search from the various starting points is the same. As we have
seen, it is often towards the end of the search that we obtain information
about unsuspected structure and outliers for observations basically from
a single normal population. However, if there are several populations, as
in the Swiss bank note data, the earlier parts of the search are also infor-
mative. For example, Figure 3.30 will show the effect of two populations
around m = 100 when n = 200. Larger initial subsets than 2v are required
for models in which there are more than v parameters to be estimated, for
example when we are determining transformations.
We find the initial subset from the intersection in all v( v- 1) /2 bivariate
scatterplots and v univariate boxplots of units within the contour specified
by (). This subset will exclude any observations which are outlying in one
or two dimensions. However it will not exclude observations that are not
outlying in one or two dimensions but are outlying in three or more. Al-
though it is not difficult to construct such observations, they seem to be
rare in practice. Any problern they might cause can be simply reduced by
decreasing the value of (). However, in general, even if one or two have been
included in the intimal subset, they are detected in the early stages of the
search, their large Mahalanobis distance causing them to leave the subset.
We do indeed sometimes observe several interchanges in the first two or
three steps of the search. All that we require is that the construction of the
initial subset reveals outliers which are masked in the whole data set. They
do not need to be excluded from the initial subset, merely to be unmasked
in it.
2.14 Monitoring the Search

At each step in the forward search we calculate all squared distances di;,,
i = 1, ... , n for mos; m s; n. Many of our most informative plots are based
on the Mahalanobis distances dim, rather than on the squared distances.
We plot these as m increase from m 0 to n. Such plots are called "forward"
plots.
Mahalanobis Distances. We plot all n distances dim for each value
of m. This plot is informative about the behaviour of individual units, the
distances for which can be followed throughout the search.
Scaled Mahalanobis Distances. At the beginning of the search the
few central units may give a very small estimate of the covariance matrix.
Consequently, units not in the subset may have very !arge distances, which
decrease as the search proceeds. The result is that the eye tends to focus
on the early part of the search, whereas important information is usually
in the last third or so, where the outliers, if any, enter and cause changes
in inferences.
Virtually constant residual plots in regression were obtained by Atkin-
son and Riani (2000), for example Figure 1.4, by scaling the least squares
residuals eim at subset size m by Sn, the error mean square estimate of a 2
at the end of the search. These scaled residuals can be written as
The Mahalanobis distances

*
d im ( T f,-1
= eim L<um eim
)0.5
are scaled by the square root of the estimated covariance matrix. If we had
independent observations with constant variance a 2 , "E would be a diagonal
matrix and
So the generalization of the scaled residuals is the scaled distance
, ) l/2v
d* X ( /~um/ (2 .104)
•m /"Eun/ '
where we rename ~u as ~un to stress that the estimator is calculated at

the end of the search from all n observations.
As Figure 2.5 shows, for the Swiss bank note data, this rescaling increases
emphasis on the right-hand end of the forward plot. The upper panel is the
forward plot of the scaled distances: in the lower panel the distances are
not scaled. In this example the plot of scaled distances seems superior
in all parts of the search. Although quite stable, these scaled distances
2.14 Monitoring the Search 67
,--.1
50 100 150 200
Subset size m
.,.,
N
0
"'Cl>
0
N
c
ill .,.,
'ö
~0
c
"'
"iij
~ ~
"'
::;
.,.,
0
.---
50 100 150 200
Subset size m
FIGURE 2.5. Swiss bank notes, starting with the first 20 observations on genuine
notes: forward plots of Mahalanobis distances - upper panel scaled and, lower
panel, unscaled
are somewhat less stable than scaled residuals in regression. This is not
surprising since the regression structure means that the residuals fl.uctuate
much less than those here from a structureless sample. We discuss the upper
panel in some detail in Chapter 3 as Figure 3.30.
Ordered Mahalanobis Distances. As we saw, for example in Fig-

ure 1.17, we find it informative to plot particular ordered Mahalanobis
distances. We discuss three useful plots. But we begirr with some results
about ordered distances and the forward search.
The progress of the search depends on ordering the squared Mahalanobis
distances. From the subset s~m- 1 ) we calculate the n
squared distances
di,; ,._ 1 , i = 1, ... , n, and order them to obtain the n distances d(iT,m- 1.
The new subset sim)at step m then consists of the units corresponding to
the m ordered distances d(il,m_ 1 to d(;,J,m- 1. The units corresponding to
the ordered distances are then divisible into two sets. Let U[i) ,m derrote the
unit with the ith ordered Mahalanobis distance at step m . Then
U[1J,m-1 , . .. , U[mJ,m-1 E S~m)
and
U[m+1J ,m-1 , . . . , U[nJ,m-1 ~ sim).
To move to the new subset S~m+l) we form the n distances di;. and
order them. It is not certain that all the units which were in S~m) will be
in S~m+ 1 ). There are three cases which need tobe distinguished :
1. Normal Progression. If
d *[m)m = maxd*i,m ,;• E S*(m) ,
the next unit to join will be U[m+ 1Jm which ~ S~mJ with distance
d(m+1)m;
2. Inversion. Now suppose that
d*[m+1)m = max d*i,m ,;• E S*(m)_
Then U[m+l)m will remain in the subset. But there must be a unit,
say UNEW ~ S~m) for which
dNEW,m s; d(mJm·
This unit will join the subset while U[m+l]m will remain in the subset.
The minimum distance among units not in the subset will obviously
be dNEW,m s; d(mJm ;
3. Interchange. An interchange occurs when two or more new units
enter the subset, when one or more must leave. Instead of the one new
unit U N EW when inversion occurs we have a set SN EW , containing
at least two members, such that
iESNEW if di,ms;d(m+l)mni~S~m)_
2.14 Monitoring the Search 69
Then the minimum distance among units not in the subset can be
written as
dNEW,m = mind7,m i E SNEW·
To obtain an upper bound for this distance let the number of units
in SNEW be nNEw(2: 2). Then
d*N EW,m < d*[m+2-

- nN EW )m ·
The smallest distance among units not in the subset is monitored up to

step n- 1.
Largest Distance among Units in the Subset. Herewemonitor
*
max di,m ,;• E S*(m),
the largest distance among units in the subset. For normal progression this
will be d(mJm· As we have seen above, for inversion the distance is d(m+l]m;
it will be larger than this when an interchange occurs . The forward plot
of this largest distance will show a peak when the first outlier is included.
The peak is therefore one step later than it is for the preceding plot of the
smallest distance not in the subset. The largest distance is monitored up
to step n.
In general, when there is one outlier, the size of the peak in the plot
of the largest distance amongst units in the subset is smaller than that
in the plot of the smallest distance among units not in the subset. This
arises because dj;, for units not belonging to the subset has an unbounded
distribution, wh~reas that for the maximum over units in the subset is the
maximum of m scaled beta distributions.
"Gap" Plot. The forward plots of the minimum and maximum dis-
tances trend upwards, which can sometimes obscure interpretation. In the
gap plot we look at the difference of the two preceding quantities, that is
(2.105)
For normal progression this difference is
d(m+l]m- d(mJm> (2.106)
where both distances are calculated using the same subset of size m. If
there is an inversion, an upper bound on the value is
d(mJm- d(m+l]m'
the negative of the value for normal progression. The bound is even more
negative if there is an interchange, the magnitude depending on the value
of nNEW· We plot both the true difference (2.105), which can be negative
and the difference in order statistics (2.106), which is always positive.
<D
U)
0
::;;
E
::I
....
E
·c:
~
"'
C\1
50 100 150 200

Subset size m
bank notes: forward plots of, dotted line, minimum distances of units not in the
subset and, solid line, of the ordered distance dim+I]m · There is an indicat ion of
appreciable interchange around m = 100
As an example of these plots, Figure 2.6 shows the forward plot of the
minimum distance amongst units not in the subset for the Swiss bank note
data, which we have seen in Figure 1.17, together with the forward plot of
d[m+l]m. These two are the same for much of the search, but different at the
beginning and around m = 100. Both regions in which some interchanges
are occurring. The earlier one is associated with instability at the b eginning
of the search. The later reflects the interchanges which occur as units from
the group of forgeries start to enter the subset, with the appreciable change
in covariance matrix and distances that we saw in Figures 1.16 and 2.5.
Covariance Matrix. The estimate of the covariance matrix :E does not
remain constant during the forward search as observations are sequentially
selected that have small Mahalanobis distances. To see how the variance is
increasing we can look at forward plots of the ratios
(2.107)
In the absence of outliers these ratios increase smoothly. Non-monotonie

increase of the curve is evidence of the existence of masked outliers which
are preventing the unambiguous ordering of the data by its closeness to the
model. Large increases at the end of the search are more easily interpreted
as being due to isolated outliers.
2.15 The Forward Search for Regression Data 71
In addition to the magnitude of the covariance matrices, we also look at

the evolution of their structure, through forward plots of the eigenvalues
and also of the eigenvectors for two dimensional subsets of the data. The
proportion of the total variation in the data explained by each eigenvalue
is an important property for principal components analysis in Chapter 5
Parameter Estimates. The means of most of the sets of multivari-
ate data analysed in this book are without structure. However, if there is
regression structure, it is interesting to look at forward plots of the param-
eter estimates of the linear models for the various responses. Forward plots
of the estimated transformation parameters are crucial in our strategy for
determining the correct transformation of data outlined in §4.6, whatever
the structure of the means.
2.15 The Forward Search for Regression Data

If, as in Section 2.8 , there is a regression structure in the means of the
multivariate observations, we need to allow for this in finding a central
set of observations to form the starting point for the forward search. We
describe a method for the case when the regressors for all responses are the
same.
2.15.1 Univariate Regression

An appealing feature of the multivariate regression model (2.57) was that
estimation was by independent least squares on each response. We proceed
by using a series of v forward searches, one on each of the responses, to
order the univariate observations by their closeness to the regression model.
We then take the intersection of the units in these ordered sets to give us
approximately the required m 0 observations in the initial subset. To start,
we sketch the forward search for univariate regression models. Many of the
principles are the same as those in the search for multivariate data. The
details are in Chapter 2 of Atkinson and Riani (2000).
For the univariate linear regression model E(Y) = Xß, with X of rank
p, let b be any estimate of ß. With n observations the residuals from this
estimate are ei(b) = Yi - x'[b, (i = 1, ... , n). The least median of squares
estimate /3; is the value of b minimizing the median of the squared residuals
er(b). Thus /3;
minimizes the scale estimate
(2.108)
where e[kJ (b) is the kth ordered squared residual. In order to allow for
estimation of the parameters of the linea r model the median is taken as
med = [(n + p + 1)/2], (2.109)

the integer part of (n + p + 1) /2.

The parameter estimate satisfying (2.108) has, asymptotically, a break-
down point of 50%. Thus, for large n, almosthalf the data can be outliers,
or come from some other model and LMS will still provide an unbiased
estimate of the regression line. This is the maximum breakdown that can
be tolerated. For a higher proportion of outliers there is no Ionger a model
that fits the majority of the data.
The definition of /3; in (2.108) gives no indication of how to find such
a parameter estimate. Since the surface to be minimized has many local
minima, approximate methods are used. Rousseeuw (1984) finds an ap-
proximation to /J; by searching only over eiemental sets, that is, subsets
of p observations, taken at random. We follow this procedure. Depending
on the dimension of the problern we find the starting point for the forward
search either by sampling 1,000 subsets or by exhaustively evaluating all
subsets. We take as our initial subset for each response that yielding the
minimum value in (2.108), so obtaining an outlier free start for our forward
search.
For regression models we have v searches, one for each response. In a gen-
eralisation of the previous notation, suppose at some stage in the forward
search the set of m observations used in fitting response j is Bi';). The
parameters of the linear model are estimated by least squares yielding the
parameter estimates /Jjm. From these parameter estimates we can calculate
a set of n residuals eijm · The forward search for the jth response moves to
dimension m + 1 by selecting the m + 1 units with the smallest squared least
squares residuals, the units being chosen by ordering all squared residuals
ei]m, i = 1, ... , n. As with the search for multivariate data, most moves
from m to m + 1 introduce just one new unit to the subset although it may
happen that two or more units join sim)as one or more leave.
The procedure is again not sensitive to the method used to select an
initial subset, even if unmasked outliers are included at the start. For ex-
ample, the least median of squares criterion (2.108) for regression can be
replaced by that of least trimmed squares (LTS). This criterion provides
estimators with better properties than LMS estimators. They are found by
minimizing the sum of the smallest h squared residuals
h
Sh(b) = L e[iJ(b), (2.110)
i= l
for some h with [(n + p + 1)/2] :::; h < n. The rate of convergence of LTS
estimates is n- 112 as opposed to n- 113 for LMS. But, for the moderate sized
datasets of the size considered in Atkinson and Riani (2000), the largest
having 200 observations, there seems to be little difference in the abilities
of the two methods to detect outliers and so to provide a clean starting
point for the forward search.
2.15. 2 Multivariate Regression

To adapt the searches for univariate regression to multivariate regression
we need to find a starting point and to describe how the search moves
forward.
As a result of the univariate forward search on response j we have, for
each m, a subset of observations Bi';) which are used for fitting the jth
model. For any particular value of m, say k , the subsets si~), j = 1, ... , v,
will contain some observations in common, but they will not in general
be identical, except when k = n. To find an initial subset of size m 0 we
consider the observations in common in these subsets. Let there be m(k)
such observations, that is
- (k)
m(k ) - S*l n ... n s*V
(k)
. (2.111)
We start with k = m 0 and increase k until the first time when there are
at least m 0 common units in the intersection. These units form the initial
subset.
The v forward searches order the observations by their closeness to the
fitted univariate models. If there were no interchanges during the search, we
would have a single list of the order in which observations on each response
enter the subset and S~';) would consist of the first m units on this list.
However, when there is an interchange, some units leave Bi';), and it is not
true that
s<m)
*J
c s<~+l)
*J .
The lists ofunits used in (2.111) to calculate m(k) therefore need to include
information on units which leave the subsets as the search progresses in
addition to those which enter.
Once we have an initial subset of m 0 units, the search progresses much
as it did in the absence of regression in §2.12. Given S~m) individual regres-
sions are fitted to the v responses. From the parameter estimates we can
calculate the n x v matrix of residuals with elements eijm and so the set
of n squared Mahalanobis distances di;,. The search moves to dimension
m + 1 by selecting the m + 1 units with the smallest squared Mahalanobis
distances, the units being chosen by ordering the squared distances di;,,
i = 1, ... ,n.
2.16 Further Reading

There are numerous books on multivariate analysis, many of which provide
important background reading for the multivariate normal distribution and
associated inferential and data analysis procedures on which our book is
based. Since the analysis of multivariate data requires numerical comput-

ing, there is a time trend in the books towards data analysis and also the
plotting of data. There is also a trend away from mathematics, which may
refl.ect an increasing cultural impatience with mathematical manipulation
for its own sake. Whatever the reasons for the latter trend, our book seems
to us extreme in following both these tendencies.
The theory is presented by Morrison (1967). Anderson (1984) gives the
matrix algebra in great detail. Muirhead (1982) focuses on distribution the-
ory, without any mention of data. We have already mentioned the mathe-
matically less advanced books of Flury and Riedwyl (1988), Flury (1997)
and Johnson and Wichern (1997). A useful reference for mathematical re-
sults is Mardia, Kent, and Bibby (1979) as is Seber (1984). Krzanowski
(2000) describes both applications and theory. The two parts of Krzanowski
and Marriott (1994) and Krzanowski and Marriott (1995) cover respec-
tively distributions, ordination and inference and classification, covariance
structures and repeated measurements.
Throughout we deal with data which presumably arise from several mul-
tivariate normal distributions, perhaps, of course, with outliers. There may
also be explanatory variables, which may be discrete or continuous. How-
ever we do not consider data in which some of the multivariate responses
have discrete distributions. Much of the Iiterature apparently about discrete
multivariate data analysis is concerned with the analysis of contingency
tables in which there is a Poisson response and categorical explanatory
variables. An example is Agresti (2002) . The forward search for Poisson
generalized linear models described in Atkinson and Riani (2000, §6.10 -
§6.12) extends to such data. Chapter 3 of Fahrmeir and Thtz (2001) de-
scribes methods for multicategorical responses, again based on generalized
linear models.
To conclude this section, we provide some references to detailed points.
An expression for the effect of deletion of observation k on the ith Maha-
lanobis distance, which is a generalization of (2.46) and (2.47), is given by
Riani and Zani (1997). The result of Wilks (1963) used in §2.7 as the start
of an alternative derivation of the distribution of the squared Mahalanobis
distance, is described by Barnett and Lewis (1994, p. 288) . It is used by
Penny (1996a) to derive the distribution of an outlier test. Further discus-
sion of her results is in Fung (1996b) and Penny (1996b). Campbell (1985)
presents a succinct summary of these deletion and distributional results.
Grubbs (1950) gives the univariate result on Betadistributionsfora simple
sample and considers the distribution of the order statistics.
The methods of starting the search described in §§2.13.2 and 2.13.3 em-
ploy only two of the many methods that have been investigated for finding
bivariate centres. Small (1990) provides a survey on multidimensional me-
dians. The lengthy review of Liu, Parelius, and Singh (1999) and associated
discussion provides many references on finding robust centres using the idea
2.16 FUrther Reading 75
of data "depth". A more recent reference is Van Aelst, Rousseeuw, Hubert,

and Struyf (2002) who apply robust regression to this problem.
2.17 Exercises
In all exercises y is a v-variate random variable with E(y) = f.L and cov(y) =
E. Unless otherwise stated, the normality of y may also be assumed.
Exercise 2.1 Show, without assuming normality, that E{(y- f.L)TE- 1 (y-
f.L)} = v.
Exercise 2.2 Show that the variance of the residual ei (2.1} is as given in
§2.1.2. What distributional assumptions did you make?
Exercise 2.3 The distribution of the sum of squares and products matrix
S(fl) (2.10} depends on the projection matrix C (2.13}. Show that C is
symmetric and idempotent and prove the resuZt claimed at the end of§ 2.2.2.
Exercise 2.4 When f.L is known, the squared Mahalanobis distance dT(f.L, f:..,)
is defined in (2.26}. Derive the distribution of this quantity when v = 1 and
f:.., = s2.
Exercise 2.5 Find the distribution of the scaled squared residual about the
mean, which is called dT in (2.4}.
Exercise 2.6 An extension of Exercise 2.5. Let ei = Yi- x[ /J be the resid-

ual from univariate regression as in §2. 9. Find the distribution of the scaled
squared residual eU s 2 . You may find equation (2.95} helpful. R elate your
answer to that you found for Exercise 2. 5.
Exercise 2. 7 Show that:
1} the loglikelihood of the n observations is
n
L(f.L, E ; y) = -(n/2) log j2nEI - ~)Yi- f.LfE - 1 (Yi - f.L) / 2;
i=l
2} the maximum likelihood estimators of f.L and of the covariance matrix E

are given by:
fl=f}= (
n n
8Yil/n, ... , 8 yiv / n
)T
and
f: = S(fl) / n;
3} the maximised multivariate normalloglikelihood is given by (equation 2.19}
-(n/2) log l2nf:l - nv/2.
Exercise 2.8 Find the form of the matrix D when the test of equality of
the v m eans is formulated as Df.L = c (equation 2. 18}. What is the row rank
of D ?
2.17 Exercises 77
Exercise 2.9 In order to test Ho : J.l = J.lo versus H1 : J.l =f. J.lo the usual
test statistic is
The quantity T 2 has Hotelling 's T 2 distribution with dimension v and de-
grees of freedom n- 1. We reject Ho if T 2 ~ T~ v n-l and accept Ho oth-
erwise. Show the connection between T 2 and th~ 'corresponding likelihood
ratio test ( equation 2. 20) .
Exercise 2.10 Suppose there are g groups of v dimensional normal Ob-

servations. Find the likelihood ratio test of the equality of the covariance
matrices of the g groups. When is this important?
Exercise 2.11 Verify the (Bartlett)-Sherman-Morrison- Woodbury formula

(equation 2.40). Show that the inverse of A- uvr is given by {A- 1 +
A- 1 U(Im- vr A- 1 u)- 1 VT A - 1 }, when the dimensions are: Ais p X p,
with U and V p x m. Apply this formula for the deletion of m rows of X.
Exercise 2.12 Find L:;: 1 di;,,

where the distances are calculated for the
subset sim). Give bounds for L~=1 di;, when there is no inversion or in-
terchange in going from s}ml to s}m+ 1 l. Buggest a data configuration for
which your lower bound is achieved. What happens to your lower bound
when there is an inversion and when there is an interchange?
Exercise 2.13 The hat matrix H is defined in equation (2.63). Prove it

is (a) symmetric and (b) idempotent. (c) Find tr(H). For what model is C
(equation 2.13) the hat matrix?
Exercise 2.14 The explanatory variables in the first 16 rows of the baby-
food data have coded levels of 1 and -1. The experimental design is a 2 5 - 1
fractional factorial. If x 5 is omitted, the design is a full 2 4 factorial. Suppose
a first-order model, including a constant term, is fitted to the results of a
full 2k factorial experiment. Calculate the values of the leverage measures
hi (equation 2.64) and confirm that the value of the sum of the leverage
measures L:~ 1 hi agrees with the resuZt you found in Exercise 2.13.
H ow does your answer change when some interaction terms of the form
XiXj are included in the model?
Exercise 2.15 Figure 1.16 in Chapter 1 is a forward plot, for the Swiss
bank note data, of the elements of the estimated covariance matrix for a
search starting from 20 observations on genuine notes. The left panel of
Figure 2. 7 is a forward plot of the determinant of this matrix. The right
panel shows the trace. Relate these two figures to one another and give rea-
sons for the difference between the two panels of Figure 2. 7. What different
features of the data are revealed by the two panels?
~ ~
{
Q) Q)
0 0
c: <D <D
"'c
l~
0 
0
0
~
 1-
-a; 0~ ~
~
0 0
C\1 C\1
0 0
0 0
0 0
50 100 150 200 50 100 150 200
bank notes : forward plots of the estimated covariance matrix; left panel, the
determinant and, right panel, the trace
2.18 Salutions
Exercise 2.1
Etr(y- p,f'E-- 1 (y- p,)

tr'E-- 1 E(y- p,)(y- p,)T
tr r:,-l 'E,
triv
V.
Exercise 2.2
var(yi- Y)
var(yi) + var(y) -
2cov(yi , Y)
2 2 n
<7 2 + 5?:_ - -cov(yi, LYi)
n n i=l
2 (j2 2<72
(j +n - -n-
n -1 2
--(j
n
Note that no distributional assumptions have been made.

2.18 Solutions 79
Exercise 2.3
A matrix C is symmetric when C = er. We have that
A matrix C is idempotent when CC = C. We have that
CC (I- JJT jn)(I- JJT /n) =I- JJT jn - JJT jn + nJJT jn 2

I- JJT jn = C.
Note that
rank(C) = trC =tri- trJT Jjn =tri- tr(n/n) = n- 1.
Given that C is symmetric and idempotent with rank ( n - 1) we have that
S(fl) = yrcy
is distributedas Wv(L:, n- 1).
Exercise 2.4
We have to find the distribution of
which be rewritten as
(2.112)
It is Straightforward to see that
(yi-tJ) 2
"'
(
n
_ 1)
2
xi 2
s
2
X1 + Xn-1 ·
Given that the XI in the denominator is independent of the X~- 1 , the
resulting distribution is Beta, that is
(yi-tJ) 2 "'(n- 1)Beta

82
(12, -n-1) 2- .
Exercise 2. 5
Equation (2.4) can be rewritten as
Now, given that (n -1)s 2 can be decomposed as the sum of two quantities,
n n
(n -1)s 2 = ~)Yi- Y) 2 = n: 1 (Yi- Y) 2 + L (yj- Y(i)) 2,
i=I j#i=I
(2.113)
From equation (2.113) and exercise (2.2) it follows that
(n- 1) 2 xi
2 2 .
n XI+ Xn-2
We now have to prove that the xi which appears both in the numerator
and the denominator
n (Yi- Y) 2 2
-- rv XI
n- 1 u2
is independent of the X~- 2 of the denominator
The proof we give has two steps. First we write the x2 variables as idem-
potent quadratic forms. Then, we show that the product of the matrices
of the two quadratic forms is equal to zero so we conclude that the two
random variables are independent. The numerator of equation (2.113) can
be rewritten as
_n_(Yi- Y)2 = yTQiy ,
n-1
where y = (YI, ... , Yn)T, QI = n':...I q(i)q(i)T (In- J JT /n), and q(i) is a vec-
tor which has a 1 in ith position and 0 elsewhere: q(i) = (0, ... , 0, 1, 0 . .. , O)T.
QI is symmetric and idempotent with trace (rank) equal to 1. On the other
hand,
n
L (yj- Y(i))2 = yTQ2y,
#i=I
where Q2 =In -q(i)q(i)T- {J -q(i)}{J -q(i)V /(n-1). Q 2 is symmetric

and idempotent with trace (rank) equal to n- 2. Since
{In- q(i)q(i)T}q(i) = 0 and {J- q(i)}T q(i) = 0,

2.18 Solutions 81
it follows that Q 1 Q2 = 0. We thus conclude that the two x2 random vari-

ables are independent. Using the independence argument between the two
x2 random variables and the relationship between Gamma and Beta,
(n-1) 2
2
XI 2 rv
(n-1) 2 B (1 n-2)
eta -' - - . (2.114)
n X1 + Xn-2 n 2 2
Note that (2.114) is just the special case of (2.52) for v = 1.
Exercise 2.6
We now require the distribution of
(2.115)
where
k = (n- p)(1- hi)· (2.116)
Since
From (2.95)
(n- p)s 2 = (n- p- 1)szi) + e7 /(1- hi)·
The residual sum of squares (n- p -l)szi) "'a 2 x~-p-l independently of
Yi and of {3(i)· But, from (2.94),
Thus er and szi ) are independent and
e7/s 2 rvk 2 X~ rvkBeta(~,n-~- 1 ),

X1 + Xn-p-l
since the two x2 variables are independent, with k defined in (2.116). This
is the result in (2.79) for v = 1.
The result of Exercise 2.5 (2.114) is obtained when just the mean is fitted,
so that p = 1 and hi = 1/n.
Exercise 2. 7
Since the Yi 's are independent (because they arise from a random sample)
the likelihood function (joint density) Lik(J.L, :E; y) is the product of the
densities of the Yi 's
n
Lik(J.L,:E;y) = Ilf(YiiJ.L,:E)
i=1
IT
n 1 T 1
i=l (J27r)vi:Eil/2 exp{-(Yi- J.L) :E- (Yi- J.l /2}
)
( J27r):vi:Ein/ 2 exp{- t(Yi- J.Lf:E- 1(Yi- J.L)/2}.
The log likelihood is given by:

n
L(J.L, :E; y) = -(n/2) log l21r:EI- L)Yi- J.L)T:E- 1 (Yi- J.L)/2. (2.117)
i=l
This solves part 1) of the exercise. As concerns part 2), in order to derive
the expressions for the maximum likelihood estimators of J.L and :E, we first
write the quadratic form in equation (2.117) in a way that facilitates finding
the maximum. Since a scalar quantity is equal to its trace,
n n
L(Yi- J.L)T:E- 1 (Yi- J.L) L::tr(yi- J.Lf:E- 1(Yi- J.L)
i=l i=1
n
tr:E- 1 L(Yi- J.L)(Yi- J.Lf. (2.118)
i=1
Now, by adding and subtracting y in the sum in the right hand side
of (2.118), we obtain
n n
L(Yi- J.L)(Yi- J.L)T L)Yi- y + y- J.L)(Yi- y + y- J.L)T
i=1 i=1
n
L(Yi- y)(yi- Y)T + n(y- J.L)(y- J.L)T
i=1
S(P,) + n(Y- J.L)(y- J.Lf. (2.119)
The other two terms in equation (2.119) vanish because L:::~= 1 (Yi- y) = 0.
Using (2.119) and (2.118) in (2.117) we obtain
nv n
L(J.L, :E; y) = - 2 log 2n- 2 1og I:EI
1
-2tr:E- 1{S(P) + n(Y- J.L)(Y- J.Lf}
nv n
-2log2n-2logi:EI (2.120)
1 n
-2tr(:E- 1S(P,))- 2 (Y- J.L)T:E- 1(y- J.L). (2.121)
2.18 Salutions 83
To find the maximum likelihood estimator for f.l we differentiate L(J.L, :E; y)
in (2.121) with respect to f.l and set the resulting expression equal to 0:
which gives
[1 = fj.
It is clear that [1 = fj maximizes log L(J.L, :E; y) with respect to JJ because
the last term in (2.121) is :::; 0 and the term vanishes for [1 = fj. Before
differentiating log L(JJ, :E; y) to find t, we substitute JJ = fj in (2.121) and
rewrite log I:EI in terms of :E- 1 to obtain
(2.122)
We now differentiate (2.122) with respect to L;- 1 , remembering that
otr~~B) = B + BT - diag(B)
and that
We obtain
oL([l, :E; y) n . 1 .
= + n:E- 2d1ag(:E)- S(J.L) + 2"diagS(J.L) = 0,
A A
a:E- 1 -0 (2.123)
whence
A 1 A 1 1
:E- -2 diag(:E) = -{S(P,)- -diagS(P,)}
n 2
or
t = S([l) .
n
Note that we solved (2.123) for :E rather than :E- 1 , even though we differen-
tiated with respect to L; - 1 . Otherwise we would have obtained { S([l)/n} - 1
as the maximum likelihood estimator for L; - 1 . Wehave exploited the prop-
erty of invariance of maximum likelihood estimators.
For part 3) of the exercise, we have from equation (2.122) that the log-
likelihood maximized with respect to [1 and is t
L([l, t; y) -nv log J2;n
2n + 2 1og I:E-
'1 1
I- 2 '1
tr(:E- S([l))
n , 1 nv
-nv log v .t.?r + -log I:E- 1- -
~
2 2
n v n , nv
--log(2n) - -logi:EI--
2 2 2
n , nv
--log l2n:EI--
2 2"
84 20 Multivariate Data and the Forward Search
Exercise 2.8
In order to test the hypothesis of equality of means, the matrix D and the
vectors c and 11 are
( ·~ -~ -~
0 0
0 0
.
0
0
-1
1 :l'
-1
c = (O, .. o,O)T and 11 = (p, 1 , .. o,P,v)To The first row of the matrix D
imposes the constraint p, 1 - p, 2 = 0, the second p, 2 - p, 3 = 0, .. o, the last
11v-l -P,v = Üo D in this case has dimension (v- 1) x v and has full row
rank.
Exercise 2.9
We start by rewriting the expression which defines the likelihood ratio test
(2020)
Given that nf:o can be decomposed as

n n
nf:o = L(Yi -p,o)(Yi -p,o)T = L(Yi- y)(Yi- Y)T + n(y -p,o)(y -p,o)T,
i=l i=l
TLR nlog { lt (Yi- I

y)~i- Y)T + n(Y -p,o~Y -p,o)T /lf:l}
nlog { it + (y- p,o)(y- p,o)T I/lf:l} 0
Now, since IA+bbTI = IAI(l+bT A- 1 b), we can rewrite the former equation
as
lf:l { 1 + (y -p,o)T"t- 1 (y -p,o)}

nlog ,
li: I
nlog { 1 + n(y -p,o)T'f:;;, 1 (y -p,o)/(n -1)}
n log { 1 + T 2 / ( n - 1)}0
This implies that the likelihood ratio test is a monotone function of Hotelling's
T 2 statistico
2.18 Solutions 85
Exercise 2.10
Wehave a random sample of size ni from each of Nv(J.L 1, :E1; y). l = 1, 2, ... , g.
The likelihood function is
Lik(J.Ll, J.l2, ... , J.L 9 , :E1, :E2, ... , :E 9 , y) = f1f= 1 Lik(J.L1, :E1; y E group l)
= (v'Z~r)nv nr=1 I:Ednt/ 2 exp [- ~tr:EI 1 { n1:E1 + n1(Y1 - J.L1)('fh - J.Ll)T}] .
The maximum likelihood estimate of J.l1 (v x 1 vector of means for group l) is
y1 under both Ho and H 1 because there is no restriction on the population
means. The maximum likelihood estimate of :E1 is I.:;f= 1 n1'El/n under Ho
where n = I::f= 1 n1. Under the alternative H1, the maximum likelihood
estimate of :E1 is 'E1. So the maximized likelihood in the two cases is:
maxLik =
1
v'27f I1
g I, ~-nt/2
:E1 exp(-n1v/2),
H1 ( 2n)nv
lt nt"f:,tfn~ -n/
1=1
maxLik =
Ho (
~
27r )nv 1=1
2
exp ( -nv/2).
The maximized loglikelihoods are
L 1 = maxL
H1
= nv 1
--log2n--
2 2
L n1log I,:E1 I-
9
nv/2,
lt
1=1
Lo = maxL = - nv log2n-:: log n1'Edn1- nv/2.

Ho 2 2
1=1
The likelihood ratio test (2L 1 - 2L 0 ) is equal to
g g
n log I L n1'Ednl - L n1log i'Ed.

1=1 1=1
Exercise 2.11
We must show that the product of (C - xxT) with the right hand side
of (2.40) gives the identity matrix
T ( -1 c-1XXTC-l)
(C-xx) C + 1-xTC-1x =
XXTC-l XXTC- 1xxTC- 1

Ip- xxTc-1 + 1- xTC-1x 1- xTC-1x
-XXTC-l + XXTC- 1xTC- 1x + XXTC- 1 - XXTC- 1 xxTC- 1
Ip + 1- xTC- 1x
XXTC-lxTC- 1 x- XXTC- 1xTC- 1 x
Ip + 1- xTC-1x = Ip.
For the generalization of the Sherman-Morrison-Woodbury formula, we

have to show that the product of (A- UVT) with {A- 1 + A- 1 U(Im -
yT A- 1 u)- 1 VT A- 1 } gives the identity matrix.
(A- UVT){A- 1 + A- 1 U(Im- VT A- 1 u)- 1 VT A- 1 } =

Ip + U(Im- VT A - 1 u)- 1 VT A- 1 - UVT A- 1
-UVT A- 1 U(Im- VT A- 1 u)- 1 VT A- 1 =
Ip- UVT A- 1 + U(Im- VT A- 1 U)(Im- VT A- 1 u) - 1 VT A- 1 =
Ip- UVT A- 1 + UVT A- 1 = Ip.
This generalization can be applied to the deletion of m rows of matrix X

because X'fm)X(m) can be written as X'fm)X(m) = xr X- XmX;:.
Exercise 2.12
We start with m = n. From (2.28)
d i2 = ( Yi - /LA)T-<'---1(
L..u Yi - f.LA) ,
where, from (2.16)
Then
n
(n -1)tr L(Yi- {L)TS({L)- 1 (Yi- {L)
i=1
n
(n- 1) tr L S(p,)- 1 (Yi - fl)(Yi - {L)T
i=1
(n -1)tr Iv = (n- 1)v.
Thus L:Z:, 1 di;, = (m- 1)v.

Since there is no Iimit on how remote a unit not included in the subset
can be, the upper bound on 2::~= 1 di;, = oo. If all units sit exactly on an
ellipsoid, all di;, will be equal and the sum = n(m- 1)vjm. If there is no
inversion or interchange and all units in S~m) sit on the ellipsoid, units not
in s~m) must have distances greater than the average value (m-1)v/m and
so 2::~ 1 di! ?:: n(m- 1)vjm. If all units are on the ellipsoid, the choice of
units to include or exclude would be arbitrary, the decision having no effect
on Mahalanobis distances. With an inversion, exactly one unit will have a
smaller distance than the m comprising sim). The minimum value of this
distance is zero, so 2::~= 1 di! ?:: (n- 1)(m- 1)vjm. With an interchange,
if all units in the subset have the same Mahalanobis distance, at least two
2.18 Salutions 87
units must have values less than the average; the maximum number of units
with zero distances is n- m. In this case, 2:::~= 1 di;. 2: (m- 1)v. The step
of the search going from sim)to sim+ 1 ) will then destroy this structure
and the sum of all the distances will increase.
Exercise 2.13
(a) Hr = (X(xr x)- 1 xr)r = xcxr x) - 1 xr = H.
(b) HH = X(XTX) - 1 XTX(XTX)- 1 XT = X(XTX)- 1 XT.
(c) tr H = 2:::~=1 hi = tr {X(XT x)- 1xr} = tr (XT X)(XT x)- 1 = tr Ip =
p.
When X contains only the constant term, that is X = J .
H = J(JT J)JT = l_JJT .

n
The vector of residuals e can be written as
1
e =(I- H)y =(I - -JJT)y =Gy.
n
So, C is the hat matrix for the model which contains only the constant
term.
Exercise 2.14
There are p = k + 1 columns in X , the column for the constant term, which
is a vector of ones, and k columns, one for each variable in the model ,
which contain 2k - 1 entries of +1 and the same number of -1 entries. The
columns are mutually orthogonal, so
xT X= diag (n, ... )n) and (XT x)- 1 = diag (1/n, ... ) 1/n),
where n = 2k .
The hat matrix H = X(XT x) - 1 xr is n X n. The leverage measures hi
are the diagonal terms of H:
p
hi = L:>~j/n = pjn = (k + 1)/n.

j=1
Then 2:::~= 1 hi = p, in agreement with the results of Exercise 2.13.

The interaction terms give additional columns of X formed by multipli-
cation of columns i and j. These columns are, as before, orthogonal to all
others. So p increases to some larger value p+, when all hi = p+ jn.
Exercise 2.15
The trace of the estimated covariance matrix is a function only of the vari-
ances of the variables; the determinant also includes the correlations. The
right panel of Figure 2. 7 shows that the inclusion of units from the sec-
ond group causes an appreciable increase in the variances of the variables
(signalled by a sudden change of slope in the trace). The left panel of Fig-
ure 2.7 shows that the increase of the variances due to the initial inclusion
of the units from the group of forgeries is partially counterbalanced by the
increase in the covariances. Due to this compensation, the overall effect on
the determinant seems to be negligible compared to that on the variances
(see the left panel of Figure 2. 7). This conclusion is in agreement with
what we had already seen in Figure 1.16. This figure showed that around
m = 105- m = 110 there was not only a big increase in the variances of
variables 4, 5, 6, but also an increase in absolute values of the covariances
between variables 6 and 4, and between variables 6 and 5.
3
Data from One Multivariate
Distribution
In this chapter we extend our analyses of the examples in Chapter 1 in

order to display further features of the forward search. We use our analysis
of the Swiss heads data to exemplify the properties of bivariate boxplots
for data analysis. As a preparation for material on transformations of data
in Chapter 4 we compare analyses of the data on national track records for
women when the response is the time for the race and also its reciprocal,
speed. This transformation leads to an appreciably simpler analysis. Our
further analysis of the data on municipalities in Emilia-Romagna focuses on
the last sixteen units to enter the forward search. For part of our analysis
we reduce the data to five selected variables that explain much of the
structure of the outliers. The last example is the data on Swiss bank notes.
We analyse all 200 observations together and also look at the two groups
separately. Forward plots of individual Mahalanobis distances, calibrated
by plots of a large nurober of units of known origin, are shown to be a
powerful tool for determining group membership.
3.1 Swiss Heads

To find the initial subset for the forward search we fit a robust ellipse to
each bivariate scatterplot, scale the ellipse and then take the observations in
the intersection of all scaled robust ellipses as our starting point (§2.13.3) .
Figure 3.1 shows a scatterplot matrix of the data with the ellipses super-
imposed. The inner ellipse is the robust ellipse containing exactly 50% of
90 3. Data from One Multivariate Distribution
the data. The outer ellipse is the same ellipse scaled using B = 0.92, which
corresponds to a theoretical value of 60% of the data in a single boxplot.
Since the content of the ellipses is similar, they are hard to distinguish in
the plot, even though we have used different types of line for the two of
them. There are exactly 25 observations inside all the 60% ellipses; these
·oo·[l]·[!]·o)•·>
o·ttJ·:<lfJ. .:.
defined the initial subset for the search in the first chapter.
. • 159 . • • . • . : . • • • ••
. . ... :.:· . 'c· :'· . . .• . . ·. ;:·. . . :·
• ~ ·lfli:-. \~,.··· . ~:·-:·
'[f]'Lfj
• • 10 • ,
• ..
.
• • • • • • •
.
•
~
~ '[.~;
!j'
. . . .
·. . . . ,'!,;· . ~-:~·. · ......
'i.· . . ·.,·..-,::.
• 0 I 0 • o: : o . '· o. '. Oo o .. • ,o'
\t .·. -..: ,.. .·:,. ..

·r.::- . . . . . .
0 • ••
' •.,,?' ' ' ., ·. : . .. :'

. . • 1.7 .
. .
: : ·:~·;. : ~ : .-· , .. : . .Lc. : ·. ·.· .
[!l ·~[!j~'
: ·;:.... .- .. ·~·~· :[~·)~:!j
=,·.
: . ~~~· : I .,·.
·t!l.·t!l·oo·· .[!]
. . . . .
. . . . .
.. . .:-~.~.:·
. . ,.; .. -:f.:.· · ·:. .
.. . ' ~ :·
~ .. ;-.....
:'' ·.· .. ..... :.::.:,
·.·":..~ "';'"- : ~; . :'
·:[!]·
•• , o 'o I 0 ~~~ ... 0
o • .. o t ...
-~.·[!].··[!]··oo·IB~ ·Clj:
I t 0 0 t I I I O
. .
0
.
• • •
0 • • • •
' '
·:~, .·ff •<,.:>. • w· . ~ .:-.~,. ,

J~ . "· .:.:. . ., ,.:, ..· ·~. . . .....
:[ !]···
;;: ..·
.
r,,;.· :
··;·~ - ··~
, )• :[ f]·· [·...f. l.· .~.·:].:[l]':•.'
:1:. : . . :~
.-:~, . . .
O I~ 0 t 00
0 ',',+' ~~~ 0 0 O 1~ 0 ,• I '·'.·, 1 O
.
. :. •
,
• •• • •
'
• . •
'
·.. • .
.
• '
.
. 10
heads. The outer (dotted) ellipse for which (} = 0.92 gives a starting point with
mo = 25
Figure 3.2 replots Figure 3.1, except that the coefficient for the outer
threshold is now B = 4. 71. This larger ellipse gives some indication of
whether there will be any outliers, in so far as bivariate plots are enough
to establish this. Units 111 and 104 were the last to enter the search in
3.1 Swiss Heads 91
Chapter 1. They are the two highlighted points on the scatterplot matrix of
Figure 1.1. In particular they have the two very large, and almost identical,
values of Y4 visible in the univariate boxplot in Figure 3.2 and lie far from
the ellipse in some of the panels .
. ~.~·"''•,1
..·.~ ··
~ . .:·
.T,,;.
..
' ~. ,j.;·
heads. The outer (dotted) ellipse for which () = 4.71 indicates some potential
outliers
In Chapter 1 we monitared the forward plot of the minimum Mahalanobis

distance amongst the observation not in the subset at each stage, which was
shown in Figure 1.3. Figure 3.3 is the plot of the maximum Mahalanobis
distance among the units in the subset, again at each stage of the search,
tagether with the ordered distance d[mJm. These two are the same, except
at inversions or interchanges, and show little difference in this example.
0
...=
0
:::;:
9
E
"C
c: "'
('"j
"'~
:::;:
0
('"j
50 100 150 200

Subset size m
FIGURE 3.3. Swiss heads: forward plots of, dotted line, maximum distances of
units in the subset and, continuous line, the ordered distance dirnJrn · In normal
progression the two are identical. To be compared with Figure 1.3
Figure 3.3 is similar to Figure 1.3, except that the maximum distances are
smaller than the minimum distances among units not in the subset, in line
with the argument of §2.14. A small difference is that the plots differ by
one in the value of m at which events occur. An outlier outside the subset
at stage m in Figure 1.3 will give the largest distance at stage m + 1 in
Figure 3.3, which looks at units within the subset.
A third related plot is the gap plot in Figure 3.4, which indicates inver-
sions and interchanges, that is whether the difference between the minimum
distance of units not in the subset is less than the maximum distance of
units in the subset. A series of interchanges (§2.14) can indicate the pres-
ence of a group of similar outliers. Once one or two have been included
in the subset, they may so alter the parameter estimates that the rest of
the group no Ionger seem remote and many or all will enter at the next
step. There are two curves on the plot. The upper shows the gap between
the m + 1st and mth ordered Mahalanobis distances at a subset size of
m, regardless of which observations are in the subset. This quantity can-
not be negative. The second curve is the difference between the minimum
Mahalanobis distance amongst the units not in the subset and the max-
imum distance amongst those in the subset. It is therefore the difference
between the curve in Figure 1.3 and the upper curve in Figure 3.3. Usually,
these two differences are the same as the ordering of distances is the same,
whether or not the constraint of belonging to the subset is considered. But
3.1 Swiss Heads 93
<0
ci
....
ci
N
ci
Cl. 0
"'
C) ci
N
9
....
9
<0
9
50 100 150 200
Subset size m
FIGURE 3.4. Swiss heads: gap plot. Forward plots of, solid line, the difference
d(m+ l]m - d(mJm and, dotted line, the difference between the minimum distance
of units not in the subset and the maximum distance of units in the subset . In
normal progression the two are identical
the ordering of distance will depend on this constraint when an inversion

or interchange occurs. As Figure 3.4 shows, the two gaps are the same for
most of the search. There seems to be some interchange at the beginning of
the search. This is often an active period, since the initial subset is selected
by one criterion, the presence of units in the intersection of ellipses, but
is then being judged by another , the ordering of Mahalanobis distances.
After this part of the search the largest difference between the two curves
is around m = 119.
If anything of importance to our understanding of the data is happen-
ing around m = 119 we would expect that the parameter estimates would
change. Figure 3.5 is a forward plot of the elements of the estimated co-
variance matrix for standardized observations- it is therefore the estimated
correlation matrix between the responses. The plot is stable - there seem to
be no large changes as units enter, or leave, the subset. There is perhaps a
small increase in one variance in the last two steps, caused by the entry of
the two outliers, but this is negligible compared with the dramatic changes
we saw for the Swiss Bank Note data when we started the search from just
one group, for example in Figure 1.16. lt is however interesting that the
change is in a variance, in fact in element (4,4), but not obviously in the co-
variances: this one further piece of evidence that the last two observations
are outlying only in Y4·
C!
·~
o;
E
c
0
~ 10
~ c:i
00
ö
!!1
c
Q)
E
w
Q)
0
c:i
50 100 150 200

Subset size m
FIGURE 3.5. Swiss heads: forward plot of elements of the estimated correlation
matrix
We continue this analysis by looking at the forward plot of individual

Mahalanobis distances for all units calculated at each subset size during
the forward search. If outliers are present, they will have large Mahalanobis
distances. But, if the outliers are masked, they will have small distances
towards the end of the search where they are included in the data used in
estimation of the parameters of the model. They will, on the contrary, give
large distances earlier in the search when they are not used in fitting. Often
the distances for masked outliers are small at the end of the search, so that
one indication of masking is the crossing of the lines in the forward plot
joining the distances as they go from large to small. If the data contain no
outliers, the plots of individual distances should be relatively stable.
Figure 3.6 shows the forward plot of scaled distances for the head data,
the scaling (§2.14) being intended to remove the effect of subset size on the
distances. The plot confirms that there are no further outliers in addition
to units 104 and 111. There is however some structure in the plot which
requires discussion.
In the lower part of the plot are the distances for units already in the
subset. It is clear from the right-hand part of the plot that these are vir-
tually constant for each unit, apart from slight fl.uctuations caused by the
effect of joining units on the estimates of the vector mean and covariance
matrix. Above these distances is a rising diagonal band of decreasing dis-
tances for each unit as it joins the subset. Above this are the distances for
the units which have not yet joined the subset. There do not seem to be
3.1 Swiss Heads 95
so 100 150 200

Subset size m
FIGURE 3.6. Swiss heads: forward plot of scaled Mahalanobis distances
any particularly large distances. As we saw in Chapter 1, the last two units
to join are 104 and 111, which we showed on Figure 1.1. Their distances
are largest at m = 198, just before the first of them joins the subset.
The two horizontallines superimposed on Figure 3.6 are the square roots
of the 2 ~% and 97 ~% points of the x~ distribution. These show that, unlike
residuals from normal theory regression models with a single response, Ma-
halanobis distances do not necessarily have small values: gaps at the bottom
of the plot arenot surprising. This point is reinforced by Figure 3.7 which
gives the density functions and 2 ~% and 97 ~% points of x2 distributions
with degrees of freedom from one to six. We can therefore expect that
95% of the observed distances should, at the end of the search, lie within
these regions. In Figure 3.6 the lower line is at 1.112 = v'1.237. In fact
the boundaries at the end of the plot suggest a very slight skewness in the
distribution of distances - there are slightly too many large ones and too
few small ones. This might indicate that we need to transform the data.
Although the scatterplots in Figure 1.1 do appear elliptical, we return to
this question at the end of the section when we look ar robust boxplots in
Figure 3 .11.
Interpretation of Figure 3.6 can be aided by the use of simulation. For
example, the forward plot of Mahalanobis distances, Figure 3.6, had some
structure including the diagonal band of decreases in distance when a unit
entered the subset. To see whether this is indeed a feature which can be
expected with data from a multivariate normal distribution, we simulated
200 observations from a six-dimensional multivariate normal distribution.
0"'
df=1 df=2 d =3
....0 0
C\1
0
0
C\i
"'
0
0
N
C! 0 0
0
0 0 0
0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
df= "'0 df=5 C\1
0"'
0
0
CO
0 0 0
0
0
"' ....
"'
0
0
0 0 0
0
0 0 0
0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
FIGURE 3.7. Densities and 2~% and 97~% points of x2 distributions with degrees
of freedom from one to six
.,Cl)
(.)
c::
s ....
"'
'5
"'0
:i5
o;
.,c::
.,
..<:
:::;: C\1
--, -,---------,-
50 100 150 200
Subset size m
FIGURE 3.8. Swiss heads: forward plot of simulated scaled Mahalanobis distances
to be compared with Figure 3.6
3.1 Swiss Heads 97
y1 y2 y3 y4 yS y6
~ ! r-~
~
'1~1
104 ~ 11!4 ,---,
111
195
"'~ ~
~ ,---,
...
0
,---,
~
~ al ~
I I I
~· ~
"' "' ~I ~
I
~
2
S! ~
S! ~
~
~
::l
a
'--"-' ~I ..___,
"'
'~
~ 5! ~
10
I ~1 1 7
•
I ~~ '--' ~
FIGURE 3.9. Swiss heads: boxplots of six variables with univariate outliers la-
belled
Since the Mahalanobis distances are invariant to the mean and covariance
matrix, we do not need to estimate the parameters of the distribution.
Furthermore, we can take the six dimensions as independent, so we only
need to sample six times from a univariate standard normal distribution
for each observation. The resulting forward plot of scaled Mahalanobis
distances for one simulation is shown in Figure 3.8. This is remarkably
similar to Figure 3.6. Not only does the diagonal band again show clearly,
dividing the units which are in the subset from those which are not , but
there are a few units with large distances and a similar gap at the bottom
of the plot, indicating the absence of very small distances.
Finally, we consider the outliers, if any. The last 20 observations to enter
the forward search, from m = 181 onwards, are units 179, 153, 29, 95, 158,
13, 33, 159, 3, 100, 125, 147, 57, 194, 133, 10, 80, 99, 104 and 111. This is
an ordering of the data in increasing distance from the fitted multivariate
normal distribution. To see whether any of these observations are also re-
mote in the univariate distributions of each variable we give the boxplots
for the six variables in Figure 3.9.
All units lying outside the whiskers have been labelled. There is not
much relationship between the univariate outliers and the list of the last
20 units. The last two to enter, units 111 and 104, have large values of
110 120 50 60 70 125 135 145

t--'"~~.o.....;,., ....~~...-~ .-'--~.,......__.__, ,...~..__.'--!,......., .-'--'---''-'-~-'-u !öl
~
2
!i! ~
!!!
~
2
."'
100 110 120 130 110 120 130 140 115 125 135
heads. The last three units to enter the search are, from the end, 111, 104 and
99
y 4 , more remote from this distribution than any other variables, but they
are not outlying in any other marginal distribution. Unit 99, which enters
when m = 198, is not outlying in any boxplots while unit 147, which has
the only outlying value of y 2 enters eighth from the end of the search. These
univariate boxplots do not seem, for these data, to give a clear idea as to
which are the more remote units.
Finally we return to bivariate plots. Figure 3.10 has three units high-
lighted. It shows that units 104 and 111 are indeed outlying in y 4 , but
are not particularly remote in the other variables. Unit 99 has one of the
largest values of y 2 and although it has no other extreme values, is often
towards the edge of the point cloud. The conclusion seems to be that it is
an extreme, but not anomalous, unit.
3.1 Swiss Heads 99
heads with virtually elliptical robust contours
To check whether multivariate normality provides a satisfactory model

for these data we show, in Figure 3.11 , the scatterplot matrix with super-
imposed robust bivariate contours. These should be nearly elliptical if the
data have a normal distribution, or if they belong to some other member of
the elliptical family, such as the multivariate t. See Muirhead (1982, §1.5)
for an introduction to elliptical distributions.
The contours in Figure 3.11 were found by expanding the B-spline fitted
to the peeled data, as described in §2.13.3 with a value of 2.68 for (} giving
contours with a nominal 99% content. These contours do not show any
noticeable departure from ellipticity. In particular, the robust contours have
not been attracted by the two large values for y4 . A more formal test of
normality is to see whether there is any evidence that the data require
transformation. We consider this step of the analysis in Chapter 4. Our
conclusions so far are that the data are weil behaved, that multivariate
-,
I
I
I
I
' '- ... ,
'....... -- .... ,
,__ _
10 20 30 40 50
Subset slze m
FIGURE 3.12. Track records: forward plot of scaled Mahalanobis distances. There
are three clear outliers towards the end of the search and a further duster around
m=30
normality provides a useful model and that there may be two people for
whom the measurement of y 4 is incorrect. If it is not possible to check
these readings for transcription or other errors and then, if necessary, by
remeasuring the individuals, further readings, just on this one variable,
should be enough accurately to determine the population distribution of
Y4 so as to decide whether some errors have been made.
3.2 National Track Records for Women

The analysis in Chapter 1 suggested that there were three gross outliers.
There was also an indication from the forward plot of minimum distances
of units not in the subset, Figure 1.6, of a further group of 9 units which
did not seem quite to agree with the majority of the data. We start in this
section by looking at the forward plot of scaled Mahalanobis distances for
each unit. We compare this plot with the results of a simulation to confirm
that the features we find are not artefacts of the forward procedure. We
then re-analyse the data on the reciprocal scale, so that the response is
speed rather than time.
The forward plot of scaled Mahalanobis distances is in Figure 3.12. This
immediately shows that there are two outliers for much of the search. The
largest is, indeed, Western Samoa (55), with the next largest, until the end
'!!.
g
V>
g
V>
'6
~ ~
g
"'
o;
.t::.
"'
:::;;
.n
0
J
20 30 40 50
Subset size m
FIGURE 3.13. Track records, simulated data: forward plot of scaled Mahalanobis
distances. Tobe compared with Figure 3.12
of the search, being Mauritius (36), another island. A third outlier is North
Korea (the Democratic People's Republic of North Korea, 33).
For comparison, a forward plot of simulated scaled Mahalanobis distances
is in Figure 3.13. There are some similarities - if the two outliers at the
beginning of the search in Figure 3.12 are ignored, the largest distances in
both plots are of a similar order at the beginning of the search and both
plots show the diagonal band of decreases as each non-outlying unit joins
the subset. However, there are some very important differences.
For the simulated data the distances rapidly decrease to around six.
On the contrary, the distances for the real data have appreciably higher
values. Indeed, it looks as if at around m = 30 there are three groups
of observations: the two outliers, a group of 10 observations rather distant
from the rest and then the bulk of the data. The gap between this apparent
group and the central observations decreases after m = 40. This is the point
at which there is a local maximum in the forward plot of the minimum
distances among the observationsnot included in the subset in Figure 1.6.
Comparison with the same plot for the simulated data in Figure 3.14 shows
that this maximum, with a value just larger than six, is indeed large enough
to suggest a departure from multivariate normality.
From largest distance downward in the centre of Figure 3.12 the coun-
tries concerned are: the Dominican Republic (16), Papua-New Guinea (41),
Guatemala (23), Turkey (52), North Korea (33), the Philippines (42), In-
donesia (26), Burma (7), the Cook Islands (12) and Argentina (1). These
0
::;: CO
E
::I
E
·;::
~
<D
20 30 40 50
Subset size m
FIGURE 3.14. Track records, simulated data: forward plot of minimum distances
of units not in the subset. Tobe compared with Figure 1.6
hardly form a homogeneaus group, culturally or economically, although

none is a developed country. For an interpretation we return to the scat-
terplot matrix of the data. Figure 3.15 is a plot without Western Samoa, in
which the group of 9 are highlighted. Mauritius (36) is also shown. The bot-
tarn line of the plot shows that we have, to an appreciable extent, detected
those countries with large times in the marathon. They are, however, close
tagether in all bivariate plots. If they are removed, the remaining times
have an appreciably more elliptical bivariate scatter.
These 12 units, that are those clear at m = 40 in the forward plot of
scaled Mahalanobis distances, Figure 3.12, also gave rise to the peak at
m = 40 of the forward plot of minimum Mahalanobis distance among units
not in the subset, Figure 1.6. This peak comes at m = 41 in the forward plot
of maximum distances among units in the subset, Figure 3.16, but is more
striking in the gap plot, Figure 3.17, where there is a decline in the plot
after m = 40 as the units labelled in Figure 3.15 start to enter the subset.
These groups of units also have an effect on the estimated covariance
matrix. Figure 3.18 is the forward plot of the estimated covariance matrix
of standardised variables. There is one stable pattern from m = 26 to 36 and
another around m = 40. As the group of nine units enters the variances and
covariances increase, the responses becoming more highly correlated. The
final introduction of Western Samoa causes a general increase in variances
and covariances.
[]0 ·;;
22 24 26 19 2.1 23 85 95 10.5
0<
, ;· + ~il·· 36 ·;tt
+ .J..,
.~\~ ..i.{!;~
\.;. ++
4..t
~
... ..
-
~
+ .... ,
p:
*<·~
~ 41
~
0[] .
. . +
,.
.J?Dt ·if.;
.··~ ~'
t• ~w 26i rf~11ß
dA.f
:. 4 1 31
.
+
;tf
+ ..
<'\~ 1
•2
+ + . .. . 1· f ~··
+ .t 42 f~~
+
..
.~1.,.. . r ..
~[] ~ ~~.~ ~ ~~~j'G
t·
3
+ 'Ii'
~
.;t '\' ~ ..·~·
~~·
:~~· I "~ \"'t *
+ <t +
'?
.."' .*
+
-
N
~~·
~1
' 1
-·~ +
. +
.~*~
-K···
... t
+
~!·
~~·
+
~~ ~
[] ~~
si rJ"fi
•:+t ..
..
~··
.;.'l
+
.J.~~·
~f"l
t .j;;'
+
.... . ;:'fl [] ....--

1~~ .~ ~
.... .~~
16 +~·
.~: .. zz +
•'"'*v
~·
+ + II
18 ... 3
: L41 b ~t·
....
.{~52
..:1#
.1·
....
*~t.,........ + +
;.
~·~
*~ ..
..
t• 0
...~ ..~
[] 23 ~~
~
~ ~?6
~·
~ + ·'lfe • ••
:"~
·?k~ ·~~6
42;i .
i_h;r;r.
...
···~· ~+552
+0-1 ii.
'J;t,752
....·~·
•• +
..
...
...
~·
... l.Y.· ~-.·
..
[]
.,. .,. .,. .,.
1 41 1 41 1 411
• 411
!+,
..
~ ...1
·\·~* ~~t. .~
.
-23
+ , ... , . ++\1+ + ~( +
·M:!J ...~!.
+
...~tt~~
,. + ~~~··~
"• ~
11 0 120 '8 52 56 60 40 44 48
FIGURE 3.15. Track records with Western Samoa removed: scatterplot matrix
of the national records. The results for nine countries plus Mauritius are Iabelied
The QQ plot of Mahalanobis distances at the end of the search, Fig-

ure 1.5, likewise showed twelve outliers. Table 3.1 lists the units with the
12 largest Mahalanobis distances at the end of the search and at m = 40.
The twelve units are not the same and the order of those that are in both
groups is very different. The forward plot of Mahalanobis distances, Fig-
ure 3.12, of course shows the connection between the two sets of units, but
is hard to interpret in such detail. It is clear from the plot that Western
Samoa (55) is the most outlying at both steps of the search. Mauritius
(36) is second in one ranking and third in the other. However, as the fig-
ure shows, North Korea (33) is second at the end of the search, but only
seventh when m = 40. Another large change is for the Cook Islands (12)
which is much more outlying at the end of the search than it is earlier
on. Of the group which were outlying when m = 40, five units are not
Cl
:;:
"'
9
E
"0
c:
"' ..,.
~
:;:
20 30 40 50
Subset size m
FIGURE 3.16. Track records: forward plot of maximum distance of units in the
subset and ordered distance d(rnJm· The two curves overlap
20 30 40 50
Subset size m
FIGURE 3.17. Track records: gap plot . Forwardplots of, solid line, the difference
d(m+l]rn - d(rnJrn and, dotted line, the difference between the minimum distances
of units not in the subset and the maximum distance of units in the subset. There
is a notable decline after m = 40
3.2 NationalThack Records for Women 105
C\!
0
J
0
0
20 30 40 50
Subset slze m
FIGURE 3.18. Thack records: forward plot of the elements of the estimated cor-
relation matrix. There is one change before m = 40 and others at the end of the
search
TABLE 3.1. National track records for women: the units with the twelve largest
Mahalanobis distances at two points in the search
Rank 1 2 3 4 5 6 7 8 9 10 11 12
m=40 55 36 16 41 23 52 33 42 26 7 12 1
m=n 55 33 36 12 13 51 35 16 52 25 7 14
among the twelve most outlying at the end of the search. Such hiding of
outliers when all observations are fitted is an example of masking. It would
be difficult to detect the group of outliers at m = 40 using the backwards
deletion of observations starting from the information at the end of the
search provided by the QQ plot in Figure 1.5.
Figure 3.19 is a zoom of part of Figure 3.12 which shows visually the ef-
fect of masking. If outliers remain outliers and central observations remain
near the centre of the distribution throughout the search, the plot of Ma-
halanobis distances will not include many trajectories of individual units
which cross each other. Figure 3.19 does show several trajectories crossing.
After m = 43 the distances for several of the previously outlying units
steadily decrease, crossing the trajectories of other units, which increase
towards the end, in line with our discussion of Table 3.1.
If it is correct to fit a single multivariate normal distribution to the
data, the bivariate distributions should have elliptical contours, perhaps
'
I I I
~ /'" \
-~ I
I
I
-
I
I
I
<0 I
I
.,"'<>
I
I
I
c::
!9.
lss
"' "'
'5
"'
:.0
0
c::
"'
a; .... ~
.r:.
"'
::;:
"'
~::_-_-:::.---.-:::::.
10 20 30 40 50
Subset size m
FIGURE 3.19. Track records: forward plot of scaled Mahalanobis distances. Detail
of Figure 3.12 exemplifying masking
with a few outliers. This seemed to be the case for the data on Swiss
heads in Figure 3.11. Figure 3.20 shows similar robust contours fitted to
the scatterplot matrix of Figure 1.4. These nominally contain 99% of the
data. This new plot is slightly strange. The upper 3 x 3 submatrix for the
shorter races contains curves which are pretty much elliptical, as does the
lower 4 x 4 submatrix for the Ionger races. But the off-diagonal plots of
observations from both sets of responses appear very non-elliptical.
In our analysis of these data in Chapter 1 we mentioned that there might
be reasons for looking at speed as the response, rather than time: that
is analysing 1/y rather than y. Figure 3.21 shows the robust contours,
now applied to the data after this reciprocal transformation. Camparisou
with the plot for the untransformed data suggests some improvement: in
particular, the bottom row of the matrix contains a set of appreciably more
elliptical contours. This is in agreement with the discussion of Table 1.1,
as it is these Ionger times about which we would expect to have more
information for transformation.
If the data have a more nearly multivariate normal distribution after
this transformation, the forward search may be more stable, showing, for
example, fewer outliers. Figure 3.22 is the forward plot of standardised Ma-
halanobis distances for the reciprocal data for a search starting from units
in an intersection of robust contours. It is quite different from the plot
from the untransformed data, Figure 3.12, and does indeed show many
fewer outliers. The distances overall for units not in the subset are smaller
•·rn·lZJ·
r . ··:[,.l]·~·~·
Ii ...: ·~. . ·~
· -:··
:[ZJ·
.
. :.0].12:.[ Z].
.
.
.
.-:~··
. --- . :i].
:~
.
.
.
.
·'f!;--
. ..· .
.:~.
.
.
. .•.
·~ ..
...
.
.
.
.
.
.
[2][2][[]~~]~]~
~~][ZJOJ~[l]~
[~J~CLJ~W~-J~
[;J~~[~~]UJ0
'
.
1
.
•
•
..•
.
1
.
•
. ·rn
[;J[l]~[LJ[LJ
.. .. e
•
.
o
.
•
•
•
o
.
•
•
FIGURE 3.20. Track records: scatterplot matrix with robust contours. Some
.
0
.
0
•
.
o
'
•
.
,.}6
,41
off-diagonal plots are far from elliptical
and the group of outliers associated with developing countries with long
times for the marathon is no Ionger evident. The four outliers are, in de-
creasing order at the end of the search, North Korea (33), Western Samoa
(55), Mauritius (36) and Czechoslovakia (14), which has not previously
appeared as outlying.
Comparison of Figure 3.22 with the simulated distances for these data
in Figure 3.13 shows how similar the two plots are, apart from the four
potential outliers. Both show the diagonal band of decreases in distance
as units join the subset. The distances in Figure 3.22 are , apart from the
outliers, rather smaller than those in the simulated data. This is an effect
of standardisation by an estimated covariance matrix inflated by outliers,
which makes non-outlying distances too small.
The four outliers show up clearly in the forward plots associated with
individual distances. Only one is given here, that of the gap plot in Fig-
ure 3.23 which is stable until almost the end of the search, when there are
four !arger values associated with the four outliers.
. ... ..
:·rn·lZJ·~·~·~·~·~
I r6 : : .i2.,. : ~ : ... : .: : . :
I a I 1• I • 1• a 1 • e
I I I I I I 1
:'[.ZJ'DJ'
: t :~
·- ·~·~·~·~
. :.~ . . :. ~ . :· .- . :.....
I I I I a• I a. I A I a
I I I I ... A I I I •• I
I I I I a a I I a I a
: : : : : : :
·~(j}·CZJ·rn·~·~·~·~
j .#e" j
~ ~-
j
I
j . i •
~ j ~ j ··.:
t• I t • I • 1 ••t•
I I • I t•• • I • I •• I • •
I I I I I I I
fliJGliZlOJ·[2]~[{~
:~: :~=-::~·:~:rn:0:~.·
..
I • I a I a I a I .~ ~ I a I •
1 I I I l I I
:~-: ~ : ~- :~: :0 :0]:~

I
I
I
I a
a
....
I
I
I
I a
a
A
••
a
a I
I
t•a
I a
a I
I
I~
I ,-
I
I
I
I
I
I
I
I
I
I
I
I
a•
A
I I I I I I I
: • : • : .. : • : • : .!t~ :.
:· ~
. ·~·~·[tj·~·~·rn
II II • t
I •
o" O
I
: ! I )AI
• I •
II II
:. . . : . : :. : : : .: : 8
I I I a • I I • I I I
I a• t •a I • t • I I I
I
I •
• I
I
~6 a
a I
I
a I
I
a I
I •
I
I •
I
I ·'~
I I I I I I I
FIGURE 3.21. Reciprocally transformed track records: scatterplot matrix with

robust contours. Many contours appear more elliptical than those of Figure 3.20
As a result of transforming the data we have obtained a set of readings

which seem to agree more closely with a single multivariate normal distribu-
tion than do the untransformed data. The structure of the outliers is much
simpler than it was before; there now appear to be four outliers, whereas
previously there were perhaps 12. These are good reasons for transform-
ing the data and the reciprocal transformation is also appealing because
of its physical interpretation of measuring speed. However we leave to the
next chapter a fuller analysis of transformations of these data to establish
the relationship between these four outliers, the bulk of the data and the
transformation that we use.

As our third example in Chapter 1 we looked at a relatively large data set
on communities in Emilia-Romagna. We found that there were two large
outliers and, in Table 1.3, listed the last sixteen communities to enter the
~----------~--------------~------------~--------__J
20 30 40 50
Subset stze m
FIGURE 3.22. Reciprocal track records: forward plot of scaled Mahalanobis dis-
tances. Comparison with Figure 3.19 shows a stable pattern of outliers
"'
C\i
0
C\i
'l!
0.
"'
(!)
q
"'c:)
0
c:)
"'
9
20 30 40 50
Subset size m
FIGURE 3.23. Reciprocal track records: gap plot. Forward plots of, solid line, the
difference d[m+l]m- d[mJm and, dotted line, the difference between the minimum
distances of units not in the subset and the maximum distance of units in the
subset . The four outliers are evident
~l
<J)
~
.!!l
<J)
g
'6
"'
:;;;
0
., ....0
c:;
I
.!1!
.,
.r:;
::;: 245--,"\.. _____ ,.._
~~ -- -- - ~ - ---- - - -- -- - -- -- ....,_
0
"" "------------ .......
50 100 150 200 250 300 350

Subset slze m
FIGURE 3.24. Municipalities in Emilia-Romagna: forward plot of scaled Maha-

lanobis distances; Zerba (277) and Cerignale (245) are very outlying
forward search. In this section we first Iook at the forward plot of scaled
Mahalanobis distances and then explore the properties of the outlying units.
Part of the challenge is that there are too many variables for the scatterplot
matrix of all observations to be decipherable.
The scaled Mahalanobis distances from a search through all the data are
plotted in Figure 3.24. This shows the two clear outliers, as weil as the
diagonal band associated with the change in distance when a unit enters
the subset. The two outliers, units 245 (Cerignale) and 277 (Zerba), were
identified in Chapter 1 as small and poor mountain communities. What
is surprising is the large change in the distance of unit 277 when it joins
the subset in the final step; this unit, which as we shall see is outlying
in almost all variables, must be having an appreciable influence on the
estimated parameters of the model. We now Iook for ways in which the
units are outlying.
With the moderate numbers of observations in the previous two exam-
ples, we could study the scatterplot matrix and highlight units to see in
what way they were outlying. This is not practicable with 28 variables. Even
if we take groups of 7 variables, so that it is possible to see the structure
of the scatterplots, we shall miss a large number of pairwise scatterplots.
We therefore start with a univariate analysis of our outlying observations.
We first looked at 28 boxplots, six of which are shown in Figure 3.25.
In t his form of boxplot the central box covers the interquartile range, with
the median denoted by the central white stripe. The whiskers can extend
up to 1.5 interquartile ranges from the central box. They t erminate at the
~
pop.inl
••9 ~
",_l birth
g
T=
cars
........,
!il
luxury
133 -
70 -
....,...,
gr
artisan
.---,
~
entrepreneur
m -
~
s 88 -
260 -
I I
~
~ ........,
~
..
I
I
~
2
~
'---' 88
!il
250 -
~ 260
f~
m
'-'-'
188 ~
260 0 245 250 0 239 - 0
FIGURE 3.25. Municipalities in Emilia-Romagna: boxplots of six variables show-

ing the outlying units from Table 1.3
most extreme observation inside this distance. Individual observations lying

outside the whiskers are plotted. In Figure 3.25 we give the numbers for
the units among the outlying observations in Table 1.3.
The boxplots in Figure 3.25 were selected as having the largest numbers
of labelled units, which are the units that the forward search has detected
as being outlying. A difficulty with this univariate approach is that we
may tend to select highly correlated variables in which the same units are
outlying. So we tried to select our variables to have distinctive patterns
of outliers. It seemed as if five variables would be enough. These, together
with the abbreviations used as labels in Figures 3.25 and 3.26 were:
Yl: infantile population (pop.inf)
Y12: birth rate (birth)
Y1s: number of cars (cars)
Y27: artisanal enterprises ( artisan)
Y2s: entrepreneurship (entrepreneur).
The resulting scatterplot matrix is shown in Figure 3.26.
These scatterplots show up many of our units astwo-dimensional outliers.
The most remote villages, 277 and 245, both stand out. Both have small
populations of young people (y1), but differ in the value of y 28 , which
reflects entrepreneurship, possibly a difficult variable to measure. Almost
10 15 10 30 50 70
pop.inf (y1)
~ ~, 2 •••• •
;41 9
~' 31 ~-·'l.
7.lifo.. artisan (y27)
~..-:
. . ,.
lö! 188 2 • • 50 • ,. ':
•• •a ' ••
*e:
e 1,., 10 1_. 10 !Bis
r;
+n.r·~~
a :O~~
· i · ( 2i~~~ ~. ~-: ~ • entrepreneur (y28
r . .: . " ~ ·U . . . ,_,"10__,.·-li..-:;··~-·
2sfl 31t 50 .. • :: •
"! 45 2, • • = 1.. ...· -,-J ~---......--.--...,...>-

2 4 8 8 10 12 40 50 60 70 0 10 IS
FIGURE 3.26. Municipalities in Emilia-Romagna: scatter plot matrix for five of

the six variables of Figure 3.25
all the other communities show as outliers from the general clusters of
points. The most striking is unit 310, Casina, which has an amazingly
high birth rate, y 12 , about twice the usual value. This is hard to explain,
if, indeed, the number is correct. One possibility is that births have been
credited to both father and mother. Another might be the presence of a
maternity hospital, although this is not in fact the case. A more prosaic
explanation is that there is a transcription error; the value for Y12 should
be 7.10 not 17.10. Apart from this one anomalous value, the measurements
for Casina fall in the centre of the observed values.
One unit , 238, which came in when m = 326, is not a univariate outlier
in any of our boxplots. However it shows up as a bivariate outlier in, for
example, the plot of Yl against y 18 . Car ownership is a little too high for a
community with such a low birth rate which, in many cases, is associated
with an impoverished and aging rural population. Projection of this obser-
vation onto either of the co-ordinate axes produces an extreme observation,

but not one large enough to be classified as a univariate outlier. This unit
provides a nice illustration of how lower dimensional projections may fail
to detect higher dimensional outliers.
Our five selected variables give a scatterplot matrix which reveals all
our outlying observations except unit 133, Torriana. The boxplots of Fig-
ure 3.25 show that this community has an outlying value of y 1 g, the pro-
portion of luxury cars. Once this variable is included, this community is
clearly seen as an outlier.
The occurrence of the units as univariate and bivariate outliers is sum-
marised in Table 3.2. The counts for the number of occurrences as a uni-
variate outlier come from the 28 boxplots, of which Figure 3.25 displays
part. The counts for bivariate outliers come from the robust ellipses drawn
through the data as part of the starting procedure for the forward search de-
scribed in §2.13. Each unit can be a univariate outher in up to 28 boxplots
and a bivariate outlier in up to 28 x 27 /2=378 bivariate plots. Different
municipalities have very different patterns. The two large outliers, units
245 and 277 are outlying in around 10 univariate plots and around 295
bivariate plots. Units 250 and 260 are similar, but a little less extreme. The
plot in Figure 3.26 for y 1 (infantile population) against Y27 (artisan enter-
prises) shows these four units close tagether. The other extreme is shown
by unit 238, which is never a univariate outlier, although it is a bivariate
one 24 times. We have already discussed the position of this unit in one of
the panels of Figure 3.26. The other units are all outlying in at least 100
bivariate plots, except for unit 310, Casina. This is outlying in 2 boxplots
and only 59 bivariate plots. This supports our contention that, apart from
these two responses, all the others for this community are near the centre
of the distribution.
In order to explain the outlying properties of unit 133 we had to add an-
other variable to the five we chose for the scatterplot matrix of Figure 3.26.
It is not satisfactory to have to introduce a new variable for one unit. In-
stead of our ad hoc procedure here we need a statistically guided procedure
which will, with much less effort, yield variables descriptive of the whole set
of data and revelatory of the outliers. The principal components analysis
of Chapter 5 is intended to do this, at the same time providing a summary
of the data with variables which have a physical interpretation.
As well as the measured indices used so far as responses, communities
have physical location. It is reasonable to expect that many communities
will have characteristics similar to those of their neighbours. This is likely
to be true for example here of the data on the agricultural communities in
the Apennines. We discuss the analysis of spatial data, although not this
example, in detail in Chapter 8. Here we consider the spatial distribution
of our 16 outliers.
TABLE 3.2. Municipalities in Emilia-Romagna: occurrences of univariate and

bivariate outliers among the last 16 communities to enter the forward search
(Table 1.3)
Unit Univariatc Bivariate % %
Number Outliers Outliers Univariate Bivariate
277 11 297 39.3 78.6
245 9 293 32.1 77.5
70 3 172 10.7 45.5
239 3 222 10.7 58.7
310 2 59 7.1 15.6
260 8 246 28.6 65.1
250 8 288 28.6 76.2
30 4 130 14.3 34.4
2 4 177 14.3 46.8
133 1 101 3.6 26.7
264 1 115 3.6 30.4
188 4 154 14.3 40.7
194 1 156 3.6 41.3
149 3 186 10.7 49.2
238 0 24 0.0 6.3
88 4 152 14.3 40.2
Figure 3.27 is a map ofEmilia-Romagna with the 16 communities labelled

and shaded. The municipalities shown up by the forward search and listed
in Table 1.3 essentially belong to two broad categories.
First, we have a group of small villages located in the mountains, many of
them in the province of Piacenza (PC on the map) . These villages include
the extreme outliers 277 and 245, as well as 239, 260, 250 (which is the
largest community of this group), 264, 188 a nd 194, the last two being
located in the province of Parma. In some cases these communities are
close together: 277, 245, 260 and 250 form a spatial dusterat the southern
end of the western boundary of the region. All have similar problems, with
an aging population, and low indexes of wealth, education, housing and
industrial development. Of course, such features are exacerbated for the two
extreme outliers, Cerignale (245) and Zerba (277) . These two communities
are outliers in almost all dimensions. On the contrary, Piozzano (264) -
which is closer to Piacenza and is located more in the hills than in the
mountains - may be considered as a "borderline" community: it gets low
scores on housing and aging ( although less extreme ones than the villages
in the mountains), but has a very low unemployment rate. So here it is
combination of indicators with different "signs" that makes the outlier.
Secondly, there are a few villages near to Bologna (30 and 2) and Modena
(149) which are clearly outliers in opposite directions from the villages
in the Appenine mountains: they have generally good aging, housing and
FIGURE 3.27. Municipalities in Emilia-Romagna: the last 16 communities to

enter the forward search (Table 1.3) are shaded and labelled
wealth indexes. These municipalities often stand out in bivariate plots, such
as those contrasting aging and unemployment, aging and housing and aging
and wealth, in Zani (1996, Chapter 7).
The additional communities in Table 1.3 are scattered throughout the
regional map. For these villages outlyingness is mainly attributed to local
instances, rather than to general socio-economic causes, as seemed to be
the case for unit 238 (Calendasco). Goro (70), which is the third last to
enter (and so is the most remote after the two extreme outliers) shows
up as an outlier in many of the scatterplots in Figure 3.26. It is a non-
touristic seaside community on the once malarial marshes near the mouth
of the River Po. In some of the scatterplots it is close to unit 245, but it
has an average birth rate and number of young children. It also, as the
boxplot in Figure 3.25 shows, has a high number of luxury cars. Recording
or definition errors may also, as we have suggested for unit 310 (Casina) ,
generate outliers. As a further example, the last community reported in
Table 1.3, Bellaria-Igea Marina (88), a seaside resort, has a relatively high
unemployment rate. However, it is well-known that many people working in
tourism ( for example in hotels and restaurants or on beaches) are wrongly
registered as unemployed in censuses and other surveys.
This further discussion is a remirrder that we may have several kinds
of outlier. In particular, atypical combinations of positive and negative
characteristics willlead to communities being detected as outlying. For un-
derstanding the general structure of the data we are also concerned with
estimation of patterns once the outliers have been detected. For example,
many of the outlying communities shown shaded on the map of Figure 3.27
fall in the mountainous areas of the region, as do many other poor com-
munities. We return to these data in Chapter 4, where our focus is on
simultaneous transformation of the 28 variables.
3.4 Swiss Bank Notes

In our first analysis in Chapter 1 of the data on Swiss bank notes we were
able to identify at least two groups of observations: units 1-100 seemed
reasonably homogeneous, but perhaps with a few outliers, whereas units
101-200 seemed divisible into two groups. We obtained these results by
fitting a single distribution to all 200 observations, but using three searches,
one starting in both groups and the others starting with units in only one
group. In this chapter we first extract some more information from the joint
analysis of both groups. We then split the data into two groups, analysing
each separately.
The scatterplot matrix of Figure 1.20 shows that the two groups are
not weH separated. Accordingly, the method of robust ellipses chooses an
initial subset that contains units from both groups, which are not removed
during the early steps of the search. We saw in Chapter 1 that this subset
of 20 Observations initiates a search which does not reveal the two groups.
Figure 3.28 is the resulting forward plot of scaled Mahalanobis distances.
The plot suggests there are about 20 outliers, but there is indeed no obvious
indication of two groups.
Figure 3.29 is the corresponding gap plot for this search. It is unevent-
ful: there is, as is often the case, appreciable inversion and interchange at
the beginning of the search. Thereafter the gap remains constant, the two
curves superimposed one on another, so there is no evidence of any fur-
ther interchange of units in the subset. At the end of the search, around
m = 176, the gap increases as some outliers enter. This reaches a maximum
at m = 184, but then decreases as a duster of outliers enter the subset one
by one.
The analysis starting with 20 observations in Group 1 is appreciably more
informative. Figure 3.30 shows the scaled Mahalanobis distances from a
forward search starting with the first 20 observations on supposedly genuine
notes. This is quite different from any plot we have so far seen. In the first
part of the search, up to m = 93, the observations seem to fall into two
groups. One has small distances and is composed of Observations within
or shortly to join the subset. Above these there are some outliers and
then, higher still, a concentrated band of outliers, all of which are behaving
similarly.
I
r
~ ~..,JI .' .
,.., ..... ,
,'
I
\
I I
\"''""'-, ... -,
I \
."'
I
CO '
<> ''
''
c
"'
;;;
'6 <0
"'
~
c
.!'1
"' "'
.<=
"'
::;
C\1
0
l 50 100 150 200
Subset size m
FIGURE 3.28. Swiss bank notes: forward plot of scaled Mahalanobis distances
from the sea rch starting with a subset of units from both groups. There seem to
be many outliers, but the group structure is not clear
C!
Ll)
0
0 ''
0
''
a. '''
"'
(!) Ll)
''
'
9 '
'
'''
C!
''
";"
''
''
''
'
"l
";" '''
'
''
'
50 100 150 200
Subset size m
FIGURE 3.29. Swiss bank notes: gap plot. Forward plots of, solid line, the dif-
ference d(-m+l}m - d(mJm and, dotted line, the difference between the minimum
distances of units not in the subset and the maximum distance of units in the
subset
"'c:
Q)
() CO
~
"'
'5
"'
:ö
0
<0
c:
<II
<ö
r.
<II oo:t
:E
50 100 150 200

Subset size m
notes: forward plot of scaled Mahalanobis distances. The three groups are evident
.,.,
ci
0
ci
.
~j
.
II\ I
..
"~ :"":
a.
.,.,
I :'
<II
.
111 I
0
9 'lN
"
q
"7
""!
"7
50 100 150 200

Subset size m
notes: gap plot. A series of interchanges starts at m = 106
When m = 101 the search has reached a point at which at least one
observation from Group 2 has to joint the subset. In fact, as the figure
shows, due to the presence of outliers, this inclusion of units from both
groups starts a little earlier. From m = 95 the distances to the group of
outliers decrease rapidly, as remote observations from Group 1, the genuine
notes, join the subset. Around m = 105 we can see that many of these
former outliers are joining the subset (their distances decrease), while many
of the units formerly in the subset leave (their distances increase). The
crossing of two bands of outliers, seen here between m = 105 and m = 115,
is typical of distances in a forward search when one multivariate normal
distribution is fitted to data that contain two or more groups of appreciable
size. Once the subset contains units from both groups, the search continues
in a way similar and then identical to that of the earlier search. The right
hand thirds of Figures 3.28 and 3.30 show identical patterns of Mahalanobis
distances , the only difference being the vertical scale of the plots.
The structure of the data is also exposed in the gap plot for the search
starting in the first group. Figure 3.31 shows that the gap is steady until
m is near 100, when a few outlying observations enter. But, most informa-
tively, at m = 105, there is a large gap between the two plots, indicating
that there is an interchange of units at m = 106. This happens a little after
m = 100 because a few units from the second group need to be in the sub-
set before there is an appreciable change in the parameter estimates. The
interchanging of units persists for another ten steps of the forward search.
The last third of the search is the sameasthat in Figure 3.29 which started
from both groups.
Very detailed information about the interchange and of the group mem-
bership of each unit comes from looking at forward plots of individual
distances for the one or more units entering the subset at each stage of
the search. Figure 3.32 shows a series of such plots. The first panel plots
all 98 distances for the units in the subset at m = 98. This distribution
of distances is used to judge the behaviour of successive units as they are
included in the subset. The second panel of the figure shows the percentage
points of the estimated distribution of these 98 distances. The plotted per-
centage points are at 2.5%, 5%, 12.5%, 25% and 50% and the symmetrical
upper points of the empirical distribution. Superimposed upon this distri-
bution is the forward plot of the distance for the unit which enters when
m = 99, which is unit 125. This is just outside the distribution at m = 99,
and so is the appropriate unit to enter. But the shape of the trace of the
Mahalanobis distance is very different from units already in the subset: it
is initially too high, becoming atypically low later in the search. This is the
first unit from the second group to enter the subset. Units 104 (m = 100)
and 127 (m = 101) show a similar transition from high to low. However,
when m = 102, unit 70 has a rather different profile, being much smaller
around m = 70 to 90, and much higher after m = 100. This unit does not
behave in the same way as the three earlier ones to join the subset.
m 100 1nc 104
notes: nine panel plot of forward Mahalanobis distances for specific units, starting
fromm= 98
The next three units to join, 129, 107 and 105 all show the pattern
associated with units from Group 2. The interchange starts at m = 106
when three units from Group 2 enter, all with proflies which are initially
very high and finally very low. This pattern continues in Figure 3.33. All
nine panels on the plot show at least two, but mostly three or more units
from Group 2 entering the subset at each stage of the forward search. For
the units to enter together, they have to have Mahalanobis distances which
are very close to one another at the value of m for which they enter. In
the last panel of Figure 3.33 this is weil inside the reference distribution
of units from Group 1, many of which will have left the subset during the
interchanges.
At each stage one or two or more units which are already in the subset
have to leave. As the parameter estimates change with the inclusion of
large numbers of units from Group 2, some units from Group 1 become
increasingly outlying. However, as Figure 3.34 shows, they do have finally
to rejoin the subset, since all units are included by the end of the search.
notes: nine panel plot of forward Mahalanobis distances for specific units, starting
fromm= 107. Reference distribution: m = 98
The first panel of the figure shows four units joining, three from Group 2
and one from Group 1. However the succeeding panels show the pattern
more clearly. When m = 117 and 118 the units included are from Group
1 and lie well within the distribution , as does unit 17 which joins when
m = 121. The rest of the units, with profiles which are initially high and
finally low clearly belong to Group 2.
These three figures show very clearly the amount of information on group
membership which can be found by looking at individual profiles against a
background of profiles from a known group. We find this a very powerful
technique in duster analysis as we judge units against the clusters we are
developing. Further examples are in Chapter 7. But, for the moment, we
see what happens for a search starting with members of the second group.
The forward plot of scaled Mahalanobis distances, Figure 3.35, is broadly
similar tothat when we started with Group 1, Figure 3.30, but the differ-
ences are informative. Before the interchange of units in the subset, the two
groups seem more clearly separated than when we started from Group 1,
notes: nine panel plot of forward MahaJanabis distances for specific units, starting
fromm= 116. Reference distribution: m = 98
although the non-fitted observations have a higher dispersion than before.

The interchange starts earlier, because some units in Group 1 are closer to
the centre of Group 2 than some of the outliers from Group 2. The dis-
tances for these outliers appear unaffected by the change in subset: there
is little of the increase in distances that we saw in Figure 3.30. The last
third of the search is again the same as before.
Figure 3.36 is the gap plot for the search starting with units from Group
2. It has some features in common with Figure 3.31 for the search starting
from Group 1, but is less dramatic; the presence of the outliers from Group
2 means that the interchange starts earlier for the search starting in Group
2, at m = 97, than for that shown in Figure 3.31, m = 106. Before the
divergence between the two curves in Figure 3.36 at m = 105, there is a
spike caused by a single outlier. The last part of the curve is again the same
as that in Figures 3.29 and 3.31.
Plots of individual distances start in Figure 3.37. The first panel shows
the 80 distances which are used to provide the distribution for calibration.
;:!:
' -,,_,..'
I
I
I
''I
~ I
I
I
</) ~
Q)
<.>
c:
.l!! ----,-,
--...
--- -""'
</)
CD
'ö
</) ' ....... ......
:ö
0
c:
"' "'
öi
J::
"'
::E ~
(\j
50 100 150 200

Subset size m
FIGURE 3.35. Swiss bank notes, starting with units in Group 2 (the forgeries):
forward plot of scaled Mahalanobis distances. Compare with Figure 3.30
Theseareall units from Group 2 and so are the first five to enter. The first
four to enter have profiles similar to those in the calibrating distribution,
although they are a little extreme and indeed should be: otherwise they
would have entered the subset earlier. The first unit that looks rather dif-
ferent is unit 125, entering when m = 85, which is the outlier causing the
spike at m = 84 in Figure 3.36. It is also the first unit to join Group 1 in
Figure 3.32. The three remaining units to enter, in the bottom panels of
Figure 3.37, are all from Group 1 and have similar profiles: they are initially
too high, but finally have increasingly small distances in the panels from
left to right in the last row of the figure. These profiles are reminiscent of
those for the Group 2 units in Figure 3.32 when starting from Group 1
because the units are initially remote from the subset used in fitting, but
finally have small distances from the model fitted to most of the data. Unit
125, the outlier in Figure 3.37, has a profile somewhere between those for
units in Group 2, too high initially, and those in Group 1 being finally too
high.
The interchanges start in Figure 3.38 when m = 97. All units joining are
from Group 1 and show similar profiles. Figure 3.39 is similar, although
interchanges do not occur at all steps of the forward search. However, in
the final panel, when m = 106, five units from Group 1 enter the subset.
0
C\i
"!
l()
a. ci
<tl
<!l
0
ci \,,:'. ': ,... ,'
,, : :'-'
l()
' ~"''
9
~
"';"
50 100 150 200

Subset size m
gap plot. Figures 3.38 and 3.39 show a series of interchanges starting at m = 97
The final plot of this series, Figure 3.40, shows further interchanges and
the inclusion of units from Group 1. It is not until m = 114 that another
unit from Group 2 is included. This interesting panel, in the centre of the
bottarn row of the figure , shows the inclusion of two units: unit 2, from
Group 1, has a profile which starts high before merging into the general
distribution of distances. On the other hand, unit 128, from Group 2, has a
distribution which is within the Group 2 distribution throughout. Finally,
at m = 114, the distances for the two units are very close.
These series of plots of distances for individual units against a calibratory
background provide a powerful tool for assigning units to groups and we
make appreciable use of them in Chapter 7 on duster analysis. But, to
conclude this chapter, we analyse each group individually.
Figure 3.41 is a forward plot of the scaled Mahalanobis distances for
the units only in Group 1. This seems to be a well-behaved plot: the last
five units to enter are, from last backwards, 1, 40, 70, 71 and 5. Of these
1 and 40 seem to be outlying for much of the search. The forward plot
of the minimum Mahalanobis distance among the units not in the subset,
Figure 3 .42, has the largest distances at the end, suggesting four or five
potential outliers. There are no sharp spikes in the curve, so we seem to have
ordered the units satisfactorily. There is thus no evidence to suggest that
the data are more complicated than a sample from one multivariate normal
distribution with, perhaps, a few outliers. This suggestion is supported by
Figure 3.43 which shows that the elements of the estimated covariance
123 • m 81 lnc 103
..
nine panel plot of forward Mahalanobis distances for specific units, starting from
m=80
matrix are stable from m = 38 until almost the end of the search: the last
five units to enter do cause a noticeable change in some of the variances
and covariances. The visually most obvious involve y4, Y5 and Y6·
We now conclude our analysis of the genuine notes by looking once more
at the data. Figure 3.44 is the set of univariate boxplots for each variable,
with all univariate outliers marked. Units 1 and 5 are each outlying in
two boxplots and units 40, 70 and 71 in one each. These are all the units
identified by the boxplots as outlying and are the last five to enter the
forward search. However appreciation of the relative importance of these
units is partly clarified by the scatterplot matrix of Figure 3.45 in which
the five units are highlighted.
For example unit 1 has high values of y 2 and y 3 and lies at the upper ex-
treme of the major axis of the bivariate scatter; unit 40 is likewise extreme,
but with the opposite sign, in the plot of y 1 and Y3· But the bivariate scat-
terplots fail to show any dustering of outliers, a conclusion which agrees
with the structure of the forward plots for Group 1.
m = 89. Reference distribution: m = 80
The structure of Group 2 is more complicated, but is also readily revealed

by the forward search. Figure 3.46 is a forward plot of the scaled Maha-
lanobis distances. In the centre of the plot, around m = 70 this shows a
dear structure of a central group, one outlier fwm that group and a second
group of 15 outliers which are the duster we found in Chapter 1. The plot
of minimum Mahalanobis distances not in the subset in Figure 3.47 reflects
this structure: there is an increase at m = 84 for the outlier which enters at
m = 85. The peak in the figure at this value of m is caused by the first unit
in the duster of outliers. As successive units from this duster enter, they
become less remote and the distances decrease. This effect can be seen in
Figure 3.48 which shows the elements of the estimated covariance matrix.
These are stable for much of the search until m = 80. Thereafter several
of the elements change: in particular the estimated variance of y 4 increases
steadily as does the variance of y6 and their covariance, which changes sign.
We now consider the profiles for individual distances during the search.
Figure 3.49 captures these in the region where the second group starts to
..." 106 ncl 24,25.31.47. 9
nine panel plot of forward Mahalanobis distances for specific units , starting from
enter. The first panel shows the distances up to m = 80 which are used to
define the reference distribution. The next four panels show units entering
which , although extreme, have profiles similar to those in the reference
set. The outlier, unit 125, enters when m = 85. Its distance is too large
throughout, but it has a similar profile to the other units in the central
group. Thereafter the units forming the second cluster start to enter. The
three panels in the bottom row of the plot show units 111, 194 and 168.
These profiles, high for most of the search, are clearly different from those
in the central group. The profiles for the remaining 12 units of this cluster,
which we do not show, are similar although gradually larger at the end of
the search: the forward plot of all distances, Figure 3.46, shows that finally
none of the distances is very large.
This analysis clarifies details of the structure that was found in Chapter 1.
We again conclude by looking at plots of the data. The univariate boxplots,
Figure 3.50, do not reveal a cluster of outliers, although they do show that
there are some present.
The scatterplot matrix of Figure 3.51 clearly shows the second duster
and the clear separation in the plot of y 4 against y 6 .
In passing we note that the overwritten Iabels in the boxplots occur
because of the rounding of the data, which is clear in the panels of the
scatterplot matrix, perhaps most clearly in that of Y2 against Y3· It is also
clear in the plot of y 4 against y6 . As Figure 3.52 shows, the structure
of all 200 observations is revealed by this one scatterplot. An interesting
question is whether the other four measurements help in deciding whether
a new note is genuine or forged. We consider such questions in Chapter 6
on discriminant analysis.
"'
.,"'
0
c:
<D J
~
'6
"'
:0
g ...
"'
i'D
.s=
"'
::;:
N -j
0
"""1
40 60 80 100
Subset slze m
FIGURE 3.41. Swiss bank notes, units in Group 1 (genuine notes): forward plot
of scaled Mahalanobis distances; a single multivariate population with perhaps
five outliers
0
cD
"'<ri
0
<ri
a::;:
E "'
..;
::::>
E
·c:
~ 0
..;
"'
.,.;
0
.,.;
''
40 60 80 100
Subset size m
FIGURE 3.42. Swiss bank notes, units in Group 1 (genuine notes) : forward plot
of maximum distances of units in the subset. There is some evidence of outliers
1~"'"=- 5,
, - - .... ------~ 4,
,,-----'
,' ~,'-- .... ------------ ...,_,/ .... ...,_
~,.--'
_____ .
_______ __ _
," ..... _.,,.. ___________
.-
"---
-/- - - - - - / I
X
't:: ,"' .//
(ii
E
~
"'0 ,: __ _,-J r- 6,
~fi{fi=i-~
·~
~0
ö 0
{! 0
Q) - . . . - -- - ' - - .... - .... _________ ... - - .... -----,'-- 6,
E
.!!!
UJ
.........
.... _.......
.. _\,
. , ............ _........ _... --- ----------------------\ ____ , s .
40 60 80 100
Subset size m
FIGURE 3.43. Swiss bank notes, units in Group 1 (genuine notes): forward plot
of elements of estimated covariance matrix. The effect of the five outliers can be
seen towards the end of the search
y1 y2 y3 y4 yS y6
0 ~ "'~
~
0
~
..
0
0 0
: ~
"'
...
;;;
I
~ "' ,........,
I I
~
"'
~! :;: 5!
I I
0
~
0
~
.
0
.. "'c
N
.. .
"'~
:L
!1:1 "'!1:1 0
. ..,
0
N
."'
•
0 0
71
!1:1 !l:l "
FIGURE 3.44. Swiss bank notes, units in Group 1 (genuine notes): boxplots
showing univariate outliers
129 0 130.0 131 .0 75 8.5 9.5 139.5 141.0 142.5
....,
1 6~~·
40
A
o!oo
8 9 10 11
FIGURE 3.45. Swiss bank notes, units in Group 1 (genuine notes): scatterplot
matrix showing the five univariate outliers of Figure 3.44
00
"'g
<D
"'
u;
'6
"'0
:.0
c;
fJ
.s::
..
'"
:::0
(\1
-~------_:..:_·_:...~~~
0
40 60 80 100
Subset size m
FIGURE 3.46. Swiss bank notes, units in Group 2 (the forgeries) : forward plot
of scaled Mahalanobis distances. The structure of two groups and an outlier is
clear around m = 70
<D
Cl
:::0
E lO
:::1
E
·;::
~
...
40 60 80 100
Subset size m
FIGURE 3.47 . Swiss bank notes, units in Group 2 (the forgeries) : forward plot
of minimum distances of units not in the subset. The p eak is at m = 84
, ' , -'
---------~--~~----~
_____________ ...,, _____________ 5.5
.... - - - - . " - - - - - - - - - - - - - ".;-·6,6

ö
U) _, ____ , --~-----~~-:. . -~::
äiE ..
.S!
:; • € ;
t"! ?.C ~
UJ
'"----- .... _
.. ....... .....................
-- ..... ---------- ......... _____ -- ....... ___ .... --------- .
~
-------5,4
40 60 80 100
Subset size m
FIGURE 3.48. Swiss bank notes, units in Group 2 (the forgeries): forward plot
of elements of estimated covariance matrix. The variances of Y4 and Y6, together
with their covariance, change appreciably at the end of the search
..
............. ~ .... ......

".
.... ...
·::·::::::::::::::~:~:
··············· · · ··~~ .........
·. ·.
~
~
FIGURE 3.49. Swiss bank notes, units in Group 2 (the forgeries): nine panel plot
of forward Mahalanobis distances for specific units, starting from m = 80
y1 y2 y3 y4 yS y6
167 - ·I
~~ 0
..,
123
~l
0
i l~
0 !'!
~ ~
~ r-1
0
!
0
~
N
11M
.---,
- .a "'
~
::
::
"l
q
:: ~
el
N
~I
I
e"'
0
"' 0
~
N
a
0
~
e
0
~I
;;;
"~
'---'
~
.. ..
L..;._j
"' "l
i~
L..-..J
'--'
~
~L -- ..
~
~
142
~53 ~- :;:
190
___J
160 -
lU
FIGURE 3.50. Swiss bank notes, units in Group 2 (the forgeries): boxplots show-
ing univariate outliers
129.6 130.2 130 8 8 9 10 12 136 0 139.5
214 0 215.5 129 5 130.5 9.0 10.5 12.0
FIGURE 3.51. Swiss bank notes, units in Group 2 (the forgeries) : scatterplot
matrix. The separation of the two groups is particularly clear in the panel for Y4
and Y6
+
+ + +
+ +
+ ++
++++++ +
+ + + +
++++ +++ + +
+++++ ++
+ + ++++ +
++ +++++ ++
++ + ++ ++
+ + + +
+ + ++++
+ + + + +
++ + + +
++ + ++
-...-.-. -·-
+
+ +
+ •
• •• ••
... . ..
•
• •
...
. -.--.. .......
•
0
- ..
• • •• •
+
• • • • •
11116
• •• •
167
182
192 1§~
187
161
138
148 162
171
7 8 9 10 11 12
y4
FIGURE 3.52. Swiss bank notes. Scatterplot of Y6 agairrst Y4 which reveals most
of the structure of all 200 observations: there are three groups and an outlier
from Group 1, the crosses. Unit 125, the open circle, lies within Group 2 for
these variables
3.5 What Have We Seen?

The four examples in this chapter illustrate the wide variety of information
that can be obtained by fitting a single multivariate normal distribution to
data when this is combined with the plots from the forward search.
The data on Swiss heads showed a weil behaved sample. There were
two outliers, but otherwise the data behaved like a sample from a nor-
mal distribution. We continue our analysis in Chapter 4 by looking at the
transformation of these data and the effect of the outliers on the estimated
transformation.
The national track records for women showed an appreciably more com-
plicated structure when analysed in the original scale of time. The forward
plot of Mahalanobis distances showed evidence of masking and of a group
of outliers as weil as three nations which were outlying throughout. Anal-
ysis of the data in the reciprocal scale, that is in terms of speed, gave a
structure more like that in the previous example: most of the data were
weil-behaved with four outliers. Again we Iook at transformations of these
data in the next chapter. Although the reciprocal transformation has a
physically meaningful interpretation, it may be that some other transfor-
mation will Iead to a closer approximation to multivariate normality.
The main difficulties in the analysis of the data on municipalities in
Emilia-Romagna was the size of the data set, particularly the number of
variables. The forward search indicated two gross outliers. More impor-
tantly, it ordered the data by closeness to the multivariate normal model,
leading us to focus on the properties of 16 units. We also consider trans-
formation of these data in the next chapter where, again, the appreciable
number of variables severely complicates the analysis.
The fourth example, the Swiss bank notes, which was known to consist of
at least two groups, showed no such simple structure. The final conclusion
of our analysis was that there were three groups and a few outliers. The
plots of Mahalanobis distances for ail 200 units showed a structure which
we shall see again in the chapter on duster analysis. This structure was
particularly evident when we started with a subset of units from the more
compact group, that of genuine notes. Half way through that search, as
units from the second group were introduced into the subset, there was,
for example in Figure 3.30, a dramatic series of interchanges, leading to
a pattern of crossing bands in the forward plot of distances. The effect
of the interchange was also clear in other plots, such as that of estimates
of the elements of the covariance matrix, Figure 1.16. But a particulary
informative tool was to Iook at the profiles of the distances for individual
units, judged against the behaviour of a large group of units of known
provenance, for example Figure 3.39.
Analysis of the two sets of 100 units yielded plots more like those seen in
the earlier examples. The genuine group consisted of a well-behaved central
set of units and, perhaps, a few outliers. The forged notes split cleanly into
3.5 What Have We Seen? 139
a third group of 15 units and the main second group with one outlier. The
scatterplots showed no evidence of non-elliptical contours, so that there
was no obvious reason to consider transformation of these data.
3. 6 Exercises
Exercise 3.1 Table 3.3 gives 20 simulated observations Y from a bivariate
normal population. What do you expect will be the form of the forward plots
of the determinant IEml, of the maximum Mahalanobis distance within the
subset, of the minimum distance outside the subset and of the gap plot
(equation 2.105 or 2.106}? What do you expect from the forward plot of
scaled M ahalanobis distances ?
TABLE 3.3. 20 simulated observations from a bivariate normal distribution

Number Yl Y2
1 11.510 9.673
2 9.183 9.539
3 8.494 12.000
4 9.422 11.500
5 10.730 10.630
6 10.990 10.570
7 8.592 10.160
8 10.290 8.532
9 9.213 11 .930
10 10.530 9.966
11 10.950 10.380
12 9.187 10.170
13 9.914 9.258
14 9.546 9.593
15 9.747 10.070
16 9.776 8.515
17 10.880 9.413
18 10.910 8.978
19 9.417 10.990
20 9.832 11 .610
Exercise 3.2 A constant value of six is added to the last three rows of the
matrix Y in Table 3.3. What do you expect for the form of the Jiveforward
plots you described in Exercise 3. 1? What is the effect of the inclusion of
the three contaminated units on the covariance matrix?
Exercise 3.3 If instead the constant value of six is added to the last six
rows of the matrix Y in Table 3.3, what are the forms of the Jiveforward
plots?
Exercise 3.4 As in Exercise 3.3 the constant value six is added to the last
six rows of the matrix Y in Table 3. 3. Describe the form of the Jive forward
plots when now the initial subset contains Jour good units (say for example
3.6 Exercises 141
•
.• • •
. ... •
0
"<t C\1
I ...... d
) •
I()
~r.·
C\1
d
•
. -,a:··.
C\1
>- ~~
.I ••...•
d
0
I •
•""""'•'I
•
.i..."• ..
I I()
0
d I
~
•
•
0
d
;:
-4 -2 0 2 4 6 0.0 0 .2 0.4 0 .6 0 .8 1.0
y1 y1
FIGURE 3.53. Exercise 3.7: two sets of simulated bivariate data. Which of the
two has to be transformed to achieve bivariate normality?
11, 12, 13 and 14) and two contaminated units (say for example 15 and
16) .
Exercise 3.5 Again, as in Exercise 3.3, we add six to the last six rows of
the matrix Y in Table 3. 3, but now start the search with a subset formed
only by the contaminated units. Describe the form of the Jiveforward plots.
Exercise 3.6 Track records: can you assess from Figures 3.19 and 3.22
which units enter the search in the last three steps? What is their order of
entry?
Exercise 3. 7 Figure 3.53 shows a scatterplot of two bivariate data sets

each of 250 units. What can be said a priori about the need to transform
the data? Consider forward searches in which you monitor the likelihood
ratio test for the hypothesis of no transformation. Describe the plot you
would expect to get for each of the two sets of data.
Exercise 3.8 Figure 3.8 in Chapter 1 is a forward plot of simulated Ma-

halanobis distances of the Swiss heads data. Show that simulation of these
distances does not require the estimates of the mean and covariance matrix
of the data. Describe how you would simulate the distances.
Exercise 3.9 The QQ plot of Figure 1.5 shows three Zarge outliers. The
next nine distances lie outside or on the simulation envelope. Determine
whether these, tagether with one of the three clear outliers, form the group
of ten developing countries identified as being rather different from the rest
in §1. 3. Discuss the implications of your finding for masking.
0 0
::::;; co:i
E
::>
E
-~ 0
::::;; C\i
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
0
::::;; 00
E
::>
E «>
·c:
~ ....
C\1
6 8 10 12 14 16 18 6 8 10 12 14 16 18
FIGURE 3.54. Exercise 3.1, simulated uncontaminated data from a bivariate nor-
mal distribution: forward plots of Wilks' ratio, maximum and minimum distances
and of the gap. The scales of the y axes are the same as those in Figure 3.56
3. 7 Salutions
Exercise 3.1
To plot determinants we use Wilks' ratio defined as lf:ml/ lf:nl (equa-
tion 2.107), which is one at the end of the search. When there are no
outliers we expect that the curve which monitors this ratio will increase
steadily from 0 to 1 throughout the forward search. Similarly, we expect
that the curves of the maximum (mth) and minimum (m + lth) Maha-
lanobis distances will increase slightly with small random fluctuations dur-
ing the search. Figure 3.54 shows that this is the case.
When there are no outliers we expect that all scaled Mahalanobis dis-
tances fall within the theoretical horizontal asymptotic confidence bands
throughout the forward search (see Figure 3.55) .
Exercise 3.2 In the presence of three outliers we expect a change in the

slope of the curve which monitors the determinant when the first outlier
is included (in this case step m = n - 2 = 18). At the same step we
expect ajump in the plot which monitors the maximum (mth) Mahalanobis
distance. But in the plot of minimum (m + lth) distance or gap we expect
to see a jump in the step prior to the inclusion of the first outlier (in this
case step m = 17). Since these three units form a duster of outliers we
3.7 Solutions 143
<0
LO
V>
Q)
<.>
c:
CO ....
1ii
'0
V>
:i5
0 M
c: ············ .........
m
CO
. ···········
~~~
.c
CO
:::;: N
. - - --.
-~-::-.•::;:.::..~--:;;..=-~-.:~.:=:. .-=---:..:=-;~=-=-=--=---.:::.~
::-,_,.._::=:=-~-·-···-···~--~--=~:c=··--~
0
10 15 20
Subset size m
FIGURE 3.55. Exercise 3.1, simulated uncontaminated data from a bivariate nor-
mal distribution: monitoring of scaled Mahalanobis distances. The two horizontal
lines are the 2.5 and 97.5 percentiles of the x~ distribution. The scale of the y
axis is the same as that in Figure 3.57
0
..,:
00
0
0 0
0 :::;: c.;
~ E
....0
.Cl) ::J
~ E
-~ 0
~ :::;: C'J
0 ~
0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
~
00
~
0
:::;: 00
<0
E c.
::J
E
·c:
<0 "' ....
(!)
~ .... N
C\J
0
6 8 10 12 14 16 18 6 8 10 12 14 16 18
FIGURE 3.56. Exercise 3.2, simulated data from a bivariate normal distribu-
tion with three contaminated units: forward plots of Wilks' ratio, maximum and
minimum distances and of the gap
<0
Lt)
"'c
Q)
....
(.)
~
'0
"'0
:0
C')
c
"'
(ij
.<=
"' ""
::;;:
10 15 20
Subset size m
FIGURE 3.57. Exercise 3.2, simulated data from a bivariate normal distribution
with three contaminated units: forward plot of scaled Mahalanobis distances
expect there will be a decrease due to masking after the jump. Figure 3.56
shows all these properties.
In the plot of all scaled Mahalanobis distances we expect that the curves
associated with the 3 outliers will be remote from the rest up to step
m = 17. In the final step we expect them to have Mahalanobis distances
comparable with the other units (see Figure 3.57).
The effect of the three outliers is to increase considerably all the entries
of the variance covariance matrix. The plot which monitors the elements of
the covariance matrix (not given here) shows a big change in the slope of
the curves for all three elements, that is for &i, &~ and &12 , when m = 18.
Exercise 3.3
With six outliers we expect a change in the slope of the curve which moni-
tors the Wilks' ratio when m = 15 and a jump in the curve of the maximum
(mth) Mahalanobis distance at the same point. At step m = n- 6 = 14
we expect a jump in the plots of the minimum distance and in the gap
plot. Figure 3.58 confirms our expectations. Since these six units form a
duster of outliers, we expect a considerable decrease after the jump due to
masking.
In the forward plot of scaled Mahalanobis distances we expect that the
curves associated with the six outliers will be remote up to m = 14. With
this high level of contamination (30%), we expect that the outliers in the
3.7 Solutions 145
...
0
CX)
ci
a
:::;: 0
"'
0
~ E
-., :::>
E
-"' ~
·x
ci
~ 0
"'
:::;: N
~ ~
0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
~
CX)
~
a
:::;: <D
CO
E 0.
:::>
E
·;::
<D
"'
Cl ~
:E ~
"'
"' 0
6 8 10 12 14 16 18 6 8 10 12 14 16 18
FIGURE 3.58. Exercise 3.3, simulated data from a bivariate normal distribu-
tion with six contaminated units: forward plots of Wilks' ratio, maximum and
minimum distances and of the gap . Note the masking effect after the peak
final steps will have Mahalanobis distances in agreement with those of the
other units (see Figure 3.59).
Exercise 3.4
If the percentage of contaminated units in the initial subset is not too high
the outliers are usually removed in the initial steps of the forward search,
so that the final part of the search is exactly equal to the one we obtain
starting from an initial subset free from contamination. Figure 3.60 shows
that initially there is a considerable overlapping among the curves. The
first steps are very active, the contaminated units being removed from the
subset. They are clearly distinct from the majority of the data in the central
part of the search. Comparison of Figure 3.60 with Figure 3.59 indicates
correctly that the order of entry in the central and final parts of the search
(fromm= 10 onwards) is exactly the same.
Exercise 3.5
When we start the search in the group of contaminated units the two groups
are clearly visible in the forward plot of scaled distances, Figure 3.61, until
some uncontaminated units enter the subset. Generally, after the initial
inclusion of units from the other group, we will observe an interchange
because the centroid of the fitted observations will lie in between the two
groups. Figure 3.62 shows that this interchange occurs around steps 8-12.
<0
U)
"'"'c: ..,.
0
"'
ü5
'i5
"'
:0
0 (')
c:
"'
(ij
.<::
"' ""
::;:
10 12 14 16 18 20
Subset size m
with six contaminated units: Forwardplot of scaled Mahalanobis distances. Note
the masking at the end of the search
<0
U)
"'"'c: ..,.
0
7Jl
'i5
"'0
:0
(')
c:
.5!1
"'
.<::
"'
::;:
""
10 15 20
Subset size m
with six contaminated units starting with an initial subset containing both good
units and outliers: forward plot of scaled Mahalanobis distances. The outliers are
immediately removed from the subset in the first step of the search
3.7 Salutions 147
<0
"'
V>
Q)
u
~
<::
....
'ö
V>
:0
0
<:: ""
"'
(ij
.s:::
"' "'
:;:
5 10 15 20
Subset size m
with six contaminated units, starting in the group of contaminated units: forward
plot of scaled MahaJanabis distances. The continuous lines are the "good" units,
the dotted lines are for the outliers. The two groups are clear at the beginning
of the search. From m = 5 to m = 10 the contaminated units have increasing
MahaJanabis distances while t he good units show a clearly decreasing pattern
After the period of interchange we expect the two groups of curves to be

intermingled, as they are in Figure 3.61. However, the trajectories over
the whole forwa rd search of the units belanging to the two groups will be
considerably different. Figure 3.61 shows that, in the central part of the
search, while good units tend to have a decreasing Mahalanobis distance,
contaminated units have a Mahalanobis distance which increases steadily.
Exercise 3.6
If there are no interchanges the unit which enters the subset in the last
step of the forward search is the one which has the largest Mahalanobis
distance when m = n - 1. Similarly, the unit which enters the subset in
step (n - 1) of the forward search is the one which has the Mahalanobis
distance ranked (n - 1) when m = n- 1. So it is clear from Figure 3.19
that, in steps n - 2, n - 1 and n the units which join the subset are 36,
33 and 55. Applying a similar reasoning it is clear from Figure 3.22 that
starting with n - 2 the order of entry of the units is: 36, 55 and 33.
Exercise 3. 7
The left panel of Figure 3.53 shows that the data have a symmetrical ellip-
tical shape. On the other hand, the data in the right panel show a differing
0
cn
CO
LI)
c:i Cl C\i
0 :::!;
~ E 0
::J
/~ ~ ~ ' "..._
~ ....
c:i
E C\i
-~
~ :::!; ~
0 C!
c:i
5 10 15 20 5 10 15 20
~ LI)
~ ....
Cl
:::!; CO
E Cl.
"'
::J
E
·c:
CD "'
<!l
N
~ ....
0
N
','
"';-
5 10 15 5 10 15
FIGURE 3,62. Exercise 3.5, simulated data from a bivariate normal distribution
with six contaminated units starting in the group of contaminated units: forward
plots of Wilks' ratio, maximum and minimum distances and of the gap. Inter-
changes are evident in the plots of the distances around m = 10, after the p eaks
at m = 6 or 7
spread of the data for one variable at various values of the other variable.
It is clear that in this case the variability increases with the mean for both
variables. This is a typical shape of data which have to be transformed to
achieve approximate normality.
The 250 points in the left panel of Figure 3.53 were simulated from a
normal bivariate distribution with J.L = (6, 6)T and
2.6 )
2 .
The data in the right panel are a multiple of the exponential of t hose in
the left panel. The pattern in the right-hand panel of Figure 3.53 is very
similar to t he babyfood data which will be analyzed in Chapter 4.
The information provided by plots of the likelihood ratio test for the null
hypothesis of no transformation during the forward search is different for
the two sets of data. That from the left-hand panel of Figure 3.63 yields a
horizontal line centred around E(x~) = 2 and below the critical rejection
points. The data in the right-hand panel of Figure 3.63 show an increasing
curve always well above the critical percentage points of the x~ distribution.
Figure 3.63 shows the monitoring of the likelihood ratio test in both cases.
3.7 Salutions 149
0
0
~ 0
(\J
~ 0
0
ll)
u; Q)
u;
.!!! .!!!
0 .Q 0
<D 0
~ "§ ~
""' ....
:::; ""'
:::;
8
ll)
(\J
0 0
100 150 200 250 100 150 200 250

FIGURE 3.63. Data from Figure 3.53: likelihood ratio tests for the hypothesis of
no transformation. The left and right panels are for the data in the respective
panels of Figure 3.53
Exercise 3.8
The Mahalanobis distance is invariant under linear transformations so the
simulation of the distances does not require the sample estimates of the
values of p and :E. In other words, it is possible to use the standard nor-
mal distribution to generate samples from the multivariate distribution
Nv(O,I).
Exercise 3.9
As Figure 3.64 shows, beyond units 33, 36 and 55, the next nine distances
which lie outside the simulation envelope are associated with units 12, 13,
51, 35, 16, 52, 25, 7 and 14. This list is very different from the group
of ten developing countries identified as being rather different from the
rest in §1.3. As Figure 3.13 showed, during the final steps of the search the
Mahalanobis distances of this group of units seems to decrease substantially.
The conclusion is that working backwards it is impossible to detect the
group of units found in section §1.3.
••
.. r--
1.0 1.5 2 .0 2.5 30 35 40
FIGURE 3.64. TI-ack records: QQ plot of the ordered Mahalanobis distances

against the square root of the percentage points of x? with 90% envelope from
99 Simulations. The numbers are associated with the rows of matrix Y. Compare
with Figure 1.5
4
Multivariate Transformations to
Normality
4.1 Background
The analysis of data is often improved by using a transformation of the re-
sponse, rather than the original response itself. There are physical reasons
why a transformation might be expected to be helpful in some examples.
If the data arise from a counting process, they often have a Poisson distri-
bution and the square root transformation will provide observations with
an approximately constant variance, independent of the mean. Similarly,
concentrations are nonnegative variables and so cannot strictly be subject
to additive errors of constant variance. Such effects are most noticeable
if there are observations both close to, and far from, zero as they are for
the viscosity measurements of the babyfood data introduced in §2.13.2. In
this chapter we analyze such data using the multivariate version of the
parametric family of power transformations introduced by Box and Cox
(1964).
We begin in §4.2 with another look at the babyfood data, comparing
the forward plots of Mahalanobis distances from transformed and untrans-
formed analyses. The theory for transformation of the response in univari-
ate regression is in §4.3.1, with the extension to multivariate data in §4.3.2.
Before we consider the forward search, we describe relatively easily calcu-
lated score tests in §4.4. Graphical procedures, including fan plots of score
tests, are in §4.5.
The procedure for finding multivariate transformations with the forward
search is described in §4.6. This is exemplified in three relatively straight-
152 4. Multivariate Transformations to Normality
forward examples. For the babyfood data in §4. 7 we show that the log
transformation is most satisfactory, a conclusion we can demonstrate is
unaffected by the influence of individual observations. In our analysis of
the data on Swiss heads in §4.8 we show the strong effect of just two units
on the estimated transformation. Our third introductory example, in §4.9,
is of data on horse mussels. As a result of our transformation we are able
to separate a set of six observations from the majority of the data.
The major example of the chapter is an analysis of the data on munici-
palities in Emilia-Romagna in §4.10. Amongst other features, we find that,
with 341 units, we need a finer grid than previously of values of trans-
formation parameters. We also need to replaces zeroes in the data with
minimum values. The forward search enables us to monitor the influential
impact of such changes. The analysis is complicated because there are 28
responses - it is not possible to have all outliers in all responses entering
at the end of the search. We follow this in §4.11 with a shorter analysis of
the transformation of the data on national track records for warnen. Again,
our transformation leads to a simplified structure for the data.
An example of transformations when there is a regression structure is in
§4.12. The data used come from an experiment on dyestuff manufacture. In
§4.13 we show how the forward search can be used to provide information
about the variables to be included in the linear model for the babyfood
data. The chapter concludes with suggestions for further reading.
4.2 An Introductory Example: the Babyfood Data

The purpose of this section is to show the effect of transformation on scat-
terplot matrices, forward plots of Mahalanobis distances and the proportion
of the data lying within given bivariate contours. These data were used in
§2.13.2 to introduce the bivariate boxplot. As Figure 2.4 showed, one bi-
variate plot gave contours which were definitely not elliptical, indicating a
distribution which was far from bivariate normal. Of course, it is well known
not only that viscosity is non-negative but that it has a highly skewed dis-
tribution. Table 4.1 gives the maximum and minimum values of the four
responses and their ratio.
The table shows that there is a ratio of at least 100 between the smallest
and largest value of each response, all of which are positive. We might ex-
pect that the distribution will be highly skewed and, indeed, the bivariate
scatter plot matrix of Figure 4.1 shows that this is so. The univariate box-
plots on the diagonal of the plot reflect the skewed marginal distributions
of Table 4.1. The bivariate plots show that the observations duster in the
bottarn left-hand corner of each panel: the asymptotic 50% and 99% con-
tours from the fitted bivariate boxplots exclude several of the observations
and there seem to be many outliers.
4.2 An Introductory Example: the Babyfood Data 153
TABLE 4.1. Babyfood data: minimum and maximum value of each response and
their ratio
Variable Minimum Maximum Ratio

Y1 9.8 1020 104.1
Y2 5.8 1950 336.2
Y3 5.0 2070 414.0
Y4 12.5 3030 242.4
Figure 4.2 shows the same plot for the data after logarithms have been
taken of all four responses. The observations now appear much more nor-
mal: the univariate boxplots seem symmetrical and the robust contours,
which are plausibly elliptical, include a higher proportion of the observa-
tions. Those observations that are outside the outer contour are much less
remote than they werein Figure 4.1.
The beneficial effect of the transformation in moving the data closer to
normality can also be seen in forward plots of Mahalanobis distances. Fig-
ure 4.3, the plot of scaled distances for the original data, is quite irregular
with severallarge outliers for much of the search. By comparison, the plot in
Figure 4.4, for the logarithmically transformed data, is stable, the pattern
of distances not changing much as the search progresses.
There are three further informative contrasts between these two figures.
The firstisthat Mahalanobis distances are scale free: they consist of residu-
als divided by an estimated standard deviation. The plots of Figure 4.3 and
4.4 are on the same vertical scale, showing that the large distances for the
untransformed data are much greater, until the final step of the search, than
those for the transformed data. This is a reflection of the difference in the
presence and structure of outliers between the untransformed and trans-
formed data which we saw in the scatterplot matrices. The second contrast
between the two figures is less evident. Figure 4.4 for the transformed data
starts with m 0 = 11 , found from the intersection of units inside ellipses
for which 0 (2.103) is one. This ellipse for the transformed data is shown
in Figure 4.5 together with the 50% ellipse passing through the median
Mahalanobis distance. For n = 27 the F value (2.102) corresponding to 0
= 1 gives an ellipse with an asymptotic content of 61.8%. The figure shows
that the two ellipses are close together, an indication of normality of the
data. However the ellipses in Figure 4.6 for the untransformed data are far
apart. The inner one again contains exactly 50% of the data. But the outer
ellipse, again for 0 = 1, is much larger, because the estimate of E is inflated
by the presence of outliers, even though the ellipse has a robust centre. As
a result of the skewed distribution of the Mahalanobis distances the ellipse
contains appreciably more of the observations than that in Figure 4.5. If we
use the same value of e for the starting point of the untransformed data as
.8 0 0 0
.6 0 0 0
..
.
0 0
0 0 0 0
:lroo
• (!)
(!) 0
0 .8 0
0 0
(!) ( 0
·"'
..
0 0 0 .6
0 0 .8
:t.:t
0
0 •
0
[5
FIGURE 4.1. Babyfood data: scatterplot matrix with asymptotic 50% and 99%
spline curves. There are many outliers and the data are far from multivariate
normality
we did for the transformed, we obtain m 0 = 19. A much smaller value of e,

0.1, was needed to obtain the value of 13 for m 0 used in the search yielding
Figure 4.3. The final point is the constant gap at the bottom of the forward
plot of Mahalanobis distances for the transformed data in Figure 4.4: there
are no very small distances, in line with what we would expect from the
asymptotic x~ distribution of t he distances. On the contrary, the forward
plot for the untransformed data in Figure 4.3 indicates the existence of
several very small distances for much of the search; these arise from the
duster of units with small values of all responses which are shown in the
scatterplot matrix of Figure 4.1.
In this analysis of the effect of transformations on the babyfood data
we have ignored the regression model in order to emphasize the effect of
the logarithmic transformation on these highly skewed distributions. In the
calculation of the Mahalanobis distances in the forward searches leading to
FIGURE 4.2. Logged babyfood data: scatterplot matrix with asymptotic 50%
and 99% spline curves. The data are much closer to multivariate normality after
transformation than the untransformed data in Figure 4.1
Figures 4.3 and 4.4 we fitted a constant Jlj to each response. We return
later in this chapter, in §4. 7, to an analysis in which we fit a model in the
explanatory variables. In this example, although not in all, introduction of
a linear model does not affect our choice of transformation.
4.3 Power Transformations to Approximate

Normality
From this analysis of the babyfood data it is apparent that the logarithmic
transformation improves the closeness of the data to the normal distri-
bution as shown, for example, by symmetry of the univariate distributions
and by the elliptical contours in the bivariate plots. It may however be that
"'
.,
<J)
0
c: ~
"''6
<i)
V>
:0
g ~
..
"'
;;;
.<:
:::;:
"'
0
---
--------~-::'9'-
14 16 18 20 22 24 26
Subset size m
FIGURE 4.3. Babyfood data: forward plot of scaled Mahalanobis distances. There
are some large distances for much of the search, which are masked at the end
other transformations, such as the square root or the reciprocal, would give
even better properties. In order to test whether this is so, we embed the
various transformations in the single parametric family due to Box and
Cox (1964). An advantage of this embedding isthat standard methods of
inference are then available for the choice of the best transformation.
In the next section we present a short description of methods for transfor-
mation of the response in univariate regression; the multivariate extension
is in §4.3.2. Unfortunately the estimated transformation and related test
statistics may be sensitive to the presence of one, or several, outliers. Ex-
amples for univariate data are in Chapter 4 of Atkinson and Riani (2000).
We use the forward search to see how estimates and test statistics evolve
as we move through the ordered data. Since observations that appear as
outlying in untransformed data may not be outlying once the data have
been transformed, and vice versa, we employ the forward search on data
subject to several transformations, as well as on untransformed data.
4.3.1 Transformation of the Response in Regression

For transformation of just the n x 1 response y in the univariate linear
regression model, Box and Cox (1964) analyze the normalized power trans-
formation
(y>.- 1)/(.\y>.-1) .\~0
z(.\) = { y logy .\ = 0,
(4.1)
0
N
"'8 ~
~
üi
'6
~
0
c e
"'
<0
.s:
"'
:::;
/
"'
:~ -------=~- ··-... ~ -
o ~ A~~~~~;~~~--------------------~~~~~~~-~~---
~-----------., ~
15 20 25
Subset size m
FIGURE 4.4. Log transformed Babyfood data: forward plot of scaled Maha-
lanobis distances. Compare with Figure 4.3
where the geometric mean of the observations is customarily written as iJ

= exp('E log yi/ n). The model fitted is multiple regression with response
z(>.); that is,
z(>.) = Xß+ f. (4.2)
Here z(>.) is n x 1, X is n x p , ß is p x 1 and the errors E are independently

normally distributed with constant variance a 2 . When >. = 1, there is no
transformation: >. = 1/2 is the square root transformation, >. = 0 gives the
log transformation and >. = -1 the reciprocal. These are the most widely
used transformations, frequently supported by some empirical reasoning.
For example, measurements of concentration often have a standard devia-
tion proportional to the mean, so that the variance of the logged response is
approximately constant. For this form of transformation to be applicable,
all observations need tobe positive. For it tobe possible to detect the need
for a transformation the ratio of the largest observation to the smallest
should not be too close to one.
The purpose of the analysis is to find an estimate of >. for which the
errors in the z(>.) (4.2) are, at least approximately, normally distributed
with constant variance and for which a simple linear model adequately
describes the data. This is achieved by finding the maximum likelihood
estimate of >., assuming anormal theory linear regression model.
. .
#'
.
~5
~
: 4 '
.l-.r .•:fr· I
1!
:7 B 2
· f-
I
~·
85
.
. I
}}/"'
5 5
tf ~
ß
22
. '3-- - ·
.t·
) 'f' ,)
.
I
5ßß I %5 I ß 5
.P~
.. 14
~
2 I -
·'~~
.1
.(.
FIGURE 4.5 . Logged babyfood data: scatterplot matrix with fitted ellipses. The
inner ellipse contains 50% of the data, the outer, for (} = 1, asymptotically con-
tains 61.8%. For (} = 1, mo = 11
Once a value of .X has been decided upon, the analysis is the same as
that using the simple power transformation
y(.X) = { (y>.- 1)/.X (4.3)

logy
However the difference between the two transformations is vital when a

value of .X is being found to maximize the likelihood, since allowance has
to be made for the effect of transformation on the magnitude of the obser-
vations.
The likelihood of the transformed observations relative to the original
observations y is
I ß
8
- ß
6 5
.:z~-·t14
•••
--/ '#_L
•' -
~I --···(
I
. 5
-
•• .L~ ß
ß . 8
/ . I • ~4
,.-:_;_:Jl.2....
,, -
....··
,' r-'-
:..
5 5
ß
I
~<'
8
-
,.
5 5
- ß
..l;r:,:
ß 6
8
~~ #
//
:'~,,~
FIGURE 4.6. Babyfood data: scatterplot matrix with fitted ellipses. The inner
ellipse contains 50% of the data, the outer, for (} = 1, asymptotically contains
61.8% . Outtiers cause the ellipses to be far apart, with the inner ellipse virtually
indistinguishable from the central duster of data points. Here, for (} = 1, mo = 19
where, in this section, we use J for the Jacobian, so that
(4.4)
allows for the change of scale of the response due to transformation.

A simpler, but identical, form for the likelihood is found by working with
the normalized transformation, defined in general as
z(.X) = y(.X)/Jlfn,
for which the Jacobian is one. The likelihood is therefore now
a standard normal theory likelihood for the response z(.-\) . For the power
transformation (4.3),
so that
log J = (.-\- 1) 2)ogyi = n(.-\- 1) logy.
The maximum likelihood estimates of the parameters are found in two

stages. For fixed ,\ the likelihood (4.5) is maximized by the least squares
estimates
with the residual sum of squares of the z(.-\),
R(.-\) = z(.-\f (J- H)z(.-\) = z(.-\)T Az(.-\), (4.6)
where the projection matrix A = I- H. Division of (4.6) by n yields the

maximum likelihood estimator of cr 2 as
Replacement of this estimate by the unbiased mean square estimate s 2 ( ,\)

in which n is replaced by (n - p) does not affect the development that
follows .
For fixed ,\ we find the loglikelihood maximized over both ß and cr 2 by
Substitution of /3(.-\) and s 2 (.-\) into (4.5). If an additive constant is ignored
this partially maximized, or profile, loglikelihood of the observations is
Lmax(.A) = -(n/2) log{R(.A)/(n- p)} (4.7)
so that 5. minimizes R(.-\).

For inference about the transformation parameter .-\, Box and Cox sug-
gest likelihood ratio tests (2.20) using (4.7), that is, the statistic
TLR = 2{Lmax(5.)- Lmax(A 0 )} = nlog{R(.-\o)/R(5.)}. (4.8)
Asymptotically the null distribution of TLR is chi-squared on one degree of

freedom. An asymptotically standard normal statistic is found by taking
the signed square root of TLR
(4.9)
Although the two statistics have the same properties for testing the value
of .-\, we sometimes prefer to plot the asymptotically normal TN: including
the sign of the difference 5. - A0 gives an indication of the direction of any
departure from the hypothesised value.
4.3.2 Multivariate Transformations to Normality

In the extension ofthe Box and Cox (1964) family to multivariate responses
there is a vector >. of v transformation parameters Aj, one for each of the v
responses. We give the results for the general multivariate regression model
of §§2.8 and 2.11. As before, Yi is the v x 1 vector of responses at observation
i with Yij the observation on response j. The normalized transformation of
Yij is given by
(>.. =I= 0) (4.10)

(>.. = 0),
where ih is the geometric mean of the jth response. The value Aj = 1 (j =
1, .. . , v) corresponds to no transformation of any of the responses. If the
transformed observations are normally distributed with vector mean /-Li for
the ith observation and covariance matrix I:, twice the profile loglikelihood
of the observations is given by
n
2Lmax(.A) const- nloglf:(>..)l- L{zi(>..) (4 .11)
i= l
n
const - n log If:(>..) I - L ei(>.)Tf:(>.) - l ei (>..). (4.12)
i=l
In (4.12) the parameter estimates Pi(>..) and f:(>.) are found for fixed >. and
ei(>..) is the v x 1 vector of residuals for observation i for the same value
of >... As in (4.8) for univariate transformations, it makes no difference in
likelihood ratio tests for the value of >.. whether we use the maximum likeli-
hood estimator of I:, or the unbiased estimator f:u. Suppose the maximum
likelihood estimator is used, so
(4.13)
When this estimate is substituted in (4.12), the profile loglikelihood reduces

to
2Lmax(>..) = const'- nlog lf:(>..)l. (4.14)
So, to test the hypothesis >.. = >..o, the statistic
(4.15)
is compared with the x2 distribution Oll V degrees of freedom. In (4.15) >. is

the vector of v parameter estimates maximising (4.12), which is found by
numerical search. Replacement of f:(>.) in (4.15) by the unbiased estimator
f:u(.A) results in the multiplication of each determinant by a factor which

cancels, leaving the value of the statistic unchanged.
There are two special cases when tests of the form of (4.15) are on one
degree of freedom. In the first we test that all responses should have the
same, unspecified, transformation, that is that ..\1 = ..\2 = ... = Av = ..\.
The second is the test of just one component of ..\ when all others are kept
at some specified value. In both cases we sometimes plot TN, the signed
square root form of these tests defined in (4.9).
4.4 Score Tests for Transformations

A disadvantage of the likelihood ratio tests of the previous sections is that
numerical maximization is required to find the value of A. This may be
particularly onerous for multivariate transformations in the forward search,
where, at each step of the search, a numerical maximization is required in v
dimensions. A computationally simpler alternative test is the approximate
score statistic (Atkinson 1973). We first derive the results for regression on
a univariate response.
The approximate score test comes from Taylor series expansion of ( 4.1)
as
8z(..\)
z(..\) ~ z(..\o) + (..\ - .Ao) ---w;- A=Ao
I
= z(..\o) + (..\- .Ao)w(.Ao) , (4.16)
which only requires calculations at the hypothesized value .Ao. In (4.16)
w(..\ 0 ) is the constructed variable for the transformation. Differentiation of
z(..\) for the normalized power transformation yields (Exercise 4.1)
az(..\)
w(..\)
8..\
yA log y yA - 1 .
_.\yA-1 - _.\yA-1 ( 1 /..\+logy). (4.17)
The combination of (4.16) and the regression model y = xT ß + E leads to

the model
z(..\o) = xT ß- (..\ - .Ao)w(.Ao) +E
= XT ß + "/ w(..\o) + E, (4.18)
where 'Y = - (..\- ..\0 ). The approximate score statistic for testing the trans-
formation, Tp(..\ 0 ), is the t statistic for regression on w(..\ 0 ) in (4.18) . Thus
Tp(..\o) = i'/(estimated standard error of )'), (4.19)
which can either be calculated directly from the regression in (4.18), or

from the formulae for added variables in which multiple regression on x
4.4 Score Tests for Transformations 163
is adjusted for the inclusion of an additional variable. The details are in

Atkinson and Riani (2000, §4.2.1). In either case, the t test for "( = 0 in
(4.18) is the test of the hypothesis .X = .A 0 .
If, as is usually the case, X contains a constant, any constant in w(.X) can
be disregarded in the regression yielding the t test for the transformation
(Atkinson and Riani 2000, Exercise 4.3). Under these conditions (4.17)
becomes
(.X)= y>-{log(yjy) -1/.A} (4.20)
w >,y>-- 1 ,
with the special cases
w(1) y{log(yjy)- 1} (.X= 1)

(4.21)
w(O) ylogy(logy/2 -logy) (.X= 0).
These are the two most frequently occurring values in the analysis of data:
either no transformation, the starting point for most analyses, or the log
transformation. For other values of .X the constructed variables are found by
evaluation of (4.20). Because Tp(.A) is the t test for regression on -w(.A),
large positive values of the statistic mean that A0 is too low and that a
higher value should be considered.
When the Box-Cox transformation is applied to multivariate data, lin-
earization of the transformation (4.10) leads to the nv values of the v
constructed variables
Wij (.Xo)
in which the jth response is differentiated with respect to Aj.

Suppose that, as in §2.11, the explanatory variables for the jth response
are Xj . Then the regression model for that response when the constructed
variable is included is
(4.22)
where "/j is a scalar. Using the approximate score test to determine whether
.A0 is the correct transformation of the v responses is equivalent to testing
that the v parameters "/j, j = 1, ... , v are all zero. Even if the matrix of ex-
planatory variables X is the same for all responses, so that the regression
model is (2.57), (4.22) shows that inclusion of the constructed variables
means that the variables for regression are no Ionger the same for all re-
sponses. As a result, the simplification of the regression in §2.11 no Ionger
holds and the covariance :E between the v responses has to be allowed for in
estimation. U nder the alternative hypothesis, although not under the null,
independent least squares is replaced by seemingly unrelated regression
described in §2.11.
To test the hypothesis >. = >. 0 , we can use an approximate score test. As
in the likelihood ratio test given by (4.15), this is a function of the ratio
of determinants of estimated covariance matrices. Under the hypothesised
transformation the estimated covariance matrix is f;(>. 0 ). Regressionon the
constructed variables Wj(>.o) in (4.22) gives a vector of parameter estimates
i and an estimated covariance matrix from the resid uals which we call f;( i).
The approximate score test is then
Tsc = nlog{lt(>.o)l/lt(i)l}, (4.23)
which again is compared with the x2 distribution on v degrees of freedom.

If only one component of >. is of interest, we again use the signed square
root of this test. Since, from (4.18), i estimates -(>.- >. 0 ), the sign isthat
of -i.
4.5 Graphics for Transformations

In the previous sections we defined a number of quantities relating to the
transformation of observations to normality. They include the parameter
estimate 5., likelihood ratio tests and the approximate score test Tp(>.)
for testing the value of a single transformation parameter. All of these
quantities are based on aggregate statistics. To apply the forward search to
the choice of a transformation we monitor the evolution of these quantities
during the search.
For transformations of a single variable we use forward plots of the score
statistic Tp(>.). However, influential observations may only be evident for
some transformations of the data, but not others. Therefore, we employ
the forward search on untransformed data and on data subject to various
transformations. In Riani and Atkinson (2000) and Atkinson and Riani
(2000, Chapter 4) five values of >. forming the vector >. 8 were used: 1, 0.5,
0, -0.5 and -1, thus running from no transformation through the square
root and the logarithmic to the reciprocal square root and the reciprocal.
The forward plot of Tp(>.) for these five values of >., using five separate
searches through the data, is called a "fan" plot. It is a major tool in the
analysis of transformations in Atkinson and Riani (2000) who give several
examples of its use.
For the analysis of multivariate transformations we monitor forward plots
of parameter estimates and likelihood ratio tests for the vector parameter
>.. We produce the fan plot using the multivariate version of the signed
square root form of the univariate likelihood ratio test defined in (4.9). We
calculate a set of tests by varying each component of >. about >. 0 . Suppose
we require a fan plot for >.1. Let Ao(j) be the vector of all parameters in
>.o except Aj· Then Aoj = (>.o(j) : >.!]) is the vector of parameter values in
which Aj takes one ofthe five standard values >. 8 while the other parameters
4.5 Finding a Multivariate Transformation with the Forward Search 165
keep their values in .\0 . To form the likelihood ratio test we also require
the estimator ~Oj found by maximization only over Aj. More explicitly, we
can write ~Oj = ( Ao(j) : ~j). Then the version, for multivariate data, of the
signed square root likelihood ratio test defined in (4.9) is
(4.24)
We produce v fan plots, one for each variable, by letting Aj, j = 1, ... , v
take each of the five standard values. Alternatively, particularly if numerical
maximization of the likelihood is time consuming, we could produce the fan
plot from the signed square root of the score test in (4.23).
4.6 Finding a Multivariate Transformation with

the Forward Search
With just one variable for transformation it is easy to use the fan plot from
the five forward searches with standard values of .\ to find satisfactory
transformations, if such exist, and to discover the observations that are
infiuential in their choice. However, with v variables for transformation,
there are 5v combinations of the five standard values to be investigated.
Whether or not the calculations aretime consuming, trying to absorb and
sort the information would be difficult. We therefore suggest three steps to
help structure the search for a multivariate transformation:
1. Run a forward search through the untransformed data, ordering the
observations at each m by Mahalanobis distances calculated from
untransformed observations. Estimate .\ at each value of m. Use the
forward plot of ~ to select a set of transformation parameters.
2. Rerun the forward search, now ordering the observations by distances
calculated with the parameters selected in the first step ; A is again
estimated for each m. As the search is now on transformed data, the
order in which the observations enter the subset will have changed
from that in Step 1. Again monitor the values of ~ and of the likeli-
hood ratio test for the transformation. If a correct transformation has
been found, the parameter estimates, if well defined, will be stable
until near the end of the search, when any outliers start to enter. At
this point, the value of the test statistic may increase rapidly. How
well defined the parameter estimates are can be determined by plots
of profile loglikelihoods against the individual values of .\ for various
values of m. A fiat likelihood will explain an estimate of .\ which is
behaving erratically.
If some change is suggested in .\, perhaps because outliers appea r
to be entering before the end of the search, repeat Step 2 until a
reasonable set of transformations has been found. Let this be AR.
3. Gonfirmatory testing of the suggested transformation. We expand

each transformation parameter in turn araund the five common val-
ues of >. (-1,-0.5,0,0.5,1), using the values of the vector >..R for
transforming the other v - 1 variables. In this way we turn a mul-
tivariate problern into a series of v univariate ones. In each search
we can test the transformation by comparing the likelihood ratio test
with x 2 on 1 degree of freedom. But we use the signed square root of
the likelihood ratio (4.24) in order to learn whether lower or higher
values of >.. are indicated. The plot is thus a version of the fan plot.
In Step 1 the search using untransformed data could be replaced by one

using a preliminary estimate of the vector of transformations, perhaps from
univariate estimation on each response separately. Some iteration may be
needed between Steps 2 and 3.
4. 7 Babyfood Data
Our preliminary analysis of the babyfood data indicated that the log trans-
formation greatly improved the normality of the data. We reached this con-
clusion without fitting a linear model. We now fit a model and see what
transformation is needed, using the forward search to detect the effect of
individual observations.
There are five explanatory variables. In their analysis of the data Box
and Draper (1987, page 572) find a linear model with terms in x 2 , x 3 and
X5 as weil as, surprisingly, the interaction X3X4 in the absence of x 4 . This
model was suggested for ail four responses. It is generaily agreed that such
models, violating a marginality constraint, are undesirable: if the variables
in this model are rescaled, the model will apparently change, a term in x 4
appearing. Suggestions for determining the importance of this interaction
are on p. 574 of Box and Draper (1987), tagether with an analysis of the
fitted coefficients in the model, which suggests that perhaps a common
model can be fitted to ail four responses.
For the present we choose a linear model with ail five first-order terms as
weil as the interaction of x 3 and x 4 , estimating the parameters separately
for each of the four responses. Our purpose at this point is to demonstrate
methods for choosing a multivariate transformation. We discuss the use of
the forward search in selecting this model in §4.13.
The first step in §4.6 is a forward search through the untransformed
data, estimating >.. at each step. The resulting forward plot of the four
elements of ~ is in Figure 4. 7. It is clear from this stable forward plot of
values close to zero that we should try the logarithmic transformation in
Step 2. The high significance of t he transformation is shown in the forward
plot of the likelihood ratio test for >..0 = 1 in Figure 4.8. This increases
steadily throughout the search; even initiaily the values are significant when
4. 7 Babyfood Data 167
C>
I
I
I
CX) I
0 I
I
I
I
<0 I
0
'<t
"'
"'0
.0
0
E
.!!! C\J
0
0
0 ,,........
..
,_______
\',
I -------
."
1 /
14 16 18 20 22 24 26 28
Subset size m
FIGURE 4.7. Babyfood data: forward plot of the four elements of the maximum
likelihood estimate >.. The log transformation is indicated for all responses and
there are no obvious outliers
compared with the percentage points of the x~ distribution shownon the

plot.
In Step 2 we repeat the analysis using the log transformation from Step
1. The forward plot of the maximum likelihood estimates of .X in Figure 4.9
is similar to that of Figure 4.7 fromm = 18, moving gently up and down
around the value of zero, with a slight tendency tobe below. The differences
in the two plots are caused by the different order in which the observations
enter the searches for the two values of .X. How well defined these estimates
of .X are can be determined from plots of the profile loglikelihood.
Figure 4.10 shows the plots of the profile loglikelihoods for the four pa-
rameters when m = 26. In each panel the values of the three parameters
that are not being varied are kept at their maximum likelihood estimates
when m = 26. The loglikelihoods are roughly parabolic close to zero, al-
though not necessarily log concave further away from the centre of the plots,
where we are not concerned with making inferences. The pairs of lines give
asymptotic 95% confidence intervals for each element of .X, based on the
asymptotic XI distribution of twice the loglikelihood ratio. All panels show
a sharp definition of the estimates: zero is acceptable for y 1 and y 2 , marginal
for Y3 and slightly unacceptable for Y4· The plot of the likelihood ratio in
Figure 4. 11 shows support for the logarithmic transformation. There is no
0
0
(')
"'
C\1
0
0 0
C\1
~
"C
0
0 0
,!:
c;;
..><
"'
::J
8
"'
0
14 16 18 20 22 24 26
Subset size m
FIGURE 4.8. Babyfood data: forward plot of the likelihood ratio test for the
hypothesis of no transformation. The horizontallines are the 95% and 99% points
of X~
evidence of the influence of individual observations on the test statistic,

nor Oll the parameter estimates in Figure 4.9. We take AR= (0, 0, 0, o)T.
It remains, as Step 3, to confirm the transformation AR· Figure 4.12
is a fan plot of the signed square root of the likelihood ratio test for the
hypothesis that all values of A are the same, against the hypothesized values
in the plot, again for all four responses. The test is therefore on one degree
of freedom. The top curve is for A = -1. The plot shows that this value
and the three others, -0.5, 0 .5 and 1 are all rejected, lying throughout
outside the central 99% band for the normal distribution at ±2.58. The
log transformation is accepted at the end of the curve, although there is
some evidence that a slightly lower value of A is to be preferred, perhaps
-0.1. This indication is in line with the forward plots of the maximum
likelihood estimates in Figure 4.9 and the plots of the profile loglikelihoods
in Figure 4.10. It is however unlikely that such a value would be appropriate
for the analysis of data and we recommend the log. What the forward search
does show is that our conclusions about the correct transformation do not
depend critically on any one observation.
It may seem surprising that the same transformation has been found
for all four responses. However, in this example, the four responses are
measurements of viscosity after 0, 3, 6 and 9 months. It is therefore to
be expected that these viscosity measurements would all require t he same
tra nsformation. This argument also explains why the linea r model for each
response includes the same terms: the various factors would have broadly
4.8 Swiss Heads 169
CO
0
CO
0
"<t
"'
-c
.0
E
0
.!!l N ..................
0
·..\\······· · ····· ······················-· ........·····················-···-····-··-·--......
0 ·················•··
_]
0 -::::..-..::~
15 20 25
Subset size m
FIGURE 4.9. Log transformed babyfood data: forward plot of the four elements
of the maximum likelihood estimate i The log transformation, or a slightly lower
value of .\, is indicated for all responses from this search on the logged data
similar effects on all responses. We return, in §4.13, to the selection of vari-

ables in this model. But, before leaving this analysis of transformations, we
note how well the data satisfy the normal theory assumptions for regres-
sion, once they have been logarithmically transformed. This is gratifying,
given how very skew the distributions were that we saw in Figure 4.1. In
fact, they are so far from normality that we were not able, for all values of
m and .X 0 , to obtain convergence to the estimates 5. required for the loglike-
lihood ratio tests plotted in Figure 4.12. Hence the shortness of the plotted
curves, especially that for .X 0 =1 , the hypothesis of no transformation.
4.8 Swiss Heads

As a second example of the use of transformations in multivariate data we
return to the analysis of the data on six measurements on Swiss heads.
Our conclusion so far isthat there are 198 units which appear to follow a
multivariate normal distribution and two outliers, units 104 and 111. These
two units have not been shown to have much effect on the fitted model:
for example in the forward plot of the covariance matrix, Figure 3.5, the
introduction of the last two units has only a slight effect on the estimates.
Westart in Figure 4.13 with a forward plot of the likelihood ratio test for
testing that all six values of A are equal to one. This is based on a search on
oo;j- oo;j-,....~~~~~..."-,TTT~~~~~~.-,
I' I'
I I
"<1- "<1-
L{) L{)
"<~- m=26 y1 oo;~- m=26 y2

Ol Ol
l-0.9 - 0 .5 -0.1 0.3 0.6 0 .9 l-0. 9 -0.5 - 0.1 0.3 0. 6 0 .9

"<1- "<1-
(!) (!)
I I
"<1- "<1-
0 0
"<1- "<1-
"<1- "<1-
"<1- "<1-
~ m=26 y3 ~ m=26 y4
I I
-0.9 - 0.5 - 0.1 0.3 0.6 0.9 - 0 .9 - 0 .5 - 0 1 0 .3 0 .6 0. 9
FIGURE 4.10. Babyfood data: profile loglikelihoods for the four transformation
parameters when m = 26. Search on logged data
"'
"'
0
"'
~
"0
!;e
0
0
,!:
]l 0 \
\
::J
"'
15 20 25
Subset size m
FIGURE 4.11. Logged babyfood data: forward plot of the likelihood ratio test for
the hypothesis of the log transformation. The horizontal lines are the 95% and
99% points of x~: this transformation is supported
4.8 Swiss Heads 171
-1
~
;;; ...- - ·05
.l!l
············-·-····-···-···-·····/"/ •..
e
0
·- .. . . ·--- .. --···-···· . ··· .... ........······--·-····

~
---
0
t: -------------------------- ... ------------ 0
CT
'
------ .................
U)
"0 ''
------- --- ---

Q)
c:
------
Cl
ü5 0
-----05
-------------
--. 1
0
C)J
15 20 25
Subset size m
FIGURE 4.12. Babyfood data: fan plot of signed square root likelihood ratio tests
that all responses should have the same transformation . The incomplete curves
result from numerical problems with the convergence of parameter estimates for
data which Figure 4.1 shows are far from normal
untransformed data, so that the order of entry of the units is the same as in
earlier chapters. In particular, units 104 and 111 are the last two to enter.
The figure shows the enormous impact these two observations have on the
evidence for transforming the data. At m = 198 the value of the statistic
is 7.218, only slightly above the expected value for a x~ random variable.
This rises to 15.84 after the two outliers have entered, a value above the
95% point of the distribution, which is included in the plot. Without the
information provided by the forward plot it would be easy to be misled into
believing that the data need transformation.
Evidence for a transformation is provided by a skewed distribution. The
only skewed distribution in the scatterplots of the data such as those in
Figure 3.10 is the marginal distribution of y 4 , caused by the outlying values
of units 104 and 111. To test whether all the evidence for the transformation
is due to y 4 we repeat the calculation of the likelihood ratio, but now
only testing whether -\ 4 = 1. The other five values of ,\ are kept at one,
both in the null parameter vector Ao and in the m.l.e. i The search is
therefore the same as before, but now gives rise to Figure 4.14, showing
a test statistic to be compared with xi. It is now even clearer that all
evidence for transformation of y 4 is provided by the inclusion of units 104
and 111. At the end of the search the test statistic has a value of 8. 789,
compared with 15.84 in Figure 4.13 for transforming all six variables. The
difference, 7.05, is not significant for a xg random variable, so that the
~
1ii
.l!l
~
";.
:.::;
LO
80 100 120 140 160 180 200

Subset size m
FIGURE 4.13. Swiss heads: forward plot of the likelihood ratio test for the hy-
pothesis of no transformation. The horizontal line is the 95% point of X~· The
last two units to enter provide all the evidence for a transformation
evidence of the tests at the end of the search is that >. 4 -j. 1, whereas all
other variables are equal to one.
The forward plot of the maximum likelihood estimate of >.4 , when all
other variables remain untransformed, is less revealing than the plots of
the estimates for the babyfood data in Figures 4. 7 and 4.9. The estimate
is poorly defined until m = 140, being larger than one. It then gradually
drifts down to around 0.5, before suddenly decreasing in the last two steps
and becoming -1.108 at the end of the search. This evolution needs tobe
judged against tests of the parameter value.
We use the fan plot based on the signed square root of the multivariate
likelihood ratio test which was defined in (4.24) as a confirmatory test
of the value of >.. Since there are six transformation parameters, the fan
plot in Figure 4.15 shows the result of 30 forward searches with Ao(j) = 1.
In all panels, except that for y 4 , there is no evidence of any need for a
transformation. In fact, any of the five transformations would be acceptable
for Yl, Y2, y 3 and y 5 since all statistics lie within the 99% interval for the
asymptotic normal distribution throughout the searches. For y6 the statistic
for no transformation is closest to zero, with the reciprocal and reciprocal
square root being rejected at the end of the search. The transformation for
Y4 is more interesting and the panel is reproduced enlarged in Figure 4.16.
The main feature of the fan plot for y 4 in Figure 4.16 is the effect of the
two outlying units which enter in the last two steps of the search when >. =
1 or 0.5. Before the entry of the outliers, at m = 198, all transformations
4.8 Swiss Heads 173
CO
C\1
~ -~
0
80 100 120 140 160 180 200

Subset size m
FIGURE 4.14. Swiss heads: forward plot of the likelihood ratio test for the hy-
pothesis of no transformation of Y4· The horizontal !irres are the 95% and 99%
points of xi. The last two units to enter provide all the evidence for this trans-
formation
TABLE 4.2. Swiss heads: entry of outliers into the subset m
A4 Step of entry of obs. 104 Step of entry of obs. 111

-1 187 193
-0.5 192 195
0 194 197
0.5 199 200
1 199 200
are acceptable. But at the end of the search, the hypothesis of no transfor-
mation is clearly rejected with a value for the statistic of -2.965. Although
the outliers also enter at the end of the search when >. = 0.5, these two
observations are less outlying on other scales and so enter earlier in some
other searches. For example, for >. = -1, there are two downward jumps
in the value of the statistic, caused by these outliers entering when m =
187 and 193. For the other transformations the outliers enter later in the
search, their effect being apparent on all curves. The value of m for entry
of the two outliers into the subset is given in Table 4.2 for each search. As
the transformation moves from the reciprocal to no transformation, there
is a smooth increase in the value of m at which the observations enter.
M r--------------------------,
t:;;ii!::~S§}:!~tf-~i;;J?-~ ~6: ~ o
~ L-------------------------~ ~ L-------------------------~
M r--------------------------,
~A
~~~~~~-~ ~
J ~--~~ \-i·-0.5
Cl- 0
1\0.5
~ L-------------------------~ ~ L-------------------------~
"'
120 140 160 180 200 120 140 160 180 200
FIGURE 4.15. Swiss heads: fan plots when Ao = 1. Only Y4 shows a need for any
transformation. See also Figure 4.16
TABLE 4.3. Swiss heads: minimum and maximum value of each response and
their ratio
Variable Minimum Maximum Ratio

Y1 96.8 130.5 1.35
Y2 100.6 127.8 1.27
Y3 108.7 139.1 1.28
Y4 47.7 74.2 1.56
Y5 112.1 134.7 1.20
Y6 122.6 153.3 1.25
This example clearly shows the effect of the two outliers on the estimated
transformation. In this case unthinking reliance on the aggregate statistics
when all observations are fitted would lead to a model which was inap-
propriate for most of the data. Although here the two outliers are readily
detected by using simple tools, such as scatterplots, our analysis quanti-
fies the effect of the two observations on a specific aspect of parameter
estimation.
Apart from the transformation of y 4 these data provide no evidence that
any other variables need transformation. This is typical of data in which the
observations have a narrow range, are far from the origin and have roughly
4.8 Swiss Heads 175
150 160 170 180 190 200
Subset size m
FIGURE 4 .16. Swiss heads: fan plot for Y4 when >.o = 1. The effect of the two
outliers can be seen at different points in the search for different transformations
(detail of Figure 4.15)
symmetrical marginal distributions. To emphasize this point Table 4.3 gives

the minimum and maximum observations for each variable and their ratio ,
which may be compared with Table 4.1 for the babyfood data.
The ratio of maximum to minimum is around 1.25 for most variables,
quite different from the values of at least 100 for the babyfoods. Analysis on
the reciprocal scale, for example, is here likely to yield results very similar
to analysis of untransformed data. As we saw in Figure 4.15 this was true
for all responses except y 4 , which is the variable with the largest ratio in
Table 4.3. After deletion of the two outliers this ratio reduces slightly to
1.46, still the largest. Similar points about the importance of the ratio of
maximum to minimum values were made in connection with the record
times for women in Table 1.1, where it was suggested that the shorter
races would not be as informative about the need for transformations as
the Ionger races. As we have already seen, the reciprocal transformation
for all times does increase multivariate normality. A third set of data for
which there may not be strong information about transformation is that
on Swiss bank notes, in which the observations again have a small range.
These data are the subject of Exercise 4.3.
4.9 Horse Mussels

In the previous example the two outliers entered in the last two steps of
the forward search when the data were not transformed. It was therefore
easy to determine a transformation that was appropriate for the bulk of the
data. This is not so in our next example, in which the effect of the outliers is
masked unless the search uses a suitable transformation. We therefore have
to use the procedure suggested in §4.6 of employing a series of searches,
the parameter values for the transformation in successive searches being
guided by the results of previous searches.
The data are in Appendix Table A.6. There are 82 observations on Horse
mussels from New Zealand. The five variables are:
Y1 : shell length, mm
y2: shell width, mm
y3 : shell height, mm
Y4: shell mass, grams
Y5: muscle mass, grams.
The data were introduced by Cook and Weisberg (1994, p. 161) who treat
them as regression with muscle mass, the edible portion of the mussel, as
response. They focus on independent transformations of the response and
of one of the explanatory variables which we now call y4 . In Atkinson and
Riani (2000, p. 116), the focus is on the joint transformation of the response
and one explanatory variable in the regression model. Here we see whether
multivariate normality can be obtained by joint transformation of all five
variables.
We begin by looking at the data. Figure 4.17 is the scatterplot matrix
of the data with superimposed robust contours. Several of the contours are
not elliptical and many cover the data poorly: for example in the plots of y5
against y 1 or y3 the scatter of points is decidedly curved, lying to one side
of the contour. It therefore seems that there is plenty for a transformation
to do in achieving scatterplots with elliptical contours.
We start with a forward search with untransformed data. Figure 4.18
is the forward plot of the resulting likelihood ratio test, to be compared
with X~· The value at the end of the search is 160.56 and the statistic is
significant throughout the range shown in the figure. The data need to be
transformed.
To obtain a first idea of a better transformation consider the forward
plot of the estimates of ,\in Figure 4.19. These estimates alltrenddown at
the end of the search, indicating the continuing introduction of units which
are further and further from the untransformed multivariate normal model.
The jump at m = 71 is caused by the introduction of unit 78. The general
shape of the curves towards the end of the search suggests we might try ,\ =
(1, 0.5, 1, 0, 1/3)T. Although 1/3 is not one of the standard five values we
4.9 Horse Musseis 177
·rn·~·~·~·~
t T" i i 0 0 I 0 1 0o
~ ° :
0
°
:[2]~~j~0°_
: : 0 0 : : 0
:[21:[IJ
I
o
0
~ O 0
o
•
0
o 0
o
•
0 0
0
)
0 .. .. •
• 0 ° • 0 • . 0
0 • • • •
[ZJ[~Jrn~J:~
+: ~
·~·~·[l]'ITJ'lZJ
t I I 0 I 0
I I t ~ I I 0 QQ
: 0 0 : 00 : 0 : 0
-~-~-~-~-rn
. o . 0 • o • o0 · ~
• . J 'h . o o 8, • o . oo o :
. . . . 0 0 .
. . . .
• C' • • • •
.
• • • • f
FIGURE 4.17. Horse mussels: scatterplot matrix with superimposed robust con-
tours. These non-elliptical curves indicate that transformation might be beneficial
have previously used, it is sometimes appropriate for transforming volume,

or mass, to have units of length, which are here the dimension of the first
three variables. The exact values of the transformation parameters are not
crucial, since we can continue to try a variety of forward searches.
The forward plot of the likelihood ratio statistic for testing .A. = (1, 0.5,
1, 0, 1/3)T is in Figure 4.20. This is an appreciable improvement over
Figure 4.18 and, indeed, the transformation is almost acceptable at the
end of the search, that is for all the data. However, it is not acceptable
at some earlier stages in the search. So there are some outliers entering
earlier on which are causing the preferred transformation to change. The
forward plot of parameter estimates in Figure 4.21 shows, by the changes
in estimates near the end of the search, that at least some of the outliers
are now entering towards the end. For example, unit 78 now enters when
m = 80. To find parameter values for our third search we move to slightly
e
0 0
~
"0
0
0
,5
]l
:.:J
0
."
50 60 70 80
Subset size m
FIGURE 4.18. Horse mussels: forward search on untransformed data. Likelihood
ratio test for the hypothesis of no transformation. The horizontal lines are the
95% and 99% points of X~· The data need tobe transformed
, .. "'' ...
__ ,,"'' ...... _,,'
'"'
q
CU
"0 ."
.0 ci
E
.!!!
0
ci
."
9
50 60 70 80
Subset size m
FIGURE 4.19. Horse mussels: forward search on untransformed data. The five
elements of the estimate j_, suggesting that at least three variables should be
transformed
4.9 Horse Mussels 179
"'
0
-~
'C l{)
0
0
,!:
Q)
""
::::;
LI)
50 60 70 80
Subset size m
FIGURE 4.20. Horse mussels: forward search with AR = (1, 0.5 , 1, 0, 1/3)T.
Likelihood ratio test for AR, which is rejected during the latter part of the search
~
-- ,--· •• 3
\ \ ...... -- ... _.... '

Cll l{)
'C
..0 0
E I
..!!! I
I
,I
0 ,I
ci
50 60 70 80
Subset size m
FIGURE 4.21. Horse mussels: forward search with AR = (1, 0.5, 1, 0, 1/3) T. The
five elements of the estimate 5.. Some outliers are entering towards the end of the
search
0
~ Lt>
"0
0
0
,s
o;
..><
::::; ~
Lt>
50 60 70 80
Subset size m
FIGURE 4.22. Horse mussels: forward search with AR = (0.5, 0, 0.5, 0, O)T.
Likelihood ratio test for AR, which is only rejected at the end of the search
smaller values of m. At m = 75 an outlier appears to be entering which is

causing an appreciable jump in the value of ).5 . The estimates just before
this suggest we try >. = (0.5, 0, 0.5, 0, o)r, a stronger transformation than
before.
This transformation does very well. Figure 4.22 is the forward plot of the
likelihood ratio test for >. = (0.5, 0, 0.5, 0, o)r. This is now similar in shape
to the plot for no transformation of the data on Swiss heads in Figure 4.14.
We have found a transformation which is supported by all the data except
the outliers, which enter at the end of the search. The forward plot of the
parameter estimates in Figure 4.23 shows that, until m = 79, three of the
parameter estimates are stable around zero and two around 0.5. We use the
fan plot of Figure 4.24 to confirm these individual values of >.. Each panel
gives the approximate score statistic for each parameter in turn when we
take six values of >. 8 including 1/ 3; the remairring four parameters are held
at their values in the vector (0.5, 0, 0.5, 0, O)r. The figure, like Figure 4.15,
shows the evolution of the score statistics during 30 forward searches. The
panels confirm the transformation we have found, but also show what other
transformation is acceptable for each variable. The transformation for y 1
is not very tightly defined, with the statistics for 0.5 and 1/ 3 close to zero
throughout the search and the log transformation within the 99% band.
For Y2 the statistic for the log transformation is closest to zero throughout,
although 1/3 is also acceptable. For y 3 the value of 1/3 is a little better
than is 0.5: the other four values are unacceptable. The transformation
for Y4 is unambiguously the log. The statistics for the other five values
of the parameter increase steadily in magnitude throughout the search.
q
//'' ·3
<U
I ',,.,.,'\\
''
... ,
"' ,'
,----, I
'0
.0 0 ' __ , ,,
' ' ........ '
E
..!!!
/1~
I I
0
I J..--4
0 .......
...
~-
/'···~...//
--···········~--
"'
c? ··.. /
................J
50 60 70 80
Subset size m
FIGURE 4.23. Horse mussels: forward search with AR= (0.5 , 0, 0.5, 0, O)r. The
five elements of the estimate >.. The estimates are stable until the outliers enter
at the end of the search
Interestingly, there is no evidence of an effect of the outliers. The effect of

the outliers is however evident in the last panel, that for transformation of
y 5 . The statistic for ..\ 5 = 0 is close to zero until the end of the search when
the outliers enter. The one third transformation gives a statistic which is
stable and close to zero at the end of the search. However it has negative
values earlier on, which are significant at the 5% level.
We now compare the data after the transformation ..\ = (0.5, 0, 0.5, 0,
O)T with that without transformation. Figure 4.25, to be compared with
Figure 4.17, is the scatterplot matrix of the transformed data with robust
contours superimposed . These are now much more elliptical and evenly
filled. Clearly the data are now much closer to multivariate normality.
A second feature is that the transformation has separated the outliers
from the bulk of the data. The outliers are highlighted in Figure 4.25. The
effect is strongest for y 5 where the log transformation has a large effect
on the smallest observations. The last to enter the subset is unit 48, for
which y 5 = 1. The next to last is unit 8 for which y 5 has the second
smallest value, 4. Three units, 25, 50 and 76 share the next smallest value,
5. The very small values of y 5 for units 48 and 4 explain why the indicated
transformation of y 5 is seen in Figure 4.23 to change rapidly at the end of
the search. The fan plot for ..\ 5 in Figure 4.24 confirms this interpretation;
the test for ,.\ = 0 responds strongly to the introduction of the last two
observations. The tests for ,.\ = -0.5 and -1 are even more sensitive to
.... ~
~
- / . -o.
"' ..--------~- ...-•. ..•••.---;o>"
........- ••.•. _-.,0
.:
~-- .. ··········-;:,'·<·~-~-~-~-~-~-~-~--~-~ -- - - -~/ - .. ' 0
;z:=~~~~-..::=--=-=--=~ - - - -r.L:.-=-~~3
---------------~---~?"'---~-- ;
~ L------------~ ~ L------------~
"' }------------~
~ 0 --- -- - - - ---- - - 0
'l'
~
U.C><
"'
~
"'
'l'
~
------------ ............
- -/
55 60 65 70 75 80 1
Subset size m
FIGURE 4.24. Horse mussels: fan plots for six values of .A including 1/3, with .Ao
= (0.5,0,0.5,0,0)T . The outliers have most effect on the transformation of Y5
the presence of these two observations. They however lie outside the range
plotted in the figure .
The comparison of forward plots of scaled Mahalanobis distances also
shows how the transformation has improved normality for most units and
made the outliers more remote. Figure 4.26 shows that , for the untrans-
formed data, there is a large number of outliers without any apparent
structure, which do not reduce in number as the search progresses. Due
to these outliers, the distances for the central units are too small and are
in the lower half of the theoretical distribution. After transformation, Fig-
ure 4.27, there is a much clearer group of six outliers, two of which, units 8
and 48, are particularly remote in the scatter plots in Figure 4.25. The same
structure is clear in the forward plots of the minimum distance amongst
units not in the subset. Figure 4.28, for the original data, shows an irregular
increase in distance as the search progresses. On the other hand, the left-
hand panel of Figure 4.29 for the transformed data initially shows a set of
distances which only increase very gradually, with few fluctuations . At the
end of the search the distances are large as the six outliers enter, although,
due to masking, the distance for the last observation is the same as that
of the preceding observation. The right-hand panel of Figure 4.29 is the
gap plot, showing the difference between the smallest squared Mahalanobis
distance for units not in the subset and the largest distance for units within
the subset. This show a first peak at m = 76, just before the first outlier
enters the subset at m = 77. There is a second peak at m = 80 as the
·rn·lZJ·
: I
•
~ T
[2J·[ZJ·[l]
..............
.
I
~ 8
7
.
o
•:
0
.
0 •
: .8
0
•
~ c48 ~
:lZJ.:r n·[ZJ:lZJ
I
•
O..:lLJ78 I
'"t'
J I
4
I
(j48
:~[Z]LIIZJ[Zl
!0 !lZJ!lZJI[]]!CZJ
~~[ZJ~[Il
FIGURE 4.25. Horse mussels: scatterplot matrix of t ransformed data, ,\ = (0.5,
0, 0.5, 0, 0) T , with superimposed robust contours. These elliptical curves, to be
compared with those of Figure 4.17, show the effect of t ransformation
last two, more extreme, outliers enter with some masking. Other forward
plots, not reproduced here, likewise show the effect of the transformation
in partitioning the data into a large central part and half a dozen outliers.
It is now time to interpret the transformation. The first three measure-
ments are lengt h, the fourth and fifth mass, which is related to volume and
so to the cube of length. The log transformation of mass leads to responses
with dimension approximately the logarithm of length. It might be hoped
that the first three variables should also be logged, to give a dimensionally
homogeneaus model. However the model with logarithms of all five vari-
ables is not acceptable, as is indicated by the fan plots of Figure 4.24. In our
final model one of the variables is logged, but the square root is taken of the
other two. Since all three measurements are of length, the same transfor-
mation might have been a nticipated for all variables. In fact, our discussion
of the fan plots of Figure 4.24 suggested that the 1/3 transformationwas
"'
.,<.>
V>
c
!1!
V>
'ö ...
:gc
V>
"'
o;
z::;
"'
:::;:
"'
0 .-
50 60 70 80
Subset size m
FIGURE 4.26. Horse mussels; untransformed data. Forward plot of scaled Ma-
halanobis distances, showing many outliers
"'
§
"'
u;
'ö
...
~c
"'
o;
i
z::;
"'
:::;:
"'
40 50 60 70 80
Subset size m
FIGURE 4.27. Horse mussels; transformed data, >. = (0.5, 0, 0.5, 0, O)T. Forward
plot of scaled Mahalanobis distances, showing six well-separated outliers
acceptable for the first three response. Figure 4.30 is the forward plot of
the likelihood ratio test for the null value ..\ = (1/3, 1/3, 1/3, 0, O)T. It is
similar in shape to Figure 4.22 for ..\ = (0.5, 0, 0.5, 0, o)T , but the values
throughout are higher, although not significantly so. Although the trans-
formation with three values of 1/3 is therefore also acceptable, that with
the log of Y2 and the square root of the other two variables is statistically
0
a. .
"'
<!)
~
ll)
ci
0
ci
50 60 70 80 50 60 70 80
FIGURE 4.28. Horse mussels: untransformed data. Forwardplots ofMahalanobis
distances. Left-hand panel, minimum distance among units not in the subset and,
right-hand panel, gap plot
<0
0
a. .
"'
<!)
~
....
ll)
ci
0
ci
40 50 60 70 80 40 50 60 70 80
FIGURE 4.29. Horse mussels: transformed data, ,\ = (0.5, 0 , 0.5, 0, O)r. Forward
plots of Mahalanobis distances. Left-hand panel, minimum distance among units
not in the subset and, right-hand panel, gap plot . The effect of the outliers is
now apparent at the end of the search
preferable. Although the three variables are all lengths, there is no reason
for them tobe subject to the same transformation, particularly if the mus-
sei shells change shape as they grow; the shapes of the three distributions
of length would then not be the same.
This is a canonical example of our approach to finding a multivariate
transformation in the potential presence of outliers and influential observa-
tions. We start with a search with untransformed data and use information
from the forward plots of the estimated transformation parameters to sug-
gest a transformation, which we then use in a second forward search, repeat-
ing the analysis until we find an acceptable transformation. In this example
"'
C\1
0
C\1
~
~
"&8 ~
Qi
-"'
:.:::;
"'
50 60 70 80
Subset size m
FIGURE 4.30. Horse mussels: forward search with >. = (1/3, 1/3, 1/3, 0, O)T.
Likelihood ratio test for this value, which is only rejected at the end of the search.
Campare with Figure 4.22
only three searches were needed in all to find a transformation which was
stable for nearly all the search, any changes being at the end where the
outliers entered and the likelihood ratio test became highly significant. We
can use the same procedure for the methods in our next chapters to find
transformations for principal components analysis , discriminant analysis
and duster analysis. But now we conclude this chapter with some exam-
ples of transformations of data from single populations that show some
special features.

Compared with the sets of data analysed so far in this chapter, the data on
municipalities in Emilia-Romagna, introduced in §1.6 are !arge and messy.
The difference is not so much the number of observations, 341, as the num-
ber of variables, 28, and their nature. In the earlier examples we had precise
physical measurements from Swiss and other scientific laboratories. Now we
have responses to surveys, from a wide variety of Italian communities with
differing standards in the provision of data. We can expect that the analysis
of transformations will not be completely straightforward.
Because of the number of responses, we divide them into three groups,
partially for ease of interpretation of our plots. The analysis of all 28 re-
sponses tagether would require numerical optimization in 28 dimensions.
This could be circumvented by use of the score statistics introduced in §4.4,
TABLE 4 .4. Municipalities in Emilia-Romagna: number of zeroes for each vari-

able replaced with the minimum value of that variable
Variable type Variable Number

Demographie Y6 1
Yw 1
Y12 1
Y13 1
Wealth Y19 1
Y21 11
Work Y25 3
Y26 92
a route we follow in the analysis of transformations for the data on national

track records for women in §4.11 .
Use of the Box-Cox transformation requires that all observations be pos-
itive. A few of the observations in these data are zero. We replaced them
with the minimum value for the particular response. Table 4.4lists the num-
ber of replacements made. For two of the variables the numbers replaced
were appreciable. We discuss the effect of replacements of these variables
later under the appropriate groups.
We first analyse the three groups separately, finding appropriate transfor-
mations which give searches in which observations influential for the trans-
formation enter at the end. Then, in §4.10.4, we merge the three groups
and see how the transformation and inferences change for our combined
analysis.
4.10.1 Demographie Variables

As the group of demographic variables we take Yl - Y5 and Y1o - Y13. As
in Chapter 1, the variables are:
Yl: % population aged less than ten

Y2: % population aged 75 or more
y3: % single-member families
y4: % residents divorced
Y5: % widows and widowers
y 10 : standardised natural increase in population
y 11 : standardised change in population due to migration
Y12: average birth rate over 1992-94
y 13 : fecundity: three-year average birth rate amongst women of child-
bearing age.
We start by monitaring the maximum likelihood estimates of the trans-

formation parameters for the nine variables. The results are in Figure 4.31.
Because we have so many responses, we divide the forward plots of 5. into
two. The upper panel of Figure 4.31 gives the estimates for variables 1, 2,
4, 5, 10 and 11. lt is clear that these variables all require transformation.
The variables in the lower panel, 3, 12 and 13, likewise all require trans-
formation, but are affected by outliers at the very end of the search. The
likelihood ratio statistic for testing the hypothesis of no transformation is
rejected virtually from the beginning of the search.
From Figure 4.31 it seems that the square root transformation is suitable
for many, although not all, of the responses. As a second stage we try
"'!
<D
d
"'
E
E
.!!!
"'d
"'d
.J.,
d
"'
"0
..c
"'d .~ 12
E ''....... -,, ...... ,.., .. - ___ ,-\,, .·•,_,1\_............
.!!! "'9 . - - .. _... __ ........ .-'''-,
-- ...... ,_ ........... - ....\,_ ...........-~I 3
I
I
<D I
9 I
I
150 200 250 300 350

Subset size m
FIGURE 4.31. Emilia-Romagna data, demographic variables: forward search on

untransformed data. Elements of the estimate 5., suggesting the square root trans-
formation for several variables. Note the different scale in the two panels
,\ = 0.5 for all variables. This hypotheses is rejected for m greater than
almost exactly 200, an improvement on the results for untransformed data,
but one that can be further improved. As a result of the forward plot of
the elements of 5. from this search we try
AR = (0.5, 0.5 , -0.5, 0.5, 0.5 , 0, 0, 0.5 , 0.5)r,
that is the square root for six variables, the logarithm for two and the
reciprocal square root for one. The forward plots of the parameter estimates
from this search are more stable than those for the search on untransformed
data in Figure 4.31.
The plots for the profile loglikelihoods for each transformation at m =

331 in Figure 4.32, just before the estimates start to respond to outliers,
show that the value of A1 is not very well determined. A search with AR
V
-
.
"
"'I
~ " . .
"
N "'
N
I
"'I
N
"""'
N
""
"'
N
""
N"'I
I I
"00 m=331y1 8 m=33 y2

0~~~~~~~~~~
8 m=33
'(-0 9 -0 .5 -0 1 0.3 0.6 0.9 '(-0 .9 -0 .5 -0 .1 0.3 0.6 0.9 ~~0~.9~-~0~
. 5~~~~~~~
.
" ..
"cr> .
"
"'I
N N
I
"'I
N
" "" """'

"'"I
N
"'
N
I
N
I
8m=331 4 8m 331y5 8 m=33

0~~~~~~~~~~ 0~~~~~~~~~~ ~~0~.9~-~0~.5~~~~~~~
'(-0 .9 -0 .5 -0 1 0.3 0.6 0.9 '(-0.9 -0.5 -0 .1 0.3 0.6 0.9
""' ""'cr> ""'

N"'I N
I
"'I
N
"
<i) "
<i) "
<i)
"'I
N
"'
N
I
"'I
N
" "
~m=331y1 ~m=331y1
~~~~~~~~~~~ 1 -~0~.9~-~0~.5~-~0.~1~~~~~
-0.9 -0.5 -0 .1 0.3 0.6 0.9 0.6 0.9
FIGURE 4 .32. Emilia-Romagna data, demographic variables: forward search with

AR = (0.5, 0.5, -0.5, 0.5, 0.5, 0,0, 0.5,0.5) T. Profile loglikelihoods at m=331.
but with A1 = 0 indicated that , with the larger number of observations in

these data, a finer grid of parameter values is needed. We therefore add
values of one quarter and one third to the values we consider. Of course
we need to check whether this finer division of the scale of A is necessary.
With this finer division the profile plots indicate the tentative vector of
transformation parameters
AR2 = (0, 0.25, 0, 0.5, 0.5, 0, 0, 0.5, 0.25)T.
The forward plot of .Xis changed from the earlier plot in Figure 4.31. The
upper panel of Figure 4.33 again shows forward plots of six estimates. The
major improvement over the earlier plot is that now virtually all change
<D
0
"'
"0
.0
E
.!!!
C\J
0
C\J
0
w
0
, __
C\J ' _,_"..,- . . . . . ,~_~j"',. . .,..__",
0
"'
"0
.0
E C\J
.!!!
9 -\.-'- .,;-------- ------ , __ --- · - --- .,_- --- ,,___ ---'- --- --- · ------------- .,_- --,---- ----·\ 3
<D
9
200 220 240 260 280 300 320 340
Subset size m
FIGURE 4.33. Emilia-Romagna data, demographic variables: forward search with

T . •
AR2 = (0, 0 .25, 0, 0.5 , 0.5, 0, 0, 0.5, 0.25) . Elements of the est1mate >.
in ~ 1 is concentrated at the end of the search. The value of .\2 is plausibly

one third or a quarter, as we expected. In the lower panel, 0.25 appears to
be a good value for >. 13 . We now need to test these values.
The forward plot of the likelihood ratio test for AR2 now behaves as
we would hope - the transformation is acceptable for nearly all the units.
Figure 4.34 shows the great increase caused by the last units to enter.
To confirm the details of this transformation we look at the more interest-
ing of the fan plots generated from the signed square root of the likelihood
ratio test (4.15). Four panels are shown in Figure 4.35.
The panel for y 1 confirms 0 as the best transformation for this variable.
The transformation is sensitive to the last observations to enter, a point
we pursue further in Figure 4.36. For >. 3 values between 0 and -0.5 are
acceptable, lying within the central 99% of the distribution, uninfluenced
by individual observations. The panel for y 5 shows a very stable plot from
which the value of 0.5 is indicated throughout the search. The last panel
we show is for transformation of y 13 , which is stable until the very end of
the search, when the introduction of the last observation causes a sudden
change. Until then 0.25 provides a satisfactory transformation.
The panels we do not show also confirm the individual values in AR2,
including a transformation of 0.25 for Y2· Only for transformation of y 1 is
there a steady change at the end of the search. The evolution of the last 22
values of ~1 are shown in the left-hand panel of Figure 4.36. The last ten
observations to enter, from the end of the search, are 277, 310, 188, 260,
"'
e
0
"8
0
,!:
a; ~
.><
::::;
200 220 240 260 280 300 320 340

Subset size m
FIGURE 4.34. Emilia-Romagna data, demographic variables: forward search with

AR2 = (0, 0.25, 0, 0.5, 0.5, 0, 0, 0.5, 0.25) T. Likelihood ratio test for AR2. The out-
liers produce a strong effect at the end of the search. The horizontal line is the
95% point of X~
245, 264, 261, 239, 246 and 250. These observations cause the estimated
transformation to move steadily from 0 to 1. Such behaviour would be
undetectable by the backwards deletion of outliers. Even if observations
277 and 310 were correctly identified as the pair needing deletion, their
removal would only cause the estimated value of )q to move closer to one.
As the right-hand panel of Figure 4.36 shows, many of these observations
correspond to the smallest values of y 1 . The six smallest values at the
bottom of the boxplot are for units 260, 188, 245, 277, 261 and 246.
4.10.2 Wealth Variables

As a second group , this time of ten variables, we take y 14 - Y23, which are
measures of the prosperity of the communities.
Y14: % occupied houses built since 1982

Y15: % occupied houses with two or more WCs
y 16 : % occupied houses with fixed heating system
Y11: % TV licence holders
Y1s: number of cars per 100 inhabitants
Y19: % luxury cars
Y2o: % working in hotels and restaurants
1.____ .?-"\...-... .. __ _ __ .. .,:-,

___.., ....------- -·-.} ,•,
' J
-------------';<~~·c-_:_·:~~-~~)/ 1
_,
200 220 240 260 280 300 320 340 200 220 240 260 280 300 320 340
FIGURE 4.35. Emilia-Romagna data, demographic variables: fan plots confirm-

ing .A = (0, 0.25, 0, 0.5, 0.5, 0,0, 0.5,0.25) T. Only the panels for YI, y3, Ys and
Yl3 are shown
~
188
<:!
~
<0
~
I ci
3 10
:8"' <0 "'
E ci
.!!! >.
ö
w
..,. "'
--' ci
:::0 264
239
"'
ci "' 250 246
261
2n
~aa
0
ci "' 260
320 325 330 335 340
Subset size m
FIGURE 4 .36. Emilia-Romagna data, demographic variables: forward search with
AR2 = (0, 0.25, 0, 0.5, 0.5, 0, 0, 0.5, 0.25)T. Left-hand panel, the last 22 values of
>.1; right-hand panel, boxplot of y1 : small values of YI influence .\1
Y21: % working in banking a nd finance

Y22 : average declared income amongst those filing income tax returns
Y23: % of inhabitants filing income tax returns.
40 60 80 100 0 20 40 60
yl6 IOO.yl6
FIGURE 4.37. Emilia-Romagna data, wealth variables: histograms of, left-hand

panel, Yl6 and, right-hand panel, the modified variable 100 - Yl6
If we proceed as we did for the demographic variables by looking at

forward plots of parameter estimates and test statistics over a finer grid of
A values we obtain as a final vector of parameter values
AR = (0, 1, 3, 1, 1, 0.5, -0.5, 0.25, 0.25, 3)T, (4.25)
so that Yl6 and yz 3 should both be cubed. These are surprising transfor-
mations since customarily we find that -1 :::; A :::; 1.
Usually we are transforming variables with a long right-hand tail, which
require values of A < 1 to give symmetry and approximate normality. But
the histogram of y 16 in the left-hand panel of Figure 4.37 shows that this
variable has a long left-hand tail. The variable itself is the percentage of
occupied houses with a fixed heating system. We could, with equal logic,
consider 100-y 16 , the percentage of occupied houses without a fixed heating
system. As the right-hand panel of Figure 4.37 shows, this new variable has
a long right-hand tail, of the kind that requires a value of A less than one
for transformation. The plot of y 23 is similarly skewed to the left and we
also work with 100- y 23 , the percentage of inhabitants not filing an income
tax return.
With these two modified variables, our analysis based on forward plots
of the estimates >..j now leads to
AR2 = (0, 1, 0.25, 1, 1, 0.5, -0.5, 0.25,0.25, -1)T. (4.26)
The important difference from the previous estimate AR in (4.25) is that

the new variables y 16 and Y23 require transformations of 0.25 and -1 rather
than the previous values of three.
We now need to test these parameter estimates agairrst the data. The
left-hand panel of Figure 4.38 is a forward plot of the likelihood ratio test of
ARz . The most obvious feature is a great increase at the end of the search,
277
....
0
....
0
0 0
in C')
in C')
2 2
0 0
~ 0
C\J
~ 0
C\J
.,;, .,;,
:.::J :.::J
~ ~
0 0
200 240 280 320 320 325 330 335 340

FIGURE 4.38. Emilia-Romagna data, modified wealth variables: forward search

with AR2 = (0, 1, 0.25, 1, 1, 0.5, -0.5, 0.25, 0.25, -1)T. Left-hand panel, likelihood
ratio test for AR2i right-hand panel, detail showing effects of three units on the
test statistic
causing this value to be rejected. The last six units to enter the search,
working backwards, are 70, 277, 250, 264, 191 and 117. The large jumps
upwards in the value of the statistic are associated with units 277, 191 and
117, which have the three smallest values of y 14 . The large effects of these
three units are shown in the right-hand panel of the figure.
We now look at a few fan plots of expansions of the signed square root of
the likelihood ratio test for >..R2 , which are presented in the four panels of
Figure 4.39. The first panel, for y 14 , shows that our chosen value of zero is
acceptable almost to the end of the search, but that the last six observations
to enter cause this value to be rejected. The effect of these units on the
likelihood ratio test for >..R2 has already been illustrated in the right-hand
panel of Figure 4.38. These units also cause 0.5 to become an acceptable
value for the transformation parameter. The panel for y 16 shows that either
1/3 or 1/4 is acceptable, butthat 0.5 is not. The inferences about the value
of >.. 16 are unaffected by the outliers so evident in the panel for transforming
Yl4·
The two remairring parreis of Figure 4.39 show that some transformations
a re sharply defined, others not. The top right-ha nd panel for y 20 indicates
clearly that -0.5 is the only correct transformation whereas the last panel,
for Yzz, shows that any value between 0 and 0.5 gives a satisfactory trans-
formation. The plot for y 21 , which is not given, shows that 0.25 is the only
acceptable value of >.. for this response; even the nearby value of 1/ 3 gives
a test statistic which occasionally wanders out of the acceptance region.
The plot for Y21, which we do not give, does not show a ny effect of the
replacement of 11 zeroes by the minimum observation which was noted in
Table 4.4.
•'
, ____ _, '- -
- - . /J..... , '----- ~ -~·'
J -....--.._\ - ................. ___ _ ___ "'A _ _ ".,.:'
~ ~- ---rr- - 1~ ~
- - - -. r\..
I
200 220 240 260 280 300 320 340 200 220 240 260 280 300 320 340
FIGURE 4.39. Emilia-Romagna data, modified wealth variables: fan plots con-
firming .X.= (0, 1, 0.25, 1, 1, 0.5, -0.5, 0.25 , 0.25, -1)T . Only the panels for YI4, Y1s ,
y2o and Y22 are shown; in the upper row .X. 5 = ( - 1, -0.5, 0, 0.5 , 1) , in the lower
row .X. 5 = (0, 0.25, 1/3, 0.5)
The major conclusion from this analysis is that , on replacing y 16 and Y23
by 100- y 16 and 100- Y2 3 , we have obtained stable transformations within
the customary range of minus one to one. In addition , the finer grid of ).
values is needed for some of this group of responses as weil as for some of
the demographic variables. The only evidence for the effect of outliers is on
the transformation of y 14 and it is with this that we end our analysis.
The first panel of Figure 4.39 showed that 0.5 was an acceptable value
for >. 14 at the end of the search. If we replace the first element of AR2
by this value of 0.5, the forward plot of the likelihood ratio test for this
new hypothesis when it is used as the basis of the search is as given in
Figure 4.40, which is quite different from that in Figure 4.38. Now the
value of the statistic is roughly double what it was during the search with
two peaks on the boundary of significance. There is also a non-significant
increase at the end of the search. This plot shows how the effect of the
outliers can be masked by the choice of just one of the ten elements of AR·
4.10.3 Work Variables

The remaining nine variables in the data on municipalities all measure
aspects of the proportion of the population who are working and how their
employment is organised:
ll)
"'
0
"'
0
~
~
"80
,;;
Q;
""'
:.::; ~
200 250 300

Subset size m
FIGURE 4.40. Emilia-Romagna data, modified wealth variables: forward search

with >. = (0.5, 1, 0.25, 1, 1, 0.5, -0.5, 0.25, 0.25, -1)T. Likelihood ratio test for >.,
tobe compared with Figure 4.38. Changing the value of A14 from 0 to 0.5 causes
appreciable masking
Y6: % population aged over 25 who are graduates

y7: % of those aged over six having no education
ys: activity rate- % of those of working age in full-time employment
yg:unemployment rate
Y24:% residents employed in factories and public services
Y25: % employees employed in factories with more than ten employees
Y26: % employees employed in factories with more than 50 employees
Y27: % artisanal enterprises
Y2s: % entrepreneurs and skilled self-employed among those of working
age.
There are several new features to the analysis of these data. One is that,
as Table 4.4 showed, there were 92 zero values for y 26 which had to be
replaced by the minimum value. There were also three zeroes in y 25 . We
need to determine whether these two variables are having an effect on any
inferences drawn from the data.
We start our analysis by replacing these zeroes and performing the cus-
tomary analysis of looking at maximum likelihood estimates of the param-
eters during the forward search, combined with likelihood ratio tests, to
obtain the estimated transformation
AR = (0.25, 0, 2, -1, 0, 1.5, 0.5, 1, 1)T.

For one variable we found it necessary to use the finer grid of A values
which included 0.25.
~ \
__ ,,/' . '
-------------.,_ .. __,............ ..... -__ .... __ -.......... __
~ ../- ,, . ,,-, __,'--'-' "
~~-~-- ....... _..'
C!
"'
"0
.0
--
E ~
.!!!
0 = .."-----
c:i .......-····------·····----· ------···-../"._........•
\
.._ __;
---------·-····/"""·---------·----····--···--· ·-----------
....... ,. 9
C!
\ ...._....
,... ___ ........
__ .,.----------- r/-
'7
0
C\i
----------~------·-----------------·-·-----·-·-·-_·-:-:--:-·,..-,-~,:-:,-: ~5
"' C! ~~~~~==~~--~==--=-~-~-~-~-~--~~~~-~-~-~-~--~---=~----~
:g
E
.!!!
~ +-------------------------------------------------------~
~ t=~====~====~======~====~====~======~====~==~
200 220 240 260 280 300 320 340
Subset size m
FIGURE 4.41. Emilia-Romagna data, work variables: forward search with

AR = (0.25, 0, 2, -1, 0, 1.5, 0.5, 1, l)T. Elements of the estimate 5.
Forward plots of the individual parameter estimates are in Figure 4.41.

There are three noteworthy features in the upper panel. One is that Ys needs
to be squared; this transformation seems steady from araund m = 200. A
second feature is that 5. 9 is minus one for much of the search, before starting
to drift upwards, with a final jump at the end to -0.5. The third isthat the
value of 5. 7 increases steadily from -1 to zero. The bottom panel has fewer
features of interest: the plot for 5. 27 drifts up above one towards the end
whereas that for transforming y 26 is virtually constant around 0.5. This is
the variable for which 92 zeros had to be replaced.
The upward drift in 5. 7 suggests that the observations are not ordered
with the outliers, if any, in Y7 entering at the end of the search. Instead the
units are being ordered more with respect to some of the other variables.
That the ordering is not for all variables is evident in the forward plot of the
likelihood ratio test for AR in Figure 4.42. Here there is a sharp peak at the
end, where a few outliers cause a significant rejection of AR- Before that, the
values from around m = 220 arenot significant, lying well below the 95%
point of the distribution. However, there is some evidence of significance
around m = 200. The dependence of this peak on the transformation of y 7
0
0 ~
1/)
8
....
0
'lii 'lii 0
CO
2 0
2
"'
.Q 0
~ ~ 0
<0
->< -><
)
::::; 0
::::;
N
....
0
~ 0
N
'-'
~
0 0
200 240 280 320 200 240 280 320

FIGURE 4.42. Emilia-Romagna data, work variables: left-hand panel, forward

search with and likelihood ratio test for AR = (0.25, 0, 2, -1 , 0, 1.5, 0.5, 1, 1)T;
right-hand panel, the same with the second element of AR, A7, replaced by -1
~;
200 220 240 260 280 300 320 340 200 220 240 260 280 300 320 340
FIGURE 4.43. Emilia-Romagna data, work variables: fan plots confirming

AR= (0.25,0,2,-1 , 0, 1.5, 0.5,1 , 1)T. Only the panels for Y6, y7, ys and yg are
shown
is shown in the right-hand panel of Figure 4.42 in which the value of 0 for
A7 in AR is replaced by -1. This is the value of 5.7 at the beginning of the
search plotted in Figure 4.41. The effect on the likelihood ratio test is to
remove the peak at m = 200, but to produce a more gradual increase of
the statistic to significance towards the end of the search.
~ 0 ~~:~~::~~~~::::::::~_::_::-~:~~_:::~.5
__ ___ ______________ _____ , - ~
u;> ' -----~ ~~- - -

\-... r""' • .._..._ _____ __....__ ,_'--....._ - U(
·--------. __ ........__ 1
200 220 240 260 280 300 320 340 200 220 240 260 280 300 320 340
FIGURE 4.44. Emilia-Romagna data, modified work variables: fan plots confirm-
ing ,\ = (0.25, 0, 2, -1 , 0, 0, 1.5, 1, 1)r. Only the panels for the modified variables
Y25 and Y26 are shown
We now briefly consider four panels of the fan plots for these nine vari-
ables. The expansions are around AR. The first panel in Figure 4.43 is for
y6 , showing that neither 0 nor 0.5 are acceptable values for A6 , but that
the finer grid of values is needed. The second panel is for y 7 , which shows
a trend which is to be expected from the argument in the previous para-
graph. Initially the value of 0 is rejected, although it is acceptable for much
of the search and is preferred towards the end. The plots of the estimates
of the transformation parameters in Figure 4.41 also indicate that A8 and
Ag behave in a way which requires further investigation. The bottom left
panel of Figure 4.43 shows that 2 is the best value for A8 . A value of one
is unacceptable only at the end of the search, so it is not certain that this
variable has to be transformed; the evidence may be solely due to the pres-
ence of outliers. The plot for yg, the bottom right panel of Figure 4.43,
shows that - 1 is the best value for Ag until close to the end of the search,
when it is rejected due to a series of upward jumps in the test statistic.
Theseare caused by the introduction of units 264, 252 and 165, the three
units with the smallest values for yg. This is a measure of unemployment
which is low in these small rural communities, but not zero.
We have not, so far, discussed y 25 and y26 . These a re the two variables
in which zeroes had to be replaced. Both are percentages of employees
working in factories of a particular size, in the case of y 26 , in factories
with more than fifty employees. It is not surprising that there were 92 zero
entries for y 26 . In analysing the wealth variables we replaced two variables
by their values subtracted from 100, in order to obtain variables for which
the estimated transformation parameters lay between -1 and one. Here
the estimated transformations are 1.5 and 0.5, rather than the values of
three we obtained for the two wealth variables. However, if we replace y 25
by 100- y 25 , with a similar transformation for Y26, we obtain variables for
which zeroes no Ionger have to be replaced.
Once this replacement has been made we obtain 0 and 1.5 for the param-
eter estimates instead of 1.5 and 0.5, the other values of AR remairring as
they were. Most of the previous plots also remain as they were. For example
the forward plot of the likelihood ratio test for the new AR is similar to
the left-hand panel of Figure 4.42 with a significant peak around m = 200.
The fan plots for >.. 25 and >..26 are in Figure 4.44. The left-hand panel of the
figure shows that 0 is a satisfactory transformation for Y25, which is the
value we had already found. However the right-hand panel of Figure 4.44
shows that, although 1.5 is close to the maximum likelihood estimate of
>.. 26 , one is also an acceptable value. It is this value that we use for our
further analysis.
4.10.4 A Combined Analysis

We now use the estimates of the transformation parameters from the three
preceding analyses as a starting point of a combined analysis of all 28
responses. We look at fan plots of the expansion of the likelihood ratio
tests and adjust a few parameter estimates. We then look at forward plots
of Mahalanobis distances and compare the patterns and apparent outliers
with those found in §1.6.
Table 4.5 gives the estimates of the parameters we have arrived at in
the three separate analyses when Y16, Y23, Y25 and Y26 were replaced by
100 minus their observed values. We continue to use these four changed
variables and run a forward search with all 28 responses transformed ac-
cording to the first column of estimates in the table. In each of the three
separate searches in the previous sections we obtained a different ordering
of the units. Now we have included all responses, the order of entry of the
units will be different again and we can expect slight changes in some of
our forward plots.
Figure 4.45 gives four panels out of the 28 fan plots for testing the value
of each element of >... The first panel is for ys, one of the work variables. The
third panel of Figure 4.43 showed that 2 was a good value for >..8 with a value
of one rejected towards the end of the search due to the inclusion of outliers.
The new search shows that, after all, one is acceptable. The second panel
of Figure 4.45 likewise provides more evidence about the transformation
of the work variable yg. The fourth panel of Figure 4.43 showed that -1
was a satisfactory transformation for this variable, until the introduction
of units 264, 252 and 165 at the end of the search. Now, with these units
entering earlier in the search, the second panel of Figure 4.45 indicates the
log transformation, despite the effect of the group of observations entering
around m = 280.
The third panel of Figure 4.45 is for y 14 , a wealth variable. The first
panel of Figure 4.39 showed that the last units to enter caused the sug-
gested transformation to change from 0 to 0.5. The effect of these units,
in particular 277, 191 and 197 is illustrated in the right-hand panel of Fig-
ure 4.38. In preparing the third panel of Figure 4.45 we have used a finer
grid of values of >... The panel now indicates that 0.25 is the preferred value
TABLE 4.5. Modified Emilia-Romagna data: estimates of transformation param-

eter for the combined analyses: .Ac1 and the elements that change in going to
.Ac2
Variable Transformation
.Ac1 .Acz
Demographie variables
Y1 0.00 0.5
Y2 0.25
Y3 0.00
Y4 0.50 1
Y5 0.50 0.25
Y10 0.00
Yu 0.00
Y12 0.50 0.25
Y13 0.25 0.5
Wealth variables
Y14 0.00 0.25
Y15 1.00 0.5
Y16 0.25 0.5
Y11 1.00
Y1s 1.00
Y19 0.50
Yzo -0.50 -1/3
Y21 0.25
Y22 0.25
Y23 -1.00
Work variables
Y6 0.25
Y1 0.00
Ys 2.00 1
yg -1.00 0
Y24 0.00
Y25 0.00 0.25
Y26 1.00
Y21 1.00
Yzs 1.00
for .A 14 . The effect of the entry of outliers is apparent in the figure, for
example around m = 280, but they do not affect this choice of value for
the parameter. A finer grid is also used in the final panel of Figure 4.45 for
Y25, again a work variable. From the upper panel of Figure 4.44 it seemed
that 0 was a satisfactory value for .A 25 . However the panel in Figure 4.45
...--;.o
V"'-----.... __
l.,
100 150 200 250 300 350 100 150 200 250 300 350
FIGURE 4.45. Modified Emilia-Romagna data: fan plots confirming .Xc1 in Ta-
ble 4.5. Only the panels for ys, yg, Y14 and Y25 are shown
indicates that 0.25 is a better value. This panel makes the general point
that with 341 units, a finer grid of values of ). is necessary; neither 0 nor
0.5 are satisfactory as values for this transformation.
As a result of the fan plots of Figure 4.45, and those for the remairring 24
variables, we obtain a series of slight adjustments of our vector of estimates
that are listed as .Ac 2 in Table 4.5. It is these values that we use for our
remairring analyses. That it has taken appreciable effort and several pages
to reach this vector of estimates is a refl.ection of the complexity of trying to
find a satisfactory transformation simultaneously for 28 variables. With the
univariate transformations described in Atkinson and Riani (2000, Chapter
4) it was possible for the forward search to achieve an ordering of the
data in which any outliers infl.uential for the transformation entered at the
end of the search. But with multivariate data, the infl.uential observations
for one transformation may be different from those for another. The final
ordering of the observations will be such that outliers enter at the end,
as we shall see from forward plots of Mahalanobis distances. But, as the
panels of Figure 4.45 and other plots show, the values of the estimates of
the transformation parameters and the test statistics may vary during the
search as observations important for transformation of a particular response
enter the subset.
The last sixteen observations to enter the forward search are:
277 70 239 245 260 250 310 264 188 133 238 194 252 278 315 327.
0
N
0
::;:
E
:::> ~
E
·c:
~
150 200 250 300

Subset size m
FIGURE 4.46 . Modified Emilia-Romagna data transformed using ..\c2 in Ta-

ble 4.5: forward plot of minimum Mahalanobis distances among units not in the
subset
Comparison of these results with those in Table 1.3 for the search on un-
transformed data shows that the seven most outlying communities are the
samein both searches, although, apart from unit 277, Zerba, they enter in
a different order. The four communities in Table 1.3 that do not appear
are units 2, 30, 88 and 149, those with the largest populations in the table.
How many of these sixteen communities can be thought of as outlying can
be determined from plots of Mahalanobis distances.
Figure 4.46 is a forward plot of the minimum Mahalanobis distance
amongst the observations not in the subset. This clearly shows the re-
moteness of the last unit, as well as indicating two other groups of outliers.
The plot is similar in general form to the same plot for the untransformed
data in Figure 1.12, except that it is smoother and the outliers are perhaps
more separated at the end of the search. Figure 4.47 is a forward plot of
scaled Mahalanobis distances, to be compared with Figure 3.24. In both
plots the obvious feature is the effect of Zerba. However, the effect of the
transformation has been to make unit 245 appreciably less remote.
Figure 4.48 repeats Figure 4.47, but cut to ignore the curve for Zerba.
It is now evident that there is a set of five units: 70, 239, 245, 260 and 250
which form a clear group especially visible around m = 300. Perhaps also
there is a slight grouping of units 310, 264 and 188 visible around m = 320.
We now consider the location of these units. Figure 4.49 shows the lo-
cation of the last 21 communities to enter the forward search using trans-
formed data. Comparison with Figure 3.27 shows that, as a result of the
0
<0
~ I
I
.,"'<.>
c:
~ ....
0
'6
~ I
"'
f5?~~o:toooo:~-~=:::::"-~~ .. .. • .....,----.~
"iil
.c
"'
::E
I
0
"'
--r
150 200 250 300
Subset size m
FIGURE 4.47. Modified Emilia-Romagna data transformed using >.c2 in Ta-

ble 4.5: forward plot of scaled Mahalanobis distances, to be compared with Fig-
ure 3.24
transformation, all except two of the last 21 are poor rural communities
in the Appenines. The two that are not are units 310 (Casina) and 70
(Goro), which we discussed at the end of §1.6. The transformation has led
to a sharpening of the separation of units; in Figure 3.27, six of the last 16
units to enter were on the plain rather than in the mountains.
4.11 National Track Records for Women

In Chapter 3 we compared the analysis of these data on the original scale
with that on the reciprocal scale, that is of speeds rather than times. There
was evidence of several kinds that the transformed data were more nearly
normal than the original data. One comparison was that of the scatterplot
matrices of Figures 3.20 and 3.21. The robust contours were more nearly
elliptical for the transformed data. Further, the transformation reduced
the number of apparent outliers from 12 to four. We now explore other
transformations to see whether even further improvement is possible.
Since all responses are measurements of time, we hope that the same
transformation can be applied to all. Figure 4.50 shows forward plots of
the likelihood ratio statistic for testing a common transformation for all
seven races, which is to be compared with XI.
The panel for >. = -1 shows
that this common transformation is rejected at the 5% levelfromm = 41.
However, increasingly negative values of >. are supported by the data. The
4oll National Track Records for Women 205
_,_
2so' ,_",.,, __ ",.-'... __ .. _...... -, __ ... -""-- . . ...
___ - ----
26o'---""'".".,.,."'""'"""--,--=------"'~":.:.:.:----~-- . . . ~
, __
........... ........, ... .... .... '.... ".---------~ ~
04---- 0 -. 0 ~
",
"' 24:1. \_\

7_o..-~- -~----~~~~ --
239'- - ••
1
0
0
...... ,:,...:.-"":~-..... , .... -:: .. . . . . . 239 .- .. -:. ~~- ...{, "-- ...
"'c:~ • • ·~ , _ I
*
'5
"'0
:c
c:
0
"'
7ö
~
"'
:::::;;
"'
j
150 200 250 300
Subset size m
FIGURE 4.480 Modified Emilia-Romagna data transformed using >.c 2 in Ta-

ble 4o5: forward plot of scaled Mahalanobis distanceso Detail of Figure 4.47
values of -3 and -4 are acceptable throughout the search; for .X = -2

there is a slight peak in the forward plot at m = 45 and there seem to be
two clear outliers at the end of the searcho
Although we seem to have found a satisfactory model, the suggested
values of .X are surprising, lying outside the customary range of minus one
to one as did the transformations for y 16 and y 23 in the wealth variables
for the Emilia-Romagna datao In that analysis we were able to remove the
egregious values of .X since the variables concerned were percentages and
we could replace y by 100- yo However, here the variables are times, so
that no such replacement is possibleo The distribution of these times has
a fairly sharp lower limit, where the best performances duster , and then
a tail of weaker performanceso Taking, as we did in §302, the reciprocal
of these times, gives speeds, which are variables with a concentration of
values at the right hand end of the distribution, similar to that for Yl6 in
the left-hand panel of Figure 40370 The value of -3 for .Xis then equivalent
to raising this variable to the third power 0
We now look at transformation of the individual responses, to see how
consistent is the evidence of a common transformationo We take AR as -3
for each response and calculate the signed square root of the likelihood
ratio test for the values 0 , -1 , -2, -3 and -4 for each .X1 , for j = 1, 000, 70
Four of the seven fan plots are shown in Figure 4051. Allpanelsshow that
-3 is an acceptable value for .\1 when all the other variables also have
this transformationo There is scant evidence of the effect of any outliers
on this selection of a transformationo Although, for most panels, the curve
FIGURE 4.49. Municipalities in Emilia-Romagna: the last 21 communities to

enter the forward search using transformed data are shaded
for Aj = - 3 lies near the centre of the plot, that for y 3 is near the upper
end of the central region; a transformation of - 4 for this variable is only
just acceptable at the end of the search and is beyond the 1% point of the
distribution for some preceding steps. Only for the transformation of Ys
is there evidence of change of the score statistic at the end of the search
as outliers enter. The conclusion from this figure is again that a common
value of - 3 is acceptable for all transformation parameters, a conclusion
which is hardly affected by any outliers.
We could have started our analysis of transformations of these data by
finding maximum likelihood estimates of the individual transformation pa-
rameters, as we did, for example, in Figure 4. 7 for the babyfood data. For
the present data this would require a seven-dimensional optimization at
each stage of the forward search. A numerically simpler alternative is to
use the score test described in §4.4. The test statistic Tsc is defined in
(4.23). We might hope that this will have a chi-squared distribution on
seven degrees of freedom. However we shall see that this is not the case at
all steps of the search.
Two versions of this statistic are possible. They differ in the estimated
covariance matrix f;(.5.) which can either be calculated from the residuals
of independent regression or iterated from this starting point using SUR
regression. Forthis example the statistics are close in value throughout the
search. Figure 4.52 is a forward plot of the values of the non-iterated test
0 0
"' "'
~ ~
;,
EI
~ r: ~
I
-"!
A I
/
~-vv
"'
20 30 40 50 20 30 40 50
FIGURE 4.50. Track records: forward plots of likelihood ratio statistics for testing
a comrnon transformation for all races. Upper two panels, .>. = -1 and -2; lower
two panels , .>. = -3 and -4
for the hypothesis that all Aj = -3. It seems as if the statistic is signifi-
cant, which we do not necessarily expect since the fan plots in Figure 4.51
supported a common value of minus three for all transformations. But the
null distributions of score tests for transformations are not always close
to the asymptotic null distributions, here chi-squared. We have therefore
added to the plot the results of a simulation of 1,000 test statistics. These
show that the distribution has Ionger upper tails than the chi-squared both
at the beginning of the search and at the end. This phenomenon has been
analysed for the score test Tp(>.. 0 ) for transformations of univariate data
by Atkinson and Riani (2002b). The longer-tailed distribution at the be-
ginning of the search in the univariate case is due to the statistic have
approximately a t distribution, rather than a normal one. The Ionger tails
at the end of the search, which they call a "trumpet effect", are due to
the presence of the observations in the constructed variables on which the
response is regressed. A similar logic applies here. The simulations show
that the score test is not quite significant at the end of the test and that a
common value of -3 is indeed appropriate for all responses.
We now compare plots of the observations transformed with >.. = -3
to those we have already seen for no transformation and for the recipro-
cal. The forward plot of the scaled Mahalanobis distances in Figure 4.53
shows more regular behaviour than that for the reciprocal transformation
in Figure 3.22. The last four observations to enter, from the last, are North
S~L~~~~~~~:; ~ ,
~'---'::-.:-_t_'\____
:: -2
0
\.--- -""'' 0
~::
0
-·
_/_ - --.. . . ........../-~\
·~ ·2
II
\' ·1
I o
30 35 40 45 50 55 30 35 40 45 50 55
FIGURE 4.51. Track records: fan plots confirming that all elements of A equal
minus three. Only the panels for y2, y3, Ys and Y7 are shown
Korea (33), Czechoslovakia (14), Western Samoa (55) , and Mauritius (36) ,
the same as for A = -1 but in a different order. These four observations
have clearly large Mahalanobis distances at the end of the search.
The scatterplot matrix of the transformed data with superimposed ro-
bust contours, Figure 4.54, is to be compared with Figure 3.20 for the
untransformed data and Figure 3.21 for the reciprocals of the data. The
improvement in normality from this further transformation is shown both
by the increasing ellipticity of the contours and by the proportion of each
panel enclosed by the outer contour. The scale of the panels is set to include
all the data and the contour should asymptotically include 99% of the data
if normality holds.
Finally we look at two plots which focus on the behaviour of the last
four units to enter the search. These have been labelled on the scatterplot
matrix of Figure 4.55 and highlighted in the parallel coordinate plot of
Figure 4.56. In interpreting these plots we are looking at a response which is
speed cubed, called z, so that large values are desirable. North Korea (33) is
outlying because of its poor performance in the two shortest races compared
with relatively good performance in middle distance races. It shows as a
bivariate outlier on, for example, the plot of z 2 against z 3 . Czechoslovakia
(14) has relatively poor performance in Ionger races, compared with its
world record for the 400m. race (z3 ). It is outlying in the plot of z 3 against
Z5. Western Samoa (55) has a particularly low value of z5 , although all
its speeds are low. It plots at the bottom left-hand corner of all panels,
4.12 Dyestuff Data 209
_/ ____ 0.99
0
"' ........ "

................. _....... -.. _.............. _..... ..
---- .... ..... ____ ..... ..-----
/ 0.95
..........
----------------------------------------?---- --=- --~ --------- 0.5
~-------~------------------------------------ 0.05
30 35 40 45 50 55
Subset size m
FIGURE 4.52. Track records: forward plot of non-iterated score test that all Aj =
-3. The dotted lines of the simulation envelope show the value is not significant,
even though it seems to be when judged by the continuous lines of the asymptotic
X~ distribution
particulary so in the plot of z 5 against z6 . This extreme value of z 5 explains

the effect in Figure 4.51, where ~ 5 was the parameter estimate most affected
by the inclusion of outliers, particularly when m = 53, which is when
Western Samoa enters the subset. Finally Mauritius (36) has a performance
which decreases with the length of the race: it is most outlying on the plot
of z1 against Z7.
One conclusion of our analysis is that, although ,\ = -3 is an unlikely
transformation, it is one which applies to all responses, which are all mea-
surements of time. This is very different from the previous example, where
variables of different kind required very different transformations. Once we
have found a transformation giving approximate normality we see a clear
structure of the data with just four outliers, which is much simpler and
more informative than an analysis of the untransformed data.
4.12 Dyestuff Data

We now consider a second example in which there is a regression struc-
ture in the data. As with the babyfood data, we therefore have to use the
method based on residuals to start the forward search which was described
in §2.15.2. The data, arising in a study of the manufacture of a dyestuff,
are taken from Box and Draper (1987, pp. 114-5). We give them in Ta-
ble A.7. There are 64 observations at the points of a 2 6 factorial and three
CXl
"'c
GI
(.)
CU <0
(ij
'6
"'
:0
0
c
CU
.".
äi
.<:
CU
~
20 30 40 50
Subset size m
FIGURE 4.53. Track records: forward plot of scaled Mahalanobis distances when
all Aj = -3. Tobe compared with Figure 3.22 for the reciprocal transformation
responses - strength, hue and brightness. Box and Draper find that only
three of the six explanatory variables have a significant effect on the three
responses. Their plots of residuals (p. 123) arguably indicate that Y2 should
be transformed, but no transformation of either y 1 or Y3 is suggested by
their univariate analyses of each response separately.
We start our multivariate analysis with a forward search on untrans-
formed data. Throughout we use the three variable model (x 1 , x 4 and X6)
used by Box and Draper, the evidence for which is not affected by the
transformations we consider. The results of our forward search are in the
left-hand panel of Figure 4.57 which shows the evolution of the estimates
of the transformation parameters. The values of 5-1 and 5- 3 oscillate around
one, which is also the starting value for 5- 2. However, as the search pro-
gresses the estimate of >. 2 decreases to around one half. We therefore repeat
the search with .A2 = 0.5, obtaining the right-hand panel of Figure 4.57;
as before, 5- 1 and 5- 3 oscillate around one. But now 5-2 fluctuates around
a half from the start of the search. We seem to have found a satisfactory
transformation and take
AR= (1, 0.5, 1f.
We have already seen a wide variety of forward plots in the analysis
of transformations of the earlier examples in this chapter, so do not give
further examples here. For example, the forward plot of the likelihood ratio
[IJ[ZJ~~~~~
:rn:[. 2J:
.
.
. ~. _.
. :~..::i :~;t~..:
'
. . . .
-~--:~ '
.
: .
.
.
.
.
.
~[ZJ[IJ~[2j~~
:~!t'~=tp·
.. . .
.
•[~
.
.
/..· : .:rn: C. 2J: ~·-· .-~. --~
lj. .•
. . . .
. .
.
~~~~rn~~
: .:;: . ::~..-.:·..' ..
.... . :~"{.:~.:-:0: .. []]:~ . :- .. ..
\
. ..
~··· .. j[lJ: [
jlj>[IJ
• ' • • t t ' •
I
: : : : . :
' ' ' . '
I •• I : ' I • •
1.-..-~---- ·
:~(.'j~ -- · ",·:~c. j~
.r:J'-1 j
I
;
:. . ·:,. :
I
:-
•.
.-
:
. . -
:
I
: ~
.. • • I
;
: ·:
:
I
:
.
~
I
;
:
•
FIGURE 4.54. Transformed track records with all Aj = -3: scatterplot matrix
with asymptotic 50% and 99% spline curves. The contours are more elliptical
than those shown in Fgure 3.21 for the reciprocally transformed data
test for >..R lies well within the 95% point of the chi-squared distribution
on three degrees of freedom throughout the search: there are no outliers,
although there is a peak around m = 35, indicating a compromise in the
ordering of the units between the three responses. The forward plots of
minimum and maximum Mahalanobis distances likewise increase steadily,
with no particular peaks to suggest any omitted structure.
One plot we do give is of profile loglikelihoods, together with asymptotic
95% confidence intervals, for the three parameters, when m =54. The left
and right-hand panels of Figure 4.58 indicate that the values of one for >.. 1
and >..3 are less precisely determined than the value of 0.5 for >..2 in the
central panel. Similar information comes from the three panels of fan plots
in Figure 4.59, where only 0.5 is acceptable for >.. 2 in the central panel,
whereas several values are acceptable for the other two parameters.
[]0 .
. . ooe o 12
... ...,..,.(':
0 00005 0 00009
~ ~ ~
t.* ..t;} ...•

~ t{;.•N• ~· ·:::(
...
•1 ~ ( + ..ft
36~+
n ~ ~· ~l**. )\~~++ 38-+f../J:. t.
..,
bZJ[J
33: • 33 •..;.. 33 ........ 33 • •t:~
55 1.55 ls i5 ls iS.
z2 .~· :tt ...·j*.+

;,*4.
yfY. .;.~·
./it
;r..a..t
• .,;;;~"+
......·t
i + •
.:....... .t.,~t*
... M .. .t
-
++
...
38-4'
...
++
f555 33 33 i5 33
~ 31 ++33
iS 33
~[] *. ""+.•t+
~ :ft·
33
·~
1;.t*
•
f"" ,. +
i;ti.\!'~·
iS~
r/1 ."... ~.
::~r. +
~~
~.t.....;-.
33+~
.~ll,..
..
. '[][2]
os36.
33
+
....
+
•.;t·:
. ~.
~~t ;t."!!f
++; ~·,;
·-t:ttl
. . i.l.. .+++
33 +
!11++ + z4 ;;ii.~· ~.)i!••
:te·· Lll+ ....... 1Jtt..+ +
FJ ~
. ß..... ~ ,~ l~yr·z+
$'
···~
~:.}·
:'*.
)f*f ++
33 ...
~·
•I>
j;J;
'(+
J·fit: • .'<;'
<*f .~··ol zS
~· ~~.
·~ •
1•• .. 1 ••
[Z][J .......... ·*·

9
8 t!i~· ·~
*+ ........
··<t+.*• t
·?,-'. •;~:~:.
.;~·
0
.Jqf..···
·~- .. 33.ff::,..
• # .
~~a:t
jg.t··
·lA~'ii:l
1
..)ii[J
,.~
36 St,....
1•• 1••
.
~:.. ..
0.0005
..
J··~·
+
..•
-
•• +
~
0 0001
# (~
j.t\! ..."..
33t,........
'.:ii!· r~r
11.·~·
~\ •• + '
+ •t. . "'33
[(++
5 "1QA.6
i.t .....
~-!~·! ..:)3 +
8"1()A.6
··k:
,. ~~
......~
~
tf!. •
.....
5'10"'--8
z7
~
3"1()1'.7 .",
FIGURE 4.55. Transformed track records with all Aj = -3: scatterplot matrix
with the last four units to enter highlighted and labelled; 33- DNK, 14- CZ, 55
-WS, 36- MA
In our analysis of the babyfood data we found a common transformation

for all three variables. Finally we try the same technique here. Figure 4.60
is a fan plot of the expansion of the signed square root of the likelihood
ratio test for the five standard parameter values, but now of the common
transformation parameter. A common value of one is just rejected at the
1% level, whereas 0.5 is acceptable throughout the search. The other three
values for >. are rejected. This simple model is therefore an alternative to
the model in which Y1 and y3 are not transformed. In the analysis of the
babyfood data all responses were measurements of viscosity and so had
the same physical units; it was plausible that they should have the same
transformation. Here the three responses, strength, hue and brightness, are
not measurements of the same quantity; there is therefore no prior reason
to expect they will have the same transformation. Our analysis shows that
2 3 4 5 6 7
FIGURE 4.56. Transformed and standardized track records with all AJ = -3:
parallel coordinate plot with the last four units to enter labelled; 33 - DNK, 14
- CZ, 55 - WS , 36 - MA
0 0
C\i C\i
"? "?
q q
"'
'0
.0 "'
'0
.0
E E
.!!! "? .!!!
0 0 0
w w
...J 0 ...J
:::;: 0 :::;: 00
q q
'7 '7
20 30 40 50 60 20 30 40 50 60
FIGURE 4.57. Dyestuff data: Estimates of transformation parameters from for-

ward searches. Left-hand panel, untransformed data; right-hand panel, >.. =
(1, 0.5, 1)T
the choice between these models can be made in the knowledge that the
correctly transformed data do not contain any outlying observations. In
this way our analysis sharpens the information that comes from the residual
plots when m = n shown by Box and Draper (1987, p. 123).
:V
---...._
0 0
1 m=5 4 1 m=5 4 I
y1 m=5 4 y3
"'<0 "' ~~~~~~~~~ :2
<0
~~0~9~
-~0 .~
3 ~0~.2~0~
. 6~1.0~1~.4L1~.8~ ~-0.9 -0.3 0.2 0 .6 1.0 1. 4 1.8
FIGURE 4.58. Dyestuff data: profile loglikelihoods for the three transformation
parameters when m =54. Search with AR= (1 , 0.5, 1)T. The value of >.2 is weil
determined
- . - '\ ,.
... _)-~:::.... --- . . . -......... ".. _ _... .... - 0.5

I - /
r- I
=-V~.)><-~- - ~--
~ L_--~----~----~----~--~
30 40 50 60 30 40 50 60
".,-- . . _, . . -- -:.:o--
0 ;;~j:!;;_;~~::.:.-----~-j" 1
30 40 50 60
Subset size m
FIGURE 4.59. Dyestuff data: fan plots for individual transformations confirming
AR= (1, 0.5 , 1)T
4.13 Babyfood Data and Variable Selection

In §4.7 we analysed the logged babyfood data with a linear model containing
all first-order terms and the interaction x 3 x 4 . We now briefl.y indicate what
the forward search can contribute to the selection of this model. We work
with logged responses throughout.
In §2.9 we mentioned that forward plots of the t tests for individual
variables in multiple regression were hard to interpret as the ordering of
the units in the search affected the values of the statistics. The two panels
C>
0
~
0
~
ö '
e ''
I!! "' ''
'
"':> '' '
''
CT
--
(/)
alc
"'
(ij
C>
--, ___, __ .,.,..
20 30 40 50 60
Subset size m
FIGURE 4.60. Dyestuff data: fan plots for a n overall transformation. Only 0.5 is
admissible
C>
"'
C>
"' ''
\V ' ' ',
' · 3
~ ~
'
(/) , ___ _ (/)
u u
~
.... ·····- 1
~ -······ ···· ~ .... 1
C> C>
~ .,!.
--------- -4
~
.,!.
--4
C> C>
"';"
C> C>
~ ~
10 15 20 25 10 15 20 25
FIGURE 4.61. Logged babyfood data: forward plotsoft statistics for the terms
of the linear model. Left-hand panel Yl, right-hand panel Y2· The values of the
statistics shrink as the estimate of a 2 increases with m
of Figure 4.61 illustrate the point. The overall impression is that the t
values start large and decrease. The left-hand panel of the plot, for Yl,
shows that, at the end of the search, the significant variables, in order, are
x3, xs, X3X4 and xz. For Yz the right-hand panel shows that the order of
variables changes slightly to x 3 , x 5 , x 2 and then the interaction x 3 x4. A
l()
l3 0
~ ~
1 \
\
\
c:
0
l()
~
Ci)
0
10 15 20 25
Subset size m
FIGURE 4.62. Logged babyfood data: forward plot of added variable t statistics
for the terms of the linear model for YI. Six separate forward searches were needed
to produce this plot
puzzling feature of these plots is the behaviour of the statistic for X2: for Yl
it starts large and positive, but finishes significantly negative, whereas, for
y 2 it is negative throughout. lt is hard to obtain much information on the
effect of individual observations on the values of this and other t statistics.
We now contrast these plots with those produced by the use of added
variables when we delete each x in turn. The method was described in §2.9.
The plot for y 1 is in Figure 4.62. The significant variables in these plots
now, as we would expect, have t statistics which diverge steadily from zero
as the sample size increases. Of course, the curve for each t statistic finishes
with the same value as in Figure 4.61. But now we can see that only the
last few observations make x 2 and the X3X 4 interaction significant; there
is nothing eise noteworthy about the behaviour of x 2 , which initially is
non-significant. That the curve for x 3 starts late in the plot is a reminder
that we have a separate search for each added variable. The design points
chosen by the search using the other variables were singular when x 3 was
included until m = 18.
Figure 4.63 for y 2 is similar, although now the significance of both x 4 and
the X3X4 interaction change towards the end of the search. The absence of
sharp jumps in these plots shows that no individual observation is having
a large effect on significance of the variables. Forward plots of residuals for
the individual variables provide a useful supplement to the Mahalanobis
1.()
rn
'-' 0
~
~
.,!.
c:
0
---------
1.() __ /_."...--5
~
Q)
Cl
10 15 20 25
Subset size m
FIGURE 4.63. Logged babyfood data: forward plot of added variable t statistics
for the terms of the linear model for y2
distances we have already seen for discovering outlying individual compo-

nents of y.
The relatively smooth shape of these plots suggest that no individual
observations are having a large effect on the choice of model. To check this
we look at forward pots of the added variable t tests for the model with all
first order terms as well as all two-factor interactions. Of course, with 16
parameters and only 27 observations, the data will tend to be well fitted.
Even so, we do not expect there to be great difficulty in selecting variables
since, at the end of the search, the design is orthogonal; the presence of one
variable will not affect the parameter estimates for the others. However, in
the earlier stages of the search, the absence of some observations will mean
that the exact orthogonality of the design is destroyed.
The forward plot for y 1 is in the left-hand panel of Figure 4.64. As we
would hope, it is similar to Figure 4.62 with the addition of many trajec-
tories which are never significant. A slight effect of the extra nine variables
in the model is to reduce the significance of some terms such as x 3 x 4 in
the earlier stages of the search. The plot for y 2 in the right-hand panel can
likewise be compared with Figure 4.63. Now the significance of both the
interaction and of x2 are reduced in the earlier stages. However, none of
the new variables gives any indication of being significant at any point in
the search.
LO
3
U)
u ~
./!/'
~
1ii
u;
..!.
c:
0
LO
__ ,
c:
0
\ ----, ___ ,
/
-/
// 5
~ ~
a; a; 0
Cl Cl
0
"?
18 20 22 24 26 28 18 20 22 24 26 28
FIGURE 4.64. Logged babyfood data: forward plots of added variable t statistics
for the terms of the 16-term linear model including all two factor interactions.
Left-hand panel YI, right-hand panel Y2· Few of the variables are significant
The forward plots of the added variable t tests for this example show
that there is no evidence of masking or other difficulties in model selection.
Our recommendation for a model would be to drop x 1 from the model
used in this section, since it is not significant, but to keep x 4 because of its
presence in the x 3 x 4 interaction. A more complicated example in which the
forward added variable plots reveal masking by outliers of the importance
of explanatory variables is given by Atkinson and Riani (2002a).
4.14 Suggestions for Further Reading

The parametric family of power transformations introduced by Box and
Cox (1964) was extended to multivariate data by Andrews, Gnanadesikan,
and Warner (1971) and by Gnanadesikan (1977). Velilla (1993) compares
marginal and joint transformations and gives further references to related
work. In these papers concern is with likelihood analysis of the data, that is
with a procedure using statistics aggregated over all observations. Methods
for detecting the influence of individual observations on transformations
of univariate data, particularly in regression, are described by Atkinson
(1985) who uses the deletion of individual observations to determine ef-
fects on parameter estimation and test statistics. Deletion procedures and
score tests for multivariate data are developed in Atkinson (1995). Chapter
4 of Atkinson and Riani (2000) applies the forward search to univariate
transformations, particularly of the response in regression. Atkinson and
Riani (2002b) show the dependence of the distribution of the score statistic
Tp(.\o) of §4.4 on the multiple correlation coefficient for the regression.
Our analysis of the data on horse mussels in §4.9 showed that a trans-
formation was necessary and that there are two outlying cases, 8 and 48.
4.14 Suggestions for Further Reading 219
Analysis of these data using regression is set as an exercise by Cook and

Weisberg (1999 , p. 351), who suggest first removing these two units, la-
belled 7 and 4 7 in their Lisp system which starts counting at zero.
TABLE 4.6. Swiss bank notes: minimum, maximum and ratio of maximum to
minimum value for each variable and each group of notes
Genuine notes
Minimum Maximum Ratio
Y1 213.8 215.9 1.010
Y2 129.0 131.0 1.016
Y3 129.0 131.1 1.016
Y4 7.2 10.4 1.444
Y5 7.7 11.7 1.520
Y6 139.6 142.4 1.020
Forged notes
Minimum Maximum Ratio
Y1 213.9 216.3 l.Oll
Y2 129.6 130.8 1.009
Y3 129.3 131.1 1.014
Y4 7.4 12.7 1.716
Y5 9.1 12.3 1.352
Y6 137.8 140.6 1.020
4.15 Exercises
Exercise 4.1 Derive the expression for w(.\) , equation (4.17).
Exercise 4.2 Swiss heads. Do you expect that units 104 and 111 have a
stronger or weaker effect on the univariate forward test for transformation
of variable y 4 on its own than on the multivariate test? Discuss the power of
univariate or multivariate tests in the presence of univariate or multivariate
outliers.
Exercise 4.3 Swiss bank notes. a) Table 4.6 gives the minimum, maxi-
mum and ratio of the maximum to minimum for each variable and each
group of notes. A priori, what can you say about the need for transforming
the data? b) Does the group of genuine not es ( observations 1-100) need
transformation? Try a common transformation for all variables. c) Would
you expect the evidence for transformation to increase or decrease when
the two groups of notes are considered together? What do you expect from
the plot monitaring the likelihood ratio test for the common transformation
.\ = 1 using: I) an unconstrained search; II) a search which starts in the
group of genuine notes, and III) a search which starts in the group of forged
notes?
Exercise 4.4 The analysis of the data on national track records for women
in §4.11 showed that evidence for the common transformation .\ = -3 is
spread throughout the data. Describe what you think will be the shape of the
4.16 Salutions 221
Jorward plot oJ the common transJormation parameter ( a) when the search

uses reciprocal data (>.. = -1) and (b) when the search uses >.. = -3?
Exercise 4.5 The last Jour observations to enter the Jorward search Jor the
transJormed data on national track records Jor women are, Jrom the last,
North Korea (33}, Czechoslovakia (14), Western Samoa (55) and Mauritius
(36). Their proflies are shown in the parallel coordinate plot, Figure 4.56.
What effect do you expect the introduction oJ these Jour units to have on
the Jorward plots oJ ( a) the maximum Mahalanobis distance among those
in the subset and (b) the minimum distance among those not in the subset?
You should consider the possibility oJ masking.
Exercise 4.6 Emilia-Romagna data. Figure 4.32 shows the profile loglike-
lihoods at m = 331 Jor the 9 demographic variables. In this step the maxi-
mum likelihood estimate oJ the transJormation parameter Jor variables Y10
and y 11 is very close to zero. From this figure, what are your expectations
about the Jan plots oJ the expansion oJ the signed square root oJ the likelihood
ratio test Jor >..R2 Jor variables y 10 and y 11 araund >.. = (-0.5, 0, 0.25, 0.5)T?
Exercise 4. 7 Dyestuff data. Using the library oJ routines Jor the Jorward
search mentioned in the PreJace, plot the Jan plots Jor the three responses
separately. What are your conclusions?
The left and right-hand panels oJ Figure 4.58 indicated that the values oJ
one Jor >..1 and A3 were less precisely determined than the value oJ 0. 5 Jor
>.. 2. What do you expect Jrom the likelihood ratio tests Jor the null hypothe-
ses >.. = (1, 1, 1)T against the unrestricted alternative (>.. 1 , >..2, A.3)T and the
restricted alternative >.. = (1, >.. 2, 1)T?
4.16 Salutions
Exercise 4 .1
>..y>-- 1y>-logy- (y>-- 1 + >..y>-- 1 logy)(y>-- 1)

w(>..) = d:~>..) (>..y>--1 )2
y>- log y y>- - 1 .
>..y>-- 1 - >..y>-_ 1 (1/A.+logy). (4.27)
Exercise 4.2
Observations 104 and 111 are basically univariate outliers, not multivariate
atypical observations. We expect that marginal univariate tests are more
powerful when there are univariate outliers. Figure 4.65 shows the plot of
I'
I' 0.
I
I
I 1
60 80 100 120 140 160 180 200

Subset size m
FIGURE 4.65 . Swiss heads data: forward plot of the signed square root of the
likelihood ratio test when Y4 is considered on its own. Compare with Figure 4.16
the signed square root likelihood ratio test when Y4 is considered on its own.
If we compare this figure with Figure 4.16 we see that the jump caused by
the inclusion of units 104 and 111 is much stronger in Figure 4.65 than
in Figure 4.16. As we expected, univariate tests are more appropriate in
presence of univariate outliers.
Exercise 4.3
a) Table 4.6 shows that for Yb Y2, Y3 and Y6 the values of the ratio in
both groups are very close to one. This implies that we expect to see very
flat profile likelihoods for transformation parameters and that these vari-
ables will not need any transformation. Also, because of the flatness in the
likelihood, we expect that while the most remote units may change the
maximum likelihood estimate of the transformation parameter, they will
not cause significant alterations in the value of the likelihood ratio test. For
variables y 4 and y 5 , the ratio is around 1.5, which is still small, so we also
expect that these variables will not need transformation.
b) Figure 4.66 is a forward plot of 5. when testing the common transfor-
mation .X = 1 for the group of genuine notes. It shows values near one for
the last 20 steps of the search. Figure 4.67, the forward plot of the likeli-
hood ratio test statistic for the hypothesis Ho : .X = 1, has small values,
when compared with xi, confirming that one is an acceptable value. But a
wide range of values is also acceptable for .X. The two panels of Figure 4.68
are twice the profile loglikelihoods for .X at m = 90 and m = 100. These
4.16 Solutions 223
"'
"'
"0
.c
E
.l\1
0
~
E
~

~ 0
1;i
:;
'7
75 80 85 90 95 100
Subset size m
FIGURE 4.66. Genuine Swiss bank notes: forward plot of the common maximum
likelihood estimate of >. when testing >. = 1
(i)
2 "'
0
~
.:.!.
:::J
"'
75 80 85 90 95 100
Subset size m
FIGURE 4.67. Genuine Swiss bank notes: forward plot of the likelihood ratio test
statistic for the hypothesis Ho : >. = 1, tobe compared with xi. The figure shows
that one is an acceptable value
v---- ---
m=90 m=100
0 0
C\1
:;:: ~
"'
.Q
0 0
~"'
CO CO
0
~
a..
0
g ....
0
-3 -2 -1 0 2 3 -3 -2 -1 0 2 3
Iambda Iambda
FIGURE 4.68. Genuine Swiss bank notes: profile loglikelihood for transformation
parameter when testing >. = 1 when m = n - 10 = 90 (left panel) and m =
n = 100 (right panel). The outer verticallines define the 95% confidence interval.
The central vertical line gives the maximum likelihood estimate of >.
virtually parabolic curves are, from Figure 4.66, centred near one. But the
95% confidence intervals, which are the points where the curves have de-
creased by 3.84/2, are ( -0.29, 1.96) at m = 90 and (0.24, 1.88) at m = 100.
They cover a wide range of values of .X. Not only is the transformationnot
well defined, but there is no evidence of any effect of the last observations
on this inference. So the outliers detected in §3.4 do not affect this trans-
formation and there is no reason not to analyse the data in the original
scale.
c) The evidence for transformation will increase when the two groups of
notes are considered jointly, because the values of the ratio of the maximum
to the minimum foreachvariable will generally increase (see Table 4.6).
N ow consider the plot monitoring the likelihood ratio test for the common
transformation A = 1. If we start our search with units coming from both
groups (unconstrained search) we expect many ßuctuations throughout the
search. If we start in the group of genuine notes or in the group of forgeries,
we expect to see a jump in the values of the likelihood ratio after step
m = 100. Finally, irrespective of the starting point used in the search, we
expect that the final part of the plot will be the same. Since the ratios of
maximum to minimum are not high for each variable, we do not expect
significant values of the likelihood ratio test in the final part of the search.
Figure 4.69, which gives the likelihood ratio test for the hypothesis of the
common transformation (.X= 1) using an unconstrained search (left panel),
a search starting in the group of genuine notes (centre panel) and a search
starting in the group of forgeries (right panel), shows all these aspects.
4.16 Solutions 225
<0
5'!- 100
Subset size m Subset size m Subset size m
FIGURE 4.69 . Swiss bank notes: forward likelihood ratio test for the hypothesis
of a common transformation ..\ = 1. The left panel gives an unconstrained search
starting with units belonging to both groups. The centre panel gives a search
starting with the first 20 genuine notes and the right panel a search starting with
the first 20 forged notes
~
'7
0
"'
"0
.0 ~
E
.!!!
0
UJ
--'
:::;:
0
<?
0
"1
20 30 40 50 20 30 40 50
FIGURE 4. 70. Track records: forward plots of maximum likelihood estimates of
..\ when testing ..\ = -1 (left panel) and ..\ = -3 (right panel)
Exercise 4.4
(a) When the search uses reciprocal data although the evidence for the
common transformation A = -3 is spread throughout the data, the forward
plot of the maximum likelihood estimate of >. will lie around -1 in the
central part of the search and then trend downwards steadily towards -3.
In a search using a value of A close to the true value we expect to see
fiuctuations around the true value throughout the search.
The two searches for A = -1 and A = -3 are given in Figure 4.70.
The left-hand panel, in agrement with our expectation, shows the estimate
declining steadily to -3. Initially those observations are included which
most closely agree with the reciprocal transformation, although even then
the highest value is a little less than -1. The right-hand panel, as expected,
shows fiuctuations around -3 throughout.
0
Lri
1.0
.j <0
0 0
::;: ::;:
E 0 E
::> ::> 1.0
E .j
"1;i ·c:E
::;: ~
1.0
"; ...
0
";
20 30 40 50 20 30 40 50
FIGURE 4.71. Transformed track record data: forward plots of maximum (left
panel) and minimum (right panel) Mahalanobis distances. There is no indication
of masking at the end of the search. This implies that the profiles of the y values
for the 4 units which enter the search in the last four steps should be dissimilar
Exercise 4.5
Figure 4.56 showed that the four profiles are dissimilar and so we do not
expect any masking. In other words, we expect that the plots of minimum
and maximum Mahalanobis distances of the transformed data will show a
constant increase rather than a peak and a sudden decrease due to mask-
ing in the final four steps. The two panels of Figure 4.71 which show the
monitoring of maximum and minimum distances confirm our expectations.
Exercise 4.6
Figure 4.32 shows that the profile likelihood surface for y 10 is sharply
peaked around 0 and the value 0.25 seems to be outside the confidence
interval, so we expect that in the expansion 0 is the only one acceptable
value. The profile likelihood for y 11 is much less sharply peaked and the
confidence interval covers 0.25, so we expect in the expansion to see that
both 0 and 0.25 are equally plausible throughout the search. Figure 4. 72
confirms our expectations.
Exercise 4. 7
The three resulting fan plots of the score test are gathered together in
Figure 4.73. For the first response y 1 , strength, there is a jump in four
out of five score statistics at the end of the search, due to the inclusion of
observation 1, the smallest observation. The effect is largest on the recipro-
cal transformation and negligible on the acceptable >. values of 1 and 0.5.
The conclusion isthat Yl does not require transformation. For y 2 , hue, the
structure of the plot is similar, except that the square root transformation
is indicated. The large increases in the statistics for >. = 0, -0.5 and -1
4.16 Solutions 227
~ -0.
0
>-0
0
>- 0
-, ••• ----------··- .·, 0.2
\...---... ,., --...,__
, __ ...... -'- ....
'-J\ 0 .
200 250 300 350 200 250 300 350

FIGURE 4.72. Municipalities in Emilia-Romagna: forward plots of the signed

square root likelihood ratio test expansion around >. = ( -0.5, 0, 0.25, 0.5)T for
variables Yw and yu
at the end of the search are caused by inclusion of the two smallest obser-
vations. Only >. = 0 .5 is acceptable throughout the search. The plots for
brightness, y3, are devoid of sudden jumps, all observations indicating no
need for transformation.
Given that all evidence for transformation seems to be due to y 2 we ex-
pect the forward curve associated with the likelihood ratio test for the null
hypothesis >. = ( 1, 1, 1) T against the unrestricted alternative ( )q, >. 2 , >. 3 ) T
to be very close to the curve which has the restricted alternative >. =
(1 , A2 , l)T.
Figure 4. 74 shows forward plots of the two likelihood ratio tests from
searches on untransformed data. The upper curve is the likelihood ratio
for testing >. = (1 , 1, 1) , against an unrestricted alternative. The plot also
shows the 95% and 99% points of the asymptotic x~ distribution: the hy-
pothesis of no transformation is clearly rejected. Since the search is on
untransformed data, the initial part of the search includes observations
which support the null hypothesis. The lower curve is again for testing the
hypothesis of no transformation, but with the alternative >. = (1 , >. 2 , 1), so
that only transformation of y 2 is considered. The two tests are virtually
indistinguishable, showing that all the evidence for transformation is in y 2 .
The general shape of the curves shows that this conclusion does not depend
on one or a few observations.
-1
0 ~
~ -0.
~
u;
2
"' 0
w
o"'
~ ~
~ ~==============================================~==~
____J /
.4
~U")~~~~~~~~~--~-----~----~----~/~~-0
(J)
u;>
--~ -----.=---:.:-- ...-- O.E
u;> L __ _ _ --_--_-~-~~-·
-·----------------------------------------~
30 40 50 60
FIGURE 4.73. Dyestuff data. Fan plots of score statistics Tp(Ao) for marginal
power transformation of each response. Toppanel y1, bottom panel Y3 · Individual
searches for each ,\. Only Y2 needs transforming
0
C\1
U")
~
u;
2
0
~
8
,5
~
a;
-"'
5
U")
10 20 30 40 50 60
Subset size m
FIGURE 4.74. Dyestuff data. Forwardplots of likelihood ratio tests for the null
hypothesis ,\ = (1, 1, 1) agairrst (continuous line) the unrestricted alternative
(-\1, -\2, -\3) and (dotted line) the restricted alternative ,\ = (1, ,\ 2, 1) . The hori-
zontal lines are the 95% and the 99% points of the :d distribution. All evidence
for a transformation is provided by Y2
5
Principal Components Analysis
5.1 Background
Principal components analysis is a way of reducing the number of variables
in the model. It may be that some of the variables are highly correlated
with each other, so that not all are needed for a description of the subject
of study; perhaps a few linear combinations of the variables would suffice.
Other variables may be unrelated to any features of interest. The data on
communities in Emilia-Romagna offer many such possibilities. In Chapter 4
we arbitra rily divided the variables into three groups. But do we need all
the nine demographic variables in order to describe the variation in the
communities or would a few variables suffice, or a few combinations of
variables? Then the other variables would be contributing nothing but noise
to the measurements.
If we are dealing with normally distributed random variables, any linear
combinations that we take will themselves be normally distributed. One
consequence of the transformations to normality of the previous chapter
is that, once the data have been transformed to approximate normality
we can use the methods of principal components analysis to reduce the
dimension of the problern if this is possible.
Principal component analysis has also sometimes been suggested as a
method of outlier detection. However, if there are outliers or unidentified
subsets, these may infl.uence the estimation of the principal components,
which are functions solely of the estimates of the mean and covariance
matrix of the data. What is needed for outlier detection is a form of prin-
230 5. Principal Components Analysis
cipal components analysis in which the major components are unaffected

by anomalaus observations. However, the minor components may respond
strongly to any outliers.
We introduce the algebra of principal components in the next section. In
§5.3 we derive some quantities which it is useful to monitor in the forward
search. In the last section of theory, §5.4, we describe the biplot, which can
usefully be monitared during the forward search.
The analysis of examples begins in §5.5 with a continuation of the analy-
sis of the data on Swiss heads in which there is no evidence of the effect of
any individual observations on the principal components. However, the milk
data analysed in §5.6 does show how the components respond to groups
of observations which are not evident from other forward plots. In §5. 7 we
analyse six measurements on the quality of life in 103 provinces of Italy.
Our analysis shows the importance of transformation to approximate nor-
mality in improving the effectiveness of a principal components analysis in
capturing the structure in just a few components.
The examples considered up to that point consist of a sample from a
single population, perhaps with a few outliers. We use the data on Swiss
bank notes in §5.8 to see how principal components analysis responds to
the presence of more than one group. For the final example we return in
§5.9 to the data on municipalities in Emilia-Romagna in which there are
28 responses. Principal components analysis enables us to answer many
questions about the structure of the data by looking at just two compo-
nents. The forward search makes it possible, as in Figure 5.39, to see this
structure clearly, free of the masking effects of multiple outliers.
5.2 Principal Components and Eigenvectors

5. 2.1 Linear Transformations and Principal Components
First consider a linear transformation aTYi of the ith observation Yi , where
a is a v x 1 vector of constants. If Yi has a v-variate normal distribution
with mean J-l and variance E,
(5.1)
a univariate normal random variable. The n x 1 vector Y a is then a sample

from this univariate distribution if Y is a sample from the multivariate
normal distribution. Departures from multivariate normality of the distri-
bution of Y may be reflected in the univariate distribution of Y a.
Ifthe rank ofY is v, we can find v linearly independent combinations Y A,
where A is a v x v full rank matrix of constants. In principal components
analysis we find a set of v linear combinations of the observations which
are orthogonal and normalized, that is, for each component, aJ a
1 = 1
5.2 Principal Components and Eigenvectors 231
and aJ ak = 0, j -j. k = 1, ... , v. These combinations are determined by

the covariance matrix of the observations. We first assume that both the
mean f.l and the variance L: are known and define the population version of
the principal components. We then use parameter estimates to define the
sample version of the components.
From the spectral representation of symmetric matrices we can write the
covariance matrix as
(5.2)
where r is a v x v orthonormal matrix and A, also v x v, is a diagonal
matrix of the eigenvalues )q ::;=: ,\ 2 ::;=: · · · ::;=: Av of L:. It is unfortunate that
the standard notation for eigenvalues is the same as that for the Box and
Cox transformation parameter of this and other chapters. We trust that
confusion will be minimised by staying with this standard notation, relying
on the context for the appropriate definition. The population principal
components of the data are formed by the centred rotation
Z = (Y- )f.LT)r, (5.3)

where J is the n x 1 vector of ones.
To find the sample principal components we again remove the effect of the
mean, now estimated, and work with linear combinations of the residuals
defined in (2.12) as
E Y-Y
y- Jfj?
Y- JJTYjn
(I- H)Y =GY= Y, (5.4)
where H = J JT j n. The spectral representation of the estimated covariance
matrix is written as
T
"Eu= GLG , (5.5)
A
where G is again a v x v orthonormal matrix
G = (gl, · · .gj, · · · ,gv),

with L, also v x v, a diagonal matrix of the eigenvalues h ::;=: l2 ::;=: · · · ::;=: lv
of tu. The v x 1 vector gj is the standardized eigenvector corresponding
to the eigenvalue lj.
The required rotation is then
Z = (I- H)YG = YG. (5.6)

The columns of the n x v matrix Z are the principal components of the
data and are uncorrelated linear combinations of the residuals. The jth
principal component is
Zj =(I- H)Ygj.
These linear combinations of the observations lie along the eigenvectors of

the data, that is along the axes of the ellipsoidal contours of the fitted mul-
tivariate normal distribution (Exercise 5.4). The first principal component
is the standardized linear combination of the residuals with maximum vari-
ance: from the geometrical point of view, it is the direction about which
the orthogonal sum of squares of the residuals is minimized. Successive
principal components cause a maximum reduction (Exercise 5.7) in this
orthogonal sum of squares.
If the maximum likelihood estimator t is used instead of tu to estimate
:E, the same principal components are obtained, but the eigenvalues l are
all multiplied by (n- 1)/n.
5.2.2 Lack of Scale Invariance and Standardized Variables

The principal components defined above appear to provide a powerful tool
for data analysis. There are however some snags. One is that, unfortu-
nately, the principal components are not invariant to changes of scale in
the columns of Y. If just the jth column of Y is multiplied by dj > 1, the
contribution of Ycj will increase in the higher order eigenvectors. Division
of the jth column of the matrix of eigenvectors by dj will not recover the
eigenvectors for Y.
More formally, suppose that Y is multiplied by D, a diagonal matrix
with diagonal elements dj, with at least one element different from the
others, so that the relative scaling of the columns of the data is changed.
The covariance matrix of the new variables is DtuD. However, if z is an
eigenvector of tu, D- 1 z is not an eigenvector of DtuD.
Two different approaches are available to overcome the resultant lack
of uniqueness of the eigenvectors and so of the principal components. If
all columns of Y have similar and physically meaningful units, these can
be used in the analysis: percentages of the components in some mixtures
are an example. In the absence of such natural meanings to the variables,
they can all be scaled to have unit variance, when the estimated covariance
matrix tu is replaced by the correlation matrix R. This is the method used
in the examples in this chapter.
5. 2.3 The Number of Components

In regression there is a clear distinction in the model between the deter-
ministic structure and an additive error term. lt is therefore possible to
test hypotheses about the structure of the model, particulary the inclusion
or exclusion of terms in the deterministic part. But the model underly-
ing principal components analysis is that of multivariate normality, with
no division between deterministic and random parts. The model does not
indicate a way of testing how many components to include: the more com-
5.3 Monitaring the Forward Search 233
ponents are included, the greater the percentage of total variation in the
data that is explained.
There is usually also no obvious hypothesis that can be tested about the
values of the population eigenvalues >. 1 , ... , >-v . In particular, to test that
>-v = 0, is to test that the data lie in a v - 1 dimensional subspace, which
is a rather special structure. If the principal components are effectively
explaining the general structure of relationships between the variables, the
fitted multivariate normal distribution will have contours that are sensibly
ellipsoidal. For those variables which are independent, but with differing
variances, the principal components will be the variables themselves, maybe
with small contributions from other variables arising from sampling error.
But, if the independent variables have similar variances, perhaps due to
scaling, the contours will be roughly spherical and the components will
be poorly defined, with each variable explaining a similar amount of the
total variation. Under such conditions, the principal components are not
achieving any simplification in the structure of the data. It has therefore
been suggested (for example by Mardia, Kent, and Bibby 1979, p. 235 and
by Flury 1997, p. 622) to test for sphericity, that is for equality of the last k
eigenvalues. Even if the test is not significant, so that the last k eigenvalues
can all be taken as being the same, this does not mean that the variables
concerned can automatically be dropped from the analysis. Failure to reject
the hypothesis of equality only indicates that no further structure can b e
extracted by the use of principal components.
5.3 Monitaring the Forward Search

To use the forward search to extract diagnostic information from a principal
components analysis, we run the search as in earlier chapters, performing
a principal components analysis on the included units for each m . We then
generate forward plots of the quantities customarily calculated when all the
data are fitted, that is when m = n. In this section we describe the most
informative of these quantities.
5. 3.1 Principal Components and Variances

An important quantity is the amount of the total variance of the observa-
tions explained by the successive principal components. By construction,
the mean of the components is zero, so their variance-covariance matrix is
zTzj(n -1) = GTYT(I - Hf(! - H )YGj(n -1) (5.7)

= GTYT(I- H)YG/(n- 1) = cT~uG = GTGLGTG = L. (5.8)
The columns of Z = (zc1 , . . . , zcv) are seen to be uncorrelated with the
variance of zci equal to lj. The sum of the variances of the first r principal
components is h + · · · + lr and the total variance of all components is
Lli =
V
tr f:u. (5.9)
i=l
It is therefore natural, in the context of principal components, to use the

trace of f:u, rather than its determinant, as a measure of the total variance
of the data. If the principal components are calculated using f: rather than
f:u, the sum of the eigenvalues is tr f:. Further, if, as in our book, the
variables are each scaled to have unit variance, the sum of the eigenvalues
is tr R = v, where R is the estimated correlation matrix.
In the analysis of data it is hoped that a few components will describe
most of the variance in the data. From (5.9) the proportion of the total
variation explained by the first r principal components, r < v, is
(5.10)
or, if the data are scaled,
(h + · · · + lr)/ tr R = (h + · · · + lr)fv. (5.11)

There are two equivalent ways of plotting this information. We find the
more informative is a forward plot of the variance explained by each com-
ponent, that is a plot of the v values lrfv. Alternatively, for some values
of m, we can look at a "scree" plot of the individual lr against r. If a few
principal components explain much of the structure of the data, the scree
plot will show an elbow, with a few large, but decreasing values, standing
out from the remaining small values.
A guide to the interpretation of scree plots comes from the empirical
finding in many fields, especially in the analysis of social science data, that
successive principal components explain half of the variance explained by
the preceding component. Thus, approximately,
(j=1, ... ,v). (5.12)
Some numerical values are given in Table 5.1. Our hope isthat the variance
explained by the first few components will be greater than those in the
table.
5.3.2 Principal Component Scores

The principal components are a new set of axes for the data Y. For the ith
observation the projection on the jth component is
(5.13)
5.3 Monitoring the Forward Search 235
TABLE 5.1. Halving rule for percentage of variance explained by successive prin-
cipal components
Component
Number Dimension of Data v
r 2 3 4 5 6 7 8
1 66.67 57.14 53.33 51.61 50.79 50.39 50.20
2 33.33 28.57 26.67 25.81 25.40 25.20 25.10
3 14.29 13.33 12.90 12.70 12.60 12.55
4 6.67 6.45 6.35 6.30 6.27
5 3.23 3.17 3.15 3.14
6 1.59 1.57 1.57
7 0.79 0.78
8 0.39
the score for the ith observation on the jth component. If the original
data are normally distributed, the transformed data and the scores of the
observations on each principal component will also be normally distributed.
Forward plots of the scores for units once they have joined the subset are
informative about the presence and effect of clusters and outliers.
5.3.3 Gorrelations Between Variables and Principal

Components
It is helpful in interpretation of the principal components if they consist of
simple contrasts in a few variables. The correlation between the components
and the individual columns of Y indicates the contribution of the variables
to each component and so may help in formulating this interpretation. A
forward plot of these correlations shows how the fitted multivariate normal
distribution rotates, if at all, as the search progresses.
From the definition of the principal components in (5.3) the covariance
matrix of the components and observations is given by
E{YT (Y - J J.LT)r}
E{(Y- JJ.LT)T(Y- JJ.LT)r}
:Er= rArrr = rA. (5.14)
The covariance between column j of Y (ycj) and column r of Z (zcJ is

therefore /jrAr. But
var Ycj = <TI and var zcr = Ar.

So the correlation between column j of Y and column r of Z is
Pycj ,zcr = /jrAr /-/(<TI Ar) = /jr( )Ar/ <Tj ). (5.15)

If the variables are standardized, we agairr replace ~ by the correlation

matrix R , when O"j = 1 and
(5.16)
estimated as
(5.17)
5. 3.4 Elements of the Eigenvectors

Instead of the correlations between the columns of Y and Z we can look di-
rectly at the elements of the estimated eigenvectors. For the rth eigenvector
we would look at a forward plot of the v elements gjr· If the observations
are not scaled, such a plot may provide rather different information from
the plot of the correlations between variables and principal components
derived in the preceding section. However, when the variables are scaled,
(5.17) shows that the correlations are just the elements of the eigenvector
multiplied by the square root of the estimated eigenvalue lr. So, for each
value of m, the forward plot of the correlations with component r is a
scaled version of the forward plot of the elements of the rth eigenvector.
But the scaling may change during the search as the values of the estimate
lr change.
5.4 The Biplot and the Singular Value

Decomposition
As weH as forward plots we also see how some standard plots change during
the forward search. One instance is the biplot, which we display for selected
values of m.
A biplot is a graphical representation of the information in an n x v data
matrix. The bi refers to the two kinds of information contained in such
a matrix. The information in the rows is for sampling units and that in
the columns for variables. A direct approach to obtaining a biplot starts
from the singular value decomposition which expresses the n x v matrix of
residuals Y, given in (5.4), as the product of three matrices:
(5.18)
where r is the rank of the matrix Y, L 1 12 = diag(li/ 2 , ... , l~/ 2 ) is the di-
agonal matrix containing the square root of the r non zero eigenvalues (in
non-descending order) of matrices (YTY)j(n- 1) =tu or (YYT) j (n- 1).
U = ( u1, . . . , Ur) is the orthorrormal matrix containing the r normalized
eigenvectors of yyT' (ur u = Ir). G = (gl' . .. ' gr ) is the orthorrormal
5.4 The Biplot and the Singular Value Decomposition 237
matrix containing the r normalized eigenvectors of yry = (n - 1)f:u

(GTG =Ir)· Usually r = v. Note that since the eigenvectors of a generic
matrixAarealso the eigenvectors of kA (k -=1- 0), U and Gin equation (5.18)
could have been defined as the matrices containing the standardized eigen-
vectors of yyT j(n- 1) and yry j(n- 1) = f:u.
The importance of the singular value decomposition for principal com-
ponents analysis is twofold: firstly, it provides a computationally efficient
method of finding principal components. Secondly, as a bonus, we get in
U standardized versions of the principal component scores. To see this,
multiply equation (5.18) on the right by G and divide by Vn=!. We
see that UL 1 12 = YG/Vn=!. Therefore the matrix UL 1 12 contains in
the columns the principal component scores divided by ,.;n=l. Similarly,
U = YGL- 1 12 /Vn=! contains the principal component scores standard-
ized to have variance 1 j (n - 1).
Equation (5.18) can also be written as the sum of r matrices each of rank
one
r
Ynxv = vn=IL-zJ 12 uig[, (5.19)

i=1
where ui and 9i are respectively the n x 1 and v x 1 vectors associated with

the ith column of U and G respectively. The bestrank 2 approximation to
Y is obtained by replacing L 112 with diag(zi1 2 , t;l 2 ,0, ... , 0), that is taking
just the first two elements in the sum defined in equation (5.19). This result,
known as the Eckart-Young theorem, is proved in Exercise 5.7. So,
2
Ynxv ~
i Ui9ir
vn=IL-z1/2
)(
i=1
)
11;2 0 g'[
~ vn=-1 ( u1 U2 ) (
1
121;2 g'f
0
~ vn=!U(2)Lg)2 G(z)· (5.20)
Equation (5.20) can be written as the product of two matrices A and BT

defined as
(1-a)/2
A = (n- 1)(1-a)/2 ( u1 u2 ) ( Z
01 0 )
[2
_ ( _ 1)(1-a)/2u L(1-a)/2
- n (2) (2)
(5.21)
(n - 1)a/2 ( l01 0 ) <>/ 2 ( 9T1 ) _ a/2 ~;2 T

BT --
[2 g'f - (n - 1) L~(2) G (2)' (5.22)
where usually 0 :::; a :::; 1 and 0 :::; a :::; 1. The biplot consists of plotting
the n + v two-dimensional vectors which form the rows of A and B. In the
biplot each row of the n x 2 data matrix
a[ )
af
A= (
a'f:
is represented by a point. The jth row (j 1, . .. , v) of the v x 2 data
matrix
bf )
b[
B= (
b~
is represented as an arrow from the origin to the point with coordinates
bj = (bj 1 , bj2f· While every row ofthe n x 2 matrix Ais associated with a
row of Y, every row of the v x 2 matrix B is associated with a column of Y.
If the scale of the two sets of coordinates is not compatible we can introduce
an arbitrary multiplier which adjusts all the variables by the same amount,
or we can use two scales.
The most popular choice is a = 0 and o: = 1. In this case A = Jn=l Uc 2 )
= YGc 2 )L;~/ 2 contains the first two principal components scaled to have
unit variance. It is easy to show that in this case the Euclidean distances
between the points in the biplot (rows of the matrix A) are the best rank
two approximations of the Mahalanobis distances between the correspond-
ing rows of the centred matrix Y (Exercise 5.13). On the other hand, the
v arrows associated with the v rows of B = Gc 2 )L~g, will provide the
best two dimensional approximation of the elements of the covariance ma-
trix f:u. In other words, the lengths bJbj of the vectors bj (element j,j of
BBT) are the best rank two approximations of the variances of the vari-
ables (s]) and, similarly, the cosines of the angles between the bj represent
correlations between the variables. Finally, if principal components analysis
is applied to standardized variables, the length of the jth arrow is equal
to the percentage of variance of the jth variable explained by the first two
principal components (Exercise 5.13).
When a = 0 and o: = 0 the approximation becomes
- ~ 1/2--
A- vn- 1U(2)L( 2 ) - YG(2), (5 .23)
BT = G'(;).
In this last case the ith row of A simply contains the principal component
scores for unit i and the jth row of B contains the elements of the first
two eigenvectors for the jth variable (gj 1 and gj 2 ). When a = 0 and o: = 0 ,
the properties of the rows of the matrices A and B separately will change.
In this case the distance between two points in the biplot (rows of matrix
5.5 Swiss Heads 239
A) is the best rank two approximation of the Euclidean distance between

the corresponding rows of the matrix Y. In this last case, however, BBT
is no Ionger the best two dimensional approximation of :Eu and the scalar
product b'[bj is not the best approximation of Crij (Exercise 5.14). Finally,
irrespective of the value of o: used, the i, jth element of the matrix Y, that
is the value for the ith observation of the jth variable measured about its
sample mean, can be written as
(5.24)
Since the length of the projection of a vector a onto a vector bis given by
aTb/llbll (see Exercise 5.2), it follows that Yij is represented as the length
ofthe projection of a; onto bj , multiplied by the scalar llbill/vn=T which
does not depend on i. If the vectors ai and bj are nearly orthogonal the
value of Yij will be approximately zero. Conversely, observations for which
Yij is very far from zero will have a; lying in a similar direction to bj.
The relative positions of the points defined by ai and bj will therefore give
information about which observations take !arge, average and small values
on each variable.
The biplot works well if the percentage of variance explained by the
first two principal components is high and the data do not contain out-
liers. However, if influential observations are present they may influence
the correlations between variables. The biplot will then give misleading in-
formation and willlead to wrong inferences. As we see in §5.6, it is highly
informative to draw the biplot in selected steps of the forward search. In
this way we can easily monitor how the angles and the relative lengths of
the arrows are modified when a group of outliers is introduced into the
subset.
5.5 Swiss Heads

Our analysis in §1.4 suggested that units 104 and 111 were slightly outlying.
In §3.1, particularly Figure 3.9, it was apparent that this was due to large
values of y4 for these two units. The effect on forward plots of Mahalanobis
distances, for example, Figure 3.6, was not such as to prepare us for the
!arge effect in the estimated transformation parameter exhibited by these
two units in Figure 4.14: on their own they caused the hypothesis of no
transformationtobe rejected. We now see how strong is their influence on
the estimated principal components.
The initial hope in analysing the data on Swiss heads was that there
would be a low dimensional, preferably one dimensional, projection of the
data which would explain most of the variation. Then it would be possible
to find just a few masks that would protect all soldiers. Figure 5 .1 is the
forward plot of the percentage variation explained for an analysis using
C>
U)
....
C>
-... _,.. ---- .. -------------- . -- .. _--- .. ----- . --- ...... -.. -- .. _-- _.... - ..
~-,-~/----, ________________ /----------------~---_, ___ _
C> -
100 120 140 160 180 200

Subset size m
FIGURE 5.1. Swiss heads: forward plot of the percentage of variance explained
by the six principal components: the first two components explain only around
63% of the total variation
standardised variables. This plot established two things. The first is that
the first principal component explains only 42.7% of the total variation in
the data at the end of the search. The next component explains a further
20.4%, making just over 63% in all. The remaining four components are
similar in the amount they explain, around 10% each. Two components
are needed to give a not very complete representation of the structure. The
second point is that the percentages shown in the plot are stable, unaffected
by any outliers. In particular, units 104 and 111, entering at the end of the
search, have little effect on the percentages explained.
As well as the percentage of variation explained by the principal compo-
nents, a global property, it is interesting to see how the scores of individual
units change during the search. Figure 5.2 gives the scores for the first
component for units included in the subset. These seem like a sample from
a normal distribution, as they should if the data are normally distributed,
and are stable as the search progresses. The same is true for Figure 5.3
which shows the scores for the second component. The unit with the most
positive score on the first component is 159, which has an outlying value
of Yl in the boxplot of Figure 3.9. Otherwise extreme values in the original
variables do not seem extreme on the principal components.
This example does not show the forward search leading to new discoveries
about the structure of the data. What the forward search has achieved is to
show how conclusions from an analysis of all the data, such as that in Flury
5.5 Swiss Heads 241
...
"'
"'0~ 0 j
cX
~
-----==:-:~~::: . . - - ----- -- -- - ----~~= .--~~-_-::-~~~-1~

'r _, ' .,. ..... -
'9
.----
100 120 140 160 160 200
Subset size m
FIGURE 5.2. Swiss heads: forward plot of scores for included ullits Oll the first
prillcipal compollent
... J
"'
"'0~ 0
cX
'r l_ --r
-"J
100 120 140 160 180 200
Subset size m
FIGURE 5.3. Swiss heads: forward plot of scores for illcluded Ullits Oll the secolld
prillcipal compollellt
and Riedwyl (1988, p. 218) are, in this example, unaffected by individual

units such as units 104 and 111. Suchstability is not always the case.
5.6 Milk Data

We now look at an example in which there are a few observations which
have a clear effect on several aspects of the principal components analysis.
Daudin, Duby, and Trecourt (1988) give data on the composition of 85
containers of milk, on each of which eight measurements were made. We
print the data in Table A.8. The variables are:
Y1: density
y 2 : fat content, grams/litre
y3: protein content, grams/litre
y4: casein content, grams/litre
y5 : cheese dry substance measured in the factory, grams/ litre
y6 : cheese dry substance measured in the laboratory, grams/ litre
y7: milk dry substance, grams/ litre
y8 : cheese produced, grams/ litre.
Although Daudin, Duby, and Trecourt (1988) state that there are 85 Ob-
servations, their Table 1 contains 86 rows, as row 63 is repeated as row 64.
Atkinson (1994) analysed these 86 rows.
A scatterplot matrix of the data is in Figure 5.4. The panel for y 5 and Y6
shows clearly that one unit is remote in this bivariate projection. Otherwise
several panels show a strong rising diagonal structure which we can hope
will be well explained by a single principal component. There arealso sev-
eral plots with an almost circular scatter, which will not be well explained.
But, before we try principal components, we consider transformation of the
data to multivariate normality.
Figure 5.5 is a forward plot of the likelihood ratio test for the hypothesis
of no transformation, to be compared with the chi-squared distribution on
eight degrees of freedom. Up to m = 80 there is no evidence of any need
for a transformation. The next four units to enter make the test significant
and then there is a final peak caused by the last unit to enter. The gap
plot, Figure 5.6, illuminates this structure: there is a sharp peak at m = 81 ,
followed by a decline until the last unit enters. This structure suggests that
the four units entering towards the end form a duster. This suggestion is
strengthened by the two scatterplots of Figure 5.7, which are details of
Figure 5.4 with a few points marked. Unit 69 is the last to enter and is
the clear outlier already mentioned. The group of four units are numbers
1, 2, 41 and 44, which are particularly evident in the right-hand panel of
the figure.
5.6 Milk Data 243
32 36
l:j
gj ~:::=;~ :=:::~~ :. . .__ __. ""-----' .----.,.--.,;=:::;:==: ~==;~ ~:::::;:=: ~

:<:
~-----~~==~~==;~,_~-,~
:<:
.-------.. "'------' ;:==;~ l!l.:J[C=-----> ~
"'
..,
0
10.26 10.30 30 32 34 22 26 30 34 115 125
FIGURE 5.4. Milk data: scatterplot matrix of the eight measurements on 85 milk
samples
The forward plot of scaled Mahalanobis distances, Figure 5.8, confirms

these analyses. Until m = n the distance for unit 69 is off the top of the
plot. Unit 44 becomes less outlying as the search progresses and is the first
of the group of four units to enter the subset. The others, units 1, 2 and 41
are clearly outlying until m = 81 when unit 44 joins. The distances then
decrease until there is some masking at the end of the search, with unit 73
having the second largest distance.
Our conclusions so far are that there is one outlier, unit 69 and a duster
of four units of which 44 is the least outlying, as is shown by its position on
the right-hand panel of Figure 5.7. Theseare the only units which indicate
the need for a transformation to improve the normality of the data. We
now proceed to a principal components analysis of the data, which we do
0
<D
0
~
'80 ..,.
0
,!;;
o;
""'
::::;
0
N
I
I
0 -
- '- ~
40 50 60 70 80
Subset size m
FIGURE 5.5. Milk data: forward plot of the likelihood ratio statistic for testing
transformation of the data. To be compared with X~
20 30 40 50 60 70 80
Subset size m
FIGURE 5.6. Milk data: gap plot - forward plot of differences in Mahalanobis
distances. One outlier and a duster of four observations are indicated
5.6 Milk Data 245
~
..-
69o "' 69•
1ll "'
"'
0 0
"' "'
<g_ <0 <g_ <0
"' "' 41 ~ 1 •
44• • • •
•
.... ~-
<0 <0
"' "'
. -....
..- ..-
"' "'
."..
"'
"' "'"'
30 31 32 33 34 35 24 25 26 27 28
y3 y4
FIGURE 5.7. Milk data: scatterplots of Y6 against Y3 and Y6 against y4, showing
the single outlier and the duster indicated in Figure 5.6
~
(I)
!'lc:
~
'6
(I)
~ 4
:ö
0
c:
"'
<V
.s::;
"'
:::;:
1/)
40 50 60 70 80
Subset size m
FIGURE 5.8. Milk data: forward plot of scaled Mahalanobis distances. The single
outlier enter in the last step of the search; the duster of four observations labelled
in Figure 5. 7 enter immediately before it
not transform. One focus of the analysis is the effect of the five units we
have identified on the principal components.
Figure 5.9 is a forward plot of the percentage of variance explained by
each principal component. For the ten steps of the search before the last
the first component explains about 71% of the variance, with the second
component explaining a further 13%. The remaining components make
only a small contribution. The power of the first component fulfills our
"0
Q)
c::
'(ij
Ci.
X
Q)
....
0
lii
>
'ö
;!?.
0
"' ----··-··--·--·--··----······-·--··-··-·-·-·--···-.........-·-··-····-··--·-·········--....--..
-------------------- ------------ ··-.....................____.....................__ _.....
------------------------- --------------- _____________
________________ .,.
-----------------------------
/
/
0
-----~;
40 50 60 70 80
Subset size m
FIGURE 5.9. Milk data: forward plot of the percentage of variance explained
by the eight principal components: after the duster of six units enters, the first
component explains around 71% of the total variation
expectations from the initial inspection of Figure 5.4. However, the forward
structure of the plot does not particularly refl.ect the structure we have
already found.
Figure 5.9 shows that, in the last step of the search, there is a dedine
in the percentage of variation explained by the first component. This is
caused by the entry of unit 69. However, there is no indication of any
effect of the previous four observations. Instead, there is an increase from
m = 71 to 76 as units 11, 76, 15, 14, 12 and 13 successively enter. These
six units are shown highlighted on the scatterplot matrix of Figure 5.10.
In general they form a duster of low values in these scatterplots, although
in some, such as the left-hand panel of Figure 5.7 they are joined by some
other units. However, most importantly, they extend the major axis of the
duster of points in those plots with a strong diagonal pattern. Their effect
is to increase the variance explained by the first principal component. The
effect of the introduction of these units is similar to that of "good" leverage
points in regression. The units are remote in the space of the observations,
but they reinforce the model already fitted to the data.
We can augment the forward plot of the percentage of variation explained
by the components, Figure 5.9, by looking at scree plots for particular val-
ues of m. Four such plots are shown in Figure 5.11. Theseare just alter-
native representations of the information in Figure 5.9, to which we have
added curves using the values from the halving rule in Table 5.1. The panel
for m = 70 seems to follow this rule rather well, whereas those for m = 76
5.6 Milk Data 247
1026 1030 30 32 34 22 24 26 115 125
FIGURE 5.10. Milk data, omitting observation 69: scatterplot matrix of the Ob-
servations. The second duster, of six observations, is highlighted
and 84 show that much more of the total variance is explained by the first
component after the duster of six units has entered. The plot for the end of
the search, m = 85, shows that the effect of including unit 69 is to increase
appreciably the contribution of the smaller principal components. The con-
tribution of the first component is, of course, correspondingly reduced.
We now consider the effect of individual units on the composition of the
principal components. The left-hand panel of Figure 5.12 shows a forward
plot of the correlations between the variables and the first principal com-
ponent. At the beginning of the search this is a combination of the mean of
four variables (3, 4, 5 and 6) and a less equal combination of the remaining
four. However, by the end of the search, the weight of the eight variables is
more nearly equal. This component represents the general positive correla-
tion of all variables. It is little changed by the effect of the last observation.
m~76
.... 0722
Ote:l Ott
0
Comp. 1Comp. 2Comp. 3Comp. 4Comp. SComp. 6 Comp. 1 Comp 2Comp. 3Comp. 4Comp. SComp. 6
0717
0032
0087 ....
Comp. 1 Comp. 2Comp. 3Comp. 4Comp. SComp. 6 Comp. , Comp. 2Comp. 3Comp. 4Comp. 5Comp. 6
FIGURE 5.11. Milk data: scree plots at four values of m. The superimposed
curves are for the halving rule, Table 5.1
"=! 3 . 4. s. 6
a)
()
0.. 0
~
-=
:5 0
<D
-~ 7,
<J) 8·"
c "<t
0 1
~ 0 2 ·"\
~
0
() ""
0
0
0
40 50 60 70 80 40 50 60 70 80
FIGURE 5.12. Milk data: forward plots of correlations of principal components

with variables. Left-hand panel, first component; right-hand panel, second com-
ponent
Component two, in the right-hand panel of the figure, has some positive
correlations and some negative. It is a cantrast between the mean of the
same four variables (3 , 4, 5 and 6) and again a looser, positively weighted
combination of the remaining four variables. Compared with the left-hand
panel, the plots in the right-hand panel fl.uctuate more, showing a greater
5.6 Milk Data 249
LO
ci
....ci
~~~--- --..... ______ / __, ___,_________ \ .
~> "'ci -,_"----~- ....... 8
1~:
<::
Q)
"'
·a;
"! 1 ·········\_...... //··.. ........······
1!? 0
2···,
ü:
... : :;c-~~--::o.~-5:.:.---..~-:=\6
ci \ -···. .',
5
0
ci
40 50 60 70 80 40 50 60 70 80
FIGURE 5.13. Milk data: forward plots of components of eigenvectors. Left-hand

panel, first eigenvector; right-hand panel, second eigenvector. To be compared
with Figure 5.12
effect of the random variation in the data on estimation of the component.

They are also more affected by the introduction of the last observation.
Figure 5.13, tobe compared with Figure 5.12, gives forward plots of the
elements of the first two eigenvectors. As we argued in §5.3.3, the two plots
are similar since Pjr = 9jrVAr. The most noticeable difference between the
two plots is in the left-hand panels: that for the correlations with variables
3, 4, 5 and 6 is virtually constant until the last unit enters, whereas, in
Figure 5.13, the components of the eigenvector decrease. This decrease is
however offset by the increase of h during the search, seen in Figure 5.9.
With our scaled variables it appears to make little difference in this example
whether we plot correlations or elements of the eigenvectors. We choose to
plot the correlations.
In this example the first two components explain around 84% of the
variability in the data, so that there is not much left for the remaining
components to explain. Some do react strongly to the outliers. The left-
hand panel of Figure 5.14 shows the correlations between the variables and
the fifth principal component. These are sensibly constant until the last
step of the search, when unit 69 causes large changes in the correlations
with variables 1, 2, 4 and 7. The right-hand panel of the figure is for the
sixth component. It reacts strongly to the group of four units entering from
m = 81.
These lower components explain little of the total variation in the data,
so we return to the first component. Figure 5.15 shows the scores for the
individual observations. This interesting plot shows how important is the
group of six observationsentering from m = 71. Units 11, 12, 13, 14 and
15 have the lowest score, together with unit 76, until unit 73 enters, which
has a score higher in absolute value than unit 76, but lower than those
·8
C\J
(.) c:i
0..
"fi
·;;;
0
E c:i
"'c:0"
~
-e0 C\J
9
(.)
....
9
40 50 60 70 80 40 50 60 70 80
FIGURE 5.14. Milk data: forward plots of correlations of principal components

with variables. Left-hand panel, fifth component, reacting to the outlying Ob-
servation 69; right-hand panel, sixth component, reacting to the group of four
outliers
:I
0
~L
40 50 60 70 80
Subset size m
FIGURE 5.15. Milk data: forward plot of scores for included units on the first
eigenvector. The group of six outliers highlighted in Figure 5.10 have extreme
negative values
of the other five units. The effect of the introduction of units 11-15 is to
shrink the range of variation of the scores of the units already included in
the subset.
We conclude this section with a comparison of the biplot at two steps of
the forward search. Figure 5.16 shows the biplot with a = 0 and a = 1 when
m = 70 (left-hand panel) and m = 76 (right-hand panel). When m = 70 the
5.6 Milk Data 251
m=70 m=76
·6 -4 -2 0 2 4 6 6 -10 -5 0 5
<')
ci r---- 50
42
<D 17 50
-,
6464
17 ~ 4 <D
(\j
ci 206%f0 4y242
"'
ci 64 yeY7 7~ li)
V ci
15 11 76 »%~~ :sy7
,~'
ci 29
"' 0
0
ci 0 52 y§
9
"
<)' 1413
:HI 136
~
3Y 28
~ J12
21
26 'f
18
25 9 0
~
21
-0.2 0 .0 0.1 0.2 0.3 ·0.3 ·0.1 0 .1 0 .2
FIGURE 5.16. Milk data: biplot when , left-hand panel, m = 70 and , right-hand
panel, m = 76, after the introduction of units 11, 12, 13, 14, 15 and 76. The first
component is on the horizontal axis and the second on the vertical axis
length of the arrow associated with the first variable (yt) is much shorter
than those for the other variables and has an orientation towards the second
quadrant. After the inclusion of units 11 , 12, 13, 14, 15 and 76 the length
of this arrow seems similar to that of the other variables and it now points
towards the first quadrant. This is in agreement with what we observed
in Figure 5.12 about the evolution of the curve for the first variable. In
both panels of Figure 5.12 we can observe that, in going from m = 70 to
m = 76 , the magnitude of the correlation between the components and
y 1 increases in absolute value and that the sign of the correlation with
the second principal component changes. Finally, the left-hand panel of
Figure 5.16 indicates that variables 2, 7 and 8 are almost orthogonal to
variables 3, 4, 5 and 6 and only lightly correlat ed with the horizontal axis ,
that is the first principal component. After the inclusion of the six units
the cosine of the angle for variables 2, 7 and 8 and the first component is
considerably reduced. This is in agreement with what we have already seen
in the left-hand panel of Figure 5.12, that is the increase of the correlation
between the first principal component and variables 2, 7 and 8. All units
that enter have an extreme negative value for the first principal component
together with different signs on the second component. Figure 5.12 also
shows that, throughout the search, apart from the end, y 2 is the variable
with the highest correlation with the second principal component. If we
project the six units on the direction of the arrow for y 2 we can see that
while units 12, 13 and 14 will have strongly negative values, the values for
units 11 , 15 and 76 will be only slightly negative. This is, of course, in
agreement with what can be seen in the second column of the scatterplot
matrix of Figure 5.10. Units 12, 13 and 14 are associated with the duster
of 3 points having by far the smallest values for y2 , whereas units 11, 15
and 76 (the three other points highlighted in this figure) all have small, but
not particularly low, values for Y2·
Our analysis shows that the forward search can illuminate not only the
structure of the data, but also the effect of individual units on the structure
of the principal components. Most importantly, units which are important
for the principal components are not necessarily those important for deter-
mining other aspects of the structure, such as the presence of outliers or
the need for a transformation. In this case our analysis has also revealed
a duster of six units, five of which intriguingly have consecutive numbers
from 11 to 15. Regrettably, the data description in Daudin, Duby, and Tre-
court (1988) does not say anything about the numbering of the units. It
is tempting to think that these units, with lower values of the variables,
correspond to a different breed of cow, or a different geographicallocation.
As so often, the forward search is rich in suggesting further questions about
the structure of the data.
5. 7 Quality of Life
In the two preceding examples we had samples that were dose to those
from a single multivariate normal distribution, perhaps with a few outliers.
We now consider an example in which the data need to be transformed to
achieve normality. We compare analyses of the untransformed data with
those after transformation and show in what ways multivariate normality
improves the principal components analysis.
Since 1990, the Italian financial journal Il Sole - 24 Ore has promoted
a survey on the quality of life in Italian provinces. Provinces are aggre-
gations of municipalities, but at a finer level than regions. For instance,
the Emilia-Romagna region discussed in Chapter 1 is currently made up
of nine provinces: Piacenza, Parma, Reggio nell'Emilia, Modena, Bologna,
Forll, Ferrara, Ravenna and Rimini. This survey on quality of life is not re-
stricted to Emilia-Romagna, however, but covers all103 provinces of Italy.
It is conducted yearly. We use the data published in 2001, which mainly
refers to the previous year.
The survey is intended to provide a synthetic measure of "quality of life" ,
which is then used to rank the provinces. We are not greatly interested in
such a ranking, which· relies on questionable social premises. Instead, we
look at the variables collected and show the effect of transformations. In
the 2001 survey there were 36 responses dealing with 6 different aspects of
quality of life. The complete data set can be found in the web site of the
book. Specifically, the areas of interest were:
5.7 Quality of Life 253
• welfare
• wealth and work
• services and environment
• crime
• population
• leisure.
We focus our analysis on 6 variables, selected within the different areas,

which are:
y 1 : average amount of bank deposits per inhabitant (Bank Deposits)

y 2 : number of robberies per 100,000 inhabitants (Robberies)
y 3 : number of housebreakings per 100,000 inhabitants (Housebreaking)
y4 : number of suicides, committed or attempted, per 100,000 inhabitants
(Suicides)
y 5 : number of gyms per 100,000 inhabitants (Gyms)
y6 : average expenditure on theatre and concerts per inhabitant (Exp.
Theatre).
Data are not provided on the original variables listed above, but rather
on a scaled version of them. Scaling is performed by dividing each response
value by the maximum reading for that response. Of course, this procedure
is dramatically affected by outliers. It is likely to produce highly skewed
distributions which will be shrunk towards the origin compared with those
for variables without outliers. We might expect that our robust approach
to multivariate transformations will greatly improve the performance and
usefulness of principal components analysis in this application.
The data are in Table A.9, the scatterplot matrix for which is given
in Figure 5.17. The main features of the data are an upward trend in all
variables and a dispersion which appears to increase with the magnitude of
the observations. We expect that power transformations of the variables will
be appropriate. An interesting question is how the transformation affects
the principal components.
Before plunging into principal components analysis, we indulge in a little
data analysis of the sort exemplified in earlier chapters. Figure 5.18 is a
forward plot of the scaled Mahalanobis distances. There seem to be several
outliers and the distribution of distances at the beginning of the plot is
Ionger tailed than comparable plotssuch as Figure 3.6 or Figure 3.13. The
two panels of Figure 5.19 confirm this impression. The left-hand panel of the
figure shows the trace of the estimated covariance matrix, which increase
at m = 95 and again at m = 99. There seem to be four serious outliers
and four lesser ones. This impression is confirmed by the right-hand panel
of Figure 5.19 which is the minimum distance amongst units not in the
subset. It confirms the four large outliers, so clear in the forward plot of
Mahalanobis distance, Figure 5.18, as well as showing an upward trend
FIGURE 5.17. Quality of life: scatterplot matrix of the six indices from 103 ltalian
provinces
for the lesser outliers. We now see how these observations influence the
principal components analysis.
The left-hand panel of Figure 5.20 shows the percentage of the variance
explained by each principal component. For most of the plot the first com-
ponent explains around 50% and the second around 20%. But, in the last
eight steps of the search the variance explained by the first component
drops to being only a little above 40% while that explained by the third
increases from 10% to almost 20%. The right-hand panel of Figure 5.20
shows the elements of the first eigenvector. This is a remarkably stable
mean of all six variables; only the weight for y 2 changes, first increasing
and then decreasing slightly in the last eight steps. The scores for the first
principal component in Figure 5.21 belie all this seeming stability. They
shrink towards the end of the search and some of the outliers visible in Fig-
ure 5.18 have extreme scores. More importantly, for the use to which such
"'
"'
0
V> "'
"'c:<.>
1Jl
'ö
~
0
.
c:
..
<ö
.s::
:::<
~
"'
0
75 80 85 90 95 100
Subset size m
FIGURE 5.18. Quality of life: forward plot of scaled MahaJanabis distances . There
seem to be several outliers
~ 0
0>
0 Cl CO
:::<
Q)
(.) E
:::>
~
I-
CO E
0 ·;::
CD
~
.....
0
....
CD
0
75 80 85 90 95 100 75 80 85 90 95 100 105
FIGURE 5.19. Quality of life: left-hand panel, forward plot of trace of the es-
timated covariance matrix; right-hand panel, forward plot of minimum Maha-
lanobis distance among units not in the subset
an analysis will be put, the rankings of many towns change markedly in the
last six or eight steps. According to these scores the half dozen best places
are 36 ('frieste), 17 (Milan), 27 (Verona), 42 (Bologna), 65 (Rome) and
42 (Genoa). Apart from Rome, all these towns are located in the northern
part of ltaly.
We now transform the data. The procedure of Chapter 4 leads to the
transformation >. = ( -0.5, 0, 0, 0.5, 0.5, O)r. The least well established of
0
CD
0
Ll"l
"0
Q)
c::
"(ii ....0
a.
X
Q) 0
<"')
(ij
>
0 0
C\J
"!!!
;;: 0
C\J
<f!.
~ 0
f\...··----···./ \2
0
0 0
75 80 85 90 95 100 75 80 85 90 95 100 105
FIGURE 5.20. Quality of life: left-hand panel, forward plot of percentage of total
variation explained by each principal component; right-hand panel, forward plot
of elements of the first eigenvector
0 '
~-~,. ~ -.-,. ,.-,-.-,. ,.-.-.-.,..., -.--- -.-.-,. ..,-.-.- ,...,-._,..,. .,-.-,...,-.-.- ,. "'-.-,-,. -,-.- 88
--r
75 80 85 90 95 100
Subset size m
FIGURE 5.21. Quality of life: score of individual units on the first principal
component. There is an apparent Iack of stability, with the ranking of many
towns changing in the last steps of the search
these transformation parameters is the value of -0.5 for transforming y 1 .

A value of 0, the log transformation, would also be acceptable. However,
for our present purpose of demonstrating the effect of transformation on
principal components analysis, t he exact value and interpretation of >. 1 do
not much matter. However , because >. 1 is negative, what were previously
the la rgest values of y 1 will now be the smallest.
BANK DEPOSIT$ ROBBERIES HOUSEBREAKING SUICIDES GYM$ EXP. THEATRE
~
" ~ 70
~f I 38
I ...
75
127
I'
02
~
,.........,
.
j;! ~
~ ' 38 -
,.........,
~ ~ ~
I
!:!
!I Ia
89 - 17
,.
u-
I
!! n - e
2
I
~ •• ""
.w
~
e
SI e
~· II j;!
~·
L'--'
FIGURE 5.22. Quality of life: boxplots of the six indices on the original scale
Before we analyse the transformed data we look at a series of univariate

boxplots to see how the data have changed. Figure 5.22 shows the plots
for the original data and Figure 5.23 those after transformation. As we
would hope, the immediate impression is of the removal of outliers and an
increase in symmetry. In particular, the central part of the distribution of
y 2 (robberies) is greatly enlarged, incorporating several previously outlying
observations. We now repeat our analysis on these data.
Figure 5.24 is the forward plot of the scaled Mahalanobis distances, plot-
ted on the same vertical scale as Figure 5.18. It is clear that the transforma-
tion has virtually abolished any multivariate outliers. We therefore do not
repeat the forward diagnostic plots of Figure 5.19, but move immediately
to the principal components analysis. Figure 5.25 is to be compared with
Figure 5.20. The first principal component now explains around 55% of
the variance throughout most of the search and over 50% at the end. The
second and third components are stable at around a further 20 and 10%,
but now the variance explained by the third component hardly changes at
the end of the search. The right-hand panel of the figure shows how sta-
ble are the components of the first principal component over the last 25
steps of the search. Again the variable is virtually a simple average of the
six readings, when we recall that the sign of YI has been reversed by the
transformation.
It is however the scores on the first principal component, in Figure 5.26,
that show a great improvement when compared with their untransformed
counterparts in Figure 5.21. They do not contract sharply, but provide
BANK DEPOSITS ROBBERIES HOUSEBREAKING SUICIDES

____,,
GYMS EXP. THEATRE
r;-
.
0
0
.
M
0
M
flll -
I I
..
0
.,
0
I
0
N
0
0
0
.. -
..... -
0
FIGURE 5.23. Transformed quality of life data: boxplots of the six indices. Com-
parison with Figure 5.22 shows the removal of numerous outliers and an increase
in symmetry
"'
"'
0
"'g "'
"'
.!!l
'0 "' ~
:i5 "'0
c
"'
=
(ij 0
"'
:::;; ------'~--------
"'
0
J
60 80 90 100
Subset size m
FIGURE 5.24. Transformed quality of life data: forward plot of scaled Maha-
lanobis distances on the same scale as Figure 5.18
0
CD
"0
"'
....
Q)
c: 0
·a;
c.
X
Q) 0
C")
~
>
ö 0
1ft "'
~
60 70 80 90 100 60 70 80 90 100
FIGURE 5.25. Transformed quality of life data: left-hand panel , forward plot of
percentage of total variation explained by each principal component; right-hand
panel, forward plot of correlations between first principal component and the
variables. To be compared with Figure 5.20
....
"' 1
0
==
"'~
.§
~J ~~
.., = ~ ~~::: J
~-
------ ~- --~ ~---- ~ --.--- --~ ll!l
--------~-----~----
"?
60 70 80 90 100
Subset size m
FIGURE 5.26. Transformed quality of life data: score of individual units on the
first principal component . There is greater stability than in Figure 5.21
stable orderings throughout the plot. Now the best half dozen towns are
42 (Bologna), 12 (Genoa), 17 (Milan), 36 (Trieste), 46 (Rimini) and Rome
(65). However the same two towns of the South, 89 ( Crotone) and 90 (Vibo
Valentia), are still ranked worst on this first component.
The stability of the scores after transformation is an important improve-
ment which comes from the greater normality of the data. A second con-
TABLE 5.2. Quality of life: comparison of untransformed and transformed analy-

ses. Percentage of total variance explained by the first three principal components
m=93 m = 103
First Second Third Total First Second Third Total
U ntransformed 50.8 20.1 10. 8 81.7 41.8 18.8 17.2 77.8
Transformed 56.7 19.1 8 .8 84.6 51.2 21.3 11.4 83.9
'Halving' 50.8 25.4 12.7 88 .9
sequence of the transformation is that the percentage of the total variance

explained by the main components increases. Table 5.2 shows the per-
centage of variance explained by the first three principal components for
the untransformed and transformed data, both at the end of the search
and ten steps from the end when m = 93. At both points in the search
the transformation slightly increases the total variance explained by the
first three components. The percentage explained by the first component
changes more dramatically. At m = 93 this isover 5% greater for the trans-
formed data than for the untransformed data, rising to nearly 10% at the
end of the search, when the outliers in the untransformed data seriously
degrade the performance of the principal components analysis. We have
already seen, in Figure 5.26 how stable are the scores for the transformed
data, right up to the end of the search.
At the end of the search both the first and second components for the
transformed data explain appreciably more variance than those for the un-
transformed data, a total of 72.5% as opposed to 60.6%, that isarelative
increase of almost 20%. Since the first three components for the untrans-
formed data explain only 77.8% of the variance, the first two components
of the transformed data provide almost as good a summary of the data as
the first three components of the untransformed data. An additional ad-
vantage of the analysis of the transformed data is the increase in normality,
leading to improved inferences when normal theory procedures are used.
For example, with a closer approach to normality, the significance of the
value an observed test statistic can be more accurately assessed.
5.8 Swiss Bank Notes

In the first two examples we found individual units or a small group of units
which had a noticeable effect on the principal components analysis, with-
out changing the overall interpretation. In the previous section we saw how
the analysis was sharpened by transformation to normality, providing more
stable inferences during the forward search. We now explore further prop-
erties of the forward search combined with principal components analysis

through an analysis of the standardised data on Swiss bank notes.
Since, by now, we have a detailed understanding of these data, we do not
expect the principal components analysis to provide new insights about the
data. Rather, we use our knowledge of the data to exhibit properties of the
analysis. We start with an analysis of all 200 observations and see the
effect of two large groups. We then analyse the forgeries alone, observing
the effect of a smaller, but still significant, group.
5. 8.1 Forgeries and Genuine N otes

As before, we start the search with twenty units in Group 1, the genuine
notes. Figure 5.27 is a forward plot of the percentage of variance explained
by each principal component. The main feature of this plot is the great
increase for the first component just after the second group starts to enter
the subset. At this point there is appreciable interchange of units, those
from both groups being present in the subset. There is then a well defined
direction, roughly between the centres of the two groups, along which most
of the variation in the data lies, so that the first principal component,
lying along this direction, explains much of this variation. As further units
enter, the scatter increases away from this direction and the percentage
of variation explained by the eigenvector decreases. There is no particular
indication of any effect towards the end of the search of the second group
of forgeries.
The explanation of the related plots of the elements of the first two
eigenvectors in Figure 5.28 is a little more complicated. In the first two
or three steps of the search, there is some interchange of units since we
start with the first 20 units in order, rather than 20 selected to have small
Mahalanobis distances. Thereafter the elements of the first eigenvector,
in particular, are unstable, reßecting the near sphericity of the units in
Group 1 in some subspaces, which is visible in the scatterplot matrix of
Figure 1.13. As we mentioned in §5.2.3 the eigenvectors are not unique
if the distribution has complete spherical symmetry. The components of
the second eigenvector seem more stable, but both eigenvectors change
appreciably as the second group of units enter the subset. Both panels of
Figure 5.28 show that the eigenvectors are stable after m = 120, with only
slight changes in the later part of the search as the third group enter from
m = 180; in particular element one of the first eigenvector decreases from
being small to being almost identically zero.
It is clear from the comparison of these plots with those we have seen
earlier that the presence of two groups manifests itself unambiguously. The
components also provide a division of the units into clusters.
Figure 5.29 shows the scores of the first hundred units on the first prin-
cipal component once the unit has entered the subset. In the first half of
the plot there is appreciable variability, corresponding to the poorly de-
FIGURE 5.27. Swiss bank notes: forward plot of variance explained by the six
principal components. The effect of the entry of the second group is manifest
shortly after m = 100
FIGURE 5.28. Swiss bank notes: forward plots of correlations of principal compo-
nents with variables. Left-hand panel, first component; right-hand panel, second
component
termined first eigenvector in Figure 5.27. But, around m = 100, there is

an increase in all scores which thereafter are stable. Figure 5.30 shows the
complementary plot for units in the second group. These curves start just
before m = 100 and soon increase to another stable pattern. Initially the
scores for Group 1 are symmetrically distributed around zero, refiecting
the approximate multivariate normality of the measurements on the first
...
j
N
1
.,
Vl
0
0 .
.X
'!' ~,''I
50 100 150 200

Subset slze m
FIGURE 5.29. Swiss bank notes: scores, after inclusion, of the first one hundred
units on the first principal component
group. Comparison of the scores from the two plots towards the end of the
search shows that the first component achieves almost complete separation
of the two groups. But Figure 5.30 shows that the separation decreases
slightly from m = 180 as the third group enter.
5. 8. 2 Forgeries Alone
When we fit just the forgeries on their own, the 15 observations of Group 3
enter fromm= 86 onwards. The left-hand panel of Figure 5.31 shows the
percentage of the variance explained by the eigenvectors; that for the first
eigenvector decreases from m = 86 whereas that explained by the third
and fourth increases- first one and then the other. The right-hand panel of
Figure 5.31 shows the correlations of the variables with the first principal
component. This is stable in four variables but the coefficient of Y6 decreases
markedly from m = 86, the sign changing from positive to negative before
the search ends. The coefficient for y 4 becomes more negative over the same
period. This reflects the outlying nature of the last 15 observations in y 4
and Y6 , which we highlighted in the scatterplot matrix of Figure 3.51. The
correlations with the second component are not particularly interesting,
but a similar sensitivity from m = 86 of other responses, is seen in the two
panels of Figure 5.32 for higher components. The left-hand panel, for the
third principal component, shows how y 1 is eliminated from this compo-
nent , while the weight for y 5 increases. Both respond appreciably to the
";"
e
Cl)
~
~
<?
..,
u;> ----r ,.--'
50 100 150 200
Subset size m
FIGURE 5.30. Swiss bank notes: scores, after inclusion, of the second one hundred
units on the first principal component
ü l()
Q. 0
.-···--....-~.. _......
"Pi
""
.<::
~ 0
c
Cl) 0
--- - ~
0
-- /
/ ~
0
ü
l()
9
"......---------
0
----------- -- -- ---
60 70 80 90 100 60 70 80 90 100
FIGURE 5.31. Swiss bank notes - forgeries: left-hand panel, forward plot of
percentage of total variance explained by each principal component; right-hand
panel, forward plot of correlations of variables with the first principal component
entry of the last, outlying, observation. The correlations with the fourth
component in the right-hand panel of Figure 5.32 show both some further
trends from m = 86 and some jumps at the very end of the search. These
plots show how changes in the structure of the data during the forward
search can simultaneously affect several components.
(.)
c.. "'
0
"E
~
.c
~ 0
"'c:0 0
~
~
0 ""?
(.)
9
60 70 80 90 100 60 70 80 90 100
FIGURE 5.32. Swiss bank notes: forward plots of correlations of principal compo-
nents with variables. Left-hand panel, third component; right-hand panel, fourth
component

The final example of this chapter deals with the survey data on municipal-
ities in Emilia-Romagna, introduced in Chapter 1 and analyzed in Chapt er
4. This is a n intriguing application of principal components analysis since
we have observations on as many as 28 dimensions, relat ed to very different
asp ects of social a nd economic life. We already made this point earlier in
the book, e.g. at the beginning of §4.10.
Here we focus on all 28 variables and hope t hat the first three or four
components will represent the data cloud satisfactorily. Furthermore, we
want the principal component estimates not to be affected by t he several
outliers that were highlighted both on t he scale of original variables and
after multivariate transformation. For instance, recall Table 1.3 and Fig-
ure 4 .47. As we will see, this is actually the case a nd we end up with
four components explaining a large portion of the observed variability and
having a clear interpretation.
In Chapter 4 we saw that transformations of the individua l varia bles
greatly improved normality of t hese data. Therefore, our analysis is per-
formed on standardised transformed data. We use the modified variables
YI6, Y23, Y25 and Y26, where the observed rea dings are replaced by 100 mi-
nus their values. We apply the t ransformation parameter obtained after
the combined analysis of all 28 responses and denoted by .Ac 2 in Table 4.5.
Listing the variables in their order from y 1 to Y2s, the multiva riat e trans-
formation parameter is
.A (0.5, 0.25, 0, 1, 0.25, 0.25, 0, 1, 0, 0, 0, 0.25, 0.5, 0. 25,

0.5, 0.5, 1, 1, 0.5, - 1/3, 0.25, 0.25, -1 , 0, 0.25, 1, 1, 1)T.
0
C')
"0
"'
c:
·a;
Ci.
)(
"'
iii
0
N
>
ö (r--········-~·····-······-··-··-···-·~--......_..J······~·../···-·····-······~-·~--..-·········---·······-········,···········-················-··--··..
;!!
0
,,... --------------------- --------- ----------------------- _____ ,
>-~~~~~~~~~-----------~--------~---------------
~ --=====--=~--:-_--=-----;;;:_==:;;::-_~,:_-==-:;;::_.
:... :.....:._·_:. ..:..._-_._ .:._: -·- :_ ..:~ -·-.:....:. _._._,:. ..:-·-:... ..
-
!!:3&~B&aiii5D!iJinn~ . . ~.:....: _._._ -=--=~~~
0
200 250 300

Subset size m
FIGURE 5.33. Transformed Emilia-Romagna data: forward plot ofthe percentage

of variance explained by the principal components
A major effect of these transformations to multivariate normality is to

improve the behaviour of our forward diagnostics based on Mahalanobis
distances, which resulted in smooth plots with some outliers entering at
the end of the search, as in Figure 4.46. We expect the sametobe true for
forward plots for principal components analysis. Details of the search have
already been given in Chapter 4.
We start by looking at Figure 5.33, the forwa rd plot of the percentage
of variation explained by the first four principal components in an analy-
sis using standardised transformed variables. The plot is indeed stable and
exhibits only a gentle change in the percentages explained by the first two
components towards the end of the search, from about m = 300 onwards.
There does not seem to be any appreciable effect of the very remote outlier
entering at the last step (unit 277, Zerba), nor of the two other groups
of outliers indicated in Figure 4.46. Despite this smooth change, the total
percentage of variation explained by the first two components is approxi-
mately constant at about 50% over the displayed range of the search. The
third and fourth components explain 7.9% and 6.6% of the global variation
in the final step, contributions that do not change appreciably along the
search. Starting with 28 transformed variables, we conclude that four com-
ponents provide a satisfactory representa tion of the data cloud, accounting
for almost 2/3 of the variability, and this percentage is not affected by
outliers.
.2 "'ci
~0
go
~
0"'
9 "'
9
,16.------------------16
"'
0
~~
~ 0
:;:"' 0
"'
9 t . ~
~- --,----------17
200 250 300 350 200 250 300 350 200 250 300 350
FIGURE 5.34. Transformed Emilia-Romagna data: forward plots of correlations

between the first two components and the variables. Upper panel, first compo-
nent; lower panel, second component
We interpret the first four principal components by looking at forward

plots of correlations between each component and the 28 responses. For
simplicity in Figure 5.34 and Figure 5.35 demographic, wealth and work
variables are represented in different panels. Furthermore, we have changed
the signs of correlations for variables with a negative transformation param-
eter in >. and so kept the original and easier interpretation of relationships.
It appears that the first component is strongly related to having a rela-
tively young population (y 10 and y 12 ), with low unemployment (yg), high
activity rate (ys) and generally good wealth conditions (Yls and Y22, for
example). These are all indicators of a high quality of life, at least from an
economic point of view. The first principal component can then be inter-
preted as a synthetic indicator of welfare, whose measurement was in fact
one of the purposes of the study for which the data were originally collected
(Zani 1996). Recall that variables Yl6 (houses with fixed heating), Y25 and
y 26 ( employees in not very small factories) have been modified, so it is not
surprising that their correlations with welfare are now negative.
The correlation plots for the first component are also generally stable
along the search and outliers have an effect on only a few variables of
scant infl.uence on welfare. We might expect that this stability will not be
shared by forward plots for subsequent components that explain a lower
portion of the total variability. Nevertheless, the forward plot of correlations
between the variables and the second principal component is fairly stable,
28
8. ........
....
0
! .... , .. _ - - .•/ _ , _.-- - - - _ _ ,.
"'
0
0 0
:;:
~ 5 ...·~·················· · · ....
L~~-- ··._:
9
....
0
0
~ 0
:;:"
~
"'
9
200 250 300 350 200 250 300 350 200 250 300 350
FIGURE 5.35. Transformed Emilia-Romagna data: forward plots of correlations

between, upper row, the third and, lower row, the fourth components and the
variables
although the last fifty or so units to enter produce a slight trend. Their
infl.uence is particularly evident in the panel of Figure 5.34 for demographic
variables, for which many outlying municipalities have extreme readings as
is shown in the first two rows of the scatterplot matrix in Figure 3.26. An
additional feature of the second component is the steady increase in the
trajectory of its correlation with y 9 , the unemployment rate, shown in the
lower right panel. Although not of great relevance for the interpretation
of this component (the correlation ranges between -0.2 and +0.2), it is
interesting to investigate why the correlation changes its sign along the
search. Figure 5.36 shows the boxplots of the distribution of y 9 for all
communities and then separately for the first 241 municipalities to enter
the search and for those entering in the last 100 steps; communities with
higher unemployment tend to enter later. As we know, many of these are
the poorer and more remote communities.
Figure 5.34 also provides guidance in interpreting the meaning of the
second principal component. This component is also positively correlated
with indicators of population youth (y 1 and, to a lesser extent, YIO, y 1 2
and YI3)· This is the reason why the forward plots related to demographic
variables decrease towards the end of the search, due to the inclusion in the
last steps of several small and aging communities located in the mountains
(the last 14 units which enter are the same as those reported in §4.10.4).
However, in contrast to the interpretation of the first component, economic
conditions seem to be generally poor for municipalities with high scores
All units Un its in at step m=241 Last 100 units to enter
t(l t(l t(l
~1
--.
~ -- ~ < ~
e S!
-
..----.=...,
e
!!
0 0
t
FIGURE 5.36. Transformed Emilia-Romagna data: boxplots of the distribution
J
of yg, the unemployment rate, for, left-hand panel, all units, centre panel, the
units included when m = 214 and, right-hand panel, the last 100 units to enter
the search
on this component, as we see from its negative correlations with many

wealth indicators (recall that y 23 has been transformed). Correspondingly,
there is evidence of a labour force with low skills, containing relatively few
graduates (y6), as well as people working in bank and finance (y21), and
relatively many artisans (y2 7 ), as well as uneducated people (y 7). Again,
these conclusions are stable along the search and are not affected by the
inclusion of the outliers.
The plots of Figure 5.35 (which show the correlations with the third and
fourth principal component) are considerably more wiggly. However, some
useful information can still be gained from these forward plots. For instance,
the third component is positively related to y 28 , a measure of entrepreneur-
ship, and to y 20 , the percentage of workers in hotels and restaurants. Out-
liers have only a slight impact on these relationships, which are stable along
the search. It might argued that both variables are related to tourism, al-
though we already made the point (at the end of §3.3) that the number of
people working in tourism is often underestimated. Also entrepreneurship
is possibly a difficult variable to measure, as the two outlying values for the
rural municipalities 277 and 260 showed in the last boxplot on Figure 3.25.
These features could detract from the clarity of interpretation of the scores
of individual municipalities on the third component.
The variables having a major effect on the fourth component are y 9 , the
unemployment rate, and Y27, the percentage of artisanal enterprises. For
both of them the absolute values of the correlations are fairly high in the
last 100 steps of the search.
Before we leave Figures 5.34 and 5.35 we pointout the special effect of the
final few outliers which had such a dramatic effect in figures like 4.46. These
effects are all small, but sharp, changes in some correlations, for example for
y 28 in the work variables of component one and Y23 in the wealth variables
of component two in Figure 5.34. There is also some effect on these two
variables in the third and fourth components of Figure 5.35. The effects
are however slight, showing the stability of the principal components to
the presence of a few outliers with a data set of this size.
Individual scores on the first two components are monitored in Fig-
ures 5.37 and 5.38. Both forward plots show firm patterns along the search,
which are not infiuenced by the final indusion of outliers. As was illustrated
at the end of §5. 7 in our analysis of data on the quality of life, such sta-
bility is one beneficial consequence of transforming data to approximate
normality
We first consider Figure 5.37. Although there are many extreme negative
scores, projecting the 28 variables on the first principal component does
not seem to reveal any particular duster of outliers. In fact, we already
noted that it is the multivariate combination of several extreme responses
that produces the impressive plots of Mahalanobis distances in Figure 4.47
and Figure 4.48. The first principal component explains 38% of the total
variation in the last step of the search, which is certainly a good result
starting from 28 responses, but we could not expect this projection to
be fully representative of the multivariate structure of the data. Highly
negative scores tend to be less so towards the end of the search, when
an increasing number of aging and poor municipalities are induded in the
fitted subset.
Projection onto the second principal component (Figure 5.38) shows, on
the contrary, a marked negative outlier, corresponding to Bologna (unit
6), the largest and richest (according to the available data) municipality in
Emilia-Romagna, but also a community with an aging population. Hence,
the outlyingness of Bologna is not unexpected in this plot, in view of our
interpretation of the second principal component. There also seems to be
a duster of a few less extreme outliers on the second principal component,
induding the towns of Parma (unit 210), Modena (159), Piacenza (262)
and Ferrara (68), and the municipalities of Casalecchio di Reno (11, in the
suburbs of Bologna) and Porretta Terme (49, a touristic and spa resort in
the Apennines). This duster is particularly dear fromm= 280 onwards.
We end our principal components analysis of municipalities in Emilia-
Romagna by looking at Figure 5.39, the scatterplot of individual scores on
the first two components computed at step m = 321 before the indusion of
the outliers. The last units to enter are numbered in the plot, which shows
that many of them have low scores on the first component whilst their
scores on the second component are unremarkable. The plot shows even
5!
II')
j"' u;>
II')
200 250 300

Subset slze m
FIGURE 5.37. Transformed Emilia-Romagna data: forward plots of first principal

component scores
- ·6
0 ,---~--------------------------------
200 250 300
Subset slze m
FIGURE 5.38. Transformed Emilia-Romagna data: forward plots of second prin-

cipal component scores
0
u
Q.
-g
~
cn
'l'
'!"
Porretta.. C~salecchii_>, d
Ferrart Piacent'cl' 0 ena
"' +Parma
"\'
+ Bologna
-10 -5 0 5
First PC
FIGURE 5.39. Transformed Emilia-Romagna data: scatterplot of individual

scores on the first two components, computed at step m = 321 before the in-
clusion of the outliers. The numbers of the last units to enter are given together
with the names of the units with low scores on the second principal component
in Figure 5.38
more clearly the outlying nature of Bologna and the other six communities
mentioned in the last paragraph as having low scores on the second com-
ponent in Figure 5.38. This plot shows how principal components analysis,
combined with the forward search, can reveal the structure of the data.
In many examples the presence of outliers obscures the structure when all
observations are fitted. Plots, as here, of quantities calculated earlier in
the search are more informative about outliers and, more importantly, the
structure of the majority of the data.
5.10 Further reading

Principal components analysis for normal data is described in classical
books on multivariate analysis, for example Mardia, Kent, and Bibby (1979,
Chapter 8) and Krzanowski (2000) in several chapters. A book length treat-
ment of the subject is Jolliffe (2002).
A mathematically detailed introduction to principal components analy-
sis is Flury (1997, Chapter 8). As weil as the data we have analysed on
the heads of young Swiss men, Flury also presents, on pp. 626-7, data on
the heads of 59 young Swiss women. Flury and Riedwyl (1988, Chapter
10) analyse the data on Swiss bank notes. Since they do not use standard-
ised variables, their numbers for percentages of variance explained by the

components are slightly different from ours.
Principal components is of continuing importance because it offers a way
of compressing the large amounts of data available on computer databases.
Recent developments include Hubert, Rousseeuw, and Verboven (2002)
who use projection pursuit to find components when the number of vari-
ables is very much greater than the number of observations. Croux and
Haesbroeck (2000) find the properties of robust methods of principal com-
ponents analysis in the situation we have considered when n > v.
Other analyses of the milk data include Atkinson (1994) and Caussinus
and Ruiz-Gazen (1995) , who use robust methods. Although such a stan-
dard robust analysis may find principal components unaffected by subsets
and outliers and can also determine the outliers, it cannot provide the con-
nection between individual units and estimates which is provided by our
analysis using the forward search.
274 Ii. Princinal Comnonents Analvsis
FIGURE 5.40. The angle() between x = (x1,x2)T and y = (y1,Y2f
5.11 Exercises
Exercise 5.1 Show that the cosine of the angle () between vectors x
(x1, x2)T and y = (Y1, Y2)T in Figure 5.40 is given by
X1Y1 + X2Y2
cos(O) = llxiiiiYII . (5.25)
Exercise 5.2 Given two vectors x and y, find an expression for the vector
x which represents the projection of x onto y (see Figure 5.41). What is the
expression which defines the length of x? Show that the vector x minimizes
the function llx- xll 2.
Exercise 5.3 Given a set of independent vectors (y1 , Y2, ... , Yk) , find a set
of mutually orthonormal vectors (z 1 , z2 , . .. , zk) with the same linear span.
Prove that the expression for the vector fj which represents the projection
of Yk on the linear span of Y1, Y2, . . . , Yk-1 is given by fj = Z zr
Yk where
Z = (z1, Z2, .. . , Zk-1)·
Exercise 5.4 Let the squared distance of a point y from the origin be given
by
(5.26)
5.11 Exercises 275
X - X
A
x=?
Length of the projection =?
FIGURE 5.41. The projection of vector x on y
(5 .27)
is a 2 x 2 symmetric positive definite matrix. Show that all points at a dis-

tance c lie on an ellipse whose axes are given by the eigenvectors of A with
length proportional to the reciprocal of the square roots of the eigenvalues.
Sketch a picture of the result. Generalize it to p dimensions.
Exercise 5.5 Draw the ellipse given by the equation
(5.28)
Give the equation of the ellipse in canonical form, the equation of the
straight lines which define the major and minor axes of the ellipse, and
calculate the length of the semiaxes.
If a point on the ellipse in canonical form has a value of 0.3 for z 1 , the
first canonical coordinate, find its z 2 coordinate. H ence find the coordinates
of the point on the original ellipse.
Exercise 5.6 Let B be a positive definite matrix of dimension p x p with

eigenvalues Al ;:::: A2 ;:::: ... ;:::: Ap and associated normalized eigenvectors
')'1, ... , 'Yp· Show that
xTßx
max - T - = >.1, attained when x = 'Y1 (5.29)
xfO X X
xTßx
min-T-
xfO X X
= Ap, attained when x = ')'p. (5.30)
M oreover, show that
xTBx
max - T - = Ak+l, attained when x = 'Yk+l, k = 1, ... ,p- X5.31)
x,t'O.l./'1,···,/'k X X
Exercise 5. 7 Consider the approximation A = (a 1 , a 2 , ... , an)T of rank r

to the matrix of residuals Y =(I- JJT jn)Y. The error approximation is
quantified as the sum of squares of the differences between the elements of
the two matrices
(5.32)
i=l i=l j=l
where f)i = (Yi- y) = (Yil - Y1, · · ·, Yiv - Yv)T ·

a) Show that the error approximation is minimized when A is defined as
(5.33)
where G = (g 1 , ... , 9r) is the matrix which contains the eigenvectors corre-
sponding to the r biggest eigenvalues of "Eu = y T y / (n - 1).
b) Show that the sum of squares of the errors is given by
n V
(5.34)
i=l i=r+l
where lr+l ;::: l r +2 ;::: · · · ;::: lv are the (v- r) smallest eigenvalues of "Eu.
Note that ai = GGT Yi is the projection of the Yi into the space spanned
by 91, ... , 9r and 2:: a'[ ai is the sum of the squared lengths of the projected
deviations. So, what are the geom etrical interpretations of principal com-
ponent analysis ?
c) What are the coefficients of the best approximating plane g 1 , ... , 9r?
d} Why in equation {5.32}, without loss of generality, can we consider
vectors Yi instead of Yi ? What is the geometrical interpretation of this as-
pect?
5oll Exercises 277
Exercise 5.8 If the matrix Y is defined as
10 1.5
y = ( 6 005
5 2
what can you say, a priori, about the correlations between variables and
principal components using standardized and unstandardized variables? What
are your expectations about the lengths and the orientation of the arrows
which represent the variables in the biplot using both standardized and un-
standardized variables?
Using a computer program calculate the matrices A and B which form
the basis for the construction of the biplot using unstandardized variables
when a = 0 and o: = 10
Exercise 5.9 Let Zk = (zc 1 , 000, zck) be the matrix containing the first
k columns of the matrix Y G Show that the percentage of variance of the
0
variable iJc J explained by the first k principal components (R~c IZ) can be
J
partitioned as R~Ycj IZk = r~
Yc J 1zc 1
+ r~YcJ 1z c 2 + 000+ r~Yc j 1z c k 0 Show that
k 2 l
"" gji i (5035)
L So2 '
i=l J
where 9ii is the jth element of the ith eigenvectoro

Exercise 5.10 Show that the Euclidean distance between vectors Yi and Yi
(rows i and j of matrix Y) can be written as
{ q( i) - q(j)} T yyT { q( i) - q(j)}
where q( i) is an n x 1 vector with all elements equal to 0 except that in

the i th position which is equal to 10 Show that the M ahalanobis distance
between vectors Yi and Yi can be written as
{q(i)- q(j)}TYf:~lyT {q(i)- q(j)} o
Exercise 5.11 Show that Euclidean distance is invariant under orthogonal

transformationso
Exercise 5.12 Show that the Euclidean distance between two units in the
space of the standardized principal components is equivalent to the Maha-
lanobis distance between vectors Yi and Yi
Exercise 5.13 How can you interpret the distance between units i and j,
the lengths of the arrows associated with vectors bj, and the cosine of the
angles between vectors bi in the biplot when a = 0 and o: = 1 ?
Exercise 5.14 How can you interpret the distance between units i and j,
the lengths of the arrows associated with vectors bj, and the cosine of the
angles between vectors bj in the biplot when a = 0 and a = 0?
5.12 Salutions
Exercise 5.1
Westart by noticing that from Figure 5.40, by definition, cos(BI) = xdllxll
and cos(B2) = vdllvll, sin(Bt) = x2/llxll and sin(B2) = Y2/IIYII· Now
cos(B) = cos(B2- BI)= cos(B2) cos(Bt) + sin(B2) sin(Ot). (5.36)
Using the former expressions we can write
Yl X! Y2 X2 XtYl + X2Y2 XT y
cos(B) = cos(B2 - Bt) = TIYIT~ + TIYIT~ = llxll IIYII = llxll IIYII ·
(5.37)
Since cos(90°) = cos(270°) = 0 and cos(B) = 0 only if xT y = 0, x and y
are perpendicular when xT y = 0.
Exercise 5.2
lf B is the angle between x and y (see Figure 5.42), the length of the
projection is given by
jxTyj
Lxl cosBj = llxll IcosBj = llxll llxll IIYII (5.38)
The vector x which defines the projection of x on y can be written as

x = ty where t is a real number such that llxll = iityii = itlliYII = ~~~:~I
(see equation 5.38). It is easy to checkthat t = TIBf lliTI· With this choice of
t, llxll = iityii = itiiiYII = 1 fl:~l lliTIIIYII = '11:~ 1 . To summarize, the vector
which represents the projection of x on y is given by
, xTy y
(5.39)
X = TIYIT TIYiT '
while the length of the projection is given by
(5.40)
Another way to derive the expression for the vector x comes from noticing
that vectors x- x = x- ty and x = ty are orthogonal (see Figure 5.42).
5.12 Solutions 279
A
X - X
Length of the projection =

llxll I cos 8 I= I xTyl I II yll
FIGURE 5.42. Projection of the vector x on y.
The orthogonality requirement implies that
(X - Xf X = (X - ty) T ty = (X - ty f y = 0. (5.41)
From the equation ( x - ty) T y = 0 we find that t = ~ Tiin.

In order to prove that the vector x = ty minimizes llx - xW we must
compute the minimum, with respect to t, of the function
(5.42)
Differentiating with respect to t we obtain
f'(t) = -2xr y + 2ti1YW·

Setting the former expression to zero we end up with the same expression
for t found before
xTy 1
t = TTYTT ITYTI ·
Given that f"(t) = 2IIYII 2 > 0, the value oft which has been found corre-
sponds to a minimum.
Exercise 5.3
Starting from a generic set of vectors y 1 , .. . , Yk, a set of orthogonal vectors
u 1 , ... , uk which span the same linear space can be constructed sequentially
as follows ( Gram-Schmidt orthogonalization process):
Note that uf Uj = 0 for i "I j = 1, ... , k. In order to convert the u's to

unit length we can define Zj = Uj / ;;:r;:;. Since (y'[ Zj )zj is the projection
of Yk on Zj (see Exercise 5.2), the projection of Yk on the linear span of
Y1, ... , Yk-1 can be written as
k-1
L_)y'[zj)Zj (5.43)
j=l
where Z = (z1, ... , Zk-d·

Exercise 5.4
By the spectral decomposition theorem A = r ArT = A1 11 'Y'f' + A2/21f, so
yr(A1/11'f' +A2121!)y
Al (yT 11 ) 2 + A2 (yT 12) 2.
Now, given that A1 and A2 are positive (because A is positive definite),

A1Yr + A2Y~ = 1 is an ellipse in Y1 = yT 11 and Y2 = yT /2 · The lengths of
the semiaxes of the ellipse are given by A;:- 1/ 2 and A2 1/ 2 .
It is easily verified that y = cA;:- 112 1 1 satisfies yT Ay = c2:
Al {(cA;:- 112 /l)T1d 2 + A2{(cA;:- 11211f12} 2

Al(cA;:-1/2)2(/f ld2 + A2(cA;:-l/2)2(1f /2)2
c2 + 0
c2.
Similarly, y = cA2 112 12 satisfies yT Ay = c 2. This implies that points at

distance c lie on an ellipse whose axes are given by the eigenvectors of
5.12 Solutions 281
Yz
FIGURE 5.43. Points with constant distance from the origin (p = 2, 1 :::; >.1 < >.2)
A with length proportional to the reciprocal of the square roots of the

eigenvalues. The constant of proportionality is c (see Figure 5.43). If the
number of dimensions p is greater than 2, the points y = (y 1 , . . . , yp)T which
have a constant distance c = ViJf'/fY from the origin lie on hyperellipsoids
c 2 = ..\I(YT'YI) 2 + >.2(YT"Y2) 2 + · · · + ..\p(YT"'fp) 2 whose axes are given by
the eigenvectors of A. The half length in the direction "Yi, i = 1, ... ,p
is equal to c>.;- 112 where AI, ... , Ap are the eigenvalues of A. Note that
z 1 = "Yf y, ... , Zp = "Y'J y can be recognized as the principal components
of y. This implies that the principal components lie in the directions of
the axes of a constant density ellipsoid. Any point on the ellipsoid has y
coordinates of the form "Yi and principal components coordinates of the
form:
(o, ... ,o, zi,o, ... ,o)r. (5.45)

N aturally, if J.l =f. 0, it is the mean centred principal component Zi = r[ (y-

J.l) which has zero mean and lies in the direction of [i·
Exercise 5.5
The purpose of this exercise is to apply in practice what we have learnt in
Exercise 5.4. Equation (5.28) can be written as a quadratic form
(y- J.l)T A(y- J.l) = ( Yl - 1.5 Y2- 1 ) ( ~ ~ ) ( y~2-_\5 ) =1

(5.46)
where J.l = (1.5 1)T. Using the spectral decomposition, A = rArT can be
decomposed as
A ( ~ ~ ) = ( 11 12 ) ( ~1 12 ) ( ~f )
( 0.615 0.788 ) ( 1.44 0 ) ( 0.615 -0.788 )
-0.788 0.615 0 5.56 0. 788 0.615 .
From the results of exercise 5.4, the equation of the ellipse in canonical
form is
.X1z~ + >.2z~ 1
1.44z~ + 5.56z~ 1,
where Z = (zl, Z2)T = rT(y- J.l), Z1 = (y- J.l)T [l = 0.615(yl - 1.5)-
0.788(y2- 1) and z2 = (y- J.l)T 12 = 0.788(y1 - 1.5) + 0.615(y2- 1) . The
lengths of the semiaxes are
1/v'>."l = 0.834
1/v0."2 = 0.424.
The equation of the major axis (the one associated with .XI), remember-
ing equation (5.45), can be found by putting z 2 = 0 and is
Similarly, the equation of the straight line which defines the minor semiaxes
can be obtained by putting z 1 = 0 and is given by
0.615
Y2 = 0 _788 (yl -1.5) + 1 = 0.78(yl -1.5) + 1. (5.47)
Finally, the z 2 coordinates of point (say A') which has z 1 = 0.3 is given
by
z2 = ~ = ±0.3956.
±y ~
5.12 Solutions 283
The corresponding coordinates of point A' in terms of y (point A say) can

be found using y = r z + f.l· We have that
( Y1 ) ( 0.615 0.788 ) ( 0.3 ) ( 1.5 ) ( 1.996 )

Y2 = -0.788 0.615 0.3956 + 1 = 1.007 .
It is easy to check that
( 1.996- 1.5 1.007 - 1 ) ( 4 2 ) ( 1.996 - 1.5 ) = 1

2 3 1.007- 1 .
All these quantities are illustrated in Figure 5.44.
lD
N
N
N
CXJ
"T
0J
>--,
0
lD
0 z2
N
'>'1
0
N
0
I
lD
0
I -0 .7 -0 .3 0 .1 0 .5 0.9 1 .3 1 .7 2 .1
y,
FIGURE 5.44. Ellipse in canonical and non-canonical form

Exercise 5.6
Fora fixed x 1 ;j:. 0, x 1 Bxi/(x[ x 1 ) has a constant value of xT Bx where x =
xd ~ has unit length. This implies that, without loss of generality,
we can prove the result for any normalized vector xT x = 1. Now, let r
be the orthogonal matrix whose columns are the eigenvectors /'1, .. . , /'p of
matrix B and A the diagonal matrix with the eigenvalues along the main
diagonal. Finally, let y = rT X. Note that X i- 0 implies y i- 0. Now, using
the spectral value decomposition of the matrix B, the quadratic form xT Bx
can be written
xTrArTx
p p
LYTAy = LAiYI
i=l i=l
p
< Al LYI =Al.

i=l
In order to prove that the maximum value is attained when x = /'l note
that setting x = ')'1 gives
since
1, k=1
,..r; /'1 = { (5.48)
0,
For this choice of x
xTBx ,..'[B/'1
,..'[r ArT1'1 = yT Ay
)( )
A1 0 0 1
0 A2 0 0
( 1 0 0 ) (
0 0 Ap 0
A1.
5.12 Solutions 285
To prove the final part of the exercise, note that when x is perpendicular
to the first k eigenvectors 'Yi, the vector y becomes
'Yi 0
y rTx= 'Y[ X=
0
(5.49)
T
'Yk+l Yk+l
'Y'f; Yp
Consequently, the quadratic form xT Bx can be written
xT Bx = L:f=k+l AiYl :S: Ak+l L:f=k+l Yl = Ak+l ·

The maximum of this quadratic form (>.k+d is obtained when Yk+l = 1
and Yk+2 = · · · = Yp = 0. It is easily verified from equation (5.49) that
these constraints are satisfied when x = 'Yk+l·
Exercise 5. 7
a) Consider a set of orthonormal vectors U = (u 1 , ... , Uk) and, for fixed fh,
consider the approximation given by an arbitrary vector Ubi. Westart by
noticing that
Yi - uur Yi + uur ih - Ubi
(I - UUT)Yi + u (UT Yi - bi). (5 .50)
Using equation (5.50) , the error sum of squares becomes

y'[ (I - UUT)Yi +
(UTiJi- bif(UTf)i- bi)· (5.51)
Note that the cross product vanishes because (I- UUT)U = U- uuru =
U- U = 0. The final term in equation (5.51) is positive unless bi is chosen
so that bi = ur Yi· With this choice of bi, Ubi = uur Yi is the projection
of Yi on the plane spanned by the orthonormal vectors u 1 , ... , Uk (see Ex-
ercise 5.3). In other words, for fixed U, the vector Yi is best approximated
by its projection onto the space spanned by u 1 , ... , Ur. When ai is chosen
as uur Yi, the sum of the nv squared errors becomes
n v n
2.:: 2.:: (f)ij - aij )2 L)Yi- UUTiJi)T(f)i- UUTiJi) (5.52)
i=l j=l i=l
n n n
2.:: iJT Yi + 2.:: iJ'[uuriJi - 2 2.:: y'[uur Yi
i=l i=l i=l
n n
2.:: iJT Yi - 2.:: y'[uur Yi· (5.53)
i=l i=l
The first term in equation (5.53) does not depend on U, therefore the
sum of squares of the errors can be minimized by maximizing the last term
in the former equation. From the geometrical point of view
n n
2:/fi[UUTf}i = L IIUUTilill 2 (5 .54)
i=1 i=1
is the sum of the squared lengths of the projection deviations. In other

words, to seek the plane which minimizes the squared lengths between
the v dimensional observations and the plane is equivalent to looking for
the plane in which the projections of the observations f}i have the largest
spread.
Using the properties of the trace we obtain
n n
"'-ruur-.
L__- Yi y, = tr L fi[UUT Yi
i=1 i=1
n
i=1
n
i=1
From Exercise 5.6, we know that the quadratic form u[f:uu 1 is maximized
when u1 = 91 , where 9 1 is the first eigenvector corresponding to the first
eigenvalue of matrix Eu. For u 2 perpendicular to u 1 , uf Su2 is maximized
by 92· In r dimensions U = (u1, ... ,ur) = (91,···,9r) and ai = GGTfli·
Consequently A'fvxn) = GGT(fj 1 , ... , fln)· In other words, the r dimensional
plane which minimizes the sum of squares of the distances between the
observations Yi and the plane is determined by 9 1 , ... , 9r, a new basis.
b) In this part of the exercise we derive the error bound for the sum of
squares of the approximation. Westart noticing that when ui = 9i ,
(5.55)
5.12 Salutions 287
So, tr(UTf.uU) h + lz + · · · + lr· Now, using equation (5.53) and re-

sult (5.55),
n v n n
L:L:wij- aij)2 "'L""' Yi-r-Yi - "'L""' Yi-ruur-Yi
i=l j=l i=l i=l
n T
i=l i=l
T
(n -1)tr'tu- (n- 1) Lli

i=l
V T
i=l i=l
L
V
(n- 1) li.
i = r+l
c) The matrix A.;xn = GGT(fh, ... , Yn) can be written

-T
Yn91
-T
Yn92
-T
Yn9r ) (cxn)
(5.56)
So, the ith element (column) of A_T can be written
(5.57)
It follows that the coefficients of 9k ( k = 1, ... , r) are given by g'[ Yi,

the sample principal component evaluated at the ith observation. In other
words, the coefficients of the approximating plane (new basis) (g1, ... , 9r)
are the sample principal components.
d) In the final part of this exercise we show why, without loss of generality,
we can consider vectors Yi instead of Yi· Using vectors Yi , equation (5.51)
can be written as
n
L(Yi- a- Ubi)T(Yi- a- Ubi)· (5.58)
i=l
From the geometrical point of view a + Ubi is the plane determined by U =

(u 1 , ... , Ur) which consists of all sets of points passing through a for some
bi. We look for the best approximating plane of dimension r determined
(spanned) by U = (u 1, ... , Ur) , which passes through a and minimizes
the sum of squared distances between the observations Yi and the plane.
Without loss of generality, we can assume that 2:::::~= 1 bi = 0, because, if

2:::::~= 1 bi = nb -=1- 0, we can use a* and bi where a* = (a+Ub) and bi = (bi -b)
so that
a + Ubi = (a + Ub) + U(bi- b) = a* + Ubi. (5.59)
Now, after adding and subtracting 'fj in equation (5.58), we obtain
n
= 2)Yi- Y- Ubi + 'fj- a)T (Yi- 'fj- Ubi + 'fj- a)
i=1
n
= .2)Yi - 'fj- Ubi)T(Yi- 'fj - Ubi) + n(y - a)T(Y- a) (5.60)
i=1
L {Yi- 'fj- GGT(Yi- Y)}T {Yi- 'fj- GGT(Yi- Y)}.

n
2:
i=1
The last term in equation (5.60) is positive unless a is chosen as 'fj. In

other words, the best approximating plane must necessarily pass through
the sample mean, so without loss of generality we can work with vectors Yi
instead of Yi.
Exercise 5. 8
The purpose of this exercise is to show in geometric terms why we must
make a preliminary standardization of the data when the variables have
different magnitudes.
Y can be written in terms of deviations from the mean as: Y = (i}c 1 , ••• ,
fJcJ, where f)ci is the n x 1 vector forming the jth column of Y.
As we have seen from Exercise 5.7, the best r dimensional approximation
of Y which minimizes the equation
n v
IIY - All 2 = L .L:wij - aij) 2 (5.61)
i= 1 j = 1
can be found by applying principal components analysis.

From the point of view of the variables, equation (5.61) can be rewritten
as
V V
L L] = L llflcj - acj 11 2, (5.62)

j= 1 j=1
where ac3 is the jth column of the n x v matrix A. As we have seen in Ex-
ercise 5.7 if, for example, Ais ofrank 1, aci is the jth column ofthe matrix
Y 919[. In this case the jth column of Y is approximated by a multiple 9j1
(j = 1, ... , v) of the n dimensional vector (line) Y 91 which represents the
first principal component. This implies that the first principal component
Y 91 minimizes the sum of the squared distances LJ
from the deviation
5.12 Solutions 289
TABLE 5.3. Centred matrix Y and squared lengths of the vectors associated with
the three variables
f}c, iJc2 YCs
3 0.17 -1.33
-1 -0.83 1.67
-2 0.67 -0.33
lliJc,ll 2 = 14 lliJc2ll 2 = 1.17 lliJcsl1 2 = 4.67
TABLE 5.4. Correlation between variables and first principal component using
unstandardized and standardized variables
Variable U nstandardized Standardized
number variables variables
1 0.983 0.619
2 0.183 0.786
3 -0.752 1
vector YCj = Ycj - yjJ to a line and so on. Naturally, the Ionger devia-
tion vectors f}cj = (f}cj, , . .. , f}cjn)T (those with larger Sj) have the most
effect on the minimization of I: L]. Table 5.3 gives the centred matrix Y
and the squared lengths of the vectors associated with the columns of Y.
Note that the length of f}c, is much greater than that of the vectors as-
sociated with the second and third columns of Y. This implies that the
first column of Y will exert a great influence on the minimization of equa-
tion (5.62). Figure 5.45 which represents the 3 vectors associated with the
3 columns of the matrix Y and the line associated with the first principal
component, shows that, in this example, the first principal component is
highly attracted by the first variable f}c,. In other terms, the angle between
f}c, and the vector which represents the first principal component is very
small. Table 5.4, which gives the correlations between the variables and the
first principal component using standardized and unstandardized variables,
shows that using unstandardized variables the correlations with variable 1
are much greater than those of the other variables. Finally, the magnitude
of the correlations between the first principal component and the variables
is exactly equal to the ordering of their lengths.
On the other hand, if the variables are standardized, they have equal
lengths and exert equal influence in the minimization of I:;=l Fig- LJ.
ure 5.46, which represents (using the same 3 dimensional point of view as
Figure 5.45) the 3 vectors associated with the 3 columns of Y and the line
Y2
r--JO
-._
<'"
y, ""'"
I
rJ
I
FIGURE 5.45. Principal components using unstandardized variables. Each arrow

is associated with a column of the matrix Y given in Table 5.3. The line associated
with the first principal component is strongly correlated with Ionger vectors
a.ssociated with the first principal component, shows that the first principal
component is virtually equidistant from the three vectors.
Let us now see what we can say a priori about the biplot. The biplot
for the unstandardized variables will show one arrow ( the one for variable
1) much Ionger than the others. When the variables are standardized, the
matrix which contains all the standardized variables will have rank 2. This
implies that the biplot for standardized variables will give a perfect rep-
resentation of the original rank 2 matrix. In this case, the length of the
arrows will be exactly the same. Finally, one of the arrows in the biplot
will be parallel to one of the axes because the matrix Y has rank 2.
Figure 5.47, which shows the biplot for unstandardized (left panel) and
standardized data (right panel) when a = 0 and a = 1, illustrates graphi-
cally all the concepts just described.
5.12 Solutions 291
ll
I
No
,-
Y3
-K I
I
I I
I
I Y2
I I I Yl
('J
I
FIGURE 5.46. Principal components using standardized variables. Each arrow is

associated with a column of the matrix Y given in table 5.3. In this case each
variable exerts equal influence on the choice of the direction which minimizes the
sum of the squared lengths
If the matrix Y is unstandardized , matrices U , L and G are
-0.8165 0.004683
u~(
0.5774 )
0.4042 -0.7094 0.5774
0.4123 0.7048 0.5774
L =
( 8.1044
0
0
0
1.8123
0 n
)
-0.9136 -0.3603 -0.1883
c~ ( -0.0492
0.4026
0.5577
-0.7477
-0.8286
-0.5273
·3 ·2 ·1 0 2 ·0.5 0.0 0.5 1.0
C!
3 3
"'ci y2
y2 "'ci
~
!\'-
"'ci. 0 \ "'Eci.
E ci
0 .:;-------------------"~ 0 0
I
ü y1 ü ci y3
y3
"'9 2
"'9 v!
/
2
C! y1
'7
-1.0 -0.5 0.0 0.5 -0.5 0.0 0.5 1.0
Comp.1 Comp.1
FIGURE 5.47. Biplot using standardized (left) and unstandardized variables

(right) when a = 0 and a = 1. Note that this figure has two scales. The up-
per right is for the values of the rows of B which are shown by arrows. The lower
left is for the values of the rows A which are shown by points. On the right panel
we have also superimposed a circle with radius 1 to show that alt arrows have
the same length
When a = 0 and a = 1,
-0.8165 0.004683 )
A= ..;n=Iu(2) = J2 ( 0.4042 -0.7094 . (5 .63)
0.4123 0.7048
The three rows of the matrix A evaluated in equation (5.63) correspond

to the 3 points which have been numbered 1, 2 and 3 in the left panel of
Figure 5.47. Also,
-0.9136 -0.3603 ) ( v'8 1044

( -0.0492
0.4026
0.5577
-0.7477
. 0
vl.812~ ) .
(5.64)
The three rows of B evaluated in (5.64) correspond to the 3 arrows which
in the left panel of Figure 5.47 have been labelled y1, y2 and y3.
Now, given that every entry of Y can be interpreted as an inner product
(see equation 5.24) consider our expectations about the position of the
units in the biplot. Unit 1, the first row of Y, had high values for fJc 1 (first
column of matrix Y). So, we expect that its position in the biplot will be
close to the arrow for fJc 1 • Similarly, unit 3 was characterized by high values
of fjc 3 , so its position in the biplot will be close to the arrow associated
with i}c 3 • Figure 5.47 shows that this is indeed the case.
5.12 Solutions 293
Exercise 5. 9
In the multiple regression model y = X ß+ E, R~IX is defined as
regression sum of squares

total corrected sum of squares
/FXTy- nfl
yTy _ y2
yT X(XT X) - 1 xry- nfl
(5.65)
yry- 'il
In this exercise we have to find the expression which defines the percentage
of variance of the jth variable (dependent variable) extracted by the first
k principal components (explanatory variables). In this case y corresponds
to the jth column of the matrix Y, while the matrix X corresponds to the
first k columns of the matrix YG, say zk = (zc,' ... 'zck ).
In this case (XT X) = (n- l)cov(Zk) = (n- l)diag(s;,, ... , s;k) =
(n- l)diag(h, . .. , lk), and xry = (n- l)cov(Zk, i/cj). Now, given that
cov(Zk) is diagonal, and that Y and YG have zero mean, equation (5.65)
can be rewritten as
Since cov(ycj, zcJ = 9jili, where 9)i is the jth element of the ith eigen-
vector, and s;,= li
k 2 l
"'9ji i
L...t s. .2
i=l J
If the variables are standardized, the percentage of variance of the jth

variable extracted by the first k principal components can simply be written
as
k
L9I)i·
i=l
Exercise 5.10
The Euclidean distance between rows i and j of the matrix Y is defined as
But the vector y[ can be written as q(i)TY or Yi = yT q(i).
dij (Yi - Yjf (Yi- Yj) = {YT q(i)- yT q(j)} T {YT q(i)- yT q(j)}
{q(i)- q(j)}T yyT {q(i)- q(j)}. (5.66)
Given that the Mahalanobis distance between rows i and j of the matrix
Y is defined as
(5.67)
it follows from equation (5.66) that equation (5.67) can be rewritten as
(5.68)
Exercise 5.11
If z = YG, where Gis an orthogonalmatrixsuch that cTc = GGT =I,
the distance between rows i and j of the matrix Z can be written as
dij (zi- Zjf(zi- Zj ) = {q(i)- q(j)}T zzT {q(i)- q(j)}

{q(i)- q(j)}TYGGTyT {q(i)- q(j)}
{q( i) - q(j)} T yyT {q( i) - q(j)} .
This implies that the Euclidean distance b etween two units in the space of
the principal components (if all components are considered) is equal to the
Euclidean distance in the original space. Note that if not all components are
considered (g1, ... ,gr)T(gt, .. . ,gr ) =Ir, but (gt, ... ,gr)(gt, ... ,gr)T -1- Iv
Exercise 5.12
The matrix which cont ains the standardized principal components can be
written as U* = YGL - l / 2 . The square distance between row i and row j
of U* can then be written as:
dij {q(i)- q(j)}TYGL- 1 12 (YGL - l / 2 )r {q(i)- q(j)}

= {q(i)- q(j)}TYGL- 1 GTyT {q(i)- q(j)}
{q(i) - q(j)}TY(GLGT) - 1 YT {q(i)- q(j)}
{q(i) _ q(j)}T yi;~ lyT {q(i) _ q(j)}.
The last expression is the Mahalanobis distance between rows i and j in

the original space (see Exercise 5.10).
Exercise 5.13
When a = 0 and a = 1, A = Vn=-1Uc 2 ) and B = Gc 2 )Lgr
Using the
result in Exercise 5.10, the distance between rows i and j of A can be
5.12 Solutions 295
written as
dTj (n - 1) { q( i) - q(j)} T u(2P&) { q( i) - q(j)}

{q(i)- q(j)}TYG(2)L~~I 2 (YG( 2 )L~~/ 2 )T {q(i)- q(j)}
{q(i)- q(j)}TYG< 2 JL(2~Gb)yr {q(i)- q(j)}. (5.69)
Now, from the spectral decomposition of the matrix Eu
This implies that the matrix G( 2 JL(2~Gbl in equation (5.69) can be rec-
ognized asthebest rank two approximation of matrix E;_;- 1 . In conclusion,
the Euclidean distance between two points in the biplot (rows of matrix
A) , when a = 0 and a = 1 can be interpreted as the best rank two ap-
proximation of the Mahalanobis distance between the corresponding rows
in the original space.
Let us now check what interpretation we can give to the lengths and
g{
cosines of the arrows associated with the p rows of the matrix B = G (2 ) L 2 )
We must determine to what extent the (i,j)th element of BBT (scalar
product b'[bj) approximates the i, jth element (SiJ) of the sample covari-
ance matrix Eu. In this case BBT = G( 2)L( 2)Gb)· Now, given that Eu
can be decomposed as GLGT, it is easy to recognize that G( 2)L( 2)Gb)
is the best rank two approximation of the matrix Eu. This implies that
the lengths bJbj of the arrows (diagonal elements of BBT) are the best
rank two approximations of s] . Similarly, the cosine of the angles between
two arrows b'[bj/(llb;ll llbill) can be interpreted asthebest rank two ap-
proximation of the correlation coefficient between the two corresponding
variables, that is s;j/(sisj)· In this case the jth diagonal element of the
matrix BBT is
(5.70)
From equation (5.35) of Exercise 5.9, it follows that if the variables have
been standardized, the squared length of vector ( arrow) bj is equal to the
percentage of variance of the jth variable explained by the first two prin-
cipal components.
Exercise 5.14
- 1/2
When a = 0 and a = 0, A = YG( 2) = vn:=TU(2)L( 2), and B = G(2)· The
distance between rows i and j of the matrix A can be written as
dTi (n- 1) {q(i)- q(j)}T U(2)Lg~(U(2)L~t~)T {q(i)- q(j)}

(n- 1) {q(i)- q(j)}r U(2)L(2P&l {q(i)- q(j)}.
The matrix (n - 1)U( 2 )L( 2 )U~) is the best rank two approximation of
yyr = ( n - 1 )U LUT. This implies that when a = 0 and a = 1, the dis-
tance between two units in the biplot can be interpreted as the best rank
two approximation of the Euclidean distance in the original p dimensional
space.
However for B, BBT = G( 2)G'[;) is not the bestrank two approximation
of f:u.
6
Discriminant Analysis
6.1 Background
In discriminant analysis the multivariate observations are divided into g
groups the membership of which is assumed known without error. The
purpose of the analysis is to develop a rule for the allocation of a new
observation of unknown origin to the most likely group. For example, in
the case of the Swiss bank notes there are two groups, genuine notes and
forgeries . The purpose of the analysis would be to develop a rule for deter-
mining whether or not a new note was genuine.
Westart in §6.2 with an outline of some theory for discriminant a nalysis.
The assumptions are not only that group membership is known, but also
that the observations have a multivariate normal distribution. In the more
general case the observations in each group have both a distinct mean and a
distinct covariance matrix. Application of maximum likelihood theory Ieads
to a classification rule which, in the space of the variables, has quadratic
boundaries. The more usual case is that of linear discriminant analysis
which arises when the groups have the same covariance matrix although,
of course, differing means.
Mention of the Swiss bank note data as an example was not fortuitous.
Although these data were believed to have two groups, our analysis has
showed that there are three groups and at least one misclassified note. Use
of such data as a training set on the assumption that all observations are
correctly categorised will not lead to optimal discrimination and may, of
course, Iead to a very poor rule. Accordingly, we use the forward search to
298 6. Discriminant Analysis
see how the behaviour of the allocation rule changes as we add observations
to those used in discrimination.
With one group of observations we have seen how the search progresses,
ordering all observations by their Mahalanobis distances. In §6.3 we extend
the forward search to ordering and including units from several popula-
tion. We then, in §6.4, describe the properties of the analysis which it is
informative to monitor during the forward search. These again include Ma-
halanobis distances as well as the probabilities of correct classification of
units and, for linear discriminant analysis, the composition of the planes
dividing the groups. The final theoretical material is in §6.5 where we ex-
tend the material on multivariate Box Cox transformations of Chapter 4
to discriminant analysis.
Our analyses of data start in §6.6 where we present a first analysis of
data on irises popularised by Fisher. Although the groups have differing
variances, we begin with a linear discriminant analysis. In the following
section we compare linear and quadratic discriminant analyses on some
data on electrodes where the two groups have very different variances.
We return to the iris data in §6.8 where we transform the data to obtain
more nearly equal variances in all groups. Despite the strong evidence for a
transformation, the performance of the linear discriminant analysis is little
affected by the transformation. This group of analyses concludes in §6.9
with the investigation of the effect of the three groups of the Swiss bank
note data on two group discriminant analysis.
The second half of the chapter covers two related analyses. In §6.10 we
analyse a set of simulated data. The data are more complicated than those
analysed earlier and are designed to have a structure similar to data on
muscular dystrophy that are analysed in the succeeding section. Both sets of
data require transformation: in the case of the data on muscular dystrophy
the transformation increases the discriminatory power of easily measured
variables compared with those that are more difficult to measure. We use
the analysis of the simulated data to highlight ways in which a diagnostic
analysis, starting from a fit to all the data, can fail when a complicated
structure of outliers is present in the data. The chapter concludes with
comments on the literatme and suggestions for further reading.
6.2 An Outline of Discriminant Analysis

6. 2.1 Bayesian Discrimination
We begin with a brief definition of discriminant analysis, which serves to
establish notation. Initially we assume that the prior probabilities of units
coming from a particular population are known, as are the parameters of
the statistical models for the populations.
Let 7rz denote the prior probability of an individual coming from pop-
ulation or group Pz, l = 1, .. . , g where g is the number of populations
considered. If we indicate by J(yll) the density of the distribution of the
observations for population l, then the posterior probability that unit i
belongs to population l after observing Yi is:
i=1,2, ... ,n. (6.1)
Following the Bayes rule, we choose the population with maximum pos-
terior probability p(llyi). If we assume that Pz is a multivariate normal
population with mean p,z and dispersion matrix I;z, the log of the numera-
tor of equation (6.1) becomes
1 1 T
V 1
- 2log27r- 2log II;zl- 2 (Yi- p,z) I;! (Yi - p,z) + log1rz . (6.2)
We allocate the unit to that population for which the posterior probability
is highest.
6.2.2 Quadratic Discriminant Analysis

In the absence of knowledge of the prior probabilities 7rz allocation is made
to that population for which the likelihood, or loglikelihood, is greatest.
That is, from (6.2), we allocate to the group for which
V 1 1 T 1
-2log27r- 2log II;zl- 2(Yi- p,z) I;! (Yi- p,z) (6.3)
is a maximum. Throughout this chapter weshall take these prior probabil-

ities as unknown, that is the same for all populations.
To investigate the form of the boundary between groups we consider two
arbitrary groups which, without loss of generality we take as groups one
and two. From (6.3) we allocate the observation Yi to Group 1 rather than
Group 2 if
-logii;ll- (Yi- p,I)T2;1 1 (Yi- p,I) > -logii;21- (Yi- f..t2)TI;2 1 (Yi - f..t2),
(6.4)
a quadratic in y. The effect on the allocation rule of unequal prior proba-
bilities 1r1 and 1r2 is to change this boundary by a constant a mount.
It is convenient to rewrite (6.4) in terms of squared population Maha-
lanobis distances dr. The observation is allocated to population 1 if
(6.5)
The allocation therefore depends on the log determinant of the dispersion

matrices and on the squared Mahalanobis distances.
6.2.3 Linear Discriminant Analysis

If the covariance matrices of the groups are all equal, that is :Ez = :E, l =
1, ... , g, the quadratic rule (6.5) becomes that we allocate to Group 1 if
(6.6)
that is, Yi is allocated to the group for which its Mahalanobis distance is
least.
It is informative to rewrite (6.6) as
(Yi - f..Llf:E - 1(Yi- f..Ll) < (Yi- f..L2f:E- 1(Yi - f..L2) .

The quadratic terms in y cancel and the rule can be rewritten as allocation
to Group 1 if
(6.7)
where f..t = (f..L 1 + f..L2)/2 . Thus, when the covariance matrices in the groups
are equal, the allocation boundary between groups is linear in y: the regions
of allocation are determined by v-dimensional hyperplanes.
6.2.4 Estimation of Means and Variances

The preceding discussion of discriminant analysis assumes that the pa-
rameters f..tl and :Ez are known. In practice they will usually have to be
estimated. Since the group memberships of all observations in the train-
ing sets are known, we estimate the parameters of each group separately,
unless linear discriminant analysis is appropriate when there is a common
covariance matrix :E. Given training sets nz from each population P1 , the
maximum likelihood estimates of the parameters f..tz and :Ez are the means
and covariance matrices of these training sets: [1, 1 and i:z. The squared
Mahalanobis distance for observation Yi from population Pz is then
d2i 1 = ( Yi - f..tz
')T~ ') ·
uz- 1(Yi - f..tl (6.8)
These estimates are then used in place of the known values in the quadratic
discrimination rule (6.3).
Since in this chapter the n sample members are divided into g groups it
is sometimes convenient to relabel the y values. Let Yi 1 = (yill, ... , Yivl)T
be the v x 1 vector containing the readings for unit i belonging to group l
( i = 1, . . . , nz) and l = 1, ... , g.
If the hypothesis of equality among covariances is true, that is :E 1 = :E2 =
· · · = L: 9 = :E, the training sets ( n 1, ... , n 9 ) are pooled for estimation of :E
to give an overall training set of size n = 2::f=1 n 1• The estimated within
groups covariance matrix is
(6.9)
and the squared Mahalanobis distances (6.8) become

di21 = (
Yit -
,
f..LI
)Tf,-1(
Dw Yi1 -
')
f..LI · (6.10)
These Mahalanobis distances are, of course, important in the forward search
as well as in the assessment of duster membership.
When there are two groups the linear maximum likelihood rule in (6.7)
with estimated parameters allocates y to Group 1 if and only if
(6.11)
where f:w is the pooled within groups estimator of 'E defined in (6.9).
A likelihood ratio test for the equality of covariance matrices is given
in (2.23). In practice, even if the covariance matrices arenot equal, linear
discriminant analysis is often preferred to quadratic discrimination; the
reduction in the number of parameters to be estimated to determine the
discrimination rule leads to a decrease in the variance of the estimates and
to an improvement in the performance of the estimated discrimination rule.
6. 2. 5 Canonical Variates
In contrast to the probabilistic approach of §6.2.1, Fisher (1936) tackled
discrimination from a purely data-based standpoint. He supposed that one
was presented with g independent random samples, of sizes n 1 , n2 , ... , n 9
from g multivariate populations and that a method of best distinguishing
among these samples was required. The only assumption he made was
that the dispersion matrices of these populations were equal; otherwise the
populations were completely unspecified. With this assumption the data
can be summarized by computing the sample mean vectors fh, l = 1, ... , g
and the pooled within-sample covariance matrix f:w in (6.9) .
Fisher then looked for the linear combination Zil = aT Y i 1 , with a =
(a 1 , . . . , av)T, that gave maximum Separation of the group means, when
measured relative to the within group varia nce of the data. This is possible
since, if we specify the vector a, we convert each v-variate observation
Yi 1 = (Yi11, .. . , Yivl)T into a univariate observation Zil·
Given that the total sum of squares of the Zil can be partitioned into the
sum of between groups (SSB) and within groups components (SSW)
9 n1
LL(Zil - :z)2 SSB(a) + SSW(a)

1= 1 i = 1
9 g n1
L.':n1(z1 -2') 2 + LL(Zii-Z!) 2

1=1 1=1 i=1
our purpose is to choose the vector a which maximizes the ratio
SSB(a) _ I:f=1 n1(z1- z)2
SSW(a) - I:f=1 L~~ 1 (zil - z1) 2 .
If we use the unbiased estimators of the within groups and between groups
variances, we can rewrite the former ratio as:
F = SSB(a)j(g- 1) .
SSW(a)/(n- g)
The larger the value of this ratio, the more variability is there between
groups rather than within groups. The notation SSB(a) and SSW(a) em-
phasizes that choice of a determines the value of F; different choices of
the coefficients a = (a 1 , ... , av)T yield different values for the two sums of
squares and hence different values of F. The best choice of a will clearly be
the one which yields the largest F value. With this choice of a the resulting
values Zil will yield the one-dimensional projection of sample points that
show up differences among groups as much as possible.
To analyse this rule we need the unbiased between groups estimator of
the covariance matrix
EB = I:f=l n1(f}t- y)('[h - Y)T

(6.12)
g-1
It is possible to show (Exercise 6.6) that the F is maximized when a is the
eigenvector corresponding to the largest eigenvalue of Ew1 Eß.
This eigenvector determines the required linear combination z = aT y .
In geometrical terms a gives the direction in the v-dimensional data space
along which the between group variability is greatest relative to the within
group variability. So far we have concentrated on finding a single direction
in the multivariate space in which to examine the differences between the
g groups. However, if g is large, a single direction will generally provide
a gross over-simplification of the true multivariate configuration, and be-
tween group differences may still be obscured. In this case we have to find
a suitable two, three, or even higher dimensional space for adequate rep-
resent ation. We can extend the argument in an analogaus way to that for
principal components analysis in §5.2. Let h > l2 > · · · > ls > 0 be the
eigenvalues of Ew1 EB, with a1, a2, . . . , a 8 the corresponding eigenvectors.
Then a1 gives the direction in the v-dimensional data space along which
the between group variability is greatest relative to the within group vari-
ability; a unit has score Zil = af Yi 1 in this direction. The vector a 2 gives
the direction along which the between group variability is second greatest
relative to the within group variability and so on. If we define new vari-
ables Z1 = Y a1, z2 = Y a2, ... , these new variables Zj are termed canonical
variables or linear discriminant functions. The best r-dimensional repre-
sentation of the differences between the groups is obtained by plotting the
sample units against the first r canonical variables.
The matrix Ew1 EB has in general s = min(v,g - 1) non-zero eigenval-
ues. If the nurober of groups is less than or equal to the number of original
variables the matrix EB is not full rank and there will be ( v - g + 1)
zero eigenvalues. The maximum dimensionality for a canonical variate rep-

resentation is thus s and the relevant canonical variables are Zj = Y aj,
j=1,2, ... ,s.
Since the eigenvalues lj measure how much (between group/within group)
variability is taken up by each canonical variate, the minimum dimension-
ality r necessary for adequate representation can usually be judged from
consideration of these eigenvalues in similar fashion to the determination
of dimensionality in principal component analysis.
For interpretation of canonical variables, we can consider the v coeffi-
cients of each aj and identify the variables with large coefficients as im-
portant in distinguishing between groups in a way similar to that already
seen for principal components analysis. However, the matter is now compli-
cated by the fact that between group variability is being assessed relative
to within group variability; large coefficients may be a refl.ection either of
large between group variability or of small within group variability in the
corresponding variate. For interpretation it is usually advocated to consider
the modified coefficients akj where akj = akj fokk where Wkk is the kth
diagonal element of i:w (k = 1,2, .. . , v,j = 1, 2, ... ,s). This standard-
ization puts all variables on a comparable footing as regards within group
variability, and allows the modified canonical variables to be interpreted as
suggested above. Finally, if we wish to go further and interpret or "name"
a discriminant function, the signs can be taken into account.
The discriminant functions are subject to the same limitations as other
linear combinations such as a regression equation: the coefficient for a vari-
able rnay change notably due to the presence of atypical observations or
if some variables are added or deleted. We found it useful to monitor the
elements of the scaled eigenvectors a* to gain insight about the importance
of the variables.
lt is possible to show that (Exercise 6.7) the coefficients ai, aj satisfy the
property
for all i f. j.
In order to overcome arbitrary scaling of the aj it is usual to adopt the

normalization
A T"EwA
'
=I. (6.13)
With this normalization the canonical variables are not only arranged to
be uncorrelated within groups, between groups (and consequently over the
whole sample) but also share the property to have equal variance (Exer-
cise 6.7).
Once the linear discriminant function (first canonical variable) has been
calculated, an observation Yi can be allocated to one of the g populations on
the basis of its "discriminant score" aT Yi· The sample means have scores
aT'ili = Zi . Then y is allocated to that population whose mean score is
closest to aT y. The rule is: allocate Yi to population Pi if
IaT Yi - a T-I
y j < IaT Yi - a T-1
Y1 for l =/= j = 1, . .. , g.
It is possible to show (Exercise 6.4) that when there are only two groups
the first and unique canonical eigenvector of the matrix f:iJi;B is given by
Then the discriminant rule becomes
allocate y to P1 if (6.14)
and to P2 otherwise.
The allocation rule given by (6.14) is exactly the same as the sample
maximum likelihood rule for two groups from the multivariate normal dis-
tribution with the same covariance matrix given in (6.11). However, the
justifications for this rule are quite different in the two cases. In (6.11)
there is an explicit assumption of multivariate normality, whereas in (6.14)
we have merely sought a sensible rule based on a linear function of y. Thus
we might hope that this rule will be appropriate for populations where
the hypothesis of multivariate normality is not exactly satisfied. However,
Fisher based his rule solely on the first two moments of the data. Results
on the characterization of distributions show that these are the sufficient
statistics for members of the elliptical family, provided any other parame-
ters are known. For example, if the family is multivariate t, the degrees of
freedom would need to be known. Since the normal distribution is a mem-
ber of this family, the relationship is perhaps not so surprising. Details of
the elliptical family are given, amongst others, by Muirhead (1982, p. 34).
For g ;:::: 3 groups the allocation rule based on the first canonical variate
and the sample maximum likelihood rule for multivariate normal popu-
lations with the same covariance matrix will not be the same unless the
sample means are collinear.
6. 2. 6 Assessment of Discriminant Rules

One final aspect to consider is the evaluation of the performance of the allo-
cation rule. The simplest data-based method is to apply the given allocation
rule to the sample and to estimate the error rate for each population by
the proportion of individuals that are misclassified or in a Bayesian setting
to compute the sum of the probabilities of not belanging to the true group.
These estimates are often called apparent error rates and this method of
estimation is generally referred to as the resubstitution method because
the units which are used to find the allocation rule are then resubstituted
into it to estimate its performance. This method will generally provide an
over optimistic assessment of the success rate of the allocation rule and
may give misleading results unless sample sizes are very large.
A reliable estimate of the error rate will only be obtained if the data used
in the assessment of the rule are different from the data that are used in the
formulation of the rule. This is essentially the principle of cross validation.
The simplest implementation of this principle is to split each training set
randomly into two portions and then to use one portion of each training set
for estimation of the allocation rule itself and the other portion to assess its
performance by finding the proportion of individuals misallocated by the
rule. This approach is known as sample splitting. The main drawback of this
approach is that unless initial sample sizes are very large, the estimation
of the allocation rule and the assessment of its performance will be based
on small samples and hence will be subject to large sampling fluctuations.
In addition, we have to remernher that any future allocations will be made
according to a rule based on the whole of the training set not just on
a random portion of them. Thus, the rule whose performance is being
assessed by sample splitting is not the rule that will be used in the future.
In order to overcome the problems associated with the two previous
methods the leave one out method has been suggested. A review is given
by Krzanowski and Hand (1997). The technique consists of determining
the allocation rule using the sample data minus one observation, and then
using the consequent rule to classify the omitted observation. Repeating
this procedure by omitting each of the units in the training sets in turn
yields, as estimates of the error rates, the proportion of misclassified Ob-
servations in the training sets. The problern of all these approaches is that
they may produce biased estimates if multiple outliers are present in the
data. We therefore monitor the misclassification rate of all units throughout
the forward search, that is
The forward search can be seen as providing a unification of the different

approaches for the evaluation of the misclassification rate. In the final step
of the search we obtain an estimate using the resubstitution method. In
the central part of the search we obtain a series of robust and efficient
cross-validation estimates.
6. 3 The Forward Search

With the g groups of observations in discriminant analysis we need an initial
subset which contains suffi.cient units from each group to allow estimation
of both the means J.ll and, if they are different, the covariance matrices
:Et, for each group. We then have a choice of how we move forward in the
search:
• Standard or unconstrained search. As with one group, we order

the Mahalanobis distances, ignoring group membership, and include
in the new subset the m + 1 units giving the smallest distances;
• Constrained or balanced search. We treat each group separately
for the ordering; the group membership of the next unit to be added
is chosen to keep the m1, that is the numbers of units in the vari-
ous groups in the subset, in as close a proportion as possible to the
numbers n 1 in the training sample.
It is often informative to compare analyses from the two searches, rather
than choosing one in advance as being appropriate. The implementation
and consequences ofthe procedures are discussed more fully in the following
sections.
6. 3.1 Step 1: Choice of the Initial Subset

As in §2.13.4 for one group, we find an initial subset of moderate size
by robust analysis of the matrix of bivariate scatter plots, but now for
each group independently. The initial subset of mo,l Observations for group
l, which we denote Si 0 ) (l) consists of those observations which are not
outlying on any scatter plot for group l, found as the intersection of all
points lying within a robust contour containing a specified portion of the
data and inside the univariate boxplot for that group. The overall initial
subset is found as: sio) = Uf=l siO) (l), of size mo = Lf=l mo,l·
6.3.2 Step 2: Adding Observations During the Forward Search

In every step of the forward search given a subset sim) of size m, for
m = mo, ... , n- 1, where m = m 1 + · · · + m 9 , we move to a subset of size
(m + 1). The selection of these ( m + 1) units can be unconstrained or it
can be constrained to balance across groups.
U nconstrained search. In every step of the forward search we select
the ( m + 1) units with the (m + 1) smallest Mahalanobis distances. The
only complication is that the number of units for any group must not be
less than mo,L, the number for group l in the initial subset.
Constrained or balanced search. Let R1 be the ratio between the
number of units of population l in the subset and in the full sample: R1 =
mL/ n1, l = 1, ... , g. Initially we select the groups with the smallest R1.
Among these we increase by one unit the group which has the smallest
( m1 + 1)th ordered Mahalanobis distance , which we denote by d[mt +1]. For
example, if d[mt+l] is in group s , the new subset is formed by the units with
the following distances:
d[mt]' ... ,d[mt]' l =f. 8 = 1, ... ) g;
d(mt], · · ·, d[m.+l] ·
6.4 Monitaring the Search 307
Thus, in the constrained search, the subset in every step of the forward
search must contain proportions of units which agree, as closely as possible,
with the proportions in the overall sample.
6. 3. 3 M ahalanobis Distances and Discriminant Analysis m

Step 2
We use Mahalanobis distances to order the observations during the for-
ward search. One of the quantities that we monitor is the evolution of the
posterior probabilities as observations are included in the subset. We now
consider the link between Mahalanobis distances and these probabilities.
The link is most clearly seen for quadratic discriminant analysis. From
equation (6.2) the posterior probabilities are positively correlated with the
prior probabilities but are negatively related both to the Mahalanobis dis-
tances from the various populations and to the determinant of the covari-
ance matrix. Consider the move from a subset of size m - 1 to one of size
m by increasing group l by one observation. As we make this move from
sim- l) to siml, the only covariance matrix to change is f;l· The determi-
na nt of this matrix is linked to the Mahalanobis distance by the deletion
relationship
(6 .15)
A large increase of Mahalanobis distance due to inclusion of unit m in group

l will therefore automatically also produce an increase in lf:L,m1 1, which is
likely to produce a big change in the posterior probability of unit m. Thus
a forward search on the Mahalanobis distance of every observation from its
own population leads to inclusion in the last steps of the search of those
units which most affect the posterior probabilities. That is, equations (6.2)
and (6.15) show that the units which have large Mahalanobis distances are
also those which are likely to produce jumps in the plot of the posterior
probabilities. If the covariance matrices for all groups are the same, the
determinants in equation (6.2) become equal for all groups. Then we have
linear discriminant analysis when the posterior probabilities depend just on
the Mahalanobis distances and prior probabilities. Here, however , inclusion
of unit m from group l causes a change in the estimated common covariance
matrix f:w (6.9) and so in all Mahalanobis distances, not just those in group
l.
6.4 Monitaring the Search

To use the forward search to extract diagnostic information from a dis-
criminant analysis, we run the search, performing a discriminant analysis
on the included units for each m. We then generate forward plots of the
quantities customarily calculated when all the data are fitted, that is when
m = n . In this section we describe the most informative of these quantities.
Outliers and influential observations can be detected by simple graphical
displays of statistics involved in the forward search. It is extremely useful
to monitor particular Mahalanobis distances such as:
dfmzl m = mo, . .. ,n -1 l = 1, .. . ,g (6.16)
and
dfmz+l] m = mo, .. . ,n- 1 l = 1, ... ,g. (6.17)
Statistics in equations (6.16) and (6.17) respectively refer, for each group,
to the maximum Mahalanobis distance in the subset and the minimum
Mahalanobis distance among the units not belonging to the subset.
If the dispersion among the groups is markedly different, the curve of
d[m.+l] never overlaps that of dJmz+l]' l =1- t = 1, ... ,g. Moreover, these
curves give, for each group, a senes of outlier tests comparing the observa-
tion about to be introduced with those already in. If one or more atypical
observations are present in the data, the plot of dfmz+l] must show a peak
in the step prior to the inclusion of the first outlier. On the contrary, the
plot which monitors dfmzl shows a sharp increase when the first outlier joins
siml. This curve may also show a subsequent decrease due to the masking
effect.
The details of these curves depend importantly on whether the search is
constrained or unconstrained and on what departures , if any, are present.
The search progressively include units with small Mahalanobis distances.
First suppose that there are no outliers and that quadratic discriminant
analysis is used if appropriate. Then, with a constrained search, the number
of units in the subsets will be balanced and, for example, for each group,
the minimum distance of units not in that group will be similar for all
groups. The progress and output of the search will be similar in all groups.
Now suppose one group contains a set of k outliers. In the unconstrained
search these k units will enter last, so that, before they enter, the proportion
of units from this group in the subset will be lower than the ratio Rt.
Alternatively, if a balanced search is used, these units will be forced to
enter to keep the ratio close to Rt. However, before they enter, the minimum
distance within group l of units not in the subset will be larger than for
outlier free groups.
The same effect is seen if linear discriminant analysis is used when the
variances of the groups are different. Then, in an unconstrained search,
the units of a group with small variance will tend to be included by the
unconstrained search before those from a group with larger variance. These
effects are illustrated in our analysis of the electrode data, for example in
Figures 6.8 and 6.9.
6.5 Transformations to Normality in Discriminant Analysis 309
This situation also arises when we are considering transformations, when,

before transformation, the groups may have very different variances. In such
cases the analysis of the group to which the last units belong, when an un-
constrained search is used, provides indirect information about differences
in variance between the groups. As we see in the next sections, both con-
strained and unconstrained searches provide useful information about the
structure of the data.
Another way to examine the differences in variability between the two
groups is to monitor
m = mo, . .. , n; (6.18)
that is the logarithm of the determinant of the estimated within groups

covariance matrix defined in equation (6.9). Whether or not a balanced
search is used, this plot will be an approximately straight line when the
two groups have the same variability: each observation, regardless of group,
will make much the same contribution to i:w. However, if the variability
in the groups is different and a balanced search is used , the plot will have
a zig-zag form.
In the following sections we will refer to equations (6.16), (6.17) and
(6.18) as monitaring the "maximum distance" , the "minimum distance"
and the "pooled determinant" .
6.5 Transformations to Normality in Discriminant

Analysis
Power transformations of the data to obtain multivariate normality were
extensively studied in Chapter 4. A major purpose of the transformation
in discriminant analysis is to achieve a common covariance matrix for all
groups as well as approximate normality. Although it makes sense to fit
different means and, perhaps, different variances for the different groups,
each variable must have the same transformation in all groups. If this were
not so, different transformations of a new observation would have to be tried
before it could be assigned to a group. We start with the more general case
in which each group has its own covariance matrix. Since the procedure is
that of §4.3.2 applied to more than one group, only an outline is given.
Let Yijl be the ith observation on response j for group l (i = 1, ... , nz;
j = 1, ... , v; l = 1, ... , g). In the extension of the Box and Cox ( 1964)
family to multivariate responses the normalized transformation of Yijl is
>.
Yi]z- 1
Zijt(Aj) ).. .>.rl
(>-. -::/=0) (6.19)
iYj
Yj logyijt ().. = 0), (6.20)
where ih is the geometric mean of the jth variable. If the transformed

observations are normally distributed with mean /-tl and covariance matrix
'Et (l = 1, ... , g), twice the loglikelihood of the observations is given by:
g 9
L ntlog l2n'Et(A)I- L L {zil- J-tt()..)}T'E! 1()..){zil- J-tt(A)},
n1
L()..) = -
l=l l=l i=l
(6.21)
where Zil = (zill, ... Zivl)T is the v x 1 vector which denotes the trans-
formed data for unit i coming from group l, J-tt(A) and 'Et()..) are respectively
the mean vector and the covariance matrix for population l. Substituting
the maximum likelihood estimates P,t ()..) and f:1 1 ()..) for given ).. in equa-
tion (6.21), twice the profile loglikelihood can be written as:
g
2Lmax()..) = constant- L ntlog lf:t(A)I. (6.22)

l=l
To test the hypothesis ).. = ).. 0 , the likelihood ratio test
g
TLR = L ntlog{lf:t(Ao)l/lf:t(.\)1} (6.23)

l=l
can be compared with the x2 distribution on v degrees of freedom. If the
hypothesis of equality of covariances is true then equation (6.23) becomes
(6.24)
where n = I:f= 1 n1. In equations (6.23) and (6.24) ,\ is found by numerical

search.
To find a transformation using the forward search we again follow the
prescription in §4.6, running a series of searches first on untransformed data
and then iteratively, using a tentative transformation and refining it until
all significant changes in the estimated transformation occur at the end of
the search. We find it preferable to use unconstrained searches because, if
the data are appropriately transformed, a duster of k outliers from one
group will enter the subset in the last k steps of the forward search.
6.6 Iris Data

As a first example of discriminant analysis we look at measurements on
three species of iris. The species are:
1. Iris setosa
2. Iris versicolor
6.6 Iris Data 311
3. Iris virginica.
Four measurements of characteristic dimensions of the flowers were made

on fifty flowers from each species. The readings, in centimetres, were:
Yl : sepal length
Y2: sepal width
Y3: petal length
y4: petal width.
The data have been much analysed; they were published by Anderson
(1935) from measurements takenon plants in the Gaspe Peninsula, Quebec.
The three species are blue-flowered water loving irises, or flags, similar
to the European yellow flag. Iris versicolor is the emblematic flower of
Quebec province. The data were analysed by Fisher (1936) as an example
of discriminant analysis and are often known as "Fisher's Iris Data". They
are in Table A.10 and are also given, for example, by Krzanowski (2000,
pp. 46- 47) and by Mardia, Kent, and Bibby (1979, pp. 6- 7).
Although the data are frequently taken as a standard example for dis-
criminant analysis, they have several interesting features. They are often
analysed on the original scale (Venables and Ripley 1994, p. 307) but some-
times logs are taken (Venables and Ripley 1994, p. 316) . It is customary to
use linear discriminant analysis which assumes that the three groups have
equal covariance matrices, but there is strong evidence that this is not the
case, for example from the test for equality of variances (2.23).
Figure 6.1 is a scatterplot matrix of the four variables, plotted with a
symbol for each of the three species. The plot of Y3 against y 4 , that is petal
length and petal width, show that one species ( iris setosa) is completely
separated from the other two . The robust bivariate boxplots in this panel
of the figure enable us to see that there is also good separation, in these
two dimensions , between the other two species. We may suspect that dis-
crimination will not be very much affected by whether we use the original
or transformed data. That the variances of the measurements in the three
groups are not the same seems evident from the plot and is emphasized by
the three univariate boxplots in panels (3,3) and (4,4) of the plot, which
summarize the values of y separated by group. The bivariate boxplot for
y3 and y 4 clearly shows that the variability of the three groups increases
with the size of the measurements on petals.
In our first analysis we use linear discriminant analysis and a common
covariance matrix for all groups. We compare linear and quadratic discrim-
inant analyses for our second example, the electrodes data which are the
subject of §6.7. Because of the differing variances in the three groups of the
iris data we use a constrained forward search. With the same number of
individuals in each species, this means that, when m is a multiple of three,
the subset will contain equal numbers of observations from each species.
FIGURE 6.1. Iris data: scatterplot matrix with univariate and bivariate boxplots
Figure 6.2 shows two plots of Mahalanobis distances from this search.
The first panel shows the maximum Mahalanobis distances of those units
which belong to the subset for each group. The second panel shows the
minimum distance for each group of those units not in the subset - apart
from the constraint caused by the need to keep group sizes equal, these
would be the next units to be included in the subset. The first thing to
notice is the very different sizes of the distances for the groups. If a different
covariance matrix were used for each group, we know from Exercise 2.12
that the distances for the m 1 units in the subset for the lth group would
sum to v(ml - 1). The pattern shown here is further evidence that the
covariance matrices of the three groups are not equal. More important,
however, is the behaviour of the distances for Group 1. The observations
entering Group 1 from 137 onwards (and which give large distances in the
6.6 Iris Data 313
Maximum MD of units belanging to the subset Minimum MD of units not belanging to the subset
110 120 130 140 150 110 120 130 140 150
FIGURE 6 .2. Iris data. Maximum Mahalanobis distances, for the three groups ,
of units belonging to the subset and minimum distances for those units not be-
longing: reading upwards in the left half of the plots, Groups 1, 2 and 3
"" ! "1
"'
~;~(:
++f +
+
+..., + <D
~:~
.c
:y
.c
f ~ "'
.!!!
s
:2 • 6 ,.:~. llf:l.
l "' ·~ 66 o 0 15 o 10 ~ .., /f
""
6o&f'M
Jl " oOao o :i< a..
6 ~" §a~oo o 33
es o o "' "'0 ~fA.O34 \,
"2 0
"80 .e ~'Th~ ~
2 .0 3.0 4 .0 2 .0 3 .0 4 .0 2 .0 3.0 4 .0
Sepalwidth Sepal wldth Sepalwidth
FIGURE 6.3. Iris data. Three scatterplots of pairs of variables showing, for Group
1 (diamonds), the last units tobe included in the forward search. Theseare the
observations yielding the !arge Mahalanobis distances in Figure 6.2
right-hand panel from 133 on as the next to enter) are 33, 34, 15, 16 and
42. If these were outliers and the covariances were estimated independently
for each group, the effect of these additions on the Mahalanobis distances
would rapidly die down; inclusion of these units in the estimation of the
covariance matrix would lead to masking. But here, a succession of outliers
enters from one group only and so have a partial effect on the common
covariance matrix. They therefore remain visible in the plot.
Figure 6.3 shows scatterplots including the numbers of the units which
give the large increases in Mahalanobis distances for Group 1. The last of all
to enter is unit 42, very much an outlier from Group 1. The other four units
form a duster at the other end of the group. It may seem surprising that
these units appear so outlying. This is caused by the common covariance
matrix which is being fitted to the three groups. If these five units are
excluded, the bivariate scatters of Group 1 are more like those of the other
groups. This reconciliation is strongest in Panel 1 of the figure. However,
as the bivariate boxplots of Figure 6.1 emphasize, the orientation of the
bivariate distribution, particulary for y 2 and y 4 in Group 1, is quite different
from that of the readings in the other two groups. The oblique common
covariance matrix increase the outlyingness of these five observations.
The discussion of Figures 6.2 and 6.3 shows the similarities and differ-
ences in the forward search and the Mahalanobis distances when there are
several groups fitted simultaneously rather than the one fitted distribution
in the previous five chapters. Now we turn to the main feature of discrim-
inant analysis, which is the assessment of the probabilities of group mem-
bership and the establishment of boundaries between the groups. What
we are particularly interested in, of course, is how these properties change
during the search and what such changes, if any, reveal about the structure
of the data.
We use the forward search to monitor the evolution of the posterior
probabilities as observations are included in the subset. We can then both
detect infl.uential observations and determine the effect of each unit on the
posterior probabilities, so monitoring the performance of the allocation rule.
The forward search is on the Mahalanobis distances, which we showed in
equation (6.15) are strongly linked to changes in the posterior probabilities.
Figure 6.4 monitors the calculated posterior probabilities that observa-
tions in Groups 2 and 3 belong to those groups. The plots start with a
subset size of 102, out of the total 150 observations. During this period
only four units from Group 2 ever have posterior probabilities less than
0.6: two, units 71 and 84, are finally misclassified. For Group 3 only unit
134 is misclassified. In the earlier part of the search shown in the left-hand
panel most units have a classification probability close to one; some of these
values decrease slightly towards the end of the search as inore extreme Ob-
servations enter the subset and the distinction between groups becomes
slightly blurred. This effect is much more noticeable in the upper right-
hand part of the right-hand panel of the figure. The difference between
the two parreis in this respect is caused by the differing variances of the
observations in the two groups to which we are fitting the same covariance
matrix. We do not show the plot for Group 1 as all units in every step of the
search are correctly classified with posterior probabilities of at least 0.99.
As with most analyses, the pattern in Figure 6.4 is stable to the contour
of the robust boxplot used to choose the initial subset.
In line with our contention of the importance and usefulness of returning
from the forward analysis to further inspection of the data, we now give
interpretations of these findings in the space of the original data. Figure 6.5
shows the scatterplot for petal width and sepal length for units in Groups
2 and 3. Those which were sometimes misclassified are represented by filled
symbols. For two of the three units (71, 84 and 134) which are misclassified
6.6 Iris Data 315
·69
69.
,--\_,84
0 84 .. __________ , ................. ____ .. - ... ""/ 0
0 0
100 110 120 130 140 150 100 110 120 130 140 150
FIGURE 6.4. Iris data. Posterior probabilities, as a function of subset size, that
observations in Groups 2 and 3 respectively belong to those groups
at the end of the forward search, there are units in the other group which
have identical observed values for these two variables. More precisely unit
71 (coordinates 3.2 and 1.8) presents the same values as unit 126 and unit
134 (2.8 and 1.5) overlaps with unit 55. Examination of the scatterplot ma-
trix with brushing shows that units 134 and 55 are very close to each other
in all bivariate scatterplots. On the left side of Figure 6.3, unit 69 (coordi-
nates 2.2 and 1.5) overlaps with unit 120. Among the units of Group 3, 120
is, apart from 134, the one which shows the smallest posterior probability
(0.779) in the last step of the forward search.
The discriminant line dividing the two groups is not shown in Figure 6.3,
but passes close to units 73 (2.5, 1.5), 78 (3, 1.7) and 139 (3, 1.8). As Fig-
ure 6.4 shows, the posterior probability of observation 73 fluctuates appre-
ciably even though this unit is always categorised correctly from m = 136
onwards. In all steps of the forward search, unit 78 always has a posterior
probability around 0.65. Unit 139 is, apart from 134, the one in Group 3
showing the smallest posterior probability in almost all steps of the forward
search.
The two remaining crosses in Figure 6.3 which appear close to triangles
refer to units 135 (2.6 and 1.4) and 130 (3.0 and 1.6). Unit 135 is the last of
the third group to be included in the forward search: it has a final posterior
probability of 0.934. During the forward search unit 130 generally shows a
posterior probability around 0.95 (the final value is 0.896).
There remains unit 84 (2.7, 1.6), the third tobe misclassified at the end
of the forward search. It is included when m = 142. Thereafter the posterior
probability that this unit belongs to Group 2 tends generally to increase.
Its final posterior probability is 0.143. An analysis of the scatterplot matrix
reveals that, in almost all the bivariate plots, this unit is surrounded by
some observations belanging to Group 3.
U")
C\i + +
+ + +
+ + + + +
+ + +
+ + + +
0
C\i + + + + +
.<= + + +
~ + + + + • + •
]i
.
+
"'
0..
"' "'
. .
+
"1
• "' "' "' "'
+
"' "' "' "' "' "'
"' "' "' "' "' "'
"' "' "' "'
q "' "'
"' "' "' "' "' "'
2.0 2.5 3 .0 3.5
Sepalwidth
FIGURE 6.5. Iris data. Scatterplot of two variables showing, by filled symbols, the
units sometimes misclassified. Triangles are for Group 2, crosses and diamonds
for Group 3
Our analysis of the iris data shows, we believe, that the forward search
technique in discriminant analysis is an extremely useful tool. As a result
we can:
1. Highlight the units which are always classified correctly with high
posterior probability in each step of the search. These can be sepa-
rated from those units which are declared correctly only when they
are included in the allocation rule;
2. See the evolution of the degree of separation or overlapping among
the groups as the subset size increases and determine the relationship
with those units which have a posterior probability close to 0.5;
3. Monitor the stability of the allocation rule with respect to different
sample sizes;
4. Determine the infiuence of observations by separating the units with
the biggest Mahalanobis distances into two groups: those which have
an effect on the posterior probabilities and those which leave them
unaltered.
In our example, monitaring the posterior probabilities enables us to dis-
tinguish the units whose posterior probabilities tended to increase as the
sample size grew (e.g. units 69 and 73), those whose posterior probability
was close to 0.5 (e.g. units 78, 71 and 134) and those which were always
completely misclassified (e.g. unit 84). If we have to classify a new unit we
can monitor its posterior probability at each step of the forward search. In
6.7 Electrodes Data 317
this way we can have an idea about the stability of the associated alloca-
tion and therefore which and how many observations are responsible for its
allocation to a particular group.
6. 7 Electrodes Data
The main purpose of this section is to investigate the contrasting proper-
ties of linear and quadratic discriminant analysis. We also study the effect
of using a balanced as opposed to an unbalanced forward search. For our
comparisons to be effective we again need a set of data in which the groups
have differing variances. For this purpose we use data from an unpublished
University of Berne Ph.D. thesis by Kreuter. The data, given and described
by Flury and Riedwyl (1988, pp. 128- 132), are measurements from two ma-
chines manufacturing supposedly identical electrodes. We give the numbers
in Table A.11.
The electrodes are shaped rather like nipples. There are five measure-
ments on each: Yl, Y2 and Ys are diameters, while Y3 and y 4 are lengths
(there is a trivial misprint in Flury and Riedwyl's description of the data)
and fifty electrodes from each machine have been measured. For reasons of
commercial secrecy, the data have been transformed by subtracting con-
stants from the variables. Flury and Riedwyl comment that this shift in
location does not affect discrimination. Whilst this is true in a limited way,
the subtraction of these unknown quantities means that it is not at all
Straightforward to investigate power transformations of the data to achieve
homogeneity of variance. The difficulty arises because estimation of the
subtracted constants leads to a distribution where the range of the data
depends on the parameter values. The resulting likelihood is unbounded
when it is estimated that the smallest observed values of each Yi have been
added to the original readings. There may also be local maxima of the
likelihood. A fuller discussion of the difficulties of the Box and Cox family
when shift parameters have to be estimated is given by Atkinson, Pericchi,
and Smith (1991).
Figure 6.6 shows the scatterplot matrix of the data with superimposed
robust boxplots. The data have been slightly jittered for this plot as there
is appreciable overlap of values in some of the variables. The univariate
boxplots on the diagonal of the figure show that some variances are larger
for Group one, others for Group two. Furthermore, the readings with the
larger means do not always have the larger variances, so that power trans-
formations would be unlikely to yield constancy of variance even if the data
did not have a shifted location. The bivariate boxplots in the figure show
that not only are the variances of the variables of differing magnitudes, but
also the covariances in the two groups sometimes differ. For example, in the
panel for Y3 and y 5 , the major axes of the covariance matrices are virtually
.so
92 . - ..
·~~ 0 0
0 .7 0 0 0
0
:~~ 0
®.WJ 22 :·®~ :@f:r

. ... . .. 0 0
FIGURE 6.6. Electrodes data: scatterplot matrix with univariate and bivariate
boxplots (data have been slightly jittered). Units in Group one are represented
by circles
orthogonal. The plot also shows an almost clear separation of the groups
on y4 . A question of interest will therefore be whether the remairring four
variables provide any extra information for discrimination.
We start with linear discriminant analysis and initially compare the bal-
anced and unbalanced searches. Figure 6. 7 shows forward plots of the poste-
rior probabilities that units in Group 1 belong to that group: the probabili-
ties for the balanced search are in the upper panel, those for the unbalanced
search in the lower. The most obvious difference in the two panels is that
units 6 and 8 are correctly classified around 15 steps earlier in the balanced
search than in the unbalanced one. The behaviour of the other appreciable
outlier, unit 9, is broadly similar in both panels. This unit is visible in the
plot for y 4 and y5 as a circle on the edge of a duster of triangles. Whichever
70 80 90 10(
Subset size m
.,"' /'
\r/
00
;§ 0
:0 r
e"' ..,.
.0
Cl.
0 0
.äi
(ii
0 0
.,.... /"''-
0.. .~:>:>-
0
70 80 90 10(
Subset size m
FIGURE 6.7. Electrodes data : linear discriminant analysis. Posterior probabili-

ties, as a fun ction of subset size, that observations in Group one belong to that
group, using balanced (top panel) and unbalanced (bottom panel) search
(a) (b)
0 0
<0 <0
0 g
"'
0
::;; ...
0
0
:::;; ~
E E
g
"!;\
0
M ·c"
E 0
M
::;; ':E
0
N
0
N
.. __ ,,, _, ... --.

.
...... -'
~ ~
70 80 90 100 70 80 90 100
FIGURE 6.8 . Electrodes data: linear discriminant analysis. Maximum Maha-

lanobis distances, for the two groups, of units belanging to the subset (left panel)
and minimum distances for those units not belanging (right panel). Balanced
search. The solid line is associated with Group one
search is used there is a slight spreading of probabilities at the very end of

the search as relative outliers enter and make the discrimination less sharp.
This feature is more apparent in Figure 6.4 for the iris data.
The difference in the results of the two searches is caused by the order in
which the units enter the subset. The Mahalanobis distances guiding this
(a) (b)
g g
0
~
"'
0
::;; ~ 0
::;;
..,.
0
E E
::J
-~
::;;
g E
:~
::;;
g
0 0
"' "'
~ ~
70 80 90 100 70 80 90 100
FIGURE 6.9. Electrodes data: linear discriminant analysis. Maximum Maha-

lanobis distances, for the two groups, of units belonging to the subset (left panel)
and minimum distances for those units not belonging (right panel). Unbalanced
search. The solid line is associated with Group one
selection are plotted in Figures 6.8, for the balanced search, and 6.9, for
the unbalanced one. The left-hand panel of Figure 6.8 shows the maximum
distance, at each step of the search, of units belanging to the subset; the
right-hand panel shows the minimum distance for units not belonging. The
two plots are similar- for most of the search the distances for units in Group
1 are larger, reflecting the more dispersed nature of this group which gives
rise to larger distances when a common covariance matrix is used.
In the absence of any constraint on the entry of units, those with smaller
Mahalanobis distances enter earlier. Of course, the entry of units alters the
estimates of the mean and variance and so changes the relative distances
of the units. But, from the discussion of Figure 6.8, we might expect that
units from Group 2 would enter earlier in the absence of the constraint. The
right-hand panel of Figure 6.9 shows that this is so, since the dotted line
associated with the minimum Mahalanobis distance for Group 2 terminates
when m = 94. From m = 96 only units from Group 1 enter the subset.
Figure 6.9 also shows the resulting effect on the distances in the two groups.
Whether these are the maximum of those inside or the minimum of those
outside, they are now much more equal. A consequence of the later entry
of some units from Group 1 has already been seen in the lower panel of
Figure 6.7 where the delayed entry of units 6 and 8 causes them to be
misclassified until much later in the search.
It is not necessary to choose between the two searches; our purpose is
to explore ways in which individual observations affect discrimination. But
we do need to choose between linear and quadratic discriminant analysis
when finally establishing a discrimination procedure.
The posterior probabilities for units in Group one of belanging to that
group are plotted in Figure 6.10. Those for the balanced search are in the
"'"'
~ ci
Q)
:E
"'
.0
0
0..~
0
s"'
0
0 0
a.. ci
70 80 90 10<
Subset size m
~"'
Q)
ci
:E
"'
.0
0
0.~
0
0
·;:::
*0 0
a.. ci
70 80 90 10(
Subset size m
FIGURE 6.10. Electrodes data: quadratic discriminant analysis. Posterior prob-

abilities, as a function of subset size, that observations in Group one belong to
that group using balanced (top panel) and unbalanced (bottom panel) search
upper panel and those for the unbalanced search are in the lower panel.
As in Figure 6.7, several units are only correctly classified later in the
search when the search is unbalanced. However, the important comparison
between Figures 6.7 and 6.10 is the much higher misclassification rate for
quadratic discrimination until, particularly for the unbalanced search, six
or seven observations from the end of the search. The effect is caused by the
increased number of parameters to be estimated when a different covariance
matrix is fitted to each group. Although observations included in the subset
will be better fitted by the individual matrices, the variance of prediction
for observations outside the subset is increased by the increased number of
parameters.
We saw in Figure 6.6 that there is good, although not perfect, separa-
tion between the two groups from the values of Y4· We now turn to the
question as to whether any extra discrimination is achieved by including
the remaining four variables.
Figure 6.11 is a forward plot of the elements of the standardized canonical
eigenvector of the plane separating the two populations. For this we have
reverted to linear discriminant analysis. The left-hand plot is for a balanced
search and the right-hand panel is for an unbalanced one. The two panels
are very similar.
In both the panels there is a large contribution, as we would expect, from
y 4 . The next largest contribution, araund -0.5, is from y 1 . The panel for
y 1 and Y4 in the scatterplot of Figure 6.6 shows that a straight line can
(a) (b)
__ ,..- ..... ______ ."_,----·4
0
i
2
"'
ci
"'"
0>
..,·a;
.
:!l
'E
..,
0
ci
·3
.s"
fJ) 3 ',------.'-.- -----------------.---- __ ,
9"' "'
9
70 80 90 100 70 80 90 100
FIGURE 6.11. Electrodes data: linear discriminant analysis. Elements of stan-

dardized canonical eigenvector using balanced (left) and unbalanced ( right)
search
m=n-11 m=n
~ 0
~ ~
/,.,\
:;l :;l / '-,
~ ~
;; ;;
:;;
56 57 58
m=n-11
59 60 61
:;;
56 58 60
m=n
62 64 .
~ ~
~ ~
-
:;l :;l
~ ~
: ',
',
;; ;;
:;;
56 57 58 59 60 61
:;;
56 58 60 62 64 ..
FIGURE 6.12. Electrodes data: estimated densities of the two populations for y2
Si
using mean and variances based on 89 > (panels on the left) and on 100 > (panels Si
on the right). Panels on top show the estimated densities in the interval fl2 ± 30"2.
Panels at the bottom show the estimated densities in the range of variation of
the data
completely separate the two groups. The other panels in the column of the
plot for Y4 also show virtually complete separation, except for unit 9 which
is sometimes surrounded by units from Group 2.
FIGURE 6.13. Electrodes data: estimated densities of the two populations for
variables y1 , y3 , Y4 and Y5 using means and variances based on S~n) . The greatest
separation by far is in Y4
An intriguing feature of Figure 6.11 is the contribution of y 2 . At the be-

ginning of the range plotted in the figure, this is close to zero, but increases
substantially towards the end of the figure to be as appreciable as the con-
tribution of y 1 , although of an opposite sign. We conclude our discussion
of the electrode data by investigating this phenomenon.
The top row of Figure 6.12 shows the estimated normal density for Y2 at
m = n - 11 and at m = n . Although the values for Group 2 are on average
to the right of those for Group 1, there does not seem tobe an appreciable
change between the two panels. The reason for the change in the importance
of y 2 , visible in Figure 6.11, is evident in the lower panel of Figure 6.12,
where the fitted normal curves are plotted over the range of data included
in Sr;:'. Towards the end of the search the larger positive values of Y2 enter,
moving the fitted curves apart and providing more powerful discrimination.
The final plot in this discussion, Figure 6.13, shows the fitted densities for
the other four variables at the end of the search. The third panel confirms
that the Separation is greatest for y 4 . The other variable which was seen to
have some effect in Figure 6.11 was y 1 . After y 4 , this is indeed the variable
showing the most separation in the panels of Figure 6.13.
In this section we have seen the difference in effect between balanced
and unbalanced searches: in the balanced search t he units for the group
with smaller variance tend to enter earlier. Wehave also seen that , for this
example with two groups of fifty observations and five variables, quadratic
discriminant analysis is much less reliable than linear discrimination. This

conclusion extends, in a systematic manner, the results of Fung (1996a)
who found that classification probabilities in a similarly sized example of
quadratic discriminant analysis were highly sensitive to the deletion of a
few observations. Such analyses support the standard practice of preferring
linear discriminant analysis even when variances are different. We now re-
turn to the Iris data to consider transformation of the data in an attempt
to achieve equality of covariance matrices and so conditions under which
linear discriminant analysis is known to be optimum.
6.8 Transformed Iris Data

We now consider power transformations of the iris data. It seems clear from
the univariate boxplots on the diagonal of Figure 6.1, especially those for
variables three and four, that there is a relationship between the means of
the readings and their univariate scatters. The bivariate boxplot for Y3 and
y 4 shows that this relationship extends to joint distributions. These are con-
ditions under which we would expect that a transformation might improve
normality and provide more nearly equal variances within the groups. But,
the important question for this chapter is whether such a transformation
improves the performance of the discriminant analysis.
Figure 6.14 shows, in the upper panel, a forward plot of the estimated
transformation parameters for the four variables. The lower panel is the
likelihood ratio test for the hypothesis of no transformation, which is to
be compared with the chi-squared distribution on four degrees of freedom.
There is strong evidence throughout that the data should be transformed
- at the end of the search the statistic has a value of 84.19. The plot is
stable throughout the part of the search shown, without any evidence of
outliers being infl.uential on the evidence for a transformation. The upper
panel for the estimated parameters is also stable suggesting the square root
transformation for variables two, three and four and the reciprocal square
root, .A = -0.5, for Yl· The estimate of .A2 is the one that varies most
throughout the search.
To interpret these parameter estimates we look, in Figure 6.15, at the pro-
file loglikelihood surfaces for the four parameters at the end of the search,
together with approximate 95% confidence intervals. Theseshow that the
parameters for the sepal dimensions, .A 1 and .A 2, are poorly defined, whereas
A3 and A4 are comparatively precisely estimated. These are the two esti-
mates giving the stable traces in Figure 6.14. Since all measurements are of
length, we would like to have the same transformation for all. Accordingly
we try a search with all Aj = 0.5.
The forward plot of the likelihood ratio test for all Aj individually equal
to 0.5 is in the left-hand panel of Figure 6.16. Comparison with the 5%
6.8 Transformed Iris Data 325
0
0
~ ~====~================~================~========~
'7" L
100 110 120 130 140 150

Subset size m
100 110 120 130 140 150

Subset size m
FIGURE 6.14. Iris data: forward plots from linear discriminant analysis. Up-
per panel, transformation parameters; lower panel, likelihood ratio test for the
hypothesis of no transformation
point of x~ shows that this transformation is acceptable throughout, even

though the value of the test statistic does increase towards the end of
the search. The right-hand panel of the plot is for the hypothesis ,\ =
( -0.5, 0.5, 0.5, 0.5)T. This is also acceptable, with slightly smaller values
of the test statistics. We however proceed with taking the square root of
all measurements since a common transformation makes more sense for
measurements with the same physical units. We note that the log transfor-
mation is not indicated - in fact the statistic for a log transformation of all
observations is significant throughout the search. In the last 40 steps the
value is always above 42, increasing steadily to a final value of 124.8 when
m = n. The variance of the observations is proportional to the mean rather
than the standard deviation being proportional to the mean.
Before returning to the discriminant analysis, we look at the transformed
data. The scatterplot is in Figure 6.17. The bivariate boxplot for Y3 and
y 4 in Group 1 is now closer in size to that of the variables in Groups
2 and 3 than it was in Figure 6.1 for the untransformed data. Although
the groups are now on a more equal scale, so that the estimation of a
common covariance matrix makes more sense, the orientations of the groups
are not always similar and the contours for Group 1 are sometimes far
from elliptical. This shape is caused, in part, by the rounded nature of the
readings.
0. 1 0 .4 0.7 1. 0 1. 3 0. 1 0.4 0. 7 1.0 1.3

N N
0)
~ OT~~~~~~~~~~~~~~~ L[)
I 1\
aJ aJ
cD cD
\
L[)
~m=150y3 m= 150 y4
- 0 .9 -0 .4 - 0.9 -0.4 0 .1 0.4 0 . 7 1.0 1. 3
FIGURE 6.15. Iris data: profile loglikelihood surfaces for the four transformation
p arameters at the end of the search
~ ~
"' ~
~ ~
u; u;
.l!l CO .l!l CO
0 0
~ <D
~ <tl
.;i .;i
:::; :::;
.... ....
"' "'
0 0
90 100 110 120 130 140 150 100 110 120 130 140 150
FIGURE 6 .16. Iris data: forward plots of likelihood ratio tests for
transformation. Left panel, Ho >. (0.5, 0.5, 0.5, 0 .5)T; right panel,
H o : >. = ( - 0.5, 0.5, 0.5, 0.5)r . Both hypotheses are acceptable
6.8 'fransformed Iris Data 327
FIGURE 6.17. Iris data: scatterplot matrix with univariate and bivariate boxplots
(after transforming the data with all Aj = 0.5). Comparison with Figure 6 .1 shows
that the effect of transformation in equalizing variances is most evident in the
bivariate boxplot for variables 3 and 4; the univariate boxplots for variables 3
and 4 now show spreads closer to those of variables 1 and 2
We now analyse the data with a constrained forward search. The for-
ward plot of the Mahalanobis distances is in Figure 6.18. Comparison with
Figure 6.2 shows the effect of the transformation. Because the three groups
now have more nearly equal variances, both the maximum distances for
units within the group and the minimum to those outside the group are
more nearly equal. This is shown by the increased closeness of the dis-
tances for the three groups. A second effect of the transformation is that
the outliers in Group 1 are now much less severe.
The scatterplot of Figure 6.17 and the forward plots of Figure 6.18 are
both indications that the data after transformation more nearly satisfy the
(a) (b)
"'
"' "'"'
0
~
"'
0 0
:::0 :::0
E E I
"'
E ~ "'
E ~ I
'lii ·c I
~ --....,... I "
!''
:::0
,.. _.......,
:= := 'I
' ... ,,":'- ......... ,_,'-''
..
..
110 120 130 140 150 110 120 130 140 150
FIGURE 6.18. Transformed iris data: linear discriminant analysis. Forwardplots

of maximum and minimum Mahalanobis distances for each group. In Figure 6.2
the untransformed data in Group 1 show some outliers. Here, after transforma-
tion, the maxima and minima for the three groups are comparable and the jumps
in the curve for Group 1 are absent
assumptions of multivariate normality with a common covariance matrix

on which linear discriminant analysis is based. However, the scatterplot
matrix of Figure 6.17 suggests that there is not likely to be much change
in the discriminant analysis as a result of the transformation: Group 1 is
well separated from the other two, which have some overlap. The forward
plots of the posterior probabilities of belonging to the correct population
are very similar to those in Figure 6.4 and so are not given. There is no
misclassification in Group 1 and a very similar structure of units misclas-
sified in Groups 2 and 3. The only slight difference is in the probabilities,
towards the end of the search, of the units which are correctly classified.
In Figure 6.4 these units in Group 2 have probabilities much nearer one
than many of those in Group 3. This is a refiection of the greater variance
in Group 3, giving rise to larger Mahalanobis distances when a common
covariance matrix is used. But, after transformation, the variances in the
two groups are more nearly equal, so that the distribution of Mahalanobis
distances is also. As a result, the spread of probabilities for units correctly
classified is more similar in the two groups.
6. 9 Swiss Bank Notes

In discriminant analysis it is assumed that both the number of groups and
the group membership of each unit are correctly known. Now we consider
\
',I
I
I
I
''
I
I
',
I
" ... _,_ - - - - - - - - - - -----·70
140 160 180 200

Subset size m
140 160 180 200

Subset size m
FIGURE 6.19. Swiss bank notes: linear discriminant analysis. Posterior proba-
bility, as a function of subset size, that, upper panel, Observations in Group 1
belong to that group and, lower panel, that observations in Group 2 belong to
Group 2
what happens when the number of groups is larger than specified. For this
purpose we look once agairr at the data on Swiss bank notes. Our earlier
analysis showed that there was one observation from Group 1, unit 70,
which should have been classified as a forgery, and that the second group,
the forgeries, could be split into two groups, the smaller containing 15 units.
We see how the forward search applied to discriminant analysis reveals this
structure.
We start with linear discriminant analysis and, as throughout this sec-
tion, a balanced search. The uneventful upper panel of Figure 6.19 shows
that all units in Group 1 are correctly classified throughout, except for unit
70 which is, correctly we believe, classified as a forgery towards the end of
the search. The lower panel of the figure, for Group 2, is more eventful:
all units are eventually classified as belanging to Group 2, but two are
consistently misclassified until, in the case of unit 116, m = 171.
Figure 6.19 does not reveal the two groups of forgeries. Theseare clearly
revealed in Figure 6.20 which is a forward plot of maximum and minimum
Mahalanobis distances for the two groups. The two panels are similar, with
the behaviour of the distances in the two groups being quite different . The
lower curve is for Group 1. This increases towards the end of the search as
more remote units enter, with unit 70 being the last to enter. The curves
for Group 2 are more dramatic and clearly show the entry of the third
group of observations. The peak in the curve has its maximum at m = 173.
(a) (b)
"'"' "'"'
...,
,,,,
,~
0 0 :'. .'!•
"'
0
"'
.
0 '
0 0
0 0
"
:'..!~
0
0
I
0
'o
tl'
"' "'"' I}~

0 o,
"'
0 : :
:::;;
.. :
/ :
0 •
'-l t \ \,'_,
r
E
._,-.
I
:>
E 0
I I.; '- I
0
"ii\ "' "' ..'
_o
:::;;
(
t_,.J o'
~ ~
_,-- . ,,.:
, .. , I
! •
__ ,
~ ,. - --- ~ ,-
140 160 180 200 140 160 180 200

FIGURE 6.20. Swiss bank notes: linear discriminant analysis, forward plots of
Mahalanobis distances. Left panel, maximum distance for units in the subset;
right panel, minimum distance of units not in the subset. Solid line, Group 1
In Figure 3.47, for units in Group 2, the peak was at m = 84, before the
group of 15 outliers started to enter. But now units from Groups 1 and 2
are entering alternately, with the first of the group of outliers being less
remote than the others. After the peak in the panels of Figure 6.20, there
is a decline as similar units enter and infl.uence the estimated covariance
matrix. However, since a common covariance matrix is being fitted to both
groups of observations, the decline is less than when just Group 2 is fitted
in Figure 3.47. Comparison with the posterior probabilities of classification
in Figure 6.19 shows that the changes of classification for units 70 and 116
occur as the units in the second group of forgeries start to enter the subset.
We now briefl.y repeat the analysis using quadratic discriminant analysis.
The upper panel of Figure 6.21 shows that all units in Group 1 are again
correctly classified during the search, except for unit 70, which now has
a rapid change to the second group at m = 172. This panel is similar to
that for linear discriminant analysis in Figure 6.19. The lower panel of
Figure 6.21 however shows much more activity than the comparable figure
for linear discriminant analysis. In particular, at m = 172, ten units change
from being classified in Group 1 and move to Group 2, with two other
units changing shortly afterwards. The changes at m = 172 are therefore
important for the probabilities in both groups. This comparison of analyses
shows again that quadratic discriminant analysis is appreciably less stable
than linear discriminant analysis.
The dramatic change in classification in the lower panel of Figure 6.21
is caused by fitting an individual covariance matrix to each group: until
the outliers are included in Group 2 and affect the covariance matrix, they
seem to be far from the group. How far can be seen from the forward
plots of Mahalanobis distances in Figure 6.22. As in Figure 6.20, there are
V
1
1
1
1
1
1
1
1
1
1
1
',
1
\ - - - - - - - - - - - - - - - - ... - - - ·70
140 160 180 200

Subset size m
167, 182, 116, 192

161, 111,160, 171
162, 180, 148, 138
140 160 180 200

Subset size m
FIGURE 6.21. Swiss bank notes: quadratic discriminant analysis. Posterior prob-
ability, as a function of subset size, that , upper panel, observations in Group 1
belong to that group and, lower panel, that observations in Group 2 belong to
Group 2
(a) (b)
'
''
''
-~
""
"""
'•
,, ~ :~
"
''• ~· !
' '
:' '',
I 111
'\
I Hl
'\,''
: '
:•' ' \.;''
_... ----'
140 160 180 200 140 160 180 200

FIGURE 6.22. Swiss bank notes: quadratic discriminant analysis, forward plots
of Mahalanobis distances. Left panel, maximum distance for units in the subset;
right panel, minimum distance of units not in the subset. Solid line, Group 1
sharp peaks in both panels around m = 174. But now the distances decline
more rapidly after the peak. This is because, with individual covariance
matrices for each group, the estimated matrices are strongly influenced by
the outlying observations, which begin to seem less remote.
Since the number of groups is incorrect, we repeated the analysis with

three groups, the third containing 15 units. We used a balanced search.
With linear discriminant analysis all units were consistently classified in
these groups except, of course, for unit 70, which behaved in a manner
similar to that plotted in Figure 6.19, any differences being due to the
forward search being balanced over three groups.
This analysis of the Swiss bank note data reinforces our conclusions from
the earlier analyses. The conclusion specific for discriminant analysis is
that quadratic discriminant analysis responds more sensitively to groups
of outliers, or an underestimate of the number of clusters, than does linear
discriminant analysis. The cause of the change in classification probabilities
is revealed by monitoring the plots of maximum or minimum Mahalanobis
distances within or without the subset. This plot is effective whether the
discrimination uses a linear or a quadratic analysis.
6.10 Importance of Transformations in

Discriminant Analysis: A Simulated Example
6.1 0.1 A Deletion Analysis
The analysis of the transformed iris data in §6.8 was straightforward: it
was easy to find the transformation, which was statistically highly signifi-
cant. The analysis was, however, rather disappointing, the transformation
having little effect on the performance of the linear discriminant analysis.
In the next section we analyse some data on muscular dystrophy, for which
the transformation is less easily established. Once it has been found, it
does however have a positive effect on discrimination. In this section we
use an example with simulated data. This has been chosen to illuminate
many of the inferential problems which arise in the analysis of the muscular
dystrophy data.
One motivation for the transformation of data in discriminant analysis
is to help satisfy the assumptions of normality and of equality of covari-
ance matrices. lf these assumptions are not satisfied, that is if the data are
not appropriately transformed, the probability of mis-allocation may sig-
nificantly increase. Another possibility is that the discriminating power of
the variables may change on transformation. This can be important when,
as in the muscular dystrophy data, some variables are easier and eheaper
to measure than others.
We generated two groups of fifty observations with the same covariance
matrix, so that linear discriminant analysis is appropriate. Group one con-
sisted of a 46 x 4 matrix from a multivariate normal population with mean
equal to 7.9 for all variables. The remaining four units were generated from
a multivariate normal population with a mean of 10.5 and were included as
6.10 Transformations in Discriminant Analysis: A Simulated Example 333
TABLE 6.1. Simulated data: correct transformation and index numbers of con-
taminated units
'Ifue transformation: )..T 0.5 -0.5 -0.5 -0.5
Outliers: original scale 1 5 10 13
Outliers: transformed scale 51 84 92 99
observations 1, 5, 10 and 13. The 50 x 4 matrix of observations in Group 2

were also from a multivariate normal population, but with mean equal to
6.3. The normal data were transformed by squaring the first variable and
raising the remaining variables to the power -2. So the true transformation
vector is ).. = (0.5, -0.5, -0.5 , -0.5)r. Finally, in this transformed scale we
contaminated four units of the second group by subtracting 1. 7 from the
values of y 3 . A summary of the data structure is given in Table 6.1.
A correct analysis should find the true transformation and reveal those
outliers which are infiuential either on the transformation or on the mis-
classification probabilities. We show in the next section that our forward
method does just that. In this section we show the markedly less successful
outcome of an analysis using standard techniques , including some diagnos-
tics. Before that we need to quantify the effect on the discriminant analysis
of getting the right model. Some results are in Table 6.2. The nomenclature
of the data is important. We intend:
Original or untransformed data to mean the data set we generated
which needs the transformation of Table 6.1 to achieve normality for 96 of
the observations.
Correctly transformed data are the data after this transformation,
usually consisting of the n = 92 uncontaminated observations.
The results of Table 6.2 show that finding the correct transformation
leads to a three-fold reduction in the average mis-classification probability.
The discriminant function changes markedly, particularly in the importance
of y 1 . The indication of the modified Box test for equality of the covariance
matrices (2.23) in the untransformed data isthat there is clear evidence that
they are not equal - the statistic is to be compared with XIo· Quadratic
discriminant analysis might then be indicated. The results of this table
show clearly how one might be misled by failing to transform.
Before finding a transformation, we first look at the data. Figure 6.23
shows the untransformed data- they look very non-normal, but the four
outliers on this scale, units 51, 84, 92 and 99 are not at all evident. The
outliers on the transformed scale, 1, 5, 10 and 13 areevident in Figure 6.24.
A question we have to consider is what effect the two sets of outliers have
on estimation of a normalising transformation.
Throughout our analysis we assume that one of the purposes of the
transformation is to find a scale in which the two covariance matrices are
the same and so calculate likelihoods for such a model. The maximum
TABLE 6.2. Simulated data: comparison of discrimination results using trans-

formed and untransformed observations
Test for Equality of Covariances
Transformed Data (n = 92) I 12.58
Untransformed Data (n = 100) 313.1
Average Mis-classification Probabilities

Group 1 Group 2
Transformed data (n = 92) 0.04 0.08
Untransformed data (n = 100) 0.11 0.22
Elements of Standardised Canonical Eigenvector
Yl Y2 Y3 Y4
Transformed Data (n = 92) 0.08 0.45 0.38 0.48
Untransformed Data (n = 100) -0.23 0.23 0.32 0.63
TABLE 6.3. Simulated data: deletion likelihood ratio analysis

min(LR i ) LR max(LR i)
Ho : >. = (0, 0, 0, 0) 24.489 34.954 40.099
Ho : >. = (0.5, -0.5, 0.5, 0) 8.706 11.996 15.518
Ho : >. = (0.5, -0.5, -0.5, -0.5) 38.556 55.882 62.112
Ho : >. = (1/3, -0.5, 1/3, 0) 3.042 4.84 7 6.326
likelihood estimates of the transformation parameter, found by numerical

search, is ). = (0.26, -0.49, 0.28, -0.24)r. The value of the likelihood ratio
test for the hypothesis of no transformation is equal to 360 and strongly
suggests that the data must be transformed. As we stressed in Chapter 4,
it is usual, and scientifically sensible, to try to find a simple transformation
close to ).. The values -1, -0.5, 0, 0.5 and 1 are commonly considered
tagether with 1/3, particularly if the variable is a volume.
Table 6.3 gives the likelihood ratio tests for several values of >. which
seem plausible and close to ).. The tests are to be compared to the X~
distribution. In order to detect the effect of individual observations on
this inference we consider the effect of the deletion of single observations
on these tests. The table gives just the minimum and maximum values
from this computationally exhausting procedure - for each deletion the
new parameter estimate ).(i) has to be found by a numerical search in
four dimensions. The results presented in Table 6.3, therefore, required
404 maximizations. A computationally less demanding alternative to these
exact calculations is the use of constructed variables to find approximate
0 2 6 8 2 3 4 5
y1
.,
"'
y2 (!, tJ.
tJ.
-:l
0
~:.
"'
...
<')
tJ. tJ.
tJ. "'
0
tJ.
"'
<')
tJ.
(!,
"'
1 2 3 4 5 0 1 2 3 4 5 6
FIGURE 6.23. Scatterplot matrix of untransformed simulated data. The filled

triangles show the four outliers on this scale - observations 51, 84, 92 and 99 -
which are not easily detected.
deletion statistics as outlined in §4.4. Since our main interest is not in

deletion diagnostics, we do not explore this method here.
We now consider the results of Table 6.3 line by line. The maximum
likelihood estimates lead one to test whether a common transformation is
possible for all variables. The results of the first line of the table show that
the log transformation for all the variables is incompatible with the data.
If the maximum likelihood estimates are rounded to the five most common
values of .A, one obtains the hypothesis ,\ = (0.5, -0.5, 0.5, o)r. The second
line of the table shows that the minimum value of the deletion likelihood
ratio test is below the x~ 95% threshold of 9.49. One could then start
a backwards procedure, deleting the observation associated with the mini-
mum value of the deletion likelihood ratio. The third row of Table 6.3 shows
the results when one tries to validate the correct transformation, that is
the hypothesis that ,\ = (0.5 , -0.5, -0.5, -0.5)r. The smallest value of the
y1"(0.5)
2 3 4 5 6 3 4 5 6
-
7
• "'
"'
y2"(-0.5)
(")
y3"(-0.5)
M
"'
y41'(-0.5)
2 3 4 5 6 234567
FIGURE 6.24. Scatterplot matrix of transformed simulated data. The filled

squares show the four outliers on this scale - observations 1, 5, 10 and 13. All
outliers are now detectable
deletion likelihood ratio is well above the x~ 99% threshold suggesting that
this combination of values of .A must be firmly rejected. Finally, the last line
shows the results when the null hypothesis isthat .A = (1/3, -0.5, 1/3, O)r,
a combination which comes from rounding the maximum likelihood es-
timates of the first and third transformation parameters to 1/3 and the
others to one of the five most common values of .A. The maximum deletion
value of the likelihood ratio is below the 95% point of x~, while the final
value of 4.847 is very close to the expectation of the X~ distribution.
As a result of this diagnostic analysis one might think that .A = (1/3,
-0.5, 1/3, O)T would be a good transformation, since the values of the
statistic are always within the 95% confidence boundary. We give the plot
of this deletion statistic in Figure 6.25.
1.0
0 20 40 60 80 100
Observation number
FIGURE 6.25. Incorrectly transformed simulated data : single deletion likelihood

ratio test , which SUpports the hypothesis >. = (1/3 , -0.5, 1/3, o)T. Tobe com-
pared with X~
As the summary in the table indicated, there are no obviously influential

o bservations.
One can now check this selected transformation for outliers, which might
have an inflationary effect on the estimation of the covariance matrix. The
left-hand panel of Figure 6.26 is a plot of the deletion Mahalanobis distances
for the transformation ..\ = (1/3, -0.5, 1/3, O)T. There are no obviously
large values , although some are clearly larger than others. One could then
assess the distribution of these quantities, to see whether the larger values
are really too large. Since squared Mahalanobis distances in this example
should asymptotically follow the x~ distribution we give in the right-hand
panel of Figure 6.26 a QQ plot of the squared distances against the chi-
squared order statistics. One would be hard put to it to see any evidence
of outliers from this smooth and uneventful plot.
This simple example shows how extremely difficult it is, when outliers are
present, to find the correct transformation in discriminant analysis when
the starting point is the maximum likelihood estimates of the parameters
based on all the observations.
6.1 0. 2 Finding a Transformation with the Forward Search

To find a transformation for these data using the forward search, we follow
the three steps of the procedure outlined in §4.6. In Step 2 of the proce-
0
92 0
"' 0
0
0
~ 0
0
0
ODO
0
~
~
0 20 40 60
Observation number
80 100
"'
0 /
0 5
Ouantiles
10 15
FIGURE 6.26. Incorrectly transformed simulated data: deletion Mahalanobis dis-

tances. Left panel, index plot; right panel, QQ-plot of squared distances against
the chi-squared distribution on 4 degrees of freedom.
dure we use an unconstrained search because, if the data are appropriately

transformed, a duster of k outliers from one group will enter the subset
in the last k steps of the forward search, rather than being spread over
many steps. Although we use unconstrained searches, the conclusions we
reach are not affected by this choice; finding the correct transformation is
however simplified.
Step 1. We start with a forward search with the ordering of the Maha-
lanobis distances based on untransformed observations. The curves for the
parameter estimates, which we do not show, are stable apart from that for
>. 3 which shows a jump when the subset size m is between 70 and 80. If we
consider only the five most common values of >., this plot suggests taking
). = (0.5, -0.5, 0.5, O)T in the second search.
Step 2. Figure 6.27 shows the results of this new search. The maximum
likelihood estimates in the upper panel of the figure are much more stable
during the central part of the search than they were for the search in
Step 1. The jump in values for 5.3 now occurs between m = 95 and m =
98. This means that we are moving towards a transformation in which
the outliers are clearly identified and enter near the end of the search.
If the transformation were correct the outliers would enter at the end.
The plot of the likelihood ratio test for the null hypothesis that the value
used in ordering is correct is in the lower panel of Figure 6.27. This shows
that evidence against these values accumulates steadily during the forward
search until the very end, when the introduction of the last few observations
causes the transformation to become seemingly acceptable.
The next combination of values of ). which is suggested by the stable
part of the search in the upper panel of Figure 6.27 up to m = 95 is
). = (0.5, -0.5, -0.5, -0.5)T . The forward search using these new values is
L-~====~===-:-~-~-====~~~~~~~====~====~======~,r~==
:
0
,....
~:=~~~~~:~:=:=~==~~~;;~~~~~=~~~~~~~~~;~~4
'
60 70 80 90 100
Subset size m
60 70 80 90 100
Subset size m
FIGURE 6.27. Simulated data: forward search ordering Mahalanobis distances

based on ..\ = (0. 5, -0.5, 0. 5, 0) T. U pper panel, maximum likelihood estimates of
transformation parameters; lower panel, likelihood ratio test for the parameter
value used in the search
TABLE 6.4. Order of inclusion of observations for search with AR

(0.5, -0.5, -0.5, -0.5) T
Step (m) 92 93 94 95 96 97 98 99 100

Unit Included 69 5 13 10 1 99 51 92 84
shown in Figure 6.28. In this case all the estimates along the forward search
are as specified in our vector A until m = 97. The plot of the likelihood ratio
in the lower panel of Figure 6.28 shows that it is the last three observations
to enter that cause rejection of our initial estimate. We therefore take AR
= (0.5,-0.5, -0.5, -0.5)T for further diagnostic calculations.
Step 3. As a result of our forward analysis we have easily recovered
the transformation in Table 6.1 which leads back to normality. Equally
importantly, we can now identify the outliers which were causing difficulty
in finding these transformations. In Table 6.4 we give the order of inclusion
of the last nine units in the forward search with AR· The last four are the
units which are outliers on the scale in which we started to analyse the
data. The four before are outliers once we have correctly transformed the
data.
0
c)
C?
,....
0
60 70 80 90 100
Subset size m
I
0
<D
0
"""
-
0
C\1
0
- ~
~
60 70 80 90 100
Subset size m
FIGURE 6.28. Simulated data. Forward search ordering Mahalanobis distances

based on ,\ = (0 .5, -0.5, -0.5, -0.5)T. Upper panel, maximum likelihood esti-
mates of tra nsformation parameters; lower panel, likelihood ratio test for the
parameter value used in the search. The outliers now enter at the end of the
search
We now identify the effect of the individual influential observations by

running searches for the five standard values for each parameter in turn.
Figure 6.29 shows fan plots of the signed square root of the likelihood
ratio test for 20 forward searches (5 for each variable) around the vector
AR· This plot (using asymptotic 99% confidence bands) shows that for the
first variable the square root transformation is the best. Even if the null
hypothesis of the log transformation cannot be rejected at the 1% Ievel
the corresponding curve is very close to the rejection line. For the second
and fourth variables only >.. = -0.5 seems tobe compatible with the data.
Finally, the most interesting plot is the one for variable 3. The hypothesis
>.. = -0.5 is perfectly in agreement with the data up to the inclusion of the
last 4 units which are observations 99, 51, 92 and 84. The effects of these
4 units in the five searches are very different. When >.. = -1 or >.. = -0.5
they are included in the last 4 steps and cause a big jump in the value
of the statistic. When >.. = 0 and >.. = 0.5 they are included around the
end of the search and cause the value of the statistic to re-enter inside the
confidence bands. Using >.. = 1 these 4 observations are included in the
middle of the search and cause an upward jump. In this scale, however, the
y1 y2
r---1
0
I /---- · -0.5
-0.5
~1
0
50 60 70 80 90 100 50 60 70 80 90 100
y3 y4
/-1
_,...-""' ,-----0.5
,
0
---~~~:_>---=:::~~:- 0
0.5
0
1
50 60 70 80 90 100 50 60 70 80 90 100
FIGURE 6.29. Simulated data: fan plots of the signed square root of the likelihood
ratio test for transformation, confirming An = (0.5, -0.5, -0.5, -0.5)T
steady downward trend caused by all the other observations again brings
the value of the statistic below the lower threshold.
6.1 0. 3 Discriminant Analysis and Confirmation of the

Transformation
We now look at the effect of our transformation on the behaviour of the dis-
criminant analysis. Since there is no explicit connection between the Box-
Cox transformation and discrimination, we repeated our analysis of trans-
formations by finding values of >. which minimized the mis-classification
probabilities, rather than maximizing the likelihood (6.21). The answers
were not very clear. The transformations of some variables were reasonably
defined, but for those with a small weighting in the canonical eigenvector
the transformation was very poorly defined. Transformation of such a vari-
able has little effect on the discrimination, even if the small value is caused
by outliers. We therefore did not pursue this procedure.
The results of Table 6.2 stress the importance, for discriminant analysis,
of correctly transforming the data. We now consider the much more detailed
information on discrimination which is gathered from the forward search.
Figure 6.30 gives the probabilities of correct classification of units in
Group 1 during the last third of the search. There are three units, 6, 45
and 15, which are not weil classified. The classification of the other units
---
15 ~~---
1
' '-----~------- .....
""d
§"' <0
:c d
"'e
--
..c
0.
0
...d ..l
·c
*
0
0.. C\1
d
_...",.- ......... ___ _
0
d 6--------------------------
,---J
75 80 85 90 95 100
Subset size m
FIGURE 6.30. Simulated data: posterior probabilities of correct classification of

units of Group one after correct transformation AR =(0.5, -0.5, -0.5, -0.5)T
remains stable and good until the search reaches m = 92. Then the four
outliers from Group 1 enter one after the other and cause the probabilities
to worsen. At the end of the search units from Group 2 enter and do not
have much effect on the probabilities in Group 1.
The plot for Group 2 is much more dramatic. We see from Figure 6.31
that the outliers on the original scale of our data are even more outlying
after the data have been transformed - as would be expected from their
large influence on the transformation. When these observations are intro-
duced, the probabilities change appreciably and the units move to being
correctly classified. But there remain two observations 74 and 86 on the
boundary of the two groups, for which the probabilities oscillate during the
search. Also the naturally outlying 69 is continually mis-classified.
Finally in Figure 6.32 we look at the monitoring of the Mahalanobis
distances and Box statistic (2.23) during the search. The left-hand panel
of Figure 6.32 shows the maximum distance monitoring plot. In the first
stages the plot shows that the maxima are very close - there is no evidence
of any difference in variance in the two groups. When m = 92, observation
69 enters Group 2 and the distance jumps up. After that four outlying
units enter Group 1, causing a jump in the maximum distance for that
group. As successive units enter there is a slight decrease due to masking,
but observation 1 causes a slight increase. After this the last four units
enter Group 2 - there is again a big increase in the distance for Group 2,
which then drops back a bit due to masking. The distances for Group 1 are
hardly altered by the introduction of these four units, despite the common
CX>
0
"'
~ CD
~ 0
"'
.0
0
Q.
0
....0
·~
(ij
0
c.. C\1
0
0
0
75 80 85 90 95 100
Subset size m
FIGURE 6.31. Simulated data: posterior probabilities of correct classification of

units of Group two after correct transformation >..n = (0.5 , -0.5, - 0.5, -0 .5)T
....'
... .
First group
Second group ''
'
0 . ''
'
"'
...f \
.
--·.
75 80 85 90 95 100 75 80 85 90 95 100
FIGURE 6.32 . Correctly transformed simulated data: forward plots. Left panel,
maximum Mahalanobis distance of units included in the subset; right panel, Box
test of equality of covariance matrices (2.23)
estimate of the covariance matrix used. Two conclusions from this plot are
that initially the within group variances are very similar and that we do not
need to constrain the search to be balanced - a situation very different from
that of the example in the next section. And, secondly, that the outliers,
infiuential or not, are entering at the end of the search.
We finish with the right-hand panel of Figure 6.32 in which we monitor

the Box statistic for equality of variances (2.23). This is sensibly constant
in the earlier stages of the search, below the 5% point, showing no evidence
of any inequality. But, at the end of the search, starting with observation
97, the evidence for non-equality of covariance matrices builds steadily.
The starting point of m = 97 for this information agrees with the plot
in the left-hand panel of the figure where, although there is an increase
in distances from m = 92, the distances for both groups initially increase
together. Only from m = 97 are the two curves markedly different.
Our exemplary analysis of these data has enabled us, in a Straightforward
manner, to recover all the features that we built into the data and to assess
their effects on discrimination. We now use this experience to analyse a set
of data about which we have no such prior knowledge.
6.11 Muscular Dystrophy Data

6.11.1 The Data
In this section we apply our method to a real data set. As we would ex-
pect from the preceding example, the misclassification rate decreases after
appropriate transformation of the data. But we also see that the order of
importance of the variables may change substantially after the data have
been appropriately transformed. This is important in this example since
the variables that increase in classificatory power are those which are more
easily measured.
Duchenne Muscular Dystrophy (DMD) is a genetically transmitted dis-
ease passed from a mother to her children. Affected male offspring may
unknowingly carry the disease but male offspring with the disease die at
a young age. Although Carriers of DMD usually have no physical symp-
toms, they tend to exhibit elevated levels of serum markers. In addition,
the levels of these enzymes may also depend on age and season. Levels of
the enzymes were measured in non carriers and in a group of Carriers using
standard laboratory procedures. The variables used are:
Y1: age (AGE)

Y2: month of the year (M)
y 3 : level of creatine kinase ( CK)
y4: level of hemopexin (H)
y5: level of lactate dehydrogenase (LD)
y5: level of pyruvate kinase (PK).
The first two serum markers, y 3 and y 4 , may be m easured rather inexpen-
sively from frozen serum. The second two, y 5 and y 6 , require fresh serum.
An important scientific problern is whether use of the expensive second pair
6.11 Muscular Dystrophy Data 345
of readings causes an appreciable increase in the detection rate. A further

feature of the data is that the water supply to the laboratory was changed
in the course of the study, although when is not recorded. It is therefore
likely that some outliers are present in the data.
We now apply our forward method to the analysis of these data. We
start with the 73 observations used by Reneher (1995, p. 170), find a trans-
formation, look for and identify outliers and consider discrimination. We
then move to the complete data set given by Andrews and Herzberg (1985 ,
pp. 223-228) and see how well our model fits. Wehave thus the advantages
of a confirmatory sample splitting, without having, ourselves, to make an
arbitrary split of the data.
6.11. 2 Finding the Transformation

We first , as ever, look at a plot of the data, which is given in Figure 6.33.
For clarity on the printed page we only include variables 1, 3, 4, 5 and 6.
It certainly looks as if the data should benefit from transformation - some
marginal distributions are skew, the variances in the two groups appear
different and there may be some outliers.
Step 1. As before, we start with a forward search using untransformed
data to obtain a preliminary idea of a good transformation. The resulting
forward plot of parameter estimates, using an unbalanced search, is in
Figure 6.34. For most of the plot the estimates of all A's except A1 are
reasonably stable, with some downward drift, but no sudden jumps. The
exception is the estimate for A1 which decreases gradually to 0.5 when at
m = 62 it drops abruptly to -1. Our previous experience suggests that a
group of outliers may be entering at this point. To determine whether this
is so, we use A1 = 0.5 in Step 2.
Remark. One feature of Figure 6.34 that should be stressed is the effect
of performing an unbalanced search. In this search we used untransformed
data. If we are far from the correct transformation, small observations will
enter first, as they do in transformation for a single group of observations.
But if as here there are two groups with the smaller observations predom-
inantly in one group on the untransformed scale, observations from that
group will tend to enter the search earlier. Whether or not we use a bal-
anced search, the later observations to enter the search will be those that
are more informative about the transformation. The curves of the estimates
will then tend to drift away from one if a transformation is needed. When
we search on the correct scale, if there is such, the two groups will have
similar variances and observations from one group will not enter the search
preferentially in the earlier stages.
Another feature of Figure 6.34 also deserves comment. We have shown
more of the search than is needed to determine the transformation, in order
to exhibit the amount of movement that may be found in the initial stages
of the search. With such small values of m there may not be sufficient infor-
FIGURE 6.33. Small muscular dystrophy dataset: scatterplot matrix of untrans-

formed data
mation to determine the transformation precisely. Because we are searching

with >. = 1 this value is initially preferred as observations agreeing with
this scale are preferentially selected. Also, at the very beginning, one or
two unmasked outliers may have been included, but are rapidly removed
by the forward search. If it is felt that the results of the search are being
infl.uenced by the starting conditions, several searches can be run from a
variety of starting points. We see here, as we have seen several times, that
the last third of the plot is unchanged for all starts with the same >..
Step 2. We now run the forward search from the starting point suggested
in Step 1, that is with >. = (0.5, 1, -0.5, 1, 0, o)T. The resulting forward
plot of estimates is in Figure 6.35. The change in 5. 1 is now right at the end
of the search, starting with m = 71. The last three observations to enter -
46, 67 and 68- have a large effect on the estimated transformation.
The effect of the last three observations on the evidence for the trans-
formation is visible in the left-hand panel of Figure 6.36 , which shows the
.... ____ _
0
ci ---------------------------
1.0 --------- 3
9
40 50 60 70
Subset size m
FIGURE 6.34. Small muscular dystrophy dataset: forward plot of estimates of

transformation parameters from an unbalanced search on untransformed data
/ - - - - - - - - - " _ _.... .... 4
.......... -- .. -----~~-~---- ....... ----"

......... --------... ---- ... ___ _
1.0
ci
0
ci
1.0 , ____________ /-----~--------

//----------------
9
40 50 60 70
Subset size m
FIGURE 6.35. Small muscular dystrophy dataset : forward plot of estimates of

transformation parameters from a search with ,\ = (0.5, 1, -0.5, 1, 0, O)T
manner in which the likelihood ratio statistic for this transformation jumps
up at the end of the search, so that the transformation is rejected by the
test. Our initial conclusion is that we have found the correct transformation
and that there are three outliers.
0
N
40 50 60 70 0 20 40 60
Subset size m Observation number
FIGURE 6.36. Small muscular dystrophy dataset: results of search with >. =
(0.5, 1, -0.5, 1, 0, o)T. Left panel, forward plot of likelihood ratio test of the trans-
formation; right panel, deletion likelihood ratio test, which fails to identify the
importance of the outliers
The masked nature of these outliers is demonstrated in the right-hand

panel of Figure 6.36, which shows the deletion version of the likelihood ratio
test of. Although individual deletion of our three outliers (46, 67 and 68)
causes the three largest decreases in the statistic, the decrease is nothing
like significant. One would be hard put to it to identify the importance of
these three observations from this plot.
Step 3. We now confirm the suggested transformation. Figure 6.37 shows
a signed square root likelihood ratio expansion for each variable around AR
=(0.5, 1 ,-0.5, 1, 0, o)T. Since we are interested in the fine structure of the
plot, even for wrong transformations, we used a balanced search.
The plot shows that only -0.5 is reasonable for Y3· For Y2 and Y4 >. = 1
seems to be best even if the data do not provide significant evidence against
the square root transformation for y 4 . The plot for Y6 confirms that the best
transformation for this variable is the logarithm, as it is for Y5· Notice a
small jump at the end of the curve for y 5 when >. = -0.5. This is caused
by unit 24 (the smallest for y5 ). It is only for y 1 that the three outliers we
have identified are highly influential for the transformation. The panel for
Yl shows that the presence of units 46, 67 and 68 is incompatible with >. = 0
(at the 5% level), 0.5 and 1. The other two values appear tobe compatible
with all the data but the inclusion of the three observations earlier in the
search causes breaks in the Ievels of the curves
A final comment on these plots, especially that for y 1 , is that the three
outliers are all in Group 2, the carriers of the disease. With the balanced
search they cannot enter the subset consecutively, so that their inclusion is
spread over the last five values of m.
"' I
--~
20 30 40 50 60 70 20 30 40 50 60 70
.---......,.--
.......... ... .... _.. - ..
........
_.." __ _
.._ ... ____ , ___ -0.
20 30 40 50 60 70
- .:-~~ -------------- .......... -'

-· .>...--- -0
20 30 40 50 60 70 20 30 40 50 60 70
FIGURE 6.37. Small muscular dystrophy dataset: fan plots of the signed
square root of the likelihood ratio test for transformation, confirming AR =
(0.5, 1, -0.5, 1, 0, O)T
6.11. 3 Outtiers and Discriminant Analysis

Our procedure has suggested three outliers. It is important to refer these
back to the original data. Their infl.uence was on the transformation of y 1 ,
age. The three outliers are by far the oldest people present. For the other
70 observations the ages range from 29-42. The three outliers have ages at
measurement of 58, 58 and 59. Once our attention has been drawn to them
by the forward analysis, not only are they clearly evident in the scatterplot
matrix of transformed data in Figure 6.38, but they are also evident at the
top of the top row of plots in the scatterplot matrix of untransformed data,
Figure 6.33. Finally, if we compare Figure 6.38 with Figure 6.33 we can
clearly see how much more normal the data have become.
The effect of the transformation on the discriminant analysis is appre-
ciable. For our comparisons we use the transformation AR = (0.5, 1 , -0.5,
1, 0, O)T for all 73 observations. Some results are in Table 6 .5, both for all
73 observations and for the 70 observations when the outliers are deleted.
Failure to use the transformation results in an approximately 25% in-
crease in mis-classification, whether or not the outliers are excluded. Under
similar conditions, use of the transformation gives almost 80% of the units
a decreased probability of mis-classification. These results on probabilities
can be visualized by ordering the probabilities of mis-classification for each
!:. !:. !:. ~
.
~
:::
-:
FIGURE 6.38. Small muscular dystrophy dataset : scatterplot matrix of trans-

formed data showing observations 46, 67 and 68 in the Yl - Y3 scatterplot
TABLE 6.5. Small muscular dystrophy dataset (n = 73) : overall mis-classification

results for discrimination using untransformed and transformed observations with
AR = (0.5, 1 , -0.5, 1, 0, o)T; n = 70, Observations 46 , 67 and 68 deleted
n = 73 n = 70
Transformed data 0 .153 0.159
Untransformed data 0 .188 0.199
Percentage of units improved 79.5 77.1
group and then plotting the logits ofthe probabilities against order number.
Figure 6.39 shows that the improvement isover almost all units.
CX)
..ci <0
eCl.
(i) ~
0
Cl.
0 C\1
.l!!
"C»
0
......1 0
0 20 40 60
order number
FIGURE 6.39. Small muscular dystrophy dataset : classification probabilities. The

upper curve is the logit of the posterior probability of correct classification after
transformation, the lower curve that before
6.11.4 More Data

We now extend our analysis from the 73 units studied by Reneher to the 194
complete observations originally given by Andrews and Herzberg (1985).
An interesting question is the stability of our conclusions to the extended
set. If observations 46, 67 and 68 are treated as outliers, 0.5 is a good
estimate of A1 . If they are treated as part of the data, then -0.5 is a better
value. These three individuals appeared as outliers because they are much
older than the remairring 70. But the results of Table 6.5 show how little
effect these observations have on the discriminant analysis.
The data with 194 units are no langer balanced - there are 127 non-
carriers whereas Group 2 contains only 67 carriers. With the new data we
may expect that there may be some additional outliers. Also there are 14
new units with ages of at least 43 years. Although these are all carriers
of the disease , they can be expected to teil us something about the three
outliers identified in the smaller data set.
We do not display here the results of the three stages in finding a suit-
able multivariate transformation, but move straight to the confirmatory
expansion in five values of A. The conclusion is that the previous trans-
formation holds except that now A1 = -0.5. The 14 newly included units
for subjects at least 43 years old thus provide information to support the
stronger transformation of y 1 indicated by the three units out of the initial
73. We therefore want to confirm the value AR = (-0.5, 1, -0.5, 1, 0, o)T.
The signed square-root tests from the expansionaraund this value are plot-
__.........,_.-.
>. 0 __ .. -
-
--------------
-
0
0 L_------------~------~
120 140 160 180 120 140 160 180
~ r-----------------------,
"'
__:_-----------,
-----=:.::::._'-'-. . . . . . . . . . . --.. . . ___ 0
120 140 160 180 120 140 160 180
0
L_------------~---------
120 140 160 180 120 140 160 180
FIGURE 6.40. Large muscular dystrophy dataset: fan plots of signed square
roots of the likelihood ratio test for transformation, confirming AR
( -0.5, 1, -0.5, 1, 0, O)T
TABLE 6.6. Large muscular dystrophy dataset: order of inclusion of observations

for balanced forward search with AR = (-0.5, 1 , -0.5, 1, 0, O)T
Step (m) 184 185 186 187 188 189

Unit lncluded 156 101 117 155 95 27
Group 2 1 1 2 1 1
Step (m) 190 191 192 193 194

Unit Included 140 118 53 130 78
Group 2 1 1 2 1
ted in Figure 6.40, in which we have used a balanced search as some of the
30 forward searches are for values far from AR· The order of inclusion of
the units for the search with AR is given in Table 6.6.
The plots for variables 1, 2 and 3 are uneventful. The inclusion of obser-
vation 78 at the end of the search causes a jump in the value of A4 as it
does for A6 , which is also slightly influenced by observation 53. But there
is no reason to question the suggested value of AR·
We now examine the data for any other outliers and the effect of trans-
formation on their presence. If there are a ny further outliers, the plots in
Figure 6.40 show that they will not have had an influence on the choice of
transformation.
We start with the untransformed data. Because we are interested in the
differences between the groups we use a balanced search so that we can mon-
itor the Mahalanobis distances from the two groups as the search evolves.
Figure 6.41(a), the minimum distance monitaring plot, shows that before
FIGURE 6.41. Large muscular dystrophy dataset: (a) minimum Mahalanobis

distances of units not in the subset from a balanced search on untransformed
data - the difference in variance in the two groups is evident; (b) determinant of
the estimated common covariance matrix - the zig-zag pattern is caused by the
constraint imposed by balance; (c) distances as for (a), but from an unbalanced
search on transformed data- the outliers are now revealed; (d) covariance matrix
as in (b), but now the data have been transformed and there is no Ionger an effect
of different variances
transforming the groups seem to have a completely different structure in

terms of Mahalanobis distances. The different structure of the two groups
is also immediately evident from the zig-zag in Figure 6.41(b), a feature
of pooled determinant monitaring discussed below (6.18). The addition of
a first and second unit from Group 1 does not have as big an effect on
this statistic as adding a single unit from Group 2. It is clear from these
two panels that, in the absence of transformation, one would be forced to

consider quadratic discriminant analysis or non-parametric procedures.
The other half of the plot is for a search using our transformation AR.
The contrast with Figure 6.41(a) and Figure 6.41(b) is fascinating. Fig-
ure 6.41(d) shows the new pooled determinant monitoring. This curve,
now that the data have been transformed , has become a straight line. It is
also interesting to notice the initial decrease of this curve in the first steps
of the forward search. This refl.ects the exchange of units in and out of the
subset during the first steps of the search. As an aid to outlier detection
TABLE 6.7. Large muscular dystrophy dataset: order of inclusion of observations

for unbalanced forward search with AR = ( -0.5, 1 , -0.5, 1, 0, O)T
Step (m) 184 185 186 187 188 189

Unit Included 117 27 95 156 155 146
Group 1 1 1 2 2 2
Step (m) 190 191 192 193 194

Unit Included 118 53 140 130 78
Group 1 1 2 2 1
TABLE 6.8. Large muscular dystrophy dataset (n = 194) : results for discrimina-
tion using untransformed and transformed observations with AR =(-0.5, 1, -0.5,
1, 0, O)T: n = 189, observations 53, 78, 118, 130 and 140 deleted
Overall mis-classification rates
n = 194 n = 189
Transformed data 0.132 0.119
Untransformed data 0.170 0.166
Percentage of units improved 72.7 78.3
Tests of equality of covariance matrices
n = 194 n = 189
U ntransformed data 605 .6 602.9
Transformed data 48 .059 51.401
Figure 6 .41(c) shows the minimum distance monitaring plot now using an
unbalanced search. The order of inclusion of the observations is given in
Table 6.7. The plot shows that, for most of the search, the transformation
has made the variances in the two groups comparable, but that there are
five outliers. For Group 1 observations 118, 53 and 78 are shown to be
outlying by the upward jump in the plot. The two outliers for Group 2 are
130 and 140. One interesting feature is that these outliers are for subjects
with ages between 22 and 39. The inclusion of the additional 14 units with
ages at least 43 has rendered the previous three outliers extreme but not
highly atypical. We now consider the effect of transformations and outlier
detection on the properties of the discriminant analyses.
We compare all 194 observations and the 189 observations left after out-
Her detection, both on the original and on the transformed scales. Table 6.8,
tobe compared with Table 6.5, shows the overall mis-classification proba-
bilities for the four analyses. The general difference between the two tables
is that, with more observations used in estimation, the overall rates have
gone down. The highest average probability of mis-classification now is
0.170 for the original data, dropping to 0.119 for the transformed data
without the outliers, a 43% increase in mis-classification from failure to use
the transformation. The percentage of units for which the probability of
mis-classification decreases because of the transformation is around 75%
and is slightly increased when the outliers are removed.
The table also gives the results of the test for equality of covariances. lt
is clear from the table, as it was from the figures, that the transformation
has an enormaus effect in improving the equality of the variances of the
two groups, although, with the 99% point of x~ 1 being 38.93, equality has
not quite been obtained.
TABLE 6.9. Large muscular dystrophy dataset: comparison of the coefficients of

the canonical eigenvector using untransformed and transformed observations
Elements of standardised ACE M CK H LD PK

canonical eigenvector Y1 Yz Y3 Y4 Ys Y6
Untransformed data
(n = 194) 0.593 0.001 0.160 0.319 0.304 0.488
Transformed data
(n = 194) 0.490 -0.052 0.582 -0.387 -0.250 -0.276
Transformed data
(n = 189) 0.542 -0.073 0.650 -0.310 -0.202 -0.211
The final table gives information on the discriminant function. We list the
coefficients of the standardized canonical eigenvector in Table 6.9. Apart
from YI and Y2, age and month, the other four variables are serum mark-
ers: Y3 and Y4 are inexpensive to measure, y5 and Y6 expensive. For the
untransformed data the inexpensive variables are second and fourth in im-
portance amongst the markers whereas, after transformation, they are first
and second. Removal of the outliers further increases the weighting on Y3
and y 4 relativetothat on y 5 and y 6 . There is thus an indication t hat use of

the transformed data could Iead to the development of a eheaper medical
test.
6.12 Further reading

Books on the analysis of multivariate data customarily include material on
discriminant analysis, for example Krzanowski (2000, Chapters 12 and 13)
and Mardia, Kent, and Bibby (1979, Chapter 11). A book length treatment
is McLachlan (1992).
In general, there is a lack of concern in many publications about verifying
the conditions under which the normal theory discriminant analysis of this
chapter is applicable. However, Krzanowski (2000, p. 364) applies the uni-
variate Box-Cox transformation to improve variance homogeneity and the
last two decades have seen the publication of many articles about outlier
detection in discriminant analysis. Campbell (1980) and Campbell (1982),
in the context of an M -estimation scheme, suggested iterat ive methods in
order to downweight the influence of outliers. Critchley and Vitiello (1991)
considered the sensitivity of mis-classification probability estimates to the
deletion of one case in two population linear discriminant analysis. The
quantities suggested in and in a series of papers by F\mg (1992) , Fung
(1995a), Fung (1995b), Fung (1996a) and Fung (1998) all depend on two
fundamental statistics, which are analogaus to the residual and leverage
measures in regression. However, as Critchley and Vitiello explain "exam-
ining the joint influence of several observations is complicated by the com-
putational burden and by possible masking effects". Our analysis of the
simulated data in §6.10 reinforces this point. Very robust methods have
been applied to discriminant analysis by Hawkins and McLachlan (1997) ,
by Croux and Dehon (2001) and by Hubert and Van Driessen (2003).
6.13 Exercises 357
6.13 Exercises
Exercise 6.1 Describe similarities and differences in terms of projection
of points between principal components analysis and canonical variate anal-
ysis.
Exercise 6.2 Show that for two groups
2
B = (g- l)Es = L nl('fh- fi)(fh- yf (6.25)
l=l
can be expressed as
(6 .26)
where
Exercise 6.3 Relationship between two group discriminant analysis and

multiple regression.
Show that if we define a dummy group variable w as
nz
w for each Yu, Y21, · · ·, Yn 1 l in sample 1
n1 + nz
n1
for each Y12, Yzz , · · ·, Yn22 in sample 2,
n1 +nz
the vector of regression coefficients /3 when w is jitted to the matrix GY is

proportional to the discriminant function coefficient vector a = Ew1 (y 1 -y 2 )
and is given by
ß= n1n2 a
(n1 + nz)(n 1 + nz- 2 + T 2)
where T 2 = (y1 -Jlz)Tf:~(Jh- y 2 )n1nz/(n1 + n2).

Also show that the squared multiple correlation coefficient R~IY is given
by {F(y 1 - '!Jz).
Exercise 6.4 Show that in the case of two groups the classification rule
based on the first canonical variate is exactly the same as the classification
rule based on multivariate normality with equal covariance matrices.
Exercise 6.5 Show that in the two group case, the within group sample
correlation of each variable yc1 with the discriminant function (ryc ,z) is
J
directly proportional to the two sample t-statistic for that variable
k Y11- Y12
(j=l, ... ,v),
ryci ,z = . f(_L + _L) w··
y n1 n2 JJ
where w11 is the jth diagonal element of the pooled covariance matrix Ew
and
k=
Exercise 6.6 Let B = (g -l)EB and W = (n- g)Ew be, respectively, the
between and within groups matrices of residual sum of squares and products
of the data. Show that the maximum of aT Baj(aTWa) is obtained when a
is the eigenvector corresponding to the Zargest eigenvalue >. of the matrix
w- 1B. This gives the so-called first discriminant function (or canonical
variate) z1 = Ya1 = auyc 1 + a21YC2 + · · · + av1YCv. Show that the dis-
criminant function which has the Zargest discriminant criterion achievable
by any linear combination of the y 's that is uncorrelated with z 1 is obtained
when >. 2 is the second Zargest eigenvalue of the matrix w- 1 B and a2 is
the corresponding eigenvector. Show that the discriminant function which
has the Zargest discriminant criterion achievable by any linear combination
of the y 's that is uncorrelated with z 1 and z2 is obtained when A3 is the
third Zargest eigenvalue of the matrix w- 1B and a 3 is the corresponding
eigenvector. Show that this property extends to z 4 , z 5 , ... , z 8 • How do your
answers change if the matrices B and W are replaced by the estimates EB
and Ew?
Exercise 6. 7 Show that the canonical variates are uncorrelated, between
groups, within groups and over the whole sample.
Exercise 6.8 For two populations the allocation rule which has been sug-
gested in this chapter is of the form
Allocate y to P1, if fr(y)/ h(y) > k

and to P2, if Jr(y)j h(y) ::; k . (6.27)
Let c(ll2) be the non-negative cost incurred whenever an individual from

P2 is incorrectly allocated to P1 and let c(2ll) the cost incurred whenever
an individual from P 1 is incorrectly allocated to P 2 . Let n 1 and n 2 be re-
spectively the prior probabilities that an observed value y is from P1 or P2.
Suppose the classification rule is defined by a partition of the v-dimensional
sample space R into two exhaustive and mutually exclusive regions R 1 and
R2. Also, suppose we have decided to assign to P 1 observations falling into
R1 and to P2 units falling into R 2 . Now, define as p(ll2) the conditional
6.14 Solutions 359
probability of classifying a unit as P 1 when it is really from P 2 and similarly

as p(2/1) the conditional probability that an observation y comes from P 1
and is misallocated. The expected cost due to misallocation, EC M, is given
by
ECM = c(2/1)p(2/1) + c(1J2)p(1/2). (6.28)
Derive the allocation rule that yields minimum expected cost due to mis-
allocation by finding the regions R 1 and R2 that minimize ECM in equa-
tion (6.28).
6.14 Salutions
Exercise 6.1
Principal components analysis tries to find new orthogonal directions (i.e.
linear combination z = aljYC 1 + · · · + avjYCv) in which the projected points
exhibit maximum spread. The purpose of canonical variates analysis is to
provide a low dimensional representation of the data that highlights as
accurately as possible the true differences between the g subsets of points
in the full configuration. For example, the first principal component is such
that when all units are projected onto this direction they exhibit maximum
spread. It is evident that although the overall projection of points along
this direction may be maximum there is no indication of any difference
between the two groups along this direction.
If 9] contains the coefficients of the jth principal component eigenvec-
tor and if G = (g 1 , . . . , 9v), the condition GT G = I established in Section
5.2.1 implies that the principal components are uncorrelated over the whole
sample and that the principal component transformation from the original
variates y to the new variates z is orthogonal. On the other hand, the
constraint imposed to the set of canonical eigenvectors A = ( a 1 , ... , a 8 ),
ATi;wA =I (equation 6.13) shows that the canonical variate transforma-
tion from y to z is not orthogonal. In geometric terms this means that the
principal component axes are at right angles to each other and that the
frame of reference for the principal component space is obtained by a rigid
rotation of the original frame of reference. On the other hand, the canonical
variate axes are not at right angles to each other and the frame of reference
for the canonical variate space involves some deformation of the original
frame of reference, with some axes pressed closer to each other and others
pulled further apart.
Exercise 6.2
'fh - fi can be written as
n1fi 1 + n2'!h - n1fi1 - n2'fh n2('!h - fi2)
n1 + n2 n1 +n2
360 60 Discriminant Analysis
The first term in the sum of equation (6025) is

n n2
(n 1 ~ ~ 2 ) 2 (fh - 'ih)('ih - 'fh)To (6029)
The second term in the sum is

n2n
(n 1 ~ ~ 2 ) 2 Cfh- 'ih)(fh- fh)T o (6030)
It is easy to verify that the sum of equations (6029) and (6 030) gives
B = + Y1 - _Y2 )(-Y1 - _Y2 )T
n1n2 (-
n1 n2
°
Exercise 6.3
Note that w = 0 for all n 1 + n 2 thus owe work in terms of deviations from
the pooled means both in terms of explanatory variables (columns of the
matrix Y) and dependent variable (dummy variable w) o Thus
xrx =nEO
Using equation (6026) we can decompose the matrix nE between groups
and within groups as
n~
, = (n1 + n2- 2) {'~w + n1n2 ddT } ,
n1 + n2 n1 + n2 - 2
where d = Cfh - 'fh)o Using the Sherman-Morrison-Woodbury inversion
formula (equation 2044) , with o: = n 1n 2/{(n 1 + n2)(n 1 + n2- 2)}, we can
easily obtain the inverse
(xrx)-1 = 1 {tw1- o:tw1d~tw1}

+ n2- 2 1 + o:dT~R}d
0
n1
Finally
n1n2 _
T _ n1n2 d o
X Y= (y1 - Y2) = -=----=-
n1 + n2 n1 + n2
Putting these pieces together we obtain
{3 (XTX)-1XTy
tR}d + o:~tR}dtR}d- o:Ew1 d~Ew1 d n1n2 1
1 + o:dTEw1d n1 + n2 n1 + n2- 2
' 1
d
~w n1n2 1
1 + o:dTER}d n1 + n2 n1 + n2- 2
n1n2 ~- 1 d
, 1 L..lw
(n1 + nz)(n1 + nz- 2) + n1nzdT~w d
n1n2
6.14 Solutions 361
As concerns the squared multiple correlation coeffi.cient, if as in this case

we work in terms of deviations from the mean we can use the formula
Now, given that
we can write
2 ßT d !!1!!2.
A
n 'T
Rw!Y = ~ =ß d.
n
The link between two group discriminant analysis and regression was first
noted by Fisher (1936) . Flury and Riedwyl (1985) give further insights into
the relationship.
Exercise 6.4
When there are 2 populations s = min(v , g -1) = 1, thus f:w1 f:B has only
one non zero eigenvalue which can be found explicitly. Given that the trace
is the sum of the eigenvalues, this eigenvalue equals
(6.31)
So, the unique non zero eigenvalue .X of the matrix f:w1 f:B is given by
equation (6.31). For the corresponding eigenvector, it is easy to verify that
the equation
.Xa
2 CJh -Ih)f:w1 0h -Ih)a
n~ n2
n1
holds when a = f:HJ(Ih -'fh).

Note that, without loss of generality, we can assume that z1 > z2 because
since is positive definite. If a were of the form a = f:w1 Oh - y1 ), then

f:H}
(z2 - zl) would be positive. Given that (z1 + z 2)/2 is the midpoint and
z1 > z2 , z > (z1 + z2 )/2 implies that z is closer to Z1, the rule in terms of
z is to assign z to P1 if
Given that z = aT y, in terms of y the rule becomes: assign y to P 1 if
> aT('fh +fh)/2

aTy
(Yh -1h)Ttl,;}y > (Yh -1h)Tt\,;}(1h + '!h)/2
('fh -1hft~,l{y- 0.5(1/t + !h)} > 0.
Thus, in the two group case, Fisher classification rule is exactly equal
to the classification rule we obtain for two normal populations and equal
covariance matrices.
Exercise 6.5
Let q(j) be the v x 1 vector with a one in the jth position and zeroes in
all other positions. Then, following the results in (2.92), the jth column
of the matrix Y can be written as YcJ = Y q(j). The within group sample
correlation between YcJ and z = Ya is given by
BycJ ,z q(j)Ttwa
J 8 ~cJ, s; J q(j)Ttwq(j) aTtwa
q(j)Ttwti\}Oh- Yh)
Jwjj(]h -]h)Ttl,;}twt\,;}(yl- Y2)
Y11- Y12
J Wjj (yl - y2)Tt\,;} (Yl - Y2)
n1 + n2 Y11 - Yj2
Y2)t\,;}(yl- Y2) . f(..L + ..L)w·.
n1n2(f11-
V n1 n2 JJ
Exercise 6.6
We first convert the maximization problern to one already solved. Spec-
tral decomposition yields "E = r ArT. The symmetric square root matrix
"E 1 12 = rA 112 rT and its inverse "E- 1 12 = rA- 1 12 rT satisfy "E 112 I:; 112 = "E,
I:;l/2I:;-1/2 = I= I:;-l/2I:;l/2 and I:;-l/2I:;-l/2 = I:;-1 .
If we set u = W 112 a, the ratio
(6.32)
6.14 Solutions 363
can be rewritten as
aTW1/2W1 /2a
aTw1/2w-1/2 Bw-1/ 2w1/2a
uTu
uTw-1/2 Bw-1 f2u
(6.33)
uTu
Consequently, the problern reduces to maximizing equation (6.33) over u.
From Exercise 5.6 we know that the maximum of this ratio is >'1, the
largest eigenvalue of w- 1 / 2 Bw- 112 .
This maximum occurs when u = ')'1 , the normalized eigenvector associ-
ated with .>q. By equation (5.31) , u orthogonal to /'I maximizes the pre-
ceding ratio when u = ')'2 , the normalized eigenvector of w - 112 BW- 112
corresponding to the second largest eigenvalue .\2. We can continue in this
fashion for the remairring linear canonical variates. Now, note that if .\ and
1' are an eigenvalue-eigenvector pair of w - 112Bw- 112 , then by definition
Multiplication Oll the left by w- 1 12 gives
or
w - 1B(w-112") = .\(w- 1/ 21'),
thus w- 1 B has the same eigenvalues as w - 1 12 BW- 112 , but the corre-
sponding eigenvector is proportional to w- 112 1' = a. Thus, the vector a
which maximizes equation (6.32) can be found by taking the eigenvector 1'1
corresponding to the largest eigenvalue of the matrix w- 1 12 BW- 112 and
then premultiplying it by w- 1 12 . We can avoid the computation of the
square root W 1 12 followed by premultiplication and find a = w- 1 12 ')' more
directly by taking the eigenvector corresponding to the largest eigenvalue
of the matrix w- 1 B . A similar argument applies to the other eigenvectors.
Note that 'Ew1'EB = { (n - g)j(g - 1)} w- 1 B. Hence eigenvectors of
w - 1 B are t he same as those of 'tii)'tB, but any eigenvalue of w - 1 B is
(g - 1)/(n- g) times the corresponding eigenvalue of 'tiiJ'tB.
Exercise 6. 7
As in the preceding exercise, we work with B = (g - 1)'EB and W =
(n - g )'tw. Given that the canonical vectors a 1 , ... , a 8 are the eigenvectors
of the matrix w - 1 B we must have that
for j = 1, . . . , s; or
(B -l1 W)a1 = 0.
From this equation we obtain that any two particular eigenvalue/ eigenvector
pairs (li, ai) and (Z1 , a1 ) satisfy
(6.34)
and
Ba1 = Z1Wa 1 . (6.35)
Premultiplying (6.34) by aJ and (6 .35) by a[ , we obtain
and
a[ Ba1 = l1afW a1.
Since B is symmetric, the scalar aJ Bai = a[ Baj , so it follows that
(6.36)
Given that W is symmetric we have that aJW ai = a[W a1 , so that if li =I= lj

the only way in which equation (6.36) can be satisfied is when a[W aj =
a[Wa1 = 0. Clearly theseargumentshold for all i =I= j . Considering all
vectors a1 together as columns of the matrix A, this property is equivalent
to ATW A being a diagonal matrix. As we said in 6.2.5, to overcome the
arbitrary scaling ofthe a1 we impose the normalization ATi;w A = I 8 • With
this additional constraint the canonical variates are not only uncorrelated
between groups but also have equal variance within groups. Now, since
from equation (6.35)
BA=WAL ,
where L = diag(h, ... , l 8 ) is the diagonal matrix containing all eigenvalues,
it follows that AT BA= ATW AL= (n- g)L. Thus, the canonical variates
are also uncorrelated between groups. Finally, since the total sum of squares
and products matrix is the sum of within groups and between groups com-
ponents, we conclude that the canonical variates are uncorrelated over the
whole sample. Algebraically
(n- l)var(Y A) (YA)rCYA) ATfTYA

ATWA+ATBA (n- g)(I + L).
Exercise 6.8
If y has a probability density function f(y), then the probability that an
observed value falls in a region R 1 of the sample space is
r
jR1
J(y)dy.
6.14 Solutions 365
So, the probability that y comes from population P 2 and is misallocat ed is
p(112) = 1r2 r h(y)dy.

JR1
(6.37)
The integral sign in equation (6.37) represents the volume formed by the
density function h(y) over the region R 1 . Similarly, the probability that y
comes from population P 1 and is misallocated is
p(211) = 1T1 r fi(y)dy.

jR2
The expected cost EC M due to misallocation is given by
ECM c(211)p(2 11) + c(112)p(112)

c(211)7rl
JR2
r fi(y)dy + c( 112)7r2 JR1r h(y)dy. (6.38)
Now, since the two regions R 1 and R 2 are a partition of R, we have that
R = R1 u R2 and R1 n Rz = 0, so that
JR
r fi(y)dy = JR1r fi(y)dy + JR2
r fi(y)dy = 1
or
r
JR2
fi(y)dy = 1 _ r fi(y)dy.
JR1
Substituting for JR 2 fi(y)dy in equation (6.38) yields
ECM = c(211)nl- c(211)nl { fi(y)dy + c(112)n2 { f2(y)dy

JR1 JR1
c(211)7rl + r {c(112)n2/2(y) - c(211)ndl(y)}dy.
JR1
fi (y) and h (y) are nonnegative for all y and are the only quantities which
depend on y. Thus, ECM is minimized if R 1 includes those values y for
which the integrand
(6.39)
In other words, EC M is minimized by choosing the region R 1 to be the set
of all those points and only those points that give a negative contribution
to the expression {c(112)n2f2(y)- c(211)ndl(y)}. This is because with this
choice the largest possible amount will be subtracted from c(211)n1 to yield
ECM. Hence, from equation (6.39), it follows that the optimal rule is
associated with the region R 1 composed of all those points y for which
This argument provides a theoretical justification for a rule of the form (6.27).
Note that if the two costs c(ll2) and c(2ll) are equal, the optimal rule for
the classification criterion becomes the one of assigning y to the population
with the larger posterior probability.
7
Cluster Analysis
7.1 lntroduction
In duster analysis the multivariate observations are to be divided into g
groups. The membership of the groups is not known, nor is the number of
groups. The situation is seemingly different from that of discriminant anal-
ysis considered in Chapter 6 where both the number of groups and group
membership are known. However, there is much in common between our
procedure for dustering and the methods we used in the earlier chapters.
We start by treating the observations as if they, perhaps after trans-
formation, come from a single multivariate population. We monitor the
forward search to see whether there is any evidence of groups. If there is,
we t entatively divide the observations into dusters and then use the tech-
nique of search with several groups employed in discriminant analysis to see
how stable the proposed duster membership is and what is the allocation
of unallocated units.
We describe the stages of our search in §7.2.1 , dividing the data analysis
into three stages: preliminary, exploratory and confirmatory. In the last of
these unassigned units are assigned to dusters during the forward search on
the basis of comparisons of Mahalanobis distances. This comparison is not
entirely Straightforward when the distances are calculated for populations
with differing covariance matrices. In §7.2.2 we discuss the use of standard-
ized distances to help with these comparisons. The extensions needed to
the forward search are outlined in §7.2.3
368 7. Cluster Analysis
In §7.3 we start a series of simulated examples which show what can be

achieved by applying the forward search to dustering. The data in §7.3
contain two clusters, one of 60 and one of 80 observations. We first show
how use of a very robust method , here the minimum covariance determi-
nant, fails to reveal that there are two populations in the data rather than
one. The forward plot of scaled Mahalanobis distances in §7.3.2 however
does reveal all the structure. In §7.3 .3 we use further plots, such as the
"entry plot" to reveal the structure in other ways which are useful in more
complicated examples.
The simulated examples in §§7.4 and 7.5 extend the 60:80 data in differ-
ent ways. In §7.4 we add a third duster to the data, together with two out-
liers. This structure is revealed by the forward search, but not by the min-
imum covariance determinant, when just one multivariate normal model
is fitted to the data. In §7.5 we add 30 units as a "bridge" between the
two dusters in order to see how our method behaves when duster bound-
aries are not sharp. The absence of sharp duster boundaries is a common
occurrence in applications of dustering methods.
The next two sections are concerned with the analysis of examples which
are not simulated. The first, in §7.6, is of financial data in which most of
the observations fall relatively easily into two dusters . Our last example,
in §7.7 is of some data on diabetes, in which the duster boundaries arenot
at all sharp. There is in fact appreciable overlap between dusters. Use of
the forward search makes it possible to divide the data into firm clusters,
with a few units left whose group membership is uncertain.
In all examples we fit different multivariate normal distributions to each
duster. There are however other possibilities. For example we could fit
multivariate normal distributions with a common, but unknown, covariance
matrix as we did in linear discriminant analysis. We mention these and
other possibilities in the last section of the chapter, where we also give
references to the large Iiterature on dustering.
7.2 Clustering and the Forward Search

7. 2.1 Three Steps in Finding Clusters
We find it convenient to think of our analysis as broken into three steps,
although the boundaries between the three are not rigid.
1. Preliminary Analysis. We start by exploring plots of the data sim-

ilar to those used in Chapter 3 where we were looking at what was
expected to be data from a single multivariate population. An im-
portant tool is the relationship between scatterplots of the data and
forward plots of Mahalanobis distances. As well as plots of all dis-
tances, we look at the behaviour of the distances for groups of units
7.2 Clustering and the Forward Search 369
selected by magnitude at seemingly interesting points in the forward

search. These often correspond to apparent separations in forward
plots of distances, or of peaks in plots such as that of the maximum
distance of units within the subset. As a result we divide the units into
tentative dusters. We also find it helpful to look at plots of the order
of entry of the units into the subset and of increases and decreases of
Mahalanobis distances during the forward search.
2. Exploratory Analysis. The next stage is to ensure that the ma-

jority of the units are assigned to the correct duster. We again use
forward searches in which a single multivariate population is fitted to
all the data, but now we start in turn in each of the tentative dusters,
ensuring that all units from the duster under study enter the subset
before any units tentatively assigned to other clusters. With g groups
we can compare the results of g searches, seeing how the distances
of units in each duster behave. If the units do indeed form a duster,
the distances will change together in all g searches. In this way we
explore a number of informative paths through the space of the pa-
rameters. Individual units which arenot correctly dassified, perhaps
because they are outlying from all dusters, will showrather different
trajectories of their Mahalanobis distances.
3. Confirmatory Analysis. As a result of the exploratory analysis we

are able to duster the majority of units. We confirm this dustering
and attempt to duster those units about which we remain uncertain
by running a search with g populations, initially similar to those
we used for discriminant analysis. We start with an initial subset
containing most, if not all, of the certainly classified units. We then
move forward, adding units to the duster to which they are dosest. If
the covariance matrices of the different dusters are not too different
we measure doseness by Mahalanobis distance. But, if some clusters
are more dispersed than others, we can instead use the standardised
distances described in the next section.
7.2.2 Standardized Mahalanobis Disfances and Analysis with

M any Clusters
In our confirmatory analysis we run a forward search fitting an individual
model for each duster. For all unassigned observations we calculate the
distance from each duster centre. Observations are induded in the subset
of the duster to which they are nearest and the distances to all duster
centres are monitored.
A difficulty with this simple prescription is that we may be comparing
Mahalanobis distances for compact and dispersed clusters. An observation
that is on the edge of a tight duster may have a relatively high Maha-
lanobis distance from that duster, due to the low dispersion of the duster.
Because of the higher variance of a dispersed duster, the observation may
have a large Euclidean distance from the dispersed duster, but a Maha-
lanobis distance slightly less than that from the tight duster, and so be
wrongly allocated. Such a wrong allocation would increase the difference
in dispersions and might lead to a dispersed group "invading" a compact
group during the search, leading to the wrong allocation of several units.
The problern is mentioned by, amongst others, Gordon (1999, p. 48).
We attack this problern by introducing standardized distances, which
are adjusted for the variance of the individual duster. The customary Ma-
halanobis distance for the ith observation from the lth group at step m
is
T f,-1
dilm = (eilm )0.5
L.Julm eilm , (l = 1, ... , g).
A form of generalized distance can then be defined as
dilm(r)
'
= dilm { IEutml
1/2v}r , (7.1)
which agrees with a suggestion of Maronna and Jacovkis (1974). When

r= 0 in (7.1) we recover the customary Mahalanobis distance. But with
r= 1 we have the standardized distance
*
d ilm = d ilm lf,
L.Julm ll/2v · (7.2)
In this standardized distance, the effect of differing variances between

groups has been completely eliminated, so that d;lm is an analogue of the
ordinary least squares residual in regression with a single response. It will
produce measures dose to Euclidean distance, with the consequence that
compact dusters may tend to "invade" dispersed ones, the reverse of the
behaviour with the usual Mahalanobis distance, obtained when r = 0.
The potential difference between the behaviour of searches with the
two distances suggests that some intermediate value of r in (7.1) might
be preferable. However, we can overcome this problern by comparing two
searches, one using standardized Mahalanobis distances and the other un-
standardized distances. Because by this stage in the analysis we are sure
of nearly all allocations to dusters, we monitor the allocation of each unit
during the search. If Contradietory analyses are suggested, we rely on the
forward searches for single dusters in the exploratory analysis.
7.2.3 Forward Searches in Cluster Analysis

The three approximate stages of our analysis described above require rather
different forward searches.
1. Prelirninary Analysis. In the preliminary analysis we use standard

forward searches for a single duster as we did in Chapter 3. We may
7.3 The 60:80 Data 371
try a variety of starting points, but there are no novel aspects to the
choice of units in the subset as we move forward.
2. Exploratory Analysis. We again fit just one distribution during

the forward search, but now we start with units in a specified tenta-
tive duster. We constrain the search so that all units from the duster
of interest enter before units from any other duster. U sually we con-
strain the remainder of the search so that these units, once they have
entered the subset, cannot leave it.
3. Confirmatory Analysis. In the confirmatory stage we fit g distri-

butions, one to each duster. There are several options for this for-
ward search, rather as there were for discriminant analysis. The initial
subset for each group consists of the first lOOa% of the observations
to join the forward search for each group when we performed the
exploratory searches just for observations in each tentative duster.
These observations are not allowed thenceforth to leave their sub-
sets. We have found that larger values of a, 0.9 or 1, give dustering
at this stage which is more in agreement with the condusions of the
exploratory searches. The search proceeds by allocating the remaining
100(1-a)% of units from the exploratory analysis and all unallocated
observations to the dosest group as judged by Mahalanobis distance.
These can be either the usual distances or the standardized distances
introduced in §7.2.2. The search can be constrained so that no Ob-
servations can leave once they have joined a subset. We could also
constrain the growth of the subsets to be balanced, perhaps in the
ratios of group sizes found in the exploratory analysis, although this
may force some implausible allocations towards the end of the search.
The dosest duster to each unit may change during the course of the
search. We find it helpful to monitor the potential allocation of each
unit not in the initial subset as the search progresses. Units for which
allocation remains stable can be certainly allocated and the search
repeated with a larger subset of certainly allocated units. In the most
complicated examples we find that there may be a few units which it
is impossible to duster with certainty.
7.3 The 60:80 Data

We start with a simple example with two clusters. In §7.3.1 we fit one
normal model and show that diagnostics derived from standard and very
robust methods fail to reveal the two dusters. The forward plot of Maha-
lanobis distances in §7.3.2 does however reveal the structure of the data.
We finish the analysis in §7.3.3 with plots we find useful in the analysis of
0
0 0 0
0 0 0
0 0
0 oo 0 0
0 0 0 0 0
0 0 0
0 0
0 0
o o'!', 0
0 0
0 0 0 0
0000 0
0 0
0
0 'b 0 0 0 'b 0
0 0
0
0
0~
0
0 0
-10 -5 0 5 10
y1
FIGURE 7.1. The 60:80 data: scatterplot showing the two clusters
more complicated examples. These plots include the entry plot and plots
of the fitted ellipses containing all the m observations in the subset.
7. 3.1 Failure of a Very Robust Statistical M ethod

Figure 7.1 shows a scatterplot of 140 data points: units 1-80 form a rather
diffuse group, whereas the remairring 60 units, numbered 81-140, form a
tight duster. The pattern is obvious to the unaided eye. The purpose of
duster analysis is to detect such groups. This can be a difficult task for
high dimensional data. But it would not seem tobe particularly challeng-
ing when there are just two dimensions. However, when we fit a single model
to these data, neither standard classical methods nor very robust methods
yield Mahalanobis distances which unambiguously show that there are two
different groups of Observations. We would like a robust estimator to fit
the !arger, more diffuse, group whilst revealing t he smaller group as out-
liers through !arge Mahalanobis distances. Thus we would like the ellipse
containing "half' the data, found by a very robust method, to contain only
observations from the !arger group. However this does not happen.
As an example, Figure 7.2 shows the results from the "fast" MCD (Min-
imum Covariance Determinant) algorithm of Rousseeuw and Van Driessen
(1999). This finds the half of t he data for which the det erminant of the es-
timated covariance matrix is minimised. The ellipse in the figure was found
by scaling up these very robust parameter estimates to give a contour which
would contain 97.5% percent of the data for a single normal population.
Unfortunately, the half of the data used in fitting includes elements from
7.3 The 60:80 Data 373
MVE Tolerance Ellipse (97.5%)
~ ,------------------------------------------------ ------ ,
057
oo 045
0
0 0 0
:.
0 0
0 0 0
0
0
0 00
~
oq, 0 0
0
0
~
027
-10 -5 0 5 10
y1
FIGURE 7.2. The 60:80 data: ellipse from very robust fit nominally containing
97.5% of the data. The half of the data used for estimation contains observations
from both clusters
both groups. It must be stressed that this is not a failure of the algorithm to
find the minimum of a multimodal function. The ellipse found is of smaller
area than that can be found containing the same number of observations
solely from the larger group. What has happened is obvious here in two
dimensions. We need to be able to check whether something similar has
occurred in more than two dimensions when the Separation into groups is
not obvious by plotting along the coordinate axes.
7.3.2 The Forward Search

In contrast to the preceding analysis, Figure 7.3 is a plot of the scaled
Mahalanobis distances from a forward search. The starting point is the
robust ellipse given by the simpler of the two methods of Zani, Riani, and
Corbellini (1998). Like Figure 1.16 for the Swiss bank note data, this figure
clearly shows the two main groups. But there are important differences of
detail from this earlier example.
The starting point for this search, like the final MCD fit illustrated in
Figure 7.2, includes units from both groups. At the very beginning of the
search there is appreciable interchange as the units from the larger, more
dispersed, group are rejected by the algorithm, which first selects those
from the tight duster. Up to m = 60 the plot shows a clear division of the
distances into two groups. There are small distances for the tight group
20 40 60 60 100 120 140

Subset size m
FIGURE 7.3. The 60:80 data: forward plot of scaled Mahalanobis distances
from which the parameters are estimated and a distinct set of much larger
distances for the more dispersed group.
When m = 61 the first unit from the dispersed group is included in
the subset. Immediately there is a change in the plot; some units from the
larger group have decreasing distances, while those for other units initially
increase. There is a further period of activity around m = 90 and then, as
the search progresses, the distances for the small group tend to increase.
In the middle of this process of fitting to units from both groups , around
m =105, there arenosmall Mahalanobis distances, even though the asymp-
totic distribution of the distances is x~, that is exponential. Unit 57, visible
in the top right-hand corner of Figure 7.2 is particularly outlying at the
beginning and end of the search.
Figure 7.4 is a QQ plot of squared Mahalanobis distances for the x~
distribution during the forward search for two values of m . The left-hand
panel, form= 37, shows the two groups with a gap between the 60small
distances and the 80 large, just as there is in Figure 7.3 for this value of
m. The plot in the right-hand panel, for m = 101, shows, in the left-hand
tail, the absence of very small distances; since the centroid of the fitted
observations is not near any data point, the smallest distance plots well
above zero.
The forward plot thus shows that there is a tight group of observations
and around 80 observations which are different. We now consider other plots
7.3 The 60:80 Data 375
m=37 m=101
0
"'
....
_.
.
I
~ ""..-
.-··
0
____/
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 .0 0.5 1.0 1.5 2.0 2.5 3.0
Quantiles Ouantiles
FIGURE 7.4. The 60:80 data: QQ plots of squared Mahalanobis distances against
X~· Left panel, m = 37, showing the two groups; right panel, m = 101- there are
no small distances
which complement the information from the forward plot of Mahalanobis

distances.
7.3.3 Further Plots for the 60:80 Data

In this section we introduce the "entry plot" and monitor the ellipses con-
taining the subset of the data used in fitting.
Entry Plot. Figure 7.5 is an entry plot for the forward search described
in §2.14. The initial subset size was 18. Dots in the plot indicate the presence
of an observation in the subset, so that the number of dots increases towards
the right of the graph as the subset size does.
In Figure 7.5 the observations are ordered so that 1-80 are those from the
dispersed group with the 60 observations 81-140 coming from the compact
group. The plot at m = 18 shows that the initial subset includes observa-
tions from both groups. The first step of the forward search, to m = 19,
results in a large amount of interchange. As the figure shows, at this point
only one observation from the dispersed group remains in the subset. It too
is eliminated at the next step. Thereafter, until m = 60 the subset consists
solely of observations from the compact group. From m = 61 only obser-
vations from the dispersed group join the subset. None of the observations
from the compact group is eliminated. This plot also explains some of the
activity around m = 90 in Figure 7.3, where there are interchanges as some
of the units from the dispersed group leave the subset.
The entry plot clearly shows which groups of observations are in the sub-
set. Its usefulness depends on the ordering of the observations. lt would be
less useful if we permuted the observations. But, in a confirmatory stage we
0
'<t
0
C\1
0
0
ä)
1/)
.0
:::l 0
1/) Cl)
Q)
"0
.iii
.!: 0
<0
.l!l
·c:
:::>
0
'<t
0
C\1
20 40 60 80 100 120 140

Subset size m
FIGURE 7.5. The 60:80 data: entry plot from mo = 18.
can plot by order of entry in the forward search. We can also use tentative
duster labels as plotting symbols.
Ellipses. The entry plot teils us which observations form the subset as
m increases. With two variables, information on the nature of the subset
and the fitted model comes from ellipses containing a specified fraction
of the data, as in Figure 7.2. We now Iook at the ellipses containing all
m points and m/2 points as the search progresses, where the eigenvectors
of the ellipse are those of f;um, the unbiased estimator of the covariance
matrix based on the subset of size m. The outer ellipse will therefore pass
through the most remote observation in the subset, that is the one with the
largest Mahalanobis distance. In labelling the axes of these plots we scale
the variables by the standard deviations estimated from the subset, that is
by the square root of the elements of the diagonal of f;m · With more than
two variables the ellipses can be superimposed on the scatterplot matrix of
the data.
The upper left-hand panel of Figure 7.6 shows the ellipses for the starting
value m 0 = 18. In the plot we have used filled symbols to indicate the
observations in the subset. The figure shows that the initial subset contains
observations from both the compact and dispersed groups. It also shows
that the ellipse contains observations which are not in the subset. Since
these have smaller Mahalanobis distances than at least one observation in
the subset, some will be included at the first step of the forward search.
7.3 The 60:80 Data 377
The upper right-hand panel of Figure 7.6 for m = 19 shows that this is
exactly what happens: now the subset contains just one observation from
the more dispersed group, an effect we noted in the entry plot ofFigure 7.5.
One more step of the forward search leads to the elimination of this last
observation and so to a subset consisting solely of observations from the
compact group.
The ellipse in the lower left-hand panel of Figure 7.6 for m = 20 is
comparatively very small so gives large scaled distances for units not in the
subset. As the first three panels of Figure 7.6 show, the ellipse shrinks and
the scaled distances grow at the start of this search. We have seen in the
forward plot of Mahalanobis distances, Figure 7.3, that large Mahalanobis
distances are obtained near this point in the forward search.
The orientation of the ellipses in the three figures changes a little as
observations from the dispersed group are excluded. There is also some
slight change in the orientation as they are re-introduced after m = 60
(lower right-hand panel of Figure 7.6) . The left-hand panel of Figure 7. 7
shows the ellipseform = 80. This is now much larger than previous ellipses,
since 20 observations are included from the more dispersed group. As the
forward search progresses the ellipses grow and rotate slightly in an anti-
clockwise direction until the right-hand panel of Figure 7.7 is obtained for
m = 140. The outer ellipse is large because it passes through unit 57, the
outlier in Figure 7.3.
Forward Plots of Minimum and Maximum Mahalanobis Dis-
tances. The entry plot and plots of ellipses illuminate the structure of the
data. We now consider plots of Mahalanobis distances which pinpoint the
position in the forward search at which changes in the structure of the
subset occur.
The left-hand panel of Figure 7.8 is the forward plot of the minimum
Mahalanobis distances amongst units not in the subset. There is a needle
sharp peak at m = 60 indicating that the next unit to be introduced is
remote from the group of observations so far fitted. The high value at the
end of the search belongs to the last observation to enter, unit 57, which
the outer ellipse in the right-hand panel of Figure 7.7 showed tobe remote
even from the dispersed data cloud. There is no indication of any other
structure.
The right-hand panel of Figure 7.8 shows the complementary plot of the
maximum Mahalanobis distance among the units in the subset. This has a
peak one observation later when m = 61. This peak is less clearly defined
than that in the left-hand panel, since the minimum Mahalanobis distance
among units not in the subset is the deletion version of the maximum dis-
tance within the subset one step later, unless there is an interchange. This
plot, like that in the left-hand panel, has a high value at the end of the
search caused by the single outlier. Both plots start with low values because
distances are calculated over observations included in the outer ellipses of
the upper two panels of Figure 7.6. The minimum distance among units
m=18 m=19
m=20 m=61
•
FIGURE 7.6. The 60:80 data: plots of ellipses for the first three steps with
mo = 18 and for m = 61 when the first diffused observation enters the sub-
set. The search initially includes units from both groups. Filled symbols are used
for units in the subset
..
-10 -5 0 -6 -4 -2 0 4
y1 y1
FIGURE 7.7. The 60:80 data: plots of ellipses for two further steps when m 0 = 18.
Left-hand panel, m = 80; right-hand panel, m = 140. Inclusion of unit 57 when
m = 140 causes an appreciable increase in the size of the ellipse
C\1
""
~
<D
Cl
::;: "" Cl
::;:
E
E
::> <D ::>
E
....
E ·;.:
·c:
....
<0
~ ::;:
C\1
C\1
0 0
20 40 60 80 100 120 140 20 40 60 80 100 120 140

FIGURE 7.8. The 60:80 data: forward plots of Mahalanobis distances. Left panel,
minimum distance amongst units not in the subset; right panel, maximum dis-
tance among units in the subset. The interchanges at the beginning of the search
and around m = 90 are evident
not in the subset will be for some point lying inside the ellipse. The maxi-
mum distance amongst those in the subset is also too small because several
observations are lying on or near to the ellipse: the extreme case would be
when all observations in the subset lay on the ellipse, when the maximum
distance would be the same as any in the subset (see Exercise 2.12).
7.4 Three Clusters, Two Outliers: A Second

Synthetic Example
We continue to consider analyses of simulated data sets in order to under-
stand the properties of the forward search when one multivariate distribu-
tion is fitted to data containing groups. The intention is to train our eyes
to the interpretation of the plots.
1.4.1 A Forward Analysis

The first example considered was relatively straightforward, with two well-
separated dusters. These new data are similar. They again contain the two
dusters of the 60:80 data, but now with the addition of a third duster, units
141-158 and two outliers, units 159 and 160. The sizes of the groups are
therefore 80, 60, 18 and 2. Figure 7.9 shows the data. The second compact
duster of 18 observations is near the longer axis of the dispersed duster of
80. The two outliers are tagether, approximately across the centroid of the
dispersed group from the duster of 60.
Webegin the forward search with mo = 28, the observations b eing chosen
by the method of robust ellipses. Figure 7.10 is the entry plot. This shows
..
. ., . •
. . . .. : ·.· . .:
.. . •
•
••
.. " . .. .. • .•
I • ••• •• • •.•.. • •
·•
I • •• •
•
• •
~ rA•
• •
-10 -5 0 5 10
y1
FIGURE 7.9. Three Clusters, Two Outliers: scatterplot of the data. The numbers
of units in the groups are 60, 80 and 18, with two outliers
that the starting ellipse has mostly chosen observations from the group of
60. The few observations from the diffuse group are eliminated at the first
forward step and the search then continues solely with observations from
the compact group until m = 61 when observations from the diffuse group
enter. Just before all the diffuse observations are included the two outliers,
observations 159 and 160 enter the subset, although they are soon rejected,
rejoining again at the very end of the search.
This behaviour can be explained by studying ellipses similar to those
considered in detail earlier for the 60:80 data_ However we leave this study
to Exercise 7.1 , passing on instead to the forward plot of the Mahalanobis
distances, Figure 7.11. This indicates all the structure in the data; up to
m = 60 the second group is clearly separated from the first with the dis-
tances of the compact group of 18 evident at the top of the plot. The two
outliers follow an independent path. Once the second group starts to enter
at m = 61 the smaller distances are not so readily interpreted, although
the group of 18 remains distinct. The two outliers re-emerge at the very
end.
The forward plot of the minimum Mahalanobis distances amongst units
not in the subset in the left-hand panel of Figure 7.12 clearly shows the end
of the firstduster through the sharp spike at m = 60. The second spike at
m = 142 is a rather less clear indication of the end of the second group.
The spike at the end of the plot clearly shows the presence of, now, two
outliers. The forward plot of the maximum Mahalanobis distances amongst
units included in the subset, right-hand panel of Figure 7.12, is similar, but
with a slightly less strong, although still appreciable, indication of the end
1
tl)
Qj
0
"':::>
1
.0 0
.,"'
"0
· ~n
.s
2
·c:
1
::::> 0
<f)
20 40 60 80 100 120 140 160

Subset size m
FIGURE 7.10. Three Clusters, Two Outliers: entry plot from mo = 28 . The two
largest groups enter in succession
40 60 60 100 120 140 160

Subset size m
FIGURE 7.11. Three Clusters, Two Outliers: forward plot of scaled Mahalanobis
distances. The three clusters and two outliers are revealed
0 IX>
:;
E
::::> CD
E
·;;:
"' ....
:;
40 60 80 100 140 40 60 80 100 140

FIGURE 7.12. Three Clusters, Two Outliers: left panel, forward plot of the min-
imum Mahalanobis distances amongst units not in the subset; right panel, max-
imum Mahalanobis distances amongst units in the subset
of the first duster. The latter part of the plot gives a slightly stronger
indication than that of the minimum distance of a third duster. The plot
again signals the two outliers.
The conclusion of this analysis is that forward plots of Mahalanobis dis-
tances and the entry plot are both useful. To condude our look at these
data we contrast our plots from the forward search with those from the min-
imum covariance determinant (MCD) procedure inS-Plus and with a plot
of dassical Mahalanobis distances from fitting all the data non-robustly.
7.4.2 A Very Robust Analysis

As Figure 7.13 shows, the MCD function from S-Plus settles over two dus-
ters, just as it did for the 60:80 data. Comparison with Figure 7.2 shows
how similar the two fits are. The major difference is that now the group
of 18 outliers are all remote from the ellipse and can be expected to yield
large robust Mahalanobis distances. These are shown as an index plot in
Figure 7.14, in which all the structure is dear: there isadiffuse group of 80
observations, 60 similar observations with small distances from one duster ,
18 similar large distances from another duster and the two outliers. Allare
certainly recognisable. The horizontallirre in the plot is at the 97.5% point
of the vx; distribution which the distances would asymptotically follow if
there were one normal population.
The index plot of the dassical, non-robust Mahalanobis distances is sim-
ilar and is shown in Figure 7.15 . The four sets of observations are again
evident, although the distances from the duster of 60 are larger and more
variable than before. The two outliers now have appreciably greater dis-
tances than the other units. However, all distances are much smaller than
0 0
0 0
0
0 <9
0 0' 0 0
0
"' o>,g
00 0 0
oo 0 0
0 0 0 0
0 0 oo 00 0
00
~ 0
0
<9 8 oo 0
o 0o
0
@0 OOo 0
oo
0 0 0
oo 0
0 0
0
0
«;> 0
0-
0
~
~~
·10 -5 0 5 10
y1
FIGURE 7.13. Three Clusters, Two Outliers: ellipse from very robust fit nom-
inally containing 97.5% of the data. The half of the data used for estimation
contains observations from two clusters
.ß.J.& oo 1 57
f9 0~ 00
0
0
0
0
0 0
0 0
0 0
0 0 0
0
0 0 0
0~ 0
:o
0 n
o o oo <b o
oo oooooooo oco o ~ oo
o
o
o:ooo
o Oo o o
0
~o oo o o
o ~
d?o~o~~~~oi!Sib
0
0 50 100 150
Index
FIGURE 7.14. Three Clusters, Two Outliers: index plot of robust Mahalanobis
distances from the MCD fit . The horizontalline is the 97.5% point of the asymp-
totic distribution of distances
015
016
0
0
0
0
0 0
0 0
0 0 0 0
0 0
0 0
0 50 100 150
Index
FIGURE 7.15. Three Clusters, Two Outliers : index plot of classical Mahalanobis
distances. The horizontal line is the 97.5% point of the asymptotic distribution
of distances
o156 0 8 146 ~144, 0159

0 ($ß 0
0 0 0 160
0
'll
"'
0 <SO
0 &146 0
0 0 0 0
0 00 0
0 0
00 0 0
'0 0 0
0
0
0 0 00 0 0 0 0 0 0
o o o o o o no o Do
~ ~o o o "oo~ -ct o q, -,.,
0 0 000 0 0 cj)O
o,.p~o~~-~~ 0
0
0 0
0
0
0 50 100 150 0 50 100 150

Index Index
FIGURE 7.16. Three Clusters, Two Outliers: index plots of Mahalanobis dis-
tances after permuting observation numbers: left, robust; right, non-robust. The
horizontallines are the 97.5% points of the asymptotic distribution of distances
the robust distances. Now only the two outliers h ave distances lying outside
the 97.5% point of the nominal distribution. vx;
7.5 Data with a Bridge 385
0 0
0
0 0
0 0
o'i '0 o o o u
o o Q;)o~ooo o ooo o _ ...
o q, oo;~o~~ro ·····
. ....o- ··· ·· ··
0 2 3
Classical Mahalanobis Distance
FIGURE 7.17. Three Clusters, Two Outliers: classical and robust Mahalanobis
distances
Further insight can be obtained from QQ plots of both kinds of Ma-

halanobis distances, but it is not as unambiguous as that obtained in the
previous section from the forward search.
These plots are even less informative if the index numbers do not indicate
group membership. Figure 7.16 gives the index plots of both sets of dis-
tances after the observation numbers have been permuted. The structures
are now much less evident. The robust distances in the left-hand panel
seem to have too many small and large distances, with a gap below the
larger distances. The non-robust distance in the right-hand panel do not
show too many small distances, but do dearly show the two outliers.
A figure which is not affected by this relabelling is Figure 7.17, of ro-
bust and non-robust Mahalanobis distances. The two outliers and the third
duster of 18 observations show up as extreme on both axes. The tight dus-
ter of 60 observations is visible as a short black line below the diagonal of
expected order statistics. This plot underplays the importance of the tight
duster.
7.5 Data with a Bridge

The groups of observations in the previous two synthetic examples were
dearly separated. Despite their relative scatters and orientations it was
possible to identify the different dusters. However such complete separa-
0
LO oo
0
0 0 0
0
0 oo 0
0 0 cP 0 0
~ 0
0
0
o 10o 0 0 0 0
0
0
0 '8 oCbo10 0
0
0
0
oo~ 0 0
0 0
o qpo 0 0
0
00
o<o 0
0
0
0 0 '0 0 '8 r:P 0
0 0 0
u;>
o'6 q,oo
0
o~lb~0 0
0
-10 -5 0 5 10
y1
FIGURE 7.18. Bridge data: the 60:80 data with a further 30 observations joining
the two clusters
tion is rare in real examples, where often clusters almost overlap. We now
consider the modification of the 60:80 data to reduce the sharpness of di-
vision between groups.
Our analysis is broken into the three steps described in §7.2.1: prelim-
inary analysis using the plots from fitting one normal distribution, ex-
ploratory analysis where we break the data into tentative clusters and con-
firmatory analysis where we adjust and try to confirm the clusters we have
found.
7. 5.1 Preliminary Analysis

Webegin with the observations. Figure 7.18 shows the "bridge" data formed
by adding a further 30 observations to those plotted in Figure 7.1 in such
a way as to form a bridge between the two clusters, roughly in a direction
from one duster centre to the other. There are no obvious breaks between
the three sets of observations. 8ince the new data run in a similar direction
to the main axis of the ellipse fitted by the 8-Plus MCD procedure, these
new data reinforce this uninformative fit.
We commence our analysis with a forward search from a starting point of
31 observations found by the method of robust ellipses. The entry plot is in
Figure 7.19. The new observations are numbered 141-170. The plot shows
that the ellipse has virtually selected only these observations. In doing so
it has behaved like the 8-Plus MCD, see Exercise 7.3. However the entry
plot shows that, after four steps of the forward search, there is a period
0
l{)
o;
(/)
.0 0
::> 0
(/) ~
Q)
u
·c;;
.s
2
·c:
::::> 0
l{)
J
40 60 80 100 120 140 160
Subset size m
FIGURE 7.19. Bridge data: entry plot. The initial subset consists mostly of units
from the "bridge", numbered 141-170. These are rapidly eliminated by the for-
ward search
of rapid interchange in which the compact group enters, driving out the
points from the bridge. After m = 60, when all the compact group have
entered, the Observations from the bridge are re-introduced, followed by the
observations from the dispersed duster. Once the procedure has found the
compact group, there are few interchanges during the rest of the search.
Those that there are occur within the dispersed group towards the end of
the search.
This description of the movement of the forward search can be encapsu-
lated by a few plots of ellipses. Figure 7.20 shows, in the upper left-hand
panel, the starting point (m0 = 31) with the outer ellipse containing points
from the bridge. There are also many from the compact group, but most
are not included in the initial subset. There then follows a period of rapid
interchange as the subset changes to consist solely of points within the
ellipse. The upper right-hand panel of Figure 7.20 shows a small outer el-
lipse for m = 36, which contains points both from the bridge and from
the ellipse. The two lower panels show the progress, through m = 39 to a
subset, form= 42, consisting solely of points in the compact group. Here
the ellipse is very small. We can expect that, around this value of m, there
will be very large Mahalanobis distances for many of the observations.
The second set of ellipses, in Figure 7.21, starts with m = 60, which
is the last set of ellipses for which the subset contains only observations
from the compact group. The remaining three panels, up to m = 81, show
the rotation and extension of the ellipse as observations from the bridge
m=31 m=36
0 0
oo-ao o
00
o qpo
oQ, o
0 0 0
m=39 m=42
0 0' 0
0 0 0
0
00
oo~O o
o qpo 00
oQ,
0
0 0 0
FIGURE 7.20. Bridge data: plots of ellipses in the earlier stages of the search,
showing how the subset moves to the tight duster
m=60 m=67
oO oo
00
0 'S 0
"' ·~
:~q,~ 0
.:.."..
m=74 m=81
oO oO
0 0 0 0
oo~o o oo-ao o
0 qpo 00 0 qpo
oQ, 0"' 0
0 0 931Sl 0 0 0 0
0~ ~
FIGURE 7.21. Bridge data: plots of ellipses showing the reintroduction of units
from the bridge into the subset
m=90 m=116
m=142 m=169
• • .:: ..... ~ 0
.·.·~~·• 0
0 oo
OQ) O
•"
0 • • e ~ e"" •" •"
0 00 ··~-'-
FIGURE 7.22. Bridge data: plots of ellipses showing the relatively constant ori-
entation of the ellipse as the dispersed group is included in the subset
join those from the compact group. The rotation of the ellipse will result
in changes in the ordering of the observations by Mahalanobis distance.
The increase in the size of the subset leads to a general shrinkage in the
distances of units not in the subset (Exercise 2.12) .
The final set of ellipses, Figure 7.22, shows that the ellipse hardly changes
its orientation during the rest of the search. Fromm= 90 to m = 169 the
effect of the tight duster and of the points in the bridge is sufficient to keep
the orientation of the ellipse constant. During this period, Mahalanobis
distances will shrink, but there will be little change in the ranking of the
observations. The final plot, for m = 169 in the lower right-hand panel of
Figure 7.22, shows that the last observation to enter will seem to be an
outlier.
A summary of the behaviour of these ellipses is given by the changes in
the principal components of the ellipse during the forward search. The left-
hand panel of Figure 7. 23 shows the percentage of variance of the complete
data explained by the two principal components and the right-hand panel
the eccentricity of the ellipse, which is 1- y'(.A 2 / .Al), where .A 2 is the smaller
of the two eigenvalues. For a circle, the value is zero. The two panels of
Figure 7.24 plot the elements of the eigenvectors. Four phases are evident,
to some extent, in each plot. These reflect the initial rejection from the
subset of units in the bridge, the rotation of the ellipse from m = 40 up to
the reinclusion of units from the bridge after m = 60 and changes in the
ellipse towards the end of the search as the ellipse becomes more circular.
40 60 80 100 140 40 60 80 100 140

FIGURE 7.23. Bridge data: forward plots of principal components of ellipses.

Left panel, variability explained by the two components; right panel, eccentricity
of the ellipse
<=! <=!
s~ s
0
Q)
~~-'
'/·
LO > LO
c:
> 0
c: Q) 0
C>
Q)
C> ·a;
·a; "0
c:
~
0 0 0
0 0
Q) 0
ö U>
ö
~ LO
'.
,,,"., .l1 LO
Q)
c:
E
Q) 9 '~ .-- ...
Q)
E
9
iii Q)
iii
<=! <=!
";" ";"
40 60 80 100 140 40 60 80 100 140

FIGURE 7.24. Bridge data: forward plots of the first and second eigenvectors,
showing the rotation of the fitted ellipse as units not from the tight duster leave
or enter the subset
It is now time to consider how these suggestions of the cha nging structure
of the subset are revealed in other plots. Figure 7.25 is the forward plot
of scaled Mahalanobis distances. There is a maximum around m = 38;
many distances start to decline after m = 60, with some crossing of the
CX>
57
<fl
__ ..,-
Q)
0
c: / ..... ,..,
"'
u; <rJ
I
I ....__,
'5 /
:s"'0
I
./' ....... "/ /AI'ö=..:;:::-;::::::~
c:
"'
(ij ....
.c
"'
::::;;
"'
0
~
40 60 80 100 120 140 160
Subset size m
FIGURE 7.25. Bridge data: forward plot of scaled Mahalanobis distances. The
three groups seem to be separated around m = 50
lines. Around m = 120 there is a set of rather dispersed larger distances,

together with the band of small distances produced by the tight duster ,
which increase together as the search progresses. The single outlier, unit
57, is clear towards the end of the plot. However, there is more specific
information than this. The distances around m = 50 may be suggestive of
structure. Working up from the bottom of the plot, there seems to be a first
group giving rise to a large number of small distances. There is perhaps a
second group giving distances with a maximum of just less than two. There
is then a small gap and a third, more intensive group of larger distances.
The few largest distances look as if they come from outliers. We utilise
these hints of structure in the further analysis of the next section.
There remain the forward plots of specific Mahalanobis distances. The
left panel of Figure 7.26 is the forward plot of the minimum Mahalanobis
distance amongst the units not in the subset. We can see a rapid increase
at m = 60, followed by a few other high values as units from the bridge
are included. Eventually these have a sufficient effect on the estimated
covariance matrix that the distances decrease. There is perhaps a second
diffuse peak at m = 90, which we know is when all units from the compact
group and the bridge are in the subset. But, without this knowledge it
is hard to teil this peak from that at around 120. The outlier stands out
clearly.
The forward plot of these distances is less informative than it is when
there are sharply differentiated groups - the effect of the bridge is to make
0
·' '
::,:
:~::
'••''
V
40 60 80 100 140 40 60 80 100 140

FIGURE 7.26. Bridge data, forward plots of Mahalanobis distances: left panel,
minimum distance of units not in the subset; right panel, gap plot
less severe any changes in distances. The same is true of the gap plot on the
right of Figure 7.26. This shows that, initially, there are many interchanges.
There are also a couple of peaks around m = 60, but otherwise the plot is
featureless, apart from the outlier at the end.
We now move to a further preliminary analysis, where we rely heavily
on the forward plots of Mahalanobis distances for groups of units and for
individual units.
7.5.2 Further Preliminary Analysis: Mahalanobis Distances

for Groups and Individual Units
The preliminary analysis in the preceding section indicated two things: that
something happens in the search around m = 60 and that there seem to be
three groups of observations. We now use plots of Mahalanobis distances
to explore these indications.
In the forward plot of all Mahalanobis distances, Figure 7.25, the clear-
est Separation seems to be between m = 45 and 55. We now look at the
behaviour of distances for observations which are roughly dustered in this
interval. If units are close together in the space of the observations, their
Mahalanobis distances will move around together as the centroid and ori-
entation of the fitted ellipse change.
Figure 7.27 is derived from the plot of scaled Mahalanobis distances in
Figure 7.25 by dividing the distances between nine panels, according to
the ordering of the distances at m = 53. We divide the observations into
three groups by choosing thresholds where the gaps seem to be, which
is at scaled Mahalanobis distances of 0.4 and 2.0. Since our analysis is
exploratory, there are no fixed consequences of this choice. We can try at a
variety of values of m and several different numbers of different t hresholds.
Un its= 20 O.O<th<=0 .4 Units=20 O.O<th<=0 .4 Units=20 0 .0(th<=0 .4
Units=11 0 .4<th<=2 .0 Units= 11 0 .4<th<=2 .0 Units= 11 0 .4<th<=2 .0
.....•.,:-..:_·.:·.:·.:·..:•..;v--- - -.-: ~ -:---:----
Units=26 th>2 .0 Units=26 th>2 .0
FIGURE 7.27. Bridge data: forward plots of scaled Mahalanobis distances divided
into three tentative groups at m = 53
The three groups we have selected are then subdivided, by order of dis-
tance, into three panels each. There are 20 curves on the first three plots
and 11 each on those in the second row. There is a remarkable progression
from the first panel to the last. The distances in the first three panels ap-
pear, on this scale, small throughout. The distances in the last six panels
are increasingly large.
As weil as a difference in scale, these plots also show different shapes.
Figure 7.28 replots the nine panels, each with their own scale. The plots
of distances now are revealed as having very different shapes. In the first
three panels the plots are high at the end, whereas in the later panels the
distances decrease as m increases. So the plots of distances differ both in
size and in shape.
The large distances at the beginning of the plots may not help in in-
terpretation. They arise during the period of interchange evident in the
gap plot in the right panel of Figure 7.26. We could therefore rerun the
forward search from a later starting point, to get a different perspective on
the distances. Or we could just omit the earlier part of the plot.
These plots certainly show that there is a progression in the shapes of the
curves, based on the ordering at m = 53. It is however hard to know where,
Units= 25 th>2.0
~
FIGURE 7.28. Bridge data: forward plots of scaled Mahalanobis distances divided
into three tentative groups at m =53. As Figure 7.27, but with each row rescaled
if at all, there is a sharp change in shape. We now add to our interpretation

of the plots the information from, for example the gap plot of Figure 7.26,
that something changes around m = 60, which augments the impression
from these multipanel plots of the importance of m = 60.
Figure 7.29 shows nine panels of forward plots of Mahalanobis distances
for units which join the forward search around m = 60. Here we have used
the original distances. Plots of scaled distances lead to a similar analysis.
The first plot gives the curves for the first 58 units to join. The pointwise
quantiles of these curves are used as a reference in the successive panels:
the central curve is the median and, working outwards, successive curves
contain 50%, 75%, 90% and 95% of the distribution, with some rounding.
Typically, new units which do not agree with the cluster established at
the beginning of the search will initially have high distances, because they
are remote from the cluster. At the end of the search they will have low
distances, since they are closer to the final cluster centre.
The unit which joins for m = 59 has initially a slightly higher distance
than the units already in the subset: necessarily so, or it would have joined
earlier. It also eventually has a slightly lower distance, although the general
shape of the curve is not distinct from some of the others. Note t hat the
~ r---m--~~8~1-nc_l_l_l•------~ ~ r--m---6~0~1-nc~l~l~~------,
: \
14 1 1~9
FIGURE 7.29. Bridge data: nine panel plot of forward Mahalanobis distances for
specific units, starting from m = 58
plots are rescaled so that, during the course of the part of the search cap-
tured in the nine panels, the curve for unit 114, which is the one included
at m = 58 in the first panel and has the highest initial distance in that
panel, would shrink to occupying less than half the plotting space of the
panels in the third row.
The curve for m = 60 is a little high in the middle, but it is the curve for
m = 61 which is the first to be really different. Initially it is high, finally
low, changing rapidly from one to the other, indicating that it does not
belong to the duster found so far. The same pattern, but more extreme, is
shown by the units which join for m = 63 to 66. The unit which joins when
m = 62 is slightly different in behaviour: initially the distance is high, but
it is not extreme at the end of the search.
The study of these individual curves confirms the importance of the in-
dications from the gap plot and the plot of minimum Mahalanobis distance
in Figure 7.26, that there is a duster of 60 Observations which enter the
search first. Although 60 is indeed the number of observations in our orig-
inal tight duster, observation 168, which enter when m = 60, is not from
that duster, whereas observation 118, entering at m = 61 is, even though it
appears not to belong. Although these observations are incorrectly dassi-
fied by our procedure, they are drawn from simulated samples and so have
randomness which may take them into a neighbouring group. Both are, in
i·"·111 : ...
.
.
I•
:i
0
w i
Li) I
;r
'I
.I
0
::;,;; 0
0
.5:
Vl
Q)
Ol
c
ro
r.
ü
0
I!)
40 60 80 100 120 140 160

Subset size m
FIGURE 7.30. Bridge data: changes in forward plot of Mahalanobis distances,
units ordered by final entry to the subset
fact, extreme in their groups and are shown, by the scatterplot of the data,
Exercise 7.4, tobe in the "wrong" positions.
Our analysis has provided strong evidence for the existence of a t ight
duster of 60 observations. The forward plot of scaled distances in Fig-
ure 7.25 suggested that there were three dusters. The evidence from this
plot is summarised in Figure 7.30 which shows changes in successive Ma-
halanobis distances, in this case unscaled: if, for a particular observation,
there is an increase in the distance when m increases to m + 1, a symbol
is plotted, but not otherwise. Distauces which move up and down together
will then have a similar pattern of dark symbolsandlight spaces. The units
can be ordered on the vertical axis in any way found to be informative, for
example by the magnitude of t he distances a t some point in t he search.
Here we h ave used the order of final entry to the subset. Like t he other
plots, this one suggests the existence of three groups. We now eliminate
the 60 observations we have found to be in a duster, one incorrectly, and
analyse the remaining 110 observations.
The forward plot of all 170 distances, Figure 7.25, indicated that, in
addition to the tight group of 60 observations, there was a relatively little
group (which we know to be the bridge) and one which was more highly
dispersed. We proceed by again running a forward sea rch, but now one in
which we influence the initial subset. If this is chosen in the more dispersed
U">
..,:
""!
0
..,:
Cl 10
:::;: "'!
<-?
E 0..
"'
:::>
E 0 Cl
")(
<-?
"'
:::;: 10
ci
10
C\i
0
0 ci
C\i
20 40 60 80 100 20 40 60 80 100
FIGURE 7.31. Bridge data, llO observations: forward plots of Mahalanobis dis-
tances: left panel , maximum distance of units in the subset ; right panel, gap plot
group that seems to have been identified, rather uninformative plots result,
as there is now no structure of extreme values of Mahalanobis distances.
But, if the search starts with the less dispersed group, informative plots
result. Accordingly we start with those units in Figure 7.25 which, at m =
45, have distances between 0.4 and 2. This procedure yields a starting
subset with m 0 = 17.
The forward plot of the maximum Mahalanobis distance in the left panel
of Figure 7.31 has a series of high values starting at m = 31, indicative of a
duster ending at m = 30. After several rather different observations have
entered, the covariance matrix has changed sufficiently that the entering
observations no Ionger have high distances. The gap plot in the right-hand
panel of Figure 7.31has a sharp peak at m = 30.
To investigate whether m = 30 does indeed reflect anything interesting
we go straight to a nine-panel plot of the curves of the Mahalanobis dis-
tances for individual observations. The first panel of Figure 7.32 shows the
group of 28 reference curves. The points entering when m = 29 and 30
agree with this general form. However the remaining observations do not.
That for m = 31 is of quite different shape, being around twice as large
as any other distances around m = 50. The remairring curves behave like
observations from another duster: initially the values are too high, and fi-
nally too low. The high values arealso increasingly high, as is evidenced by
the shrinking of the reference set in successive panels. These observations
are initially far from the fitted duster centre, but are finally nearer the
centroid of the fitted ellipse than the 30 observations found here.
This time the group of 30 have been found correctly, apart from observa-
tion 168 which was induded in the first duster. We now consider methods
of exploring this dustering into three groups. For example, should there be
two groups or four? Have some units been put in the wrong duster?
~ ,--m--~2S~rn-c~rr~r8~----, ~ r---
m~-J~O~rn-c~
, ~
,,~
J ------~
FIGURE 7.32. Bridge data, 110 observations; nine panel plot of forward Maha-
lanobis distances for specific units, starting from m = 28
7.5.3 Exploratory Analysis: Single Clusters for the Bridge

Data
The analysis of the previous section has led to the establishment of three
tentative dusters containing respectively 80, 60 and 30 units. In this section
we look at searches starting in the three dusters in turn and particularly
monitor the distances of the units in the duster.
For these searches with individual dusters we plot scaled Mahalanobis
distances. We expect that forward plots of these will be roughly constant
during the first part of the search as units from the particular duster enter.
Thereafter the mean and covariance matrix used in estimating the Maha-
lanobis distances will not be those for the particular duster, since other
observations will be induded, and the distances will often increase. Con-
versely, the distances for other members of other dusters may decrease as
the centre of the ellipse moves towards those units later in the search. Since
we monitor separately the scaled distances for each tentative duster, for
the three dusters here we can thus study the distance of each unit for three
different forward searches through the data. If the duster is indeed homo-
geneous we expect that the units from the duster will have similar traces
of Mahalanobis distances for each individual search, even though the plots
of the distances will depend on the duster from which we started.
0
.,.;
117, 110, 129, 106, 131

~
"'c:~
*
'5
.!!!
LJ
g
q
.!!!
"'
J:.
"'
:<
II)
ci
~ ~ _j
50 100 150
Subset size m
FIGURE 7.33. Bridge data: forward plot of scaled Mahalanobis distances for
tentative Cluster 2 when the search starts in that duster
For the bridge data we start with the tight duster of 60 observations,
Cluster 2. Figure 7.33 shows the scaled distances for a search starting with
Cluster 2. The distances are indeed nearly stable to begin with, but then in-
crease, virtually all together, with m. The observations most different from
the rest are 109, which initially has the highest distance, but which changes
to the lowest for the last 30 steps of the search, and 168, which is highest
around m = llO. At m = llO the other observations with anomalously
high distances are, reading down, l17, 100, 129, 106 and 131.
We can use nine-panel plots similar to Figure 7.28 to compare the dis-
tances from this starting point for all three clusters. We can also look at
plots of individual distances. Figure 7.34 showsindividual curves agairrst a
background of Cluster 2 units. The first panel in the figure gives all curves
for the first 54 units to enter the search, that is 90% of the units forming
Cluster 2. As in Figure 7.29 the quantiles of this distribution are used as
a reference set in the other panels. These show that, up to m = 59, the
included units have trajectories of distances much like those of units in the
reference set. When m = 60, unit 168 is included, which has a high distance
for m > 60. Unit l18, included when m = 61, is more extreme, especially
up to m = 60, but does have a trajectory very much like unit l14, included
when m = 58. As we shall see, these two units are close together in Eu-
clidean space. Continuation of this plot for units not in Cluster 2, which
we do not give, shows trajectories which look very different from those of
the first 60 units to be included. They more resemble more extreme ver-
FIGURE 7.34. Bridge data: forward plots of scaled Mahalanobis distances from
m = 54 for individual units tentatively in Cluster 2 from a search starting in that
duster
sions of the trajectory for unit 118 in the centre panel of the bottom row
of Figure 7.34.
The 30 curves of scaled Mahalanobis distances for the bridge, that is
Cluster 3, in Figure 7.35 seem to tell a less coherent story. However, up to
m = 87, most of the curves move together, the principal exceptions being
those for units 118 and 167, which are the two lowest around m = 87 and
two of the highest towards the end of the search. Again, these two units
are close in Euclidean space. Traces of distances for the three groups show
that units 118 and 167 show up in the last panel for units in Cluster 3.
The curves for the individual units of Cluster 3 show that, although units
118 and 167 are not the last to enter the duster, the trajectories of their
distances do have something in common with units from Cluster 2 which
enter from m = 30.
Finally, Figure 7.36 shows the results of the search starting from Cluster
1, the dispersed group. The distances form a coherent pattern up to m = 80.
Thereafter most rise gently, including that for observation 57, which is the
outlier we have noticed before. However, the curves for units 17, 24, 43 and
67 curve downwards, cutting across the general trend. Plots of individual
trajectories show that unit 43 behaves very similarly to units from the
V>
Q)
u
c CD
~ 0
'6
V>
J5
0
c
<II
äi
s:;
0""'"
<II
::?!
C\1
0
0
0
50 100 150
Subset size m
tentative Cluster 3 (the bridge) when the search starts in that duster
bridge which enter after m = 80. In fact, observation 43 is very close to the
bridge.
1.5.4 Gonfirmatory Analysis: Three Clusters for the Bridge

Data
As a result of the analyses outlined in the previous sections we can be
confident about the assignment of all observations except the 13 found to
be a little different from the majority of the observations in their supposed
duster. Asthelast part of our analysis we, very briefl.y, consider the fitting
of three ellipses to the data and what this can tell us about the structure
of the observations.
There are several options for this forward search, rather as there were
for discriminant analysis. There are 170- 13 = 157 observations of whose
allocation we are sure. The three initial subsets consist of the first 100o:%
of these observations to join the forward search for each group when we
performed the searches for groups in the last section. These observations
are not allowed thenceforth to leave their subsets. We then have an unallo-
cated set which contains the 13 observations about which we feel uncertain,
plus those others not yet allocated. The search proceeds by allocating the
unallocated observations to the closest group of the three , as judged by
II>
"'<.> ....
Q)
c:
<0
(i)
'ö
"'0
:ö ('")
c:
<0
(ij
.<:
<0 (\J
::lö
50 100 150
Subset size m
tentative Cluster 1 (the dispersed group) when the search starts in that duster
Mahalanobis distance. For this purpose we can use either the usual dis-
tances or standardized distances. The search can be constrained so that no
observations can leave once they have joined a subset.
We can agairr plot distances for each unit, but now there are three dis-
tances at each stagein the search. We can also project these plots to show
how the closeness of the units to each duster varies during the forward
search. However, for the bridge data, the extra information found from
the search with three centres can be summarized by a forward plot of the
population to which the unit is allocated, tagether with scatterplots of the
data.
Figure 7.37 shows the allocation of units to populations for all units about
which we were undecided, tagether with those for which the allocation
changed at least once during the search, using the customary Mahalanobis
distances. Reading up from the bottarn of the plot we have a key giving the
symbols for the three groups. Then we have a group of units from Cluster
2 which had never been thought tobe atypical of their group in our earlier
analyses, but which were classified in other clusters at least once in this
search. Above them is unit 156 from Cluster 3. Then, in a single group, are
the units which we have singled out earlier for further consideration.
We now consider the individual units in turn. Unit 109 is correctly clas-
sified in Cluster 2 for most of the search. Units 111 and 114 are indicated
3 HIS 6
3 167 6
••••••
6 6 6 6 6 6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
2 131
• • • • • • • • • • • • • • • • • ••• • • • • • • •
2 129
2 118 6
• • • • • • • • • • • • • • • • • • •• • • • • • • •
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2 117
• • • • ••• • • • • •• ••• • • •• • • • • • • •
2 lOS
•••••••••••••••••••••••••••
2 100
I 67
•+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •• •••••••
+ + + + + + + + +
I ~7 + + + + + + + + + + + + + + + + + + + + + + + + + + +
I 43 + + + + + + + + + + + + + + + + + + + + + + + + + + +
I 24 + + + + + + + + + + + + + + + + + + + + + + + + + + +
I 17 + + + + + + + + + + + + + + + + + + + + + + + + + + +
31~ 6 6 6 6 6 + 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2 114
•••• •••• •• •••• 6 6 6 6 A 6 A 6 6 6 6 6 6
2 111
•••••••••••••• e • 6 6 6 A 6 6 6 6 6 6 6
2 109
gr. 3
••• •••••• ••••
6 6 A 6
e
6
•••••••
6 6 6 6 6 6 6
•
A A A 6
• 6 6
A 6
6 A
A 6 A 6 6 6 A 6 6
gr. 2
gr. 1
•••• •••• ••••••••••••• ••• •••
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
1 42 14 4 1 46 1 48 1 50 1 52 1 54 1 56 1 58 1 60 1 62 1 64 1 66 1 68 170
FIGURE 7.37. Bridge data: confirmatory search with t hree clusters. Allocation
of units in t he last steps of the search
as belanging to Cluster 3, rather than Cluster 2. Of the remairring units

only 118 is misclassified, also being in Cluster 3 rather than Cluster 2. Fig-
ure 7.38 shows a scatterplot of the units from Cluster 2, the tight group,
and Cluster 3, the bridge. We can see that the t hree misclassified units are
on the edge of the tight group. Because this group has a small variance
relative to the bridge, Cluster 3, it t ends to be eaten into by the bridge
for which Mahalanobis distances may be less, due to the larger variance.
Interestingly, observation 109 is also in this region.
The results for the search using the standardized distances are in Fig-
ure 7.39. Herethereis more misclassification than before, now of previously
impeccable units from Cluster 1. Units 12 and 64 are correctly classified as
being in Cluster 1 for part of the search and incorrectly as being in Clus-
ter 3, the bridge. Units 36 and 72 are firmly misclassified. Of the suspect
units, 17, 24, 43 and 67 areallalso given to Cluster 3. Unit 168 now goes
to Cluster 2. From these results we see that groups with smaller variance
are indeed favoured when standardized distances are used for comparisons.
Where these units are is shown in the scatterplot of Figure 7.40. A ll, apart
from 168, are on the boundary of Clusters 1 and 3: units 17, 24 and 67 are
close together, with 36, 43 and 72 also between the two groups, but in a
slightly different position.
0 +
+
+
+
A +
A
A
A A
A A
A
A A d>
A A
..
A
AA A A A
A A
A AA
A
A
118 ••
109 .,.
A
A
114 •• •
~~. •
.,.•
111 ~·
•••••• •
168
•
2 3 4 5 6 7
y1
FIGURE 7.38. Bridge data: units not certainly classified in Figure 7.37. • units
in Cluster 2 (81-140)
3 168
3 US7 A A
•••••••••••••••••••••••••••
6 A 6 6 6 6 6 6 6 A A A A A A A A A 6 6 6 A 6 A A
2 131
•• •• •• •• •• •• • •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• ••
•• • •• • • •• •• • •• • •• •• • •• • •• •• •• •• •• •• •• •• •• •• •• ••
2 129
2 118
•• ••• •• ••• ••• •• ••• ••• •• ••• •• •• ••• •• ••• •• •• •• •• •• •• •• •• •• •• •• ••

2 117
2 106
2 100
I 67 A A 6 A A 6 A A 6 A A A A A A A A A A A A 6 A A 6 A A
I '7 + + + + + + + + + + + + + + + + + + + + + + + + + + +
I 43 A A 6 A 6 A A A A A A A A A A A A A A A A 6 A A 6 A A
I 24 A A 6 6 6 A 6 6 6 6
6 A A A A A A 6 A A 6 6 6 6 A A A
I 17 A A 6 A A 6 A 6 6 A A A A A 6 6 6 6 A 6 6 6 6 6 6 6 A
I 72 6 A 6 6 6 6 6 6 6 6 A A 6 6 6 6 6 6 6 6 6 6 6 6 6 A 6
I 64 6 6 6 6 6 6 6 6 6 6 6 6 6 6 + + + + + + + 6 6 6 6 6 6
I 36 A A 6 6 6 6 6 6 6 6 6 A 6 6 6 6 6 6 6 6 6 6 6 6 A 6 6
I 28 A 6 6 6 6 6 6 6 6 + + + + 6
6 6 6 + 6 A 6 6 + + + + +
I 23 A 6 6 6 6 6 6 + + + + + 6 6 + + + + + + + + + + + + +
I 12 + + + + + 6 6 6 6 6 6 A A A + + + + + + + + + + + + 6
111 A 6 6 6 6 6 6 6 6 6 6 + + + + 6 6 6 6 6 + + + + + + +
9'· 3 6 6 6 6 6 6 6 6 6 6 6 A A 6 6 6 6 6 6 6 6 6 6 6 6 6 A
gr . 2
gr. 1
•+ •+ • • • •+ •+ •+ •+ •+ •+ •+ •+ • •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+
+ + + +
14 2 14 4 146 1 48 1 50 15 2 1 54 156 1 58 1 60 1 62 16 4 1 66 1 68 1 70
FIGURE 7.39. Bridge data: confirmatory search with three clusters using stan-
dardized distances. Allocation of units in the last steps of the search
57
L{)
+
+
0
+ + +
+
++1\:. +
+ + ++++ +
+ iJ. +
+ + +
-10 -5 0 5 10
y1
FIGURE 7.40. Bridge data: units not certainly classified in Figure 7.39. + units
in Cluster 1 (1-80); • units in Cluster 2 (81-140)
What are we to condude? One condusion is that the forward search,

free of masking and other problems of outliers, shows us which units are
on the boundary between dusters. As we have changed between unstan-
dardized and standardized distances, different units which have never been
called into question are attracted to one group or the other. Units that are
undear follow the same pattern. If a firm decision has to be made about
duster membership, it seems sensible to rely on the forward searches for
the individual dusters, which do not depend on the kind of distance used.
It is also sensible to place more reliance on the multiduster analysis which
causes changes of allocation in the fewest units about which we have not
previously been concerned. Here this means that we would prefer the anal-
ysis using unstandardized distances. Units 109, 111 and 114 would remain
allocated to Cluster 2, as they are for most, but not all, the search, whereas
unit 118 will be allocated to Cluster 3. Given the scatterplot of Figure 7.38
the misallocation of this unit is not at all surprising. We leave it to the
reader to experiment with values of r other than 0 or 1 in (7.1) to see if
the number of units unexpectedly changing allocation can be minimized.
7.6 Financial Data

7. 6.1 Preliminary Analysis
As a first example with data that are not simulated, we consider some
financial data introduced by Zani (2000, p. 194), taken from the Italian
financial journal Il Sole - 24 Ore for May 7th, 1999. An introduction to the
data in English is given by Cerioli and Zani (2001). There is information
on 103 investment funds operating in Italy since April 1996. The variables
are:
Y1: short term (12 month) performance

Y2: medium term (36 month) performance
y3: medium term (36 month) volatility.
Figure 7.41 is a scatterplot matrix of the data. There seem to be two

10 20 30 40 50
0
0 0
0 0 "'
fb~
y1 =Short term
performance ~ 0
0 0 ll'>
0 0
0
0 0 o 0 0 00 0
~ ~--------------~
0 r---------------~ r----------~0~~
0 0
0
"0
y2=Medium term
0 0 t~~ 0
0 oo\~
performance
~
"'
N
0 0 "'
y3=Medium term
8
-"'
0
0 0 0
volatility
0 ?~OcP 0
0 oar~go 0
0 rrDPo
0 5 10 15 20 25 10 15 20 25
FIGURE 7.41. Financial data: scatterplot matrix
clusters, with a few observations in between. This impression is also that

7.6 Financial Data 407
l
.,"'<.>
<Xl
~
<0
c
"'
iii
'5
"'0
:.0
c ~
"'
Oi
.c
"'
:::0
--r
20 40 60 80 100
Subset size m
FIGURE 7.42. Financial data: forward plot of scaled Mahalanobis distances with
mo = 17. There seem to be two groups
obtained from the forward plot of Mahalanobis distances from m 0 = 17,

Figure 7.42, which has all the features we associate with the search when
there are two groups. At around m = 38 there seems a clear separation,
although some units seem later to drift away from the upper group. F'rom
m around 53, the two groups start to become intermingled, with a rapid
decrease in many distances and an increase in some others. This is a period
of interchange of units as the shape of the fitted ellipsoid changes. F'rom
m = 65 onwards there are no very small distances, indicating that the
centroid of the fitted ellipsoid is somewhere in between the two clusters.
Plots of aspects of the principal components of the fitted ellipsoid also
suggest there are two clusters. The left panel of Figure 7.43 shows that the
percentage of total variation explained by the first two principal compo-
nents changes markedly around m = 60, as does the eccentricity plotted
in the right-hand panel, but that is the only period of appreciable change.
The components of the eigenvectors in Figure 7.44 also change from one
pattern to another, but around m = 58. These changes occur later than
those associated with individual observations as several observations have
to be included in the subset before the fitted ellipsoid changes appreciably.
Other plots reinforce these suggestions. The left panel of Figure 7.45 is
the forward plot of minimum Mahalanobis distances amongst units not in
the subset. lt shows a broad peak with its maximum at m = 54, suggesting
that several remote observations are being included in the basic subset
20 40 60 80 100 20 40 60 80 100
FIGURE 7.43. Financial data: left panel, percentage of variability explained by

the first two principal components; right panel, the eccentricity of the ellipse from
the first two eigenvalues
--- . ....... -------"'''..

''\-.... ----- --
,----------....
0
0
0
.l!l 0
5i U) E
E9
ll)
E
Q)
9
ü:i Q)
ü:i
C!
";-
20 40 60 80 100 20 40 60 80 100
FIGURE 7.44. Financial data: left panel, forward plot of components of the first
eigenvector; right panel, the components of the second eigenvector
at this point. The gap plot in the right-hand panel of Figure 7.45 shows
that there is a large amount of interchange shortly after the peak, as the
ellipsoid changes appreciably as it has to straddle two groups. Figure 7.46
is an ordered entry plot. The observations are ordered by their first entry
into the subset. If there are no interchanges, the plot has the shape at the
top right-hand corner, as units successively enter. Below this we see that
II"!
L() C!
0
:E L()
E
::I
E
.... a. 0
"'
(!}
·c: 0
0
:E ''
' ''
(')
~':
\:
•'
V
C\1
20 40 60 80 100 20 40 60 80 100
FIGURE 7.45. Financial data, forward plots: left panel, minimum Mahalanobis
distance among units not in the subset; right panel, gap plot
0
~
0
CO
(i;
"E
0
0
<0
<::-
"E

i!i ....
0
u::
0
C\1
1"11111111
mmuu
mmnu
0
20 40 60 80 100
Subset size m
FIGURE 7.46. Financial data: ordered entry plot- units ordered by first entry to
the subset. The interchanges around m = 60 cause the white area in the centre
of the plot
most of these units, the last to enter the subset, were previously included,
but were removed by the interchanges starting at m = 60.
Further evidence about the duster structure of the data can be found
by looking at the trajectories of groups of distances during the search.
Figure 7.47 shows six panels of Mahalanobis distances separated at m = 38
,,
,,::units= 18 OO<th<=06 Uni ts = 18 OO<th< = 06 i\ Units= 17 OO<th<=06
jl
1 I'
I
I I
I
\
\
\
; ~nils= 17 th>06
II ' \ - " ' ' " - " >0>06
,.
I I
II
H.~.
,. I
... ',
.... ~
" ' ',
FIGURE 7.47. Financial data: forward plots of Mahalanobis distances. The six
panels of distances, cut at m = 38, have been rescaled to fill the panels
by cutting on distances less than six. The top three panels contain 51 units
all of which have very similar shaped trajectories, although the magnitude
of the distances increases. The lower left-hand panel contains two kinds of
shape - one which jumps up in the middle like those of the top panels and
one which is low towards the end. Most, but not all, of the remaining units
follow this trajectory.
7. 6. 2 Explomtory Analysis
Figure 7.47 indicates that there are more than 51 units belanging to the first
duster. This corroborates the evidence from the plot of minimum distances
amongst observations not in the subset in the left panel of Figure 7.45,
which shows a broad peak leading up to m = 53. We therefore now look
at the forward plots of individual Mahalanobis distances for some units
entering around this peak. Figure 7.48 shows nine panels of individual
distances starting from m = 47. The first five panels are uneventful; the
observations joining up to m = 51 agree with those already in the subset.
But the remaining four panels show observations entering which are remote,
but seem to have much the same trajectory as those already in the subset.
The second plot of nine panels, Figure 7.49, starts from m = 56, past the
peak in the plot of minimum distances in the left-hand panel of Figure 7.45,
but still a region of high values. Indeed, the units entering are again remote
FIGURE 7.48. Financial data: nine panel plot of forward Mahalanobis distances
for specific units, starting from m = 47
from those that have already entered . The curves form= 56 and 57, units
54 and 50, have an intermediate shape rather similar tothat for observation
21 which enters when m =54. Fromm= 58, when observation 72 enters,
onwards, the shapes look more like those in the final panels of Figure 7.47
for units which enter at the end of the search. In the panels from m = 60,
several curves are given as there is appreciable interchange in this region
of the search.
We now start the search from the other putative duster , which consists
of units with large distances in the initial part of the plot of forward Ma-
halanobis distances, Figure 7.42. We select those units with the 20 largest
distances at m = 46. The forward plot of Mahalanobis distances in Fig-
ure 7.50 again shows that there are two groups. The plot is indeed similar
in structure to Figure 7.42, where we started in the first tentative duster,
except that the changes associated with the introduction of units probably
now not from the second duster starts a little earlier, around m = 45. The
summary of the forward plot of Mahalanobis distance giving the changes
in the distances is shown in Figure 7.51 in which the units are ordered by
their distances at m = 46. There appears to be an almost complete Separa-
tion into two groups around this value of m, although the last part of the
search is uninformative.
·.
m=6' ncl 69 70,101
..
m•64 lncl 62.63.66, 75.86.99 I (
FIGURE 7.49. Financial data: further nine panel plot of forward Mahalanobis
distances for specific units, starting from m = 56 with reference distribution for
m = 47. The multiple curves in the last five panels show the effect of interchanges
0
(\J
"'c:~ ':1
~
'6
"'0
:ö
c: ~
"'
(ij
.c
"'
::;:
"'
20 40 60 80 100
Subset size m
FIGURE 7.50. Financial data: forward plot of Mahalanobis distances starting

w ith 20 units in the second duster
0
0 :I 'I
::
•! ·! i,
·i
0
CO
·I . ~
0 ::
::E 0 !
<0
.!:
"'OlCl>
c
CO
.r::
ü
0
C\1
20 40 60 80 100
Subset size m
FIGURE 7.51. Financial data, changes in forward plot of Mahalanobis distances
starting with units in the second duster. Units ordered by distances at m = 46
Ii"!
0
"' ~
::;;
E
.... a. ci "'
::>
E
·c: "'
(.!)
0
ci
~
"' ' '

~.,,:·,:
,,
~
'7 ~
"'
20 40 60 80 100 20 40 60 80 100
FIGURE 7.52. Financial data, forward plots of Mahalanobis distances starting
with units in the second duster: left panel, minimum distance of units not in the
subset; right panel, gap plot
Forward plots of specific distances enable us to pinpoint this change more

clearly. The forward plot of minimum Mahalanobis distances amongst units
not in the subset in the left panel of Figure 7.52 has a peak at m = 46,
followed by a second similar sized peak at m = 49. Presumably a series of
remote observations are entering in this region. The two peaks at m = 46
,, ,_-----------------,.-------------------~r-------------------,
nits= 16 OO<th<=04 -.un i ts= 16 OO<th< = 04 nitsa 14 OO<th<=04
,.
,.
"
••
,,
..
FIGURE 7.53. Financial data: forward plots of Mahaianobis distances starting

with units in the second duster. The six parreis of distances, cut at m = 45, have
been rescaied to fill the parreis
and 49 show up in the gap plot on the right of Figure 7.52, which shows
the interchanges starting at m =53.
We now look at some plots of individual curves and of groups of curves.
The forward plot of all Mahalanobis distances, Figure 7.50 seems to have
two groups and little in the middle at m = 45, so we divide the units into
two t entative clusters at this point, the more compact consisting of those
with distances less than four. Figure 7.53 is the six-panel plot of t hese
distances. There are 46 units in the first row. The first two panels in the
first row and those in the second row are very different in both shape and
scale and form two distinct groups. The division is however less clear in the
third panel of the top row, where some of the curves are of an intermediate
shape and magnitude, which unfortunately is to be expected if there is not
a sharp duster boundary.
The plots of individual distances fromm= 44, given in Figure 7.54, are
helpful in this respect. The units added up to m = 46 appear to belong
together naturally. However those added when m = 47 and 48, are appre-
ciably larger, although not very different in shape. This is the area of the
peaks in the plot of minimum distances amongst units not in the subset,
left-hand panel of Figure 7.52, so that it is to be expect ed that distant ob-
servations will be added. The curves for the observations in the remairring
"Y."~46
1ncl89
lf:, .
.
~
.
~
(\
- ····\ : ····......
~
, ' .,1
1 ••
••
FIGURE 7.54. Financial data, search starting in the second duster: nine panel
plot of forward Mahalanobis distances for specific units, starting from m = 44
panels are all initially high, but finally decreasing, the shape associated
with the lower row of panels in Figure 7.53 and so with the other duster.
This analysis does not enable us to duster all units unambiguously. How-
ever it does provide an excellent starting point for a confirmatory analysis.
The peak in the plot of minimum distances for the first duster in the left
panel of Figure 7.45 and the gap plot in the right-hand panel of the same
figure are at m = 53. If we take the 52 units before this peak and likewise
the 45 units before the first peak in the plots in Figure 7.52 for the second
duster, no units have been incorporated in both clusters and only units 21,
50, 52, 54, 77 and 89 remain tobe clustered. These six units together with
the two tentative dusters are shown with different symbols in Figure 7.55.
The plot dearly shows why these units are the last to be dustered. Obser-
vation 52 is an outlier from Cluster 1. The other observations are shown, in
the two lower panels, to lie between the two clusters found so far. However,
in the panel for y 1 and y 2 , observations 77 and 89 appear to lie in Cluster
1.
Before we proceed to our confirmatory analysis, it is interesting to re-
call where these remaining units enter the two dusters in the two forward
searches, which we can do by considering the plots of individual distances.
Figure 7.48 gave distances fromm= 47 when the search started in the first
10 20 30 40 50
5 52 "'
(\J
+ + 0
.
(\J
• •• :t"" +.7+
+ + + +
.-.•t••
• •7.
+ li-Vt "'
:f~+
y1 =Short term
performance ....
·j:~> +
•~ • 89
+~
+
+ ~
• + + + +
• 21 • 21 "'
• • 54 50
• • 54() 0
0
"' +
5~ 52
+
•••
0
-.t
+ +
+ ~ + 77 ~ ~++
0
"'
+ ~:w
:it y2=Medium term 89 iJl~ +
. ......' .
21+
50
++
+
performance • 5()21 '++
,_llf:
54 54
•
0
-
(\J
0 • •
+ +
+:t(+ +
"'
(\J
+ 1tt'+<t~ 5~ + ~t+ + + 5
+ -!'~-:\:,. + + ~* iirt+ 0
(\J
21 21
~ 54
50 y3=Medium term
89 89
"'
t'·
volatility
.
• :~··7 77
• i.,-
ile• •
·~
~
•
0 5 10 15 20 25 10 15 20 25
FIGURE 7.55. Financial data: scatterplot matrix; two tentative clusters and six
unclustered units. + units in Cluster 1; • units in Cluster 2
duster. The units to enter fromm= 53 are successively 52, 21 and 89, none
of which had previously been assigned. Figure 7.49 shows that the next two
units to enter are 54 and 50, also unassigned. After this, units enter which
were induded in the second group before the peak in distances. The story
is similar if we start from m = 45 in the second group. As Figure 7.54
shows, four units in the group, 89, 54, 50 and 21 enter before observation
26 which was assigned in the search from Cluster 1. Only observation 77
fails to be induded in either group.
We now interpret these results in the light of the scatterplot matrix,
Figure 7.55, where observation 52 appears as an outlier from Cluster 1, but
away from Cluster 2. This observation is not dose to either duster centre,
but is much doser to that of the first duster. The next two observations to
enter the search from Cluster 1 are 21 and 89. Theseobservations also enter
Cluster 2 in the steps after m = 45. Tagether with units 50 and 54, they
are very much between the groups. It is only unit 77 that is apparently far
from both groups.
20 40 60 60 100
Subset size m
FIGURE 7.56. Financial data: forward plot of scaled Mahalanobis distances for
the 52 observations in Cluster 1. The distances for the six unclustered units are
highlighted
These impressions are based on inspection of the scatterplot and so tend

to rely on Euclidean distance. In making visual comparisons of the posi-
tions of units it is hard to allow for the elliptical structure of the contours
of constant Mahalanobis distance. We accordingly look at such distances
between both cluster centres to establish the final clustering, if any, of these
six units.
7.6.3 Gonfirmatory Analysis

As a result of the preliminary and exploratory analyses we have a classifi-
cation of the units into two groups , with in addition six undecided units,
numbers 21, 50, 52, 54, 77 and 89. The first group consists of 52 observa-
tions, units 1 - 56 less four undecided, and the second of 45 observations,
units 53- 103, less two undecided.
We approach our confirmatory analysis in two ways. First we see which of
the six unassigned units can be included in either tentative cluster without
changing the shape of the cluster. Secondly we establish to which clusters
the units would be assigned if we insist on dustering all units and see how
this assignment changes during the forward search.
Figure 7.56 presents scaled Mahalanobis distances for the first group
plus those units about which we are undecided for a search through all the
data, starting from Cluster 1. Throughout this search the largest distances
--- ----------.,-52
---
20 40 60 80 100
Subset size m
FIGURE 7.57. Financial data: forward plot of scaled Mahalanobis distances for
the 45 observations in Cluster 2. The six unclustered units are highlighted
generally come from the six unassigned observations. At m = 28 the three

most remote units are 77, 52 and 89: the peak at m = 53 is caused by
these six units and, at the end of the search, the largest distances are due,
reading downwards, to observations 52, 50 and 77. Figure 7.57 is the same
plot, but for Cluster 2. Now the initially most remote units are 52, 50, 77
and 21. Towards the end of the search the remote units are 52, 77 and 50.
Of course, at the last stage of the search, the two figures have to agree
for the six unclustered units. It is clear from these two plots that the six
unassigned units do not naturally belong to either duster.
Finally we fit two ellipsoids to these data. Before we discuss the results
we record that units 1- 56 areallstock funds whereas units 57- 103 are
balanced funds, so that we have roughly found the correct clustering. The
plot of Figure 7.58 shows the allocations achieved using unstandardized
Mahalanobis distances when two ellipsoids are fitted. Units 21 and 77 are
misclassified only at the beginning of the search, units 89 and 52 are cor-
rectly clustered, unit 52 being an outlier from Cluster 1. The remaining
two units, 50 and 54 are both misclassified as belonging to Cluster 2. The
scatterplot matrix of Figure 7.55 shows that these two observations do in-
deed lie very much between the two clusters. Since the scatters of the two
groups are similar, these results are unchanged if standardized distances
are used instead of the unstandardized ones of Figure 7.58.
In the period to which the data refer, the Italian Stock Exchange experi-
enced a remarkable increase in most stock prices. This increase is paralleled
2 89 ••••••••••••••••••••
2 77 • + + + + + + • • • • • • • • • • • • •
1 54
••••••••••••••••••••
1 52 + + + + + + + + + + + + + + + + + + + +
1 50
••••••••••••••••••••
••••••••
1 21 + + + + + + + + + + + +
gr. L
••••••••••••••••••••
gr. 1 + + + + + + + + + + + + + + + + + + + +
82 83 84 85 86 87 88 89 90 9 1 92 93 9 4 95 96 97 98 99 101 103
FIGURE 7.58. Financial data: confirmatory search with two clusters using un-
standardized distances. Allocation of units in the last steps of the search
by positive short and medium term performances of many funds, especially

so for stock funds. Stock funds also exhibited higher volatility, which is syn-
onymaus with higher risk. The six undecided units show different features,
which we expect can explain their behaviour along the forward search.
First, we note that unit 52 is an outlier from Cluster 1 (the stock funds),
showing excellent short and medium term performances and average volatil-
ity. Thus it is really a "good" outlier! On the contrary, units 50 and 54 have
negative short term performances, an exception even in the less remunera-
tive group of balanced funds, and remarkably low volatility. Also the short
term performance of unit 21 is poor, although it is not far from that of
other stock funds.
Among the balanced funds, both units 77 and 89 perform relatively well,
especially so for the former. Both these observations lie well in the middle
of the scatter of Cluster 1 in the two dimensional projection relating short
and medium term performance, Figure 7.55. However, they exhibit smaller
volatility than all stock funds, with only units 50 and 54 showing compa-
rably low figures. Hence we conclude that observations 77 and 89 are not
completely indistinguishable from observations in Cluster 1; we can expect
that they should be somehow identified as different.
Finally we briefly consider the results of the confirmatory search when
two ellipsoids are fitted as shown in Figure 7.58. This confirmatory search
does indeed agree with the conclusions above (although it is perhaps too
easy to say so after seeing the results). In particular, although unit 21 is

misdassified at the beginning of that part of the search shown, it is ulti-
mately induded in the correct duster. This happens as the search indudes
in Cluster 1 other units with comparably low short term performance.
7. 7 Diabetes Data
1.1.1 Preliminary Analysis
Our final example is of 145 observations on diabetes patients, which have
been used in the statisticalliterature as a difficult example of duster anal-
ysis. A discussion is given, for example, by Fraley and Raftery (1998). The
data were introduced by Reaven and Miller (1979). There are three mea-
surements on each patient:
y1 : Plasma glucose response to oral glucose

Y2 : Plasmainsulin response to oral glucose
Y3 : Degree of insulin resistance.
Figure 7.59 is a scatterplot matrix of the data which are in Table A.17.
There seems to be a central duster and two "arms" forming separate dus-
ters. The first duster is appreciably more compact than the other two.
However, the plot of y 1 against y 2 is diagonal with increasing variance as
values of y 1 and y 2 increase. In the absence of the third variable we would
expect the forward search to start at the bottom left-hand corner of the
plot and to include units with increasing values of both variables. Although
there would seem to be no obvious breaks between clusters, each observa-
tion to enter would be slightly and increasingly remote from the duster
already established, so units might be expected to enter with increasingly
large Mahalanobis distances. It is hard, from visual inspection, to tell what
the effect of the third variable is on this argument.
We start our forward analysis with a subset found by the method of ro-
bust ellipses, for which mo = 23. The resulting forward plot of Mahalanobis
distances in Figure 7.60 has some strange features. There is a shortage
of very small distances and overall the distances perhaps fall into three
groups: there is a gap between the largest distances and the rest around
m = 70, these largest distances beingrather uniformly distributed. Around
m = 45 it looks as if there is a gap between the smallest distances and those
which are somewhat larger, which again have a rather uniform distribution.
These impressions are confirmed by the QQ plot of Mahalanobis distances
at m = 30 in the left-hand panel of Figure 7.61 which shows some evidence
of three regimes as well as of three outliers.
We repeat the forward search with a starting point consisting of the units
giving the 15 smallest Mahalanobis distances at m = 45 in Figure 7.60.
7.7 Diabetes Data 421
1600
I!
400 800 1200
0 0
°0§0
g
0
0
oo (loo
0 0
0 0
y1=Piasma r/0
11)0 'b 0
0
glucose resp. 0 t% N
to oral ~0
fii'J
00 0 0
glucose
0 .
<oo~ odl ~
8
~
~ 0 (j 00
0 0 o0
8 0 0 oo
~ 00 0 ~0
•ow
Q)8>%
y2=Piasma
I
0 ~Cl)
0
CX)
insulin resp. 0 Q)
8...
to oral
glucose
~~~og 0
,,
0 0
{} oo 0
0 0 ~
Ii.
Ooo 0 '6
y3=Degree of ...
0
0
<o insulin
0 8
~ @% resistance "'
0~ ~ co
<o 0 0 0
dl i c:o9 O ~&JQ) OoO g<fl6
0
100 200 300 0 200 400 600
FIGURE 7.59. Diabetes data: scatterplot matrix. There seem tobe three clusters
The resulting forward plot of distances in Figure 7.62 is not much different
from that we have seen earlier, but the gap between the first and second
groups around m = 30 seems a little clearer. The QQ plot of Mahalanobis
distances at this point in the right panel of Figure 7.61 is similar to that
in the left panel, but shows slightly more division into three groups; there
are three approximately linear regions of increasing slope. The plot of the
variance explained by the first two principal components in the left panel
of Figure 7.63 also suggests three groups, with an early peak in the curve
indicating some further structure. As we have noted before, the changes
in this plot occur later than the introduction of the first unit from a new
group. The first two principal components seem to explain virtually all the
variability in the data.
Other plots also suggest three groups without clear boundaries. The for-
ward plot of minimum Mahalanobis distances among the units not in the
subset is in the left panel of Figure 7.64. This starts approximately hor-
izontally, then increases around m = 70 to a larger slope. From about
U)
Q)
<.>
r::
1ll 0.... -<
i5
U)
:c0
r::
"'
(ij
-'=
"'
::;: 0
"'
20 40 60 80 100 120 140

Subset size m
FIGURE 7.60. Diabetes data: forward plot of Mahalanobis distances starting

with mo = 23
0 0 0
0 10
<0 oo
ooOoo
d'o
U)
8r::
0
10
0'
0
U)
Q)
<.>
....
0
""
r::
"'
0
i5
~ ~
0
.ll!
.!'l
-o
0 ~
.
(')
~
U)
:c 0 .!'l e
,.I
(') ~ .0
0 0
r:: #' r:: 0
.-/
"'
(ij 0 "' "'
(ij
"' "'
-'=
::;:
.r=
"' ~
:::;: )
.
~
0 0
0~
2 3 2 3
Ouantiles Quantiles
FIGURE 7.61. Diabetes data: QQ plots of Mahalanobis distances at m = 30 from

two starting points. Left panel, mo = 23; right panel, mo = 15
Subset slze m
FIGURE 7.62. Diabetes data: forward plot of Mahalanobis distances starting

with mo = 15 chosen at m = 45
20 40 60 80 100 120 140 20 40 60 80 100 120 140

FIGURE 7.63. Diabetes data: left panel, percentage of variability explained by

the first two principal components; right panel, the eccentricity of the ellipse from
the first two eigenvalues
m = 110 the distances are all large but oscillate appreciably. This is not
a pattern we have seen before. For the 60:80 data the comparable plot in
Figure 7.8 was initially horizontal: there was a sharp peak as the second
duster entered, followed by a second, roughly horizontal, period before the
lO
0
::::;:
E
:::1
E
....
·c:
:E
"'
"'
20 40 60 80 100 120 140 20 40 60 80 100 120 140
FIGURE 7.64. Diabetes data, forward plots: left panel, minimum MahaJanabis
distance among units not in the basic subset; right panel, gap plot
final outlier. Neither the plot for the data with three groups and two out-
liers, Figure 7.12, nor that for the data with a bridge, Figure 7.26, show
a monotonic trend in the latter part of the plot. However, both the plots
for the financial data, Figure 7.45 and Figure 7.52, do show an increase in
the second half of the search as units from the next group are included.
Tagether with our earlier discussion of the y 1 , Y2 panel of the scatterplot
matrix of Figure 7.59, this suggests that the initial group is being extended
by the inclusion of units from a second group. The last 30 observations look
different again.
The gap plot in the right-hand panel of Figure 7.64 shows that the search
proceeds steadily up to around m = 110, without interchanges. Although
evidence from the minimum Mahalanobis distances suggests that a second
duster is being included, the absence of interchanges indicates that the
change is gradual. The forward plot of the eccentricity of the ellipse from
the first two eigenvalues in the right panel of Figure 7.63 shows both the
change in the shape of the ellipsoid at the beginning of this process and,
very clearly, the effect of the introduction of the third group.
A nine panel plot of individual Mahalanobis distances is given in Fig-
ure 7.65 from a cut at m = 45. As is usually the case, there is a progression
from small distances to !arge, here without any sharp breaks, although the
shape of the distances does change steadily. There are 68 distances in the
first two panels and 28 in the last row. The plot of distances after rescal-
ing in Figure 7.66 shows that the first 68 distances have in common that
they are horizontal towards the end. The 47 distances in the next four
panels have a common decreasing shape, which is shared by, perhaps, four
curves in the last row. The curves for the remairring 24 observations in the
bottom row seem to share a horizontal shape after the initial peak before
they decline. The grouped nature of these last observations is shown very
7. 7 Diabetes Data 425
Units=3 4 OO<th<=09 Units=34 OO<th<=09 Units=32
....
Units=06 09<th<=16 Units=06 09<th<=16 Units=OS
·.
Units= 10 th> 16
FIGURE 7.65. Diabetes data: forward plots of Mahalanobis distances. The nine
parreis of distances come from a cut at m = 45
clearly in the plot of changes in scaled Mahalanobis distances , Figure 7.67,

ordered by first appearance in the subset. The figure also shows the other
two groups, although the transition between the two is not clear: it might
be anywhere between 55 and 75 on the scale of ordered units.
Although none of these pieces of information yields , on its own, an un-
ambiguous division of the data into clusters, it does allow us to proceed
to a first division into three groups, which can then be refined using our
confirmatory techniques. As the tightest duster we take the 70 observa-
tions with smallest Mahalanobis distances at the cut when m = 45 for the
forward search starting from m 0 = 15. The third group is the last 30 Ob-
servations to enter this search. It seems clear from the patterns at the top
of Figure 7.67 that this group will contain at least one outlier. The second
group is the complement of these two non-intersecting groups, a group of
45 possibly heterogeneaus units.
We close our preliminary analysis by returning to the scatterplot matrix
of the data. Figure 7.68 shows our first group of 70 observations. These
form a plausible duster, roughly ellipsoidal in outline, with symmetrical
marginal distributions. However there is some detailed structure which in-
dicates non-normality. The firstisthat the values of Yl have been rounded
to the nearest integer. The second is that there are some small groups of
observations close tagether, an effect which is increased by the rounding,
as well as some gaps within the data. With m 0 = 15 the starting subset is
Units ~ 34 OO<th< ~og • Units~34 OO<th<=09 Units~ 32

n·,
Units~os 09<th<= 16 Units~os 09<th<~ 16
FIGURE 7.66. Diabetes data: forward plots of Mahalanobis distances. Nine pan-
els of distances from a cut at m = 45. Here the distances in Figure 7.65 have
been rescaled to fill the panels
J!~~~~~~, ~ ~ !JiJI !. :Jt J.. . . . . .l. .j!IJ~i

;:!:
0
1: mr1 ... 111 !.11111111 :.. 1 _I __ ·. 1! ..11 II,
I•.•. ::II.. :.,·:•t.tllll
C\1
~ ~~·:;•n :un::~ ~~~ 1:~·· !•' ... ·~~· ~!-(J•':n~~.~:: !::!· · ·~ '!; '•:i:i.":.·i:•.::-:~~·-:~·a~~!!!!!;!!.!!•~l·!!!'!!=
11 , 111 1111 ii' i't'•' ,'' 1 1, !' .:: 1 • 1a11'1'1 ·: • .,·;: • .•• ··•• ·•·, .. :, 1'' ... 1uu::•m.a1· ···.:::
I' 1·111:11.: :: . :·;· ...''·•11·11 :1:.:.• . :; .'••I ·:·.··::; ... ,··: ...• , ....,,,,
,,. ....1"' ::
1,1.
8
••• =,·::.!1. -:.11-l.:i ,-:: .. :::: ;:j.:i•l·:~=~::;:l; •.•.. :.:: ·: .: :..•• ;.... ,::,:. =:::·; ..•,· ... ::::· .!~iliii'iii
. ill'l=ll :,· ,........... öl• ..
D
:::;: I .• '1111111'1' 1" II II II ·I····· I' •.•• ···• . "'""111 111;;1'11'1 111 : 11
I!· 'I· ,.:'1'!1'1:i1•·i!.: 11'1•'•1''nl••••..• , "'1111'• • ·I·• "' ,•.. ,., '!'::•
1 ,,,, 1 •: • • 0000 0 0000 00
•. ", ..,.• ,.., ....... .
0
.SO 0 lo lll''•'t,, ' ,, llo( '11111 ' I l o 1 I,
"" i: i 'll•.. ·i!• ' :·, '11• .. :: ' ~:!1'' 1 ::!·!··!,:!•! !'1' ·:·•:•,•..··•! :: :· ...::.:· ....:.. ;.u: ..•
,. 'I -~,: I :1:'.:1· • • ··~· . :. • .. ..... I;: ...... 1·.;, ih • .-: .... ri:n"•iht'· .... :.. :,•::~~~ .., ...
<f>

1I I 111. •••• • ··.I ••• J.IJ.II'''" ····.·I .
I ,.11·~··· .. .
0> 11."1' I•
I •.•.'• ... I .. 1''ll •I 't •.1 ... •, •······· ..... '•··•····•·· 1........ ,., •.......
•. •••t
• •• 1 ••'"1••,11 1......•..
c: "'"1111," •11"1111 1111
0
"' •1° · 11 11••
l';,":,:·::ji:·!,·l;jll!.j:l . ·: :·::.::1.,-!:::;1.:11'' ·: ·:!-=··öto·;;·,•::r.:-·,;:öö,;;l:.,l!~,"l·:,··l•l•l.il
t .. t
0
s::. <0 '"' ''•• ......
(.)
11.11 •1111· ........ ~ •. •· .. II 1'11 .•.•. lll .h..:,: .............. •. ,.,.,. ,II • '· !'! 1111111111111111 I
....0
• 1, 11 111 :.;. ' •;,. I • • . : • .1: II 111., .. 'I jj•i:•~ 1 !llt
1·1. I .... ,,
1 j
1
1•l•;•l:!ti•ltll.•.:l•:•.h•ll
1
I ,, '1"''"•1'·11·:'·1"1'•1 . ·1·111 .1.. . ,,.,, .. :· ..... '':'· ''•111"11'1'1'

1i!~j:!!•i•i,i·'··. ,· .. :1, :(:i::·.:·•=••J~! ·!: :•.:tln!;lir1 1j. 1:. 11 l 1 1iit~ll :.1•=.(·:: ::r:.=i1::.:· 11,.·1;• 1;,m
,, ' , , ', ..
I''
0 ,.. ·--II····!·· .:! 1
•·•,!• : ,.. ;;·llJj·!·:·.h 1..... 1;=h;:i. .;:~:,::;~•ii;!·!~~~::··
11 ..........1''"'1•'1'
,, :.-.......... "'I l ,oo....•
1 111
;;,,:11111:i1,1;;:,1; ;:h1·;, :i~:.l=:i:;;i.,··~~~ ·11 n i!"i,lhj::·
I .lt'•'llll''lll
:=.1·..... . '" .. II I '1"11111 I ool - ·~
'•'lt•l"ll••'t'tl"
··•I·'• ·'·
:.J•: .. I .. o,ollo "''I lll"ll""'o'•'"l !!•: 111•1•1 ''11'11
C\1
.... ............:. J"l
..1 ....I ...11·1 lo
.,'II.. ,..• •, 11t' 11'1'I•• !!!:1!!,!:1' 1111'.11
"JJI'tl(o lo I • •' "'•(
... 1 :• ·.,",:.:11':1·=·1·'!··:1! :'1'1 11:·11'1 1111'!11
0
h !:d.... ,.~:·::::i!:: :.::::!. "ii,IJI.Jh•:!:!!dJ.::,: .... !tt.tiiJ;•I:I.i .. :.. ,llut lt:,:J:hu!tttiUIIIIIIt::ltltiJJ; .•.
I • I IJIU)I"" • • .1111. ''JI'""I 1 1 11 • .t .t•
1 11.1 I
t•• •••.. I, 'nt
20 40 60 80 100 120 140
Subset size m
FIGURE 7.67. Diabetes data: changes in forward plot of scaled Mahalanobis

distances starting with mo = 15. Units ordered by first entry to the subset
300 350 400

0
+ +
+
8
y1=Piasma 0
+ +
glucose resp. "'
to oral + + +
+
*+
0
CD
+ +
glucose + +
++ + +
+ + + ++
+
+
i
+ ++ + ++++
++ ~ +
++ + ++ -tt- +++ "t+
+ ++ + ...... +
+ + + + + ++ +-. y2=Piasma + -t+ + +
+ + + +
+
*++
+
*
! :F ++ + + insulin resp. :j: +
++++
++ ++ ++
:f:++ +
+ + :j:
to oral +
+ t ++ +
++ ++ +
+
+
+ glucose +
+
+ +
+ + + + 0
+
+++ ++ +
+ "'
N
+ 0
+ :j: ++ + 0
+
!++- + +i
++ y3=Degree of
N
+ +++ + + 0
+ ++ +.,:!:"':!: + + insulin ~
+* i; + + + +
+ ++ + + + + resistance
+ + §
+ + + + ++ + +
80 90 100 110 100 150 200 250
FIGURE 7.68. Diabetes data: scatterplot matrix of tentative first duster of 70

observations from a cut at m = 45
sufficiently small that the search responds slightly to this local irregular-
ity, producing, for example, the initial peak in the plot of the percentage
of variance explained in the left panel of Figure 7.63. Overall, the 70 Ob-
servations are more uniformly than normally distributed over the marginal
ellipses: the density of the observations seems not to increase sufficiently to-
wards the centre. This feature explains the absence of very small distances
in forward plots of Mahalanobis distances, such as Figure 7.62.
The scatterplot of the second tentative duster in Figure 7.69 shows that
these observations also form a loose group, again without a noticeably nor-
mal structure. In Figure 7.70 we finally give the scatterplot matrix when the
observations are divided into three dusters. The scatterplot for y 1 agairrst
Y2 shows that a sensible division has been obtained, although it is not
always dear whether units belong to one duster or to the adjacent one.
However, the other panels which indude Y3 show that one observation in
the third duster seems to be wrongly dustered. We have already noted
that this duster will contain at least one outlier.
400 500 600 700

• • • • 0
;!
....,
• •
.~'...:··· . •• • • • • • • 0
..
~
• ",. •
y1=Piasma
glucose resp.
to oral
.. .,.' ••
• • ·~.
••
• •• •
• • • ••
•
• •
0
~
glucose •• • • • • CO
R • • ••
.'....
.
•• • • • •
• •
8CD •
• • • ~
•
• ~.· • y2=Piasma • • ••• • ••
.._....,....~·
0
••
0
"' insulin resp.
• ·~
... •• • • • to oral
•• • • • •
0
0
• glucose • ••
• •
0
•
• • ••
0
..•: ... .
• • • "'
.. • ...
0
•• • •
0
... ....
• •••• •
••• •
0
0
.....• •
•• • • • y3=Degree of C')
• •
• • •• • •• •• insulin 8
• ;' • • • •
•
• •
• ' resistance "'
0
~
80 100 120 140 100 200 300 400 500
FIGURE 7.69. Diabetes data: scatterplot matrix of tentative second duster of 45

Observations
7. 7.2 Exploratory Analysis

We start our exploratory analysis by once more looking at forward searches
based on our preliminary three dusters of 70, 45 and 30 observations. Our
initial purpose is to remove from these groups any observations which seem
at all different. We can then check their group membership from a search
in which three ellipsoids are fitted.
Figure 7. 71 shows the scaled Mahalanobis distances for Cluster 1 for
the search in which that duster is induded first. Overall these distances
are roughly horizontal until all the duster has been fitted. Thereafter they
increase towards the end of the search. An exception to this is unit 79,
which, with units 83 and 72, has the smallest distance at the end of the
search. Observation 79 is also dearly shown to be different in Figure 7. 72,
a nine-panel plot of the distances for all dusters. The third panel in the
first row shows two further observations, 72 and 83, which behave rather
like observation 79. All three units are highlighted in Figure 7.71. Clusters
400 800 , 200 1600
" "
" ""~" ../)""
t "
" " "" •
y1 =Piasma t"" " "'"~tJ.tJ.
glucose resp.
to oral
flll"
• • Ii
"~
. . , ..
glucose ~··· ""' "
"" " "" \.

""
" "" """"
':
.. A"&
"""
y2=Piasma
~·
."""',. "'..
'6 /!' ""
.. . . "
insulin resp.
to oral ··-'!·- tJ.b.
glucose ~···
,
" •
...•. ....".•'
lJ
,· y3=Degree of
..
f100
••~
•
%
'l. ~ A"•
200
I!.
.$-
II.
300
6
J
~·
... ,
•
~6{1, ll. lJ.
~.,4>'6 """ "" " '

0
insulin
resistance
200 400 600

0
FIGURE 7.70. Diabetes data: scatterplot matrix of all observations in three ten-
tative clusters
2 and 3 have curves which are mostly very different from those for Cluster
1.
Next we look at the curves for individual units in tentative Cluster 1.
Figure 7. 73 shows the curves starting with m = 50, since we want to
monitor the curve for unit 79, which enters at m =54. The central panel
of the figure shows that the trajectory for this unit is different from that
for the duster not only at the end, but also in the centre of the search, as
the distance decreases from being large to small. Other units which seem
a little different are 10, entering when m = 56 and 43 when m = 58.
The continuation of this analysis in Figure 7. 74 suggests several more units
which behave differently. In order of entry these are 83, 2, 81, 3, 42, 33
and 105. Only two units in this figure seem typical of the duster. The
last three units to enter are 40, 59 and 60, which may also be categorized
as behaving atypically although, of course, the reason that they enter the
duster later in the search is that they have larger Mahalanobis distances
than the observations which entered earlier.
0
N
""!
.,"'
0
<=
~
'ö
Vl C!
:i5
0
<=
.,"'
o;
.s::;
:::;;
"'ci
0
ci
20 40 60 80 100 120 140

Subset size m
FIGURE 7.71. Diabetes data: forward plot of scaled Mahalanobis distances in

the first tentative duster, starting in that duster. Units 79, 83 and 72 have the
smallest distances at the end of the search
,, Units~24 Group 1 Units= 24 Group 1 Units=22 Group
...... __ ...
FIGURE 7.72. Diabetes data: forward plots of scaled Mahalanobis distances for
the three dusters from a search starting in the first duster
· r--m
-•~
5~2 ~
1nc~l~
76~----~
FIGURE 7.73. Diabetes data: nine panel plot of forward scaled Mahalanobis
distances for specific units, starting from m = 50
The second duster is rather heterogeneous, being formed from units we

did not assign to the first or third dusters in the preliminary analysis. De-
spite this, the forward plot of scaled Mahalanobis distances in Figure 7. 75
does show a common structure for all units. Initially there is much activity
as the units in Cluster 2 are introduced. Then Cluster 1 enters, providing a
period of stable growth in nearly all distances. At the end, the units from
the third duster enter and the behaviour is once more less homogeneous. In
particular, four units have profiles which decrease around m = 125, before
finishing with the smallest distances. These are, reading upwards at the
end of the search, units 103, 88, 134 and 96.
The five units highlighted in Figure 7.75 have distances which increase
appreciably as other units from Cluster 2 enter the subset. Such units are
becoming increasingly far from the duster centre as the subset grows. Study
of this part of the curve also suggests a number of further observations
whose dassification is not certain. The nine-panel plot of the distances,
separated at m = 40, is in Figure 7.76. The first panel of the second row
shows that unit 70 has a trajectory which decreases in the centre of the
search and so is doser to the shape associated with Cluster 1. Another tool
for investigating the homogeneity of this duster is the plot of increases and
decreases in distances. The plot for a search starting in Cluster 2 is shown in
Figure 7.77. Three units in the centre ofthe plot for Cluster 2, namely 50, 25
and 38, show a white patch for m > 45, corresponding to decreasing values
~ m•60 lncl 02 7
'~
!~
~ L----------------------J ~L---------------------_J
m•63 lncl 81 ~ r---m--•~6~4~1n-c~I~O~J----------~
:~
~ L----------------------J
~~
m•66 lncl 33 m•67 lncl 10~
FIGURE 7.74. Diabetes data: further nine panel plot of forward scaled Maha-
lanobis distances for specific units, starting from m = 59 with reference distribu-
tion for m = 50 .
.".
"'"'
0
"' ~
c:
"'
u;
'0
"'0
:0
(\j
c:
"'
(ij
.<=
"'
:::;;
20 40 60 BO 100 120 140

Subset size m
FIGURE 7.75. Diabetes data: forward plot of scaled Mahalanobis distances for
the 45 units of Cluster 2, starting from a search in Cluster 2
Units=24 Group Units=24 Group 1 Units=22 Group 1
FIGURE 7.76. Diabetes data: forward plots of scaled MahaJanabis distances for
the three clusters from a search starting in the second cluster; distances separated
at m =40
of the distances. The remaining units show either increases in distance, or a

shorter decrease. The distances for these three units are highlighted against
the other distances for the second duster in Figure 7.78. The comparison
shows that these curves are not typical of Cluster 2.
The exploratory search starting from Cluster 3 is simpler. Figure 7. 79
shows the forward plot of scaled Mahalanobis distances. Observation 86 is
a dear outlier and observation 124 has a trace rather different from the
others, the distance decreasing steadily as m increases until almost the end
of the search. The nine-panel plot in Figure 7.80 shows observation 86 in
the first panel of the third row and observation 124 in the second. Theseare
indeed the two observations most unlike the rest ofthe duster . The plotalso
shows, in the last panel of the first row, the three observations 72, 79 and
83 which showed as outlying in the comparable plot for Cluster 1, namely
Figure 7. 72. It is interesting to compare the general shape of the curves for
the various dusters in these two figures. For example, in the search starting
434 7 . Cluster Analysis
...,
....
,....
N
~
c Ii')
0>

0
...,
CX>
c
,....
0 0>
'- Ii')
Q>
,....
0 ....
"0
..., Ii')
,.,
N
14 23 .32 41 50 59 68 77 86 95 104 11.3 122 131 140 149

Subset size m
FIGURE 7.77. Diabetes data: changes in forward plot of scaled Mahalanobis

distances from a search starting with the second duster. Units ordered by final
entry to the subset
0 _j
20 40 60 80 100 120 140
Subset slze m
the 45 units of Cluster 2, starting from a search in Cluster 2 as in Figure 7.75.
Units 25, 38 and 50 may be atypical for the duster
!'"'c:! <0
*
'6
"'0
:c
c
<D
"'
äi
-'=- .,.
"'
::;:
"'
0
20 40 60 60 100 120 140

Subset size m
the 30 units of Cluster 3, starting from a search in Cluster 3
Units=24 Group 1 Units= 24
,,
• Uni ts= 15 Group 2 Uni ts= 15 Group 2
Units = 10 Group 3 Units= 10 Group 3
----------':::~
~... czZC
FIGURE 7.80. Diabetes data: forward plots of scaled Mahalanobis distances for
the three clusters from a search starting in the third duster
with Cluster 1, the distances in Figure 7. 72 for Cluster 1 are initially low,
but then increase as the centre of the fitted ellipsoid moves away from the
duster. In Figure 7.80, on the other hand, the distances for Cluster 1 are
initially large, fl.uctuate together as the centre for Cluster 3 changes and
then decrease as the centre moves towards fitting all observations and so
towards Cluster 1.
7. 7.3 Gonfirmatory Analysis

As a result of these analyses from single dusters we have detected 29 units
with uncertain allocation. In addition to these unassigned units we have 56
Observations in Cluster 1, 32 in Cluster 2 and 28 in Cluster 3. We now run a
forward search fitting three ellipsoids. Since the variances of the dusters are
not equal, we use standardized distances for this search, which was run with
the parameter a = 1, so that all seemingly certainly allocated observations
were fitted before any of the unassigned observations were allocated.
The resulting allocations are in Figure 7.81 for the search with m 0 = 103.
The lower set of units contains those we had believed to be firmly dassified,
but which change dassification at least once during the search. We also
indude in this part of the display any units, such as 64 and 75, for which
our preliminary dassification differs from the dassification of a panel of
doctors. They put units 1 - 76 in Cluster 1, 77- 112 in Cluster 2 and 113
- 145 in Cluster 3. We do not use this information during the search.
The upper set of units in the figure are the 29 for which we were unable
to decide a dassification. We start by considering units in this set which
do not change dassification. These can then be used in a second search
to augment the duster of which we are certain. Units 2, 3, 10, 40, 42, 43,
60, 69, 70, 72, 79 and 96 are in Cluster 1 throughout. We add all of them
to Cluster 1 except observation 79, which we have already noticed in our
earlier analysis as probably belonging to Cluster 2 (see Figures 7.71, 7.72
and 7.73). Thus we "misdassify" observation 96. Likewise we add units
86, 88, 103 and 131 to Cluster 2 and 124 to Cluster 3. We have also then
"misdassified" unit 131.
With these additions to the dusters m 0 now becomes 117. The allocations
in the forward search from this starting value are in Figure 7.82, with a
summary in Table 7.1. This table gives the doctors' allocation and then
the way in which our dustering differs from it. The "certain" line indudes
all units which have the same allocation, with only occasional changes,
throughout the search shown in Figure 7.82 where this differs from the
doctors' allocation. The "uncertain" line contains all units for which the
allocation changed appreciably during the search, classified by their final
allocation. We do not include in this subset any of the units, such as 135 and
137, which our previous analysis had firmly allocated to a particular duster
but which were then reallocated in the final search. Indeed, we prefer to
+ + + + + + + + + + +
.. ...••• ••••
+ + + + + + + + + + +
+ + + + + + + + + + +
• • •• • •• ••• ••
+ + + + + +
.
+ + +
+ + + +
+•• • • • •+
•
• +
+ + + +
! ~······························· •••••••••••
:1::::::::::::::::::::::::::::::: :::::::::::
1 ........................................... .
!~::::::::::::::::::::::::::::::: :::::::::::
•••••••••••
••••••••••••••••••••••••••••••••••••••••••
~~t~o•••••••••••••••••••••••••••••••
,t·······························
a; •••••••••••••••••••••••••••••••
:,~:::::::::::::::::::::::::::::::
•.......•..
.... ........ .. . .. .. . .. .. . ...... .... . .......... .
•••••••••••
:::::::::::
gr .. 1 •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+
gr
101 104 107 110 113 116 1 19 122 1 25 128131 134137 140 143 146
FIGURE 7.81. Diabetes data: duster membership during a confirmatory search

with three dusters using standardized distances starting with mo = 103
3 134
••••••••
• ••• . •••••••••••
• .
•••
• • • • • • • • • •
•• • • • •• •••
.
2 10~ + + + + + + + + + +
2 83
• • • •• + + +
• +
••• + + + + +
•• • • • + + + + +
• • •• •• ••• •• • • ••••••
. .
2 81 + + + + + + + + + +
2 79 + + + + + + + + + + + + + + + + + + + + + + + + + + + +
•• •• • • •
. .
I 6~ + + + + + + + + + + + + + + + + + + + +
•
.
I ~ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
I l8
I 33
+
+
+
+
+
+
+
+
+
+
+
+ .
+ +
+
+ +
+
+
+
• +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+ +
+
+
+ +
+
+ +
I 31 + • + + • + + + + + + + + + + + + + + + + + + + + + + + +
1 26
1 2!1
+ + + + + + + + + + + + + + + + + + + • + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
1 16 + + + + + + + + + + + + + + + +
• + • • •
+ + + + + + + +
3 137
• • • • • • • • •• • •• ••• • • •••
•• • • • •• •
3 13!1
• • • • • • • •••• •• ••• • • ••
••• • • • •••
3 131
• • • • • • • • •• • •• • • • • • • ••
• • • • • •••
3 136
• • • • • • • •••• •• ••• • • •••
•• • • • • • •
3 II~
• • • • • • • •••• •• •• • • •• • •• ••
• • •• • • • • ••
2 111
•• ••• •• • • • •. • •. • •• •• • •• • •• • .. • •• •• .• •• •• •
2 107 • • • • • • •
•+ • • •+ • •
2 84 + + + + + + + + + + + + + +
•••••••• • • • ••
1 75
••••••••••• •• •• ••••••••• • • • ••
1 64
••••••••••• •• ••••••••••• • •• • ••
296 + +
• • • • • •
+ + + +
. •.
+ + + + + + + + + + +
• • • • • • • • • • • • • • • • • • • •
+ + + + + + + + + + +
.
gr. 3
gr . 2
gr . 1 +
•••••••••••••••••••••••••••••
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
115 1 17 119 121 123 125 127 129 131 133 135 137 139 1 41 143 14 5
FIGURE 7.82. Diabetes data: duster membership during the second confirmatory
search with three dusters using standardized distances starting with mo = 117
400 800 1200 1600
6 !J.[ /j.O,
6~
6 t. 6 6
6
y1=plasma
glucose 6
I'"
111
8<0
8C\1
~
y2=plasma
8CXl
insulin
0
....0
• •
•• •••
y3=insulin 8....
resistance
~"" f;/A
"\&' A
'--.-.,-,-~--~~ ~.-.--.-.~-.~~r~~~~ T---.----.---.---J o

100 200 300 0 200 400 600
FIGURE 7.83. Diabetes data: scatterplot matrix of all Observations in three dus-
ters. The dustering of the five numbered units is uncertain
TABLE 7.1. Diabetes Data: final dustering of units
Cluster I II III
Doctors 1:76 77:112 113:145
Certain 96 64, 75, 115
131, 136
Uncertain 79, 83 81, 105 134
monitor these units separately and keep the established dassification, which
originated from our conservative exploratory analysis on single dusters .
A general feature of Table 7.1 isthat Cluster 2 appears to have "stolen"
several units from the other dusters and indeed we would expect that, using
standardized distances, units would be attracted to a less dispersed duster.
But it is instructive to compare our dustering with that of the doctors.
7.8 Discussion 439
8
""'h
~
"b "
0
0
;!:
"
8 " "
~
"
" "
8
~ "" "" "
~
" 6
0
0 " & ""
"' 131
.. .. . .
135
6 111
§
.,
6
1RJ • " •
0
0
• + .H!J ..
++ ;*144+-tt'F"'i t+
•
+:t
•• •
* + ++
~
+ + +"*ttt-+
....__ + ct.+++ -t
+
0 200 400 600
y3
FIGURE 7.84. Diabetes data: scatterplot matrix of Y2 and Y3 as in Figure 7.83.

The numbered units were unambiguously clustered, but changed during the
search represented in Figure 7.82
Figure 7.83 is a scatterplot matrix of the allocation from the forward

search, with the five uncertain units represented by their actual number.
In addition, Figure 7.84 reproduces the Y2 Y3 panel of Figure 7.83, showing
the units which our analysis had put unambiguously in dusters but which
changed allocation in the final search of Figure 7.82. We see that these
units are indeed on the boundaries between dusters.
We finish with a graphical comparison of our dustering with that of the
doctors, the scatterplot matrix for which is shown in Figure 7.85. This
shows that our allocation provides the more coherent dustering. For exam-
ple, units 64, 75 and 131 evident in the y 1 - Y3 panel, are more naturally
induded in Cluster 2 than in Clusters 1 and 3. Figure 7.86 represents the
differences b etween the two duster analyses: the six units in Table 7.1 where
we disagree with the doctors are shown with the symbols for the doctors'
dassification; the five units with uncertain dassification are represented by
squares. As we saw in Figure 7.83, the uncertainly dassified units lie on
the boundaries between dusters.
7.8 Discussion
Our robust dassification methods are notably different from those that ap-
pear in the literatme and that are currently used in applications. Most
traditional methods of duster analysis allow automatic classification of the
available data. This means that, after the user has selected a specific dis-
tance definition and what algorithm to use, the procedure ends with a sharp
400 800 1200 1600
l1l1
l1 l1
6 if l1
l>.Ji; @>
tJ.l1 .!;,!16
l1 l1
6 6
y1 =plasma l'l1 a.t>.~ 8
glucose 6 ~ "'
""
6 1ft ~
§ ....--------6-b_l>._l>.,...,
l>.- · 6
·'
... ••
• 8
8 l1
6 l1 6.6
l1
~ l16
"' 66 l1 i6
8
t!#
/A t>.
y2=plasma l,l,e>~
;j
~tr.
insulin
·~': ••• •
• •
•• ••• 8
+
"'
y3=insu lin 8....
resistance
8
"'
100 200 300 0 200 400 600
FIGURE 7.85. Diabetes data: scatterplot matrix of all observations with the
doctors' dustering
dassification of individuals into non-overlapping groups. However, auto-

matic dassification implies that meaningless groups are obtained when a
dustering algorithm is applied to structureless data. In other examples the
structure may not be detected because the dusters are not well separated,
or because the chosen algorithm is not the most powerful tool for identi-
fying them. In contrast, our approach emphasizes the diagnostic power of
monitaring distances, as well as other quantities, along the forward search.
Our view of duster analysis thus focuses on robust diagnostic classification,
which relates the group membership of each observation to its behaviour
during the search.
In this section we sketch the main ideas of the most widely adopted
procedures of duster analysis, namely agglomerative hierarchical dustering
and partitioning methods, and we see how poorly these procedures perform
with some of our data sets. We also give a brief overview of the more recent
techniques of model based dustering. The section ends with some references
to the vast Iiterature on duster analysis.
7.8 Discussion 441
400 800 1200 1600
y1=plasma
glucose
•• 8
y2=plasma
insulin
8.... •• •
•
••
y3=insulin 8....
resistance
8C\1
100 200 300 0 200 400 600
FIGURE 7.86. Diabetes data: scatterplot matrix of all observations. Units which
we classify differently from the doctors are highlighted with the doctors' dustering
symbol. Units uncertainly dustered in Table 7.1 are represented with squares
7. 8.1 Agglomerative Hierarchical Clustering

Given n multivariate observations, agglomerative procedures of duster anal-
ysis produce a hierarchy of n partitions of these units. The first partition
is the trivial one where each duster is made up of a single unit. Then, the
two dosest observations (according to the specified distance measure) are
fused into a single group, so that there are n - 2 clusters containing only
one object and one duster of two units. The subsequent step is to join the
closest of the n - 1 clusters to obtain a reduced partition of n - 2 groups.
The procedure is repeated until, at the last step, the two clusters remairr-
ing are fused into a single one, to form the complementary trivial partition
made up of only one group containing all units.
Different agglomerative procedures can be obtained following different
definitions of distance between clusters. Let di 1 i 2 denote the distance be-
tween observations i1 and i2 and let dc 1 c 2 be the distance between clusters
C 1 and C 2 . The single linkage method defines

dc 1c 2 = min{di 1i 2 : i1 E C1, iz E Cz},
that is the distance between two dusters is taken to be the smallest distance
between a member of the first group and a member of the second one. In
the complete linkage procedure
dc 1c 2 = max{di 1i2 : i1 E C1, iz E Cz},
while in the average linkage algorithm
L L
n1 n2
dc1Cz = di1iz/(n1nz) ,
i 1 =1 iz=l
where n 1 and n 2 are the number of units belanging to C 1 and to Cz,

respectively. Another popular agglomerative criterion is the Ward method,
which joins the two dusters that minimize the increase in the within-duster
sum of squares of the distances from the respective duster means.
The most common choice for di 1iz with metric data is the Euclidean
distance, although the morerobust L 1 (or Manhattan) distance
V
di1 i2 = L lxiü - Xiz] I

j=l
is currently implemented in many software packages. The Mahalanobis dis-

tance between two dusters cannot be defined, unless they have the same
covariance matrix.
The results of agglomerative dustering can be shown through a den-
drogram, a rooted tree which illustrates the joins made at each successive
stage of the hierarchy. The nodes of this tree correspond to dusters, from
individual observations to the whole dataset. The height of each node is
proportional to the distance value at which the corresponding fusion takes
place. In practice, however, users of duster analysis are often interested only
in those steps that are presumably most helpful for detecting the underly-
ing structure of the data. The dendrogram is then "cut" at a certain step,
for instance when a large difference between fusion distances is observed,
and the corresponding partition is taken as the duster analysis solution.
Alternative agglomerative methods have different properties. For instance,
single linkage shows a tendency for early dustering of observations linked
by a series of intermediates, a property often called the chain effect. On
the contrary, the complete linkage and the Ward methods tend to be bi-
ased towards the detection of spherical dusters. A common feature of all
agglomerative algorithms is that the joins made at each step cannot be
split at subsequent steps. Therefore, there is no guarantee of optimality of
the whole set of partitions, although each step is performed according to a
minimum distance criterion. Another disadvantage of hierarchical methods
is that the resulting dendrogram becomes uneasy to read as n increases.
7.8 Discussion 443
7.8.2 Partitioning Methods

Another stream of duster analysis techniques, often called partitioning
methods, takes a different approach to duster analysis and Iooks for the
best dassification , given some optimality criterion. Optimality is achieved
(at least approximately) at the expenses of fixing the number of groups,
say g, in advance. Although these methods can be applied to the same
data iteratively for different values of g, duster membership in the g-group
dassification has no influence on that in the (g + 1 )-group dassification.
Therefore, "bad" allocation of an observation in one solution does not af-
fect its allocation in subsequent partitions, as is the case in hierarchical
algorithms.
Unfortunately, the optimal partition of n observations into g groups can
not be obtained by exhaustive enumeration , except in trivial examples. The
number of distinct partitions of n individuals into g groups is
which gets amazingly large even for moderate values of n a nd g. For in-
stance, even in a small size problern where n = 100 and g = 5 this number
is as large as 1068 (Everitt, Landau, and Leese 2001). In practice most par-
titioning methods operate through iterative algorithms which approximate
the global objective function. These iterative algorithms are computation-
ally fast and can be applied effectively to large databases.
The most widely adopted partitioning technique is the k-means algorithm
(we should say 'g-means' in our notation). Although a number of variants
of this method are currently available, its basic version works through the
following steps:
1. Given g initial duster centres (called seeds), the algorithm starts with
a t entative g-group dassification by assigning each observation to the
dosest seed;
2. Given the current dassification, duster means are computed. Cluster

means are also called centroids. Then, the following steps are iterated
for i = 1, .. . , n:
(a) Compute the distance between the ith observation and each cen-
troid;
(b) lf the dosest centroid is not that of the duster to which the ith
observation currently belongs, reassign the observation to the
nearest duster;
(c) lf duster membership has changed at Step 2b, recompute the
duster means.
3. Step 2 is repeated until convergence, i.e. until there is no change in

duster centroids, or until a fixed number of iterations is reached.
Distance is usually measured through the Euclidean norm, which ensures

convergence (Anderberg 1973, p. 166). The k-means algorithm then seeks
to obtain a partition for which the average Euclidean distance between each
observation and the corresponding centroid is minimum. A robust variant,
which is currently implemented inS-Plus (Struyf, Hubert, and Rousseeuw
1997), is called Partitioning Around Medoids (PAM). It considers duster
:nwdians, called medoids, instead of means, and it allows the computation
of the L 1 distance between each unit and the medoids.
Both standard k-means and PAM show the unappealing feature of yield-
ing different results according to whether the variables are standardized
or not prior to analysis, a feature we also noted in principal components
analysis in §5.2.2. This undesirable effect could be overcome if any of the
generalized Mahalanobis metrics (7.1) were adopted at Step 2a of the k-
means algorithm, a solution that has not received much attention in the
dassification literature.
Although it relies on optimizing a minimum distance criterion, the k-
means method suffers from some theoretical and practical shortcomings.
First, it is often unclear what is the best value of g, given a number of ten-
tative choices. A popular strategy for judging the effectiveness of a partition
is to compute an ANOVA table similar tothat for multivariate regression.
However, formal inference on sums of squares is difficult in the context of
k-means dustering, so the final decision is often left to personal judgment.
Another drawback is that the algorithm can reach a local minimum of
the objective function, so making the results sensitive to the choice of the
initial duster centres. Usual solutions to this problern indude selecting
the seeds as widely separated as possible in Euclidean space, or resorting
to a preliminary hierarchical duster analysis. Even more subtly, there are
several versions of the k-means algorithm available in commercial software,
which may yield (slightly) different answers when applied to the same data.
Finally, we note that updating the duster means at Step 2c of the al-
gorithm can make the results sensitive to the order in which the units are
listed (Anderberg 1973, p. 163). Since the effect of the list sequence on the
overall dustering solution is usually limited, this is not seen as a major
drawback for the purpose of detecting broad dusters of units, such as con-
sumer segments in market analysis. However, it is dearly a very disappoint-
ing feature when the actual dassification of each individual observation is
of primary concern (Cerioli 1999).
1.8.3 Same Examples from Traditional Cluster Analysis

We start by seeing how popular duster analysis tools perform with the
60:80 data set analyzed in §7.3. As we noted when introducing this simple
7.8 Discussion 445
FIGURE 7.87. The 60:80 data: dendrogram from the average linkage algorithm
using Euclidean distances
example, the pattern of the data is obvious and we expect that any sensible
dustering technique should be able to recover it. In Figure 7.2 we showed
that this is not the case if dassification is performed through robust Maha-
lanobis distances. We now see that the performance of traditional methods
of duster analysis may be slightly better, but is stilllargely unsatisfactory.
Figure 7.87 gives a dendrogram of the 60:80 data, obtained by hierar-
chical dassification through the average linkage algorithm and Euclidean
distances on standardized variables. The partition with g = 2 (the true
number of groups) is dearly useless for dassification purposes, since it sep-
arates the natural outlier ( unit 57) from the rest of the data. Even if we put
aside this atypical observation, the picture seems to suggest a partition with
four dusters and an additional outlier (unit 27). Hence, hierarchical duster
analysis would split the dispersed group into three different subgroups plus
some noise.
On the partitioning side, Figure 7.88 shows the 2-group dustering of
the 60:80 data obtained through the robust PAM algorithm and the L 1
distance on standardized variables. In this picture, called clusplot (Pison,
Struyf, and Rousseeuw 1999), each observation is represented by a point
in two dimensional space, using principal components. Of course, in this
bivariate example the first two components explain all of the variability.
Around each duster an ellipse is drawn. Even in the ideal situation where g
is known in advance (an overly optimistic assumption in most applications),
the algorithm wrongly allocates three observations from the large group
(units 36, 43, 72). The k-means method provides very similar results, with
"' oo 0 0
0 0 0 0
0
0<eo "!,<f'o
0 0 0 0 0 0
0 0
0 Oe!> 0
0 0
"' 0
'&,o'b aoo0
~
0 0
:8
0 0
0
0. 00 0
0
oo
E 0
0 0
0 0 0
0 0 &
')'
0
-2 -1 0 2
Component 1
FIGURE 7.88. The standardized 60:80 data: clusplot from the PAM algorithm
with g = 2 and the L1 metric
two units misclassified. Furthermore, variable scaling is not neutral on the

computed classification and the misclassification rate increases remarkably
if the raw instead of standardized 60:80 data are analyzed (see Exercise 7.2).
As we might expect, the performance of traditional methods of duster
analysis is even worse with more complex data structures, such as the bridge
in the data of §7.5 or the "arms" in the diabetes data of §7.7. For instance,
Figure 7.89 is a clusplot for the bridge data, where different symbols are
used to distinguish the clusters. This plot has been obtained through the
PAM algorithm, setting g = 3 (the true number of clusters) and adopting
the L 1 distance. It is apparent that the algorithm is not able to separate
the observations in the bridge from those in the tight clusters, thus forcing
an inappropriate partitioning of the dispersed group. Using the Euclidean
distance or resorting to the k-means algorithm does not noticeably improve
the classification performance, simply yielding different splits of the large
group.
7.8.4 Model-Based Clustering

The usefulness of traditional (hierarchical and partitioning) methods of
duster analysis is often justified on exploratory grounds. They do not re-
quire specification of an explicit statistical model, which may be inappro-
priate for data sets whose structure is ignored prior to analysis. However,
these methods are not neutral with respect to the data structure they
7.8 Discussion 447
oo 0 0
0
0 ?,o go<>o
"'.,<= l>
<=
8.
E l>
0
(.) l>
l>
"'
'l' l>
l>
l>
l>
'f
-1 0 2 3
Component 1
FIGURE 7.89. Standardized bridge data: clusplot from the PAM algorithm with
g= 3 and the L1 metric
highlight. For instance, the decisions made about variable scaling, or about
what distance and what algorithm should be used, can lead to remarkably
different results.
Model-based dustering methods take a different way and make an ex-
plicit link between duster analysis and formal statistical models. Specifi-
cally, it is assumed that observations Yi , i = 1, ... , n, are a sample from a
mixture of g multivariate distributions, each with mean f-Lt and covariance
matrix I:t, l = 1, .. . , g. This is the formulation we used in Chapter 6 for
discriminant analysis, but, to repeat, here neither g nor the duster labels
are known. Let ft(Yi; f-Lt, I:t) be the density of observation Yi from the lth
distribution. In most applications ft (Yi; f-Lt, I:t) is taken to be the density of
a multivariate normal distribution, so from (2. 7)
Let (} = ((} 1 , ... , Bn) T be the n-dimensional vector containing the duster
label of each observation. That is, (}i = l if Yi belongs to the lth population
(i = 1, . . . ,n; l = 1, ... ,g). Similarly, let 1r = (n1, ... ,n9 )T be the g-
dimensional vector of mixture weights. That is, 1r1 2:: 0 is the probability
that an observation comes from the lth population (,Ey= 1 1r1 = 1). There
are essentially two alternative approaches to model the dusters, according
to whether the focus is on (} or on 1r.
The classification likelihood approach tries to maximize the likelihood

n
LikcJ(J-l1, ... , J-L 9 , ~1, ... , ~ 9 , (}; y) = II fo, (yi; J-Lo., ~lfi) (7.4)
i=1
with respect to the unknown parameters. This method emphasizes estima-
tion of the dassification parameter (}, yielding an estimated duster member-
ship for each observation. On the contrary, the mixture likelihood approach
maximizes
n 9
LikMix(J-l1, ... , J-L 9 , ~1, ... , ~ 9 , 1r; y) = IT L 7rtfz(Yi; J-ll, ~z), (7.5)
i=1 1=1
leading to an estimated probability of duster membership for each obser-
vation.
The normality assumption (7.3) implies that for each duster the sur-
faces of constant density are ellipsoidal. Shape, volume and orientation of
the dusters depend on the covariance matrices ~ 1 , ... , ~ 9 • The least de-
manding instance is when these matrices are unconstrained, a situation
equivalent to computing group-specific Mahalanobis distances (see §7.2.2).
However, this very general implementation of model-based dustering has
rarely been used in applications, due to its lack of parsimony. The situa-
tion is analogaus to the choice between linear and quadratic discriminant
analysis, where we showed in §6. 7 how the large number of parameters in
quadratic discrimination may Iead to poor performance. Some more re-
strictive definitions indude ~~ = ~ and ~~ = D, l = 1, ... , g, for D a
diagonal matrix. In the latter all dusters are spherical, although of differ-
ent sizes, the assumption implicit in the use of the Euclidean distance. A
more general parametrization of each ~~ can be through its singular value
decomposition (Banfield and Raftery 1993), which is used to represent the
geometry of the lth group in multidimensional space.
7. 8. 5 Further Reading
Anderberg (1973) and Hartigan (1975) are two dassical references of the
early literature on duster analysis, from which the basic algorithms origi-
nate. Kaufman and Rousseeuw (1990), Gordon (1999) and Everitt, Landau,
and Leese (2001) provide more recent accounts, where a number of ernerg-
ing topics are also considered. Among these we mention the important issue
of duster validation, that is assessment of the validity of dassifications that
have been obtained from the application of a dustering algorithm. The val-
idation step can be performed to judge the effect of any choice made along
the duster analysis application. For instance, the same algorithm might be
applied using different distance measures, or after removal or transforma-
tion of some variables. As a diagnostic tool, it can be applied to highlight
7.8 Discussion 449
the effect of one or several observations on the duster analysis solution, thus
providing influence measures for dassification (Jolliffe, Jones, and Morgan
1995, Cheng and Milligau 1996). However, these measures can suffer from
masking and swamping precisely as any other backward method based on
case deletion. It also seems desirable that the dustering of a set of units
should remain unchanged after the removal of dusters from which they are
absent. Kaufman and Rousseeuw (1990, p. 239) call this property duster
omission admissibility. Some robustness issues of partitioning methods (es-
pecially k-means) are examined in Cuesta-Albertos, Gordaliza, and Matnin
(1997) and Garcia-Escudero and Gordaliza (1999).
Maronna and Jacovkis (1974) were among the first to exploit a version of
the k-means algorithm based on the Mahalanobis distance. In the context
of hierarchical duster analysis, the use of Mahalanobis metrics is consid-
ered by Gnanadesikan, Harvey, and Kettenring (1993), at the cost of the
assumption of a common covariance matrix for all groups. Alternative par-
titioning methods based on minimizing different functions of the within
groups matrix of residual sums of squares and products of the data are
described in Friedman and Rubin (1967) and Marriott (1982).
Cluster analysis methods based on maximizing a likelihood function orig-
inate from the work of Scott and Symons (1971). Fraley and Raftery (2002)
give an overview of recent advances in model-based dustering techniques
and applications, with emphasis on the dassification approach. Symons
(1981) , Banfield and Raftery (1993) and Fraley and Raftery (1998) provide
applications to the diabetes data of §7. 7. McLachlan and Peel ( 2000) is
a book length treatment of the mixture approach to statistical modelling,
with component distributions possibly different from the normal one.
Finally, we mention that duster analysis methods (especially k-means)
are a key tool in what is nowadays called data mining, that is the "analysis
of (often large) observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both understandable
and useful to the data owner" . We refer to Hand, Mannila, and Smyth
(2001) (from which the quoted definition of data mining is taken) and to
Hastie, Tibshirani, and Friedman (2001) for a broad description of this
research area.
CO
0 CO
....
0
"'
0 0 0 0 0
0 0 C\1 0 0
0 o '8::,& sl?" o o CJD&Bi?" o
cn<lQ~
C\1
C\1
,g C\1
cn ooo~ §;j
•
>- 0 >-o 0 0 @ 0
0 o cP ao
•oJ•o~
0 00 0 0
0 0
0
0 0 0 .,.
0
~
0 0 ~
~
-40 -30 -20 -10 0 10 20 -30 -20 -10 0 10
y1 y1
FIGURE 7.90. Three Clusters, Two Outliers: ellipses when (left) m = 60 and
(right) m = 61, starting from mo = 28. Filled symbols are used for units in the
subset
7. 9 Exercises
Exercise 7.1 Figure 7.90 shows the ellipses at m = 60 and m = 61 for the
Three Clusters, Two Outtiers example of § 1.4, when the search is started
from m 0 = 28. Figure 7.91 gives the ellipses at steps 149 ::::: m ::::: 152 for
the same search. Comment on the pictures and compare them to the entry
plot of Figure 7.1 0.
Exercise 7.2 Apply the PAM algorithm from S-Plus to the 60:80 data of
§ 7. 3, using the L 1 distance. Set g = 2 and do not standardize the variables.
Obtain the clusplot and comment on it. Which units are now incorrectly
classified?
Exercise 7. 3 Perform a very robust analysis of the bridge data of §7. 5 by

the MCD procedure from S-Plus.
Exercise 7.4 Highlight observations 118 and 168 in the scatterplot of the
bridge data of §7. 5. Explain why they are in the 'wrang' position, thus being
incorrectly classified by the forward search.
Exercise 7.5 Apply hierarchical dustering to the bridge data of§7.5. Adopt
average and single linkage m ethods based on the Euclidean distance after
variable standardization.
7.10 Salutions 451
m=149 m=150
: ..
-;~"'
··~
..
..
m=151 m=152
FIGURE 7.91. Three Clusters, Two Outliers: ellipses when m =149, 150, 151 and
152. Filled symbols are used for units in the subset
7.10 Salutions
Exercise 7.1
The tight ellipse in the left panel of Figure 7.90 surrounds all the units
of the tight group, Iahelied from 81 to 140 in the entry plot, which form
the subset when m = 60. When m = 61 the first observation from the
dispersed group ( unit 68, as we can see from the entry plot) enters the
subset, thus changing the orientation of the ellipse and increasing its size.
The rotation of the ellipse is responsible for the sudden change in the
ordering of Mahalanobis distances apparent in Figure 7.11.
From Figure 7.91 we note that when m = 149 both outliers are in the
subset, with one on the outer ellipse. When m = 150, the more distaut
outlier has left the subset. All other observations now have smaller Ma-
halanobis distances than the remaining outlier, which is on the ellipse. In
particular, not all members of the third duster of observations are in the
subset. So, in the next step, the lower left-hand panel of Figure 7.91, there
is an interchange and the ellipse changes orientation, reflecting the exclu-
sion of the two outliers and the continuing inclusion of members of the
third group. This interchange is also clear around m = 150 in the top two
rows of the entry plot of Figure 7.10, corresponding to observations 159
and 160 (the two outliers) .
"'
l
E u;> +
8
+
·5 0 5 10
Component 1
FIGURE 7.92. The 60:80 data: clusplot from the PAM algorithm with g = 2 and
the L1 metric. The variables are not standardized. Different symbols are used for
the two clusters identified by the algorithm
Exercise 7. 2
Figure 7.92 is the dusplot from S-Plus. Comparing it with Figure 7.88,
we see that more units from the diffuse group are now misdassified. The
resulting ellipse for the duster containing the tight group has changed
orientation and has increased its size, so this duster is even more dispersed.
Furthermore, the two ellipses now overlap. This is an indication that the
detected groups are not well separated.
There are 9 observations from the diffuse group which are dassified in-
correctly. From the output of the S-Plus PAM procedure, we see that these
are units 13, 22, 33, 36, 43, 45, 57, 71 and 72.
Exercise 7.3
Figure 7.93 is the output from the S-Plus MCD function. As for the 60:80
data set, the very robust procedure settles over different dusters, including
all the Observations from the tight group, the bridge and part of the diffuse
group. The orientation and shape of the ellipse are very similar to that of
Figure 7.2. Its centre lies approximately in the middle of the bridge, so the
index plot of Figure 7.94 now also contains some very small distances.
Exercise 7.4
Figure 7.95 replots Figure 7.18, with different symbols for the three groups
7.10 Solutions 453
057
045
0
0 0 0
0 0 0 0
0 oo o o
0 0 0 0 0 0
0
0
0 0\1& 0 0
0 q,oo
0
q,
0 0
0 0 0
0 0 "so:>
o'O 'b.ll.g_
027 oVIJOP '~>o,~0
o 0 Jlo~
-10 ·5 0 5 10
y1
FIGURE 7.93. Bridge data: ellipse from very robust fit nominally containing
97.5% of the data. The half of the data used for estimation contains observations
from all clusters
057
045
027
0
0 0
0 0
0 (j) 0
0 0
0 0
0
0 0 0 0 0 c9 0
00 0 6'0 0
0
0 0 0
0 oo 0 o
0
0 0 0 oo
0
0 50 100 150
Index
FIGURE 7.94. Bridge data: index plot of robust Mahalanobis distances from the
MCD fit . The horizontal line is the nominal 97.5% point of the distribution of
distances
+
"'
+
+
+ ++ +
+ + +
+ + ++ + +
C\J + +t+ +
>. + + + + + +
0 - "!;. .... + +
~ + + + +
+ + +
+++ + + +
+ + +
+ +
+-IJ. + ....
+
+ +
+ "" 'Ir @
+
+
"'t,.tz,.t:P>
'{>
"~-
+
"" 1680
-10 -5 0 5 10
y1
FIGURE 7.95. Bridge data: scatter plot with observations 118 and 168 labelled.
Different symbols are used for the three groups and also for the two Iabelied
observations
and with labels for observations 118 and 168. Both observations lie on the
borders of the tight duster, although in different directions. Unit 118, which
actually comes from this duster, is in the same direction as the bridge and
is indeed very dose to one unit from it. On the contrary, observation 168
is extreme with respect to the bridge to which it belongs and lies well
on the incorrect side of the boundary line separating the two groups. The
positioning of units 118 and 168 in the scatter plot thus explains why they
exhibit forward Mahalanobis distances (see Figure 7.29) that are more akin
to those of the duster to which they do not belong.
Exercise 7. 5
Figure 7.96 shows the dendrogram obtained through the average linkage
algorithm. This picture is similar to the corresponding plot for the 60:80
data (Figure 7.87), with one unit (57) very far, in the Euclidean metric,
from the bulk of the data; it suggests that the dispersed group should be
split into a few dusters. There is now evidence of one or two additional
clusters, more akin to the tight group of observations pictured on the left
of the dendrogram, although less compact. In fact, these dusters indude
7.10 Solutions 455
Average Iinkaga
E
·~ N
:I:
FIGURE 7.96. Standardized bridge data: dendrogram from the average linkage
algorithm, using the Euclidean distance
all the units from the bridge with the exception of observation 168, which
is still interchanged with observation 118.
As a contrast, Figure 7.97 gives an example of the "chain effect" often
exhibited by the single linkage method. In this algorithm dusters are joined
at each stage if their boundaries are sufficiently dose. This may result in
sequences of individual aggregations, starting from the nearest objects and
adding observations with increasing distance from the existing duster cen-
tres. Two dusters merge as soon as one gains a single observation between
them that is sufficiently dose to the boundary of the other duster. This
behaviour is apparent in Figure 7.97. The agglomerative process starts by
fusing together the units belonging to the tight group , to which the units in
the bridge are then added. After one single observation from the dispersed
group has joined this duster, it merges with one relatively large duster
coming from the dispersed group. Then other units from the diffuse group
join, induding another relatively large duster. It is also instructive to note
that the two main dusters obtained from splitting the dispersed group (pic-
tured one on the left and the other on the right of the dendrogram) do not
fuse together, as in the average linkage method, before joining the large
duster coming from the tight group. This example shows the poor per-
formance of the single linkage algorithm for the purpose of distinguishing
poorly separated groups.
Single Iinkaga
FIGURE 7.97. Standardized bridge data: dendrogram from the single linkage
algorithm, using the Euclidean distance
8
Spatial Linear Models
8.1 Introduction
The main goal of spatial modeHing is to provide a description of continu-
ous or categorical phenomena observed at locations (i.e. points or surfaces)
in space. By far the most common applications have been on the earth's
surface, where each location can be described by a two-dimensional vec-
tor of geographical coordinates. One example is the analysis of yield from
spatially contiguous plots in an experimental design, when the plots are
subject to different treatments the effects of which are to be estimated.
Another major example in environmental sciences is the study of pollution
data recorded at a number of monitoring stations within the same area.
In all applications considered in previous chapters of this book, as well
as in all analyses performed in the companion volume by Atkinson and
Riani (2000), we made the basic assumption that the data consisted of
independent observations from the same distribution, possibly after some
transformationandremoval of outliers. However, the independence assump-
tion is usually not realistic when analysing spatially indexed variables. In
a spatial context it is often reasonable to expect that the response read-
ings on a sample element will be similar to the readings on other elements
close to it. Statistically speaking, this means that observations collected
at neighbouring sites are usually more similar than would be predicted by
an independent outcomes process. This phenomenon is known as (positive)
spatial dependence and its recognition has been a fundamental advance
towards a more realistic description of geo-referenced variables.
458 8. Spatial Linear Models
In this chapter we explore how the forward search can be extended to

cope with linear models for a continuous response which explicitly allow
for spatial dependence. In fact, analysis will be univariate, as we focus on a
single response. However, there is one basic difference from the univariate
regression models already described in Atkinson and Riani (2000), which
makes the forward search for spatiallinear models akin to the multivariate
procedure described in this book. The fundamental issue is that spatial
observations are conceived as a single realization of a random mechanism
acting on the plane. Spatial dependence implies that the joint likelihood of
the whole sample cannot be factorized as the product of individual marginal
terms. The likelihood is then a multivariate distribution and the forward
search must recognize this fact. The multivariate normal distribution again
plays a major role, as in previous chapters of this book, although here
standardised residuals are employed in place of Mahalanobis distances. An
additional difficulty comes from the way in which sample values are iden-
tified. On the plane a two-dimensional vector index is attached to each
Observation, so that spatial data do not share the familiar time-ordering
structure of longitudinal studies and time series.
We consider two popular approaches to modeHing spatially dependent
variables. The first one, named kriging, is a geostatistical model for predic-
tion of spatial variables observed over a continuous domain. In the second
instance we address spatial autoregressive models, which have been widely
applied to the study of regional data. Although both methods allow for
spatial dependence in the response variable, there is a crucial distinction
between them for the forward search. In kriging it is possible to compute a
robust measure of the spatial dependence structure based on all the data,
so the forward search for kriging can be run conditionally on it. In the
dass of spatial autoregression models that we consider in this book, on the
other hand, such a fortunate coincidence does not happen. Therefore we
describe an alternative forward algorithm, called the block forward search,
where spatial dependence parameters are estimated jointly with regression
parameters at each step of the search. This novel algorithm yields robust
estimates of both parameter sets in spatial autoregression models.
The structure of the chapter is as follows . In §8.2 we give some back-
ground on kriging and introduce standard exploratory methods for spatial
prediction based on case-deletion diagnostics. We also show the effects of
masking and swamping on these methods. In §8.3 we explain how the for-
ward search can be extended to the kriging model and describe a number
of diagnostic plots useful for spatial prediction purposes. We also highlight
a few specific differences between the forward search for kriging and the
basic algorithm for regression models with independent errors (Atkinson
and Riani 2000, Chapter 2, and this book, Chapter 2). These differences
arise because kriging focuses on prediction rather than on estimation of
model parameters. The three succeeding sections contain analyses of data
8.2 Background on Kriging 459
sets for which the kriging model may be appropriate and serve the purpose
of describing the information potential of the forward search for kriging.
We move to spatial autoregression in §8.7. After giving some theoretical
background, we describe a few standard diagnostics for these models. Since
explanatory variables might be at hand, we focus also on leverage measures.
Agairr the effects of masking and swamping are paramount. In §8.8 we
describe our block forward search algorithm for autoregressive models. In
§§8.9 and 8.10 we apply the block forward search principle to several data
sets, agairr showing the power of the forward search.
The chapter ends with some suggestions for further reading.
8.2 Background on Kriging

Kriging is a generic term that covers different least-squares methods of
spatial prediction. It is named after D.G. Krige, a South African mirring
engineer who contributed to the early development of techniques for the
prediction of ore grades in gold and other mineral deposits. Over the years,
the term kriging has often been used as a synonym for best linear unbiased
prediction of spatial variables observed over a continuous region.
There are several different versions of kriging. The one that we use in
this book is called ordinary kriging and assumes a constant, but unknown,
mean level throughout the study region. In practice there might be reasons
also for considering a nonstationary version of the model, named universal
kriging, where the mean level is allowed to vary from site to site. However,
we believe that for exploratory purposes the ordinary kriging model is
usually the most helpful choice, as it provides the basic benchmark to
which the data are to be contrasted, in order to detect spatial outliers,
trends and other relevant features of the data.
8. 2.1 Ordinary K riging

In order to introduce the ordinary kriging model, it is instructive to conceive
two-dimensional spatial data on a univariate response as a realization of
the stochastic process
{y(s): s E V}, (8.1)
where y(s) derrotes the response value at site s , V C ~ 2 is the study region
and y( s) has a correlation structure over space. In the present context s =
(s 1 , s 2)T corresponds to the two-dimensional (column) vector of coordinates
defining a spatial site within V.
The usual distinction between kriging and autoregressive models to be
described in §8.7 comes from the very nature of V. If it is reasonable to
think that s varies continuously through V, then the kriging approach
applies. This is the typical context of statistical applications in the earth
and natural sciences, where features such as soil properties or pollutant

concentrations may be naturally defined at each point of the study region.
Our basic kriging model is written as
y(s) = f-L + 8(s) sE V, (8.2)
where 1-L is a fixed but unknown constant and errors {8(s): s E V} follow
a spatial stochastic process with mean 0. This representation assumes the
absence of large-scale variation, that is spatial trend in the expected value
of the response variable. However, spatial dependence of errors implies that
Observations taken at different sites are correlated.
For inference from a finite realization of process (8.1), we have to impose
some constraints on the spatial dependence structure of the response vari-
able. In kriging and other geostatistical models it is customary to restriet
attention to the second-order properties of y( s) . The most familiar measure
of dependence between pairs of random variables is their covariance
c(s, t) = cov{y(s), y(t)} = E{y(s)y(t)}- f-L 2 ,
for the constant-mean model (8.2). In the present context, however, we

represent spatial dependence in terms of the difference variance
var{y(s)- y(t)} = 2v(s, t).
The notation 2v(s, t) is justified in view of the property

2v(s, t) = 2o-;- 2c(s, t),
valid if var{y(s)} = var{y(t)} = o-;. Hence, the factor 2 vanishes from both
sides and we obtain
v(s, t) = o-;- c(s, t). (8.3)
We suppose that for all pairs s, t E V
E{y(s)} 1-L (8.4)
and 2v(s,t) 2v(s- t) = 2v(h), (8.5)
where h = s- t is the two-dimensionallag vector between s and t. Under

assumptions (8.4) and (8.5), the process {y(s) : s E D} is said tobe intrin-
sically stationary. The function 2v(h) is called the variogram, and v( h) is
the semivariogram. If in addition 2v(h) is a function only of the distance
between s and t, that is 2v(h) = 2v(l!hll), where 11· 11 is the Euclidean norm,
then the variogram is said to be isotropic. The ordinary kriging model is
the basic kriging model (8.2) with {y(s): s E D} intrinsically stationary.
A theoretical advantage of modeHing spatial dependence through the
variogram is that intrinsic stationarity is slightly more general than the
more familiar assumption of second-order stationarity, under which
c(s, t) = c(s- t) = c(h).

y(si)
y(so) ? •
•
•
FIGURE 8.1. Example of a simple network of spatial locations: the unknown

value y(so) is to be predicted from observations y(s 1 ), . . . , y(s 4 )
Second-order stationarity implies intrinsic stationarity, while the converse

in not necessarily true .
An additional advantage is that the sample variagram to be defined in
equation (8.26) is an unbiased estimator of v(h) under the ordinary kriging
model, whereas the corresponding estimator of c(h) is not. In fact ,
2v(h) = E{(y(s)- y(t)} 2 for s- t = h (8.6)
does not involve the unknown mean J.L. Hence the sample analogue of (8.6)
is not affected by the finite sample bias introduced by estimation of J.L in
c(h). Furthermore, the potential bias originated by misspecification of a
constant mean in the ordinary kriging model (8.2) is usually smaller for
the estimator of 2v(h) than for the covariance estimator (Cressie 1993,
pp. 70-73).
In practice, measurements on the response variable are collected at a
network of n spatiallocations, say S = (s 1 , ... , sn)· The available sample
is then the n x 1 vector y = (y(s 1 ), ... , y(sn)f. To simplify notation,
sometimes we write Yi = y(si) for the observation at site Si, i = 1, ... , n.
Prediction rather than parameter estimation is the main goal of kriging
applications. That is, given observations in S, one is usually interested
in prediction of the value y(so) at the unsampled site s0 E D. A simple
example with n = 4 observation sites is shown in Figure 8.1.
For the purpose of spatial prediction, we have to consider the amount
of spatial dependence between y(s 0 ) and all sample elements y(si), i =
1, ... , n, as well as among all pairs of observations in the sample. Let
v = (v(so- si), ... , v(so- sn))T. Define Y to be the n x n symmetric
matrix whose elements are given by v(si- s1 ), for i,j = 1, . .. , n. Given the
observations in S, the best linear unbiased predictor of the random variable

y( s 0 ) is (Exercise 8.1)
(8.7)
where
(8.8)
where J denotes an n x 1 vector whose elements are all equal to 1. The

associated mean-squared prediction error (or kriging variance) is
T _1 (1- JTY- 1 v) 2
2
a (s 0 \S) = E { y(so)- y(so\S) }2 =
A
v Y v- JTY-lJ (8.9)
We leave it to Exercise 8.2 to show that analogaus expressions can be

obtained through the covariance function c(s, t) under the assumption of
second-order stationarity.
If Y is known, the ordinary kriging predictor y(s 0 \S) is best in the sense
that it has the minimum mean-squared prediction error among all linear
functions a T y such that
E(aT y) = E{y(so)},
for a an n x 1 vector of constants. Furthermore, if {y(s) : s E D} is a

Gaussian process, then the ordinary kriging predictor is best among all
linear and non linear functions of y. Optimal prediction of an average,
instead of a point value at a specific location, is also feasible but will not
be addressed in this book. Also note (Exercise 8.3) that
y(so\S) = y(so) if so ES, (8.10)
that is the predictor y( s 0 \S) is an exact interpolator of the observed process

under the ordinary kriging model (8.2).
In many experimental applications it is important also to consider the
effect of measurement error. For example, measurement error occurs when
a measurement is taken several times and different results are obtained be-
cause of instrumental error. If we allow for this extra source of uncertainty,
then the ordinary kriging model becomes
y(s) = J-t + 8(s) + c(s) s E D, (8.11)
where c( s) is a white-noise process representing measurement error. That

is, c( s) is taken to be independent of 8 (s) and such that
c(s) rv N(o,a;),
where the measurement error variance a; does not depend on s, and
cov{c(s),c(t)} =0
for all pairs of sites s =f. t E 1J. Again, 11 is unknown and {y(s): s E 1J} is
assumed to be intrinsically stationary.
The quantity a; constitutes a part of what is usually called the nugget
effect of the variogram, a positive limit to which 2v(h) may tend as llhll ~
0. Therefore, measurement error is one reason for the possible discontinuity
at the origin allowed in many variogram models (see §8.2.2 below), since
by definition
2v(h) = 0 if llhll = 0.
Another possible cause for the nugget effect is discontinuity at very small
scales, which is exhibited by many geophysical phenomena.
Under the measurement error model (8.11), interest usually lies in knowl-
edge of the smoothed version
y*(s) = 11 + 8(s)
of y( s), rather than in prediction of y( s) itself. Proper appreciation of the
role of measurement error is important not only from the modeler's point
of view, but also for its implications on the performance of the forward
search (see the remark at the end of §8.3.1).
If there is measurement error, the variogram of the response variable can
be decomposed as
var{y(s)- y(t)} = 2a; + var{y*(s)- y*(t)}. (8.12)
The above decomposition holds also when s = t if we consider pairs of inde-

pendent measurements at the same location, say y(s)l and y(s)z. Therefore,
var{y(s)l - y(s)z} = 2a; > 0
and equation (8.8), which gives the vector of prediction coeffi.cients ry, must
be modified as follows.
Let v* = [v*(s 0 - sl), ... , v*(s 0 - sn)]T be the n x 1 vector that repre-
sents spatial dependence between y(s 0 ) and y(si), i = 1, ... , n, under the
measurement error model (8.11). The elements of v* are defined as
if (8.13)
and
1 2
v*(so- si) = 2var{y(sih- y(si)z} = ac: if (8.14)
Equation (8.13) yields the same semivariogram value as under the basic
model (8.2), since measurement error has no effect on prediction of a new
observation at an unsampled location. On the contrary, equation (8.14)
takes into account the variability of repeated measurements at location Si·
The matrix Y = [v(sisj)] remains unchanged and still has zeros on its
main diagonal, since observation of the noise-corrupted process y( s) is the
only information source on S. As in (8.7), the kriging predictor of y*(s 0 )

is then
y*(soiS) = r/[ y, (8.15)
where
(8.16)
The mean-squared prediction error of (8 .15) becomes
(8.17)
because of equation (8.12).

Under the measurement error model (8.11) we typically have
y*(soiS) =f y(so) if so ES,
so that the kriging predictor now smooths the observed data instead of
being an exact interpolator. On the contrary,
y*(soiS) = y(soiS) if so~ s. (8.18)
That is, measurement error does not affect prediction at locations where
no data have been observed.
In practice it is difficult to obtain sample information about the true
value of the measurement error variance a;.
Furthermore, measurement
error might also be confounded with a microscale component, that is a
structure of the observed phenomenon with a range shorter than the sam-
pling support. In the absence of actual replications of the measurement
process or very close sampling points, a tentative estimate of a; can be
obtained by linear extrapolation of variagram estimates near the origin or
by subject-matter information. Knowledge about the physical nature of the
problern is essential at this stage.
It might be argued that ordinary kriging models (8.2) and (8.11) pro-
vide a simplistic representation of natural phenomena, as they imply the
absence of large-scale variation in the response variable. The more general
approach of universal kriging allows for the definition of a space-varying
mean function, say f.-l(s), at each location. However, for the purpose of our
diagnostic analysis through the forward search, models (8.2) and (8.11) are
precisely what is needed: a simple benchmark to which all observations
are to be contrasted. After a brief description of some isotropic variagram
models, in §8.2.3 we introduce a simple example of a spatial data set con-
taining a few known outliers which serves the purpose of illustrating our
main ideas in kriging.
8.2.2 Isotropie Semivariogram Models

Before turning to the description of our first kriging data set, we give a
brief account of some commonly adopted parametric models for the semi-
variogram in the isotropic case. The topic of (semi)variogram modeHing
has been of paramount importance in much of the geostatisticalliterature,
especially in its early stages. We do not emphasize the role of such paramet-
ric models as a tool for proper identification and description of the spatial
features of y(s). Rather, we approachvariagram models with the aim of
deriving a robust starting point from which our forward search for ordinary
kriging can be initialized.
A basic theoretical property that the variogram function 2v(h) = 2v(si-
Sj) must satisfy is
n n
LL aiaj2v(sisj):::; 0, (8 .19)
i=l j = l
for any finite number of spatial sites s 1, .. . , Sn and real numbers a 1, .. . , an

summing to 0 (Exercise 8.4). A variogram function satisfying property
(8.19) is said tobe conditionally negative definite. Valid parametric models
for 2v(h) , or equivalently for the semivariogram v(h), are then built so as
to guarantee this basic requirement.
One of the most frequently adopted models for an isotropic semivari-
ogram in two dimensions is defined as
~0 + ol {1.5llhll/02- 0.5(llhll/02)
llhll =0,
v(llhll; 0) = { 0 < llhll :::; 02,
3}
Oo + 81 llhll ~ 02,
(8.20)
where Oo, 81 and 82 arenonnegative parameters and 0 = (Oo, 81, 82)r. The
value of Oo is the nugget effect, since v(llhll)-+ Oo if llhll-+ 0. Furthermore,
00 + 01 gives the variance of the process (also including measurement error),
while 02 defines the range of spatial dependence. In fact, under this model
observations are uncorrelated if their distance llhll ~ 82. Representation
(8.20) is known as the spherical semivariogram model.
Another popular semivariogram model is exponential
~0 + ol {1- exp(-llhll/02)}
llhll = 0, (8.21)
v(llhll; O) = { llhll > 0,
where the interpretation of Oo ~ 0, 01 ~ 0 and 82 ~ 0 is similar to that

given above. The main difference between the exponential and the spherical
modelisthat the exponential function (8.21) reaches its supremum 00 + 01
only asymptotically, when llhll -+ oo. In other words, the exponential model
does not assume that there exists a finite range for spatial autocorrelation,
that is a finite distance above which observations are uncorrelated. Despite
this theoretical difference, however, models (8.20) and (8.21) often exhibit
Spherical semivariogram
.---------------~-----. ~ ~~--------------------,
~ "' ~ "'
..<:: "'
() = (1,2,5)T ..<:: "' () = (0.5, 3, 8)T
~ ~
0 • 0 .
0 2 4 6 8 10 0 2 4 6 8 10
llhll llhll
Exponential semivariogram
~ ~
() = (1, 2, 5)T
~ "' ~ "'
..<:: "' ..<:: "'
'-'
;::. ~
0 . 0 .
0 2 4 6 8 10 0 2 4 6 8 10
llhll llhll
FIGURE 8.2. Spherical and exponential semivariogram models, for selected pa-
rameter values. All models show a nugget effect
similar fits in practice. The spherical and exponential semivariogram mod-

els are plotted in Figure 8.2 for selected parameter values. In all pictures
the nugget effect is evident.
The simplest possible specification for v(llhll) is provided by the linear
model
llhll =0, (8.22)
llhll > 0,
where now () = ( Bo, () 1 ) T. Again, Bo, () 1 ~ 0 and Bo is the nugget effect.
Strictly speaking, a linear function does not provide a valid semivar-
iogram model for an intrinsically stationary process, since it shows un-
bounded variation. That is,
limllhii__,=(Bo + Blllhll) = oo,
corresponding to a deterministic trend in the data. However, our expe-
rience with the forward search has shown that the robust fit of a linear
semivariogram model could be a simple and effective starting point for the
forward analysis of data sets with a complex spatial structure, such as a
regional trend or a large number of dustered outliers. We give examples of
this behaviour in §§8.4.2 and 8.6.
Many more semivariogram models exist which satisfy property (8.19).
However, in the rest of this chapter we confine our attention to the three
TABLE 8.1. Simulated kriging data before contamination: readings are at the
nodes of a 9 x 9 regular lattice
Row Column
1 2 3 4 5 6 7 8 9
1 11.39 13.43 12.79 13.20 9.85 11.37 11.97 10.79 10.53
2 11.83 12.70 10.77 13.79 14.43 11.31 8.73 6.79 7.21
3 11.72 12.26 15.64 12.31 12.34 10.21 8.95 7.68 10.58
4 12.78 9.95 12.79 9.70 10.43 8.36 5.46 7.19 10.00
5 11.64 11.86 13.78 9.98 8.89 8.46 7.53 10.59 9.11
6 11.45 11.42 14.55 12.57 11.63 11.56 8.35 8.69 10.35
7 11.43 13.06 11.30 12.38 11.25 7.43 11.04 11.41 9.49
8 15.57 12.28 12.23 14.03 11.59 11 .36 10.78 11.49 10.18
9 12.68 12.61 13.79 15.96 12.83 11.62 12.22 10.85 10.85
functions given above. Further details on these and alternative models

abound in the geostatistical literature. A good and ext ensive reference is
the book by Chilesand Delfiner (1999).
8. 2. 3 Spatial Outtiers
The purpose of this section is to describe a simple example of a spatial data
set containing a few known outliers which serves the purpose of illustrating
our main ideas. The dataset is given in Table 8.1. It refers to measurements
at the nodes of 9 x 9 regular lattice. Sites on the lattice are indexed in
lexicographical order, so that s 1 = (1 , l)r , s2 = (1 , 2)r , ... , ss 1 = (9, 9)T.
The data were produced by simulation of the measurement-error ordi-
nary kriging model (8 .11), under a normal distribution for both and s. o
Spatial dependence in y ( s) was modelled through the isotropic spherical
semivariogram (8.20). In this example J-L = 10, (} 0 = 2, 01 = 4, 02 = 8 and
a; = 0.1. Figure 8.3 shows a selection of three-dimensional views of these
simulated data from different perspectives.
In spatial statistics outlier detection often aims at highlighting observa-
tions which are unusual with respect to their surrounding values. We call
them spatial outliers, to emphasize that anomaly is intended with respect
to the spatial distribution of the response variable. Thus a spatial outlier
might or might not be anomalaus in the traditional sense, that is in the
analysis of the univariate distribution of sample observations y 1 , ... , Yn·
In the present example a duster of spatial outliers is introduced in the
northwest corner of the grid by changing the observations at sites s 1 , s 2
and s 3 . This modification is performed by adding constants 6, 4 and 5, re-
spectively, to the original readings. As a result, contaminated values clearly
become outliers with respect to the bulk of the data, both from a spatial
and a distributional (univariate) point of view.
X ez 180
X e = 60
X • 240
:
.,
I
!
..
FIGURE 8.3. Simulated kriging data (before contamination) : three-dimensional

views of the data from six perspectives
8.2.4 Kriging Diagnostics

Before computing the optimal weight vector (8.8) or (8.16), it is impor-
tant to identify observations which might have an undue influence on it.
As emphasis is mainly on prediction, case-deletion diagnostics for kriging
are usually based on standardized prediction residuals computed by cross
validation.
The leave one out Cross-validation principle implies that each case is
deleted in turn from the entire sample and its value is estimated from the
remaining n-1 cases. Let Sei) = (s1, ... , Si-1, si+l, ... , sn) denote network
S with the ith location removed. Standardized prediction residuals are then
defined as
Yi - Yi,S( i)
i = 1, . .. ,n, (8.23)
8-i, S (i )
where Yi,S(i) =y( Si Is(i)) stands for the kriging predictor of Yi based only
on observations in S(i), and &ls , ( •). is a robust estimate
of the correspond-
ing kriging variance a}' 8 (•). . In §8.2.5 we show how the unknown ingredients
a-;
required for computing ' 8 (•). (i.e. v and Y) can be estimated robustly
from the available sample. Note that definition (8.23) is valid irrespective
of the role of measurement error. In fact , even under the measurement
error model (8.11), Yi is the only observable quantity at site Si when pre-
diction is performed from the reduced network s(i)l with Si deleted. Hence
f):" , 8 ( t.) = Yi ' s<. l , in view of the interpolation property (8.18). Some theoret-
t
ical properties about prediction residuals extending the deletion algebra of

§2.5 are provided in Exercise 8.6.
Cook's distance isanother popular diagnostic measure in regression mod-
els with independent errors (see, e.g. , Atkinson and Riani 2000, p. 25). In
the kriging context, where prediction is of paramount importance, it can
be computed to provide a measure of infiuence on the predicted values.
Assurne that the measurement-error ordinary kriging model (8.11) holds.
Define the n x 1 prediction vectors
and
Y(i) = (fj*(s1IS(i)) , . . . , fj*(sniS(i)))T.
Then, we suggest computing a spatial version of Cook's distance for pre-
diction at site Si as
C i,S<i l = - (Y(i)
'* - Y'*)Tf - 1(Y(i)
'* - Y'*) i = 1, ... , n, (8.24)
where Y denotes a robust estimate of the matrix Y (see §8.2.5). The minus
sign in the right-hand side of equation (8.24) is necessary to guarantee that
nonnegative distances are obtained, in view of the conditionally negative
definiteness property of the variogram function (see §8.2.2).
Figure 8.4 shows boxplots of both standardized prediction residuals ei ,S(i))
and square roots of Cook distances for the simulated data of §8.2.3 with
a cluster of spatial outliers. None of the contaminated values has predic-
tion diagnostics which can be judged extreme with respect to the specified
model. Therefore, there is clear evidence of masking due to the presence of
multiple spatial outliers. Indeed, the only apparent outlier displayed both in
Figure 8.4(a) and 8.4(b) is the observation at site s 12 , where the relatively
low value e12 ,s(1 2 l (and hence the high value C12 ,S< 12 ) has been swamped
by spatial proximity to the contaminated corner.
Such undesirable effects may be present even if we apply other exploratory
techniques specifically devised for the purpose of locating spatial anoma-
lies. For instance, Table 8.2 shows for each row and column of the grid
the absolute value of the standardized (mean- median) difference (Cressie
1993, p. 38)
u = n~ 12 (y- ii)/(0.7555() , (8.25)
where n 1 is the number of spatial sites located on a specific row or column
of the grid, and y and ii are respectively the average and median value
computed on that row or column. In addition, ( is a resistant measure of
dispersion computed as
( = (interquartile range)/1.349.
(a) (b)
FIGURE 8.4. Simulated kriging data with multiple outliers: boxplots of (a) stan-
dardized prediction residuals given in (8.23) and (b) square root of Cook distances
for spatial prediction given in (8.24)
TABLE 8.2. Simulated kriging data with multiple outliers: absolute values of the
standardized (mean- median) difference u. Values lul > 3 are usually of interest
for the purpose of detecting spatial outliers
Row number 1 2 3 4 5 6 7 8 9
Iu I 1.1 0.6 1.1 0.8 0.4 1.2 4.4 3.4 0.0
Col. number 1 2 3 4 5 6 7 8 9
Iu I 5.2 2.2 0.4 0.3 0.3 2.1 1.0 1.8 1.9
Values of lul around 3 or greater are usually adopted to highlight atypical

rows and columns with gridded data.
It is clear from Table 8.2 that in our example the summary diagnostic
picks out rows 7 and 8 and column 1 as having potential outliers. In fact,
rows 7 and 8 do not have outliers and only column 1 does. On the contrary,
columns 2 and 3 and row 1, which contain outliers, arenot chosen by the
mean-median summary method. An additional disadvantage with small
data sets is that the specific algorithm used for computing the median and
the interquartile range may have some effect on this diagnostic.
Of course, some simple diagnostic methods may occasionally provide
helpful guidance in the identification of multiple spatial outliers. For in-
stance, bivariate plots of y(si) and y(si + e), for i = 1, ... , n and e a unit
vector in a specified direction, can display the three contaminated values
in this example for a careful choice of e. However, conclusions are usually
infl.uenced by what direction is actually chosen, so that even the use of
such bivariate scatterplots may not result in a satisfactory tool for the ex-
ploratory analysis of spatial data. In §8.4.1, on the contrary, it will be seen

how clearly the plots from the forward search are able to reveal the three
outliers in this example.
8. 2. 5 Robust Estimation of the Variagram

Before moving to the description of the forward search algorithm for the
ordinary kriging model, in this section we briefiy address the issue of robust
variagram estimation. This step must be taken prior to the application of
our forward approach. Hence, it is important to ensure that outliers do not
have an undue influence on estimates of 2v(h), especially at short lags.
The simplest estimator of 2v(h) is the sample analogue of (8.6)
(8.26)
where N ( h) is the number of pairs of sites {Si, s1 } at lag h and LN(h) de-
notes summation over such pairs. The estimator 2v(h) is called the sample
variogram. When spatiallocations are irregularly spaced within S, compu-
tation of 2v(h) is usually smoothed by inclusion of the points lying in some
specified (small) tolerance region around h.
The sample variogram 2v(h) is unbiased under the ordinary kriging mod-
els (8.2) and (8.11). Unfortunately, it is biased (and sometimes badly so)
when these models are contaminated by outliers. For this reason, we adopt
the robust variogram estimator of Cressie and Hawkins (1980)
- (N(h)-l LN(h) JIYi- Yiif

2v (h) = --'-----0-.4_5_7_+_0_.4_9_4/...,-N__,(__,h)-----'-- (8.27)
The denominator of (8.27) is defined in order to make 2v(h) approximately

unbiased.
An even more robust estimator of 2v(h) can be obtained by replacing
the average of square-rooted differences JIYi- Yil in formula (8.27) with
their median. However, this estimator can be highly inefficient for moderate
sample sizes and, according to our experience, it should be adopted only
when huge contamination is suspected. In all our kriging examples that
follow we make use only of the robust variagram estimator defined by (8.27)
and do not apply such a "very robust" variogram estimator.
A common problern with estimated measures of dependence isthat sam-
ple estimates often do not satisfy the same theoretical properties as their
population counterpart. As a consequence, the robust estimates v(h) will
not usually satisfy the requirement of being conditionally negative defi-
nite, as described in §8.2.2. Forthis reason we fit a valid parametric model
v(h; (}) to sample values v(h), leading to model-based estimates v(h; e) and
y = { v(si- Sji 0)}. Here e denotes a robust estimate of e. In what follows

we restriet ourselves to the three parametric models described in §8.2.2. As
in §8.2.4, the resulting robust estimate of CTJs . is denoted by
... , (t) '
a?
S(·) .
t
A simple and popular technique for fitting semivariogram models to sam-

ple estimates is through weighted least squares. In our robust setting the
weighted least squares algorithm minimizes
(8.28)
with respect to the parameter vector B. Here, K is the total number of lags
for which the robust variogram v(h) is computed and h 1 , . . . , hK denote
such different lags. The vector 0 is then the minimizer of the weighted least
squares fit (8.28) . More complex likelihood-based methods for estimating
() are described in many books, including Cressie (1993) and Stein (1999) .
8.3 The Forward Search for Ordinary Kriging

In this section we describe how the forward search can be applied to or-
dinary kriging models, both with and without measurement error. Recall
that in ordinary kriging the main features of the analysis are spatial depen-
dence in the response variable and emphasis on prediction rather than on
parameter estimation. These are also major sources of difference between
the forward search for kriging and the basic algorithm for regression models
with independent errors, described at length in Atkinson and Riani (2000,
Chapter 2) and reviewed in Chapter 2 of this book.
To summarize, one major difference between the forward search for or-
dinary kriging and that for standard univariate regression is that spatial
dependence in the response variable must be allowed for. We overcome this
difficulty by "conditioning" the search on the robust estimate of the var-
iogram function described in §8.2.5. On the progression side, the forward
search for ordinary kriging moves to a larger subset by considering stan-
dardized, instead of raw, prediction residuals. An additional change is that
all plots now run up to n - 1 instead of n, since measurement error can
be the only source of uncertainty in prediction of each sample value y(si)
when the subset size is equal to n .
8. 3.1 Choice of the Initial Subset

As in all applications of the forward search we have encountered so far , our
algorithm starts with the definition of a subset of m 0 observations which are
intended tobe outlier free. Forthis purpose, Iet st:.~),imo = (si, , . . . , Simo)
be a set of mo spatiallocations and ~ = ( i 1 , . .. , im 0 ) T be the corresponding
mo-dimensional vector of indices. Let fj. s<=ol derrote the kriging predictor
,, '
at site Si given Observations in sfrno). With a slight abuse of notation, if
Si E sfrno) then Yi,s:=o) Stands equivalently for the predictor of the observed
value y(si) or of the noiseless value y*(si), according to the assumed model.
The corresponding standardized prediction residual is
Yi- fj ,,. s<=ol

'
e,,s<=o)
' a,, s<=o)
'
Our initial subset is selected by a least median of squares criterion similar

to the one sketched in §2.15.1 and detailed in Chapter 2 of Atkinson and
Riani (2000). Specifically, we take as our initial subset S~rno) the set of m 0
spatial locations which satisfies
(8.29)
where e2[l],S,<= 0 l is the lth ordered squared residual among e7,s,<=ol,

•
i =
1, ... , n. Herewetake
where f(n- p)/21 is the rounded value of (n- p)/2.

The main difference between our robust criterion for kriging and the one
for independent errors given in equation (2.108) isthat here the standard-
ized prediction residuals e. s<=ol are computed instead of the raw residuals
,, '
e.t , s<=ol
L
= Yi- fj.t, s<=o>·
1,
We prefer to use standardized residuals in the least median of squares

fit (8.29) because these residuals provide our basic criterion for progressing
in the search (see the next section). Furthermore, as we have already seen
in §8.2.4, standardized prediction residuals are the main diagnostic tool in
kriging. Standardization is usually regarded to be important with correlated
data, since the site-specific estimate a.,, s<=o)
'
reflects the influence of spatial
dependence on prediction uncertainty at location Si. For the same reason,
we adopt standardized instead of raw residuals in our diagnostic analysis
of spatial autoregressive models to be described in §§8. 7 and 8.8.
Remark: Despite our preference for diagnostics based on standardized
residuals with correlated data, we emphasize that the forward search for
kriging is not sensitive to the specific method adopted to select sirno)' pro-
vided that the initial subset is either outlier free or contains unmasked
outliers which are immediately removed by the forward procedure. We
have already encountered many examples of this behaviour, especially in
Chapter 7: see, e.g., Figures 7.6 and 7.20, Exercise 3.4, and also Remark
4 in §2.12. In the specific context of ordinary kriging, Cerioli and Riani
(1999) showed that allowing for measurement error can ensure a high de-
gree of interchange at the very first steps of the search. Hence, under the
measurement-error model (8.11), the method is often resistant to the in-
clusion of some outliers into the starting set of locations, as they are im-
mediately ejected from it. On the contrary, the requirement that be simo)
outlier free is essential if we assume that measurement error does not exist.
In fact, in that instance, equation (8.10) shows that
Yt , 8 •<mol = Yi if
and units cannot leave the subset once they have joined it.
In our applications to ordinary kriging, where J-L is the only large-scale
parameter to be estimated from the data, we start from mo = 2, as this
is the smallest dimension for which f; . s<mol can be computed. For more
t, '
general trend surface models, where E{y( s)} = J-L( s) is a function of a num-
ber of unknown parameters, m 0 has to be increased accordingly. If C;:J
is too large, minimization (8.29) is performed over some large number of
samples, although approximate algorithms usually lead to inferior proper-
ties of robust estimators in multiple regression (Hawkins and Olive 2002).
Alternative methods for selecting the initial subset in kriging models when
n or mo are large are described in Cerioli and Riani (1999) and Riani and
Cerioli (1999).
8.3.2 Progressing in the Search

Given a subset of dimension m 2: m 0 , say S~m), the forward search for
ordinary kriging moves to dimension m + 1 by selecting the m + 1 spatial
locations with the smallest squared standardized prediction residuals, the
locations being chosen by ordering allsquared residuals e2 (m)' i = 1, ... 'n .
..",s"'
The algorithm yields a forward search predictor of y( s) at each observation
site. For instance, the forward search predictor at site S i is defined as
YFs(si) = (Y.z, s<"'o)

*
, ... ,y.. " , s<nl)r
•
, (8.30)
that is as the collection of the ordinary kriging predictors at that site in

each step of the forward search.
Apart from the beneficial activity that may be observed in the very first
steps of the procedure, in most moves from m to m + 1 just one new site
joins the subset. Under the measurement error model (8.11) it may also
happen that two or more locations join S~m) as one or more leave. Then an
interchange occurs. We gave details about interchanges with multivariate
data in §2.14. Note that leaving the subset is not possible under model
(8.2), since in that case
Yt, s<= Yi if S *(m) .

• ) =
Si E
8.3.3 Monitaring the Search

As usual, the forward search progresses until all locations are included in
the subset. However, the last stage of the search is uninformative for the
purpose of kriging analysis, as measurement error is the only source of
variability in prediction when m = n.
As with regression models, if just one site enters sim)
at each move,
the algorithm provides an ordering of the data according to the specified
null model, with observations furthest from it joining the subset at the last
stages of the procedure. Spatial outliers and other observations which are
potentially anomalaus at a local scale are detected by graphical displays
of the statistics computed along the search. The diagnostic power of the
method thus parallels what we have already seen in the preceding chapters
of this book, as well as in Atkinson and Riani (2000).
Standardized prediction residuals. We plot standardized prediction
residuals e. s<=) at each step of the forward search. Every location is mon-
.
'•
itored until it joins the subset, because from that point onwards positive
prediction residuals are to be attributed to measurement error. This for-
ward plot for kriging typically shows a scissors shape, since the curves
corresponding to sites with small prediction errors are visible only in the
first steps of the search. On the contrary, the curves related to atypical
observations typically stand apart in the plot.
The forward plot of standardized prediction residuals is very informative
about the behaviour of individual sites, in the same way as the forward plot
of Mahalanobis distances is for multivariate data. In addition, recall that
the ordinary kriging models (8.2) and (8.11) assume first-order stationarity
ofthe response variable. Fitting them when E{y(s)} is not constant over V
leads to biased predictions. If E{y( s)} grows along a preferential direction,
that is a monotonic large-scale component is present in the data, we expect
long patches of residuals with the same sign in the order of entrance in the
subset. This monotonic large-scale component is also likely to cause an
asymmetric display of the trajectories in the forward plot of standardized
prediction residuals, with a larger number of curves either in the positive
or in the negative half-plane of the plot.
If E{y( s)} =f. J1 a question arises as to what variagram has to be actually
computed. The usual suggestion (e.g., Cressie 1993) is to first detrend the
data and then estimate the variagram on detrended residuals. The ratio-
nale behind this approach is that the ordinary kriging solutions (8.8) and
(8.16) require a constant mean. However, the underlying spatial trend is
usually unknown and must itself be estimated. We see in §8.5 that prior de-
trending has several disadvantages and may produce spurious information.
476 80 Spatial Linear Models
On the contrary, the robust fitting of standard semivariogram models to

the raw data of §805 does not affect the forward search and leads to a clear
description of important features of the datao Our forward search approach
is thus more robust than ordinary kriging even to trend contaminationo A
partial explanation of this greater robustness is that only the behaviour of
the variogram near the origin has a major impact on our diagnostic toolso
Ordered residuals and kriging variances. Further useful plots for
outlier detection are those that monitor selected order statistics of stan-
dardized prediction residuals and estimated kriging varianceso Such diag-
nostic measures are defined as:
m = m0 + 1, 000, n- 1, (8031)
m = mo + 1, 000, n- 1, (8032)
and
~ * ~ 2
m = mo + 1, 000, n- 1,
<Tm = <T[m+l],Si"'l (8°33)
~
<Tm =
~ 2
(T[n],S~"') m = mo + 1, 000, n- 1. (8034)
If there is no interchange (see §2014) the values of e;"_ and o-;", correspond
to the minimum absolute standardized prediction residual and estimated
kriging variance, respectively, among the units not belonging to the subseto
On the contrary, em and &m show the largest absolute residual and variance
among all unitso
The forward plots of e;"_ and o-;"_ will show a peak in the step prior to
the inclusion of the first outlier, as happened in previous chapters to the
smallest Mahalanobis distance not in the subseto On the other hand, with
a duster of spatial outliers the curves of em and &m will have a sharp
decrease when the first outlier joins sim),
due to the masking effecto The
same decrease will be also apparent in the plots of e;"_ and o-;", at subsequent
steps, as further outliers enter the subseto
Average residuals and kriging variances. Similar but smoother in-
formation is provided by plots of average residuals and variances, such as
n
e~ = __1_ """ m = m0 + 1, 000, n- 1, (8035)
n-m ~
i=m+l
and
~a
<Tm= - - -
1 L n
~2
<T['] sC=J m = mo + 1, 000, n- 1. (8036)
n-m t,"'
i=m+l
Cycle statistics. High-order spatial trends and cyclic components may

be diffi.cult to identify through plots specifically devised for the purpose
8.4 Contaminated Kriging Examples 4 77
of outlier detection. For simplicity, suppose that observations are collected

at the nodes of a regular grid, as in the example of §8.2.3, the extension
to more complex spatial structures being straightforward. At each step
m = m 0 , ... , n - 1 of the forward search for ordinary kriging we con-
sider the standardized prediction residual, say e*, s<=l,
•
of the unit joining
the subset at that step. Furthermore, let s~m) = (s~7), s~';))T denote the
two-dimensional coordinate vector of the corresponding spatial location. If
e* , s~=l and e* , s~=+lJ have the same sign we compute the coordinate differ-
ences
(8.37)
and
(8.38)
The frequency distribution of distances d 1 m and d 2 m provides an indi-
cation of whether a cyclic component is present in the row and column di-
rections, respectively. The rationale behind (8.37) and (8.38) is that, when
a row or a column cycle is present, observations at a lag equal to the cycle
length tend tobe similar. Hence, they will presumably be included in two
consecutive steps of the algorithm. In the unusual situation where more
than one location joins the subset in the same step, we compute d 1m and
d2m for all units that enter.
The specific order in which units join sim)at successive steps of the
search might have some effect on the results of cycle analysis through d 1m
and d 2 m. Therefore, we strongly recommend the adoption of exhaustive
enumeration of all possible starting subsets in the initial step of our algo-
rithm, when spatial cycle analysis is to be performed.
8.4 Contaminated Kriging Examples

8.4.1 Multiple Spatial Outliers
The purpose of this section is to show how our forward approach for or-
dinary kriging works in practice. For this purpose, we apply the forward
search algorithm based on squared standardized prediction residuals to
the simulated data set of §8.2.3, after contamination at sites s 1 , s2 and
s 3 . Figure 8.5 shows the robustly estimated isotropic semivariogram val-
ues v(llhllk) for k = 1, ... , 10, as well as the fitted spherical model with
parameter estimates fj = (1.37, 7.31, 11.38)T.
Since n is small and m 0 = 2, we perform complete enumeration of all
distinct pairs of spatial sites {Si, Sj }, i, j = 1, ... , 81, for the purpose of
selecting the initial subset si2 l. We fit the measurement-error model (8.11)
with u; = 0.1. The specific value assumed for the measurement error vari-
ance does not affect the conclusions in this example, although some blurring
is introduced when u; approaches the estimate of 00 .
C\1
~
CO
-<:: <0
~
>;:>
....
C\1 .. ...... , ........
0 2 4 6 8
Distance classes
FIGURE 8.5. Simulated kriging data with multiple outliers: solid line,
robust semivariogram estimates; dotted line, fitted spherical model with
{J = (1.37, 7.31, 11.38)T
We start our analysis by looking at Figure 8.6, the forward plot of stan-
., .
dardized prediction residuals e. s<"'l. The three contaminated values now
show up clearly, as they have the largest standardized residuals for most
of the search. The effect of masking is also apparent when the first outlier
joins the subset at m = 79. Table 8.3 reports the units included in the last
10 steps of the forward search for both contaminated and original data.
Contaminated locations move into sim)
at the last three steps, with an
ordering which reflects their degree of outlyingness.
Also moves at previous steps can be motivated by inspection of the data
given in Table 8.1. From the uncontaminated readings we see that they refer
to locations whose values are less in agreement with the postulated spatial
model. In addition, the prediction residual corresponding to the largest un-
contaminated value y(s 2 1) = 15.6 has an upward increase at m = 68, when
the relatively low observation at site s 40 is included. Locations entering in
steps m < 79 are also included in the last stages of the forward search run
on the original data without contamination. As usual, the forward search
algorithm provides a "natural" ordering of the observations according to
the spatial structure implied by the measurement-error ordinary kriging
model (8.11), even in the case of well-behaved data.
Information about the presence of three spatial outliers is reinforced by
the inspection of Figure 8.7, which shows the forward plots of statistics
(8.31) through (8.34), restricted for ease of presentation to steps m::::: 17.
C\J
"'
[ij
"
"0
·c;;
~
c:
0
u
'ö
0
a.
Q)
-o
U5
0 20 40 60 80
Subset size m
FIGURE 8.6. Simulated kriging data with multiple outliers: forward plot of stan-
dardized prediction residuals. Units are monitored until they join the subset
TABLE 8.3. Simulated kriging data with multiple outliers: units included in the
last 10 steps of the forward search for contaminated and original data
Steps 72 73 74 75 76 77 78 79 80 81
Contaminated data 17 76 5 34 64 60 21 1 2 3
Original data 48 29 76 17 18 34 5 64 21 60
The sharp peak at m = 78 in the plots of e;", and <7;"_ (panels a and c)
anticipates the inclusion of the first outlier, unit s 1 . This Observation is
also responsible for the elbows at m = 79 in the plots of em (panel b) and
<7m (panel d). Apart from the first stages, where results may be unstable,
all such plots lead to the same conclusions and clearly unveil the three
contaminated values.
8.4 .2 Packet of Nonstationarity

The simulated data of Table 8.1 are now modified in such a way that
E{y(si)} = 18 for i = 1, . .. , 6, and i = 10, 11, 12. Therefore, contamination
here does not concern only a few spatial outliers as in §8.4.1, but rather
(a) (b)
20 30 40 50 60 70 80 20 30 40 50 60 70 80
(c) (d)
20 30 40 50 60 70 80 20 30 40 50 60 70 80
FIGURE 8.7. Simulated kriging data with multiple outliers: forward plot of (a)
e;" , (b) ern, (c) fT;", (d)
frm. Allplots indicate the existence of 3 spatial outliers
involves a specific area within the study region where the postulated model
does not apply. We call this area a "nonstationary packet".
Figure 8.8 displays three-dimensional views of the contaminated data
from different perspectives. Visual inspection of the data may suggest a
steep gradient towards the end of some columns, although it is difficult
to tell where this gradient actually begins and which sites are involved.
Furthermore, comparison of Figure 8.8 and Figure 8.3 (where no contam-
ination occurs) shows little difference, suggesting that random noise could
also contribute to the perceived spatial trend.
Classical spatial exploratory techniques are again of no help in identifying
the presence of contaminated values. For instance, in the first two rows
of the grid, where the atypical area is actually located, the standardized
(mean- median) difference (8.25) takes the values u = - 1.4 and u = - 0.2,
respectively. Hence we move to our forward approach.
Figure 8.9 is the forward plot of standardized prediction residuals for this
dataset. Now the 9 contaminated observations are clearly visible, as they
have the largest standardized prediction residuals along the search. The
effect of masking also appears when the first outlier, unit s 5 , joins the subset
at m = 72. This site has the smallest observation in the nonstationary
packet. Other sites in that area follow at subsequent steps, as Table 8.4
X 1e ~ 120 X
~-
X : 240
FIGURE 8.8. Simulated kriging data with a nonstationary pocket:

three-dimensional views of the data from six perspectives
TABLE 8.4. Simulated kriging data with a nonstationary pocket: units included
in the last 12 steps of the forward search. Two fitted semivariogram models
Fitted Steps
semivariogram 70 71 72 73 74 75 76 77 78 79 80 81
Spherical 21 64 5 1 3 2 12 4 10 11 60 6
Linear 5 1 3 2 12 21 4 64 11 10 60 6
shows. In this example, the !arger contamination in the northwest corner

of the grid worsens the prediction of the low value at 860, which joins sim)
in step 80. Nevertheless, its forward residual trajectory is still seen to be
in agreement with the bulk of the data for most of the search. The wider
dispersion of curves in the top half-plot of Figure 8.9 also suggests that
some large-scale effect may now be present in the data.
The consequences of contamination arealso clear from Figure 8.10, which
displays the forward plots of diagnostics (8.35), (8.32), (8.36) and (8.34),
from m = 17 onwards. Again, there is clear evidence of an abrupt change
in all curves when passing fromm= 71 to m = 72 as 85 enters sim).
Residuals in Figure 8.9 and the corresponding diagnostics in the forward
search have been obtained by forcing sample values v(llhll) to fit a spherical
N
"':::>
'·u;"
"C
~
c
0
u
15
0
~
a.
-o
ü5
C)l
0 20 40 60 80
Subset size m
FIGURE 8.9. Simulated kriging data with a nonstationary pocket and estimated
spherical semivariogram: forward plot of standardized prediction residuals
semivariogram function, which in this case turns out to be very similar to

a linear function. Hence, analogaus results are reached if we dismiss any
prior information about the true form of v(llhll) and fit a simple linear
semivariogram in the range 0:::; llhll :::; lls1 - snll· Inspection of the plots
based on the fitted linear semivariogram (Figure 8.11) shows that the only
effect of this misspecification is a slight advance in the abrupt change ex-
perienced by the diagnostic curves, with locations s2 1 and 864 now entering
the subset after 8 5 . The units included in the last 12 steps of the forward
search are given in Table 8.4 and are the same as those for the spherical
semivariogram function, although in a slightly different order.
8.5 Wheat Yield Data

So far, we have applied our forward algorithm for ordinary kriging to two
simulated examples, where multiple spatial outliers were already known
to exist. In this section we consider an example based on a real data set.
Furthermore, this example has been studied widely in the spatial statistical
literature, so we are able to compare our findings with results obtained
through different exploratory t echniques.
8.5 Wheat Yield Data 483
(a) (b)
20 30 40 50 60 70 80 20 30 40 50 60 70 80
(c) (d)
0
-i
<0
ll)
"'
20 30 40 50 60 70 80 20 30 40 50 60 70 80
spherical semivariogram: forward plot of (a) e~ , (b) em , (c) a-~, (d) 8-m
Table 8.5 gives the wheat yield data, taken from Cressie (1993, p . 455)
and also available through the module SPATIALSTATS of S-Plus (Math-
soft 1996). These data consist of 500 observations on the production of
wheat grain (in pounds). Measurements refer to a 20 x 25 lattice of plots.
Row indices run in the north-south direction, while column indices run in
the east-west direction. The global size of the study area is 1 acre. Although
there is some ambiguity as regards the actual plot size, plot dimensions are
taken tobe 3.30 meters (10.82 feet) in the east-west direction and 2.51 me-
ters (8.25 feet) in the north-south direction, as in Cressie (1993). The data
come from a uniformity trial, that is an agricultural experiment in which
there are no differences in treatment (e.g., Hinkelman and Kempthorne
1994). The purpose of this study was to assess natural variation in soil
fertility and so to determine the optimal plot size for future wheat yield
trials.
The wheat yield data have been much studied in the statisticalliterature,
since their introduction by Mercer and Hall in 1911. Several analyses have
tried to account for spatial autocorrelation between neighbouring plots,
through application of spatial autoregressive models (see §§8. 7 and 8.8)
and spectral methods for spatial processes. Cressie (1993, §4.5) provides a
brief account of this literatme and performs a detailed exploratory study
TABLE 8.5. Mercer and Hall wheat-yield data: readings are at the nodes of a 20 x 25 regular lattice
Row Column
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 3.63 4.15 4.06 5.13 3.04 4.48 4.75 4.04 4.14 4 4.37 4.02 4.58 3.92 3.64 3.66 3.57 3.51 4.27 3.72 3.36 3.17 2.97 4.23 4.53
2 4.07 4.21 4.15 4.64 4.03 3.74 4.56 4.27 4.03 4.5 3.97 4.19 4.05 3.97 3.61 3.82 3.44 3.92 4.26 4.36 3.69 3.53 3.14 4.09 3.94
3 4.51 4.29 4.4 4.69 3.77 4.46 4.76 3.76 3.3 3.67 3.94 4.07 3.73 4.58 3.64 4.07 3.44 3.53 4.2 4.31 4.33 3.66 3.59 3.97 4.38
4 3.9 4.64 4.05 4.04 3.49 3.91 4.52 4.52 3.05 4.59 4.01 3.34 4.06 3.19 3.75 4.54 3.97 3.77 4.3 4.1 3.81 3.89 3.32 3.46 3.64
5 3.63 4.27 4.92 4.64 3.76 4.1 4.4 4.17 3.67 5.07 3.83 3.63 3.74 4.14 3.7 3.92 3.79 4.29 4.22 3.74 3.55 3.67 3.57 3.96 4.31
6 3.16 3.55 4.08 4.73 3.61 3.66 4.39 3.84 4.26 4.36 3.79 4.09 3.72 3.76 3.37 4.01 3.87 4.35 4.24 3.58 4.2 3.94 4.24 3.75 4.29
7 3.18 3.5 4.23 4.39 3.28 3.56 4.94 4.06 4.32 4.86 3.96 3.74 4.33 3.77 3.71 4.59 3.97 4.38 3.81 4.06 3.42 3.05 3.44 2.78 3.44
8 3.42 3.35 4.07 4.66 3.72 3.84 4.44 3.4 4.07 4.93 3.93 3.04 3.72 3.93 3.71 4.76 3.83 3.71 3.54 3.66 3.95 3.84 3.76 3.47 4.24
9 3.97 3.61 4.67 4.49 3.75 4.11 4.64 2.99 4.37 5.02 3.56 3.59 4.05 3.96 3.75 4.73 4.24 4.21 3.85 4.41 4.21 3.63 4.17 3.44 4.55
10 3.4 3.71 4.27 4.42 4.13 4.2 4.66 3.61 3.99 4.44 3.86 3.99 3.37 3.47 3.09 4.2 4.09 4.07 4.09 3.95 4.08 4.03 3.97 2.84 3.91
.$
Q)
11 3.39 3.64 3.84 4.51 4.01 4.21 4.77 3.95 4.17 4.39 4.17 4.17 4.09 3.29 3.37 3.74 3.41 3.86 4.36 4.54 4.24 4.08 3.89 3.47 3.29
"0 12 4.43 3.7 3.82 4.45 3.59 4.37 4.45 4.08 3.72 4.56 4.1 3.07 3.99 3.14 4.86 4.36 3.51 3.47 3.94 4.47 4.11 3.97 4.07 3.56 3.83
0
::E 13 4.52 3.79 4.41 4.57 3.94 4.47 4.42 3.92 3.86 4.77 4.99 3.91 4.09 3.05 3.39 3.6 4.13 3.89 3.67 4.54 4.11 4.58 4.02 3.93 4.33
~
Q)
14 4.46 4.09 4.39 4.31 4.29 4.47 4.37 3.44 3.82 4.63 4.36 3.79 3.56 3.29 3.64 3.6 3.19 3.8 3.72 3.91 3.35 4.11 4.39 3.47 3.93
.: 15 3.46 4.42 4.29 4.08 3.96 3.96 3.89 4.11 3.73 4.03 4.09 3.82 3.57 3.43 3.73 3.39 3.08 3.48 3.05 3.65 3.71 3.25 3.69 3.43 3.38
~
16 5.13 3.89 4.26 4.32 3.78 3.54 4.27 4.12 4.13 4.47 3.41 3.55 3.16 3.47 3.3 3.39 2.92 3.23 3.25 3.86 3.22 3.69 3.8 3.79 3.63
~
:;; 17 4.23 3.87 4.23 4.58 3.19 3.49 3.91 4.41 4.21 4.61 4.27 4.06 3.75 3.91 3.51 3.45 3.05 3.68 3.52 3.91 3.87 3.87 4.21 3.68 4.06
ol
0.. 18 4.38 4.12 4.39 3.92 4.84 3.94 4.38 4.24 3.96 4.29 4.52 4.19 4.49 3.82 3.6 3.14 2.73 3.09 3.66 3.77 3.48 3.76 3.69 3.84 3.67
rn
19 3.85 4.28 4.69 5.16 4.46 4.41 4.68 4.37 4.15 4.91 4.68 5.13 4.19 4.41 3.54 3.01 2.85 3.36 3.85 4.15 3.93 3.91 4.33 4.21 4.19
00 20 3.61 4.22 4.42 5.09 3.66 4.22 4.06 3.97 3.89 4.46 4.44 4.52 3.7 4.28 3.24 3.29 3.48 3.49 3.68 3.36 3.71 3.54 3.59 3.76 3.36
-.:!'
00
-.:!'
(a) (b)
20 30 40 50 60 70 80 20 30 40 50 60 70 80
(c) (d)
0
-.i
0
c?
0
C\J
C\i
20 30 40 50 60 70 80 20 30 40 50 60 70 80
linear semivariogram: forwa rd plot of (a) e ~ , (b) em , (c) 8-~ , (d) 8-m. To be
compared with Figure 8.10
of the Mercer and Hall wheat-yield data through geostatistical techniques.

Although the observations are given on a spatiallattice , kriging is feasible
because it is possible to think about a continuously varying spatial index
running over the study area.
As in our previous kriging analyses, we start by looking at the spatial
structure of the Mercer and Hall wheat-yield data through the six-panel
perspective plot given in Figure 8.12. This display is not particularly infor-
mative, mainly because of the relatively large nurober of spatial sites in-
volved. On the contrary, the exploratory geostatistical analysis performed
by Cressie (1993, §4.5) shows that standard assumptions such as station-
arity and absence of outliers seem to be questionable for this data set. His
approach relies on prior removal of possible trends in the row and column
directions (see Figure 8.13). As some observations may have an exagger-
ated influence at this stage, he uses a robust detrending algorithm based
on iterative computation of row and column medians f) . This algorithm is
known as median poli8h (Emerson and Hoaglin 1983). Its application sug-
gests that the readings at locations 8 290 = (12, 15)r, 8 3 76 = (16, 1)T and
8430 = (18, 5)T, where indices are in lexicographical order, should be con-
sidered as spatial outliers, being unusual with respect to their surrounding
XA = 120 X : 180
X e = 80
FIGURE 8.12. Wheat yield data: three-dimensional views of the data from six
perspectives
values. However, this exploratory approach is not entirely satisfactory for

two reasons which are outlined below.
First, case-deletion statistics are based on computation of standardized
prediction residuals (8.23). We have already seen in §8.2.4 that such back-
wards diagnostics are prone to masking and swamping when multiple spa-
tial outliers are present in the data. To unveil such undesirable effects we
look at Figure 8.14, the forward plot of standardized prediction residuals
computed on the detrended wheat-yield data. Although masking does not
appear to be a problern in this example, the forward plot reveals that the
three outliers have a different behaviour along the search. Observations at
sites 8zgo and 8 376 are markedly anomalous for most of the search, while
Y430 becomes so only when m ~ 440 and several units with a negative resid-
ual are included in the fitting subset. Furthermore, for most of the search
there is an additional unit , namely site 8 1 g 1 , for which the value of e.t , s<=)
•
is
close to that of the "declared" spatial outlier 8 430 . The observed trajecto-
ries also suggest that the value at location 8 208 = (9, 8)T has some peculiar
features not highlighted by standard exploratory techniques. In fact, com-
ing back to Table 8.5, we see that the low reading y(8 208 ) is surrounded by
relatively much larger observations.
0
.,;
-- ;--!
-I I
II ~ -I I I I - . I ~ II I ~ il i;iI I
-
II)
..;
0
..;
i
II)
oi
0
oi
.:.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Column number (east-west)
0
.,;
I !I
II) --
PI;~~~~ I ; I I I I I I I
..;
0
..;
II)
oi
0
oi
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Row number (north-south)

FIGURE 8.13. Wheat yield data: boxplots of values in each column (top panel)
and each row (bottom panel) of the lattice. Waves in the east-west direction?
More importantly, trend removal is performed through an iterative pro-

cedure whose theoretical properties are difficult to establish. The resulting
technique of robust prediction is based on a kriging procedure which re-
quires the highly subjective step of data winsorization (Hawkins and Cressie
1984), that is modification of extreme (detrended) values in order to down-
weight their influence. In what follows we see how the forward search ap-
plied to the original wheat-yield readings can overcome such difficulties and
provide a more sensible representation of their spatial structure, as well as
some useful additional insights.
To start our analysis we take 13 distance classes as in Cressie (1993,
p. 253), and compute the isotropic version of the robust variogram esti-
mator (8.27). Then we fit an exponential model to sample values v(hk),
k = 1, ... , 13, yielding parameter estimates B= (0.12, 0.11, 15.99)r. Choice
of the specific parametric family is not crucial and a spherical model would
lead essentially to the same search. The measurement error variance O"; is
set equal to Bo = 0.12, since in this example it seems sensible to ascribe the
nugget effect solely to measurement error. Recall that each datum y(si) is
obtained by aggregation over a plot, so that variation at very small scales
should not exist. As in our analyses of §8.4, we initialize the search from
r__ ,;------
'<t
- ----- ----- - - - - -- ~
"'
"iö
"
'C
'iii
C\1
~
c:
0
n
u 0
~
a.
'E
(/)
C)J
"f
0 100 200 300 400 500

Subset size m
FIGURE 8.14. Detrended wheat-yield data: forward plot of standardized pre-

diction residuals. The search is run on residuals from median polish. Units are
monitored until they join the subset
mo = 2 and select the initial subset si 2 )

by exhaustive enumeration of all
possible pairs of sites.
Figure 8.15 is the resulting forward plot of standardized prediction resid-
uals and shows only one outlier, unit 8 290 . On the contrary, the other two
Observations highlighted by Cressie (1993) and in our Figure 8.14 are in
good accordance with the bulk of the data, especially y(8 43o), whose tra-
jectory is pictured in boldface in Figure 8.15. It appears that trendremoval
through median polish can produce spurious outliers. Other locations have
large absolute standardized prediction residuals at the end of the search,
but none of them have outstanding trajectories, apart perhaps for the ob-
servation at site 82os.
Additionalinformation on possible outliers is gained through Figure 8.16,
which provides the plots of e;,.. and &;,.. against m, as well as a zoom taken at
the end of the forward search. The left-hand panel shows a steady increase
without any remarkable change in slope. This suggests a fairly stable tran-
sition between well and badly predicted spatiallocations. The large values
of e;,.. in the last steps of the search mean that for a few units the ordi-
nary kriging model does not fit well. From the right-hand panel we see that
the shape of a;" parallels that of e;,.. for most of the search. However, the
estimated kriging variance experiences an upward jump in step m = 495,
290
I()
..,.iii""'
(ij
e
c
0
n
'ö
0
e0.
,;
ijj
'?
208
0 100 200 300 400 500

Subset size m
FIGURE 8.15. Wheat yield data: forward plot of standardized prediction resid-
uals. Units are monitored until they join the subset . The trajectory for location
8430 is displayed in black
before inclusion of unit s 174 . The relative contribution of each observation

may thus be fairly different on pointwise prediction and on the uncertainty
assessment of that prediction. The succeeding units which enter are s 84 ,
s 5 , s 208 and s 290 . It is worth noting that the curves in Figure 8.16 do not
show a decrease in the last steps. In this example there is no masking of
spatial outliers, as the worst predicted values are scattered throughout the
study region.
From the plots we have constructed starting from the raw wheat-yield
data, we have thus learned that a constant-mean kriging model is not ap-
propriate for a number of observations, although only one or perhaps two of
them may be declared as spatial outliers. However, the even separation of
curves between the top and bottom half plots of Figure 8.15 does not sup-
port the presence of a monotonic spatial trend in the area. For this reason
we check for more complex departures from the stationarity assumption,
as might be also suggested by inspection of the boxplots in Figure 8.13.
Figure 8.17 shows the frequency distribution of row and column distances
(8.37) and (8.38). The high peak at lag 1 in both directions is attributable
to the effect of spatial autocorrelation. In the absence of spatial cycles we
expect these distributions to fall off quickly at distances greater than 1,
with some noise in their right-hand tail. The secondary peak at lag 3 in
~
0
0
"'
0 ,..,
~
0 0
~
~
0
0 0
0
~
0
0
~
~
4 76 4 84 492 500
Subs e t size m ~ Subse t size m
0
0
~
0
0
"'
~
0
0
~
0
0
0
~
1 50 200 250 300 350 400 450 500 cio 50 1 00

S u bse t s iz e m
FIGURE 8.16. Wheat yield data: forward plot of, left-hand panel, e;" and,
right-hand panel, a-;;.,, together with zooms taken at the end of the search
0 0
>-<D >-<D
(.) (.)
c:: c::
<ll <ll
:I :I
O"o 0"0
~ .". ~ .".
u.. u..
0 0
C\1 C\1
0 0
5 10 15 20 5 10 15 20 25
Verticallag Horizontal lag
FIGURE 8.17. Wheat yield data: frequency distribution of, left-hand panel, row
distances d1m and, right-hand panel, column distances d2m - Note the secondary
peak at lag 3 in the right panel
the right-hand panel of Figure 8.17 thus suggests the presence of a cyclic
component in the east-west direction, with length equal to 3 columns. It
is interesting to note that McBratney and Webster (1981) reached a sim-
ilar conclusion through more complex and less intuitive spectral analysis
techniques.
8.6 Reftectance Data 491
g,-------------------------. 0
~
g
"'
::'
"'0
" 0
ai
0
~ a:i
Su b se t si z e m
0
~
"
;:>
150 2 00 250 3 00 350 400 4 50 500 150 2 00 250 300 3 50 400 4 50 500
FIGURE 8.18. Wheat yield data: forward plots of e;", left-hand panel, before and,
right-hand panel, after perturbation with Gaussian noise in the second decimal
digit. A constant variogram model is fitted in both cases
Finally, we supplement our discussion on the wheat yield data by high-

lighting an additional point of interest. Comparison of the curve of e;" for
our simulated data set of §8.4.1 with that for the wheat yield data shows
the latter to be considerably rougher, especially in its central part . This
effect may be ascribed to rounding error. To show this in greater detail
we fit a constant semivariogram model v(llhll; 0) to sample values v(llhli),
that is we ignore for a while the effect of spatial dependence. The zoom in
the left-hand panel of Figure 8.18 clearly shows that e;" jumps at certain
steps, while slightly decreasing at others. This behaviour occurs because
y( Si) = y( S j) for at least one location Si E S~ m) and one location S j tj:. S~ m) .
The right-hand panel displays the same curve after a small perturbation
in the data, which is performed by adding Gaussian noise to the second
decimal digit of the readings in Table 8.5. The trajectory of e;" is now much
smoother.
8.6 Reflectance Data

In this section we consider what might be claimed to be a "difficult" exam-
ple, since two different and almost equally plausible structures seems to be
present in the data. We see how a simple and effective description of these
alternative spatial structures can be obtained through our forward search
for ordinary kriging.
TABLE 8.6. Refiectance data ( x 10- 1 ): observations are at the nodes of a 9 x 9

regular lattice
Row Column
1 2 3 4 5 6 7 8 9
1 32 35 36 37 38 47 34 35 31
2 38 39 43 41 55 42 38 34 37
3 50 62 46 39 55 37 40 32 28
4 45 50 43 33 24 38 44 42 39
5 40 36 16 18 31 37 52 30 24
6 37 14 10 21 26 30 35 41 19
7 10 12 5 12 17 18 20 24 23
8 50 62 19 6 14 17 17 5 6
9 46 35 0 4 5 5 6 0 0
The observations in Table 8.6, taken from Haining (1990, p. 217), give
reflectance values extracted from an aerial survey along the south coast
of England. The purpose of the survey was to monitor pollution levels
arising from the pumping of waste material into the English Channel, in
a coastal area where sewage disposal had taken place. Higher reflectance
values indicate higher levels of pollution.
The spatial locations where such values were collected form the nodes
of a 9 x 9 regular grid, so that n = 81. As in our previous examples,
sites on the grid are indexed in lexicographical order. For the purpose of
statistical analysis, the sides of the lattice were standardized to have unit
length and refiectance values were multiplied by 10- 1 . The last row of the
data set was missing in the original source, although additional information
(i.e. residuals from median polish) was provided about it. We have thus
reconstructed all missing values from this supplementary information.
These data are likely to contain a large-scale effect (trend surface), an
autocorrelation component and local-scale effects ( nonstationary pockets) .
The closer an observation is to the source of pollution, the greater its re-
flectance value is likely to be. In this case the parameter values of the trend
component indicate the dispersal gradient. One reason for the presence of
spatial correlation is that the reflectance value recorded in any pixel is a
partial averaging of reflectance values in neighbouring pixels. This appears
to be a general problern with remotely sensed data. Furthermore, at a pro-
cess level, pollution might be affected by local mixing and local dispersal
due tosmall scale turbulence and wave action.
The complicated spatial structure of the reflectance data set is revealed
by Figure 8.19. In a first study, Haining (1987) concluded that reflectance
in this sample area can be described by a "first order trend surface model
with autoregressive errors" . Subsequently, the same author (Haining 1990,
8.6 Refiectance Data 493
X ~ 120 XA = 180
..
XA •80
X e = 240
FIGURE 8.19. Refiectance data ( x 10- 1 ): three-dimensional views of the data

from six perspectives
pp. 216-219) seemed to reach a somewhat different conclusion, and sug-

gested that the data contain "high order trends" both in the horizontal and
in the vertical direction and "show evidence that outliers are present". We
now apply the forward search to detect nonstationary pockets and spatial
outliers, and to throw light on the presence of other systematic compo-
nents. For the reasons described in §8.5, we analyse the original readings
instead of detrended residuals from median polish.
First, we fit a linear semivariogram model with 10 distance classes, yield-
ing B= (1.38, 53.86)T. The measurement error variance is set to a; = 0.1.
As in our previous examples, the search is started after selection by exhaus-
tive enumeration of an initial subset of size 2. Sites on the grid are ordered
lexicographically. As in previous kriging applications, we start our analysis
by looking at Figure 8.20, monitaring standardized prediction residuals.
This forward plot reveals two outlying trajectories, corresponding to loca-
tions 8 65 = (8, 2)T and 8 64 = (8, l)T. The third and fourth last unit to
enterare locations 873 = (9, l)T and 874 = (9, 2f, respectively. For these
units ieii,S~"'l > 3 for most of the search. Furthermore, the shape of their
residual curves are very similar to those of the two extreme trajectories.
This suggests the existence of a duster of spatial outliers in the bottarn
left-hand corner of the grid.
'"'"
V
::J
"C
·c;;
~
c:
uu
0 C\l
~
c.
'E
Cl) 0
0 20 40 60 80
Subset size m
FIGURE 8.20. Reftectance data: forward plot of standardized prediction residuals

for the search run with O"; = 0.1
(o) (b)
~
<D
0
<D
<D
"'
N
N
"'
00
~
47 59 65 71 77 ~ 41
su bset s ize m
FIGURE 8.21. Reftectance data: curves of (a) e:r, and (b) &;,. for m 2 40 and
O"; = 0.1.
A remarkable peak at m = 77 in both panels
The remarkable peak at m = 77 in Figure 8.21, which contains the

forward plots of e:n, and o-:n_ for m 2: 40, also confirms the peculiarity of
refl.ectance values recorded at the last four locations which join the subset.
Indeed, turning back to the values in Table 8.6, we see that such readings,
although not extreme on a global scale, are much larger than the relatively
low surrounding ones. These values correspond to the upright peak towards
the background of the upper left panel in Figure 8 .19.
Another striking feature of Figure 8.20 is the uneven distribution of pos-

itive and negative residuals. This is caused by the fact that from m = 42
onwards the search includes only locations with a positive residual. Accord-
ing to our discussion in §8.3.3, this feature strongly suggests the presence
of a trend component.
Raising the value of the measurement error variance, however, Ieads to
dramatic changes in the plots produced by the forward search. For instance,
Figure 8.22 displays the forward plot of standardized prediction residuals
obtained from the search assuming a; = 00 = 1.38. Note that setting the
measurement error variance to the largest possible value may be justified
on the same grounds as in the analysis of the wheat yield data set in §8.5.
This alternative solution does not show a clear-cut trend and only two
outliers, namely y(s 65 ) = 62 with a large positive residual, and y(s 5 5) = 5
with a large negative residual. A similar result could be also obtained by
reducing the value of K, the number of distance classes in the computation
of v(llhll).
Unlike our previous examples, the forward search for the reflectance data
thus shows appreciable sensitivity to some preliminary choices, such as the
degree of measurement error and the way of estimating v(llhll). Neverthe-
less, two clearly distinct data structures emerge according to such choices.
The first solution, corresponding to Figures 8.20 and 8.21, emphasizes the
peaks of the spatial distribution of the reflectance values in the study area.
On the contrary, the second solution (Figure 8.22) points at the bumps in
that distribution. Both structures are almost equally plausible given the
available data, and this is why the initial conditions matter. We conclude
that more information about the physical features of the problern should be
available in order to make a definite choice between these two alternative
models for the reflectance data.
Iustability to local perturbations in the data is exhibited by many ro-
bust techniques, such as the least median of squares method. A cogent
description of this behaviour is provided, for instance, by Hettmansperger
and Sheather (1992) andin the discussion following that paper (Stromberg
1993, Appa, Land, and Rousseeuw 1993). As our application to the re-
fl.ectance data shows, a bonus of the forward search is that repeated analy-
ses from different starting points lead to a simple and effective description
of alternative spatial structures ernerging from the observations at hand.
8. 7 Background on Spatial Autoregression

In the rest of this chapter we address the problern of extending our forward
approach to a different dass of popular models for spatially autocorrelated
observations, namely spatial autoregressive models with simultaneous spec-
ification. In contrast to our previous kriging analyses, explanatory variables
r __ §_S_,...."'
... ,-------- ..... ___

I
_,./
_____ _....... -----I

"'
(ij
:::l
"C
·c;;
!1:
(\j
ii~~::~~::~:::~::'_ :__ _ __
Jlt"'f ==v=-~
c:
0
n
'ö
0
~
a.
u C)l
ü5
"t
20 40 60 80
Subset size m
FIGURE 8.22. Reflectance data: forward plot of standardized prediction residuals

for the search run with = 1.38 a;
will be present in some of the examples that follow. So, strictly speaking,
we should name our models as "regressive-autoregressive". For simplicity,
however, we use the term "autoregressive" throughout .
In spatial autoregression the available data are conceived as a realization
of the stochastic process
{y(s): s E V}, (8.39)
as in the geostatistical specification (8.1), but now V represents a countable
collection of spatial regions. In practical applications the number of such
regions is finite, say n, and
V=S=(sl, ... , sn)·

The study region V then coincides with the network of observation locations
s.
Sites in S may either correspond to the nodes of a regular grid, as in
our kriging examples, or be irregularly spaced, such as in the case of n
administrative units. In both situations a fundamental difference arises
with respect to the kriging approach, where the observation network S is
a sample of n individual points from a continuous surface V. In spatial
autoregression each site Si E S is representative of a wider area (such as
a grid tile or an administrative unit) to which it belongs. If S is a regular
grid, Si = ( Sil , Si 2 )T is the two-dimensional (column) vector of coordinates
8. 7 Background on Spatial Autoregression 497
defining the centre of area i. When S is a network of administrative units,

Si= (sil, Si2)T is usually taken tobe the coordinate vector of the ith unit
seat. For simplicity, as in previous sections, we often write Yi = y(si) for the
random variable defining the response observation at site Si, i = 1, ... , n.
Spatial information is introduced in model (8.39) through a set of rela-
tionships among sites. This is given in terms of a neighbourhood structure,
which is attached toS. Some simple examples of neighbourhood structures
are outlined in §8.7.1.
There are two popular classes of spatial autoregressive models extending
autoregression procedures of time series analysis. These are the simultane-
ous spatial autoregression and the conditional spatial autoregression mod-
els, which are usually referred to as SAR and CAR models respectively.
An introduction to SAR models is given in §8. 7.2, while we refer to Ex-
ercise 8.13 for the definition of a CAR model. Both approaches assume
multivariate normality of observations Yi, i = 1, .. . , n , but with different
covariance matrices. Although both the SAR and CAR specifications have
been widely adopted to represent regional data, we restriet ourselves only
to the SAR approach for one basic reason which is now outlined.
For a CAR model it is possible to obtain robust estimates of spatial inter-
action parameters based on all the data, as we did for kriging. The reason is
that the ordinary least squares method is consistent under the CAR spec-
ification, so it is Straightforward to derive consistent estimators of spatial
autocorrelation which adopt morerobust objective functions than the least
squares one. One possibility is described by Haining (1990, pp. 384-385).
Extension of the forward approach for regression models with independent
errors, described in Atkinson and Riani (2000), to CAR models then follows
by use of the generalized least squares principle in regression parameter es-
timation. The method of generalized least squares takes advantage of the
robust estimate of spatial interaction parameters computed from all the
data and the forward search can be run conditionally on this estimate.
The resulting approach is similar to that of the forward search for kriging,
introduced in §8.3, and thus we will not pursue it further.
Under a SAR model, on the contrary, such a fortunate coincidence does
not happen (§8. 7.2) and robust estimators of spatial interaction based on all
the data are not easily derived. Hence, we have to resort to a different ap-
proach where spatial autoregression parameters are estimated consistently
and effi.ciently at each step of the algorithm, together with regression pa-
rameters. For this purpose, we argue that the basic forward algorithm for
independent observations under a regression structure must be suitably
modified and run over blocks of contiguous spatial locations. This makes
the forward search for SAR models more complex and intriguing than the
one for CAR models sketched above. The resulting algorithm, which we call
the Block Forward Search, is introduced in §8.8, after a brief description of
the mathematical formalism of SAR models, their maximum likelihood fit
and the ubiquitous problems raised by masking and swamping.
8. 7.1 Neighbourhood Structure and Edge Gorreetion

Spatial relationships between pairs of locations are represented through a
neighbourhood structure which is attached to the observation network S.
This structure is displayed through an n x n matrix of nonnegative weights
Wi ,j ~ 0
for i , j=1, ... ,n,
where
Wi ,i = 0.
The simplest neighbourhood structure arises in the case of a regular grid,
where we define
w·.
•,J 1 if s1 is immediately to the north, (8.40)
south, east or west of Si,
w·•,J. 0 otherwise.
Similarly, for n administrative units
w·.
•,J 1 if units Si and s1 share a common boundary, (8.41)
w •,J
·. 0 otherwise.
Figure 8.23 shows an elementary example of the neighbourhood structure

provided by (8.40) on a 4 x 4 regular square grid, where locations inS are
labelled in lexicographical order and neighbouring sites are joined by a line.
Of course, this structure is symmetric and Wi ,j = w 1,i.
Different neighbourhood structures could be perfectly legitimate choices.
For instance, when analysing observations on a lattice, one could also take
diagonal adjacencies into account, or give decreasing weights to first, second
or higher order neighbours. In the alternative case of regional data, two
areas could be defined to be neighbours if their distance (computed between
their administrative centres) is shorter than a fixed threshold. Cliff and
Ord (1981, pp. 17-19) describe a number of alternative and potentially
useful definitions of the weight matrix W, including cases where each Wi,j
is computed as a function of both the distance between Si and s1 , and
the proportion of the perimeter of Si which is shared by s1 . It might be
argued that neighbourhood systemssuch as those corresponding to (8.40)
and (8.41) provide less detailed spatial information than, say, the Euclidean
distances used for the purpose of estimating an isotropic variagram model.
However, this information loss is usually unimportant in the context of
model (8.39), where each site represents a wider surface and not merely a
single point.
An important question arising in the statistical analysis of spatial sys-
tems is whether the same specification of weights wi,j should apply both
to interior and edge sites. The basic problern is that, according to defini-
tions (8.40) and (8.41), sites on the boundary of S will typically have fewer
8. 7 Background on Spatial Autoregression 499
ss 86 87 Sg
Sg 810 su
813 814 815
FIGURE 8.23. Symmetrie neighbourhood structure (8.40) on a 4 x 4 square grid

S = (s1, ... , 815): two sites Bi and Bj are neighbours (wi,j = 1) if they are joined
by a line. The smaller dots correspond to unobserved sites outside S, two of which
are Iabelied as u1 and u2
neighbours than interior points. This is immediately seen in the simple ex-
ample of Figure 8.23. For instance, location s 1 has only two neighbours in
S (i.e., sites s 2 and s 5 ) instead of four, since no information is available
about the response values at the unobserved sites u 1 and u2. This defi-
ciency is the source of non-negligible bias in the statistical properties of
estimators, which is particularly severe if n is small. Furthermore, the the-
oretical properties of autoregression models are affected by the existence
of boundary sites with a reduced number of neighbours. For example, even
the simplest form of SAR model, such as (8.43) in §8.7.2 with p = 1, is
not second-order stationary on a finite lattice S under the neighbourhood
scheme (8.40) (Exercise 8.7).
For these reasons, we modify the weight matrix W in order to take edge
effects into account. In our applications we explore two simple but widely
adopted techniques of edge correction. The first one is toroidal correction
(Ripley 1981, p. 152), which wraps a reetangular region onto a torus. Edge
points on opposite borders are thus considered to be close, and all sites
have the same number of neighbours. For example, in the 4 x 4 grid of
Figure 8.23, location s 1 becomes a neighbour also of sites s 4 and s 13 , so
that toroidal correction implies
wl,4 = W1,13 = 1.
Toroidal correction can be interpreted by viewing S as a part of a larger

grid of identical rectangles, each showing the same values Y1, ... , Yn as S.
The unobserved value located immediately to the north of s 1 in Figure 8.23
is then assumed to be
the one on the left is
and so forth.
In the second instance we apply the asymmetric Neumann correction
(Moura and Bairam 1992, p. 338), where the off-region neighbours of a
boundary site have the same response value as the site itself. That is, in
Figure 8.23
(8.42)
The asymmetric Neumann correction has the advantage of being easily

extended to encompass the case of non-lattice data, where the assumption
of a toroidal boundary is often not very realistic and may be difficult to
implement.
We have also devised a symmetric version of the Neumann correction,
which might be called "mirror correction". This correction is appropriate
only for symmetric interaction schemes and assumes that the off-region
neighbours of a boundary site have the same response value as the in-region
neighbours of the site. Recalling Figure 8.23 again, the mirror correction
implies that
and
However, in our applications of §§8.8 and 8.10 the results obtained through
the mirror correction are usually very similar to those produced under the
standard Neumann boundary assumption (8.42). For this reason we will
not report them in detail.
Adjusting the weight matrix W to allow for edge effects is a step that
must be taken prior to our forward analysis. Hence, it is important to
judge what is the consequence of different choices on the results from the
search. It is hoped that alternative methods of edge correction will give rise
approximately to the same conclusions. We shall see that this is in fact the
case in all our applications.
8. 7.2 Simultaneaus Spatial Autoregression (BAR) Models

Let y = (Yl, ... , Yn) T be the response vector. At each location we might
have additional (non stochastic) information related to the values of p- 1
spatial covariates. Let X denote the corresponding design matrix of di-
mension n x p and rank p, allowing also for the mean effect. We consider
first-order Gaussian spatial autoregressive models with simultaneaus repre-
sentation (SAR, for short). In compact notation they are defined as follows
(8.43)
where ß = (ß0 , . .. , ßp_ l)T is a p-dimensional parameter vector fitting the

mean and the spatial covariates, In is the n x n identity matrix, p is a mea-
sure of spatial interaction between neighbouring sites, and c = ( c 1 , . . . , cn) T
is an n-dimensional vector of disturbances. Errors ci are taken to be inde-
pendent and normally distributed with mean 0 and common variance cr 2 .
For (8.43) to be meaningful, it is also assumed that (In - pW) is nonsin-
gular. It is not necessary to assume that W is symmetric, although in our
applications we usually consider symmetric weight matrices.
To see the effect of neighbouring observations on the response value at
each site, we write the individual equations making up model (8.43). Let
xf be the ith row of X. After some elementary manipulation of (8.43), we
see that the ith equation of the first-order SAR model is
n
Yi = xT ß + P L Wi,j(Yj- xJ ß) + ci i = 1, ... ,n.

j=l
Hence, if p = 0 the SAR model reduces to the standard normal-theory

regression model with independent errors (see §2.8). A value p > 0 cor-
responds to the case of positive spatial dependence between responses at
p < 0. As in Standard univariate regression E(yi) =

model (8.43) the n x n covariance matrix of y becomes
xr
neighbouring sites, while the reverse is true in the less common situation
ß. However, under
which is not diagonal if p -=f. 0. The range of possible values of p is con-

strained by the specific form of W (Exercise 8.8).
Estimation of the unknown parameters ß, cr 2 and p in a SAR model
follows by maximization of the loglikelihood function
n 1
L(ß, a 2 , p; y) = -2log(27rcr 2 ) +log! In- pWI- 2a 2 (y- Xß)TS(y- Xß),
(8.44)
where
This function is simply the loglikelihood of a single observation y from an

n-variate normal distribution with mean X ß and covariance matrix L: (see
§2.2).
Maximization of the loglikelihood function L(ß, a 2 , p; y) with respect to
the unknowns is usually performed in stages. For a given p, the maximum
likelihood estimates of ß and a 2 are the generalized least squares estimates
(Exercise 8.9)
(8.45)
and
(8.46)
Given the values of /3 and a2 , the maximum likelihood estimate of p, say

p, is then obtained by numerical optimization of the profile loglikelihood
derived from (8.44), subject to the constraint that
B =(In- ßW)T(In- ßW)

is positive definite.
It is worth remarking that maximum likelihood estimation is strongly to
be preferred in the present context. In fact, the least squares estimator of
the autocorrelation parameter p is not even consistent under model (8.43)
(Exercise 8 .11). This explains why a robust and consistent estimate of p
cannot be easily derived from all the data, and the forward search for krig-
ing does not extend to the spatial autoregression models with simultaneaus
specification defined in (8.43). To solve this problem, we introduce the block
forward search algorithm in §8.8, after a brief description of the effects of
masking and swamping on some standard case-deletion diagnostics for SAR
models.
8. 7.3 Spatial Outtiers Under the SAR Model

Recall from §8.2.3 that a spatial outZier is defined to be an observation
which is unusual with respect to its neighbouring values. In the context of
the SAR model (8.43), a common way to assess spatial outlyingness is to
compute individual departures from the fitted model. This is accomplished
through the n x 1 vector of standardized regression residuals
e = ;,- 1 (In- ßW)(y- X/J), (8.4 7)
which defines the lack of fit statistic eT ej(n- p).

Other definitions of residuals can be useful for correlated data, although
our experience with the forward search has shown that alternative choices
usually provide similar guidance in practice. We prefer standardized over
raw residuals as they allow for spatial autocorrelation, which we believe is
a sensible property for the purpose of detecting spatial outliers. Wemade a
TABLE 8.7. Selection of cases from simulated SAR data with multiple outliers:
readings are at the nodes of a 12 x 12 regular lattice
Row Column i Yi XT
z
1 1 1 85.75 0.384 1.239 0.324
1 2 2 75.86 -0.017 0.844 -1.048
1 3 3 81.17 -0.296 0.333 1.452
... ... ... ... ... ... . ..
1 12 12 74.14 1.507 -1.691 0.017
2 1 13 74.60 -0.293 -1.006 1.678
... ... ... ... . .. . .. . ..
12 12 144 80.61 2.122 -0.697 -0.452
similar pointalso in §8.3.1, when describing the forward search for ordinary
kriging. Furthermore, as in the familiar case of independent observations,
the globallack of fit statistic eT ej(n- p) can be decomposed into a sum of
individual contributions (the standardized residuals) which can be easily
identified with the elements of y.
After so many examples of masking and swamping for backward diagnos-
tics, one might expect that these shortcomings will strongly affect also the
standardized residuals in (8.47). To confirm this expectation we simulated
a data set on a 12 x 12 regular grid S. The response vector y was generated
according to the SAR model (8.43) with p = 4, ß = (20, 5, 4, 3)T, p = 0.2
and cr 2 = 0.5. Simulation of (In- pW)- 1 c was performed after generation
of a normally distributed disturbance vector c and addition of the constant
value 20. The 144 x 3 design matrix X was also obtained by simulation
from a multivariate normal distribution.
Response values Yi were then modified at a 4 x 4 block of sites at the
crossings of rows 1, . .. , 4 and columns 1, ... , 4, plus an additionallocation
on its border (s 49 , in lexicographical order, corresponding to the first Ob-
servation in row 5). Since contamination was performed after simulation of
y, we can think of the 17 modified values as a duster of additive spatial
outliers. Cantamination was not very marked in this example, as it simply
amounted to subtracting small constants (ranging from 0.05 to 2.2) from
the original readings. The full simulated data set after contamination is re-
ported in the website of the book, while Table 8. 7 just shows some selected
rows of it. Figure 8.24 shows a selection three-dimensional views of the re-
sponse values from different perspective angles. Visual inspection does not
provide any particular evidence of contamination.
Figure 8.25 shows a boxplot and histogram of standardized residuals
(8.47) for this simulated data set with multiple spatial outliers. Not sur-
prisingly, masking affects both diagnostic plots and does not allow proper
understanding of the features of the data. Indeed, all contaminated values
remain undetected and only a natural spatial outlier at site s 36 is high-
X An le = 120 XAn le = 180
XAn le = 80
X An le = 240
i
.1;
I
FIGURE 8.24. Simulated SAR example with multiple outliers: three-dimensional

view of the data from different perspectives
lighted due to its low standardized residual. The unsatisfactory behaviour

of standardized residuals refl.ects the poor breakdown properties of maxi-
mum likelihood estimators in spatiallinear models. An additional problern
comes from the fact that the unknown autocorrelation parameter p is re-
placed by its sample non-robust estimate p in the computation of e.
8. 7.4 High Leverage Bites

In §2.8 we briefl.y encountered the hat matrix
The ith diagonal element of H is
hi = xf(XT X)- 1 xi ,
usually called the leverage of the ith observation. The leverage measures
how far the regressor values for that unit are from the bulk of the data in
the space spanned by the columns of X. Detection of high leverage points
is an important step of model building because they may exert an undue
inftuence on the computed fit. The effect of a high leverage point is to force
N
g
36 - - - - 0
- -3 -2 -1 0 2
FIGURE 8.25 . Simulated SAR example with multiple outliers: boxplot and his-
togram of standardized residuals
the fitted model close to the observed value of the response variable. Hence,
high leverage points typically have small residuals, even in the absence of
masking (see, e.g. , Atkinson and Riani 2000, Chapter 2).
The concept of leverage can be generalized in two useful ways which take
account of spatial dependence in the response variable. The first extension
is to define individualleverages as the diagonal elements of
(8.48)
The second possibility is to consider the diagonal elements of
(8.49)
which are called complementary leverage8. Martin (1992) shows the prop-
erties of both generalized leverage measures. He makes the point that with
correlated data the most useful definitionisthat of equation (8.49).
To see the performance of these generalized leverage measures with mul-
tiple high leverage points, we introduce a new simulated data set, similar
to the one described in Section 8. 7.3. Here S corresponds to the tiles of
a 20 x 20 regular grid. The response vector y was simulated according to
model (8.43) with p = 4, ß = (20 , 5, 4, 3)T as before, p = 0.1 and a 2 = 1.
The 400 x 3 design matrix X was also obtained by simulation from a multi-
variatenormal distribution, but the covariate values at sites 8 2 , 8 3 and 8 24
TABLE 8.8. Selection of cases from simulated SAR data with multiple high lever-
age points: readings are at the nodes of a 20 x 20 regular lattice
Row Column i Yi XT
1 1 1 26.64 -1.128 '
0.434 -0.269
1 2 2 61.10 2.163 2.92 2.012
1 3 3 64.30 3.486 1.350 2.453
... ... ... ... ... ... . ..
1 20 20 29.64 -1.231 0.494 0.121
2 1 21 36.03 1.241 -1.529 0.887
... .. . ... ... ... ... .. .
20 20 400 33.98 0.099 0.615 -0.890
(in lexicographical order) were slightly modified in order to increase their

leverages. The full data set (after modification) is reported in our website,
while Table 8.8 gives a selection from it.
Westart our leverage analysis of this example by looking at Figure 8.26,
the scatterplot matrix of the response and regressor variables. The three
artificial high leverage sites (also highlighted on the plot) lie on the bor-
derline of all the data clouds, but they are never so extreme as to lead to
unambiguous findings. Furthermore, the scatterplot matrix does not con-
sider spatial information about each observation site, which we know is
particularly importantunder the autocorrelated SAR model.
On the other hand, spatial information is allowed for in Figure 8.27,
which displays a boxplot and histogram of the complementary leverages
computed from (8.49). Even though in this example the contamination
rate is very low (less than 1%), the effects of masking and swamping appear
again, as one sees a plethora of potentially suspect influential observations.
In this situation it is difficult to tell what a sensible conclusion would be.
Furthermore, the definition of both leverage diagnostics (8.48) and (8.49)
implies knowledge of p. We have thus to face again the consequences of
substituting the non-robust estimate p for the unknown autocorrelation
parameter p.
8.8 The Block Forward Search for Spatial

Autoregression
We now outline a new forward search method for regression data with
dependent errors, which can be applied to the SAR model (8.43). The
main improvement of this algorithm over the one described in Atkinson
and Riani (2000) and sketched in Chapter 2 of this book is that it allows
joint estimation of regression and autocorrelation parameters at each step
, l.,. •
8.8 The Block Forward Search for Spatial Autoregression 507
·2 ·I 0 I 2 3
+l
y
.. ~- .. "., . ••
..
.·... .•• ..~
~
l_ . .._
• • g
·. J~
+ 3
:.=~"\.
x1
.. *
.
• I
..
~:.
3+
x2 . ..
.:·,..
-
,...
.
....
+2
• + 3
~
M
"'
0
';-
.
• • '~"'
:r····~. !" 2
2
.......
:\+.2;
. ... . .••' .· .... '..•
0 • ,.• •
x3
....
';-
~
\
. ....
20 30 40 50 so -3 -2 -1 0 I 2 3
FIGURE 8.26. Simulated SAR example with multiple high leverage points: scat-
terplot matrix. The crosses correspond to the three modified high-leverage sites
of the search. For this purpose we pick up n * < n blocks of contiguous

spatiallocations within S and consider these blocks as the basic sets of our
algorithm. Hence, the algorithm is named the block forward search. Blocks
are intended to retain the spatial dependence properties of the whole study
region and are defined to resemble as closely as possible the shape of S (see
§8.8.2). Confining attention to subsets of neighbouring locations ensures
that spatial relationships are preserved by the forward search, so that p
can be estimated within each block together with ß.
The block forward search algorithm for spatial autoregressive models was
first introduced by Cerioli and Riani (2002). Its steps are summarized as
follows.
"'
8
3 ~~:::
2==== ...
0
"'"'d
24 - - - -
0
0.98 0.99 1.00 1.01 1.02
FIGURE 8.27. Simulated SAR example with multiple high leverage points: box-
plot and histogram of complementary leverages. The three artificial high-leverage
sites are labelled in the boxplot
8. 8.1 Subset Likelihood

Let A be a collection of sites Si E S. Define SA to be the subset of S
indexed by locations in A and denote by a the cardinality of SA. Similarly,
take YA, XA, WA and 3A tobe the blocks of y, X, Wand 3 corresponding
to locations in SA. Also W A has to be corrected for edge effects through one
of the methods described in §8.7.1. This adjustment might be particularly
important when a is small as compared to n.
The exact loglikelihood function based only on observations in S A is
described in Martin (1984) and is given by
However, in our applications we often work with the simpler approxima-

tion
-~log(27ra 2 ) + loglla- pWAI
-2:2 (YA- XAß)T=:A(YA- XAß), (8.51)

which is defined in terms of
instead of {(3- 1)A}- 1. Ofcourse, both (8.50) and (8.51) correspond to

the full likelihood function L(ß, a 2 , p; y) given in (8.44) if A = S. If A C
S, LA. (ß, a 2 , p; YA) yields an approximation which can be evaluated much
more quickly than its exact counterpart. Hence, it is particularly appealing
for repeated optimization at subsequent steps of the block forward search
algorithm. In what follows we denote by /JA, <7~ and PA the maximizers
of either the exact likelihood (8.50) or its approximation (8.51), depending
on the context.
8. 8. 2 Defining the Blocks

The initial subset is selected among n* blocks B< of contiguous sites. Let b<
be the cardinality of the Lth block, ~ = 1, . . . , n*. Each block is made up of
neighbouring sites, so it can serve as an eiemental sample for the purpose of
estimating the model parameters. In accordance with the notation set up
in §8.8.1, we write ßs,, <7~, and PB, for the maximum likelihood estimates
of ß, a 2 and p computed from block B<.
To mirnie the spatial properties of model (8.43), blocks are defined to
retain as closely as possible the same shape as S . For instance, when S
is a regular square lattice each B< is simply a square of b< = !ivfn x !ivfn
contiguous tiles of S, for 0 < Ii < 1 a fixed constant.
In the general situation where S does not have a regular shape, blocks can
be obtained as follows . Define D(v1 , v2 ) = (0 , vl) x (0, v2 ) tobe the interior
of a reetangle of edges v 1 and v 2 . Edges are chosen in such a way that S ~
D( v 1, v2). Choose a pair of subshape sizes, say <P( vl) and <P( v2), depending
on the values of v1 and v2. Define D{ <P(v1), <P(v2)} = {0, <P(v1)} x {0, <fy(v2)}
to be the interior of a reetangle of edges <P( v1) and <P( v2 ), which serves as a
template for obtaining subshapes of S . Then define block B 1 as the subset
of spatiallocations whose coordinate vector Si lies within V{ <P(v1), <fy(v2)}.
Figure 8.8.2 shows an example of this procedure. Subsequent blocks are
defined in a similar way by simply translating the origin of the template
reetangle V{ <P(vl), <fy(v2)}. Since there arenot data outside S, we only use
the rectangles that provide a non-empty intersection with S. Finally, the
neighbourhood structure of block B<, ~ = 1, ... , n*, is given by WB,.
Other inferential procedures for spatially autocorrelated data make use
of blocks of close spatial locations. Subsampling methods provide a major
example of this Iiterature (Politis, Romano, and Wolf 1999). As in subsam-
pling, we suggest selecting the subshape sizes <P(vl) and <P(v2) in such a way
that each b< is roughly proportional to n 112 , to ensure a balance between
efficiency of block estimates and robustness of the method. However, since
robustness is our main focus , the proportionality constants are likely to
510 80 Spatial Linear Models
v2
.----------------------------.
ss
Sg
810
sn
FIGURE 80280 Exemplification of block definition for an irregular network

S = (s1, 000, s2o) with n = 200 Subshape sizes are taken as f/>(vi) = 0025vl
and 4>(v2) = Oo25v2o The template reetangle D{4>(vl),4>(v2)} is showno The cor-
responding block is B1 = (s16, s1s, s19)
be smaller than in subsampling techniqueso In fact, blocks are taken tobe

the actual eiemental sets of our algorithm and n* must belarge enough to
ensure that there is at least a block without contaminated unitso We also
avoid the use of overlapping blocks of locations, as this would destroy the
diagnostic power of (partially) ordering spatiallocations through our block
forward search algorithm (see §80805)0 Some numerical results showing the
effect of different choices of cjJ(v1 ) and cjJ(v2) are given in §8°9°
For simplicity, we restriet our description of the block forward search
algorithm to the situation where S is a regular lattice and b. = b for
~ = 1, 000, n* 0 This is also the setting of our examples in §§809 and 80100
80 80 3 Choice of the Initial Subset

As in §80301, we resort to the least median of squares method to find the
starting subseto Let
be the n x 1 vector of standardized residuals computed from the fit to

observations in block B.o This fit is obtained by maximization of either the
exact likelihood function (8050) or its approximate version (8 051), with A =
B.o Furthermore, let ei,B,, i = 1, 000, n, be the ith standardized residual in
e B, 0 We select as our initial subset the block, say B*, which satisfies
2 2 }
e[med],B. = m.ln e[med],B,
° {
'
where the notation is the same as in equation (8029)0

8.8.4 Progressing in the Search

Let m be the number of spatiallocations used for fitting model (8.43) at a
step oftheblock forward search. As usual, denote by sim} the correspond-
ing subset of S. In the first step we take m 0 =band simo} = B*.
The subset likelihood in each step is derived from either the exact form
(8.50) or from its approximation (8.51), with A = siml. This yields pa-
rameter estimates /3m, a-;,_ and Pm· siml is then updated to sim+km) by
taking the subset of m + km spatial locations with the smallest absolute
standardized residuals in the n x 1 vector
(8.52)
excluding those already in the initial subset. In this search all units but
those forming B* can thus leave the subset. Of course, 1 :::; km :::; b. For
m :2:: b we take either km = b or km = 1. In the examples that follow both
choices give similar results , although the former is often to be preferred as
it provides smoother residual trajectories and estimates of p.
We have also implemented an alternative updating scheme where sites
from B* can be removed from the fitting subset at subsequent steps of the
search, a property which was found useful in standard regression and in
kriging. However, its interpretation is less intuitive for the block forward
search algorithm described here, since blocks and not individual locations
are now the eiemental sets.
The block forward search estimator of the multidimensional parameter
{) = (ßr,a 2 ,p)T is defined as the sequence of its (approximate) maximum
likelihood estimators dm = ({3;;,,ii?r,,ßm)T. That is,
- -r -r -r T
{)BFS = ({)b, ... ,{)m,··· , {)n) ·
8.8.5 Monitaring the S earch

In spatial autoregression the main focus is on parameter estimation, rather
than prediction of unobserved values. Hence, our forward residual plots run
again up to n, as they did in multiple regression (Atkinson and Riani 2000)
and in the multivariate applications described in previous chapters of this
book.
Standardized residuals. We plot all n standardized residuals in the
vector em for each value of m . This is usually a most informative plot,
showing how the postulated SAR model fit varies throughout the search.
Generalized leverage measures. Following the rationale of §8.7.4, if
p > 1 we also display the diagonal elements of the m x m matrices
(8.53)
and
(8.54)
for the m sites belonging to si"'l . In both (8.53) and (8.54)
B8 iml = (Irn- Prn W 8 im) )T (Irn - Prn W 8 im) ).
These diagnostics are computed for each value of m. Their display extends
the forward leverage plot of Atkinson and Riani (2000, p. 34) to spatially
correlated observations.
Spatial interaction parameter. The estimate of the spatial interac-
tion parameter p does not remain constant during the search as new blocks
of spatial locations join the fitting subset. The forward plot of the (ap-
proximate) maximum likelihood estimates Prn provides a diagnostic tool
for detecting the effect of spatial outliers on estimation of p.
Likelihood ratio test. At each step of the search we also compute the
signed square-root of the likelihood ratio statistic (see §2.3) which tests the
null hypothesis
Ho: p =Po,
for a number of plausible null values Po.
To display the form of this statistic, consider the approximate loglikeli-
hood (8.51) and let ßrno and a-;, 0 be the corresponding estimates of ß and
o- 2 computed with A = si=) and p = PO· The test statistic at step m is
then (Exercise 8.12)
sign(/Jrn -Po) { 2L ~im) (/-Jrn , a-;,., Prn; Ysim) )

1/ 2
- 2L ~iml (ßrno, &?no, Po; y si"'l)
A }
. ('Prn- Po ){1og I~ ~-1 I l (' 2 /' 2)

Sign • s.(ml 0 - m og crrn o-rno
.::. 8 tml.::.
~ ~ } 112
A - 2 T
- o-rn r 8 (m) .::.S(m)T s<m)
• • *
+ O"rnors(m)
A-2 T
* 0
.::.8
*
(m) 0 Ts<=lo
•
.(8.55)
In (8.55) r si"'l = y siml - X siml ßrn is the m x 1 ordinary residual vector

computed from ßrn for the m units belonging to si=), r si"'lo = Ysiml -
X siml ßrno is the corresponding vector under H 0 , and
If Ho is true the signed square-root of the likelihood ratio statistic be-

haves approximately as a standard normal random variable. Hence its com-
putation allows some confirmatory statements about the true value of p,
as soon as m gets reasonably large. Furthermore, we might expect that the
likelihood approximation (8.51) affects both L * (ml (/-Jrn, a-;,, ßrn; Ys<ml) and
s. •
L ~iml (/Jrno, a-;;,o, Po; Ysi"'l ), so its effect should b e smaller on TN(P) than on
the point estimator Prn.
8.9 SAR Examples With Multiple Contamination 513
We display forward plots of the signed square-root likelihood ratio statis-

tic TN(P) for a number of plausible null values po, in a way similar to the
fan plot of §4.5. For the reasons outlined above, they provide a useful sup-
plement to the plot of Pm for the purpose of detecting the effect of spatial
outliers on estimation of p.
Ordering the data. As we have seen with all our forward algorithms,
the block forward search often allows us to rank the locations in S following
Si
their entrance step into m). This gives an ordering of the data according to
their degree of agreement with the postulated SAR model (8.43). However,
if the algorithm is run with km = b this ordering is only partial, since all
sites belonging to the same block aretobe regarded as equivalent. For the
same reason we do not support the choice of overlapping blocks of spatial
locations, as in that situation each site would belong to more than one
block and ordering would become impossible.
8.9 SAR Examples With Multiple Cantamination

In this section we show the power of the block forward search algorithm
in a few simulated examples where we know the nature and the amount
of contamination. We first analyse the simulated data set with masked
spatial outliers introduced in §8. 7 .3. Then we consider a new example where
outliers have a large influence on the estimate of the spatial interaction
parameter p. Finally, we move to the data set with masked high leverage
points described in §8. 7.3. In all instances, we can appreciate the ability of
the block forward search to depict the relevant spatial features of the data.
8. 9.1 M asked Spatial Outliers

We begin this section with the data set described in §8. 7.3. Recall that this
data set was simulated under the SAR model (8.43) with an additional
duster of 17 additive spatial outliers. Furthermore, contamination was not
very marked in this example and came undetected both by visual inspection
of the data and by standard spatial diagnostic techniques.
First, we adopt toroidal edge corrections and the fast approximation to
the subset likelihood function given in (8.51). We start our analysis by
performing a block forward search with non-overlapping square blocks of
dimensions 4 x 4 and rule km = b. Figure 8.29 is the resulting forward plot
of standardized residuals (8.52). The effect oftheblock search is dear in the
segmented nature of the forward plot. The trajectories corresponding to the
contaminated duster are marked in black in the picture. They dearly stand
apart from the others, as these sites have the largest standardized residuals
for most of the search. In addition, the effect of masking is apparent toward
the end of the algorithm, when all residuals have similar magnitude.
"'
----- -------------~----
20 40 60 80 100 120 140
Subset size m
FIGURE 8.29. First simulated SAR example with multiple spatial outliers: for-
ward plot of standardized residuals for b = 16. Curves corresponding to con-
taminated sites are shown in black. Toroidal edge corrections and approximate
maximum likelihood
It is interesting to see that in this example masking becomes a problern

for m < n, as the modified observations are not particularly different from
the bulk of the data and the contaminated corner joins sim) at the step
prior to the final one. Also residuals from a singlerobust fit of model (8.43)
might fail to detect all spatial outliers, as examination of the magnitude
of the least median of squares residuals from the initial subset shows. It is
just the trajectory of each residual in the forward plot that allows proper
understanding of the spatial features of the corresponding location.
To evaluate how our preliminary choices affect the results from the
search, we also ran the block forward search algorithm under different
settings. Figure 8.30 provides the forward plot of standardized residuals
computed from the approximate likelihood function (8.51) and the asym-
metric Neumann edge-correction method introduced in §8.7.1. Figure 8.31
is the same plot obtained through the exact likelihood (8.50) with toroidal
edge corrections. In both cases b = 16 as before. It is reassuring to see
that the main findings provided by the forward search are the same in all
instances, with the contaminated corner clearly standing apart from the
remairring trajectories. It is also worth noting that the effect of the likeli-
hood approximation is negligible even if in this example the true value of p
is close to its upper limit (see Exercise 8.8), a situation where the validity
of (8.51) might be questioned.
C\1
------- -------------
20 40 60 80 100 120 140
Subset size m
ward plot standardized residuals for b = 16. Curves corresponding to contami-
nated sites are shown in black. Asymmetrie Neumann edge corrections and ap-
proximate maximum likelihood. To be compared with Figure 8.29
C\1
----- ------------ 36
20 40 60 80 100 120 140

Subset size m
ward plot of standardized residuals for b = 16. Curves corresponding to contam-
inated sites are shown in black. Toroidal edge corrections and exact maximum
likelihood. To be compared with Figure 8.29
20 40 60 80 100 120 140

Subset size m
ward plot of standardized residuals for a reduced block size (b = 9). Curves cor-
responding to contaminated sites are shown in black. Toroidal edge corrections
and approximate maximum likelihood. Tobe compared with Figure 8.29
Also, the choice of a different block size does not appreciably change the
results from the search. For instance, Figure 8.32 displays the forward plot
of standardized residuals for b = 3 x 3, toroidal correction and approxi-
mate likelihood. Again, this plot depicts essentially the same information
as Figure 8.29, although there is more variability in the first stages of the
search, due to the smaller size of the fitting subset. The effect of masking
also shows up a bit earlier, as spatial outliers are now spread over a larger
number of blocks.
8.9.2 Estimation of p
We introduce a new example where multiple spatial outliers have a dispro-
portionate effect on the estimate of p, not only on the residuals from the
fitted model. For this purpose, we simulated a data set from the SAR model
(8.43) with S a 12 x 12 grid, ß = (20 , 5, 4, 3)T as in §8.9.1, p = 0.1 and
<7 2 = 1. Then we modified Yi at a block of 16 sites located at the crossings
of rows 1, ... , 4 and columns 1, ... , 4, by subtracting 8 from all of them.
The full simulated data set after contamination is reported in our website,
while Table 8.9 provides a selection of it. Fixed contamination increases
the similarity of the response values in the outlying corner and thus has
a larger infl.uence on estimation of p. Different three-dimensional views of
the contaminated data set are shown in Figure 8.33.
8.9 SAR Examples With Multiple Contamination 517
TABLE 8.9. Selection of cases from the second simulated SAR example with
multiple outliers: readings are at the nodes of a 12 x 12 regular lattice
Row Column i Yi XT
•
1 1 1 32.78 0.548 0.822 0.453
1 2 2 22.47 -0.816 0.717 -0.933
1 3 3 20.31 -0.168 -1.25 0.112
. .. . .. .. . ... ... . .. . ..
1 12 12 41.13 -0.011 1.509 0.488
2 1 13 14.05 -2.078 0.367 -0.882
. .. ... ... .. . ... ... . ..
12 12 144 26.53 0.089 -0.905 -1.404
X e ~ 120 X ~ 180
XA e~ 60
X = 240
FIGURE 8.33. Second simulated SAR example with multiple outliers:

three-dimensional views of the data from six perspective angles
Although in this example contamination on y is marked and might be

detected also by other methods, the block forward search still provides
important additional information through the display of residual trajec-
tories and the monitoring of diagnostic quantities. Furthermore, it is the
only robust method that is able to highlight the inferential effect of each
observation on the estimate of the spatial interaction parameter p.
20 40 60 80 100 120 140

Subset size m
FIGURE 8.34. Second simulated example with multiple spatial outliers: trajec-
tories of standardized residuals for individuallocations. Curves corresponding to
contaminated sites are shown in black. Toroidal edge corrections and approximate
maximum likelihood
The block forward search algorithm is applied with b = 4 x 4 and km = b,

although similar results are obtained with different block sizes. First, we
adopt toroidal correction and the approximation (8.51) to the likelihood
function of blocks. As in our example of §8.9.1, we start by looking at the
forward plot of standardized residuals, Figure 8.34. Contaminated locations
stand out very clearly in this plot due to their outlying trajectories. Despite
the large amount of contamination, masking is still present in the final step,
when sim) = S, and particularly so for the Observations at sites S14 and
s1s (in lexicographical order).
An interesting picture in this example is Figure 8.35. The left-hand panel
of this figure shows the forward plot of Pm· The right-hand panel gives the
forward plot of the signed square-root likelihood ratio statistic TLR ,m for a
few null values Po 2: 0, tagether with 99% pointwise confidence bands. The
limits of these bands are taken from the asymptotic N(O, 1) distribution
of TLR,m· The effect of including the contaminated subset is paramount in
the final step of both displays, raising the estimate of the autocorrelation
parameter from p = 0.103 based on si 128 ) to p = 0.223 based on all the
data. Multiple outliers grossly mislead confirmative analysis based on the
likelihood ratio statistic, with the true value p = 0.1 being wrongly rejected
only after their inclusion. On the other hand, it is seen that the most
plausible value at the end, p = 0.2, lies well outside the 99% confidence
bands for most of the search.
1.0
20 40 60 80 100 120 140 20 40 60 80 100 140

FIGURE 8.35. Second simulated example with multiple spatial outliers: left-hand
panel, forward plot of the maximum likelihood estimate of p; right-hand panel,
forward plot of the signed square-root likelihood ratio statistic for testing
Ho : p = po , for a few null values po 2': 0, with 99% asymptotic confidence
bands. Toroidal edge corrections and approximate maximum likelihood
As in the example of §8.9.1, our findings arenot affected by the specific

edge correction method, nor by use of the approximate likelihood function
(8.51) instead of its exact counterpart. Figure 8.36 displays the forward
plots of Pm and the signed square-root likelihood ratio statistic TLR,m com-
puted under asymmetric Neumann correction and approximate likelihood.
Similarly, Figure 8.37 shows the same graphs for toroidal correction and
exact likelihood. It is clearly seen that these pictures provide essentially the
same information as Figure 8.35. The only difference is the more accurate
estimate of p obtained through the exact likelihood function (8.10) when
m ~ n/2.
We conclude this example by stressing the value of the block forward
search algorithm for the purpose of detecting the infl.uence of multiple out-
liers on estimation of p. Furthermore, outliers enter toward the end of the
algorithm in a step that can be easily identified by our diagnostic plots: see
the sharp increase in the curves of Pm and TLR ,m when m = 128. Quantities
computed before outlier inclusion can thus be regarded as robust against
them.
8. 9. 3 Multiple High Leverage Sites

We now turn to the simulated data set with masked high-leverage points
described in §8.7.4. The block forward search is run over non-overlapping
square blocks of dimensions 4 x 4, so that b = n/9. For each value of m we
calculate the generalized leverage measures (8.48) and (8.49) only for the
20 40 60 80 100 120 140 20 40 60 80 100 140

Ho : p = po, for a few null values Po ?: 0, with 99% asymptotic confidence bands.
Asymmetrie Neumann edge corrections and approximate maximum likelihood.
To be compared with Figure 8.35
I!)
I!)
20 40 60 80 100 120 140 20 40 60 80 100 140

Ho : p = po , for a few null values po ?: 0, with 99% asymptotic confidence
bands. Toroidal corrections and exact maximum likelihood. To be compared with
Figure 8.35
(0
0
0
-------
-------------
--------
===---:.-
.......
N
0
0
0
0
385 390 395 400 385 390 395 400
FIGURE 8.38. Simulated SAR example with multiple high leverage points: for-
ward plots of, left-hand panel, leverages and, right-hand panel, complementary
leverages in the last 20 steps of the search. Curves corresponding to contaminated
high-leverage sites are shown in black. Toroidal edge corrections and approximate
maximum likelihood
observations in the subset. Rule km = 1 is then adopted for progressing in

the search, as we need leverage information with respect to other locations
within the same block. Repeated evaluation of the exact loglikelihood func-
tion (8.50) becomes computationally demanding with the large subset sizes
towards the end of the search. For simplicity, we thus restriet ourselves to
the fast approximation given in (8.51) with toroidal edge corrections.
Figure 8.38 displays the resulting forward plots of estimated leverages
and complementary leverages in the last 20 steps of the search. Each site
is monitared following its inclusion into sim).
The curves corresponding
to the three artificial high-leverage points are highlighted in the plot. The
effect of masking appears towards the end of the algorithm, when these
sites have leverage measures comparable to those of other locations.
Masking may be a bit surprising for site 8 24 , which is not a neighbour of
8 2 and 8 3 according to our definition of the weight matrix W . Nevertheless,
monitaring leverage diagnostics at subsequent steps of the block forward
search clearly shows the different behaviour of all the three sites that were
contaminated in the regressor variable space. Although both panels of Fig-
ure 8.38 convey essentially the same information, the picture on the left
shows smoother trajectories for uncontaminated sites.
It is worth noting that the estimated leverage of site 8 3 is always less ex-
treme than the estimated leverage of 8 16 , a natural high-leverage site, and
becomes almost unnoticeable in the last few steps of the search. As in pre-
vious forward residual plots, as weil as in the forward plots of Mahalanobis
distances we have encountered in earlier chapters, it is thus the shape of
the corresponding trajectory that clearly Ieads to proper recognition of the

peculiarity of this point in the space spanned by the columns of X si"'l. On
the other hand, the trajectory of 8 16 is quite similar to that of the bulk of
the other locations, thus suggesting that the remoteness of this site does
not depend on inclusion of the units which enter in the last steps of the
algorithm.
8.10 Wheat Yield Data Revisited

As a final example, we reconsider the Mercer and Hall wheat-yield data
set analysed in §8.5. Herewe dismiss the specific (although a bit ambigu-
ous) spatial information about plot size that was required by kriging, and
simply apply the basic SAR model (8.43). Furthermore, we have shown in
§8.5 that this data set does not seem to be affected by masking, but only
contains a few isolated spatial outliers. Hence it is useful to check that the
block forward search does not introduce spurious information with rela-
tively "well-behaved" spatial data.
In this example there are no explanatory variables. Hence the SAR model
is purely spatial, p = 1 and the leverage measures discussed in §§8.7.4
and 8.9.3 are not relevant. We run the block forward search over non-
overlapping blocks of dimensions b = 4 x 5, to refiect the reetangular shape
of the study area. We update S~m) according to rule km = b. For compu-
tational simplicity, we again restriet ourselves to the fast approximation to
the likelihood function given in (8 .51) with toroidal edge corrections. We
obtained similar results under different assumptions.
Figure 8.39 is the resulting forward plot of standardized residuals (8.52).
Individual trajectories have a regular and homogeneous behaviour, with
only three possible spatial outliers. Two of them (locations 8 2 go and 8 3 6 0 )
were already described as anomalous by the forward search for ordinary
kriging. The low response at site 8 5 now becomes an outlier because we
restriet spatial information to neighbouring sites and dismiss information
at larger lags. For instance, the ordinary kriging predictor at 8 5 was also
infiuenced by other comparably low values in column 5, so making y 5 less
unusual.
Since the outlying locations are scattered within S, there is no masking
towards the end of the search. The number of curves in the top half-plane
of the plot is approximately equal to n/2, so that there is no evidence of a
spatial trend . We conclude that the information gained through the block
forward search is broadly similar to that obtained within the geostatistical
framework of §8.5 .
Finally, we check the infiuence of individual blocks of sites on estimation
of the spatial interaction parameter p. Figure 8.40 shows the forward plots
of the maximum likelihood estimate Pm and of the likelihood ratio statis-
8.10 Wheat Yield Data Revisited 523
V /;,~~~~~------ ~:~~~~~~--------
,;·=·:> ~~=~-==-::-: :-.:-::-::-.:-: :-:::-.:-:-:::-.:-~;-:-;-:;~-- =---==---=-=---:::-.:- ~7
~:~~;~~:-~~:~~-:.::.:::::==~:~~~=-=~::-~=~:~ 5
--- -----------
100 200 300 400 500
Subset size m
FIGURE 8.39. Wheat yield data: forward plot of standardized residuals from the
SAR model. Toroidal edge corrections and approximate maximum likelihood
C\1
C\1
c:i
~0
0
0
C\1
c:i
(X)
c:i
<0
~
c:i
100 200 300 400 500 100 200 300 400 500
FIGURE 8.40. Wheat yield data: left-hand panel, forward plot of the maximum
likelihood estimate of p; right-hand panel, forward plot of the signed square-root
likelihood ratio statistic for testing Ho : p = po, for a few null values po 2:: 0, with
99% asymptotic confidence bands. Toroidal edge corrections and approximate
maximum likelihood
tic TLR,m, for a number of null values Po > 0. In the right-hand panel,
the corresponding 99% pointwise confidence bands are displayed. The fi-
nal estimate Pn = 0.16 corresponds to that obtained through standard
maximum likelihood. As the search stabilizes, values of p belonging to the
interval (0.15; 0.175) become increasingly plausible. We see that the blocks
included in the last steps of the search have only a minor effect on the
estimated autocorrelation parameter p. This is sensible behaviour in the
absence of clusters of masked spatial outliers.
8.11 Further Reading

Kriging was developed within the framework of geostatistics and its aim
was prediction of ore grades in mineral deposits. Despite the origin of the
term, Matheron (1963) is usually credited for the mathematical develop-
ment of this methodology. Since its early development, the scope of kriging
models has broadened appreciably and their applications now extend to
many fields in the earth and natural sciences (Goovaerts 1997, Chilesand
Delfiner 1999 and Webster and Oliver 2001). Part I of Cressie (1993) is
a comprehensive overview of both kriging theory and application from a
statistical point of view, while Christensen (2001) provides a more concise
account. Stein (1999) gives a book length treatment of the theoretical prop-
erties of spatial prediction, also detailing the effect of variagram estimation
on the theoretical properties of kriging predictors. The work by Diggle,
Tawn, and Moyeed (1998) extends the potential of kriging applications to
more general error structures than the usual normal distribution.
Note that best linear unbiased prediction can be embedded within a
number of alternative theoretical frameworks, including generalized linear
regression (Goldberger 1962) and the geometry of Hilbert spaces (Brock-
well and Davis 1991). From a computational point of view, commonly used
software for fitting kriging models under standard assumptions is avail-
able through packages S-Plus SPATIALSTATS (Mathsoft 1996) and GSLIB
(Deutsch and Journel 1998); see also Venables and Ripley (2002).
The popularity of spatial autoregressive models mainly came after the
work of Besag (1974), Ord (1975), Cliff and Ord (1981) and Ripley (1981).
In the last twenty years such models have been adopted for represent-
ing spatial variation in a variety of fields. For instance, Haining (1990),
Cressie (1993), Griffith and Layne (1999), Nair, Hansen, and Shi (2000)
and Müller (2001) give a number of interesting applications in agriculture,
epidemiology, socioeconomic analysis, manufacturing and experimental de-
sign. Commercial software for computing the maximum likelihood fit of
spatial autoregressive models again includes the package SPATIALSTATS
of S-Plus (Mathsoft 1996).
The importance of exploratory methods in both spatial prediction and
autoregression models has been stressed by many, including Cressie (1986),
Haining (1990) and Haslett, Bradley, Craig, and Unwin (1991) . However,
the usual approach relies on computation of case-deletion statistics. The
theoretical properties of such backward diagnostics are described in Martin
(1992), Christensen, Johnson, and Pearson (1992) and Haslett and Hayes
(1998).
8.12 Exercises
Ordinary kriging
Exercise 8.1 Prove equations (8.8} and (8.9}, giving the ordinary krig-
ing predictor f;(soiS) and its mean-squared prediction error a- 2 (soiS) under
model (8.2}.
Exercise 8.2 Obtain the analogues of equations (8.8} and (8.9}, giving the
ordinary kriging predictor f;(soiS) and its mean-squared prediction error
a- 2 (s 0 IS) under model (8.2}, in terms of the covariance function c(h).
Exercise 8.3 Show that the ordinary kriging predictor f;(soiS) is an exact
interpolator under model (8.2) .
Exercise 8.4 Show that the variagram function 2v(h) must be condition-
ally negative definite.
Exercise 8.5 Consider the measurement error model (8.11}. If 8(s) is a
second-order stationary process, write the ordinary kriging predictor as a
function of the measurement error variance a-~ (Cerioli and Riani 1999).
Exercise 8.6 Under the ordinary kriging model, let C be the n x n covari-
ance matrix of y and J be an n x 1 vector of ones. Furthermore, define Y(i)
to be vector y with the i th observation excluded, Ce i) to be the (n -1) x (n-1)
covariance matrix of Y(i)> c(i) be the (n- 1) x 1 vector of covariances be-
tween Yi and Y(i)> and J(i) to be vector J with the ith entry removed. Show
that the variance of the prediction residual e(i) = ei,S<,> = Yi - Yi,S<,l is
b2
var(e(i)) = --•---,
bi- hi
wh ere bi = a-y- r c-1

2 c(i) h-
(i) C(i)> i = xi
- = 1 - c(i)
-2(Jrc-1J)-1 , Xi r c-1 J
(i) (i) an
d
a-z = var(yi), i = 1, ... , n (Christensen, Johnson, and Pearson 1992}.
Spatial autoregression
Exercise 8. 7 Show that the first order BAR model (8.43) is not stationary
without edge corrections.
Exercise 8.8 Let Wi be the ith eigenvalue of W . Show that the spatial
interaction parameter p of a jirst-order BAR model with W = wr must
satisfy
pwi < 1 for all i = 1, ... , n.
Exercise 8.9 Show that /3 in (8.45) and a2 in (8.46} are the maximum
likelihood estimates of ß and a- 2 under the jirst-order BAR model (8.43}.
Give the profile loglikelihood for p.
8.12 Exercises 527
Exercise 8.10 Give the asymptotic covariance matrix of the maximum

likelihood estimates /3, a2 and p under the first-order SAR model (8.43).
Exercise 8.11 Show that the least squares estimator of p is inconsistent
under the jirst-order SAR model (8.43).
Exercise 8.12 Obtain the expression ofthe likelihood mtio statistic (8.55).
Exercise 8.13 Define Y(i) tobe the (n-1) x 1 vector of observations in the
reduced network S(i)> excluding location si. Let E(yiiY(i)) and var(yiiY(i))
be the conditional expectation and the conditional variance of Yi, given Y(i).
Furthermore, as in model (8.43), let X denote a covariate matrix of di-
mension n x p and rank p, allowing also for the mean effect, and ß be
a p-dimensional pammeter vector. A first-order conditional autoregressive
model (GAR, for short) is defined by the following assumptions.
• A uforegressive structure of conditional expectations
n
E(yiiY(i)) = xT ß + P L Wi,j(Yj- xJ ß) i = 1, . .. ,n, (8.56)
j=l
where xf is the ith row of X and the nonnegative weights Wi,j provide
the neighbourhood structure of Si. These weights are defined as in
§8. 1.1, so that wi ,i = 0, but with the additional constmint that
Wi,j = Wj,i i,j=1, ... ,n. (8.57)
• Gonditional homoschedasticity
i = 1, ... ,n.
• Normality of the conditional distribution
The conditional density ofyi given Y(i), say f(YiiY(i);ß;a 2 ;p) is uni-
variate normal.
To summarize, a first order GAR model assumes that, for i = 1, ... , n,
(8.58)
where J.Li = E(yiiY(i)) is given in equation (8.56). The conditional density

(8.58) depends on Y(i) only through the neighbouring sites of si, for which
Wi,j > 0. Therefore, a first order GAR model exhibits a spatial analogue of
the Markov property for time series.
(a) Show that, under a first order GAR model,
y"' Nn(Xß, L:cAR), (8.59)

where
~CAR = a 2 (In- pW)- 1
and W = (wi ,j), i,j = 1, ... , n, is the n x n weight matrix.
(b) Write the loglikelihood function of a first order GAR model and show
why it is different from the loglikelihood of the first order SAR model {8.43}.
Write the resulting profile loglikelihood for p.
(c) Let f-l* = (f-li, ... ,f-l~)T be the n x 1 vector of conditional expectations
f-l: = E(YiiY(i)) and define E:* = y- f-l*. Show that
cov(c*, y) = a 2 In.
8.13 Salutions
Exercise 8.1
Under the ordinary kriging model (8.2), we look at a predictor which is both
linear in the sample values, y(soiS) = 2::~= 1 "liYi, and uniformly unbiased,
i.e. E{y(soiS)} = E{y(s 0 )} = f-l for all f-l· The unbiasedness condition yields
2::~= 1 r]iE(yi) = 2::~ 1 "lif-l = f-l, that is
(8.60)
The corresponding squared prediction error, spe for short, is
by property (8.60). Adding and subtracting 2::~= 1 'fJiYT and rearranging

terms gives
n n n n n
L ryi{y(so)- yi} 2 +L L 'fJi'fJjYiYj - L L 'fJi'rJiY'f(8.61)
i=1 i=1 j=1 i=1 j=1
again by property (8.60). Furthermore, if the last term of (8.61) is split

into two identical parts, we obtain
nn nn 1 nn
LL'fJi'f/jYiYj- LL'fJi'fJjY'f = -2 LL'fJi'fJj(Yi- Yj) 2 . (8.62)
i=1j=l i=1j=l i=lj=l
8.13 Solutions 529
The mean-squared prediction error u 2 (s0 IS) is the expectation of {y(s 0 ) -

y(soiS)F. Hence, from equations (8.61) and (8.62)
n n n
u 2 (soiS) = L"Ji2v(so-si)- LL'TJi'T}jv(si - sj) = 2ryTv-ryTY'T} (8.63)
i=l i=l j=l
in the notation of §8.2.1, where v(h) denotes the semivariogram. Equa-

tion (8.63) has to be minimized with respect to '7} under the unbiasedness
constraint (8.60), which can also be written as
r? J = 1,
for J an n x 1 vector of ones. This constrained optimization problern is

solved by introducing a Lagrange multiplier, say -2a, and by minimizing
the function
r.p = 2'7}TV - 'T}Ty'T}- 2a(ryT J- 1),
with respect to the n + 1 unknowns '7]1, ... , 'T}n and a. Taking the partial
derivatives of r.p with respect to '7} and a and equating them to zero leads
to the system of n + 1 linear equations
Yry+aJ =v
{ (8.64)
'T}T J = 1.
Let 'TJ+ and v + be the (n + 1) x 1 vectors

and
respectively. If we define Y + tobe the (n + 1) x (n + 1) matrix
the linear equation system (8.64) becomes
(8.65)
This system has the unique solution
(8.66)
if Y + is nonsingular. The optimal weight vector '7} then corresponds to the

first n rows of 'TJ+.
Under the ordinary kriging model (8.2), uniqueness of 'TJ+ follows if Y
is strictly conditionally negative definite, that is if condition (8.19) holds
with a strict inequality provided that a 1 , ... , an are not all equal to zero.
This is usually achieved by adopting a valid variagram model and avoiding
duplicate points. For simplicity, here and in what follows we assume that
1+ 1 exists. Nevertheless, note that when 1 + is singular there are many

solutions to system (8.65), but y(soiS) is still unique.
We now recall a standard matrix result that leads to an explicit formula
for 'Tl· Let A be a (n + 1) x (n + 1) symmetric nonsingular matrix which is
partitioned as
where A 11 is n x n , a 12 is n x 1 and a22 is a scalar. Theinverse of Ais
- A 11
-1 a12 )
1 ) (8.67)
y+-1 = b-1 ( b
1-1 + 'V'-1JJT'Y'-1
.l
-JT1-1
.l
(8.68)
with b = -JTY- 1J.

From (8.66) and (8.68), we obtain
'TJ y-1v + b-1 1 -1 J Jr 1 -1v _ b-1 1 -1 J

= y-1{v + b-1J(JTy-1v- 1)}
1 _1 ( v + 1 ~;;~~1 v J) , (8.69)
given that both band JTY- 1v are scalars.

Write a = (1- JT1- 1v)(JTY- 1J)- 1. Substituting the optimum 'TJ from
equation (8.69) into equation (8.63) yields
a 2 (soiS) = 2(v + aJ)TY- 1v- (v + aJ)TY- 1(v + aJ)

VT1-1v + aJT1-1v- avT1-1 J- a2 JT1-1 J.
Noting that JTY- 1v = vT1- 1J finally gives
Exercise 8.2
We now assume second-order stationarity, that is
E{y(s)} = 11 and c(s, t) = c(s- t).
Again, we look forthebest linear predictor y(soiS) = 2:~= 1 'TJiYi , under the
unbiasedness constraint 2::~ 1 'TJi = 1. Working in terms of the covariance
8.13 Solutions 531
function, it is convenient to write the squared prediction error spe at site

so as
'Pe ={y(,o)- ii(soiS)}' ~ { y(so)- ~- t, ";(y;- ~)}',

by the unbiasedness property I:~=l 'T/i = 1. Hence,
n n n
(8.70)
i=l i=l j=l
where O"~ = E{y(so) -11Fand c(so- si) = E[{y(so)- ILHYi -11}].

Minimization of (8.70) with respect to TJ under the constraint I:~=l 'T/i =
1 leads to a system of n + 1 linear equations similar to (8.64), but with
v(h) replaced by -c(h). Hence, the optimum weight vector becomes
(8.71)
where c = (c(s 0 - s 1 ), ... , c(s 0 - sn))T and Cis the nxn symmetric matrix
whose elements are given by c(si- Sj), for i,j = 1, ... , n. Again, here and
in subsequent exercises we consider only the case where C is non singular.
From equations (8.70) and (8.71), the corresponding mean-squared pre-
diction error is
since TJT J = I:~=l 'T/i = 1.

Intuitively, relationship (8.3), which holds under the assumption of second-
order stationarity, highlights the equivalence between the formulae for TJ and
0" 2 ( s 0 IS) given above and those displayed in Exercise 8.1. A formal proof
of this statement is provided, e.g., by Christensen (2001).
Exercise 8.3
The exact interpolation property of the ordinary kriging predictor (without
measurement error) is perhaps best seen from representation (8.71), where
'T} is written in terms of the covariance function c( h).
First, note that
since both JTC- 1J and cTc- 1J are scalars. Hence,
where In is the n x n identity matrix.

Next, let
(8. 72)
denote the generalized least-squares estimator of J.L under the ordinary krig-
ing model (8.2); see, e.g., equation (2.97) in Chapter 2, with p = v = 1 and
X = J. The best linear unbiased predictor at site so, y(soiS) = TJT y, then
becomes
y(soiS) (JTC-l J)-l JTC-ly + cTC-l{fn _ J(JTC-l J) - l JTC-l }y

fj, + qT(y _ fj,J), (8.73)
where vector qT = cTc- 1 satisfies

Cq = c. (8.74)
Note that equation (8.73) is the estimated conditional expectation of y(s 0 )
given y, if {y(s) : s E D} is a Gaussian process. This explains why the
ordinary kriging predictor is best also among all non linear functions of y
under the normality assumption.
Now suppose that the value to be predicted is the observation at the
sample site Si· That is, let s 0 = Si E S in equation (8. 73). Since the
ordinary kriging model (8.2) does not allow for measurement error, vector
c becomes
c = (c(s1- Si), ... , c(sn- Si)f,
which is just the ith column of matrix C.
As in equation (2.91) andin Exercise 2.5, let q(i) be the n x 1 vector
q(i) = (0, ... ,0,1,0, ... ,0)T, (8.75)

where the 1 is in the ith place. Vector q(i) clearly solves system (8.74),
since it extracts the ith column of C:
0
c(s1 - si) c(s1 -Sn)
)
( c("u'
"s,) c(s2- Si) c(s2 - sn)
Cq(i) 1
c(sn-sl) c(sn-si) a2y
c(s,-
0
s,) )
c(s2- Si)
= c.
c(sn- Si)
8.13 Salutions 533
From equation (8.73), we thus see that the bestlinear unbiased predictor
at site si E S is
(1 + q(i)T(y- (1J) = (1 + q(if y- (1q(i)T J

(1 + Yi - (1 = Yi, (8.76)
since premultiplication by q(i)T extracts the ith element from a vector; see
equation (2.92). Result (8. 76) implies that the ordinary kriging predictor
(without measurement error) is an exact interpolator when Si E S.
Exercise 8.4
We consider an intrinsically stationary process {y(s) : s E D} satisfying
conditions (8 .4) and (8.5), a finite collection of spatial sites s 11 ... , Sn and
real numbers a 1 , ... , an such that
n
(8.77)
For any n, the variance of the linear combination L:~=l aiYi is
var (taiYi)
•=1
(8 .78)
by property (8.77). Of course, this variance must be nonnegative. Further-

more,
given that
n n n n
L L aiajy; = L L aia1yJ = 0
i=l j=l i=l j=l
again by property (8.77). As a consequence,

and, from (8.78) ,
Hence,
if and only if
n n
L I>iajv(sisj)::::; 0.
i=l j=l
Exercise 8.5
If 8(s) is a second-order stationary process, the optimum weight vector TJ
can be expressed in terms of the covariance function c( h), as in equation
(8.71). However, under the measurement error model (8.11), the covari-
ance matrix C has to be modified to allow for the variability of repeated
measurements at the same location. The appropriate covariance matrix is
denoted by c* and its elements are given by
c(si- Bj) if i # j,
a y2 + a E:2 if i = J..
That is,
C*=C+a;In. (8.79)
The ordinary kriging predictor is then
y(soiS) = TJ~ y, (8.80)
where
1- JTC- 1 c )
TJ* = C;l ( c+ JTC* tJ J (8.81)
and the n x 1 vector c is defined as in equation (8.71).

Since we want to highlight the contribution of a;
to y(soiS), we need
an explicit expression for C; 1 . The following result is an extension of the
Sherman-Morrison-Woodbury formula of equation (2.40) . It can be proved
either by a constructive reasoning as in §2.5.2, or by showing that multipli-
cation of matrix A + B by the stated inverse gives the identity matrix (see
Exercise 2.11). Let A be an x n nonsingular matrix, a a positive scalar,
and B = aln. Then,
8.13 Salutions 535
In the setting of equation (8. 79), A = C and a= a;. The required inverse
is then
C*-1 = aE:-2 1n _ aE:-4(c- 1 + ae-2 1n )-1 = aE:-2(In _ aE:-2D) , (8.82)

where D = (C- 1 + a; 2 In)- 1 . Substituting formula (8.82) into equation
(8.81) yields
T -2{ 1-a;2JTc+a;4JTDcJ}T (l -2D)

c+
ae-2( n- ae:-2JTDJ)
TJ* = ae: n - ae: .
Hence, the ordinary kriging predictor (8.80) becomes
y(soiS)
Exercise 8. 6
In order to allow for spatial dependence between observations at different
sites, we need more general deletion formulae than those given in our §2.5
andin §2.3 of Atkinson and Riani (2000). However, a useful simplification
occurs, since J.L is the only parameter to be estimated in the ordinary kriging
modeland J = (1, . . . , 1)T is the corresponding explanatory variable. Also
note that the results derived in this exercise are valid under both models
(8.2) and (8.11), since measurement error has no effect on prediction of Yi
given the reduced network s(i)·
In what follows we assume that the spatial dependence structure is
known, so that matrices Y and C are made up of known constants. Studying
the effect of variogram (or cavariogram) estimation on the theoretical prop-
erties of kriging predictors is an important research area, but goes beyond
the scope of this book. We refer the interested reader to the monograph of
Stein (1999) for further details.
To simplify notation, we write e(i) instead of ei,S<•l, as in §2.5. We also
let Y(i) denote the (n - 1) x 1 vector of observations when Yi is excluded,
C(i) be the (n - 1) x (n- 1) covariance matrix of Y(i)> J(i) be vector J
with the ith entry removed, i.e. an (n- 1) x 1 vector of ones, and C(i) be
the (n - 1) x 1 vector of covariances between Yi and Y( i). Accordingly, it is
convenient to partition the n x n covariance matrix C as
First, we note from equation (8. 73) that
e(i) = Yi -
- Yi,S<,l = Yi -
- J.l-(i) r 0 (i)
- c(i) -1 ( -
Y(i) - J.l-(i) J (i) ) ,
where
- (Jr -1J )-1Jr
(i) 0 (i) Y(i)
(i) 0 (i) (i) (8.84)
-1
J.l-(i) =
is the generalized least-squares estimate of J.-L based on the reduced network

S(i)i see equation (8.72). Therefore, we write
(8.85)
r c-
- -- Yi- c(i)
w1·th
Yi (i) Y(i) and Xi-
1 r (i)1 J (i)·
- - 1 - c(i) c-
We know from formula (8.67) that the inverse of Cis
r
= bil ( bi C (i) + Ji) C~iC(i) 0 (i)
-1 c-1 -1 c-1 )
c-1 - (i) C(i) '
-c(i)c(i) 1
(8.86)
r 0 -1J
J (i) (i) (i) + b-1-2
i xi, (8.87)
. JT - 1 r - 1J
smce (i) 0 (i) C(i) = c(i) 0 (i ) (i ).
We have thus shown that
( Jr 0 -1 J . )-1 _ (Jr c-1 J _ b-1-2)-1
(i) (i) (•) - i xi ·
A simple adaptation of the Sherman-Morrison-Woodbury formula (2.40)
yields for scalars ß and a
2ß- 2
(ß - a2 ) -1 = ß - 1 + a .
1- a2ß- 1
Setting ß = JTC- 1 J and a = b".; 112xi, we obtain

8.13 Salutions 537
where hi = xf(JTC- 1 J)- 1 .

We now look at JTC- 1 y. Reasoning as in (8.87) yields
J rc - 1Y = Jr c-1
(i) (i) Y(i) + b-1Jr
i
r
(i) (i) C(i)C(i) 0 (i) Y(i)
c-1 -1
-
b-1 r
i c(i) 0 (i) Y(i)
-1
-1 Jr
- bi Yi (i) 0 (i) C(i)
-1
+ Yi b-1
i
r c-1
J (i) (i) Y(i) + b-1- -
i XiYi·
Therefore,
r c-1
J (i) Jrc-1
(i) Y(i) = y- b-1- -
i XiYi· (8.89)
Putting equations (8.88) and (8.89) into (8.84) gives the change in the
generalized least-squares estimate of p, due to deletion of site Si
Substituting back equation (8.90) into (8.85) and remembering that hi =

xf(JTC- 1 J)- 1 gives
(ili - xi[:t )bi

(8.91)
bi- hi
We now have to compute the variance of (8.91), that is
(8.92)
We evaluate the components of (8.92) separately.

Recall that f)i = Yi- c0)C(i)1Y(i)• C(i) is the covariance matrix of Y(i) and
C(i) is the ( n - 1) x 1 vector of covariances between Yi and Y(i). Therefore,
var(fh) a; + var(c0)C(i)1Y(i))- 2cov(yi, c0)C(i)1Y(i))

T c-l T c-l
+ c(i) (i) (i)c-1
T c-lc - 2c(i)
2 2
ay (i) C(i) (i) C(i) = ay - c(i) (i) C(i)
bi.
In a similar fashion,
var(xi[j,) x7var{(JTC-l J)-1 JTC-ly}

x7(JTc-l J)-2 JTc-Icc-l J = x7(JTc-l J)-l
hi.
To obtain the covariance term in (8.92), we recall from §2.10 and Exer-
cise 8.3 that
Yi = q(i)T y,
where the n X 1 vector q(i) = (0, ... , 0, 1, 0, ... , O)T has a 1 in the ith place;
see definition (8.75). Similarly, we define the (n- 1) x n matrix
q(i- 1)T
q(i + 1)T
which is obtained by stacking the n-1 row vectors q(l)T = (0, .. . , 1, .. . , 0),
for l -=1- i.
It is easily verified that
Q(i)Y = Y(i)·
Therefore,
-
Yi Te-l
= Yi - c(i) (i) Y(i) = q (·)T
z Y- c(i) (i) (i)Y = { q (·)TJ
Tc-lQ z n - c(i) (i) (i) }Y
Tc-lQ
and
cov[{ q(i)T In - c0)C(i)lQ(i) }y, Xi(JT c-1 J)-1 JT c-ly]

xi(JTC- 1J)- 1{q(i)T In- c0)C(i)1Q(i)}CC- 1J
Xi(JTc-lJ)-l{q(i)T J- c0)C(i)lQ(i)J}
xi(JTc-lJ)-1(1- c0)C(i)lJ(i))
x7(JTc-l J)-l = iii,
8.13 Solutions 539
noting that q(i)T J = 1 and Q(i)J = J(i)·

Collecting these pieces, we finally obtain the desired result from equation
(8.92)
Exercise 8.7
We recall that {y(s): s E D} is said tobe second-order stationary if for all
pairs s, t E D
E{y(s)} f.l
and c(s,t) c(s- t) ,
where c(s,t) = cov{y(s),y(t)}. This also implies that var{y(s)} is constant

over D . It is then enough to give a numerical example showing that under
model (8.43) the elements of the covariance matrix
do not depend only on s- t .

For instance, let S be a 5 x 5 regular grid where the simple weighting
scheme (8.40) is adopted without boundary adjustments. Sites in S are
numbered lexicographically. The first row of W is then
wf = (0,1,0,0,0,1,0, ... , 0)
given that w 1 ,j = 1 only for j = 2 and j = 6. The other rows of W are

obtained similarly (see, e.g., Figure 8.23). Suppose that model (8.43) holds
with p = 0.1 and a 2 = 1. Table 8.10 gives the diagonal elements of the
resulting 25 x 25 covariance matrix
computed through a standard matrix inversion routine. It is apparent that

var(yi) is site specific. In a similar fashion, the non-diagonal elements of :E
could be seen to depend on pairs Si, s j, not only on the difference Si - s j.
On the contrary, if toroidal edge correction is applied we obtain
var(yi) = 1.1418 for i = 1, ... ,n.
Exercise 8.8
Here W = wr , so that
TABLE 8.10. Variances for individual sites on a 5 x 5 regular lattice without

boundary adjustments under a first-order SAR model with p = 0.1 and a 2 = 1
Row Column
1 2 3 4 5
1 1.0656 1.1007 1.1014 1.1007 1.0656
2 1.1007 1.1397 1.1405 1.1397 1.1007
3 1.1014 1.1405 1.1413 1.1405 1.1014
4 1.1007 1.1397 1.1405 1.1397 1.1007
5 1.0656 1.1007 1.1014 1.1007 1.0656
We require I: to be positive definite, which happens if In - pW is nonsin-

gular. In fact, any matrix which is the square of a nonsingular matrix is
positive definite and the same is true for its inverse. Hence, we must have
!In - pWI =!= 0.
An additional constraint is imposed by the loglikelihood function (8.44),

which is well defined only if
!In- pWI > 0, (8.93)
that is if In - pW is strictly positive definite.

We now look at the eigenvalues of In - pW, say Ai, i = 1, ... , n . They
are all real, since W is symmetric. By their definition, we have
From equation (8.94) we see that
1- Ai
- - - =Wi
p
is an eigenvalue of W. We thus establish the relationship
between the eigenvalues of In- pW and those of W .

The eigenvalues of In- pW must be positive, to satisfy condition (8.93) .
Hence, we must have
1- pwi > 0 for all i = 1, .. . ,n. (8.95)
In practice, the largest eigenvalue of W, say Wmax , is usually positive,

while the smallest, say Wmin, is negative. Therefore, condition (8.95) be-
comes
-1 -1
Wmin < P < Wma x·
8.13 Solutions 541
This result shows that the range of permissible values of p depends on the
form and size of W. For instance, for square lattices with weighting (8.40),
Wmax ---+ +4 and Wmin ---+ -4 as n ---+ 00, SO -0.25 < p < +0.25.
Exercise 8.9
The details of maximum likelihood estimation from a multivariate normal
distribution have been outlined in Exercise 2.7. Two major differences arise
in the framework of spatial autoregression. First, the mean is not constant;
we assume instead that p, = X ß. More importantly, vector y in model (8.43)
is a single observation from the n-variate normal distribution N ,....., (X ß, L:).
No independent replication occurs within the sample and the likelihood
function is simply the n-variate density of y
Lik ( p,, L:;y) 1

= l2rrL:I { ( T - 1
(y-Xß)/2}.
112 exp- y-Xß) L:
Here L: is a n x n matrix, so the fraction before the exponent becomes

(2rr )-n/21L:I-l/2.
U nder the first-order SAR model
The resulting loglikelihood function is shown in (8.44). For a given p, the

partial derivatives of this function with respect to ß and a 2 are
(8.96)
and
8L(ß,a2,p;y) = _...!!:..__ + _1_(y- Xßf3(y- Xß). (8.97)

oa
2 2a 2 2a 4
The maximum likelihood estimates of ß and a 2 are then obtained as the
solutions of the estimating equations
(8.98)
and
-na 2 + (y- Xß)TS(y- Xß) = 0. (8.99)
In (8.98), with slight abuse of notation, 0 denotes a p x 1 vector of zeros.
Equation (8.98) gives (8.45), while equation (8.99) yields (8.46). Condi-
tional on
and
the profile loglikelihood for p is

na 2
Lmax(P) -(n/2)log(2na 2) + logiin- pWI- 28- 2
constant- (n/2)log(a 2 ) + logiin- pWI .
Exercise 8.10
We know from maximum likelihood theory that the asymptotic covariance
matrix of /3, a2 and ß is the inverse of the expected information matrix,
that is
-Ea 2 L(ß ,a 2 ,p;y) ) -1
8 ß8f]
-E8 2 L(ß,a ,p;y)
aa2ßp
_ E 8 2 L(ß ,a 2 ,p ;y)
ßp2
(8.100)
Under the SAR model (8.43) , the first-order partial derivatives of the
loglikelihood L(ß, a 2 , p; y) with respect to ß and p are given in equations
(8.96) and (8.97). Hence,
(8.101)
and
(8.102)
Equation (8.101) does not involve any random quantity, so it is equal to

its expected value. Equation (8.102) yields
E82 L(ß, a 2 , p; y)
8(a2)2
-2n 4
a
- 8a1 tr { a 2 .::.
~( In - pW )-1(Jn - pwr)-1}
n
-2a4 · (8.103)
The first-order partial derivative of L(ß , a 2 , p ; y) with respect top is
8L(ß,a 2 , p; y) = Ologiin- pWI __1_8(y- Xß)T'2(y- Xß). (8 _104 )

ap ap 2a 2 ap
In Exercise 8.8 we noted that the eigenvalues of In- pW are
1- pwi i = 1, . . . ,n,
where Wi denotes the ith eigenvalue of W. Hence,

n
logiin - pWI = L log(1 - pwi )·
i=1
8.13 Solutions 543
and
Ologlln - pWI = _~ Wi .
ap L.q-
i=1
pw·
•
Furthermore,
o(y- Xß)T'3(y- Xß)
(y- Xß)T(-WT- W + 2pWTW)(y- Xß)
ap
-2(y- Xßf(In- pW)rW(y- Xß),
noting that (y- Xß)TWT(y- Xß) = (y- Xß)TW(y- Xß). Collecting
the pieces, equation (8.104) becomes
fJL(ß,a 2,p;y) = - ~ Wi + _!._(y- Xß)T(In- pWfW(y- Xß).

op L.q
i=1
- pw•· a2
(8.105)
Let
L
n 2
a- wi
- i=1 (1 - pwi)2.
Differentiating (8 .105) again with respect to p,
fJ2L(ß,a2,p;y) = a- _!._(y- Xß)TWTW(y- Xß).

fJp2 a2
Taking the expectation yields
EfJ2L(ß, a2 , p; y) 1
a- 2 E{(y- Xß)TWTW(y- Xß)}
fJp2 a
1
a- 2tr{a 2WTW(ln- pW)- 1 (In- pWT)- 1 }
a
a- tr{W(In- pW)- 1 (In- pWT) - 1 WT}
a- tr(BBT)
say, where B = W(In- pW)- 1 .
In a similar way, we can compute the remaining expectations in (8.100)
(not shown for brevity). The resulting asymptotic covariance matrix is then
where 0 denotes a p x 1 vector of zeros.
Exercise 8.11
For simplicity we consider the first-order SAR model (8.43) with ß known.
The least squares estimate of p is obtained by minimizing the expression
D(p) = (y- Xß)T'3(y- Xß)
with respect to p. Recalling the definition of 2, the derivative of D(p) with

respect to the unknown parameter is
D'(p) (y- Xß)T(-WT- W + 2pWTW)(y- Xß)
-2(y- Xß)TW(y- Xß) + 2p(y- Xß)TWTW(y- Xß)
-2(y- Xß)T(In- pWT)W(y- Xß),
since (y- Xß)TWT(y- Xß) = (y- Xß)TW(y- Xß). The estimating
equation for p is then
(8.106)
Now we compare the least-squares estimating equation (8.106) with the
score function (8.105). It is well known that, under mild regularity condi-
tions, maximum likelihood yields consistent parameter estimates. Hence,
the least-squares estimating equation (8.106) will result in a consistent es-
timate of p only if
n
"" Wi O (8.107)
L......- 1 - pw· -t
i=1 •
as n - t oo.
Condition (8.107) is verified only in very special cases, e.g. when p = 0,
so that 2.:::~= 1 Wi = tr(W) = 0, or when W is upper or lower triangular
(corresponding to a one-sided form of spatial dependence analogaus to time
dependence), so that wi = 0 for i = 1, .. . , n. However, condition (8.107)
does not hold in general, even asymptotically, so the least squares estimate
of p is not consistent.
Estimating functions which resemble the score function but do not make
full distributional assumptions are often called "quasilikelihood" estimating
functions (McCullagh and Nelder 1989) .
Exercise 8.12
We are testing the null hypothesis
Ho: p =Po (8.108)
against the two-sided alternative H 1 : p =I= po , using information from a
subset of m spatial locations. For simplicity, in what follows we suppress
the subscript corresponding to sim).
Ifwe adopt the loglikelihood approximation L *(ß, u 2 , p; y) given in (8.51),
the signed square-root of the likelihood ratio statistic is defined as
1/ 2
= sign(ß- Po) 2L*(ß, 8-2 , ß; y)- 2L*(ßoJr5, Po; y)
}
,
A A
TN(P) {
where /Jo and a-g are the maximum likelihood est imates of ß a nd u 2 under
the constraint (8.108). From (8.51), we have
L *(ß, u 2 , p; y) = - ~log(2n) - ~logu 2 +~logl21- 2~2 (y-X ß )TB(y- Xß),

8.13 Solutions 545
since 121 = li - pWI 2 . Therefore,
-mlog(2n) - mloga 2 + logiBI

1 A TA A
-~(y- Xß) 2(y- Xß),

a
where 3 =(I- pW)T(I- pW). In a similar way,
-2L * (/Jo, a6, po; y) mlog(2n) + mloga6 -logl2ol

1 A T A
+-;;z(Y- Xßo) 2o(Y- Xßo),

ao
where 2o =(I- poW)T(I- poW). Hence,
sign(p- Po) {
1 a2
logl220 1- mlog-;;z- ~(y- Xßf3(y- Xß)
A 1 A A A
a0 a
1 A T A } 1/2
+-;;z(Y- Xßo) 2o(Y- Xßo)
ao
Exercise 8.13
( a) A first-order CAR model is defined through the set of conditional den-
sities f (Yi IY(i); ß; a 2 ; p), for i = 1, ... , n. The model is well defined if these
conditional densities are mutually consistent and yield a proper joint dis-
tribution for y. Therefore, part (a) of the exercise is proved if we can show
that the joint distribution of y exists under condition (8.58), and that this
joint distribution is precisely of the form given in (8.59).
A general key result relating joint and conditional distributions is a fac-
torization theorem noted by Brook (1964) and fully developed by Besag
(1974) in the context of spatial processes. This theorem is proved also in
several books, including Cressie (1993 , pp. 412-413) and Guttorp (1995,
p. 7). We introduce the following notation. Let a = (a 1 , ... , an)T and
b = (b 1 , ... , bn)T denote two possible realizations of {y(s) : s E S}. Let
f(a) and f(b) be the joint densities of a and b, while f(ail·) and f(bil·) are
the conditional densities of ai and bi, i = 1, ... , n. The factorization theo-
rem says that, under a mild regularity condition, the conditional densities
must satisfy the relationship
(8.109)
The regularity condition (not stated here) implies that each term in the
denominator of (8.109) is positive. This is clearly true if f(bil·) is the normal
density function.
Equation (8.109) is important because it relates the joint probability
structure of {y(s) : s E S}, given in terms of the density ratio f(a)/ f(b),
to the set of conditional distributions defined at each site. Furthermore,

it shows the restrictions that these conditional distributions must satisfy
in order to give a mathematically consistent joint distribution over the
whole study area. In fact, since the labeHing of individual sites within S
is arbitrary, there are as many as n! possible factorizations of the density
ratio f(a)/ f(b) , which must all be equal.
We exploit the factorization theorem for a first-order CAR model, by
taking a = y = (y 1 , ... , Yn)r, i.e. the observed realization of {y(s): s ES},
and by setting b = Xß = (J.t1, ... ,J.tn)T, where J.ti = x[ß, i = l , . .. ,n.
For simplicity, we suppress dependence on parameters ß, a 2 and p in our
notation for densities and work with the logarithm of the density ratio
f(y)/ f(Xß).
Under the normal conditional density (8.58), we have
In a similar way,
1
J(J.tiiY1, · · ·, Yi-1, J.ti+l, · · ·, J.tn) = ~ X
2rra 2
}2]
i-1 n
exp -[ 2~2 {J.ti- f.li - p L Wi,j(Yj- /-lj)- p _L J = •+1
Wi,j(/-lj- /-lj)
J= 1
{f; Wi ,j (Yj - }2] .

2 i-1
= k ,exp - :a2 [ /-lj)
From equation (8.109) , the logarit hm of the density ratio is then
J(y)
log f(X ß)
8.13 Solutions 547
This gives
f(y)
log f(Xß)
(8.110)
We note that
n i- l n n
L L Wi ,j(Yi- J.Li)(yj- J.Lj) = L L Wi ,j(Yi- J.Li)(yj- J.Lj)
i=l j=l i=l j=i+l
since Wi ,j = Wj,i by condition (8.57). Furthermore,
n
L Wi ,i(Yi- J.Li) 2 =0
i=l
since Wi ,i = 0. Hence,
n i-1 n n
2 LL Wi,j(Yi - J.Li)(yj- J.Lj) = L L Wi,j(Yi- J.Li)(yj- J.Lj)·
i=l j=l i=l j=l
Substituting into equation (8.110), the logarithm of the density ratio be-
comes
which is a quadratic form in the residuals Yi - J.Li = Yi - xf ß. In matrix

notation,
f(y) 1 T T
log f(Xß) - - { ( y - Xß) (y- Xß)- (y- Xß) (pW)(y- Xß)}
2a 2
1 T
- 2a 2 (y- Xß) (In- pW)(y- Xß), (8.111)
where W = (wi ,j ), i, j = 1, ... , n, is the n x n matrix of weights.

It is now easy to see that the logarithm of the density ratio shown in
equation (8.111) is precisely the one which is obtained under the multi-
variatenormal distribution (8.59). In fact, if the joint distribution of y is
n-variate normal with mean Xß and covariance matrix a 2 (In- pW)- 1 ,
f(y) = ( v27i)nla2(J:- pW)-111/2 exp{- 2~2 (y-Xß)T(In -pW)(y-Xß)}

(8.112)
and
f(Xß) = (v/27i)nla2(Jn1_ pW)-111/2.
Equation (8.111) is then the logarithm of the resulting density ratio. We
conclude that for a first-order CAR model
Y"" Nn(Xß, :EcAR),
where
(8.113)
In order to be a valid covariance matrix, :Ec AR must be symmetric and
positive definite. Symmetry is ensured by assumption (8.57) . Under the
binary weighting scheme (8.40), positive definiteness of (In - pW) and
thus of :EcAR follows if -0.25 < p < +0.25 (see Exercise 8.8).
As a final remark, we point out why the density ratio f (y) / f (X ß) (or
equivalently its logarithm) gives a complete characterization of the joint
density f(y). This is easily seen by noting that
f(y)j f(Xß) f(y)j f(Xß)

f(y) = f f(y)dy/f(Xß) f {f(y)j f(Xß)}dy
J
since f(y)dy = 1. Hence, knowledge of f(y)/ f(Xß) as a function of y is
equivalent to knowledge of the joint density function.
The validity of the factorization theorem (8.109) is general and it is by
no means limited to the case of normal conditional distributions. Therefore,
this theorem provides a basic tool for deriving valid models of spatial pro-
cesses with Markov structure, known as Markov random fields, also under
non-Gaussian assumptions (Besag 1974; Kaiser and Cressie 2000).
8.13 Solutions 549
(b) The loglikelihood of a first order CAR model is simply the logarithm
of the multivariate normal density (8.112), seen as a function of ß, CJ 2 and
p. Hence,
L(ß, CJ 2 , p; y) = -~log(27rCJ 2 ) + ~logiin- pWI- ~(y- Xß)r'2(y- Xß).

2 2 2CJ
(8.114)
The form ofthe loglikelihood (8.114) is similar tothat of a first order SAR
model, given in (8.44), but with one important difference. This difference
is the coefficient of IIn- pWI, which in the SAR model is twice as large as
in the CAR model. Therefore, the two models are not equivalent.
Since the term involving IIn - pWI does not depend on ß and CJ 2 , the
maxirnum likelihood estimates /3 and a 2 are the same as under a SAR
model; see equations (8.45) and (8.46), and Exercise 8.9. Conditional on /3
and a2 , the profile loglikelihood for p is then
Lmax(P) = constant- (n/2)log(a 2 ) + (1/2)logiin - pWI
(c) From (8.56), we write
c:* = y- f.l* = y- Xß- pW(y- Xß) =(In- pW)(y- Xß).
Furthermore, we know from part (a) of this exercise that E(y) = Xß and
E{(y-Xß)(y-Xß)T} =CJ 2 (In -pW)- 1 . Hence,
E(c:*) = 0,
an n x 1 vector of zeros, and
cov(c:*,y) E{c:*(y- Xßf}

E{(In- pW)(y- Xß)(y- Xß)T}
(In- pW)E{(y- Xß)(y- Xßf}
(In- pW)CJ 2 (In- pW)-l = CJ 2 In.
Appendix: Tables of Data
TABLE A.l. Swiss heads d a ta: six dimensions in millimetres of the heads of 200
Swiss soldiers
Number Y1 Y2 Y3 Y4 Y5 Y6
1 113.2 111.7 119.6 53.9 127.4 143.6
2 117.6 117.3 121.2 47.7 124.7 143.9
3 112.3 124.7 131.6 56.7 123.4 149.3
4 116.2 110.5 114.2 57.9 121.6 140.9
5 112.9 111.3 114.3 51.5 119.9 133.5
6 104.2 114.3 116.5 49.9 122.9 136.7
7 110.7 116.9 128.5 56.8 118.1 134.7
8 105.0 119.2 121.1 52.2 117.3 131.4
9 115.9 118.5 120.4 60.2 123.0 146.8
10 96.8 108.4 109.5 51.9 120.1 132.2
11 110.7 117.5 115.4 55 .2 125.0 140.6
12 108.4 113.7 122.2 56.2 124.5 146.3
13 104.1 116.0 124.3 49.8 121.8 138.1
14 107.9 115.2 129.4 62.2 121.6 137.9
15 106.4 109.0 114.9 56.8 120.1 129.5
16 112.7 118.0 117.4 53 .0 128.3 141.6
17 109.9 105.2 122.2 56.6 122.2 137.8
18 116.6 119.5 130.6 53.0 124.0 135.3
19 109.9 113.5 125.7 62.8 122.7 139.5
20 107.1 110.7 121.7 52.1 118.6 141.6
21 113.3 117.8 120.7 53.5 121.6 138.6
22 108.1 116.3 123.9 55.5 125.4 146.1
23 111.5 111.1 127.1 57.9 115.8 135.1
24 115.7 117.3 123.0 50.8 122.2 143.1
25 112.2 120.6 119.6 61.3 126.7 141.1
26 118.7 122.9 126.7 59.8 125.7 138.3
27 118.9 118.4 127.7 64.6 125.6 144.3
28 114.2 109.4 119.3 58.7 121.1 136.2
29 113.8 113.6 135.8 54.3 119.5 130.9
30 122.4 117.2 122.2 56.4 123.3 142.9
31 110.4 110.8 122.1 51.2 115.6 132.7
32 114.9 108.6 122.9 56.3 122.7 140.3
33 108.4 118.7 117.8 50.0 113.7 131.0
34 105.3 107.2 116.0 52.5 117.4 133.2
35 110.5 124.9 122.4 62.2 123.1 137.0
36 110.3 113.2 123.9 62.9 122.3 139.8
37 115.1 116.4 118.1 51.9 121.5 133.8
38 119.6 120.2 120.0 59.7 123.9 143.7
39 119.7 125.2 124.5 57.8 125.3 142.7
40 110.2 116.8 120.6 54.3 123.6 140.1
554 Appendix: Tables of Data
TABLE A.l. Swiss heads data (continued)
Number Y1 Y2 Y3 Y4 Ys Y6
41 118.9 126.6 128.2 63.8 125.7 151.1
42 112.3 114.7 127.7 59.4 125.2 137.5
43 113.7 111.4 122.6 63.3 121.6 146.8
44 108.1 116.4 115.5 55.2 123.5 134.1
45 105.6 111.4 121.8 61.4 117.7 132.6
46 111.1 111.9 125.2 56.1 119.9 139.5
47 111.3 117.6 129.3 63.7 124.3 142.8
48 119.4 114.6 125.0 62.5 129.5 147.7
49 113.4 120.5 121.1 61.5 118.1 137.2
50 114.7 113.8 137.7 59.8 124.5 143.3
51 115.1 113.9 118.6 59 .5 119.4 141.6
52 114.6 112.4 122.2 54.5 121.2 126.3
53 115.2 117.2 122.2 60.1 123.9 135.7
54 115.4 119.5 132.8 60.3 127.8 140.3
55 119.3 120.6 116.6 55.8 121.5 143.0
56 112.8 119.3 129.6 61.0 121.1 139.4
57 116.6 109.6 125.4 54.6 120.2 122.6
58 106.5 116.0 123.2 52.8 121.7 134.9
59 112.1 117.4 128.2 59.9 120.3 131.5
60 112.8 113.0 125.4 64.8 119.4 136.6
61 114.6 119.0 116.8 57.4 123.8 140.0
62 110.9 116.5 125.8 53.5 124.8 142.9
63 109.1 117.0 123.7 60.0 120.1 137.7
64 111 .7 117.3 121.0 51.5 119.7 135.5
65 106.4 111.1 124.4 59.1 122.4 138.4
66 121.2 122.5 117.8 54.8 121.5 143.9
67 115.2 121.2 117.4 54.9 121.9 144.0
68 123.2 124.2 120.0 57.9 119.4 138.4
69 113.1 114.5 118.9 56.9 121.8 135.0
70 110.3 108.9 115.2 55.9 119.0 138.0
71 115.0 114.7 123.5 66 .7 120.3 133.6
72 111.9 111 .1 122.3 63 .8 117.1 131.6
73 117.2 117.5 120.2 60.5 119.5 129.6
74 113.8 112.5 123.2 62.0 113.5 132.4
75 112.8 113.5 114.3 53 .8 128.4 143.8
76 113.3 118.4 123.8 51.6 122.7 141.7
77 123.9 120.5 118.3 54.3 122.0 133.8
78 119.8 119.6 126.1 57.5 124.7 130.9
79 110.9 113.9 123.7 62.7 124.8 143.5
80 111.9 125.1 121.8 58.1 112.1 134.8
Number YI Y2 Y3 Y4 Y5 Y6
81 114.0 120.8 131.2 61.0 124.7 152.6
82 113.6 110.4 130.9 60.2 118.5 132.5
83 118.9 126.1 121.9 56.1 127.3 145.6
84 119.4 127.8 128.0 61.8 120.6 141.3
85 121.0 121.1 116.8 56.3 124.2 140.9
86 109.0 105.7 126.2 59.4 121.2 143.2
87 117.9 125.1 122.6 58.2 128.4 151.1
88 124.8 123.8 128.3 60.4 129.1 147.2
89 120.6 124.3 120.0 59.5 123.4 144.1
90 115.6 115.9 117.2 54.0 119.9 135.3
91 116.6 119.1 131.0 58.0 123.3 136.4
92 118.7 118.9 129.6 68.6 123.0 141.4
93 114.3 117.1 127.1 55.7 119.1 139.8
94 110.9 113.1 124.1 60.6 115.7 132.1
95 119.2 120.0 136.9 55.1 129.5 142.0
96 117.1 123.7 108.7 53.2 125.6 136.6
97 109.3 110.2 129.3 58.5 121.0 136.8
98 108.8 119.3 118.7 58.9 118.5 132.7
99 109.0 127.5 124.6 61.1 117.6 131.5
100 101.2 110.6 124.3 62.9 124.3 138.9
101 117.8 109.0 127.1 53.9 117.9 135.8
102 112.4 115.6 135.3 55.8 125.0 136.1
103 105.3 109.8 115.4 59.6 116.6 137.4
104 117.7 122.4 127.1 74.2 125.5 144.5
105 110.9 113.7 126.8 62.7 121.4 142.7
106 115.6 117.5 114.2 55.0 113.2 136.6
107 115.4 118.1 116.6 62.5 125.4 142.1
108 113.6 116.7 130.1 58.5 120.8 140.3
109 116.1 117.6 132.3 59.6 122.0 139.1
110 120.5 115.4 120.2 53.5 118.6 139.4
111 119.0 124.1 124.3 73.6 126.3 141.6
112 122.7 109.0 116.3 55.8 121.8 139.4
113 117.8 108.2 133.9 61.3 120.6 141.3
114 122.3 114.2 137.4 61.7 125.8 143.2
115 114.4 117.8 128.1 54.9 126.5 140.6
116 110.6 111.8 128.4 56.7 121.7 147.5
117 123.3 119.1 117.0 51.7 119.9 137.9
118 118.0 118.0 131.5 61.2 125.0 140.5
119 122.0 114.6 126.2 55.5 121.2 143.4
120 113.4 104.1 128.3 58.7 124.1 142.8
Number I Y1 Y2 Y3 Y4 Y5 Y6
121 117.0 111.3 129.8 55.6 119.5 136.1
122 116.6 108.3 123.7 61.0 123.4 134.0
123 120.1 116.7 122.8 57.4 123.2 145.2
124 119.8 125.0 124.1 61.8 126.9 141.2
125 123.5 123.0 121.6 59.2 115.3 138.4
126 114.9 126.7 131.3 57.3 122.7 139.2
127 120.6 110.8 129.6 58.1 122.7 134.7
128 113.0 114.8 120.7 54.1 119.7 140.9
129 111.8 110.2 121.0 56.4 121.4 132.1
130 110.8 114.9 120.5 58.7 113.4 131.6
131 114.8 118.8 120.9 58.4 119.7 135.9
132 122.5 122.3 116.7 57.4 128.1 147.3
133 105.9 105.6 129.3 69.5 123.6 136.6
134 108.0 111.3 116.9 53.8 117.8 129.6
135 114.4 111.7 116.3 54.3 120.2 130.1
136 117.9 112.9 119.1 54.2 117.9 134.8
137 110.7 113.9 114.5 53.0 120.1 124.5
138 112.3 110.4 116.8 52.0 121.0 133.4
139 110.9 110.0 116.7 53.4 115.4 133.0
140 126.6 127.0 135.2 60.6 128.6 149.6
141 116.2 115.2 117.8 60.8 123.1 136.8
142 117.2 117.8 123.1 61.8 122.1 140.8
143 114.5 113.2 119.8 50.3 120.6 135.1
144 126.2 118.7 114.6 55.1 126.3 146.7
145 118.7 123.1 131.6 61.8 123.9 139.7
146 116.2 111.5 112.9 54.0 114.7 134.2
147 113.9 100.6 124.0 60.3 118.7 140.7
148 114.4 113.7 123.3 63.2 125.5 145.5
149 114.5 119.3 130.6 61.7 123.6 138.5
150 113.3 115.9 116.1 53.5 127.2 136.5
151 120.7 114.6 124.1 53.2 127.5 139.1
152 119.1 115.3 116.6 53.5 128.2 142.6
153 113.2 107.7 122.0 60.6 119.4 124.2
154 113.7 110.0 131.0 63.5 117.3 134.6
155 116.3 119.3 116.6 57.3 122.0 141.6
156 117.6 117.8 122.5 59.9 119.4 136.3
157 114.8 115.0 115.2 58.9 122.5 135.2
158 127.3 123.9 130.3 59.8 128.3 138.7
159 130.5 125.5 127.4 62.1 130.1 153.3
160 110.4 105.4 122.1 56.2 114.6 122.8
TABLE A.l. Swiss heads data (concluded)
Nurober Y1 Y2 Y3 Y4 Y5 Y6
161 108.5 105.4 119.1 59.4 120.4 134.7
162 121.6 112.1 126.5 60.6 122.7 142.9
163 117.9 115.2 139.1 59.6 125.5 141.3
164 112.7 111.5 114.9 53.5 113.9 132.6
165 121.8 119.0 116.9 56.5 120.1 139.2
166 118.5 120.0 129.8 59.5 127.8 150.5
167 118.3 120.0 127.5 56.6 122.0 139.4
168 117.9 114.4 116.4 56.7 123.1 136.3
169 114.2 110.0 121.9 57.5 116.1 126.5
170 122.4 122.7 128.4 58.3 131.7 148.1
171 114.1 109.3 124.4 62.8 120.8 133.4
172 114.6 118.0 112.8 55.6 118.5 135.6
173 113.6 114.6 127.1 60.8 123.8 143.1
174 111.3 116.7 117.7 51.2 125.7 141.9
175 111.4 120.4 112.1 56.4 120.3 137.1
176 119.9 114.4 128.8 69.1 124.9 144.3
177 116.1 118.9 128.3 55.8 123.7 139.7
178 119.7 118.2 113.5 59.5 127.0 146.5
179 105.8 106.7 131.2 61.3 123.7 144.3
180 116.7 118.7 128.2 55.8 121.2 143.9
181 106.4 107.3 122.9 57.6 122.3 132.9
182 112.2 121.3 130.1 65.3 120.3 137.9
183 114.8 117.3 130.3 60.9 125.6 137.4
184 110.0 117.4 114.1 54.8 124.8 135.1
185 121.5 121.6 125.4 59.5 128.5 144.7
186 119.8 119.4 119.6 53.9 122.3 143.6
187 107.7 108.4 125.1 62.3 122.7 137.2
188 118.4 115.7 121.1 57.8 124.9 140.5
189 119.8 113.9 132.0 60.8 122.4 137.6
190 114.1 112.8 119.3 52.7 114.2 136.9
191 117.7 121.8 120.0 59.1 122.6 138.3
192 111.1 117.7 117.7 60.2 124.6 139.2
193 111.1 117.7 117.7 59.1 124.7 141.9
194 128.1 118.3 129.4 61.0 134.7 148.6
195 120.4 118.7 126.4 59.4 133.1 147.1
196 112.9 112.0 123.5 57.2 121.3 133.3
197 118.2 114.4 114.8 55.3 126.1 149.1
198 119.0 112.7 129.1 62.0 127.6 146.6
199 111.8 116.0 117.8 60.9 114.4 128.7
200 116.6 111.4 115.6 60.9 117.8 137.4
TABLE A.2. Nationaltrack records for women
Number Country Yl Y2 Y3 Y4 Y5 Y6 Y7
secs secs secs mins mins mins mins
1 Argentina 11 .61 22.94 54.50 2.15 4.43 9.79 178.52
2 Australia 11.20 22.35 51.08 1.98 4.13 9.08 152.37
3 Austria 11.43 23.09 50.62 1.99 4.22 9.34 159.37
4 Belgium 11.41 23.04 52.00 2.00 4.14 8.88 157.85
5 Bermuda 11.46 23.05 53.30 2.16 4.58 9.81 169.98
6 Brazil 11.31 23.17 52.80 2.10 4.49 9.77 168.75
7 Burma 12.14 24.47 55.00 2.18 4.45 9.51 191.02
8 Canada 11.00 22.25 50.06 2.00 4.06 8.81 149.45
9 Chile 12.00 24.52 54.90 2.05 4.23 9.37 171.38
10 China 11.95 24.41 54.97 2.08 4.33 9.31 168.48
11 Colombia 11.60 24.00 53.26 2.11 4.35 9.46 165.42
12 Cook ls. 12.90 27.10 60.40 2.30 4.84 11.10 233.22
13 Costa Rica 11.96 24.60 58.25 2.21 4.68 10.43 171.80
14 cz 11.09 21.97 47.99 1.89 4.14 8.92 158.85
15 Denmark 11.42 23.52 53.60 2.03 4.18 8.71 151.75
16 Dominica 11.79 24.05 56.05 2.24 4.74 9.89 203.88
17 Finland 11 .13 22.39 50.14 2.03 4.10 8.92 154.23
18 France 11.15 22.59 51.73 2.00 4.14 8.98 155.27
19 GDR 10.81 21.71 48.16 1.93 3.96 8.75 157.68
20 FRG 11.01 22.39 49 .75 1.95 4.03 8.59 148.53
21 GB 11.00 22.13 50.46 1.98 4.03 8.62 149.72
22 Greece 11.79 24.08 54.93 2.07 4.35 9.87 182.20
23 Guatemala 11 .84 24.54 56.09 2.28 4.86 10.54 215.08
24 Hungary 11.45 23.06 51.50 2.01 4.14 8.98 156.37
25 Irrdia 11.95 24.28 53.60 2.10 4.32 9.98 188.03
26 Indonesia 11.85 24.24 55.34 2.22 4.61 10.02 201.28
27 Ireland 11.43 23.51 53.24 2.05 4.11 8.89 149.38
28 Israel 11.45 23.57 54.90 2.10 4.25 9.37 160.48
Appendix : Tables of Data 559
TABLE A.2. Nationaltrack records for women (concluded)
Number Country Y1 Yz Y3 Y4 Ys Y6 Y1
secs secs secs mins mins mins mins
29 Italy 11.29 23.00 52.01 1.96 3.98 8.63 151.82
30 Japan 11 .73 24.00 53.73 2.09 4.35 9.20 150.50
31 Kenya 11.73 23.88 52.70 2.00 4.15 9.20 181.05
32 Korea 11.96 24.49 55.70 2.15 4.42 9.62 164.65
33 DRK 12.25 25.78 51.20 1.97 4.25 9.35 179.17
34 Luxembourg 12.03 24.96 56.10 2.07 4.38 9.64 174.68
35 Malaysia 12.23 24.21 55.09 2.19 4.69 10.46 182.17
36 Mauritius 11 .76 25.08 58.10 2.27 4.79 10.90 261.13
37 Mexico 11.89 23.62 53.76 2.04 4.25 9.59 158.53
38 Netherlands 11.25 22.81 52.38 1.99 4.06 9.01 152.48
39 New Zealand 11.55 23.13 51.60 2.02 4.18 8.76 145.48
40 Norway 11.58 23.31 53.12 2.03 4.01 8.53 145.48
41 Papua NG 12.25 25.07 56.96 2.24 4.84 10.69 233.00
42 Philippines 11.76 23.54 54.60 2.19 4.60 10.16 200.37
43 Poland 11.13 22 .21 49.29 1.95 3.99 8.97 160.82
44 Portugal 11.81 24.22 54.30 2.09 4.16 8.84 151.20
45 Rumania 11.44 23.46 51.20 1.92 3.96 8.53 165.45
46 Singapore 12.30 25.00 55.08 2.12 4.52 9.94 182.77
47 Spain 11.80 23.98 53.59 2.05 4.14 9.02 162.60
48 Sweden 11.16 22 .82 51.79 2.02 4.12 8.84 154.48
49 Switzerland 11.45 23.31 53.11 2.02 4.07 8.77 153.42
50 Taiwan 11.22 22.62 52.50 2.10 4.38 9.63 177.87
51 Thailand 11.75 24.46 55.80 2.20 4.72 10.28 168.45
52 Turkey 11.98 24.44 56.45 2.15 4.37 9.38 201.08
53 USA 10.79 21.83 50.62 1.96 3.95 8 .50 142.72
54 USSR 11.06 22.19 49.19 1.89 3.87 8.45 151.22
55 W. Samoa 12.74 25.85 58.73 2.33 5.81 13.04 306.00
TABLE A.3. Selection of data from 341 municipalities of Emilia-Romagna
Number Municipality Province Yl Y2 Y6 yg Yl2

2 Argelato BO 7.83 5.60 2.78 5.7 6.95
6 Bologna BO 5.28 10.43 11.67 6.5 5.63
9 Calderara di Reno BO 8.30 5.03 2.98 4.3 8.77
11 Casalecchio di Reno BO 5.62 7.67 5.50 5.1 6.22
30 Granarolo dell'Emilia BO 6.79 6.81 3.04 5.6 10.10
31 Grizzana Morandi BO 6.43 9.60 2.01 7.0 8.59
32 lmola BO 7.15 9.45 5.48 6.1 7.68
49 Porretta Terme BO 6.84 11.25 4.42 7.5 6.86
64 Cento FE 7.29 7.88 4.24 8.3 7.59
65 Codigoro FE 6.22 8.37 2.10 16.9 5.18
66 Comacchio FE 8.41 5.15 1.67 24.5 7.66
68 Ferrara FE 5.34 8.72 8.19 9.3 5.51
70 Goro FE 8.37 5.22 0.70 11.0 8.01
71 Jolanda di Savoia FE 6.70 6.98 0.96 10.0 4.36
72 Lagosanto FE 7.27 7.00 1.07 19.2 6.76
88 Bellaria-lgea Marina RN 8.34 6.13 2.84 14.3 8.48
98 Forll FO 6.88 9.08 6.56 8.4 6.54
113 Montescudo RN 8.96 8.77 3.16 11.7 9.00
120 Riccione RN 8.40 6.65 4.21 16.3 8.41
121 Rimini RN 7.93 7.07 6.40 12.9 8.41
133 Torriana RN 9.78 6.99 2.99 9.4 7.98
141 Carpi MO 7.04 7.59 3.76 4.8 7.12
147 Fanano MO 6.28 14.39 2.16 6.1 7.05
148 Finale Emilia MO 7.45 7.92 3.06 8.7 6.80
149 Fiorano Modenese MO 10.76 3.98 1.77 6.6 9.31
159 Modena MO 7.07 8.04 8.75 6.2 7.61
184 Albareto PR 6.70 16.41 1.81 7.2 5.40
185 Bardi PR 6.42 18.66 1.59 7.8 4.62
188 Bore PR 2.18 15.06 0.46 9.1 4.10
194 Compiano PR 6.39 15.74 1.84 13.2 5.56
210 Parma PR 6.49 8.98 9.40 5.6 6.42
228 Varano de' Melegari PR 6.41 12.92 2.85 4 .8 4.82
238 Calendasco PC 5.16 8.62 0.54 8.1 7.37
239 Caminata PC 4.08 19.44 2.94 9.8 3.13
240 Caorso PC 6.58 10.51 0.96 6.7 6.89
241 Carpaneto Piacentino PC 7.30 9 .33 0.87 6.3 7.09
245 Cerignale PC 2.52 27.76 0.70 11.1 0.00
250 Ferriere PC 3.85 20.52 0.43 10.0 2.37
259 Nibbiano PC 5.76 11.02 1.30 7.8 7.76
260 Ottone PC 1.80 24.47 1.01 6.1 3.37
261 Pecorara PC 3.04 19.52 0.86 8.3 4.30
262 Piacenza PC 6.82 8.30 7.97 8.7 6.97
264 Piozzano PC 4.53 12.53 0.98 2.9 6.22
277 Zerba PC 2.58 30.97 0.00 12.2 4.30
310 Casina RE 8.41 10.88 2.30 6.1 17.10
329 Reggio nell'Emilia RE 7.50 9.05 7.13 6.0 8.40
Average 7.25 10.08 2.50 7.26 7.13
Median 7.25 9.46 2.32 6.3 7.19
TABLE A.3. Selection of data from 341 municipalities of Emilia-Romagna (con-

cluded)
Number Municipality Province Y1s Yl9 Y23 Y26 Y27 Y28
2 Argelato BO 70.87 18.82 64.84 20.38 26.38 5.05

6 Bologna BO 60.81 21.44 67.49 42.10 23.01 7.27
9 Calderara di Reno BO 67.34 12.79 62.31 28.56 39.67 6.31
11 Casalecchio di Reno BO 61.12 14.66 67.50 27.69 34.13 4.68
30 Granarolo dell'Emilia BO 71.40 17.46 63.15 34.66 32.4o 6.56
31 Grizzana Morandi BO 55.26 3.43 58.69 12.34 50.26 4.13
32 Imola BO 59.34 11.93 63.18 40.24 30.66 5.32
49 Porretta Terme BO 57.62 10.32 64.42 42.76 21.43 6.11
64 Cento FE 58.81 11.45 62.75 33.36 44.37 5.99
65 Codigoro FE 50.73 7.50 62.04 11.68 46.54 5.22
66 Comacchio FE 48.45 11.35 50.53 9.12 20.99 5.77
68 Ferrara FE 58.92 15.76 61.91 35.19 30.22 5.99
70 Goro FE 43.88 20.70 48.64 7.29 10.88 3.37
71 Jolanda di Savoia FE 49.17 5.21 52.94 0.00 40.56 3.40
72 Lagosanto FE 52.25 8.48 57.83 0.00 54.65 5.75
88 Bellaria-Igea Marina RN 53.59 9.48 59.89 2.50 20.73 12.87
98 Forn FO 63.75 19.13 62.45 28.33 34.73 6.83
113 Montescudo RN 52.52 7.74 50.43 0.00 44.62 6.92
120 Riccione RN 57.36 11.72 58.05 13.85 25.69 11.42
121 Rimini RN 56.82 14.38 58.02 26.06 25.83 8.99
133 Torriana RN 58.18 22.67 51.00 0.00 37.11 6.64
141 Carpi MO 63.13 19.35 66.72 18.86 45.18 8.37
147 Fanano MO 49.15 16.47 64.32 9.96 41.74 4.63
148 Finale Emilia MO 53.68 8.89 64.24 30.15 41.62 5.17
149 Fiorano Modenese MO 64.91 15.13 61.58 49.95 42.81 1.45
159 Modena MO 66.17 22.14 64.43 36.07 32.64 7.90
184 Albareto PR 51.29 6.48 57.68 0.00 51.92 3.39
185 Bardi PR 46.04 9.62 59.24 0.00 47.51 2.88
188 Bore PR 48.20 4.55 68.37 0.00 28.21 4.77
194 Compiano PR 48.80 7.09 51.67 0.00 43.30 9.55
210 Parma PR 62.22 20.14 63.88 38.66 28.23 7.83
228 Varano de' Melegari PR 55.47 8.33 62.89 9.23 52.53 4.20
238 Calendasco PC 62.76 9.72 63.96 4.62 51.50 5.26
239 Caminata PC 47.65 0.00 63.32 0.00 31.58 2.78
240 Caorso PC 53.50 10.56 60.17 34.85 38.60 3.41
241 Carpaneto Piacentino PC 54.21 13.64 59.73 14.67 36.27 6.34
245 Cerignale PC 45.43 5.88 49.84 0.00 8.00 3.85
250 Ferriere PC 36.26 16.67 49.72 0.00 27.37 5.35
259 Nibbiano PC 52.59 14.93 61.98 13.53 50.85 3.21
260 Ottone PC 40.18 9.26 56.34 0.00 35.82 12.15
261 Pecorara PC 48.43 6.31 66.67 0.00 42.37 2.51
262 Piacenza PC 59.58 19.10 62.24 34.79 28.12 7.42
264 Piozzano PC 63.07 11.11 68.80 0.00 44.83 7.87
277 Zerba PC 39.35 6.67 45 .81 0.00 38.89 15.79
310 Casina RE 57.36 6.08 58.72 10.37 50.12 5.13
329 Reggio nell'Emilia RE 64.28 23.12 62.48 35.90 32.38 6.96
Average 55.18 11.28 60.48 15.55 42.75 5.71
Median 55.01 11.06 61.04 14.63 43.30 5.73
TABLE A.4. Swiss bank notes data: six dimensions in millimetres of 200 Swiss
1,000 Franc notes
Number Yl Y2 Y3 Y4 Y5 Y6
1 214.8 131.0 131.1 9.0 9.7 141.0
2 214.6 129.7 129.7 8.1 9.5 141.7
3 214.8 129.7 129.7 8.7 9.6 142.2
4 214.8 129.7 129.6 7.5 10.4 142.0
5 215.0 129.6 129.7 10.4 7.7 141.8
6 215.7 130.8 130.5 9.0 10.1 141.4
7 215.5 129.5 129.7 7.9 9.6 141.6
8 214.5 129.6 129.2 7.2 10.7 141.7
9 214.9 129.4 129.7 8.2 11.0 141.9
10 215.2 130.4 130.3 9.2 10.0 140.7
11 215.3 130.4 130.3 7.9 11.7 141.8
12 215.1 129.5 129.6 7.7 10.5 142.2
13 215.2 130.8 129.6 7.9 10.8 141.4
14 214.7 129.7 129.7 7.7 10.9 141.7
15 215.1 129.9 129.7 7.7 10.8 141.8
16 214.5 129.8 129.8 9.3 8.5 141.6
17 214.6 129.9 130.1 8.2 9.8 141.7
18 215.0 129.9 129.7 9.0 9.0 141.9
19 215.2 129.6 129.6 7.4 11.5 141.5
20 214.7 130.2 129.9 8.6 10.0 141.9
21 215.0 129.9 129.3 8.4 10.0 141.4
22 215.6 130.5 130.0 8.1 10.3 141.6
23 215.3 130.6 130.0 8.4 10.8 141.5
24 215.7 130.2 130.0 8.7 10.0 141.6
25 215.1 129.7 129.9 7.4 10.8 141.1
26 215.3 130.4 130.4 8.0 11.0 142.3
27 215.5 130.2 130.1 8.9 9.8 142.4
28 215.1 130.3 130.3 9.8 9.5 141.9
29 215.1 130.0 130.0 7.4 10.5 141.8
30 214.8 129.7 129.3 8.3 9.0 142.0
31 215.2 130.1 129.8 7.9 10.7 141.8
32 214.8 129.7 129.7 8.6 9.1 142.3
33 215.0 130.0 129.6 7.7 10.5 140.7
34 215.6 130.4 130.1 8.4 10.3 141.0
35 215.9 130.4 130.0 8.9 10.6 141.4
36 214.6 130.2 130.2 9.4 9.7 141.8
37 215.5 130.3 130.0 8.4 9.7 141.8
38 215.3 129.9 129.4 7.9 10.0 142.0
39 215.3 130.3 130.1 8.5 9.3 142.1
40 213.9 130.3 129.0 8.1 9.7 141.3
Appendix: Tables of D ata 563
TABLE A.4. Swiss bank notes data (continued)
41 214.4 129.8 129.2 8.9 9.4 142.3
42 214.8 130.1 129.6 8.8 9.9 140.9
43 214.9 129.6 129.4 9.3 9.0 141.7
44 214.9 130.4 129.7 9.0 9.8 140.9
45 214.8 129.4 129.1 8.2 10.2 141.0
46 214.3 129.5 129.4 8.3 10.2 141.8
47 214.8 129.9 129.7 8.3 10.2 141.5
48 214.8 129.9 129.7 7.3 10.9 142.0
49 214.6 129.7 129.8 7.9 10.3 141.1
50 214.5 129.0 129.6 7.8 9.8 142.0
51 214.6 129.8 129.4 7.2 10.0 141.3
52 215.3 130.6 130.0 9.5 9.7 141.1
53 214.5 130.1 130.0 7.8 10.9 140.9
54 215.4 130.2 130.2 7.6 10.9 141.6
55 214.5 129.4 129.5 7.9 10.0 141.4
56 215.2 129.7 129.4 9.2 9.4 142.0
57 215.7 130.0 129.4 9.2 10.4 141.2
58 215.0 129.6 129.4 8.8 9.0 141.1
59 215.1 130.1 129.9 7.9 11.0 141.3
60 215.1 130.0 129.8 8.2 10.3 141.4
61 215.1 129.6 129.3 8.3 9.9 141.6
62 215.3 129.7 129.4 7.5 10.5 141.5
63 215.4 129.8 129.4 8.0 10.6 141.5
64 214.5 130.0 129.5 8.0 10.8 141.4
65 215.0 130.0 129.8 8.6 10.6 141.5
66 215.2 130.6 130.0 8.8 10.6 140.8
67 214.6 129.5 129.2 7.7 10.3 141.3
68 214.8 129.7 129.3 9.1 9.5 141.5
69 215.1 129.6 129.8 8.6 9.8 141.8
70 214.9 130.2 130.2 8.0 11.2 139.6
71 213.8 129.8 129.5 8.4 11.1 140.9
72 215.2 129.9 129.5 8.2 10.3 141.4
73 215.0 129.6 130.2 8.7 10.0 141.2
74 214.4 129.9 129.6 7.5 10.5 141.8
75 215.2 129.9 129.7 7.2 10.6 142.1
76 214.1 129.6 129.3 7.6 10.7 141.7
77 214.9 129.9 130.1 8.8 10.0 141.2
78 214.6 129.8 129.4 7.4 10.6 141.0
79 215.2 130.5 129.8 7.9 10.9 140.9
80 214.6 129.9 129.4 7.9 10.0 141.8
TABLE A .4. Swiss bank notes data (continued)
81 215.1 129.7 129.7 8 .6 10.3 140.6
82 214.9 129.8 129.6 7.5 10.3 141.0
83 215.2 129.7 129.1 9.0 9.7 141.9
84 215.2 130.1 129.9 7.9 10.8 141.3
85 215.4 130.7 130.2 9.0 11.1 141.2
86 215.1 129.9 129.6 8.9 10.2 141.5
87 215.2 129.9 129.7 8.7 9.5 141.6
88 215.0 129.6 129.2 8.4 10.2 142.1
89 214.9 130.3 129.9 7.4 11.2 141.5
90 215.0 129.9 129.7 8.0 10.5 142.0
91 214.7 129.7 129.3 8.6 9.6 141.6
92 215.4 130.0 129.9 8.5 9.7 141.4
93 214.9 129.4 129.5 8.2 9.9 141.5
94 214.5 129.5 129.3 7.4 10.7 141.5
95 214.7 129.6 129.5 8.3 10.0 142.0
96 215.6 129.9 129.9 9.0 9.5 141.7
97 215.0 130.4 130.3 9.1 10.2 141.1
98 214.4 129.7 129.5 8.0 10.3 141.2
99 215 .1 130.0 129.8 9.1 10.2 141.5
100 214.7 130.0 129.4 7.8 10.0 141.2
101 214.4 130.1 130.3 9 .7 11.7 139.8
102 214.9 130.5 130.2 11.0 11.5 139.5
103 214.9 130.3 130.1 8.7 11 .7 140.2
104 215.0 130.4 130.6 9.9 10.9 140.3
105 214.7 130.2 130.3 11.8 10.9 139.7
106 215.0 130.2 130.2 10.6 10.7 139.9
107 215.3 130.3 130.1 9.3 12.1 140.2
108 214.8 130.1 130.4 9.8 11.5 139.9
109 215.0 130.2 129.9 10.0 11.9 139.4
110 215 .2 130.6 130.8 10.4 11.2 140.3
111 215.2 130.4 130.3 8.0 11.5 139.2
112 215.1 130.5 130.3 10.6 11.5 140.1
113 215.4 130.7 131.1 9.7 11.8 140.6
114 214.9 130.4 129.9 11 .4 11.0 139.9
115 215 .1 130.3 130.0 10.6 10.8 139.7
116 215.5 130.4 130.0 8.2 11.2 139.2
117 214.7 130.6 130.1 11.8 10.5 139.8
118 214.7 130.4 130.1 12.1 10.4 139.9
119 214.8 130.5 130.2 11.0 11.0 140.0
120 214.4 130.2 129.9 10.1 12.0 139.2
TABLE A.4. Swiss bank notes data (continued)
121 214.8 130.3 130.4 10.1 12.1 139.6
122 215.1 130.6 130.3 12.3 10.2 139.6
123 215.3 130.8 131.1 11.6 10.6 140.2
124 215.1 130.7 130.4 10.5 11.2 139.7
125 214.7 130.5 130.5 9.9 10.3 140.1
126 214.9 130.0 130.3 10.2 11.4 139.6
127 215.0 130.4 130.4 9.4 11 .6 140.2
128 215.5 130.7 130.3 10.2 11.8 140.0
129 215.1 130.2 130.2 10.1 11.3 140.3
130 214.5 130.2 130.6 9.8 12.1 139.9
131 214.3 130.2 130.0 10.7 10.5 139.8
132 214.5 130.2 129.8 12.3 11.2 139.2
133 214.9 130.5 130.2 10.6 11.5 139.9
134 214 .6 130.2 130.4 10.5 11.8 139.7
135 214.2 130.0 130.2 11.0 11.2 139.5
136 214.8 130.1 130.1 11.9 11.1 139.5
137 214.6 129.8 130.2 10.7 11.1 139.4
138 214.9 130.7 130.3 9.3 11.2 138.3
139 214.6 130.4 130.4 11.3 10.8 139.8
140 214.5 130.5 130.2 11.8 10.2 139.6
141 214.8 130.2 130.3 10.0 11.9 139.3
142 214.7 130.0 129.4 10.2 11 .0 139.2
143 214.6 130.2 130.4 11.2 10.7 139.9
144 215.0 130.5 130.4 10.6 11 .1 139.9
145 214 .5 129.8 129.8 11.4 10.0 139.3
146 214.9 130.6 130.4 11.9 10.5 139.8
147 215.0 130.5 130.4 11.4 10.7 139.9
148 215 .3 130.6 130.3 9.3 11.3 138.1
149 214.7 130.2 130.1 10.7 11 .0 139.4
150 214.9 129.9 130.0 9.9 12.3 139.4
151 214.9 130.3 129.9 11.9 10.6 139.8
152 214.6 129.9 129.7 11.9 10.1 139.0
153 214.6 129.7 129.3 10.4 11.0 139.3
154 214.5 130.1 130.1 12.1 10.3 139.4
155 214.5 130.3 130.0 11.0 11.5 139.5
156 215.1 130.0 130.3 11.6 10.5 139.7
157 214.2 129.7 129.6 10.3 11.4 139.5
158 214.4 130.1 130.0 11.3 10.7 139.2
159 214.8 130.4 130.6 12.5 10.0 139.3
160 214.6 130.6 130.1 8.1 12.1 137.9
TABLE A.4. Swiss bank notes data (concluded)
161 215.6 130.1 129.7 7.4 12.2 138.4
162 214.9 130.5 130.1 9.9 10.2 138.1
163 214.6 130.1 130.0 11.5 10.6 139.5
164 214.7 130.1 130.2 11.6 10.9 139.1
165 214.3 130.3 130.0 11.4 10.5 139.8
166 215.1 130.3 130.6 10.3 12.0 139.7
167 216.3 130.7 130.4 10.0 10.1 138.8
168 215.6 130.4 130.1 9.6 11.2 138.6
169 214.8 129.9 129.8 9.6 12.0 139.6
170 214.9 130.0 129.9 11.4 10.9 139.7
171 213.9 130.7 130.5 8.7 11.5 137.8
172 214.2 130.6 130.4 12.0 10.2 139.6
173 214.8 130.5 130.3 11.8 10.5 139.4
174 214.8 129.6 130.0 10.4 11.6 139.2
175 214.8 130.1 130.0 11.4 10.5 139.6
176 214.9 130.4 130.2 11.9 10.7 139.0
177 214.3 130.1 130.1 11.6 10.5 139.7
178 214.5 130.4 130.0 9.9 12.0 139.6
179 214.8 130.5 130.3 10.2 12.1 139.1
180 214.5 130.2 130.4 8.2 11.8 137.8
181 215.0 130.4 130.1 11.4 10.7 139.1
182 214.8 130.6 130.6 8.0 11.4 138.7
183 215.0 130.5 130.1 11.0 11.4 139.3
184 214.6 130.5 130.4 10.1 11.4 139.3
185 214.7 130.2 130.1 10.7 11.1 139.5
186 214.7 130.4 130.0 11.5 10.7 139.4
187 214.5 130.4 130.0 8 .0 12.2 138.5
188 214.8 130.0 129.7 11.4 10.6 139.2
189 214.8 129.9 130.2 9.6 11.9 139.4
190 214.6 130.3 130.2 12.7 9.1 139.2
191 215.1 130.2 129.8 10.2 12.0 139.4
192 215.4 130.5 130.6 8.8 11.0 138.6
193 214.7 130.3 130.2 10.8 11.1 139.2
194 215.0 130.5 130.3 9.6 11.0 138.5
195 214.9 130.3 130.5 11.6 10.6 139.8
196 215.0 130.4 130.3 9.9 12.1 139.6
197 215.1 130.3 129.9 10.3 11.5 139.7
198 214.8 130.3 130.4 10.6 11.1 140.0
199 214.7 130.7 130.8 11.2 11.2 139.4
200 214.3 129.9 129.9 10.2 11.5 139.6
TABLE A.5. Babyfood data from Box and Draper (1987) . The responses are the
viscosity in centipoises at the time of manufacture and when measured 3, 6 and
9 months later
Number X! X2 XJ X4 X5 Y1 Y2 Y3 Y4
1 -1 -1 -1 -1 -1 9.8 7.5 12.5 41.5

2 1 -1 -1 -1 1 30.2 35.0 22.5 45.0
3 -1 1 -1 -1 1 17.5 17.5 12.5 20.0
4 1 1 -1 -1 -1 12.5 10.0 7.5 12.5
5 -1 -1 1 -1 1 512.5 1950.0 2070.0 3030.0
6 1 -1 1 -1 -1 655.0 670.0 450.0 1700.0
7 -1 1 1 -1 -1 342.5 262.5 410.0 322.5
8 1 1 1 -1 1 1020.0 1050.0 970.0 1230 .0
9 -1 -1 -1 1 1 82.5 145.0 162.5 145.0
10 1 -1 -1 1 -1 19.0 22.0 17.5 25.0
11 -1 1 -1 1 -1 9.3 5.8 5.0 12.5
12 1 1 -1 1 1 27.5 22.5 15.0 20.0
13 -1 -1 1 1 -1 270.0 237.5 337.5 717.5
14 1 -1 1 1 1 282.5 710.0 650.0 547.5
15 -1 1 1 1 1 172.5 237.5 210.0 190.0
16 1 1 1 1 -1 172.5 155.0 257.5 435.0
17 -1 0 0 0 0 45.8 52.5 62.5 57.5
18 1 0 0 0 0 77.5 62.5 70.0 113.8
19 0 -1 0 0 0 195.8 262.5 252.5 276.3
20 0 1 0 0 0 33.0 22.5 15.0 27.5
21 0 0 -1 0 0 20.0 15.0 17.5 17.5
22 0 0 1 0 0 337.5 117.5 105.0 177.5
23 0 0 0 -1 0 70.0 147.5 60.0 147.5
24 0 0 0 1 0 83.8 62.5 132.5 105.0
25 0 0 0 0 -2 40.0 40.0 22.5 60.0
26 0 0 0 0 2 287.5 450.0 482.5 495.0
27 0 0 0 0 0 67.5 77.5 45.0 107.5
TABLE A.6. Musseis data from Cook and Weisberg (1994). The lengths are in
millimetres, the masses in grams
Number Y1: width y2: height y3: length y 4 : shell mass y 5: mass
1 318 68 158 345 47

2 312 56 148 290 52
3 265 46 124 167 27
4 222 38 104 67 13
5 274 51 143 238 31
6 216 35 99 68 14
7 217 34 109 75 15
8 202 32 96 54 4
9 272 44 119 128 23
10 273 49 123 150 32
11 260 48 135 117 30
12 276 47 133 190 26
13 270 50 126 160 24
14 280 52 130 212 31
15 262 50 134 208 31
16 312 61 120 235 42
17 220 34 94 52 9
18 212 32 102 74 13
19 196 28 85 42 7
20 226 38 104 69 13
21 284 61 134 268 50
22 320 60 137 323 39
23 331 60 140 359 47
24 276 46 126 167 40
25 186 30 92 33 5
26 213 35 98 51 12
27 291 47 130 170 26
28 298 54 137 224 32
29 287 55 140 238 40
30 230 40 106 68 16
31 293 57 135 208 33
32 298 48 135 167 28
33 290 47 134 187 28
34 282 52 135 191 42
35 221 37 104 58 15
36 287 54 135 180 27
37 228 46 129 188 33
38 210 33 107 65 14
39 308 58 131 299 29
40 265 48 124 159 26
41 270 44 124 145 25
TABLE A.6. Musseis data (concluded)
Number Y1: width Y2: height y3: length y4: shell mass y5: mass
42 208 33 99 54 9
43 277 45 123 129 18
44 241 39 110 104 23
45 219 38 105 66 13
46 170 27 87 24 6
47 150 21 75 19 6
48 132 20 65 10 1
49 175 30 86 36 8
50 150 22 69 18 5
51 162 25 79 20 6
52 252 47 124 133 22
53 275 48 131 179 24
54 224 36 107 69 13
55 211 33 100 59 11
56 254 46 126 120 18
57 234 37 114 72 17
58 221 37 108 74 15
59 167 27 80 27 7
60 220 36 106 52 14
61 227 35 118 76 14
62 177 25 83 25 8
63 230 47 112 125 18
64 288 46 132 138 24
65 275 54 127 191 29
66 273 42 120 148 21
67 246 37 110 90 17
68 250 43 115 120 17
69 290 48 131 203 34
70 226 35 111 64 16
71 269 45 121 124 22
72 267 48 121 153 24
73 263 48 123 151 19
74 217 36 104 68 13
75 188 33 93 51 10
76 152 25 76 19 5
77 227 38 112 88 15
78 216 25 110 53 12
79 242 45 112 61 12
80 260 44 123 133 24
81 196 35 101 68 15
82 220 36 105 64 16
TABLE A.7. Dyestuff data from Box and Draper (1987). The responses are
strength (Yl), hue (y2) and brightness (y3)
Number Xl X2 X3 Yl Y2 Y3 Number Xl X2 X3 Yl Y2 Y3
1 -1 -1 -1 3.4 15 36 33 -1 -1 1 12.6 32 32
2 1 -1 -1 9.7 5 35 34 1 -1 1 10.5 10 34
3 -1 -1 -1 7.4 23 37 35 -1 -1 1 11.3 28 30
4 1 -1 -1 10.6 8 34 36 1 -1 1 10.6 18 24
5 -1 -1 -1 6.5 20 30 37 -1 -1 1 8.1 22 30
6 1 -1 -1 7.9 9 32 38 1 -1 1 12.5 31 20
7 -1 -1 -1 10.3 13 28 39 -1 -1 1 11.1 17 32
8 1 -1 -1 9.5 5 38 40 1 -1 1 12.9 16 25
9 -1 1 -1 14.3 23 40 41 -1 1 1 14.6 38 20
10 1 1 -1 10.5 1 32 42 1 1 1 12.7 12 20
11 -1 1 -1 7.8 11 32 43 -1 1 1 10.8 34 22
12 1 1 -1 17.2 5 28 44 1 1 1 17.1 19 35
13 -1 1 -1 9.4 15 34 45 -1 1 1 13.6 12 26
14 1 1 -1 12.1 8 26 46 1 1 1 14.6 14 15
15 -1 1 -1 9.5 15 30 47 -1 1 1 13.3 25 19
16 1 1 -1 15.8 1 28 48 1 1 1 14.4 16 24
17 -1 -1 -1 8.3 22 40 49 -1 -1 1 11 31 22
18 1 -1 -1 8 8 30 50 1 -1 1 12.5 14 23
19 -1 -1 -1 7.9 16 35 51 -1 -1 1 8.9 23 22
20 1 -1 -1 10.7 7 35 52 1 -1 1 13.1 23 18
21 -1 -1 -1 7.2 25 32 53 -1 -1 1 7.6 28 20
22 1 -1 -1 7.2 5 35 54 1 -1 1 8.6 20 20
23 -1 -1 -1 7.9 17 36 55 -1 -1 1 11.8 18 20
24 1 -1 -1 10.2 8 32 56 1 -1 1 12.4 11 36
25 -1 1 -1 10.3 10 20 57 -1 1 1 13.4 39 20
26 1 1 -1 9.9 3 35 58 1 1 1 14.6 30 11
27 -1 1 -1 7.4 22 35 59 -1 1 1 14.9 31 20
28 1 1 -1 10.5 6 28 60 1 1 1 11.8 6 35
29 -1 1 -1 9.6 24 27 61 -1 1 1 15.6 33 16
30 1 1 -1 15.1 4 36 62 1 1 1 12.8 23 32
31 -1 1 -1 8.7 10 36 63 -1 1 1 13.5 31 20
32 1 1 -1 12.1 5 35 64 1 1 1 15.8 11 20
TABLE A.8. Milk data from Daudin, Duby and Trecourt (1988). Eight measure-
ments of properties of 85 milk samples
Number Yl Y2 Y3 Y4 Y5 Y6 Y7 Ys
1 10.318 37.7 35.7 26.5 27.1 27.4 127.1 15.35
2 10.316 37.5 35.3 26.0 27.2 27.2 128.7 14.72
3 10.314 37.0 32.8 25.3 24.8 23 .9 124.1 14.61
4 10.311 39.5 33.7 26.8 25.6 25.8 127.5 14.56
5 10.309 36.0 32.8 25.9 25 .1 24.9 121.6 13.74
6 10.322 36.0 33.8 26.9 25.6 25.7 124.5 14.31
7 10.311 36.0 33.8 26.9 25.8 25.4 125.3 14.13
8 10.314 36.7 34.1 27.0 25.9 25.9 124.9 14.16
9 10.292 37.2 31.5 24.8 23.6 23.9 122.5 14.13
10 10.297 35.0 31.6 24.9 23.9 23.8 121.0 14.58
11 10.282 34.7 29.9 23.5 22.7 22.5 114.7 13.83
12 10.262 31.5 30.1 23.6 22 .8 22 .7 111.1 13.18
13 10.270 30.5 30.1 23.8 22.7 22 .6 115.0 13.45
14 10.269 31.6 29.8 23.3 22.4 22.3 112.7 12.82
15 10.264 34.9 29.7 23.2 22.2 22.3 113.5 13.36
16 10.275 35.7 32.5 25 .7 24.4 23 .8 120.1 14.61
17 10.275 37.9 31.8 25.0 23.4 23.5 122.6 14.74
18 10.293 34.6 32.9 26.1 25.3 24.4 120.8 13.74
19 10.282 36.6 32.2 25.3 24.4 24.1 121.1 14.63
20 10.300 37.2 32.1 25.6 25.0 24.2 123.4 14.74
21 10.300 34.0 33.1 26.4 25.3 25.1 119.7 13.80
22 10.300 35.3 33.3 26.0 25.1 25.0 121.5 14.07
23 10.295 35.8 33.9 26.6 25.9 25.5 121.7 14.57
24 10.295 35.9 33.8 26.5 25.2 25.3 121.4 14.88
25 10.288 34.8 32.9 26.2 25 .2 25.0 118.4 13.99
26 10.290 35.9 33.4 26.3 25.5 25.4 121.1 14.12
27 10.290 35.5 32.7 25.4 24.0 23.8 119.9 13.81
28 10.301 35.8 35.7 28.0 26.7 27.2 122.8 14.69
29 10.302 37.0 34.9 27.7 26.6 26.6 125.5 15.31
30 10.300 35.8 33.9 26.8 26.0 25.8 122.6 14.37
31 10.305 34.2 34.5 27.2 26.1 25.8 123.9 14.63
32 10.302 34.8 33.7 26.1 25 .2 25.2 124.2 14.59
33 10.300 36.5 33.9 26.6 25.5 25.4 124.1 14.89
34 10.300 36.6 33.2 25.9 24.6 24.9 123.4 14.60
35 10.300 34.8 34.2 26.8 25.5 25.4 121.4 14.66
36 10.300 37.3 34.9 27.4 26.7 26.7 125.1 14.75
37 10.300 34.6 34.6 27.0 26.2 25.8 122.6 14.33
38 10.310 35.7 34.8 27.4 26.1 26.3 123.4 14.58
39 10.300 35.7 33.4 26.1 24.9 24.9 120.8 14.67
40 10.302 37.9 33.6 26.1 25.3 24.9 126.1 15.20
41 10.300 36.8 35.4 26.0 27.1 27.0 125.1 14.94
42 10.300 39.3 34.8 27.4 26.1 26.2 127.8 15.43
43 10.310 35.6 34.0 26.8 26.4 25.7 125.5 14.40
TABLE A.8. Milk data (concluded)
Number Y1 Y2 Y3 Y4 Y5 YB Y1 YB
44 10.300 37.8 34.1 27.2 25.9 27.0 126.9 15.97
45 10.310 35.6 33.3 26.4 25.1 24.9 122.6 14.77
46 10.310 35.7 34.1 26.8 25.8 25.7 121.7 14.59
47 10.305 33.3 33.2 26.6 25.1 25.3 126.7 14.17
48 10.300 34.4 33.3 26.5 25.2 25.4 123.0 14.79
49 10.300 33.3 32.2 25.6 24.6 24.3 121.5 14.70
50 10.305 38.4 32.5 25.8 24.9 24.9 126.2 15.20
51 10.305 34.0 31.9 25.5 24.4 24.3 122.6 14.11
52 10.300 33.1 31.1 24.8 23.7 23.9 119.0 14.06
53 10.305 34.8 32.3 25.7 25.2 24.7 122.6 14.09
54 10.305 35.5 32.7 26.0 25.1 24.9 122.7 14.41
55 10.310 35.6 33.3 26.5 25.6 25.6 123.3 14.16
56 10.300 36.1 32.5 25.7 24.9 24.8 123.1 14.34
57 10.295 36.2 32.6 25.5 24.7 24.9 121.4 14.04
58 10.300 36.0 33.5 26.7 25.6 25.7 121.9 14.02
59 10.290 35.2 31.8 25.1 24.1 24.1 122.2 14.00
60 10.300 35.4 32.1 25.5 24.3 24.2 122.1 13.78
61 10.300 36.6 32.1 25.3 24.5 24.4 123.4 14.14
62 10.300 37.5 32.6 25.6 24.8 24.9 125.4 14.86
63 10.300 36.9 33.8 27.0 25.6 25.6 124.9 14.29
64 10.300 37.3 31.5 24.9 23.7 23.8 123.4 14.46
65 10.300 35.2 31.7 25.0 24.0 23.9 121.5 13.82
66 10.300 36.4 32.3 25.8 24.6 24.4 123.5 14.54
67 10.300 34.5 31.4 24.8 24.1 24.0 123.9 14.24
68 10.300 35.8 32.1 25.9 24.7 24.4 122.6 14.14
69 10.295 36.0 31.2 24.6 33.8 33.7 122.9 14.05
70 10.300 35.8 31.3 24.9 23.7 23.9 121.7 14.04
71 10.305 36.5 33.1 26.6 25.4 25.4 125.1 14.20
72 10.305 35.5 31.2 24.7 23.3 23.0 122.8 14.15
73 10.300 34.5 30.2 23.8 22.8 22.7 112.3 14.00
74 10.305 35.9 32.0 25.3 23.3 24.3 124.2 14.35
75 10.301 34.3 32.2 25.4 24.6 24.4 122.4 13.98
76 10.295 34.6 30.8 23.9 22.7 22.7 115.7 13.93
77 10.310 36.3 32.6 25.8 24.8 24.7 124.7 14.26
78 10.305 35.2 32.6 25.5 24.8 24.6 123.3 14.07
79 10.300 34.9 32.7 25.8 24.7 24.9 119.0 14.42
80 10.300 35.9 33.7 26.6 25.6 25.7 123.4 14.20
81 10.301 37.6 33.2 26.3 24.9 24.8 125.6 14.94
82 10.310 36.1 33.6 26.2 25.6 25.6 125.3 14.43
83 10.300 34.1 33.2 26.0 25.4 25.2 124.2 14.37
84 10.305 39.0 32.3 25.4 24.4 24.7 123.5 14.40
85 10.300 34.4 33.1 26.5 25.6 25.5 123.5 14.18
TABLE A.9. Indices of the quality of life in the provinces of Italy. Data from Il
Sole - 24 Ore 2001
1 9962 .65 105.60 507.78 13.50 10.75 9.45
2 8468 .69 19.93 421.21 18.27 13.24 1.84
3 9770.61 47.25 540.63 9.57 11.89 3.78
4 9706.74 30.60 504.93 16.10 9.13 3.38
5 8275.02 59.37 733.30 28.50 12.82 3.11
6 9121.25 35.83 480.45 11.63 12.80 4.59
7 8288 .02 24.31 479.30 23.78 14.80 4.14
8 7174.69 18.67 451.22 14.94 14.94 4.06
9 10525.64 19.90 317.61 19.07 7.46 3.93
10 6403 .96 23.57 589.65 10.17 16.64 10.43
11 7421.80 35.39 997.12 17.16 16.80 4.65
12 9572.63 44.94 495.60 25.90 15.83 13.19
13 7172 .86 38.82 363.79 14.89 13.09 3.23
14 9277.42 31.08 484.54 17.91 12.55 4.36
15 8761.72 38.33 434.94 10.50 10.87 3 .37
16 9154.08 3.38 166.69 21.96 6.76 2.36
17 20890.66 102.47 324.47 12.85 12.77 17.24
18 9406 .42 31.92 299.26 9.75 6.36 4.48
19 10046.40 57.16 551.31 10.25 8.45 4.47
20 9626.65 30.45 412.06 17.03 9.62 4.23
21 9141.12 28.89 250.82 20.26 10.43 4.15
22 9433.11 30.04 319.42 13.03 13.82 3.84
23 8816.74 28.88 520.42 12.83 10.59 1.91
24 9846.11 22.81 231.13 15.21 8.11 1.81
25 12051.57 16.55 148.95 29.23 8.38 9.16
26 11100.26 14.02 243.59 11.93 6.49 4.74
27 9786 .08 41.11 440.26 16.40 10.73 37.48
28 8962 .15 31.33 419.45 17.24 11.95 4.40
29 7350.34 6.16 229.80 30.80 8.05 1.41
30 8558.84 37.68 489.69 10.08 10.84 4.22
31 8159.78 42.56 536.16 19.63 14.11 7.46
32 9922.04 35.74 353.19 10.55 13.01 3.79
33 7507.03 23.43 342.39 13.97 9.04 2.37
34 10359.46 19.60 283.79 16.72 10.18 6.55
35 8941.28 20.89 159.18 20.89 10.08 5.03
TABLE A.9. Quality of life data (continued)
Number Y1 Y2 Y3 Y4 Ys Y6
36 13643.31 43.41 165.54 45.85 11.36 28.33
37 8759.12 13.44 387.85 24.75 7.42 3.69
38 10012.27 34.83 401.52 20.97 14.98 6.87
39 11417.11 31.00 344.01 23.00 12.75 8.81
40 10264.58 33.11 308.11 19.30 17.11 5.56
41 10179.11 32.25 312.82 17.70 12.01 7.81
42 12600.70 90.79 510.68 19.20 13.34 16.44
43 7636.55 18.41 413.69 15.54 12.95 5.86
44 9408.58 67.57 578.04 24.98 9.94 8.88
45 10598.16 39.25 432.91 27.20 15.42 5.52
46 8788.12 89.56 618.20 13.47 21.48 5.89
47 9697.86 16.70 353.76 19.00 11.51 9.33
48 8740.72 28.00 313.56 10.53 18.14 7.87
49 8317.71 25.95 323.59 15.11 14.45 7.80
50 7661.13 22.38 335.67 13.75 15.91 4.77
51 7164.87 24.08 495.05 12.04 10.53 3.67
52 8838.89 38.07 454.14 30.08 12.51 10.76
53 8680.71 24.76 434.51 7.76 13.30 6.88
54 11232.08 51.54 508.41 11.50 11.81 16.34
55 6727.87 22.75 399.06 9.88 15.87 4.78
56 8471.09 28.89 448.82 23 .21 15.73 4.76
57 10523.71 27.19 194.65 14.21 9.58 2.61
58 11336.44 33.45 332.18 24.80 13.38 4.99
59 7618.48 13.45 327.47 23.37 10.20 3.40
60 11022.14 59.04 357.82 23.87 12.15 4.33
61 7977.96 32.40 318.29 22.68 11.99 8.76
62 6801.26 30.48 241.58 13.45 9.41 3.81
63 5748.72 25.19 381.90 8.85 9.19 2.59
64 5411.53 52.90 402.67 10.58 5.95 1.42
65 15558.38 102.04 472.97 10.26 11.72 14.31
66 5085.80 34.67 463.92 4.28 7.99 2.32
67 4503.98 29.74 344.31 5.26 9.71 0.65
68 4056.48 122.19 350.35 5.84 5.49 1.91
69 4142.85 17.42 353.11 9.90 4.10 2.68
TABLE A.9. Quality of life data (concluded)
70 6049.62 251.53 161.30 2.68 8.29 5.85
71 4825 .67 16.81 221.94 10.68 5.91 0.37
72 4960.91 54.37 317.17 8.33 6.77 2.38
73 6139.46 23.72 221.74 10.54 13.18 2.50
74 7431.26 27.05 352.27 13.35 18.49 1.72
75 7423 .60 31.85 357.80 10.50 22.36 5.54
76 5841.90 13.57 226.36 10.75 11.52 1.75
77 4550.97 8.48 167.50 8.48 7.63 0.94
78 4119.42 15.33 172.96 14.23 7.66 0.77
79 5311.58 56.76 181.54 6.93 5.05 1.27
80 6527.75 45.30 272.95 6.52 7.15 3.38
81 4693.84 26.41 257.59 12.95 5.28 2.25
82 4835.48 50.60 218.95 7.30 4.62 1.53
83 4200.45 36.78 281.60 12.01 5.76 2.15
84 4610.62 10.03 130.86 16.80 3.76 1.27
85 5893.22 12.14 169.02 14.08 5.34 0.91
86 4122.81 24.77 273.15 5.38 5.79 2.04
87 4443.40 14.41 197.78 7.07 7.86 1.21
88 4093.96 47.36 129.28 5.26 7.54 1.72
89 4123.18 17.90 179.00 2.31 1.73 0.94
90 3418.24 20.51 174.37 3.42 3.42 0.76
91 4727.70 25.41 454.35 5.08 6.70 3.22
92 5506.50 232.05 323.24 22.69 7.70 5.22
93 4279.83 42 .58 289.28 9.35 8.75 3.08
94 4593.08 16.07 175.31 14.57 2.36 1.58
95 5202.02 22.30 174.43 18.41 5.66 0.91
96 3805.09 13.87 170.89 18.86 5.55 0.66
97 4465.26 118.97 248.28 7.71 9.07 5.97
98 4761.98 31.37 342.73 10.24 9.91 1.60
99 4408.57 40.32 383.02 14.93 7.72 2.13
100 5762.56 25.70 440.60 7.41 8.49 2.36
101 5509.60 34.70 223.51 9.70 4.85 1.45
102 6669.89 32.84 283.81 15.70 8.77 5.13
103 5310.75 19.15 257.91 17.87 4.47 1.17
TABLE A.lO. Iris data from Anderson (1935) . Iris setosa: Yl sepal length, Y2
sepal width, Y3 petal length and Y4 petal width, all in centimetres
Number Y1 Y2 Y3 Y4 Number Y1 Y2 Y3 Y4
1 5.1 3.5 1.4 0.2 26 5.0 3.0 1.6 0.2
2 4.9 3.0 1.4 0.2 27 5.0 3.4 1.6 0.4
3 4.7 3.2 1.3 0.2 28 5.2 3.5 1.5 0.2
4 4.6 3.1 1.5 0.2 29 5.2 3.4 1.4 0.2
5 5.0 3.6 1.4 0.2 30 4.7 3.2 1.6 0.2
6 5.4 3.9 1.7 0.4 31 4.8 3.1 1.6 0.2
7 4.6 3.4 1.4 0.3 32 5.4 3.4 1.5 0.4
8 5.0 3.4 1.5 0.2 33 5.2 4.1 1.5 0.1
9 4.4 2.9 1.4 0.2 34 5.5 4.2 1.4 0.2
10 4.9 3.1 1.5 0.1 35 4.9 3.1 1.5 0.2
11 5.4 3.7 1.5 0.2 36 5.0 3.2 1.2 0.2
12 4.8 3.4 1.6 0.2 37 5.5 3.5 1.3 0.2
13 4.8 3.0 1.4 0.1 38 4.9 3.6 1.4 0.1
14 4.3 3.0 1.1 0.1 39 4.4 3.0 1.3 0.2
15 5.8 4.0 1.2 0.2 40 5.1 3.4 1.5 0.2
16 5.7 4.4 1.5 0.4 41 5.0 3.5 1.3 0.3
17 5.4 3.9 1.3 0.4 42 4.5 2.3 1.3 0.3
18 5.1 3.5 1.4 0.3 43 4.4 3.2 1.3 0.2
19 5.7 3.8 1.7 0.3 44 5.0 3.5 1.6 0.6
20 5.1 3.8 1.5 0.3 45 5.1 3.8 1.9 0.4
21 5.4 3.4 1.7 0.2 46 4.8 3.0 1.4 0.3
22 5.1 3.7 1.5 0.4 47 5.1 3.8 1.6 0.2
23 4.6 3.6 1.0 0.2 48 4.6 3.2 1.4 0.2
24 5.1 3.3 1.7 0.5 49 5.3 3.7 1.5 0.2
25 4.8 3.4 1.9 0.2 50 5.0 3.3 1.4 0.2
TABLE A.lO. Iris data (continued). Iris versicolor: Y1 sepal length, Y2 sepal
width, Y3 petallength and Y4 petal width, all in centimetres
51 7.0 3.2 4.7 1.4 76 6.6 3.0 4.4 1.4
52 6.4 3.2 4.5 1.5 77 6.8 2.8 4.8 1.4
53 6.9 3.1 4.9 1.5 78 6.7 3.0 5.0 1.7
54 5.5 2.3 4.0 1.3 79 6.0 2.9 4.5 1.5
55 6.5 2.8 4.6 1.5 80 5.7 2.6 3.5 1.0
56 5.7 2.8 4.5 1.3 81 5.5 2.4 3.8 1.1
57 6.3 3.3 4.7 1.6 82 5.5 2.4 3.7 1.0
58 4.9 2.4 3.3 1.0 83 5.8 2.7 3.9 1.2
59 6.6 2.9 4.6 1.3 84 6.0 2.7 5.1 1.6
60 5.2 2.7 3.9 1.4 85 5.4 3.0 4.5 1.5
61 5.0 2.0 3.5 1.0 86 6.0 3.4 4.5 1.6
62 5.9 3.0 4.2 1.5 87 6.7 3.1 4.7 1.5
63 6.0 2.2 4.0 1.0 88 6.3 2.3 4.4 1.3
64 6.1 2.9 4.7 1.4 89 5.6 3.0 4.1 1.3
65 5.6 2.9 3.6 1.3 90 5.5 2.5 4.0 1.3
66 6.7 3.1 4.4 1.4 91 5.5 2.6 4.4 1.2
67 5.6 3.0 4.5 1.5 92 6.1 3.0 4.6 1.4
68 5.8 2.7 4.1 1.0 93 5.8 2.6 4.0 1.2
69 6.2 2.2 4.5 1.5 94 5.0 2.3 3.3 1.0
70 5.6 2.5 3.9 1.1 95 5.6 2.7 4.2 1.3
71 5.9 3.2 4.8 1.8 96 5.7 3.0 4.2 1.2
72 6.1 2.8 4.0 1.3 97 5.7 2.9 4.2 1.3
73 6.3 2.5 4.9 1.5 98 6.2 2.9 4.3 1.3
74 6.1 2.8 4.7 1.2 99 5.1 2.5 3.0 1.1
75 6.4 2.9 4.3 1.3 100 5.7 2.8 4.1 1.3
TABLE A.lO. Iris data (concluded). Iris virginica: Yl sepallength, Y2 sepal width,
Y3 petal length and Y4 petal width, all in centimetres
101 6.3 3.3 6.0 2.5 126 7.2 3.2 6.0 1.8
102 5.8 2.7 5.1 1.9 127 6.2 2.8 4.8 1.8
103 7.1 3.0 5.9 2.1 128 6.1 3.0 4.9 1.8
104 6.3 2.9 5.6 1.8 129 6.4 2.8 5.6 2.1
105 6.5 3.0 5.8 2.2 130 7.2 3.0 5.8 1.6
106 7.6 3.0 6.6 2.1 131 7.4 2.8 6.1 1.9
107 4.9 2.5 4.5 1.7 132 7.9 3.8 6.4 2.0
108 7.3 2.9 6.3 1.8 133 6.4 2.8 5.6 2.2
109 6.7 2.5 5.8 1.8 134 6.3 2.8 5.1 1.5
110 7.2 3.6 6.1 2.5 135 6.1 2.6 5.6 1.4
111 6.5 3.2 5.1 2.0 136 7.7 3.0 6.1 2.3
112 6.4 2.7 5.3 1.9 137 6.3 3.4 5.6 2.4
113 6.8 3.0 5.5 2.1 138 6.4 3.1 5.5 1.8
114 5.7 2.5 5.0 2.0 139 6.0 3.0 4.8 1.8
115 5.8 2.8 5.1 2.4 140 6.9 3.1 5.4 2.1
116 6.4 3.2 5.3 2.3 141 6.7 3.1 5.6 2.4
117 6.5 3.0 5.5 1.8 142 6.9 3.1 5.1 2.3
118 7.7 3.8 6.7 2.2 143 5.8 2.7 5.1 1.9
119 7.7 2.6 6.9 2.3 144 6.8 3.2 5.9 2.3
120 6.0 2.2 5.0 1.5 145 6.7 3.3 5.7 2.5
121 6.9 3.2 5.7 2.3 146 6.7 3.0 5.2 2.3
122 5.6 2.8 4.9 2.0 147 6.3 2.5 5.0 1.9
123 7.7 2.8 6.7 2.0 148 6.5 3.0 5.2 2.0
124 6.3 2.7 4.9 1.8 149 6.2 3.4 5.4 2.3
125 6.7 3.3 5.7 2.1 150 5.9 3.0 5.1 1.8
TABLE A.ll. Electrodes data from Flury and Riedwyl (1988). Machirre one: y1 ,
y2 and y5, scaled diameters; Y3 and y4, scaled lengths
Number Y1 Y2 Y3 Y4 Y5 Number Y1 Y2 Y3 Y4 Y5
1 40 58 31 44 64 26 43 60 35 38 62
2 39 59 33 40 60 27 40 59 29 41 60
3 40 58 35 46 59 28 40 59 37 41 59
4 39 59 31 47 58 29 40 60 37 46 60
5 40 60 36 41 56 30 40 58 42 45 61
6 45 60 45 45 58 31 42 63 48 47 64
7 42 64 39 38 63 32 41 59 37 49 60
8 44 59 41 40 60 33 39 58 31 47 60
9 42 66 48 20 61 34 42 60 43 49 61
10 40 60 35 40 58 35 42 59 37 53 62
11 40 61 40 41 58 36 40 58 35 40 59
12 40 58 38 45 60 37 40 59 35 48 58
13 38 59 39 46 58 38 39 60 35 46 59
14 42 59 32 36 61 39 38 59 30 47 57
15 40 61 45 45 59 40 40 60 38 48 62
16 40 59 45 52 59 41 44 60 36 44 60
17 42 58 38 51 59 42 40 58 34 41 58
18 40 59 37 44 60 43 38 60 31 49 60
19 39 60 35 49 59 44 38 58 29 46 60
20 39 60 37 46 56 45 39 59 35 43 56
21 40 58 35 39 58 46 40 60 37 45 59
22 39 59 34 41 60 47 40 60 37 44 61
23 39 60 37 39 59 48 42 62 37 35 60
24 40 59 42 43 57 49 40 59 35 44 58
25 40 59 37 46 60 50 42 58 35 43 61
TABLE A.ll. Electrodes data (concluded). Machirre two: y1, Y2 and ys, scaled
diameters; Y3 and Y4, scaled lengths
Number Y1 Y2 Y3 Y4 Y5 Number Y1 Y2 Y3 Y4 Y5
51 44 58 32 25 57 76 44 57 33 11 59
52 43 58 25 19 60 77 44 60 25 10 59
53 44 57 30 24 59 78 44 58 22 16 59
54 42 59 36 20 59 79 44 60 36 18 57
55 42 60 38 29 59 80 46 61 39 14 59
56 43 56 38 32 58 81 42 58 36 27 57
57 43 57 26 18 59 82 43 60 20 19 60
58 45 60 27 27 59 83 42 59 27 23 59
59 45 59 33 18 60 84 43 58 28 12 58
60 43 58 29 26 59 85 42 57 41 24 58
61 43 59 39 22 58 86 44 60 28 20 60
62 43 59 35 29 59 87 43 58 45 25 59
63 44 57 37 19 58 88 43 59 35 21 59
64 43 58 29 20 58 89 43 60 29 2.0 60
65 43 58 27 8 58 90 44 59 22 11 59
66 44 60 39 15 60 91 44 58 46 25 58
67 43 58 35 13 58 92 43 60 28 9 60
68 44 58 38 19 58 93 43 59 38 29 59
69 43 58 36 19 58 94 43 58 47 24 57
70 43 58 29 19 60 95 42 58 24 19 59
71 43 58 29 21 58 96 43 60 35 22 58
72 42 59 43 26 58 97 45 60 28 18 60
73 43 58 26 20 58 98 43 57 38 23 60
74 44 59 22 17 59 99 44 60 31 22 58
75 43 59 36 25 59 100 43 58 22 20 57
TABLE A.12. Muscular dystrophy data. Non carriers: Y1 age, Y2 month, Y3- Y6
Ievels of four serum markers. Units 1 - 39 are included in the "small" data set
Number Yl Y2 Y3 Y4 Ys Y6
1 27 6 22 84.0 2.8 145
2 26 7 30 76.0 17.1 145
3 26 3 35 76.7 10.9 105
4 26 7 34 78.0 8.0 140
5 31 10 27 90.0 15.6 167
6 31 3 22 71.5 11.8 98
7 31 6 22 73.5 5.1 184
8 25 9 72 80.5 12.0 225
9 35 7 51 70.0 16.6 146
10 36 2 30 66.7 15.3 124
11 36 7 23 66 .3 4.4 142
12 33 4 67 98.0 9.3 225
13 27 9 50 69.0 15.1 160
14 27 3 92 68.0 16.5 115
15 36 10 55 78.2 21.8 188
16 25 6 38 82.0 15.8 161
17 33 3 27 100.0 10.3 169
18 22 8 34 84.0 12.0 175
19 23 2 44 81.3 10.5 159
20 25 7 32 86.5 6.7 149
21 22 10 35 59.4 11.3 130
22 31 5 35 90.3 15.3 124
23 33 8 31 75.5 13.7 160
24 33 11 25 78.9 12.2 127
25 27 6 52 77.0 17.9 198
26 30 10 34 75.0 15.4 171
27 20 10 53 93.2 22.3 349
28 30 2 69 66.7 8.7 119
29 30 7 25 70.5 5.3 123
30 27 5 24 89.5 16.1 176
31 35 8 21 108.5 9.8 148
32 26 6 51 82.0 12.9 149
33 27 6 37 77.3 3.9 141
34 24 7 24 82.0 14.2 123
35 25 2 30 77.0 16.2 124
36 26 7 34 81.3 9.7 158
37 20 9 22 102.0 10.3 177
38 22 10 32 79.2 5.8 190
39 34 10 24 70.4 10.6 181
40 22 7 20 72.0 11.9 110
41 22 6 34 91.0 14.5 144
42 24 9 25 92.0 14.0 166
43 38 12 26 109.0 8.9 163
TABLE A .12. Muscular dystrophy data. Noncarriers (continued)
Number YI Y2 Y3 Y4 Ys Y6
44 39 1 28.0 102.3 17.1 146
45 39 3 21.0 92.4 10.3 197
46 39 3 23.0 111.5 10.0 133
47 39 4 26 .0 92.6 12.3 196
48 39 4 25.0 98.7 10.0 174
49 39 6 21.0 93.2 5.9 181
50 32 2 56.0 72.0 9 .9 227
51 34 10 48.0 83.0 13.7 228
52 22 6 51.0 91.0 12.7 149
53 39 3 18.0 95 .0 11.3 66
54 33 7 28.0 104.0 6.9 169
55 33 11 41.0 105.5 15.1 252
56 20 6 40.0 81.0 6.1 167
57 22 7 21.0 74.5 12.2 163
58 25 2 95.0 69.8 7.3 169
59 25 6 59.0 72.5 10.7 314
60 36 2 40.0 72.7 7.0 131
61 36 8 30.0 79.5 11.9 130
62 22 6 48.0 76.0 16.6 133
63 39 6 39.0 88.5 7.6 168
64 30 3 30.0 82.7 18.1 124
65 37 7 38.0 85.0 21.6 198
66 27 3 27.0 87.2 12.5 99
67 28 7 32.0 76.3 5.6 159
68 29 2 74.0 80.4 8.9 207
69 30 6 33.0 86.0 3.8 149
70 31 7 34.0 80.5 11.1 149
71 31 2 45.0 86.5 10.8 169
72 32 6 52.0 79.0 10.7 187
73 32 7 28.0 82.5 17.4 144
74 32 12 35.0 97.0 14.5 137
75 25 6 37.0 93.0 15.3 167
76 26 2 44.0 81.3 15.3 166
77 26 6 68.0 82.8 11.9 177
78 37 10 97.5 34.0 12.0 203
79 25 7 37.0 98.0 16.4 198
80 25 7 34.0 92.0 12.1 217
81 20 6 30.0 80.0 12.9 129
82 25 6 37.0 98.0 11.7 177
83 34 9 24.0 100.5 14.0 231
84 32 4 41.0 78.5 10.9 191
85 32 6 43.0 87.5 6.0 136
TABLE A.12. Muscular dystrophy data. Noncarriers (concluded)
Number Yl Y2 Y3 Y4 Y5 Y6
86 32 10 30 90.5 15.3 136
87 33 7 30 85.0 11.4 176
88 33 10 43 88.5 20.3 175
89 22 6 52 83.5 10.9 176
90 32 8 20 77.0 11.0 200
91 36 7 28 86.5 13.2 171
92 22 11 30 104.0 22.6 230
93 23 1 40 83.0 15.2 205
94 30 5 24 78.8 9.6 151
95 27 8 15 87.0 13.5 232
96 30 11 22 91.0 17.5 198
97 25 10 42 65.5 13.3 216
98 26 2 130 80.3 17.1 211
99 26 3 48 85.2 22.7 160
100 27 7 31 86.5 6.9 162
101 26 10 47 53.0 14.6 131
102 27 3 36 56.0 18.2 105
103 27 7 24 57.5 5.6 130
104 31 4 34 92.7 7.9 140
105 31 9 38 96.0 12.6 158
106 35 10 40 104.6 16.1 209
107 28 4 59 88.0 9.9 128
108 28 8 75 81.0 10.1 177
109 28 9 72 66.3 16.4 156
110 27 7 42 77.0 15.3 163
111 27 3 30 80.2 8.1 100
112 28 6 24 87.0 3.5 132
113 24 9 26 84.5 20.7 145
114 23 8 65 75.0 19.9 187
115 27 3 34 86.3 11.8 120
116 25 2 37 73.3 13.0 254
117 34 3 73 57.4 7.4 107
118 34 7 87 76.3 6.0 87
119 25 7 35 71.0 8.8 186
120 20 7 31 61.5 9.9 172
121 20 5 62 81.0 10.2 181
122 31 6 48 79.0 16.8 182
123 31 7 40 82.5 6.4 151
124 26 7 55 85.5 10.9 216
125 26 7 32 73.8 8.6 147
126 21 11 26 79.3 16.4 123
127 27 6 25 91.0 10.3 135
TABLE A.l2. Muscular dystrophy data. Carriers: the first set of unit numbers
are those for the "small" data set
Number Number Y1 Yz Y3 Y4 Y5 Y6
40 128 30 10 167 89.0 25.6 364
41 129 41 10 104 81.0 26.8 245
42 130 22 8 30 108.0 8.8 284
43 131 22 8 44 104.0 17.4 172
44 132 20 10 65 87.0 23.8 198
45 133 42 9 440 107.0 20.2 239
46 134 59 8 58 88.2 11.0 259
47 135 35 9 129 93.1 18.3 188
48 136 36 6 104 87.5 16.7 256
49 137 35 2 122 88.5 21.6 263
50 138 29 4 265 83.5 16.1 136
51 139 27 4 285 79.5 36.4 245
52 140 27 9 25 91.0 49.1 209
53 141 28 4 124 92.0 32.2 298
54 142 29 8 53 76.0 14.0 174
55 143 30 2 46 71.0 16.9 197
56 144 30 7 40 85.5 12.7 201
57 145 30 8 41 90.0 9.7 342
58 146 31 6 657 104.0 110.0 358
59 147 32 2 465 86.5 63.7 412
60 148 32 5 485 83.5 73.0 382
61 149 37 2 168 82.5 23.3 261
62 150 38 6 286 109.5 31.9 260
63 151 39 1 388 91.0 41.6 204
64 152 39 9 148 105.2 18.8 221
65 153 34 6 73 105.5 17.0 285
66 154 35 4 36 92.8 22.0 308
67 155 58 8 19 100.5 10.9 196
68 156 58 2 34 98.5 19.9 299
69 157 38 1 113 97.0 18.8 216
70 158 30 8 57 105.0 12.9 155
71 159 42 8 78 118.0 15.5 212
72 160 43 11 73 104.0 20.6 201
73 161 29 3 69 111.0 16.0 175
TABLE A.l2. Muscular dystrophy data. Carriers (concluded)
162 30 10 177 103.5 19.8 241
163 35 6 48 98.0 16.4 233
164 35 7 34 96.5 10.4 122
165 35 9 42 100.1 17.1 184
166 44 9 109 81.0 25.3 227
167 35 9 925 81.0 62.9 279
168 35 4 1288 82.0 51.6 368
169 36 9 325 76.3 33.9 413
170 53 6 59 93.0 22 .2 240
171 54 4 69 92.6 20.9 243
172 30 4 363 91.3 36.0 325
173 35 11 37 84.0 12.8 156
174 53 6 101 77.5 11.7 280
175 41 3 99 93.2 18.6 156
176 40 9 125 90.5 19.4 438
177 42 8 52 93.3 11.2 272
178 59 6 560 106.0 21.0 345
179 31 8 85 94.0 20.1 198
180 32 6 72 88.0 8.3 166
181 52 6 197 91.5 25 .2 236
182 52 3 242 85.5 16.6 168
183 53 8 245 89.5 22 .7 269
184 39 10 154 103.5 21.3 296
185 39 6 228 104.0 10.2 236
186 43 8 80 90.5 12.1 269
187 44 6 28 104.0 22.0 142
188 45 6 35 86.3 14.4 184
189 33 5 57 88.0 8.9 190
190 26 11 326 98.0 27.1 358
191 26 6 700 90.0 49.1 343
192 61 9 100 101.0 11.8 301
193 61 2 80 97.5 15.1 262
194 48 6 115 79.0 14.2 258
TABLE A.l3. The 60:80 data
Number Y1 Y2 Number Y1 Y2
1 -5.012 -0.739 36 5.575 -1.425
2 -5.535 -0.876 37 3.338 0.705
3 -1.016 -0.314 38 -0.929 2.217
4 -9.488 0.449 39 0.400 0.019
5 3.434 1.369 40 -2.274 -0.602
6 0.455 1.958 41 -6.458 -1.084
7 -1.416 0.679 42 -5.623 -0.922
8 -5.544 -1.809 43 4.600 -2.675
9 -7.615 -2 .717 44 -9.047 -1.510
10 -7.078 -2 .640 45 6.944 5.484
11 2.402 5.193 46 -2.049 -0.523
12 -0.564 -3.159 47 -3.544 0.466
13 7.103 2.199 48 -3.965 -4.043
14 5.275 0.482 49 -2.074 -2.499
15 -5.661 1.042 50 -2.083 -0.930
16 5.325 3.274 51 -6.175 -2.212
17 0.879 -2.706 52 3.176 1.211
18 4.345 1.376 53 2.396 2.057
19 -1.234 1.464 54 -5.982 -1.003
20 0.025 -0.350 55 -5 .354 -2.102
21 -0.575 0.063 56 -4.732 -1.773
22 8.103 2.426 57 11.639 9.161
23 3.720 -0.530 58 -0.957 0.746
24 0.989 -2.850 59 -2.667 2.636
25 1.768 0.066 60 -1.817 0.892
26 4.399 1.344 61 -2 .956 -1.730
27 -7.954 -5 .681 62 1.063 -0.646
28 2.065 5.042 63 1.939 0.912
29 -4.554 -3.484 64 0.027 -2.777
30 -3.260 -3.660 65 -0.024 -1.929
31 -5.093 -1.782 66 0.300 0.155
32 -4.163 -0.377 67 1.093 -2.275
33 7.246 0.552 68 -0.427 -0.202
34 0.964 1.931 69 -4.791 0.375
35 -0.395 -1.249 70 -4.956 -2 .937
TABLE A.l3. The 60:80 data (concluded)
71 5.926 1.355 106 6.759 -5.894
72 5.228 -1.806 107 5.845 -6.799
73 3.895 -0.023 108 6.704 -6.189
74 -0.592 0 .097 109 5.902 -5.774
75 -1.349 0.472 110 6.618 -6.566
76 -5.471 -1.188 111 5.301 -6.897
77 -6.910 -2.812 112 6.169 -6.784
78 -1.660 -1.560 113 5.861 -7.152
79 4.715 2.412 114 5.463 -6.305
80 2.151 1.496 115 6.787 -6.698
81 6.138 -7.120 116 6.675 -6.660
82 6.002 -6.742 117 7.132 -6.113
83 6.227 -6.788 118 5.281 -5.725
84 5.895 -6.612 119 5.956 -7.255
85 6.237 -6.754 120 6.686 -6.598
86 6.178 -6.021 121 6.604 -6.482
87 6.761 -6.821 122 6.163 -6.335
88 6.748 -6.501 123 5.814 -7.142
89 6.800 -6.308 124 6.931 -6.749
90 6.295 -7.015 125 6.49 -6.674
91 7.157 -6.809 126 6.143 -6.455
92 6.259 -7.497 127 5.717 -6.803
93 6.215 -6.484 128 6.243 -7.318
94 6.547 -7.371 129 6.870 -5.965
95 5.806 -6.889 130 6.05 -6.719
96 6.244 -5.959 131 6.602 -5.782
97 6.195 -6.246 132 6.186 -6.208
98 6.968 -6.612 133 5.626 -6.905
99 6.055 -6.954 134 5.688 -6.903
100 6.566 -5.645 135 6.479 -5.913
101 6.324 -6.704 136 6.376 -6.523
102 6.324 -6.606 137 6.210 -6.970
103 6.144 -6.535 138 5.698 -6.570
104 5.746 -7.259 139 6.318 -6.417
105 5.885 -6.875 140 6.703 -6.993
TABLE A .l4 . Three clusters, two outliers
1 1.682 0.932 41 5.906 1.715
2 -3.716 0.026 42 0.762 -0.696
3 -6.653 -0.751 43 11.267 4.463
4 -9.166 -2.120 44 0.869 -4.606
5 1.153 2.829 45 -0.828 -1.242
6 5.679 1.283 46 3.884 3.720
7 -2.375 -0.251 47 2.597 3.473
8 -0.219 -0.495 48 -2.351 0.045
9 1.545 -2.629 49 -3.123 0.786
10 3.298 -0.232 50 5.260 1.596
11 8.033 5.651 51 2.985 -0.543
12 -2.142 -1.198 52 -0.871 0.404
13 6.907 2.628 53 -0.717 2.866
14 3.186 1.108 54 3.256 1.875
15 -6.594 -1.747 55 2.572 -0.363
16 0.337 1.857 56 -2 .127 -1.428
17 4.748 0.208 57 5.911 0.445
18 4.526 1.462 58 -3.877 -0.277
19 -0.749 -2.873 59 4.908 4.052
20 2.499 -2.227 60 2.306 1.785
21 -1.284 -1.254 61 1.317 3.234
22 2.338 0.247 62 0.725 -0.670
23 -9.096 -0.316 63 10.644 5.478
24 -0.387 2.785 64 1.131 -0.138
25 8.332 4.728 65 3.695 3.078
26 -3.184 -3.332 66 -4.991 1.467
27 3.201 0.967 67 -1.589 1.694
28 -0.245 -2.648 68 2.175 -3.422
29 -3.337 2.500 69 5.496 1.910
30 4.656 -0.901 70 -5.380 -2 .006
31 4.029 1.718 71 4.104 3.952
32 -0.546 -0.558 72 -0.026 0.276
33 -0.450 1.908 73 -6.982 -1.564
34 1.543 0.386 74 -2.369 -3.255
35 -1.908 -1.891 75 -1.892 0.831
36 -4.493 1.269 76 -0.409 -1.461
37 4.298 0.725 77 2.792 0.351
38 2.223 1.987 78 5.785 1.071
39 0.569 -1.264 79 -6.592 -4.13
40 -4.569 -4.819 80 2.593 2.910
TABLE A.14. Three clusters, two outliers (concluded)
81 6.590 -6.348 121 5.980 -7.278
82 6.897 -6.334 122 6.389 -6.913
83 6.656 -5.906 123 5.557 -6.441
84 6.142 -6.002 124 5.798 -6.304
85 6.719 -6.827 125 6.572 -6.290
86 6.333 -6.499 126 5.504 -6.898
87 6.642 -6.703 127 6.365 -5.376
88 6.154 -5.958 128 5.599 -5.577
89 6.310 -6.706 129 6.391 -6.220
90 6.085 -6.233 130 6.497 -6.739
91 6.525 -6.410 131 5.282 -6.437
92 5.991 -6.212 132 5.986 -6.776
93 6.485 -6.279 133 7.151 -6.489
94 6.529 -6.663 134 6.200 -5.995
95 6.852 -6.335 135 6.748 -6.358
96 5.993 -6.373 136 5.793 -6.919
97 6.181 -6.749 137 6.387 -6.492
98 5.734 -6.360 138 6.971 -6.490
99 6.110 -6.515 139 7.050 -6.104
100 6.675 -7.288 140 6.247 -6.210
101 6.002 -6.023 141 -11.541 -9.499
102 6.753 -6.808 142 -11.123 -10.368
103 6.455 -7.162 143 -10.139 -10.070
104 6.750 -6.262 144 -11.810 -10.152
105 6.233 -6.302 145 -11.124 -11.049
106 5.853 -7.013 146 -11.999 -10.550
107 6.019 -6.462 147 -10.918 -10.602
108 6.690 -6 .270 148 -11 .548 -10.489
109 6.691 -6.710 149 -10.814 -10.284
110 6.004 -6.080 150 -11.033 -10.100
111 6.410 -6.634 151 -11.359 -10.029
112 6.401 -6.614 152 -10.080 -10.895
113 6.275 -6.086 153 -10.902 -10.486
114 6.924 -7.248 154 -10.778 -10.507
115 6.511 -6.648 155 -11.402 -10.590
116 4.727 -6.225 156 -11.129 -10.052
117 5.953 -5.145 157 -11.906 -10.140
118 5.526 -6.154 158 -11.212 -10.178
119 6.618 -6.723 159 -5.873 11.385
120 6.201 -6.924 160 -5.416 10.832
TABLE A.l5. Data with a bridge
Number YI Y2 Number YI Y2
1 -5.012 -0.739 44 -9.047 -1.510
2 -5.535 -0.876 45 6.944 5.484
3 -1.016 -0.314 46 -2.049 -0.523
4 -9.488 0.449 47 -3.544 0.466
5 3.434 1.369 48 -3.965 -4.043
6 0.455 1.958 49 -2.074 -2.499
7 -1.416 0.679 50 -2.083 -0.930
8 -5.544 -1.809 51 -6.175 -2.212
9 -7.615 -2.717 52 3.176 1.211
10 -7.078 -2.640 53 2.396 2.057
11 2.402 5.193 54 -5.982 -1.003
12 -0.564 -3.159 55 -5.354 -2.102
13 7.103 2.199 56 -4.732 -1.773
14 5.275 0.482 57 11.640 9.161
15 -5.661 1.042 58 -0.957 0.746
16 5.325 3.274 59 -2.667 2.636
17 0.879 -2.706 60 -1.817 0.892
18 4.345 1.376 61 -2.956 -1.730
19 -1.234 1.464 62 1.063 -0.646
20 0.025 -0.351 63 1.939 0.912
21 -0.576 0.063 64 0.027 -2.777
22 8.103 2.426 65 -0.024 -1.929
23 3.720 -0.530 66 0.300 0.155
24 0.988 -2.850 67 1.093 -2.275
25 1.768 0.066 68 -0.427 -0.202
26 4.399 1.344 69 -4.791 0.375
27 -7.954 -5.681 70 -4.956 -2.937
28 2.065 5.042 71 5.926 1.355
29 -4.554 -3.484 72 5.228 -1.806
30 -3.260 -3.660 73 3.895 -0.023
31 -5.093 -1.782 74 -0.592 0.097
32 -4.163 -0.377 75 -1.349 0.472
33 7.246 0.552 76 -5.471 -1.188
34 0.964 1.931 77 -6.910 -2.812
35 -0.395 -1.249 78 -1.660 -1.560
36 5.575 -1.425 79 4.715 2.412
37 3.338 0.704 80 2.151 1.496
38 -0.930 2.217 81 6.138 -7.120
39 0.400 0.019 82 6.002 -6.742
40 -2.274 -0.602 83 6.227 -6.788
41 -6.458 -1.084 84 5.895 -6.612
42 -5.623 - 0.922 85 6.237 -6.754
43 4.600 -2.675 86 6.178 -6.021
TABLE A.15. Data with a bridge (eoncluded)
Number Yl Y2 Number Yl Y2
87 6.761 -6.821 129 6.870 -5.965
88 6.748 -6.501 130 6.050 -6.719
89 6.800 -6.308 131 6.602 -5.782
90 6.295 -7.015 132 6.186 -6.208
91 7.157 -6.809 133 5.626 -6.905
92 6.259 -7.497 134 5.688 -6.903
93 6.215 -6.484 135 6.479 -5.913
94 6.547 -7.371 136 6.376 -6.523
95 5.806 -6.889 137 6.210 -6.970
96 6.244 -5.959 138 5.698 -6.570
97 6.195 -6.246 139 6.318 -6.417
98 6.968 -6.612 140 6.703 -6.993
99 6.055 -6.954 141 3.042 -4.594
100 6.566 -5.645 142 3.663 -4.623
101 6.324 -6.704 143 2.124 -4.524
102 6.324 -6.606 144 3.809 -6.481
103 6.144 -6.535 145 4.891 -5.099
104 5.746 -7.259 146 4.084 -5.492
105 5.885 -6.875 147 3.562 -5.194
106 6.759 -5.894 148 4.234 -3.757
107 5.845 -6.799 149 4.065 -4.472
108 6.704 -6.189 150 2.541 -4.430
109 5.902 -5.774 151 3.614 -3.448
110 6.618 -6.566 152 3.716 -5.075
111 5.301 -6.897 153 3.969 -5.732
112 6.169 -6.784 154 4.303 -5.312
113 5.861 -7.152 155 2.467 -4.155
114 5.463 -6.305 156 2.036 -3.381
115 6.787 -6.698 157 3.654 -5.486
116 6.675 -6.660 158 3.410 -2.913
117 7.132 -6.113 159 3.738 -5.630
118 5.281 -5.725 160 4.029 -3.836
119 5.956 -7.255 161 4.541 -6.641
120 6.686 -6.598 162 3.991 -5.016
121 6.604 -6.482 163 3.628 -3.185
122 6.163 -6.335 164 3.250 -4.873
123 5.814 -7.142 165 2.747 -2.800
124 6.931 -6.749 166 4.012 -4.499
125 6.490 -6.674 167 4.986 -5.499
126 6.143 -6.455 168 5.178 -7.342
127 5.717 -6.803 169 4.184 -5.203
128 6.243 -7.318 170 3.306 -5.657
TABLE A .l6. Investment funds data from Il Sole - 24 Ore 1999: Yl and y2
performance, Y3 volatility
Number Y1 Y2 Y3 Number Y1 Y2 Y3
1 13.20 29.40 21.30 27 14.40 31.60 22.10
2 12.10 27.10 20.00 28 10.40 29.50 21.60
3 15.50 32.40 21.70 29 10.30 34.10 22.90
4 10.60 28.30 21.00 30 11.50 36.20 23.30
5 12.40 28.90 21.20 31 12.60 28.70 21.20
6 14.60 34.90 20.70 32 11.80 25.70 20.10
7 9.60 30.50 21.60 33 12.70 34.30 24.00
8 8.70 28.30 22.20 34 9.20 30.80 24.00
9 14.80 32.90 20.30 35 18.30 28.30 19.80
10 17.30 36.90 22.30 36 11.20 30.40 22.10
11 7.70 25.80 21.10 37 21.00 36.40 20.40
12 12.10 27.60 23.50 38 9.20 25.40 21.90
13 13.50 32.50 22.40 39 17.00 45.10 23.20
14 5.50 32.40 25.50 40 10.20 30.20 22.40
15 10.90 32.40 23.00 41 15.50 35.70 22.60
16 15.40 34.60 20.20 42 10.90 32.20 22.80
17 11.60 29.90 22.00 43 16.70 33.20 23.20
18 13.00 27.60 22.40 44 5.50 27.50 23.10
19 10.00 32.50 20.60 45 12.60 31.00 21.40
20 14.60 40.80 23.40 46 10.00 27.20 20.00
21 3.80 28.70 18.20 47 11.60 30.00 21.50
22 15.00 30.90 22.00 48 10.00 28.80 22.30
23 10.70 30.10 21.20 49 10.60 26.40 20.00
24 12.70 31.60 20.40 50 -0.30 28.00 16.70
25 11.40 26.10 19.80 51 13.80 31.40 22.80
26 9.60 28.00 19.50 52 24.60 49.10 22.00
TABLE A .16. Investmentfunds data (concluded)
53 12.20 34.60 20.90 79 11 .40 19.60 12.00
54 -0.40 22.10 15.90 80 -0.60 10.80 9.10
55 15.80 34.00 23.20 81 9.00 19.70 11.90
56 14.80 32.10 20.30 82 7.10 15.00 8.80
57 7.70 10.80 9.50 83 10.50 14.10 9.30
58 13.40 18.60 10.70 84 9.20 15.70 9.30
59 15.50 13.00 10.50 85 7.60 20.20 11.70
60 11.40 13.80 8.50 86 8.40 17.10 12.60
61 10.30 19.20 11.60 87 9.90 16.40 9.10
62 6.70 17.20 11.00 88 13.20 19.90 12.20
63 9.60 17.20 10.30 89 10.80 31.10 15.50
64 8.70 13.60 8.60 90 12.90 26.60 12.20
65 9.20 20.00 12.20 91 11.40 18.00 11.40
66 8.10 16.50 12.10 92 8.30 17.40 12.20
67 5.80 22.70 12.50 93 11.90 20.80 12.10
68 15.20 23.50 13.40 94 7.40 18.00 10.90
69 7.90 18.80 12.20 95 8.90 12.40 8.70
70 11.60 23.20 12.70 96 0.70 18.90 12.90
71 9.70 22.80 12.70 97 3.90 19.40 10.80
72 6.50 18.20 13.50 98 11 .60 17.10 10.30
73 16.20 20.50 12.40 99 10.50 17.90 12.30
74 10.30 19.80 10.80 100 8.40 16.30 13.30
75 9.80 16.50 10.30 101 9.30 21.20 12.30
76 9.00 19.00 11.20 102 10.30 18.70 10.30
77 17.00 34.30 12.70 103 10.20 21.20 12.10
78 7.80 13.20 10.00
TABLE A .17. Diabetes data from Reaven and Miller (1979): Yl and Y2 responses
to oral glucose, y3 insulin resistance
Nurober Y1 Y2 Y3 Number Y1 Y2 Y3
1 80 356 124 38 78 335 241
2 97 289 117 39 106 396 128
3 105 319 143 40 98 277 222
4 90 356 199 41 102 378 165
5 90 323 240 42 90 360 282
6 86 381 157 43 94 291 94
7 100 350 221 44 80 269 121
8 85 301 186 45 93 318 73
9 97 379 142 46 86 328 106
10 97 296 131 47 85 334 118
11 91 353 221 48 96 356 112
12 87 306 178 49 88 291 157
13 78 290 136 50 87 360 292
14 90 371 200 51 94 313 200
15 86 312 208 52 93 306 220
16 80 393 202 53 86 319 144
17 90 364 152 54 86 349 109
18 99 359 185 55 96 332 151
19 85 296 116 56 86 323 158
20 90 345 123 57 89 323 73
21 90 378 136 58 83 351 81
22 88 304 134 59 100 398 122
23 95 347 184 60 llO 426 ll7
24 90 327 192 61 80 333 131
25 92 386 279 62 96 418 130
26 74 365 228 63 95 391 137
27 98 365 145 64 82 390 375
28 100 352 172 65 84 416 146
29 86 325 179 66 100 385 192
30 98 321 222 67 86 393 115
31 70 360 134 68 93 376 195
32 99 336 143 69 107 403 267
33 75 352 169 70 ll2 414 281
34 90 353 263 71 93 364 156
35 85 373 174 72 93 391 221
36 99 376 134 73 90 356 199
37 100 367 182 74 99 398 76
TABLE A.l7. Diabetes data (concluded)
75 93 393 490 111 114 643 155
76 89 318 73 112 103 533 120
77 98 478 151 113 300 1468 28
78 88 439 208 114 303 1487 23
79 100 429 201 115 125 714 232
80 89 472 162 116 280 1470 54
81 91 436 148 117 216 1113 81
82 90 413 344 118 190 972 87
83 94 426 213 119 151 854 76
84 85 425 143 120 303 1364 42
85 96 465 237 121 173 832 102
86 111 558 748 122 203 967 138
87 107 503 320 123 195 920 160
88 114 540 188 124 140 613 131
89 101 469 607 125 151 857 145
90 108 486 297 126 275 1373 45
91 112 568 232 127 260 1133 118
92 105 527 480 128 149 849 159
93 103 537 622 129 233 1183 73
94 99 466 287 130 146 847 103
95 102 599 266 131 124 538 460
96 110 477 124 132 213 1001 42
97 102 472 297 133 330 1520 13
98 96 456 326 134 123 557 130
99 95 517 564 135 130 670 44
100 112 503 408 136 120 636 314
101 110 522 325 137 138 741 219
102 92 476 433 138 188 958 100
103 104 472 180 139 339 1354 10
104 75 455 392 140 265 1263 83
105 92 442 109 141 353 1428 41
106 92 541 313 142 180 923 77
107 92 580 132 143 213 1025 29
108 93 472 285 144 328 1246 124
109 112 562 139 145 346 1568 15
110 88 423 212
Bibliography
Agresti, A . (2002). Categorical Data Analysis, 2nd Edition. New York:

Wiley.
Anderberg, M. R. (1973). Cluster Analysis for Applications. New York:
Academic Press.
Anderson, E. (1935). The irises of the Gaspe peninsula. Bulletin of the
American Iris Society 59, 2- 5.
Anderson, T. W. (1984). An Introduction to Multivariate Statistical
Analysis, 2nd Edition. New York: Wiley.
Andrews , D. F. , R. Gnanadesikan, and J . L. Warner (1971) . Transfor-
mations of multivariate data. Biometries 27, 825- 840.
Andrews , D. F. and A. M . Herzberg (1985). Data. New York: Springer-
Verlag.
Appa, G. M., A. H. Land, and P. J. Rousseeuw (1993). Discussion of pa-
per by Hettmansperger and Sheather. The American Statistician 47,
160- 163.)
Atkinson, A. C. (1973). Testing transformations to normality. Journal of
the Royal Statistical Society, Series B 35, 473- 479.
Atkinson, A . C. (1985). Plots, Transformations, and Regression. Oxford:
Oxford University Press.
Atkinson, A. C. (1994). Fast very robust methods for the detection of
multiple outliers. Journal of the American Statistical Association 89,
1329- 1339.
598 Bibliography
Atkinson, A. C. (1995). Multivariate transformations, regression diagnos-

tics and seemingly unrelated regression. In C. P. Kitsos and W. G.
Müller (Eds.), MODA 4 - Advances in Model-Griented Data Analy-
sis, pp. 181- 192. Heidelberg: Physica-Verlag.
Atkinson, A. C., L. R. Pericchi, and R. L. Smith (1991). Grouped like-
lihood for the shifted power transformation. Journal of the Royal
Statistical Society, Series B 53, 473-482.
Atkinson, A. C. and M. Riani (2000). Robust Diagnostic Regression Anal-
ysis. New York: Springer-Verlag.
Atkinson, A. C. and M. Riani (2002a) . Forward search added vari-
able t tests and the effect of masked outliers on model selection.
Biometrika 89, 939- 946.
Atkinson, A. C. and M. Riani (2002b). Tests in the fan plot for robust, di-
agnostic transformations in regression. Chemometrics and Intelligent
Labaratory Systems 60, 87- 100.
Banfield, J . D. and A. E . Raftery (1993). Model-based Gaussian and
non-Gaussian clustering. Biometries 49, 803- 821.
Barnett, V. and T. Lewis (1994). Outtiers in Statistical Data, 3rd Edi-
tion. New York: Wiley.
Besag, J. E. (1974). Spatial interaction and the statistical analysis of
lattice systems (with discussion). Journal of the Royal Statistical So-
ciety, S eries B 36, 192- 236.
Box, G. E. P. (1949). A general distribution theory for a dass of likeli-
hood criteria. Biometrika 36, 317-346.
Box, G. E. P. and D. R. Cox (1964). An analysis oftransformations (with
discussion). Journal ofthe Royal Statistical Society, Series B 26, 211-
246.
Box, G. E. P. and N. R. Draper (1987). Empirical Model-Building and
Response Surfaces. New York: Wiley.
Brockwell, P. J. and R. A. Davis (1991). Time series: Theory and Meth-
ods, 2nd Edition. New York: Springer-Verlag.
Brook, D. (1964). On the distinction between conditional probability and
joint probability approaches in the specification of nearest neighbour
systems. Biometrika 51, 481- 483. Brook (1964)
Campbell, N. A. (1980). Shrunken estimators in discriminant and canon-
ical variate analysis. Applied Statistics 29, 5-14.
Campbell, N. A. (1982). Robust procedures in multivariate analysis II:
Robust canonical variate analysis. Applied Statistics 31, 1-8.
Campbell, N. A. (1985) . Updating formulae for allocation of individuals.
Applied Statistics 34, 235- 236.
Bibliography 599
Caussinus, H. and A. Ruiz-Gazen (1995). Metries for finding typical

structures by means of principal components analysis. In Data Sci-
ence and its Applications, pp. 177- 192. Japan: Academic Press.
Cerioli, A. (1999). Measuring the infl.uence of individual observations
and variables in duster analysis. In M. Vichi and 0. Opitz (Eds.),
Classification and Data Analysis, pp. 3-10. Berlin: Springer-Verlag.
Cerioli, A. and M. Riani (1999). The ordering of spatial data and the de-
tection of multiple outliers. Journal of Computational and Graphical
Statistics 8, 239- 258.
Cerioli, A. and M. Riani (2002). Robust methods for the analysis of
spatially autocorrelated data. Statistical Methods and Applications -
Journal of the Italian Statistical Society 11, 335- 358.
Cerioli, A. and S. Zani (2001). Exploratory methods for detecting high
density regions in duster analysis. In S. Borra, R. Rocci, M . Vichi,
and M. Schader (Eds.), Advances in Classification and Data Analysis,
pp. 11- 18. Berlin: Springer-Verlag.
Cheng, R. and G. W. Milligan (1996). Measuring the infl.uence of indi-
vidual data points in a duster analysis. Journal of Classification 13,
315- 335.
Chiles, J. P. and P. Delfiner (1999). Geostatistics. New York: Wiley.
Christensen, R. (2001). Advanced Linear Modelling, 2nd Edition. New
York: Springer-Verlag.
Christensen, R., W. Johnson, and L. M. Pearson (1992). Prediction di-
agnostics for spatiallinear models. Biometrika 79, 583- 591.
Cliff, A. D. and J. K. Ord (1981). Spatial Processes. Models and Appli-
cations. London: Pion.
Cook, R. D. and S. Weisberg (1982). Residuals and Infiuence in Regres-
sion. London: Chapman and Hall.
Cook, R. D. and S. Weisberg (1994). An Introduction to Regression
Graphics. New York: Wiley.
Cook, R. D. and S. Weisberg (1999). Applied Regression Including Com-
puting and Graphics. New York: Wiley.
Cressie, N. (1986). Kriging nonstationary data. Journal of the American
Statistical Association 81, 625- 634.
Cressie, N. and D. M. Hawkins (1980). Robust estimation of the vari-
ogram. Mathematical Geology 12, 115- 125.
Cressie, N. A. C. (1993). Statistics for Spatial Data, Revised Edition.
New York: Wiley.
600 Bibliography
Critchley, F. and F. Vitiello (1991). The influence of observations on

misdassification probability estimates in linear discriminant analysis.
Croux, C. and C. Dehon (2001). Robust linear discriminant analysis
using S-estimators. Canadian Journal of Statistics 29, 473- 492.
Croux, C. and G. Haesbroeck (2000). Principal component analysis based
on robust estimators of the covariance or correlation matrix: influence
functions and efficiencies. Biometrika 87, 603- 618.
Cuesta-Albertos, J. A., A. Gordaliza, and C. Matnin (1997). Trimmed
k means: an attempt to robustify quantizers. The Annals of Statis-
tics 25, 553- 576.
Daudin, J. J., C. Duby, and P. Trecourt (1988). Stability of principal
component analysis studied by the bootstrap method. Statistics 19,
241- 258.
de Boor, C. (2002). A Practical Guide to Splines, Revised Edition. New
Deutsch, C. V. and A. G. Journel (1998). GSLIB. Geostatistical Sofwtare
Library and User's Guide, 2nd Edition. New York: Oxford University
Press.
Diggle, P. J., J. A. Tawn, and R. A. Moyeed (1998). Model-based geo-
statistics (with discussion). Applied Statistics 41, 299- 350.
Emerson, J. D. and D. C. Hoaglin (1983). Analysis of two-way tables
by medians. In D. C. Hoaglin, F. Mosteller, and J. W. Tukey (Eds.),
Understanding Robust and Exploratory Data Analysis, pp. 166- 210.
New York: Wiley.
Everitt, B. S., S. Landau, and M . Leese (2001). Cluster Analysis, 4th
Edition. London: Arnold.
Fahrmeir, L. and G. Tutz (2001). Multivariate Statistical Modelling
Basedon Generalized Linear Models. New York: Springer-Verlag.
Fisher, R. A. (1936) . The use of multiple measurements in taxonomic
problems. Annals of Eugenics 7, 179-188 ..
Flury, B. (1997). A First Course in Multivariate Statistics. New York:
Springer-Verlag. Flury and Riedwyl (1985)
Flury, B. and H. Riedwyl (1985). Tests, the linear two group discrimi-
nant function and their computation by linear regression. American
Statistician 39, 20- 25. Flury and Riedwyl (1985)
Flury, B. and H. Riedwyl (1988). Multivariate Statistics: A Practical
Approach. London: Chapman and Hall. Flury and Riedwyl (1985)
Fraley, C. and A. E. Raftery (1998). How many dusters? Which duster-
ing method? - Answers via model-based duster analysis. Computer
Journal 41 , 578- 588.
Bibliography 601
Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant

analysis, and density estimation. Journal of the American Statistical
Association 97, 611- 631.
Friedman, H. P. and J. Rubin (1967). On some invariant criteria for
grouping data. Journal of the American Statistical Association 62,
1159- 1178.
Fung, W. K. (1992). Some diagnostic measures in discriminant analysis.
Statistics and Probability Letters 13, 279- 285.
Fung, W. K. (1995a). Diagnostics in linear discriminant analysis. Journal
of the American Statistical Association 90, 952- 956.
Fung, W. K. (1995b). Influence on classification and probability of mis-
classification. Sankhya, Series B 57, 377-384.
Fung, W. K. (1996a). The infiuence of observations for locallog-odds in
linear discriminant analysis. Communications in Statistics, Theory
and Methods 25, 257- 268.
Fung, W. K. (1996b ). Letter to the editor: Critical values for testing
multivariate statistical outliers. Applied Statistics 45, 496- 497.
Fung, W. K. (1998). On the equivalence oftwo diagnostic measures in dis-
criminant analysis. Communications in Statistics, Theory and Meth-
ods 27, 1923-1935.
Garcia-Escudero, L. A. and A. Gordaliza (1999). Robustness properties
of k means and trimmed k means. Journal of the American Statistical
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Mul-
tivariate Observations. New York: Wiley.
Gnanadesikan, R., J. W. Harvey, and J. R. Kettenring (1993). Maha-
lanobis metrics for duster analysis. Sankhya, Series A Special Volume
55, 494- 505.
Goldberg, K. M. and B. Iglewicz (1992). Bivariate extensions of the box-
plot. Technometries 34, 307- 320.
Goldberger, A. S. (1962). Best linear unbiased prediction in the gen-
eralized linear regression model. Journal of the American Statistical
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation.
New York: Oxford University Press.
Gordon, A. D. (1999). Classification, 2nd Edition. Boca Raton: Chapman
and Hall/CRC.
Griffith, D. A. and L . J. Layne (1999). A Casebook for Spatial Statistical
Data Analysis. New York: Oxford University Press.
602 Bibliography
Grubbs, F. E. (1950). Sample criteria for testing outlying observations.

Annals of Mathematical Statistics 21, 27- 57.
Guttorp, P. (1995). Stochastic Modeling of Scientific Data. London:
Chapman and Hall.
Radi, A. S. (1992). Identifying multiple outliers in multivariate data.
Journal of the Royal Statistical Society, Series B 54, 761- 771.
Radi, A. S. and J . S. Sirnonoff (1993). Procedures for the identification of
multiple outliers in linear models. Journal of the American Statistical
Haining, R. (1990). Spatial Data Analysis in the Social and Environmen-
tal Sciences. Cambridge: Garnbridge University Press.
Haining, R. P. (1987). Trend-surface models with regional and local scales
of variation with an application to aerial survey data. Technomet-
ries 29, 461- 469.
Hand, D., H. Mannila, and P. Smyth (2001). Principles of Data Mining.
Cambridge, Mass.: MIT Press.
Hartigan, J . A. (1975). Clustering Algorithms. New York: Wiley.
Harvey, A. C. (1990) . The Econometric Analysis of Time Series, 2nd
Edition. Cambridge, Mass.: MIT Press.
Haslett, J., R. Bradley, P. Craig, and A. Unwin (1991). Dynamic graphics
for exploring spatial data with application to locating global and local
anomalies. The American Statistician 45, 234- 242.
Haslett, J. and K. Hayes (1998). Residuals for the linear model with
general covariance structure. Journal of the Royal Statistical Society,
Series B 60, 201- 215.
Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Sta-
tistical Learning. Data Mining, Inference and Prediction. New York:
Springer-Verlag.
Hawkins, D. M. and N. Cressie (1984). Robust Kriging - A proposal.
Mathematical Geology 16, 3- 18.
Hawkins, D. M. and G. J . McLachlan (1997). High-breakdown linear
discriminant analysis. Journal of the American Statistical Associa-
tion 92, 136- 143.
Hawkins, D. M. and D. J. Olive (2002). Inconsistency ofresampling algo-
rithms for high-breakdown regression estimators and a new algorithm
(with discussion). Journal of the American Statistical Association 91,
136- 159.
Hettmansperger, T. P. and S. J. Sheather (1992). A cautionary note on
the method of least median squares. American Statistician 46, 79- 83.
Bibliography 603
Hinkelman, K. and 0. Kempthorne (1994). Design and Analysis of

Experiments. Volume I. Introduction to Experimental Design. New
York: Wiley.
Hubert, M., P. J. Rousseeuw, and S. Verboven (2002). A fast method
for robust principal components with applications to chemometrics.
Chemometrics and Intelligent Labaratory Systems 60, 101- 111.
Hubert, M. and K. Van Driessen (2003). Fast and robust discriminant
analysis. Computational Statistics and Data Analysis 43. (To ap-
pear).
Johnson, R. A. and D. W. Wichern (1997). Applied Multivariate Statis-
tical Analysis, 4th Edition. New Jersey: Prentice-Hall.
Jolliffe, I. T . (2002) . Principal Component Analysis, 2nd Edition. New
Jolliffe, I. T., B. Jones, and J. T. Morgan (1995). Identifying inft.uen-
tial observations in hierarchical duster analysis. Journal of Applied
Kaiser, M. S. and N. Cressie (2000). The construction ofmultivariate dis-
tributions from Markov random fields. Journal of Multivariate Anal-
ysis 73, 199- 220.
Kaufman, L. and P. J. Rousseeuw (1990). Finding Groups in Data. An
Introduction to Cluster Analysis. New York: Wiley.
Krzanowski, W. J. (2000). Principles of Multivariate Analysis, 2nd Edi-
tion. Oxford: Glarendon Press.
Krzanowski, W. J. and D. J. Hand (1997). Assessing error rate estima-
tors: the leave-one-out method reconsidered. Australian Journal of
Krzanowski, W. J. and F. H. C. Marriott (1994). Kendall's Library of
Statistics 1: Multivariate Analysis, Part 1. London: Edward Arnold.
Krzanowski, W.J.
Krzanowski, W. J. and F. H. C. Marriott (1995). Kendall's Library of
Statistics 2: Multivariate Analysis, Part 2. London: Edward Arnold.
Krzanowski, W.J.
Liu, R. Y., J. M. Parelius, and K. Singh (1999). Multivariate analysis
by data depth: descriptive statistics, graphics and inference (with
discussion). Annals of Statistics 27, 783 - 858.
Mardia, K. V. , J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis.
London: Academic Press.
Maronna, R. and P. M. Jacovkis (1974). Multivariate dustering proce-
dures with variable metrics. Biometries 30, 499- 505.
604 Bibliography
Marriott, F. H. C. (1982). Optimization methods of duster anaiysis.

Martin, R. J. (1984). Exact maximum likelihood for incompiete data
from a correiated Gaussian process. Communications in Statistics,
Theory and Methods 13, 1275- 1288.
Martin, R. J. (1992). Leverage, influence and residuais in regression mod-
eis when observations are correiated. Communications in Statistics,
Theory and Methods 21, 1183- 1212.
Matheron, G. (1963). Principies of geostatistics. Economic Geology 58,
1246- 1266.
Mathsoft (1996). S+SPATIALSTATS. User's Manual. Seattie: Mathsoft.
McBratney, A. B. and R. Webster (1981). Detection ofridge and furrow

pattern by spectrai anaiysis of crop yieid. International Statistical
Review 49, 45- 52.
McCullagh, P. and J. A. Neider (1989). Generalized Linear Models, 2nd
Edition. London: Chapman and Hall.
McLachian, G. and D. Peei (2000). Finite Mixture Models. New York:
Wiley.
McLachian, G. J. (1992). Discriminant Analysis and Statistical Pattern
Recognition. New York: Wiley.
Micuia, G. (1998). Handbook of Splines. Dordrecht: Kiuwer.
Morrison, D. F. (1967). Multivariate Statistical Methods, 2nd Edition.
New York: McGraw-Hill.
Moura, J. M. F. and N. Bairam (1992). Recursive structure of non-
causal Gauss-Markov random fields. IEEE Transactions on Informa-
tion Theory 38, 334-354.
Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. New
York: Wiiey.
Müller, W. G. (2001). Collecting Spatial Data, 2nd Edition. Berlin:
Springer-Verlag.
Nair, V., M. Hansen, and J. Shi (2000). Statistics in advanced manu-
factoring. Journal of the American Statistical Association 95, 1002-
1005.
Ord, K. (1975). Estimation methods for models of spatial interaction.
Journal of the American Statistical Association 70, 120- 126.
Penny, K. I. (1996a). Appropriate critical values when testing for a sin-
gle multivariate outlier by using the Mahalanobis distance. Applied
Bibliography 605
Penny, K. I. (1996b). Author's response to letter by Fung: Critical values

for testing multivariate statistical outliers. Applied Statistics 45, 497.
Pison, G., A. Struyf, and P. J. Rousseeuw (1999). Displaying a dustering

with CLUSPLOT. Computational Statistics and Data Analysis 30,
381- 392.
Politis, D. N., J. P. Romano, and M. Wolf (1999). Subsampling. New
Rao, C. R. (1973) . Linear Statistical Inference and its Applications, 2nd
Edition. New York: Wiley.
Reaven, G. M. and R. G. Miller (1979). An attempt to define the nature
of chemical diabetes using a multidimensional analysis. Diabetolo-
gia 16, 17- 24.
Rencher, A. C. (1995). Methods of Multivariate Analysis. New York:
Wiley.
Riani, M. and A. C. Atkinson (2000). Robust diagnostic data analysis:
Transformations in regression (with discussion). Technometries 42,
384- 398.
Riani, M. and A. Cerioli (1999). Graphical tools for the det ection of mul-
tiple outliers in spatial statistics models. In W. Gaul and H. Locarek-
Junge (Eds.), Classification in the Information Age, pp. 233- 240.
Berlin: Springer-Verlag.
Riani, M. and S. Zani (1997) . An iterative method for the detection of
multivariate outliers. Metron 55, 101-117.
Ripley, B. D . (1981). Spatial Statistics. New York: Wiley.
Rousseeuw, P . J. (1984). Least median of squares regression. Journal of
the American Statistical Association 79, 871- 880.
Rousseeuw, P. J. and K. Van Driessen (1999) . A fast algorithm for the
minimum covariance determinant estimator. Technometries 41, 212-
223.
Rousseeuw, P. J. and B. C . van Zorneren (1990). Unmasking multivari-
ate outliers and leverage points. Journal of the American Statistical
Association 85, 633-9.
Scott, A. J. and M. J. Symons (1971). Clustering methods based on
likelihood ratio criteria. Biometries 27, 387- 397.
Seher, G. A. F. (1984). Multivariate Observations. New York: Wiley.
Small, C. G. (1990). A survey of multidimensional medians. International
Statistical R eview 58, 263- 277.
Stein, M . L. (1999) . Interpolation of Spatial Data. New York: Springer.
606 Bibliography
Stromberg, A. J. (1993). Discussion of paper by Hettmansperger and

Sheather. The American Statistician 47, 87- 88.
Struyf, A., M. Hubert, and P. J. Rousseeuw (1997). Integrating robust
dustering techniques in S-Plus. Computational Statistics and Data
Analysis 26, 17- 37.
Symons, M. J. (1981). Clustering criteria and multivariate normal mix-
tures. Biometries 37, 35-43.
Van Aelst, S., P. J. Rousseeuw, M. Hubert, and A. Struyf (2002). The
deepest regression method. Journal of Multivariate Analysis 81, 138
- 166.
Velilla, S. (1993). A note on the multivariate Box-Cox transformation to
normality. Statistics and Probability Letters 17, 259-263.
Venables, W. N. and B. D. Ripley (1994) . Modem Applied Statistics with
S-Plus. New York: Springer-Verlag.
Venables, W. N. and B. D. Ripley (2002). Modem Applied Statistics with
S, 4th Edition. New York: Springer-Verlag.
Webster, R. and M. Oliver (2001). Geostatistics for Environmental Sci-
entists. Chichester: Wiley.
Wilks, S. S. (1963). Multivariate statistical outliers. Sankhya, Series
A 25, 407-426.
Woodruff, D. and D. M. Rocke (1994). Computable robust estimation of
multivariate location and shape in high dimension using compound
estimators. Journal of the American Statistical Association 89, 888-
896.
Zani, S. (1996). Misure della Qualita della Vita. Milan: Angeli.
Zani, S. (2000). Analisi dei Dati Statistici. Vol. 11. Osservazioni Multi-
dimensionali. Milano: Giuffre Editore.
Zani, S., M. Riani, and A. Corbellini (1998). Robust bivariate boxplots
and multiple outlier detection. Computational Statistics and Data
Analysis 28, 257- 270.
Zellner, A. (1962). An efficient method of estimating seemingly unrelated
regressions and tests of aggregation bias. Journal of the American
Statistical Association 57, 348-368.
Author Index
Agresti, A., 74, 597 Brook, D., 545, 598

Anderberg, M.R. , 444, 448, 597
Anderson, E. , 311 , 576, 597 Campbell, N.A. , 74, 356, 598
Anderson, T.W. , 74, 597 Caussinus, H., 273, 599
Andrews, D.F., 218, 345, 351, 597 Cerioli, A., 58, 406, 444, 474, 507,
Appa, G.M., 495, 597 526, 599, 605
Atkinson, A.C ., 2, 3, 15, 32, 42, Cheng, R. , 449, 599
47, 48, 50- 52, 66, 71, 72, Chiles, J.P. , 467, 524, 599
74,156,162- 164,176,202, Christensen, R. , 524- 526,531, 599
207, 218, 242, 273, 317, Cliff, A.D. , 498 , 524, 599
457, 458, 469, 472, 473, Cook, R.D ., 2, 33, 176, 219, 568,
475, 497, 505, 506, 511, 599
512, 535, 597, 598, 605 Corbellini, A., 59, 61, 62, 373, 606
Cox, D.R., 14, 30, 151 , 156, 161,
Bairam, N., 500, 604 218, 309, 598
Banfield, J.D. , 448, 449, 598 Craig, P., 524, 602
Barnett, V., 74, 598 Cressie, N.A.C ., 461,469,471,472,
Besag, J.E., 524, 545, 548, 598 475, 483, 485, 487, 488,
Bibby, J.M., 35, 39, 74, 233, 272, 524, 545 , 548, 599, 602,
311, 356, 603 603
Box, G.E.P., 14, 30, 38, 58, 151, Critchley, F., 356, 600
156, 161, 166, 209, 213, Croux, C., 273, 356, 600
218, 309, 567, 570, 598 Cuesta-Albertos, J.A., 449, 600
Bradley, R., 524, 602
Brockwell, P.J., 524, 598 Daudin, J.J. , 242, 252, 571, 600
608 Author Index
Davis, R.A., 524, 598 Hawkins, D.M., 56, 356,471,474,

de Boor, C., 59, 600 487, 599, 602
Dehon, C., 356, 600 Hayes, K., 525, 602
Delfiner, P., 467, 524, 599 Herzberg, A.M., 345, 351, 597
Deutsch, C.V. , 524, 600 Hettmansperger, T.P., 495, 602
Diggle, P.J. , 524, 600 Hinkelman, K., 483, 603
Draper, N.R., 58, 166, 209, 213, Hoaglin, D.C. , 485, 600
567, 570, 598 Hubert, M. , 75, 273, 356, 444, 603,
Duby, C., 242, 252, 571, 600 606
Emerson, J.D. , 485, 600 Iglewicz, B., 61, 63, 601

Everitt, B.S., 443, 448, 600
Jacovkis, P.M., 370, 449, 603
Johnson, R.A ., 11, 74, 603
Fahrmeir, L., 74, 600
Fisher, R.A., 301, 311 , 361, 600 Johnson, VV., 525, 526, 599
Jolliffe, I.T., 272, 449, 603
Flury, B. , 6, 23, 74, 233, 242, 272,
317, 361, 579, 600 Jones , B., 449, 603
Journel, A.G., 524, 600
Fraley, C. , 420, 449, 600, 601
Friedman, H.P. , 449, 601 Kaiser, M.S ., 548, 603
Friedman, J. , 449, 602 Kaufman, L., 448, 449, 603
Fung, VV.K., 74, 324, 356, 601 Kempthorne, 0 ., 483, 603
Kent , J.T., 35, 39, 74, 233, 272,
Garcia-Escudero, L.A., 449, 601
311, 356, 603
Gnanadesikan, R., 218, 449, 597, Kettenring, J.R. , 449, 601
601 Knott , M., 42
Goldberg, K.M., 61 , 63, 601 Krzanowski, VV.J. , 74, 272, 305,
Goldberger, A.S. , 524, 601 311 , 356, 603
Goovaerts, P., 524, 601
Gordaliza, A., 449, 600, 601 Land, A.H., 495, 597
Gordon, A.D., 370, 448, 601 Landau, S. , 443, 448, 600
Griffith, D.A., 524, 601 Layne, L.J. , 524, 601
Grubbs, F.E., 74, 602 Leese, M., 443, 448, 600
Guttorp, P., 545, 602 Lewis, T., 74, 598
Liu, R.Y., 74, 603
Radi, A.S., 56, 602
Haesbroeck, G., 273, 600 Müller, VV.G., 524, 604
Haining, R.P. , 492, 493, 497, 524, Mannila, H., 449, 602
602 Mardia, K.V., 35, 39, 74, 233,272,
Hand, D.J., 305, 449, 602, 603 311, 356, 603
Hansen, M. , 524, 604 Maronna, R., 370, 449, 603
Hartigan, J .A., 448, 602 Marriott, F .H.C., 74,449,603,604
Harvey, A.C. , 54, 602 Martin, R.J., 505, 508, 525, 604
Harvey, J.VV., 449, 601 Matheron, G., 524, 604
Haslett, J. , 524, 525, 602 Mathsoft, 483, 524, 604
Hastie, T., 449, 602 Matnin, C., 449, 600
Author Index 609
McBratney, A.B., 490, 604 Rocke, D.M., 56, 60, 606

McCullagh, P., 544, 604 Romano, J.P., 509 , 605
McLachlan, G.J., 356, 449, 602, Rousseeuw, P .J., 56, 72, 75, 273,
604 372, 444, 445, 448, 449,
Micula, G. , 59, 604 495, 597, 603, 605, 606
Miller, R.G., 420, 594, 605 Rubin, J. , 449, 601
Milligan, G.W., 449, 599 Ruiz-Gazen, A., 273, 599
Morgan, J.T., 449, 603
Morrison, D.F., 74, 604 Scott, A.J., 449, 605
Moura, J.M.F., 500, 604 Seher, G.A.F., 74, 605
Moyeed, R.A. , 524, 600 Sheather, S.J., 495, 602
Muirhead, R.J., 74, 99, 304, 604 Shi, J., 524, 604
Simonoff, J.S. , 56, 602
Nair, V., 524, 604 Singh, K., 74, 603
Neider, J.A., 544, 604 Small, C.G., 74, 605
Smith, R.L., 317, 598
Olive, D.J., 56, 474, 602 Smyth, P., 449 , 602
Oliver, M., 524, 606 Stein, M.L ., 472, 524, 535, 605
Ord, J.K., 498, 524, 599, 604 Stromberg, A.J., 495, 606
Struyf, A., 75 , 444, 445, 605, 606
Parelius, J.M., 74, 603 Symons, M.J., 449, 605, 606
Pearson, L.M. , 525, 526, 599
Peel, D. , 449, 604 Tawn, J.A. , 524, 600
Penny, K.I., 74, 604, 605 Tibshirani, R., 449, 602
Pericchi, L.R., 317, 598 Trecourt, P., 242, 252, 571, 600
Pison, G., 445, 605 Tutz, G. , 74, 600
Politis, D.N., 509, 605 Unwin, A. , 524, 602
Raftery, A.E., 420, 448, 449, 598, van Aelst, S., 75, 606
600, 601 van Driessen, K., 356, 372 , 603,
Rao, C.R., 46, 605 605
Reaven, G.M., 420, 594, 605 van Zomeren, B.C., 56, 605
Rencher, A.C., 345, 605 Velilla, S., 218, 606
Riani, M., 3, 15, 32, 42, 47, 48, Venables, W.N., 311, 524, 606
50- 52,58,59,61-63,66, Verboven, S., 273, 603
71, 72, 74, 156, 163, 164, Vitiello, F., 356, 600
176, 202, 207, 218, 373,
457, 458, 469, 472- 475, Warner, J .L., 218, 597
497, 505- 507, 511, 512, Webster, R., 490, 524, 604, 606
526, 535, 598, 599, 605, Weisberg, S., 2, 33, 176, 219, 568,
606 599
Riedwyl, H., 6, 23, 74, 242 , 272, Wiehern, D.W., 11, 74, 603
317, 361 , 579,600 Wilks, S.S., 44, 74, 606
Ripley, B.D., 311, 499, 524, 605, Wolf, M., 509, 605
606 Woodruff, D., 56, 60, 606
610 Author Index
Zani, S., 59, 61- 63, 74, 115, 267,

373, 406, 599, 605, 606
ZeHner, A., 53, 606
Subject Index
added variable, 50, 162 PAM algorithm, 444

partitioning methods, 443- 444
best linear unbiased prediction, see sensitivity to the list order,
kriging
444
biplot, 236- 239, 290 comfortable shirt, 10
at selected steps of the for-
conditional expectation, 527, 532
ward search, 250
confirmatory analysis
boxplot
for clustering, 369,371,401-
bivariate
405, 436- 439
from ellipse, 62- 64
for transformation, 166
from peeling, 59-62
constructed variable, 162, 163
univariate, 110
convex hull, 59
centroid Cook's distance, 469
robust, 60, 63 cosine of the angle between vec-
duster analysis, 367, 372 tors, 238, 251, 278, 295
k-means algorithm, 443 cross validation
agglomerative hierarchical dus- discriminant analysis, 305
tering, 441- 442 kriging, 468
automatic dassification, 439
dusplot, 445 data depth, 75
duster validation, 448 data mining, 449
dendrogram, 442, 445 datasets, see examples
difficult example, 420, 491 deletion diagnostics, 2, 53, 218
model-based dustering, 446- duster analysis, 448
448 discriminant analysis, 356
612 Subject Index
failure, 3, 55, 191, 337 training sets, 300

kriging, 468- 471, 524 with estimated parameters, 300-
failure, 469, 480 301
multivariate transformation,
334 eccentricity, 389
failure, 348 Eckart-Young theorem, see Singu-
deletion formulae, 40- 43 lar value decomposition
kriging, 469, 526, 535- 539 eigenvalues
multiple, 52 In- pW, 540
regression, 48, 52 L:wtL:B, 302, 361
discriminant analysis, 298- 305 correlation matrix, 234
F ratio, 302 covariance matrix, 231, 236
allocation of a new observa- eigenvectors
tion, 297 and canonical variables, 302
expected cost of misalloca- and extrema of quadratic forms,
tion, 359, 365 275
using the forward search, poorly determined, 261
316 standardized, 231, 237
worsened by a larger num- eiemental set, 72
ber of parameters, 321 ellipse, 280
Bayesian, 298, 366 at selected steps of the for-
canonical variables (linear dis- ward search, 376- 377,387,
criminant functions), 301- 451
304, 358 changes in principal com-
correlation between variables ponents, 389
and discriminant functions, eccentricity, 389
362 effect of transformation, 153
error rate, 304- 305 in canonical form, 282
geometry, 302 robust, 89
compared to principal com- scaling, 61, 63
ponents analysis, 359 elliptical family of distributions,
improved by transformation, 99, 304
327, 333, 349, 355 estimating equation, 541, 544
linear, 300 Euclidean distance, 239, 293
compared to canonical vari- in duster analysis, 442, 444
ables, 304, 362 influence on visual compari-
preferred to quadratic, 301, son of units, 417
324, 332 invariance under orthogonal
posterior probability of group transformations, 294
membership, 299 low rank approximation, 296
quadratic, 299 relationship with standardized
relationship with multiple re- Mahalanobis distance, 370
gression, 357 examples
sample splitting, 345 "bridge" data, 385- 405, 446,
scores, 303 452- 455, 590
Subject Index 613
60:80 data, 371- 379,444,452, simulated transformed data

586 from different populations,
babyfood data, 58, 59, 62, 64, 332- 344
152- 155, 166- 169, 214- Swiss bank notes, 6, 22- 30,
218, 567 70,87,116- 128,222- 224,
diabetes data, 420- 439, 594 260- 264, 328- 332, 562
dyestuff data, 209- 213, 226- Swiss heads, 1, 5- 10,89- 100,
227, 570 169- 175, 221- 222, 239-
electrodes data, 317- 324, 579 242, 553
financial data, 406- 420, 592 three clusters, two outliers,
horse mussels, 176- 186, 218, 379- 385, 451, 588
568 wheat yield data, 482- 491,522-
iris data, 310- 317, 324- 328, 524
576 experimental design, 77, 209
milk data, 242- 252, 273, 571 uniformity trial, 483
municipalities in Emilia-Romagna,
forward plot, 56, 66- 71
6, 16- 22, 108- 116, 186-
t statistics, 214
204, 226, 265- 272, 560
added variable, 216
muscular dystrophy data, 344-
allocation of units in duster
356, 581 analysis, 370, 371, 402
national track records for women, average kriging variance, 476
5, 10- 16, 100- 108, 204- average prediction residual , 4 76
209,225- 226, 558 biplot, see biplot
quality of life, 252- 260, 573 changes in Mahalanobis dis-
refiectance data, 491- 495 tances, 369, 396
simulated bivariate data from showing groups, 425
anormal population, 140, correlations between variables
142 and principal components,
with outliers, 142- 147 235, 247
simulated kriging data with stable, 267
a nonstationary pocket, eccentricity of an ellipse, 389,
479- 482 424
simulated kriging data with elements of an eigenvector, 236
multiple outliers, 467, 477- compared to correlations,
479 249
simulated SAR data with mul- in discriminant analysis, 321
tiple high leverage points, showing rotation of the fit-
505- 506, 519- 522 ted ellipse, 389
simulated SAR data with mul- unstable, 261
tiple outliers (I), 503-504, elements of the covariance ma-
513- 514 trix
simulated SAR data with mul- showing groups, 25
tiple outliers (II), 516- stable, 23, 93
519 ellipses, see ellipse
614 Subject Index
entr~ 369, 375-376 maximum prediction residual

first, 408 among all units, 476
fan, 164, 168 minimum distance not in the
stable, 190 subset, 4, 8
with a finer grid of values, compared to ordered dis-
200 tance, 70
gap, 69, 92 for groups, 308, 312
generalized leverage measures, large jump, 13, 19
511, 521 showing changes in the sub-
illuminating, 242, 377 set structure, 377, 421
likelihood ratio test for equal- simulated data, 101
ity of covariance matri- minimum kriging variance not
ces, 344 in the subset, 476
likelihood ratio test for the minimum prediction residual
hypothesis p = po, 513, not in the subset, 476
518 percentage of variance explained
likelihood ratio test for trans- by principal components,
formation, 148, 165, 166 234, 245
common, 204 showing groups, 261 , 421
Mahalanobis distances, 4, 94 stable, 240, 266
absence of very small dis- pooled determinant, 309
tances,97,154,374,420, zig-zag form, 353
427 posterior probabilities in dis-
effect of scaling, 66, 94 criminant analysis, 314
effect of transformation, 153, ratios from covariance matri-
182, 203 ces, 70
for groups of units, 368, 392- score test for transformation,
394 206
for individual units against scores in principal components
a background, 119, 394- analysis, 235, 249
395 stable, 240
showing groups, 373, 380, scree plot, see principal com-
407 ponents analysis
simulated data, 97, 101 standardized prediction resid-
stable, 66, 94 uals, 4 75, 4 78
standardized, 403 standardized residuals in SAR
maximum distance in the sub- model, 511, 513
set, 91 transformation parameter, 165
for groups, 308, 312 stable, 166
showing changes in the sub- Wilks' ratio, 142
set structure, 377 forward search, 3-5
maximum kriging variance among balanced, 306, 311, 348
all units, 476 compared to unbalanced, 318-
maximum likelihood estimate 320
of p, 512, 518 effect on forward plots, 308
Subject Index 615
in duster analysis, 371 for SAR model parameters,

behaviour in the initial stages, 511
345 , 425 generalized linear models, 74
block, 458, 497, 506- 513 monitoring, see forward plot
stability to preliminary choices, multivariate data, 55- 58
514- 516, 518 multivariate regression, 73
CAR model, 497 multivariate transformation,
duster analysis, 367- 371 165- 166, 185
allocation of units, 436 before discriminant analy-
boundary units, 405, 439 sis, 324, 337, 345
common sense suggestions, before principal components
405 analysis, 242, 252, 260,
confirmatory, see confirma- 265
tory analysis leading to a eheaper med-
constrained, 371, 402 ical test, 356
effect of rounding, 425 with a finer grid of values,
189, 202
emphasis on robust diag-
with grouped data, 345
nostic dassification, 440
ordinary kriging, 4 72- 4 77
gradual transition, 424
advantages over other ex-
leading to profitable invest-
ploratory methods, 486
ments, 419
detection of cydic compo-
refinement of groups, 428
nents, 489
unallocated observations in effect of measurement er-
exploratory analysis, 371,
ror, 477, 495
401, 415, 436 effect of rounding, 491
with individual dusters, 398- effect of trend removal, 488
401, 428- 436 effect of variogram model,
discriminant analysis, 303, 305- 482
307 showing alternative struc-
compared to that for one tures in the same data,
multivariate population, 495
314 predictor, 4 74
exciting results, 316 principal components analy-
exemplary analysis, 344 sis, 233- 236
explanation for misdassi- progression rule, 68
fied units, 314 interchange, 26, 68- 70, 73,
posterior probability of group 86, 474
membership, 307, 314 inversion, 68- 70, 86
stability of condusions, 351 normal, 68- 70
with wrong number of groups, recovery from poor start, 58,
328- 332 65, 145, 373, 377
emphasis on monitoring, 6, effect of measurement er-
56 ror, 474
estimator regression
616 Subject Index
compared to ordinary krig- effect on forward plots, 391

ing, 472
robustness, 57 hat matrix, 47, 87, 504
SAR model, see forward search, Hotelling's T 2 , 39, 84
block
impatient authors
stability to different starting
evidence of, 311, 606
points, 26, 72, 346, 473
initial subset, 4, 56, 64- 65, 73
stable, 65, 106, 186, 242
in kriging, 473
univariate regression, 51, 71
in the block forward search,
usefulness in data interpre-
510
tation, 13, 29, 125, 314,
with grouped data, 306, 371
349, 416, 478, 486, 495
constrained, 396
with grouped data, see also
inverse of a partitioned matrix,
forward search,
530
duster analysis
comparing the results of g kriging, 458, 459, 524
searches, 369 ordinary, 459-464
fitting one distribution, 4, detection of cyclic compo-
24, 116-124, 379 nents, 476
fitting several distributions, equations, 529, 531
369, 401, 436 exploratory methods, 485,
with messy data, 186, 265 524
mean-squared prediction er-
generalized least squares, 53, 497, ror (kriging variance), 462,
502, 532 529, 531
geographical map, 18 measurement error, 462-465,
and outliers, 113, 203 468, 534
geometric mean, 157 optimality property, 462, 532
geostatistics, 459, 465, 524 predictor, 462, 528, 530
Gram-Schmidt orthogonalization unbiasedness condition, 528
process, 280 universal, 459, 464
group membership
identified by the forward search, least median of squares, 71, 473,
121, 301, 405 510
compared to medical clas- least trimmed squares, 72
sification, 439 leverage, 87, 246, 504
preliminary, 369, 425 generalized for dependent Ob-
uncertain, 371, 403, 436 servations, 505
wrong, 370, 395 complementary, 505
grouped data, 22, 38, 85, 116, 297, failure, 506
367 forward, 511
effect on forward plots, 374, trajectory, 522
377 likelihood ratio test
homogeneous, 398 for the hypothesis J1 Jlo,
overlapping, 386 37, 84
Subject Index 617
for the hypothesis p = Po , Manhattau (LI) distance, 442

512 , 544 Markov random field, 548
for the hypothesis ~ 1 = ... = masking,3, 55,105,141,313,356,
~g = ~ ' 38, 85 449, 469, 486, 503, 506
for transformation, 160, 161 revealed by forward plots, 94,
common, 162 243, 308, 342, 478, 480,
linear transformation 513, 518, 521
distribution, 230 maximum likelihood
giving maximum separation and the forward search, 57
of groups, 301 for CAR model, 549
for SAR model, 501- 502, 541-
Mahalanobis distance 543
deletion, 40, 43 , 337 subset of spatiallocations,
distribution, 45, 95 508
chi-squared, 7, 34, 39, 62, in duster analysis, 448, 449
154, 337, 374 in power transformations , 157
scaled F , 34, 39, 43, 61 , 63, with grouped data, 310
65 with multivariate data, 35, 36,
scaled beta, 2, 33, 44, 49, 82
62, 79 with univariate data, 32
forward, 56, 66 mean shift outlier model, 52
and posterior probability in minimum covariance determinant,
discriminant analysis, 307, 372
316 minimum volume ellipsoid, 56
from a group, 370 multivariate analysis
scaled, 57, 66 trends in, 74
shape of, 119, 393 , 424 multivariate data
standardized, 370 normal model, 5, 99
in duster analysis, 442, 448, departures from, 1
449 effect oftransformation, 106,
in regression, 48 169, 181, 208, 260
index plot, 382 in discriminant analysis, 299,
ordered, 68- 69 304
population, 39 in principal components anal-
in discriminant analysis, 299- ysis, 232
300 linear combination, 229
relationship with Euclidean dis- ordered by the forward search,
tance, 238, 294, 295 see ordering multivariate
robust , 382 data
standardized, 403
univariate, 2 normal distribution
with estimated parameters, 40 multivariate, 34, 82
from a group, 300- 301 defined through its condi-
relationship with deletion tional distributions, 545
distance, 43 density ratio, 548
618 Subject Index
ellipsoidal contours, 232 effect on inference, 5, 58

for spatial data, 458, 502 effect on principal components,
mixture with grouped data, 246, 249, 263, 270
447 compared to transformation,
univariate, 32 252
estimated density, 323 effect on transformation, 171-
normal equations 174, 181, 201, 243, 340,
partitioned, 50 346
masked, 195
ordering multivariate data in duster analysis, 369, 380,
by Mahalanobis distances, 3 425
by the forward search, 4, 51, masked, 5, 348
57, 58, 138 unmasked by the forward
effect of balancing, 319 search, 94, 176, 339
effect of changes in the fit- multivariate, 3, 113
ting subset, 389 projection on principal com-
effect of transformation, 57, ponents, 270
202 origin of, 100, 112, 115, 252,
ordering spatial data 345
by the forward search, 475 single, 2
effect of contamination, 478 spatial, see spatial outlier
partial, 513 test, 74
ordinary least squares with grouped data, 314
inconsistent estimator in SAR
model, 502, 544
outlier parallel coordinate plot, 208
broad definition, 55 parameter estimates
duster of, 9, 14, 23, 242, 345 t statistics, 51
detection, 4, 55, 62, 90, 229 of f-t, 32, 35, 83
effect of including more data, in kriging, 532
355 of :E, 36, 41, 47, 83
effect on discrimination, 314, low rank approximation, 238,
319, 349 295
effect on estimates, 55, 107, with transformed data, 206
331 of CT 2 , 32
effect on forward plots of :EB, 302
Mahalanobis distances, 57, of :Ew, 38, 300
105 stable, 24, 57, 165
maximum distance in the transformation, 160
subset, 69, 92, 308 variogram models, see vari-
minimum distance not in ogram
the subset, 9, 24, 308 with grouped data, 300
ratios from covariance ma- poor communities, 20, 204, 268
trices, 70 principal components analysis, 230-
Wilks' ratio, 142 233
Subject Index 619
and singular value decompo- QQ plot, 7, 337, 374, 420

sition, see biplot
correlation between variables regression
and components, 235 R 2 , 293, 361
covariance matrix of the com- multivariate, 46
ponents, 233 seemingly unrelated, 53- 55,
dimension reduction, 10, 113, 163
229 transformed data, 157
effect of random variation, 249 with correlated observations,
geometry, 232, 281, 286 see kriging, spatial au-
improved by transformation, toregression
259, 270 residual sum of squares, 33, 160
interpretation of the compo- partitioning with grouped data,
nents, 235, 247, 267 301
low rank approximation of the residual sum of squares and prod-
centred data matrix, 237, ucts matrix, 35, 4 7
276 between groups, 357, 358
lower components, 249 within groups, 358
number of components, 232 residuals
halving rule, 234 deletion, 40
on standardized variables, 232, in multivariate regression, 47
234, 236, 238, 289 least squares, 33
poorly defined, 233 matrix of, 36, 231, 236
population, 231 prediction, see standardized
prediction residuals
proportion of variance explained,
scaled, 2, 33, 66, 79, 81
234, 293
standardized
sample, 231
forward, 511
scores, 235, 237, 238
in spatial autoregression, 502
not affected by outliers, 270
trajectory, 514
scree plot, 234
robust methods
at selected steps of the for-
cluster analysis, 444, 449
ward search, 246
discriminant analysis, 356
tests in, 233
failure in SAR model, 514
total variance, 234 failure with grouped data, 382
with grouped data, 261 instability to local perturba-
profile loglikelihood, 160, 161 tions, 495
CAR model, 549 multivariate analysis, 55, 59
plot, 165, 167, 211, 324 failure with grouped data,
SAR model, 502, 542 372, 446, 452
with grouped data, 310 principal components analy-
profiles of Mahalanobis distances, sis, 273
121 rotation, 231
projection, 234, 278, 280, 285
in discriminant analysis, 302 scatterplot matrix, 7
620 Subject Index
and outliers, 13, 27, 125, 242, spatial data

349 complicated structure, 492
and robust contours, 99, 176 lack of ordering structure, 458
with grouped data, 317 normal model
and robust ellipses, 89 in spatial autoregression, 497
suggestions for dimension re- on a continuous domain, 459
duction, 242 on a discrete domain, 496
with brushing, 315 on a lattice, 467, 498
with grouped data, 23, 311, ordered by the forward search,
420,427,439 see ordering spatial data
score test for transformation, 162- spatial dependence, 457, 492
164, 207 in autoregressive models, 501
semivariogram, see variagram in kriging, 460, 465
Sherman-Morrison-Woodbury for- preserved by the forward search,
mula, 42, 86 507
extended, 534, 536 spatiallocation, 457, 459
Simulation envelope, 7 administrative unit, 497, 498
simulation inference, 95, 207 block of, 507
singular value decomposition, 236 definition, 509
spatial autoregression, 458, 496, overlapping, 510
524 network, 461, 496
CAR model, 527-528, 548 spatial outlier, 467, 489, 502
compared to kriging, 496 duster of, 467, 493, 503
conditional specification, see effect on estimates, 471, 504,
CAR model 516
edge correction, 499-500,508 effect on forward plots, 475,
effects on the forward search, 489
500 inference on p, 518
edge sites, 498 ordered residuals and krig-
neighbourhood structure, 497- ing variances, 476, 479
500 masked, 469, 503
of a block of spatial loca- unmasked by the forward
tions, 509 search, 479, 513
parameter estimation spatial prediction, 461
based on a subset of spa- and exact interpolation, 462,
tial locations, 509 531, 533
constraints, 526, 548 implications for the forward
difference between CAR and search, 474, 475
SAR models, 497 with measurement error, 464
forward, 511 spatial trend, 460, 466, 492
robust through the forward effect on forward plots, 475
search, 506, 519 perceived by visual inspection,
SAR model, 501- 502 480
simultaneaus specification, see removed by median polish, 485
SAR model disadvantages, 486
Subject Index 621
revealed by forward plots, 481 , improving the equality of co-

495 variance matrices with grouped
spectral representation of a sym- data, 328, 333, 355
metric matrix, 231 in principal components anal-
standardization of variables, 288 ysis, 229
dramatically affected by out- Jacobian of, 159
liers, 253 log, 157, 163
effect on duster analysis, 444, physically meaningful, 138
446 reciprocal, 14, 108
effect on principal components trumpet effect, 207
analysis, 232
standardized prediction residuals, variable selection
468 through the forward search,
compared to raw residuals, 473 214- 218
forward, 476 variogram, 460
shape of, 481, 486 compared to the covariogram,
stochastic process, 459, 496 461
conditionally negative definite,
stationary
465, 526
intrinsically, 460
estimation, 4 71- 4 72
second-order,460, 530,539
with spatial trend , 475
subsampling, 509
isotropic, 460
SUR, see regression
models , 465- 467
survey data
fitted by weighted least squares,
problems with, 19, 186, 265
472
swamping, 469, 506 nugget effect, 463, 465, 466,
487
transformation to normality, 148, robust estimator, 471
151 with measurement error, 463
Box and Cox power family,
156, 218 Wilks ' ratio, 44
common, 168, 185, 325 Wishart distribution, 35, 47
in discriminant analysis, 309-
310, 332
in multivariate regression,
161- 162
in univariate regression, 156-
160
normalized, 156
simple, 158
with estimated shift param-
eters, 317
cube root , 176
easily recovered by the for-
ward search, 339
Springer Series in Statistics (continued from p. ii)
lbrahim!Chen/Sinha: Bayesian Survival Analysis.

Jol/ifle: Principal Component Analysis, 2nd edition.
Knottnerus: Sampie Survey Theory: Some Pythagorean Perspectives.
Kolen/Brennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz!Johnson (Eds.) : Breakthroughs in Statistics Volume li.
Kotz!Johnson (Eds.): Breakthroughs in Statistics Volume III.
Küchler/Sarensen: Exponential Families of Stochastic Processes.
Kutoyants: Statistical Influence for Ergodie Diffusion Processes.
Lahiri: Resampling Methods for Dependent Data.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam!Yang: Asymptotics in Statistics: Some Basic Concepts, 2nd edition.
Liu: Monte Carlo Strategies in Scientific Computing.
Longford: Models for Uncertainty in Educational Testing.
Manski: Partial Identification of Probability Distributions.
Mielke!Berry: Permutation Methods: A Distance Function Approach.
Pan/Fang: Growth Curve Modelsand Statistical Diagnostics.
Parzen/Tanabe!Kitagawa: Selected Papers of Hirotugu Akaike.
Politis/Romano/Wolf: Subsampling.
Ramsay!Silverman: Applied Functional Data Analysis: Methods and Case Studies.
Ramsay!Silverman: Functional Data Analysis.
Rao/Toutenburg: Linear Models: Least Squares and Alternatives.
Reinsel.· Elements ofMultivariate Time Series Analysis, 2nd edition.
Rosenbaum: Observational Studies, 2nd edition.
Rosenblatt: Gaussian and Non-Gaussian Linear Time Seriesand Random Fields.
Särndal/Swensson!Wretman: Model Assisted Survey Sampling.
Santner/Williams/Notz: The Design and Analysis of Computer Experiments.
Schervish: Theory of Statistics.
Shao!Tu: The Jackknife and Bootstrap.
SimonojJ: Smoothing Methods in Statistics.
Singpurwal/a and Wilson: Statistical Methods in Software Engineering:
Reliability and Risk.
Small: The Statistical Theory of Shape.
Sprott: Statistical Inference in Science.
Stein: Interpolation of Spatial Data: Some Theory for Kriging.
Taniguchi/Kakizawa: Asymptotic Theory of Statistical lnference for Time Series.
Tanner: Tools for Statistical Inference: Methods for the Exploration ofPosterior
Distributionsand Likelihood Functions, 3rd edition.
van der Laan: Unified Methods for Censored Longitudinal Data and Causality.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Verbeke/Molenberghs: Linear Mixed Models for Longitudinal Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West!Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition.
ALSO AVAILABLE FROM SPRINGER!
1. f. Jolllffc
Robu I Oiagno lic Principal

Regression Componenl
Analysis Anolysis
..............
ROBUST DIAGNOSTIC PRINCIPAL COMPONENT

REGRESSION ANALYSIS ANALYSIS
ANTHONY ATIUNSON and MARCO RIANI
Second Edition
The authors develop new, highly informative IAN T. JOWFfE
graph for the analy is of regression data includ-
Principal component analysis is centrat to the
ing generalized li near models. The graph Iead
study of mullivariate data. Although one of the
to the detection of model inadequacies, which may earlie t multivariate techniques. it continue
be systemalic - perhaps a transformalion of the tobe the subject of much research, ranging from
data is needed - or there may be everal outliers. new model-based approaches to algorithmic ideas
Theseare identified and their importance estab-
from neural networks. The fust edition ofthi book
lished. lmproved models can then be fitted and was the first comprehensive text wrillen olely
checked. The graphs are gcnerated from a robust
on principal component analy is. The second
forward search through the data, which orders the
edition update and substan tially expand
observalions by their closeness to the assumed the original version, and is once again the defini-
model. The four main chapters cover regres ion,
tive text on the subject. It includes core materi -
tran formation of data in regre ion, nonlinear
a l, c urrent re earcb and a wide range of
least squares, and generalized linear models. As
applicat.ions. Jt length is nearly double that of
weil as illustrating our new procedures, the
the first edition.
authors develop the theory of the model u ed,
particularly for generalized linear models. 2003/502 PP./HAROCOVER/I SBN 0-387-95442-2
SPRINGER SE.RIES IN STATISTICS
2000/344 PP ./HARDCOVER/ISBN 0-387-95017-6
SPRINGER SERIES IN STATISTICS
APPLIED MULnYARIATE To Order or for lnfonnatlon:

ANALYSIS tn 1118 Americas: CALL: 1-800-SPRINGER 0<
NEIL H. TIMM FAI: (201) 348-4505 • WillE: Spoircer-Verlag New
YO<I<, lnc., Oept, 55630, PO Bo• 2485, Se<:aucus, NJ
This textbook integrale both theory and practice 07096-2485 • VISIT: YOUllocal teclvllcal bookste<e
including both the analysis of formallinear mul- • E.11A1.: O«<efsssprircer-ll)' .com
tivariat.e models and exploratory data analysis Outs/<le lhe Americas: CALL: +49 (0) 6221 345-217/8
technique . Each chapter contains the develop- • FAI: + 49 (Ol 6221 345-229 • WIIIIE: Spoircer
Customer Service, Haberstrasse 7, 69126
ment of ba ic theoretical re ults with numerou Heldetberg, Germany • E.IIAI.: ordersOsprircer.de
applications illustrated using examples from the PROMOTION: S5630
social and behavioral sciences, and other disci-
plines. All examples are analyzed using SAS for
Windows Version 8.0. The book includes an
overview of vectors, matrice , multivariate dis-
tribution theory, and multivariate linear models.
2002/752 PP .. 19 1UUS./HAROCOVER
ISBN ().387-95347-7
SPRINGER TEXTS IN STATISTICS
. Springer
www.spnnger·ny.com
'

(Springer Series in Statistics) Anthony C. Atkinson, Marco Riani, Andrea Cerioli (Auth.) - Exploring Multivariate Data With The Forward Search-Springer-Verlag New York (2004)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Springer Series in Statistics) Anthony C. Atkinson, Marco Riani, Andrea Cerioli (Auth.) - Exploring Multivariate Data With The Forward Search-Springer-Verlag New York (2004)

Uploaded by

Copyright:

Available Formats

Springer Series In Statistics

Springer Science+Business Media, LLC

(continued after inde.x)

With 390 Figures

Library of Congress Cataloging-i n-Publication Data

Ai miei genitori, a mio fratello Gianfranco e a mia zia Ada

A Chiara ed Alessandra, il presente, ed alla bimba ehe verra

Why We Wrote This Book

monitor statistical quantities such as parameter estimates, Mahalanobis

For Whom We Wrote It

What Is In Our Book

Oxford University, helped greatly with the S-Plus programming. In Italy

1 Examples of Multivariate Data 1

2 Multivariate Data and the Forward Search 31

2.3.2 Hypotheses About the Variance 37

3 Data from One Multivariate Distribution 89

4 Multivariate Transformations to Normality 151

4.6 Finding a Multivariate Thansformation with the

5 Principal Components Analysis 229

6 Discriminant Analysis 297

6.2.2 Quadratic Discriminant Analysis 299

7 Cluster Analysis 367

7.4.2 A Very Robust Analysis 382

8 Spatial Linear Models 457

8.7.2 Simultaneaus Spatial Autoregression (SAR) Models 501

Appendix: Tables of Data 551

Author Index 607

Subject Index 611

For convenience we gather toget her a summary of the notation we have

Likelihood and the Normal Distribution

R is the estimated correlation matrix, with off-diagonal elements equal

matrix X is the same for each of the v responses.

y is the n x 1 vector of values Yi· Observations in y arenot independent.

km is the progression index of the block forward search, i.e. s im) is

Ao1 Swiss heads data: six dimensions in millimetres of the heads

1.1 Influence, Outliers and Distances

• The presence of a single outlier;

• The presence of a group of outliers;

• Two or more distinct groups in the data;

• A transformation is required to obtain approximate normality of the

1.2 A Sketch of the Forward Search

used for parameter estimation; they provide Mahalanobis distances which

1.3 Multivariate Normality and our Examples

Data arising in the study of society are typically more complicated to

1.4 Swiss Heads

Y1: minimal frontal breadth

Diagrammatic front and profile views of a head illustrating these measure-

\lj L.-----4C.:........~...,.J \-~~~-.-J L,--.,.......,.4-.,.......,.--i L,....-4:t.....r--.-..--l 4-+--.---.----J

50 100 150 200

1.5 National Track Records for Warnen

Distance Minimum Country Maximum Country Ratio

Y1: 100 metres in seconds

~~..;; ~ ..4'-+ ~·~·

." +.: WS . WS . .. ...... w

·~ ;Hi ..14l f..

k~J:,'~'·: krii~· b.Y1~ .. ~t... ~!

1.0 1.5 2.0 2.5 3.0 3.5 4.0

FIGURE 1.5. Track records: QQ plot of the ordered Mahalanobis distances

DRK DRK DRK ORK ORK

~}... ."[] .,.

FIGURE 1.8. Europe, Italy and the Emilia-Romagna region

the forward search in this analysis is that it enables us to identify, and

1.6 Municipalities in Emilia-Romagna