You are on page 1of 578

COMPSTAT 2004

Proceedings
in Computational Statistics

16th Symposium Held in Prague,


Czech Republic, 2004

Edited by
Jaromir Antoch

With 151 Figures


and 38 Tables

Physica-Verlag
A Springer Company
Prof. Dr. Jaromir Antoch
Charles University
Faculty of Mathematics and Physics
Department of Statistics and Probability
Sokolovsk— 83
18675 Prague 8 ± Karlin
Czech Republic
antoch@karlin.mff.cuni.cz

Additional material to this book can be downloaded from http://extras.springer.com


ISBN 3-7908-1554-3 Physica-Verlag Heidelberg New York

Cataloging-in-Publication Data
Library of Congress Control Number: 2004108446

This work is subject to copyright. All rights are reserved, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, reci-
tation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks.
Duplication of this publication or parts thereof is permitted only under the provisions of the Ger-
man Copyright Law of September 9, 1965, in its current version, and permission for use must
always be obtained from Physica-Verlag. Violations are liable for prosecution under the German
Copyright Law.
Physica is a part of Springer Science+Business Media
springeronline.com
° Physica-Verlag Heidelberg 2004
for IASC (International Association for Statistical Computing) ERS (European Regional Section
of the IASC) and ISI (International Statistical Institute).
Printed in Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the rel-
evant protective laws and regulations and therefore free for general use.
Softcover-Design: Erich Kirchner, Heidelberg
SPIN 11015154 88/3130-5 4 3 2 1 0 ± Printed on acid-free paper
Foreword
Statistical computing provides the link between the statistical theory and
applied statistics. As at previous COMPSTATs, the scientific programme
covered all aspects of this link, from the development and implementation
of new statistical ideas through to user experiences and software evaluation.
Following extensive discussions, a number of changes have been introduced
by giving more focus to the individual sessions, involve more people in the
planning of sessions, and make links with other societies as Int erface of Inter-
national Federation of Classification Societies (IFCS) involved in statistical
computing. The proceedings should appeal to anyone working in statistics
and using computers, whether in universities, industrial companies, govern-
ment agencies , research institutes or as software developers.
This proceedings would not exist without the help of many people. Among
them I would like to thank especially to the SPC members D. Banks (USA),
H. Ekblom (S), P. Filzmoser (A), W . HardIe (D) , J . Hinde (IRE) , F. Murtagh
(UK), J. Nakano (JAP), A. Prat (E), A. Rizzi (I), G. Sawitzki (D) and
E. Wegman (USA); the session organizers D. Cook (USA), D. Banks (IFCS,
USA) C. Croux (B) , L. Edler (D), V. Esposito Vinzi (I), F. Ferraty (F),
V. Kurkova (CZ), M. Miiller (D) , J. Nakano (ARS IASC, JAP), H. Nyquist
(S), D. Pefia (E) , M. Schimek (A), G. Tunnicliffe-Wilson (GB) and E. Weg-
man (Interface, USA); as well as to all who contributed and /or refereed the
papers.
Last but not least, I must sincerely thank my colleagues from Department
of Statistics of the Charles University, Institute of Computer Science of the
Czech Academy of Sciences, Czech Technical University, Technical University
of Liberec and to Mme Anna Kotesovcova from Conforg Ltd. Without their
substantial help neither this book nor the COMPSTAT 2004 would exist.
My final thanks go to Mme Bilkova and Mme Pickova, who retyped most
of the contributions and prepared the final volume, and Mme G. Keidel from
the Springer Verlag, Heidelberg, who extremely carefully checked the final
printing.

Prague May 15, 2004


J aromir Antoch
Contents

Invited papers
Gr ossmann W ., Schimek M.G., Sint P.P., T he hist ory
of CO MP STAT and key-st eps of statist ical
comput ing during t he last 30 years 1
Ali A.A. , Jansson M., Hybrid algorit hms for const ruct ion
of D-efficient designs 37
Amari S., Park H., Ozeki T. , Geometry of learning in
in multilayer perce ptrons 49
Braverm an A., Kahn B., Visual dat a minin g for quantize d
spatial dat a 61
Carr D.B ., Sung M.-H., Gr ap hs for representing statist ics
indexed by nucleotide or amino acid sequences 73
Chen, C. H. et al., Matrix visualizati on and
inform ation minin g 85
Cramer K., Kamps D., Zuckschwerdt C., st-apps and
EMILeA-st at : Int eractive visualizat ions in
descripti ve statist ics 101
Crit chley F . et al., The case sensitivity funct ion approach
to diagnostic and robust computation:
A relaxation st rategy 113
Cuevas A., Fraim an R, On t he bootstrap methodology
for functi onal dat a 127
Deistl er M., Rib arits T ., Hanzon B., A novel approach
t o parametrizati on and paramet er est imatio n
in linear dynam ic systems 137
Fun g W .K. et al., Statisti cal analysis of handwrit ten
Ara bic num erals in a Chinese popu lat ion 149
Gather D., Fried R , Met hods and algorit hms for
robust filt ering 159
Gentleman R , Using GO for statistical ana lyses 171
Ghosh S., Comput ati onal challenges in determining
an optimal design for an experiment 181
Groos J ., Kopp-Schneider A., Visualization of param et ric
carcinogenesis models 189
viii Contents

Heinzl H., Mittlboeck M., Design aspects of a computer


simulation study for assessing uncertainty in
human lifetime toxicokinetic models 199
Held L., Simultaneous inference in risk assessment;
a Bayesian perspective 213
Hofmann H., Interactive biplots for visual modelling 223 0 •

Hornik K., R: The next generation 235


House L.L., Banks D., Robust multidimensional scaling 251
Hely Mo, West ad F., Martens H., Improved jackknife
variance est imat es of bilinear model parameters . . . . . 261
Huh M.Y. , Line mosaic plot : Algorithm and
implementation 277
Kafadar K., Wegman E.J., Graphical displays of
Internet traffic data 287 0 •• •• ••• ••• 0 ••••• •••

Kiers H.A.L. , Clustering all three mod es of three-mode


data: Computational possibilities and problems 303 0 • •• 0

Kneip A., Sickles RC ., Song W. , Functional data


analysis and mixed effect models o' • •• •• • 315

Martinez A.R , Wegman E.J. , Martinez W.L. , Using


weights with a text proximity matrix 327 0 ••• •••

Min W ., Tsay RS. , On canonical analysis of vector time


series o 339
Neuwirth E. , Learning statistics by doing or by describing:
The role of software 3510 •••••• •• ••••••••• •••• •••

Ostrouchov G., Samatova N.F ., Embedding methods and


robust st atistics for dimension reduction 359
Pena D., Rodriguez J ., Tiao G.C ., A general partition
cluster algorithm 371
Priebe C.E. et al. , Iterative denoising for cross-corpus
discovery .. .. 0 ••• • • • • •• •• • ••381 •• •• • • • • ••• ••• •••••• 0 ••••

Ramsay J. 0. , From data to differential equations 393


Riani, M., Atkinson A., Simple simulations for robust
tests of multiple outliers in regression 405 0 ••

Saporta G., Bourdeau M., The St @tNet project for


teaching statistics .. .. 0 • ••• 0417• 0 • 0 • 0 •••• • • • • •• • • • • • • • • •

Schimek M.G ., Schmidt W ., An automatic thresholding


approach to gene expression analysis. 429 0 • •• •• ••••••• ••

Scholkopf B., Kernel methods for manifold est imat ion 441
Contents ix

Scott D.W ., Outlier det ection and cluste ring by partial


mixture mod eling 453
Shibat a R. , Int erD at abase and DandD 465
Swayne D.F ., Buja A., Explorat ory visual ana lysis of
graphs in GG obi 477
Tenenh aus M., PLS regression and PL S path modeling
for multiple table ana lysis 489
Theus M., 1001 gra phics 501
Torsney B., Fitting Bradley Terry Models using
a multiplicative algorit hm 513
Tunnicliffe-Wilson G., Morton A., Modelling mult iple
time series: Achieving the aims 527
Van Huffel S., Tot al least squares and errors-in-variables
modeling: Bridging t he gap between statistics,
computational mathemati cs and engineering 539

Aut hor Index 557


COMPSTAT 2004 Section Index 565

Contributed papers (on CD)


Achcar J .A., Martinez E.Z. , Louzada-Net o F ., Binary
dat a in t he presence of misclassifications 581
Adachi K. , Mult iple correspondence spline ana lysis 589
Almeida R. et al., Modelling short t erm variab ility
int eractions in ECG : QT versus RR 597
Amendola A., Niglio M., Vit ale C., The t hreshold
ARMA model and its autocorrelatio n function 605
Araki Y, Konishi S., Imot o S., Funct ional discrimina nt
analysis for microarray gene expression dat a via
rad ial bas is funct ion networks 613
Arhipov S., Fract al peculiarit ies of birth and death 621
Arhipova 1. , Balina S., The problem of choosing statistical
hypotheses in applied statistics 629
Arteche J ., Reducing the bias of t he log-p eriodogram
regression in perturbed long memory series 637
Bart kowiak A., Dist al points viewed in Kohonen 's
self-organizing maps 647
x Contents

Bastien Po , PLS-Cox model : Application to gene


expression. 0 • • •• • • • 655
0 ••••••• ••••••• ••••••• •• •••• 0 • 0 • ••

Bayraksan G., Morton D.P. , Testing solution quality


in stochastic programming 663 0 • •• • • •• •• 0

Beran R. , Low risk fits to discrete incomplete multi-way


layouts .671 0

Bertail P., Clemencon S., Approximate regenerative


block-bootstrap for Markov chains 679 0 • • •••• •••

Betinec M., Two measures of credibility


of evolutionary trees .... 689 0 ••••••• •• • 0 • • • 0 • •• 0 •••• 0 •• •

Biffignandi S., Pisani S., A statistical database for


the trade sector . .... 697 0 •• 0 • • • 0 • 0 • • •• •• 0 •••• 0 • 0 • 0 •• 0 • 0 •

Bind er H., Tutz G. , Localized logistic classification


with variable selection .705 0 •• 0 • •• 0 •••• 0 • ••• • • • • • • • • • • • • 0

Bognar T. , Komornfk .L , Komornikova M., New


STAR mod els of time series and their application
in finance 0 713
••••••• •• ••• 0 •• • •• •• ••• 0 • • • • • •• • • •

Bouchard G., Triggs B., The trade-off between generative


and discriminative classifiers 721 0 ••••••• 0 •• 0 • 0 ••••

Boudou A., Caumont 0 ., Viguier-Pla S.,Principal


components analysis in the frequency domain 729
Boukhetala K. , Ait-Kaci So, Finite spatial sampling design
and "quant izat ion" . .. 737 0 •••••• •• •• •• 0 • •• ••• •• • • • • • • • • •

Brewer M.J . et al. , Using principal components analysis


for dimension reduction ... 745 0 •••• •• ••••••••••••• ••••••

Brys G., Hubert M., Struyf A., A robustification of the


Jarque-Bera test of normality 753 0 ••••• ••• 0 • 0 • 0 •

Burdakov 0. , Grimvall A., Hussian M., A generalised


PAY algorithm for monotonic regression in several
variables . . . . 0 ••• 0 • • 761 •• 0 • • • • • • • •• •• •• • 0 • •• 0 • 0 • • 0 • 0 • • ••

Cardot H., Crambes Ch ., Sarda Po, Conditional


quantiles with functional covariates:
An application to ozone pollution forecasting 769 0 • • • • • • •

Cardot H., Faivre R. , Maisongrande P., Random effects


varying time regression mod els with application
to remote sensing data 777 0 • •

Ceranka B., Graczyk M., Chemical balance weighting


designs for v + 1 objects with different variances .. .. . 785
Contents xi

Choulakian V., A comparison of two methods of principal


component ana lysis 793
Chretien S., Corset F ., A lower bound on inspection
time for complex syst ems with Weibull transit ions ... 799
Chri st odoulou C., Karagrigoriou A., Vont a F .,
An inference cur ve-based ranking t echnique 807
Conversano C., Visto cco D., Model based
visualization of portfolio sty le ana lysis 815
Cook D., Car agea D., Honavar V., Visualization
in classificat ion problems 823
Cost anzo G.D., Ingrassia S., Analysis of t he MIB30
basket in the period 2000-2002 by functional PC's . .. 831
Croux C., J oossens K., Lemmens A., Bagging a stacked
classifier 839
Csicsman J ., Fenyes C.,Developing a microsimulation
service syste m 847
Cwiklinska-Jurkowska M., Jurkowski P., Effectiveness
in ensemble of classifiers and their diversity 855
Cap ek V., Test of cont inuity of a regression function 863
Cizek P., Robust est imat ion of dimension reduction
space 871
Dab o-Niang S., Ferr aty F ., Vieu P., Nonp aramet ric
unsup ervised classification of sat ellite wave
alt imete r forms 879
Debruyne M., Hub ert M., Robust regression quantiles
with censored dat a 887
Derquenn e C., A multivari at e mod elling method for
statist ical mat ching 895
Di Bucchiani co A. et al., Performance of cont rol charts
for specific alternative hyp otheses 903
Di Iorio F. , Triacca D., Dimensionality pro blem in
tes t ing for noncausality between time series 911
Di Zio, M., Guarnera D., Rocci R. , A mixture of mixture
mod els t o detect unity measur e errors 919
Di Zio M. et al., Multivari ate t echniques for imputation
based on Bayesian networks 927
Dod ge Y., Kondylis A., Whittaker J. , Extending PLS1
t o PLAD regression 935
xii Contents

Doray L.G ., Haziza A., Minimum distance inference for


Sundt's distribution 943 0 • • • •• •

Dorta-Guerra R. , Conzalez-Davila E., Optimal 22


factorial designs for binary response data 951
Downie T. R., Reduction of Gibbs phenomenon
in wavelet signal estimation 959
Dufour J .-M., Neifar M., Exact simulation-based
inference for autoregressive processes. 0. 0. 967 0 • ••• •• ••••

Duller C. , A kind of PISA-survey at university 00... 0. . . 975 0 •• • •

Eichhorn BoH. , Discussions in a basic statistics class . .. . . 0981 0 •

Engelen S., Hubert M., Fast cross-validation in


robust PCA 0 0 0 0 0 • • 0. . 989
• • 0 • 0 •

Escabias M., Aguilera A.M., Valderrama M.J o,


An application to logistic regression with
missing longitudinal data. 0 0 0••••••• 0997
Fabian Z., Core function and parametric inference 1005
Fernandez-Aguirre K., Mariel Po, Martin-Arroyuelos A.,
Analysis of the organizational culture at a public
university 0 0 0 0. 00. 0 1013
Fort G., Lambert-Lacroix S., Ridge-partial least
squares for GLM with binary response 0. 0. 0. . 00. .. 1019 0

Francisco-Fernandez M., Vilar-Fernandez J .M.,


Nonparametric estimation of the volatility
function with correlated errors 0 0 0 1027
Frolov A.A. et al. , Binary factorization of textual data
by Hopfield-like neural network 1035
Fujino T., Yamamoto Y., Tarumi T., Possibilities and
problems of the XML-based graphics 1043 0 ••• ••

Gamrot W ., Comparison of some ratio and regression


estimators under double sampling for nonresponse
by simulation . . .. . 0... 0.
0 ••• 0 •• 1053
••• • •• • • ••• • •• • •• • • •

Celnarova E., Safaifk L., Comparison of three statistical


classifiers on a prostate cancer data 1061
Gibert K. et al. , Knowledge discovery with clustering:
Impact of metrics and reporting phase by using
KLASS 0 0 1069
Giordano F. , La Rocca M., Perna C., Neural network
sieve bootstrap for nonlinear time series 0. . . 0. . 1077 0 • 0 • •
Contents xiii

Gonzalez S. et al. , Indirect methods of imputation


in sample surveys 1085
Grassini L., Ordinal variables in economic analysis 1095
Gr ay A. et al., High-dimensional probabilistic
classification for drug discovery 1101
Grendar M., Determination of const rained mod es
of a multinomial distribution 1109
Griin B., Leisch F ., Bootstrapping finite mixture
mod els 1115
Gunning P., Horgan J .M., An algorithm for obtaining
strata with equal coefficients of vari ation 1123
Hafidi B., Mkhadri A., Schwarz information criterion
in th e presence of incomplete-dat a 1131
Han afi M., Lafosse R. , Regression of a multi-set based
on an exte nsion of the SVD 1141
Harper W .V., An aid to addressing tough decisions:
The automat ion of general expression transfer
from Excel t o an Arena simulation 1149
Hayashi A., Two classification methods for educat ional
da t a and it's application 1157
Heitzig J. , Prot ection of confidenti al data when
publishing corre lat ion matrices 1163
Hennig C., Classification and outlier identification for
the GAIA mission 1171
Hirotsu C., Ohta E. , Aoki S., Testing t he
equa lity of the odds ratio par amet ers 1179
Hlubinka D., Growth curve approach to profiles of
atmospheric radi ation 1185
Ho YH.S., Calibra ted int erpol ated confidence int ervals
for population quantiles 1193
Hoang T.M. , Parsons V.L., Bagging survival tr ees
for prognosis based on gene profiles 1201
Hond a K. et al., Web-b ased an alysis syste m in dat a
orient ed stat ist ical system 1209
Hrach K. , The int eractive exercise t ext book 1217
Huskova M., Meint anis S., Bayesian like pro cedures
for det ection of changes 1221
xiv Contents

Iizuka M. et al, Development of the educat iona l


materials for st atistics using Web 1229
Ingrassia S., Morlini 1. , On t he degrees of freedom
in richly par ameterised mod els 1237
J alam R. , Chauchat J.-H ., Dumais J ., Automatic
recognition of key-words using n-grams 1245
J arosova E. et al., Modelling of t ime of unemployment via
log-locat ion-scale mod el. 1255
Jerak A., Wagner S., Semip aramet ric Bayesian
analysis of EPO pat ent opposition 1263
Juutilain en 1. , Roning J ., Modelling the probability
of rejecti on in a qualification t est 1271
Kaarik E., Sell A., Estimating ED 50 using t he
up- and-down method 1279
Kalin a J ., Durbin-Wat son tes t for least weight ed squa res 1287
Kannisto J ., The expected effective retirement age
and t he age of retirement 1295
Katina S., Mizera 1. , Tot al vari ation penalty in image
warping 1301
Kawasaki Y , Ando T. , Functional data ana lysis
of the dynamics of yield curves 1309
Klaschka J. , On ord ering of splits, Gray code, and some
missing references 1317
Klinke S., Q&A - Vari able multiple choice exercises
with commente d answers 1323
Kolacek J ., Use of Fourier transformation for kernel
smoothing 1329
Komarkova L., Rank est imators for the t ime of a change
in censored data 1337
Koubkova A., Critical values for cha nges in sequent ial
regression models 1345
Kropf S., Hothorn L.A., Multiple test pro cedures
with multiple weight s 1353
Krecan L., Volf P., Clust ering of transaction dat a 1361
Kukush A., Markovsky 1. , Van Huffel S., Consist ent
est imat ion of an ellipsoid with known center 1369
Kurkova V., Learning from data as an inverse problem 1377
Contents xv

Kuroda M., Data augmentation algorithm for graphical


models with missing data .. .. 1385 0 • • 0 • 0 • 0 0 • • 0 0 •• •••• •••

Lazraq A., Cleroux R., Principal variable analysis 1393 0 • 0 •••• • • •

Lee E.-K. et al. , GeneGobi: Visual data analysis


aid tools for microarray data 1397 0 ••••• ••••

Leisch F. , Exploring the structure of mixture model


components 1405 0 • • • • • • 0 • • 0 •••••••• • • •• • • •

Lin J o-L., Granger C.W.J ., Testing nonlinear


cointegration 1413 0 • • •• • •• ••

Lipinski P., Clustering of large number of stock market


trading rules 1421 0" 0 • ••••• ••

Luebke K., Weihs Co, Optimal separation projection . . 1429 0 • 0 • 0

Malvestuto F.M. , Tree and local computation with the


multiproportional estimation problem . 1439 0 • •• •• • 0 •• • • •

Manteiga WoG., Vilar-Fernandez J .Mo, Bootstrap test


for the equality of nonparametric regression
curves under dependence 1447 0 ••• •••••••••• ••• •••••••• 0 •

Marek L., Do we all count the same way? o' 1455 00 ' 0 • • • • • • •• • • ••

Masicek L., Behaviour of the least weighted squares


estimator for data with correlated regressors 1463 0 • •• 0 • •

Matei Ao, Tille Y., On the maximal sample


coordination . .. .. 0 •• 0 • 0 1471
0 • • • • • • • • •• •• 0 • 0 • 0 • 0 • 0 • 0 • • 0 •

McCann L., Welsch R.E., Diagnostic data traces


using penalty methods 1481 0 •• 0 • 0 ••• •• 0 0 •••• 0 • • • • •

Michalak K., Lipinski P., Prediction of high increases


in stock prices using neural networks 1489
Miwa T. , A normalising transformation of noncentral
F variables with large noncentrality parameters .... 1497
Mizuta M., Clustering methods for functional data:
k-means, single linkage and moving clustering 1503 0 •• •••

Mohammadzadeh M, Jafari Khaledi M., Bayesian


prediction for a noisy log-Gaussian spatial model ... 1511
Monleon T . et al. , Flexible discrete events simulation
of clinical trials using LeanSim(r) . . 1519 0 • 0 ••• 0 .0 ••• • 0 • 0

Mori Y., Fueda K. , Iizuka M., Orthogonal score estimation


with variable selection . 1527
0 • 0 •• ••• 0 •• 0 • 0 • • •• 0 • 0 • ••• • 0 •

Mucha Hi-L, Automatic validation of hierarchical


clustering
00' 0 • 0 0 • ••• 0 • 0 1535 ••• 0 •••• 0 0 • •• • • • 0 • 0 •••• •••••
xvi Contents

Muller W.G ., Stehlik M., An example of


D-optimal designs in the case of corre lated erro rs . . 1543
Mun oz M.P. et al., TAR-GARCH and st ochast ic
volatility model: Evaluation based on simulations
and finan cial time series 1551
Murtagh F ., Qu antifying ultram etricity 1561
Naya S., Cao R , Arti aga R , Nonpa rametric
regression with functional dat a 1569
Necir A., Boukhet ala K ., Estimating the risk-adjust ed
premium for t he largest claims reinsur ance covers .. 1577
Neykov N. et al., Mixture of GLMs and t he t rimmed
likelihood methodology 1585
Niemczyk J. , Computing the derivatives of the
aut ocovariances of a VARMA pro cess 1593
Novikov A., Optimality of two-stage hypothesis t est s 1601
Ocan a-Peinado F .M., Valderram a M.J. , Modelling
residu als in dynamic regression: An alte rnat ive
using principal components analysis 1609
Or t ega-Mor eno M., Valderr am a M.J ., St at e-space mod el
for syste m with narrow-b and excitat ions 1615
Oxley L., Reale M., Tunnicliffe-Wilson G., Finding
directed acyclic gra phs for vect or autoregressions . . . 1621
Payne RW. , Confidence int ervals and tests for cont rasts
between combined effects in generally balanced
designs 1629
Peifer M., Timmer J ., Studenti sed blockwise bootstrap
for t esting hypotheses on time series 1637
Pham-Gi a T ., Turkkan N., Sample size determination
in the Bayesian ana lysis of t he odds ratio 1645
Pl at P., The least weight ed squares est ima tor 1653
Porzio G.C ., Ragozini G., A parametric framework for
dat a depth cont rol charts 1661
Praskova Z., Some remarks to testing
of heteroskedasti city in AR mod els 1669
Quinn N., Killen L., Buckley F. , Statisti cal mod elling
of lactation cur ve dat a 1677
Renzet ti M. et al., The It alian judicial st atisti cal
inform ation syst em 1685
Contents xvii

Roelant E. , Van Aelst S., Willems G., The multivariate


least weight ed squared distances est imat or 1693
Rueda Garda M. et al., Quantile est imation with
calibra t ion est imators 1701
Rui z M. et al., A Bayesian mod el for binomi al imp erfect
sampling 1709
Ruiz-C astro J .E. et al., A two-syst em governed by PH
distributions with memory of the failure
ph ase 1717
Rezankova H., Hiisek D., Frolov A.A., Some approaches
to overlapping clustering of binary variables 1725
Saavedra P. et al., Homogeneity analysis for sets of
time series 1733
Saito T. , Properties of the slide vector model for ana lysis
of asymmet ry 1741
Sakurai N., Wat anab e M., Yamaguchi K., A st atistical
method for market segmentation using
a restrict ed lat ent class mode l. 1751
Same A., Ambroi se Ch. , Govaert G., A mixture
mod el approach for on-line clustering 1759
Savicky P., Kotrc E., Experimental st udy of leaf
confidences random forests 1767
Scavalli E., Standard methods and innovations for
dat a editing 1775
Sharkasi A., Ruskin H., Crane M., Int erdepend ence
between emerging and major markets 1783
Shimamura T ., Mizuta M., Flexible regression
mod eling via radial basis function networks and
Lasso-type est imat or 1791
Shin H.W ., Sohn S.Y., EWMA combin ation of both
GARCH and neural networks for the prediction
of exchange rat e 1799
Siciliano R., Aria M., Conversano C., Thee harvest:
Methods, software and some applicat ions 1807
Sima D.M. , Van Huffel S., Appropriat e cross-validation
for regularized errors-in-variables linear mod els . . . .. 1815
Simoes L., Oliveira P. M., Pires da Costa A.,
Simulation and mod elling of vehicle's delay 1823
xviii Contents

Skibicki M., Optimum allocation for Bayesian


multivariate stratified sampling 1831
Storti G. , Multivariate bilinear GARCH models 1837
Sung J ., Tanaka Y , Influence analysis in Cox
proportional hazards models 1845
Sidlofova T ., Existence and uniqueness of minimi zation
problems with Fourier based stabilizers 1853
Tarsitano A., Fitting the generalized lambda
distribution to income data 1861
Tatsunami S. et al. An application of correspondence
analysis to the classification of causes of death
among Japanese hemophiliacs with HIV-1 1869
Tressou J ., Double Monte-Carlo simulations in food risk
assessment 1877
Triantafyllopoulos K. , Montana G., Forecasting
the London metal exchange with dynamic model ... 1885
Tsang W.W., Wang J. , Evaluating the CDF of the
Kolmogorov Statistics for normality testing 1893
Tsomokos 1. , Karakostas K.X., Pappas V.A.,
Making statistical analysis easier 1901
Turmon M., Symmetric normal mixtures 1909
Tvrdfk J. , Krivy 1., Comparison of algorithms for
nonlinear regression estimates 1917
Vanden Branden K.V ., Hubert M., Robust classification
of high-dimensional data 1925
Vandervieren E., Hubert M., An adjusted boxplot for
skewed distributions 1933
Verboven S., Hubert M., MATLAB software for
robust statistical methods 1941
Vfsek J.A., Robustifying instrumental variables 1947
Vos H.J ., Simultaneous optimization of selection-mastery
decisions 1955
Waterhouse T . H., Eccleston J. A., Duffull S. B.,
On optimal design for discrimination and
estimation 1963
Wilhelm A.F.X. , Ostermann R. , Encyclopedia
of statistical graphics 1971
Contents xix

Willems Go , Van Aelst So , A fast bootstrap method


for the MCD est imat or 0 00 0 0 0 0 0 0. 0 . 0 . 0 0 0 0 . 0 . 0 00 . 0 • • • 1979
Wimmer G., Witkovsky Vo , Savin A., Confidence
region for par ameters in replicated errors . .. . . 0 • 0 ••• 1987
Witkovsky V., Matlab algorithm TDIST: The distribution
of a linear combination of Student's t
random variables 00 .0 .00 .0 .0 .0 . 0 • • • 0 • • • • •• • • • 0 0 0 • 0 •• 1995
Yamamoto Y. et al. , Parallel computing in a st atistical
syst em Jasp 0 • • 0 0 0 • •• ••••••• • • ••• •• ••• •• •• 0 • • 0 0 0 •• •• 2003
Yokouchi D., Shib at a R., DandD : Client server
syst em .. 0 0 0 0 • 0 0 • 0 0 0 •• • • 0 0 • • • • •• • • ••• 0 • •• ••• • 0 0 0 • • • • 2011
Zadlo T ., On unbiasedn ess of some EBLU predictor 2019
Zar zo M., A graphical proc edure to assess th e
uncertainty of scores in principal component
analysis0 • • • • 0 • 0 •• • 0 •• •• • •• • •• •• • • • • • • • • • • • • • ••• • • • • 2027
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

THE HISTORY OF COMPSTAT AND KEY-


STEPS OF STATISTICAL COMPUTING
DURING THE LAST 30 YEARS
Wilfried Grossmann, Michael G. Schimek and Peter Paul
Sint
K ey words: COMPSTAT symposium, computationa l st atistics , history of
st ati st ics, st atistical computing, stati stical languages, st atistical softwar e.
COMPSTAT 2004 section: Historical keynote.

1 Introduction
First of all we try to trace t he sit uat ion and the ideas t hat culminated in
t he first COMPSTAT symposium in the year 1974 held at the University of
Vienn a, Austria. Special emphasis is given to the memori es of our founding
memb er P. P. Sint who had been the driving force behind early COMPSTAT
and had served it for twenty years.
At th e time COMPSTAT was established computing technology was in its
infancy. Yet it was well understood that computing would playa vital role in
t he future pro gress of statistics. The impact of the first digit al compute r in
the Depar tment of Statistics at the University of Vienn a on the local st ati s-
tics community is describ ed . After the first computational st at istics event
in 1974 it was anyt hing but clear that t he COMPSTAT symposia would go
on for decad es as an int ernational undertaking to be incorporated as early
as 1978 into t he International Association for St atis ti cal Computing (IASC,
http://www . iasc-isi. org/ ), a Section of th e Intern ational St atistical In-
stitute (lSI) .
After the descrip tion of the background aga inst which the COMPSTAT
idea emerged, the subject area of computational statist ics is critically dis-
cussed from a hist orical perspective. Key ste ps of developm ent are pointed
out . Special consider ation is given to the impact of st atistical theory, com-
puting (algorithms) , compute r science, and applications. Further we provide
an overview of t he symposia and trace the topi cs across 30 years, the period
of historic interest. Finally we dr aw conclusions, also with respect to recent
developm ents.

2 The early history of electronic computing


To start off we describ e t he sit ua tion of computi ng technology in post-war
Vienn a and the prominent role of the Department of St atistics (later on
St atistics and Informatics) at t he University of Vienn a. Also the Mathemat ics
Department of this university is of historic interest . There, a well-attended
semina r was held in summer 1962 by the Viennese mathematician N. Hof-
reit er "Zur Programmiernug von elekt ronischen Rechenm aschinen" (" On the
2 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

programming of electronic calculators"). Topics were one and two address


machines, conne ctors, and programming of simple loops. The treatment was
purely theoretical and no specific ma chine was envisioned. Highlight was an
excursion to the first-ever elect ronic computer at the university, a Burroughs
Datatron 205, installed 1960 at the Department of Statistics. The same
professor held classes in computing which started from slide rul ers and did
not go beyond mechanic al calcul ators (Brunsviga type) because of lackin g
electro-mechanic al machines for teaching .
While finishing his studies in physics Sint became a scholar of the Insti-
tute of Advanced Studies (Institut fiir Hoher e Studien, IRS) in Vienna and
ended up rather by cha nce in the Sociology Department. (A planned formal
science department had not been realized .) There he learned, besid es the
basics of sociology, to handle card count ing machines, especially the IBM
Electronic Statistical Machine Type 101. With such machines one could not
only coun t but also perform simple calcul ations. Only much later he learned
from a historical article [72] that and how it could have been used even to
invert matrices.
The punching machines did not produce printouts on the cards. Hence
users had to learn the encoding. This was simple for num eric codes but
more demanding for alphabetic characters. Sorting of t he cards by alphabet
or num erically was don e sequentially st arting from the last digit or char-
acter in the field. Sorted output card stacks were stapled and resorted for
the next digit resp ectively cha rac ter. This pro cedure made it also possibl e
to perform multiplications of punched multi-digit numbers by "progressive
digiting" . Statistical machines had been popular in Vienn a since the late
19th cent ury. Programming was carr ied out on plugboard tablets - an in-
vention of the Austrian O. Schaffler [75] - based on telephone swit chboard
te chnology [54]. This technology was adopted in the census of the Austro-
Hung ari an Monarchy in 1890 (at the same time also in the USA). Scheffler
later sold his patents [88] to Hollerith's Tabulating Machine Comp any (which
end ed up in International Business Machines, IBM) . By some tri cky program-
ming of the boards sorting was possible by two columns (i.e. cha rac ters) of
the card in one run. Not knowing this, the famous economist F. Machlup
destroyed one of Sint's nearly finished sortings while explaining to him the
"proper" way of doing the job (with appropriate excuses afterwards).

3 The institutional environment in which COMPSTAT


was born
During t hese early years so-called statistical machines were also used in so-
ciology. A front runner of the use of formal mathematical methods in social
sciences was the Institute for Advanced Studies. It was found ed with es-
sential fina ncial help from the US Ford Foundation (hence locally known as
"Ford Institute") . The famous sociologist P. Lazarsfeld, found er of the In-
stitute for Appli ed Social Psychology at the University of Vienn a in 1929,
Th e history of COMPSTAT and statistical computing 3

and its dir ector until his emigration to the USA in 1933, later professor at
Columbia University and O. Morgenst ern (tog ether with J . von Neuma nn),
t he father of game theory and a former dir ector of th e Austrian Institute of
Trad e Cycle Resear ch, were the driving forces behind t he found ation of the
Ford Insti tute [39] . At that time formal-mathemat ical as well as empirical
methods were pract ically abse nt from the syllabus of economics and sociology
in most aca demic institutions in Austria.
S. Sagoroff was a key person during the found ation of the Ford Institute
and also its first director . He had alrea dy an interesting personal hist ory:
After receiving his doctor degree from the University of Leipzig (Germ any)
and st udying in the USA under t he sup ervision of J . A. Schumpeter 1933/ 34
on a Rockefeller gra nt, he became professor of st atis ti cs, president of the st a-
t istical office, and dir ector of t he Rockefeller Institute for Economic Research
in Bulgari a before World War II. Later he was Bulgari an Royal Ambassador
to Germ any in Berlin until 1942 (when Bulgari a joined t he Allies). In that
function he was involved in t he delay of the delivery of Bulgari an J ews. Whil e
in Berlin and with a bro ad interest in science he had befriend ed wit h some
of Germ any's inte llect ua l elite, including a number of Nobel laureates who
cherished his dinner par ti es. After liberati on from his int ernment in Bavari a
he had worked for t he US Ambassador R. D. Murphy and had spent some
time at St anford University, before becoming professor of stat ist ics at the
University of Vienn a.
Sagoroff was certainly an able organiz er for the start- up of IHS but might
not have been the best choice for ru nning t he inst itution in a way ensuring
high scient ific standa rds. Still , the Ford Institute was a tremendous place
to learn and to get acquain ted with cur rent t houghts in social and economic
sciences, offering contacts to resear chers of high reputation. In t he following
decad e IHS played an important role in the revers al of t he former sit uation
at Viennese acad emic instit ut ions, advocating mathemat ical and st atistic al
approaches.
Sagoroff's USA experience had also been crucial to the fact that he was
successful in receiving a Rockefeller grant for the University of Vienna to buy
a digit al compute r. The found ation paid for half of the price (83.500 US$)
and the computer compa ny gave an educational gra nt covering t he other half.
The university had to pay just for transportation and inst allation. That
Sagoroff was int erested in compute rs and on the lookout for one was most
likely fueled by t he fact that at the very t ime H. Zeman ek was const ruc t ing
the first transisto rized computer in Europ e at the Technische Hochschule
(now University of Technology) in Vienn a. At that same time Sagoroff 's
assist ant at t he Statistics Depar tment , A. Adam , also tried to build a simple
elect ronic stat istical calculator and obtained also a patent on this device. But
he definitely had not t he techni cal expertise of Zeman ek and his machine was
never used in practi ce. Nevertheless his historical findings on the early history
of comput ing remain a landmark in the historiography of t he area ([18] widely
dist rib ut ed during the 1973 lSI Session in Vienn a).
4 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

The arrival of the first "elect ronic brain" in Vienna in 1960 was not only
of interest for the scientific community but meant also a major event for the
Aus trian media. The elect ronic tube-based machine needed a special powerful
elect ricity generator to convert the 50 Hz alte rnat ing cur rent in Austria to
the 60 Hz used in the USA. It was installed in the cellar of the new university
annex building. The windows of the computer room had to be equipped with
sp ecially coa ted glass to ensure const ant temp erature.
This Datatron 205 was a one address machine with one command (or ad-
dress) and two calculat ing registers. The machine owned a drum storage
with 4000 cells. Each cell held 10 bin ary-decimal digits (each digit was rep-
resented by 4 bits and the uppermost values beyond 0-9 were not used) .
The 11th digit was used for signs and as a modifier in some comma nds. The
4000 cells were divided in 40 cylinders on the drum each cont aining 100 words
with an avera ge access t ime (half turn of the drum) in the millisecond do-
main . It possessed a feature later reinvented by IBM and market ed in a more
elabora te form under the nam e virtual memory: two cylinders could accept
repeat ed identical runs of 20 words (comm ands) which reduc ed access time
to one fifth . The crit ical parts of the program code were shifted to this 'fast
storage' with one block command and the pro gram execution shifted (often
simultan eously) to the first command in this storage which mean t it was
transferr ed into command register A.
The impl ementation was in digital cod e: Each comma nd was a two digit
number act ing on one address, for inst anc e the comma nd "64" imported
a number into register A:
0000641234 Import the cont ent of cell 1234 (on the drum)
int o calculat ing regist er A
While 74
0000741235 Add the content of cell 1235
to the content of the regist er A
60 stood for multiplication, 61 for division. Other arithmet ic operations,
floating point operations, shift op erations, logical operations , condit ional
jumps, printing of regist ers were performed similarly. An addit ional reg-
ist er could be used independent ly or to enlarge the number of digits in reg-
ist er A. 02 st ored results back to the drum. 08 stopped t he run.
In principle there exist ed an assembl er with mnemonic alphabet ic codes,
however , there was no t ap e punching device to ente r alphab etic characters.
Becaus e one had to know the digit codes for operating the machine (ent ering
and changing commands bit by bit only guid ed by a displ ay of the regist ers
on the console) the dir ect way was definitely faster . As one could act ually
see each bit stored in t he regist ers during programming and debugging one
could also spot a malfunctioning hardwar e uni t if one of the bits did not
show up properly. In t his case one had to open the ma chine and t ake out
the concerne d unit (a flip flop with four tubes). Usu ally it was easy to spo t
the culprit by visua l insp ection or alte rnat ively exchanging the tubes one by
one. Only the (preliminar y) finished pro gram was print ed or punched out on
Th e history of COMPSTAT and statistical computing 5

a pap er tape. As space was sca rce and each letter had to be encoded by two
decimal digit s, comments accompanying the results were kept to a minimum.
The arrival of this compute r was essential to the fact that th e St atistics
Department becam e the hub of comput ing inside t he University of Vienna.
Sint 's first experiences wit h real comput ing in t he early nineteen sixt ies
are connected to a pro gramming course for digit al compute rs held by t he
mathematician J . Ropp ert, assistant in the Department of St atisti cs. As one
of t he few who to ok an exam in computer pr ogramming and as a scholar
of IRS , Sint was offered an assistants hip at t his department . His statisti-
cal qualifications were elementary probability t heory (not based on measur e
t heory ) and some statistics for sociologists. (The type of statist ics used in
qu antum physics were not of much help in a statist ics depar tment.) At the
IRS he also obtain ed a first t ra ining in ga me t heory from O. Morgenstern.
Lat er , while spen ding a yea r in Oxford , he learned mor e statist ics and got
int erest ed in cluster analysis. This contact with English st atistics helped him
doing "real" st atistics in t he following.
Before Sint could use the new generation of computers (an IBM / 360-44
was inst alled at the Univ ersity of Vienn a in 1968) he had to learn his first
pro gr amming language, Fortran. For W . Winkler , a professor emer it us of
stat ist ics, he wrote his first Fortran pro gram for the calculat ion of a Lexis-
ty pe population distribut ion on an off-sit e compute r. When he had finished
Winkler remarked that it would have been much faster to do the job on
a mechan ical calculato r. At that time correcting card decks and working on
a remot e machine was ext remely time consuming.
About t hat time IBM had started developin g and distributing st atisti-
cal software. Most developments were open source Fortran code. Naturally
Fortran was a lar ge step forward going along with third genera t ion digi-
t al compute rs . Programme codes for algorithms were published by t he US
Association of Comput ing Machinery (ACM) . About that time also the first
commercial packages arr ived. In st atisti cs one could choose between OSIRIS ,
BMD , P -STAT, and SPSS . The Department of St ati stics at t he Univ ersity of
Vienna decided for SPSS in Decemb er 1973. SPSS , like BMD and P-STAT
was implemented in Fortran , offering high portability. All t he implementa-
t ions of statist ical methods at the depar tment were pro grammed in Fortran ,
not a user-fr iendl y environment from a today's persp ective. This included the
first administ ra t ive pro gram for t he enr ollment of student s and production
of corresponding statist ics.

4 The first symposium and the Compstat Society


Access to elaborate algorit hms on compu t ers increased the awareness of mor e
recent methodological developments in st atisti cs, pr imaril y in the Anglo-
American world. In t he St atisti cs Department at t he University of Vi-
enna, wit h its tradition in convent ional economic and demographic st atis-
t ics, t he younger memb ers t ried hard to establish contacts with the int erna-
6 Wilfri ed Grossm ann, Michael G. Schimek and Peter Paul Sint

ti onal statistical community. Not havin g had access to sufficient t ravel fund s,
Sint and his colleague J . Gordesch, a trained mathematician , encouraged by
A. Liebin g, the publisher of the journal Metrika, envisioned a conference on
an up-to-date st atist ical topic in Vienn a. Sint was int erested in clust er ana l-
ysis and Gordesch rather in computationa l probability an d mod el building.
These and ot her to pics were ventilated until one set tled on a conference on
computers and statistics . As for the name in English they took the Journal of
the Royal Statisti cal Society as a mod el: it comprised series A for Theoretical
Statisti cs and series B for Appl ied Statistics. Thus t hey assumed Sympo sium
on Computation al Statistics would be a proper name. Sint came up with
t he acr onym COMPSTA T arguing t ha t one needs a short name which would
still be near to an und erst andable expression to be easily rememb ered (this
is what is called a logo now).
For t he first call for pap ers t he word COMPSTAT was embedded in an
arrow like graph derived from the symbols used in ana log computi ng: several
input lines ending in a tri an gle (the statist ical engines or algorit hms). The
condensed final result we are still using is displayed in th e left figur e. Sint and
his colleagues were t hinking about stat ist ical methods (they were the hub of
our ideas about the conference) as means of compressing a lar ge number of
inputs in a few meaningful results and COMPSTAT as an input to improve
t he algorit hms (being quite aware of the recursivity of thes e pro cesses).

COMPSTAT~ =COMPSTATV-

The original design idea was rather something like the right figur e. A sket-
ched dr awing similar to this one (without th e small arrows and with a smaller
number of input lines) had been dr opp ed by the gra phics designer of t he
publisher.
As we know now this was t he first freely accessible int ernational con-
ference with an open call for pap ers in this area. The first COMPSTAT
meeting was announced in the American Statist ician (att racting some par-
ticipants from th e USA) which helped later to defend t he right of name in
that count ry. The only precedin g int ern ational conference of that kind was
organiz ed and financed by IBM. Precedin g were also t he at first rather local
North American Interface symposia start ing in Southern Californi a in 1967,
sponsored by t he local cha pte rs of both t he American Statistical Associa-
t ion and t he ACM , obt aining an inte rnat iona l flavor as late as 1979 (twelfth
Interface symposium held at t he University of Wat erloo , Ontario, Can ad a).
For the Interface Foundation of North America , Inc., and its history see
http ://www.galaxy.gmu.edu/stats/IFNA .html.
Any organi zer of a new kind of conference is uncertain about its suc-
cess an d the numb er of participants he/ she might attract . Accord ing to
Th e hist ory of COMPSTAT and statistical computing 7

the preface of the pro ceedin gs [1], Sint and Gord esch were not sure whether
"ma t hematicians specialized in proba bility theory or stat ist ics, or experts in
elect ronic dat a pro cessing would look at computat iona l stat ist ics as a serious
subject". As t he deadli ne of t he call for pap ers carne near er t he organi zers
becam e increasin gly anxious and starte d to must er locals for participation.
Fortunately, in the first few days after t he deadline had expired , a reasonabl e
number of additional abstrac ts appeared , all tog ether enough t o give them
peace of mind.
In 1972 Sint had at te nded a conference where the proc eedin gs pap ers
had to be retyp ed by clerical st aff which t urned out to be a disaster. This
experience in mind it was decided to ask for cam era-r ead y copies. For th e
COMPSTAT pro ceedings it worked out smoot hly and the copies could be
distribut ed during the symposium, a pr actice t hat has survived till now.
The formal invit ation to the conference was signed by G. Bruckmann and
L. Schmetterer , both professors of st ati sti cs at the department, becaus e t he
young colleag ues hop ed that th e appeara nce of int ernationally known person-
alit ies would be mor e acce ptable to par ticipants and t o t he pot ential buyers
of t he pro ceedings (Sint and Gordesch just signed t he pr eface; F . Ferschl was
added as an edito r by the publisher).
Gordesch had at the t ime of the conference alrea dy left Vienna, and Sint
had moved to the Austrian Acad emy of Sciences. Thus, alt hough the latt er
was st ill around (his new boss was Schmet terer, t he successor of Sagoroff
as professor of statist ics), a lot of the pr epar atory work had t o be don e
by t he young colleagues W . Grossmann, G. Pflu g, and W. Schimanovich.
M.G . Schimek, a first- year st udent of st atistics and informatics in 1974, learn-
ing Fortran and SPSS at that time, was a keen observer of all these act ivit ies
going on in t he Department of St atistics and Informatics at the University
of Vienna.
The int erest of Gordesch in COMPSTAT had remain ed awake and so the
next conference was naturally held in Berlin. From th at time onwards it has
never been a pr oblem t o find places to go. Someone has always been willing
to organiz e the symposium .
To have a perman ent platform a Compstat Society was created in 1976.
Memb ership was by invit ation only. Mainly organiz ers and chair persons of
the first conferences were approac hed. Sint recalls t hat only selected mem-
bers were asked (no formal board decision) when COMPSTAT was trans-
ferr ed to the International Association for St atistical Comput ing (lAS C) in
1978. It was an initiative of N. Victor (1991-1993 IASC President) . Read ers
int erest ed in the history of the IASC are referr ed to t he Stat istical Soft ware
Newsletter, edited for almost three decades by A. Hormann, and since 1990 in-
t egrated as sp ecial sect ion into t he official journal of t he IASC Comput ational
Stat istics and Data Analysis. Furthermore we wan t to mention P. Dirschedl
and R. Ost ermann, (1994 [32]) as a valu abl e reference for developments in
computationa l stat ist ics (including IASC act ivities in Germany, the history of
8 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

the legendary Reisensburg Meetings and of the Statistical Software Newslet-


ter) .
Formally the Compstat Society was dissolved by the Austrian Registra-
tion Office due to inactivity. Numerous members reappeared in the newly
founded European Regional Chapter (now European Section) of the IASC.
The main stumbling block in the transfer was Physica-Verlag and its owner
A. Liebing. He had contributed a lot to the planning of the first symposium
to make it a success and was then afraid that, if the conference is taken over
by a large organization, other publishers would get interested and grab the
proceedings and the then started COMPSTAT Lectures (a series of books
apart from the proceedings). The result of the heated discussions during
COMPSTAT 1978 in Leiden was a most favourable treatment clause which
gave Liebing an advantage over competitors. This worked out satisfactorily
until he sold Physica-Verlag to the Springer company because of his retire-
ment as a publisher.
Sint's continued active involvement ceased after 20 years at the second
COMPSTAT symposium that took place in Vienna, organized by R. Dutter
(University of Technology, Vienna) and W. Grossmann. The 1994 anniver-
sary was also marked by a COMPSTAT Satellite Meeting on Smoothing -
smoothing having been a hot topic at that time - held in the famous alpine
spa Semmering (on the border between Lower Austria and Styria) , bringing
additional audiences mainly from outside Europe to COMPSTAT. It was or-
ganized by M. G. Schimek (Karl-Franzens-University, Graz; currently IASC
Vice President) . The COMPSTAT baby had become off age and a new gen-
eration was following the tradition of P. P. Sint.

5 Some remarks on the development of computational


statistics
The idea of COMPSTAT was borne at the University of Vienna in an envi-
ronment typical for statistics departments in continental Europe at that time
against the background of new computer technology, rather specific with re-
spect to statistical methodology. In order to obtain a more detailed picture of
the role of COMPSTAT we need to sketch some important issues in the devel-
opment of computational statistics in connection with other topics. Starting
point for our considerations is the following working definition of the term
Computational Statistics, which is according to a statement of N. Victor in
1986 (cf. Antoni et al., 1986 [7], p. vi) ".....not an independent science but
rather an important area of statistics and indispensable tool for the statisti-
cian" . This statement is made more precise in a definition proposed by A.
Westlake (cf. Lauro, 1996 [61]): "Computational statistics is related to the
advance of statistical theory and methods through the use of computational
methods. This includes both the use of computation to explore the impact
of theories and methods, and development of algorithms to make these ideas
available to users" . This definition gives on the one hand a clarification of
The history of COMPSTAT and statistical computing 9

the term "area of statistics" in Victor's statement, on the other hand it em-
phasizes also the instrumental aspect of statistical methods with repect to
their application.
Starting from this definition it is quite clear that we have to consider the
progress of computational statistics in connection with developments in sta-
tistical theory, developments in computation and algorithms, developments
in computer science, and last but not least developments in the application of
statistics. In many ways there has always been an exchange of ideas, impor-
tant for the understanding of computational statistics, stemming from these
four areas. In the following we sketch some of these ideas and discuss their
interplay.

5.1 Computational statistics and statistical theory


According to B. Efron in 2002 [36] the development of statistics in general
can be divided into a theory area and a methodology area. Efron illustrates
the theory area as a journey from applications towards the mathematical for-
mulation of statistical ideas. According to him it all starts around 1900 with
the work ofK. Pearson and goes on to the contributions of J. Neyman and Sir
R. Fisher, finally approaching the decision-theoretic framework for statistical
procedures due to A. Wald. A key feature in this development is the foun-
dation of statistical theory on optimality principles. This decision-theoretic
framework is capable of bolstering statistical methods by a sound mathemat-
ical theory, provided that the problems are stated in pr ecise mathematical
form by a number of assumptions. In that sense the theoretical background
is a prerequisite for the application of statistics and for the computations
in connection with the statistical models. Obviously computation meant in
early times paper and pencil calculations or using rather simple (mechanical)
computing devices .
To some extent the early investigations were oriented more towards the
analysis of mathematical properties of procedures and less towards the anal-
ysis of data. A milestone in the shift from the theory area towards the
methodology area was the paper of J . W . Tukey in 1962 [83] about the future
of data analysis. It emphasizes a number of important aspects, in particular
the distinction between confirmatory and explanatory analysis, the iterative
and dynamic nature of data analysis, the importance of robustness, and the
use of graphical techniques for data analysis. In this paper Tukey is not so
enthusiastic about the computer with respect to data analysis. He states
that the computer is in many instances "import ant but not vital", in others
"vit al" . However due to the technological development the computer has
definitely become more important for the methodology area than one could
foresee 40 years ago.
In fact, the methodology area is in many aspects characterized by a strong
interplay between statistics and computing, ranging from the implementation
of procedures over the definition of new types of models up to the discovery
10 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

of new aspects of statistical theory. A typical example is Bayesian data


analysis, the progress of which has been driven to a considerable extent by
new computational techniques (cf. Gelman et al., 1996 [44]) . High computing
power is needed for these methods, hence they are often summarized under
the heading computer intensive methods. Another interesting feature of many
of these developments is the fact that optimality principles are not necessarily
applied in a closed form by defining one objective function in advance, but
rather by outlining a number of optimization problems in an iterative and
more dynamic way than in traditional statistics. This iterative process is
rather statistical in nature compared to the iterative numerical solutions of
nonlinear equations. Hence, from a statistical (data analytic) point of view
one is sometimes not solely interested in the final solution but also in the
behaviour of the algorithm.
In many instances theoretical insight into methods and the development
of models go hand in hand with the implementation of these methods respec-
tively models . In the following we list (in alphabetical order) a number of
key developments that have resulted in standard approaches of applied statis-
tics (together with early references): Bootstrap Methods (Efron 1979 [35]),
EM-Algorithm (Dempster, Laird and Rubin, 1977 [30]), Exploratory Data
Analysis (EDA; Tukey, 1970 [84]) , Generalized Additive Models (GAM; Buja,
Hastie and Tibshirani, 1989 [22], Hastie and Tibshirani, 1990 [50]), Gener-
alized Linear Models (GLM; Nelder and Wedderburn, 1972 [70]), Graphical
Models (Lauritzen and Wermuth, 1989 [60]), Markov Chain Monte Carlo
(MCMC) - in particular Gibbs Sampling - (Hastings, 1970 [52], Geman,
1984 [45]), Nonparametric Regression (Stone, 1977 [77]), Projection Pursuit
(Fisherkeller et al., 1974 [38] , Friedman and Tukey, 1974 [43]), Proportional
Hazard Models (Cox, 1972 [28]), Robust Statistics (Huber, 1964 [56]), and
Tree Based Methods (Breiman et al., 1982 [21]) . Besides these developments
inside statistics we wish to point out that new aspects of statistical data
analysis have in addition occurred in connection with Data Mining (Frawley
et al., 1992 [41]), recently explored from a statistical learning perspective by
T. Hastie, R. Tibshirani and J . Friedman (2001 [51]) .
Apart from these examples that are all characterized by a strong interplay
between statistical theory and computational statistics in the sense of West-
lake 's definition, it should be noted that there are also methods which had
been formulated long before they were feasible to compute. An interesting
example with respect to the interplay between theory and computation are
rank procedures. According to R. A. Thisted (1988 [80]) the motivation of
F. Wilcoxon for defining his rank test was the fact that for moderate sample
sizes calculation of the rank sum by hand is easier than calculation of the
sum and the variance. However, the situation is completely changed in case
of large sample sizes and machine calculation. Other examples of theoretical
models introduced long before it was feasible to numerically evaluate them
are conditional inference for logistic regression as formulated by Sir D. J. Cox
Th e history of COMPSTAT and statistical computing 11

(Meht a and Pat el, 1992 [62]) or the empirical Bayes approach of H. Robbins
(1956 [74]) t hat nowad ays sees int eresting applicatio ns in microarray analysis
(Efron, 2003 [37]).
Besides these new developments in st atist ical t heory, t he advance of com-
pu t ers has also influenced ot her areas of statist ical theory in t he sense of
providing t ools for experimental checking of st atistical mod els under vari-
ous scenarios. Such typ es of computer experi ments are of int erest even in
cases where t he methods are well underpinned from a theoreti cal point of
view. A well known early example is the Princet on st udy on robust statis t ics
(Andrews et al., 1972 [20]). Tod ay in theoretical investigati ons it is ra t her
common to support the results by simulation and gr aphical displays. In
t his context one should know that according to H. H. Gold stine (1972 [48])
such compute r experiments were alrea dy envisioned by J . von Neum an and
S. Ulam in 1945 at the very beginning of digit al comput ing. This led to the
development of simulation lan guages, rather ind ependently of conventiona l
stat ist ics, but with an imp ortant imp act on computer science (see also [65]).
Not e that Simula was t he first obj ect-oriented language ever (Dahl and Ny-
gaard, 1966 [29]). A good overvi ew of simulation from a stat ist ical persp ective
can be found in B. Ripl ey's book of 1987 [73] .

5.2 Computational statistics and algorithms


Computation in stat ist ics is based on algori thms which have their origin ei-
ther in numerical mathematics or in computer science. Such methods are
summariz ed under t he to pic statistical computing. Usu ally textbooks em-
ph asize t he numerical aspects (for inst an ce Monah an , 2001 [67]). However
in the following we want to review briefly some import ant developments in
num erical mathematics as well as in computer science.
For mainstream stat ist ics the most important area is numerical ana lysis.
The core t opics are numerical linear algebra and optimization techniques but
pr acti cally all areas of modern num erical analysis may be useful. Approxi-
mation t echniques applying spe cific classes of functi ons, for examples splines
or wavelets, play an imp ort an t role in smoo thing. Num erical int egration is
essent ial for t he calculat ion of pr obabili ty dist ributions, and for time series
analysis Fourier tran sform s ar e of utmost imp ortan ce (not e t hat t he fast
Fourier tran sform , which is one of the most important algorit hms of numeri-
cal analysis, was invented by J. Tukey in connect ion with stat ist ical problems
(Tukey and Cooley, 1965 [85])). Recur sive algorit hms and filtering are t ra di-
t ionally linked t o t ime series but recentl y th ese methods are also of int erest in
connection with data st reams [86] . However it seems t hat statist icians apply
these methods oft en mor e like a tool from the shelf. New innovative aspects
occur on t he one hand in t he t heoret ical analysis of algorit hms in the conte xt
of statistical models, on t he other hand as ada ptat ion of methods according
t o st atistical needs, which is in fact one of the key issues in computat iona l
stat ist ics. The organizat ion of the recent t extbook by J . E . Gentle (2002 [46])
is a good exa mple.
12 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

Another core topic is generation of random numbers which is conceptu-


ally close to computational statistics and computational probability theory
and is the basic technique for discrete event simulation. Most early applica-
tions concerned the generation of random variates of different distributions
for sampling as well as for numerical integration. Nowadays this technique is
fundamental to many new developments in statistics like bootstrap methods,
Bayesian computation, or multiple imputation techniques. However, also
in this field it seems that statisticians are mainly interested in using these
technique for their own purposes, in particular the theory of uniform ran-
dom number generation is traditionally rather linked to number theory and
computer science. An important contribution inside st atistics is the quite
exhaustive monograph on the generation of non-uniform random variates by
L. Devroye (1986 [31]) .
Apart from numerical analysis, there are algorithms of statistical interest
for sorting, searching and combinatorial problems sometimes summarized un-
der the heading semi-numerical algorithms. They are of utmost importance
for exact nonparametric test procedures and for exact logistic regression as
implemented in StatXact and LogXact (see for example [63] and [64]). Com-
binatorial algorithms are also used in the context of experimental design.
There is another group of algorithms highly relevant for computational
statistics. Their origin is mainly in computer science, in particular we are
thinking of machine learning, artificial intelligence (AI), and knowledge dis-
covery in data bases. Neural Networks, Genetic Algorithms, Decision Trees,
Belief Networks or Boosting are important and actual examples. These de-
velopments have given rise to a new research area on the borderline between
statistics and computer science. New challenges arise from the need to inter-
pret these non-statistical approaches in a statistical framework. In addition
to [51], papers by D. Hand (1996 [49]), and R. Coppi (2002 [27]) discuss some
of these issues.
All the above mentioned computational topics cover methods that are
also adopted in other areas of mathematical modelling. If one looks into
a book of mathematical modelling one might find similar algorithms and
techniques as in a textbook about computational statistics. For example the
book of N. Gershenfeld (1999 [47]) distinguishes between Analytical Models ,
Numerical Models and Observational Models. Analytical Models (mainly
difference and differential equations) occur also in statistical applications,
in particular in finance and epidemiology, but, as Sir D. J . Cox had stated
in the preface to COMPSTAT 1992 [10], these topics are not core topics
in computational statistics. It seems that the situation has not changed
since. Obviously, in the area of Observational Models there is large overlap
with methods used in statistical modelling but the focus is a different one.
This had already been noticed in the early days of computational statistics
by Sir J. A. Nelder (1978 [69]) who identified the following peculiarities of
computing in statistics compared to other areas: (i) Complex data structures:
Th e history of COMPSTAT and statistical computing 13

Problems analyzed by st atist icians have oft en a rather complex data st ru ct ure
and ada ptat ion of this structure towards t he requirements of an algorithmic
procedure is many times a genuine st atistical t ask ; (ii) Explorat ory nature of
statistical analysis: Usu ally in a st atistical analysis we have not only a pure
algorit hmic cycle (defined by: get dat a, do algorit hm, put results, stop) but
rather a cycle of different comput at ions, which are to some exte nt defined
according to the inte rpretat ion of the pr evious results; (iii) Competen ce of
users: Users of st atisti cal methods are not necessarily experts in t he area of
stat ist ics or in the area of numerical mathematics , but experts in a dom ain
and want to interpret t heir methods according to their dom ain knowledge.
With t hese spe cific points in mind it is not sur prising that graphical com-
putation plays a mor e prominent role in st ati stics than in other areas of
modelling. J. Tukey is one of t he st atist ical pion eers , in particular with re-
spect to dynamic gra phics (Friedman and Stuet zle, 2002 [42]). Statisti cs has
cont ribute d t o the development of gra phical computat ion complementary to
compute r science. L. Wilkinson et al. (2000 [87]) stress the following three
key ideas in the pro gression of statist ical graphics, which may be seen as main
driving factors behind most genuine st at ist ical innovations: (i) Gr aphics are
not only a t ool for displaying results but rather a tool for perceivin g stati sti -
cal relationships dir ectly; (ii) Dyn am ic int eractive graphics are an importan t
tool for data ana lysis, and (iii) Gr aphics are a means of model form alization
reflect ing qu anti t at ive and qualitative t raits of its variables.

5.3 Computational statistics and computer science


Due t o t he specific needs of st atisti cal dat a analysis mentioned in t he pr e-
vious section it was quite natural that even in t he earl y days of computers
stat ist icia ns were int erest ed in developing specific soft ware tools t ailor ed mor e
t owards their needs t han mathematical subrout ine librari es like NAGLIB or
IMSL . As early as 1965 Sir J . A. Nelder starte d wit h t he development of
GENSTAT in Adelaid e (South Australia) on a CDC 3000 computer (Nelder,
1974 [68]). The data st ructure was at t hat time t he dat a matrix, but in
t he further developments at Rotham st ed Experiment al Station (UK) the de-
sign was changed towards increasin gly statistics-oriented data structures like
vari at es, vect ors, matrices or t abl es with main emphasis on the vari at e as
well as th e development of a st atis t ical lan guage. Around t he sam e time also
other projects had been st arted t hat resulted in major packages: BMD (later
BMDP) was developed by W. J. Dixon and M.B. Br own from 1964 onwards
at t he University of California at Los Angeles as a coherent combinat ion of
different analysis subrout ines with a common cont rol lan guage (first manual
in 1972 [33]). SAS was designed by J. Goodnight and A. J . Barr start ing in
1966 (the commercial SAS Insti tute was found ed in 1976 by J. Goodnight ,
J . Sall , A. Barr and J. Helwigand;
http ://www .sas .com/presscenter/bgndrJrristory.html.
http ://www.theexaminer.biz/Software/Goodnight.htm).
14 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

Finally in 1967 N. H. Nie, C. H. Hull and D. H. Bent commenced at the


University of Stanford the SPSS project
(http://www.spss.com/corpinfo/history.htm).
The latter two packages still flourish as products of service companies.
Many other statistical packages were designed in the subsequent years
with the aim of supporting data manipulation and statistical computing.
The major developers tried to keep track of the progress made in computing
infrastructure in order to improve their products with respect to data stor-
age and data management and to offer numerically more reliable statistical
analysis methods. The book of I. Francis (1981 [40]) provides an overview
over this early period of statistical software. It describes more than 100 pack-
ages available at the beginning of the nineteen eighties. The scope of these
programs ranged from data management systems and survey programs to
general purpose statistical programs and programs for specific analysis tasks.
With respect to programming Fortran was the dominant source language and
most of the products were offered for different hardware configurations and
operating systems. Today for a number of reasons most of these products are
only of historical interest. For specific purpose packages at the forefront of
statistical methodology it was difficult to keep their competitive advantage
after its methods had become widespread. For other products it was noth-
ing but easy to keep path in their program design with the fast progress of
computer technology. Only the major producers were able to follow the de-
velopments which also meant a switch from Fortran to other languages like
C or C++, an adaptation to new computer architectures, and integration
of modern user interfaces as well as of graphic facilities into their packages.
Their new orientation towards customized analysis procedures made these
products increasingly attractive for statisticians as well as non-statisticians.
More important for computational statistics were other developments
aiming at the design of statistical languages as basis for statistical program-
ming environments. Based on the conceptual formulation of the Generalized
Linear Model , GLIM seems to have been the first system that was oriented
towards the definition of an interactive analytical language for a large class
of statistical problems in a unified manner, taking advantage of the previous
GENSTAT experiences. The most important step in this direction was the
S language, a project starting in 1976. The goal was the definition of a pro-
gramming language for the support of data analysis processes
(http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html).
The computer science oriented concepts of the S language are best described
in the so called "green" S book by J. Chambers (1998 [23]). For the statistical
aspects we refer to the "white" S book of J. Chambers and T . Hastie (1992
[24]). The general approach, a clever combination offunctional programming
and object oriented programming, supports perfectly the iterative nature of
the statistical data analysis process and forms a new paradigm for comput-
ing, which is independent of the statistical application. The ACM honored
Th e history of COMPSTAT and sta tistical computing 15

t his cont ribut ion t o computer scien ce: In 1998 Chambers received the ACM
Software System Award for his seminal work which "has forever altered the
way people analyze, visualize and manipulat e data" [17].
In 1992 based on the S language, R. Ih aka and R. Gentl em en started
t he R-project at the University of Au ckland (New Zeal and ; cf. Gentlem an
and Ihaka, 1996 , [59] for t he early history of R) . Due to free ava ila bility the
R- community grew rather fas t a nd in 1996 the Comprehensive R Ar chive
Network (CRAN) was established at the University of Technology in Vienna
(cf. Hornik and Leisch , 2002 , [55] for recent developments) . A fur ther impor-
tant st ep in t he development of stati stical environme nts , closely related to R ,
was the format ion of the Omegahat-proj ect (http ://www . omegahat . org/ )
for statistical computing in 1998 . It serves as an umbrella for a number of
ot he r recent op en source proj ects. It s goa l, as described in det ail by D. Tem-
ple Lang [79], is to meet the challenges for stat ist ica l comput ing resulting
from new developments in com puter scien ce like distribut ed comput ing or
Web-based serv ices . Examples are exte nsions of exist ing sys te ms such as
St atDataML (Meyer et al., 2002 [66]) offering a XML interface for data ex-
change or embe dding R into a spreadshee t environme nt (Neuwirth and Baier ,
2000 [71]).
Besid es S and R there were a number of other impor tant proj ects in the
area of stati stical softwa re development. For instance we want to mention
W. Hardie's Xpl oRe [53], an interactive statist ica l comp uting enviro nme nt,
reali zing new conce pt s of non parametric curve and densi ty esti mation as well
as statistica l graphics in the mid ninet een eight ies. In connect ion with XploRe
recent efforts t o extend its scope to stat ist ica l teaching and to Web applica -
t ions are wort h mentioning. Another project of int erest due to L. Tierney in
the late nin et een eight ies was XLISP-STAT ([81], [82]), a st atist ical environ-
ment based on the public X-LISP language freely ava ila ble from the statli b
archive.
A fur ther line of development are efforts to use parallel architectures in
statistical comput ing. Such computer architect ures are typica lly used for the
implem entation of dem anding numerical algorit hms . In recent years com-
put er science has widen ed t he scop e of parall el computing t owards distributed
comput ing. We expec t t his research area t o grow quite rapidly in the future,
with an impact on st ati st ical comput ing .
An other statis t ica lly relevan t area of computer scien ce is data man age-
men t . While dat a structures in statist ica l computing are usu ally closely re-
lated to formal sp ecifications of data types (e.g. list s, vect ors , or matrices) ,
t he int erpret ati on of an analysis process makes oft en use of con ceptual and
relational st ructures. Tradit ionally this topic is treated in the t heory of dat a
bases. A major br eakthrough in t his area was the int ro duction of the re-
lational dat a model by E . F . Co dd (1970 [26]). It offer es the opportunity
t o describe complex real world problems from a concept ual point of view in
a unified manner . The description of data by data models is nowadays cap-
16 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

t ure d und er t he heading met ad ata. In this context it is worth mentioning


that the t erm metadat a occurred for the first time in connect ion with official
stat ist ics in a book by B. Sundgren (1975 [78]). Modern dat a bas e syst ems of-
fer not only tools for st orage and retrieval, but also st atis t ical functionalities,
in particular for tabulat ion (core instruments for official st at ist ics). Despite
the fact that t hey are rather simpl e with respect to stat ist ical methodology,
there are numerous pitfalls from a conceptual poin t of view. The latter rais e
interesting opera t ional questions which ar e treated in the cont ext of data
war ehouses. An int erest ing reference which helps to und erstand the con-
nect ions as well as the differences between the st atistic al approac h and t he
computer science approach to multi-dimensional t abl es is [76] .

5.4 Computational statistics and applications


With resp ect t o the int erplay between applicat ions and computational statis-
t ics we want to discuss now the challenges t hat arise from application prob-
lems. Besides the difficulties resulting from new probl ems in various resear ch
ar eas , for examp le analysis of microarrays in biology, one can identify - rather
ind epend ently from the field of resear ch - the following three interwoven chal-
lenges for computational statistics: handling of problems ste mming from new
dat a capt ure t echniques, from the complexity of data structures, and from
the size of data.
Since the early t imes of computationa l stat ist ics a major effort has been
the development of tools for aut oma t ic data capture and of inte rfaces to
data man agement syste ms. This has led t o the developm ent of compute r-
aided survey information collect ion (CASIC) tool s, an area which seems to
be nowad ays more a topic in official stat ist ics and man agement of st atistical
dat a base syst ems. Inside comput ational st atistics we observe an increasing
int erest in t he handling of efficient dat a genera t ion systems . Many times
such syst ems occur in connect ion with aut omate d monitoring of networks, in
par t icular the Int ernet. Such data st reams are of int erest from a computer
science as well as a st atisti cal point of view. The st atistical persp ective is
treat ed in a number of pap ers in a recent issue of the Journal of Computa-
tional and Graphical St atistics (e.g. Wegman and Mar chette, 2003 [86] .
With respect to t he data structures t he t ra dit ional model was charac te r-
ized by t he relation between sample and univ erse or by a properly designed
measurement pro cess. Such data structures can be represent ed quit e well in
a relational scheme and appropriate st atistical models can be formul ated for
t he analysis, for inst an ce hierar chical mod els. In connect ion with dat a min-
ing applicat ions stat ist icians are confronte d with new dat a st ruct ures which
do not fit into t he standard mod el. On e has to analyze data combined from
different sources which are often rather inhomogeneous wit h respect t o qu al-
ity (e.g. problem of missing valu es) and have no imm ediate int erpret ation
in a tradi tional st atistic al fram ework. Com bination of such dat a sources is
a statistical pr oblem in its own right.
The history of COMPSTAT and statistical computing 17

The last difficulty is the size of the data. P. J. Huber (1994 [57]) clas-
sified data sets from tiny (about 100 bytes) up to huge (about 1010 bytes) .
One can definitely argue that size is always an issue relative to computing
power and storage capacity, and problems practically intractable 30 years
ago are nowadays routine applications. Nevertheless, today's statisticians
and computer scientists have to solve problems for huge datasets. Specific
problems concerning the data structure, the data base management, and the
computational complexity are discussed in Huber (1999 [58]).
A second important topic for computational statistics with respect to ap-
plications is the statistical analysis process itself. The ubiquitous availability
of the computer and of statistical software packages has changed the con-
text in many ways . On the one hand statistical software packages support
statisticians in the phase of exploratory data analysis and allow them the
evaluation of numerous tentative models for the data without careful plan-
ning in advance. On the other hand they enable non-statisticians to perform
rather complex analyses for their data, in former times solely carried out by
professional statisticians. This evolution has weakened in some sense the role
of statisticians as custodians of the data and has caused many discussions
inside the statistical profession. Here we only want to mention Y. Dodge
and J . Whittaker (1992 [34]) who raised the point that this development
might bring about a de-skilling of certain parts of the profession. However
they also argued that the democratization of facilities does not automatically
mean a threat to the profession in the long run. We claim that statistical
analysis is definitely more than the application of certain algorithms because
an analysis strategy is required too. For instance in the current scientific
development of the bio-sciences we see an explosion of highly complex data
problems that can only be managed in part with the resources at hand.
In the nineteen eighties the question of automated analysis strategies was
intensively discussed in connection with the issue of statistical expert systems.
This undertaking ended without substantial success making it clear that it
is rather implausible to assume statisticians can be easily substituted by
machines in the near future. To put it in a nutshell, not even standard data-
analytic problems can be handled easily via routine applications and simple
rule systems. Another area of interest in this context is certainly the role of
computers in statistical education, in particular for non-professionals, taking
advantage of the various opportunities offered in the field of computational
statistics.

6 The COMPSTAT symposia


In this section we review the COMPSTAT symposia, giving a tabulated sum-
mary of the occurrence of topics and a verbal description of the meetings and
proceedings.
As for the summary of topics covered in the COMPSTAT symposia we
have produced two self-explaining tables, Table 1 for the period 1974-1988,
18 Wilfried Grossmann , Michael G. Schimek and Peter Paul Sint

and Table 2 for the period 1990-2002. The notation in these tables is the
following: "p" denotes that a topic was present in the proceedings, "f" de-
notes that a topic was frequ ently present in the proceedings (i.e. more than
3 times), "K" represents a keynote paper, "I" represents one or two invited
papers, and finally "T" signifies a tutorial. We suggest to read the respective
table in parallel with the verbal description of the chronologically ordered
COMPSTAT symposia.
The very first COMPSTAT symposium was held at the University of Vi-
enna in 1974, initiated by P. P. Sint and J. Gordesch. Both were also in fact
the editors of the proceedings [1]. There were about 50 presentations orga-
nized according to five subject areas, reflecting to some extent the interests
of the organizers: Computational Probability, Automatic Classification, Nu-
merical and Algorithmic Aspects of Statistical Computing, Simulation and
Stochastic Processes, and last but not least Software Packages. In 1974 there
were neither formal keynotes nor invit ed lectures. However, during the open-
ing session a special lecture was delivered by the well-know mathematical
statistician L. Schmetterer on stochastic approximation (not in the proceed-
ings) .
Naturally the topics within the subject areas were rather scattered, but
some of them remained popular across the whole period of 30 years such
as Robustness (note that P. J. Huber was present at the first symposium),
Time Series Analysis, and Modelling (the latter in its beginning primarily
meaning factor analysis and dimension reduction techniques). It is remark-
able that a number of statistical packages popular at the time were already
covered: R. Buhler's P-STAT and Sir J . A. NeIder's GENSTAT. The pre-
sentation of a SAS system, not to be confounded with the later much more
successful namesake [25], should also be mentioned. Further, as in succeeding
conferences, APL (for details see e.g. [19]) appeared as a popular statistical
environment.
With all this in mind Gordesch and Sint speculated in the preface of [1]
about a spectacular growth of the field, in writing "which as we hope will
now result in techniques of model building being very different today from
what it was in pre-computer days" .
The second COMPSTAT symposium took place in Berlin 1976, organized
by J . Gordesch and P. Naeve (also the editors of the volume [2]). Altogether
58 papers were presented. The subject areas were more or less the same as
at the first meeting but the names had changed somewhat: Computational
Probability, Automatic Classification and Multidimensional Scaling, Numer-
ical and Algorithmic Aspects of Statistical Models (with subtopics Linear
Models, Multivariate Analysis and Sampling), Simulation and Stochastic Pro-
cesses, and finally Software. A new section "Applications" was introduced
(mainly in economics and biology) . This selection reflects the understanding
of computational topics in the mid nineteen seventies: Multivariate Analy-
sis comprised mainly ANOVA as well as Factor Analysis and Computational
The history of COMPSTAT and statistical computing 19

Probability meant random number generators and the calculation of distribu-


tions in statistics. Apart from statistical computing Software also comprised
recent developments in data bases. In addition there was a dedicated interest
in the comparison of software packages with respect to certain technical as
well as practical criteria.
The third COMPSTAT symposium in Leiden 1978 was organized by the
Department of Medical Statistics in cooperation with the Computer Cen-
tre (both Leiden University) and headed by 1. C. A. Corsten and J . Her-
mans. 68 papers were presented and published in the proceedings [3] . For
the first time two keynotes were included, delivered by Sir J. A. Nelder and
J. Tinbergen. The main topics consisted of Linear and Nonlinear Regression,
Time Series, Discriminant Analysis, Contingency Tables, Cluster Analysis,
Exploratory Techniques, Simulation and Optimization, Teaching Statistics,
and Statistical Software. It is interesting to note that Exploratory Techniques
was mainly an umbrella for problems in connection with multidimensional
scaling. The topics Simulation and Optimization as well as Computational
Probability also comprised contributions which would nowadays hardly find
their way into a statistical meeting.
The fourth COMPSTAT symposium was organized by and held at the
University of Edinburgh in 1980 with a record number of about 750 partici-
pants. Four invited and 82 (out of 250 submissions) contributed papers were
presented and published in the proceedings volume [4], edited by M.M . Barrit
and D. Wishart. This meeting clearly marks the beginning of the transition
from batch to interactive computer processing, reflected in a special ses-
sion . Invited lectures were given by J . Tukey on styles of data analysis, by
E.M.L. Beale on branch and bound methods for optimization, by R. Tomas-
sone on survey management of large data sets, and by 1. Francis on a tax-
onomy of statistical software. Other topics were Sampling Methods, Data
Base Management, Education, Analysis of Variance/Covariance, Interactive
Computing, Linear and Nonlinear Regression, Multivariate Analysis, Opti-
mization and Simulation, Cluster Analysis, Statistical Software, and Time
Series Analysis.
The diffusion of interactive personal computing (marking the shift from
mainframe to personal computers in the early nineteen eighties) can be clearly
identified in COMPSTAT 1982 held in Toulouse (the fifth symposium) with
about 500 participants. H. Caussinus, who also published the proceedings [5]
together with P. Ettinger and R. Tomassone, chaired the program committee.
One finds several new features at this COMPSTAT: the number of invited
speakers was increased to 15 in order to cover new developments in compu-
tational statistics like Experimental Design, Computing Environments, Nu-
merical Methods, EDA, Parallel Processing in Statistics, and Artificial Intel-
ligence. In contrast to previous proceedings volumes comprising also papers
at the border of statistics to other areas, the focus was now less theoretical
and more computing oriented (60 papers out of 250 submissions).
20 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

COMP STAT Symposium 74 76 78 80 82 84 86 88


Algorit hms f f f p p p fi n
Appli cations f f p p p n p
Bayes/ MCMC / EM p P
Catego rical Dat a p p f p p f pI
Classification/Discrimination p p f p p fi f P
Cluster Analysis f f f f p f n
Computati onal Probabili ty f f f p P
Data Bases / Met ad ata p p f fi p
Data Imput.jSurvey Design p f p n f pI
Data Visualization/Graphi cs p p pI pT
Dimension Redu ction f f f p p f pI pI
Exp eriment al Design p p p pI p P
Expert Syst ems /Al I pI n fiT
Exploratory Data Analysis p p pI p p p
Foundations/ Histo ry p K pI I I
Gr aphical Mod els p pT
Han dlin g of Huge Data p
Image Analysis p
Internet-based Methods
MANOVA p p p p f P
Mod elling/GLM/GAM p f f f p n
Neural Networks p p f
Numerics/ Op timi zati on p f p f I fi f
P ar allel Computing pI K
Reliabili ty and Survival p p
Regression (linear /nonlinear) p f p p f pI
Resampling p fK
Robustness p p p n f P
Simulations f p f f
Smoothing/ Curve Estimat. p f p f p pI
Spatial St atisti cs p p p p
Statist ical Softwar e f f fK fi pI f f P
Stat. Learning/Dat a Mining p
St ochastic Systems p f f p P
Teaching St atistics f p p pI
Time Series Analysis f p p f pI n p
Tree-based Meth ods p
Wavelets

Tabl e 1: Topics in the pr oceedin gs of t he COMP STAT sym posia 1974-1988.


p-present , f-frequent (p> 3), K-Keynot e, I-Invited, T'-Tutorial.
Th e history of COMPSTAT and sta tistical computing 21

COMPSTAT Symposium 90 92 94 96 98 00 02
Algorithms p f fI fI f
Appli cations fI p p fI K f
Bayes/MCMC/EM f pK f fI fI f
Categorical Dat a I
Classification/Discrim ination f fI fI fK f f f
Cluster Analysis p p
Compu t ational Probability I I p P
Data Bases /Met ad ata pI fI pT f
Data Imput.jSurvey Design I p fKI p
Data Visualization/Graphics p p pI p pI
Dimension Reduc tion I p p p
Exp erim ent al Design f f f p P
Exp ert Systems /Al fI p
Explo ratory Data Analysis p p
Foun dations/ Histo ry pI K p
Gr aphical Models p fI p p p
Handling of Huge Data K p P
Image Analysis pI pI p
Intern et-based Methods I p fI
MANOVA p I P
Modelling/ GLM/ GAM p fI p pK fI p
Neural Networ ks I p
Numerics/ Optimization pI p fI p
Par allel Computing pI p
Reliabili ty and Survival p f p pI p
Regression (linea r/nonlinear) p fI fI p p p
Resampling f f p p p f
Robu stness fI fI f fI p
Simulations p p p p
Smoothing/ Curve Estimat . f f pI fI p pI
Spati al St atist ics p pI p p fI
St atistical Software p fI f p f IT
St at. Learning/Data Minin g f p p p pI fK
Sto chast ic Syst ems pI I pI
Teaching St ati stics p pK pI pI fI
Time Series Analysis fI f fI fI pI fI fI
Tree-base d Methods p pI p P
Wavelets p pI pI K p

Table 2: Topics in the proceedin gs of t he COMPSTAT symposia 1990-2002.


p-present , f-frequent (p > 3), K-Keyno te, I-Invited , Tv-Tutorial.
22 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

Many of them reflect the trends of the time, especially the penetration of per-
sonal computers and improved graphical displays into the world of statistics.
The wish of statisticians to apply these new technologies, not yet covered
by commercial software packages, can be clearly seen. Another novelty was
the production of a complementary volume with short communications and
posters.
The sixth symposium took place in Prague in 1984, extending the scope
of COMPSTAT to the Eastern European countries. As a matter of fact IASC
had planned for a meeting in Bratislava (a Slovakian town only 65 kilometers
from Vienna) but the (communist) Czechoslovakian Academy of Sciences de-
cided for the central location of Prague. Luckily there were several dedicated
statisticians, among them T . Havranek, Z. Sidak and M. Novak, the organiz-
ers of the meeting. Many colleagues, who at that time did not have the chance
to participate in Western meetings, could attend. Out of a record number of
about 300 submissions 65 papers were selected. T. Havranek, Z. Sidak and
M. Novak also edited the proceedings [6] and a companion volume of short
communications and posters, following the example of 1982. Commemorat-
ing the tenth anniversary of the COMPSTAT symposia P.P. Sint was invited
to deliver a lecture entitled "Roots in Computational Statistics" . The main
topics covered in invited talks were Computational Statistics in Random Pro-
cesses, Computational Aspects of Robustness, Discriminant Analysis, Statis-
tical Expert Systems, Optimization Techniques, Linear Models, and Formal
Computation in Statistics. Besides these topics also the traditional COMP-
STAT themes like Cluster Analysis, Multivariate Analysis, Statistical Mod-
elling and Software were present. It is worth mentioning that also a number
of more computer science-oriented papers on data management and data pre-
processing had found their way into the proceedings, reflecting some of the
local interests.
COMPSTAT 1986 (the seventh symposium) was held in Rome and at-
tracted an ever record of about 900 participants. From around 300 submis-
sions for contributions about 60 contributed papers as well as 13 invited pa-
pers were published in the proceedings [7], edited by F. De Antoni, N. Lauro
and A. Rizzi. A keynote lecture was given by E. B. Andersen about informa-
tion, science and statistics, discussing the challenges for statistics resulting
from the development of statistical software, graphics, interactive computing,
and new methods and styles of data analysis. Apart from the invited program
the proceedings volume presents itself well-balanced between statistically
oriented themes, computer science oriented topics and novel applications.
The main statistical themes comprised the traditional COMPSTAT topics
like Probabilistic Models in Exploratory Data Analysis, Computational Ap-
proaches of Inference, Numerical Aspects of Statistical Computation, Cluster
Analysis and Robustness, but also a rather specialized topic entitled Three
Mode Data Matrices. The more computer science oriented topics reflect the
trend towards Expert Systems and Artificial Intelligence, typical for the mid
The history of COMPSTAT and statistical computing 23

nineteen eighties. Altogether 9 papers on statistical expert systems were pre-


sented. Not so much in the mainstream of the time we identify sections on
Computer Graphics, Data Representation, Statistical Software and Statisti-
cal Data Base Management. Main application areas were Clinical Trials and
Econometric Computing. Additionally there was a section about Teaching
Statistics.
Also for COMPSTAT 1988 (the eighth symposium), taking place in Co-
penhagen, the number of participants remained high with more than 800.
It was organized by D. Edwards who also published the proceedings (co-
editor N. E. Raun, [8]) and the additional volume of short communica-
tions and posters. There were two keynotes delivered by G. W. Stewart
on parallel linear algebra in statistical computations and by B. Efron on
computer-intensive statistical inference, and 7 invited papers. They were
related to Non-Parametric Estimation, Projection Pursuit, Expert Systems,
Algorithms, Statistical Methods, Statistical Data Bases, and Survey Pro-
cessing. Out of approximately 300 submissions 51 contributed papers were
selected. At that time computational statistics had become an integrated
part of statistics research with new emerging areas, especially graphical tech-
niques and models, Bayes methods, and smoothing techniques. Nonpara-
metric curve estimation and dimension reduction techniques are discussed at
COMPSTAT for the first time. At the same time the COMPSTAT evergreen
Expert Systems is still quite present. A real innovation were tutorials in the
programme. They covered the fields Dynamic Graphics (R. Becker), Artifi-
cial Intelligence (W. Gale), and Graphical Models (N. Wermuth). The new
availability of modern computing also makes itself visible in the appearance
of the proceedings volume with a relatively larger number of electronically
produced papers.
The ninth meeting in Dubrovnik 1990 marks a dramatic change in the pos-
itive development of the COMPSTAT symposia seen so far. Submissions were
down to 115 (43 contributed papers selected). After six years COMPSTAT
was back in a communist country, however when this decision was taken,
nobody could foresee the disintegration of Yugoslavia. During the conference
around Dubrovnik first road barricades were erected and soon after the civil
war broke out (the conference hotel on the sea shore was destroyed in the
following years) . Anticipating the unrest many participants and speakers did
not show up (an audience of around 180 was present). Thus the proceedings
volume [9], edited by the organizer K. Momirovic does not really represent
the conference (many more papers than presentations). The programme was
dominated by the subject areas Expert Systems, Multivariate Data Analysis
and Model Building, and Computing for Robust Statistics. Special topics
were Optimization and Analysis of Spatial Data. All these comprised in-
vited talks (6 invited papers in the proceedings) . In addit ion some of the
traditional COMPSTAT topics such as Algorithms, Time Series (with an in-
vited paper) and Computational Inference were present. Despite all external
24 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

problems it is not eworthy t ha t aspects of mod elling and appropriate softwar e


played an important part in this meeting, establishing a new COMPSTAT
focus. T . Hasti e (replacin g J . Chamb ers) presented statist ical mod els in S
for t he first time and new st rategies for GLIM4 were outlined by B. Fran cis.
As a matter of fact it was for t he first time that the st atistical and graph-
ical environment S (the S-Plus package) was discussed in the COMPSTAT
community.
In 1992 the tenth symposium was held in Neuchatel. It was the gen-
eral hope that COMPSTAT would recover from the Dubrovnik advent ure,
however the pr oblems went on. Submissions remained low with about 115.
Despite the fact t ha t participation was only ar ound 200, some participants
had to stay in remote accommoda tions, forced to use the cable car to get from
Chaumont (great views!) down to the conference site and back, redu cing the
audience even further.
Y. Dodge, t he organi zer and proceedin gs editor (co-editor J . Whittaker) ,
decided to reshap e the symposium and the volume. In response to t he un-
expected low number of submissions and t he fact t ha t Physica-Verlag had
been sold to t he Springer compa ny, he cha nged the format of the proceed-
ings, giving up t he established layout and format as well as t he tradi tion of
a complementary volume for short communications and posters , in accepting
almost all submitted pa pers as full contributions (Computati onal Statistics
Volume 1 and 2 [10] of a new Springer-Verlag series).
There is an interesting foreword by Sir D. J. Cox with th e title "T he
Role of Computers in St atist ics" . In a prologue Y. Dodge and J. Wh it taker
fear ed th at a de-skilling of t he profession du e to th e dissemin ation of commer-
cial software packages could t ake place. Wh en studying the two volumes of
COPMPSTAT 1992, one dedicated to statistics and mod elling and t he ot her
to computation, we were astonished by the bro ad ran ge of topics. The main
subject areas in Volume 1 are Statist ical Modelling, Multivari ate Analysis,
Classificati on and Discrimination, Symboli c and Relational Data , Graphi-
cal Models, Time Series, Nonlinear Regression , Robustness and Smoothing
Techniques, Industrial Appli cations and Bayesian St atistics. Volume 2 com-
prises P rogramming Environment s, Computational Inference, Package De-
velopm ents, Experiment al Design, Image Processing and Neural Networks,
Met a Dat a, Survey Design and Data Bases. Almost all these topics included
an invit ed lecture. There were neither official keynotes nor tutorials.
The new pro ceedin gs format had not been approved by the European
Regional Section of the IASC and was cha nged back to t he previous COMP-
STAT appeara nce for th e year 1994, remaining in this sty le up to the present.
The twenti eth anniversary of COMPSTAT was celebra te d at t he elevent h
symposium held in Vienn a 1994. The program committee , cha ired by R. Dut-
ter , tried to find a compromise between t he t ra ditiona l COMP STAT topics
and act ua l to pics when select ing t he keynote speaker and the invited speak-
ers. The keynote was given by P. Hub er and concerne d t he t reatment of
Th e hist ory of COMPSTAT and statistical comp uting 25

huge data sets. The t hemes of the invit ed pap ers were Multivar iate Analysis,
Classificati on and Discrim ination , Dynami c Graphics, Numerical Analysis,
Nonp ar am etric Regression , MCM C, Selection Procedures, Neur al Networks,
Cha nge Poi nt Problems, Wavelet Analysis, and T ime Series Forecasting. Be-
sides these invited lectures two t ut orials were organ ized: W. Schacherm ayer
introduced stat istical problems in finance and insur ance and B. Sundgren
gave an overvi ew on metad ata. Fur thermore a discussion about the nature of
computationa l stat ist ics was organized. All toget her about 280 participants
at te nded this meeting. The organiz ers returned to the traditi onal form at of
publishing the pro ceedin gs and an additio nal volume of short communica tions
and posters. The pro ceedings [11] were edite d by R. Dutter and W . Gross-
mann and contained t he invit ed and 60 cont ributed papers, selecte d from
approximately 200 submissions. With respect to statistical software the in-
creasing domin an ce of S for t he developm ent of computationa l statistics was
evident . Other more commercia lly orient ed products were present ed during
the conference and docum ented in a separate bookl et .
After t he symposium in Vienn a there was a COMP STAT Satellite Meeting
on Smoothing held at Semmering, att racting almost 50 part icipants. Becau se
of the COMPSTAT anniversary a hist oric t ra in was bringing COMPSTAT
par ticipants and accompa nying persons on the oldest mountain railro ad in t he
world (now a World Cultural Heritage) from Vienn a to t he spa of Semm erin g
in t he Austrian Alps.
The meeting was organi zed by M. G. Schimek and comprised 7 invited
lectures (pr esenters were B. Cleveland , M. Delecroix, R. Eubank , Th. Gasser ,
R. Kohn , A. van der Linde, and W . Stuetzle) and two software present ations
(S-Plus and for t he first time XploRe). W . Hardl e an d t he organi zer edited a
pro ceedings volume [12] consisti ng of 10 pap ers (not published elsewhere) out
of 26 given at t he meeting. It also includes an exposito ry discussed pap er by
J . S. Marron ( "A Personal View of Smoothing and St ati sti cs") and two other
discussed contributions by W . S. Cleveland and C. Load er ("Smoothing by
Local Regression: Principles and Methods") and by B. Seifert and Th. Gasser
("Vari an ce Properti es of Local Polynomi als and Ensuing Modifications") . It
is wort h mentionin g t ha t local regression smoothing is now a principal tool for
normalization of microarray data in genetic resear ch. Since the sym posium in
Cop enh agen 1988 nonp ar am etric smoot hing techniques and relevant software
had played a steadily increasing role in COMPSTAT.
The twelft h COMPSTAT symposium was organi zed under th e auspices
of A. Prat in Bar celona 1996, attract ing an est imated number of 300 par-
ti cipants. An opening keynote was delivered by G. Box ent itled "Statistics,
Teaching, Learning and the Computer" and a closing keynote "Information
Markets " was present ed by A. G. Jordan . Eleven invited pap ers covered
to pics like Time Series , Fun ctional Imaging Analysis, Appli cati ons of Statis-
tics in Economics, Classification and Computers, Image Processing, Optimal
Design , Wavelet Analysis, Profil e Methods, Web-based Computing, and Mul-
26 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

tidimensional Nonparametric Regression. Apart from the invited lectures the


proceedings [13] edited by A. Prat present also 56 contributed papers selected
from about 250 submissions, arranged in alphabetical order and grouped ac-
cording to subjects at the end of the proceedings. From the subject areas
one gets the overall impression that the main emphasis was on statistical
modelling, in particular Bayesian Methods, Classification, Experimental De-
sign and Time Series from the classical areas, and Neural Networks, Genetic
Algorithms, Wavelets and Classification Trees as more recent methodologies.
Also of interest is a rather broad spectrum of applications presented at the
conference. A novelty was the introduction of awards for the best papers of
young researchers.
The thirteenth symposium held at the University of Bristol in 1998 had
seen less participants than the previous COMPSTAT. Organizers were R. Pay-
ne and P. Green. There was a methodological keynote on wavelets delivered
by B. W. Silverman and in addition an applied keynote on the analysis of
clustered multivariate data in toxicity studies presented by G. Molenberghs.
Three of the 10 invited lectures dealt with various statistical techniques in
connection with applications like Mortality Pattern Prediction, Covariance
Structures in Plant Improvement Data, and Markov Models in Modeling
Bacterial Genoms. The other invited lectures consider rather methodological
issues like Design Algorithms, Scaling for Graphical Display, MCMC for La-
tent Variable Models, Decision Trees, Semi- and Nonparametric Techniques
in Time Series, and Time Series Forecasting. In addition there was an invited
lecture on teaching in network environments. The 58 contributed papers con-
tained in the proceedings volume (edited by R. Payne and P. Green, [14]) were
selected from about 180 submissions. Taking R. Payne's affiliation (IACR
Rothamsted) into account, it is not surprising that the proceedings show
a strong orientation towards statistical modelling and applications. However ,
there are also papers dealing with more computer science oriented aspects
of computational statistics, in particular computing environments and soft-
ware packages for special problems. The proceedings were accompanied by
a volume comprising the short communications and posters, edited by IACR
Long Ashton.
The fourteenth COMPSTAT symposium was held in Utrecht 2000. It
was organized by P. van der Heijden (Utrecht University) and J . G. Bethle-
hem (Statistics Netherlands). The number of participants was around 220.
It had a substantial applied focus on the social sciences and official statis-
tics. There were two keynotes (one on multiple imputation by D. B. Rubin
and the other on official statistics in the IT-era by P. Kooiman) and 13 in-
vited papers. The invited lectures concerned Algorithms, Bayesian Model
Selection, GLMs, HGLMs (a further generalization of GLMs), Imputation,
Data Mining, Spatio-Temporal Modelling, Survival Techniques, Time Series,
and Teaching. Further there were 60 contributed papers (out of around
250 submissions) following mostly the conventional subject areas of COMP-
Th e history of COMP STAT and sta tistical computing 27

STAT. A pro ceedin gs volume [15] and a supplement compr ising t he short
communicat ions and post ers were published (editors P. van der Heijden and
J .G. Bethlehem) .
The last (fift eenth) COMPSTAT sympos ium we can report on took place
at Humboldt-Universit at zu Berlin in 2002. It was organized by W . Hardle
and attracted approximately 220 submissions. This t ime the primar y fo-
cus was on business applicat ions, especially in connect ion with t he Internet
(such as E-Commerce and Web-Mining) and on the handling of massive and
complex dat a sets (e.g. in genetic resear ch) . The idea was t o expa nd the
t ra ditiona l scop e of COMPSTAT and t o make it att rac t ive for new aud i-
ences . However t he numbe r of about 260 participants made it clear that t his
endeavour was not sufficient to substant ially enlarge t he audience for such
a meeting. However it is only fair mentioning t hat many young resear chers
showed up for the first t ime, also joining IASC because of a special pr omotion
scheme.
There was a keynote delivered by T. Hastie ent it led "Supervised Learn-
ing from Micro array Dat a" . The other 8 invit ed talks concerned the topics
Bayes Methods, Gr aphical Methods, Int ernet Traffic, Smoothing, Teaching,
and Ti me Series. Further there were 90 cont ributed pap ers connecte d to t he
above to pics as well as to Algorithms, Classificatio n, Comput ational Infer-
ence, Computing En vironments, Data Mining, Met a Dat a , and Multivari at e
Methods. Two additio nal area s of int erest have emerged because of sub-
missions received, the stat istical language R and functional dat a analysis.
Innovat ions were t hat t he pr int ed proceedin gs volum e (edited by W . Har die
and B. Ronz [16]) also appeared as a Springer-Verlag e-book and t hat t he
compa nion volume of short communications and post ers was published on
a CD. Moreover severa l pri ces were grante d (among th em a new one for
softwa re inn ovation) .

7 Conclusions
The evolut ion of computational st ati stic s has always been strongly influenced
by developments in stat ist ical t heory, in algorithms, in compute r science, and
by t he problems statist icians are confronte d wit h. In statistical t heory many
actua l topics are connected t o concepts and methods of computational st ati s-
ti cs requiring definit ely mor e than the pr oper implementat ion of well-defined
algorit hms. With resp ect t o computation we can observe a shift from pure
numerical ana lysis to more graphically oriente d techniques and algorit hms
developed in computer science. This brings about a new quali ty of coopera-
tion between stat ist ics and compute r science with a high pot ential for future
development. The t ra ditiona l knowledge t ra nsfer from computer science t o
com putational stat ist ics was primaril y in the area s of stat ist ical packages,
statist ical languages, stat ist ical gra phics and stati stical data man agement
syste ms. Yet these convent iona l areas are still open to new developments, in
particular with regard t o statist ical Web serv ices and the sea mless int egration
28 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

of various tools. Concerning applications the main challenges for computa-


tional statistics are complex data structures and very large or huge data, as
well as the demand for new analysis strategies. Due to the penetration of all
areas of our life by computers one can expect an ever increasing number of
challenging tasks.
Although we can identify many inter-connections between computational
statistics and computer science, symbolic computation has not received the
attention it deserves. Mathematica has been used to implement a num-
ber of statistical approaches applying general mathematical notation, this
way making it feasible to calculate the results with (at least in principle
arbitrarily) high precision. One might envisage a development where simi-
lar approaches are introduced in environments like Sand R or complement
these environments. The ability to use abstractions, symbolic representations
and/or general objects/classes in everyday work while having access to low
level constructs to improve statistical methods based on experiments or to
solve non-anticipated practical problems, could be a promising way for the
future.
The review of the COMPSTAT symposia has shown that the meetings
and proceedings reflect clearly the international developments of computa-
tional statistics of the last 30 years, although with some delay in certain
subject areas (e.g. Bayesian Methods, Resampling, Statistical Environments,
Smoothing Techniques, Statistical Learning, Tree Based Methods, Wavelets) .
On the other hand the anticipation of new ideas in connection with Dimen-
sion Reduction Methods, Expert Systems, and Robust Techniques was very
fast . More recently, with respect to content, there seems to be a shift of
focus towards topics related to statistical modelling and at the same time
less interest in computer science contributions useful in statistics.
There was a continuous uptrend in conference participation during the
first 16 years with symposia covering a rather broad spectrum of compu-
tational statistics topics. The nineteen eighties have certainly seen the high
time of COMPSTAT with many innovations in statistical computing, a boost
in algorithms and professional software (emphasizing personal computing in
the second half of the decade), and the early adoption of expert systems.
This is also reflected in the size of the meetings, going beyond 300 submis-
sions and ranging between 800 and 900 participants in Rome 1986 and in
Copenhagen 1988. After the problems with COMPSTAT 1990 in Dubrovnik
and COMPSTAT 1992 in Neuchatel, the symposia have stabilized since at
a lower level of participation. In recent years we have typically seen around
200 submissions and about 250 participants (l.e. very little non-contributing
participants) . This has already led to new formats of presentation to keep
the number of parallel sessions as low as possible.
In general, COMPSTAT was probably not the main forum for the presen-
tation of state-of-the-art research results in computational statistics. It was
rather an important forum for the exchange of relevant information in the
The history of COMPSTAT and statistical computing 29

European statistical community about current developments from all over


the world as well as on practical aspects such as new algorithms and statisti-
cal software. This was largely achieved by a dedicated invitation policy. One
can say that the organizers of the symposia always have given their best to
identify distinguished personalities for keynotes and invited lectures. This
way the European research community has received a great deal of valuable
impulses that often have proven influential for subsequent projects in Europe.
Occasional tutorials were another means of this successful policy, not only
attracting young researchers.
COMPSTAT has always been an international undertaking. However,
most recently there have been discussions at IASC business meetings focus-
ing on strategies for opening up the European COMPSTAT symposia even
further to make them world-wide events in the future, integrating other re-
gional sections such as the Asian Section and the planned African Section
(an initiative of S. Azen as President) . As far as the Interface Foundation
of North America, Inc., is concerned, there was a formal proposal in 1987 to
transform it into the North American Section of IASC, however it was voted
down by the Interface Board. E . J . Wegman (1997-1999 IASC President)
from George Mason University (USA) initiated an informal connection be-
tween IASC and Interface which has finally led to the establishment of an
IASC-Interface Liaison Committee, chaired by M. G . Schimek as IASC Vice
President, to foster mutual interests and to organize invited sessions at each
others symposia.
We are all looking forward to this year's COMPSTAT symposium in
Prague, celebrating the thirtieth anniversary, chaired by J. Antoch (Charles
University of Prague) . Maybe the first step towards the next generation of
COMPSTAT meetings has already been taken as it is organized in compli-
ance with new guidelines. According to its scientific programme on the Web
we can already say that the Prague symposium is going to be truly inter-
national with contributions from all IASC regional sections, from Interface,
and beyond.

References
[1] Bruckmann, G., Ferschl, F. and Schmetterer, L. (1974, eds.). COMP-
STAT 1974. Proceedings in Computational Statistics. Physica-Verlag,
Wien.
[2] Gordesch, J . and Naeve, P. (1976, eds .). COMPSTAT 1976. Proceed-
ings in Computational Statistics. 2nd Symposium Berlin/FRG. Physica-
Verlag, Wien.
[3] Corsten, L. C. A. and Hermans, J . (1978, eds .) . COMPSTAT 1978.
Proceedings in Computational Statistics. 3rd Symposium Leideti/The
Netherlands. Physica-Verlag, Wien.
30 Wilfri ed Grossm ann , Michael G. Schim ek and Peter Paul Sint

[4] Barrit t , M. M. and Wishart , D. (1980 , eds.) . COMPSTAT 1980. Pro-


ceedings in Computational Statistics. 4th Sympo sium Edinbu rgh/UK.
Physica-Verlag, Wien.
[5] Cau ssinus, H., Ettinger , P. and Tam asson e, R. (1982 , eds.). COMP-
STAT 1982. Proceedings in Computational Statistics . 5th Symposium
Toulouse/France. Physica-Verlag , Wien .
[6] Havran ek , T. , Sidak , Z. and Novak , M. (1984 , eds.).COMPS TA T 1984.
Proceedings in Computational Statist ics. 6th Symposium Prague/CSSR .
Physica-Verlag, Wien.
[7] De Ant oni, F ., Lauro, N. and Rizzi , A. (1986 , eds .). COMPSTAT 1986.
Proceedings in Computational Statistics. 7th Symposium Rome/Italy.
Physica-Verlag, Wien.
[8] Edwards, D. and Raun , N. E . (1988 , eds.). COMPS TA T 1988. Proceed-
ings in Computational Statist ics. 8th Symposium Copenhagen/D enmark.
Physica-Verlag , Heidelberg.
[9] Mornirovic, K. and Mildner , V. (1990 , eds. ). COMPSTAT 1990. Proceed-
ings in Computational Statistics. 9th Symposium Dubrovnik/Yugo slavia.
Physica-Verlag, Heidelber g.
[10] Dodge,Y. and Whitt aker , J . (1992 , eds .). Computational Sta tistics.
Volume 1 and 2. Proceedings of the 10th Sympo sium, COMPSTAT,
N euchatel/Switzerland. Physica-Verlag, Heidelb erg.
[11] Du t t er , W. and Gro ssm ann, W. (1994, eds. ). COMPSTAT 1994. Pro-
ceedings in Computational Statistics. 11th Symposium Vienna/Austria.
Physica-Verlag, Heidelberg.
[12] Hardle, W . and Schimek , M. G. (1996 , eds.) Statistical Th eory and
Computational Aspects of Sm oothing. Proceedings of the COMPSTAT
'94 Satellite Meeting held in Semmering, Austria, 27-28 August 1994,
Physica-Verlag, Heidelberg.
[13] Prat , A. (1996 , ed.). COMPSTAT 1996. Proceedings in Computational
Statistics. 12th Symposium Barcelona/Spain. Physica-Verlag, Heidel-
berg.
[14] P ayn e, R. and Gr een , P. (1998, eds. ). COMPSTAT 1998. Proceedings in
Computational Statist ics. 13th Sympo sium Bristol/UK. Physica-Verl ag,
Heidelb er g.
[15] Bethlehem , J . G. and van der Heijd en , P. G. M. (2000 , eds .). COMP-
STAT 2000. Proceedings in Computa tional St atistics. 14th Symposium
Utrecht/The N etherlands. Physica-Verlag, Heidelb erg.
[16] Hardle, W . and Ronz, B. (2002 , eds.). COMPSTAT 2002. Proceedings
in Computational Sta tisti cs. 15th Symposium B erlin/Germany. Physica-
Verlag, Heidelb erg.
[17] ACM (1999). Software Syst em s A ward. Press Release. New York , Mar ch
23,1999. http://www.acm.org/announcements/ss99.html .
T he history of COMPSTAT and statistical computing 31

[18] Adam, A. (1973). Von himmlischen Uhrwerk zur statistischen Fabrik.


600 Jahre Entdeckungsreise in das Neuland iisterreischer Statistik und
Datenverarbeitung. Mun k, Wien .
[19] Anscombe, F . (198 1) . Computing in Statistical Science through APL.
Springer-Verlag, New York.
[20] Andrews, D. F ., Bickel, P. J ., Hampel, F . R., Hub er, P. J ., Rogers, W .
H. and Tukey, J . W . (1972) . Robust Estimation of Location : Survey and
Advances. Princeton University Press, Princeto n/NJ .
[21] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984) .
Classification and Regression Trees. Wadsworth, Pacific Grove/CA.
[22] Buja, A., Hast ie, T . and Tibshirani, R. (1989) . Linear smooth ers and
additiv e models (with discussion), Ann . Stat ist ., 1 7,453 -555.
[23] Chambers, J . M. (1998) . Programm ing with Data - A Guide to the S
Language. Springer-Verlag, New York.
[24] Chambers, J . M. and Hast ie, T . J . (1992). Statistical Models in S. Chap-
man & Hall, London.
[25] Chr isteller, S., Meystre, A., Ballmer, U. and Glutz, G. (1974). SAS.
A Software System for Statistical Data Analysis. In Bruckmann, G., Fer-
schl, F . and Schmetterer, L. (eds.) . COMPSTAT. Proceedings in Com-
putat ional Statistics, 479 - 488 .
[26] Codd, E. F . (1970) . A Relational Model for Large Shared Data Banks.
CACM , 13 , 377 -387.
[27] Coppi , R. (2002) . A Th eoretical Framework for Data Mining: the Infor-
mational Paradigm . Comp utat. Statist. Data Anal. , 38 , 501 - 515.
[28] Cox, D. J. (1972) . Regression models and life-tables (with discussion) .
J . Royal St atist. Soc., B 34 , 187 - 220 .
[29] Dahl, O. R. and Nygaard, K. (1966). Simula - an Algol-based Simulation
Language. CACM, 9, 671- 678 .
[30] Dempster, A. P., Laird, N. and Rubin, D. B. (1977). Maximum likelihood
from incomplete data via the EM algorithm (with discussion) . J . Royal
Statist. Soc., B 39 , 1-38.
[31] Devroye, L. (1986) . Non -Uniform Random Variate Generation.
Springer-Verlag , New York.
[32] Dirschedl, P. and Ostermann, R. (1994, eds.). Computational Statistics.
Physica-Verlag , Heidelberg.
[33] Dixon, W . J . (197 1, ed.). BMD. Biomedical Computer Programs. Uni-
versity of California Press, Los Angeles/CA.
[34] Dodge, Y. and Whittaker, J . (1992) . In Dodge, Y. and Whittaker , J .
(eds.) Science, Data , Statistics and Computing. In Computational Statis-
tics. Volum e 1, 3 -7.
[35] Efro n, B. (1979) . Bootstrap methods : anoth er look at the ja ckknife . Ann.
St atist., 7, 1- 26.
32 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint

[36] Efron, B. (2002). Statistics in the 20th Century and the 21th. In Dutter,
R. (ed .) Festschrift 50 Jahre Osterreichische Statistische Gesellschaft
1951-2001. Austrian Statistical Society, Vienna, 7-20.
[37] Efron, B. (2003). Robbins, empirical Bayes and microarrays. Ann.
Statist., 31 ,366 -378.
[38] Fisherkeller, M. A., Friedman, J. H. and Tukey, J . W . T . (1974) . PRIM-
9. An Interactive Multidimensional Data Display System. Stanford Lin-
ear Accelerator Publication No. 1408. Palo Alto/CA.
[39] Fleck, C . (2000) . Wie Neues nicht entsteht. Die Griindung des Insti-
tuts fur Hiihere Studien in Wi en durch Ex- Osterreicher und die Ford
Foundation. Osterreichische Zeitschrift fiir Geschichtswissenschaften, 1,
129-177.
[40] Francis, 1. (1981). Statistical Software . A Comparative Review. North
Holland, New York.
[41] Frawley, W., Piatetsky-Shapiro, G. and Matheus, C.(1992) . Knowledge
Discovery in Databases: An Overview. AI Magazine, Fall 1992, 213-228.
[42] Friedman, J. H. and Stuetzle, W . (2002) . John W. Tukey's work on
interactive graphics. Ann. Statist., 30, 1629 -1639.
[43] Friedman, J. H. and Tukey, J . W . (1974) . A projection pursuit algorithm
for exploratory data analysis . IEEE Trans. Comp., C 23,881-890.
[44] Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1996). Bayesian
Data Analysis . Chapman &Hall, London.
[45] Geman, S. and Geman, D. (1984) . Stochastic relaxation , Gibbs distri -
butions, and the Bayesian restoration of images. IEEE Trans. Pattern
Anal. Machine Intellig., 6, 721-741.
[46] Gentle, J . E. (2002). Elements of Computational Statistics. Springer-
Verlag, New York.
[47] Gershenfeld, N. (1999). The Nature of Mathematical Modeling. Cam-
bridge University Press, Cambridge/UK.
[48] Goldstine, H. H. (1972). The Computer from Pascal to von Neumann.
Princeton University Press, Prlnceton/N,l.
[49] Hand, D. (1996). Classification and Computers, Shifting the Focus. In
Prat, A. (ed .) COMPSTAT 1996. Proceedings in Computational Statis-
tics., 77 -88.
[50] Hastie, T . and Tibshirani, R. (1990). Generalized Additive Models. Chap-
man & Hall , London.
[51] Hastie, T., Tibshirani, R. and Friedman, J. (2001) . The Elements of
Statistical Learning. Springer-Verlag, New York.
[52] Hastings, W. K. (1970) . Monte Carlo sampling methods using Markov
chains and their applications. Biometrika, 57, 97 - 109.
[53] HardIe, W., Klinke, S. and Turlach, B. A. (1995). XploRe: An Interactive
Statistical Computing Environment. Springer-Verlag, New York.
The history of COMPSTAT and statistical computing 33

[54] Heide, 1. (2003). Diffusing the emerging punched card technology in Eu-
rope 1889-1914. Information Systems and Technology in Organizations
and Society. ISTOS-Workshop Universitat Pompeu Fabra, Barcelona.
http://cbs.dk/staff/lars.heide/ISTOS/paper-l0.pdf.
[55] Hornik, K. and Leisch, F. (2002). Vienna and R: Love, Marriage and the
Future. In Dutter, R (ed.) Festschrift 50 Jahre Osterreichische Statis-
tische Gesellschaft 1951-2001, Austrian Statistical Society, 61-70.
[56] Huber, P. J. (1964) . Robust estimation of a location parameter, Ann .
Math. Statist., 35,73 -101 .
[57] Huber, P. J . (1994) . Huge Datasets. In Dutter, W . and Grossmann, W.
(eds.) COMPSTAT 1994. Proceedings in Computational Statistics, 1 -
13.
[58] Huber, P. J . (1999). Massive Dataset Workshop : Four Years After, J .
Computat. Graph. Statist. , 8 ,635 -652.
[59] Ihaka, R and Gentleman, R (1996) . R : A language for data analysis
and graphics. J. Computat. Graph. Statist., 5, 299 -314.
[60] Lauritzen, S. 1. and Wermuth, N. (1989) . Graphical models for associ-
ation between variables, some of which are qualitative and some quanti-
tat ive. J . Royal Statist. Soc., B 50, 157-224.
[61] Lauro, C. (1996) . Computational Statistics or Statistical Computing, is
that the question ? Computat. Statist. Data Anal., 23, 191-193.
[62] Mehta, C. R. and Patel, N. R (1992) . Exact Logistic Regression : Theory,
Applications, Software . In Dodge, Y and Whittacker, J . (eds.) Compu-
tat ional Statistics. Volume 2, 63 -78.
[63] Mehta, C. R. and Patel, N. R (1997) . Exact Inference for Categorical
Data. Electronic Publication: Harvard University and Cytel Software
Corporation, http:
www.cytel!Library/articles.asp.
[64] Mehta, C. R, Patel, N. Rand Senchaudhuri, P. (2000). Efficient Monte
Carlo Methods for Conditional Logistic R egression. J. Amer. Statist. As-
soc., 95 , 99 -108.
[65] Metropolis, N. and Ulam, S. (1949) . The Monte Carlo Method. J. Amer.
St atist. Assoc., 44, 335 - 342.
[66] Meyer, D. Leisch, F., Hothorn, T . and Hornik, K. (2002) . StatDataML:
An XML Format for Statistical Data. In HardIe, W. and Ronz , B.
(eds.) COMPSTAT 2002. Proceedings in Computational Statistics. , 545-
550.
[67] Monahan, J. F. (2001). Numerical Methods of Statistics. Cambridge Uni-
versity Press, Cambridge/UK.
[68] Nelder, J . A. (1974) . Genstat - A Statistical System. In Bruckmann,
G., Ferschl, F. and Schmetterer, L. (eds.) COMPSTAT. Proceedings in
Computational Statistics, 499 - 506.
34 Wilfri ed Grossm ann , Michael G. Schimek and Peter Paul Sint

[69] NeIder, J . A. (1978). The Future of Statisti cal Softwar e. In Corst en, L.
C. A. and Herm ans , J . (eds.) COMPSTAT 1978. Proceedings in Com-
putational St atistics, 11 -19.
[70] Nelder, J . A. and Wedd erburn, R. W. M. (1972) . Generalized linear
models. J. Ro yal St atist. Soc., A 135,370-84.
[71] Neuwirt h, E . and Baier , T . (2002) . Embedd ing R in standa rd software,
and t he other way round. In Hornik, K. and Leisch, F . (eds.) DSC
2001 Proceedings. 2nd Interna tional Workshop on Distributed Statistical
Computing, http://www.ci.tuwien.ac .at/Conferences/DSC-2001 .
[72] Owen, D. B. (1976). On the history of statis tics and probabilit y. Proceed-
ings of a symposium on the American mathematical herit age. Dekker ,
New York.
[73] Ripley, B. D. (1987) . Sto chastic Sim ulation. Wiley, New York.
[74] Robbins, H. (1956). An empirical Bayes A pproach to St atistics. Proc.
Third Berkeley Symp . Statist . Probab ., 1, 157 - 163.
[75] Schaffler , O. (1895). Neuerungen an statistischen Zahlmaschinen.
Osterreichisches Patentprivileg No.46/3182, Patent ar chiv, Wien .
[76] Shoshani , A. (1997) . OLAP and Statistical Databases: Sim ilarit ies and
Differences. Proceedin gs 16th ACM SIGACT-SIGMOD-SIGART Sym-
posium on Principles of Datab ase Systems 1997, 185 -196.
[77] Stone, C. (1977) . Consis tent nonparametric regression (with discussion).
Ann . Statist ., 5, 595-645.
[78] Sundgren , B. (1975). Th eory of Data Bases. P etrocelli/Charter, New
York.
[79] Templ e Lang, D. (2000). Th e Om egahat En vironm ent: New Possibiliti es
for St atistical Computing. J . Computat. Gr aph. St ati st ., 9, 423- 451.
[80] Thist ed , R. A. (1988). Elem ents of Statistical Computing. Ch apman &
Hall, New York.
[81] Tiern ey, L. (1989) . XLISP- STAT: A Sta tistical Environment Based
on the XLISP Langu age, Technic al Report No. 528, School of
Statisti cs, University of Minnesot a , http ://www . stat. umn.edu/
lUke/xls/tutorial/techreport/techreport.html.
[82] Tiern ey, L. (1990) . LISP-STAT: An Object-Oriented Environment for
Statistical Computing and Dynamic Graphics. Wiley, New York.
[83] Tukey, J. W. (1962) . Th e futu re of data analysis. Ann . Math. Statist.,
33 , 1- 67 and 812.
[84] Tukey, J . W . (1970). Exploratory Dat a Analysis. Volum e I and II (limi ted
prelim inary edition). Addison-Wesley, Readi ng/MA.
[85] Tukey, J. W . and Cooley, J. W . (1965). A n algorithm for the mac hine
calculation of complex Fourier series . Math. Comput., 19 , 237 - 301.
[86] Wegman , E. J. and Mar chette, D. J (2003). On Som e Techniques for
Streami ng Dat a: A Case Stu dy of Intern et Packet Headers. J . Computat.
Gr aph. Statist ., 12, 893 -914.
Th e history of COMPSTAT and statistical computing 35

[87] Wilkinson, L., Rop e, D. J ., Carr, D. B. and Rubin, M. A. (2000). Th e


Language of Graphics. J. Computat. Gr aph. Stat ist. , 9 , 558 - 581.
[88] Zeman ek, H. (1975) . Ott o S chaffier. Ein vergessener Ost erreicher. Die
Biographie eines genialen Unt ern ehm ers und Erfind ers. Osterreichischer
Gewerbeverein. J ahrbuch, 92 , 71 -92.

A cknowledgem ent : First of all the aut hors wish to thank Prof. J aromir
Antoch (Charles University of Prague) for giving t hem the opport unity to
present a historical keynote. Further the aut hors appreciate valuable hints
and comments from t he following colleagues: Dr . Lutz Edl er (Germ an Can-
cer Resear ch Cent er Heidelberg) , Dr. Karl A. Froschl (Electronic Commerce
Comp etence Cent er Vienna), Dr. Walter Gr afendorfer (Austrian Computer
Society) , Prof. Kurt Hornik (Wlrtscha ftsuniversitat Wien) , and Prof. Ed-
ward J . Wegman (George Mason University ). However , all errors and omis-
sions are in the responsibility of t he aut hors.
Address: W . Grossmann, University of Vienna , Insti tu te for St atistics and
Decision Support Systems, Universitatsstrafie 5, A-lOlO Wien , Austria
M.G. Schimek, Medical University of Graz, Institute for Medical Inform atics,
St atisti cs and Docum ent ation, Auenbruggerpl atz 2, A-8036 Graz, Austria
P.P. Sint , Austri an Acad emy of Sciences, Institute for European Integration
Resear ch, Prinz Eugen StraBe 8-10/ 2, A-1040 Wien , Austria
E-mail: wilfried .grossmann@univie.ac.at ,
michael.schimek@meduni-graz.at, sint@oeaw .ac.at
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

HYBRID ALGORITHMS FOR


CONSTRUCTION OF D-EFFICIENT
DESIGNS
Abdul Aziz Ali and Magnus Jansson

K ey words: Ex act D-optimal designs , genet ic algorit hms, local search.


COMPSTAT 2004 section : Design of experiments .

Abstract : We construc t exac t D- efficient designs for linear regression mod els
using a hybrid algorit hm that consists of geneti c and local sea rch components.
The genetic component is a genetic algorit hm (GA) with a 100% mutation
rate and ranking select ion. The local sea rch methods we use are based on
t he G-bit improvement and a combination of the Powel multidimension al
and Brent line optimi zation techniques. Computational results show that
the hybrid algorithm generates designs that are compa ra ble in efficiency to
those found using the modified Fedorov algorit hm (MFA) , but without being
limit ed to using a given set of candida te points.

1 Introduction
An experimental design is said to be optimal if it meets predefined crite ria
that determine t he precision with which th e mod el par ameters or response
is estimated. The D-optimality crite rion Keifer and Wolfowitz [12] puts em-
phasis on t he pr ecision with which t he mod el par ameters are est imate d by
maximi zing the det erminant of the mod el's inform ation matrix. This crite-
rion has t he intuiti vely appealing int erpret ation of minimi zing the volum e of
the joint confidence ellipsoid of the least squ ares regression par ameter est i-
mates.
Ex act D-optimal designs are calculat ed using optimization algorit hms
such as those given by Cook and Nachtsheim [6] and Johnson and Nacht-
sheim [11] among others. These algorit hms iteratively maximize t he det er-
minant of the information matrix by sequentially, or simultaneously, adding
and deleting points to t he design . Many of the most used algorit hms require
an explicit set of candidat e points to work with, thus putting heavy demands
on prior dom ain-specific knowledge of the optimizati on problem. Although
not as common, evolutiona ry algorit hms have also been used to calculate
D-optimal designs. Govaerts and San chez [8] were t he first to use genet ic
algorit hms (GAs) to find exact D-optimal designs . However , their algorit hm
incorporated t he use of a candidate set of design points, much like the mor e
t ra ditiona l algorithms. Poland et al. [17] used a GA to improve on t he st an-
dard Mont e Carlo algorit hms by applying DETMAX and k-exchan ge as the
mut ation oper ator. Compar ed to t he excha nge algorit hms, t heir algorit hm
38 Abdul A ziz Ali and Magnu s Jansson

was slower but yielded better results. Broudi scou et al. [5] successfully ap-
plied a purely genet ic algorit hm to t he exact D-optimal design problem in
a chemomet rics setting. GAs have since then been used by Montepiedr a et
al. [15] who omitted the mutation operator in favor of fast er convergence
and Heredia-Langner et al. [10] who used real value encoding in place of the
mor e traditional bin ary encoding. The latter named also give an excellent
introduction to the use of GAs in calculating opt imal designs.
This pap er present s the use of hybrid algorithms in calculating D- efficient
or near D-optimal designs. The hybrid algorit hms considered here consist of
a genetic component with 100% mutation rat e and local search methods. The
mutation op erator is exte nsively used in ord er to escape from local optima.
The hybrid algorit hm is therefore implement ed in two stages: The genetic
component finds a neighborhood point of a local optimum and t he local
sea rch finds the local optimum. The genet ic component is then updat ed wit h
the coordina tes of t he local optimum and t he pro cess is repeated until some
termina t ion condition is met .

1.1 Model and the exact D-optimal design problem


In many experimental situa tions, t he experimenters usually approxima te the
relationship between the response varia ble and t he inpu t factors with the
linear mod el
y=X{3+e
wher e X is t he (n x p) matrix of factor levels (design matrix) , (3 is the
p x 1) vector of unknown regression par amet ers , y is t he (n x 1) vecto r of
observations and e is t he (n x 1) vector of error terms t ha t are assumed to
be iid (possibly norm ally distributed) wit h E(e) = 0 and E(ee T ) = 0' 2[.
Wh en the goal is to construct exac t designs, t he problem becomes one of
how to determine x; n
i = 1, 2,3 , . .. , from the region defined by all the level
combinations of the factors called t he design region X, so t hat the resulting
design will esti mate some function of {3 with a pr ecision that is at least as
good as that provided by any ot her design in X. The exact n-point design is
denoted by
Xl X2 ••• Xk }
~n = { r -J» r2/ n . . . rk/ n '
where L:~=l Ti = nand ri is the number of t ria ls at Xi. T he standardized
pr edictor var iance is given by,

a function of t he design ~n and th e point at which th e prediction is mad e.


The design C is an exact D-optimal design if M is a non-singular matrix
and the following is satisfied:
Hybrid algorithms for const r uct ion of D -effl cient designs 39

A measure of efficiency is the D-efficiency which is defined as follows: A de-


sign 6 has a D- efficiency relative to 6 given by

IM (6 )1] f;
D - ef f = 100 · [ IM(6)1 .
This compari son is valid even when the designs being compared are of dif-
ferent sizes because t he comparison is based on the information per point for
each design. For t he inte rested read er an excellent review of opt imum design
theory is given by Ash and Hedayat [3] an d boo ks by Atk inson & Donev [2]
and Silvey [18] .

1.2 Commonly used algorithms and genetic search


Ex act D-optimal designs are calculated using opt imization algorithms such as
t hose given by Dykstra [7], Cook and Nachts heim [6], Mit chell [14]' Wynn [20]
an d Johnson and Nachts heim [11] among others. These algorit hms are search
heuristi cs t hat it eratively maximize the det erminant of the information ma-
t rix by sequ entially adding and deleting points to th e design or exchanging
points between t he exist ing design and a candidate set of points. The al-
gorit hms updat e t he design matrix wit h rank-one matrices derived from the
candidate points as shown by the following formula which is often used for
computationa l efficiency. Upon the addit ion of a point t o a n point design ~n ,
t he cha nge in t he information matrix is,

As a consequence t he point x whose addit ion to the design maximizes t he


det erminan t of t he information matrix is the point whose standard ized pr e-
dict ed response vari anc e calculate d from t he current design is lar gest .
A major drawback of these algorit hms is t ha t for each iteration the se-
quenti al algorit hms have to calculate t he vari an ce functions of the current
designs. Exchan ge algorit hms calculate the varia nce functions of all possible
pairs of candidate and t he current design 's points, a pr ocess t hat puts heavy
demands on memor y and speed even for mod erat ely lar ge designs.
Although not as commonly used as the exchange algorithms, evolut ionary
algorit hms have also been used t o calculate D-optimal designs. Genet ic algo-
rithms (GAs) have been successfully used to search for optimal or near opti-
mal solutions in lar ge-scale optimization because of their versatility. GAs do
not require convexity or even cont inuity of a function and have th eir st rong
points as a powerful computational t ool for functi on optimizat ion because
they are less susceptible to being trapped in local optima as compar ed to
many ot her numerical optimization techniques.
GAs usua lly, bu t not always, encode t he possible solutions to an opt i-
mization problem usin g bin ary strings. For exa mple, if t he ran ge of possible
40 Abdul A ziz Ali and Magnu s Jansson

solut ions lies in the interval [-a , a] then the 8-bit bin ar y st ring 00000000
will represent - a and 11111111 will represent +a. A randomly generated
set of strings forms the initi al population from which the GA starts its sear ch.
Initi al candidate solut ions (strings) are usually uniforml y sa mpled from the
search space in ord er to introduce variability in t he set of candidate solut ions.
This initializat ion pr ocess is a random search whereby a number of possible
solut ions are randomly generated and t he best solutions (the fittest st rings)
are remembered .

2 G A implementation for finding D-efficient designs


2.1 Encoding the designs
The GA is implemente d by encoding each complete design, including th e
number of experimental runs , as a one bit -string. Bin ary encoding is t he most
widely used form of represent ation becaus e of its flexibility and also because
its t heoretical fram ework is well developed Goldberg [9]. Bin ar y encoding also
allows for a simple way t o apply t he mutation and recombinat ion operators.
Consider t he m-bit representation of a single fact or design at t he high and
low levels. The base 10 (decimal) repr esentation of the coordinate points
will be 0 and 2m - 1 resp ectively. This design region is t ra nsformed to t he
familiar [-1 ,1] by t he function f : x r--> 2~:::' 1 - 1, where x is the decimal
represent ation.
The length of the bit-string is det ermined by the number of bits required
to code t he coordinates of the levels taken by each fact or , t he number of
factors, and t he number of trials. For examp le, an n-t rial experimental design
with k factors each requiring p bits to code its coordinates would require npk
bits.

2.2 Initialization and selection


The initial populati on of st rings (at iteration 0) consists of N st rings. Because
the D- optimality criterion pushes the design points to the edges or vertices of
the design region , init ial designs ar e generated by drawing random vari at es
from a U-sh ap ed distribution which puts mor e mass on the edges. The Beta
distribu t ion with a = 0.35, (3 = 0.35, t ra nsformed from [0, 1] t o cover th e
design region for each factor is used. This is done in ord er to sa mple fit t er
st rings t ha n would have ot herwise been found using t he commonly used uni-
form dist ribution. The designs are then evaluate d, ranked according to t heir
fitness, and encoded as bit-strings. The first it erati on produces the N fitt est
st rings .
In t he second and subsequent iterati ons , t he N fitt est st rings from the
earlier iteration are selected, N mutated copies of t hese are mad e, and M
st rings which result from their recombination ar e generated. These 2N + M
st rings are evalua te d and ra nked according t o t heir fitn ess and t he N fitt est
Hybrid algorithms for construction of D-eflicient designs 41

strings are kept. This type of selection leads to what is known as an elitist
algorithm. It ensures that the fittest strings are preserved from one iteration
to the next and removes the possibility that all strings found in iteration i + 1
are poorer than the fittest string found in iteration i. Other methods of
selection such as selection with probability proportional to fitness may result
in the loss of the fittest strings as there is a positive probability that anyone
string could be lost.

2.3 Recombination
Recombination when applied to strings with binary coding is usually per-
formed by single or multi-point crossover. Single point cross-over is used in
this application because of its simplicity and ease of execution. This is done
by sampling without replacement of a pair of strings with probability propor-
tional to their fitness. A point is randomly chosen and each string is divided
into two segments. The strings then swap their segments and a new pair of
strings is created. In this way, strings with high fitness are paired with each
other and exchange sub-strings. Those that inherit segments which result in
high fitness (also called building blocks) are kept for the next iteration.

2.4 Mutation
Mutation relocates the candidate solutions to some other points in the search
space. Although it is common to use mutation with low probability so as not
to destroy highly fit strings and prolong the computation times, we always
apply mutation with probability Pm = 1. The reason for mutating in this
way is that copies of the strings are made prior to mutating them so that
strings are not lost because of mutation. Also a ranking selection which
results in the elitist algorithm is used. This algorithm implements mutation
by switching one randomly selected bit per string. The inversion operator is
a generalization of the mutation operator. Whereas the mutation operator
switches one bit per string, the inversion operator flips a whole string segment.
The start and end positions for the inversion are randomly decided. Inversion
is used when there is no improvement in fitness in at least one iteration.
The GA search process is thus iterative: evaluation, selection and re-
combination using the basic operators: selection, cross-over and mutation,
until some termination condition is met. The basic algorithm is given by the
pseudo code below .
If s(i) is the set of strings processed by the GA at iteration i and f is the
objective function then,

i = 0; initialize s( i);
evaluate f(i);
do while (termination condition is not met);
select s(i + 1) from s(i);
42 Abdul A ziz Ali and Magnus Jansson

recombine s(i + 1);


mutate s(i + 1);
evaluate s(i + 1);
i = i + 1;
End;

3 Local search methods


Local sear ch is a strategy of sear ching a neighb orhood until a gr adi ent is
found , moving along the gradient, then updating the st arting point and gen-
erat ing a new neighborhood. We will examine two local search methods. The
first method is local improvem ent on t he geneti c algorit hm usin g a modified
vers ion of the G-bit improvement, Goldberg [9] . The G-bit improvement is
implemented in the followin g manner.

1. Select t he fitt est st ring wh ich t he genetic algorit hm generates.

2. Sweep the st ring bit by bit, evaluat ing the fitn ess of every st ring t ha t
results from one-bi t switches. If a bit chan ge results in a violati on of
any of t he constraints then discard the st ring .
3. W hen a st ring is found that has a better fitness t han t he first (starting)
st ring t hen replace t he st arting string with t he fitter string.
4. Repeat the pr ocess until no fur ther improvement is made afte r sweeping
through the fittest st ring.

An object ive fun cti on is evaluate d for every swit ch which makes the
method somewhat slow. The method is therefore most useful when the ge-
neti c algor it hm converges to a point on t he sear ch grid that is very close
t o the optimum and there is a stee p gradient between the two point s. This
method is only used on the fitt est st ring found afte r t he t ermination condit ion
has been met by t he GA.
Because of the difficulty of computi ng the directional derivatives of poorly
charac te rized functions, we use methods that do not require differentiability.
Local search is t radit ionally done using gree dy algorithms such as those of
Lawler [13] and Syslo et al. [19] . We implement local sear ch by a combi na t ion
of Powell's method and Br ent line optimization as given in Press et al. [16].
Powell' s method is given below. Readers int erested in the t echnical de-
t ails are referred t o Num erical recipes in C availa ble on-line at
www . library . cornell. edu/nr. The algorit hm establishes the direction
along which the optimization t akes place and then t he Brent line optimiza-
tion is used it er atively. Because minimization and maximization are t rivially
related, we conside r the optimization problem as the minimization of a fun c-
ti on f without loss of generality.
The algorit hm begins by initializing the direction set t o the basis vectors
of the n dimensional space i.e,
Hy brid algor ithms for const r uction of D-efficient designs 43

Ui = e i i = 1, . . . , n.
1. Save the starti ng position as Po.

2. For i = 1, .. . , n , move P i-l to a minimum of f along t he direction U i


and call t his point Pi '

3. Set U n+l f - P n - Po.

4. Move P« t o a minimum along the dir ect ion U n+l and ca ll t his point Po.

5. Set Ukf- U n +l , (1 ~ k ~ n ) where k is the ind ex where t he objective


function made its greatest decrease.

In addit ion to t he design region itself which is a constrain ed space in


IR n , it is not unusual t o enco unter constraint s in design problems. The main
difficulty when using t hese methods in const rained sp aces is t hat the dir ection
set degenerates t o vect ors of null norm at the edges of the sear ch sp ace.
Bracketing minima may be impossibl e because one of the points needed to
br acket t he minimum may not be within the limits of the const rai nt s. To
overcome t hese lim it ations we have modified the algorithm to re-initialize the
directi on set along the edges of t he design region. When local search lead s
to a point t ha t violates any const raint, a new search is initi ated closer to t he
st arting po int .

4 Examples
4.1 Response surface design in two factors
Box and Dr ap er [4] analytically det ermined D-op timum designs for a second
or der resp onse surface model in two fact ors usin g 6 to 9 design poi nt s. E xact
D- efficient designs for t heir model are found using the hybrid algorithm and
t he genetic component of the hybrid algorithm used alone for comparison as
well as for validat ion and for t esting the performan ce of t he algorit hms.
The second ord er response surface model in two fact ors is given by:

y= (30 + (31Xl + (32x2 + (312 x I X2 + (311 x r + (322x~ + E .


The design region is given as : X = [-1 , I f
8 bits were used to enco de each coord inate point and 6 st rings were used
t o init ialize the algorit hm.
The hybrid and genet ic algorit hms were run 10,000 times and t he average
efficiencies of the resul ting designs computed . Det ails of t he performance
and the average D- efficiencies of t he designs found usin g the algorithms are
summarized in Tabl e 1.
The hybrid algorithm required an initi al population of only 6 st rings and
12 it er ations t o ca lculate exact D- efficient designs for the resp onse sur face
44 Abdul A ziz Ali and Magnu s Jansson

A nalytical Hybrid Algorithm Gen etic Algorit hm


N i I(Xl X) "1 I(X l X) -11 D- ejJ* I(X l X ) -11 D-eff*
6 12 3.7350E-3 3.7698E- 3 99.84 3.6233E-2 68.47
7 12 1.0196E-3 1.0790E-3 99.06 8.0309E-3 70.89
8 12 4.2340E-4 4.5345E-4 98.86 2.9834E-3 72.22
9 12 1.9290E-4 2.2701E-4 97.32 1.3027E- 3 72.73

N =numb er of design points i = number of iterations D-eff* =Average D-efficiency

Table 1: Comparison of t he the Hybrid Algorithm and the GA with the


analyt ica lly calculated values for the response surface model.

mode l with 6 to 9 points. This indicat es that the local search com po ne nt
of the hybrid algorit hm was used to a large exte nt to find the design s t hat
minimize I(X TX)-II .
Exact D- efficient design s are rarely found using analytical function op-
t im ization as shown above. When t he design region is poorly characte rized
and/or const raine d, it is usual practice to gene rate efficient desi gns using
com puterized algorit hms . The next t wo exam ples are mixture design s with
bo th line ar and non-linear as well as single and multi-component constraints
imposed on their design regions.

4.2 Mixture experiment with quadratic constraints


This example is found in Atkinson & Donev [2, pp . 186-187]. Using a t hree
compone nt mi xture experiment , models have been first fitt ed to two responses
afte r which measurements are mad e on a t hird response, but only in the region
wh er e t he ot he r two resp onses have satisfactory values. The requirem ents
t hat Yl ;::: Cl and Y2 ;::: C2 for sp ecified c i and C2 lead to the following qu adratic
const raints:
-4.062xI + 2.962xl + X 2 ;::: 0.6075
-1.174x I + 1.057xl + X2 ;::: 0.5019
The D-op timum cont inuous design for the second order ca nonical pol ynomi al
uses 6 su pport point s with equal weight and is given in Atkinson & Donev [2] .
We a pplied the hybrid and genetic algorit hms to finding exact D-efficient
designs for this problem usin g 12 design points and com pared t hem to the
design s found using 200 it er ations of the MFA with a randomly gen er ated
set of 72 points that satisfy all the constraints. The ca ndidate point s wer e
generated by on e exec ut ion of t he GA. The MFA used t he value E= 1.0E-7
as t he sm all est val ue that is cons ide red t o be non-zer o wh en t he sea rch no
lon ger yields an improved design .
The hybrid and GA wer e initiali zed using 6 strings (design s) assembled
from t he same set of candidate point s t hat wer e used by t he MFA. Each
coord inate point was enco ded usin g 16 bits which gives a search grid step
Hybrid algorit hms for construction of D-efficient designs 45

Algorith m i \(X l X ) -11 D-eff* T im e*(s)


Modified Fedorov 200 7.5698E 5 100 0.5
Hyb rid Algorithm 200 8.9377E5 97.29 22.33
Genet ic Algorithm 200 1.6356E6 87.97 20.88

i = numbe r of ite rations D-eff*=Average D-efficiency Time* = Average t ime

Tabl e 2: Results for the example in sect ion 4.2.

of size 1/2 16=1. 52587E-5. This search grid is finer than t hat used for t he
pr evious example becau se t he design region for this example is not as regular
and symmetric. The t ermination condition was when 200 iterations had been
complete d regardless of when the last imp rovement was made. The genet ic
and hybrid algorit hms were run 10,000 times and t he average efficiencies and
times are shown in Tabl e 2.
The results show th at th e combinat ion of t he GA and local sea rch finds
efficient designs in a relatively short t ime using few it erations as seen from
the optimized object ive function value . This, in t he pr esence of non-linear
const raints on the design region .

4.3 Resin vehicle characterization


Alt ekar and Scarlatt i [1] designed an experiment to characterize gel vehicles
for use in lit hographic inks. A combination of a factorial and a mixture
design was used to st udy t he effects of var ying the ratio of two resins and
other formul ation vari abl es on t he viscosity of t he inks. Eac h formulation
consiste d of two resin solids, gelling age nt, ink oil and alkyd varnish. The
amount of alkyld varn ish was fixed at 7% in each formul ation and the ink oil
was an inert vari abl e used as a filler. The ratio of the two resins was vari ed
as follows: Resin A/Resin B ratio 60/40 , 50/50 and 40/60, and were coded
as [- 1, 0, 1] for t he low, mid and high levels respectively.
In order t o tes t t he hybrid an d genetic algorit hms, t he same mixture
pr oportions were used to find D-efficient designs. To make the pr oblem more
cha llenging, t he rat io of the solids was allowed to var y cont inuously between
6/10 and 10/6. Table 3 shows t he const ra int s on t he design region for the
mixture experiment.
Because the resin solids, gelling agent and ink oil had to add up to 93%,
t he amount of ink oil was aut oma t ically restricted to 37- 47.67%. In addit ion
t o the constraint t hat all t he mixture proportions sum to uni ty, t he following
mult i-component const ra ints are also imposed:

Resin solids: 0.45 s:; Xl + X2 s:; 0.55.

Ratio of solids: ~< Xl < 10 .


10 - X2 - 6
46 Abdul Aziz Ali and Magnus Jansson

Component Minimum Maxim um


Xl - Resin A 0.0000 0.5500
X2 - Resin B > 0.0000 0.5500
X 3 - Gelling agent 0.0033 0.0100
X 4 - Ink oil 0.3700 0.4767
X 5- Alkyd varnish 0.0700 0.0700

Tab le 3: Restrictions on the design region .

Algorithm i I(X l X) '1 1 D-eff* Tim e*(s)


Modified Fedorov 200 5.3805E26 79.02 1.32
Hybrid Algorithm 200 8.1824E25 100 31.84
Genet ic Algori thm 200 1.0412E27 72.76 29.32

i= numbe r of it er ations D- eff*=A ver age D- efficiency Ti me *= Average time

Tabl e 4: Results for the example in sect ion 4.3.

Let the solids be given by x;


= X l + X2 . The following mod el was consid-
ered for t he purposes of evalua t ing t he algorit hm:

A 24 point design was genera te d using the hybrid algorit hm. For comparison
purposes 200 it erations of t he MFA with a candida te set of 144 point s which
sa t isfy all the const ra ints was used . The can didate set was again genera te d
using the GA. The hybrid algorit hm and the GA were later re-initialized using
t he sam e set of poin ts assembled int o 6 designs . Each coordinate point was
cod ed using 16 bits and the t erminat ion condit ion was when 200 it erations
had been complete d. The GA and hybrid algorit hm were run 10,000 times.
Det ails of t he average efficiencies and times are shown in t able 4.
Tabl e 4 shows t ha t the hybrid algorit hm finds on average designs with
higher relativ e efficiency t han t hose found using th e MFA for t his probl em.
Whereas t he MFA can only be as good as th e quality of its candidate points,
t he hybrid algorit hm generates new design points through loca l sear ch, selec-
t ion, and recombination. As a resul t , the hybrid algorit hm arr ives at efficient
designs wit hout th e benefit of using a specific set of candidate points.

5 Conclusions
A hybrid algorit hm used t o find D- efficient designs for linear regression mod-
els is present ed in t his pap er. The genetic component of the hybrid algorit hm
allows for a high mutation pr obability without necessaril y prolonging the time
t o convergence. This is possible because mutated copies of the st rings are
re-inj ect ed into t he population of st rings during every iteration and only the
Hybrid algorithms for construction of D-efficient designs 47

fittest strings are selected for the succeeding iterations. This greatly increases
the chances of escaping local optima when applied to poorly characterized
functions with many local extrema. Genetic algorithms are very efficient and
are designed to search large spaces. However, they require a large initial pop-
ulation of strings to work with and the resulting variation inevitably leads
to long computing times if the search domain is to be thoroughly explored.
Searching the neighborhood of each point and updating the population of
strings at every iteration of the GA with fitter strings that result from local
search leads to much faster convergence than using the GA alone. The hybrid
algorithm presented here therefore uses a small population of strings to search
for efficient designs . It also requires a relatively few number of iterations and
as a consequence less computing time is required to find efficient designs .
The computing times for the examples used in this paper are real times (not
CPU times) when using a 2.0 GHz Pentium PC. It should be noted that
although the hybrid algorithm provides designs that are as efficient as those
obtained using the MFA, it usually is slower depending on the the number of
the candidate points supplied to the MFA, but has a distinct advantage when
the candidate set of points is not of high quality or even not available. This
relieves the experimenter from having to start with some previous knowledge
of the search domain.
The algorithm presented in this paper is coded in Pascal using Borland
Delphi version 4 and is available as a .exe file upon contacting the authors.
The application that runs the algorithm allows for customizing of all the
GA and local search parameters and generates the design points, the design
matrix, the information matrix and its eigenvalues, the variance function
plots as well as the records and graphical history of the optimization process,
among other things.

References
[1] Altekar M., Scarlatti. A. N (1997) . Resin vehicle characterization using
statistically designed experiments. Chemometrics and Intelligent Labo-
ratory Systems 36 207 - 211.
[2] Atkinson A.C., Donev A.N (1992) . Optimum experimental designs. Ox-
ford: Oxford University Press.
[3] Ash H., Hedayat A (1978). An introduction to design optimality with an
overview of the literature. Comm. Statist. Theory Methods. 7, 1259 -
1325 .
[4] Box G.E.P., Draper N.R (1971). Factorial designs , the IF'FI criterion
and some related matters. Technometrics 13, 731-742.
[5] Broudiscou A., Leardi R., Phan-Tan-Luu R (1996) . Genetic algorithm as
a tool for selection of D-optimal design . Chemometrics and Intelligent
Laboratory Systems 35, 105 -116.
48 Abdul Aziz Ali and Magnus Jansson

[6] Cook R.D., Nachtsheim C.J. (1980). A comparison of algorithms for


constructing exact D-optimum designs . Technometrics 22, 315 - 324.
[7] Dykstra O. (1971) . The augmentation of experimental data to maximize
lX'-l Technometrics 13, 682 -688.
[8] Govaerts B., Sanchez R.P. (1992). Construction of exact D optimal de-
signs for linear regression models using genetic algorithms. Belgian Jour-
nal of Operations Research, Statistics and Computer Science 1-2, 153 -
174.
[9] Goldberg D.E. (1989). Genetic algorithms in search, optimization, and
machine learning . Addison Wesley.
[10] Heredia-Langner A., Carlyle. W .M., Montgomery D.C., Borror C.M.,
Runger G.C. (2003) . Genetic algorithms for the construction of D-
optimal designs . Journal of Quality Technology 35 28-46.
[11] Johnson M.E ., Nachtsheim C.J . (1983) . Some guidelines for construct-
ing exact D-optimal designs on convex design spaces. Technometrics 25,
271 -277.
[12] Keifer J ., Wolfowitz J . (1959). Optimum designs in regression problems.
Ann. Math. Statist. 30, 271 - 294.
[13] Lawler E .L. (1976). Combinatorial optimization: networks and matroids.
New York: Holt, Reinhart and Winston.
[14] Mitchell T.J. (1974) . An algorithm for the construction of D-optimal
experimental designs . Technometrics 20 211- 220.
[15] Montepiedra G., Myers D., Yeh A.B. (1998). Application of genetic algo-
rithms to the construction of exact D-optimal designs. Journal of Applied
Statistics 6, 817 - 826.
[16] Press W .H., Teukolsky S.A., Vetterling W .T ., Flannery B.P. (1992).
Numerical recipes in C, second edition: the art of scientific computing,
Cambridge: Cambridge University Press.
[17] Poland, J.A., Mitterer K., Knodler A., Zell (2001) . Genetic algorithms
can improve the construction of D-optimal experimental designs. In :
Mastorakis N. (Ed.), Advances In Fuzzy Systems and Evolutionary Com-
putation, WSES 2001, 227 - 231.
[18] Silvey S.D (1980) . Optimum design . London: Chapman & Hall.
[19] Syslo M.M., Deo N., Kowalik J.S (1983). Discrete optimization with
pascal programs. Engelwood Cliffs, NJ : Prentice-Hall.
[20] Wynn, H.P. (1970). The sequential generation of D-optimum experimen-
tal designs. Ann. Math. Statist. 41 1655-1664.

Acknowledgement: The authors would like to thank Professor Hans Nyquist


of Stockholm University for his review and useful suggestions on this paper.
Address : A. Aziz Ali, Clinical Information Management, AstraZeneca R&D
Sodertalje, S-151 85 Sodertalje, Sweden
E-mail: Abdul.Aziz.Ali<MstraZeneca.com
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

GEOMETRY OF LEARNING IN
MULTILAYER PERCEPTRONS
Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
K ey words: Learning, information geomet ry, singular statistical mod el, neu-
ral networks.
COMPSTAT 2004 section: Neural networks and machine learning.

Abstract: Neural networks provide a good mod el of learning from st atistic al


dat a . Multil ayer percept ron is regarded as a stat istical model in which a non-
linear input-output relation is realized . The set of multil ayer perceptrons
forms a st ati stical manifold in which learning and est imation takes place.
This is a Riemannian manifold with Fisher information metric. However ,
such a hierar chical mod el includes algebraic singularit ies at which the Fi sher
information matrix degenerat es. This causes various difficulti es in learning
and statist ical esti ma t ion. The pr esent pap er elucidates the st ru ct ure of
singularit ies, and how they influence the behavior of learning. The pap er
describes a new learning algorit hm, nam ed t he natural gra dient met hod, to
overcome such difficult ies. Various statist ical problems in singular mod els
are discussed, and the models select ion criteria (Ale and MDL) are st udied
in t his fram ework.

1 Introduction
The multilayer perceptron is a simple feedforward model of neural networks,
which transforms input signals to output signals nonlinearl y. It is a univ ersal
approxima tor in t he sense t hat any nonlinear t ra nsforma t ion is approxima te d
sufficient ly well by an adequate perceptron , if t he number of hidden units is
lar ge.
In orde r to realize a good approxima t or, examples of input-output pair s
are used. On-line learning receives a series of train ing exa mples one by one,
and modifies t he par am et ers of a perceptron each time when one exa mple is
given. Usu ally old examples are then discarded. Batch learning keeps all th e
exa mples and modifies the par amet ers in a bat ch mod e.
A multilayer perceptron is an old model of learning machines, and the
error-correct ing learning algorit hm was established for simple perceptrons in
t he sixties. Amari proposed a gra dient descent learning method for multi-
layer perceptrons [2]' which was rediscovered later ind epend entl y and becam e
popular und er the nam e of backpropagation [23].
We study t he set of multil ayer perceptrons of a fixed architec t ure, which
includ e a number of modifiabl e par am et ers called connec t ion weight s and
biases. The set forms a multi-dimension al manifold , where all these par am e-
t ers play a role of admissible coordinat e syst ems. Learning takes place in t he
manifold , dr awing a traj ectory.
50 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki

It is import ant t o study the geometrical st ruc t ure of the manifold which
we call a neuroman ifold. We will show by st atistical considerations that the
neur oman ifold is Riemannian whose metric is sp ecified by the Fisher infor-
mation matrix [3] . Moreover , it has a pair of affine connect ions [4], but we
do not st at e t hem in t he pr esent pap er. The neuromanifold has singulari-
ties where t he Fisher information (or the Riemannian metric) degenerat es [5] .
This is an int eresting st atistical model, because the convent ional Cr am er-Rae
par adi gm excludes such a mod el, assuming the existe nce and non-degeneracy
of the Fisher information matrix as regulari ty condit ions.
It is known that the convergence sp eed of a multilayer perceptron is usu-
ally very slow. This is caused by the Riemannian chara cte r in particular by
its degeneracy, becaus e the convent ional backprop learning method does not
take the Riemannian nature into account . The state of a network is often
attrac te d by singulari ti es by the convent ional algorit hm and takes long t ime
before getting rid of them. The natural gradi ent learning algorit hm was pro-
posed t o overcome the flaw, which t akes the Riemannian gra dient inst ead of
the convent ional gra dient [3] . We show in the pr esent pap er the reasons why
it works so well. We also explain an adapt ive method of impl ementing t he
natural gradient [8] . In t he case of the squa red err or crit erion under Gaus sian
noises, the natural gra dient algorit hm coincides with t he adapt ive version of
the Gau ss-Newt on method , but they differ in mor e general mod els (see [17]) .
We finally study the dyn ami cs of learning and the nature of singularities
and explain the reason why learning traj ect ories are attract ed t o and stay
longer in a neighborhood of sin gularities. The st ati st ical analysis of behaviors
of est ima t ors in a neighborhood of singularities is anot her import an t problem
to be st udied . We show the convent ional crite ria of mod el selection such as
AIC and MDL fail in t his case.

2 Neuromanifold of multilayer perceptrons


Let us consider a multilayer perceptron of h hidden units and one output
unit , which receives n-dimensional input signa ls x = (Xl , ' " ,xn ) . A hidden
un it , say the i-t h uni t receives x , and t akes its weight ed sum, resulting in
t he pot enti al
(1)
Here W i = ( W il ' . .. ,Win ) is th e weight vect or of t he i-t h un it, and we neglect
the bias t erm for t he sake of simplifying t he not ation. The unit calculate s
t he nonlinear t ra nsform of the potential , cp(u) , where t he nonlinear function
is
cp(u) = t anh u. (2)
The final units collects all the outputs of t he hidden uni ts , and its final output
is t heir weighted sum, if no noise int ervenes. We put

(3)
Geometry of learning in multilayer perceptrons 51

wher e we summarized all t he modifi abl e par am et ers in a lar ge vect or 0 =


( VI , . . . , V m ; WI , '" , W m ) . The final output of the perceptron is disturbed
by noise, so t hat
y = f(x , 0) + c, (4)
where we assume t hat e is a Gau ssian noise with mean 0 and vari an ce l.
Therefore, its behavior is represented by the condit ional probability of y
given x,
p (ylx, 0) = cexp { -~ (y - f( x, O)) 2} (5)

or the joint probability distribution of (re, y) ,

p(y , x ; 0) = q(x)p (y lx, 0) (6)

where q(x) is t he pr ob abil ity distribution of inp uts x. T he set of all t he


per ceptrons is a manifold called a neuromanifold M where 0 plays t he role of
the coord inate system . Each point of t he neuromanifold corresponds t o t he
probability distribution (5) or (6) .

3 Fisher information matrix and the Riemannian metric


The Fi sher informat ion matrix G is given by

G(O) = E [\7 logp(y , x ; 0)\7 log p(y, x ; Of] (7)

which is further calculate d as

G(O) = E [\7 f(x, 0)\7 f (x , Of] , (8)

where E deno t es expec tation, \7 = (a/ OBi) is the gradient and T denot es
t ranspose of a vect or . Let us define the square of the dist an ce between two
nearby perceptrons whose par am et ers are 0 and O+dO . Inform ation geomet ry
gives t he squared dist an ce by t he qu adrati c form

(9)

This is the Riem annian metric, where the Fisher information met ric is used
as t he Riem annian metric t ensor [19]. This is t he only invari ant metric to be
introduced in t he manifold of probabili ty distributions.
Given a (large) number N of indepe ndent ly generate d input- output pair s
(x I, YI) , .. . , (x N , Y N ) , the maximum likelihoo d est imat or (or any other first
order efficient estimat or ) sati sfies t he Cr am er-Rae bound. Hence, the dis-
t an ce is lar ge when two perceptrons ar e well separate d in t he sense that
t heir est imatio n can be don e pr ecisely. However , differ ent from t he ordinar y
statist ical model, t he neuromanifold includes points at which the Fisher in-
form ation degener ates and it s inverse diverges. This is relat ed to t he uniden-
t ifiability of network par am et ers .
52 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki

4 Identifiability of perceptrons and singularity


The behavior of a percept ron is invari ant und er the following two opera-
tions [10] :

1. Chan ge of signs of Vi and Wi at t he same time.

2. Permutation of t he hidden units, which causes permutation of t he weight


vect ors {wd and t he output weight {vd at the same time.
This causes th e following unid ent ifiabili ty :

1. Wh en Vi = 0 or W i = 0, the behavior is the same whatever value W i or Vi


t akes.

2. Wh en uu = W j (or W i = -W j) , the behavior is t he same when

Vi + Vj = Vi + Vj
I I
(10)

holds for two perceptrons {W i , vd and {W i , va.


We call t he set

(11)

the crit ical set on which unident ifiabili ty t akes place. The F isher information
degenerates on t he crit ical set , because the unid entifiability implies t ha t the
estimation error does not converge to 0 even when N goes to infinity. Hence
the st ati sti cal model is non-regular , and the Riemanni an metric is singular.
See also [7], [8], [9], [15]; [24] , [27] .
Let us int roduce the equivalence relation ~, by which two perceptrons
with different par amet ers are equivalent when their input-output behaviors
are the same. Then the set
(12)
includes algebra ic singularities and dimensions are redu ced on the critical set .
The convent iona l t heory of statistical estimation does not hold in a neigh-
borhood of singularities.

5 Natural gradient learning algorithm


Let
(13)
be t he set of input-output exa mples, which we call th e training set . Here
we assume t hat t he exa mples are generated independently by using t he t rue
perceptron whose parameters are given by 9 0 ,
Given t he t ra ining set, we want to obtain the estimated par amet ers iJ
which is closest to t he t rue one. The performan ce of t he estimato r o is
Geometry of learning in multilayer pereeptrons 53

measured by the generalization error, which is the expectation of the squared


error for a new example (x, y),

(14)

The conventional on-lin e learning algorit hm uses the gradient of the in-
st antaneous error at time t,

\7e(Xt ,Yt, Ot) = ~\7{Yt-f(Xt,Ot)r (15)

to update the cur rent par ameters Ot to the new one,


Ot+1 = e, - e\7e. (16)
The gradient of a function is believed to be the stee pest direction of cha nge.
This is true only when the coordinate system () is orthonormal in a Euclidean
spa ce. The st eepest dir ection of e is given by
'\7 e = G- 1\7e (17)
in a Riemannian space, where G-1 is the inverse of th e Riemannian metric
matrix.
Amari [3] proposed to use t he Riemannian gra dient for learning,
-1
()t+1 = eG \7 e, (18)
A A

()t -

which is called the natural gra dient method. The natural gra dient method is
proved to give an Fisher efficient est ima tor, even t hough exa mples are used
only once when t hey are observed , and then discarded .
The performanc e of the na t ur al gradient method is lar gely different from
the conventional method, when the Riemannian structure is very different
from the Euclidean one. It will be seen that this is indeed the case with mul-
tilayer perceptrons, becaus e t hey include singul arities where the Riemannian
metric degenerates.
It is known that the learning trajectory is often trapped in the so called
plateaus, at which the par ameters cha nge so slowly, and it takes long ti me to
get rid of. The st atistical physic al approach made it clear t hat the parameters
are once at t rac te d to t he critical set of the neuromanifold , so that the set
becomes plateaus of learning [25] , [21], [18] . Rattray, Saad and Amari [20]
ana lyzed the dyn amics of the natural gra dient learning method, and showed
that it has an idealistic cha rac te rist ic for avoiding plateaus. See also [14] .

6 Implementation of natural gradient-Adaptive


natural gradient method
In order to implement the natural gradient method, one needs to use the in-
verse G- 1 of the Fisher information matrix. However, it is in genera l difficult
54 Shun-ichi Amari, Hyeyoung Park and Tom oko Ozeki

to calculate t he Fisher information matrix, because it uses t he expectation


with respect to the unknown distribution q(x) of inputs. Moreover , it is com-
put ationally heavy to invert the matrix G when the number of paramet ers is
lar ge.
Amari , P ark and Fukumizu [8] proposed an ada pt ive met hod to obtain
an esti mate of t he inverse of the Fisher matrix. It is an iterati ve method ,
and the estimate (;-1 (Ot)
is calculate d by

(;-1 (Ot) = (1 + c' ) (;-1 (Ot-l) - c'~ It (~It) T , (19)

where It I(xt,
= (h) and c' is anot her learning constant which may depend
on t. One should choose c and c' carefully. By using t his est imate (;-1 (Ot) ,
we can obtain t he updat e rul e of t he ada ptive natural gra dient method of t he
form ,
OtH Ot -
= C(; - 1 (Ot)
V e, (20)

Park, Amari and Fukumizu [17] genera lized the idea to be applicable to more
genera l cost functions.

7 Dynamics of learning in the neighborhood of the


critical set
In order to see t he dynami cs of learning, let us consider the special case of
perceptrons consist ing of two hidd en units. Let us consider the set Q(w ,v)

Q(w ,v) = {WI = W2 = W , Vl + V2 = v} (21)

which is a part of t he crit ical set . This corre sponds to the set of all t he
perceptrons which have only one hidd en unit , where t he weight vector is w
and t he out put weight is v. Let the true par ameters be 0 0 = {W I , W 2 , V I, V2 } ,
where WI :f W 2 so t hat it needs two hidd en units.
Let () = (w , v) be t he best perceptron with one hidd en unit t ha t approx-
imates the input-o ut put function I (x , ( 0 ) of the t rue perceptron. Then , all
t he percept rons of two hidden units on t he line:

(22)

corresponds to the best approximation by one hidd en unit percept ron. Let
us t ra nsform t he two weights as

(23)

Then , th e derivative of L( 0) along t he line is 0, because all t he perceptrons


are equivalent along t he line. The derivative in th e dir ection of cha nging ib
Geometry of learning in multilayer perceptron s 55

and v are zero , because t hey are the best approximator . The derivative in
the dir ection of u is again 0, because t he perceptrons having u is equivalent
to that having -u t ha t is derived by cha nging the two hidd en units. Hence
the line forms crit ical points of the cost function. This implies that it is very
difficult to get rid of it once the par amet ers are attracted to Q (111, v).
Fukumizu and Amari [12] calculate d the Hessian of L . Wh en it is positive
definite, the line is really at t rac ti ng. Wh en it includes the negative eigenval-
ues, t he state is escaping in these dir ections event ua lly. They showed t hat,
in some cases, a par t of the line is really at t rac ting in some region , while it
is really a saddle having direct ions of escape (although the derivative is 0).
In such a case, the perceptron is once truly attracted to the line, and stays
inside the line fluctuating around it because of random noise until it finds
t he place from which it can esca pe from t he line. This is clearly a plateau .
This explains th e plateau phenomenon . In ord er to show why the natural
gra dient works well, we need to evaluate the natural gra dient in the neighbor-
hood of t he crit ical points. We can then prove that t he natural gradient has
a lar ge magnitude in the neighborhood of t he critical set , so th at th e plateau
phenom ena will disappear. Computer simulat ions confirm t his observation.

8 Estimation and testing in the neighborhood of the


critical set
The Fisher information matrix degenerates on the critical set. Therefore, t he
Cramer-Rao par adi gm cannot be valid in t he neighb orhood of the crit ical set .
Let us consider the statistical test

Hi, (J = (Jo (24)

against
(25)

The likelihood ratio statisti cs is given by

(26)

where iJ is the maximum likelihood est imator. Wh en t he t rue point (Jo is


a regular point , that is, it is not in the critical region , th e mle (maximum
likelihood est imator) is asymptot ically subject to t he Gaussi an distribution
with mean 0 and th e variance-covariance matrix G- 1 ((Jo) IN, where N is the
number of observations. In such a case, th e log likelihood-ratio st atistics is
expa nded in the Taylor series, giving

(27)
56 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki

Hence this is du e to the X2-distribution of the degrees of freedom equa l to


the number k of par am et ers . It s expectation is

k
E[A] = N' (28)

However , when t he true distribution ()o lies on the crit ical set, the situation
changes. The Fisher information matrix degenerat es, and c:' diverges, so
that the expansion is no mor e valid . The exp ect ation of t he log likelihood
esti ma t or is asympt otically writ ten as

E[A] = c(N ) k (29)


N
where the t erm c(N ) takes various forms depending on t he nature of singu-
larities. Fukumizu [11] showed that

c(N ) = 10gN (30)

in t he case of multilayer perceptrons under a certain condition. In th e case


of the Gaussian mixture,
c(N ) = log log N (31)
holds [13], [16].
Since the par am et ers are not identifiable, we cannot esti mate the par am-
eters when t he true one is on the critical set . However , we can est ima te its
equivalence class, and the consistency holds with t he order of VN. When
t he true one is close to t he singular point, the est ima t ion of the par am et ers
suffers from similar difficulty. For fixed N , the vari ance of the est ima t ors di-
verges in inversely proportional to the squ ar e of t he distan ce from t he critical
set. We need a new fram ework t o analyze such singular cases.

9 Bayesian estimator
The Bayesian est imator is used in many cases where an adequate pr ior dis-
t ribut ion is assumed for the purpose of penalizin g compl ex mod els based on
dat a . It is empirically known that the Bayesian posterior distribution or its
maximizer behaves well in the case of large scale neur al networks. In such a
case, one uses a non- zero smooth prior on the neuromanifold.
However, a smoot h prior is not regular in the equivalence class M of the
neuromanifold , becau se a point in the equivalence class includes infinitely
many equivalent paramet ers when it is in the critical poin t . This implies
t hat the Bayesian smooth pr ior is in favor of singular points (perceptrons
with a smaller number of hidd en units) with an infinitely large factor. Hence
t he Bayesian method works well in such a case to avoid overfitting. One may
use a very lar ge perceptrons with a smooth Bayesian prior, and an adequa te
smaller mod el is select ed.
Geometry of learning in multilayer perceptrons 57

The Bayesian est imato r of singular mod els was st ud ied by Watan ab e
[28] , [29] by using the method of algebraic geomet ry, in particular Hiron-
aka's theory of resolution of singularity and Sato 's formula in the t heory of
algebraic analysis.

10 Model selection
In order to obt ain an adequate mod el, one should select a good class of
mod els based on dat a , t hat is, one should det ermine t he number of hidden
uni ts. This is t he problem of model selection. AIC , BIC an d MDL have been
widely used as crite ria of model select ion.
AIC [1] is the crite rion to minimize the generalization err or . The model
t hat minimizes
AIC = t raining error + ~ (32)

is selected by this criterion. This is derived from the asy mpt ot ic statistical
analysis, where the mle est imator o is subject to the Gau ssian distribution
asy mpt ot ically.
MDL [22] is the criterion to minimize t he length of encoding the observed
dat a by using a famil y of par ametric models. It is given asy mptot ically by
t he minimizer of
. . log N
MDL = t rammg err or + 2N k (33)

The Bayesian criterion BIC [26] gives the same criterion as MDL .
However , in t he case of multilayer perceptrons, the neuroman ifold of per-
cept rons with a smaller number of hidden units are included in that with
a larger number , but the former is the crit ical set of the larger neurornan-
ifold. Therefore, the maximum likelihood est imator (or any other efficient
estimators ) is no more subject to the Gau ssian distribution even asympt ot -
ically. Mod el select ion is required when t he est imator is close t o the crit ica l
set , and hence t he validity of AIC and MDL fails to hold. On e should evaluate
t he log likelihood-r atio st atistics more carefully in such a case [6] .
There have been report ed many comput er simulations of applica t ions of
AIC and MDL . Sometimes AIC works better, while MDL does bet ter in other
cases . Such confusing rep orts seem to be given rise to by t he difference of
regular and sing ular mod els and also the different nature of singularit ies.

11 Conclusions
Multilayer per ceptrons are popular nonlinear mod els for nonlinear regression
analysis of observed data. A class of perceptrons is sp ecified by t he number
of hidden uni t s, and a smaller class is included in a lar ger class. A class
of multilayer perceptrons forms a manifold nam ed t he neuromanifold, where
modifiable par am et ers play t he role of the coordinate system.
58 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki

The neuromanifold is a Riemannian space, where the Fisher information


matrix plays the role of the Riemannian metric. A remarkable point is that it
is singular in the sense that the Riemannian metric degenerates on a subset
of the manifold, in which the neuromanifold of a smaller hidden units are
embedded.
We proposed the natural gradient learning method, which takes the Rie-
mannian nature into account. It works well because it avoids the plateaus
existing in to the critical set corresponding to the neuromanifold of a smaller
number of hidden units.
Conventional statistical analysis assumes the existence and non-dege-
neracy of the Fisher information matrix. However in the case of multilayer
perceptrons, as well as other similar hierarchical models, the singularity is
unavoidable in its nature. The criteria of model selection such as AIC and
MDL fail their validity under such circumstances.
The present paper reviews such aspects of learning in neural networks,
which require a new statistical analysis of singular models. Geometry will be
useful for this purpose.

References
[1] Akaike H. (1974). A new look at the statistical model identification. IEEE
Trans. Automatic Control AC-19, 716-723.
[2] Amari S. (1965) . Theory of adaptive pattern classifiers. IEEE Trans.
Elect. Comput. EC-16, 299-307.
[3] Amari S. (1998) . Natural gradient works efficiently in learning. Neural
Computation 10, 251 - 276.
[4] Amari S., Nagaoka H. (2000). Information geometry. AMS and Oxford
University Press, New York.
[5] Amari S., Ozeki T. (2001). Differential and algebraic geometry of multi-
layer perceptrons. IEICE Trans., E84-A, 31-38.
[6] Amari S. (2003). New consideration on criteria of model selection. Neural
Networks and Soft Computing (Proceedings of the Sixth International
Conference on Neural Networks and Soft Computing), L. Rutkowski and
J. Kacprzyk (eds.), 25-30.
[7] Amari S., Ozeki T., Park H. (2003). Learning and inference in hierarchi-
cal models with singularities. Systems and Computers in Japan 34 (7),
701 -708.
[8] Amari S., Park H., Fukumizu K. (2000). Adaptive method of realizing
natural gradient learning for multilayer perceptrons. Neural Computa-
tion 12, 1399-1409.
[9] Amari S., Park H., Ozeki T. (2002) . Geometrical singularities in the
neuromanifold of multilayer perceptrons. Advances in Neural Informa-
tion Processing Systems,T.G. Dietterich, S. Becker, and Z. Ghahra-
mani (eds .) 14, 343 - 350.
Geometry of learning in multilayer perceptrons 59

[10] Chen A.M., Liu H., Hecht-Nielsen R. (1993). On the geometry of feed-
forward neural network error surfaces. Neural Computation 5, 910 -927.
[11] Fukumizu K (2003). Likelihood ratio of unidentifiable models and mul-
tilayer neural networks. The Annals of Statistics 31 (3), 833 -851.
[12] Fukumizu K, Amari S. (2000). Local minima and plateaus in hierarchical
structures of multilayer perceptrons. Neural Networks 13, 317- 327.
[13] Hartigan J .A. (1985). A failure of likelihood asymptotics for normal mix-
tures. Proc. Barkeley Conf. in Honor of J. Neyman and J . Kiefer 2,
807-810.
[14] Inoue M., Park H., Okad a M. (2003). On-line learning theory of soft com-
mittee machines with correlated hidden units - Steepest gradient descent
and natural gradient descent -. J. Phys. Soc. Jpn 72 (4), 805-810.
[15] Kurkova V., Kainen P.C . (1994). Functionally equivalent feedforward
neural networks. Neural Computation 6, 543-558.
[16] Lin X., Shao Y., (2003). Asymptotics for likelihood ratio tests under loss
of identifiability. The Annals of Statistics 31 (3),807 - 832.
[17] Park H., Amari S., Fukumizu K (2000). Adaptive natural gradient learn-
ing algorithms for various stochastic models. Neural Networks 13, 755-
764.
[18] P ark H., Inou e M., Okada M. (2003). Learn ing dynamics of multilayer
perceptrons with unidentifiable parameters. J . Phys. A: Mathe. Gen. 36
(47),11753-11764.
[19] Rao C.R. (1945). Information and accuracy attainable in the estimation
of statistical parameters. Bulletin of the Calcutta Mathematical Society
37,81 -91.
[20] Rattray M., Saad D., Amari S. (1998). Natural gradient descent for on-
line learning. Physical Review Letters 81 , 5461- 5464.
[21] Riegler P., Biehl M. (1995). On-line backpropagation in two-lay ered neu-
ral networks. J. Phys. Aj Mathe. Gen. 28, L507 - L513.
[22] Rissanen J . (1978) . Modelling by shorte st data description. Automata
14,465 -471.
[23] Rumelhart D.E., Hinton G.E ., Williams R.J . (1986). Learn ing intern al
representations by error propagation . In D.E. Rumelhart, J .L. McClel-
land, and the PDP Research Group (eds.) , Parallel distributed process-
ing (Vol. 1,318 -362), Cambridge, MA:MIT Press.
[24] Ruger S.M., Ossen A. (1995). The metric of weight space. Neural Pro-
cessing Letters 5, 63- 72.
[25] Saad D., Solla A. (1995). On-line learning in soft committee machines.
Phys. Rev . E 52, 4225-4243.
[26] Schwarz G. (1978). Estimating the dimension of a model. The Annals of
Statistics 6, 461- 464.
[27] Sussmann H.J . (1992). Uniquen ess of the weights for minimal feedfor-
ward n ets with a given input-output map . Neural Networks 5, 589 - 593.
60 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki

[28] Watanabe S. (2001a). Algebraic analysis for non-identifiable learning


machines. Neural Computation 13, 899-933.
[29] Watanabe S. (2001b) . Algebraic geometrical methods for hierarchical
learning machines. Neural Networks 14 (8), 1409-1060.

Address : S. Amari, T . Ozeki, RIKEN Brain Science Institute, 2-1 Hirosawa,


Wako, Satitma, 351-0198, Japan
H. Park, Dept. of Computer Science, College of Natural Science, Kyungpook
National University, Sangyuk-dong, Buk-gu, Daegu, 702-701, Korea
E-mail: {amari.tomoko}@brain.riken.jp.{hypark}@knu.ac.kr
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

VISUAL DATA MINING FOR QUANTIZED


SPATIAL DATA
Amy Braverman and Brian Kahn
K ey words: Massive data sets, clust er analysis, multivariat e visualization.
COMPSTAT 2004 section : Applications.

Abstract: In previou s pap ers we've shown how a well known data compres-
sion algorit hm called En tropy-constrained Vector Quantization (ECVQ; [3])
can be mod ified t o reduce t he size and complexity of very larg e, satellite data
sets . In this pap er , we discuss how to visualize and underst and the cont ent
of such reduced data sets. We developed a J ava tool to facilitate this using
simple multivari at e visualization , and interactively performing fur ther data
reduc tion on user selecte d spat ial subsets . This enables analyst s to compa re
reduc ed represent ations of t he data for different regions and varying spat ial
resolu tions. The ultimate aim is to explain physically observed differences,
t rends, patterns and anomolies in t he data .

1 Introduction
This work came about becaus e of challenges posed by NASA' s Earth Observ-
ing System (EOS). EOS is a long-term data collect ion program for st udy ing
climat e change, its consequences for life on Earth, and effects of human activi-
ties on it . The cente rpieces of EOS are three sate llites, Terra, Aqu a and Aura.
Terra and Aqu a are already in orbit, and Aur a is due for launch in 2004. Each
carr ies a suite of instruments t hat collect massive amount s of observational
data ; so massive t hat it is difficult t o take full advantage of them. Different
instruments have different sa mpling st ra te gies, resolutions, file naming con-
ventions, and collect data about different physical proc esses. The information
is provided to users in files corresponding t o individual spacecra ft orbits or
par ts of orbits, each of which can be very lar ge, and must be stitched t ogether
pr op erly t o pr ovide a global or even a regio nal picture. To make these data
more access ible, NASA produces globa l summar y data sets called Level 3
dat a products.
Tr adi tionally, Level 3 products are simpl e maps of mean qu antities and
st andard deviations at coarse spatial resolution, by month. In [2], we pro-
pos ed methods for const ruct ing non par am etric, multivari at e distribution es-
t ima tes to replace traditional map s. For inst anc e, the Multi-angle Imaging
SpectroRadiomet er (MISR) aboa rd Terra collects dat a about clouds . A key
goal is to better und erst and t he sp atial distribution of cloud s since t hey have
great influence on Earth 's energy budget . The inform ation MISR collect s
includ es t hree variables seen at high resolu tion: scene albedo, height , and
cloud presence indic ator. Albedo is a measure of scene reflectivity measured
rou ghly on a scale of zero t o one. Scene height is measured in met ers above
62 Amy Braverman and Brian Kahn

the Earth's surface ellipsoid. The cloud indicator is a binary variable taking
value one if the scene is cloudy, and zero otherwise. To summarize this infor-
mation traditional Level 3 products are created by partitioning one month's
data into spatial subsets corresponding to one degree latitude-longitude grid
cells. Six maps are then produced: mean and standard deviation of albedo ,
mean and standard deviation of height, and mean and standard deviation of
cloud indicator.
The Level 3 product we proposed regards each triplet of albedo, height
and cloud indicator as a three-element vector, and uses ECVQ to cluster
data each grid cell. We report a set of cluster representatives, the number
of original data points belonging to each cluster, and within-cluster mean
squared error, also called distortion. We call this a summary, or a compressed
or quantized version of the grid cell's data. Figure 1 illustrates. For one grid
cell it shows a three dimensional scatterplot of the original data in light
gray. Positions of cluster representatives are shown by the embedded balls,
and ball shading shows cluster population according to the color bar on the
right. Two key features of the summary are that i) cluster representatives be
centroids of cluster members, and ii) data vectors must be assigned to clusters
with the nearest (euclidian distance) representatives. This ensures that mean
squared error between grid cell data points and their representatives are at
least locally minimized, and that representatives and mean squared errors
resulting from aggregation to coarser resolutions will be properly preserved.
Details of the algorithm like the one used to produce these summaries can
be found in [1] .
Starting with a monthly summary of MISR cloud data at one degre e res-
olution, our challenge is to discover and understand how relationships among
grid cell distributions change spatially, and over different resolutions. In
other words, instead of examining spatial patterns of average behavior and
variability only, we want to examine spatial patterns of other distributional
characteristics such as the number of modes, presence of outliers, and nonlin-
ear regressions. This requires interactively comparing summaries of different
grid cells, and of aggregated spatial areas. Thus, we want to quickly visual-
ize summaries, and construct summaries of summaries in hierarchical fashion.
The main subject of this paper is the Java tool L3View, written to facilitate
this.

2 L3View
The basic data structure underlying L3View is a 180 x 360 array of objects
called L3Cell's. An L3Cell contains a variable-length vector of Cluster ob-
jects, with the number of objects depending on grid cell data complexity.
A Cluster records a three-dimensional cluster representative, a cluster count,
and a within-cluster mean squared error. L3View presents a map of the
world, and when the user clicks on it the with the mouse, L3View translates
the mouse position into geographic coordinates. L3View opens a separate
Visual data mining for quantized spatial data 63

Observation Scale xlO •

1.8

1.6

0.8 1.4

~·O.6
o 1.2
~·.·O.4
[j
0.2
0.8
o
2 1.5 0.6

0.4

0.2

o 0 I'Jbedo

Figure 1: Three-dimensional scatterplot of MISR albedo, height and cloudi-


ness data, in light gray, for a one degree grid cell in northern Oklahoma
(southwest corner 38°N, 98°W) in March 2000. The embedded balls show
the locations of cluster representatives. The ball colors show cluster popula-
tions using the gray-scale color bar on the right.

window, and displays a simple, multivariate visualization of the summary for


the one degree grid cell at that location. Further, the user can select a subre-
gion of the map with a rubberband box, and choose to summarize summaries
of all grid cells within the box. This too is shown in a new window using the
simple multivariate display.

2.1 Main map and control panel


The left panel of Figure 2 is a screenshot of the main L3View control panel.
L3View uses Java Swing components to interact with users. The image dis-
played is constructed from information in the grid cells' clusters and combined
with a GIF file containing continental outlines using Java image processing
functions. L3Vicw knows the position of the mouse in a graphic coordinate
space native to the underlying Java object type, JPanel. L3View has methods
to convert back and forth between this coordinate system, the 180 x 360 grid,
and latitude and longitude. Latitude and longitude are displayed interac-
tively as the mouse is moved, and the tool knows when the mouse is clicked,
dragged, or leaves the map area. Clicking on a grid cell spawns a GraphView
64 Amy Braverman and Brian Kahn

Figure 2: L3View main control panel showing MISR cloud fraction for
March 2000.

window, which contains three graphics for visualizing the clusters represent-
ing that grid cell's data.
If the mouse is used to isolate a rectangular geographic region with a rub-
berband box, L3View calculates the corresponding geographic and index lim-
its. These are subsequently used in two cases. First, if the Zoom button is
pushed, a new window containing a magnified image of the isolated area is
spawned. Second, if the Aggregate button is pushed, all clusters from all
grid cells inside the box are summarized, and the result is displayed in a new
GraphView window. The lambda text box accepts user specified values for
a parameter of the summarization algorithm that specifies how much data
reduction is applied. This is discussed in Section 3.
Finally, the Set Maximum and Set Split sliders are used to study spatial
patterns in the cumulative distribution function of the display variable. Set
Maximum truncates the upper end of the color scale so that all grid cells with
display values at or above the maximum display white. Set split is similar:
all values above the split value are displayed white, while all values below the
split value display in black.

2.2 The GraphView window


The GraphView window is a simple, three panel multivariate visualization
of a set of clusters. A typical GraphView window is shown in Figure 3. It
includes two bar plots and a parallel coordinate plot. The bar plots are two
instances of the same class, instantiated to display cluster counts and mean
squared errors (distortions). Each has one bar per cluster, and bars are sorted
in order of increasing cluster count. Actual values of counts and distortions
Visual data mining for quantized spatial data 65

.....
mean /ad
Alb"do
0 .841
0 .392/0 .298
H8~"t
10054.509
Cloud Frac
1.0
2582 .843/3243.668 0 .65410.478
min 0 .126 328 .691 0.0

Figure 3: A GraphView window showing the summary of MISR albedo,


height and cloud indicator for the grid cell with southwest corner 36°N, 98°W
over northern Oklahoma. A zoom-in view of the parallel coordinate plot
legend is shown in superimposed box.

relative to the norms of corresponding cluster representatives, are shown at


the bars' left edges. Though not apparent in these black and white figures,
bars are colored using a scheme that transitions smoothly from blue to red
with increasing count.
The parallel coordinate plot occupies the right side of the window. Each
line plot shows the representative values of albedo, scene height, and cloudi-
ness for a single cluster on scales normalized using the global means and
standard deviations. These are shown at the bottom of the parallel coordi-
nate plot area. Lines are color coded to match bars in the other two panels
so users can see which representatives belong to which clusters. In addi-
tion, clicking on any bar or any line highlights the bars and line in all plot
corresponding to that cluster.
GraphView windows are spawned to visualize a set of clusters, either for
a single grid cell or when a set of grid cells are to be summarized collectively.
In the latter case, one could simply display the entire collection of clusters,
but that would become more unwieldy for large areas as more clusters are
included. Complexity of the parallel coordinate plots could grow to the point
where it is impossible to 'resolve individual lines. Therefore, distributions
represented by cluster sets must be summarized before they are displayed.
The next section describes the theoretical rationale for this.
66 Amy Braverman and Brian Kahn

3 Hierarchical aggregation and quantization


Br averm an [2] describ ed how ent ropy-constra ined vect or quan ti zat ion
(ECVQ; [3]) is modifi ed t o function as a data reducti on tool for lar ge, spatial
da t a sets. The basic idea is to partition these dat a into one degree spatial
subsets, and use ECVQ to clust er the subsets in a coordina t ed way. ECVQ
is a randomized , iterative algorit hm similar to K-mean s, except it minimizes
t he expecte d valu e of t he penalized loss function,

L>.(X, o:(X)) = IIX - q(X )11 2 + A [-lOg N%X) ]. (1)

X represents a randomly dr awn observat ion from t he empirical distribution


of t he grid cell's dat a . o:(X ) is an int eger t hat sp ecifies the id number of
t he clust er to which X is assigned, and q(X) is the corresponding cluster
cent roid. N is the t otal number of data poin ts in t he grid cell, N a( X) is th e
number of dat a points assigned to t he sa me clust er as X , and the logar ithm is
base two . A is a fixed par am et er t hat sp ecifies how imp ortant the second t erm
on t he right in Equation (1) is. For K -means , one must specify K , t he number
of clust ers a pri ori. For ECVQ, one must sp ecify K , the maximum allowable
number of clust ers, and A. The algorithm then det ermines t he num ber of
clust ers and the assignment of data points to t hem. We added a final step
in which each dat a point is subsequently reassigned to t he clust er with the
near est euclidian dist an ce representative, and th e representatives updated
again. This ensures clust er representatives are centroids of clust er memb ers,
and mean squa red errors between data points and their representatives are
minimized . In [1] we introduced a further modification of ECVQ in which X
is a random vari abl e havin g t he distribution of q(X) rather than the original
empirical distribution of t he data. In other words, we allow realizations to
have un equal mass. That is pr ecisely the sit uation in which we find our selves
when summa rizing sets of clust ers form ed by combining multiple grid cells.
Consider Fi gur e 4. It shows a schematic representation of a one degree
spatial grid. Each grid cell contains a summa ry instan ti at ed as an L3Cell
obj ect . Fi gur e 4 also shows a two degree grid cell superimposed, and sup-
pose we want t o summa rize the four , one degree L3Cell's inside. Let X luv
be a random vari abl e having t he distribution of the summa ry for the one
degree grid cell wit h sout hwest corne r at row u and column v. Suppose t his
grid cell is the lower-left most grid cell in t he light box in Figur e 4, and de-
note the other three grid cells' random variab les by X 1(u+l )v , X1 u (v+l ) ' and
X l (u+l) (v+l )' At coarser, two degree resolution t he light box is represent ed
by X 2u v ,
1 1

X 2u v = I::I::X l (u+i )(v + j ) uv = V(u+i)(v+j)],


i= O j = O

with
Visu al data minin g for quantized spatial da ta 67

. 360° •
I 110° 1
60°

\
r-
60°

-
10°

/
2"
H-
i
'-- 'Y'!
~,

1/

Figure 4: Schematic repr esent ation of a gridded map. The lar ge rect angle
represents a 180 x 360 array shown broken into 3 x 6 = 18, 60 x 60 arrays. Each
of these is further subd ivided int o a 6 x 6 ar ray. Each cell in the 6 x 6 array
is a 10 x 10 arrangement of one degree grid cells. The light er box illustrat es
how four one degree grid cells can make up a grid cell at coarser, two degree
resolution.

N(u+ i)(v+j)
P (V = V(u+i )(v +j) ) = 1 1 '
I:i=OI:j=o N(u+i) (v+j )
and N i j is the t otal number of dat a points repr esent ed by the summary of
the corr esponding grid cell. In other words, X 2uv is a mixture of Xl uv,
X l(u+l )v , X l u(v+l) ' and X l( u+ l)( v+l) with weights equa l to t he proportions
of t he total count represent ed by X 2uv contributed by each one degree cell.
The idea is illust rated on the left side of Figur e 5, which shows t he mixture
distribution positioned directly above the four component distributions. Any
nesting of fine-scale grid cells in a coarser grid can be repr esent ed in a similar
way, and ensures mass , expectation, and mean squared error are all properly
preserved between resolutions.
If dat a redu ction were not a concern, we could pro ceed directly to visu-
alizing mixture distributions like the middle layer in Figure 5. However, the
greate r t he numb er of grid cells being aggregated, the greate r the number
of support point s in the mixture, and t he numb er of corres ponding clust ers .
So, we compress the mixture distribution using a mass-weighted version of
ECVQ describ ed in [1] , but implemented here in Java with the user specify-
ing A dir ectly via the "Set Lambda" button and text box in the main control
pan el. K, the maximum number of clusters is nomin ally set to 10, and t he
default value of A is zero, thus essent ially implementing the K-means . If A is
chan ged to a positi ve value, t he algorithm becomes ECVQ.
68 Amy Braverman and Brian K ahn

~, I ~
I- ~
,.J . ~c- c-_ ,--~- ~
~ lL~
'~ r,- -- r-- __
" "" -
..... JIII!
-
~
'"
.~ ~ ~·X
• -] 0 •

Figure 5: A hierar chy of distributions within a two degree spatial region.


The bot tom squar e on t he right corr esponds to the 2° x 2° area, and shows
concept ual repr esent ati ons of clust er sets for constituent one degree grid cells
as histo grams. The middl e layer on the right depicts the mixture distribution
formed by t he union of t he clust er sets from t he one degree cells. The top
layer is t he redu ced distribution afte r summarization.

By first considering aggrega te d distributions for large ar eas, and t hen


syste matically summa rizing subregions, we can begin to und erstand how the
pr evalence of various typ es of phenomena change spati ally. The next section
demonstrat es how t his can be done .

4 Visual data mining


As an exa mple of how a scientist might use L3View for data exploration, we
focus on an area in cent ral Africa shown in Figure 6. The rect angular region
exte nds from latitude I°S to latitude gON, and from longitude 11°E to 31°E .
The background L3View image shows cloud fraction. There is a clear differ-
ence between the northern and southern parts of thi s region, approximate ly
demarc at ed by the horizont al dashed line in embedded, zoomed-in view. The
southern area is very cloudy, and the northern area cont ains grid cells var ying
cloudiness. This is consistent with the climatological location of a persist ent
band of clouds called the Int er-Tropical Convergence Zone (ITCZ) . The lower
panel of Figure 6 shows t he GraphView window of the summa ry of t he entire
10 x 20 degree region .
Visual data mining for quantized spatial data 69

Figure 6: Screenshots from L3View visualization of central Africa. Top: Main


L3View map with embedded zoom of central Africa. Bottom: GraphView
window for the entire, aggregated central area. All the parallel coordinate
plots are annotated to show the cluster id numbers corresponding to the
individual line plots.

The region contains 200 grid cells, with a total of 2,099 clusters. These
represent 6,205,769 original MISR albedo-height-cloud indicator vectors. We
begin by aggregating the whole region using the default value of >. = 0 and
the number of clusters, K, set to 15. The resulting GraphView summary is
shown in the lower panel of Figure 6. The figure is small making the graph
labels difficult to see, but we can see from the bar chart of cluster counts
that one cluster dominates in size. Using L3View interactively, we find that
this is cluster 9, and it contains about 30 percent of the distribution's mass .
Cluster 9 corresponds to one of two clear clusters, 6 being the other. Cluster
6 accounts for another eight percent of the distribution's mass. 9 and 6 have
representatives with low albedo, low height, and are clear cloud indicators.
70 Amy Braverman and Brian Kahn

This is a dark, vegetated region of jungle. Areas to its north show significant
numbers of low altitude, bright, clear scenes. This is the Sahara desert.
The remaining clusters have cloudy representatives, and form three sub-
groups. Clusters 8 and 4 constitute a subgroup with low albedo and below
average height. Clusters 1, 2, and 7 form a second subgroup. Their heights
are nearly one standard deviation above the mean, but their albedos range
from nearly one standard deviation below to one standard deviation above
average. The final subgroup is characterized by very large heights, two stan-
dard deviations above the mean at least. They too show a range of albedos
similar to that of the second subgroup. These high clouds are likely the tops
of thunderstorms prevalent in central Africa at this time of year, and the
surrounding cloud formations. The first two subgroups are more mysterious.
Clusters 1, 2, and 7 could be low and mid-level cumulus and stratus clouds.
8 and 4 are possibly dust, clear land surface misclassified as cloud, or simply
dark, low clouds, as implied by the classification.
Noting the relatively sharp difference in cloud fractions between the north-
ern and southern areas, we separately summarize them are shown in Figure 7.
Signatures of southern region representatives look much like signatures for
the region as a whole . The north's representatives also look roughly like
those of the whole region except clusters similar to 0 and 4 are missing. The
absence of clusters similar to cluster 0 in the lower panel is encouraging,
since this cluster represents deep convective clouds. Corroborating sources
indicate these are in fact absent in this region at this time.
These distributional differences are summarized in Table 1. Not sur-
prisingly the joint distribution shows that the south is cloudier than the
north. The fact that the south is dominated by low clouds while the north is
dominated by mid-level clouds is less obvious but clear from the conditional
distribution.

Type
Clear
Low cloud 0.060 0.171 0.231 0.273 0.442
Mid-level cloud 0.094 0.114 0.207 0.426 0.295
High cloud 0.066 0.101 0.168 0.301 0.263
Total cloudy 0.220 0.386 0.606 1.000 1.000
I_ _ _ _ _~
Total 0.550 I 0.450 I 1.000 ~ -----'

Table 1: Joint and conditional distributions of cloud type/clear and location.


Columns 2, 3, and 4 show the full, joint distribution. Columns 5 and 6 show
the conditional distribution of cloud type given cloudy scene, and location.

The presence of low, dark clouds in both the north and the south at
this time of year is something of a surprise. To see if these clouds can be
Visual data mining for quantized spatial data 71

COllntl by Clu5t1r

p"
'» r~
,, " "" .
. ~.
"'

Figur e 7: Top: GraphView window for the aggregat ed area in t he nor thern
half of the region (above the dash ed line). Bot tom: GraphView window for
t he aggr egated ar ea in the southern half of the region (below the dash ed line) .

at t ribute d to specific ar eas, we subdivided the north and sout h regions into
east and west. We found no distributional differences related to east-west
division for eit her the north or south. We t hen investi gated ar eas along the
prominent clear/cloudy boundary in F igure 6, and contrasted them to areas
away from the boundary. We did this separately for east and west , but none
of these visu alizations revealed definitive distributional differences. We are
therefore reason abl y confident that Table 1 te lls a complete story.

5 Discussion
The example of t he pr evious section is a small scale, simple exa mple of one
way we t hink L3View may be useful for exploring spatial summaries of satel-
lite data. Guided by the background map in the main L3View window, we
72 Amy Braverman and Brian Kahn

focused on an are a of interest, and hierarchically examined the data distri-


bution summaries. We discovered that, in addition to the cloud fraction
differences apparent from the background image, there is also a difference in
the types of cloud present in the northern and southern parts of the region.
We will have to perform many more exercises like this one to gain confidence
that summarized data have enough detail to be scientifically useful, and to
gain experience interpreting physically what we see in L3View.
Two main computational issues are brought to light in this exercise . First,
L3View's implementation of ECVQ/K-means is not fast enough to summa-
rize large geographic regions in reasonable time. One would ideally like to
summarize whole hemispheres in the same sort of hierarchical exploration
performed here . Second, we have not yet made use of the tool's ability to
summarize the same data for different values of A's. We would like to look
at data at different quantization resolutions as well as different spatial ones.
We believe there is important information in how distributions collapse as
greater data reduction is imposed. To achieve greater interactivity both these
issues must be addressed. We look forward to working on these and other
improvements to L3View as it matures. We also eagerly anticipate working
with our geoscience colleagues to better understand the connection between
global physical processes and their expression through rich, Earth Observing
System data sets.

References
[1] Braverman Amy, Fetzer Eric, Eldering Annmarie, Nittel Silvia, Leung
Kelvin (2003) . Semi-streaming quantization for remote sensing data. Jour-
nal of Computational and Graphical Statistics 12, 4, 759 -780.
[2] Braverman Amy (2002). Compressing massive geophysical datasets using
vector quantization. Journal of Computational and Graphical Statistics
11, 1, 44-62.
[3] Chou P.A., Lookabaugh T., Gray R.M. (1989). Entropy-constrained vec-
tor quantization. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 37, 31- 42.

Acknowledgement: The authors would like to thank Eric Fetzer, Annmarie


Eldering and Barbara Gaitley for their helpful comments. This work was per-
formed at the Jet Propulsion Laboratory, California Institute of Technology,
under contract with the National Aeronautics and Space Administration.
Address : A. Braverman, Jet Propulsion Laboratory, California Institute of
Technology, Mail Stop 169-237,4800 Oak Grove Drive, Pasadena, CA 91109-
8099
B. Kahn, Department of Atmospheric Science, UCLA, 405 Hilgard Avenue ,
Los Angeles, CA 90095
E-mail: Amy .Braverman@jpl.nasa.gov, kahn@atmos.ucla.edu.
COMPSTAT'2004 Symposium © Phy sica-Verlag/Springer 2004

GRAPHS FOR REPRESENTING


STATISTICS INDEXED BY NUCLEOTIDE
OR AMINO ACID SEQUENCES
Daniel B. Carr and Myong-Hee Sung
K ey words: Letter sequ ences, graphics, layouts.
COMPSTAT 2004 secti on: Gr aphics.

Abstract: This pap er develops coordinates and layouts for graphs that rep-
resent stat ist ics indexe d by repetitive letter sequences. The need for such
gra phics arises in a vari ety of applicat ions. The exa mples in t his pap er con-
cern sequences of nucleotides, such as AGTGGC , and sequ ences of amino
acids .

1 Introduction
In cont rast to maps t hat represent st atisti cs ind exed by geospatial coordi-
nates, the development of gra phics methodology for stat ist ics ind exed by
repetitive let t er sequ ences has been mod est. One int eresting exception is the
sequence logo display [13] that can show a sequence of categ orical frequ en-
cies. St atistical gra phics methods for categori cal dat a [7], [8] ar e relevant for
relatively simple multiv ar iat e combinat ions but so far have seen little use in
nucleotide and amino acid ind exing exa mples.
Jo urnal art icles ty pically show short t abl es with one column giving t he
sequence of letters and one mor e column providin g stat ist ics. The rows are of-
ten sor t ed by one of the st atistical columns. Both the one-dimensional linear
ordering and t he restriction to a modest number of rows reduce the opportu-
nity t o see patt erns t hat may lead to new und erstanding. One-dim ensional
linear orderings produced by clustering, the first principal component, min-
imal spanning t ree traversal, space filling curves or other methods do not
exploit the hum an ability to see multivari at e patterns based on 2-D and 3-D
connecte dness and proximity. Conn ect edness and proximity are among t he
most powerful of hum an perceptual grouping principles [18] . Thus this pa-
per seeks to develop 2-D and 3-D coordinates for repres enting letter-indexed
st atist ics.
The gra phical design obj ectives include providing an overvi ew along with
interactive focusing and re-expression methods. For long sequences, com-
bin atorics grow exponent ially with sequ ence length and qui ckly lead to an
overwhelmin g number of st atist ics. Overvi ews require subst an tial st atistical
summari zation. The mod est resear ch here concerns developing representa-
ti ons for short sequences.
On e approac h not investi gated here is the use of pixel oriented visual-
ization [9]. It is possible to encode univ ari ate stat ist ics on all nucleotide se-
74 Daniel B. Carr and Myotig-Hee Sung

quences of length ten (4 10 = 1024 x 1024) in a pixel plot on a 1280 x 1024 mon-
itor. Large high resolution prints will handle somewhat longer length se-
quences. Interactive pan and zoom methods can support layouts for longer
sequences but showing all the values at once is problematic. The color of
individual pixels is hard to identify with increased monitor resolution. The
use of multiple monitors cannot keep up with the exponentially growing com-
binations.
Layout details are an issue. A convenient layout for a pixel plot may
use lexicographic order for the first half (last half) of the sequence along the
x-axis (y-axis). A following sectional on fractal coordinates provide another
approach to layouts. In both cases indexing regularity help to keep the ana-
lyst oriented with interpreting the plots. However, indexing that is convenient
for human memory may be poor at bringing out meaningful patterns. Maps
often work well for showing geospatially-indexed statistics because geospa-
tial attributes often have locally similar values . This applies to covariates as
well as to the primary variables of interests. Proximity that reflects scientific
relationships can be crucial to seeing meaningful patterns.
The layouts in this paper have limitations because they are primarily
based on indexing regularity. However, the layouts provide some opportuni-
ties to rearrange letter order or axis placement either for perceptual simplifi-
cation (such as reducing line crossings) or for incorporating physical/chemical
properties (such as hydrophobicity) of the sequence constituents. Interest-
ingly these two objectives can lead to the same display. Axes ordering prob-
lems are in general NP complete [1]. While many people prefer 2-D lay-
outs, 3-D layouts not only allow better preservation of interpoint distances
of higher dimensional points, they also provide more opportunities arranging
axes. Thus the layout options are not as restrictive as might be assumed at
first glance.
This paper develops three approaches to constructing coordinates while
mentioning some alternatives along the way. Three different data sets mo-
tivate the development of the coordinates. Section 2 describes self-similar
coordinates at different scales. Section 3 concerns self-similar coordinates
at the same scale with focus on 3-D extension of parallel coordinates. The
application shows cell statistics from a 4-D table. Section 4 illustrates the
use of simple additive vector coordinates for showing all quadruples of amino
acids (ignoring order) . This approach can be useful despite some substantial
overplotting problems. The section also hints at other 2-D layouts that avoid
overplotting. Comments appear along the way about software and interactive
tools used for rendering.

2 Self-similar coordinates, dimensionality, and fractals


The regularity of self-similar coordinates speeds learning and helps analysts
devote short-term memory to other issues. A natural approach to developing
multivariate coordinates encodes letters as integers and then uses Cartesian
Graphs for representin g statistics ind exed by letter sequences 75

coordinate product sets wit h a coordinat e for each positi on in t he seque nce.
With A=l , C=2, G=3, and T=4 t he sequence ATCG is located at (1, 4, 2, 3) .
The similar t reatment of each position in th e sequence and t he sa me ordering
of nucleotides for each axis motivat es the descrip tion as a self-similar coor-
din at e syste m. With coordinates in hand, multivari ate glyphs can encode
mult ivari at e stat istics associated with t he sequence. The most immediate
problem with this approach is that straight forward gra phical represent ation
of points is only available through three dimensions .

2.1 Rendering approaches and difficulties


There are many approaches t o rendering multivariate data with mor e t han
three coordinates . As further mentioned in Secti on 4, nest ed coordinate plots
provide one approach. Another approach encodes some coordinates as glyph
features. For exa mple with four coordinates the ray angle of a stereo-ray
glyph can encode the value for the four th coordina te [3]. Ray length an d
color can encode mor e coordinates. Good perceptual accurac y of ext raction
for the angle encoding and mod est use of "ink" make glyph a good choice for
revealin g hyp erplanes in 4-D data and other tasks. However in t he cur rent
context all 3-D coordinate glyphs lose self similarity when rend ering mor e
th an three coordinates .
Before developing 3-D coordinate s to represent sequences longer t ha n
t hree, bri ef comments about limits and merits of 3-D graphics are appropri-
ate . Many people pr efer 2-D graphics to 3-D gra phics. Common arguments
for 2-D rend ering are t ha t people only see surfaces, occlusion is a probl em
in 3-D and t hat motion and/or binocular par allax depth cues are impo ssible
or inconvenient t o convey on a print ed page. The position here is that many
human s are endowed with the cognit ive ability to see 3-D images based on
motion and binocular parallax. They should be allowed t o utiliz e this capa-
bility whenever it helps in dealin g with difficult scient ific challenges. Three
dimensions pr ovide a richer environment for conveying relat ionships and pro-
du ce less distort ion than 2D and 1D plots when scaling multi vari at e dat a into
lower dim ensions.

2.2 Weighted vector addition and fractals


Vector addit ion pr ovides an ent icing starting point for developing 3-D coor-
din ates. For nucleotides, associate each let t er with a vect or from the origin
to the vertice s of a tetrahedron. Let A=(l ,l ,l ), C=(l ,-l ,-l) , G= (-l ,l ,-l) ,
and T=(-l ,-l ,l). Then using of vector additio n for each let t er in a sequence
pro du ces a 3-D coordinate for representing the sequence. However , vector ad-
dition is commutat ive so all permut ations of t he sa me set of letters yield t he
sa me point. When t he goal is t o represent all hexam ers (six let t er sequences)
th e result should be 46 = 4096 distinct plot ting poin ts but rather 84 points
corresp onding with t he mult inom ial terms in mult inomi al (A + C + G + T) 6.
76 Daniel B. Carr and Myong-Hee Sung

Figure 1: Fractal coordinates for nucleotide sequences. Sphere size (and


color) show counts. Small rectangles show the plotting locations of low count
spheres.

Weighted vector addition provides an approach that can produce unique


points for plotting. Consider power of two weights 2(6-i) /63 where i is the po-
sition along the sequence and the weights sum to 1. The sequence ACGTTC
then maps into the point (.555, .270, .206).
Rendering and rotating reveals the coordinates creating a Sierpinski Gas-
ket similar to one shown in Mandelbrot [11]. The self-similarity provides
means of decoding the indexing of a point based on its location. A point
in the large tetrahedron toward the C attractor has C as the first letter. If
a point within this C tetrahedron is as close as possible to the T attractor,
then all the remaining letters are T . A tetrahedron zooming widget can reveal
the sequence of conditioning letters. Similarly, zooming can be controlled by
entering the first letters of the sequence.
A delight occurs when rotating the gasket. In some orthographic pro-
jections the appearance is a square lattice. Construction using a pair of
coordinates indicated earlier makes this clear. However, this is not intuitive
from looking at the gasket in other views. The 2-D layout is also self-similar
and extends immediately to the lO-mer by lO-mer pixel display mentioned
earlier.
Graphs for representing statistics indexed by letter sequences 77

The motivat ion for Figur e 1 was an early effort to find transcription reg-
ulation docking site s for the St anford yeas t genes [5]. The study clust ered
genes based on their express ion levels. This produced groups of seemingly
co-r egulated genes . For t he genes in a group, the 300 (nucleot ide)-letter re-
gions up stream of the protein-coding regions of t he genes were scanned with
a sliding wind ow of length six. This produced t he basic stat ist ics on the
occurrence frequ encies of t he different hexam ers encounte red. The sphere
glyphs in the plot encode counts using size and color. A glance reveals that
most of the higher count hexam ers appear along the AAAAAA t o TTTTTT
edge . Relatively little was known about t ra nscription regulation when the
St anford yeast dat a was first mad e availa ble and the stat istics for Figure 1
was produced. Today many t ra nscr iption regulation sites of various lengths
have been identified and t he regions as far as 800 nucleotides upstream are
relevant for some genes. The plots could be improved by obt aining better
dat a and by highlighting the hexam ers known to be associated wit h tran-
scription regulation .
Different kind s of software can produce figur es similar to Figur e 1. With
a lit tle work most standard stat ist ical software can produce project ed st atic
views. Softwar e t hat provides rot ation, filt erin g and brushing, such as Xgobi
and CrystalVision, pr ovide bet t er visu alization environments . Efforts to pro-
du ce multilayer 3-D visualization methodology similar t o GIS software led
to the development of softwar e called GLISTEN (geometric letter-indexed
st atistic al t abl e encoding). GLISTEN supports point and path layers that
are used in t he gra phs below.
Efforts to exte nd the fract al layout to amino acids were not very suc-
cessful. One generalizat ion used the 20 face cente rs of t he icosahedron as
at t ractors and adapte d t he weights so t he clouds of points asso ciated wit h
each of the 20 at t ractors would be sepa ra te d. Only three letter sequences
were shown t o restrict the view t o 203 = 8000 points. The high density of
points for t he sma llest sca le icosa hedra and the occlusion, par tl y due t o mor e
points, made t his layout less desira ble.

2.3 Connecting coordinates for longer sequences


P aths t hat connect points can represent longer let ter sequences. For exa mple,
a path t hrough t hree points in Figure 1 can represent nucleotide sequences
18 let t ers long. The use of paths is advan t ageous since occlusi on problems do
not grow t oo quickl y. Exp erimenting with translucent t riangles and tetrahe-
dr a for showing t riples and qu adruples were not successful du e to inability t o
see t hrough much more t ha n two layers .
Paths can encode stat ist ics using path t hickness and color. The path
dir ecti on also needs to be encoded unless t he sequence is explicit or int ended
t o be reversible. There are limitations to this approach. If t he data cont ains
a lar ge number of sequences, overplot t ing pr eclud es prov idin g an overvi ew.
Fil t ering widgets can then help t o cope with very lar ge dat ab ases. A second
78 Daniel B. Carr and Myong-Hee Sung

Figure 2: Capless hemisphere coordinates. Sphere size shows counts from


I-D table. Path thickness, color, and filtering enable focus on high counts
from 2-D tables.

limitation is that a single coordinate system with each point representing


a sequence of p letters does not accommodate sequence lengths that are not
a multiple of p. A third issue that especially applies to fractal coordinates is
that the apparent distance between paths is heavily influenced by the subset
of coordinates receiving heavy weight. Fourth, there can be ambiguity when
multiple paths go through the same point. Still, such displays can often turn
up meaningful patterns that are otherwise missed.

3 Parallel coordinates escape the plane


Parallel coordinates (PC) plots also provide self-similar representations. An-
alysts are increasingly using these plots to show multivariate data and to
provide interactive input in a multivariate context. A limitation of parallel
coordinates is the lack of a natural way to connect non-adjacent axes. Fig-
ure 2 shows a capless hemisphere coordinate system that partially addresses
the problem. The coordinate system encodes the 20 natural amino acids as
longitude and nine positions along a sequence as latitude. The gap create
perceptual groups that facilitate focusing on subsets. The path connecting
12 and 19 and the path between 12 and its neighbor 13 do not overlap due
to the hemisphere curvature. The 3-D setting with curvature means the axes
are not longer parallel, but there is a simple mapping to parallel coordinates.
The data motivating Figure 2 comes from a database of peptides [2], or
Graphs for representing statistics indexed by letter sequences 79

in this case amino acid sequences of length 9 known to bind to important


immune molecules called HLA. This binding reaction is crucial in initiating
the recognition by the human body of peptides from 'foreign' sources such
as viruses or cancer. When a T-cell finds the peptide-HLA combination on
the target cell surface, it activates coordinated processes for the purpose of
clearing the infected cells. The immune system is an incredibly complex
'search-and-dest roy' syst em. Autoimmune diseases are examples of false pos-
itives, where the immunity is mistakenly directed toward normal tissue. An
example of an immune system false negative is the inability to detect and clear
certain infections. Bioinformatic prediction of peptides from pathogenic pro-
teomes such as HIV has emerged as a valuable tool in vaccine development
and cancer immunotherapy [16] .
The data used in the example concerns peptides binding to the HLA A-2
molecule (a specific form among many different genetic versions of this HLA) .
Most of the binding peptides listed were 9-mers. Typical statistics would be
just the counts of amino acid for each of the nine positions. Such can be
represented by sequence logo displays or as sequence of bar charts. The
sphere size and color (when not shown in gray level) in Figure 2 convey this
information just as effectively. The paths in Figure 2 encode counts from the
nine choose two (36) two-way (20 x 20) tables. A filtering widget removed all
but the highest count cells. The counts are encoded by the line color (when
not shown in gray level) based on a color ramp. The layout also requires
some sorting considerations. Putting the hydrophobic amino acids adjacent
to each other reduces line crossing.
Since paths can have more than one line segment, the capless hemisphere
coordinate framework can also represent statistics from higher dimensional
tables. The three-segment paths in Figure 3 show high count (frequency)
cells from the nine choose four (126) four-way (20 x 20 x 20 x 20) tables (see
also [10]). The lowest count path shown goes through L2, A7, A8, and V9
as might be expected from the 1-D margin counts. There are large counts
for L2 and V9. The rest of the high frequency paths shown go through G4.
This is not apparent from the 1-D margins counts.

4 Additive vector coordinates and overplotting


When the ordering of letters in a sequence is not important, the additive
vector coordinate approach mentioned in Section 2 is more appropriate.
A 3-D tessellation application provides such data [15], [17] . The data arise
from tessellating the space of proteins based on the location of their backbone
Carbon alpha atoms. The Delauney tessellation yields tetrahedra indexed by
the associated amino acid residues at the vertices. The ordering of the amino
acids for a tetrahedron is not considered important. There are (20 - 1 + 4)
choose 4 or 8855 distinct tetrahedra (see [6] for a discussion of classical oc-
cupancy problems.)
Figure 4 provides an additive vector coordinate example with the vee-
80 Daniel B. Carr and Myong-Hee Sung

Figure 3: Paths with three segments show frequently occurring quadruples,


i.e. high counts from 4-way tables.

tors point to twenty points evenly spaced around a circle. Again point size
and color encode the counts, and dynamic filtering has removed low count
tetrahedra. High count patterns jump out. One is a circle involving three
Cysteins and one each of the amino acids.
While Figure 4 reveals a lot of structure, there are at least three problems
worth noting. First, over some 2000 points are overplotted. This is partly
related to symmetric construction with equal angles between vectors. Second,
zooming reveals many points that to close together to see closer in a overview.
Third, for over 4000 points involving four distinct amino acids the connection
between plotting location and the indexing is almost impossible to untangle
without mouseovers. Figure 4 is mostly useful for points in an outer annulus
of the circle.
There are several possibilities for alternative views. It is possible to show
a statistic encoded by color in a casement display of 204 = 160000 points [12].
In this example the casement display is a 20 x 20 layout of 20 x 20 matrices.
However it remains desirable to study plots with a factor of 18 less points.
The 8855 points can be placed in a 4-D simplex. (See also pentagonal num-
bers [4].) Space prohibits showing a layout composed of two-dimensional
slices of the simplex. There is also a layout in the plane for all tetrahedra
with 2 or more of the same amino acid. While this layout involves dupli-
cates the regularity makes the layout easier to study. Such a layout can
provide a starting point for drilling down to conditioned views of the 1-1-1-1
combinations.
Graphs for representing statistics indexed by letter sequences 81

Figure 4: Vector addition coordinates. Sphere size (and color) encode statis-
tics for protein tetrahedra. Small rectangles show plotting locations of low
count spheres.

5 Closing remarks

Just as map projections have been devised to serve different purposes, co-
ordinates systems for encoding statistics can be developed to serve different
purposes. A worthy goal is to develop coordinates systems with a regularity
that minimizes memory burdens and helps analysts keep oriented with re-
spect to the coordinates. A tension arises when one desires to show complex
relationships faithfully in some abstract sense while keeping the relation-
ships cognitively accessible. In many cases there are no easy answers and
the graphics are a compromise. Still, analysts can make discoveries from
imperfect graphs. It is worthwhile to consider graphics that lean toward
the cognitive accessibility and work toward incorporating as much scientific
structure as possible options. Accessible graphics enable analysts to look
and, if they look, they have a chance to see.
82 Daniel B. Carr and Myoug-Hee Su ng

R eferences
[1] Ankerst M., Berchtold S., Keirn D.A. (1998). Similarity clustering of di-
mensions for an enhanced visualization of multidimensional data . Pro-
ceedings IEEE Symposium on Information Visualization, IEEE Com-
puter Society, Wash ington, 51- 60.
[2] Brusi c V., Rudy G., Harrison L.C. (1998). MHOPEP, a database of
MHO-binding peptides: update 1997. Nucleic Acids Research 26 (1),
368-371.
[3] Carr D.B., Nicholson W .L. (1988). EXPLOR4: a program for exploring
four ?dimensional data . Dynamic Graphics for Statistics, W .S. Cleveland
and M.E. McGill (eds.), Wadsworth, Belmont , California, 309-329.
[4] Conway J .H., Guy RK. (1996). Th e book of numbers . Copernicus Books,
Inc. New York.
[5] DeRisi J .L., Iyer V.R, Brown P.O. (1997). Exploring th e metabolic and
gen etic control of gene express ion on a genomic scale. Science 278, 680-
686.
[6] Feller W . (1968). An in troduction to probability theory and its applica-
tions. Third Edition. John Wiley and Sons. New York.
[7] Friendly M. (1999). Ext ending mosaic displays: marginal, conditional,
and partial views of categorical data. Journal of Computational and
Grap hical Statistics 8 (3), 373-395.
[8] Hoffman H. (2000). Exploring categorical data : interactive mosaic plots .
Metrika 51 , 11- 26.
[9] Keirn D.A. (1996). Pixel-oriented visualization techniques for exploring
very large databases. Journal of Computational and Graphical Statistics,
58-77.
[10] Lee J .P., Carr D., Grinstein G., Kinney J ., Saffer J . (2002). The ne xt
frontier for bio- and cheminformatics visualization. T-M Rhine , Ed .
IEEE Computer Graphics and Applications, 6 - 11.
[11] Mandelbrot B.B. (1983). The fra ctal geometry of nature. W.H . Freeman
and Company.
[12] Munson P.J, Singh R K. (1997). Statistical significance of hierarchical
multi-body potentials based on Delaunay tessellation and their applica-
tion in sequence-structure alignments. Protein Science 6, 198- 201.
[13] Schneider T .D., Stephens RM. (1990). Sequence logos: a new way to
display consensus sequenc es. Nucleic Acids Research 18 , 6097- 6100.
[14] Segal M. Cummings R, Hubbard A. (2001). Relating amino acid se-
quences to phenotype: analysis of peptide binding data. Biometrics V57,
632-643
[15] Singh R K. , Tropsha A., Vaisman LL (1996). Delaun ay tessellation of
proteins: four body nearest neighbor propens ities of amino acid residues.
J . Computational. Biology. 3 (2),213-221.
Graphs for representing statistics indexed by letter sequences 83

[16] Sung M.-H., Simon R. (2004) . Genome-wide conserved epitope profiles


of HIV-l predicted by biophysical properties of MHC binding peptides J.
Computational Biology 11 (1), 125-145.
[17] Vaisman 1.1., Tropsha A., Zheng W. (1998) . Compositional preferences in
quadruplets of nearest neighbor residues in protein structures: statistical
geometry analysis. Proceedings of the IEEE Symposia on Intelligence
and Systems, 163 -168.
[18] Ware C. (2000). Information visualization, perception for design. Morgan
Kaufman Publishers, New York.

Acknowledgement: The work was supported by NSF cooperative grant


# 9983461. Duoduo Liao and Yanling Liu implemented versions of GLIS-
TEN.
Address : D.B. Carr, George Mason University, Dept. AES MS 4A7, George
Mason University, Fairfax, VA 22030, USA
M.-H. Sung, Biometric Research Branch, National Cancer Institute Room
81466130 Executive Plaza Rockville, MD 20852
E-mail: dcarrogmu. edu, sungm@mail.nih. gov
COMPSTAT'2004 Symp osium © Ph ysica-Verlag/Springer 2004

MATRIX VISUALIZATION AND


INFORMATION MINING
Chun-Houh Chen, Hai-Gwo Hwu, Wen-Jung Jang,
Chiun-How Kao, Yin-Jing Tien, ShengLi Tzeng,
Han-Ming Wu
K ey words: Dimension free visu alization , effect-or dered data display, gener-
alized association plots, heat-map , matrix map , relativity of statistical gra ph,
re-orderable matrix, sufficient statistical gra ph.
COMPSTA T 2004 section : Dat a visualization .

Abstract: Many statistical techniqu es, par t icularl y mult ivariat e met hod-
ologies, focus on extracting informat ion from dat a and proximity matrices.
Rather than rely solely on num eric al characterist ics, ma t rix visua lization al-
lows one to graphically reveal structure in a matrix. This article reviews the
hist ory of matrix visua lization, t hen gives a mor e det ailed descrip t ion of its
general fram ework , along with some extensions. Possible resear ch dir ect ions
in matrix visua lization and information mini ng are sketc hed. Color versions
of figur es presented in this article, t oget her with softwa re packages, can be
obtained from http ://gap .stat.s inica . edu . tw/ .

1 Introduction
The semina l work of Thkey [34] st at es a basic principle of Explorat ory Dat a
Analysis (EDA) :

It is important to understand what you CAN DO before you learn


to measure how WELL you seem to have DONE it.

In his concept of EDA, Tukey would allow the dat a to speak for th emselves
prior to adoption of any standard assumptions or form al modeling. Much of
t his pr eliminar y work can be achieved with gra phics-oriente d t ools - t he box
and whisker plot , t he scatterplot , etc.
Many visualization tec hniques have now been developed to assist us in
looking at dat a. Mu ch of t his lit erature has been devot ed to dim ension re-
du cti on: multidimensional scali ng [11], projectio n pursuit [20], self-organizing
maps [24], and sliced inverse regression [26]. These techniques are very use-
ful for exploring dat a st ructure when t he number of variables is of mod erate
size and when structure is not too complex. Yet , with striking adva nces
in comp ut ing, communication, and high-throughput biom edical instruments,
t he number of variables can easily reach te ns of t housa nds, and the need
for pr act ical dat a analysis remain s. Dimension redu ctio n tools generally lose
effectiveness when it comes t o visua l exp lora t ion for informati on st ructure
embed ded in very high dimensiona l dat a sets. On t he other hand , matrix
86 Chun-Houh Chen et ai.

visua lization, int egrated with comput ing, memory, and display, has great po-
tential for visually exploring the st ructure that underlies massive and compl ex
data sets.
A bri ef review of mat rix visualization is provided in Section 2. Section 3
introduces the genera l framework of matrix visua lization and some extensions
of it appear in Sectio n 4. An outline of possible resear ch dir ections is sketched
in Sect ion 5 and t here are concluding remarks in Section 6.

2 Matrix visualization (MV)


The tec hnique of matrix visu alization discussed in this article is deemed
dimension-free [3]. The only limitation of size for a given dat a set is the
resolution of a compute r display or the size of the printing device used. Ma-
trix visualization is not a new technique but CPU speed, memory size, and
display capa bility of mod ern compute rs give it a br and-new platform.
Given a dat a mat rix, X = [Xij] n xp , the idea is to gra phically pr esent all
num erical values, Xij, in a matrix map . With small values of n and p, thi s
map can be the num erical matrix itself. Bertin [1] considered re-orderable or
permutati on matrices, Hartigan [19] introduced the direct clusterin g of a data
matrix, Lenstra [25] and Slagel et al. [33] linked the t raveling-salesman prob-
lem and shortest spanning path to matrix reord erin g, Wegman [35] proposed
the idea of a color histogram , while Minnotte and West [30] implement ed The
Data Image package. The Cluste r and TreeView packages by Eisen et al. [12]
are probabl y the most popul ar visualization packages becaus e of the wide
application to gene expression profiling for eDNA microarray experiments.
Heatmap s with classification t ree par titioning are commonly used to sum-
mari ze stock market structure (http ://www .smartmoney .com/marketmap/).
On the other hand, Carmich ael and Sneath (t axom et ric maps [2]) and
Ling [27] dealt with visualization of a proximity matri x, D = [Pij] n x n , whose
elements are measur es of degree of relati onship between pairs of a set of n ob-
jects . Murdoch and Chow [31] used elliptical glyphs to repr esent correlat ion
mat rices while Friendly [15] called them Corrg rams. Church and Helfman [9]
develop ed Dotplot for exploring self-similarity in millions of lines of text and
code. Chen [4], [5], [6] and Chang et al. [3] integrated visua lization for data
and proximity matrices into the frame work of generalized association plots
(GAP ). We use matrix visualization (MV) to refer to all t he aforement ioned
terminologies.
The basic prin ciple of MV, t hen, is to effectively pr esent a complete data
or proximity matrix on a computer displ ay or printout. We use the psychosis
disorder data set describ ed in Hwu et al. [22] to illustrate the framework
of an MV analysis. Thirt y-three Positive and Negati ve Syndrome Scales,
PANSS [23], for one hundred and sixty-three schizophrenic patient s were used
in t he origina l ana lysis. We randomly sampled 40 pati ents with 17 symptoms
for our illustration. PANSS symptoms are rat ed on an ordina l scale from
1 (normal) to 7 (severe) , but we treat them on a cont inuous scale for simplic-
Matrix visualization and information minin g 87

ity. Psychiatrists [22] have addressed t hree fund am ental issu es: the group-
ing st ru ct ure among the symptoms, the clust erin g st ru ct ure of patients , and
the genera l behavior of every patient-clust er in each symptom-group. These
three issues are closely related to th e three maj or pieces of information con-
tained in any multivari ate data set: the linkage amongst n subject points
in t he p-dimensional space; the linkage between p vari abl e vectors in the
n-dimensiona l space; and the interaction linkage between the sets of subjects
and vari ab les. Factor analysis and clust ering related methods ar e commonly
applied to answ er the first two issues, but there is no general t echnique for
studyin g the interaction effects for subjects and vari abl es. W it h appropriate
pr esentation (permutation and color/shape coding) and integration for the
raw data and proximity matrices, MV can be used to effecti vely displ ay all
t hree pieces of information wit h many ty pes of dat a form ats and sampling
schemes .

3 Four components of matrix visualization


Most exist ing MV methods deal with data and proximity matrices separately.
Ch en [6] integrated t hem in a fram ework of genera lized associat ion plots with
four major components. We focus our introduction on these four components .

3.1 Presentation of raw data matrix and selection of


proximity matrices
The first task of MV is to convert a numerical matrix int o a matrix map with
a color dot (symbol) representing each ent ry. Then each row vector , call
it a pat ient 's symptom profile, is converte d int o a hori zontal color band and
every column vector, a sympt om 's patient distribution, is replaced by a verti-
cal color st rip. The information in a num eric matrix is t hus comprehensively
displayed in a matrix map (Figure 1a).

3.1.1 Color spectrum and variable transformation The select ion of


a suitabl e color sp ectrum is crucial in an MV analysis [1], [13] . With the
PANSS symptoms , we need only find a color sp ectrum capable of express-
ing its ordinal nature, a rainbow spect ru m in Figure 1a for exa mple. For
measurements with bi-directional st ruct ure, such as the logarithmic gene ex-
pression profiles used for cDNA microarray experiments, an int egr ation of
two monotonic color sp ectrums is needed (Figure 2).
For t he PANSS example, the ra inbow sp ectrum is the same for each symp-
t om since they share the sa me scale. When variable st ruct ure becomes mor e
complicated, t ra nsforma t ion of vari abl es may be necessar y before the MV
can effectively pr esent t he data structure. In par ticular , in ord er t o make
simultan eous visualizat ion of multiple vari abl es in MV meaningful, it is es-
sential t o st andardize variables (sometimes for subjects) wit h different scales.
When out liers are pr esent , t he relationship between outl ying observations and
88 Chun-Houh Chen et a1.

__ Correlation
(b)_AAl
I&A1Il1 _

-1

(c) Distance
I ::~ AIM I
Min. Max.

Figure 1: Presentation of the PANSS data matrix with proximity matrices.

Log2(ratio of gene expression)

(-) down (0) no (+) up


regulated differentially expressed regulated

Figure 2: Bi-directional color spectrum for gene expression profile.

the main body of the data set can exhaust the color spectrum and only the
relative structure of outliers and the main body can be observed [3], [28] .
A logarithm or similar transformation can be applied to variables or prox-
imities to diminish the outlier effect. Transformation of variable (symptom),
also termed the column conditioned transformation, is commonly practiced.
Row (patient) and matrix conditioned transformations are used from time to
time.

3.1.2 Selection of proximity measures The second important task is


to identify appropriate measurements for representing between-variables and
between-subjects association. The importance of this choice for proximity
matrices is two-fold: it is used to directly assess the strength of variable-
interaction with subject-relationship; it will be used to permute the raw data
or proximity matrix. Suitable color spectrums are also needed to project
Matrix visualization and information mining 89

num erical pr oximity matrices to matrix maps. Eu clidean dist an ce is used


in t he psychosis disord er exa mple to measure the patient -to-patient dissimi-
larity so a uni-directional grey-scale sp ectrum (Figur e l c) is adopted, while
a bi-directional blu e-white-red spectrum is applied to illustrate the between
symptoms corre lation coefficients (Figure Ib) . Appropriat e transformations
of vari abl es before comput ing proximities, and of proximity measurements
dir ectl y, are necessar y for both numerical and visua l considera t ions.

3.2 Seriation of proximity matrices and raw data matrix


Although Figure l a has already convert ed t hree num erical matrices int o MV
form at, not much informati on can be obtain ed from it since vari abl es and
subjects are randomly permuted in these matrices. In ord er for a statis t i-
cal graph (including MV) t o reveal struct ure embedded in the data being
displayed it is necessary t o place obj ects with similar (different ) properties
at closer (dist ant) positions in t he gra ph. Chen [6] called this concept t he
relativity of a st atist ical graph. The corresponding mechanism in an MV
display is t o identify the best seriations (permutations) for the two proximity
ma trices. Friendly and Kwan [16] used a similar t erm , effect -ordered dat a
display, for ord erin g informa t ion in genera l visual displays.

3 .2 .1 Robinson matrix Cri t eria are necessary to identify "good" seri-


ations (permutations ) for a given matrix. Seriat ion is a dat a ana lytic tool
for findin g a permut at ion or ord ering of a set of ob jects using a data matrix
(symmet ric or asymmetric ). Hubert [21] and Mar cotorchino [29] discussed the
seriation problem from t he as pect of problem setting, methodolo gy and algo-
rithms. Two maj or considera t ions in permuting a matrix are global pattern
identifi cation and local cluster form ation , see [6] and [16] for mor e det ailed
discussions. The global and local criteria usually conflict with each other
unl ess the embedded st ructure has a simple uni-dimensional pat t ern. On e
famili ar globa l crit erion is the Robinson form [32], [8], and [6] . A matrix is
said t o be a Robinson (anti-Robinson) matrix if t he element s in its rows and
columns do not increase (decrease) when moving hori zont ally or vertic ally
away from t he ma in diagonal (Figur e 3a) . A permuted Robinson matrix is
a pr e-Robinson matrix (Figure 3b & c). A Robinson matrix satisfies both
t he relativity condit ion [6] and t he effect- ordered requirement [16] and it is
optimized for globa l and local criteria. A Robinson proximi ty matrix may
be obt ain ed from a raw dat a matrix with a mon ot onic st ruc t ure for vari abl es
and/or for sub jects. For our PANSS data , this monotonic structure can be
a positive-neutral-n egative pattern. Robinson form t akes relative positions
of any two columns and rows int o considera t ion, so it focuses mor e on global
considera t ion than on local structure. Visualization is usually mor e globally
focused to reflect our ability t o perceive the information in a given gra ph
and organiz e it into a global patt ern. There is no practi cal algorit hm which
optimizes the Robi nson crite rion because of its comput ing complexity.
90 Chun-Houh Chen et al.

_~ll~\j~ .\/.9.9_-
Min. Max.

(a) (b) (c)

Figure 3: Robinson (a) and pre-Robinson (b & c) matrices.

The Iris data [14] is used to compare the performance of seriations with
several commonly used sorting algorithms. The target proximity matrix is the
Euclidean distance matrix of the 150 iris flowers on four variables. Using the
convergence properties of a series of Pearson correlation matrices, Chen [6]
proposed an elliptical seriation which identifies permutations with very good
near-Robinson structure (Figure 4g & h).

(e) (f) (g) (h)

Figure 4: Permuted Euclidean distance maps for Iris data with eight seriation
algorithms: (a) farthest insertion spanning; (b) nearest insertion spanning;
(c) single linkage tree; (d) complete linkage tree; (e) average linkage tree; (f)
GAP rank-one tree; (g) GAP rank-two ellipse; (h) GAP double ellipse.
Matrix visualization and information mining 91

3.2.2 Tree seriation Most of the seriation algorithms try to optimize


some local properties. Travelling-salesman [25] and minimal spanning
path [33] algorithms orient toward local optimization. Figure 4a & b show the
distance matrix of the Iris data sorted by two minimal spanning algorithms.
The hierarchical cluster analysis with a tree-architecture (dendrogram) is the
most popular permutation fulfilling criteria for local optimization [27], [12]
(Figure 4c, d, & e). Relative positions ofthe terminal nodes in the final tree
grown are employed as the permutation to sort the input objects. The two
dendrograms in Figure 5 are generated from the correlation matrix for symp-
toms and the distance matrix for patients in Figure 1. The two proximity
matrix maps are permuted accordingly. The data matrix is two-way sorted
using the corresponding permutations. After the permutations, relativity for
both subjects and variables is satisfied. That is, patients with similar symp-
tom profiles are placed in closer rows while symptoms with comparable pa-
tient distributions correspond to columns nearby each other. Patient-clusters
and symptom-groups can now be easily identified using the sorted proximity
matrix maps with the tree-architectures. These pieces of information are usu-
ally summarized through factor analysis for variables and clustering analysis
for subjects. The two-way sorted data map in Figure 5a is actually a con-
densed version of two sets of scatter-plot matrix with C(17,2) = 136 and
C(40,2) = 780 pair-wise plots each for studying the interaction of patient-
clusters and symptom-groups.
One fundamental problem arises in applying the dendrogram for matrix
permutation. There are n - 1 intermediate nodes for a dendrogram with
n terminal nodes. Each of these n - 1 intermediate nodes can be flipped
independently resulting in 2n - 1 possible final permutations for a given den-
drogram. Figure 6 shows the results of different flipping mechanisms for inter-
mediate nodes applied to the same tree-architecture (dendrogram) for a given
correlation coefficient matrix. As can be seen, different flipping mechanisms
result in totally different visual perceptions and grouping effects. Figure 6a is
the only permuted matrix map with perfect scores on both local and global
criteria such as the minimum spanning and anti-Robinson scores [6] . It is
also possible to use external and internal references [17], [36] for identifying
desired flipping patterns.

3.3 Partitions of permuted matrix maps


The next step after the matrices have been permuted is to identify clusters in
the resulting maps. This is a constrained clustering problem since the vari-
ables and subjects are sorted and listed in a one-dimensional manner. The
goal becomes one of finding partitioning points on the two permutations. For
a matrix map sorted with a dendrogram, the dendrogram structure can help
in identifying suitable cutting points. Two purple lines are drawn in Fig-
ure 7b & c to partition the two dendrograms (and matrix maps) into three
symptom-groups coded in (red, green, and blue) and four patient-clusters
92 Chun-Houh Chen et el.

(b) Correlation

- _&
.'--
(a) PANSS

1234567
(c) Distance
1 1i;~m, ·
Min. Max.

Figure 5: Proximity and raw data maps with dendrograms after permutation.

coded in (cyan, magenta, yellow, and grey) . Without dendrograms, charac-


teristics in a permuted data matrix (map) and in proximity matrices (maps)
must be employed for identifying possible partitions. Chen [6] contains some
discussion on matrix partition using a convergent sequence of Pearson corre-
lation matrices.

3.4 Sufficient statistical graph


Symptoms and patients are partitioned into three groups and four clusters
in Figure 7. These two partitions thus cut the correlation map , the distance
map, and the raw data map into nine, sixteen, and twelve blocks accordingly.
Blocks for proximity maps can be categorized as within-group blocks on the
main diagonal and between-group blocks off the diagonal. Blocks for the
raw data map have different combinations of symptom-groups and patient-
clusters. Chen [6] proposed a concept of sufficient statistical graph for these
partitioned blocks. The purpose is to comprehensively and effectively sum-
marize information embedded in the raw data matrix and two proximity ma-
trices with a simplified version of MV. Individual values within a block are
replaced by a single summary statistic such as mean, median, or standard
deviation to represent the information for that particular block. Figure 8
Matrix visualization and information mining 93

(a) (b) (c) (d)

Figure 6: Results of different intermediate nodes flipping mechanisms applied


to one tree-architecture (dendrogram) for a given proximity matrix.

displays the mean sufficient statistical graph for Figure 7. This presentation
clearly illustrates the within-groups strength and the between-group rela-
tionship for symptoms and patients. More importantly, the sufficient graph
for the raw data map effectively summarizes the interaction patterns of four
patient-clusters on three symptom-groups. These three mosaic-displays of
MV in Figure 8 can now easily reveal all three components of linkages for
a given multivariate data set.

4 Generalization and flexibility of matrix visualization


Matrix visualization is very flexible and can be easily generalized for various
purposes and situations. Two examples are illustrated in this section.

4.1 Sediment MV
Regular MV preserves the identity of each subject and variable, each dot in
Figure 9a is the score of a specific symptom for a particular patient. It is
possible to ignore symptom identity and sort the symptom profile for each
patient according to severity. This results in the sediment MV for patients,
as seen in Figure 9b, to express severity structure. One could also omit
patients' identities and create the sediment MV for symptoms, as in Figure 9c.
This is a side-by-side bar-chart and box-plot which displays the distribution
structure for all symptoms simultaneously.

4.2 Sectional MV
The goal of a sectional MV is to display only those numerical values that
satisfy certain conditions in the original MV display. Each sub-figure in
Figure 10 exhibits correlation coefficients with p-values smaller than certain
94 Chun-Houh Chen et al.

(b) Correlation

(a) PANSS (c) Distance


-~~ 1- " I
1234567 Min. Max.

Figure 7: Partitions of permuted matrix maps with dendrograms.

significant levels for a student t-test. Figures with smaller p-values preserve
more significant correlation coefficients along the main diagonal to reveal
major (tight) symptom-groups, since the matrix maps have already been
permuted.

5 Future directions
Matrix visualization is not a new research field but there are still many topics
to be explored. All the available MV methods focus on seriation algorithms
or coloring (shading) schemes for a data or proximity matrix with entries
along a continuous scale. This is insufficient for exploring more complicated
information structures in the statistical modelling of longitudinal, categorical,
dependent or other complex data. We discuss several possible MV related
issues in this section.

5.1 Categorical data


Two major difficulties arise in applying MV to categorical data (especially
nominal): computing of proximities - given a categorical data matrix, it is
necessary to define the proximity measure so that the numerical version of
Ma trix visualization and information mining 95

(b) Correlation

-1

__,(a) PANSS

1234567
(c) Distance
I:\lliIiIii!M
Min. Max.

Figur e 8: Sufficient st atistical graph in GAP.

relativity is valid ; selection of color spectrum - categories sharing similar (dif-


ferent ) subject-dist ribut ions should be assigned with compar abl e (distinct)
colors in ord er to satisfy the color version of the relativi ty concept. Chen [5]
and Chang et al. [3] introduced a categorical version of GAP which can re-
solve these two difficulti es.

5.2 Longitudinal multivariate data


The PANSS symptom profiles were collected when patient s were admit te d.
There are also PANSS profiles collecte d at discharg e and follow-ups . How
to use a single or mult iple MV displays to summarize and pr esent the int e-
gration of st ruc t ure for patients, symptoms, and t ime is a complicate d and
challenging t ask.

5.3 Multi-conditioned multivariate data


There are symptom tabl es other than PANSS in Hwu et al. [22] t hat can
be analyzed simultaneously. The setup is similar to the longitudinal MV
problem. Both data structures have identical set s of subj ects across multiple
t ime points or multiple tables. The longitudinal one has only one set of
variables measured multi pie times while t he multi- conditioned one has various
sets of vari ables. St at istical methods and theories from canonical corre lat ion
are discussed in Chen [5] and Chi [7] for multi-conditioned version of GAP.
96 Chun-Houh Chen et al.

PANSS score
2 3 4 5

(a) (b) (c)

Figure 9: Sedimented MV for patients and symptoms.

p-value < 1 < 0.10 < 0.05 < 0.01 < 0.005

Figure 10: Sectional MV for the permuted correlation coefficient map.

5.4 MV with covariates adjustment


When effects of covariates such as gender or age are of concern in an MV
analysis, covariate adjustment has to be taken into consideration. When
gender acts as the covariate, it is not easy to create an MV display. One pos-
sibility is to decompose the correlation matrix into two component matrices,
one for between-group (gender) structure, one for within-group pattern. The
between-group correlation matrix can then be used to study the covariate
effects on the original correlation matrix.

5.5 MV with dependent variables


The MV problems discussed so far do not include dependent variables. MV
for a regression context with dependent variables is similar but not identi-
Matrix visualization and inform ation mining 97

cal to MV with adjust ing covariates. Sliced inverse regression by Li [26] is


a natural staring point. The design matrix X can be row-wise sorted first
by the magnitude of t he depend ent vari abl e y with or without slicing. The
proximity for vari abl es can be the original one or the sliced version . Finally
t he reduced vari abl es can be t reate d as the raw dat a and fed to the regular
MV environment . There are certainly many different kind s of MV t hat can
be developed for various regression problems.

5.6 Data with dependent (clustered) structure


When sa mples are collect ed with depend ent st ruct ure, such as famili al data
for genet ics studies, two difficult ies emerge . First , it is numerically diffi-
cult t o define the pr oximities with multi-levels of relationship an d two issues
must be addresse d: t he definition of between-cluster dist an ce or similarity;
whether or not within-clust er st ructure should be pr eserved while comput-
ing t he between-clust er relationship. Second , it is gra phically hard to display
proximity and data maps simult aneously for t he individu als' relationship an d
clust ers' (families) st ructure . Generation of ind exing vari abl es for clust ers
may be necessar y in form ing MVs for dat a with depend ent st ruc t ure .

5.7 Mixed data


Categorical vari abl es introduce difficulties in t he computation of proximities
and t he select ion of color sp ectrums. Problems get even more complicated
when vari abl es collecte d are mixed with quantit ative, ordinal, binary, and
nomin al ty pes . General similarity coefficients proposed by Gower [18] and
general weighted two-way dissimil arity coefficients introduced by Cox and
Cox [10] may aid the calculat ion of proximity matrices for variables and
subjects in const ructing the MV display with mixed dat a . Color coding for
a dat a matrix wit h mixed data is a mor e difficult task.

5.8 MV for huge data


When da ta size (vari abl e or subject) exceeds t he limit ations of computers
used , such as thousands of vari abl es and millions of subjects, burdens may
come from t he CPU speed , compute r memory and display. Parallel and
distributed computation wit h PC clusters may speed up computat ion t ime.
When hardwar e support is not available, pro cedures from sa mpling t ech-
niqu es, sequ ent ial analysis, smoothing methods, and image processing all
can be of help in creat ing MV for st udy ing st ruct ures of huge dat a sets.
There are many int eresting resear ch areas not yet mentioned - MV for
colorb lind people, MV for sp atial data , and MV wit h missin g observations
are good examples.
98 Chun-Houh Chen et al.

6 Conclusion
MV tools ar e not created to replace existing mathematical or statistical pro-
cedures. Instead , they can be applied in advance to obt ain a general picture
of the information structure and build up confidence for choosing and using
mor e rigorous and appropriate mathematical and st atistic al op erations. Of
cour se it is possible that a good MV display alone can answer all the ques-
tions a user has in mind and reveal mor e comprehensive understanding about
a data set than formal mathematical operations and statist ical mod ellings.

References
[1] Bertin J . (1967) . Semiologie graphique, P aris : Editions Gauthier-Villars .
English translation by William J . Berg . as Semiology of Graphics: : Dia-
gra ms, Networks, Maps . TheUniversity of Wisconsin Press, Madis on, WI ,
1983.
[2] Carmichael J ., Sneath P. (1969). Taxometric maps. Systematic Zoology
18,402 -415.
[3] Chan g S.C ., Chen C.H., Chi Y.Y., Ouyoun g C.W . (2002). R elativity and
resolution for high dim ensional information visualizati on with generalized
association plots (GAP). Proceedings in Computational Statistics 2002
(Compst at 2002), Berlin , Germ any, 55 - 66.
[4] Chen C. H. (1996) . Th e properties and application s of the conve rgence of
correlati on matrices. In : 1996 Proceedin gs of the Section on St ati sti cal
comput ing, 49 - 54, American St atistic al Association.
[5] Chen C. H. (1999). Ext ensions of generalized associat ion plots ( GA P) .
In : 1999 Proceedin gs of the Section on St atistical Graphics, 111-116,
American Statisti cal Association.
[6] Chen C. H. (2002). Generalized association plots : inform ati on visualiza-
tion via iterative ly generated correlati on m atrices. St atistica Sinica 12,
7-29.
[7] Chi Y. Y. (1999). Information visualization for comparing two sets of vari-
ables. Mast er Thesis. Division of Biomedical St atist ics, Graduate Institute
of Epidemiology, College of Public Health, National Taiwan University.
[8] Chepoi V., Fichet B. (1997) . R ecognition of Robinsonian dissimilarities,
Journal of Classification 14, 311- 325.
[9] Church K.W. , Helfman J.1. (1993). Dotpl ot: a program for exploring self-
simi larity in millions of lines of text and code. J ournal of Computational
and Graphical Statistics 2 , 153 - 174.
[10] Cox T .F. , Cox M. A.A. (2000). A general weighted two-way dissimilarity
coefficient. Journal of Classification 17, 101-121.
[11] Cox T.F., Cox M.A.A . (2001). Multidimensional scalin g. 2nd ed . Chap-
man & Hall/CRC.
Matrix visualization and information min ing 99

[12] Eisen M.B. , Spellman P.T ., Brown P.O., Bot stein B. (1998). Clust er
analys is and display of genome-wide expression patt erns. Proc. Nat 'l.
Acad. Sci. U. S. A. 95, 14863-14868.
[13] Encarnacao J., Fruhauf M. (1994). Global informa ti on visualization: the
visualizati on challenge for the 21st Century, in Scientific Visualization
Advan ces and Changes L. Rosenblum et al (eds) , Academic Press.
[14] Fisher RA. (1936). Th e us e of multiple measureme nts in axon omic prob-
lem s. Annals of Eug enics 7, 179-188.
[15] Friendly M. (2002). Corrgrams: exploratory displays for correlation ma-
trices. Amer. St atist 56 , 316-324.
[16] Friendly M., Kwan E. (2003). Effect ordering for data displays. Compu-
t ational St atistics & Dat a Analysis 43 , 509-539.
[17] Gale N., Halp erin C.W ., Cost anzo C.M . (1984). Unclassed m atrix shad-
ing and optimal ordering in hierarchical cluster analysis. J . Classification
1,75 -92.
[18] Gower J.C. (1971). A general coefficie nt of simi larity and som e of its
properties. Biometrics 27,857-874.
[19] Har tigan J .A. (1972). Direct clustering of a data matrix. Journal of the
American St atistic al Association 67, 123-129.
[20] Hub er P.J. (1985). Projection pursui t. The Ann als of Statisti cs 13 , 435 -
475.
[21] Hub ert L. (1976). Seriation using asymmetric proximity m easures.
British J . Math. St atist. Ps ych. 29, 32 - 52.
[22] Hwu H.G., Chen C.H., Hwang T .J ., Liu C.M., Cheng J .J ., Lin S.K , Liu
S.K , Chen C.H. , Chi Y.Y., Ouyoung C.W., Lin H.N., Chen W. J . 2002).
Symptom patt erns and subgrouping of schizophrenic patients: significance
of negative symptoms assessed on admission. Schizophr enia Resear ch 56 ,
105-119.
[23] Kay S.R , Fis zbein A., Opler L.A. (1987). Th e positive and negative
syndrome scale (PA NSS) for schizophrenia. Schizophr . Bull. 13 , 261-
276.
[24] Kohon en T . (1995). Self-organizing maps. Berlin , Heidelberg: Springer.
[25] Lenstra J .K (1974). Clust ering a data array an d the traveling salesm an
problem. Operations Resear ch 22, 413-414.
[26] Li KC. (1991). Sliced inve rse regression for dim ensional reducti on (with
discussion). Journal of the American Statistic al Association 86, 316 - 342.
[27] Ling RF. (1973). A computer generated aid for cluster analysis. Com-
mun icat ions of th e ACM 16 , 355- 361.
[28] Marchet te D.J., Solka J.L. (2003). Using data imag es for outli er detec-
tion. Computational Statist ics and Dat a Analysis 43 , 541- 552.
[29] Mar cotorchino F. (1991). Seriat ion problems: an overview. Applied
Sto chastic Models and Data Analysis 7, 139-151.
100 Chun-Houh Chen et al.

[30] Minnotte M., West W . (1998) . Th e data imag e: a tool fo r exploring


high dim ensional data sets. In : 1998 Proceedings of the ASA Section on
St atist ical Gr aphics, Dallas, Texas, 25 - 33.
[31] Murdoch D.J ., Chow E.D. (1996). A graphical display of large correlation
matrices. St atisti cal Computing 50 , 178- 180.
[32] Rob inson W . S. (1951) . A method for chron ologically ordering archaeo-
logical deposits. American Antiquity 16, 293 - 301.
[33] Slagel J .R. , Chan g C.L., Heller S.R. (1975) . A clust ering and data reorga-
nizing algorithm. IEEE Tran sactions on Syst ems , Man , and Cyb erneti cs
5 ,125 -128.
[34] Tukey J .W . (1977) . Exploratory Data Analysis. Addison-Wesley.
[35] Wegm an E . (1990). Hyperdim ens ional dat a analy sis using parallel coor-
dinates. Journal of t he American St atis tic al Association 85, 664-675.
[36] Ziv B.J. , David K.G ., Tommi S.J . (2001). Fast optimal leaf ordering for
hierarchical clust ering. Bioinformatics 17, S22 -S29.

Acknowledgem ent: This work is partially supported by the National Science


Council, Taiwan , R.O.C . (NSC-90-2118-M-001-030) , and the National Health
Resear ch Institutes, Taiwan R.O .C. (NHRI-GT-EX908825PP and NHRI-
EX92-9113PP) . The aut hors are grateful to Chen-Hsin Chen and Don ald
Ylvisaker for valu abl e suggest ions. The first aut hor also wants to express his
gratitude to t he Institute for Pure and Appli ed Mathematics (IPAM) , UCLA
for par t of the current work was carried out during his visiting IPAM.
Address : C.-H. Chen , W .-J . Jang, C.-H . Kao, S. Tz eng, H.-M . Wu, Institute
of St atistical Science, Acad emia Sinica , Taipei, Taiwan
H.-G . Hwu , Department of Psychiatry, National Taiwan University Hospital
and College of Medicin e, National Taiwan Universit y, Taip ei, Taiwan
Y.-J . Ti en , Institute of St atisti cs, National Central University, Chung-Li,
Taiwan
E-mail : cchen@stat .sinica .edu.tw
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

st-apps AND EMILEA-STAT:


INTERACTIVE VISUALIZATIONS
IN DESCRIPTIVE STATISTICS
Katharina Cramer, U do Kamps, and
Christian Zuckschwerdt
K ey words: Applied st atisti cs, descriptive statisti cs, interactive visu aliza-
tions , multimedia, t eaching and learning environment, web-based.
COMPSTAT 2004 section: Teaching st atisti cs.

Abstract: Wi thin t he "New Media in Education Funding Programme" the


German Federal Ministry of Education and Research (bmb+f) has support ed
the project e-stat to develop and t o pr ovide a multimedia, web-based , and
inte ractive learning and t eaching environment in applied statist ics called
EMILeA-stat . After sketc hing the st ru ct ure of EM ILeA-st at, its scope and
obj ectiv es bri efly we focus on interactive visualizations in descriptive stat is-
t ics as a specific and ty pical aspect of the syste m. Alt ernatively, the visua l-
izations are available off-line as a gra phical package called st-apps .

1 Introduction
Within the "New Media in Educat ion Funding Programme" the German
Feder al Ministry of Education and Resear ch (bmb-l-f') supports the project
e-stat (pr oject period April 2001 - June 2004) t o develop and t o provide
a multimedia , web-based , and interactive learning and teaching environment
in applied st atisti cs called EMILeA-st at, which is a regist ered br and nam e.
It is accessible via int ernet (emilea-stat. uni-oldenburg.de) .
T he project was set up by 13 par tners at t hat time working at seven
German universiti es: Bonn, Berlin (Humboldt-Un iversity) , Dortmund, Karl s-
ruhe, Munst er , Oldenburg (leading un iversity) , and Potsdam . In t est and
evaluation ph ases of EMILeA-st at other uni versities are involved, to o. The
proj ect is also supported by furt her partners in ad vice and it coop erat es
wit h economic partners such as SPSS Softwar e, Springer-Verlag, MD*Tech
Method & Data Technolo gies (XploRe-Softwar e) , and AON Re. Including
t he group of associated partners who are providing additional cont ent, about
70 people ar e co-working in developin g und realizing EMILeA-stat at the
pr esent time. For mor e det ail about th e proj ect we refer to its web page
www.emilea.de .

2 The system EMILeA-stat


Statisti cal and quantitative thinking and acting have become fund am ental
skills in several br anches of natural sciences, life sciences, social sciences ,
102 Katharina Cramer, Udo Kamp s, and Christian Zuckschwerdt

economics, and engineering. Mod els, tools, and methods, which have been
develop ed in st atistics , are applied in mod elling and data analysis, e.g., in
bu siness and industry, in ord er to obtain decision crite ria and to gain mor e
insight into structural correlations. Owin g to t hese var ious applicat ions and
t he necessity of using statist ical methodology in so many fields, there have
to be consequences for the processes of learning and t eaching: Pupils, for
example, should get to know elementary and applicat ion-oriented st at ist ics.
Therefore, stat ist ics and dat a analysis, theoretic ally and pr acti cally, have
t o become part of teachers' studies at universit y and at in-servic e training
courses. Moreover , students of many different disciplines with a statist ics
impact should be famili ar with basic and advanced statist ics. These goals
gave the main imp act to develop EMILeA-st at

• as one syste m suitable for teaching stat ist ics at schools, univ ersiti es,
and in fur ther vocational training,

• as one syste m which supports supervised and self-direct ed learning, and

• as one syste m which is access ible anywhere, anyti me, and for anyone.

The basic concept offers on the one hand t he opportunity to t ailor individ-
ual courses covering sp ecific learning needs . On t he other hand , EMILeA-st at
serves as an int elligent statist ical encyclopaedia.
Basic statist ical conte nts are pr esent ed on three levels of abst raction in
ord er t o take int o account t hat different typ es of users have - owing t o their
individual mathematical and theoret ical backgrounds - different needs. If
sensible t he contents are writt en on level

A (element ar y level): pr esent ation in a popular scientific way by assuming


no or only a low pr eviou s (mathematical) knowledge,

B (basic level): like undergraduate courses in applied statist ics for stu-
dents, e.g., of economics, psychology, and social sciences, and

C (advanced level) : containing deeper mat erial and special t opics within
the bro ad field of stat ist ics and applied probabili ty.

Furthermore, user-orient ed views and scenari os, which are near t o real
world applicat ions, are integrated .
The following fields and subjects of quantit ative methodology are or
will be contained in EMILeA-stat : Descrip t ive and induct ive st atist ics, ex-
ploratory dat a analysis, int eractive st atist ics, gra phica l repr esentations and
methods, basic mathematic s needed in stat ist ics, pr obabili ty t heory, statist i-
cal methods in finance and insuran ce mathematics, mod elling and pr edicti on
of dat a in financi al market s, stat ist ical methods in market ing, virtual pro-
du ctions and virtual company, experimental design , stat ist ical quality man-
agement, and busi ness ga mes.
st-apps and EMILeA-stat 103

3 Interactive visualizations
The theoretical statistical content in EMILeA-stat is supplemented by in-
teractive visu alizations which are programmed as Java-Applets. By offering
a vari ety of int eractive options (for a detailed description see below) they
support the learner in her I his learning proc ess by offering t he possibility to
explore the explained method, to experiment with data , and to make own
experiences with the discussed topic. Due to th e fact that many places, e.g.,
at universities or schools, where teaching takes place still do not have access
to the intern et, these visualizations are not only part of the system but also
realized as an off-line graphical package called st-apps . A Germ an version
of this tool - including an addit iona l textbook with explana tions, instruc-
tions , proposals for the use in teaching, etc. - is available via the publishing
comp any Springer. An Engli sh edition is planned.
In the following we give an overvi ew about the visualizations included
in st-apps and present this tool by giving some examples. Finally the dif-
ferences between the on-line version and t he gra phical package are bri efly
sketched.

3.1 Types of visualizations


Due to their use the interactive visualizations are divid ed int o two groups:
The first one works like a simple statistical engine. It is used for ana lyz-
ing own data and pr eparing present ations. The others are designed in the
first pla ce regarding did actical aspects in ord er to support the exploratory
learning pro cess for becoming famili ar with the explained conte nt or method.
By offering many int eractive elements the user is invit ed to experiment and
explore the pr esent ed content by her / his own activity. Ex amples are given
below.

3.2 The range concerning descriptive statistics


Star ting with elementary visua lizations like bar cha rts (vertical and hori-
zontal) , pie and ring cha rts, line charts, box plots, ste m-and-leaf plot s, his-
togra ms, and plots of the empirical and the approximate empirical distribu-
tion function, traditional measures like mean , st andard deviation, quantiles,
etc. are considered as well as measur es of relationships between measurement
or categorical variables. Methods for the description of economic data like
pri ce index numbers and measures of concent ration, e.g., Lorenz cur ves and
Gini coefficients, are also illustrated by this kind of int eractive illustration.
The field of univari ate approaches is completed by regression and tim e se-
ries an alysis. Furthermore, applets with scatterplots and scatter matri ces
ar e realized. All the mentioned visu alizations are available on-line as part of
EMILeA-stat , while in st-apps the following subjects focusing on ana lyzing
and presenting data are included:
104 K ath arina Cramer, Udo Kamps, and Christian Zuckschwerdt

Bar charts

Pie and ring charts


st-apps and EMILeA-stat 105

Line charts

Location parameters

Mean and median Quantil es

Scale parameters

"
106 K ath arina Cram er, Udo Kamps, and Christi an Zuckschwerdt

Box plots

Hll ·
. [H

HIDH

Empirical distribution function

Lorenz curves
st-apps and EMILeA-stat 107

Histograms and approximate empirical distribution funct ion

Scatter plot and scatter matrix

Regressions

_....... _r

Linear regression Quadratic regression

3.3 The structure


Each int eract ive visualization consists of t hree par ts: t he plot ti ng area, a t a-
ble for the data, an d t he menu.
108 Katharina Cramer, Udo Kamps, and Christian Zuckschwerdt

Plotting area

When opening a visualization in st-apps a given data set is loaded au-


tomatically into the table and presented in the plotting area.
The size of the diagram and the table can be changed manually. Further-
more, it is possible to close one of these two components, a facility which is,
e.g., useful for presentations.
The "user interface" - the aspects of interactivity in the plot and the
table or the menu - is standardized such that the frequent user should be
able to work with a new visualization easily. Instructions accompanying the
visualizations indicate the concept of learning by discovery enabled via the
available interactivity.
Owing to the fact that the menu is organized similar to the menus of
known standard software we will not go into details about this part while
the plotting area and the table are described in more detail in the next
paragraphs.

3.4 The diagram


Depending on the type of visualization different aspects of interactivity are
implemented. Some of them will be explained in the following.
The automatically loaded data set can be modified by adding new data
points. They can be given numerically by adding them in the table (see
below) or by clicking the axis with the right mouse button.

• t
st-apps and EMILeA-stat 109

Moreover, existing data (points on the axis) can be moved to the right or
left with the left mouse button. The axes are automatically rescaled.

::::-

---',
,"'",.,""'''''
These options are included in many visualization such as those concern-
ing location and scale parameters, box plots, scatter plots, regressions, his-
tograms and the approximate empirical distribution function.
Furthermore, there are interactive aspects which are matched only with
specific visualizations. Three examples are given in the following:

Histogram The histogram applet offers the most interactivity. Each bar
can, for example, be split into two bars by clicking with the mouse into the
respective bar.

By shifting the endpoints of the bars the width of the classes and even-
tually the number of classes change.
110 Ketluuine Cramer, Udo Kamps, and Christian Zuckschwerdt

Fitting a straight line Concerning linear regression there is, e.g., one
visualization available where a straight line has to be fitted manually to the
data. The correct linear regression function obtained by least squares can
also be added for checking the manually fitted line.

Lorenz curve The Lorenz curve is offered in two interactive versions. In


the applet shown in the illustration different market situations are modelled
by shifting the marked buttons.

3.5 The table


The part which includes the table is composed of three parts:
st-apps and EMILeA-stat 111

~~~~~~i!!!!.!~!!iL -l I Buttons
In some cases a fourth component with further information such as pa-
rameters or coefficients is realized.
In the drop-down-menu a selection of data sets suitable for visualization is
offered. If a data set is loaded, the accompanying table can, e.g., be modified
in the following ways:

I Symbol I Action
~ Add a column
iliI Delete the marked column(s)
t:lf- Add a row
¢:[ Delete the marked row(s)
it Shift the marked row(s) up
,(!. Shift the marked row(s) down

An error in the table is indicated by CD and ~ restores the original data


set . If a functionality makes no sense for the actual applet the corresponding
button does not appear.
Depending on the specific visualization further buttons are offered. The
button Korrelation .. , for example, calculates the correlation coefficient in the
regression applets while in the histogram applet .... generates a histogram
with equidistant classes. Also the already mentioned optional fourth part
with information about used parameters or coefficients is inserted by pushing
a button, namely etp'l'''' .
Some visualizations consist of two tables: one for the original data and one
for the frequency table. The latter one depends on the first one. Therefore,
it can only be modified as explained if the original data has been deleted.

3.6 Interactive visualizations available online


As already mentioned a wider range of applications, such as price index
numbers or visualizations,e.g., concerning time series analysis, are offered on-
line in EMILeA-stat. Moreover, in contrast to the off-line tool st-apps each
interactive visualization is - similar to the theoretical content - available on
112 Katharina Cramer, Udo Kamps, and Christian Zuckschwerdt

three levels of abst rac t ion. The elementary level A offers at least interactivity
whereas on level C (advanced) t he full rang e of functionality as described
is accessible. Concerning the data sets load ed, this level dep end ent design
means t hat the systems offers to a user working on level A only one data set
(given by the teacher) , while on level B she/he can choose between a wide
rang e of data sets . On level C ana lyzing own data is possible. In other words
the describ ed "user int erface" of t he off-line tool is available only on level C to
its full extent . On t he other hand st-apps offers - becau se of these facilities
- a vari ety of helpful and powerful tools for ana lyzing and presenting data
which are also useable without an access to the int ernet .

References
[1] Burkschat , M., Cr amer , E. , Kamps , U. (2003). B eschreibende Sta tistik:
Grundlegend e Methoden. Springer , Heidelberg (in Germ an).
[2] Cr amer , E., Cr am er , K , Kamps , U., Zuckschwerdt , Ch . (2004). B eschrei-
bende Statistik: Int eraktive Grafiken. Springer , Heidelberg (in Germ an).
[3] Cramer , E., Cr amer , K , Kamps , U. (2002). e-stat: A web-based learning
environme nt in applied statistic s. Proceedin gs in Computational St atis-
tics, W . Hardl e, B. Ronz (Ed s.), Physica-Verlag, Heidelberg, 309- 314.
[4] Cr amer , E., HardIe, W ., Kamps, U., Wit zel R. (2003). E-stat: Views,
m ethods, applicati ons. Bulletin of the Intern ational St atistical Institute
54t h Session, Contributed P ap ers, Volume LX, Book 2, 82 -85 .
[5] Cr amer, K , Kamps, U. (2003). Interactive graphics for eleme ntary sta-
tistical education. Bulletin of the Intern ational St atistical Institute 54t h
Session , Contributed Pap ers, Volume LX , Book 1, 222 -223 .

Address: K Cr amer , U. Kamps, C. Zuckschwerd t , University of Oldenburg,


Institute of Mathemati cs, D-26111 Oldenburg, Germ any
E-mail: e-stat@uni-oldenburg.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

THE CASE SENSITIVITY FUNCTION


APPROACH TO DIAGNOSTIC AND
ROBUST COMPUTATION:
A RELAXATION STRATEGY
Frank Critchley, Michael Schyns, Gentiane Haesbroeck,
David Kinns, Richard A. Atkinson and Guobing Lu
K ey words: Combi natorial optimisati on , convexity, diagnosti cs, Euclidean
geometry, masking, multiple case effects , relaxation, robustness.
COMPSTAT 2004 section : Robustness.

Abstract: T he present pap er focuses on t he case sensit ivity funct ion ap-
pr oach t o diagnosti cs and robust ness t ha t are combina t orial by definiti on
and hard to solve exactly. At tention is also given t o t he visual displ ays.

1 Overview and organisation


Cent ral to both diagn ostic s and robustness are a ran ge of optimisat ion prob-
lems t hat are combinatorial by definition and correspondingly hard to solve
exactly. A vari ety of multiple case effects - such as mas king - may be pr esent ,
fur ther com plicating app rop riate inference.
The pr esent pap er offers a computation-focused pro gress repo rt on t he
case sens iti vity functi on app roac h to diagnostics and robustness introduced
in [4], on which we dr aw. A key idea her e is t hat relaxation brings benefits.
Specifically, t he st rategy outlined below shows how such high-dimension al
(o(nc m ) ) discret e optimisation problems can be embedded in low-d imen-
siona l (O(n) ) smooth reformulat ions, in which both the insights of geomet ry
and the power of analysis ar e available. In par t icular , informative plots be-
come possibl e, while additiona l convexity and derivative informati on ca n be
exploite d .
Overall mot ivation for t he case sensit ivity function approach derives from
considerat ions of (A) uni ty, (B) insight and (C) innovation, examples includ-
ing - in order of appearance :

• (AI ): an emphas is (throughout) on t he connect ivity of diagnostics and


robustness,

• (A2): a single setting for a ran ge of optimisat ion problems (Sections 3


and 4) ,

• (BI) : visual displ ays affording insight int o t he nature and vari ety of
mul t iple case effects (Secti on 5),

• (CI ): new diagnostic methodologies (Sect ion 6),


114 Frank Critchley et el.

• (B2): insight into t he performan ce of existi ng algorit hms (Section 7),


and:

• (C2) : enhanced (pote nt ially, encompassi ng) sets of algorit hms for a
class of robustness problems (Section 7).

A few preliminari es are established in Secti on 2.

2 Preliminaries
To gain focus, at te nt ion is restrict ed to one-sample contexts, with { Zi :
i E N} , N := {1, ..., n } denot ing a random sa mple of n > 1 cases from
an unknown distribution F in dim( z) dimensions. The associated empirical
distribution is P := L iENn- 1 Pi, where Pi denotes t he distribution degen-
erate at Zi . T hroughout , analysis is conducte d conditional on t he observed
{zd ·
Assuming, as we do , t hat no further information is availa ble about t he ob-
served cases, it is desir abl e t hat any analysis of these data should be invari ant
under permutati on of the arbit ra ry lab els attached to them. Given n, t his in-
variance is achieved - without loss of information - by replacing { Zi : i E N }
by P. In partic ular, every statist ic of int erest here is of the form T[P], for
some funct iona l T[·]. This may, for example, be (the observed significance
level of) a t est stat ist ic, a param et er est imate, a pr ediction of future valu es
of an observabl e, or a non par am etric density or regression funct ion est ima te.
In par ti cular , T[ ·] may be sca lar, vector or function valu ed.
Let Z := (z[) . In mult ivariate conte xts where all the random variabl es
z z
in rv F are on t he sa me footing, we put dim(z) = k, = X, Zi = Xi and
Z = X . In the usual linear mod el Y = X f3 + E, we put dim( z) = 1 + k,
zr = CiJ, XI') and z[ = (Yi, x f), so t hat Z = (yIX) , (a constant te rm being
assumed and accommodated by supposing t hat the distribution of t he first
element of x is degenerate at t he valu e 1).

3 A combinatorial optimisation problem


Two integers h > 0 and m > 0 are called n -complementary if h + m = n, in
which case:
(1)

where, for any integer 0 < a < n, N a := {0 cAe N : IAI a}. In


particular , INhl = INml or, in t he famili ar combinato rial ident ity, nOh =
nOm'
Throughout , {H, M} denotes a bip artition of N . That is, H and M are
non empty, complementary subsets of N. In particular , IHI and IMI are n-
complementary. Of course, holding onto t he cases lab elled by H is t he sam e
t hing as missing out those lab elled by M . That is,
Th e case sensitivity function approach in robustness 115

(2)
-1
where, for any 0 cAe N , FA := I: iEA IAI F; and F_ A := FAc .
~ ~ ~ ~

As is well-known, diagnostics and robustness meet at t he influence func-


t ion. The simple bu t genera l relations (1) and (2) provide a second , global
connection between t hese two areas of stat ist ics, as we now discuss. For
br evity, each scalar target functional t [·] below is impli citly assumed to be
defined wherever it is evalua t ed, and its possible depend ence on F or T sup-
pr essed notationally.
A general problem arising in diagnosti cs is to identify subsets M of given
size m whose omission causes maxim al chang es T[F] -4 T[F-Ml in a st at ist ic
of inte rest, as measured by t[F_ M ] for some appropriat e t ar get functional t[·].
A lead example is Cook 's (squared) distan ce in the linear mod el. With T[F] =
,B[F ] := (lEp(XxT))-1lEp( XY) , jj := ,B[F ] and jj-M := ,B[F_MJ, we have:

tCook[F_ M] := (ks 2 ) - 1 (jj-M - jjf X T X (jj-M - jj)


where s2 is the usu al est ima te of error variance. Again , a ran ge of robust
est imates are defined in te rms of subsets H of given size h which optimise
a specified t ar get funct ional t[FH ] . A lead example is minimum covariance de-
t erminant (MCD) est ima t ion in multivari at e analysis based on minimisation
of t Mcv[FH] := log(d et(cov[FH ])) . These two lead examples are developed
below.
Summari sing, a ran ge of optimisation problems arising naturally in both
diagnosti cs (V) and robustness (R) have combinatorial complexity and en-
tirely equivalent (V) +-+ (R) forms expres sed in Problem 3.1, in which h and m
are given n-complementary int egers:

Problem 3.1. (Combinatorial optimisation problem)


(V) Optimise t[F- M] over M E Nm .
(R) Optimise t[FH] over H E Nh .

We note in passing th at a vari ety of other combinatorial problems , not nec-


essarily linked t o diagnosti cs and robustness, can also be formulat ed in this
way.
This high-dimension al discret e problem can be embedded in a low-dimen-
sional smooth one, as follows. It suffices to express such a relaxation st rategy
in, say, the (V) form , t hat in the (R) form following at once via (1) and (2).

4 A relaxation strategy
Throughout t his section , h and m denote given n- compl ementar y integers .
Again , M denotes a general memb er of Nm , and H its complement in N .
116 Frank Critchley et al.

4.1 Probability vectors as labels for weighted empirical


distributions
The first step in the relaxation strategy adopted here is to use probability
vectors as labels for weighted empirical distributions.
For any P == (Pi) C lpln := {all probability n-vectors}, let F(p) :=
I:iEN PiFi denote the distribution attaching probability Pi to Zi , and 1F :=

{F(p) : P E JPln} . For brevity, the {z;} are assumed distinct (this avoids an
elaboration required in the general case) . Accordingly, (indeed, equivalently),

p ...... F(p) is a bijection between IP'n and 1F. (3)

In particular, every weighted empirical distribution corresponds to one and


only one probability vector, which provides a convenient label for it . For
example, Po := (n- 1 ) labels F.
Moreover, P-M labels F_ M, where the i t h element of P-M is zero if i E M
and h- 1 otherwise. That is, (3) specialises to:

P-M ...... F_ M is a bijection between V~m and 1F- m,


where V~m comprises the nCm distinct probability vectors arising from per-
mutation of h-1(O;;', 1I)T and 1F- m := {F-M : M E Nm } is the set of
distributions optimised over in Problem 3.1.
The (R) form is immediate, writing P-M , V~m and 1F- m as PH , Vh and
1Fh respectively. Of course, in the limit when m = (n - 1) (equivalently,
h = 1), V~m comprises the n unit vectors in IP'n which label the degenerate
distributions {F;} in the obvious way.
Again, with 0 < Aa := a/n < 1 denoting the proportion of cases in
o cAe N, the identity:
(4)
has an exactly analogous probability vector form:

(5)
Finally, let T[·] denote any statistic of interest. Following [4], pertur-
bation is defined here as movement P --> p* between probability n-vectors,
with primary effect (corresponding to the identity functional T) the induced
change F(p) --> F(p*) in distribution, and general effect T[F(p)] --> T[F(p*)].

4.2 Size and direction of perturbations


Again following arguments set out in [4], the second relaxation step embeds
IP'n in n-dimensional Euclidean space E", this choice of geometry assigning
both size and direction to perturbations.
The case sensitivity function approach in robustness 117

In particular, the size r~,% == rhn


) = Jm/(nh) of the perturbation Po --+
P-M (not, note, of its primary effect F --+ F-M):
(i) does not depend on which m cases are deleted,
(ii) increases with m for fixed n, and
(iii) decreases with n for fixed m,
each of which is intuitive.
Again, for any nonzero vector v in E" , let d(v) := v/llvll denote its
direction. Then, for any 0 cAe N, d A := d(PA - Po) and d-A := dAc are
the directions of the perturbations (from Po) which hold onto and miss out A,
respectively. In particular, (4) and (5) can be tellingly re-expressed as dA =
-d-A . In words, for any nonempty proper subset of cases, the perturbation
which holds onto it is in the opposite direction to that which misses it out.
Finally, let {Mr : r = 1,2, 3} denote a tripartition of N. Then it is easy
to see that the perturbations ±dM 1 (from Po) holding onto and missing out
M 1 are orthogonal to those, ±d(PM2 - PMa) , which trade probability weight
between the cases labelled M2 and M 3 , exact ly similar relations holding under
cyclic permutation of subscripts.

4.3 Convexification of the feasible region


Recalling that Vr:: m labels the distributions over which an optimum is sought,
the third relaxation step is to embed Vr:: m in its convex hull, IP'r:: m say, this
larger set serving (below) as the feasible region for the smooth embedding of
Problem 3.l.
It follows that IP'r:: m = {p E IP'n : Pi ::; h -1 (i E N)} , a closed convex
polyhedron of maximal dimension (n - 1) in E", And, dually, that Vr:: m is
the set of all vertices (extreme points) of IP'r:: m • That is, all those members
of IP'r:: m which cannot be written as a strict convex combination of two other
members. Geometrically, all those points in IP'r:: m which do not lie in the
interior of a line segment joining two others. Again, we have:

{po} = {p E IP'n : Pi::; n- 1 (i E N)} C 1P'~1 C 1P'~2 C ... C 1P'~(n_1) = IP'n (6)

while , writing lP'~m as 1P'i::, the (R) form is immediate.

4.4 Examples
Figure 1 illustrates the n = 3 case . 1P'3 = 1P'::2 is the outer equilateral triangle,
whose vertices V:: 2 are the unit vectors. 1P'::1 is the inverted, inner equilateral
triangle, whose vertices V:: 1 are the midpoints of the sides of 1P'3. Both
triangles are centred on Po. All perturbations (from Po) which miss out
a single case are the same size, and smaller than all which miss out two.
Again, each perturbation (from Po) that holds onto a given case is in the
opposite direction to that which misses it out, and orthogonal to that which
trades weight between the other two.
118 Frank Critchley et al.

(0,0,1)

(1/2,0,1/2) . . . . . . . . . . . . . . . . . . .. (0,1/2,1/2)

(1,0,0) ....
(1/2,1/2,0) (0,1,0)

Fig ure 1: lP'3 and some of its key features.

Figur e 2: lP'4 and some of its key feat ur es.

T he n = 4 case is illustrated in the 3-D polyh edr a of Fig ure 2. The


left most of t hese is t he regular triangular pyramid lP'4 = lP'~3' whose vertices
V~3 (again, t he unit vecto rs) are shown as solid circles . T he four square
symbols shown there are the vertices V~l> each P -{i} being t he centroid of
the face of lP'n opposite to P{i}, (a result t hat holds for any n > 1). Agai n,
t he six oval symbols at the mid-points of t he edges of lP'4 are t he vertices V~2'
T he convex hulls lP'~1 and lP'~2 of t hese two vertex sets comprise the ot her
two polyhedra shown, all three being centred on PO. The inclusions (6) are
clear.
Overall , t he three sides of lP'3 are scaled copies of lP'2, each being t he region
where zero weight is attached to a given case. For the same reason, the four
faces of lP'4 are scaled copies of lP'3, similar resu lts holding in general.
Th e case sensi tivity function approach in robustness 119

4.5 A smooth reformulation


Now, exploit ing (3) , we define the case sensitivity fun ction T( .) for the st atis-
t ic T[·] via T(p) := T[F(p)]. Similarly, we define the smooth t ar get fun ction
t( ·) via t(p) := t [F (p)]. In particular , t MCD(P) = log(det(cov[F(p)])) , while
tCook(p) = (ks2 )- 1(13(p) - 13)TX T X (13(p) - 13), where 13(p) := ,8[F (p)].
The final relaxation step is to embed Problem 1 in:

Problem 4.1. (O(n) smooth reformulation of Problem 3.1}


Optimise t (p) over p E lP'~ m == lP'h'
It follows at on ce that any concave (resp ectively, convex ) smooth target func-
ti on t (.) at tains its minimum (resp ectiv ely, maximum) over t he feasible region
lP'~m == lP'h of Problem 4.1 at a memb er of t he feasible region V~m == Vh of
Problem 3.1 and, in t he st rict case, only at such a vertex.
In par ti cular , [7] show t hat t MC D(') is concave, exploit ing this in t heir
smooth-MCD algorit hms. Although its convexity in a neighbourhood of Po
need not extend to the whole feasible region of Problem 4.1, [4] pr esent numer-
ical result s which support t he conjecture that p-generalised Cook 's dist an ce
tCook (') enjoys similar ext remal prop erties (as t hey not e, it would be helpful
to have eit her a proof of - or count erexample to - such a conjecture). We
not e, in passin g, t hat further positive evidence for it t urns up in Figure 3 of
t he following section.

5 Visual displays of multiple case effects


On e outcome of the above relaxation strategy is the availability of visual
displ ays offering insight into the nature and variety of multiple case effect s
t hat ca n occur in different contexts. We focus here on graphs of tCook(') in
t he linear model, following [10] from which Figure 3 is taken .
For all bu t the smallest values of n , dir ect visualisation of the graph of
any smoot h target function t( ·) over lP'n - or one of its subsets lP'~m - is pr e-
vented by the fact that each has dim ension (n - 1). Inst ead , the approach
adopted here uses t ripart it ions of N as devices providing informat ive trian -
gular subsets of lP'n, over which the graph of t (·) ca n t hen be displ ayed . The
key idea is to attach equal probability weight to cases in the sa me memb er of
a tripar t ition. This t urn s out t o be a rich enough st ructure t o pr ovid e insight
into a ran ge of multiple case effects - allowing us, in effect, to see the nature
of each, and their vari ety.

5.1 Tripartitions
Suppose t hen t hat M := {Mr : r = 1,2, 3} is a given partition of N int o
t hree disjoint subsets, wit h m ; := IMrl > a and 2:r m; = n, and let

11' = lI'(M ) := {p E lP'n : [i E Mn j E Mrl =} Pi = Pj} .


120 Frank Critchley et al.

It follows that T is the convex hull of {PMr : r = 1,2, 3}. That is, T is the

triangle which has these three points as vertices which , when convenient, we
abbreviate to {Mr } . Otherwise said, P E !pm belongs to T if and only if, for
some 7T == (7Tr ) E 1P'3, P = L r 7Tr PMr . In this case , 7T = 7T(p) is unique, 7Tr(p)
being the total probability assigned (equally) by P to the m; cases in Mr.
Accordingly, we may identify T with 1P'3 via the bijection P f-+ 7T(p). For
example, Po f-+ (K r ), where K r := mr/n is the proportion of cases in Mr.
However, whereas 1P'3 is a fixed equilateral, the shape and size of 'lI' vary with
the {m r } . Nevertheless, important inclusions, collinearities and orthogonali-
ties in 1P'3 survive in 'lI' for every M .
Two obvious cyclic permutations applying, the identity:

shows that P-Ml lies on the M 2M3 side of T , being closer to whichever
vertex labels the larger number of cases. In particular, writing Pr(.X) :=
(1 - >')p-Mr + >'PMr , the line segment JLr := {Pr(>') : >. E [0, I]} lies in T,
all three such meeting at Po by (5). Again, using Section 4.2, each JLr is
orthogonal to the side of'lI' containing P-Mr , §-r say, along which proba-
bility weight is traded between the other two subsets. Thus, the probability
attached to M; increases linearly along JLr from zero at the P-Mr end to
unity at the other. Indeed, for each>. E [0,1], this probability is constant at
the value>. for all points in 'lI' on the line through Pr(>') parallel to §- r' In
particular, it vanishes on §-r .

5.2 Four multiple case effects in the linear model


[3] and [11] discuss a variety of possible effects that a pair of cases may have
on Cook 's distance. Here, with M 3 representing a convenient 'null' data
set , and restricting ourselves to the special case ml = m2 = 1 (for a fuller
account, see [10]), we consider four effects defined in the table below, and
illustrated in the corresponding rows of Figure 3:
Effect Joint presence of M 1 and M 2 •••
(a) Masking conceals presence of either
(b) Cancellation has no effect on fitted line
(c) Swing swings fitted line, (intercept rv unch anged)
(d) Raise & Lower translates fitted line, (slope rv unchanged)
For clarity, stylised simple linear regression data sets are used, shown in the
middle column of Figure 3. In each case , M 3 contains m3 = 20 points,
comprising five replicates at each corner of the square {±1}2, whose fitted
line is the horizontal axis. Both M 1 and M 2 consist of a single point at the
corner of {±4} 2 indicated.
The righthand column of Figure 3 gives the corresponding graph oftcook(')
over T, limits being used where needed (since, of course, a line cannot be
fitted to a single case). Some linear rescaling between plots has been applied,
Th e case sensitivity function approach in robustn ess 121


IM11=1

IM21=1

(a)
• •
• • IM31=2 0
Ml

M3

IM11=1

I M31 =2 0

(b)
• •
• •

I M21= 1

IM11= 1

(e)
• •
• • I M3 1- 20
Ml


IM21=1
M3

I M2 1=1 IM11=1
• •

(d)
• •
• • IM31 =2 0

Figure 3: Four multipl e case effects in t he linear mod el: (a) masking, (b) can-
cellatio n, (c) swing and (d) raise & lower.

both vert ically and horizontally, to enha nce their visual clari ty (a minor cost
being some loss of visual perception that t he angle at M 3 exceeds 87°). Note
that Po (corr esponding to F) is close to M 3 , being just one-eleventh of t he
way along the line 1L3 joining M 3 to t he midpoint of t he opposi t e side. The
inbuilt M 1 - M 2 symmet ry is evident t hr oughout. Overall, t he four graphs
have visibly different shapes, discussed next:
122 Frank Critchley et a1.

(a) Masking. The 'spike' at M 3 reflects the domin ant effect of removing
both M l and M 2 , while t he parallelism of t he conto urs to § -3 corresponds
to the fact that there is, of course, no effect here in t ra ding weight between
these sets .
(b) Cancellation. The conto urs of t C ook (-) here are st ra ight lines fanning
out from M 3 . In par t icular, lL 3 is t he zero height conto ur, since varying 7f3
while keepin g 7fl = 7f 2 has no effect on t he fit ted line. Tr adin g weight between
M l and M 2 now has a quadratic, globa lly dominant , effect .
(c) Swing. T he overall shape of t he surface here is very similar, but not
ident ical , to t hat in the maskin g case. The 'spike' at M 3 remains domin ant ,
but t he sur face cont ours are no longer par allel to § -3 .
(d) Raise & Lower. This is perhaps t he most interesting gra ph. As is
intuitive from the data , t he dominant globa l effect occurs along § -3 ' Looking
at t he sur face, we see two 'troughs' . These run along lL l and lL 2 , showing
t hat varying the weight on one of t hese subs ets alone has littl e effect. The
contours of teook( ' ) are par allel to § -3 when t here is little weight on M 3 , but
become more cur ved as 7f3 increases . Locally to Po, tr ading weight between
M l and M 2 produces t he largest effects .

6 A relaxed diagnostic approach to detecting heavy mu-


tual masking
Multiple case effects can be st rong and yet intrinsically hard to det ect wit h
standa rd diagnostic procedures, while t he burden of full enumeration in-
creases combina torially with m . Heavy mutual masking is a well-known ex-
ample, cha llenge dat a sets comprising 60% of cases from one dist ribution and
40% from a second, suitably remot e from t he first . [4] present a widely appli-
cable, relaxed , two-stage app roach to detecting such effects (cf. [2]), briefly
reviewed here.
Adopting t he standa rd assumpt ion in t he literature that at most ha lf
t he cases are discord ant from a common pat tern followed by th e rest , St age I
consists on maximi sing (say) a suitable t ar get function t (·) over lP'~m' with m
t he integer par t of n/2, t he opt imum being known or ass umed to occur at
a vertex. This correspo nds precisely to missing out a specified subs et M of
m cases . The (in)equa lity const ra ints defining lP'~m being linear , t his relaxed
optimisatio n can be carrie d out with standa rd software (or some alternative,
as indicated in Section 7). The assumed intern al consiste ncy of t he cases in
il := Me may also be checked .
St age II back-checks for swamp ing. T ha t is, for cases in M which are not
inconsist ent with the pattern followed by t he majority. [4] envisage doing t his
separately for each case in M, alt hough a sequen tia l approac h is possible.
Havin g augmented it wit h any such cases, a final check on their intern al
consiste ncy can be mad e while, if required, t he possibility of furt her st ructure
within t he cases in M may be made.
Th e case sensitivity function approach in robustness 123

[4] report encouragi ng results for t his genera l st ra tegy , using regression
as a test problem and several form s of challenge dat a set. Specifically, they
maximise tCookO in St age I, using t he mean shift outlier t est in Stage II .
Finally, a remark on local maxima. On those occasions when the final
check for a common pattern fails, the possibility t hat t his is because omission
of M is a particular form of non-trivial local maximum can be easily explored
as follows. The value of t( ·) there can be compa red to t hat where M is held
onto . If this is greater, replacing M by its complement, and then cont inui ng
as before, is ind icat ed . On the relatively few occasions where it was needed in
t heir reg~sion study, [4] report t hat t his simple st rategy was successful. The
original M containing no mutually masked cases, moving t o its complement
pr oduced a lar ge increase in t C ook and led again t o correct ident ificatio n of
the st ructure in t he dat a .

7 Developments in relaxed robust computation


Consider now minimisati on over lP'h of t he particular function t o = t M C DO
as an exemplar of the class of robust esti ma tio n pro cedures that can be
defined in t his way. Algorithms for t he MCD problem include those reported
in [1], [8], [9] and [12]. These are all discret e in t he sense t hat they ad dress
Problem 3.1, it eratively 'jumping' between members of V h.
We bri efly sketc h here some of the work report ed in [6] and, more fully,
in [7], recalling that t hese pap ers show that tM C DO is, ind eed , concave .
Collectiv ely, the new approaches reported therein are referr ed to as sm ooth-
MCD algorit hms.
Figur e 4 shows two views of t he sa me tMCD sur face over lP'~ for un ivari at e
dat a. This simple example offers some genera l geomet ric insight: the gra ph of
t M CD contains multiple local minima , separated by hills, wit h corresponding
limitations for any purely descent algorit hm. In particular , it motivat es the
use of swappi ng st ra tegies aimed at 'getting you over a hill t o a lower valley' .
At the same time, t he swapping strategy employed by the feasible subset s
algorit hm - while optimal in its own t erms - is relatively expensive t o perform
and may not always be needed , in t he sense that not every vertex is a local
minimum.
Again , [4] note the benefits of using explicit gradient infor mation, when
t his is available. [5] develop local projected (here, centred ) Taylor expansions
in genera lity. They show that such expansions are possible even when , as
here, one or mor e const ra ints (here, p T 1n = 1) impl y that th ere are no op en
sets in a funct ion's dom ain (here, a subset of P"). Ind eed, they exist uniquely
un der mild condit ions and can be used to guide algorithms downhill , in the
usual way. They also pr ovide also a useful necessary and sufficient condit ion
for a vertex in Vh to be a local minimum, for any t. In the tM C D case, it is
shown th at th ese are pr ecisely t he points where t he C-steps of FAST-MCD
converge .
124 Frank Critchley et al.

Figure 4: Two views of a tMCD surface (n = 3; k = 1).

Now, conditional on robustness, there are two key performance criteria


in any problem such as this: speed and optimality. Perfection (i .e. instant,
global optimality!) being unachievable, different algorithms aim for it , while
striking different trade-offs between these criteria. Accordingly, the state-of-
the-art can be thought of as a boundary of limiting speed/optimality trade-
offs that are currently feasible, the different algorithms appearing at different
points along it .
[6] and [7], to which the reader is referred for further details, exploit
features of the case sensitivity function approach - in particular, insights
from (convex) geometry, the power of analysis, and a unifying structure -
both to understand better why current algorithms occur where they do along
this boundary, and to add new algorithms that fill it out and/or nudge it
nearer to perfection.
The case sensitivity function approach in robustness 125

References
[1] Agu1l6 J . (1998). Computing the minimum covariance determinant esti-
mator. Technical report , Universidad de Alicante .
[2] Atkinson A.C . (1986). Masking unmasked. Biometrika 73 , 533 - 541.
[3] Barrett B.E. and Gray J .B. (1997). Leverage, residual, and interaction
diagnostics for subsets of cases in least squares regression. Computa-
tional St atistics and Dat a Analysis 26 , 39 - 52.
[4] Critchley F., Atkinson R.A., Lu G. and Biazi E. (2001). Influence anal-
ysis based on the case sens itivity function. J. Royal Statistical Society,
B 63 , 307 - 323.
[5] Critchley F., Lu G., Atkinson R.A . and Wang D.Q. (2003). Projected
Taylor expansions for use in Statistics. Under consideration.
[6] Critchley F ., Schyns M. and Haesbroeck G. (2003). Smooth optimization
for the MCD estim ator. Int ernational Conference on Robust Statistics,
Antwerp, 29 - 30.
[7] Critchley F., Schyns M., Haesb roeck G., Lu G., Atkinson R.A. and Wang
D.Q. (2004). A convex geometry approach to algorithms for the MCD
method of robust statistics. Under consideration.
[8] Hawkins D.M. (1994). A feasible solution algorithm for the minimum
covariance determinant estim ator in multivariate data. Computational
St at istics and Data Analysis 1 7, 197 - 210.
[9] Hawkins D.M. and Oliv e D.J. (1999). Improved feasible solution al-
gorithms for high breakdown estim ation. Computational Statistics and
Data Analysis 30, 1 - 11.
[10] Kinns D.J. (2001). Multipl e case influence analysis with particular ref-
erence to the linear model. PhD thesis, University of Birmingham .
[11] Lawrance A.J . (1995). Deletion influence and masking in regression. J .
Royal Statistical Society, B 57, 181 - 189.
[12] Rousseeuw P.J . and Van Driessen K. (1999). A fast algorithm for the
minimum covariance determinant estim ator. Technometrics 4 1, 212 -
223.

A cknowledgem ent : The UK aut hors are grateful for EPSRC support under
research grant GR / K08246 and to D.Q. Wang for helpful discussions .
Address: F . Critchley, M. Schyns, G. Haesbroeck, D. Kinns, R.A . Atkinson,
G. Lu, The Op en University, Milton Keyn es; University of Namur; University
of Liege; (formerly) University of Birmingham; University of Birmingham and
University of Bristol
E-mail : F .Critchley@open . ac . uk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

ON THE BOOTSTRAP METHODOLOGY


FOR FUNCTIONAL DATA
Antonio Cuevas and Ricardo Fraiman
K ey words: Bootstrap validity, boot strap consiste ncy, bounded Lipschit z
metric, different iable funct ionals , functional data an alysis.
COMPSTAT 2004 section: Fun ctional data analysis.

Abstract: The cur rent theory of stat ist ics with funct ional dat a provides
only a few results [21] of asympt ot ic validity for t he bootstrap methodology.
Rou ghly speaking, these validity resul ts guarantee t hat t he bootstrap versions
of the sa mpling distributi on of a statistic t end (as t he sample size increases) to
the same limit as t he true sa mpling distributions. From a comput ational and
pr acti cal point of view, such results have an sp ecial int erest when dealin g with
funct ional data , as the distributional pr operties of the st atist ics are usu ally
difficult to handle in t his set up. Of course, t he point is that while t he t rue
sam pling distributions are usu ally very difficult t o handle, t he corresponding
bootstrap versions can be approxima te d with arbit ra ry pr ecision .
In thi s work , a uniform inequ alit y is obtained for t he Bounded Lipschit z
dist anc e between th e empirical distribution of a function-valu ed random vari-
able and the corresponding underlying distribu tion t hat generat es the sa mple.
As a consequence, a result of bootstrap validity (consist ency) is obtain ed for
funct ional statis tics defined from differenti abl e operators.
Our pr oof is based on t he use of a differenti al methodology for operators,
similar to that used by P arr [19], and relies also on a resul t of empirical
pro cesses t heory pr oved by Yukich [29].

1 Introduction
We deal here with t he stat istical set ups where the available sa mple informa-
t ion consists of (or can be considered as) a set of functions. Depending on
t he approach an d on t he assumed st ruc t ure of the dat a (which come often in
a discreti zed version) t his st atisti cal field is called "longit udinal dat a analy-
sis" or "funct ional dat a analysis" (FDA) . We will follow here a purely func-
tional approac h which entails t o consider t he available dat a as t rue functi ons
and, as a consequence, t o define and motivat e t he methods in a functi onal
fram ework.
The books by Ram say and Silver man [22], [23] have greatly cont ribute d t o
populari ze t he FDA techniques am ong the users, offering a num ber of appeal-
ing case st udies and pr act ical methodologies. Simultan eously, this increasing
popul ari ty motivates t he need of a solid theoreti cal foundation for the FDA
methods, as man y bas ic issues (concerning, e.g., the asy mptotic behavior)
are oft en rather involved in t he FDA setup.
128 Antonio Cuevas and Ricardo Fraiman

In general terms, the FDA theory is still incomplete as many topics remain
unexplored from the mathematical point of view. Some theoretical develop-
ments with functional data have been made in fields as principal component
analysis ([5], [11], [17], [20], [25]), linear regression ([6], [7], [8], [9], [13]), data
depth [14], clustering [1] and anova models ([12], [18], [10]).
An important issue in this field has to do with the asymptotic validity
(usually called consistency) of bootstrap procedures for functional data. This
looks as an interesting research line since the exact calculation of sampling
distributions in FDA problems presents an obvious difficulty so that the boot-
strap methodology turns out to be often the only practical alternative. Of
course, the point is that while the sampling distribution of a function-valued
statistic can be formally defined in the same way as the analogous concept for
a real-valued statistic, the effective calculation and handling of such "func-
tional" sampling distributions is usually very difficult since they are in fact
probability measures defined on function spaces. Thus the case for using
bootstrap versions is quite strong as they are discrete measures which can be
in turn approximated by resampling with arbitrary precision. An example of
the use of resampling methods in a functional data framework can be found
in [10] .
The classical works by Bickel and Friedman [3], Singh [26] and Parr [19],
among others, have established the validity of the bootstrap methodology, in
the case of real variables, for a number of useful statistics, including the sam-
ple mean and those generated by differentiable statistical functionals. The
functional counterpart of this theory is much less developed. However, Cine
and Zinn [15] have proved, in a very general setup, a bootstrap version of
Donsker theorem for the empirical processes. A partial extension of this result
is given in [24]. Politis and Romano [21] have proved the consistency of the
bootstrap for the sample mean in the case of uniformly bounded functional
variables taking values in a separable Hilbert space imposing very general as-
sumptions on the dependence structure which include the independent case
to be considered here . The main purpose of this paper is to partially extend
this consistency result to (function-valued) statistics defined from differen-
tiable operators. So we are concerned here with a functional version of some
classical validity theorems, as those in [19] or [2], where the methodology
based on functional differentiation plays a relevant role.
More precisely, we want to get a bootstrap validity result for statistics
of type T(Pn ) where T is a differentiable operator (taking values in a func-
tional space) and Pn is the empirical distribution associated with a sample
Xl, ... , X n of n functions drawn from a common distribution P. In practical
terms, this result will establish that the distribution of vn(T(Pn ) - T(P))
can be approximated by its corresponding bootstrap version in vn(T(P~) -
T(Pn ) ) , where P~ is the empirical distribution based on an artificial (boot-
strap) sample drawn from the original sample. Our approach is much in
the spirit of Theorem 4 in [19] although the fact that we are dealing with
functional data entails some additional technical complications.
On the bootstrap meth odology for functional data 129

Our main result establishes that y'ri,(T(Pn) - T(P)) and y'ri,(T(P~) -


T(Pn)) converge (weakly) to the sam e limit . It is proved in Section 3 below.
An essent ial auxiliary st ep in the pro of of this theorem is a uniform (uni versal)
bound, similar to t he classical Dvoretzky-Kiefer-Wolfowitz (DKW) inequ ality
(see, e.g ., [27]) , for t he dist an ce d(Pn, P) between Pn and P . It will be
established in Section 2. This bound is uni versal in the sense that it does
not depend on t he underlyin g distribution P i t his is crucial in a bootstrap
setup as the bound for d(Pn , P) will also hold for its bo otstrap counte rpart
d(P~ ,Pn) ' Let us recall t hat P stands here for a probability dist ribut ion
in a function space so t hat in order to establish a DKW-typ e inequ ality we
would need a dist an ce d(Pn , P) compatible with t he weak convergence and
making sense in a functional fram ework. We will use t he so-called Bounded
Lipschit z met ric defined by

d(Pn, P) = sup
fEF
I /f er; - /f dPI, (1)

P being a probabili ty on a normed space X , Pn a empirical drawn from P


and

F = {f : X ~ IR: f is Lipschit z with Ilfll oo ~ 1 and Lipschit z constant I} .


(2)

2 A uniform inequality for the Bounded Lipschitz


metric
Let P be a probabili ty on a Ban ach space X whose support is included in
the ball B(O , r ) C X . Let Pn be the empirical dist ribution associate d with
a sa mple X l , . .. , X n dr awn from P . Let d denote the Bounded Lipschit z
metric defined in 1). We next show a version of t he DKW inequ ality for
d(Pn,P) .
Theorem 2.1. For all e > ° th ere exists K = K( €) su ch that

lP'{y'ri,d(Pn,P) > K} < €, f or all n , f or all P with support in B(O,r) .


(3)
Proof: This will resul t as a dir ect consequence of a exponencia l bound ob-
tained by Yukich ([29], Theor em 1) by a empirical process methodol ogy.
Define the e-ent ropy N (€, F) by

N (€, F) = min{m : t here exist iI ,... , f m E F such that (4)


sup min Ilf - fill ~ < €2, Vf E F},
Q '

where t he supremum on Q is taken on the set of all t he pr obability distribu-


J
tio ns wit h finit e support and I lfll ~ = PdQ is t he £ 2(Q)-norm.
130 Antonio Cuevas and Ricardo Fttutusu

°
Yukich's Theorem establishes t ha t if the envelope function F := sup{lf(x)1 :
f E F} fulfills F :; 1 and there are constants < EO :; 1, < 8 < 1, and
C 2:: 1 such that
°
N (E, F ) :; exp(C/E2- 8 ) , "IE, 0 < E:; EO , (5)
t hen
(6)
for all M greate r t ha n, or equa l to , some constant M(8 , C, EO) whose explicit
expression is given in t he st atement of Theorem 1 in [29].
In fact , the proof will be mor e simpl e and intuitive by replacing the dis-
tances Ilf - fi l l ~ in (5) by the supremum distan ces Ilf - f illoo. As a con-
sequence, we will prove a stronger version of condit ion (5) , by taking t he
supremum in (5) over all th e possible probability measures (inst ead of just
considering those of finite support) . The reason is that we in fact will provide
a bound for Ilf - f illoo and, as t he Q's are probability measures and the I's

°
are bounded, we will also get bounds for th e £2(Q) norms.
Given < E < 1, divid e t he interval [-1, 1] (where the functions f E F
t ake values) into q = [2/E] + 1 subintervals with ext reme points in the set

RE = {0, E,- E,2E,-2E, . . . , - 1, 1}.


Let us also consider a finite sequence of q1 balls defined by

B(O, E) C B(O, 2E) C ... C B(O, r)

Observe that q1 is eit her r]« or r f e + 1.


Let F m = {iI, . .. , f m} be a class of functions takin g values in R E such
t ha t every f i is constant on the domains B(O, E), B(O, 2E) \ B(O, E), B(O, 3E) \
B(0 ,2 E) ,00 . and the differences between th e values of f i on two adjace nt
dom ains (for exa mple on B(O, E) and B(0,2E) \ B(O, E)) is at most E. Note
t ha t # (F m ) = m :; q3q 1 •
We have t hat for every f E F there exists i E {I , . . . , m} such t hat
Il f - f i lloo :; 2E. Ind eed , given f E F , t here exist s Yo E R E such that
If(O) - Yol < E. Now let 90 be t he set of all functions f i in F m such that
f i(O) = Yo . As f has Lipschitz constant 1, we have SUPxEB(O,E) If( x) - g(x)1 :;
2E. On the oth er hand, as

sup f( x) :; f(O) + E and inf f( x) 2:: f(O) - E,


x E B (O,E ) x EB (O,E)

the class ~h of all functi ons 9 E 90 such that sUPX EB (0,2E) If (x ) - g(x)1 :; 3E
is not empty. In a similar way, by the Lipschit z property of t, we can choose
a non- empty class 92 C 91 such t ha t sUPxE B(0,3E) If( x ) - g(x)1 :; 3E, for all
9 E 9 2. By recurrence, define the (non-emp ty) class 9q 1 -1 of functions such
t ha t sUPx EB(O,r) If( x) - g(x )1:; 3E for all 9 E 9 q l - 1.
On the bootstrap m ethodology for function al data 131

Thus we hav e shown

N(3E, F) :::; q3q1 :::; (~ + 1) 3 / €+1 :::; exp Cl~17) ,


r
(7)

17
for all 7] E (0,1) , and C = (2r)l+ log 3. Finally, using Yukich's [29] Theo-
rem 1, (observe that 2-0 in (5) has been denoted 1+7] in (7)), we conclude (6).

3 A bootst rap validity result for funct ional dat a


We est ablish now a validity result for function-valued statistics defined on
functional data. The methodology will be bas ed on differentiability argu-
ments very much in the line of [19]. As pointed out in the int roduct ion , the
functional version of the DKW-in equality obtain ed in the pr evious section
will be a crucial step in the proof.

T heorem 3.1. Let H be a bounded set in a Banach space (endowed with


the Borel a -olqebra}. Let P(H) be the set of all probability m easures whose
support is included in H . Let T be an operator defin ed in P(H) with values in
another Banach space C. Denote by Pn th e empiri cal measure corresponding
to i.i .d. H -nolued variabl es with dist ribution P . Let P; be the corresponding
em pirical associat ed with a bootstrap sample Xi , . . . , X~ .

(a) Assume that T sat isfies the following differentiability condition for some
given P E P(H),

T(Q) = T(P) + T;'(Q - P) + o(d(Q , P)) , (8)

where th e remainder term o(d(Q, P)) denotes, as usual , an operator


such that
lim o(d(Q, P)) - 0
Q---->P d(Q, P) - ,
and T;' : P(H) ----> C is a linear (not n ecessarily continuous) operator
for which is valid th e bootstrap for th e sample m ean in the sense that

vnT;'(P~ - Pn) converges weakly a.s , to the same lim it


as vnT;'(Pn - P). (9)

Th en,
(10)
Z being the weak limit of vn
(T(Pn) - T(P)) .
(b) Assume that the operator T takes values in a separable Hilbert space
C and it is differentiable in the sens e (8). If the function w(x) =
T;'(ox - P) is bounded (ox being the degenerat e distribution at x ), th en
condition (9) is fulfill ed and therefore (10) holds.
132 Antonio Cuevas and Ri cardo Fraiman

Proof: (a) The result is a simple consequence of Theorem 2.1. Ind eed , using
the differenti abili ty assumpt ion (8) ,

T(Pn ) = T(P) + T},(Pn - P) + o(d(Pn , P))


and
T(P~) = T(P) + T},(P~ - P) + o(d ( P~, P)) .
Hence

v'ri (T(P~) - T(Pn ) ) = v'riT},(P~ - Pn ) + v'rio(d(P~ , P )) + v'rio(d(Pn , P)) .


(11)
The first t erm in t he right-hand side t ends, by assumpt ion (9) , to th e sa me
limit as v'ri(T(Pn ) - T(P)) . Also, from t he trian gle inequ ality, vnd(P~ ,P)
is bounded in probability (uniformly on P) , as both vnd(P~ , Pn ) and
v'rid(Pn , P) are . Therefore the remainder t erms in (11) te nd t o zero in prob-
a bility almost sure ly, which concludes the pr oof of (a) .

(b) Since the opera to r T}, is linear,

Then , we may apply Theorem 3.1 in [21] t o conclude that (9) , and there-
fore (10) , holds in t his case .

Some final rema rks:

(i) The hypothesis of uniform boundedness is not very rest rictive in pr ac-
ti ce. It is in some sense similar to t he assumption of compac t support
in nonpar am etric est ima t ion. If one is willing to renoun ce to th e usu al
ga ussian mod els (which is also the case in nonpar am etrics) the hypothe-
sis of boundedness looks quite natural as every observabl e ph enomenon
pro vides in fact observat ions taking valu es in a bounded dom ain (whose
limits are imp osed by t he measurement inst ru ments ). From a t echnical
point of view, boundedness is required for Theorem 2.1 (in order to be
able to apply t he ent ropy argument involved in the proof) and also for
t he result by Politis and Rom ano ([21] , Theorem 3.1) used in th e proof
of par t (b). Note also that t he boundedn ess condit ion must be fulfilled
in the metric of the spa ce where the random elements Xi t ake values.
For example, if t his sp ace is L 2 [a , b] t he assumption that Xi E H, where
H is bounded in L 2 [a, b], does not entail that the realizations of Xi have
to be bounded in t he supremum sense.

(ii) The above t heorem can be applied, for example, to show t he validity
of t he bootstrap for st atisti cs of ty pe g(X) which may arise in different
On the bootstrap methodology for functiona l data 133

problems theoretical and applied. In particular, this type of statis-


tics could appear if we are looking for robust alte rnatives (similar to
M-estimators) for the sample mean in a functional data setup. Such
functional statistics are often called Z-estimators; see [28], ch. 3.3.
Since they are usu ally defined in an implicit way (as the solution of a
functional equat ion) the effective use of our validity theorem for them
would require an additional result in order to ensure that the required
differentiability conditions ar e fulfilled. A detailed study on the asymp-
totic behavior of Z-estimators can be found in [30] .
(iii) As an example of a differentiable operator T = T(P) let us consider
the variance operator

T(P)(t) = J X 2(t
,w)dP(w) - f-L~(t) ,
wher e X(t) = X(t ,w) is a process with distribution P and mean func-
tion f-Lp(t) . It can be easily seen that the differential T p is the linear
operator given by

Tp(Q)(t) = J X 2(t , w)dQ(w) - f-L~(t).


R e ferences
[1] Abr ah am C., Cornillon P.A ., Matzner-Lber E., Molinari N. (2003) . Un-
supervised Curve Clustering using B-Splin es. Scandinav ian Journal of
Statistics, 30, 581 - 595.
[2] Arcones M. A., Cine, E. (1992) . On the bootstrap of M-estimators and
oth er statistical fun ctionals. In Exploring th e limits of bootstrap (Ed ited
by Raou l Le Page and Lynne Billard), Wiley, New York, 13-47.
[3] Bickel, P. J., Freedm an , D. A. (198 1). Some asymptotic theory for the
bootstrap. The Annals of Statistics 9, 1196 -1217.
[4J Billingsley P. (1968) . Con vergenc e of Probability Measures. Wiley, New
York.
[5J Boent e G., Fraiman R. (2000) . Kernel-based functional principal compo -
n ents. Statistics and Probability Letters 48, 335 - 345.
[6J Cardot H. Ferraty F ., Sarda P. (1999) . Functional linear model. Statistics
and Probability Letters 45, 11 - 22.
[7J Cardot H, Ferr aty F ., Mas A., Sarda P. (2003). Test ing hypotheses in the
functional lin ear model. Scandinavian Journal of St atistics 30, 241 - 255.
[8] Cardot H., Sarda P. (2003) . Estimation in gen eralized linea r models for
functional data via penalized likelihood. Journal of Multivari ate Analysis ,
to appear.
[9] Cuevas A., Febr ero M., Fraiman R. (2002). Linear fun ctional regression:
th e case of fix ed design and functional response. Can adi an J ournal of
Statistics 30 , 285 -300.
134 Antonio Cuevas and Ricardo Fraiman

[10] Cuevas A., Febrero M., Fraiman R. (2004). An anova test for functional
data. Computational Statistics and data Analysis, to appear.
[11] Dauxois J ., Pousse A., Romain Y (1982). Asymptotic theory for the
principal component analysis of a vector random function: some applica-
tions to statistical inference. Journal of Multivariate Analysis 12, 136-
154.
[12] Fan J., Lin S.K. (1998). Test of significance when the data are curves.
Journal of the American Statistical Association 93, 1007-1021.
[13] Ferraty F., Vieu P. (2002). The functional nonparametric model and
application to spectrometric data. Computational Statistics 17, 545 - 564.
[14] Fraiman R., Muniz, G. (2001). Trimmed means for functional data. Test
10,419 -440.
[15] Cine E., Zinn J . (1990). Bootstrapping general empirical measures. The
Annals of Probability 18, 851 - 869.
[16] Kneip A., Gasser T . (1992). Statistical tools to analyze data representing
a sample of curves. The Annals of Statistics 20, 1266-1305.
[17] Locantore N., Marron J.S ., Simpson D.G., Tripoli N., Zhang J .T ., Cohen
K.L . (1999). Robust principal component analysis for functional data (with
discussion). Test 8, 1- 74.
[18] Munoz-Maldonado Y, Staniswalis J.G., Irwin L.N., Byers, D. (2002). A
similarity analysis of curves . Canadian Journal of Statistics 30,373 -381.
[19] Parr W . C. (1985). The bootstrap: some large sample theory and con-
nections with robustness. Statistics and Probability Letters 3, 97 -100.
[20] Pezzulli S., Silverman, B.W. (1993). Some properties of smoothed prin-
cipal components analysis for functional data. Computational Statistics
8, 1-16.
[21] Politis D.N., Romano J .P. (1994). Limit theorems for weakly dependent
Hilbert space valued random variables with application to the stationary
bootstrap. Statistica Sinica 4,461-476.
[22] Ramsay J .O., Silverman B.W. (1997). Functional data analysis.
Springer-Verlag, New York.
[23] Ramsay J.O., Silverman B.W. (2002). Applied functional data analysis.
Springer-Verlag, New York.
[24] Sheehy A., Wellner J.A. (1992). Uniform Donsker classes of functions.
The Annals of Probability 20, 1983- 2030.
[25] Silverman B.W. (1996). Smoothed functional principal components anal-
ysis by choice of norm. The Annals of Statistics 24, 1- 24.
[26] Singh K. (1981). On the asymptotic accuracy of Efron's bootstrap. The
Annals of Statistics 9, 1187-1195.
[27] van der Vaart A. (2000). Asymptotic Statistics. Cambridge University
Press, Cambridge.
[28] van der Vaart A., Wellner J. (1996). Weak convergence and empirical
processes. Springer-Verlag, New York.
On t he bootstrap meth odology for func tional data 135

[29] Yukich J .E. (1986). Unif orm exponenti al bounds for the n orm alized em-
pirical process. Studia Mathematic a 84, 71- 78.
[30] Zhan Y. (2002). Central lim it theorems for fun ctiona l Z-estimators. St a-
t ist ica Sinica 12, 609 - 634.

A cknowledgem ent: The first aut hor has been par ti ally supporte d by gra nt
BFM2001-0169 from the Spanish Ministry of Science and Technology.
Address: A. Cuevas , Departamento de Mat emat icas, Facultad de Ciencias ,
Universid ad Aut6noma de Madrid, 28049-Madrid (Spain) .
R. Fraim an , Depar tamento de Mat emati ca, Universidad de San Andres, Vito
Dum as 284, Victoria, Provincia de Buenos Aires (Argent ina) .
E-mail : antonio .cuevas@uam.es.rfraiman@udesa .edu .ar
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

A NOVEL APPROACH TO PARAMETRIZA-


TION AND PARAMETER ESTIMATION IN
LINEAR DYNAMIC SYSTEMS
Manfred Deistler, Thomas Ribarits and Bernard Hanzon

K ey words: Identifi cation , par ametrization, multivari ate st at e space systems .


COMPSTAT 2004 secti on : Time series analysis.

Abstract : We describe a novel approach, called data driven local coordi-


nat es (DDLC) , for par am etrizing linear syst ems in st at e space form , and we
analyze some of its properties which are relevant for e.g. maximum like-
lihood est imat ion. In addit ion we describe how t his idea can be used for
a concent rate d likelihood function, obt ain ed by a least squa res typ e concen-
t rat ion ste p, which gives the so called sls (separ able least squ ar es) DDLC
approac h. Both approaches give favourabl e results in numerically optimizing
the likelihood fun ction in simulat ion studies.

1 Introduction
Despi t e the fact that identification (in the sense of model selection and pa-
ram et er est imat ion) of linear dynami c systems is a quite mature subj ect now ,
t here still exist severe problems in applying ident ificat ion procedures, in par -
t icular in t he multi vari abl e case.
As is well known , one of the major problems is t he 'curse of dim ension al-
ity '; in the (linear) multivari abl e case the dim ension of the par am et er space is
a quadrati c function of the number of outputs, unl ess addit ional restrictions,
e.g. of factor ana lysis - or reduc ed rank regression typ e or of 'st ru ct ural'
type are imposed .
In t his cont ribution our main focus will be on anot her importan t issue.
For simplicity of notation , we only consider linear systems with unobserved
whit e noise input s. Then the most common mod els are AR, ARMA and
state space (StS) models. In applicat ions AR models still dominate, mainl y
for two reasons:

(i) The st ructure of par am et er spaces for AR mod els is mu ch simpler t ha n


in the case of ARMA and StS models. In particular , in t he most com-
mon par am etrization of AR(p) models (where the coefficient matrix of
the pr esent output is the identity) t he ent ries of all ot her coefficient
matrices are free paramet ers (of cour se satisfying the stability condi-
t ion ) and identifi abl e, including t he par am et ers corres ponding to the
lower dim ension al syste ms.
138 Manfred Deistler, Thomas Ribarits and Bernard Hanzon

(ii) The maximum likelihood method gives least squares-type estimators,


which are asymptotically efficient and numerically robust and fast ; in
other words parameter estimation is simple.

On the other hand ARMA and StS systems are more flexible and thus in
many cases less parameters may be required.
As is well known every causal (stable) rational transfer function (describ-
ing the input-output behaviour of a linear system) can be described by an
ARMA or a StS system; in this sense ARMA and state space system are
equivalent. However, when embedded in 'naive' parameter spaces, typically
the classes of observational equivalence are larger in the state space case . For
instance, in the univariate case, for ARMA (n, n) systems, the equivalence
classes are singletons in IR 2 n for the ARMA case (unless common factors oc-
cur) , whereas they are n 2 dim ensional manifolds for (minimal) state space
2
systems in the embedding IR 2 n + n • Identifiability is obtained by selecting rep-
resentatives from equivalence classes and the advantage of large equivalence
classes lies in the possibility to select (in some sense) better representatives.
This is the reason why we here restrict ourselves to StS systems.
Both, typical ARMA and StS model classes suffer from the fact that the
parametrization problem is non-trivial and that in general no explicit formula
for the maximum likelihood estimator exists. For instance, in general, the
boundary of the identifiable parameter spaces contains lower dimensional
systems, which are not identifiable and algorithmic problems occur if the
true system is close to the boundary. Some of these problems cannot be fully
understood in the framework of the usual asymptotic analysis or are even
better reflected by numerical rather than by statistical analysis. In a certain
sense, asymptotic properties are parametrization independent, to be more
precise:

(i) Under general assumptions, consistency can be shown for transfer func-
tions in a coordinate-free way (see e.g. [2]); if we have identifiable pa-
rameter spaces and the function attaching parameters to transfer func-
tions is continuous, then the corresponding parameter estimates are
consistent, independent of the choice of the particular parametrization.
(ii) Under certain conditions the asymptotic variances of the maximum
likelihood estimators change in a well defined way.

On the other hand a number of numerical properties are parametriza-


tion dependent. Numerical problems may arise for instance if the grid is too
coarse in relation to the curvature of the likelihood function or if the likeli-
hood function has 'long valleys ' in relevant parts of the parameter space. It
can be shown (see e.g, [4], [8]) that the choice of the parametrization has
a severe impact on e.g. success rates or the number of iterations in numerical
optimization of the likelihood function .
A novel approach in linear dynamic systems 139

In the following we pr esent two 'data driven' parametrizations as a con-


tribution to the aim of increasing the 'market penetration' for state space
modelling in applications.

2 Parametrization by state space systems


A common approch is to commence from the model class UA of all causal
and rational s x s transfer functions
00

k(z) = LKjzj (1)


j=O

For a number of reasons, e.g. in order to obtain finite dimensional param-


eter spaces, UA has to be broken into bits, where each bit is parametrized
separately. In many cases, in a first step, the subclasses M(n) of all transfer
functions of order n are considered. Here we deal with parametrizations of
M(n) via state space systems (in innovations form):

Xt+1 (2)
Yt (3)

where Yt is the s-dimensional observed output, Xt is the n-dimensional state


and Ct is (unobserved) s-dimensional white noise with Ectc~ = 2; > O.
Usually it is assumed that

IAmax(A)1 < 1 (stability) (4)


and
IAmax(A - BC)I < 1 (strict minimum phase assumption) (5)
hold. Here Amax(D) denotes an eigenvalue of D of maximal modulus. How-
ever, mainly for the sake of notational simplicity, we here do not impose (4)
and (5). For the stable case, the steady state solution of (1) is given by
00

Yt =L Kjct-j + Ct, Kj = CAj-1 B (6)


j=l
2+2ns
Let S(n) denote the set of all (A, B, C) E jRn (we identify (A, B, C)
with (vecA,vecB,vecC)). Clearly, S(n) = jRn
2+2ns
and it can be shown
that the set Sm(n) ~ S(n) of all minimal (A, B, C) is open and dense in
2+2n
jRn s . Let us endow UA with the pointwise topology, i.e. the topology
corresponding to the product topology in the space (jRsXs)N of power series
coefficients (Kjlj E N) of the transfer functions. As can be shown, the
closure M(n) of M(n) satisfies M(n) = Uf=lM(i) .
Finally, we define the mapping
140 Manfred Deistler, Th omas Ribarits and Bernard Han zon

7f : S( n) -> M (n ) (7)
by

7f (A, B , C) = C (Z-l 1- A)-l B = k (z) (8)

For describing M (n) by state space systems the following approac h (see
e.g. [1] and [5]) may be used:

(i) FUll state space par ametrizations, i.e, M(n) is describ ed by Sm(n) .
The dr awb ack of this approach is that Sm(n) is non-identifiable. The
classes of observat ional equivalence are given by

E(A ,B , C) = {(TAT-l,TB ,CT- 1IT E GL (n )} (9)

and are real ana lytic manifolds of dimension n 2 . Thus there are n 2
unn ecessary par ameters.

(ii) M( n) can be shown to be a real ana lytic manifold of dimension 2ns,


which in genera l cannot be described by one coordina te system. One
approach is to use soca lled overlapping par ametriz ations, an alte rnative
approac h is the use of canonica l forms, such as echelon form. In both
cases a mod el selection procedure has to be applied in order to select
a subclass of M( n) from a fixed finite number of sub classes.

(iii) The approach described here, namely dat a dr iven local coordina tes
DDLC , (see [3], [4]) is as follows: We commence from an initial (min-
imal) (A , B , C) E Sm(n) and t he t angent spa ce to the equivalence
class E(A , B , C) at (A , B , C) . (A , B , C) may be obtained by an ini-
tial esti mate, using e.g. a subspace or an inst ru mental vari able es-
timation method. Then we take t he ort hocomplement (in S(n)) to
t he t an gent spa ce as (pr elimin ary) par ameter space : Let QJ.. denote
a (n 2 + 2ns) x 2ns mat rix whose columns form a basis for this ortho-
complement . Then we have the parametri zation:

'PD : JH.2ns -> S(n ) (10)

TD 1-+ ( ~:~~~~~~)
vecC(TD)
( ~:~~)
vecC
+ QJ.. . TD

The corresponding par ameter space T D S;; JH. 2n s is defined by remov-


ing t he non-m inim al systems and th e corresponding space for transfer
functions is VD = 7f('PD(T D)) .
A novel approach in linear dynamic systems 141

The intuitive motivation behind the DDLC approach is that , due to or-
thogonality to the tangent space, the numerical properties of optimization
based estimators, such as the maximum likelihood estimator, are at least lo-
cally favourable. Comparisons with other parametrizations corroborate this
notion (see e.g. [4] and [8]). In particular these comparisons show that eche-
lon forms (whose parameters correspond to the usual ARMA parameters) are
clearly outperformed. DDLC is now the default option in the system identi-
fication toolbox in MAT LAB6.x. The success of DDLC was the motivation
for a careful investigation of the topological and geometrical properties of
DDLC relevant for estimation, described in the next section.

3 Topological and geometrical properties of D D LC


Important properties of DDLC are summarized in the following theorem:
see [5], [9].
Theorem 3.1. Let an initial minimal system (A , B , C) be given. Then the
parametrization by DDLC as given in (10) has the following properties:

(i) T D is an open and dense subset of ~2ns .


(ii) There exist open n eighborhoods Ti)c ~ TD of ° E TD and vjy°c of
7i'(A, B, C) in M(n) such that Ti)c is identifiable, vjy°c = 7i'(Ti)C) and
the mapping 'l/JyYc :
omorphism.
vr ---+ Ti)c defined by 'l/JyYC( 7i'( TD)) = TD is a home-

(iii) For n > 0, 7i'(TD) contains transfer functions of lower McMillan degree.
(iv) There exists an open and dense subset vt:
of V D such that for every
k EV pn,
the corresponding equival ence class in TD consists of a finite
number of points.

(v) VI) is dense in VD, where VI) denotes the interior ofVD in M(n). Addi-
tionally, V D is open (and trivially dense) in 7i'(TD), but not necessarily
open in M(n) .

(vi) 7i'(TD) ~ VD , where equality can hold, but th e inclusion may also be
strict.

In a certain sense this theorem is an analogue to the theorems given in [2]


for the overlapping description of M(n) and for echelon forms. We give
a short discussion of the consequences of the results of Theorem 3.1 :

(i) Openness means that the parameters are free and in particular not
restricted to a thin subset of ~2ns. This is an important requirement
for gradient-type optimization procedures to work properly. Note that
openness also holds if the stability assumption (4) and the miniphase
assumption (5) are imposed. Clearly then denseness will not hold.
142 Manfred Deistl er, Th omas Ribarits and Bernard Hanzon

(ii) st ates t hat t here exist neighb orhoods TlJe and VE e where th e par a-
metrizati on is well-po sed in t he sense of being injective (and thus ident i-
fiabl e) and t he par ameters are attached to transfer functions in a cont in-
uous way. In par ti cular 'coo rdinate free' consistency of transfer functi on
est imates in vboe (see [2]) t hen implies consistency of t he corres ponding
par ameter est imates . However , we have no statements concerning t he
size of TlJe and vboe, respectively.

(iii) For n > 0, t he following holds: T he closur e of t he par ameter space TD


- not e t ha t TD = jR2ns - corresponds to transfer functions of equa l and
lower McMillan degrees. The equivalence classes in TD <, TD are gen-
erally given by nonlinear restricti ons and are t hus difficult to describ e.

(iv) In genera l, TD is not identifiable; as a 'second best ' result, the equiva-
lence classes for a generic subset V //n
are at least finite and t hus consist
of isolated point s in TD.

(v) deals with t he st ructure of the set VD ; for a discussion of the relevan ce
of (v) see [9].

(vi) The fact that VD may contain more transfer functions t ha n those de-
scribed by t he closure of the par ameter space TD can effect the act ua l
esti mation pro cedure. In that case t he norm of t he parameter vect or
may diverge to infinit y whereas t he corres ponding sequence of t ra nsfer
functi ons converges to a well defined transfer funct ion est imate in M (n) .
Problems of nonconvergence of algorithms due to t his phenom enon have
act ua lly puzzled researchers in the past when using echelon canonica l
forms.

4 Separable least squares for ML-type estimation


One way t o redu ce t he dimension of the par ameter space over which a likeli-
hood function has to be num erically optimized is to concent ra te out par am-
ete rs which ente r linearl y by an (ordinary or genera lized) least squa res ste p.
For the concent rated likelihood aga in the DDLC- approach is used ; see [5],
[8] and [7] .
Here, we commence from the inverse state space system

A fJ
,-."-, ~
Xt+l (A - B C) Xt + B Yt (11)
ct - C Xt + Yt
'-v-"
C

and t he correspo nding par amet ers (..4, B, C) are in a one-to-one relati on wit h
(A , B , C). The (Gaussian) conditiona l likelihood function is of the form
A novel approach in linear dyn amic systems 143

T
LT(A , ts ,C, 2:) = logdet 2: + ~ L tr {St(A , f3, C)St(A , ts ,C)'2:- 1
} (12)
t=l

Substituting a consistent estimat e t for 2:, we get an approximat ion of


t his criterion function, which we again denote by LT. Becau se f3 ent ers
linearly in LT , we obtain

vecR = (X' x) - 1 X'yt (13)

by rmmmizmg L T with resp ect to B for fixed A and C. Here, yt =


(y~ , .. . , Yr)' is the st acked vector of observations and X depends on yt ,
A, C and t and we assume that X has full column rank . This leads to t he
following new crite rion function depending on A and Conly:
T
L T'(A , C) = tr~ L St (A , C )St (A , C)'t- 1 (14)
t=l

Given the (observabl e) pair (A,C), yt, and, t , we obtain the original
system by .6. y (A , C ) = (A ,B,C) = (A - f3C ,f3 ,-C) . The pairs (A 1 ,C1 )
and (A 2 , ( 2 ) ar e called observationally equivalent if they corres pond to the
sam e transfer function, i.e. if 7f(.6. y (A 1 , Cd) = 7f(.6. y (A 2 , ( 2)). If (A , C)
is observabl e, t hen, under certain addit ional assumptions, all observational
equivalent pairs are given by £cc(A ,C) = (TAT- 1 ,CT- 1 ) , T E GL(n) .
£cc(A , C) is a real analyt ic manifold of dimension n 2 and the DDLC
2+
const ru ct ion is performed again by taking the orthocompl ement in ]Rn n s
to the t ang ent space of £cc(A , C) at an initial point (A , C) . Let us denote t he
new par am et er space, where t he non-minimal syste ms have been removed ,
again by TD ~ ]Rn s and let us put VD = 7f(.6. Y (4?D(TD))). Here, 4?D is given
by

(15)
veC~ (TD ) ) ( vecA) QJ..
TD 1-4 ( vecC (TD) = ve cC + . TD
2+
where QJ.. E ]Rn n s x n s is now a matrix with ort honormal colum ns spanning
the new orthocompl ement t o £cc(A , C) at the point (A , C) ; (15) is called the
sls D D L C par ametrizat ion. For the following theorem see [7] :

Theorem 4.1. Let yt and an ini tial (A , B , C) be given and let (A , f3, C)
denote the corresponding in vers e system in (11). Th e parametrizat ion by
sls D D L C as given in (15) has the follow ing propert ies:

(i) T D is an open and dense subset of ]Rn s .


144 Man fred Deistler, Th omas Ribarits and Bernard Hanzon

(ii) Th ere exis t open neighborhoods Tjj' c ~ T D of 0 E T D and vboc of


n (A, B , C) in VD suc h that Tjj' c is identifi able, Vir = n(Tjj'C) and
the mapping 'lj;~c : vboc -+ Tjj' c defined by 'lj;~c (n(TD )) = TD is a hom e-
om orphism.

(iii) n(TD n Tx) may (but n eed not n ecessaril y) contain tran sfer fun ctions
of lower McMillan degree.

(iv) There exis ts an open and den se subset vt:


of VD such that fo r ever y
k EV bin,
the correspon ding equivalence class in TD cons ists of a fin it e
numb er of points.

(v) n(TD nTx) ~ VD, where equalit y can hold, but the in clusi on may also
be strict.

Here, T x denotes th e (gen eric) set of TD suc h that X has full column
rank .

Not e that here, as opposed to ordinary DDLC, vboc is not open in M(n).
An alternative pr ocedure is to concent ra te out C. In this case, ML est i-
ma t ion of ~ can also be incorporated; see [8] .

5 A numerical comparison
Eight different minimal , st abl e and st rictly minimum ph ase st at e space mod-
els (A , B , C) with two outputs are specified. The mod els are denoted by M I ,
. .. , M s and are of order 2, 4, .. . , 16. The poles and zeros are quite close to
each other and close t o the unit circle, but t hey do not cancel.
Simul ation data for models M I , . . . , M s comprising T = 500 output
observations are created , where t he whit e noise sequ ence (St) is chosen t o be
Gaussian distributed with ~ = h .
In t he next step, 50 random initi al state space mod els are creat ed by ran-
doml y perturbing the ma trices (A ,B, C) corres ponding to th e t rue syst ems.
It is ensure d t hat t he perturbed mod els remain minimal, stable and minimum
ph ase.
All computations are carr ied out using the syste m identific ation toolbox
of the software package MATLAB , version 6.5.0 . 180913a (R13) . The iden-
t ificat ion pro cedure itself is performed by using the built-in funct ion pem.
The option ' Sear chDi r ect i on ' is set to ' Gn ' (a plain Gauss-Newton typ e
algorithm is used for min imizing the crite rion function) . For a mor e detailed
discussion of the simulatio n results pr esented below we refer to [6] . We con-
fine ourselves to t he following state ments: sls D D L C leads to

• better success rates. Not e that an identification experiment is consid-


ered to have failed if the algorit hm yields a final par ameter est ima te
with a likelihood valu e lar ger than 1.2 times the valu e of the asy mp-
tot ic likelihood at t he t rue system; see Tab le 1 (A) . Note e.g. that for
A novel approach in linear dynamic system s 145

sls D D L C only l out of 400 esti mation runs failed, whereas usage of
D D L C leads to 8 failed runs.

• fewer iterati ons until convergence due to t he redu cti on of t he dimension


of the par ameter space; see Table 1 (B).

• t he lowest condit ion numbers of t he Gauss-Newton approximat ion to


t he Hessian of the criterion function. Note t hat the echelon can onical
form C an lead to t he highest condit ion numbers; see Tabl e 1 (D) .

• bet ter estima tes, i.e. better or at least equa lly good values of the like-
lihood functi on at convergence; see Table 1 (F ).

In total, t he echelon canonical form performs worst , and sls D D LC is


slight ly better t ha n DDLC. However , the act ual computations t urn out to
be more time-consuming for sls D D LC, but still remain within a feasibl e
rang e.

A M1 M2 M3 M4 M5 M6 M7 u,
Ca n 0 78 18 8 50 28 74 68
DDLC 0 0 4 0 0 0 4 8
sls D D LC 0 0 0 0 0 2 0 0

B M1 M2 M3 M4 M5 M6 M7 Ms
C an 8 24 18 20 27 35 35 28
DDLC 6 10 13 9 21 18 12 12
sls D D L C 8 9 10 8 16 13 8 8

C M1 M2 M3 M4 M5 M6 M7 Ms
Ca n 0 39 22 23 23 32 33 28
DDLC 0 0 12 0 0 0 6 8
sls D D L C 0 0 0 0 0 6 0 0

C an 1Ae + 12 1.3e + 10 4.ge + 16 1.1e+17


DDLC 9.7e + 4 l.1 e + 5 8.0e + 3 1.0e + 6
sls D D LC 3.5e + 1 5Ae + 2 2.8e + 2 7.8e + 2
D M6 M7 Ms
C an 1.3e + 16 4.2e + 16 6.1e + 20
DDLC 7.ge+ 5 1.1e + 6 3.8e + 6
sls D D L C 3.3e + 2 6.3e + 2 3.2e + 3
146 Manfred Deistl er, Th omas Ribarits and Bernard Hanzon

IE
Can O. 1.3e + 17 3.ge + 19 1.4e + 19
DDLC O. O. 3.1e + 5 O.
sls D D LC O. O. O. O.
IE I Ms
Can 1.3e + 21 3.3e + 18 8.1e + 21 1.6e + 23
DDLC O. O. LI e + 6 3.4e + 5
sls D D LC O. 5.8e + 2 O. O.

Ca n 8.02e - 1 1.13 1.16 8.88e - 1


DDLC 7.88e - 1 1.09 1.11 8.18e - 1
sl sDDLC 7.75e - 1 1.09 1.11 8.18e - 1
I u,
Can 1.17 1.14 1.03 1.1
DDLC 1.08 1.04 9.03e - 1 9A6e - 1
sls D D LC 1.08 1.02 8.67e - 1 9.35e - 1

G M1 M2 M3 M4 M5 M6 M7 u,
Ca n O. 1.42 2.68 1.77 2.96 2.28 3.12 3.06
DDLC O. O. 1.34 O. O. O. 1.33 2.18
sls D D LC O. O. O. O. O. 1.21 O. O.

Table 1: Identification of ARMA- type mod els


(A) P ercent age of failed runs out of 50 identification exper iments.
(B) Average number of ite ra t ions for successful runs. Test cases wit h no suc-
cessful runs are indicated by O.
(C) Average number of iterati ons for failed runs. Test cases, where no run
failed , are denoted by O.
(D) Average maximum condit ion number of t he Gauss-Newton approxima-
tions to the Hessian s for succesful runs. Test cases with no successful runs
are indicated by O.
(E) Average maximum condit ion number of t he Gauss-Newton approxima-
ti ons to t he Hessians for failed runs. Test cases, where no run failed, are
indicat ed by O.
(F) Average crite rion value for successful runs.
(G) Average crite rion value for failed runs.
A novel approach in linear dyn amic sy stems 147

References
[1] Deistler M. (2000). System identification - general aspects and struc-
ture. In G . Goodwin (ed .) , System Identifi cation and Adaptive Control,
Springer , London, 3 -26. (Festschrift for B.D.O. Anderson).
[2] Hannan E.J. , Deistler M. (1988) . Th e statistical theory of lin ear systems .
John Wil ey & Sons, New York , 1988.
[3] McKelvey T ., Helmersson A. (1997) . System identification using an
over-param etrized model class - improving the optimizat ion algorithm.
In Proc. 36th IEEE Conference on Decision and Con trol , San Diego,
California , USA 3,2984 -2989.
[4] McKelvey T ., Helmersson A., Ribarits T. (2004). Data driven local coor-
dinat es for m ultivariable lin ear systems and their application to syst em
identification. Forthcoming in Automatica .
[5] Ribarits T . Th e role of parametrization s in ident ification of lin ear dy-
namic system s. PhD t hesis, TU Wien.
[6] Ribarits T ., Deistler M. (2003) . A new parametrization m ethod fo r the
estim ation of state-space mod els.
[7] Rib arits T ., Deistler M., Hanzon B. (2004) . An analysis of separable least
squares data driven local coordin ates for ma ximum likelihood estim ation
of lin ear system s. Submitted to Automatica .
[8] Ribari ts T ., Deistler M., Han zon B. (2004). On new param etrization
m ethods for the estim ati on of state-space models. Forthcoming in Intern.
Journal of Adaptive Con trol and Signal Processing .
[9] Ribarits T. , Deistler M., McKelvey T. (2004). An analysis of the
param etri zation by data driven local coordinates for multivariable lin ear
syst ems. Automatica 40 (5), 789 -803.

Address : M. Deistler , T . Ribarits , Institute for Mathem atical Methods in


Econ omics, Resear ch Unit Econometrics and System Theory (EOS), Vienna
University of Technolo gy, Argentinierstrasse 8, 1040 Vienna , Aus t ria
B. Hanz on, Ma t hemat ical Institut e, Leiden University, P.O . Pox 9512,
2300 RA Leiden , The Netherlands
E-mail : {Manfred.Deistler.Thomas.Ribarits}@tuwien.ac.at ,
bhanzon@math.leidenuniv.nl
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

STATISTICAL ANALYSIS OF
HANDWRITTEN ARABIC NUMERALS
IN A CHINESE POPULATION
Wing K. Fung, C.T. Yang, C.K. Li and N.L. Poon
K ey words: Writing habits, Arabic num erals, statist ical study, classification
system, t est for ind ependence, probabili ty of occurrence.
COMPS TA T 2004 section : E-statisti cs.

Abstract: A sample of 187 subjects from the Chinese population in Hong


Kon g was selecte d to participat e in a handwriting study of Ar abi c numerals.
Ch ar act eristi c features such as slant, dir ection of writing, angularity of turn-
ings, dir ections of initial and/or ending strokes etc . were developed. A set of
charac t eristic cod es representing the profile of writing habits was assigned to
each writer. Hierarchical cluster analysis was conducted on the cha rac te ristic
features which were on nom inal scale, and hence subjects who had similar
writing habits for Arabic num erals were grouped. Pe arson 's X2 t ests for ind e-
pendence for pair s of feature vari abl es were const ruc te d. The ind epend ence
property allows us to est ima te t he probability of occurrence for certain char-
acterist ic features of a Arabi c num eral. An alte rnat ive way for est ima t ion is
suggested when som e of the features are not st atistically ind epend ent .

1 Introduction
Writing habit , being a product of long-t erm adaptat ion to the needs and
a bilities of the writer , is believed t o be uni que. Vari ous classification systems
for handwriting have been suggeste d; see [3] for a review. A system for
t he classification of handwrit t en numerals have been developed by Ansell
and St rach [1], and St rach [6] . Recently, compute r algorithms for extra cti ng
features from scanned image of handwriting were used by Srih ari et al. [5]
for t he analysis of individu ality of handwriting.
It t his pap er , we analyse the char act erist ic features and codes of the Ar a-
bic num eral writings, i.e. , 0,1 , ... ,9, of 187 subjects. We give a det ailed
description on data collect ion and the methods of stat ist ical analysis for t he
study. Hierar chical cluster ana lysis (Section 4.1) is conduct ed on the char-
acte ristic codes of t he single numerals and t he pair ed num erals . We define
a clust er being the set of subjects havin g rescaled distan ce at the min imal
level. From th at , we can obtain the number of clust ers in the dendrogram
diagram, which is useful for measuring t he vari ability of th e numeral( s) in
question.
We are also int erested in investi gating whether t he cha rac te rist ic features
within each num eral are stat istically ind epend ent . If the features are ind e-
pend ent, it would provide a simple method to esti mate t he relative frequency
150 Wing K. Fung et al.

or probability of occurrence of certain characteristic features, by simply mul-


tipling the marginal probabilities for individual features . X 2 tests would be
conducted for checking independence.

2 Data and methods


A sample of 187 subjects from the local Chinese population was selected to
participate in the handwriting study. The subjects were asked to follow the
instructions on the questionnaire provided to them. They had to write Ara-
bic numerals, 0,1, . . . , 9, as in their normal ways using their own pens. The
specimens were collected together by the staff of the Hong Kong Govern-
ment Laboratory for further examination. The collected numerals were then
examined microscopically with the aid of a Nikon 5MB-2B microscope by
professional document examiners of the Hong Kong Government Laboratory.
Characteristic features such as slant, direction of writing, angularity of turn-
ings, directions of initial and/or ending strokes etc. were selected and a code
was then assigned to each characteristic feature. Each feature normally has
2-5 possible assignments of characteristic codes .
Hierarchical cluster analysis is a set of statistical techniques that is partic-
ularly useful for separating a set of objects into constituent groups or clusters
which minimize the variation between members of the same group [8] with-
out making assumptions about the number of groups or the group structure.
Grouping of objects into clusters is done on the basis of similarities or dis-
tances [8], [2] . The analysis is a powerful exploratory method commonly
employed in many disciplines. The clustering method processes the values of
the measure of similarities among pairs of objects, generating a tree or den-
drogram that shows the hierarchy of similarities among all pairs of obj ects.
Since hierarchical cluster analysis is mainly designed for quantitative mea-
surements, the code of the selected numeral was re-coded to a number of
binary variables, with 1 referring to the presence of code status and a other-
wise. Take for an example of numeral "0", the original data size is 187 x 9
(number of subjects x number of characteristic features for "a"), and the
recoded data has size 187 x 25 (25 codes for "0'). Proximity of dissimilar-
ity is generated for binary data that ranges from a to 1 [4], [7], [2]. The
dissimilarity measure is taken as the pattern difference of the fourfold table
which is computed as bc/(n**2), where band c refer to the diagonal cells
corresponding to features present on a object but absent on the other and
n( = 187) is the total number of objects involving in the study [7]. Hierar-
chical cluster analysis measure for binary data was then employed using an
algorithm that starts from all objects as apart and merge two clusters nearby
until only one is left [4]. The default measure of pattern difference in SPSS
was adopted for the binary codes to identify dissimilarity (notice that other
similarity/ dissimilarity measures have been attempted, and they give similar
results to those reported in Section 4.1) . Average linkage between objects
was then taken to demonstrate the procedure of statistical classification. It
Statistical analysis of handwritten Arabic num erals 151

defines the dis t an ce between two clusters as the average of the distances be-
tween all pairs of cases in which one memb er of the pair is from each of
the clusters. This method uses information about all pairs of distanc es, not
just t he near est or the farthest , and so it is usually preferred in cluster anal-
ysis [7] . Subjects having the similar (or the same) way of numeral writing
were grouped/clustered together. A tree diagram or dendrogram was selecte d
to pr esent the results of clust er analysis, which is depict ed hori zont ally with
each row representing a case, and cases with high similarity are adjacent.
The number of clusters and the cluster sizes were also measured.
Another question of int erest is whether the quantified features are statis-
t ically ind ependent of one anot her. Each feature is a nominal vari abl e which
can t ake different possible codes (normally 2-5). Pear son 's X2 ind epend ence
t est for each pair of features is conducte d. The evalua t ion on the probability
of occurrence of certain charac te ristic features will be much simplified if t he
ind ependence assumpt ion is found to be valid.

3 Summary statistics
The maj ority of t he par t icipants were of young to middle age (20-49) . Only
4% and 2% were of age < 20 and age > 50 resp ectively. Nearl y all (99%) of
t hem were right-handed.
The assignment of characte rist ic features and codes is an import ant pr o-
cess in t he project . Figure 1 shows an exa mple of such an assignment for
num eral 4. In t his particular num eral, there are 8 char act eristic features and
each feature has 2-3 possible assignments of charac te rist ic codes. Some of
the features such as features 1, 2, 3, and 8 are relatively easy to distinguish ,
while ot hers may need some comparisons on the length of the measurements.
At the lower half of Figure 1, we have identified th e code for each feat ure of
that par ticular numeral. Moreover , one ot her code for each of th e features is
also provided for easy und erst anding. Numeral 4 has 8 char act erist ic features
with 19 characterist ic codes .
Tabl e 1 gives an overall summar y for the number of characterist ic features
and codes for the studied num erals; the det ailed charac te rist ic features and
codes are omit t ed for br evity. We can see that num eral "1" is t he simplest
num eral as expe cte d an d it has 4 cha racte ristic features (num eral "6" t oo),
while num erals 5, 8 and 9 have the most features and codes .

Num eral 0 1 2 3 4 5 6 789


Feature 9 4 7 8 8 9 4 9 12 10
Cod e 25 11 19 20 19 30 10 23 33 31

Tabl e 1: Numbers of charac te rist ic features and codes for num erals 0-9.
152 Win g K. Fung et el.

Code of the
Feature One other code
above numeral

I . Turning to the left Round (r) Angular


?4

2. Loop on the left No (n) Yes


?4

3. Conn ect ion between horizont al and


vert ica l strokes
Open (0 ) q~closed
4. Relation between slanting and
vertical strok es
Open (0) 4~Closed
5. Ratio of vert ical strok e above &
below the hor izontal stroke alb
a - Shorter (sh)
4: a - Longer

6. Top part of vertica l stroke relative to


left slanting stro ke
Shorter (sh) 4~Taller
a~
7. Left slanting stro ke (a)/ portion of
vertical stroke below horizon tal Shorter (sh) a - Longer
b-+

4__
stroke (b)

8. Ending of vert ical stroke Tapered (t) Blunt ed

Figur e 1: Assignm ent of cha racterist ic features and codes for numeral 4.
Statistical analysis of handwritten Arabic num erals 153

4 Statistical analysis
4.1 Hierarchical cluster analysis
We employ hierar chical cluster ana lysis for subject s grouping in the writing
of num eral "4" . Figure 2 gives the dendrogram for the clusterin g of the
subjects ; for clarity, only the last 20 subjects were selected for classification .
As not ed from Figur e 2, we have identified four tight ly linked clusters namely,
cluster a: subjects (14,20,1 ,8) ; cluster b: (4,7) ; clust er c: (10,11) ; and
clust er d: (3,18,9). The subject s within each clust er are very similar to each
other and so they are grouped toget her. For exa mple, we can see from the
figur e t ha t subjects 14, 20, 1 and 8 are grouped to gether and afte r checking
t he origina l data we found t hat they in fact wrote in (exactly) t he same
pattern during the writing of num eral "4" . The same sit uation also happens
to subjects 4 and 7, 10 an d 11, and 3, 18 and 9. The dendrogram of the
cluster ana lysis can give us information on th e similarity/dissimilari ty in t he
writing of "4" amongst the subjects.
* * *H IE R A R C H I C AL C L U S TE R ANAL Y S I S * * *

Dendrogra m us i ng Ave rage Li n ka g e (Be t we e n Groups )

Rescaled Distanc e

CAS E 10 15 20 25
Label Nurn +- -- --- --- +- -- - - --- - +- - - - - - - - - +- - - - - -- - -+- - - - - -- - - +

14
20
1
8
19
4
7
10
11
9
18
3
12
16
17
6
2
5
13
15

Figure 2: Cluste ring of t he last 20 subjects for num eral 4.

We can also count the number of clusters form ed in Figur e 2. A cluster


is defined as follows: a cluster is of size two or more subjects if its rescaled
dist an ce is at th e minimal level or at the lowest par t of the dendrogram and
a clust er is of size 1 if otherwise. The clusters in Figur e 2 ar e ident ified as,
C1: (14,20,1 ,8) , C2: (19) , C3: (4,7) , C4: (10,11) , C5: (9,18, 3), and subjects
154 Win g K . Fung et el.

12, 16, 17, 6, 2, 5, 13 and 15 each form the remaining clusters C6, C7, . .. ,
C13, respecti vely. Using the same pro cedure, 16 clusters are identified for
the paired num erals 4 and 7 in t he dendrogram of Figur e 3 which is to be
discussed in more det ails below.
Hierar chical clust er ana lysis is again conducted for two num erals "4"
and "7" and the results are shown in Figur e 3. As we can see that t here
are only two tight ly linked clust ers, less t ha n t hat formed in Figur e 2 for
a single num eral "4" . The clusters are, clust er i: subjects (3,18, 8,12) and
cluster ii: subjects (1,20) . In fact , afte r checking the origina l dat a , we found
t hat th ere were some differences in t he way of writing of num erals "4" and
"7" for subjects 3, 18, 8 and 12 of cluster i. The two subjects 1 and 20 of t he
ot her clust er also did not write exac tly t he same num erals "4" and "7" .
.. * HIE R ARC HIe A L CLUS TER A N A L Y SIS • ..
Dendrogram using Average Linkage (Between Groups)

Rescaled Distance

CAS E 0 5 10 15 20 25
Labe l Num +- - - - - - - - -+-- -- - ---- +-- - - - - -- - +- - - - - --- -+- - - - - - - --+

t-
3
18
8
12
1

~
20 I--
19
4 -
2
6

If-
5
14 I--
9
11
16 I I
7
I
10
15 ---l
13
17
I
Figur e 3: Clustering of t he last 20 subjects for numerals 4 and 7.

Table 2 summa rizes t he findings obtain ed from hierar chical cluster ana ly-
sis of Ara bic num erals. The number of clust ers formed and t he maximum and
the second maxim um sizes of clusters are list ed for reference. Accord ing to
cluster analysis num eral "I" is t he simplest handwriting cha racter amongst
all. Tot ally, there are only 36 clusters form ed via the classification procedure
wit h merely 8 clusters containing 5 or mor e subjects and t he larg est clust er
involvin g 63 homogenous subjects. On t he cont ra ry, num eral "5", arming
with 9 features and 30 codes, is t he most inform ative cha racter t hat help
distinguish subjects' dist inctiveness on handwriting.
Statistical analysis of hand writ ten Arabic numerals 155

Numerals No. of No. of No. of clusters Max. size Second


features clusters with size ~ 5 of cluster size
a 9 85 7 13 13
1 4 36 8 63 27
2 7 82 8 20 18
3 8 123 3 19 7
4 8 81 8 23 14
5 9 139 a 4 4
6 4 69 10 23 19
7 9 97 8 17 10
8 12 108 4 20 7
9 10 115 6 10 10

Table 2: Summary findings for cluster analysis of single Arabic numeral.

T he combined numerals increase the number of cha racteristic feat ures and
codes on ha ndw riting discrimi nation and enha nce t he hete rogeneity among
subjects. Table 3 summarizes t he findings of cluster analysis of two Arabic
numerals, 4 and others, demo nstrating the dissimilarity reinforcement be-
tween subjects in handwriting. Comparing wit h t he findings in Tab le 2, t he
combined Arabic numerals overall increases t he number of clusters formed;
that is, t he subjects are more heterogeneous from one another t han that in
the writing of a single numeral. It is to be noted t hat for the combined numer-
als 4 and 5, 175 clusters (out of 187 subjects) are identified, which indicates
that the handwritings of the 187 subjects for numerals 4 and 5 together are
nearly all different (in one or more characteristic features) .

Numerals No. of No. of No. of clusters Max. size Second size


features clusters with size ~ 5 of cluster of cluster
0, 4 17 133 2 7 5
1,4 12 98 6 11 11
2, 4 15 147 a 4 4
3, 4 16 147 2 6 5
4, 5 17 175 1 5 2
4, 6 12 135 3 6 6
4, 7 17 140 3 7 6
4,8 20 166 1 6 4
4, 9 18 149 1 8 4

Table 3: Summary findings for cluster ana lysis of two Arabic numerals.
156 Wing K. Fung et el,

4.2 Tests for independence and probability assessment


Next we would like to investigat e whet her t he feat ures wit hin each num eral
are statistic ally independent . Let us consider t he simplest numeral "I" first.
T he commo n X2 test for independence is employed. There are four cha racte r-
ist ic feat ures measured for num eral 1, includi ng slant, initial hook, serif and
ending posit ion. Since a large majority of t he subjects (99%) wrote numeral 1
wit hout t he serif feature, this feature is excluded from t he X2 test based on
t he common rul e-of-five in applying t he test. The ot her t hree feat ur es can
form 3 sets of paired features , and t he feat ures of each pair are found to be
statistically ind epend ent (each at a = 5% significance level; details omitted).
This sort of ind epend ence can be very imp ortant for evaluating t he relative
frequency or pr obabili ty of occurre nce of certain characterist ic feat ures. For
exa mple, we may evaluate the probability, becau se of inde pendence, as

Pl23 [(Slant (S) = forward (f), Initi al Hook (IH) = right (r) ,
Ending Positi on (EP ) = hook (h)]
P1(S = 1) P2 (IH = r) P3(E P = h)
0.69 x 0.14 x 0.25
= 0.02415,

where the marginal probabiliti es 0.69, 0.14 and 0.25 are obtained from t he
dir ect counti ng method. Of course th ere are many assumpt ions behind this
est imate which might be ra t her cru de. It is to be awared t hat t he pair wise
independence does not imply t he features being all mutually ind ependent.
The est imate may also be adjusted upward if we want to make it conserva-
tive (forensic docum ent exa miners like to take t he conservati ve approach in
pr actice).

1 2 3 4 5 6 7 8
1 ind D ind D D D ind
2 ind ind ind ind ind ind
3 D D D ind ind
4 D ind ind D
5 D D ind
6 D ind
7 ind

Tabl e 4: Results of X 2 independence tests for features 1, . .. ,8 (meanings


refer to text) of numeral "0" , each at 5% significance level. D and ind stand
for features being st ati st ically dependent and ind ependent respect ively.

Next we investi gate t he num eral 0 which has nine features on (1) slant ,
(2) initial and ending st rokes, (3) start ing position, (4) ending position at
St atistical analysis of handwritten Arabic num erals 157

right, middle or left , (5) st roke crossing position, (6) ending position being
t ap erin g or blunt , (7) sh ap e, (8) ending position at upp er half, middle or
lower , and (9) t he writing dir ection which is however omitted in our ana lysis
because of t he rul e-of-five. The ind ependence test results for feature pair s are
summa rized in Table 4. It is int eresting to not e that only feature 2, initial
and ending strokes (being open or close) , is st atisti cally ind ependent from
other features considered. Feature 8 is (pairwise) ind ependent of all other
features except 4. Thus, it seems difficult t o use a kind of (simpl e) pr oduct
rul e, as pr esent ed in num eral 1 where the assumption of feature ind epend ence
is taken, t o est ima te t he relative frequ ency or probability of occur rence of t he
par t icular characterist ic features for numeral O.
We suggest below an alte rnat ive way to (conservat ively) est ima te the
probability of occurrence of the following feature codes for num eral 0,

P1 234567S(f, c, l, m, l, t , e, u )
P2(C)P13 45678(f , l, m , l, t, e, u )
< P2(C)P13567s(f,l,l ,t, e, u)
P2(C)P1 3567(f, l , l, t , e)Ps(u ),
where P2(c) and Ps(u) can be evaluated easily, and P13567(f,l ,l ,t, e) can
also be est imated based on some dir ect count ing from t he sa mple. It is
not ed that similar assumpt ions as for num eral 1 may have to be mad e as
well. Furtherm ore, we need to aware t hat the overall level of significanc e is
not equa l to 5%, though t he individual level is set t o be 5% for each pair ed
comparison , because of multiple comparisons. Moreover , it may also not be
reasonabl e to regard the features being stat ist ically all ind epend ent, or all
depend ent.
Another quest ion of int erest is on the est imation of the pro bability of
occurrence for characteristic feature codes of two or mor e num erals. We
shall not at te mpt to answer this question du e to the possible very complex
dependence st ru ct ure in the dat a.

5 Concluding remarks
We have investi gat ed charac te ristic features of num erals 0, . . . , 9. The Arab ic
num erals are chosen becaus e they are commonly found in daily life. Hier-
archical clust er ana lysis is used t o classi fy subjects of similar handwrit ing
features into groups. As expecte d, a sub ject is more difficult t o cluster / group
wit h ot her s when more num erals are considered . In fact , the individualit y
of handwriting features may be identified in our sample when we consider
2 or 3 numerals t oget her such as 5, 8, and 9 which have mor e characterist ic
features. This ph enomenon may also be of interest t o docum ent examiner s.
The X2 t est s are const ru cte d to see if the features are statis tically pair-
wise ind epend ent. The features (except the serif feature) in num eral 1 are
ind epend ent , while some features in numeral 0 are depende nt of one anot her.
158 Wing K. Fung et a.

This dependence structure would also be found in other num erals. However,
it is still possible to find some independence structure in the features such
that the probability of occurrence of some characteristic features can be es-
timated, of which the poss ible limitations have to be awared. Furthermore,
it is suggest ed that the probability should be estimated for a single num eral ,
and not for two or mor e combined num erals .

References
[1] Ansell M., Strach S.J. (1975). The classification of handwriting numerals.
Proceedings of 7th Meeting of the IAFS, Zurich .
[2] Everitt B.S., Landau S., Leeseb M. (2001) . Cluster analysis. 4th edition.
Oxford University Press, New York.
[3] Huber R.A., Headrick A.M. (1999) . Handwriting identification: facts
and fundam entals. CRC Press, 152- 164.

[4] Kaufman L., Rousse euw P.J . (1990) . Finding groups in data: an intro -
duction to clust er analysis. Wiley, New York.

[5] Srihari S.N., Cha S.H., Arora H., Lee S. (2002). In dividuality of hand-
writ ing. J . Forensic Sci. 47, 1-17.

[6] Strach S.J. (1998) . Propos ed research areas on handwriting comparison .


International J Forensic Document Ex aminers 4 , 312.
[7] SPSS Base 7.5 Applications guide. Chicago: SPSS Inc., c1997.
[8] Spath H. (1984) . Clust er analysis algorithms. Ellis Horwood, Chich est er.

Acknowledgem ent : The authors would like to thank a referee for helpful
comm ents that improved the presentation of the paper, and to D.G. Clark e,
Government Chemist , and S.C. Leung, Assist ant Government Chemist of
Hong Kong for their support and permission to use the data.
Address: W.K. Fung , C.T . Yang , Department of Statistics and Actuarial
Science, The Universi ty of Hong Kong, Pokfulam Road, Hong Kong ; C.K.
Li and N.L. Poon, Questioned Documents Section, Hong Kong Government
Laboratory.
E-mail: wingfung<tlhku. hk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

METHODS AND ALGORITHMS


FOR ROBUST FILTERING
Ursula Gather and Roland Fried
K ey words: Signal extraction, drift , edge, outlier , updat e algorit hm .
COMPSTAT 2004 section : Robustness.

Abstract: We discuss filtering procedures for robust extrac t ion of a signal


from noisy time series. Moving averages and running medians are st andard
methods for t his, but they have short comings when larg e spikes (outli ers) re-
spectively t rends occur. Modified trimm ed means and linear median hybrid
filters combine advantages of both approaches, but they do not complete ly
overcome the difficulties. Improvement s can be achieved by using robust
regression methods, which work even in real t ime becau se of increased com-
pu t ational power and faster algorit hms. Ext ending recent work we pr esent
filters for robust onlin e signal ext rac t ion and discuss t heir merits for pr eserv-
ing trends, abru pt shifts and ext remes and for the removal of spikes.

1 Introduction
In sp eech recognition, video transmission and int ensive care monitoring the
basic t ask is to extra ct a signal from t he observed noisy time series. The
signa l is assumed t o vary smo othly most of t he time with a few abrupt shifts .
Besides th e attenua t ion of norma l observational noise and t he removal of out-
lying spikes for recovering smooth sequ ences, th e pr eservation of the locations
and heights of shifts and local extremes is important . All t his needs to be
done automatically and in real time with short delays. This increases t he risk
of confusing outlier sequences and shifts or local ext remes. For distinguishing
ext remes and outliers we rely on the smo othness of the underlying signa l, i.e.
observations which are far away from an est imated signal value are t reate d
as outliers and not as being du e to a signal peak . We can identify shifts by
their duration set ting a lower limit for the length of a relevant shift .
Moving average s and other linear filters are popular for signal ext raction
as they recover t rends and are very efficient in Gaussian sa mples, but they
are highly vulnerable to outli ers and they blur level shifts . Tukey [16] sug-
gests running medians for removing outliers and preserving level shifts , but
standard medians have deficiencies in trend periods [8]. Linear median hy-
brid filt ers [10], [11] have been suggeste d as they are computationally mor e
efficient than running medians , and pre serve shifts similarly good or even
better than these. These filters track polynomial trends, but they can only
remove single isolat ed outliers . Modified trimmed mean filters are anot her
compromise between running means and running median s. They choose an
adaptive amount of t rimming, bu t like running medians they also det eriorat e
in trend periods.
160 Urs ula Gath er and Roland Fried

A better solut ion for tracking t rends is to replace t he median , a robust


location est imator, by t he esti mate d intercept obtained by robust regression
of t he data in a movin g wind ow against time. Based on a compa rison of
functiona ls wit h high breakdown point Davies, Fried and Gather [8] recom-
mend Siegel's [15] repeated median because of its robu stness against outliers
and its stabili ty. Since lar ger out liers have st ronger effects on t he repeated
median we can add auto matic rules for online trimmi ng of outliers and con-
st ru ct pro cedures which are almost as bias-robust as filters based on least
median of squa res regression [13], but considerably fast er and mor e efficient
for Gaus sian samples [5]. The Q",-method [3], [14] has very nice properties
for scale est imat ion even when a level shift occur s [9] .
Robust regression also allows to const ruc t hybrid filters which have sim-
ilar benefits as linear median hybri d filters , while being considera bly mor e
robust [6]. Procedures applying ada pt ive trimming which do not deteriorate
in trend periods can also be derived [7] .
In Section 2 we introduce the filtering procedures. In Section 3 we discuss
computationa l and other aspects . In Section 4 we propos e a robust rule for
the ada pt ive choice of the window widt hs. In Section 5 we ana lyze real and
simulated data for further compa rison before we give some conclusions.

2 Methods for robust filtering


We assume a component mod el for the sequence (xt} of observed data

Xt = f-lt + Ut + Vt, t E Z. (1)

The und erlying signa l f-lt is the level of th e time series, which is assumed
to vary smoothly with a few sudden chan ges, while Ut is additive noise from
a symmet ric distributi on wit h mean zero and variance (12 , and Vt is impulsive
(spiky) noise from an outlier genera t ing mechanism. For online signa l ext rac-
t ion we move a t ime window of width n = 2k + 1 t hrough the series and use
Xt -k, .. . , X t + k to approximate f-lt. This causes a time delay of k observations .
Firstly we fix k to a given value for all filters.

2.1 Filters based on robust regression


A standa rd median filter (running median) approximates t he signa l f-lt by t he
median of t he observations { Xt-k , . .. , xt+d within a moving time window,

where f-l t is regarded as t he level of the series at time point t, which is assumed
to be locally constant . For t racking trends, Davies et al. [4] suggest fitting
a local linear t rend f-lt + i = f-l t + ifJt , i = -k, ... , k , to { X t-k ,"" xt+ d by
robu st regression and recommend Siegel's [15] repeated median (RM) . Wh en
applied to the dat a (i , Xt+i ), i = -k , .. . , k, t he RM read s
Methods and algorithms for robust filtering 161

- RM
(3t

2.2 Filters based on trimming

Lee and Kassam [12] suggest modified trimmed mean (MTM) filtering as
a compromise between running mean s and running medians. MTM filters
regulate the amount of t rimming depending on the data. Firstly t he local
median ilt and t he local median absolute deviation about the median (MAD)
o-t ar e calcul ated , t hen all observations farther away from the median than
a multiple qt = do- t of the MAD are trimmed. Finally, t he avera ge of the
remaining observations is t aken as filter out put:

1
L Xt+i . l[iLt-qt,iLt+qt] (Xt+i),
k

nt i=-k
# { Xt+i E [ilt - qt, ilt + qt], i = -k , . .. , k},
d · Cn ' m ed{lxt- k - iltl,··· , IXt+k - iltl}·

Here, Cn is a correction factor , which is chosen t o achieve unbi asedness for


Gau ssian noise. For a very larg e window width we get Cn = 1.483, while e.g.
for n = 21 we have Cn = 1.625. For d = 0, MTM( xt) is a running median ,
while for d = 00 we get a moving average .

MTM filters implicitly assume a location mod el as st andard median fil-


t ers do . A st raight forward modification is to fit a local linear t rend by t he
repeated median an d trim t hose observations having lar ge residu als in t his
regression setting. The local vari abili ty can be est ima te d by applying the
MAD to t he regression residuals [7]. The filter output can t hen be derived
eit her by least squa res regression or by t he repeat ed median of the obser-
vations with moderately lar ge residuals. We denote the resulting filt ers by
162 Ursula Gath er and R oland Fried

TRM and MRM , resp ectively:

TRM( xt}

L (j-JJY
j EJ,
{j = -k, ... , k : IXt+j - jlf M- jfif MI :::; qt}
= m ed{ Xt+j - j fif:1 RM , j E Jt}
-MRM Xt+i - Xt+j }
f3t m ediEJ, { m edj EJ"j#i . . ,
2- J
with (jlf M, fifM) being t he repeat ed median level and slope est imate for the
current t ime windo w {Xt-k , . . . , Xt+ k}.

2.3 Hybrid filters


Linear median hybrid filters are combinations of linear and median fil-
t ers [10], [11]. Linear subfilters are applied to t he input data before t aking
t he median of t heir out comes as final filter output. This reduces computat ion
t ime and increases t he flexibility compared t o standard median filt ers du e t o
the vari ety of linear subfilte rs . Linear median hybrid filt ers wit h finit e impulse
response, bri efly FMH filters, are cha rac te rized by subfilte rs which respond
t o a finit e number of impulses only.
A simple FMH filt er corresponds to a location model and applies two
one-sided movin g averages and the curre nt observat ion Xt as cent ral subfilte r
for edge preservation :

med{<I> 1 (Xt) , Xt, <I>2(Xt}}


1 k
<I>2(Xt} = k L Xt+i·
i =l

Predictive FMH filters correspond to a linear trend mod el and apply pr edic-
tive FIR subfilte rs for one-sided ext rapolation of a t rend:

PFMH(xt} m ed{ <I> F(Xt), Xt, <I> B(Xt} },


k k
<I>F(Xt) = L hiXt- i , <I>B(Xt) = L hiXt+ i.
i= l i =l
Methods and algorit hms for robust filtering 163

Choosing the weights hi = 4~(k~t2 , i = 1, ... , k results in the minimal mean


squa re err or (MSE ) predictions for a linear t rend which is disturbed by white
noise [11] .
Combined FMH filters use pr edictions of different degrees,

CF M H( xt) = m ed{ <I> F(xd , <I> I (xd , Xt , <I>2(Xt) , <I> B(Xt)} ,


wit h <I> I (Xt) , <I>2(Xt) , <I>F(xd and <I>B(Xt ) being the subfilte rs for forward and
backward ext rapolat ion of a constant signa l and a linear t rend as given above.
In view of increased com putat ional power an d becau se of improved algo-
rithms, computatio n time is nowad ays not a great problem . Fried , Bernholt
and Gather [6] use half-window medi ans and rep eat ed median s to const ruct
robust hybrid filters:

PRMH( Xt) m ed{RMF(Xt) , Xt, RMB(xd}


CRMH(xt) m ed{ RM F(Xt) , p,[ , Xt, p, f , RM B (Xt)}
Here , p,[ = m ed{ Xt- k, ' " ,Xt- I} and p,f = m ed{ Xt+I , .. . ,Xt+d are half-
windo w median s, while RMF(xt) and RMB(xd est imate the level at time t
usin g t he repeated medi an of Xt-k, ... , Xt- I a nd Xt+l, . .. , Xt+k, respectively:

m ed{ Xt-k + k(3t-F , .. . , Xt- I + (3t-F }


Xt+i - xt+j }
m e di = - k, ... ,- I e j=-k,...,-I ,j#i-'--
{md . --.~
~ - J
-B -B
m ed{Xt+! - (3t , . . . , Xt+k - k(3t }
Xt+i - Xt+ j }
m edi = I, ... ,k {md
e j=I,...,k,j# i . . .
~ - J

3 Comparison of different filtering procedures


In the following we compare the pr evious filt ering pro cedures w.r.t. compu-
tat ion t ime and their analytical properties.

3.1 Computation
The t ime needed for the filt ering is crucial in real time applications. Fast
algorit hms for the update of the filter output are needed for onlin e sign al
ext raction. Deno ting the length of the time window by n, t he median of
t he pr oceeding window can be updated in logarithmic time (O(log n)) usin g
linear sp ace if t he dat a in the window are stored in sor t ed order usin g a red-
black t ree [2, Secti on 15.1]. This improves on t he linear time needed for
calculating the median from scratc h.
An algorithm for the update of the rep eat ed medi an in linear t ime using
quadrati c space based on a hammock graph is proposed by Bernholt and
Fried [1] , and anot her update algorithm needin g only linear space running in
O(nlog n) average time is pr esented by Fried , Bernholt and Gather [6] .
164 Ursula Gather and Roland Fried

Upd ating t he residu als and calculat ing t he MAD can be done in linear
time. Hence, t he MTM and t he TRM can both be calculate d in linear t ime.
For the MRM , however , 0(n 2 ) time is needed at least for the second repeat ed
median . Detailed descrip tions of t he update algorit hms can be found in Fried ,
Bernholt and Gather [6], [7] .
The Table given below summari zes t he t ime and t he space needed for
the updat es of th e filterin g pro cedures. Not e that the space for t he repeated
median an d both repeated median hybrid filters can be redu ced to O(n) , but
at t he expense of larg er computat ion times.

SM RM MTM MRM TRM FMH RMH


time O(log n) O(n) O(n) 0(n 2 ) O(n) 0(1) O(n)
space O(n) 0(n 2 ) O(n) O(n) 0 (n 2 ) O(n) O(n)

Table 1: T ime and space needed for t he update of the filters.

3.2 Analytical properties


For a discussion of the filtering pro cedures we concentra te on t he ana lyt ical
properties within a single t ime wind ow when being applied to dat a genera te d
from the component mod el (1) .
Equivariance and invar iance are important properties of statist ical pro ce-
dures. Location equivariance mean s t hat adding a constant to all observations
in a window cha nges t he filter out put accordingly. Scale equivariance mean s
t hat multipli cation of all observati ons with a constant cha nges t he est imate
in th e same way. All t he above pro cedures possess these two properties.
Only some of t he procedures are trend invari ant , however [6], [7] . This
property mean s that the ext racted level does not cha nge when adding a linear
t rend as long as t he cent ral level is fixed. The RM , the PFMH, the PRMH,
t he TRM and the MRM are trend invariant , while th e median , t he MTM ,
the CFMH and the CRMH are not. Therefore, for the latter methods t he
efficiency, t he removal of spikes and the pr eservati on of shifts are influenced
by und erlying trends .
Filters which are not trend invari ant blur e.g. upward shifts wit hin down-
ward t rends. Although the median and the MTM can remove k spikes com-
pletely in a single time window from a constant signa l if t here is no obser va-
tional noise (a 2 = 0), even a single positi ve out lier wit hin a downward t rend
causes smea ring. The predictive FMH can remove a single spike and preserve
a shift exactly within a linear t rend irre spectively of t he directions as it is
trend invari ant , while the combined FMH does so only if t he outlier (shift)
has t he sa me dir ection as t he trend . The RMH filters improve on the FMH
filters as they can remove up to lk/2J subsequent spikes wit hout any effect .
Fur th ermore, t he predictive RMH pr eserves shifts exactly, while for t he com-
bin ed RMH t his is true only if t he shift is in t he sa me dir ection as the t rend,
Methods and algorithms for robust filtering 165

just like for the combined FMH. The RM, the TRM and the MRM can even
remove k - 1 spikes completely within a single time window irrespectively of
a linear trend if u 2 = O.
The previous results hold when there is no observational noise. Lipschitz
continuity restricts the influence of minor changes in the data due to small
noise or rounding. The standard median, the FMH, the RM and the RMH
filters are Lipschitz-continuous. The median is Lipschitz-continuous with
constant 1 like all order statistics, while the repeated median and the re-
peated median hybrid filters are Lipschitz-continuous with constant 2k + 1.
An FMH filter is Lipschitz-continuous with constant max Ih{l, the maximal
absolute weight given by a subfilter. MTM, MRM and TRM filters , however,
are not Lipschitz-continuous, which can cause instabilities when there are
small changes in the data. The discontinuity is caused by the trimming of
observations. Application of continuous M-estimators is preferable for this
reason, but computationally more expensive. Nevertheless, we investigate
simpler trimming based methods in order to obtain information about the
possible gain by further iterations.
The finite-sample breakdown point (FSBP) is the fraction of observations
which have to be put into worst case positions in order to make the estimate
take arbitrarily wrong values . For the median the breakdown point becomes
(k + l)/n when applied to n = 2k + 1 data points, meaning that at least half
of the window needs to be outlying in order to cause an arbitrarily large spike
in the extracted signal. Since for the explosion of the local MAD also at least
k + 1 observations need to be modified, the MTM has the same breakdown
point, while for the FMH filters two outliers are sufficient to make it break
down . From the following Table we see that the RMH filters are considerably
more robust than the FMH filters, and that the RM, TRM and MRM are
almost as robust as the median in the sense of breakdown.

Table 2: Fraction of outliers in a window causing breakdown.

Simulations show the effect of the second step in the derivation of the
TRM and the MRM on their MSE as compared to that of the RM filter .
Application of least squares to the trimmed observations (TRM) increases
the efficiency for Gaussian noise , but almost preserves the robustness of the
repeated median, while application of the repeated median (MRM) further
reduces the bias caused by outliers [7].

4 Adaptive choice of the window width


From the previous discussion we see that only the repeated median and the
predictive hybrid filters PFMH and PRMH are both trend invariant and
166 Ursula Gather and Roland Fried

continuous, i.e. stable w .r .t , the occurrence of both trends and small changes
in the data. The hybrid filters tend to preserve shifts and extremes, while the
repeated median smoothes them considerably when being applied with a large
window width [6], [7] . This means that on the one hand we should choose
a short window width, but on the other hand a large window width is better
for removing outlier patches and for the attenuation of the observational
noise. This is a robust variant of the common problem of bandwidth selection
in nonparametric smoothing.
Fried [5] investigates rules for online shift detection based on the most
recent residuals in the time window . Similarly, we can formulate rules for
the automatic choice of the window width using the regression residuals.
Often least squares criteria are used to assess the local model fit and to find
the bandwidth, but this is not suitable when outliers are present. Instead,
a robust criterion is needed. Remembering that the median is the value
which balances the signs of the residuals and that the repeated median is
a regression analogue, it is natural to use the sign of the residuals. In this way
we give the same weight to all observations irrespectively of their magnitude.
However, note that there are always as many positive as negative residuals
in the window for the repeated median fit. Therefore, we have to apply this
idea to a suitable subset.
The Figure below visualizes the smoothing of a maximum by fitting a line
with a too large window width. The residuals in the center will typically be
positive, while most of the residuals at the start and the end of the window
will be negative. These signs are simply al reverse for a minimum. Therefore
it is natural to use the total number of positive residuals at the start and the
end of the window for assessing the model fit. We divide the window into
three sections as follows, namely the first L(k + 1)/2 J observations, the central
n - 2L(k + 1)/2J observations and the last L(k + 1)/2J observations. If the
total number T of positive residuals in the first and the last section is much
larger than the average L(k+ 1)/2 J, we should shorten the window width since
the signal slope might be decreasing substantially within the window. If Tis
much smaller than L(k + 1)/2 J, the window width should also be shortened
since the signal slope might be increasing.
However, this reduction should not result in a window width which is
too small to resist outlying patterns. Results of previous studies [6], [7] show
that the repeated median resists up to between 25% and 30% outliers without
being substantially affected. Therefore, the minimal window width should be
about four times the maximal length of outlier patches to be removed. For
patches of length three e.g. we use the constraint n 2: 11. Since longer time
windows allow better attenuation of observational noise and also robustness
against many outliers we increase the window width after each step whenever
possible.
The proposed repeated median algorithm with robust adaptive selection
of the window width is as follows: Let kl < ku be lower and upper bounds
Methods and algorithms for robust filtering 167

...........
...
~r---------- ---,

...
.
<0
o
x

C\l
o
o
O ' r - - - , - - - , - - - - - - - r - --,-J
10 20 30 40
time

Figure 1: Smoothing of a maximum by fitting a line to the filled poin ts.

for k , and 0 :::; d l < 1 < d u :::; 2 be constants. Set k = kl and t = k + 1.


1. Calcul at e th e repeate d median fit (ilt, ~t ) for Xt - k, .. . ,XtH t o obtain
RM(Xt) = ilt.
2. Get t he residuals ri = Xt+i - ilt - i!3t , i = -k . . . , k , and set T = #{ i =
-k , .. . , -k - 1 + l(k + 1)/2J , k + 1 - l(k + 1)/2J , . .. , k : r, > O} .
3. If k > k l and T < dl . l(k + 1)/2J or T > dc : l(k + 1)/2J set k = k - 1
and go t o 1.
4. If k < k« set k = k + 1.
5. Set t = t + 1 and go to 1.

The same or similar app roaches can be used for t he other robust filters.
We just need t o modify t he window sections for the hybrid filters possibly
obt aining asy mmet ric filt ers.

5 Application
We now apply the filt ering procedures to two data set s. The first exa mple is a
time series simul at ed from an underlying sawt oot h signal, which is overlaid by
Gaus sian whit e noise with zero mean and unit variance , and t here are t hree
isolated , three pair s and two triples of outliers of size -5. The Fi gur e below
shows t he outputs of the CRMH and the adap ti ve RM filter wit h kl = 5,
ku = 15, dl = 0.7 and du = 1.3. The CRMH with n = 21 pr eserves t he local
ext remes very well, but it is rather vari abl e. The adapt ive RM is almost as
good at the extremes while being much smoo ther . Most of t he t ime a width
close to the maximal n = 31 is chosen , but close t o the three local ext remes
and at about t = 280 the width decreases even to th e minimal n = 11. The
PRMH not shown here is similar to the CRMH, but it is mor e affected by
the out liers, while t he ordinar y RM and the median cut t he extremes.
168 Ursula Gather and Roland Fried

o 50 100 150 200 250 300

time

Figure 2: Simulated time series (dotted), underlying signal (dashed) and


outputs of the CRMH (thin solid) and the RM with adaptive window width
(bold solid).

As a second example we analyze five hours of measurement of the arterial


blood pressure of an intensive care patient. Figure 3 visualizes these data
along with the outcomes of the MRM with a window width of n = 21 and of
the adaptive RM filter with the same constants as before. The MRM resists
some aberrant patterns very well, but it oversmoothes the local extremes
at t = 70 and at t = 290. The adaptive RM again chooses the largest
width n = 31 most of the time, but the width drops down to n = 17 about
t=175 and t=225, and even to the minimal n = 11 about t=60 and t=130.
It performs better at the extremes than the MRM, but it is affected by two
subsequent outlying patterns about t=180. The RM with fixed window width
also shows a spike there and performs in between the adaptive RM and the
MRM at the extremes.

6 Conclusion
Improved numerical algorithms render the real time application of robust
procedures for time series filtering possible. Methods for robust regression
Methods and algorithms for robust filtering 169

Il)
0

0
0

Il)
l!?
:::>
0>
lJ)
lJ)

l!?
o,
0
O>
c;;
st Il)
co
C1l

0
co

,...
Il)

0
r-,

Il)
<D

0 50 100 150 200 250 300

lime

Figure 3: Ar t erial blood pr essure (dotted) and outputs of t he MRM (bold


solid) and t he RM wit h adapt ive window widt h (thin solid).

like the repeat ed median allow to const ruc t filters which have similar benefits
like c1assicallinear or location based app roaches when t hese perform well, but
overcome deficiencies w.r. t. t he removal of spiky noise (outli ers) or t he t rac k-
ing of trend s. We find the repeat ed median pro cedure with robust adaptive
choice of the window widt h par ti cularly promising . First applicat ions show
that t his algorit hm can be mod ified even for onlin e filtering without any t ime
delay by esti ma t ing t he int ercept at the right hand side of t he time window ,
but mor e experience is needed to optimize t he aut omatic choice of the window
width then.

References
[1] Bernholt T. , Fried R. (2003). Compu ting the update of the repeated median
regression lin e in linear tim e. Informati on Processing Letters 88, 111-
117.
[2] Cormen T .R ., Leiserson C.E., Rivest R.L . (1990). Introduction to algo-
170 Ursula Gather and Roland Fried

rithms. MIT Press, Cambridge, Massachusetts, and McGraw-Hill Book


Company, New York.
[3] Croux C., Rousseeuw P.J. (1992). Time-efficient algorithms for two highly
robust estimators of scale. COMPSTAT 1992, Physica-Verlag, Heidelberg,
411-428.
[4] Davies P.L., Fried R , Gather U. (2004) . Robust signal extraction for on-
line monitoring data . J . Statistical Planning and Inference 122, 65 -78.
[5] Fried R. (2004) . Robust filtering of time series with trends. Current Ad-
vances and Trends in Nonparametric Statistics, special issue of the J . of
Nonparametric Statistics, to appear.
[6] Fried R , Bernholt T ., Gather U. (2004a) . Repeated median and hybrid
filters. Technical Report, SFB 475, University of Dortmund, Germany.
[7] Fried R., Bernholt T., Gather U. (2004b) . Modified repeated median filters.
Preprint, Department of Statistics, University of Dortmund, Germany.
[8] Fried R, Gather U. (2002) . Fast and robust filtering of time series with
trends. COMPSTAT 2002, Physica-Verlag, Heidelberg, 367 - 372.
[9] Gather U., Fried R. (2003). Robust estimation of scale for local linear
temporal trends. Proceedings of PROBASTAT 2002, Tatra Mountains
Mathematical Publications 26 , 87 -101.
[10] Heinonen P., Neuvo Y. (1987). FIR-median hybrid filters. IEEE Trans-
actions on Acoustics, Speech, and Signal Processing 35, 832 - 838.
[11] Heinonen P., Neuvo Y. (1988). FIR-median hybrid filters with predictive
FIR substructures. IEEE Transactions on Acoustics, Speech, and Signal
Processing 36, 892 -899.
[12] Lee Y., Kassam S. (1985) . Generalized median filtering and related non-
linear filtering techniques. IEEE Transactions on Acoustics, Speech, and
Signal Processing 33, 672 - 683.
[13] Rousseeuw P.J . (1984). Least median of squares regression. J . American
Statistical Association 79, 871 -880.
[14] Rousseuw P.J ., Croux C. (1993). Alternatives to the median absolute
deviation. J. American Statistical Association 88 , 1273-1283.
[15] Siegel A.F. (1982). Robust regression using repeated medians. Biometrika
68, 242 - 244.
[16] Tukey J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading,
Mass. (preliminaryed. 1971).

Acknowledgement: The financial support of the Deutsche Forschungsgemein-


schaft (SFB 475, "Reduction of complexity in multivariate data structures")
is gratefully acknowledged.
Address : U. Gather, R Fried, Department of Statistics, University of Dort-
mund, 44221 Dortmund, Germany
E-mail: gather@statistik.uni-dortmund.de
COM PSTAT'2004 Symposium © Physica-Verlag/Springer 2004

USING GO FOR STATISTICAL ANALYSES


Robert Gentleman
K ey words: Ontology, bioinformatics , graphs.
COMPS TA T 2004 section : Biostatistics.

Abstract: In t his pap er we use met a-d ata packages from the Bioconductor
Project to carry out st atisti cal ana lyses of gene expression data . But would
like to note th at the potential scop e of these applications is much bro ad er
and many of the methods described here could be applied to other types
of high-throughput dat a. To provide context we make use of data from an
investi gation into acute lymphoblastic leukemia .

1 Introduction
While there ar e a number of different definitions of an ont ology we will use
the notion of a restricted vocabular y as the basis for the discussions here.
Ontolo gies and related concepts are becomin g incr easingly import ant tools
for organizing and navigating information. Initiatives in biology (our main
focus) as well as t he sema ntic web are providing a variety of resources and
interesting problems relat ed to ontologi es.
For genes and gene products t he Gene Ontology Consortium, or GO ,
(www .geneontology .org) is an initi ative t hat is designed t o address this
pr oblem . GO provides a restrict ed voca bulary as well as clear indicat ions
of the relationships between t erms. GO is clearl y a valu abl e tool for data
ana lysis, however its st ructure (as a DAG) and t he complex nature of th e
relationships t hat it represents make appropriate use of this tool challenging.

1.1 The graph structure of GO


The GO ont ologies are structured as dir ect ed acyclic gra phs (DAGs) that
represent a network in which each t erm may be a child of one or mor e parents.
We use t he express ions GO nod e and GO term int erch an geabl y. Child t erms
are mor e specific than t heir parents. The relationship between a child and
a par ent can be and be eit her a is a relation or a has a (part oj) relation.
Each te rm in the ontolo gy is associate d with a unique identifier and the
relationships between t he GO t erms (par ent /child) as well as other relevant
dat a are provided by GO. The GO package pr ovides six sets of mappings,
two for each ontology.
In general, given a set of most sp ecific terms of int erest we can find t he
gra ph t hat consists of those t erms and any less sp ecific t erms (par ents) . We
will refer to this graph as the induced GO graph for t he specific set of child
nod es.
GO itself is st rict ly t he ont ology. The mapping of genes to GO t erms is
carr ied out separat ely. The act ual mappings are pr ovided by GOA [1] and
172 Robert Gentleman

are mappings between GO terms and LocusLink IDs which are modified to
account for the multiplicity of mappings between the manufacturer IDs and
LocusLink IDs .

2 An example
To demonstrate some of the tools that are included in the GOstats package
we consider expression data from 79 samples from patients with acute lym-
phoblastic leukemia (ALL) that were investigated using Affymetrix GeneChip
arrays [2] . The data were normalized using quantile normalization and ex-
pression estimates were computed using RMA [4]. Of particular interest is
the comparison of 37 samples from patients with the BCR/ABL fusion gene
resulting from a chromosomal translocation (9;22) with the 42 samples from
the NEG group.

Figure 1: The induced GO graph for the selected genes.

To reduce the set of genes for consideration we applied two different sets
of filters (gene filtering is considered in more detail in [5] and the interested
reader is referred there) . A non-specific filter was used to remove genes
that showed little or no change in expression level across experiments. The
resulting data set had 2391 probes remaining. To select genes whose ex-
pression values were associated with the phenotypes of interest (BCR/ABL
and NEG) we used the mt . maxT function from the multtest package which
computes a permutation based t-test for comparing two groups.
After adjustment for multiple testing there were only 19 probes (which
correspond to 16 genes) with an adjusted p-value below 0.05. Using those
genes we obtain the set of most-specific GO terms in the MF ontology that
they are annotated at and compute the induced GO graph which is rendered
in Figure 1. No labels have been added to the nodes in this plot since there is
not sufficient room to provide informative ones. Notice that the most specific
terms are at the top of the graph and that arrows go from more specific nodes
to less specific ones. The node in the bottom center is the MF node. Clearly
Using GO for statistical analy ses 173

some sort of int eractivity (e.g. tooltips) would be benefici al. We will return
to this plot in the next section and use it to provide a more detailed view of
the data .

3 Statistical analyses

3.1 Finding interesting GO terms


If genes have been partitioned into distinct sets, say by finding those with
sm all p-values (as was don e above) or by some form of clustering, then one of
the questions that arises is whether genes that comprise a cluster have a com-
mon function , proc ess or location in the cell. A second, related application
of t his idea is to provide meaning to a list or set of genes that were select ed
according to som e crite ria. For example, in our microarray experiment we se-
lected genes that were differentially expressed between the BCR/ ABL group
and the NEG group. We might then wonder whether these genes have a com-
mon fun ction, are involved in common pro cesses, or perhaps ar e co-located
in some region of the cell.
We can as k if there are more interesting genes at the node than one
might exp ect by chance. If that is true, then that term can be thought
of as being overrepresented in the data. This question can be answered
usin g a Hyp ergeometric distribution. Suppose t hat there are N total genes
annotated for t he ontology of interest and that our list of interesting genes
cont ains m distinct genes. Then we can imagine an urn with N balls in it and
N -m are black while m are whit e. If we dr aw k balls from the urn, where k is
the number of genes annot ate d at a nod e, we are asking whether the number
of white balls in that dr awn sample is unusually large. Suppose that there are
q white balls (int eresting genes) in the drawn sample, we then ask what is the
probability that X ::::: q where X is a Hypergeometric random variable with
parameters as we have described . This probability const it ute s a p-value sin ce
it is the probability of seeing som ething as ext reme or more ext reme than
what was observed. This fun ctionality is provided in the function GOHyperG
available in the GOstats pa ckage.
In Figure 2, we reproduce the plot from Figure 1 except that we have now
colored the nodes according to the p-value obt ain ed from the Hyp ergeome tric
test described above. The nod es in Figure 2 are colored eit her red or blu e
dep ending on whether the un adjusted Hyp ergeometric p-valu e was less than
0.10 or not (for those viewing this document in black and white the nodes
should be dark and light grey, resp ectively). The GO t erms for the terms
colored red ar e printed below. The relevant biology suggests that these ar e
qui t e reason abl e. We note that while t he smallest p-values are associate d
with nodes t hat have few genes annotated at them there are som e nodes
with a reasonabl e number of genes annotat ed at them (counts) and sm all
p-valu es.
174 Robert Gentleman

GO ID Term p-value No. of Genes


1 0005148 prolactin recepto... 0.003 2
2 0005131 growth hormone re... 0.003 2
3 0005159 insulin-like grow... 0.008 5
4 0004715 non-membrane span... 0.017 11
5 0030693 caspase activity 0.019 12
6 0005126 hematopoietin/int . 0.057 37
7 0004197 cysteine-type end . 0.085 56
8 0005515 protein binding 0.095 1165
9 0008234 cysteine-type pep... 0.096 63
10 0005200 structural consti... 0.097 64
11 0003714 transcription cor. .. 0.097 64
12 0005198 structural molecu ... 0.099 343

Table 1: GO terms, p-values and counts.

Figure 2: The induced GO graph colored according to unadjusted Hyperge-


ometric p-valu es.

3.2 Selecting genes according to GO term


GO can also be used as a method of data reduction. Here one might carry
out an analysis focusing on a particular subset of genes , say those associated
with the GO term transcription factor.
Many of the effects due the BCR/ABL translocation ar e mediated by
tyrosine kinase activity. It will therefore be of interest to examine genes that
are known to have tyrosine kinase activity. We examine the set of GO terms
and identify the term, GO:0004713 from the molecular junction portion of the
GO hierarchy as referring to protein-tyrosine kinase activity. We see
that for the Affymetrix HGU95av2 chip 230 probe sets are annotated at this
Using GO for st atistical analy ses 175

particular term. Of these only 32 were select ed by t he non-specific filtering


st ep. We focus our at te nt ion on these probes and carr y out a permutation
t-test analysis.
In this analysis of the GO-filter ed data, 4 probe sets have FWER- adjus-
t ed p-values less than 0.1. They are printed below, tog ether with the adjusted
p-valu es from an analysis that used all probes that passed our non-specific
filter and hence involved 2391 genes.

GO analysis

40480_s3t 2039 _s_at 36643_at 2057_g_at


0 .00002 0 .00025 0.02146 0.07481

All Genes"

40480 s at 2039_s_at 36643_at 2057 _g_at


0.001 0.018 0.473 0 .823
Due to t he reduced number of tes ts in t he analysis focused on tyrosine
kin ases, we are left with more significant genes aft er correct ing for multiple
t esting. For inst anc e, the prob e set 36643_at , which corresponds to the gene
DDR1, was not significant in the unfo cused ana lysis, but would be if instead
t he investigation was orient ed towards studying tyrosine kin ases a priori.

3.3 U sing shortest paths


[6] consider some int eresting applicat ions of GO in conjunct ion wit h microar -
ray expression dat a. In t his secti on we cons ider a relat ed idea and apply it
to the ALL da t a. At their most basic level the ideas of [6] consis t of formin g
a graph between genes (which are the nodes) based on some relevant dist anc e.
This dist ance might be correlation distance or it could be any other relevant
distan ce. Then all edges in the graph that correspond to distan ces that are
lar ger t han some threshold ar e removed. Next , genes ar e grouped according
to som e sp ecific categ orizat ion (they used GO biologic al process t erms) and
the shortest paths (using Dijkstra 's algorit hm) between all pair s of nod es are
compute d . Those shortest paths can then be examined to see whether they
provide information of relevan ce.
In the ALL expe riment we ar e most int erested in comparing patients that
have the BCR/ ABL defect to those that have no measured cyt ogenet ic ab-
normalities. Our adaptation of the shortest path technology is as follows.
We use the output of the first filtering step described pr eviously - that is we
select genes that show some level of express ion and some vari ation in expres-
sion across sa mples. We t hen separate the data into two sets (BCR/ ABL and
NEG) and within each group we define the distance between two genes as
on e minus the Pearson correlation (other approaches such as that use by [6]
or some other robust correlation est imates ). We t hen used an edge weight of
176 Robert Gentleman

d(u, v ) = (1 - Cu,v)k wit h k = 1 and r = 0.6 as t he cut off for correlations,


if Cu,v < -r t hen no edge exists (Zhou et . al used k = 6 in their analysis and
some experimentation may be warranted).
Our int erest in t his particular exa mple is on transcrip t ion factors. Hence
we use t he GO term GO : 0003700 which map s to th e molecular function
transcription factor activity t o identify all genes with t ra nscription
factor activity. We used only genes for which t his was a most specific an-
notat ion and obtained 814 mappings and 531 unique LocusLink ids . Of
t hese 152 were among those pr obes select ed for our analysis. Of these we
found t ha t t here were 6 with duplicate ent ries. A visual insp ection (not re-
port ed) suggeste d t hat t he correlation between these duplicate probes was
quite high and so only one of each was used in the subsequent an alysis. This
left us with 146 distinct tran scription factors for our st udy.
For every pair of transcript ion facto rs we compute two quan ti ties. The
shortest path between each pair for each of the different condit ions . For ex-
ample in our ALL exa mple we compute t he shortest pat hs between all tran-
scrip tion factors using a gra ph based only on data from those with BCR/ ABL
and secondly t he sa me set of valu es based only on data from t hose without
any noticeabl e genomic defects. Then for each pair th e distances are com-
par ed (plot ted) an d those pairs for which t he dist anc e has changed the most
ident ified and fur ther explored .
We first consider t hose t ra nscription factors t hat are not connecte d to
the others in their resp ecti ve graphs. T here are three sets, those t hat are
not connec te d in eit her graph, those that are not connected in one of the two
graphs bu t not in th e other . They are reported in Table 2.

Affymetrix ID Symbol Which gra ph


1 34730_g_at TRO Both
2 1106_s _at TRA@ NEG only
3 34850_at UBE2E3 NEG only
4 1185_at IL3RA BCR/ ABL only
5 32186_at SLC7A5 BCR/ ABL only
6 33641_g_at AIF 1 BCR/ ABL only
Tabl e 2: Genes not connecte d in t he different gra phs.

We now consider the finite pairwise distan ces. First a simple t-te st can
be carr ied out to see if t here is any difference between t he dist an ces in one
gra ph versus t he ot her. We took each pairwise dist an ce in th e NEG graph and
subtracted from it the sa me pairwise dist an ce comput ed on the BCR/ ABL
graph. The t-tes t is for whether the mean is zero and the t est st atis tic was
0.179 with an ext reme ly small p-valu e. So we see t hat dist ances in the NEG
graph seem to be longer t han t hose in the BCR/ ABL . Further evidence of
t his difference comes from the observat ion t hat proportion of valu es that were
lar ger in t he NEG gra ph was 0.589.
Using GO for sta tistical analy ses 177

We will focus our attention on those differences that are lar ge in absolute
value. We chose a valu e of 2.5 as our cut-off and found that there were
66 differences that were lar ger t ha n 2.5. These corresponded to 26 distinct
genes.
Whil e all may be int eresting and a par ti cular investigator may want to
expend considera ble effort in study trans crip tion factors that are of particular
interest we will cente r our ana lysis on the set of genes that appear most
frequently in t his list .
There are three genes that have high counts , namely, MYC, MPO and
GADD45A. This fact suggest s that perh aps the expression pattern s of these
t hree different transcription factors are substantially different in t he two phe-
notypes we are studying.
For each of the three t ranscript ion factors we can compute t he averag e
dist anc e, separa te ly within each graph, to all the ot her selected genes. We
find t hat the results ar e quite consist ent and that in all cases the path length
is much shorter in t he BCR/ ABL group than it is in the NEG group. For
MYC th e means were 5 for NEG and 2 for BCR/ ABL, and for MPO they
were 4 for NEG and 2 for BCR/ ABL and for GADD45A t he means were 5 for
NEG and 2 for BCR/ ABL . It is rather int eresting to observe that amongst
t he pair wise distances t hat have cha nged the most are those between these
three specific genes.
Specific paths between trans cription factors can also be exa mined. Recall
t hat we compute out distan ce between two t ra nscript ion factors based on the
shortest path length between t hem in each of the two gra phs. In our exa mples
we focus on MYC and the distan ces between it and MPO and GADD45A.
We print out the different shortest paths for genes connecting MYC to
both MPO and GADD45A for each of the two phenotyp es, respectiv ely (first
the paths for BCR/ ABL, then for t he NEG samples). The MYC to MPO
results are:

BCR/ABLE
MYC->EIF4Gl->HMG20B->MPO
NEG
MYC->CDC25B->TRAP1->FLJ10326->LANCL1->EMP3->S100A4
->LGALS1->MPO

If we then make use of t he results in Figur e 3 we see that there ar e


positive corr elat ions between MYC and EIF4G1 and as well between EIF4G1
and HMG20B , but that for HMG20B and MPO the corre lat ion is negative.
Positi ve corre lat ions are suggestive of sha red trans crip tional act ivity while
negative correlat ions are suggestive of transcriptional inhibition.
The results compa ring MYC to GADD45A are:
178 Rob ert Gentleman

5.0 6.0 7.0 6 7 B 9

~oOo
o
0°08 &
o OO~

Il9~O Oo

0
01:

I,~~t I o~::o
0 0

""00
0
~

0
o 0
o
0

14t~6_"
0
03
""
~ 0

;.t
0 0
tl °8 0
°oS oo~"
°000 0 0 0
o
:1 0 0
0
0 I

I~-"I
0 0 o ..
0
00
00
~
0 ~ ;oo
0
o 8> 0 0 8 0 ~ 0 0' 80 00 0
8l
"*~ d! 0 0

:0
0 / ' 0
o 0 0
9 °0
00 0
o %
0 o If 0 0 00

5.5 65 7.5 6.0 7.0 8.0

Figur e 3: Pairwis e sca t te rplots of gene express ion for those genes on t he
shortes t path between MYC and MPa from patients with the BCRj ABL
translocation.

BCR/ABLE
MYC->UBE2A->BAZ1A->CD53->GADD45A
NEG
MYC->CDC25B->TRAP1->SSBP1->SMC1Ll->TK1->HCK->
SH3PB1->PVRL2->GADD45A

We do not have space to present the other pairwise scat te rplot s here but
readers that are makin g use of the compendium version of this paper can
easily explore t hose different plots on t heir own.
We not ice t hat the path lengths for the NEG samples are longer (involve
mor e genes) than t hose for the BCRj ABL samples. We might also want to
ask whether t he dist an ces are also larger (that is t ha t t he correlations are
sma ller). To do this we need to obtain the edge weights from the respect ive
graphs and compa re t hem. We found t hat t here appeared to be no difference
(all average d around a distan ce of about 0.65) but t he number of edges is
quite sma ll and one might expect to see systemat ic differences if a lar ger
st udy were und ertaken.
We can check our results, at least to some extent, by exa mining pairwise
scatterplots of the gene expressions. In Figur e 3 the genes on t he path from
MYC to MPa are plot ted. We see quite strong correlations along the diagon al
and not e that HMG20B and MPa have a negati ve corre lation.
Finally, we finish our exa mination of t hese data by considering some of
t he specific paths between the different t ra nscript ion factors. We see, in
Using GO for st atistical analyses 179

Figures 4 the actual shortest path between the genes MYC and MPO. The
two end points have been colored red, genes along the path are colored blue.

o
o

Figure 4: Shortest path betwe en MYC and MPO in the NEG samples.

4 Discussion
GO and the mappings from genes to specific t erms in each of the three ontolo-
gies provide a number of important and unique data analyt ic opportunities.
In this paper we have considered three separate applications of these re-
sources to the problem of analysing gene expression data and in all cases the
GO related data have provided new and import ant insights into the data.
Using GO mappings to select certain terms for further study and reference
has the possibility of providing meaning to sets of genes that have been
selecte d according to different crite ria. An equa lly important application is
to use GOA mappings to reduc e the set of genes und er considera t ion. As the
cap acity of micro arrays increases it is important t hat we begin developing
tools and st rate gies that dir ectly address spe cific questions of int erest. P-
valu e correction methods are at best a band-aid and do not represent an
approach t hat has long t erm viability [5].
In our final example we adapted the method proposed by [6] to a dif-
ferent problem , one wher e we consider only transcription factors and where
we are int erest ed in underst anding their interr elationships. The results are
promising and in our example reflect a fund am ental difference between those
with the BCR/ ABL translocation and those patients with no observed ge-
netic abnormalit ies. Ideally these, and other observations will lead to better
understanding of t ra nscript ional regulation and from t hat t o bet t er under-
standing mod aliti es of efficacy for drug t reat ments.
180 Rob ert Gentleman

Perh aps mor e important than t he statistical present ation is the fact that
we have also provided softwar e implementations for all tools described and
discussed in this pap er. They are available from t he Bioconductor Project in
the form of the GOst ats package. GOstats makes substant ial use of software
infrastructure from the Bioconductor Project in carrying out this ana lysis.
In particular t he graph, Rgraphviz and REGL, tog ether wit h t he different
met a-d ata packages.
Finally, t his docum ent itself repr esent s an approac h to repr oducible re-
sea rch in t he sense discussed by [3] and it can be reproduced on any users
machine equipped with R and the appropriate set of R packages. We encour-
age the int erest ed reader to avail themselves of the opportunity to explore
the dat a and t he methods in mor e det ail on t heir own computer.

References
[1] Camon E ., Magran e M., Barrell D., Lee V., Dimm er E. , Binns D.,
Maslen J ., Harte N., Lopez R. , Apweiler R. (2004). Th e gen e ont ol-
ogy annotation (goa) database: sharing know ledge in uniprot with gen e
ontology. Nucleic Acids Resear ch 32 , D262 - D266.
[2] Chiaretti S., Li X., Gentleman R., Vit ale A., Vignetti M., Mandelli F. ,
Ritz J ., Foa R. (2004) . Gen e expressi on profile of adult t- cell acut e lym-
phocytic leuk em ia identifie s dist in ct subsets of patients with different re-
sponse to therapy and survival. Blood 103, 2771 - 2778.
[3] Gentleman R. , Templ e Lan g D. (2003). Statistical analyses and repro-
duci ble research.
[4] Irizarry R.A ., Hobb s B., Collin F ., Beazer-B arcl ay, YD ., Antonellis K.J .,
Scherf U., Speed T .P. (2003) . Exploration, normalizati on, an d summ aries
of high densit y oligonucleotide array probe level data. Biost atist ics 4 249 -
264.
[5] von Heydebr eck A., Huber W ., Gentl eman R. (2004). Different ial ex-
pression with the biocondu ctor proj ect. In En cyclop edia of Geneti cs, Ge-
nomics, P roteomics and Bioinformatics. John Wiley and Sons.
[6] Zhou X., Kao M.-C.J ., Wong W.H . (2002) . Trans itive fun ctional an-
notatio n by shortes t-path analysis of gen e expressi on data . PNAS 99,
12783-12788.

A cknowledgem ent: I would like to thank Vincent Car ey for many helpful
discussions about these, and very many other topics. I would like to thank
Drs. J . Rit z and S. Chiar et ti of the DF CI for making their data available and
for helpin g me to understand how it relates to ALL. I would like to thank J .
Zhang and J . Gentry for a great deal of assistance in preparing the dat a and
writing software in support of this resear ch.
Address : R. Gentleman, Department of Biostatist ics, Harvar d Universi ty
E- mail: rgentlem@jimmy.harvard. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

COMPUTATIONAL CHALLENGES IN
DETERMINING AN OPTIMAL DESIGN
FOR AN EXPERIMENT
Subir Ghosh
K ey words: Balan ced ar rays, computational challenges , factorial designs, in-
ter acti ons, or thogonal arrays, robust designs, search designs, sea rch linear
models, search probabili ties, un availabili ty of data.
COMPS TA T 2004 secti on : Design of experiments .

Abstract: In t his pap er we pr esent some comput ationally challenging prob-


lems for findin g an optimum design in an experiment . We consid er the pr ob-
lem of finding an optimum design when one model from a set of possibl e
models would describe the dat a better than ot her models in the set but we
do not know this model a priori. We also consider the robustness of optimum
designs under a model when some observations are un available.

1 Introduction
In the ea rly development of designing a st atist ically efficient experiment, con-
siderable at tent ion was given to t he computationa l simplicity of the ana lysis
and to some desir abl e properties of the inferences drawn on the comparisons
(param et ers) of inter est [2]. The concepts of ort hogona lity and balan ce in
expe rimental designs wer e develop ed. With the pro gress in methodologi-
cal research and t he development in comput ing t echnolo gy, t he concepts of
optimum designs and various opt imality crite ria were proposed [10]. The ex-
periment could be performed at a single stage or at many stages over ti me .
The data could be cont inuous, discret e, univari ate, multivari at e, t ime series,
spatial, and other kinds or some combinat ions of t hem. Inference pro cedures
could be par am etric , nonparametric, semipar am etric, frequ entist , Bayesian ,
and ot hers . The most a maz ing aspect in t he design research is the enor mous
cont ributions of all kinds of resear chers from ext reme theorist s to extreme
pr actioners [8] . We do not attempt to make any futil e effort to list all the
cont ributo rs and their research. In t his pap er we exa mine some aspec t s of
det ermining optimal designs and dis cuss some challenging problems.

2 Optimum designs
An optimum design is normally obtain ed by sat isfying one or mor e optimality
pr op erties (minimizing variance, maximizing power and many ot hers ) for the
comparisons (par amet ers) of int erest under an assumed model. The choice
between a best design with resp ect t o (w.r.t .) one crite rion and a best design
w.r.t. another crite rion is always an issue at t he t ime of t he select ion of an
182 Subir Ghosh

optimum design. With the change in the computing environment, this is-
sue has become much more complex. For example, the orthogonal fractional
factorial plans may be best w.r.t. many optimality criteria but they require
more runs in most situations than nonorthogonal plans and furthermore may
not perform well compared to nonorthogonal plans when the assumed model
is really inadequate. If we decide to give up orthogonality and opt for opti-
mal balanced fractional factorial plans as our nonorthogonal plans, then we
may cut down the cost of running the experiment as well as improve the per-
formance when the assumed model is inadequate. Finding optimal balanced
fractional factorial plans as nonorthogonal plans is always computationally
challenging but it is possible to find such plans in the modern computing
environment. Many such plans are already available in the design literature.
The list of references is available in Ghosh and Rao [7], [8].

3 Robust designs
The unavailability of data that we often encounter in conducting an exper-
iment should be a concern at the design stage. Ghosh [3] introduced the
concept of robustness of design against the unavailability of any t (a positive
integer) observations in the sense that the unbiased estimation of all the pa-
rameters of interest is still possible when any t observations are unavailable.
For n observations, there are (~) possible sets of t observations. Ghosh and
Namini [5] gave several criteria and methods for determining the influential
set of t observations for robust designs. There are numerous such practi-
cal issues including the presence of outliers, time trend in observations, and
others in real life exp eriments. Such practical issues give rise to challenging
computational problems in the selection of designs.

4 Model identification using search designs


The problem of finding a best design or a class of best designs satisfying one
or more optimality criteria under an assumed model is a challenging task.
Analytical methods are not often sufficient for resolving this task. Compu-
tational methods are very powerful in addition to the applicable analytical
methods in resolving this problem. When we are not absolutely sure about
the assumed model that will fit the experimental data adequately, the prob-
lem becomes daunting. In reality we are rarely sure about a particular model
in terms of its effectiveness in describing the data adequately. However, we
are normally sure about a set of possible models that would describe the
data better than other models in the class . The pioneering work of Srivas-
tava [13] introduced the search linear model with the purpose of searching
for and identifying the best model from a set of possible models that includes
the best model. We now focus on finding a best design or a class of best
designs for model identification through the use of the search linear models.
Computational methods are indispensable for this purpose.
Computation al challenges in determining an optimal design 183

In factorial experiments , t he lower ord er effects are normally important


and t he high er ord er effects are all assumed to be negligible. In main effect
plan s, the main effects are important and the interaction effects are assumed
t o be zero. Such an assumpt ion mayor may not hold true in reality be-
caus e of t he possible pr esence of a few significant non-negligible int eractions.
The st andard linear mod els cannot ident ify th ese non-negligible effects using
a sma ll number of runs or treatments considera bly sma ller than the t ot al
number of possible runs for an experiment . This motivat es the use of search
designs under th e search linear model in sea rching for and identifying non-
negligible int eraction effects. We cons ider the probl em of comp aring sear ch
designs with the ability of sea rching for and identifying k (a positive int eger)
non-negligible inte raction effects .

5 Search linear model


Consider the search linear mod el [13]

(1)
where y( n x 1) is the vect or of observations, A1 (n x vd and A 2(n x V2)
are matrices known from t he und erlying design . The elements of t he vector
el (Vl x 1) are unknown par am et ers. About the elements of 6(V2 x 1) we
know t hat at most k elements are non zero but we do not know which element s
are nonzero. The k is small compar ed to V2. The goal is to search for and
identify t he non zero elements of 6 and then esti ma te t hem along wit h th e
elements of 6 . Such a model is called a sea rch linear mod el. When 6 = 0 ,
t he sear ch linear mod el becomes th e ordinar y linear mod el. For the sear ch
linear mod el, we have e2 =f o.
Let A 22 be any (n x 2k) submatrix obtained by choosing 2k columns
of A 2 . A design is called a search design [13] if, for every sub mat rix A 2 2 ,

(2)
The rank condit ion (2) allows us t o fit and discriminat e between any two
mod els in t he class of possible mod els described earlier. Any two models
in t he class have VI common paramet ers which are t he elements of 6 and
at most 2k un common par am et ers which are t he elements of 6. Not e t ha t
n 2: VI + 2k. A search design allows us t o search for and identify the non zero
elements of 6 and then est imate t hem along with the elements of 6.

6 Computationally challenging problems


Consider a class of (~) linear models from (1) with t he par am et ers as 6 and
k elements of 6 . The (~) possible sets of k elements of 6 give rise to ( ~)
such mod els. For any two models in this class, the element s 6 are common
par am et ers but in the two sets of k elements in 6 some common pa ra meters
184 Subir Ghosh

may or may not be pr esent . A search pro cedure identifies t he mod el which
best fits t he dat a generated from the search design . To ident ify this model,
the sum of squa res of err ors (SSE) of each mod el is used [13]. If SSE for the
first model (Ml) is sma ller than t he SSE for the second mod el (M2) , then
Ml provides a better fit and is selecte d over M2 . For a fixed valu e of k ,
all ( V~) mod els are fitted to the dat a and the sea rch pro cedure selects t he
model with t he sma llest SSE as t he best model for describing t he data.

6.1 Optimal search designs


For each mod el in the class of ('f) linear models from (1), we consider the
vari an ce-covari an ce matrix of the least squa res est ima t ors of t he par ame t ers.
We calculate the valu es of the Det erminant (D) , Trace(T) , and Maximum
Charact eristic Roo t(MCR). So we obt ain ('f) sets of valu es of D, T, and
MCR. We calculate the arit hmetic means and the geometric means of D, T ,
and MCR and denot e them by AD , AT, AMCR, GD , GT, and GMCR. The
smaller are t he valu es of AD , AT , AMCR, GD , GT, and GM CR, the better
is the sear ch design. Not e that the minimization of only D, T , and MCR
represent the A- , D- , and E- optimality criteria [10] . The arit hmetic mean
is mor e meaningful than the geometric mean in som e areas of application and
vice vers a. We use these six criteria for comparing search designs with the
sa me number of runs. This is computat ionally a huge t ask.
For a fact ori al experiment with four factors each at two levels (+) and (-) ,
suppose that 6 consists of the genera l mean and main effects and 6 consists
of only two factor int eractions. Consider two designs, dl and d2 with 8 runs.
Design dl has the ability of searching for one nonnegligible two-fact or int er-
act ion and furthermo re, t his plan is optimal w.r.t . t he AD, GD , AT , and
GT crite ria. Design d2 has also the ability of searching for one nonnegli-
gible two-factor int eract ion and fur thermore, this plan is optimal w.r.t . the
AMCR and GMCR criteria. These new plans are obtained by first findin g all
t he sear ch designs with 8 runs and 4 fact ors and then calculati ng their AD,
AT, AMCR, GD, GT , and GMCR valu es. Finding of dl and d2 is ind eed
a comput er int ensive task. Tabl e 1 pr esents dl and d2.

6.2 Search probabilities


The probability of select ing one model over anot her model depends on 0'2,
t he noise vari an ce which we refer to as t he noise. To see this dependence, we
consider three cases 0'2 = 0, 0'2 = 00 , and 0 < 0'2 < 00 . Let MO be t he true
mod el in the class of mod els described above. Furthermore, let Ml be a com-
peting mod el where Ml =j:. MO. In the noiseless case, 0'2 = 0, the SSE for
MO, SSE(MO) , is zero, which is always smaller than th e SSE(Ml). Hence,
MO will definitely be selecte d over M1. Therefore, t he correct non zero inte r-
action will always be identified with probability one. Thus, P[SSE(MO) <
S SE(Ml)IMO ,Ml ,O'2 = 0] = 1. In reality 0'2 > 0 and the SSE(MO) may
Compu tational challenges in determining an optimal design 185

dl d2
- - - - - - - -
- + + + + + + +
- - + + - + + +
- + - + + - - -
- + + - - - + -
+ - - + + + - +
+ - + - - - + +
+ + - - + + - -
Table 1: d l and d2 with 8 runs and 4 factors.

not be less than SSE(Ml). Therefore, MO may not necessarily be select ed


over M1. Hence, the probability of correctly identi fying t he nonz ero in-
t eracti on is less t han one and we write P[SSE(MO) < S SE(Ml)IMO , Ml ,
(12 > 0] < 1. In the case of infinite noise, MO and Ml are equa lly likely t o
be selecte d and so t he probabili ty of selecti ng M O over Ml is 1/2 , and we
writ e P[SSE(MO < S SE(Ml)IMO, Ml , (12 = 00] = 1/2. For 0 < (12 < 00,
P[SSE(MO ) < SSE( M l )IM O, Ml , (12] is called th e search probability for
a given MO, Ml , and (12 . Not e that the sea rch probabili ty is between 1/2
an d 1. Shira kura et al. [12] pr esented the search probabili ty for searching one
non negligible effect (k = 1) based on the normality assumpt ion for observa-
t ions under t he search linear model (1).
There are many of t hese sear ch pro babili ties t o consider. We not e t hat
for a given t rue model MO, th ere are (V2 - 1) competing models of Ml for
k = 1. Since the true model MO is unknown, we consider all V2(V2 - 1) pos-
sible pair s of (MO, Ml) and calculate all the sear ch probabili ties for a given
(12. From these search probabili ties, Ghosh and Teschmacher [9] present ed
a V2 x V2 search probability ma trix (SPM) where t he columns corres pond
to t he possible true mod els and the rows corr espond t o t he possible com-
peting mod els. The off-di agonal elements of th e S P M rep resent t he search
pr obabili t ies corresponding to all possible pair s of MO and Ml for a given
(12 . Since the t rue model MO is different from the competing model Ml , the
diagonal elements of t he S P M are not meaningful and t herefore left blank.
When comparing two designs , we would like to det ermine which design has
a greate r chance of identifying the t rue non zero int eracti on t erm. A method
for doin g t his is by comparing t he S P M s of the two designs for a given (12 .
The SPM for a design is dependent on a par amet er , p, which is the rati o of
t he magnitude of t he t rue unknown int eracti on t erm (signal) and (1 (noise).
In ot her word s, t he S P M depends on (12 t hrough p. Let S P M, (p) be the
S P M of the ith design for a given p, where the columns and rows correspond
to the possible t rue and competing mod els, respectively.
Shir akura , et al. [12] proposed a crite rion for comparing search designs for
a specific value of p. T his criterion is based on the minimum valu e of all t he
186 Subir Ghosh

elements of the SPM. The high er is this minimum valu e, the better is t he
design . Ghosh and Teschmacher [9] defined the SPM, proposed two other
criteria, and presented methods of comparing search designs for all values
of p using all three criteria. On e of the two proposed crite ria in Ghosh and
Teschmacher [9] is based on the element-by-element comparison of two SPM s
and the other one is bas ed on comparing two minimum search probability
vectors (M S PV s) whose elements are the minimum values of the columns
of two SPMs. The comparisons are then made by using a majority rule in
t he sense of having the fifty percent or more elements of an S P M are greater
the corresponding elements of anot her SPM . Similar compar isons are also
mad e for two MSPVs. The methods proposed in Ghosh and Teschm acher [9]
have opened up a new direction of computationally challenging problems for
finding optimum designs .
Orthogonal designs have many well-known optimality properties under
the ordinary linear mod el. However, bal anc ed designs can perform better
than orthogonal desig ns under t he search linear model. Consider two search
designs, D1 and D2 , each with 12 runs , and 4 factors each at two levels (-)
and (+) . Design D 1 is a bal anced array of full strength and design D2 is
an orthogonal arr ay of strength 2 obtained from th e 12-run Plackett-Burman
design [11] by choosing the first four columns. Table 2 pr esents D1 and D2.
Design D1 performs better than Design D2 und er the ordinary linear model
with 6 = O. However , D2 performs better than D1 und er the search linear
model when the vector 6 consists of two and three factor int eractions only
one of which is nonzero, so that k = 1. This is a really striking example
illustrating the fact that an orthogonal design is not necessarily the best in
all situations.

6.3 Robust designs


The opt ima l designs may no longer be optimal when some observations be-
come un available during the experiment . Det ermining the robustness of op-
timal designs against the un availability of data is a computationally difficult
problem [3], [5] . Ghosh and Al-Sabah [6] pr esented some efficient compos-
ite plans for response surface experiments with surprisingly high er efficiency
than exist ing comparable plans in the literature w.r.t. all three criteria , D, T,
and MCR. For example, under the second ord er response surface mod el with
ten factors, the MCR, T , and D x 1070 values are 7.1, 31.2, and .033 for
Ghosh-Al-Sab ah plan and 6791.4 , 6850.0 , and 1.6 for th e exist ing Dr ap er-Lin
plan [1]. Ghosh-Al-Sabah plans were obtain ed while studying the robustness
properties of som e exist ing designs.

7 Conclusions
In this pap er we have describ ed some challenging computational problems in
finding a best design for an experiment . Modern comput ing environment has
Computation al challenges in determining an optimal design 187

D1 D2
+ + + + + - + -
- - - - + + - +
- - - + - + + -
- - + - + - + +
- + - - + + - +
+ - - - + + + -
- -
+ + - + + +
- + - + - - + +
+ - - + - - - +
- + + - + - - -
+ - + - - + - -
+ + - - - - - -
Table 2: D1 and D2 with 12 run s and 4 factors.

helped us in attempt ing to resolve t hese problems. Many ot her cha llenging
problems and some of t heir soluti ons are indeed available in t he work of
other resear chers. Many new computat iona lly cha llenging problems are also
constant ly emerging with the modern development in science and technology.

References
[1] Drap er N.R., D.K.J . Lin (1990). Sm all composite designs. Technometrics
32 187 -194.
[2] Fisher RA . (1935). Th e design of experiments . First Edit ion. Oliver an d
Boyd , London.
[3] Ghosh,S. (1979). On robustness of designs against inc omp lete data.
Sankhya B 40 , 204-208.
[4] Ghosh S. (1980). On m ain effect plus one plans for 2m factorials. Ann .
Stati st. 8, 922- 930.
[5] Ghosh S., Namini H. (1990). Influential observations under robust designs.
IN: Coding Theory and Design Theory, Part II: Design Theory, D.K. Ray-
Ch audhuri (ed.), Springer-Verlag, New York , 86 - 97.
[6] Ghosh S., Al-Sab ah W .S. (1996). Effi cient composite designs with small
number of runs. J . Stati st. Pl ann. Inference 53 , 117- 132.
[7] Ghosh S., Rao C.R (1996). Design and analysis of experiments . North-
Holland , Elsevier Science B.V. , Amsterd am.
[8] Ghosh S., Rao C.R (2001). An overview of developm ents in statis tical de-
signs an d analysis of experiments . In : Recent Advances in Experim ent al
Designs and Related Topics, S. Altan and J. Singh , (eds.), Nova Science
Publishers, Inc., New York, 1 -24.
[9] Ghosh S., Teschmacher T. (2002). Compariso ns of search designs using
search probabilities. J . St ati st . P lann. Inference 104, 439-458.
188 Subir Ghosh

[10] Kiefer J. (1959). Optimum experim ental designs . J. Roy. Statist. Soc. B
21 , 272-319.
[11] Plackett R.L., Burman J.P. (1946). The design of optimum multifactorial
experiments. Biometrika 33 305 - 325.
[12] Shirakura T., Takahashi T., Srivastava J.N. (1996). Searching probabili-
ti es for nonzero effects in search designs for the noisy case. Ann . Statist.
24 6, 2560- 2568.
[13] Srivastava, J.N . (1975). Designs for searching non-negligible effects. In:
A Survey of Statistical Design and Linear Models , J .N. Srivastava, (ed.),
North-Holland, Elsevier Science B.V., Amsterdam, 505-519.

Acknowledgement: The author would like to express his sincere gratitude to


three reviewers for their critical reading of the earlier version of this paper.
Address : S. Ghosh , University of California, Riverside, CA 92521-0138,USA
E-mail: ghosh@uerael. uer . edu
COMPSTAT'2004 Symposium © Physica-Verlag/Sp ringer 2004

VISUALIZATION OF PARAMETRIC
CARCINOGENESIS MODELS
J utta Groos and Annette Kopp-Schneider
K ey words : Hepatoc arcinogenesis, color-shift mod el, maximum likelihood
esti mate .
COMPSTAT 2004 secti on : Biostatistics.

Abstract: This paper concent ra te on the effect ive tool s to compare different
carcinogenesis mod els with resp ect to their ability to pr edict numbers and
rad ii of foci in hepatocarcinogenesis experiments . Especially t he CSM-GUI
(Color-Shi ft graphical user int erface) shows to be a powerful instrument t o
tes t a new mod el before st arting the very t ime-inte nsive pro cedure of finding
the maximum likelihood par am et ers.

1 Introduction
Hepato carcinogenesis experiments identify focal lesions consisti ng of inte r-
mediate cells at different pr eneoplastic stages. Several hypotheses are estab-
lished to describe th e form ation and pro gression of pr eneoplast ic liver foci.
A common model of hepatocar cinogenesis is the multi-st age mod el, which
is based on the assumption that cells have to und ergo multiple successive
changes on their way from t he normal to the malignant stage. In this model
single cells change t heir phenotyp e through mutation into the next stage and
proliferate according t o a linear st ochastic birth-death pro cess [4] [5] .
In cont ras t, t he Color-Shift-Model (CSM) was int roduced by Kopp-
Schneider and colleag ues [4] to describ e that whole colonies of alte red cells
simultaneously alte r their ph enotyp e. In this mod el, pr eneoplasti c foci are
ass umed to grow exponent ially with det erministic rate and to change their
ph enotype ('color ') after an exponent ially distribut ed waiting time [1] [3] .
To t ake into account t hat t he assumption of det erministic growth rates for
foci in the CSM seems to oversimplify th e real proc ess, a CSM with stochastic
growt h rates is introduced .
In order to compare different mod els wit h resp ect to t heir ability to pr edict
numbers and radii of foci in a rat hepatocarcinogenesis experiment maximum
likelihood est ima te s for the mod el par am et ers are used and the pr edict ed and
empirical dist ributions are vizua lized.

2 Color-shift-model with stochastic growth rates in case


of 2 colors
The assumption of det erminist ic growt h rat es for the foci in the CSM seems
t o oversimplify t he real pro cess. Therefore, a CSM with st ochastic color de-
pendent growt h rates is introduced , which assumes that foci change t heir
190 Ju tta Groos and Annette Kopp-Schneider

color when reaching a det erminist ic rad ius r switch ' As in t he CSM , the for-
mati on of spherical foci with initi al radius ro is describ ed by a hom ogeneous
Poi sson process with rat e u, Let B 1 and B 2 be ind epend ent posit ive random
variables wit h densities f B 1 and f B 2 . The random variables B 1 and B 2
describe the exponential growt h of foci of color 1 and color 2.
Given that a focus is pr esent at time t , the timepoint of its formation , TO ,
is a realisat ion of a random vari abl e T un iformly distributed on [0, t] , where
T , B 1 and B 2 are ind ependent.
Consider exemplarily a focus generated at time T = TO which grows in
color C = 1 with rate B 1 = b1 unt il it reaches the radius r swit ch, where it
changes its color and grows in color C = 2 wit h rate B 2 = b2 .

Color 1: R (t ) < rswit ch :


The radius at time i > To , R( t) , is describ ed by:

IR(t) = ro exp(br(t - TO ) ) I
In ( ~ l
Color 2: R(t) 2: r sw it ch {::} t > b1
D
+ TO .
Define
In( ~)
Tl := b
1
as t he t ime spent in color 1 unt il change to color 2.
The radius of a focus of color C = 2 at t imepo int t > TO + Tl , R(t) , is
describ ed by:
I
R(t) = rswitch exp(b2 (t - Tl - TO) ) I
So t hat an expression for t he joint dist ribut ion of radius R(t) and color
C(t ) = 1 at t ime t can be derived :
P (R(t) :::; r, C (t ) = 1)
P(R(t) :::; r, R (t) :::; rsw i t ch)
O r :::; ro
P(R(t ) :::; r) r E (r o, rswitch ]
{
P (R( t ) :::; r swit ch ) r > r sw it ch
or :::; ro
In(;!i;-l Joo f B, (b ,)db +F , ( In(;!i;-l) r E (ro , r sw itch]
t b, 1 B t
In ( r%" )
=
'J
t

In ( ~ l
t
fBi,;b,J db1 +FB1 C n(? l ) r > rswitch ,
In ( ~ )
t

where FB1 and f B are distribution and density of the random vari abl e B 1 •
1
Visualization of parametric carcinogenesis models 191

Therefore the joint density of radius R(t) and color C(t) = 1 at time tis:
fR(t) ,C(t)(X, 1)

[ ~xt 100
In (?cl )
t
fB, (bd db
bi 1
_ In ( :;) fB ,
t
(~) xt +
In(;:g)
t
1 fB (In(:;) )
' t
~xt
. 1 (ro ,r . w it ch ] (x)

1
xt
100 fB, (b 1 )
b
1
db, . l(ro ,r . wit ch] (x) .
In (?cl )
t

with the indicator function:


x E (a,b]
x 1: (a, b] .

The joint distribution of radius R(t) and color C(t) = 2 at time tis:
P(R(t) :::; r, C(t) = 2)
P(R(t) :::; r, R(t) > rswitch)

~(R(t) :::; r IR(t) > rswitch)P(R(t) > rswitch)


r :::; Fsio itch.
{ r> Teurit cti .

Therefore if r > rswitch :


P(R(t) :::; r, C(t) = 2)
00 00

1 1
In( r )
r . wit ch
b2t f B2(b)f (b) db2 db1
2 B, 1
In (~ )ln (~ )
t t 7"1
In( _ r_ )
rs wjt cb
00 T'

1 1
t
t - 71
+ - t - f B2(b2)fB, (bd db2db1
In (~ ) - 00

t
00 00

1 1
In( -r- )
r s w it c h fB2(b 2)fB, (h) db db
t bz 2 1
In( ~)l n( ~)
t t 7"1
00
1 FB2 ( In(t r;;;;;;;;:)) (t - 7)f
1
+ 1 B,
(b)
1
db1 ,
t - 71
I n (~ )
t
192 Jutta Groos and Annette K opp-Schneider

where FB1 an d FB2 , I B1 and I B2 are distributions and densities of the


random vari abl es B 1 and B 2 .
Hence the following expression for t he joint density of radius R (t) and
color C (t ) = 2 at time t is obtained:
Let x > r switch :
00 00

J J
1
h(t),C(t)(x , 2) = xt
In (~ ) l n (~ )
t t 7"1

00
In( -X-)
Ts wit ch
t J
I n( ~)
t
00

J
1
+ t
I n( ~)
t
00 00

J J
1
xt
In ( ~ )ln( ~ )
t t 71

For X :::; T'suritch. is I R(t),C(t)(x ,2) = O.


Therefore t he joint densitity of radius R(t) and color C (t ) at t ime tis:

Color 1

J
00

I R(t),c (t)( x , 1) = :t
I n ( ;:;' )
t

Color 2

IR(t ),C(t)(x , 2)

3 Application to rat liver foci data


For a typical hepatocar cinogenesis experiment anima ls, e.g. rats , ar e treat ed
with a carcinogen and liver sect ions are st ain ed with special histological mark-
ers to observe foci of alte red hepato cyt es which are known to be pr ecursor le-
sions of carcinoma . Measurements are made in two-dimensional liver sect ions
Visualization of parametric carcinogenesis models 193

and inference about t he reality in three-dimensional liver is limit ed by the


ste reological problem. This problem is described briefly by t he fact t ha t the
probabili ty of a focus to be cut increases wit h its size. The mod el describes
the three-dimension al situation. Moolgavkar and colleagues [5] suggested to
t ranslate the expressions for t he distributions of size and number of foci in
3D into t he corres ponding exp ressions in 2D by th e Wicksell-Transformation
and to t hen apply the mod el to the two-dimensional measurement s by max-
imum likelihood methods.
Consider t ha t only focal transections with radii lar ger t han E can be de-
tec te d and that one liver section per anima l is evalua te d. Kopp-S chneider
and colleagues [4] derived th e following expressions for t he expecte d number
of focal tran sect ions of color j at timepoint t in two dimensions 1 ,

J
00

n2,j = 2f-lt Jx 2 - (;2 f R(t),C(t)(x ,j) dx, (1)


e

and th e density of the size distribution of focal transectio ns of color j at


timepoint t in two dimensions

f R(2 )(t) IC( t) (y lj )

00
Y J00

2
1 2 f R(t),C(t)(x,J.) dX. (2)
J Jx 2 - (;2 f R(t),C(t)(x ,j) dx y Jx - y
e

Assum e that foci of each anima l grow and cha nge t heir color ind epend ent
of other foci. Let n 2,k denote t he number of focal transections of color k
observed in a liver sectio n of area A and let r2,k,j denote the radius of t he j-th
focal t ra nsection of color k . This liver sect ion cont ributes the loglikelihood

2 [
( ; (n2,k In( An 2,k) - An 2,k) + f; In(fR(2)(t),C(t)(r2,k,j , k))
n2 k ]
+ C, (3)

where C is a data depend ent constant . Assuming that the liver sections
of one experiment are ind epend ent of each other , the loglikelihood of the
complete data set is t he sum of the cont ributions of every sect ion.

4 Example
Dat a from an NNM-experiment published by Weber and Bannasch in 1994 [8]
are chosen to illustrat e th e methodology. In th is st udy rats were treated

ITo differenti ate between t he numb er of foci and the numb er of focal t ransections an
additiona l ind ex was int roduced. Here the index 2 stands for two dim ensions.
194 Jutta Groos and Ann ette Kopp-Schneider

with 6mg NNM 2 per kg body-weight conti nuously during six different t ime-
periods, 7, 11, 15, 20, 27 and 37 weeks, with each group consisti ng of five
animals. After this ti me period one liver section of each rat was stained by the
marker H&E 3 and different ty pes of focal t ra nsections were observed. Here
only two different types of foci are considered. The morphometric evalua t ion
of the stained liver sect ions genera ted a data set consisting of t he area of
every liver sect ion and t he ty pe and area of every focal transection det ect ed
in t his sect ion.
A Color- Shift-Model wit h color dependent an d Bet a-dist ributed growt h rates
is applied t o t his dat a set . Random variables B 1 and B 2 , which describe
t he exponent ial growt h in color 1 and color 2, are Bet a-distributed wit h
par am et ers Pl , ss , al and P2 , q2 , a2 · Form par am et ers, tu , are introduced
addit ionally to the par am et ers of th e standard Bet a-di stribution, Pi and qi
(pi, qi , ai > 0, i = 1,2 ), t o modify the supp ort of the distribution functi on .
Hence the growth rate in color i, B, , is a positive random vari abl e with the
following density:

1 b1Pi- 1)(ai - bi) (qi- 1 )


[e , (bi ) = B( . .) (Pi+qi- l ) . 1[O,a;j(bi ) pi ,qi , ai >0, i=1 ,2 ,
Pt , qt ai

where B (p, q) is the Bet a-function

B(p, q) = J
1

z(p- l) (l - Z)(q- l ) dz :

°
Inserting this expression into the joint densitity of radius R (t) and color C(t)
at ti me t , double int egrals are obt ain ed in equat ions (1) and (2) which can-
not be solved analyt ically. The loglikelihood funct ion (3) depend s on eight
par amet ers.

4.1 Implementation
The MATLAB environment is used to compute t he loglikeliho od function,
find the maximum likelihood par am et ers and visualize t he results. Numeri-
cal double integration with singularit ies has t o be performed for every single
det ect ed focal tran section. As about 1000 focal t ran sect ions are det ect ed the
computat ion of t he likelihood is a very t ime-inte nsive pro cedure. Using the
MEX-interface, functions for numerical double integr at ion from the Fortran
NAg libr ary are included to improve t he performan ce 4. To find t he maximum
2The chemical ca rcinogen N-Nitrosomorpholine (NNM) was ad m iniste red in t he drink-
ing water.
3H&E stands for Hemalum&Eosin , a biological marker t o identify ac idophilic and ba-
sophilic cell structures.
4Su broutine DOI DAF of Num erical Algor it hm Gro ups (NAg), Fortran Librar y, version
Ma rk 18 [6].
Visualiza tion of parametri c carcinogenesis models 195

Figur e 1: A time-point can be chosen over the pop-up-menue and the pa-
rameters can be varied over their corres ponding sliders. Depending on the
par amet ers t he t hree axes show t he t heoret ical distributions of size and num -
ber of focal t ransections of ty pe 1 and 2 (dot te d lines) compa red with the
empirica l dat a taken from t he NNM-experiment (solid lines). One slider is
pr ovided for t he Poisson par ameter J1 , six sliders for t he par ameters PI , qI , a l
and P2 , qz , a2 corresponding to t he Bet a-distributed growt h rates in typ e 1
an d 2 and one slider for r s w itch .
likelihood param eters it is necessary to define a set of eight starting param-
eters for the fm in con 5 function and to define proper int ervals for the range
of the eight model par amet ers. For this purpose a gra phical user interface
(CSM-GUI) is imp lemented in MATLAB to test the theoretic al distributions
of size and number of focal t ra nsections in 2D under variat ion of parame-
ters (Figure 1). After minimizing the negative loglikelihood by the fm incon
function theoretical results can be compared with the empirical data.

4 .2 Results
Figur es 2 and 3 illustrate ty pical visualizat ions of the results of t he modu-
lation of t he NNM-Experim ent. The empirical size dist ribution is compa red
wit h t he t heoret ical size distributions obtain ed from two different Color-Shift-

5 f m in con is a MAT LA B function for nonlinear mi nimi zation under const raints used t o
m inimiz e the negative loglikelihood func tion [7J.
196 Jutta Groos and Annette Kopp-Schneider

CDF Type1 Foci after 37 Weeks NNM

0.'

0.'

0.7

0.0

I 0.0

0.'

0.3

0.2

0.1

0
0 0.2 0.3 0.' 0.7
Radlus[mmj

Figur e 2: The result of t he CSM (dashed line) and CSM wit h Bet a-distributed
Growth rates (dotted line) applied on foci of ty pe 1 afte r 37 weeks. The solid
line repr esents t he empirical dat a.
COF Type 2 Foel after 37 WeeksNNM

o.• k · ··'· ·· ····· · · · · .·· · f ··· · · / , , , " i

0.8

0.7 /- · · · ·· · · · · · · · · . · I'
0.6/- · ·' ····· · , · / . ······, ··· ··.'. · , ~

J
~
0.5

0..

0.3 /- .; ... . .

0.2

0.1

0.3 0.4 0.5


Radius

Figur e 3: The result of t he CSM (dashed line) and CSM wit h Bet a-d istributed
Growth rates (dotted line) applied on foci of ty pe 2 after 37 weeks. The solid
line repr esent s the empirical dat a.
Models using maximum likelihood est imates for the par amet ers. The CSM
without modifications is repr esented by t he dashed line, the CSM with Beta-
distributed growt h rat es is illustrated by t he dotted line and the solid line
stands for the empirical dat a from the NNM-experiment. Consid erin g only
typ e1-foci t he modified CSM seems to predict the size distribution better
than t he CSM. But t he visualizations for t he focal transections of typ e 2
show t hat t he modified CSM expects too larg e foci of the second type, so
that t here is an advantage for CSM wit hout modification in t his case. The
Visualization of parametric carcinogenesis models 197

deterministic switch-radius makes the model highly sensitive against outliers


in type 1. A single large type 1 focus leads to a large estimate of r switch .
To make the model more robust against these outliers a model assuming
a stochastic switch-radius has to be formulated.
5 Conclusions
The above mentioned forms of visualization are effective tools to compare
different carcinogenesis models with respect to their ability to predict num-
bers and radii of foci in hepatocarcinogenesis experiments. Especially the
CSM-GVI (Color-Shift graphical user interface) is a powerful instrument to
test a new model before starting the very time-intensive procedure of finding
the maximum likelihood parameters. To improve the CSM with stochas-
tic growth rates a Color-Shift-Model with stochastic color dependent growth
rates and stochastic switch-radius has to be introduced. The next step could
be the integration of the whole process , finding the starting parameters, maxi-
mizing the loglikelihood function and visualizing the results in one GUI which
could simplify the modulation.
References
[1] Burkholder I., Kopp-Schneider A. (2002). Incorporating phenotype-
depend ent growth rates into the Color-Shift-Model for preneoplastic hep-
atocellular lesions . Math. Biosci. 179, 145.
[2] Geisler I. (2001). Stochastische Modelle fuer den Mechanismus der
Entstehung und der Progression von K rebsvorstufen in der Leber. Doc-
toral thesis.
[3] Geisler I., Kopp-Schneider A. (2000). A model for hepatocarcinogene-
sis with clonal expansion of three successive phenotypes of preneoplastic
cells. Math. Biosci. 168, 167.
[4] Kopp-Schneider A., Portier C., Bannasch P. (1998). A model for hep-
atocarcinogenesis treating phenotypical changes in focal hepatocellular
lesions as epigenic events. Math. Biosci. 148, 181.
[5] Moolgavkar S., Luebeck E., de Gunst M., Port R., Schwarz M.(1990).
Quantitative analysis of enzyme-altered foci in rat hepatocarcinogenesis
experiments. I. Single agent regimen. Carcinogenesis 11, 1271.
[6] NAg LTD (1997). NAg Fortran Library Manual, Mark 18.
[7] The Mathworks Inc. (2003). Matlab Documentation CD, Release 13.
[8] Weber E., Bannasch P. (1994). Dose and time dependence of the cellular
phenotype in rat hepatic preneoplasia induced by continuous oral exposure
to N-Nitrosomorpholine. Carcinogenesis 15, 6.
[9] Wicksel S.D. (1925). The corpuscle problem. A mathematical study of
a biometrical problem. Biometrica 17, 87.
Address: J . Groos, A. Kopp-Schneider, German Cancer Research Center,
Biostatistics, 1m Neuenheimer Feld 280, D-69120 Heidelberg, Germany
E-mail: j.groos@dkfz.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

DESIGN ASPECTS OF A COMPUTER


SIMULATION STUDY FOR ASSESSING
UNCERTAINTY IN HUMAN LIFETIME
TOXICOKINETIC MODELS
Harald Heinzl and Martina Mittlboeck
Key words: Dioxin, indeterminability, occupational cohort, Monte Carlo sim-
ulation study.
COMPSTAT 2004 section: Biostatistics.

Abstract: The general paradigm for risk assessment of exposures to toxic


agents in human environment is the identification and characterization of
hazard, assessment of exposure and characterization of risk. Performed in
practice risk assessment is addressing particularly the outcome of integrat-
ing the data available from epidemiology, long-term mortality and morbidity
studies and mechanistic research with information on the type and extent
of exposure, as well as statistical analysis properly used . Various technical
and non-technical aspects of the design process of the Monte Carlo simula-
tion study will be reported and discussed. Finally, a Monte Carlo computer
simulation study was designed in order to examine in detail the influences
of various sources of uncertainty and their potential implications on the risk
estimates from the Boehringer cohort data is presented.

1 Introduction
The need for risk assessment of exposures to toxic agen ts in human environ-
ment has increased steadily over the last decades . The general paradigm for
risk assessment is the identification and characterization of hazard, assess-
ment of exposure and characterization of risk. Performed in practice risk
assessment is addressing particularly the outcome of integrating the data
available from epid emiology, long-term mortality and morbidity studies and
mechanistic research with information on the type and extent of exposure, as
well as statistical analysis properly used . A sound, scientifically based risk
assessment is an essential tool for risk managers and legislators responsible
for security and safety of humans.
The use of toxicokinetic models makes it possible to construct exposure
indices that may be mor e closely related to the individual dose than tradi-
tional exposures measures. However , the process introduces a wide array of
sources of uncertainty, which inevitably makes risk assessment more difficult.
In addition, representing population heterogeneity in the assessment of risks
and the identification of sensitive sub-population is of great concern.
The analysis of uncertainty is becoming an integral part of many scien-
tific evaluations. For example, in the risk assessment process, an uncertainty
200 Harald Heinzl and Martina Mittlbo eck

analysis has been recognized as an imp ortant component of risk char act eri-
zation by regulatory agencies [29] . Uncertainty is pr evalent in the proc ess of
risk assessment of chemica l compounds at various levels. Uncertainty of the
exposure assessment influences dose est ima t es. Such effects are exaggerated
further by un certainty in dose-response mod elling, mainl y caused by limited
knowledge about the functional dose-r esponse relationship. Finally, uncer-
tainty is propagated to the risk est ima t ion pro cedure, which provide the basis
for risk man agement decisions .
It is vit al to distinguish un certainty from variability : The lat t er is a ph e-
nom enon in t he physical world to be measured , ana lysed and where appropri-
ate explained. By cont ras t, un certainty is an aspect of knowledge (Sir David
Cox as quot ed in Vose [28] . Total uncert ainty is the combination of vari-
ability and uncertainty. To avoid confusion it was suggested to renam e to tal
un cer t ainty by ind et erminability [28], a t erm inology adopted in our work.
Our exa mple focusses on t he risk assessment pro cess whether 2,3,7,8-
te t rac hlorodibenzo-p-dioxin (TCDD, "Seveso-dioxin") is a pot enti al hum an
carcinogen. In 1997 TCDD was evaluated as hum an carcinogen [19] , [22].
The decision subst anti ally relied on empirical st udies of highly exposed oc-
cupational cohorts. The so-called Boehringer cohort was amongst them , and
its dat a were thoroughly analysed [1]' [2]' [13] [14] , [24]. These st atis tic al
analyses were a rather delicat e t ask as amongst other things indi vidu al life-
time TCDD-expo sures st arting in the 1950ies had to be reconstructed from
TCDD-measurements in the 1980ies and 1990ies when such measurements
becam e feasible and affordable. Inevit ably, a lot of un certainty remain ed
due t o lack of longitudinal physiological data , t he possibility of measurement
err ors and workplace misclassification errors, disagreement about the appro-
pri at e statist ical analysis st rategy, limited knowledge about t he functional
dose-cancerogenic pro perty relationship and t he advent of new toxic okin et ic
insight - just t o nam e a few circumstances.
Now, it is qui t e common that results of large-scaled st atisti cal or epi-
demiological analyses will be questioned and disputed . However , the goal of
an un certainty analysis is to t ell us how much we can be wrong and st ill be
okay [7] . Therefore we designed a compute r simulation st udy to be able t o
exa mine in det ail the influences of various sources of uncert ain ty and t heir
potenti al implications on the risk esti mates from t he Boehringer cohort dat a.
The pap er ist organized as follows. In Section 2 our adopted view of
un certainty ana lysis is defined in bri ef. Section 3 is devot ed to dioxin , that
is, genera l characterist ics of the compound, features of the Boehringer cohort
dat a set and various approac hes to mod el lifelong human toxi cokineti cs are
describ ed. Sect ion 4 contain s t echnical and non-technic al design asp ects of
the int ended compute r simulation st udy. In Section 5 a bri ef discussion is
given.
Design aspects of a computer simulation study 201

2 Indeterminability, variability and uncertainty


Indeterminability (or total uncertainty) denotes the inability to be able to
precisely predict what the future holds. The two components of indeter-
minability are variability and uncertainty [28]. According to Hodges [18]
a statistical, a structural and a technical part of indeterminability can be
distinguished (see also [12], [17] . The statistical part corresponds to variabil-
ity, whereas the other two parts correspond to uncertainty. The statistical
part of indeterminability is variation given structure or in other words, resid-
uals given a model, a common statistical technique to describe variability in
a regression model.
Structural uncertainty emerges from the fact that the model itself - the
assumed structure - may be uncertain either due to incomplete or insufficient
knowledge about biological, physiological or toxicological mechanisms, or due
to the existence of more than one way to explain a specific phenomenon, that
is, there are several plausible models. A special and very important aspect of
structural uncertainty is the so-called model parameter uncertainty [12], i.e.
uncertainty about model assumptions and model constants. In toxicokinetic
models e. g., total lipid volume of the body may be assumed non-varying
over human life time or the elimination halflife of a certain toxin may be
considered known in one approach, whereas it may not in another.
The third part of indeterminability in Hodges' classification is technical
uncertainty, which mainly comprises the ordinary and unspectacular circum-
stances of everyday scientific work . It is usually neglected although occas-
sionally it may allocate a considerable fraction of indeterminability. Ex-
amples for technical uncertainty are poor quality of raw data (e.g. typos,
rounding errors), numerical estimation problems, in particular in connection
with complex nonlinear models , or research limitations due to lack of re-
sources (e.g. software, time, human expertise), which may artificially restrict
the spectrum of considered scientific models or employed statistical analysis
methods.

3 Dioxin at a glance
3.1 Polychlorinated dibenzodioxins and -furans
(PCDD/Fs)
PCDD/Fs are highly lipophilic synthetic chemicals which arise primarily
from the production and combustion process of chlorinated chemicals and
as a byproduct to chlorinated bleaching and waste incineration. Environ-
mental contamination by PCDD/Fs has been documented worldwide and is
ubiquitous. In industrialised countries the PCDD/F burden of the population
is assumed to result mainly from intake of contaminated food . Improvements
in the analytical techniques used to measure PCDD/F concentrations have
allowed for the concentration of these compounds to be assessed in reasonable
amounts of human tissue, most notably in adipose tissue, blood serum and
202 Harald Heinzl and Martin a Mittlboeck

plasm a. Repeated det erminations in hum ans allow t he invest igat ion of the
kinetic of t hese toxins.
T CDD is believed t o be t he most pote nt of the PCDD/Fs. Numerous ef-
fects in humans have been observed from exposure t o T CDD ; am ongs t them
are lung cancer and soft-tiss ue sa rcoma. Ob served adverse health effects other
than cancer inclu de chlorac ne, alte red sex hormone levels, alt ered develop-
ment out comes, altered thyroid funct ion, alte red immune function, cardio-
vascular diseases and neurol ogical disord ers t o name just a few, e.g, see also
the sur vey of Gr assman et al. [15]. The establishment of a causal relationship
between exposure t o dioxins and diseases in human s is of outstanding signifi-
cance in public health and disease pr event ion . To establish such a causa l link
is extremly difficult since chronic diseases may occur a long time afte r t he
act ua l exposure has cease d and t his extended lag t ime (lat ency period) be-
tween exposure and disease onset may obscure a causal link . This impli es the
need for proper mod elling of the indi vidu al intoxination pr ocess in order to
const ru ct appropriate dose metrics (like area under t he concentration-t ime
cur ve) for quan ti t ative representat ion of t he disease-exp osure relati onship.
Obv iously it is essent ial to relate the occurrence of diseases t o dioxin levels
experienced during t he exposure before disease onset. Previous levels have to
be est imated from pr esent ones. Retrosp ective det ermination of dioxin levels
in hum ans and t heir subsequent use in risk assessment are st ron gly connected
t o the toxicokinet ics of the dioxins. Chronic environmental exposure, route
of exposure, st orage in ad ipose tissue, and mechani sm of eliminat ion are im-
portant det erm inan ts of t he level of TCDD in seru m years afte r possibly high
occup ational exposures . Currently available physiologically based ph arma-
cokinet ic (PBPK) models try to meet this requirements at least partly.
Occupationally exposed cohorts are an imp ortant source of information
du e to mor e pronoun ced effects (occupational exposures are higher in gen-
eral) and improved ability t o cont rol for confounders (easier and mor e reliabl e
information retrieval among workers regist ered in files of compani es or insur-
ance agencies). For workers in t he chemical industry, where occup ational ex-
posure to dioxins has occured in past production periods, the establishment of
causa l relationships is also connected to insurance and compensa t ion issues,
which requires an individua lly-based assess ment of exposure, disease onset
and their relationship. In 1997 t he Int ernational Agency for Resear ch on
Can cer (IARC) reevalu ated TCDD as carcinogenic to hum ans (IARC group
1 classification) on the basis of limited evidence of carcinogenicity to hum an s
and sufficient evidence of carcinogenicity in experimental anima ls [19], [22].
The most import ant studies, which gave evidence with respect to human
carcinogenicity, were four cohort st udies with adequate follow-up times of
herbi cide pr oducers, one each in t he Unit ed St at es and t he Net herlands , two
in Germany. T he lar gest and most heavily exposed German cohort is t he so-
called Boehringer cohort [13] , [14] , [1] , [2] . Main feat ure s of the Boehringer
cohort are describ ed in the next Subsection.
Design aspects of a computer simulation study 203

Overall, the strongest evidence for TCDD carcinogenicity is for all cancers
combined, not for a specific site. Due to the lack of a clearly predominating
site it was considered by the IARC that there is limited evidence in humans
for the carcinogenicity of TCDD [19], [22] . This could be due to still limited
power of those epidemiological studies requiring cautious appreciation, or due
to an unspecific non-standard carcinogenic action of dioxin. The evidence in
humans for the carcinogenicity of all other PCDDs is even more diffuse and
was rated inadequate by the IARC in 1997.

3.2 The Boehringer cohort


The Boehringer cohort consists of around 1600 workers occupationally ex-
posed to PCDD/Fs. About a quarter of the workers are women. The co-
hort members came from two plants operated by the C.H. Boehringer Sohn
Chemical Company, one in Ingelheim and the other in Hamburg, Germany.
In Ingelheim 2,4,5 trichhlorphenol (TCP) was produced from 1950 to 1954,
in Hamburg TCP was produced from 1957 until contamination with diox-
ins was stopped in April 1983 and the plant was finally closed in Octo-
ber 1984 [5], [21] . Since 1984, an investigation programme independent of
the C.H. Boehringer Sohn Chemical Company has been performed by the
Institute of Occupational and Social Medicine of th e University of Mainz [5].
Comprising 186 persons evaluable for health evaluation in the first phase
from 1984 until 1989 and comprising 192 in a second medical investigation
program started in 1992 biomonitoring data on TCDD and major PCDD/F
congeners and severe polychlorinated biphenyl congeners have been obtained
from samples from adipose tissue or blood serum lipids [3] . This cohort was
further investigated in a follow-up study using dioxin concentration measure-
ments for 88 persons [4] .
The Ingelheim and Hamburg plants can be subdivided into about 20 work-
ing areas corresponding to different involvement in the production processes
(e.g. bromophos production, trichlorophenol production, 2,4,5-trichlorophe-
noxyacetic acid production, repair, laundry, administration, etc.), believed
to result in different exposures levels to dioxins . Work histories were docu-
mented using a recall questionnaire asking for the start of employment, end
of employment and sojourn times in the working areas.

3.3 Available toxicokinetic models


A series of PBPK models for lifelong TCDD exposure in humans are available
in the literature. Nearly all of them assume a linear elimination kinetic, they
only differ in the sophistication how time-dependent physiological variables
as body weight, body fat volume or liver fat volume are considered (e.g. [11],
[10]; [23]; [20], [14], [26] . The model of Carrier et al. [8], [9] is an exception
in terms of the elimination function which is based on a modified Michaelis-
Menten function .
204 Harald Heinzl and Martin a Mittlboeck

Of course, mor e biologically complex mechani stic mod els could be sug-
gested. Phenomena such as TCDD absorption, distribut ion, binding to liver
receptors, enzyme induction, and synt hesis of binding prote ins could be con-
sidered . However , such phenomena occur on a much fast er t ime scale (hours
to days) t han TCDD eliminat ion (years in humans) , which finally just ifies
the assumpt ion of an qu asi-equilibrium between TCDD in lipid fraction of
blood, liver and adipose t issue. Not e that this assumption (or var iations of
it) is mad e either explicit ly or impli citly in all of the lifelong TCDD mod els
for human s mentioned above.

4 The computer simulation study


The planning of a lar ge computer simulat ion st udy comprises of technical
and non-techni cal issues. The t echnic al issues coincide to a larg e exte nt
with the problem ana lysis and design st ep of t he common three-st ep software
development pro cess (where the third st ep is impl ement ation) .
The non-technical issues consist of various essent ial pr erequisit es and fun-
dam ent al decisions . Treating them lightly could seriously jeopa rdise t he su c-
cess of the whole project .

4.1 Problem analysis


At first , it is necessar y to ana lyse plausible PBPK mod els for hum an lifetime
toxicokinet ics of TCDD and integrat e t hem into mor e comprehensive mod els.
Among others, t hese model should allow for multiple exposure t o different
t oxins with similar kinet ics (PCDDj Fs inst ead of just TCDD alone), chroni-
cal exposure (both background and workplace) and pointwise exposure (e.g.
t hrough accidents ). Usage of t hese mod els is in establishing a dose-response
relationship for a proper risk assessment of TCDD. The mod els have also
t o allow the const ruction of ind ividual human exposure profiles over longer
time periods. Ideally, one wide-ran ging model could be found, from where
all others deduce as special cases .
This approach or t hese different approaches in mod elling individual hu-
ma n lifetime toxi cokin et ics could be mechani sti cally compared und er various
realisti c scenarios, e.g. t emporal change in background exposur e, spatial
change in workplace exposure , high accidental expos ure over a short t ime, ef-
fects of fattening and loosing weight during lifetime, lifetime effects of br east-
feeding (both in contaminate d women and in persons who during childhood
have been breast-fed by a contaminate d woman) , sensit ivit y in model par am-
eters, effects of congeners other than TCDD, effects of a confounder vari abl es
like smoking st atus (in par t icular effects of ignoring them) , effects of ignor-
ing int eracti on terms in the model (int eractions among two mutually different
congeners or among a congener and a confounder vari abl e). The const ruction
of expos ure indi ces from individual concent ra t ion-t ime cur ves could also be
studied.
Design aspects of a computer sim ulation study 205

The main part of the project are Mont e Carl o compute r simulations in
order to assess un certainty in the t oxicokinet ic mod elling pro cess up to its
impli cations on risk assessment . The main issues to be st udied are amongst
ot her things:

• Uncertain ty in choice of PBPK model assumpt ions : E.g. assume non-


linear kinet ic for tox in eliminat ion to genera te dat a and use linear ki-
neti c for analysis. The goal is to identify t hose model assumptions
which are particularl y sensit ive to dose level pr ediction. sensit ivity
in model par am et ers t o inte rindividua l variat ion: E .g. ind ividualise
age-relate d chan ges of body fat volum e.

• Uncert ainty caused by measurement of toxin levels: Different lab orato-


ries report different dioxin levels for the sa me sample. In th e Boehringer
dat a differences of 50% or mor e occur frequentl y [12, Figure 4b].

• Uncertainty caused by workplace misclassifications: Par ticipants of t he


Boehringer st udy have been asked about their working history. These
int erviews have been repeat ed at a later t ime point . Comparisons re-
vealed that 50% of the report ed working ti mes and 30% of the report ed
working areas did not mat ch between two intervi ews [12] .

• Uncertainty caused by different approaches t o mod el t he covariance


st ruct ure of repeated measurements

• Un certainty du e to choice of statistical est ima t ion method

• Effects of missing values and unknown confounders

• Uncert ainty in choice of appropr iate exposure index, lag time and dose-
response relationship: This form of un cert ainty concerns t he subsequ ent
processing of the tox icokinet ic results in dose-response mod els. Even
if the former would yield absolutely correc t valu es, uncert ainty in t he
latter would still dist ort the results of t he risk assessment pro cess.

• Selecti on effects: They could have been eas ily occurred in th e Boeh-
ringer cohort dat a as participation in t he dioxin measurement pro gr am
was on a volunt ar y basis. A specific form of select ion bias is the so-
called "healt hy worker sur vivor effect" (see e.g, [25]).

To meet th ese requirements a compute r pro gram libr ary with a flexible mod-
ular st ructure has to be designed and impl emented (see next Subsection).
Thereby not e that un cer tainty an alysis can only shed light ont o overlooked
issues, underrated issu es or issues which have not been known at the t ime of
original analysis itself. It is probabl e that some time afte r t he completi on of
the un cert ainty analyses new scient ific theories may evolve, e.g. a new t ox-
icokin eti c TCDD mod el for human s. The design of the compute r program
206 Harald Heinzl and Martin a Mittlboeck

libr ary should allow a flexible and smoot h int egration of current ly unknown
but supposable future development s.
There are num erous adequa te softwar e products available where t he com-
puter program library could be implemente d so that the actual decision is
mainl y a matter of personal preference. In the curre nt case the computer
pr ogram library is implemente d in form of SAS macros (SAS Institute Inc.,
Car y, NC , USA) .

4.2 Program library for Monte Carlo simulations


The main goal of the simulat ion st udy is to mimick the essential features of
both the Boehringer cohort and t he corres ponding statist ical analyses. Four
main computer program modul es can be distinguished.

4 .2 .1 Simulation of whole cohort. The simulated plant is operating


between 1950 and 1985. Amongst other t hings five main workin g areas with
different TCDD working exposure levels are assumed. The exposure levels
are assumed to follow a lognorm al distribution with mean int ake of 3500, 150,
40, 5 and 0 TCDD units/ year , respecti vely. Mean background exposure is
set to 1 unit / year. The mean values closely resemble the act ua l exposure es-
timates as reported in Becher et al. [1] . The highest exposure occur s solely in
t he 1950ies. Determinati on of TCDD concent ra tions in t he simulated work-
ers happens in 1990 and 1995. The willingness of t he workers to participate
in t he TCDD screening programme is simulate d as well. The numbers of
workers in t he simul ated cohort and in the simul ated TCDD screenin g pro-
gra mme should approximate ly resemble t he corres ponding numbers in the
Boehringer cohort.
Indi vidual cha nge of working area, termination of work cont ract, retire-
ment and death of t he virtual workers are randomly simulated as well as hiring
of new workers. TCDD eliminati on kineti c is generated according to four dif-
ferent scena rios, t ha t is, simple linear kineti c with constant to tal lipid volume
(TLV) over lifetime, simple linear kinetic with TLV vary ing with workers age,
linear kinet ic according to Thomaseth and Salvan [26] with TLV and liver
lipid volum e varying with workers age, and modified Michaelis-Menten ki-
neti c with body weight varying with workers age [8], [9] . During lifet ime the
simulate d workers are subject to develop one of two kinds of cancer. Devel-
opm ent of cancer will increase mortality of a simulated worker and will entail
his retirement . The funct ional dose-cancer respo nse relati onship of TCDD is
mod elled by increasing t he hazard for the first kind of cancer proportionally
to t he individua l TCDD exposure during lifetime. Various T CDD exposure
indices can be explored (e.g. area under t he concent ra tion-t ime cur ve (AUC) ,
lagged AUC , etc. ).
Due to t he hazard increase in t he first out of two kind s of cancer the
existe nce of a predominating cancer site is simulated.
Design aspects of a comp uter simulation study 207

4.2.2 Measurement errors. In module 4.2.1 simulate d true values are


recorded. These will be contaminate d with TCDD measuring erro rs, work-
place misclassification err ors, etc. in ord er to get simul at ed observed valu es.

4.2 .3 Workplace exposure backcalculation. TCDD measurements are


available only a long t ime afte r t he actua l workplace exposure . Under plausi-
ble assumptions (concern ing background exposure, fat fraction of body, form
of elimination from body, etc. ) th e exposure levels in different working areas
can be esti mated by backcalcul ation . There have been two different mai n at-
te mpts to perform a backcalcul ation , one is describ ed in det ail by Becher et
al. [1], t he other is due to Portier et al. [24]. Both atte mpts can be compared
wit h t his pr ogram module [16].

4 .2.4 Risk estimates. Extract var ious individual time-dependent expo-


sure indi ces for all memb ers of t he simulated cohort. Assess dose-response
relationship between t hese t ime-dependent exposure indi ces and cancer in-
cidence and mor t ality by use of Cox reg ression mod els, Poisson regression
mod els and standa rdised mortality ratio analyses [1] . The final results of this
simula t ion module are cancer risk est ima tes which in reality would provide
the decision basis for risk man agers.

4.3 Miscellaneous non-technical issues


When pr earranging un certainty investigat ions then their t ime demand should
be accordingly t aken int o account. The availability of a det ailed and profound
documentat ion of the stat istical analyses in question is an imp ortant pr ereq-
uisit e. Risk assessment for dioxins is an int erdisciplinar y effort . The integra-
t ion of research resul ts from vari ous scient ific disciplin es such as t oxicology,
molecular biology, biochemistry, medicine, epidemiology and biost atistics is
required . It is self-evident that each isolated effort would be doomed t o fail-
ur e. An un cert ainty analysis is no except ion. Arrangements have to be mad e
in order to allow t he perman ent discussion of assumpti ons and results with
exp onents of t he ot her scient ific disciplin es.
It is an open questi on who should do the uncert ainty analysis. Two op-
tions are obviou s: t he un certainty analysis is perform ed wit hin the team
which did the original stat ist ical analysis or outside this team . The pro s
of t he form er case are evident, that is, alrea dy exist ing knowledge of the
mat t er will result in efficient work (and usually t here will be some kind of
un cert ain ty assessment alrea dy during the performance of a statistical anal-
ysis). However , the cons are evident as well. That is, if somebody works
over a longer period of time on a certain problem , then some sort of fact ory
blindness will be hardl y avoidable. On t he other hand, if somebo dy from out-
side t he stat istical analysis team performs t he un cert ainty assess ment , t hen
t his person will usually have anot her main focus onto t he resear ch pr oblem
and new ideas may be develop ed du e t o t he non-involvement in the original
208 Harald Heinzl and Martina Mittlboeck

analysis. The cons of this approach ar e in the greater effort to famili aris e
with the subject and a possibly difficult relationship to the t eam memb ers of
the original analysis. These considerations should be mad e an int egral part
in the pro jects stat ist ical analysis schedul e from the beginning.
Here a rather t ra dit iona l Monte Carlo simul ation study is utilis ed for un-
certainty assessment . It mainly consist s of the exploration and evalua t ion of
different int eresting scenarios. Alt ernatively, an uncertainty assess ment could
be performed within a fully Bayesian fram ework (see e.g . [6], [7]. A det ailed
comparison of t he pro s and cons of both approaches is beyond the scop e of
this pap er.

5 Discussion
Risk assessment is a vit al act ivity in mod ern societ y becaus e it provides
the scient ific basis for effort t o ident ify and cont rol hazards t o health and
life. However , risk assessment is generally subject t o great uncertainty. The
scient ific knowledge available in t his field is far from sufficient . Uncertainty
in risk assessment is at pr esent a major but lar gely unsolved problem to be
faced with solid resear ch.
The goa l of un cert ainty analysis is t o provide an evaluation of t he limits
of our knowledge, or in other word s, an un certainty ana lysis should te ll us
how much we can be wrong and still be okay [7] .
Uncert ainty assess ment of large-scaled statistical an alyses is obviou sly
a reasonabl e and essent ial task in the empirical resear ch pro cess. In our view
it is useful to consider t he idea of ind et erminability which can be subdivided
into statistical vari ability, structural and technical uncertainty [18] , [12] , [17] .
Analyti cal approaches t o assess structural and t echnical un certainty will
be eas ily limited by t he complexity of the underlying problems. However ,
elaborate computer simulation studies have evolved as an appropriate tool
for the investi gation of these ty pes of ind et erminabili ty [28] .
Obviously, analysis of un certainty comprises uncertainty itself. During an
un cer t ainty ana lysis various decisions about par am et er settings (e.g. constant
or random , dist ribution typ e and distribution par am et ers, et c.) have to be
made. Actually, this set tings would require an uncertainty analysis of its
own. That is, t here would be met a-uncertainty - th e uncertainty of the
uncert ainty ana lysis. And th en there would be met a-m et a-uncertainty, th e
un certainty of the met a-uncert ainty analysis such that we would built one
layer of un cert ainty on anot her and finally miss t he goal. The loophole in
this catch is t he insight that un cert ainty ana lyses are not don e on their own,
but are part of the scientific resear ch proc ess. Accordingly, t he results of an
un certainty analysis should be communicate d t o t he scient ists who posed t he
resear ch quest ion , collecte d the data and performed the st atistical analysis
on the one hand as well as to other experts in the field on the other hand.
Together these resear chers will be able t o assess t he validi ty of t he un cer t ainty
analysis and to discuss the consequences of the results [17] .
Design aspects of a computer simulation study 209

References
[1] Becher H., Flesch-Janys D., Gum P., Steindorf K. (1998a). Berichte
5/98, Krebsrisikoabschtzung fur Dioxine, Risikoabschtzungen fur
das K rebsrisiko von polychlorinierten Dibenzodioxinen- und Furanen
(PCDD/Fs) auf der Datenbasis epidemiologischer Krebsmortalittsstu-
dien . Forschungsbericht im Auftrag des Umweltbundesamtes, Erich
Schmidt Verlag, Berlin .
[2] Becher H., Steindorf K., Flesch-Janys D. (1998b) . Quantitative cancer
risk assessment for dioxins using an occupational cohort. Environ Health
Perspect 106 (Suppl 2), 663-670.
[3] Beck H., Eckart K., Mathar W., Wittkowski R (1989). Levels of PCDD 's
and PCDF's in adipose tissue of occupationally exposed workers. Chemo-
sphere 18, 507-516.
[4] Benner A., Edler L., Mayer K., Zober A. (1993). Untersuchungspro-
gramm "Dioxin" der Berufsgenossenschaft der chemischen Industrie .
Ergebnisbericht - Teil II. Arbeitsmedizin, Sozialmedizin, Umweltmedi-
zin 29 , 11-16.
[5] BG Chemie . (1990). Untersuchungsprogramm 'Dioxin', Ergebnis-
bericht - Teil 1. Berufsgenossenschaft der Chemischen Industrie. BG
Chemie (Ed .), Heidelberg, ISBN: 3-88338-302-9.
[6] Bois F .Y. (1999). Analysis of PBPK models for risk characterization.
Annals of the New York Academy of Sciences 895, 317- 337.
[7] Bois F.Y., Diack C. (2004). Uncertainty analysis. In: Quantitative Meth-
ods for Cancer and Human Health Risk Assessment, Edler L., Kitsos
C.P. (Eds .), Wiley, Chichester, to appear.
[8] Carrier G., Brunet RC., Brodeur J. (1995a). Modeling of the toxicoki-
netics of polychlorinated dibenzo-p-dioxins and dipenzofurans in mam-
malians, including humans. 1. Nonlinear distribution of PCDD jPCDF
body burden between liver and adipose tissues . Toxicology and Applied
Pharamcology 131, 253-266.
[9] Carrier G., Brunet R.C., Brodeur J. (1995b). Modeling of the toxicoki-
netics of polychlorinated dibenzo-p-dioxins and dipenzofurans in mam-
malians, including humans. II. Kinetics of absorption and disposition of
PCDDs jPCDFs. Toxicology and Applied Pharamcology 131, 267- 276.
[10] Caudill S.P., Pirkle J.L., Michalek J.E. (1992). Effects of measurement
error on estimating biological half-life. Journal of exposure analysis and
environmental epidemiology 2, 463-476.
[11] Craig T .O., Grzonka RB . (1991). A time-dependent 2,3,7,8-
tetrachlorodibenzo-p-dioxin body-burden model. Arch. Environ. Contam.
Toxicol. 21 ,438-446.
[12] Edler L. (1999). Uncertainty in biomonitoring and kinetic modeling. An-
nals of the New York Academy of Sciences 895, 80-100.
210 Harald Heinzl and Martina Mittlboeck

[13] Flesch-Janys D., Berger J., Gurn P., Manz A., Nagel S., Waltsgott
H., Dwyer J .H. (1995) . Exposure to polychlorinated dioxins and fu-
rans (PCDD/F) and mortality in a cohort of workers from a herbicide-
producing plant in Hamburg, Federal Republic of Germany. American
Journal of Epidemiology 142, 1165-1175. Published erratum in Amer-
ican Journal of Epidemiology (1996) 144, 716.
[14] Flesch-Janys D., Steindorf K., Gurn P., Becher H. (1998). Estimation of
the cumulated exposure to polychlorinated dibenzo-p-dioxins/furans and
standardized mortality ratio analysis of cancer mortality by dose in an
occupationally exposed cohort. Environ Health Perspect 106 (Suppl 2),
655 -662.
[15] Grassmann J .A., Masten S.A., Walker N.J ., Lucier G.W. (1998). Ani-
mal models of human response to dioxins . Environ Health Perspect 106
(Suppl 2), 761 -775.
[16] Heinzl H., Edler 1. (2002). Assessing uncertainty in a toxicokinetic model
for human lifetime exposure to TCDD. Organohalogen Compounds 59,
355-358.
[17] Heinzl H., Edler L. (2003). Evaluating and assessing uncertainty of
large-scaled statistical analyses exemplified at the Boehringer TCDD co-
hort. Proceedings of the second workshop on research methodology,.
Ader H.J., Mellenbergh G.J. (Eds) , VU University, Amsterdam, ISBN
90-5669-071-X, 87-94.
[18] Hodges J .S. (1987). Uncertainty, policy analysis and statistics. Statisti-
cal Science 2, 259 - 291.
[19] IARC. (1997). IARC Monographs on the Evaluation of Carcinogenic
Risks to Humans . Vol. 69 : Polychlorinated Dibenzo-para-dioxins and
Polychlorinated Dibenzofurans. International Agency for Research on
Cancer, Lyon.
[20] Kreuzer P.E ., Csanady Gy.A., Baur C., Kessler W ., Papke 0 ., Greim
H., Filser J.G. (1997). 2,3,7,8-Tetrachlorodibenzo-p-dioxin (TCDD) and
congeners in infants. A toxicokinetic model of human lifetime body bur-
den by TCDD with special emphasis on its uptake by nutrition. Arch .
Toxicol. 71,383 -400.
[21] Manz A., Berger J. , Dwyer J.H ., Flesch-Janys D., Nagel S., Waltsgott H.
(1991). Cancer mortality among workers in chemical plant contaminated
with dioxin. Lancet 338, 959 -964.
[22] McGregor D.B., Partensky C., Wilbourn J ., Rice J.M . (1998). An
IARC Evaluation of Polychlorinated Dibenzo-p-dioxins and Polychlori-
nated Dibenzofurans as Risk Factors in Human Carcinogenesis . Environ
Health Perspect 106 (Suppl 2), 755 -760.
[23] Michalek J.E., Pirkle J.L., Caudill S.P., Tripathi R.C., Patterson D.G .
Jr., Needham L.L. (1996). Pharmacokinetics of TCDD in veterans of
operation ranch hand: lO-year follow-up . Journal of toxicology and en-
vironmental health 47,209-220.
Design aspects of a computer simulation study 211

[24] Portier C.J., Edler L., Jung D., Needham L., Masten S., Parham F.,
Lucier G. (1999). Half-lives and body burdens for dioxin and dioxin-like
compounds in humans estimated from an occupational cohort in Ger-
many. Organohalogen Compounds 42, 129-137.
[25] Steenland K., Deddens J., Salvan A., Stayner L. (1996) . Negative bias in
exposure-response trends in occupational studies: modeling the healthy
worker survivor effect. American Journal of Epidemiology 143, 202-
210.
[26] Thomaseth K., Salvan A. (1998). Estimation of occupational exposure
to 2,3, 'l,8-tetrachlorodibenzo-p-dioxin using a minimal physiologic toxi-
cokinetic model. Environ Health Perspect 106 (SuppI2), 743-753. Pub-
lished erratum in Environ Health Perspect (1998) 106 (Suppl 4), CP2.
[27] Van der Molen G.W. , Kooijman S.A.L .M., Slob W . (1996). A generic
toxicokinetic model for persistent lipophilic compounds in humans: an
application to TCDD. Fundamental and applied toxicology 31 , 83-94.
[28] Vose D. (2000). Risk analysis: a quantitative guide. 2nd ed., Wiley,
Chichester.
[29] WHO. (1995). Application of risk analysis to food standard issues. Re-
port of the Joint FAO/WHO Expert Consultation. World Health Orga-
nization, Geneva.

Acknowledgement: We particularly emphasise the generous support of Lutz


Edler and his colleagues of the Biostatistics Unit of the German Cancer
Research Center in Heidelberg, Germany. Furthermore, the study was sup-
ported in parts by grant J 1823 of the Austrian Science Fund.
Address: H. Heinzl, M. Mittlboeck, Department of Medical Computer Sci-
ences , Medical University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria
E-mail: harald.heinzl@meduniwien.ac.at ,
martina.mittlboeck@meduniwien .ac.at
COMPSTAT '2004 Symposium © Physica-Verlag /Springer 2004

SIMULTANEOUS INFERENCE
IN RISK ASSESSMENT;
A BAYESIAN PERSPECTIVE
Leonhard Held

K ey words: Risk assessment, Monte Carlo , simultaneous credible bands, si-


multan eous inference.
COMPSTAT 2004 secti on: Biost atist ics.

Abstract: We consider t he problem of making simultaneous inferenti al state-


ments in risk assessment from a Bayesian persp ective. We review a generic
algorit hm for comput ing a two-sided simultaneous credible band based on
Mont e Carlo samples from a multidimensional posterior dist ribution. A sim-
ple modification leads to an upp er or lower simultaneous credible bound,
which will be described . Such simultaneous credible bands and bounds have
at t ra ct ive properties: th ey are easy t o calculate, compl et ely non-parametric
and invari ant t o mon otone component-wise transformations of the vari abl es.
We illust rate the proposed approach through an exa mple from low-dose risk
est imat ion, pr eviously ana lysed in t he literature wit h frequ entist methods.

1 Introduction
St at isti cal risk assess ment deals wit h the probabilis t ic quantification of poten-
ti al dam aging effects of an environmental hazard. Of particular imp ortan ce is
the formul ation and est imat ion of dose-resp onse relationships based on data
from cont rolled toxicological studies. This pap er t akes a Bayesian view to the
stat ist ical pr oblem of est imating t he dose-response relationship and derived
qu antities. Such an approach has at least two useful features: First, t he pos-
te rior distribution of any function of the original par am et ers can be derived
exactly using Mont e Carl o simul ati on; secondly, pointwise and simultaneous
credible bands and bounds can be compute d exactly up to Mont e Carlo err or.
From a freqentist persp ective, the calcul ation of simultaneous confidence
bands has been developed in Pan , Pi egorsch and West [8], and has been
applied to risk assess ment est imation in Al-Saidy et al. [1] and Pi egorsch
et al. [9]. Al-Saidy et al. [1] consider quantal response dat a with a binomial
likelihood while P iegorsch et al. [9] apply t he methods to cont inuous measure-
ments based on a quadrati c regr ession mod el. In t his pap er we re-an aly ze th e
dat a from Pi egorsch et al. [9], but use a Bayesian approach based on Monte
Carlo sa mpling. In particular , we develop methods to calculate simultaneous
credible bounds for the benchmark dose at various ben chmark risks.
The pap er is organiz ed as follows. In Section 2 we review an algorit hm
to calculate (two-sided) simultaneous credible bands based on Monte Carl o
214 Leonh ard Held

sa mples from a post erior dist ribution and out line a straight forwa rd modifica-
ti on to obtain one-sided simultaneous credible boun ds. In Section 3 we apply
t hese methods t o a pr oblem from low-dose risk assess ment and compa re our
results wit h those obtain ed by Pi egors ch et al. [9] using frequentist methods.
We close with some discussion in Section 4.

2 Monte Carlo estimation of simultaneous credible


bands and bounds
2.1 Two-sided credible bands
Assume that we have a sufficient ly lar ge sample 0 (1) , . . . , o(n) from a poste rior
distribution p(Oly), obtain ed t hrough simple Monte Carl o, or mor e advanced
Markov chai n Mont e Carlo (MCMC) simulation. Here 0 is an unknown
param et er of dimension p , perhaps obtained after suitable t ran sformation of
t he original par amet ers in t he mod el.
The approac h proposed in Besag et al. [2, Section 6.3] starts wit h sorting
and ranking t he sa mples sepa rately for each par am et er of interest i , i = e
1, . . . . p. Let ep] denot e the corresponding order statist ic and r~j) t he rank
of e~j) , j = 1, . .. , n . Let j* be t he sma llest int eger such t hat t he hyp er-
rect angular defined by

[e Z[n+ 1- r ], e[r]]
Z ,
. = 1, .. . , p
~ (1)

contains at least k of the n valu es 0 (1) , ... , o'» . Besag et al. point out t hat
j* is equa l t o the kt h order stat istic of t he set

S -- { max { n+ 1 -mim
. r i(j) , mr x ri(j) } , J. -- 1 , . . . ,n } . (2)

By constru cti on, the credible region (1) will then contain (at least) 100k / n %
of the empirical distribution.
Fi gur e 1 illustrates the const ruc t ion of simultaneous credible bands for
simulated data with n = 25 an d p = 10. Each line corresponds to one sampl e
e
OW while each column represents a pa ra mete r i • The yellow ba nd is a
simultaneous credible ban d of empirical coverage 84 and 72%. The set (2) is
in t his example
s= {16, 17, 17, 18, 19,1 9, 20, 20,20,20, 22, 22, 22,22,23,23, 23, 23, 24, 24, 24, 25, 25, 25, 25}.
(3)
It is st ra ight forward bu t tedious to re-calculate (3) based on Figur e 1 and
formul a (2) .
Not e t hat the simultan eous credible band is a product of symmetric uni-
vari at e credible inte rvals of the sam e level (2j * [ti - 1) . 100%. Besag et
al. [2] also note t hat t he method is slight ly conservat ive in t he sense t hat ,
for n fixed , the credible region (1) will typ ically contain slight ly mor e that
100k/ n% of t he empirical dist ribution because of tie s in t he set (2) ; t his is
Simultaneous inference in risk assessment; a Bayesian perspective 215

84 % simultaneous credibleband

1"'"
(J)
0

2 4 6 8 10

Parameter

72 % simultaneous credibleband

2 4 6 8 10

Parameter

Figure 1: Illustration of the construction of simultaneous credible bands for


simulated data with n = 25 and p = 10.

evident from out small example where the set (3) has many ties. This prob-
lem increases to an extent with p increasing, because the number of ties will
then typically increase. However, the method is still consistent as n -> 00.
Empirical evidence shows that these credible bands tend to get rather unsta-
ble for credibility levels close to unity. In other words, the Monte Carlo will
be quite large in these circumstances, but this problem can be easily attacked
by taking a larger sample. However, the method requires the storage of all
samples from all components of () which can be prohibitive is p and n is large.
216 Leonh ard Held

Furtherm ore, t he sorting and ranking of the sa mples from each component
can be computationally int ensive, if n is ext remely large. However , in our
experience, for n = 10,000 sa mples the method gives st abl e est imates at t he
usu al credibility levels (95 and 99%) in just a few seconds.
Also not e that ranking and sort ing has to be don e only once, even if
simult aneous credible bands are required on more t han one level. Only the
set (2) , the ord ered sa mples epJ and the ranks r~j) need to be available to cal-
culate simult an eous credible ban ds at additional levels. The computat ional
effort to calculate these addit ional simultaneous credible bands is negligible,
compa red to t he initial ranking and sort ing.

2.2 One-sided credible bounds


Besag et al. [2] not e th at "one-sided and ot her asy mmetric bounds can be
const ructed ana logously" , but do not give furt her det ails. We will now look at
t his problem mor e closely. Clearly, t he genera l idea of the approach described
above can be eas ily applied to calculate, say, an upp er confidence bound: Let
j* be t he sma llest integer such t hat the area defined by

( - 00 ' n!J*J] ,
Ut z. = 1 , .. . , p (4)

contains at least k of the n values o'»,... , (J(n ). This pro cedure t hus defines
a one-sided upper credible bound of credibility level 100k/ n %. The only
questi on remaining is if t here is also an analogous formul a to (2). Indeed, j*
now simply equa ls t he kth ord er stat ist ic of t he set

{ mrxrioi , ). = 1 ,.. . ,n } (5)

Similarl y, a lower bound can be obtain ed by

[ef J, oo), i=l , . .. ,p (6)

where j* now equa ls t he kth ord er stat istic of t he set

. (j) . - 1, .. . , n } . (7)
{ milll r i , ) -

A compl et ely equivalent way t o calculate a lower simultaneous credible bound


is of course to compute t he negat ive upper simultaneous credible bound of
the negativ e samples.
Given the genera l applicability of the method proposed by Besag et al. [2]
described above, it is sur prising how rar ely it has been used in pr acti ce.
We will now describe an applicat ion t aken from the area of low-dose risk
est imation, where simultaneous credible bounds are useful.
Simultaneous inference in risk assessment; a Bayesian perspective 217

3 Applications in low-dose risk estimation


Here we look at a specific problem in low-dose risk estimation, where the
observed data Y (x) are continuous, reflecting the adverse effect of some toxic
exposure x. In other words, Y(x) is expected to decrease with increasing x.
The data come from a study originally described in Chapman et al. [4],
where x is a particular concentration of copper in and Y(x) is the germination
tube length of giant kelp, exposed to copper at dose x . There were up to
five replicate observations for each of six copper concentrations between 0
and 180 f1g/L.
Let Y(Xi) = f1(Xi)+ti , where ti rv N(O, 0-2), i = 1, . . . , m are independent.
We follow Piegorsch et al. [9] and assume a simple quadratic regression model
f1( x) = /30 + /31 X + /32x2. It may perhaps be useful to impose a further
mono tonicity constraint on the regression coefficients {3 = (/30, /31, (32) T such
that the function f1(x) is decreasing with increasing x. A weaker requirement
is to assume that f1(x) is monotone at least within the observed range of
x values. We will comment on such modifications in the discussion but use
for the moment the unconstrained model.
A key quantity in risk assessment is the so-called risk junction R(x) =
P(Y(x) :::; f1(0) - 80-), where 8 is a constant and typically chosen as 8 = 2 or
8 = 3. The idea is that a response, which is more than 8 standard deviations
below the control mean is considered as adverse, and R(x) quantifies the
probability of such an event as a function of the dose x . Furthermore, the
additional risk is defined as RA(X) = R(x) - R(O) , which becomes under the
normal model
(8)
where <1>(.) is the standard normal distribution function. Finally, a key con-
cept in risk assessment is the notion of the benchmark risk and benchmark
dose. This is often used to establish a low-dose level needed, the bench-
mark dose XB , to generate a specific additional risk RA, the benchmark risk
Z E (0,1) . Hence model (8) is inverted to find the benchmark dose XB for
a fixed benchmark risk z, i.e. solve RA(XB) = z for XB(Z).
Piegorsch et al. [9] develop sophisticated methodology to compute a fre-
quentist simultaneous upper confidence bound for RA(X), The established
function is then inverted based on equation (8) to obtain a simultaneous lower
confidence bound for x B (z) . Here we will devise an alternative Bayesian ap-
proach based on Monte Carlo sampling. For notational convenience, we set
'" = 1/0-2 • A non-informative reference prior p({3, "') ex ",-1 is assumed for
the unknown parameters (e.g. [3]) and hence the posterior distribution is of
the usual normal-gamma form, known from standard linear model theory:

p({3, "'Iy) = p("'ly)p({3I"', y).


Here p("'ly) is gamma distributed with parameters (m - p)/2 and s2 . (m-
p)/2 where p = 3 is the dimension of {3 and S2 is the classical (unbiased)
218 Leonhard Held

estimate of the variance (72 . Furthermore, p(,B I1\; , y) is normal with mean
S
equa l to the least squ ar es est imate = (X' X)-l X'y and covariance matrix
1\;- 1 (X ' X)-l . We can thus eas ily generate ind ependent samples from this
post erior distributi on by first sa mpling I\;(i ) from p(l\;ly) and then sa mpling
,B(i) from p(,BI I\;(i),y).
A Bayesian approach using Mont e Carl o sa mpling has t he advantage
t hat samples from any function of t he para meters can be obt ain ed with-
out any need for approximat ions, such as, for exa mple, the Delta method.
In t he cur ren t conte xt, RA(X) as defined in (8) is a simple function of t he
par am et ers 131, 132 and (72 . Hence we are able to comput e the post erior
distribu ti on of RA(X) for a ran ge of valu es of x, say Xl < X2 < .. . <
XM, and t hen compute simultaneous credible bounds for the par am et ers
RA(X1) , RA(X2), . .. , RA(XM) . For illustration, Fi gur e 2 displays the first
n = 100 samples from th e post erior distribution of RA(X) for 8 = 3.

C!

co
d

CD
d
-c
c:
".
d

C\l
d

0
d

0 50 100 150

Dose (mg/kg)

Fig ure 2: 100 sa mples from RA(X) for X E [0,180], 8 = 3.

Fi gur e 3 now displays t he post erior median of RA(X) , as well as t he 95% si-
mult aneous upper credible bound for RA(X) , calculated usin g (4) and (5).
Those have bee n obtained using n = 10,000 sa mples and 181 equa lly spaced
valu es of X E {O, 1, .. . , 180}. For compa rison, we also display t he frequ en-
t ist esti mate of RA(X) as well as t he corresponding 95% simultaneous up per
confidence bound described in Pi egorsch et al. [9].
Note that t he Bayesian poin t est ima tes are slight ly above th e frequ entist
ones. A more pr onounced difference can be seen for the simult aneous upper
bound, which is agai n lar ger in t he Bayesian approac h.
Pi egorsch et al. [9] go on to construct lower simultaneous credible bounds
Simultaneous inference in risk assessment; a Bayesian perspective 219

,
,
0::>
ci

<0
ci
-e: Bayesian estimate(posteriormedian)
a: Frequentistestimate
<t Bayesian simultaneous crediblebound
ci Frequentist simultaneous confidence bound

C\I
ci

0
...',.
ci

0 50 100 150

Dose (mg/kg)

Figure 3: Est imat ed RA func t ion and simult aneous upper 95% credible
bound, 0 = 3.

for XB = XB (Z) as a funct ion of the benchmark dose Z E (0,1) by simply


inverting t he up per simultaneous bounds obtained for RA(XB). Here we look
at an alte rnative Bayesian sample-based solut ion.
Again , we view XB (Z) as a functi on of t he origin al par am et ers, i.e. for
each benchmark risk Z and each sa mple .8ij ) , .8~j) and l7 (j) , we solve equa t ion
(8) for x~) .T he two solut ions are

XB (Z) = 2~2 (-.81 ± J.8r - 4Cl .8217 ) , (9)

where C1 = 0 + 1>-1(z + 1>(-0)). If t here are two real positive solutions,


we t ake the smaller valu e as x~) . Not e t hat we have dropped t he upper
index (j) in (9) for simplicity. For each valu e of z, this defines sa mples from
the post erior distribution of XB (Z),
If x B (z) would be well-defined for every valu e of Z and every sa mp le of
t he par am et ers .81, .82 and 17, we could ind eed ju st invert t he simultaneous
credible bound for RA(XB) to obtain one for XB (Z ), ju st as in t he frequent ist
case. However , t here will not always be a positive rea l solut ion (9) . Here it
t urns out t hat for Z = 0.99, 11% of t he sa mples do not have real solution.
For sma ller values of z, far less sa mples do not have a real solution; in fact
less t ha n 1% are "missing" for Z <= .83. For illustrat ion , consider Figur e 5,
which displays t he first 100 sa mples from t he posterior of x(z).
Inver ting the credible bound for RA(XB) may hence induce bias, and
we therefore consider a sa mple based , dir ect solut ion to construc t a lower
220 Leonhard Held

simultan eous bound for a series of benchmark doses zi, l = 1, ... , L . We


choose L = 100 equa lly-spaced valu es for Z between 0 and 0.99. Not e t hat
the upper bound on ZL is <1>(0) < 1.0, which in our case equa ls 0.9987 .
We now have two options. First, we may delet e all those sa mples, where
XU) (Zl) is missing for at least one benchma rk risk Zl, . . . , ZL considered. Due
t o mono tonicity properties this corre sponds to deleting all t hose samples,
where xU ) (ZL) is missing. However , this may induce bias in t he est imation
of simultaneous credible bounds, because lar ger valu es of BMD are typ ically
missing for lar ger values of z , Alt ernatively, we may impute all missing values
with t heir corresponding median, say, and pro ceed as if the sample would
have been observed complete ly. The two different typ es of credible bounds
are shown in Figure 5, denot ed as "complete case" and "median imputed" .
There seems to be no substantial difference. Also displ ayed is the lower
simultaneous confidence bound proposed by Pi egorsch et al. [9], obtain ed by
simply invert ing the upper simultaneous confidence bound for R( x) .

0
0

'"
0
~

0 0
~ ~
CD

0
U"l

0.0 0.2 0.4 0.6 0.8 1.0

BMR

Figur e 4: 100 samples of XB(Z ) for 0 = 3.

4 Discussion
The Bayesian approach t o simultaneous inference in risk assessment has much
t o offer. It does not rely on approximati ons, is complete ly genera l and easy to
impl ement. For example, it will be st raightforward to calculate a Bayesian
simult aneous credible bound in t he applicat ion considered in Al-Saidy et
al. [1], where the response variable is bin omial.
We now close with two final comments . In the cur rent applicat ion it
Simultan eous inference in risk assessment; a Bayesian persp ecti ve 221

0
;?
Bayesian estimate (posterior median)
Frequentist estimate
0 Complete case simultaneous crediblebound
ex>
Median imputed simultaneous crediblebound
Invertedfrequenti st simultaneous confidence bound
0
CD
Cl
::<
OJ

....
0

0
(\/

0.0 0.2 0.4 0.6 0.8 1.0

BMR

F igure 5: Estimat ed benchmark dose funct ion and simultaneous lower 95%
credible bound, 0 = 3.

t urne d out t hat a large number of sa mples did correspond t o non-monotone


dose-r esponse relat ionships in t he observed ran ge of x . Certainl y, t he quadra-
tic regression approach to t he problem is open to question , and perhap s
a monot one model, such as a logisti c growt h curve mod el or a nonparam etric
regression model under mono tonicity constra ints (e.g. [7] could have been
useful. However , we should mention t hat we could have easily incorporated
monotonicity constraints on f3 by simply ignoring all sa mples t hat do not
fulfill the restricti on imp osed , see Gelfand, Smith and Lee [5] for more det ails
in t he context of Ma rkov chain Monte Carl o simulation.
An alte rnative way to obtain simultaneous probabilit y state ments from
Monte Carl o output is based on highest post erior densit y est imation, and
has been described in Held [6]. This approac h taken has the advantage that
t he simultaneous region does not need to be a hyp er-r ect an gular , so is mor e
realist ic. Indeed , Held [6] has shown through examples, that simultaneous
credible bands using t he method describ ed in Besag et al. [2] may include
regions in the parameter space, which are not supporte d by the post erior
at all. The difference between the two methods is related to t he disti nction
between credible intervals based on quan til es and highest post erio r densit y
int ervals in the one-dimensional case. The form er int ervals may include areas
of low post erior density, for examp le if t he post erior is bi-modal, whereas the
lat t er will - by definit ion - only include regions of high post erior density.
However , the method by Held [6] can only be applied to calculate the
post erior support for a series of refere nce points, bu t t here is no easy way
222 Leonh ard Held

to visualize these credible regions in higher dimensions. In the cur rent ap-
plication there does not seem to be an obviou s reference point for RA(X) ,
say, so t he method by Besag et al. [2] is the obvious choice for simultaneous
Bayesian inference in risk assessment .

References
[1] Al-Said y a .M., Pi egorsch W .W. , West RW. , Nitcheva D.K. (2004).
Confid ence bands for low-dose risk estim ati on with quantal response
data. Biometri cs, to appear. Available at
http ://dostat.stat.sc.edu/bands.
[2] Besag J.E., Green P.J. , Higdon D.M., Mengersen K.L. (1995). Bayesian
computati on and stochastic syste ms (with discussion) . St atist . Sci. 10,
3- 66.
[3] Box G.E.P. , Ti ao G.C. (1973). Ba yesian inference in statistica l analysis.
Reading, MA: Addison-Wiley. Reprinted by Wiley in 1992 in the Wiley
Classics Library Edition.
[4] Cha pman G.A., Denton D.L., Lazor chak J .M. (1995). Short-t erm m eth-
ods for estim ating the chronic toxicit y of effiuen ts and receiving wa-
ters to West coast marine and estuarine organisms . Technical Report
EPA/ 600/R-95-136. U.S. Environment al Protection Agency, Cincinnati,
Ohio.
[5] Gelfand A.E., Smith A.F .M., Lee T .M. (1992). Bayesian analysis of con-
strained param eter and truncated data problems using Gibbs sampling.
Journal of the American Statisti cal Associati on 87, 523- 532.
[6] Held L. (2004) . Simultaneous posterior probability statements from
Monte Carlo output. Journal of Computational and Graphical Statis-
tics 13 , 20- 35.
[7] Holmes C.C., Heard N.A. (2003). Generalized monotonic regression us-
ing random change points. St ati stics in Medicine 22 , 623 - 638.
[8] P an W ., P iegorsch W .W ., West RW. (2003). Exact one-si ded simulta-
n eous confidence bands via Uusipaikka's m ethod. Ann als of t he Institute
of Statisti cal Mathematics 55, 243 -250.
[9] Pi egorsch W .W ., West RW., P an W ., Kod ell RL. (2004). Low-dos e risk
estim ati on via simultaneous infere nces. Appli ed Statist ics, to appear.
Availabl e at http ://dostat . stat. sc . edu!bands .

Address: L. Held, Department of St atistics, University of Muni ch, Lud-


wigstrasse 33, 80539 Muni ch, Germany
E-mail : held@stat.uni-muenchen.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

INTERACTIVE BIPLOTS FOR VISUAL


MODELLING
Heike Hofmann
K ey words : Data visualisation, biplots, uni variate linear mod els, cate gory
level points, bipl ot axis, visu al mod elling.
COMPSTAT 2004 secti on : Data visu alisat ion.

Abstract : The link between stat ist ical mod els and visualisation t echniques
is not very well explored, even though strong connect ions do exist . This pap er
describes how biplots - interactiv e biplots in particular - can be used for visua l
mod elling. By slightly adjusting the way biplots are const ruc ted t hey provide
the means t o display linear models. The goodness of fit of a particular model
becomes inst an tly visible. This makes them a useful addit ion to the standard
set of visualizat ion tools for linear models.
Biplots show pr edict ed valu es and residuals. This helps, firstl y, to assess
a model far beyond the mere st atistics and t o det ect structural defects in
it. Secondly, bipl ots provide a link between th e mod elling stat ist ics and th e
origin al data. Addi tional int eractive methods such as hotselection also allow
the an alysis of outlier effects and behaviour.

1 Introduction
Biplots are a very promising t ool for visu alisin g high-dimensional data, which
include both continuous and categoric al variables. The strategy of biplots is
to choose a linear subspace (usually a 2-dimensional sp ace - in ord er to be
able to plot t he result usin g standard t echniques), which is in som e resp ect
optimal , and project t he high-d imensional dat a ont o this space . On e cri-
t erion for optimality is, for instan ce, to minimise t he discrepan cy between
the high- and th e two dimensional repr esent ations of the data . Biplots show
only one projection out of infinitely many. They t herefore cannot be exact
representations of the data but only approxima t ions.
What gave the Biplots their pr efix "Bi-" ((3t is th e greek syllable for
"two" ) is the simultaneous representation of both dat a points and original
axes within the projection space.
The bipl ot axis of a cont inuous vari abl e is represented by a st ra ight line
(in case of linear models, t o which we will restrict ourse lves) with unit points
marked by sma ll perpendicular lines. One uni t of a vari abl e X i corresponds to
one times t he standard deviation of X i . If t he data matrix X is centered and
st andardized , t hese units are therefore dir ectly comparable for all i, and the
length of a uni t vector gives a measure for how well a vari abl e is represented
in the chosen pr ojection plane.
Inst ead of cont inuous axes , so called category level points (CLPs) are used
t o display a catego rical variabl e X . Using a binar y dummy vari abl e for each
224 Heike Hofmann

of the categories of X , an imaginary axis is found as in the continuous case.


A CLP is given as the I-unit point of this axis. Each CLP therefore represents
one category of X.
The different gray shades of the points in the figure is the effect of a crude
graphical density estimate - light areas in the display correspond to a high
number of observations.
Biplot of Class, Age, Sex and Survived Mosaicplot of Class, Age, Sex and Survived

eCls : Third

. .•
Sud :No
e

I
• Age :Child S;x : Female
. ... .: c
S MI " eCis . Second
ex[]: ae • C e.
c .:•
• Age : Adult"" Cis : First

'.§" e

'j~ LeCis : Crew


Sud : Yes

1st principal component

Figure 1: Biplot and corresponding mosaicplot of the Titanic Data [3] . Each
dot on the left side corresponds to a cell on the right hand side. Highlighted
are survivors.

Figure 1 shows a biplot of categorical variables, based on a Multiple Cor-


respondence Analysis (MCA). Next to the biplot a mosaicplot of the same
variables is drawn.
Biplots were first introduced by [5]. A recent monograph on biplots by [6]
summarises different types of biplots and embeds various models in the con-
cept. Possibilities for interactive extensions have been examined in [7].

Biplot representation
The graphical representation of a biplot is dot based. This means for categor-
ical variables, that each combination is shown as one single dot. Of course,
this does not allow conclusions about this combination's size any more. One
solution to this problem is the use of density estimates. This also covers
the problem of over-plotting, which, especially in large data sets, is always
present in dot based representations.
The graphical representation of a biplot has two components:
• Data points are projected onto the plane spanned by the first two prin-
cipal components and visualised as dots. The center of the plot is given
by the projection of the p dimensional mean (~Xf][, ... , ~X;][).
Interactive biplots for visual modelling 225

• The uni t vect ors e~ corresponding t o th e (dummy) vari abl es are also
project ed onto this plan e.
The gra phical representation differs for continuous and categorical vari-
ables: For cont inuous vari ables, an arr ow is dr awn from plot cente r to
t he pr ojection of t he variable, which marks t he dir ection of t he original
vari abl es. These dir ect ions are called the biplot axes. The arr owheads
mark the uni t points on t he biplot axes.
For a catego rical vari abl e its projection on t he biplot is marked by
a square rect an gle, t he CLPs.

"Reading" a biplot
In a bipl ot the most importan t source of information is t he dist anc e between
objects. The dist an ce gives a measure of how similar or how closely related
ob jects are .
The dist ance of a CLP to the plot 's cente r (in the middle of t he plot) or
t he length of a un it on a biplot axis reflect how good the pr ojection of the
und erlyin g variable is, i.e, wit h incr easing dist an ce t he goodness of fit - and
with it t he "importance" - of t his vari abl e increases.
The meaning of objects lying close t o each other varies according t o t heir
typ e:
• point - point: close points reflect high dim ensional "neighbours".
• axis - axis: axes with a small angle between t hem indicate a high
positive correlation betwee n the variables, angles near 180 0 indicate
a high negative correlation.
• CLP - CLP : Neighbour ing CLPs are a hint that t he corresponding
variables are asso ciated, i.e, that t hese categories frequ ently occur to-
get her in t he dat a .
• points - axis/CLPs: t he data values for a point ar e found by or-
thogonal pr ojecti on onto an axis. The axes closest t o a point t herefore
represent t he strongest influence for a data poin t . Accordingly, points
are assigned t o those cat egories with t he closest lying CLPs. In doing
so, one has to rememb er , t hat a bipl ot of more than ju st two variables
cannot be anything but an approxima ti on.

2 Interactive methods
Based on th e construc t ion and int erpret ati on of a biplot , int eractive methods
have to be provided for in t he display t o facilitate interpret abili ty and ease
of use.

2.1 Interactive querying


Int eractive querying is conte xt sensitive - querying different obj ects provides
different informat ion . Ex amples for severa l queryin g results are given in
figures 2 to 4.
226 Heike Hofm ann

aCl s : Third

.. ..
Sud: No

x : Female

Itj~cls :crew a
Sud : Yes

1st prin cipal comp onent

Figure 2: Querying a point or Figure 3: Qu eryin g a CLP highlights


"empty space" of the plot results in the other CLPs and the prediction
dr awing perpendi cular lines onto the regions of the underlying cate gorical
biplot axes. Estimat ed valu es of the variable.
variabl es are given for the point in
the pr oj ection plan e.
acts : Third

.. ..
Sud : No
a
. - -....... "
~
8.. Sex : Male
e a •

:'"rL
8 A :

F igure 4: Dr ag-query: dr agging from one


~ point of t he plot t o another draws circles
j a cts :Cre around t he start ing point as visual aid for
1st p rincip al component est imat ing dist an ces between ob ject s.

Figure 3 shows t he prediction regions corres ponding t o the variabl e' Class'.
All cate gories corre sponding to a singl e vari abl e divid e the biplot area in a set
of mu tually exclusive pr ediction region s. The prediction region of a CLP is
defined as t he space closest t o the CLP, i.e. no other CLP is closer . From the
pr ediction regions in figure3 it becomes obvious that t he representation from
the MCA do es not fit well: almost all dots are pr edict ed to be second class
passengers - t here are no combinations pr edicted as t hird class passengers .

2.2 Logical zooming and hotselection


The difference between logical and "normal" zooming lies in the fact that
by logical zooming an obj ect is not only enlarged but mor e det ails appear.
Logical zooming in biplots has two main applicat ions: logical zooming in
Interactive biplots for visual modelling 227

large data sets gives a tool to drill down the data set into smaller parts,
which are - hopefully - more homogeneous and therefore easier to analyze.
Another advantage of logical zooming is its possibility of excluding out-
liers. By focussing on the "main" part, i.e, not regarding outliers, their
influence on the model becomes apparent. This is particularly useful for
models with a poor behaviour with respect to outliers. If in fact the effect
outliers have on a model is of fore-most interest, we will want to use hotse-
lection [8] instead of logical zooming. The boundary between these two tools
is fluent - but essentially, the concept of hotselection is less permanent than
logical zooming: changes are more readily made and taken back again. In
the setting of modelling, hots election is used to compute a new model based
on highlighted values only.
Figure 5 shows a biplot of a correspondence analysis taking all of the de-
scriptive variables into account. Several clearly distinguished groups appear
in the plane spanned by the first and second principal component axis . High-
lighting shows poisonous mushrooms. These clusters are marked by numbers
in the graphic. Using a Mosaicplot of all the descriptive variables, we want to
find descriptions (as short as possible) for these groups. The following table
gives a short summary of our results:


Class

12.

.
e p
3#

4#.
2nd prine .

rs:
eompon"ent 6.

2.

1st prine.
component
9.

Figure 5: Biplot of an MCA of all of the Figure 6: Zoom into group 8.


mushrooms variables. Highlighted are poi-
sonous mushrooms. Some distinct clusters
appear (marked by the numbers).

Zooming is equivalent to hierarchical clustering via MCA . The eight poi-


sonous mushrooms in cluster 7 all have stalk color y setting them off from
the rest. Figure 6 shows it zoom into the largest cluster, cluster 8. Several
more groups show up in the projection plane. Clusters 9 to 12 consist of ed-
ible mushrooms only. For all of these clusters simple descriptions among the
228 Heike Hofm ann

explanat ory vari abl es exist . Clust er 10 e.g. consists of mushrooms with stalk
color o. All of t he descriptions are only valid for the zoom ed data (Le. only in
combination wit h all of t he description for cluste r 8 above). Cluster 13, con-
sisting of 2512 mushrooms, is t he only one which needs fur ther insp ection -
using further logical zooming. Aft er two mor e ste ps all poisonous mushrooms
can be separ at ed from the edible ones.

Gr oup Count Class Descrip tion


1 1296 pois. ring type = I
2 1728 pois . gill type: b
3 36 pois . st alk color above rin g: c (or rin g type: n)
4 32 pois. st alk sur face below ring :y popul ation: v
5 16 edible stalk surface above rin g:y stalk color above
ring:n
6 120 edible ringtyp e = f or ringt yp e = p, st alk sur face:k
(below and above t he ring)
7 56 mixed 48 edible
8 4826 mixed 4010 edible

3 Univariate linear models with continuous response


Based on the gra phical represent ation of a biplot and its int eractive features,
we will t ry anot her ap proach to visu alise linear mod els among t he data. The
biplo t represent ation pr ovides a possibility to dr aw conclusions from a linear
mod el in such a way, th at the goodness of fit as well as t he most imp ortant
explanatory variabl es become inst antly visible wit h one biplot represent ation .
Let us assume a sit ua t ion, where we are dealin g with a cont inuous re-
spo nse variable Y and severa l ind ependent variables Xl , . . . , X p • The X i do
not necessaril y have t o be continuous - but we also do not work with categor-
ical variables dir ectl y. Inst ead , for a categorical vari abl e X i a set of binary
dummy varia bles is used as explained before.
Let Xl ," " X p be a set of ind epend ent vari abl es, which has been produced
in t his way, i.e. a variable is eit her cont inuous by defaul t or it is a variable
corresponding t o a single cate gory.
A lin ear regression model then has the form

Y = X (3+ E, E rv N (O, (721)

where X = (][, X l , . .. , X p ) is the design m atrix and (3 t he vector of param e-


ters (3i .
If some of the vari ables are dummy variables, we have to use a further
condit ion for t he par am et ers of these var iables in ord er to get a un iqu e re-
sult . Let Zl , .. . , ZI be the dummy vari abl es for the categorical vari abl e X
and (31, . . . , (31 t he corresponding par am et ers of t he linear mod el, th en a com-
Interacti ve biplots for visual modelling 229

monly used constraint (null-sum -coding ) on the esti mates for t hese param e-
t ers is t hat t hey sum to zero, i.e.

or one of t he categories is used as basis and the par am eters of the resulting
model show the influence a category has with respect to th e basis. The
const ra int effect-coding on the par am et ers t hen is

if Zi is the dummy variable corresponding t o the basis category.


It is well known, that t he hat matrix H := X(X' X)-l X' is t hat pr ojection
matrix, which minimizes the least squa res problem of L i ET and gives t he
pr edict ed values l' as
1'=HY.
Accordingly, t he LS-esti mator for f3 is
/J = (X' X)-l X 'Y.

On e of t he favourit e methods for looking for st ructure among t he residuals


Ei = Yi - Yi is t o plot residuals versu s their pr edict ed valu es,
i.e, the data
points are pro ject ed into t he plane spanned by l' and Y - 1' . These vect ors
are ind eed orthogonal t o each ot her, since t he scalar product vanishes:

(1' , Y - 1') = (H Y, Y - HY) =


= (Y, H '(HY - Y)) H'=H~2=H (Y, HY - HY) = o.

3.1 Finding the biplot axes and comparing effects


The biplot axes are const ruc te d in this sit uation in the sa me way as for
standard biplots of PCA or MCA . By pro jecting the data into the plan e
spanned by l' and Y - l' we get /Ji as the coordinate of e~ = (0, ... , 1, . . . , 0) E
R P in t he dir ecti on of 1' .
However , while pr ojecting e~ in the direction of Y - l' a pr oblem appears:
generally we do not have a valu e along the Y-ax is for any given valu e of X ,
par t icularl y for e~ . A dir ect calculat ion therefore is not possibl e. But we
do know that the whole dat a space X is orthogonal t o the dir ecti on of the
residuals Y - 1' , since

X ' · (Y - 1') = X 'Y - X' · X (X'X)- l X'Y


, #
= X 'Y - X 'Y = O.
v
=[

Therefore the coor dinate of t he it h biplot ax is in t he dir ection of t he residuals


is also zero.
230 H eike Hofmann

A
( .
I' B
)
C
) )
D

~A ~o ~B ~C ~D

Fi gur e 7: Axis of pr edict ed values to gether with t he five bipl ot axes for
vari abl es A ,B,O,D an d E .

residuals

• i
• i

.1

.
.< ·•
)1 )1
I
~~
•I
)I~
• I!

I • i predicted
I i I !
!I
• 1 ,
..--..-- .--.-...- - -.... ................................. ............ ....•
Duluth Uni-Farrn Crooks ton Waseca
Grand-Rapids Morris

Figure 8: Analysis of vari ance of t he Barl ey Dat a . Predict ed values are


plot t ed vs. residuals. Six different sit es of barl ey cult ivat ion are dr awn as
bipl ot axes . The results of Duluth are used as basis valu es in t he Anova .

Fi gur e 7 shows the vector of t he pr edict ed values Y t oget her wit h biplot
axes for five variables A , B , 0, D and E. We can re-est ablish t he relati on
of pro ject ed dat a poin ts and t heir original valu es by ort hogonal projections
of the poin ts ont o the biplot axes . In the case of an an alysis of vari ance
this means, that we get very informative "labels" for t he pr edict ed values.
Fi gur e 8 shows an ana lysis of variance of t he B arley D ata [4].
We see not only par allel dot plots of the barl ey yields, but also a natural
ordering of the six categories , even (roughly) t heir distance or closeness. The
last point has a caveat: the lengths of t heir units are not directly comparable,
i.e. an axis with lar ge units is not by default a more import an t factor, since the
"importance" of an axis also depends on t he var iability of fJi. The st andard
tes t of judging, whet her the ith par am et er is significantly different from 0, i.e.
f3i = 0 vs f3i :j:. 0, uses the est imate's variability. T he test stat istic fJd SE rJi ,
where SE&i = o-2 e~(X' X)-l ei , is appr oximately t dist ribute d wit h n - p - 1
degrees of freedom.
A second choice of uni ts on the biplot axes t herefore is t he te rm fJi / SErJi '
This re-scales the biplo t axes in a way t hat t heir lengths are proportional
to t he values of the t-st at ist ic. More important vari ables in t he regression
model now have lar ger par am et ers, whereas biplot axes with insignificant
par am et ers remain short . Gr aphically we can support this by highlighting
an int erval on t he ax is of pr edict ed valu es, which corresponds to t he 5% level
Interactive biplots for visual modelling 231

of a t-test. See figure 9: in this example the SE{3i are of the same order of
magnitude, and the distances do not change compared to figure 8.

biplot axes (re-scaled)


intervalof non-significant differencesto Morris

>0
Duluth Uni-Farm Crookston Waseca
Grand-Rapids Morris

Difference std. err. Prob


Morris· Crookston -2.02000 2.309 0.978724
Morrls - Duluth 7.40333 2.309 0.075985
Mcrrls - Grand-Rapids 10.4683 2.309 0.001805
University-Fann . Morris -2.73333 2.309 0.923061
waseca- Morris 12.7083 2.309 0.000052

Figure 9: Comparison of effects: on the top the graphical test via the in-
terval of non-significant values is shown, on the bottom is a table of the
corresponding pairwise tests.

When setting the origin of this interval the exact coding, which we used
for a categorical variable is important: if we use effect-coding, the origin of
the 5% interval will be placed on the predicted value of the basis. When
using a null-sum-coding the origin of the interval is set to the expected value
ofY.
Figure 9 shows the (re-scaled) biplot axes of the example above. The
category Morris is set as basis value. Around this value the interval of non-
significant values is shown as a gray-shaded rectangle. The categories Uni-
Farm and Crookston fall into this rectangle, indicating that these categories
have parameters, which are not significantly different from the parameter for
Morris.
Since the differences between the parameters are not affected by the choice
of the coding, we may use these differences for more than one comparison
(and with that , multiple t ests) in each plot . From a statistical point of view
this multiple test situation suggests the use of Bonferroni-confidence inter-
vals for each parameter rather than the use of the above significance intervals.
The difference between the above intervals and Bonferroni's intervals is es-
sentially a factor, calculated from the level of significance and the number of
comparisons made.
The price we have to pay for the re-scaling of the biplot axes with the
parameter's variability is that we lose the quantitative connection between
data points and biplot axes.
In order to avoid re-scaling we may try another approach to visualise the
tests between the effects: the software JMP suggests the use of circles of dif-
ferent size around the parameter values. The size of each circle is given by the
232 H eike Hofmann

st andard deviation of the par ameter t imes to. / 2 ' Wh ether two parameters are
significantly different is decided by t he angle: if t he angle at the intersection
of their circles is less than 90° the two values are not significantly different ,
otherwise they are (see figur e 10). For a more detailed explana t ion of the
underlying stat istics see JMP 's "Statistics and Graphics Guide " , p.94-95.
The disadvantage of this approach is t ha t angles have to be compa red.
This makes the decision between significant and not- significant differences
between the par amet ers rather difficult visually.

~ 1 ' ~2 significantly d ifferent ~1' ~2 borderline significantly different ~1 ' ~ not significantly different

Figur e 10: Confidence circles around parameter values. Depending on the


angl es at the circles' int ersections the difference between the pa ramet ers is
significantly different (left) , borderline significantly different (middle) and not
significantly different (right) .

3.2 Projection of the response variable Y


Since we may writ e Y as the sum of the project ion axes Y and Y - Y,
Y = 1 . Y + 1 . (Y - Y) ,
Y has the coordinates (1,1) in the new coordinate syste m.
residu als

!
: A
IY -Y I 2 = RSS

pr ed icted

Figur e 11: Response variable Y in the projection plane spanned by predicted


and residual values.

The uni ts on the projection axes are given as IY - YI and where WI,
!Y - YI 2 = L i(Yi - "fi )'(Yi - "fi) = RSS and 2
= TSS - RSS. RSS is WI
the residual sum of squares and TSS is t he total sum of squares.
The coordina te of Y in direction of Y - Y shows th e squ ar e root of the
residu al sum of squ ar es, V R S S; the coordina te in dir ection of Y gives the
squ ar e root of the difference between the tot al sum of squa res , TSS , and the
Interactive biplots for visual modelling 233

residual sum. The angle a between Y and Y - Y is therefore related to the


goodness of fit statist ics R 2 of the regression model:

2( ) = ( IHY I)2 = TSS - RSS = R 2


cos a WI TSS '

i.e. the smaller a is, the better is the fit of the regression model. Of course,
the angle depends on the aspect ratio of t he display. By fixing the aspect
ratio to 1, different plots (and thereby different models) can be compared:
a plot with large width and little height indicates a good fit (the residuals are
small with respect to the predicted values), while a quadratic plot or, even
worse , a tall and thin plot indicates a very bad fit, see figure 12.

Source Sum of Squares


Regression 133.914
Residual 1.04788

Variable
Constant
Coefficient
0.011854
s.e.c t Coeff
0.0106
t-ratlo
1.12
prob
0.2660
=y
XI 1.50080 0.0149 101 E0.0001
X2 -0.496514 0.0129 -38.6 E0.0001

Source Sum of Squares


Regression 132.345
Residual 426.808

Variable Coefficient s.e.ofCoeff t-ratio prob


Constant -0.365717 0.2139 -1.71 0.0904
XI 1.61270 0.3004 5.37 ~0.0001
X2 -0.172733 0.2595 -0.666 0.5072

Figure 12: Example of regressions with good fit (above) and bad fit (below).
The goodness of fit is emphasized by t he shape of the display. The ang le
between Y and Y also corresponds to R 2 .

4 Conclusions
Biplots can be used to visualize univariate linear models. They allow, at the
same time, an assessment of the model's goodness of fit . Add itional interac-
tive methods such as interactive querying provide the ana lytic goodness of fit
234 Heike Hofmann

statistics, too. This allows a tight link of visual display and the corresponding
model. Another interactive method, hotselection, gives a way of examining
the influence of single points or group of points on the model , which can be
used as a very efficient way of outlier spotting.
In the paper only one-dimensional models are shown - this is just for
illustration purposes. The approach itself is, of course, not limited to one
dimension.
If using scatterplots for a biplot representation, biplots are restricted to
a 2d display - with graphics that allow display of higher dimensionality such
as a tour ([1]' [2]) for example, more precise displays are possible. In a tour
the described approach would mean to fix the z-axis artificially to Y - Y
(equivalent to fixing Y to be fully included while touring the data) and to
tour through the X space. This also allows to deal with higher-dimensional Y.

References
[1] Asimov D. (1985). The grand tour: a tool for viewing multidimensional
data. SIAM J . Sci. Stat. Comput. 6, 128 -143.
[2] Buja A., Swayne D., Cook D. (1996). Interactive high-dimensional data
visualization. Journal of Computational and Graphical Statistics 5, 78 -
99.
[3] Dawson R.J.M. (1995). The "unusual episode" data revisited. Journal of
Statistics Education 3.
[4] Fisher R. (1935) . The design of experiments. Edinburgh UK: Oliver and
Boyd .
[5] Gabriel K. (1971). The biplot graphic display of matrices with application
to principal component analysis. Biometrika 58, 453-467.
[6] Gower J .C., Hand D.J. (1996) . Biplots. London: Chapman and Hall Ltd.
[7] Hofmann H. (1998). Interactive biplots. In New Techniques & Technolo-
gies for Statistics (NTTS) 98, Sorrento, Italy: Eurostat, 127 -136.
[8] Velleman P. (1995) . Data Desk 5.0, Data Description. Ithaka, New York.

Address: Heike Hofmann, Department of Statistics, Iowa State University,


Ames lA, USA
E-mail: hofmann@iastate. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

R: THE NEXT GENERATION


Kurt Hornik
K ey words: R , CRAN, 8 , 8-PLU8, 8weave, J,j\1EX.
COMPS TA T 2004 section: 8tatistical software.

Abstract: Version 2.0 of R will be released in the course of 2004. Following


t he 1.0 release on 2000-02-29, the advent of t his "next genera tion" of R mostly
indicates the view of the R developers that R has now moved substanti ally
beyond being a reference implement ation of 8. In this pap er , we look at
several of these key enha ncements. We start with a review of some key facts
on "R and 8" . 8ections 2 to 5 then describ e th e name space mechanism , the
new grid graphics syste m, t he packaging syst em , and 8weave, a tool which
allows to embed R code for data analysis into J,j\1EX documents.

1 Introduction
8 is a very high level lan guage and an environment for data ana lysis and
graphics which has been developed at Bell Lab oratories for about 30 years.
In 1998, t he Association for Computing Machinery (ACM) presented its Soft-
war e 8ystem Award to John M. Chamb ers , th e principal designer of 8, for
"the S system, which has forever altered the way people analyze, visualize, and
manipulate data . . . ". The evolut ion of t he 8 language is cha racte rized by
four books by John Ch amb ers and coauthors, which are also the primary ref-
erences for 8. The "Brown Book" [1] is of hist orical interest only. The "Blue
Book" [2] describ es the "New 8" language. The "White Book" [5] docum ent s
a concerted effort to add functionality to facilitate st atistical modeling in 8 ,
introducing data structures such as factors, time series , and dat a frames,
a formula not ation for compactly expressing linear and generali zed linear
mod els, and a simpl e syste m for obj ect-orient ed programming in 8 allowing
users to define their own classes and methods. Together with the Blue Book ,
it describ es 8 version 3 ( "83") . [4], t he "Green Book" , int rodu ces version 4
of 8 ( "84"), a major revision of 8 designed by John Chamb ers to improve its
usefulness at every stage of the programming pro cess, introducing in partic-
ular a new "formal" OOP syst em supporting multiple dispatch and multiple
inh eritanc e, and a unified input/output model via "connections". Tod ay,
a comm ercial implement ation of the 8 lan guage called "8-P LU8" is available
from Insightful Corporati on (http ://www .insightful .com).
What is now t he R project starte d in 1992 in in Auckland, New Zealand,
as an experiment by Ross Ihaka and Rob ert Gentleman "in try ing to use the
m ethods of LISP implementors to build a sm all testbed which could be used to
trial some ideas on how a statis tical environment might be built " [8] . The de-
cision to use an 8-like syntax for this st atis tic al environment , being motivated
by both famili arity with 8 and t he observation that the pars e t rees genera ted
236 Kurt Hornik

by S and LISP are essent ially identi cal, resulted in a syst em "not unlike S" .
In fact , basing t he R evaluat ion mod el on Scheme (a memb er of t he LISP
famil y) has given R lexical scoping as the most prominent difference between
R and ot her impl ement at ion of the S language [7] . Since mid-1997 there has
been a core group (the "R Cor e Team") who can modify the R source code
CVS archive. The group cur rently consists of Doug Bates, John Chamb ers,
Peter Dalgaard, Robert Gentl eman , Kurt Hornik, Stefano Iacus, Ross Ihaka,
Friedri ch Leisch, Thomas Luml ey, Martin Maechler , Dun can Murdoch, Paul
Murrell, Martyn Plummer , Bri an Ripley, Dun can Temple Lan g, and Luke
Ti ern ey. R version 1.0, released on 2000-02-29, provi ded an implementation
of S version 3. The key innovations in S4 were introduced in Lx series re-
leases (connections in 1.3, a first implementation of the S4 OOP syst em in
version 1.4) .
An R distribution pro vides a run-time environment wit h gra phics, a de-
bugger , access to to certain syst em functions, and the ability to run pro-
gra ms stored in script files, and contains functi onality for a lar ge number
of statistical pro cedures. This "base system" is highly exte nsible through
so-called packages (see Section 4) which can contain R code and correspond-
ing docum entation , data sets, code to be compiled and dyn amically loaded ,
and so on. In fact , t he R distribution itself provides its functionality via
"base" packages such as base, stats, grid, and methods. The data ana lytic
techniques described in such popular books as [23], [16], or [21] have corre-
sponding R packages (MASS , nlme, and survival) . In addit ion, there are
packages for bootstrapping, various state-of-t he-art machine learning tech-
niqu es, and spatial statist ics including int eracti ons with GIS. Other pack-
ages facilit ate interaction with most commonly used relational datab ases,
importi ng data from other statistical software, and dealing with XML . Cur-
rently, more t ha n 300 packages are available via t he Comprehens ive R Archive
Network (CRAN, http://CRAN .R-project .org) , a collect ion of sites which
carry ident ical material , consist ing of the R distribution( s) , cont ributed ex-
tensions, docum ent ati on for R, and binaries.
It is important to realize that t he "R Project" is really a multi-tiered
lar ge scale softwar e developm ent effort , with the R Core Team deliverin g the
basic distribution which mostly provides t he computational infrastructure
on which others can build special-pur pose data analysis solut ions. In this
pap er , we discuss four of the key additions to this infrastructure relative to
t he S reference st andard.

2 N arne spaces
Name spaces allow package aut hors to cont rol how global vari ables in their
code are resolved . To see why this is important , suppose that package foo
defines the function

mydnorm <- funct ion(x) 1 / sqrt(2 * pi) * exp(- x-2 / 2)


R: th e nex t genera tion 237

and has been attached t o the sear ch path so that evalua t ing t he expression
mydnormCo) uses t he above fun cti on when looking up a valu e for the symbol
C mydnorm' . Now suppose t hat t he user ente rs

pi <- 1
at t he prompt, so t hat t he symbol C pi' is bound t o the valu e 1 in the R work-
space ( "global environme nt" , . GlobalEnv). With the "usual" dynamic look-
up mechani sm for bindings (of sy mbols to valu es) in place, going t hrough the
collect ions of bindings represented by the search path and start ing with the
glob al environment, evaluat ing mydnorm(O) would not give the result that the
faa package author had intended-nam ely, using t he valu e bound to pi in
t he base package. Mor e generally, t op level assignment s as well as attaching
packages to the sea rch path can insert shadowing definitions ah ead of t he
ones intended . Na me spaces ensure that this do es not happen .
In the above example, all global vari abl es were int ended to refer to the
definitions pr ovid ed by t he base package, which is always attached (at the
end of t he sea rch path). Suppose t hat faa wanted to make use offunctionality
provided by anot her package bar which is not necessaril y always attached .
Tr aditionally, the package au thor would then arrange bar t o be attached at
some point. This is not only subject to sh adowing as described above, bu t
also has the effect of forcing a possibly undesired change to t he sear ch path
onto the user . Usin g name spaces, one can import the requ ired functionality
(more pr ecisely, exporte d var iabl es) from other packages. Such imports t hen
ca use the ot her packages to be loaded if necessar y, wit hout attaching them.
Finally, nam e spaces also allow the package aut hor to control which defi-
nitions pr ovid ed by a package are visible t o a package user and which ones are
private and only available for int ernal use. By default , a definition is private;
it is mad e public by an explicit export of the nam e of the defined variable.
Simil ar t o pro ving mathem atical t heorems, good pr ogr amming pr acti ce for
a high-l evel lan guage such as R typic ally suggests providing functionality
based on small building blo cks which perform simple tasks and are readil y
comprehended . If all these blo cks correspond to functions with a few lines
of code , and all t hese fun ctions are visibl e to users, these will find det ermin-
ing t he key functionality provided by a package rather challenging . (Thus
far, coding pr acti ces suggest ed using names sta rt ing with a '.' for "inte r-
nal" variables, based on the fact t hat listing varia ble nam es in eleme nts of
t he search path by defaul t excludes nam es with a leading dot. This redu ces
clu t ter , bu t do es no t pr event shadowing. )
A package is given a name space by placing a NAMESPACE file containing
nam e space dir ecti ves into t he t op level source dir ect ory of t he package. This
mechani sm makes it possibl e to obt ain t he information on the package code
int erface as part of t he package meta-d at a, without the need of pro cessin g t he
package code. The main dir ectives control export and import of vari abl es, and
supe rficia lly resemb le R function calls, with the argument s being synt actic
names or st ring constant s (i.e., quoting is only necessar y for non-standard
nam es). For example, t he dir ective
238 Kurt Hornik

export (mydnorm, "[<-.myclass")


exports two variables. Import directives are used to import definitions from
other packages with name spaces. The directive
import ("survival")

imports all exported variables from package survival;


importFrom(" survival" , "Surv")

would only import its SurvO function. There is also a useDynLib directive
for specifying that external code compiled into a DLL is to be loaded when
the package is loaded.
As syntactic sugar, variables exported by a package with a name space
can also be referenced using fully qualified references which are obtained by
concatenating the package and variable name, separated by a double colon
(e.g., f 00: : mydnormin the above example) . This is less efficient than a formal
import and also loses the advantage of separating the dependency meta-data
from the package code, so this approach is usually not recommended.
Name spaces are sealed. This means that once a package with a name
space is loaded, one can no longer change the bindings (add or remove vari-
ables, or change the values) . If it is necessary to record state information on
the package level, one can use dynamic variables (functions allowing to get
and set state information maintained in their environment). 8ealing ensures
that the bindings cannot be changed at run time, which has been instrumen-
tal to the development of a byte code compiler for R.
R supports both the 83 and 84 paradigms for object oriented program-
ming . In the former, there are no "formal" data structures representing the
class information, and method dispatch is based on a naming convention
(methods are functions the name of which is obtained by concatenating the
names of the generic and the class of the argument on which dispatch is
based, separated by a period) . With the advent of name spaces, this cre-
ates a problem: if a package is imported (hence loaded) but not attached to
the search path, the 83 method it provides may not be found for dispatch.
The name space mechanism therefore also provides facilities for registering
83 methods for dispatch. The directive
S3method("print", "f oo")

registers the function print. foo defined in the package as the 83 method for
generic print and class "foo". (This mechanism in fact pertains to the cases
where the generic is defined in a package with a name space. In this case ,
83 methods only need to be registered, but not exported.) The "formal"
84 OOP paradigm provides classes and generics with more structure than
their 83 counterparts, and hence conceptually allows better integration with
name spaces. 84 classes are private by default; they can be made public
using the exportClasses directive. As of writing this article, all generics for
which formal methods are defined need to be declared in an exportMethods
R: the next generation 239

directive, and where the generics are form ed by taking over existing functions,
those functions need to be imported (explicitly unless they are defined in the
base name space) . These mechanisms may be different in R 2.0; the current
development efforts will most likely bring the mechanisms in R more in line
with those in "related" functional languages in the LISP family which provide
both name spaces and a "formal" OOP system (such as Common Lisp or
Dylan).
By giving package developers the tools to control the package code inter-
face and the resolution of global variables in their code, name spaces substan-
tially enhance the potential of R for dealing with complex data analysis tasks
based on combinations of "many" extension packages, in particular providing
a way of resolving conflicts among definitions in these.

3 Grid graphics
Traditional S graphics ("base graphics" in R, although now provided by pack-
age graphics) divides pages of graphics output into outer margins and pos-
sibly several figure regions which in turn each consist of figure margins and
plot regions. This places severe limitations on the possibilities for access-
ing the whole graphics page, e.g. when annotating a high-level plot. (The
standard example is that one cannot have arbitrarily rotated text in axis
labels, as text () supports arbitrary rotation but can only draw inside the
plot region, whereas mtext 0 can only write horizontally or vertically.) Each
region has one or more coordinate systems associated with it, as controlled
via "graphical parameters" (par () ).
Grid graphics is an alternative graphics engine provided by package grid
in the R distribution. One of its goals is to remove some of the inconvenient
constraints imposed by the base graphics system. In addition, it aims at the
development of functions to produce high-level graphical components which
would not be very easy to produce using traditional S graphics (such as Trel-
lis graphics [3], [6], where the more natural building block is a "panel" which
consists of a plot plus one or more "st rips" around it), and the rapid devel-
opment of new graphics ideas . It serves these aims by providing functionality
for the production of low-level to medium-level graphical components, such
as lines, rectangles, data symbols, and axes, and sophisticated support for
arranging graphical components. Grid does not provide high-level graphical
components such as scatterplots of barplots, and hence is primarily targeted
at graphics developers rather than "users", with the usual remark that in
S, there is at most a gradual transition between these groups, if no such
distinction at all.
In grid, there can be any number of graphics regions . A graphics region
is referred to as a viewport and is created using the viewport 0 function.
A viewport can be positioned anywhere on a graphics device (page, window,
. . . ), it can be rotated, and it can be clipped to . For example,
viewport(x = 0.5, Y = 0.5, width = 0.5, height = 0 .25, angle = 45)
240 Kurt Hornik

describ es a viewport which is centered within t he page, and is half t he width


and one quar t er of t he height of the page, an d rotat ed 45°. Not e t hat t he
above is only a descrip tion of a graphics region-it is create d on a graphics
device only when the viewport is "pushed" ont o t hat device using the function
push . viewport () . Each dev ice maintain s a st ack of viewports, with the
t op one being t he cur rent one. Pushing places a viewp ort on t op of the
stack, while "popping" (using pop . viewport () ) removes it from there. When
several viewports are pu shed onto the viewport stack, later viewports ar e
located and sized wit hin t he context of the earlier ones. Gr aphics out put
is always relati ve to t he current viewport (on t he cur rent gra phics device).
Hence, select ing the region desired for output is simply a mat t er of pushing
and popping t he appropriate viewp orts.
The viewport mechani sm makes it very simple to divide a gra phics page
into areas as desired. For example, to create a plot with a legend taking up
80% and 20%, resp ecti vely, of the width of t he curre nt viewport, one can
simply use

push. viewport (viewport (x = 0, width = 0 .8, just = "left"»


t o set up the plot region , plot to it and pop it from the stack, and t hen use

push.viewport(viewport(x = 1, width = 0 .2, just = "right"»


to set up t he legend region . Grid also pr ovides an alte rnative way for posi-
ti oning viewp orts within each other based on layout s, which allows a simple
emulat ion of the multi-figur e array mechani sm in base graphics. Any view-
port pushed imm ediately after a viewport containing a layout may specify
its locati on with respect to that layout.
Each viewport has a number of coordina te systems available. There are
four main ty pes : absolute (e.g., "inches" ), normalized (e.g., "npc" ), relat ive
(e.g., "nat ive" ), and referenti al coordinates, which allow locations and sizes
to be specified in t erms of physical coordinates, as pr oportions of the page
size (or t he current viewport), relative to a user-defined set of x- and y-
ran ges, and based on t he size of some ot her gra phical object, respectively.
T he select ion of which coordina te syste m to use wit hin t he current viewport
is mad e using t he unit 0 function, which creates an object which combines
coordinate valu e and system information. For exa mple,

viewport(x = unit(60, "native "),


y = unit(0 .5 , "npc"),
width = unit(l, "strwidth", "coordinates for everyone") ,
height = unit(3 , "i nches "»

describ es a viewport which is cente red at t he x-value 60 of and half-way up


the preceding viewport , is 3 inches high and as wide as t he t ext "coordinates
for everyone" .
Grid pr ovides a standard set of graphical primitives: lines, t ext , points,
rect an gles, polygons, and circles (the nam es of the corr esponding functions
R: the next generation 241

are obtain ed by pr efixing the nam es of the corresponding base gra phics func-
tions with ' gr i d . ' ). There are also two high er-level components: x- and
y-axes. These functions are mostly similar to their base count erpar ts, but
differ in the way graphical par am et ers , such as line cont our and thickness,
are specified.
In grid, there is a much smaller set of gra phical par am et ers, consist ing
of col (the "foreground" color for dr awing lines and borders) , fill (the
"background" color for filling shap es) , lty and lwd (line ty pe and width) ,
fontfamily , fontface (such as bold or italic), fontsize (the size of t ext in
points) , lineheight (the height of a line as a multiple of the size of t ext) ,
and cex (mul tiplier applied to fontsize: th e size of t ext is fontsize * cex
and hence the size of a line is fontsize * cex * lineheight). Settings of
gra phical par am et ers are represente d by "gpar" obj ects, and may be spec-
ified for both viewports and graphical obj ects. A setting for a viewport
will apply to all graphical out put within that viewport and all viewports
subsequently pushed onto the viewport st ack, unless the graphical obj ect
or viewport specifies a different sett ing. A description of gra phical param-
ete r set ti ngs is create d using t he gpar 0 function, which can be associated
with a viewport or gra phical object via their gp slots (as accessed by t he gp
argument t o the functions creating viewports and gra phical obj ects) . The
following piece of code illust rates t hese mechanisms.

push.viewport(viewport(gp = gpar(fill = "grey",


fontface = "italic")))
grid .rectO
grid .rect(width = 0.8, height = 0.6, gp '" gpar(fill = "whi t e ") )
grid .text(paste("This text and the inner r ect angl e " ,
"have their own gpar settings" , sep = "\n") ,
y = 0 .75 , gp = gpar(fontface = "plain"))
pop. vi ewport 0

One of t he key applicat ions of grid graphics is in the R implementat ion of


Trellis gra phics, provided by the lattice package (a so-ca lled recommended
package that is availa ble from CRAN, and included in every bin ar y distri-
bution of R), see [20], which illust rates how high-level gra phics funct ionality
can be built on top of grid. Becaus e lattice consists of grid calls, it is possible
to bot h add grid out put to latt ice output , an d vice versa.
There is also limited support for combining base and grid graphics usin g
funct ionality provided by t he gridBase package. [14] shows how to annotate
base gra phics using grid (e.g., t o add axis lab els at arbitrary rot ations), and
to embed base gra phics in grid viewports .
Gr id pr ovides mor e feat ures not discussed here. For informat ion on these,
and examples of t he use of grid, see in parti cular t he docum ent ation at
http ://www .stat .auckland.ac .nz/paul/grid/grid .html.
242 Kurt Hornik

4 Packages
The R package syste m provides a st andar dized int erface to exte nding R 's
functionality. In source form, packages can contain
• "core" met a-information, cur rent ly serialized as a DESCRIPTION file in
Debian Control File form at (t ag-valu e pair s)
• addit ional met a-d ata , such as a NAMESPACE file defining the package
code interface
• code and document ation for R
• foreign code to be compiled/dynloaded (C, C++, Fortran , ... ) or
interpret ed (Shell, Perl , Tcl, ... )
• addit ional material such as dat a sets , demos, vignettes, package-sp ecific
tests, . . .
Only the core met a-information must be present . Mandatory met a-d ata in-
clude nam e and version of the package, and information on the license and
the package maint ainer. In a file syste m "representation", a source package
consists of a subdirectory containing the DESCRIPTION and possibly ot her
"to p-level" files, and several pr e-defined subdirectories, some of which may be
missing, such as R for R code and src for foreign source code to the compiled
and dynlo aded.
To be available for extending R, packages must be ins talled to librari es,
which ar e simply locations where R knows to find (inst alled) packages. In-
stalling from source performs a vari ety of tasks as needed or desired , such
as preformatting R docum ent ation in pla in t ext and HT ML form ats, creat ing
DLLs from foreign code, genera t ing a bin ar y image of t he R code, and set-
t ing up severa l dat a st ructure s with package ind ex information . This pro cess
is plu g'n 'pl ay if t he packages are "self-contained" (so that only t he standard
tools for pro cessing them are required). Developers can provide configuration
scripts for aut omat ically dealin g with sit ua t ions where packages depend on
the availab ility of functionality "outside of R" , such as libr ari es for dealin g
with X ML or access ing a datab ase ma nage ment syste m.
Creating packages is st raig ht forwar d: developers simply need t o gat her
the mat erial to be packaged int o the appropriate locati ons relative to the
package source direct ory. If R code is t he st arting poin t , R provides a con-
venience function packageSkeletonO which creates t he basic file st ructures
as well as docum entation skeletons for the R objects.
P ackages ar e distribut ed as single files archiving their conte nts . For source
packages, gzipped tar files are used . These are created via t he build utility
(currently, a P erl script ) which essent ially performs necessary cleanups, adds
front-matter information, and creates the archive wit h a canonical file nam e
obt ain ed from t he package nam e and version (as recorded in the DESCRIP-
TION file) . One can also build and inst all bina ry packages, which are alrea dy
R: the next generation 243

set up for use on a particular platform (so that only mimimal processing is
needed when installing). E.g., CRAN provides binary packages for the 32-
bit Windows platforms, because the tools needed for processing the source
packages (Make, Perl, compilers, ... ) might not be available to all users on
such systems.
Packages can be distributed over the web through repositories, which are
suitably indexed collections of packages. The package management tools
provided by R allow for directly installing packages from repositories and au-
tomatically updating installed packages when newer versions are made avail-
able in the repositories. This versioning facility, together with the generality
of the package mechanism, makes packages an ideal vehicle for distributing
many kinds of R-related material which needs to be kept up-to-date, such
as e.g. data sets or manuals (preferably implemented as package vignettes,
see Section 5). The Bioconductor project (http://www . bioconductor. org),
an open source and open development software initiative for the collective
creation of extensible software infrastructure for computational biology and
bioinformatics which uses R as its primary implementation language, is work-
ing on providing the next generation of client and server side tools for repos-
itory management, featuring in particular a multi-level package dependency
mechanism similar to the ones found in popular GNU /Linux distributions
such as Debian (http ://www . debian. org). These tools are already avail-
able via the R extension package reposTools from Bioconductor, and will
eventually be integrated into the R distribution.
Packages can be submitted to unit testing using the check utility (cur-
rently, a Perl script) . When run on a package source directory, this first veri-
fies that the package can be installed as the basic test of whether it "works" ,
and then goes on to perform a variety of other tests, such as checking
• availability and correctness of meta-information (as recorded in the
DESCRIPTION file mentioned above);
• R code, including syntactic correctness, common coding problems (e.g.,
when loading DLLs or defining replacement functions), consistency of
S3 generics and methods, etc.;
• R documentation, including correctness (syntax, presence of all required
documentation slots), consistency (of code and documentation), and
completeness (all user level objects must be documented) ;
• whether the package is able to run the code in in the examples of its
documentation (which is required). In addition, there are mechanisms
for regression and certification testing of code: package maintainers can
provide files with R code that will be run and if necessary compared to
already certified output.

Repository maintainers can use the package testing facilities for control-
ling the quality of the packages in the repository, and hence the repository
itself. For example, the CRAN repository tracks the R release process by
244 Kurt Hornik

only providing packages which pass the tests against the version of R be-
ing released. In addition, the effects of changes in (the development and
patched version of) R and updates to contributed packages are monitored
on a daily basis . It is this continuous improvement process which markedly
distinguishes the R project from most other software initiatives which use
repositories for distributing extensions.
Most of the testing tools used by the check utility are in fact implemented
in R (in particular to ensure portability and availability to all users of R) and
distributed in the tools package contained in the R distribution. It is very
important to realize that whereas check is a rather inflexible utility for cre-
ating standardized reports on the package "quality" status, the underlying
functions from package tools provide a flexible and extensible toolbox for
computing on packages. For example, codoc 0 is a function for checking
code/documentation consistency. More precisely, it analyzes the (usually,
function synopsis) information of the \usage sections of R documentation
files, and compares the documented synopses to what the code actually con-
tains. (Currently, code and documentation for functions in a package are
not generated from common sources, and hence may be inconsistent.) What
codoc 0 returns is an object containing a variety of information, including
the information on mismatches found. Printing this object gives a status
report on mismatches intended for human readers; if no mismatches were
found, nothing is printed. This mechanism is used by check to assess and
report the basic codoc status. But the object returned contains additional
data as well, such as information on \usage entries not corresponding to valid
R syntax after eliminating special markup for indicating synopses for 53 or
54 methods, or on functions for which documentation was registered (via the
\alias meta-data markup) without providing a synopsis (which "might" be
a problem, and in the case of non-method functions in packages with a name
space typically is one) . Even though this information is not printed, it is
available in the result of the codoc computations, and hence can be used for
further processing.

5 Sweave
Sweave [9] is a tool that allows to embed the R code for complete data analyses
in l5\TEX documents. (In fact , we shall see that the underlying principles are
much more general.) In the process of generating the displayed version of the
document, first the code in the Sweave source file is processed (by R) and its
textual or graphical output inserted as appropriate to create a l5\TEX source
file. Then, a DVI or PDF file is created (by latex or pdflatex) .
A small Sweave source file is shown in Figure 1. The file contains two R
code chunks embedded in a simple l5\TEX document. At the beginning of a
line, '« ... » ' and t e ' mark the start of a code and documentation chunk,
respectively. Sweave translates this to a regular l5\TEX document, which is
then compiled to give Figure 2. The results of the Kruskal-Wallis test as well
R: the next generation 245

as the box plot have nicely been int egrat ed into the final version.
\documentclass[a4paper]{article}

\title{Sweave Example}
\author{Friedrich Leisch}

\begin{document}

\maketitle

In this example we embed parts of the examples from t he


\texttt{kruskal.test} help page into a \LaTeX{} document :

«»= data(airquality) kruskal.test (Ozone - Month, data =


airquality) ~ which shows that the location parameter of the Ozone
distribut ion varies s ignificantly from month to month. Finally we
include a boxplot of t he data:

\begin{center}
«fig=TRUE,echo=FALSE»= boxplot(Ozone - Month, data = airquality)
~
\ end{center}

\end{document}
Figure 1: A sma ll Sweave source file: example.Snw.

The Sweave source file shown in Figure 1 uses t he syntax of noweb [18],
a simple lit erat e pr ogramming tool which allows to combine program source
code and t he corresponding documentation int o a single file. This syntax is
par ti cularly useful if Emacs is used for aut horing Sweave documents: then,
using EBB [19, Emacs Speaks St atistic s;], an Em acs ext ension package, one
can connect t he docum ent t o a running R pr ocess while writing it . Code
chunks can be sent to R and evaluated using simple keyboar d shortc uts or
popup menus. Syntax highlighting, automatic indentation and keyb oard
shortcuts depend on t he location of t he point er: in code and doc umenta-
t ion chunks one gets t he sa me behav ior as when editing "simple" R code or
]}.1EX files, respectively. Using Emacs or t he noweb syntax is not necessar y
t o Sweave. There is also a ]}.1EX-based synt ax, where C Scode' environments
are used for marking code chunks. Using thi s syntax, t he box plot code chunk
in our example file would be ty peset as
\begin{Scode}{fig=TRUE,echo=FALSE}
boxplot(Ozone - Month, data = airquality)
\end{Scode}

Sweave offers fine cont rol on how t he code chun ks are pr ocessed . By de-
fault , both t he S code itself and its console output are insert ed , inside suitable
246 Kur t Hornik

Sweave Exa mple


FriedrichLelsch
February 25, 2004

In th is example we embed parts o f the examples from the kru skll.l. tellt
help page into a JnEX doc ument:

R> deta(ajrquality)
R> kruska l.test (Oz one - Mon t b, da ta '" ai rq l.la l ity)

Kruske.l-Wallis ran k BUIll ts st

data : 02:one by Honth


Krusk al -Wlll l i a chi - squar ed .. 29.2666, df .. 4, p- val ue .. 6 . 9018 - 06

which shows that the loca tion parameter of the Ozone distribution varies sig-
nificantly from month to month. Finally we include a boxplot of the data

Figure 2: The final do cument created from example.Snw.

verbatim-st yle environments , into t he generate d ~1EX file. This emulate s an


int eracti ve session. One can suppress eit her input t o or out put from the R
process , or ind icat e that output is already in ~1EX format (e.g. , when using
on e of the CRAN exte nsion packages xtable or Hmisc to create "pretty" t a-
bles) , or complete ly suppress the evaluat ion of t he code chunk. In addit ion,
Sweave can replace S express ions inside \Sexpr markup in do cumentation
chunks by t heir values (provided that these can be coerced into a characte r
string) .
Sweave is written ent irely in S, and contained in package utils in t he R
distribution . From a user 's view, t here are two basic fun ctions . Sweave 0
tran slates Sweave source files into ~1EX files as described above. Stangle 0
simply ext racts only the code.
R: the next generation 247

As apparent from the above description, what Sweave really does is per-
form cert ain computations on int egr at ed text documents which contain both
code and docum entation chunks. S4weave, a re-implement ation of Sweave
using S4 classes and methods currently under way, enforces this view [11] .
Providing more structure also makes it possibl e to comput e a directed gra ph
of chunk dependencies, and hence pro cess chunks conditiona lly. There is
also an XML DTD for Sweave source files for document excha nge with ot her
dynamic document systems.
To assess the import ance of facilities such as Sweave, one should keep in
mind how reports as part of a st atistical data ana lysis project are tradi tion-
ally written . F irst, the data are ana lyzed, and afterwards the results of t he
ana lysis (numbers , graphs, . .. ) are used as th e basis for a writ ten report . In
lar ger projects the two steps may be repeated alte rn ately, but the basic pro-
cedure remains the same. The basic par adigm is to writ e t he report around
the results of t he ana lysis. Using Sweave, one can create dyn amic reports,
which can be updat ed automatically if data or analysis cha nge. In particular ,
t he code is always available for reproduce t he displayed results, which makes
Sweave an ideal vehicle for dissemin ating reproducible resear ch, see e.g. [1 3].
Sweave also grea tly aids in t he creation and deploym ent of docum ent ation
for "aggregated" functionality of S code , such as manuals for packages (where
the traditional function-based S documentation methods cannot easily deliver
a comprehensive view) , or books on st atistical analysis using S. Using Sweave,
there is t he additiona l benefit t ha t one can always ext ra ct t he code from the
document (the term vignettes has been introduced for docum ents with this
property) and use it for subsequent manipulating and pro cessing. Vignettes
have enough st ructure to allow for an int egrated and int eractiv e presentation
of the code t hey contain. For example, vExplorer 0 from the Bioconductor
tkWidgets package allows to view vignet tes and int eract with their code
chunks, see e.g. [12] for more details.

6 Summary
In t his pap er , we have discussed four of t he key innovations in the "next
genera t ion" of R. There are of course many mor e, including a new system
for except ion handling, a byt e code compiler, external pointer objects, a
mechanism for serialization and unserialization of R obj ect s to and from con-
nections, mathemati cal annot ation of plots [15] , as well as many refinements
to the S lan guage (such as a thorough distinction of t he cha racter st ring "NA"
from a missing value for a cha rac te r string). The NEWS file in the top-level
dir ectory of the R distribution has more information.

References
[1] Becker R.A ., Ch amb ers J .M. (1984) . S. An interactive environme nt for
data analysis and graphics. Monterey: Wadsworth and Brooks/Cole.
248 Kurt Hornik

[2] Becker RA., Ch ambers J .M. , Wilks A.R (1988) . The new S language.
Ch apman & Hall , London.
[3] Becker RA ., Cleveland W.S. , Shyu M.-J . (1996). The visual design
and control of trellis displays. Journal of Computat ional and Gr aphical
St atistics 5 123 -155.
[4] Chambers J.M. (1998). Programm ing with data. Springer , New York.
http ://em .bell-labs .eom/em/ms/departments/sia/Sbook/.
[5] Chambers J .M., Hasti e T .J . (1992) . Statistical models in S. Ch apman
& Hall , London.
[6] Cleveland W .S. (1993). Visualizing data. Hob art Press, 1993.
[7] Gentleman R , Ihaka R (2000). Lexical scope and statistical comp ut-
ing. Journal of Computational and Gr aphical St atistics, 9 491 - 508.
http://www.amstat.org/publieations/jegs/.
[8] Ihaka R (1998) . R : Pas t and future histo ry. In S. Weisb erg, (ed.) , Pro-
ceedings of the 30t h Symposium on the Int erface, the Interface Founda-
t ion of North Am erica, 392-396.
[9] Leisch F . (2002) Sweave: Dyn amic generation of st atistical re-
ports using lit erate data analysis . In Wolfgan g Hardl e and
Bernd Ronz (eds) , Compst at 2002 - Proceedings in Computa-
t ional St atistics, Physika Verlag , Heidelb erg, Germany, 575 - 580.
http ://www.ei .tuwien.ae .at/leiseh/Sweave.
[10] Leisch F . (2002). Sweave, part I: Mixing R and E'J'EX. R News 2 (3)
28 -31 . http://CRAN.R-projeet.org/doe/Rnews/.
[11] Leisch F. (2003). Sweave and beyon d: Computations on text
docum ents. In Kurt Hornik, Friedrich Leisch , and Achim
Zeileis (eds) , Proceedings of the 3rd International Work-
shop on Distributed St atist ical Computing, Vienn a, Austria.
http://www.ei .tuwien .ae .at/Conferenees/DSC-2003/Proeeedings/
[12] Leisch F. (2003). Sweave, part II: Package vignettes. R News 2 (2)
21-24. http ://CRAN.R-projeet.org/doe/Rnews/.
[13] Leisch F. , Rossini A.J. (2003) . R eproducible statistical research. Ch an ce
16 (2) 46 - 50.
[14] Murrell P. (2003). Integrating grid graphics outp ut with base graphics
output. R News 3 (2) . http://CRAN.R-projeet .org/doe/Rnews/.
[15] Murrell P ., Ihaka R (2000). An approach to providing m athematical
ann otation in plots. Journal of Computation al and Gr aphical St atistics
9 582 -599. http://www . amstat. org/publieations/jegs/.
[16] Pinheiro J .C. , Bat es D.M . (2000) . Mixed-effects mod els in Sand S-Plus.
Springer . http ://nlme . stat. wise . edu/MEMSS/ .
[17] R Development Core Team (2004). Writing R extensi ons .
R Foundation for St atistical Computing, Vienna, Austria .
http ://www.R-projeet .org.
[18] Ram sey N. (1998) . Noweb man page. University of Virginia, USA, 1998.
http ://www .es . virginia . edu/ nr/noweb. Version 2.9a.
R: the next generation 249

[19] Rossini A.J ., Heiberger R.M ., Spar ap ani R. , n Miichler M., Hornik K.
(2004) . Emacs speaks statis tics : A multi-platfo rm , multi-package devel-
opm ent environme nt for statistical analysis. J ournal of Computational
and Gr aphical St atistics 13 (1) , 1-15
[20] Sarkar D. (2002) . Lattice. R News 2 (2) 19 -23.
http://CRAN,R-project,org/doc/Rnews/.
[21] Therneau T .M., Gr ambsch P. (2000) . Modeling survival data: exten ding
the Cox model'. Springer .
[22] T ierney L. (2003). Nam e space m anagem ent for R . R News 3 (1) 2 - 6.
http://CRAN.R-project,org/doc/Rnews/.
[23] Venabl es W .N., Ripl ey B.D. (2002) . Modern applied statistics with S.
Fourth editi on. Springer. http://www.stats.ox.ac . uk/pub/MASS4/ .

A cknowledgem ent : Section 2 is based on mat erial in [22] and t he Writ ing
R Ext ension s manual [17], Section 3 on a primer on "Grid Gr aphics" by Paul
Murrell. Section 5 dr aws from [10] .
Address : K. Hornik, Insti tut fiir St atistik, Wir schaft suniversitat Wien , Aus-
tria
E- ma il : Kurt.Hornik@wu-wien.ac.at
COMPSTAT '2004 Symposium © Physica-Verlag/Springer 2004

ROBUST MULTIDIMENSIONAL SCALING


Leanna L. House and David Banks
K ey words: St atisti cal computing, data reduction, robust, multidimensional
scalin g.
COMPSTAT 2004 section : St atistical software.

Abstract: Mod ern t echnolog y ena bles the collect ion of vast quantiti es of
data. Smar t aut omatic data select ion algorit hms are needed to discover im-
portant data structures t hat are obscured by oth er st ru ct ure or random noise.
We suggest an efficient and flexible algorit hm that chooses the "best" sub-
sa mple from a given dat aset. We avoid t he combinat orial search over all
possible subsamples and efficient ly find t he dat apoints that describ e the pri-
mary structure of the data . Although the algorit hm can be used in many
analysis scenarios, this pap er explores the applicat ion of t he method to prob-
lems in multidimension al scaling.

1 Introduction
Although mod ern te chnology ena bles the collect ion of huge amounts of data ,
it also exace rbate s t he problem of dat a qu ality cont rol. Spurious or erroneous
information caused by eit her the random nature of the data or human err or
will inevitably exist within large datasets. But the t ask of sift ing through
millions of observations and removing those th at are not represent ative of the
true population borders on t he impo ssibl e. Sma rt , aut omate d, data cleaning
algorit hms or robust analysis t ools that work in t andem with the collect ion
t echnolo gies are needed .
From a stati st ical persp ective, robust analysis methods, including L, M, S,
and R est imat ors, serve as appropriate means to account for cont aminated
dat a. However , such methods arguably apply only to par am etric approaches
and do not exte nd t o unsupervised learning probl ems or multidimension al
scalin g. Furthermore, ana lyzing the dat a directly, without first reducing the
number of observations, may exceed computer softwar e or memor y limit a-
tions.
To address this problem , we pres ent an efficient dat a redu ction algorit hm
that act ively seeks the pr imary underlying struct ur e of the data while re-
moving spur ious observations. Rather than use gra phical methods to hunt
for erroneous data as descri bed by Karr , Sanil , and Banks [6], we syst em-
at ically sear ch among st rategically chosen sub sets of t he collect ed sa mple.
Ultimately, we find the subsample t hat provides the best st at ist ical signal,
as measured in terms of fit , compared to other subsets of compara ble size.
The algorit hm we propose do es not require t he evaluation of every subset
within a sample. Instead , it performs a series of greedy sear ches t hat allow
252 Lesuue L. House and David Banks

the method to scale to large datasets. And the algorithm is flexible since it
can be applied to any situation in which there is some measure of goodness-
of-fit. In this paper, we describe how the method applies in the context of
linear regression and multidimensional scaling, where the measures of fit are
R 2 and stress, respectively.
We understand that specifying an acceptable degree of lack-of-fit or re-
quired statistical signal for a chosen subsample is unclear. Since one is trying
to cherry-pick the best possible subset of the data, we consider two options.
The first entails the prespecification of the final subset size. The subset with
the highest statistical signal (of the specified size) is chosen, regardless of the
magnitude of the signal, or lack there of. The second approach requires the
inspection of the plot, signal versus subset size. A knee in the plotted curve
points to the subset size at which one is forced to include bad data.
In the context of previous statistical work, our approach is most akin to
the S-estimators introduced by Rousseeuw and Yohai [9], which built upon
Tukey's proposal of the shorth as an estimate of central tendency [2], [9].
Our key innovations are that instead of focusing upon parameter estimates
we look at complex model fitting, and also we focus directly upon subsample
selection. See [3], [4] for more details on the asymptotics of S-estimators and
the difficulties that arise from imperfect identification of bad data.
In the context of previous computer science work, our procedure is related
to one proposed by Li [7]. That paper also addresses the problem of find-
ing good subsets of the data, but it uses a chi-squared criterion to measure
lack-of-fit and applies only to discrete data applications. Besides offering
significant generalization, we believe that the two-step selection technique
described here enables substantially better scalability in realistically hard
computational inference.
Section 2 describes the algorithm in detail within the context of regres-
sion . Section 3 illustrates the flexibility of the algorithm and applies it to
a simulated, multidimensional scaling scenario. Section 4 concludes the paper
with a discussion and a description of additional applications.

2 Proposed algorithm
Because of the wide familiarity with regression, we describe the steps of the
algorithm while referring to the following scenario:

Given n observations, {Yi , Xd, we assume that the expected


structure within the data is a multivariate linear model

with independent errors terms, ti rv N(O,o} And we want to


protect our analysis against the the possibility that as much as
1 - Q percent of the data either do not have a common linear
Robust multidimensional scaling 253

relati onship or are random noise or follow a different functional


relationship wit h Y. The choice of Q requir es dom ain knowledge
or a good sense of the erro rs in t he dat a collect ion protocol.

Typi cal regression analyses fit all the data, and t hen atte mpt to identify
outliers or high-leverage points. Some robu st methods, such as S-est imation,
attempt t o find the best fit to some pr esp ecified fraction of the dat a , bu t those
methods do not generalize to , say, nonpar am etric multivari ate regression . In
cont rast, we sear ch among the data to find a large subset that produces
good fit . This entails ra ndom select ion of starting-point subsamples and the
comparison of fits from subsamples of the data .
In a linear regression set t ing, t he coefficient of determinat ion, R 2 , pro-
vides a natural choice for assessing and comparing the st ati st ical signal of
subsamples. T he statist ic relies on sums of squ ar ed deviations to assess lack-
of-fit and does not penalize subsets for including mor e or less observations.
Simpl y, a subsample with a high R 2 is better t han anot her with a low R 2 •
In genera l, it is desir abl e t hat t he measur e of fit not depend up on the
size of the subsample. This is true for the coefficient of det ermination and
also for st ress in multidim ensional scalin g. The algorit hm, however , can be
modified to accommodate other sit uat ions, usually by a normalization t hat
allows one t o measure th e "average" goodness-of-fit. That t echnique allows
one t o broad en t he field of fit crite ria to include average absolute deviation
or average complexity, as measured by Mallow's Cp st atistic [8] or Akaike's
Informati on Cri terion [1 ].
The remainder of this sect ion describ es how we randomly select a set of
subsamples from which we ultimately choose th e best . We do not enumerate
or t est all possible subsamples of size Qn. Rather , we propose start ing with
a series of sma ll, randomly chosen datasets and growing each until they are of
size Qn . Don e properly, we can ensure that with som e pr esp ecified probabili ty
at least one of t he original subsamples will eventually grow to contain nearl y
all good data.

2.1 Choosing the initial subsamples


To begin we select t he minimum number , d, of subsamples S, needed to
gua rantee, with probabili ty c, that at least one S, contains only "good" data;
i.e., data for which the assumpt ion of a linear mod el is correct . The size of the
initial subs amples depend s on the scenario and should equa l the minimum
number of observations needed to calculate the chosen lack-of-fit measure;
for t he case of multivari ate regression in IRP using R 2 as the crite rion, one
needs p + 2 observat ions in each st arting subsample.
Assuming Q percent of t he dat a are good, then the probability of select ing
(with replacement) a start ing subsample that contai ns bad data is 1 _ QP+2.
Hence, after specifying c, we may solve for d using
254 Leanna L. House and David Banks

c IF' [ at least one of Sl, . . . ,Sd is all good]


1 -IF' [ all of Sl, ... , Sd are bad]
d
1- II IF' [ s, is bad ]
i=l
= 1 - (1 _ QP+2)d.
For example, if we want the probability of selecting at least one good initial
sample to equal 95% (c = .95) and we assume that 20% of the data are
spurious (Q = .8), then we have .95 = 1 - [1 - (.8)p+2]d. Setting p = 1 for
simple linear regression, the smallest integer greater than d is 5. Thus we
need five starting-point subsamples to ensure with probability .95 that one
of them will work as we want.
We assume the probability of choosing the same observation twice for one
subsample is small enough to justify selecting S, with replacement. However ,
one may use finite population methods if necessary (e.g., when the total
sample size is small) . In that case the calculation of d becomes slightly
more complicated when p is very large. Such cases might necessitate the use
numerical techniques to find d.

2.2 Select subsamples


Since the exact value for Q is unknown, let k equal the desired proportion
of data we wish to select from the large dataset. (The value for k does
not necessarily have to equal Q.) One subsample at a time, we sequentially
append observations that improve (or cause littel reduction) in the goodness-
of-fit measure until S, contains the target number of kn data points.
To balance the need for computational speed against the risk of adding
bad data, we suggest a two-step rule for adding observations. For the sake of
creating a time efficient algorithm, we accept the risk of suboptimal selections,
but we want to avoid the possibility of a "slippery slope ." Specifically, we do
not want a selection that only slightly increases the lack-of-fit to lower the
standard so that we get a subsequent selection that also slightly increases
the lack-of-fit, with the end result that a chain of marginally satisfactory
selections eventually produces a subsample that contains bad data.
The addition process begins with a fast search that adds data points as
the algorithm sweeps through the data (Step 1). Starting with the statistical
fit measured in an original subsample, Si, we consider the addition of each
of the remaining observations in succession. If the union of an observation
with S, either increases the statistical signal or only decreases it by a minute,
prespecified amount TJ , then the observation is added to the subsample. Hence
the next candidate data point in the sequence is considered with regard to
a new, slightly larger Si' Setting ni to represent the number of observations
in the current Si, the algorithm stops when ni equals kn.
Robust multidimensional scaling 255

If after sweeping through the data one time we have ni < kn , our algo-
rithm moves to the second, significantly slower step. Here, we search over
all data not already in the subsample to find the observation which, when
added, reduces the goodness-of-fit by the smallest amount. We then add that
observation and either improve the fit measure for S, t best or decreases the
statistical measure by the smallest possible amount (regardless of 1]). No-
tice step 2, unlike step 1, guarantees the addition of one observation on each
pass through all of the data (excluding observations already in Si) . Step 2 is
repeated until ni = kn .
The following pseudo-code describes this two step algorithm. We use
GOF( . ) to denote a generic goodness-of-fit measure.

Pseudocode for a Two-Step Selection

Step 1: Fast Search


Initialize: Draw d random samples S; of size p + 2 (with replacement).
Search over all observations:
Do for all samples s;
Do for observations Z j = (lj, X j):
If z, E s, goto next j
If GOF(Si) - GOF(Zj U Si) < 1] add z, to s..
If ni = [kn] stop.
Next j
Next i.

Step 2: Slow Search


Search over all observations:
Do for all samples s..
Do for observations Z j = (lj, X j):
If Zj E S, goto next j
If GOF(ZjUSi) > maxj GOF(ZjUSi) add z, to s..
If ni = [kn] stop.
Next j
Next i .

The algorithm requires two vital inputs: the goodness-of-fit measure and
the choice of 1], the tolerated increase in lack-of-fit during step 1. As men-
tioned previously, we recommend that the goodness-of-fit measure not depend
upon the sample size; the lack-of-fit values should be comparable as ni in-
creases. However, the choice of 1] offers one way to force comparability by
making it depend upon ni as well.
If one can achieve independence between the lack-of-fit measure and sam-
ple size, then the selection of 1] depends upon one's willingness to accept bad
observations. In the regression setting, when 1] = 0, step 1 only appends
256 Leanna L. House and David Banks

data points that strictly improve the R 2 • On the other hand, t he value of "l
can be determined empirically by insp ection of a histogram of 100 lack-of-fit
values obtained by adding 100 random data points to an initi al subsample of
size p + 2.
After repeating Step 1 and 2 for d subsamples, t he final task is to select
one Si as the best or most repr esent ative of the underlying structure. If the
purpose for implementing the proposed algorithm is strictly to reduc e the
dataset to kn , t hen one could select the subsample with the lowest lack-of-fit,
regardless of its size. On the other hand, if the inclusion of bad observations
is worrisome or the magnitude of the goodness-of-fit measur e for the best
subsample is unsatisfactory, then we recomm end plotting the goodness-of-fit
aga inst the order of ent ry of the observations . Given an initial subsample
with only good data , the gra ph should depict a long plateau wit h a sudden
knee in the cur ve when bad observations begin to ente r th e subsa mple. One
may choose t he best size for t he subsample according to the size at which the
knee occurs.
Note t he proposed algorit hm ent ails a stochas tic choice of starting sets,
followed by a det erministic exte nsion algorit hm. Even though we can guar-
antee, with a specified probability, a clean starting set , we cannot make the
sam e gua rantee at t he conclusion of t he algorit hm. Since the exte nsion pro-
cedure depends slightly upon the ord er in which t he cases are considered, the
final result does not quite enjoy the sam e probabilistic properties as the ini-
t ial starting sets. Nevert heless, simulation results indicate that the proposed
procedure does lead , with probability near t he nominal level specified in the
initial calculat ion that determ ined the number of starting-point subs amples,
to the selection of a subs ample of good data.

3 Application: multidimensional scaling


The robustness problem in the linear regression exa mple could have been
addressed t hrough other means , such as S-est imators , but it provides a con-
venient test -b ed for developin g and assessing the proposed methodology. Our
real interest lies in more complicated problems, such as arise in non par ametric
regression or classification with mislab eled data or non-metric multidimen-
siona l scaling.
Here we demonstrate the st rengt hs of t he two-ste p algorit hm within the
context of multidimension al scaling (MDS) . A practical concern in using MDS
is th at a relatively small proportion of outliers or similar data quality pro b-
lems can distort t he fit into uninterpret ability. Essentially, a mulitdimen-
sional an alysis attempts to force a fit that is driven lar gely by the bad data ,
and t hus simple low-dimension al structure in the good data can be over-
looked or not repr esented at all. Our pro cedure for cherry-picking the best
sample allows the fitting pro cedure to ignor e points that cause lar ge increases
in lack-of-fit , which in t his context is most naturally measured by the stress
function.
Robust multidimensional scaling 257

Given a clean dat aset that consists of th e latitudes and longitudes of


99 major citi es in t he easte rn Unite d St at es, we generated six (three groups
of two) unclean datasets . The datasets differ with respect t o t he proportion
of bad data and their degree of badness (refer to Tabl e 1). The first set
dist orts one dist an ce between two cit ies by 150% and 500%. The remaini ng
sets increase t he number of distortions t o 10 and 30 interpoint dist an ces.
For the latter two groups, some alte red dist ances might share one end-point.
Thus we consider the percent of uncl ean cit ies, or 1 - Q t o be greate r than
or equa l t o 2%, 10% and 30% for each set respectively.

True Distance Original na n* Final


l-Q (%) Distortion (%) Stress Stress
150 1.028 80 80 4.78e-12
2 500 2.394 80 80 4.84e-12
150 1.791 80 80 4.86e-12
10 500 28.196 80 80 4.81e-12
150 3.345 80 77 4.86e-12
30 500 9.351 80 78 4.78e-12

Tabl e 1: Compar e 6 data qu ality scena rios for MDS.

Using Kruskal-Sheph ard non-metric scaling, we assess th e stat ist ical sig-
nal of a given dataset by using the stress function

where th e d ii , are the distan ces between the two-dimensional embeddings of


t he poin ts X i and X i' and t he g( . ) is an arbitrary mono tonically increasin g
functi on (this impli es t hat t he fit depend s only up on the rank s of t he int er-
point dist ances) . The fitting is don e by alte rnat ing isot onic regression to find
an est imate of 9 wit h grad ient descent to find an est ima te of the dii' ; our
impl ementation used the pro cedure in the R softwar e package.
For each of the six datasets , t he algorithm attempts t o find the subset that
minimizes the st ress function the most . Since t he cit ies lie on the surface of
t he globe and do not embed perfectl y ont o a two-dimensional Euclidean space,
some st ress exists even wit hin t he clean dataset. The t otal stress measures
for the perturbed datasets are list ed in the third column of Tabl e 1, whereas
t he st ress for the original, clean dat aset equals 8.42 * 10- 12 . Additionally, we
chose t o set TJ = 1.0- 12 , a valu e slight ly greate r than zero and commensurate
with st ress in the undistorted sample.
In a real sit ua t ion, Q, t he percent of clean observations, is unknown.
Thus, using expert informa t ion we must est ima te Q in ord er t o calculate d,
t he required number of st arting-point subsamples. Furthermore, k , the per-
cent by which we wish to redu ce the original dataset , is typically uncl ear as
258 Leanna L. House and David Banks

well. In this example, for all of t he datasets we assumed that Q = .9 and we


set k = .8.
Tabl e 1 describ es the effect of implementing the algorit hm using each
dataset . The columns lab eled "Origina l Stress" and "Fina l Stress" provide
the stress measures for the compl ete datasets and the chosen subsam pies
respectively. The column lab eled "n a " gives the number of observations
in t he best subs ample chosen from the direct application of the algorit hm.
And the column lab eled "n *" gives the number of observations in the chosen
subsample after inspecting gra phs t ha t plot stress against sample size. Noti ce
n a =1= n* in t he last two rows , when 30 interpoint dist an ces are perturbed .
This is du e to the fact that k is greate r than the true value of Q. Figur e 1
displ ays the plots of st ress against sample size.

e ec
.Q6
t:
.9
"""
cci
~
§d
iii
ill",
05°
0
0
20 40 60 80 0 20 40 60 80
Sample Size Sample Size

Figur e 1: Plo t of stress measure versus sample size (in the order of entry)
when 30 distan ces are distorted: (left) 150% distortion; (right) 500% distor-
tion; Not ice plateau in graph while including good observat ions in subs ample,
but at sample size = 77 (left) and sample size = 78 (right) we st art to append
bad data.

4 Discussion
In order to take advantage of the full potential of a larg e dataset, we pro-
pose a straightforward method to remove bad data. In essence, we robustify
t he data using a two-step algorit hm to select the subsample that is in best
agreement with the assumed structure in the data.
We demonstrat e the benefits of the algorit hm within the cont ext of mul-
tidimensiona l scalin g. In MDS scenario s, even small proportions of bad data
can ent irely distort the apparent geomet ric relationships among t he cases.
Our algorit hm successfully isolates the primary st ructure of six distorted
datasets. The st ress measures of th e final chosen subsampies are dr amati-
cally lower than t hose of the corresponding original dataset s.
One distinguishing feature of the algorit hm is t hat it does not require the
complete enumeration of all possible subs amples. This saves an enormous
amount of computer time, and ensures t hat the algorit hm is essent ially of
order O(n) (if one avoids or minimizes the slow-sear ch phase). However , t he
Robust multidimensional scaling 259

spirit of our two-step algorithm could be implemented in other ways. For


example, solely running the slow search in step 2 might be optimal in terms
of only choosing the very best observations to include within a subsample.
However, this requires d(n - p - 2) separate reviews of the entire pool , which
is hard when n is large or the calculation of the lack-of-fit measure is complex.
The procedure we describe extends easily to almost any statistical ap-
plication, requiring only som e measure of fit. In fact, it can even address
multiple structures within a dataset. By applying the algorithm repeatedly,
each time removing the data that fit the most recently discovered underlying
structure, one can retrieve disjoint subsam pies representing different models.
Subsequent work will extend this technique to such situations and provide
a more thorough study of the performance of the search procedure.

References
[1] Akaike H. (1973). Information theory and an extension of the maximum
likelihood principle. Second International Symposium on Information The-
ory, 267 - 281.
[2] Andrews D.F., Bickel P. J ., Hampel F . R., Huber P. J . Rogers W.H.,
Tukey J .W. (1972) . Robust estimates of location: survey and advances.
Princeton University Press, Princeton, NJ.
[3] Davies P.L. (1987). Asymptotic behavior of S-estimates of multivariate lo-
cation parameters and dispersion matrices. Annals of Statistics 15, 1269-
1292.
[4] Davies P.L . (1990). The asymptotics of S-estimators in the linear regres-
sion model. Annals of Statistics 18, 1651-1675.
[5] Hawkins D.M. (1993). A feasible solution algorithm for the minimum vol-
ume ellipsoid estimator in multivariate data. Computational Statistics 9,
95 -107.
[6] Karr Alan F ., Sanil Ashish P., Banks David L. (2002). Data quality: a sta-
tistical perspective . National Institute of Statistical Sciences, Research
Triangle Park, NC .
[7] Li X.-B . (2002) . Data reduction vis adaptive sampling. Communications
in Information and Systems 2, 53 - 68.
[8] Mallows C.L. (1973). Some comments on c; Technometrics 15, 661 -675.
[9] Rousseeuw P.J ., Leroy A.M. (1987). Robust regression and outliers detec-
tion . Wiley, New York.
[10] Rousseeuw P.J ., d Yohai V. (1984). Robust regression by means of S-
estimators. In Robust and Nonlinear Time Series Analysis, J. Franke,
W. Hardie, R.D. Martin (eds .), Lecture Notes in Statistics 26, Springer-
Verlag, New York, 256-272.
Address : L.L . House, D. Banks, Institute of Statistics and Decision Sciences,
Duke University, Durham, North Carolina, 27708 U.S.A.
E-mail : house@stat.duke. edu, banks@stat.duke. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

IMPROVED JACKKNIFE VARIANCE


ESTIMATES OF BILINEAR MODEL
PARAMETERS

Martin Hey, Frank Westad and Harald Martens

K ey words: PLSR, PCR, bilin ear model, jackknife, varian ce.


COMPSTAT 2004 secti on: P artial least squares .

Abstract: This pap er puts focus on some the remaining issu es concern ing
jackknifing of cent red bilin ear models. A method improvement is proposed ,
describing how all t he bilinear model par am et ers can be rotat ed in order
to estimate the unc ert ainties of all model par am et ers. The mean values of
centred models are also included in the rot ation scheme.
The un certainty infor mation of t he bilin ear model par am et ers ca n be used
to perform variable select ion, variable weighting and det ection of outliers .

1 Introduction
Crossvalidation [1] and especia lly jackknife [2] can be used in order to est imate
t he un cert ainty of the par amet ers in a bilin ear model [3] . This t echn ique is
currentl y used in commercial software (e.g. The Uns crambler) to est imate
the un certainty in the reduced-rank regression coefficients bA in the mul tiple
linear approximati on model at rank A ,

fJ = XbA + bO,A (1)

or for mult iple y-vari abl es

(2)

Preliminary versions of stability information of the bilin ear loadings P A , Q A


and scores T A for t he underlying bilinear regress ion models (see equation (6)
for definit ions) are also available. The uncertaint y in the regression coeffi-
cients is used for e.g. variable selection while t he uncertaint y in the scores is
used to make "st ability plot s" and e.g. spo t sample outli ers.
In t his ar t icle, t he method of calculating uncertainty of regression coeffi-
cient s is expanded to also include the un certainty of t he bilinear model par am-
ete rs, the loadings and loading weights (PA , QA ' W A) and t he scores (T A)'
The mean valu e of centred models are also included in the proposed rotation
scheme . This has been lacking in commercial applicat ions, and has not yet
been described in t he lit erature.
262 Martin H0y, Frank Westad and Harald Martens

2 Theory
2.1 Notation
Matrices are written as uppercase bold letters (X), while vectors are written
as lowercase bold letters (x). Unless transposed (written as x'), vectors are
always columns. Uppercase letters (A) denotes constants, while lowercase
letters are counters or indexes (a = 1 . .. A).

2.2 Jackknife and segmentation


When crossvalidating or jackknifing a model, the dataset with N samples
(objects) are divided into M segments.M sub-models are estimated where
model m = 1 ... M is estimated from the slightly smaller dataset where
the objects in segment m are left out. In the special case of leave-one-out
crossvalidation, M = N with N - 1 samples in each subset. We have chosen
to label the segment that is left out with a subscript m , and the reduced
dataset with segment m missing is labelled with a subscript -m.
When jackknife is used in statistical literature, the data are often consid-
ered to be drawn from the same distribution, and focus is then on creating
as many "independent" estimates as possible. The most common way to
perform jackknife-validation is the leave-one-out, which gives N estimates of
each parameter. One can also perform delete-d jackknife, where d samples are
removed in each subset, giving (~) estimates. For d > 1 the delete-d jack-
knife thus shifts the jackknife estimate towards to the bootstrap estimate,
which is based on random sampling of errors or samples. The statistical for-
mulae and properties of these estimates are well documented in statistical
literature [4], [5], [6].
If the dataset is generated by e.g. a factorial design, it may contain vari-
ability on different levels. Take as (a hypothetical) example an experimenter
who has tested four different levels (doses) of a treatment on 20 patients
twice (two replicates), giving a total of 40 experiments. She might be inter-
ested in both the variation between the dose-levels, the variation between the
patients, and the variation for a given patient over time. Traditionally, one
would use ANOVA to obtain this information, but the same can be achieved
by using cross-validated or jackknifed PLSR with the right segmentation of
the data (see also [7]).
In this example, one could first place all samples with the same dose-level
in the same segment. This would give M = 4 segments, and the validation
would then show the ability of three of the treatments to predict the third,
i.e. how different the response to the dose-levels are. One could also remove
one patient at the time giving M = 20 segments, to validate how similar or
different the patients reacted to the doses. This would be a good segmentation
in order to look for outliers between the patients, i.e, whether one (or more)
of the patients reacted to the treatment in a very different way than the
others. Yet another possibility would be to remove one replicate at the time,
Improved jackknife variance estimates of bilinear model parameters 263

giving M = 2 segments. The validation would then show whether the patients
changed over time. One could also use the leave-one-out method giving M =
40 segments. The validation would then be a mix of the above, testing
both the dose-levels, replicates and patients at once. These four examples of
segmentation will in general give quite different estimates of the variances in
the model parameters. Thus, it is very important to be aware of on what
level one is validating the results [8].
Even though the jackknife-formulae for different segmentations are given
in statistical literature, the authors feel the need for documenting these also
in the chemometric literature. The most general expression is that of delete-d
jackknife, where one explores all the combinations of data where d samples
are removed, (~). The variance of a parameter () can then be estimated as

,2
S (())
N - d "'" ('
= -(N) Z:: ()-m
_)2
- () (3)
d d m

where B_ m is the value of () estimated when segment m is removed, and (j is


the mean value of all the estimated values.
Like in the example above with treatments and patients, we often don't
explore all the combinatorial possibilities of removing d samples at the time.
Instead, we only use the M = N / d possible subsets given by removing each
of the M segments one at the time. For d = 1 these two methods are the
same, namely the leave-one-out validation. But for d > 1, we have (~) » M.
When only M of the possible subsets is used, equation (3) reduces to

(4)

When doing significance-testing based on variance estimates from jack-


knife, one needs to know how many degrees of freedom to use. When using
estimates from equation (4), the degrees of freedom in the variance estimate
is M - 1. To illustrate both the correctness of equation (4) and the M - 1
degrees of freedom, the authors performed a Monte-Carlo simulation. The
results are documented in section (3.1) . The theory and results presented
here are in contrast to [8], where the factor (N - 1)/ N is used .
In all the above, it is assumed that the size of the different segments
is equal (or not very different). For segmentation schemes with unequal
segment-sizes, the above formulae are more complicated.

2.3 Variance of regression coefficients


From each of the M bilinear submodels (perturbations of eqs. (1), each time
using with A latent variables or factors) we estimate regression coefficients
b-m,A, and from the complete dataset we estimate b A. One approach to
estimate the uncertainty in bA is then to sum all the squared deviations
from bA [3]:
264 Martin Hoy, Frank Westad and Harald Martens

(5)

The correction-facto r outsi de the summation is reduced to the more well-


known (N - l) / N for leave-one-out crossvalida tion or ordinary jackknife [5] .
Note also anot her difference to the jackknife as describ ed in statistical
literature [4], whre each b_ m,A-estimate is compa red to t he mean of all
t he M submodel est imates instead of using t he value from t he complete
dataset . The idea behind using bA as in equa t ion (5) is t ha t t his is the
"best" est imate we can get, using all t he samples that we have availab le. In
most cases, this is also the estimate t ha t would be used as the final mod el, and
we are int erested in the varia tion around that est imate. This bias-including
mean squ ar ed erro r estimate elimina tes the mean of th e perturbed submodel
par amet er est imates from t he jackknife expressions. Since the redu ced-r ank
PLSR mod els deviates from t he t heoretical prop erties of t he well understood
traditional full-rank OLS regression mod els, t he aut hors consider th e known
theoretical prop erties in full-r ank OLS regression mod els non-applicable for
t he PLSR solut ion. Examples will be given in the section "Results and Dis-
cussion" th at substant iate t his choice.

2.4 Rotation of bilinear models


It would be nice to calculate the uncertainty of all t he other PCR/PLSR
mod el par amet ers in t he same simple way as t he regression coefficients in
equa tion (5), but this is complicated du e to certain properti es of t he bilinear
model. The bilinear mod el as in both P CR and PLSR can be seen as a sum
of oute r-products, one for each factor:
A A
X = lx' + LtaP~ +EA and Y = ly' + Ltaq~ +FA (6)
a=l a=l

where x' and y' contains t he mean value of each variable, t ; is a vect or
of scores (a linear combination of the X-variables) , Pa and qa are loadings
for X and Y respect ively and EA , FA contains unmodelled residu als. The
only difference between t he PCR- and PLSR-algorithms lies in t he way t ; is
defined .
A property of bilinear mod els is that the scores and loadings have rot a-
tional freedom . We can rotate t he scores in any direction, as long as the
corres ponding loadings are rotated th e same amount in th e opposite direc-
tion . The model will st ill contain the same inform ation, and the regression
coefficients will be t he sa me.
Scores- and loading-vectors for t he different submodels m may ap pear
to be quite different du e to t rivial translati ons, rot ation and mirroring. If
e.g. t he sign of each element in both t - m,a and P-m,a changes, t he information
Improved jackknife variance estima tes of bilinear model param eters 265

explained by their product in t hat factor will still be t he same, but it will
be meanin gless to compare each value in t hose vectors to other score- or
loadin g-vectors with different alignment. One way to solve t his probl em is to
rotate all the M sub-models towa rd the model calculate d from the complete
dat aset before we compare them.
Equ ation (6) repr esents t he model calculate d from t he complete dat aset
with all N samples. Rewriting that model using matrix notat ion, we get

x = [1 T A] [x PA]' +EA
Y = [1 TAl [y Q A]' +F A
where
T = (X - 1x'WA](P ' W )- 1 (7)
and W A is th e int ern al loadin g weight matrix. For each consecutive fact or,
t he corresponding column in W A is defined as the first eigenvector of residu al
X - X covaria nce (in PCR) or X - Y covaria nce (in PLSR). The linear
regression coefficients in eqs. (1), (2) is then defined as

(8)
Similarl y, we can write each of the M sub-models in matrix not ation, where
t he index -m denotes t hat segment m has been left out .

X- m [1 T - m,A] [x- m P - m,A]' + E-m,A


Y- m [1 T - m,A] [Y- m Q-m,A]' + F - m,A (9)
Without cha nging equation (9), we can insert an invertible matri x C and its
inverse C- 1 , since CC- 1 = I.

X- m [1 T - m,A] C- mC= -;" [x- m P-m,A]' + E-m,A


Y- m [1 T - m,A] C- mC=-;" [Y- m Q-m,A]' + F - m,A (10)

Comparing equation (7) and equat ion (10), we can define C - m as a ro-
tat ion matri x, where we try e.g, t ry to rotat e [1 T - m,A] towards [1 T A]'
Similarly, we t hen int erpret C ;;.T as a rotation of [x- m P - m,A] towards
[x P A]' Thus, if we wanted to estima te th e matrix C- m , we could use
either t he relation between t he scores or t he relation between one of t he
loadin gs as tar gets.
If t he dat a were without noise, perfectly behaved and contained sufficient
redundant inform ation , the only difference between t he submodel and the
to tal model would be reflections and possibly reord erings (permut at ions) of
t he factors. It would t hen be possible to map the submodel onto t he total
model with a matri x C- m containing only one ±1 per column/row, and the
rest of the elements O. But when the dat a contains noise and insufficient
redun dant inform ation, rot ation at angles t hat are not multiples of 90° and
266 Martin HlZiy, Frank West ad and Harald Martens

possibly rescalin g of the axis will be necessary t o map the submodel perfectly
onto t he total model.
In ord er to consume as few degrees of freedom in Y as possible in the
est imation of C , we have chosen to use the scores matrices as t argets. Since
cross-valida t ion/jackknife segment m has been removed in T - m,A, it has
fewer rows t ha n T A . In order to esti ma te C- m , the samples in segment
m must also be removed from T A before comparing them . This short ened
version of [1 T A] is denoted as [1 T Ahm' Since the sa mples in segment m is
now removed from both matrices, fewer degrees of freedom in Y is consumed
t ha n if e.g. t he loading matrices were to be used as t argets . Not e that even if
the samples in segment m are not used dir ectly when est ima ti ng C- m , t hey
ar e not complete ly left out since they have been influencin g Y and W in the
total model.
In ord er to est imate the matrix C- m , the crite ria to be minimised is th e
difference between [1 T Ahm from the total mod el and the rot ated [1 T - m,A]
from t he reduced mod el. The difference is here denoted G-m,A:
(11)

There are many possible ways to estimate C- m from equat ion (11). To
reduce t he degrees of freedom consumed in t he rotation, we have chosen
to use an orthogonal rot ation, which means t hat the columns in C - m are
orthogonal with length one. The pro cedure for est imating C - m st ar ts with
performing an SVD :

U BV' = [1 T - m,A]' [1 T A]\m (12)

and then C - m is esti mated as

C- m=UV' (13)
There are many possible ways to est ima te C- m from equa t ion (11) (or even
without using t he scores matrices) , the above is just one solut ion. Other
possible pro cedures are discussed in section 3.2.

2.4.1 Rotating the scores For each left-out segment m = 1 . . . M , we es-


timate C- m using equati on (13) . With the appropr iate matrix C- m , we can
then calculate values for the rotated versions of t he scores in each submodel.

Augmenting the submodel score matrix: Since the score-matrix of


submodel - m is calculated with t he samples of segment m left out, we would
only be able t o re-estimate parts of th e total score-matrix by rotating the
scores from submod el -m. In ord er t o fix this, we first insert est imated score
valu es of the left-out samples set m into t he score-ma t rix of submodel -m
before we rotat e it . These estimated valu es are calculated in t he usual way:
~ _, ( ,
T m,A = (X m - Ix ) W - m,A P - m,AW - m,A
)-1 (14)
Improved jackknife variance estima tes of bilinear model parameters 267

By inserting these values into T- m,A at t he right positions, we can now cal-
culate t he full rot ated score-matrix of submodel - m . We denote the rot ated
mat rix with a tilde, and t he augmente d score-matrix from submodel - m is
denoted with a subscript -m, m .

(15)

Using th e rot ated versions of the score-matrix as calculate d in equation (15) ,


we can est imate t he variance of each element in the same way we did for the
regression coefficients in equation (5). For the elements of t he score-mat rix,
t he corres ponding equa tion is

(16)

wher e t ia is t he score-value of sample i in factor a of the total mod el. This


equation gives an est imate of t he variance of t he score-value for each sample
in each factor. This can be used e.g. to dr aw approximate confidence-regions
around each sample in t he score-plot , and t hus det ermine if two samples are
far enough apa rt to be considered different . Such an approximate confidence-
region could e.g. be created by using ±2S(tia) , but it is important to empha-
sise that t he statistical pr operties of t he variance est imate (16) is not known ,
and that t he "confidence-region" should be regarded as approximate . (Fur-
t her improvements might be attained by degrees-of-freedom correc t ion to
compensate for t he est imation of the rot ation par amet ers.)
The rotated score-values t-m,ia are also interesting in t hemselves. By
plot ting t hese values together with t he score-values from the total mod el in
the score-plot, the user gets a visual image of the stability of each sample, and
such plot s are often referred to as stability plots. Samples that are outliers
will tend to get a very different score-value when t hey are not used in the
calibration, and thus will be easily visible in t he stability plot.

2.4.2 Rotating the loadings The matrices of X- and Y-Io adings for
submodel - m have the same dimensions as t he loading-matrices of the full
mod el. They can t here fore be rot ated without augmentation.

[:i:- m P-m,A] [X- m P - m,A] c=~


[Y-m Q-m,A] [Y - m Q - m,A] c= ~ (17)

In the present case, C - m is an orthogonal matrix, and thus C=~ = C - m '


The notation in equation (17) is general, and also valid for matrices with
other properties. In t he same way as with th e scores, t he variance of each
element in the loading-matrices can now be est imate d:
268 Martin Hi2Jy, Frank Westad and Harald Mart ens

(18)

As with the score-valu es, these vari ances can be used to dr aw approximate
confidence regions in t he loading plot and det ermine whether or not two
variables ar e overlapping and t hus cont ains t he sa me information.

2.4.3 Rotation of the loading weights Rot at ion of the load ing weights
W A(7) is a lit tl e more complicate d t han rotation of scores and loadings.
The rotat ed version of the loading weight s is pr oposed as:
W - m,A = [X- m W - m,AW~m,AP -m ,A] C=;;' [0 (P~ W A)- i] (19)
where t he column of zeros is needed because the matrix C - m was estimated
from equa tion (12), where an ext ra column is appended.
Similar to the ot her mod el par am et ers , t he variance of each element in
t he loading weight matrices can now be est ima ted as:
M
S
'2(Wka ) = ~
M - 1 'L..J
" (W_ -m,ka - W ka
)2 (20)
m =i

Having variance estimates of the ind ividual loading weights opens up a new
possibility in var iabl e select ion. It will then be possible t o do a significance
t est of each vari abl e k in each factor a. Valu es W ka that are not significant ly
different from zero, can be forced to zero aft er which t he vect or W a is re-
ort hogona lised. This pr ocedure will then yield vari abl e select ion where it is
possibl e to remove variables only in some of the factors, while leaving t hem
in for other factors.
As fur ther fact ors are calculate d and the information left in the dat aset
decreases, mor e and mor e vari abl es will become insignificant with their cor-
responding load ing weight set t o zero. Finally, t he loadi ng vector W a will
be reduced t o t he zero-vecto r, and no further factors needs to be calculate d.
Thus, the pro cedure would yield aut omat ic select ion of the number of factors
to calculate, with int egr at ed vari abl e selecti on. The aut omatic deletion of
insignifican t vari abl es is expected t o yield more stable models th at are also
eas ier to int erpret du e to th e reduced number of vari abl es in each fact or.

3 Results and discussion


3.1 Jackknife and segmentation
To confirm that equation (4) gives consiste nt estima tes of variance wit h M -1
degrees of freedom for different segment ation sizes, a Mont e-Carl o simulation
Improved jackknife variance estimates of bilinear model parameters 269

was carried out. The parameter of interest in the simulation was the variance
of regression coefficients in a full-rank OLS solution to MLR regression, i.e.
a bilinear PCCR or PLSR model with maximum possible number of factors.
A matrix X with 300 samples and 3 variables was drawn with random,
evenly distributed values between 0 and 1. The regressand y was calculated
from true regression coefficients (3 = [0 1 2]' and random noise e which was
drawn from the distribution N(O, 12 ) .
The dataset was then split up in several different ways with M ranging
from 2 to 300, corresponding to the extremes of splitting in two and leave-
one-out. For each value of M, the regression coefficients b were estimated and
the variance of the second element in b was estimated from equation (4) . The
whole procedure was then repeated 500 times with different noise e added
each time.
Since the true variance of the added noise (e) was known, it was possible
to compare the jackknife-estimated values of s2(b) with the theoretically
expected values . The theoretical variance of the regression coefficients from
MLR (given that X is noise-free) is

(21)

Figure 1 shows the jackknife-estimated variance of the regression coeffi-


cient (5) as a function of the number of segments M, together with the
theoretically expected variance value (21).
As could be expected, one can see that the variance-estimate is more
uncertain when it is based only on only a few segments. But as the number
of segments increases, the variance-estimate stabilises towards the theoretical
value, and its own variance gets smaller.
Given that s2(b) is the estimated jackknife-variance of b based on M seg-
ment "observations", and assuming the underlying distribution is normal
with variance a 2 (b), then

(22)

is chi-square distributed with v = M - 1 degrees of freedom and a variance


of 2v. Reordering this, the variance of the variance-estimate S2 (b) is

(23)

where v is the degrees of freedom in the estimate of s2(b) . Since the variance
of the regression coefficient were estimated a lot of times in the Monte-Carlo
simulations, it was possible to estimate also the variance of the variance-
estimate, S2 (s2(b)). If we then "guess" that the degrees of freedom in s2(b)
is v = M - 1, we can plot the variance of our variance-estimate as a function
of 2/(M - 1). If M - 1 is the correct number of degrees of freedom, this
should give a straight line with intercept zero and slope a 4 .
270 M artin Hl2Jy, Frank Westad and Harald Martens

0.032

0.031

e 0.03
Q;
,; .... ".
" I'"
~ 0.029 '"
" . I,,'"
+1 """"" """" ' "
:c
N-

'" 0.028 " "" ", "" . """"""""""

0.027

0.026
- Variance of b
. , .. ± 2 std.error of variance of b
- - Theoreti cal value

Number of segments: M

Figur e 1: Variance of regression coefficient as a function of t he number of


segments.

As figur e 2 shows, this is ind eed the case. The above was also repeated
with l/ = M and l/ = N (not shown here), but these (and other) alte rn atives
gave a line with incorrect slope. Thus, we can conclude t ha t equa tion (4)
gives consistent est imates of t he varia nce of b with M - 1 degrees of freedom .

3.2 Alternative rotation schemes


The est imation of t he orthogona l rot ati on mat rix in equation (13) can be
made even more conservative. A simpler matrix that only corrects for reflec-
t ions and permutations can be calculated as
C,:~ct = round (C- m ) (24 )
where the opera tor round 0 mean s rounding each element in C - rn towards
t he near est integer ; - 1, 0 or 1. This approach would consume even fewer
degrees of freedom than t he orthogonal rot ation in equa tion (13). Wh en using
the simpl e rounding pro cedure above, the norm of C,:~ct must be monitored
(C,:~ct should have norm 1). If e.g. th e angle between the submodel and the
ma in mod el is aro und 45 degrees, the rounding can result in mor e that one
element per row/ column being different from zero.
Impro ved jackkn ife variance estimates of bilinear model parameters 271

X 10- 3
2 r-=------r- - - , - -- -,-- - ,.-- - ...-- ---,-- - ---.-- - --.-- - ,..-- ---,

..... Variance of varianceof b


- - Theoretical value
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
2/(M-l )

Figure 2: Variance of t he variance of t he regression coefficients as a function


of 2/ (M - 1).

A pro cedure that solves t his problem is to calculate t he corre lation be-
tween a factor in th e t ot al mod el and t he fact ors in th e submodel. If the
highest absolute value is t he diagonal element in t he corre latio n matrix, t hen
set the element in c~~ct to - lor 1 dependin g on the sign of t he correlation.
All other elements for that factor are set t o zero, both for the total model
and t he sub model elements . T hereafte r, the highest abso lute correlatio n for
each tot al model wit h respect to the submodel is found and set t o -lor 1
in c~~ct . This avoids t hat two facto rs in t he submodel are assigned to the
sa me fact or in t he t otal model, and yields a matrix c~~ct t hat is gua ra nteed
to have norm one and only account for reflections an d reorderings.
On e could also envision other approaches that would consume more de-
grees of freedom . St arting from equa t ion (11), t he matrix C -m could also
be est imated by an OLS regression . This corres ponds to pr ojecting t he total
model onto the reduc ed mod el. A similar bu t mor e num erically stable ap-
proach would be to use a rank-reduc ed regression like PLSR inst ead of OLS
reg ression in t he est ima t ion ste p.
272 Martin H0y, Frank Westad and Harald Martens

3.3 The rank of the rotation


An importan t question regarding the rotation of sub models is how many
bilinear fa ctors to use in the rotation? The simulations performed suggests
that the best solut ion is to perform the rotation of the mod els afte r A o pt
fact ors have been calculat ed. This int roduces a pr oblem , as A o pt is not
known a priori but ty pically est imated from a crossvalidated Root Mean
Squ ar e Error of Predict ion (RMSEP) cur ve showing esti ma te d prediction
err ors in Y . Thus, t he current impl ement ations st ar ts with an ordinary
PLSR in order to establish A o pt . Then , in a second ste p, t he rotations are
performed , and the vari an ces are estimated.
Such a two-st ep pr ocedure causes problems for the suggested "dynamic"
var iabl e select ion scheme suggested in section 2.4.3. Further resear ch in t his
area might solve this problem.

3.4 Large matrices with many predictor variables


If there are many pr edictor vari abl es in the input dat a Z , one might consider
doin g an SVD , Z = U BV' , and t hen use t he much sma ller X = U B as
input to t he PLSR algorit hm instead of Z . This will great ly reduc e t he time
consumed in the calibration, especially when doin g leave-one-out crossvali-
dation. The vari able dependent par am et ers from the PLSR (like B A , P A
and W A) will th en have to be multiplied wit h V ' in ord er t o correspond
t o t he origin al X-vari abl es. Since e.g. the regression coefficients are then
rot at ed , it is necessar y to est imate covariance uncertainties (not just vari -
ances ), in ord er for the rotated un certainty est imate to be applicable t o the
origina l vari abl es. It appears that t his can be done by modifying equation (5):
Let d- m,A = b-m,A - bA . The covariance between t he regression coefficient s
can th en be calculate d as

(25)

The diagonal of t his covariance matrix contains the valu es calculated from
equation (5). This covariance will be applicable to the regression coeffi-
cient s B from t he regression Y = X B + F . In ord er t o be applicable t o the
regression with t he lar ge input matrix, Y = ZC + F , the covar iance matrix
must be multiplied with V :

(26)

3.5 Examples of score-plot


A matrix of X -data with seven sa mples and three vari abl es were genera te d
by samp ling from a normal distribution. The y-dat a were then calculate d
by multiplying X wit h some pr edefined regression coefficients, and adding
Improved jackknife variance estimates of bilinear model parameters 273

C\I 0
....
0
tl
eel
1.1.
-2

-4

-6

-8
-8 -6 -4 -2 0 2 4 6
Factor 1

Figure 3: Original score plot with perturbations from leave one out jackknife.
The centre of each "st ar" is the value from the complete model, and the circles
denotes the value when that sample is kept out of the mod el calibration.

normal distributed noise with vari anc e 0.1. Before subjecting these X- and
y-data to a PLSR with full leave-on e-out crossvalidation, normal distributed
noise with variance 0.1 was also added to X .
Figure 3 shows a score plot with all the values from each cross-validat ion
segment, sometimes referred to as a stability plot. In the centre of each "star"
are the score-values from the model calculated with all the samples. The lines
going out from each "st ar " shows the score-value of that sample in each of
the cross-valid ated models. The value with a circle on it denotes the valu e of
that sample in the segment where the sample itself was left out, and thus had
no influence on the model. Samples that are outliers will tend to get a very
different score value when t hey are not included in the mod el, and thus the
scor e value denoted with a circle will be further away from the cent re than
th e other score valu es.
Note that several samples flip over, change sign or otherwise show large
deviations that are not related to the unc ertainty of the sample. This is due
to the rotational freedom of bilinear models as described in the beginning of
274 Martin H0y, Frank Westad and Harald Martens

2
~
~~
C\J
....
0
t3 0
Cll
LL

if\
_6'----....L.....-----'----'----....L.....-----'--------JL.-------'
-6 -4 -2 o 2 4 6 8
Factor 1

Figure 4: Rotat ed score plot. The cent re of each "star" is t he value from t he
comp lete mod el, and t he circles denot es the value when t hat samp le is kept
out of t he model calibrat ion.
sectio n 2.4. As a consequence, t he var iati ons between t he values in figur e 3
are unsuitable for calculat ing uncert ainti es.
Figur e 4 shows the score plot after each of t he submode ls have been
rotate d as describ ed in equa tion (15). The pict ure is now much clearer,
and t he variance left can be assumed to be due to t he uncert ainty of the
score-values. Note t hat for each sample t here can be a quite lar ge difference
between t he mean of all t he obtained values and t he value from t he tot al
model. This is the rati onale for choosing t he total mod el as the reference
value and not t he mean (c.f. the discussion after equation (5)) .

4 Conclusion
An improvement of t he jackkni fe rotation met hod by Martens & Martens [3]
has been proposed for est imating the uncertainty in t he bilinear model pa-
ra meters with the use of jac kknife. The method works by rot ating each of
the submode ls towards t he main model before t he valu es are used to est imate
variances. The rotati on matrix can be est imated in severa l ways, and some
of the alte rnatives are discussed .
Improved jackknife variance estimates of bilinear model parameters 275

Further research is needed to establish the statistical properties of the


obtained vari ance estimates, and alternative procedures for estimating the
rotation matrix should be compared.

References
[1] Stone M. (1974). Cross-validatory choice and assessment of statistical
predict ion. J . Roy. Stat. Soc. B Met .36 (1), 111-147.
[2] Efron B. (1982). The Jackknife, the Bootstrap, and other resampling
plans . CBMS-NSF Regional Conference Series in Applied Mathematics.
Society for Industrial and Applied Mathematics, Philadelphia, P ennsyl-
vani a.
[3] Martens H., Martens M. (2000). Modified Jack-knife estimation of param-
eter un certainty in bilinear modelling by partial least squares regression
(PLSR) . Food Qual. Prefer 11 (1),5-16.
[4] Tukey J .W. (1958) Bias and confiden ce in not quit e large samples. Ann.
Math. St at. 29, 614.
[5] Shao J . Wu C.F.J . (1989). A general theory for jackknife variance esti-
mation . Ann . Stat. 17 (3), 1176-1197.
[6] Efron B. Tibshirani R.J. (1998). An introduction to the Bootstrap. Chap-
man & Hall, New York.
[7] Martens H., H¢y M., West ad F., Folkenberg D., Martens M. (2001). Anal-
ysis of design ed experime nts by stabilis ed PLS R egression and jack-knifing.
Chemometr. Intell . Lab. 58 (2), 151-170.
[8] Martens H., Martens M. (2001). Mult ivariate Analysis of Quality. An
Introduction. J.Wiley & Sons Ltd, Chichester UK.

Address : M. Hey, Norwegian Met eorological Institute, Pb 43 Blindern,


N-0313, Norway
F. Westad , Matforsk, Osloveien 1, N-1430 As, Norway
H. Martens, CIGENE, Norwegian Agricultural University, N-1432 As,
Norway
E-mail : martin.hoy@pvv.ntnu.no
COMPSTAT'2004 Symposium © Physica-Verl ag/Springer 2004

LINE MOSAIC PLOT: ALGORITHM


AND IMPLEMENTATION
Moon Yul Huh
K ey words: Mosaic plot , line mosaic plot, st atisti cal graphics, visual infer-
ence, statist ical algorit hms.
COMPSTAT 2004 section: E-st atisti cs.

Abstract: Conventional mosaic plot is to gr aphically represent cont ingency


t abl es by t iles whose size is proportional to t he cell count. The plot is infor-
mative when we ar e well trained reading t his. This pap er introduces a new
approach for mosaic plot called line mosaic plot which uses lines instead of
tiles to represent th e size of the cells in cont ingency tables. We also give
a genera l straightfo rward algorithm to const ruct the plot dir ectl y from the
data set while the conventional approac h is to const ruct the plot from t he
cross t abulation. We demonstrate t he effect iveness of thi s tool for visual
inference using a real data set .

1 Introduction
Mosaic display introduced by Hartigan and Kleiner [6] has been genera lized
t o multi-way t abl es and has been exte nsively worked for visua l inference of in-
dependence usin g Mosaic plots by Friendl y [4], [5] . Meyer and et . al. [11] con-
sidered visua l inference of cont ingency t abl es using associat ion plots mainl y
for the case of 2-way t abl es. Another sour ces for th e works of Mosaic Plots
are Hofmann [7], [8] and Unwin [12] . Most of t he st atisti cal packages avail-
able tod ay have impl emented mosaic displ ays (SAS, S-Plus, R, Minitab , and
others).
Conventi onal mosaic plot is t o graphica lly represent contingency t abl es
usin g til es whose size is proportional to the cell count . Figure 1 gives the
mosaic plot of the Titani c dat a [3] as impl emented in R [10]. This data will be
explained in more det ail in t he next section. The plot is informative when we
are well trained in reading this. Our experiments with the gra duate students
showed t hat the features in the mosaic plot is confusing and misleading if
mor e than 2 vari ables are involved in the plot . The reason behind t his could
be du e t o t he limit ation of hum an perception. Firstly, t his could be explained
by the Steven 's law of dimensionality. Steven 's law stat es that perceived scale
in absolute measurements is the actual scale raised to a power where the scale
is as follows: for linear features, power is .9-1.1; for area feature, .6-.9; for
volume, .5-.8.
Steven 's Law suggests that phy sical relationships that are not represented
in linear features can be grossly misp erceived . For exa mple, a lake represented
on a map with an area graphically 10 t imes lar ger than anot her will be
perceived as only 5 times lar ger as noted in Catar ci, et . al. [1] . Since the
278 Moon Yul Huh

x
Cll
(j)

Class

Figur e 1: Convent ional mosaic plot of Ti t ani c Data using R.

mosaic plot presents all t he features using 2-dimensional bars, t he perceived


sca le of t he features may be und erest imat ed according to t he law.
Secondly, t he misp erception of mosaic plot could be du e to the fact that
the columns and rows of the bar s of t he plot are not aligned, and make "erro rs
in perception" as explained by Cleveland and McGill [2]. They state t ha t the
errors in perception from the gra phs are in the following order.
• Position along identical, non- align ed sca les.
• Length.
• Angle/Slope (though err or depend s great ly on orient at ion and typ e) .
• Area .
• Volum e.
• Color Hue, Saturation , Density (only informal t esting) .
Above observations suggest us t o use lines instead of bar s to represent the
cell sizes in cont ingency tables and to plot t he lines along a common aligned
scales. Figure 2 gives the 'line' mosaic plot for t he Ti t an ic dat a . Det ails
of the const ruction and int erpret ation of t his plot will be given in the next
section. In line mosaic plot, each cell of t he cont ingency t abl e is given equa l
sized rect angle, and t he size of t he frequ ency of each cell is represented usin g
t he tot al length of the lines dr awn inside the rect an gle. All t he rect angular
boxes are aligned hori zontally and vertic ally so th at t he compariso n of t he
relative size of the lengths in each rect angl e can be perceived mor e eas ily.
In sect ion 2, we give t he algorit hm for the line mosaic plot. In sect ion 3, we
present t he impl ement ation of t he algorit hm, and demonstrate the usefuln ess
of the plot using a real dat a set.
Line mosaic plot : algorithm and implementation 279

Cl as s ={ 1st, 2nd , 3rd, Crew }


Age ={ Child , Adult}

~1II111
[]
_~ .~" [I· I
ill :::>
(J) (J)
l__ 1

Figure 2: Line mosai c plot of Titanic Data.

2 Algorithm for mosaic array


Algorithms t o genera te mosaic plots have been approached in two ways as
far as the aut hor is aware of at the pr esent time. The first approach is to
const ru ct th e plot for a sp ecific setting. In other words , suggest ed algorit hm
is to build th e plot for the contingency t abl e of a specific dimension, and
apply similar method for other dim ensions. Wang [13] and Friendly [4] give
algorithms for 4-dimensions . Second approach is to use recursive structure as
impl ement ed in R [10] . To use these algorit hms, we need contingency t ables.
In this pap er , we sugg est a simple st raight forward algorit hm to const ru ct
the line mosaic plot directly from the data set . Figur e 2 suggests us th at
line mosaic plot is simply a 2 dim ensional array of t he frequencies, what we
call mosaic arr ay. Mosaic array is the basic building block for our work , and
in the next secti on, we give an algorit hm to const ruct t his array dir ectly
from t he data set . Also, t he algorit hm for t he t he converse operation, or
const ructing data from mosaic array is given. Wh en the problem considered
is superv ised learning, or when t here is the t arget vari abl e, mosaic array will
be 3-dimensional. 3rd dim ension corresponds t o t he target vari abl e, and the
number of levels of this dim ension will be equal to the number of categories
of the target vari abl e. The const ruc t ion of t his case will become clear in
sect ion 3 where we give the impl ementation of the line mosaic plot .
We assume t ha t all t he variables ar e discret e, and let p be t he number of
vari abl es, and n ' = (nl , . . . , n p ) be t he vector of the valu es t hat each variabl e
can adopt, or number of categ ories for each categorical variable. Wi thout loss
of genera lity, we can assume that the valu es of each vari abl e are transformed
into int eger valu es st arting from 1. For exa mple, t he valu es of the vari abl e
sex will be 1,2 . Also, let X be t he dat a matrix of dimension n x p where
280 Moon Yul Huh

n is t he number of obs ervations. For convenience and for t he simplicity of


not at ion , let v be the p - length vector denoting an observation or an instan ce
from t he data mat rix X. Using this notation, we can write Vj ,j = 1, . .. , p as
a realization of the lh variable of an inst anc e from the data matrix X. We
finally assume that t he vari abl es ar e ordered according to some measure of
importan ce for mosaic plot. Hence, the first variable will be the first choice,
the second one is next choice, and so on for t he mosaic plot.
We now build a 2-dimension al mosaic array F which is a represent ation
of multidimensional cros s t abl e form for the data matrix X , or the array of
the form of Figure 2. The size of F will be II\~;]n2i and II\~~1) /2]n2i+l for
row and column resp ectively.
An inst an ce of X , which is denoted as v in the above, will add 1 to the
cell Fr,J> where I and J will be determined as follows.

I -- ",[ p/2] (
LJi=l V2i -
1)II[p/2]
j =i+ l n2j + V2[p/2]

J -- ",[(P-1
LJi=O
J/ 2](
V2i+ l -
1)II[(P-l) / 2]
j =i+ l n 2j+l + V2[(p-l)/2]+1
where [x] denotes t he int eger not exceeding x.
We now give the algorithm to construct the valu es of the vari abl es, v ,
when an inst anc e belongs to a cell F(I, J) . l,From the row ind ex I , vari-
ables of even indi ces, or V2, V4 ," " V2 [p/ 2] will be construc te d, and from the
column ind ex J , variables of odd indi ces, or Vl, V3, " " V2[(p- l)/2]+1 will be
const ruc te d . The algorit hm follows.
Valu es of odd indi ces, Vl, V3 , .. • , V2[(p-l )/ 2]+1 from J ar e:

i = 1,3 , .. . , 2[(p- 1)/2] - 1


i = 2[(p - 1)/2]+ 1

Valu es of even ind ices, V2 , V4, .. . , V2 [p/2] from I are:

1+[ [P!2 / ]' i = 2, 4, .. . , 2[p/ 2] - 1


Vi = { 1+ ~j;t(r~n~j, n 2[p/ 2]) ' i = 2[P/2]

where M od(x, y) = x - x x [x/y] . We have shown above algorit hmically,


a unique F is const ructed for a given dat a set . Now, to dr aw a mosai c plot,
we need to construct IIFII rectangles in t ot al , where IIFII denotes the number
of t he cells that F makes, or is equal to II\~;]n2i x IIl~~1) /2In2i+l ' The
rect an gles are separated by some gaps between them , and it is convent ional to
leave lar ger gaps for t he variabl es with higher hierar chy. Our impl ementation
for the construct ion of the rectangles and t he gaps between t hem are given
in the following section.
To complete t he algorit hm, we need to consider several det ails . At first ,
we need to standardize mosaic ar ray F according to some crite rion. We
Line mosaic plot: algorithm and implementation 281

can consider sever al options for standardization. In this work, we stan-


dardize each cell wit h resp ect to the maximum cell frequency, or we use
F(I , J) /maxI ,JF(I, J) . Secondly, we need t o set som e gaps between t he
rect an gles, so that the plot is easier to perceive. An option for t his is sug-
geste d in Fri endly [4]. In this work, we apply the following method.
For horizo nt al dir ecti on , t here will be rr1~~1 ) /2Jn2i+l - 1 ga ps between
the rect angl es, and for vertical dir ection, there will be rr~;Jn2i - 1 ga ps
between the rect an gles. To impl ement the horizont al gaps, we leave 1 unit
space between the rect an gles for t he lowest hier ar chy, 2 'unit ' space for the
next hierarchy, ... , [~] uni t space for the highest hierarchy. Here, 'unit'
is arbit rary. We may set 5 pixels, for exa mple, for the uni t space. For
t he column, we leave 0.5 unit space between t he rect an gles for the lowest
hier ar chy, 1.5 un it space for the next hier ar chy, ... , [P/2] - 0.5 uni t space for
t he highest hier ar chy. An algorit hm for t he gaps is given in Fi gure 3.

For row ba rs:


• Let G f- rr\~;- 1 ) / 21 n2i+l - 1, which is the total number of gap s.
th
• Let the i gap gi = 1, for i = 1, . .. , G. Let the number of variables for
the row bars m = [(p - 1)/2], and initi alize d t o 1.
• if (m == 1) break ;
• for i = m , . . . , 1, step -1 {
d = d * n2i+l;
for j = d, .. . , G, st ep d{
gj ++;
}}

For column bar s:


• Let G f- rr\~;J n2i - 1, which is the t otal number of gaps.
• Let gi = 1, for the i - th gap where i = 1, . . . , G. Let the number of
var iables for t he column bars m = [P/ 2]' and initi alize d as 1.
• if (m == 1) br eak ;
• for i = m , . . . , 1, st ep -1 {
d = d * n2i ;
for j = d, . . . , G, ste p d {
gj + +;
}}

Figure 3: Algorithm for the gaps between the rectangular bars .

The above procedure works for unsupervi sed learning. With superv ised
learning, we have t ar get variable. We ass ume here t hat t he last variable, or
282 Moon Yul Huh

vari able p is for the target . In this case, we build F with p - 1 variables.
Frequencies in the cell (I , J) , or F(f , J) will be divid ed into n p different
categories. In t his case, it will be convenient to express F in 3 dimension al
form such that F(I , J , K) , K = 1, . .. , n po

3 Implementation and demonstration of the line mosaic


plot
We illustrate the implement ation of line mosaic plot using Titanic data in-
troduced by Dawson (1995, ht tp:/ / ssLumh .ac.be/titanic.html) goes as fol-
lows. Titanic dat a consists of 2201 cases and 4 variables {Class, Gender,
Age and Surviv al}. The values of each vari ables are: Class={1st, 2nd, 3rd,
crew}; Gender={male, female}; Age ={adult, child}; Survived ={yes, no} .
Hence, p = 4, n' = (4,2,2 ,2) . When a case (1st, adult, mal e, yes) is given,
v' = (1,1 ,1,1) , and the above algorit hm give {f = 1, J = I}. Wh en a case
is (crew, male, child, no) , F of T it anic data. Mosaic array F of Titanic data
is given in Table 1.

57 5 14 11 75 13 192 0
118 0 154 0 387 35 670 0
140 1 80 13 76 14 20 0
4 0 13 0 89 17 3 0

Tabl e 1: Mosaic array F of Titanic data.

Tabl e 2 gives the mosaic array F for t he T itani c dat a when survive is
tar get vari able. Impl ement ation of t his mosaic plot can be accomplished by
assigning different colors for different categories. For Titan ic data, we may
assign survive d as the tar get vari abl e. Conventional mosaic plot and line
mosaic plot of the Ti tan ic data for this case is given in Figur e 4 and Figure 5
respecti vely.

when survive = yes, or k = 1

57 5 14 11 75 13 192 0
140 1 80 13 76 14 20 0

when survive = no, or k = 2

118 0 154 0 387 35 670 0


4 0 13 0 89 17 3 0

Table 2: Mosaic array F of Ti tanic data with survive as t he target vari able.
Line m osaic plot: algorithm and implementation 283

Class

F igur e 4: Mosaic plot of Titan ic Dat a when survived is tar get variable.

Class = { 1st, 2nd, 3rd, Crew }


Age = { Child, Adult}
.1
r

w
II I I

ell
:::!

-.
w

II
ell
E
I II
w II
"I
u,

II
x
w
(J)

SUlviv= { No, Yes

Figur e 5: Line mosa ic Plo t of T it an ic Dat a when survived is target var iable.

From Figur e 5, it is easy to see t hat most of t he passengers ar e males,


and there are very few children passengers. The larg est numb er of passenger
groups are crews, then 3rd class, 2nd class, and 1st class passengers are
the fewest . In gender-wise, t here are very few female crews, and largest class
group for females is seen to be 3rd class, t hen 1st , and t hen 2n d class. We can
visually estima te t hat the numb er of female 3rd group passengers is about
twice t he num ber of female 2nd class passengers. Turning our attenti on
to survive, it is straightforwar d to observe that most of t he 3rd and crew
class passengers could not surv ive, but most of t he 1st and 2nd class female
284 Moon Yul Huh

passengers survived. The proportion of survivals in the 3rd and crew classes
can even be estimated visually by reading the number of bars in the plot.
For {crew, adult, mal e} combination, the proportion can be estimated as 2/7.
For {3rd, adult, male} combination, the proportion is less than 1/4. For the
female case, we can observe directly from the plot that the survival proportion
is much higher except for the {3rd, adult} combination. Although there
are few child passengers, the plot clearly shows that most of the children
passengers survived except for the 3rd class cases.
Figure 6 gives the process of obtaining a line mosaic as implemented in
hDAVIS [9] . hDAVIS is freely available on the following website .
http ://stat .skku .ac .kr/-myhuh/davis.html .

DDon JLJI
l [' - -I I" - l ~'- 'l [····· ,·][1" 1
[
u
........•....• •. •• •.J L _ _ L-. .1

DO
Figure 6: Line mosaic plot implemented in DAVIS.

References
[1] Catarci T ., D'Amore F., Janecek P., Spaccapietra S. (2001). Interacting
with GIS : from paper cartography to virtual environments. Unesco Ency-
clopedia on man-machine Int erfaces , Advanced Geographic Information
Syst ems, Unesco Press.
[2] Cleveland W.S., McGill R (1985). Graphical perception and graphical
methods for analyzing scient ific data. Science 229, 828-833.
[3] Dawson RJ.M. (1995). Th e "unusual episode" data revisited. J. Statistics
Education 3 (3), 1-7.
[4] Friendly M. (1994). Mosaic displays for multi-way contingency tables.
Journal of the American Statistical Association, 89 , 190-200.
[5] Friendly M. (1999). Extending mosaic displays: marginal, partial, and
conditional views of categorical data. Journal of Computational and
Graphical Statistics 8, 373-395.
Lin e mosaic plot: algorithm and implementation 285

[6] Hartigan J .A., Kleiner B. (1981). Mosai cs fo r contingenc y tables. Eddy,


W . F ., (ed.), Computer Science and Statistics: Proceedings of the
13th Symposium on the Int erface, 268 -273. Springer-Verlag, New York,
NY.
[7] Hofmann H. (2000). Exploring categorical data: interactive mosaic plots.
Metrika 51 (1), 11-26.
[8] Hofmann H. (2003). Constructing and reading mo saic plots. Computa-
tional St atistics & Data Analysis 43, 565 - 580.
[9] Huh M.Y., Song K.R. (2002). DA VIS: A Java-based data visualization
system. Computational St atistics, 17 (3), 411 -423.
[10] Ihaka R. , Gentl eman R. (1996). R : A language for data analysis and
graphics. Journal of Computational and Graphical St atistics 5 , 299 - 314.
[11] Meyer D., Zeileis A., Hornik K.(2003) . Visualizing in dependence using
extended association and mos aic plots. DSC 2003 Workin g Paper, Insti-
tut for Statistik & Wahrscheinlichkeitstheorie, Technische University at
Wien, Institut for Statistik , Wirtschaftsuniversity at Wien.
[12] Unwin A. (2003). Variation s on mosaic plots. Workshop on Modern St a-
tisti cal Visualization and related topi cs(l) at ISM on 13-14, November
2003, ISM, Tokyo, J ap an.
[1 3] Wang C.M. (1985). Applications an d computing of mo saics. Computa-
tional St atisti cs & Dat a Analysis 3 , 89 -97.

A cknowledgem ent : This work was supporte d by the Samsung Research Fund
(2003) of Sungkyunkwan University.
Address: M.Y. Huh , Department of St ati sti cs, Sungkyunkwan University,
Chongro-Ku, Seoul , Korea
E-mail: email :myhuh@skku.ac.kr
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

GRAPHICAL DISPLAYS OF INTERNET


TRAFFIC DATA
Karen Kafadar and Edward J. Wegman
Key words: Logarithmic transformation, comput at iona l methods, recursive
computat ion, graphical displays, exploratory data an alysis .
COMPSTAT 2004 section : Data visua lisation.

Abstract: The threat of cyber at tacks motivates t he need to monitor In-


t ernet t ra ffic dat a for potentially a bnorma l behavior. Due to the enormous
volum es of such data, st atistical pro cess monitoring tools, such as those used
t ra ditionally on data in t he product manufacturing dep artments, are inad-
equa te . The det ection of "exotic " data, which may indic at e a potenti al at-
tack, requir es a characterizat ion of "typical" behavior. We propose some
simple gra phical tools that permit ready visual identification of unusual In-
t ernet traffic patterns in "st reaming" dat a . These methods are illustrated
on a moderate-sized data set (135,605 records) collecte d at George Mason
University.

1 Introduction
Cyb er attacks on compute r net works or personal computers have become
major threats to nearly all operations in society. Methods to thwart such
at tacks are seriou sly needed . The problem of det ecting unu su al behavior
in data st reams occurs in many fields, such as in disease surveillan ce, nu-
clear product manufacturing, and phone and credit card use. Historically,
manufacturing and financi al industries have relied on convent iona l st atistical
pro cess moni toring tools, such as control char ts and process flow dia gram s.
Such t ools are reliabl e and appropriate, because t he data st reams can be
stratifi ed into reas ona bly ind epend ent series. For example, mon itoring a cus-
tomer 's credit card use relies on an analysis of the data from the cust omer' s
past charging amounts and frequ encies. This data stream is a much smaller
data set than t he ent ire dat ab ase, with events occurring irr egularly but not
frequently; mor eover, one customer's data stream can be considered as ind e-
pend ent of other custo mers' data stream s. In cont ra st, Internet traffic data
are virtually continuous (limited only by the resolution of th e time clock th at
capt ures them) , and the data for one syst em involve hundreds of thousands
of other compute r or network syste ms.
Tools for moni toring such dat a are essent ial. Conventional statist ical
analysis oft en assumes that data follow a mathematically t ract abl e probabil-
ity distribution function and will yield valid est ima te s of the par am et ers of
this distribution. Such approaches cannot be used on millions of data points.
Gr aphical tools for streaming data offer hop e of identifying pot enti al cyb er-
288 K aren Kafadar and Edward J . Wegm an

at tacks, particularly when the tools are t ailor ed for the application. Features
of Internet traffic dat a ar e described in Sect ion 2.
Even with novel graphical displays for massive data streams, however,
a charac te rization of "typical" behavior is still needed, so relevant gra phical
tools can be made mor e sensitive to capturing exotic or abnormal patterns.
Two approaches to the det ection problem through visualization are discussed
in this art icle. Sect ion 3 describ es a "drill-down" approach to viewing lar ge
data sets, illustrated on a data set of 135,605 records collecte d over a one-hour
period at George Mason Un iversity. Section 4 describ es a second approach ,
"evolut ionary graphical displays" , which pr esent the da t a only within a nar-
row time wind ow (e.g., 10 minutes) ; early dat a disappear as new, mor e recent
data , come into view. Two exa mples are "wate rfall diagram" and "skyline
displ ay. Secti on 5 offers a summary and proposals for further work.

2 Features of Internet traffic data


To monitor Internet t ra ffic dat a for potenti al at tacks, organizat ions will in-
stall anonymous surveillan ce machines outs ide a "firewall" to monitor incom-
ing and outgoing t ra ffic. For a discussion of the typ es of pro gram s that mon-
itor t raffic flow, see Mar chette [1, Ch . 4]. Data collected during an Internet
session includes many features; key features include source and desti nati on
addresses, source and destinati on ports, and measures of size and duration
of the session .
IP addresses
Int ernet traffic proceeds from one machine to another , usin g a protocol for
data transfer known as Internet Protocol (IP ), which directs t he transmission
of data among machines during an Int ernet session. The "IP head er" contains
several important pieces of information. Since each IP address is a 32-bit
number represented in four 8-bit fields (e.g., 127.0.0.1) , 232 = 4,294,967,296
machines can be addressed. Multiplied by th e volum e of traffic during a given
day, convent iona l st atic gra phs cannot display such t remendous volum es of
data on a syste m with finit e resolution. The IP header capt ures the two
addressable machines involved in an Internet session .
Transmission Cont rol Protocol
A common communicat ion protocol is Tr ansmission Control Protocol
(TCP) . TCP implements a two-way connection between machines and con-
t ains th e necessar y instructions for delivering and sequencing packets . The
instructions are capture d in a file whose head er includes t he source and desti-
nation port numbers , useful for mon itoring t ra ffic flow and det ecting potential
attacks.
Each host machine has 216 = 65,536 ports, divid ed into three rang es.
The first range includes 1024 (2 10 ) "well-known ports" numbered 0 to 1023;
for example, file transfer protocol (ftp) uses port 21; secur e shell (ssh) uses
port 22; t elnet uses port 23; smt p mail operates from port 25; web service
(http) operat es from port 80; pop3 mail operat es from port 110; secure web
Graphical display s of Internet tra ffic data 289

encry pt ion (ht tps) opera tes from port 443; real time st rea m control pro to-
col (rtsp) uses port 554 for quick-time st reaming movies. The second range
consists of regist ered ports, numbered 1024 to 49151; for exa mple, Sun has
registered port 2049 for its network file system (nfs). The remaining 16384
(2 14 ) ports, numbered 49152 to 65536, are dynami c or pri vate ports. Un-
protected por ts (source ports or destination por ts) ar e prime candida tes for
intrusion; too much t raffic on a given por t within a short time frame may
ind icate a potenti al attack. In this dat a set , all por ts numbered 10000 or
above were coded simply as "port 10000".
Size of sessi on
Internet t ra ffic data are sent in "packets" . The "size" of an Intern et
session can be measur ed in several ways: duration (e.g., number of seconds) ,
number of packets, and number of byt es. Typically, these numbers will be
correlat ed , but not in any specific deterministi c way. However , a machine
may send many packets with few byt es, or rather fewer full-sized packets ;
eit her sit uation may signa l a pot ential attack on a syste m.
Sample data
Intern et t ra ffic dat a are being collect ed at George Mason University;
a sample of ten records from a da ta set over th e course of one hour is shown
in Table 1. Column 1 lab eled time denotes t he clock ti me (in number of
seconds from an origin) at which the Intern et session began ; duration or
len repr esents the duration or length of the session in seconds; SIP and DIP
are t he source and destination ports, respect ively; DPort and SPort are the
dest ination and source port numbers, respectively; and Npacket and Nbyte
indica te the number of packets and number of bytes transferr ed in t he session .
In t he plots below, t he vari able time is shifted by 39603 seconds and scaled
by 1/60 , so that t he first session starts at 0.01067 minutes past the st art ofthe
hour, and the last session starts at 59.971 minutes past the start of t he hour.
Tabl e 2 summa rizes t he distribution of the values in each column with the
five-number summar y [4] supplement ed with t he 10t h and 90t h percentiles for
each column (min imum, lower 10%, lower fourth , median , upp er fourth , up-
per 10%, maximum) . The "size" varia bles are all very highly skewed towards
the upper end; t he distan ce between the 90th percentile and the maximum is
2- 3 ord ers of magnit ude greater than t he distance from the 90th percentile to
t he minimum. One session involved over 35 million byt es, and almost 66,000
packets, alt hough sessions of 1,832 bytes and 12 packets were more ty pical.
The next secti on provides some displays of th ese data , with the objective of
t rying to cha racterize "ty pical" behavior , so t hat "aty pical" beh avior can be
not ed more readily.

3 Viewing Internet traffic data


Most features collected on Intern et traffic dat a are highly skewed, as seen
for t he size variables. Thus, a plot of any pair of these vari ables has a very
high density of point s in the first quadrant near t he origin. By select ively
290 Karen Kafadar and Edward J . Wegman

time duration SIP DIP DPort SPort Npacket Nbyte


1 39603.64 0 .23 4367 54985 443 1631 9 3211
2 39603.64 0 .27 18146 9675 3921 25 15 49
3 39603.65 0.04 18208 28256 1255 80 6 373
4 39603.65 1389 .10 24159 17171 23 1288 845 5906
5 39603 .65 373 .99 60315 37727 2073 80 1759 834778
6 39603 .65 0.13 28256 18208 80 1256 10 816
7 39603 .65 1498.1 1 25699 4837 9593 80 65803 35661821
8 39603 .65 0 .04 18208 28256 1251 80 5 373
9 39603 .66 122 .38 54985 4179 1298 443 99 85559
10 396 03 .66 0 .13 28256 18208 80 1257 10 816

Table 1: Samp le of Int ern et t raffic data from George Maso n Univers ity.

time duration SIP DIP


minimum 39603.64 0 .00 259 259
lower 10% 39937.68 0 .20 4930 4024
lower 4th 40507 .09 0 .32 9765 8705
median 41435.55 0.58 20258 25164
uppe r 4th 42326. 46 3 .77 41282 45900
upper 10% 42857 .49 21.45 62754 58202
maximum 43201 .26 3482 .50 65276 65262
#(unique values) 104268 9101 2504 5139

DPort SPort Npacket Nbyte


minimum 20 20 2 0
lower 10% 80 1187 9 568
lowe r 4th 80 1369 10 860
median 80 1849 12 1832
upper 4th 80 3681 21 7697
upper 10% 80 10000 45 25161
maximum 10000 10000 65803 35661821
#(unique va lues) 380 6742 1056 29876

Table 2: Summary statistics from Int ernet traffic data set (135,605 sessions).

"zooming in" , or "dr illing down" int o t his regio n, as one does on a geograph-
ical map, specific feat ures can be bet t er obse rved . An alte rnative to t his
"dr ill-down" approach (steps of power magnificat ion) is a logarit hmic trans-
formation, which allows one to view the point s by sca nn ing across t he screen
rather than by magnifiying regions of t he space. We describe t his approach
below.
Graph ical displays of Int ernet traffic data 291

10g[l+ Nbytes < 1000)] 10g(1+ Nbytes[700 to 2000])

::J

N
~
~
0
oj
:;j

f '"
s '" ~ '"
'" 26 11 zeroes s
or> or>
ci e
:; /I
:;
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 3.3 3.4 3.5 3.6 3.7 3.B

N '" 42824 Bandwidth = 0.0 1743 N '"46401 Bandwidth.. 0.01422

10g(1+ Nbyles[1800 10 9000]) 10g(1+ Nbytes[> 7000])

'" 0

~
'"
I
z- '"
ci

or>
~ ~

ci ci

:; :;
3.B 4.0 4.2 4.4 4.6

N = 37350 Bandwidth .. 0.02509 N .. 36008 Bandwidth .. 0.04908

Fi gur e 1: Kernel density est imates of 10g(1 + vi N byte ), four separa te ran ges.

Densit y plots
Figure 1 is a kernel density est ima te [3] of log. Nbyte = Nbyt e« =
f(Nbyte) , where f( x) = 10g(1 + fi ). We use t he transformation f (x ) =
10g(1 + fi) for all t hree size variables t o sprea d out their values (valu es of x
near t he low end of t he scale are not spread out as far as t hey would be wit h
t he simple log(x) transformation; f'(x) < Y]«, much more so for sma ll x ).
Likewise, log. len = f(durat ion) and log .pkt = f(Npacket) . All calcula-
tions and graphs ar e made using the open-source software R, available from
http ://www . cran . r-proj ect . org. A sma ll peak at 0 reflect s 2611 zeroe s;
the next lar gest byt e size is 147. The dat a are clearl y skewed, and local peaks
of high density appear where log. byte ~ 3.4, 3.8, 4.1, 4.5, and 5.1 (Nbyte
~ 840, 1400, 3500, 8000, 26000) .
Distribution s of sessi on size variables
Boxplots can be useful t o displ ay t he relationship between two variables,
as in Figure 2 for t he two variables log. len = f(duration) (y-axis) and
log . byte = f(Nbyte ). The first box contains t he 2911 valu es for which
Nbyte is zero; t he second box contains the next 1216 values where Nbyte
ran ges from 1 to 365 (0 < log. byte ::; 3); subsequent bins are 0.1 wide,
except the last five bins. This display shows a relatively stable t rend up unt il
t he last few bins, but is ot herwise not very useful for outli er det ect ion , since
out liers are prevalent in each bin . The boxplot display does confirm genera l
292 Karen Keieder and Edward J. Wegman

Message size variables

o o o 0
o o o 0
o
o 0
8

g
.~ o
g , 8
g : 9

:I
N
, 8
+ o

~~
,
• 0

(-1 .0J (3.1,3.2J (3.4.3.5J (3.7,3.8J (4,4.1J (4.3,4.4J (4.6,4.71 (4.9,5J (5.2,5.31 (5.6,5.8J (6.5,91

log(1 + sqrt{Nbyte))

Figure 2: Boxplots of log. duration = log(l + Vduration) vs log. Nbyte =


log(l + VNbyte).

trends: sessions with more bytes tend to last longer, and most sessions are
short.
The preponderance of relatively short sessions can be seen in Figure 3(a),
which displays the session durations as horizontal lines that extend from the
start time to the end time. Because these sessions are reported in the order in
which they began, the session start times range from time 0 (bottom line) to
59.971 (nearly the end of the hour). Figure 3(b) shows the same information,
but each line is shifted back to O. With continuously monitored data, the
session duration lines would continue past the censoring point (illustrated
as a red dotted line in Figure 3b) . Relatively few sessions are "censored"
(i.e., ended within the hour), reflecting the fact that most sessions are short:
93% of the sessions lasted less than 30 seconds . Figure 4 shows a barplot of
the number of active sessions during each 30-second subset of this one-hour
period (a time frame of 30 seconds is selected to minimize the correlation
between counts in adjacent bars). The mean number of active sessions in any
one 30-second interval during this hour is 923, with standard deviation 140,
suggesting a rough upper "3-sigma limit" of 1343 sessions. [Because these
numbers are counts, a square root transformation may be appropriate; see
Tukey [4]. The mean and standard deviation of the square roots of the counts
are 30.29 and 2.23, respectively, resulting in an approximate upper "3-sigma
Graphical displays of Int ernet traffic data 293

limit" of (30.29 + 3 ·2.23)2 = 1367, very close to the limit on t he raw counts,
since t he Poisson distribution with a high mean is approximately Gau ssian.]
The maximum number of sessions in any one of these 120 30-second int ervals
is 1299, below the "3-sigma limit " . This plot could be monitored cont inuously
in time, dropping older bar s off the left-side of the plot , and adding new bars
on t he right ; the upper 3-sigma limit could depend upon hour, day, or week
of the year .

0
<0

s
c:
0
.~ 0

" ~ "
"
.~
(;

*
1:

-*
1:

...ac: '"'
0 0
c:
'"'
I
<J)
e

-c
0
cv "is. '"
JJl
0

;e ;e

10 20 30 40 50 60 o 10 20 30 40 50 60

Time Session duration

Figur e 3: Session duration plot. P anel (a) displays t he length of t he session,


st arting at the act ua l start t ime. Pan el (b) , a shifte d version of panel (a) ,
displays the length of t he session, start ing at time O.

Distribution of session duration by session st art ti me can also be dis-


played as a scatter plot of points, (time , duration) , as shown in Wegman
and Mar chet te [7, p. 14]. Because the short sessions dominate this plot ,
Figur e 5(a) shows t his same plot , but using the log-transform ed ordina te in-
stead; i.e., (t i me, log .len) . This plot shows a nearly straight line of points
at log .len = 2.2685 between session st ar t times of 23 to 57 minutes. Fig-
ure 5(b) expa nds this part of the plot and marks t he identified points wit h red
"+" symbols; t hey fall int o two groups: an early group of 268 points (mea n
duration = 75.05 seconds), and a later group of 152 point s (mean duration
= 75.15 second s). All 420 points are web sessions (destination port 80) and
arise from a single source IP 65246, destination IP 45900, and source ports
numbered 10000 or higher. A major cha llenge is t he development of statis-
tical "screening" algorithms to identify such "inte resting" patterns in data
294 Karen Kafadar and Edward J. Wegman

Figure 4: Barplot of the number of session in successive 30-second non-


overlapping intervals during the hour (120 intervals). The mean number
is 923 (standard deviation = 140), yielding an upper 3-sigma limit of approx-
imately 1343. The maximum of these 120 counts is 1299.

plots such as this one, so that potential attacks to networks can be identified
in real time. Since an infinite number of patterns can occur, a collection of
likely patterns must be catalogued, so that statistical significance on their de-
tection can be quantified. Algorithms that identify too many false negative
patterns would result in an unnecessary number of shutdowns and service
denials.
Relationships between pairs of size variables
Figure 6 shows a series of plots of the transformed Nbyte vari-
able, log. byte, versus the transformed Npacket variable, log. pkt = log(l +
J Npacket) , in four separate ranges. Panel (a) ,observations for which log. pkt
is between 1 and 2 (Npacket between 3 and 40), shows a generally increasing
trend, simplified in Panel (b) with boxplot displays (the labels in the x-axis
are the same as those in panel (a), multiplied by 10). Panel (c) shows one
line of 293 points around log. pkt = 2.77 (Npacket ~ 226) and log. byte =
5.82 (Nbyte ~ 112,792) [all come from destination port 80 (web), source IP
23070, and destination IP 443 (https)]; and another set of 39 points around
log. pkt = 2.84 (Npacket ~ 259) and log. byte = 5.64 (Nbyte ~ 78,341) [all
have SIP 4837, DIP 56612, DPort 80]. Panels (c) and (d) show many points
at high values of log. pkt along two lines with approximately unit slope but
Graphical displays of Int ernet traffic data 295

(8) (b)

o o
o o

o
o
o
o
o o
o o

10 20 30 40 50 60 25 30 35 40 45 50 55 60

Session starttime Session starttime


+ : SIP = 65246, DIP = 45900, DPon = 80, sPon = 10000

Fi gur e 5: Plot of log-transformed duration, log . l en, as a function of session


start ti me. The left panel (all data) shows an almost perfectly horizont al line
of points at log .len = 2.2685, between 23 and 57 minut es (expanded in the
right panel) .

with different intercepts; th e upp er set corres ponds mainl y to destination


IP addresses 25 (smtp mail) , 80 (web) , and 443 (secur e web) ; the lower set
corresponds mostly to DIP 554 (rtsp) . The dense set of 55 points points in
Figure 6(d) (3.4 < log . pkt < 3.8) lie near t he line log . byte = 3 + log . pkt ,
and 43 of them corre spond to DPort 43. The exte nt t o which such a pattern
could occur by chance alone should be investi gated.
The relationsh ip between number of byt es, NByte, and session duration,
duration is shown with plots of log . byte versus log . l en, first for t he ent ire
data set (F igur e 7), and then in 4 subra nges defined by the intervals of
log. byte and log .len (Figur e 8) . This approach t o viewing the dat a can
be considered as roughly equivalent t o a "drill-down" approac h, where all
dat a are displayed in translated regions of the logged vari ables. Because
longer sessions are associate d with mor e byt es, and most sessions are short ,
plots of log. byte = log(l + JNbyte) versus log . len = log(l + Jduration)
should be dense near the low end of each scale but much less dense near
the upp er end. In fact , the points in Figure 7 are especially dense around
log .len = 0.5, t hen less dense until log . l en = 1.1. Figure 7 also reveals
296 Karen Kafadar and Edward J . Wegman

NByle vs Npacket

~
0
I -l- -;- 8e I
E;J ~
!IIIIIIIIIIIIIIIIIIIIIIIIIIIII!!III I§ 889:
I I I
0

Itt i
0
8 0

. : ! ! : ! ": . .' I 0
0

. . . .. 0

8
~
0
8 0

- 0 0 0 0 0 0

1.0 1.2 1.4 1.6 1.8 2.0 12 13 14 15 16 17 18 19 20

log.pkt
119,322 observations

~
,..:
0
.,;
;:
::;
i i ::l
~ :;j ~

~
ZJ

~
2.0 2.2 2.4 2.8 2.8 3.0 3.0 3.2 3.4 3.6 3.8 4.0

log.pkt (14,612 observations) log.pkt (800 observations)

Figur e 6: Pl ot s of log . byte versus log. pkt , in 4 subra nges of log . pkt .

a set of points in th e upp er right corner of the plot , 2.5 :::; log . l en :::; 3, and
6 :::; l og. byte :::; 7, which is discussed below in connection with Figur e 9(c).
F igure 8 shows t hree un com monly st raig ht lines of points: 377 points in
F igur es 8(a), in the region where 1.17 :::; log . len :::; 1.27 and log . byte >=::::
5.0; 292 points in Figure 8(b), where 1.69 :::; log .len :::; 1.79 and log . byte >=::::
5.8; an d 60 poi nts in Figure 8(d), where 2.7:::; log .len :::; 2.9 and log . byte
increases from 6.4 to 7. The point s in t hese sets of lines have in common
(1) SIP = 1681, SPort = 10000, DPort = 25 (smtp mail) (recall that SPort
10000 actua lly refers to all source ports num bered 10000 or higher) ; (2) SIP
= 23070, DIP = 336, DPort = 80 (web); and (3) DPort = 554 (rtsp), SPort
= 1276 to 2070. For a given session, initi al ports are assigned at rand om , bu t
subsequent ones are assigned by an increment ing pat tern characteristic of the
operating system. Hence, a st ring of SPort numbers may signal a pot enti al
attacker seeking information about operat ing system t o invade.
Stratifica tion by groups of destin ation ports
This hour of Int ernet activity involved 380 unique destination ports (Ta-
ble 2). DPort 80 (web) is t he most common, comprising 116,134 of th e
135,605 records. The next most common destination port is DPort 443 (se-
cure web https ), utilized 11,627 t imes, followed by DPort 25 (mail SMTP)
accessed 6,186 times. Ports 554 (rtsp) , 113, 10000 (or higher) , 8888 occur
200, 128, 97, 94 t imes, resp ecti vely. Twelve destination port numbers during
Graphical displays of Internet traffic data 297

NBytes vs Durat ion

. ,: ' :,1 ;>~. "


.., ' .:... ~~ :~ \:'1ft;·;·::··:'·
....~..:.~ .': . ~,' ,: ~... -: . :~..; ~" ;,~..,', .~ ..:, ' ..
-,' . . . ~ ;.:
., ; . r..:
. '.. . : ~ :
':" ,.
.:......: ...
.; I.· .

~ .'

0.5 1.0 1.5 2.0 2.5 3.0

log(1 + sqrt{duration))

Figure 7: log. byte versus log . duration.

:;l 0

.n
~
'"
:::l
:;; :;;
i~ ..
~
i
~
..
~

~ ~

::J
~

'"
::l ::l
1.15 1.20 1.25 1.30 1.65 1.70 1.75 1.80 1.85

Iog.leo logJen
x • S IP 1681, OPort 25, Sport 10000, 43-5 0 packets 292 points: SIP 23070. DIP 336. OPort 80

~ ,~ . . . '. ':.~: ., "

1.9
" , , ; .,y)~',ti K:I;~
2.0 2.1 2.2

log.tan2
2.3
••
2.4 2.5 2.6 2.6 2.7 2.6

I09.le02
2.9 3.0

+ '"' Dport 554, SPort 1276 to 2070. 1000- 2000 packets

F igure 8: log . byte vers us log. durat ion, 4 subranges.


298 Karen Kafadar and Ed ward J . Wegman

Des! Port 25 (6186) Des! Port 443 ( 11627 )

.
" ~..:'
. ..:.. .:
,,~#t~~;i;-;. :
" ~~ . • I

:.:~~~:\:~::::~ .;::,~~i.jj~:;. -:
{'...;. }'.
~

- >, ": r.: :~:..." , ~

log.len log.len

Dest Ports 113, 554, 8888, 10000 (519) Other Des! Ports ( 1139 )

++

logJen 10g.len
1 . 113 5 .. 5548 .. 8888 0;1 000

Figur e 9: log. byte vs log .len for ot her destination ports.

this hour occurre d bet ween 5 and 29 times in the file; 5 por ts occur red only
4 t imes, 8 occurre d only 3 t imes, 47 destination por ts occurred only twice, and
293 destination ports occur red only once. Displaying all 135,605 points on
one plot is not very informative, so inst ead we subdivide t he session records
into groups according to their dest ination por ts. Because over 85% of these
dat a are web sessions (DPort = 80) , a plot of log . byte versus log . l en for
only t he web sessions looks like Figur e 7 (all data). Figur e 9 are scatterplot s
of two vari ables, condit ioned on values of a third (non-web DPorts) : DPort
25 (smt p mail) in panel (a) ; 443 (ht t ps) in pan el (b) ; 113, 554, 8888, 10000
in pan el (c), and t he remaining 310 destination ports in pan el (d) . P an el
(c) shows t hat the line of points in the upper right corner of Figur e 7 arises
from sessions with DPort 554 (rtsp), and t hat t he sessions from DPort 8888
occur in a small cluste r near log . l en = 2 and log. byte = 5. Forty of t he
52 points in the upper right corne r of Figur e 9(d), where log . byte ~ 4 +
0.5 log .len, correspond to DPort numbers 119 and 1755, but are otherwise
unrelated (some "pat terns" can be spur ious) .
M oni toring fr equen cy of sou rce IP addresses
These same plots can be constructed when the data are subset ted by
source IP address (SIP) , as opposed to destinat ion port number (DPort ).
The number of source IP addresses that may be active du rin g a given hour
Graphical displays of Internet traffic data 299

EWMA on T-Squared (lambda =0.5)

!"
« 51
~
"
!1

l<i
2000 4000 6000 8000 10000

Timeof session

Figure 10: Multivariate EWMA plot of Hotelling's T 2 , last 10,000 values .


of activity is likely to be very much higher than the number of destination
ports; in this data set , only 380 unique destination ports were accessed , versus
3548 unique source ports. A plot of log. Nbyte versus log .len shows clumps
of observations for certain source IP addresses, often because they correspond
to heavy web traffic .
Multivariate charts
The three session "size" variables, log .len, log . pkt, log. byte, being
somewhat correlated, are amenable to a "control chart' where the statis-
tic being plotted is a weighted linear combination of the previously plotted
variable (>') and the current value of Hotelling's T 2 statistic (1 - >.) . Varde-
man and Jobe (1999) provide tables for the optimal choices of >.. Calculating
a Hotelling's T 2 statistic on three successive observations, denoted H t , a mul-
tivariate exponentially weighted moving average (MEWMA) chart using >.
= 0.5 is shown in Figure 10 (last 10,202 observations only) . Most values
(99.7%) are below 60; a successive run of observations above 60 might sug-
gest abnormal session sizes. To minimize the effect of outliers on Hotelling's
T2 statistic, location and scale are estimated using medians and trimmed
standard deviations instead of classical sample means and standard devia-
tions (SDs) . The SDs were estimated as 1.85, 0.34, 0.74, and the pairwise
correlations are 0.53 (log.len, log .pkt), 0.56 (log . len, log.pkt) , 0.90
(log. pkt, log. byte).

4 Evolutionary displays
Wegman and Marchette [6] advocate a new approach to visualizing massive
data sets, called "evolut ionary displays." Massive data sets are too large to
display using graphs and plots that are designed for moderate data sets of
300 Karen Kafadar and Edward J . Wegm an

::iKyline Plots
(a) (b)

''T T
x X ;IX X
..,.
0
~ x x )I( x 0
0 65246
X
X
X
X
X
X
)I(
)I(
X
X '"
X X X )I( X
X X X )I( X
X X X )I( X
X
x
X
x
X
x
)I(
)I(
x
x s
0
x
x
x
x
x
x
x
x
x
)I(
)I(
)I(
x
x
x
'" i
I
I
II
'" xx x xx )I( x
xx x xx )I( x I
xxx
xxx
x
x
xx
xx
)I( x
x :5
§ '"
)I(
'"c
1l xxx
xxx
x
x
xx
xx
)I(
)I(
x
x
~ xxx x xx )I( x ~
xxx X xx X
s §
::> )I( ::>
g xxx xx xx )I( xx
0 ~ xxx xx
xx
)I( 0
15 xxx
xxx X
x xx )I( X 15
)I( x
.BE xxx
xxx X
X
XX)l(
xx )I( X
X
.BE
::> xxx X xx )I( X ::>
z
Z xxx
xxx x
X xx )I(
xx )I( x
x
0
;e
xxx
xxx x xx )I(
xx )I( x
x x
;e xxx x xx )I(
xx )I( x
xxx x x

J~
xxx x xx )I( x
xxx x x xx )I( x
xxx x xX XX)l( X

~
)¢I(O( X XX X X )l( X
)¢I(O( xx xxx xx )I( X
:1m( xx
_ _ xx xx xxx xxX)!(
_~
x xxm< x
x )Q[)OOOI(lIOII(
xxx )lie( x x m< x
0

2000 4000 6000 6000 10000 0 10000 20000 30000 40000 50000 60000

Oport number Source IP address

Figure 11: Skylin e plots . (a) : DPort access; (b) : Source IP acces s.

fixed size. The concept behind evolutionary displ ays is to exhibit data within
the most current time frame, dropping off old dat a and making room for most
recent data. For example, in Figure 10, new data come in on t he right as old
data on the left are pu shed off the scree n. Wegman and Marchette [6, p. 906,
F igure 4] use this conce pt to define a wat erfall displ ay, useful for monitoring
frequ ency of source port s.
Skyline plot s
Most destination port numbers occur only once or twice during the hour;
of the 380 dist inct DPort s, 293 occurred only once, 47 occurred twice, 8 oc-
cur red 3 t imes, 5 occurred 4 times. The remaining 27 ports occur red over
4 times; t he t op five are DPort 80 (web , 116,134 times) , 25 (mail-smtp,
6,186 t imes ), 443 (secur e web, 11,627), 554 (rtsp ; 200 times) , and 113 (128 ti-
mes). Set ting aside the "well-known" ports 0-1023, we plot t he occurrence of
destination port s number ed 1024 and above, which should arise mor e or less
at random, and flag as unusu al any DPort that is referenced over 10 ti mes.
Fi gure 11 shows two such plot s; one for DPort (color cha nges indic at e DPort
access counts greater than 10, indicat ive of potenti ally high traffic on this
dest ination port) , and one for SIP in the first 10,000 session records (color
changes indicate SIP occurrences of more than 50). Four unu sually frequent
source IP addresse s are immedi atel y evident: 4837, 13626, 33428 ,and 65246 ,
Graphical displays of Int ernet traffic data 301

which occur 371, 422, 479, and 926 t imes , respectively, in the first 10,000 ses-
sions. The const ruction of this plot resembles t he t ra cing of a skyline, so we
call it a "skyline plot. " Limit s on skyline plot s may depend upon time of
day, day of week, month, or season.

5 Summary and further work


This article has highlighted several of the cha llenges t ha t arise in ana lyz-
ing and displaying massive data sets. Some simpl e st atis itics based on ro-
bust qu antiti es are useful for cha racterizing typical behavior (e.g., number
of source and destinati on ports, and source and destinatio n IP addresses,
and frequency of access). These cha racterizations suggest gra phical displays
which highlight unusual usage or access. We discussed t he role of "evoluti on-
ary graphics" on such dat a , specifically t he use of "wate rfall diagr ams" , and
pr oposed "skyline plot s" as a means of monitoring ports and IP addresses .
Future work will include massive dat a sets from Internet sessions and ot her
fields.

References
[1] Mar chette D.J . (2001). Comput er intrusion detection and network m on-
itoring. Springer .
[2] Khumbah N.-A., Wegman , E.J . (2003). Data compression by geom etri c
quantization. Recent Advan ces and Trends in Nonparametric St at ist ics,
M. Akrit as, D.N. Politi s (eds) , North Holland Elsevier , Amsterd am .
[3] Silverm an B.W . (1986). Densit y estimati on. Chap man and Hall: Lon-
don .
[4] Tukey J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading,
Massachuset ts .
[5] Vardeman S.B., Jobe J.M. (1999) . St atistical quality assuran ce m ethods
for engineers. Wiley, New York.
[6] Wegman E.J ., Mar chette D.J . (2003). On som e techni ques for
streaming data: A case study of Int ernet packet headers. J. Com-
put . Graph. St at . 12 (4) , 893 -914.
[7] Wegman E.J .; Marchette D.J. (2004). St atistical analysis of network data
f or cybersecurity. Chan ce, 9-19.
A ckn owledgem ent : Funding from Grant No. F49620-01-1-0274 from the Air
Force Office of Scientific Resear ch, awarded to George Mason University, is
gratefully acknowledged. Part of this resear ch was conducte d during the first
aut hor 's appoint ment as faculty visitor at National Institute of Standards
and Technology.
Address: K. Kafad ar , E .J. Wegman , University of Colorado-Denver and
George Mason University
E-mail : kk@math. cudenver . edu ; ewegman@galaxy.gmu. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

CLUSTERING ALL THREE MODES OF


THREE-MODE DATA: COMPUTATIONAL
POSSIBILITIES AND PROBLEMS
Henk A.L. Kiers
K ey words : Clust er analysis, multiway analysis.
COMPS TA T 2004 secti on : Clusterin g.

Abstract: For t he analysis of three-mode dat a sets (i.e., dat a sets pertain-
ing t o t hree different sets of ent it ies) vari ous component analysis t echniques
are available. These yield components that are summari es of t he entities
of each mod e. Becau se such components are ofte n interpret ed in a more or
less bin ary way in t erms of the ent itie s related st rongest to t hem, it seems
logical t o actua lly const rain these comp onents to have binar y valu es only. In
t he pr esent pap er , such const rained mod els are proposed and algorit hms for
fitting t hese mod els are provided. In one of these vari ants, t he components
are constra ined such t hat they correspond to nonoverlapping clust ers of ent i-
t ies. Finally, a pr ocedure is proposed for stee ring component values towards
bin ar y values, without act ua lly imp osing them to be binar y, using penalties.

1 Analysis of three-mode data


Three-mode data sets are data sets pert aining t o three different sets of enti-
t ies. An example of a t hree-mode data set is a set of scores of a number of in-
dividuals, on a number of vari ables , each obtained under a number of different
cond it ions. For the analysis of t hree-mode data, various explora to ry t hree-
way methods are available. The two most common methods for the analysis
of three-mode dat a are CANDECOMP / PARAFAC [1], [6] and Tu cker3 anal-
ysis [16], [10]. Both methods summarize t he dat a by component s for all three
mod es, and for t he ent it ies pert ainin g t o each mode they yield component
weights; in the case of Tucker3 analysis, in addit ion a so-called core array is
given, which relat es the components for all three modes to each ot her.
If we denote our I x J x K three-mode dat a array by X , then the two
methods can be described as fitting the model
P Q R
Xijk =L LL aip bjq Ckr9pqr + eijk, (1)
p = l q= l r=l

where a i p , b j q and Ckr are referr ed to as t he component weights, which are


elements of the component matrices A (for mod e A) , B (for mod e B) , and C
(for mod e C) , of orders I x P , J x Q, and K x R , resp ectively; 9 pqr denotes the
element (p, q, r ) of the P x Q x R core array G , and eij k denot es t he erro r te rm
for element Xijk ; P, Q, and R denot e t he numbers of components for the three
304 Henk A .L. Kiers

respective mod es. The difference between CANDECOMP /PARAFAC and


Tucker3 an alysis is that in CANDECOMP/ PARAFAC the core is actually
set equal to a superident ity array (Le., gpqr = 1 if p = q = r , gpqr = 0
otherwis e). As a consequence, in the case of CANDECOMP/ PARAFAC , for
all mod es we have the sa me number of components, and (1) act ually reduces
to
R
Xij k = L airbjr Ckr + ei j k (2)
r= l

Clearly, when these models are fitted to data, we end up with component
matrices A , B, and C , and , in the case of Tucker3 an alysis, we also get
a three-mode core array G as outcome of the analysis.
The result of a three-mode analysis is a summary of the observation units,
the vari able s and the conditions by means of a number of components, and
possibly a core array describing t he relations between them . The component-
wise interpret ation, however, is not very easy, becaus e it requires one to t hink
in dim ensions along which the observation units, vari abl es or condit ions vary.
Here the component weights indic ate to what exte n t, for inst an ce, the individ-
uals can be describ ed by the property defined by the component. Likewise,
vari abl es are related to the components for t he vari abl es to different ext ents.
Now the int erpretation of the components usu ally proc eeds conversely: From
the st rengths of the relations of the vari abl es to the components, one can
int erpret the mean ing of th e components. This interpret ation is rather cum-
bersome if one discriminates pr ecisely bet ween different st rengths of relations.
Therefore, in pr actice, one tends to inte rpret components on the basis of the
variables related strongest to it , and one te nds t o ignore the less related vari-
ables. In fact , thus one binariz es th e relations, in sufficiently strong, and not
sufficiently strong. Thus one could say t hat the components are int erpret ed
as if t hey refer to clust ers of variables consisting of those vari abl es th at have
the strongest relations with them . Similar cluster based int erpret ations can
be given to components describing individuals and conditions, if a prio ri in-
form ation on the individua ls and conditions is available. To enhance the
interpret ability of t he component matrices, they are often subjected to sim-
ple structure rotations such as varimax [7], see also [8], but t he clusters will
always remain somewhat fuzzy (i.e., relations are never ent irely binar ized).
Now if, in pr acti ce, components t end to be int erpret ed as clusters, then
would not it seem mor e rational to mod el data in terms of cluster memb er-
ship, and discard the information on st rengt hs of relations? The idea of clus-
t ering all three mod es simultaneously has been pursued by various a ut hors .
Clustering approac hes involving the CANDECOMP/ PARAFAC mod el have
been proposed by Chaturvedi and Carroll [3] and Leenen et al. [11], where
the latter aut hors use Boolean products rather than ordinar y products . An
exte nsion of the latter Boolean mod el to the Tucker3 sit uat ion, has been
proposed by Ceulemans, van Mechelen and Leenen [2].
Surprisingly, except for a recent pap er by Rocci and Vichi [13], straight-
Clustering all three modes of three-mode data 305

forward (non-Boolean) generalizations of the Tucker3 mod el do not seem to


have received attention yet , and no algorithms seem to have been published
for handling this case. The present paper, therefore, focuses on that particu-
lar case. The mod els describ ed here are in fact three-mode generalizations of
the GENNCLUS mod el [4], [5] PENNCLUS, and t he Double k-means clus-
t ering model by Vichi [17].

2 Clustering variants of the Tucker3 model


As has been mentioned above, in the Tucker3 model the elements of the
component matrices are, in practic e, often interpreted in a mor e or less binary
way. That is, when int erpreting a component for, say, the variables, for
each variable it is sp ecified whether it is associat ed with the component or
not. Thus, a Tucker3 mod el that fully complies with this bin ary way of
interpretation would simply have binary component weights for the variables:
1 for the vari ables associat ed with a component, and 0 for those not associated
with the component. In fact, one might want to specify the strength of the
associat ion by a value different than 1, but if the sam e valu e is used for
all vari abl es related to a component, then one can always scale such valu es
to 1 anyway. Therefore, it is here proposed to const rain the elements of each
component matrix to be bin ary, that is to be equal to 0 or 1.
When all elements of the component matrices are binary, one could say
that th e components refer to clust ers of, for example, variables. Without
further const raints, such clusters may very well overlap, in the sense th at
som e ent it ies are associate d with mor e than one cluster. The overlap of
clusters is nonproblematic for the int erpretation of the clusters themselves,
but does make the overall model relatively difficult to int erpret. Therefore, it
can be at t ract ive to impose a further const ra int, nam ely the constraint that
clusters do not overlap. Specifically, this const ra int impli es that each ent ity
is assigned to one and only one cluster.
Mod els for these two constrained vari ants of the Tu cker3 model are de-
scribed below, and it is also indic ated how this mod el can be fitt ed to data. In
the next section, algorit hms for actually carry ing out such fitting procedures
are given.

2.1 Tucker3 with overlapping clusters


The Tucker3 mod el with overlapping clusters is defined as the mod el
P Q R
Xi j k = LLL a i pbj q Ck rgpqr + eij k , (3)
p=l q= l r = l

where a ip , b j q, and Ckr are constrain ed to be binary (0 or 1). To avoid


summation notation, we writ e the abo ve mod el in t erms of matrices as follows
(4)
306 Henk A .L. Kiers

where X a, G a , and E a denote the A-mode matricized versions of the three-


way arrays X , G, and E (i.e., the matrices obtain ed upon putting the frontal
slabs next to each other , see [9], and ® denotes the Kronecker product. To
fit this mod el to an empirical data set , it is proposed her e to min imize t he
sum of squ ar ed residuals , hence to minimize

f(A, B, C , G) = IIX a - AGa(C' ® B')1I 2 , (5)


over A , B , C, and G , subje ct to the constraint that the elements of A, B ,
and C are binar y. Not e that the core array is left fully unconstrain ed.
It is well known that the Tu cker3 model is not unique. That is, non-
singula r transformations of the component matrices can be compensa ted by
the inverse t ra nsforma t ions in the core, and thus do not affect th e mod el
esti ma tes. For example, suppose we transform A by multiplying it by a non-
singular matrix S , pr emultiplying G a by s:' yields exactl y the sam e mod el
est imates since (AS)(S-lG a )(C' ® B') = AGa(C' ® B') . In the case of
binary constrain ts, t his nonuniqueness is limit ed to those cases where non-
singul ar transformations do not affect t he bin ar y const ra int . This is possible
when t here are columns in, for inst an ce, matrix A that do not overlap : upon
replacing one such column by the sum of such columns, t he bin ary constra int
will still be sat isfied. Specifically, suppose A has only two columns that do
not overlap (i.e., do not have unit elements at the sam e position) , then re-
placin g t he second by t he sum of the two comes down to postmultiplying A
by the nonsingul ar matrix S = (: ~). Clearl y, th en AS sa t isfies the
binary constraint , and upon replacing A by AS, and G « by (S-lG a) we
get the sam e est ima tes as with A and G a. Similar nonuniquenesses can be
identified up on describing mod el (3) using B- or C-mod e matricized versions
as
(6)
and
x , = CGc(B' ® A') + E c , (7)
where subscripts b and c ind icate B- and C-mode matricized versions of the
three-way arrays at hand, which are obtained by other ways of positioning
slices of t he three-way arrays next to each other, see Kiers [9] .

2.2 Tucker3 with nonoverlapping clusters


The Tucker3 model with nonv erlapping clust ers is the sa me mod el as that
for overlapping clusters describ ed above, in Section 2.1, with the additional
constraint on t he matrices A , B , and C that in all rows one and only one
element is 1, and all others are O. The proc edure to fit this mod el is hence
to minimiz e (5) over A , B , C , and G , subject to the constraint t hat the
elements of A, B , and C are bin ary with exa ctly one uni t element in each
row. As a consequence of the minimization subject to these const ra ints, the
Clust ering all thr ee modes of three-mode data 307

core array now will contain t he within cluster average scores in X, hence
the core effectively summarizes the data in such a way that it gives the
avera ge score of the individuals in each cluster, averaged across th e vari ables
associated to the vari able cluster at hand, and averaged across condit ions
associated with the condition cluster at hand. Wh en the clusters can be
interpreted well, then the core has a very easy int erpretation too, simply in
term s of 'cluste r scores'.

3 Algorithm for Tucker3 with overlapping clusters


As mentioned in Section 2.1, fitting the Thcker3 mod el with overlapping
clusters comes down to minimizing (5) over A , B, C , and G, subject to the
const ra int that the elements of A , B, and C are binary. To find solutions
for this minimization problem, it is proposed her e to use an alte rnating least
squ ar es algorit hm, which , st arting from initial values for A , B , C , and G ,
finds updates for A keepin g the other matrices fixed, then for B keeping
the other matrices fixed, next for C keeping the other matrices fixed , and
finally for G keeping the other matrices fixed. After one complete cycle,
the function value is evalua te d, and if it has decreased considerably, a new
cycle is started . This process is repeated until the function value changes no
longer . Each update is found such that it decreases the function value, or,
at least does not increase the function value. Becaus e the function value is
bounded below by 0, it is t hus guar anteed to converge to a stable value.

3.1 Updating procedures


The choice for initial values for A , B , C, and G will be discussed later.
Given that such values are available, the first step is to find improved valu es
for A , keeping the other matrices fixed . Hence t he problem is to minimize

g(A) = IIX a - AF11 2 , (8)

where F is written for G a(C' 0 B'). Now the columns of A ar e updated


column afte r column, keeping the other columns of A fixed . Specifically, to
update column j of A, we find the minimum of
2

g(aj) = X a - "Ld; - ajlj = IIX-j - ajljl12 , (9)


Z#j

where X _j is writ ten for X a - LZh ad;, aj denotes the jth column of A,
and I j denotes t he jth row of F . A solut ion for minimizing (9) is given by
Chaturvedi and Carroll [3] . A computationa lly slightly different pro cedure
(with the same solution) can be derived as follows. Fun ction (9) can be
written as the sum of independ ent functions elaborated as
308 Henk A.L. Ki ers

g(aij) = constant - 2aij(X - jfj)i + aTjfjf j


= constant+(fjfj-2(X_ jfj) i)aij, i = I, ... , I, (10)
where in the second line it is used that arj = aij becaus e each element of A
is constrained to be bin ary. Each of the functions g(aij) is now minimized
over binary aij by t aking aij = 0 if (f jf j - 2(X- jf j) i) > 0, and aij = 1 if
(fjf j - 2(X- jf j) i) :::; 0, hence
aij = 0 if 2(X- jf j) i) < fjf j
aij=1 if2(X- jf j)i) 2f jf j i=I , ... ,I. (11)
In pr actice, it may happen that all elements of column j become zero by
the above updates of the elements of column j. This would imply that the
Tucker3 mod el would not use the jth A-mode component . Hence all core
elements related to this component (in the j t h row of G a ) , and therefore also
the elements in the j t h row of F , do not have any cont ribut ion to fitting the
data; in other words, then the t erm ajfj = O. However , in pr act ice, this will
almost never be the optimal solution for a j f j , since it would imply that no
contribution is better than any conceivable cont ribut ion. Furthermore, zero
columns in A will cause computat ional problems later on in the algorit hm.
Therefore, whenever a j = 0, a sp ecial fixing pro cedure seems in order. Here
we use the following. If aj = 0, first t he jth row of G a and hence also
the j th row of F, is multiplied by -1. This does not affect the fit, bec aus e
when a jfj = 0, then also aj( - fj) = O. Next , aj is updated again according
to (11) , and this is used as the update for a j ' If it so happens that the updated
aj again is a vector with zeros only, then aj is set back to its original valu es
before updating column j, and likewise the core is set back to its original
values .
To update matrix B , a complete ly analogous pro cedure is followed . Spe-
cifically, noting that (4) has equivalent ly been written as (6) X b = BGb(A' ®
C') + E b, it can be seen that using this version of the mod el, the pro cess of
updating B is the sam e as that describ ed for A above, aft er replacing A by B ,
B by C , C by A , and G a by Gb in the above description. Likewise, updating
matrix C can be carr ied out by using the procedure for updating A , afte r
replacing A by C , B by A , C by B, and G a by G c in the above description.
Finally updating the core array can be carr ied out as follows. The problem
now is t o minimize
(12)
over G, which in A-mode ma tricized form is written as G a. Becaus e there is
no const raint on G a , t he solution to this problem is given by
(A' A)- IA' X a(C 129 B)(C'C 129 B' B)-I
(A' A)-I A' X a(C(C'C)-1 129 B(B' B)-I) (13)
Clustering all three modes of three-mode data 309

see [12] , see also [15]. Note t ha t, if the inverses do not exist (as may come
about when any of the component matrices has incomplete rank) then the
inverse is replaced by a generalized inverse.
The above described st eps for updating A , B, C, and G are followed by
t he computation of the loss function valu e. If this has decrease d, then a new
cycle of updatings is starte d; if it has remained t he sam e, t hen the ensuing
soluti on is considered a candidate for the minimum of t he loss function. De-
pending on how the pro cedure is starte d, t his may be a local minimum of t he
function rather than t he global minimum. It is therefore recomm end ed to
run the algorit hm from several st arts. One approach is to st art from (very)
many random starts, hoping t hus to cover a wide ran ge of (at least) locally
optimal solut ions for which the cha nce t hat it contains the global minimum
is high. Alternatively, or in addit ion, one may use a few start s t ha t can be
expected to have a high cha nce to lead to t he globa l minimum. A suggestion
for such 'ra tiona l' st arts is given in t he next subsect ion.

3.2 Rational starts


Becaus e the algorit hm describ ed above very easily leads to local optima, it
is important to run the algorit hm from various different st arts, among which
preferably are starts that have a high cha nce of leading to the global optimum.
Experience so far has indic ated that a useful starting configuration can be
obtained as follows. Fi rst , ana lyze th e data by ordinary Tucker3 ana lysis,
leading to columnwise orthonormal component matrices. Next rotate all
three component matrices by means of varimax, and multiply all columns
that have a negative sum of elements by -1. Then one start ing configuration
is obtain ed by set ting all values that are higher than t heir column average to 1
and all others to O. An alte rnative is to set, for each matrix, all values above
a particular threshold to 1, and all others to O. The t hreshold should depend
on the number of elements in the component mat rix at hand , and it can
be vari ed syst emati cally to yield different starts. By systematically varying
the threshold value for A between [-1 / 2 and 0 (not including 0), different
starts can be obtained , which in practice seem to lead to at least reasonabl y
good solutions; likewise for B t he threshold is to be chosen between J- 1 / 2
an d 0, and for C the t hreshold is to be chosen between K- 1 / 2 and O. More
experience is needed , however , to evalua te the usefulness of these starts.

4 Algorithm for Tucker3 with nonoverlapping clusters


To fit the Tucker3 model with nonoverlapping clusters comes down to mini-
mizing (5) over A , B , C , and G , but now subject to t he const raint t ha t t he
elements of A , B , and C are bin ary, and th at each row of these matrices
has one and only one unit element. To find solut ions for this minimi zation
problem , it is proposed to use an alte rnati ng least squ ar es algorithm similar
in set up to that for the overlapping clusters sit uation. The updates for t he
310 Henk A .L. Kiers

component matrices A , B , and 0 ar e, obviously, different, while t he update


for the core is t he sam e, but its computat ion can now be simplified some-
what. This is becaus e the inverses in the updating formul a (13) are now very
easy to compute, becaus e, du e t o the const raints on the component matrices,
A'A , B'B , and 0'0 now are diagonal matrices with on the diagonal simply
the number of unit elements in the corresponding columns of t he component
matrices.
Below only the updating pro cedure for A is describ ed . Those for B and 0
ar e obtain ed an alogously, afte r letting the component ma trices switch roles
(compare Section 3.1), and the update for the core does not need further
description.

4.1 Updating procedure for A


To update A subject to the const raints at hand, we now minimize

g(A) = IIX a - AF11 2 , (14)

over A, where F is again written for G a(O' <>9 B'). This function can be
writ ten as t he sum of the ind ependent functions

(15)

where x~ and a~ denote the it h rows of X a and A, respectively, subject to


the const ra int that one of the elements of a~ is 1 and all ot hers are O. Thus,
du e to t he const raint, in I:l ailf; all but one te rm are 0, while the non zero
t erm (the j t h) equa ls fi . Hence, the problem is simply to find the valu e j for
which Ilx~ - f i 11 2 is minimal, and set the associate d valu e a i j equa l to 1, and
all other elements of a~ equal to O. In formul as, the updat es for the elements
of a~ are given by

j arg min (1Ix~ - f i11 2 )


1 (16)
ail 0, for l =1= j .

If a column of A turns out t o have zero elements only, a slightly modified


versi on of the fixing pro cedure describ ed for the overlapping clusters case can
be used . That is, in t his case all rows of F corre sponding to zero columns
in A are multiplied by -1 , and the whole matrix A is updat ed again. If this
will again result in one or more zero columns in A , then A is set back t o the
ori ginal valu es of A .
The problem of fitting the Tu cker3 mod el with nonoverlapping clusters
has recentl y been proposed also by Rocci and Vichi [13], but at the time of
wr it ing, their algorit hm had not yet been published. Even mor e recently,
Clustering all three modes of three-mode data 311

Schepers and van Mechelen [14] have proposed an algorithm for fitting this
model, which also has not been published yet. It is planned t o compare these
algorit hms in the near future.

4.2 Rational starts


As possibly useful rational starts for the nonoverlapping clust ers algorithm,
again t he results from Th cker3 analysis applied to t he dat a , followed by vari-
max of t he component matrices can be used. This time, afte r multiplying
columns having negative sums by -1, st ar ts ar e obtained simply by set ting
all rowwise highest elements t o 1, and all ot her elements to O. Other rational
st arts are used in t he algorit hms by Schepers and van Mechelen [14], and by
Rocci and Vichi [13] . Their relati ve advantages are st ill t o be studied.

5 Should we fully constrain components to be binary?


In t he present pap er, procedures have been describ ed for constra ining com-
ponents t o be binary. However , it is known t hat fitting mod els under binary
const raints is very difficult , in t he sense that it is very hard to find the glob-
ally optimal solution. Moreover , t he const ra ints of bin ari ty may for some
sit uations be t oo st rong. In some sit uat ions, it may be needed to allow for
nonz ero component weights with clearly different valu es within columns. For
such purposes, special algorit hms are needed , which, t o t he aut hor's knowl-
edge, ar e not yet availab le.
An alte rnative rou t e t o avoid the very st rong constra int of bin arity could
be t o require component mat rices t o be close to bin ari ty rather th an to exact
bin arity. This can be achieved by imposing t he bin ari ty constraint as a soft
const ra int in such a way that it pena lizes (rather than pr ohibits) nonbinarity.
In other words , soft const ra ints can be imposed by minimizing the ordinar y
Th cker3 loss fun ction t o which penalty t erms are added whose values increase
wit h increasing deviations from bin arity. On e pro cedure for attaining t his is
to minimize the function

f(A , B , C , U, V , W , G ) = IIX a - AGa(C' ® B')11 2


+,\IIU - AI1 2 + /-LIlY - BII 2 + vllW - C11 2 (17)
over arbit rary A , B , C , and G , and over binar y auxiliary matrices U , V ,
and W ; '\ , /-L and v are penalty paramet ers t hat regulat e the st rengt h of
t he const raint, and t hat have t o be specified in advance. Without further
constraints, one will find degenerate solut ions in which component matrices
t end to 0 (thus annihilating the penalty te rms), while the core elements te nd
to infinity in such a way that the product AGa(C' ® B') still fits the data
well. One way t o avoid such degeneracies, which in practi ce turned out t o
work reasonably well, is t o const ra in the auxiliary binary matrices to have at
least one nonzero element in each column.
312 Henk A .L. Kiers

An alte rnating least squa res algorit hm for minimizing (17) has been de-
vised and programmed . The algorit hm t end s to require many iterations, but
does ind eed give solut ions with t he required properties. For inst an ce, for data
const ructed on the basis of component matrices that were bin ary up to a few
elements, the method indeed singled out t hese elements as different from the
others. However , much more experience is needed to assess its usefuln ess in
act ual pr act ice.

6 Conclusion
The present paper has offered methods for Tu cker3 analysis with t he compo-
nent matrices const ra ined t o be binary, and , in a special case also such t hat
t he components have no overlap. The algorit hms proposed work in t he sense
th at th ey decrease th e loss function valu e, but they appear, as usual with
bin ar y optimizat ion problems, to be prone to hit t ing local optima. Some
starting pro cedures have been pr opos ed that worked well in som e cont rived
examples, but t he algorit hms, as well as their start ing pro cedures need fur-
t her testing, as well as comparison to compet itors that have been proposed
recently for t he nonov erlapping case.
In addition to the methods where components ar e const ra ined t o be fully
binary, a pr ocedure has been proposed for weakly impo sing bin arity, by using
penalty t erms. Again , this pro cedure needs further t esting. If it turns out
t o work well in pr actice, and if it is not very prone t o hitting local optima ,
it could also be used for fitting t he fully const rained mod el by gradually in-
creasing t he penalty param et ers that regulat e the strength of the const raints.
Whether this or other pro cedures work best in dealing with the local opt imum
probl em of Tucker3 wit h bin ar y const raint s is subject to further resear ch.

References
[1] Carroll J . D., Ch ang J .-J . (1970) . Analysis of individual differences in
multidim ensional scaling via an N-way generalization of "Eckart- Young "
decomposition. Psychom etrika 35, 283-319.
[2] Ceuleman s E ., van Mechelen 1., Leenen 1. (2003). TuckerS hierarchical
classes analysis. P sychom etrika 68 , 413 - 433.
[3] Chatur vedi A., Carroll J.D. (1994) . An alterna ting com bin atorial opti-
mization approach to fitt ing INDCLUS and Generalized IND CLUS mod-
els. Journal of Classificat ion 11 , 155 -170.
[4] DeSarbo, W .S. (1982). GENNCLUS: N ew models for general nonhierar-
chical clust ering analysis. P sychometrika 47,449-475.
[5] Gaul W. , Schad er , M. (1996). A new algorithm for two-mod e clust er-
ing. In: Bock H.-H., Polasek W. (eds.) Dat a analysis and information
syste ms. Springer , Heidelb erg.
Clustering all three modes of three-mode data 313

[6] Harshman RA. (1970). Foundations of the PARAFAC procedure: models


and conditions for an "explanatory" multi-mode factor analysis. UCLA
Working Papers in Phonetics 16, 1-84.
[7] Kaiser H.F. (1958). The varimax criterion for analytic rotation in factor
analysis. Psychometrika 23, 187- 200.
[8] Kiers H.A.L. (1998). Joint orthomax rotation of the core and compo-
nent matrices resulting from three-mode principal components analysis.
Journal of Classification 15, 245 - 263.
[9] Kiers H.A.L . (2000) . Towards a standardized notation and terminology
in multiway analysis. Journal of Chemometrics 14, 105-122.
[10] Kroonenberg P.M., De Leeuw J. (1980). Principal component analysis of
three-mode data by means of alternating least squares algorithms. Psy-
chometrika 45,69 -97.
[11] Leenen L, van Mechelen L, de Boeck P., Rosenberg S. (1999). INDCLAS:
A three-way hierarchical classes model. Psychometrika 64 , 9 - 24.
[12] Penrose R (1956) . On best approximate solutions of linear matrix equa-
tions. Proceedings of the Cambridge Philosophical Society 52, 17-19.
[13] Rocci R., Vichi M. (2003). Three-mode clustering of a three-way data
set. CLADAG 2003, University of Bologna, Bologna.
[14] Schepers J., Van Mechelen L (2004). Three-mode partitioning: Method
and application. Paper presented at the meeting of the GfKl , Dortmund,
March 9-11.
[15] Ten Berge J.M.F. (1993). Least squares optimization in multivariate
analysis. DSWO Press, Leiden .
[16] Tucker L.R (1966). Some mathematical notes on three-mode factor anal-
ysis. Psychometrika 31,279-311.
[17] Vichi M. (2001) Double k-means Clustering for simultaneous classifica-
tion of objects and variables. In: Borra S., Rocci R, Schader M. (eds.) :
Advances in classification and data analysis, Springer, Heidelberg.

Acknowledgement: The author is obliged to Roberto Rocci , Jan Schepers,


Marieke Timmerman, Iven van Mechelen, and Maurizio Vichi.
Address : Henk A.L. Kiers, Heymans Institute, University of Groningen,
Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands
E-mail : h .a .l.kiers@ppsw.rug.nl
COMPSTAT '2004 Symposium © Physica-Verlag/Springer 2004

FUNCTIONAL DATA ANALYSIS AND


MIXED EFFECT MODELS
Alois Kneip, Robin C. Sickles and Wonho Song
Key words : Mixed effects mod el, functional principal component analysis,
non par am etric regression.
COMPSTAT 2004 section : Fun ctional dat a analysis.

Abstract: P anel st udies in economet rics as well as longitudinal st udies in


biomedical applications provide data from a sample of individua l uni ts where
each unit is observed repeatedly over time (age, etc. ). In this conte xt, mixed
effect mod els are oft en applied to analyze the behavior of a response vari abl e
in dependence of a number of covariates. In some important applications it
is necessary to assume that individua l effects vary over time (age, et c.).
In the pap er it is shown t hat in many sit uations a sensible ana lysis may
be based on a semipar am etric approach relying on t ools from functional data
analysis. The basic idea is t hat time-varying individual effects may be rep-
rese nted as a a sa mple of smoot h functions which can be char act erized by
its Karhunen-L oeve decomposition. An importan t application is th e est ima-
tion of time-vary ing technic al inefficiencies of indi vidu al firms in stochast ic
frontier analysis.

1 Introduction
P anel st udies in economet rics as well as longitudinal st udies in biom edical
applications provide data from a sa mple of indiv idu al units where each unit
is observed repeat edly over time (age, etc .). St atistical analysis t hen usu ally
aims to mod el the vari ation of some response var iable Y . In addit ion t o its
depend ence on some vect or of explanatory vari abl es X , th e vari abili ty of Y
between different individual un its is of primary int erest .
For simplicity, we will assume a balan ced design wit h T equa lly spaced
repeated measurements per indi vidual. The resulting observations of n indi-
viduals can then be represent ed in the form (¥it, X it) , where t = 1, . . . T and
i = 1, . .. , n . The simplest form of analysis is based on mixed effect mod els
of the form
p

¥it = !3o + L !3j X itj + Ui + Ei t (1)


j= l

where Eit are i.i.d. err or terms, while Ui represents indi vidual random effects .
An imp ort ant exa mple in econometrics are stochastic frontier mod els.
Then lit rep resents pr oduction output of an individu al firm i in time period t,
while X it is a corresponding vect or of pr oduction inputs. The Ui are then
316 Alois Kneip, Robin C. Sickles and Wonho Song

interpreted as technical inefficiencies. Firm i is more efficient than firm j if


Ui > Uj'
However, in many applications it is too simple to assume constant individ-
ual effects Ui . A straightforward generalization is to suppose that Ui == Ui(t)
is a function of t.
P
Yit = /30 + L /3jXitj + Ui(t) + tit (2)
j=l

In the following we will assume that the Ui (t) can be considered as smooth
random functions . In many biometrical applications, where for example t
indicates age of an individual unit, smoothness can be considered as a stan-
dard assumption. In econometrics, where t usually indicates time, for a given
unit i the corresponding data {Yit, Xit}, t = 1, . .. ,T, represent an individual
time series . In this situation model (2) assumes that the residual time series
{Yit - /30 - ~~=l /3j X itj }, i = 1, ... , n, can be decomposed into a smooth
stochastic trend u; and i.i.d. white noise.
Traditional analysis relies on parametric models . Very often polynomial
approximations to the functions Ui are used. More generally, for some pre-
specified basis functions bl , .. . , bi. the Ui are modelled by Ui(t) = ~r 'l9 irbr(t) ,
where 'l9 i l , . . . , 'l9 i L are individual random coefficients. Analysis is then based
on the well-known methodology of mixed effect models. If additionally nor-
mality is assumed and if X and tare uncorrelated, likelihood estimation based
on the EM algorithm is often applied. In stochastic frontier analysis such an
approach has been used by Battese and Coelli [1] or Cornwell, Schmidt, and
Sickles [2] in order to model time-dependent individual inefficiencies.
In this paper we consider a nonparametric approach based on ideas
from functional data analysis as proposed by Kneip, Sickles and Song [6] .
The functions Ui can be decomposed into Ui = ui; +Vi, where w(t) is a general
mean function and Vi(t) = Ui(t) - w(t). Model (2) can then be rewritten in
the form
P
Yit = L /3jXitj + w(t) + Vi(t) + tit (3)
j=l

Note that the constant /30 is incorporated into w(t), and that the mean of
Vi(t) is zero.
For a given L functional principal component analysis is then used to
estimate a best possible basis 91, ... ,9 L for approximating Vi by Vi (t)
~~=l (}ir9r(t). The approach possess a number of advantages
• The basis 91, .. . , 9 L to be estimated corresponds to the best possi-
ble basis for approximating the Vi by an L-dimensional linear function
space. Any approximation Vi(t) :::::: ~~=l 'l9 irbr(t) based on prespeci-
fied basis functions h, .. ., bL (e.g. polynomials or splines) possesses
a higher systematic error.
Functional data analysis and mixed effect models 317

• All n· T observations are used to estimate gl, ... , gL . Compared to a


completely non parametric analysis based on simply estimating all Vi by
nonparametric regression these functions can be estimated with a much
higher degree of accuracy.
Functional principal components are widely used in functional data anal-
ysis (see for example [7]). It must be emphasized, however, that the present
situation is different from the usual setup in this domain, since the functions
Vi of interest are not directly observed. This constitutes a major complica-
tion.
The paper is organized as follows. Section 2 presents the theoretical basis
of our approach relying on the Karhunen-Loeve decomposition. An algorithm
for determining gr and coefficients ;3j, Bir as proposed by Kneip, Sickles and
Song [6] is described in Section 3. Section 3.2 presents a new procedure which
may be considered as a promising alternative. Section 4 is devoted to the
problem of choosing an optimal dimension L,

2 Functional principal components


Let generally be LLd. smooth random function on £2[0,1] and
JJ
VI, .. . , Vri

suppose that E(Vi) = 0. Furthermore, let Ilfll = f(t)2 denote the usual
J
£2- norm for f E £2[0,1]' and set < t', V >= f*(t)f(t)dt. The covariance
operator then is a generalization of the concept of a covariance matrix in
multivariate analysis of random vectors. The so-called covariance kernel is
defined as
O"(s, t) = E(Vi(S)Vi(t))
and the corresponding covariance operator r is defined by the relation

r» = E « Vi,v> Vi) = J O"(S, t)v(s)ds

for any function V E £2 [0, 1]. r is a Hilbert-Schmidt operator and possesses

°
finite eigenvalues h 2: l2 2: . . . as well as corresponding orthonormal eigen-
functions ')'1,')'2, . .. such that III'rll = 1 and < ')'T)')'s > = for r :f s. A
precise mathematical discussion of properties of r can, for example, be found
in Gihman and Skorohod [4].
The well known Karhunen-Loeve decomposition states that the func-
tions Vi can be decomposed in terms of the eigenfunctions:

(4)
r

where fJ ir =< Vi, "[r ». This decomposition posseses the following properties
(see for example [4]) :
a) E(fJir) = 0, r = 1,2, . . . , and Var((fJid = h 2: Var((fJ i2 ) = l2 2:
Var((fJ i3) = l3 2: . . .
318 Alois Kneip, Robin C. Sickles and Wonho Song

b) {Jir is un correlated with {Jis if r =1= s


c) For each L = 1,2, . . .

for any possibl e choice of basis fun ctions b1 , . • . . bt: E £ 2[0,1] .

Un correlatedness of the random coefficient s {Jir for different r simplifies fur-


ther analysis, whi ch may, for example, rely on t he EM algorit hm. Note t ha t
this is a specifi c pr operty of the Karhunen-Loeve basis . For any prespec-
ified basis b1 , . • . , h one will have to take into account that the resulting
coefficients are usu ally correlate d.
Property c) may be seen as t he most important feature of (4). For
any possibl e dimension L the decomposition pro vides the best possibl e basis
1'1, . . . , I'L for approximat ing the random fun ctions Vi by a linear combina t ion
of L fun ctions. Indeed , it is well-known t hat in many situation a relatively
small number L of component s is sufficient to model the underlyin g fun ctions
su ch that a model of the form
L
Vi(t) = L {Jirl'r(t) (6)
r=1

holds in a good approximat ion.


Of course, t he major problem of (6) consis ts in the fact that the fun c-
tion I'r as well as an appro priat e dimension L ar e unknown. In fun ct ion al
data analysis it is usu ally assumed that n fun ctional realizations ca n be
obs erved , or at least can be approximate d with a negligibl e erro r. Esti-
mates ir can then det ermined from the empirical covariance op er ator r n V =
~ 2::7=1 « Vi, V > Vi) Some asy mptotic theory is given in Dauxois, Pousse
and Romain [3]. Under som e additional condit ions it is shown that rates of
converge nce of estimated eigenvalues and empirica l eigenfunct ions I'r,n are of
orde r n- 1 / 2 .
The pr esent sit uat ion is different , sinc e one has to deal with n . T noisy
observations. The major po int of inter est is modellin g the functions Vi(t ) at
the design po int s t = 1, ... , T . We may formalize smoothness of VI , . .. , v n by
requiring that t here are i.i.d . smoot h random functions VI , . • . , V n E £ 2[0, 1]
with Vi(t ) = Viet).
Discreti zing (6) then leads to the model
L
Vi (t ) = L 8ir gr (t ), t = 1, . . . , T, i = 1, . . . , n (7)
r= 1

Empirical versions of properties a ) and b) as well as of orthonormality of


1'1,1'2, . . . are then obtain ed by requiring
Functional data analysis and mixed effect models 319

(ex ) 2::iBr1 2: 2::iBr2 2: . . .


((3) 2::i BirBis = 0 for r =I- s.
(-r) ~ 2::;=1 gr(t)2 = 1 and 2::;=1 gr(t)gs(t) = 0 for all r, s E{I, ..., L} with
r =I- s.
Moreover , a discret ized version of (5) is given by

1 n T LIn (T L )
:;;: ~ ~(Vi(t) - ~ Birgr (t)) 2 :::; :;;: ~ Qi1~~~iL ~(Vi(t) - ~ exirbr(t))2
(8)
for any possibl e choice of br(t) , t = 1, .. . , T , r = 1, . .. , L .
Note that Cond itio ns (ex) - (-r) do not impo se any restriction, and they
introduce a suitable normalization which ensures identifiabili ty of t he com-
ponents up t o sign changes (inst ead of Bir,gr one may also use -Bir , - gr)'
If (6) holds for some suitable L , t hen there exist some gr such that (7) as
well as (ex) - (-r) and (8) are satisfied.
Obv iously t he components gr depend on t he realized Vi and on the sa mple
size n . Due t o different normalization usu ally gr(t) =I- 1'r,n(~). This does not
const it ute a serious dr awb ack for an empirical analysis based on (3) and (7).
In fact , in model (7) only t he L dim ensional linear space spanned by gl , . . . , tn.
is identifi abl e. There are infinit ely many possible choices of basis functions ,
and by using condit ions (1) - (3) we select a par t icularly well-interpret abl e
basis. Asymptotically, as N , T ---+ 00 gr(t) as well as 1'r,n (t ) will both con-
verge to 1't (t ) in probabilit y. Under (6) th e linear subspaces of JRT spanned
by the vect ors {(gr(1), . .. , gr(T) )/}r=l,...,L , {(1'r,n(1) , ... ,1'r,n(T ))/}r=l,...,L
and { (-rr(1) , . . . ,1'r(T ))/}r=l,...,L will coincide with high probability for lar ge
sa mples
How to det ermine the functional components gr in (7)? There are essen-
t ially two straightforward pro cedures which could imm ediately be applied if
t he realized functions Vi where known. These algebra ic methods will serve as
a basis of t he pr act ical, data-based methods to be pr esented in Sections 3.

Method 1: Some simple algebra shows th at , if the Vi were known, the com-
ponents gr could be det ermined from the eigenvect ors of th e empirical co-
variance matrix 2:n of VI = (v1(1), . .. ,v1(T ))', . . . ,Vn = (vn (l ), . . . ,vn (T ))':

(9)

Let .A I 2: .A2 2: . .. 2: .AT as well as 1'1,1'2, .. . ,1'T to denote the resulting


eigenvalues and orthonormal eigenvect ors of 2: n . Then

x, ~L Brr for all r = 1, 2, . .. ,L , (10)


i

gr(t) = ,jT' 1'rt for all r =l , .. . , L, t=l , ... ,T . (11)


320 Alois Kn eip, R obin C. Sickles and Wonh o Song

Also not e t hat "L,'j=L+IAj = * "L,~=I "L,i= I (Vi (t) - "L,~=IBir9r(t)) 2. If (7)
holds, then obviously "L,'j= L+I Aj = 0

Method 2 : A seco nd po ssibility is t o consider t he n x n matrix M n defined


by
1 T
(Mnk j =TL Vi (t )Vj(t), i, j = 1, . .. , n (12)
t= 1

By using some fur ther algebra, see for example [5], one can then deduce
that all non zero eigenvalues Ar and h; of t he empirica l covaria nce ~n and of
t he matrix M n are relat ed by hr = "L,i Brr = !fAr. Moreover , the eigenvect ors
PI = (PH , . . . ,Pnd' , P 2 = (pI2, ' " ,Pn2)', . . . of M n corresponding t o nonzero
eigenvalues hI ;::::: h2 ;::::: • . . are closely related to the paramet ers Bir since

Btr. -- hTl / 2 p tr. (13)


Finally, 9r can be com pute d from Ar and and Pir:

(14)

3 Algorithms
W hen combini ng 3) and (7) one ob t ains
p L
lit = L {3j X itj + w(t ) + L Bir9r(t) ) + t it (15)
j =1 r= 1

The op t imal basis fun ctions 9r sat isfying (7)- (11) as well as w, {3j and Bir
are unknown.
Based on the mathe matica l framework of Secti on 2 different algorit hms
ca n be applied in order to esti mate the components wand 9r of the (15).
In this section we will rely on a pr esp ecified dimen sion L. The important
qu estion of det ermining an appropriate L will be considered in Sect ion 4.

3.1 An algorithm based on estimating the covariance


matrix ~n
In the followin g we will discuss a straight forward method which can be seen
as a simple version of a somew hat more general algorit hm proposed by Kneip ,
Sickles and Son g [6] .
The idea is eas ily described : In a first st ep partial spline methods as
int roduced by Sp eckm an [8] are us ed to det ermine est imates and Vi . The /3j
mean fun cti on w is estimated non parametrically, and t hen est imat es gr are
t
det ermined from t he empirical covariance matrix n of VI, .. . ,Vn .
Functional data analysis and mixed effect models 321

_ Let_ us firs~ int roduce some additional not ati ons. Let Yt = ~ L: i lit ,
Y = (Yl , . .. , YT )' , li = (lil . . . , liT)' and Ei = (Eil , " . , EiT). Furthermore,
n L:i X itj , and
let X ij = (X il j , ._. . , X iTj)' , Xtj = .1 x,
= (Xlj , . . . , XTj )' . _We
will use X i and X t o denote the T x p matrices wit h elements X itj and X tj .
The algorit hm now can be described as follows:

Step 1: Det ermine esti mates /31, ... ,/3p and Vi (t) by minimzing
p

L 2:)Yit - fit - L (3j (Xitj - Xtj ) - Ui(t)) 2


i t j= l

+L f' IT(V~'(s))2ds (16)


t

where f' > 0 is a preselect ed smoot hing par ameter and v~' denotes the second
derivati ve of v.
Spline t heory implies that any solution Vi, i = 1, . . . ,n of (16) possess an
expansion Vi (t ) = L:j (jiZj (t ) in t erms of a na t ural spline basis Zl, . .. , ZT .
If Z and A denot e T xT matrices with element s Zj(t ) and It
z'j(s )zj' (t ),
t he ab ove minimizati on problem can be reformulat ed in matrix notati on:
Det ermine /3 = (/31 " '" /3p )' and (i = ((Ii ,.. ., (Ti)' by minimizing

(17)

where II· 11 2 denotes t he usual euclidean norm in ffi.T , IIal12 = M .


It is easily seen that with

z; = Z (Z ' Z + f' A )- l Z'


the solutions are given by

/J ~ (~(X, - XY(I - Z.)(X, - Xl)-,~(X,-XY(I-Z.)(Yi-Y) (18 )

as well as

Ther efore,
(19)
estimates Vi = (vi(l) , . .. , vi (T ))' .

Remarks:
322 Alois Kn eip , Robin C. Sickl es and Wonh o Song

• An obvious problem is the choice of n. A straight forward approach


then is to use (generalized) cross-validation pro cedures in ord er to es-
timat e an opt imal smoot hing param eter K,opt . Not e, however, that the
goal is not to obtain optimal est imate s of the Vi(t) but to approxi-
mate the functions gr in (15). Estimating 9 in the subsequent ste ps
of t he algorit hm involves a sp ecific way of averag ing over individual
dat a which subst antially reduces vari abili ty. In ord er to reduce bias,
a sma ll degree of undersmoothing, i.e. choosing r: < K,opt , will usu ally
be ad van t ageous .
• Our setup is based on assuming a balanced design. However , in prac-
tice one will oft en have t o deal with the sit ua t ion that there are missin g
observations for som e indi viduals. In pri ncipl e, the above est ima tion
pro cedure can eas ily be ad apted t o t his case . If for an individual k ob-
servations are missing, then only the remaining T - k are used for min-
imizing (16) . Esti mates of Vi (t ) at all t = 1, . .. , T are t hen obtained
by spline int erpolation.

Step 2: An est imate w of the mean funct ion w is calcul at ed by minimizing

~ (y, -t, /JjX'j - W(t)) , +" 1'(W"(S))'ds.


Step 3: Det ermin e the empirical covariance matrix t n of
VI = (vl (1),Vl(2), ,, , ,Vl(T))', ... ,v n = (vn(1) ,vn(2) , . .. ,v n (T ))' by
~
L..J n = -1 L ViVo
A AI

n i
'

and calcul at e its eigenvalues ),1 2: ),2 2: . . . ),T and the corresponding eigen-
vectors "Yl , "b ,... , "YT.

Step 4: Set gr(t) = VT · "Yrt, r = 1,2 , . .. , L , t 1, .. . ,T , and for all


i= 1, .. . , n det ermine s.: ... ,BLi by minimizing
L
L(1''it - Yt - (Xi - X)/3 - L 13rigr (t ))2
t r =1

with respect t o 13 1i , · . · , 13Li'

Based on this algorit hm t he unknown mod el components w and gr in (15)


can be replaced by wand gr' Further analysis may t hen be based on the
"est imate d" mod el
p L
¥it ~ L (3j X itj + w(t ) + L Oirgr(t )) + f.it (20)
j =1 r= 1
Functional data analysis and mixed effect models 323

e
The algorit hm aut omat ically also yields est imates /3j and i r . However , vari -
ability of these est imates may be reduced by re-estimating these coefficients
by relying on (20):

e
Step 5: Re-est imat e t he coefficients /3j and ir by fitting th e est imated model
Yit = L j=l (3j X itj + w (t ) + L~=l (Jir9r(t » + ci t to the dat a .

Kn eip , Sickles and Song [6] also st udy th e asy mpto tic behavior of the
resulting est ima t ors as n, T -4 00.
Let K,T = T K,. If the und erlying functi on Vi , as discussed in Section 2, is
twice cont inuously differentiabl e, then t he bias in est imating Vi is of ord er K,t,
while variance is of order ~ . Choosing K,T to be of ord er T- 4 / 5 then
"'r T
leads to the optimal individual rates of convergence ~ Lt(Vi(t) - Vi(t»2 =
Op (T- 4 / 5 ) .
Under some t echn ical assumpt ions (ma inly concerning smoot hness as well
as the corre lation between X it and Vi (t» a t heorem by Kn eip , Sickles and
Song [6J th en impli es th at for all r = 1, . . . , L

T -1 ~
LJ (gr(t) - gr(t»
t=l
A 2 1 1)
= Op ( K,T + T2 + 1/4
K,T nT
(21)

Further results concern rates of converge nce and asymptotic distributions of


par am et er est imates. As can be seen from (21) vari an ce of 9r also decreases
with the number n of individual units. By undersmo othing, i.e, choosing
K,T = o(T - 4 / 5 ) , t he components gr can be est ima te d with better rates of
convergence than t hose obtainable for t he individual functions Vi.
In Kn eip , Sickles and Song [6J finit e sa mple performan ce of the est imators
is additionally examined via Mont e Carlo simulations. The method is t hen
applied to t he analysis of technical efficiency of t he U.S. banking industry.

3.2 An algorithm based on estimating the matrix M n


Mod el (3) obviously impli es t hat
p

Vi (t ) = - l : (3jXitj - w(t) -
lit cit
j=l
Hence, if t he par am et ers (3j were known , the matrix
T P
(Mn)i,j = ~ l:(lit - Yt - l : (3j(Xitj - Xtj ), i,j = 1, . .. ,n (22)
t=l j=l
provides an est imate of M which by Method 2) discussed in Section 2 can be
used to calculate est imates of gr.
324 Alois Kn eip, Robin C. Sickles and Wonho Song

The basic idea of the followin g algorit hm is now eas ily describ ed : Un-
der (15) the "t rue" matrix M possesses only L non zero eigenvalues, and
therefore I:j=L+1 Aj = O. Based on (22) , different matrices Mn (/3) can be
det ermined in dep endence of all possible valu es of {3j . Estimates /3j and Mn
can be obtained by minimzing the sum of the sm allest n - L eigenvalues of
M( {3) with resp ect to {3.
The pr ecise algorithm now can be described as follows:

Step 1*: For all possible values /Jj of {3j, j = 1, . . . , p compute

T P
(Mn(/J) i,j = ~ 2)Yit - ft - L /Jj (X itj - X tj) , i , j = 1, . . . , n
t=l j =l

and its eigenvalues h(/Jh 2 h(/Jh 2 . .. 2 h(/J)n.


Then det ermine est imate s /31, ...,/3p by minimizing
n

with resp ect to /J .


Step 2 *: Set Mn = Mn (/3 ) and det ermine eigenvalu es h1 2 h2 2 .. . and
corresponding orthonormal eigenvect ors fh, . . . ,Pn·
Estimates fir are t hen ca lculate d by a weighted sum of residuals :

In spite of averaging over individuals, (23) may lead to fairly noisy est i-
mates of gr' Som e addit ion smoothing will usu ally improve the performance
of the esti mat or. Usin g a spline approach, an esti mate of gr may thus alte r-
natively be det ermined by minimizing

instead of using (23) .


Step 3 *: An est imate wof the mean function w is ca lcula te d by minimizing
Functional data analysis and mixed effect models 325

As in the procedure of Section 3.1 accuracy of coefficient estimates may


be improved by a final re-estimation:
Step 4*: Re-estimate the coefficients /3j and Bir by fitting the estimated
model lit = 2::~=1 (3jXi t j + w(t) + 2::~=1 8ir gr (t )) + cit to the data.

Recall that that the procedure of Section 3.1 requires smoothing of the
individual data of each of the n units in order to estimate Vi, i = 1, . . . , n.
An important advantage of the above algorithm thus is that it only requires
some global smoothing over weighted averages of observations in Steps 3*
and 4*. The choice of the smoothing parameter /'i, will thus be less critical,
and a possible smoothing bias will not affect the estimates of the parameters
(3j . One may expect a superior behavior of this method if the number T of
repeated measurement is fairly small.
On the other hand, a drawback is the fact that already for estimating (3j
in Step 1* a sensible selection of the dimension L in (15) has to be made.
Indeed, usually 15) will have to be satisfied in a very good approximation in
order to avoid biased estimates of the parameters. In practice, one may apply
the algorithm for different values of L and choose an appropriate dimension
by using some goodness-of-fit criterion.
Theoretical properties of the above algorithm have not yet been studied
and remain a topic of future research.

4 Choice of dimension
Any analysis based on (15) requires a sensible choice of the dimension L.
If L is too small, there may exist a large systematic error in approximat ing
the Vi. On the other hand, if L is too large, then estimates will possess an
unnecessarily large variance.
Note that for a given sample the eigenvalues of the estimated covariance
matrix f:n will usually satisfy 5. r > 0 for r > L . This will even be true if (15)
holds exactly and if therefore the eigenvalues of true matrix ~n are such that
Ar = 0 for r > L. In other words, the noise term cit will "create" additional
(small) components in the peA decomposition. It is obvious that any com-
ponent generated or strongly influenced by noise should not be included into
model (15).
From this point of view one may tend to choose L in such a way that
each component gr, r = 1, . . . , L, possesses an influence on the model fit
which is significantly larger than that of any noise component. This idea has
been adopted by Kneip, Sickles and Song [6] in order to estimate a dimen-
sion L. Under the hypothesis that (15) holds for some L , i.e. 2::~=L+1 Ar = 0,
they derive asymptotic approximations of mean m(L) and variance s(L)2 of
h
5. r , and it is shown that C(L) = ~r=L+;(~)-m(L)
A

2::;=£+1 asymptotically pos-


sesses a standard normal distribution. For any possible value of L, m(L) and
s(L) can be approximated from the data.
326 Alois Kneip, Robin C. Sickles and Wonho Song

An estimate of L is then obtained by choosing the smallest I = 1,2, . . .


such that
C(l) :::; Zl-a ,

where Zl-a is the 1 - a quantile of a standard normal distribution.

References
[1] Battese G.E. , Coelli T .J . (1992) . Frontier production functions, technical
efficiency and panel data: With application to paddy farmers in India.
Journal of Productivity Analysis 3 , 153-169.
[2] Cornwell C., Schmidt P., Sickles RC. (1990). Production frontiers with
cross-sectional and time-series variation in efficiency levels. Journal of
Econometrics 46, 185 - 200.
[3] Dauxois J., Pousse A., Romain Y. (1982). Asymptotic theory for the
principal component analysis of a vector random function: some ap-
plications to statistical inference. Journal of Multivariate Analysis 12,
136-154.
[4] Gihman LL, Skorohod A.V. (1970) . The theory of stochastic processes.
New York: Springer.
[5] Good LJ . (1969) . Some applications of the singular value decomposition
of a matrix. Technometrics 11 823-831.
[6] Kneip A., Sickles RC., Song W. (2004). On estimating the mixed effects
model. Manuscript.
[7] Ramsay J .O., Silverman B.W . (1997) . Functional data analysis. New
York: Springer.
[8] Speckman P. (1988). Kernel smoothing in partial linear models. Journal
of the Royal Statistical Society, Series B 50,413-436.

Address: A. Kneip, Fachbereich Reents- und Wirtschaftswissenschaften, Uni-


versitat Mainz, 55099 Mainz , Germany
RC. Sickles, W. Song, Department of Economics - MS 22, Rice University,
6100 S. Main Street, Houston, TX 77005-1892, USA
E-mail: kneip©uni-mainz.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

USING WEIGHTS WITH A TEXT


PROXIMITY MATRIX
Angel R . Martinez, Edward J. Wegman
and Wendy L. Martinez
K ey words: Bigr am proximity matrix, k near est neighbors classifier , natural
lan guage pro cessing.
COMPSTAT 2004 secti on: Applications , Classification.

Abstract: In pr eviou s work, we introduced a way of encoding free-form


docum ents called t he bigram proximity matrix (BPM) . Wh en t his encoding
was used on a corpus of docum ents, where each docum ent is t agged with
a t opic lab el, results showed that t he docum ent s could be classified based
on their t agged meaning. In t his pap er , we investi gate methods of weight-
ing t he elements of the BPM, analogous to t he weight ing schemes found in
natural lan guage proc essing. These include logarithmic weights, aug mente d
norm alized frequ ency, inverse docum ent frequency and pointwise mutual in-
form ation . Results pr esented in this pap er show that some of t he weights
increased the proportion of correc t ly classified docum ent s.

1 Introduction
The bigram proximity matrix (BPM) was first developed by Martinez and
Wegman [8] , [9] , [10] as a way of encoding t ext so it can be used in applicat ions
such as do cum ent clustering, classification or information retrieval. Previou s
studies with t he BPM indi cated that docum ent s can be successfully classified
usin g k near est neighb ors and other methods when they are encoded in this
way. The objective of the current work is to define bigr am weights ana logous
t o the te rm weights found in natural lan guage pro cessing and to investi gate
t he utility of usin g them in docum ent classification.
In Section 2, we pr esent some background information on the BPM and
include an illustrative example. We then provide definitions of t he bigram
weights in Section 3. Sect ion 4 contains informa ti on about the exp eriments
that were conducted, as well as t he resul ts. Finally, we offer a summar y and
some comments a bout future work in Section 5.

2 Bigram proximity matrix


The BPM is a non-symmetric matrix that capt ures t he number of word co-
occurrences in a moving 2-word window . It is a squa re matrix whose column
and row headings are the alphabet ically ord ered entries of t he lexicon , plus
one more element for end of sentence punctuat ion. The BPM ma trix ele-
ment ij is the num ber of t imes word i appears imm ediat ely before word j
in the unit of text . The size of t he BPM is det ermined by the size of t he
328 Angel R. Martin ez, Edward J. Wegm an and Wendy L. Martin ez

crowd his in father man sought t he wise young

crowd 1
his 1
in 1
fat her 1
man 1
sought 1
t he 1 1
wise 1
young 1

Tabl e 1: Ex ample of Bigram Proximity Matrix. (Note: Zeros in empty boxes


are removed for clarity.)

lexicon created by listing alphabet ically the unique occurrences of the words
in t he te xt. Additionally, it should be noted t hat all end of sent ence punc-
tuation is replaced with a period , and the period is t reated as a word . By
convent ion, the period is designated as the first word in t he ord ered lexicon .
It is asse rted that t he BPM represent ation of the sema ntic conte nt pr eserves
enough unique features to be semantically separ abl e from BPMs of other
thematically unrelated collect ions.
The rows in t he BPM represent the first word in the pair , and the second
word is given by the column. For example, th e BPM for the sentence or te xt
stream ,

The wise young m an sought his f ather in the crowd.


is shown in Tabl e 1. We see that the matrix element locat ed in the third row
(his) and the fifth column (fath er) has a value of one. This mean s that t he
pair of words ' his fat her ' occurs once in this unit of t ext. It should be noted
that in most cases, depending on the size of the lexicon and the size of the
t ext st ream, the BPM will be very sparse. So, while the dim ensionality of
t he BPM can be very lar ge, sp ar se matrix t echniques makes the analysis fast
and the st orage requirements small.

3 Definition of weights
We can see from the definition of the BPM, that the elements of the matrix
represent the number of t imes that a bigr am or word pair occurs in the
docum ent. Some of t he measures of semantic similarity for classification
cited in Martinez [8] employed the raw frequ encies, others used bin ar y valu es
(if t he frequ ency is non- zero, then it is replaced with a 1), and some required
conversion to probabilities or relative frequencies. In this paper , we will only
be concerne d with t he first case, where raw bigram frequ encies are compar ed
Using weights with a text proximity matrix 329

to weighte d values. Because of this, we will use one measure of semantic


simil arity - th e n ormalized correlati on coefficie nt (NCC) . This is similar to
t he cosine measure used in informa t ion retrieval [7].
Let A represent a BPM that has been converte d t o a column vect or by
conca te nating t he columns, one on t op of the other. We do this conversion
so the usu al definition of the normalized corre lat ion coefficient can be used .
Let e denote anot her BPM that has been similarly converte d to a vector.
The cosine of the angle between t hese two 'vect ors' is given by

ATe
NCC = cos (lAc = IIAIIIIClI (1)

where IIAII denotes t he magnitude of vector A, and M is the number of words


in the lexicon squ ar ed , i.e., the total number of elements in the BPM.
The NCC given in Equation 1 is a similari ty measure, whose range in t his
case is between 0 and 1. Lar ger valu es of the NCC correspond to observations
t hat are close to gether. For example, the NCC similarity between a docum ent
BPM and itself is 1. If the two docum ent BPM 'vect ors' are ort hogonal to
each other, then the NCC similarity is O. We convert t he NCC similari ty
valu es to Eu clidean dist anc e using the following transformation

(2)
where Si j represents the similarity between docum ent i and j , and di j is t he
dist an ce between docum ent i and docum ent j.

3.1 Local - global - document weights


We will denot e the ij-t h element (the ij -t h bigr am or word pair) of the
k-th weighted BPM as a ij k . We can write thi s in t erms of local, global and
do cum ent components as follows

(3)
wher e l ijk is the local weight for bigram ij that occurs in document k, g ij is the
global weight for bigr am ij in the corpus, and dk is a docum ent normalization
factor. We represent the frequ ency or t he number of times bigr am ij appears
in docum ent k as ! i jk . We use the following to ind icate t he conversion of
a frequ ency f t o a bin ary valu e:

I(f) = 1 if ! > 0 (4)


o if ! = 0
The two local weights we use are called t he logarithmic and t he augm ented
normalized bigram fr equency. Before we define t hese, we make one small
change in not ation for ease of und erst anding. We denote the ij -t h bigram
330 Angel R. Martinez, Edward J . Wegm an and Wendy L. Martinez

with the subscript b, where som e arbitrary order or lab eling has been imposed
on the bigrams (elements of the BPM) . The logari thmic weight is defined as

h k = I = 10g(1 + fbk), (5)


and the augmented normalized bigram frequ ency is given by

(6)
If no local weights are used , t hen we denot e that as just the bigr am frequ ency

(7)
Note that the letters I, t, and n ar e used in t he informat ion retrieval literature
t o denote the typ e of local weight [1].
We use only one global weight in this st udy called t he inverse document
frequen cy (IDF) ; others can be found in Berry and Browne [1]. The IDF for
bigr ams is defined as

9b = f = log (K -;- t
k= l
IUbk)) , (8)

where K is the t ot al number of do cument s in the cor pus. When choos ing
a global weight, one needs t o cons ider the state of t he corpus. If t he corpus
changes, the BPM cha nges first and then the global weight must be revised.
Thus, if t he corpus is unstable or constant ly changing, then using a globa l
weight might not be a good idea .
We now come t o the do cument normalization fact or. The cosine n ormal-
ization seems to be used oft en wit h t erm-document matrices [1], so t his is
what we use here. For our bigrams, this is given by

(9)

This simply normalizes the BPMs, or one could think of this as ensuring t hat
the magnitude of t he BPM 'vector' is 1. We note that with t he normalized
correlation coefficient, the do cument normalizat ion does not really qu alify
as a weight becaus e t his normali zation would t ake place anyway with the
dist an ce measure. What it means is that the denominat or in Equation 1 is
one, so we do not need to calculate it for t he similarity measure.
We ca n designate t he weighting scheme by using a t hree letter code as
follows:
txx bigram frequency - no weights
nfc augmented normalized frequency - IDF - cosine normalization
tfc bigram frequency - IDF - cosine normalization
lfc logarithmic - IDF - cosine normalization
Using weights with a text proximity matrix 331

3.2 Mutual information


In general, mutual information is a measure of the common information be-
tween two random variables [7] . Pointwise mutual information is defined
on two particular points in the distributions. In natural language processing,
pointwise mutual information is often calculated between elements and is used
for clustering words and word sense disambiguation. We define a pointwise
mutual information for bigrams, following the work by Pantel and Lin [11],
where they discuss the pointwise mutual information between a word and
a context (Le., words around it). We use documents in place of contexts to
define pointwise mutual information between a bigram and a document. The
,idea of using contexts as analogous to documents has been explored by Gale,
Church and Yarowsky [5].
The pointwise mutual information between bigram b and document k is
denoted as M hk. The idea is to substitute this value for each corresponding
element in the document's BPM. Recall that the number of times bigram b
occurs in document k is represented by fbk . We then calculate the number
of times bigram b occurs across all documents in the corpus, which is given
by
K

fb. = LJbi . (10)


i=l
Next we need the total number of bigrams occurring in document k. This is
given as
M

f. k = LJik. (11)
i= l
The pointwise mutual information is defined as

MIbk = I og ( fbk -;- N ) = Iog ( N X fbk ) , (12)


fb. -;- N x f. k -;- N fb . x f. k
where N is the total number of bigrams and contexts, given by
M K

N= L L f i j .
i=l j=l

One of the problems with pointwise mutual information is that it is bi-


ased toward infrequent words (bigrams) and contexts [11], so Pantel and Lin
recommend multiplying Equation 12 with a discounting factor. For bigram b
and document k, this is

min {fb. ; f. d
Cbk = - fbk
-- X -_:-'-:....-..:....:..,,--''---
fbk + 1 min {lb. ; f. d + 1
We did not use this factor in our research; only Equation 12 was implemented.
332 Angel R. Martin ez, Edward J. Wegm an and Wendy L. Martin ez

Topic Number Topic Description


4 Cessna on the W hit e House
5 Clinic Murders (Salvi)
6 Comet int o Jupit er
8 Death of Kim Jong II's Father
9 DNA in OJ Tri al
11 Hall's Copter in N. Kor ea
12 Flood ing Humble, TX
13 Justice-to-b e Breyer
15 Kobe, Japa n Qu ake
16 Lost in Iraq
17 NYC Subw ay Bombing
18 Oklah oma City Bombing
21 Serb ian s Down F-1 6
22 Serbs Violat e Bih ac
24 US Air 427 Crash
25 WTC Bombin g Tr ial

Table 2: List of 16 to pics.

4 Experiments
The goal of our experiments is to assess the usefulness of weighting the BPMs.
In particular , to answer the question: Can documents be classified more
successfully using weighte d bigrams? In the next subsect ions, we describe
some of the background and det ails of the experiments, followed by results.
All experiments and ana lyses, including reading the docum ents and creating
t he BP Ms, were done on a PC using M ATLABTM, Version 6.5.

4.1 Description of corpus


We use the Topic Detect ion and Tracking (T DT) Pil ot Corpus (Linguist ic
Dat a Consortium, Philad elph ia , PA) to evalua te the utility of weighting t he
BPMs. This corpus of docum ent s contains over 16,000 news sto ries from
various wire services an d were classified in term s of their meaning in the
following way. A set of 25 topics were initially chosen and docum ents were
t agged as eit her belonging to one of those topics (yes) , partially belonging
(brief) or not belonging (no). We chose a set of 503 docum ent s encompassing
16 to pics as shown in Tabl e 2 and created a BPM for each one with weighting
schemes as describ ed in the previous sect ion.
As for pre-processing t he docum ent s, we remove all pu nctuation (except
for t he end of sente nces) and symbols such as hyphens, etc. As st ated previ-
ously, all end of sente nce punctuation is converted to a period , which is then
t reated as a word. We also investigate t he effect of anot her pre-processing
scheme - removing noise or sto p words [8]. For t he full text case, the size of
Using weights with a text proximity matrix 333

the lexicon is 11,103. When noise words ar e removed, the lexicon contains
10,997 words .

4.2 Classification and dimensionality reduction


We are int erested in seeing whether or not weighting th e bigrams improves
t he results when we try to classify docum ents from t he TDT corpus. To t his
end, we use a simple k near est neighbor (k-nn) classifier [3] . This type of
classifier works in t he following way. We have a document with an unknown
classification. We find its k nearest neighb ors using the normalized corre-
lation coefficient and look at t heir class lab els. The docum ent is assigned
t he class lab el that corresponds to the class th at occurs with the high est
frequ ency among t he k near est neighbors.
The k near est neighbor classifier is easy to use and is suitabl e for high-
dim ensional dat a. It would be int eresting t o reduce t he dim ensionality of the
space, so we can use some ot her method of investigation such as clustering
or being able to visu alize the data . In keeping with Martinez [8], we use
the Isom etric Feature Mapping or ISO MAP [12] pro cedure t o reduce the
dim ensionality of the BPMs and repeat our classification experiments. This
is par ticularl y useful in our case, because it requires the int erpoint dist an ce
matrix as its only input . Before we explain ISOMAP, we first briefly describ e
multidimensional scaling.
The purpose of multidimensional scalin g is to repr esent points or obser-
vat ions in a lower dim ension al space (usu ally 2-D or 3-D) in such a way t hat
points that are close together in the higher dimensional space will also be
close to gether in t he lower dim ensional space [2] . However, if the observa-
tions live along a lower dim ension al nonlinear manifold, then the Euclidean
dist an ce between t he poin ts might not be the best measure of the dist an ce
between t he points along the manifold. To illustrat e this idea , we show a 2-D
nonlinear manifold embedded in 3-D in Figure 1. The Euclidean dist an ce be-
tween 2 random points on this manifold is shown in Figure 2, and we see that
a better measure of t he dist an ce between them would be alon g this man ifold.
ISOMAP seeks a mapping from a higher dim ension al space to a lower
dim ension al one such t hat the mapping pr eserves the dist an ces between ob-
servat ions, where t he dist anc e in the higher dimensional space is measured
along the geodesic path of t he nonlinear manifold. The first st ep in t he
ISOMAP algorit hm is to convert the interpoint Euclidean distanc e matrix
into geodesic distan ces. The geodesic distances ar e then used as input to
classical multidimensional scalin g. Besides th e interpoint dist anc e matrix,
ISO MAP requires a value for the number of near est neighbors (k) that is
used in det ermining t he geodesic distan ce. We use a valu e of k = 10 in t his
body of work.
334 Angel R. Martinez, Edward J. Wegman and Wendy L. Martinez

'. " .
. .

Figure 1: This illustrates a 2-D manifold (or surface) embedded in a 3-D


space.

4.3 Results
To summarize, we varied the weights and other parameters and performed
the following experiments with the weighted BPM.

• Text pre-processing conditions were full and denoised lexicon.

• Bigram weights were M hk' lfe, nfe, tfe, and txx.

• Dimensionality of the space for using k-nn was either full dimensionality
or 4-D and 6-D from ISOMAP.

• The values of k for the k-nn classifier were k = 1, 3, 5, 7, 10.

• The Euclidean distance was used for the 4-D and 6-D k-nn classification.

The results from the experiments are shown in Tables 3 through 5.


Several things can be noted from these results. First, we see that in
the full BPM case, using the pointwise mutual information increases the
proportion of documents correctly classified. Secondly, denoising the data
seems to produce poorer or similar results in the weighted case, but better
results in the unweighted case (txx). Finally, it is interesting to note that the
weighting scheme tfe allows us to compare the use of the IDF global weight
alone. By comparing the tfe* and txx* entries, we see that using the IDF
global weight increases the correct classification.
Using weights with a text proximity matrix 335

15
.-... .
.....•..~~~~,..,
'

10 ... . ... ~
.
.:. .
: ."'\... .
5 .... -4i.
o ;fJ1

-5

-10

-15
" " "

40
20
-ID -5 o 5 10 15

Figure 2: This is a data set randomly generated according to the mani fold
given in Figur e 1. The Euclidean distan ce between two points is given by
the straight line shown here. If we are seeking the neighborhood st ruc t ure
along the manifold, t hen it would be better to use the geodesic distan ce (the
distance alon g the manifold or the roll) between the points.

k=1 k=3 k=5 k=7 k = 10


lfc 0.90 0.92 0.93 0.93 0.94
lfc-d en 0.87 0.87 0.86 0.87 0.87
MI 0.98 0.99 1.00 1.00 0.99
MI-den 0.98 0.98 0.99 0.99 0.99
nfc 0.99 0.99 0.99 1.00 1.00
rife-den 0.98 0.99 0.99 0.99 0.99
tfc 0.98 0.98 0.99 0.99 0.99
tfc -den 0.99 0.98 0.98 0.99 0.98
txx 0.90 0.90 0.91 0.92 0.93
txx-den 0.93 0.93 0.93 0.93 0.92

Table 3: Proportion of do cuments correctly classified - full BPMs.

5 Summary
In this paper, we defined bigr am weights for the BPMs that are similar to
term weights used in natural language pr ocessing and information retrieval.
After the BPMs are weight ed , we applied the k-nn classification method to
336 Angel R. Martinez, Edward J. Wegman and Wendy L. Martinez

k = 1 k = 3 k =5 k = 7 k = 10
lfc 0.74 0.74 0.75 0.77 0.76
lfc-den 0.71 0.71 0.73 0.73 0.72
MI 0.82 0.81 0.83 0.83 0.84
MI-den 0.81 0.83 0.85 0.87 0.86
nfc 0.84 0.84 0.85 0.86 0.85
nfc-den 0.85 0.85 0.87 0.87 0.87
tfc 0.88 0.87 0.87 0.86 0.87
tfc-den 0.86 0.86 0.87 0.86 0.86
txx 0.66 0.65 0.65 0.64 0.65
txx-den 0.73 0.72 0.74 0.73 0.75

Table 4: Proportion of documents correctly classified - BPMs reduced to 4-D.

k=1 k=3 k=5 k =7 k = 10


lfc 0.83 0.84 0.85 0.85 0.84
lfc-den 0.78 0.79 0.81 0.80 0.80
MI 0.91 0.93 0.93 0.95 0.95
MI-den 0.91 0.93 0.93 0.93 0.93
nfc 0.92 0.92 0.92 0.93 0.93
nfc-den 0.92 0.93 0.94 0.94 0.94
tfc 0.92 0.93 0.93 0.93 0.92
tfc -den 0.92 0.91 0.94 0.92 0.90
txx 0.67 0.67 0.68 0.69 0.67
txx-den 0.83 0.81 0.83 0.82 0.82

Table 5: Proportion of documents correctly classified - BP Ms reduced to 6-D.

determine whether or not weighting the BPMs improve document recogni-


tion. Results show that in some cases , where local weights were used, such
as the normalized augment ed frequency, did improve the classification per-
form ance. Additionally, using the pointwise mutua l information, taking the
context into account, significantly improved the results.
A lot of work in this area of weighting the BPMs remains to be done.
One interesting possibility is to change the pointwise mutual information to
include the topic. In other words , inst ead of using the document as the con-
text, we might use the topic or class as the context. Of course in this case,
we would have to use a training set of documents that are t agged with their
topic to estimate the context. This can then be used with new untagged
documents and their BPMs. Additionally, we could use the discounting fac-
tor with the mutual information. Other bigram weights can be defined and
examined , such as entropy, probabilistic inverse and pivoted-cosine normal-
ization [1] . We might also examine other real-valued measures of distance or
similarity other than the NCC.
We looked at pre-processing the text by removing noise words . We could
Using weights with a text proximity matrix 337

also perform some experim ents using a st emmed and denoised lexicon [8], [1] .
We could also examine the affect of the dimensionality redu ction proc edure.
As state d previous, ISOMAP seeks a nonlinear manifold ; we might try some-
t hing like classical multidimensional scaling [2] (using the NCC similarity
directly rather than the geodesic dist an ce). Fin ally, we could use some other
methods to analyze the reduced BPMs, such as model-based clustering [4] ,
linear or quadratic classifiers [3], non-metric multidimensional scaling, self-
organizing maps [6] etc.

References
[1] Berry M.W., Browne M. (1999). Understanding search engines: mathe-
matical mod eling and text retrieval. SIAM.
[2] Cox T .F ., Cox M.A.A. (2001). Mult idim ension al scaling, 2nd editi on.
Chapman and Hall - CRC.
[3] Dud a RO ., Hart P.E ., Stork D.G. (2000) . Pattern classification, 2nd edi-
tion . Wiley-Interscience.
[4] Fraley C., Raft ery A.E. (1998) . How many clust ers? Whi ch clusteri ng
m ethod ? Answers via model-based clust er analysis. The Computer Jour-
nal 41 , 578- 588.
[5] Gale, Church and Yarowsky. (1992) . A m ethod for disambiguating word
senses in a corpus. Computers and the Hum anities 26 , 415-439 .
[6] Kohonen, Tuevo. (2001) . Self-organizing maps, third editi on. Springer-
Verlag.
[7] Manning C.D. , Schiitze H. 2000. Foundations of statistical natural lan-
guage processing. The MIT Press.
[8] Martinez A.R (2002). A fram ework for the representation of semantics.
Ph.D. Dissertation, George Mason University.
[9] Martinez A.R , Wegman E.J . (2002). A text stream transformati on for
sem antic -based clustering. Proceedings of the Int erface.
[10] Martinez A.R , Wegman E.J. (2002) . En coding of text to preserve mean-
ing. Proceedings of the Army Conference on Applied St atistics.
[11] Pantel P., Lin D. (2002). Discovering word senses from text. Proceedings
of ACM SIGKDD Conference on Knowledge Discovery and Dat a Mining ,
613 -619.
[12] Tenenbaum J.B., de Silva V., Langford J. C. (2000). A global geometri c
fram ework for nonlin ear dim ension alit y reducti on. Science 290 , 2319 -
2323.

Address: A.R Martinez, W.L . Martinez, NAVSEA Dahlgren, USA


E .J. Wegman , School of Informat ion Technology and Engin eering, George
Mason University Fairfax, Virginia
E- ma il: marinwe@onr .navy.mil
COMPSTAT'2004 Symposium © Physica-Verlag/Sprin ger 2004

ON CANONICAL ANALYSIS OF VECTOR


TIME SERIES
Wanli Min and Ruey S. Tsay
K ey words: Hankel matrix, Kronecker ind ex, canonical correlation.
COMPSTAT 2004 section: Time series analysis.

Abstract: In this pap er , we establish some asymptotic results for canon-


ical analysis of vector linear time series when the data poss ess condit ional
het eros cedasticity. We show that for correct identification of a vector time
series model, it is essent ial t o use a modific ation, which we prescribe, to
a commonly used t est st atistic for t esting zero canonical correlat ions. A real
example and simulation are used to demonstrate t he importan ce of the pro-
posed t est st atistics.

1 Introduction
Since proposed in [13] , canonical correlation analysis has been widely applied
in many stat ist ical areas, especially in multivariate analysis. Time series
analysis is no except ion. [6] proposed a canonical analysis of vector time
series that can reveal the underlying st ruc t ure of the dat a to aid mod el in-
t erpretation. In particular, they showed that linear combinations of several
unit-root non-st ationary time series can become st ationar y. This is the idea
of co-integration that was popular among econometricians in the 1990s afte r
the publicat ion of [10] . [22] applied canonical correlat ion analysis to develop
the sma llest can onical correlation method for identifying univari ate ARMA
mod el for a stationar y and/or non-stationar y time series. [17] introduced the
concept of scalar component mod els t o build a parsimonious VARMA mod el
for a given vect or time series. Again , canonical correlat ion ana lysis was used
exte nsively to search for scalar component mod els. Many other aut hors also
used can onical analysis in time series analysis. See, for inst an ce, [15].
To build a model for a k-dimensionallinear process, it suffices to identify
the k Kronecker ind exes or k linearl y independ ent scalar component mod els,
becau se we can use such information t o identify t hose par ameters that require
est ima t ion and t hose that can be set to zero within a dynam ic linear vect or
model. Simply put, t he Kronecker ind exes and scalar component mod els can
overcome the difficulties of curse of dimensionality, par amet er explosion, ex-
changeable mod els, and redundant param et ers in mod elling a linear vector
t ime series. For simplicity, we shall consider t he problem of specifying Kro-
necker ind exes in t his pap er. The issue discussed, however , is equa lly applica-
ble to specificat ion of sca lar component mod els. The method of det ermining
Kronecker indexes of a linear vector pro cess with Gaussi an innovat ions has
been studied by [1], [7], [18], [20], among others . These st udies show that
canonical correlation analysis is useful in sp ecifying the Kronecker ind exes
340 Wanli Min and Ruey S. Tsay

under normality. On the other hand, the assumption of Gaussian innova-


tions is questionable in many applications, especially in analysis of economic
and financial data that often exhibit conditional heteroscedasticity. See, for
instance, the summary statistics of asset returns in Chapter 1 of [21]. In the
literature, a simple approach to model conditional heteroscedasticity is to
apply the generalized autoregressive conditional heteroscedastic (GARCH)
model of [9] and [3] . We shall adopt such a model for the innovation series
of multivariate time series data.
In this paper, we continue to employ canonical analysis in vector time
series. However, we focus on statistical inference concerning canonical corre-
lation coefficients when the distribution of the innovations is not Gaussian.
Our main objective is to identify a vector model with structural specification
for a given time series that exhibits conditional heteroscedasticity and has
high kurtosis. Specifically, we study canonical correlation analysis when the
innovations of the series follow a vector GARCH model.

1.1 Preliminaries
Based on the Wold decomposition, a k-dimensional stationary time series
Zt = (Zlt"" , Zkt)' can be written as Zt = 11 + 2::: 0 'l/Jiat-i, where 11 =
(Ill, ... ,Ilk)' is a constant vector, 'l/Ji are k x k coefficient matrices with
'l/Jo = I k being the identity matrix, and {at = (alt,··· ,akt)'} is a sequence
of k-dimensional uncorrelated random vectors with mean zero and positive-
definite covariance matrix E . That is, E(at) = 0, E(ata~_J = 0 if i i- 0,
and E(ataD = :E. The at process is referred to as the innovation series of
z; If 2::: 0 II'l/Jill < 00, then Zt is (asymptotically) weakly stationary, where
IIAII is a matrix norm, e.g. IIAII = Jtrace(AA'). Often one further assumes
that at is Gaussian. In this paper, we assume that

supE([aitI TJ lFt_1) < 00 almost surely for some TJ > 2, (1)


i ,t

where F t - 1 = o-{at-1, at-2, ' " } denotes information available at time t-1.
Writing 'l/J(B) = 2::: 0 'l/JiBi, where B is the backshift operator such that
BZ t = Zt-1, then Zt = 11 + 'l/J(B)at. 1f'ljJ(B) is rational, then Zt has
a VARMA representation

ep(B)(Zt -11) = 8(B)at (2)

wher e ep(B) = I - 2::f=1 epiBi and 0(B) = I - 2::)=1 8 jBj are two matrix
polynomials of order p and q, respectively, and have no common left factors.
For further conditions of identifiability, see [8] for more details. The station-
arity condition of Zt is equivalent to that all zeros of the polynomial lep(B)1
ar e outside the unit circle.
The number of parameters of the VARMA model in Eq . (2) could reach
(p+q)k 2 +k+k(k+ 1)/2 if no constraint is applied, making parameter estima-
On canonical analysis of vect or time series 341

tion unnecessaril y difficult in some applications. Several methods are avail-


able in the literature t hat can simplify the use of VARMA mod els when t he in-
novat ions {at } are Gaussian . For instan ce, specification of Kr onecker indexes
of a Gau ssian vect or t ime series can lead to a par simonious par am et rization
of VARMA repr esent ation, see [19] . In many sit uat ions, the innovationa l
pro cess a t has condit iona l het eroscedasticity. In t he univariate case, [3] pro-
posed a GARCH (rl' r2) mod el to handle conditiona l heteroscedasti city. The
mod el can be written as
rl r2

gt C¥o + :L: c¥iaL i + :L: (3j gt- j , (3)


i= l j=l

where C¥o > 0, C¥i ::::: 0, (3j ::::: 0, and {Ed is a sequence of independ ent and
identically distributed ran dom vari ables with mean zero and variance 1. It 's
well-known that at is asymptotically second order stationary if L~~l C¥i +
Lj~l (3j < 1. Generalization of the GARCH models to multivari at e case
introduces additiona l complexity to the modelling pro cedure becau se t he
covariance matrix of a t has k(k + 1)/2 elements . Writing the conditional
covariance matrix of a t given t he past informat ion as ~ t = E(ata~IFt-d ,
where F t - 1 is defined in Eq. (1), we have at = ~i/2 Et, where ~i /2 is
the symmet ric squa re-root of the matrix ~t and {Ed is a sequence of in-
dependent and identi cally distributed random vectors with mean zero and
identity covariance matrix. Often Et is assumed to follow a multivari ate nor-
mal or Student-t dist ribution . To ensure th e positive definiteness of ~t ,
several models have been proposed in the literature. For exa mple, con-
sider t he simpl e case of ord er (1,1). [11] consider the BEKK mod el ~t
= ee' + Aat-la~_lA' + B~ t-lB' , where e is a lower t ria ngular matrix
and A and B ar e k x k matrices. [4] discusses t he diagon al mod el ~ t =
ee' +AA'0(at-la~_ 1)+BB'0~t-l ' where 0 stands for mat rix Had am ar d
pr oduct (element -wise product ).
Wh en GARCH effects exist, t he time series Zt is no longer Gaussian. Its
innovations become a sequence of uncorr elated, but serially dependent ran-
dom vectors. It is well-known t ha t such innovations tend to have heavy tails,
see [9] and [21], among ot hers. The perform ance of canonical correlat ion
ana lysis under such innovations is yet to be investi gated. This is t he main
obj ective of t his pap er. Secti ons 2 & 3 review and introduce the problem con-
sidered in the pap er. Section 4 establishes t he stat istics to specify Kronecker
ind exes for VARMA+GARCH process. Section 5 presents some simulation
results, and Section 6 applies the ana lysis to a real financi al time series.

2 Kronecker index and vector ARMA representation


2.1 Vector ARMA model implied by Kronecker index
For simpli city, we assume that f-L = O. Given a time point t , define t he past
and fut ure vecto rs P t an d F t of the process Zt as P t = ( Z~ _ l' Z~_ 2 ' .. . )' ,
342 Wanli Min and Ru ey S. Tsay

F t = (Z~, Z~+1 " o ' The Hankel Matrix of Zt is defined as H = E(FtPD.


)' .

It is obvious th at for a VARMA mod el in Eq. (2) the Hankel matrix H is


of finite rank. In fact , it can be shown that Rank(H) is finite if and only if
Zt has a VARMA mod el repr esent ation, see [12] and [20].
The Kronecker indexes of Zt consist of a set of non-negative int egers
{Kil i = 1"" , k} such that for each i, K, is the sma llest non-negative
int eger that the (k x K, + i)th row of H is eit her a null vector or is a linear
combination of the previous rows of H . It turns out that 2:7=1 K, is the rank
of H , which is invari ant und er different VARMA present ations of Zt. In fact ,
the set of Kronecker indexes, {Kd 7=1' of a given VARMA process is invaria nt
und er various forms of mod el repr esent ation. [20] illustrates how to construc t
an Echelon VARMA form for Zt usin g the Kronecker indexes {Kd7=1' For
a stationa ry pro cess Zt with specified Kronecker index {K 1 , , K d, let
0 ••

p = max{Kili = 1, · ·· , k} . Then Zt follows a VARMA(p ,p) model


p P
~OZt - L ~iZt-i = 8 + ~Oet - L E>jet-j , (4)
i=1 j=1
where 8 is a constant vector, t he it h row of ~j and E>j are zero for j > Ki ,
and ~o is a lower t ria ngular matrix with ones on the diagon al. Furthermore,
some elements of ~ i can be set to zero based on t he Kronecker indexes.
A VARMA mod el in Eq . (4) provides a unique ARMA repr esent ation for Zt ,
see Theorem 2.5.1 in [12].

2.2 Specification of Kronecker index


If the sma llest canonical correlat ion between the future and past vect ors F t
and P t is zero , t hen X; = VjF t is uncorr elated with P t , i.e, Cov(Xt , P d
= VjE(F tPD = VjH = O. This leads to a row dependency of the Hankel
matrix so that the ana lysis is directly related to Kronecker index . Test-
ing for zero canonical correlation t hus plays an important role in specifying
Kron ecker ind exes. [7] used the tradi ti onal X2 test to propose a mod elling
procedure:
Step 1: Select a lar ge lag s so t hat the vector P t = ( Z~ _1 " . . , Z~ _s )'
is a good approximation of t he past vector and choose initial future
sub-vect or F ; = {Zlt}. If a vecto r AR approximat ion is used , then
s can be selecte d by information crite ria such as AIC or BIC .
Step 2: Let p be t he sma llest samp le cano nical correlation in modulus
between F ; and Pt . Denot e the canonical vari at es by X , = VjF; and
yt = V~P t , and compute t he test statist ics

S = -nlog(l - (2) rv X~s-f+1' (5)


where n is t he number of observations , f and ks are the dimension of
F ; and P t, respectiv ely.
On canonical analysis of vector time series 343

St ep 3: Denot e the last element of F ; as Zi,t+h. If H o : p = 0 is not


reject ed , t hen the Kronecker ind ex for t he it h component Z it of Zt is
K, = h. In t his case, updat e t he future vect or F t by removing Zi,t+j
for j 2 h. If all k Kronecker ind exes have been found, the pro cedure
is te rminated. Otherw ise, augment F; by adding th e next availa ble
element of the updat ed F , and return to St ep 2.
The asymptot ic X2 distribution of t he S-st atistic in Eq. (5) of Step 2 is
derived under the ind epend ence sa mpling assumpt ion. [18] showed that the
canonical correlations cannot be t reated as the cross correlat ion of two white-
noise series since the corresponding canonical var iates are serially correlated.
Sup pose F; = (Zl ,t , ' " , Zi,t+h)'. The sma llest sa mple canonical corre la-
ti on p is the lag-(h+ 1) sa mple cross-correlation Pxy(h + 1) of the correspond-
ing canonical variate s X t = VjF; and Yt = V~P t because Yt is observabl e
at time t -1 whereas X, is observabl e at t ime t + h. Und er H o : pxy(m) = 0,
t he asy mptot ic vari ance of pxy(m) is, shown in [5],

L vn·
00

var[pxy(m)] ;::::: n - 1 {Pxx(v)pyy(v) + pxy(m + v)p yx(m - (6)


1/ = -00

Making use of t he resul t mentioned above, [18] proposed a proper t est st atisti c
'2
T = - (n - s) log(l - p,) rv X~s-f+1 (7)
d
where d = 1 + 2 2:~= 1 Pxx(v)p yy(v). In Eq. (7), it is und ersto od that d = 1 if
h = 0, Pxx(v) and Pyy(v) are t he lag-z- sa mple aut ocorrelat ions of X; and Yt ,
resp ectively, and n is t he sample size. The Bartlet t 's formul a in Eq. (6) is
for independent Gau ssian innovations {ad. This is not th e case when the
innovations follow a GARCH(r1 , r2) mod el. We shall st udy in next sect ion
properties of sample aut o-covariances in t he presence of GARCH innovat ions.
All proofs can be found in [14] .

3 Sample auto-covariance functions of a linear process


Lemma 3.1. Suppose {ad is a stationary GARCH(r1, r2) process of Eq.
(3) with finite fourth moment and Et is symmet rically distributed, then
E( aiakajal) = 0, Vi S j S k S l unless i = j and k = l both hold.
Proposition 3 .1. Suppose {ad is a GARCH(r1,r2) process with E(aF) = a 2
and E( at) < 00 and the process X; is defined as X, = 2:: 0'l/Jiat-i with
2:i I'l/Jil < 00, and 2:i i'l/Jl < 00. Let I'xx(O) = a 2 2:: 0 'l/J;' Then the next
inequality holds: 2::1 IIE(Xl- l'xx (O)I~o) I < 00, where ~o = a{ EO, L 1,... }
and IIYII denot es the L 2-norm of a random variable Y .
Defining the nor m of a random matrix as IIAII := yf E (trA A' ), we can
generalize Prop 3.1 to a linear process with innovational pro cess t hat follows
a multivari at e GARCH mod el.
344 Wanli Min and Ru ey S. Tsay

Proposition 3.2. A ssume at = (a lt , · " , amt )' follow s a pure diagonal m ul-
tivariate GAR CH mod el, i.e. ait follow s a un ivariate GARCH(rl , r2) mod el
an d is stati onary with finit e fou rth moment fo r each i = 1, . . . , m . Consider
00

th e process X; = :z= '1' ~at- i where '1'i are m -dimensional vectors. A ssume
i=O
00 00

fu rther that :z= II'1'ill < 00 and :z= i ll'1'i112 < 00. Let ~o = a{ao ,a_l" "} '
i=O i=O
00

Th en the next ine quality holds: :z= IIE(Xl- l'xx(O)I~o)1I < 00 , where l'xx(O) =
t=l
00

:z= '1' ~~'1' i and ~ = E(at~) = diag(a r ,'" , a;' ).


i=O
Ob serving XtYf+h = (Xt +Yt±h)2~(Xt-Yt±h )2 , we have by t he t riangle in-
equality the next corollary.
00 00
Corollary 3.1. Suppo se x, = :z= '1' ~at- i an d Yf = :z= <I> ~at- i both satisfy
i=O i=O
the condi ti ons in P rop 3.2. Let l'xy(h ) = E(XtYf+h) , where h is an in teger .
00

W e have :z= IIE(XtYf+h - l'xy ( h) l ~o ) 1 1 < 00 .


t=l
To generalize the resul t t o the case that X , is multivariate, we define
Vec(A) = (A~ , . . . , A~ )' for a matrix A. We also use a Lemma in [23].
00

Proposition 3.3. Let x, = (X lt , ' " , Xkt}' = :z= '1'iat-i, where '1'i are
i=O
m atrices of dimen sion k x m and at is m -dimensional an d foll ows a pure
diagonal st atio n ary GARCH(rl , rs ) mo del with finite 4th moment. Furth er,
00 00

:z= II'1'ill < 00, :Z= i ll'1'ill 2 < 00 . Lettin g ~ = E(XtX~+h) where h is an
i=O i=O
00

in teger, we have :z= II Vec(E(XtX ~ +hl~o) - ~)II < 00


t=l
00

Proposition 3.4. Let X , =(X lt , ' " , X kt)' = :z= '1'iat-i, Y, = (Ylt , ' " , Yit)'
i=O
00
= :z= <I>iat-i, where '1'i an d <I>i are matrice s of dimension k x m and l x m , re-
i=O
spectively. Su ppose both X , and Y , satisfy the conditions in Proposition 3.3 .
In t=l
n
Den ote ~xy(h) = E(XtY~+h ) ' Th en :z= Vec(XtY~+h - ~ xy(h)) --->

N( O, ~) , where h is an y in teger and ~ E Rkl x kl .

Remark 3.1. For a causal, stati onary VARMA(p , q) process <I>(B)(Zt -


00

j.t) = E>(B)at, its MA(oo) represen tati on Zt = j.t + :z= '1'iat-i satisfies th e
i=O
00 00

condition :z= II'1'ili < 00, :z= i ll'1'i l12 < 00 since II'1'ill rv r i with r E (0,1)
i=O i=O
On canonical analysis of vector time series 345

being the largest root (in magnitude) of ~(B-l). Consequently, if at follows


a pure diagonal GARCH model with finite fourth moment, the sample auto-
covariance matrix of Zt has an asymptotic joint normal distribution.

Theorem 3.1. Suppose that Zt is a k-dimensional stationary VARMA pro-


cess of model (2), where the innovation series at follows a GARCH(rl, r2)
model with finite 4th moment. Let P t = (Z~_l"" ,Z~_s)' be a past vector
with a prespecified s > 0 that contains all the information needed in predict-
ing the future observation of Zt, F t = (ZI,t, ... Zi,t+h)' be the future subvector
of Zt constructed according to the procedure described in Section 2. Let p be
the smallest sample canonical correlation between P t and Ft . Under the null
hypothesis that the smallest canonical correlation p between P t and F t is zero
but all the other canonical correlations are nonzero, then p2j var(p) has an
asymptotic X2 distribution with ks - f + 1 degrees of freedom, where f is the
dimension of Ft·

4 Asymptotic variance of sample cross correlation


Next we consider the variance of sample cross-correlation coefficient for the
case that gives rise to a zero canonical correlation between the past and future
vectors of Zt. To this end, we make use of the Aitken's delta method. Suppose
rt and X, are stationary moving-average processes. More specifically, rt =
h 00

2:: ¢iat-i and X, = 2:: 'l/Jiat-i with at being a GARCH(rl, r2) process of
i=O i=O
Eq . (3). By Lemma 1, E(aiajakat) = 0 Vi:::: j = j and k = l
:::: k :::: l unless i
n-q
both hold . Let U = 1'xx(O), V = 1'yy(O), and W = 1'xy(q) = n~q 2:: Xtrt+q .
t=1
Given q > h, where h corresponds to a Kronecker index, we have 1'Xy(q) =
1'yy(q) = 0, and on applying the delta method the following result holds:

1 '""' [ ( () o
, Xd, Y Y q, q+ d)]
Var(pxy(q ) ~ :;: LJ Pxx d)pyy d
A )

+ Cum(X (0) (0) , (8)


Idl:5 h "[x« 1'yy

h-d
where Cum(Xo, x.; Y q , Yq+d) = 2:: 2:: 'l/Ji'l/Ji+d¢k¢k+d Cov(a6, a~_k+i)'
i~O k=O
Therefore, the fourth order cumulants of {Xd depend on the auto-covariance
function of {an. Compared to 1'x x(dhyy(d), Cum(Xo, Xd, Yq, Yq+ d) has
a non-negligible impact on Var(pxy(p)) if Cov(a6, a~-k+i)j E 2(a6) is large.
For instance, if at is a GARCH(l,l) process, then Cov(a6, aI}joA = 20:1 +
1-6af(al l + ,61)2 ' ThiIS rat.i10 IS 86 given
+,61/3)
r - (a
2a
. 0:1 = 05 .o an d {31 = 02
. , Consiider-
ing the 4th order cumulant correction term in Var(p) , one can modify the
T statistic proposed by Tsay as
346 Wanli Min and Ruey S. Tsey

T* (9)

5 Simulations study
We conduct some simulations to study the finite sample performance of the
modified test statistics. We focus on a bivariate ARMA+GARCH(I,I) model
chosen to have GARCH parameters similar to those commonly seen in em-
pirical asset returns. The model is

Z t - [ 0.8 0] Z
0 0.3 t-l = at - [-0.8 1.3]
-0.3 0.8 at-l, (10)

where t = 1,' " ,n and at = diag(ygu,y'g2t)Et with Et "" i.i.d Nz(O,I),


where git satisfy the GARCH(I,I) model git = 0.5 + 0.2ar t-l + 0.7gi,t-l for
i = 1 and 2. For a given sample size n, each realization was obtained by
generating 5n observations. To reduce the effect of the starting values Zo
and ao, we only use the last n observations. For this model, the two future
subvectors which in theory give a zero canonical correlation are F t(l) =
(Zlt, ZZh Zl,t+d' and F t(2) = (Zlt, ZZt, ZZ,t+d'. A value of s = 5 was selected
according to Ale criterion in a preliminary analysis using pure vector AR
models . The corresponding past vector is P, = (Z~_l"" ,Z~_5)/
Let 8(1) and 8(2) be the test statistics 8 = -n log(l- fJZ) of [7] when the
future subvectors are F t (1) and F t(2), respectively. Similarly, let T(I) and
-2
T(2) be the corresponding test statistics T = - (n - s) log(l- Pj-) of [18] and
T*(I) and T*(2) be the test statistics T* = -(n - s) log(1 - f.-)
-2
proposed in
Eq. (9). In particular, we adopt the approach of [2] to estimate the variance
of sample cross-covari ance Var[ixy(q)] by
n-q
n Var[ixy(q)] ;::; 0-*(0) + 22)1 - ijn)K(ibn)o-*(i),
i= l

where o-*(i) = l:XtY't+qXt+iYt+i+qjn - i;y(q), K(x) = Ilxl:s;1' and o«


t
n- 1 / 4 .
However, to improve the robustness of the variance estimate in finite
samples, we employ a modified estimate of 0-* (i). The modification is to
use a trimmed sequence {X t Y't+ q} by trimming both the lower and upper
0.2 percentiles of XtY't+q.
As an alternative, we also applied the stationary bootstrap method of [16]
to estimate Var(fJ). Each bootstrap step was repeated 1000 times . Let B(I)
On canonical analysis of vector time series 347

Statistic Mean S.D Percentile Rej. at X~(0 . 95 )


90% 95% 99% percent age
8 (1) 10.81 5.81 18.20 21.67 30.30 20.3
8 (2) 11.63 8.91 20.94 25.94 37.48 22.2
T (l ) 10.88 6.89 18.26 22.04 32.5 17.3
T (2) 9.14 6.66 15.94 19.27 28.66 10.8
X§ 8 4 13.36 15.51 20.10 5.0
T *(l) 8.13 4.29 13.82 16.35 21.60 6.5
T *(2) 7.01 3.93 11.99 14.01 20.11 3.4
B(l ) 7.72 4.1 13.05 15.32 20.81 4.8
B(2) 6.31 3.66 11.03 13.65 18.47 4.0

Table 1: Empirical quantiles of various tes t statist ics for testing zero canonical
correlations, based on 2,000 replications wit h sample size 2,000.

·2 '
and B(2) be the corres ponding test st atisti cs -(n - s) log(l-LJ) , where d is
obtained from boot straps .
Table 1 compa res empirica l percentiles and the size of various test st atis-
t ics discussed above for t he model in Eq . (10) when the sample sizes is 2000,
which is common among financi al data. The corresponding qu antiles of the
asy mpt ot ic X~ are also given in the table. Other sample size is also con-
sidered. From t he table, we make the following observati ons. First, the T *
and bootstrap B statistics perform reasonably well when the sample size is
sufficiently lar ge. The boot strap method outperforms the other test statis-
tics. However , it requires intensive computation. For inst anc e, it to ok severa l
hours to compute t he bootst rap test s in Table 1 whereas it only to ok sec-
ond s to compute the ot her tests. Second , the T statistics und erestimate the
vari an ce of cross-corre lat ion so that the empirica l quantiles exceed their the-
oretic al counte rpa rts . Third , as expecte d, th e 8 st atistics perform poorly
for both sample sizes considered. Fourth, the performance of t he proposed
tes t statis tic T * indic ates that the [2] method to est imate the variance of
cross-covaria nce is reasonabl e in the presence of GARCH effects prov ided
that robust est imators ir" (i) are used .

6 An illustrative example
In this sect ion we apply the proposed test statistics to a 3-dimension al finan-
cial time series. The data consist of daily log returns, in percent ages, of stoc ks
for Amo co, IBM , and Merck from Februar y 2, 1984 to Decemb er 31, 1991
wit h 2000 observati ons. The series are shown in Figur e 1. It is well-known
t ha t daily stock return series tend to have weak dynami c depend ence, but
st rong conditiona l heteroscedasticity, making th em suitable for t he proposed
test. Our goal here is to provide an illustration of specifying a vecto r ARMA
348 Wanli Min and Ruey S. Tsay

s ~1 :-~"'~~T~~\"~-I
· ~ 1~~'~'-T~~~'-~~'~ I

, ~1 -· ~· ~~--:·~
Figure 1: Time series of Amoco, IBM and Merck stocks daily return
I

(2/2/1985-12/31 /1991 ).

model with GARCH innovations rather than a thorough analysis of the term
structure of stock returns.
Denote the return series by Zt = (ZIt, ZZh Z3t}' for Amoco, IBM, and
Merck stock, respectively. Following the order specification procedure of
Section 2.2, we apply the proposed test of Eq. (9), denoted by T*, to the data
and summarize the test results in Table 2. We also included the test statistics
T of Eq. (7) for comparison purpose. The past vector P, is determined by
the AIC as P, = (Z~_l' Z~_z)', The p-value is based on a X%s-f+l test where
k = 3, S = 2, and f = dim(Ft).
From Table 2, the proposed test statistic T* identified {I, 1, I} as the
Kronecker indexes for the data, i.e. K, = 1 for all i. On the contrary, if
one assumes that there are no GARCH effects and uses the test statistic T,
then one would identify {I , 1, 2} as the Kronecker indexes . More specifi-
cally, the T statistic specifies K 1 = K z = 1, but finds the smallest canonical
correlation betwe en F; = (Zl ,t, ZZ ,t, Z3,t, Z3,t+1) and P, to be significant
at the usual 5% level. To determine K 3 , one need to consider the canon-
ical correlation analysis between F; = (Zl,t, ZZ ,t, Z3,t, Z3,t+l, Z3,t+Z)' and
the past vector Pt . The corresponding test statistic is T = 4.05, which is
insignificant with p-value 0.134 under the asymptotic X~ distribution. There-
fore, without considering GARCH effects, the identified kronecker indexes are
(K 1 = 1, K z = 1, K 3 = 2), resulting in an ARMA(2,2) model for the data.
Consequently, by correctly considering the GARCH effect, the proposed test
statistic T* was able to specify a more parsimonious ARMA(l,l) model for
the data. In summary, we entertain a vector ARMA(l,l) model with diago-
nal GARCH(l,l) innovations for the data. The estimated VARMA-GARCH
On canonical analysis of vector time series 349

mod el is given below:


.0 2.0***
.4 ]
[ - .1 ] [ - .1 2.1*** .4 ]
Zt- .0 0.3 .2* Zt-l =
- .0 +at+ .1 0.2 .2** a t-l
[ .1 0.9** .1 .1* .2* 0.9** .1
(11)
where t he superscript *, **, and *** indicate significance at the 10%, 5% and
ar
1% level, respectively, and the volatility gt = E( IFt - d follows the model

1.59 ] [ .28 0 0] [ .00 0 0]


gt = 0.23 + 0 .14 0 a Ll + 0 .76 0 gt-l
[ 0.05 0 0 .06 0 0 .91
where all estimates except t he (l ,l)th element of the coefficient matrix of
gt- l are significant at t he 1% level. Model checking shows that the fitted
mod el appears to be adequate in handling serial depend ence in t he data .

future subvect or Ft sm.can.cor T* dJ p-value Remark T


(Zl ,t) .130 33.96 6 0 33.96
( Z l,t, Z2,t ) .116 26.97 5 0 26.97
(Zl ,t , Z 2.t , Z3,t ) .101 20.68 4 0 20.68
( Z l ,t, Z 2,t, Z 3,t , Zl ,t+d .051 5.59 3 .13 K1 = 1 5.95
( Zl ,t, Z 2,t , Z 3,t , Z 2,t+l) .032 1.52 3 .68 K2 = 1 4.48
( Zl ,t , Z 2,t , Z 3,t , Z 3,t+l ) .055 5.98 3 .11 K3 = 1 11.38

Table 2: Model specificati on for t hree daily stoc k Returns

References
[1] Akaike H. (1976) . Canonical correlation analysis of time series and the
use of an informa tion criteri on. Syst ems identification: Advan ces and
Case Studies, eds R. K. Methra and D. G. Lainiotis. New York: Aca-
demic Press, 27- 96.
[2] Berlin et A., Fran cq C. (1997). On Bartlett 's formula for non-linear pro-
cesses. J ournal of Time Series Analysis 18, 535-552.
[3] Bollerslev T . (1986) . Generalized autoregressive conditional heteroscedas-
ticit y. J ournal of Econometrics 31 , 307 - 327.
[4] Bollerslev T. , En gle R.F., Nelson D.B. (1994). ARCH models. Handbook
of Econometrics IV. Elsevces Science B.V. , 2959-3-38,
[5] Box G.E.P. , J enkins G.M. (1976). Tim e series analysis: forecasting and
cont rol. San Fran cisco, CA: Holden-D ay.
[6] Box G.E.P. , Ti ao G.C. (1977). A canonical analysis of multiple tim e
series. Biometrika 64, 355 - 365.
[7] Cooper D.M., Wood E.F. (1982) . Identifying multivariate time series
models. J ournal of Time Series Analysis 3 , 153 -164.
350 Wanli Min and Ruey S. Tsay

[8] Dunsmuir W., Hannan E .J . (1976). Vector linear time seri es models.
Advances in Appli ed Probability 8, 339-364.
[9] Engl e RF . (1982). Autoregressive conditional heteroscedasticity with es-
timates of the variance of U.K . inflation. Econometrica 50, 987 -1008.
[10] Engl e RF., Granger C.W.J. (1987). Co-integration and error-correction:
representation, estimation and testing . Econometrica 55 , 251 - 276.
[11] Engle RF., Kroner K.F. (1995) . Mult ivariate simultaneous generalized
ARCH. Econometric Theory 11 , 122-150.
[12] Hannan E.J., Deistler M. (1988) . The stat istical theory of linear systems.
John Wiley, New York.
[13] Hot elling, H. (1936) . Relations between two sets of variables. Biometrika
28, 321- 377.
[14] Min W .L., Tsay RS. (2004). On canonical analysis of multivariate time
series. Working paper , GSB , University of Chicgao.
[15] Quenouille M.H. (1957). The analysis of multiple tim e series. London:
Griffin.
[16] Romano J.P. , Thombs L.A. (1996). Inference for autocorrelations under
weak assumptions. Journal of the American Statistical Association 91 ,
590 -600.
[17] Tiao G.C ., Ts ay R.S. (1989) . Model specification in multivariate tim e
series (with discussion) . Journal of the Royal Statistical Society. Ser. B
51 , 157- 213.
[18] Tsay RS . (1989a) . Identifying multivariate time series models. Journal
of Time Series Analysis 10,357-371.
[19] Tsay RS . (1989b) . Parsimonious parametrization of vector autoregres-
sive moving average models. Journal of Business and Economic Statistics
7,327 -341.
[20] Tsay R.S. (1991). Two canonical forms for vector ARMA processes. St a-
tistica Sinica 1, 247 - 269.
[21] Tsay R.S. (2002) . Analysis of financial time series. John Wiley : New
York.
[22] Tsay RS ., Tiao G .C. (1985) . Use of canonical analysis in tim e series
model identification. Biometrika 72, 299 -315.
[23] Wu W.B. (2003). Emp irical processes of long-m emory sequences.
Bernoulli 9, 809 -831.

Acknowledgement : We thank Dr. G.Tunnicliffe-Wilson for helpful comments


and U.S. National Science Foundation and Graduate School of Business, Uni-
versity of Chicago for partial financi al support.
Address: W. Min, R.S. Tsay, Graduate School of Business, University of
Chicago , 1101 East 58th Street , Chicgao. 1L 60637, U.S.A.
E-mail: ruey. t sayegsb . uchicago. edu
COMPSTAT'2004 Symp osium © Physica-Verlag/Springer 2004

LEARNING STATISTICS
BY DOING OR BY DESCRIBING:
THE ROLE OF SOFTWARE
Erich Neuwirth
K ey words: Statistical computing, st atistics education, teaching st atisti cs.
COMPSTAT 2004 secti on: Teaching stat ist ics.

Abstract: Pap er discusses several key questions connecte d wit h t he teach-


ing, and learning, st atist ics. Among the problem covered belong: whom to
teach, what typ e of present ation to chose, how and in which exte nd to use
computers . ..

1 Teaching statistics: for whom?


St atistics possi bly is the disciplin e use by most nonspeci alist s as part of their
work. P sychologists , medical doctors, journalists, and people from many
mor e fields they all use stati stics , or at least have to be able to int erpret
statist ical data. At elect ion times, newspap ers and TV report about opinion
polls and most of the publi c has problems in judging t he reliability of forecasts
for the election based on samples. So the need of education a rather bro ad
audience for statist ics is generally accepte d.
Wh en discussing statisti cs educat ion under th ese aspects, it is clear that
we have to face different audiences with different st atistical need and we also
have to t ake into account quite different levels of formal training outside of
stat istics.

St atisti cs educa tion may t ar get t he following knowledge levels:

• Basic stat ist ical knowledge: und erstanding simple st atistical summari es
and graphs , num eracy.

• Basic stat ist ical skills: select ing appropria te simpl e statistical methods
for own ana lyses, ability to imm ediately identify misuses of stat ist ics.

• Advan ced st atistical knowled ge: Und erst anding complex methods, es-
pecially multivari at e ana lytical and gra phical methods.

• Advan ced st atistical skills: selecting appropriate complex methods and


underst anding their role in gaining insights.

We need to distinguish the level of present ation for st atist ics education

• No formal prerequisites, just data as numb ers and graphs.


352 Erich Neuwirth

• Basic mathematic al knowledge and skills, simple algebr aic formul as ad-
missible as tools for explaining.
• College level mathemat ical background

Fin ally, the level of computer exper tise of the educatees also plays an
important role in designing cours es and act ivit ies for st atistics educa tion.

2 Demographical modelling: a success story


Let us begin with a success st ory. In Austria, like in many other count ries,
t here is an ongoing discussion about different options of financing the retire-
ment syst em. At http://sunsite. univie . ac . at/Pro jects/demography/
we have publi shed a manipulable stat istical mod el forecasting the popula-
ti on 's age structure for Austria for t he next 30 year s. The model is imple-
mented as an Excel sheet, and it looks like thi s:

Age dist ribution of population (census results)


Austria 1991
I Austria E
Percentage of 1991 population 100.0%

E:l
!
cr.' first year of
;fil
! retirement age
60

~ first year of
#l workforce age
jj 20

The most important det ails in thi s mod el are the "sliders" ; they allow
to cha nge the gra ph dynamically. The horizontal slider t urns the graph into
a movie. The graph always displays t he population pyr amid for a given year ;
when t he slider is moved, the year cha nges and t he cha nge of the age st ru ct ure
becomes dy na mically visible.
The other sliders allow to cha nge different mod el par amet ers like ret ire-
ment age, and will immediat ely display cha nges in the syst em resulting from
cha nges in the par am eters . The model also allows to use dat a from different
count ries (currently we have Austria, Germany, USA, and J ap an) to an alyze
how different populations structures can get.
Learning statistics by doing or by describing: the role of software 353

This model can be used in two ways:


• As a ready made tool for experiments in a given framework, one might
say as a demographical microworld.
This is the "consumer mode" for the model.
• As a project to be developed by the learner. This is the "producer
mode" for the model.
A nice story illustrates how the model was used for statistical education
in "consumer mode". The author received an email message from a member
of the Austrian parliament, essentially stating that the MP had found the
model accidentally when browsing the web. Being involved in discussions
about retirement legislative questions he started playing with it and found
that he could analyze some consequences of changes in retirement laws easily.
The final statement was: "now I understand the problem much better" .
The author regularly teaches a course about computer based demographic
modelling for sociology students. In this course, the students are shown the
model at the beginning. Then there are two days of intensive computer based
modelling, and at the end all the students are able to implement the model
themselves. They also are guided towards further investigations. e.g. the in-
fluence of changing birth rates on demographical developments. They imple-
ment different scenarios and study possible changes with hands on modelling
and parameter variation. The students really enjoy this course because the
finish with the feeling that they have acquired knowledge and skills allowing
them to add statistical modelling to their personal toolkit.

3 Learning statistics for data analysis: how?


The didactic success of the demography model just described is very much
tied to information technology. The finished version can be downloaded on
the web, the user only needs Microsoft Excel on the computer. So a very
widely used tool is the computational infrastructure of the model. This also
has an additional important message: serious statistical modelling can be
done with software available on almost any desktop computer, quite often
there is no immediate need for highly specialized software for models of higher
complexity.
When the model is used in producer mode, the statistical and mathemat-
ical theory for the model is not too complicated. Mathematically speaking
this is a simple linear first order difference equation. It may, however, be
described only using basic arithmetic. Since it is implemented in Excel and
since the students know Excel already, the important message is that seri-
ous modelling can be done with widely available general purpose software.
Like in the consumer mode use of the model a case is made for modelling as
a mental process and not a function of highly specialized software.
Using spreadsheet programs also has another important didactical aspect.
Spreadsheets always display the data, data are not hidden. One of the most
354 Erich Neuwirth

important concepts of st at ist ics is the data matrix, also called data fram e.
In a spreadsheet, the data are always visible and it becomes a very physical
exp erience that doing st at ist ics is operating on data . This fact is much mor e
obscured when a st atistical pr ogr amming language like S, R, SPSS , or SAS
is used as t he basic tool in statist ics courses.
The main difference between the spreadsheet approach and the statistical
pro gramming language approach might be char acterized as dir ect manipu-
lation vs , descriptive. The programming language approach is much more
formul a based, the data are not as omnipresent as in t he spreadsheet ap-
pro ach. For int roductory statist ics cour ses, this constant reminder "statist ics
is about data" can be quite helpful. Many students afte r their first course of
non computer based st atistics have the impression that stat ist ics is about cer-
t ain typ es of formulas , and not so much about dat a . P rogramming languages
still somewhat support this mindset, whereas the spreadsheet approach re-
ally emphasizes the data analysis point of view. More t opics a bout mod elling
with sprea dsheets can be found in [6] .
The direct manipulation approach is not solely restricted to spreadsheets.
Program s like Fathom (availabl e from Key Curriculum Press) also emphasize
t he "manipulate the data with t he mou se" approach as opposed to the "write
a program to manipulate t he dat a" approach.
Spr eadsheets are not t he answer to all statis t ical problems. Ex cel has
some flaws concern ing stat ist ics. The most inconvenient ones are some in-
accuracies wit h distributions functions an d not too high quality of random
number generators, inconsistent handling of missing data , and unavailability
of som e of the most important typ es of stat ist ical graphs (like histograms
with un equ al bin widths).
Therefore, it makes sense to use a mor e advanced statistic al toolbox than
just a spread sheet program . This does now, however , imply that the spread-
sheet par adigm has to be thrown overboard. The RExcel pr ogram (par t of the
R COM server pro ject accessible at http://sunsite .univie.ac.at/rcom/
and describ ed in [5]) allows t o use pr actic ally all th e funct ionality of R from
within Excel. This way, t he student can still operat e on the data in with t he
direct manipulation method , but use st atistic al methods not available from
the spreadsheet program alone.
This also demonstrates an important message about softwar e in general :
Softwar e should ada pt to the user 's needs. If possible, one should not be
forced to switc h programs , it is better if a st andard package can be enhanced
by exte nding its functionality.
RExcel is not t he only st atisti cal exte nsion of Exc el. PopTools (availabl e
from http ://sunsite.univie . ac. at/Spreadsite/poptools) also is an ex-
ample of how addit ional st atist ics functions can be integrated int o the spread-
sheet par adigm.
St atistical gra phics is anot her ext remely important concept to be dis-
cussed in the conte xt of st at ist ics education. [1] and [9] make a very con-
Learning statistics by doing or by describing: the role of software 355

vincing case for graphical methods. The statistics package R (available


from http://www .r-project. orgcomes with many data sets, including data
about age, sex, class and survival of the Titanic passengers and crew. Quite
a few statistics teachers investigate this data set with mosaic plots (and with-
out any formulas visible for the students) . Again, this illustrated the point we
already made: statistics should help gaining insights from data, and not be
a way of just applying formulas to data. Similarly, trellis plots are a relatively
new technique for multivariate analysis by using arrays of graphs arranged
according to statistical variables.
So far, we have only discussed software supporting statistical education
running on desktop or notebook computers. Additionally, there is a whole
range of web sites for statistics education, offering course material and applets
for experimenting. http://wise . cgu. edu/ offers a good overview of such
sites.
Some of these sites are just online resources, not offering much more than
printable static material to support statistics courses. The more interactive
sites follow a philosophy similar to the one exemplified by consumer mode
use of our demography example. They offer the students opportunities to
analyze data interactively. Projects like the XploRe eBooks (available from
http://www.xplore-stat.de) combine the printed material approach and
the applet approach by directly embedding applets into electronically dis-
tributed static course materials.
One of the central problems of teaching statistics mostly as a data anal-
ysis course is to find data which are interesting to analyze for students. For
this purpose, the WWW is a really powerful resource. The Journal of Statis-
tics Education at http://www . amstat. org/publications/j se/ has an ex-
tensive collection of data especially selected for educational purposes, and
StatLib (at http://lib.stat.cmu.edu) has a large collection of datasets
cited in the statistical literature, especially in textbooks for introductory
statistics.
All these datasets have the disadvantage that the students "do not con-
nect" with them. The author therefore since 10 years collects data from his
students with a questionnaire and uses these data throughout the statistics
courses. The questions are what one would expect: subject area, weight, size,
size of parents, grades in some school subjects and so on. The advantage of
using this data set is that for each analysis each student sees his or her place
in the result, and therefore feels to have learned something about a group
he or she belongs to. To the author's experience the students become quite
interested in the final report they have to produce, and sometimes they take
the challenge of designing statistical questions which can be analyzed with
this data set.
Information technology plays an important role in collecting these data
quickly. If the group is small enough, a Palm handheld calculator with ques-
tionnaire software (Pendragon forms from Pendragon Software) is used to
356 Erich Neuwirth

collect the data in the classroom. At the end of the class period, the hand-
held is connected to a notebook computer, the data are transferred, and then
immediately a first step of the analyze can be performed in front of the stu-
dents. The message of doing it this way is that collecting data can be set up
quite conveniently, and therefore with good planning statistics be used very
quickly. For larger classes, a browser based questionnaire is used. As part of
this project, students also start asking questions about the privacy of their
data and so are exposed to the problems of collecting data through their own
experience as part of the course.
All the projects and tools so far mostly have been concerned with analyz-
ing data. An important area in statistics education we have not considered
yet is probability. This is the topic of the next section.

4 Learning probability for statistics


As most statistic teachers have experienced, probability is important as one
of the foundations of statistics, but it is rather hard to teach if the students
are supposed to learn more than just a few formulas. One of the main prob-
lems is that students may misunderstand probability as a somewhat strange
packaging of combinatorics. Information technology in this case allows us to
add something which is not so easy without computers: experiments through
Monte Carlo simulation. Chapter 7 in [6] demonstrates the basic techniques
of such simulations with spreadsheets. Again, the important message is that
this can be done with readily available software.
The danger when using a Monte Carlo approach to teach probability is
that students only learn that "computer generated randomness" behaves like
probability theory predicts, and do not connect this with "everyday" random-
ness. Therefore, it is very important to perform experiments using physical
randomness with a device like a Galton Board (sometimes called Quincunx)
and then build a Monte Carlo simulation for the same phenomenon. Com-
paring the outcome of "real" randomness and simulated randomness can con-
vince the students that computer simulations are close enough to reality and
therefore problems which are more or less unaccessible for real experiments
can be studied with Monte Carlo simulations.
A software category we have not discussed at all so far are CAS, Computer
Algebra Systems. The most known programs in this category are Mathemat-
ica, Maple, MuPAD and Derive . There are special toolkits for doing statistics
and probability with CAS, see for example [2] and [8] . The approach there is
somewhat different from the spreadsheet approach. The CAS program is used
as a specialized programming language, and the experiments are performed
by using custom made functions in this programming language.
Monte Carlo Simulations can be considered as computer implementations
for the law of large numbers. A difficult topic when dealing with probabil-
ity is the relation between the law of large numbers and the central limit
theorem. Using computers for both Mote Carlo simulations and numerical
Learning statistics by doing or by describing: the role of software 357

calculations of probabilities for sums of independent random variables allows


to connect numerical-analytical models with simulated randomness and show
that probability is able to model randomness reasonably well.
Once the trust in simulations is built, they can be used to empirically ver-
ify facts about statistical tests and confidence intervals. Without computers,
it is practically impossible to illustrate concepts like the errors of first and
second kind of a test and confidence levels of confidence intervals empirically.
Monte Carlo simulations once again allow us to study the empirical error
rates of simulated tests and compare them with the theoretical values .
Sampling also is a very important concept in statistics. In Monte Carlo
simulations, the machinery in the background produces a sequence of num-
bers. It does not select from a given set, it produces a new number each
time it is asked for one . We might say that the random number generator
is spitting out an infinite sequence of random numbers. When sampling is
investigated, it is very helpful if for experimental activities we can see the
whole sample space and then select the sample from this set. Spreadsheets
allow us to make this process very visual. From a didactical point of view, it
seems very important to clearly model the process of selecting from a given
well defined finite set and not blur the lines to the production of random
numbers by some unpredictable machinery.
When probability is studied, combinatorics also has to be investigated.
The relationship between probability and randomness is the equal probability
assumption. This is something that cannot be proved analytically. Therefore,
helping to build trust in the assumption is very important for the learner.
Monte Carlo experiments can playa key role for that. In this area, computers
cannot only be used for simulations, that also can play an important role in
better understanding combinatorics.
Just read the following description: Let us build a table. The first column
is filled with Is. The rest of the first row is filled with Os. All the other cells
contain the sum of the number above and the number above and to the left .
This is a complete and completely operational description of the binomials.
This description is not only a description, it is the complete instruction
to compute the binomials with a spreadsheet. Additionally, it tells that each
number in each row migrates down into the next row exactly twice, once
vertically and once diagonally. therefore, row sums double from row to row
and this description contains the proof of the fact that the row sums of the
binomials are the powers of 2.
Expressing this more formally, the binomials can be described by a two
term recursion. It turns out that this kind of recursion covers most of the
combinatorics problems needed for basic probability models. Therefore, the
table approach to combinatorics covers most of the ground needed in an
introductory course. Once again, the readily available tool spreadsheet can
be used to analyze structures, and to help understand concepts, not just as
a more convenient kind of pocket calculator.
358 Erich Neuwirth

5 Some final thoughts


Statistics and probabili ty have their origin in methodology to analyze em-
piric al data and gain insights. So at the beginn ing of t hese subjects, there
often is an experiment . Without computers, it is very hard to create this
experiment based situation as a genera l setting. Some example highlight are
possible , but overall st atistics courses without compute rs ar e pap er and pen-
cil based theory cours es (or not too int eresting computations cours es for very
small data sets). With computers, we can analyze real or at least realistic
data set s, and we can study probability also with an expe rimental approach.
Therefore, for many learners who are not mostly int erest ed in theory but
in methods they can apply in their daily lives, this approac h is much mor e
promising t ha n compute r free statistics. As a consequence, it might be rea-
sonabl e to avoid compute rs as an aid to learning in some specialized areas
of statistics . But overall, information technology allows to make st atistical
concepts and methods both more accessible and more useful for a very wide
audience.

References
[1] Friendly M. (2000) . Visualizing categorical data. SAS Inst itute 2000.
[2] Hastings K (2000) . Probability with mathematica. Lewis Publishers.
[3] Neuwir th E. (2002) . R ecursively defin ed com bin atorial fun ctions: extend-
ing Galton's board. Discrete Math. 239, 33- 51.
[4] Emb edding R in standard soft ware, and the other way roun d.
In Hornik K and Leisch, F . (eds.), DSC 2001 Proceedin gs,
http://www.ci .tuwien.ac .at/Conferences/DSC-2001
[5] Neuwirth E ., Baier T . (2001) . Emb edding R in standard softw are, and the
other way round . In Hornik K , Leisch, F . (eds.), DSC 2001 Proceedin gs,
http://www.ci .tuwien.ac.at/Conferences/DSC-2001
[6] Neuwirth E ., Arganbright D. (2003). Th e active m odeler: mathematical
mod eling with Excel. Brooks-Cole.
[7] Neuwirth E . Probababilities, the US electoral college, and gen erating fun c-
tion s considered harmful. To appear in Intern ational J ournal of Comput-
ers for Mathemati cal Learning.
[8] Rose C., Smith D. 2002. Math em ati cal statistics with mathematics.
Springer Verlag.
[9] Tufte E. (2001). Th e visual display of quantit ative information. Gr aphi cs
Press.

Acknowl edgem ent : Thanks to J aromir Anto ch and Marl ene Muller for their
pati ence.
Address: E . Neuwir th , University of Vienn a , Austria
E-mail: erich .neuwirth@univie.ac.at
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

EMBEDDING METHODS
AND ROBUST STATISTICS
FOR DIMENSION REDUCTION
George Ostrouchov and N agiza F. Samatova
K ey words: Dimension reduction, convex hull, FastMap , principal compo-
nents, multidimension al scalin g, robust statistics, Euclidean dist ance.
COMPSTAT 2004 section : Dimensional redu ction.

Abstract: Recently, several non-det erministic dist ance embedding meth-


ods t hat can be used for fast dim ension redu ction have been proposed in
the machine learning literature. These include FastMap , MetricMap , and
Spar seMap. Among them , FastMap , impli citl y assumes th at the objects are
points in a p-dimensional Euclidean space. It selects a sequence of k ~ p or-
t hogonal axes defined by distant pairs of points (called pivots) and computes
t he projection of the poin ts ont o t he orthogonal axes. We show that FastMap
picks all of its pivots from t he vertices of t he convex hull of the dat a points in
t he original implicit Euclidean space. This provides a connecti on to resul ts
in robust st ati stics, where the convex hull is used as a t ool in multivariate
outlier det ect ion and in robust est imation methods. The connection sheds
a new light on som e properties of FastMap and provides an opportunity for
a robust class of dim ension reduction algorithms that we call RobustMaps,
which ret ain the speed of FastM ap and exploit ideas in robust st atistics. One
simple RobustMap algorit hm is shown to outperform principal components
on contaminat ed data both in t erm s of clean vari an ce capture d and in t erms
of time complexity.

1 Introduction
Dim ension reduction starts with n objects as points in a p-dimensional vector
space and ma ps the obj ects ont o n points in a k-dimensional vector space,
where k < p . A more general sit ua t ion arises when the point coordinates ar e
not known and only pairwi se dist an ces (or a distan ce function to compute
them) are available. This mapping of obj ects based on their dist an ces only
into a k-dimensional vect or space is called finit e metric space embedding [8] .
Several embedding methods and t heir pr operties are discussed in [8] , includ-
ing Fast.Map , MetricMap , and Spars eMap . The discussion cente rs mostly on
whether the embeddings are contractive, a property of importanc e in similar-
ity searching th at gua ra nt ees no missed items. In this pap er , we concentrate
on FastMap and its properties t hat connect th e t echnique t o ideas in robust
stat ist ics.
FastMap is first introduced in [6] as a fast alte rn at ive t o Multidimen-
sional Scalin g (MDS) [14] and a genera lization of Principal Component Anal-
360 George Ostrouchov and Nagiza F. Samatova

ysis (peA) [9]. Given dimension k and Euclidean distances between n ob-
jects, FastMap maps the objects onto n points in k-dimensional Euclidean
space. An implicit assumption by FastMap that the objects are points in
a p-dimensional Euclidean space (p 2': k) is noted in [8]. Because of this
assumption, FastMap is usually viewed as a dimension reduction method.
When FastMap begins with Euclidean distances between the n objects, it
has time complexity O(n) . If the Euclidean distances must be explicitly com-
puted from a p-dimensional vector representation, FastMap time complexity
is O(np).
We show how FastMap operates within the the implicit or explicit p-di-
mensional Euclidean space containing the points of a data set. FastMap
selects a sequence of k ~ p orthogonal axes defined by distant pairs of points
(called pivots) and computes the projections of the points onto the orthogonal
axes. We show that FastMap picks all of its pivots from convex hull vertices of
the original data set . This provides a connection to results in robust statistics,
where the convex hull is used as a tool in multivariate outlier detection and
in robust estimation methods. The connection sheds a new light on some
properties of FastMap, in particular its sensitivity to outliers, and provides
an opportunity for a new class of dimension reduction algorithms that retain
the speed of FastMap and exploit ideas in robust statistics.
We begin in Section 2 by defining the convex hull and some of its prop-
erties. In Section 3 we describe the FastMap algorithm. The main result,
showing that FastMap pivots are pairs of vertices of the convex hull is in
Secion 4. Section 5 discusses the implications of this result and finally Sec-
tion 6 presents an algorithm, RobustMap, that results from these implica-
tions. Some further comments and conjectures about connections to QR and
QLP factorizations [13] are also made.

2 Convex hull of a data set


Let S be a set of n points in p-dimensional Euclidean space. The convex hull
of 5, denoted by C(5), is the smallest convex set (a polytope) that contains 5
[5], [7] . We can visualize a convex hull in two or three dimensions as a rubber
band or an elastic bag stretched around the points. In higher dimensions,
we must rely on more formal properties of hyperplanes, and the notion of
half-space support . Our definitions below are mostly from [5], [7].
Definition 2.1. A hyperplane is an affine subspace (a translation of a linear
subspace) of RP with dimension p - 1.
The set of points
h(u,v) = {x E RP: (u - vf(x - v) = O},for u,v E RP, (1)
is a hyperplane perpendicular to the vector u - v and passing through v. The
closed half-space that is defined by this hyperplane and that contains u is
given by
Embedding methods and robust statistics for dimension reduction 361

H(u,v) = {x E RP: (u - vf(x - v) ~ O},for u,v E RP, (2)

Definition 2.2. If 8 intersects h( u, v) and 8 lies in H (u, v) for some u , v E


RP, then h( u, v) is a supporting hyperplane of 8 and H (u, v) is a supporting
half-space of 8.

We use Ziegler's [15, section 2.1] definition of a face of a polytope and


state it in terms of a supporting hyperplane.

Definition 2.3. A face of a polytope 0(8) is any set of the form

0(8) n h(u, v) ,
where h( u, v) is a supporting hyperplane of 8 for some u, v E RP. Further,
for a p-dimensional polytope, facets are (p-1)-dimensional, ridges are (p- 2)
dimensional, edges are i-dimensional, and vertices are O-dimensional.

The above characterization of a vertex as a single point (a O-dimensional


face) of 0(8) that lies in the supporting hyperplane, will be used in Section 4
to link FastMap pivots to vertices of the convex hull.

3 FastMap overview
Given the Euclidean distance between any two points (objects) of 8, k it-
erations of FastMap produce a k-dimensional (k S; p) representation of 8.
Each iteration selects from 8 a pair of points, called pivots, that define an
axis and computes coordinates of the 8 points along this axis . The pairwise
distances for 8 can then be updated to reflect a projection of 8 onto the
subspace (a hyperplane passing through the origin) orthogonal to this axis.
The next iteration implicitly operates on the projected 8 in the subspace.
However, these projections are accumulated and jointly performed only for
the distances that are needed. In this manner, after k iterations, the 8 points
end up with k coordinates giving their k-dimensional representation.
To provide details of the FastMap algorithm, we first introduce some
notation. Let (ai, bi ) be the pair of pivot elements from 8 at iteration i, Let
di(x, y) be the Euclidean distance between points x and y of 8 after their
ith projection onto a pivot-defined hyperplane, so that do(x, y) is the initial
Euclidean distance. Also, let Xi be the ith coordinate of x in the resulting
k-dimensional representation of x E 8.
Pivot elements are chosen by the choos e-distant-objects heuristic shown in
Fig. 1. Initially, i = O. After selecting a pivot pair (ai, bi ), the ith coordinate
of each point x E 8 is computed as

(3)
362 George Ostrouchov and Negize F. Samatova

Choose-distant-objects ( S, di (,) )
1. Choose an arbitrary object s E S

2. Let aH1 be the a E S that maximizes di(a, s)

3. Let bH 1 be the b E S that maximizes di (b, s)


4. Report aH1 and bH 1 as the distant objects.

Figure 1: Choose-distant-objects heuristic for iteration i ,

This projection is based on the law of cosines and current distances from the
two pivot points. The distan ces are updated when ever needed in Choos e-
distant-objects or in (3). An update for a single iteration is presented in [6]
and we extend this in [1] to a combined update

d;( x, y) = d6(x, y) - '2) Xj - yj)2 . (4)


j=l

This is based on the Pythagorean theorem and the sequ ence of i projections
onto hyp erplanes perpendicular to pivot axes.
There ar e k iterations, each requiring O(n) distance computations of O(p).
The resulting total time complexity is O(npk). Note that if all the original
distances are already availa ble, the total time complexity is O(nk 2 ) du e to
the sum in (4). If k is a small const ant compared to nand p , as is usually
t he case, k is dropped from the above compl exity statements giving those we
provided in the Introduction.

4 FastMap and vertices of the convex hull


Here we prove the main result of this paper, namely that all pivot points are
select ed from vertices of the convex hull of the data set. We do this in two
st eps. First we show that the Choose-distant-obj ect heuristic pivot pair is
a pair of convex hull vertices within the cur rent working subspace. Then we
show that if a point is a vertex in a subspace projection, it is also a vertex
in the origin al p-dimensional space.
The Choos e-distant-obj ects heuri stic first t akes an arbit ra ry point b E S
and finds a E S, the most distant point from b. Becaus e a is the most distant
point in S from b

(s - bf (s - b) :S (a - bf (a - b), \:j s E S. (5)

Now, for any point s E S distinct from a, we have


Embedding methods and robust statistics for dimension reduction 363

o < (s-af(s-a)
= (s-b+b-af(s-b+b-a)
(s - bf(s - b) + 2(s - b)T(b - a) + (b - af(b - a)
< 2(s - b)T(b - a) + 2(b - a)T(b - a) by (5)
2(s-b+b-af(b-a)
= 2(s - a)T(b - a) (6)

If we add s = a in (6) , we have


0::; (s - af(b - a) , Vs E S,

which defines a supporting half space H(a, b) for all points in S. Since a is
the only point in the supporting hyperplane h(a, b) of S, it must be a single
point face of C(S). This, by Definition 2.3, is a vertex of C(S).
Next, the Choose-distant-objects heuristic finds the point in S most distant
from a. By the same argument this is again a vertex of C(S). We state this
as a lemma.

Lemma 4.1. A single application of the Choose-distant-objects heuristic to


a set of points S returns a pivot pair of points that are among the vertices of
C(S).

After choosing a pair of vertices, FastMap projects the set S into a sub-
space orthogonal to the vector defined by the pivot pair (a, b) and repeats the
Choose-Distant-Objects heuristic in the subspace of dimension p - 1. Pivot
pairs and projections are computed until suitably many orthogonal vectors
are extracted to be used as the principal axes of the lower dimensional rep-
resentation of S . So far, we have shown that a pivot pair is a pair of convex
hull vertices within its current working subspace. Are they all also vertices
of C(S) in the original space? The answer is yes, subject to a uniqueness
caveat requiring that no pair of points (except the current pivot points) get
projected onto the same point. Assuming that the points S are in sufficiently
general position [15] takes care of this. Because we have a finite set of points,
we can perturb them by an arbitrarily small amount to achieve such a general
position. We show that a vertex in a subspace projection is a vertex in the
original p dimensional space.
Let PH be a symmetric projection matrix into a subspace H C RP and
let SH = {PHu : U E S} be the set of image points of S in this subspace.
We also need to assume that S are in sufficiently general position so that all
vertices of C(SH) are projections of distinct points of S.

Lemma 4.2. If PHS is a vertex in the convex hull of SH and S are in general
position, then s is a vertex in the convex hull of S.
364 George Ostrou chov and Nagiza F. Samatova

Proof Sin ce PHS is a ver t ex of C(SH) , by Definition 2.3

where h(u, v ) is a supporting hyperplane of C(8H) for some u , v E H . Be-


cause PHS E h(u , v) , there is a u' E H such that h(u, v) = h(u' ,PHs). Now ,
PHS is the only point of 8H that is in the supporting hyp erplane, so that

for all PHx E SH distinct from PHS. Because S are in general position,

Then ,

(u' - PHsf[x - (I - PH)X - s + (I - PH)s] > 0


(u' - PHsf(x - s ) - (u' - PHsf(I - PH)(x - s) > o.
Since PH(u' - PHS) = (u' - PHS) (b ecause u' E H) ,

(u' - PHsf(x - s ) > O,V'x E 8 distinct from s.

Equality holds for x = s, so it is the unique point on this supporting hyp er-
plane of 8 and thus it is a vertex of the convex hull of 8 . 0
Letting Sv ~ S be t he vertices of C(8) , Lemmas 4.1 and 4.2 lead to the
main result:

Theorem 4.1. FastMap pivot pairs are a subset of the vertices of the convex
hull of the data. That is,

i = 1, . . . , k.

5 Implications
Convex hull computations in st atistics are mostl y associate d with robust
mul ti vari ate est imat ion. Loosely, an esti mator of som e par am et er is said
to be robust if it performs well even when the assumed model (implicit or
explicit ) is not sat isfied by the data . For example, when est imat ing a location
par amet er , an implicit assumpt ion is t hat the data are generate d by one
process t hat has a location. If mor e than one process generate d the data,
a robust est imator would st ill est imate the location of the dominant proc ess
rather t han som e meaningless location betw een the processes. The medi an,
for example, is a robust estimator of location while the mean is not . A classic
reference on robust est imat ion is [11] .
The concept of trimming ext remes is often used in reducing dependence
on outliers in dat a [10]. Tukey is attributed with coining the t erm peeling as
Embedding methods and robust statistics for dimension reduction 365

the multivariate extension of trimming [10], where one peels off the vertices
of the convex hull before using the remaining points for estimating a location
parameter. This is based on a generalization of the simple practice of remov-
ing the maximum and minimum before computing the mean, which dates at
least to the early 19th century [10] . Here, with the aim of robustness, the
very points on which FastMap depends are discarded! Clearly, FastMap is
very sensitive to outliers in the data.
In situations where the data generation system is known to work smoothly,
such as machine generated data, outliers may not be of concern. For example,
we have recently found that in analyzing climate simulation and astrophysics
simulation data, methods that are sensitive to extremes often produce the
most compelling results. Here, the extremes are not outliers and may be of
most interest. On the other hand, massive data sets are often the result of
a long run with several checkpoint restarts where anomalies may occur. For
example, in [4]' instrument generated Atmospheric Radiation Measurement
data [2] contains many instrument restarts that appear as zeros in data with
high positive values. Although it is easy to discover these, an automated ap-
plication of FastMap would be driven by the zero coordinate outliers. Clearly,
there are situations where an extremes-sensitive method like FastMap is ap-
propriate or even preferable as well as situations where it will fail.
Outlier sensitivity of FastMap is mentioned in [8] and PCA is presented as
more robust . Although PCA is less sensitive to outliers than FastMap, it too
is not considered a robust technique. A measure of estimator sensitivity to
changes in extreme values of data is the notion of breakdown point [3]. Loosely
speaking, the breakdown point is the smallest proportion of data that needs
to be contaminated to make arbitrarily large changes to the estimator. By
this definition, the breakdown point of FastMap is ~, which is asymptotically
zero. Principal Components Analysis, the most popular dimension reduction
method, also has a breakdown point of~ . In both cases , taking one point
arbitrarily far in some direction will rotate the first axis in that direction.
Some robust PCA methods begin by computing a robust covariance matrix
estimate then proceeding with standard PCA as usual. The classical example
of a high breakdown estimator is the median with a .5 breakdown point. That
is, half of the data must be moved to make an arbitrarily large change in the
median. A multivariate extension of the median is proposed in [12] . This
extension uses the notion of half-space support to define the depth of a data
point so that, ignoring ties, the point with maximal depth is the multivariate
median.
The main lesson from robust statistics is that the most distant points are
often not the best choice for defining a projection axis . The key to new fast
and robust methods is a replacement of the Choose-distant-object heuristic by
something that considers more than just the maximum distance from a point.
One should back-off a little from the maximum, while considering the entire
distance distribution. This distribution is already available within the O(np)
366 George Ostrouchov and Negize F. Samatova

complexity. A closer examination, even with more complex algorithms such


as clustering, of the distance distribution tail can yield much more robust
results, still within the O(np) complexity. In fact, such methods will be
more robust than standard PCA. Clearly there are many directions that this
methodology can be taken and undoubtedly many such algorithms will be
proposed. We provide a simple example in the section that follows.
We would like to note another implication on an algorithm, DFastMap
[1]' that we recently developed for fast dimension reduction across distributed
data sets. Our initial insights that lead to DFastMap produced the main idea
for the present paper. Formalizing the convex hull connection to FastMap
gives an explanation of why an application of DFastMap to distributed data
performs as well as the serial FastMap on a centralized data set. The union
of local convex hull vertices necessarily includes all convex hull vertices of the
centralized data set. This assertion can be proved using arguments similar
to those we used in Section 4. DFastMap centralizes the pivots, arguably
a very good subset of the local convex hull vertices (see [1] for more details).
This provides a key subset of the combined data convex hull vertices so that
little information about extremes is lost when compared to centralizing all
the data.
Finally, we also mention an implication on complexity of FastMap and
convex hull computations. Because all the FastMap projection axes are com-
puted from points in Sv, the convex hull vertices are sufficient for all distant
point searches. Clearly FastMap could be faster if Sv were available. Erick-
son [5] reports that finding Sv by the "gift-wrapping" algorithm takes O(nJ)
time, where f = ISv I is the number of vertices. Since Fast.Map completes in
O(np) time, this is not helpful as f > p for any non-degenerate data sets.

6 A RobustMap algorithm
The FastMap algorithm computes all distances from one object but uses only
the maximum, resulting in an outlier-sensitive method. From a statistical
viewpoint the distribution of the distances contains information on potential
outlier candidates. In essence, we are trimming the extremes of this distance
distribution. A complication is that two objects with a similar distance to
the reference object can be very far apart in the full p-dimensional space.
Selecting a small number of extreme objects and clustering them in the full
p-dimensional space, can provide much more information on a robust choice
of a distant object. Keeping the selection of a few objects fast and their
number small lets us remain within the O(np) time complexity of FastMap.
We provide a simple variant of this idea . Take a constant number, say
r « n, largest distances, cluster the corresponding objects, and choose a cen-
tral point of the largest cluster as a pivot. This affords protection against
a small number, about r /2, outliers. Fig. 2 gives the choose-distant-objects
heuristic for RobustMap. The parameter r can be some small number that
depends on the level of contamination we expect in the data. A second pa-
Embedding methods and robust statistics for dimension reduction 367

lRobustMap: Choose-distant-objects ( S,di (, ) )

1. Choose an arbitrary object s E S

2. Select r largest distances in di (a, s)


3. Cluster the r corresponding objects.
4. Let ai+l be the object nearest the center of the largest cluster.
5. Similarly, choose bi+l as above, replacing s with ai+l .
6. Report ai+l and bi+l as the distant objects.

Figure 2: RobustMap Choose-distant-objects heuristic for iteration i .

rameter controls the number of clusters. Clusters can be considered different


at some fixed percentage of the largest distance. Our prototype implementa-
tion in R uses single linkage clustering, where a distance of more than 10% of
the maximum distance implies a separate cluster.
To test the behavior of RobustMap, we use the Longley data in Rand
add an observation that blends the origin and the first observation. This is
a small data set, but it allows us to move the outlier in and out of the data
and quickly explore the behavior of RobustMap, PCA, and FastMap on the
contaminated data. To measure the effect of the contamination, we report
captured variability within the clean data, while giving the contaminated
data to the algorithm. Our reference is PCA on the clean data. Fig . 3 shows
typical results and we discuss how the outlier position and non-determinism
of RobustMap and FastMap affect the results.
As the outlier moves farther from the data, the FastMap and PCA lines
move together but remain well below RobustMap. This is reasonable, as
both are highly affected by outliers. The non-determinism of RobustMap
and FastMap does not change the order of the methods in Fig. 3 with Ro-
bustMap leading and FastMap coming last. Half of a 95% confidence interval
around RobustMap would roughly fill the distance betwe en the Reference and
RobustMap.
Other, more complex and more robust approaches can consider a multi-
variate distance distributions from two or more objects. Formal tests may
be developed on the basis of distributional assumptions for the objects and
derivations of the resulting distance distributions. At the same time, more
thorough testing is needed to explore aspects beyond capture of variability.
For example, RobustMap projections are different from PCA and can provide
alternate data views based on distance distributions.
368 George Ostrouchov and Nagiza F. Samatova

C!

co
0

c '"
0
Reference
'"e
0
0 RobustMap
Co
PCA
a. FastMap
"':
0

·:1
I··1
'"
0 1: .. :1 ..
1:1.. ..
...
.:1
1:1 1:1
0
0 I:i lI:i ••:1
2 3 4 5 6
Axis

Figur e 3: P roportion of clean vari ability cap tured by each component axis,
when pr esented with contaminate d dat a. Reference is PCA on clean dat a
only.

We also see some preliminary evidence t hat these methods are related to
pivoting st rategies in QR factorization and the recent QLP factorization [13]
that provides a fast approxima tion for the Singular Value Decompo sition.
Our prototype implementation of RobustM ap and FastM ap differs from the
original [6] by using Householder reflections applied to t he rows, somewhat
like the QLP factorization. We conject ure t hat FastM ap , Robu stMap, and
their connection to t he convex hull provide a geomet ric explanation for the
success of QLP fact oriza tion and may be sour ces of new pivoting strategies
for QR factorizat ion. This is anot her direction where t hese methods may
provide new insights .

References
[1] Abu-Khzam F.N ., Samatova N., Ostrouchov G., Lan gston M.A., Geist
A. (2002). Dist ribut ed dim ension reduction algorithms for widely dis-
persed data . In P ar allel and Distributed Computing and Syst ems, ACTA
Press, 174-178.
[2] D. O. E. (1990). Atmospheric radiation m easurement program plan .
Technical Report DOE jER-0441 , U. S. Department of En ergy, Oce of
Embedding methods and robust statistics for dimension reduction 369

Health and Environmental Research, Atmospheric and Climate Research


Division, National Technical Information Service, 5285 Port Royal Road,
Springfield, Virginia 22161.

[3] Donoho D.L., Huber P.J. (1983). The notion of breakdown-point. In


Bickel, Doksum, and Hodges, (eds), Festschrift fur Erich L. Lehmann,
Belmont, CA, Wadsworth 157-184.

[4] Downing D.J., Fedorov V.V., Lawkins W.F., Morris M.D., Ostrouchov
G. (2000). Large data series: Modeling the usual to identify the unusual.
Computational Statistics & Data Analysis 32 245- 258.

[5] Erickson J . (1999). New lower bounds for convex hull problems in odd
dimensions. SIAM J. Comput. 28 (4), 1198-1214.

[6] Faloutsos C., Lin K. (1995). FastMap: A fast algorithm for indexing,
data-mining and visualization of traditional and multimedia datasets. In
ACM SIGMOD Conference, San Jose, CA, May 1995, 163-174.
[7] Gallier J .H. (2000). Geometric methods and applications for computer
science and engineering. Springer.

[8] Hjaltason G.R., Samet H. (2003). Properties of embedding methods for


similarity searching in metric spaces. IEEE Transactions on Pattern
Analysis and Machine Intelligence 25, 530-549.

[9] Hotelling H. (1933). Analysis of a complex of statistical variables into


principal components. J . Educ . Psych. 24, 417-441, 498-520.

[10] Huber P.J. (1972). Robust statistics: A review. Annals Mathematical


Statistics 43 (4), 1041-1067.
[11] Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York.
[12] Ruts 1., Rousseeuw P.J. (1996). Computing depth contours of bivariate
point clouds. Computational Statistics & Data Analysis 23, 153-168.

[13] Stewart G.W. (1999). The QLP approximation to the singular value de-
composition. SIAM J . Sci. Comput. 20 (4), 1336-1348.

[14] Torgerson W.S . (1952). Multidimensional scaling i: Theory and method.


Psychometrika 17, 401 - 419.

[15] Ziegler G.M. (1995) . Lectures on polytopes . Springer-Verlag.

Acknowledgement: Research sponsored by the Laboratory Directed Research


and Development Program of Oak Ridge National Laboratory (ORNL), man-
aged by UT-Battelle, LLC for the U. S. Department of Energy under Contract
No. DE-AC05-000R22725.
370 George Ostrouchov and Nagiza F . Samatova

Address: G. Ostrouchov, N.F . Samatova, Computer Science and Mathemat-


ics Division at the Oak Ridge National Laboratory, P.O.Box 2008, Oak Ridge,
Tennessee 37831-6367, U.S.A.
E-mail: os t rouchovgoorn.l , gOY
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

A GENERAL PARTITION CLUSTER


ALGORITHM
Daniel Peiia, Julio Rodriguez and George C. Tiao
Key words: Predictive distribution, robust estimation, SAR procedure.
COMPSTAT 2004 section: Clustering.

Abstract: A new cluster algorithm based on the SAR procedure proposed


by Pefia and Tiao [9] is presented. The method splits the data into more
homogeneous groups by putting together observations which have the same
sensitivity to the deletion of extreme points in the sample. As the sample
is always split by this method the second stage is to check if observations
outside each group can be recombined one by one into the groups by using
the distance implied by the model. The performance of this algorithm is
compared to some well known cluster methods.

1 Introduction
Finding groups in data is a key activity in many scientific fields. Gordon [8] is
a good general reference. Classical Partition and Hierarchical algorithms have
been very useful in many problems but they have some four main limitations.
First, the criteria used are not affine equivariant and therefore the results
obtained depend on the changes of scale and/or rotation applied to the data.
Second, the usual heterogeneity measures based on the Euclidian metric do
not work well for highly correlated observations forming elliptical clusters or
when the clusters overlap. Third, we have to specify the number of clusters
or decide about the criteria for choosing them. Fourth, there is no general
procedure to deal with outliers. Some advances have been made to solve
these problems, see [4], [5] and [16].
An alternative approach to cluster is to fit mixture models. This idea has
been explored both from the classic and Bayesian point of view. Banfield and
Raftery [3] and DasGupta and Raftery [6] have proposed a model-based ap-
proach to clustering which finds an initial solution by hierarchical clustering
and then assumes a mixture of normals model and uses the EM algorithm to
estimate the parameters. A clear advantage of fitting normal mixtures is that
the implied distance is the Mahalanobis distance, which is affine equivariant.
From the Bayesian point of view the parameters of the mixture are estimated
by Markov Chain Monte Carlo methods and several procedures have been
proposed to allow for an unknown number of components in the mixture,
see [12] and [14] . A promising approach to cluster analysis, that can avoid
the curse of dimensionality, is projection pursuit, where low-dimensional pro-
jections of the multivariate data are used to provide the most interesting
views of the full-dimensional data. Pefia and Prieto [11] have proposed an
372 Daniel Peiie, Julio Rodriguez and George C. Tieo

algorit hm where the data is project ed on the dir ecti ons of maximum het-
erogeneity defined as those directi ons in which t he kur to sis coefficient of t he
project ed data is maximized or minimized. Then they used t he sp acin gs to
sear ch for clusters on the univari ate vari abl es obtained by these projections.
Finally, Pefia and Ti ao [9] propose t he SAR (split and recombine) pro ce-
dure for det ecting heterogeneity in a sample wit h respect to a given model.
This pr ocedure is general, affine equivaria nt, does not require to specify a pri-
ori t he numb er of clust ers and it is well suite d for finding the components in
a mixture of mod els. The idea of t he pro cedure is first to split the sample
into mor e homo geneous groups and second recombine the observations one
by one in order to form homogeneous clust ers. The SAR pro cedure has two
important properties, that are not shar ed by many of t he most oft en used
cluster algorit hms, (i) it do es not require an initial st arting point, (ii) each
homogeneous group is obtain ed ind epend entl y from the others, so that each
group does not compete with the others to incorporat e an observation. The
first pr operty impli es that the algorit hm we propose can be used as a first
solution for any other cluster algorithm, the second, th at the pro cedure may
work well even if t he groups are not well sepa ra ted. This pap er analyzes
t he applicat ion of t he SAR pro cedur e to clust er ana lysis and it is organiz ed
as follows. Secti on 2 pr esents th e main ideas of the pro cedure. Section 3
compa res it in a Mont e Carlo st udy to Mclust (Model Based Clu ster , [7],
k-rnean s, pam (P ar t ition around medoids, [15] and Kpp (Kurtosis projection
pursuit, [11] .

2 The SAR procedure


Suppose we define a measure H( x , X) of t he het erogeneity between an obser-
vation , x, and a set of data , X . We are going to use this measure to split t he
sa mple iteratively into homogeneous groups and t o recomb ine observat ions
into t he groups. We assume t hat t he het erogeneity measure H( x , X) is equiv-
ariant , t hat is invari ant to linear t ra nsformations, and is coherent with t he
assumed mod el. As t he true structure of the data is unknown, we st art the
pro cess by assuming t hat t he data is homo geneous, and have been generated
by a normal distribution, Np(JL , V) . Then we propose a het erogeneity mea-
sure based on out of sample pr edicti on as follows. The pr edictive distribut ion
for a new observation xf generate d by a normal mod el using a Jeffrey's prior
- n/2
p(JL ,V) ex: IVI-(P+l )/2 is (see for instan ce, [2] P(xf ,X) ex: ( 1+ n~~ ) ,

where Qf = n~l (xf - x)'"y- l(Xf - x) and x is t he sa mple mean and V the
sa mple covariance matrix, given by V = (X - lx)'(X - lx) /(n - p). Fol-
lowing Peiia and Ti ao [9] we will use as measure of het erogeneit y of a data
X i wit h resp ect to a group X C
i) which does not cont ain t his observation, the
standa rdized predictive value given by
A general partition cluster algorithm 373

p(xiIX(i)) } {Qi(i)}
H(xi,X(i))=-2In { C IX) =(n-l)ln 1+( ) , (1)
P Xi(i) (i) n- 1 - p

where Qi(i) = n~l(Xi - X(i))/VCi)l(Xi - X(i))' and V(i) and X(i) are the co-
variance matrix and the mean computed using the sample X(i) without the
case ith. Note that H(Xi , X(i)) is a monotonic function of the Mahalanobis
distance Qi(i) , which is usually used to check the heterogeneity of a point Xi
with respect to the sample X(i)'
The splitting of the sample is made as follows. For each observation, Xi ,
we define the discriminator of this point as the observation which, when
deleted from the sample, makes the point Xi as heterogeneous as possible
with the rest of the data. The discriminator of Xi is the point Xj if

where X(ik) is the sample without the cases ith and kth.
Each sample point must have a unique discriminator, but several sample
points may share the same discriminator. It can be proved (see [10]) that
the discriminators are members of the convex hull of the sample. That is,
a discriminator must be an extreme point. An intuitive procedure to split
the sample into groups is to put together observations with share the same
discriminators, as they are affected in the same way to modifications of the
sample by deleting some extreme values. It is obvious that if two observations
are identical they will have the same discriminator and if they are close they
also will have the same discriminator. The number of points in the sample
which share the same discriminator is called the order of the discriminator.
We consider as special points discriminators of order larger than K, where
K = f(p, n) and we will put them in a special group of extreme observations.
However, discriminators of order smaller than K are considered as usual
points and are assigned to the group defined by all the observations that
share a common discriminator. We need to define the minimum size of a set
of data to be considered as a group. We will say that we have a group if we
could compute the mean and covariance matrix of the group and, therefore,
the minimum group size must be no = p + h, where h > 0, and p is the
number of vari ables. Usually h = It», n) and in the examples we have taken
h = log(n - p). In the procedure which follows we have considered as special
points to those discriminators of order larger that K, where K = p + h - 1.
This value seems to work well in the simulations we have made. Based on
these considerations the sample is split as follows: 1) Observations which
have the same discriminator are put in the same group, the discriminator is
only included in the group if it has order smaller than K; 2) Discriminators
of order bigger that K are allocated to a specific group of isolated points; 3) if
two groups formed by the previous rules have any obs ervation in common the
374 Daniel Peiie, Julio Rodriguez and George C. Tieo

two groups are joined into one group. This three rules split the sample into
more homogeneous groups. Each group is now considered as a new sample
and the three rules are applied again until splitting further the sample will
lead to isolated points because the groups obtained are all of them of size
smaller than the minimum group size no. A group of data is called basic
group if when split will lead to subgroups of size smaller than the minimum
size, p + h .
When the sample cannot be split further the recombining process is ap-
plied starting from any of the basic groups obtained. The recombining process
is the one suggested by Pefia and Tiao [9]. Each group is enlarged by incor-
porating observations one by one. For a given group, we begin by testing
the observation outside the group which is the closest to the group in terms
of the measure H(Yf' X g ) , where Yf is the observation outside the group
formed by data X g • If H(Yf' X g ) is smaller than some cut-off value, that is
the 99th percentile of the distribution of the statistic H(Yf' X g ) , this obser-
vation is incorporated into the group and the process of testing the closest
observation to the group is repeated for the enlarged group. The enlarging
process will continue until either the threshold is crossed or the entire sample
is included. A similar idea of recombining points has been used for robust es-
timation (see for instance, [1]. We may have one of the three possible cases.
First, the enlarging of all the basic groups leads to the same group which
include all the observations apart from some outliers. Then we have a homo-
geneous sample with some isolated outliers and the procedure ends. Second,
the enlarging of the basic groups leads to a partition of the sample into dis-
joint groups and we conclude we have some groups in the data and again
the procedure ends. Third, we obtain more than a possible solution because
the partition obtained is different when starting from different basic groups.
Then we have more than one possible solution and the final solutions found
are called possible data configurations, PDC. The selection among them is
made by a model selection criterion.

3 Monte Carlo results

The properties of the algorithm have been studied in a Monte Carlo ex-
periment, similar to the one used by Pefia and Prieto [11] to illustrate the
behavior of their cluster procedure. Sets of 10 x p x k random observations
in dimension p = 2,4,8 have been generated from a mixture of k = 2,4
components of a multivariate distributions. In all data sets the number of
observations from each distribution has been determined randomly, but en-
suring that each cluster contains a minimum of p+ 1 observations. The mean
for each distribution is chosen at random from the multivariate normal dis-
tribution Np(O, 11). The factor 1 (see Table 1) is selected to be as small as
possible while ensuring that the probability of overlapping between groups is
roughly equal to 0.01. We generated data sets in six different scenarios.
A general partition clust er algorit hm 375

a) Mixture of k multivari ate norm al distributions. In each group t he co-


vari an ce matrix is generated as S = UDU', from a random orthogona l
matrix U and a diagonal matrix D with ent ries generated from a uni-
form distribution (a.l): [10- 3 , 5y'P], so th at t he covariance matrices are
well conditio ned, and (a2): [10- 3 , 1Oy'P], so that the covariance matri-
ces are ill-condi tioned .
b) Mixture of k multivari ate uniform distributions wit h (bl ) covariance gen-
erate d as (al) and (b2) covaria nce generated as (a2).
c) Mixture of k multivari ate normal distributions genera te d as indicated
in scenario a.l), but 10% of t he da ta are outliers (cl ): generated by
Np(O , II) and (c2): for each clust er in the data , 10% of its observations
have been genera te d as a group of outliers at a distance 4X~ ,O . 99 in
a group along a random dir ection , and a single outlie r along anot her
random directi on.

a l) Covari an ce matri ces well condit ioned


p k f BAR Kpp k-means Mclust pam
2 2 55 1.65 7.33 45.35 16.73 34.98
4 140 1.29 0.95 24.90 1.54 1.86
4 2 14 4.83 9.90 47.15 12.38 32.11
4 20 5.58 9.39 27.20 6.75 10.76
8 2 12 15.43 13.13 43.29 12.28 55.61
4 18 7.52 12.58 15.81 3 .75 14.42
Average 6 .05 8.88 33.95 8.90 24.96
a2) Covariance matrices ill-conditioned
p k f BAR Kpp k-means Mclust pam
2 2 55 1.58 9.38 46.38 14.23 33.95
4 140 1.00 0.61 25.14 0.60 1.83
4 2 14 0.99 4.96 48.54 11.64 32.89
4 20 1.39 5.07 30.99 6.55 5.38
8 2 12 0.64 5.19 44.83 0.66 50.94
4 18 0.87 6.01 22.92 4.36 11.01
Average 1.08 5.20 36.47 6.34 22.66

Tabl e 1: Percent ages of mislab eled observations for the BAR, the Kpp,
t he k-means , th e Mclust and the pam procedures. Norm al observat ions
with: (al ) covariance matrices well conditi oned, (a2) covariance matri ces
ill-conditioned. The best method in each case is indic ated in boldface.

To provide better understanding of t he behavior of t he new procedure, in


each table we compa re the proposed method with Kpp, k-means , Mclust and
t he pam algorithm. The Mclust algorit hm has been run with the function
'EMclust' with mod els EI , VI , EEE, VVV , EEV and VEV and number of
376 Daniel Peiia, Julio R odriguez and George C. Tieo

cluster between 1 to 8 and t he final configuration is select ed by the BIC


(see [7], for a descrip tion of different mod els used in the function 'EMclust') .
The rul e to select t he number of clust ers in the algorit hm pamis the maximum
of the silhouette statistic for k = 1, . . . , 8 and in k-mean s t he stopping rul e
used is the one proposed by Calinski and Har ab asz.
Table 1 gives the average percentage of observations which have been
lab eled incorr ectl y in scena rios al) and a2), obt ained from 200 replicati ons
for each value in t he same data sets in all procedures. In scenario a l ) the
SAR pro cedure has the best performan ce, and Kpp and Mclust are second
havin g a similar behavior. In t he scena rio a2) when t he covaria nce matrix
is ill-conditioned, the SAR procedure is again the best followed by Kpp and
Mclust. This result is quite consiste nt as t he SAR pro cedure is the best in
eight out of the twelve comparison includ ed in the two scena rios of Tabl e 1
and in t he four cases in which it is not t he best it is not far from the best
one. The k-means an d pam show a poo r result .

bl ) Covariance mat rices well conditio ned


p k f SAR Kpp k-means Mclust pam
2 2 55 0.45 11.53 51.40 21.08 44.75
4 140 0.58 0.38 29.25 0.84 1.16
4 2 14 0.85 4.81 51.71 12.48 51.41
4 20 1.58 4.33 33.15 9.11 7.68
8 2 12 6.24 5 .45 41.83 7.38 60.80
4 18 2.33 4.93 20.07 5.58 16.93
Average 2.00 5.24 37.90 9.41 30.46
b2) Covari ance matrices ill-conditioned
p k f SAR Kpp k-mean s Mclust pam
2 2 55 1.55 11.78 48.65 20.53 41.95
4 140 0.56 0.99 34.30 1.75 2.06
4 2 14 0.79 4.06 53.23 6.00 46.45
4 20 0.38 3.13 34.39 7.54 7.28
8 2 12 0.34 5.76 45.96 0.00 62.13
4 18 0.46 4.21 27.32 4.74 12.61
Average 0.68 4.99 40.64 6.76 28.75

Table 2: Percent ages of mislab eled observations for the SAR , the Kpp,
t he k-means , t he Mclust and t he pam pr ocedures. Uniform observati ons
with: (bl ) covaria nce mat rices well conditioned, (b2) covariance matrices
ill-cond itioned .

Table 2 shows the outcome for scenarios bl ) and b2) where we ana lyze t he
same st ructure that in scena rios a l ) and a2) but now using mixtures of uni-
form distributions. Tabl e 2 shows t he percent ages of mislabeled observat ions
A general partition cluster algorithm 377

for both scena rios b l ) and b2) . The behavior of the SAR procedure is agai n
t he best as an average and the best in ten of t he twelve cases. The second
best behavior corres ponds to Kpp, that is bet ter than Mclust in eleven out
of the twelve cases .

c1) Non concent ra te d cont aminations


p k fSAR Kpp k-means Mclust pam
2 2 55 1.25 0.68 3.00 6.47 0.69
2 4 140 0 .83 1.30 12.31 3.50 2.85
4 2 14 8.58 9.46 14.55 6.71 7.21
4 4 20 5.66 11.89 22.64 5.27 6.13
8 2 1212.64 14.48 16.88 12.58 16.46
8 4 18 9.47 16.67 44.08 6.78 4 .59
Average 6.40 9.08 18.91 6.89 6.32
c2) Concentrated contamina t ions
p k f SAR Kpp k-means Mclust pam
2 2 55 0.98 4.03 26.25 12.61 17.50
2 4 140 0.40 0.65 12.88 0.49 2.04
4 2 14 3.58 6.29 35.46 17.90 28.46
4 4 20 3.21 10.01 17.69 15.47 7.50
8 2 12 15.03 13.41 38.66 23.42 53.08
8 4 18 8.15 13.73 17.72 6.93 14.71
Average 5 .22 8.02 24.78 12.80 20.55

Table 3: Percent ages of mislab eled observati ons for the SAR, the Kpp, t he
k-rneans, the Mclust and t he pam pro cedures. Norm al observations with
10% the outliers : (cl ) non concent ra te d contamina tions, (c2) concent rated
contaminations.

A final simul ati on st udy has been conducte d (see Table 3) to determ ine
t he behavior of the methods in the presence of outl iers. Scenarios cl ) and
c2) contain 10% of data contamina te d by first , a non concent ra te contamina-
tion, and second, a concent rate d cont amination defined in scena rio c). The
crite rion to obtain t he mislab eled observation is based only in the 90% of
observations not contamina ted. Table 3 shows the percentage of mislab eled
observations for the scena rios cl ) and c2). The maximum number of clust ers
k have been increase to te n in t he algorit hms k-rnean s, Mclust and pam so
t hat the concentrate d contamination can be considered as isolated clust ers.
In the scenari o cl ) the best methods, as an average, are, wit h very small
difference, t he pam algorithm and t he SAR pro cedure. However , for concen-
trated contamina tion, scena rio c2), the SAR pro cedure is aga in clearl y the
best followed by Kpp. As a summary of this Monte Carlo st udy we may
conclude t hat the SAR pro cedure has the smallest error classification rate in
378 Daniel Peiie , Julio Rodriguez and George C. T ieo

22 out of the 36 sit uations considered and t he best average number of misla-
beled observat ions in 5 scenari os out of the six considered. The only scenario
in which the SAR is not the best is in scenario c1) but the difference wit h
respect to the best method , pam, is very small: misclassification per cent age
of 6.4% versus 6.32% for pam. The Kpp is the second best in five out of the
six scenarios. Ordering the methods for average classification err ors in all the
scena rios from bet ter to worse, the order would be: SAR , Kpp, Mclust, pam
and k-means.

References
[1] Atkinson A.C. (1994). Fast very robust m ethods for detection of multiple
outliers. Journal of the American St atis tical Association 89 , 1329- 1339.
[2] Box G.E.P. , T iao G.C . (1973). Bayesian inference in statistical analysis.
Addison-Wesley.
[3] Banfield J.D ., Raft ery A. (1993). Model-based Gaussian and non-
Gaussian clust ering. Biometrics 49 , 803 -821.
[4] Cuest a-Alb ertos, J . A., Gord aliza, A. C., Matran , C. (1997). Trimmed
k-means: an atte mpt to robustify quant izers. The Ann als of St atis tics
25 ,553-576.
[5] Cuevas A., Febr ero, M., Fraim an R. (2000). Estim ating the nu m ber of
clusters . Can adi an Journal of Statistics 28 , 367 - 382.
[6] Dasgupta A., Raftery A.E. (1998). Detecting features in spatial point
processes with clutter via model-based clustering. Journal of the American
St atistic al Association 93 , 294 - 302.
[7] Fraley C., Raft ery A.E. (1999). MCLUBT: Boftwarefor m odel-based clus-
ter analysis. Journal of Classification 16 , 297-306.
[8] Gordon A. (1999). Classification . 2nd edn. London: Chapman and Hall-
CRC.
[9] Pefia D., and Ti ao G.C . (2003). Th e BAR procedure: A diagnostic anal-
ysis of het erogen eous data. (Manuscript submitted for publi cation) .
[10] Pefia D., Rodriguez J ., Ti ao G.C. (2004). Clust er analysis by the BAR
procedure (Manuscript submitted for publicat ion) .
[11] Pefia, D. and Prieto , J. (2001). Cluster identifi cati on using proj ections.
Journal of the American St atistical Association 96 , 1433-1445.
[12] Richar son S., Green P.J . (1997). On Ba yesian analysis of mixtures with
an unknown num ber of components. Journal of the Royal Statistical So-
ciety B 59 , 731-758.
[13] Rousseeuw P.J ., Leroy A.M. (1987). Robust regression and outlier detec-
tion . New York: John Wiley.
[14] St ephens M. (2000). Bayesian analysis of mixture models with an un-
known number of components-an altern ative to reversible jump m ethods.
The Annals of St atisti cs 28 , 40 - 74.
A general parti tion cluster algorithm 379

[15] Stuyf A., Hubert M., Rousseeuw P.J . (1997) . Int egrating robust clust er-
ing techniques in S-PLUS. Computat ional Statistics and Data Analysis
26 ,17-37.
[16] Tibshirani R., Walther G. , Hastie T. (2001). Estim ating the numb er of
clusters in a data set via the gap statisti c. Journal of the Royal Statis ti cal
Society B 63 , 411 -423.

Address: D. Pefia, Departam ent o de Est adfstica, Univ ersidad Carlos III de
Madrid , Spain
J. Rodriguez, Laboratorio de Est adfstica , Universidad Politecnica de Madrid ,
Spain
G.C. Ti ao , Gr adu ate School of Business, University of Chic ago, USA
E-mail: dpena@est-econ .uc3m .es
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

ITERATIVE DENOISING
FOR CROSS-CORPUS DISCOVERY

Carey E. Priebe, David J. Marchette, Youngser Park,


Edward J. Wegman, Jeffrey L. Solka,
Diego A . Socolinsky, Damianos Karakos,
Ken W. Church, Roland Guglielmi,
Ronald R. Coifman, Dekang Lin,
Dennis M. Healy, Marc Q. Jacobs, Anna Tsao

K ey words: Text document pr ocessin g, st atist ical pat tern recognition, di-
mensionality reduct ion .

COMPS TA T 2004 section: Dim ensional reduction, Classification .

Abstract: We consider t he problem of statistical pattern recogniti on in


a heterogeneous, high-dimension al set t ing . In par ti cular , we consider the
search for meanin gful cross-category associations in a het erogeneous text doc-
um ent corpus. Our approach involves "iterative denois ing" - that is, itera-
ti vely ext ract ing (corpus-dependent) features and partit ioning t he document
collection into sub-corpora. We pr esent an anecdote wher ein this method-
ology discovers a meaningful cross-category associat ion in a het erogeneous
collect ion of scient ific do cuments.

1 Introduction
The "int egrat ed sensing and pr ocessin g decision trees" introdu ced in [9] pr o-
ceed acco rding to t he following philosophy. Assume t hat t here is a het ero-
geneo us collectio n of ent it ies X = X l , '" , X n which can, in principle, be
measured (sensed) in a lar ge number of ways. Becau se the sensor cannot
make all measurement s simultaneously - eit her due to phy sical sensor con-
st raints or becau se of t he high int rinsic dimension of t he complete feature
collect ion - only a su bset of the possible measur ements is t o be mad e at any
one time.
Thus, for t he ent ire entity collection X a first set of measurement s is
made. Based on the features obtained, X is partit ioned int o {Xl , . . . , XJi },
each Xj l being (pr esumably) mor e hom ogeneous than t he original enti ty col-
lect ion X. Then , for each partition cell XJt a new set of measurements is
considered. This process cont inues , generating br an ches consist ing of "iter-
at ively denoised" ent ity collect ions {Xj l l , " . , Xj l h}, {Xj lJ2I, ' " , X j lJ2Ja },
and so forth, until a collect ion (say, Xj lJ2h) is deemed sufficiently coherent
for inference to pr oceed . Such collect ions are the leaves of t he t ree.
382 Carey E . Priebe et ai.

2 Iterative denoising for cross-corpus discovery


The example application we consider herein is t hat of discovering meaning-
ful associat ions in a het erogeneous t ext document corpus. See, for exa m-
ple, [1] for a survey of t ext mining.

2.1 Feature extraction & dimensionality reduction


Let C be a collect ion of t ext docum ents. The corpus-dependent feature ex-
traction of Lin & Pan t el [6], [8] can be describ ed as

.ccO : 'DocumentSpace ---+ [MutualinformationFeature]dd C).

Both t he features themselves and the number of features de(C) depend on t he


corpus C. Thus .ce(C) is a ICI x dc(C) mutual information f eature matrix.
E ach of t he features is associated with a word (aft er stemming and removal
of st opper words) , as follows. For document x in corpus C , and associate d
word W , t he mu tual information between x and W is given by

mxw = Iog f x,w ) .


, ( 2::~ f~ , w 2::w f x,w

Here f x,w = cx,wlN where cx,w is the number of times word W appears in
docum ent x and N is t he total number of words in th e corpus C. This
information is discounted to reduc e th e imp act of infrequ ent words via

m =m . cx,w min(2::~ c~ ,w, 2::w cx,w)


x,w x,w 1 + cx,w 1 + min(2::~ c~ , w, 2::w cx,w) .

The mutual inf ormati on f eature vector, t hen, for document x in corpus C, is
given by
ex = .cc(x) = [m X,Wll '" , mX,Wdc(C ) ] '
Given two do cuments x, y E C, the dist anc e (we use the t erm loosely; it
is in fact a pseudo- dissimilarity) employed, p, is given by

Thus
PO.ce(C)
is a ICI x ICI in te rpoin t distance matrix. All subsequ ent proc essing will be
based on these interpoint dist an ces, as discussed in [7] . However , the features,
and hence the interpoint distan ces themselves, ar e corpus dependent and so,
as the it erative denoising tree is built, based on the evolving partitioning,
these dist anc es change .
Multidimension al scaling [2] is used to embed the int erpoint distan ce ma-
trix pOLe( C) into a Euclidean space lRdmd. (C). Noti ce first th at , if the feat ure
Iterative denoising for cross-corpus discovery 383

vect ors were Euclidean - that is, if we were using an act ua l dist an ce in the
de (C)-dimension al space - t hen t he features could be represented with n o
disto rt ion in IRd c( C)- l . Alas , they are not , and cannot be. So

mds 0 p 0 £o(C)

is a IC I x dmds(C) Euclidean f eature matrix repr esenting th e corpus C. The


choice of d mds (C) represents a distortion/dimensionality tradeoff.
Finally, the Euclidean represent ation mdsopo£o(C) produced by multidi-
mensional scaling is redu ced, via principal component analysis [5], to a lower
dim ensional space for subsequ ent proc essing. Again we face a model selection
choice of dim ensionality. The combinat ion feature extraction/dimensionality
reduction we propose, th en , is given by

pca 0 mds 0 p 0 £ o(C) ,

yielding a ICI x dp c a ( C ) LSI f eature matrix which can be seen as akin to


a (generalized) latent semantic indexing (LSI) [4].

2.2 Science news corpus


A het erogeneous corpus of t ext docum ents obt ained from the Science News
web sit e is used in this example. The Science News (SN) corpus C consists
of ICI = 1047 do cum ents in eight classes. Tabl e 1 provides a br eakdown
of t he corpus by number of docum ents per class. Our goal is two find two
docum ents in different classes which have a meaningful associat ion.

Class N umber of Documents


Anthropology 54
Astronomy 121
Behavioral Sciences 72
Earth Sciences 137
Life Sciences 205
Math & CS 60
Medicine 280
Physics 118
Table 1: Science News corpus.

For this Science News corpus C , feature extraction via £ c(C) yields a fea-
ture dimension d.c (C) = 10906. That is, t here are 10906 distinct meaning-
ful words in t he corpus, and the Lin & Pant el feature ext ract ion produces
a 1047 x 10906 feature matrix.
Multidimension al scaling (Figur e 1, left panel) on t he 1047 x 1047 int er-
point distan ce matrix p o £0( C) yields dmds(C) = 898. (Numerical issues in
the multidimensional scaling algorit hm make 898 the larg est dim ension int o
which the inte rpoint dist anc e matrix can be embedded. So, while Figure 1
384 Carey E. Priebe et al.


~ e

..
8

Componenl

Figur e 1: Multidimensional scaling (left pan el) for t he original 1047 10906-
dimensional SN feature vectors. The lar gest num erically stable multidimen-
sional scaling embedding is dmds(C) = 898. (This left curve suggests that
perhap s 200, and certainly 400 dimensions is sufficient to adequa te ly fit the
do cum ents into Euclidean space.) Principal comp onents (right pan el) for the
898-dimensional Euclidean embedding of the original 1047 10906-dimensional
SN feature vectors. (The "elbow" of this scree plot occur s, perh aps, in the
rang e of 10-50 pr incipal components .)

suggests t hat perh ap s 200, and certainly 400 dimensions is sufficient to ad-
equa tely fit the docum ents into Euclidean space, we avoid t he first mod el
selecti on qu andary by choosing t he lar gest num erically stable multidimen-
siona l scaling embedding.)
A subsequent principal component ana lysis of t he 898-dimensional Eu-
clidean features m ds 0 p 0 £ 0 (C) yields t he scree plot present ed in Figur e 1,
right pan el. This scree plot suggest s that a latent semantic index dimension
of perh ap s 10-50 is appropriate for the SN corpus.
Figur e 2 displ ays the projection of the data set onto the first two principal
components of
pea 0 mds 0 p 0 £ o(C) (1)
for t he Science News corpus. Not ice t hat this plot suggests t ha t t he combi-
nation feature extraction/dimensiona lity reduction we have employed (eq. 1)
has capt ured well some of th e information concern ing th e eight classes, de-
spite t he fact t hat we are viewing just two dimensions (as opposed to , say, the
10-50 dimensions suggested by the scree plot in Figur e 1). To wit : there are
two groups exte nding from and distinguishable from the main body of doc-
um ents. These two groups are dom inated by medicine (the upper left arm)
and ast ronomy (the upper right arm). Add itionally, some physics docum ents
ar e present in the astronomy arm and some life sciences and behaviora l sci-
ences docum ent s are pr esent in t he medicine arm. That physics should have
some similarity wit h ast ronomy, and that life sciences and behavioral sciences
should have some similarity with medicine, agrees with intuition.
Iterative denoising for cross-corpus discovery 385

o Anthro
6 Astra
+ Behavior
x Earth
x <> Ufe
x '1 Math
lilI Mad
'if Physics
x
o

-0.1 0.0 0.1 0.2 0.3

PC,

Figure 2: The first two principal components of pca 0 mds 0 p 0 £c(C) for the
Science News corpus. The eight symbols represent the eight classes; the three
clusters generated via hierarchical clustering correspond roughly to the main
body and the two arms. Notice that there are two groups extending from
and distinguishable from the main body of documents. These two groups are
dominated by medicine (the upper left arm) and astronomy (the upper right
arm). The documents selected as our anecdotal "meaningful association" are
indicated throughout by the solid dots and document number.

2.3 Example result


Recall that the SN corpus C has ICI = 1047 with class label vector

v = [54,121,72,137,205,60,280,118].
The iterative denoising tree for cross-corpus discovery is illustrated on the
SN corpus in Figure 3. This figure provides a coarse depiction of one path,
from root to leaf, of the tree; a row-by-row description thereof follows.

Row 1: At the root, we have

pea 0 mds 0 p 0 £c(C).

Recall that these 1047 documents yield a feature dimension de(C) = 10906
and an mds dimension dm d s (C) = 898. We display the first two principal
components; thus the root (row 1) in Figure 3 is presented in detail in Fig-
ure 2.
386 Carey E. Pri ebe et ai.

·.~.
:;.:.::t '
:\fj
.::. . -r.~1
.,
..
;
'
• > .. - .. -

-----1
. :
~ ---

:I I.~.
\:l'''i!~·,
:'~~':~ ~' . : :

l~_,~_~::tj
+

·
::·~ .
...
.' ..r:
: :~ ); :~ : · : t·~
'. ~~~'

.J~·~:
... - ~ - .. .> ..

-~
.G;§J
Node22'

:
;./(":
f ' ' .
.......

'/>.
. . .... ..
. • • : ..~

_ ..
[JJ
':. . : ....:~:. '

:- -
• : .. .;

~.:-::)~~..
~ 0'

·D
_22~

.' ' ......


,~ " : ,.
. . . ., . , .
Nodt221 2
I
I

, ,, .
I

'[ 2]"
,,
,
, :,'
,_•

~ .

Figur e 3: One path in an it erative denoising tree for the SN corpus.

Row 2: In the same space as for Row 1, we have simply split out three
clusters obtained via hierarchical clusterin g, for displ ay convenience.
(We choose in this manus cript to avoid model selection det ails; e.g., the
choice of t hree vs. two cluste rs at t he root. In general, we recomm end that
this issue be avoided by generat ing a binary tree unless user int ervention is
possible. In this example, the roo t begs for three clust ers - a core and two
arms.)
Iterative denoising for cross-corpus discovery 387

To illustrate an anecdotal meaningful cross-corpus discovery, we will fol-


low cluster 2, C2 , which contains 166 documents. This subset of the original
corpus is denoised in the sense that it is primarily physics and astronomy.
The class label vector is

V2 = [2,113,0,10,4,0,1 ,36].
Thus, C 2 contains nearly all (113 of 121) of the astronomy documents, nearly
one third (36 of 118) of the physics documents, and but a smattering from the
other classes. So while the original feature extraction was done in the context
of a corpus containing medicine, behavioral sciences, and mathematics doc-
uments, these topics are not a part of the context for the feature extraction
for C 2 and this feature extraction can therefore focus on features germane to
physics and astronomy.

Row 3: Here we display

(See Figure 4 for more detail.) These 166 documents yield a feature dimension
d.c(C2) = 3037 and an mds dimension dmds(C2) = 162. Since E involves
corpus-dependent feature extraction, this display is different than the "cluster
2" display in Row 2. This difference is due to denoising. The indicated
partition represents the clusters generated via hierarchical clustering. Notice
that one of the clusters (C22 , lower right, containing 91 documents) contains
approximately half of C 2's astronomy documents (52 of 113) and nearly all
of C 2's physics documents (35 of 36). In continuing pursuit of our anecdotal
meaningful cross-corpus discovery, we follow C 22 .

Row 4: The class label vector for C22 is

V22 = [0,52,0,1,2,0,1,35].
The left display in Row 4 (see Figure 5 for more detail) depicts

pea 0 mds 0 p0 £022 (C22).

These 91 documents yield a feature dimension de (C 22 ) = 1981 and an mds di-


mension dmds(C 22) = 89. Again, recall that the feature extraction is corpus-
dependent. Now consider altering the geometry via the document subset

8 22 = {10500, 10651} C C 22 -
(These documents were chosen arbitrarily, for the purposes of illustration:
they consist of a Physics document about neutrinos and an Astronomy docu-
ment about black holes .) In the display, the two black squares represent 8 22 -
The right display in Row 4 (see Figure 6 for more detail) depicts the
altered geometry after consideration of 8 2 2 - That is, here we have added
388 Carey E. Priebe et el.

'"
ci

'"
es

ci

0
Q.
0
ci
" "
" "AA

- 0.4 - 0.3 -0.2 -0.1 0.0 0.' 0.2

PC,

Fi gure 4: Node N 2 in the iterative denoising tree for the SN corpus.

a new (90th) feature K cd(" 8 22 ) to the 89 multidimensional scalin g features,


and are displ aying

In t he display, t he two black squ ar es again represent 8 22 . The distan ce-to-


subset used for the addit ional "t unnelling" feature (see, for instance, [3])
d(' ,822 ) , is the minimum Euclidean distance to an element of the subset in
the LSI-sp ace defined by the select ed principal components; in this case , t he
scree plot suggests dp c a (Q 2) = 20. The coefficient K; used for the tunnelling
feature is obt ain ed by scaling the valu es d(' ,822 ) so t hat the vari an ce for
the tunnelling feature K cd(" 8 22 ) is some pre-specified positive multiple c of
the maximum multidimensional scaling feature vari an ce. We use c = 10000
in this example so that this new feature dominat es the multidim ensional
scalin g features in the subsequ ent principal component analysis. (Not e t hat
the scale pr esented in N~2 in Figur e 6 is such that th e ordinate has no imp act
on the subsequent clustering; the abscissa dom inat es.) Rather than use the
automat ic clustering (depict ed) , we illust rat e user int ervention via manual
clustering based on a vertical line (recall t hat the abscissa dominates) at 700
in N~2 ' We follow the rightmost clust er obt ain ed thusly, C 221 .
Iterati ve denoising for cross-corp us discovery 389

;l 6

'"
ci

6
66

N
ci
6

6

. •
6

<J
6 ·6 •
• •
ci • 6 ..
Q.

6 6 6

..
6
6 6
0 A l:J. A A
fA 6
d>
6 6 "
c 6
6
10~1~ ..
..
£! 6
6
6 6 0 .. •
66 6

9 6 61 0i 22
6
6 6


6 . . • •
..•"

... • •
6
6
N
9 6
..
- 0.3 - 0.2 - 0.1 0.0 0.1 0.2

PC,

Figure 5: Node N2 2 in t he iterative denoising t ree for t he SN corp us.

6
::J 6
6 6
6
6
N
ci
6
6 6 66 1~22
6 6 6
6 6 "
6 6
A 6

..
6
ci 6
6
~ ~6
6
66
<J 6 ~
'"
.
Q.
0 6 6
ci 6
'"
'" '"
A

"'" .. A

... . .... • •.. *.. •.."


9 40~6

~
.. .•... .. .'" ..•
.... ••
- 3000 -2000 - 1000 1000

PC,

Figur e 6: Node N~2 in the iterative denoising t ree for t he SN corpus.


390 Carey E. Priebe et al.

d A

A
~

A A A

A1O
~
i .?2
;;
cS
11.

0
d

~

9 • • •
1~f6

~ •
-4 -3 -2 -1

PC,

Fig ure 7: Node N 22 1 in the iterat ive denoising t ree for t he SN corpus.

Row 5: The document collect ion C22 1 is, again, almost entirely astronomy
an d physics, wit h
IC22l I = 17
and
V22 1 = [0, 8, 0,1 ,0,0,0, 8].
These 17 doc uments yield a feature dimension de(C 22 d = 367 and an mds
dime nsion dmds = 16. After recalculati ng t he features for C 22 1 , we display

pea 0 [( m ds 0 p 0 LC221 (C22l ) ) ; K C,d(·, 8 22 ) ] .

(See F igure 7 for more detail.) (A valu e of e' = 100 is used here; the impact
of the tunnelling feature is lessened.)

Row 6: Here we consider one of t he two clust ers, C22 12 , from N 22 1 via

pea 0 [( mds 0 p 0 L C 2 2 12 (C22l 2 ) ) ; K C,d(·, 8 22 ) ] .

a nd
V22 12 = [0, 6,0 ,1 ,0 , 0,0 , 5].
Iterative denoising for cross-corpus discovery 391

These 12 documents yield a feature dimension ddC2212) = 215 and an mds


dimension dmds = 11. This, in turn, clusters into C22121 and C22122 .
Let us finally consider C22121. This leaf contains eight documents, with
class label vector
V22121 = [0,4,0,0,0,0,0,4] .

Pairs of documents from different classes which fall to the same leaf of the
iterative denoising tree are candidate associations. Thus this example yields
16 candidate associations, at least one of which (astronomy #10422 = "X-Ray
Universe: Quasar's jet goes the distance" by R. Cowen, Science News Online,
Feb . 16, 2002 & physics #10516 = "Glimpses inside a tiny, flashing bubble"
by I. Peterson, Science News Online, Oct. 5, 1996) is plausibly a meaningful
association.

3 ConcIusion
We have pres ented an anecdote - not an experiment! - suggesting that an
iterative denoising methodology can be a useful tool in discovering meaningful
cross-corpus associations. Corpus-dependent feature extraction is an essential
part of the methodology, providing features which are iteratively fine-tuned to
ever more homogeneous subsets of documents as one progresses down the tree.
The specific approaches to feature extraction, dimensionality reduction, and
partitioning may be profitably altered within the framework of the general
methodology. The adaptive geometry provided by employing distance-to-
subset "t unnelling" features allows the user to alter the details of tree growth.
Experimental design to allow for statistical evaluation of the performance
of the methodology provides some interesting hurdles, and will be reported
elsewhere.
Finally, we note that the methodology described is not specific to text
document processing, and may have application in many disparate discovery
scenarios. The fundamental idea, as in [9], is to address the problem of there
being more measurements that can be made than should be made at anyone
time.

References
[1] Berry M.W., editor (2004). Survey of text mining: clustering, classifica-
tion , and retrieval. Springer-Verlag.
[2] Borg I., Groenen P. (1997). Modern multidimensional scaling: theory and
applications. Springer-Verlag.
[3] Cowen L.J ., Priebe C.E . (1997) . Randomized nonlinear projections un-
cover high-dimensional structure. Advances in Applied Mathematics 9,
319 -331.
[4] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman
R. (1990). Indexing by latent semantic analysis. Journal of the American
Society for Information Science 41 (6),391 -407.
392 Carey E. Priebe et el.

[5] Jolliffe LT . (1986). Prin cipal component analysis. Springer-Verlag.


[6] Lin D., P antel P. (2002). Con cept discovery from text . In Proceedin gs of
Conference on Computational Linguistics 2002, Taip ei, Taiwan, 577- 583.
[7] Maa J .-F., Pearl D.K. , Bartoszynsk y R (1996). R educing multidim en-
sional two-sample data to one-dime nsi onal in terpoint comparis ons. The
Annals of St at isti cs 24 , 1069-1074.
[8] P antel P., Lin D. (2002). Discovering word senses from text . In Proceed-
ings of ACM SIGKDD Conference on Knowledge Discovery and Dat a
Mining 2002, Edmonton, Can ada, 613-619.
[9] Priebe C.E ., Mar chette D.J ., Healy D.M. (2004). Integrated sensing an d
processing decision trees. IEEE Trans. PAMI , to appea r.

Ackn owledgem ent : Sponsored by the Defense Advan ced Resear ch Projects
Agency und er "Novel Mathema tical and Computational Approaches to Ex-
ploitation of Massive, Non-physical Dat a" , ARPA Order No. P2 46, Program
Code 3E 20. Issued by DARPA /CMO und er Contract No. MDA972-03-C-
0014 to AlgoTek, Inc. T he views and conclusions cont ained in t his document
are those of the aut hors and should not be int erpreted as representing the offi-
cial policies, eit her explicit ly or implied, of DARPA or the U.S. Government.
Approved for Public Release, Distribution Unlimited.
Address: C.E. Priebe, E.J. Wegman, D.A. Socolinsky, K.W. Church, R. Gug-
lielmi, RR Coifma n, D. Lin, M.Q . J acobs , A. Tsao, AlgoTek, Inc., 3811
N. Fairfax Dr., Suite 700
D.J . Mar chette, J .L. Solka, NSWCDD BlO, Dahl gren, VA
Y. P ark , D. Kar akos, Johns Hopkins D., BaIt ., MD
D.M. Healy, DARPA , Arlington, VA 22203
E-mail : cepcjhuc edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

FROM DATA TO DIFFERENTIAL


EQUATIONS
Jim O. Ramsay
K ey words: Fun ctional data analysis, differential equation, smoothing, non-
parametric regression , lupus.
COMPSTAT 2004 section: Fun ctional dat a ana lysis, Spatial statist ics.

Abstract: Differential equat ions are the na tural way to model systems with
fun ct ional inputs and functional outputs. They allow us to st udy t he system' s
dynami cs in the sense of explicitl y modelling how the output changes in
resp onse to sudden cha nges in input. For example, engineers developing
cont rol syste ms for industrial pro cesses rou tin ely use DIFE's as mod elling
tools.
A new method is described for goin g dir ectl y from noisy discret e data ,
not necessaril y sa mpled at equally spaced times, to a system of differenti al
equations of arbit rary orders , linear or nonlinear, t ha t describes the dat a . The
method involves a gener alization of non par am etric cur ve est imat ion in which
t he penalty functional rather t han t he smo othing functions is est imate d .
Examples are dr awn from chemical engineeri ng and medicine.

1 Introduction: Three main themes


Differ enti al equations (here shorte ned to DIFE's) make exp licit t he relation
between one or more derivat ives and the functi on itself. For example, the
general first order equation for funct ion x (t ),

Dx = !(t, x) ,
defines a dependency of the first derivativ e Dx on the fun ction x as well as,
possibl y, ot her direct dep end encies on argument t.
The talk for which this paper is a summar y aims to make three general
points:

• DIFE's are powerful tools for modeling dat a. Indeed, they are al-
read y rou tinely used in t he chemical, phy sical and biological sciences
as well as in engineering . They are import an t prim arily beca use t hey
model the dynamics of an observed pr ocess; t hat is, rate s of change
are modeled along t he obs erved fun ction. This is espe cially import ant
in input/out put systems where how the sys te m responds to an abru pt
change in input can be as important as the long-t erm cha nge that re-
sults .
• We have new methods for fitting differenti al equa t ions or dyn amic mod-
els t o raw noisy dat a t hat appear t o be substantially more effective than
394 Jim O. Ramsay

existing techniques. These methods are based on developments in func-


tional data analysis [1], [2], a collection of methods for the analysis of
curves and images as data.

• Some important applications are outlined to show the potential for


these developments in chemical engineering and medicine. In chemical
engineering, there are many possibilities for the use of these techniques
in process control applications. In medicine, the methods are applied to
some data on lupus, where the dynamics of the disease are the central
issue for developing effective treatments.

2 Why consider differential equation models?


The behavior of a derivative is often of more interest than the function itself.
The classic example is mechanics, where Newton's second law for position
x(t) as a function of mass m, x(t) = mD 2x(t), as well as it's descendent,
e = mc2 , shows that energy exchange takes place at the level of acceleration,
not position. How rapidly a system responds rather than its final level of
response is often what matters.
Since a DIFE links the behavior of a derivative to the behavior of the
function, it implies that derivatives will exhibit the same smoothness and
regularity that characterizes the function. Consequently, a DIFE is can be
an important method for computing stable estimates of derivatives.
Natural scientists often deliver theory to biologists and engineers in the
form of DIFE's. Moreover, many other fields, such as pharmacokinetics and
industrial process control, routinely use DIFE's as models for real-life sys-
tems. DIFE's are especially important when feedback systems must be de-
veloped to control the behavior of systems.
Although linear differential equations are much easier to work with then
nonlinear systems, nonlinear DIFE's are often compact and elegant mod-
els for systems exhibiting exceedingly complex behavior, especially in the
biosciences. Indeed, chaotic systems and systems exhibiting catastrophic
changes are usually modelled with nonlinear dynamics.
It is the business of statistics to model random variation. We usually
model random behavior in functions by assuming a fixed underlying process
with superimposed noisy variation. But a DIFE allows a much richer range
of ways in which stochastic behavior can be introduced:

• random coefficient functions

• random forcing functions

• random initial, boundary and other constraints

• system time t unfolding at a random rate


From da ta to differen tial equations 395

3 A simple input/output system


We begin by looking at a first order linear DIFE for a single out put funct ion
x (t ) and a single input function u( t ), alt hough our ult imat e goa l is to link
multiple outputs to multiple inputs ,
Figure 1 is an exa mple: The fluid level in a t ray within a distillation
column of an oil refinery is shown as a functi on of th e flow of a fluid int o t he
tray, We must explain two things: By how much does t he fluid level ult imate
cha nge in response to t he cha nge in input flow ind icat ed , and how rap idly
does this cha nge take place?

_4
Q)

~ 3

~ 2
>.
....
COl
I-

20 40 60 80 100 120 140 160 180

0.1
,
?; " '. " " .'. "
'.
o - 0.1
;;::
X -0.2
.@ - 0.3

a: -0 .4
Q)

..
-0 .5
'-, '. "
" "
r. -- '. '. " --,
o 20 40 60 80 100 120 140 160 180
time

Figure 1: The upper pan el shows t he level of material in a tray of a dist illation
column in an oil refinery, and t he lower level shows t he flow of materi al being
distilled int o t he tray, The points are meas ured valu es, and t he solid lines
are smoot hs of the dat a using regression splines ,

The DIFE has the genera l form

D x(t) = - f3 (t) x (t) + a (t )u( t) (1)

The hom ogen eous part of the equation

D x(t) = - f3 (t) x(t)

describ es the en dogen ous or int ernal dyn ami cs of the syst em , the forcing
func tio n u is an exogenous functional ind epend ent vari abl e t hat perturbs
396 Jim O. R amsay

these int ernal dynamics. The funct ions a and (3 are the coefficient functions
that define the DIFE. The syst em is linear in t hese coefficient fun ctions, and
also in the input and output funct ions.
On e way t o underst and the sep ar ate roles of a and (3 is to study a simpler
cons tant coeffi cient model with an input that ste ps from 0 to 1 at t ime 1 and
for which x(O) = 1. The solution to t he equatio n in t his case is

x (t ) = «:", 0 :::; t :::; 1,


e- 13t [l - (a / (3)e- 13 (t - l )J, 1 :::; t,

We see that (3 cont rols the rat e of change and t hat the ultimate level or
gain is a/(3. We can compare a to the volume control on a radio playing
a song ca rr ied by radi o signal u ; the bigger a , the louder t he sound. The
bass/treble cont rol, on the other hand, corresponds to (3; the lar ger (3, the
high er the frequ ency of what we hear.

4 Fitting a differential equation to data


The basic idea is to use profiled least squares to est imate unknown par am-
ete rs, such as a and (3 in t he above example. We do t his by repl acing t he
smo ot hing fun ction x used t o smoot h a seq uence of noisy function al observa-
t ions Yj, i = 1, .. . , n by t he equations defining t he fit to the dat a condit ional
on a rou ghness penalty defined by a different ial operat or L. Then we op timize
the fit with resp ect to only the unknown par am et ers; the fitte d valu es x (tj )
are computed as byproduct of t he pro cess, but do not t hemselves require
addit ional par am et ers .
Focussin g for simplicity on the first order linear equa t ion (1) , and defin-
ing the coefficient fun ctions to be constants , we define t he linear differ ential
op er ator L Oi ,13 to be

LOi,l3 x(t) = -(3x (t) + Dx(t) - au (t )


Fun ction x is a solut ion to (1) if and only if LOi ,l3 x = O.
Now define t he penalized leas t squar es fitting crite rion as

PENSSE(YI>., a , (3) = L[Yj - x (tj )]2 + >. ![LOi,I3 Xf (2)


J

If x has t he basis fun ct ion expansion x (t ) = c'¢(t), where ¢(t) is a functional


vect or of basis functions of length K , then criterion (2) is minimized wit h
respect to coefficient vector c by

(3)

where
• <P is the n by K matrix of basis fun cti on values cPk (tj)
From data to differential equations 397

• >. is a smoothing parameter

• y is the vector of noisy observations to be smoothed

• penalty matrix R( o, (3) is

• penalty vector s( o, (3) is

Substituting (3) into (2), we may now minimize the un-penalized profiled
error sum of squares

PROFSSE(YI>., o, (3) = L[Yj - x(tj ICt, (3)]2 (4)


i

with respect to parameters Ct and (3. Our experience is that the smoothing
parameter >. can usually be selected by minimizing the generalized cross-
validation (GCV) criterion.
This process may be extended to equations of an arbitrary order, nonlinear
equations, and systems of equations.

5 Two simulated data examples


5.1 Twenty tilted sinusoids
How well can we recover derivatives using this process? Consider a tilted
sinusoid
Xi(t) = Cil + Ci2t + Ci3 sin(67l't) + Ci4 cos(67l't)

that is annihilated by the operator

We generated N = 20 of these by randomly generating coefficients from


N(O, 1) and adding noise to Xi from the same distribution.
As a point of comparison, when we smoothed with L = D 4 , best re-
sults were obtained with>' = 10- 10 and the integrated root-mean-squared
errors for the function and the first two derivatives were 0.32, 9.3and 315.6,
respectively.
When we estimated all four constant coefficients for the order four linear
differential operator L, best results were obtained for>. = 10- 5 and the inte-
grated root-mean-squared errors for the function and the first two derivatives
398 Jim O. Ramsay

were 0.18, 2.8 an d 49.3, respectively. These repr esent improvements in pre-
cision of estimation by factors of 1.8, 3.3 and 6.4, respectively. We estimate d
(32 to be 353.6, whereas the right value was 355.3.
The most dr am atic improvement in derivative est imation occurred at the
boundari es. Estimating the linear differential operator virtually elimina te d
the usu al inst ability of derivative est imates in t hese regions because these
estimates are linked by the DIFE to the behavior of the function values, which
are only mildly mor e unstable at the boundaries than within the int erior. But
even in the int erior, for example, t he precisions of the estimates of D 1 x and
D 2 x were at least dou bled.

5.2 A single forced harmonic


How well does the method do when applied to a single functi onal observation?
A second order equation with coefficients (30 = 4.04 and (31 = 0.4 was forced
by a ste p function u that was zero up to t = 211" and one afte r , and multiplied
by coefficient a = - 2.0. Noise sampled from N(0 ,0 .04) was added. One
hundred t ria ls were conducte d, and in each A was chosen by minimizing
GCV.
The mean estimates of coefficients (30 ,(31 and a were 4.041±0.007, 0.397±
0.005 and -1.998 ± 0.0009, respectively, indicating no detectable bias.

6 The oil refinery data


After some experimentation with first and second ord er mod els, and with
constant and varying coefficient mod els, the clear conclusion was that the
constant coefficient mod el D x = -0.02x - 0.19u was preferred . The standard
error for (3 was est imate d be 0.0004 by both the bootstrapping and the delta
methods. The corresponding estimates for t he standard err or of a were 0.0024
and 0.0025, respectively.
Figure 2 shows the data for th e tray level along with the fit to the data
implied by th e differential equa tion.

7 The lupus data


7.1 The disease
Systemic lupus erythematosus (SLE) , or simply "lupus", is an auto- immune
disease in the same family as rheum atoid arthrit is. The body's immune
system attacks itse lf, producing a wide spectrum of symptoms and affect ing
many organ s. These attacks, called fl ares, occur suddenly and unpredict ably,
last for varying periods, and t hen disappear , somet imes for long period s.
"Erythematosus" means reddenin g, referring to a characte rist ic skin rash by
which it was first identified .
The disease is incur able. Around 9 times as many women as men get
the disease, and bla cks and some Asian groups more suscept ible. Incid ence
From data to differenti al equations 399

4.5

3.5

3
(J)
>
(J)
2.5

I'-
2
"""....>.
co 1.5
I-

0.5

o - ...
.....
-0.5 ' - -_ - ' - _---J' - -_ - ' -_ ---'_ _--'-_---'_ _--'-_ --l._ _...L..----'
o 20 40 60 80 100 120 140 160 180
Time

Figure 2: The fit t o t he dat a defined by the differential equa t ion is shown as
a solid line, and the data as points.

rang es from 3 to 400 per 100,000. Lupus can appear at any age, and the
ea rlier it appear s, t he more severe it t ends to be. Lupus is on th e increase,
and in som e places is now more common than rh eum atoid art hrit is. Genet ic,
environmental, and hormonal factors are all involved. Exposures to chemica ls
and ultra-violet light are suspecte d triggers for flar es.
Symptoms ra nge from mild to severe, and can caus e perman ent dam age
or be fat al. A rash on the face and chest, pain and swelling in th e joints
and fat igue are common and early signs of a flar e. The kidn eys are oft en
affecte d, with swelling and loss of function , and end-stage renal failure is
a real risk. The heart , arte ries, lun gs, eyes and central nervous system may
also be involved ; and t he psychological effects of lupus are receiving mor e
and mor e atte nt ion. A typical flar e goes from ju st noticeabl e to acute in the
ord er of t en days or less.
The vari ation in the nature and severity of symptoms combined with the
unpredict abili ty of flar es makes t reat ing this disease a huge challenge . Mild
symptoms are treated with ant i- inflammatory drugs (aspirin, etc .), and more
severe sympt oms require the use of cort icosteroids, usu ally prednison e. The
response t ime to an increase in pr ednisone dose is usually of the order of
a few days. However , corticost eroids are toxi c if t aken over long periods at
high doses, with common side effects being weight gain, sleeplessness and
400 Jim O. R ams ay

ost eoporosis. Sudden decreases in dose can trigger a new flare; consequently,
high dose levels must be tap ered down gradually.
P ati ents are assessed at regular int ervals. Although lupus symptoms are
multidimension al, long term treatment requires some overall measure of dis-
ease severity. A number of symptom severity scales have been proposed, and
the SLEDAI scale is now widely used. SLEDAI is a check list of 24 symptoms,
each given a num erical weight ranging from 1 for fever to 8 for seizure s.
A flar e has been defined by an int ern ational committee as a SLEDAI score
increase of 3 or more to a level of 8 or higher. During flar es SLEDAI scores
of 25 to 30 ar e common.
A joint McGill/University of Toronto team headed by Dr. Paul Fortin has
complete hist ories for about 300 patients spanning, in many cases, around 20
year s. This is one of the largest and highest quality set of patient records in
t he world.
Figur e 3 shows t he data for a single patient over a three-year period.
Notice the strong flare that coincides with the reduction in prednisone dose
just after the seventh year .

A =3.1623

o Data
20 -Smooth lit
- - DIFElit
-S alt)
eo 15
1;l
s
UJ
~ 10

5.5 7.5

Figure 3: The data for a single patient over a three-year period. Heavy lines
join times and valu es of SLEDAI measurements. A flar e is ind icated by a
solid heavy line joining t he first SLEDAI measurement within the flare to the
pr evious measurement extended to a time 0.02 years ba ck. The light solid
line joins times and values at which prednisone doses were fixed.
From data to differential equations 401

7.2 The statistical challenges


We require a model for:

• flar e t imings (a point pr ocess)

• flar e int ensities (a marked point pr ocess)

• flar e durations (a marked interval proc ess)

• flar e dyn ami cs: rate of onset and rate of recovery


• how flar e charac te ristics depend on pr ednisone level and

• pr ednis one dyn amics or rate of change

• ind ividual differences in all of the above

7.3 Data issues


The SLEDAI scale score has limited reliability. The dates at which these
scores are assessed are themselves haphazard. Some dat a may be actually
missing, eg: does SLEDAI = 0 always mean "no symptoms"?
We can, however , work closely with the physicians who work with t hese
patients t o identify flare characterist ics, including flar e onset times, flar e
durations, and t o answer some questions. For exa mple, a SLEDAI score may
not change, but the fact t hat pr ednisone was increased at that point suggests
t hat the disease has non etheless become acute. We can also return to patient
records to ret rieve other informat ion as required .

7.4 A simple model for flare dynamics


Let u (t ) be an indicato r function for when lupus is in its active state and
a flar e is t aking place.

• u( t ) takes only valu es 0 and 1.

• T he times ti at which u (t ) becomes positive can be est imated dir ectly


from the data , and t herefore assumed known.

• The duration of an active state will be b, and may vary from flar e t o
flar e.

We might pr opose to model symptom level s(t ) as a first order differential


equation:
Ds(t) = - j3s(t) + a(t )u(t) (5)
This, however , is to o simple in an importan t way. It pr edicts th at the rate
of increase in sympto m level is equa l to its rate decrease when u(t) returns
to zero. In fact , however , sympt oms rise far more rapidly than t hey decay.
402 Jim O. Ramsay

We can imagine that the disease also affects the body's capacity to respond
to the disease itself, as well as it 's capacity to recover. That is, f3 is also
affected by the disease, and therefore must be replaced by the function f3(t) .
When the patient is healthy between flares, f3(t) is high, leading to rapid
response to the onset of the disease. When the patient is experiencing a flare,
f3(t) is near zero, implying a slow recovery.
We tried this differential equation for f3(t)

Df3(t) = -'Yf3(t) + 0[1 - u(t)]

When u(t) switches on, f3(t) decays to zero, and Ds(t) tends to equal au(t) ;
that is, s(t) increases linearly while u(t) = 1. When u(t) switches off, f3(t)
returns to the level of its gain, 0h, and s(t) tends to decay exponentially
with rate equal to f3(t)'s gain.
This gives us the general shape of a lupus flare. The increase in symptoms
is essentially linear because f3(t) decays rapidly to 0 inside a flare; when
f3(t) ~ 0, the gain becomes ab. But after a flare, when u(t) returns to
zero, f3 returns to its healthy level, and there is an exponential decrease in
symptoms.
Actually, preliminary results indicated large values for rate parameter 'Y,
implying that f3(t) moved extremely rapidly between virtually zero and its
maximum value , defined by O. We decided to simplify the differential equation
for f3(t) to
Df3(t) = 0[1 - u(t)]
This implies linear increase within a flare episode, and exponential decrease
afterwards with a rate constant O.

7.5 The data analysis


Order 4 B-spline basis functions, with a knot at every data point, and three
coincident knots at the times of onset and offset of flares were used to repre-
sent symptom function s(t) . Coincident knots allow the first derivative to be
discontinuous at flare boundaries, as required by the model. Coefficient a(t)
was made nonconstant, and represented as a basis function expansion in
terms of four order 3 B-splines.
Figure 4 shows the fitting function s, the solution to the differential equa-
tion and coefficient function a for smoothing parameter A = 10- 0 ,5 . For this
value of A, s fits the data quite well everywhere, and certainly adequately
given the reliability of the SLEDAI score . But the solution of the differ-
ential equation climbs with each successive flare because the estimated rate
constant 0 = 0.61 is too small to allow enough decay in symptoms between
flares.
Figure 5 shows these results for the higher smoothing parameter A = 10°.5 •
Now we see that the fit s(t) and the solution to the differential equation are
very close. The estimate of 0 is now 2.24, and this rate constant is sufficient
From data to differential equations 403

A = 0.31623
25 r-----:-----:----,-----,----....,---,----,-..------,-------,

o Data
20 - Smooth fit
- - DIFE fit
- 0 «(t)
~
o 15
o
en
«
o
. . ...
llJ
u5 10 ~"
.

.... l;/

O'------'------'-----'-----'------'---->-E)
5 5.5 6 6.5 7 7.5 8
Year

Figure 4: Results for the analysis of the data in Figure 3 using a smoothing
parameter ), = 10- 0 . 5 . The circles are SLEDAI measurements. The heavy
solid line is the fit to the data s(t) that minimizes criterion (4) and the dashed
line is the solution to the differential equation (5). The light solid line plots
the value of 8a(t) .

for the recovery from a flare before the next flare begins. What we lose,
however, is the capacity to fit lower values of SLEDAI; the range of variation
within a flare is too limited to permit this.
On the whole, however , these fits are quite satisfactory and capture well
the main dynamic features of this segment of a lupus record.

References
[1] Ramsay, J. O. and Silverman, B. W. (1997). Functional data analysis.
New York: Springer.
[2] Ramsay, J. O. and Silverman, B. W. (2002). Applied functional data anal-
ysis. New York: Springer.

Acknowledgement: The statistical investigations of these data are funded


by the Canadian Institute for Health Research (CIHR) . The co-investigators
on the lupus project are Dr. Michal Abrahamowicz, McGill University, and
Dr. Paul Fortin, University of Toronto. I would also like to recognize the
404 Jim O. Ramsay

,, =0.31623
o Data
20 -Smooth fit
- - DIFE fit
-oa(t)
(I!
o 15
~
<i:
a
iu
cr.l 10

5.5 7.5

Figure 5: Resul ts for t he ana lysis of t he dat a in F igure 3 using a smoothing


par am et er>. = 10°.5 . The circles and lines are as in Figure 4.

cont ri butions of my graduate st udents, Mr. J iguo Cao , Ms. Carlotta Fok and
Ms. Wen Zhang. The example from chemical engineering was supplied by
Dr. J ames McLellan of Qu een 's University, and this resear ch also benefited
from t he resear ch collaborat ion with Andrew Poyton, a graduate st udent at
Qu een 's.
Address: J.O. Ramsay, McGill University, 1205 Dr. Penfield Ave., Montreal ,
Qu eb ec, Canada H3A 1B1
E-m ail: r amsayopsych .mcgf.Ll vca
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

SIMPLE SIMULATIONS FOR ROBUST


TESTS OF MULTIPLE OUTLIERS IN
REGRESSION
Marco Riani and Anthony Atkinson
K ey words: Forward sea rch, lar ge data sets , simultaneous inference, t rimmed
estimat ors ..
COMPSTAT 2004 section: Robustness.

Abstract: The null distribution of t he likelihood ratio t est for outliers in


regression depend s on the distributional properties of t rimmed samples. Ap-
proximations to the dist ribution of the st atist ic that are simple t o simulate
are describ ed and applied to three examples.

1 Introduction
Test s of outliers in regr ession need est imates of both the paramet ers of the
linear mod el and of the err or vari anc e a 2 • If the outli ers are included in
the set used for esti ma t ion, inconsistent est imates of the par am et ers will be
obtained and the existe nce and the effect of the outliers will be masked. We
therefore consider procedures in which the observations are divided into two
groups: t hose believed to be 'good ' and the outli ers . The good observations
are used to provide est imates of t he param et ers t o be used in the t est for
outl iers.
Let t here pr ovision ally be m good observations out of n. We are int er-
est ed in the null distribution of the outlier t est. We th erefore need to perform
our calculat ions as t hough there were no outliers . If we were int erest ed in the
simplest case when, inst ead of regression , t he focus is the location par am et er
of a random sample from a symmetrical distribution, we would base our es-
t imates on the m cent ral observations, trimming the remaining m - n. The
properties of our est imat ors would t hen be those coming from this trimmed
sa mple of n observations, rather than from m observations t aken at ran-
dom from the parent populat ion. We use this insight t o provide excellent
approxima t ions to t he distribution of the out lier test in regression .
The lit erature on the det ection of outli ers in regression is vast. The t est we
st udy here is t he likelihood ratio t est, that is th e t est based on t he pr ediction
residuals used, for example, by Hadi and Simonoff [13], for the det ection
of mult iple outliers. T wo useful sur veys of methods for mult iple outli ers
in regression are Beckman and Cook [9] and Barnett and Lewis [8]. An
important point is that , if several outliers ar e present, single deletion methods
(for example, Cook and Weisb erg [12], Atkinson [1]) may fail. Hawkins [14]
argues for exclusion of all possibly outlying observations, which are then
406 Marco Riani and Anthony Atkinson

t est ed sequent ially for reinclusion. This corresponds to our descripti on in


which m observations are used for est imation.
The dr awb ack to Hawkins's pro cedure is that it is unclear how many
observat ions should be delet ed , and , because of maskin g, which ones, before
reinclusion and t esting begin . However, the forward search is an obj ecti ve
pro cedure of this type: it starts from a sma ll, robustly chosen, subset of the
data and fits subsets of increasin g size. Each newly introduced observation
can be t est ed for outl yingness before it is included in the fitted subset .
The use of the forward sea rch in regression is described in Atkinson and
Riani [4] where, as in Atkinson [2], the emphas is is on inform ative plots
and th eir int erpret ation. The extension to mult ivari at e dat a is described by
Atkinson [3], with a book length treatment in Atkinson , Riani and Cerioli [7].
Although t he forward search is a powerful general method for the det ection
of multiple outli ers and unidentifi ed clust ers, the references do not describ e
inferent ial proc edures based on the qu antities plot t ed. At kinson and Riani
[6] use t he forward search as a means of generating a series of outli er t est s
with decreasing amounts of trimming; m increases from slight ly mor e than
t he number of param et ers to n. The valu es of the stat ist ics are assessed
by simulati on and by analyt ical approximations to the robust tests. The
int erest in the present pap er is in the applicat ion of the t est s. We use both
simulat ions of forward sea rches and two simple simulated approximat ions t o
the distribution to analyse t hree sets of da ta. As a result we are able to
combine the power of the forward sear ch with pr ecise st atistic al pro cedures.
The pap er is organis ed as follows: in §2 we briefly review the forward
search and robust est ima t ion; both depend on est ima t ors from trimmed sa m-
ples. In §3 we write the outlier tes t explicit ly in te rms of such samples and
show how simul ations using sa mp les from trimmed distributions can be used
t o approxima te the distribut ion of the statist ic. Examples in §4 show how
well our approxima t ion works. The final sect ion briefly describes further
work.

2 Least squares and outlier detection


2.1 Least squares
In t he regression model
y=X(3+ t, (1)

Y is t he n x 1 vector of responses, X is an n x p full-r ank matrix of known


constants, with it h row xi ,
and (3 is a vector of p unknown param et ers. The
normal t heory assumptions are t hat the err ors ti are LLd. N (O, (}'2).
With iJ the least squ ar es est ima t or of (3 the vecto r ofleast squar es residuals
is

e =y- fj =y- X iJ = (I - H)y , (2)


Simple simulations for robust tests of multiple outliers in regression 407

where H = X(X T X)-l XT is the 'hat' matrix, with diagonal elements hi and
off-diagonal elements hij . The mean square estimator of (72 can be written
n
s2 = eT e/(n - p) = I:>U(n - p). (3)
i=l

We define the standardized residuals

(4)

Like the errors Ei , the qi are distributed N(O, (72), although they are not
independent.
The likelihood ratio test for agreement of a new observation Ynew ob-
served at Xnew with the sample of n observations providing (J and s2 is the
prediction residual
T A

d~ = Ynew - xnew(3 (5)


, sJ{l + x~ew(XT X)-lxnew} ,

which , when the observation Ynew comes from the same population as the
other observations, has a t distribution on n - p degrees of freedom.

2.2 The forward search


Let M be the set of all subsets of size m of the n observations. The forward
search fits subsets of observations of size m to the data, with mo :::; m :::; n .
We discuss the starting point of the search in §2.3.
si
Let m ) E M be the optimum subset of size m. Least squares applied to
this subset yields parameter estimates (J(m*) and s2(m*), the mean square
estimate of (72 on m - p degrees of freedom . Residuals can be calculated for
all observations including those not in
m
si
) . The n resulting standardized
residuals can from (4) be written as
T *
.(m* ) = Yi - Xi (3(m )
A

(6)
q, J {l - h,(m*)}

The notation hi(m*) serves as a reminder that the leverage of each obser-
si
vation depends on m ) . The search moves forward with the subset si
m 1
+)
consisting of the observations with the m + 1 smallest absolute values of the
e., that is the numerator of qi(m*) .
In order to simulate the distribution of the outlier test of §2.4 we need
a simple way of simulating variables with the same distribution as the qi (m *) .
When m = n these residuals are those in (4) and the distribution is N(O, (72) .
But with m < n the estimates of the parameters are based on only those
observations giving the central m residuals: (J(m*) and s2(m*) are calculated
from truncated samples.
408 Marco Riani and Anthony Atkinson

2.3 Robust estimation and the start of the search


The search st arts from a subset of p observations sip) that is chosen to
provide a very robust estimator of the regression parameters. For example,
if Least Median of Squares (LMS, Rousseeuw [16]) is used, the subset of p
observations is found minimizing the scale estimate

(7)

where elk} (p*) is the kth ordered squared residual and h is the integer part
of (n + p + 1)/2 and corresponds to 'half' the observations when allowance
is made for fitting. Typically the search either examines all subsets of size p,
if this is not too large, or several thousand subsets are examined at random.
These starting methods destroy masking; any remaining outliers are then
removed in the initial steps of the search. Consequently, the search is in-
sensitive to the exact starting procedure. What is important for our present
purpose is that the search again uses parameter estimates based on a central
part of the sample.

2.4 Testing for outliers


Let the observation "nearest" to those constituting si m ) be imin where
(8)

the observation with the minimum prediction residual among those not in
sim ) . If observation imin is an outlier, so will be all other observations not
. SCm)
In * .
To test whether observation imin is an outlier we use the predictive resid-
ual (5). The test for agreement of the observed and predicted values is

Id·irmn. I -I Yimin - xf:nin;3(m*) 1


- s(m*)J{l + himin(m*)} .
(9)

It is the distribution of this statistic that is the subject of this paper. In (5),
when all observations were used in fitting and a new observation was being
tested, the distribution was tn-p o Now the estimates (3(m*) and s(m*) are
based on the central part of the distribution. Even under the null hypothesis
that the sample contains no outliers, the distribution is no longer t.

3 Simulating the distribution


The empirical distribution of the series of test statistics can be found by
repeated simulations of forward searches. In this section we describe this
method and then describe two alternative simulation-based methods. The
first replaces the series of simulations and forward searches with independent
Simple sim ulations for robust tests of mul tiple outliers in regression 409

simul ati ons for each value of m. The second uses a series of orderin gs of
simulated data , but avoids t he forward sear ch.
Both of these methods are for the st atistics calculated for simpl e sa mples.
In §3.4 we int roduce a correct ion for the dependence of the distribution of
t he statistics on p.

3.1 The empirical distribution


In orde r to find th e distribution of the test statist ic during the forward sear ch
t he most st ra ightforward method is to simulate samples of all n observations
and repeat the forward search a number of times. In order to capt ure any
special feat ures of the hat mat rix, t he matrix of explana tory variables is
that of the data under study. Observations are simulate d using the fitt ed
values at the end of the sea rch, t hat is XI
/J(n) , and the estimated standa rd
deviation s (n ).

3.2 Method 1: Truncated samples


We are interest ed in approximations to t he null distributi on of (9) for given
m which can easily be found . The statistic is a function of the m residu als
qi (m *) E S~m) and of qimin(m *). In t he absence of outliers, these will be the
observations with t he m + 1 smallest values of Iqi(m*)I. Since the qi( m*) are
residuals , t heir distribution does not dep end on t he par ameters (3 of t he linear
mod el. They have also been st andardised to have constant variance, which is
then estimate d. To find the required distribution we t herefore simul at e from
a t ru ncate d nor mal distribution and calculate t he value of t he out lier test for
such samples. The steps are:
Step 1. Ob tain a random sample of m + 1 observations U, from t he
uni form distribution on [0.5 - (m + 1)/2n , 0.5 + (m + 1)/2nJ.
Step 2. Use the inversion method to obtain a sample of m + 1 from the
t ru ncate d normal distribution:

(10)

where 1> is the standa rd normal c.d.f.


Step 3. Find the most outlying observation:

Zimin = max IZi I i = 1, . . . , m + 1. (11)

Then S~m) = {z.} , i =f i min = 1, . . . , m + 1.


Step 4 . Estimate t he par am eters. Let z(m ) be the mean of t he m
observati ons in S~m) and s; (m) be the mean squar e est imate of the vari ance.
Step 5. Calcul ate the simulated valu e of t he outlier test in (9):

d7 . = Zimin - z(m) (12)


irnm sz(m )V{(m + l) / m }
410 Marco Riani and Anthony A tkin son

The simulat ion of the t ru nca te d normal dist ribut ion using t he inversion
method in St eps 1 and 2 is st ra ight forward in S-Plus or R.

3.3 Method 2: Ordered observations


In t he forward search the n observati ons are ord ered for each value of n , In t he
a bsence of out liers we might expect that t his ord er would not change much
during t he search. As a second method of approximating the distribution of
t he statist ics, we simulate sets of n observa t ions from t he norm al distribu-
t ion, correc t for t he mean and order the absolute valu es of t he observations.
For our calculations for each valu e of m we use the m sma llest abs olute resid-
uals to est imate t he par amet ers. The pr ocedure is repeate d severa l t imes,
typ ically 1,000, to give the empirical distribution of the stat ist ics.

3.4 Adjustment for regression


In both Method 1 and Method 2 we est ima te t he sample mean , rather t han
a regression mod el, so hirnin(m) = 11m . Simulations show th at the result-
ing upp er percentage points of t he distribu tion are t oo sma ll when we are
analysing regression data. Good agreement is obtained by using t he adjuste d
statist ic
Idirninl = IJm+o
m
p Yirnin - xfrnin;3(m*)
s(m*h/{l + hirnin(m*)}
I, (13)

wit h 0 = 0.7. As m increases, the effect of th e correct ion becomes less.

4 Examples
4.1 Hawkins's data
This set of simulated data was analysed by Atkinson and Riani [4]' §3.1.
There are 128 observations and nine explanatory variables. The dat a were
inte nded by Hawkins t o be misleading for standard regression methods . Fig-
ur e 1 shows a forward plot of t he minimum deletion residual among obser-
vat ions not in t he subset , that is t he outli er t est st at ist ic (13), t oget her with
two sets of simul ated percent age points of t he distribution, both based on
1,000 simulat ions. We first consider t hese simulation envelopes .
The envelopes plot t ed with cont inuous lines in the figur e are the 1, 2.5,
5, 50, 95, 97.5 and 99% points of the empirical distribution of t he outli er
t est during forward searches simulated wit hout out liers. The dot t ed lines are
from our second approxima te simulation method in which random samples
of observations are ordered once. Agreement between t he two envelopes is
excellent during t he second half of t he sear ch; agreement between the two
sets of upp er envelopes is also good during the first half of the search for
m > 20. The envelopes are of a kind we shall see in all simulat ions . Initi ally
t hey are very bro ad , corresponding to distribut ions wit h high trimming and
Simple sim ulations for robust test s of multiple outliers in regression 411

-e
(ij
::>
"0
'w
~ M
c:
0
~
a;
"0
E (\J
::>
E
'2
~

20 40 60 80 100 120
Subset size m

Fi gur e 1: Hawkins's Dat a: forward plot of minimum deletion residua ls (the


outlier t est ). The four groups of observations are clearly separated by the
t hree lar ge peaks signalling the first observat ion from each new group imm e-
diately before it ente rs t he subset. The dot t ed lines are envelopes simulated
by Method 2.

few degrees of freedom for t he est imation of erro r . In t he central part of the
search t he band is virt ua lly horizontal and gradually narrows. Toward s the
end of t he search t here is rapid incr ease as we te st the few lar gest residuals.
The cont inuous line showing the plot of the outlier tes t in the figure reveals
all t he features that Hawkins put in th e dat a . There are 86 observations with
very small var iance. The plot shows a huge jump in th e valu e of the statis t ic
when the first observation of th e next group ente rs. This pro cess is repeat ed
two mor e times, clearl y identifying t he four separa te groups of dat a that are
pr esent, t he decline after each peak being du e to t he effect of maskin g. The
forward plot of t his tes t st atisti c is the sa me as t hat in the lower panel of
Fi gure 3.6 of Atkinson and Riani [4] ; the new confidence bands calibrate
inferences about the significance of the peaks.
The envelopes rise rapidly at t he end of the sea rch and we can see that the
outlie r test finishes up being non- significant. Thus Hawkins has su cceeded
in const ructing a data set with many out liers all of which are masked. The
cur ve of the st atist ic starts to rise ju st before m = 86. If we take only t he
first 86 observations and provide simulat ion envelopes for them , t he envelopes
rise at the end as t he envelopes do here from m around 125. The last few
observations do not t hen lie outside the simulation bands for t his redu ced set
of data .
412 Marco Riani and Anthony A tkin son

,
\
l()

\
'<T

20 40 60 80
Subset size m

Figur e 2: Ozone Data: forward plot of minimum deletion residu als (the
out lier test) . There are some mild outliers towards the end of t he sea rch
and some evidence of masking. The dot ted lines are envelopes simulated by
Method 1.

4.2 Ozone data


Hawkins 's data are a synt het ic exa mple in which there are many outliers. We
now consider two exa mples of real data.
The first is t he ozone data from Breiman and Freedm an [10] which give
readings of ozone concent ra t ion on 300 consecutive days. The results for the
first 80 days were extensively ana lysed by Atkinson and Rian i [4], §3.4. Here
we follow th eir ana lysis.
As a result of t he use of the forward search combined with response trans-
formation the final mod el found by Atkinson and Riani had a logged response
with five of Breiman and Freedm an 's original var iables augmente d by a lin-
ear t rend in time. Figur e 2 shows a forward plot of the outlier test for t his
model toget her with simulation envelopes from the forward sear ch (continu-
ous lines) and the approximate envelopes from the first method, of sampling
from a t ru nca ted distribution . The agreeme nt between t he two set s of en-
velopes is again good, particularl y for the upp er envelope.
The evidence from t his plot is much less dr am ati c t ha n that of Figur e 1.
Apart from th e very beginning of the sear ch, t he plot lies near or within
the bounds for all values of m up to t he introduction of the 76th observa-
tion. Thereafter there seem to be four mild outliers, a conclusion in line with
t he forward plot of residu als in Figure 3.37 of Atk inson and Riani [4] . At
Simple sim ulations for robust tests of multiple outliers in regression 413

-e-

(ij
:::>
"0
' (;;
[I! C')
c:
0
~
a;
"0
E
:::>
E '"
'2
~

20 40 60 80 100
Subset size m

Figure 3: Sur gical Unit Dat a: forward plot of minimum deleti on residuals
(the out lier test ). The appreciable maximum of the stat ist ic in the cent re of
the search suggest there may be two equa l sized groups of observations that
differ in some systematic way. The dott ed lines are envelopes simulated by
Met hod 1.

m = 76 t his plot shows four appreciable residuals, three negat ive and one
positive: t hese lie apart from t he general cloud of residuals t hro ughout t he
whole search. The plot also shows some evidence of maski ng, t he residu-
als decreasing somewhat in magnitude at t he end of t he search. The effect
of masking is also evident in Fi gur e 2, where the test statist ic lies within
t he simulat ion envelopes for t he last two steps of the search. Although t he
masking here is not as misleading about the st ruc t ure of the data as that in
Figure 1, there are again outliers whose pr esence would be overloo ked by an
analysis based on all t he data , or on single deletion diagnosti cs.

4.3 Surgical unit data


Net er , Ku tner , Nachts heim and Wasserman [15] int roduce, on p.334, data
on t he sur vival t ime of 54 pat ients underg oing liver surgery, together wit h
four explanatory vari abl es t hat may be used to pr edict surv ival time. Their
pr eferr ed mod el regresses y on three of t he explanat ory vari abl es, X 4 being
excluded. On p.437 anot her 54 observations are int ro duced to check the
mod el fit t ed to t he first 54. T heir Tabl e 10.9 compares par am et er est ima tes
from the two sets for t he pr eferr ed regression model. The conclusion is th at
t here is no systematic difference between t he two set s and t hat the sa me
mod el is acce ptable for all t he dat a .
414 Marco Riani and Anthony Atkinson

10 20 30 40 50 10 20 30 40 50
Subset size m Subset size m

Figure 4: Surgical Unit Data: forward plot of minimum deletion residuals


(the outlier test) for the first and second 54 observations. There is strong
evidence that here are three groups amongst the first 54 observations. The
dotted lines are envelopes simulated by Method 1.

Atkinson and Riani [5] analysed the combined set of all 108 observations
using the forward search to assess the influence of individual observations
on the estimated regression coefficients. They also conclude that a logged
response and a linear model in Xl -X3 adequately describes the data. Because
we will shortly be augmenting the set of explanatory variables, we work with
all four original variables.
Figure 3 is a forward plot of the test for outliers for all 108 observations,
together with simulation envelope and the approximation found by our first
method. This surprising plot seems to show evidence of two groups - the
extreme value of the statistic, well outside the boundaries is at the entre of
the search, after which there is a gradual decline in the values. At the end of
the search the statistic is nudging the lower envelope , a stronger version of
the effect of masking noticed in the two previous figures.
Since the maximum value of the statistic is at m = 55, we examine those
units that enter after this value, to see whether they might belong to a second
cluster. Detailed analysis of the results of the forward search show that, after
m = 57 nearly all the patients entering have unit numbers greater than 54
and so come from the group of confirmatory observations.
This figure suggests the group of confirmatory observations may be differ-
ent from the original 54 units. Accordingly, we introduce a dummy variable
for the two sets and repeat the analysis. This variable is highly significant,
with a t value of -7.83 at the end of the search. However, the resulting for-
ward plot still has a slight peak in the centre, although this is much reduced
from that in Figure 3. Some remaining structure is indicated.
To take the analysis further we consider the two groups separately. Fig-
ure 4 gives the forward plots of the test for outliers. The plot for the second
Simple sim ulations for robust test s of multiple outliers in regression 415

group of observations in the right-hand panel , suggests that the group is ho-
mogeneous. However , that in the left-hand panel strongly indicates that the
first group contains at least one identifiable subgroup t hat needs to be dis-
entangled before further ana lysis is undertaken. A next stage in the ana lysis
would be to extend the scatterplo t matrix of th e data in Figur e 8.3 of Neter
et al. [15] to include different plotting symbols for the tent ative groups.

5 Discussion
The pr evious exa mples are compa ra tively small and the many plots from the
forward search can easily be int erpreted. However , as the number of units
increases, plots for individual units, such as forward plots of residu als , can
become messy and uninformative du e to overplotting. Atkinson and Riani
[6] ana lyse 500 observations on the behaviour of cust omers with loyalty cards
from a superm arket cha in in Northern It aly. Despite the lar ger number of
observations the forward plot of t he test for outliers is as easily int erpret ed
as those in this pap er and shows an unsuspected group of 30 very different
customers.
There are two further genera l methodological matters that deserve com-
ment . The first is that the envelopes present ed in this pap er were all found
by simulation. An alt ernative, invest igated by Atkinson and Rian i [6] , is
to calculate t he percent age points dir ectl y using ana lytical results on order
statist ics and t he varian ce of truncated normal distributions. The oth er point
is t hat, however the envelopes ar e calculated, the probability st atements refer
to pointwise exceeda nce of the bands. To find, for example, th e probability
of at least one trans gression of a specified envelope somewhere during a par-
ti cular region of the sea rch, for exa mple the second half, requi res calcul ation
of t he simultan eous probability of trans gression at any of the stages of the
sear ch within that region. Computationally feasible methods ar e describ ed
by Buj a and Rolke [11] .
Atkinson and Riani [6] may be viewed at "'''''''.lse . ac . uk/ collections/
statistics/research/

References
[1] Atkinson A.C. (1985). Plots, tran sform ation s, and regression. Oxford
University Press, Oxford .
[2] Atkinson A.C. (1994). Fast very robust m ethods for the detection of m ul-
tiple outliers. Journal of the American St atistical Association 89 , 1329-
1339.
[3] Atkinson A.C . (2002) . Th e forward search. In W. HardI e and B. Ronz ,
edito rs, COMPSTAT 2002: Proceedings in Computational St ati stics,
Physica-Verlag, Heidelberg, 587- 592.
[4] Atkinson A.C. , Riani M. (2000) . Robust diagnostic regression analysis.
Springer-Verlag, New York.
416 Marco Riani and Anthony Atkinson

[5] Atkinson A.C ., Riani M. (2002). Forward search added variable t tests and
the effect of masked outliers on model selection. Biometrika 89,939-946.
[6] Atkinson A.C., Riani M. (2004). Distribution theory and simulations for
tests of outliers in regression. Submitted.
[7] Atkinson A.C., Riani M., Cerioli A. (2004). Exploring multivariate data
with the forward search. Springer-Verlag, New York.
[8] Barnett V., Lewis T. (1994) Outliers in statistical data (3rd edition).
Wiley, New York.
[9] Beckman RJ ., Cook RD. (1983) Outlier detection (with discussion).
Technometrics 25 , 119-163.
[10] Breiman L., Friedman J .H. (1985). Estimating optimal transformations
for multiple regression and transformation (with discussion) . Journal of
the American Statistical Association 80, 580- 619.
[11] Buja A., Rolke W . (2003). Calibration for simultaneity: (re)sampling
methods for simultaneous inference with applications to function estima-
tion and functional data. Technical report, The Wharton School, Univer-
sity of Pennsylvania.
[12] Cook RD., Weisberg S. (1982). Residuals and influence in regression.
Chapman and Hall, London.
[13] Hadi A.S., Simonoff J .S. (1993). Procedures for the identification of
multiple outliers in linear models. Journal of the American Statistical
Association 88, 1264-1272.
[14] Hawkins D.M. (1983). Discussion of paper by Beckman and Cook. Tech-
nometrics 25, 155-156.
[15] Neter J., Kutner M.H., Nachtsheim C.J., Wasserman W. (1996). Applied
linear statistical models , 4th edition. McGraw-Hill, New York.
[16] Rousseeuw P.J. (1984). Least median of squares regression. Journal of
the American Statistical Association 79 871- 880.

Address : M. Riani, Dipartimento di Economia, Universita di Parma, Italy


A. Atkinson, Department of Statistics, London School of Economics, UK
E-mail : mriani@unipr.it.a.c.atkinson@lse.ac . uk
COMPSTAT.2004 Symposium © Physica-Verlag/Springer 2004

THE ST@TNET PROJECT FOR TEACHING


STATISTICS
Gilbert Saporta and Marc Bourdeau
K ey words: Teachin g, st atistics, information society.
COMPSTAT 2004 sectio n: Teaching statistics.

Abstract: This pap er describ es the design and development of St @t Net , an


Internet environment for the t eaching of basic Appli ed St atistics. St @t Net
has been developed by a consortium of French- speaking un iversiti es. Aft er
some general considera t ions on educa t ion for t he Inform ation Society, and
more specifically for t he teaching of Statist ics, we will pr esent our product in
its present state of development.

1 Means and ends


The t itle of t his session is about t eaching St atistics for the Information So-
ciety. Well, the Information Society began wit h t he invention of t he print-
ing press with moveabl e typ e, and that has profoundly modified the formal
educa t ion pr ocess. In essence , it has permi tted widespread knowledge dis-
seminat ion. For a few cent ur ies things have stayed mor e or less the sam e,
until t he invention of mass media . St arting with the radio, then te levision, it
becam e apparent that t he world had once mor e profoundly cha nged. Their
consequences in t he form al and informal education pro cesses were no doubt
far-reaching, but now t hat we have ente red the compute r age, we have passed
int o spee d Warp Five to speak St arTrek lingo; in the last few yea rs t he
Int ernet development has brought about a genuine revolution in educa t ion
t hinking, actua lly a t ot ally new zeitgeist.
Neil P ostman [1931-2003], one of the keener observers of t he evolution of
education, and of society in genera l, has fully explored t he consequences of
t his information revolut ion [4]' [5]. He reports, and it is a common obser-
vation , that the sit ua t ion of teachers an d pr ofessors has become pr ecarious:
they are worr ied, even anxious, about th eir role and their immediate future
in the Inform ation Society.
Top ping all this , governi ng bodies in many of the develop ed count ries have
nowad ays become obstinate in dr astically reducing budgets , with the elusive
hop e that the new tec hnologies will give rise to an unprecedent ed increase in
productivity: lesser means, greater expectatio ns... Illusion , reality, who can
t ell? And what is the end of educa t ion finally?
At first sight then , it might appear t ha t the new Information t echnologies
(ITs) could lead to t he end of t he profession : at this journey's end, all the
t ransmission of knowledge would originate from a few specialized qu arters
far away from st udents, pedagogical encounte rs would be vir tual with the
Internet being t he sole communication channel. Universit ies and colleges
418 Gilbert Sepotte and Marc Bourd eau

would supply themselves for knowledge transmission and cert ificat ion from
t hose virt ua l hyp er-classro oms.
IT s could secure huge savin gs for education boar ds, bu t could entail th e
disappearan ce of most t eachers and professors.
From an overvi ew of some recent and very successful pedagogical experi-
ments in Qu ebec univ ersit ies usin g IT s, one can suspect t hat t hings will not
be t hat simple [1]. The sa me sit uat ion, it is easy t o confirm, is pr evalent
the world over. Actually, getting an educa t ion is a form a travelling. And
quality t rave lling often imply person al guides, at least human encounte rs, not
ju st guidebooks and TV docum ent ari es though they can be illum inating and
irr eplaceabl e. In our experience , all t he pedagogies devised wit h the ITs in
mind have always implied mor e personal contacts with st udents, less mass
disp ensing of knowledge!
With the Internet , we have perhaps ente red an era of renaissan ce of the
true pedagogical relation , not the opposit e. As we will explain, this has far
reaching impli cat ions for t eachers and st udents reciproca l relations.

1.1 Teaching statistics in the information age


Conc erning Stati stic s and Data analysis, there is no gloom and doom scenario
in view: there is a huge increase of informations that have to be pr ocessed .
As John Wilder Tukey [1915-2000] has so correctly noted "T he best t hing
about being a St atisti cian is that you get to play in every body's backyard."
Bet t er to ols of analysis are badl y needed, and, since t here is already
a widespread availability of dat a sour ces and an increased appetite for syn-
t hetic information , an imp ort ant increase in St atisti cs lit eracy is ur gently
needed for an ever increasing number of people. Think, among other t hings,
of t he amount of information st ored and available in natio na l St atist ics Offices
the world over. All newspap ers and mass media are now replet e with reports
of polls , of official statistics on the economy and society in general. Think
also of t he huge amount of bus iness infor mations st ored in Dat a War ehouses
t ha t come with an abundance of Data Mining softwares recentl y market ed .
Making sense out of this "chaos" [8] is a huge undertaking. We are heading
towards a knowledge-based society where st atisticians will be ever more in
demand.
We report here on the educa t ion mat erial for t he t eaching of St atistic s
produced in our universities. See Saport a [6] for an overview of some of the
web faciliti es for t he t eaching of St atisti cs'' . See also the remarkable pap er
by Velleman & Moor e for the ins and outs for t he use of ITs in t he t eaching
of St atistics [9].
1 All the relevant do cument s upon wh ich rest s this asser ti on a nd that have been used
for [1], a re loca t ed on the following web pages
http ://www.mgi .polymtl .ca/marc . bourdeau/lnfAgeTeaching ...
2 Al so ava ilable in t he web pa ges just re ferre d to .
The St @tNet project for teaching stati stics 419

2 The St@tNet project


The St @tNet pr oject is develop ed at the Conservatoire Nat ional des A rts et
Metiers (Cnam) , a major public institut ion for cont inuous educat ion and an
integral part of the French Ministry of Education, Research and Technology.
The Cn am was found ed in 1794 t o "enlighte n ignoran ce that do es not yet
know , and poverty which cannot afford knowledge." Mor e t han 70 000 adult
students at te nd its course s each year in numerous fields, two-thirds of them
have already had two year s of 'higher' education, one third are women .
Courses are given mainly in the evenings and in Saturday classes for
credits leading to wards a degree, as well as through in-serv ice training dur-
ing work ing hours , and, finally, through dist anc e-learning. The Cn am links
a network of 150 towns and is organized around a 'mai n' complex in P aris ,
22 regional centers, plus som e cente rs in overseas t erritories. One can begin
a pro gram anywhere in the network and cont inue in any other cente r. Gr ad-
uate st udies leading t o Masters and PhDs are availa ble in many disciplines.
St @tNet follows a ser ies of pr eviou s developments of t eachin g materials for
int roduct ory St atistics t hat dat e back to t he early nin eti es. Previous courses
were available on diskettes and CD-roms [7]. The actual web-course version
was finan ced by the Agence Universita ire de la Francophonie (A UF) and t he
French Ministere de l'Edu cation Nati on ale. It is op erational since 2002, and
can be obtained also on a CD-r om version.
St @tNet is the only web resource proposed at the Cn am for dist an ce
learning for the mu ch needed Introductory St atistics. It is freely accessible'',
Indeed , having been finan ced by public fund s, and for t he adva nce ment of
public learning in conformity with it s founding principles that go back t o t he
Enlight enment Age, t he decision of the free access of St @tNet was finally
agreed upon afte r fierce debates, bu t regist ering at a cost of 250 euros is
mandat ory for cert ificat ion purposes and t he use of usual facilitie s: t ut orship
(one tutor per 25 st udent s), an Internet access on a virtual teachin g envi-
ronment (VTE) , an e-mail.etc . T his fee comprises the CD-rom that avoids
most of t he Internet cost s and wait ing times, especia lly in dist ant locations.
St @t Net is now also implement ed on t he virtual campuses of the Agen ce
Universitaire de la Fran cophonie where it is one of t he two most popular
resources for self-ed ucat ion. Starting in the Fall of 2004, the Cn am will orga-
nize a certificat ion syste m for t he AUF courses. St @tNet is a complementary
resource for the E cole Milita ire, it is also recommended by the French as-
sociation of mathematics teachers as an aid to school t eachers who have to
adapt themselves to new cur ricula that include elements of Probability and
St atist ics.
With its network of institutions, the Cn am is an ideal ground for the
development of pedagogy and t eaching mat erial using ITs. Modern t eaching
of Applied St atist ics requires the use of specialized softwar e, and should be
3ht t p : / / www . agr o- mont pel l i er . f r / cnam- l r / s t at net / .
420 Gilbert Saporta and Marc Bourdeau

data based, centered on case studies for more advanced material and hands-on
training. Applied Statistics is indeed much more than a set of mathematical
formulae: its learning implies the development of "statistical thinking", re-
quires the understanding of difficult concepts such as variation, randomness,
laws of chance - a difficult oxymoron at first glance - , probable errors,
risks, etc. Animations and various graphical tools provide efficient means of
learning.

Depending on the level, one can think of various designs for the Internet
environments and interactions. Up to now, there are two stages planned in
the St@tNet project, the first one is fully operational, the second in develop-
ment, but with partial versions tested in ordinary classrooms.
For the first stage, at the very basic level of statistical knowledge, St@tNet
has opted for a complete Html environment. The advantage of this choice
is that interactions of the students with the environment are quite easy to
reali ze: this course is by no means a paper-course translated into Html, as one
can still see quite often, but a full-fledged Html environment with frequent
short interactions inserted by design into the course.
For higher levels of knowledge, where short interactions are much less
needed, St@tNet has opted for a downloadable Latex-Pdf text, with full
hyper-referencing possibilities, and many of the hyper-references are internal.

2.1 First stage: the basics


The first stage of the project, the one for the really basic knowledge, is now
fully operational. It consists of six modules: data description, probability,
random variables, sampling and estimation, tests, basic linear regression.
Each of the modules is introduced by a video file (Figure 1, upper part) and
is composed of lessons, all of which are of the same structure: Introduction,
development, synopsis, exercices. A glossary of terms is accessible within
each lesson, as well as all the necessary Statistical Tables and Internet links.
Once in a module, and after viewing its presentation video , the user can
pick a lesson of his choice: indeed, the learning progression is not designed
with a linear structure in mind. Most of our students detest such a progres-
sion that do not correspond to their needs.
The lower part of Figure 1 shows part of a page of the Deoeloppemeni
(development) section of Lecoti 1 (Lesson 1) of the module Tests (tests),
with the shown pop-up window that is produced when a wrong answer is
given by the reader. Upon a wrong answer, the reader can either correct his
answer or get the right one with a short explanation.
Similarly to what is represented in this last Figure, lessons are interspersed
with questions to the reader to check if the elements of learning have been
correctly assimilated, as well as with some Flash animations and some hy-
perlinks to Java applets. All lessons end with a page of summary (Figure 2),
and a few more elaborate exercices, again with answers given directly on the
The St@tNet project for teaching statistics 421

page, with pop-ups for feedback. A pop-up Glossary, the same for all lessons
is hyper-referenced, and, finally, a page of links is available, with some of
them referring to external Java applets useful for the learning.

MODULE :STA~~UEDEeCRIPnVG

·tt"'.,
• $l,,,:~;_llrvl'l:.~~ern:I:i,9:_1.i(l ~'Jifll!,~ "Qo1IlItl!" ~~. tm~~rr.n....-e'24.,,!"":cor.~,flU ....j~ 4 "!' •.. ~~,·t::~~:.~::I:.';;;~~40I••
~ 0l'S"'f~b1~'''' ''._''' ~1''J.> Qt;t~'''''~I!;I'' _~lI'I!qr'''''tr,_ la: f!liljr.ll't:') ±:'2llLl~
~,on iiJtiP6,$_'l'ti4iril.i)arit,.qu.i PQlu' ''{' ~;OS {U $~U. 'dQ ~Jijt01' .. thi:i:l:cii\·mnpli'iri~.'o::o,...cH: ·ori d9l::1 . ,:"
~"" ~~p~r~ILV'''~J~:~:_cl~ \'l~,~I Io"f,l~:r" fJll='~~\IJ .
Ql.so;ll\il ,ii'litq,VElliil\J(rr'1ihlit,81,liI ,l;I"' ,ii.ti¢lii\bt&i:Ul'W,UI"'I Il/!iii.<le!I·~atl\l iliiri:l, '~ qo,Ivi:! l!I:it'* ltIo (Ql ill ,, ;t \);O'-- ---'
1i, _PO-' ~ ~

Figure 1: Upper: The entry for the module Statistiques descriptives (Descrip-
tive statistics), with its introductory video. Lower: Part of the development
section for Lesson 1 of the module Tests (Tests), with a pop-up window
obtained with a wrong answer .

A new audience has been reached by this approach, and the rate of reten-
tion and success is better than for traditional courses. This last point might
be the consequence of the type of students (a "sampling bias"!) interested
in such an environment.
422 Gilb ert Saporta and Marc Bourdeau

-
1.. ~ l:MlM npltril •• lfl~0Mit'N.MKitt'INHJl'''''.~ M~,.r<_~
ft~,(1_"""f'lO"r\OITb .. UM4"""'~

u ~ ~... ~ · ... ..-t ",*,iHInIi::6lIi4itHM, .~ . ~l.H:.niit"illllJq"*,, a.J""""""" ~ djU


tm~~ \ff.ftQ~~. ,

>Ctw . . . Wdito·~""",c:w:rlipiU· "flIiod~~.""'Ii:-t.-l:j-.O I . l ~ ~ ~ ~~

>Qh elllW""N rWW. iWJ1 Ifil:JW;.-n .. .;.n. ~

n"".,
i _~t~ ::IOII·~IM1.:w.lI'I~ ...~UIt.QU*'<l!hl,WI'~."'fj.OPoll~illMIN
"""Ml~I·~.p;;~ ,I'iJf"t ..
Ut .... n.t_.....,. __ l ·-JlUtt!tW Qilll"WiL
.....-pl. . . . ~ · t ..,"'~'"
• • ~

.............
.. l1l.n- _
fi,
,. I!' ~ -M. ~ ' . ..~ ••1 , _•• ",. . - _ _

2 . ~ ~ -}.lW;. · v~ Wt"t (f. . ~ DU(.lOllCIft_ _~ .. !OVlIturr. i...... il1 04> .P6tu!!<~~ A"rKrlqt.lo ~JT I4CN. dM
~""'lliM\i~PQuroW;I'(tj
... ~l,Ift. 'lmW,

Figure 2: The summary page from Lesson 1 of t he module Statistiques de-


scriptives (Descrip tive Statistics).)

2.2 Second stage: applied linear models


After t he first stage of t he project was carrie d out , and afte r a decision was
made t o embark on a lar ge project concerning appli ed linear mod els, consist-
ing of t he st andard curr iculum complete d by methodologies for categ orical
dat a, like t he logistic and log-linear models, reflection was given as to what
format would be appropriate for more advanced learning.
The advanced learner of a given discipline, especially at t he Cnam, has
very different needs than the learner of the elements . More often than not ,
a first course in Probability-St atistics is mandatory. A second cours e is t aken
by those who feel a greate r relevan ce of t he mat erial taught to t heir act ual
work. Hence a truer moti vati on. In any case, to get the attention of a student ,
any st udent, one has to pay heed to its needs, to spea k his lan guage.
In Applied St atistics, the act ual pr actice requires t he cont inuous use of
a Statistics software on real data - rea l as oppo sed to simulate d, wit h all the
complexit ies t hen of reality - , and an important part of the work consists of
careful questioning from the analyst and writing of the facts found during
the pro cess.
All t his points t o a pedagogy t hat rest s principally on case st udies, prob-
ably the natural points of ent ry to t he curriculum for many students in the
engineering and man agement sciences to whom this course is dest ined. The-
ory, t he mathematic al derivations - and they show a complexity far beyond
that found in the elements - , are seen as answers to specific questioning on
their part . Thus bigger and more mathematical chunks of mat erial in more
advanced st udies, inst ead of the tidbits of the elements.
Another important point in our view of things, there should be a constant
preoccupation from t he designers of Intern et courses, all cours es for that
mat t er, to instill into the st udent s t he art of questioning. We refer here also
Th e St @tNet project for teaching stat istics 423

to Postman in his last essay ([4] p.161 seq): "(...) question-asking is t he most
significant intellectual skill available to human beings," and it is ext remely
st range that, especially in the Sciences whether hard or applied , it is not
t aught in schools!
Finally, and this also harks back to Postman in all his books on Education ,
we have written historical notes on all the principal aspects on the origin of
the need of st atisti cal mod els for reality. It is a fact t hat with History notes
there is a sort of holographic phenomenon: even when one st arts from hard
sciences' bits of knowledge, exploring how t hings came to be, where ideas
came from and how we came by t hem, provides, if propelled by a sense of
questioning, an insight on the whole of societi es, on all of Hum an nature.
This const it utes an essent ial pa rt for any formation. Education aft er all is
not only about informat ion, but first and foremost about the form ation or
cast ing of minds, young ones in particular .
In summar y, du e to the mathematical sophistication of this material t here
is a need for t extbook typography, as well as, as usual , a need for a compl et e
system of inn er referencing and outer or hyp er-referencing facilities. This
leaves nowad ays almost no choice: such a cours e must be written in Latex-
Pdf typ eset . The Pdf-files are virus-proof, th ey can be readil y pr inted on
pap er with te xt book color qu ality, t heir use on computer screens is very
confortable, moreover providing som e annotating facilities, and, finally, inn er
links and hyp erlinks are manipulated with extreme ease.
This second st age of St @tNet is not, as yet , fully operational, but a demo-
version ia available, and parts of the material, especially some case studies,
were t est ed with great success in st andard classrooms". In the following pages
we pr esent some of its highlights.
In Figure 3, we can see part of ordinary page of the course file. At t he
bottom of t he page an icon referring to a Fl ash anima ti on, an image of which
appears on Figure 4.
The read er can flip back and forth from any page giving int ernal links
to an equa t ion, a t abl e, a figure. He can also, if he subscribes to an Inter-
net server, readil y access a certain number of hyp er-links to whatever sit es
deemed interesting by the aut hors. These pages will be added aut omatically
at the end of t he pdf-file file which can be saved with the added information.
The Adob e-r ead er provides also various facilities to annotate the file pages.
Like in the first st age of St @tNet, many Fl ash anima t ions are also included
in t he t ext. They constit ute a remarkable tool to ease the learning. The
development of a Fl ash animat ion is fairl y easy, t hey are space efficient, and
the Flash plug-in is very light and widespr ead . Furthermore, th ese anima t ions
ar e upward compati ble, and can be readil y upd at ed .
Each one of our anima t ions comes with a certain number of cont rollable
bu ttons one of which is an audio file. In th e exa mple (Figur e 4) , the nod es of
4ht t p: / / wwy.mgi. pol ymt l . ca/ mar c . bour deau/ l nf AgeTeachi ng .
424 Gilbert Seporte and Marc Bourdeau

Modeliser : prem ier cTill"re de bon ajus tem ent


I'eq uation d e la va rian ce ; Ie U'
0 11 montro In.reletjou fondumentnle suivante :

S(.'.,. = SCRes + SCMod "" sell + SCM .


SC IIJ O_Vic; (Y;-li;) JO_SC.,T SCT .
SCII SCM SCn •
SOI; + SOI' es SCr + R" = 1 .
Le R' est <lit It, eoefflclent d'explicatlon dn modele. Plu s iI est
voism de 1, plus lesresid uKsoo t )')t)t itlh pl us le modele 6ClUbJe
bon .

D~ It Ph~ut ~ er fi t memo I'li on devs nee un pou 10 \i tivd oppement rh ooriquo,


on petit,se Iemlllnriser n1;OC Ies pro prUites dynemlqucs de In regres ston , ce sera
dans 88 version simple ki. en cllquant sur I'lcone de In prcmlere animation
eonoernent le tnodellssrfcu. Soo obsectrf (:!'S<t. de rait(~ trouver (I{:!$ configure-
tlons nil les points ou mend s de La H1gTf'A.qon out beaucoup d 'mfluence sur
Iil.ij: r~ulutt8. On })PI)mfondira long~l!!m ~l1f. ces questtons pa r Ii) lluite..

Figur e 3: Par t of a typical page in st age two , with the icon referring to a Flash
animation.

Figure 4: A page from one of the course Flash animat ions, wit h its cont rol-
lable but tons, one of which (bottom) is for an audio file.
Th e St@tNet project for teaching statistics 425

the regr ession are mobile and new nodes can be added, t he confidences bands
resulti ng from t he least squa res results have a but ton to control t heir level,
an d whenever a change is ma de, directly with t he mouse on the computer
screen, the new regression line and confidence bands with the other numerical
par ameters promptly appear on the screen. The au dio file provides instruc-
tions for t he use of t he animation, a few explanations, and always, t his is
very imp ortant, a questioning that the animation brings out.
----."---- -,

I~
I.!

FlO. 5 - Div~r8P.S transformations d~ \' (Fig. .oj ) : i\ Ia suite )/M, 1'· ~·1 ,
y O,2o, rO,2 , log(n, -Y -o.~ .

Exorcices.
1. Ut.IlJ8fJ... lesdonnees de t'tJt example(ellquer cl-contre) pour \'erifler les
effets des transformatlons (observes eomme sur la Fig, <; et IeTab, I) Data
sur des fiOll&-edlalltillolls dMquelques ,ii1.ailu18 de sujets,

Fig ure 5: A ty pical page of a case study, with an icon to imp ort t he dat a.

In F igure 5, we show a ty pical page of a case study. Remember: case


st udies are t he backbone of our pedagogy. A case st udy is usually severa l
pages long and is built along a certain questioning on a data set t hat is usually
quite complicated. Its flow is very pr ogressive, and generally requires a few
dozen hours of work with the writ ing of a roughly 20 page report.
This task imp eratively implies team work. And t rue collabo ration is nec-
essary : a case study is not composed of a certain num ber of unrelated prob-
lems, like the st andard homeworks found in most cur ricula, but has a syn-
t hetic cha racte r where each part responds or reson ates with other ones. There
is a wide landscap e built in every case st udy, severa l cha pte rs of Statistics
are brought to bear. This preclud es t he "usual" split up of t he work ... The
work required is similar to actua l data ana lysis required by engineers and
scientis ts, and ordinar y work for that matter.
In a standa rd classroom , st udents often have a natural peer group and
team form ation is quite easy - t hough there ar e more optima l ways for t his
selection -, but for distan ce learning t hings ar e not that easy. However , in
most organizations nowad ays, t he new ITs (chats, forums, etc .) alr eady allow
team virtua l meeting and working t hat take up a very lar ge, if not t he lar ger
part of t he work pr ocess!
Students te nd to use all the mode rn hyper-communicating computer fa-
cilit ies t hat are usually available on all recent computers , as well as within
426 Gilbert Saporta and Marc Bourdeau

the VTEs in use in most universities. In most North American universities


students are in constant Internet contact with their peer-groups, almost day
and night, sending each other files of their works, of their thoughts, comments
on the courses, etc. And Internet real time voice communication facilities are
rapidly spreading. Writing, however, thanks to the Internet, has regained
much luster: writing constitutes indeed an essential tool for the unveiling of
one's real thoughts. Team work is constant. On a less bright side, homeworks
and exams tend to be freely accessible to all.. .
Professors must adapt to this situation. The Internet has and will pro-
foundly change the learning world. Thus ITs can compensate the isolation
of students of the past in distance learning, as they have already done so in
the traditional classroom. But our type of pedagogy implies a much greater
contact not only horizontally, from students to students as we just noted,
but also vertically. For professors of Statistics, it is not the transmission of
bits of knowledge that will constitute their main task, it is the statistical
cast of mind itself that will be more and more the focal point of teaching.
And statistical thinking, like all casts of mind, is best transferred through
apprenticeship.

3 Conclusion
The Information Age offers mind-boggling perspectives and cannot but have
a profound impact on the pedagogy of whatever discipline there is, but first
and foremost for those that present a technical character, and Statistics is
one of them. All the presentations in this session will no doubt show the
diversity of options.

The end of the journey for teachers? At first sight, it might appear that
all these new facilities lead to the disappearance of teachers and professors.
But many very successful pedagogical experiments have shown that human
pedagogical guides are more necessary than ever, and that ITs provide an
indispensable structure for more interactions between them and the students.
11 would not be surprising that the new pedagogical paradigm would be that
of apprentices and masters. In all the pedagogical experiences we have seen,
not only in Statistics, not only our own, there is a greater need than ever for
human personal transmission. The role of professors becomes more and more
that of a personae ressource, a guide so to speak, and less and less that of a
knowledge dispenser. Pure knowledge transmission is not the principal role
of professors anymore: this has now been more or less automated thanks to
the new ITs. Transmission is required now at a much higher cognitive level.
And written words for their precision, as well as oral contacts, play - ITs
in the background again! - a crucial role . On the Internet, all courses tend
to become tutorials! And this is the expensive form of teaching... That may
explain why the pedagogical interaction has become so much more demanding
than ever.
Th e St @tNet project for teaching st atistics 427

We do not imply however that teaching becomes a student driven process


for the development of curricula and material. Bu t a clear result of our expe-
rience with teaching case-base d St atisti cs courses to engineers with an active
pedagogy approach, amply confirms Parr's [3] and Moore's [2] experiences:
we obtain a much mor e efficient knowledge transmission, as well as a mor e
positive attit ude towards t he disciplin e, than what we observed through years
of teaching with the traditional approach.
On the other hand , many govern ments nowad ays tend easily to believe
that education and other public services are not of primary importan ce and
cost too much. For reasons of globalization and so forth, they preside over
decreasing public spending. The Int ernet Age can readily provide very low
quality and very low cost formative material - garbage-in, garbage-out -, as
well as higher t ha n ever quality education. The lat ter being the kind needed
in an increasingly complex world . But the wheel of Fortune spin s fast er than
ever, and, it has always been the case, the outcomes are not to tally random:
t he better educate d no doubt will reap the profit s.
The quest ion of what 's in store for pedagogues in t he future will in any
case be with us for some time.
What about St@tNet'sjourney? The concept ion, development and im-
plement ation of St @tN et required considerable resources, hum an as well as
financi al. The end product could const it ute a complete cur riculum in French
for Appli ed St atistics.
The first stage was conceived with a playful spirit in mind, to which
the elementary concepts of Probability and Statisti cs lend themselves fairly
easily. But putting it into servi ce required a considerable amount of work, so
much mor e than the writing of a st andard chalk and blackboard course, or of
a set of telegraphic compute r slides. The second stage is much more difficult
to conceive if one do es not care for a st andard run of the mill pro duct , but
st rives afte r somet hing mor e pedagogically efficient . The deeper one goes
into the discipline, the mor e difficult th e task .
The question is not whether or not there will be a need for St @tN et or its
successors in the future, but what form they will take, and what resour ces
will it be necessary to put into act ion? A knowledge-based society will in-
deed bring no dearth of work for statist icians and teachers of the discipline
(cf. t he 6th European research program FP6) . However, at th e same time
t ha t technology 's pace shows no sign of slowing down and t ha t the demand
is growing rap idly, the hum an and financi al resources might become mor e
difficult to muster ... Intern ational coopera tion and sha ring of the new IT
products catering to t he needs of st udents of St atistics as well as greate r
imagin ation and dedication on t he part of teachers of Statistics will no doubt
be necessary.
428 Gilbert Seporie and Marc Bourdeau

References
[1] Bourdeau M. (2003). L 'enseignem ent superie ur et les TICE (Technologies
de l'information et des com munications en enseigneme nt) . Big bang ou
m ega flop ? Conference pour l'inaugur ation de la mission TICE, Univer-
site de Bretagne Sud , 4 mars 2003.
http ://www.mgi .polymtl.ca/marc.bourdeau/lnfAgeTeaching.
[2] Moor e David S. (1997). New pedagogy and n ew content: Th e case of
statistics . Int ernational Statistical Review, 65(2) , 123-137.
[3] P arr William C. & Smith Marl ene A. (1998). Developing case-based busi-
n ess statis tics courses. The American Statistician , 52(4) ,330 -337.
[4] Postman N. (1999) . Building a bridge to the 18th century . How the past
can improve our future. Alfred A. Knopf, New York.
[5] Postman N. (1995). Th e en d of education. R edefining the value of school.
Alfred A. Knopf, New York.
[6] Saporta G. (2002). St @tNet, an Intern et based softw are for teaching in-
troductory statis tics. Proceedin gs Icots 6, Sixth International Conference
on Teaching Statisti cs, Cap etown , July 8-12 2002.
[7] Saporta G. , Morin A. (1996). Interactive software for learning statistics .
Compstat 96, Bar celona , 26-30 Augu st 1996.
[8] Serban A.N. , Luan J. (2002). Overview of knowl edge m anagem ent. New
Directions for Institutional Research, 2002(113) , 5 - 12.
[9] Velleman P aul F ., Moore David S. (1996). Mult im edia for teachin g statis-
tics: promises and pitfalls. The Ameri can St atistici an , 50(3) , 217 - 225.

A cknowledgem ent : St @tNet was developed with the generous support of the
Agence Universita ire pour la Francophonie and of t he French Min ist ere de
l'Education National e.
Deep gratitude to all those working on the project :
http://www.mgi .polymtl.ca/marc.bourdeau/lnfAgeTeaching/credits .pdf .
Address: G. Saporta , Conservatoire National des Arts et Metiers, 292, rue
Saint-Mar tin , F-75003 Par is, Fran ce, http://cedric.cnam.fr/rvsaporta
M. Bourdeau 'Ecole Polyt echnique de Montreal ,
http://www .mgi.polymtl.ca/marc.bourdeau
E- m ail: Saporta@cnam. fr, Marc . Bourdeau@polymtl . ca
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

AN AUTOMATIC THRESHOLDING
APPROACH TO GENE EXPRESSION
ANALYSIS

Michael G. Schimek and Wolfgang Schmidt

K ey words: Empirical Bayes, microarray, mult iple tes t ing, R, sparse se-
quence, st atistical comput ing, threshold.
COMPSTAT 2004 section: Bayesian methods.

Abstract: The statist ical problems of gene expression analysis based on t he


two popular array readout methods, eDNA and Affymetrix, are addressed. As
an alte rnat ive t o multiple frequ entist statistical t esting the empirical Bayes
methodology is introduced. An empirical Bayes thresholding approach is
described and its relevance for microarray dat a analysis is shown. Finally
two data sets, one of eDNA-ty pe and the other of Affym etrix-type, are an-
alyzed wit h the new automatic and computationally efficient thresholding
technique.

1 Introduction
In recent years t he new t echnolo gy of micro arrays has mad e it feasible to
measure express ion of thousands of genes t o identify cha nges between dif-
ferent biological states. St atisti cians are requ est ed to design methods which
help t o quantify the relevance of these experimentally obt ain ed changes.
In such biological experiments (for an introduction see [12]) we are con-
front ed with t he pr oblem of high-dimensionality because of t housands of
genes involved an d at the sam e time with sma ll sa mple sizes (du e to lim-
it ed availability of cases and for reasons of cost ). This makes it a statistically
and computat iona lly demanding t ask. The complexity of the diseases, t he
poor underst anding of t he und erlying biology and the imp erfecti on of t he
measurements (many different source s of noise) are addit ional problems.
There are two dominating DNA array readout methods, eDNA ([3],
p. 17ff, [12]) and Affym etrix GeneChips ([2], [12]). In the first t he data
are read from a fluorescent signal and in t he last t he dat a are record ed from
a rad ioactive signal. The st at ist ical method described in this pap er can be
ap plied in both inst an ces.
At first we portray popular techniques for gene express ion analysis. Then
the idea of empirical Bayes methods is introduced and an empirical Bayes
thresholding (EBT) approach is describ ed in some det ail. It s pract ical rel-
evance is demonstrated for colon data from one of our laborat ories (eDNA)
and for t he so-called Golub dat a set (Affymetrix) from [8] .
430 Michael G. Schimek and Wolfgang Schmidt

2 Fold change and classical inference methods


Let us assume that for each of n genes (i = 1, ... , n) we have measur ements
over J experimental condit ions (j = 1, .. . , J) on K slides (arrays) per exper-
iment (k = 1, . . . , K) . These measurements may be intensity readings from
a spot te d cDNA or pr obe-level intensity signa ls from an Affymetrix oligonu-
cleotide system. The express ion dat a are eit her log-rat ios of int ensities or
log-intensiti es.
For each gene t he fund am ent al quest ion is whether the level of expressio n
is substant ially different between two (J = 2) or mor e (J > 2) situations.
One approach commonly used in early publications (e.g. [13]) - always limited
to control versus treatment designs - has been a simple fold approac h. This
mean s that a gene is lab eled significan tly changed if its average expression
level varies by more than a constant fact or , typic ally two, between t he two
experimental condit ions, t he so-called "twofold rul e" . This ad ho c approach
has one severe disadvantage: a factor of two does not mean t he sa me in
different regions of the spect ru m of int ensiti es, especially when extreme values
are concerned.
The standard statist ical approac h taken is significance testing which im-
plies t hat fold cha nge is replaced by significan ce. The null hyp othesis for
each gene is that the dat a we observe have some common distributional pa-
ram et er among the conditio ns, usually t he mean of the expression levels. For
each gene we calculate a st atistic that is a function of t he dat a. For inst an ce
a t-tes t stat ist ic is applied (despite an unrealistic dist ributional assum ption).
Yet there is not much gain compared t o t he fold approac h becau se t he dif-
ference between two logarithmic expression levels is the logarithm of t heir
ra t io.
Which errors are committed at each par t icular gene when t esting for
differential expression? Apart from th e typ e I erro r (false positive) and t he
ty pe II err or (false negative) t here is t he complication of tes t ing multiple
hypotheses simultaneously. Each gene has ind ividu al ty pe I and II err ors and
it is nothing but clear how to measure th e overall error rat es. In t he recent
lit erature several compound erro r measures have been suggeste d, such as the
false discovery rat e ([4]) and t he posit ive false discovery rat e ([14]). However
th eir calculation is not st ra ight forward and th e select ion of a t est statist ic
in this situation is far from trivial.
Suppose n genes (i = 1, .. . , n) have been measured over two experimental
conditions (j = 1,2) on K I arrays of condit ion 1 and K z arrays of condit ion 2
and K I + K z = K . Let Xi I and Xi Z be the mean gene expression for gene
i under condit ions 1 and 2, and let s; be t he pool ed standard deviation for
gene i ,
An automatic thresholding approach to gene expression analysis 431

As pointed out alre ad y a t est statistic for assessing differenti al gene expression
is the standard t-test
t,t -_ Xt'2 - x'!
t

Si

Alt ernatively a rank-sum statistic can be adopted . Suppose Tik be the rank
of the kth expression level within gene i . Then the rank-sum statistic for
gene i is
K,
r, = L Tik,·
k,=!

An ext reme Ti valu e in eit her direction would indi cate a difference in gene
expression. The t-statistic as introduced above t ests for difference in the
mean whereas the rank st atistic t ests for difference in distribution.
Under the assumption of only two exp erim ental conditions and no cor-
relation between measurements it is possible to derive the null distribution:
eit her permutation ([17]) or bootstrap ([6]) t echniques can be applied. In
both cases the computational demand is quite high.
If the null distribution is calculate d individually for each gene, this has
two disadvantages. The first is what is known as granularity problem: the
null distribution has a resolu t ion on the order of the number of permutations.
With n genes and m permutations the resolution is on the ord er of l /m for
individual null distributions, but l /(nm) for a pooled null distribution. For
instance , if we test 3000 genes with 100 permutations, then we can expect to
reject 30 at a time. The second problem is that we ar e not in the position t o
const ruc t better rejection regions. With individual null distributions, each
gene is treated as a different exp eriment. For each "experiment" we have
m observations from t he null distributi on and one from the original mea-
surements. It is not possible to compar e th e null distribution to t he observed
st atistic to derive more powerful , asy mmet ric rejecti on regions ([15], p. 277f) .
This means loss of power.
In the SAM ("Si gnifican ce Analysis of Microarray") method ([16]) anot her
approach has been t aken : t here the test statistics are pool ed and considered
to follow a mixture distribution. As a consequence, many observations from
t he mixture of the null and affecte d distributions, as well as from the pure
null distribution are available, leading to improved rejection regions. The
pitfall of using different distributions for t he estimation of the overall error
rate (due to pooling of t he null statistics) is sufficiently controlled in SAM
according to its aut hors. In SAM expression is evaluate d by a combinat ion
of t est and thresholding ste ps for the purpose of non-symmetric rejection
regions. This approac h improves the decision pro cess when the numbers of
overexpressed and underexpressed genes are sub st antially different (usu ally
the case in pr act ice). The cut off for t est significance is tuned via a user-
sp ecified paramet er connect ed to the false discovery rat e (the number of false
positives is limited t his way) . Hence SAM is not an aut omat ic approach.
432 Michael G. Schimek and Wolfgang Schmidt

3 Empirical Bayes methods


Empirical Bayes methods have been around in statistics for thirty years,
beginning with [5]. One cannot say that they have enjoyed much attention
so far . Why are they attractive for gene expression analysis? Empirical Bayes
methods are well-suited for high-dimensional decision problems. In contrast
to techniques as discussed above, where inference is performed separately for
different genes, in empirical Bayes information among genes is shared. In
most microarray experiments involving thousands of genes but only a small
number of microarrays, the amount of information per gene is quite low. The
idea is to combine related inference problems which means that the evaluation
of the expression level of one gene is influenced by the overall expression levels.
Here advantage is taken of the quantification of the variability characterizing
the bulk of genes (still assuming independent measurements).
The general framework of empirical Bayes methods is quite flexible. Prob-
ability distributions are specified in several layers that account for multiple
sources of variation. Then based on a mixture model posterior probabilities
are computed. This allows comparison among multiple conditions.
Empirical Bayes methods have been applied for the first time to gene
expression analysis in [11] and [7]. In these papers the concepts of fold change
respectively significance with respect to the frequentist false discovery rate are
generalized. In the next section we consider an empirical Bayes thresholding
approach that does not require a concept of fold change or significance.

3.1 A fully automatic thresholding approach


Here we describe an empirical Bayes approach for the estimation of possibly
sparse sequences observed with white noise (modest correlation is tolerable) .
A sparse sequence consists of a relatively small number of informative mea-
surements (in which the signal component is dominating) and a very large
number of noisy zero measurements. This is the typical situation found in
image processing. Gene expression profiling can be seen along the same lines.
Johnstone and Silverman (2004) have proposed a method that can handle
such sparse sequences by means of thresholding without any users-specified
parameters apart from distributional assumptions ([10]). It is called empir-
ical Bayes thresholding (EBT). As will be seen later on, the choice of the
threshold is the most critical aspect, both in terms of signal extraction and
computational burden.
In empirical Bayes threshold models the object of interest is a sequence
of parameters ()i on each of which we have a single observation Xi subject to
some noise ti , such that
Xi = ()i + ti ·
For the estimation of the ()is additional assumptions are required. In [10] the
()is are assumed to be medians and that the observation X = (Xl,"" X n )
satisfies
An automatic thresholding approach to gene expression analysis 433

Xi = /-li + ti,
where the tiS are N(O, (J"2) random variables, not too highly correlated. Fur-
ther let /-l = (/-ll, /-l2,·· . , /-In) be a vector of medians (means are also feasible
but not of interest in this paper).
Obviously the /-liS will not be exactly zero in most applications. The
p-norm of u, Ii/-lil = (2: l/-liIP)l/P allows for a more subtle characterization
of sparsity of /-l (assuming small p). In other words, the quantification of
sparsity corresponds to bounds on the p-norm of /-l for p > 0. Consider the
sum of squares of a vector with Ii/-lil p = 1 for some small p. If only one of the
components of /-l is nonzero, then the energy will be 1. If on the other hand,
all of the components are equal, then the energy will be n l - 2 / p and is tending
to zero as n ---. 00 if p < 2, tending rapidly to zero if p is near zero. Consider
the case of p small. Then the only way for a signal in an lp ball with small p
to have large energy (sum of squares) is to consist of a few large components,
as opposed to many small components of roughly equal magnitude. Among
all signals with a given energy, the sparse ones are those with small Z, norm.
Some measure of sparsity is needed because sparsity of a signal is not
solely a matter of the proportion of /-li that are zero or near zero, but also
of subtle ways in which the energy of the signal /-l is distributed among the
various components. For our purposes it is sufficient that the number of
indices i for which /-li is nonzero is bounded. In engineering such a parameter
/-l is called a "nearly black signal". For some 'fl this is

(1)

where I denotes an indicator function . Assuming the signal is sparse in the


sense of belonging to an lp norm ball of small radius 'fl, we have

(2)

For (1) and (2) it is possible to derive minimax squared error properties. It
can be shown that EBT adapts automatically to the degree and character
of sparsity of the signal with the minimax rate (l.e. the optimum rate for
such signals; for details see [10]). It is worth mentioning that the minimax
properties are the same as in the false discovery rate approach in [1] .
Suppose the errors t i are independent. Within the Bayesian context spar-
sity is equivalent to suitable prior distributions for the Bis we are interested
in. The notion that many or most of the Bis are near zero is captured by as-
suming that the elements Bi have independent prior distributions each given
by the mixture
!prior(B) = (1 - w)oo(B) + w"((B) . (3)
434 Michael G. Schimek and Wolfgang Schmidt

The nonzero part of the prior, "f, is assumed to be a fixed unimodal symmetric
density. "f is traditionally assumed to be a normal density, Here ([10]) it is
recommend to use a heavier-tailed prior. For the mixing prior in (3) it is
favorable to use for "f the Laplace density with scale parameter a > 0
1
"fa(u) = 2"aexp ( -a lui)

or the mixture density


(JLIG = B) '" N(O,B- 1 -1) with G '" Beta (0:, 1).
The mixture density for JL has tails that decay as JL-20i - 1 • For 0: = ~ the
tails have the same weight as those of the Cauchy distribution.
In both cases the posterior distribution of JL given an observed X , and
the marginal distribution of X, are tractable. This makes it feasible to adopt
marginal maximum likelihood for the w selection as well as to estimate JL by
the posterior median.
Further assumptions required for the nonzero part of the prior "f are
(i) a fixed unimodal symmetric density, (ii) tails to be exponential or heavier,
and (iii) a mild regularity condition.
The key feature of this empirical Bayes approach is the threshold. If the
absolute value of a particular Xi exceeds some threshold t then it is taken
to correspond to a nonzero JLi, estimated simply by Xi itself. Otherwise the
coefficient JLi is estimated zero. The problem is, that the threshold t (or
rather tiS) needs to be tuned to the sparsity of the signal. If a threshold
appropriate for dense singnals is applied to a sparse signal, or vice versa, the
result is of no use at all. Hence a good threshold selection method needs
(i) to be adaptive between sparse and dense signals, (ii) to be stable to small
changes in the data, and (iii) to be tractable to compute. The approach in
[10] comprises all these properties.
Let us now discuss the choice of the mixing weight w, or equivalently, of
the threshold t(w) . Assume that the Xi are independent. For any value of
the weight w consider the posterior distribution of JL given X = x under the
assumption that X '" N(JL,(J2). Let JL(x;w) be the median of this distribu-
tion. For fixed w < 1, JL(x; w) is a monotonic function of x with the following
threshold property
3t(w) > 0 such that JL(x;w) = 0 {:} [z] :::; t(w).
Let g = "f * ¢ denote the convolution of the density "f with the standard
normal density ¢ . The marginal density of the observations Xi is then
(1 - w)¢(x) - wg(x) .
The marginal maximum likelihood estimator w of w is defined as the maxi-
mizer of the marginal log-likelihood
n
l(w) = L log {(I - w) ¢ (Xi) + wg (Xi)},
i=l
An automa tic thresholding approach to gene expression analysis 435

subject to t he const ra int on W t ha t t he threshold satisfies t(w) S J 2 logn


(the t hreshold takes values from 0 to J2log n).
Wh at is the posterior probability that J-L is nonz ero? Let us define

(3(x ) = g(x) - 1. (4)


¢(x )
Then the post erior pr obabili ty wpost (x ) = P(J-L =I- 0IX = x ) will satisfy
wg(x ) 1 + (3(x )
wpost(x) = wg(x ) + (1 - w)¢(x ) w- 1 + (3(x )
As a result it can be found usin g function (4) alone.
To find t he posterior median p,( x; w) of J-L given X = x > 0, we need the
cumulat ive distribution

where h is a density. If x> 0, we can find p,(x ;w) via th e following properties:
- 1
jl(x ;w) = 0 if Wpost Fl (O] ») < 2'
F\(jl( x ;w)lx) = (2wpost(x))-1 ot herwise.

!
For wpost(x ) S t he median is necessarily zero (no need to evaluate FHOl x)) .
For x < 0 the antisymmet ric property p,(- x ,w) = -p,(x , w) can be used .
The Bayes facto r threshold is related to the posterior median . It is a value
T(W) such t ha t P(J-L > 0IX = T(W)) = 0.5. This is to say that T(W) is the
lar gest value of t he sequence for which the est imate d J-L will be zero, if the
est imate is obtained from the post erior median.
How can we find the est imate w of W or the scale par ameter a of the
Lapl ace density? Maximization of the mar ginal maximum likelihood l gives
the solution. Let us define the score function S(w) = l'(w). Becaus e of
smoo thness and monotonicity of S(w) it is possible to find the est imates by
a binary sear ch, or an even faster algorit hm. The obtained valu es are t hen
plu gged back into the prior and the par ameters J-Li are evalua te d via these
est imates, eit her by using the post erior median itself, or by using some other
threshold rul e with the same threshold t(w).
The threshold is obtain ed from t he posterior median p" mainly by use of
the following properties:
(i) shrinkage rul e: 0 S jl S x for x 2: 0
(ii) t hreshold rul e: there exists t( x) > 0 such t hat jl(x ) = 0 if and only if
Ixl S t(w)
(iii) bounded shrin kage: there exists a constant b such that for all wand x
Ijl(x ;w) - xl s t(x ) + b.
This approac h is quite unique in combining features of excellent t heo-
reti cal properties and efficient computation. According to [10] t he results
436 Michael G. Schimek and Wolfgang Schmidt

C'1
o

r-!

.8 0
"'l-<cO""
I r-!
b.O I
......
0

o o
C'?
I o o

o 500 1000 1500

genes
Figure 1: Colon data aft er pr eprocessing.

proven for white noise err ors still hold for mod estl y correlate d errors, at least
in an approxima te sense. This generalizat ion is important for micro array
applications because some of the measurements are usually replicates. The
EBT approach was impl ement ed in the R language ([9]) by lain M. John-
stone and Bernard W . Silverman. The master function of the EBT algo-
rithm is ebayes t hresh O. This function as well as the others required to
analyze spa rse sequ ences can be downlo ad ed freely for academic purposes
from http://www.stats.ox .ac . uk/'"'-'s ilvermal ebayesthresh/ . Relevant
documentation is found t here too . After having been sourced to R , the EBT
algorit hm can be used as any other function in R.

4 Two examples
Our first example uses eDNA measurements and our second example is based
on Affymetrix measurements. In both techniques we have many different
source s of noise such as vari ation in hybridization time, vari ation in reagent
conce nt rations, leak of extern al light during chip reading, inhomogeneities
in chip prepar ation , vari ations in laser int ensity during chip reading, t race
contamination with cross-hybridizing oligonucl eotides, etc . EBT is an ideal
method t o handle a decision problem und er sparsity due to substantial noise.

4.1 The colon data set


The dat a are of eDNA- typ e from a colon carcinoma experiment which en-
compasses a set of 13 colon carcinoma patients. None of these patients
was treated neoadjuvant. The invasionfront of the tumors was investi-
An automatic thresholding approach to gene expression analysis 437

C'l 0
0
0 00
,....; «)) 0
~ ~~
Q)

.9 OO0gf 0 oeo>
~....
I
0

°eo 00
9
08
6
O~ @
0

-
,....;
0 0
b.O
0 I 00 o§
0 0 0
0 ~
0
0
C":I
I 0
0

0 500 1000 1500

genes
Fi gure 2: EBT result for colon dat a .
gated and hyb ridi zat ion was mad e against a po ol of 4 probes of normal
colon t issue. St andard pro to col was used and qu ality cont rol ensure d by
the Insti tute of P athology, Medical University of Graz. The expe riments
contain n = 1536 differ ent genes, among t hem some replicates. The
eDNA chips where then scanned with micro array image analysis softwar e
Im agene from BioDiscovery producing two t ext files. These files where
t hen imported int o R ([9]) using the obj ect-oriented microarr ay analysis li-
brar y com .braju .sma (can be obt ain ed from the University of California at
Berkeley, http ://www .maths .lth.se/help/R/com.braju . sma/) The follow-
ing pr eprocessing steps wer e applied: (i) background subtract ion, (ii) trans -
form at ion into M = log2 (R ed/ Gr een) (M is further referred t o as log-ratio) ,
(iii) normalizing within slide using scaled print-t ip method , (iv) few valu es
which were not det ect ed on a subset of slides, hence N As , were set to zero
in order to allow fur ther pr ocessing, and (v) the experiments where merged
usin g the medi an . The dat a afte r pr eprocessin g are displ ayed in Fi g. 1.
The EBT algorit hm was applied usin g following param et ers : prior =
" laplace" and a = N A , so that t he scale par am et er a is esti mate d by mar ginal
maximum likelihood . bayes f ac = T mean s that whenever a threshold is
explicit ly calculated , the Bayes factor threshold will be used. Having sdev =
N A , the standard deviation is est imate d via the medi an absolute deviation
from zero (m ad(x, center = 0)). Finally, with th reshrule = "median" the
post erio r median is chosen.
Fig. 2 shows the genes from t he colon dat a that are informati ve after
administe ring t he EBT algorit hm. F inally we obt ain ed nl = 37 overexpresse d
and n2 = 39 underexpressed genes . These could be verified by pathologist s.
438 Michael G. Schim ek and Wolfgang Schmidt

C':l

:>,
C"l
'm.:::
-I-"

<J.)
-+-" ....-i
.S
I
b.O
0
..- a

....-i
I

a 500 1000 1500 2000 2500 3000

ge nes
Figur e 3: Subcl ass ALL of Golub data after pr eprocessing.

4.2 The Golub data set


The Golub dat a set ([8]) is well-known and has been re-an alyzed by many
authors . The dat a or iginate from a gene exp ression st udy with pat ients suf-
fering from two ty pes of acute leukemia . Here we consider only a subset of it
(i.e. the learning set ) with data from 27 acute lymphoblasti c leukemia (ALL)
patients and 11 acute myeloid leukemia (AML) patients. The int ensities were
measured using Affym etrix high-density oligonucl eotide chips. The data com-
pr ise the expression of n = 6817 human genes and can be obtained from the
Web site http ://www-genome . wi. mit . edu/mpr/ data...set...ALL...AML . html.
The following preprocessing ste ps were administe red using R functions:
(i) t hresholding of the express ions (floor of 100 and ceiling of 16000) , (ii) filter-
ing by excluding genes wit h expressions max/min:::; 5 or(m ax-min) :::; 500,
(iii) a base 10 logari thmic t ra nsformation, (iv) scalin g the matrix using t he
R comma nd scale from th e package base, which is a generic function whose
defaul t method cente rs and/or scales the columns of a numeric matrix , and
(v) merging the experiments by subclass ALL or AML resp ectively. Du e t o
t he pr epro cessing an accumulation at a minimum value was observed that
had to be eliminated in order to comply with t he requirements of the EBT
algorithm. This bot t om line was eliminated by removing t he resp ective gene
during t he pr ocessing step. It is re-inserted at t he pr ecedin g ind ex location of
the result mat rix wit h a value of zero afte r the t ra nsformation. This filtering
omits 35 genes for the merged ALL data and 117 genes for the merged AML
data . For t he subclass ALL t he pr eprocessed data are displ ayed in Fig. 3.
Applyin g t he EBT algorit hm with the sa me par am et ers as specified for
An automa tic thresholding approach to gene expression analysis 439

0
o

>,
l.t:>
,....; - 0
-+"
' 00
~
Q.)
-+"
0
,....; -
.S 0 o
I

-b.O
0 l.t:>
0
- o

0
0
-
I I I I I I I

0 500 1000 1500 2000 2500 3000

genes
Figure 4: EBT result for subcl ass ALL of Golub dat a.
t he above colon dat a cDNA experiment , 2 genes are detected for ALL, 1 gene
is detected for AML an d 12 genes are expressed in both subclasses (for a plot
of the all together 14 overexpressed genes of t he ALL subcl ass see Fig. 4).
This is a substant ial reduction in t he numb er of inform ative genes. Wh ether
t he identified genes that are overexpr essed in one subclass while not in the
other (i.e. having a zero estimate) - obviously of discriminative power - are
of high biological relevan ce needs to be answered by future leukemia resear ch
and is not a st atist ical matter.

References
[1] Abr amovich F ., Benjamini Y., Donoho D.L., Johnstone LM. (2002).
Adapting to unkn own sparsit y by controlling the false discovery rat e.
Preprint.
[2] Affymetrix (1999). Affymetrix microarray suite user guide. Affymetrix,
Santa Clar a, CA.
[3] Baldi P., Hatfield G.W. (2002). DNA microarrays and gene expression.
From experiments to data analysis and modeling. Ca mbridge University
Press, Cambridge.
[4] Benjamini Y , Hochberg Y (1995). Controlling the false discovery rate:
A practical and powerful approach to multiple testing. J . Royal Statist.
Soc., B 85 , 289 -300.
[5] Efron B., Morris C. (1973). Combining possibly related estima tion prob-
lems (with discussion) . J . Royal Statist . Soc. B 35 ,379- 421.
[6] Efron B., Tishir ani R.J . (1993). An introduction to the B ootstrap. Chap-
man & Hall, London .
440 Michael G. Schimek and Wolfgang Schmidt

[7] Efron B., Tishirani RJ ., Storey J.D., Tusher V. (2001). Empirical Bayes
analysis of a microarray experiment. J . Amer . Statist. Assoc. 96, 1151-
1160.
[8] Golub T .R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov
J .P., Coller H., Loh M.L., Downing J.R, Caligiuri M.A., Bloomfield C.D.,
Lander E.S. (1999). Molecular classification of cancer: Class discovery
and class prediction by gene expression monitoring. Science 286, 531 -
537.
[9] Ihaka R, Gentleman R (1996). R: A language for data analysis and
graphics. J. Computat. Graph. Statist. 5, 299-314.
[10] Johnstone I.M ., Silverman B.W . (2004). Needles and straw in haystacks:
Empirical Bayes estimates of possibly sparse sequences. To appear in An-
nal. Statist.
[11] Newton M.A., Kendziorski C.M., Richmond C.S., Blattner F .R. (2001) .
On differential variability of expression ratios: Improving statistical infer-
ence about gene expression changes from microarray data . J. Computat.
Biol. 8 , 37 - 52.
[12] Nguyen D.V., Arpat A.B., Wang N., Carroll RJ . (2002) . DNA microar-
ray experiments: Biological and technological aspects. Biometrics 58,701-
717.
[13] Schena A.M ., Shalon D., Davis RW., Brown P.O . (1995). Quantita-
tive monitoring of gene expression patterns with a complementary DNA
microarray. Science 270, 467 - 470.
[14] Storey J. D. (2002). A direct approach to false discovery rates . J. Royal
Statist. Soc. B 64,479-498.
[15] Storey J.D., Tibshirani R (2003). SAM thresholding and false discovery
rates for detecting differential gene expression in DNA microarrays. In
Parmigiani G., Garrett E .S., Irizarry RA., Zeger S.L. (ed.) The analysis
of gene expression data. Methods and software. Springer-Verlag, New
York, 272- 290.
[16] Tusher V., Tibshirani R, Chu C. (2001). Significance analysis of mi-
croarray applied to transcriptional responses to ionizing radiation. Pro-
ceedings of the National Academy of Sciences 98, 5116-5121.
[17] Westfall P.H ., Young S.S. (1993). Resampling-based multiple testing: Ex-
amples and methods for p-value adjustment. Wiley, New York.

Acknowledgement: Thanks to K. Wagner for the cDNA hybridisation of the


colon data and to Dr . M. Asslaber for making the data set available to us;
both from Institute for Pathology, Medical University of Graz, Austria (Lore
Saldow Project and Gen-AU). The second author acknowledges funding from
Gen-AU (" A Comprehensive Disease Bank for Functional Genomics").
Address: Medical University of Graz, Institute for Medical Informatics,
Statistics and Documentation, Auenbruggerplatz 2, A-8036 Graz, Austria
E-mail: michael.schimek@meduni-graz.at
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

KERNEL METHODS FOR MANIFOLD


ESTIMATION
Bernhard Scholkopf
Key words: Kernel methods, support vector machines, quantile estimation.
COMPSTAT 2004 section: Neural networks and machine learning.

Abstract: We describe methods for estimating manifolds in high-dimen-


sional spacs. They work by mapping the data into a reproducing kernel
Hilbert space and then determining regions in terms of hyperplanes.

1 Kernel algorithms for pattern recognition


Suppose we are given empirical data

(1)

Here, the domain X is some nonempty set that the inputs X i are taken from ;
the Yi E Yare called targets. Here and below , i, j = 1, . . . , m .
We have made no assumptions on the domain X other than it being a set.
In order to study the problem of learning, we need additional structure. In
learning, we want to be able to gen eralize to unseen data points. In the case
of pattern recognition, given some new input X EX, we want to predict the
corresponding Y E {±1}. Loosely speaking, we want to choose Y such that
(x, y) is similar to the training examples. To this end, we need similarity
measures in X and in {± 1}. The latter is easier, as two target values can
only be identical or different.! For the former, we require a similarity measure

k: X X X ~ JR, (x, x') f---+ k(x , x') (2)

with the property that there exists a map <I> into a Hilbert space H such that
for all x , x' EX,
k(x, x') = (<I> (x), <I>(x') ) . (3)
Such a function k is called a positive definite kernel [1], [10], [8], H is the
reproducing kern el Hilbert space (RKHS) associated with it, and <I> is called
its feature map . A popular example, in the case where X is a normed space,
is the Gaussian
k(x, x) = exp -
I (11x 2- aX1112) ' (4)
2

where a > O.
1 In the case wh ere the outputs a re t ak en from a general set y , the situation is more
complex, cr. [11].
442 B ernh ard Scholkopf

The advantage of using a positive definite kern el as a similarity mea-


sure is t ha t it allows us to const ruct algorit hms in Hilbert spaces. For in-
st an ce, consider the following simple classification algorit hm, where Y =
{±1} . The idea is t o compute t he means of t he two classes in the RKHS ,
ci = ~l 2:{i:Yi=+l} <f?(Xi), and C2 = ~2 2:{i:Yi=- l} <f?(Xi ), where ml and m2
are th e number of exa mples wit h positive and negative target values, respec-
tively. We t hen assign a new point <f?(x) to the class whose mean is closer to
it . This can be shown [8] to lead to

y = sgn ((<f?( x) , Cl ) - (<f?(X), C2) +b) (5)

with b = ~ (11c211 2 - Ilc ll1 2) . Rewritten in te rms of k , this reads

y = sgn (2-
ml
L
{i:Yi=+l}
k( x , Xi) - _1
m2
L
{i:Yi=-l}
k( x , Xi) + b) (6)

and b = ~ ( ~ 2: {(i,j ):Yi=Yj=-l} k( Xi , Xj ) - nh- 2:{(i,j):Yi=Yj=+l} k( Xi , Xj )) .


Let us consider one well-known special case of this typ e of classifier. As-
sume that the class means have t he sa me dist an ce to the origin (hence b = 0),
and that k can be viewed as a density, i.e., it is positi ve and has int egral
1 (assuming t he int egr al exists ), Ix
k( x , x' )dx = 1 for all x' E X . Then
(6) corres ponds to the Bayes decision bou ndary separating t he two classes,
subject to t he assumption t hat the two classes are equa lly likely and were
generated from two probability distributions that are correct ly esti mate d by
t he P ar zen windows est imators of t he two classes,

Pl( X) := -
1
L
m l {i:Yi=+l}
k (X,Xi ), P2(X) := -
1
L
m2 {i:Yi=- l}
k( x, x d · (7)

The classifier (6) is quite close to the Support Vector Machine (S VM)
that has recently at t rac te d much attent ion [10], [8] . It is linear in the RKHS
(see (5)) , while in t he input dom ain , it is repr esented by a kern el expa n-
sion (6). It is example-base d in t he sense t ha t the kern els are centered on
the training exa mples, i.e., one of t he two arguments of t he kern els is always
a training exa mple. This is a genera l property of kernel methods, due to the
Repr esent er Theorem [5], [8] . The main point where SVMs deviate from (6)
is in t he select ion of t he examples t hat the kern els are centered on, and in
the weight t ha t is put on the individual kern els in the decision function. The
SVM decision boundary takes the form

(8)

where the coefficients Ai and b ar e computed by solving a convex quadratic


programming problem such t ha t t he margin of separ ation of t he classes in
Kernel methods for manifold estimation 443

the RKHS is maximized. It turns out that for many problems this leads to
sparse solutions, i.e., often many of the Ai take the value O. The Xi with
nonzero Ai are usually called Support Vectors.
Using methods from statistical learning theory [10], one can bound the
generalization error of SVMs. In a nutshell, statistical learning theory shows
that it is imperative that one uses a class of functions whose capacity (e.g.,
measured by the VC dimension) is matched to the size of the training set. In
SVMs, the capacity measure used is the size of the margin, which is inversely
proportional to the RKHS norm of the SVM parameter vector.
The SV algorithm has been generalized to problems such as regression
estimation [10], mappings between general sets of objects [11], and single
class problems. As the latter algorithm is closely related to the one to be
proposed in the present paper, we will describe it in the next section.

2 Single-class SVMs
Let us assume we are given unlabelled data Xl, ... ,X m E X generated LLd.
according to some underlying distribution P. We would like to estimate
quantiles C of P using kernel expansions as C ~ {x E Xlf(x) E I} . Here,
I is an interval, and f = 2::1 Aik(x, Xi)'
In the case of 1= [p,oo[ (where p E JRR) , an approach to compute such
an estimator f is the single-class SVM [7] . It approximately computes the
smallest set C E C containing a specified fraction of all training examples,
where smallness is measured in terms of a regularizer corresponding to the
norm in the RKHS associated with k, and C is the family of sets correspond-
ing to half-spaces in the RKHS. When choosing a suitable kernel, this notion
of smallness will coincide with the intuitive idea that the quantile estimate
should not only contain a specified fraction of the training points, but it
should also be sufficiently smooth so that we can be confident that this state-
ment will also be approximately true for previously unseen points sampled
from P (for an analysis, see [7]).
Let us briefly describe the main ideas of the approach. The training points
are mapped into the RKHS using the feature map q, associated with k, and
then it is attempted to separate them from the origin with a large margin by
solving the following quadratic program: for 1/ E (0,1],2

minimize,
WEH
eE JR m
, p E JR -llwll
1
2
2
+ -l/m
1 '~~i
" -
.
P (9)

subject to (w, q,(Xi)) 2 p - ~i, ~i 2 O. (10)
Since nonzero slack variables ~i are penalized in the objective function, we
can expect that if wand p solve this problem, then the decision function,
f(x) =sgn((w,q,(x)) -p), (11)
2Here and below we follow the convention that bold face greek character denote vectors,
e
e.g., = (6, . . . ,~m)T .
444 Bernhard Scholkopf

Figure 1: In the 2-D toy example depict ed , t he hyp erplane (w, <I>(x) ) = p
separates all but one of the point s from the origin. The outlier <I>(x) is ass o-
ciated wit h a slack variable f , which is penalized in t he obj ective function (9) .
The distan ce from the outlier t o the hyp erplan e is ~/"wll ; the dist an ce be-
tween hyp erplan e and origin is p/llwll . The lat t er impli es t ha t a small Ilwll
corresponds t o a large mar gin of separation from t he origin (from [8]).

will equal 1 for most examples Xi contain ed in the t raining set," while t he
regularization t erm Ilwll will st ill be sm all. For an illustration, see Figure 1.
The t rade-off between t hese two goa ls is cont rolled by a par am et er 1J.
On e ca n show that the solut ion takes t he form

(12)

where the ai are computed by solving t he du al pr oblem,


1
minimize -2 'L...J
" a '·aJ·k (x "
· J
x· ) (13)
aE IRm
ij

subject t o (14)

Not e t ha t du e t o (14) , the t raining exa mples cont ribute with nonnegative
weights ai :::: a t o the solution (12). On e can show that asympt ot ically,
a fraction 1J of all training exa mples will have strictl y posit ive weights, and
the rest will be zero.

3 Implicit manifold estimation


A richer class of solutions , where some of the weights can be negativ e, is
obt ained if we change t he geometric set up. In this case, we esti mate a region
which is a slab in t he RKHS , i.e., t he area enclosed between two parallel
hyp erplan es (see Figure 2).
3We use the co nvention t hat sg n (z) eq ua ls 1 for z 2: 0 a nd -1 other wise.
K ernel methods for m anifold estima tion 445

<P(X'~

/ /~'/lIwll
/
/
/
(P+O')iIIwll/
/

Figure 2: T wo par allel hyp erpl anes (w, <.I> (x) ) = p + 8(*) enclosing all but
two of the points. The outlier <.I>(x(*») is associa ted with a slack variable ~( * ) ,
which is penalized in th e obj ecti ve funct ion (15).

To t his end, we consider the following mod ified pro gram:"

minimize , ~ ( * ) E IR m , p E IR
W E'H
-2111w11 2+ _1 2)~i + ~n
lim ,.
-p (1 5)

subject to 8 -~i ~ (W, <.I> (Xi)) - P ~ 8* +~; (16)


and ~}*) 2: O. (17)

Here, 8(*) are fixed par am et ers. Note that st rict ly speaking , one of t hem is
redundant: one can show that if we subt rac t some offset from both, then we
obtain the sa me overa ll solu tion, with p offset by the sa me amount . Hence,
we can genera lly set one of them to zero , say, 8 = O. In t he simul ations
shown below, this is t he case; nonetheless, we pr efer to keep the 8 in the
opt imizat ion problem .
Before we compute t he du al problem , let us discuss the relat ionship of
t his convex qua dratic optimization problem to other ap proaches .

• For 8 = 0 and 8* = 00 (i.e. , no upper constraint ), we recover t he


single-class SVM (9)-(10) .

• If we dr op p from t he obj ective function and set 8 = - € , 8* = e (for


some fixed e 2: 0) , we obtain th e e-insensiti ve support vector regr ession
algorit hm [10], for a dat a set where all output values Yl , " " Ym are
zero . Not e t hat in t his case, t he solut ion is trivial, w = O. This shows
t hat the p in our objective function cannot be dropped an d plays an
import ant role.

• For 8 = 8* = 0, the te rm 2::i(~i + ~n measures the dist an ce of the


point <.I>(Xi) from the hyp erplan e (w, <.I>(Xi) ) - P = 0 (up to a scalin g
of Ilwll). As II t ends to zero , t his t erm will dominat e the obj ective
4Here a nd below, the supe rscript (*) simultaneo usly denotes the variables with and
wit ho ut as te ris k, e.g., e(') is a sho rthand for e a nd e' .
446 Bernhard Scholkopf

func tion. Hence, in this case, the solut ion will be a hyp erplan e that
approximates the data well in the sense that the points lie close to it
in t he RKHS norm.

Let us now compute the du al opt imizat ion problem . Here are all con-
st raints, along with the Lagran ge multipliers t hat we will use for them :

~i - 8 + (w, cI>(Xi) ) - P 2: 0, ai 2: 0 (18)


~: + 8* + p - (w, cI>(Xi) ) 2: 0, a; 2: 0 (19)
d *) 2: 0, ;3}*) 2: 0 (20)

This lead s to the Lagran gian

-2111w11 2 + _1 2)~i + ~n - p
vm .
t

- L ad~i - 8 + (w, cI>(Xi) ) - p]


i

- L an~; + 8* + p - (w, cI>(Xi) )]


i

(21)

The solution of our primal problem (15)-(17) is known to be a saddle point


of L, To find it , we need to minimize w.r .t . t he primal variables w, e (*) , p and
maximize w.r .t . t he du al variables a (* ) , (3<*l. Set ting the derivatives w.r.t.
t he primal variabl es equal t o zero , we obtain

8£ =0 {::::::::} w = L(ai - a; )cI>(xi ) (22)


8w

8£ =0 {::::::::} _1__ a(*) - ;3(*) = 0 (23)


8d *) vm t t

8£ =0
8p
{::::::::} L(ai - an = 1. (24)

Substituting t hese condit ions into t he Lagran gian lead s t o t he du al problem ,

minimize
aE JR'"
~L (ai - an(aj - aj )k(xi,Xj ) - 8L a i + 8* L a; (25)
ij i i

subject to 0 < a(*) < _1_


- vm
t -
(26)

and L(ai - an = 1, (27)


Kernel methods for manifold estimation 447

where the box constraints on a1*), (26), have been derived from (23) by taking
into account that a1-),,ai
*) ;:::: 0. 5

The dual problem can be solved using standard quadratic programming


packages. Alternatively, custom methods such as variants of SMO (cf. [7])
can be used . The offset p can be computed from the value of the correspond-
ing variable in the double dual, or using the Karush-Kuhn-Tucker (KKT)
conditions, just as in other support vector methods [8]. Once this is done, we
can evaluate for each test point x whether it satisfies 8 :S (w, <I> ( x)) - p :S 8*.
In other words, we have an implicit description of the region in X that cor-
responds to the region in between the two hyperplanes in the RKHS. For
8 = 8*, this is a single hyperplane, corresponding to a manifold in X. 6 See
Figure 3 for some toy examples of the algorithm in action.
We now analyze how the parameter 1/ influences the solution. To this end,
we introduce the following shorthands for the sets of SV and outlier indices:

SV {i I (w, <I>(Xi)) - P - 8 :S O} (28)


SV* {i I (w, <I>(Xi)) - P - 8* ;:::: O} (29)
OLC*) {ild*»O} (30)

It is clear from the primal optimization problem that for all i, ~i > 0 implies
(w, <I>(Xi)) - P - 8 < 0 (and likewise, ~i > 0 implies (w, <I>(Xi)) - P - 8* > 0),
hence OLC*) C SVC*) . The difference of the SV and OL sets are those points
that lie precisely on the boundaries of the constraints."
Below, IAI denotes the cardinality of the set A.

Proposition 3.1. The solution of (15)-(17) satisfies

- -IOL*1
ISVI
- -- > 1/, (31)
m m
lOLl
-----
ISV*I
< 1/. (32)
m m
Two notes before we proceed to the proof:

• The above statements are not symmetric with respect to exchanging


the quantities with asterisks and their counterparts without asterisk.
5 As an aside, note that due to (27), the du al solution is invariant with respect to the
transformation 8(') -> 8(*) + canst. - such a transformation only adds a constant to the
objective function, leaving the solution unaffected.
6 subject to suitable conditions on k
7The present usage differs slightly from the standard definition of SVs (support vectors),
which are usually those that satisfy a~ *) > O. In our definition, SV are those points where
the constraints are active. However, the difference is marginal: (i) It follows from the KKT
conditions that a~') > 0 implies that the corresponding constraint is active. (ii) while it
can happen in theory that a constraint is active and nevertheless the corresponding a~')
is zero , this almost never occurs in practice.
448 Bernhard SchOlkopf

v = 0.1

v = 0.1

v = 0.5

Figure 3: Toy examples of (25)-(27), showing the training points (circles),


SVs lying exactly on the hyperplanes (bold circles), and outliers marked by
crosses (depicted area [-1,1]2, kernel (4), parameter settings (J = 0.5,8 = 0).
Lines correspond to hyperplanes constructed in the RKHS (see text); the
dashed line is the hyperplane corresponding to the constraint with the C
variables. For 8* = 0, the two hyperplanes coincide (note that due to finite
accuracy, the points do not lie exactly on the hyperplane and are thus marked
as outliers); for 8* = 0.1, the dashed hyperplane is sufficiently far away from
the data to reduce the algorithm to the single-class SVM (9)-(10). The top
row shows a simple toy data set, which in the middle row is contaminated
with an outlier. The bottom row shows how v = 0.5 handles the outlier.
This is due to the sign of p in the primal objective function. If we
used +p rather than -p, we would obtain almost the same dual, the
only difference being that the constraint (27) would have a "-I" on
the right hand side. In this case, the role of the quantities with and
without asterisks would be reversed in Proposition 3.1.
• The "v-property" of single class SVMs is obtained as the special case
where OL* = SV* = n.
Proof. Assume that (w, e(*) , p) is a solution of (15)-(17) . Thus it is optimal
w.r.t. all primal variables, in particular e(*) and p, i.e., keeping w fixed. In
that case, the problem takes the form

minimize (33)
e (*)EIFt"',pEIFt

subject to 8 - ~i ~ (w, <I>(Xi)) - P ~ 8* + ~i and d*) ~ O. (34)


Kernel m ethods for m anifold estima tion 449

If we increase p by a small E > 0 (cf. Fi gur e 2) ,8 (33) decreases proportionally


t o v plus the fract ion of points with ~i > 0 (since these slack variables can be
shrunk by the sa me E without violating t he const ra ints ) mi nus th e fract ion of
SVs (rememb er that all SVs eit her have ~i > 0 or lie exac tly on the hyp erpl ane
(w, <I> (Xi) ) - P - 8 = 0 - in both cases, an increase of p by E will lead t o the
sa me incr ease in th e ~i vari abl es, in order t o satisfy t he const ra ints ). If t he
overa ll decrease were positive, i.e., if v + I O~ ' I - I~I > 0, then we could get
a strict decrease in (33) by changing p, violati ng t he assumption t hat we are
already at the optimum. Therefore, we have I ~I - IO~ ' I 2: u ,
If, on the other hand , we decrease p by an E > 0, t he objective function
will decrease proportionally t o IC;:I - I S~ ' I - i/ , As above, this quantity
cannot by strictly positi ve, since we are already optimal. Therefore we have
lOLl _ ISV'I < i/ , 0
m m-
If in addit ion we make certain assumptions on t he distribution generat ing
t he dat a and on the kernel," t hen asymptotic ally, t he two inequ alities in t he
proposition become equa lities with probabili ty 1. The main idea of t he proof
can be given in a nutshell: if t he capaci ty of the function class t hat we are
using is well behaved (which it is, since we are regularizing using t he RKHS
norm Ilwl!), t hen asymptot ically, t he set of point s which lie exac t ly on the
hyp erpl an es is negligible. Hence, loosely speaking, we have SV(*) = OL (*),
and thus v :S: I ~I - I S~' I = IC;:I - IO~'1 :S: u, For det ails, see [8], [9].
To conclude t his section, note that an approxima te description of the data
as the zero set of a function is not only useful as a compac t represent ation of
the data. It can also pot enti ally be used in t asks such as denoising and image
super-resolution. Given a noisy point x, we can map it int o t he RKHS and
then project it ont o t he hyp erpl ane(s) th at we have learnt . We then compute
an approxima te pre-image und er <I> to get a noise-free version of x . A similar
st atist ical denoising t echnique has been used in conjunct ion with kernel PCA
(t o be describ ed next) wit h rather encouraging results [8], [4].

4 Other kernel approaches for manifold estimation


There exist severa l other possibilities t o use machine learning methods em-
ployin g positive definit e kernels for estimating manifolds. On e of them is
known as the RPM algorit hm (see [8]) ; two ot her ones, t o be described be-
low, build on t he kernel PCA algorithm.

Kernel peA T he kernel method for comput ing dot products in an RKHS
is not restrict ed t o SV machines. It can be used t o develop nonlinear gener-
alizations of any algorit hm that can be cast in t erms of dot pr oducts, such
8We choose € small eno ug h so t hat all cons t raints t hat a re not ac t ive will also not b e
active after ad d ing t he e; it is easy t o see that suc h an € exists.
9Essentially, we need to require that t he distribu t ion have a de nsity w.r. t . t he Leb esgu e
measure, a nd t hat k is analytic a nd non- const an t (cf. [8], [9]).
450 Bernhard Scholkopf

as principal component ana lysis. Given data Xl , . . . ,Xm E X, kernel prin-


cipal component analysis (kPCA) [8] computes the principal components of
the points <I>(Xl), " " <I>(x rn ) . Since'H may be infinit e-dim ensional, the PCA
problem needs to be transformed into a problem that can be solved in terms
of the kernel k . To this end , we consid er an estimated covaria nce matrix
in 'H,

(35)

where <I>(Xi)T denotes the linear form mapping v E 'H to (<I>(Xi ), v ). To


diagonalize C , we first observe t ha t all solutions to Cv = AV with A =1= 0
must lie in the span of <I>-images of the training data (as can be seen by sub-
stit uting (35) and dividing by A). Thus, we may expa nd the solution v
as v = 2::1 ai <I>(xi ), thereby reducing the problem to t ha t of finding
the ai. The latter can be shown to take the form mAO = K 0, where
0= (al , .. . , am )T and K ij = k( Xi , Xj). Absorbing the m factor into the
eigenvalue A, one can mor eover show t ha t the p-th feature ext rac tor takes
the form
(v P , <I>(x) ) = ~P
v AI'
f
i= l
afk(xi' x). (36)

This is derived by computing the dot product between a test point <I>(x) and
the p-th eigenvect or in the RKHS j t he b
factor ensures th at (v P , v P ) = l.
Wh en evaluate d on the training exa mple X n , (36) takes t he form

(v P, <I>(x n )) = _l_(Ka
,;>;P P) = _l_(APaP ) = ,;>;PaP.
n ..[5.P n n
(37)

In (35) , we have implicitly assumed that the data in t he RKHS have zero
mean . If this is not the case, we need to subtract the mean (11m) 2:i <I>(Xi)
from all points . This lead s to a slightly different eigenvalue pro blem , where
we diagonalize
K' = (1 - ee T)K(l - ee T) (38)
(with e = m- l / 2 (1, . . . , l)T) rather than K .
The kPCA algorit hm can be used to obtain an implicit description of
a manifold containing the data as follows. The principal dir ect ions with
the smallest eigenvalues (sometimes called "minor components" ) cha racte rize
directi ons in the RKHS such th at when project ed onto these directions, the
data set has the smallest possible variance which can be obtained in any
dir ection which is in the span of the mapped data.!? Gener ally, we are
interest ed in low vari an ce dir ections which lie in t he span of sets of inputs
points (e.g., the training set ) mapped into t he RKHS , as these lead to implicit
10 Note t ha t for some kernels, the R K HS will be infinit e dimen sion al. In t hat case, there
are infinit ely many zero vari an ce directions which do not lie in the sp an of the data .
Kernel methods for manifold estimation 451

function expansions in terms of kernel functions. If we consider expansions


in terms of the training set , the functions take the form

(39)

A tighter description of the desired manifold may be obtainable by intersect-


ing several such surfaces, e.g., using (39) for values of p corresponding to
several small eigenvalues Ap •

LLE and Laplacian Eigenmaps Kernel PCA can also be used for mani-
fold learning in a rather different way. In this case, the manifold is not learnt
as the zero set of a kernel expansion. Rather, we will obtain a low dimensional
coordinate embedding of data sampled from the manifold (" dimensionality
reduction") .
It turns out that locally linear embedding (LLE) [6], currently a rather
popular algorithm for nonlinear dimensionality reduction, is a special case
of kPCA [3] : The LLE algorithm first constructs W to be the matrix whose
row i (summing to 1) contains the coefficients to of the minimal squared
error affine reconstruction of Xi from its p nearest neighbors. Denote M :=
(1 - W)(l - W T ) , with maximal eigenvalue Am ax . One can show that M's
smallest eigenvalue is 0 and the corresponding uniform eigenvector is e. In
LLE, the coordinate values of the m-dimensional eigenvectors m-d, ... , m-1
give an embedding of the m data points in IR d • If we define K := (Am ax 1- M),
then by construction, K is a positive definite matrix, its leading eigenvector
is e, and the coordinates of the eigenvectors 2, . .. , d + 1 provide the LLE
embedding. Equivalently, we can use the eigenvectors 1, . .. , d of the matrix
obtained by projecting out the subspace spanned bye, i.e., (1 - ee T)K(l -
ee T) . Note that this is identical to the centered kernel matrix (38) used in
kPCA. We thus know that the coordinates of the leading eigenvectors of
kPCA performed on K yield the LLE embedding. This, together with (37),
shows that the LLE embedding is identical to the kPCA projections up to
a whitening multiplication with yI).P.
As shown in [3], several other approaches can be viewed as special cases
of kPCA , including certain spectral methods. Many of these methods are
based on the computation of a weighted adjacency matrix W on the data,
e.g., using the kernel (4) on neighboring points (where several definitions
of neighborhood are possiblej.l! Define the graph Laplacian L by L i i := di ,
L i j = - W i j if Xi and Xj are neighbors, and 0 otherwise, where di = I:j~i W i j
is the degree of the ith vertex. It turns out that similar to LLE, the bottom
eigenvectors of the Laplacian can provide a low-dimensional representation
of the data [2], and again, a link to KPCA can be established [3].

llThis local similarity measure can also take into account invariances of the data.
452 Bernhard SchOlkopf

5 Conclusion
Kernel methods have a solid foundation in statistical learning theory and
functional analysis. They let us interpret (and design) learning algorithms
geometrically in an RKHS, and combine statistics and geometry in an ele-
gant way. The present article has described several methods for using this
approach for the estimation of manifolds.

References
[1] Aizerman M.A ., Braverman E.M., Rozonoer L.1. (1964). Theoretical
foundations of the potential function method in pattern recognition learn-
ing. Automation and Remote Control 25 821- 837.
[2] Belkin M., Niyogi P. (2003). Laplacian eigenmaps for dimensionality
reduction and data representation. Neural Computation 15 (6) 1373 -
1396.
[3] Ham J ., Lee D., Mika S., Scholkopf B. (2004). A kernel view of the di-
mensionality reduction of manifolds. In Proceedings ofICML (in press) .
[4] Kim K.I., Franz M.O., Scholkopf B. (2004). Kernel Hebbian algorithm
for single-frame super-resolution. In Statistical Learning in Computer
Vision Workshop, Prague.
[5] KimeldorfG.S., WahbaG. (1971). Some results on Tchebycheffian spline
functions . Journal of Mathematical Analysis and Applications 33 82-
95.
[6] Roweis S., Saul L. (2000). Nonlinear dimensionality reduction by locally
linear embedding. Science 290, 2323- 2326 .
[7] Scholkopf B., Platt J., Shawe-Taylor J ., Smola A.J., Williamson R.C .
(2001). Estimating the support of a high-dimensional distribution . Neural
Computation 13 1443 -1471.
[8] Scholkopf B., Smola A.J. (2002). Learning with kernels. MIT Press,
Cambridge, MA.
[9] Steinwart I. (2004). Sparseness of support vector machines-
some asymptotically sharp bounds. In S. Thrun, L. Saul, and
B. Scholkopf, (eds), Advances in Neural Information Processing Systems
16. MIT Press, Cambridge, MA.
[10] Vapnik V.N. (1995). The nature of statistical learning theory. Springer
Verlag, New York.
[11] Weston J ., Chapelle 0 ., Elisseeff A., Scholkopf B., Vapnik V. (2003).
Kernel dependency estimation. In S. Becker, S. Thrun, and K. Ober-
mayer, (eds), Advances in Neural Information Processing Systems 15,
Cambridge, MA, USA. MIT Press.
Address : B. Scholkopf, Max-Planck-Institut fiir biologische Kybernetik, Spe-
mannstr. 38, Tiibingen, Germany
E-mail: bernhard.schoelkopf@tuebingen .mpg.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

OUTLIER DETECTION AND CLUSTERING


BY PARTIAL MIXTURE MODELING
David W. Scott
K ey words: Minimum distan ce est imation, robust esti mation, explora tory
data ana lysis.
COMPSTAT 2004 section : St atis tical software.

Abstract: Clust erin g algori thms based upon nonpar ametric or semipara-
metric density est imation are of mor e t heoretical inte rest than some of the
distan ce-based hierar chical or ad hoc algorithmic pro cedures. However den-
sity estimation is subjec t to the curse of dimensionality so that car e must
be exercised. Clust erin g algorit hms are sometimes describ ed as biased since
solutions may be highly influenced by init ial configur ations. Clusters may be
associated with modes of a non paramet ric density est ima tor or with compo-
nents of a (normal) mixture estimator. Mode-finding algorithms are related
to but different than gaus sian mixture mod els. In t his paper , we describ e
a hybrid algorit hm which finds mod es by fit ting incompl ete mixture mod els,
or par tial mixture component models. Problems wit h bias are redu ced since
t he partial mixture model is fit ted many ti mes using carefully chosen random
starting guesses. Many of these partial fits offer unique diagnosti c informa-
tion about the st ruc t ure and features hidden in the data. We describ e t he
algorit hms and present some case st udies .

1 Introduction
In this pap er , we consider t he problem of finding outliers and/or clusters
through t he use of t he normal mixture mod el
K
f(x) = L Wk ¢(x If-L k , ~k) . (1)
k= l

Mixture models afford a very genera l famil y of densiti es. If the number
of components, K , is quite lar ge, then almost any density may be well-
approximate d by t his mod el. Aitkin and Wilson [1] first suggest ed using t he
mixture mod el as a way of han dling data with multiple outliers, especially
when some of t he out liers group into clumps. They used t he EM algorit hm
to fit the mixture model. Assuming that the "good" data are in one clust er
and make up at least fifty percent of the total data , then it is easy to see t ha t
we have introduced a number of "nuisance par ameters" into the problem (to
mod el the out liers).
Implementing t his idea in pr actice is challenging. If t here are just a few
"cluste rs" of outli ers, then the number of nuisance pa ra meters should not pose
too much difficulty. However , as the dimension increases, t he total num ber
454 David W. Scot t

of par am et ers grows quite rapidly, especially if a complete ly general covari-


ance matrix, ~k, is used for each component . The most dir ectly challenging
problem is findin g an appropria te choice of the number of components, K ,
and initial guesses for the many par am et ers. An obvious first choice is to use
a clust ering algorithm such as k-mean s [15] as an approach to find an initial
partition, and then compute t he relative size, means, and covaria nces of each
group to use as initi al guesses for the EM algorit hm.
It is abund an tly clear that for ma ny of our fits, we will in fact be using the
wrong valu e of K . Furthermore, even if we happen t o be usin g t he appropriate
valu e for K , t here may be a number of different solutions, depending upon
the specific initialization of t he par am et ers. St ar ting with a larg e number of
initi al configurations is helpful , bu t as t he dim ension and sample size increase,
t he number of possibili ties qui ckly exceeds our capabilit ies.
However, the least discussed and least understood problem ar ises becaus e
so little is generally known about the stat ist ical distributions of the clust ers
representing the outliers. It cert ainly seems mor e reasonabl e t o know some-
thing about the distribution of the "good" dat a; however , one is on much less
firm ground t rying t o claim the sa me knowledge about the distributions of
the several non-informative clusters. Even in the sit uati on where the "good"
data are in mor e than one cluster , sometimes little is known about the dis-
tribution in one or mor e of those "good" clusters.
In t his pap er , we discuss how an alte rnat ive to the EM algorit hm can pro-
vide sur prisingly useful est ima tes and diagnosti cs, even when K is incorrect .
Such t echno logy is especially int eresting when K is too small , since in this
situation the number of par am eters to be est ima te d may be a small fraction
of the number in the full, correct mod el. Furthermore, this tec hnology is of
special interest in the sit uation where little is known ab out the correct distri-
bution of many of the cluste rs. This latter cap abili ty is of growing imp ortan ce
and int erest in the analysis of massive dat asets typically encounte red in data
minin g applications.

2 Mixture fits with too few components


We exa mine some empirical results t o reinfor ce t hese ideas. One well-known
t rimodal density in two dim ensions is the lagged Old Faithful Geyser duration
data , {( Xt- l , Xt) , t = 2, . .. , 298}; see [2] and [27]. Successive eru ptions were
observed and the duration of each eru pt ion, {Xt , t = 1, . .. , 299}, record ed
to the near est second . A quick count shows that 23, 2, and 53 of the orig-
inal 299 values occurred exactly at Xt = 2, 3, and 4 minutes, resp ectively.
Ex amining t he original time sequence suggests that t hose measurements are
clumped ; perhaps acc urate measurements were not taken afte r dark. We
modified the data as follows: the 105 valu es that were only recorded t o t he
near est minute were blurred by ad ding uniform noise of 30 seconds in dura-
t ion. Then all of the data were blurred by adding uniform noise, U( -.5, .5),
seconds, and then converted back into minutes.
Outlier detection and clustering by partial mixture modeling 455

In Figure 1, maximum likelihood estimates (MLE) of a bivariate normal


and three two-component bivariate normal mixture fits are shown. Each
bivariate normal density is represented by 3 elliptical contours at the 1, 2, and
3-(1 levels. Figure 1 provides some examples of different solutions, depending
upon the value of K selected and the starting values for the parameters
chosen. In two dimensions, your eye can tell you what is wrong with these
fits. In higher dimensions, diagnostics indicating a lack of fit leave unclear
if a component should be split into two, or if the assumed shaped of the
component is not correct.

Figure 1: Maximum likelihood bivariate normal mixture fits to the lagged


Old Faithful geyser eruption data with K = 1 and K = 2. The weights in
each frame from L to Rare (1.0), (.350, .650), (.645, .355), and (.728, .272).
Each bivariate normal component is represented by 3 contours at the 1, 2,
and 3-(1 levels.

3 The L2E criterion


Minimum distance estimation for parametric modeling of fe(x) = f(xIB) is
a well-known alternative to maximum likelihood; see [7]. In practice, several
authors have suggested modeling the data with a nonparametric estimator
(such as the histogram or kernel method), and then numerically finding the
values of the parameters in the parametric model that minimize the distance
between fe and the curve; see [6] and [9], who considered Hellinger and L2
distances, respectively. Using a nonparametric curve as a target introduces
some choices, such as the smoothing parameter, but also severely limits the
dimension of the data and the number of parameters that can be modeled.
(Precise numerical integration is quite expensive even in two dimensions.
Numerical optimization algorithms require very good accuracy in order to
numerically estimate the gradient vectors.)
Several authors have discovered an alternative criterion for parametric
estimation in the case of L2 or integrated squared error (ISE); see [25], [13],
[5], [20], [21], [22], for example. (This idea follows from the pioneering work
of Rudemo [18] and Bowman [8] on cross-validation of smoothing parameters
in nonparametric density estimates.) In particular, Scott [20], [21] considered
456 David W . Scott

est ima t ion of mixture mod els by this t echnique. Given a true density, g(x) ,
and a mod el, j()(x) , t he goa l is to find a fully dat a-based esti ma te of the
L2 dist an ce between g and j , which is then minimiz ed with resp ect t o e.
Expanding the L2 crite rion

d(j(), g) = J [j()(x) - g(x)f dx , (2)

we obt ain t he t hree integrals

dU() ,g) = J j()(x)2dx - 2 J j ()(x) g(x) dx + J g(x)2dx . (3)

The third int egral is unknown but is constant with resp ect to and there- e
fore may be ignored . The first int egral is ofte n available as a closed form
express ion that may be evaluated for any posit ed valu e of e. Additionally,
we must add an assumpt ion on the mod el that t his int egral is always finit e,
i.e, j() E L 2 . The second int egral is simply the average height of the density
estimate, given by -2E[j()(X)], where X ""' g(x), and which may be est i-
mated in an unbi ased fashion by -2n- 1 I:~=l j()(Xi)' Combining, th e L2E
crite rion for par am etric est imat ion is given by

(4)

For the multivari ate normal mixture mod el in Equation 1,

1Wd
j ()(x)2dx =
K
LL
K

k=le =l
Wk we <1>(0 IIl k - li e, ~k + ~e). (5)

Since this is a comput at ionally feasible closed-form expression, est ima t ion of
t he normal mixture model by t he L2E pro cedure may be performed by use
of any st andard nonlin ear optimization code; see [20] , [21] . In particular , we
used the nlmin rou tine in the Splus libr ar y for the examples in this pap er.
Next, we return to the Old Faithful geyser exa mple. Using t he same
starti ng valu es as in Figure 1, we computed the corresponding L2E esti mates,
which are displayed in Figure 2. Clearly, both algorithms are attracted to th e
sam e (local) est imates, which combine vari ous clusters into one (since K < 3) .
However , there are int erest ing differences. First we compare the est imate d
weights: in Fi gur e 1, the MLE weight of the lar ger component in each fram e
is 1, 0.65, 0.65 , and 0.73, respect ively, while in Figure 2 the corres ponding
L2E weights are 1, 0.74, 0.72, and 0.71. Of mor e int erest , t he L2E covariance
matrices are eit her t ighte r or sma ller. Since t he (explicit) goal of L2E is to
find t he most normal fit (locally) , observe t hat a number of points in the
smaller clust ers fall outside the 3-0- cont ours in fram es 2 and 3 of Figure 2.
The MLE covari an ce estimate is not robust and is inflate d by those (slight)
Outlier detection and clustering by partial mixture modeling 457

outliers. These differences are likely due to the inherent robustness properties
of any minimum distance criterion; see [12] . Increasing the covariance matrix
to "cover" a few outliers results in a large increase in the integrated squared
or L2 error, and hence those points are largely ignored.

.~
Figure 2: Several L2E mixture fits to the lagged Old Faithful geyser eruption
data with K = 1 and K = 2; see text. The weights in each frame are (1.0),
(.258, .742), (.714, .286), and (.711, .289).

4 Partial mixture modeling


The two-component L2E estimates above were computed with the constraint
that WI + W2 = 1. Is this constraint necessary? Can the weights WI and W2 be
treated as unconstrained variables? Certainly, when using EM or maximum
likelihood, increasing the weights increases the likelihood without bound, so
that the constraint is necessary (and active). However, the L2E criterion does
not require that the model je be a density. The second integral in Equation 3
measures the average height of the density model, but a careful review of the
argument leading to Equation 4 confirms the fact that only g(x) is required
to be a density, not je(x) ; see [22] .
With this understanding, when we fit a L2E mixture model with K = 2,
we are only assuming that the true mixture has at least 2 components. That
is, we explicitly use our model for the local components of "good" data (local
in the sense of our initial parameter guesses), but make no explicit assumption
about the (unknown) distribution of the remaining data, no matter how many
or few clusters they clump into . Our algorithm is entirely local. Different
starting values may lead to quite different estimates.
Thus, we re-coded our L2E algorithm treating all of the weights in Equa-
tion 5 as unconstrained variables. In Figure 3, we display some of the "un-
constrainted" L2E mixture estimates, using the same starting values as in
Figure 2. These estimates are qualitatively quite similar to those in Figure 2,
with some interesting differences. Comparing the first frames in Figures 2
and 3, the covariance matrix has narrowed as the weight decreased to .783.
The sums of the (unconstrained) weights in the final three frames of Figure
3 are 0.947. 0.966, and 1.048. In the first two cases, the total probability
458 David W. Scott

mod eled is less than unity, suggesti ng a small fraction of th e data ar e being
treated/lab eled as outliers with respect to t he fitted normal mixture mod el.
The fact that the third total probability exceeds unity is consist ent with our
pr evious observation t ha t the best fitting cur ve in the L2 or ISE sense often
int egrates to mor e than 1, when t here is a gap in the middl e of the data.

.~
Figur e 3: Several L2E partial mixture fits to t he lagged Old Faithful geyser
eru pt ion data with K = 1 and K = 2, but without any const ra ints on
the weights; see text . The weights in each frame ar e (.783), (.253, .694),
(.683, .283), and (.751, .297).

Since there are pot enti ally many mor e local solutions, we displ ay four
mor e L2E solutions in Figure 4. Some of these estimates are quite unexpected
and deserve careful exa mina tion. The first frame is a vari ation of a K = 1
component which capt ure s 2 clusters. However, the K = 2 estimates in the
last 3 fram es each capt ure two individual clusters, while completely ignoring
t he third. Comparing the contours in the last t hree frames of Figur e 4, we
see that exac tly the same estimate s appear in different pairs. Lookin g at
t he weights in Figur es 3 and 4, we see that the smaller isolated components
are almost exactly reproduced while ent irely ignoring the third clust er. This
feature of L2E is quite novel and we conclude that many of the local L2E
results hold valu able diagnostic inform ation as well as quite useful estimate s
of t he local st ruc t ure of t he data.

~ .
@:~ .~
~. ~ ~.:-4.
.
~ ·:~·»f·
....
.£@:~
..:-... ~

••• . .. . .,. .
.. , ..,. ... , ,.
.....-. ...
~ ~ ~...
. . ..
.~

..~...
~.

Figur e 4: Same as Figure 3 but different st arting values; see text . The weights
in each fram e are (.683) , (.253, .316), (.253, .283), and (.316, .283).
Outlier detection and clustering by partial m ix tu re m odeling 459

Finally, in Figur e 5, we conclude this investigation of the geyser data by


checking a number of K = 1 unconstrained L2E solutions. In this case, t he
three individual components are found one at a t ime, depending upon the
initi al par am eter valu es. Noti ce that t he weights are identical to those in
the previous figure. Furthermore, these weights are less t ha n 50%, which is
the usual breakdown point of robust algorithms; see [17]. However , the L2E
algorit hm is local and different ideas of breakd own ap ply.

~.,~ ..~~.Ji.;';.
~@) '~.'~... .i.tI·'.
....
,
~.. ~~
~ ~.
:,-
.... : ." ~"

••
.... "
... , "
.... .
•••
'I.e"
'"' .
'Ie"
....
••••
~

.e" .~
Figur e 5: Four more K = 1 parti al mixture fits to th e geyser data; see text .
The weights in each frame are (.694) , (.253) , (.316), and (.283).

5 Other examples
5.1 Star data
Another well-studied bivari ate dataset was discussed by Rousseeuw and Le-
roy [17] . The data are measurements of the temperature and light int ensity
of 47 st ars in the dir ection of Cygnus. For our ana lysis, the data were blurred
by uniform U( - .005, .005) noise. Four giant stars exert enough influence to
distort t he corre lat ion of a least-squ ar es or maximum likelihood esti mate;
see the first fram e in Figure 7. In the second frame, a K = 2 MLE normal
mixture is displayed. Notice the four giant st ar s are represent ed by one of
the two mixture components and has a nearly singular covaria nce matrix.
The third frame shows a K = 1 par ti al component mixture fit by L2E , with
'Ii! = 0.937. T he sha pe of the two covariance matrices of th e "good" data
is somewha t different in t hese three frames. In par ticular , the correlatio n
coefficients are -0.21,0.61, and 0.73, respectively.
These data were recently re-ana lyzed by Wan g and Raftery [26] with
near est-neighb or vari ance est imator (NNVE) , an extension of the NNBR es-
timator [10] . They compa red their covarian ce estimates to t he minimum
volume ellipsoid (MVE) of Rousseeuw and Leroy [17] as well as the (non-
robust) MLE . In Figur e 7, I have overlaid t hese 4 covaria nce matrices (at t he
1-0' cont our level) wit h t ha t of the par tial density component (PDC) est imate
obtain ed by L2E shown in t he t hird frame of Figur e 6. For convenience, I have
cente red t hese ellipses on the origin. The NNVE and NNBR ellipses are virt u-
460 David W . Scott

LO
c;vj-<-----,-_ _. -_ _.-_-'-_.--_---,-_ _-.--_---'_.-_ _.--_---,-_-----J
3.5 4.0 4.5 3.5 4.0 4.5 3.5 4.0 4.5
Figur e 6: Two-rr cont ours of MLE (K = 1), MLE mixture (K = 2), and
par tial L2E mixture (K = 1) fits to t he blurred star data.

ally identical , while t he MVE ellipse is slightly rotated and narrower . These
three are sur rounded by t he slightly elongated L2E PDC ellipse. Of cour se,
t he MLE has the wrong (non-robust ) orient ation. The corre lation coefficients
for NNVE and NNBR are 0.65 versus 0.73 for MVE and L2E. Observe that
L2E does not explicitly requi re a search for the good data. The ot her t hree
algorit hms require exte nsive sea rch and/or calibra t ion of an auxiliary pa ra m-
eter. L2E is driven by t he choice of the sha pe of the mixing distribution. One
might choose instead to use tv component s, as suggeste d by McLachlan and
Peel [16], alt hough the degrees of freedom must be specified. In eit her case,
L2E provides useful diagnostic information as a byproduct of t he est imat ion,
rather than as a follow-on ste p of ana lysis.

"' .--- - - - - ---,


ci

'"
9

~
9 '----- - -- - --'
-0.4 -0.2 0.0 0.2 0.4

Figur e 7: Ellip ses represent ing the 2-0" contours of five est imates of t he co-
vari ance matrix of t he st ar data; see text.

5.2 Australian athlete data


For our final example, we consider four vari ables from the AIS data on Aus-
tralian Athletes [I1J. These data are available in the R pa ckage with the
Outlier detection and clustering by partial mixture modeling 461

command data Cais, package=' sn ") : Following Wang and Raftery [26], we
selected the variables body fat (BFAT), body mass index (BMI), red cell
count (RCC), and lean body mass (LBM). (Wang and Raftery also included
ferritin in their analysis.) We blurred the data then standardized each vari-
able.
We fit a K = 1 L2E starting with the maximum likelihood estimate. The
result was WI = 0.98. A pairwise scatterdiagram of the 202 points is shown
in Figure 8, together with contours of the fitted 4-dimensional ellipse. A
careful examination of this plots suggests some clusters. In fact, the first 100
measurements are of female athletes and the last 102 measurements are of
male athletes.

Figure 8: Ellipses representing the (1,2,3)-0- contours of a L2E partial mixture


estimate of the Australian athlete data; see text.

Starting with the MLE values for the female athletes, we re-fit a K = 1
L2E. Now WI = 0.41 (somewhat less than the 49.5% female population).
The contours of the fitted 4-dimensional ellipse are superimposed upon the
scatter matrix in Figure 9. The L2E is clearly modeling a large fraction of
the female athletes.
Finally, we started the L2E with the male values. However, L2E found
a smaller subset of the data lying in a subspace. (L2E is just as susceptible
at MLE at being attracted to singular mixture components, depending upon
initial guesses . That is why blurring was applied in all our examples to
remove trivial singularities due to rounding.) Further experimentation would
be interesting.

6 Discussion
We have shown how a mimmum distance criterion and a mixture model
with only one or two partial components can provide useful estimates and
diagnostics. In particular, the value of WI + Wz provides an indication of the
462 David W. Scott

Figure 9: Ellipses representing the (1 ,2,3)-0" contours of a second L2E partial


mixture estimate of the Australian athlete data; see text.

fraction of the data being modeled by a K = 2 mixture. In our experience, the


proportion of solutions that are interesting when K = 2 and the parameters
are initialized by some random process is quite small. Further research on this
question is open. However, many of the K = 1 solutions following random
initialization are quite useful. The systematic use of these ideas for clustering
is explored further in [23] .
Alternatively, Banfield and Raftery [4] allow a number of outliers to be
modeled as a spatial Poisson process. It would be interesting to apply that
model with K = 2 to these data, where the noise is not Poisson, and to
compare the parameter estimates.
The identification of outliers without an explicit probability model should
always be viewed as preliminary and exploratory. If a probability model is
known, then the tasks of parameter estimation and outlier identification can
be more rigorously defined . However, even probability models are usually
known only approximately at best, and hence outliers so identified are still
subject to certain biases.
The general topic of outlier detection is discussed in [3] . Robust estima-
tion is described by Huber [14] . Coupled with a good exploratory such as
XGobi [24]' the L2E PDC has much potential for helping unlock information
in complex data.

References
[1] Aitkin M., Wilson, G.T . (1980) . Mixture models, outliers, and the EM
algorithm. Technometrics 22, 325 - 33l.
[2] Azzalini A., Bowman A.W. (1990). A look at some data on the old faithful
geyser. Applied Statistics 39, 357 -365.
Outlier detection and clustering by partial mixture modeling 463

[3] Barnett V., Lewis T . (1994). Outliers in statistical data. John Wiley &
Sons, New York.
[4] Banfield J.D., Raftery A.E. (1993) . Model-based Gaussian and non-
Gaussian clustering. Biometrics 49, 803 -821.
[5] Basu A., Harris I.R. , Hjort H.L., Jones M.C . (1998) . Robust and effi-
cient estimation by minimising a density power divergence. Biometrika
85,549 -560.
[6] Beran R (1977), Robust location estimates. The Annals of Statistics 5,
431-444.
[7] Beran R (1984). Minimum distance procedures. In Handbook of Statistics
Volume 4: Nonparametric Methods, pp. 741-754.
[8] Bowman A.W. (1984) . An alternative method of cross-validation for the
smoothing of density estimates. Biometrika 71, 353 -360.
[9] Brown L.D., Hwang J.T.G. (1993). How to approximate a histogram by a
normal density. The American Statistician 47, 251- 255.
[10] Byers S., Raftery A.E. (1998). Nearest-neighbor clutter removal for es-
timating features in spatial point processes. Journal of the American Sta-
tistical Association 93, 577 - 584.
[11] Cook RD ., Weisberg S. (1994) . An introduction to regression graphics.
Wiley, New York .
[12] Donoho D.L., Liu RC . (1988) . The 'automatic' robustness of minimum
distance functional. The Annals of Statistics 16, 552 - 586.
[13] Hjort H.L. (1994) . Minimum L2 and robust Kullback-Leibler estima-
tion. Proceedings of the 12th Prague Conference on Information Theory,
Statistical Decision Functions and Random Processes, P. Lachout and
J.A. Vfsek (eds.), Prague Academy of Sciences of the Czech Republic,
pp. 102-105 .
[14] Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York .
[15] Macqueen J.B. (1967). Some methods for classification and analysis of
multivariate observations. Proc. Symp. Math. Statist. Prob 5th Sympo-
sium 1, 281-297, Berkeley, CA.
[16] McLachlan G.J., Peel D. (2001). Finite mixture models. John Wiley &
Sons , New York.
[17] Rousseeuw P.J., Leroy A.M. (1987). Robust regression and outlier detec-
tion . John Wiley & Sons, New York.
[18] Rudemo M. (1982) . Empirical choice of histogram and kernel density
estimators. Scandinavian Journal of Statistics 9, 65 - 78.
[19] Scott D.W. (1992). Multivariate density estimation: theory , practice ,
and visualization. John Wiley, New York.
[20] Scott D.W. (1998) . On fitting and adapting of density estimates. Com-
puting Science and Statistics, S. Weisberg (Ed.) 30, 124-133.
[21] Scott D.W. (1999) . Remarks on fitting and interpreting mixture models .
Computing Science and Statistics, K. Berk and M. Pourahmadi, (Eds.)
31, 104-109 .
464 David W. Scott

[22] Scott D.W. (2001). Parametric statistical modeling by minimum inte-


grated square error. Technometrics 43, 274 - 285.
[23] Scott D.W ., Szewczyk W .F . (2001). The stochastic mode tree and clus-
tering. Journal of Computational and Graphical Statistics, under revision.
[24] Swayne D.F., Cook D., Buja A. (1998). XGobi : Interactive dynamic
data visualization in the X Window system. Journal of Computational
and Graphical Statistics 7, 113-130.
[25] Terrell G.R. (1990). Linear density estim ates. Proceedings of the Statis-
tical Computing Section, American Statistical Association, 297 - 302.
[26] Wang N., Raftery A.E. (2002). Nearest-neighbor variance estimation:
Robust covariance estim ation via nearest-neighbor cleaning. Journal of
the American Statistical Association 97, 994 -1019.
[27] Weisberg S. (1985). Applied linear regression. John Wiley, New York.

Acknowledgement: This work was supported in part by NSF grant number


DMS 02-04723 and NSF Digital Government contract number EIA-9983459.
Address : D.W. Scott, Rice University, Department of Statistics, MS-138,
POBox 1892, Houston, TX 77251-1892 USA
E-mail: scottdw@rice. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

INTERDATABASE AND DANDD


Ritei Shibata
K ey words: Work ing with data, environme nt, datab ase, Internet.
COMPSTAT 2004 secti on: E-stati stic s.

Abstract: InterD atabase is a comfortabl e environment for working simulta-


neousl y with different typ es of data which are scat te red over various datab ases
or files on a network or on the Internet . This pap er reports an impl ement a-
tion of the InterD atab ase based on DandD which is a syste m for Da t a and
Description. Due to t he high level of data abstraction established in a long
run DandD pro ject , the impl emented InterD at ab ase is, in fact , a flexibl e en-
vironment which covers almost all fields of science and which can be used for
dat a acquisit ion, data cleaning, dat a organis ation, data visua lisa tion, data
analysis and data mod elling.

1 Environmental changes of statistics


Using highly developed compute rs and networks , it has become faster and
easier to get various data from different sour ces or datab ases. Some of these
are publicly available on the Internet . A major challenge facin g st atisticians
is the creation of powerful environments for workin g with such a vari ety of
data. Int erD atab ase is one such environment ([9]). Here, by "Int er" we mean
both " Inter-datab ases" to utilise different datab ases and" Dat ab ases on the
Internet" to utilise scattered datab ases over th e Internet . Both are closely
related and it does not seem so mean ingful t o distinguish t hem once Int ernet
access has been established. Therefore, we can simply say that InterD atab ase
is an environment for utilising different dat abases simultaneously. Before
describing InterD atab ase, we review other relat ed approaches.

2 Approaches
Let us quickly review t he following different approaches to t he environment
for working wit h vari ous typ e of data ; Net CDF, DDI and MetBroker .

2.1 NetCDF
Net CDF ( Network Common Data Form, [4] ) is a dat a abst raction for st oring
and retrieving multidimensional dat a and is distributed as a software librar y
which provides a concrete impl ementation of t he abst rac t ion. The software
has been developed under th e Unidata Program sponsored by t he US National
Science Foundation to support research and education in t he at mospheric
sciences . This approac h is closely related to our InterD atab ase, but not the
sa me. Net CDF is t ar get ed only for multidimensional or array dat a . However
InterD ataB ase is not restrict ed to data following such a neat format. Another
466 Ritei Shib ata

point is that Net CDF requires reform atting all the data so t ha t it accords
with a common rul e, called Common Data Lan guage (CDL) . This can be
a burden unl ess the data pro cessing pro cedures have been established from
the beginn ing in each field of applicat ion.

2.2 DDI and NESSTAR


DDI (Data Docum ent ation Initiative, [7]) is also close to Int erD atab ase in
concept. It aims t o create a uni versally supported met ad ata standard for the
social science community and is implemente d as an XML document. The
DDI-tree contains five main br an ches.
1. Docum ent Description
2. Study Description
3. Data F ile Description
4. Vari abl e Description
5. Other Study-Related Materials
DDI is mainl y conce rne d with table data such as t he results of sur veys.
The dat a is t herefore assumed t o be a simple set of realizati ons of several
vari abl es, so that it would not be an easy job t o describe complicated relati ons
among vari abl es, or to includ e array data in which each axis corr esp onds
t o an explanatory vari abl e and t he valu e of t he array corresponds to t he
response vari abl e. Also descript ion pro cedures are not yet form alised well.
For exa mple, descript ion of the sampling design is left free.
NESSTAR is a met ad ata-driven syste m which can be used in conjunction
with DDI met ad at a . It searches or navigat es dat a files, based on what is
wr itten in the DDImet ad ata and can display a simple summary. However
t he syste m is not aiming at manipulation of severa l dat a sets to combine it
int o one.

2.3 MetBroker
MetBroker([3]) is middlewar e which provides consistent access t o het eroge-
neous weather datab ases. It is a mediator that sits between agricult ural
models and vari ous sources of online dat a. This approach resolves data het-
erogeneity problems by writing a suitable pr ogram . It is efficient for meeting
the needs of sp ecific tasks, bu t it would be laborious t o rewrite the pro gram to
meet t he needs of users or to accommodate st ructural cha nges in dat ab ases.

3 Our approach
As was mentioned before, our goa l is to provide a good environment t o work
wit h different typ es of data which might be scattered over networks . Our ap-
pro ach is pr obabl y closer in concept t o DDI or NESSTAR. In InterD atab ase,
InterDatabase and DandD 467

the DandD Client Server System ([8]) is driven by a DandD instance and
provides a similar environment. A major difference of InterDatabase from
DDI and NESSTAR is the unified general approach to providing such an
environment. A high level of data abstraction is necessary to retain such
generality and an intimate linkage between the abstraction and development
of support softwares is indispensable.
DandD ( Data and Description) is a long run project started around 1990.
Preliminary works can be found , for example, in [10]. The aim of this project
was to establish a formal rule of description of data. A hope was to make it
possible to do an automatic analysis of data as well as to make it easier to
exchange data with enough description for the aim of analysis. In the first
part of the project, data abstraction was a main concern. The basic model
had been established to construct necessary number of structures, relational
or array, by quoting data vectors which are simple sequences of numbers. All
necessary attributes are classified into three levels. The bottom level is for
each data vector, the middle level is for each structure constructed, and the
top level is for the whole data. The rule had been implemented by a LISP
like own language, and some experimental supporting softwares had been
developed.
A breakthrough had been occurred by an introduction of XML as a media
for implementation of DandD rule in 1997. The project grew to cover various
data which are not necessarily included as a body of the XML document. This
led to an introduction of the concept of External Data Vector and further led
to the idea of InterDatabase. Therefore InterDatabase is a natural extension
of our original idea of DandD and it is now a part of DandD together with its
support system, DandD server client system ([8]). Let us focus our attention
into the closely related features of DandD to InterDatabase.

4 DandD
DandD is a generic name for a system consisting of the following three ele-
ments.

1. DandD rule: A syntax and semantics for describing data. The syntax
is currently written as a DTD.

2. DandD instance: A document implemented along the DandD rule for


the data. Currently it takes the form of XML document.

3. DandD client server system: A software system that provides a suitable


environment for handling DandD instances.

As has been mentioned before, the data itself is not necessarily a part of
a DandD instance, and it allows us to implement InterDatabase.
468 Ritei Shibata

4.1 Data vector


In DandD , any dat a are decomposed int o several data vectors . The dat a vec-
t or here is defined as a simple sequence of numbers, which becomes a body
of elements denot ed by t he tag <DataVector> in a DandD inst an ce. To keep
consistency, only sequences of numbers are allowed as the vect or . There-
fore, for examp le, catego rical data is always converte d t o a sequence of num-
bers an d the coding information is attached to t he elements as an attribute
Code. If the bo dy is empty, t he three attri butes Access , Protocol and
PostProcessing te ll us all the necessar y inform ation to get the bod y from
outs ide DandD .
The att ribute Access t ells us where to access by its IPAddress at-
t ribute, an d any informati on needed for t he access , for example, user I.D .
or password by ot her attributes . The attribute Protocol t ells us t he
physical network protocol for t he access and ot her informati on such as,
for examp le, a query sentence which is needed to ext ract the data from
a database. The attribute PostProcessing is for converting t he dat a
obtained from a dat a server t o a simp le sequence of numbe rs . Then the
seq uence can be regarde d as if it were t he body of the DataVector in
t he Dan dD inst ance. The following is an example of DataVector , where
the body is empty and sho uld be obtained from a relati onal dat abase syste m .

Example 1
<DataVector Id="ii" LongName="Year" Access="ai"
@Protocol="bi " PostProcessing="ci"/>

<Access Id="ai" IPAddress="131.113 .65 .i" Userld="dandd"/>


<Protocol I d="b i" Physical="TCP"/>
<JDBC DatabaseServerType="postgresql" DatabaseName="KobeQuake">
@select year from kobequake
</JDBC>
</Protocol>
<ScanFormat Id= "ci">
%s-%*s-%*s
</ScanFormat>
The attributes Access and Protocol of t he DataVector wit h I.D . i1 re-
fer to t he elements wit h I.D .s at and bt , resp ecti vely. T he two attributes
IPAddress and Userld of the Access at te ll us t he IP address and t he user
I.D. , respectively. The attribute Physical of t he Protocol wit h I.D. b l te lls
us t hat t he physical access prot ocol to t he dat a server is TCP. Currently t he
only ot her availa ble protocol is UDP , but more protocols will be introduced
according to need . This element has a sub element JDBC which further te lls
us the softwa re protocol to communicate with the dat a server. The J DBC
InterDatabase and DandD 469

is an int erface to access a dat ab ase t hrough J ava lan guage([5][6]) , which ab-
sorbs differences of dat ab ase servers . Other availab le pro to cols allowed here
are FTP and HTTP. The attributes DatabaseServerType and DatabaseName
of JDBC te ll us that t he datab ase serve r is Post greSQ L and t hat t he nam e of
t he dat ab ase t o be accesse d is KobeQu ake, respectively. In fact , t he dat ab ase
is a record of t he disastrous eart hqua ke that occur red in t he Kob e area in
J ap an on 17 J anuar y 1995, and t he example above is a part of a DandD
exa mple inst an ce KobeQuake. dad which is available from the DandD project
home page
http ://www.stat.math.keio.ac .jp/DandD.
The bo dy of JDBC is a Structured Query Lan guage (SQL) sente nce which
gets a column year from the table kobequake in t he relational dat abase
KobeQuake.
The last element ScanFormat is referr ed to by t he J.D. c1 in t he attribu te
PostProcessing of t he DataVector specifies t he pr ocessing method afte r re-
ceiving a resp onse from t he datab ase server. The resp onse is not necessaril y
a sequence of numbers and ofte n has to be converted to fit t he DandD re-
quirement t hat t he bo dy of DataVector is a sequence of numbers . In this
example, t he resp onse to t he query is a sequence of dat es of th e form of YY-
MM-DD and what is neede d as the body of t his DataVector is only the YY
part. The body of ScanFormat spe cifies the ext raction method by a formula .
The syntax is t he sa me as th at of the function scanf in t he lan guage C. Other
elements which can be referred t o, together wit h or in place of ScanFormt ,
are PrintFormat , Arithmetic, Media and Movie. The PrintFormat is used
for adding something to each element of t he sequence. The syntax is the
sa me as t hat of t he function printf in the C lan guage as well. Alt hough this
PrintFormat element does not appear in t his example, it becomes necessar y,
for example, to add t he prefix 20 to all YY t o make a four digit representati on
of the year for consiste ncy wit h ot her dat a vecto rs obtained from different
dat a sour ces. T he Arithmetic is used for a mor e complicated manipulation,
arit hmetic operation on each element of the sequence. This is used , for exam-
ple, when t he conversion of t he uni t from cent imet re to met re, or sexagesimal
t o digit is necessar y. The ot her two fun ctions are experimental and support
t he case when t he resp onse from dat a server is an image or a movie. In the
attribute PostProcessing of t he DataVector, several such manipulat ions of
t he element of t he response can be referr ed . Each man ipulation is assumed
t o be applied in order.
Besides t he elements mentioned above, t he Code element can be also re-
ferred together in PostProcessing. The primary role of t he Code element
was to provide coding informat ion for categorical dat a . The body is a se-
quence of quot ed strings, which provides codes for natural numbers in t he
body of DataVector, and it is usually referred to in the attribute Code of t he
DataVector. If it is referred to in t he attribute PostProcess ing, it mean s
t hat each element of the cur rent sequence is mat ched to t he body of t he Code
470 Ritei Shibata

and converted to the matched index. This Code is used as a default Code
attribute of the DataVector as well. The conversion of the labels of the levels
of categorical data can be described if two different Codes are given to the
attributes PostProcessing and Code of a DataVector. Then, the code given
in PostProcessing indicates the code used for the database or for the data
file, and the code given in the attribute Code indicates the code which should
be used in the DandD instance. If both are missing, the obtained sequence
is regarded as a sequence of numbers.
The reason why we provided such functionalities at the level of data ac-
quisition from a data server comes from our design principle. The principle
of InterDatabase is to provide a good flexible environment for working with
various types of data, and so it is better to do any necessary conversions at
the stage of data acquisition outside DandD. We are then free from the dif-
ferences of the data sources. An alternative would be to modify the existing
databases or data files according to the needs of the user or to create a new
database. However , this is not only laborious but also inefficient for the case
of huge data sets which are rarely used as a whole. In InterDatabase, no
modification is necessary to the existing data sources. This principle is close
to that of DDI or MetBroker, but InterDatabase provides a more general and
flexible way of resolving such differences than others since it is free from any
particular software. It is sufficient to write explicitly any necessary informa-
tion in the form of XML, which is needed for processing, analysis, modelling
and its utilisation.

4.2 Data
The DataVectors defined in the element DataBody are organised into several
structures within the element Data. Two types of structures are available;
Relational or Array. The relational model is general enough to represent
any relations among data ([2]). The relational database under the frame
work of the relational model is now a standard for database systems because
of its generality and ease of system maintenance. A relation is a collection
of variables and the realization is a collection of data vectors each of which
is a sequence of realized values of each variable. The realization looks like a
table and it is usually called a table in the Relational Database Management
System (RDBMS) .
Caution is necessary when using the word table. A contingency table
or the result of a designed experiment is also called a table in statistics.
However such a table is not a table in the sense of RDBMS . In the relation
model, each row in the table is regarded as a point in the value space of
the variables, so that the table is nothing more than a set of such points.
Therefore, the position of each row in the table has no specific meaning.
This is in contrast to, for example, a two dimensional contingency table, in
which two hidden variables exist, say row index and column index variables,
besides the variable for the values in the table. Therefore, it should be
Int erDatabase and DandD 471

reorgani sed as a table in RDBMS , of two ind ex variables and a vari abl e for
t he values of t he table. Each index variable then repeatedly takes the sa me
value as many times as t he number of rows or columns. To avoid such a
redu ndancy, we allow an array st ruc t ure besides t he relational structure in
DandD , since such a t abl e or multidimensional array frequentl y appears as
a neat dat a structure and it becomes cumbe rsome to represent it as a rela-
t ion. Example 2 gives a pr act ical exa mple of a relation al st ructure in DandD.

Example 2

<Data>
<Relational Id="Futures" LongName="TSLongNameE TSLongNameJ"
MainKey="dly dlm dld cmdty dvy dvm mkt"
Control= "dly dlm dld" Nominal="cmdty dvy dvm mkt">
<Value Id="dly" LongName="Dealing Year"
RefId="Dealing_Year02 Dealing_Year03" Systems="tl"/>
<Value Id="dlm" LongName="Dealing Month"
RefId="Dealing_Month02 Dealing_Month03" Systems="tl "/>
<Value Id="dld" LongName="Dealing Day"
RefId="Dealing_Day02 Dealing_Day03 " Systems="tl"/>
<Value Id= "cmdty" LongName= "Commodity Dealt"
RefId="Commodity02 Commodity03"/>
<Value Id="dvy" LongName="Delivery Year"
RefId="Delivery_Year02 Delivery_Year03" Systems="t2"/>
<Value Id= "dvm" LongName= "Delivery Month"
RefId="Delivery_Month02 Delivery_Month03" Systems="t2"/>
<Value Id="mkt" LongName="Dealing Market"
RefId="Market02 Market03"/>
<Value Id= "op" LongName="Opening Price of a Day"
RefId="S_price02 S_price03" Systems="il"/>
<Value Id= "hp" LongName="Highest Price in a Day"
RefId="H_price02 H_price03" Systems="il"/>
<Value Id="lp" LongName=Lowest Price i n a Day"
RefId="L_price02 L_price03" Systems="il"/>
<Value Id= "cp " LongName="Closing Price of a Day"
RefId= "E_price02 E_price03 " Systems=il"/>
<Value Id= "sp " LongName="Settlement Price of a Day"
RefId= "B_price02 B_pr ice03" Systmes="il"/>
<Value Id= "amt " LongName="Amount of Dealings i n a Day"
RefId="Amount02 Amount03"/>
<Value Id= "oint" LongName= "Amount of Open Interest"
RefId="OpenInterest_Amount02 OpenInterest _Amount03"/>
</Relat ional>
<Time Id="tl">
<Year RefId="dly"/>
472 Ri tei Shibata

<Month RefId="dlm"/>
<Day RefId="dld"/>
</T ime>
<Time Id="t2 ">
<Year RefId="dvy"/>
<Month RefId="dvm"/>
</T ime>
<Interval Id=" i l">
<Min RefId="lp "/>
<Max RefId="hp"/>
<Other RefId="op"/>
<Other RefId="cp"/>
<Other RefId="sp"/>
</ Interval>
</Data>

T his is a par t of an examp le of DandD inst an ce, Futures2002-2003 . dad,


describin g t he record of daily pric es of various commodity futures from De-
cember 2002 to J anuar y 2003. The record is obtained from t he site
//ftp .tokyoweb .or.jp/tocomftp/pub/

t hro ugh FTP as a CSV (comma separated values) file for each month. In
the example, relational data is defined by t he tag <Relational> and the
sub elements <Value> define the colum ns of the relationa l data. T he reason
why two data vecto rs are referr ed in t he attribute RefId of any Value is
t hat the records in the site are separated into two files, 2002-12 . csv for
December and 2003-01 . csv for J anuary. Moreover, t he site changed t he
record format after t he 1 Janua ry 2003 and the records before t hat day are
stored in a direct ory past and newer records are sto red in a direct ory now.
Therefore, as in Exampl e 1, we need t o adjust t he old format to t he newer
one . The following example illustrates a few of t he definiti ons of such dat a
vectors. Here we have omitted some attributes which are not essent ial for
un derstand ing t he key point s.

Example 3

<DataVector Id="Delivery_Year02" Access="accl " Protocol="prtl "


PostProcessing="Delivery_Year02scan aml"/>
<DataVector Id="Delivery_Year03" Access="accl" Protocol= "prt2 "
PostProcessing="Delivery_Year03scan"/>

<Access Id= "accl" IPAddress="ftp.tokyoweb.or .jp"


UserId="anonymous"/>
<Protocol Id= "prtl " Encoding= "Shift_JIS" Phys i cal="TCP">
InterDatabase and DandD 473

<FTP Id="ftp1" SUffix="/tocomftp/pub/past/2002-12.csv"/>


</Protocol>
<Protocol Id="prt2" Encoding="Shift _JIS" Physical="TCP I>
<FTP Id=lftp2" SUffix="/tocomftp/pub/now/2003-01 .csv"/>
</Protocol>
<Arithmetic Id=l am1">
x+2000
</Arithmetic>
<ScanFormat Id=IDealing_Year02scan">
%4d%*2d%*2d,%*s,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d
</ScanFormat>
<ScanFormat Id= IDealing_Year03scan">
%4d%*2d%*2d,%*s,%*d,%*f,%*f,%*f,%*f,%*f,%*d,%*d,%*d
</ScanFormat>

Two data vectors are defined in t his example, sharing the sam e at t ribute
Access. The att ribute PostProcessing of the first vector says that the
ScanFormat with LD. Dealing_Year02scan and Arithmetc with LD. am1
should successively be applied to each of the lines returned by an execut ion
of FTP protocol. We need such two st ep processing becau se t he last two digits
of a yea r are only recorded in the file 2002-12. csv but the full four digit s
are recorded in the file 2003-01 . csv. To adjust the format t o the newer one,
the last two digits are extracted from each line of the CSV file 2002-12 . csv
and convert ed to the four digit year represent ation. The att ribute Encoding
of the Protocol indic at es that cha racte r code of the lines obtained by FTP
is the shift JIS code. The read er may guess other differences of the form ats
of those two files. Many other differ ent form ats for the files ar e poss ible.
Consider Example 2 aga in. The attribute MainKey of the Relational te lls
us the main key of the relational data . This is the same idea as in RDBMS.
Each record is identifi ed by the combination of the indicated Values. This
at t ribute tog ether with the ForeignKey attribute ena bles us to make links
between sever al relational data . Other at t ributes Control and Nominal indi-
cate whi ch Values are factors. Possible other factor attributes are Variable ,
Block, Latent and Auxiliary. The concept of factor type is useful not only
for applying a model like ANOVA , but also for the visualisation of data . In
t he example above, the at t ribute Control suggest s that the specifi ed vari-
ables const it ut e an x-axis of a plot and the att ribute Nominal sugg ests that
separate visualisations should be org anised according to the valu es of the
variabl es. This is an example showing how the semantics of variables can be
described in a formal way. Such a form al description plays an important role
in an aut omat ic visualisation or a semi automatic data an alysis.
It is cru cial to describ e several relations among variables by the Systems
att ribute of Value . Not e that the relation here is not the same as that in the
relational model, whi ch is a relation of the given records as a set of points in
a valu e space. In Example 2, two Time relations and an Interval relation are
474 Ritei Shibata

defined. The Time ind icat es t hat the sp ecified Values const it ute a calendar
system . The Interval indi cates t ha t the four Values are closely related ,
const ituting an int erval given by Min and Max with several aggregate d valu es,
t he op ening price , the closing pri ce and the set tlement pr ice given by Other.
Not e that futures pri ce moves time by t ime in a day.

5 InterDatabase implemented by DandD


We have bri efly explained som e important aspects of a DandD inst an ce, in
conjunct ion wit h InterD atabase. On e of the advantages of our impl ementa-
t ion is that any number of DandD inst an ces can be crea ted for the sa me set of
data. The data view is not unique and changes from st age to st age and from
person to person . For example, it would be natural to organise dat a reflect-
ing the way it has been collecte d, describing the at t ributes and background
information as pr ecisely as possible for possible future needs . However , we
may find a better data organi sation and a necessar y and sufficient description
after browsing the data or even afte r mod elling. Also , th e data view heavily
depends on the aims of data utilisation. For exa mple, the choice of response
variable depends on th e aim. It would not be realisti c to fix such a view, and
rather natural t o allow different views for a set of dat a. In DandD, different
views can be eas ily create d. It is usu ally enough t o modify an exist ing DandD
inst an ce. No modific ation of data sources is necessary. As a consequ ence,
many DandD instances will be produced in t he course of dat a acquisitio n and
mod elling. Thus a mechani sm is necessar y to organis e different data views
or DandD instances .

5.1 Network of DandD instances


A mechan ism Relatives is available in DandD inst an ce. This make it possi-
ble to construc t a network of DandD instances, to trace changing dat a views.
The element Relatives has sub elements Parent , Child and Sibling. Each
element indi cat es which DandD inst anc es are the par ent s, children or sib-
lings t hrough its URL. Then, other related views can be easily sear ched on
the Internet and accumulate d as experiences, without maintaining any li-
br ar y. We hop e t hat such an accumulat ion leads to more producti ve work
with data. Also, other mechani sms Model or Summary should help the user
obtain a better und erst anding of the data .

5.2 Data updates


Data is updated day by day or ti me by time. As a consequence it is imp ort ant
to have a mechanism to pursue the cha nges in a dat ab ase. Further work is
necessar y t o impl ement such a mechani sm , alt hough our Int erD at ab ase is
robust enough t o accommodate such cha nges.
InterDatabase and DandD 475

References
[1] DandD Project. (2004) . DandD Hom e Page.
http://www .stat .math.keio.ac.jp/DandD/.
[2] Date C. (2003) . An in troduction to database systems . 8th Edi tion,
Addison-Wesley, Boston.
[3] Laurenson M., Otuka A., Ninomi ya S.(2002). Developing agricultural
mod els using M etBroker mediation software. Journal of Agricultural Me-
teorology 58 (1) , 1-9.
[4] Rew R., Davis Gren.(1990). N etCDF: An interf ace for scie ntifi c data ac-
cess. IEEE Computer Graphics and Applications, 10 (4) , 91 -99.
[5] Sun Micro Systems. (2004). Java Technology. http ://java.sun.com/ .
[6] Sun Micro Syst ems. (2004). JDBe Technology. http://java.sun.com/
[7] The Norwegian social science data services. (1999). Providing global access
to dist ributed data through metadata stan dardisation - the parallel stories
of NESSTAR and Th e DDI. Working Paper 10, UN/ECE Work Session
on St atistical Met ad ata, Geneva.
[8] Yokouchi D. (2004) . DandD client server system. Compstat 2004, Physica-
Verlag , Prague.
[9] Yokouchi D, Shibata R.(2001). InterDatabas e - DandD instance as an
agent on the Int ern et - (in Japanese). Proceedings of the Institute of
St atistical Mathematics, 49 (2) , 317 - 331.
[10] Shibata R. , Sibuya M. (1987) . Formal description of data typ e for statis-
ti cal analy sis. Proceedings of the first lASC world conferenc e, 203 - 212.

Address : R. Shibata, Keio University Yokohama , J apan


E-mail : shibata@stat.math.keio.ac.jp
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

EXPLORATORY VISUAL ANALYSIS


OF GRAPHS IN GGOBI
Deborah F. Swayne and Andreas Buja
K ey words: GGobi, graphical softwar e, syste m R, covariate displays.
COMPSTAT 2004 secti on : Dat a visu alisat ion.

Abstract : Graphs have long been of int erest in t elecommunications and so-
cial network analysis, and t hey are now receiving increasing attent ion from
statist icians working in other areas, parti cularly in biostatistics. Most of the
visu alization software available for working with gra phs has come from out-
side st atis tic s and has not included t he kind of int eraction t hat st atistici ans
have come t o expect. At th e sam e time, most of the explora t ory visua liza-
tion software available to st atisti cians has made no pr ovision for the special
structure of gra phs.
Gr aphics softwar e for the explora t ory visual analysis of graph data should
include t he following: graph layout methods; a vari ety of displays and meth-
ods for exploring vari abl es on both nodes and edges, including methods that
allow t hese covariate displays to be linked to t he network view; methods for
t hinning or otherwise trimming a lar ge graph. In addit ion, t he power of the
visualization software is greate r if it can be smoot hly linked t o an exte nsible
and int eract ive stat istics environment .
In t his pap er , we will describe how t hese goals have been addressed in
GGobi through its data format , architect ure, graphical user int erface design ,
and its relationship to the R softwar e [7].

1 Introduction
A graph consists of nodes and edges; the edges connect pair s of nod es. In
social network an alysis, t he nodes frequently repr esent people or institutions;
the edges represent int eractions such as conversat ions or trading relation-
ships. The gra phs encounte red in telecommunications ar e similar : the nodes
typi cally repr esent te lephone numbers or IP (Internet Protocol) addresses;
the edges capt ure t eleph one calls or exchanges of packets .
For a dat a analyst st udying graph data , t he descrip tion of the gra ph
is oft en only par t of the story, becaus e the nod es and th e edges may each
correspond t o mult ivari ate dat a. For exa mple, if t he gra ph capt ures a set of
t elephone numbers and t elephone calls, we may have demographic dat a or
usage dat a abo ut t he bill-p ayer for each te lephone number , and we may also
know the time and duration of phone calls. We th erefore observ e variables
on nod es and on edges.
How do explorato ry dat a analysts approach such data? First , we need t o
visualize the graph, that is, t o lay it out by using node positi ons that have
478 Deborah F. Swayne and Andreas Buja

been calculated to help us interpret the graph structure. This is not a well-
defined objective, but often the distance between nodes in the layout should
reflect their distance from one another according to some distance metric on
the graph. Another guideline is that minimizing edge crossings usually makes
a graph more readable by cutting down on clutter. Still, there is no "best"
layout method, or even a best layout for a particular graph: for example,
one layout may clarify a graph's overall structure while deemphasizing local
structure, while in another layout, a local region of interest may be clearly
drawn but the overall structure looks like spaghetti. Graph layout in an
interactive context, then, should offer several layout algorithms and a lot of
interaction methods for tuning and exploration.
The layout algorithms should be fast enough to be used in real time. For
example, we might draw only straight-line edges, and we might not sacrifice
any time to choose the perfect position for node labels. The suite of layout
algorithms should include methods for laying out graphs in 3D (or higher-D) ,
which we can rotate to shift our viewpoint and focus on local structure.
Other important interaction methods include the following:
• We should be able to tune the layout by moving nodes interactively.
• We should be able to pan and zoom the display of the graph.
• We should have a variety of ways to thin or subset the graph by elim-
inating or collapsing nodes and edges. At times, we may not want to
eliminate nodes, but to find ways to highlight nodes and edges of in-
terest while "downlighting" the rest. In that way, we retain context as
we focus on a subset of interest.
So far, we have considered only the structure of the graph, ignoring the
multivariate data associated with the nodes and edges. Once the layout is
displayed, one wants to explore the data together with the graph, to investi-
gate the relationships between the variables and the shape of the graph. The
use of linked views, by now a standard feature of interactive data visualiza-
tion software, is well suited to this goal. The graph view can be linked to
displays of multivariate data on both nodes and edges.
These additional views can be used to highlight, label or paint nodes and
edges in the graph view according to variable values, so that we can explore
the distribution of data values in the graph (see Fig . 2). Equally, we can
highlight data in the covariate views. For example, we might want to thin
the graph according to covariate values. In the case of telephone calls, we
could erase the edges corresponding to the shortest calls, and then erase all
the nodes that no longer have edges.
Finally, this software will be more powerful and more extensible if it
can be programmed using some scripting language, and if it is connected to
a software system for data analysis that includes a library of standard graph
algorithms.
Graph drawing is an active research area in computer science with a long
history [2]. The layouts produced are highly tuned and often beautiful. Since
Exploratory visual analysis of graphs in GGobi 479

they are not produced within the context of data analysis, the graphics are
typically not interactive, and the programmers have not adopted the linked
views approach. Some tools (e.g. Pajek [1]) offer a library of graph algorithms
in addition to layout, and some can even be extended with plugins (e.g.
TUlip, www.tulip-software.org).Still. the designers clearly do not have
exploratory data analysis (EDA) in mind.
Within the field of statistics, graph visualization has not gotten very
much attention. A notable exception is the work of [12], which has never
been released to the public. Even the social network analysis community,
which combines an interest in graph drawing with an interest in multivariate
data analysis, has not to our knowledge produced tools which combine both
sets of visualization capabilities. We therefore feel that there exists a gap in
current software offerings for the exploration of graph data. GGobi [10] is
our attempt to fill this gap.
This paper is structured as follows. Section 2 introduces GGobi, the
software which will be discussed in the rest of the paper. Section 3 describes
GGobi's methods for graph layout. Section 4 describes some of GGobi's
methods for manipulating displays, especially graph views. Section 5 explains
how GGobi can be embedded in other software, and what this design offers
for graph data analysis. Section 6 describes the data format that is used to
specify relationships between nodes and edges, graph elements and variables.
We use a real telecommunications dataset for illustration throughout the
paper. The meaning of its variables has been masked to protect the privacy
of the customers.

2 GGobi
GGobi is general-purpose multivariate data visualization software, designed
to support EDA. GGobi displays include scatterplots, scatterplot matrices,
barcharts, time series plots, and parallel coordinate plots. All displays can
be linked for color and glyph brushing as well as for point and edge label-
ing . GGobi is known for its powerful projection facilities for high-dimensional
rotations. Among GGobi's many other manipulations are panning and zoom-
ing, subsampling, and interactive moving of points and groups of points in
data space.
GGobi can be easily extended, either by being embedded in other soft-
ware or by the addition of plugins; either way, it can be controlled using an
Application Programming Interface (API). An illustration of its extensibility
is that it can be embedded in R.
GGobi is a direct descendent of a data visualization system called
XGobi [9] that has been in use since the early 1990's. XGobi supported
the specification and display of graphs, but it did not include any graph
layout methods. Graph data was an afterthought with XGobi, while it was
a consideration in the GGobi design process from the beginning.
GGobi supports a plain ASCII format involving multiple input files (as in
480 Deborah F . Swayn e and Andreas Buja

!!::============PE:0S:O Poso

Figur e 1: These two displays show layouts of t he snetwork.xml data genera te d


by t he GraphViz layout methods. On t he left is a 2-D "neato" layout; on t he
right a "dot" layout .

XGobi) for the simplest dat a specificat ions, but an XML(Extensible Markup
Lan guage) file format has to be used for anyt hing richer, and graphs are an
exa mple. The form at is briefly describ ed in Section 6.

3 Graph layout
We have used GGobi's plugin mechanism to add graph layout . Because this
is specialized softwa re, it is convenient t hat thi s functionality can be optional.
There are two plugins available for GGobi that can be used for laying out
gra phs.

3.1 The graph layout plugin


The simplest plugin is called GraphLayout . It includes t hree layout methods,
two of which rely on the libr ar y included with Gr aphViz [6], a freely available
collection of to ols for manipulating gra ph st ruct ures and generat ing graph
layouts. All three methods work by generating a new dat aset on the fly and
makin g it available through t he GGobi inte rface, so scatterplots of the new
posit ion vari abl es can be displayed, and edges added to them.
The t hree layout methods are:
Radial: The radi al layout [1 2] places a designat ed node at the center,
and arra nges t he rest of t he nod es in concent ric circles around it. The re-
sulting layout is a t ree arranged radially, with any ext ra edges added. If t he
Exploratory visual analysis of graphs in GGobi 481

underlying graph is not very tree-like, the layout can result in a great many
edge crossings, and the layout doesn 't do anyt hing to minimize these cross-
ings. In addition to the two position vari abl es, the method generates a few
other vari ables , such as the number of st eps between nod e j and the cent er.
Dot: "Dot" produces hierarchical layouts of dir ected gra phs in 2D; the
other layout methods ignor e edge dir ection. It first finds an optimal rank
assignme nt for each nod e, then sets the vertex ord er within ranks , and finally
finds optimal coordinates for the nod es.
Neato: The "neato" layout algorit hm produces "spring" model layouts
of undirected gra phs. In spring mod els, th e graph is mod elled as a set of
obj ects connecte d by springs, assuming both at t ra ctive and repulsive forces,
and an ite ra tive solver is used to find a low-energy configurat ion. Only t he
positions at t he final configurat ion are returned by the algorit hm. Neato is
t he most general-purpose method of t he t hree. Fur ther , neato can genera te
layouts in spaces from 2D to lOD, and edge weight s can be used to further
tune t he layout.
The first layout method is illustrated in Fig. 2; the latter two are illus-
t ra te d in Fig. 1.
There is a manual for the plugin which describ es its use in more det ail.
The dot and neato layout methods are describ ed in the GraphViz docum en-
tation, which can be found on
www.research.att .com/sw/tools/graphviz/refs .html.
The Gr aphViz software can be obtained from www . graphviz . org.

3.2 The ggvis plugin: multidimensional scaling


The "ggvis" plugin is a reimplementation of XGVis [4] a multidimensional
scalin g (MDS) tool which is part of t he XGobi softwar e. MDS is a method for
visualizing data where objects are charac te rized by dissimilarity values for all
pairs of obj ects . It interpret s these dissimilari ties as distan ces and const ructs
map s in R k • It was originally developed as a data analysis method in the
social sciences, but it is also used to lay out graphs.
Like neato , ggvis computes layouts through iterative optimization , but
unlike neato , the display is redr awn at each iteration, so we can watch the
layout take shap e. We can also int ervene during the optimization pro cess,
by moving points int eractively when t hey are trapped in local minima , or by
adjust ing param eters of the MDS obj ecti ve function.
GGVi s puts a lar ge number of par amet ers und er int eractive user cont rol.
As a consequence, ggvis layouts are highly tunable. One of t he most useful
ggvis par amet ers is the exponent of a power transform ation of the tar get
distan ces; lowerin g it below one lets the short distanc es dominate, while
exponents grea ter th an one exp ose the long dist an ces. This lets us decide
whether we want to spread the leaves out, highlighting the structure in the
leaves, or to collapse t hem, revealin g the connectivity in t he int erior of the
graph.
482 Deborah F. Swayn e and Andreas Buja

In addit ion to param et ers , we can make use of color and glyph groupings
of the nodes. We may subselect one group at a t ime for layout , or we may
lay out the groups simultan eously but as un conn ect ed gra phs. Or we may
layout a subgroup and use it as an anchor set for laying out t he remaining
nodes.
There is also a diagnosti c plot t hat permits us to judge how closely the
pairwise dist ances in t he layout match t he target dist ances.

3.3 Multiple edge sets


Sometimes one wants to compare different edge sets for t he sa me set of no des.
In the case of te lephone calls, for inst ance, the exte nded community associ-
ate d with a phon e number changes from week to week, with changes both in
t he set of phone numbers in t he community, and in the t otal length of the
conversations between any pair of nodes.
On e strat egy to compare these different edge sets is to start by det er-
minin g a layout based on t he union of all nod es and t he un ion of all edges .
Since any of t he edge sets can be associated with the set of nodes used t o
det ermine the layout, it 's easy to compare them: Op en multiple scat t erpl ots
of the nodes in t he gra ph view, and assign a different edge set t o each one.
That technique could even be the basis for an animation of edges and edge
variables over t ime.

4 Graph exploration
Once the layout has been produced and t he gra ph is displayed, a great deal
of explorat ion is possible without using any fur ther plugins . Most of t his
funct ionality depends on using linked views. As one would expect , nodes in
the graph view are linked to points in scatte rplots of node vari abl es, or to
bar s in a bar char t ; t his is a familiar style of linking. It is perhaps less obvious
that an edge in t he graph view and a point in a scat te rplot of edge var iables
are also linked: these are just different ways of renderin g the sam e record.
Here are som e of the manipulations availa ble in GGobi:
Move Points: In t his mod e, any point can be moved to manually tune
t he layout. To move a group of points, one brushes them with a common
glyph and color; by moving any memb er of t he group, one moves the whole
group. Under cert ain circum st an ces, point motion can be linked across plots
of layouts, nam ely, when t he nod es are shared across gra phs that differ only
in edge sets and shar e a single layout in separa te windows.
Edit Edges: To edit the graph int eractively, add nodes (by clicking
the mouse where you want the new nod e to appear) and edges (by pr essing
down the mouse button at the source nod e and dr agging t he edge to the
destination). To view or modify t he default prop ert ies (such as record la-
bel or vari able values), use the left bu tton; to simply have the new record
added quickly, use the right or middle button. To delet e nodes or edges, use
"shadow" brushing as described below.
Exploratory visual analysis of graphs in GGobi 483


Figure 2: An illustration of linked brushing with graphs. The nodes in the
graph are linked to the data in the scatterplot at the lower left ; the edges to
the data in the scatterplot at the lower right.

Identify: When the identification mode is active, bringing the cursor


near a point causes a label to be displayed, both in the current display and
in other displays. By default, this is the case label supplied in the data file
(or the row number), but it can also be a list of variable name - value pairs
or an id. If edge identification is selected, the nearest edge will be labelled
instead of the nearest point.
Brushing (interactively): Linked brushing is probably the most famil-
iar use of linked views. In the case of graphs, it is probably clear by now that
it can be used in at least two ways. First, a plot of node data is linked to
a graph view such that brushing points in one plot causes the same points
to change color or glyph in the other. Second, a plot of edge data is linked
to the graph view such that brushing points in the edge data plot affects
the edges in the graph view, and vice versa. This latter functionality is an
innovative feature of ggobi.
One brushing style allows a point or an edge to be "shadow" brushed, so
that it's drawn in a faint color and can later be removed from the displays
altogether.
484 Deborah F. Swayne and Andreas Buja

Fig. 2 shows linking between a radial layout of the snetwork.xml data and
two scatterplots. Two rectangular arrays of data are involved, one for the
nodes and the other for the edges . The window at the lower right contains
a I-D plot (an ASH, or Average Shifted Histogram ([8]) of a transformation of
one of the edge variables, interactions. The highest values have been brushed
with large green rectangles (rendered in dark gray in the gray-scale printed
version of this paper), and the corresponding edges in the radial layout view
are wide and green. All the green edges are connected to a single node, which
tells us that a single individual participates in all of the longest interactions
in the data. The window at the lower left contains a jittered scatterplot of
hours vs citizenship, the two variables recorded for each person. The points
representing the people with the highest values of the citizenship variable
(visa holders) have been brushed with large orange circles (rendered as large
medium-gray circles in gray scale), and the corresponding points are brushed
in the graph view. A couple of subgraphs contain no visa holders at all, and
a couple of other subgraphs are dominated by visa holders, but we also see
a great deal of interaction between visa holders and other people in the data.
(Recall that the data is actually about telephone calls, but that its meaning
has been thoroughly obscured to protect customer privacy.)
The line characteristics (color, type and thickness) are implied when
the point characteristics (color, type and size) are specified in the Choose
color fj glyph panel.
One of the options available in the brushing mode is shadow brushing [3];
that is, to select points or edges to be drawn in a "shadow" color, close to the
color of the background. This is especially appealing for graph visualization
because clutter is often severe, yet we often don't want to lose sight of the
graph structure when viewing a subset of the data. (Sometimes, of course,
we don't want to draw those points at all , even as shadows, and then we
exclude them using the Color fj glyph groups tool.)
Coloring by variables: Since interactive brushing of continuous vari-
ables can be tedious, an automatic scheme is available as part of the Color
schemes tool. In the snetwork.xml data, one of the edge variables (interac-
tions) is continuous, so we can choose a sequential color scale and apply it
to the "Cont acts" edge set using the interactions. (Since the distribution of
that variable is highly skewed, we might also apply a transformation first.)
Panning and Zooming: It is essential to be able to zoom in on interest-
ing regions of the graph view, and that functionality is available in GGobi's
scale mode. (GGo bi displays are not linked for scaling.)
All these methods are described in more detail in the GGobi manual,
available on www. ggobi. org.

4.1 The graph manipulation plugin


All of the interactive methods just listed are useful for multivariate data, not
just for graphs. In addition to those methods, we have added a plugin for
Explor atory visual analysis of graphs in GGobi 485

methods of exploration that are peculiar t o gra phs. It has two functions as
of this writing, both of th em designed for focussing on conti guous subsets of
the graph.
The first fun ction responds to a button click by shadow-brus hing leaf
nod es and the edges connected to them recursively until no leaf nodes are
highlight ed. It can be a useful way to qui ckly hide a lot of clut t er in a a messy
gra ph, and get a look at t he cente r .
The second is a method for focussing on a node and its near est neighbors.
It is used in conjunction with the Ident ification mod e in GGobi. Move the
curs or near a point of inte rest, and then click a mou se bu tton. All points will
be shad ow brushed with the except ion of the nearest point and its neighbors
within one or two st eps. In t his way, one can walk around the gra ph , focussing
on one small neighborhood at a time.

5 Graphs in GGobi's API


While GG obi is a stand-alone application, it has been designed and con-
st ruc t ed as a pro gramming library and can be embedded within oth er appli-
cat ions. It has a lar ge, and st ill evolving, Appli cation Programming Interface
(API) which develop ers can use to inte grate the GGobi functi onality with
other code . For dat a analysts, GGobi becomes much more powerful once it
is embe dded in a st atistic s environment with an exte nsion lan guage.
Our most develop ed example is t he Rggobi package, which allows GGobi
to be embedded in t he R process. Users can t hen launch GGobi (using R
data fram es or dat a files outside R), and then read and set data valu es and
case attributes (such as color and glyph), and even add event handlers which
cause R t o respond t o GGobi events. Ed ge sets can also be added , and t he
at t ributes of edges (color , line ty pe and line thickness) are handled exac t ly
like point at t ributes.
In this first simple example, we create a matrix to represent the nodes,
and open it in ggobi. We next crea te an empty dat a set , dim ensioned t o hold
six records. Finally we create a 6 by 2 array to define the edges as 6 rows
of source - destination pairs , nam ed in te rms of the node lab els, and add the
edge set to the running ggobi.
x <- matrix(c(0,0,2,1, 0,2,0 ,1 , 0,0,0 ,1),4, 3,
dimnames = list(c("a", "b", "c", "d"), c C''X'", "Y", "Z")))
gg <- ggobi(x)

d2 <- gg$createEdgeData(6, name="edges")


e2 <- rbind(c(" a", "b"), c("b","c"), C(" a " , "c " ) ,
C(" a " , lid"), c C'b", "d"), c C?c", "d"))
gg$setEdges(e2, edgeset = d2)
In the second example, we deal wit h a more complex case, in which there
are vari ab les correspo nding to the edges as well as to the nod es. We start
486 Deborah F . Swayn e and Andreas Buja

again, using the matrix x just described. Next we add a second dataset ,
3 by 2, composed of the dat a corresponding to t he edges. Finally we add
three edges to the second dataset .

gg <- ggobi(x)

z <- matrix(c(1,2,1, 1,2,2), 3, 2,


dimnames = list(letters[10 :12], C("X", "Y")))
d2 <- gg$setData(z , name=" z")

el <- rbind(c(" a", "b"), c("b", " C" ) , C(" a " , "d"))
gg$setEdges(el, edgeset=gg[[lI z ll] ] )

We plan t o exte nd the API and t he Rggobi package so that they ca n


work with other graph packages cur rent ly under development as part of t he
Bioconducto r pro ject (www. bioconductor. org) .

6 Data format: Specifying graphs in XML


GGobi relies on XML (the Ext ensible Markup Lan gu age) for everyt hing be-
yond the simplest of input data. The use of XML has allowed us t o design
a sys te m of mark-ups or t ags that describe one or more dataset s in great
det ail within a sin gle file, even specifying the relationships between records
in different datasets .
We based GGobi's XML form at on a pr e-existing XML form at design ed
for the Omegah at project (www.omegahat .org) and t he S lan guage (R and
S-Plus) . Some of t he information t ha t can be specified in t he GGo bi XML file
includes vari abl e types and axis ran ges, the symbol and color corresponding
to a record, and multiple dat a set s and t he rul es for linking them .
GGobi's XML form at is described elsewhere [11], so we will only explain
here how the specification of data records is used t o describ e graphs. A dat a
record speci ficat ion may be as simple as this:
<re cord> 1 .0 2.5 </record>
<record> 1.7 2.2 </record>
This is a pair of records for a dataset wit h two vari abl es. If we want to identify
t hese records as nodes, we must also give them unique ids. (Ids can also be
used for linking and identifi cation, but that usage is described elsewhere .)
<record id=IMacbeth"> 1.0 2 .5 </record>
<record id=IBanquo "> 1. 7 2 .2 </record>
If we want a set of edges t o be dr awn on a scatterplot or a graph view
of these nodes, we need a second dataset . If t here is to be an edge from
"Mac bet h" to "Banquo," t he second dat aset must contain a record like this:
<record source=IMacbeth" destination= II Banquo II > </record>
Exploratory visual analysis of graphs in GGobi 487

If there are variables corresponding to that edge, they are specified within
the record, just as they are for nodes.

<record source=IMacbeth" destination=IBanquo">


27 42 4.6
</record>

As we implied in Section 3.3, it's possible to specify more than one edge
set corresponding to the same node set within the same XML file, and that
offers a way to compare related edge sets.
There are graph specification languages in XML under development, and
we expect it will be easy to translate between those formats and GGobi's,
though those other languages probably won't fully support multivariate data.
For the interested reader, the GGobi distribution includes several graph
datasets in XML. Some include position variables so that additional layout
isn't required: buckyball.xml and cube6.xml describe geometric objects, with
no additional variables. Another, snetwork.xml, is fully multivariate and does
not include variables that can be used for displaying the graph; that is the
dataset that served as an example throughout this paper.

7 Conclusions
As more statisticians become interested in graph data analysis, they approach
this area with the expectations and expertise acquired in working with general
multivariate data. They expect first of all to be able to work in environments
like R, with a set of algorithms, a variety of static display methods, and
a scripting language. This set of goals is being pursued in the Bioconductor
project and elsewhere.
Second, statisticians and other data analysts who have come to rely on
direct manipulation graphical methods will want to use them with this form
of data as well: to quickly update plots, changing variables and projection,
to pan and zoom displays, and to use linked views to explore the graph
and the distribution of multivariate data in the graph. GGobi's data format
supports describing the graph and the data together , and its architecture
allows the addition of plugins, so it's natural to extend GGobi, applying all
its functionality to graph data.
Finally, we want to integrate the direct manipulation graphics, algorithms
and scripting language so that we can use them all together. This expectation
is not yet as automatic as the first two: People often still imagine building
a single monolithic application that can do everything. As the example of
graph data shows, however, there are many specialized problems that are
often overlooked, so no monolithic piece of software can satisfy the needs of
all users. If instead it's possible to integrate complementary software tools,
and to extend them with plugins and packages, then even the most unusual
cases can be handled without too much trouble.
488 Deborah F. Swayne and Andreas Buja

The GGobi software and documentation, including several plugins and


the Rggobi package, are available on the web site www.ggobi. org.

References
[1] Batagelj V., Mrvar A. (1998). Pajek - program for large network analysis.
Connections 21, 47-57.
[2] Battista G.D., Eades P., Tamassia R., Tollis 1. (1994) . Annotated bibliog-
raphy on graph drawing algorithms. Computational Geometry: Theory
and Applications 4, 235 - 282.
[3] Becker R.A., Cleveland W.S. (1987). Brushing scatterplots. Technomet-
rics 29, 127 -142.
[4] Buja A., Swayne D.F. (2002) . Visualization methodology for multidimen-
sional scaling. Journal of Classification 18, 7 - 43.
[5] Chen C.-H., Chen J.-A. (2000). Interactive diagnostic plots for multidi-
mensional scaling with applications in psychosis disorder data analysis.
Statistica Sinica 10, 665-691.
[6] Gansner E.R. , North S.C. (2000). An open graph visualization system
and its applications to software engineering. Software - Practice and
Experience 30 (11), 1203-1233.
[7] Ihaka R., Gentleman R. (1996) . R: A language for data analysis and
graphics. Journal of Computational and Graphical Statistics 5, 299-
314.
[8] Scott D.W . (1985). Average shifted histograms: effective non-parametric
density estimation in several dimensions. Annals of Statistics 13, 1024-
1040.
[9] Swayne D.F., Cook D., Buja A. (1998). XGobi: Interactive dynamic data
visualization in the X Window System. Journal of Computational and
Graphical St atistics 7 (1), 113-130.
[10] Swayne D.F ., Temple Lang D., Buja A., Cook D. (2003). GGobi: evolv-
ing from XGobi into an extensible framework for interactive data visu-
alization. Computational Statistics & Data Analysis 43 , 423 -444.
[11] Temple Lang D., Swayne D. F. (2001). The ggobi XML input format.
www. ggobi. org.
[12] Wills G. (1999). NicheWorks - interactive visualization of very large
graphs. Journal of Computational and Graphical Statistics 8 (2), 190-
212.
Acknowledgement: We thank the reviewer who pointed out to us that the
ggvis plugin would be a good environment for implementing the Interactive
Diagnostic plots for MDS as described in [5].
Address : D.F. Swayne, AT&T Labs - Research
A. Buja, The Wharton School, University of Pennsylvania Duncan Temple
Lang, University of California, Davis
E-mail : dfs@research.att.com
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

PLS REGRESSION AND PLS PATH


MODELING FOR MULTIPLE
TABLE ANALYSIS
Michel Tenenhaus
K ey words: Multiple factor analysis, PLS regression , PLS path modeling,
generalized canonical correlation ana lysis.
COMPSTAT 2004 secti on: P artial least squ ares .

Abstract: A situation where J blocks of vari abl es are observed on the same
set of individuals is considered in t his pap er. A fact or analysis logic is applied
to t abl es inst ead of individuals. The lat ent variabl es of each block should well
explain their own block and in the sam e ti me t he latent var iables of sa me
rank should be as positively corre late d as possible. In the first part of the
pap er we describe the hierarchical PLS path mod el and remind th at it allows
to recover t he usual multiple table analysis methods. In the second part we
suppose that the number of latent variabl es can be different from one block
to anot her and that these latent vari abl es are orthogonal. PLS regression and
PLS path mod eling are used for this sit uat ion. This approach is illustrat ed
by an example from sensory analysis.

1 Introduction
We consider in t his pap er a sit ua tion where J blocks of variables Xl , " " XJ
are observed on the sa me set of individuals. The problem under st udy is
complet ely symmet rical as all blocks of vari abl es play th e same role. All the
vari abl es are supposed t o be st andardized . We can follow a factor ana lysis
logic on tables instead of variables. In the first sect ion of this pr esentation
we suppose t hat each block X j is multidimensional and is summa rized by m
lat ent vari abl es plus a residu al E j . Each dat a t abl e is decomposed into two
parts: X j = tj lP~l +.. .+ tjmP~m +Ej . The first part of the decomposition is
tj lP~ l + .. .+ tjmP~m ' The lat ent vari abl es ( t jl , . .. , t jm) should well explain
the dat a t abl e X j and in t he sa me time the latent vari abl es of same rank
h( tlh , . . . , tJh) should be as positively correlated as possible. The second
part of t he decompositi on is t he residual E j which repr esents the part of X j
not related t o the other block, i.e. the specific par t of X] .
We show that th e PLS approach allows to recover th e usu al methods for
multiple t abl e analysis. In section two we suppose that the number of latent
vari abl es can be different from one block to anot her and that these latent
var iabl es are orthogon al. PLS regression and PLS path mod eling are used
for this sit uation. This approach is illustrat ed by an example from sensory
analysis in the last sect ion.
490 Michel Tenenhaus

2 Multiple Table Analysis: a classical approach


In Multiple Table Analysis it is usu al to introduce a super-block XJ+l merg-
ing all the blocks Xj. This super-block is summarized by m latent variables
tJ+1 ,l , . . . , tJ+l ,m also called auxiliary vari abl es. The causal mod el describing
this situation is given in Figure 1. This model corresponds t o the hier ar chical
model proposed by Wold [16].
The latent vari abl es tjl, " " t j m should well explain their own block X j .
In the same time the latent vari abl es of same rank (hh , .. . , tJh) and the
auxiliar y vari abl e tJ+l ,h should be as positively correlated as possible. In
the usu al Multiple Table Analysis (= MTA) methods, as Horst 's [6] and
Carroll's [1] Generalized Canonical Correlation Analysis, orthogonality con-
straints ar e imposed on the au xiliary vari abl es tJ+1,h and the latent var iabl es
t jh related to block j have no orthogonality const ra ints. We define for the
super-block XJ+l t he sequ ence of blocks EJ+l ,h obtain ed by deflation: each
block EJ+1 ,his defined as the residual of t he regression of X J+ l on the latent
variables tJ+l ,l , . . . , tJ+l ,h. Figure 2 corresponds t o st ep h. For computing
the latent vari ables tjl~ and the auxiliar y variables tJ+ l ,h we use the general
PLS algorithm [16] defined as follows for st ep h of this specific applicat ion:
E xt ernal esti m ati on:

- Each block X j is summarized by the latent vari abl e t jh = X jW jh


- The super-block XJ+l ,h is summarized by the latent var iable tJ+1 ,h =
EJ+l ,h-lWJ+l ,h

Internal estim ation:

- Each block X j is also summarized by the latent vari abl e Zjh = ej htJ+l ,h,
where ejh is the sign of the corre lat ion between tjh and tJ+l ,h. We will
however choos e ej h = + 1 and show that the correlation is then positive.
- The super-block EJ+l,h-l is summari zed by the latent vari able ZJ+ l,h =
J
L eJ+ l ,j ,ht j h, where eJ+ l ,j ,h = +1 when the centroid scheme is used,
j=l
or the correlation between t jh and tJ+l ,h for the factorial scheme, or
furthermore the regr ession coefficient of tjh in the regression of tJ+l ,h
on hh, . . . , tJh for t he path weighting scheme.

We can now describe the PLS algorit hm for the J-block case. The
weights Wjh can be computed according to two modes: Mode A or B.
In Mod e A simple regression is used:

Wjh ex X ;tJ+l ,h,j = 1 to J , and WJ+l ,h ex E~+1,h-l ZJ+l , h (1)

where ex means that the left term is equal to the right t erm up to a normal-
ization.
PLS regression and PLS path modeling for multiple table analysis 491

For Mode B multiple regression is used:

W jh ex
, 1 '
a nd W J+ l ,h ex ( E J+l,h -l E J+l, h -l) - E J + 1,h - l Z J +l ,h (2)
The normalization dep end s up on the method used . For some method Wj h
is of norm 1. For other methods t he varian ce of tj h is equal to 1.

Figure 1: P ath mod el for the J-block case .

Figur e 2: Pat h model for t he J-block case : St ep h.

It is now easy to check that t he corr elat ion between t jh a nd tJ+l ,h is


always positiv e: t~+l ,htjh = t~+ l,h XjWjh ex t~+l ,hXjX;t J+l ,h > 0 when
Mod e A is used . T he sam e result is obt ained when Mod e B is used . T his
justifies t he replacement in both (1) and (2) of t he int ernal est imation Zj, h
by t he external est imat ion t J +l ,h .
The PLS algorit hm can now be described. We begin by an arb it ra ry choice
of the weights Wjh . We get the external est imations of t he latent variables,
t hen t he int ernal ones . Using t he equations (1) or (2) we get new weights.
492 Michel Tenenhaus

This pro cedure is it er ated until convergence always verified in pr actic e, but
only mathemat ically pro ven for t he two-block case .
The vario us options of PLS Path Mod eling (Mode A or B for external est i-
mation; centroid, factorial or path weight ing schemes for inte rnal est imat ion)
allow t o find again many methods for Multiple Tabl e Analysis: Gener alized
Canonical Analysis (the Horst's one [6] and the Carroll's one [1], Mul tiple Fac-
tor An alysis [4], Lohmoller's split principal component analysis [9], Horst 's
maximum variance algorit hm [7] . The link s between PLS and these methods
have been demonstrated in [9] or [11] and st udied on pr act ical examples in [5]
and [10]. These various methods are obtained by usin g the PLS algorit hm ac-
cording t o t he options described in Tabl e 1. The super-block only is deflat ed ;
the original blo cks are not deflated .

Scheme of calculation Mode of calculation for the outer estimation


for the inner estimation A B
PLS Horst'sgeneralized Horst'sgeneralized
Centroid canonical correlation analysis canonical correlation
analysis (SUMCOR
criterion)
PLS Carroll's generalized Carroll'sgeneralized
Factorial canonical correlation analysis canonical correlation
analvsis
- Lohmoller's split
principal component
analysis
Path weighting scheme - Horst'smaximum
variancealgorithm
- Escofier & Pages
Multiple FactorAnalvsis
No deflation on the original blocks, deflation on the super-block

Tabl e 1: Mul tiple Tabl e An alysis and PLS algorit hm.

D i scussi on on the orthogona lit y constraints

There is som e advantage on imp osing orthogonality const rai nts only on
the lat ent varia bles related t o t he super-block: no dim ension limit ation du e
to block sizes. If orthogona lity const raints were impose d on t he block lat ent
variables, t hen the maximum m of late nt vari abl es would be t he size of the
sm allest block. The super-block X J +1 is su mmarized by m orthogon al lat ent
var iables tJ+ l, l, " " t J+ l ,m' Each blo ck X j is summarized by m latent
var iabl es tj l, . . . , tjrno Bu t these latent var iables can be highly correlate d
and consequently do not reflect the real dimension of t he block. In each
block X j t he lat ent vari abl es t j l ," " t j m repr esent t he par t of the block
correlated wit h the ot her blocks. A principal component analysis of t hese
latent var iables will give the actua l dim ension of t his par t of X ] .
PLS regression and PLS path modeling for multiple table analysis 493

It can be preferred to impose orthogona lity on the latent vari ables of each
block. But we have to remove the dimension limit ation du e to the smallest
block. This sit uat ion is going to be discussed in t he next section.

3 Multiple Table Analysis: new perspectives


We will describ e in t his section a new approach mor e focused on the blocks
t ha n on t he super-block. This approach is called PLS-MTA : a PL S approach
to Multiple Tabl e Analysis.
We now suppose a variable number of common components in each block:

(3)

A two ste ps pro cedure is proposed to find these component s.

St ep 1

For each block X j we define t he super-block XJ+l ,-j obtained by merging


all the other blocks X i for i :f j. For each j we carry out a PLS regression
of X J+I ,-j on X j . So we obtain mj ortho gonal and standardized PLS com-
pon ents t jl " .. , t j m j which represent the part of X j related with the other
blocks. The choice of t he number mj of components is det ermined by cross-
valid ation.

St ep 2

One of t he pro cedures described in Table 1 is used on the blocks Tj =


for h = 1. We obtain t he rank one components tll , . . . , t i: and
{tj l , . .. , t j m j }
t J+I ,I. Then, to obtain the next components we only consider the blocks wit h
mj > 1. For these blocks we const ruct t he residual Tj l of t he regression of Tj
on t j l. A MTA is t hen applied on these blocks and we obtain the rank two
components tI 2 , . . . , t J2 (for j with mj > 1) and tJ+I ,2 . The components t jl
and tj 2 are uncorr elat ed by const ruction, but th e auxiliary vari ables tJ+I ,1
and t J +I ,2 can be slightly correlate d as we did not impose ort hogona lity
const raint on th ese components . This resear ch of components is iterated
until t he vari ous mj common components are found . These components can
finally be expressed in te rm of the original vari ables.
There is a great advant age on imposing ortho gonality const raints on each
block components: t he new mj orthogonal and standa rdized components
t j l ," " t j_ m J are dedu
_ ced from t he mj orthogonal and standa rdized PL S com-
ponents t jl , . .. , tjmj by a rotation. That means that

(4)

where A j is an ort hogonal (rot ati on) matrix.


494 Michel Tenenhaus

4 Application
We are going to use P LS-MTA on wine data which has been collecte d by C.
Asselin and R. Morlat and are fully describ ed in [3]. A set of 21 red wines
with Bourgueil, Chin on and Saumur origins are describ ed by 27 var iables
distributed in four blocks: Xl = Smell at rest = [smell int ensity at rest , aro-
mat ic qu ality at rest , fruity not e at rest , floral not e at rest , spicy not e at rest],
X 2 = View = [visual int ensity, sha ding (from oran ge to purple) , impression
of surface], X 3 = Smell after shaking = [smell intensity, smell qua lity, fruity
not e, floral not e, spicy note, vegetabl e note, phelonic note, aromat ic int ensity
in mouth, ar omatic persist ence in mouth, aromatic quality in mouth], X 4
= Tasting = [intensity of attack, acidity, ast ringency, alcohol, balan ce (acid-
ity, ast ringency, alcohol), mellowness, bitterness, ending int ensity in mouth,
harmony]. Another varia ble describing t he globa l qu ality of the wine will be
used as an illustrative var iable.
We now describ e t he application of PLS-MTA methodology on these
da ta.
Step 1
PLS regressions of [X 2, X 3, X 4] on Xl , [Xl, X 3, X 4] on X 2, [Xl , X 2, X 4]
on X 3 , and [X 1 ,X2 ,X3 ] on X 4 all lead to two PLS component s when we
decide to keep a component if it is significant (Q2 is larger t ha n 0.05). The
X- and Y- explana tory powers of these components are given in t able 2.

X Proportion of variance of Proportion of variance of the


block X explained by other blocks explained by the
two X-PLS components two X-PLS components

Smell at rest .750 .296


View .995 .344
Smell after shaking .715 .449
Tasting .822 .438

Table 2: Proportion of X and Y vari an ces explained by t he first two X-PLS


components .

Then the "smell at rest" block T1 = {tll ' t 12}, t he " view" block T2
{t 21 ,t22} , t he " smell afte r sha king" block T3 = {t31,t32}, and th e " tasting"
block T4 = {t41' t42} are defined wit h t he st andardized PLS X -compon ent s.
Step 2
The PLS components being orthogonal, it is equivalent to use Mod e A
or B for t he left part of the causa l model given in Figur e 3 (PLS-Graph
output [2] . Due to t he small number of observat ions Mode A has to be used
for the right par t of the causa l mod el of Figur e 3. We use the cent roid scheme
for t he int ern al est imation. We give in Figur e 3 the MTA mod el for t he first
rank components and in Table 3 the correlat ions between t he latent vari ables.
PLS regression and PLS path modeling for multiple table analysis 495

Figure 3: Path model for the first rank components (PLS-Graph output).

Smell at View Smell after Tasting Global


rest shaking

Smell at rest 1.00


View .78 1.00
Smell after shaking .88 .91 1.00
Tasting .74 .92 .92 1.00
Global .90 .96 .98 .95 1.00

Table 3: Correlations between the rank 1 latent variables.

In Figure 3 the figures above the arrows are the correlation loadings and
the figures in brackets below the arrows are the weights applied to the stan-
dardized variables. Correlations and weights are equal on the left side of the
path model because the PLS components are uncorrelated.
Rank one components are written as:

tll .9998 X tll + .0176 X t12

t21 .9558 X t21 + .2950 X t22

t31 .9869 X t31 + .1619 X t32

t41 .9947 X t41 + .1042 X t42

t51 .2516 X tll + .0045 X t12 + .2552 X t21 + .0788 X t22 + .2707 X t31

+.0445 X t32 + .2628 X t41 + .0276 X t42


496 Michel Tenenhaus

We may not e that the rank one components are highly corre lated to the first
PLS components tn , t 21 ' t3 1 and t 41 '
To obtain t he rank two components it is now useful to use equa tion (4)
which here becomes:
- - [ COS()j Sin ()j ]
[t j l, tj2 ] = [t j l , t j 2] . ()
-sm j
() .
cos J
(5)
as
A . = [ cos ()j sin ()j ] (6)
J - sin ()j cos ()j
is the orthogonal rot ation matrix in the plan with an angle ()j . For each of
t he new comp onents tn , . .. , t41 it can be checked t hat t he squ ar es of the
coefficients of the PLS component s t jl , t j 2 sum up to one. It is then easy to
get the rank two components :
t12 - .0176 x t n + .9998 x t 12
t 22 - .2950 X t 21 + .9558 X t 22
t 32 -.1619 X t31 + .9869 X t 32
t42 - .1042 X t 41 + .9747 X t 42

However , to get t he exte rnal latent variable t 52 for the sup er-blo ck we need
to apply the complete algorit hm. We first regress each block Tj = {tjl ' t j 2}
on t j l . Then t he path mod el used for rank one components is used on the
standa rdized residu al tabl es Tj 1 = { tjn , tj2 1}' The results are given in
Figur e 4.

Smell at Smell after


View Tasting Global
rest shaking
Smell at rest 1
View .407 1
Smell after shaking .803 .398 1
Tasting .822 .145 .780 1
Global .928 .394 .950 .906 1

Tabl e 4: Correlations between the rank two latent vari ables.

It is mor e clear to express t he rank two component in term of the original


st andardized vari abl es. We then get the previous expressions for h 2, . . . , t42
and th e following one for t 52:
t 52 - .005 x t n + .288 x t12 + .014 X t 21 - .045 X t 22 - .078 X t31
+ .463 X t32 - .029 X t 41 + .295 X t 42

In Table 4 we give t he corre lations between the rank two components. The
sensory components of rank one and two are uncorr elated by construction.
The globa l components are also practi cally uncorr elated (r = -.000008) .
PLS regression and PLS path modeling for multiple table analysis 497

Figure 4: Path model for the second rank components in term of residuals.

1.0 or----------,----------------,
Spicy "Fte at rest
.8
Spicy note Smell intensityat rest
Vegetable note Bitterness
Smell intensity
.6

Acidity Astringency
.4 Visual intensity
Phenolic note
Cl Shading
<: Alcohol
'C .2 Aromaticpersistence
III
.9 impressionof surface
Ending intensityin mouth
N
-.0 - - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - -Jntensity of attack' -
1: I Aromatic Intensityin mouth
Cll
<: Aromaticqualityat rest
0 Fruity note at rest
C- -.2 Harmony
E Fruitynote
0 Floral note at rest
o Smell quality
iii -.4 Mellowness
.c Global Quality
0 Floralnote Balance
t5 -.6 Aromaticquality in mouth

-.6 -.4 -.2 -.0 .2 .4 .6 .8 1.0

Globalcomponent 1 loading

Figure 5: Variable loadings with the global components.

5 Discussion
PLS-MTA comes to carry out a kind of principal component analysis on each
block and on the super-block such that the components of same rank are as
positively correlated as possible. So, for each dimension h, the interpretations
498 Mich el Tenenh aus

3 , - - - - - - - ------, -------,
: .T2
, T'

2

1VAU
iii' 3h
•,
I PERl
: 4EL~
N O · •• - - ••• - - - - - - - - - ;ruR - - - -: - - - - · 1-aOI- - ·10AM-· --
"E
Ql
• DO~ 1 2BOU @ •
21NG IFON . "" @ ,liNG
c: @ 1 ROC ~El ~BEA Appellation
o @) 1PO
a. ~ 2DAM
§ -1
'CHA
• IBEN • Saumur
o
tii o Chinon
.a
o
(5 -2 __----.------...-----+------.------! @ Bourgueil
-3 -2 -1 o 2

Global component 1

F igur e 6: Wine visua lization in the globa l component space .

of various block compo nents t hj, j = 1, .. . , J + 1 can be related . In Figur e 5


t he "Smell at rest" , " View", "Smell after sha king" and "Tasti ng" loadings
with t he globa l components are displayed. It makes sense as t he corre lations
of t he vari abl es with t he block components and t he globa l compo nents are
rather close. T he global quality judgement on t he wines has also been dis-
played as an illustrative variable. In Figur e 6 t he wines are also displayed
using t he globa l compo nents . T he best wines are located in the sout h-eastern
quad rant.

References
[1] Carro ll J.D. (1968). A generalization of canonical correlati on analysis
to three or more sets of varia bles. P roc. 76th Conv. Am. Ps ych. Assoc.,
227-228.
[2] Chin W.W. (2003). PL S-graph user's guide. C.T. Bauer College of Busi-
ness, University of Houston, USA.
[3] Escofier B. , Pages J. (1988). A nalyses factorielles simp les et multiples.
Dunod , Pari s.
[4] Escofier B., Pages J. (1994). Mu ltiple factor analysis. (AF MULT pack-
age ), Computationa l Statisti cs and Data Analysis 18, 121- 140.
PLS regression and PLS path modeling for multiple table analysis 499

[5] Guinot C., Latreille J., Tenenhaus M. (2001). PLS Path modellind and
multiple table analysis. Application to the cosmetic habits of women in
Ile-de-France. Chemometrics and Intelligent Laboratory Systems 58,
247 -259.
[6] Horst P. (1961) . Relations among m sets of variables. Psychometrika 26,
126-149.
[7] Horst P. (1965). Factor analysis of data matrices. Holt , Rinehart and
Winston, New York.
[8] Hotelling H. (1936). Relations between two sets of variates. Biometrika
28, 321- 377.
[9] Lohmoller J .-B. (1989). Latent variables path modeling with partial least
squares. Physica-Verlag, Heildelberg.
[10] Pages J ., Tenenhaus M. (2001). Multiple factor analysis combined with
PLS path modeling. Application to the analysis of relationships between
physico-chemical variables, sensory profiles and hedonic judgements.
Chemometrics and Intelligent Laboratory Systems 58 261 - 273.
[11] Tenenhaus M. (1999) . L'approche PLS. Revue de Statistique Appliquee,
47, (2) , 5 -40.
[12] Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2004). PLS
path modeling. Computational Statistics an Data Analysis (to appear).
[13] Tucker L.R. (1958) . An inter-battery method of factor analysis. Psy-
chometrika 23 (2), 111-136.
[14] Van den Wollenberg A.L. (1977) . Redundancy analysis: an alternative
for canonical correlation. Psychometrika 42, 207 - 219.
[15] Wold H. (1982) . Soft modeling: the basic design and some exten-
sions. In Systems under indirect observation, Part 2, K.G. Joreskog &
H. Wold (Eds) , North-Holland, Amsterdam, 1 -54.
[16] Wold H. (1985). Partial least squares . In Encyclopedia of Statistical Sci-
ences , Kotz, S. & Johnson, N.L. (Eds), John Wiley & Sons, New York
6, 581 - 591.
[17] Wold S., Martens H., Wold H. (1983). The multivariate calibration
problem in chemistry solved by the PLS method. In: A. Ruhe and
B. Kagstrom (Eds), Proc. Conf. Matrix Pencils. Lectures Notes in Math-
ematics, Springer-Verlag, Heidelberg.

Address : M. Tenenhaus, HEC School of Management, 78351 Jouy en Josas,


France
E-mail: tenenhaus@hec.fr
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

1001 GRAPHICS

Martin Theus

Key words: Statistical graphics, defaults, rendering, interaction.


COMPSTAT 2004 section: Data visualisation.

Abstract: Statistical graphics, or in more modern terms, data visualization,


is not a new discipline. Whereas in the early days the construction of a graph
was technically not easy and usually even required some artistic capabilities,
generating statistical graphs is very easy in today's statistical software pack-
ages. This obviously leads to a less careful construction of these plots. In
an object oriented software package like R we can call the generic function
plot with almost any arbitrary object as argument, and some plot method
will render this object, whether it makes sense or not.
This paper investigates how well chosen plot defaults and rendering tech-
niques can guarantee much better results in a graphical data analysis. Fur-
thermore, standard plots and examples of plot ensembles are presented which
are suitable for analyzing variables of a specific structure.

1 Introduction
Everybody knows the phrase "A picture can be worth a 1000 words". Advo-
cates of statistical graphical methods and data visualization sometimes use
this phrase to support their position. Whereas everyone knows that there
are many examples which prove that they are right, there is a far greater
number of examples (although less quoted) which prove the opposite. All
positive examples are usually very well thought out. E.g. Minard's visualiza-
tion of Napoleon's march on Moscow is a very popular example for the power
of a good visualization. The power of Minard's graph lies in the well chosen
combination of spatial plotting of time series information, not to mention sev-
eral artistic and aesthetic considerations, which are not that obvious at a first
glance. This brings us back to the phrase "A picture is worth a 1000 words" ,
which only holds true if the picture is really well chosen . Today, where the
next statistical graphics is only one keystroke or mouse-click away, we tend
to produce many graphs which would probably need more than a 1000 words
to be interpreted.
In the next Section of this paper we will investigate the influence of the
right choice of plot defaults on the quality, i.e. the interpretability and usabil-
ity, of a graph. This should make us more alert to default plot settings, which
are often inappropriate for the solution sought initially. The final section of
the paper goes beyond single graphs, and shows strategies for analyzing mul-
tivariate data with ensembles of standard statistical graphs.
502 Martin Th eus

Default DPlot Smalier DSymbols

0
'"
0
N

~
E
.Q>
0
~
0
0
0
N
0

g
D

0 15 0 10 05 10 15 01 5 0 10 05 10 15

Nub Nub

Zoomed DView Flnal DSettlng

C!
"!

..:>~"
'"
ci
C!
.,'>"
E E co 0
.Q> 0
ci ...,..... .Q> ci
~ ~
0
ci
'"
ci
c
'"
ci
[J

C!
0
0 1.0 0 0.5 0.0 0.5 1.0 01 .0 0 0.5 0.0 0.5 1.0

Nub Nub

Figure 1: The pollen dat a plot ted in R.

2 On plot defaults
2.1 The scatterplot - less can be more
A scatterplot of two qua nt itative variables is pr obably the most elementary
and fund am ent al plot in st atist ics. At a first glance t here do not seem t o
be many degrees of freedom to choose par am et ers t o improve a scatterplot .
Reviewing Cleveland [1] and [2] the only t hing we can do with scatterplots
is t o change scales and plot symbo ls. Obviously Clevelan d 's work was writ-
te n at t imes where pen plot t er and amber CRT s were t he lat est technology.
Further more, dat asets with mor e than just a few hundr ed observations were
very uncommon. Tod ay' s problems often look much different . A coupl e of
t housand points are ofte n regar ded as rather sma ll, but would have used
up a whole ink cart ridge of a pen plot t er 25 years ago. This calls for new,
adva nced rend erin g strategies.
1001 graphics 503

Figure 2: The pollen data plotted in Mondrian.

Figure 1 shows an example where the default plot symbol is unsuitable


to find the interestring structure in a dataset. The dataset is the so called
"pollen" data from the 1985 ASA data competition, and consists of a 5-dim.
normal distribution with the word "EUREKA" added to the center of the
data. The upper left plot shows the data plotted with the default setting of
the R plot function. The '0' which is used as the default plot symbol is only
suitable for small datasets with less than 100 points. The upper right plot
shows the same data now plotted with ' .' as plot symbol and reveals - by
squeezing your eyes - the unusually high density in the center of the plot.
The plots in the lower row show how we isolate the feature by zooming in.
The corresponding R-code is:

> names(pollen)
[1J "Ridge" "Nub" "Crack" "Weight" "Density" "Number"
> attach(pollen)
> par(mfrow=c(2,2))
> plot(Nub, Weight, main="Default Plot")
> plot(Nub, Weight, pch=".", main="Smaller Symbols")
> plot (Nub, Weight, pch=".", xlim=c(-1,1), ylim=c(-1,1),
> plot (Nub, Weight, xlim=c(-1,1), ylim=c(-O.8,1.6), ...

Figure 2 shows the same data plotted in Mondrian [6]. The default scatterplot
in Mondrian uses a-transparency to cope with overplotting. a-transparency
allows us to use suitably sized points in a scatterplot, without losing the
information about density in the scatterplot. The amount of transparency
gets bigger with the number of points to plot . In Figure 2 the unusual feature
is immediately visible without the need to optimize plot parameters. More
information on how to plot scatterplots can be found in Cook et al. [3] .
504 Martin Th eus

2.2 The histogram - yet another optimal


representation?
The histogram is probably numb er two in t he list of most often used statistical
graphs. There exist dozens of rules (cf. D. Scot t [5]) , which number of bins is
the "best" under which circums tan ces. "Best" usually means, that the sum
of the squa red differenceS between the true density and the est imation via
t he histo gram is minimized with some variance const rai nt.

origin O=010 origin O=019

o 100 200 300 400 500 o 100 200 300 400 500

origin O=028 origin O=037


co
0
~
0

..,.
0
0
0

0
0
0
0
o 100 200 300 400 500 0 100 200 300 400 500

origin O=046 origin O=055


co co
0 0
0
0 ~

-e- ..,.
0 0
0 0
0 0

0 0
0 0
0 0
0 0
0 100 200 300 400 500 0 100 200 300 400 500

Figur e 3: 6 hist ograms with superp osed density est ima torS for the variable
"displacement" of t he "mpg-auto" dataset from t he VCI ML repository. The
number of bins has been det erm ined according to "St urges Rule" .
1001 graphics 505

In cases where the data comes from a single generating process following
a continuous, only mildly skewed random variable, these rules will deliver
sufficiently nice results". The more critical situation arises, when the data is
a mixture of several generating processes from both continuous and discrete
random variables. In these situations, we have to cope with gaps , discrete
patterns and accumulation points. Unfortunately real data usually comes
from the latter kind of process .
Figure 3 shows an example of six histograms for the variable "displace-
ment" of the "mpg-auto" dataset from the UCI Machine Learning Repository
with origins at 10, 19, 28, 37, 46 and 55. The number of bins has been de-
termined according to "Sturges Rule". The bin width has been "beautified"
to 50 within the R hist function. Obviously non of the six origins gives us
a satisfying estimation of the underlying density, nor does the kernel den-
sity estimator. The explanation is not too hard to find. Most cars in the
dataset have only a very small displacement of 80 to 160. Bigger cars - all
6 cylinder engines in the dataset - form another mode at 220 to 260. Two
discrete spikes can be found at 300 and 340, with some larger outliers, all
corresponding to 8 cylinder engines.

Figure 4: A histogram starting at 60 with bin wiDth 20, yielding 20 binS for
the variable "displacement".

Figure 4 shows a histogram starting at 60 with bin width 20, yielding


20 bins for the variable "displacement", showing all of the above features.
Finding a parameter setting revealing these features is easy in an interactive
environment, but harder in a command line interface, where each new setting
1 Although in these cases almost any origin and bin width will lead to almost optimal
results.
506 Martin Th eus

must be retyp ed, until a sa t isfying set ting is found . Finding explanations for
t he above described structural features can be don e most convenient ly within
an interactive environment, which allows linked highlighting. This leads to
the nex t section.

Plotting subgroups in histograms


It is common practi ce to color a subgroup in a histo gram. Usually this should
answer the question, whether this subgroup is any different from the whole
population or not.

Histogramoof ompg Spinogramoof ompg


q
.....
0
co co
ci
,.,o 0
to c: to
c:
Q)
:::l
~0 ci
0
e v
t::r Co
V
~ ci
u.. c,
0 N
N
ci
0
0
ci

0 10 20 30 40 50 0 10 20 30 40 50

mpg mpg

Fi gure 5: Left : A histogram for the variable "mpg" with model years 74-
78 highlighted. Righ t : A Spinogram , showing the sa me data .

Figure 5 shows an exa mple of this situation. The left histogram has all
model years from 74 to 78 highlighted. At first glance we would expect that
the selecte d subgroup has appr oximately the sam e distribution as the whol e
population . To verify t his, we use a spinogram.
A spinogra m is a hist ogram, where all bars have the sa me height. In order
to keep the proport ionality of the area of a bar and the number of cases in t he
bar, the width is adj uste d, i.e. whereas in a histogram with equa lly spaced
bins the height of a bar is proportional to the number of cases in t his group, in
a spinogra m the widt h is proportional. Obviously the x-axis of a spinogram
then is transformed to a no longer linear but still cont inuous scale. This
pu ts mor e visua l weight on areas with high density and less weight on areas
wit h low density. The highlighting in a spinogra m is still done from bottom
t o t op. This allows the comparison of pr oportions of the highli ghted cases
across the whole rang e of t he underlying variable. Whereas this comparison
is easily possible, the comparison of proportions in highlighted histograms is
almost impossible. This is du e to the fact that our visual system is well able
t o compare positions along a comm on scale, but almost incap abl e of judging
1001 graphics 507

length or position in different scales (cf. Cleveland [1] 262pp) . Coming back
to the example in Figure 5 the spinogram reveals that the cars in the years
74 - 78 mostly have mpg-values close to the overall mean, i.e. the tails of the
distribution of this group are less populated than in the rest of the sample.

Figure 6: A histogram of the variable "mpg" colored according to the number


of cylinders.

Figure 7: The same data as in Figure 6, now plotted as a spinogram.


Spinograms also allow you to look at the conditional distribution of more
than one highlighted group. Figure 6 shows a histogram of "mpg" color
brushed according to the number of cylinders of the engine (cars with 3-5 cy-
linders are joined in one group). Again, the histogram suffers from the differ-
ently scaled proportions and is hard to read. Figure 7 shows the correspond-
ing spinogram, which makes the comparison across bars much easier. This
kind of display is especially useful in classification problems, which need to
assign more than two groups. With multiple groups, the stacking order of
the groups .in the spinogram becomes an important issue. A more compre-
hensive ill).lstration of how to visualize conditional distributions can be found
in Hofmann and Theus [4] .
508 Martin Theus

2.3 Mosaic plots - but which one?


Mosaic plots have been adopted more and more in the statistics community
over the last 10 years. They form a very powerful framework to visualize
multidimensional categorical data. Mosaic plots are especially good at visu-
alizing associations between 2, 3, 4 or even 5 variables at a time. They are
weaker for looking at only few variables, each having many categories.
Figure 8 shows a mosaic plot for the "mpg-auto" data for "Model year"
and "Cylinder". Due to the strong variation in the variable "cylinders" over
the different years , it is quite hard to read across the years while following
a particular number of cylinders. The same problem arises when labeling
the categories of the conditioned variable, i.e, "Cylinders". In Figure 8 an
equidistant labelling was chosen, which does not fit any particular year , but
should be a good estimate for all years. In this situation a fluctuation dia-
gram, as shown in Figure 9, is much more appropriate to display the data.
In a fluctuation diagram all cells get the same space assigned in a grid like
layout.
The area which is filled by a tile within a cell is still proportional to the num-
ber of observations in this cell. Thus the only cell which is completely filled
with a tile is the cell with the maximum cell count. The advantage of this
kind of display is obvious. Using the grid like layout it it now easy to follow
a particular category of a variable throughout the whole plot. Comparing
Figures 8 and 9 we can see the structure in the data more clearly in the
fluctuation diagram. The number of 4 cylinder cars is steadily growing over
the 13 years, whereas the 8 cylinder cars seem to disappear in the early 80s.
The number of 6 Cylinder cars is relatively stable over the years, whereas
3 and 5 cylinder cars are only found rarely.

Figure 8: A mosaic plot for "Model year" and "Cylinder" .

Besides fluctuation diagrams two other variations of the standard mosaic


plot have eproven to be useful. In the same bin size display all tiles are of
equal size, which is useful to detect empty cells in high dimensional datasets
1001 graphics 509

Figure 9: A fluctuation diagram for "Model year" and "Cylinder" .

and the multiple barchart view which scales the size of the tiles along only
one axis .

3 Plot ensembles
The last section gave some hints on how to choose the right plot parameters
and/or plot types, in order to get meaningful plots. This helps to optimize
a single plot or view.
In an exploratory data analysis process we often try to answer statistical
questions with graphics. E.g. looking at the "mpg-auto" data we might be
interested in the influence of the originating country or continent and the
number of cylinders on the gas consumption of a car . This relationship
between two categorical and one continuous variable can be investigated by
using an ensemble of 4 linked plots.
The plot ensemble in Figure 10 features a barchart for cylinders and ori-
gin, a mosaic plot of the two variables and a boxplot of "mpg" conditioned
on number of cylinders (alternatively we also could use a boxplot of "mpg"
conditioned on the originating country). In this ensemble we see the inter-
action structure of the two influencing variables in the mosaic plot, as well
as their marginal distribution in the two barcharts. The boxplot shows the
distribution of "mpg" for each cylinder group and via highlighting we can in-
vestigate the interaction structure of the "origin" and "cylinders" on "mpg".
In Figure 10 the group of all Japanese cars has been highlighted.
The next example in Figure 11 shows how we can look at the temporal
distribution of spam e-mails, In the barchart of the classification variable
"spam" all spam e-mails have been selected. In the barchart for "Day of
Week", as well as the corresponding spineplot, we see the absolute and rel-
ative distribution
/
of spam e-mails over the course of a week. Whereas the
absolute amount of spam e-mails grows towards the middle of the week, the
510 Martin Theus

Figure 10: An ensemble of four plots to investigate the influence of country


and cylinder on "mpg".

relative amount is highest at the weekends. In the histogram of "Time of


Day" we see an almost constant amount of spam mails over the 24 hours of
a day, whereas due to the small number of ordinary e-rnails outside business
hours, the proportion of spam is very high during the night.
The ensembles in Figure 10 and 11 are only two examples which show
that a specific question in an exploratory data analysis can be answered with
ensembles of (linked) plots. If statistical packages do not offer the whole suite
of basic plots users can not plot data in the most suitable way. If for instance
a package only offers point plots for quantitative data, these plots are used
to try to visualize discrete data.

4 Conclusion
The rise of computers with graphical capabilities has lead to new graphical
data analysis possibilities, but also caused an inflation in the use of statis-
tical graphics. Only well 'designed graphics can be "worth a 1000 words".
Many statistical software packages do not take care over default settings.
This defitit can often be explained by the fact that the underlying code and
1001 graphics 511

Figure 11: An ensemble of plots to investigate the temporal distribution of


spam e-mails,

graphical model is quite old, and was not adapted to modern data problems
and rendering methods yet.
Using a-channel transparency can help a lot when trying to avoid over-
plotting problems in scatterplots and parallel coordinate plots. The his-
togram as a means of density estimation is an example of a plot where "no
default" is the only good default". Spinograms are a good choice when trying
to visualize a sub-population of a continuous variable. A histogram, which is
often used instead, is not useful for this task. Mosaic plots are complemented
by three variations to build a suite of plots, which can visualize multivariate
discrete data. Where the one plot is good, the other one fails.
Generally, for a comprehensive graphical data exploration, we need a wide
range of plots, which can be applied exactly for the purpose they serve best.
No craftsman would enter a construction site with a toolbox consisting of
just a single type of tool.

2In a recent talk an expert an Support Vector Machines (SVM) noted that he would
suggest that all implementations of SVMs should always force the user to explicitly specify
parametersp since there is nothing such as a default parameter setting which would generally
yield acceptable results
512 Martin Theus

Data--M--GraPh --(Q:>-
'4·1 1.0 90 23.16 23.45
-3.0 1.6 87.7 23.1423 .71
'3.0 2.9 85.8 23.392 4.29

*
7
'3·4 2 .0 87.8 2) .5324.08
'3.2 3.1 87.2 23.7124.2$

**~
'4·2 3.5 87.1 23.8224.19
-4.2 1.3 86.2 2).8$ 24.19

i-:"
'3.2 2.6 85.9 23·8024 .14


-3.5 2.8 8 7.2 23.652 3.90
-4.3 2.2 88.4 23.5823.88
-3.9
'3.5
-4.3
'4.1
0·7
3.1
2 .1
0.6
88.6
89.1
89.4
87.8
23.4723 .96
23.77 24.01
23.5923 .89
23.652 4.00
*

Figure 12: Statistical graphics code information on data in an abstract form .


A successful decoding by a human is only possible, if the abstraction is suit-
able for the kind of data coded.

Figure 12 illustrates the process of coding information in a statistical


graph. Given some data we code - and often condense - the information
about this data via a computer based procedure into an abstract represen-
tation. The crucial part is the decoding proc ess by the human observer.
A successful decoding by a human is only possible, if the abstraction is suit-
able for the kind of data coded .
Additionally we must keep in mind that the human visual system has
many limitations as basically described in Cleveland's [1] overview in the
context of graph reading. His investigations have been limited to the state of
statistical graphics in the early 80s. Today's rendering techniques offer new
possibilities and challenges.

References
[1] Cleveland W.S . (1985). The elem ents of graphing data. Wadsworth,
Monetrey, CA.
[2] Cleveland W.S . (1993). Visualizing data. Hobart, Summit, NJ .
[3] Cook D., Theus M., Hofmann H. Scatterplots for massive datasets.
Journal of Computational and Graphical Statistics, submitted.
[4] Hofmann H., Theus M. Visualizing conditional distributions. Journal of
Computational and Graphical Statistics, submitted.
[5] Scott D. (1992) Multivariate density estimation - theory, practice, and
visualization. Wiley, New York.
[6] Theus M. (2002). Int eractive data visualization using mondrian. Journal
of St atistical Software 7 (11).

Address: M. Theus, Department of Computational Statistics and Data Anal-


ysis, Augsburg University,' Universitatsstr. 14,86135 Augsburg, Germany
E-mail: mrrtin.theus@math.uni-augsburg.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

FITTING BRADLEY TERRY MODELS


USING A MULTIPLICATIVE
ALGORITHM
Ben Torsney

K ey words: Bradley Terry mod el, discret e dat a , facto rial structure, genera l
equivalence theorem , maximum likelihood est ima t ion, multiplicative algo-
rithm, optimal design theory, pair ed compari sons.
COMPSTAT 2004 section : Design of experiments.

Abstract: We consider t he problem of estimating t he par am et ers of a Br ad-


ley Terry Mod el by t he method of maximum likelihood, given data from
a pair ed compari sons experiment. The param et ers of a basic model can be
t aken t o be weights which are positive and sum to one. Hence they corre-
spond to design weights and optimality theorems and numerical t echniques
develop ed in th e optimal design arena can be t ra nsporte d to this est imat ion
problem . Furthermore exte nsions of the basic mod el to allow for a factorial
structure in the treatment s leads to an optimisation problem with respe ct
to several sets of weights or distributions. We can extend techniques to this
case . In section 1 we introduce the notion of pair ed comparisons experiments
and t he Bradley Terry Model. In section 2 th e par am et er est imat ion prob-
lem is outlined with opt imality results and a genera l class of multiplicative
algorithms outlined in sect ions 3 and 4 respectively. A sp ecific algorit hm is
applied to the Bradley Terry log-likelihood in sect ion 5 and treatments with
a factorial st ru ct ure ar e considered in section 6. Finally in section 7 exte n-
sions to triple comparisons and to exte nded rankings are bri efly outlined .

1 Paired comparisions
1.1 Introduction
We consider pair ed comparison experiments in which J treatments or prod-
ucts are compar ed in pairs. In a simple form a subject is pres ented with two
treatments and asked to indi cate which he/she pr efers or considers bet t er.
In reality the subject will be an expert t est er; for exa mple, a food t ast er in
examples ari sing in food t echnology. The link with opt imal design t heory
(ap art from t he fact t hat a sp ecialised design , paired comparisons, is under
considerat ion) is that , the par am et ers of one mod el, Br adl ey Terry mod el, for
the resultant data are like weights. Hence the theory characte rising and the
methods developed for finding optimal design weights can be applied to char-
acte rising and findin g the maximum likelihood est imators of t hese Bradley
Terry weights.
514 Ben Torsney

1.2 The data


In a simple experiment a set of such test ers is availabl e and each is pr esented
with one pair from a set of J treatments, say T 1 , T2, . . . , TJ . The number of
comparisons, nij of T; to T j, we assume has been pr edetermined. Sufficient
summary data comprises the set {Oi j : i = 1,2, ... , J ; i = 1,2, ... , J ; i < j
or i > j} , where O ij is the observed frequency with which T i is pr eferr ed
to T j . Of course O i j + O j i = n i j '

1.3 Models
1.3.1 A general model In the absence of other information the most
general model here is to propose:

(1)
where
()ij = P(Ti is prefered to T j )

Ap art from the const ra int O i j + Oji = n i j , independ ence between frequ en-
cies is t o be recommend ed . So apart from the constraint ()ij + ()ji = 1, these
define unrelated binomial param et ers . The maximum likelihood estimator of
()ij is O ij / n i j (the proportion of t imes T; is pr eferr ed to T j in these n ij com-
parisons) , and form al inferences can be based on th e asy mptot ic properties
of t hese.

1.3.2 Bradley Terry Model This is a mor e restrict ed mod el in that it


imposes inte rre lat ions between the ()ij. It proposes that:
p'
()i j = --~- (2)
Pi + Pj
where Pl ,P2 , .. . , PJ are positive param et ers . See [1] .
These can be viewed as indic es or quality charact erist ics, one for each t reat-
ment . These are only unique up to a constant multiple, since ()i j is invari ant
to proportional changes in P i and P j . A constraint needs to be impo sed for
uniqueness. One possibili ty is:

This implies 0 < Pi < 1. We return to this later.

1.3.3 Motivation for Bradley Terry Model However we can show that
is uniquely det ermined by a lat ent difference. Let P i = exp(Ai)' Then:
()ij

() .. _ exp(bij )
(3)
~J - 1 + exp(bij )
Fitting Bradley Terry Mod els using a multiplicative algorit hm 515

Thus ()ij is uniquely det ermined by t he difference in the t ra nsformed qua lity
characterist ics Ai , Aj , while it is invari ant to shifts in their values .
Further ()ij = F( Oij ), where F( 0) is t he logistic dist ribution function. If
we assume t hat the difference in quality, between the two treatment s, has
a logistic distributi on , then ()ij is the probability of a difference of at most
Oij ; or the difference in quality is given by:
Oij = F-I(()ij) = F-I{Pi/(P i + Pj)}
See [6]. Other choices of F(.) can lead to alte rnat ive models with par am et ers
similar to PI ,P2,· ·· ,PJ·

2 Parameter estimation
In t erms of the original par am et ers t he likelihood of the data is a pr oduct of
binomial likeliho ods, nam ely:

(4)
r<s
Let P = (PI,P2, . · · ,PJ) and, for convenience, let Oii = 0, i = 1,2, . .. , J,
and o; = L j o.;
Then the likelihood of the dat a under the Bradl ey Terr y mod el is given by
making t he substit ut ions ()rs = Pr/(Pr+Ps) , esr = Ps/(P r +Ps) , ,0rs+Osr =
n rs , t o yield:

(5)

e
We wish to choose P (p > 0) to maximise L(p) . Since ij is invari an t to
proport ional changes in the p;'s, so is L(p ). In fact L(p ) is a hom ogeneous
funct ion of degree zero in P; i.e. L(cp) = L(p) , where c is a scalar constant . It
is constant on rays running out from the origin. It will therefore be max imised
all along one specific ray. We can identify this ray by findin g a particular
optimising p*. This we can do by impos ing a constraint on p. Pos sible
const raints are LPi = 1 or TIP i = 1, or g(p) = 1 where g(p) is a sur face
which cuts each ray exac tly once. In t he case J = 2 a suit abl e g(p) is defined
by P2 = h(PI) , where h(.) is a decreasing function which cuts the two main
axes, as in the case of h(PI ) = 1 - PI , or has these as asy mptotes, as in
t he case of h(PI) = l /PI. In general a suitabl e choice of g(p) is one which
is positive and homogeneous of som e degree h . Not e t hat ot her alte rnatives
ar e LPi = 0 or TIP i = 0 , where 0 is any positi ve constant ; e.g. 0 = J .
The choice of TI Pi = 1, being equivalent to L In(Pi) = 0, confers on
Cti = In(Pi) t he notion of a main effect. We will opt for t he choice of LPi =
1, which conveys the notion of Pi as a weight . We wish to maximise the
likelihood or log-likelihood subject to t his const ra int and to non-negativity
too. This is an example of t he following genera l problem:
516 Ben Torsney

Problem (P) :
Maximise ¢>(p) subject to Pi ~ 0, L:P i = 1.
We wish t o maximise ¢>(p) with resp ect to a probability distribution.
Her e we will t ake ¢>(p) = In{ L(p)}.

Ther e are many examples of this problem arising in various areas of st atistics,
esp ecially in the area of optimal regression design . We can exploit optimality
results and algorithms develop ed in this ar ea . The feasible region is an op en
but bounded set. Thus there should always be a solut ion to this problem
allowing for the possibility of an unbounded maximum, multiple solut ions
and solutions at vertices (i .e. Pt = 1, Pi = 0, i =I t) .

3 Optimality conditions
We ca n define optimality condit ions in terms of t he point t o point dir ectional

°
derivativ e defined by Whittle [19] . The dir ectional derivativ e of F",(p, q) of a
crite rion ¢>(.) at P in the direction of q is t he limit as E 1 of:

[¢>{(l - E)p + Eq} - ¢>(p)]/E


i.e,
F",(p,q ) = dg/dE I E = 0+
where g(E) = ¢>{(1 - E)p + Eq}.
This derivative exists even if ¢>(.) is not differenti abl e; but if ¢>(.) is differ-
ent iable t hen:
F",(p, q) = (q - pf d
where d = 8¢>/ 8p.
Let F; = F", (p, ej ), where ej is the l h un it vector in ~J. Then:

Fj = dj - pTd = dj - LPidi , where d, = 8¢>/ 8pj.

We call Fj the l h vertex dir ectional derivative of ¢>(.) at p . Note that


L:P jFj = 0, so t ha t, in general some Fj are negative and som e are posi-
t ive.
If ¢>(.) is differentiable at v' , then a necessar y condit ion for ¢>(p. ) to be
a local maximum of ¢>(.) in t he feasible region of Problem (P) is:

Fj• = F",{p· , ej } = ° for pj > 0,


F/ = F",{p·, ej} ~ ° for pj = 0,
If ¢>(.) is concave on its feasibl e region, t hen these first order stat ionarity
condit ions ar e both necessary and sufficient . This is the general equivalence
theorem in op ti mal design . See [19] , [5] .
It is clear that all pj must be positive in the case of t he Br adl ey Terry
likelihood, so that t he second condition is redundant.
Fitting Bradley Terry Models using a multiplicative algorithm 517

4 Algorithms
4.1 A multiplicative algorithm
Problem (P) has a dis tinct set of constraints, namely the vari ables Pl ,PZ, ... ,
PJ must be nonnegative and sum to 1. An iteration whi ch neatly submits to
these and has some suitable properties is the multiplicative algorit hm:

(r+l) p;r)f(d;»)
Pj r r») (6)
LPi )f(di

where d;r) = fJ¢jfJPj I P = per) while f(d) is posi tive and strictly increasing
in d and may dep end on one or more free paramet ers.
This type of iteration was first proposed by [13], t aking f (d) = dO , with
o > O. This, of cour se, requires positive derivatives . Subsequent empiri-
cal studies include Silvey et al [11], which is a study of the choic e of 0 when
f(d) = dO, 0 > 0; Torsney [15], which mainly considers f(d) = eOd in a variety
of applications , for which one criterion ¢(.) could have negative derivatives;
Torsney and Alahmadi [16] who conside r other choices of f(.) ; Torsney and
Mandal [18] who consider objective choic es of f( .); and [8] who explore de-
velopments of the algorit hm based on a clustering approach in the context of
a cont inuous design space. Torsney and Mandal [17] and Mandal et al [9] also
apply these algorithms to the construction of constrained optimal designs.
Titterington [12] describes a proof of monotonicity of f(d) = dO in the
case of D-optimality. Torsney [14] explores monotonicity of particul ar values
for 0 for particular ¢ (p). Torsney [14] also establishes a sufficient condition
for monotonicity of f (d) = dO, 0 = 1/ (t + 1), when the crit erion ¢ (p) is
homogenous of degr ee -t, t > 0 with positive derivatives and proves this
condit ion to be true in the case of linear design crit eria su ch as the c-optimal
and the A-optimal criteria, for which t = 1, so that 0 = 1/2 . In other cases
the value 0 = 1 can be shown to yield an EM algorit hm, which is known
to be monotonic and convergent; see [13]. Beyond this there are minimal
results on convergence, alt hough this will dep end on the choic e of f( .) and of
par ameters like o. See [11] for some em pirical results. In principal the choice
of f(.) is arbitrary but ob jective bases for choices are addressed in the form al
properties now listed .

4.2 Properties of the algorithm


Under the conditions imposed on f(.) , the above iterations pos sess the fol-
lowing properties which are considered in more det ail in [15], [16] and [7]:
1. per) is always feasible.
2. F¢{p(r), p(r+l )} ;::: 0, with equality when the d/s corresponding to
nonzero p/s have a common value d (= LPidi), in which case per) =
p(r+l) .
518 Ben Torsney

3. An it erate p( r ) is a fixed point of the iteration if the derivatives d;r)


corresponding to nonzero p;r) are all equ al; equivalent ly if the corre-
sponding vertex dir ectional derivatives Fj(r) are zero . Thus a solution
to Problem (P) is a fixed point of the iteration. So also ar e solutions
subject t o setting a given subset of weights to zero; see [15] .
4. We mentioned that f(.) may dep end on one or more free paramet ers.
Torsney and Alahmadi [16] explore methods for choosing a single pos-
itiv e param et er 8 for various given choices of f( .). Torsney and Man-
dal [18] explore methods for choosing f( .), which can accommodate
negative part ial derivatives or for which (positive) partial derivatives
can be replaced by vertex dir ectional derivatives. A further pap er is in
prepar ation on choosing f(.) when the criteria has positive derivatives.

5 Fitting Bradley Terry Models


Our crit eria is:

cf;(p) = In{L(p)} = L Oi.ln(Pi) - L L n rs In(Pr + Ps) (7)


r<s
Since L(p ) is a homog eneous function of degree zero L,Pidi = O. In fact
d j = F j . So there are always positive and negative d j unless all are zero. We
require a function f(d) which is defined for positive and negative d, where we
t ake d to represent a partial derivative. Noting t ha t all pj must be positive
a suitable choice of f( .) should be governed by the fact that at the optimum
dj = 0, j = 1,2, . .. , J.
This sugg ests that suitable function is one that is 'centred' on zero and
changes reasonabl y quickly ab out d = O. It should also be desirabl e to
t reat positive and negative derivatives symmetrically. Torsney and Man-
dal [18] st art from a function h(x ) defined on P, such that h( x) > 0, h'( x) >
0, h(O) = 1. They propose:
h( x) x< O
f( x) = { 2 - h(-x) x> O
i.e.
f( x) = (1 + s ) - sh(- sx) , s = sign(x)
Clearl y f( x) is increasing , while for y > 0, (y , f(y)) and (-y , f( -y)) are
reflections of each other in the point (0,1) = (0, f(O)) ; i.e. f( -y) = [2- f(y)] .
Equivalently f'(y) is symmet ric about zero. Note that 0 < f( x) < 2, so that
f( x) is bounded; also f(O) = 1.

Tor sney and Mandal [18] consider var ious choices of h( x) , including
h( x) = 2H(8x) , where 8 is a positive par am et er and H( .) is a cumulat ive
distribution function such that H(O) = 1/2. Here we opt for H(.) = <1>( .), so
that iterations prove to be:
Fitting Bradley Terry Models using a multiplicative algorithm 519

Ex ample: We use t his algorit hm in two examples.

Example 1:
In this case J = 8 coffee typ es were compared t hro ugh 26 pairwi se compar-
isons on each pair , yielding a t ot al of N = 728 observat ions; i.e, I: I: O i j =
728. A suitable 8 is 8 = l / N . In effect we are st andardising t hrough replac-
ing observed by relative frequ encies in the log-likelihood , and t hen taking
8 = 1. St arting from p)ol = 1/ J , the numbers of iterations needed t o achieve
maxldjl = maxlFjl ::; lO- n , n = 0,1 , .. . , 7 resp ect ively are 17, 21, 25, 32,
38, 45, 51, 59. The optima l p' is: (0.190257,0.122731,0.155456,0.106993 ,
0.091339,0.149406,0.080953,0.102865). It erations were monotonic.

Ex ample 2:
In this example J = 9 quality of life dim ensions were compa red in pairs by
each of 50 patients with early signs of rh eumat oid arthritis (RA). The 9 di-
mensions were: ability t o physically functi on , pain, st iffness, ability to work ,
fatigue, depression , inte rference with social act ivit ies, side effects, and finan-
cial burden . This data arose from the Consort ium of Practicing Rheumatolo-
gist s long-t erm observati onal multi-center study of early severe RA. Patients
ente red in this additive cohort had less than 1 year of sympt om onset . The
responses were obtain ed at their first te lephone int ervi ew. Formed in 1992,
the Consortium prospectively followed them to delineate early outc ome and
factors, such as treatment, functional , ra diographic, psychosocial , and eco-
nomic outcomes. Data on disease severity, functional st atus, psychosocial
health, cost, radi ographic dam age, laboratory serologies and acute ph ase re-
actants were recorded at Baseline and at 6 months, 1 year, and annua lly
t herea fter. As a chronic illness, RA imp acts every dimension of qu ality of
life. Even among RA pat ients, however , differences in life situations , clinical
pr esentation , and disease course can be st riking, leading t o varying patient
ran kings of the import an ce of difference disease and life factors. T he 9 factors
were selected to represent aspects of RA that pati ents could eas ily identify
an d compare .
There were a to t al of N = 1800 compari sons; i.e, I: I: O ij = 1800. In
8 cases t here were ties. These were split 50:50 between t he relevant t rea t -
ments. Again a suitable 8 is 8 = l / N . Start ing from p)Ol = l /J , t he
numbers of iterations needed to achieve max Id j I = max IFj I ::; lO- n , n =
0,1 , . . . , 6 respectively are 28, 42, 56, 69, 84, 96, 110. T he optimal p'
is: (0.265361, 0.172154 , 0.151644 , 0.059151 , 0.123506 , 0.030753, 0.037740 ,
0.055038 ,0.104653) , the ord er of the components corresponding t o the ord er
of the dim ensions as listed above. It erations were mono tonic.
520 Ben Torsney

There is a further issue here. These 1800 responses have been obtain ed
from only 50 patients. Each patient has responded on each pair wise compar-
ison . We have assumed independence between t he resulting 36 observations.
Dittrich et al [3] also conte mplate t his 'independent decisions ' mod el, an
ind epend ence which allows for inconsistent responses by a patient. However
they exte nd it to a 'dependent decisions' mod el. For an individual patient 's
comp ari son of T, and Tj let Yij = 1 if he/she records that T; is pr eferr ed to
Tj and Yij = 0 otherwise. In the case of three dimensions their mod el is:

where C is a normalising constant .


All par am et ers must be positive. A const ra int is st ill needed on the Pj
as above, but non e are needed on t he Wj' However we could transform to
qj = Wj/ (W I + W2 + W3 ), so t hat (q1 + q2 + q3) = 1, while CY = (WI + W2 + W3 )
is a free positive par am et er , which could be tr eat ed like the variable q arising
in models allowing for ties discussed in sect ion 7 below. The above class of
algorit hm could t hen be used to find the optimal valu es of both t he Pi 'S and
qj 's. This would need the individual responses on each pair of dim ensions
from each respond ent .
If WI = W2 = W3 = 1, we recover the ind ependence mod el.
Fur thermore exte nsions of Bradley Terry mod els are available when respon-
dents record consiste nt rankings; see below. However there is scope for ex-
t ending this work .

6 Treatments with a factorial structure


In Example 1 the 8 coffees comprised the 8 combinat ions arising from 2 br ew
st rengt hs, 2 roast colours, and 2 br ands. Simpl er versions of the Br adl ey
Terry Mod el have been proposed in terms of definit ions of main effects and
possibly low order inte ract ions. We consider main effects only for t he moment
in the case of 3 factors.
Suppose t hat we have J = K LM t reatments arising from the K LM factor
level combinations of 3 factors, denoted by CY, f3, "y with K, L and M levels
respectively. We have treatments T klm , k = 1, ... , K ; l = 1, . . . , L ; m =
1, . . . , M , with associate d Br adl ey Terry par amet ers Pklm , such that Tkl m is
pr eferr ed to T qrs with probabili ty {Pklm /(P klm + pqrs)}. This is allowing for
main effects and interactions of all ord ers .
A main effects or addit ive model correspo nds to:

Pklm = CYkf3nm
i.e,
Fitting Bradley Terry Models using a multiplicative algorithm 521

where ak, I3l, "[m. > O.


The likelihood is again a homogeneous function of degree zero in each of
the three sets of main effect parameters. Constraints need to be imposed on
each of them. Various choices can be considered as above with appropriate
extensions of the above algorithm. If we opt for the constraints

Lak = Ll3l = L'Ym = 1

we wish to maximise the log-likelihood with respect to several distributions.


At the optimum all partial derivative should be zero . (Note that alternatives
could be Lak = K , Ll3l = L, L"Ym = M) .
A suitable set of iterations are :

(r+l) _ "Y}:;) f-y(d;,-J


"Y
m
- L"Y~r) f-y(di-Y))
where fo,(.), f(3(·), f -y(.) are positive increasing functions and dk") = 8¢/8ak
at a = a(r) etc .
This set of iterations enjoys the same properties as those for a single dis-
tribution, including F</>p.(r) ,\(r+l)) ;::: 0, where X = (aT, I3 T, "YTf. See [18] .
In our example K = L = M = 2. Taking 8 = I/N, fo(d CO) ) = <I>(8d CO))
and B)O) = 1/2, j = 1,2, for B = a, 13, "Y, representing brew strength, roast
colour and coffee brand respectively, the numbers of iterations needed to
achieve max IdjO) I = max IF?)I :::; lO-n, for n = 0,1 , ... ,7 respectively are
7, 12,15,19, 23,27, 31, 36. Optimal values are : a* = (0.574904,0.425096),
13* = (0.551050,0.448950), "Y* = (0.504887,0.495113);
and the optimal p*: is (0.159949, 0.156852, 0.130313, 0.127790 , 0.118269,
0.115980,0.096356,0.094491).
Iterations were monotonic.

Notes:

1. Other variations of the above iterations ar e possible. One is to cycle


through the three sets of main effect parameters running the iterations
for each one in turn, while keeping the others fixed.
2. Obviously the approach is extendable to any number of factors.
522 Ben Torsney

3. There are exte nsions of the Br adl ey Terry model which allow for
int eractions and t he above it er ations ca n be exte nded to these too .
For exa mple a model including an int eraction between br ew strength
and roast colour corresponds t o:

i.e.
In(pkl m) = In(ak) + InCBl) + In(')'m) + In((a,Bhl)
where (a,Bhl > O.
The likelihood is now additionally homogenous in two set s of resp ects;
namely, it is invariant to proportional changes in the t erms (a,Bhl when
the constant of proportionality eit her vari es with a or wit h,B . Sever al
sets of consistent const raint s are needed . On e possibility is the set

2:)a,B)kl = K , l = 1,2 , ... , L; 2:)a,B) kl = L, k = 1,2 , ... , K

or sums could be repl aced by products.


Further development of our class of it erations is needed. For each a
an d each ,B the (a,B)kl in effect define a set of probability distribu-
tions except that the probabilities are sca led to add t o a constant
differing from 1. On e option would be to alte rnate between it era-
t ions (appropriately modified to satisfy these re-scaling constraints)
for each set . An alte rnat ive derives from Linear Programming The-
ory. The non-negativity and equa lity const raint s imply that the set
{(a,B)kl : k = 1,2 , . .. , K; l = 1,2 , . . . , L} belongs t o a bounded convex
polyhedron whos e vertices are Basic Feasible Solutions. The convex
weights defining the (a,B) kl defines one distribution. An ext ra set of
equat ions for updating t hese can be added to the sets for main effect s.

7 Extensions of the Bradley Terry Model


There ar e extensions of the basic Br adl ey Terry Model which ca n be fitt ed
using the above methods. These include:
(a) Mod els allowing a 'no-preference' option . Two possibili ti es are :
(i)
P(Ti is preferred to Tj) = pi/(Pi + Pj + Po)
P(Tj is preferred to Ti ) = Pj /(Pi + Pj + Po)
P(No preference) = PO /(P i + Pj + Po)
On e ext ra par am et er has been introduced Po which must be positive and
t hese proba bilit ies an d hence t he likelihood ar e homogenous of degree zero in
Po,PI, . .. , PJ. F inding maximum likeliho od est imates of these defines anot her
example of Problem (P) .
Fitting Bradley Terry Models using a multiplicative algorithm 523

(ii) Rao and Kupper [10] proposed :

P(Ti is preferred to T j ) = pd(pi + qPj)


P(Tj is preferred to T i ) = Pj /(Pj + qPi)
where q > 1.
This mod el has a latent logistic distribution motivation since P(Ti is pre-
ferred to Tj ) = F(.\ i -.\j-r), r 2: 0, where F( .) is the logistic distribution
function and Pi = exp (.\i ), q = exp(r) .

(iii) Davidson [2] proposed:

P(Ti is preferred to T j ) = pd(pi + Pj + q(PiPj)1/2)

P(No preference) = q(piPj) 1/2 / (Pi + Pj + q(PiPj)1 /2).


where q > 1.

Each of (ii) and (iii) lead to likelihoods which are homogenous of degr ee
zero in the Pi's. Also note that {A /(A + qB)} = {r1A /(r1A + r2B)} , where
r1 = q-1 /2 and r2 = 1/r1 . This is homogenous of degree zero in r1 and
r2. Hence we could impose the const raint r1 + r2 = 1. However r1 2: r2.
A further transformation is 81 = r1 - r2, 82 = 2r2. Now constraints are
81 , 82 > 0, 81 + 82 = 1. We can now maximise the likelihood with resp ect to
two distributions using our family of algorithms. To det ermine q we need to
re-scale to r1r2 = 1.

Henery [4] replaces q in (ii) by % = qi * qj with qi * qj 2: 1. The latter


condit ion implies that at most one qi can be less than 1, (the minimum in
fact) . If none satisfy this condition then the above transformation could be
a pplied to each qi, leading to an optimisation of the likelihood with respec t
to (J + 1) distributions. If the minimum was known to be less than 1, and it's
subscript i were known too, then an appropriat e variation of the approach
t akes rli = (1/qi)1/2 , where rli is the value of r1 for this particular qi.

Kuk [6] considers applications to the outcome offootball matches and ext ends
the model to include two sets of the paramet ers {Pi} and two sets of the
par am et ers {qj}, one each for 'home' and 'away' games . The likelihood is
homogenous of degr ee zero in the two sets of Pi'S as a whole and in three
sets of variables which ar e based on transform ations of the qj'S similar to
that defining r1 and r2 above. Thus we wish to maximise the likelihood with
respect to four distributions.
524 B en Torsncy

(b) Triple Compari sons.


An exte nsion of pairwise comparisons is to invite subjects to place three
treatments in order of pr efer ence. Let :
()ijk = P(Ti is preferred to T j and T j is preferred to n)
Various poss ible exte nsions of the Br adl ey Terry Model include:

()ijk = PiPj /{(Pi + Pj + Pk)(Pj + Pk)} ;

()ijk = (Pi) 2pj /D,


where D = (Pi) 2pj + (Pj )2pi + (Pi) 2pk + (Pk)2pi + (Pj )2pk + (Pk?Pj
(c) Extended Rankings.
The latter model exte nds to rankings of more than three treatment s, while
both mod els define likelihoods which are homogenous of degree zero in PI , P2,
.. . , PJ , each of which must be positive. Maximum likelihoo d estimation of
t hese is equivalent to anot her example of Problem (P) . Equally if the treat-
ments have a factorial structure, t he likelihood can be expre ssed as a function
of sever al distributions and opt imised with resp ect to t hese using the algo-
rithms described .

8 Discussion
The primary focus of this pap er is one of cross fertilisation, an arguably
som ewhat limited even simple minded on e. It is t o point out that a class
of maximum likelihood est imation problems could be attacked using tools
for solving opt imal design problems becau se in each case one or several set s
of optimising weights or distributi ons are sou ght. Hence the equivalence
t heore ms charac te rising optimality in t he optimal design arena and related
algorit hms can be t ransporte d over to the par am eter est imation arena. This
is one new cont ribut ion of this work. On e other is using a new version of the
above mentioned algorit hms, one which can accomodate negative derivatives.

References
[1] Br adl ey R.A., Terry M.E . (1952). Th e rank an alysis of incom plete block
designs I, Th e m ethod of paired comparisons. Biometrika 39 , 324 -345.
[2] Davidson R.R. (1970). On extending the Bradley Terry model to accom-
mod ate ties in paired comparis ons experiments. J .Am. St a tist. Ass. 65 ,
317- 328.
[3] Hittich R., Hat zinger R ., Katzenb eisser W . (2002). M odelling depend en-
cies in paired comparisi ons data- a log-linear approach. Computational
St atisti cs & Data Analysis 40 , 39 -57.
[4] Henery R .J. (1992). An extension of the Thursi on e-Most eller mod el for
chess . St atistician 41 , 559-567.
Fitting Bradley Terry Models using a multiplicative algorithm 525

[5] Kiefer J. (1974). General equivalence theory for optimum designs (ap-
proximate theory). Annals of Statistics 2, 849-879.
[6] Kuk A.C.Y. (1995). Modelling paired comparison data with large num-
bers of draws and large variability of draw percentages among players .
Statistician 44, 523- 528.
[7] Mandal S., Torsney B. (2000). Algorithms for the construction of opti-
mising distributions . Communications in Statistics (Theory and Meth-
ods) 29, 1219-1231.
[8] Mandal S., Torsney B. (2004). Construction of optimal designs using a
clustering approach. (Under revision for J. Stat. Planning & Inf.)
[9] Mandal S., Torsney B., Carriere K.C . (2004). Constructing optimal de-
signs with constraints. Jounal of Statistical Planning and Inference (to
appear) .
[10] Rao P.V., Kupper L.L. (1967). Ties in paired comparison experiments:
a generalisation of the Bradley Terry model. J . Am. Statist. Ass. 62,
192-204.
[11] Silvey S.D., Titterington D.M., Torsney B. (1978). An algorithm for
optimal designs on a finite design space. Communications in Statistics
A 14, 1379-1389.
[12] Titterington D.M. (1976) . Algorithms for computing D-optimal designs
on a finite design space. Proc. 1976 Conf. On Information Sciences and
Systems, Dept. of Elect. Eng ., Johns Hopkins Univ. Baltimore, MD, 213
- 216.
[13] Torsney B. (1977). Contribution to discussion of 'Maximum Likelihood
Estimation via the EM Algorithm' by Dempster, Laird and Rubin. J.
Royal Stat. Soc. (B) 39, 26-27.
[14] Torsney B. (1983). A moment inequality and monotonicity of an algo-
rithm. Lecture Notes in Economics and Mathematical Systems, A.V. Fi-
acco, K.O . Kortanek (Eds.), Springer Verlag 215,249-260.
[15] Torsney B. (1988). Computing optimizing distributions with applications
in design, estimation and image processing. In: Optimal Design and
Analysis of Experiments, Y. Dodge , V.V. Fedorov, H.P . Wynn (Eds .),
North Holland., 361 - 370.
[16] Torsney B., Alahmadi A.M. (1992). Further developments of algorithms
for constructing optimizing distributions. In Model Oriented data Anal-
ysis, V. Fedorov, W.G. Muller, LN. Vuchkov (Eds) , Proceedings of
2nd IIASA-Workshop, St. Kyrik, Bulgaria, 1990, Physica Verlag, 121-
129
[17] Torsney. B., Mandal S. (2000). Construction of constrained opti-
mal designs. In: Optimum Design 2000, A. Atkinson, B. Bogacka,
A. Zhiglavsky (Eds) , Proceedings of Design, held in honour of
60th Birthday of Valeri Fedorov , Cardiff, Kluwer, 141-152.
526 Ben Torsney

[18] Torsney B., MandaI S. (2004). Mult iplicative algorithms for cons truc tin g
optimizing distributions mODa 7. Advances in Model Orient ed Design
and Analysis, 143-150.
[19] Whi t tle P. (1973). Some general point s in the theory of optimal experi-
m ental design. J. Roy. Statist, Soc. B 35, 123-130.

Address: B. Torsney, Department of St ati st ics, University of Glasgow, Glas-


gow G12 8QW, U.K.
E-mail: B.Torsney@stats .gla .ac . uk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

MODELLING MULTIPLE TIME SERIES:


ACHIEVING THE AIMS
Granville Tunnicliffe-Wilson and Alex Morton
K ey words: Cros s-sp ectral analysis, exte nded aut oregression, prediction,
transfer functions.
COMPSTAT 2004 section: Time series analysis.

Abstract: We review the t ra ditional aims and methodology of multiple time


series mod elling , and pr esent some recent developments in the mod els avail-
able t o achieve t hese aims , in t he contex t of both regularl y and irr egularl y
sa mpled dat a . These mod els are analogues of the vector aut oregressive pro-
cess, based on t he generalised shift, or Laguerr e, operator. They form a sub-
class of vector aut oregressive moving-average pro cesses; t hey ret ain many of
the at t ractive features of t he standard vector AR mod el, but have an added
dim ension of flexibility, t hat leads to improvements in pr edict ive ability.

1 Reviewing the objectives and methodology


The aims of time series analysis are revealed in the t itl es of some of the
early books on t he subject. Th e Extrapolation, Int erpolation and Sm ooth-
ing of Stationary Tim e Series with Engineering Applications, by Wiener [19],
has a comprehensive t it le, but Prediction and Regulation by Whittle [18],
and Tim e Series analysis, Forecasting and Control by Box and J enk ins [5]
make more explicit t he application t o cont rol, which was undoubtedly one
of Wiener 's obj ectives. The obj ectives are also clearl y state d in the t it le of
St atistical analysis and Control of Dynamic Systems by Akaike and Naka-
gawa [2], the original publication of which , in J ap an ese, took place in 1972.
The t ime series mod el may itself be t he immediat e obj ective of the mod-
elling , as in pr edat or-prey systems and a host of other scientific applicat ions,
where an underst anding of t he mechanisms of int eraction between time series
vari abl es is required. However , pr edicti on is an encompas sing obj ective. The
mod el is genera lly identified by its pr edictive cap acity, whatever the aim of
t he application. Smoothing, or mor e genera lly signal extraction, depends on
the structure ident ified by t he pr edictive mod el. Control applications rely on
t he a bility to pr edict an output series from an input series.
Methodology developed in the early years is st ill widely used . Spectral
analysis has t ended to give way to time domain methods, particularl y in
economet ric forecasting. There is, however , one conte xt, that of mod elling
causa l (or one-sided) dependency, in which cross-spectral analysis, is cur -
rently an under-used t ool. It is generally very efficient, both st atistically and
in t erms of the t ime and effort required t o obtain useful results. Earl y book s
which pr esent ed t his methodology, such as J enkins and Watts [10] , ar e now,
fortunat ely, supplement ed by some recent, well received t exts. These cover
528 Granville Thnnicliffe- Wilson and Alex Morton

the use of spectral analysis for identifying the transfer function coefficients Vk,
by which a dependent series Yt is related to lagged values of the explanatory
series Xt:
Yt = VOXt + VIXt-l + VZXt-Z + ... + nt· (1)
Although cross-spectral analysis is based on frequency domain regression, its
results can be expressed as estimates, over an appropriate lag window, of the
transfer function coefficients. We illustrate this with a simple example, partly
to encourage the re-introduction of such methods, but also to demonstrate,
in part, why the subject moved away from them.
Figure l(a) shows temperatures measured every minute by sensors in the
cab and trailer of a transport vehicle . It is clear that the cab temperature lags
the trailer temperature. Figure l(b) shows the transfer function coefficients
in this relationship, as estimated by cross spectral analysis. The estimates
were produced almost automatically, with little user intervention. Limits on
the plot show that significant values are spread over lags 0 to 4, with a peak
at lag 2. This represents a one-sided, or causal, relationship, that may be
used to predict the cab temperature from the trailer temperature as shown
in Figure 2(a) .

(a) (b) (c)

"!,. ...... .. ... , • •• .•


Minules lag in minutes Lag in minUles

Figure 1: (a) Graphs of temperatures inside a transport vehicle trailer (solid


line) and in the cab (dotted line), (b) lagged prediction coefficients obtained
by cross-spectral analysis, for predicting cab temperature from trailer tem-
perature, and (c) for predicting trailer temperature from cab temperature.

However, the desired aim was to predict trailer temperatures from the
sensor in the cab. Figure 1(C) shows the estimated transfer function coeffi-
cients when the roles of the series are reversed. The significant values are
spread over lags 0 to -2. The relationship is no longer causal and these co-
efficients cannot be used for prediction. But reasonable linear predictions of
the trailer temperature from the cab temperature can still be constructed, as
shown in Figure 2(b).
In general, cross-spectral estimation of prediction coefficients is limited
to one-sided or causal relationships. It can, therefore, be used successfully to
estimate input-output relationships in open loop systems, but the estimates
are distorted when applied to input-output data gathered under closed loop,
feedback control, conditions. A solution to this problem was presented by
Modelling multiple time series: achieving the aims 529

(a) (b)

Minutes Minutes

Figure 2: Predictions of transport vehicle temperatures: (a) The cab tem-


peratures (solid line) with values predicted (dotted line) from the trailer
temperatures, (b) the trailer temperatures (solid line) with values predicted
(dotted line) from the cab temperatures.
Akaike and Nakagawa in the industrial context of designing a cement kiln con-
troller. It was based on multivariate autoregressive modelling of the records
of plant variables. The identified model could also be used directly in plant
control by expressing it in state space form . The predictions in Figure 2(b)
were obtained in this way.
From that point on, time domain methods, and, particularly in the mul-
tivariate context, empirical autoregressive modelling, have dominated the
methodology for time series analysis. However, the spectacular success in the
univariate context, of autoregressive moving-average (ARMA) models and
their extensions to integrated and seasonal processes, has not carried over to
the multivariate context. Despite the fact that multivariate ARMA models
were formulated many years ago [15], and much effort has been been put into
procedures for their identification, see for example Tiao and Tsay [16], there
are very few examples of real applications compared with those of the multi-
variate (pure) autoregressive model. More successful has been the state space
identification of multivariate time series models, see for example Aoki [3], in
which the states are selected to form a basis of the multivariate time series
prediction space. Although these state models have a multivariate ARMA
representation, this is not required for their application to prediction and
control.
In the econometric literature, the multivariate (or vector) autoregressive
model is still dominant. Structural forms have been used to incorporate eco-
nomic constraints, and Bayesian formulations to incorporate prior beliefs, as
in Doan, Litterman and Sims [7]. The use of the concept of co-integration
to characterise and test for persistence in the relationships between multi-
variate series, has depended very much on vector autoregressions to account
for any residual autocorrelation in the error correction model. The reason
for this dominance must be , in large part, the simplicity of the multivariate
autoregressive model, and its convenience for order selection, estimation and
530 Granville Tunnic1iffe-Wilson and Alex Morton

theoretical analysis. It also has the potential, by choice of a sufficiently high


order, to approximate closely any linear process.
The question is therefore, whether the multivariate autoregressive model
does provide, essentially, for all our requirements in the world of linear mul-
tiple time series modelling. In asking this we will leave aside the problem of
seasonality, and restrict the question to non-seasonal series, because season-
ality can often be removed or modelled separately. The answer, we believe,
is yes in many cases. But there are important reasons why, in practice, the
multivariate autoregressive model is not fully adequate. The fact that ARMA
models are used for univariate series suggests that pure autoregressive mod-
els may be less than adequate. The reason may simply be parsimony. The
autoregressive approximation may require rather more coefficients than an
ARMA model , to achieve the same predictive accuracy. If a criterion such
as AIC [1] is used to select the order automatically, then the penalty on the
number of parameters may compromise this predictive accuracy, particularly
at high lead times, when the series length is small.
The number of coefficients in a multivariate autoregressive model will
generally be much greater, for a given order (maximum lag) of model, than
for a univariate model. The loss of predictive ability that results from the
requirement to choose a relatively low order model may therefore be much
more important. The class of models we describe in the next two sections
provides one possible, and simple, way to mitigate this loss of predictive
ability, without foregoing most of the attractive features of the standard
multivariate autoregression.

2 A basis for prediction


In both the discrete and the continuous case, the same idea underlies the
models that we formulate in the next section. A chosen, finite , number p
of weighted functions of the present and past values of the process will be
used as linear predictors of future values . We will call these the ZAR states
in the discrete case, and CZAR states in the discrete case. For continuous
time series the models are expressed in terms of a continuous record of the
process, but they are also very useful in applications to irregularly sampled
data, or, in the case of multiple time series, when different series are recorded
regularly but at different sampling rates. In these contexts, the state space
form of the model is integrated to determine the state transition from the
time of each observation to the next.
A discrete model, very closely related to the univariate form of the dis-
crete model which we describe, was presented by Wahlberg and Hannan [17].
A continuous model, exactly equivalent to the univariate form of the con-
tinuous model which we describe, was presented by Belcher, Hampton and
Tunnicliffe Wilson [4] .
In the case of a discrete process Xt, t = ,1,2,3, . .. , the ZAR states
Xt,k, at time t, are defined for orders k = O,l, ,p - 1, by
Mod elling mul tiple tim e series: achieving the aims 531

(2)
where th e operator W is known as the generalised shift operator, and is
defined in te rms of t he backward shift operator B and a specified smoothing
coefficient, or discount factor, 0, by

B-O 2 2 2 3
W = 1 _ OB = -0 + (1 - 0 )(B + OB + 0 B + ...). (3)

In pr acti ce W is applied by the recur sive calculat ions,

Xt ,k+ l = X t- l ,k - OXt ,k + OX t- l ,k+ l , (4)

t akin g X t ,O = X t . The choice of 0 is in t he range 0 :::; 0 < 1, and in the case


0= 0, t he state X t ,k reduc es to the lagged value X t - k.
For t he cont inuous time process x (t ), the CZAR states Xk (t ), at time t ,
are defined , for ord er k = 0, 1, .. . p - 1, by

(5)

where the operator Z is defined form ally in terms of t he Lapl ace (or differ-
ential) op erator s, and a decay rate const ant K, in the rang e K, > 0, by

l-s l K, K, - S
Z = =-- (6)
l+ sl K, K, +s '

There is, however , no requirement of differenti ability placed upon a series to


which this op erator is applied, becau se it is well defined as

Z X(t) = - x(t) + 2K, 1:0 exp( - K,r) x(t - r)dr, (7)

for any second order st ationary pr ocess x (t ).


The operators are equa lly well defined when Xt or x (t ) is a vector pro cess
of dimension m , t hough we note t ha t a set of m p scalar functions of t he
present and past is then defined.
Figur e 3 shows th e weight functions applied to present and past values for
the orders k = 1, . .. , 5 for the discret e operator , taking 0 = 0.5, and orders
k = 1, 3 and 5 for the cont inuous case, taking, without loss of generalit y,
K, = 1.
In each case, if we were to let p ---. 00, we would obtain a basis for t he
present and past values of th e series (t akin g t ime t as the present). The idea is
that if we are to limit the number p , of linear functions of the present and pas t,
t ha t we use for pr edicting the future, then the states defined above give us
greater flexibility in t he discret e case, than t he simpl e choice of lagged values
Xt, Xt - l, . .. Xt -p+l . T he effective ran ge of past values that are weighted
532 Granville Tuuni cliiie-Wil son and Alex Mort on

(a) (b)

·
t
J"-, Orde r 1
--- Order1

·
, Order 2
~
.\
V Order 3
,
·
.Qlz
Order 3
A ~

! \
'0--'
,
·
,
A Order 4

OrderS

· V
r- Order S

$ a 10 12 1~
'-/

. .
Discrete time lag Continuous time lag

Fi gur e 3: (a) Discret e weights for the first 5 orders of the ZAR operator,
(b) cont inuous weights for orders 1, 3 and 5 of the CZAR operator.

into the predictors is approximate ly p(l + 0)/(1 - 0), rather than p. In the
cont inuous case the effect ive ran ge is approxima te ly 2p/ K .
There is no guar antee that , for a given discret e process, the choice of 0 > 0
will define bet t er pr edictors. However , consider a continuous process x (T),
that is sampled at times T = bt; to give the discret e pro cess X t. Defining the
ZAR st ates of X t by set ting 0 = 1 - K8 , these will converge, appropriate ly,
to t he CZAR state s of x(T), as 8 --+ O. The consequence of using the simple
lagged states X t - k, regardless of how small 8 might become, would lead in the
limit to states that were equivalent to X(T) and its derivatives to order p-l.
There is in general no guarantee that these would exist. That is why the
pure aut oregressive model in continuous time, that uses these derivatives as
its states, is unable to approximate an arbitrary continuous time st ationary
process, though the orde r is increased ind efinitely.
For this reason , the advantage of the CZAR mod el, proposed in the next
section, over the standard cont inuous time aut oregressive (CAR) mod el is
undeniabl e, in t erms of empirical approximation . The success of the uni-
vari ate application of the CZAR mod el lead us to consider th e discret e ZAR
form. The foregoing argument sugg est s that whenever a discret e pro cess
might be considered to be a sampled continuous pro cess, the discret e ZAR
model should be pr eferr ed t o the st andard AR mod el, for its approximation.
The weight functions that we use to define t he ZAR and CZAR st ates are
closely related to th e respect ive discrete and cont inuous Lagu erre functions,
which have the possible advantage of providing orthogonal bases of t he past
and pr esent. Par tington [14] describes a vari ety of simil ar weight functions
t hat could be used t o define a basis of the past observations of a discret e
pro cess. Br ay [6] uses a basis that differs from the Lagu erre functions, but
may be orthogonalised to prov ide a similar basis .
Our use of the operator Z was developed from the applicat ion of the
Cayley-Hamilton t ra nsformat ion to repar am et eris ation of continuous time
mod els by Belcher et al. [4] . This transformation has been widely used t o map
Modelling multiple tim e series: achieving the aims 533

from cont inuous time to discret e ti me systems . Most famously, Wiener [19]
solved t he prediction problem for continuous time series by tran sforming it
to t ha t of pr edic tion for a discr ete paramet er pro cess. The expos ition by
Doob [8, p. 582] sets this out clearly. The op erator W may be motivated as
the discret e analogue of Z, in which the Moebius transformation of t he unit
disk t o its elf replaces t he Cayley-H amil ton tran sformation.

3 Extended autoregressive models


We propose models for zero mean stationary pro cesses based on t he pr evi-
ously introduced concepts. These are readily exte nded by the ad dit ion of
a constant te rm or other fixed regressors, to pr ocesses with non-zero mean .
e
In the following, and r: are t aken to be pr e-specified coefficient s, with a i
and i{!i model par amet ers . The ZAR(p, e) model for a discret e vector pro-
cess Xt, t hat is impli ed by the use at time t - 1 of t he linear pr edictors defined
by (2), is

Xt = al X t - l + a2 Xt-l ,1 + ...+ apXt- l ,p- l + et , (8)


where et is whit e noise with vari ance 0";.
When this mod el is true, et is the
linear innovation in Xt. This is the most convenient form for many purposes,
su ch as model est imat ion and predict ion, and we call it t he predictive form ,
bu t we also pr esent an algebraically equivalent form of this model, which we
te rm t he natural form , as follows:

Xt = i{!lXt,l + i{!2Xt,2 + ...+ i{!pXt,p + nt , (9)


where nt follow t he AR(I) model:

nt = ent-l + e(t ), (10)


e(t) being whit e noise with variance 0";. We also write (9) as
i{!(W )Xt = (1 - i{!l W - i{!2W2 - . . . - i{!pWP) Xt = nt . (11 )
We describ e (9) as t he natural form of the model because the pro cess
defined , for any fixed t, by
Yk = W- kXt, (12)
is also a stationary pro cess, and 9) is just a standard autoregress ive approxi-
mation of Yk.
We also note t hat (9) is equivalent to an A RMA (p, p - 1) model with
a pr e-sp ecified moving average op erator (1 - eB)p-l . The model pr esented
by Wahlberg and Hannan [17] , and the model of Morton and Tunnicliffe Wil-
son [12], are very similar, except that t hey have ARMA(p,p) representations.
The CZAR(p , "') model for a cont inuous vector pro cess x (t) is analogous .
The pr edict ive form of model is

dx(t ) = [a l x(t) + a 2xl (t ) + ...+ apxp_l (t )]dt + dB (t ), (13)


534 Granville Tunnic1iffe-Wilson and Alex Morton

when B(t) is Brownian motion with diffusion variance a~. The natural,
algebraically equivalent, form of this model is

X(t) = tplXl (t) + tp2X2(t) + ... + tppxp(t) + n(t), (14)

where n(t) now follows the continuous time AR(I) model, or CAR(I) model:

dn(t) = -tm(t)dt + dH(t), (15)


where H(t) is Brownian motion with variance a~. We also write (14) as

tp(Z)X(t) = (1 - tp1Z - tp2Z2 - . . . - tppZP) x(t) = n(t). (16)

We describe (14) as the natural form of the model because the process
defined, for any fixed t, by

Yk = Z-kX(t), (17)
is also a stationary process, and (14) is just a standard autoregressive ap-
proximation of Yk. We note that (14) is equivalent to a CARMA(p,p - 1)
model with moving average operator ('" + S )p-l .

4 Examples
Our first example illustrates the effect on predictions of using a discrete
trivariate ZAR model for the three series of monthly flour prices that were
modelled by Tiao and Tsay [16] .

(a)

Months Months

Figure 4: Predictions of monthly flour prices at Buffalo, using past values of


three series of flour prices, at Buffalo, Minneapolis and Kansas City: (a) pre-
dictions (dotted line) from a standard AR( 2) model, (b) predictions from
a ZAR(2, 0.5) model.
In Figure 4 we see forecasts of just one of the three series , but made
using two trivariate models. Using the AIC [1], a standard AR(2) model
and a ZAR(2,0.5) model were selected. This example illustrates the fact that
forecasts made using the ZAR model tend to show less damped behaviour.
Although these are in-sample forecasts, and too much must not be read into
Modelling multiple time series: achieving the aims 535

one such example, the ZAR model forecasts tend to predict better the turning
points of the irregular cyclical behaviour of the series .
The three flour price series were very similar in nature, and it is natural to
represent them by a symmetric vector autoregression. Our second example
is very different; the data arise from what is clearly an input-output sys-
tem. The rainfall is measured at two locations in a river catchment, and the
river-flow from the catchment is also measured. Figure 5 shows the hourly
measurements over a period slightly in excess of four days. The river-flow
record is much more slowly changing than the rainfall record and visual in-
spection shows that the response from input to output is spread over a period
of several hours, possibly with a range of time constants reflecting some rel-
atively rapid, and some relatively slow runoff. The objective is to use the
rainfall record to predict the river flow. The transfer function of this response
is difficult to estimate using spectral analysis because it is so dispersed over
many lags. The use of the ZAR model is appropriate here because of this
dispersed response. Using the AIC a standard AR(2) model was selected for
the three series, whereas a ZAR(6, 0.75) was selected. The choice of 0.75 for
the smoothing parameter is not critical, but was chosen because the low
frequency delay in the W operator is about 1.75/0.25 = 7 hours.

(a)

Hours Hours

Figure 5: Hourly records of rainfall and river-flow in a single catchment:


(a) the solid and broken lines show the rainfall at two gauges in the catchment,
(b) the river flow .
A model of relatively low order can then capture a response covering
a period of more than one day. In fact the AR(2) model gave very poor
in-sample predictions, whereas the ZAR(6, 0.75) produced extremely good
in-sample predictions. A fair comparison is illustrated using models of the
same order, an AR(3) and a ZAR(3,0.75) model.
Figure 6(a) shows 'predictions' of the river flow from hour 20 using the
fitted AR(3) model. The model parameters are estimated using the first
80 values of all three series. Given these parameters, the predictions of river-
flow shown from hour 20 are constructed using the rain-fall series alone over
that period. A state space representation is used with the Kalman filter
to compute this. The peak river-flow is substantially under-predicted. Fig-
ure 6(b) shows the corresponding predictions using the ZAR(3,0.75) model.
536 Granville Tunnic1iffe-Wil son and Alex Morton

(a) (c)

H~.

Figur e 6: Predictions of the river flow (solid line) using different mod els and
inform ation: (a) the dotted line shows predictions using a trivari ate AR(3)
mod el, based on river flow information up to hour 20, and full knowledge of
t he rainfall throughout the record , (b) similar predictions using a trivari ate
ZAR(3 ,0.75) mod el, (c) predictions (broken line) are obtained as in (b) , ex-
cept that the known rainflow is used only up to hour 50, and thereafter all
the series are predict ed: th e dot ted lines show 90% probability limits for the
forecast s.

These are very close to the act ua lity. Figure 6(c) is const ructed using the
same ZAR(3 ,0.75) mod el, but no observations of rainfall or river-flow are used
beyond hour 50. The pr ediction limits ar e shown on this figur e and rapidly
widen beyond that hour, but they provide a realistic and useful bound on
the peak river flow many hours later. The last 20 observati ons were not used
in mod el estimation, so their pre dictions are genuinely out-of sample. In this
exa mple the ZAR mod el reveals its potent ial.
Our first exa mple of the CZAR mod el relates to discrete time series
with different, and var ying, sampling intervals. Figur e 7(a) shows monthly
Claim ant Count (CC) figur es t hat have been long used as a measure of un-
employment . A mor e recent measure of unemployment has been the Labour
Force Survey (LFS) est imate, which is shown in t he same figure. The LFS
est imate was record ed annua lly, t hen qu arterly. In the figure, the quarterly
measur ements have been inte rpolate d monthly. These series were analysed
by Harvey and Chung [9], in which one of the aims was to est imate the slope
of the LFS series by using a bivaria te model to 'borrow' inform ation from t he
more frequently observed CC series. A cont inuous time mod el is natural for
such series, and we est imated the bivari at e CZAR(2,0.5) mod el. We report
the use of this model for slope est imat ion, in Morton and Tunnicliffe Wil-
son [12] . Here, we illustrate its application to prediction. Figur e 7(b) shows
forecasts of the LFS unemplo yment and their err or limits obtain ed from this
mod el. The bivari ate mod el ena bles good monthly forecasts to be produced ,
from a point where only 8 annua l valu es have been record ed .
Our final exa mple is a bivariate mod el of dat a which is truly sampled
irr egularl y. Kir chner and Weil [11] pr esent a comp endium of marine fossil
records which indicate the patte rn of extinct ions and origina tions of marine
animals over the past 545 million years (Myrs) . The records are arra nged
into 108 st ratigra phic int ervals which vary in length from 2.5 to 12.5 Myrs,
Modelling multiple time series: achieving the aims 537

(a) (b)

I-
E

l, "
3S00

a.
E
"§ """
~
~2000
~
,g 1~

'. ! '~'---;;;'--':'--;:c---c;:;------;---- -;*; -----.:;--7,;;-~


jl~ ~ ~ ~ ~ ~ ~
Months from February 1971 Months from February 1971

Figure 7: (a) The Claimant Count (solid line) unemployment series, and the
Labour Force Survey (small circles) unemployment series, (b) the Labour
Force Survey series (solid line) with forecasts and forecast error limits (broken
lines) .
(a) (b)

I~~ Myr before present

Ir~~
:B,"~ .. ----;;;;,~~
'.=-=;~~,£; ""~
""~,,,~,,,, ...,----;;.~;--;;.:.;;----:.J
;:;;-----:*'
Myr before present Lag in Myr

Figure 8: (a) The series of originations and extinctions of genera, (b) the
estimated lagged cross-correlation function between these series .
and for each of these the number of families and genera of marine animals to
appear and disappear is documented.
The objective is to investigate the relationship between the series, and
in particular, the recovery of species following mass extinctions. Figure 8(a)
shows the series of genera. We fitted a bivariate CZAR(5,O.5) model to the
logarithms of these series. Figure 8(b) shows the cross-correlation function
derived from this model. The peak is at the lag of 16 Myr, which is similar
to that obtained by Kirchner and Weil using other methods.

References
[1] Akaike H. (1973). A new look at statistical model identification. IEEE
Transactions on Automatic Control AC-19, 716-723 .
[2] Akaike H., Nakagawa T. (1988). Statistical analysis and Control of Dy-
namic Systems, Kluwer, Dordrecht.
538 Granville Tunnic1iffe- Wil son and Alex Morton

[3] Aoki M. (1990). Stat e space modelling of tim e series . Springer-Verlag,


Berlin.
[4] Belcher J ., Hampton J.S. , Tunnicliffe Wilson G. (1994). Paramet erisation
of continu ous time autoregressive m odels for irregularly sampled tim e se-
ries data. J . Royal St atist . Soc. B 56, 141-155.
[5] Box G.P., J enkins G.M. (1970). Time series analysis: for ecasting and
control. Holden-Day, San Francisco.
[6] Bray J . (1971) . Dynamic equations for economic forecasting with the
G.D.P. - un employment relation and the growth of G.D.P. in the Unit ed
K ingd om as an example. J . Royal. St atist. Soc. A 134,167-227.
[7] Doan T ., Lit term an R. , Sims C. (1984). Forecasting and conditional pro-
jections using realistic prior distributions. Econometric Reviews 3, 1-100.
[8] Doob J .L. (1953). Sto chastic processes. Wiley.
[9] Harv ey A.C ., Chung C. (2000). Estimating the und erlying change in un -
employm ent in the UK. J. Royal St atist. Soc. A 163, 303 -340.
[10] J enkins G.M., Watts G. D. (1969) . Spectral Analysis and its Applica-
tion s. Holden-Day, San Fran cisco.
[11] Kir chner J .W ., Weil A (2000) . Delayed biological recovery from extinc-
tions throughout the fossil record. Nature 44, 177- 180.
[12] Morton A.S., Tunnicliffe Wilson G. (2001) . Extra cting econ omic cycles
using modified autoregressions. The Man chest er School 69 , 574 - 585.
[13] Morton A.S., Tunnicliffe Wilson G. (2003). A class of m odified high order
auto regressive models with im proved resoluti on of low frequency cycles. J.
Time Series Analysis, (to appea r).
[14] Partington J.R. (1997). Interpolation, identification, an d sampling.
Clar endon Press, Oxford.
[15] Quenouille M.H. (1957). The analysis of mu ltiple tim e series. Griffin,
London.
[16] Ti ao G.C., Tsay R.S. (1989) . Model specification in multivariate time
serie s (with discussion). J. Royal Statist. Soc. B 51,157-213.
[17] Wahlberg B., Hannan , E .J . (1993). Param etric signal modelling using
laguerre filt ers. The Ann als of Applied Probability 3, 467 - 496.
[18] Whi ttle P. (1963) . Prediction and regulati on. English Universit ies Press.
London .
[19] Wiener N. (1949). Extrapolation, in terpolati on, and smoothing of sta-
tionary tim e series. Cambridge, New York.

Address: G. Tunnicliffe-Wilson , A. Morton, Dept. of Mathematics and


St atistics , Lan caster University, UK
E-mail: G.Tunnicliffe-Wilson@lancaster .ac . uk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004

TOTAL LEAST SQUARES AND ERRORS-


IN-VARIABLES MODELING: BRIDGING
THE GAP BETWEEN STATISTICS,
COMPUTATIONAL MATHEMATICS
AND ENGINEERING
Sabine Van Huffel
K ey words: Total least squ ar es, err ors-in-variables, orthogonal regression ,
singular valu e decomposition, numerical algorit hms.
COMPSTAT 2004 section : Numerical methods for st atistics.

Abstract: The main purpose of this paper is to pr esent an overview of the


progress of a mod eling technique which is known as Total Least Squ ar es
(TLS) in comput at ional mathematics and engineering , and as Errors-In-
Vari abl es (EIV) mod eling or orthogonal regression in the st atistical com-
munity. The basic concepts of TLS and EIV modeling are pr esented . In
particular , it is shown how the seemingly different linear algebraic approach
of TLS, as studied in comput at ional mathemati cs and applied in diverse en-
gineering fields, is related to EIV regression , as studied in the field of st atis-
t ics. Computational methods, as well as the main algebraic, sensitivity and
statistical properties of the est ima t ors, are discussed. Furthermore, general-
izations of the basic concept of TLS and EIV mod eling , such as structured
TLS, Lp approxima t ions, nonlinear and polynomial EIV , are introduced and
applications of t he t echnique in engineering are overvi ewed.

1 Introduction and problem formulation


The Tot al Least Squ ar es (TLS) method is one of several linear param et er es-
t imation t echniques that has been devised t o compensa te for data errors. The
bas ic motivation for TLS is the following: Let a set of multidimensional dat a
points (vectors) be given. How can one obt ain a linear mod el t hat explains
these data? The idea is t o modify all data points in such a way that some
norm of the modification is min imized subject to th e const ra int that th e mod-
ified vectors sati sfy a linear relation. Although the nam e "to tal least squar es"
ap peared in the lit erature only 25 yea rs [15] ago , this method of fitting is cer-
t ainl y not new and has a long history in t he st atistical literature, where the
method is known as "orthogonal regression", "err ors-in-varia bles regression"
or "meas ure ment error mod eling" . The univariat e line fitting problem was
already discussed since 1877 [2] . More recently, t he TLS approach to fitting
has also st imulate d int erests out side st atist ics. One of the main reasons for its
popularity is the availability of efficient and num erically robust algorit hms in
which the Singular Value Decomposition (SVD) plays a prominent role [15] .
540 Sabine Van Huffel

Another reason is the fact that TLS is an application oriented procedure.


It is suited for situations in which all data are corrupted by noise, which is
almost always the case in engineering applications. In this sense, TLS and
EIV modeling are a powerful extension of classical least squares and ordinary
regression, which corresponds only to a partial modification of the data.
A comprehensive description of the state of the art on TLS from its con-
ception up to the summer of 1990 and its use in parameter estimation has
been presented in [33] . While the latter book is entirely devoted to TLS,
a second [34] and third book [35] present the progress in TLS and in the
broader field of errors-in-variables modeling respectively from 1990 till 1996
and from 1996 till 2001.
The problem of linear parameter estimation arises in a broad class of sci-
entific disciplines such as signal processing, automatic control, system theory
and in general engineering, statistics, physics, economics, biology, medicine,
etc . It starts from a model described by a linear equation:

(1)

where 6, ... , ~p and 1} denote the variables and f3 = [f31, ... ,f3pjT E IRP plays
the role of a parameter vector that characterizes the specific system. A basic
problem of applied mathematics is to determine an estimate of the true but
unknown parameters from certain measurements of the variables. This gives
rise to an overdetermined set of n linear equations (n > p) :

(2)

where the ith row of data matrix X E IRnxp and vector y E IRn contain
respectively the measurements of the variables 6, ... , ~p and 1}.
In the classical least squares approach, as commonly used in ordinary
regression, the measurements X of the variables ~i are assumed to be free
of error and hence, all errors are confined to the observation vector y. How-
ever, this assumption is frequently unrealistic: sampling errors, human errors,
modeling errors and instrument errors may imply inaccuracies of the data
matrix X as well. One way to take errors in X into account is to introduce
perturbations also in X. Therefore, the following TLS problem was intro-
duced in the field of computational mathematics [14], [15] (R(X) denotes the
range of X and IIXIIF its Frobenius norm [16]):

Definition 1.1 (Total Least Squares problem). Given an overdeter-


mined set of n linear equations Xf3 >=::: y in p unknowns f3. The total least
squares problem seeks to

min ] [Li E'] IIF subject to (X - Li)jj = y - E' (3)


A ,E,f3

jj is called a TLS solution and [Li E'] the corresponding TLS correction.
Total least squares and errors-in-variables m odeling 541

This paper is organi zed as follows. Secti on 2 desc ribes t he univari at e ElV
regr ession pr obl em fro m a statist ical point of view. Secti on 3 t hen formulat es
t he TLS pr oblem from a comput ational point of view and shows the relation-
ship wit h uni vari ate ElV regr ession. Next , Section 4 pr esent s the SVD based
basic TLS algorit hm, while Section 5 describes major prop erties of t he TLS
approach. Furthermore, exte nsions of the t echnique are discussed in Sect ion
6 whil e Section 7 overviews the many applicat ions of TLS in engineering
fields . Finally, Section 8 gives t he conclusions .

2 Univariate ElV regression: a statistical approach


2.1 Model formulation
For t he simplest ElV mod el, t he goal is t o est imate from bivariate dat a
a straight line fit bet ween 2 var ia bles, bo th of which are measured with error.

Definition 2.1 (Univariate Ordinary Regression). For a samp le size


of n, (ei,Yi), i = 1, .. . ,n, the stan dard regression model with one explanatory
varia ble is given by

130 + ei131 + ti = Yi, i = 1, . . . ,n (4)

where the independent varia ble ei is either fixed or random and the error ti
has zero mean and is un correlated with ~i .
The unknown int ercept 130 and slope 131 are usu ally est imated using
a Least-Squares (LS) approach for reasons of comp utational efficiency.

Definition 2.2 (Univariate EIV Regression). For a sample size of n ,


(Xi, Yi ), i = 1, . . . , n, the univariate EIV regression model is defin ed as f ol-
lows. Th e uno bservable true vari ables (ei, 'T/i) satisfy

(5)

however, one observes (Xi,Yi ), i = 1, . .. ,n, which are the true varia bles plus
additive errors (Oi, ti ):

ei = ei + Oi and Yi = 'T/i + ti, i = 1, . . . ,n (6)

Assume that Oi, ti, i = 1, . .. , n, all have finit e vari ances, zero mean
(without loss of generality), and are un correlated , i.e. , E(oi) =E(ti ) = 0,

COV (Oi,tj ) = ° for all i,j . Depending on the ass umption about ei, three
°
var( Oi ) = o-~ , var( ti ) = 0-; for all i, cov( Oi, OJ) = cov ( ti, tj) = for all i i= j,

different models are defined. If the ei are unknown constants, t hen the model
is kn own as a fun ct ion al relatio nship. If the ei are ind ependent identi cally
distribut ed (i.i.d .) random varia bles and independent of t he erro rs, the mod el
is ca lled a st ructural relati onship and we have: E(ei) = J.L and var (ei) = 0- 2 .
A generaliza t ion of both models is the ultrastructural relati onship which
542 Sabine Van Huffel

assumes that the ~i are ind epend ent rand om vari abl es but not identically
distributed , i.e. having possibly different means J-li and common vari ance 2 . 0-
EIV regression looks like standar d regression if one rewrites Eqs, (5-6) as

However , t his is not the usu al regression model, Xi is random and is correlate d
with t he err or te rm (i: COV(Xi, ( i) = - {310"~ . This covariance is only zero when
o-g = 0, which is the regression mod el, or when {31 = 0, which is the trivial
case . If one at te mpts to use ordinary regression est imates (least squa res) on
EIV regression modeled data , one obtains inconsistent esti ma tes.
The seemingly minor change between mod el (4) and mod el (5)-(6) has
importan t pr act ical and t heoret ical consequences. One of the most impor-
t ant differences between both mod els concerns mod el identifiabili ty. It is
common to assume that all random variabl es in t he EIV regression model
are jointly normal. In this case, the st ruct ur al and functional mod el ar e not
identifi abl e [7] . Side condit ions need t o be imposed , the most common of
which are the following: (1) t he ratio of the err or vari ances, ), == 0-;/
d, is
known; (2) o-~ is known ; (3) 0-; is known ; (4) both of the error vari an ces, o-g
and 0-;,
are known . The first assumpt ion is the most popular and is the one
with the most pu blish ed th eoret ical results, dating back t o Adcock [2], [3]. It
also leads to the commonly known Or thogonal Regression (OR) est ima t or.
Ind eed , if ), is known , t he data can be scaled so t hat ), = 1. In this case,
the maximum likelihood solut ion of the normal EIV regression problem is
OR, which minimizes the sum of squa res of the ort hogonal dist anc es from
t he data points to the regression line inst ead of the sum of squa res of the
vertical dist an ces, as in standard regression (see Figur e 1).

*""

l.
(X,.Y,)

Fi gur e 1: St andard regression (LS) Orthogonal regression (TLS)


Total least squares and errors-in-variables modeling 543

2.2 Parameter estimation


Assume that the data have been properly scaled so that >. = 1. For the
functional relationship, the likelihood function is £((30, (31, d, 6, .. . ,~n) ex

Note that t5i = Xi - ~i and ti = Yi - (30 - ~i(31 so that maxirmzmg (8)


requires minimizing I:(t5r + tn, which means that the sum of squares of
the orthogonal distances from the data points to the line is minimized. Ad-
cock [2], [3] considered the appropriate estimator to be orthogonal regression,
which has been rediscovered many times during the first half of the 20th cen-
tury. Lindley [23] , however, considered a weighted least squares approach
to the model (7) as follows. Estimate (30, (31 by taking both errors t i and
t5i into account to minimize a sum of weighted squared residuals, where the
weights are proportional to the reciprocal of the variance of the errors (i, i.e.,
0'; + 0'3 (3r. Thus, one minimizes:

(9)

This minimization problem is solved when>' is known or both 0'; and 0'3 are
known. If >. = 1, the denominator reduces to 1 + (3r and amounts to or-
thogonal regression. Weighted least squares has drawn much attention in the
literature; see [7] for references . Since Sprent [28] , the name has standardized
to generalized least squares. The success of generalized LS might give the
impression that it is the LS method for the EIV regression model. Since gen-
eralized LS estimation only works for the no-equation-error model with the
error covariance matrix known up to a scalar multiple, a unified approach
for modifying LS to suit all different assumptions on the error covariance
structure is called for. Modified LS is such an approach. The normality
assumption on the errors (and on the true variables for the structural and
ultrastructural relationships) is not needed, only the existence of second mo-
ments. From Eq. (7) it is clear that ( i are i.i.d. random variables with zero
mean and variance 0'; + 0'3 (3r regardless of the type of relationship. Cheng [7]
developed modified LS estimators for (30 and (31 by minimizing an unbiased
and consistent estimator of the appropriate unknown error variance. The
estimators are a function of the residuals. Assuming>. known , an appropri-
ate modified LS estimator for the unknown error variance 0'3 is obtained by
minimizing

(10)
544 Sabine Van Huffel

Minimizing Q with respect to 130 and 131 yields:

= fi -131 X (where ii denotes the mean of a vector v) (11)


Syy - ASxx + [(Syy - As xx)2 + 4AS~y]~ .
2 provided Sxy =1= 0 (12)
Sxy

with Sxx = ~ L(Xi - X)2, Syy = ~ L(Yi - fi)2 and Sxy = ~ L(Xi - X)(Yi - fi)
the sample variances and covariance. In summary, the statistical approach
seeks for estimators of the EIV regression model with optimal statistical prop-
erties (such as maximum likelihood, unbiasedness, consistency, etc.), mostly
reflecting asymptotic behaviour as n -+ 00 . If p > 1 explanatory variables
~ are considered, the problem formulation can be extended but the estima-
tor 13 of dimension p can no longer be found analytically, as derived above,
but via an eigenvalue-eigenvector approach [12, 13] or an SVD approach (see
further).

3 TLS and EIV regression: a computational approach


3.1 Model formulation
In computational mathematics, measurement errors in linear models are tack-
led from a geometrical point of view, as explained in Section 1. To enlighten
the difference with the statistical approach, we consider the univariate model
and first assume that the intercept is zero, i.e. 130 = 0 . It is assumed that the
true variables satisfy a compatible linear relationship, given by Eqs. (5)-(6).
The TLS approach then aims to find minimal corrections (in a LS sense) s.
and Ei to the measured data Xi, Yi such that the corrected data Xi - Yi - Ei s:
satisfy exactly the unobserved relationship, i.e,
Definition 3.1 (Univariate TLS problem). Given (Xi, Yi), i = 1, ... , n
satisfying Eqs . (5)-(6) . Find corrections t, and Ei and a slope estimate 131
by minimizing
n
.rr:in. 1:)87 + E7) subject to (Xi - 8i );31 = Yi - Ei' i = 1, ... , n (13)
Oh E; ,/31 i = l

Solving this seemingly different minimization problem leads to the same


slope estimator ;31 , called the TLS solution, as given in (12) . If the underlying
relationship is an intercept model, as given by Eqs . (5)-(6), the same TLS
approach can be used provided the centered data Xi - X and Yi - fi are used.
Alternatively, a mixed LS-TLS approach [33] can be applied to the original
data:
Definition 3.2 (Univariate mixed LS-TLS problem). Given (Xi, Yi),
i = 1, . .. , n satisfying Eqs. (5-6). Find corrections 8i and Ei' an intercept
estimate ;30 and a slope estimate 131 by minimizing
Total least squares and errors-in-variables modeling 545

n
_ ~~n _ l:)Jl+ €;) subject to .80+ (Xi -Ji ).81 =Yi-€i , i = I, . .. ,n (14)
15; ,€ ; ,(3o ,fh i = 1

This approach is ca lled mixed LS-TLS becau se the underlying relat ionship
between t he t rue variables is equivalent with

Wi!30 + ~i !31 = "li , i = 1, . .. , n (15)

wh ere "li , ~i are uno bservabl e, as expresse d by Eq. (6) , and Wi == 1 Vi is


exactly known. Therefore, no correc ti ons are needed for t he observations Wi
in cont rast to the corresponding observat ions Xi, Yi of ~i , "li. Hence, t he best
est imates are found via a mixture of a LS and TLS approach, see [33]. Solvin g
t his mix ed LS-TLS minimization problem lead s t o the same slope est imat ors
.80,.81, ca lled t he mixed LS-TLS solution, as given in (11)- (12). Hence, for
t he un ivari ate case, TLS in it s simplest versi on is just orthogon al regr ession .
For p > 1 expla na t ory variables, the TLS problem formulati on is generalized
as given in definition 1. Further extensions are discussed in Secti on 6.

3.2 Historical remarks


Although the name 'total least squares' appeared only recentl y in t he lit er-
at ure [14], [15], this method of fitting is certainly not new and has a long
history in t he stat istical lit er ature where t he method is kn own as or thogon al
regr ession or erro rs-in-varia bles regr ession. Indeed , t he univari at e line fitting
probl em (p = 1) was already discussed since 1877 [2] . Some well-known con-
tribut ors are Adcock [2], [3] , P ear son [26] , Koopman s [17] , Madansk y [24]
and York [37] (see [4] , [7] for a list of references) . The method of orthogon al
regr ession has been rediscovered many times, oft en ind ep endentl y. About
t hirty yea rs ago , t he tec hn ique was exte nded to mul tiple regression problems
(p > 1) and lat er to multivari ate problems which deal with mo re t han one
observation vect or Y, e.g., [29], [13].
Mor e recently, t he TLS approach t o fitting also st imula te d inte res t outside
statistics. In the field of numerical ana lysis, t his problem was first st udied
by Golub and Van Loan [14], [15]. Their analysis, as well as their algo-
rithm , is strongly based on t he SVD. Geom etrical insight into t he properti es
of t he SVD br ought St aar [30] indepe ndent ly to the same concept. Van Huf-
fel and Vandewalle [32] generalized the algorit hm of Golub and Van Loan
t o all cases in which their algorit hm fails to pr oduce a solut ion, described
the proper ti es of these so-called nongeneric TLS problems and proved that
t he proposed generalization st ill sat isfies the TLS crite ria if addit ional con-
st raints are imposed on the solution space. This seemingly different linear
algeb raic approac h is ac t ually equivalent t o t he method of mul tivari ate EIV
regr ession analysis, st udied by GIeser [13]. GIeser 's method is based on an
eigenvalue-eigenvect or analysis, while the TLS method uses the SVD whi ch
546 Sabine Van Huffel

is num erically mor e robust in the sense of algorithmic implement ation . Fur-
thermore, the TLS algorit hm computes the minimum norm solution (called
minimum norm TLS) whenever the TLS problem lacks a unique minimizer.
These extensions are not considered by GIeser .
In engineering fields , e.g., experimental mod al an alysis, the TLS technique
(mor e commonly known as the H, technique), was also introduced about
20 years ago [21]. In t he field of syste m identification, Levin [22] first st udied
th e problem. His method, called the eigenvector method or Koopmans-L evin
method [10], computes the sam e estimate as the TLS algorithm whenever
the TLS problem has a unique solution . Compensated least squa res was yet
anot her nam e arising in this area: this method compensa tes for the bias in
the est imator, du e to measurement error, and is shown to be asymptotically
equivalent to TLS [31] . Fur thermore, in the area of signa l processing, the
minimum norm method was introduced and shown to be equivalent to min-
imum norm TLS [9] . Finally, the TLS approach is tightly related to the
maximum likelihood Principal Component Analysis (PCA) method used in
chemomet rics [36] .

4 Basic TLS algorithm and computational issues


We now ana lyze the TLS problem by making substant ial use of the SVD.

Definition 4.1 (Singular Value Decomposition). Th e singular value


decom position (8VD) of th e n x (p + 1) matrix [X y] is defin ed by

[X y] = UI:;VT (16)

where U = [UI " ' " un ], Ui E ~n, UTU = In and Y = [VI , . . . , V p + 1], Vi E
~p+ l , vrv = I p + l contain respectiv ely th e left and right singular vect ors,
an d I:; = diag(O"I , .. . , O"r ), r = min{n,p+ I} , 0"1 2: .. . 2: a; 2: 0, are th e
singu lar values in decreasing order of magnitude.

To solve Eq . (2) with TLS , br ing the set into t he form :

(17)

If O"p+l =I 0, [X y] is of rank p + 1 and the space S generate d by the rows


of [X y] coincides with ~p+l. There is no nonzero vector in the orthogonal
complement of S, hence the set of equat ions (17) is incomp atible. In order to
obtain a solution, t he rank of [X y] must be redu ced to p. Using the Eckart-
Youn g-Mirsky theorem [16], t he best rank p TLS approximation [X 17] of
[X y], which minimizes the deviations in vari an ce, is obtained by setting the
smallest singular valu e O"p +l of [X y] to zero. The following theorem gives
condit ions for the uniqueness and existence of a TLS solution (Vi j denot es
the (i , j)th ent ry of matrix V):
Total least squares and errors-in-variables modeling 547

Theorem 4.1. Solution of the basic TLS problem Xf3 ~ y.


Let(16) be the SVD of[X y] and amin(X) the smallest singular value of X.
If amin(X) > a p+1 , the rank 1 TLS correction solves the TLS problem (3)
~ T
= [X y] - [X y] = ap+!up+1Vp+1
~

[~X ~YJ

with [X y] = UI;V T , I; = diag(a1, . .. ,ap,0) and the TLS solution


~ 1 T
f3 = - [V1,p+1, . . . , vp,p+d (18)
Vp+1,p+1

exists and is the unique solution to Xf3 = fl.


Note the equivalence: amin(X) > a p+1 {:} a p > a p+1 and vp+!,p+! =I O.
The following algorithm computes (if possible) a TLS solution (3of Xf3 ~
y such that (X - ~X)(3 = y - ~yand II[~X ~y]IIF is minimal.
Algorithm 4.1. Basic TLS solution of Xf3 ~ y. Given X E jRnx p ,
y E jRn.
Step 1: Compute the SVD (16), i.e, [X y] = UEV T
Step 2: If vp+! ,p+! =I 0 then (3 = - v p +; ,P+l [V1,p+!, ... , Vp,p+1V
For the univariate case (p = 1), one easily proves , using the basic properties
of eigenvalue and singular value decompositions, that the SVD based TLS
solution, given by (31 = -V12V221, equals the analytical solution in Eq. (12).
The conditions amin(X) > a p+1 , or equivalently a p > a p+! and vp+! ,p+!
=I 0, ensure that algorithm 4,1 computes the unique TLS solution of Xf3 ~
y , These conditions are generically satisfied provided X is of full rank and
the set X f3 ~ y is not too conflicting. Hence, most TLS problems which
arise in practice can be solved by means of algorithm 4.1, in which the TLS
solution is obtained by a simple scaling of the right singular vector of [X y]
corresponding to its smallest singular value.
Extensions of this basic TLS problem to multivariate TLS problems
X B ~ Y having more than one right hand side vector, to problems in which
the TLS solution is no longer unique or fails to have a solution altogether and
to mixed LS-TLS problems that assume some of the columns of X to be error-
free, are considered in detail in [33]. In addition, it is shown how to speed
up the TLS computations directly by computing the SVD only partially or
iteratively if a good starting vector is available. More recent advances, e.g.
recursive TLS algorithms, neural based TLS algorithms, rank-revealing TLS
algorithms, regularized TLS algorithms, TLS algorithms for large scale prob-
lems, etc., ar e reviewed in [34], [35].

5 TLS properties
Under specific conditions, the TLS solution, as introduced in numerical anal-
ysis, computes optimal parameter estimates in models with only measurement
548 Sabine Van Huffel

error, referred to as classical errors-in-variables (EIV) models. This is shown


for the univariate case in Sections 2 and 3. These models are characterized
by the fact that the true values of the observed variables satisfy one or more
unknown but exact linear relations of the form (1). In particular, in case of
one underlying linear relation, we define:
Definition 5.1 (Multiple EIV regression model). Assume that the n
measurements in X , yare related to p unknowns (3 by :

'3(3 = tt X = '3 + ~ and y = ry + E (19)


where ~,E represent the measurement errors and all rows of [~ E] are i.i.d.
with zero mean and covariance matrix C, known up to a scalar multiple cr~ .
If additionally C = cr~I is assumed with I the identity matrix (i.e, ~ij and
Ei are uncorrelated random variables with equal variance) and lim n _ oo ~'3T'3
exists and is positive definite, then it can be proven [12, 14] that the TLS
solution 73TLS of X (3 :::::: y estimates the true parameter values (3, given by
('3T'3)-l'3T ry , consistently, i.e. 73TLS converges to (3 as n ---> 00 . This TLS
property does not depend on any assumed distributional form of the errors.
It should be not ed that the TLS correction [.6. ?], being of rank 1 as shown
in Theorem 1, can not be considered as an appropriate estimator for the true
measurement errors ~ and E, added to the data [33], [15]. Note also that the
LS estimates are inconsistent in this case. In these cases, TLS gives better
estimates than does LS, as confirmed by simulations [33]. This situation may
occur far more often in practice than is recognized. It is very common in
agricultural, medical and economic science , in humanities, business and many
other data analysis situations. Hence TLS should be a quite useful tool to
data analysts. In fact, the keyrole and importance of LS in regression analysis
is the same as that of TLS in EIV regression. Nevertheless, a lot of confusion
exists in the fields of numerical analysis and statistics about the principle of
TLS and its relation to EIV modeling. In particular, the name "Total Least
Squares" is still largely unknown in the statistical community, while inversely
the concept of EIV modeling did not penetrate sufficiently well in the field
of computational mathematics and engineering. Roughly speaking, TLS is a
special case of EIV estimation and, as such, TLS is reduced to a method in
statistics but, on the other hand, TLS appears in many other fields, where
mainly the data modification idea is used and explained from a geometric
point of view, independently from its statistical interpretation.
Let us now discuss some of the main properties of the TLS method by
comparing them with those of LS. First of all, a lot of insight can be gained
by comparing their analytical expressions, given by:
LS: (3LS (X T X)-l X T y (20)
TLS : 73TLS (XTX-cr;+lI)-lXTy (21)
with X of full rank and crp+l the smallest singular value of [X y].
Total least squares and errors-in-variables modeling 549

From a numerical analyst 's point of view, these formulas tell us that the
TLS solution is more ill-conditioned than the LS solution since it has a higher
condition number. This implies that errors in the data more likely affect the
TLS solution than the LS solution. This is particularly true under worst case
perturbations. Hence, TLS can be considered as a kind of de regularizing
procedure. However, from a statistical point of view, these formulas tell us
that TLS is doing the right thing in the presence of LLd. equally sized errors
since it removes (asymptotically) the bias by subtracting the error covariance
matrix (estimated by 0"~+1I) from the data covariance matrix X T X .
Secondly, while LS minimizes a sum of squared residuals, TLS minimizes
a sum of weighted squared residuals, expressed as follows:

LS: (22)

TLS: (23)

From a numerical analyst's point of view, we say that TLS minimizes the
Rayleigh quotient. From a statistical point of view, we say that we weight
the residuals by multiplying them with the inverse of the corresponding error
covariance matrix (up to a scaling factor) to derive consistent estimates.
Other properties of TLS, which were studied in the field of numerical
analysis, are its sensitivity in the presence of errors on all data [33]. Differ-
ences between the LS and TLS solution are shown to increase when the ratio
O"p([X Y]) /O"min(X) is growing. This is the case when the set of equations
X f3 ~ Y becomes less compatible, when the vector Y is growing in length and
when X tends to be rank-deficient. Assuming LLd. equally sized errors, the
improved accuracy of the TLS solution compared to that of LS is maximal
when the orthogonal projection of Y is parallel with the pth singular vector
of X, corresponding to O"min(X) , Additional algebraic connections and sensi-
tivity properties of the TLS and LS problem, as well as many more statistical
properties of the TLS estimators, based on knowledge of the distribution of
the errors in the data, have been described, see [33], [34] for an overview.

6 TLS extensions
The statistical model that corresponds to the basic TLS approach is the no-
equation-error EIV regression model with the restrictive condition that the
measurement errors on the data are i.i.d, with zero mean and common error
covariance matrix, equal to the identity matrix up to an unknown scalar.
Most published TLS algorithms just handle this case while other more useful
EIV regression estimators did not receive enough attention in computational
mathematics. To relax these restrictions, several extensions of the TLS prob-
lem have been investigated. In particular, the mixed LS-TLS problem for-
mulation allows to extend consistency of the TLS estimator in EIV models,
where some of the variables ~i are measured without error. The data least
550 Sabine Van Huffel

squares problem refers to the special case in which all variables except 1] are
measured with error and was introduced in the field of signal processing by
DeGroat and Dowling [8] in the mid nineties. Whenever the errors are in-
dependent but unequally sized, weighted TLS problems should be considered
using appropriate diagonal scaling matrices in order to maintain consistency.
If, additionally, the errors are also correlated, then the generalized TLS prob-
lem formulation allows to extend consistency of the TLS estimator in EIV
models, provided the corresponding error covariance matrix is known up to
a factor of proportionality (see definition 7). More general problem formula-
tions, such as restricted TLS, which also allow the incorporation of equality
constraints, have been proposed, as well as equivalent problem formulations
using other Lp norms and resulting in the so-called Total Lp approximations
(see [33] for references). The latter problems proved to be useful in the pres-
ence of outliers. Robustness of the TLS solution is also improved by adding
regularization, resulting in the regularized TLS methods [11], [27], [35] . In
addition, various types of bounded uncertainties have been proposed in order
to improve robustness of the estimators under various noise conditions and
algorithms are outlined [34], [35].
Furthermore, constrained TLS problems have been formulated . Arun [5]
addressed the unitarily constrained TLS problem, i.e., XB :::::: Y, subject
to the constraint that the solution matrix B should be unitary. He proved
that this solution is the same as the solution to the orthogonal Procrustes
problem [16, p.582]. Abatzoglou et al [1] considered yet another constrained
TLS problem, which extends the classical TLS problem (3) to the case where
the errors [6. E] in the data [X y] are algebraically related. However, if there
is a linear dependence among the error entries in [6.E], then the TLS solution
no longer has optimal statistical properties (e.g. maximum likelihood in case
of normality) . This happens, for instance, in dynamic system modeling, e.g.,
in system identification when we try to estimate the impulse response of
a system from its input and output by discrete deconvolution. In these so-
called structured TLS problems, the data matrix [X y] is structured, typically
block Toeplitz or Hankel. In order to preserve maximum likelihood properties
and consistency of the solution [1], [18], the TLS problem formulation, given
in definition 1, must be extended with the additional constraint that any
(affine) structure of X or [X y] must be preserved in ~ or [~ ?], where
~ and ? are chosen to minimize the error in the discrete L 1 , L2 and L oo
norm. For L2 norm minimization, various computational algorithms have
been presented, as surveyed in [34], [35], and shown to reduce the computation
time by exploiting the matrix structure in the computations. In addition, it
is shown how to extend the problem and solve it, if latency or equation errors
are included. Recently, robustness of the structured TLS solution has been
improved by adding regularization, see e.g. [25] .
Yet , another important extension is the elementwise-weighted TLS (EW-
TLS) estimator, which computes consistent estimates in linear EIV models,
Total least squares and errors-in-variables modeling 551

where the measurement errors are elementwise differently sized or , more gen-
erally, where the corresponding error covariance matrices may differ from
row to row. Some of the variables are allowed to be exactly known (ob-
servable) [19], [35]. Mild conditions for weak consistency of the EW-TLS
estimator are given and an iterative procedure to compute it is proposed.
Finally, we mention the important extension to nonlinear EIV models,
nicely studied in the book of Caroll, Ruppert and Stefanski [6] . In these
models, the relationship between the variables ei
and 1] is assumed to be
nonlinear. It is important to notice here that the close relationship between
nonlinear TLS and EIV stops to exist. Indeed, consider the bilinear EIV
model XBG ~ Y, in which X, G, and Yare affected by measurement errors.
Applying TLS to this model leads to the following bilinear TLS problem:

~ plill 11[.6.x.6. o .6.ylll}s.t. (X - .6. x) B (G - .6.0) = Y -.6. y


Ax,Aa ,Ay ,B

However, solving this problem yields inconsistent estimates of B [12]. A con-


sistent estimate can be obtained [20] using the adjusted LS estimator (the
full rank case is considered here for reasons of simplicity) :

with Vx = E(6.~6.x), Vo = E(6.o6.'[;) and 6. x and 6.0 represent the errors


on X and G respectively. Corrections for small samples have been derived
and shown to give superior performance for small sized problems. Various
other types of nonlinear EIV models, including bilinear, polynomial, nonlin-
ear functional, semi-linear and Cox's proportional Hazards models, have been
considered and consistent estimators are derived, see [35] for an overview.

7 Applications in engineering fields


Since the publication of the SVD based TLS algorithm [15], many new TLS
algorit hms have been developed and, as a result, the number of applications
in TLS and EIV modeling has increased exponentially in the last decade,
because of its emergence in new fields such as computer vision, image recon-
struction, speech and audio processing, and its gain in popularity in fields
as signal processing, modal and spectral analysis, system identification and
astronomy. In [34], [35], the use of TLS and errors-in-variables models in
the most important application fields, such as signal processing and system
identification, are surveyed and new algorithms that apply the TLS concept
to the model characteristics used in those fields are described. In these fields,
the structured TLS approach is important. In particular, a lot of common
problems in ststem identification and signal processing can be reduced to
special types of structured TLS problems, including block Hankel or Toeplitz
matrix structures, the essence of which is the LS approximation of a given
matrix by a rank-deficient one . For example, in system identification the
552 Sabine Van Huffel

well-known Kalman filterin g is exte nded to the err ors-in-variables cont ext in
which noise on t he inputs as well as on the outputs is taken into account
thereby improving the filterin g performan ce. In th e field of signal processing,
in particular in-vivo magnetic resonance spectroscopy and audio coding, new
state-space based methods have been derived by making use of th e TLS ap-
pro ach for spectral estimation with exte nsions to decimation and multichan-
nel data quantification. In addition, it has been shown how to extend the
least mean squ ar es (LMS) algor ithm to the EIV cont ext for use in adapt ive
signal pro cessing and various noise environments . Finally, TLS applications
also emerge in other fields , including information retrieval, image reconstruc-
tion, multivari ate calibra t ion, ast ronomy, and compute r vision. It is shown
in [35] how the TLS approach and its generalizat ions, including structured,
regulari zed and generalized TLS , can be successfully applied.
This list of applications of TLS and EIV mod eling is certainly not exha us-
tive and clearl y illust rates the increased interest of TLS and EIV mod eling
in engineering over t he past 20 years.

8 Conclusions

The basic principle of TLS is that t he noisy dat a [X y] , while not satisfying
a linear relation, are modified with minim al effort, as measur ed by t he Frob e-
niu s norm , in a 'nea rby' matrix [X YJ which is rank-deficient so th at the set
Xj3 = fj is compa tible. This matrix [X fj] is a rank-one modification of the
data matrix [A b] . T he solution to the TLS problem can be det ermined from
t he SVD of the matrix [X y]. A simple algorit hm outlines t he computati ons
of the solution of the basic TLS problem. By 'basic' is meant t hat only one
right -h and side vector y is considered and that the TLS problem is solvable
(generic) and has a unique solution. Extensions of this basic TLS problem
are discussed. Much of the literature concerns the classical TLS problem
X j3 ~ y , in which all columns of X ar e subject to errors, but more general
TLS problems, as well as ot her problems relat ed to classical TLS , have been
propo sed and are bri efly overviewed here.
Engineerin g applications of the Total Least Squar es (TLS ) te chnique have
been overv iewed . TLS has its roots in statistics where it can be defined as
a special case of classical Errors-in- Vari ables (EIV ) regression in which all
measurement err ors on t he data are LLd. with zero mean and equa l vari-
ance. Due to the developm ent of a powerful algorit hm based on the SVD
in computational mathematics t he method became very popular in engineer-
ing applicat ions. This is a nice exa mple of inte rdisciplina ry work. However ,
the dan ger exists t ha t research ers will focus their at te nt ion on the wrong
probl ems which are eit her unreason able from a statistical point of view (e.g.
biased , inconsist ent , not efficient ) or not practi cally useful from an engineer-
ing point of view (e.g. assumptions never satisfied) . This pap er invites any
Total least squares and errors-in-variables modeling 553

reader to open the frontiers of its own discipline and look over the border
into neighbouring areas so that the any engineering problem, dealing with
measurement error, is studied in a correct way.

References
[1] Abatzoglou T .J ., Mendel J .M. and Harada G.A. (1991). The constrained
total least squares technique and its applications to harmonic superreso-
lution. IEEE Trans. Acoust., Speech & Signal Processing 39,1070-1087.
[2] Adcock RJ. (1877). A problem in least squares. The Analyst 4, 183-184.
[3] Adcock RJ. (1878). A problem in least squares. The Analyst 5, 53-54.
[4] Anderson T.W. (1984). The 1982 Wald memorial lectures : Estimating
linear statistical relationships. Ann . Statist. 12, 1-45.
[5] Arun K.S. (1992). A unitarily constrained total least-squares problem in
signal-processing. SIAM J. Matrix Anal. Appl. 13, 729-745.
[6] Carroll RJ., Ruppert D. and Stefanski L.A. (1995). Measurement error
in nonlinear models, Chapman & Hall/CRC, London.
[7] Cheng C.-L. and Van Ness J.W. (1999). Statistical regression with mea-
surement error. Arnold, London.
[8] Degroat RD. and Dowling E.M. (1993). The data least squares problem
and channel equalization. IEEE Trans. Sign. Process. 41, 407-411.
[9] Dowling E.M. and Degroat RD. (1991). The equivalence of the total
least-squares and minimum norm methods. IEEE Trans. Sign. Process.
39, 1891-1892.
[10] Fernando K.V . and Nicholson H. (1985). Identification of linear systems
with input and output noise : the Koopmans-Levin method. lEE Proc.
D 132,30 -36.
[11] Fierro R.D., Golub G.H., Hansen P.C. and O'Leary D.P. (1997). Regu-
larization by truncated total least squares. SIAM J. Sci. Compo 18 , 1223-
1241.
[12] Fuller W .A. (1987). Error measurement models. John Wiley, New York.
[13] GIeser L.J. (1981). Estimation in a multivariate "errors in variables"
regression model: Large sample results. Ann . Statist. 9, 24-44.
[14] Golub G.H. (1973). Some modified matrix eigenvalue problems. Siam
Review 15, 318-344.
[15] Golub G.H. and Van Loan C.F. (1980). An analysis of the total least
squares problem. SIAM J . Numer. Anal. 17 ,883-893.
[16] Golub G.H. and Van Loan C.F . (1996). Matrix computations. 3rd ed.,
The Johns Hopkins Univ.Press, Baltimore.
[17] Koopmans T.C . (1937). Linear regression analysis of economic time se-
ries. De Erven F. Bohn, N.V. Haarlem.
[18] Kukush A., Markovsky 1. and Van Huffel S. (2004). Consistency of the
structured total least squares estimator in a multivariate model. Journal
of Statistical Planning and Inference, to appear.
554 Sabine Van Huffel

[19] Kukush A. and Van Huffel S. (2004). Consistency of elementwise-


weighted total least squares estimator in a multivariate errors-in-
variables model AX=B. Metrika 59, issue 1, to appear.
[20] Kukush A., Markovsky 1. and Van Huffel S. (2003). Consistent estima-
tion in the bilinear multivariate errors-in-variables model. Metrika 57,
253-285.
[21] Leuridan J ., De Vis D., Van Der Auweraer H. and Lembregts F. (1986).
A comparison of some frequency response function measurement tech-
niques. Proc. 4th Int. Modal Analysis Conf., Los Angeles, CA, Feb. 3-6,
908-918.
[22] Levin M.J . (1964). Estimation of a system pulse transfer function in the
presence of noise. IEEE Trans. Automat. Contr. 9, 229-235.
[23] Lindley D.V. (1947). Regression lines and the linear functional relation-
ship. J.R. Statist. Soc. Suppl. 9, 218-244.
[24] Madansky A. (1959) . The fitting of straight lines when both variables are
subject to error. J. Amer. Statist. Assoc. 54, 173- 205.
[25] Mastronardi N., Lemmerling P. and Van Huffel S. (2004). Fast regular-
ized structured total least squares algorithm for solving the basic decon-
volution problem. Numer. Lin. Alg. with Appl., to appear.
[26] Pearson K. (1901). On lines and planes of closest fit to points in space.
Philos. Mag. 2, 559-572.
[27] Sima D., Van Huffel S. and Golub G.H. (2004). Regularized Total Least
Squares based on quadratic eigenvalue problem solvers. BIT, to appear.
[28] Sprent P. (1966). A generalized least squares approach to linear func-
tional relationships. J .R. Statist. Soc. B 28, 278- 297.
[29] Sprent P. (1969). Models in regression and related topics . Methuen &
Co. ltd., London, UK.
[30] Staar J . (1982). Concepts for reliable modelling of linear systems with
application to on-line identification of multivariable state space descrip-
tions. PhD thesis, Dept. EE, K.U.Leuven, Leuven, Belgium.
[31] Stoica P. and Soderstrom T (1982). Bias correction in least squares iden-
tification. Int . J. Control 35, 449-457.
[32] Van Huffel S. and Vandewalle J. (1988). Analysis and solution of the
nongeneric total least squares problem. SIAM J. Matrix Anal. Appl. 9,
360-372.
[33] Van Huffel S. and Vandewalle J . (1981). The total least squares problem:
computational aspects and analysis, SIAM, Philadelphia.
[34] Van Huffel S., editor, (1997). Recent advances in total least squares
techniques and errors-in-variables modeling, SIAM Proceedings series,
SIAM, Philadelphia.
[35] Van Huffel S. and Lemmerling , editors, (2002). Total least squares and
errors-in-variables modeling: Analysis, Algorithms and Applications,
Kluwer Academic Publishers, Dordrecht.
Total least squares and errors-in-variables modeling 555

[36] Went zell P.D., Andrews D.T., Hamil ton D.C. , Fab er K. and Kowal-
ski B.R. (1997). Maximum likelihood principal component analysis. J .
Chemometrics 11 , 339-366.
[37] York D. (1966). Least squares fitting of a straight lin e. Can . J. of Physics
44 , 1079-1086.

A ckno wledgem ent : Dr. Sabin e Van Huffel is a full professor at the Katholieke
Universit eit Leuven, Belgium . Resear ch supporte d by the KU Leuven re-
search coun cil (GOA-Mefisto 666), the Flemish Government (FWO pro jects
G.0078.01, G.0269.02, G.0270.0 2, resear ch communi ties lOCoS , ANMMM) ,
and t he Belgian Federal Government (WAP V-22).
Address: S. Van Huffel, Katholieke Universit eit Leuven, Depar tment of Elec-
trical En gineerin g, Division ESAT-SCD , Kast eelpark Arenberg 10,3001 Leu-
ven, Belgium
E-mail : sabine .vanhuffel~esat.kuleuven.ac.be
Author Index Bognar T 713
Bouchard G 721
Boudou A 737
Abb as I.. 1519
Boukhet ala K 737, 1577
Achcar J.A 581
Bourd eau M 417
Acosta L 1551
Braverman A 61
Adachi K 589
Brewer M.J 745
Aguilera A.M 997
Brys Goo 753
Ait-K aci S 737
Buckley F 1677
Ali A.A 37
Buj a A 477
Almeida R 597
Burdakov 0 761
Amari S 49
Ambroi se Ch 1759 Cao R 1569
Amendola A 605 Caragea D 823
An H 1397 Cardo t H 769, 777
Ando T 1309 Carne X 1519
Aoki S 1179 Carr D.B 73
Araki Yoo 613 Casanovas J 1519
Arcos A 1085 Caumont 0 737
Arhipov Soo 621 Ceranka B 785
Arh ipova 1. 629 Chauchat J .-H 1245
Aria M 1807 Chen C.-H 85
Arn aiz J.A 1519 Choulakian V 793
Arteche J 637 Chretien S.B 799
Arti aga R 1569 Christodoulou C 807
Artiles J 1733 Church K.W 381
Atkinson A 405 ClemenQon S 679
Atkinson RA 113 Cleroux R 1393
Cobo E. 1519
Balina S 629 Coifman RRoo 381
Banks D 251 Conversano C 815 , 1807
Bartkowiak A 647 Cook D 823 , 1397
Basti en P 655 Corset F 799
Bayraksan G 663 Costanzo G.D 831
Beran R 671 Crambes Ch 769
Bertail P 679 Cramer K. 101
Betinec M 689 Crane NI 1783
Biffignandi S 697 Critchley F 113
Binder H 705 Croux C 839
558 Au thor Index

Csicsman J 847 Fort G 1019


Cuevas A 127 Fraiman R. 127
Cwiklinska-Jurkowska M 855 Francisco-Fernandez M 1027
Fried R. 159
Capek V 863 Frolov A.A 1035 , 1725
Cizek P 871 Fueda K. 1229 , 1527
Dabo-Niang S 879 Fujino T 1043 , 1229
Dawson L.A 745 Fujiwara T oo 2003
Debruyne M 893 Fung W .K 0 149
Deistler M 137
Gamrot W 1053
Derquenne Coo 895
Gatell J .M 1519
Di Bucchianico A 903
Gather U 0 159
Di Iorio F 911
Celnarova E 1061
Di Zio M 919, 927
Gentleman R. 171
Dimova R. 1585
Ghosh S 181
Dodge y. 0 0.0935
Giordano F 0 1077
Doray L.G. 0. 0 0. . 0 943
Giron F.J 0 1709
Dorta-Guerra R. 0951
Gonzalez S 0 1085
Downie ToR. . 0 0 959
Conzalez-Davila E 951
Duffull S. Boo 1963
Gonz alez Aguilera S 1701
Dufour J .-M 967
Govaert G.. 0 1759
Duller C 975
Graczyk M 785
Dumais J 1245
Granger C.W.J 1413
Eccleston J. A 1963 Grassini L 1095
Eichhorn B.H 981 Gray A 0. 0 1101
Elston D.A 745 Grendar M 1109
Eng elen S 989 Grimvall A 761
Escabias M.. 0 997 Groos J 0 189
Esser M 1255 Grossmann W 0 • •••• 1
Crun B 0 1115
Fabian Z 1005 Guarnera U 919
Faivre R. 0 777 Guglielmi R. 0 381
Fenyes C 847 Gunning P 0 • • •• • • 1123
Fern andez-Aguirre K. 1013
Fernandez-Villodres G 1717 Haesbroeck G 0 113
Ferraty F 0 879 Hafidi B 0 1131
Filzmoser P 0 1585 Hanafi M 1141
Fonseca P 1519 Hanzon B 0 137
Au thor Index 559

Harper W .V 1149 J arosova E 1255


Hayashi A 1157 J erak A 1263
Haziza A 943 J oossens K. 839
Healy D.M 381 Jurkowski P 855
Heinzl H 199 Juutil ain en 1. 1271
Heit zig J 1163
Held L 213 Kaarik E 1279
Hennig C 1171 Kafad ar K 287
Hernandez C. N 1733 Kahn B 61
Hirot su C 1179 Kalin a J 1287
Hlubinka D 1185 Kamps U 101
Ho YH.S 1193 Kannisto J 1295
Hoan g T .M 1201 Kao C.-H 85
Hofmann H 223, 1397 Kar agrigoriou A 807
Honavar V 823 Karakos D 381
Hond a K 1209 Kar akost as K.X 1901
Horgan J .M 1123 Katina S 1301
Hornik K. 235 Kawasaki Y 1309
Hothorn L.A 1353 Kiers H.A.L 303
House L.L 251 Killen L 1677
Hey M 261 Kim D 1397
Hrach K 1217 Kim J 1397
Huskova M 903, 1221 Kinns D 113
Hubert M .. 753, 893, 989, 1925, Klast erecky P 903
Klaschka J 1317
1933, 1941
Klinke S 1323
Huh M .Y 277
Kn eip A 315
Hiisek D 1035, 1725
Kobayashi I 2003
Hussian M 761
Kolacek J 1329
Hwu H.-G 85
Komarkova L 1337
Iizuka M 1229, 1527 Komarek P 1101
Imo t o S 613 Komornfk J 713
Ingrassia S 831, 1237 Komornfkova M 713
Kond ylis A 935
J acobs M.Q 381 Koms . hi1 S 613
J afari Kh aledi M 1511 Kopp-Schneider A 189
J alam R. 1245 Kotrc E 1767
J an g W .-J 85 Koubkova A 1345
J an sson M 37 Kropf S ',' 1353
560 Author Index

Krecan L 1361 Marti-Recobe r M 1551


Krivy 1. 1917 Martin-Arroyu elos A 1013
Kukush A 1369 Mart ens H 261
Kurkova V 1377 Mar tinez A.R 327
Kurod a M 1385 Mar tinez W.L 327
Kuwabara R 1869 Martinez E.Z 581
Matei A 1471
La Ro cca M 1077 Mayes RW 745
Lafosse R 1141 McCann L 1481
Laguna P 597 Meint anis S 1221
Lambert-Lacroix S 1019 Michalak K. 1489
Lazraq A 1393 Min W 339
Lee E.-K. 1397 Mit tl bo eck M 199
Leisch F 1115, 1405 Miwa T 1497
Lemmens A 839 Mizera 1. 1301
Li C.K 149 Mizut a M 1503, 1791
Lin D 381 Mkh adri A 1131
Lin J.-L 1413 Mohammadzadeh M 1511
Lipinski P 1421, 1489 Monleon T 1519
Liu T 1101 Mont an a G 1885
Louzad a- Net o F 581 Montero J 1519
Lu G 113 Montoro-Cazorla D 1717
Luebke K 1429 Moore A 1101
Luengo I 1733 Mori Y 1209, 1527
Morlini 1. 1237
Maisongr ande P 777
Morton A 527
Mala 1. 1255
Morton D.P 663
Malvestuto F.M 1439
Munoz M.P 1551
Manteiga W.G 1447
Mucha H.-J 1535
Marchette D .J 381 Mull er W .G 1543
Marek L 1455 Murtagh F 1561
Mariel P 1013
Markovsky 1. 1369 Na kano J 2003
Marquez D 1551 Naya S 1569
Martin J 1709 Necir A 1577
Martinez J.P 597 Neifar M 967
Martinez M.D 1085 Neuwirth E. 351
Mar tinez Puert as H 1701 Neykov N 1585
Mar tinez Puert as S 1701 Neytchev P 1585
Author Index 561

Niemczyk J 1593 Reale M 1621


Niglio M 605 Renzetti M 1685
Novikov A 1601 Riani M 405
Ribarits T 137
Ocan a J 1519 Riera A 1519
Ocana-Peinado F .M 1609 Rocci R 919
Ohta E. 1179 Rocha A.P 597
Oliveira P.M 1823 Rodriguez J 371
Ortega-Mor eno M 1615 Roelant E 1693
Ost ermann R 1971 Roj ano C 1709
Ostrouchov G 359 Rom an )T 1085
Oxley L 1621 Rom an Montoya Y. 1701
Ozeki T 49 Ronin g J 1271
Pappas V.A 1901 Rueda Garda M 1701
Park H 49 Rueda M.M 1085
Park Y. 381 Ruiz M 1709
Parsons V.L 1201 Rui z-Castro J .E 1717
Payne RW 1629 Ruskin H 1783
Peiia D 371
Rezankova H 1035, 1725
Peifer M 1637
P erez C 1709 Saavedra P 1733
P erez-Ocon R 1717 Sacco G 927
Pern a C 1077 Saito T 1741
Pham-Gia T 1645 Sakurai N 1751
Pires da Cost a A 1823 Sam atova N.F 359
Pisani S 697 Same A 1759
Plat P 1653 Santana A 1733
Polyakov P.A 1035 Saporta G 417
Poon N.L 149 Sarda P 769
Popelka J 1255 Savicky P 1767
Porzio G.C 1661 Savin A 1987
Praskova Z 1669 Scanu M 927
Priebe C.E 381 Scavalli E. 1775
Pueyo E 597 Schimek M.G 1, 429
Quinn N 1677 Schmidt W 429
Scholkopf B 441
Ragozini G. . . . . . . . . . . . . . . 1661 Schyns M 113
Ramsay J .O 393 Scott D.W 453
562 Author Index

Sell A 1279 Torsney B. 513


Sharkasi A 1783 1Iessou J 1877
Shibata R .465, 2011 1Iiacca lJ 911
Shimamura T 1791 Triantafyllopoulos K 1885
Shin H.W 1799 Tri ggs B 721
Siciliano R 1807 T sang W.W 1893
Sickles RC 315 Tsao A 381
Sim a D.M 1815 T say RS 339
Simo es L 1823 Tsomokos 1. 1901
Sindoni G 1685 Tunnicliffe-Wilson G. 527, 1621
Sint P.P 1 Turkkan N 1645
Skibicki M 1831 Tutz G 705
Snasel V 1035 Tvrdik J 1917
Socolin sky D .A 381 Tzeng S 85
Sohn S.Y 1799
Solka J .L 381 Urbano A 1685
Song W 315
Vald errama M.J .. ... 997, 1609,
St ehlik M 1543
1615
Storti G 1837
Van Aelst S 1693, 1979
Struyf A 753
Van Huffel S 539, 1369
Sung J 1845
van Zwet V.R 903
Sung M.-H 73
Vanden Br anden K 1925
Swayne D.F 477
Vandervi eren E 1933
Safarik L 1061 Van Huffel S 1815
Sidlofov a T 1853 Vegas E 1519
Verboven S 1941
Taki M 1869 Vicard P 927
Tanaka Y 1845 Vieu P 879
Tarsitano A 1861 Viguier-Pla S 737
Tarumi T 1043, 1229 Vilar-Fernandez J .M 1027,
Tatsunami S 1869 1447
Tenenhau s M 489 Villazon C 1551
Theus M 501 Vistocco D 815
Tiao G .C 371 Vfsok J.A 1947
Ti en Y-J 85 Vitale C 605
Tille Y 1471 Volf P 1361
Timmer J 1637 Vont a F 807
Tininini L 1685 Vos H.J 1955
Au thor Index 563

VVagner S 1263
VVang J 1893
VVatanabe M 1751
VVaterh ouse T . H 1963
VVegman E.J 287, 327, 381
VVeihs C 1429
VVelsch R.E 1481
VVestad F 261
VVhit taker J 935
VVilhelm A .F .X 1971
VVillems G 1693, 1979
VVimmer G 1987
Wit kovsky V 1987, 1995
VVu H.-M 85
VVurt ele E 1397

Yadoh isa H 1209


Yamad a K 1869
Yamagu chi K. 1751
Yamamoto y. 1043, 1209, 2003
Yanagi K. 1229
Yan g, C .T 149
Yokou chi D 2011

Zadlo T 2019
Zarzo M 2027
Zuckschwerdt C 101
COMPSTAT 2004 Section Index

Algorithms
Doray L.G. , Haziza A., Minimum distance inference
for Sundt 's distribution 943
Grendar M., Det ermination of constrained mod es
of a multinomial distribution 0 0. 0 01109
Gunning P., Horgan J.M., An algorithm for
obtaining strata with equal coefficients of
variation 0.. 0 0. 0. 0. 0. 1123
Klaschka J ., On ordering of splits, Gray code,
and some missing references 00 0 00 1317
Kuroda M., Data augmentation algorithm for
graphical models with missing data 1385
Miwa T ., A normalising transformation of noncentral
F variables with large noncentrality parameters .. 0. 1497
Tvrdfk J., Kfivy 1., Comparison of algorithms for
nonlinear regression estimates 0. 0 0 o • • 01917
Witkovsky V., Matlab algorithm TDIST:
The distribution of a linear combination of
Student's t random variables 0 0. . 00 1995

Applications
Bognar T., Komornfk J ., Komornfkova M., New STAR
models of time series and application in finance 0. 00. 713
Braverman A., Kahn B., Visual data mining for
quantized spatial data 0. . 0 0. 0 0061
Cardot H., Crambes Ch ., Sarda Po, Conditional
quantiles with functional covariates: An application
to ozone pollution forecasting 0. . 0. ... 0. . . .. 00. . . . 0.. 769
Cardot H., Faivre R. , Maisongrande P., Random effects
varying time regression models with application
to remote sensing data 0 0 0 0 0. 777
Chretien S., Corset F ., A lower bound on inspection
time for complex systems with Weibull transitions ... 799
Conversano C., Vistocco D., Model based
visualization of portfolio style analysis 0 0.815
566 COMPSTAT 2004 Section Jndex

Costanzo G.D ., Ingrassia S., Analysis of the MIB30


basket in the period 2000-2002 by functional PC's ... 831
Di Bucchianico A. et al., Performance of control
charts for specific alternative hypotheses 903
Celnarova E., Safarik L., Comp arison of three
st atistical classifiers on a prostat e cancer dat a 1061
Gr assini L., Ordinal variables in economic ana lysis 1095
Hlubinka D., Growth cur ve approach to profiles
of at mospheric radiation 1185
Huskova M., Meint anis S., Bayesian like pro cedures
for det ection of cha nges 1221
J arosova E. et al., Modelling of tim e of unemployment
via log-location-scale model 1255
Juutilain en 1. , Roning J ., Modelling t he probability
of rejecti on in a qualification test 1271
Kafadar K. , Wegman E.J ., Gr aphi cal displays
of Intern et traffic data 287
Kukush A., Markovsky 1. , Van Huffel S., Consist ent
est imat ion of an ellipsoid with known cente r 1369
Lipinski P., Clustering of large numb er of stock
market trading rul es 1421
Martinez A.R. , Wegman E.J. , Martinez W .L.,
Using weights with a t ext proximity matrix 327
Michalak K ., Lipinski P., Prediction of high increases
in st ock prices using neural networks 1489
Porzio G.C., Ragozini G. , A parametric framework
for data depth cont rol charts 1661
Quinn N., Killen L., Buckley F ., Statistical
mod elling of lactation curve dat a 1677
Sharkasi A., Ruskin H., Crane M., Int erdepend ence
between emerging and major markets 1783
Tatsun ami S. et al. An applicat ion of corres pondence
ana lysis to the classification of causes of death
among J apanese hemophiliacs with HIV-l 1869
Tressou J ., Double Mont e-C arlo simulat ions in food
risk assessment 1877
COMPSTAT 2004 Section Index 567

Bayesian Methods
Achcar J .A., Martinez E.Z. , Louzada-Neto F .,
Binary dat a in the presence of misclassifications . . . . . 581
Di Zio M. et al., Multivari at e t echniques for
imputation based on Bayesian networks 927
Huskova M., Meintanis S., Bayesian like pro cedures
for detect ion of cha nges 1221 0 •• •• • • 0 • •••• 0 • •

Jerak A., Wagner S., Semipar ametri c Bayesian


ana lysis of EPa pat ent opp osition 1263 0 0 0 ••• 0 • 0 • 0 0 • •• 0 •

Mohammad zadeh M , J afari Khaledi M., Bayesian


prediction for a noisy log-Gaussian spat ial model. 1511 0 •

Pham-Gia T ., Turkkan N., Sample size determination


in t he Bayesian analysis of the odds rat io . . 1645 0 • 0 • 0 • ••

Rui z M. et al., A Bayesian mod el for binomial


imp erfect sampling 1709
Schimek M.G., Schmidt W o, An aut omatic t hresholding
approach to gene expression ana lysis . 429 0 • 0 • 0 • 0 • 0 0 • 0 • 0 •

Skibicki M., Optimum allocat ion for Bayesian


multivari at e st rat ified sampling 1831
Vos H.J. , Simult aneous optimizat ion of select ion
mast ery decisions 1955 0 • • • 0 • 0 •

Biostatistics
Araki Y., Konishi S., Imoto So , Functional discriminant
ana lysis for microarray gene expression dat a via
radi al basis function networks . . . . . 613 0 • •• 0 • 0 0 0 • 0 • • • 0 • ••

Carr D.B ., Sung M.-H., Graphs for representing


stat ist ics indexed by nucleotide or amino acid
sequences 0 • 0 • 0 • • • 0 • ••••• 73•• • •• • 0 • 0 ••• 0 • 0 • 0 • • 0 • 0 0 0 • 0 • 0 •

Celnarova E. , Safarik L., Comparison of three


statist ical classifiers on a prost at e cancer dat a . 1061 0 • 0 ••

Gentleman R. , Using GO for st atisti cal analyses .171 0

Gr ay A. et al., High-dimensional probabilist ic


classificat ion for drug discovery 1101 0 ••••

Groo s J ., Kopp-S chneider A., Visualizati on of


par ametric carcinogenesis models 189 0 ••

Heinzl H., Mittlboeck M., Design aspects of a computer


simulation st udy for assessing uncert ainty in hum an
lifetime toxicokinet ic mod els 199 0 • 0 •••
568 COMPSTAT 2004 Section Index

Held L., Simultaneous inference in risk assessment ;


a Bayesian persp ective 213
Hirotsu C., Ohta E., Aoki S., Testing t he equality
of the odds ratio param eters 1179
Kaarik E. , Sell A., Estimating ED50 using t he
up- and-down method 1279
Lee E.-K. et al. , GeneGobi : Visual data analysis
aid tools for microarray data 1397
Monleon T . at al., Flexible discrete events simulat ion
of clinical trials using LeanSim(r) 1519
Schimek M.G. , Schmidt W. , An automatic thresholding
approach to gene expression ana lysis 429
Tatsunami S. et al. An applicat ion of correspondence
ana lysis to the classification of causes of death
among J ap anese hemophiliacs with HlV-l 1869

Classification
Betinec M., Two measures of credibility of
evolutionary trees 689
Binder H., Tutz G. , Localized logistic classification
with variable selection 705
Bouchard G., Triggs B., The t rade-off between
generat ive and discriminative classifiers 721
Cook D., Caragea D., Honavar V., Visualization
in classification problems 823
Croux C., Jooss ens K., Lemmens A., Bagging
a stacked classifier 839
Cwiklinska-Jurkowska M., Jurkowski P., Effectiveness
in ensemble of classifiers and their diversity 855
Dab o-Niang S., Ferr aty F ., Vieu P., Nonp arametric
unsupervised classification of sat ellite wave
altimete r forms 879
Fung W .K. et al., St atistical ana lysis of handwritten
ar abic num erals in a Chinese population 149
Hayashi A., Two classification methods for
educat ional dat a and it 's applicat ion 1157
Hennig C., Classification and outlier identification
for the GAIA mission 1171
COMPSTAT 2004 Section Index 569

Priebe CoE. et al., It erative denoising for cross-corpus


discovery . . 0 • • • 0 • 0 0 0 0 0 • 0 0 0 •• 0 •• •• •• 0 • 0 • • 0 • 0 ••• 0 • 0 • •• • 381
Vanden Branden K.V., Hub ert M ,. Robust
classificat ion of high dimensional dat a . . 0 • 0 •••• •••• 0 1925

Clustering
Di Zio, M., Guarnera Do, Rocci R. , A mixture of
mixture mod els to detect unity measur e errors 0 • •• •• 919
Gib ert K. et al. , Knowledge discovery with clust ering:
Imp act of metrics and reporting phase by
using KLASS 1069
Criin B., Leisch F ", Bootstrapping finit e mixture
mod els 0 0 • 0 • • • 0 0 •• • • 0 • • • •• •• • • • • • • • • 1115
J alam R. , Chauchat Jo-H., Dumais J ., Automatic
recognition of key-words using n-grams. 0 • 0 •• • • •• • • • 1245
Kiers H.A.L. , Clust ering all three modes of three-mode
dat a: Computational possibilities and problems. 0 • 0 •303
Krecan Lo, Volf Po, Clust ering of t ransact ion data 1361
Leisch F. , Exploring the structure of mixture mod el
component s 0 •••• • • 0 • 0 0 • 0 • 0 • 0 • 0 0 • 0 ••••••• ••• 0 ••••• •• 1405
Lipinski P., Clustering of large numb er of sto ck
market t rading rul es. 0 • • 0 • 0 • 0 •••• 0 •• 0 • 0 •• 0 •• 0 • 0 •••• 1421
Mucha H.-J. , Automatic validation of hierarchical
clust ering0 • ••• ••• ••• •• •• •••• ••• 0 0 • 0 •• • • •• • • •• •• • ••• 1535
Murtagh F o, Quantifying ultram etri city 0 •• 0 • 0 0 • 0 • •• 0 • 0 • 1561
Pefia Do, Rodriguez J. , Ti ao G. c. , A genera l par tition
clust er algorit hm .. 0 • 0 • 0 • 0 • • • 0 0 • • 0 • • •• 0 • 0 • 0 • ••• •• 0 • • • 371
Rezankova H., Husek D., Frolov A.A., Some
approaches to overlap ping clustering of binar y
vari ables 0 • •••• •• • • •• ••• 0 ••• 0 ••• •• ••• ••• • 1725
Sam e Ao, Ambrois e Ch ., Govaert G. , A mixture
model approach for on-line clust ering .. 0 • 0 • • • • • • • • • 0 1759
Scott D.W ., Outlier det ection and clust ering
by partial mixture mod eling ..... 0 •••• 0 ••• •• • •• • • ••• 453•

Turmon M., Symmetric normal mixtures o' o' o' o' 0 0 0 0 0 •• 1909

Data Imputation
Derquenne C., A multivari at e mod elling method
for st atisti cal matching 0 0 • • 0 • • 0 0 • 0 • 0 • • 0 0 0 • 0 • 0 • 0 • 0 • ••• 895
570 COMPSTAT 2004 Section Index

Di Zio M. et al. , Multivariate techniques for


imputation based on Bayesian networks 927
Gamrot W., Comparison of some ratio and regression
estimators under double sampling for nonresponse
by simulation 1053
Gonzalez S. et al. , Indirect methods of imputation
in sample surveys 1085
Rueda Garcia M. et al., Quantile estimation with
calibration estimators 1701

Data Visualization
Adachi K., Multiple correspondence spline analysis 589
Arhipov S., Fractal peculiarities of birth and death 621
Bartkowiak A., Distal points viewed in Kohonen 's
self-organizing maps 647
Braverman A., Kahn B., Visual data mining for
quantized spatial data 61
Carr D.B., Sung M.-H. , Graphs for representing
statistics indexed by nucleotide or amino acid
sequences 73
Chen, C. H. et al. , Matrix visualization and
information mining 85
Cook D., Caragea D., Honavar V., Visualization
in classification problems 823
Fujino T. , Yamamoto Y , Tarumi T ., Possibilities
and problems of the XML-based graphics 1043
Hofmann H., Interactive biplots for visual modelling 223
Huh M.Y , Line mosaic plot : Algorithm and
implementation 277
Kafadar K ., Wegman KJ ., Graphical displays
of Internet traffic data 287
Katina S., Mizera 1. , Total variation penalty in
image warping 1301
Lee E.-K. et al. , GeneGobi: Visual data analysis
aid tools for microarray data 1397
Swayne D.F. , Buja A., Exploratory visual analysis
of graphs in GGobi .477
Theus M., 1001 graphics 501
COMPSTAT 2004 Section Index 571

Vandervieren E., Hubert M., An adjusted


boxplot for skewed distributions 1933
Wilhelm A.F .X., Ostermann R , Encyclop edia
of statistical graphics 1971

Design of Experiments
Ali A.A., Jansson M., Hybrid algorithms for
construction of D-efficient designs 37
Ceranka B., Graczyk M., Chemical balance weighing
designs for v + 1 objects with different variances . . . . . 785
Dorta-Guerra R , Gonzalez-Davila E. , Optimal 22
factorial designs for binary response data 951
Ghosh S., Computational challenges in determining
an optimal design for an experiment 181
Muller W .G., Stehlik M., An example of D-optimal
designs in the case of correlat ed errors 1543
Payne R W ., Confidence intervals and tests for
contrasts between combined effects in generally
bal anced designs 1629
Torsney B., Fitting Bradley Terry models using
a multiplicative algorithm 513
Waterhouse T. H., Eccleston J. A., Duffull S. B., On
optimal design for discrimination and estimation .. . 1963

Dimensional Reduction
Brewer M.J. et al. , Using principal components
an alysis for dimension reduction 745
Cizek P., Robust estimation of dimension reduction
space 871
Luebke K , Weihs C., Optimal separation
projection 1429
Mori Y. , Fueda K , Iizuka M., Orthogonal score
estimation with variable selection 1527
Ostrouchov G. , Samatova N.F. , Embedding methods
and robust statistics for dimension reduction 359
Priebe C.E. et al., Iterative denoising for cross-corpus
discovery 381
Saito T ., Properties of the slide vector model for
analysis of asymmetry 1741
572 COMPSTAT 2004 Section Index

E-statistics
Fujino T. , Yamamoto Y., Tarumi T ., Possibilities
and problems of the XML-based graphics 1043
Honda K. et al., Web-based analysis system
in data-oriented statistical system 1209
Shibata R, InterDatabase and DandD 465
Yokouchi D., Shibata R , DandD : Client server
system 2011
Functional Data Analysis
Araki Y, Konishi S., Imoto S., Functional discriminant
analysis for microarray gene expression data via
radial basis function networks 613
Beran R , Low risk fits to discrete incomplete
multi-way layouts 671
Boudou A., Caumont 0. , Viguier-Pla S., Principal
components analysis in the frequency domain 729
Cardot H., Crambes Ch ., Sarda P., Condi tional
quantiles with functional covariates: An applicat ion
to ozone pollution forecasting 769
Cardot H., Faivre R , Maisongrande P., Random effects
varying time regression models with application
to remote sensing data 777
Costanzo G.D., Ingrassia S., Analysis of the MIB30
basket in the period 2000-2002 by functional PC's ... 831
Cuevas A., Fraiman R, On the bootstrap
methodology for functional data 127
Dabo-Niang S., Ferraty F ., Vieu P., Nonparametric
unsupervised classification of satellite wave
altimeter forms 879
Escabias M., Aguilera A.M., Valderrama M.J .,
An application to logistic regression with
missing longitudinal data 997
Hlubinka D., Growth curve approach to profiles
of at mospheric radiation 1185
Kawasaki Y. , Ando T ., Functional data analysis
of the dynamics of yield curves 1309
Kneip A., Sickles RC ., Song W., Functional data
analysis and mixed effect models 315
COMPSTAT 2004 Section Index 573

Manteiga W.G., Vilar-Fernandez J.M., Bootstrap


test for the equality of nonparametric
regression curves under dependence 1447
Mizuta M., Clustering methods for functional data:
k-means, single linkage and moving clustering 1503
Naya S., Cao R, Artiaga R , Nonparametric
regression with functional data 1569
Ortega-Moreno M", Valderrama M.J. , State-space
model for system with narrow-band excitations 1615
Ramsay J .0., From data to differential equations 0393

Historical Keynote
Grossmann Wo, Schimek M.G. , Sint P.P., The history
of COMPSTAT and key-steps of statistical
computing during the last 30 years. 0 •• • • • • • • • • •• • • • • 0 . 1

Model Selection
Beran R , Low risk fits to discrete incomplete
multi-way layouts. 0 0 0. 0. 0. 0. 0.. 0. . 671
Christodoulou C. , Karagrigoriou A., Vonta F .,
An inference curve-based ranking technique . . 00. 0.. 0807
Hafidi B; Mkhadri A., Schwarz information
criterion in the presence of incomplete-data .. 1131 0 • 0 • ••

Kannisto J. , The expected effective retirement


age and the age of retirement 0 0. 0. 0. 0 0 1295
Sima D.M., Van Huffel S., Appropriate cross
validation for regularized errors-in-variables
linear models 00 ' . 00. 0. 0.. 0. 0. 00. 0. 0. 00.. 0. 0. 0 1815
Tarsitano A. , Fitting the generalized lambda
distribution to income data 0 0. 0 1861

Multivariate Analysis
Adachi K ., Multiple correspondence spline analysis 589
Choulakian V., A comparison of two methods of
principal component analysis 0 793
Fabian Zo, Core function and parametric inference 0. 1005
Fernandez-Aguirre Ko, Mari el Po, Martin-Arroyuelos A.,
Analysis of the organizational culture at
a public university 0. . 00. 0. 0 0. . 0. . 0. . 0 0 1013
574 COMPSTAT 2004 Section Index

Heitzi g J ., Prot ection of confidential dat a


when publishing correlat ion matrices 1163
Kropf S., Hothorn L.A. , Multiple t est pro cedures
with multiple weights 1353
Lazraq A., Cleroux R. , Principal vari able analysis 1393
Sakurai N., Wat an abe M., Yam aguchi K., A statist ical
method for market segmentat ion using a restrict ed
lat ent class model 1751
Wimmer G. , Witkovsky V., Savin A., Confidence
region for paramet ers in replicat ed errors 1987
Zadlo T. , On unbiasedness of some EBL U predictor 2019
Zarzo M., A gra phical proc edure to assess uncertainty
of scores in principal component ana lysis 2027

Neural Networks and Machine Learning


Sidlofova T. , Existence and uniqueness of minimiz ation
problems with Fouri er based stabilizers 1853
Amari S., Park H., Ozeki T ., Geometry of learning
in multilayer perceptrons 49
Araki Y. , Konishi S., Imoto S., Functional discriminant
analysis for microar ray gene expression dat a via
radi al basis function networks 613
Frolov A.A. et al. , Binary factorization of t extual
dat a by Hopfield-like neur al network 1035
Giordano F. , La Rocca M., Pern a C., Neural network
sieve bootstrap for nonlinear time series 1077
Ingrassia S., Morlini 1. , On th e degrees of freedom
in richly par amet erised mod els 1237
Kurko va V., Learning from dat a as an inverse
problem 1377
Michalak K., Lipinski P., Prediction of high increases
in sto ck prices using neur al networks 1489
Savicky P., Kotrc E. , Exp eriment al study of leaf
confidences for random forests 1767
Scavalli E. , St andard methods and innovations
for data edit ing 1775
Scholkopf B., Kernel methods for manifold est imat ion 441
Shimamura T. , Mizuta M., Flexible regression
mod eling via radi al basis function networks 1791
COMPSTAT 2004 Section Index 575

Shin H.W ., Sohn S.Y., EWMA combination of both


GARCH and neural networks for the prediction
of exchange rate 1799

N onparametrical Statistics
Burdakov 0 ., Grimvall A., Hussian M., A generalised
PAV algorithm for monotonic regression in several
variables 761
Capek V., Test of continuity of a regression function 863
Ho Y.H.S ., Calibrated interpolated confidence
intervals for population quantiles 1193
Kolacek J. , V se of Fouri er transformation for kernel
smoothing 1329
Komarkova L., Rank estimators for the time
of a change in censored data 1337
Necir A., Boukhetala K., Estimating the risk-adjusted
premium for the largest claims reinsurance covers . . 1577

Numerical Methods for Statistics


Hanafi M., Lafosse R., Regression of a multi-set
based on an extension of the SVD 1141
Van Huffel S., Total least squares and errors-in-variables
modeling: Bridging the gap between statistics,
computational mathematics and engineering 539

Official Statistics
Biffignandi S., Pisani S., A statistical database
for the trade sector 697
Di Zio, M., Guarnera V., Rocci R. , A mixture of
mixture models to detect unity measure errors 919
Matei A., Tille Y., On the maximal sample
coordinat ion 1471
Renzetti M. et al., The Italian judicial
statistical information system 1685

Optimization
Bayraksan G., Morton D.P., Testing solution quality
in stochastic programming 663
Novikov A., Optimality of two-stage hypothesis t ests 1601
576 COMPSTAT 2004 Section Index

Partial Least Squares


Bastien P., PLS-Cox model: Application to gene
expression 655
Dodge Y. , Kondylis A., Whittaker J ., Extending
PLS1 to PLAD regression 935
Engelen S., Hubert M., Fast cross-validation
in robust PCA 989
Fort G. , Lambert-Lacroix S., Ridge-partial least
squares for GLM with binary response 1019
Hoy M., Westad F. , Martens H., Improved jackknife
variance est imat es of bilinear model parameters ... .. 261
Tenenhaus M., PLS regression and PLS path modeling
for multiple table analysis 489

Robustness
Brys G. , Hubert M., Struyf A., A robustification
of the Jarque-Bera test of normality 753
Critchley F. et al. , The case sensitivity function
approach to diagnostic and robust computation . ... . 113
CIzek P., Robust estimation of dimension reduction
space 871
Debruyne M., Hubert M., Robust regression
quantiles with censored data 887
Gather D., Fried R., Methods and algorithms for
robust filtering 159
House L.L., Banks D., Robust multidimensional
scaling 251
Kalina J ., Durbin-Watson test for least
weighted squares 1287
Masfcek L., Behaviour of the least weighted squares
estimator for data with correlated regressors 1463
McCann L., Welsch R.E., Diagnostic data traces
using penalty methods 1481
Neykov N. et al. , Mixture of GLMs and the trimmed
likelihood methodology 1585
Ostrouchov G. , Samatova N.F., Emb edding methods
and robust statistics for dimension reduction 359
Plat P., The least weighted squares estimator 1653
COMPSTAT 2004 Section Index 577

Riani , M., Atkinson A., Simple simulations for


robust t ests of multiple outliers in regression 405
Roelant E., Van Aelst S., Willems G., The multivariate
least weight ed squ ared distances est imat or 1693
Sung J ., Tanaka Y. , Influence ana lysis in Cox
proportional hazards mod els 1845
Visek J .A., Robustifying instrument al vari ables 1947
Willems G., Van Aelst S., A fast bootstrap method
for the MCD est imator 1979

Simulations
Dufour J .-M., Neifar M., Exact simulat ion-based
inference for autoregressive processes 967
Gamrot W ., Comparison of some ratio and regression
est imators under double sampling for nonr esponse
by simulat ion 1053
Harper W.V. , An aid to addressing tough decisions:
The aut omat ion of general expression transfer
from Excel t o an Arena simulation 1149
Koubkova A., Critical values for changes in sequent ial
regression mod els 1345
Monleon T. at al., Flexible discret e events simulation
of clinical t rials using LeanSim(r) 1519
Naya S., Cao R, Artiaga R , Nonp arametric
regression with functional dat a 1569
Simoes L., Oliveira P. M., Pires da Cost a A.,
Simulation and mod elling of vehicle's delay 1823
Tressou J ., Doubl e Mont e-Carlo simulations in food
risk assessment 1877

Smoothing
Downie T . R , Redu ction of Gibbs phenomenon
in wavelet signal est imat ion 959
Francisco-Fern andez M., Vilar-Fernandez J .M.,
Nonparamet ric est imat ion of the volatility
function with corre lated errors 1027
Manteiga W .G., Vilar-Fern andez J.M. , Bootstrap
t est for t he equality of nonpar ametric
regression curves under dependence 1447
578 COMPSTAT 2004 Section Index

Spatial Statistics
Boukhetala K , Ait-Kaci S., Finite spatial sampling
design and "quant izat ion" 737
Mohammadzadeh M., Jafari Khaledi M., Bayesian
prediction for a noisy log-Gaussian spatial mod el 1511
Ramsay J .0 ., From data to differential equat ions 393

Statistical Software
Ceranka B., Graczyk M., Chemi cal balance weighing
designs for v + 1 obj ects with different variances 785
Hornik K , R: The next generation 235
House L.L., Banks D., Robust multidimensional
scaling 251
Lee E.-K . et al., GeneGobi : Visual dat a analysis
aid tools for microarray dat a 1397
Marek L., Do we all count the sam e way? 1455
Scott D.W. , Outlier det ection and clustering
by partial mixture mod eling 453
Tsang W.W. , Wan g J. , Evaluating the CDF of
the Kolmogorov statistics for normality testing . . . . . 1893
Tsomokos 1. , Karakost as K.X., Pappas V.A. ,
Making st at ist ical analysis easier 1901
Verb oven S., Hub ert M., MATLAB software
for robust st atistical methods 1941
Yam amoto Y. et al., Parallel computing in
a st atistical system J asp 2003

Teaching Statistics
Arhipova 1. , Balina S., The problem of choosing
stat ist ical hypotheses in applied statisti cs 629
Cramer K. , Kamps D., Zuckschwerdt C., st-apps and
EMILeA-st at : Int eractive visuali zations in
descrip tive st atistics 101
Duller C., A kind of PISA-survey at university 975
Eichhorn B.H ., Discussions in a basic stat ist ics class 981
Hrach K , The int eractive exercise t ext book 1217
Iizuka M. et al. , Development of the educa t ional
materials for statisti cs using Web 1229
COMPSTAT 2004 Section Index 579

Klinke S., Q&A : Variable multiple choice exercises


with commented answers 0 •• 0 • •• 0 ••• ••• •• 1323
Neuwirth Eo, Learning statistics by doing or by
describing: The role of software ... 0 • •• 0 •••• 0 • 0 • 0 ••• 0 351
Saporta G., Bourdeau M., The St @tNet proj ect
for teaching statistics . 0 • •• •• • •• • • • • •• •• 0 • •• •• ••• •• •• 417

Time Series Analysis


Almeida R. et al. , Modelling short term variability
interactions in ECG : QT versus RR .... 00 • • • • • •• •• • • 597
Amendola A., Niglio M., Vitale C., The threshold
ARMA model and its autocorrelation function 0 • 0 • 0 • 605
Arteche .L, Reducing the bias of the log-periodogram
regression in perturbed long memory series 637
Bognar T ., Komornfk J. , Komornfkova M., New STAR
mod els of time series and application in finance . .. 0 0 713
Deistler M., Ribarits T., Hanzon B., A novel approach
to parametrization and parameter estimation in
linear dynamic systems 0 0 ••• 0 • 0 • •• • • • • • • •• •• 0 • 137
Di Iorio F ., Triacca V., Dimensionality problem in
testing for noncausality between time series. 0 • •• 0 ' • • 911
Dufour J.-Mo , Neifar M., Exact simulation-based
inference for autoregressive processes 967
Francisco-Fernandez M., Vilar-Fernandez J.M. ,
Nonparametric est imat ion of the volatility
function with correlated errors 0 • • 0 • 0 • 0 • 0 • 0 • • •• • • • • • 1027
Lin J.-L., Granger C.W.J ., Testing nonlinear
cointegration 1413
Min W., Tsay R.S., On canonical analysis of vector
time series 0 • ••• 0 • •• • 0 • • • • •• • • ••• • • •••••• 0 ••• •• 339
Munoz M.P. et al. , TAR-GARCH and stochastic
volatility model: Evaluation based on simulations
and financial time series ... 0 •• • •• • • • • • • •• • •• •••• •• •• 1551
Niemczyk J ., Computing the derivatives of the
autocovariances of a VARMA process . . .. 0 • ••• •• •• • 1593
Ocana-Peinado F .M., Valderrama M.J. , Modelling
residuals in dynamic regression: An alternative
using principal components analysis 0 • 0 •• • • • 0 1609
580 COMPSTAT 2004 Section Index

Oxley L. , Reale M., Tunnicliffe-Wilson G., Finding


dir ect ed acyclic graphs for vect or autoregressions .. . 1621
Peifer M., Timmer J ., Studentised blockwise
bootstrap for t esting hypotheses on t ime series . . . .. 1637
Praskova Z., Some remarks to testing
of het eroskedasticity in AR models 1669
Saavedra P. et al., Homogeneity analysis for sets
of time series 1733
Shin H.W ., Sohn S.Y. , EWMA combination of both
GARCH and neural networks for the predicti on
of exchange rate ... .. 0 ••• • •• •• 1799• • • • •• •• • •• • • • • • •• 0 ••

Storti G. , Multivari ate bilinear GARCH models 1837


Triantafyllopoulos K. , Montana G. , Forecasting
London met al exchange with a dyn ami c model . ... 1885 0

Tunnicliffe-Wilson G o, Morton A., Modelling multiple


time series: Achieving the aims . 527 0 • •• 0 • 0 • 0 • •••• •••• 0 • •

Tree Based Methods


Betinec M., Two measures of credibility of
evolutionary trees. 0 ••• • •• •• • • • • •• • • • • • ••• • • • • • •• • 0 •• 689
Hoan g T.M., Parsons V.L. , Bagging survival
trees for pro gnosis based on gene profiles 1201
Malvestuto F .M., Tree and local computation with
t he multiproportional est imation problem 1439
Savicky P. , Kotrc E. , Experimental st udy of leaf
confidences for random forests 1767
Sicilian o R., Aria M., Conversano C., Tree harvest:
Methods, software and some applicat ions 1807

You might also like