Professional Documents
Culture Documents
Proceedings
in Computational Statistics
Edited by
Jaromir Antoch
Physica-Verlag
A Springer Company
Prof. Dr. Jaromir Antoch
Charles University
Faculty of Mathematics and Physics
Department of Statistics and Probability
Sokolovsk 83
18675 Prague 8 ± Karlin
Czech Republic
antoch@karlin.mff.cuni.cz
Cataloging-in-Publication Data
Library of Congress Control Number: 2004108446
This work is subject to copyright. All rights are reserved, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, reci-
tation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks.
Duplication of this publication or parts thereof is permitted only under the provisions of the Ger-
man Copyright Law of September 9, 1965, in its current version, and permission for use must
always be obtained from Physica-Verlag. Violations are liable for prosecution under the German
Copyright Law.
Physica is a part of Springer Science+Business Media
springeronline.com
° Physica-Verlag Heidelberg 2004
for IASC (International Association for Statistical Computing) ERS (European Regional Section
of the IASC) and ISI (International Statistical Institute).
Printed in Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the rel-
evant protective laws and regulations and therefore free for general use.
Softcover-Design: Erich Kirchner, Heidelberg
SPIN 11015154 88/3130-5 4 3 2 1 0 ± Printed on acid-free paper
Foreword
Statistical computing provides the link between the statistical theory and
applied statistics. As at previous COMPSTATs, the scientific programme
covered all aspects of this link, from the development and implementation
of new statistical ideas through to user experiences and software evaluation.
Following extensive discussions, a number of changes have been introduced
by giving more focus to the individual sessions, involve more people in the
planning of sessions, and make links with other societies as Int erface of Inter-
national Federation of Classification Societies (IFCS) involved in statistical
computing. The proceedings should appeal to anyone working in statistics
and using computers, whether in universities, industrial companies, govern-
ment agencies , research institutes or as software developers.
This proceedings would not exist without the help of many people. Among
them I would like to thank especially to the SPC members D. Banks (USA),
H. Ekblom (S), P. Filzmoser (A), W . HardIe (D) , J . Hinde (IRE) , F. Murtagh
(UK), J. Nakano (JAP), A. Prat (E), A. Rizzi (I), G. Sawitzki (D) and
E. Wegman (USA); the session organizers D. Cook (USA), D. Banks (IFCS,
USA) C. Croux (B) , L. Edler (D), V. Esposito Vinzi (I), F. Ferraty (F),
V. Kurkova (CZ), M. Miiller (D) , J. Nakano (ARS IASC, JAP), H. Nyquist
(S), D. Pefia (E) , M. Schimek (A), G. Tunnicliffe-Wilson (GB) and E. Weg-
man (Interface, USA); as well as to all who contributed and /or refereed the
papers.
Last but not least, I must sincerely thank my colleagues from Department
of Statistics of the Charles University, Institute of Computer Science of the
Czech Academy of Sciences, Czech Technical University, Technical University
of Liberec and to Mme Anna Kotesovcova from Conforg Ltd. Without their
substantial help neither this book nor the COMPSTAT 2004 would exist.
My final thanks go to Mme Bilkova and Mme Pickova, who retyped most
of the contributions and prepared the final volume, and Mme G. Keidel from
the Springer Verlag, Heidelberg, who extremely carefully checked the final
printing.
Invited papers
Gr ossmann W ., Schimek M.G., Sint P.P., T he hist ory
of CO MP STAT and key-st eps of statist ical
comput ing during t he last 30 years 1
Ali A.A. , Jansson M., Hybrid algorit hms for const ruct ion
of D-efficient designs 37
Amari S., Park H., Ozeki T. , Geometry of learning in
in multilayer perce ptrons 49
Braverm an A., Kahn B., Visual dat a minin g for quantize d
spatial dat a 61
Carr D.B ., Sung M.-H., Gr ap hs for representing statist ics
indexed by nucleotide or amino acid sequences 73
Chen, C. H. et al., Matrix visualizati on and
inform ation minin g 85
Cramer K., Kamps D., Zuckschwerdt C., st-apps and
EMILeA-st at : Int eractive visualizat ions in
descripti ve statist ics 101
Crit chley F . et al., The case sensitivity funct ion approach
to diagnostic and robust computation:
A relaxation st rategy 113
Cuevas A., Fraim an R, On t he bootstrap methodology
for functi onal dat a 127
Deistl er M., Rib arits T ., Hanzon B., A novel approach
t o parametrizati on and paramet er est imatio n
in linear dynam ic systems 137
Fun g W .K. et al., Statisti cal analysis of handwrit ten
Ara bic num erals in a Chinese popu lat ion 149
Gather D., Fried R , Met hods and algorit hms for
robust filt ering 159
Gentleman R , Using GO for statistical ana lyses 171
Ghosh S., Comput ati onal challenges in determining
an optimal design for an experiment 181
Groos J ., Kopp-Schneider A., Visualization of param et ric
carcinogenesis models 189
viii Contents
Scholkopf B., Kernel methods for manifold est imat ion 441
Contents ix
Marek L., Do we all count the same way? o' 1455 00 ' 0 • • • • • • •• • • ••
1 Introduction
First of all we try to trace t he sit uat ion and the ideas t hat culminated in
t he first COMPSTAT symposium in the year 1974 held at the University of
Vienn a, Austria. Special emphasis is given to the memori es of our founding
memb er P. P. Sint who had been the driving force behind early COMPSTAT
and had served it for twenty years.
At th e time COMPSTAT was established computing technology was in its
infancy. Yet it was well understood that computing would playa vital role in
t he future pro gress of statistics. The impact of the first digit al compute r in
the Depar tment of Statistics at the University of Vienn a on the local st ati s-
tics community is describ ed . After the first computational st at istics event
in 1974 it was anyt hing but clear that t he COMPSTAT symposia would go
on for decad es as an int ernational undertaking to be incorporated as early
as 1978 into t he International Association for St atis ti cal Computing (IASC,
http://www . iasc-isi. org/ ), a Section of th e Intern ational St atistical In-
stitute (lSI) .
After the descrip tion of the background aga inst which the COMPSTAT
idea emerged, the subject area of computational statist ics is critically dis-
cussed from a hist orical perspective. Key ste ps of developm ent are pointed
out . Special consider ation is given to the impact of st atistical theory, com-
puting (algorithms) , compute r science, and applications. Further we provide
an overview of t he symposia and trace the topi cs across 30 years, the period
of historic interest. Finally we dr aw conclusions, also with respect to recent
developm ents.
and its dir ector until his emigration to the USA in 1933, later professor at
Columbia University and O. Morgenst ern (tog ether with J . von Neuma nn),
t he father of game theory and a former dir ector of th e Austrian Institute of
Trad e Cycle Resear ch, were the driving forces behind t he found ation of the
Ford Insti tute [39] . At that time formal-mathemat ical as well as empirical
methods were pract ically abse nt from the syllabus of economics and sociology
in most aca demic institutions in Austria.
S. Sagoroff was a key person during the found ation of the Ford Institute
and also its first director . He had alrea dy an interesting personal hist ory:
After receiving his doctor degree from the University of Leipzig (Germ any)
and st udying in the USA under t he sup ervision of J . A. Schumpeter 1933/ 34
on a Rockefeller gra nt, he became professor of st atis ti cs, president of the st a-
t istical office, and dir ector of t he Rockefeller Institute for Economic Research
in Bulgari a before World War II. Later he was Bulgari an Royal Ambassador
to Germ any in Berlin until 1942 (when Bulgari a joined t he Allies). In that
function he was involved in t he delay of the delivery of Bulgari an J ews. Whil e
in Berlin and with a bro ad interest in science he had befriend ed wit h some
of Germ any's inte llect ua l elite, including a number of Nobel laureates who
cherished his dinner par ti es. After liberati on from his int ernment in Bavari a
he had worked for t he US Ambassador R. D. Murphy and had spent some
time at St anford University, before becoming professor of stat ist ics at the
University of Vienn a.
Sagoroff was certainly an able organiz er for the start- up of IHS but might
not have been the best choice for ru nning t he inst itution in a way ensuring
high scient ific standa rds. Still , the Ford Institute was a tremendous place
to learn and to get acquain ted with cur rent t houghts in social and economic
sciences, offering contacts to resear chers of high reputation. In t he following
decad e IHS played an important role in the revers al of t he former sit uation
at Viennese acad emic instit ut ions, advocating mathemat ical and st atistic al
approaches.
Sagoroff's USA experience had also been crucial to the fact that he was
successful in receiving a Rockefeller grant for the University of Vienna to buy
a digit al compute r. The found ation paid for half of the price (83.500 US$)
and the computer compa ny gave an educational gra nt covering t he other half.
The university had to pay just for transportation and inst allation. That
Sagoroff was int erested in compute rs and on the lookout for one was most
likely fueled by t he fact that at the very t ime H. Zeman ek was const ruc t ing
the first transisto rized computer in Europ e at the Technische Hochschule
(now University of Technology) in Vienn a. At that same time Sagoroff 's
assist ant at t he Statistics Depar tment , A. Adam , also tried to build a simple
elect ronic stat istical calculator and obtained also a patent on this device. But
he definitely had not t he techni cal expertise of Zeman ek and his machine was
never used in practi ce. Nevertheless his historical findings on the early history
of comput ing remain a landmark in the historiography of t he area ([18] widely
dist rib ut ed during the 1973 lSI Session in Vienn a).
4 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint
The arrival of the first "elect ronic brain" in Vienna in 1960 was not only
of interest for the scientific community but meant also a major event for the
Aus trian media. The elect ronic tube-based machine needed a special powerful
elect ricity generator to convert the 50 Hz alte rnat ing cur rent in Austria to
the 60 Hz used in the USA. It was installed in the cellar of the new university
annex building. The windows of the computer room had to be equipped with
sp ecially coa ted glass to ensure const ant temp erature.
This Datatron 205 was a one address machine with one command (or ad-
dress) and two calculat ing registers. The machine owned a drum storage
with 4000 cells. Each cell held 10 bin ary-decimal digits (each digit was rep-
resented by 4 bits and the uppermost values beyond 0-9 were not used) .
The 11th digit was used for signs and as a modifier in some comma nds. The
4000 cells were divided in 40 cylinders on the drum each cont aining 100 words
with an avera ge access t ime (half turn of the drum) in the millisecond do-
main . It possessed a feature later reinvented by IBM and market ed in a more
elabora te form under the nam e virtual memory: two cylinders could accept
repeat ed identical runs of 20 words (comm ands) which reduc ed access time
to one fifth . The crit ical parts of the program code were shifted to this 'fast
storage' with one block command and the pro gram execution shifted (often
simultan eously) to the first command in this storage which mean t it was
transferr ed into command register A.
The impl ementation was in digital cod e: Each comma nd was a two digit
number act ing on one address, for inst anc e the comma nd "64" imported
a number into register A:
0000641234 Import the cont ent of cell 1234 (on the drum)
int o calculat ing regist er A
While 74
0000741235 Add the content of cell 1235
to the content of the regist er A
60 stood for multiplication, 61 for division. Other arithmet ic operations,
floating point operations, shift op erations, logical operations , condit ional
jumps, printing of regist ers were performed similarly. An addit ional reg-
ist er could be used independent ly or to enlarge the number of digits in reg-
ist er A. 02 st ored results back to the drum. 08 stopped t he run.
In principle there exist ed an assembl er with mnemonic alphabet ic codes,
however , there was no t ap e punching device to ente r alphab etic characters.
Becaus e one had to know the digit codes for operating the machine (ent ering
and changing commands bit by bit only guid ed by a displ ay of the regist ers
on the console) the dir ect way was definitely faster . As one could act ually
see each bit stored in t he regist ers during programming and debugging one
could also spot a malfunctioning hardwar e uni t if one of the bits did not
show up properly. In t his case one had to open the ma chine and t ake out
the concerne d unit (a flip flop with four tubes). Usu ally it was easy to spo t
the culprit by visua l insp ection or alte rnat ively exchanging the tubes one by
one. Only the (preliminar y) finished pro gram was print ed or punched out on
Th e history of COMPSTAT and statistical computing 5
a pap er tape. As space was sca rce and each letter had to be encoded by two
decimal digit s, comments accompanying the results were kept to a minimum.
The arrival of this compute r was essential to the fact that th e St atistics
Department becam e the hub of comput ing inside t he University of Vienna.
Sint 's first experiences wit h real comput ing in t he early nineteen sixt ies
are connected to a pro gramming course for digit al compute rs held by t he
mathematician J . Ropp ert, assistant in the Department of St atisti cs. As one
of t he few who to ok an exam in computer pr ogramming and as a scholar
of IRS , Sint was offered an assistants hip at t his department . His statisti-
cal qualifications were elementary probability t heory (not based on measur e
t heory ) and some statistics for sociologists. (The type of statist ics used in
qu antum physics were not of much help in a statist ics depar tment.) At the
IRS he also obtain ed a first t ra ining in ga me t heory from O. Morgenstern.
Lat er , while spen ding a yea r in Oxford , he learned mor e statist ics and got
int erest ed in cluster analysis. This contact with English st atistics helped him
doing "real" st atistics in t he following.
Before Sint could use the new generation of computers (an IBM / 360-44
was inst alled at the Univ ersity of Vienn a in 1968) he had to learn his first
pro gr amming language, Fortran. For W . Winkler , a professor emer it us of
stat ist ics, he wrote his first Fortran pro gram for the calculat ion of a Lexis-
ty pe population distribut ion on an off-sit e compute r. When he had finished
Winkler remarked that it would have been much faster to do the job on
a mechan ical calculato r. At that time correcting card decks and working on
a remot e machine was ext remely time consuming.
About t hat time IBM had started developin g and distributing st atisti-
cal software. Most developments were open source Fortran code. Naturally
Fortran was a lar ge step forward going along with third genera t ion digi-
t al compute rs . Programme codes for algorithms were published by t he US
Association of Comput ing Machinery (ACM) . About that time also the first
commercial packages arr ived. In st atisti cs one could choose between OSIRIS ,
BMD , P -STAT, and SPSS . The Department of St ati stics at t he Univ ersity of
Vienna decided for SPSS in Decemb er 1973. SPSS , like BMD and P-STAT
was implemented in Fortran , offering high portability. All t he implementa-
t ions of statist ical methods at the depar tment were pro grammed in Fortran ,
not a user-fr iendl y environment from a today's persp ective. This included the
first administ ra t ive pro gram for t he enr ollment of student s and production
of corresponding statist ics.
ti onal statistical community. Not havin g had access to sufficient t ravel fund s,
Sint and his colleague J . Gordesch, a trained mathematician , encouraged by
A. Liebin g, the publisher of the journal Metrika, envisioned a conference on
an up-to-date st atist ical topic in Vienn a. Sint was int erested in clust er ana l-
ysis and Gordesch rather in computationa l probability an d mod el building.
These and ot her to pics were ventilated until one set tled on a conference on
computers and statistics . As for the name in English they took the Journal of
the Royal Statisti cal Society as a mod el: it comprised series A for Theoretical
Statisti cs and series B for Appl ied Statistics. Thus t hey assumed Sympo sium
on Computation al Statistics would be a proper name. Sint came up with
t he acr onym COMPSTA T arguing t ha t one needs a short name which would
still be near to an und erst andable expression to be easily rememb ered (this
is what is called a logo now).
For t he first call for pap ers t he word COMPSTAT was embedded in an
arrow like graph derived from the symbols used in ana log computi ng: several
input lines ending in a tri an gle (the statist ical engines or algorit hms). The
condensed final result we are still using is displayed in th e left figur e. Sint and
his colleagues were t hinking about stat ist ical methods (they were the hub of
our ideas about the conference) as means of compressing a lar ge number of
inputs in a few meaningful results and COMPSTAT as an input to improve
t he algorit hms (being quite aware of the recursivity of thes e pro cesses).
COMPSTAT~ =COMPSTATV-
The original design idea was rather something like the right figur e. A sket-
ched dr awing similar to this one (without th e small arrows and with a smaller
number of input lines) had been dr opp ed by the gra phics designer of t he
publisher.
As we know now this was t he first freely accessible int ernational con-
ference with an open call for pap ers in this area. The first COMPSTAT
meeting was announced in the American Statist ician (att racting some par-
ticipants from th e USA) which helped later to defend t he right of name in
that count ry. The only precedin g int ern ational conference of that kind was
organiz ed and financed by IBM. Precedin g were also t he at first rather local
North American Interface symposia start ing in Southern Californi a in 1967,
sponsored by t he local cha pte rs of both t he American Statistical Associa-
t ion and t he ACM , obt aining an inte rnat iona l flavor as late as 1979 (twelfth
Interface symposium held at t he University of Wat erloo , Ontario, Can ad a).
For the Interface Foundation of North America , Inc., and its history see
http ://www.galaxy.gmu.edu/stats/IFNA .html.
Any organi zer of a new kind of conference is uncertain about its suc-
cess an d the numb er of participants he/ she might attract . Accord ing to
Th e hist ory of COMPSTAT and statistical computing 7
the preface of the pro ceedin gs [1], Sint and Gord esch were not sure whether
"ma t hematicians specialized in proba bility theory or stat ist ics, or experts in
elect ronic dat a pro cessing would look at computat iona l stat ist ics as a serious
subject". As t he deadli ne of t he call for pap ers carne near er t he organi zers
becam e increasin gly anxious and starte d to must er locals for participation.
Fortunately, in the first few days after t he deadline had expired , a reasonabl e
number of additional abstrac ts appeared , all tog ether enough t o give them
peace of mind.
In 1972 Sint had at te nded a conference where the proc eedin gs pap ers
had to be retyp ed by clerical st aff which t urned out to be a disaster. This
experience in mind it was decided to ask for cam era-r ead y copies. For th e
COMPSTAT pro ceedings it worked out smoot hly and the copies could be
distribut ed during the symposium, a pr actice t hat has survived till now.
The formal invit ation to the conference was signed by G. Bruckmann and
L. Schmetterer , both professors of st ati sti cs at the department, becaus e t he
young colleag ues hop ed that th e appeara nce of int ernationally known person-
alit ies would be mor e acce ptable to par ticipants and t o t he pot ential buyers
of t he pro ceedings (Sint and Gordesch just signed t he pr eface; F . Ferschl was
added as an edito r by the publisher).
Gordesch had at the t ime of the conference alrea dy left Vienna, and Sint
had moved to the Austrian Acad emy of Sciences. Thus, alt hough the latt er
was st ill around (his new boss was Schmet terer, t he successor of Sagoroff
as professor of statist ics), a lot of the pr epar atory work had t o be don e
by t he young colleagues W . Grossmann, G. Pflu g, and W. Schimanovich.
M.G . Schimek, a first- year st udent of st atistics and informatics in 1974, learn-
ing Fortran and SPSS at that time, was a keen observer of all these act ivit ies
going on in t he Department of St atistics and Informatics at the University
of Vienna.
The int erest of Gordesch in COMPSTAT had remain ed awake and so the
next conference was naturally held in Berlin. From th at time onwards it has
never been a pr oblem t o find places to go. Someone has always been willing
to organiz e the symposium .
To have a perman ent platform a Compstat Society was created in 1976.
Memb ership was by invit ation only. Mainly organiz ers and chair persons of
the first conferences were approac hed. Sint recalls t hat only selected mem-
bers were asked (no formal board decision) when COMPSTAT was trans-
ferr ed to the International Association for St atistical Comput ing (lAS C) in
1978. It was an initiative of N. Victor (1991-1993 IASC President) . Read ers
int erest ed in the history of the IASC are referr ed to t he Stat istical Soft ware
Newsletter, edited for almost three decades by A. Hormann, and since 1990 in-
t egrated as sp ecial sect ion into t he official journal of t he IASC Comput ational
Stat istics and Data Analysis. Furthermore we wan t to mention P. Dirschedl
and R. Ost ermann, (1994 [32]) as a valu abl e reference for developments in
computationa l stat ist ics (including IASC act ivities in Germany, the history of
8 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint
the term "area of statistics" in Victor's statement, on the other hand it em-
phasizes also the instrumental aspect of statistical methods with repect to
their application.
Starting from this definition it is quite clear that we have to consider the
progress of computational statistics in connection with developments in sta-
tistical theory, developments in computation and algorithms, developments
in computer science, and last but not least developments in the application of
statistics. In many ways there has always been an exchange of ideas, impor-
tant for the understanding of computational statistics, stemming from these
four areas. In the following we sketch some of these ideas and discuss their
interplay.
(Meht a and Pat el, 1992 [62]) or the empirical Bayes approach of H. Robbins
(1956 [74]) t hat nowad ays sees int eresting applicatio ns in microarray analysis
(Efron, 2003 [37]).
Besides these new developments in st atist ical t heory, t he advance of com-
pu t ers has also influenced ot her areas of statist ical theory in t he sense of
providing t ools for experimental checking of st atistical mod els under vari-
ous scenarios. Such typ es of computer experi ments are of int erest even in
cases where t he methods are well underpinned from a theoreti cal point of
view. A well known early example is the Princet on st udy on robust statis t ics
(Andrews et al., 1972 [20]). Tod ay in theoretical investigati ons it is ra t her
common to support the results by simulation and gr aphical displays. In
t his context one should know that according to H. H. Gold stine (1972 [48])
such compute r experiments were alrea dy envisioned by J . von Neum an and
S. Ulam in 1945 at the very beginning of digit al comput ing. This led to the
development of simulation lan guages, rather ind ependently of conventiona l
stat ist ics, but with an imp ortant imp act on computer science (see also [65]).
Not e that Simula was t he first obj ect-oriented language ever (Dahl and Ny-
gaard, 1966 [29]). A good overvi ew of simulation from a stat ist ical persp ective
can be found in B. Ripl ey's book of 1987 [73] .
Problems analyzed by st atist icians have oft en a rather complex data st ru ct ure
and ada ptat ion of this structure towards t he requirements of an algorithmic
procedure is many times a genuine st atistical t ask ; (ii) Explorat ory nature of
statistical analysis: Usu ally in a st atistical analysis we have not only a pure
algorit hmic cycle (defined by: get dat a, do algorit hm, put results, stop) but
rather a cycle of different comput at ions, which are to some exte nt defined
according to the inte rpretat ion of the pr evious results; (iii) Competen ce of
users: Users of st atisti cal methods are not necessarily experts in t he area of
stat ist ics or in the area of numerical mathematics , but experts in a dom ain
and want to interpret t heir methods according to their dom ain knowledge.
With t hese spe cific points in mind it is not sur prising that graphical com-
putation plays a mor e prominent role in st ati stics than in other areas of
modelling. J. Tukey is one of t he st atist ical pion eers , in particular with re-
spect to dynamic gra phics (Friedman and Stuet zle, 2002 [42]). Statisti cs has
cont ribute d t o the development of gra phical computat ion complementary to
compute r science. L. Wilkinson et al. (2000 [87]) stress the following three
key ideas in the pro gression of statist ical graphics, which may be seen as main
driving factors behind most genuine st at ist ical innovations: (i) Gr aphics are
not only a t ool for displaying results but rather a tool for perceivin g stati sti -
cal relationships dir ectly; (ii) Dyn am ic int eractive graphics are an importan t
tool for data ana lysis, and (iii) Gr aphics are a means of model form alization
reflect ing qu anti t at ive and qualitative t raits of its variables.
t his cont ribut ion t o computer scien ce: In 1998 Chambers received the ACM
Software System Award for his seminal work which "has forever altered the
way people analyze, visualize and manipulat e data" [17].
In 1992 based on the S language, R. Ih aka and R. Gentl em en started
t he R-project at the University of Au ckland (New Zeal and ; cf. Gentlem an
and Ihaka, 1996 , [59] for t he early history of R) . Due to free ava ila bility the
R- community grew rather fas t a nd in 1996 the Comprehensive R Ar chive
Network (CRAN) was established at the University of Technology in Vienna
(cf. Hornik and Leisch , 2002 , [55] for recent developments) . A fur ther impor-
tant st ep in t he development of stati stical environme nts , closely related to R ,
was the format ion of the Omegahat-proj ect (http ://www . omegahat . org/ )
for statistical computing in 1998 . It serves as an umbrella for a number of
ot he r recent op en source proj ects. It s goa l, as described in det ail by D. Tem-
ple Lang [79], is to meet the challenges for stat ist ica l comput ing resulting
from new developments in com puter scien ce like distribut ed comput ing or
Web-based serv ices . Examples are exte nsions of exist ing sys te ms such as
St atDataML (Meyer et al., 2002 [66]) offering a XML interface for data ex-
change or embe dding R into a spreadshee t environme nt (Neuwirth and Baier ,
2000 [71]).
Besid es S and R there were a number of other impor tant proj ects in the
area of stati stical softwa re development. For instance we want to mention
W. Hardie's Xpl oRe [53], an interactive statist ica l comp uting enviro nme nt,
reali zing new conce pt s of non parametric curve and densi ty esti mation as well
as statistica l graphics in the mid ninet een eight ies. In connect ion with XploRe
recent efforts t o extend its scope to stat ist ica l teaching and to Web applica -
t ions are wort h mentioning. Another project of int erest due to L. Tierney in
the late nin et een eight ies was XLISP-STAT ([81], [82]), a st atist ical environ-
ment based on the public X-LISP language freely ava ila ble from the statli b
archive.
A fur ther line of development are efforts to use parallel architectures in
statistical comput ing. Such computer architect ures are typica lly used for the
implem entation of dem anding numerical algorit hms . In recent years com-
put er science has widen ed t he scop e of parall el computing t owards distributed
comput ing. We expec t t his research area t o grow quite rapidly in the future,
with an impact on st ati st ical comput ing .
An other statis t ica lly relevan t area of computer scien ce is data man age-
men t . While dat a structures in statist ica l computing are usu ally closely re-
lated to formal sp ecifications of data types (e.g. list s, vect ors , or matrices) ,
t he int erpret ati on of an analysis process makes oft en use of con ceptual and
relational st ructures. Tradit ionally this topic is treated in the t heory of dat a
bases. A major br eakthrough in t his area was the int ro duction of the re-
lational dat a model by E . F . Co dd (1970 [26]). It offer es the opportunity
t o describe complex real world problems from a concept ual point of view in
a unified manner . The description of data by data models is nowadays cap-
16 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint
The last difficulty is the size of the data. P. J. Huber (1994 [57]) clas-
sified data sets from tiny (about 100 bytes) up to huge (about 1010 bytes) .
One can definitely argue that size is always an issue relative to computing
power and storage capacity, and problems practically intractable 30 years
ago are nowadays routine applications. Nevertheless, today's statisticians
and computer scientists have to solve problems for huge datasets. Specific
problems concerning the data structure, the data base management, and the
computational complexity are discussed in Huber (1999 [58]).
A second important topic for computational statistics with respect to ap-
plications is the statistical analysis process itself. The ubiquitous availability
of the computer and of statistical software packages has changed the con-
text in many ways . On the one hand statistical software packages support
statisticians in the phase of exploratory data analysis and allow them the
evaluation of numerous tentative models for the data without careful plan-
ning in advance. On the other hand they enable non-statisticians to perform
rather complex analyses for their data, in former times solely carried out by
professional statisticians. This evolution has weakened in some sense the role
of statisticians as custodians of the data and has caused many discussions
inside the statistical profession. Here we only want to mention Y. Dodge
and J . Whittaker (1992 [34]) who raised the point that this development
might bring about a de-skilling of certain parts of the profession. However
they also argued that the democratization of facilities does not automatically
mean a threat to the profession in the long run. We claim that statistical
analysis is definitely more than the application of certain algorithms because
an analysis strategy is required too. For instance in the current scientific
development of the bio-sciences we see an explosion of highly complex data
problems that can only be managed in part with the resources at hand.
In the nineteen eighties the question of automated analysis strategies was
intensively discussed in connection with the issue of statistical expert systems.
This undertaking ended without substantial success making it clear that it
is rather implausible to assume statisticians can be easily substituted by
machines in the near future. To put it in a nutshell, not even standard data-
analytic problems can be handled easily via routine applications and simple
rule systems. Another area of interest in this context is certainly the role of
computers in statistical education, in particular for non-professionals, taking
advantage of the various opportunities offered in the field of computational
statistics.
and Table 2 for the period 1990-2002. The notation in these tables is the
following: "p" denotes that a topic was present in the proceedings, "f" de-
notes that a topic was frequ ently present in the proceedings (i.e. more than
3 times), "K" represents a keynote paper, "I" represents one or two invited
papers, and finally "T" signifies a tutorial. We suggest to read the respective
table in parallel with the verbal description of the chronologically ordered
COMPSTAT symposia.
The very first COMPSTAT symposium was held at the University of Vi-
enna in 1974, initiated by P. P. Sint and J. Gordesch. Both were also in fact
the editors of the proceedings [1]. There were about 50 presentations orga-
nized according to five subject areas, reflecting to some extent the interests
of the organizers: Computational Probability, Automatic Classification, Nu-
merical and Algorithmic Aspects of Statistical Computing, Simulation and
Stochastic Processes, and last but not least Software Packages. In 1974 there
were neither formal keynotes nor invit ed lectures. However, during the open-
ing session a special lecture was delivered by the well-know mathematical
statistician L. Schmetterer on stochastic approximation (not in the proceed-
ings) .
Naturally the topics within the subject areas were rather scattered, but
some of them remained popular across the whole period of 30 years such
as Robustness (note that P. J. Huber was present at the first symposium),
Time Series Analysis, and Modelling (the latter in its beginning primarily
meaning factor analysis and dimension reduction techniques). It is remark-
able that a number of statistical packages popular at the time were already
covered: R. Buhler's P-STAT and Sir J . A. NeIder's GENSTAT. The pre-
sentation of a SAS system, not to be confounded with the later much more
successful namesake [25], should also be mentioned. Further, as in succeeding
conferences, APL (for details see e.g. [19]) appeared as a popular statistical
environment.
With all this in mind Gordesch and Sint speculated in the preface of [1]
about a spectacular growth of the field, in writing "which as we hope will
now result in techniques of model building being very different today from
what it was in pre-computer days" .
The second COMPSTAT symposium took place in Berlin 1976, organized
by J . Gordesch and P. Naeve (also the editors of the volume [2]). Altogether
58 papers were presented. The subject areas were more or less the same as
at the first meeting but the names had changed somewhat: Computational
Probability, Automatic Classification and Multidimensional Scaling, Numer-
ical and Algorithmic Aspects of Statistical Models (with subtopics Linear
Models, Multivariate Analysis and Sampling), Simulation and Stochastic Pro-
cesses, and finally Software. A new section "Applications" was introduced
(mainly in economics and biology) . This selection reflects the understanding
of computational topics in the mid nineteen seventies: Multivariate Analy-
sis comprised mainly ANOVA as well as Factor Analysis and Computational
The history of COMPSTAT and statistical computing 19
COMPSTAT Symposium 90 92 94 96 98 00 02
Algorithms p f fI fI f
Appli cations fI p p fI K f
Bayes/MCMC/EM f pK f fI fI f
Categorical Dat a I
Classification/Discrim ination f fI fI fK f f f
Cluster Analysis p p
Compu t ational Probability I I p P
Data Bases /Met ad ata pI fI pT f
Data Imput.jSurvey Design I p fKI p
Data Visualization/Graphics p p pI p pI
Dimension Reduc tion I p p p
Exp erim ent al Design f f f p P
Exp ert Systems /Al fI p
Explo ratory Data Analysis p p
Foun dations/ Histo ry pI K p
Gr aphical Models p fI p p p
Handling of Huge Data K p P
Image Analysis pI pI p
Intern et-based Methods I p fI
MANOVA p I P
Modelling/ GLM/ GAM p fI p pK fI p
Neural Networ ks I p
Numerics/ Optimization pI p fI p
Par allel Computing pI p
Reliabili ty and Survival p f p pI p
Regression (linea r/nonlinear) p fI fI p p p
Resampling f f p p p f
Robu stness fI fI f fI p
Simulations p p p p
Smoothing/ Curve Estimat . f f pI fI p pI
Spati al St atist ics p pI p p fI
St atistical Software p fI f p f IT
St at. Learning/Data Minin g f p p p pI fK
Sto chast ic Syst ems pI I pI
Teaching St ati stics p pK pI pI fI
Time Series Analysis fI f fI fI pI fI fI
Tree-base d Methods p pI p P
Wavelets p pI pI K p
Many of them reflect the trends of the time, especially the penetration of per-
sonal computers and improved graphical displays into the world of statistics.
The wish of statisticians to apply these new technologies, not yet covered
by commercial software packages, can be clearly seen. Another novelty was
the production of a complementary volume with short communications and
posters.
The sixth symposium took place in Prague in 1984, extending the scope
of COMPSTAT to the Eastern European countries. As a matter of fact IASC
had planned for a meeting in Bratislava (a Slovakian town only 65 kilometers
from Vienna) but the (communist) Czechoslovakian Academy of Sciences de-
cided for the central location of Prague. Luckily there were several dedicated
statisticians, among them T . Havranek, Z. Sidak and M. Novak, the organiz-
ers of the meeting. Many colleagues, who at that time did not have the chance
to participate in Western meetings, could attend. Out of a record number of
about 300 submissions 65 papers were selected. T. Havranek, Z. Sidak and
M. Novak also edited the proceedings [6] and a companion volume of short
communications and posters, following the example of 1982. Commemorat-
ing the tenth anniversary of the COMPSTAT symposia P.P. Sint was invited
to deliver a lecture entitled "Roots in Computational Statistics" . The main
topics covered in invited talks were Computational Statistics in Random Pro-
cesses, Computational Aspects of Robustness, Discriminant Analysis, Statis-
tical Expert Systems, Optimization Techniques, Linear Models, and Formal
Computation in Statistics. Besides these topics also the traditional COMP-
STAT themes like Cluster Analysis, Multivariate Analysis, Statistical Mod-
elling and Software were present. It is worth mentioning that also a number
of more computer science-oriented papers on data management and data pre-
processing had found their way into the proceedings, reflecting some of the
local interests.
COMPSTAT 1986 (the seventh symposium) was held in Rome and at-
tracted an ever record of about 900 participants. From around 300 submis-
sions for contributions about 60 contributed papers as well as 13 invited pa-
pers were published in the proceedings [7], edited by F. De Antoni, N. Lauro
and A. Rizzi. A keynote lecture was given by E. B. Andersen about informa-
tion, science and statistics, discussing the challenges for statistics resulting
from the development of statistical software, graphics, interactive computing,
and new methods and styles of data analysis. Apart from the invited program
the proceedings volume presents itself well-balanced between statistically
oriented themes, computer science oriented topics and novel applications.
The main statistical themes comprised the traditional COMPSTAT topics
like Probabilistic Models in Exploratory Data Analysis, Computational Ap-
proaches of Inference, Numerical Aspects of Statistical Computation, Cluster
Analysis and Robustness, but also a rather specialized topic entitled Three
Mode Data Matrices. The more computer science oriented topics reflect the
trend towards Expert Systems and Artificial Intelligence, typical for the mid
The history of COMPSTAT and statistical computing 23
huge data sets. The t hemes of the invit ed pap ers were Multivar iate Analysis,
Classificati on and Discrim ination , Dynami c Graphics, Numerical Analysis,
Nonp ar am etric Regression , MCM C, Selection Procedures, Neur al Networks,
Cha nge Poi nt Problems, Wavelet Analysis, and T ime Series Forecasting. Be-
sides these invited lectures two t ut orials were organ ized: W. Schacherm ayer
introduced stat istical problems in finance and insur ance and B. Sundgren
gave an overvi ew on metad ata. Fur thermore a discussion about the nature of
computationa l stat ist ics was organized. All toget her about 280 participants
at te nded this meeting. The organiz ers returned to the traditi onal form at of
publishing the pro ceedin gs and an additio nal volume of short communica tions
and posters. The pro ceedings [11] were edite d by R. Dutter and W . Gross-
mann and contained t he invit ed and 60 cont ributed papers, selecte d from
approximately 200 submissions. With respect to statistical software the in-
creasing domin an ce of S for t he developm ent of computationa l statistics was
evident . Other more commercia lly orient ed products were present ed during
the conference and docum ented in a separate bookl et .
After t he symposium in Vienn a there was a COMP STAT Satellite Meeting
on Smoothing held at Semmering, att racting almost 50 part icipants. Becau se
of the COMPSTAT anniversary a hist oric t ra in was bringing COMPSTAT
par ticipants and accompa nying persons on the oldest mountain railro ad in t he
world (now a World Cultural Heritage) from Vienn a to t he spa of Semm erin g
in t he Austrian Alps.
The meeting was organi zed by M. G. Schimek and comprised 7 invited
lectures (pr esenters were B. Cleveland , M. Delecroix, R. Eubank , Th. Gasser ,
R. Kohn , A. van der Linde, and W . Stuetzle) and two software present ations
(S-Plus and for t he first time XploRe). W . Hardl e an d t he organi zer edited a
pro ceedings volume [12] consisti ng of 10 pap ers (not published elsewhere) out
of 26 given at t he meeting. It also includes an exposito ry discussed pap er by
J . S. Marron ( "A Personal View of Smoothing and St ati sti cs") and two other
discussed contributions by W . S. Cleveland and C. Load er ("Smoothing by
Local Regression: Principles and Methods") and by B. Seifert and Th. Gasser
("Vari an ce Properti es of Local Polynomi als and Ensuing Modifications") . It
is wort h mentionin g t ha t local regression smoothing is now a principal tool for
normalization of microarray data in genetic resear ch. Since the sym posium in
Cop enh agen 1988 nonp ar am etric smoot hing techniques and relevant software
had played a steadily increasing role in COMPSTAT.
The twelft h COMPSTAT symposium was organi zed under th e auspices
of A. Prat in Bar celona 1996, attract ing an est imated number of 300 par-
ti cipants. An opening keynote was delivered by G. Box ent itled "Statistics,
Teaching, Learning and the Computer" and a closing keynote "Information
Markets " was present ed by A. G. Jordan . Eleven invited pap ers covered
to pics like Time Series , Fun ctional Imaging Analysis, Appli cati ons of Statis-
tics in Economics, Classification and Computers, Image Processing, Optimal
Design , Wavelet Analysis, Profil e Methods, Web-based Computing, and Mul-
26 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint
STAT. A pro ceedin gs volume [15] and a supplement compr ising t he short
communicat ions and post ers were published (editors P. van der Heijden and
J .G. Bethlehem) .
The last (fift eenth) COMPSTAT sympos ium we can report on took place
at Humboldt-Universit at zu Berlin in 2002. It was organized by W . Hardle
and attracted approximately 220 submissions. This t ime the primar y fo-
cus was on business applicat ions, especially in connect ion with t he Internet
(such as E-Commerce and Web-Mining) and on the handling of massive and
complex dat a sets (e.g. in genetic resear ch) . The idea was t o expa nd the
t ra ditiona l scop e of COMPSTAT and t o make it att rac t ive for new aud i-
ences . However t he numbe r of about 260 participants made it clear that t his
endeavour was not sufficient to substant ially enlarge t he audience for such
a meeting. However it is only fair mentioning t hat many young resear chers
showed up for the first t ime, also joining IASC because of a special pr omotion
scheme.
There was a keynote delivered by T. Hastie ent it led "Supervised Learn-
ing from Micro array Dat a" . The other 8 invit ed talks concerned the topics
Bayes Methods, Gr aphical Methods, Int ernet Traffic, Smoothing, Teaching,
and Ti me Series. Further there were 90 cont ributed pap ers connecte d to t he
above to pics as well as to Algorithms, Classificatio n, Comput ational Infer-
ence, Computing En vironments, Data Mining, Met a Dat a , and Multivari at e
Methods. Two additio nal area s of int erest have emerged because of sub-
missions received, the stat istical language R and functional dat a analysis.
Innovat ions were t hat t he pr int ed proceedin gs volum e (edited by W . Har die
and B. Ronz [16]) also appeared as a Springer-Verlag e-book and t hat t he
compa nion volume of short communications and post ers was published on
a CD. Moreover severa l pri ces were grante d (among th em a new one for
softwa re inn ovation) .
7 Conclusions
The evolut ion of computational st ati stic s has always been strongly influenced
by developments in stat ist ical t heory, in algorithms, in compute r science, and
by t he problems statist icians are confronte d wit h. In statistical t heory many
actua l topics are connected t o concepts and methods of computational st ati s-
ti cs requiring definit ely mor e than the pr oper implementat ion of well-defined
algorit hms. With resp ect t o computation we can observe a shift from pure
numerical ana lysis to more graphically oriente d techniques and algorit hms
developed in computer science. This brings about a new quali ty of coopera-
tion between stat ist ics and compute r science with a high pot ential for future
development. The t ra ditiona l knowledge t ra nsfer from computer science t o
com putational stat ist ics was primaril y in the area s of stat ist ical packages,
statist ical languages, stat ist ical gra phics and stati stical data man agement
syste ms. Yet these convent iona l areas are still open to new developments, in
particular with regard t o statist ical Web serv ices and the sea mless int egration
28 Wilfried Grossmann, Michael G. Schimek and Peter Paul Sint
References
[1] Bruckmann, G., Ferschl, F. and Schmetterer, L. (1974, eds.). COMP-
STAT 1974. Proceedings in Computational Statistics. Physica-Verlag,
Wien.
[2] Gordesch, J . and Naeve, P. (1976, eds .). COMPSTAT 1976. Proceed-
ings in Computational Statistics. 2nd Symposium Berlin/FRG. Physica-
Verlag, Wien.
[3] Corsten, L. C. A. and Hermans, J . (1978, eds .) . COMPSTAT 1978.
Proceedings in Computational Statistics. 3rd Symposium Leideti/The
Netherlands. Physica-Verlag, Wien.
30 Wilfri ed Grossm ann , Michael G. Schim ek and Peter Paul Sint
[36] Efron, B. (2002). Statistics in the 20th Century and the 21th. In Dutter,
R. (ed .) Festschrift 50 Jahre Osterreichische Statistische Gesellschaft
1951-2001. Austrian Statistical Society, Vienna, 7-20.
[37] Efron, B. (2003). Robbins, empirical Bayes and microarrays. Ann.
Statist., 31 ,366 -378.
[38] Fisherkeller, M. A., Friedman, J. H. and Tukey, J . W . T . (1974) . PRIM-
9. An Interactive Multidimensional Data Display System. Stanford Lin-
ear Accelerator Publication No. 1408. Palo Alto/CA.
[39] Fleck, C . (2000) . Wie Neues nicht entsteht. Die Griindung des Insti-
tuts fur Hiihere Studien in Wi en durch Ex- Osterreicher und die Ford
Foundation. Osterreichische Zeitschrift fiir Geschichtswissenschaften, 1,
129-177.
[40] Francis, 1. (1981). Statistical Software . A Comparative Review. North
Holland, New York.
[41] Frawley, W., Piatetsky-Shapiro, G. and Matheus, C.(1992) . Knowledge
Discovery in Databases: An Overview. AI Magazine, Fall 1992, 213-228.
[42] Friedman, J. H. and Stuetzle, W . (2002) . John W. Tukey's work on
interactive graphics. Ann. Statist., 30, 1629 -1639.
[43] Friedman, J. H. and Tukey, J . W . (1974) . A projection pursuit algorithm
for exploratory data analysis . IEEE Trans. Comp., C 23,881-890.
[44] Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1996). Bayesian
Data Analysis . Chapman &Hall, London.
[45] Geman, S. and Geman, D. (1984) . Stochastic relaxation , Gibbs distri -
butions, and the Bayesian restoration of images. IEEE Trans. Pattern
Anal. Machine Intellig., 6, 721-741.
[46] Gentle, J . E. (2002). Elements of Computational Statistics. Springer-
Verlag, New York.
[47] Gershenfeld, N. (1999). The Nature of Mathematical Modeling. Cam-
bridge University Press, Cambridge/UK.
[48] Goldstine, H. H. (1972). The Computer from Pascal to von Neumann.
Princeton University Press, Prlnceton/N,l.
[49] Hand, D. (1996). Classification and Computers, Shifting the Focus. In
Prat, A. (ed .) COMPSTAT 1996. Proceedings in Computational Statis-
tics., 77 -88.
[50] Hastie, T . and Tibshirani, R. (1990). Generalized Additive Models. Chap-
man & Hall , London.
[51] Hastie, T., Tibshirani, R. and Friedman, J. (2001) . The Elements of
Statistical Learning. Springer-Verlag, New York.
[52] Hastings, W. K. (1970) . Monte Carlo sampling methods using Markov
chains and their applications. Biometrika, 57, 97 - 109.
[53] HardIe, W., Klinke, S. and Turlach, B. A. (1995). XploRe: An Interactive
Statistical Computing Environment. Springer-Verlag, New York.
The history of COMPSTAT and statistical computing 33
[54] Heide, 1. (2003). Diffusing the emerging punched card technology in Eu-
rope 1889-1914. Information Systems and Technology in Organizations
and Society. ISTOS-Workshop Universitat Pompeu Fabra, Barcelona.
http://cbs.dk/staff/lars.heide/ISTOS/paper-l0.pdf.
[55] Hornik, K. and Leisch, F. (2002). Vienna and R: Love, Marriage and the
Future. In Dutter, R (ed.) Festschrift 50 Jahre Osterreichische Statis-
tische Gesellschaft 1951-2001, Austrian Statistical Society, 61-70.
[56] Huber, P. J. (1964) . Robust estimation of a location parameter, Ann .
Math. Statist., 35,73 -101 .
[57] Huber, P. J . (1994) . Huge Datasets. In Dutter, W . and Grossmann, W.
(eds.) COMPSTAT 1994. Proceedings in Computational Statistics, 1 -
13.
[58] Huber, P. J . (1999). Massive Dataset Workshop : Four Years After, J .
Computat. Graph. Statist. , 8 ,635 -652.
[59] Ihaka, R and Gentleman, R (1996) . R : A language for data analysis
and graphics. J. Computat. Graph. Statist., 5, 299 -314.
[60] Lauritzen, S. 1. and Wermuth, N. (1989) . Graphical models for associ-
ation between variables, some of which are qualitative and some quanti-
tat ive. J . Royal Statist. Soc., B 50, 157-224.
[61] Lauro, C. (1996) . Computational Statistics or Statistical Computing, is
that the question ? Computat. Statist. Data Anal., 23, 191-193.
[62] Mehta, C. R. and Patel, N. R (1992) . Exact Logistic Regression : Theory,
Applications, Software . In Dodge, Y and Whittacker, J . (eds.) Compu-
tat ional Statistics. Volume 2, 63 -78.
[63] Mehta, C. R. and Patel, N. R (1997) . Exact Inference for Categorical
Data. Electronic Publication: Harvard University and Cytel Software
Corporation, http:
www.cytel!Library/articles.asp.
[64] Mehta, C. R, Patel, N. Rand Senchaudhuri, P. (2000). Efficient Monte
Carlo Methods for Conditional Logistic R egression. J. Amer. Statist. As-
soc., 95 , 99 -108.
[65] Metropolis, N. and Ulam, S. (1949) . The Monte Carlo Method. J. Amer.
St atist. Assoc., 44, 335 - 342.
[66] Meyer, D. Leisch, F., Hothorn, T . and Hornik, K. (2002) . StatDataML:
An XML Format for Statistical Data. In HardIe, W. and Ronz , B.
(eds.) COMPSTAT 2002. Proceedings in Computational Statistics. , 545-
550.
[67] Monahan, J. F. (2001). Numerical Methods of Statistics. Cambridge Uni-
versity Press, Cambridge/UK.
[68] Nelder, J . A. (1974) . Genstat - A Statistical System. In Bruckmann,
G., Ferschl, F. and Schmetterer, L. (eds.) COMPSTAT. Proceedings in
Computational Statistics, 499 - 506.
34 Wilfri ed Grossm ann , Michael G. Schimek and Peter Paul Sint
[69] NeIder, J . A. (1978). The Future of Statisti cal Softwar e. In Corst en, L.
C. A. and Herm ans , J . (eds.) COMPSTAT 1978. Proceedings in Com-
putational St atistics, 11 -19.
[70] Nelder, J . A. and Wedd erburn, R. W. M. (1972) . Generalized linear
models. J. Ro yal St atist. Soc., A 135,370-84.
[71] Neuwirt h, E . and Baier , T . (2002) . Embedd ing R in standa rd software,
and t he other way round. In Hornik, K. and Leisch, F . (eds.) DSC
2001 Proceedings. 2nd Interna tional Workshop on Distributed Statistical
Computing, http://www.ci.tuwien.ac .at/Conferences/DSC-2001 .
[72] Owen, D. B. (1976). On the history of statis tics and probabilit y. Proceed-
ings of a symposium on the American mathematical herit age. Dekker ,
New York.
[73] Ripley, B. D. (1987) . Sto chastic Sim ulation. Wiley, New York.
[74] Robbins, H. (1956). An empirical Bayes A pproach to St atistics. Proc.
Third Berkeley Symp . Statist . Probab ., 1, 157 - 163.
[75] Schaffler , O. (1895). Neuerungen an statistischen Zahlmaschinen.
Osterreichisches Patentprivileg No.46/3182, Patent ar chiv, Wien .
[76] Shoshani , A. (1997) . OLAP and Statistical Databases: Sim ilarit ies and
Differences. Proceedin gs 16th ACM SIGACT-SIGMOD-SIGART Sym-
posium on Principles of Datab ase Systems 1997, 185 -196.
[77] Stone, C. (1977) . Consis tent nonparametric regression (with discussion).
Ann . Statist ., 5, 595-645.
[78] Sundgren , B. (1975). Th eory of Data Bases. P etrocelli/Charter, New
York.
[79] Templ e Lang, D. (2000). Th e Om egahat En vironm ent: New Possibiliti es
for St atistical Computing. J . Computat. Gr aph. St ati st ., 9, 423- 451.
[80] Thist ed , R. A. (1988). Elem ents of Statistical Computing. Ch apman &
Hall, New York.
[81] Tiern ey, L. (1989) . XLISP- STAT: A Sta tistical Environment Based
on the XLISP Langu age, Technic al Report No. 528, School of
Statisti cs, University of Minnesot a , http ://www . stat. umn.edu/
lUke/xls/tutorial/techreport/techreport.html.
[82] Tiern ey, L. (1990) . LISP-STAT: An Object-Oriented Environment for
Statistical Computing and Dynamic Graphics. Wiley, New York.
[83] Tukey, J. W. (1962) . Th e futu re of data analysis. Ann . Math. Statist.,
33 , 1- 67 and 812.
[84] Tukey, J . W . (1970). Exploratory Dat a Analysis. Volum e I and II (limi ted
prelim inary edition). Addison-Wesley, Readi ng/MA.
[85] Tukey, J. W . and Cooley, J. W . (1965). A n algorithm for the mac hine
calculation of complex Fourier series . Math. Comput., 19 , 237 - 301.
[86] Wegman , E. J. and Mar chette, D. J (2003). On Som e Techniques for
Streami ng Dat a: A Case Stu dy of Intern et Packet Headers. J . Computat.
Gr aph. Statist ., 12, 893 -914.
Th e history of COMPSTAT and statistical computing 35
A cknowledgem ent : First of all the aut hors wish to thank Prof. J aromir
Antoch (Charles University of Prague) for giving t hem the opport unity to
present a historical keynote. Further the aut hors appreciate valuable hints
and comments from t he following colleagues: Dr . Lutz Edl er (Germ an Can-
cer Resear ch Cent er Heidelberg) , Dr. Karl A. Froschl (Electronic Commerce
Comp etence Cent er Vienna), Dr. Walter Gr afendorfer (Austrian Computer
Society) , Prof. Kurt Hornik (Wlrtscha ftsuniversitat Wien) , and Prof. Ed-
ward J . Wegman (George Mason University ). However , all errors and omis-
sions are in the responsibility of t he aut hors.
Address: W . Grossmann, University of Vienna , Insti tu te for St atistics and
Decision Support Systems, Universitatsstrafie 5, A-lOlO Wien , Austria
M.G. Schimek, Medical University of Graz, Institute for Medical Inform atics,
St atisti cs and Docum ent ation, Auenbruggerpl atz 2, A-8036 Graz, Austria
P.P. Sint , Austri an Acad emy of Sciences, Institute for European Integration
Resear ch, Prinz Eugen StraBe 8-10/ 2, A-1040 Wien , Austria
E-mail: wilfried .grossmann@univie.ac.at ,
michael.schimek@meduni-graz.at, sint@oeaw .ac.at
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract : We construc t exac t D- efficient designs for linear regression mod els
using a hybrid algorit hm that consists of geneti c and local sea rch components.
The genetic component is a genetic algorit hm (GA) with a 100% mutation
rate and ranking select ion. The local sea rch methods we use are based on
t he G-bit improvement and a combination of the Powel multidimension al
and Brent line optimi zation techniques. Computational results show that
the hybrid algorithm generates designs that are compa ra ble in efficiency to
those found using the modified Fedorov algorit hm (MFA) , but without being
limit ed to using a given set of candida te points.
1 Introduction
An experimental design is said to be optimal if it meets predefined crite ria
that determine t he precision with which th e mod el par ameters or response
is estimated. The D-optimality crite rion Keifer and Wolfowitz [12] puts em-
phasis on t he pr ecision with which t he mod el par ameters are est imate d by
maximi zing the det erminant of the mod el's inform ation matrix. This crite-
rion has t he intuiti vely appealing int erpret ation of minimi zing the volum e of
the joint confidence ellipsoid of the least squ ares regression par ameter est i-
mates.
Ex act D-optimal designs are calculat ed using optimization algorit hms
such as those given by Cook and Nachtsheim [6] and Johnson and Nacht-
sheim [11] among others. These algorit hms iteratively maximize t he det er-
minant of the information matrix by sequentially, or simultaneously, adding
and deleting points to t he design . Many of the most used algorit hms require
an explicit set of candidat e points to work with, thus putting heavy demands
on prior dom ain-specific knowledge of the optimizati on problem. Although
not as common, evolutiona ry algorit hms have also been used to calculate
D-optimal designs. Govaerts and San chez [8] were t he first to use genet ic
algorit hms (GAs) to find exact D-optimal designs . However , their algorit hm
incorporated t he use of a candidate set of design points, much like the mor e
t ra ditiona l algorithms. Poland et al. [17] used a GA to improve on t he st an-
dard Mont e Carlo algorit hms by applying DETMAX and k-exchan ge as the
mut ation oper ator. Compar ed to t he excha nge algorit hms, t heir algorit hm
38 Abdul A ziz Ali and Magnu s Jansson
was slower but yielded better results. Broudi scou et al. [5] successfully ap-
plied a purely genet ic algorit hm to t he exact D-optimal design problem in
a chemomet rics setting. GAs have since then been used by Montepiedr a et
al. [15] who omitted the mutation operator in favor of fast er convergence
and Heredia-Langner et al. [10] who used real value encoding in place of the
mor e traditional bin ary encoding. The latter named also give an excellent
introduction to the use of GAs in calculating opt imal designs.
This pap er present s the use of hybrid algorithms in calculating D- efficient
or near D-optimal designs. The hybrid algorit hms considered here consist of
a genetic component with 100% mutation rat e and local search methods. The
mutation op erator is exte nsively used in ord er to escape from local optima.
The hybrid algorit hm is therefore implement ed in two stages: The genetic
component finds a neighborhood point of a local optimum and t he local
sea rch finds the local optimum. The genet ic component is then updat ed wit h
the coordina tes of t he local optimum and t he pro cess is repeated until some
termina t ion condition is met .
IM (6 )1] f;
D - ef f = 100 · [ IM(6)1 .
This compari son is valid even when the designs being compared are of dif-
ferent sizes because t he comparison is based on the information per point for
each design. For t he inte rested read er an excellent review of opt imum design
theory is given by Ash and Hedayat [3] an d boo ks by Atk inson & Donev [2]
and Silvey [18] .
solut ions lies in the interval [-a , a] then the 8-bit bin ar y st ring 00000000
will represent - a and 11111111 will represent +a. A randomly generated
set of strings forms the initi al population from which the GA starts its sear ch.
Initi al candidate solut ions (strings) are usually uniforml y sa mpled from the
search space in ord er to introduce variability in t he set of candidate solut ions.
This initializat ion pr ocess is a random search whereby a number of possible
solut ions are randomly generated and t he best solutions (the fittest st rings)
are remembered .
strings are kept. This type of selection leads to what is known as an elitist
algorithm. It ensures that the fittest strings are preserved from one iteration
to the next and removes the possibility that all strings found in iteration i + 1
are poorer than the fittest string found in iteration i. Other methods of
selection such as selection with probability proportional to fitness may result
in the loss of the fittest strings as there is a positive probability that anyone
string could be lost.
2.3 Recombination
Recombination when applied to strings with binary coding is usually per-
formed by single or multi-point crossover. Single point cross-over is used in
this application because of its simplicity and ease of execution. This is done
by sampling without replacement of a pair of strings with probability propor-
tional to their fitness. A point is randomly chosen and each string is divided
into two segments. The strings then swap their segments and a new pair of
strings is created. In this way, strings with high fitness are paired with each
other and exchange sub-strings. Those that inherit segments which result in
high fitness (also called building blocks) are kept for the next iteration.
2.4 Mutation
Mutation relocates the candidate solutions to some other points in the search
space. Although it is common to use mutation with low probability so as not
to destroy highly fit strings and prolong the computation times, we always
apply mutation with probability Pm = 1. The reason for mutating in this
way is that copies of the strings are made prior to mutating them so that
strings are not lost because of mutation. Also a ranking selection which
results in the elitist algorithm is used. This algorithm implements mutation
by switching one randomly selected bit per string. The inversion operator is
a generalization of the mutation operator. Whereas the mutation operator
switches one bit per string, the inversion operator flips a whole string segment.
The start and end positions for the inversion are randomly decided. Inversion
is used when there is no improvement in fitness in at least one iteration.
The GA search process is thus iterative: evaluation, selection and re-
combination using the basic operators: selection, cross-over and mutation,
until some termination condition is met. The basic algorithm is given by the
pseudo code below .
If s(i) is the set of strings processed by the GA at iteration i and f is the
objective function then,
i = 0; initialize s( i);
evaluate f(i);
do while (termination condition is not met);
select s(i + 1) from s(i);
42 Abdul A ziz Ali and Magnus Jansson
2. Sweep the st ring bit by bit, evaluat ing the fitn ess of every st ring t ha t
results from one-bi t switches. If a bit chan ge results in a violati on of
any of t he constraints then discard the st ring .
3. W hen a st ring is found that has a better fitness t han t he first (starting)
st ring t hen replace t he st arting string with t he fitter string.
4. Repeat the pr ocess until no fur ther improvement is made afte r sweeping
through the fittest st ring.
An object ive fun cti on is evaluate d for every swit ch which makes the
method somewhat slow. The method is therefore most useful when the ge-
neti c algor it hm converges to a point on t he sear ch grid that is very close
t o the optimum and there is a stee p gradient between the two point s. This
method is only used on the fitt est st ring found afte r t he t ermination condit ion
has been met by t he GA.
Because of the difficulty of computi ng the directional derivatives of poorly
charac te rized functions, we use methods that do not require differentiability.
Local search is t radit ionally done using gree dy algorithms such as those of
Lawler [13] and Syslo et al. [19] . We implement local sear ch by a combi na t ion
of Powell's method and Br ent line optimization as given in Press et al. [16].
Powell' s method is given below. Readers int erested in the t echnical de-
t ails are referred t o Num erical recipes in C availa ble on-line at
www . library . cornell. edu/nr. The algorit hm establishes the direction
along which the optimization t akes place and then t he Brent line optimiza-
tion is used it er atively. Because minimization and maximization are t rivially
related, we conside r the optimization problem as the minimization of a fun c-
ti on f without loss of generality.
The algorit hm begins by initializing the direction set t o the basis vectors
of the n dimensional space i.e,
Hy brid algor ithms for const r uction of D-efficient designs 43
Ui = e i i = 1, . . . , n.
1. Save the starti ng position as Po.
4. Move P« t o a minimum along the dir ect ion U n+l and ca ll t his point Po.
4 Examples
4.1 Response surface design in two factors
Box and Dr ap er [4] analytically det ermined D-op timum designs for a second
or der resp onse surface model in two fact ors usin g 6 to 9 design poi nt s. E xact
D- efficient designs for t heir model are found using the hybrid algorithm and
t he genetic component of the hybrid algorithm used alone for comparison as
well as for validat ion and for t esting the performan ce of t he algorit hms.
The second ord er response surface model in two fact ors is given by:
mode l with 6 to 9 points. This indicat es that the local search com po ne nt
of the hybrid algorit hm was used to a large exte nt to find the design s t hat
minimize I(X TX)-II .
Exact D- efficient design s are rarely found using analytical function op-
t im ization as shown above. When t he design region is poorly characte rized
and/or const raine d, it is usual practice to gene rate efficient desi gns using
com puterized algorit hms . The next t wo exam ples are mixture design s with
bo th line ar and non-linear as well as single and multi-component constraints
imposed on their design regions.
of size 1/2 16=1. 52587E-5. This search grid is finer than t hat used for t he
pr evious example becau se t he design region for this example is not as regular
and symmetric. The t ermination condition was when 200 iterations had been
complete d regardless of when the last imp rovement was made. The genet ic
and hybrid algorit hms were run 10,000 times and t he average efficiencies and
times are shown in Tabl e 2.
The results show th at th e combinat ion of t he GA and local sea rch finds
efficient designs in a relatively short t ime using few it erations as seen from
the optimized object ive function value . This, in t he pr esence of non-linear
const raints on the design region .
A 24 point design was genera te d using the hybrid algorit hm. For comparison
purposes 200 it erations of t he MFA with a candida te set of 144 point s which
sa t isfy all the const ra ints was used . The can didate set was again genera te d
using the GA. The hybrid algorit hm and the GA were later re-initialized using
t he sam e set of poin ts assembled int o 6 designs . Each coordinate point was
cod ed using 16 bits and the t erminat ion condit ion was when 200 it erations
had been complete d. The GA and hybrid algorit hm were run 10,000 times.
Det ails of t he average efficiencies and times are shown in t able 4.
Tabl e 4 shows t ha t the hybrid algorit hm finds on average designs with
higher relativ e efficiency t han t hose found using th e MFA for t his probl em.
Whereas t he MFA can only be as good as th e quality of its candidate points,
t he hybrid algorit hm generates new design points through loca l sear ch, selec-
t ion, and recombination. As a resul t , the hybrid algorit hm arr ives at efficient
designs wit hout th e benefit of using a specific set of candidate points.
5 Conclusions
A hybrid algorit hm used t o find D- efficient designs for linear regression mod-
els is present ed in t his pap er. The genetic component of the hybrid algorit hm
allows for a high mutation pr obability without necessaril y prolonging the time
t o convergence. This is possible because mutated copies of the st rings are
re-inj ect ed into t he population of st rings during every iteration and only the
Hybrid algorithms for construction of D-efficient designs 47
fittest strings are selected for the succeeding iterations. This greatly increases
the chances of escaping local optima when applied to poorly characterized
functions with many local extrema. Genetic algorithms are very efficient and
are designed to search large spaces. However, they require a large initial pop-
ulation of strings to work with and the resulting variation inevitably leads
to long computing times if the search domain is to be thoroughly explored.
Searching the neighborhood of each point and updating the population of
strings at every iteration of the GA with fitter strings that result from local
search leads to much faster convergence than using the GA alone. The hybrid
algorithm presented here therefore uses a small population of strings to search
for efficient designs . It also requires a relatively few number of iterations and
as a consequence less computing time is required to find efficient designs .
The computing times for the examples used in this paper are real times (not
CPU times) when using a 2.0 GHz Pentium PC. It should be noted that
although the hybrid algorithm provides designs that are as efficient as those
obtained using the MFA, it usually is slower depending on the the number of
the candidate points supplied to the MFA, but has a distinct advantage when
the candidate set of points is not of high quality or even not available. This
relieves the experimenter from having to start with some previous knowledge
of the search domain.
The algorithm presented in this paper is coded in Pascal using Borland
Delphi version 4 and is available as a .exe file upon contacting the authors.
The application that runs the algorithm allows for customizing of all the
GA and local search parameters and generates the design points, the design
matrix, the information matrix and its eigenvalues, the variance function
plots as well as the records and graphical history of the optimization process,
among other things.
References
[1] Altekar M., Scarlatti. A. N (1997) . Resin vehicle characterization using
statistically designed experiments. Chemometrics and Intelligent Labo-
ratory Systems 36 207 - 211.
[2] Atkinson A.C., Donev A.N (1992) . Optimum experimental designs. Ox-
ford: Oxford University Press.
[3] Ash H., Hedayat A (1978). An introduction to design optimality with an
overview of the literature. Comm. Statist. Theory Methods. 7, 1259 -
1325 .
[4] Box G.E.P., Draper N.R (1971). Factorial designs , the IF'FI criterion
and some related matters. Technometrics 13, 731-742.
[5] Broudiscou A., Leardi R., Phan-Tan-Luu R (1996) . Genetic algorithm as
a tool for selection of D-optimal design . Chemometrics and Intelligent
Laboratory Systems 35, 105 -116.
48 Abdul Aziz Ali and Magnus Jansson
GEOMETRY OF LEARNING IN
MULTILAYER PERCEPTRONS
Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
K ey words: Learning, information geomet ry, singular statistical mod el, neu-
ral networks.
COMPSTAT 2004 section: Neural networks and machine learning.
1 Introduction
The multilayer perceptron is a simple feedforward model of neural networks,
which transforms input signals to output signals nonlinearl y. It is a univ ersal
approxima tor in t he sense t hat any nonlinear t ra nsforma t ion is approxima te d
sufficient ly well by an adequate perceptron , if t he number of hidden units is
lar ge.
In orde r to realize a good approxima t or, examples of input-output pair s
are used. On-line learning receives a series of train ing exa mples one by one,
and modifies t he par am et ers of a perceptron each time when one exa mple is
given. Usu ally old examples are then discarded. Batch learning keeps all th e
exa mples and modifies the par amet ers in a bat ch mod e.
A multilayer perceptron is an old model of learning machines, and the
error-correct ing learning algorit hm was established for simple perceptrons in
t he sixties. Amari proposed a gra dient descent learning method for multi-
layer perceptrons [2]' which was rediscovered later ind epend entl y and becam e
popular und er the nam e of backpropagation [23].
We study t he set of multil ayer perceptrons of a fixed architec t ure, which
includ e a number of modifiabl e par am et ers called connec t ion weight s and
biases. The set forms a multi-dimension al manifold , where all these par am e-
t ers play a role of admissible coordinat e syst ems. Learning takes place in t he
manifold , dr awing a traj ectory.
50 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
It is import ant t o study the geometrical st ruc t ure of the manifold which
we call a neuroman ifold. We will show by st atistical considerations that the
neur oman ifold is Riemannian whose metric is sp ecified by the Fisher infor-
mation matrix [3] . Moreover , it has a pair of affine connect ions [4], but we
do not st at e t hem in t he pr esent pap er. The neuromanifold has singulari-
ties where t he Fisher information (or the Riemannian metric) degenerat es [5] .
This is an int eresting st atistical model, because the convent ional Cr am er-Rae
par adi gm excludes such a mod el, assuming the existe nce and non-degeneracy
of the Fisher information matrix as regulari ty condit ions.
It is known that the convergence sp eed of a multilayer perceptron is usu-
ally very slow. This is caused by the Riemannian chara cte r in particular by
its degeneracy, becaus e the convent ional backprop learning method does not
take the Riemannian nature into account . The state of a network is often
attrac te d by singulari ti es by the convent ional algorit hm and takes long t ime
before getting rid of them. The natural gradi ent learning algorit hm was pro-
posed t o overcome the flaw, which t akes the Riemannian gra dient inst ead of
the convent ional gra dient [3] . We show in the pr esent pap er the reasons why
it works so well. We also explain an adapt ive method of impl ementing t he
natural gradient [8] . In t he case of the squa red err or crit erion under Gaus sian
noises, the natural gra dient algorit hm coincides with t he adapt ive version of
the Gau ss-Newt on method , but they differ in mor e general mod els (see [17]) .
We finally study the dyn ami cs of learning and the nature of singularities
and explain the reason why learning traj ect ories are attract ed t o and stay
longer in a neighborhood of sin gularities. The st ati st ical analysis of behaviors
of est ima t ors in a neighborhood of singularities is anot her import an t problem
to be st udied . We show the convent ional crite ria of mod el selection such as
AIC and MDL fail in t his case.
(3)
Geometry of learning in multilayer perceptrons 51
where E deno t es expec tation, \7 = (a/ OBi) is the gradient and T denot es
t ranspose of a vect or . Let us define the square of the dist an ce between two
nearby perceptrons whose par am et ers are 0 and O+dO . Inform ation geomet ry
gives t he squared dist an ce by t he qu adrati c form
(9)
This is the Riem annian metric, where the Fisher information met ric is used
as t he Riem annian metric t ensor [19]. This is t he only invari ant metric to be
introduced in t he manifold of probabili ty distributions.
Given a (large) number N of indepe ndent ly generate d input- output pair s
(x I, YI) , .. . , (x N , Y N ) , the maximum likelihoo d est imat or (or any other first
order efficient estimat or ) sati sfies t he Cr am er-Rae bound. Hence, the dis-
t an ce is lar ge when two perceptrons ar e well separate d in t he sense that
t heir est imatio n can be don e pr ecisely. However , differ ent from t he ordinar y
statist ical model, t he neuromanifold includes points at which the Fisher in-
form ation degener ates and it s inverse diverges. This is relat ed to t he uniden-
t ifiability of network par am et ers .
52 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
Vi + Vj = Vi + Vj
I I
(10)
(11)
the crit ical set on which unident ifiabili ty t akes place. The F isher information
degenerates on t he crit ical set , because the unid entifiability implies t ha t the
estimation error does not converge to 0 even when N goes to infinity. Hence
the st ati sti cal model is non-regular , and the Riemanni an metric is singular.
See also [7], [8], [9], [15]; [24] , [27] .
Let us int roduce the equivalence relation ~, by which two perceptrons
with different par amet ers are equivalent when their input-output behaviors
are the same. Then the set
(12)
includes algebra ic singularities and dimensions are redu ced on the critical set .
The convent iona l t heory of statistical estimation does not hold in a neigh-
borhood of singularities.
(14)
The conventional on-lin e learning algorit hm uses the gradient of the in-
st antaneous error at time t,
()t -
which is called the natural gra dient method. The natural gra dient method is
proved to give an Fisher efficient est ima tor, even t hough exa mples are used
only once when t hey are observed , and then discarded .
The performanc e of the na t ur al gradient method is lar gely different from
the conventional method, when the Riemannian structure is very different
from the Euclidean one. It will be seen that this is indeed the case with mul-
tilayer perceptrons, becaus e t hey include singul arities where the Riemannian
metric degenerates.
It is known that the learning trajectory is often trapped in the so called
plateaus, at which the par ameters cha nge so slowly, and it takes long ti me to
get rid of. The st atistical physic al approach made it clear t hat the parameters
are once at t rac te d to t he critical set of the neuromanifold , so that the set
becomes plateaus of learning [25] , [21], [18] . Rattray, Saad and Amari [20]
ana lyzed the dyn amics of the natural gra dient learning method, and showed
that it has an idealistic cha rac te rist ic for avoiding plateaus. See also [14] .
where It I(xt,
= (h) and c' is anot her learning constant which may depend
on t. One should choose c and c' carefully. By using t his est imate (;-1 (Ot) ,
we can obtain t he updat e rul e of t he ada ptive natural gra dient method of t he
form ,
OtH Ot -
= C(; - 1 (Ot)
V e, (20)
Park, Amari and Fukumizu [17] genera lized the idea to be applicable to more
genera l cost functions.
which is a part of t he crit ical set . This corre sponds to the set of all t he
perceptrons which have only one hidd en unit , where t he weight vector is w
and t he out put weight is v. Let the true par ameters be 0 0 = {W I , W 2 , V I, V2 } ,
where WI :f W 2 so t hat it needs two hidd en units.
Let () = (w , v) be t he best perceptron with one hidd en unit t ha t approx-
imates the input-o ut put function I (x , ( 0 ) of the t rue perceptron. Then , all
t he percept rons of two hidden units on t he line:
(22)
corresponds to the best approximation by one hidd en unit percept ron. Let
us t ra nsform t he two weights as
(23)
and v are zero , because t hey are the best approximator . The derivative in
the dir ection of u is again 0, because t he perceptrons having u is equivalent
to that having -u t ha t is derived by cha nging the two hidd en units. Hence
the line forms crit ical points of the cost function. This implies that it is very
difficult to get rid of it once the par amet ers are attracted to Q (111, v).
Fukumizu and Amari [12] calculate d the Hessian of L . Wh en it is positive
definite, the line is really at t rac ti ng. Wh en it includes the negative eigenval-
ues, t he state is escaping in these dir ections event ua lly. They showed t hat,
in some cases, a par t of the line is really at t rac ting in some region , while it
is really a saddle having direct ions of escape (although the derivative is 0).
In such a case, the perceptron is once truly attracted to the line, and stays
inside the line fluctuating around it because of random noise until it finds
t he place from which it can esca pe from t he line. This is clearly a plateau .
This explains th e plateau phenomenon . In ord er to show why the natural
gra dient works well, we need to evaluate the natural gra dient in the neighbor-
hood of t he crit ical points. We can then prove that t he natural gradient has
a lar ge magnitude in the neighborhood of t he critical set , so th at th e plateau
phenom ena will disappear. Computer simulat ions confirm t his observation.
against
(25)
(26)
(27)
56 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
k
E[A] = N' (28)
However , when t he true distribution ()o lies on the crit ical set, the situation
changes. The Fisher information matrix degenerat es, and c:' diverges, so
that the expansion is no mor e valid . The exp ect ation of t he log likelihood
esti ma t or is asympt otically writ ten as
9 Bayesian estimator
The Bayesian est imator is used in many cases where an adequate pr ior dis-
t ribut ion is assumed for the purpose of penalizin g compl ex mod els based on
dat a . It is empirically known that the Bayesian posterior distribution or its
maximizer behaves well in the case of large scale neur al networks. In such a
case, one uses a non- zero smooth prior on the neuromanifold.
However, a smoot h prior is not regular in the equivalence class M of the
neuromanifold , becau se a point in the equivalence class includes infinitely
many equivalent paramet ers when it is in the critical poin t . This implies
t hat the Bayesian smooth pr ior is in favor of singular points (perceptrons
with a smaller number of hidd en units) with an infinitely large factor. Hence
t he Bayesian method works well in such a case to avoid overfitting. One may
use a very lar ge perceptrons with a smooth Bayesian prior, and an adequa te
smaller mod el is select ed.
Geometry of learning in multilayer perceptrons 57
The Bayesian est imato r of singular mod els was st ud ied by Watan ab e
[28] , [29] by using the method of algebraic geomet ry, in particular Hiron-
aka's theory of resolution of singularity and Sato 's formula in the t heory of
algebraic analysis.
10 Model selection
In order to obt ain an adequate mod el, one should select a good class of
mod els based on dat a , t hat is, one should det ermine t he number of hidden
uni ts. This is t he problem of model selection. AIC , BIC an d MDL have been
widely used as crite ria of model select ion.
AIC [1] is the crite rion to minimize the generalization err or . The model
t hat minimizes
AIC = t raining error + ~ (32)
is selected by this criterion. This is derived from the asy mpt ot ic statistical
analysis, where the mle est imator o is subject to the Gau ssian distribution
asy mpt ot ically.
MDL [22] is the criterion to minimize t he length of encoding the observed
dat a by using a famil y of par ametric models. It is given asy mptot ically by
t he minimizer of
. . log N
MDL = t rammg err or + 2N k (33)
The Bayesian criterion BIC [26] gives the same criterion as MDL .
However , in t he case of multilayer perceptrons, the neuroman ifold of per-
cept rons with a smaller number of hidden units are included in that with
a larger number , but the former is the crit ical set of the larger neurornan-
ifold. Therefore, the maximum likelihood est imator (or any other efficient
estimators ) is no more subject to the Gau ssian distribution even asympt ot -
ically. Mod el select ion is required when t he est imator is close t o the crit ica l
set , and hence t he validity of AIC and MDL fails to hold. On e should evaluate
t he log likelihood-r atio st atistics more carefully in such a case [6] .
There have been report ed many comput er simulations of applica t ions of
AIC and MDL . Sometimes AIC works better, while MDL does bet ter in other
cases . Such confusing rep orts seem to be given rise to by t he difference of
regular and sing ular mod els and also the different nature of singularit ies.
11 Conclusions
Multilayer per ceptrons are popular nonlinear mod els for nonlinear regression
analysis of observed data. A class of perceptrons is sp ecified by t he number
of hidden uni t s, and a smaller class is included in a lar ger class. A class
of multilayer perceptrons forms a manifold nam ed t he neuromanifold, where
modifiable par am et ers play t he role of the coordinate system.
58 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
References
[1] Akaike H. (1974). A new look at the statistical model identification. IEEE
Trans. Automatic Control AC-19, 716-723.
[2] Amari S. (1965) . Theory of adaptive pattern classifiers. IEEE Trans.
Elect. Comput. EC-16, 299-307.
[3] Amari S. (1998) . Natural gradient works efficiently in learning. Neural
Computation 10, 251 - 276.
[4] Amari S., Nagaoka H. (2000). Information geometry. AMS and Oxford
University Press, New York.
[5] Amari S., Ozeki T. (2001). Differential and algebraic geometry of multi-
layer perceptrons. IEICE Trans., E84-A, 31-38.
[6] Amari S. (2003). New consideration on criteria of model selection. Neural
Networks and Soft Computing (Proceedings of the Sixth International
Conference on Neural Networks and Soft Computing), L. Rutkowski and
J. Kacprzyk (eds.), 25-30.
[7] Amari S., Ozeki T., Park H. (2003). Learning and inference in hierarchi-
cal models with singularities. Systems and Computers in Japan 34 (7),
701 -708.
[8] Amari S., Park H., Fukumizu K. (2000). Adaptive method of realizing
natural gradient learning for multilayer perceptrons. Neural Computa-
tion 12, 1399-1409.
[9] Amari S., Park H., Ozeki T. (2002) . Geometrical singularities in the
neuromanifold of multilayer perceptrons. Advances in Neural Informa-
tion Processing Systems,T.G. Dietterich, S. Becker, and Z. Ghahra-
mani (eds .) 14, 343 - 350.
Geometry of learning in multilayer perceptrons 59
[10] Chen A.M., Liu H., Hecht-Nielsen R. (1993). On the geometry of feed-
forward neural network error surfaces. Neural Computation 5, 910 -927.
[11] Fukumizu K (2003). Likelihood ratio of unidentifiable models and mul-
tilayer neural networks. The Annals of Statistics 31 (3), 833 -851.
[12] Fukumizu K, Amari S. (2000). Local minima and plateaus in hierarchical
structures of multilayer perceptrons. Neural Networks 13, 317- 327.
[13] Hartigan J .A. (1985). A failure of likelihood asymptotics for normal mix-
tures. Proc. Barkeley Conf. in Honor of J. Neyman and J . Kiefer 2,
807-810.
[14] Inoue M., Park H., Okad a M. (2003). On-line learning theory of soft com-
mittee machines with correlated hidden units - Steepest gradient descent
and natural gradient descent -. J. Phys. Soc. Jpn 72 (4), 805-810.
[15] Kurkova V., Kainen P.C . (1994). Functionally equivalent feedforward
neural networks. Neural Computation 6, 543-558.
[16] Lin X., Shao Y., (2003). Asymptotics for likelihood ratio tests under loss
of identifiability. The Annals of Statistics 31 (3),807 - 832.
[17] Park H., Amari S., Fukumizu K (2000). Adaptive natural gradient learn-
ing algorithms for various stochastic models. Neural Networks 13, 755-
764.
[18] P ark H., Inou e M., Okada M. (2003). Learn ing dynamics of multilayer
perceptrons with unidentifiable parameters. J . Phys. A: Mathe. Gen. 36
(47),11753-11764.
[19] Rao C.R. (1945). Information and accuracy attainable in the estimation
of statistical parameters. Bulletin of the Calcutta Mathematical Society
37,81 -91.
[20] Rattray M., Saad D., Amari S. (1998). Natural gradient descent for on-
line learning. Physical Review Letters 81 , 5461- 5464.
[21] Riegler P., Biehl M. (1995). On-line backpropagation in two-lay ered neu-
ral networks. J. Phys. Aj Mathe. Gen. 28, L507 - L513.
[22] Rissanen J . (1978) . Modelling by shorte st data description. Automata
14,465 -471.
[23] Rumelhart D.E., Hinton G.E ., Williams R.J . (1986). Learn ing intern al
representations by error propagation . In D.E. Rumelhart, J .L. McClel-
land, and the PDP Research Group (eds.) , Parallel distributed process-
ing (Vol. 1,318 -362), Cambridge, MA:MIT Press.
[24] Ruger S.M., Ossen A. (1995). The metric of weight space. Neural Pro-
cessing Letters 5, 63- 72.
[25] Saad D., Solla A. (1995). On-line learning in soft committee machines.
Phys. Rev . E 52, 4225-4243.
[26] Schwarz G. (1978). Estimating the dimension of a model. The Annals of
Statistics 6, 461- 464.
[27] Sussmann H.J . (1992). Uniquen ess of the weights for minimal feedfor-
ward n ets with a given input-output map . Neural Networks 5, 589 - 593.
60 Shun-ichi Amari, Hyeyoung Park and Tomoko Ozeki
Abstract: In previou s pap ers we've shown how a well known data compres-
sion algorit hm called En tropy-constrained Vector Quantization (ECVQ; [3])
can be mod ified t o reduce t he size and complexity of very larg e, satellite data
sets . In this pap er , we discuss how to visualize and underst and the cont ent
of such reduced data sets. We developed a J ava tool to facilitate this using
simple multivari at e visualization , and interactively performing fur ther data
reduc tion on user selecte d spat ial subsets . This enables analyst s to compa re
reduc ed represent ations of t he data for different regions and varying spat ial
resolu tions. The ultimate aim is to explain physically observed differences,
t rends, patterns and anomolies in t he data .
1 Introduction
This work came about becaus e of challenges posed by NASA' s Earth Observ-
ing System (EOS). EOS is a long-term data collect ion program for st udy ing
climat e change, its consequences for life on Earth, and effects of human activi-
ties on it . The cente rpieces of EOS are three sate llites, Terra, Aqu a and Aura.
Terra and Aqu a are already in orbit, and Aur a is due for launch in 2004. Each
carr ies a suite of instruments t hat collect massive amount s of observational
data ; so massive t hat it is difficult t o take full advantage of them. Different
instruments have different sa mpling st ra te gies, resolutions, file naming con-
ventions, and collect data about different physical proc esses. The information
is provided to users in files corresponding t o individual spacecra ft orbits or
par ts of orbits, each of which can be very lar ge, and must be stitched t ogether
pr op erly t o pr ovide a global or even a regio nal picture. To make these data
more access ible, NASA produces globa l summar y data sets called Level 3
dat a products.
Tr adi tionally, Level 3 products are simpl e maps of mean qu antities and
st andard deviations at coarse spatial resolution, by month. In [2], we pro-
pos ed methods for const ruct ing non par am etric, multivari at e distribution es-
t ima tes to replace traditional map s. For inst anc e, the Multi-angle Imaging
SpectroRadiomet er (MISR) aboa rd Terra collects dat a about clouds . A key
goal is to better und erst and t he sp atial distribution of cloud s since t hey have
great influence on Earth 's energy budget . The inform ation MISR collect s
includ es t hree variables seen at high resolu tion: scene albedo, height , and
cloud presence indic ator. Albedo is a measure of scene reflectivity measured
rou ghly on a scale of zero t o one. Scene height is measured in met ers above
62 Amy Braverman and Brian Kahn
the Earth's surface ellipsoid. The cloud indicator is a binary variable taking
value one if the scene is cloudy, and zero otherwise. To summarize this infor-
mation traditional Level 3 products are created by partitioning one month's
data into spatial subsets corresponding to one degree latitude-longitude grid
cells. Six maps are then produced: mean and standard deviation of albedo ,
mean and standard deviation of height, and mean and standard deviation of
cloud indicator.
The Level 3 product we proposed regards each triplet of albedo, height
and cloud indicator as a three-element vector, and uses ECVQ to cluster
data each grid cell. We report a set of cluster representatives, the number
of original data points belonging to each cluster, and within-cluster mean
squared error, also called distortion. We call this a summary, or a compressed
or quantized version of the grid cell's data. Figure 1 illustrates. For one grid
cell it shows a three dimensional scatterplot of the original data in light
gray. Positions of cluster representatives are shown by the embedded balls,
and ball shading shows cluster population according to the color bar on the
right. Two key features of the summary are that i) cluster representatives be
centroids of cluster members, and ii) data vectors must be assigned to clusters
with the nearest (euclidian distance) representatives. This ensures that mean
squared error between grid cell data points and their representatives are at
least locally minimized, and that representatives and mean squared errors
resulting from aggregation to coarser resolutions will be properly preserved.
Details of the algorithm like the one used to produce these summaries can
be found in [1] .
Starting with a monthly summary of MISR cloud data at one degre e res-
olution, our challenge is to discover and understand how relationships among
grid cell distributions change spatially, and over different resolutions. In
other words, instead of examining spatial patterns of average behavior and
variability only, we want to examine spatial patterns of other distributional
characteristics such as the number of modes, presence of outliers, and nonlin-
ear regressions. This requires interactively comparing summaries of different
grid cells, and of aggregated spatial areas. Thus, we want to quickly visual-
ize summaries, and construct summaries of summaries in hierarchical fashion.
The main subject of this paper is the Java tool L3View, written to facilitate
this.
2 L3View
The basic data structure underlying L3View is a 180 x 360 array of objects
called L3Cell's. An L3Cell contains a variable-length vector of Cluster ob-
jects, with the number of objects depending on grid cell data complexity.
A Cluster records a three-dimensional cluster representative, a cluster count,
and a within-cluster mean squared error. L3View presents a map of the
world, and when the user clicks on it the with the mouse, L3View translates
the mouse position into geographic coordinates. L3View opens a separate
Visual data mining for quantized spatial data 63
1.8
1.6
0.8 1.4
~·O.6
o 1.2
~·.·O.4
[j
0.2
0.8
o
2 1.5 0.6
0.4
0.2
o 0 I'Jbedo
Figure 2: L3View main control panel showing MISR cloud fraction for
March 2000.
window, which contains three graphics for visualizing the clusters represent-
ing that grid cell's data.
If the mouse is used to isolate a rectangular geographic region with a rub-
berband box, L3View calculates the corresponding geographic and index lim-
its. These are subsequently used in two cases. First, if the Zoom button is
pushed, a new window containing a magnified image of the isolated area is
spawned. Second, if the Aggregate button is pushed, all clusters from all
grid cells inside the box are summarized, and the result is displayed in a new
GraphView window. The lambda text box accepts user specified values for
a parameter of the summarization algorithm that specifies how much data
reduction is applied. This is discussed in Section 3.
Finally, the Set Maximum and Set Split sliders are used to study spatial
patterns in the cumulative distribution function of the display variable. Set
Maximum truncates the upper end of the color scale so that all grid cells with
display values at or above the maximum display white. Set split is similar:
all values above the split value are displayed white, while all values below the
split value display in black.
.....
mean /ad
Alb"do
0 .841
0 .392/0 .298
H8~"t
10054.509
Cloud Frac
1.0
2582 .843/3243.668 0 .65410.478
min 0 .126 328 .691 0.0
with
Visu al data minin g for quantized spatial da ta 67
. 360° •
I 110° 1
60°
\
r-
60°
-
10°
/
2"
H-
i
'-- 'Y'!
~,
1/
Figure 4: Schematic repr esent ation of a gridded map. The lar ge rect angle
represents a 180 x 360 array shown broken into 3 x 6 = 18, 60 x 60 arrays. Each
of these is further subd ivided int o a 6 x 6 ar ray. Each cell in the 6 x 6 array
is a 10 x 10 arrangement of one degree grid cells. The light er box illustrat es
how four one degree grid cells can make up a grid cell at coarser, two degree
resolution.
N(u+ i)(v+j)
P (V = V(u+i )(v +j) ) = 1 1 '
I:i=OI:j=o N(u+i) (v+j )
and N i j is the t otal number of dat a points repr esent ed by the summary of
the corr esponding grid cell. In other words, X 2uv is a mixture of Xl uv,
X l(u+l )v , X l u(v+l) ' and X l( u+ l)( v+l) with weights equa l to t he proportions
of t he total count represent ed by X 2uv contributed by each one degree cell.
The idea is illust rated on the left side of Figur e 5, which shows t he mixture
distribution positioned directly above the four component distributions. Any
nesting of fine-scale grid cells in a coarser grid can be repr esent ed in a similar
way, and ensures mass , expectation, and mean squared error are all properly
preserved between resolutions.
If dat a redu ction were not a concern, we could pro ceed directly to visu-
alizing mixture distributions like the middle layer in Figure 5. However, the
greate r t he numb er of grid cells being aggregated, the greate r the number
of support point s in the mixture, and t he numb er of corres ponding clust ers .
So, we compress the mixture distribution using a mass-weighted version of
ECVQ describ ed in [1] , but implemented here in Java with the user specify-
ing A dir ectly via the "Set Lambda" button and text box in the main control
pan el. K, the maximum number of clusters is nomin ally set to 10, and t he
default value of A is zero, thus essent ially implementing the K-means . If A is
chan ged to a positi ve value, t he algorithm becomes ECVQ.
68 Amy Braverman and Brian K ahn
~, I ~
I- ~
,.J . ~c- c-_ ,--~- ~
~ lL~
'~ r,- -- r-- __
" "" -
..... JIII!
-
~
'"
.~ ~ ~·X
• -] 0 •
The region contains 200 grid cells, with a total of 2,099 clusters. These
represent 6,205,769 original MISR albedo-height-cloud indicator vectors. We
begin by aggregating the whole region using the default value of >. = 0 and
the number of clusters, K, set to 15. The resulting GraphView summary is
shown in the lower panel of Figure 6. The figure is small making the graph
labels difficult to see, but we can see from the bar chart of cluster counts
that one cluster dominates in size. Using L3View interactively, we find that
this is cluster 9, and it contains about 30 percent of the distribution's mass .
Cluster 9 corresponds to one of two clear clusters, 6 being the other. Cluster
6 accounts for another eight percent of the distribution's mass. 9 and 6 have
representatives with low albedo, low height, and are clear cloud indicators.
70 Amy Braverman and Brian Kahn
This is a dark, vegetated region of jungle. Areas to its north show significant
numbers of low altitude, bright, clear scenes. This is the Sahara desert.
The remaining clusters have cloudy representatives, and form three sub-
groups. Clusters 8 and 4 constitute a subgroup with low albedo and below
average height. Clusters 1, 2, and 7 form a second subgroup. Their heights
are nearly one standard deviation above the mean, but their albedos range
from nearly one standard deviation below to one standard deviation above
average. The final subgroup is characterized by very large heights, two stan-
dard deviations above the mean at least. They too show a range of albedos
similar to that of the second subgroup. These high clouds are likely the tops
of thunderstorms prevalent in central Africa at this time of year, and the
surrounding cloud formations. The first two subgroups are more mysterious.
Clusters 1, 2, and 7 could be low and mid-level cumulus and stratus clouds.
8 and 4 are possibly dust, clear land surface misclassified as cloud, or simply
dark, low clouds, as implied by the classification.
Noting the relatively sharp difference in cloud fractions between the north-
ern and southern areas, we separately summarize them are shown in Figure 7.
Signatures of southern region representatives look much like signatures for
the region as a whole . The north's representatives also look roughly like
those of the whole region except clusters similar to 0 and 4 are missing. The
absence of clusters similar to cluster 0 in the lower panel is encouraging,
since this cluster represents deep convective clouds. Corroborating sources
indicate these are in fact absent in this region at this time.
These distributional differences are summarized in Table 1. Not sur-
prisingly the joint distribution shows that the south is cloudier than the
north. The fact that the south is dominated by low clouds while the north is
dominated by mid-level clouds is less obvious but clear from the conditional
distribution.
Type
Clear
Low cloud 0.060 0.171 0.231 0.273 0.442
Mid-level cloud 0.094 0.114 0.207 0.426 0.295
High cloud 0.066 0.101 0.168 0.301 0.263
Total cloudy 0.220 0.386 0.606 1.000 1.000
I_ _ _ _ _~
Total 0.550 I 0.450 I 1.000 ~ -----'
The presence of low, dark clouds in both the north and the south at
this time of year is something of a surprise. To see if these clouds can be
Visual data mining for quantized spatial data 71
COllntl by Clu5t1r
p"
'» r~
,, " "" .
. ~.
"'
Figur e 7: Top: GraphView window for the aggregat ed area in t he nor thern
half of the region (above the dash ed line). Bot tom: GraphView window for
t he aggr egated ar ea in the southern half of the region (below the dash ed line) .
at t ribute d to specific ar eas, we subdivided the north and sout h regions into
east and west. We found no distributional differences related to east-west
division for eit her the north or south. We t hen investi gated ar eas along the
prominent clear/cloudy boundary in F igure 6, and contrasted them to areas
away from the boundary. We did this separately for east and west , but none
of these visu alizations revealed definitive distributional differences. We are
therefore reason abl y confident that Table 1 te lls a complete story.
5 Discussion
The example of t he pr evious section is a small scale, simple exa mple of one
way we t hink L3View may be useful for exploring spatial summaries of satel-
lite data. Guided by the background map in the main L3View window, we
72 Amy Braverman and Brian Kahn
References
[1] Braverman Amy, Fetzer Eric, Eldering Annmarie, Nittel Silvia, Leung
Kelvin (2003) . Semi-streaming quantization for remote sensing data. Jour-
nal of Computational and Graphical Statistics 12, 4, 759 -780.
[2] Braverman Amy (2002). Compressing massive geophysical datasets using
vector quantization. Journal of Computational and Graphical Statistics
11, 1, 44-62.
[3] Chou P.A., Lookabaugh T., Gray R.M. (1989). Entropy-constrained vec-
tor quantization. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 37, 31- 42.
Abstract: This pap er develops coordinates and layouts for graphs that rep-
resent stat ist ics indexe d by repetitive letter sequences. The need for such
gra phics arises in a vari ety of applicat ions. The exa mples in t his pap er con-
cern sequences of nucleotides, such as AGTGGC , and sequ ences of amino
acids .
1 Introduction
In cont rast to maps t hat represent st atisti cs ind exed by geospatial coordi-
nates, the development of gra phics methodology for stat ist ics ind exed by
repetitive let t er sequ ences has been mod est. One int eresting exception is the
sequence logo display [13] that can show a sequence of categ orical frequ en-
cies. St atistical gra phics methods for categori cal dat a [7], [8] ar e relevant for
relatively simple multiv ar iat e combinat ions but so far have seen little use in
nucleotide and amino acid ind exing exa mples.
Jo urnal art icles ty pically show short t abl es with one column giving t he
sequence of letters and one mor e column providin g stat ist ics. The rows are of-
ten sor t ed by one of the st atistical columns. Both the one-dimensional linear
ordering and t he restriction to a modest number of rows reduce the opportu-
nity t o see patt erns t hat may lead to new und erstanding. One-dim ensional
linear orderings produced by clustering, the first principal component, min-
imal spanning t ree traversal, space filling curves or other methods do not
exploit the hum an ability to see multivari at e patterns based on 2-D and 3-D
connecte dness and proximity. Conn ect edness and proximity are among t he
most powerful of hum an perceptual grouping principles [18] . Thus this pa-
per seeks to develop 2-D and 3-D coordinates for repres enting letter-indexed
st atist ics.
The gra phical design obj ectives include providing an overvi ew along with
interactive focusing and re-expression methods. For long sequences, com-
bin atorics grow exponent ially with sequ ence length and qui ckly lead to an
overwhelmin g number of st atist ics. Overvi ews require subst an tial st atistical
summari zation. The mod est resear ch here concerns developing representa-
ti ons for short sequences.
On e approac h not investi gated here is the use of pixel oriented visual-
ization [9]. It is possible to encode univ ari ate stat ist ics on all nucleotide se-
74 Daniel B. Carr and Myotig-Hee Sung
quences of length ten (4 10 = 1024 x 1024) in a pixel plot on a 1280 x 1024 mon-
itor. Large high resolution prints will handle somewhat longer length se-
quences. Interactive pan and zoom methods can support layouts for longer
sequences but showing all the values at once is problematic. The color of
individual pixels is hard to identify with increased monitor resolution. The
use of multiple monitors cannot keep up with the exponentially growing com-
binations.
Layout details are an issue. A convenient layout for a pixel plot may
use lexicographic order for the first half (last half) of the sequence along the
x-axis (y-axis). A following sectional on fractal coordinates provide another
approach to layouts. In both cases indexing regularity help to keep the ana-
lyst oriented with interpreting the plots. However, indexing that is convenient
for human memory may be poor at bringing out meaningful patterns. Maps
often work well for showing geospatially-indexed statistics because geospa-
tial attributes often have locally similar values . This applies to covariates as
well as to the primary variables of interests. Proximity that reflects scientific
relationships can be crucial to seeing meaningful patterns.
The layouts in this paper have limitations because they are primarily
based on indexing regularity. However, the layouts provide some opportuni-
ties to rearrange letter order or axis placement either for perceptual simplifi-
cation (such as reducing line crossings) or for incorporating physical/chemical
properties (such as hydrophobicity) of the sequence constituents. Interest-
ingly these two objectives can lead to the same display. Axes ordering prob-
lems are in general NP complete [1]. While many people prefer 2-D lay-
outs, 3-D layouts not only allow better preservation of interpoint distances
of higher dimensional points, they also provide more opportunities arranging
axes. Thus the layout options are not as restrictive as might be assumed at
first glance.
This paper develops three approaches to constructing coordinates while
mentioning some alternatives along the way. Three different data sets mo-
tivate the development of the coordinates. Section 2 describes self-similar
coordinates at different scales. Section 3 concerns self-similar coordinates
at the same scale with focus on 3-D extension of parallel coordinates. The
application shows cell statistics from a 4-D table. Section 4 illustrates the
use of simple additive vector coordinates for showing all quadruples of amino
acids (ignoring order) . This approach can be useful despite some substantial
overplotting problems. The section also hints at other 2-D layouts that avoid
overplotting. Comments appear along the way about software and interactive
tools used for rendering.
coordinate product sets wit h a coordinat e for each positi on in t he seque nce.
With A=l , C=2, G=3, and T=4 t he sequence ATCG is located at (1, 4, 2, 3) .
The similar t reatment of each position in th e sequence and t he sa me ordering
of nucleotides for each axis motivat es the descrip tion as a self-similar coor-
din at e syste m. With coordinates in hand, multivari ate glyphs can encode
mult ivari at e stat istics associated with t he sequence. The most immediate
problem with this approach is that straight forward gra phical represent ation
of points is only available through three dimensions .
The motivat ion for Figur e 1 was an early effort to find transcription reg-
ulation docking site s for the St anford yeas t genes [5]. The study clust ered
genes based on their express ion levels. This produced groups of seemingly
co-r egulated genes . For t he genes in a group, the 300 (nucleot ide)-letter re-
gions up stream of the protein-coding regions of t he genes were scanned with
a sliding wind ow of length six. This produced t he basic stat ist ics on the
occurrence frequ encies of t he different hexam ers encounte red. The sphere
glyphs in the plot encode counts using size and color. A glance reveals that
most of the higher count hexam ers appear along the AAAAAA t o TTTTTT
edge . Relatively little was known about t ra nscription regulation when the
St anford yeast dat a was first mad e availa ble and the stat istics for Figure 1
was produced. Today many t ra nscr iption regulation sites of various lengths
have been identified and t he regions as far as 800 nucleotides upstream are
relevant for some genes. The plots could be improved by obt aining better
dat a and by highlighting the hexam ers known to be associated wit h tran-
scription regulation .
Different kind s of software can produce figur es similar to Figur e 1. With
a lit tle work most standard stat ist ical software can produce project ed st atic
views. Softwar e t hat provides rot ation, filt erin g and brushing, such as Xgobi
and CrystalVision, pr ovide bet t er visu alization environments . Efforts to pro-
du ce multilayer 3-D visualization methodology similar t o GIS software led
to the development of softwar e called GLISTEN (geometric letter-indexed
st atistic al t abl e encoding). GLISTEN supports point and path layers that
are used in t he gra phs below.
Efforts to exte nd the fract al layout to amino acids were not very suc-
cessful. One generalizat ion used the 20 face cente rs of t he icosahedron as
at t ractors and adapte d t he weights so t he clouds of points asso ciated wit h
each of the 20 at t ractors would be sepa ra te d. Only three letter sequences
were shown t o restrict the view t o 203 = 8000 points. The high density of
points for t he sma llest sca le icosa hedra and the occlusion, par tl y due t o mor e
points, made t his layout less desira ble.
tors point to twenty points evenly spaced around a circle. Again point size
and color encode the counts, and dynamic filtering has removed low count
tetrahedra. High count patterns jump out. One is a circle involving three
Cysteins and one each of the amino acids.
While Figure 4 reveals a lot of structure, there are at least three problems
worth noting. First, over some 2000 points are overplotted. This is partly
related to symmetric construction with equal angles between vectors. Second,
zooming reveals many points that to close together to see closer in a overview.
Third, for over 4000 points involving four distinct amino acids the connection
between plotting location and the indexing is almost impossible to untangle
without mouseovers. Figure 4 is mostly useful for points in an outer annulus
of the circle.
There are several possibilities for alternative views. It is possible to show
a statistic encoded by color in a casement display of 204 = 160000 points [12].
In this example the casement display is a 20 x 20 layout of 20 x 20 matrices.
However it remains desirable to study plots with a factor of 18 less points.
The 8855 points can be placed in a 4-D simplex. (See also pentagonal num-
bers [4].) Space prohibits showing a layout composed of two-dimensional
slices of the simplex. There is also a layout in the plane for all tetrahedra
with 2 or more of the same amino acid. While this layout involves dupli-
cates the regularity makes the layout easier to study. Such a layout can
provide a starting point for drilling down to conditioned views of the 1-1-1-1
combinations.
Graphs for representing statistics indexed by letter sequences 81
Figure 4: Vector addition coordinates. Sphere size (and color) encode statis-
tics for protein tetrahedra. Small rectangles show plotting locations of low
count spheres.
5 Closing remarks
Just as map projections have been devised to serve different purposes, co-
ordinates systems for encoding statistics can be developed to serve different
purposes. A worthy goal is to develop coordinates systems with a regularity
that minimizes memory burdens and helps analysts keep oriented with re-
spect to the coordinates. A tension arises when one desires to show complex
relationships faithfully in some abstract sense while keeping the relation-
ships cognitively accessible. In many cases there are no easy answers and
the graphics are a compromise. Still, analysts can make discoveries from
imperfect graphs. It is worthwhile to consider graphics that lean toward
the cognitive accessibility and work toward incorporating as much scientific
structure as possible options. Accessible graphics enable analysts to look
and, if they look, they have a chance to see.
82 Daniel B. Carr and Myoug-Hee Su ng
R eferences
[1] Ankerst M., Berchtold S., Keirn D.A. (1998). Similarity clustering of di-
mensions for an enhanced visualization of multidimensional data . Pro-
ceedings IEEE Symposium on Information Visualization, IEEE Com-
puter Society, Wash ington, 51- 60.
[2] Brusi c V., Rudy G., Harrison L.C. (1998). MHOPEP, a database of
MHO-binding peptides: update 1997. Nucleic Acids Research 26 (1),
368-371.
[3] Carr D.B., Nicholson W .L. (1988). EXPLOR4: a program for exploring
four ?dimensional data . Dynamic Graphics for Statistics, W .S. Cleveland
and M.E. McGill (eds.), Wadsworth, Belmont , California, 309-329.
[4] Conway J .H., Guy RK. (1996). Th e book of numbers . Copernicus Books,
Inc. New York.
[5] DeRisi J .L., Iyer V.R, Brown P.O. (1997). Exploring th e metabolic and
gen etic control of gene express ion on a genomic scale. Science 278, 680-
686.
[6] Feller W . (1968). An in troduction to probability theory and its applica-
tions. Third Edition. John Wiley and Sons. New York.
[7] Friendly M. (1999). Ext ending mosaic displays: marginal, conditional,
and partial views of categorical data. Journal of Computational and
Grap hical Statistics 8 (3), 373-395.
[8] Hoffman H. (2000). Exploring categorical data : interactive mosaic plots .
Metrika 51 , 11- 26.
[9] Keirn D.A. (1996). Pixel-oriented visualization techniques for exploring
very large databases. Journal of Computational and Graphical Statistics,
58-77.
[10] Lee J .P., Carr D., Grinstein G., Kinney J ., Saffer J . (2002). The ne xt
frontier for bio- and cheminformatics visualization. T-M Rhine , Ed .
IEEE Computer Graphics and Applications, 6 - 11.
[11] Mandelbrot B.B. (1983). The fra ctal geometry of nature. W.H . Freeman
and Company.
[12] Munson P.J, Singh R K. (1997). Statistical significance of hierarchical
multi-body potentials based on Delaunay tessellation and their applica-
tion in sequence-structure alignments. Protein Science 6, 198- 201.
[13] Schneider T .D., Stephens RM. (1990). Sequence logos: a new way to
display consensus sequenc es. Nucleic Acids Research 18 , 6097- 6100.
[14] Segal M. Cummings R, Hubbard A. (2001). Relating amino acid se-
quences to phenotype: analysis of peptide binding data. Biometrics V57,
632-643
[15] Singh R K. , Tropsha A., Vaisman LL (1996). Delaun ay tessellation of
proteins: four body nearest neighbor propens ities of amino acid residues.
J . Computational. Biology. 3 (2),213-221.
Graphs for representing statistics indexed by letter sequences 83
Abstract: Many statistical techniqu es, par t icularl y mult ivariat e met hod-
ologies, focus on extracting informat ion from dat a and proximity matrices.
Rather than rely solely on num eric al characterist ics, ma t rix visua lization al-
lows one to graphically reveal structure in a matrix. This article reviews the
hist ory of matrix visua lization, t hen gives a mor e det ailed descrip t ion of its
general fram ework , along with some extensions. Possible resear ch dir ect ions
in matrix visua lization and information mini ng are sketc hed. Color versions
of figur es presented in this article, t oget her with softwa re packages, can be
obtained from http ://gap .stat.s inica . edu . tw/ .
1 Introduction
The semina l work of Thkey [34] st at es a basic principle of Explorat ory Dat a
Analysis (EDA) :
In his concept of EDA, Tukey would allow the dat a to speak for th emselves
prior to adoption of any standard assumptions or form al modeling. Much of
t his pr eliminar y work can be achieved with gra phics-oriente d t ools - t he box
and whisker plot , t he scatterplot , etc.
Many visualization tec hniques have now been developed to assist us in
looking at dat a. Mu ch of t his lit erature has been devot ed to dim ension re-
du cti on: multidimensional scali ng [11], projectio n pursuit [20], self-organizing
maps [24], and sliced inverse regression [26]. These techniques are very use-
ful for exploring dat a st ructure when t he number of variables is of mod erate
size and when structure is not too complex. Yet , with striking adva nces
in comp ut ing, communication, and high-throughput biom edical instruments,
t he number of variables can easily reach te ns of t housa nds, and the need
for pr act ical dat a analysis remain s. Dimension redu ctio n tools generally lose
effectiveness when it comes t o visua l exp lora t ion for informati on st ructure
embed ded in very high dimensiona l dat a sets. On t he other hand , matrix
86 Chun-Houh Chen et ai.
visua lization, int egrated with comput ing, memory, and display, has great po-
tential for visually exploring the st ructure that underlies massive and compl ex
data sets.
A bri ef review of mat rix visualization is provided in Section 2. Section 3
introduces the genera l framework of matrix visua lization and some extensions
of it appear in Sectio n 4. An outline of possible resear ch dir ections is sketched
in Sect ion 5 and t here are concluding remarks in Section 6.
ity. Psychiatrists [22] have addressed t hree fund am ental issu es: the group-
ing st ru ct ure among the symptoms, the clust erin g st ru ct ure of patients , and
the genera l behavior of every patient-clust er in each symptom-group. These
three issues are closely related to th e three maj or pieces of information con-
tained in any multivari ate data set: the linkage amongst n subject points
in t he p-dimensional space; the linkage between p vari abl e vectors in the
n-dimensiona l space; and the interaction linkage between the sets of subjects
and vari ab les. Factor analysis and clust ering related methods ar e commonly
applied to answ er the first two issues, but there is no general t echnique for
studyin g the interaction effects for subjects and vari abl es. W it h appropriate
pr esentation (permutation and color/shape coding) and integration for the
raw data and proximity matrices, MV can be used to effecti vely displ ay all
t hree pieces of information wit h many ty pes of dat a form ats and sampling
schemes .
__ Correlation
(b)_AAl
I&A1Il1 _
-1
(c) Distance
I ::~ AIM I
Min. Max.
the main body of the data set can exhaust the color spectrum and only the
relative structure of outliers and the main body can be observed [3], [28] .
A logarithm or similar transformation can be applied to variables or prox-
imities to diminish the outlier effect. Transformation of variable (symptom),
also termed the column conditioned transformation, is commonly practiced.
Row (patient) and matrix conditioned transformations are used from time to
time.
_~ll~\j~ .\/.9.9_-
Min. Max.
The Iris data [14] is used to compare the performance of seriations with
several commonly used sorting algorithms. The target proximity matrix is the
Euclidean distance matrix of the 150 iris flowers on four variables. Using the
convergence properties of a series of Pearson correlation matrices, Chen [6]
proposed an elliptical seriation which identifies permutations with very good
near-Robinson structure (Figure 4g & h).
Figure 4: Permuted Euclidean distance maps for Iris data with eight seriation
algorithms: (a) farthest insertion spanning; (b) nearest insertion spanning;
(c) single linkage tree; (d) complete linkage tree; (e) average linkage tree; (f)
GAP rank-one tree; (g) GAP rank-two ellipse; (h) GAP double ellipse.
Matrix visualization and information mining 91
(b) Correlation
- _&
.'--
(a) PANSS
1234567
(c) Distance
1 1i;~m, ·
Min. Max.
Figure 5: Proximity and raw data maps with dendrograms after permutation.
displays the mean sufficient statistical graph for Figure 7. This presentation
clearly illustrates the within-groups strength and the between-group rela-
tionship for symptoms and patients. More importantly, the sufficient graph
for the raw data map effectively summarizes the interaction patterns of four
patient-clusters on three symptom-groups. These three mosaic-displays of
MV in Figure 8 can now easily reveal all three components of linkages for
a given multivariate data set.
4.1 Sediment MV
Regular MV preserves the identity of each subject and variable, each dot in
Figure 9a is the score of a specific symptom for a particular patient. It is
possible to ignore symptom identity and sort the symptom profile for each
patient according to severity. This results in the sediment MV for patients,
as seen in Figure 9b, to express severity structure. One could also omit
patients' identities and create the sediment MV for symptoms, as in Figure 9c.
This is a side-by-side bar-chart and box-plot which displays the distribution
structure for all symptoms simultaneously.
4.2 Sectional MV
The goal of a sectional MV is to display only those numerical values that
satisfy certain conditions in the original MV display. Each sub-figure in
Figure 10 exhibits correlation coefficients with p-values smaller than certain
94 Chun-Houh Chen et al.
(b) Correlation
significant levels for a student t-test. Figures with smaller p-values preserve
more significant correlation coefficients along the main diagonal to reveal
major (tight) symptom-groups, since the matrix maps have already been
permuted.
5 Future directions
Matrix visualization is not a new research field but there are still many topics
to be explored. All the available MV methods focus on seriation algorithms
or coloring (shading) schemes for a data or proximity matrix with entries
along a continuous scale. This is insufficient for exploring more complicated
information structures in the statistical modelling of longitudinal, categorical,
dependent or other complex data. We discuss several possible MV related
issues in this section.
(b) Correlation
-1
__,(a) PANSS
1234567
(c) Distance
I:\lliIiIii!M
Min. Max.
PANSS score
2 3 4 5
p-value < 1 < 0.10 < 0.05 < 0.01 < 0.005
6 Conclusion
MV tools ar e not created to replace existing mathematical or statistical pro-
cedures. Instead , they can be applied in advance to obt ain a general picture
of the information structure and build up confidence for choosing and using
mor e rigorous and appropriate mathematical and st atistic al op erations. Of
cour se it is possible that a good MV display alone can answer all the ques-
tions a user has in mind and reveal mor e comprehensive understanding about
a data set than formal mathematical operations and statist ical mod ellings.
References
[1] Bertin J . (1967) . Semiologie graphique, P aris : Editions Gauthier-Villars .
English translation by William J . Berg . as Semiology of Graphics: : Dia-
gra ms, Networks, Maps . TheUniversity of Wisconsin Press, Madis on, WI ,
1983.
[2] Carmichael J ., Sneath P. (1969). Taxometric maps. Systematic Zoology
18,402 -415.
[3] Chan g S.C ., Chen C.H., Chi Y.Y., Ouyoun g C.W . (2002). R elativity and
resolution for high dim ensional information visualizati on with generalized
association plots (GAP). Proceedings in Computational Statistics 2002
(Compst at 2002), Berlin , Germ any, 55 - 66.
[4] Chen C. H. (1996) . Th e properties and application s of the conve rgence of
correlati on matrices. In : 1996 Proceedin gs of the Section on St ati sti cal
comput ing, 49 - 54, American St atistic al Association.
[5] Chen C. H. (1999). Ext ensions of generalized associat ion plots ( GA P) .
In : 1999 Proceedin gs of the Section on St atistical Graphics, 111-116,
American Statisti cal Association.
[6] Chen C. H. (2002). Generalized association plots : inform ati on visualiza-
tion via iterative ly generated correlati on m atrices. St atistica Sinica 12,
7-29.
[7] Chi Y. Y. (1999). Information visualization for comparing two sets of vari-
ables. Mast er Thesis. Division of Biomedical St atist ics, Graduate Institute
of Epidemiology, College of Public Health, National Taiwan University.
[8] Chepoi V., Fichet B. (1997) . R ecognition of Robinsonian dissimilarities,
Journal of Classification 14, 311- 325.
[9] Church K.W. , Helfman J.1. (1993). Dotpl ot: a program for exploring self-
simi larity in millions of lines of text and code. J ournal of Computational
and Graphical Statistics 2 , 153 - 174.
[10] Cox T .F. , Cox M. A.A. (2000). A general weighted two-way dissimilarity
coefficient. Journal of Classification 17, 101-121.
[11] Cox T.F., Cox M.A.A . (2001). Multidimensional scalin g. 2nd ed . Chap-
man & Hall/CRC.
Matrix visualization and information min ing 99
[12] Eisen M.B. , Spellman P.T ., Brown P.O., Bot stein B. (1998). Clust er
analys is and display of genome-wide expression patt erns. Proc. Nat 'l.
Acad. Sci. U. S. A. 95, 14863-14868.
[13] Encarnacao J., Fruhauf M. (1994). Global informa ti on visualization: the
visualizati on challenge for the 21st Century, in Scientific Visualization
Advan ces and Changes L. Rosenblum et al (eds) , Academic Press.
[14] Fisher RA. (1936). Th e us e of multiple measureme nts in axon omic prob-
lem s. Annals of Eug enics 7, 179-188.
[15] Friendly M. (2002). Corrgrams: exploratory displays for correlation ma-
trices. Amer. St atist 56 , 316-324.
[16] Friendly M., Kwan E. (2003). Effect ordering for data displays. Compu-
t ational St atistics & Dat a Analysis 43 , 509-539.
[17] Gale N., Halp erin C.W ., Cost anzo C.M . (1984). Unclassed m atrix shad-
ing and optimal ordering in hierarchical cluster analysis. J . Classification
1,75 -92.
[18] Gower J.C. (1971). A general coefficie nt of simi larity and som e of its
properties. Biometrics 27,857-874.
[19] Har tigan J .A. (1972). Direct clustering of a data matrix. Journal of the
American St atistic al Association 67, 123-129.
[20] Hub er P.J. (1985). Projection pursui t. The Ann als of Statisti cs 13 , 435 -
475.
[21] Hub ert L. (1976). Seriation using asymmetric proximity m easures.
British J . Math. St atist. Ps ych. 29, 32 - 52.
[22] Hwu H.G., Chen C.H., Hwang T .J ., Liu C.M., Cheng J .J ., Lin S.K , Liu
S.K , Chen C.H. , Chi Y.Y., Ouyoung C.W., Lin H.N., Chen W. J . 2002).
Symptom patt erns and subgrouping of schizophrenic patients: significance
of negative symptoms assessed on admission. Schizophr enia Resear ch 56 ,
105-119.
[23] Kay S.R , Fis zbein A., Opler L.A. (1987). Th e positive and negative
syndrome scale (PA NSS) for schizophrenia. Schizophr . Bull. 13 , 261-
276.
[24] Kohon en T . (1995). Self-organizing maps. Berlin , Heidelberg: Springer.
[25] Lenstra J .K (1974). Clust ering a data array an d the traveling salesm an
problem. Operations Resear ch 22, 413-414.
[26] Li KC. (1991). Sliced inve rse regression for dim ensional reducti on (with
discussion). Journal of the American Statistic al Association 86, 316 - 342.
[27] Ling RF. (1973). A computer generated aid for cluster analysis. Com-
mun icat ions of th e ACM 16 , 355- 361.
[28] Marchet te D.J., Solka J.L. (2003). Using data imag es for outli er detec-
tion. Computational Statist ics and Dat a Analysis 43 , 541- 552.
[29] Mar cotorchino F. (1991). Seriat ion problems: an overview. Applied
Sto chastic Models and Data Analysis 7, 139-151.
100 Chun-Houh Chen et al.
1 Introduction
Within the "New Media in Educat ion Funding Programme" the German
Feder al Ministry of Education and Resear ch (bmb-l-f') supports the project
e-stat (pr oject period April 2001 - June 2004) t o develop and t o provide
a multimedia , web-based , and interactive learning and teaching environment
in applied st atisti cs called EMILeA-st at, which is a regist ered br and nam e.
It is accessible via int ernet (emilea-stat. uni-oldenburg.de) .
T he project was set up by 13 par tners at t hat time working at seven
German universiti es: Bonn, Berlin (Humboldt-Un iversity) , Dortmund, Karl s-
ruhe, Munst er , Oldenburg (leading un iversity) , and Potsdam . In t est and
evaluation ph ases of EMILeA-st at other uni versities are involved, to o. The
proj ect is also supported by furt her partners in ad vice and it coop erat es
wit h economic partners such as SPSS Softwar e, Springer-Verlag, MD*Tech
Method & Data Technolo gies (XploRe-Softwar e) , and AON Re. Including
t he group of associated partners who are providing additional cont ent, about
70 people ar e co-working in developin g und realizing EMILeA-stat at the
pr esent time. For mor e det ail about th e proj ect we refer to its web page
www.emilea.de .
economics, and engineering. Mod els, tools, and methods, which have been
develop ed in st atistics , are applied in mod elling and data analysis, e.g., in
bu siness and industry, in ord er to obtain decision crite ria and to gain mor e
insight into structural correlations. Owin g to t hese var ious applicat ions and
t he necessity of using statist ical methodology in so many fields, there have
to be consequences for the processes of learning and t eaching: Pupils, for
example, should get to know elementary and applicat ion-oriented st at ist ics.
Therefore, stat ist ics and dat a analysis, theoretic ally and pr acti cally, have
t o become part of teachers' studies at universit y and at in-servic e training
courses. Moreover , students of many different disciplines with a statist ics
impact should be famili ar with basic and advanced statist ics. These goals
gave the main imp act to develop EMILeA-st at
• as one syste m suitable for teaching stat ist ics at schools, univ ersiti es,
and in fur ther vocational training,
• as one syste m which is access ible anywhere, anyti me, and for anyone.
The basic concept offers on the one hand t he opportunity to t ailor individ-
ual courses covering sp ecific learning needs . On t he other hand , EMILeA-st at
serves as an int elligent statist ical encyclopaedia.
Basic statist ical conte nts are pr esent ed on three levels of abst raction in
ord er t o take int o account t hat different typ es of users have - owing t o their
individual mathematical and theoret ical backgrounds - different needs. If
sensible t he contents are writt en on level
B (basic level): like undergraduate courses in applied statist ics for stu-
dents, e.g., of economics, psychology, and social sciences, and
C (advanced level) : containing deeper mat erial and special t opics within
the bro ad field of stat ist ics and applied probabili ty.
Furthermore, user-orient ed views and scenari os, which are near t o real
world applicat ions, are integrated .
The following fields and subjects of quantit ative methodology are or
will be contained in EMILeA-stat : Descrip t ive and induct ive st atist ics, ex-
ploratory dat a analysis, int eractive st atist ics, gra phica l repr esentations and
methods, basic mathematic s needed in stat ist ics, pr obabili ty t heory, statist i-
cal methods in finance and insuran ce mathematics, mod elling and pr edicti on
of dat a in financi al market s, stat ist ical methods in market ing, virtual pro-
du ctions and virtual company, experimental design , stat ist ical quality man-
agement, and busi ness ga mes.
st-apps and EMILeA-stat 103
3 Interactive visualizations
The theoretical statistical content in EMILeA-stat is supplemented by in-
teractive visu alizations which are programmed as Java-Applets. By offering
a vari ety of int eractive options (for a detailed description see below) they
support the learner in her I his learning proc ess by offering t he possibility to
explore the explained method, to experiment with data , and to make own
experiences with the discussed topic. Due to th e fact that many places, e.g.,
at universities or schools, where teaching takes place still do not have access
to the intern et, these visualizations are not only part of the system but also
realized as an off-line graphical package called st-apps . A Germ an version
of this tool - including an addit iona l textbook with explana tions, instruc-
tions , proposals for the use in teaching, etc. - is available via the publishing
comp any Springer. An Engli sh edition is planned.
In the following we give an overvi ew about the visualizations included
in st-apps and present this tool by giving some examples. Finally the dif-
ferences between the on-line version and t he gra phical package are bri efly
sketched.
Bar charts
Line charts
Location parameters
Scale parameters
"
106 K ath arina Cram er, Udo Kamps, and Christi an Zuckschwerdt
Box plots
Hll ·
. [H
HIDH
Lorenz curves
st-apps and EMILeA-stat 107
Regressions
_....... _r
Plotting area
• t
st-apps and EMILeA-stat 109
Moreover, existing data (points on the axis) can be moved to the right or
left with the left mouse button. The axes are automatically rescaled.
::::-
---',
,"'",.,""'''''
These options are included in many visualization such as those concern-
ing location and scale parameters, box plots, scatter plots, regressions, his-
tograms and the approximate empirical distribution function.
Furthermore, there are interactive aspects which are matched only with
specific visualizations. Three examples are given in the following:
Histogram The histogram applet offers the most interactivity. Each bar
can, for example, be split into two bars by clicking with the mouse into the
respective bar.
By shifting the endpoints of the bars the width of the classes and even-
tually the number of classes change.
110 Ketluuine Cramer, Udo Kamps, and Christian Zuckschwerdt
Fitting a straight line Concerning linear regression there is, e.g., one
visualization available where a straight line has to be fitted manually to the
data. The correct linear regression function obtained by least squares can
also be added for checking the manually fitted line.
~~~~~~i!!!!.!~!!iL -l I Buttons
In some cases a fourth component with further information such as pa-
rameters or coefficients is realized.
In the drop-down-menu a selection of data sets suitable for visualization is
offered. If a data set is loaded, the accompanying table can, e.g., be modified
in the following ways:
I Symbol I Action
~ Add a column
iliI Delete the marked column(s)
t:lf- Add a row
¢:[ Delete the marked row(s)
it Shift the marked row(s) up
,(!. Shift the marked row(s) down
three levels of abst rac t ion. The elementary level A offers at least interactivity
whereas on level C (advanced) t he full rang e of functionality as described
is accessible. Concerning the data sets load ed, this level dep end ent design
means t hat the systems offers to a user working on level A only one data set
(given by the teacher) , while on level B she/he can choose between a wide
rang e of data sets . On level C ana lyzing own data is possible. In other words
the describ ed "user int erface" of t he off-line tool is available only on level C to
its full extent . On t he other hand st-apps offers - becau se of these facilities
- a vari ety of helpful and powerful tools for ana lyzing and presenting data
which are also useable without an access to the int ernet .
References
[1] Burkschat , M., Cr amer , E. , Kamps , U. (2003). B eschreibende Sta tistik:
Grundlegend e Methoden. Springer , Heidelberg (in Germ an).
[2] Cr amer , E., Cr am er , K , Kamps , U., Zuckschwerdt , Ch . (2004). B eschrei-
bende Statistik: Int eraktive Grafiken. Springer , Heidelberg (in Germ an).
[3] Cramer , E., Cr amer , K , Kamps , U. (2002). e-stat: A web-based learning
environme nt in applied statistic s. Proceedin gs in Computational St atis-
tics, W . Hardl e, B. Ronz (Ed s.), Physica-Verlag, Heidelberg, 309- 314.
[4] Cr amer , E., HardIe, W ., Kamps, U., Wit zel R. (2003). E-stat: Views,
m ethods, applicati ons. Bulletin of the Intern ational St atistical Institute
54t h Session, Contributed P ap ers, Volume LX, Book 2, 82 -85 .
[5] Cr amer, K , Kamps, U. (2003). Interactive graphics for eleme ntary sta-
tistical education. Bulletin of the Intern ational St atistical Institute 54t h
Session , Contributed Pap ers, Volume LX , Book 1, 222 -223 .
Abstract: T he present pap er focuses on t he case sensit ivity funct ion ap-
pr oach t o diagnosti cs and robust ness t ha t are combina t orial by definiti on
and hard to solve exactly. At tention is also given t o t he visual displ ays.
• (BI) : visual displ ays affording insight int o t he nature and vari ety of
mul t iple case effects (Secti on 5),
• (C2) : enhanced (pote nt ially, encompassi ng) sets of algorit hms for a
class of robustness problems (Section 7).
2 Preliminaries
To gain focus, at te nt ion is restrict ed to one-sample contexts, with { Zi :
i E N} , N := {1, ..., n } denot ing a random sa mple of n > 1 cases from
an unknown distribution F in dim( z) dimensions. The associated empirical
distribution is P := L iENn- 1 Pi, where Pi denotes t he distribution degen-
erate at Zi . T hroughout , analysis is conducte d conditional on t he observed
{zd ·
Assuming, as we do , t hat no further information is availa ble about t he ob-
served cases, it is desir abl e t hat any analysis of these data should be invari ant
under permutati on of the arbit ra ry lab els attached to them. Given n, t his in-
variance is achieved - without loss of information - by replacing { Zi : i E N }
by P. In partic ular, every statist ic of int erest here is of the form T[P], for
some funct iona l T[·]. This may, for example, be (the observed significance
level of) a t est stat ist ic, a param et er est imate, a pr ediction of future valu es
of an observabl e, or a non par am etric density or regression funct ion est ima te.
In par ti cular , T[ ·] may be sca lar, vector or function valu ed.
Let Z := (z[) . In mult ivariate conte xts where all the random variabl es
z z
in rv F are on t he sa me footing, we put dim(z) = k, = X, Zi = Xi and
Z = X . In the usual linear mod el Y = X f3 + E, we put dim( z) = 1 + k,
zr = CiJ, XI') and z[ = (Yi, x f), so t hat Z = (yIX) , (a constant te rm being
assumed and accommodated by supposing t hat the distribution of t he first
element of x is degenerate at t he valu e 1).
(2)
-1
where, for any 0 cAe N , FA := I: iEA IAI F; and F_ A := FAc .
~ ~ ~ ~
4 A relaxation strategy
Throughout t his section , h and m denote given n- compl ementar y integers .
Again , M denotes a general memb er of Nm , and H its complement in N .
116 Frank Critchley et al.
{F(p) : P E JPln} . For brevity, the {z;} are assumed distinct (this avoids an
elaboration required in the general case) . Accordingly, (indeed, equivalently),
(5)
Finally, let T[·] denote any statistic of interest. Following [4], pertur-
bation is defined here as movement P --> p* between probability n-vectors,
with primary effect (corresponding to the identity functional T) the induced
change F(p) --> F(p*) in distribution, and general effect T[F(p)] --> T[F(p*)].
{po} = {p E IP'n : Pi::; n- 1 (i E N)} C 1P'~1 C 1P'~2 C ... C 1P'~(n_1) = IP'n (6)
4.4 Examples
Figure 1 illustrates the n = 3 case . 1P'3 = 1P'::2 is the outer equilateral triangle,
whose vertices V:: 2 are the unit vectors. 1P'::1 is the inverted, inner equilateral
triangle, whose vertices V:: 1 are the midpoints of the sides of 1P'3. Both
triangles are centred on Po. All perturbations (from Po) which miss out
a single case are the same size, and smaller than all which miss out two.
Again, each perturbation (from Po) that holds onto a given case is in the
opposite direction to that which misses it out, and orthogonal to that which
trades weight between the other two.
118 Frank Critchley et al.
(0,0,1)
(1/2,0,1/2) . . . . . . . . . . . . . . . . . . .. (0,1/2,1/2)
(1,0,0) ....
(1/2,1/2,0) (0,1,0)
5.1 Tripartitions
Suppose t hen t hat M := {Mr : r = 1,2, 3} is a given partition of N int o
t hree disjoint subsets, wit h m ; := IMrl > a and 2:r m; = n, and let
It follows that T is the convex hull of {PMr : r = 1,2, 3}. That is, T is the
triangle which has these three points as vertices which , when convenient, we
abbreviate to {Mr } . Otherwise said, P E !pm belongs to T if and only if, for
some 7T == (7Tr ) E 1P'3, P = L r 7Tr PMr . In this case , 7T = 7T(p) is unique, 7Tr(p)
being the total probability assigned (equally) by P to the m; cases in Mr.
Accordingly, we may identify T with 1P'3 via the bijection P f-+ 7T(p). For
example, Po f-+ (K r ), where K r := mr/n is the proportion of cases in Mr.
However, whereas 1P'3 is a fixed equilateral, the shape and size of 'lI' vary with
the {m r } . Nevertheless, important inclusions, collinearities and orthogonali-
ties in 1P'3 survive in 'lI' for every M .
Two obvious cyclic permutations applying, the identity:
shows that P-Ml lies on the M 2M3 side of T , being closer to whichever
vertex labels the larger number of cases. In particular, writing Pr(.X) :=
(1 - >')p-Mr + >'PMr , the line segment JLr := {Pr(>') : >. E [0, I]} lies in T,
all three such meeting at Po by (5). Again, using Section 4.2, each JLr is
orthogonal to the side of'lI' containing P-Mr , §-r say, along which proba-
bility weight is traded between the other two subsets. Thus, the probability
attached to M; increases linearly along JLr from zero at the P-Mr end to
unity at the other. Indeed, for each>. E [0,1], this probability is constant at
the value>. for all points in 'lI' on the line through Pr(>') parallel to §- r' In
particular, it vanishes on §-r .
•
IM11=1
IM21=1
(a)
• •
• • IM31=2 0
Ml
M3
IM11=1
•
I M31 =2 0
(b)
• •
• •
•
I M21= 1
IM11= 1
•
(e)
• •
• • I M3 1- 20
Ml
•
IM21=1
M3
I M2 1=1 IM11=1
• •
(d)
• •
• • IM31 =2 0
Figure 3: Four multipl e case effects in t he linear mod el: (a) masking, (b) can-
cellatio n, (c) swing and (d) raise & lower.
both vert ically and horizontally, to enha nce their visual clari ty (a minor cost
being some loss of visual perception that t he angle at M 3 exceeds 87°). Note
that Po (corr esponding to F) is close to M 3 , being just one-eleventh of t he
way along the line 1L3 joining M 3 to t he midpoint of t he opposi t e side. The
inbuilt M 1 - M 2 symmet ry is evident t hr oughout. Overall, t he four graphs
have visibly different shapes, discussed next:
122 Frank Critchley et a1.
(a) Masking. The 'spike' at M 3 reflects the domin ant effect of removing
both M l and M 2 , while t he parallelism of t he conto urs to § -3 corresponds
to the fact that there is, of course, no effect here in t ra ding weight between
these sets .
(b) Cancellation. The conto urs of t C ook (-) here are st ra ight lines fanning
out from M 3 . In par t icular, lL 3 is t he zero height conto ur, since varying 7f3
while keepin g 7fl = 7f 2 has no effect on t he fit ted line. Tr adin g weight between
M l and M 2 now has a quadratic, globa lly dominant , effect .
(c) Swing. T he overall shape of t he surface here is very similar, but not
ident ical , to t hat in the maskin g case. The 'spike' at M 3 remains domin ant ,
but t he sur face cont ours are no longer par allel to § -3 .
(d) Raise & Lower. This is perhaps t he most interesting gra ph. As is
intuitive from the data , t he dominant globa l effect occurs along § -3 ' Looking
at t he sur face, we see two 'troughs' . These run along lL l and lL 2 , showing
t hat varying the weight on one of t hese subs ets alone has littl e effect. The
contours of teook( ' ) are par allel to § -3 when t here is little weight on M 3 , but
become more cur ved as 7f3 increases . Locally to Po, tr ading weight between
M l and M 2 produces t he largest effects .
[4] report encouragi ng results for t his genera l st ra tegy , using regression
as a test problem and several form s of challenge dat a set. Specifically, they
maximise tCookO in St age I, using t he mean shift outlier t est in Stage II .
Finally, a remark on local maxima. On those occasions when the final
check for a common pattern fails, the possibility t hat t his is because omission
of M is a particular form of non-trivial local maximum can be easily explored
as follows. The value of t( ·) there can be compa red to t hat where M is held
onto . If this is greater, replacing M by its complement, and then cont inui ng
as before, is ind icat ed . On the relatively few occasions where it was needed in
t heir reg~sion study, [4] report t hat t his simple st rategy was successful. The
original M containing no mutually masked cases, moving t o its complement
pr oduced a lar ge increase in t C ook and led again t o correct ident ificatio n of
the st ructure in t he dat a .
References
[1] Agu1l6 J . (1998). Computing the minimum covariance determinant esti-
mator. Technical report , Universidad de Alicante .
[2] Atkinson A.C . (1986). Masking unmasked. Biometrika 73 , 533 - 541.
[3] Barrett B.E. and Gray J .B. (1997). Leverage, residual, and interaction
diagnostics for subsets of cases in least squares regression. Computa-
tional St atistics and Dat a Analysis 26 , 39 - 52.
[4] Critchley F., Atkinson R.A., Lu G. and Biazi E. (2001). Influence anal-
ysis based on the case sens itivity function. J. Royal Statistical Society,
B 63 , 307 - 323.
[5] Critchley F., Lu G., Atkinson R.A . and Wang D.Q. (2003). Projected
Taylor expansions for use in Statistics. Under consideration.
[6] Critchley F ., Schyns M. and Haesbroeck G. (2003). Smooth optimization
for the MCD estim ator. Int ernational Conference on Robust Statistics,
Antwerp, 29 - 30.
[7] Critchley F., Schyns M., Haesb roeck G., Lu G., Atkinson R.A. and Wang
D.Q. (2004). A convex geometry approach to algorithms for the MCD
method of robust statistics. Under consideration.
[8] Hawkins D.M. (1994). A feasible solution algorithm for the minimum
covariance determinant estim ator in multivariate data. Computational
St at istics and Data Analysis 1 7, 197 - 210.
[9] Hawkins D.M. and Oliv e D.J. (1999). Improved feasible solution al-
gorithms for high breakdown estim ation. Computational Statistics and
Data Analysis 30, 1 - 11.
[10] Kinns D.J. (2001). Multipl e case influence analysis with particular ref-
erence to the linear model. PhD thesis, University of Birmingham .
[11] Lawrance A.J . (1995). Deletion influence and masking in regression. J .
Royal Statistical Society, B 57, 181 - 189.
[12] Rousseeuw P.J . and Van Driessen K. (1999). A fast algorithm for the
minimum covariance determinant estim ator. Technometrics 4 1, 212 -
223.
A cknowledgem ent : The UK aut hors are grateful for EPSRC support under
research grant GR / K08246 and to D.Q. Wang for helpful discussions .
Address: F . Critchley, M. Schyns, G. Haesbroeck, D. Kinns, R.A . Atkinson,
G. Lu, The Op en University, Milton Keyn es; University of Namur; University
of Liege; (formerly) University of Birmingham; University of Birmingham and
University of Bristol
E-mail : F .Critchley@open . ac . uk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: The cur rent theory of stat ist ics with funct ional dat a provides
only a few results [21] of asympt ot ic validity for t he bootstrap methodology.
Rou ghly speaking, these validity resul ts guarantee t hat t he bootstrap versions
of the sa mpling distributi on of a statistic t end (as t he sample size increases) to
the same limit as t he true sa mpling distributions. From a comput ational and
pr acti cal point of view, such results have an sp ecial int erest when dealin g with
funct ional data , as the distributional pr operties of the st atist ics are usu ally
difficult to handle in t his set up. Of course, t he point is that while t he t rue
sam pling distributions are usu ally very difficult t o handle, t he corresponding
bootstrap versions can be approxima te d with arbit ra ry pr ecision .
In thi s work , a uniform inequ alit y is obtained for t he Bounded Lipschit z
dist anc e between th e empirical distribution of a function-valu ed random vari-
able and the corresponding underlying distribu tion t hat generat es the sa mple.
As a consequence, a result of bootstrap validity (consist ency) is obtain ed for
funct ional statis tics defined from differenti abl e operators.
Our pr oof is based on t he use of a differenti al methodology for operators,
similar to that used by P arr [19], and relies also on a resul t of empirical
pro cesses t heory pr oved by Yukich [29].
1 Introduction
We deal here with t he stat istical set ups where the available sa mple informa-
t ion consists of (or can be considered as) a set of functions. Depending on
t he approach an d on t he assumed st ruc t ure of the dat a (which come often in
a discreti zed version) t his st atisti cal field is called "longit udinal dat a analy-
sis" or "funct ional dat a analysis" (FDA) . We will follow here a purely func-
tional approac h which entails t o consider t he available dat a as t rue functi ons
and, as a consequence, t o define and motivat e t he methods in a functi onal
fram ework.
The books by Ram say and Silver man [22], [23] have greatly cont ribute d t o
populari ze t he FDA techniques am ong the users, offering a num ber of appeal-
ing case st udies and pr act ical methodologies. Simultan eously, this increasing
popul ari ty motivates t he need of a solid theoreti cal foundation for the FDA
methods, as man y bas ic issues (concerning, e.g., the asy mptotic behavior)
are oft en rather involved in t he FDA setup.
128 Antonio Cuevas and Ricardo Fraiman
In general terms, the FDA theory is still incomplete as many topics remain
unexplored from the mathematical point of view. Some theoretical develop-
ments with functional data have been made in fields as principal component
analysis ([5], [11], [17], [20], [25]), linear regression ([6], [7], [8], [9], [13]), data
depth [14], clustering [1] and anova models ([12], [18], [10]).
An important issue in this field has to do with the asymptotic validity
(usually called consistency) of bootstrap procedures for functional data. This
looks as an interesting research line since the exact calculation of sampling
distributions in FDA problems presents an obvious difficulty so that the boot-
strap methodology turns out to be often the only practical alternative. Of
course, the point is that while the sampling distribution of a function-valued
statistic can be formally defined in the same way as the analogous concept for
a real-valued statistic, the effective calculation and handling of such "func-
tional" sampling distributions is usually very difficult since they are in fact
probability measures defined on function spaces. Thus the case for using
bootstrap versions is quite strong as they are discrete measures which can be
in turn approximated by resampling with arbitrary precision. An example of
the use of resampling methods in a functional data framework can be found
in [10] .
The classical works by Bickel and Friedman [3], Singh [26] and Parr [19],
among others, have established the validity of the bootstrap methodology, in
the case of real variables, for a number of useful statistics, including the sam-
ple mean and those generated by differentiable statistical functionals. The
functional counterpart of this theory is much less developed. However, Cine
and Zinn [15] have proved, in a very general setup, a bootstrap version of
Donsker theorem for the empirical processes. A partial extension of this result
is given in [24]. Politis and Romano [21] have proved the consistency of the
bootstrap for the sample mean in the case of uniformly bounded functional
variables taking values in a separable Hilbert space imposing very general as-
sumptions on the dependence structure which include the independent case
to be considered here . The main purpose of this paper is to partially extend
this consistency result to (function-valued) statistics defined from differen-
tiable operators. So we are concerned here with a functional version of some
classical validity theorems, as those in [19] or [2], where the methodology
based on functional differentiation plays a relevant role.
More precisely, we want to get a bootstrap validity result for statistics
of type T(Pn ) where T is a differentiable operator (taking values in a func-
tional space) and Pn is the empirical distribution associated with a sample
Xl, ... , X n of n functions drawn from a common distribution P. In practical
terms, this result will establish that the distribution of vn(T(Pn ) - T(P))
can be approximated by its corresponding bootstrap version in vn(T(P~) -
T(Pn ) ) , where P~ is the empirical distribution based on an artificial (boot-
strap) sample drawn from the original sample. Our approach is much in
the spirit of Theorem 4 in [19] although the fact that we are dealing with
functional data entails some additional technical complications.
On the bootstrap meth odology for functional data 129
d(Pn, P) = sup
fEF
I /f er; - /f dPI, (1)
°
Yukich's Theorem establishes t ha t if the envelope function F := sup{lf(x)1 :
f E F} fulfills F :; 1 and there are constants < EO :; 1, < 8 < 1, and
C 2:: 1 such that
°
N (E, F ) :; exp(C/E2- 8 ) , "IE, 0 < E:; EO , (5)
t hen
(6)
for all M greate r t ha n, or equa l to , some constant M(8 , C, EO) whose explicit
expression is given in t he st atement of Theorem 1 in [29].
In fact , the proof will be mor e simpl e and intuitive by replacing the dis-
tances Ilf - fi l l ~ in (5) by the supremum distan ces Ilf - f illoo. As a con-
sequence, we will prove a stronger version of condit ion (5) , by taking t he
supremum in (5) over all th e possible probability measures (inst ead of just
considering those of finite support) . The reason is that we in fact will provide
a bound for Ilf - f illoo and, as t he Q's are probability measures and the I's
°
are bounded, we will also get bounds for th e £2(Q) norms.
Given < E < 1, divid e t he interval [-1, 1] (where the functions f E F
t ake values) into q = [2/E] + 1 subintervals with ext reme points in the set
the class ~h of all functi ons 9 E 90 such that sUPX EB (0,2E) If (x ) - g(x)1 :; 3E
is not empty. In a similar way, by the Lipschit z property of t, we can choose
a non- empty class 92 C 91 such t ha t sUPxE B(0,3E) If( x ) - g(x)1 :; 3E, for all
9 E 9 2. By recurrence, define the (non-emp ty) class 9q 1 -1 of functions such
t ha t sUPx EB(O,r) If( x) - g(x )1:; 3E for all 9 E 9 q l - 1.
On the bootstrap m ethodology for function al data 131
17
for all 7] E (0,1) , and C = (2r)l+ log 3. Finally, using Yukich's [29] Theo-
rem 1, (observe that 2-0 in (5) has been denoted 1+7] in (7)), we conclude (6).
(a) Assume that T sat isfies the following differentiability condition for some
given P E P(H),
Th en,
(10)
Z being the weak limit of vn
(T(Pn) - T(P)) .
(b) Assume that the operator T takes values in a separable Hilbert space
C and it is differentiable in the sens e (8). If the function w(x) =
T;'(ox - P) is bounded (ox being the degenerat e distribution at x ), th en
condition (9) is fulfill ed and therefore (10) holds.
132 Antonio Cuevas and Ri cardo Fraiman
Proof: (a) The result is a simple consequence of Theorem 2.1. Ind eed , using
the differenti abili ty assumpt ion (8) ,
Then , we may apply Theorem 3.1 in [21] t o conclude that (9) , and there-
fore (10) , holds in t his case .
(i) The hypothesis of uniform boundedness is not very rest rictive in pr ac-
ti ce. It is in some sense similar to t he assumption of compac t support
in nonpar am etric est ima t ion. If one is willing to renoun ce to th e usu al
ga ussian mod els (which is also the case in nonpar am etrics) the hypothe-
sis of boundedness looks quite natural as every observabl e ph enomenon
pro vides in fact observat ions taking valu es in a bounded dom ain (whose
limits are imp osed by t he measurement inst ru ments ). From a t echnical
point of view, boundedness is required for Theorem 2.1 (in order to be
able to apply t he ent ropy argument involved in the proof) and also for
t he result by Politis and Rom ano ([21] , Theorem 3.1) used in th e proof
of par t (b). Note also that t he boundedn ess condit ion must be fulfilled
in the metric of the spa ce where the random elements Xi t ake values.
For example, if t his sp ace is L 2 [a , b] t he assumption that Xi E H, where
H is bounded in L 2 [a, b], does not entail that the realizations of Xi have
to be bounded in t he supremum sense.
(ii) The above t heorem can be applied, for example, to show t he validity
of t he bootstrap for st atisti cs of ty pe g(X) which may arise in different
On the bootstrap methodology for functiona l data 133
T(P)(t) = J X 2(t
,w)dP(w) - f-L~(t) ,
wher e X(t) = X(t ,w) is a process with distribution P and mean func-
tion f-Lp(t) . It can be easily seen that the differential T p is the linear
operator given by
[10] Cuevas A., Febrero M., Fraiman R. (2004). An anova test for functional
data. Computational Statistics and data Analysis, to appear.
[11] Dauxois J ., Pousse A., Romain Y (1982). Asymptotic theory for the
principal component analysis of a vector random function: some applica-
tions to statistical inference. Journal of Multivariate Analysis 12, 136-
154.
[12] Fan J., Lin S.K. (1998). Test of significance when the data are curves.
Journal of the American Statistical Association 93, 1007-1021.
[13] Ferraty F., Vieu P. (2002). The functional nonparametric model and
application to spectrometric data. Computational Statistics 17, 545 - 564.
[14] Fraiman R., Muniz, G. (2001). Trimmed means for functional data. Test
10,419 -440.
[15] Cine E., Zinn J . (1990). Bootstrapping general empirical measures. The
Annals of Probability 18, 851 - 869.
[16] Kneip A., Gasser T . (1992). Statistical tools to analyze data representing
a sample of curves. The Annals of Statistics 20, 1266-1305.
[17] Locantore N., Marron J.S ., Simpson D.G., Tripoli N., Zhang J .T ., Cohen
K.L . (1999). Robust principal component analysis for functional data (with
discussion). Test 8, 1- 74.
[18] Munoz-Maldonado Y, Staniswalis J.G., Irwin L.N., Byers, D. (2002). A
similarity analysis of curves . Canadian Journal of Statistics 30,373 -381.
[19] Parr W . C. (1985). The bootstrap: some large sample theory and con-
nections with robustness. Statistics and Probability Letters 3, 97 -100.
[20] Pezzulli S., Silverman, B.W. (1993). Some properties of smoothed prin-
cipal components analysis for functional data. Computational Statistics
8, 1-16.
[21] Politis D.N., Romano J .P. (1994). Limit theorems for weakly dependent
Hilbert space valued random variables with application to the stationary
bootstrap. Statistica Sinica 4,461-476.
[22] Ramsay J .O., Silverman B.W. (1997). Functional data analysis.
Springer-Verlag, New York.
[23] Ramsay J.O., Silverman B.W. (2002). Applied functional data analysis.
Springer-Verlag, New York.
[24] Sheehy A., Wellner J.A. (1992). Uniform Donsker classes of functions.
The Annals of Probability 20, 1983- 2030.
[25] Silverman B.W. (1996). Smoothed functional principal components anal-
ysis by choice of norm. The Annals of Statistics 24, 1- 24.
[26] Singh K. (1981). On the asymptotic accuracy of Efron's bootstrap. The
Annals of Statistics 9, 1187-1195.
[27] van der Vaart A. (2000). Asymptotic Statistics. Cambridge University
Press, Cambridge.
[28] van der Vaart A., Wellner J. (1996). Weak convergence and empirical
processes. Springer-Verlag, New York.
On t he bootstrap meth odology for func tional data 135
[29] Yukich J .E. (1986). Unif orm exponenti al bounds for the n orm alized em-
pirical process. Studia Mathematic a 84, 71- 78.
[30] Zhan Y. (2002). Central lim it theorems for fun ctiona l Z-estimators. St a-
t ist ica Sinica 12, 609 - 634.
A cknowledgem ent: The first aut hor has been par ti ally supporte d by gra nt
BFM2001-0169 from the Spanish Ministry of Science and Technology.
Address: A. Cuevas , Departamento de Mat emat icas, Facultad de Ciencias ,
Universid ad Aut6noma de Madrid, 28049-Madrid (Spain) .
R. Fraim an , Depar tamento de Mat emati ca, Universidad de San Andres, Vito
Dum as 284, Victoria, Provincia de Buenos Aires (Argent ina) .
E-mail : antonio .cuevas@uam.es.rfraiman@udesa .edu .ar
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
1 Introduction
Despi t e the fact that identification (in the sense of model selection and pa-
ram et er est imat ion) of linear dynami c systems is a quite mature subj ect now ,
t here still exist severe problems in applying ident ificat ion procedures, in par -
t icular in t he multi vari abl e case.
As is well known , one of the major problems is t he 'curse of dim ension al-
ity '; in the (linear) multivari abl e case the dim ension of the par am et er space is
a quadrati c function of the number of outputs, unl ess addit ional restrictions,
e.g. of factor ana lysis - or reduc ed rank regression typ e or of 'st ru ct ural'
type are imposed .
In t his cont ribution our main focus will be on anot her importan t issue.
For simplicity of notation , we only consider linear systems with unobserved
whit e noise input s. Then the most common mod els are AR, ARMA and
state space (StS) models. In applicat ions AR models still dominate, mainl y
for two reasons:
On the other hand ARMA and StS systems are more flexible and thus in
many cases less parameters may be required.
As is well known every causal (stable) rational transfer function (describ-
ing the input-output behaviour of a linear system) can be described by an
ARMA or a StS system; in this sense ARMA and state space system are
equivalent. However, when embedded in 'naive' parameter spaces, typically
the classes of observational equivalence are larger in the state space case . For
instance, in the univariate case, for ARMA (n, n) systems, the equivalence
classes are singletons in IR 2 n for the ARMA case (unless common factors oc-
cur) , whereas they are n 2 dim ensional manifolds for (minimal) state space
2
systems in the embedding IR 2 n + n • Identifiability is obtained by selecting rep-
resentatives from equivalence classes and the advantage of large equivalence
classes lies in the possibility to select (in some sense) better representatives.
This is the reason why we here restrict ourselves to StS systems.
Both, typical ARMA and StS model classes suffer from the fact that the
parametrization problem is non-trivial and that in general no explicit formula
for the maximum likelihood estimator exists. For instance, in general, the
boundary of the identifiable parameter spaces contains lower dimensional
systems, which are not identifiable and algorithmic problems occur if the
true system is close to the boundary. Some of these problems cannot be fully
understood in the framework of the usual asymptotic analysis or are even
better reflected by numerical rather than by statistical analysis. In a certain
sense, asymptotic properties are parametrization independent, to be more
precise:
(i) Under general assumptions, consistency can be shown for transfer func-
tions in a coordinate-free way (see e.g. [2]); if we have identifiable pa-
rameter spaces and the function attaching parameters to transfer func-
tions is continuous, then the corresponding parameter estimates are
consistent, independent of the choice of the particular parametrization.
(ii) Under certain conditions the asymptotic variances of the maximum
likelihood estimators change in a well defined way.
Xt+1 (2)
Yt (3)
7f : S( n) -> M (n ) (7)
by
For describing M (n) by state space systems the following approac h (see
e.g. [1] and [5]) may be used:
(i) FUll state space par ametrizations, i.e, M(n) is describ ed by Sm(n) .
The dr awb ack of this approach is that Sm(n) is non-identifiable. The
classes of observat ional equivalence are given by
and are real ana lytic manifolds of dimension n 2 . Thus there are n 2
unn ecessary par ameters.
(iii) The approach described here, namely dat a dr iven local coordina tes
DDLC , (see [3], [4]) is as follows: We commence from an initial (min-
imal) (A , B , C) E Sm(n) and t he t angent spa ce to the equivalence
class E(A , B , C) at (A , B , C) . (A , B , C) may be obtained by an ini-
tial esti mate, using e.g. a subspace or an inst ru mental vari able es-
timation method. Then we take t he ort hocomplement (in S(n)) to
t he t an gent spa ce as (pr elimin ary) par ameter space : Let QJ.. denote
a (n 2 + 2ns) x 2ns mat rix whose columns form a basis for this ortho-
complement . Then we have the parametri zation:
TD 1-+ ( ~:~~~~~~)
vecC(TD)
( ~:~~)
vecC
+ QJ.. . TD
The intuitive motivation behind the DDLC approach is that , due to or-
thogonality to the tangent space, the numerical properties of optimization
based estimators, such as the maximum likelihood estimator, are at least lo-
cally favourable. Comparisons with other parametrizations corroborate this
notion (see e.g. [4] and [8]). In particular these comparisons show that eche-
lon forms (whose parameters correspond to the usual ARMA parameters) are
clearly outperformed. DDLC is now the default option in the system identi-
fication toolbox in MAT LAB6.x. The success of DDLC was the motivation
for a careful investigation of the topological and geometrical properties of
DDLC relevant for estimation, described in the next section.
(iii) For n > 0, 7i'(TD) contains transfer functions of lower McMillan degree.
(iv) There exists an open and dense subset vt:
of V D such that for every
k EV pn,
the corresponding equival ence class in TD consists of a finite
number of points.
(v) VI) is dense in VD, where VI) denotes the interior ofVD in M(n). Addi-
tionally, V D is open (and trivially dense) in 7i'(TD), but not necessarily
open in M(n) .
(vi) 7i'(TD) ~ VD , where equality can hold, but th e inclusion may also be
strict.
(i) Openness means that the parameters are free and in particular not
restricted to a thin subset of ~2ns. This is an important requirement
for gradient-type optimization procedures to work properly. Note that
openness also holds if the stability assumption (4) and the miniphase
assumption (5) are imposed. Clearly then denseness will not hold.
142 Manfred Deistl er, Th omas Ribarits and Bernard Hanzon
(ii) st ates t hat t here exist neighb orhoods TlJe and VE e where th e par a-
metrizati on is well-po sed in t he sense of being injective (and thus ident i-
fiabl e) and t he par ameters are attached to transfer functions in a cont in-
uous way. In par ti cular 'coo rdinate free' consistency of transfer functi on
est imates in vboe (see [2]) t hen implies consistency of t he corres ponding
par ameter est imates . However , we have no statements concerning t he
size of TlJe and vboe, respectively.
(iv) In genera l, TD is not identifiable; as a 'second best ' result, the equiva-
lence classes for a generic subset V //n
are at least finite and t hus consist
of isolated point s in TD.
(v) deals with t he st ructure of the set VD ; for a discussion of the relevan ce
of (v) see [9].
(vi) The fact that VD may contain more transfer functions t ha n those de-
scribed by t he closure of the par ameter space TD can effect the act ua l
esti mation pro cedure. In that case t he norm of t he parameter vect or
may diverge to infinit y whereas t he corres ponding sequence of t ra nsfer
functi ons converges to a well defined transfer funct ion est imate in M (n) .
Problems of nonconvergence of algorithms due to t his phenom enon have
act ua lly puzzled researchers in the past when using echelon canonica l
forms.
A fJ
,-."-, ~
Xt+l (A - B C) Xt + B Yt (11)
ct - C Xt + Yt
'-v-"
C
and t he correspo nding par amet ers (..4, B, C) are in a one-to-one relati on wit h
(A , B , C). The (Gaussian) conditiona l likelihood function is of the form
A novel approach in linear dyn amic systems 143
T
LT(A , ts ,C, 2:) = logdet 2: + ~ L tr {St(A , f3, C)St(A , ts ,C)'2:- 1
} (12)
t=l
Given the (observabl e) pair (A,C), yt, and, t , we obtain the original
system by .6. y (A , C ) = (A ,B,C) = (A - f3C ,f3 ,-C) . The pairs (A 1 ,C1 )
and (A 2 , ( 2 ) ar e called observationally equivalent if they corres pond to the
sam e transfer function, i.e. if 7f(.6. y (A 1 , Cd) = 7f(.6. y (A 2 , ( 2)). If (A , C)
is observabl e, t hen, under certain addit ional assumptions, all observational
equivalent pairs are given by £cc(A ,C) = (TAT- 1 ,CT- 1 ) , T E GL(n) .
£cc(A , C) is a real analyt ic manifold of dimension n 2 and the DDLC
2+
const ru ct ion is performed again by taking the orthocompl ement in ]Rn n s
to the t ang ent space of £cc(A , C) at an initial point (A , C) . Let us denote t he
new par am et er space, where t he non-minimal syste ms have been removed ,
again by TD ~ ]Rn s and let us put VD = 7f(.6. Y (4?D(TD))). Here, 4?D is given
by
(15)
veC~ (TD ) ) ( vecA) QJ..
TD 1-4 ( vecC (TD) = ve cC + . TD
2+
where QJ.. E ]Rn n s x n s is now a matrix with ort honormal colum ns spanning
the new orthocompl ement t o £cc(A , C) at the point (A , C) ; (15) is called the
sls D D L C par ametrizat ion. For the following theorem see [7] :
Theorem 4.1. Let yt and an ini tial (A , B , C) be given and let (A , f3, C)
denote the corresponding in vers e system in (11). Th e parametrizat ion by
sls D D L C as given in (15) has the follow ing propert ies:
(iii) n(TD n Tx) may (but n eed not n ecessaril y) contain tran sfer fun ctions
of lower McMillan degree.
(v) n(TD nTx) ~ VD, where equalit y can hold, but the in clusi on may also
be strict.
Here, T x denotes th e (gen eric) set of TD suc h that X has full column
rank .
Not e that here, as opposed to ordinary DDLC, vboc is not open in M(n).
An alternative pr ocedure is to concent ra te out C. In this case, ML est i-
ma t ion of ~ can also be incorporated; see [8] .
5 A numerical comparison
Eight different minimal , st abl e and st rictly minimum ph ase st at e space mod-
els (A , B , C) with two outputs are specified. The mod els are denoted by M I ,
. .. , M s and are of order 2, 4, .. . , 16. The poles and zeros are quite close to
each other and close t o the unit circle, but t hey do not cancel.
Simul ation data for models M I , . . . , M s comprising T = 500 output
observations are created , where t he whit e noise sequ ence (St) is chosen t o be
Gaussian distributed with ~ = h .
In t he next step, 50 random initi al state space mod els are creat ed by ran-
doml y perturbing the ma trices (A ,B, C) corres ponding to th e t rue syst ems.
It is ensure d t hat t he perturbed mod els remain minimal, stable and minimum
ph ase.
All computations are carr ied out using the syste m identific ation toolbox
of the software package MATLAB , version 6.5.0 . 180913a (R13) . The iden-
t ificat ion pro cedure itself is performed by using the built-in funct ion pem.
The option ' Sear chDi r ect i on ' is set to ' Gn ' (a plain Gauss-Newton typ e
algorithm is used for min imizing the crite rion function) . For a mor e detailed
discussion of the simulatio n results pr esented below we refer to [6] . We con-
fine ourselves to t he following state ments: sls D D L C leads to
sls D D L C only l out of 400 esti mation runs failed, whereas usage of
D D L C leads to 8 failed runs.
• bet ter estima tes, i.e. better or at least equa lly good values of the like-
lihood functi on at convergence; see Table 1 (F ).
A M1 M2 M3 M4 M5 M6 M7 u,
Ca n 0 78 18 8 50 28 74 68
DDLC 0 0 4 0 0 0 4 8
sls D D LC 0 0 0 0 0 2 0 0
B M1 M2 M3 M4 M5 M6 M7 Ms
C an 8 24 18 20 27 35 35 28
DDLC 6 10 13 9 21 18 12 12
sls D D L C 8 9 10 8 16 13 8 8
C M1 M2 M3 M4 M5 M6 M7 Ms
Ca n 0 39 22 23 23 32 33 28
DDLC 0 0 12 0 0 0 6 8
sls D D L C 0 0 0 0 0 6 0 0
IE
Can O. 1.3e + 17 3.ge + 19 1.4e + 19
DDLC O. O. 3.1e + 5 O.
sls D D LC O. O. O. O.
IE I Ms
Can 1.3e + 21 3.3e + 18 8.1e + 21 1.6e + 23
DDLC O. O. LI e + 6 3.4e + 5
sls D D LC O. 5.8e + 2 O. O.
G M1 M2 M3 M4 M5 M6 M7 u,
Ca n O. 1.42 2.68 1.77 2.96 2.28 3.12 3.06
DDLC O. O. 1.34 O. O. O. 1.33 2.18
sls D D LC O. O. O. O. O. 1.21 O. O.
References
[1] Deistler M. (2000). System identification - general aspects and struc-
ture. In G . Goodwin (ed .) , System Identifi cation and Adaptive Control,
Springer , London, 3 -26. (Festschrift for B.D.O. Anderson).
[2] Hannan E.J. , Deistler M. (1988) . Th e statistical theory of lin ear systems .
John Wil ey & Sons, New York , 1988.
[3] McKelvey T ., Helmersson A. (1997) . System identification using an
over-param etrized model class - improving the optimizat ion algorithm.
In Proc. 36th IEEE Conference on Decision and Con trol , San Diego,
California , USA 3,2984 -2989.
[4] McKelvey T ., Helmersson A., Ribarits T. (2004). Data driven local coor-
dinat es for m ultivariable lin ear systems and their application to syst em
identification. Forthcoming in Automatica .
[5] Ribarits T . Th e role of parametrization s in ident ification of lin ear dy-
namic system s. PhD t hesis, TU Wien.
[6] Ribarits T ., Deistler M. (2003) . A new parametrization m ethod fo r the
estim ation of state-space mod els.
[7] Rib arits T ., Deistler M., Hanzon B. (2004) . An analysis of separable least
squares data driven local coordin ates for ma ximum likelihood estim ation
of lin ear system s. Submitted to Automatica .
[8] Ribari ts T ., Deistler M., Han zon B. (2004). On new param etrization
m ethods for the estim ati on of state-space models. Forthcoming in Intern.
Journal of Adaptive Con trol and Signal Processing .
[9] Ribarits T. , Deistler M., McKelvey T. (2004). An analysis of the
param etri zation by data driven local coordinates for multivariable lin ear
syst ems. Automatica 40 (5), 789 -803.
STATISTICAL ANALYSIS OF
HANDWRITTEN ARABIC NUMERALS
IN A CHINESE POPULATION
Wing K. Fung, C.T. Yang, C.K. Li and N.L. Poon
K ey words: Writing habits, Arabic num erals, statist ical study, classification
system, t est for ind ependence, probabili ty of occurrence.
COMPS TA T 2004 section : E-statisti cs.
1 Introduction
Writing habit , being a product of long-t erm adaptat ion to the needs and
a bilities of the writer , is believed t o be uni que. Vari ous classification systems
for handwriting have been suggeste d; see [3] for a review. A system for
t he classification of handwrit t en numerals have been developed by Ansell
and St rach [1], and St rach [6] . Recently, compute r algorithms for extra cti ng
features from scanned image of handwriting were used by Srih ari et al. [5]
for t he analysis of individu ality of handwriting.
It t his pap er , we analyse the char act erist ic features and codes of the Ar a-
bic num eral writings, i.e. , 0,1 , ... ,9, of 187 subjects. We give a det ailed
description on data collect ion and the methods of stat ist ical analysis for t he
study. Hierar chical cluster ana lysis (Section 4.1) is conduct ed on the char-
acte ristic codes of t he single numerals and t he pair ed num erals . We define
a clust er being the set of subjects havin g rescaled distan ce at the min imal
level. From th at , we can obtain the number of clust ers in the dendrogram
diagram, which is useful for measuring t he vari ability of th e numeral( s) in
question.
We are also int erested in investi gating whether t he cha rac te rist ic features
within each num eral are stat istically ind epend ent . If the features are ind e-
pend ent, it would provide a simple method to esti mate t he relative frequency
150 Wing K. Fung et al.
defines the dis t an ce between two clusters as the average of the distances be-
tween all pairs of cases in which one memb er of the pair is from each of
the clusters. This method uses information about all pairs of distanc es, not
just t he near est or the farthest , and so it is usually preferred in cluster anal-
ysis [7] . Subjects having the similar (or the same) way of numeral writing
were grouped/clustered together. A tree diagram or dendrogram was selecte d
to pr esent the results of clust er analysis, which is depict ed hori zont ally with
each row representing a case, and cases with high similarity are adjacent.
The number of clusters and the cluster sizes were also measured.
Another question of int erest is whether the quantified features are statis-
t ically ind ependent of one anot her. Each feature is a nominal vari abl e which
can t ake different possible codes (normally 2-5). Pear son 's X2 ind epend ence
t est for each pair of features is conducte d. The evalua t ion on the probability
of occurrence of certain charac te ristic features will be much simplified if t he
ind ependence assumpt ion is found to be valid.
3 Summary statistics
The maj ority of t he par t icipants were of young to middle age (20-49) . Only
4% and 2% were of age < 20 and age > 50 resp ectively. Nearl y all (99%) of
t hem were right-handed.
The assignment of characte rist ic features and codes is an import ant pr o-
cess in t he project . Figure 1 shows an exa mple of such an assignment for
num eral 4. In t his particular num eral, there are 8 char act eristic features and
each feature has 2-3 possible assignments of charac te rist ic codes. Some of
the features such as features 1, 2, 3, and 8 are relatively easy to distinguish ,
while ot hers may need some comparisons on the length of the measurements.
At the lower half of Figure 1, we have identified th e code for each feat ure of
that par ticular numeral. Moreover , one ot her code for each of th e features is
also provided for easy und erst anding. Numeral 4 has 8 char act erist ic features
with 19 characterist ic codes .
Tabl e 1 gives an overall summar y for the number of characterist ic features
and codes for the studied num erals; the det ailed charac te rist ic features and
codes are omit t ed for br evity. We can see that num eral "1" is t he simplest
num eral as expe cte d an d it has 4 cha racte ristic features (num eral "6" t oo),
while num erals 5, 8 and 9 have the most features and codes .
Tabl e 1: Numbers of charac te rist ic features and codes for num erals 0-9.
152 Win g K. Fung et el.
Code of the
Feature One other code
above numeral
4__
stroke (b)
Figur e 1: Assignm ent of cha racterist ic features and codes for numeral 4.
Statistical analysis of handwritten Arabic num erals 153
4 Statistical analysis
4.1 Hierarchical cluster analysis
We employ hierar chical cluster ana lysis for subject s grouping in the writing
of num eral "4" . Figure 2 gives the dendrogram for the clusterin g of the
subjects ; for clarity, only the last 20 subjects were selected for classification .
As not ed from Figur e 2, we have identified four tight ly linked clusters namely,
cluster a: subjects (14,20,1 ,8) ; cluster b: (4,7) ; clust er c: (10,11) ; and
clust er d: (3,18,9). The subject s within each clust er are very similar to each
other and so they are grouped toget her. For exa mple, we can see from the
figur e t ha t subjects 14, 20, 1 and 8 are grouped to gether and afte r checking
t he origina l data we found t hat they in fact wrote in (exactly) t he same
pattern during the writing of num eral "4" . The same sit uation also happens
to subjects 4 and 7, 10 an d 11, and 3, 18 and 9. The dendrogram of the
cluster ana lysis can give us information on th e similarity/dissimilari ty in t he
writing of "4" amongst the subjects.
* * *H IE R A R C H I C AL C L U S TE R ANAL Y S I S * * *
Rescaled Distanc e
CAS E 10 15 20 25
Label Nurn +- -- --- --- +- -- - - --- - +- - - - - - - - - +- - - - - -- - -+- - - - - -- - - +
14
20
1
8
19
4
7
10
11
9
18
3
12
16
17
6
2
5
13
15
12, 16, 17, 6, 2, 5, 13 and 15 each form the remaining clusters C6, C7, . .. ,
C13, respecti vely. Using the same pro cedure, 16 clusters are identified for
the paired num erals 4 and 7 in t he dendrogram of Figur e 3 which is to be
discussed in more det ails below.
Hierar chical clust er ana lysis is again conducted for two num erals "4"
and "7" and the results are shown in Figur e 3. As we can see that t here
are only two tight ly linked clust ers, less t ha n t hat formed in Figur e 2 for
a single num eral "4" . The clusters are, clust er i: subjects (3,18, 8,12) and
cluster ii: subjects (1,20) . In fact , afte r checking the origina l dat a , we found
t hat th ere were some differences in t he way of writing of num erals "4" and
"7" for subjects 3, 18, 8 and 12 of cluster i. The two subjects 1 and 20 of t he
ot her clust er also did not write exac tly t he same num erals "4" and "7" .
.. * HIE R ARC HIe A L CLUS TER A N A L Y SIS • ..
Dendrogram using Average Linkage (Between Groups)
Rescaled Distance
CAS E 0 5 10 15 20 25
Labe l Num +- - - - - - - - -+-- -- - ---- +-- - - - - -- - +- - - - - --- -+- - - - - - - --+
t-
3
18
8
12
1
~
20 I--
19
4 -
2
6
If-
5
14 I--
9
11
16 I I
7
I
10
15 ---l
13
17
I
Figur e 3: Clustering of t he last 20 subjects for numerals 4 and 7.
Table 2 summa rizes t he findings obtain ed from hierar chical cluster ana ly-
sis of Ara bic num erals. The number of clust ers formed and t he maximum and
the second maxim um sizes of clusters are list ed for reference. Accord ing to
cluster analysis num eral "I" is t he simplest handwriting cha racter amongst
all. Tot ally, there are only 36 clusters form ed via the classification procedure
wit h merely 8 clusters containing 5 or mor e subjects and t he larg est clust er
involvin g 63 homogenous subjects. On t he cont ra ry, num eral "5", arming
with 9 features and 30 codes, is t he most inform ative cha racter t hat help
distinguish subjects' dist inctiveness on handwriting.
Statistical analysis of hand writ ten Arabic numerals 155
T he combined numerals increase the number of cha racteristic feat ures and
codes on ha ndw riting discrimi nation and enha nce t he hete rogeneity among
subjects. Table 3 summarizes t he findings of cluster analysis of two Arabic
numerals, 4 and others, demo nstrating the dissimilarity reinforcement be-
tween subjects in handwriting. Comparing wit h t he findings in Tab le 2, t he
combined Arabic numerals overall increases t he number of clusters formed;
that is, t he subjects are more heterogeneous from one another t han that in
the writing of a single numeral. It is to be noted t hat for the combined numer-
als 4 and 5, 175 clusters (out of 187 subjects) are identified, which indicates
that the handwritings of the 187 subjects for numerals 4 and 5 together are
nearly all different (in one or more characteristic features) .
Table 3: Summary findings for cluster ana lysis of two Arabic numerals.
156 Wing K. Fung et el,
Pl23 [(Slant (S) = forward (f), Initi al Hook (IH) = right (r) ,
Ending Positi on (EP ) = hook (h)]
P1(S = 1) P2 (IH = r) P3(E P = h)
0.69 x 0.14 x 0.25
= 0.02415,
where the marginal probabiliti es 0.69, 0.14 and 0.25 are obtained from t he
dir ect counti ng method. Of course th ere are many assumpt ions behind this
est imate which might be ra t her cru de. It is to be awared t hat t he pair wise
independence does not imply t he features being all mutually ind ependent.
The est imate may also be adjusted upward if we want to make it conserva-
tive (forensic docum ent exa miners like to take t he conservati ve approach in
pr actice).
1 2 3 4 5 6 7 8
1 ind D ind D D D ind
2 ind ind ind ind ind ind
3 D D D ind ind
4 D ind ind D
5 D D ind
6 D ind
7 ind
Next we investi gate t he num eral 0 which has nine features on (1) slant ,
(2) initial and ending st rokes, (3) start ing position, (4) ending position at
St atistical analysis of handwritten Arabic num erals 157
right, middle or left , (5) st roke crossing position, (6) ending position being
t ap erin g or blunt , (7) sh ap e, (8) ending position at upp er half, middle or
lower , and (9) t he writing dir ection which is however omitted in our ana lysis
because of t he rul e-of-five. The ind ependence test results for feature pair s are
summa rized in Table 4. It is int eresting to not e that only feature 2, initial
and ending strokes (being open or close) , is st atisti cally ind ependent from
other features considered. Feature 8 is (pairwise) ind ependent of all other
features except 4. Thus, it seems difficult t o use a kind of (simpl e) pr oduct
rul e, as pr esent ed in num eral 1 where the assumption of feature ind epend ence
is taken, t o est ima te t he relative frequ ency or probability of occur rence of t he
par t icular characterist ic features for numeral O.
We suggest below an alte rnat ive way to (conservat ively) est ima te the
probability of occurrence of the following feature codes for num eral 0,
P1 234567S(f, c, l, m, l, t , e, u )
P2(C)P13 45678(f , l, m , l, t, e, u )
< P2(C)P13567s(f,l,l ,t, e, u)
P2(C)P1 3567(f, l , l, t , e)Ps(u ),
where P2(c) and Ps(u) can be evaluated easily, and P13567(f,l ,l ,t, e) can
also be est imated based on some dir ect count ing from t he sa mple. It is
not ed that similar assumpt ions as for num eral 1 may have to be mad e as
well. Furtherm ore, we need to aware t hat the overall level of significanc e is
not equa l to 5%, though t he individual level is set t o be 5% for each pair ed
comparison , because of multiple comparisons. Moreover , it may also not be
reasonabl e to regard the features being stat ist ically all ind epend ent, or all
depend ent.
Another quest ion of int erest is on the est imation of the pro bability of
occurrence for characteristic feature codes of two or mor e num erals. We
shall not at te mpt to answer this question du e to the possible very complex
dependence st ru ct ure in the dat a.
5 Concluding remarks
We have investi gat ed charac te ristic features of num erals 0, . . . , 9. The Arab ic
num erals are chosen becaus e they are commonly found in daily life. Hier-
archical clust er ana lysis is used t o classi fy subjects of similar handwrit ing
features into groups. As expecte d, a sub ject is more difficult t o cluster / group
wit h ot her s when more num erals are considered . In fact , the individualit y
of handwriting features may be identified in our sample when we consider
2 or 3 numerals t oget her such as 5, 8, and 9 which have mor e characterist ic
features. This ph enomenon may also be of interest t o docum ent examiner s.
The X2 t est s are const ru cte d to see if the features are statis tically pair-
wise ind epend ent. The features (except the serif feature) in num eral 1 are
ind epend ent , while some features in numeral 0 are depende nt of one anot her.
158 Wing K. Fung et a.
This dependence structure would also be found in other num erals. However,
it is still possible to find some independence structure in the features such
that the probability of occurrence of some characteristic features can be es-
timated, of which the poss ible limitations have to be awared. Furthermore,
it is suggest ed that the probability should be estimated for a single num eral ,
and not for two or mor e combined num erals .
References
[1] Ansell M., Strach S.J. (1975). The classification of handwriting numerals.
Proceedings of 7th Meeting of the IAFS, Zurich .
[2] Everitt B.S., Landau S., Leeseb M. (2001) . Cluster analysis. 4th edition.
Oxford University Press, New York.
[3] Huber R.A., Headrick A.M. (1999) . Handwriting identification: facts
and fundam entals. CRC Press, 152- 164.
[4] Kaufman L., Rousse euw P.J . (1990) . Finding groups in data: an intro -
duction to clust er analysis. Wiley, New York.
[5] Srihari S.N., Cha S.H., Arora H., Lee S. (2002). In dividuality of hand-
writ ing. J . Forensic Sci. 47, 1-17.
Acknowledgem ent : The authors would like to thank a referee for helpful
comm ents that improved the presentation of the paper, and to D.G. Clark e,
Government Chemist , and S.C. Leung, Assist ant Government Chemist of
Hong Kong for their support and permission to use the data.
Address: W.K. Fung , C.T . Yang , Department of Statistics and Actuarial
Science, The Universi ty of Hong Kong, Pokfulam Road, Hong Kong ; C.K.
Li and N.L. Poon, Questioned Documents Section, Hong Kong Government
Laboratory.
E-mail: wingfung<tlhku. hk
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
1 Introduction
In sp eech recognition, video transmission and int ensive care monitoring the
basic t ask is to extra ct a signal from t he observed noisy time series. The
signa l is assumed t o vary smo othly most of t he time with a few abrupt shifts .
Besides th e attenua t ion of norma l observational noise and t he removal of out-
lying spikes for recovering smooth sequ ences, th e pr eservation of the locations
and heights of shifts and local extremes is important . All t his needs to be
done automatically and in real time with short delays. This increases t he risk
of confusing outlier sequences and shifts or local ext remes. For distinguishing
ext remes and outliers we rely on the smo othness of the underlying signa l, i.e.
observations which are far away from an est imated signal value are t reate d
as outliers and not as being du e to a signal peak . We can identify shifts by
their duration set ting a lower limit for the length of a relevant shift .
Moving average s and other linear filters are popular for signal ext raction
as they recover t rends and are very efficient in Gaussian sa mples, but they
are highly vulnerable to outli ers and they blur level shifts . Tukey [16] sug-
gests running medians for removing outliers and preserving level shifts , but
standard medians have deficiencies in trend periods [8]. Linear median hy-
brid filt ers [10], [11] have been suggeste d as they are computationally mor e
efficient than running medians , and pre serve shifts similarly good or even
better than these. These filters track polynomial trends, but they can only
remove single isolat ed outliers . Modified trimmed mean filters are anot her
compromise between running means and running median s. They choose an
adaptive amount of t rimming, bu t like running medians they also det eriorat e
in trend periods.
160 Urs ula Gath er and Roland Fried
The und erlying signa l f-lt is the level of th e time series, which is assumed
to vary smoothly with a few sudden chan ges, while Ut is additive noise from
a symmet ric distributi on wit h mean zero and variance (12 , and Vt is impulsive
(spiky) noise from an outlier genera t ing mechanism. For online signa l ext rac-
t ion we move a t ime window of width n = 2k + 1 t hrough the series and use
Xt -k, .. . , X t + k to approximate f-lt. This causes a time delay of k observations .
Firstly we fix k to a given value for all filters.
where f-l t is regarded as t he level of the series at time point t, which is assumed
to be locally constant . For t racking trends, Davies et al. [4] suggest fitting
a local linear t rend f-lt + i = f-l t + ifJt , i = -k, ... , k , to { X t-k ,"" xt+ d by
robu st regression and recommend Siegel's [15] repeated median (RM) . Wh en
applied to the dat a (i , Xt+i ), i = -k , .. . , k, t he RM read s
Methods and algorithms for robust filtering 161
- RM
(3t
Lee and Kassam [12] suggest modified trimmed mean (MTM) filtering as
a compromise between running mean s and running medians. MTM filters
regulate the amount of t rimming depending on the data. Firstly t he local
median ilt and t he local median absolute deviation about the median (MAD)
o-t ar e calcul ated , t hen all observations farther away from the median than
a multiple qt = do- t of the MAD are trimmed. Finally, t he avera ge of the
remaining observations is t aken as filter out put:
1
L Xt+i . l[iLt-qt,iLt+qt] (Xt+i),
k
nt i=-k
# { Xt+i E [ilt - qt, ilt + qt], i = -k , . .. , k},
d · Cn ' m ed{lxt- k - iltl,··· , IXt+k - iltl}·
TRM( xt}
L (j-JJY
j EJ,
{j = -k, ... , k : IXt+j - jlf M- jfif MI :::; qt}
= m ed{ Xt+j - j fif:1 RM , j E Jt}
-MRM Xt+i - Xt+j }
f3t m ediEJ, { m edj EJ"j#i . . ,
2- J
with (jlf M, fifM) being t he repeat ed median level and slope est imate for the
current t ime windo w {Xt-k , . . . , Xt+ k}.
Predictive FMH filters correspond to a linear trend mod el and apply pr edic-
tive FIR subfilte rs for one-sided ext rapolation of a t rend:
3.1 Computation
The t ime needed for the filt ering is crucial in real time applications. Fast
algorit hms for the update of the filter output are needed for onlin e sign al
ext raction. Deno ting the length of the time window by n, t he median of
t he pr oceeding window can be updated in logarithmic time (O(log n)) usin g
linear sp ace if t he dat a in the window are stored in sor t ed order usin g a red-
black t ree [2, Secti on 15.1]. This improves on t he linear time needed for
calculating the median from scratc h.
An algorithm for the update of the rep eat ed medi an in linear t ime using
quadrati c space based on a hammock graph is proposed by Bernholt and
Fried [1] , and anot her update algorithm needin g only linear space running in
O(nlog n) average time is pr esented by Fried , Bernholt and Gather [6] .
164 Ursula Gather and Roland Fried
Upd ating t he residu als and calculat ing t he MAD can be done in linear
time. Hence, t he MTM and t he TRM can both be calculate d in linear t ime.
For the MRM , however , 0(n 2 ) time is needed at least for the second repeat ed
median . Detailed descrip tions of t he update algorit hms can be found in Fried ,
Bernholt and Gather [6], [7] .
The Table given below summari zes t he t ime and t he space needed for
the updat es of th e filterin g pro cedures. Not e that the space for t he repeated
median an d both repeated median hybrid filters can be redu ced to O(n) , but
at t he expense of larg er computat ion times.
just like for the combined FMH. The RM, the TRM and the MRM can even
remove k - 1 spikes completely within a single time window irrespectively of
a linear trend if u 2 = O.
The previous results hold when there is no observational noise. Lipschitz
continuity restricts the influence of minor changes in the data due to small
noise or rounding. The standard median, the FMH, the RM and the RMH
filters are Lipschitz-continuous. The median is Lipschitz-continuous with
constant 1 like all order statistics, while the repeated median and the re-
peated median hybrid filters are Lipschitz-continuous with constant 2k + 1.
An FMH filter is Lipschitz-continuous with constant max Ih{l, the maximal
absolute weight given by a subfilter. MTM, MRM and TRM filters , however,
are not Lipschitz-continuous, which can cause instabilities when there are
small changes in the data. The discontinuity is caused by the trimming of
observations. Application of continuous M-estimators is preferable for this
reason, but computationally more expensive. Nevertheless, we investigate
simpler trimming based methods in order to obtain information about the
possible gain by further iterations.
The finite-sample breakdown point (FSBP) is the fraction of observations
which have to be put into worst case positions in order to make the estimate
take arbitrarily wrong values . For the median the breakdown point becomes
(k + l)/n when applied to n = 2k + 1 data points, meaning that at least half
of the window needs to be outlying in order to cause an arbitrarily large spike
in the extracted signal. Since for the explosion of the local MAD also at least
k + 1 observations need to be modified, the MTM has the same breakdown
point, while for the FMH filters two outliers are sufficient to make it break
down . From the following Table we see that the RMH filters are considerably
more robust than the FMH filters, and that the RM, TRM and MRM are
almost as robust as the median in the sense of breakdown.
Simulations show the effect of the second step in the derivation of the
TRM and the MRM on their MSE as compared to that of the RM filter .
Application of least squares to the trimmed observations (TRM) increases
the efficiency for Gaussian noise , but almost preserves the robustness of the
repeated median, while application of the repeated median (MRM) further
reduces the bias caused by outliers [7].
continuous, i.e. stable w .r .t , the occurrence of both trends and small changes
in the data. The hybrid filters tend to preserve shifts and extremes, while the
repeated median smoothes them considerably when being applied with a large
window width [6], [7] . This means that on the one hand we should choose
a short window width, but on the other hand a large window width is better
for removing outlier patches and for the attenuation of the observational
noise. This is a robust variant of the common problem of bandwidth selection
in nonparametric smoothing.
Fried [5] investigates rules for online shift detection based on the most
recent residuals in the time window . Similarly, we can formulate rules for
the automatic choice of the window width using the regression residuals.
Often least squares criteria are used to assess the local model fit and to find
the bandwidth, but this is not suitable when outliers are present. Instead,
a robust criterion is needed. Remembering that the median is the value
which balances the signs of the residuals and that the repeated median is
a regression analogue, it is natural to use the sign of the residuals. In this way
we give the same weight to all observations irrespectively of their magnitude.
However, note that there are always as many positive as negative residuals
in the window for the repeated median fit. Therefore, we have to apply this
idea to a suitable subset.
The Figure below visualizes the smoothing of a maximum by fitting a line
with a too large window width. The residuals in the center will typically be
positive, while most of the residuals at the start and the end of the window
will be negative. These signs are simply al reverse for a minimum. Therefore
it is natural to use the total number of positive residuals at the start and the
end of the window for assessing the model fit. We divide the window into
three sections as follows, namely the first L(k + 1)/2 J observations, the central
n - 2L(k + 1)/2J observations and the last L(k + 1)/2J observations. If the
total number T of positive residuals in the first and the last section is much
larger than the average L(k+ 1)/2 J, we should shorten the window width since
the signal slope might be decreasing substantially within the window. If Tis
much smaller than L(k + 1)/2 J, the window width should also be shortened
since the signal slope might be increasing.
However, this reduction should not result in a window width which is
too small to resist outlying patterns. Results of previous studies [6], [7] show
that the repeated median resists up to between 25% and 30% outliers without
being substantially affected. Therefore, the minimal window width should be
about four times the maximal length of outlier patches to be removed. For
patches of length three e.g. we use the constraint n 2: 11. Since longer time
windows allow better attenuation of observational noise and also robustness
against many outliers we increase the window width after each step whenever
possible.
The proposed repeated median algorithm with robust adaptive selection
of the window width is as follows: Let kl < ku be lower and upper bounds
Methods and algorithms for robust filtering 167
...........
...
~r---------- ---,
...
.
<0
o
x
C\l
o
o
O ' r - - - , - - - , - - - - - - - r - --,-J
10 20 30 40
time
The same or similar app roaches can be used for t he other robust filters.
We just need t o modify t he window sections for the hybrid filters possibly
obt aining asy mmet ric filt ers.
5 Application
We now apply the filt ering procedures to two data set s. The first exa mple is a
time series simul at ed from an underlying sawt oot h signal, which is overlaid by
Gaus sian whit e noise with zero mean and unit variance , and t here are t hree
isolated , three pair s and two triples of outliers of size -5. The Fi gur e below
shows t he outputs of the CRMH and the adap ti ve RM filter wit h kl = 5,
ku = 15, dl = 0.7 and du = 1.3. The CRMH with n = 21 pr eserves t he local
ext remes very well, but it is rather vari abl e. The adapt ive RM is almost as
good at the extremes while being much smoo ther . Most of t he t ime a width
close to the maximal n = 31 is chosen , but close t o the three local ext remes
and at about t = 280 the width decreases even to th e minimal n = 11. The
PRMH not shown here is similar to the CRMH, but it is mor e affected by
the out liers, while t he ordinar y RM and the median cut t he extremes.
168 Ursula Gather and Roland Fried
time
6 Conclusion
Improved numerical algorithms render the real time application of robust
procedures for time series filtering possible. Methods for robust regression
Methods and algorithms for robust filtering 169
Il)
0
0
0
Il)
l!?
:::>
0>
lJ)
lJ)
l!?
o,
0
O>
c;;
st Il)
co
C1l
0
co
,...
Il)
0
r-,
Il)
<D
lime
like the repeat ed median allow to const ruc t filters which have similar benefits
like c1assicallinear or location based app roaches when t hese perform well, but
overcome deficiencies w.r. t. t he removal of spiky noise (outli ers) or t he t rac k-
ing of trend s. We find the repeat ed median pro cedure with robust adaptive
choice of the window widt h par ti cularly promising . First applicat ions show
that t his algorit hm can be mod ified even for onlin e filtering without any t ime
delay by esti ma t ing t he int ercept at the right hand side of t he time window ,
but mor e experience is needed to optimize t he aut omatic choice of the window
width then.
References
[1] Bernholt T. , Fried R. (2003). Compu ting the update of the repeated median
regression lin e in linear tim e. Informati on Processing Letters 88, 111-
117.
[2] Cormen T .R ., Leiserson C.E., Rivest R.L . (1990). Introduction to algo-
170 Ursula Gather and Roland Fried
Abstract: In t his pap er we use met a-d ata packages from the Bioconductor
Project to carry out st atisti cal ana lyses of gene expression data . But would
like to note th at the potential scop e of these applications is much bro ad er
and many of the methods described here could be applied to other types
of high-throughput dat a. To provide context we make use of data from an
investi gation into acute lymphoblastic leukemia .
1 Introduction
While there ar e a number of different definitions of an ont ology we will use
the notion of a restricted vocabular y as the basis for the discussions here.
Ontolo gies and related concepts are becomin g incr easingly import ant tools
for organizing and navigating information. Initiatives in biology (our main
focus) as well as t he sema ntic web are providing a variety of resources and
interesting problems relat ed to ontologi es.
For genes and gene products t he Gene Ontology Consortium, or GO ,
(www .geneontology .org) is an initi ative t hat is designed t o address this
pr oblem . GO provides a restrict ed voca bulary as well as clear indicat ions
of the relationships between t erms. GO is clearl y a valu abl e tool for data
ana lysis, however its st ructure (as a DAG) and t he complex nature of th e
relationships t hat it represents make appropriate use of this tool challenging.
are mappings between GO terms and LocusLink IDs which are modified to
account for the multiplicity of mappings between the manufacturer IDs and
LocusLink IDs .
2 An example
To demonstrate some of the tools that are included in the GOstats package
we consider expression data from 79 samples from patients with acute lym-
phoblastic leukemia (ALL) that were investigated using Affymetrix GeneChip
arrays [2] . The data were normalized using quantile normalization and ex-
pression estimates were computed using RMA [4]. Of particular interest is
the comparison of 37 samples from patients with the BCR/ABL fusion gene
resulting from a chromosomal translocation (9;22) with the 42 samples from
the NEG group.
To reduce the set of genes for consideration we applied two different sets
of filters (gene filtering is considered in more detail in [5] and the interested
reader is referred there) . A non-specific filter was used to remove genes
that showed little or no change in expression level across experiments. The
resulting data set had 2391 probes remaining. To select genes whose ex-
pression values were associated with the phenotypes of interest (BCR/ABL
and NEG) we used the mt . maxT function from the multtest package which
computes a permutation based t-test for comparing two groups.
After adjustment for multiple testing there were only 19 probes (which
correspond to 16 genes) with an adjusted p-value below 0.05. Using those
genes we obtain the set of most-specific GO terms in the MF ontology that
they are annotated at and compute the induced GO graph which is rendered
in Figure 1. No labels have been added to the nodes in this plot since there is
not sufficient room to provide informative ones. Notice that the most specific
terms are at the top of the graph and that arrows go from more specific nodes
to less specific ones. The node in the bottom center is the MF node. Clearly
Using GO for statistical analy ses 173
some sort of int eractivity (e.g. tooltips) would be benefici al. We will return
to this plot in the next section and use it to provide a more detailed view of
the data .
3 Statistical analyses
GO analysis
All Genes"
We now consider the finite pairwise distan ces. First a simple t-te st can
be carr ied out to see if t here is any difference between t he dist an ces in one
gra ph versus t he ot her. We took each pairwise dist an ce in th e NEG graph and
subtracted from it the sa me pairwise dist an ce comput ed on the BCR/ ABL
graph. The t-tes t is for whether the mean is zero and the t est st atis tic was
0.179 with an ext reme ly small p-valu e. So we see t hat dist ances in the NEG
graph seem to be longer t han t hose in the BCR/ ABL . Further evidence of
t his difference comes from the observat ion t hat proportion of valu es that were
lar ger in t he NEG gra ph was 0.589.
Using GO for sta tistical analy ses 177
We will focus our attention on those differences that are lar ge in absolute
value. We chose a valu e of 2.5 as our cut-off and found that there were
66 differences that were lar ger t ha n 2.5. These corresponded to 26 distinct
genes.
Whil e all may be int eresting and a par ti cular investigator may want to
expend considera ble effort in study trans crip tion factors that are of particular
interest we will cente r our ana lysis on the set of genes that appear most
frequently in t his list .
There are three genes that have high counts , namely, MYC, MPO and
GADD45A. This fact suggest s that perh aps the expression pattern s of these
t hree different transcription factors are substantially different in t he two phe-
notypes we are studying.
For each of the three t ranscript ion factors we can compute t he averag e
dist anc e, separa te ly within each graph, to all the ot her selected genes. We
find t hat the results ar e quite consist ent and that in all cases the path length
is much shorter in t he BCR/ ABL group than it is in the NEG group. For
MYC th e means were 5 for NEG and 2 for BCR/ ABL, and for MPO they
were 4 for NEG and 2 for BCR/ ABL and for GADD45A t he means were 5 for
NEG and 2 for BCR/ ABL . It is rather int eresting to observe that amongst
t he pair wise distances t hat have cha nged the most are those between these
three specific genes.
Specific paths between trans cription factors can also be exa mined. Recall
t hat we compute out distan ce between two t ra nscript ion factors based on the
shortest path length between t hem in each of the two gra phs. In our exa mples
we focus on MYC and the distan ces between it and MPO and GADD45A.
We print out the different shortest paths for genes connecting MYC to
both MPO and GADD45A for each of the two phenotyp es, respectiv ely (first
the paths for BCR/ ABL, then for t he NEG samples). The MYC to MPO
results are:
BCR/ABLE
MYC->EIF4Gl->HMG20B->MPO
NEG
MYC->CDC25B->TRAP1->FLJ10326->LANCL1->EMP3->S100A4
->LGALS1->MPO
~oOo
o
0°08 &
o OO~
Il9~O Oo
0
01:
I,~~t I o~::o
0 0
""00
0
~
0
o 0
o
0
14t~6_"
0
03
""
~ 0
;.t
0 0
tl °8 0
°oS oo~"
°000 0 0 0
o
:1 0 0
0
0 I
I~-"I
0 0 o ..
0
00
00
~
0 ~ ;oo
0
o 8> 0 0 8 0 ~ 0 0' 80 00 0
8l
"*~ d! 0 0
:0
0 / ' 0
o 0 0
9 °0
00 0
o %
0 o If 0 0 00
Figur e 3: Pairwis e sca t te rplots of gene express ion for those genes on t he
shortes t path between MYC and MPa from patients with the BCRj ABL
translocation.
BCR/ABLE
MYC->UBE2A->BAZ1A->CD53->GADD45A
NEG
MYC->CDC25B->TRAP1->SSBP1->SMC1Ll->TK1->HCK->
SH3PB1->PVRL2->GADD45A
We do not have space to present the other pairwise scat te rplot s here but
readers that are makin g use of the compendium version of this paper can
easily explore t hose different plots on t heir own.
We not ice t hat the path lengths for the NEG samples are longer (involve
mor e genes) than t hose for the BCRj ABL samples. We might also want to
ask whether t he dist an ces are also larger (that is t ha t t he correlations are
sma ller). To do this we need to obtain the edge weights from the respect ive
graphs and compa re t hem. We found t hat t here appeared to be no difference
(all average d around a distan ce of about 0.65) but t he number of edges is
quite sma ll and one might expect to see systemat ic differences if a lar ger
st udy were und ertaken.
We can check our results, at least to some extent, by exa mining pairwise
scatterplots of the gene expressions. In Figur e 3 the genes on t he path from
MYC to MPa are plot ted. We see quite strong correlations along the diagon al
and not e that HMG20B and MPa have a negati ve corre lation.
Finally, we finish our exa mination of t hese data by considering some of
t he specific paths between the different t ra nscript ion factors. We see, in
Using GO for st atistical analyses 179
Figures 4 the actual shortest path between the genes MYC and MPO. The
two end points have been colored red, genes along the path are colored blue.
o
o
Figure 4: Shortest path betwe en MYC and MPO in the NEG samples.
4 Discussion
GO and the mappings from genes to specific t erms in each of the three ontolo-
gies provide a number of important and unique data analyt ic opportunities.
In this paper we have considered three separate applications of these re-
sources to the problem of analysing gene expression data and in all cases the
GO related data have provided new and import ant insights into the data.
Using GO mappings to select certain terms for further study and reference
has the possibility of providing meaning to sets of genes that have been
selecte d according to different crite ria. An equa lly important application is
to use GOA mappings to reduc e the set of genes und er considera t ion. As the
cap acity of micro arrays increases it is important t hat we begin developing
tools and st rate gies that dir ectly address spe cific questions of int erest. P-
valu e correction methods are at best a band-aid and do not represent an
approach t hat has long t erm viability [5].
In our final example we adapted the method proposed by [6] to a dif-
ferent problem , one wher e we consider only transcription factors and where
we are int erest ed in underst anding their interr elationships. The results are
promising and in our example reflect a fund am ental difference between those
with the BCR/ ABL translocation and those patients with no observed ge-
netic abnormalit ies. Ideally these, and other observations will lead to better
understanding of t ra nscript ional regulation and from t hat t o bet t er under-
standing mod aliti es of efficacy for drug t reat ments.
180 Rob ert Gentleman
Perh aps mor e important than t he statistical present ation is the fact that
we have also provided softwar e implementations for all tools described and
discussed in this pap er. They are available from t he Bioconductor Project in
the form of the GOst ats package. GOstats makes substant ial use of software
infrastructure from the Bioconductor Project in carrying out this ana lysis.
In particular t he graph, Rgraphviz and REGL, tog ether wit h t he different
met a-d ata packages.
Finally, t his docum ent itself repr esent s an approac h to repr oducible re-
sea rch in t he sense discussed by [3] and it can be reproduced on any users
machine equipped with R and the appropriate set of R packages. We encour-
age the int erest ed reader to avail themselves of the opportunity to explore
the dat a and t he methods in mor e det ail on t heir own computer.
References
[1] Camon E ., Magran e M., Barrell D., Lee V., Dimm er E. , Binns D.,
Maslen J ., Harte N., Lopez R. , Apweiler R. (2004). Th e gen e ont ol-
ogy annotation (goa) database: sharing know ledge in uniprot with gen e
ontology. Nucleic Acids Resear ch 32 , D262 - D266.
[2] Chiaretti S., Li X., Gentleman R., Vit ale A., Vignetti M., Mandelli F. ,
Ritz J ., Foa R. (2004) . Gen e expressi on profile of adult t- cell acut e lym-
phocytic leuk em ia identifie s dist in ct subsets of patients with different re-
sponse to therapy and survival. Blood 103, 2771 - 2778.
[3] Gentleman R. , Templ e Lan g D. (2003). Statistical analyses and repro-
duci ble research.
[4] Irizarry R.A ., Hobb s B., Collin F ., Beazer-B arcl ay, YD ., Antonellis K.J .,
Scherf U., Speed T .P. (2003) . Exploration, normalizati on, an d summ aries
of high densit y oligonucleotide array probe level data. Biost atist ics 4 249 -
264.
[5] von Heydebr eck A., Huber W ., Gentl eman R. (2004). Different ial ex-
pression with the biocondu ctor proj ect. In En cyclop edia of Geneti cs, Ge-
nomics, P roteomics and Bioinformatics. John Wiley and Sons.
[6] Zhou X., Kao M.-C.J ., Wong W.H . (2002) . Trans itive fun ctional an-
notatio n by shortes t-path analysis of gen e expressi on data . PNAS 99,
12783-12788.
A cknowledgem ent: I would like to thank Vincent Car ey for many helpful
discussions about these, and very many other topics. I would like to thank
Drs. J . Rit z and S. Chiar et ti of the DF CI for making their data available and
for helpin g me to understand how it relates to ALL. I would like to thank J .
Zhang and J . Gentry for a great deal of assistance in preparing the dat a and
writing software in support of this resear ch.
Address : R. Gentleman, Department of Biostatist ics, Harvar d Universi ty
E- mail: rgentlem@jimmy.harvard. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
COMPUTATIONAL CHALLENGES IN
DETERMINING AN OPTIMAL DESIGN
FOR AN EXPERIMENT
Subir Ghosh
K ey words: Balan ced ar rays, computational challenges , factorial designs, in-
ter acti ons, or thogonal arrays, robust designs, search designs, sea rch linear
models, search probabili ties, un availabili ty of data.
COMPS TA T 2004 secti on : Design of experiments .
1 Introduction
In the ea rly development of designing a st atist ically efficient experiment, con-
siderable at tent ion was given to t he computationa l simplicity of the ana lysis
and to some desir abl e properties of the inferences drawn on the comparisons
(param et ers) of inter est [2]. The concepts of ort hogona lity and balan ce in
expe rimental designs wer e develop ed. With the pro gress in methodologi-
cal research and t he development in comput ing t echnolo gy, t he concepts of
optimum designs and various opt imality crite ria were proposed [10]. The ex-
periment could be performed at a single stage or at many stages over ti me .
The data could be cont inuous, discret e, univari ate, multivari at e, t ime series,
spatial, and other kinds or some combinat ions of t hem. Inference pro cedures
could be par am etric , nonparametric, semipar am etric, frequ entist , Bayesian ,
and ot hers . The most a maz ing aspect in t he design research is the enor mous
cont ributions of all kinds of resear chers from ext reme theorist s to extreme
pr actioners [8] . We do not attempt to make any futil e effort to list all the
cont ributo rs and their research. In t his pap er we exa mine some aspec t s of
det ermining optimal designs and dis cuss some challenging problems.
2 Optimum designs
An optimum design is normally obtain ed by sat isfying one or mor e optimality
pr op erties (minimizing variance, maximizing power and many ot hers ) for the
comparisons (par amet ers) of int erest under an assumed model. The choice
between a best design with resp ect t o (w.r.t .) one crite rion and a best design
w.r.t. another crite rion is always an issue at t he t ime of t he select ion of an
182 Subir Ghosh
optimum design. With the change in the computing environment, this is-
sue has become much more complex. For example, the orthogonal fractional
factorial plans may be best w.r.t. many optimality criteria but they require
more runs in most situations than nonorthogonal plans and furthermore may
not perform well compared to nonorthogonal plans when the assumed model
is really inadequate. If we decide to give up orthogonality and opt for opti-
mal balanced fractional factorial plans as our nonorthogonal plans, then we
may cut down the cost of running the experiment as well as improve the per-
formance when the assumed model is inadequate. Finding optimal balanced
fractional factorial plans as nonorthogonal plans is always computationally
challenging but it is possible to find such plans in the modern computing
environment. Many such plans are already available in the design literature.
The list of references is available in Ghosh and Rao [7], [8].
3 Robust designs
The unavailability of data that we often encounter in conducting an exper-
iment should be a concern at the design stage. Ghosh [3] introduced the
concept of robustness of design against the unavailability of any t (a positive
integer) observations in the sense that the unbiased estimation of all the pa-
rameters of interest is still possible when any t observations are unavailable.
For n observations, there are (~) possible sets of t observations. Ghosh and
Namini [5] gave several criteria and methods for determining the influential
set of t observations for robust designs. There are numerous such practi-
cal issues including the presence of outliers, time trend in observations, and
others in real life exp eriments. Such practical issues give rise to challenging
computational problems in the selection of designs.
(1)
where y( n x 1) is the vect or of observations, A1 (n x vd and A 2(n x V2)
are matrices known from t he und erlying design . The elements of t he vector
el (Vl x 1) are unknown par am et ers. About the elements of 6(V2 x 1) we
know t hat at most k elements are non zero but we do not know which element s
are nonzero. The k is small compar ed to V2. The goal is to search for and
identify t he non zero elements of 6 and then esti ma te t hem along wit h th e
elements of 6 . Such a model is called a sea rch linear mod el. When 6 = 0 ,
t he sear ch linear mod el becomes th e ordinar y linear mod el. For the sear ch
linear mod el, we have e2 =f o.
Let A 22 be any (n x 2k) submatrix obtained by choosing 2k columns
of A 2 . A design is called a search design [13] if, for every sub mat rix A 2 2 ,
(2)
The rank condit ion (2) allows us t o fit and discriminat e between any two
mod els in t he class of possible mod els described earlier. Any two models
in t he class have VI common paramet ers which are t he elements of 6 and
at most 2k un common par am et ers which are t he elements of 6. Not e t ha t
n 2: VI + 2k. A search design allows us t o search for and identify the non zero
elements of 6 and then est imate t hem along with the elements of 6.
may or may not be pr esent . A search pro cedure identifies t he mod el which
best fits t he dat a generated from the search design . To ident ify this model,
the sum of squa res of err ors (SSE) of each mod el is used [13]. If SSE for the
first model (Ml) is sma ller than t he SSE for the second mod el (M2) , then
Ml provides a better fit and is selecte d over M2 . For a fixed valu e of k ,
all ( V~) mod els are fitted to the dat a and the sea rch pro cedure selects t he
model with t he sma llest SSE as t he best model for describing t he data.
dl d2
- - - - - - - -
- + + + + + + +
- - + + - + + +
- + - + + - - -
- + + - - - + -
+ - - + + + - +
+ - + - - - + +
+ + - - + + - -
Table 1: d l and d2 with 8 runs and 4 factors.
elements of the SPM. The high er is this minimum valu e, the better is t he
design . Ghosh and Teschmacher [9] defined the SPM, proposed two other
criteria, and presented methods of comparing search designs for all values
of p using all three criteria. On e of the two proposed crite ria in Ghosh and
Teschmacher [9] is based on the element-by-element comparison of two SPM s
and the other one is bas ed on comparing two minimum search probability
vectors (M S PV s) whose elements are the minimum values of the columns
of two SPMs. The comparisons are then made by using a majority rule in
t he sense of having the fifty percent or more elements of an S P M are greater
the corresponding elements of anot her SPM . Similar compar isons are also
mad e for two MSPVs. The methods proposed in Ghosh and Teschm acher [9]
have opened up a new direction of computationally challenging problems for
finding optimum designs .
Orthogonal designs have many well-known optimality properties under
the ordinary linear mod el. However, bal anc ed designs can perform better
than orthogonal desig ns under t he search linear model. Consider two search
designs, D1 and D2 , each with 12 runs , and 4 factors each at two levels (-)
and (+) . Design D 1 is a bal anced array of full strength and design D2 is
an orthogonal arr ay of strength 2 obtained from th e 12-run Plackett-Burman
design [11] by choosing the first four columns. Table 2 pr esents D1 and D2.
Design D1 performs better than Design D2 und er the ordinary linear model
with 6 = O. However , D2 performs better than D1 und er the search linear
model when the vector 6 consists of two and three factor int eractions only
one of which is nonzero, so that k = 1. This is a really striking example
illustrating the fact that an orthogonal design is not necessarily the best in
all situations.
7 Conclusions
In this pap er we have describ ed some challenging computational problems in
finding a best design for an experiment . Modern comput ing environment has
Computation al challenges in determining an optimal design 187
D1 D2
+ + + + + - + -
- - - - + + - +
- - - + - + + -
- - + - + - + +
- + - - + + - +
+ - - - + + + -
- -
+ + - + + +
- + - + - - + +
+ - - + - - - +
- + + - + - - -
+ - + - - + - -
+ + - - - - - -
Table 2: D1 and D2 with 12 run s and 4 factors.
helped us in attempt ing to resolve t hese problems. Many ot her cha llenging
problems and some of t heir soluti ons are indeed available in t he work of
other resear chers. Many new computat iona lly cha llenging problems are also
constant ly emerging with the modern development in science and technology.
References
[1] Drap er N.R., D.K.J . Lin (1990). Sm all composite designs. Technometrics
32 187 -194.
[2] Fisher RA . (1935). Th e design of experiments . First Edit ion. Oliver an d
Boyd , London.
[3] Ghosh,S. (1979). On robustness of designs against inc omp lete data.
Sankhya B 40 , 204-208.
[4] Ghosh S. (1980). On m ain effect plus one plans for 2m factorials. Ann .
Stati st. 8, 922- 930.
[5] Ghosh S., Namini H. (1990). Influential observations under robust designs.
IN: Coding Theory and Design Theory, Part II: Design Theory, D.K. Ray-
Ch audhuri (ed.), Springer-Verlag, New York , 86 - 97.
[6] Ghosh S., Al-Sab ah W .S. (1996). Effi cient composite designs with small
number of runs. J . Stati st. Pl ann. Inference 53 , 117- 132.
[7] Ghosh S., Rao C.R (1996). Design and analysis of experiments . North-
Holland , Elsevier Science B.V. , Amsterd am.
[8] Ghosh S., Rao C.R (2001). An overview of developm ents in statis tical de-
signs an d analysis of experiments . In : Recent Advances in Experim ent al
Designs and Related Topics, S. Altan and J. Singh , (eds.), Nova Science
Publishers, Inc., New York, 1 -24.
[9] Ghosh S., Teschmacher T. (2002). Compariso ns of search designs using
search probabilities. J . St ati st . P lann. Inference 104, 439-458.
188 Subir Ghosh
[10] Kiefer J. (1959). Optimum experim ental designs . J. Roy. Statist. Soc. B
21 , 272-319.
[11] Plackett R.L., Burman J.P. (1946). The design of optimum multifactorial
experiments. Biometrika 33 305 - 325.
[12] Shirakura T., Takahashi T., Srivastava J.N. (1996). Searching probabili-
ti es for nonzero effects in search designs for the noisy case. Ann . Statist.
24 6, 2560- 2568.
[13] Srivastava, J.N . (1975). Designs for searching non-negligible effects. In:
A Survey of Statistical Design and Linear Models , J .N. Srivastava, (ed.),
North-Holland, Elsevier Science B.V., Amsterdam, 505-519.
VISUALIZATION OF PARAMETRIC
CARCINOGENESIS MODELS
J utta Groos and Annette Kopp-Schneider
K ey words : Hepatoc arcinogenesis, color-shift mod el, maximum likelihood
esti mate .
COMPSTAT 2004 secti on : Biostatistics.
Abstract: This paper concent ra te on the effect ive tool s to compare different
carcinogenesis mod els with resp ect to their ability to pr edict numbers and
rad ii of foci in hepatocarcinogenesis experiments . Especially t he CSM-GUI
(Color-Shi ft graphical user int erface) shows to be a powerful instrument t o
tes t a new mod el before st arting the very t ime-inte nsive pro cedure of finding
the maximum likelihood par am et ers.
1 Introduction
Hepato carcinogenesis experiments identify focal lesions consisti ng of inte r-
mediate cells at different pr eneoplastic stages. Several hypotheses are estab-
lished to describe th e form ation and pro gression of pr eneoplast ic liver foci.
A common model of hepatocar cinogenesis is the multi-st age mod el, which
is based on the assumption that cells have to und ergo multiple successive
changes on their way from t he normal to the malignant stage. In this model
single cells change t heir phenotyp e through mutation into the next stage and
proliferate according t o a linear st ochastic birth-death pro cess [4] [5] .
In cont ras t, t he Color-Shift-Model (CSM) was int roduced by Kopp-
Schneider and colleag ues [4] to describ e that whole colonies of alte red cells
simultaneously alte r their ph enotyp e. In this mod el, pr eneoplasti c foci are
ass umed to grow exponent ially with det erministic rate and to change their
ph enotype ('color ') after an exponent ially distribut ed waiting time [1] [3] .
To t ake into account t hat t he assumption of det erministic growth rates for
foci in the CSM seems to oversimplify th e real proc ess, a CSM with stochastic
growt h rates is introduced .
In order to compare different mod els wit h resp ect to t heir ability to pr edict
numbers and radii of foci in a rat hepatocarcinogenesis experiment maximum
likelihood est ima te s for the mod el par am et ers are used and the pr edict ed and
empirical dist ributions are vizua lized.
color when reaching a det erminist ic rad ius r switch ' As in t he CSM , the for-
mati on of spherical foci with initi al radius ro is describ ed by a hom ogeneous
Poi sson process with rat e u, Let B 1 and B 2 be ind epend ent posit ive random
variables wit h densities f B 1 and f B 2 . The random variables B 1 and B 2
describe the exponential growt h of foci of color 1 and color 2.
Given that a focus is pr esent at time t , the timepoint of its formation , TO ,
is a realisat ion of a random vari abl e T un iformly distributed on [0, t] , where
T , B 1 and B 2 are ind ependent.
Consider exemplarily a focus generated at time T = TO which grows in
color C = 1 with rate B 1 = b1 unt il it reaches the radius r swit ch, where it
changes its color and grows in color C = 2 wit h rate B 2 = b2 .
IR(t) = ro exp(br(t - TO ) ) I
In ( ~ l
Color 2: R(t) 2: r sw it ch {::} t > b1
D
+ TO .
Define
In( ~)
Tl := b
1
as t he t ime spent in color 1 unt il change to color 2.
The radius of a focus of color C = 2 at t imepo int t > TO + Tl , R(t) , is
describ ed by:
I
R(t) = rswitch exp(b2 (t - Tl - TO) ) I
So t hat an expression for t he joint dist ribut ion of radius R(t) and color
C(t ) = 1 at t ime t can be derived :
P (R(t) :::; r, C (t ) = 1)
P(R(t) :::; r, R (t) :::; rsw i t ch)
O r :::; ro
P(R(t ) :::; r) r E (r o, rswitch ]
{
P (R( t ) :::; r swit ch ) r > r sw it ch
or :::; ro
In(;!i;-l Joo f B, (b ,)db +F , ( In(;!i;-l) r E (ro , r sw itch]
t b, 1 B t
In ( r%" )
=
'J
t
In ( ~ l
t
fBi,;b,J db1 +FB1 C n(? l ) r > rswitch ,
In ( ~ )
t
where FB1 and f B are distribution and density of the random vari abl e B 1 •
1
Visualization of parametric carcinogenesis models 191
Therefore the joint density of radius R(t) and color C(t) = 1 at time tis:
fR(t) ,C(t)(X, 1)
[ ~xt 100
In (?cl )
t
fB, (bd db
bi 1
_ In ( :;) fB ,
t
(~) xt +
In(;:g)
t
1 fB (In(:;) )
' t
~xt
. 1 (ro ,r . w it ch ] (x)
1
xt
100 fB, (b 1 )
b
1
db, . l(ro ,r . wit ch] (x) .
In (?cl )
t
The joint distribution of radius R(t) and color C(t) = 2 at time tis:
P(R(t) :::; r, C(t) = 2)
P(R(t) :::; r, R(t) > rswitch)
1 1
In( r )
r . wit ch
b2t f B2(b)f (b) db2 db1
2 B, 1
In (~ )ln (~ )
t t 7"1
In( _ r_ )
rs wjt cb
00 T'
1 1
t
t - 71
+ - t - f B2(b2)fB, (bd db2db1
In (~ ) - 00
t
00 00
1 1
In( -r- )
r s w it c h fB2(b 2)fB, (h) db db
t bz 2 1
In( ~)l n( ~)
t t 7"1
00
1 FB2 ( In(t r;;;;;;;;:)) (t - 7)f
1
+ 1 B,
(b)
1
db1 ,
t - 71
I n (~ )
t
192 Jutta Groos and Annette K opp-Schneider
J J
1
h(t),C(t)(x , 2) = xt
In (~ ) l n (~ )
t t 7"1
00
In( -X-)
Ts wit ch
t J
I n( ~)
t
00
J
1
+ t
I n( ~)
t
00 00
J J
1
xt
In ( ~ )ln( ~ )
t t 71
Color 1
J
00
I R(t),c (t)( x , 1) = :t
I n ( ;:;' )
t
Color 2
IR(t ),C(t)(x , 2)
J
00
00
Y J00
2
1 2 f R(t),C(t)(x,J.) dX. (2)
J Jx 2 - (;2 f R(t),C(t)(x ,j) dx y Jx - y
e
Assum e that foci of each anima l grow and cha nge t heir color ind epend ent
of other foci. Let n 2,k denote t he number of focal transections of color k
observed in a liver sectio n of area A and let r2,k,j denote the radius of t he j-th
focal t ra nsection of color k . This liver sect ion cont ributes the loglikelihood
2 [
( ; (n2,k In( An 2,k) - An 2,k) + f; In(fR(2)(t),C(t)(r2,k,j , k))
n2 k ]
+ C, (3)
where C is a data depend ent constant . Assuming that the liver sections
of one experiment are ind epend ent of each other , the loglikelihood of the
complete data set is t he sum of the cont ributions of every sect ion.
4 Example
Dat a from an NNM-experiment published by Weber and Bannasch in 1994 [8]
are chosen to illustrat e th e methodology. In th is st udy rats were treated
ITo differenti ate between t he numb er of foci and the numb er of focal t ransections an
additiona l ind ex was int roduced. Here the index 2 stands for two dim ensions.
194 Jutta Groos and Ann ette Kopp-Schneider
with 6mg NNM 2 per kg body-weight conti nuously during six different t ime-
periods, 7, 11, 15, 20, 27 and 37 weeks, with each group consisti ng of five
animals. After this ti me period one liver section of each rat was stained by the
marker H&E 3 and different ty pes of focal t ra nsections were observed. Here
only two different types of foci are considered. The morphometric evalua t ion
of the stained liver sect ions genera ted a data set consisting of t he area of
every liver sect ion and t he ty pe and area of every focal transection det ect ed
in t his sect ion.
A Color- Shift-Model wit h color dependent an d Bet a-dist ributed growt h rates
is applied t o t his dat a set . Random variables B 1 and B 2 , which describe
t he exponent ial growt h in color 1 and color 2, are Bet a-distributed wit h
par am et ers Pl , ss , al and P2 , q2 , a2 · Form par am et ers, tu , are introduced
addit ionally to the par am et ers of th e standard Bet a-di stribution, Pi and qi
(pi, qi , ai > 0, i = 1,2 ), t o modify the supp ort of the distribution functi on .
Hence the growth rate in color i, B, , is a positive random vari abl e with the
following density:
B(p, q) = J
1
z(p- l) (l - Z)(q- l ) dz :
°
Inserting this expression into the joint densitity of radius R (t) and color C(t)
at ti me t , double int egrals are obt ain ed in equat ions (1) and (2) which can-
not be solved analyt ically. The loglikelihood funct ion (3) depend s on eight
par amet ers.
4.1 Implementation
The MATLAB environment is used to compute t he loglikeliho od function,
find the maximum likelihood par am et ers and visualize t he results. Numeri-
cal double integration with singularit ies has t o be performed for every single
det ect ed focal tran section. As about 1000 focal t ran sect ions are det ect ed the
computat ion of t he likelihood is a very t ime-inte nsive pro cedure. Using the
MEX-interface, functions for numerical double integr at ion from the Fortran
NAg libr ary are included to improve t he performan ce 4. To find t he maximum
2The chemical ca rcinogen N-Nitrosomorpholine (NNM) was ad m iniste red in t he drink-
ing water.
3H&E stands for Hemalum&Eosin , a biological marker t o identify ac idophilic and ba-
sophilic cell structures.
4Su broutine DOI DAF of Num erical Algor it hm Gro ups (NAg), Fortran Librar y, version
Ma rk 18 [6].
Visualiza tion of parametri c carcinogenesis models 195
Figur e 1: A time-point can be chosen over the pop-up-menue and the pa-
rameters can be varied over their corres ponding sliders. Depending on the
par amet ers t he t hree axes show t he t heoret ical distributions of size and num -
ber of focal t ransections of ty pe 1 and 2 (dot te d lines) compa red with the
empirica l dat a taken from t he NNM-experiment (solid lines). One slider is
pr ovided for t he Poisson par ameter J1 , six sliders for t he par ameters PI , qI , a l
and P2 , qz , a2 corresponding to t he Bet a-distributed growt h rates in typ e 1
an d 2 and one slider for r s w itch .
likelihood param eters it is necessary to define a set of eight starting param-
eters for the fm in con 5 function and to define proper int ervals for the range
of the eight model par amet ers. For this purpose a gra phical user interface
(CSM-GUI) is imp lemented in MATLAB to test the theoretic al distributions
of size and number of focal t ra nsections in 2D under variat ion of parame-
ters (Figure 1). After minimizing the negative loglikelihood by the fm incon
function theoretical results can be compared with the empirical data.
4 .2 Results
Figur es 2 and 3 illustrate ty pical visualizat ions of the results of t he modu-
lation of t he NNM-Experim ent. The empirical size dist ribution is compa red
wit h t he t heoret ical size distributions obtain ed from two different Color-Shift-
5 f m in con is a MAT LA B function for nonlinear mi nimi zation under const raints used t o
m inimiz e the negative loglikelihood func tion [7J.
196 Jutta Groos and Annette Kopp-Schneider
0.'
0.'
0.7
0.0
I 0.0
0.'
0.3
0.2
0.1
0
0 0.2 0.3 0.' 0.7
Radlus[mmj
Figur e 2: The result of t he CSM (dashed line) and CSM wit h Bet a-distributed
Growth rates (dotted line) applied on foci of ty pe 1 afte r 37 weeks. The solid
line repr esents t he empirical dat a.
COF Type 2 Foel after 37 WeeksNNM
0.8
0.7 /- · · · ·· · · · · · · · · . · I'
0.6/- · ·' ····· · , · / . ······, ··· ··.'. · , ~
J
~
0.5
0..
0.3 /- .; ... . .
0.2
0.1
Figur e 3: The result of t he CSM (dashed line) and CSM wit h Bet a-d istributed
Growth rates (dotted line) applied on foci of ty pe 2 after 37 weeks. The solid
line repr esent s the empirical dat a.
Models using maximum likelihood est imates for the par amet ers. The CSM
without modifications is repr esented by t he dashed line, the CSM with Beta-
distributed growt h rat es is illustrated by t he dotted line and the solid line
stands for the empirical dat a from the NNM-experiment. Consid erin g only
typ e1-foci t he modified CSM seems to predict the size distribution better
than t he CSM. But t he visualizations for t he focal transections of typ e 2
show t hat t he modified CSM expects too larg e foci of the second type, so
that t here is an advantage for CSM wit hout modification in t his case. The
Visualization of parametric carcinogenesis models 197
1 Introduction
The need for risk assessment of exposures to toxic agen ts in human environ-
ment has increased steadily over the last decades . The general paradigm for
risk assessment is the identification and characterization of hazard, assess-
ment of exposure and characterization of risk. Performed in practice risk
assessment is addressing particularly the outcome of integrating the data
available from epid emiology, long-term mortality and morbidity studies and
mechanistic research with information on the type and extent of exposure, as
well as statistical analysis properly used . A sound, scientifically based risk
assessment is an essential tool for risk managers and legislators responsible
for security and safety of humans.
The use of toxicokinetic models makes it possible to construct exposure
indices that may be mor e closely related to the individual dose than tradi-
tional exposures measures. However , the process introduces a wide array of
sources of uncertainty, which inevitably makes risk assessment more difficult.
In addition, representing population heterogeneity in the assessment of risks
and the identification of sensitive sub-population is of great concern.
The analysis of uncertainty is becoming an integral part of many scien-
tific evaluations. For example, in the risk assessment process, an uncertainty
200 Harald Heinzl and Martina Mittlbo eck
analysis has been recognized as an imp ortant component of risk char act eri-
zation by regulatory agencies [29] . Uncertainty is pr evalent in the proc ess of
risk assessment of chemica l compounds at various levels. Uncertainty of the
exposure assessment influences dose est ima t es. Such effects are exaggerated
further by un certainty in dose-response mod elling, mainl y caused by limited
knowledge about the functional dose-r esponse relationship. Finally, uncer-
tainty is propagated to the risk est ima t ion pro cedure, which provide the basis
for risk man agement decisions .
It is vit al to distinguish un certainty from variability : The lat t er is a ph e-
nom enon in t he physical world to be measured , ana lysed and where appropri-
ate explained. By cont ras t, un certainty is an aspect of knowledge (Sir David
Cox as quot ed in Vose [28] . Total uncert ainty is the combination of vari-
ability and uncertainty. To avoid confusion it was suggested to renam e to tal
un cer t ainty by ind et erminability [28], a t erm inology adopted in our work.
Our exa mple focusses on t he risk assessment pro cess whether 2,3,7,8-
te t rac hlorodibenzo-p-dioxin (TCDD, "Seveso-dioxin") is a pot enti al hum an
carcinogen. In 1997 TCDD was evaluated as hum an carcinogen [19] , [22].
The decision subst anti ally relied on empirical st udies of highly exposed oc-
cupational cohorts. The so-called Boehringer cohort was amongst them , and
its dat a were thoroughly analysed [1]' [2]' [13] [14] , [24]. These st atis tic al
analyses were a rather delicat e t ask as amongst other things indi vidu al life-
time TCDD-expo sures st arting in the 1950ies had to be reconstructed from
TCDD-measurements in the 1980ies and 1990ies when such measurements
becam e feasible and affordable. Inevit ably, a lot of un certainty remain ed
due t o lack of longitudinal physiological data , t he possibility of measurement
err ors and workplace misclassification errors, disagreement about the appro-
pri at e statist ical analysis st rategy, limited knowledge about t he functional
dose-cancerogenic pro perty relationship and t he advent of new toxic okin et ic
insight - just t o nam e a few circumstances.
Now, it is qui t e common that results of large-scaled st atisti cal or epi-
demiological analyses will be questioned and disputed . However , the goal of
an un certainty analysis is to t ell us how much we can be wrong and st ill be
okay [7] . Therefore we designed a compute r simulation st udy to be able t o
exa mine in det ail the influences of various sources of uncert ain ty and t heir
potenti al implications on the risk esti mates from t he Boehringer cohort dat a.
The pap er ist organized as follows. In Section 2 our adopted view of
un certainty ana lysis is defined in bri ef. Section 3 is devot ed to dioxin , that
is, genera l characterist ics of the compound, features of the Boehringer cohort
dat a set and various approac hes to mod el lifelong human toxi cokineti cs are
describ ed. Sect ion 4 contain s t echnical and non-technic al design asp ects of
the int ended compute r simulation st udy. In Section 5 a bri ef discussion is
given.
Design aspects of a computer simulation study 201
3 Dioxin at a glance
3.1 Polychlorinated dibenzodioxins and -furans
(PCDD/Fs)
PCDD/Fs are highly lipophilic synthetic chemicals which arise primarily
from the production and combustion process of chlorinated chemicals and
as a byproduct to chlorinated bleaching and waste incineration. Environ-
mental contamination by PCDD/Fs has been documented worldwide and is
ubiquitous. In industrialised countries the PCDD/F burden of the population
is assumed to result mainly from intake of contaminated food . Improvements
in the analytical techniques used to measure PCDD/F concentrations have
allowed for the concentration of these compounds to be assessed in reasonable
amounts of human tissue, most notably in adipose tissue, blood serum and
202 Harald Heinzl and Martin a Mittlboeck
plasm a. Repeated det erminations in hum ans allow t he invest igat ion of the
kinetic of t hese toxins.
T CDD is believed t o be t he most pote nt of the PCDD/Fs. Numerous ef-
fects in humans have been observed from exposure t o T CDD ; am ongs t them
are lung cancer and soft-tiss ue sa rcoma. Ob served adverse health effects other
than cancer inclu de chlorac ne, alte red sex hormone levels, alt ered develop-
ment out comes, altered thyroid funct ion, alte red immune function, cardio-
vascular diseases and neurol ogical disord ers t o name just a few, e.g, see also
the sur vey of Gr assman et al. [15]. The establishment of a causal relationship
between exposure t o dioxins and diseases in human s is of outstanding signifi-
cance in public health and disease pr event ion . To establish such a causa l link
is extremly difficult since chronic diseases may occur a long time afte r t he
act ua l exposure has cease d and t his extended lag t ime (lat ency period) be-
tween exposure and disease onset may obscure a causal link . This impli es the
need for proper mod elling of the indi vidu al intoxination pr ocess in order to
const ru ct appropriate dose metrics (like area under t he concentration-t ime
cur ve) for quan ti t ative representat ion of t he disease-exp osure relati onship.
Obv iously it is essent ial to relate the occurrence of diseases t o dioxin levels
experienced during t he exposure before disease onset. Previous levels have to
be est imated from pr esent ones. Retrosp ective det ermination of dioxin levels
in hum ans and t heir subsequent use in risk assessment are st ron gly connected
t o the toxicokinet ics of the dioxins. Chronic environmental exposure, route
of exposure, st orage in ad ipose tissue, and mechani sm of eliminat ion are im-
portant det erm inan ts of t he level of TCDD in seru m years afte r possibly high
occup ational exposures . Currently available physiologically based ph arma-
cokinet ic (PBPK) models try to meet this requirements at least partly.
Occupationally exposed cohorts are an imp ortant source of information
du e to mor e pronoun ced effects (occupational exposures are higher in gen-
eral) and improved ability t o cont rol for confounders (easier and mor e reliabl e
information retrieval among workers regist ered in files of compani es or insur-
ance agencies). For workers in t he chemical industry, where occup ational ex-
posure to dioxins has occured in past production periods, the establishment of
causa l relationships is also connected to insurance and compensa t ion issues,
which requires an individua lly-based assess ment of exposure, disease onset
and their relationship. In 1997 t he Int ernational Agency for Resear ch on
Can cer (IARC) reevalu ated TCDD as carcinogenic to hum ans (IARC group
1 classification) on the basis of limited evidence of carcinogenicity to hum an s
and sufficient evidence of carcinogenicity in experimental anima ls [19], [22].
The most import ant studies, which gave evidence with respect to human
carcinogenicity, were four cohort st udies with adequate follow-up times of
herbi cide pr oducers, one each in t he Unit ed St at es and t he Net herlands , two
in Germany. T he lar gest and most heavily exposed German cohort is t he so-
called Boehringer cohort [13] , [14] , [1] , [2] . Main feat ure s of the Boehringer
cohort are describ ed in the next Subsection.
Design aspects of a computer simulation study 203
Overall, the strongest evidence for TCDD carcinogenicity is for all cancers
combined, not for a specific site. Due to the lack of a clearly predominating
site it was considered by the IARC that there is limited evidence in humans
for the carcinogenicity of TCDD [19], [22] . This could be due to still limited
power of those epidemiological studies requiring cautious appreciation, or due
to an unspecific non-standard carcinogenic action of dioxin. The evidence in
humans for the carcinogenicity of all other PCDDs is even more diffuse and
was rated inadequate by the IARC in 1997.
Of course, mor e biologically complex mechani stic mod els could be sug-
gested. Phenomena such as TCDD absorption, distribut ion, binding to liver
receptors, enzyme induction, and synt hesis of binding prote ins could be con-
sidered . However , such phenomena occur on a much fast er t ime scale (hours
to days) t han TCDD eliminat ion (years in humans) , which finally just ifies
the assumpt ion of an qu asi-equilibrium between TCDD in lipid fraction of
blood, liver and adipose t issue. Not e that this assumption (or var iations of
it) is mad e either explicit ly or impli citly in all of the lifelong TCDD mod els
for human s mentioned above.
The main part of the project are Mont e Carl o compute r simulations in
order to assess un certainty in the t oxicokinet ic mod elling pro cess up to its
impli cations on risk assessment . The main issues to be st udied are amongst
ot her things:
• Uncert ainty in choice of appropr iate exposure index, lag time and dose-
response relationship: This form of un cert ainty concerns t he subsequ ent
processing of the tox icokinet ic results in dose-response mod els. Even
if the former would yield absolutely correc t valu es, uncert ainty in t he
latter would still dist ort the results of t he risk assessment pro cess.
• Selecti on effects: They could have been eas ily occurred in th e Boeh-
ringer cohort dat a as participation in t he dioxin measurement pro gr am
was on a volunt ar y basis. A specific form of select ion bias is the so-
called "healt hy worker sur vivor effect" (see e.g, [25]).
To meet th ese requirements a compute r pro gram libr ary with a flexible mod-
ular st ructure has to be designed and impl emented (see next Subsection).
Thereby not e that un cer tainty an alysis can only shed light ont o overlooked
issues, underrated issu es or issues which have not been known at the t ime of
original analysis itself. It is probabl e that some time afte r t he completi on of
the un cert ainty analyses new scient ific theories may evolve, e.g. a new t ox-
icokin eti c TCDD mod el for human s. The design of the compute r program
206 Harald Heinzl and Martin a Mittlboeck
libr ary should allow a flexible and smoot h int egration of current ly unknown
but supposable future development s.
There are num erous adequa te softwar e products available where t he com-
puter program library could be implemente d so that the actual decision is
mainl y a matter of personal preference. In the curre nt case the computer
pr ogram library is implemente d in form of SAS macros (SAS Institute Inc.,
Car y, NC , USA) .
analysis. The cons of this approach ar e in the greater effort to famili aris e
with the subject and a possibly difficult relationship to the t eam memb ers of
the original analysis. These considerations should be mad e an int egral part
in the pro jects stat ist ical analysis schedul e from the beginning.
Here a rather t ra dit iona l Monte Carlo simul ation study is utilis ed for un-
certainty assessment . It mainly consist s of the exploration and evalua t ion of
different int eresting scenarios. Alt ernatively, an uncertainty assess ment could
be performed within a fully Bayesian fram ework (see e.g . [6], [7]. A det ailed
comparison of t he pro s and cons of both approaches is beyond the scop e of
this pap er.
5 Discussion
Risk assessment is a vit al act ivity in mod ern societ y becaus e it provides
the scient ific basis for effort t o ident ify and cont rol hazards t o health and
life. However , risk assessment is generally subject t o great uncertainty. The
scient ific knowledge available in t his field is far from sufficient . Uncertainty
in risk assessment is at pr esent a major but lar gely unsolved problem to be
faced with solid resear ch.
The goa l of un cert ainty analysis is t o provide an evaluation of t he limits
of our knowledge, or in other word s, an un certainty ana lysis should te ll us
how much we can be wrong and still be okay [7] .
Uncert ainty assess ment of large-scaled statistical an alyses is obviou sly
a reasonabl e and essent ial task in the empirical resear ch pro cess. In our view
it is useful to consider t he idea of ind et erminability which can be subdivided
into statistical vari ability, structural and technical uncertainty [18] , [12] , [17] .
Analyti cal approaches t o assess structural and t echnical un certainty will
be eas ily limited by t he complexity of the underlying problems. However ,
elaborate computer simulation studies have evolved as an appropriate tool
for the investi gation of these ty pes of ind et erminabili ty [28] .
Obviously, analysis of un certainty comprises uncertainty itself. During an
un cer t ainty ana lysis various decisions about par am et er settings (e.g. constant
or random , dist ribution typ e and distribution par am et ers, et c.) have to be
made. Actually, this set tings would require an uncertainty analysis of its
own. That is, t here would be met a-uncertainty - th e uncertainty of the
uncert ainty ana lysis. And th en there would be met a-m et a-uncertainty, th e
un certainty of the met a-uncert ainty analysis such that we would built one
layer of un cert ainty on anot her and finally miss t he goal. The loophole in
this catch is t he insight that un cert ainty ana lyses are not don e on their own,
but are part of the scientific resear ch proc ess. Accordingly, t he results of an
un certainty analysis should be communicate d t o t he scient ists who posed t he
resear ch quest ion , collecte d the data and performed the st atistical analysis
on the one hand as well as to other experts in the field on the other hand.
Together these resear chers will be able t o assess t he validi ty of t he un cer t ainty
analysis and to discuss the consequences of the results [17] .
Design aspects of a computer simulation study 209
References
[1] Becher H., Flesch-Janys D., Gum P., Steindorf K. (1998a). Berichte
5/98, Krebsrisikoabschtzung fur Dioxine, Risikoabschtzungen fur
das K rebsrisiko von polychlorinierten Dibenzodioxinen- und Furanen
(PCDD/Fs) auf der Datenbasis epidemiologischer Krebsmortalittsstu-
dien . Forschungsbericht im Auftrag des Umweltbundesamtes, Erich
Schmidt Verlag, Berlin .
[2] Becher H., Steindorf K., Flesch-Janys D. (1998b) . Quantitative cancer
risk assessment for dioxins using an occupational cohort. Environ Health
Perspect 106 (Suppl 2), 663-670.
[3] Beck H., Eckart K., Mathar W., Wittkowski R (1989). Levels of PCDD 's
and PCDF's in adipose tissue of occupationally exposed workers. Chemo-
sphere 18, 507-516.
[4] Benner A., Edler L., Mayer K., Zober A. (1993). Untersuchungspro-
gramm "Dioxin" der Berufsgenossenschaft der chemischen Industrie .
Ergebnisbericht - Teil II. Arbeitsmedizin, Sozialmedizin, Umweltmedi-
zin 29 , 11-16.
[5] BG Chemie . (1990). Untersuchungsprogramm 'Dioxin', Ergebnis-
bericht - Teil 1. Berufsgenossenschaft der Chemischen Industrie. BG
Chemie (Ed .), Heidelberg, ISBN: 3-88338-302-9.
[6] Bois F .Y. (1999). Analysis of PBPK models for risk characterization.
Annals of the New York Academy of Sciences 895, 317- 337.
[7] Bois F.Y., Diack C. (2004). Uncertainty analysis. In: Quantitative Meth-
ods for Cancer and Human Health Risk Assessment, Edler L., Kitsos
C.P. (Eds .), Wiley, Chichester, to appear.
[8] Carrier G., Brunet RC., Brodeur J. (1995a). Modeling of the toxicoki-
netics of polychlorinated dibenzo-p-dioxins and dipenzofurans in mam-
malians, including humans. 1. Nonlinear distribution of PCDD jPCDF
body burden between liver and adipose tissues . Toxicology and Applied
Pharamcology 131, 253-266.
[9] Carrier G., Brunet R.C., Brodeur J. (1995b). Modeling of the toxicoki-
netics of polychlorinated dibenzo-p-dioxins and dipenzofurans in mam-
malians, including humans. II. Kinetics of absorption and disposition of
PCDDs jPCDFs. Toxicology and Applied Pharamcology 131, 267- 276.
[10] Caudill S.P., Pirkle J.L., Michalek J.E. (1992). Effects of measurement
error on estimating biological half-life. Journal of exposure analysis and
environmental epidemiology 2, 463-476.
[11] Craig T .O., Grzonka RB . (1991). A time-dependent 2,3,7,8-
tetrachlorodibenzo-p-dioxin body-burden model. Arch. Environ. Contam.
Toxicol. 21 ,438-446.
[12] Edler L. (1999). Uncertainty in biomonitoring and kinetic modeling. An-
nals of the New York Academy of Sciences 895, 80-100.
210 Harald Heinzl and Martina Mittlboeck
[13] Flesch-Janys D., Berger J., Gurn P., Manz A., Nagel S., Waltsgott
H., Dwyer J .H. (1995) . Exposure to polychlorinated dioxins and fu-
rans (PCDD/F) and mortality in a cohort of workers from a herbicide-
producing plant in Hamburg, Federal Republic of Germany. American
Journal of Epidemiology 142, 1165-1175. Published erratum in Amer-
ican Journal of Epidemiology (1996) 144, 716.
[14] Flesch-Janys D., Steindorf K., Gurn P., Becher H. (1998). Estimation of
the cumulated exposure to polychlorinated dibenzo-p-dioxins/furans and
standardized mortality ratio analysis of cancer mortality by dose in an
occupationally exposed cohort. Environ Health Perspect 106 (Suppl 2),
655 -662.
[15] Grassmann J .A., Masten S.A., Walker N.J ., Lucier G.W. (1998). Ani-
mal models of human response to dioxins . Environ Health Perspect 106
(Suppl 2), 761 -775.
[16] Heinzl H., Edler 1. (2002). Assessing uncertainty in a toxicokinetic model
for human lifetime exposure to TCDD. Organohalogen Compounds 59,
355-358.
[17] Heinzl H., Edler L. (2003). Evaluating and assessing uncertainty of
large-scaled statistical analyses exemplified at the Boehringer TCDD co-
hort. Proceedings of the second workshop on research methodology,.
Ader H.J., Mellenbergh G.J. (Eds) , VU University, Amsterdam, ISBN
90-5669-071-X, 87-94.
[18] Hodges J .S. (1987). Uncertainty, policy analysis and statistics. Statisti-
cal Science 2, 259 - 291.
[19] IARC. (1997). IARC Monographs on the Evaluation of Carcinogenic
Risks to Humans . Vol. 69 : Polychlorinated Dibenzo-para-dioxins and
Polychlorinated Dibenzofurans. International Agency for Research on
Cancer, Lyon.
[20] Kreuzer P.E ., Csanady Gy.A., Baur C., Kessler W ., Papke 0 ., Greim
H., Filser J.G. (1997). 2,3,7,8-Tetrachlorodibenzo-p-dioxin (TCDD) and
congeners in infants. A toxicokinetic model of human lifetime body bur-
den by TCDD with special emphasis on its uptake by nutrition. Arch .
Toxicol. 71,383 -400.
[21] Manz A., Berger J. , Dwyer J.H ., Flesch-Janys D., Nagel S., Waltsgott H.
(1991). Cancer mortality among workers in chemical plant contaminated
with dioxin. Lancet 338, 959 -964.
[22] McGregor D.B., Partensky C., Wilbourn J ., Rice J.M . (1998). An
IARC Evaluation of Polychlorinated Dibenzo-p-dioxins and Polychlori-
nated Dibenzofurans as Risk Factors in Human Carcinogenesis . Environ
Health Perspect 106 (Suppl 2), 755 -760.
[23] Michalek J.E., Pirkle J.L., Caudill S.P., Tripathi R.C., Patterson D.G .
Jr., Needham L.L. (1996). Pharmacokinetics of TCDD in veterans of
operation ranch hand: lO-year follow-up . Journal of toxicology and en-
vironmental health 47,209-220.
Design aspects of a computer simulation study 211
[24] Portier C.J., Edler L., Jung D., Needham L., Masten S., Parham F.,
Lucier G. (1999). Half-lives and body burdens for dioxin and dioxin-like
compounds in humans estimated from an occupational cohort in Ger-
many. Organohalogen Compounds 42, 129-137.
[25] Steenland K., Deddens J., Salvan A., Stayner L. (1996) . Negative bias in
exposure-response trends in occupational studies: modeling the healthy
worker survivor effect. American Journal of Epidemiology 143, 202-
210.
[26] Thomaseth K., Salvan A. (1998). Estimation of occupational exposure
to 2,3, 'l,8-tetrachlorodibenzo-p-dioxin using a minimal physiologic toxi-
cokinetic model. Environ Health Perspect 106 (SuppI2), 743-753. Pub-
lished erratum in Environ Health Perspect (1998) 106 (Suppl 4), CP2.
[27] Van der Molen G.W. , Kooijman S.A.L .M., Slob W . (1996). A generic
toxicokinetic model for persistent lipophilic compounds in humans: an
application to TCDD. Fundamental and applied toxicology 31 , 83-94.
[28] Vose D. (2000). Risk analysis: a quantitative guide. 2nd ed., Wiley,
Chichester.
[29] WHO. (1995). Application of risk analysis to food standard issues. Re-
port of the Joint FAO/WHO Expert Consultation. World Health Orga-
nization, Geneva.
SIMULTANEOUS INFERENCE
IN RISK ASSESSMENT;
A BAYESIAN PERSPECTIVE
Leonhard Held
1 Introduction
St at isti cal risk assess ment deals wit h the probabilis t ic quantification of poten-
ti al dam aging effects of an environmental hazard. Of particular imp ortan ce is
the formul ation and est imat ion of dose-resp onse relationships based on data
from cont rolled toxicological studies. This pap er t akes a Bayesian view to the
stat ist ical pr oblem of est imating t he dose-response relationship and derived
qu antities. Such an approach has at least two useful features: First, t he pos-
te rior distribution of any function of the original par am et ers can be derived
exactly using Mont e Carl o simul ati on; secondly, pointwise and simultaneous
credible bands and bounds can be compute d exactly up to Mont e Carlo err or.
From a freqentist persp ective, the calcul ation of simultaneous confidence
bands has been developed in Pan , Pi egorsch and West [8], and has been
applied to risk assess ment est imation in Al-Saidy et al. [1] and Pi egorsch
et al. [9]. Al-Saidy et al. [1] consider quantal response dat a with a binomial
likelihood while P iegorsch et al. [9] apply t he methods to cont inuous measure-
ments based on a quadrati c regr ession mod el. In t his pap er we re-an aly ze th e
dat a from Pi egorsch et al. [9], but use a Bayesian approach based on Monte
Carlo sa mpling. In particular , we develop methods to calculate simultaneous
credible bounds for the benchmark dose at various ben chmark risks.
The pap er is organiz ed as follows. In Section 2 we review an algorit hm
to calculate (two-sided) simultaneous credible bands based on Monte Carl o
214 Leonh ard Held
sa mples from a post erior dist ribution and out line a straight forwa rd modifica-
ti on to obtain one-sided simultaneous credible boun ds. In Section 3 we apply
t hese methods t o a pr oblem from low-dose risk assess ment and compa re our
results wit h those obtain ed by Pi egors ch et al. [9] using frequentist methods.
We close with some discussion in Section 4.
[e Z[n+ 1- r ], e[r]]
Z ,
. = 1, .. . , p
~ (1)
contains at least k of the n valu es 0 (1) , ... , o'» . Besag et al. point out t hat
j* is equa l t o the kt h order stat istic of t he set
S -- { max { n+ 1 -mim
. r i(j) , mr x ri(j) } , J. -- 1 , . . . ,n } . (2)
By constru cti on, the credible region (1) will then contain (at least) 100k / n %
of the empirical distribution.
Fi gur e 1 illustrates the const ruc t ion of simultaneous credible bands for
simulated data with n = 25 an d p = 10. Each line corresponds to one sampl e
e
OW while each column represents a pa ra mete r i • The yellow ba nd is a
simultaneous credible ban d of empirical coverage 84 and 72%. The set (2) is
in t his example
s= {16, 17, 17, 18, 19,1 9, 20, 20,20,20, 22, 22, 22,22,23,23, 23, 23, 24, 24, 24, 25, 25, 25, 25}.
(3)
It is st ra ight forward bu t tedious to re-calculate (3) based on Figur e 1 and
formul a (2) .
Not e t hat the simultan eous credible band is a product of symmetric uni-
vari at e credible inte rvals of the sam e level (2j * [ti - 1) . 100%. Besag et
al. [2] also note t hat t he method is slight ly conservat ive in t he sense t hat ,
for n fixed , the credible region (1) will typ ically contain slight ly mor e that
100k/ n% of t he empirical dist ribution because of tie s in t he set (2) ; t his is
Simultaneous inference in risk assessment; a Bayesian perspective 215
84 % simultaneous credibleband
1"'"
(J)
0
2 4 6 8 10
Parameter
72 % simultaneous credibleband
2 4 6 8 10
Parameter
evident from out small example where the set (3) has many ties. This prob-
lem increases to an extent with p increasing, because the number of ties will
then typically increase. However, the method is still consistent as n -> 00.
Empirical evidence shows that these credible bands tend to get rather unsta-
ble for credibility levels close to unity. In other words, the Monte Carlo will
be quite large in these circumstances, but this problem can be easily attacked
by taking a larger sample. However, the method requires the storage of all
samples from all components of () which can be prohibitive is p and n is large.
216 Leonh ard Held
Furtherm ore, t he sorting and ranking of the sa mples from each component
can be computationally int ensive, if n is ext remely large. However , in our
experience, for n = 10,000 sa mples the method gives st abl e est imates at t he
usu al credibility levels (95 and 99%) in just a few seconds.
Also not e that ranking and sort ing has to be don e only once, even if
simult aneous credible bands are required on more t han one level. Only the
set (2) , the ord ered sa mples epJ and the ranks r~j) need to be available to cal-
culate simult an eous credible ban ds at additional levels. The computat ional
effort to calculate these addit ional simultaneous credible bands is negligible,
compa red to t he initial ranking and sort ing.
( - 00 ' n!J*J] ,
Ut z. = 1 , .. . , p (4)
contains at least k of the n values o'»,... , (J(n ). This pro cedure t hus defines
a one-sided upper credible bound of credibility level 100k/ n %. The only
questi on remaining is if t here is also an analogous formul a to (2). Indeed, j*
now simply equa ls t he kth ord er stat ist ic of t he set
. (j) . - 1, .. . , n } . (7)
{ milll r i , ) -
estimate of the variance (72 . Furthermore, p(,B I1\; , y) is normal with mean
S
equa l to the least squ ar es est imate = (X' X)-l X'y and covariance matrix
1\;- 1 (X ' X)-l . We can thus eas ily generate ind ependent samples from this
post erior distributi on by first sa mpling I\;(i ) from p(l\;ly) and then sa mpling
,B(i) from p(,BI I\;(i),y).
A Bayesian approach using Mont e Carl o sa mpling has t he advantage
t hat samples from any function of t he para meters can be obt ain ed with-
out any need for approximat ions, such as, for exa mple, the Delta method.
In t he cur ren t conte xt, RA(X) as defined in (8) is a simple function of t he
par am et ers 131, 132 and (72 . Hence we are able to comput e the post erior
distribu ti on of RA(X) for a ran ge of valu es of x, say Xl < X2 < .. . <
XM, and t hen compute simultaneous credible bounds for the par am et ers
RA(X1) , RA(X2), . .. , RA(XM) . For illustration, Fi gur e 2 displays the first
n = 100 samples from th e post erior distribution of RA(X) for 8 = 3.
C!
co
d
CD
d
-c
c:
".
d
C\l
d
0
d
0 50 100 150
Dose (mg/kg)
Fi gur e 3 now displays t he post erior median of RA(X) , as well as t he 95% si-
mult aneous upper credible bound for RA(X) , calculated usin g (4) and (5).
Those have bee n obtained using n = 10,000 sa mples and 181 equa lly spaced
valu es of X E {O, 1, .. . , 180}. For compa rison, we also display t he frequ en-
t ist esti mate of RA(X) as well as t he corresponding 95% simultaneous up per
confidence bound described in Pi egorsch et al. [9].
Note that t he Bayesian poin t est ima tes are slight ly above th e frequ entist
ones. A more pr onounced difference can be seen for the simult aneous upper
bound, which is agai n lar ger in t he Bayesian approac h.
Pi egorsch et al. [9] go on to construct lower simultaneous credible bounds
Simultaneous inference in risk assessment; a Bayesian perspective 219
,
,
0::>
ci
<0
ci
-e: Bayesian estimate(posteriormedian)
a: Frequentistestimate
<t Bayesian simultaneous crediblebound
ci Frequentist simultaneous confidence bound
C\I
ci
0
...',.
ci
0 50 100 150
Dose (mg/kg)
Figure 3: Est imat ed RA func t ion and simult aneous upper 95% credible
bound, 0 = 3.
0
0
'"
0
~
0 0
~ ~
CD
0
U"l
BMR
4 Discussion
The Bayesian approach t o simultaneous inference in risk assessment has much
t o offer. It does not rely on approximati ons, is complete ly genera l and easy to
impl ement. For example, it will be st raightforward to calculate a Bayesian
simult aneous credible bound in t he applicat ion considered in Al-Saidy et
al. [1], where the response variable is bin omial.
We now close with two final comments . In the cur rent applicat ion it
Simultan eous inference in risk assessment; a Bayesian persp ecti ve 221
0
;?
Bayesian estimate (posterior median)
Frequentist estimate
0 Complete case simultaneous crediblebound
ex>
Median imputed simultaneous crediblebound
Invertedfrequenti st simultaneous confidence bound
0
CD
Cl
::<
OJ
....
0
0
(\/
BMR
F igure 5: Estimat ed benchmark dose funct ion and simultaneous lower 95%
credible bound, 0 = 3.
to visualize these credible regions in higher dimensions. In the cur rent ap-
plication there does not seem to be an obviou s reference point for RA(X) ,
say, so t he method by Besag et al. [2] is the obvious choice for simultaneous
Bayesian inference in risk assessment .
References
[1] Al-Said y a .M., Pi egorsch W .W. , West RW. , Nitcheva D.K. (2004).
Confid ence bands for low-dose risk estim ati on with quantal response
data. Biometri cs, to appear. Available at
http ://dostat.stat.sc.edu/bands.
[2] Besag J.E., Green P.J. , Higdon D.M., Mengersen K.L. (1995). Bayesian
computati on and stochastic syste ms (with discussion) . St atist . Sci. 10,
3- 66.
[3] Box G.E.P. , Ti ao G.C. (1973). Ba yesian inference in statistica l analysis.
Reading, MA: Addison-Wiley. Reprinted by Wiley in 1992 in the Wiley
Classics Library Edition.
[4] Cha pman G.A., Denton D.L., Lazor chak J .M. (1995). Short-t erm m eth-
ods for estim ating the chronic toxicit y of effiuen ts and receiving wa-
ters to West coast marine and estuarine organisms . Technical Report
EPA/ 600/R-95-136. U.S. Environment al Protection Agency, Cincinnati,
Ohio.
[5] Gelfand A.E., Smith A.F .M., Lee T .M. (1992). Bayesian analysis of con-
strained param eter and truncated data problems using Gibbs sampling.
Journal of the American Statisti cal Associati on 87, 523- 532.
[6] Held L. (2004) . Simultaneous posterior probability statements from
Monte Carlo output. Journal of Computational and Graphical Statis-
tics 13 , 20- 35.
[7] Holmes C.C., Heard N.A. (2003). Generalized monotonic regression us-
ing random change points. St ati stics in Medicine 22 , 623 - 638.
[8] P an W ., P iegorsch W .W ., West RW. (2003). Exact one-si ded simulta-
n eous confidence bands via Uusipaikka's m ethod. Ann als of t he Institute
of Statisti cal Mathematics 55, 243 -250.
[9] Pi egorsch W .W ., West RW., P an W ., Kod ell RL. (2004). Low-dos e risk
estim ati on via simultaneous infere nces. Appli ed Statist ics, to appear.
Availabl e at http ://dostat . stat. sc . edu!bands .
Abstract : The link between stat ist ical mod els and visualisation t echniques
is not very well explored, even though strong connect ions do exist . This pap er
describes how biplots - interactiv e biplots in particular - can be used for visua l
mod elling. By slightly adjusting the way biplots are const ruc ted t hey provide
the means t o display linear models. The goodness of fit of a particular model
becomes inst an tly visible. This makes them a useful addit ion to the standard
set of visualizat ion tools for linear models.
Biplots show pr edict ed valu es and residuals. This helps, firstl y, to assess
a model far beyond the mere st atistics and t o det ect structural defects in
it. Secondly, bipl ots provide a link between th e mod elling stat ist ics and th e
origin al data. Addi tional int eractive methods such as hotselection also allow
the an alysis of outlier effects and behaviour.
1 Introduction
Biplots are a very promising t ool for visu alisin g high-dimensional data, which
include both continuous and categoric al variables. The strategy of biplots is
to choose a linear subspace (usually a 2-dimensional sp ace - in ord er to be
able to plot t he result usin g standard t echniques), which is in som e resp ect
optimal , and project t he high-d imensional dat a ont o this space . On e cri-
t erion for optimality is, for instan ce, to minimise t he discrepan cy between
the high- and th e two dimensional repr esent ations of the data . Biplots show
only one projection out of infinitely many. They t herefore cannot be exact
representations of the data but only approxima t ions.
What gave the Biplots their pr efix "Bi-" ((3t is th e greek syllable for
"two" ) is the simultaneous representation of both dat a points and original
axes within the projection space.
The bipl ot axis of a cont inuous vari abl e is represented by a st ra ight line
(in case of linear models, t o which we will restrict ourse lves) with unit points
marked by sma ll perpendicular lines. One uni t of a vari abl e X i corresponds to
one times t he standard deviation of X i . If t he data matrix X is centered and
st andardized , t hese units are therefore dir ectly comparable for all i, and the
length of a uni t vector gives a measure for how well a vari abl e is represented
in the chosen pr ojection plane.
Inst ead of cont inuous axes , so called category level points (CLPs) are used
t o display a catego rical variabl e X . Using a binar y dummy vari abl e for each
224 Heike Hofmann
eCls : Third
. .•
Sud :No
e
I
• Age :Child S;x : Female
. ... .: c
S MI " eCis . Second
ex[]: ae • C e.
c .:•
• Age : Adult"" Cis : First
'.§" e
Figure 1: Biplot and corresponding mosaicplot of the Titanic Data [3] . Each
dot on the left side corresponds to a cell on the right hand side. Highlighted
are survivors.
Biplot representation
The graphical representation of a biplot is dot based. This means for categor-
ical variables, that each combination is shown as one single dot. Of course,
this does not allow conclusions about this combination's size any more. One
solution to this problem is the use of density estimates. This also covers
the problem of over-plotting, which, especially in large data sets, is always
present in dot based representations.
The graphical representation of a biplot has two components:
• Data points are projected onto the plane spanned by the first two prin-
cipal components and visualised as dots. The center of the plot is given
by the projection of the p dimensional mean (~Xf][, ... , ~X;][).
Interactive biplots for visual modelling 225
• The uni t vect ors e~ corresponding t o th e (dummy) vari abl es are also
project ed onto this plan e.
The gra phical representation differs for continuous and categorical vari-
ables: For cont inuous vari ables, an arr ow is dr awn from plot cente r to
t he pr ojection of t he variable, which marks t he dir ection of t he original
vari abl es. These dir ect ions are called the biplot axes. The arr owheads
mark the uni t points on t he biplot axes.
For a catego rical vari abl e its projection on t he biplot is marked by
a square rect an gle, t he CLPs.
"Reading" a biplot
In a bipl ot the most importan t source of information is t he dist anc e between
objects. The dist an ce gives a measure of how similar or how closely related
ob jects are .
The dist ance of a CLP to the plot 's cente r (in the middle of t he plot) or
t he length of a un it on a biplot axis reflect how good the pr ojection of the
und erlyin g variable is, i.e, wit h incr easing dist an ce t he goodness of fit - and
with it t he "importance" - of t his vari abl e increases.
The meaning of objects lying close t o each other varies according t o t heir
typ e:
• point - point: close points reflect high dim ensional "neighbours".
• axis - axis: axes with a small angle between t hem indicate a high
positive correlation betwee n the variables, angles near 180 0 indicate
a high negative correlation.
• CLP - CLP : Neighbour ing CLPs are a hint that t he corresponding
variables are asso ciated, i.e, that t hese categories frequ ently occur to-
get her in t he dat a .
• points - axis/CLPs: t he data values for a point ar e found by or-
thogonal pr ojecti on onto an axis. The axes closest t o a point t herefore
represent t he strongest influence for a data poin t . Accordingly, points
are assigned t o those cat egories with t he closest lying CLPs. In doing
so, one has to rememb er , t hat a bipl ot of more than ju st two variables
cannot be anything but an approxima ti on.
2 Interactive methods
Based on th e construc t ion and int erpret ati on of a biplot , int eractive methods
have to be provided for in t he display t o facilitate interpret abili ty and ease
of use.
aCl s : Third
.. ..
Sud: No
x : Female
Itj~cls :crew a
Sud : Yes
.. ..
Sud : No
a
. - -....... "
~
8.. Sex : Male
e a •
:'"rL
8 A :
Figure 3 shows t he prediction regions corres ponding t o the variabl e' Class'.
All cate gories corre sponding to a singl e vari abl e divid e the biplot area in a set
of mu tually exclusive pr ediction region s. The prediction region of a CLP is
defined as t he space closest t o the CLP, i.e. no other CLP is closer . From the
pr ediction regions in figure3 it becomes obvious that t he representation from
the MCA do es not fit well: almost all dots are pr edict ed to be second class
passengers - t here are no combinations pr edicted as t hird class passengers .
large data sets gives a tool to drill down the data set into smaller parts,
which are - hopefully - more homogeneous and therefore easier to analyze.
Another advantage of logical zooming is its possibility of excluding out-
liers. By focussing on the "main" part, i.e, not regarding outliers, their
influence on the model becomes apparent. This is particularly useful for
models with a poor behaviour with respect to outliers. If in fact the effect
outliers have on a model is of fore-most interest, we will want to use hotse-
lection [8] instead of logical zooming. The boundary between these two tools
is fluent - but essentially, the concept of hotselection is less permanent than
logical zooming: changes are more readily made and taken back again. In
the setting of modelling, hots election is used to compute a new model based
on highlighted values only.
Figure 5 shows a biplot of a correspondence analysis taking all of the de-
scriptive variables into account. Several clearly distinguished groups appear
in the plane spanned by the first and second principal component axis . High-
lighting shows poisonous mushrooms. These clusters are marked by numbers
in the graphic. Using a Mosaicplot of all the descriptive variables, we want to
find descriptions (as short as possible) for these groups. The following table
gives a short summary of our results:
•
Class
12.
.
e p
3#
4#.
2nd prine .
rs:
eompon"ent 6.
2.
1st prine.
component
9.
explanat ory vari abl es exist . Clust er 10 e.g. consists of mushrooms with stalk
color o. All of t he descriptions are only valid for the zoom ed data (Le. only in
combination wit h all of t he description for cluste r 8 above). Cluster 13, con-
sisting of 2512 mushrooms, is t he only one which needs fur ther insp ection -
using further logical zooming. Aft er two mor e ste ps all poisonous mushrooms
can be separ at ed from the edible ones.
monly used constraint (null-sum -coding ) on the esti mates for t hese param e-
t ers is t hat t hey sum to zero, i.e.
or one of t he categories is used as basis and the par am eters of the resulting
model show the influence a category has with respect to th e basis. The
const ra int effect-coding on the par am et ers t hen is
A
( .
I' B
)
C
) )
D
~A ~o ~B ~C ~D
Fi gur e 7: Axis of pr edict ed values to gether with t he five bipl ot axes for
vari abl es A ,B,O,D an d E .
residuals
• i
• i
.1
•
.
.< ·•
)1 )1
I
~~
•I
)I~
• I!
I • i predicted
I i I !
!I
• 1 ,
..--..-- .--.-...- - -.... ................................. ............ ....•
Duluth Uni-Farrn Crooks ton Waseca
Grand-Rapids Morris
Fi gur e 7 shows the vector of t he pr edict ed values Y t oget her wit h biplot
axes for five variables A , B , 0, D and E. We can re-est ablish t he relati on
of pro ject ed dat a poin ts and t heir original valu es by ort hogonal projections
of the poin ts ont o the biplot axes . In the case of an an alysis of vari ance
this means, that we get very informative "labels" for t he pr edict ed values.
Fi gur e 8 shows an ana lysis of variance of t he B arley D ata [4].
We see not only par allel dot plots of the barl ey yields, but also a natural
ordering of the six categories , even (roughly) t heir distance or closeness. The
last point has a caveat: the lengths of t heir units are not directly comparable,
i.e. an axis with lar ge units is not by default a more import an t factor, since the
"importance" of an axis also depends on t he var iability of fJi. The st andard
tes t of judging, whet her the ith par am et er is significantly different from 0, i.e.
f3i = 0 vs f3i :j:. 0, uses the est imate's variability. T he test stat istic fJd SE rJi ,
where SE&i = o-2 e~(X' X)-l ei , is appr oximately t dist ribute d wit h n - p - 1
degrees of freedom.
A second choice of uni ts on the biplot axes t herefore is t he te rm fJi / SErJi '
This re-scales the biplo t axes in a way t hat t heir lengths are proportional
to t he values of the t-st at ist ic. More important vari ables in t he regression
model now have lar ger par am et ers, whereas biplot axes with insignificant
par am et ers remain short . Gr aphically we can support this by highlighting
an int erval on t he ax is of pr edict ed valu es, which corresponds to t he 5% level
Interactive biplots for visual modelling 231
of a t-test. See figure 9: in this example the SE{3i are of the same order of
magnitude, and the distances do not change compared to figure 8.
>0
Duluth Uni-Farm Crookston Waseca
Grand-Rapids Morris
Figure 9: Comparison of effects: on the top the graphical test via the in-
terval of non-significant values is shown, on the bottom is a table of the
corresponding pairwise tests.
When setting the origin of this interval the exact coding, which we used
for a categorical variable is important: if we use effect-coding, the origin of
the 5% interval will be placed on the predicted value of the basis. When
using a null-sum-coding the origin of the interval is set to the expected value
ofY.
Figure 9 shows the (re-scaled) biplot axes of the example above. The
category Morris is set as basis value. Around this value the interval of non-
significant values is shown as a gray-shaded rectangle. The categories Uni-
Farm and Crookston fall into this rectangle, indicating that these categories
have parameters, which are not significantly different from the parameter for
Morris.
Since the differences between the parameters are not affected by the choice
of the coding, we may use these differences for more than one comparison
(and with that , multiple t ests) in each plot . From a statistical point of view
this multiple test situation suggests the use of Bonferroni-confidence inter-
vals for each parameter rather than the use of the above significance intervals.
The difference between the above intervals and Bonferroni's intervals is es-
sentially a factor, calculated from the level of significance and the number of
comparisons made.
The price we have to pay for the re-scaling of the biplot axes with the
parameter's variability is that we lose the quantitative connection between
data points and biplot axes.
In order to avoid re-scaling we may try another approach to visualise the
tests between the effects: the software JMP suggests the use of circles of dif-
ferent size around the parameter values. The size of each circle is given by the
232 H eike Hofmann
st andard deviation of the par ameter t imes to. / 2 ' Wh ether two parameters are
significantly different is decided by t he angle: if t he angle at the intersection
of their circles is less than 90° the two values are not significantly different ,
otherwise they are (see figur e 10). For a more detailed explana t ion of the
underlying stat istics see JMP 's "Statistics and Graphics Guide " , p.94-95.
The disadvantage of this approach is t ha t angles have to be compa red.
This makes the decision between significant and not- significant differences
between the par amet ers rather difficult visually.
~ 1 ' ~2 significantly d ifferent ~1' ~2 borderline significantly different ~1 ' ~ not significantly different
!
: A
IY -Y I 2 = RSS
pr ed icted
The uni ts on the projection axes are given as IY - YI and where WI,
!Y - YI 2 = L i(Yi - "fi )'(Yi - "fi) = RSS and 2
= TSS - RSS. RSS is WI
the residual sum of squares and TSS is t he total sum of squares.
The coordina te of Y in direction of Y - Y shows th e squ ar e root of the
residu al sum of squ ar es, V R S S; the coordina te in dir ection of Y gives the
squ ar e root of the difference between the tot al sum of squa res , TSS , and the
Interactive biplots for visual modelling 233
i.e. the smaller a is, the better is the fit of the regression model. Of course,
the angle depends on the aspect ratio of t he display. By fixing the aspect
ratio to 1, different plots (and thereby different models) can be compared:
a plot with large width and little height indicates a good fit (the residuals are
small with respect to the predicted values), while a quadratic plot or, even
worse , a tall and thin plot indicates a very bad fit, see figure 12.
Variable
Constant
Coefficient
0.011854
s.e.c t Coeff
0.0106
t-ratlo
1.12
prob
0.2660
=y
XI 1.50080 0.0149 101 E0.0001
X2 -0.496514 0.0129 -38.6 E0.0001
Figure 12: Example of regressions with good fit (above) and bad fit (below).
The goodness of fit is emphasized by t he shape of the display. The ang le
between Y and Y also corresponds to R 2 .
4 Conclusions
Biplots can be used to visualize univariate linear models. They allow, at the
same time, an assessment of the model's goodness of fit . Add itional interac-
tive methods such as interactive querying provide the ana lytic goodness of fit
234 Heike Hofmann
statistics, too. This allows a tight link of visual display and the corresponding
model. Another interactive method, hotselection, gives a way of examining
the influence of single points or group of points on the model , which can be
used as a very efficient way of outlier spotting.
In the paper only one-dimensional models are shown - this is just for
illustration purposes. The approach itself is, of course, not limited to one
dimension.
If using scatterplots for a biplot representation, biplots are restricted to
a 2d display - with graphics that allow display of higher dimensionality such
as a tour ([1]' [2]) for example, more precise displays are possible. In a tour
the described approach would mean to fix the z-axis artificially to Y - Y
(equivalent to fixing Y to be fully included while touring the data) and to
tour through the X space. This also allows to deal with higher-dimensional Y.
References
[1] Asimov D. (1985). The grand tour: a tool for viewing multidimensional
data. SIAM J . Sci. Stat. Comput. 6, 128 -143.
[2] Buja A., Swayne D., Cook D. (1996). Interactive high-dimensional data
visualization. Journal of Computational and Graphical Statistics 5, 78 -
99.
[3] Dawson R.J.M. (1995). The "unusual episode" data revisited. Journal of
Statistics Education 3.
[4] Fisher R. (1935) . The design of experiments. Edinburgh UK: Oliver and
Boyd .
[5] Gabriel K. (1971). The biplot graphic display of matrices with application
to principal component analysis. Biometrika 58, 453-467.
[6] Gower J .C., Hand D.J. (1996) . Biplots. London: Chapman and Hall Ltd.
[7] Hofmann H. (1998). Interactive biplots. In New Techniques & Technolo-
gies for Statistics (NTTS) 98, Sorrento, Italy: Eurostat, 127 -136.
[8] Velleman P. (1995) . Data Desk 5.0, Data Description. Ithaka, New York.
1 Introduction
8 is a very high level lan guage and an environment for data ana lysis and
graphics which has been developed at Bell Lab oratories for about 30 years.
In 1998, t he Association for Computing Machinery (ACM) presented its Soft-
war e 8ystem Award to John M. Chamb ers , th e principal designer of 8, for
"the S system, which has forever altered the way people analyze, visualize, and
manipulate data . . . ". The evolut ion of t he 8 language is cha racte rized by
four books by John Ch amb ers and coauthors, which are also the primary ref-
erences for 8. The "Brown Book" [1] is of hist orical interest only. The "Blue
Book" [2] describ es the "New 8" language. The "White Book" [5] docum ent s
a concerted effort to add functionality to facilitate st atistical modeling in 8 ,
introducing data structures such as factors, time series , and dat a frames,
a formula not ation for compactly expressing linear and generali zed linear
mod els, and a simpl e syste m for obj ect-orient ed programming in 8 allowing
users to define their own classes and methods. Together with the Blue Book ,
it describ es 8 version 3 ( "83") . [4], t he "Green Book" , int rodu ces version 4
of 8 ( "84"), a major revision of 8 designed by John Chamb ers to improve its
usefulness at every stage of the programming pro cess, introducing in partic-
ular a new "formal" OOP syst em supporting multiple dispatch and multiple
inh eritanc e, and a unified input/output model via "connections". Tod ay,
a comm ercial implement ation of the 8 lan guage called "8-P LU8" is available
from Insightful Corporati on (http ://www .insightful .com).
What is now t he R project starte d in 1992 in in Auckland, New Zealand,
as an experiment by Ross Ihaka and Rob ert Gentleman "in try ing to use the
m ethods of LISP implementors to build a sm all testbed which could be used to
trial some ideas on how a statis tical environment might be built " [8] . The de-
cision to use an 8-like syntax for this st atis tic al environment , being motivated
by both famili arity with 8 and t he observation that the pars e t rees genera ted
236 Kurt Hornik
by S and LISP are essent ially identi cal, resulted in a syst em "not unlike S" .
In fact , basing t he R evaluat ion mod el on Scheme (a memb er of t he LISP
famil y) has given R lexical scoping as the most prominent difference between
R and ot her impl ement at ion of the S language [7] . Since mid-1997 there has
been a core group (the "R Cor e Team") who can modify the R source code
CVS archive. The group cur rently consists of Doug Bates, John Chamb ers,
Peter Dalgaard, Robert Gentl eman , Kurt Hornik, Stefano Iacus, Ross Ihaka,
Friedri ch Leisch, Thomas Luml ey, Martin Maechler , Dun can Murdoch, Paul
Murrell, Martyn Plummer , Bri an Ripley, Dun can Temple Lan g, and Luke
Ti ern ey. R version 1.0, released on 2000-02-29, provi ded an implementation
of S version 3. The key innovations in S4 were introduced in Lx series re-
leases (connections in 1.3, a first implementation of the S4 OOP syst em in
version 1.4) .
An R distribution pro vides a run-time environment wit h gra phics, a de-
bugger , access to to certain syst em functions, and the ability to run pro-
gra ms stored in script files, and contains functi onality for a lar ge number
of statistical pro cedures. This "base system" is highly exte nsible through
so-called packages (see Section 4) which can contain R code and correspond-
ing docum entation , data sets, code to be compiled and dyn amically loaded ,
and so on. In fact , t he R distribution itself provides its functionality via
"base" packages such as base, stats, grid, and methods. The data ana lytic
techniques described in such popular books as [23], [16], or [21] have corre-
sponding R packages (MASS , nlme, and survival) . In addit ion, there are
packages for bootstrapping, various state-of-t he-art machine learning tech-
niqu es, and spatial statist ics including int eracti ons with GIS. Other pack-
ages facilit ate interaction with most commonly used relational datab ases,
importi ng data from other statistical software, and dealing with XML . Cur-
rently, more t ha n 300 packages are available via t he Comprehens ive R Archive
Network (CRAN, http://CRAN .R-project .org) , a collect ion of sites which
carry ident ical material , consist ing of the R distribution( s) , cont ributed ex-
tensions, docum ent ati on for R, and binaries.
It is important to realize that t he "R Project" is really a multi-tiered
lar ge scale softwar e developm ent effort , with the R Core Team deliverin g the
basic distribution which mostly provides t he computational infrastructure
on which others can build special-pur pose data analysis solut ions. In this
pap er , we discuss four of the key additions to this infrastructure relative to
t he S reference st andard.
2 N arne spaces
Name spaces allow package aut hors to cont rol how global vari ables in their
code are resolved . To see why this is important , suppose that package foo
defines the function
and has been attached t o the sear ch path so that evalua t ing t he expression
mydnormCo) uses t he above fun cti on when looking up a valu e for the symbol
C mydnorm' . Now suppose t hat t he user ente rs
pi <- 1
at t he prompt, so t hat t he symbol C pi' is bound t o the valu e 1 in the R work-
space ( "global environme nt" , . GlobalEnv). With the "usual" dynamic look-
up mechani sm for bindings (of sy mbols to valu es) in place, going t hrough the
collect ions of bindings represented by the search path and start ing with the
glob al environment, evaluat ing mydnorm(O) would not give the result that the
faa package author had intended-nam ely, using t he valu e bound to pi in
t he base package. Mor e generally, t op level assignment s as well as attaching
packages to the sea rch path can insert shadowing definitions ah ead of t he
ones intended . Na me spaces ensure that this do es not happen .
In the above example, all global vari abl es were int ended to refer to the
definitions pr ovid ed by t he base package, which is always attached (at the
end of t he sea rch path). Suppose t hat faa wanted to make use offunctionality
provided by anot her package bar which is not necessaril y always attached .
Tr aditionally, the package au thor would then arrange bar t o be attached at
some point. This is not only subject to sh adowing as described above, bu t
also has the effect of forcing a possibly undesired change to t he sear ch path
onto the user . Usin g name spaces, one can import the requ ired functionality
(more pr ecisely, exporte d var iabl es) from other packages. Such imports t hen
ca use the ot her packages to be loaded if necessar y, wit hout attaching them.
Finally, nam e spaces also allow the package aut hor to control which defi-
nitions pr ovid ed by a package are visible t o a package user and which ones are
private and only available for int ernal use. By default , a definition is private;
it is mad e public by an explicit export of the nam e of the defined variable.
Simil ar t o pro ving mathem atical t heorems, good pr ogr amming pr acti ce for
a high-l evel lan guage such as R typic ally suggests providing functionality
based on small building blo cks which perform simple tasks and are readil y
comprehended . If all these blo cks correspond to functions with a few lines
of code , and all t hese fun ctions are visibl e to users, these will find det ermin-
ing t he key functionality provided by a package rather challenging . (Thus
far, coding pr acti ces suggest ed using names sta rt ing with a '.' for "inte r-
nal" variables, based on the fact t hat listing varia ble nam es in eleme nts of
t he search path by defaul t excludes nam es with a leading dot. This redu ces
clu t ter , bu t do es no t pr event shadowing. )
A package is given a name space by placing a NAMESPACE file containing
nam e space dir ecti ves into t he t op level source dir ect ory of t he package. This
mechani sm makes it possibl e to obt ain t he information on the package code
int erface as part of t he package meta-d at a, without the need of pro cessin g t he
package code. The main dir ectives control export and import of vari abl es, and
supe rficia lly resemb le R function calls, with the argument s being synt actic
names or st ring constant s (i.e., quoting is only necessar y for non-standard
nam es). For example, t he dir ective
238 Kurt Hornik
would only import its SurvO function. There is also a useDynLib directive
for specifying that external code compiled into a DLL is to be loaded when
the package is loaded.
As syntactic sugar, variables exported by a package with a name space
can also be referenced using fully qualified references which are obtained by
concatenating the package and variable name, separated by a double colon
(e.g., f 00: : mydnormin the above example) . This is less efficient than a formal
import and also loses the advantage of separating the dependency meta-data
from the package code, so this approach is usually not recommended.
Name spaces are sealed. This means that once a package with a name
space is loaded, one can no longer change the bindings (add or remove vari-
ables, or change the values) . If it is necessary to record state information on
the package level, one can use dynamic variables (functions allowing to get
and set state information maintained in their environment). 8ealing ensures
that the bindings cannot be changed at run time, which has been instrumen-
tal to the development of a byte code compiler for R.
R supports both the 83 and 84 paradigms for object oriented program-
ming . In the former, there are no "formal" data structures representing the
class information, and method dispatch is based on a naming convention
(methods are functions the name of which is obtained by concatenating the
names of the generic and the class of the argument on which dispatch is
based, separated by a period) . With the advent of name spaces, this cre-
ates a problem: if a package is imported (hence loaded) but not attached to
the search path, the 83 method it provides may not be found for dispatch.
The name space mechanism therefore also provides facilities for registering
83 methods for dispatch. The directive
S3method("print", "f oo")
registers the function print. foo defined in the package as the 83 method for
generic print and class "foo". (This mechanism in fact pertains to the cases
where the generic is defined in a package with a name space. In this case ,
83 methods only need to be registered, but not exported.) The "formal"
84 OOP paradigm provides classes and generics with more structure than
their 83 counterparts, and hence conceptually allows better integration with
name spaces. 84 classes are private by default; they can be made public
using the exportClasses directive. As of writing this article, all generics for
which formal methods are defined need to be declared in an exportMethods
R: the next generation 239
directive, and where the generics are form ed by taking over existing functions,
those functions need to be imported (explicitly unless they are defined in the
base name space) . These mechanisms may be different in R 2.0; the current
development efforts will most likely bring the mechanisms in R more in line
with those in "related" functional languages in the LISP family which provide
both name spaces and a "formal" OOP system (such as Common Lisp or
Dylan).
By giving package developers the tools to control the package code inter-
face and the resolution of global variables in their code, name spaces substan-
tially enhance the potential of R for dealing with complex data analysis tasks
based on combinations of "many" extension packages, in particular providing
a way of resolving conflicts among definitions in these.
3 Grid graphics
Traditional S graphics ("base graphics" in R, although now provided by pack-
age graphics) divides pages of graphics output into outer margins and pos-
sibly several figure regions which in turn each consist of figure margins and
plot regions. This places severe limitations on the possibilities for access-
ing the whole graphics page, e.g. when annotating a high-level plot. (The
standard example is that one cannot have arbitrarily rotated text in axis
labels, as text () supports arbitrary rotation but can only draw inside the
plot region, whereas mtext 0 can only write horizontally or vertically.) Each
region has one or more coordinate systems associated with it, as controlled
via "graphical parameters" (par () ).
Grid graphics is an alternative graphics engine provided by package grid
in the R distribution. One of its goals is to remove some of the inconvenient
constraints imposed by the base graphics system. In addition, it aims at the
development of functions to produce high-level graphical components which
would not be very easy to produce using traditional S graphics (such as Trel-
lis graphics [3], [6], where the more natural building block is a "panel" which
consists of a plot plus one or more "st rips" around it), and the rapid devel-
opment of new graphics ideas . It serves these aims by providing functionality
for the production of low-level to medium-level graphical components, such
as lines, rectangles, data symbols, and axes, and sophisticated support for
arranging graphical components. Grid does not provide high-level graphical
components such as scatterplots of barplots, and hence is primarily targeted
at graphics developers rather than "users", with the usual remark that in
S, there is at most a gradual transition between these groups, if no such
distinction at all.
In grid, there can be any number of graphics regions . A graphics region
is referred to as a viewport and is created using the viewport 0 function.
A viewport can be positioned anywhere on a graphics device (page, window,
. . . ), it can be rotated, and it can be clipped to . For example,
viewport(x = 0.5, Y = 0.5, width = 0.5, height = 0 .25, angle = 45)
240 Kurt Hornik
are obtain ed by pr efixing the nam es of the corresponding base gra phics func-
tions with ' gr i d . ' ). There are also two high er-level components: x- and
y-axes. These functions are mostly similar to their base count erpar ts, but
differ in the way graphical par am et ers , such as line cont our and thickness,
are specified.
In grid, there is a much smaller set of gra phical par am et ers, consist ing
of col (the "foreground" color for dr awing lines and borders) , fill (the
"background" color for filling shap es) , lty and lwd (line ty pe and width) ,
fontfamily , fontface (such as bold or italic), fontsize (the size of t ext in
points) , lineheight (the height of a line as a multiple of the size of t ext) ,
and cex (mul tiplier applied to fontsize: th e size of t ext is fontsize * cex
and hence the size of a line is fontsize * cex * lineheight). Settings of
gra phical par am et ers are represente d by "gpar" obj ects, and may be spec-
ified for both viewports and graphical obj ects. A setting for a viewport
will apply to all graphical out put within that viewport and all viewports
subsequently pushed onto the viewport st ack, unless the graphical obj ect
or viewport specifies a different sett ing. A description of gra phical param-
ete r set ti ngs is create d using t he gpar 0 function, which can be associated
with a viewport or gra phical object via their gp slots (as accessed by t he gp
argument t o the functions creating viewports and gra phical obj ects) . The
following piece of code illust rates t hese mechanisms.
4 Packages
The R package syste m provides a st andar dized int erface to exte nding R 's
functionality. In source form, packages can contain
• "core" met a-information, cur rent ly serialized as a DESCRIPTION file in
Debian Control File form at (t ag-valu e pair s)
• addit ional met a-d ata , such as a NAMESPACE file defining the package
code interface
• code and document ation for R
• foreign code to be compiled/dynloaded (C, C++, Fortran , ... ) or
interpret ed (Shell, Perl , Tcl, ... )
• addit ional material such as dat a sets , demos, vignettes, package-sp ecific
tests, . . .
Only the core met a-information must be present . Mandatory met a-d ata in-
clude nam e and version of the package, and information on the license and
the package maint ainer. In a file syste m "representation", a source package
consists of a subdirectory containing the DESCRIPTION and possibly ot her
"to p-level" files, and several pr e-defined subdirectories, some of which may be
missing, such as R for R code and src for foreign source code to the compiled
and dynlo aded.
To be available for extending R, packages must be ins talled to librari es,
which ar e simply locations where R knows to find (inst alled) packages. In-
stalling from source performs a vari ety of tasks as needed or desired , such
as preformatting R docum ent ation in pla in t ext and HT ML form ats, creat ing
DLLs from foreign code, genera t ing a bin ar y image of t he R code, and set-
t ing up severa l dat a st ructure s with package ind ex information . This pro cess
is plu g'n 'pl ay if t he packages are "self-contained" (so that only t he standard
tools for pro cessing them are required). Developers can provide configuration
scripts for aut omat ically dealin g with sit ua t ions where packages depend on
the availab ility of functionality "outside of R" , such as libr ari es for dealin g
with X ML or access ing a datab ase ma nage ment syste m.
Creating packages is st raig ht forwar d: developers simply need t o gat her
the mat erial to be packaged int o the appropriate locati ons relative to the
package source direct ory. If R code is t he st arting poin t , R provides a con-
venience function packageSkeletonO which creates t he basic file st ructures
as well as docum entation skeletons for the R objects.
P ackages ar e distribut ed as single files archiving their conte nts . For source
packages, gzipped tar files are used . These are created via t he build utility
(currently, a P erl script ) which essent ially performs necessary cleanups, adds
front-matter information, and creates the archive wit h a canonical file nam e
obt ain ed from t he package nam e and version (as recorded in the DESCRIP-
TION file) . One can also build and inst all bina ry packages, which are alrea dy
R: the next generation 243
set up for use on a particular platform (so that only mimimal processing is
needed when installing). E.g., CRAN provides binary packages for the 32-
bit Windows platforms, because the tools needed for processing the source
packages (Make, Perl, compilers, ... ) might not be available to all users on
such systems.
Packages can be distributed over the web through repositories, which are
suitably indexed collections of packages. The package management tools
provided by R allow for directly installing packages from repositories and au-
tomatically updating installed packages when newer versions are made avail-
able in the repositories. This versioning facility, together with the generality
of the package mechanism, makes packages an ideal vehicle for distributing
many kinds of R-related material which needs to be kept up-to-date, such
as e.g. data sets or manuals (preferably implemented as package vignettes,
see Section 5). The Bioconductor project (http://www . bioconductor. org),
an open source and open development software initiative for the collective
creation of extensible software infrastructure for computational biology and
bioinformatics which uses R as its primary implementation language, is work-
ing on providing the next generation of client and server side tools for repos-
itory management, featuring in particular a multi-level package dependency
mechanism similar to the ones found in popular GNU /Linux distributions
such as Debian (http ://www . debian. org). These tools are already avail-
able via the R extension package reposTools from Bioconductor, and will
eventually be integrated into the R distribution.
Packages can be submitted to unit testing using the check utility (cur-
rently, a Perl script) . When run on a package source directory, this first veri-
fies that the package can be installed as the basic test of whether it "works" ,
and then goes on to perform a variety of other tests, such as checking
• availability and correctness of meta-information (as recorded in the
DESCRIPTION file mentioned above);
• R code, including syntactic correctness, common coding problems (e.g.,
when loading DLLs or defining replacement functions), consistency of
S3 generics and methods, etc.;
• R documentation, including correctness (syntax, presence of all required
documentation slots), consistency (of code and documentation), and
completeness (all user level objects must be documented) ;
• whether the package is able to run the code in in the examples of its
documentation (which is required). In addition, there are mechanisms
for regression and certification testing of code: package maintainers can
provide files with R code that will be run and if necessary compared to
already certified output.
Repository maintainers can use the package testing facilities for control-
ling the quality of the packages in the repository, and hence the repository
itself. For example, the CRAN repository tracks the R release process by
244 Kurt Hornik
only providing packages which pass the tests against the version of R be-
ing released. In addition, the effects of changes in (the development and
patched version of) R and updates to contributed packages are monitored
on a daily basis . It is this continuous improvement process which markedly
distinguishes the R project from most other software initiatives which use
repositories for distributing extensions.
Most of the testing tools used by the check utility are in fact implemented
in R (in particular to ensure portability and availability to all users of R) and
distributed in the tools package contained in the R distribution. It is very
important to realize that whereas check is a rather inflexible utility for cre-
ating standardized reports on the package "quality" status, the underlying
functions from package tools provide a flexible and extensible toolbox for
computing on packages. For example, codoc 0 is a function for checking
code/documentation consistency. More precisely, it analyzes the (usually,
function synopsis) information of the \usage sections of R documentation
files, and compares the documented synopses to what the code actually con-
tains. (Currently, code and documentation for functions in a package are
not generated from common sources, and hence may be inconsistent.) What
codoc 0 returns is an object containing a variety of information, including
the information on mismatches found. Printing this object gives a status
report on mismatches intended for human readers; if no mismatches were
found, nothing is printed. This mechanism is used by check to assess and
report the basic codoc status. But the object returned contains additional
data as well, such as information on \usage entries not corresponding to valid
R syntax after eliminating special markup for indicating synopses for 53 or
54 methods, or on functions for which documentation was registered (via the
\alias meta-data markup) without providing a synopsis (which "might" be
a problem, and in the case of non-method functions in packages with a name
space typically is one) . Even though this information is not printed, it is
available in the result of the codoc computations, and hence can be used for
further processing.
5 Sweave
Sweave [9] is a tool that allows to embed the R code for complete data analyses
in l5\TEX documents. (In fact , we shall see that the underlying principles are
much more general.) In the process of generating the displayed version of the
document, first the code in the Sweave source file is processed (by R) and its
textual or graphical output inserted as appropriate to create a l5\TEX source
file. Then, a DVI or PDF file is created (by latex or pdflatex) .
A small Sweave source file is shown in Figure 1. The file contains two R
code chunks embedded in a simple l5\TEX document. At the beginning of a
line, '« ... » ' and t e ' mark the start of a code and documentation chunk,
respectively. Sweave translates this to a regular l5\TEX document, which is
then compiled to give Figure 2. The results of the Kruskal-Wallis test as well
R: the next generation 245
as the box plot have nicely been int egrat ed into the final version.
\documentclass[a4paper]{article}
\title{Sweave Example}
\author{Friedrich Leisch}
\begin{document}
\maketitle
\begin{center}
«fig=TRUE,echo=FALSE»= boxplot(Ozone - Month, data = airquality)
~
\ end{center}
\end{document}
Figure 1: A sma ll Sweave source file: example.Snw.
The Sweave source file shown in Figure 1 uses t he syntax of noweb [18],
a simple lit erat e pr ogramming tool which allows to combine program source
code and t he corresponding documentation int o a single file. This syntax is
par ti cularly useful if Emacs is used for aut horing Sweave documents: then,
using EBB [19, Emacs Speaks St atistic s;], an Em acs ext ension package, one
can connect t he docum ent t o a running R pr ocess while writing it . Code
chunks can be sent to R and evaluated using simple keyboar d shortc uts or
popup menus. Syntax highlighting, automatic indentation and keyb oard
shortcuts depend on t he location of t he point er: in code and doc umenta-
t ion chunks one gets t he sa me behav ior as when editing "simple" R code or
]}.1EX files, respectively. Using Emacs or t he noweb syntax is not necessar y
t o Sweave. There is also a ]}.1EX-based synt ax, where C Scode' environments
are used for marking code chunks. Using thi s syntax, t he box plot code chunk
in our example file would be ty peset as
\begin{Scode}{fig=TRUE,echo=FALSE}
boxplot(Ozone - Month, data = airquality)
\end{Scode}
Sweave offers fine cont rol on how t he code chun ks are pr ocessed . By de-
fault , both t he S code itself and its console output are insert ed , inside suitable
246 Kur t Hornik
In th is example we embed parts o f the examples from the kru skll.l. tellt
help page into a JnEX doc ument:
R> deta(ajrquality)
R> kruska l.test (Oz one - Mon t b, da ta '" ai rq l.la l ity)
which shows that the loca tion parameter of the Ozone distribution varies sig-
nificantly from month to month. Finally we include a boxplot of the data
As apparent from the above description, what Sweave really does is per-
form cert ain computations on int egr at ed text documents which contain both
code and docum entation chunks. S4weave, a re-implement ation of Sweave
using S4 classes and methods currently under way, enforces this view [11] .
Providing more structure also makes it possibl e to comput e a directed gra ph
of chunk dependencies, and hence pro cess chunks conditiona lly. There is
also an XML DTD for Sweave source files for document excha nge with ot her
dynamic document systems.
To assess the import ance of facilities such as Sweave, one should keep in
mind how reports as part of a st atistical data ana lysis project are tradi tion-
ally written . F irst, the data are ana lyzed, and afterwards the results of t he
ana lysis (numbers , graphs, . .. ) are used as th e basis for a writ ten report . In
lar ger projects the two steps may be repeated alte rn ately, but the basic pro-
cedure remains the same. The basic par adigm is to writ e t he report around
the results of t he ana lysis. Using Sweave, one can create dyn amic reports,
which can be updat ed automatically if data or analysis cha nge. In particular ,
t he code is always available for reproduce t he displayed results, which makes
Sweave an ideal vehicle for dissemin ating reproducible resear ch, see e.g. [1 3].
Sweave also grea tly aids in t he creation and deploym ent of docum ent ation
for "aggregated" functionality of S code , such as manuals for packages (where
the traditional function-based S documentation methods cannot easily deliver
a comprehensive view) , or books on st atistical analysis using S. Using Sweave,
there is t he additiona l benefit t ha t one can always ext ra ct t he code from the
document (the term vignettes has been introduced for docum ents with this
property) and use it for subsequent manipulating and pro cessing. Vignettes
have enough st ructure to allow for an int egrated and int eractiv e presentation
of the code t hey contain. For example, vExplorer 0 from the Bioconductor
tkWidgets package allows to view vignet tes and int eract with their code
chunks, see e.g. [12] for more details.
6 Summary
In t his pap er , we have discussed four of t he key innovations in the "next
genera t ion" of R. There are of course many mor e, including a new system
for except ion handling, a byt e code compiler, external pointer objects, a
mechanism for serialization and unserialization of R obj ect s to and from con-
nections, mathemati cal annot ation of plots [15] , as well as many refinements
to the S lan guage (such as a thorough distinction of t he cha racter st ring "NA"
from a missing value for a cha rac te r string). The NEWS file in the top-level
dir ectory of the R distribution has more information.
References
[1] Becker R.A ., Ch amb ers J .M. (1984) . S. An interactive environme nt for
data analysis and graphics. Monterey: Wadsworth and Brooks/Cole.
248 Kurt Hornik
[2] Becker RA., Ch ambers J .M. , Wilks A.R (1988) . The new S language.
Ch apman & Hall , London.
[3] Becker RA ., Cleveland W.S. , Shyu M.-J . (1996). The visual design
and control of trellis displays. Journal of Computat ional and Gr aphical
St atistics 5 123 -155.
[4] Chambers J.M. (1998). Programm ing with data. Springer , New York.
http ://em .bell-labs .eom/em/ms/departments/sia/Sbook/.
[5] Chambers J .M., Hasti e T .J . (1992) . Statistical models in S. Ch apman
& Hall , London.
[6] Cleveland W .S. (1993). Visualizing data. Hob art Press, 1993.
[7] Gentleman R , Ihaka R (2000). Lexical scope and statistical comp ut-
ing. Journal of Computational and Gr aphical St atistics, 9 491 - 508.
http://www.amstat.org/publieations/jegs/.
[8] Ihaka R (1998) . R : Pas t and future histo ry. In S. Weisb erg, (ed.) , Pro-
ceedings of the 30t h Symposium on the Int erface, the Interface Founda-
t ion of North Am erica, 392-396.
[9] Leisch F . (2002) Sweave: Dyn amic generation of st atistical re-
ports using lit erate data analysis . In Wolfgan g Hardl e and
Bernd Ronz (eds) , Compst at 2002 - Proceedings in Computa-
t ional St atistics, Physika Verlag , Heidelb erg, Germany, 575 - 580.
http ://www.ei .tuwien.ae .at/leiseh/Sweave.
[10] Leisch F . (2002). Sweave, part I: Mixing R and E'J'EX. R News 2 (3)
28 -31 . http://CRAN.R-projeet.org/doe/Rnews/.
[11] Leisch F. (2003). Sweave and beyon d: Computations on text
docum ents. In Kurt Hornik, Friedrich Leisch , and Achim
Zeileis (eds) , Proceedings of the 3rd International Work-
shop on Distributed St atist ical Computing, Vienn a, Austria.
http://www.ei .tuwien .ae .at/Conferenees/DSC-2003/Proeeedings/
[12] Leisch F. (2003). Sweave, part II: Package vignettes. R News 2 (2)
21-24. http ://CRAN.R-projeet.org/doe/Rnews/.
[13] Leisch F. , Rossini A.J. (2003) . R eproducible statistical research. Ch an ce
16 (2) 46 - 50.
[14] Murrell P. (2003). Integrating grid graphics outp ut with base graphics
output. R News 3 (2) . http://CRAN.R-projeet .org/doe/Rnews/.
[15] Murrell P ., Ihaka R (2000). An approach to providing m athematical
ann otation in plots. Journal of Computation al and Gr aphical St atistics
9 582 -599. http://www . amstat. org/publieations/jegs/.
[16] Pinheiro J .C. , Bat es D.M . (2000) . Mixed-effects mod els in Sand S-Plus.
Springer . http ://nlme . stat. wise . edu/MEMSS/ .
[17] R Development Core Team (2004). Writing R extensi ons .
R Foundation for St atistical Computing, Vienna, Austria .
http ://www.R-projeet .org.
[18] Ram sey N. (1998) . Noweb man page. University of Virginia, USA, 1998.
http ://www .es . virginia . edu/ nr/noweb. Version 2.9a.
R: the next generation 249
[19] Rossini A.J ., Heiberger R.M ., Spar ap ani R. , n Miichler M., Hornik K.
(2004) . Emacs speaks statis tics : A multi-platfo rm , multi-package devel-
opm ent environme nt for statistical analysis. J ournal of Computational
and Gr aphical St atistics 13 (1) , 1-15
[20] Sarkar D. (2002) . Lattice. R News 2 (2) 19 -23.
http://CRAN,R-project,org/doc/Rnews/.
[21] Therneau T .M., Gr ambsch P. (2000) . Modeling survival data: exten ding
the Cox model'. Springer .
[22] T ierney L. (2003). Nam e space m anagem ent for R . R News 3 (1) 2 - 6.
http://CRAN.R-project,org/doc/Rnews/.
[23] Venabl es W .N., Ripl ey B.D. (2002) . Modern applied statistics with S.
Fourth editi on. Springer. http://www.stats.ox.ac . uk/pub/MASS4/ .
A cknowledgem ent : Section 2 is based on mat erial in [22] and t he Writ ing
R Ext ension s manual [17], Section 3 on a primer on "Grid Gr aphics" by Paul
Murrell. Section 5 dr aws from [10] .
Address : K. Hornik, Insti tut fiir St atistik, Wir schaft suniversitat Wien , Aus-
tria
E- ma il : Kurt.Hornik@wu-wien.ac.at
COMPSTAT '2004 Symposium © Physica-Verlag/Springer 2004
Abstract: Mod ern t echnolog y ena bles the collect ion of vast quantiti es of
data. Smar t aut omatic data select ion algorit hms are needed to discover im-
portant data structures t hat are obscured by oth er st ru ct ure or random noise.
We suggest an efficient and flexible algorit hm that chooses the "best" sub-
sa mple from a given dat aset. We avoid t he combinat orial search over all
possible subsamples and efficient ly find t he dat apoints that describ e the pri-
mary structure of the data . Although the algorit hm can be used in many
analysis scenarios, this pap er explores the applicat ion of t he method to prob-
lems in multidimension al scaling.
1 Introduction
Although mod ern te chnology ena bles the collect ion of huge amounts of data ,
it also exace rbate s t he problem of dat a qu ality cont rol. Spurious or erroneous
information caused by eit her the random nature of the data or human err or
will inevitably exist within large datasets. But the t ask of sift ing through
millions of observations and removing those th at are not represent ative of the
true population borders on t he impo ssibl e. Sma rt , aut omate d, data cleaning
algorit hms or robust analysis t ools that work in t andem with the collect ion
t echnolo gies are needed .
From a stati st ical persp ective, robust analysis methods, including L, M, S,
and R est imat ors, serve as appropriate means to account for cont aminated
dat a. However , such methods arguably apply only to par am etric approaches
and do not exte nd t o unsupervised learning probl ems or multidimension al
scalin g. Furthermore, ana lyzing the dat a directly, without first reducing the
number of observations, may exceed computer softwar e or memor y limit a-
tions.
To address this problem , we pres ent an efficient dat a redu ction algorit hm
that act ively seeks the pr imary underlying struct ur e of the data while re-
moving spur ious observations. Rather than use gra phical methods to hunt
for erroneous data as descri bed by Karr , Sanil , and Banks [6], we syst em-
at ically sear ch among st rategically chosen sub sets of t he collect ed sa mple.
Ultimately, we find the subsample t hat provides the best st at ist ical signal,
as measured in terms of fit , compared to other subsets of compara ble size.
The algorit hm we propose do es not require t he evaluation of every subset
within a sample. Instead , it performs a series of greedy sear ches t hat allow
252 Lesuue L. House and David Banks
the method to scale to large datasets. And the algorithm is flexible since it
can be applied to any situation in which there is some measure of goodness-
of-fit. In this paper, we describe how the method applies in the context of
linear regression and multidimensional scaling, where the measures of fit are
R 2 and stress, respectively.
We understand that specifying an acceptable degree of lack-of-fit or re-
quired statistical signal for a chosen subsample is unclear. Since one is trying
to cherry-pick the best possible subset of the data, we consider two options.
The first entails the prespecification of the final subset size. The subset with
the highest statistical signal (of the specified size) is chosen, regardless of the
magnitude of the signal, or lack there of. The second approach requires the
inspection of the plot, signal versus subset size. A knee in the plotted curve
points to the subset size at which one is forced to include bad data.
In the context of previous statistical work, our approach is most akin to
the S-estimators introduced by Rousseeuw and Yohai [9], which built upon
Tukey's proposal of the shorth as an estimate of central tendency [2], [9].
Our key innovations are that instead of focusing upon parameter estimates
we look at complex model fitting, and also we focus directly upon subsample
selection. See [3], [4] for more details on the asymptotics of S-estimators and
the difficulties that arise from imperfect identification of bad data.
In the context of previous computer science work, our procedure is related
to one proposed by Li [7]. That paper also addresses the problem of find-
ing good subsets of the data, but it uses a chi-squared criterion to measure
lack-of-fit and applies only to discrete data applications. Besides offering
significant generalization, we believe that the two-step selection technique
described here enables substantially better scalability in realistically hard
computational inference.
Section 2 describes the algorithm in detail within the context of regres-
sion . Section 3 illustrates the flexibility of the algorithm and applies it to
a simulated, multidimensional scaling scenario. Section 4 concludes the paper
with a discussion and a description of additional applications.
2 Proposed algorithm
Because of the wide familiarity with regression, we describe the steps of the
algorithm while referring to the following scenario:
Typi cal regression analyses fit all the data, and t hen atte mpt to identify
outliers or high-leverage points. Some robu st methods, such as S-est imation,
attempt t o find the best fit to some pr esp ecified fraction of the dat a , bu t those
methods do not generalize to , say, nonpar am etric multivari ate regression . In
cont rast, we sear ch among the data to find a large subset that produces
good fit . This entails ra ndom select ion of starting-point subsamples and the
comparison of fits from subsamples of the data .
In a linear regression set t ing, t he coefficient of determinat ion, R 2 , pro-
vides a natural choice for assessing and comparing the st ati st ical signal of
subsamples. T he statist ic relies on sums of squ ar ed deviations to assess lack-
of-fit and does not penalize subsets for including mor e or less observations.
Simpl y, a subsample with a high R 2 is better t han anot her with a low R 2 •
In genera l, it is desir abl e t hat t he measur e of fit not depend up on the
size of the subsample. This is true for the coefficient of det ermination and
also for st ress in multidim ensional scalin g. The algorit hm, however , can be
modified to accommodate other sit uat ions, usually by a normalization t hat
allows one t o measure th e "average" goodness-of-fit. That t echnique allows
one t o broad en t he field of fit crite ria to include average absolute deviation
or average complexity, as measured by Mallow's Cp st atistic [8] or Akaike's
Informati on Cri terion [1 ].
The remainder of this sect ion describ es how we randomly select a set of
subsamples from which we ultimately choose th e best . We do not enumerate
or t est all possible subsamples of size Qn. Rather , we propose start ing with
a series of sma ll, randomly chosen datasets and growing each until they are of
size Qn . Don e properly, we can ensure that with som e pr esp ecified probabili ty
at least one of t he original subsamples will eventually grow to contain nearl y
all good data.
If after sweeping through the data one time we have ni < kn , our algo-
rithm moves to the second, significantly slower step. Here, we search over
all data not already in the subsample to find the observation which, when
added, reduces the goodness-of-fit by the smallest amount. We then add that
observation and either improve the fit measure for S, t best or decreases the
statistical measure by the smallest possible amount (regardless of 1]). No-
tice step 2, unlike step 1, guarantees the addition of one observation on each
pass through all of the data (excluding observations already in Si) . Step 2 is
repeated until ni = kn .
The following pseudo-code describes this two step algorithm. We use
GOF( . ) to denote a generic goodness-of-fit measure.
The algorithm requires two vital inputs: the goodness-of-fit measure and
the choice of 1], the tolerated increase in lack-of-fit during step 1. As men-
tioned previously, we recommend that the goodness-of-fit measure not depend
upon the sample size; the lack-of-fit values should be comparable as ni in-
creases. However, the choice of 1] offers one way to force comparability by
making it depend upon ni as well.
If one can achieve independence between the lack-of-fit measure and sam-
ple size, then the selection of 1] depends upon one's willingness to accept bad
observations. In the regression setting, when 1] = 0, step 1 only appends
256 Leanna L. House and David Banks
data points that strictly improve the R 2 • On the other hand, t he value of "l
can be determined empirically by insp ection of a histogram of 100 lack-of-fit
values obtained by adding 100 random data points to an initi al subsample of
size p + 2.
After repeating Step 1 and 2 for d subsamples, t he final task is to select
one Si as the best or most repr esent ative of the underlying structure. If the
purpose for implementing the proposed algorithm is strictly to reduc e the
dataset to kn , t hen one could select the subsample with the lowest lack-of-fit,
regardless of its size. On the other hand, if the inclusion of bad observations
is worrisome or the magnitude of the goodness-of-fit measur e for the best
subsample is unsatisfactory, then we recomm end plotting the goodness-of-fit
aga inst the order of ent ry of the observations . Given an initial subsample
with only good data , the gra ph should depict a long plateau wit h a sudden
knee in the cur ve when bad observations begin to ente r th e subsa mple. One
may choose t he best size for t he subsample according to the size at which the
knee occurs.
Note t he proposed algorit hm ent ails a stochas tic choice of starting sets,
followed by a det erministic exte nsion algorit hm. Even though we can guar-
antee, with a specified probability, a clean starting set , we cannot make the
sam e gua rantee at t he conclusion of t he algorit hm. Since the exte nsion pro-
cedure depends slightly upon the ord er in which t he cases are considered, the
final result does not quite enjoy the sam e probabilistic properties as the ini-
t ial starting sets. Nevert heless, simulation results indicate that the proposed
procedure does lead , with probability near t he nominal level specified in the
initial calculat ion that determ ined the number of starting-point subs amples,
to the selection of a subs ample of good data.
Using Kruskal-Sheph ard non-metric scaling, we assess th e stat ist ical sig-
nal of a given dataset by using the stress function
e ec
.Q6
t:
.9
"""
cci
~
§d
iii
ill",
05°
0
0
20 40 60 80 0 20 40 60 80
Sample Size Sample Size
Figur e 1: Plo t of stress measure versus sample size (in the order of entry)
when 30 distan ces are distorted: (left) 150% distortion; (right) 500% distor-
tion; Not ice plateau in graph while including good observat ions in subs ample,
but at sample size = 77 (left) and sample size = 78 (right) we st art to append
bad data.
4 Discussion
In order to take advantage of the full potential of a larg e dataset, we pro-
pose a straightforward method to remove bad data. In essence, we robustify
t he data using a two-step algorit hm to select the subsample that is in best
agreement with the assumed structure in the data.
We demonstrat e the benefits of the algorit hm within the cont ext of mul-
tidimensiona l scalin g. In MDS scenario s, even small proportions of bad data
can ent irely distort the apparent geomet ric relationships among t he cases.
Our algorit hm successfully isolates the primary st ructure of six distorted
datasets. The st ress measures of th e final chosen subsampies are dr amati-
cally lower than t hose of the corresponding original dataset s.
One distinguishing feature of the algorit hm is t hat it does not require the
complete enumeration of all possible subs amples. This saves an enormous
amount of computer time, and ensures t hat the algorit hm is essent ially of
order O(n) (if one avoids or minimizes the slow-sear ch phase). However , t he
Robust multidimensional scaling 259
References
[1] Akaike H. (1973). Information theory and an extension of the maximum
likelihood principle. Second International Symposium on Information The-
ory, 267 - 281.
[2] Andrews D.F., Bickel P. J ., Hampel F . R., Huber P. J . Rogers W.H.,
Tukey J .W. (1972) . Robust estimates of location: survey and advances.
Princeton University Press, Princeton, NJ.
[3] Davies P.L. (1987). Asymptotic behavior of S-estimates of multivariate lo-
cation parameters and dispersion matrices. Annals of Statistics 15, 1269-
1292.
[4] Davies P.L . (1990). The asymptotics of S-estimators in the linear regres-
sion model. Annals of Statistics 18, 1651-1675.
[5] Hawkins D.M. (1993). A feasible solution algorithm for the minimum vol-
ume ellipsoid estimator in multivariate data. Computational Statistics 9,
95 -107.
[6] Karr Alan F ., Sanil Ashish P., Banks David L. (2002). Data quality: a sta-
tistical perspective . National Institute of Statistical Sciences, Research
Triangle Park, NC .
[7] Li X.-B . (2002) . Data reduction vis adaptive sampling. Communications
in Information and Systems 2, 53 - 68.
[8] Mallows C.L. (1973). Some comments on c; Technometrics 15, 661 -675.
[9] Rousseeuw P.J ., Leroy A.M. (1987). Robust regression and outliers detec-
tion . Wiley, New York.
[10] Rousseeuw P.J ., d Yohai V. (1984). Robust regression by means of S-
estimators. In Robust and Nonlinear Time Series Analysis, J. Franke,
W. Hardie, R.D. Martin (eds .), Lecture Notes in Statistics 26, Springer-
Verlag, New York, 256-272.
Address : L.L . House, D. Banks, Institute of Statistics and Decision Sciences,
Duke University, Durham, North Carolina, 27708 U.S.A.
E-mail : house@stat.duke. edu, banks@stat.duke. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: This pap er puts focus on some the remaining issu es concern ing
jackknifing of cent red bilin ear models. A method improvement is proposed ,
describing how all t he bilinear model par am et ers can be rotat ed in order
to estimate the unc ert ainties of all model par am et ers. The mean values of
centred models are also included in the rot ation scheme.
The un certainty infor mation of t he bilin ear model par am et ers ca n be used
to perform variable select ion, variable weighting and det ection of outliers .
1 Introduction
Crossvalidation [1] and especia lly jackknife [2] can be used in order to est imate
t he un cert ainty of the par amet ers in a bilin ear model [3] . This t echn ique is
currentl y used in commercial software (e.g. The Uns crambler) to est imate
the un certainty in the reduced-rank regression coefficients bA in the mul tiple
linear approximati on model at rank A ,
(2)
2 Theory
2.1 Notation
Matrices are written as uppercase bold letters (X), while vectors are written
as lowercase bold letters (x). Unless transposed (written as x'), vectors are
always columns. Uppercase letters (A) denotes constants, while lowercase
letters are counters or indexes (a = 1 . .. A).
giving M = 2 segments. The validation would then show whether the patients
changed over time. One could also use the leave-one-out method giving M =
40 segments. The validation would then be a mix of the above, testing
both the dose-levels, replicates and patients at once. These four examples of
segmentation will in general give quite different estimates of the variances in
the model parameters. Thus, it is very important to be aware of on what
level one is validating the results [8].
Even though the jackknife-formulae for different segmentations are given
in statistical literature, the authors feel the need for documenting these also
in the chemometric literature. The most general expression is that of delete-d
jackknife, where one explores all the combinations of data where d samples
are removed, (~). The variance of a parameter () can then be estimated as
,2
S (())
N - d "'" ('
= -(N) Z:: ()-m
_)2
- () (3)
d d m
(4)
(5)
where x' and y' contains t he mean value of each variable, t ; is a vect or
of scores (a linear combination of the X-variables) , Pa and qa are loadings
for X and Y respect ively and EA , FA contains unmodelled residu als. The
only difference between t he PCR- and PLSR-algorithms lies in t he way t ; is
defined .
A property of bilinear mod els is that the scores and loadings have rot a-
tional freedom . We can rotate t he scores in any direction, as long as the
corres ponding loadings are rotated th e same amount in th e opposite direc-
tion . The model will st ill contain the same inform ation, and the regression
coefficients will be t he sa me.
Scores- and loading-vectors for t he different submodels m may ap pear
to be quite different du e to t rivial translati ons, rot ation and mirroring. If
e.g. t he sign of each element in both t - m,a and P-m,a changes, t he information
Improved jackknife variance estima tes of bilinear model param eters 265
explained by their product in t hat factor will still be t he same, but it will
be meanin gless to compare each value in t hose vectors to other score- or
loadin g-vectors with different alignment. One way to solve t his probl em is to
rotate all the M sub-models towa rd the model calculate d from the complete
dat aset before we compare them.
Equ ation (6) repr esents t he model calculate d from t he complete dat aset
with all N samples. Rewriting that model using matrix notat ion, we get
x = [1 T A] [x PA]' +EA
Y = [1 TAl [y Q A]' +F A
where
T = (X - 1x'WA](P ' W )- 1 (7)
and W A is th e int ern al loadin g weight matrix. For each consecutive fact or,
t he corresponding column in W A is defined as the first eigenvector of residu al
X - X covaria nce (in PCR) or X - Y covaria nce (in PLSR). The linear
regression coefficients in eqs. (1), (2) is then defined as
(8)
Similarl y, we can write each of the M sub-models in matrix not ation, where
t he index -m denotes t hat segment m has been left out .
Comparing equation (7) and equat ion (10), we can define C - m as a ro-
tat ion matri x, where we try e.g, t ry to rotat e [1 T - m,A] towards [1 T A]'
Similarly, we t hen int erpret C ;;.T as a rotation of [x- m P - m,A] towards
[x P A]' Thus, if we wanted to estima te th e matrix C- m , we could use
either t he relation between t he scores or t he relation between one of t he
loadin gs as tar gets.
If t he dat a were without noise, perfectly behaved and contained sufficient
redundant inform ation , the only difference between t he submodel and the
to tal model would be reflections and possibly reord erings (permut at ions) of
t he factors. It would t hen be possible to map the submodel onto t he total
model with a matri x C- m containing only one ±1 per column/row, and the
rest of the elements O. But when the dat a contains noise and insufficient
redun dant inform ation, rot ation at angles t hat are not multiples of 90° and
266 Martin HlZiy, Frank West ad and Harald Martens
possibly rescalin g of the axis will be necessary t o map the submodel perfectly
onto t he total model.
In ord er to consume as few degrees of freedom in Y as possible in the
est imation of C , we have chosen to use the scores matrices as t argets. Since
cross-valida t ion/jackknife segment m has been removed in T - m,A, it has
fewer rows t ha n T A . In order to esti ma te C- m , the samples in segment
m must also be removed from T A before comparing them . This short ened
version of [1 T A] is denoted as [1 T Ahm' Since the sa mples in segment m is
now removed from both matrices, fewer degrees of freedom in Y is consumed
t ha n if e.g. t he loading matrices were to be used as t argets . Not e that even if
the samples in segment m are not used dir ectly when est ima ti ng C- m , t hey
ar e not complete ly left out since they have been influencin g Y and W in the
total model.
In ord er to est imate the matrix C- m , the crite ria to be minimised is th e
difference between [1 T Ahm from the total mod el and the rot ated [1 T - m,A]
from t he reduced mod el. The difference is here denoted G-m,A:
(11)
There are many possible ways to estimate C- m from equat ion (11). To
reduce t he degrees of freedom consumed in t he rotation, we have chosen
to use an orthogonal rot ation, which means t hat the columns in C - m are
orthogonal with length one. The pro cedure for est imating C - m st ar ts with
performing an SVD :
C- m=UV' (13)
There are many possible ways to est ima te C- m from equa t ion (11) (or even
without using t he scores matrices) , the above is just one solut ion. Other
possible pro cedures are discussed in section 3.2.
By inserting these values into T- m,A at t he right positions, we can now cal-
culate t he full rot ated score-matrix of submodel - m . We denote the rot ated
mat rix with a tilde, and t he augmente d score-matrix from submodel - m is
denoted with a subscript -m, m .
(15)
(16)
2.4.2 Rotating the loadings The matrices of X- and Y-Io adings for
submodel - m have the same dimensions as t he loading-matrices of the full
mod el. They can t here fore be rot ated without augmentation.
(18)
As with the score-valu es, these vari ances can be used to dr aw approximate
confidence regions in t he loading plot and det ermine whether or not two
variables ar e overlapping and t hus cont ains t he sa me information.
2.4.3 Rotation of the loading weights Rot at ion of the load ing weights
W A(7) is a lit tl e more complicate d t han rotation of scores and loadings.
The rotat ed version of the loading weight s is pr oposed as:
W - m,A = [X- m W - m,AW~m,AP -m ,A] C=;;' [0 (P~ W A)- i] (19)
where t he column of zeros is needed because the matrix C - m was estimated
from equa tion (12), where an ext ra column is appended.
Similar to the ot her mod el par am et ers , t he variance of each element in
t he loading weight matrices can now be est ima ted as:
M
S
'2(Wka ) = ~
M - 1 'L..J
" (W_ -m,ka - W ka
)2 (20)
m =i
Having variance estimates of the ind ividual loading weights opens up a new
possibility in var iabl e select ion. It will then be possible t o do a significance
t est of each vari abl e k in each factor a. Valu es W ka that are not significant ly
different from zero, can be forced to zero aft er which t he vect or W a is re-
ort hogona lised. This pr ocedure will then yield vari abl e select ion where it is
possibl e to remove variables only in some of the factors, while leaving t hem
in for other factors.
As fur ther fact ors are calculate d and the information left in the dat aset
decreases, mor e and mor e vari abl es will become insignificant with their cor-
responding load ing weight set t o zero. Finally, t he loadi ng vector W a will
be reduced t o t he zero-vecto r, and no further factors needs to be calculate d.
Thus, the pro cedure would yield aut omat ic select ion of the number of factors
to calculate, with int egr at ed vari abl e selecti on. The aut omatic deletion of
insignifican t vari abl es is expected t o yield more stable models th at are also
eas ier to int erpret du e to th e reduced number of vari abl es in each fact or.
was carried out. The parameter of interest in the simulation was the variance
of regression coefficients in a full-rank OLS solution to MLR regression, i.e.
a bilinear PCCR or PLSR model with maximum possible number of factors.
A matrix X with 300 samples and 3 variables was drawn with random,
evenly distributed values between 0 and 1. The regressand y was calculated
from true regression coefficients (3 = [0 1 2]' and random noise e which was
drawn from the distribution N(O, 12 ) .
The dataset was then split up in several different ways with M ranging
from 2 to 300, corresponding to the extremes of splitting in two and leave-
one-out. For each value of M, the regression coefficients b were estimated and
the variance of the second element in b was estimated from equation (4) . The
whole procedure was then repeated 500 times with different noise e added
each time.
Since the true variance of the added noise (e) was known, it was possible
to compare the jackknife-estimated values of s2(b) with the theoretically
expected values . The theoretical variance of the regression coefficients from
MLR (given that X is noise-free) is
(21)
(22)
(23)
where v is the degrees of freedom in the estimate of s2(b) . Since the variance
of the regression coefficient were estimated a lot of times in the Monte-Carlo
simulations, it was possible to estimate also the variance of the variance-
estimate, S2 (s2(b)). If we then "guess" that the degrees of freedom in s2(b)
is v = M - 1, we can plot the variance of our variance-estimate as a function
of 2/(M - 1). If M - 1 is the correct number of degrees of freedom, this
should give a straight line with intercept zero and slope a 4 .
270 M artin Hl2Jy, Frank Westad and Harald Martens
0.032
0.031
e 0.03
Q;
,; .... ".
" I'"
~ 0.029 '"
" . I,,'"
+1 """"" """" ' "
:c
N-
0.027
0.026
- Variance of b
. , .. ± 2 std.error of variance of b
- - Theoreti cal value
Number of segments: M
As figur e 2 shows, this is ind eed the case. The above was also repeated
with l/ = M and l/ = N (not shown here), but these (and other) alte rn atives
gave a line with incorrect slope. Thus, we can conclude t ha t equa tion (4)
gives consistent est imates of t he varia nce of b with M - 1 degrees of freedom .
X 10- 3
2 r-=------r- - - , - -- -,-- - ,.-- - ...-- ---,-- - ---.-- - --.-- - ,..-- ---,
A pro cedure that solves t his problem is to calculate t he corre lation be-
tween a factor in th e t ot al mod el and t he fact ors in th e submodel. If the
highest absolute value is t he diagonal element in t he corre latio n matrix, t hen
set the element in c~~ct to - lor 1 dependin g on the sign of t he correlation.
All other elements for that factor are set t o zero, both for the total model
and t he sub model elements . T hereafte r, the highest abso lute correlatio n for
each tot al model wit h respect to the submodel is found and set t o -lor 1
in c~~ct . This avoids t hat two facto rs in t he submodel are assigned to the
sa me fact or in t he t otal model, and yields a matrix c~~ct t hat is gua ra nteed
to have norm one and only account for reflections an d reorderings.
On e could also envision other approaches that would consume more de-
grees of freedom . St arting from equa t ion (11), t he matrix C -m could also
be est imated by an OLS regression . This corres ponds to pr ojecting t he total
model onto the reduc ed mod el. A similar bu t mor e num erically stable ap-
proach would be to use a rank-reduc ed regression like PLSR inst ead of OLS
reg ression in t he est ima t ion ste p.
272 Martin H0y, Frank Westad and Harald Martens
(25)
The diagonal of t his covariance matrix contains the valu es calculated from
equation (5). This covariance will be applicable to the regression coeffi-
cient s B from t he regression Y = X B + F . In ord er t o be applicable t o the
regression with t he lar ge input matrix, Y = ZC + F , the covar iance matrix
must be multiplied with V :
(26)
C\I 0
....
0
tl
eel
1.1.
-2
-4
-6
-8
-8 -6 -4 -2 0 2 4 6
Factor 1
Figure 3: Original score plot with perturbations from leave one out jackknife.
The centre of each "st ar" is the value from the complete model, and the circles
denotes the value when that sample is kept out of the mod el calibration.
normal distributed noise with vari anc e 0.1. Before subjecting these X- and
y-data to a PLSR with full leave-on e-out crossvalidation, normal distributed
noise with variance 0.1 was also added to X .
Figure 3 shows a score plot with all the values from each cross-validat ion
segment, sometimes referred to as a stability plot. In the centre of each "star"
are the score-values from the model calculated with all the samples. The lines
going out from each "st ar " shows the score-value of that sample in each of
the cross-valid ated models. The value with a circle on it denotes the valu e of
that sample in the segment where the sample itself was left out, and thus had
no influence on the model. Samples that are outliers will tend to get a very
different score value when t hey are not included in the mod el, and thus the
scor e value denoted with a circle will be further away from the cent re than
th e other score valu es.
Note that several samples flip over, change sign or otherwise show large
deviations that are not related to the unc ertainty of the sample. This is due
to the rotational freedom of bilinear models as described in the beginning of
274 Martin H0y, Frank Westad and Harald Martens
2
~
~~
C\J
....
0
t3 0
Cll
LL
if\
_6'----....L.....-----'----'----....L.....-----'--------JL.-------'
-6 -4 -2 o 2 4 6 8
Factor 1
Figure 4: Rotat ed score plot. The cent re of each "star" is t he value from t he
comp lete mod el, and t he circles denot es the value when t hat samp le is kept
out of t he model calibrat ion.
sectio n 2.4. As a consequence, t he var iati ons between t he values in figur e 3
are unsuitable for calculat ing uncert ainti es.
Figur e 4 shows the score plot after each of t he submode ls have been
rotate d as describ ed in equa tion (15). The pict ure is now much clearer,
and t he variance left can be assumed to be due to t he uncert ainty of the
score-values. Note t hat for each sample t here can be a quite lar ge difference
between t he mean of all t he obtained values and t he value from t he tot al
model. This is the rati onale for choosing t he total mod el as the reference
value and not t he mean (c.f. the discussion after equation (5)) .
4 Conclusion
An improvement of t he jackkni fe rotation met hod by Martens & Martens [3]
has been proposed for est imating the uncertainty in t he bilinear model pa-
ra meters with the use of jac kknife. The method works by rot ating each of
the submode ls towards t he main model before t he valu es are used to est imate
variances. The rotati on matrix can be est imated in severa l ways, and some
of the alte rnatives are discussed .
Improved jackknife variance estimates of bilinear model parameters 275
References
[1] Stone M. (1974). Cross-validatory choice and assessment of statistical
predict ion. J . Roy. Stat. Soc. B Met .36 (1), 111-147.
[2] Efron B. (1982). The Jackknife, the Bootstrap, and other resampling
plans . CBMS-NSF Regional Conference Series in Applied Mathematics.
Society for Industrial and Applied Mathematics, Philadelphia, P ennsyl-
vani a.
[3] Martens H., Martens M. (2000). Modified Jack-knife estimation of param-
eter un certainty in bilinear modelling by partial least squares regression
(PLSR) . Food Qual. Prefer 11 (1),5-16.
[4] Tukey J .W. (1958) Bias and confiden ce in not quit e large samples. Ann.
Math. St at. 29, 614.
[5] Shao J . Wu C.F.J . (1989). A general theory for jackknife variance esti-
mation . Ann . Stat. 17 (3), 1176-1197.
[6] Efron B. Tibshirani R.J. (1998). An introduction to the Bootstrap. Chap-
man & Hall, New York.
[7] Martens H., H¢y M., West ad F., Folkenberg D., Martens M. (2001). Anal-
ysis of design ed experime nts by stabilis ed PLS R egression and jack-knifing.
Chemometr. Intell . Lab. 58 (2), 151-170.
[8] Martens H., Martens M. (2001). Mult ivariate Analysis of Quality. An
Introduction. J.Wiley & Sons Ltd, Chichester UK.
1 Introduction
Mosaic display introduced by Hartigan and Kleiner [6] has been genera lized
t o multi-way t abl es and has been exte nsively worked for visua l inference of in-
dependence usin g Mosaic plots by Friendl y [4], [5] . Meyer and et . al. [11] con-
sidered visua l inference of cont ingency t abl es using associat ion plots mainl y
for the case of 2-way t abl es. Another sour ces for th e works of Mosaic Plots
are Hofmann [7], [8] and Unwin [12] . Most of t he st atisti cal packages avail-
able tod ay have impl emented mosaic displ ays (SAS, S-Plus, R, Minitab , and
others).
Conventi onal mosaic plot is t o graphica lly represent contingency t abl es
usin g til es whose size is proportional to the cell count . Figure 1 gives the
mosaic plot of the Titani c dat a [3] as impl emented in R [10]. This data will be
explained in more det ail in t he next section. The plot is informative when we
are well trained in reading this. Our experiments with the gra duate students
showed t hat the features in the mosaic plot is confusing and misleading if
mor e than 2 vari ables are involved in the plot . The reason behind t his could
be du e t o t he limit ation of hum an perception. Firstly, t his could be explained
by the Steven 's law of dimensionality. Steven 's law stat es that perceived scale
in absolute measurements is the actual scale raised to a power where the scale
is as follows: for linear features, power is .9-1.1; for area feature, .6-.9; for
volume, .5-.8.
Steven 's Law suggests that phy sical relationships that are not represented
in linear features can be grossly misp erceived . For exa mple, a lake represented
on a map with an area graphically 10 t imes lar ger than anot her will be
perceived as only 5 times lar ger as noted in Catar ci, et . al. [1] . Since the
278 Moon Yul Huh
x
Cll
(j)
Class
~1II111
[]
_~ .~" [I· I
ill :::>
(J) (J)
l__ 1
I -- ",[ p/2] (
LJi=l V2i -
1)II[p/2]
j =i+ l n2j + V2[p/2]
J -- ",[(P-1
LJi=O
J/ 2](
V2i+ l -
1)II[(P-l) / 2]
j =i+ l n 2j+l + V2[(p-l)/2]+1
where [x] denotes t he int eger not exceeding x.
We now give the algorithm to construct the valu es of the vari abl es, v ,
when an inst anc e belongs to a cell F(I, J) . l,From the row ind ex I , vari-
ables of even indi ces, or V2, V4 ," " V2 [p/ 2] will be construc te d, and from the
column ind ex J , variables of odd indi ces, or Vl, V3, " " V2[(p- l)/2]+1 will be
const ruc te d . The algorit hm follows.
Valu es of odd indi ces, Vl, V3 , .. • , V2[(p-l )/ 2]+1 from J ar e:
The above procedure works for unsupervi sed learning. With superv ised
learning, we have t ar get variable. We ass ume here t hat t he last variable, or
282 Moon Yul Huh
vari able p is for the target . In this case, we build F with p - 1 variables.
Frequencies in the cell (I , J) , or F(f , J) will be divid ed into n p different
categories. In t his case, it will be convenient to express F in 3 dimension al
form such that F(I , J , K) , K = 1, . .. , n po
57 5 14 11 75 13 192 0
118 0 154 0 387 35 670 0
140 1 80 13 76 14 20 0
4 0 13 0 89 17 3 0
Tabl e 2 gives the mosaic array F for t he T itani c dat a when survive is
tar get vari able. Impl ement ation of t his mosaic plot can be accomplished by
assigning different colors for different categories. For Titan ic data, we may
assign survive d as the tar get vari abl e. Conventional mosaic plot and line
mosaic plot of the Ti tan ic data for this case is given in Figur e 4 and Figure 5
respecti vely.
57 5 14 11 75 13 192 0
140 1 80 13 76 14 20 0
Table 2: Mosaic array F of Ti tanic data with survive as t he target vari able.
Line m osaic plot: algorithm and implementation 283
Class
F igur e 4: Mosaic plot of Titan ic Dat a when survived is tar get variable.
ell
:::!
-.
w
II
ell
E
I II
w II
"I
u,
II
x
w
(J)
Figur e 5: Line mosa ic Plo t of T it an ic Dat a when survived is target var iable.
passengers survived. The proportion of survivals in the 3rd and crew classes
can even be estimated visually by reading the number of bars in the plot.
For {crew, adult, mal e} combination, the proportion can be estimated as 2/7.
For {3rd, adult, male} combination, the proportion is less than 1/4. For the
female case, we can observe directly from the plot that the survival proportion
is much higher except for the {3rd, adult} combination. Although there
are few child passengers, the plot clearly shows that most of the children
passengers survived except for the 3rd class cases.
Figure 6 gives the process of obtaining a line mosaic as implemented in
hDAVIS [9] . hDAVIS is freely available on the following website .
http ://stat .skku .ac .kr/-myhuh/davis.html .
DDon JLJI
l [' - -I I" - l ~'- 'l [····· ,·][1" 1
[
u
........•....• •. •• •.J L _ _ L-. .1
DO
Figure 6: Line mosaic plot implemented in DAVIS.
References
[1] Catarci T ., D'Amore F., Janecek P., Spaccapietra S. (2001). Interacting
with GIS : from paper cartography to virtual environments. Unesco Ency-
clopedia on man-machine Int erfaces , Advanced Geographic Information
Syst ems, Unesco Press.
[2] Cleveland W.S., McGill R (1985). Graphical perception and graphical
methods for analyzing scient ific data. Science 229, 828-833.
[3] Dawson RJ.M. (1995). Th e "unusual episode" data revisited. J. Statistics
Education 3 (3), 1-7.
[4] Friendly M. (1994). Mosaic displays for multi-way contingency tables.
Journal of the American Statistical Association, 89 , 190-200.
[5] Friendly M. (1999). Extending mosaic displays: marginal, partial, and
conditional views of categorical data. Journal of Computational and
Graphical Statistics 8, 373-395.
Lin e mosaic plot: algorithm and implementation 285
A cknowledgem ent : This work was supporte d by the Samsung Research Fund
(2003) of Sungkyunkwan University.
Address: M.Y. Huh , Department of St ati sti cs, Sungkyunkwan University,
Chongro-Ku, Seoul , Korea
E-mail: email :myhuh@skku.ac.kr
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
1 Introduction
Cyb er attacks on compute r net works or personal computers have become
major threats to nearly all operations in society. Methods to thwart such
at tacks are seriou sly needed . The problem of det ecting unu su al behavior
in data st reams occurs in many fields, such as in disease surveillan ce, nu-
clear product manufacturing, and phone and credit card use. Historically,
manufacturing and financi al industries have relied on convent iona l st atistical
pro cess moni toring tools, such as control char ts and process flow dia gram s.
Such t ools are reliabl e and appropriate, because t he data st reams can be
stratifi ed into reas ona bly ind epend ent series. For example, mon itoring a cus-
tomer 's credit card use relies on an analysis of the data from the cust omer' s
past charging amounts and frequ encies. This data stream is a much smaller
data set than t he ent ire dat ab ase, with events occurring irr egularly but not
frequently; mor eover, one customer's data stream can be considered as ind e-
pend ent of other custo mers' data stream s. In cont ra st, Internet traffic data
are virtually continuous (limited only by the resolution of th e time clock th at
capt ures them) , and the data for one syst em involve hundreds of thousands
of other compute r or network syste ms.
Tools for moni toring such dat a are essent ial. Conventional statist ical
analysis oft en assumes that data follow a mathematically t ract abl e probabil-
ity distribution function and will yield valid est ima te s of the par am et ers of
this distribution. Such approaches cannot be used on millions of data points.
Gr aphical tools for streaming data offer hop e of identifying pot enti al cyb er-
288 K aren Kafadar and Edward J . Wegm an
at tacks, particularly when the tools are t ailor ed for the application. Features
of Internet traffic dat a ar e described in Sect ion 2.
Even with novel graphical displays for massive data streams, however,
a charac te rization of "typical" behavior is still needed, so relevant gra phical
tools can be made mor e sensitive to capturing exotic or abnormal patterns.
Two approaches to the det ection problem through visualization are discussed
in this art icle. Sect ion 3 describ es a "drill-down" approach to viewing lar ge
data sets, illustrated on a data set of 135,605 records collecte d over a one-hour
period at George Mason Un iversity. Section 4 describ es a second approach ,
"evolut ionary graphical displays" , which pr esent the da t a only within a nar-
row time wind ow (e.g., 10 minutes) ; early dat a disappear as new, mor e recent
data , come into view. Two exa mples are "wate rfall diagram" and "skyline
displ ay. Secti on 5 offers a summary and proposals for further work.
encry pt ion (ht tps) opera tes from port 443; real time st rea m control pro to-
col (rtsp) uses port 554 for quick-time st reaming movies. The second range
consists of regist ered ports, numbered 1024 to 49151; for exa mple, Sun has
registered port 2049 for its network file system (nfs). The remaining 16384
(2 14 ) ports, numbered 49152 to 65536, are dynami c or pri vate ports. Un-
protected por ts (source ports or destination por ts) ar e prime candida tes for
intrusion; too much t raffic on a given por t within a short time frame may
ind icate a potenti al attack. In this dat a set , all por ts numbered 10000 or
above were coded simply as "port 10000".
Size of sessi on
Internet t ra ffic data are sent in "packets" . The "size" of an Intern et
session can be measur ed in several ways: duration (e.g., number of seconds) ,
number of packets, and number of byt es. Typically, these numbers will be
correlat ed , but not in any specific deterministi c way. However , a machine
may send many packets with few byt es, or rather fewer full-sized packets ;
eit her sit uation may signa l a pot ential attack on a syste m.
Sample data
Intern et t ra ffic dat a are being collect ed at George Mason University;
a sample of ten records from a da ta set over th e course of one hour is shown
in Table 1. Column 1 lab eled time denotes t he clock ti me (in number of
seconds from an origin) at which the Intern et session began ; duration or
len repr esents the duration or length of the session in seconds; SIP and DIP
are t he source and destination ports, respect ively; DPort and SPort are the
dest ination and source port numbers, respectively; and Npacket and Nbyte
indica te the number of packets and number of bytes transferr ed in t he session .
In t he plots below, t he vari able time is shifted by 39603 seconds and scaled
by 1/60 , so that t he first session starts at 0.01067 minutes past the st art ofthe
hour, and the last session starts at 59.971 minutes past the start of t he hour.
Tabl e 2 summa rizes t he distribution of the values in each column with the
five-number summar y [4] supplement ed with t he 10t h and 90t h percentiles for
each column (min imum, lower 10%, lower fourth , median , upp er fourth , up-
per 10%, maximum) . The "size" varia bles are all very highly skewed towards
the upper end; t he distan ce between the 90th percentile and the maximum is
2- 3 ord ers of magnit ude greater than t he distance from the 90th percentile to
t he minimum. One session involved over 35 million byt es, and almost 66,000
packets, alt hough sessions of 1,832 bytes and 12 packets were more ty pical.
The next secti on provides some displays of th ese data , with the objective of
t rying to cha racterize "ty pical" behavior , so t hat "aty pical" beh avior can be
not ed more readily.
Table 1: Samp le of Int ern et t raffic data from George Maso n Univers ity.
Table 2: Summary statistics from Int ernet traffic data set (135,605 sessions).
"zooming in" , or "dr illing down" int o t his regio n, as one does on a geograph-
ical map, specific feat ures can be bet t er obse rved . An alte rnative to t his
"dr ill-down" approach (steps of power magnificat ion) is a logarit hmic trans-
formation, which allows one to view the point s by sca nn ing across t he screen
rather than by magnifiying regions of t he space. We describe t his approach
below.
Graph ical displays of Int ernet traffic data 291
::J
N
~
~
0
oj
:;j
f '"
s '" ~ '"
'" 26 11 zeroes s
or> or>
ci e
:; /I
:;
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 3.3 3.4 3.5 3.6 3.7 3.B
'" 0
~
'"
I
z- '"
ci
or>
~ ~
ci ci
:; :;
3.B 4.0 4.2 4.4 4.6
Fi gur e 1: Kernel density est imates of 10g(1 + vi N byte ), four separa te ran ges.
Densit y plots
Figure 1 is a kernel density est ima te [3] of log. Nbyte = Nbyt e« =
f(Nbyte) , where f( x) = 10g(1 + fi ). We use t he transformation f (x ) =
10g(1 + fi) for all t hree size variables t o sprea d out their values (valu es of x
near t he low end of t he scale are not spread out as far as t hey would be wit h
t he simple log(x) transformation; f'(x) < Y]«, much more so for sma ll x ).
Likewise, log. len = f(durat ion) and log .pkt = f(Npacket) . All calcula-
tions and graphs ar e made using the open-source software R, available from
http ://www . cran . r-proj ect . org. A sma ll peak at 0 reflect s 2611 zeroe s;
the next lar gest byt e size is 147. The dat a are clearl y skewed, and local peaks
of high density appear where log. byte ~ 3.4, 3.8, 4.1, 4.5, and 5.1 (Nbyte
~ 840, 1400, 3500, 8000, 26000) .
Distribution s of sessi on size variables
Boxplots can be useful t o displ ay t he relationship between two variables,
as in Figure 2 for t he two variables log. len = f(duration) (y-axis) and
log . byte = f(Nbyte ). The first box contains t he 2911 valu es for which
Nbyte is zero; t he second box contains the next 1216 values where Nbyte
ran ges from 1 to 365 (0 < log. byte ::; 3); subsequent bins are 0.1 wide,
except the last five bins. This display shows a relatively stable t rend up unt il
t he last few bins, but is ot herwise not very useful for outli er det ect ion , since
out liers are prevalent in each bin . The boxplot display does confirm genera l
292 Karen Keieder and Edward J. Wegman
o o o 0
o o o 0
o
o 0
8
g
.~ o
g , 8
g : 9
:I
N
, 8
+ o
~~
,
• 0
(-1 .0J (3.1,3.2J (3.4.3.5J (3.7,3.8J (4,4.1J (4.3,4.4J (4.6,4.71 (4.9,5J (5.2,5.31 (5.6,5.8J (6.5,91
log(1 + sqrt{Nbyte))
trends: sessions with more bytes tend to last longer, and most sessions are
short.
The preponderance of relatively short sessions can be seen in Figure 3(a),
which displays the session durations as horizontal lines that extend from the
start time to the end time. Because these sessions are reported in the order in
which they began, the session start times range from time 0 (bottom line) to
59.971 (nearly the end of the hour). Figure 3(b) shows the same information,
but each line is shifted back to O. With continuously monitored data, the
session duration lines would continue past the censoring point (illustrated
as a red dotted line in Figure 3b) . Relatively few sessions are "censored"
(i.e., ended within the hour), reflecting the fact that most sessions are short:
93% of the sessions lasted less than 30 seconds . Figure 4 shows a barplot of
the number of active sessions during each 30-second subset of this one-hour
period (a time frame of 30 seconds is selected to minimize the correlation
between counts in adjacent bars). The mean number of active sessions in any
one 30-second interval during this hour is 923, with standard deviation 140,
suggesting a rough upper "3-sigma limit" of 1343 sessions. [Because these
numbers are counts, a square root transformation may be appropriate; see
Tukey [4]. The mean and standard deviation of the square roots of the counts
are 30.29 and 2.23, respectively, resulting in an approximate upper "3-sigma
Graphical displays of Int ernet traffic data 293
limit" of (30.29 + 3 ·2.23)2 = 1367, very close to the limit on t he raw counts,
since t he Poisson distribution with a high mean is approximately Gau ssian.]
The maximum number of sessions in any one of these 120 30-second int ervals
is 1299, below the "3-sigma limit " . This plot could be monitored cont inuously
in time, dropping older bar s off the left-side of the plot , and adding new bars
on t he right ; the upper 3-sigma limit could depend upon hour, day, or week
of the year .
0
<0
s
c:
0
.~ 0
" ~ "
"
.~
(;
*
1:
-*
1:
...ac: '"'
0 0
c:
'"'
I
<J)
e
,§
-c
0
cv "is. '"
JJl
0
;e ;e
10 20 30 40 50 60 o 10 20 30 40 50 60
plots such as this one, so that potential attacks to networks can be identified
in real time. Since an infinite number of patterns can occur, a collection of
likely patterns must be catalogued, so that statistical significance on their de-
tection can be quantified. Algorithms that identify too many false negative
patterns would result in an unnecessary number of shutdowns and service
denials.
Relationships between pairs of size variables
Figure 6 shows a series of plots of the transformed Nbyte vari-
able, log. byte, versus the transformed Npacket variable, log. pkt = log(l +
J Npacket) , in four separate ranges. Panel (a) ,observations for which log. pkt
is between 1 and 2 (Npacket between 3 and 40), shows a generally increasing
trend, simplified in Panel (b) with boxplot displays (the labels in the x-axis
are the same as those in panel (a), multiplied by 10). Panel (c) shows one
line of 293 points around log. pkt = 2.77 (Npacket ~ 226) and log. byte =
5.82 (Nbyte ~ 112,792) [all come from destination port 80 (web), source IP
23070, and destination IP 443 (https)]; and another set of 39 points around
log. pkt = 2.84 (Npacket ~ 259) and log. byte = 5.64 (Nbyte ~ 78,341) [all
have SIP 4837, DIP 56612, DPort 80]. Panels (c) and (d) show many points
at high values of log. pkt along two lines with approximately unit slope but
Graphical displays of Int ernet traffic data 295
(8) (b)
o o
o o
o
o
o
o
o o
o o
10 20 30 40 50 60 25 30 35 40 45 50 55 60
NByle vs Npacket
~
0
I -l- -;- 8e I
E;J ~
!IIIIIIIIIIIIIIIIIIIIIIIIIIIII!!III I§ 889:
I I I
0
Itt i
0
8 0
. : ! ! : ! ": . .' I 0
0
. . . .. 0
8
~
0
8 0
- 0 0 0 0 0 0
log.pkt
119,322 observations
~
,..:
0
.,;
;:
::;
i i ::l
~ :;j ~
~
ZJ
~
2.0 2.2 2.4 2.8 2.8 3.0 3.0 3.2 3.4 3.6 3.8 4.0
Figur e 6: Pl ot s of log . byte versus log. pkt , in 4 subra nges of log . pkt .
a set of points in th e upp er right corner of the plot , 2.5 :::; log . l en :::; 3, and
6 :::; l og. byte :::; 7, which is discussed below in connection with Figur e 9(c).
F igure 8 shows t hree un com monly st raig ht lines of points: 377 points in
F igur es 8(a), in the region where 1.17 :::; log . len :::; 1.27 and log . byte >=::::
5.0; 292 points in Figure 8(b), where 1.69 :::; log .len :::; 1.79 and log . byte >=::::
5.8; an d 60 poi nts in Figure 8(d), where 2.7:::; log .len :::; 2.9 and log . byte
increases from 6.4 to 7. The point s in t hese sets of lines have in common
(1) SIP = 1681, SPort = 10000, DPort = 25 (smtp mail) (recall that SPort
10000 actua lly refers to all source ports num bered 10000 or higher) ; (2) SIP
= 23070, DIP = 336, DPort = 80 (web); and (3) DPort = 554 (rtsp), SPort
= 1276 to 2070. For a given session, initi al ports are assigned at rand om , bu t
subsequent ones are assigned by an increment ing pat tern characteristic of the
operating system. Hence, a st ring of SPort numbers may signal a pot enti al
attacker seeking information about operat ing system t o invade.
Stratifica tion by groups of destin ation ports
This hour of Int ernet activity involved 380 unique destination ports (Ta-
ble 2). DPort 80 (web) is t he most common, comprising 116,134 of th e
135,605 records. The next most common destination port is DPort 443 (se-
cure web https ), utilized 11,627 t imes, followed by DPort 25 (mail SMTP)
accessed 6,186 times. Ports 554 (rtsp) , 113, 10000 (or higher) , 8888 occur
200, 128, 97, 94 t imes, resp ecti vely. Twelve destination port numbers during
Graphical displays of Internet traffic data 297
~ .'
log(1 + sqrt{duration))
:;l 0
.n
~
'"
:::l
:;; :;;
i~ ..
~
i
~
..
~
~ ~
::J
~
'"
::l ::l
1.15 1.20 1.25 1.30 1.65 1.70 1.75 1.80 1.85
Iog.leo logJen
x • S IP 1681, OPort 25, Sport 10000, 43-5 0 packets 292 points: SIP 23070. DIP 336. OPort 80
1.9
" , , ; .,y)~',ti K:I;~
2.0 2.1 2.2
log.tan2
2.3
••
2.4 2.5 2.6 2.6 2.7 2.6
I09.le02
2.9 3.0
.
" ~..:'
. ..:.. .:
,,~#t~~;i;-;. :
" ~~ . • I
:.:~~~:\:~::::~ .;::,~~i.jj~:;. -:
{'...;. }'.
~
log.len log.len
Dest Ports 113, 554, 8888, 10000 (519) Other Des! Ports ( 1139 )
++
logJen 10g.len
1 . 113 5 .. 5548 .. 8888 0;1 000
this hour occurre d bet ween 5 and 29 times in the file; 5 por ts occur red only
4 t imes, 8 occurre d only 3 t imes, 47 destination por ts occurred only twice, and
293 destination ports occur red only once. Displaying all 135,605 points on
one plot is not very informative, so inst ead we subdivide t he session records
into groups according to their dest ination por ts. Because over 85% of these
dat a are web sessions (DPort = 80) , a plot of log . byte versus log . l en for
only t he web sessions looks like Figur e 7 (all data). Figur e 9 are scatterplot s
of two vari ables, condit ioned on values of a third (non-web DPorts) : DPort
25 (smt p mail) in panel (a) ; 443 (ht t ps) in pan el (b) ; 113, 554, 8888, 10000
in pan el (c), and t he remaining 310 destination ports in pan el (d) . P an el
(c) shows t hat the line of points in the upper right corner of Figur e 7 arises
from sessions with DPort 554 (rtsp), and t hat t he sessions from DPort 8888
occur in a small cluste r near log . l en = 2 and log. byte = 5. Forty of t he
52 points in the upper right corne r of Figur e 9(d), where log . byte ~ 4 +
0.5 log .len, correspond to DPort numbers 119 and 1755, but are otherwise
unrelated (some "pat terns" can be spur ious) .
M oni toring fr equen cy of sou rce IP addresses
These same plots can be constructed when the data are subset ted by
source IP address (SIP) , as opposed to destinat ion port number (DPort ).
The number of source IP addresses that may be active du rin g a given hour
Graphical displays of Internet traffic data 299
!"
« 51
~
"
!1
l<i
2000 4000 6000 8000 10000
Timeof session
4 Evolutionary displays
Wegman and Marchette [6] advocate a new approach to visualizing massive
data sets, called "evolut ionary displays." Massive data sets are too large to
display using graphs and plots that are designed for moderate data sets of
300 Karen Kafadar and Edward J . Wegm an
::iKyline Plots
(a) (b)
''T T
x X ;IX X
..,.
0
~ x x )I( x 0
0 65246
X
X
X
X
X
X
)I(
)I(
X
X '"
X X X )I( X
X X X )I( X
X X X )I( X
X
x
X
x
X
x
)I(
)I(
x
x s
0
x
x
x
x
x
x
x
x
x
)I(
)I(
)I(
x
x
x
'" i
I
I
II
'" xx x xx )I( x
xx x xx )I( x I
xxx
xxx
x
x
xx
xx
)I( x
x :5
§ '"
)I(
'"c
1l xxx
xxx
x
x
xx
xx
)I(
)I(
x
x
~ xxx x xx )I( x ~
xxx X xx X
s §
::> )I( ::>
g xxx xx xx )I( xx
0 ~ xxx xx
xx
)I( 0
15 xxx
xxx X
x xx )I( X 15
)I( x
.BE xxx
xxx X
X
XX)l(
xx )I( X
X
.BE
::> xxx X xx )I( X ::>
z
Z xxx
xxx x
X xx )I(
xx )I( x
x
0
;e
xxx
xxx x xx )I(
xx )I( x
x x
;e xxx x xx )I(
xx )I( x
xxx x x
J~
xxx x xx )I( x
xxx x x xx )I( x
xxx x xX XX)l( X
~
)¢I(O( X XX X X )l( X
)¢I(O( xx xxx xx )I( X
:1m( xx
_ _ xx xx xxx xxX)!(
_~
x xxm< x
x )Q[)OOOI(lIOII(
xxx )lie( x x m< x
0
2000 4000 6000 6000 10000 0 10000 20000 30000 40000 50000 60000
Figure 11: Skylin e plots . (a) : DPort access; (b) : Source IP acces s.
fixed size. The concept behind evolutionary displ ays is to exhibit data within
the most current time frame, dropping off old dat a and making room for most
recent data. For example, in Figure 10, new data come in on t he right as old
data on the left are pu shed off the scree n. Wegman and Marchette [6, p. 906,
F igure 4] use this conce pt to define a wat erfall displ ay, useful for monitoring
frequ ency of source port s.
Skyline plot s
Most destination port numbers occur only once or twice during the hour;
of the 380 dist inct DPort s, 293 occurred only once, 47 occurred twice, 8 oc-
cur red 3 t imes, 5 occurred 4 times. The remaining 27 ports occur red over
4 times; t he t op five are DPort 80 (web , 116,134 times) , 25 (mail-smtp,
6,186 t imes ), 443 (secur e web, 11,627), 554 (rtsp ; 200 times) , and 113 (128 ti-
mes). Set ting aside the "well-known" ports 0-1023, we plot t he occurrence of
destination port s number ed 1024 and above, which should arise mor e or less
at random, and flag as unusu al any DPort that is referenced over 10 ti mes.
Fi gure 11 shows two such plot s; one for DPort (color cha nges indic at e DPort
access counts greater than 10, indicat ive of potenti ally high traffic on this
dest ination port) , and one for SIP in the first 10,000 session records (color
changes indicate SIP occurrences of more than 50). Four unu sually frequent
source IP addresse s are immedi atel y evident: 4837, 13626, 33428 ,and 65246 ,
Graphical displays of Int ernet traffic data 301
which occur 371, 422, 479, and 926 t imes , respectively, in the first 10,000 ses-
sions. The const ruction of this plot resembles t he t ra cing of a skyline, so we
call it a "skyline plot. " Limit s on skyline plot s may depend upon time of
day, day of week, month, or season.
References
[1] Mar chette D.J . (2001). Comput er intrusion detection and network m on-
itoring. Springer .
[2] Khumbah N.-A., Wegman , E.J . (2003). Data compression by geom etri c
quantization. Recent Advan ces and Trends in Nonparametric St at ist ics,
M. Akrit as, D.N. Politi s (eds) , North Holland Elsevier , Amsterd am .
[3] Silverm an B.W . (1986). Densit y estimati on. Chap man and Hall: Lon-
don .
[4] Tukey J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading,
Massachuset ts .
[5] Vardeman S.B., Jobe J.M. (1999) . St atistical quality assuran ce m ethods
for engineers. Wiley, New York.
[6] Wegman E.J ., Mar chette D.J . (2003). On som e techni ques for
streaming data: A case study of Int ernet packet headers. J. Com-
put . Graph. St at . 12 (4) , 893 -914.
[7] Wegman E.J .; Marchette D.J. (2004). St atistical analysis of network data
f or cybersecurity. Chan ce, 9-19.
A ckn owledgem ent : Funding from Grant No. F49620-01-1-0274 from the Air
Force Office of Scientific Resear ch, awarded to George Mason University, is
gratefully acknowledged. Part of this resear ch was conducte d during the first
aut hor 's appoint ment as faculty visitor at National Institute of Standards
and Technology.
Address: K. Kafad ar , E .J. Wegman , University of Colorado-Denver and
George Mason University
E-mail : kk@math. cudenver . edu ; ewegman@galaxy.gmu. edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: For t he analysis of three-mode dat a sets (i.e., dat a sets pertain-
ing t o t hree different sets of ent it ies) vari ous component analysis t echniques
are available. These yield components that are summari es of t he entities
of each mod e. Becau se such components are ofte n interpret ed in a more or
less bin ary way in t erms of the ent itie s related st rongest to t hem, it seems
logical t o actua lly const rain these comp onents to have binar y valu es only. In
t he pr esent pap er , such const rained mod els are proposed and algorit hms for
fitting t hese mod els are provided. In one of these vari ants, t he components
are constra ined such t hat they correspond to nonoverlapping clust ers of ent i-
t ies. Finally, a pr ocedure is proposed for stee ring component values towards
bin ar y values, without act ua lly imp osing them to be binar y, using penalties.
Clearly, when these models are fitted to data, we end up with component
matrices A , B, and C , and , in the case of Tucker3 an alysis, we also get
a three-mode core array G as outcome of the analysis.
The result of a three-mode analysis is a summary of the observation units,
the vari able s and the conditions by means of a number of components, and
possibly a core array describing t he relations between them . The component-
wise interpret ation, however, is not very easy, becaus e it requires one to t hink
in dim ensions along which the observation units, vari abl es or condit ions vary.
Here the component weights indic ate to what exte n t, for inst an ce, the individ-
uals can be describ ed by the property defined by the component. Likewise,
vari abl es are related to the components for t he vari abl es to different ext ents.
Now the int erpretation of the components usu ally proc eeds conversely: From
the st rengths of the relations of the vari abl es to the components, one can
int erpret the mean ing of th e components. This interpret ation is rather cum-
bersome if one discriminates pr ecisely bet ween different st rengths of relations.
Therefore, in pr actice, one tends to inte rpret components on the basis of the
variables related strongest to it , and one te nds t o ignore the less related vari-
ables. In fact , thus one binariz es th e relations, in sufficiently strong, and not
sufficiently strong. Thus one could say t hat the components are int erpret ed
as if t hey refer to clust ers of variables consisting of those vari abl es th at have
the strongest relations with them . Similar cluster based int erpret ations can
be given to components describing individuals and conditions, if a prio ri in-
form ation on the individua ls and conditions is available. To enhance the
interpret ability of t he component matrices, they are often subjected to sim-
ple structure rotations such as varimax [7], see also [8], but t he clusters will
always remain somewhat fuzzy (i.e., relations are never ent irely binar ized).
Now if, in pr acti ce, components t end to be int erpret ed as clusters, then
would not it seem mor e rational to mod el data in terms of cluster memb er-
ship, and discard the information on st rengt hs of relations? The idea of clus-
t ering all three mod es simultaneously has been pursued by various a ut hors .
Clustering approac hes involving the CANDECOMP/ PARAFAC mod el have
been proposed by Chaturvedi and Carroll [3] and Leenen et al. [11], where
the latter aut hors use Boolean products rather than ordinar y products . An
exte nsion of the latter Boolean mod el to the Tucker3 sit uat ion, has been
proposed by Ceulemans, van Mechelen and Leenen [2].
Surprisingly, except for a recent pap er by Rocci and Vichi [13], straight-
Clustering all three modes of three-mode data 305
core array now will contain t he within cluster average scores in X, hence
the core effectively summarizes the data in such a way that it gives the
avera ge score of the individuals in each cluster, averaged across th e vari ables
associated to the vari able cluster at hand, and averaged across condit ions
associated with the condition cluster at hand. Wh en the clusters can be
interpreted well, then the core has a very easy int erpretation too, simply in
term s of 'cluste r scores'.
where X _j is writ ten for X a - LZh ad;, aj denotes the jth column of A,
and I j denotes t he jth row of F . A solut ion for minimizing (9) is given by
Chaturvedi and Carroll [3] . A computationa lly slightly different pro cedure
(with the same solution) can be derived as follows. Fun ction (9) can be
written as the sum of independ ent functions elaborated as
308 Henk A.L. Ki ers
see [12] , see also [15]. Note t ha t, if the inverses do not exist (as may come
about when any of the component matrices has incomplete rank) then the
inverse is replaced by a generalized inverse.
The above described st eps for updating A , B, C, and G are followed by
t he computation of the loss function valu e. If this has decrease d, then a new
cycle of updatings is starte d; if it has remained t he sam e, t hen the ensuing
soluti on is considered a candidate for the minimum of t he loss function. De-
pending on how the pro cedure is starte d, t his may be a local minimum of t he
function rather than t he global minimum. It is therefore recomm end ed to
run the algorit hm from several st arts. One approach is to st art from (very)
many random starts, hoping t hus to cover a wide ran ge of (at least) locally
optimal solut ions for which the cha nce t hat it contains the global minimum
is high. Alternatively, or in addit ion, one may use a few start s t ha t can be
expected to have a high cha nce to lead to t he globa l minimum. A suggestion
for such 'ra tiona l' st arts is given in t he next subsect ion.
over A, where F is again written for G a(O' <>9 B'). This function can be
writ ten as t he sum of the ind ependent functions
(15)
Schepers and van Mechelen [14] have proposed an algorithm for fitting this
model, which also has not been published yet. It is planned t o compare these
algorit hms in the near future.
An alte rnating least squa res algorit hm for minimizing (17) has been de-
vised and programmed . The algorit hm t end s to require many iterations, but
does ind eed give solut ions with t he required properties. For inst an ce, for data
const ructed on the basis of component matrices that were bin ary up to a few
elements, the method indeed singled out t hese elements as different from the
others. However , much more experience is needed to assess its usefuln ess in
act ual pr act ice.
6 Conclusion
The present paper has offered methods for Tu cker3 analysis with t he compo-
nent matrices const ra ined t o be binary, and , in a special case also such t hat
t he components have no overlap. The algorit hms proposed work in t he sense
th at th ey decrease th e loss function valu e, but they appear, as usual with
bin ar y optimizat ion problems, to be prone to hit t ing local optima. Some
starting pro cedures have been pr opos ed that worked well in som e cont rived
examples, but t he algorit hms, as well as their start ing pro cedures need fur-
t her testing, as well as comparison to compet itors that have been proposed
recently for t he nonov erlapping case.
In addition to the methods where components ar e const ra ined t o be fully
binary, a pr ocedure has been proposed for weakly impo sing bin arity, by using
penalty t erms. Again , this pro cedure needs further t esting. If it turns out
t o work well in pr actice, and if it is not very prone t o hitting local optima ,
it could also be used for fitting t he fully const rained mod el by gradually in-
creasing t he penalty param et ers that regulat e the strength of the const raints.
Whether this or other pro cedures work best in dealing with the local opt imum
probl em of Tucker3 wit h bin ar y const raint s is subject to further resear ch.
References
[1] Carroll J . D., Ch ang J .-J . (1970) . Analysis of individual differences in
multidim ensional scaling via an N-way generalization of "Eckart- Young "
decomposition. Psychom etrika 35, 283-319.
[2] Ceuleman s E ., van Mechelen 1., Leenen 1. (2003). TuckerS hierarchical
classes analysis. P sychom etrika 68 , 413 - 433.
[3] Chatur vedi A., Carroll J.D. (1994) . An alterna ting com bin atorial opti-
mization approach to fitt ing INDCLUS and Generalized IND CLUS mod-
els. Journal of Classificat ion 11 , 155 -170.
[4] DeSarbo, W .S. (1982). GENNCLUS: N ew models for general nonhierar-
chical clust ering analysis. P sychometrika 47,449-475.
[5] Gaul W. , Schad er , M. (1996). A new algorithm for two-mod e clust er-
ing. In: Bock H.-H., Polasek W. (eds.) Dat a analysis and information
syste ms. Springer , Heidelb erg.
Clustering all three modes of three-mode data 313
1 Introduction
P anel st udies in economet rics as well as longitudinal st udies in biom edical
applications provide data from a sa mple of indiv idu al units where each unit
is observed repeat edly over time (age, etc .). St atistical analysis t hen usu ally
aims to mod el the vari ation of some response var iable Y . In addit ion t o its
depend ence on some vect or of explanatory vari abl es X , th e vari abili ty of Y
between different individual un its is of primary int erest .
For simplicity, we will assume a balan ced design wit h T equa lly spaced
repeated measurements per indi vidual. The resulting observations of n indi-
viduals can then be represent ed in the form (¥it, X it) , where t = 1, . . . T and
i = 1, . .. , n . The simplest form of analysis is based on mixed effect mod els
of the form
p
where Eit are i.i.d. err or terms, while Ui represents indi vidual random effects .
An imp ort ant exa mple in econometrics are stochastic frontier mod els.
Then lit rep resents pr oduction output of an individu al firm i in time period t,
while X it is a corresponding vect or of pr oduction inputs. The Ui are then
316 Alois Kneip, Robin C. Sickles and Wonho Song
In the following we will assume that the Ui (t) can be considered as smooth
random functions . In many biometrical applications, where for example t
indicates age of an individual unit, smoothness can be considered as a stan-
dard assumption. In econometrics, where t usually indicates time, for a given
unit i the corresponding data {Yit, Xit}, t = 1, . .. ,T, represent an individual
time series . In this situation model (2) assumes that the residual time series
{Yit - /30 - ~~=l /3j X itj }, i = 1, ... , n, can be decomposed into a smooth
stochastic trend u; and i.i.d. white noise.
Traditional analysis relies on parametric models . Very often polynomial
approximations to the functions Ui are used. More generally, for some pre-
specified basis functions bl , .. . , bi. the Ui are modelled by Ui(t) = ~r 'l9 irbr(t) ,
where 'l9 i l , . . . , 'l9 i L are individual random coefficients. Analysis is then based
on the well-known methodology of mixed effect models. If additionally nor-
mality is assumed and if X and tare uncorrelated, likelihood estimation based
on the EM algorithm is often applied. In stochastic frontier analysis such an
approach has been used by Battese and Coelli [1] or Cornwell, Schmidt, and
Sickles [2] in order to model time-dependent individual inefficiencies.
In this paper we consider a nonparametric approach based on ideas
from functional data analysis as proposed by Kneip, Sickles and Song [6] .
The functions Ui can be decomposed into Ui = ui; +Vi, where w(t) is a general
mean function and Vi(t) = Ui(t) - w(t). Model (2) can then be rewritten in
the form
P
Yit = L /3jXitj + w(t) + Vi(t) + tit (3)
j=l
Note that the constant /30 is incorporated into w(t), and that the mean of
Vi(t) is zero.
For a given L functional principal component analysis is then used to
estimate a best possible basis 91, ... ,9 L for approximating Vi by Vi (t)
~~=l (}ir9r(t). The approach possess a number of advantages
• The basis 91, .. . , 9 L to be estimated corresponds to the best possi-
ble basis for approximating the Vi by an L-dimensional linear function
space. Any approximation Vi(t) :::::: ~~=l 'l9 irbr(t) based on prespeci-
fied basis functions h, .. ., bL (e.g. polynomials or splines) possesses
a higher systematic error.
Functional data analysis and mixed effect models 317
suppose that E(Vi) = 0. Furthermore, let Ilfll = f(t)2 denote the usual
J
£2- norm for f E £2[0,1]' and set < t', V >= f*(t)f(t)dt. The covariance
operator then is a generalization of the concept of a covariance matrix in
multivariate analysis of random vectors. The so-called covariance kernel is
defined as
O"(s, t) = E(Vi(S)Vi(t))
and the corresponding covariance operator r is defined by the relation
°
finite eigenvalues h 2: l2 2: . . . as well as corresponding orthonormal eigen-
functions ')'1,')'2, . .. such that III'rll = 1 and < ')'T)')'s > = for r :f s. A
precise mathematical discussion of properties of r can, for example, be found
in Gihman and Skorohod [4].
The well known Karhunen-Loeve decomposition states that the func-
tions Vi can be decomposed in terms of the eigenfunctions:
(4)
r
where fJ ir =< Vi, "[r ». This decomposition posseses the following properties
(see for example [4]) :
a) E(fJir) = 0, r = 1,2, . . . , and Var((fJid = h 2: Var((fJ i2 ) = l2 2:
Var((fJ i3) = l3 2: . . .
318 Alois Kneip, Robin C. Sickles and Wonho Song
1 n T LIn (T L )
:;;: ~ ~(Vi(t) - ~ Birgr (t)) 2 :::; :;;: ~ Qi1~~~iL ~(Vi(t) - ~ exirbr(t))2
(8)
for any possibl e choice of br(t) , t = 1, .. . , T , r = 1, . .. , L .
Note that Cond itio ns (ex) - (-r) do not impo se any restriction, and they
introduce a suitable normalization which ensures identifiabili ty of t he com-
ponents up t o sign changes (inst ead of Bir,gr one may also use -Bir , - gr)'
If (6) holds for some suitable L , t hen there exist some gr such that (7) as
well as (ex) - (-r) and (8) are satisfied.
Obv iously t he components gr depend on t he realized Vi and on the sa mple
size n . Due t o different normalization usu ally gr(t) =I- 1'r,n(~). This does not
const it ute a serious dr awb ack for an empirical analysis based on (3) and (7).
In fact , in model (7) only t he L dim ensional linear space spanned by gl , . . . , tn.
is identifi abl e. There are infinit ely many possible choices of basis functions ,
and by using condit ions (1) - (3) we select a par t icularly well-interpret abl e
basis. Asymptotically, as N , T ---+ 00 gr(t) as well as 1'r,n (t ) will both con-
verge to 1't (t ) in probabilit y. Under (6) th e linear subspaces of JRT spanned
by the vect ors {(gr(1), . .. , gr(T) )/}r=l,...,L , {(1'r,n(1) , ... ,1'r,n(T ))/}r=l,...,L
and { (-rr(1) , . . . ,1'r(T ))/}r=l,...,L will coincide with high probability for lar ge
sa mples
How to det ermine the functional components gr in (7)? There are essen-
t ially two straightforward pro cedures which could imm ediately be applied if
t he realized functions Vi where known. These algebra ic methods will serve as
a basis of t he pr act ical, data-based methods to be pr esented in Sections 3.
Method 1: Some simple algebra shows th at , if the Vi were known, the com-
ponents gr could be det ermined from the eigenvect ors of th e empirical co-
variance matrix 2:n of VI = (v1(1), . .. ,v1(T ))', . . . ,Vn = (vn (l ), . . . ,vn (T ))':
(9)
Also not e t hat "L,'j=L+IAj = * "L,~=I "L,i= I (Vi (t) - "L,~=IBir9r(t)) 2. If (7)
holds, then obviously "L,'j= L+I Aj = 0
By using some fur ther algebra, see for example [5], one can then deduce
that all non zero eigenvalues Ar and h; of t he empirica l covaria nce ~n and of
t he matrix M n are relat ed by hr = "L,i Brr = !fAr. Moreover , the eigenvect ors
PI = (PH , . . . ,Pnd' , P 2 = (pI2, ' " ,Pn2)', . . . of M n corresponding t o nonzero
eigenvalues hI ;::::: h2 ;::::: • . . are closely related to the paramet ers Bir since
(14)
3 Algorithms
W hen combini ng 3) and (7) one ob t ains
p L
lit = L {3j X itj + w(t ) + L Bir9r(t) ) + t it (15)
j =1 r= 1
The op t imal basis fun ctions 9r sat isfying (7)- (11) as well as w, {3j and Bir
are unknown.
Based on the mathe matica l framework of Secti on 2 different algorit hms
ca n be applied in order to esti mate the components wand 9r of the (15).
In this section we will rely on a pr esp ecified dimen sion L. The important
qu estion of det ermining an appropriate L will be considered in Sect ion 4.
_ Let_ us firs~ int roduce some additional not ati ons. Let Yt = ~ L: i lit ,
Y = (Yl , . .. , YT )' , li = (lil . . . , liT)' and Ei = (Eil , " . , EiT). Furthermore,
n L:i X itj , and
let X ij = (X il j , ._. . , X iTj)' , Xtj = .1 x,
= (Xlj , . . . , XTj )' . _We
will use X i and X t o denote the T x p matrices wit h elements X itj and X tj .
The algorit hm now can be described as follows:
Step 1: Det ermine esti mates /31, ... ,/3p and Vi (t) by minimzing
p
where f' > 0 is a preselect ed smoot hing par ameter and v~' denotes the second
derivati ve of v.
Spline t heory implies that any solution Vi, i = 1, . . . ,n of (16) possess an
expansion Vi (t ) = L:j (jiZj (t ) in t erms of a na t ural spline basis Zl, . .. , ZT .
If Z and A denot e T xT matrices with element s Zj(t ) and It
z'j(s )zj' (t ),
t he ab ove minimizati on problem can be reformulat ed in matrix notati on:
Det ermine /3 = (/31 " '" /3p )' and (i = ((Ii ,.. ., (Ti)' by minimizing
(17)
as well as
Ther efore,
(19)
estimates Vi = (vi(l) , . .. , vi (T ))' .
Remarks:
322 Alois Kn eip , Robin C. Sickl es and Wonh o Song
n i
'
and calcul at e its eigenvalues ),1 2: ),2 2: . . . ),T and the corresponding eigen-
vectors "Yl , "b ,... , "YT.
e
The algorit hm aut omat ically also yields est imates /3j and i r . However , vari -
ability of these est imates may be reduced by re-estimating these coefficients
by relying on (20):
e
Step 5: Re-est imat e t he coefficients /3j and ir by fitting th e est imated model
Yit = L j=l (3j X itj + w (t ) + L~=l (Jir9r(t » + ci t to the dat a .
Kn eip , Sickles and Song [6] also st udy th e asy mpto tic behavior of the
resulting est ima t ors as n, T -4 00.
Let K,T = T K,. If the und erlying functi on Vi , as discussed in Section 2, is
twice cont inuously differentiabl e, then t he bias in est imating Vi is of ord er K,t,
while variance is of order ~ . Choosing K,T to be of ord er T- 4 / 5 then
"'r T
leads to the optimal individual rates of convergence ~ Lt(Vi(t) - Vi(t»2 =
Op (T- 4 / 5 ) .
Under some t echn ical assumpt ions (ma inly concerning smoot hness as well
as the corre lation between X it and Vi (t» a t heorem by Kn eip , Sickles and
Song [6J th en impli es th at for all r = 1, . . . , L
T -1 ~
LJ (gr(t) - gr(t»
t=l
A 2 1 1)
= Op ( K,T + T2 + 1/4
K,T nT
(21)
Vi (t ) = - l : (3jXitj - w(t) -
lit cit
j=l
Hence, if t he par am et ers (3j were known , the matrix
T P
(Mn)i,j = ~ l:(lit - Yt - l : (3j(Xitj - Xtj ), i,j = 1, . .. ,n (22)
t=l j=l
provides an est imate of M which by Method 2) discussed in Section 2 can be
used to calculate est imates of gr.
324 Alois Kn eip, Robin C. Sickles and Wonho Song
The basic idea of the followin g algorit hm is now eas ily describ ed : Un-
der (15) the "t rue" matrix M possesses only L non zero eigenvalues, and
therefore I:j=L+1 Aj = O. Based on (22) , different matrices Mn (/3) can be
det ermined in dep endence of all possible valu es of {3j . Estimates /3j and Mn
can be obtained by minimzing the sum of the sm allest n - L eigenvalues of
M( {3) with resp ect to {3.
The pr ecise algorithm now can be described as follows:
T P
(Mn(/J) i,j = ~ 2)Yit - ft - L /Jj (X itj - X tj) , i , j = 1, . . . , n
t=l j =l
In spite of averaging over individuals, (23) may lead to fairly noisy est i-
mates of gr' Som e addit ion smoothing will usu ally improve the performance
of the esti mat or. Usin g a spline approach, an esti mate of gr may thus alte r-
natively be det ermined by minimizing
Recall that that the procedure of Section 3.1 requires smoothing of the
individual data of each of the n units in order to estimate Vi, i = 1, . . . , n.
An important advantage of the above algorithm thus is that it only requires
some global smoothing over weighted averages of observations in Steps 3*
and 4*. The choice of the smoothing parameter /'i, will thus be less critical,
and a possible smoothing bias will not affect the estimates of the parameters
(3j . One may expect a superior behavior of this method if the number T of
repeated measurement is fairly small.
On the other hand, a drawback is the fact that already for estimating (3j
in Step 1* a sensible selection of the dimension L in (15) has to be made.
Indeed, usually 15) will have to be satisfied in a very good approximation in
order to avoid biased estimates of the parameters. In practice, one may apply
the algorithm for different values of L and choose an appropriate dimension
by using some goodness-of-fit criterion.
Theoretical properties of the above algorithm have not yet been studied
and remain a topic of future research.
4 Choice of dimension
Any analysis based on (15) requires a sensible choice of the dimension L.
If L is too small, there may exist a large systematic error in approximat ing
the Vi. On the other hand, if L is too large, then estimates will possess an
unnecessarily large variance.
Note that for a given sample the eigenvalues of the estimated covariance
matrix f:n will usually satisfy 5. r > 0 for r > L . This will even be true if (15)
holds exactly and if therefore the eigenvalues of true matrix ~n are such that
Ar = 0 for r > L. In other words, the noise term cit will "create" additional
(small) components in the peA decomposition. It is obvious that any com-
ponent generated or strongly influenced by noise should not be included into
model (15).
From this point of view one may tend to choose L in such a way that
each component gr, r = 1, . . . , L, possesses an influence on the model fit
which is significantly larger than that of any noise component. This idea has
been adopted by Kneip, Sickles and Song [6] in order to estimate a dimen-
sion L. Under the hypothesis that (15) holds for some L , i.e. 2::~=L+1 Ar = 0,
they derive asymptotic approximations of mean m(L) and variance s(L)2 of
h
5. r , and it is shown that C(L) = ~r=L+;(~)-m(L)
A
References
[1] Battese G.E. , Coelli T .J . (1992) . Frontier production functions, technical
efficiency and panel data: With application to paddy farmers in India.
Journal of Productivity Analysis 3 , 153-169.
[2] Cornwell C., Schmidt P., Sickles RC. (1990). Production frontiers with
cross-sectional and time-series variation in efficiency levels. Journal of
Econometrics 46, 185 - 200.
[3] Dauxois J., Pousse A., Romain Y. (1982). Asymptotic theory for the
principal component analysis of a vector random function: some ap-
plications to statistical inference. Journal of Multivariate Analysis 12,
136-154.
[4] Gihman LL, Skorohod A.V. (1970) . The theory of stochastic processes.
New York: Springer.
[5] Good LJ . (1969) . Some applications of the singular value decomposition
of a matrix. Technometrics 11 823-831.
[6] Kneip A., Sickles RC., Song W. (2004). On estimating the mixed effects
model. Manuscript.
[7] Ramsay J .O., Silverman B.W . (1997) . Functional data analysis. New
York: Springer.
[8] Speckman P. (1988). Kernel smoothing in partial linear models. Journal
of the Royal Statistical Society, Series B 50,413-436.
1 Introduction
The bigram proximity matrix (BPM) was first developed by Martinez and
Wegman [8] , [9] , [10] as a way of encoding t ext so it can be used in applicat ions
such as do cum ent clustering, classification or information retrieval. Previou s
studies with t he BPM indi cated that docum ent s can be successfully classified
usin g k near est neighb ors and other methods when they are encoded in this
way. The objective of the current work is to define bigr am weights ana logous
t o the te rm weights found in natural lan guage pro cessing and to investi gate
t he utility of usin g them in docum ent classification.
In Section 2, we pr esent some background information on the BPM and
include an illustrative example. We then provide definitions of t he bigram
weights in Section 3. Sect ion 4 contains informa ti on about the exp eriments
that were conducted, as well as t he resul ts. Finally, we offer a summar y and
some comments a bout future work in Section 5.
crowd 1
his 1
in 1
fat her 1
man 1
sought 1
t he 1 1
wise 1
young 1
lexicon created by listing alphabet ically the unique occurrences of the words
in t he te xt. Additionally, it should be noted t hat all end of sent ence punc-
tuation is replaced with a period , and the period is t reated as a word . By
convent ion, the period is designated as the first word in t he ord ered lexicon .
It is asse rted that t he BPM represent ation of the sema ntic conte nt pr eserves
enough unique features to be semantically separ abl e from BPMs of other
thematically unrelated collect ions.
The rows in t he BPM represent the first word in the pair , and the second
word is given by the column. For example, th e BPM for the sentence or te xt
stream ,
3 Definition of weights
We can see from the definition of the BPM, that the elements of the matrix
represent the number of t imes that a bigr am or word pair occurs in the
docum ent. Some of t he measures of semantic similarity for classification
cited in Martinez [8] employed the raw frequ encies, others used bin ar y valu es
(if t he frequ ency is non- zero, then it is replaced with a 1), and some required
conversion to probabilities or relative frequencies. In this paper , we will only
be concerne d with t he first case, where raw bigram frequ encies are compar ed
Using weights with a text proximity matrix 329
ATe
NCC = cos (lAc = IIAIIIIClI (1)
(2)
where Si j represents the similarity between docum ent i and j , and di j is t he
dist an ce between docum ent i and docum ent j.
(3)
wher e l ijk is the local weight for bigram ij that occurs in document k, g ij is the
global weight for bigr am ij in the corpus, and dk is a docum ent normalization
factor. We represent the frequ ency or t he number of times bigr am ij appears
in docum ent k as ! i jk . We use the following to ind icate t he conversion of
a frequ ency f t o a bin ary valu e:
with the subscript b, where som e arbitrary order or lab eling has been imposed
on the bigrams (elements of the BPM) . The logari thmic weight is defined as
(6)
If no local weights are used , t hen we denot e that as just the bigr am frequ ency
(7)
Note that the letters I, t, and n ar e used in t he informat ion retrieval literature
t o denote the typ e of local weight [1].
We use only one global weight in this st udy called t he inverse document
frequen cy (IDF) ; others can be found in Berry and Browne [1]. The IDF for
bigr ams is defined as
9b = f = log (K -;- t
k= l
IUbk)) , (8)
where K is the t ot al number of do cument s in the cor pus. When choos ing
a global weight, one needs t o cons ider the state of t he corpus. If t he corpus
changes, the BPM cha nges first and then the global weight must be revised.
Thus, if t he corpus is unstable or constant ly changing, then using a globa l
weight might not be a good idea .
We now come t o the do cument normalization fact or. The cosine n ormal-
ization seems to be used oft en wit h t erm-document matrices [1], so t his is
what we use here. For our bigrams, this is given by
(9)
This simply normalizes the BPMs, or one could think of this as ensuring t hat
the magnitude of t he BPM 'vector' is 1. We note that with t he normalized
correlation coefficient, the do cument normalizat ion does not really qu alify
as a weight becaus e t his normali zation would t ake place anyway with the
dist an ce measure. What it means is that the denominat or in Equation 1 is
one, so we do not need to calculate it for t he similarity measure.
We ca n designate t he weighting scheme by using a t hree letter code as
follows:
txx bigram frequency - no weights
nfc augmented normalized frequency - IDF - cosine normalization
tfc bigram frequency - IDF - cosine normalization
lfc logarithmic - IDF - cosine normalization
Using weights with a text proximity matrix 331
f. k = LJik. (11)
i= l
The pointwise mutual information is defined as
N= L L f i j .
i=l j=l
min {fb. ; f. d
Cbk = - fbk
-- X -_:-'-:....-..:....:..,,--''---
fbk + 1 min {lb. ; f. d + 1
We did not use this factor in our research; only Equation 12 was implemented.
332 Angel R. Martin ez, Edward J. Wegm an and Wendy L. Martin ez
4 Experiments
The goal of our experiments is to assess the usefulness of weighting the BPMs.
In particular , to answer the question: Can documents be classified more
successfully using weighte d bigrams? In the next subsect ions, we describe
some of the background and det ails of the experiments, followed by results.
All experiments and ana lyses, including reading the docum ents and creating
t he BP Ms, were done on a PC using M ATLABTM, Version 6.5.
the lexicon is 11,103. When noise words ar e removed, the lexicon contains
10,997 words .
'. " .
. .
4.3 Results
To summarize, we varied the weights and other parameters and performed
the following experiments with the weighted BPM.
• Dimensionality of the space for using k-nn was either full dimensionality
or 4-D and 6-D from ISOMAP.
• The Euclidean distance was used for the 4-D and 6-D k-nn classification.
15
.-... .
.....•..~~~~,..,
'
10 ... . ... ~
.
.:. .
: ."'\... .
5 .... -4i.
o ;fJ1
-5
-10
-15
" " "
40
20
-ID -5 o 5 10 15
Figure 2: This is a data set randomly generated according to the mani fold
given in Figur e 1. The Euclidean distan ce between two points is given by
the straight line shown here. If we are seeking the neighborhood st ruc t ure
along the manifold, t hen it would be better to use the geodesic distan ce (the
distance alon g the manifold or the roll) between the points.
5 Summary
In this paper, we defined bigr am weights for the BPMs that are similar to
term weights used in natural language pr ocessing and information retrieval.
After the BPMs are weight ed , we applied the k-nn classification method to
336 Angel R. Martinez, Edward J. Wegman and Wendy L. Martinez
k = 1 k = 3 k =5 k = 7 k = 10
lfc 0.74 0.74 0.75 0.77 0.76
lfc-den 0.71 0.71 0.73 0.73 0.72
MI 0.82 0.81 0.83 0.83 0.84
MI-den 0.81 0.83 0.85 0.87 0.86
nfc 0.84 0.84 0.85 0.86 0.85
nfc-den 0.85 0.85 0.87 0.87 0.87
tfc 0.88 0.87 0.87 0.86 0.87
tfc-den 0.86 0.86 0.87 0.86 0.86
txx 0.66 0.65 0.65 0.64 0.65
txx-den 0.73 0.72 0.74 0.73 0.75
also perform some experim ents using a st emmed and denoised lexicon [8], [1] .
We could also examine the affect of the dimensionality redu ction proc edure.
As state d previous, ISOMAP seeks a nonlinear manifold ; we might try some-
t hing like classical multidimensional scaling [2] (using the NCC similarity
directly rather than the geodesic dist an ce). Fin ally, we could use some other
methods to analyze the reduced BPMs, such as model-based clustering [4] ,
linear or quadratic classifiers [3], non-metric multidimensional scaling, self-
organizing maps [6] etc.
References
[1] Berry M.W., Browne M. (1999). Understanding search engines: mathe-
matical mod eling and text retrieval. SIAM.
[2] Cox T .F ., Cox M.A.A. (2001). Mult idim ension al scaling, 2nd editi on.
Chapman and Hall - CRC.
[3] Dud a RO ., Hart P.E ., Stork D.G. (2000) . Pattern classification, 2nd edi-
tion . Wiley-Interscience.
[4] Fraley C., Raft ery A.E. (1998) . How many clust ers? Whi ch clusteri ng
m ethod ? Answers via model-based clust er analysis. The Computer Jour-
nal 41 , 578- 588.
[5] Gale, Church and Yarowsky. (1992) . A m ethod for disambiguating word
senses in a corpus. Computers and the Hum anities 26 , 415-439 .
[6] Kohonen, Tuevo. (2001) . Self-organizing maps, third editi on. Springer-
Verlag.
[7] Manning C.D. , Schiitze H. 2000. Foundations of statistical natural lan-
guage processing. The MIT Press.
[8] Martinez A.R (2002). A fram ework for the representation of semantics.
Ph.D. Dissertation, George Mason University.
[9] Martinez A.R , Wegman E.J . (2002). A text stream transformati on for
sem antic -based clustering. Proceedings of the Int erface.
[10] Martinez A.R , Wegman E.J. (2002) . En coding of text to preserve mean-
ing. Proceedings of the Army Conference on Applied St atistics.
[11] Pantel P., Lin D. (2002). Discovering word senses from text. Proceedings
of ACM SIGKDD Conference on Knowledge Discovery and Dat a Mining ,
613 -619.
[12] Tenenbaum J.B., de Silva V., Langford J. C. (2000). A global geometri c
fram ework for nonlin ear dim ension alit y reducti on. Science 290 , 2319 -
2323.
1 Introduction
Since proposed in [13] , canonical correlation analysis has been widely applied
in many stat ist ical areas, especially in multivariate analysis. Time series
analysis is no except ion. [6] proposed a canonical analysis of vector time
series that can reveal the underlying st ruc t ure of the dat a to aid mod el in-
t erpretation. In particular, they showed that linear combinations of several
unit-root non-st ationary time series can become st ationar y. This is the idea
of co-integration that was popular among econometricians in the 1990s afte r
the publicat ion of [10] . [22] applied canonical correlat ion analysis to develop
the sma llest can onical correlation method for identifying univari ate ARMA
mod el for a stationar y and/or non-stationar y time series. [17] introduced the
concept of scalar component mod els t o build a parsimonious VARMA mod el
for a given vect or time series. Again , canonical correlat ion ana lysis was used
exte nsively to search for scalar component mod els. Many other aut hors also
used can onical analysis in time series analysis. See, for inst an ce, [15].
To build a model for a k-dimensionallinear process, it suffices to identify
the k Kronecker ind exes or k linearl y independ ent scalar component mod els,
becau se we can use such information t o identify t hose par ameters that require
est ima t ion and t hose that can be set to zero within a dynam ic linear vect or
model. Simply put, t he Kronecker ind exes and scalar component mod els can
overcome the difficulties of curse of dimensionality, par amet er explosion, ex-
changeable mod els, and redundant param et ers in mod elling a linear vector
t ime series. For simplicity, we shall consider t he problem of specifying Kro-
necker ind exes in t his pap er. The issue discussed, however , is equa lly applica-
ble to specificat ion of sca lar component mod els. The method of det ermining
Kronecker indexes of a linear vector pro cess with Gaussi an innovat ions has
been studied by [1], [7], [18], [20], among others . These st udies show that
canonical correlation analysis is useful in sp ecifying the Kronecker ind exes
340 Wanli Min and Ruey S. Tsay
1.1 Preliminaries
Based on the Wold decomposition, a k-dimensional stationary time series
Zt = (Zlt"" , Zkt)' can be written as Zt = 11 + 2::: 0 'l/Jiat-i, where 11 =
(Ill, ... ,Ilk)' is a constant vector, 'l/Ji are k x k coefficient matrices with
'l/Jo = I k being the identity matrix, and {at = (alt,··· ,akt)'} is a sequence
of k-dimensional uncorrelated random vectors with mean zero and positive-
definite covariance matrix E . That is, E(at) = 0, E(ata~_J = 0 if i i- 0,
and E(ataD = :E. The at process is referred to as the innovation series of
z; If 2::: 0 II'l/Jill < 00, then Zt is (asymptotically) weakly stationary, where
IIAII is a matrix norm, e.g. IIAII = Jtrace(AA'). Often one further assumes
that at is Gaussian. In this paper, we assume that
where F t - 1 = o-{at-1, at-2, ' " } denotes information available at time t-1.
Writing 'l/J(B) = 2::: 0 'l/JiBi, where B is the backshift operator such that
BZ t = Zt-1, then Zt = 11 + 'l/J(B)at. 1f'ljJ(B) is rational, then Zt has
a VARMA representation
wher e ep(B) = I - 2::f=1 epiBi and 0(B) = I - 2::)=1 8 jBj are two matrix
polynomials of order p and q, respectively, and have no common left factors.
For further conditions of identifiability, see [8] for more details. The station-
arity condition of Zt is equivalent to that all zeros of the polynomial lep(B)1
ar e outside the unit circle.
The number of parameters of the VARMA model in Eq . (2) could reach
(p+q)k 2 +k+k(k+ 1)/2 if no constraint is applied, making parameter estima-
On canonical analysis of vect or time series 341
where C¥o > 0, C¥i ::::: 0, (3j ::::: 0, and {Ed is a sequence of independ ent and
identically distributed ran dom vari ables with mean zero and variance 1. It 's
well-known that at is asymptotically second order stationary if L~~l C¥i +
Lj~l (3j < 1. Generalization of the GARCH models to multivari at e case
introduces additiona l complexity to the modelling pro cedure becau se t he
covariance matrix of a t has k(k + 1)/2 elements . Writing the conditional
covariance matrix of a t given t he past informat ion as ~ t = E(ata~IFt-d ,
where F t - 1 is defined in Eq. (1), we have at = ~i/2 Et, where ~i /2 is
the symmet ric squa re-root of the matrix ~t and {Ed is a sequence of in-
dependent and identi cally distributed random vectors with mean zero and
identity covariance matrix. Often Et is assumed to follow a multivari ate nor-
mal or Student-t dist ribution . To ensure th e positive definiteness of ~t ,
several models have been proposed in the literature. For exa mple, con-
sider t he simpl e case of ord er (1,1). [11] consider the BEKK mod el ~t
= ee' + Aat-la~_lA' + B~ t-lB' , where e is a lower t ria ngular matrix
and A and B ar e k x k matrices. [4] discusses t he diagon al mod el ~ t =
ee' +AA'0(at-la~_ 1)+BB'0~t-l ' where 0 stands for mat rix Had am ar d
pr oduct (element -wise product ).
Wh en GARCH effects exist, t he time series Zt is no longer Gaussian. Its
innovations become a sequence of uncorr elated, but serially dependent ran-
dom vectors. It is well-known t ha t such innovations tend to have heavy tails,
see [9] and [21], among ot hers. The perform ance of canonical correlat ion
ana lysis under such innovations is yet to be investi gated. This is t he main
obj ective of t his pap er. Secti ons 2 & 3 review and introduce the problem con-
sidered in the pap er. Section 4 establishes t he stat istics to specify Kronecker
ind exes for VARMA+GARCH process. Section 5 presents some simulation
results, and Section 6 applies the ana lysis to a real financi al time series.
L vn·
00
Making use of t he resul t mentioned above, [18] proposed a proper t est st atisti c
'2
T = - (n - s) log(l - p,) rv X~s-f+1 (7)
d
where d = 1 + 2 2:~= 1 Pxx(v)p yy(v). In Eq. (7), it is und ersto od that d = 1 if
h = 0, Pxx(v) and Pyy(v) are t he lag-z- sa mple aut ocorrelat ions of X; and Yt ,
resp ectively, and n is t he sample size. The Bartlet t 's formul a in Eq. (6) is
for independent Gau ssian innovations {ad. This is not th e case when the
innovations follow a GARCH(r1 , r2) mod el. We shall st udy in next sect ion
properties of sample aut o-covariances in t he presence of GARCH innovat ions.
All proofs can be found in [14] .
Proposition 3.2. A ssume at = (a lt , · " , amt )' follow s a pure diagonal m ul-
tivariate GAR CH mod el, i.e. ait follow s a un ivariate GARCH(rl , r2) mod el
an d is stati onary with finit e fou rth moment fo r each i = 1, . . . , m . Consider
00
th e process X; = :z= '1' ~at- i where '1'i are m -dimensional vectors. A ssume
i=O
00 00
fu rther that :z= II'1'ill < 00 and :z= i ll'1'i112 < 00. Let ~o = a{ao ,a_l" "} '
i=O i=O
00
Th en the next ine quality holds: :z= IIE(Xl- l'xx(O)I~o)1I < 00 , where l'xx(O) =
t=l
00
Proposition 3.3. Let x, = (X lt , ' " , Xkt}' = :z= '1'iat-i, where '1'i are
i=O
m atrices of dimen sion k x m and at is m -dimensional an d foll ows a pure
diagonal st atio n ary GARCH(rl , rs ) mo del with finite 4th moment. Furth er,
00 00
:z= II'1'ill < 00, :Z= i ll'1'ill 2 < 00 . Lettin g ~ = E(XtX~+h) where h is an
i=O i=O
00
Proposition 3.4. Let X , =(X lt , ' " , X kt)' = :z= '1'iat-i, Y, = (Ylt , ' " , Yit)'
i=O
00
= :z= <I>iat-i, where '1'i an d <I>i are matrice s of dimension k x m and l x m , re-
i=O
spectively. Su ppose both X , and Y , satisfy the conditions in Proposition 3.3 .
In t=l
n
Den ote ~xy(h) = E(XtY~+h ) ' Th en :z= Vec(XtY~+h - ~ xy(h)) --->
j.t) = E>(B)at, its MA(oo) represen tati on Zt = j.t + :z= '1'iat-i satisfies th e
i=O
00 00
condition :z= II'1'ili < 00, :z= i ll'1'i l12 < 00 since II'1'ill rv r i with r E (0,1)
i=O i=O
On canonical analysis of vector time series 345
2:: ¢iat-i and X, = 2:: 'l/Jiat-i with at being a GARCH(rl, r2) process of
i=O i=O
Eq . (3). By Lemma 1, E(aiajakat) = 0 Vi:::: j = j and k = l
:::: k :::: l unless i
n-q
both hold . Let U = 1'xx(O), V = 1'yy(O), and W = 1'xy(q) = n~q 2:: Xtrt+q .
t=1
Given q > h, where h corresponds to a Kronecker index, we have 1'Xy(q) =
1'yy(q) = 0, and on applying the delta method the following result holds:
1 '""' [ ( () o
, Xd, Y Y q, q+ d)]
Var(pxy(q ) ~ :;: LJ Pxx d)pyy d
A )
h-d
where Cum(Xo, x.; Y q , Yq+d) = 2:: 2:: 'l/Ji'l/Ji+d¢k¢k+d Cov(a6, a~_k+i)'
i~O k=O
Therefore, the fourth order cumulants of {Xd depend on the auto-covariance
function of {an. Compared to 1'x x(dhyy(d), Cum(Xo, Xd, Yq, Yq+ d) has
a non-negligible impact on Var(pxy(p)) if Cov(a6, a~-k+i)j E 2(a6) is large.
For instance, if at is a GARCH(l,l) process, then Cov(a6, aI}joA = 20:1 +
1-6af(al l + ,61)2 ' ThiIS rat.i10 IS 86 given
+,61/3)
r - (a
2a
. 0:1 = 05 .o an d {31 = 02
. , Consiider-
ing the 4th order cumulant correction term in Var(p) , one can modify the
T statistic proposed by Tsay as
346 Wanli Min and Ruey S. Tsey
T* (9)
5 Simulations study
We conduct some simulations to study the finite sample performance of the
modified test statistics. We focus on a bivariate ARMA+GARCH(I,I) model
chosen to have GARCH parameters similar to those commonly seen in em-
pirical asset returns. The model is
Z t - [ 0.8 0] Z
0 0.3 t-l = at - [-0.8 1.3]
-0.3 0.8 at-l, (10)
Table 1: Empirical quantiles of various tes t statist ics for testing zero canonical
correlations, based on 2,000 replications wit h sample size 2,000.
·2 '
and B(2) be the corres ponding test st atisti cs -(n - s) log(l-LJ) , where d is
obtained from boot straps .
Table 1 compa res empirica l percentiles and the size of various test st atis-
t ics discussed above for t he model in Eq . (10) when the sample sizes is 2000,
which is common among financi al data. The corresponding qu antiles of the
asy mpt ot ic X~ are also given in the table. Other sample size is also con-
sidered. From t he table, we make the following observati ons. First, the T *
and bootstrap B statistics perform reasonably well when the sample size is
sufficiently lar ge. The boot strap method outperforms the other test statis-
tics. However , it requires intensive computation. For inst anc e, it to ok severa l
hours to compute t he bootst rap test s in Table 1 whereas it only to ok sec-
ond s to compute the ot her tests. Second , the T statistics und erestimate the
vari an ce of cross-corre lat ion so that the empirica l quantiles exceed their the-
oretic al counte rpa rts . Third , as expecte d, th e 8 st atistics perform poorly
for both sample sizes considered. Fourth, the performance of t he proposed
tes t statis tic T * indic ates that the [2] method to est imate the variance of
cross-covaria nce is reasonabl e in the presence of GARCH effects prov ided
that robust est imators ir" (i) are used .
6 An illustrative example
In this sect ion we apply the proposed test statistics to a 3-dimension al finan-
cial time series. The data consist of daily log returns, in percent ages, of stoc ks
for Amo co, IBM , and Merck from Februar y 2, 1984 to Decemb er 31, 1991
wit h 2000 observati ons. The series are shown in Figur e 1. It is well-known
t ha t daily stock return series tend to have weak dynami c depend ence, but
st rong conditiona l heteroscedasticity, making th em suitable for t he proposed
test. Our goal here is to provide an illustration of specifying a vecto r ARMA
348 Wanli Min and Ruey S. Tsay
s ~1 :-~"'~~T~~\"~-I
· ~ 1~~'~'-T~~~'-~~'~ I
, ~1 -· ~· ~~--:·~
Figure 1: Time series of Amoco, IBM and Merck stocks daily return
I
(2/2/1985-12/31 /1991 ).
model with GARCH innovations rather than a thorough analysis of the term
structure of stock returns.
Denote the return series by Zt = (ZIt, ZZh Z3t}' for Amoco, IBM, and
Merck stock, respectively. Following the order specification procedure of
Section 2.2, we apply the proposed test of Eq. (9), denoted by T*, to the data
and summarize the test results in Table 2. We also included the test statistics
T of Eq. (7) for comparison purpose. The past vector P, is determined by
the AIC as P, = (Z~_l' Z~_z)', The p-value is based on a X%s-f+l test where
k = 3, S = 2, and f = dim(Ft).
From Table 2, the proposed test statistic T* identified {I, 1, I} as the
Kronecker indexes for the data, i.e. K, = 1 for all i. On the contrary, if
one assumes that there are no GARCH effects and uses the test statistic T,
then one would identify {I , 1, 2} as the Kronecker indexes . More specifi-
cally, the T statistic specifies K 1 = K z = 1, but finds the smallest canonical
correlation betwe en F; = (Zl ,t, ZZ ,t, Z3,t, Z3,t+1) and P, to be significant
at the usual 5% level. To determine K 3 , one need to consider the canon-
ical correlation analysis between F; = (Zl,t, ZZ ,t, Z3,t, Z3,t+l, Z3,t+Z)' and
the past vector Pt . The corresponding test statistic is T = 4.05, which is
insignificant with p-value 0.134 under the asymptotic X~ distribution. There-
fore, without considering GARCH effects, the identified kronecker indexes are
(K 1 = 1, K z = 1, K 3 = 2), resulting in an ARMA(2,2) model for the data.
Consequently, by correctly considering the GARCH effect, the proposed test
statistic T* was able to specify a more parsimonious ARMA(l,l) model for
the data. In summary, we entertain a vector ARMA(l,l) model with diago-
nal GARCH(l,l) innovations for the data. The estimated VARMA-GARCH
On canonical analysis of vector time series 349
References
[1] Akaike H. (1976) . Canonical correlation analysis of time series and the
use of an informa tion criteri on. Syst ems identification: Advan ces and
Case Studies, eds R. K. Methra and D. G. Lainiotis. New York: Aca-
demic Press, 27- 96.
[2] Berlin et A., Fran cq C. (1997). On Bartlett 's formula for non-linear pro-
cesses. J ournal of Time Series Analysis 18, 535-552.
[3] Bollerslev T . (1986) . Generalized autoregressive conditional heteroscedas-
ticit y. J ournal of Econometrics 31 , 307 - 327.
[4] Bollerslev T. , En gle R.F., Nelson D.B. (1994). ARCH models. Handbook
of Econometrics IV. Elsevces Science B.V. , 2959-3-38,
[5] Box G.E.P. , J enkins G.M. (1976). Tim e series analysis: forecasting and
cont rol. San Fran cisco, CA: Holden-D ay.
[6] Box G.E.P. , Ti ao G.C. (1977). A canonical analysis of multiple tim e
series. Biometrika 64, 355 - 365.
[7] Cooper D.M., Wood E.F. (1982) . Identifying multivariate time series
models. J ournal of Time Series Analysis 3 , 153 -164.
350 Wanli Min and Ruey S. Tsay
[8] Dunsmuir W., Hannan E .J . (1976). Vector linear time seri es models.
Advances in Appli ed Probability 8, 339-364.
[9] Engl e RF . (1982). Autoregressive conditional heteroscedasticity with es-
timates of the variance of U.K . inflation. Econometrica 50, 987 -1008.
[10] Engl e RF., Granger C.W.J. (1987). Co-integration and error-correction:
representation, estimation and testing . Econometrica 55 , 251 - 276.
[11] Engle RF., Kroner K.F. (1995) . Mult ivariate simultaneous generalized
ARCH. Econometric Theory 11 , 122-150.
[12] Hannan E.J., Deistler M. (1988) . The stat istical theory of linear systems.
John Wiley, New York.
[13] Hot elling, H. (1936) . Relations between two sets of variables. Biometrika
28, 321- 377.
[14] Min W .L., Tsay RS. (2004). On canonical analysis of multivariate time
series. Working paper , GSB , University of Chicgao.
[15] Quenouille M.H. (1957). The analysis of multiple tim e series. London:
Griffin.
[16] Romano J.P. , Thombs L.A. (1996). Inference for autocorrelations under
weak assumptions. Journal of the American Statistical Association 91 ,
590 -600.
[17] Tiao G.C ., Ts ay R.S. (1989) . Model specification in multivariate tim e
series (with discussion) . Journal of the Royal Statistical Society. Ser. B
51 , 157- 213.
[18] Tsay RS . (1989a) . Identifying multivariate time series models. Journal
of Time Series Analysis 10,357-371.
[19] Tsay RS . (1989b) . Parsimonious parametrization of vector autoregres-
sive moving average models. Journal of Business and Economic Statistics
7,327 -341.
[20] Tsay R.S. (1991). Two canonical forms for vector ARMA processes. St a-
tistica Sinica 1, 247 - 269.
[21] Tsay R.S. (2002) . Analysis of financial time series. John Wiley : New
York.
[22] Tsay RS ., Tiao G .C. (1985) . Use of canonical analysis in tim e series
model identification. Biometrika 72, 299 -315.
[23] Wu W.B. (2003). Emp irical processes of long-m emory sequences.
Bernoulli 9, 809 -831.
LEARNING STATISTICS
BY DOING OR BY DESCRIBING:
THE ROLE OF SOFTWARE
Erich Neuwirth
K ey words: Statistical computing, st atistics education, teaching st atisti cs.
COMPSTAT 2004 secti on: Teaching stat ist ics.
• Basic stat ist ical knowledge: und erstanding simple st atistical summari es
and graphs , num eracy.
• Basic stat ist ical skills: select ing appropria te simpl e statistical methods
for own ana lyses, ability to imm ediately identify misuses of stat ist ics.
• Advan ced st atistical knowled ge: Und erst anding complex methods, es-
pecially multivari at e ana lytical and gra phical methods.
We need to distinguish the level of present ation for st atist ics education
• Basic mathematic al knowledge and skills, simple algebr aic formul as ad-
missible as tools for explaining.
• College level mathemat ical background
Fin ally, the level of computer exper tise of the educatees also plays an
important role in designing cours es and act ivit ies for st atistics educa tion.
E:l
!
cr.' first year of
;fil
! retirement age
60
~ first year of
#l workforce age
jj 20
The most important det ails in thi s mod el are the "sliders" ; they allow
to cha nge the gra ph dynamically. The horizontal slider t urns the graph into
a movie. The graph always displays t he population pyr amid for a given year ;
when t he slider is moved, the year cha nges and t he cha nge of the age st ru ct ure
becomes dy na mically visible.
The other sliders allow to cha nge different mod el par amet ers like ret ire-
ment age, and will immediat ely display cha nges in the syst em resulting from
cha nges in the par am eters . The model also allows to use dat a from different
count ries (currently we have Austria, Germany, USA, and J ap an) to an alyze
how different populations structures can get.
Learning statistics by doing or by describing: the role of software 353
important concepts of st at ist ics is the data matrix, also called data fram e.
In a spreadsheet, the data are always visible and it becomes a very physical
exp erience that doing st at ist ics is operating on data . This fact is much mor e
obscured when a st atistical pr ogr amming language like S, R, SPSS , or SAS
is used as t he basic tool in statist ics courses.
The main difference between the spreadsheet approach and the statistical
pro gramming language approach might be char acterized as dir ect manipu-
lation vs , descriptive. The programming language approach is much more
formul a based, the data are not as omnipresent as in t he spreadsheet ap-
pro ach. For int roductory statist ics cour ses, this constant reminder "statist ics
is about data" can be quite helpful. Many students afte r their first course of
non computer based st atistics have the impression that stat ist ics is about cer-
t ain typ es of formulas , and not so much about dat a . P rogramming languages
still somewhat support this mindset, whereas the spreadsheet approach re-
ally emphasizes the data analysis point of view. More t opics a bout mod elling
with sprea dsheets can be found in [6] .
The direct manipulation approach is not solely restricted to spreadsheets.
Program s like Fathom (availabl e from Key Curriculum Press) also emphasize
t he "manipulate the data with t he mou se" approach as opposed to the "write
a program to manipulate t he dat a" approach.
Spr eadsheets are not t he answer to all statis t ical problems. Ex cel has
some flaws concern ing stat ist ics. The most inconvenient ones are some in-
accuracies wit h distributions functions an d not too high quality of random
number generators, inconsistent handling of missing data , and unavailability
of som e of the most important typ es of stat ist ical graphs (like histograms
with un equ al bin widths).
Therefore, it makes sense to use a mor e advanced statistic al toolbox than
just a spread sheet program . This does now, however , imply that the spread-
sheet par adigm has to be thrown overboard. The RExcel pr ogram (par t of the
R COM server pro ject accessible at http://sunsite .univie.ac.at/rcom/
and describ ed in [5]) allows t o use pr actic ally all th e funct ionality of R from
within Excel. This way, t he student can still operat e on the data in with t he
direct manipulation method , but use st atistic al methods not available from
the spreadsheet program alone.
This also demonstrates an important message about softwar e in general :
Softwar e should ada pt to the user 's needs. If possible, one should not be
forced to switc h programs , it is better if a st andard package can be enhanced
by exte nding its functionality.
RExcel is not t he only st atisti cal exte nsion of Exc el. PopTools (availabl e
from http ://sunsite.univie . ac. at/Spreadsite/poptools) also is an ex-
ample of how addit ional st atist ics functions can be integrated int o the spread-
sheet par adigm.
St atistical gra phics is anot her ext remely important concept to be dis-
cussed in the conte xt of st at ist ics education. [1] and [9] make a very con-
Learning statistics by doing or by describing: the role of software 355
collect the data in the classroom. At the end of the class period, the hand-
held is connected to a notebook computer, the data are transferred, and then
immediately a first step of the analyze can be performed in front of the stu-
dents. The message of doing it this way is that collecting data can be set up
quite conveniently, and therefore with good planning statistics be used very
quickly. For larger classes, a browser based questionnaire is used. As part of
this project, students also start asking questions about the privacy of their
data and so are exposed to the problems of collecting data through their own
experience as part of the course.
All the projects and tools so far mostly have been concerned with analyz-
ing data. An important area in statistics education we have not considered
yet is probability. This is the topic of the next section.
References
[1] Friendly M. (2000) . Visualizing categorical data. SAS Inst itute 2000.
[2] Hastings K (2000) . Probability with mathematica. Lewis Publishers.
[3] Neuwir th E. (2002) . R ecursively defin ed com bin atorial fun ctions: extend-
ing Galton's board. Discrete Math. 239, 33- 51.
[4] Emb edding R in standard soft ware, and the other way roun d.
In Hornik K and Leisch, F . (eds.), DSC 2001 Proceedin gs,
http://www.ci .tuwien.ac .at/Conferences/DSC-2001
[5] Neuwirth E ., Baier T . (2001) . Emb edding R in standard softw are, and the
other way round . In Hornik K , Leisch, F . (eds.), DSC 2001 Proceedin gs,
http://www.ci .tuwien.ac.at/Conferences/DSC-2001
[6] Neuwirth E ., Arganbright D. (2003). Th e active m odeler: mathematical
mod eling with Excel. Brooks-Cole.
[7] Neuwirth E . Probababilities, the US electoral college, and gen erating fun c-
tion s considered harmful. To appear in Intern ational J ournal of Comput-
ers for Mathemati cal Learning.
[8] Rose C., Smith D. 2002. Math em ati cal statistics with mathematics.
Springer Verlag.
[9] Tufte E. (2001). Th e visual display of quantit ative information. Gr aphi cs
Press.
Acknowl edgem ent : Thanks to J aromir Anto ch and Marl ene Muller for their
pati ence.
Address: E . Neuwir th , University of Vienn a , Austria
E-mail: erich .neuwirth@univie.ac.at
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
EMBEDDING METHODS
AND ROBUST STATISTICS
FOR DIMENSION REDUCTION
George Ostrouchov and N agiza F. Samatova
K ey words: Dimension reduction, convex hull, FastMap , principal compo-
nents, multidimension al scalin g, robust statistics, Euclidean dist ance.
COMPSTAT 2004 section : Dimensional redu ction.
1 Introduction
Dim ension reduction starts with n objects as points in a p-dimensional vector
space and ma ps the obj ects ont o n points in a k-dimensional vector space,
where k < p . A more general sit ua t ion arises when the point coordinates ar e
not known and only pairwi se dist an ces (or a distan ce function to compute
them) are available. This mapping of obj ects based on their dist an ces only
into a k-dimensional vect or space is called finit e metric space embedding [8] .
Several embedding methods and t heir pr operties are discussed in [8] , includ-
ing Fast.Map , MetricMap , and Spars eMap . The discussion cente rs mostly on
whether the embeddings are contractive, a property of importanc e in similar-
ity searching th at gua ra nt ees no missed items. In this pap er , we concentrate
on FastMap and its properties t hat connect th e t echnique t o ideas in robust
stat ist ics.
FastMap is first introduced in [6] as a fast alte rn at ive t o Multidimen-
sional Scalin g (MDS) [14] and a genera lization of Principal Component Anal-
360 George Ostrouchov and Nagiza F. Samatova
ysis (peA) [9]. Given dimension k and Euclidean distances between n ob-
jects, FastMap maps the objects onto n points in k-dimensional Euclidean
space. An implicit assumption by FastMap that the objects are points in
a p-dimensional Euclidean space (p 2': k) is noted in [8]. Because of this
assumption, FastMap is usually viewed as a dimension reduction method.
When FastMap begins with Euclidean distances between the n objects, it
has time complexity O(n) . If the Euclidean distances must be explicitly com-
puted from a p-dimensional vector representation, FastMap time complexity
is O(np).
We show how FastMap operates within the the implicit or explicit p-di-
mensional Euclidean space containing the points of a data set. FastMap
selects a sequence of k ~ p orthogonal axes defined by distant pairs of points
(called pivots) and computes the projections of the points onto the orthogonal
axes. We show that FastMap picks all of its pivots from convex hull vertices of
the original data set . This provides a connection to results in robust statistics,
where the convex hull is used as a tool in multivariate outlier detection and
in robust estimation methods. The connection sheds a new light on some
properties of FastMap, in particular its sensitivity to outliers, and provides
an opportunity for a new class of dimension reduction algorithms that retain
the speed of FastMap and exploit ideas in robust statistics.
We begin in Section 2 by defining the convex hull and some of its prop-
erties. In Section 3 we describe the FastMap algorithm. The main result,
showing that FastMap pivots are pairs of vertices of the convex hull is in
Secion 4. Section 5 discusses the implications of this result and finally Sec-
tion 6 presents an algorithm, RobustMap, that results from these implica-
tions. Some further comments and conjectures about connections to QR and
QLP factorizations [13] are also made.
0(8) n h(u, v) ,
where h( u, v) is a supporting hyperplane of 8 for some u, v E RP. Further,
for a p-dimensional polytope, facets are (p-1)-dimensional, ridges are (p- 2)
dimensional, edges are i-dimensional, and vertices are O-dimensional.
3 FastMap overview
Given the Euclidean distance between any two points (objects) of 8, k it-
erations of FastMap produce a k-dimensional (k S; p) representation of 8.
Each iteration selects from 8 a pair of points, called pivots, that define an
axis and computes coordinates of the 8 points along this axis . The pairwise
distances for 8 can then be updated to reflect a projection of 8 onto the
subspace (a hyperplane passing through the origin) orthogonal to this axis.
The next iteration implicitly operates on the projected 8 in the subspace.
However, these projections are accumulated and jointly performed only for
the distances that are needed. In this manner, after k iterations, the 8 points
end up with k coordinates giving their k-dimensional representation.
To provide details of the FastMap algorithm, we first introduce some
notation. Let (ai, bi ) be the pair of pivot elements from 8 at iteration i, Let
di(x, y) be the Euclidean distance between points x and y of 8 after their
ith projection onto a pivot-defined hyperplane, so that do(x, y) is the initial
Euclidean distance. Also, let Xi be the ith coordinate of x in the resulting
k-dimensional representation of x E 8.
Pivot elements are chosen by the choos e-distant-objects heuristic shown in
Fig. 1. Initially, i = O. After selecting a pivot pair (ai, bi ), the ith coordinate
of each point x E 8 is computed as
(3)
362 George Ostrouchov and Negize F. Samatova
Choose-distant-objects ( S, di (,) )
1. Choose an arbitrary object s E S
This projection is based on the law of cosines and current distances from the
two pivot points. The distan ces are updated when ever needed in Choos e-
distant-objects or in (3). An update for a single iteration is presented in [6]
and we extend this in [1] to a combined update
This is based on the Pythagorean theorem and the sequ ence of i projections
onto hyp erplanes perpendicular to pivot axes.
There ar e k iterations, each requiring O(n) distance computations of O(p).
The resulting total time complexity is O(npk). Note that if all the original
distances are already availa ble, the total time complexity is O(nk 2 ) du e to
the sum in (4). If k is a small const ant compared to nand p , as is usually
t he case, k is dropped from the above compl exity statements giving those we
provided in the Introduction.
o < (s-af(s-a)
= (s-b+b-af(s-b+b-a)
(s - bf(s - b) + 2(s - b)T(b - a) + (b - af(b - a)
< 2(s - b)T(b - a) + 2(b - a)T(b - a) by (5)
2(s-b+b-af(b-a)
= 2(s - a)T(b - a) (6)
which defines a supporting half space H(a, b) for all points in S. Since a is
the only point in the supporting hyperplane h(a, b) of S, it must be a single
point face of C(S). This, by Definition 2.3, is a vertex of C(S).
Next, the Choose-distant-objects heuristic finds the point in S most distant
from a. By the same argument this is again a vertex of C(S). We state this
as a lemma.
After choosing a pair of vertices, FastMap projects the set S into a sub-
space orthogonal to the vector defined by the pivot pair (a, b) and repeats the
Choose-Distant-Objects heuristic in the subspace of dimension p - 1. Pivot
pairs and projections are computed until suitably many orthogonal vectors
are extracted to be used as the principal axes of the lower dimensional rep-
resentation of S . So far, we have shown that a pivot pair is a pair of convex
hull vertices within its current working subspace. Are they all also vertices
of C(S) in the original space? The answer is yes, subject to a uniqueness
caveat requiring that no pair of points (except the current pivot points) get
projected onto the same point. Assuming that the points S are in sufficiently
general position [15] takes care of this. Because we have a finite set of points,
we can perturb them by an arbitrarily small amount to achieve such a general
position. We show that a vertex in a subspace projection is a vertex in the
original p dimensional space.
Let PH be a symmetric projection matrix into a subspace H C RP and
let SH = {PHu : U E S} be the set of image points of S in this subspace.
We also need to assume that S are in sufficiently general position so that all
vertices of C(SH) are projections of distinct points of S.
Lemma 4.2. If PHS is a vertex in the convex hull of SH and S are in general
position, then s is a vertex in the convex hull of S.
364 George Ostrou chov and Nagiza F. Samatova
for all PHx E SH distinct from PHS. Because S are in general position,
Then ,
Equality holds for x = s, so it is the unique point on this supporting hyp er-
plane of 8 and thus it is a vertex of the convex hull of 8 . 0
Letting Sv ~ S be t he vertices of C(8) , Lemmas 4.1 and 4.2 lead to the
main result:
Theorem 4.1. FastMap pivot pairs are a subset of the vertices of the convex
hull of the data. That is,
i = 1, . . . , k.
5 Implications
Convex hull computations in st atistics are mostl y associate d with robust
mul ti vari ate est imat ion. Loosely, an esti mator of som e par am et er is said
to be robust if it performs well even when the assumed model (implicit or
explicit ) is not sat isfied by the data . For example, when est imat ing a location
par amet er , an implicit assumpt ion is t hat the data are generate d by one
process t hat has a location. If mor e than one process generate d the data,
a robust est imator would st ill est imate the location of the dominant proc ess
rather t han som e meaningless location betw een the processes. The medi an,
for example, is a robust estimator of location while the mean is not . A classic
reference on robust est imat ion is [11] .
The concept of trimming ext remes is often used in reducing dependence
on outliers in dat a [10]. Tukey is attributed with coining the t erm peeling as
Embedding methods and robust statistics for dimension reduction 365
the multivariate extension of trimming [10], where one peels off the vertices
of the convex hull before using the remaining points for estimating a location
parameter. This is based on a generalization of the simple practice of remov-
ing the maximum and minimum before computing the mean, which dates at
least to the early 19th century [10] . Here, with the aim of robustness, the
very points on which FastMap depends are discarded! Clearly, FastMap is
very sensitive to outliers in the data.
In situations where the data generation system is known to work smoothly,
such as machine generated data, outliers may not be of concern. For example,
we have recently found that in analyzing climate simulation and astrophysics
simulation data, methods that are sensitive to extremes often produce the
most compelling results. Here, the extremes are not outliers and may be of
most interest. On the other hand, massive data sets are often the result of
a long run with several checkpoint restarts where anomalies may occur. For
example, in [4]' instrument generated Atmospheric Radiation Measurement
data [2] contains many instrument restarts that appear as zeros in data with
high positive values. Although it is easy to discover these, an automated ap-
plication of FastMap would be driven by the zero coordinate outliers. Clearly,
there are situations where an extremes-sensitive method like FastMap is ap-
propriate or even preferable as well as situations where it will fail.
Outlier sensitivity of FastMap is mentioned in [8] and PCA is presented as
more robust . Although PCA is less sensitive to outliers than FastMap, it too
is not considered a robust technique. A measure of estimator sensitivity to
changes in extreme values of data is the notion of breakdown point [3]. Loosely
speaking, the breakdown point is the smallest proportion of data that needs
to be contaminated to make arbitrarily large changes to the estimator. By
this definition, the breakdown point of FastMap is ~, which is asymptotically
zero. Principal Components Analysis, the most popular dimension reduction
method, also has a breakdown point of~ . In both cases , taking one point
arbitrarily far in some direction will rotate the first axis in that direction.
Some robust PCA methods begin by computing a robust covariance matrix
estimate then proceeding with standard PCA as usual. The classical example
of a high breakdown estimator is the median with a .5 breakdown point. That
is, half of the data must be moved to make an arbitrarily large change in the
median. A multivariate extension of the median is proposed in [12] . This
extension uses the notion of half-space support to define the depth of a data
point so that, ignoring ties, the point with maximal depth is the multivariate
median.
The main lesson from robust statistics is that the most distant points are
often not the best choice for defining a projection axis . The key to new fast
and robust methods is a replacement of the Choose-distant-object heuristic by
something that considers more than just the maximum distance from a point.
One should back-off a little from the maximum, while considering the entire
distance distribution. This distribution is already available within the O(np)
366 George Ostrouchov and Negize F. Samatova
6 A RobustMap algorithm
The FastMap algorithm computes all distances from one object but uses only
the maximum, resulting in an outlier-sensitive method. From a statistical
viewpoint the distribution of the distances contains information on potential
outlier candidates. In essence, we are trimming the extremes of this distance
distribution. A complication is that two objects with a similar distance to
the reference object can be very far apart in the full p-dimensional space.
Selecting a small number of extreme objects and clustering them in the full
p-dimensional space, can provide much more information on a robust choice
of a distant object. Keeping the selection of a few objects fast and their
number small lets us remain within the O(np) time complexity of FastMap.
We provide a simple variant of this idea . Take a constant number, say
r « n, largest distances, cluster the corresponding objects, and choose a cen-
tral point of the largest cluster as a pivot. This affords protection against
a small number, about r /2, outliers. Fig. 2 gives the choose-distant-objects
heuristic for RobustMap. The parameter r can be some small number that
depends on the level of contamination we expect in the data. A second pa-
Embedding methods and robust statistics for dimension reduction 367
C!
co
0
c '"
0
Reference
'"e
0
0 RobustMap
Co
PCA
a. FastMap
"':
0
·:1
I··1
'"
0 1: .. :1 ..
1:1.. ..
...
.:1
1:1 1:1
0
0 I:i lI:i ••:1
2 3 4 5 6
Axis
Figur e 3: P roportion of clean vari ability cap tured by each component axis,
when pr esented with contaminate d dat a. Reference is PCA on clean dat a
only.
We also see some preliminary evidence t hat these methods are related to
pivoting st rategies in QR factorization and the recent QLP factorization [13]
that provides a fast approxima tion for the Singular Value Decompo sition.
Our prototype implementation of RobustM ap and FastM ap differs from the
original [6] by using Householder reflections applied to t he rows, somewhat
like the QLP factorization. We conject ure t hat FastM ap , Robu stMap, and
their connection to t he convex hull provide a geomet ric explanation for the
success of QLP fact oriza tion and may be sour ces of new pivoting strategies
for QR factorizat ion. This is anot her direction where t hese methods may
provide new insights .
References
[1] Abu-Khzam F.N ., Samatova N., Ostrouchov G., Lan gston M.A., Geist
A. (2002). Dist ribut ed dim ension reduction algorithms for widely dis-
persed data . In P ar allel and Distributed Computing and Syst ems, ACTA
Press, 174-178.
[2] D. O. E. (1990). Atmospheric radiation m easurement program plan .
Technical Report DOE jER-0441 , U. S. Department of En ergy, Oce of
Embedding methods and robust statistics for dimension reduction 369
[4] Downing D.J., Fedorov V.V., Lawkins W.F., Morris M.D., Ostrouchov
G. (2000). Large data series: Modeling the usual to identify the unusual.
Computational Statistics & Data Analysis 32 245- 258.
[5] Erickson J . (1999). New lower bounds for convex hull problems in odd
dimensions. SIAM J. Comput. 28 (4), 1198-1214.
[6] Faloutsos C., Lin K. (1995). FastMap: A fast algorithm for indexing,
data-mining and visualization of traditional and multimedia datasets. In
ACM SIGMOD Conference, San Jose, CA, May 1995, 163-174.
[7] Gallier J .H. (2000). Geometric methods and applications for computer
science and engineering. Springer.
[13] Stewart G.W. (1999). The QLP approximation to the singular value de-
composition. SIAM J . Sci. Comput. 20 (4), 1336-1348.
1 Introduction
Finding groups in data is a key activity in many scientific fields. Gordon [8] is
a good general reference. Classical Partition and Hierarchical algorithms have
been very useful in many problems but they have some four main limitations.
First, the criteria used are not affine equivariant and therefore the results
obtained depend on the changes of scale and/or rotation applied to the data.
Second, the usual heterogeneity measures based on the Euclidian metric do
not work well for highly correlated observations forming elliptical clusters or
when the clusters overlap. Third, we have to specify the number of clusters
or decide about the criteria for choosing them. Fourth, there is no general
procedure to deal with outliers. Some advances have been made to solve
these problems, see [4], [5] and [16].
An alternative approach to cluster is to fit mixture models. This idea has
been explored both from the classic and Bayesian point of view. Banfield and
Raftery [3] and DasGupta and Raftery [6] have proposed a model-based ap-
proach to clustering which finds an initial solution by hierarchical clustering
and then assumes a mixture of normals model and uses the EM algorithm to
estimate the parameters. A clear advantage of fitting normal mixtures is that
the implied distance is the Mahalanobis distance, which is affine equivariant.
From the Bayesian point of view the parameters of the mixture are estimated
by Markov Chain Monte Carlo methods and several procedures have been
proposed to allow for an unknown number of components in the mixture,
see [12] and [14] . A promising approach to cluster analysis, that can avoid
the curse of dimensionality, is projection pursuit, where low-dimensional pro-
jections of the multivariate data are used to provide the most interesting
views of the full-dimensional data. Pefia and Prieto [11] have proposed an
372 Daniel Peiie, Julio Rodriguez and George C. Tieo
algorit hm where the data is project ed on the dir ecti ons of maximum het-
erogeneity defined as those directi ons in which t he kur to sis coefficient of t he
project ed data is maximized or minimized. Then they used t he sp acin gs to
sear ch for clusters on the univari ate vari abl es obtained by these projections.
Finally, Pefia and Ti ao [9] propose t he SAR (split and recombine) pro ce-
dure for det ecting heterogeneity in a sample wit h respect to a given model.
This pr ocedure is general, affine equivaria nt, does not require to specify a pri-
ori t he numb er of clust ers and it is well suite d for finding the components in
a mixture of mod els. The idea of t he pro cedure is first to split the sample
into mor e homo geneous groups and second recombine the observations one
by one in order to form homogeneous clust ers. The SAR pro cedure has two
important properties, that are not shar ed by many of t he most oft en used
cluster algorit hms, (i) it do es not require an initial st arting point, (ii) each
homogeneous group is obtain ed ind epend entl y from the others, so that each
group does not compete with the others to incorporat e an observation. The
first pr operty impli es that the algorit hm we propose can be used as a first
solution for any other cluster algorithm, the second, th at the pro cedure may
work well even if t he groups are not well sepa ra ted. This pap er analyzes
t he applicat ion of t he SAR pro cedur e to clust er ana lysis and it is organiz ed
as follows. Secti on 2 pr esents th e main ideas of the pro cedure. Section 3
compa res it in a Mont e Carlo st udy to Mclust (Model Based Clu ster , [7],
k-rnean s, pam (P ar t ition around medoids, [15] and Kpp (Kurtosis projection
pursuit, [11] .
where Qf = n~l (xf - x)'"y- l(Xf - x) and x is t he sa mple mean and V the
sa mple covariance matrix, given by V = (X - lx)'(X - lx) /(n - p). Fol-
lowing Peiia and Ti ao [9] we will use as measure of het erogeneit y of a data
X i wit h resp ect to a group X C
i) which does not cont ain t his observation, the
standa rdized predictive value given by
A general partition cluster algorithm 373
p(xiIX(i)) } {Qi(i)}
H(xi,X(i))=-2In { C IX) =(n-l)ln 1+( ) , (1)
P Xi(i) (i) n- 1 - p
where Qi(i) = n~l(Xi - X(i))/VCi)l(Xi - X(i))' and V(i) and X(i) are the co-
variance matrix and the mean computed using the sample X(i) without the
case ith. Note that H(Xi , X(i)) is a monotonic function of the Mahalanobis
distance Qi(i) , which is usually used to check the heterogeneity of a point Xi
with respect to the sample X(i)'
The splitting of the sample is made as follows. For each observation, Xi ,
we define the discriminator of this point as the observation which, when
deleted from the sample, makes the point Xi as heterogeneous as possible
with the rest of the data. The discriminator of Xi is the point Xj if
where X(ik) is the sample without the cases ith and kth.
Each sample point must have a unique discriminator, but several sample
points may share the same discriminator. It can be proved (see [10]) that
the discriminators are members of the convex hull of the sample. That is,
a discriminator must be an extreme point. An intuitive procedure to split
the sample into groups is to put together observations with share the same
discriminators, as they are affected in the same way to modifications of the
sample by deleting some extreme values. It is obvious that if two observations
are identical they will have the same discriminator and if they are close they
also will have the same discriminator. The number of points in the sample
which share the same discriminator is called the order of the discriminator.
We consider as special points discriminators of order larger than K, where
K = f(p, n) and we will put them in a special group of extreme observations.
However, discriminators of order smaller than K are considered as usual
points and are assigned to the group defined by all the observations that
share a common discriminator. We need to define the minimum size of a set
of data to be considered as a group. We will say that we have a group if we
could compute the mean and covariance matrix of the group and, therefore,
the minimum group size must be no = p + h, where h > 0, and p is the
number of vari ables. Usually h = It», n) and in the examples we have taken
h = log(n - p). In the procedure which follows we have considered as special
points to those discriminators of order larger that K, where K = p + h - 1.
This value seems to work well in the simulations we have made. Based on
these considerations the sample is split as follows: 1) Observations which
have the same discriminator are put in the same group, the discriminator is
only included in the group if it has order smaller than K; 2) Discriminators
of order bigger that K are allocated to a specific group of isolated points; 3) if
two groups formed by the previous rules have any obs ervation in common the
374 Daniel Peiie, Julio Rodriguez and George C. Tieo
two groups are joined into one group. This three rules split the sample into
more homogeneous groups. Each group is now considered as a new sample
and the three rules are applied again until splitting further the sample will
lead to isolated points because the groups obtained are all of them of size
smaller than the minimum group size no. A group of data is called basic
group if when split will lead to subgroups of size smaller than the minimum
size, p + h .
When the sample cannot be split further the recombining process is ap-
plied starting from any of the basic groups obtained. The recombining process
is the one suggested by Pefia and Tiao [9]. Each group is enlarged by incor-
porating observations one by one. For a given group, we begin by testing
the observation outside the group which is the closest to the group in terms
of the measure H(Yf' X g ) , where Yf is the observation outside the group
formed by data X g • If H(Yf' X g ) is smaller than some cut-off value, that is
the 99th percentile of the distribution of the statistic H(Yf' X g ) , this obser-
vation is incorporated into the group and the process of testing the closest
observation to the group is repeated for the enlarged group. The enlarging
process will continue until either the threshold is crossed or the entire sample
is included. A similar idea of recombining points has been used for robust es-
timation (see for instance, [1]. We may have one of the three possible cases.
First, the enlarging of all the basic groups leads to the same group which
include all the observations apart from some outliers. Then we have a homo-
geneous sample with some isolated outliers and the procedure ends. Second,
the enlarging of the basic groups leads to a partition of the sample into dis-
joint groups and we conclude we have some groups in the data and again
the procedure ends. Third, we obtain more than a possible solution because
the partition obtained is different when starting from different basic groups.
Then we have more than one possible solution and the final solutions found
are called possible data configurations, PDC. The selection among them is
made by a model selection criterion.
The properties of the algorithm have been studied in a Monte Carlo ex-
periment, similar to the one used by Pefia and Prieto [11] to illustrate the
behavior of their cluster procedure. Sets of 10 x p x k random observations
in dimension p = 2,4,8 have been generated from a mixture of k = 2,4
components of a multivariate distributions. In all data sets the number of
observations from each distribution has been determined randomly, but en-
suring that each cluster contains a minimum of p+ 1 observations. The mean
for each distribution is chosen at random from the multivariate normal dis-
tribution Np(O, 11). The factor 1 (see Table 1) is selected to be as small as
possible while ensuring that the probability of overlapping between groups is
roughly equal to 0.01. We generated data sets in six different scenarios.
A general partition clust er algorit hm 375
Tabl e 1: Percent ages of mislab eled observations for the BAR, the Kpp,
t he k-means , th e Mclust and the pam procedures. Norm al observat ions
with: (al ) covariance matrices well conditi oned, (a2) covariance matri ces
ill-conditioned. The best method in each case is indic ated in boldface.
Table 2: Percent ages of mislab eled observations for the SAR , the Kpp,
t he k-means , t he Mclust and t he pam pr ocedures. Uniform observati ons
with: (bl ) covaria nce mat rices well conditioned, (b2) covariance matrices
ill-cond itioned .
Table 2 shows the outcome for scenarios bl ) and b2) where we ana lyze t he
same st ructure that in scena rios a l ) and a2) but now using mixtures of uni-
form distributions. Tabl e 2 shows t he percent ages of mislabeled observat ions
A general partition cluster algorithm 377
for both scena rios b l ) and b2) . The behavior of the SAR procedure is agai n
t he best as an average and the best in ten of t he twelve cases. The second
best behavior corres ponds to Kpp, that is bet ter than Mclust in eleven out
of the twelve cases .
Table 3: Percent ages of mislab eled observati ons for the SAR, the Kpp, t he
k-rneans, the Mclust and t he pam pro cedures. Norm al observations with
10% the outliers : (cl ) non concent ra te d contamina tions, (c2) concent rated
contaminations.
A final simul ati on st udy has been conducte d (see Table 3) to determ ine
t he behavior of the methods in the presence of outl iers. Scenarios cl ) and
c2) contain 10% of data contamina te d by first , a non concent ra te contamina-
tion, and second, a concent rate d cont amination defined in scena rio c). The
crite rion to obtain t he mislab eled observation is based only in the 90% of
observations not contamina ted. Table 3 shows the percentage of mislab eled
observations for the scena rios cl ) and c2). The maximum number of clust ers
k have been increase to te n in t he algorit hms k-rnean s, Mclust and pam so
t hat the concentrate d contamination can be considered as isolated clust ers.
In the scenari o cl ) the best methods, as an average, are, wit h very small
difference, t he pam algorithm and t he SAR pro cedure. However , for concen-
trated contamina tion, scena rio c2), the SAR pro cedure is aga in clearl y the
best followed by Kpp. As a summary of this Monte Carlo st udy we may
conclude t hat the SAR pro cedure has the smallest error classification rate in
378 Daniel Peiie , Julio Rodriguez and George C. T ieo
22 out of the 36 sit uations considered and t he best average number of misla-
beled observat ions in 5 scenari os out of the six considered. The only scenario
in which the SAR is not the best is in scenario c1) but the difference wit h
respect to the best method , pam, is very small: misclassification per cent age
of 6.4% versus 6.32% for pam. The Kpp is the second best in five out of the
six scenarios. Ordering the methods for average classification err ors in all the
scena rios from bet ter to worse, the order would be: SAR , Kpp, Mclust, pam
and k-means.
References
[1] Atkinson A.C. (1994). Fast very robust m ethods for detection of multiple
outliers. Journal of the American St atis tical Association 89 , 1329- 1339.
[2] Box G.E.P. , T iao G.C . (1973). Bayesian inference in statistical analysis.
Addison-Wesley.
[3] Banfield J.D ., Raft ery A. (1993). Model-based Gaussian and non-
Gaussian clust ering. Biometrics 49 , 803 -821.
[4] Cuest a-Alb ertos, J . A., Gord aliza, A. C., Matran , C. (1997). Trimmed
k-means: an atte mpt to robustify quant izers. The Ann als of St atis tics
25 ,553-576.
[5] Cuevas A., Febr ero, M., Fraim an R. (2000). Estim ating the nu m ber of
clusters . Can adi an Journal of Statistics 28 , 367 - 382.
[6] Dasgupta A., Raftery A.E. (1998). Detecting features in spatial point
processes with clutter via model-based clustering. Journal of the American
St atistic al Association 93 , 294 - 302.
[7] Fraley C., Raft ery A.E. (1999). MCLUBT: Boftwarefor m odel-based clus-
ter analysis. Journal of Classification 16 , 297-306.
[8] Gordon A. (1999). Classification . 2nd edn. London: Chapman and Hall-
CRC.
[9] Pefia D., and Ti ao G.C . (2003). Th e BAR procedure: A diagnostic anal-
ysis of het erogen eous data. (Manuscript submitted for publi cation) .
[10] Pefia D., Rodriguez J ., Ti ao G.C. (2004). Clust er analysis by the BAR
procedure (Manuscript submitted for publicat ion) .
[11] Pefia, D. and Prieto , J. (2001). Cluster identifi cati on using proj ections.
Journal of the American St atistical Association 96 , 1433-1445.
[12] Richar son S., Green P.J . (1997). On Ba yesian analysis of mixtures with
an unknown num ber of components. Journal of the Royal Statistical So-
ciety B 59 , 731-758.
[13] Rousseeuw P.J ., Leroy A.M. (1987). Robust regression and outlier detec-
tion . New York: John Wiley.
[14] St ephens M. (2000). Bayesian analysis of mixture models with an un-
known number of components-an altern ative to reversible jump m ethods.
The Annals of St atisti cs 28 , 40 - 74.
A general parti tion cluster algorithm 379
[15] Stuyf A., Hubert M., Rousseeuw P.J . (1997) . Int egrating robust clust er-
ing techniques in S-PLUS. Computat ional Statistics and Data Analysis
26 ,17-37.
[16] Tibshirani R., Walther G. , Hastie T. (2001). Estim ating the numb er of
clusters in a data set via the gap statisti c. Journal of the Royal Statis ti cal
Society B 63 , 411 -423.
Address: D. Pefia, Departam ent o de Est adfstica, Univ ersidad Carlos III de
Madrid , Spain
J. Rodriguez, Laboratorio de Est adfstica , Universidad Politecnica de Madrid ,
Spain
G.C. Ti ao , Gr adu ate School of Business, University of Chic ago, USA
E-mail: dpena@est-econ .uc3m .es
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
ITERATIVE DENOISING
FOR CROSS-CORPUS DISCOVERY
K ey words: Text document pr ocessin g, st atist ical pat tern recognition, di-
mensionality reduct ion .
1 Introduction
The "int egrat ed sensing and pr ocessin g decision trees" introdu ced in [9] pr o-
ceed acco rding to t he following philosophy. Assume t hat t here is a het ero-
geneo us collectio n of ent it ies X = X l , '" , X n which can, in principle, be
measured (sensed) in a lar ge number of ways. Becau se the sensor cannot
make all measurement s simultaneously - eit her due to phy sical sensor con-
st raints or becau se of t he high int rinsic dimension of t he complete feature
collect ion - only a su bset of the possible measur ements is t o be mad e at any
one time.
Thus, for t he ent ire entity collection X a first set of measurement s is
made. Based on the features obtained, X is partit ioned int o {Xl , . . . , XJi },
each Xj l being (pr esumably) mor e hom ogeneous than t he original enti ty col-
lect ion X. Then , for each partition cell XJt a new set of measurements is
considered. This process cont inues , generating br an ches consist ing of "iter-
at ively denoised" ent ity collect ions {Xj l l , " . , Xj l h}, {Xj lJ2I, ' " , X j lJ2Ja },
and so forth, until a collect ion (say, Xj lJ2h) is deemed sufficiently coherent
for inference to pr oceed . Such collect ions are the leaves of t he t ree.
382 Carey E . Priebe et ai.
Here f x,w = cx,wlN where cx,w is the number of times word W appears in
docum ent x and N is t he total number of words in th e corpus C. This
information is discounted to reduc e th e imp act of infrequ ent words via
The mutual inf ormati on f eature vector, t hen, for document x in corpus C, is
given by
ex = .cc(x) = [m X,Wll '" , mX,Wdc(C ) ] '
Given two do cuments x, y E C, the dist anc e (we use the t erm loosely; it
is in fact a pseudo- dissimilarity) employed, p, is given by
Thus
PO.ce(C)
is a ICI x ICI in te rpoin t distance matrix. All subsequ ent proc essing will be
based on these interpoint dist an ces, as discussed in [7] . However , the features,
and hence the interpoint distan ces themselves, ar e corpus dependent and so,
as the it erative denoising tree is built, based on the evolving partitioning,
these dist anc es change .
Multidimension al scaling [2] is used to embed the int erpoint distan ce ma-
trix pOLe( C) into a Euclidean space lRdmd. (C). Noti ce first th at , if the feat ure
Iterative denoising for cross-corpus discovery 383
vect ors were Euclidean - that is, if we were using an act ua l dist an ce in the
de (C)-dimension al space - t hen t he features could be represented with n o
disto rt ion in IRd c( C)- l . Alas , they are not , and cannot be. So
mds 0 p 0 £o(C)
For this Science News corpus C , feature extraction via £ c(C) yields a fea-
ture dimension d.c (C) = 10906. That is, t here are 10906 distinct meaning-
ful words in t he corpus, and the Lin & Pant el feature ext ract ion produces
a 1047 x 10906 feature matrix.
Multidimension al scaling (Figur e 1, left panel) on t he 1047 x 1047 int er-
point distan ce matrix p o £0( C) yields dmds(C) = 898. (Numerical issues in
the multidimensional scaling algorit hm make 898 the larg est dim ension int o
which the inte rpoint dist anc e matrix can be embedded. So, while Figure 1
384 Carey E. Priebe et al.
•
~ e
..
8
Componenl
Figur e 1: Multidimensional scaling (left pan el) for t he original 1047 10906-
dimensional SN feature vectors. The lar gest num erically stable multidimen-
sional scaling embedding is dmds(C) = 898. (This left curve suggests that
perhap s 200, and certainly 400 dimensions is sufficient to adequa te ly fit the
do cum ents into Euclidean space.) Principal comp onents (right pan el) for the
898-dimensional Euclidean embedding of the original 1047 10906-dimensional
SN feature vectors. (The "elbow" of this scree plot occur s, perh aps, in the
rang e of 10-50 pr incipal components .)
suggests t hat perh ap s 200, and certainly 400 dimensions is sufficient to ad-
equa tely fit the docum ents into Euclidean space, we avoid t he first mod el
selecti on qu andary by choosing t he lar gest num erically stable multidimen-
siona l scaling embedding.)
A subsequent principal component ana lysis of t he 898-dimensional Eu-
clidean features m ds 0 p 0 £ 0 (C) yields t he scree plot present ed in Figur e 1,
right pan el. This scree plot suggest s that a latent semantic index dimension
of perh ap s 10-50 is appropriate for the SN corpus.
Figur e 2 displ ays the projection of the data set onto the first two principal
components of
pea 0 mds 0 p 0 £ o(C) (1)
for t he Science News corpus. Not ice t hat this plot suggests t ha t t he combi-
nation feature extraction/dimensiona lity reduction we have employed (eq. 1)
has capt ured well some of th e information concern ing th e eight classes, de-
spite t he fact t hat we are viewing just two dimensions (as opposed to , say, the
10-50 dimensions suggested by the scree plot in Figur e 1). To wit : there are
two groups exte nding from and distinguishable from the main body of doc-
um ents. These two groups are dom inated by medicine (the upper left arm)
and ast ronomy (the upper right arm). Add itionally, some physics docum ents
ar e present in the astronomy arm and some life sciences and behaviora l sci-
ences docum ent s are pr esent in t he medicine arm. That physics should have
some similarity wit h ast ronomy, and that life sciences and behavioral sciences
should have some similarity with medicine, agrees with intuition.
Iterative denoising for cross-corpus discovery 385
o Anthro
6 Astra
+ Behavior
x Earth
x <> Ufe
x '1 Math
lilI Mad
'if Physics
x
o
PC,
Figure 2: The first two principal components of pca 0 mds 0 p 0 £c(C) for the
Science News corpus. The eight symbols represent the eight classes; the three
clusters generated via hierarchical clustering correspond roughly to the main
body and the two arms. Notice that there are two groups extending from
and distinguishable from the main body of documents. These two groups are
dominated by medicine (the upper left arm) and astronomy (the upper right
arm). The documents selected as our anecdotal "meaningful association" are
indicated throughout by the solid dots and document number.
v = [54,121,72,137,205,60,280,118].
The iterative denoising tree for cross-corpus discovery is illustrated on the
SN corpus in Figure 3. This figure provides a coarse depiction of one path,
from root to leaf, of the tree; a row-by-row description thereof follows.
Recall that these 1047 documents yield a feature dimension de(C) = 10906
and an mds dimension dm d s (C) = 898. We display the first two principal
components; thus the root (row 1) in Figure 3 is presented in detail in Fig-
ure 2.
386 Carey E. Pri ebe et ai.
·.~.
:;.:.::t '
:\fj
.::. . -r.~1
.,
..
;
'
• > .. - .. -
-----1
. :
~ ---
:I I.~.
\:l'''i!~·,
:'~~':~ ~' . : :
l~_,~_~::tj
+
·
::·~ .
...
.' ..r:
: :~ ); :~ : · : t·~
'. ~~~'
.J~·~:
... - ~ - .. .> ..
-~
.G;§J
Node22'
:
;./(":
f ' ' .
.......
'/>.
. . .... ..
. • • : ..~
_ ..
[JJ
':. . : ....:~:. '
•
:- -
• : .. .;
~.:-::)~~..
~ 0'
·D
_22~
, ,, .
I
'[ 2]"
,,
,
, :,'
,_•
~ .
•
Row 2: In the same space as for Row 1, we have simply split out three
clusters obtained via hierarchical clusterin g, for displ ay convenience.
(We choose in this manus cript to avoid model selection det ails; e.g., the
choice of t hree vs. two cluste rs at t he root. In general, we recomm end that
this issue be avoided by generat ing a binary tree unless user int ervention is
possible. In this example, the roo t begs for three clust ers - a core and two
arms.)
Iterative denoising for cross-corpus discovery 387
V2 = [2,113,0,10,4,0,1 ,36].
Thus, C 2 contains nearly all (113 of 121) of the astronomy documents, nearly
one third (36 of 118) of the physics documents, and but a smattering from the
other classes. So while the original feature extraction was done in the context
of a corpus containing medicine, behavioral sciences, and mathematics doc-
uments, these topics are not a part of the context for the feature extraction
for C 2 and this feature extraction can therefore focus on features germane to
physics and astronomy.
(See Figure 4 for more detail.) These 166 documents yield a feature dimension
d.c(C2) = 3037 and an mds dimension dmds(C2) = 162. Since E involves
corpus-dependent feature extraction, this display is different than the "cluster
2" display in Row 2. This difference is due to denoising. The indicated
partition represents the clusters generated via hierarchical clustering. Notice
that one of the clusters (C22 , lower right, containing 91 documents) contains
approximately half of C 2's astronomy documents (52 of 113) and nearly all
of C 2's physics documents (35 of 36). In continuing pursuit of our anecdotal
meaningful cross-corpus discovery, we follow C 22 .
V22 = [0,52,0,1,2,0,1,35].
The left display in Row 4 (see Figure 5 for more detail) depicts
8 22 = {10500, 10651} C C 22 -
(These documents were chosen arbitrarily, for the purposes of illustration:
they consist of a Physics document about neutrinos and an Astronomy docu-
ment about black holes .) In the display, the two black squares represent 8 22 -
The right display in Row 4 (see Figure 6 for more detail) depicts the
altered geometry after consideration of 8 2 2 - That is, here we have added
388 Carey E. Priebe et el.
'"
ci
'"
es
ci
0
Q.
0
ci
" "
" "AA
PC,
;l 6
'"
ci
6
66
N
ci
6
6
•
. •
6
<J
6 ·6 •
• •
ci • 6 ..
Q.
6 6 6
..
6
6 6
0 A l:J. A A
fA 6
d>
6 6 "
c 6
6
10~1~ ..
..
£! 6
6
6 6 0 .. •
66 6
9 6 61 0i 22
6
6 6
•
6 . . • •
..•"
•
... • •
6
6
N
9 6
..
- 0.3 - 0.2 - 0.1 0.0 0.1 0.2
PC,
6
::J 6
6 6
6
6
N
ci
6
6 6 66 1~22
6 6 6
6 6 "
6 6
A 6
..
6
ci 6
6
~ ~6
6
66
<J 6 ~
'"
.
Q.
0 6 6
ci 6
'"
'" '"
A
"'" .. A
~
.. .•... .. .'" ..•
.... ••
- 3000 -2000 - 1000 1000
PC,
d A
A
~
A A A
A1O
~
i .?2
;;
cS
11.
0
d
~
•
9 • • •
1~f6
~ •
-4 -3 -2 -1
PC,
Fig ure 7: Node N 22 1 in the iterat ive denoising t ree for t he SN corpus.
Row 5: The document collect ion C22 1 is, again, almost entirely astronomy
an d physics, wit h
IC22l I = 17
and
V22 1 = [0, 8, 0,1 ,0,0,0, 8].
These 17 doc uments yield a feature dimension de(C 22 d = 367 and an mds
dime nsion dmds = 16. After recalculati ng t he features for C 22 1 , we display
(See F igure 7 for more detail.) (A valu e of e' = 100 is used here; the impact
of the tunnelling feature is lessened.)
Row 6: Here we consider one of t he two clust ers, C22 12 , from N 22 1 via
a nd
V22 12 = [0, 6,0 ,1 ,0 , 0,0 , 5].
Iterative denoising for cross-corpus discovery 391
Pairs of documents from different classes which fall to the same leaf of the
iterative denoising tree are candidate associations. Thus this example yields
16 candidate associations, at least one of which (astronomy #10422 = "X-Ray
Universe: Quasar's jet goes the distance" by R. Cowen, Science News Online,
Feb . 16, 2002 & physics #10516 = "Glimpses inside a tiny, flashing bubble"
by I. Peterson, Science News Online, Oct. 5, 1996) is plausibly a meaningful
association.
3 ConcIusion
We have pres ented an anecdote - not an experiment! - suggesting that an
iterative denoising methodology can be a useful tool in discovering meaningful
cross-corpus associations. Corpus-dependent feature extraction is an essential
part of the methodology, providing features which are iteratively fine-tuned to
ever more homogeneous subsets of documents as one progresses down the tree.
The specific approaches to feature extraction, dimensionality reduction, and
partitioning may be profitably altered within the framework of the general
methodology. The adaptive geometry provided by employing distance-to-
subset "t unnelling" features allows the user to alter the details of tree growth.
Experimental design to allow for statistical evaluation of the performance
of the methodology provides some interesting hurdles, and will be reported
elsewhere.
Finally, we note that the methodology described is not specific to text
document processing, and may have application in many disparate discovery
scenarios. The fundamental idea, as in [9], is to address the problem of there
being more measurements that can be made than should be made at anyone
time.
References
[1] Berry M.W., editor (2004). Survey of text mining: clustering, classifica-
tion , and retrieval. Springer-Verlag.
[2] Borg I., Groenen P. (1997). Modern multidimensional scaling: theory and
applications. Springer-Verlag.
[3] Cowen L.J ., Priebe C.E . (1997) . Randomized nonlinear projections un-
cover high-dimensional structure. Advances in Applied Mathematics 9,
319 -331.
[4] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman
R. (1990). Indexing by latent semantic analysis. Journal of the American
Society for Information Science 41 (6),391 -407.
392 Carey E. Priebe et el.
Ackn owledgem ent : Sponsored by the Defense Advan ced Resear ch Projects
Agency und er "Novel Mathema tical and Computational Approaches to Ex-
ploitation of Massive, Non-physical Dat a" , ARPA Order No. P2 46, Program
Code 3E 20. Issued by DARPA /CMO und er Contract No. MDA972-03-C-
0014 to AlgoTek, Inc. T he views and conclusions cont ained in t his document
are those of the aut hors and should not be int erpreted as representing the offi-
cial policies, eit her explicit ly or implied, of DARPA or the U.S. Government.
Approved for Public Release, Distribution Unlimited.
Address: C.E. Priebe, E.J. Wegman, D.A. Socolinsky, K.W. Church, R. Gug-
lielmi, RR Coifma n, D. Lin, M.Q . J acobs , A. Tsao, AlgoTek, Inc., 3811
N. Fairfax Dr., Suite 700
D.J . Mar chette, J .L. Solka, NSWCDD BlO, Dahl gren, VA
Y. P ark , D. Kar akos, Johns Hopkins D., BaIt ., MD
D.M. Healy, DARPA , Arlington, VA 22203
E-mail : cepcjhuc edu
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: Differential equat ions are the na tural way to model systems with
fun ct ional inputs and functional outputs. They allow us to st udy t he system' s
dynami cs in the sense of explicitl y modelling how the output changes in
resp onse to sudden cha nges in input. For example, engineers developing
cont rol syste ms for industrial pro cesses rou tin ely use DIFE's as mod elling
tools.
A new method is described for goin g dir ectl y from noisy discret e data ,
not necessaril y sa mpled at equally spaced times, to a system of differenti al
equations of arbit rary orders , linear or nonlinear, t ha t describes the dat a . The
method involves a gener alization of non par am etric cur ve est imat ion in which
t he penalty functional rather t han t he smo othing functions is est imate d .
Examples are dr awn from chemical engineeri ng and medicine.
Dx = !(t, x) ,
defines a dependency of the first derivativ e Dx on the fun ction x as well as,
possibl y, ot her direct dep end encies on argument t.
The talk for which this paper is a summar y aims to make three general
points:
• DIFE's are powerful tools for modeling dat a. Indeed, they are al-
read y rou tinely used in t he chemical, phy sical and biological sciences
as well as in engineering . They are import an t prim arily beca use t hey
model the dynamics of an observed pr ocess; t hat is, rate s of change
are modeled along t he obs erved fun ction. This is espe cially import ant
in input/out put systems where how the sys te m responds to an abru pt
change in input can be as important as the long-t erm cha nge that re-
sults .
• We have new methods for fitting differenti al equa t ions or dyn amic mod-
els t o raw noisy dat a t hat appear t o be substantially more effective than
394 Jim O. Ramsay
_4
Q)
~ 3
~ 2
>.
....
COl
I-
0.1
,
?; " '. " " .'. "
'.
o - 0.1
;;::
X -0.2
.@ - 0.3
a: -0 .4
Q)
..
-0 .5
'-, '. "
" "
r. -- '. '. " --,
o 20 40 60 80 100 120 140 160 180
time
Figure 1: The upper pan el shows t he level of material in a tray of a dist illation
column in an oil refinery, and t he lower level shows t he flow of materi al being
distilled int o t he tray, The points are meas ured valu es, and t he solid lines
are smoot hs of the dat a using regression splines ,
describ es the en dogen ous or int ernal dyn ami cs of the syst em , the forcing
func tio n u is an exogenous functional ind epend ent vari abl e t hat perturbs
396 Jim O. R amsay
these int ernal dynamics. The funct ions a and (3 are the coefficient functions
that define the DIFE. The syst em is linear in t hese coefficient fun ctions, and
also in the input and output funct ions.
On e way t o underst and the sep ar ate roles of a and (3 is to study a simpler
cons tant coeffi cient model with an input that ste ps from 0 to 1 at t ime 1 and
for which x(O) = 1. The solution to t he equatio n in t his case is
We see that (3 cont rols the rat e of change and t hat the ultimate level or
gain is a/(3. We can compare a to the volume control on a radio playing
a song ca rr ied by radi o signal u ; the bigger a , the louder t he sound. The
bass/treble cont rol, on the other hand, corresponds to (3; the lar ger (3, the
high er the frequ ency of what we hear.
(3)
where
• <P is the n by K matrix of basis fun cti on values cPk (tj)
From data to differential equations 397
Substituting (3) into (2), we may now minimize the un-penalized profiled
error sum of squares
with respect to parameters Ct and (3. Our experience is that the smoothing
parameter >. can usually be selected by minimizing the generalized cross-
validation (GCV) criterion.
This process may be extended to equations of an arbitrary order, nonlinear
equations, and systems of equations.
were 0.18, 2.8 an d 49.3, respectively. These repr esent improvements in pre-
cision of estimation by factors of 1.8, 3.3 and 6.4, respectively. We estimate d
(32 to be 353.6, whereas the right value was 355.3.
The most dr am atic improvement in derivative est imation occurred at the
boundari es. Estimating the linear differential operator virtually elimina te d
the usu al inst ability of derivative est imates in t hese regions because these
estimates are linked by the DIFE to the behavior of the function values, which
are only mildly mor e unstable at the boundaries than within the int erior. But
even in the int erior, for example, t he precisions of the estimates of D 1 x and
D 2 x were at least dou bled.
4.5
3.5
3
(J)
>
(J)
2.5
I'-
2
"""....>.
co 1.5
I-
0.5
o - ...
.....
-0.5 ' - -_ - ' - _---J' - -_ - ' -_ ---'_ _--'-_---'_ _--'-_ --l._ _...L..----'
o 20 40 60 80 100 120 140 160 180
Time
Figure 2: The fit t o t he dat a defined by the differential equa t ion is shown as
a solid line, and the data as points.
rang es from 3 to 400 per 100,000. Lupus can appear at any age, and the
ea rlier it appear s, t he more severe it t ends to be. Lupus is on th e increase,
and in som e places is now more common than rh eum atoid art hrit is. Genet ic,
environmental, and hormonal factors are all involved. Exposures to chemica ls
and ultra-violet light are suspecte d triggers for flar es.
Symptoms ra nge from mild to severe, and can caus e perman ent dam age
or be fat al. A rash on the face and chest, pain and swelling in th e joints
and fat igue are common and early signs of a flar e. The kidn eys are oft en
affecte d, with swelling and loss of function , and end-stage renal failure is
a real risk. The heart , arte ries, lun gs, eyes and central nervous system may
also be involved ; and t he psychological effects of lupus are receiving mor e
and mor e atte nt ion. A typical flar e goes from ju st noticeabl e to acute in the
ord er of t en days or less.
The vari ation in the nature and severity of symptoms combined with the
unpredict abili ty of flar es makes t reat ing this disease a huge challenge . Mild
symptoms are treated with ant i- inflammatory drugs (aspirin, etc .), and more
severe sympt oms require the use of cort icosteroids, usu ally prednison e. The
response t ime to an increase in pr ednisone dose is usually of the order of
a few days. However , corticost eroids are toxi c if t aken over long periods at
high doses, with common side effects being weight gain, sleeplessness and
400 Jim O. R ams ay
ost eoporosis. Sudden decreases in dose can trigger a new flare; consequently,
high dose levels must be tap ered down gradually.
P ati ents are assessed at regular int ervals. Although lupus symptoms are
multidimension al, long term treatment requires some overall measure of dis-
ease severity. A number of symptom severity scales have been proposed, and
the SLEDAI scale is now widely used. SLEDAI is a check list of 24 symptoms,
each given a num erical weight ranging from 1 for fever to 8 for seizure s.
A flar e has been defined by an int ern ational committee as a SLEDAI score
increase of 3 or more to a level of 8 or higher. During flar es SLEDAI scores
of 25 to 30 ar e common.
A joint McGill/University of Toronto team headed by Dr. Paul Fortin has
complete hist ories for about 300 patients spanning, in many cases, around 20
year s. This is one of the largest and highest quality set of patient records in
t he world.
Figur e 3 shows t he data for a single patient over a three-year period.
Notice the strong flare that coincides with the reduction in prednisone dose
just after the seventh year .
A =3.1623
o Data
20 -Smooth lit
- - DIFElit
-S alt)
eo 15
1;l
s
UJ
~ 10
5.5 7.5
Figure 3: The data for a single patient over a three-year period. Heavy lines
join times and valu es of SLEDAI measurements. A flar e is ind icated by a
solid heavy line joining t he first SLEDAI measurement within the flare to the
pr evious measurement extended to a time 0.02 years ba ck. The light solid
line joins times and values at which prednisone doses were fixed.
From data to differential equations 401
• The duration of an active state will be b, and may vary from flar e t o
flar e.
We can imagine that the disease also affects the body's capacity to respond
to the disease itself, as well as it 's capacity to recover. That is, f3 is also
affected by the disease, and therefore must be replaced by the function f3(t) .
When the patient is healthy between flares, f3(t) is high, leading to rapid
response to the onset of the disease. When the patient is experiencing a flare,
f3(t) is near zero, implying a slow recovery.
We tried this differential equation for f3(t)
When u(t) switches on, f3(t) decays to zero, and Ds(t) tends to equal au(t) ;
that is, s(t) increases linearly while u(t) = 1. When u(t) switches off, f3(t)
returns to the level of its gain, 0h, and s(t) tends to decay exponentially
with rate equal to f3(t)'s gain.
This gives us the general shape of a lupus flare. The increase in symptoms
is essentially linear because f3(t) decays rapidly to 0 inside a flare; when
f3(t) ~ 0, the gain becomes ab. But after a flare, when u(t) returns to
zero, f3 returns to its healthy level, and there is an exponential decrease in
symptoms.
Actually, preliminary results indicated large values for rate parameter 'Y,
implying that f3(t) moved extremely rapidly between virtually zero and its
maximum value , defined by O. We decided to simplify the differential equation
for f3(t) to
Df3(t) = 0[1 - u(t)]
This implies linear increase within a flare episode, and exponential decrease
afterwards with a rate constant O.
A = 0.31623
25 r-----:-----:----,-----,----....,---,----,-..------,-------,
o Data
20 - Smooth fit
- - DIFE fit
- 0 «(t)
~
o 15
o
en
«
o
. . ...
llJ
u5 10 ~"
.
.... l;/
O'------'------'-----'-----'------'---->-E)
5 5.5 6 6.5 7 7.5 8
Year
Figure 4: Results for the analysis of the data in Figure 3 using a smoothing
parameter ), = 10- 0 . 5 . The circles are SLEDAI measurements. The heavy
solid line is the fit to the data s(t) that minimizes criterion (4) and the dashed
line is the solution to the differential equation (5). The light solid line plots
the value of 8a(t) .
for the recovery from a flare before the next flare begins. What we lose,
however, is the capacity to fit lower values of SLEDAI; the range of variation
within a flare is too limited to permit this.
On the whole, however , these fits are quite satisfactory and capture well
the main dynamic features of this segment of a lupus record.
References
[1] Ramsay, J. O. and Silverman, B. W. (1997). Functional data analysis.
New York: Springer.
[2] Ramsay, J. O. and Silverman, B. W. (2002). Applied functional data anal-
ysis. New York: Springer.
,, =0.31623
o Data
20 -Smooth fit
- - DIFE fit
-oa(t)
(I!
o 15
~
<i:
a
iu
cr.l 10
5.5 7.5
cont ri butions of my graduate st udents, Mr. J iguo Cao , Ms. Carlotta Fok and
Ms. Wen Zhang. The example from chemical engineering was supplied by
Dr. J ames McLellan of Qu een 's University, and this resear ch also benefited
from t he resear ch collaborat ion with Andrew Poyton, a graduate st udent at
Qu een 's.
Address: J.O. Ramsay, McGill University, 1205 Dr. Penfield Ave., Montreal ,
Qu eb ec, Canada H3A 1B1
E-m ail: r amsayopsych .mcgf.Ll vca
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
1 Introduction
Test s of outliers in regr ession need est imates of both the paramet ers of the
linear mod el and of the err or vari anc e a 2 • If the outli ers are included in
the set used for esti ma t ion, inconsistent est imates of the par am et ers will be
obtained and the existe nce and the effect of the outliers will be masked. We
therefore consider procedures in which the observations are divided into two
groups: t hose believed to be 'good ' and the outli ers . The good observations
are used to provide est imates of t he param et ers t o be used in the t est for
outl iers.
Let t here pr ovision ally be m good observations out of n. We are int er-
est ed in the null distribution of the outlier t est. We th erefore need to perform
our calculat ions as t hough there were no outliers . If we were int erest ed in the
simplest case when, inst ead of regression , t he focus is the location par am et er
of a random sample from a symmetrical distribution, we would base our es-
t imates on the m cent ral observations, trimming the remaining m - n. The
properties of our est imat ors would t hen be those coming from this trimmed
sa mple of n observations, rather than from m observations t aken at ran-
dom from the parent populat ion. We use this insight t o provide excellent
approxima t ions to t he distribution of the out lier test in regression .
The lit erature on the det ection of outli ers in regression is vast. The t est we
st udy here is t he likelihood ratio t est, that is th e t est based on t he pr ediction
residuals used, for example, by Hadi and Simonoff [13], for the det ection
of mult iple outliers. T wo useful sur veys of methods for mult iple outli ers
in regression are Beckman and Cook [9] and Barnett and Lewis [8]. An
important point is that , if several outliers ar e present, single deletion methods
(for example, Cook and Weisb erg [12], Atkinson [1]) may fail. Hawkins [14]
argues for exclusion of all possibly outlying observations, which are then
406 Marco Riani and Anthony Atkinson
where H = X(X T X)-l XT is the 'hat' matrix, with diagonal elements hi and
off-diagonal elements hij . The mean square estimator of (72 can be written
n
s2 = eT e/(n - p) = I:>U(n - p). (3)
i=l
(4)
Like the errors Ei , the qi are distributed N(O, (72), although they are not
independent.
The likelihood ratio test for agreement of a new observation Ynew ob-
served at Xnew with the sample of n observations providing (J and s2 is the
prediction residual
T A
which , when the observation Ynew comes from the same population as the
other observations, has a t distribution on n - p degrees of freedom.
(6)
q, J {l - h,(m*)}
The notation hi(m*) serves as a reminder that the leverage of each obser-
si
vation depends on m ) . The search moves forward with the subset si
m 1
+)
consisting of the observations with the m + 1 smallest absolute values of the
e., that is the numerator of qi(m*) .
In order to simulate the distribution of the outlier test of §2.4 we need
a simple way of simulating variables with the same distribution as the qi (m *) .
When m = n these residuals are those in (4) and the distribution is N(O, (72) .
But with m < n the estimates of the parameters are based on only those
observations giving the central m residuals: (J(m*) and s2(m*) are calculated
from truncated samples.
408 Marco Riani and Anthony Atkinson
(7)
where elk} (p*) is the kth ordered squared residual and h is the integer part
of (n + p + 1)/2 and corresponds to 'half' the observations when allowance
is made for fitting. Typically the search either examines all subsets of size p,
if this is not too large, or several thousand subsets are examined at random.
These starting methods destroy masking; any remaining outliers are then
removed in the initial steps of the search. Consequently, the search is in-
sensitive to the exact starting procedure. What is important for our present
purpose is that the search again uses parameter estimates based on a central
part of the sample.
the observation with the minimum prediction residual among those not in
sim ) . If observation imin is an outlier, so will be all other observations not
. SCm)
In * .
To test whether observation imin is an outlier we use the predictive resid-
ual (5). The test for agreement of the observed and predicted values is
It is the distribution of this statistic that is the subject of this paper. In (5),
when all observations were used in fitting and a new observation was being
tested, the distribution was tn-p o Now the estimates (3(m*) and s(m*) are
based on the central part of the distribution. Even under the null hypothesis
that the sample contains no outliers, the distribution is no longer t.
simul ati ons for each value of m. The second uses a series of orderin gs of
simulated data , but avoids t he forward sear ch.
Both of these methods are for the st atistics calculated for simpl e sa mples.
In §3.4 we int roduce a correct ion for the dependence of the distribution of
t he statistics on p.
(10)
The simulat ion of the t ru nca te d normal dist ribut ion using t he inversion
method in St eps 1 and 2 is st ra ight forward in S-Plus or R.
4 Examples
4.1 Hawkins's data
This set of simulated data was analysed by Atkinson and Riani [4]' §3.1.
There are 128 observations and nine explanatory variables. The dat a were
inte nded by Hawkins t o be misleading for standard regression methods . Fig-
ur e 1 shows a forward plot of t he minimum deletion residual among obser-
vat ions not in t he subset , that is t he outli er t est st at ist ic (13), t oget her with
two sets of simul ated percent age points of t he distribution, both based on
1,000 simulat ions. We first consider t hese simulation envelopes .
The envelopes plot t ed with cont inuous lines in the figur e are the 1, 2.5,
5, 50, 95, 97.5 and 99% points of the empirical distribution of t he outli er
t est during forward searches simulated wit hout out liers. The dot t ed lines are
from our second approxima te simulation method in which random samples
of observations are ordered once. Agreement between t he two envelopes is
excellent during t he second half of t he sear ch; agreement between the two
sets of upp er envelopes is also good during the first half of the search for
m > 20. The envelopes are of a kind we shall see in all simulat ions . Initi ally
t hey are very bro ad , corresponding to distribut ions wit h high trimming and
Simple sim ulations for robust test s of multiple outliers in regression 411
-e
(ij
::>
"0
'w
~ M
c:
0
~
a;
"0
E (\J
::>
E
'2
~
20 40 60 80 100 120
Subset size m
few degrees of freedom for t he est imation of erro r . In t he central part of the
search t he band is virt ua lly horizontal and gradually narrows. Toward s the
end of t he search t here is rapid incr ease as we te st the few lar gest residuals.
The cont inuous line showing the plot of the outlier tes t in the figure reveals
all t he features that Hawkins put in th e dat a . There are 86 observations with
very small var iance. The plot shows a huge jump in th e valu e of the statis t ic
when the first observation of th e next group ente rs. This pro cess is repeat ed
two mor e times, clearl y identifying t he four separa te groups of dat a that are
pr esent, t he decline after each peak being du e to t he effect of maskin g. The
forward plot of t his tes t st atisti c is the sa me as t hat in the lower panel of
Fi gure 3.6 of Atkinson and Riani [4] ; the new confidence bands calibrate
inferences about the significance of the peaks.
The envelopes rise rapidly at t he end of the sea rch and we can see that the
outlie r test finishes up being non- significant. Thus Hawkins has su cceeded
in const ructing a data set with many out liers all of which are masked. The
cur ve of the st atist ic starts to rise ju st before m = 86. If we take only t he
first 86 observations and provide simulat ion envelopes for them , t he envelopes
rise at the end as t he envelopes do here from m around 125. The last few
observations do not t hen lie outside the simulation bands for t his redu ced set
of data .
412 Marco Riani and Anthony A tkin son
,
\
l()
\
'<T
20 40 60 80
Subset size m
Figur e 2: Ozone Data: forward plot of minimum deletion residu als (the
out lier test) . There are some mild outliers towards the end of t he sea rch
and some evidence of masking. The dot ted lines are envelopes simulated by
Method 1.
-e-
(ij
:::>
"0
' (;;
[I! C')
c:
0
~
a;
"0
E
:::>
E '"
'2
~
20 40 60 80 100
Subset size m
Figure 3: Sur gical Unit Dat a: forward plot of minimum deleti on residuals
(the out lier test ). The appreciable maximum of the stat ist ic in the cent re of
the search suggest there may be two equa l sized groups of observations that
differ in some systematic way. The dott ed lines are envelopes simulated by
Met hod 1.
m = 76 t his plot shows four appreciable residuals, three negat ive and one
positive: t hese lie apart from t he general cloud of residuals t hro ughout t he
whole search. The plot also shows some evidence of maski ng, t he residu-
als decreasing somewhat in magnitude at t he end of t he search. The effect
of masking is also evident in Fi gur e 2, where the test statist ic lies within
t he simulat ion envelopes for t he last two steps of the search. Although t he
masking here is not as misleading about the st ruc t ure of the data as that in
Figure 1, there are again outliers whose pr esence would be overloo ked by an
analysis based on all t he data , or on single deletion diagnosti cs.
10 20 30 40 50 10 20 30 40 50
Subset size m Subset size m
Atkinson and Riani [5] analysed the combined set of all 108 observations
using the forward search to assess the influence of individual observations
on the estimated regression coefficients. They also conclude that a logged
response and a linear model in Xl -X3 adequately describes the data. Because
we will shortly be augmenting the set of explanatory variables, we work with
all four original variables.
Figure 3 is a forward plot of the test for outliers for all 108 observations,
together with simulation envelope and the approximation found by our first
method. This surprising plot seems to show evidence of two groups - the
extreme value of the statistic, well outside the boundaries is at the entre of
the search, after which there is a gradual decline in the values. At the end of
the search the statistic is nudging the lower envelope , a stronger version of
the effect of masking noticed in the two previous figures.
Since the maximum value of the statistic is at m = 55, we examine those
units that enter after this value, to see whether they might belong to a second
cluster. Detailed analysis of the results of the forward search show that, after
m = 57 nearly all the patients entering have unit numbers greater than 54
and so come from the group of confirmatory observations.
This figure suggests the group of confirmatory observations may be differ-
ent from the original 54 units. Accordingly, we introduce a dummy variable
for the two sets and repeat the analysis. This variable is highly significant,
with a t value of -7.83 at the end of the search. However, the resulting for-
ward plot still has a slight peak in the centre, although this is much reduced
from that in Figure 3. Some remaining structure is indicated.
To take the analysis further we consider the two groups separately. Fig-
ure 4 gives the forward plots of the test for outliers. The plot for the second
Simple sim ulations for robust test s of multiple outliers in regression 415
group of observations in the right-hand panel , suggests that the group is ho-
mogeneous. However , that in the left-hand panel strongly indicates that the
first group contains at least one identifiable subgroup t hat needs to be dis-
entangled before further ana lysis is undertaken. A next stage in the ana lysis
would be to extend the scatterplo t matrix of th e data in Figur e 8.3 of Neter
et al. [15] to include different plotting symbols for the tent ative groups.
5 Discussion
The pr evious exa mples are compa ra tively small and the many plots from the
forward search can easily be int erpreted. However , as the number of units
increases, plots for individual units, such as forward plots of residu als , can
become messy and uninformative du e to overplotting. Atkinson and Riani
[6] ana lyse 500 observations on the behaviour of cust omers with loyalty cards
from a superm arket cha in in Northern It aly. Despite the lar ger number of
observations the forward plot of t he test for outliers is as easily int erpret ed
as those in this pap er and shows an unsuspected group of 30 very different
customers.
There are two further genera l methodological matters that deserve com-
ment . The first is that the envelopes present ed in this pap er were all found
by simulation. An alt ernative, invest igated by Atkinson and Rian i [6] , is
to calculate t he percent age points dir ectl y using ana lytical results on order
statist ics and t he varian ce of truncated normal distributions. The oth er point
is t hat, however the envelopes ar e calculated, the probability st atements refer
to pointwise exceeda nce of the bands. To find, for example, th e probability
of at least one trans gression of a specified envelope somewhere during a par-
ti cular region of the sea rch, for exa mple the second half, requi res calcul ation
of t he simultan eous probability of trans gression at any of the stages of the
sear ch within that region. Computationally feasible methods ar e describ ed
by Buj a and Rolke [11] .
Atkinson and Riani [6] may be viewed at "'''''''.lse . ac . uk/ collections/
statistics/research/
References
[1] Atkinson A.C. (1985). Plots, tran sform ation s, and regression. Oxford
University Press, Oxford .
[2] Atkinson A.C. (1994). Fast very robust m ethods for the detection of m ul-
tiple outliers. Journal of the American St atistical Association 89 , 1329-
1339.
[3] Atkinson A.C . (2002) . Th e forward search. In W. HardI e and B. Ronz ,
edito rs, COMPSTAT 2002: Proceedings in Computational St ati stics,
Physica-Verlag, Heidelberg, 587- 592.
[4] Atkinson A.C. , Riani M. (2000) . Robust diagnostic regression analysis.
Springer-Verlag, New York.
416 Marco Riani and Anthony Atkinson
[5] Atkinson A.C ., Riani M. (2002). Forward search added variable t tests and
the effect of masked outliers on model selection. Biometrika 89,939-946.
[6] Atkinson A.C., Riani M. (2004). Distribution theory and simulations for
tests of outliers in regression. Submitted.
[7] Atkinson A.C., Riani M., Cerioli A. (2004). Exploring multivariate data
with the forward search. Springer-Verlag, New York.
[8] Barnett V., Lewis T. (1994) Outliers in statistical data (3rd edition).
Wiley, New York.
[9] Beckman RJ ., Cook RD. (1983) Outlier detection (with discussion).
Technometrics 25 , 119-163.
[10] Breiman L., Friedman J .H. (1985). Estimating optimal transformations
for multiple regression and transformation (with discussion) . Journal of
the American Statistical Association 80, 580- 619.
[11] Buja A., Rolke W . (2003). Calibration for simultaneity: (re)sampling
methods for simultaneous inference with applications to function estima-
tion and functional data. Technical report, The Wharton School, Univer-
sity of Pennsylvania.
[12] Cook RD., Weisberg S. (1982). Residuals and influence in regression.
Chapman and Hall, London.
[13] Hadi A.S., Simonoff J .S. (1993). Procedures for the identification of
multiple outliers in linear models. Journal of the American Statistical
Association 88, 1264-1272.
[14] Hawkins D.M. (1983). Discussion of paper by Beckman and Cook. Tech-
nometrics 25, 155-156.
[15] Neter J., Kutner M.H., Nachtsheim C.J., Wasserman W. (1996). Applied
linear statistical models , 4th edition. McGraw-Hill, New York.
[16] Rousseeuw P.J. (1984). Least median of squares regression. Journal of
the American Statistical Association 79 871- 880.
would supply themselves for knowledge transmission and cert ificat ion from
t hose virt ua l hyp er-classro oms.
IT s could secure huge savin gs for education boar ds, bu t could entail th e
disappearan ce of most t eachers and professors.
From an overvi ew of some recent and very successful pedagogical experi-
ments in Qu ebec univ ersit ies usin g IT s, one can suspect t hat t hings will not
be t hat simple [1]. The sa me sit uat ion, it is easy t o confirm, is pr evalent
the world over. Actually, getting an educa t ion is a form a travelling. And
quality t rave lling often imply person al guides, at least human encounte rs, not
ju st guidebooks and TV docum ent ari es though they can be illum inating and
irr eplaceabl e. In our experience , all t he pedagogies devised wit h the ITs in
mind have always implied mor e personal contacts with st udents, less mass
disp ensing of knowledge!
With the Internet , we have perhaps ente red an era of renaissan ce of the
true pedagogical relation , not the opposit e. As we will explain, this has far
reaching impli cat ions for t eachers and st udents reciproca l relations.
data based, centered on case studies for more advanced material and hands-on
training. Applied Statistics is indeed much more than a set of mathematical
formulae: its learning implies the development of "statistical thinking", re-
quires the understanding of difficult concepts such as variation, randomness,
laws of chance - a difficult oxymoron at first glance - , probable errors,
risks, etc. Animations and various graphical tools provide efficient means of
learning.
Depending on the level, one can think of various designs for the Internet
environments and interactions. Up to now, there are two stages planned in
the St@tNet project, the first one is fully operational, the second in develop-
ment, but with partial versions tested in ordinary classrooms.
For the first stage, at the very basic level of statistical knowledge, St@tNet
has opted for a complete Html environment. The advantage of this choice
is that interactions of the students with the environment are quite easy to
reali ze: this course is by no means a paper-course translated into Html, as one
can still see quite often, but a full-fledged Html environment with frequent
short interactions inserted by design into the course.
For higher levels of knowledge, where short interactions are much less
needed, St@tNet has opted for a downloadable Latex-Pdf text, with full
hyper-referencing possibilities, and many of the hyper-references are internal.
page, with pop-ups for feedback. A pop-up Glossary, the same for all lessons
is hyper-referenced, and, finally, a page of links is available, with some of
them referring to external Java applets useful for the learning.
MODULE :STA~~UEDEeCRIPnVG
·tt"'.,
• $l,,,:~;_llrvl'l:.~~ern:I:i,9:_1.i(l ~'Jifll!,~ "Qo1IlItl!" ~~. tm~~rr.n....-e'24.,,!"":cor.~,flU ....j~ 4 "!' •.. ~~,·t::~~:.~::I:.';;;~~40I••
~ 0l'S"'f~b1~'''' ''._''' ~1''J.> Qt;t~'''''~I!;I'' _~lI'I!qr'''''tr,_ la: f!liljr.ll't:') ±:'2llLl~
~,on iiJtiP6,$_'l'ti4iril.i)arit,.qu.i PQlu' ''{' ~;OS {U $~U. 'dQ ~Jijt01' .. thi:i:l:cii\·mnpli'iri~.'o::o,...cH: ·ori d9l::1 . ,:"
~"" ~~p~r~ILV'''~J~:~:_cl~ \'l~,~I Io"f,l~:r" fJll='~~\IJ .
Ql.so;ll\il ,ii'litq,VElliil\J(rr'1ihlit,81,liI ,l;I"' ,ii.ti¢lii\bt&i:Ul'W,UI"'I Il/!iii.<le!I·~atl\l iliiri:l, '~ qo,Ivi:! l!I:it'* ltIo (Ql ill ,, ;t \);O'-- ---'
1i, _PO-' ~ ~
Figure 1: Upper: The entry for the module Statistiques descriptives (Descrip-
tive statistics), with its introductory video. Lower: Part of the development
section for Lesson 1 of the module Tests (Tests), with a pop-up window
obtained with a wrong answer .
A new audience has been reached by this approach, and the rate of reten-
tion and success is better than for traditional courses. This last point might
be the consequence of the type of students (a "sampling bias"!) interested
in such an environment.
422 Gilb ert Saporta and Marc Bourdeau
-
1.. ~ l:MlM npltril •• lfl~0Mit'N.MKitt'INHJl'''''.~ M~,.r<_~
ft~,(1_"""f'lO"r\OITb .. UM4"""'~
n"".,
i _~t~ ::IOII·~IM1.:w.lI'I~ ...~UIt.QU*'<l!hl,WI'~."'fj.OPoll~illMIN
"""Ml~I·~.p;;~ ,I'iJf"t ..
Ut .... n.t_.....,. __ l ·-JlUtt!tW Qilll"WiL
.....-pl. . . . ~ · t ..,"'~'"
• • ~
.............
.. l1l.n- _
fi,
,. I!' ~ -M. ~ ' . ..~ ••1 , _•• ",. . - _ _
2 . ~ ~ -}.lW;. · v~ Wt"t (f. . ~ DU(.lOllCIft_ _~ .. !OVlIturr. i...... il1 04> .P6tu!!<~~ A"rKrlqt.lo ~JT I4CN. dM
~""'lliM\i~PQuroW;I'(tj
... ~l,Ift. 'lmW,
to Postman in his last essay ([4] p.161 seq): "(...) question-asking is t he most
significant intellectual skill available to human beings," and it is ext remely
st range that, especially in the Sciences whether hard or applied , it is not
t aught in schools!
Finally, and this also harks back to Postman in all his books on Education ,
we have written historical notes on all the principal aspects on the origin of
the need of st atisti cal mod els for reality. It is a fact t hat with History notes
there is a sort of holographic phenomenon: even when one st arts from hard
sciences' bits of knowledge, exploring how t hings came to be, where ideas
came from and how we came by t hem, provides, if propelled by a sense of
questioning, an insight on the whole of societi es, on all of Hum an nature.
This const it utes an essent ial pa rt for any formation. Education aft er all is
not only about informat ion, but first and foremost about the form ation or
cast ing of minds, young ones in particular .
In summar y, du e to the mathematical sophistication of this material t here
is a need for t extbook typography, as well as, as usual , a need for a compl et e
system of inn er referencing and outer or hyp er-referencing facilities. This
leaves nowad ays almost no choice: such a cours e must be written in Latex-
Pdf typ eset . The Pdf-files are virus-proof, th ey can be readil y pr inted on
pap er with te xt book color qu ality, t heir use on computer screens is very
confortable, moreover providing som e annotating facilities, and, finally, inn er
links and hyp erlinks are manipulated with extreme ease.
This second st age of St @tNet is not, as yet , fully operational, but a demo-
version ia available, and parts of the material, especially some case studies,
were t est ed with great success in st andard classrooms". In the following pages
we pr esent some of its highlights.
In Figure 3, we can see part of ordinary page of the course file. At t he
bottom of t he page an icon referring to a Fl ash anima ti on, an image of which
appears on Figure 4.
The read er can flip back and forth from any page giving int ernal links
to an equa t ion, a t abl e, a figure. He can also, if he subscribes to an Inter-
net server, readil y access a certain number of hyp er-links to whatever sit es
deemed interesting by the aut hors. These pages will be added aut omatically
at the end of t he pdf-file file which can be saved with the added information.
The Adob e-r ead er provides also various facilities to annotate the file pages.
Like in the first st age of St @tNet, many Fl ash anima t ions are also included
in t he t ext. They constit ute a remarkable tool to ease the learning. The
development of a Fl ash animat ion is fairl y easy, t hey are space efficient, and
the Flash plug-in is very light and widespr ead . Furthermore, th ese anima t ions
ar e upward compati ble, and can be readil y upd at ed .
Each one of our anima t ions comes with a certain number of cont rollable
bu ttons one of which is an audio file. In th e exa mple (Figur e 4) , the nod es of
4ht t p: / / wwy.mgi. pol ymt l . ca/ mar c . bour deau/ l nf AgeTeachi ng .
424 Gilbert Seporte and Marc Bourdeau
Figur e 3: Par t of a typical page in st age two , with the icon referring to a Flash
animation.
Figure 4: A page from one of the course Flash animat ions, wit h its cont rol-
lable but tons, one of which (bottom) is for an audio file.
Th e St@tNet project for teaching statistics 425
the regr ession are mobile and new nodes can be added, t he confidences bands
resulti ng from t he least squa res results have a but ton to control t heir level,
an d whenever a change is ma de, directly with t he mouse on the computer
screen, the new regression line and confidence bands with the other numerical
par ameters promptly appear on the screen. The au dio file provides instruc-
tions for t he use of t he animation, a few explanations, and always, t his is
very imp ortant, a questioning that the animation brings out.
----."---- -,
I~
I.!
FlO. 5 - Div~r8P.S transformations d~ \' (Fig. .oj ) : i\ Ia suite )/M, 1'· ~·1 ,
y O,2o, rO,2 , log(n, -Y -o.~ .
Exorcices.
1. Ut.IlJ8fJ... lesdonnees de t'tJt example(ellquer cl-contre) pour \'erifler les
effets des transformatlons (observes eomme sur la Fig, <; et IeTab, I) Data
sur des fiOll&-edlalltillolls dMquelques ,ii1.ailu18 de sujets,
Fig ure 5: A ty pical page of a case study, with an icon to imp ort t he dat a.
3 Conclusion
The Information Age offers mind-boggling perspectives and cannot but have
a profound impact on the pedagogy of whatever discipline there is, but first
and foremost for those that present a technical character, and Statistics is
one of them. All the presentations in this session will no doubt show the
diversity of options.
The end of the journey for teachers? At first sight, it might appear that
all these new facilities lead to the disappearance of teachers and professors.
But many very successful pedagogical experiments have shown that human
pedagogical guides are more necessary than ever, and that ITs provide an
indispensable structure for more interactions between them and the students.
11 would not be surprising that the new pedagogical paradigm would be that
of apprentices and masters. In all the pedagogical experiences we have seen,
not only in Statistics, not only our own, there is a greater need than ever for
human personal transmission. The role of professors becomes more and more
that of a personae ressource, a guide so to speak, and less and less that of a
knowledge dispenser. Pure knowledge transmission is not the principal role
of professors anymore: this has now been more or less automated thanks to
the new ITs. Transmission is required now at a much higher cognitive level.
And written words for their precision, as well as oral contacts, play - ITs
in the background again! - a crucial role . On the Internet, all courses tend
to become tutorials! And this is the expensive form of teaching... That may
explain why the pedagogical interaction has become so much more demanding
than ever.
Th e St @tNet project for teaching st atistics 427
References
[1] Bourdeau M. (2003). L 'enseignem ent superie ur et les TICE (Technologies
de l'information et des com munications en enseigneme nt) . Big bang ou
m ega flop ? Conference pour l'inaugur ation de la mission TICE, Univer-
site de Bretagne Sud , 4 mars 2003.
http ://www.mgi .polymtl.ca/marc.bourdeau/lnfAgeTeaching.
[2] Moor e David S. (1997). New pedagogy and n ew content: Th e case of
statistics . Int ernational Statistical Review, 65(2) , 123-137.
[3] P arr William C. & Smith Marl ene A. (1998). Developing case-based busi-
n ess statis tics courses. The American Statistician , 52(4) ,330 -337.
[4] Postman N. (1999) . Building a bridge to the 18th century . How the past
can improve our future. Alfred A. Knopf, New York.
[5] Postman N. (1995). Th e en d of education. R edefining the value of school.
Alfred A. Knopf, New York.
[6] Saporta G. (2002). St @tNet, an Intern et based softw are for teaching in-
troductory statis tics. Proceedin gs Icots 6, Sixth International Conference
on Teaching Statisti cs, Cap etown , July 8-12 2002.
[7] Saporta G. , Morin A. (1996). Interactive software for learning statistics .
Compstat 96, Bar celona , 26-30 Augu st 1996.
[8] Serban A.N. , Luan J. (2002). Overview of knowl edge m anagem ent. New
Directions for Institutional Research, 2002(113) , 5 - 12.
[9] Velleman P aul F ., Moore David S. (1996). Mult im edia for teachin g statis-
tics: promises and pitfalls. The Ameri can St atistici an , 50(3) , 217 - 225.
A cknowledgem ent : St @tNet was developed with the generous support of the
Agence Universita ire pour la Francophonie and of t he French Min ist ere de
l'Education National e.
Deep gratitude to all those working on the project :
http://www.mgi .polymtl.ca/marc.bourdeau/lnfAgeTeaching/credits .pdf .
Address: G. Saporta , Conservatoire National des Arts et Metiers, 292, rue
Saint-Mar tin , F-75003 Par is, Fran ce, http://cedric.cnam.fr/rvsaporta
M. Bourdeau 'Ecole Polyt echnique de Montreal ,
http://www .mgi.polymtl.ca/marc.bourdeau
E- m ail: Saporta@cnam. fr, Marc . Bourdeau@polymtl . ca
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
AN AUTOMATIC THRESHOLDING
APPROACH TO GENE EXPRESSION
ANALYSIS
K ey words: Empirical Bayes, microarray, mult iple tes t ing, R, sparse se-
quence, st atistical comput ing, threshold.
COMPSTAT 2004 section: Bayesian methods.
1 Introduction
In recent years t he new t echnolo gy of micro arrays has mad e it feasible to
measure express ion of thousands of genes t o identify cha nges between dif-
ferent biological states. St atisti cians are requ est ed to design methods which
help t o quantify the relevance of these experimentally obt ain ed changes.
In such biological experiments (for an introduction see [12]) we are con-
front ed with t he pr oblem of high-dimensionality because of t housands of
genes involved an d at the sam e time with sma ll sa mple sizes (du e to lim-
it ed availability of cases and for reasons of cost ). This makes it a statistically
and computat iona lly demanding t ask. The complexity of the diseases, t he
poor underst anding of t he und erlying biology and the imp erfecti on of t he
measurements (many different source s of noise) are addit ional problems.
There are two dominating DNA array readout methods, eDNA ([3],
p. 17ff, [12]) and Affym etrix GeneChips ([2], [12]). In the first t he data
are read from a fluorescent signal and in t he last t he dat a are record ed from
a rad ioactive signal. The st at ist ical method described in this pap er can be
ap plied in both inst an ces.
At first we portray popular techniques for gene express ion analysis. Then
the idea of empirical Bayes methods is introduced and an empirical Bayes
thresholding (EBT) approach is describ ed in some det ail. It s pract ical rel-
evance is demonstrated for colon data from one of our laborat ories (eDNA)
and for t he so-called Golub dat a set (Affymetrix) from [8] .
430 Michael G. Schimek and Wolfgang Schmidt
As pointed out alre ad y a t est statistic for assessing differenti al gene expression
is the standard t-test
t,t -_ Xt'2 - x'!
t
•
Si
Alt ernatively a rank-sum statistic can be adopted . Suppose Tik be the rank
of the kth expression level within gene i . Then the rank-sum statistic for
gene i is
K,
r, = L Tik,·
k,=!
An ext reme Ti valu e in eit her direction would indi cate a difference in gene
expression. The t-statistic as introduced above t ests for difference in the
mean whereas the rank st atistic t ests for difference in distribution.
Under the assumption of only two exp erim ental conditions and no cor-
relation between measurements it is possible to derive the null distribution:
eit her permutation ([17]) or bootstrap ([6]) t echniques can be applied. In
both cases the computational demand is quite high.
If the null distribution is calculate d individually for each gene, this has
two disadvantages. The first is what is known as granularity problem: the
null distribution has a resolu t ion on the order of the number of permutations.
With n genes and m permutations the resolution is on the ord er of l /m for
individual null distributions, but l /(nm) for a pooled null distribution. For
instance , if we test 3000 genes with 100 permutations, then we can expect to
reject 30 at a time. The second problem is that we ar e not in the position t o
const ruc t better rejection regions. With individual null distributions, each
gene is treated as a different exp eriment. For each "experiment" we have
m observations from t he null distributi on and one from the original mea-
surements. It is not possible to compar e th e null distribution to t he observed
st atistic to derive more powerful , asy mmet ric rejecti on regions ([15], p. 277f) .
This means loss of power.
In the SAM ("Si gnifican ce Analysis of Microarray") method ([16]) anot her
approach has been t aken : t here the test statistics are pool ed and considered
to follow a mixture distribution. As a consequence, many observations from
t he mixture of the null and affecte d distributions, as well as from the pure
null distribution are available, leading to improved rejection regions. The
pitfall of using different distributions for t he estimation of the overall error
rate (due to pooling of t he null statistics) is sufficiently controlled in SAM
according to its aut hors. In SAM expression is evaluate d by a combinat ion
of t est and thresholding ste ps for the purpose of non-symmetric rejection
regions. This approac h improves the decision pro cess when the numbers of
overexpressed and underexpressed genes are sub st antially different (usu ally
the case in pr act ice). The cut off for t est significance is tuned via a user-
sp ecified paramet er connect ed to the false discovery rat e (the number of false
positives is limited t his way) . Hence SAM is not an aut omat ic approach.
432 Michael G. Schimek and Wolfgang Schmidt
Xi = /-li + ti,
where the tiS are N(O, (J"2) random variables, not too highly correlated. Fur-
ther let /-l = (/-ll, /-l2,·· . , /-In) be a vector of medians (means are also feasible
but not of interest in this paper).
Obviously the /-liS will not be exactly zero in most applications. The
p-norm of u, Ii/-lil = (2: l/-liIP)l/P allows for a more subtle characterization
of sparsity of /-l (assuming small p). In other words, the quantification of
sparsity corresponds to bounds on the p-norm of /-l for p > 0. Consider the
sum of squares of a vector with Ii/-lil p = 1 for some small p. If only one of the
components of /-l is nonzero, then the energy will be 1. If on the other hand,
all of the components are equal, then the energy will be n l - 2 / p and is tending
to zero as n ---. 00 if p < 2, tending rapidly to zero if p is near zero. Consider
the case of p small. Then the only way for a signal in an lp ball with small p
to have large energy (sum of squares) is to consist of a few large components,
as opposed to many small components of roughly equal magnitude. Among
all signals with a given energy, the sparse ones are those with small Z, norm.
Some measure of sparsity is needed because sparsity of a signal is not
solely a matter of the proportion of /-li that are zero or near zero, but also
of subtle ways in which the energy of the signal /-l is distributed among the
various components. For our purposes it is sufficient that the number of
indices i for which /-li is nonzero is bounded. In engineering such a parameter
/-l is called a "nearly black signal". For some 'fl this is
(1)
(2)
For (1) and (2) it is possible to derive minimax squared error properties. It
can be shown that EBT adapts automatically to the degree and character
of sparsity of the signal with the minimax rate (l.e. the optimum rate for
such signals; for details see [10]). It is worth mentioning that the minimax
properties are the same as in the false discovery rate approach in [1] .
Suppose the errors t i are independent. Within the Bayesian context spar-
sity is equivalent to suitable prior distributions for the Bis we are interested
in. The notion that many or most of the Bis are near zero is captured by as-
suming that the elements Bi have independent prior distributions each given
by the mixture
!prior(B) = (1 - w)oo(B) + w"((B) . (3)
434 Michael G. Schimek and Wolfgang Schmidt
The nonzero part of the prior, "f, is assumed to be a fixed unimodal symmetric
density. "f is traditionally assumed to be a normal density, Here ([10]) it is
recommend to use a heavier-tailed prior. For the mixing prior in (3) it is
favorable to use for "f the Laplace density with scale parameter a > 0
1
"fa(u) = 2"aexp ( -a lui)
where h is a density. If x> 0, we can find p,(x ;w) via th e following properties:
- 1
jl(x ;w) = 0 if Wpost Fl (O] ») < 2'
F\(jl( x ;w)lx) = (2wpost(x))-1 ot herwise.
!
For wpost(x ) S t he median is necessarily zero (no need to evaluate FHOl x)) .
For x < 0 the antisymmet ric property p,(- x ,w) = -p,(x , w) can be used .
The Bayes facto r threshold is related to the posterior median . It is a value
T(W) such t ha t P(J-L > 0IX = T(W)) = 0.5. This is to say that T(W) is the
lar gest value of t he sequence for which the est imate d J-L will be zero, if the
est imate is obtained from the post erior median.
How can we find the est imate w of W or the scale par ameter a of the
Lapl ace density? Maximization of the mar ginal maximum likelihood l gives
the solution. Let us define the score function S(w) = l'(w). Becaus e of
smoo thness and monotonicity of S(w) it is possible to find the est imates by
a binary sear ch, or an even faster algorit hm. The obtained valu es are t hen
plu gged back into the prior and the par ameters J-Li are evalua te d via these
est imates, eit her by using the post erior median itself, or by using some other
threshold rul e with the same threshold t(w).
The threshold is obtain ed from t he posterior median p" mainly by use of
the following properties:
(i) shrinkage rul e: 0 S jl S x for x 2: 0
(ii) t hreshold rul e: there exists t( x) > 0 such t hat jl(x ) = 0 if and only if
Ixl S t(w)
(iii) bounded shrin kage: there exists a constant b such that for all wand x
Ijl(x ;w) - xl s t(x ) + b.
This approac h is quite unique in combining features of excellent t heo-
reti cal properties and efficient computation. According to [10] t he results
436 Michael G. Schimek and Wolfgang Schmidt
C'1
o
r-!
.8 0
"'l-<cO""
I r-!
b.O I
......
0
o o
C'?
I o o
genes
Figure 1: Colon data aft er pr eprocessing.
proven for white noise err ors still hold for mod estl y correlate d errors, at least
in an approxima te sense. This generalizat ion is important for micro array
applications because some of the measurements are usually replicates. The
EBT approach was impl ement ed in the R language ([9]) by lain M. John-
stone and Bernard W . Silverman. The master function of the EBT algo-
rithm is ebayes t hresh O. This function as well as the others required to
analyze spa rse sequ ences can be downlo ad ed freely for academic purposes
from http://www.stats.ox .ac . uk/'"'-'s ilvermal ebayesthresh/ . Relevant
documentation is found t here too . After having been sourced to R , the EBT
algorit hm can be used as any other function in R.
4 Two examples
Our first example uses eDNA measurements and our second example is based
on Affymetrix measurements. In both techniques we have many different
source s of noise such as vari ation in hybridization time, vari ation in reagent
conce nt rations, leak of extern al light during chip reading, inhomogeneities
in chip prepar ation , vari ations in laser int ensity during chip reading, t race
contamination with cross-hybridizing oligonucl eotides, etc . EBT is an ideal
method t o handle a decision problem und er sparsity due to substantial noise.
C'l 0
0
0 00
,....; «)) 0
~ ~~
Q)
.9 OO0gf 0 oeo>
~....
I
0
°eo 00
9
08
6
O~ @
0
-
,....;
0 0
b.O
0 I 00 o§
0 0 0
0 ~
0
0
C":I
I 0
0
genes
Fi gure 2: EBT result for colon dat a .
gated and hyb ridi zat ion was mad e against a po ol of 4 probes of normal
colon t issue. St andard pro to col was used and qu ality cont rol ensure d by
the Insti tute of P athology, Medical University of Graz. The expe riments
contain n = 1536 differ ent genes, among t hem some replicates. The
eDNA chips where then scanned with micro array image analysis softwar e
Im agene from BioDiscovery producing two t ext files. These files where
t hen imported int o R ([9]) using the obj ect-oriented microarr ay analysis li-
brar y com .braju .sma (can be obt ain ed from the University of California at
Berkeley, http ://www .maths .lth.se/help/R/com.braju . sma/) The follow-
ing pr eprocessing steps wer e applied: (i) background subtract ion, (ii) trans -
form at ion into M = log2 (R ed/ Gr een) (M is further referred t o as log-ratio) ,
(iii) normalizing within slide using scaled print-t ip method , (iv) few valu es
which were not det ect ed on a subset of slides, hence N As , were set to zero
in order to allow fur ther pr ocessing, and (v) the experiments where merged
usin g the medi an . The dat a afte r pr eprocessin g are displ ayed in Fi g. 1.
The EBT algorit hm was applied usin g following param et ers : prior =
" laplace" and a = N A , so that t he scale par am et er a is esti mate d by mar ginal
maximum likelihood . bayes f ac = T mean s that whenever a threshold is
explicit ly calculated , the Bayes factor threshold will be used. Having sdev =
N A , the standard deviation is est imate d via the medi an absolute deviation
from zero (m ad(x, center = 0)). Finally, with th reshrule = "median" the
post erio r median is chosen.
Fig. 2 shows the genes from t he colon dat a that are informati ve after
administe ring t he EBT algorit hm. F inally we obt ain ed nl = 37 overexpresse d
and n2 = 39 underexpressed genes . These could be verified by pathologist s.
438 Michael G. Schim ek and Wolfgang Schmidt
C':l
:>,
C"l
'm.:::
-I-"
<J.)
-+-" ....-i
.S
I
b.O
0
..- a
....-i
I
ge nes
Figur e 3: Subcl ass ALL of Golub data after pr eprocessing.
0
o
>,
l.t:>
,....; - 0
-+"
' 00
~
Q.)
-+"
0
,....; -
.S 0 o
I
-b.O
0 l.t:>
0
- o
0
0
-
I I I I I I I
genes
Figure 4: EBT result for subcl ass ALL of Golub dat a.
t he above colon dat a cDNA experiment , 2 genes are detected for ALL, 1 gene
is detected for AML an d 12 genes are expressed in both subclasses (for a plot
of the all together 14 overexpressed genes of t he ALL subcl ass see Fig. 4).
This is a substant ial reduction in t he numb er of inform ative genes. Wh ether
t he identified genes that are overexpr essed in one subclass while not in the
other (i.e. having a zero estimate) - obviously of discriminative power - are
of high biological relevan ce needs to be answered by future leukemia resear ch
and is not a st atist ical matter.
References
[1] Abr amovich F ., Benjamini Y., Donoho D.L., Johnstone LM. (2002).
Adapting to unkn own sparsit y by controlling the false discovery rat e.
Preprint.
[2] Affymetrix (1999). Affymetrix microarray suite user guide. Affymetrix,
Santa Clar a, CA.
[3] Baldi P., Hatfield G.W. (2002). DNA microarrays and gene expression.
From experiments to data analysis and modeling. Ca mbridge University
Press, Cambridge.
[4] Benjamini Y , Hochberg Y (1995). Controlling the false discovery rate:
A practical and powerful approach to multiple testing. J . Royal Statist.
Soc., B 85 , 289 -300.
[5] Efron B., Morris C. (1973). Combining possibly related estima tion prob-
lems (with discussion) . J . Royal Statist . Soc. B 35 ,379- 421.
[6] Efron B., Tishir ani R.J . (1993). An introduction to the B ootstrap. Chap-
man & Hall, London .
440 Michael G. Schimek and Wolfgang Schmidt
[7] Efron B., Tishirani RJ ., Storey J.D., Tusher V. (2001). Empirical Bayes
analysis of a microarray experiment. J . Amer . Statist. Assoc. 96, 1151-
1160.
[8] Golub T .R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov
J .P., Coller H., Loh M.L., Downing J.R, Caligiuri M.A., Bloomfield C.D.,
Lander E.S. (1999). Molecular classification of cancer: Class discovery
and class prediction by gene expression monitoring. Science 286, 531 -
537.
[9] Ihaka R, Gentleman R (1996). R: A language for data analysis and
graphics. J. Computat. Graph. Statist. 5, 299-314.
[10] Johnstone I.M ., Silverman B.W . (2004). Needles and straw in haystacks:
Empirical Bayes estimates of possibly sparse sequences. To appear in An-
nal. Statist.
[11] Newton M.A., Kendziorski C.M., Richmond C.S., Blattner F .R. (2001) .
On differential variability of expression ratios: Improving statistical infer-
ence about gene expression changes from microarray data . J. Computat.
Biol. 8 , 37 - 52.
[12] Nguyen D.V., Arpat A.B., Wang N., Carroll RJ . (2002) . DNA microar-
ray experiments: Biological and technological aspects. Biometrics 58,701-
717.
[13] Schena A.M ., Shalon D., Davis RW., Brown P.O . (1995). Quantita-
tive monitoring of gene expression patterns with a complementary DNA
microarray. Science 270, 467 - 470.
[14] Storey J. D. (2002). A direct approach to false discovery rates . J. Royal
Statist. Soc. B 64,479-498.
[15] Storey J.D., Tibshirani R (2003). SAM thresholding and false discovery
rates for detecting differential gene expression in DNA microarrays. In
Parmigiani G., Garrett E .S., Irizarry RA., Zeger S.L. (ed.) The analysis
of gene expression data. Methods and software. Springer-Verlag, New
York, 272- 290.
[16] Tusher V., Tibshirani R, Chu C. (2001). Significance analysis of mi-
croarray applied to transcriptional responses to ionizing radiation. Pro-
ceedings of the National Academy of Sciences 98, 5116-5121.
[17] Westfall P.H ., Young S.S. (1993). Resampling-based multiple testing: Ex-
amples and methods for p-value adjustment. Wiley, New York.
(1)
Here, the domain X is some nonempty set that the inputs X i are taken from ;
the Yi E Yare called targets. Here and below , i, j = 1, . . . , m .
We have made no assumptions on the domain X other than it being a set.
In order to study the problem of learning, we need additional structure. In
learning, we want to be able to gen eralize to unseen data points. In the case
of pattern recognition, given some new input X EX, we want to predict the
corresponding Y E {±1}. Loosely speaking, we want to choose Y such that
(x, y) is similar to the training examples. To this end, we need similarity
measures in X and in {± 1}. The latter is easier, as two target values can
only be identical or different.! For the former, we require a similarity measure
with the property that there exists a map <I> into a Hilbert space H such that
for all x , x' EX,
k(x, x') = (<I> (x), <I>(x') ) . (3)
Such a function k is called a positive definite kernel [1], [10], [8], H is the
reproducing kern el Hilbert space (RKHS) associated with it, and <I> is called
its feature map . A popular example, in the case where X is a normed space,
is the Gaussian
k(x, x) = exp -
I (11x 2- aX1112) ' (4)
2
where a > O.
1 In the case wh ere the outputs a re t ak en from a general set y , the situation is more
complex, cr. [11].
442 B ernh ard Scholkopf
y = sgn (2-
ml
L
{i:Yi=+l}
k( x , Xi) - _1
m2
L
{i:Yi=-l}
k( x , Xi) + b) (6)
Pl( X) := -
1
L
m l {i:Yi=+l}
k (X,Xi ), P2(X) := -
1
L
m2 {i:Yi=- l}
k( x, x d · (7)
The classifier (6) is quite close to the Support Vector Machine (S VM)
that has recently at t rac te d much attent ion [10], [8] . It is linear in the RKHS
(see (5)) , while in t he input dom ain , it is repr esented by a kern el expa n-
sion (6). It is example-base d in t he sense t ha t the kern els are centered on
the training exa mples, i.e., one of t he two arguments of t he kern els is always
a training exa mple. This is a genera l property of kernel methods, due to the
Repr esent er Theorem [5], [8] . The main point where SVMs deviate from (6)
is in t he select ion of t he examples t hat the kern els are centered on, and in
the weight t ha t is put on the individual kern els in the decision function. The
SVM decision boundary takes the form
(8)
the RKHS is maximized. It turns out that for many problems this leads to
sparse solutions, i.e., often many of the Ai take the value O. The Xi with
nonzero Ai are usually called Support Vectors.
Using methods from statistical learning theory [10], one can bound the
generalization error of SVMs. In a nutshell, statistical learning theory shows
that it is imperative that one uses a class of functions whose capacity (e.g.,
measured by the VC dimension) is matched to the size of the training set. In
SVMs, the capacity measure used is the size of the margin, which is inversely
proportional to the RKHS norm of the SVM parameter vector.
The SV algorithm has been generalized to problems such as regression
estimation [10], mappings between general sets of objects [11], and single
class problems. As the latter algorithm is closely related to the one to be
proposed in the present paper, we will describe it in the next section.
2 Single-class SVMs
Let us assume we are given unlabelled data Xl, ... ,X m E X generated LLd.
according to some underlying distribution P. We would like to estimate
quantiles C of P using kernel expansions as C ~ {x E Xlf(x) E I} . Here,
I is an interval, and f = 2::1 Aik(x, Xi)'
In the case of 1= [p,oo[ (where p E JRR) , an approach to compute such
an estimator f is the single-class SVM [7] . It approximately computes the
smallest set C E C containing a specified fraction of all training examples,
where smallness is measured in terms of a regularizer corresponding to the
norm in the RKHS associated with k, and C is the family of sets correspond-
ing to half-spaces in the RKHS. When choosing a suitable kernel, this notion
of smallness will coincide with the intuitive idea that the quantile estimate
should not only contain a specified fraction of the training points, but it
should also be sufficiently smooth so that we can be confident that this state-
ment will also be approximately true for previously unseen points sampled
from P (for an analysis, see [7]).
Let us briefly describe the main ideas of the approach. The training points
are mapped into the RKHS using the feature map q, associated with k, and
then it is attempted to separate them from the origin with a large margin by
solving the following quadratic program: for 1/ E (0,1],2
minimize,
WEH
eE JR m
, p E JR -llwll
1
2
2
+ -l/m
1 '~~i
" -
.
P (9)
•
subject to (w, q,(Xi)) 2 p - ~i, ~i 2 O. (10)
Since nonzero slack variables ~i are penalized in the objective function, we
can expect that if wand p solve this problem, then the decision function,
f(x) =sgn((w,q,(x)) -p), (11)
2Here and below we follow the convention that bold face greek character denote vectors,
e
e.g., = (6, . . . ,~m)T .
444 Bernhard Scholkopf
Figure 1: In the 2-D toy example depict ed , t he hyp erplane (w, <I>(x) ) = p
separates all but one of the point s from the origin. The outlier <I>(x) is ass o-
ciated wit h a slack variable f , which is penalized in t he obj ective function (9) .
The distan ce from the outlier t o the hyp erplan e is ~/"wll ; the dist an ce be-
tween hyp erplan e and origin is p/llwll . The lat t er impli es t ha t a small Ilwll
corresponds t o a large mar gin of separation from t he origin (from [8]).
will equal 1 for most examples Xi contain ed in the t raining set," while t he
regularization t erm Ilwll will st ill be sm all. For an illustration, see Figure 1.
The t rade-off between t hese two goa ls is cont rolled by a par am et er 1J.
On e ca n show that the solut ion takes t he form
(12)
subject t o (14)
Not e t ha t du e t o (14) , the t raining exa mples cont ribute with nonnegative
weights ai :::: a t o the solution (12). On e can show that asympt ot ically,
a fraction 1J of all training exa mples will have strictl y posit ive weights, and
the rest will be zero.
<P(X'~
/ /~'/lIwll
/
/
/
(P+O')iIIwll/
/
Figure 2: T wo par allel hyp erpl anes (w, <.I> (x) ) = p + 8(*) enclosing all but
two of the points. The outlier <.I>(x(*») is associa ted with a slack variable ~( * ) ,
which is penalized in th e obj ecti ve funct ion (15).
minimize , ~ ( * ) E IR m , p E IR
W E'H
-2111w11 2+ _1 2)~i + ~n
lim ,.
-p (1 5)
Here, 8(*) are fixed par am et ers. Note that st rict ly speaking , one of t hem is
redundant: one can show that if we subt rac t some offset from both, then we
obtain the sa me overa ll solu tion, with p offset by the sa me amount . Hence,
we can genera lly set one of them to zero , say, 8 = O. In t he simul ations
shown below, this is t he case; nonetheless, we pr efer to keep the 8 in the
opt imizat ion problem .
Before we compute t he du al problem , let us discuss the relat ionship of
t his convex qua dratic optimization problem to other ap proaches .
func tion. Hence, in this case, the solut ion will be a hyp erplan e that
approximates the data well in the sense that the points lie close to it
in t he RKHS norm.
Let us now compute the du al opt imizat ion problem . Here are all con-
st raints, along with the Lagran ge multipliers t hat we will use for them :
-2111w11 2 + _1 2)~i + ~n - p
vm .
t
(21)
8£ =0
8p
{::::::::} L(ai - an = 1. (24)
minimize
aE JR'"
~L (ai - an(aj - aj )k(xi,Xj ) - 8L a i + 8* L a; (25)
ij i i
where the box constraints on a1*), (26), have been derived from (23) by taking
into account that a1-),,ai
*) ;:::: 0. 5
It is clear from the primal optimization problem that for all i, ~i > 0 implies
(w, <I>(Xi)) - P - 8 < 0 (and likewise, ~i > 0 implies (w, <I>(Xi)) - P - 8* > 0),
hence OLC*) C SVC*) . The difference of the SV and OL sets are those points
that lie precisely on the boundaries of the constraints."
Below, IAI denotes the cardinality of the set A.
- -IOL*1
ISVI
- -- > 1/, (31)
m m
lOLl
-----
ISV*I
< 1/. (32)
m m
Two notes before we proceed to the proof:
v = 0.1
v = 0.1
v = 0.5
minimize (33)
e (*)EIFt"',pEIFt
Kernel peA T he kernel method for comput ing dot products in an RKHS
is not restrict ed t o SV machines. It can be used t o develop nonlinear gener-
alizations of any algorit hm that can be cast in t erms of dot pr oducts, such
8We choose € small eno ug h so t hat all cons t raints t hat a re not ac t ive will also not b e
active after ad d ing t he e; it is easy t o see that suc h an € exists.
9Essentially, we need to require that t he distribu t ion have a de nsity w.r. t . t he Leb esgu e
measure, a nd t hat k is analytic a nd non- const an t (cf. [8], [9]).
450 Bernhard Scholkopf
(35)
This is derived by computing the dot product between a test point <I>(x) and
the p-th eigenvect or in the RKHS j t he b
factor ensures th at (v P , v P ) = l.
Wh en evaluate d on the training exa mple X n , (36) takes t he form
(v P, <I>(x n )) = _l_(Ka
,;>;P P) = _l_(APaP ) = ,;>;PaP.
n ..[5.P n n
(37)
In (35) , we have implicitly assumed that the data in t he RKHS have zero
mean . If this is not the case, we need to subtract the mean (11m) 2:i <I>(Xi)
from all points . This lead s to a slightly different eigenvalue pro blem , where
we diagonalize
K' = (1 - ee T)K(l - ee T) (38)
(with e = m- l / 2 (1, . . . , l)T) rather than K .
The kPCA algorit hm can be used to obtain an implicit description of
a manifold containing the data as follows. The principal dir ect ions with
the smallest eigenvalues (sometimes called "minor components" ) cha racte rize
directi ons in the RKHS such th at when project ed onto these directions, the
data set has the smallest possible variance which can be obtained in any
dir ection which is in the span of the mapped data.!? Gener ally, we are
interest ed in low vari an ce dir ections which lie in t he span of sets of inputs
points (e.g., the training set ) mapped into t he RKHS , as these lead to implicit
10 Note t ha t for some kernels, the R K HS will be infinit e dimen sion al. In t hat case, there
are infinit ely many zero vari an ce directions which do not lie in the sp an of the data .
Kernel methods for manifold estimation 451
(39)
LLE and Laplacian Eigenmaps Kernel PCA can also be used for mani-
fold learning in a rather different way. In this case, the manifold is not learnt
as the zero set of a kernel expansion. Rather, we will obtain a low dimensional
coordinate embedding of data sampled from the manifold (" dimensionality
reduction") .
It turns out that locally linear embedding (LLE) [6], currently a rather
popular algorithm for nonlinear dimensionality reduction, is a special case
of kPCA [3] : The LLE algorithm first constructs W to be the matrix whose
row i (summing to 1) contains the coefficients to of the minimal squared
error affine reconstruction of Xi from its p nearest neighbors. Denote M :=
(1 - W)(l - W T ) , with maximal eigenvalue Am ax . One can show that M's
smallest eigenvalue is 0 and the corresponding uniform eigenvector is e. In
LLE, the coordinate values of the m-dimensional eigenvectors m-d, ... , m-1
give an embedding of the m data points in IR d • If we define K := (Am ax 1- M),
then by construction, K is a positive definite matrix, its leading eigenvector
is e, and the coordinates of the eigenvectors 2, . .. , d + 1 provide the LLE
embedding. Equivalently, we can use the eigenvectors 1, . .. , d of the matrix
obtained by projecting out the subspace spanned bye, i.e., (1 - ee T)K(l -
ee T) . Note that this is identical to the centered kernel matrix (38) used in
kPCA. We thus know that the coordinates of the leading eigenvectors of
kPCA performed on K yield the LLE embedding. This, together with (37),
shows that the LLE embedding is identical to the kPCA projections up to
a whitening multiplication with yI).P.
As shown in [3], several other approaches can be viewed as special cases
of kPCA , including certain spectral methods. Many of these methods are
based on the computation of a weighted adjacency matrix W on the data,
e.g., using the kernel (4) on neighboring points (where several definitions
of neighborhood are possiblej.l! Define the graph Laplacian L by L i i := di ,
L i j = - W i j if Xi and Xj are neighbors, and 0 otherwise, where di = I:j~i W i j
is the degree of the ith vertex. It turns out that similar to LLE, the bottom
eigenvectors of the Laplacian can provide a low-dimensional representation
of the data [2], and again, a link to KPCA can be established [3].
llThis local similarity measure can also take into account invariances of the data.
452 Bernhard SchOlkopf
5 Conclusion
Kernel methods have a solid foundation in statistical learning theory and
functional analysis. They let us interpret (and design) learning algorithms
geometrically in an RKHS, and combine statistics and geometry in an ele-
gant way. The present article has described several methods for using this
approach for the estimation of manifolds.
References
[1] Aizerman M.A ., Braverman E.M., Rozonoer L.1. (1964). Theoretical
foundations of the potential function method in pattern recognition learn-
ing. Automation and Remote Control 25 821- 837.
[2] Belkin M., Niyogi P. (2003). Laplacian eigenmaps for dimensionality
reduction and data representation. Neural Computation 15 (6) 1373 -
1396.
[3] Ham J ., Lee D., Mika S., Scholkopf B. (2004). A kernel view of the di-
mensionality reduction of manifolds. In Proceedings ofICML (in press) .
[4] Kim K.I., Franz M.O., Scholkopf B. (2004). Kernel Hebbian algorithm
for single-frame super-resolution. In Statistical Learning in Computer
Vision Workshop, Prague.
[5] KimeldorfG.S., WahbaG. (1971). Some results on Tchebycheffian spline
functions . Journal of Mathematical Analysis and Applications 33 82-
95.
[6] Roweis S., Saul L. (2000). Nonlinear dimensionality reduction by locally
linear embedding. Science 290, 2323- 2326 .
[7] Scholkopf B., Platt J., Shawe-Taylor J ., Smola A.J., Williamson R.C .
(2001). Estimating the support of a high-dimensional distribution . Neural
Computation 13 1443 -1471.
[8] Scholkopf B., Smola A.J. (2002). Learning with kernels. MIT Press,
Cambridge, MA.
[9] Steinwart I. (2004). Sparseness of support vector machines-
some asymptotically sharp bounds. In S. Thrun, L. Saul, and
B. Scholkopf, (eds), Advances in Neural Information Processing Systems
16. MIT Press, Cambridge, MA.
[10] Vapnik V.N. (1995). The nature of statistical learning theory. Springer
Verlag, New York.
[11] Weston J ., Chapelle 0 ., Elisseeff A., Scholkopf B., Vapnik V. (2003).
Kernel dependency estimation. In S. Becker, S. Thrun, and K. Ober-
mayer, (eds), Advances in Neural Information Processing Systems 15,
Cambridge, MA, USA. MIT Press.
Address : B. Scholkopf, Max-Planck-Institut fiir biologische Kybernetik, Spe-
mannstr. 38, Tiibingen, Germany
E-mail: bernhard.schoelkopf@tuebingen .mpg.de
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: Clust erin g algori thms based upon nonpar ametric or semipara-
metric density est imation are of mor e t heoretical inte rest than some of the
distan ce-based hierar chical or ad hoc algorithmic pro cedures. However den-
sity estimation is subjec t to the curse of dimensionality so that car e must
be exercised. Clust erin g algorit hms are sometimes describ ed as biased since
solutions may be highly influenced by init ial configur ations. Clusters may be
associated with modes of a non paramet ric density est ima tor or with compo-
nents of a (normal) mixture estimator. Mode-finding algorithms are related
to but different than gaus sian mixture mod els. In t his paper , we describ e
a hybrid algorit hm which finds mod es by fit ting incompl ete mixture mod els,
or par tial mixture component models. Problems wit h bias are redu ced since
t he partial mixture model is fit ted many ti mes using carefully chosen random
starting guesses. Many of these partial fits offer unique diagnosti c informa-
tion about the st ruc t ure and features hidden in the data. We describ e t he
algorit hms and present some case st udies .
1 Introduction
In this pap er , we consider t he problem of finding outliers and/or clusters
through t he use of t he normal mixture mod el
K
f(x) = L Wk ¢(x If-L k , ~k) . (1)
k= l
Mixture models afford a very genera l famil y of densiti es. If the number
of components, K , is quite lar ge, then almost any density may be well-
approximate d by t his mod el. Aitkin and Wilson [1] first suggest ed using t he
mixture mod el as a way of han dling data with multiple outliers, especially
when some of t he out liers group into clumps. They used t he EM algorit hm
to fit the mixture model. Assuming that the "good" data are in one clust er
and make up at least fifty percent of the total data , then it is easy to see t ha t
we have introduced a number of "nuisance par ameters" into the problem (to
mod el the out liers).
Implementing t his idea in pr actice is challenging. If t here are just a few
"cluste rs" of outli ers, then the number of nuisance pa ra meters should not pose
too much difficulty. However , as the dimension increases, t he total num ber
454 David W. Scot t
est ima t ion of mixture mod els by this t echnique. Given a true density, g(x) ,
and a mod el, j()(x) , t he goa l is to find a fully dat a-based esti ma te of the
L2 dist an ce between g and j , which is then minimiz ed with resp ect t o e.
Expanding the L2 crite rion
The third int egral is unknown but is constant with resp ect to and there- e
fore may be ignored . The first int egral is ofte n available as a closed form
express ion that may be evaluated for any posit ed valu e of e. Additionally,
we must add an assumpt ion on the mod el that t his int egral is always finit e,
i.e, j() E L 2 . The second int egral is simply the average height of the density
estimate, given by -2E[j()(X)], where X ""' g(x), and which may be est i-
mated in an unbi ased fashion by -2n- 1 I:~=l j()(Xi)' Combining, th e L2E
crite rion for par am etric est imat ion is given by
(4)
1Wd
j ()(x)2dx =
K
LL
K
k=le =l
Wk we <1>(0 IIl k - li e, ~k + ~e). (5)
Since this is a comput at ionally feasible closed-form expression, est ima t ion of
t he normal mixture model by t he L2E pro cedure may be performed by use
of any st andard nonlin ear optimization code; see [20] , [21] . In particular , we
used the nlmin rou tine in the Splus libr ar y for the examples in this pap er.
Next, we return to the Old Faithful geyser exa mple. Using t he same
starti ng valu es as in Figure 1, we computed the corresponding L2E esti mates,
which are displayed in Figure 2. Clearly, both algorithms are attracted to th e
sam e (local) est imates, which combine vari ous clusters into one (since K < 3) .
However , there are int erest ing differences. First we compare the est imate d
weights: in Fi gur e 1, the MLE weight of the lar ger component in each fram e
is 1, 0.65, 0.65 , and 0.73, respect ively, while in Figure 2 the corres ponding
L2E weights are 1, 0.74, 0.72, and 0.71. Of mor e int erest , t he L2E covariance
matrices are eit her t ighte r or sma ller. Since t he (explicit) goal of L2E is to
find t he most normal fit (locally) , observe t hat a number of points in the
smaller clust ers fall outside the 3-0- cont ours in fram es 2 and 3 of Figure 2.
The MLE covari an ce estimate is not robust and is inflate d by those (slight)
Outlier detection and clustering by partial mixture modeling 457
outliers. These differences are likely due to the inherent robustness properties
of any minimum distance criterion; see [12] . Increasing the covariance matrix
to "cover" a few outliers results in a large increase in the integrated squared
or L2 error, and hence those points are largely ignored.
.~
Figure 2: Several L2E mixture fits to the lagged Old Faithful geyser eruption
data with K = 1 and K = 2; see text. The weights in each frame are (1.0),
(.258, .742), (.714, .286), and (.711, .289).
mod eled is less than unity, suggesti ng a small fraction of th e data ar e being
treated/lab eled as outliers with respect to t he fitted normal mixture mod el.
The fact that the third total probability exceeds unity is consist ent with our
pr evious observation t ha t the best fitting cur ve in the L2 or ISE sense often
int egrates to mor e than 1, when t here is a gap in the middl e of the data.
.~
Figur e 3: Several L2E partial mixture fits to t he lagged Old Faithful geyser
eru pt ion data with K = 1 and K = 2, but without any const ra ints on
the weights; see text . The weights in each frame ar e (.783), (.253, .694),
(.683, .283), and (.751, .297).
Since there are pot enti ally many mor e local solutions, we displ ay four
mor e L2E solutions in Figure 4. Some of these estimates are quite unexpected
and deserve careful exa mina tion. The first frame is a vari ation of a K = 1
component which capt ure s 2 clusters. However, the K = 2 estimates in the
last 3 fram es each capt ure two individual clusters, while completely ignoring
t he third. Comparing the contours in the last t hree frames of Figur e 4, we
see that exac tly the same estimate s appear in different pairs. Lookin g at
t he weights in Figur es 3 and 4, we see that the smaller isolated components
are almost exactly reproduced while ent irely ignoring the third clust er. This
feature of L2E is quite novel and we conclude that many of the local L2E
results hold valu able diagnostic inform ation as well as quite useful estimate s
of t he local st ruc t ure of t he data.
~ .
@:~ .~
~. ~ ~.:-4.
.
~ ·:~·»f·
....
.£@:~
..:-... ~
••• . .. . .,. .
.. , ..,. ... , ,.
.....-. ...
~ ~ ~...
. . ..
.~
..~...
~.
Figur e 4: Same as Figure 3 but different st arting values; see text . The weights
in each fram e are (.683) , (.253, .316), (.253, .283), and (.316, .283).
Outlier detection and clustering by partial m ix tu re m odeling 459
~.,~ ..~~.Ji.;';.
~@) '~.'~... .i.tI·'.
....
,
~.. ~~
~ ~.
:,-
.... : ." ~"
••
.... "
... , "
.... .
•••
'I.e"
'"' .
'Ie"
....
••••
~
.e" .~
Figur e 5: Four more K = 1 parti al mixture fits to th e geyser data; see text .
The weights in each frame are (.694) , (.253) , (.316), and (.283).
5 Other examples
5.1 Star data
Another well-studied bivari ate dataset was discussed by Rousseeuw and Le-
roy [17] . The data are measurements of the temperature and light int ensity
of 47 st ars in the dir ection of Cygnus. For our ana lysis, the data were blurred
by uniform U( - .005, .005) noise. Four giant stars exert enough influence to
distort t he corre lat ion of a least-squ ar es or maximum likelihood esti mate;
see the first fram e in Figure 7. In the second frame, a K = 2 MLE normal
mixture is displayed. Notice the four giant st ar s are represent ed by one of
the two mixture components and has a nearly singular covaria nce matrix.
The third frame shows a K = 1 par ti al component mixture fit by L2E , with
'Ii! = 0.937. T he sha pe of the two covariance matrices of th e "good" data
is somewha t different in t hese three frames. In par ticular , the correlatio n
coefficients are -0.21,0.61, and 0.73, respectively.
These data were recently re-ana lyzed by Wan g and Raftery [26] with
near est-neighb or vari ance est imator (NNVE) , an extension of the NNBR es-
timator [10] . They compa red their covarian ce estimates to t he minimum
volume ellipsoid (MVE) of Rousseeuw and Leroy [17] as well as the (non-
robust) MLE . In Figur e 7, I have overlaid t hese 4 covaria nce matrices (at t he
1-0' cont our level) wit h t ha t of the par tial density component (PDC) est imate
obtain ed by L2E shown in t he t hird frame of Figur e 6. For convenience, I have
cente red t hese ellipses on the origin. The NNVE and NNBR ellipses are virt u-
460 David W . Scott
LO
c;vj-<-----,-_ _. -_ _.-_-'-_.--_---,-_ _-.--_---'_.-_ _.--_---,-_-----J
3.5 4.0 4.5 3.5 4.0 4.5 3.5 4.0 4.5
Figur e 6: Two-rr cont ours of MLE (K = 1), MLE mixture (K = 2), and
par tial L2E mixture (K = 1) fits to t he blurred star data.
ally identical , while t he MVE ellipse is slightly rotated and narrower . These
three are sur rounded by t he slightly elongated L2E PDC ellipse. Of cour se,
t he MLE has the wrong (non-robust ) orient ation. The corre lation coefficients
for NNVE and NNBR are 0.65 versus 0.73 for MVE and L2E. Observe that
L2E does not explicitly requi re a search for the good data. The ot her t hree
algorit hms require exte nsive sea rch and/or calibra t ion of an auxiliary pa ra m-
eter. L2E is driven by t he choice of the sha pe of the mixing distribution. One
might choose instead to use tv component s, as suggeste d by McLachlan and
Peel [16], alt hough the degrees of freedom must be specified. In eit her case,
L2E provides useful diagnostic information as a byproduct of t he est imat ion,
rather than as a follow-on ste p of ana lysis.
'"
9
~
9 '----- - -- - --'
-0.4 -0.2 0.0 0.2 0.4
Figur e 7: Ellip ses represent ing the 2-0" contours of five est imates of t he co-
vari ance matrix of t he st ar data; see text.
command data Cais, package=' sn ") : Following Wang and Raftery [26], we
selected the variables body fat (BFAT), body mass index (BMI), red cell
count (RCC), and lean body mass (LBM). (Wang and Raftery also included
ferritin in their analysis.) We blurred the data then standardized each vari-
able.
We fit a K = 1 L2E starting with the maximum likelihood estimate. The
result was WI = 0.98. A pairwise scatterdiagram of the 202 points is shown
in Figure 8, together with contours of the fitted 4-dimensional ellipse. A
careful examination of this plots suggests some clusters. In fact, the first 100
measurements are of female athletes and the last 102 measurements are of
male athletes.
Starting with the MLE values for the female athletes, we re-fit a K = 1
L2E. Now WI = 0.41 (somewhat less than the 49.5% female population).
The contours of the fitted 4-dimensional ellipse are superimposed upon the
scatter matrix in Figure 9. The L2E is clearly modeling a large fraction of
the female athletes.
Finally, we started the L2E with the male values. However, L2E found
a smaller subset of the data lying in a subspace. (L2E is just as susceptible
at MLE at being attracted to singular mixture components, depending upon
initial guesses . That is why blurring was applied in all our examples to
remove trivial singularities due to rounding.) Further experimentation would
be interesting.
6 Discussion
We have shown how a mimmum distance criterion and a mixture model
with only one or two partial components can provide useful estimates and
diagnostics. In particular, the value of WI + Wz provides an indication of the
462 David W. Scott
References
[1] Aitkin M., Wilson, G.T . (1980) . Mixture models, outliers, and the EM
algorithm. Technometrics 22, 325 - 33l.
[2] Azzalini A., Bowman A.W. (1990). A look at some data on the old faithful
geyser. Applied Statistics 39, 357 -365.
Outlier detection and clustering by partial mixture modeling 463
[3] Barnett V., Lewis T . (1994). Outliers in statistical data. John Wiley &
Sons, New York.
[4] Banfield J.D., Raftery A.E. (1993) . Model-based Gaussian and non-
Gaussian clustering. Biometrics 49, 803 -821.
[5] Basu A., Harris I.R. , Hjort H.L., Jones M.C . (1998) . Robust and effi-
cient estimation by minimising a density power divergence. Biometrika
85,549 -560.
[6] Beran R (1977), Robust location estimates. The Annals of Statistics 5,
431-444.
[7] Beran R (1984). Minimum distance procedures. In Handbook of Statistics
Volume 4: Nonparametric Methods, pp. 741-754.
[8] Bowman A.W. (1984) . An alternative method of cross-validation for the
smoothing of density estimates. Biometrika 71, 353 -360.
[9] Brown L.D., Hwang J.T.G. (1993). How to approximate a histogram by a
normal density. The American Statistician 47, 251- 255.
[10] Byers S., Raftery A.E. (1998). Nearest-neighbor clutter removal for es-
timating features in spatial point processes. Journal of the American Sta-
tistical Association 93, 577 - 584.
[11] Cook RD ., Weisberg S. (1994) . An introduction to regression graphics.
Wiley, New York .
[12] Donoho D.L., Liu RC . (1988) . The 'automatic' robustness of minimum
distance functional. The Annals of Statistics 16, 552 - 586.
[13] Hjort H.L. (1994) . Minimum L2 and robust Kullback-Leibler estima-
tion. Proceedings of the 12th Prague Conference on Information Theory,
Statistical Decision Functions and Random Processes, P. Lachout and
J.A. Vfsek (eds.), Prague Academy of Sciences of the Czech Republic,
pp. 102-105 .
[14] Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York .
[15] Macqueen J.B. (1967). Some methods for classification and analysis of
multivariate observations. Proc. Symp. Math. Statist. Prob 5th Sympo-
sium 1, 281-297, Berkeley, CA.
[16] McLachlan G.J., Peel D. (2001). Finite mixture models. John Wiley &
Sons , New York.
[17] Rousseeuw P.J., Leroy A.M. (1987). Robust regression and outlier detec-
tion . John Wiley & Sons, New York.
[18] Rudemo M. (1982) . Empirical choice of histogram and kernel density
estimators. Scandinavian Journal of Statistics 9, 65 - 78.
[19] Scott D.W. (1992). Multivariate density estimation: theory , practice ,
and visualization. John Wiley, New York.
[20] Scott D.W. (1998) . On fitting and adapting of density estimates. Com-
puting Science and Statistics, S. Weisberg (Ed.) 30, 124-133.
[21] Scott D.W. (1999) . Remarks on fitting and interpreting mixture models .
Computing Science and Statistics, K. Berk and M. Pourahmadi, (Eds.)
31, 104-109 .
464 David W. Scott
2 Approaches
Let us quickly review t he following different approaches to t he environment
for working wit h vari ous typ e of data ; Net CDF, DDI and MetBroker .
2.1 NetCDF
Net CDF ( Network Common Data Form, [4] ) is a dat a abst raction for st oring
and retrieving multidimensional dat a and is distributed as a software librar y
which provides a concrete impl ementation of t he abst rac t ion. The software
has been developed under th e Unidata Program sponsored by t he US National
Science Foundation to support research and education in t he at mospheric
sciences . This approac h is closely related to our InterD atab ase, but not the
sa me. Net CDF is t ar get ed only for multidimensional or array dat a . However
InterD ataB ase is not restrict ed to data following such a neat format. Another
466 Ritei Shib ata
point is that Net CDF requires reform atting all the data so t ha t it accords
with a common rul e, called Common Data Lan guage (CDL) . This can be
a burden unl ess the data pro cessing pro cedures have been established from
the beginn ing in each field of applicat ion.
2.3 MetBroker
MetBroker([3]) is middlewar e which provides consistent access t o het eroge-
neous weather datab ases. It is a mediator that sits between agricult ural
models and vari ous sources of online dat a. This approach resolves data het-
erogeneity problems by writing a suitable pr ogram . It is efficient for meeting
the needs of sp ecific tasks, bu t it would be laborious t o rewrite the pro gram to
meet t he needs of users or to accommodate st ructural cha nges in dat ab ases.
3 Our approach
As was mentioned before, our goa l is to provide a good environment t o work
wit h different typ es of data which might be scattered over networks . Our ap-
pro ach is pr obabl y closer in concept t o DDI or NESSTAR. In InterD atab ase,
InterDatabase and DandD 467
the DandD Client Server System ([8]) is driven by a DandD instance and
provides a similar environment. A major difference of InterDatabase from
DDI and NESSTAR is the unified general approach to providing such an
environment. A high level of data abstraction is necessary to retain such
generality and an intimate linkage between the abstraction and development
of support softwares is indispensable.
DandD ( Data and Description) is a long run project started around 1990.
Preliminary works can be found , for example, in [10]. The aim of this project
was to establish a formal rule of description of data. A hope was to make it
possible to do an automatic analysis of data as well as to make it easier to
exchange data with enough description for the aim of analysis. In the first
part of the project, data abstraction was a main concern. The basic model
had been established to construct necessary number of structures, relational
or array, by quoting data vectors which are simple sequences of numbers. All
necessary attributes are classified into three levels. The bottom level is for
each data vector, the middle level is for each structure constructed, and the
top level is for the whole data. The rule had been implemented by a LISP
like own language, and some experimental supporting softwares had been
developed.
A breakthrough had been occurred by an introduction of XML as a media
for implementation of DandD rule in 1997. The project grew to cover various
data which are not necessarily included as a body of the XML document. This
led to an introduction of the concept of External Data Vector and further led
to the idea of InterDatabase. Therefore InterDatabase is a natural extension
of our original idea of DandD and it is now a part of DandD together with its
support system, DandD server client system ([8]). Let us focus our attention
into the closely related features of DandD to InterDatabase.
4 DandD
DandD is a generic name for a system consisting of the following three ele-
ments.
1. DandD rule: A syntax and semantics for describing data. The syntax
is currently written as a DTD.
As has been mentioned before, the data itself is not necessarily a part of
a DandD instance, and it allows us to implement InterDatabase.
468 Ritei Shibata
Example 1
<DataVector Id="ii" LongName="Year" Access="ai"
@Protocol="bi " PostProcessing="ci"/>
is an int erface to access a dat ab ase t hrough J ava lan guage([5][6]) , which ab-
sorbs differences of dat ab ase servers . Other availab le pro to cols allowed here
are FTP and HTTP. The attributes DatabaseServerType and DatabaseName
of JDBC te ll us that t he datab ase serve r is Post greSQ L and t hat t he nam e of
t he dat ab ase t o be accesse d is KobeQu ake, respectively. In fact , t he dat ab ase
is a record of t he disastrous eart hqua ke that occur red in t he Kob e area in
J ap an on 17 J anuar y 1995, and t he example above is a part of a DandD
exa mple inst an ce KobeQuake. dad which is available from the DandD project
home page
http ://www.stat.math.keio.ac .jp/DandD.
The bo dy of JDBC is a Structured Query Lan guage (SQL) sente nce which
gets a column year from the table kobequake in t he relational dat abase
KobeQuake.
The last element ScanFormat is referr ed to by t he J.D. c1 in t he attribu te
PostProcessing of t he DataVector specifies t he pr ocessing method afte r re-
ceiving a resp onse from t he datab ase server. The resp onse is not necessaril y
a sequence of numbers and ofte n has to be converted to fit t he DandD re-
quirement t hat t he bo dy of DataVector is a sequence of numbers . In this
example, t he resp onse to t he query is a sequence of dat es of th e form of YY-
MM-DD and what is neede d as the body of t his DataVector is only the YY
part. The body of ScanFormat spe cifies the ext raction method by a formula .
The syntax is t he sa me as th at of the function scanf in t he lan guage C. Other
elements which can be referred t o, together wit h or in place of ScanFormt ,
are PrintFormat , Arithmetic, Media and Movie. The PrintFormat is used
for adding something to each element of t he sequence. The syntax is the
sa me as t hat of t he function printf in the C lan guage as well. Alt hough this
PrintFormat element does not appear in t his example, it becomes necessar y,
for example, to add t he prefix 20 to all YY t o make a four digit representati on
of the year for consiste ncy wit h ot her dat a vecto rs obtained from different
dat a sour ces. T he Arithmetic is used for a mor e complicated manipulation,
arit hmetic operation on each element of the sequence. This is used , for exam-
ple, when t he conversion of t he uni t from cent imet re to met re, or sexagesimal
t o digit is necessar y. The ot her two fun ctions are experimental and support
t he case when t he resp onse from dat a server is an image or a movie. In the
attribute PostProcessing of t he DataVector, several such manipulat ions of
t he element of t he response can be referr ed . Each man ipulation is assumed
t o be applied in order.
Besides t he elements mentioned above, t he Code element can be also re-
ferred together in PostProcessing. The primary role of t he Code element
was to provide coding informat ion for categorical dat a . The body is a se-
quence of quot ed strings, which provides codes for natural numbers in t he
body of DataVector, and it is usually referred to in the attribute Code of t he
DataVector. If it is referred to in t he attribute PostProcess ing, it mean s
t hat each element of the cur rent sequence is mat ched to t he body of t he Code
470 Ritei Shibata
and converted to the matched index. This Code is used as a default Code
attribute of the DataVector as well. The conversion of the labels of the levels
of categorical data can be described if two different Codes are given to the
attributes PostProcessing and Code of a DataVector. Then, the code given
in PostProcessing indicates the code used for the database or for the data
file, and the code given in the attribute Code indicates the code which should
be used in the DandD instance. If both are missing, the obtained sequence
is regarded as a sequence of numbers.
The reason why we provided such functionalities at the level of data ac-
quisition from a data server comes from our design principle. The principle
of InterDatabase is to provide a good flexible environment for working with
various types of data, and so it is better to do any necessary conversions at
the stage of data acquisition outside DandD. We are then free from the dif-
ferences of the data sources. An alternative would be to modify the existing
databases or data files according to the needs of the user or to create a new
database. However , this is not only laborious but also inefficient for the case
of huge data sets which are rarely used as a whole. In InterDatabase, no
modification is necessary to the existing data sources. This principle is close
to that of DDI or MetBroker, but InterDatabase provides a more general and
flexible way of resolving such differences than others since it is free from any
particular software. It is sufficient to write explicitly any necessary informa-
tion in the form of XML, which is needed for processing, analysis, modelling
and its utilisation.
4.2 Data
The DataVectors defined in the element DataBody are organised into several
structures within the element Data. Two types of structures are available;
Relational or Array. The relational model is general enough to represent
any relations among data ([2]). The relational database under the frame
work of the relational model is now a standard for database systems because
of its generality and ease of system maintenance. A relation is a collection
of variables and the realization is a collection of data vectors each of which
is a sequence of realized values of each variable. The realization looks like a
table and it is usually called a table in the Relational Database Management
System (RDBMS) .
Caution is necessary when using the word table. A contingency table
or the result of a designed experiment is also called a table in statistics.
However such a table is not a table in the sense of RDBMS . In the relation
model, each row in the table is regarded as a point in the value space of
the variables, so that the table is nothing more than a set of such points.
Therefore, the position of each row in the table has no specific meaning.
This is in contrast to, for example, a two dimensional contingency table, in
which two hidden variables exist, say row index and column index variables,
besides the variable for the values in the table. Therefore, it should be
Int erDatabase and DandD 471
reorgani sed as a table in RDBMS , of two ind ex variables and a vari abl e for
t he values of t he table. Each index variable then repeatedly takes the sa me
value as many times as t he number of rows or columns. To avoid such a
redu ndancy, we allow an array st ruc t ure besides t he relational structure in
DandD , since such a t abl e or multidimensional array frequentl y appears as
a neat dat a structure and it becomes cumbe rsome to represent it as a rela-
t ion. Example 2 gives a pr act ical exa mple of a relation al st ructure in DandD.
Example 2
<Data>
<Relational Id="Futures" LongName="TSLongNameE TSLongNameJ"
MainKey="dly dlm dld cmdty dvy dvm mkt"
Control= "dly dlm dld" Nominal="cmdty dvy dvm mkt">
<Value Id="dly" LongName="Dealing Year"
RefId="Dealing_Year02 Dealing_Year03" Systems="tl"/>
<Value Id="dlm" LongName="Dealing Month"
RefId="Dealing_Month02 Dealing_Month03" Systems="tl "/>
<Value Id="dld" LongName="Dealing Day"
RefId="Dealing_Day02 Dealing_Day03 " Systems="tl"/>
<Value Id= "cmdty" LongName= "Commodity Dealt"
RefId="Commodity02 Commodity03"/>
<Value Id="dvy" LongName="Delivery Year"
RefId="Delivery_Year02 Delivery_Year03" Systems="t2"/>
<Value Id= "dvm" LongName= "Delivery Month"
RefId="Delivery_Month02 Delivery_Month03" Systems="t2"/>
<Value Id="mkt" LongName="Dealing Market"
RefId="Market02 Market03"/>
<Value Id= "op" LongName="Opening Price of a Day"
RefId="S_price02 S_price03" Systems="il"/>
<Value Id= "hp" LongName="Highest Price in a Day"
RefId="H_price02 H_price03" Systems="il"/>
<Value Id="lp" LongName=Lowest Price i n a Day"
RefId="L_price02 L_price03" Systems="il"/>
<Value Id= "cp " LongName="Closing Price of a Day"
RefId= "E_price02 E_price03 " Systems=il"/>
<Value Id= "sp " LongName="Settlement Price of a Day"
RefId= "B_price02 B_pr ice03" Systmes="il"/>
<Value Id= "amt " LongName="Amount of Dealings i n a Day"
RefId="Amount02 Amount03"/>
<Value Id= "oint" LongName= "Amount of Open Interest"
RefId="OpenInterest_Amount02 OpenInterest _Amount03"/>
</Relat ional>
<Time Id="tl">
<Year RefId="dly"/>
472 Ri tei Shibata
<Month RefId="dlm"/>
<Day RefId="dld"/>
</T ime>
<Time Id="t2 ">
<Year RefId="dvy"/>
<Month RefId="dvm"/>
</T ime>
<Interval Id=" i l">
<Min RefId="lp "/>
<Max RefId="hp"/>
<Other RefId="op"/>
<Other RefId="cp"/>
<Other RefId="sp"/>
</ Interval>
</Data>
t hro ugh FTP as a CSV (comma separated values) file for each month. In
the example, relational data is defined by t he tag <Relational> and the
sub elements <Value> define the colum ns of the relationa l data. T he reason
why two data vecto rs are referr ed in t he attribute RefId of any Value is
t hat the records in the site are separated into two files, 2002-12 . csv for
December and 2003-01 . csv for J anuary. Moreover, t he site changed t he
record format after t he 1 Janua ry 2003 and the records before t hat day are
stored in a direct ory past and newer records are sto red in a direct ory now.
Therefore, as in Exampl e 1, we need t o adjust t he old format to t he newer
one . The following example illustrates a few of t he definiti ons of such dat a
vectors. Here we have omitted some attributes which are not essent ial for
un derstand ing t he key point s.
Example 3
Two data vectors are defined in t his example, sharing the sam e at t ribute
Access. The att ribute PostProcessing of the first vector says that the
ScanFormat with LD. Dealing_Year02scan and Arithmetc with LD. am1
should successively be applied to each of the lines returned by an execut ion
of FTP protocol. We need such two st ep processing becau se t he last two digits
of a yea r are only recorded in the file 2002-12. csv but the full four digit s
are recorded in the file 2003-01 . csv. To adjust the format t o the newer one,
the last two digits are extracted from each line of the CSV file 2002-12 . csv
and convert ed to the four digit year represent ation. The att ribute Encoding
of the Protocol indic at es that cha racte r code of the lines obtained by FTP
is the shift JIS code. The read er may guess other differences of the form ats
of those two files. Many other differ ent form ats for the files ar e poss ible.
Consider Example 2 aga in. The attribute MainKey of the Relational te lls
us the main key of the relational data . This is the same idea as in RDBMS.
Each record is identifi ed by the combination of the indicated Values. This
at t ribute tog ether with the ForeignKey attribute ena bles us to make links
between sever al relational data . Other at t ributes Control and Nominal indi-
cate whi ch Values are factors. Possible other factor attributes are Variable ,
Block, Latent and Auxiliary. The concept of factor type is useful not only
for applying a model like ANOVA , but also for the visualisation of data . In
t he example above, the at t ribute Control suggest s that the specifi ed vari-
ables const it ut e an x-axis of a plot and the att ribute Nominal sugg ests that
separate visualisations should be org anised according to the valu es of the
variabl es. This is an example showing how the semantics of variables can be
described in a formal way. Such a form al description plays an important role
in an aut omat ic visualisation or a semi automatic data an alysis.
It is cru cial to describ e several relations among variables by the Systems
att ribute of Value . Not e that the relation here is not the same as that in the
relational model, whi ch is a relation of the given records as a set of points in
a valu e space. In Example 2, two Time relations and an Interval relation are
474 Ritei Shibata
defined. The Time ind icat es t hat the sp ecified Values const it ute a calendar
system . The Interval indi cates t ha t the four Values are closely related ,
const ituting an int erval given by Min and Max with several aggregate d valu es,
t he op ening price , the closing pri ce and the set tlement pr ice given by Other.
Not e that futures pri ce moves time by t ime in a day.
References
[1] DandD Project. (2004) . DandD Hom e Page.
http://www .stat .math.keio.ac.jp/DandD/.
[2] Date C. (2003) . An in troduction to database systems . 8th Edi tion,
Addison-Wesley, Boston.
[3] Laurenson M., Otuka A., Ninomi ya S.(2002). Developing agricultural
mod els using M etBroker mediation software. Journal of Agricultural Me-
teorology 58 (1) , 1-9.
[4] Rew R., Davis Gren.(1990). N etCDF: An interf ace for scie ntifi c data ac-
cess. IEEE Computer Graphics and Applications, 10 (4) , 91 -99.
[5] Sun Micro Systems. (2004). Java Technology. http ://java.sun.com/ .
[6] Sun Micro Syst ems. (2004). JDBe Technology. http://java.sun.com/
[7] The Norwegian social science data services. (1999). Providing global access
to dist ributed data through metadata stan dardisation - the parallel stories
of NESSTAR and Th e DDI. Working Paper 10, UN/ECE Work Session
on St atistical Met ad ata, Geneva.
[8] Yokouchi D. (2004) . DandD client server system. Compstat 2004, Physica-
Verlag , Prague.
[9] Yokouchi D, Shibata R.(2001). InterDatabas e - DandD instance as an
agent on the Int ern et - (in Japanese). Proceedings of the Institute of
St atistical Mathematics, 49 (2) , 317 - 331.
[10] Shibata R. , Sibuya M. (1987) . Formal description of data typ e for statis-
ti cal analy sis. Proceedings of the first lASC world conferenc e, 203 - 212.
Abstract : Graphs have long been of int erest in t elecommunications and so-
cial network analysis, and t hey are now receiving increasing attent ion from
statist icians working in other areas, parti cularly in biostatistics. Most of the
visu alization software available for working with gra phs has come from out-
side st atis tic s and has not included t he kind of int eraction t hat st atistici ans
have come t o expect. At th e sam e time, most of the explora t ory visua liza-
tion software available to st atisti cians has made no pr ovision for the special
structure of gra phs.
Gr aphics softwar e for the explora t ory visual analysis of graph data should
include t he following: graph layout methods; a vari ety of displays and meth-
ods for exploring vari abl es on both nodes and edges, including methods that
allow t hese covariate displays to be linked to t he network view; methods for
t hinning or otherwise trimming a lar ge graph. In addit ion, t he power of the
visualization software is greate r if it can be smoot hly linked t o an exte nsible
and int eract ive stat istics environment .
In t his pap er , we will describe how t hese goals have been addressed in
GGobi through its data format , architect ure, graphical user int erface design ,
and its relationship to the R softwar e [7].
1 Introduction
A graph consists of nodes and edges; the edges connect pair s of nod es. In
social network an alysis, t he nodes frequently repr esent people or institutions;
the edges represent int eractions such as conversat ions or trading relation-
ships. The gra phs encounte red in telecommunications ar e similar : the nodes
typi cally repr esent te lephone numbers or IP (Internet Protocol) addresses;
the edges capt ure t eleph one calls or exchanges of packets .
For a dat a analyst st udying graph data , t he descrip tion of the gra ph
is oft en only par t of the story, becaus e the nod es and th e edges may each
correspond t o mult ivari ate dat a. For exa mple, if t he gra ph capt ures a set of
t elephone numbers and t elephone calls, we may have demographic dat a or
usage dat a abo ut t he bill-p ayer for each te lephone number , and we may also
know the time and duration of phone calls. We th erefore observ e variables
on nod es and on edges.
How do explorato ry dat a analysts approach such data? First , we need t o
visualize the graph, that is, t o lay it out by using node positi ons that have
478 Deborah F. Swayne and Andreas Buja
been calculated to help us interpret the graph structure. This is not a well-
defined objective, but often the distance between nodes in the layout should
reflect their distance from one another according to some distance metric on
the graph. Another guideline is that minimizing edge crossings usually makes
a graph more readable by cutting down on clutter. Still, there is no "best"
layout method, or even a best layout for a particular graph: for example,
one layout may clarify a graph's overall structure while deemphasizing local
structure, while in another layout, a local region of interest may be clearly
drawn but the overall structure looks like spaghetti. Graph layout in an
interactive context, then, should offer several layout algorithms and a lot of
interaction methods for tuning and exploration.
The layout algorithms should be fast enough to be used in real time. For
example, we might draw only straight-line edges, and we might not sacrifice
any time to choose the perfect position for node labels. The suite of layout
algorithms should include methods for laying out graphs in 3D (or higher-D) ,
which we can rotate to shift our viewpoint and focus on local structure.
Other important interaction methods include the following:
• We should be able to tune the layout by moving nodes interactively.
• We should be able to pan and zoom the display of the graph.
• We should have a variety of ways to thin or subset the graph by elim-
inating or collapsing nodes and edges. At times, we may not want to
eliminate nodes, but to find ways to highlight nodes and edges of in-
terest while "downlighting" the rest. In that way, we retain context as
we focus on a subset of interest.
So far, we have considered only the structure of the graph, ignoring the
multivariate data associated with the nodes and edges. Once the layout is
displayed, one wants to explore the data together with the graph, to investi-
gate the relationships between the variables and the shape of the graph. The
use of linked views, by now a standard feature of interactive data visualiza-
tion software, is well suited to this goal. The graph view can be linked to
displays of multivariate data on both nodes and edges.
These additional views can be used to highlight, label or paint nodes and
edges in the graph view according to variable values, so that we can explore
the distribution of data values in the graph (see Fig . 2). Equally, we can
highlight data in the covariate views. For example, we might want to thin
the graph according to covariate values. In the case of telephone calls, we
could erase the edges corresponding to the shortest calls, and then erase all
the nodes that no longer have edges.
Finally, this software will be more powerful and more extensible if it
can be programmed using some scripting language, and if it is connected to
a software system for data analysis that includes a library of standard graph
algorithms.
Graph drawing is an active research area in computer science with a long
history [2]. The layouts produced are highly tuned and often beautiful. Since
Exploratory visual analysis of graphs in GGobi 479
they are not produced within the context of data analysis, the graphics are
typically not interactive, and the programmers have not adopted the linked
views approach. Some tools (e.g. Pajek [1]) offer a library of graph algorithms
in addition to layout, and some can even be extended with plugins (e.g.
TUlip, www.tulip-software.org).Still. the designers clearly do not have
exploratory data analysis (EDA) in mind.
Within the field of statistics, graph visualization has not gotten very
much attention. A notable exception is the work of [12], which has never
been released to the public. Even the social network analysis community,
which combines an interest in graph drawing with an interest in multivariate
data analysis, has not to our knowledge produced tools which combine both
sets of visualization capabilities. We therefore feel that there exists a gap in
current software offerings for the exploration of graph data. GGobi [10] is
our attempt to fill this gap.
This paper is structured as follows. Section 2 introduces GGobi, the
software which will be discussed in the rest of the paper. Section 3 describes
GGobi's methods for graph layout. Section 4 describes some of GGobi's
methods for manipulating displays, especially graph views. Section 5 explains
how GGobi can be embedded in other software, and what this design offers
for graph data analysis. Section 6 describes the data format that is used to
specify relationships between nodes and edges, graph elements and variables.
We use a real telecommunications dataset for illustration throughout the
paper. The meaning of its variables has been masked to protect the privacy
of the customers.
2 GGobi
GGobi is general-purpose multivariate data visualization software, designed
to support EDA. GGobi displays include scatterplots, scatterplot matrices,
barcharts, time series plots, and parallel coordinate plots. All displays can
be linked for color and glyph brushing as well as for point and edge label-
ing . GGobi is known for its powerful projection facilities for high-dimensional
rotations. Among GGobi's many other manipulations are panning and zoom-
ing, subsampling, and interactive moving of points and groups of points in
data space.
GGobi can be easily extended, either by being embedded in other soft-
ware or by the addition of plugins; either way, it can be controlled using an
Application Programming Interface (API). An illustration of its extensibility
is that it can be embedded in R.
GGobi is a direct descendent of a data visualization system called
XGobi [9] that has been in use since the early 1990's. XGobi supported
the specification and display of graphs, but it did not include any graph
layout methods. Graph data was an afterthought with XGobi, while it was
a consideration in the GGobi design process from the beginning.
GGobi supports a plain ASCII format involving multiple input files (as in
480 Deborah F . Swayn e and Andreas Buja
!!::============PE:0S:O Poso
XGobi) for the simplest dat a specificat ions, but an XML(Extensible Markup
Lan guage) file format has to be used for anyt hing richer, and graphs are an
exa mple. The form at is briefly describ ed in Section 6.
3 Graph layout
We have used GGobi's plugin mechanism to add graph layout . Because this
is specialized softwa re, it is convenient t hat thi s functionality can be optional.
There are two plugins available for GGobi that can be used for laying out
gra phs.
underlying graph is not very tree-like, the layout can result in a great many
edge crossings, and the layout doesn 't do anyt hing to minimize these cross-
ings. In addition to the two position vari abl es, the method generates a few
other vari ables , such as the number of st eps between nod e j and the cent er.
Dot: "Dot" produces hierarchical layouts of dir ected gra phs in 2D; the
other layout methods ignor e edge dir ection. It first finds an optimal rank
assignme nt for each nod e, then sets the vertex ord er within ranks , and finally
finds optimal coordinates for the nod es.
Neato: The "neato" layout algorit hm produces "spring" model layouts
of undirected gra phs. In spring mod els, th e graph is mod elled as a set of
obj ects connecte d by springs, assuming both at t ra ctive and repulsive forces,
and an ite ra tive solver is used to find a low-energy configurat ion. Only t he
positions at t he final configurat ion are returned by the algorit hm. Neato is
t he most general-purpose method of t he t hree. Fur ther , neato can genera te
layouts in spaces from 2D to lOD, and edge weight s can be used to further
tune t he layout.
The first layout method is illustrated in Fig. 2; the latter two are illus-
t ra te d in Fig. 1.
There is a manual for the plugin which describ es its use in more det ail.
The dot and neato layout methods are describ ed in the GraphViz docum en-
tation, which can be found on
www.research.att .com/sw/tools/graphviz/refs .html.
The Gr aphViz software can be obtained from www . graphviz . org.
In addit ion to param et ers , we can make use of color and glyph groupings
of the nodes. We may subselect one group at a t ime for layout , or we may
lay out the groups simultan eously but as un conn ect ed gra phs. Or we may
layout a subgroup and use it as an anchor set for laying out t he remaining
nodes.
There is also a diagnosti c plot t hat permits us to judge how closely the
pairwise dist ances in t he layout match t he target dist ances.
4 Graph exploration
Once the layout has been produced and t he gra ph is displayed, a great deal
of explorat ion is possible without using any fur ther plugins . Most of t his
funct ionality depends on using linked views. As one would expect , nodes in
the graph view are linked to points in scatte rplots of node vari abl es, or to
bar s in a bar char t ; t his is a familiar style of linking. It is perhaps less obvious
that an edge in t he graph view and a point in a scat te rplot of edge var iables
are also linked: these are just different ways of renderin g the sam e record.
Here are som e of the manipulations availa ble in GGobi:
Move Points: In t his mod e, any point can be moved to manually tune
t he layout. To move a group of points, one brushes them with a common
glyph and color; by moving any memb er of t he group, one moves the whole
group. Under cert ain circum st an ces, point motion can be linked across plots
of layouts, nam ely, when t he nod es are shared across gra phs that differ only
in edge sets and shar e a single layout in separa te windows.
Edit Edges: To edit the graph int eractively, add nodes (by clicking
the mouse where you want the new nod e to appear) and edges (by pr essing
down the mouse button at the source nod e and dr agging t he edge to the
destination). To view or modify t he default prop ert ies (such as record la-
bel or vari able values), use the left bu tton; to simply have the new record
added quickly, use the right or middle button. To delet e nodes or edges, use
"shadow" brushing as described below.
Exploratory visual analysis of graphs in GGobi 483
•
Figure 2: An illustration of linked brushing with graphs. The nodes in the
graph are linked to the data in the scatterplot at the lower left ; the edges to
the data in the scatterplot at the lower right.
Fig. 2 shows linking between a radial layout of the snetwork.xml data and
two scatterplots. Two rectangular arrays of data are involved, one for the
nodes and the other for the edges . The window at the lower right contains
a I-D plot (an ASH, or Average Shifted Histogram ([8]) of a transformation of
one of the edge variables, interactions. The highest values have been brushed
with large green rectangles (rendered in dark gray in the gray-scale printed
version of this paper), and the corresponding edges in the radial layout view
are wide and green. All the green edges are connected to a single node, which
tells us that a single individual participates in all of the longest interactions
in the data. The window at the lower left contains a jittered scatterplot of
hours vs citizenship, the two variables recorded for each person. The points
representing the people with the highest values of the citizenship variable
(visa holders) have been brushed with large orange circles (rendered as large
medium-gray circles in gray scale), and the corresponding points are brushed
in the graph view. A couple of subgraphs contain no visa holders at all, and
a couple of other subgraphs are dominated by visa holders, but we also see
a great deal of interaction between visa holders and other people in the data.
(Recall that the data is actually about telephone calls, but that its meaning
has been thoroughly obscured to protect customer privacy.)
The line characteristics (color, type and thickness) are implied when
the point characteristics (color, type and size) are specified in the Choose
color fj glyph panel.
One of the options available in the brushing mode is shadow brushing [3];
that is, to select points or edges to be drawn in a "shadow" color, close to the
color of the background. This is especially appealing for graph visualization
because clutter is often severe, yet we often don't want to lose sight of the
graph structure when viewing a subset of the data. (Sometimes, of course,
we don't want to draw those points at all , even as shadows, and then we
exclude them using the Color fj glyph groups tool.)
Coloring by variables: Since interactive brushing of continuous vari-
ables can be tedious, an automatic scheme is available as part of the Color
schemes tool. In the snetwork.xml data, one of the edge variables (interac-
tions) is continuous, so we can choose a sequential color scale and apply it
to the "Cont acts" edge set using the interactions. (Since the distribution of
that variable is highly skewed, we might also apply a transformation first.)
Panning and Zooming: It is essential to be able to zoom in on interest-
ing regions of the graph view, and that functionality is available in GGobi's
scale mode. (GGo bi displays are not linked for scaling.)
All these methods are described in more detail in the GGobi manual,
available on www. ggobi. org.
methods of exploration that are peculiar t o gra phs. It has two functions as
of this writing, both of th em designed for focussing on conti guous subsets of
the graph.
The first fun ction responds to a button click by shadow-brus hing leaf
nod es and the edges connected to them recursively until no leaf nodes are
highlight ed. It can be a useful way to qui ckly hide a lot of clut t er in a a messy
gra ph, and get a look at t he cente r .
The second is a method for focussing on a node and its near est neighbors.
It is used in conjunction with the Ident ification mod e in GGobi. Move the
curs or near a point of inte rest, and then click a mou se bu tton. All points will
be shad ow brushed with the except ion of the nearest point and its neighbors
within one or two st eps. In t his way, one can walk around the gra ph , focussing
on one small neighborhood at a time.
again, using the matrix x just described. Next we add a second dataset ,
3 by 2, composed of the dat a corresponding to t he edges. Finally we add
three edges to the second dataset .
gg <- ggobi(x)
el <- rbind(c(" a", "b"), c("b", " C" ) , C(" a " , "d"))
gg$setEdges(el, edgeset=gg[[lI z ll] ] )
If there are variables corresponding to that edge, they are specified within
the record, just as they are for nodes.
As we implied in Section 3.3, it's possible to specify more than one edge
set corresponding to the same node set within the same XML file, and that
offers a way to compare related edge sets.
There are graph specification languages in XML under development, and
we expect it will be easy to translate between those formats and GGobi's,
though those other languages probably won't fully support multivariate data.
For the interested reader, the GGobi distribution includes several graph
datasets in XML. Some include position variables so that additional layout
isn't required: buckyball.xml and cube6.xml describe geometric objects, with
no additional variables. Another, snetwork.xml, is fully multivariate and does
not include variables that can be used for displaying the graph; that is the
dataset that served as an example throughout this paper.
7 Conclusions
As more statisticians become interested in graph data analysis, they approach
this area with the expectations and expertise acquired in working with general
multivariate data. They expect first of all to be able to work in environments
like R, with a set of algorithms, a variety of static display methods, and
a scripting language. This set of goals is being pursued in the Bioconductor
project and elsewhere.
Second, statisticians and other data analysts who have come to rely on
direct manipulation graphical methods will want to use them with this form
of data as well: to quickly update plots, changing variables and projection,
to pan and zoom displays, and to use linked views to explore the graph
and the distribution of multivariate data in the graph. GGobi's data format
supports describing the graph and the data together , and its architecture
allows the addition of plugins, so it's natural to extend GGobi, applying all
its functionality to graph data.
Finally, we want to integrate the direct manipulation graphics, algorithms
and scripting language so that we can use them all together. This expectation
is not yet as automatic as the first two: People often still imagine building
a single monolithic application that can do everything. As the example of
graph data shows, however, there are many specialized problems that are
often overlooked, so no monolithic piece of software can satisfy the needs of
all users. If instead it's possible to integrate complementary software tools,
and to extend them with plugins and packages, then even the most unusual
cases can be handled without too much trouble.
488 Deborah F. Swayne and Andreas Buja
References
[1] Batagelj V., Mrvar A. (1998). Pajek - program for large network analysis.
Connections 21, 47-57.
[2] Battista G.D., Eades P., Tamassia R., Tollis 1. (1994) . Annotated bibliog-
raphy on graph drawing algorithms. Computational Geometry: Theory
and Applications 4, 235 - 282.
[3] Becker R.A., Cleveland W.S. (1987). Brushing scatterplots. Technomet-
rics 29, 127 -142.
[4] Buja A., Swayne D.F. (2002) . Visualization methodology for multidimen-
sional scaling. Journal of Classification 18, 7 - 43.
[5] Chen C.-H., Chen J.-A. (2000). Interactive diagnostic plots for multidi-
mensional scaling with applications in psychosis disorder data analysis.
Statistica Sinica 10, 665-691.
[6] Gansner E.R. , North S.C. (2000). An open graph visualization system
and its applications to software engineering. Software - Practice and
Experience 30 (11), 1203-1233.
[7] Ihaka R., Gentleman R. (1996) . R: A language for data analysis and
graphics. Journal of Computational and Graphical Statistics 5, 299-
314.
[8] Scott D.W . (1985). Average shifted histograms: effective non-parametric
density estimation in several dimensions. Annals of Statistics 13, 1024-
1040.
[9] Swayne D.F., Cook D., Buja A. (1998). XGobi: Interactive dynamic data
visualization in the X Window System. Journal of Computational and
Graphical St atistics 7 (1), 113-130.
[10] Swayne D.F ., Temple Lang D., Buja A., Cook D. (2003). GGobi: evolv-
ing from XGobi into an extensible framework for interactive data visu-
alization. Computational Statistics & Data Analysis 43 , 423 -444.
[11] Temple Lang D., Swayne D. F. (2001). The ggobi XML input format.
www. ggobi. org.
[12] Wills G. (1999). NicheWorks - interactive visualization of very large
graphs. Journal of Computational and Graphical Statistics 8 (2), 190-
212.
Acknowledgement: We thank the reviewer who pointed out to us that the
ggvis plugin would be a good environment for implementing the Interactive
Diagnostic plots for MDS as described in [5].
Address : D.F. Swayne, AT&T Labs - Research
A. Buja, The Wharton School, University of Pennsylvania Duncan Temple
Lang, University of California, Davis
E-mail : dfs@research.att.com
COMPSTAT'2004 Symposium © Physica-Verlag/Springer 2004
Abstract: A situation where J blocks of vari abl es are observed on the same
set of individuals is considered in t his pap er. A fact or analysis logic is applied
to t abl es inst ead of individuals. The lat ent variabl es of each block should well
explain their own block and in the sam e ti me t he latent var iables of sa me
rank should be as positively corre late d as possible. In the first part of the
pap er we describe the hierarchical PLS path mod el and remind th at it allows
to recover t he usual multiple table analysis methods. In the second part we
suppose that the number of latent variabl es can be different from one block
to anot her and that these latent vari abl es are orthogonal. PLS regression and
PLS path mod eling are used for this sit uat ion. This approach is illustrat ed
by an example from sensory analysis.
1 Introduction
We consider in t his pap er a sit ua tion where J blocks of variables Xl , " " XJ
are observed on the sa me set of individuals. The problem under st udy is
complet ely symmet rical as all blocks of vari abl es play th e same role. All the
vari abl es are supposed t o be st andardized . We can follow a factor ana lysis
logic on tables instead of variables. In the first sect ion of this pr esentation
we suppose t hat each block X j is multidimensional and is summa rized by m
lat ent vari abl es plus a residu al E j . Each dat a t abl e is decomposed into two
parts: X j = tj lP~l +.. .+ tjmP~m +Ej . The first part of the decomposition is
tj lP~ l + .. .+ tjmP~m ' The lat ent vari abl es ( t jl , . .. , t jm) should well explain
the dat a t abl e X j and in t he sa me time the latent vari abl es of same rank
h( tlh , . . . , tJh) should be as positively correlated as possible. The second
part of t he decompositi on is t he residual E j which repr esents the part of X j
not related t o the other block, i.e. the specific par t of X] .
We show that th e PLS approach allows to recover th e usu al methods for
multiple t abl e analysis. In section two we suppose that the number of latent
vari abl es can be different from one block to anot her and that these latent
var iabl es are orthogon al. PLS regression and PLS path mod eling are used
for this sit uation. This approach is illustrat ed by an example from sensory
analysis in the last sect ion.
490 Michel Tenenhaus
- Each block X j is also summarized by the latent vari abl e Zjh = ej htJ+l ,h,
where ejh is the sign of the corre lat ion between tjh and tJ+l ,h. We will
however choos e ej h = + 1 and show that the correlation is then positive.
- The super-block EJ+l,h-l is summari zed by the latent vari able ZJ+ l,h =
J
L eJ+ l ,j ,ht j h, where eJ+ l ,j ,h = +1 when the centroid scheme is used,
j=l
or the correlation between t jh and tJ+l ,h for the factorial scheme, or
furthermore the regr ession coefficient of tjh in the regression of tJ+l ,h
on hh, . . . , tJh for t he path weighting scheme.
We can now describe the PLS algorit hm for the J-block case. The
weights Wjh can be computed according to two modes: Mode A or B.
In Mod e A simple regression is used:
where ex means that the left term is equal to the right t erm up to a normal-
ization.
PLS regression and PLS path modeling for multiple table analysis 491
W jh ex
, 1 '
a nd W J+ l ,h ex ( E J+l,h -l E J+l, h -l) - E J + 1,h - l Z J +l ,h (2)
The normalization dep end s up on the method used . For some method Wj h
is of norm 1. For other methods t he varian ce of tj h is equal to 1.
This pro cedure is it er ated until convergence always verified in pr actic e, but
only mathemat ically pro ven for t he two-block case .
The vario us options of PLS Path Mod eling (Mode A or B for external est i-
mation; centroid, factorial or path weight ing schemes for inte rnal est imat ion)
allow t o find again many methods for Multiple Tabl e Analysis: Gener alized
Canonical Analysis (the Horst's one [6] and the Carroll's one [1], Mul tiple Fac-
tor An alysis [4], Lohmoller's split principal component analysis [9], Horst 's
maximum variance algorit hm [7] . The link s between PLS and these methods
have been demonstrated in [9] or [11] and st udied on pr act ical examples in [5]
and [10]. These various methods are obtained by usin g the PLS algorit hm ac-
cording t o t he options described in Tabl e 1. The super-block only is deflat ed ;
the original blo cks are not deflated .
There is som e advantage on imp osing orthogonality const rai nts only on
the lat ent varia bles related t o t he super-block: no dim ension limit ation du e
to block sizes. If orthogona lity const raints were impose d on t he block lat ent
variables, t hen the maximum m of late nt vari abl es would be t he size of the
sm allest block. The super-block X J +1 is su mmarized by m orthogon al lat ent
var iables tJ+ l, l, " " t J+ l ,m' Each blo ck X j is summarized by m latent
var iabl es tj l, . . . , tjrno Bu t these latent var iables can be highly correlate d
and consequently do not reflect the real dimension of t he block. In each
block X j t he lat ent vari abl es t j l ," " t j m repr esent t he par t of the block
correlated wit h the ot her blocks. A principal component analysis of t hese
latent var iables will give the actua l dim ension of t his par t of X ] .
PLS regression and PLS path modeling for multiple table analysis 493
It can be preferred to impose orthogona lity on the latent vari ables of each
block. But we have to remove the dimension limit ation du e to the smallest
block. This sit uat ion is going to be discussed in t he next section.
(3)
St ep 1
St ep 2
(4)
4 Application
We are going to use P LS-MTA on wine data which has been collecte d by C.
Asselin and R. Morlat and are fully describ ed in [3]. A set of 21 red wines
with Bourgueil, Chin on and Saumur origins are describ ed by 27 var iables
distributed in four blocks: Xl = Smell at rest = [smell int ensity at rest , aro-
mat ic qu ality at rest , fruity not e at rest , floral not e at rest , spicy not e at rest],
X 2 = View = [visual int ensity, sha ding (from oran ge to purple) , impression
of surface], X 3 = Smell after shaking = [smell intensity, smell qua lity, fruity
not e, floral not e, spicy note, vegetabl e note, phelonic note, aromat ic int ensity
in mouth, ar omatic persist ence in mouth, aromatic quality in mouth], X 4
= Tasting = [intensity of attack, acidity, ast ringency, alcohol, balan ce (acid-
ity, ast ringency, alcohol), mellowness, bitterness, ending int ensity in mouth,
harmony]. Another varia ble describing t he globa l qu ality of the wine will be
used as an illustrative var iable.
We now describ e t he application of PLS-MTA methodology on these
da ta.
Step 1
PLS regressions of [X 2, X 3, X 4] on Xl , [Xl, X 3, X 4] on X 2, [Xl , X 2, X 4]
on X 3 , and [X 1 ,X2 ,X3 ] on X 4 all lead to two PLS component s when we
decide to keep a component if it is significant (Q2 is larger t ha n 0.05). The
X- and Y- explana tory powers of these components are given in t able 2.
Then the "smell at rest" block T1 = {tll ' t 12}, t he " view" block T2
{t 21 ,t22} , t he " smell afte r sha king" block T3 = {t31,t32}, and th e " tasting"
block T4 = {t41' t42} are defined wit h t he st andardized PLS X -compon ent s.
Step 2
The PLS components being orthogonal, it is equivalent to use Mod e A
or B for t he left part of the causa l model given in Figur e 3 (PLS-Graph
output [2] . Due to t he small number of observat ions Mode A has to be used
for the right par t of the causa l mod el of Figur e 3. We use the cent roid scheme
for t he int ern al est imation. We give in Figur e 3 the MTA mod el for t he first
rank components and in Table 3 the correlat ions between t he latent vari ables.
PLS regression and PLS path modeling for multiple table analysis 495
Figure 3: Path model for the first rank components (PLS-Graph output).
In Figure 3 the figures above the arrows are the correlation loadings and
the figures in brackets below the arrows are the weights applied to the stan-
dardized variables. Correlations and weights are equal on the left side of the
path model because the PLS components are uncorrelated.
Rank one components are written as:
t51 .2516 X tll + .0045 X t12 + .2552 X t21 + .0788 X t22 + .2707 X t31
We may not e that the rank one components are highly corre lated to the first
PLS components tn , t 21 ' t3 1 and t 41 '
To obtain t he rank two components it is now useful to use equa tion (4)
which here becomes:
- - [ COS()j Sin ()j ]
[t j l, tj2 ] = [t j l , t j 2] . ()
-sm j
() .
cos J
(5)
as
A . = [ cos ()j sin ()j ] (6)
J - sin ()j cos ()j
is the orthogonal rot ation matrix in the plan with an angle ()j . For each of
t he new comp onents tn , . .. , t41 it can be checked t hat t he squ ar es of the
coefficients of the PLS component s t jl , t j 2 sum up to one. It is then easy to
get the rank two components :
t12 - .0176 x t n + .9998 x t 12
t 22 - .2950 X t 21 + .9558 X t 22
t 32 -.1619 X t31 + .9869 X t 32
t42 - .1042 X t 41 + .9747 X t 42
However , to get t he exte rnal latent variable t 52 for the sup er-blo ck we need
to apply the complete algorit hm. We first regress each block Tj = {tjl ' t j 2}
on t j l . Then t he path mod el used for rank one components is used on the
standa rdized residu al tabl es Tj 1 = { tjn , tj2 1}' The results are given in
Figur e 4.
In Table 4 we give t he corre lations between the rank two components. The
sensory components of rank one and two are uncorr elated by construction.
The globa l components are also practi cally uncorr elated (r = -.000008) .
PLS regression and PLS path modeling for multiple table analysis 497
Figure 4: Path model for the second rank components in term of residuals.
1.0 or----------,----------------,
Spicy "Fte at rest
.8
Spicy note Smell intensityat rest
Vegetable note Bitterness
Smell intensity
.6
Acidity Astringency
.4 Visual intensity
Phenolic note
Cl Shading
<: Alcohol
'C .2 Aromaticpersistence
III
.9 impressionof surface
Ending intensityin mouth
N
-.0 - - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - -Jntensity of attack' -
1: I Aromatic Intensityin mouth
Cll
<: Aromaticqualityat rest
0 Fruity note at rest
C- -.2 Harmony
E Fruitynote
0 Floral note at rest
o Smell quality
iii -.4 Mellowness
.c Global Quality
0 Floralnote Balance
t5 -.6 Aromaticquality in mouth
Globalcomponent 1 loading
5 Discussion
PLS-MTA comes to carry out a kind of principal component analysis on each
block and on the super-block such that the components of same rank are as
positively correlated as possible. So, for each dimension h, the interpretations
498 Mich el Tenenh aus
3 , - - - - - - - ------, -------,
: .T2
, T'
•
2
1VAU
iii' 3h
•,
I PERl
: 4EL~
N O · •• - - ••• - - - - - - - - - ;ruR - - - -: - - - - · 1-aOI- - ·10AM-· --
"E
Ql
• DO~ 1 2BOU @ •
21NG IFON . "" @ ,liNG
c: @ 1 ROC ~El ~BEA Appellation
o @) 1PO
a. ~ 2DAM
§ -1
'CHA
• IBEN • Saumur
o
tii o Chinon
.a
o
(5 -2 __----.------...-----+------.------! @ Bourgueil
-3 -2 -1 o 2
Global component 1
References
[1] Carro ll J.D. (1968). A generalization of canonical correlati on analysis
to three or more sets of varia bles. P roc. 76th Conv. Am. Ps ych. Assoc.,
227-228.
[2] Chin W.W. (2003). PL S-graph user's guide. C.T. Bauer College of Busi-
ness, University of Houston, USA.
[3] Escofier B. , Pages J. (1988). A nalyses factorielles simp les et multiples.
Dunod , Pari s.
[4] Escofier B., Pages J. (1994). Mu ltiple factor analysis. (AF MULT pack-
age ), Computationa l Statisti cs and Data Analysis 18, 121- 140.
PLS regression and PLS path modeling for multiple table analysis 499
[5] Guinot C., Latreille J., Tenenhaus M. (2001). PLS Path modellind and
multiple table analysis. Application to the cosmetic habits of women in
Ile-de-France. Chemometrics and Intelligent Laboratory Systems 58,
247 -259.
[6] Horst P. (1961) . Relations among m sets of variables. Psychometrika 26,
126-149.
[7] Horst P. (1965). Factor analysis of data matrices. Holt , Rinehart and
Winston, New York.
[8] Hotelling H. (1936). Relations between two sets of variates. Biometrika
28, 321- 377.
[9] Lohmoller J .-B. (1989). Latent variables path modeling with partial least
squares. Physica-Verlag, Heildelberg.
[10] Pages J ., Tenenhaus M. (2001). Multiple factor analysis combined with
PLS path modeling. Application to the analysis of relationships between
physico-chemical variables, sensory profiles and hedonic judgements.
Chemometrics and Intelligent Laboratory Systems 58 261 - 273.
[11] Tenenhaus M. (1999) . L'approche PLS. Revue de Statistique Appliquee,
47, (2) , 5 -40.
[12] Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2004). PLS
path modeling. Computational Statistics an Data Analysis (to appear).
[13] Tucker L.R. (1958) . An inter-battery method of factor analysis. Psy-
chometrika 23 (2), 111-136.
[14] Van den Wollenberg A.L. (1977) . Redundancy analysis: an alternative
for canonical correlation. Psychometrika 42, 207 - 219.
[15] Wold H. (1982) . Soft modeling: the basic design and some exten-
sions. In Systems under indirect observation, Part 2, K.G. Joreskog &
H. Wold (Eds) , North-Holland, Amsterdam, 1 -54.
[16] Wold H. (1985). Partial least squares . In Encyclopedia of Statistical Sci-
ences , Kotz, S. & Johnson, N.L. (Eds), John Wiley & Sons, New York
6, 581 - 591.
[17] Wold S., Martens H., Wold H. (1983). The multivariate calibration
problem in chemistry solved by the PLS method. In: A. Ruhe and
B. Kagstrom (Eds), Proc. Conf. Matrix Pencils. Lectures Notes in Math-
ematics, Springer-Verlag, Heidelberg.
1001 GRAPHICS
Martin Theus
1 Introduction
Everybody knows the phrase "A picture can be worth a 1000 words". Advo-
cates of statistical graphical methods and data visualization sometimes use
this phrase to support their position. Whereas everyone knows that there
are many examples which prove that they are right, there is a far greater
number of examples (although less quoted) which prove the opposite. All
positive examples are usually very well thought out. E.g. Minard's visualiza-
tion of Napoleon's march on Moscow is a very popular example for the power
of a good visualization. The power of Minard's graph lies in the well chosen
combination of spatial plotting of time series information, not to mention sev-
eral artistic and aesthetic considerations, which are not that obvious at a first
glance. This brings us back to the phrase "A picture is worth a 1000 words" ,
which only holds true if the picture is really well chosen . Today, where the
next statistical graphics is only one keystroke or mouse-click away, we tend
to produce many graphs which would probably need more than a 1000 words
to be interpreted.
In the next Section of this paper we will investigate the influence of the
right choice of plot defaults on the quality, i.e. the interpretability and usabil-
ity, of a graph. This should make us more alert to default plot settings, which
are often inappropriate for the solution sought initially. The final section of
the paper goes beyond single graphs, and shows strategies for analyzing mul-
tivariate data with ensembles of standard statistical graphs.
502 Martin Th eus
0
'"
0
N
~
E
.Q>
0
~
0
0
0
N
0
g
D
0 15 0 10 05 10 15 01 5 0 10 05 10 15
Nub Nub
C!
"!
..:>~"
'"
ci
C!
.,'>"
E E co 0
.Q> 0
ci ...,..... .Q> ci
~ ~
0
ci
'"
ci
c
'"
ci
[J
C!
0
0 1.0 0 0.5 0.0 0.5 1.0 01 .0 0 0.5 0.0 0.5 1.0
Nub Nub
2 On plot defaults
2.1 The scatterplot - less can be more
A scatterplot of two qua nt itative variables is pr obably the most elementary
and fund am ent al plot in st atist ics. At a first glance t here do not seem t o
be many degrees of freedom to choose par am et ers t o improve a scatterplot .
Reviewing Cleveland [1] and [2] the only t hing we can do with scatterplots
is t o change scales and plot symbo ls. Obviously Clevelan d 's work was writ-
te n at t imes where pen plot t er and amber CRT s were t he lat est technology.
Further more, dat asets with mor e than just a few hundr ed observations were
very uncommon. Tod ay' s problems often look much different . A coupl e of
t housand points are ofte n regar ded as rather sma ll, but would have used
up a whole ink cart ridge of a pen plot t er 25 years ago. This calls for new,
adva nced rend erin g strategies.
1001 graphics 503
> names(pollen)
[1J "Ridge" "Nub" "Crack" "Weight" "Density" "Number"
> attach(pollen)
> par(mfrow=c(2,2))
> plot(Nub, Weight, main="Default Plot")
> plot(Nub, Weight, pch=".", main="Smaller Symbols")
> plot (Nub, Weight, pch=".", xlim=c(-1,1), ylim=c(-1,1),
> plot (Nub, Weight, xlim=c(-1,1), ylim=c(-O.8,1.6), ...
Figure 2 shows the same data plotted in Mondrian [6]. The default scatterplot
in Mondrian uses a-transparency to cope with overplotting. a-transparency
allows us to use suitably sized points in a scatterplot, without losing the
information about density in the scatterplot. The amount of transparency
gets bigger with the number of points to plot . In Figure 2 the unusual feature
is immediately visible without the need to optimize plot parameters. More
information on how to plot scatterplots can be found in Cook et al. [3] .
504 Martin Th eus
o 100 200 300 400 500 o 100 200 300 400 500
..,.
0
0
0
0
0
0
0
o 100 200 300 400 500 0 100 200 300 400 500
-e- ..,.
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Figur e 3: 6 hist ograms with superp osed density est ima torS for the variable
"displacement" of t he "mpg-auto" dataset from t he VCI ML repository. The
number of bins has been det erm ined according to "St urges Rule" .
1001 graphics 505
In cases where the data comes from a single generating process following
a continuous, only mildly skewed random variable, these rules will deliver
sufficiently nice results". The more critical situation arises, when the data is
a mixture of several generating processes from both continuous and discrete
random variables. In these situations, we have to cope with gaps , discrete
patterns and accumulation points. Unfortunately real data usually comes
from the latter kind of process .
Figure 3 shows an example of six histograms for the variable "displace-
ment" of the "mpg-auto" dataset from the UCI Machine Learning Repository
with origins at 10, 19, 28, 37, 46 and 55. The number of bins has been de-
termined according to "Sturges Rule". The bin width has been "beautified"
to 50 within the R hist function. Obviously non of the six origins gives us
a satisfying estimation of the underlying density, nor does the kernel den-
sity estimator. The explanation is not too hard to find. Most cars in the
dataset have only a very small displacement of 80 to 160. Bigger cars - all
6 cylinder engines in the dataset - form another mode at 220 to 260. Two
discrete spikes can be found at 300 and 340, with some larger outliers, all
corresponding to 8 cylinder engines.
Figure 4: A histogram starting at 60 with bin wiDth 20, yielding 20 binS for
the variable "displacement".
must be retyp ed, until a sa t isfying set ting is found . Finding explanations for
t he above described structural features can be don e most convenient ly within
an interactive environment, which allows linked highlighting. This leads to
the nex t section.
0 10 20 30 40 50 0 10 20 30 40 50
mpg mpg
Fi gure 5: Left : A histogram for the variable "mpg" with model years 74-
78 highlighted. Righ t : A Spinogram , showing the sa me data .
Figure 5 shows an exa mple of this situation. The left histogram has all
model years from 74 to 78 highlighted. At first glance we would expect that
the selecte d subgroup has appr oximately the sam e distribution as the whol e
population . To verify t his, we use a spinogram.
A spinogra m is a hist ogram, where all bars have the sa me height. In order
to keep the proport ionality of the area of a bar and the number of cases in t he
bar, the width is adj uste d, i.e. whereas in a histogram with equa lly spaced
bins the height of a bar is proportional to the number of cases in t his group, in
a spinogra m the widt h is proportional. Obviously the x-axis of a spinogram
then is transformed to a no longer linear but still cont inuous scale. This
pu ts mor e visua l weight on areas with high density and less weight on areas
wit h low density. The highlighting in a spinogra m is still done from bottom
t o t op. This allows the comparison of pr oportions of the highli ghted cases
across the whole rang e of t he underlying variable. Whereas this comparison
is easily possible, the comparison of proportions in highlighted histograms is
almost impossible. This is du e to the fact that our visual system is well able
t o compare positions along a comm on scale, but almost incap abl e of judging
1001 graphics 507
length or position in different scales (cf. Cleveland [1] 262pp) . Coming back
to the example in Figure 5 the spinogram reveals that the cars in the years
74 - 78 mostly have mpg-values close to the overall mean, i.e. the tails of the
distribution of this group are less populated than in the rest of the sample.
and the multiple barchart view which scales the size of the tiles along only
one axis .
3 Plot ensembles
The last section gave some hints on how to choose the right plot parameters
and/or plot types, in order to get meaningful plots. This helps to optimize
a single plot or view.
In an exploratory data analysis process we often try to answer statistical
questions with graphics. E.g. looking at the "mpg-auto" data we might be
interested in the influence of the originating country or continent and the
number of cylinders on the gas consumption of a car . This relationship
between two categorical and one continuous variable can be investigated by
using an ensemble of 4 linked plots.
The plot ensemble in Figure 10 features a barchart for cylinders and ori-
gin, a mosaic plot of the two variables and a boxplot of "mpg" conditioned
on number of cylinders (alternatively we also could use a boxplot of "mpg"
conditioned on the originating country). In this ensemble we see the inter-
action structure of the two influencing variables in the mosaic plot, as well
as their marginal distribution in the two barcharts. The boxplot shows the
distribution of "mpg" for each cylinder group and via highlighting we can in-
vestigate the interaction structure of the "origin" and "cylinders" on "mpg".
In Figure 10 the group of all Japanese cars has been highlighted.
The next example in Figure 11 shows how we can look at the temporal
distribution of spam e-mails, In the barchart of the classification variable
"spam" all spam e-mails have been selected. In the barchart for "Day of
Week", as well as the corresponding spineplot, we see the absolute and rel-
ative distribution
/
of spam e-mails over the course of a week. Whereas the
absolute amount of spam e-mails grows towards the middle of the week, the
510 Martin Theus
4 Conclusion
The rise of computers with graphical capabilities has lead to new graphical
data analysis possibilities, but also caused an inflation in the use of statis-
tical graphics. Only well 'designed graphics can be "worth a 1000 words".
Many statistical software packages do not take care over default settings.
This defitit can often be explained by the fact that the underlying code and
1001 graphics 511
graphical model is quite old, and was not adapted to modern data problems
and rendering methods yet.
Using a-channel transparency can help a lot when trying to avoid over-
plotting problems in scatterplots and parallel coordinate plots. The his-
togram as a means of density estimation is an example of a plot where "no
default" is the only good default". Spinograms are a good choice when trying
to visualize a sub-population of a continuous variable. A histogram, which is
often used instead, is not useful for this task. Mosaic plots are complemented
by three variations to build a suite of plots, which can visualize multivariate
discrete data. Where the one plot is good, the other one fails.
Generally, for a comprehensive graphical data exploration, we need a wide
range of plots, which can be applied exactly for the purpose they serve best.
No craftsman would enter a construction site with a toolbox consisting of
just a single type of tool.
2In a recent talk an expert an Support Vector Machines (SVM) noted that he would
suggest that all implementations of SVMs should always force the user to explicitly specify
parametersp since there is nothing such as a default parameter setting which would generally
yield acceptable results
512 Martin Theus
Data--M--GraPh --(Q:>-
'4·1 1.0 90 23.16 23.45
-3.0 1.6 87.7 23.1423 .71
'3.0 2.9 85.8 23.392 4.29
*
7
'3·4 2 .0 87.8 2) .5324.08
'3.2 3.1 87.2 23.7124.2$
**~
'4·2 3.5 87.1 23.8224.19
-4.2 1.3 86.2 2).8$ 24.19
i-:"
'3.2 2.6 85.9 23·8024 .14
•
-3.5 2.8 8 7.2 23.652 3.90
-4.3 2.2 88.4 23.5823.88
-3.9
'3.5
-4.3
'4.1
0·7
3.1
2 .1
0.6
88.6
89.1
89.4
87.8
23.4723 .96
23.77 24.01
23.5923 .89
23.652 4.00
*
References
[1] Cleveland W.S . (1985). The elem ents of graphing data. Wadsworth,
Monetrey, CA.
[2] Cleveland W.S . (1993). Visualizing data. Hobart, Summit, NJ .
[3] Cook D., Theus M., Hofmann H. Scatterplots for massive datasets.
Journal of Computational and Graphical Statistics, submitted.
[4] Hofmann H., Theus M. Visualizing conditional distributions. Journal of
Computational and Graphical Statistics, submitted.
[5] Scott D. (1992) Multivariate density estimation - theory, practice, and
visualization. Wiley, New York.
[6] Theus M. (2002). Int eractive data visualization using mondrian. Journal
of St atistical Software 7 (11).
K ey words: Bradley Terry mod el, discret e dat a , facto rial structure, genera l
equivalence theorem , maximum likelihood est ima t ion, multiplicative algo-
rithm, optimal design theory, pair ed compari sons.
COMPSTAT 2004 section : Design of experiments.
1 Paired comparisions
1.1 Introduction
We consider pair ed comparison experiments in which J treatments or prod-
ucts are compar ed in pairs. In a simple form a subject is pres ented with two
treatments and asked to indi cate which he/she pr efers or considers bet t er.
In reality the subject will be an expert t est er; for exa mple, a food t ast er in
examples ari sing in food t echnology. The link with opt imal design t heory
(ap art from t he fact t hat a sp ecialised design , paired comparisons, is under
considerat ion) is that , the par am et ers of one mod el, Br adl ey Terry mod el, for
the resultant data are like weights. Hence the theory characte rising and the
methods developed for finding optimal design weights can be applied to char-
acte rising and findin g the maximum likelihood est imators of t hese Bradley
Terry weights.
514 Ben Torsney
1.3 Models
1.3.1 A general model In the absence of other information the most
general model here is to propose:
(1)
where
()ij = P(Ti is prefered to T j )
Ap art from the const ra int O i j + Oji = n i j , independ ence between frequ en-
cies is t o be recommend ed . So apart from the constraint ()ij + ()ji = 1, these
define unrelated binomial param et ers . The maximum likelihood estimator of
()ij is O ij / n i j (the proportion of t imes T; is pr eferr ed to T j in these n ij com-
parisons) , and form al inferences can be based on th e asy mptot ic properties
of t hese.
1.3.3 Motivation for Bradley Terry Model However we can show that
is uniquely det ermined by a lat ent difference. Let P i = exp(Ai)' Then:
()ij
() .. _ exp(bij )
(3)
~J - 1 + exp(bij )
Fitting Bradley Terry Mod els using a multiplicative algorit hm 515
Thus ()ij is uniquely det ermined by t he difference in the t ra nsformed qua lity
characterist ics Ai , Aj , while it is invari ant to shifts in their values .
Further ()ij = F( Oij ), where F( 0) is t he logistic dist ribution function. If
we assume t hat the difference in quality, between the two treatment s, has
a logistic distributi on , then ()ij is the probability of a difference of at most
Oij ; or the difference in quality is given by:
Oij = F-I(()ij) = F-I{Pi/(P i + Pj)}
See [6]. Other choices of F(.) can lead to alte rnat ive models with par am et ers
similar to PI ,P2,· ·· ,PJ·
2 Parameter estimation
In t erms of the original par am et ers t he likelihood of the data is a pr oduct of
binomial likeliho ods, nam ely:
(4)
r<s
Let P = (PI,P2, . · · ,PJ) and, for convenience, let Oii = 0, i = 1,2, . .. , J,
and o; = L j o.;
Then the likelihood of the dat a under the Bradl ey Terr y mod el is given by
making t he substit ut ions ()rs = Pr/(Pr+Ps) , esr = Ps/(P r +Ps) , ,0rs+Osr =
n rs , t o yield:
(5)
e
We wish to choose P (p > 0) to maximise L(p) . Since ij is invari an t to
proport ional changes in the p;'s, so is L(p ). In fact L(p ) is a hom ogeneous
funct ion of degree zero in P; i.e. L(cp) = L(p) , where c is a scalar constant . It
is constant on rays running out from the origin. It will therefore be max imised
all along one specific ray. We can identify this ray by findin g a particular
optimising p*. This we can do by impos ing a constraint on p. Pos sible
const raints are LPi = 1 or TIP i = 1, or g(p) = 1 where g(p) is a sur face
which cuts each ray exac tly once. In t he case J = 2 a suit abl e g(p) is defined
by P2 = h(PI) , where h(.) is a decreasing function which cuts the two main
axes, as in the case of h(PI ) = 1 - PI , or has these as asy mptotes, as in
t he case of h(PI) = l /PI. In general a suitabl e choice of g(p) is one which
is positive and homogeneous of som e degree h . Not e t hat ot her alte rnatives
ar e LPi = 0 or TIP i = 0 , where 0 is any positi ve constant ; e.g. 0 = J .
The choice of TI Pi = 1, being equivalent to L In(Pi) = 0, confers on
Cti = In(Pi) t he notion of a main effect. We will opt for t he choice of LPi =
1, which conveys the notion of Pi as a weight . We wish to maximise the
likelihood or log-likelihood subject to t his const ra int and to non-negativity
too. This is an example of t he following genera l problem:
516 Ben Torsney
Problem (P) :
Maximise ¢>(p) subject to Pi ~ 0, L:P i = 1.
We wish t o maximise ¢>(p) with resp ect to a probability distribution.
Her e we will t ake ¢>(p) = In{ L(p)}.
Ther e are many examples of this problem arising in various areas of st atistics,
esp ecially in the area of optimal regression design . We can exploit optimality
results and algorithms develop ed in this ar ea . The feasible region is an op en
but bounded set. Thus there should always be a solut ion to this problem
allowing for the possibility of an unbounded maximum, multiple solut ions
and solutions at vertices (i .e. Pt = 1, Pi = 0, i =I t) .
3 Optimality conditions
We ca n define optimality condit ions in terms of t he point t o point dir ectional
°
derivativ e defined by Whittle [19] . The dir ectional derivativ e of F",(p, q) of a
crite rion ¢>(.) at P in the direction of q is t he limit as E 1 of:
4 Algorithms
4.1 A multiplicative algorithm
Problem (P) has a dis tinct set of constraints, namely the vari ables Pl ,PZ, ... ,
PJ must be nonnegative and sum to 1. An iteration whi ch neatly submits to
these and has some suitable properties is the multiplicative algorit hm:
(r+l) p;r)f(d;»)
Pj r r») (6)
LPi )f(di
where d;r) = fJ¢jfJPj I P = per) while f(d) is posi tive and strictly increasing
in d and may dep end on one or more free paramet ers.
This type of iteration was first proposed by [13], t aking f (d) = dO , with
o > O. This, of cour se, requires positive derivatives . Subsequent empiri-
cal studies include Silvey et al [11], which is a study of the choic e of 0 when
f(d) = dO, 0 > 0; Torsney [15], which mainly considers f(d) = eOd in a variety
of applications , for which one criterion ¢(.) could have negative derivatives;
Torsney and Alahmadi [16] who conside r other choices of f(.) ; Torsney and
Mandal [18] who consider objective choic es of f( .); and [8] who explore de-
velopments of the algorit hm based on a clustering approach in the context of
a cont inuous design space. Torsney and Mandal [17] and Mandal et al [9] also
apply these algorithms to the construction of constrained optimal designs.
Titterington [12] describes a proof of monotonicity of f(d) = dO in the
case of D-optimality. Torsney [14] explores monotonicity of particul ar values
for 0 for particular ¢ (p). Torsney [14] also establishes a sufficient condition
for monotonicity of f (d) = dO, 0 = 1/ (t + 1), when the crit erion ¢ (p) is
homogenous of degr ee -t, t > 0 with positive derivatives and proves this
condit ion to be true in the case of linear design crit eria su ch as the c-optimal
and the A-optimal criteria, for which t = 1, so that 0 = 1/2 . In other cases
the value 0 = 1 can be shown to yield an EM algorit hm, which is known
to be monotonic and convergent; see [13]. Beyond this there are minimal
results on convergence, alt hough this will dep end on the choic e of f( .) and of
par ameters like o. See [11] for some em pirical results. In principal the choice
of f(.) is arbitrary but ob jective bases for choices are addressed in the form al
properties now listed .
Tor sney and Mandal [18] consider var ious choices of h( x) , including
h( x) = 2H(8x) , where 8 is a positive par am et er and H( .) is a cumulat ive
distribution function such that H(O) = 1/2. Here we opt for H(.) = <1>( .), so
that iterations prove to be:
Fitting Bradley Terry Models using a multiplicative algorithm 519
Example 1:
In this case J = 8 coffee typ es were compared t hro ugh 26 pairwi se compar-
isons on each pair , yielding a t ot al of N = 728 observat ions; i.e, I: I: O i j =
728. A suitable 8 is 8 = l / N . In effect we are st andardising t hrough replac-
ing observed by relative frequ encies in the log-likelihood , and t hen taking
8 = 1. St arting from p)ol = 1/ J , the numbers of iterations needed t o achieve
maxldjl = maxlFjl ::; lO- n , n = 0,1 , .. . , 7 resp ect ively are 17, 21, 25, 32,
38, 45, 51, 59. The optima l p' is: (0.190257,0.122731,0.155456,0.106993 ,
0.091339,0.149406,0.080953,0.102865). It erations were monotonic.
Ex ample 2:
In this example J = 9 quality of life dim ensions were compa red in pairs by
each of 50 patients with early signs of rh eumat oid arthritis (RA). The 9 di-
mensions were: ability t o physically functi on , pain, st iffness, ability to work ,
fatigue, depression , inte rference with social act ivit ies, side effects, and finan-
cial burden . This data arose from the Consort ium of Practicing Rheumatolo-
gist s long-t erm observati onal multi-center study of early severe RA. Patients
ente red in this additive cohort had less than 1 year of sympt om onset . The
responses were obtain ed at their first te lephone int ervi ew. Formed in 1992,
the Consortium prospectively followed them to delineate early outc ome and
factors, such as treatment, functional , ra diographic, psychosocial , and eco-
nomic outcomes. Data on disease severity, functional st atus, psychosocial
health, cost, radi ographic dam age, laboratory serologies and acute ph ase re-
actants were recorded at Baseline and at 6 months, 1 year, and annua lly
t herea fter. As a chronic illness, RA imp acts every dimension of qu ality of
life. Even among RA pat ients, however , differences in life situations , clinical
pr esentation , and disease course can be st riking, leading t o varying patient
ran kings of the import an ce of difference disease and life factors. T he 9 factors
were selected to represent aspects of RA that pati ents could eas ily identify
an d compare .
There were a to t al of N = 1800 compari sons; i.e, I: I: O ij = 1800. In
8 cases t here were ties. These were split 50:50 between t he relevant t rea t -
ments. Again a suitable 8 is 8 = l / N . Start ing from p)Ol = l /J , t he
numbers of iterations needed to achieve max Id j I = max IFj I ::; lO- n , n =
0,1 , . . . , 6 respectively are 28, 42, 56, 69, 84, 96, 110. T he optimal p'
is: (0.265361, 0.172154 , 0.151644 , 0.059151 , 0.123506 , 0.030753, 0.037740 ,
0.055038 ,0.104653) , the ord er of the components corresponding t o the ord er
of the dim ensions as listed above. It erations were mono tonic.
520 Ben Torsney
There is a further issue here. These 1800 responses have been obtain ed
from only 50 patients. Each patient has responded on each pair wise compar-
ison . We have assumed independence between t he resulting 36 observations.
Dittrich et al [3] also conte mplate t his 'independent decisions ' mod el, an
ind epend ence which allows for inconsistent responses by a patient. However
they exte nd it to a 'dependent decisions' mod el. For an individual patient 's
comp ari son of T, and Tj let Yij = 1 if he/she records that T; is pr eferr ed to
Tj and Yij = 0 otherwise. In the case of three dimensions their mod el is:
Pklm = CYkf3nm
i.e,
Fitting Bradley Terry Models using a multiplicative algorithm 521
Notes:
3. There are exte nsions of the Br adl ey Terry model which allow for
int eractions and t he above it er ations ca n be exte nded to these too .
For exa mple a model including an int eraction between br ew strength
and roast colour corresponds t o:
i.e.
In(pkl m) = In(ak) + InCBl) + In(')'m) + In((a,Bhl)
where (a,Bhl > O.
The likelihood is now additionally homogenous in two set s of resp ects;
namely, it is invariant to proportional changes in the t erms (a,Bhl when
the constant of proportionality eit her vari es with a or wit h,B . Sever al
sets of consistent const raint s are needed . On e possibility is the set
Each of (ii) and (iii) lead to likelihoods which are homogenous of degr ee
zero in the Pi's. Also note that {A /(A + qB)} = {r1A /(r1A + r2B)} , where
r1 = q-1 /2 and r2 = 1/r1 . This is homogenous of degree zero in r1 and
r2. Hence we could impose the const raint r1 + r2 = 1. However r1 2: r2.
A further transformation is 81 = r1 - r2, 82 = 2r2. Now constraints are
81 , 82 > 0, 81 + 82 = 1. We can now maximise the likelihood with resp ect to
two distributions using our family of algorithms. To det ermine q we need to
re-scale to r1r2 = 1.
Kuk [6] considers applications to the outcome offootball matches and ext ends
the model to include two sets of the paramet ers {Pi} and two sets of the
par am et ers {qj}, one each for 'home' and 'away' games . The likelihood is
homogenous of degr ee zero in the two sets of Pi'S as a whole and in three
sets of variables which ar e based on transform ations of the qj'S similar to
that defining r1 and r2 above. Thus we wish to maximise the likelihood with
respect to four distributions.
524 B en Torsncy
8 Discussion
The primary focus of this pap er is one of cross fertilisation, an arguably
som ewhat limited even simple minded on e. It is t o point out that a class
of maximum likelihood est imation problems could be attacked using tools
for solving opt imal design problems becau se in each case one or several set s
of optimising weights or distributi ons are sou ght. Hence the equivalence
t heore ms charac te rising optimality in t he optimal design arena and related
algorit hms can be t ransporte d over to the par am eter est imation arena. This
is one new cont ribut ion of this work. On e other is using a new version of the
above mentioned algorit hms, one which can accomodate negative derivatives.
References
[1] Br adl ey R.A., Terry M.E . (1952). Th e rank an alysis of incom plete block
designs I, Th e m ethod of paired comparisons. Biometrika 39 , 324 -345.
[2] Davidson R.R. (1970). On extending the Bradley Terry model to accom-
mod ate ties in paired comparis ons experiments. J .Am. St a tist. Ass. 65 ,
317- 328.
[3] Hittich R., Hat zinger R ., Katzenb eisser W . (2002). M odelling depend en-
cies in paired comparisi ons data- a log-linear approach. Computational
St atisti cs & Data Analysis 40 , 39 -57.
[4] Henery R .J. (1992). An extension of the Thursi on e-Most eller mod el for
chess . St atistician 41 , 559-567.
Fitting Bradley Terry Models using a multiplicative algorithm 525
[5] Kiefer J. (1974). General equivalence theory for optimum designs (ap-
proximate theory). Annals of Statistics 2, 849-879.
[6] Kuk A.C.Y. (1995). Modelling paired comparison data with large num-
bers of draws and large variability of draw percentages among players .
Statistician 44, 523- 528.
[7] Mandal S., Torsney B. (2000). Algorithms for the construction of opti-
mising distributions . Communications in Statistics (Theory and Meth-
ods) 29, 1219-1231.
[8] Mandal S., Torsney B. (2004). Construction of optimal designs using a
clustering approach. (Under revision for J. Stat. Planning & Inf.)
[9] Mandal S., Torsney B., Carriere K.C . (2004). Constructing optimal de-
signs with constraints. Jounal of Statistical Planning and Inference (to
appear) .
[10] Rao P.V., Kupper L.L. (1967). Ties in paired comparison experiments:
a generalisation of the Bradley Terry model. J . Am. Statist. Ass. 62,
192-204.
[11] Silvey S.D., Titterington D.M., Torsney B. (1978). An algorithm for
optimal designs on a finite design space. Communications in Statistics
A 14, 1379-1389.
[12] Titterington D.M. (1976) . Algorithms for computing D-optimal designs
on a finite design space. Proc. 1976 Conf. On Information Sciences and
Systems, Dept. of Elect. Eng ., Johns Hopkins Univ. Baltimore, MD, 213
- 216.
[13] Torsney B. (1977). Contribution to discussion of 'Maximum Likelihood
Estimation via the EM Algorithm' by Dempster, Laird and Rubin. J.
Royal Stat. Soc. (B) 39, 26-27.
[14] Torsney B. (1983). A moment inequality and monotonicity of an algo-
rithm. Lecture Notes in Economics and Mathematical Systems, A.V. Fi-
acco, K.O . Kortanek (Eds.), Springer Verlag 215,249-260.
[15] Torsney B. (1988). Computing optimizing distributions with applications
in design, estimation and image processing. In: Optimal Design and
Analysis of Experiments, Y. Dodge , V.V. Fedorov, H.P . Wynn (Eds .),
North Holland., 361 - 370.
[16] Torsney B., Alahmadi A.M. (1992). Further developments of algorithms
for constructing optimizing distributions. In Model Oriented data Anal-
ysis, V. Fedorov, W.G. Muller, LN. Vuchkov (Eds) , Proceedings of
2nd IIASA-Workshop, St. Kyrik, Bulgaria, 1990, Physica Verlag, 121-
129
[17] Torsney. B., Mandal S. (2000). Construction of constrained opti-
mal designs. In: Optimum Design 2000, A. Atkinson, B. Bogacka,
A. Zhiglavsky (Eds) , Proceedings of Design, held in honour of
60th Birthday of Valeri Fedorov , Cardiff, Kluwer, 141-152.
526 Ben Torsney
[18] Torsney B., MandaI S. (2004). Mult iplicative algorithms for cons truc tin g
optimizing distributions mODa 7. Advances in Model Orient ed Design
and Analysis, 143-150.
[19] Whi t tle P. (1973). Some general point s in the theory of optimal experi-
m ental design. J. Roy. Statist, Soc. B 35, 123-130.
the use of spectral analysis for identifying the transfer function coefficients Vk,
by which a dependent series Yt is related to lagged values of the explanatory
series Xt:
Yt = VOXt + VIXt-l + VZXt-Z + ... + nt· (1)
Although cross-spectral analysis is based on frequency domain regression, its
results can be expressed as estimates, over an appropriate lag window, of the
transfer function coefficients. We illustrate this with a simple example, partly
to encourage the re-introduction of such methods, but also to demonstrate,
in part, why the subject moved away from them.
Figure l(a) shows temperatures measured every minute by sensors in the
cab and trailer of a transport vehicle . It is clear that the cab temperature lags
the trailer temperature. Figure l(b) shows the transfer function coefficients
in this relationship, as estimated by cross spectral analysis. The estimates
were produced almost automatically, with little user intervention. Limits on
the plot show that significant values are spread over lags 0 to 4, with a peak
at lag 2. This represents a one-sided, or causal, relationship, that may be
used to predict the cab temperature from the trailer temperature as shown
in Figure 2(a) .
However, the desired aim was to predict trailer temperatures from the
sensor in the cab. Figure 1(C) shows the estimated transfer function coeffi-
cients when the roles of the series are reversed. The significant values are
spread over lags 0 to -2. The relationship is no longer causal and these co-
efficients cannot be used for prediction. But reasonable linear predictions of
the trailer temperature from the cab temperature can still be constructed, as
shown in Figure 2(b).
In general, cross-spectral estimation of prediction coefficients is limited
to one-sided or causal relationships. It can, therefore, be used successfully to
estimate input-output relationships in open loop systems, but the estimates
are distorted when applied to input-output data gathered under closed loop,
feedback control, conditions. A solution to this problem was presented by
Modelling multiple time series: achieving the aims 529
(a) (b)
Minutes Minutes
(2)
where th e operator W is known as the generalised shift operator, and is
defined in te rms of t he backward shift operator B and a specified smoothing
coefficient, or discount factor, 0, by
B-O 2 2 2 3
W = 1 _ OB = -0 + (1 - 0 )(B + OB + 0 B + ...). (3)
(5)
where the operator Z is defined form ally in terms of t he Lapl ace (or differ-
ential) op erator s, and a decay rate const ant K, in the rang e K, > 0, by
l-s l K, K, - S
Z = =-- (6)
l+ sl K, K, +s '
(a) (b)
·
t
J"-, Orde r 1
--- Order1
·
, Order 2
~
.\
V Order 3
,
·
.Qlz
Order 3
A ~
! \
'0--'
,
·
,
A Order 4
OrderS
· V
r- Order S
$ a 10 12 1~
'-/
. .
Discrete time lag Continuous time lag
Fi gur e 3: (a) Discret e weights for the first 5 orders of the ZAR operator,
(b) cont inuous weights for orders 1, 3 and 5 of the CZAR operator.
into the predictors is approximate ly p(l + 0)/(1 - 0), rather than p. In the
cont inuous case the effect ive ran ge is approxima te ly 2p/ K .
There is no guar antee that , for a given discret e process, the choice of 0 > 0
will define bet t er pr edictors. However , consider a continuous process x (T),
that is sampled at times T = bt; to give the discret e pro cess X t. Defining the
ZAR st ates of X t by set ting 0 = 1 - K8 , these will converge, appropriate ly,
to t he CZAR state s of x(T), as 8 --+ O. The consequence of using the simple
lagged states X t - k, regardless of how small 8 might become, would lead in the
limit to states that were equivalent to X(T) and its derivatives to order p-l.
There is in general no guarantee that these would exist. That is why the
pure aut oregressive model in continuous time, that uses these derivatives as
its states, is unable to approximate an arbitrary continuous time st ationary
process, though the orde r is increased ind efinitely.
For this reason , the advantage of the CZAR mod el, proposed in the next
section, over the standard cont inuous time aut oregressive (CAR) mod el is
undeniabl e, in t erms of empirical approximation . The success of the uni-
vari ate application of the CZAR mod el lead us to consider th e discret e ZAR
form. The foregoing argument sugg est s that whenever a discret e pro cess
might be considered to be a sampled continuous pro cess, the discret e ZAR
model should be pr eferr ed t o the st andard AR mod el, for its approximation.
The weight functions that we use to define t he ZAR and CZAR st ates are
closely related to th e respect ive discrete and cont inuous Lagu erre functions,
which have the possible advantage of providing orthogonal bases of t he past
and pr esent. Par tington [14] describes a vari ety of simil ar weight functions
t hat could be used t o define a basis of the past observations of a discret e
pro cess. Br ay [6] uses a basis that differs from the Lagu erre functions, but
may be orthogonalised to prov ide a similar basis .
Our use of the operator Z was developed from the applicat ion of the
Cayley-Hamilton t ra nsformat ion to repar am et eris ation of continuous time
mod els by Belcher et al. [4] . This transformation has been widely used t o map
Modelling multiple tim e series: achieving the aims 533
from cont inuous time to discret e ti me systems . Most famously, Wiener [19]
solved t he prediction problem for continuous time series by tran sforming it
to t ha t of pr edic tion for a discr ete paramet er pro cess. The expos ition by
Doob [8, p. 582] sets this out clearly. The op erator W may be motivated as
the discret e analogue of Z, in which the Moebius transformation of t he unit
disk t o its elf replaces t he Cayley-H amil ton tran sformation.
when B(t) is Brownian motion with diffusion variance a~. The natural,
algebraically equivalent, form of this model is
where n(t) now follows the continuous time AR(I) model, or CAR(I) model:
We describe (14) as the natural form of the model because the process
defined, for any fixed t, by
Yk = Z-kX(t), (17)
is also a stationary process, and (14) is just a standard autoregressive ap-
proximation of Yk. We note that (14) is equivalent to a CARMA(p,p - 1)
model with moving average operator ('" + S )p-l .
4 Examples
Our first example illustrates the effect on predictions of using a discrete
trivariate ZAR model for the three series of monthly flour prices that were
modelled by Tiao and Tsay [16] .
(a)
Months Months
one such example, the ZAR model forecasts tend to predict better the turning
points of the irregular cyclical behaviour of the series .
The three flour price series were very similar in nature, and it is natural to
represent them by a symmetric vector autoregression. Our second example
is very different; the data arise from what is clearly an input-output sys-
tem. The rainfall is measured at two locations in a river catchment, and the
river-flow from the catchment is also measured. Figure 5 shows the hourly
measurements over a period slightly in excess of four days. The river-flow
record is much more slowly changing than the rainfall record and visual in-
spection shows that the response from input to output is spread over a period
of several hours, possibly with a range of time constants reflecting some rel-
atively rapid, and some relatively slow runoff. The objective is to use the
rainfall record to predict the river flow. The transfer function of this response
is difficult to estimate using spectral analysis because it is so dispersed over
many lags. The use of the ZAR model is appropriate here because of this
dispersed response. Using the AIC a standard AR(2) model was selected for
the three series, whereas a ZAR(6, 0.75) was selected. The choice of 0.75 for
the smoothing parameter is not critical, but was chosen because the low
frequency delay in the W operator is about 1.75/0.25 = 7 hours.
(a)
Hours Hours
(a) (c)
H~.
Figur e 6: Predictions of the river flow (solid line) using different mod els and
inform ation: (a) the dotted line shows predictions using a trivari ate AR(3)
mod el, based on river flow information up to hour 20, and full knowledge of
t he rainfall throughout the record , (b) similar predictions using a trivari ate
ZAR(3 ,0.75) mod el, (c) predictions (broken line) are obtained as in (b) , ex-
cept that the known rainflow is used only up to hour 50, and thereafter all
the series are predict ed: th e dot ted lines show 90% probability limits for the
forecast s.
These are very close to the act ua lity. Figure 6(c) is const ructed using the
same ZAR(3 ,0.75) mod el, but no observations of rainfall or river-flow are used
beyond hour 50. The pr ediction limits ar e shown on this figur e and rapidly
widen beyond that hour, but they provide a realistic and useful bound on
the peak river flow many hours later. The last 20 observati ons were not used
in mod el estimation, so their pre dictions are genuinely out-of sample. In this
exa mple the ZAR mod el reveals its potent ial.
Our first exa mple of the CZAR mod el relates to discrete time series
with different, and var ying, sampling intervals. Figur e 7(a) shows monthly
Claim ant Count (CC) figur es t hat have been long used as a measure of un-
employment . A mor e recent measure of unemployment has been the Labour
Force Survey (LFS) est imate, which is shown in t he same figure. The LFS
est imate was record ed annua lly, t hen qu arterly. In the figure, the quarterly
measur ements have been inte rpolate d monthly. These series were analysed
by Harvey and Chung [9], in which one of the aims was to est imate the slope
of the LFS series by using a bivaria te model to 'borrow' inform ation from t he
more frequently observed CC series. A cont inuous time mod el is natural for
such series, and we est imated the bivari at e CZAR(2,0.5) mod el. We report
the use of this model for slope est imat ion, in Morton and Tunnicliffe Wil-
son [12] . Here, we illustrate its application to prediction. Figur e 7(b) shows
forecasts of the LFS unemplo yment and their err or limits obtain ed from this
mod el. The bivari ate mod el ena bles good monthly forecasts to be produced ,
from a point where only 8 annua l valu es have been record ed .
Our final exa mple is a bivariate mod el of dat a which is truly sampled
irr egularl y. Kir chner and Weil [11] pr esent a comp endium of marine fossil
records which indicate the patte rn of extinct ions and origina tions of marine
animals over the past 545 million years (Myrs) . The records are arra nged
into 108 st ratigra phic int ervals which vary in length from 2.5 to 12.5 Myrs,
Modelling multiple time series: achieving the aims 537
(a) (b)
I-
E
l, "
3S00
a.
E
"§ """
~
~2000
~
,g 1~
Figure 7: (a) The Claimant Count (solid line) unemployment series, and the
Labour Force Survey (small circles) unemployment series, (b) the Labour
Force Survey series (solid line) with forecasts and forecast error limits (broken
lines) .
(a) (b)
Ir~~
:B,"~ .. ----;;;;,~~
'.=-=;~~,£; ""~
""~,,,~,,,, ...,----;;.~;--;;.:.;;----:.J
;:;;-----:*'
Myr before present Lag in Myr
Figure 8: (a) The series of originations and extinctions of genera, (b) the
estimated lagged cross-correlation function between these series .
and for each of these the number of families and genera of marine animals to
appear and disappear is documented.
The objective is to investigate the relationship between the series, and
in particular, the recovery of species following mass extinctions. Figure 8(a)
shows the series of genera. We fitted a bivariate CZAR(5,O.5) model to the
logarithms of these series. Figure 8(b) shows the cross-correlation function
derived from this model. The peak is at the lag of 16 Myr, which is similar
to that obtained by Kirchner and Weil using other methods.
References
[1] Akaike H. (1973). A new look at statistical model identification. IEEE
Transactions on Automatic Control AC-19, 716-723 .
[2] Akaike H., Nakagawa T. (1988). Statistical analysis and Control of Dy-
namic Systems, Kluwer, Dordrecht.
538 Granville Tunnic1iffe- Wil son and Alex Morton
(1)
where 6, ... , ~p and 1} denote the variables and f3 = [f31, ... ,f3pjT E IRP plays
the role of a parameter vector that characterizes the specific system. A basic
problem of applied mathematics is to determine an estimate of the true but
unknown parameters from certain measurements of the variables. This gives
rise to an overdetermined set of n linear equations (n > p) :
(2)
where the ith row of data matrix X E IRnxp and vector y E IRn contain
respectively the measurements of the variables 6, ... , ~p and 1}.
In the classical least squares approach, as commonly used in ordinary
regression, the measurements X of the variables ~i are assumed to be free
of error and hence, all errors are confined to the observation vector y. How-
ever, this assumption is frequently unrealistic: sampling errors, human errors,
modeling errors and instrument errors may imply inaccuracies of the data
matrix X as well. One way to take errors in X into account is to introduce
perturbations also in X. Therefore, the following TLS problem was intro-
duced in the field of computational mathematics [14], [15] (R(X) denotes the
range of X and IIXIIF its Frobenius norm [16]):
jj is called a TLS solution and [Li E'] the corresponding TLS correction.
Total least squares and errors-in-variables m odeling 541
This paper is organi zed as follows. Secti on 2 desc ribes t he univari at e ElV
regr ession pr obl em fro m a statist ical point of view. Secti on 3 t hen formulat es
t he TLS pr oblem from a comput ational point of view and shows the relation-
ship wit h uni vari ate ElV regr ession. Next , Section 4 pr esent s the SVD based
basic TLS algorit hm, while Section 5 describes major prop erties of t he TLS
approach. Furthermore, exte nsions of the t echnique are discussed in Sect ion
6 whil e Section 7 overviews the many applicat ions of TLS in engineering
fields . Finally, Section 8 gives t he conclusions .
where the independent varia ble ei is either fixed or random and the error ti
has zero mean and is un correlated with ~i .
The unknown int ercept 130 and slope 131 are usu ally est imated using
a Least-Squares (LS) approach for reasons of comp utational efficiency.
(5)
however, one observes (Xi,Yi ), i = 1, . .. ,n, which are the true varia bles plus
additive errors (Oi, ti ):
Assume that Oi, ti, i = 1, . .. , n, all have finit e vari ances, zero mean
(without loss of generality), and are un correlated , i.e. , E(oi) =E(ti ) = 0,
COV (Oi,tj ) = ° for all i,j . Depending on the ass umption about ei, three
°
var( Oi ) = o-~ , var( ti ) = 0-; for all i, cov( Oi, OJ) = cov ( ti, tj) = for all i i= j,
different models are defined. If the ei are unknown constants, t hen the model
is kn own as a fun ct ion al relatio nship. If the ei are ind ependent identi cally
distribut ed (i.i.d .) random varia bles and independent of t he erro rs, the mod el
is ca lled a st ructural relati onship and we have: E(ei) = J.L and var (ei) = 0- 2 .
A generaliza t ion of both models is the ultrastructural relati onship which
542 Sabine Van Huffel
assumes that the ~i are ind epend ent rand om vari abl es but not identically
distributed , i.e. having possibly different means J-li and common vari ance 2 . 0-
EIV regression looks like standar d regression if one rewrites Eqs, (5-6) as
However , t his is not the usu al regression model, Xi is random and is correlate d
with t he err or te rm (i: COV(Xi, ( i) = - {310"~ . This covariance is only zero when
o-g = 0, which is the regression mod el, or when {31 = 0, which is the trivial
case . If one at te mpts to use ordinary regression est imates (least squa res) on
EIV regression modeled data , one obtains inconsistent esti ma tes.
The seemingly minor change between mod el (4) and mod el (5)-(6) has
importan t pr act ical and t heoret ical consequences. One of the most impor-
t ant differences between both mod els concerns mod el identifiabili ty. It is
common to assume that all random variabl es in t he EIV regression model
are jointly normal. In this case, the st ruct ur al and functional mod el ar e not
identifi abl e [7] . Side condit ions need t o be imposed , the most common of
which are the following: (1) t he ratio of the err or vari ances, ), == 0-;/
d, is
known; (2) o-~ is known ; (3) 0-; is known ; (4) both of the error vari an ces, o-g
and 0-;,
are known . The first assumpt ion is the most popular and is the one
with the most pu blish ed th eoret ical results, dating back t o Adcock [2], [3]. It
also leads to the commonly known Or thogonal Regression (OR) est ima t or.
Ind eed , if ), is known , t he data can be scaled so t hat ), = 1. In this case,
the maximum likelihood solut ion of the normal EIV regression problem is
OR, which minimizes the sum of squa res of the ort hogonal dist anc es from
t he data points to the regression line inst ead of the sum of squa res of the
vertical dist an ces, as in standard regression (see Figur e 1).
*""
l.
(X,.Y,)
(9)
This minimization problem is solved when>' is known or both 0'; and 0'3 are
known. If >. = 1, the denominator reduces to 1 + (3r and amounts to or-
thogonal regression. Weighted least squares has drawn much attention in the
literature; see [7] for references . Since Sprent [28] , the name has standardized
to generalized least squares. The success of generalized LS might give the
impression that it is the LS method for the EIV regression model. Since gen-
eralized LS estimation only works for the no-equation-error model with the
error covariance matrix known up to a scalar multiple, a unified approach
for modifying LS to suit all different assumptions on the error covariance
structure is called for. Modified LS is such an approach. The normality
assumption on the errors (and on the true variables for the structural and
ultrastructural relationships) is not needed, only the existence of second mo-
ments. From Eq. (7) it is clear that ( i are i.i.d. random variables with zero
mean and variance 0'; + 0'3 (3r regardless of the type of relationship. Cheng [7]
developed modified LS estimators for (30 and (31 by minimizing an unbiased
and consistent estimator of the appropriate unknown error variance. The
estimators are a function of the residuals. Assuming>. known , an appropri-
ate modified LS estimator for the unknown error variance 0'3 is obtained by
minimizing
(10)
544 Sabine Van Huffel
with Sxx = ~ L(Xi - X)2, Syy = ~ L(Yi - fi)2 and Sxy = ~ L(Xi - X)(Yi - fi)
the sample variances and covariance. In summary, the statistical approach
seeks for estimators of the EIV regression model with optimal statistical prop-
erties (such as maximum likelihood, unbiasedness, consistency, etc.), mostly
reflecting asymptotic behaviour as n -+ 00 . If p > 1 explanatory variables
~ are considered, the problem formulation can be extended but the estima-
tor 13 of dimension p can no longer be found analytically, as derived above,
but via an eigenvalue-eigenvector approach [12, 13] or an SVD approach (see
further).
n
_ ~~n _ l:)Jl+ €;) subject to .80+ (Xi -Ji ).81 =Yi-€i , i = I, . .. ,n (14)
15; ,€ ; ,(3o ,fh i = 1
This approach is ca lled mixed LS-TLS becau se the underlying relat ionship
between t he t rue variables is equivalent with
is num erically mor e robust in the sense of algorithmic implement ation . Fur-
thermore, the TLS algorit hm computes the minimum norm solution (called
minimum norm TLS) whenever the TLS problem lacks a unique minimizer.
These extensions are not considered by GIeser .
In engineering fields , e.g., experimental mod al an alysis, the TLS technique
(mor e commonly known as the H, technique), was also introduced about
20 years ago [21]. In t he field of syste m identification, Levin [22] first st udied
th e problem. His method, called the eigenvector method or Koopmans-L evin
method [10], computes the sam e estimate as the TLS algorithm whenever
the TLS problem has a unique solution . Compensated least squa res was yet
anot her nam e arising in this area: this method compensa tes for the bias in
the est imator, du e to measurement error, and is shown to be asymptotically
equivalent to TLS [31] . Fur thermore, in the area of signa l processing, the
minimum norm method was introduced and shown to be equivalent to min-
imum norm TLS [9] . Finally, the TLS approach is tightly related to the
maximum likelihood Principal Component Analysis (PCA) method used in
chemomet rics [36] .
[X y] = UI:;VT (16)
where U = [UI " ' " un ], Ui E ~n, UTU = In and Y = [VI , . . . , V p + 1], Vi E
~p+ l , vrv = I p + l contain respectiv ely th e left and right singular vect ors,
an d I:; = diag(O"I , .. . , O"r ), r = min{n,p+ I} , 0"1 2: .. . 2: a; 2: 0, are th e
singu lar values in decreasing order of magnitude.
(17)
[~X ~YJ
5 TLS properties
Under specific conditions, the TLS solution, as introduced in numerical anal-
ysis, computes optimal parameter estimates in models with only measurement
548 Sabine Van Huffel
From a numerical analyst 's point of view, these formulas tell us that the
TLS solution is more ill-conditioned than the LS solution since it has a higher
condition number. This implies that errors in the data more likely affect the
TLS solution than the LS solution. This is particularly true under worst case
perturbations. Hence, TLS can be considered as a kind of de regularizing
procedure. However, from a statistical point of view, these formulas tell us
that TLS is doing the right thing in the presence of LLd. equally sized errors
since it removes (asymptotically) the bias by subtracting the error covariance
matrix (estimated by 0"~+1I) from the data covariance matrix X T X .
Secondly, while LS minimizes a sum of squared residuals, TLS minimizes
a sum of weighted squared residuals, expressed as follows:
LS: (22)
TLS: (23)
From a numerical analyst's point of view, we say that TLS minimizes the
Rayleigh quotient. From a statistical point of view, we say that we weight
the residuals by multiplying them with the inverse of the corresponding error
covariance matrix (up to a scaling factor) to derive consistent estimates.
Other properties of TLS, which were studied in the field of numerical
analysis, are its sensitivity in the presence of errors on all data [33]. Differ-
ences between the LS and TLS solution are shown to increase when the ratio
O"p([X Y]) /O"min(X) is growing. This is the case when the set of equations
X f3 ~ Y becomes less compatible, when the vector Y is growing in length and
when X tends to be rank-deficient. Assuming LLd. equally sized errors, the
improved accuracy of the TLS solution compared to that of LS is maximal
when the orthogonal projection of Y is parallel with the pth singular vector
of X, corresponding to O"min(X) , Additional algebraic connections and sensi-
tivity properties of the TLS and LS problem, as well as many more statistical
properties of the TLS estimators, based on knowledge of the distribution of
the errors in the data, have been described, see [33], [34] for an overview.
6 TLS extensions
The statistical model that corresponds to the basic TLS approach is the no-
equation-error EIV regression model with the restrictive condition that the
measurement errors on the data are i.i.d, with zero mean and common error
covariance matrix, equal to the identity matrix up to an unknown scalar.
Most published TLS algorithms just handle this case while other more useful
EIV regression estimators did not receive enough attention in computational
mathematics. To relax these restrictions, several extensions of the TLS prob-
lem have been investigated. In particular, the mixed LS-TLS problem for-
mulation allows to extend consistency of the TLS estimator in EIV models,
where some of the variables ~i are measured without error. The data least
550 Sabine Van Huffel
squares problem refers to the special case in which all variables except 1] are
measured with error and was introduced in the field of signal processing by
DeGroat and Dowling [8] in the mid nineties. Whenever the errors are in-
dependent but unequally sized, weighted TLS problems should be considered
using appropriate diagonal scaling matrices in order to maintain consistency.
If, additionally, the errors are also correlated, then the generalized TLS prob-
lem formulation allows to extend consistency of the TLS estimator in EIV
models, provided the corresponding error covariance matrix is known up to
a factor of proportionality (see definition 7). More general problem formula-
tions, such as restricted TLS, which also allow the incorporation of equality
constraints, have been proposed, as well as equivalent problem formulations
using other Lp norms and resulting in the so-called Total Lp approximations
(see [33] for references). The latter problems proved to be useful in the pres-
ence of outliers. Robustness of the TLS solution is also improved by adding
regularization, resulting in the regularized TLS methods [11], [27], [35] . In
addition, various types of bounded uncertainties have been proposed in order
to improve robustness of the estimators under various noise conditions and
algorithms are outlined [34], [35].
Furthermore, constrained TLS problems have been formulated . Arun [5]
addressed the unitarily constrained TLS problem, i.e., XB :::::: Y, subject
to the constraint that the solution matrix B should be unitary. He proved
that this solution is the same as the solution to the orthogonal Procrustes
problem [16, p.582]. Abatzoglou et al [1] considered yet another constrained
TLS problem, which extends the classical TLS problem (3) to the case where
the errors [6. E] in the data [X y] are algebraically related. However, if there
is a linear dependence among the error entries in [6.E], then the TLS solution
no longer has optimal statistical properties (e.g. maximum likelihood in case
of normality) . This happens, for instance, in dynamic system modeling, e.g.,
in system identification when we try to estimate the impulse response of
a system from its input and output by discrete deconvolution. In these so-
called structured TLS problems, the data matrix [X y] is structured, typically
block Toeplitz or Hankel. In order to preserve maximum likelihood properties
and consistency of the solution [1], [18], the TLS problem formulation, given
in definition 1, must be extended with the additional constraint that any
(affine) structure of X or [X y] must be preserved in ~ or [~ ?], where
~ and ? are chosen to minimize the error in the discrete L 1 , L2 and L oo
norm. For L2 norm minimization, various computational algorithms have
been presented, as surveyed in [34], [35], and shown to reduce the computation
time by exploiting the matrix structure in the computations. In addition, it
is shown how to extend the problem and solve it, if latency or equation errors
are included. Recently, robustness of the structured TLS solution has been
improved by adding regularization, see e.g. [25] .
Yet , another important extension is the elementwise-weighted TLS (EW-
TLS) estimator, which computes consistent estimates in linear EIV models,
Total least squares and errors-in-variables modeling 551
where the measurement errors are elementwise differently sized or , more gen-
erally, where the corresponding error covariance matrices may differ from
row to row. Some of the variables are allowed to be exactly known (ob-
servable) [19], [35]. Mild conditions for weak consistency of the EW-TLS
estimator are given and an iterative procedure to compute it is proposed.
Finally, we mention the important extension to nonlinear EIV models,
nicely studied in the book of Caroll, Ruppert and Stefanski [6] . In these
models, the relationship between the variables ei
and 1] is assumed to be
nonlinear. It is important to notice here that the close relationship between
nonlinear TLS and EIV stops to exist. Indeed, consider the bilinear EIV
model XBG ~ Y, in which X, G, and Yare affected by measurement errors.
Applying TLS to this model leads to the following bilinear TLS problem:
well-known Kalman filterin g is exte nded to the err ors-in-variables cont ext in
which noise on t he inputs as well as on the outputs is taken into account
thereby improving the filterin g performan ce. In th e field of signal processing,
in particular in-vivo magnetic resonance spectroscopy and audio coding, new
state-space based methods have been derived by making use of th e TLS ap-
pro ach for spectral estimation with exte nsions to decimation and multichan-
nel data quantification. In addition, it has been shown how to extend the
least mean squ ar es (LMS) algor ithm to the EIV cont ext for use in adapt ive
signal pro cessing and various noise environments . Finally, TLS applications
also emerge in other fields , including information retrieval, image reconstruc-
tion, multivari ate calibra t ion, ast ronomy, and compute r vision. It is shown
in [35] how the TLS approach and its generalizat ions, including structured,
regulari zed and generalized TLS , can be successfully applied.
This list of applications of TLS and EIV mod eling is certainly not exha us-
tive and clearl y illust rates the increased interest of TLS and EIV mod eling
in engineering over t he past 20 years.
8 Conclusions
The basic principle of TLS is that t he noisy dat a [X y] , while not satisfying
a linear relation, are modified with minim al effort, as measur ed by t he Frob e-
niu s norm , in a 'nea rby' matrix [X YJ which is rank-deficient so th at the set
Xj3 = fj is compa tible. This matrix [X fj] is a rank-one modification of the
data matrix [A b] . T he solution to the TLS problem can be det ermined from
t he SVD of the matrix [X y]. A simple algorit hm outlines t he computati ons
of the solution of the basic TLS problem. By 'basic' is meant t hat only one
right -h and side vector y is considered and that the TLS problem is solvable
(generic) and has a unique solution. Extensions of this basic TLS problem
are discussed. Much of the literature concerns the classical TLS problem
X j3 ~ y , in which all columns of X ar e subject to errors, but more general
TLS problems, as well as ot her problems relat ed to classical TLS , have been
propo sed and are bri efly overviewed here.
Engineerin g applications of the Total Least Squar es (TLS ) te chnique have
been overv iewed . TLS has its roots in statistics where it can be defined as
a special case of classical Errors-in- Vari ables (EIV ) regression in which all
measurement err ors on t he data are LLd. with zero mean and equa l vari-
ance. Due to the developm ent of a powerful algorit hm based on the SVD
in computational mathematics t he method became very popular in engineer-
ing applicat ions. This is a nice exa mple of inte rdisciplina ry work. However ,
the dan ger exists t ha t research ers will focus their at te nt ion on the wrong
probl ems which are eit her unreason able from a statistical point of view (e.g.
biased , inconsist ent , not efficient ) or not practi cally useful from an engineer-
ing point of view (e.g. assumptions never satisfied) . This pap er invites any
Total least squares and errors-in-variables modeling 553
reader to open the frontiers of its own discipline and look over the border
into neighbouring areas so that the any engineering problem, dealing with
measurement error, is studied in a correct way.
References
[1] Abatzoglou T .J ., Mendel J .M. and Harada G.A. (1991). The constrained
total least squares technique and its applications to harmonic superreso-
lution. IEEE Trans. Acoust., Speech & Signal Processing 39,1070-1087.
[2] Adcock RJ. (1877). A problem in least squares. The Analyst 4, 183-184.
[3] Adcock RJ. (1878). A problem in least squares. The Analyst 5, 53-54.
[4] Anderson T.W. (1984). The 1982 Wald memorial lectures : Estimating
linear statistical relationships. Ann . Statist. 12, 1-45.
[5] Arun K.S. (1992). A unitarily constrained total least-squares problem in
signal-processing. SIAM J. Matrix Anal. Appl. 13, 729-745.
[6] Carroll RJ., Ruppert D. and Stefanski L.A. (1995). Measurement error
in nonlinear models, Chapman & Hall/CRC, London.
[7] Cheng C.-L. and Van Ness J.W. (1999). Statistical regression with mea-
surement error. Arnold, London.
[8] Degroat RD. and Dowling E.M. (1993). The data least squares problem
and channel equalization. IEEE Trans. Sign. Process. 41, 407-411.
[9] Dowling E.M. and Degroat RD. (1991). The equivalence of the total
least-squares and minimum norm methods. IEEE Trans. Sign. Process.
39, 1891-1892.
[10] Fernando K.V . and Nicholson H. (1985). Identification of linear systems
with input and output noise : the Koopmans-Levin method. lEE Proc.
D 132,30 -36.
[11] Fierro R.D., Golub G.H., Hansen P.C. and O'Leary D.P. (1997). Regu-
larization by truncated total least squares. SIAM J. Sci. Compo 18 , 1223-
1241.
[12] Fuller W .A. (1987). Error measurement models. John Wiley, New York.
[13] GIeser L.J. (1981). Estimation in a multivariate "errors in variables"
regression model: Large sample results. Ann . Statist. 9, 24-44.
[14] Golub G.H. (1973). Some modified matrix eigenvalue problems. Siam
Review 15, 318-344.
[15] Golub G.H. and Van Loan C.F. (1980). An analysis of the total least
squares problem. SIAM J . Numer. Anal. 17 ,883-893.
[16] Golub G.H. and Van Loan C.F . (1996). Matrix computations. 3rd ed.,
The Johns Hopkins Univ.Press, Baltimore.
[17] Koopmans T.C . (1937). Linear regression analysis of economic time se-
ries. De Erven F. Bohn, N.V. Haarlem.
[18] Kukush A., Markovsky 1. and Van Huffel S. (2004). Consistency of the
structured total least squares estimator in a multivariate model. Journal
of Statistical Planning and Inference, to appear.
554 Sabine Van Huffel
[36] Went zell P.D., Andrews D.T., Hamil ton D.C. , Fab er K. and Kowal-
ski B.R. (1997). Maximum likelihood principal component analysis. J .
Chemometrics 11 , 339-366.
[37] York D. (1966). Least squares fitting of a straight lin e. Can . J. of Physics
44 , 1079-1086.
A ckno wledgem ent : Dr. Sabin e Van Huffel is a full professor at the Katholieke
Universit eit Leuven, Belgium . Resear ch supporte d by the KU Leuven re-
search coun cil (GOA-Mefisto 666), the Flemish Government (FWO pro jects
G.0078.01, G.0269.02, G.0270.0 2, resear ch communi ties lOCoS , ANMMM) ,
and t he Belgian Federal Government (WAP V-22).
Address: S. Van Huffel, Katholieke Universit eit Leuven, Depar tment of Elec-
trical En gineerin g, Division ESAT-SCD , Kast eelpark Arenberg 10,3001 Leu-
ven, Belgium
E-mail : sabine .vanhuffel~esat.kuleuven.ac.be
Author Index Bognar T 713
Bouchard G 721
Boudou A 737
Abb as I.. 1519
Boukhet ala K 737, 1577
Achcar J.A 581
Bourd eau M 417
Acosta L 1551
Braverman A 61
Adachi K 589
Brewer M.J 745
Aguilera A.M 997
Brys Goo 753
Ait-K aci S 737
Buckley F 1677
Ali A.A 37
Buj a A 477
Almeida R 597
Burdakov 0 761
Amari S 49
Ambroi se Ch 1759 Cao R 1569
Amendola A 605 Caragea D 823
An H 1397 Cardo t H 769, 777
Ando T 1309 Carne X 1519
Aoki S 1179 Carr D.B 73
Araki Yoo 613 Casanovas J 1519
Arcos A 1085 Caumont 0 737
Arhipov Soo 621 Ceranka B 785
Arh ipova 1. 629 Chauchat J .-H 1245
Aria M 1807 Chen C.-H 85
Arn aiz J.A 1519 Choulakian V 793
Arteche J 637 Chretien S.B 799
Arti aga R 1569 Christodoulou C 807
Artiles J 1733 Church K.W 381
Atkinson A 405 ClemenQon S 679
Atkinson RA 113 Cleroux R 1393
Cobo E. 1519
Balina S 629 Coifman RRoo 381
Banks D 251 Conversano C 815 , 1807
Bartkowiak A 647 Cook D 823 , 1397
Basti en P 655 Corset F 799
Bayraksan G 663 Costanzo G.D 831
Beran R 671 Crambes Ch 769
Bertail P 679 Cramer K. 101
Betinec M 689 Crane NI 1783
Biffignandi S 697 Critchley F 113
Binder H 705 Croux C 839
558 Au thor Index
VVagner S 1263
VVang J 1893
VVatanabe M 1751
VVaterh ouse T . H 1963
VVegman E.J 287, 327, 381
VVeihs C 1429
VVelsch R.E 1481
VVestad F 261
VVhit taker J 935
VVilhelm A .F .X 1971
VVillems G 1693, 1979
VVimmer G 1987
Wit kovsky V 1987, 1995
VVu H.-M 85
VVurt ele E 1397
Zadlo T 2019
Zarzo M 2027
Zuckschwerdt C 101
COMPSTAT 2004 Section Index
Algorithms
Doray L.G. , Haziza A., Minimum distance inference
for Sundt 's distribution 943
Grendar M., Det ermination of constrained mod es
of a multinomial distribution 0 0. 0 01109
Gunning P., Horgan J.M., An algorithm for
obtaining strata with equal coefficients of
variation 0.. 0 0. 0. 0. 0. 1123
Klaschka J ., On ordering of splits, Gray code,
and some missing references 00 0 00 1317
Kuroda M., Data augmentation algorithm for
graphical models with missing data 1385
Miwa T ., A normalising transformation of noncentral
F variables with large noncentrality parameters .. 0. 1497
Tvrdfk J., Kfivy 1., Comparison of algorithms for
nonlinear regression estimates 0. 0 0 o • • 01917
Witkovsky V., Matlab algorithm TDIST:
The distribution of a linear combination of
Student's t random variables 0 0. . 00 1995
Applications
Bognar T., Komornfk J ., Komornfkova M., New STAR
models of time series and application in finance 0. 00. 713
Braverman A., Kahn B., Visual data mining for
quantized spatial data 0. . 0 0. 0 0061
Cardot H., Crambes Ch ., Sarda Po, Conditional
quantiles with functional covariates: An application
to ozone pollution forecasting 0. . 0. ... 0. . . .. 00. . . . 0.. 769
Cardot H., Faivre R. , Maisongrande P., Random effects
varying time regression models with application
to remote sensing data 0 0 0 0 0. 777
Chretien S., Corset F ., A lower bound on inspection
time for complex systems with Weibull transitions ... 799
Conversano C., Vistocco D., Model based
visualization of portfolio style analysis 0 0.815
566 COMPSTAT 2004 Section Jndex
Bayesian Methods
Achcar J .A., Martinez E.Z. , Louzada-Neto F .,
Binary dat a in the presence of misclassifications . . . . . 581
Di Zio M. et al., Multivari at e t echniques for
imputation based on Bayesian networks 927
Huskova M., Meintanis S., Bayesian like pro cedures
for detect ion of cha nges 1221 0 •• •• • • 0 • •••• 0 • •
Biostatistics
Araki Y., Konishi S., Imoto So , Functional discriminant
ana lysis for microarray gene expression dat a via
radi al basis function networks . . . . . 613 0 • •• 0 • 0 0 0 • 0 • • • 0 • ••
Classification
Betinec M., Two measures of credibility of
evolutionary trees 689
Binder H., Tutz G. , Localized logistic classification
with variable selection 705
Bouchard G., Triggs B., The t rade-off between
generat ive and discriminative classifiers 721
Cook D., Caragea D., Honavar V., Visualization
in classification problems 823
Croux C., Jooss ens K., Lemmens A., Bagging
a stacked classifier 839
Cwiklinska-Jurkowska M., Jurkowski P., Effectiveness
in ensemble of classifiers and their diversity 855
Dab o-Niang S., Ferr aty F ., Vieu P., Nonp arametric
unsupervised classification of sat ellite wave
altimete r forms 879
Fung W .K. et al., St atistical ana lysis of handwritten
ar abic num erals in a Chinese population 149
Hayashi A., Two classification methods for
educat ional dat a and it 's applicat ion 1157
Hennig C., Classification and outlier identification
for the GAIA mission 1171
COMPSTAT 2004 Section Index 569
Clustering
Di Zio, M., Guarnera Do, Rocci R. , A mixture of
mixture mod els to detect unity measur e errors 0 • •• •• 919
Gib ert K. et al. , Knowledge discovery with clust ering:
Imp act of metrics and reporting phase by
using KLASS 1069
Criin B., Leisch F ", Bootstrapping finit e mixture
mod els 0 0 • 0 • • • 0 0 •• • • 0 • • • •• •• • • • • • • • • 1115
J alam R. , Chauchat Jo-H., Dumais J ., Automatic
recognition of key-words using n-grams. 0 • 0 •• • • •• • • • 1245
Kiers H.A.L. , Clust ering all three modes of three-mode
dat a: Computational possibilities and problems. 0 • 0 •303
Krecan Lo, Volf Po, Clust ering of t ransact ion data 1361
Leisch F. , Exploring the structure of mixture mod el
component s 0 •••• • • 0 • 0 0 • 0 • 0 • 0 • 0 0 • 0 ••••••• ••• 0 ••••• •• 1405
Lipinski P., Clustering of large numb er of sto ck
market t rading rul es. 0 • • 0 • 0 • 0 •••• 0 •• 0 • 0 •• 0 •• 0 • 0 •••• 1421
Mucha H.-J. , Automatic validation of hierarchical
clust ering0 • ••• ••• ••• •• •• •••• ••• 0 0 • 0 •• • • •• • • •• •• • ••• 1535
Murtagh F o, Quantifying ultram etri city 0 •• 0 • 0 0 • 0 • •• 0 • 0 • 1561
Pefia Do, Rodriguez J. , Ti ao G. c. , A genera l par tition
clust er algorit hm .. 0 • 0 • 0 • 0 • • • 0 0 • • 0 • • •• 0 • 0 • 0 • ••• •• 0 • • • 371
Rezankova H., Husek D., Frolov A.A., Some
approaches to overlap ping clustering of binar y
vari ables 0 • •••• •• • • •• ••• 0 ••• 0 ••• •• ••• ••• • 1725
Sam e Ao, Ambrois e Ch ., Govaert G. , A mixture
model approach for on-line clust ering .. 0 • 0 • • • • • • • • • 0 1759
Scott D.W ., Outlier det ection and clust ering
by partial mixture mod eling ..... 0 •••• 0 ••• •• • •• • • ••• 453•
Turmon M., Symmetric normal mixtures o' o' o' o' 0 0 0 0 0 •• 1909
Data Imputation
Derquenne C., A multivari at e mod elling method
for st atisti cal matching 0 0 • • 0 • • 0 0 • 0 • 0 • • 0 0 0 • 0 • 0 • 0 • 0 • ••• 895
570 COMPSTAT 2004 Section Index
Data Visualization
Adachi K., Multiple correspondence spline analysis 589
Arhipov S., Fractal peculiarities of birth and death 621
Bartkowiak A., Distal points viewed in Kohonen 's
self-organizing maps 647
Braverman A., Kahn B., Visual data mining for
quantized spatial data 61
Carr D.B., Sung M.-H. , Graphs for representing
statistics indexed by nucleotide or amino acid
sequences 73
Chen, C. H. et al. , Matrix visualization and
information mining 85
Cook D., Caragea D., Honavar V., Visualization
in classification problems 823
Fujino T. , Yamamoto Y , Tarumi T ., Possibilities
and problems of the XML-based graphics 1043
Hofmann H., Interactive biplots for visual modelling 223
Huh M.Y , Line mosaic plot : Algorithm and
implementation 277
Kafadar K ., Wegman KJ ., Graphical displays
of Internet traffic data 287
Katina S., Mizera 1. , Total variation penalty in
image warping 1301
Lee E.-K. et al. , GeneGobi: Visual data analysis
aid tools for microarray data 1397
Swayne D.F. , Buja A., Exploratory visual analysis
of graphs in GGobi .477
Theus M., 1001 graphics 501
COMPSTAT 2004 Section Index 571
Design of Experiments
Ali A.A., Jansson M., Hybrid algorithms for
construction of D-efficient designs 37
Ceranka B., Graczyk M., Chemical balance weighing
designs for v + 1 objects with different variances . . . . . 785
Dorta-Guerra R , Gonzalez-Davila E. , Optimal 22
factorial designs for binary response data 951
Ghosh S., Computational challenges in determining
an optimal design for an experiment 181
Muller W .G., Stehlik M., An example of D-optimal
designs in the case of correlat ed errors 1543
Payne R W ., Confidence intervals and tests for
contrasts between combined effects in generally
bal anced designs 1629
Torsney B., Fitting Bradley Terry models using
a multiplicative algorithm 513
Waterhouse T. H., Eccleston J. A., Duffull S. B., On
optimal design for discrimination and estimation .. . 1963
Dimensional Reduction
Brewer M.J. et al. , Using principal components
an alysis for dimension reduction 745
Cizek P., Robust estimation of dimension reduction
space 871
Luebke K , Weihs C., Optimal separation
projection 1429
Mori Y. , Fueda K , Iizuka M., Orthogonal score
estimation with variable selection 1527
Ostrouchov G. , Samatova N.F. , Embedding methods
and robust statistics for dimension reduction 359
Priebe C.E. et al., Iterative denoising for cross-corpus
discovery 381
Saito T ., Properties of the slide vector model for
analysis of asymmetry 1741
572 COMPSTAT 2004 Section Index
E-statistics
Fujino T. , Yamamoto Y., Tarumi T ., Possibilities
and problems of the XML-based graphics 1043
Honda K. et al., Web-based analysis system
in data-oriented statistical system 1209
Shibata R, InterDatabase and DandD 465
Yokouchi D., Shibata R , DandD : Client server
system 2011
Functional Data Analysis
Araki Y, Konishi S., Imoto S., Functional discriminant
analysis for microarray gene expression data via
radial basis function networks 613
Beran R , Low risk fits to discrete incomplete
multi-way layouts 671
Boudou A., Caumont 0. , Viguier-Pla S., Principal
components analysis in the frequency domain 729
Cardot H., Crambes Ch ., Sarda P., Condi tional
quantiles with functional covariates: An applicat ion
to ozone pollution forecasting 769
Cardot H., Faivre R , Maisongrande P., Random effects
varying time regression models with application
to remote sensing data 777
Costanzo G.D., Ingrassia S., Analysis of the MIB30
basket in the period 2000-2002 by functional PC's ... 831
Cuevas A., Fraiman R, On the bootstrap
methodology for functional data 127
Dabo-Niang S., Ferraty F ., Vieu P., Nonparametric
unsupervised classification of satellite wave
altimeter forms 879
Escabias M., Aguilera A.M., Valderrama M.J .,
An application to logistic regression with
missing longitudinal data 997
Hlubinka D., Growth curve approach to profiles
of at mospheric radiation 1185
Kawasaki Y. , Ando T ., Functional data analysis
of the dynamics of yield curves 1309
Kneip A., Sickles RC ., Song W., Functional data
analysis and mixed effect models 315
COMPSTAT 2004 Section Index 573
Historical Keynote
Grossmann Wo, Schimek M.G. , Sint P.P., The history
of COMPSTAT and key-steps of statistical
computing during the last 30 years. 0 •• • • • • • • • • •• • • • • 0 . 1
Model Selection
Beran R , Low risk fits to discrete incomplete
multi-way layouts. 0 0 0. 0. 0. 0. 0.. 0. . 671
Christodoulou C. , Karagrigoriou A., Vonta F .,
An inference curve-based ranking technique . . 00. 0.. 0807
Hafidi B; Mkhadri A., Schwarz information
criterion in the presence of incomplete-data .. 1131 0 • 0 • ••
Multivariate Analysis
Adachi K ., Multiple correspondence spline analysis 589
Choulakian V., A comparison of two methods of
principal component analysis 0 793
Fabian Zo, Core function and parametric inference 0. 1005
Fernandez-Aguirre Ko, Mari el Po, Martin-Arroyuelos A.,
Analysis of the organizational culture at
a public university 0. . 00. 0. 0 0. . 0. . 0. . 0 0 1013
574 COMPSTAT 2004 Section Index
N onparametrical Statistics
Burdakov 0 ., Grimvall A., Hussian M., A generalised
PAV algorithm for monotonic regression in several
variables 761
Capek V., Test of continuity of a regression function 863
Ho Y.H.S ., Calibrated interpolated confidence
intervals for population quantiles 1193
Kolacek J. , V se of Fouri er transformation for kernel
smoothing 1329
Komarkova L., Rank estimators for the time
of a change in censored data 1337
Necir A., Boukhetala K., Estimating the risk-adjusted
premium for the largest claims reinsurance covers . . 1577
Official Statistics
Biffignandi S., Pisani S., A statistical database
for the trade sector 697
Di Zio, M., Guarnera V., Rocci R. , A mixture of
mixture models to detect unity measure errors 919
Matei A., Tille Y., On the maximal sample
coordinat ion 1471
Renzetti M. et al., The Italian judicial
statistical information system 1685
Optimization
Bayraksan G., Morton D.P., Testing solution quality
in stochastic programming 663
Novikov A., Optimality of two-stage hypothesis t ests 1601
576 COMPSTAT 2004 Section Index
Robustness
Brys G. , Hubert M., Struyf A., A robustification
of the Jarque-Bera test of normality 753
Critchley F. et al. , The case sensitivity function
approach to diagnostic and robust computation . ... . 113
CIzek P., Robust estimation of dimension reduction
space 871
Debruyne M., Hubert M., Robust regression
quantiles with censored data 887
Gather D., Fried R., Methods and algorithms for
robust filtering 159
House L.L., Banks D., Robust multidimensional
scaling 251
Kalina J ., Durbin-Watson test for least
weighted squares 1287
Masfcek L., Behaviour of the least weighted squares
estimator for data with correlated regressors 1463
McCann L., Welsch R.E., Diagnostic data traces
using penalty methods 1481
Neykov N. et al. , Mixture of GLMs and the trimmed
likelihood methodology 1585
Ostrouchov G. , Samatova N.F., Emb edding methods
and robust statistics for dimension reduction 359
Plat P., The least weighted squares estimator 1653
COMPSTAT 2004 Section Index 577
Simulations
Dufour J .-M., Neifar M., Exact simulat ion-based
inference for autoregressive processes 967
Gamrot W ., Comparison of some ratio and regression
est imators under double sampling for nonr esponse
by simulat ion 1053
Harper W.V. , An aid to addressing tough decisions:
The aut omat ion of general expression transfer
from Excel t o an Arena simulation 1149
Koubkova A., Critical values for changes in sequent ial
regression mod els 1345
Monleon T. at al., Flexible discret e events simulation
of clinical t rials using LeanSim(r) 1519
Naya S., Cao R, Artiaga R , Nonp arametric
regression with functional dat a 1569
Simoes L., Oliveira P. M., Pires da Cost a A.,
Simulation and mod elling of vehicle's delay 1823
Tressou J ., Doubl e Mont e-Carlo simulations in food
risk assessment 1877
Smoothing
Downie T . R , Redu ction of Gibbs phenomenon
in wavelet signal est imat ion 959
Francisco-Fern andez M., Vilar-Fernandez J .M.,
Nonparamet ric est imat ion of the volatility
function with corre lated errors 1027
Manteiga W .G., Vilar-Fern andez J.M. , Bootstrap
t est for t he equality of nonpar ametric
regression curves under dependence 1447
578 COMPSTAT 2004 Section Index
Spatial Statistics
Boukhetala K , Ait-Kaci S., Finite spatial sampling
design and "quant izat ion" 737
Mohammadzadeh M., Jafari Khaledi M., Bayesian
prediction for a noisy log-Gaussian spatial mod el 1511
Ramsay J .0 ., From data to differential equat ions 393
Statistical Software
Ceranka B., Graczyk M., Chemi cal balance weighing
designs for v + 1 obj ects with different variances 785
Hornik K , R: The next generation 235
House L.L., Banks D., Robust multidimensional
scaling 251
Lee E.-K . et al., GeneGobi : Visual dat a analysis
aid tools for microarray dat a 1397
Marek L., Do we all count the sam e way? 1455
Scott D.W. , Outlier det ection and clustering
by partial mixture mod eling 453
Tsang W.W. , Wan g J. , Evaluating the CDF of
the Kolmogorov statistics for normality testing . . . . . 1893
Tsomokos 1. , Karakost as K.X., Pappas V.A. ,
Making st at ist ical analysis easier 1901
Verb oven S., Hub ert M., MATLAB software
for robust st atistical methods 1941
Yam amoto Y. et al., Parallel computing in
a st atistical system J asp 2003
Teaching Statistics
Arhipova 1. , Balina S., The problem of choosing
stat ist ical hypotheses in applied statisti cs 629
Cramer K. , Kamps D., Zuckschwerdt C., st-apps and
EMILeA-st at : Int eractive visuali zations in
descrip tive st atistics 101
Duller C., A kind of PISA-survey at university 975
Eichhorn B.H ., Discussions in a basic stat ist ics class 981
Hrach K , The int eractive exercise t ext book 1217
Iizuka M. et al. , Development of the educa t ional
materials for statisti cs using Web 1229
COMPSTAT 2004 Section Index 579