Professional Documents
Culture Documents
<':; ~
', ....
.G1Y
1qq? ,~. .- ~;(
,~, ,
,1'
London NWl
t
United Slales Edition Published by
ACADEMIC PRESS INC.
~,'.
Preface
(Hareourt Braee Jovanovieh, Ine.)
Copyright © 1984 by
When 1 arrived in Paris, in July 1973, 1 had little idea what lay in store for
me. After a traditional statistical education and with a Masters degree in
muitivariate analysis tucked under my arm, 1 embarked on a course which
All Righls Reserved
was to shake up most of my previous ideas on statistics as well as on life itself.
No part of this book may be reprodueed in any form by photostat, microfilm, or
any other means, without written permission from the publishers
It is impossible for anyone to spend two years as a student in Paris and,
likewise, impossible for any statistician to spend two years in contact with the
revolutionary Jean-Paul Benzécri without being radically atTected as a conse
quence. It has taken me sorne time since those years of doctoral study in
British Library Calaloguing in Publicalion Dala France to fully comprehend al1 that 1 learnt. This is an ongoing process, of
/
Greenaere, Miehael / course, and this book represents a milestone of 10 years' personal experience
Theory and applieations of correspondenee of the statistical method the French call analyse des correspondances, which
analysis. has been obviously translated as "correspondence analysis".
1. Multivariate analysis
I. Title My reasons for writing this book were twofold. First, in 19801 was invited
519.5'35 QA278 to give a paper on correspondence analysis at an international conference on
ISBN 0-12-299050-1 muitidimensional graphical methods, called "Looking at Multivariate Data",
LCCCN 83-72867 in Sheffield, England. There was considerable interest in my talk and 1
realized then, more than ever, the tremendous communication gap between
Benzécri's group in France and the Anglo-American statistical school. 1 felt
that it was almost my duty to undertake the writing of a book which would
explain this important facet of French research to English speaking statistici
ans, using not only a language but a mathematical style familiar to them.
Secondly, having gained experience of the tremendous versatility of
correspondence analysis in describing graphically almost any rectangular
table of data, 1 have tried to write a book which will be readable and helpful
to researchers in the natural and human sciences in general. Of course it is
impossible to avoid mathematical details in such a text and it is assumed that
Filmset by Advanced Filmsetters (Glasgow) Ltd the reader has sorne mathematical background. However, the book can be
Printed in Great Britain by Hartnolls Ltd, read at different levels. On the one hand the theory of correspondence
Bodmin, Cornwall analysis is laid out systematically once and for all as a primary reference,
i,', ~~:~~;~{;j :1 v
.,: '-~:J;";"J ~
Preface vii
vi Preface
always prepared for a last-minute typing cnsIs, to Lien Badenhorst for
while on the other hand there are sufficient practical examples and applica
assistance with the graphical material, to colleagues who helped with the
tions to justify this text fully as a practical manual as well. 1 hope that this
proofreading, to Tom Bishop for his dependable transporting of a heavy
book will bring many researchers' ideas on the subject to a point that it
manuscript and original drawings to London on my behalf, and to Emily
serves as a springboard for a much wider and more routine application of
Wilkinson and Jeremy Lambert of Academic Press for their co-operation and
correspondence analysis in the future.
expert guidance of the manuscript towards the final product in your hands
This book is suitable as a course to statistics students, at either under
graduate or graduate level depending on the detail prescribed. Numerous now.
examples, with solutions, terminate each of Chapters 2 to 8 and will fami
liarize students with the theoretical material. N otice that a basic knowledge of December 1983
Michael Greenacre
the algebra of matrices and vectors is assumed, but not of their geometry,
which is described in detail. A course which concentrates less on mathematics
would be more suitable for students of most other disciplines in which
numerical data are collected and analyzed. Such a course would consist of a
careful reading of the first 3 chapters, generally ignoring the theoretical
examples, skimming through Chapters 4 to 8, and final1y concentrating on an
appropriate selection of applications in Chapter 9. This approach can be
fol1owed initial1y by readers who are not statisticians but who wish to gain a
rapid overview of the method's basic concepts and interpretation. Subsequent
reading of the theoretical examples and Chapters 4 to 8, as the need arises,
wil1 gradual1y fil1 in the details and demonstrate the wide applicability of the
technique and its unique position in the field of multivariate analysis.
1 would need many more pages to thank in ful1 appreciation al1 the people
that have led, directly or indirectly, to the publication of this book. 1 have
dedicated this book to Jean-Paul Benzécri in acknowledgement ofthe role he
has played in my education. To my friends and colleagues in Paris, Pierre
Teillard, Michael Meimaris, Ludovic Lebart, Maurice Roux, Michel Jambu,
Sylvie Stepan, Madame Laraise, Bernard Michau, René Dorr, Laurent Degos
(to name but a few !), 1 extend my warmest thanks. Here in South Africa 1 owe
a particular debt of gratitude to Michael Browne, who has always kept me
on the straight and narrow path of rationality and common sense in spite of
my many attempts to stray off! (1 must add a similar word of appreciation to
John Gower, who performed more or less the same task while 1 was at
Rothamsted in England.) To my family, parents, friends and colleagues 1
apologize for their having to share with me that particular traumatic state
that accompanies aventure of this nature and thank them for their patience
and understanding during this periodo
Final thanks go to the University of South Africa (UNISA) and the
Council for Scientific and Industrial Research (CSIR), for their generous
financial support on numerous occasions during crucial years of study and
research, to Cas Crouse, Piet van der Westhuizen and the farnilies Théron,
Brink and Claassens for their encouragement, to Edna Schu1tz for her
excellent and competent typing of my manuscript, to Lizzie Pieters who was
J
) Contents
I
Preface v
CHAPTER 1 Introduction
,1
CHAPTER 2
Geometric Concepts in Multidimensional Space 14
25
28
I
2.4 Assigning masses (weights) to vectors 33
2.6 Examples 41
I CHAPTER 3
Simple IIIustrations of Correspondence Analysis
3.1 A typical problem
3.2 The dual problem
3.3 Decomposition of inertia
54
54
60
66
80
I CHAPTER 4
Theory of Correspondence Analysis and Equivalent
Approaches .
4.1 Algebra of correspondence analysis
4.2 Reciprocal averaging
83
83
96
108
116
119
I
CHAPTER 5
Multiple Correspondence Analysis
5.1 Bivariate indicator matrices .
126
127
x Contents J Contents xi
IJ
rn
To Jean-Pau/ Benzécri Introduction
I se1f and thus providing a very modest and tentative explanation of the
phenomenon under study. It is unfortunate that so much emphasis is placed
on a model as a representation of reality, which is usually unjustified, with
--.1.
2 Theory and Applications ofCorrespondence Analysis 1. 1ntroduction 3
little or no attention paid to its ability to describe data meaningfully. In fact, is given in Fig. 1.1, where the "stem" is the tens digit of the age and the "Ieaf"
the whole question of data description has not been given the attention it the units digit. This picture contains both a broad summary in the form of the
deserves, as pointed out by Finch (1981). Often the data set at hand is the outline of a conventional histogram, as well as the finer details ofthe data (for
one-and-only set of observations available, the "sampling units" constitute example, the high number of ages in the late 20s).
the population and the study is never to be repeated. In such a case the It is no coincidence that the stem-and-Ieaf plot is a graphical method, as it
description of the data is of supreme importance. is our contention that graphical displays provide the best summaries of
Thus, in general, once a set of data has been collected we would advise that data-a picture is worth a thousand numbers. A graphical description is
the first analytical step be sorne sort of descriptive summary. For example, the more easily assimilated and interpreted than a numerical one and can assist
most familiar summary of a set of quantitative measurements is the arithmetic all three functions mentioned at the start of this introduction: summarizing a
average, or mean, often reported along with the standard deviation. As far as large mass of numerical data, simplifying the aspect of the data by appealing
the applied researcher is concerned (and sorne statisticians too) the values oC to our natural ability to absorb visual images, and (hopefully) providing a
these two quantities might totally replace the original set of measurements as global view of the information, thereby stimulating possible explanations. In
he develops a mental picture of the data located around the mean with a spite of these advantages, it is only in recent years that the value and
spread summarized by the standard deviation. However, the statistician tremendous potential of statistical graphics have been realized. This is mostly
should be aware of the ideal circumstances (i.e. the model) in which such due to the rising importance of exploratory data analysis in a world which
summary values are adequate and thus might consider difTerent measures oC grows more complex and varied each day, where there is a continual
location and dispersion (e.g. median and percentiles) to be more appropriate. proliferation of potential1y interesting phenomena and an abundant supply
Thus even at this low level of analysis, with minimal manipulation of the data, of information available (or obtainable) to study them.
it is difficult to proceed objectively. Sorne structure is always presumed to
underlie the observations by the very nature of our analysis. This leads us to
the following aphorism (with apologies to George Orwell): no statistical 1.2 CORRESPONOENCE ANALYSIS ANO MULTIOIMENSIONAL
methods are objective, but sorne are more objective than others. GRAPHICS
While on the subject of a single set of quantitative measurements, let us
lavish praise on John Tukey's ingenious stem-and-Ieaf plot (Tukey, 1977), A complete data set is seldom so simple that a few histograms can summarize
which converts such data into a neat graphical description with no (or very it adequately. The type of data which we shall be discussing in this book is of
little) loss of information. An example of a stem-and-Ieaf plot of a set of ages a multivariate nature, that is a number of qualities have been observed or
measured on each unit of study, where this unit might be a person, a time
43 30 22 41 29 period, geographical region or other object.
27 24 33 51 63 For example, in an opinion survey of a group of people we might have,
for each person, his or her age, residential area, educational qualifications
35 29 36 18 45
1 889 as well as opinions on a number of controversial issues like the present
38 38 49 20 39 government, women's rights, child education, etc. The resultant data are
2 0124478889999
19 30 53 28 28 3 000123556889 often summarized only marginal1y, by which we mean that the individual
4 13456679 qualities are summarized separately: for example, the histogram of ages, the
30 55 65 46 66
5 011358
29 18 35 46 frequencies in each residential area and in each educational category and the
6 356
7 1 frequencies of response categories of each question. Such a summary may be
51 31 71 24
total1y inadequate in revealing the interesting patterns of response amongst
47 21 50 29 the group. F or examp1e, although there might be very few people in the
32 28 58 44 sample who have had a university education and very few people who feel
FIG.1.1. The ages of a group of 46 people (on the Jeft) and the stem-and-Ieaf strongly about women's rights, it might just be that these are largely the same
histogram of these data (on the right). For example. the first row of the histogram people. This would only be noticed if the categories of education were
shows that there are two people of age 18 and one of age 19 in the group. "crossed" with the opinion on women's rights in a two-way frequency table,
4 Theory and Applications 01 Correspondence Analysis 1. 1ntroduction 5
TABlE 1.1
Readership (in readers per 1 000 peryear) of 21 daily newspapers over 5 years.
Year
Newspaper 1976 1977 1978 1979 1980 Totals
A 64 58 67 59 60 308
-¿, ~~~~
-.L.
B 18 18 23 20 17 96 ~<J'>
e
o
E
F
G
12
36
29
133
34
10
25
21
115
28
9
34
25
116
30
12
31
20
107
26
27
20
89
29
9 52
153
11 5
560
147
I FIG.l.2. The "profiles" of four newspapers (C, J, P and U) over the 5-year
period
H 178 143 180 150 148 799 do not describe the changing readership patterns, for example how readership
I
J
8
101
8
113
5
143
6
112 107
6 33
576 tI of specific newspapers changes with time. An initial attempt at representing
sorne of these patterns is illustrated in Fig. 1.2. The data matrix is represented
K 66 56 60 58 53 293
L 87 69 79 68 69 372 as a long, flat table and the frequencies in each row, expressed as percentages
M 23 19 17 19 17 95 of their respective row totaIs, may be displayed as histograms rising up from
N 34 24 29 26 23 136 the table (for example, newspaper e had a total readership of 52 in the sampIes
o 70 56 60 55 50 291 over 5 years; the readership of 12 in 1976 is 23% of this total so that the
P 29 20 25 19 18 111 first "block" in row e has a height representing 23 %). Each of these sets of
o 46 40 38 38 33 195
relative frequencies is cal1ed a profile, in this case a profile of a particular
R
S
123
79
122
68
149
70
122
61
11 2
57
628
335
1¡ newspaper's readership across the 5-year periodo Only four profiles are
T 130 109 148 110 100 597 dispIayed in Fig. 1.2 for purposes of illustration. It is immediately apparent
U
Totals
22
1322
17
1139
19
1326
15
1134
16
1060
89
5981 I that the profiles of newspaper e and of newspaper J are quite different, with
J rising to a peak in 1978 and e dropping to its lowest point in the same year.
The profile of newspaper P shows a different pattem, with readership
I
6 Theory and Applications oICorrespondence Analysis
1. 1ntroduction 7
generaIly faIling off from the initial peak in 1976, while the profile of
newspaper U is fairIy similar to that of P.
Even though such a diagram is a simple transcription of the numerical
~2=0.00094 .1
data, it is clearIy easier to interpret because of its visual impact. Yet problems
08.3%)
will still arise if we try to picture and assimilate aIl 21 row profiles. A diagram
like Fig. 1.2 will become as crowded as the skyline of Manhattan and it will .c
no longer be so easy to evaluate the similarities and differences amongst the
profiles. It would help if profiles which have "similar" shapes (e.g. those of
newspapers P and U) could be grouped together somehow in order to
achieve a global view of the data. ..
1977
.0
.M
are two sets of points, one representing the newspapers (rows) and the other .l (63.2%)
representing the years (columns). Each point representing a newspaper can H.
U
be considered a display of the complete profile of that newspaper. In other 1978 ..
.0
·.N .. 1976
dimensional graphical technique from that of analyses which are based on objects in sorne sense". Bock's report included many simple examples as well
the same algebra and numerical procedure, but which operate in com as details of a computer program, and did much to popularize the method.
pletely different frameworks. Thus, although it is possible to trace the algebra The only paper of note during this time in the biometric literature was that
of the technique back almost 50 years, it is only in the last 20 years that of Williams (1952), which is also often cited. It is true to say that the method
correspondence analysis has existed in the form that we describe in this book. remained little known outside the psychometric world until the 1970s, which
The "Ieading case" of a data matrix suitable for correspondence analysis prompted Hill (1974) to label it a "neglected multivariate method". Further
is a two-way contingency table which expresses the observed association historical details and references may be found in the book on dual scaling by
between two quantitative variables. In 1935 H. O. HartIey published a Nishisato (1980, Section 1.2).
short mathematical article under his original German name (Hirschfeld, Correspondence analysis, the geometric form of the aboye methods,
1935) which gives an algebraic formulation of the "correlation" between originated in France in a completely different context (linguistics), and was
the rows and columns of a contingency tableo We can attribute the the brainchild of Jean-Paul Benzécri. In the early 1960s a group of French
mathematical origins of correspondence analysis to this paper, although analysts were studying large tables of data obtained from various literature
Richardson and Kuder (1933) and Horst (1935) independentIy suggested sources. Benzécri (1977a, Section 3.2.4) recalls that the first table he studied
similar ideas non-mathematically in psychometric literature, the latter author was the one which appears in every modern Chinese language manual, where
already coining the term "method of reciprocal averages". Later R. A. Fisher the rows are the consonants and the columns are the final vocals. Here the
derived the same theory in the form of a discriminant analysis on a "data" consist of indications whether the various row-column combinations
contingency table and applied it to data on the eye and hair colours of a are permitted in the language or not. In such a context the French term
group of schoolchildren (Fisher, 1940), now a classic example of a contingency correspondance was used to denote the "system of associations" between the
table (reproduced in Table 9.1 of Chapter 9). Meanwhile Louis Guttman elements of two sets, in this case the rows and the columns. In fact, the term
independently derived a method of constructing scales for categorical data carne to represent a specific mathematical entity, namely the original table
(Guttman, 1941), again the same theory in a different context. Guttman divided by its grand total. Since this table consists of non-negative elements
treated the general case of more than two qualitative variables, and his ideas only, a correspondence could be considered as a distribution of one unit of
find their counterpart in what we call multiple correspondence analysis mass across the cells of a rectangular matrix. Thus the concept of a bivariate
(Chapter 5). Since these two famous statisticians, Fisher and Guttman, probability density is included as a special case. In other linguistic examples,
presented essentially the same theory in biometric and psychometric contexts the rows and columns might be (in theory) all the words of a language, the
respectively, it is often the case that biometricians cite Fisher as the inventor data being the number of times the word in the rows precede the words in the
of the technique while psychometricians cite Guttman. To this day the two columns in a specific text. Here the discreteness of the table blurred into a
schools have developed almost independentIy, with a strong school of quasi-continuum of language and there was littIe distinction in the minds of
ecologists worldwide using the method of "reciprocal averaging" (the term the French analysts between a discrete correspondence and a continuous
suggested by the psychologist Horst!) and the psychologists the method of correspondence, the latter being a surface covering a unit volume over a
"dual (or optimal) scaling". These techniques share the same mathematical rectangular, possibly infinite, domain. A familiar special case of a continuous
theory and computational procedure as correspondence analysis, but lead to correspondence would thus be the bivariate normal probability density.
numerical, rather than graphical, results. The French term analyse des correspondances literally means "analysis of
In the 1940s and 1950s further mathematical development took place, correspondences", where the word correspondence has the specific technical
particularly in the psychometric literature by Guttman and his followers. In meaning just discussed. In the usual translation-correspondence analysis
Japan, a group of data analysts led by Chikio Hayashi, carried on a parallel English-speaking readers would tend quite naturally to think of correspond
development of Guttman's scaling ideas, which they cal1ed the "quantification ence in the less concrete sense of agreement. Thus the original meaning has
of qualitative data" (Hayashi, 1950, 1952, 1954, 1968). It is interesting that been changed in translation.
as early as 1946 computing machines were being used to perform the Since the early 1960s, then, a small group of dedicated data analysts, led by
calculations (Mosier, 1946). An often quoted technical report by Bock (1960) Benzécri, gained extensive practical experience of correspondence analysis
emphasized the basic principIe of optimal scaling: "to assign numerical values and other descriptive multivariate techniques like cluster analysis. They
to alternatives, or categories, so as to discriminate optimally among the applied their ideas far beyond the initiallinguistic context and gained a large
10 Theory and Applications ofCorrespondence Analysis 1. 1ntroduction 11
degree of fame within France and the continent, and a certain amount of Finally, we return to our statement at the start of this section that
notoriety in the English-speaking world. In order to put the position of this correspondence analysis is, numerically at least, similar to a number of other
group into perspective, a few of their characteristics deserve mentioning. techniques. This does not mean that it is the same as these techniques. For
First, their whole philosophy of data analysis is founded on inductive example, dual scaling (see for example, Nishisato, 1980) is concerned with
reasoning, proceeding from the particular to the general. The data set at hand deriving numerical scores for categories with certain properties, a method
and how one describes it are of importance, not the general framework or pioneered by Guttman. No geometry of such scores is mentioned or intended
model that one might think the data fit. This standpoint is summarized in this framework and the results are not reported in the form of graphical
strongly in Benzécri's second principIe of data analysis: "The model must fit displays. Neither is the framework specifically multidimensional, but rather a
the data, not vice versa". Hand in hand with this principIe is an initial sequence of one-dimensional frameworks. This distinction is an important
rejection of probabilistic and mathematical modelling as presumptious and one and should also be mentioned in the case of reciprocal averaging. An
irrelevant. While few statisticians would subscribe to such an extreme view article by HiIl (1974) popularized the name correspondence analysis but con
point, we would acknowledge that there are occasions when blind assump centrated on single dimensions only. This paper forms the basis of a subse
tions of models lead to serious defects in statistical analysis. However, quent "definition" of correspondence analysis by Hill (1982). By contrast,
Benzécri's first principIe of data analysis that "statistics is not probability" correspondence analysis, as we know it, derives sets of multidimensional
and that "authors (who hardly ever write in our language)"-that is French "scores" with a well-defined and intentional geometric interpretation.
"have erected a pompous discipline, rich in hypotheses which are never For further commentary on correspondence analysis and Benzécri's ideas,
satisfied in practice" (Benzécri et al., 1973, p. 3) was destined to raise the anger see Mallows and Tukey (1982) and Gifi (1981).
of statisticians in general rather than win them over, even slightly, to his
idealismo
Secondly, from the outset the descriptive techniques developed by this 1.4 OUTLlNE OF THIS BOOK AND NOTATION
group were geometric ones. Data were transcribed to sets of points in multi
dimensional space, and points were grouped visually in the form of branches Correspondence analysis is basically a fairly simple technique from the
of a tree structure or actual clusters in a graphical display. This stress on mathematical and computational point of view. However, because it is
graphics followed a tradition of geometry in French mathematics (['esprit primarily a geometric technique rather than a statistical one, it is necessary
géometrique) and paralleled the growing interest in multidimensional scaling to introduce a number of geometric concepts which are crucial to a full
at more or less the same time across the Atlantic. understanding of the method (Chapter 2). Rather than follow immediately
Thirdly, also owing to a longstanding and famous French tradition, their with an algebraic description, we have chosen to devote a whole chapter to
work has been couched in an extremely rigorous algebraic notation. To the sorne simple examples of correspondence analysis so that most of the
initiated this notation is tremendously powerful in expressing exactly the computations and results can be described in detail (Chapter 3). A formal
function and characteristics of both operands and operators. It is unfortunate mathematical treatment is then given in Chapter 4 (Section 4.1) which can be
that this complex language in which the group chose to express itself, almost skimmed through on a first reading and referred back to when necessary. The
completely closed the communication lines to the Anglo-American statistical remainder of Chapter 4 deals with other analyses which are algebraically
schools, who have always used a much more pragmatic notational style. equivalent to correspondence analysis (cf. Section 1.3 above).
Recently, however, sorne of the students of the French school are changing Chapters 5, 6 and 7 treat a wide variety of contexts in which correspondence
their style to communicate and gain acceptance with the larger body of analysis can be used. This is in accordance with the idea that correspondence
statisticians. The book by Lebart et al. (1977), for example, has been read and analysis is a single technique capable of handling many different types of
understood by French-reading statisticians (see the book review by Nash, problems, in contrast to the usual approach in statistics where difTerent
1979), whereas it is a pity that the crucial works of Benzécri et al. (1973) and methods are developed to solve different problems. Thus in Chapter 5 we
the journal Les Cahiers de l'Analyse des Données will never be widely read discuss how correspondence analysis can be applied to data consisting of a
because of their unfamiliar mathematical style. However, for a brief taste of number of qualitative variables and how a general data set can be recoded to
Benzécri's philosophy, interested readers can consult Benzécri (1969b), the be in a form suitable for such an analysis. In Chapter 6 the correspondence
only article available in English and fairly devoid of mathematics. analysis of ratings and preference data is discussed and compared to existing
1. Introduction 13
12 Theory and Applications ofCorrespondence Analysis
methods. In Chapter 7 four traditional areas of multivariate statistics are the respective indexing variables. The matrices F and G usually contain the
co-ordinates of the row and column points respectively in a graphical display,
discussed and it is shown how correspondence analysis can be of use in each
for example the co-ordinates of the two sets of points in Fig. 1.3. Such
ofthese.
conventions are very useful in the development of the algebra and in subse
In Chapter 8 we have collected together a number of diverse topics of
special interest. The most important of these is a discussion of the stability of quent recall of formulae.
the graphical displays obtained by correspondence analysis, as well as their The notation == indicates the definition of a symbol, for example F == UD"
probabilistic properties in appropriate situations. signifies that the matrix F is defined as UD", as opposed to F = UD" which
Chapter 9 is probably the most important part of the book, where a means that F, previously defined, is equal to UD".
number of applications to data sets from a wide spectrum of disciplines are
discussed. These are not intended to be complete case studies, but rather
illustrations of specific features of correspondence analysis.
We have assumed that the reader has sorne basic knowledge of matrices
and vectors, at a level similar to that assumed by most modern textbooks on
applied multivariate analysis. Most of these books do have a chapter devoted
to matrix algebra and we have not included such a chapter here. If required,
the reader should refer to the relevant chapters of Press (1972), Morrison
(1976), Mardia el al. (1979) or, in particular, the text by Green and Carroll
(1976), which we highly recommend. We have added an appendix specifically
on the singular value decomposition because this crucial concept is given an
inadequate or non-existent treatment in most textbooks (Appendix A). An
appendix on computation and available computer programs is also given to
assist those who wish to execute a correspondence analysis (Appendix B).
At the end of each of Chapters 2 to 8 there is a section entitled "Examples".
These are usually the formal statements and proofs of theoretical results
mentioned during the respective chapter, or numerical examples for purposes
of illustration. They are assembled at the end of each chapter so that the text
is as uninterrupted as possible. '
Finally, we should mention sorne important notational devices that we
have introduced. Following convention in typeset texts, we use italics to
denote scalars (e.g. c, i, K), and boldface to denote vectors (in lower case, e.g.
c, r) and matrices (in upper case, e.g. e, N). If a particular scalar, say j, is
chosen as an index, then its respective capital is usually used as the total
number in the indexed set: for example C l , e2 , ••• , cj , ... , CJ, or cj,j = 1 ... J. In
order to save space when using the summation notation, we indicate the
summation parameters as subscripts and superscripts, for example 'f.{=1 r¡,
often omitting parameters that are obvious in the context, for example 'f.[r¡
or 'f.¡r¡. Since summation most often starts at the first element and ends with
the last, this is assumed when not specifically indicated, so that 'f.kA k stands
for 'f.{-= 1 Ak, while 'f.{-= 2Ak has to be denoted in fuI!.
Sorne letters are informally reserved for specific entities and these will be
explained as they are encountered. For example, we use 1 and J as the
numbers of rows and columns respectively of a general matrix, and i and j as
2. Geomelric Concepts in Multidimensional Space 15
subspaces which best "fit", or líe closest to, a given set of point vectors. The
crucial concept of the singular value decomposition (SVD) and its geometry
are discussed in this section as wel! as in Appendix A, where it is shown to
underlie many difTerent multidimensional techniques.
Section 2.6 concludes with sorne theoretical and practical examples of the
material in this chapter.
The integer J is caBed the order of the vector and we often refer to such a
and Carrol1 (1976), which gives an excel!ent conceptual introduction to vector as a J-vector. For example, suppose that we measure someone's height
vector and matrix algebra and associated geometry in the context of applied (cm), weight (kg), shoulder width (cm) and waist (cm). These measurements
multivariate analysis. Their book overlaps and complements the material in can be col!ected into a single vector, of order 4 (Le. a 4-vector):
this chapter, and is highly recommended as paral!el reading matter, particu
l
larly if the reader has a limited mathematical background. In the present 1751+- height
chapter we introduce al! the geometric concepts that will be required for a 70 +-weight
thorough understanding of correspondence analysis, as wel! as related multi x = 47 +-shoulder width
dimensional analyses. 84 +-waist
In Section 2.1 we introduce the basic geometric unit in multidimensional
space, the point vector, as wel! as its co-ordinates with respect to a set of basis In order to write vectors as row vectors, the notation xTis introduced, where
vectors. Subspaces are defined and the distinction is made between a Tdenotes the transpose of a vector, for example:
dimension of the space and its dimensionality.
xT = [175 70 47 84] or x = [175 70 47 84]T
In Section 2.2 we begin to add structure to multidimensional space by
defining distances, angles and scalar products between point vectors. These A1though the column vector notation is conventional it is wasteful of space
definitions are illustrated in the wel1-known context of Euclidean space. in printed text, and we often denote particular column vectors in their
In Sections 2.3 and 2.4 we generalize these concepts slightly to accom transposed form as row vectors, as shown above.
modate our later description. We need first1y to associate weights with the
dimensions of the space and secondly to assign weights to the individual
Dimension, co-ordinates and basis
point vectors themselves. We cal1 these latter weights (point) masses, in order
to distinguish them from the former dimension weights. Initial!y, for purposes of illustration, let us suppose that we are interested only
In Section 2.5 we arrive at our objective of identifying low-dimensional in a person's height and weight measurements. The data vector (of order 2)
16 Theory and Applications ofCorrespondence Analysis 2. Geometric Concepts in Multidimensional Space 17
"'f-~
this dimension
FIG. 23. The diagonal dímension in Fig. 2.2 on which all three points lie, at 2000 1 year I
positions 17.5. 17.0 and 1 5.0 respectively with respect to the basis vector b.
IMPORTS
xA 70 = 17,5 [lOJ
= [ 175J 4
1000,
/j/ ""\year 2
Thus these 3 points actually he in a 1-dimensional subspaee defined by the
basis vector b, and have co-ordinates with respect to b of 17.5,17.0 and 15,0
respectively (Fig. 2.3). This single dimension, which is a combination of the 500+ /
'""
../'
/ \,year :3
....'
/~ Consumption 2500 2700 3200
Y2
""
~ .... "./"
perspective. For example, suppose that we have local consumption figures for
," Y3
the product discussed aboye so that our data are given by the matrix of Table
2.1. We can depict the three years as points in a "room" where we are looking
/ " down towards the comer of the room (Fig. 2.7). The origin of the display is
/ " at this comer, with two of the axes running horizontally along the edges of
EXPORTS the floor and one vertically up from the comer. We have indicated the
Y2 "I\
Y "projections" of the points down onto the "floor" of the display (as if the
points have been dropped down vertically), that is with respect to the exports
Y3-Y\
FIG.2.5
Multidimensionalspace
When we want to represent points with respect to 3 co-ordinate axes, it is still FIG.2.7. 3-dimensional display of the columns of Table 2.1. The three poínts líe
in the vertical plane indicated.
possible to give the impression of a third dimension on paper by introducing
Theory and Applications ofCorrespondence Analysis 2. Geometric Concepts in Multidimensional Space 23
22
~ dimensions. In our "room" of Fig. 2.7, this is a vector cutting across the room
Ii::::iE and fol1owing the three points as closely as possible. The question arises as to
e how we could find the vector e which in fact comes the closest to all three
~
3000
8
/'
.
Y3
......-
/,
/'
/
points. In order to answer this question we have to define what we mean by
"closeness" or, alternatively, distance in our space of points, and this is the
subject ofthe next section.
/'
k /'/
.~ . --.
b The above example in three dimensions is a highly simplified example of
reducing the dimensionality of a set, or cloud, of points in multidimensional
y,. /'/'/'
/'
/'
/'
2500 -u
y -
900
1200
2800 J space. In actual applications we shall usually be dealing with points in spaces
of much higher dimensionality, but the question will still be the same: can we
/'
find a subspace of lower dimensionality which comes "close" to the cloud of
points? To take a different example, in craniometry a large number of
Positions of the three points in the plane indicated in Fig. 2.7 (viewed
FIG.2.8. measurements might be made on an adult human skull in order to define it
from the opposite side).
accurately. If the vector a contains 30 such measurements then it can be
and imports axes, which is the 2-dimensional display we had previously. The thought of as a point in the 30-dimensional space of possible vectors of
projections onto the two "wal1s" are also shown, and it is clear that these no description of human skulls. The vector of description e of a child's skul1 is
longer lie exactIy on a straight lineo However, since the points do lie on a also a point in this space and perhaps differs from the adult skul1 simply by a
straight line on the "fioor" it is clear that they are contained exactIy in aplane scalar quantity: e = ka. Or e might be obtained by combining the character
which stands vertical1y on this lineo If we take this plane out of the room, as istics of a smal1 number of different skull-types with appropriate scalings, for
it were, turn it around (so that year 1 is on the left, as before), and lay it fiat, example: e = kla l +k 2 a2 • Ifwe have a large number ofskul1s whose measure
then we have the display ofFig. 2.8. We have again centred the display at the ments can be obtained by linearly combining two basic skull-types, with
centroid of the three points, y = [1900 1200 2800JT, and the axes are variation only in the coefficients k l and k 2 , then we would say that the skul1
those defined by the original consumption axis and by b. In order to move vectors líe in a two-dimensional subspace of 30-dimensional space. Again it
from the original origin of the display in Fig. 2.7 to a point, say yl' we have will not usually be possible to generate the skull vectors exactIy and we would
to move first to the centroid y, which is the origin of the new display in Fig. try to identify the basis vectors al and a 2 which approximately generate the
2.8, then - 4 units along the b axis, then - 300 units on the consumption skull vectors. Thus, although it is not possible to display the skull vectors in
axis. In short: y1 = Y- 4b - 300e 3, where e3 is the unit vector along the con their original space, they can alllie (approximately) in a space ofmuch lower
sumption axis. Recalling the definition of b = 100el - 200e 2 , where el and e2 dimensionality which can be visualized quite easily. Notice that in this
are unit vectors along the export and import axes respectively, we see that the example we have not centred the data explicitIy, although it is more
customary to investigate the vectors with reference to their centroid as
aboye equation is correct:
origin-we shal1 return to this matter later.
1 1. r190~l f, 1 11 r~l
r
0
r
1500 1 To conclude this section we would like to review sorne concepts in a
y, ~ L~: y-4b-3OOe, ~ L~~:r400L~ 1+ 8°OL~ 3OOL~1 general framework. Suppose that we have a cloud of 1 points (or point
vectors) in J-dimensional space (the term "point vector" is synonymous with
"point" and merely underlines our considering a point to be a vector). We
In Fig. 2.8 it is clear that the three points lie approximately in a straight assume that the standard basis vectors el' e 2 , ... , ej define the J axes or
line: for example, the dashed line defined by the vector cseems to fol1ow the dimensions of this space. Let us consider a typical point vector, defined as
direction of spread of the three points. If we are willing to gloss over the x == [X 1 X2 ... XjJT. The numbers Xl' X 2 , ... , Xj are cal1ed the co-ordinates of
deviations of the three points from this line, we could reduce the points' x with respect to e l ,e2 , ... ,ej, because x = xle l +x 2 e 2 + ... +xjej. This
original 3-dimensional positions to a 1-dimensional display along the axis latter expression can also be read as: x is a linear combination of the basis
defined by C. The vector e is now a combination of the consumption axis and vectors e l ,e 2 , ... ,ej, with coefficients X l ,X 2 , ... ,Xj. Geometrical1y, x is a
the axis defined by b, in other words a combination of al1 three original point (or movement from the origin to a point) obtained by moving Xl units
I
I
in the direction of el' then x 2 units in the direction of e 2 , etc. The vectors Notice that we use the term "subspace" in a slightiy looser sense than the
x1e1,x1e2," .,xJeJ are cal1ed the eomponents of x; x is thus the sum of its usual mathematical definition. We shal1 cal1 a K-dimensional subspace of
components. J-dimensional space the set of vectors U+(X1V1 + ... +(XKVK, where u is any
The centroid of a set of points v1 •.• V¡ is a particular linear combination fixed J -vector, v1 ... V K are K linearly independent J-vectors and (Xl ... (XK are
(X1V1 + ... +(x¡v¡, where the coefficients add up to 1: :E¡(X¡ = 1 (such a linear
combination is often cal1ed a baryeentre). Previously we have used the term
centroid in the sense of average vector, that is when al1 the coefficients are I real numbers. Our definition ineludes the fixed vector u which can be thought
of either as the first step of transferring the vectors from the origin into the
subspace or as effectively redefining the origin of the space at the point u.
I
equal to l/l. However, we shal1 think of a centroid more often as a weighted
average vector, where the coefficients (Xl ... (X¡ are proportional to weights (or
masses) assigned to the respective vectors in the linear combination. Thus the 2.2 DISTANCE, ANGLE AND SCALAR PRODUCT
ordinary mean vector is the centroid when equal weight is assigned to al1 the
Although not mentioned explicitiy in the aboye discussion, the usual physical
vectors.
Although the standard basis is the set of vectors defining our original frame
of reference, as it were, we often re-express our vectors of interest as linear
combinations of other basis vectors, in other words with respect to other
I c-.2!1_c:~pts
of distance, length and direction have aiready been implied, for
example in Fig. 2.7. These concepts become cr1,!ciaJwhen we startto stu9J'
LǻIAa~a, ",:here it would be extremely improbable that the measurements. on
I
more than two people lie exactly along a single dimension or that consecutive
axes. An important property of a basis is that it consists of a set of linearly
years of export/import data be exactiy linearly related. T.!!~ ..qllestion. to
independent vectors-no basis vector is a linear combination of the other
º.~ªsked in a practical situation is not whether a set of points in multi
basis vectors. Geometrical1y, each basis vector defines a new dimension in
dimensional sP::ice lies exactiy in sorne subspace of lower dimensionality, g.ut
space because its movement cannot be obtained by combining movements of
the other basis vectors.
In Fig. 2.7 the 3 vectors y 1 - y, y 2 - Y and y 3 - Y (the vectors from the
centroid y to the three respective points y l' Y2 and y 3) were defined in
I whet1J:er-iheo-set lies approximately in such a subspace. 1\.. first step' in
formaIi~Úlg this notion is to define a measure of distance (or metric) between
points in the multidimensional space of the data.
The concept of an angle in multidimensional space is a rather more
I
3-dimensional space, yet we saw that they can be expressed as linear
abstract idea, although we are familiar with direction in the physical world
combinations of 2 basis vectors, that is they lie in a subspace of dimensionality
around uso Notice, however, that distance and angle are both scalar quantities
2. Alternatively, we can say that none of the 3 vectors have components along
(real numbers) defined in terms of two points: distance is a value quantifying
a dimension which is linearly independent of the aboye subspace. It is
the eloseness of a point a to a point b, and angle is a value quantifying how
important to note the distinction between the terms "dimensionality" and
rapidly two vectors are diverging from a common origino This is represented
"dimension". The dimensionality of a set (or space) of point vectors is a fixed
in Fig. 2.9 fQr.ªIIY twopoints a and b and it appears intuitively that ifwe
integer value, while dimensions are themselves vectors (which we also caH
know the distances of a and b from the origin O (which we often call the
axes) in the space of the vectors.
lengths of the vectors a and b respectively) as wel1 as the angle between a and
distance from a to b b (i.e. how ra¡>~dly they are moving apart), then we can work out from this
information the distance from a to b. This is indeed true and it turns out that
C>
. (
-------- b
both concepts of distance and angle can be embodied in a single fundamental
concept in multidimensional space, cal1ed the sealar produet ~. in!1f[
produet).In order to introduce this Cüñcept we shal1 ftrst review the
definitions of distance and angle in the simple case of 2-dimensional physical
....0 '
.:o.",c.
space- the space with familiar "horizontal" (x) and "vertical" (y) axes.
~
~~
axis 2
•
/ ·"\":~f
- \~
All of the aboye formulae can be expressed in terms of the scalar product of
a and b, denoted by (a, b): (a, b) == alb l +a 2 b2 •
a In the notation of vector algebra the expression al b l + a2 b 2 is just a Tb, the
~I -~.
transpose of a multiplied by b (in the sense of matrix multiplication). The
aboye formulae are thus:
Ilall = (a,a)1/2 = (a Ta )1/2
(2.2.1 )
o-b
Ilbll = (b,b// 2 = (b Tb)1/2
(i.e. the squared length of a vector is the scalar product of the vector with
itself)
d(a,b) = (a-b,a-b)1/2 = ((a_b)T(a_b))1/2 (2.2.2)
cos () = (a, b)/((a, a) (b, b) )1/2 = aTb/(aTab Tb)1/2 (2.2.3)
As mentioned above, d(a, b) can be evaluated in terms of lIall, Ilbll and cos ();
in fact d(a, b) squared is
J
01 _/)1
,11"'1 I
(],..
I .
aXIs I
d 2(a,b) = (a-b)T(a-b)
I "1
= a Ta + bTb-2aTb
Clearly it is not desirable that the distances depend directly on the chosen
29
(a, b) == Ejajb j = a Tb
and the definitions of length, distance and angle are exactly as before (see
I ment remains the same for any units chosen originally. For example, suppose
that the standard deviations of height and weight in the sample of people
are 30 (cm) and 10 (kg) respectively. The height measurements are divided
by 30 and the weights by 10 and these are then plotted in 2-dimensional
equations (2.2.1-2.2.3)). Again we are assuming that the co-ordinates of a and
b are with respect to the standard basis el"'" ej, but the definitions would
still apply if a and b were expressed relative to any orthonormal basis. We
shall prove this result in the next section in the even more general context of
I Euclidean space. Two vectors x and y of measurements become standardized
as X s = [x¡j30 x 2/10]T and Ys = [y¡j30 Y2/IO]T and the scalar product of
2
X s and Ys is (xs,Ys) = x 1y¡j30 +x2Y2/102.
I
An equivalent way of describing the above strategy is in terms of differential
weighted Euclidean space. weighting of the co-ordinate axes. The original vectors of measurements are
Notice finally that although we encourage the concept of a vector as a
retained but the definition of scalar product contains a weighting factor on
point in space it should be remembered that the scalar product of two vectors
each termo For the example above, the weighting factors are 1/302 and 1/102
I
is dependent on the origin of the space. Distances between point vectors are, respectively, so that the scalar product between two vectors x and y of
by contrast, independent of the origin of the space.
measurementsis:
(x,y) == X1Y1/302+X2Y2/102 = X TO;>l y (2.3.1 )
2.3 WEIGHTED EUCLlDEAN SPACE
0,1 - [1/30
s - O
2
O J
1/102
example and then describe its multidimensional extension, which we shall
need latero I is the diagonal matrix ofthe inverses ofthe variances.
Geometrically the vectors x and y are plotted in their original units but
scalar products (and therefore distances and lengths) in this space are
Examp/e
_.-_.__ in_--2-dimensiona/-weighted
,
Euc/idean space
Let us refer back to Fig. 2.1 and the example of the height and weight
measurements on different people and suppose that we are interested in
I calculated using (2.3.1), where the height measurement is down-weighted
relative to the weight measurement. We call this space weighted Euclidean
space, with the weights in this example equal to the inverses of the variances.
The space can be imagined to be stretched along the axes, in the sense that
defining distances in the space of these measurements. In Fig. 2.1 we plotted
points as if the space was Euclidean and the implied interpoint distances are
then just the usual "straight line" distances, which can be computed using
formula (2.2.2).
I relative to our physical way of thinking a shell of points equidistant to a fixed
point is not spherical, but elliptical. In our example above, the ellipse of
equidistant points has major axis parallel to the height axis (Fig. 2.11).
Another way of thinking of this is that the units of measurement are changed
However, there is a very good reason why the Euclidean distance is on each axis, with unit distances on axes being inversely proportional to the
unsuitable for this type of data. For example, because of the units of 1 respective weights.
measurement (cm for height, kg for weight), the height measurement always Once again we shall find it advantageous to work only with orthonormal
has a much higher value than the weight. The difference in height between bases. In the weighted Euclidean space with scalar product (2.3.1) the
two people is therefore usually higher than the difference in weight. Thus the I previous basis vectors el and e2 are still orthogonál but not of unit length. An
height measurement will contribute relatively more to the Euclidean distance, orthonormal basis is clearly 30e 1 and l0e 2, because
which depends on the sum of squares of these differences. On the other hand,
(30e¡}TO;>1(30e 1) = 1 and (10e2)TO;>1(10e2~ = 1
if we expressed height in m then the weight measurement would dominate the
Euclidean distance. (cf. the remark at the end of the previous paragraph). The co-ordinates of x
30 Theory and Applications 01 Correspondence Analysis 2. Geometric Concepts in Multidimensional Space 31
weighting matrix is any positive definite matrix Q, that is where the scalar
product between vectors x and y in J -dimensional space is defined by:
x TQy = I:.jI:.j'qjj'XjYj'
Weighl
060b e
Let us thus express x and y relative to any basis b 1 ·•· b J :
x = I:.jujb j y = I:.j'vj'bj'
Heighl
then their scalar product is
x TQy = (I:.jujb j )TQ(I:.j'vj'bj')
FIG.2.11. The computation of distance in weighted Euclidean space The ellipse
defines the set of points which are all equidistant to 9 for a given distance. If = I:.jI:.j'ujvj'bJQbj'
d(a, g) is 3 times d(c, g) then a difference in a unit of weight is, as far as thanks to the distributive nature of matrix multiplication. Now if the basis
computing interpoint distances is concerned, the same as a difference in 3 units of
height. In this way differences in scaies and in variabilities of measurements can
+
b ... b J is orthonormal then by definition bJQbj' = O ifj j' and bJQbj = 1,
1
be compensated foro j = 1 ... J. (In matrix notation we write this as: B T QB = 1, where I is the
identity matrix and Bis the matrix of column vectors b 1 .. , b J • We say that
the basis B is orthonormal in the metric Q.) Thus it fol1ows that:
with respect to this basis are xd30 and x 2 /10 respectively: xTQy = I:.jUjVj (2.3.4)
x = (xd30)30e 1 + (x 2 /10)10e 2 i.e. the weighted Euclidean scalar product is simply the Euclidean scalar
In other words the co-ordinates of x with respect to the orthonormal basis in product of the co-ordinates with respect to any orthonormal basis (ortho
the weighted Euclidean space are exactly the co-ordinates of the standardized normal in the same weighted space).
vector x. in ordinary (unweighted) Euclidean space.
Distances between vectors of frequencies
~klimensiona/ weigh(ed Euc/idean space
One of the most common examples of a weighted Euclidean distance is
In general, weighted Euclidean space is defined by the scalar product: the chi-square (x2) statistic for testing whether a probability density con
forms to sorne expected density. For example, suppose that the nation
xTDqy = I:.jqjXjYj (2.3.2) wide results of a general election give the 5 parties contesting the election the
where ql'" qJ are positive real numbers defining the relative weights fol1owing numbers of votes (in thousands): 1548, 2693, 621, 950 and 283
assigned to the J respective dimensions. The squared distance between two respectively. Expressed as proportions of the total number of votes (6095
points x and y in this space is thus the weighted sum of squared differences in thousand), these are 0.254, 0.442, 0.102, 0.156 and 0.046 respectively. Assuming
co-ordinates : that every voter could choose from al1 five parties, we would expect the votes
in the different parts of the country to be roughly in the same proportions,
d2 (x,y) == (x-y)TDq(X-Y) = I:. j qj(Xj-Yj)2 (2.3.3)
unless other patterns of voting are present. Suppose that in a certain rural
This type of distance function is often referred to as a diagonal metric. area of the country a total of 5000 voters vote as fol1ows for the 5 respective
From our 2-dimensional example, it seems that as long as we maintain the parties: 1195, 2290, 545, 771 and 199, that is in proportions 0.239, 0.458,
prerequisite of an orthonormal basis, no matter how the scalar product is 0.109,0.154 and 0.040 respectively. If voting here had taken place in exactly
defined, then the usual (unweighted) definition of Euclidean scalar product the same proportions as in the nation as a whole, the expected number (or
(and thus distance, length and direction) may be applied to the co-ordinates frequency) of votes for each party would have been 1270,2210,510, 780 and
of the vectors with respect to that basis, in order to evaluate the respective 230 respectively (e.g. 1270 = 0.254 x 50(0). Thus there have been 75 votes less
quantities in the weighted space. than this expected frequency for party 1, 80 more for party 2, 35 more for
This result is easy to prove, even in the more general case where the party 3, 9less for party 4 and 31less for party 5. In order to test whether these
32 Theory and Applieations ofCorrespondenee Analysis
introduces the sample size into the measure of difference between observed
33
I
14.01/5000 = 0.0028, which is independent of the total observed frequency.
75 2 80 2 35 2 92 31 2 Because the critical point at P = 0.05 of X2 (4) is 9.488, it is clear that if the
= 1270 + 2210 + 510 + 780 + 230 total observed frequency had been less than 3387, with the same relative
frequencies, then there would not have been enough evidence of difference
= 4.43 + 2.90+2.40+0.10+4.18
= 14.01
In order to perform the test, the result 14.01 is compared to the critical
I in the profiles. lf we now have another observed profile p' in another
voting area, we can again see how different this is from p by calculating
(p' - p)TD ¡ 1(p' - p); let us suppose this evaluates as 0.0112. This is 4 times the
I
points of the chi-square distribution with 4 degrees of freedom (i.e. X2 (4)), squared distance of p from p, in other words p' is twice as far from p as is p in
and because 14.01 is greater than 13.28 (P = 0.01) and less than 14.86 the weighted Euclidean space of the profiles. But suppose that the total
(P = 0.005), the set of observed frequencies is said to be significantly different observed frequency in this new area is only 1000. The X2 statistic is
from the set of expected frequencies at a significance level of P < 0.01 (i.e. less 1000 x 0.0112 = 11.20 which is not as significant as the x2 statistic of 14.01
than 1 %).
Another way of thinking about formula (2.3.5) is to consider the two
sets offrequencies as two 5-vectors o=: [1195 2290 545 771 199]T and
I computed for p. Thus the relative values of the total observed frequencies are
important in the weighting of the observed profiles themselves so that we can
order these profiles with respect to sorne measure of "evidence of difference"
of the profiles. In the next section we shall continue discussing this topic after
I
e == [1270 221(j 510 780 230]T in 5-dimensional space. The X2 statistic
is then: a general introduction to the weighting of points in multidimensional space.
X2 = (o -e) TD; 1(0 -e)
I
in other words the squared distance between o and e in the weighted 2.4 ASSIGNING MASSES (WEIGHTS) TO VECTORS
Euclidean space with weights equal to the inverse expected frequencies.
lf we define p and p to be the vectors of relative frequencies and of expected In the previous section we discussed the differential weighting of the original
relative frequencies respectively: dimensions when calculating scalar products (and thus distances and lengths)
in the space of a set of point vectors. In this section we introduce what
p == (l/n)o = [0.239 0.458 0.109 0.154 0.040]T
is essentially a dual concept in correspondence analysis-the differential
P == (1/n)e = [0.254 0.442 0.102 0.156 0.046]T weighting of the points themselves.
where n is the total observed freguency, then the X2 statistic aboye is lt is not uncommon in many statistical methods to weight certain observa
tions for justifiable reasons. For example, in an opinion survey it might be
X2 = n(p_p)TD¡1(p_p) = nr.ipj-pjf/pj (2.3.6)
difficult to obtain answers from female respondents for reasons which are
This type of formulation will be seen often in this book. Correspondence quite independent of the survey. In the final data set the female opinion is
analysis is concerned with vectors of relative frequencies, like p, as points in grossly under-represented and any summary statistic of general opinion is
multidimensional space. Such vectors are known as profiles, for example in male dominated. In this case it could be decided to assign higher weight (or
our voting illustration p is the profile of the rural area across the 5 polítical mass) to the female responses in order to equalize the contributions of the
parties. The vector p is in this case the average (or expected) profile of the two sexes in the calculation of means, regressions, etc.
whole country across the 5 parties. The squared distance from p to p is In order to enforce a distinction between the weighting of points and
(p - p)TD ¡ 1(p - p), a weighted Euclidean distance where the weights are the dimensions, we shall prefer the term mas s when referring to a quantity which
inverses of the expected relative frequencies. Because ít is proportional to weights a point while the term weight will usually refer to the weighting ofthe
the X2 statistic this distance function is caBed the chi-square distance (X 2 standard dimensions (axes) of a space. However, the verb "to weight" will be
distance). The proportionality factor is the total observed frequency n, which retained in both contexts.
2. Geometric Concepts in Multidimensional Space 35
34 Theory and Applications ofCorrespondence Analysis
In our study of the geometry of a set of vectors, the assigning of difTerent frequencies is the centroid of the individual vectors of relative frequencies,
masses to the vectors amounts to attaching difTerent degrees of importance to where each of the latter vectors is weighted by its associated total frequency
the positions of the points in space. Previously, in Section 2.2, we mentioned (number ofvotes); hence the term average profile for ji.
that our objective is to identify a low-dimensional subspace which comes The statistic XZ in (2.4.3) aboye can be sirnilarly described as a weighted
"closest" to aH the data points. When the~nts have difTerent masses then
sum of the squared di.~tances between Pi _a,n4~ Ir we introduce the foliowing
the subspace should lie even closer totile poirits' oC higher mass,-WlUle a iloÜí.iíonaldefinitións:
The centroid of a set ofpoints x1,XZ, ... ,x] with different masses W 1, wz,... , W]
is the weighted average point:
r-/,
--
in(I) == XZ In ----------
(which we shaH caH the (total) inertia of the set of 1 profile
vectors) then the centroid p and the inertia in(I) can both be
f'-i~i~~~'~¡/i¡:~ (2.4.1) expressed as weighted averages:
Hence x also tends in the direction of the points with higher mass. . p = ~¡w¡p¡
in(I) = ~iw¡df
x2 ana/ysis of a sel of vectors of frequencies .-i:..\
Thus the average profil(ji i~ a point vector which indicates the centroid of
Formula (2.3.6) describes a particular XZ statisfic as a squared distance individual profiles, while the total inertia is lí measure of how much the
between the vector p of observed relative frequencies and the vector p of individual profiles are spread around the centroid. Both ji and in(I) are
expected relative frequencies, multiplied by the total observed frequency n. In independent of the absolute frequencies that constitute the original data and
the context of the same example suppose now that we have the fuH would be identical if the data were multiplied by any constant value. The
breakdown of election results for aH the constituent areas of the country. absolute frequencies n¡ are only taken into account in relation to each other
That is, for each area i we have ni' the number of people who voted, and the and their relative values define the masses W¡ which are associated with the
profile Pi' the 5-vector of relative frequencies indicating how the n¡ people profiles.
voted. For each area we can calculate the statistic xf (ef. (2.3.6)) N otice that the term "inertia" is used here by analogy with the definition
x¡Z = n¡ ( p¡-p-)Tn-l( -)
p p¡-p (2.4.2) in applied mathematics of "moment of inertia"- the integral of mass times
squared distance to the centroid. In the statistical literature the total inertia
and add these up for aH the areas: is known as "Pearson's mean-square contingency coefficient" computed on
XZ = ~¡xf (2.4.3) the table offrequencies (cf. Section 4.1.6).
In this case XZ assesses aH the evidence in the data for differences in the voting
patterns between the various areas and the overaH voting pattern.
Ir we define the individual elements of the profile as p¡ == [Pil p¡z ... p¡s]T 2.5 IDENTIFYING OPTIMAL SUBSPACES
then the number of people in area i who voted for party j is n¡p¡j' The total
number of people in aH the areas who voted for party j is then ~¡n¡pij' Hence In the previous sections of this chapter we have defined a set of points in a
the relative frequency of votes for party j is ~¡n¡pul~¡n¡, which is what we multidimensional space, where distances and scalar products are calculated
have denoted previously by Pj, the jth element of ji. In vector notation this is: between points by weighting the dimensions and where each point is itself
weighted by an associated mass. Our object now is to identify the subspaces
ji = ~¡n¡pJ~¡n¡ (2.4.4) of lower dimensionality which best contain the set of points, that is the sub
Comparing this to (2.4.1) we see that the vector ji of overaH relative spaces which come closest to the set of points.
'1
Closeness, or fie. of a subspace to a set ofpoints di, sayo If y¡ is weighted by mass W¡ (i = 1 ... 1) then our definition of the
eloseness of the whole set of points to the subspace S is:
The title of a paper by Karl Pearson: "On lines and planes of elosest fit to a
i system of points" (Pearson, 1901), shows that this objective has a long t/J(S;Yl'''YJ) == "I:.¡w¡df (2.5.1 )
history. Pearson's geometry was ordinary Euelidean and there was no where
concept of differentially weighting the points inherent in bis measure of fit,
df == lIy¡-Y;iIi>. == (Yi-y¡)TDq(y¡-y¡)
but the generalization of bis ideas is a straightforward extension of what is
now commonly called "principal components analysis". and D q is the diagonal matrix of positive dimension weights. The squared
r The first problem to address is how to define the eloseness of a set of points
to any given subspace. We have defined distances between any two given
points, so it is intuitive that the distance between a point and a given
distance df depends on the subspace S and our objective is thus to find the
subspace S* which minimizes the function t/J in (2.5.1).
In accordance with our definition of a subspace at the end of Section 2.1
subspace is the shortest of the distances between the point and all the points we can think of a single point s as a zero-dimensional subspace. The function
contained in the subspace. Thus we could define the eloseness of a set of (2.5.1) then becomes:
points to the subspace as an average, or weighted average, of the correspond
ing set of such shortest distances. For reasons of algebraic simplicity as well t/J(S;Yl'''YJ) = "I:.¡w¡(y¡-s)TDiY¡-s)
as a host of other geometric conveniences, we base our measure of eloseness since Y¡ is equal to s for all i. The centroid y is the point which minimizes this
on the squared distances rather than the distances themselves. This is a function, a result easily shown by setting the function's derivatives with
common practice in statistics, for example in regression and in analysis of respect to the elements of s equal to zero (cf. Example 4.6.3). Thus the
variance where the model is fitted so as to minimize the sum of the squares of centroid is in this sense the elosest point to all the given points y 1 ... yJ.
the errors, not of the absolute errors themselves. Furthermore we can show that in our search for the optimal K*-dimensional
Figure 2.12 depicts a eloud of points in J-dimensional weighted Euclidean subspace S* we need only consider subspaces S which contain y, hence we
space with a subspace of lower dimensio~ality K* drawn schematically as a have drawn y in the candidate subspace of Fig. 2.12. This result is proved in
plane cutting through the space. For a typical point y¡, Yi represents the point Example 2.6.3. Hence any subspace S which is optimal in the sense of
in the subspace which is elosest to y¡, this minimum distance being equal to minimizing (2.5.1) must inelude the centroid, with the result that we can
restrict the approximations Y¡ of the points Y¡ to be of the following form:
Y¡ = Y+ "I:.f: d¡kVk where v1 ·•· VK. are basis vectors of the subspace. The
function (2.5.1) to be minimized can thus be written as:
- K· T - K·
t/J(S;Yl .. ·YJ) = "I:.¡w¡(y¡-y-"I:. k J¡kvd Dq(y¡-y-"I:. k J¡kVk) (2.5.2)
The variables of this objective function are the K* axes v1'" V K., implying
a total of J K* scalar variables. There is an additional problem of identifying
the optimal solution amongst the infinity of bases for the optimal subspace,
even if attention is restricted to orthonormal bases. Fortunately we do not
pace, S
have to resort to the use of optimization techniques to solve this problem, as
our particular choice of fit in terms of squared distances leads to considerable
simplification of the algebra and the algorithm to compute the solution.
... la
singular value decomposition (which we henceforth denote by the abbrevia SVD form as:
tion SVD) is one of the most useful tools in matrix algebra and includes the A[K*] = U(K*)D~(K*)vTK*)
concept of the well-known eigenvalue/eigenvector decomposition (which we
where U (K*), V (K*) and D~(K*) are the relevant submatrices of U, V and D~.
call the eigendecomposition) as a special case. A few relevant results are
From (2.5.7) the rows and columns of A[K*] are equivalently the points
stated here, and we leave the reader to refer to the more detailed discussion
in respective subspaces of dimensionality K* which best fit the rows and
in Appendix A
columns of A in the sense of minimum sum of squared Euclidean distances.
The SVD is the expression of any real 1 x} matrix A of rank K in the
This efTectively solves the problem of minimizing a simpler form of (2.5.2), in
folIowing form:
the absence of masses and dimension weights, that is ordinary principal
A = U D~ VT components analysis. We now introduce straightforward generalizations of
(2.5.3)
Ix} IxK KxK Kx} the decomposition and of the matrix approximation to cope with these, thus
i.e. defining "generalized principal components analysis" (cf. Appendix A and
A = :r.f(X.UkV~ (2.5.4) Table Al (2)).
In Appendix A.l it is shown that any matrix A can be decomposed as
where UTU = 1 = VTV and (Xl ~ (X2 ~ ... ~ (XK > O. The K orthonormal A = NDI'MT where NTnN = MTcI>M = 1, .o and cI> being prescribed positive
I-vectors u l ... UK of U, called the left singular vectors, are an orthonormal definite symmetric matrices. We call this the generalized singular value
basis for the columns of A and are the eigenvectors of AAT, with associated decomposition "in the metrics" .o and cI>. Let us now set n = D w, the masses,
eigenvalues <xi ... <xk· SimilarIy the K orthonormal }-vectors VI' v2, ... , VK and cI> = D q , the dimension weights. Then the matrix approximation
of V, called the right singular vectors, are an orthonormal basis for the T ~K* T
(transposed) rows of A and are the eigenvectors of ATA, with the same A[K*] = N (K*¡D I'(K*¡M(K*) = ¿'k Ilk o k m k
associated eigenvalues (Xi ... (Xk. The elements (Xl'" <XK of the diagonal minimizes:
matrix D~ are calIed the singular values (of A). The existence and the unique IIA-Xllb•. Dw == l¡ljw¡qj(aij-x¡Y
ness properties of the SVD are discussed in Appendix A. = l¡w¡(a¡-x¡}TDq(a¡-x¡} (2.5.8)
The matrices F == UD~ and G == VD~ contain the co-ordinates of the
rows and columns of A with respect to the respective basis vectors in V and amongst alI matrices X of rank at most K*, where a; and x; are the rows of
U. For example, if (2.5.3) is written as A = FV T, then the ¡th row a; of A can A and X respectively. Comparing this to (2.5.2) we see that this provides the
be written as: required solution where A is defined as the matrix of centred rows of y, that
a¡ = (2.5.5) is Y-ly T. From the form of the optimum, A[K*], the vectors mI'" mK* of
:r.f¡;kVk
so that the ith row of F contains the co-ordinates of a¡. SimilarIy the jth row
1I M(K*) define an orthonormal basis for the optimal subspace and the co
ordinates of the vectors y¡ - Y with respect to this basis are in the rows of
of G contains the co-ordinates of the jth column of A with respect to the basis F(K*) == N(K*)DI'(K*)'
vectors in U. The (generalized) SVD provides the required solution for any prescribed
The beauty of the SVD for our present purpose is the fact that if the last dimensionality K*. For K* = 1 the first pair ofsingular vectors and associated
terms of (2.504) corresponding to the smallest singular values are dropped singular value (the largest) provide the optimal solution, for K* = 2 the first
then a least-squares approximation of the matrix A results. That is, if we and second pairs of singular vectors and associated singular values provide
define the matrix A[K*] as the first K* terms of (2.5.4): the optimal solution, and so on. This "additivity" of the dimensions leads to
K* T
A[K*] ==:r. k (XkUkVk (2.5.6) the basis vectors mI ... mK, in this case, being caBed principal axes of the
rows ofY.
then A[K*] minimizes: The squared singular values give an idea of how well the matrix is
IIA - XII 2 == :r.¡:r. j(aij - X¡J2 (2.5.7) represented along the principal axes. The total variation of a matrix A is
quantified by its squared norm, for eX'ample in the present case (cf. (2.5.8)):
amongst all 1 x} matrices X of rank at most K* (cf. (AlA)). A[K*] is called
the rank K* (least-squares) approximation of A and can itself be written in IIAllb•. Dw = l¡w¡a;Dqa¡ = lf= lll~ (2.5.9)
(a)
Proo!
S=
l
X~X¡
X2X¡
XIX¡
.
~¡.x~ ... x\X¡
T T
:
Xlx¡
I =xTx
another 3-vector y. A thus maps Xto y and because the matrix-vector multiplication
Ax is a linear operation, A is called a linear mapping.
Suppose that the co-ordinate system is changed from the standard basis el' e2, e to
a basis (¡, (2' (3 defined as follows:
(¡ = e¡ +2e 2 +e 3
(2 = e¡ + e 2 -e 3
3
Let s == [xix¡ ... XIX¡], the diagonal ofS, Then the matrix A == [b~,] can be written as So/ution
(2.6.1 ). Re-expressing vectors with respect to new bases can sometimes be confusing, so
(b) Whatever the original set of points x¡ ". X¡ might be, we know from (2.6.1~ and the following simple fact should be remembered: if a vector X has co-ordinates
from comment (2) below that A = sI T+ IsT- 2S, Since cJ)lsT = 1s T-1 wT15 = O xjB) ... xy!) with respect to a basis b¡ , .. bK then, by definition:
(wTI = 1) and similarly sI TcJ)T= O, we have cJ)AcJ)T= - 2cJ)ScJ)T. Thus we need to
X = lkX~BlJJk = BX(B) (2,6.3)
show that cJ)ScJ)T= S, which we do by showing that Sw = O. The ith element of SW is
wherex(B) == [xjB)",XY!l]T andB == [b¡".bKl
li'Sii'W i ' = li,(xi-i)T(Xi,-X)W i, where i = liWiXi
Thus if we let F == [(¡(2(3] in our problem above, then the co-ordinates X(F) and
= (X¡-i)T{li'Wi,(Xi,-x)} y(F) of Xand y respectively with respect to F must satisfy the equations:
I 1 -1 1
(3) The scalar product matrix S can only be recovered with respect to an origin
"interior" to the set of points, i.e. a barycentre, for example the centroid of the points Fy(F) = AFx(F)
(cf. Schoneman, 1970; Appendix). The set of barycentres of a cloud of points forms a"
convex set, bounded by the convex hull of the points. so that in terms ofthe new co-ordinates:
(4) The transformation cJ)AcJ)Tof A is called a doub/e-centering, The weighted average y(F) = F-¡AFx(F)
2 O0J
3-1 1]
A (F) = F-¡AF = O 1 O
A==
l 4
2
-1
-1
2
2
If we write y = Ax then A can be considered an operation on a 3-vector X to obtain
[O O 1
Thus in the new co-ordinate system the mapping A takes on a particular simple
form-the first co-ordinate is doubled and the other two co-ordinates remain
unchanged.
44 Theory and Applications o!Correspondence Analysis 2. Geometric Concepts in Multidimensional Space 45
Comment y¡
The aboye change of basis F was specificaHy chosen so that F-¡ AF = D A, a diagonal
matrix. This implies that AF = FD A, which looks like the eigenequation of the matrix
A. However, the vectors of F are not orthogonal to each other, hence they are not the
eigenvectors of A. The eigenvectors are thus an orthogonal basis (in fact, an ortho
normal basis) with respect to which the matrix A takes the simple form of a diagonal s
matrix.
Exercise
Show that the matrix
l
7 -1 1
A == i -2 8-2
3 -3 9 J FIG.2.14
2 O 0
The second term of this expansion is simply equal to t TDqt, the squared distance
Thus:
= Y-(~w¡y;+t) = y-(Y' +t) = O
I
=[-6-5-3-2-111357J
+ 2L¡w¡(y¡ -y¡)TDq(Yi - yi) -4 -3 -4 -2 O -2 1 1 7 6
l.
I
I[
2. Geometric Concepts in Multidimensional Space 47
46 Theory and Applications ofCorrespondence Analysis
.Ye
/
From our discussion in Section 2.5, the complete solution to the problem is contained r."'/ ·YIO
in the SYD ofX: ~<:$/
X = UDay T where UTU = yTy = I (o)
..§ J/
r;./
~~/
Since X has less rows than columns, we would first find the eigendecomposition ofXX T oQ./
(= UD;U T). The matrix XX Tis: /
/
xx T = [160 134J /
134 136
~ ¿(Y7 ·Ye
and its eigenvalues are the roots ofthe eigenequation:
/ y
IXX T -).11 = O /
Y4 /
where l...1 indicates the matrix determinant. That is: / ./ ·Ya
160-). 134 I !2
136-). = (160-),)(136-),)-134 2 = O
//
134 YI
1 • Y3
YI Yz Y3 Y4 YsYa Y7 Ya Ye YIO
Either of these equations gives the relationship between a and b as a = 1.0935b, and
if we normalize U1 so that a2 + b2 = 1, we have that the unit vector defining the FIG.2.15. (a) Posltions of the point vectors in the full space. showing the
subspace is: optimalline; (b) optimal1-dimensional display of the points.
U1 = [0.738 0.675]T
These co-ordinates are thus calculated as: -7.13, -5.71, -4.91, -2.83, -0.74,
The 10 points and the unit vector U1 defining the closest subspace are given in
-0.61,1.41,2.89,8.41 and 9.21 respectively. For example:
Fig. 2.l5(a). We have also plotted the one-dimensional subspace separately in Fig.
2.15(b), with its origin at the centroid and the projections of the 10 points. There are ¡iu 1 = (-6xO.738)+(-4xO.675) = -7.128
two equivalent ways of calculating the positions of the projections in the subspace.
First we can calculate the first right singular vector v1 of X by using the result: Comments
(1) In practice we use an eigenvaluejeigenvector procedure on a computer to perform
T
V 1 = X u¡ja 1 the calculations. We have performed the calculations by hand in the above example
which is the first column of Y = XTUD; 1 (remember that we have worked with XX T to illustrate the numerical procedures involved. Alternatively a procedure to compute
here, not XTX). Then the co-ordinates with respect to u1 of the columns of X are the SYD directly can be used, if available (see Appendix B).
simply the elements of gl = al vl' which is again just the first column of G = VDa. (2) In this example we have in fact performed a principal components analysis of the
Secondly, because gl = al v1 = a 1X Tu¡ja 1 = X TU1, we can calculate gl simply as columns of the data matrix X (see Appendix A). The vector U1 defines the first
X TU1, which is the set of scalar products between the columns of X and the basis
I
principal axis of the columns of X and is the axis of maximum variance in the sense
vector U1: that the variance of the projections of the points onto U 1 is maximized (notice that in
-
T
X 1U 1 principal components analysis inertia and variance are identical). The sum of squares T
of the projections, i.e. gig1' is equal to the first eigenvalue ).1 = 282.54 of XX ,
r
-T - - T .
gl = X u1 = [X 1.. ·X 10] u1 = -T: (2.6.4)
X 10 U1
because: T r-T T T
glgl = u1XX U 1 = u 1UD,¡U U 1 =).1
48 Theory and Applications ofCorrespondence Analysis 2. Geometric Concepts in Multidimensional Space 49
(since UTU = 1) and hence the variance of the projections is gig¡/(lO-l) = 2¡/9 = the rows of y define 5 points in 4-dimensional weighted Euclidean space, where
1
31.39. The sums of squared deviations of the original rows of X are given by the weights are defined by the diagonal matrix:
diagonal of XX Tas 160 and 136 respectively, therefore the total variance of these two
rows is 160/9 + 136/9 = 32.89. Thus 100(31.39/32.89) = 95.4 % of the total variance of 17.3 O O O
the two rows of X is displayed by the projections onto the first principal axis. Since O 23.5 O O
the projections are of the form XTu l , where uiu I = 1, this shows that the elements of
D==
q O O 17.0 O
u I are the ones which maximize the variance of (normalized) linear combinations of [
O O O 42.2
the elements of the columns of X. This is the usual definition of principal components
analysis by Hotelling (1933) and is discussed later in the context of correspondence and let the points have associated masses proportional to 5.7, 9.3, 26.4, 45.6 and 13.0
analysis as well as in general in Appendix A. respectively (these values sum to 100). Calculate the first two principal axes of these 5
(3) Result (2.6.4) is a special case of the following general result in any scalar product points, the projections of the 5 points onto the principal plane and the percentage of
space: if u is a unit vector in the metric of the space, then the length of the (orthogonal) inertia of the points which is represented by this planeo
projection of a vector x onto the subspace defined by u is simply the scalar product of
Solution
x and u. For example, in weighted Euclidean space where scalar products are defined
by the diagonal matrix D q, with positive diagonal elements, then if u TDqu = 1, the r
Let the rows of Y be yi ... y where Yi is a 4-vector. Let w == [5.7 9.3 26.4 45.6 13.0]T
be the vector of masses of the points and D w == diag(w), the diagonal matrix of these
length of the projection of x onto the subspace defined by u is x TDqu. (In (2.6.4)
D q = l.) The projection is thus the vector (xTDqu)u and the vector from the projection masses. Since 1TW = ~iWi = 100.0, the centroid of the 5 points is y = ~iWiyJ100 =
to x is x - (x TDqu)u. It is easily seen that this latter vector is orthogonal to u (always [31.6 23.3 32.2 13.0]T.
in the metric D q' of course): The matrix of deviations from the centroid is thus:
--
2.6.5
..."~.-
Another examp/e of subspace fitting
Consider the following data matrix of percentages:
We are interested in the rank 2 approximation ofS, which can be computed as:
0.058 0.462
-0.288 0.737
- 36.4
22.2
18.2 27.3
16.7 38.9
18.2
22.2 S[2] = U(2)D~(2)V(2) = I 0.718 0.048
Y
5x4
I
== 49.0 19.6 23.5 7.8 -0.572 -0.398
20.5 27.3 37.5 14.8 0.267 - 0.288
-'
Then we know that N (2¡D ~(2¡Mr2) is the generalized rank 2 approximation of Y, with inertia of the five points is their weighted sum of squared distances to the centroid:
N (2) = D~1f2U(2) and M(2) = D;1 f 2Y(2), and that the two principal axes are LiW¡(Yi-y)TDiy¡-y) = L¡w;yiDqy¡, while the inertia in the plane is the weighted
defined by the orthonormal basis vectors in the columns of M(2): sum of squared projected distances: L¡w;fiTf¡. Of course we do not actuaHy have to
evaluate these sums because the squares of the singular values of S give the moments
0.194 0'041 of inertia of the points along the respective principal axes, that is the principal in
1f2 -0.037 -0.141 ertias (cf. (2.5.9-2.5.11)). The third and fourth singular values of S are 47.8 and 1.8
M - D- Y respectively, and this gives a total inertia of (639.4)2+(233.3)2+(47.8)2+(1.8)2 =
(2) - (2) - -0.099 -0.010
r
q
465549. The inertia in the plane is (639.4)2 + (233.3)2, which is 99.5 % of the total
-0.060 0.109 J inertia. Thus practically aH of the variation of the points is contained in the subspace
of the first two principal axes-Fig. 2.16 is an almost exact representation of the five
The projections of the points onto the subspace defined by these two axes are the rows points. The inertias Jl¡ along the axes are usually denoted by Ak(k = 1 ... K).
of the matrix F == N (2¡D ~(2) (remember that we are dealing with the rows of the
matrix y): Comments
(1) The total inertia is also the sum of squares of all the elements of S. (This general
15.5 45.1
result is proved by expressing the sum of squares of aH the elements of S as tr(SS T).)
-60.4 56.4
(2) Ir we place ourselves in the usual Euclidean reference system the basis vectors of
F == N(2¡D~(2) = D~1f2U(2¡D~(2) = 89.4 2.2 M(2) are neither orthogonal nor normalized. The sense of the orthonormality of M(2)
-54.2 -13.8 is with respect to D q in the metric of the row vectors: M(2)D q M(2l = 1. Here we stress
'again the important fact that the rows of F which are co-ordinates with respect to M(2)
47.3 -18.6 can be considered to be ordinary Euclidean vectors for purposes of distance and scalar
Since these co-ordinates are with respect to an orthonormal basis we can plot them product calculations (cf. Section 2.3). So whereas we cannot easily imagine the space
on the usual rectangular co-ordinate system (Fig. 2.16). in which the row vectors reside, the plotting of F in a Euclidean space brings the
In calculations of inertia it is customary to use the relative values of the masses points back into our familiar "frame ofreference".
(as in the centroid calculation), however this does not afTect our determining the value (3) The data matrix Y in the aboye problem has a property which has not been fully
of the inertia in the plane relative to the total inertia of the five points. The total discussed, namely that the sum of each row is a constant (100 %). This implies that the
rank ofY (and thus ofS too) is actually 3, not 4, and the fact that we obtained a fourth
basic value of 1.8 for S is merely rounding error-the fourth singular value is
A2 = 54 4291 (11.7%)
theoretically zero. Yet another property of the problem is that the inverses of the
dimension weights l/q1 ... l/q4 are proportional to the elements Y1" 'Y4 of the row
sea le centroid.
R.2 ~ The next example shows that in this particular situation we can actually omit the
centering of the rows and find the generalized SVD of y itself, in which case the
'RI
centroid ofthe rows is "contained" in the SVD. This is a situation which we shall meet
in correspondence analysis.
because
1TDw(Y -ly T) = wTy _ (w T1)w TY/w T1
=OT
and hence 1 is orthogonal to N with respect to D w' The norms of 1 and y with respect
to D wand D q are respectively:
1 TD w1=l Tw
yTDqy = rxy TD qD¡ll = rxy T1 = rxc
3. Simple IIIustrations ofCorrespondence Analysis
""
TABLE 3.1
Matrix of artificial data: the frequencies of dlfferent types of smokers in a sample of
)
personnel from a fictitious organization.
Smoking category
Row
Staff group (1) None (2) Light (3) Medium (4) Heavy totals
(1) Senior managers 4 2 3 2 11
(2) Junior managers 4 3 7 4 18
~ (3)
(4)
(5)
Senior employees
Junior employees
Secretaries
25
18
10
10
24
6
12
33
7
4
13
2
51
88
25
Simple Illustrations of
Column totals 61 45 62 25 193
Correspondence Analysis
(1) None (2) Light (3) Medium (4) Heavy Histograms Masses
~
= 0.0748 (87.8%) 0.455 - 0,096,
ÁI
•
senior employees M _ - 0.085 0.329
(Z) - -0.231 0.024 (3.1.4)
-0.139 -0.256
Compared to F(z) and M(2) of Example 2.6.5, the sign of the second column
senior·managers
is reversed, otherwise the values can be seen to be in direct proportion.
junio~ managers We can again plot the rows of F(2) as points in the usual rectangular co
~ scole I ordinate system (Fig. 3.1) and the relative positions of the points are identical
0.1
to those of Fig. 2.16. Because of the change in sign of the co-ordinates on the
FIG. 3.1. Graphical display of the row profiles of Table 3.2 with respect to the
best-fitting plane. The inertias. denoted by)., and ).2' and their percentages are
shown on their respective axes. I second dimension, each figure is a mirror image of the other.
The display of points in Fig. 3.1 may be called the correspondence
analysis of the row profiles of the original data matrix N. Notice that in
I
'lI"" ~
I
(the average row profile). Each of these profile vectors comprises a set of
equal to the elements of e, the centroid of the row profiles (cf. Table 3.2) and
values which sum to 1, and with this as a prerequisite in the definition of the
the centroid r of the column profiles is, symmetrically, the vector of masses of
triplet any changes in scale of the row profiles, masses or metric are "divided
out" again. In fact, just one quantity links the correspondence analysis the row profiles. In the space of the column profiles the metric is defined in a
symmetric fashion by D r- l so that each dimension is again weighted inversely
described aboye with the absolute values of the data in N-that is the total
ofN, the number ofunits partitioned in the contingency tableo
The overaIl quality of representation of the points in Fig. 3.1, calculated in
terms of the relative values of sums of squared singular values, is exactIy the
I by the element of the average (or expected) profile. The triplet defining the
dual problem is thus C, e and D r- l. The co-ordinates of the column profiles
with respect to their optimal 2-dimensional subspace are thus provided by
the rows of the 4 x 2 matrix G(2):
same as in Example 2.6.5, namely 99.5 %. (The total inertia is equal to the sum
of the squared singular values: (0.2734)2 + (0.1001)2 + (0.0203)2 = 0.08518.
G(2) = M(2)DI'(2) (3.2.1 )
ConventionaIly the percentages ofinertia are written on the axes, for example
(0.2734)2 = 0.0748 and (0.1001)2 = 0.0100 are 87.8% and 11.7% ofthe total where M )and D 1'(2) are the appropriate submatrices ofthe generalized SVD
inertia respectively, totalling 99.5 % of the inertia in the plane.) We can thus F
of C - Ir (or of C, omitting the first trivial dimension):
interpret Fig. 3.1 as the almost exact positions of the points.
AIthough this example has been constructed for illustrative purposes C-Ir T = MDI'N
- - T
where MTDcM = ÑTDr-1Ñ = 1 (3.2.2)
rather than as a serious application of correspondence analysis, let us
nevertheless comment briefiy on the interpretation of the display. Notice that We compute the matrix G(2) of co-ordinates and the matrix Ñ(2) defining
the senior employees and secretaries are relatively similar to each other in the two principal axes in 5-dimensional space as:
terms of their smoking habits. Junior managers and junior employees are
relatively far from those groups, with senior managers lying almost midway 0.393
l
between junior managers and senior employees. In this way the rows of data -0.031,
-0.100 0.141
are mapped to points in aplane and our examination of the relative positions G(2) = -0.196 0.007
(3.2.3)
of the points suggests similarities and differences amongst the staff's smoking
habits. -0.294 -0.198
0.014 -0.110
3.2 THE DUAL PROBLEM -0.088 -0.226
In Section 3.1 we have investigated the geometry of the row profiles of the
Ñ(2) = I 0.368 -0.028 (3.2.4)
-0.388 0.263
contingency table N. In a similar and symmetric fashion we can investigate
the geometry of the column profiles of N. As we shall illustrate in our simple 0.095 0.102
example and show more formaIly in Chapter 4, the geometry of the column
profiles is directIy related to the geometry of the row profiles in a number of The singular values are 0.2734, 0.1001 and 0.0203, exactIy the same as those
ways, hence the name "correspondence analysis". of the analysis of the row profiles-hence our use of the same notation Ji. for
Let us thus consider N as a set of columns rather than a set of rows. A the singular values in both cases. Notice that 4 points occupy a 3-dimensional
space.
convenient way of thinking about this is that we apply all our discussion of
As we shaIl show more formaIly in Chapter 4 the matrices G and Ñ are
3. Simple Illustrations ofCorrespondence Analysis 63
A2 =0.0100 101.7%)
O)
-5
"O
O)
o.~
E g¡, r--C")~COo
.
lighl smoking
C ro ro L!"lO"lCOL!"lC")
ro (f) ~
OON~~
-O)
~
_ro ro
> 00000
O) o~
~Cl.
f
.
medium smoking
C
>
>
ro 00000 oC")
AI=0.0748 (87.8 %) ..
no smoking
E O) COCOCONCO
::J I O,......,......L..OO
o
U ~
00000 o
O)
(f)
.~ "O
E
::J
COC")~NC") ~
.
heovy smoking
1 seo l.
0.1
1
o O)
-o::::t,......Cf)(Y),...... N
O) O,......,......L!"),...... C")
~ 00000 O
.2:l
ro . C")
U (f)
O)
O)(f)
C")C(f)
w
' . - ro
C")~E
-.J
~
o
E
(f) E
C
..c
O)
:.::::i
~r--NC")C")
~CONC")C")
C")
C")
FIG.3.2. Graphical display of the column profiles of Table 3.3 with respect to the
best-fitting plane.
OONL!"l~ N
f- c 2 00000 O
.- o N related in a very simple way to matrices M and F respectively of Section 3.1
-5ü
.~ (see (3.1.3), (3.1.4)):
(f)
Cl. O) G = Oc-1MO¡.¡ or M = OcGO;l (3.2.5)
::J C coCOOL!"l~ CO
2O) O
Z
<O <D ,......
OO~N~
en <O ~
O)
a:
-----
0~~~e
if need be, in which case the agreement of signs is implicit.
The 2-dimensional graphical display of the points representing the 4
smoking types (the rows of G(2») is given in Fig. 3.2. N otice that the principal
inertias and thus their percentages are identical to those of the corresponding
Fig.3.1.
Formulae (3.2.5) and (3.2.6) tell us that the co-ordinates of the profile
points with respect to their principal axes in the one problem are related (by
,l.
3. Simple Illustrations ofCorrespondence Analysis 65
64 Theory and Applications ofCorrespondence Analysis
[(0.364 x 0.393)+ (0.182 x -0.1(0)+ (0.273 x -0.196)+ (0.182 x -0.294)]
simple pre- and postmultiplication of diagonal matrices) to the actual Jll = 0.2734
principal axes of the profile points in the other problem, and vice versa. This
symmetry of the two problems, along with the fact that the singular values, =0.065
and thus their squares, the principal inertias, are the same in both problems, [(0.364 x -0.031)+ (0.182 x 0.141) + (0.273 x 0.007)+ (0.182 x -0.198)]
is the heart of the duality. J12 = 0.1001
In practice we are usually not interested in the matrices M and Ñ which
define the principal axes in the dual problems. In the displays of Figs 3.1 and = -0.197
3.2 the co-ordinate system has become that of the principal axes and because Allllwing for rounding error these values agree with the first row of F(2) in
we are interested in the relative positions of the points, say the row profiles (3.1.3).
in Fig. 3.1, the relationship between the new and the old co-ordinate systems Geometrically it is clear that a particular row profile tends to a position (in
in the row profile space is of secondary importance. However, based on its space) which corresponds to the smoking categories which are prominent
(3.2.5) in this case, our interest in the position of the column profiles in in that row profile. F or example, the "non-smoking" point, defined by the first
Fig. 3.2 is seen to be related to that very change of co-ordinate system. This column profile, lies on the positive side (0.393) of the first principal axis and
is an important point which needs careful consideration in order to under any row profile which is relatively high on non-smokers will lie on the
stand fully the duality of the geometric concepts in correspondence analysis, positive side of its first principal axis. The "expansion" of the co-ordinates by
where practically all of the entities are serving dual purposes. dividing by the respective singular values is necessary because there is a
Another way of writing (3.2.5) and (3.2.6), in terms of the co-ordinate symmetric transition formula from the set of row profile points to the
matrices F and G only, is as follows: individual column profile points and the two sets of points cannot both be
G = CFD;l (3.2.7) barycentres of each other. In our example transition formula (3.2.7) would
mean that, given the display of the row profiles (the staff groups), a parti
F = RGD;l (3.2.8) cular smoking category tends along principal axes in the direction of the staff
groups which are relatively prominent in that category.
These formulae, which we shall prove in Chapter 4, are known as the
Because of the geometric correspondence of the two clouds of points, both
transition Jormulae, because they describe how to pass between the co
in position as well as in inertia, the displays of Figs 3.1 and 3.2 may be merged
ordinate matrices of the two dual problems. As an illustration of the
into one joint display (Fig. 3.3). There are advantages and disadvantages of
geometric implications of these formulae let us suppose that we know the
this simultaneous display. Cleariy an advantage is the very concise graphical
co-ordinates of the column profiles with respect to their first two principal
display expressing a number of different features of the data in a single
axes, that is we know matrix G(2) of (3.2.3). The first row of R, the matrix of
picture. The display of each cloud of points indicates the nature of similarities
row profiles, is: and dispersion within the cloud, while the joint display indicates the
ri = [0.364 0.182 0.273 0.182] correspondence between the clouds. Notice, however, that we should avoid
the danger of interpreting distances between points of different clouds, since
(this is the profile of senior managers across the smoking categories). In terms
no such distances have been explicitly defined. Distances between points
of transition formula (3.2.8), the co-ordinates of this row profile with respect
within the same cloud are defined in terms of the relevant chi-square distance,
to the first two principal axes of the row profiles is given by the first row of F 2'
while the between-cloud correspondence is governed by the barycentric
f 1T = -TG
f 1
D-
(2)
1
1'(2)
nature of the transition formulae, as described above.
= (0.364gi +0.182gi +0.273gj + 0.182gJ)D;(~) (3.2.9)
where gi ... gl are the rows of G 2 • The expression in parentheses is a bary PrincipIe of distributional equivalence
centre of the 4 column profile points, since the sum of the elements of the A further advantage of the use of the chi-square distance and the resultant
profile vector r 1 is 1. The postmultiplication by D ;(~) means that the co duality between the two clouds of points is called the "principie of distribu
ordinates of the resultant barycentre are divided by the singular values J.ll and tional equivalence". This principie is very important to the French statisticians
J.l2 respectively. Thus the co-ordinates of the first row profile are:
3. Simple Illustrations ofCo"espondence Analysis 67
66 Theory and Applications ofCo"espondence Analysis
representation of the points, which are actually in 3-dimensional space,
A2=0.0100 because only 0.5 % of the total inertia of the points is not represented in this
(11.7%)
2-dimensional subspace. In practice, however, when we deal with much larger
data matrices, we seldom obtain such excellent 2-dimensional displays. If a
lillht smo~IO\I large percentage of the total inertia lies along other principal axes then it
•
(U) means that sorne points are not being wel1 represented with respect to the first
(se)
(JE) • two principal axes. Since the actual 2-dimensional display shows the projec
secretorias
junior
•
employees
tions of the true points onto the plane and does not show which points lie
medium smo~inll (ME) close to the plane and which are further off, we need to consider additional
senior eniployees e( SE) information if we are to interpret the display correctIy. Remember that we are
Al =0.0748
(87.8%)
•
no smo~lnll (NO) trying to understand the geometry of a set of high-dimensional points
through an approximate low-dimensional display and we must know where
the display is accurate and where no1. This is analogous to many other areas
of statistics, for example in constructing a model for data, where we must
(SM)
heavy' smo~inll • study both the model as well as the quality of fit of that model to the data,
,enior manollers
(HV) where the model fits the data well and where no1.
(JM)
•
junior manallers
Contributions to inertia
I scole I
01 To il1ustrate the principIes involved, let us suppose that we choose a 1
dimensional correspondence analysis of the data of Table 3.1, in other words
FIG.3.3. Correspondence analysis of the data in Table 3.1. with the points
displayed in the principal plane. This is the joint display of Figs 3.1 and 3.2. the final display is given by Fig. 3.4. This represents an approximate view of
the data and we know that it is still a very good overal1 view because 87.8 %
of the inertia is represented along this dimensiono We can informal1y interpret
who developed the technique in the context of linguistics (Benzécri, 1963). this dimension as separating the "smokers" on the left from the "non
Briefly, this principIe states that if two profiles, row profiles say, are identical smokers" on the righ1. More formal1y, however, we can quantify the part
("distributionally equivalent"), then these two rows of the original con played by each point in establishing this particular dimension as the first
tingency table may be added together to give a single roW without affecting principal axis. The inertia along this axis, 0.07475 = (0.2734)2, is equal to the
the geometry of the column profiles. A symmetric result is true for identical weighted sum of squared distances to the origin of the displayed row profiles
column profiles. Geometrical1y this means that we can merge two points of a or, equivalently, the corresponding weighted sum for the displayed column
cloud which lie at identical positions into a new point which has the mass of profiles, the weights being the masses of the respective points. Each term in
both points, and this does not afTect the geometry of the points in the other these sums can thus be expressed as a percentage of this first principal inertia,
doud. This unique result, which is peculiar to the geometry of correspondence and we cal1 these contributions by the points to the principal inertia or to the
analysis, is proved in Section 4.1.17. principal axis (Table 3.4). For example, the point "medium smoking" has a
mass of 0.321 and a distance from the centroid (origin of Fig. 3.4) of - 0.196
(cf. (3.2.3)). Its absolute contribution to the first principal inertia is thus
3.3 DECOMPOSITION OF INERTIA 0.321 x (-0.196? = 0.01237, which is 16.5 %of 0.07475. In this example we
see that the points representing the senior and junior employees contribute
In Sections 3.1 and 3.2 we have described the two dual problems which make over 80 % of this principal inertia of the row profiles, while amongst the
up a correspondence analysis, how the displays of the row and column column profiles the point "no smoking" contributes 65 %just by itself. If we
profiles are obtained and the reasons for merging these two displays into one. think of the points in their fixed positions in the two corresponding spaces as
The display in Fig. 3.3 represents the graphical result of the correspondence exerting forces of attraction for the principal axis by virtue of their positions
analysis of Table 3.1. In this particular example this is an almost exact
3. Simple I/lustrations ofCorrespondence Analysis
69
heavy medium lighl
no
smoking smoking smoking
smoking
I I I ,
I I 1 I
I 1 I I
, • O 0748"
"'l·
r r"
l. •
" • ~I ,
I
~"
se~ior se~ior
I
:9(j)ro(j)u~ _ o ro c FIG.3.4. 1-dimensional correspondence analysis of the data in Table 3.1 (i.e.
'::;Q)CfJ>Q.>
ro g o. .", poi nts in Fig. 3.3 with respect to the horizontal axis)
e
o
ü
ro'';:::¡
-o~'';:; e
CfJ Q) ID ( 1 J '
> +-'
« o.
Q)(/)~-o
-o¡:cn(j)~Q.
0_ .
2 o ~·x (j) (j) ~ ro~ and their masses, then it is these points with high contributions which have
coCin.co..c'5 (j)co.C«::>
played the major role in causing the final orientation of the principal axis.
.g C ~ ;¡; r-- -o .~.~
ro ~.- o. o
'g'o ~ C'J <D O) C'J LD
crlC'JO)'<t<D
'<tr--~'<t
o)C'JOOCXl
en E (j).=: C C
CfJ=,..o-oco (j)..QCioU
OLDO)crlCXl O) M crl <D Here we notice that it is the points with highest mass which have contributed
ro_cQ)'~cn
CC'St);<ti 00000 0000 the most to the first axis. This is not surprising, since the principal axis tends
-o 8 ro -S ..Q~x
ro
c ~.-.-'
0 . - x
C
ro Q)
U
ro O).~ Q)
u - ro
more towards the higher mass points. Yet there are often cases which we shall
cn-:S'';:;OCCL
.- ~ 0 consider in later applications where a point has fairly low mass, but neverthe.
;;;o~rouC -
0 ....... ..=
'O
less a high contribution to inertia because of its large distance from the
Cü-o~·~~~
o.cu~·+os
e ~
O'-t=
cv
e
centroid.
"'=t'- ro ro
.Uenwc-(j)
ro +-' Q)
'§ 52 ~:¡¡ C'0C'0C'J--:<:> M--:I.Cl<:>
LDM"'LD Based on the positions ofthe points on the first principal axis (Fig. 3.4) and
C'0.~ (j) . - ~.D ..CI +-' 0._ OCXl~C'0r-
LD M <D ~ ~
UJ,,-:-=CX)Q) Q)
--,o.o'<t.r::(j)c 's.~ 'ü knowledge of which points have contributed the most to this axis, we can
~+-, r-I--:S(ñ c o c
.....
o 0..", assign sorne descriptive name to the axis to guide us in our interpretation. In
r-- ~ 0.0· o
-'::Sog-gu
Q)oco,.;:::¡=Q)
U o.
this case the axis clearly lines up the groups in terms of their level of smoking.
-E~E.E~~ en In fact, to be precise this is rather in terms of their degree of non-smoking
OLQ)·.:::::VJCO
+-'.~
....... e 'x
LD'<tC'J<DLD ~Mr--'<t
+-'
+-,
_ _ + cQ)"O
-," 0'0 +-' ro
C'J C'J M r-- C'J 00 M C'0N because of the particularly high contribution of the point "non-smoking".
uoroou~ co Q..~ ro O<DCXl'<tLD CONNr
Q)(/)Q.UCtlco ';:::;o.>-a. OOMC'JO ","Or-r The actuallevel of smoking-light, medium or heavY-does not playas large
o. ro·u ro o. ~ Q) ~ e 'ü 00000 0000
U')';:::; e(J) cr c O O c 00000 0000 a role as the distinction between smokers and non-smokers.
~ Ci>'~'~ en = o. .",o. Having interpreted this dimension of the points we would like to know
.c .~ ~ ro.2 ~
:=:(])c.n-cQ)+-'
5: (J) .~ +-'..c en how close each point lies to this 1-dimensional subspace. For example, we can
Q) - - ~ +-' .
en en
look at the angle () between the true profile point vector and the principal axis
~ -5.5'~'~ E (J)CJ)Q)Q) O'
~ (j) (j)
~
.'=0'
(j)(j)>->
.- --
oOOC
Q) +-,.
co 0 ' '0
ro 0 '_
0_ 0
0'-'< c
c 0· (Fig. 3.5). It is convenient to examine the squared cosine of this angle because
Q.cn;:='o Q. cco.o.en g~ E-3
(j)E~roo.ro
.r:: ~ e.9- (j)
u..c. o EE~ ~.~
.- O en
-3EEen
E
O (; O (; ~
+-' U') 1- +-'
Ilh row profile
'+-Q.> Q)c+-, U) E ~.2 5;
o > O).~
(/)·,;:;cQ.o~
ro u·
,+-'
'c
(j)
'c'c'c U
(j) :l (j)
~
Cl)CfJ....cU co
c:: O Ol (j) (j)
§Z:.:i2I
wilh mass r¡
....
......
0----
'';::' Q) -o Q).~ e ~(j)J(j)J(j) :
----- -a---
Qj5r3-5t~ di ....
e ~ o.~..s
e o:: >.....1---
~C'JM'<tLD (,~C'JM'<t
....
.... ....
:
cenlrold e
-< .... 81 f¡k
ei "klh principal a~is
FIG. 3.5. Co-ordinate f¡k of ith row profile with respect to kth principal axis. If the
row profile point is at a distance di from the centroid e then the angle it subtends e
e
with the axis is given by cos = f¡k/d¡. The quantity cos" is called the (relative) e
contribution ofaxis k to the ith point.
70 Theory and Applications of Correspondence Analysis 3. Simple lIIustrations ofCorrespondence Analysis 71
for each point the squared cosines of these angles with the full set of A2= 0.0\00
(11.7%) do nol drink
orthogonal principal axes add up to 1. Another way of describing this is that y
,I
the inertia r¡df of the profile point vector (that is the ith row profile with mass ,I
r¡ and distance di from the centroid) is decomposed along the principal axes. ,
The part of this inertia along the first axis, say, is r;id, where f¡t is the co liOhl sm~klno ,, ,
ordinate of the point on this axis. Expressed as a proportion of the point's ,
,
o nationwide average
total inertia this is r¡f¡i!r¡dr = (f¡dd¡)2 = cos 2 e. The amount cos 2 e is thus junior employees ,, ,
secretories
e
called the contribution of the axis to the inertia of the point. If cos 2 is high, A'= 0.0748 me~lium smoklnQ
senior, employees
then the axis explains the point's inertia very well; equivalently e is low and (87.8°M
drlnkV
-.no smoking
the profile vector is said to lie in the direction of the axis, or "correlate" with
e,
the axis. The values of cos 2 which are also called the relative contributions
because they are independent of the mass of the point, are also given in
heavy smoklnQ
• -
senior managers
Table 3.4 as well as the angle between each point and the principal axis.
For example, the total inertia of the point "medium smoking" in its true
-
junior managers
profile of "medium smoking" and the average profile, given in Table 3.3, FIG.3.6. Display of a supplementary row profile ("nationwíde average") and
with the inverses of the average profile elements as weights: 0.0392 = two supplementary column profiles ("drínking" and "non-drinking") in the
(0.048 -0.057)2/0.057 + (0.113 -0.093)2/0.093 + ... etc.). From Table 3.4 the analysis of Fig. 3.3 (supplementary data given in Table 3.5) ..
part of the inertia displayed on the first axis is 0.01237, thus the value of
cos 2 eis 0.01237/0.0126 = 0.981. This very high value indicates that the point context of our simple example, suppose that the percentages of non-smokers,
"medium smoking" is practically on the first principal axis and there is hardly light, medium and heavy smokers are reported by the nationwide survey to
any error in its display in Fig. 3.4. The other relative contributions indicate be 42 %, 29 %, 20 %and 9 % respectively in the country as a whole. This set
the quality of representation of each individual poin1. Generally a high of values defines a point in the space of the row profiles of our example and
contribution of the point to the inertia of the axis implies a high relative it is a simple malter to represent this point in the existing display by
contribution of the axis to the inertia of the point, but not conversely. The projecting the point perpendicularly onto the planeo To evaluate its co
point "secretaries" on the first axis is extremely well represented, but its ordinates, fs say, we use the relevant transition formula (3.2.9):
contribution to the axis is minimal. The point "senior managers", on the f sT= (0.42g 1 +0.29g 2 +0.20g 3 +0.09g4 )Tn;<h = [0.258 0.118]
other hand, is poorly represented here and its position is almost orthogonal
to the first principal axis. The supplementary point is displayed in Fig. 3.6 and is seen to lie relatively
In Section 4.1.11 we shall show how the total inertia of a data matrix may far from the centroid ofthe points, approximately midway between the points
be decomposed in several different ways, in much the same spirit as the representing "no smoking" and "light smoking". This shows that our sample
decomposition of the sum of squares in analysis of variance. The different consists of general!y high proportions of smokers compared to the nation
decompositions lead to the various definitions of contributions to inertia wide average. Of course we can actually see this fact quite clearly by
which we have introduced aboye. inspecting the data but the display of the points is much more informative
and lends itself conveniently to making comparisons and identifying patterns
in the profiles.
Supplementarv profiles In a symmetric fashion we could display supplementary column profiles,
A very important concept in correspondence analysis (as well as in al! the this time using the transition formula (3.2.7) from rows to columns. For
displays based on the SVD, described in Appendix A) is that of supplementary example, we might have an additional c1assification of the sample in terms of
rows and columns which are represented on an existing display. In the whether a person consumes alcoholic beverages or no1. Table 3.5 shows our
3. Simple I1lustrations ofCorrespondence Analysis 73
original data, with two extra columns showing how the sample is divided
according to this question. Each column defines a column profile in the same
space as the profiles of the smoking categories across the statT groups and can
(/) QJ
be projected perpendicularly onto the plane of the first two principal axes.
QJ ID Their positions are also shown in Fig. 3.6. Because of the orientation of these
> ~!"-<Ococo
~~~
.r::
~~o:::tr---~
N
+-'
e two points, more in line with the second principal axis than the first we can
E o ~
~
~
ro
see that in our sample there is no strong association between non-drinking
::J.r::
(/) O QJ
>
QJ
e () > O
and non-smoking. However, the alignment of the points does suggest a
0- QJ Z O~L!')O!"- ~
Uro.o
e
O
~
o
Z
possible association between drinking and level of smoking amongst the
+-' smokers, with relatively more drinkers in the high smoking group.
ro
E > Notice that we are not making any statements about the statistical
>
ro
O
significance of the associations and patterns observed in these graphical
e I
QJ
IN'<t'<t(v)N I
?f2.
(j)
displays. The displays are simply representations of the data where we can
e
E ~ view the data in a form which is more convenient for interpretation (cf.
::J
O
()
Section 8.1, where the stability of these displays is discussed).
"O
e
E
::J
We can again compute relative contributions (squared angle cosines) of
ro "O each of the supplementary points in order to quantify how well the points are
?f2.
~ QJ
(V)!"-N(V)!"- O
e ¿ ~ (V) N displayed. The point representing the nationwide average has relative con
ro
e O)
e M tributions 0.631 and 0.131 respectively by axes 1 and 2. Adding these together
L!') +-' j2
O
we obtain 0.762 as the relative contribution of the plane to the point, which
(V)a3
~ E E we call the quality of the (planar) display of the point, in other words the
<Il
..:-
QJ C/J EO)
?f2. squared cosine of the angle the point subtends with the plane (this simple
f-~ :.:::i N (v)O'<t<O (j)
::J
(/)
~N N geometric result is illustrated in Fig. 3.7). The two points representing the
N
-:S drinking categories subtend the same angles with the axes and the plane, in
.~
fact they are joined by a straight line through the origino Their relative
O)
e QJ
QJ and is much more associated with the second axis than with the first.
:oro
Because supplementary rows and columns do not play any part in defining
f-
(/) (/)
"O the chi-square distance function nor in determining the principal axes, the
O (fJ
10-
(fJ
~
(1)
<J.)
<J.)
<J.)
~
ro
+-'
QJQJ>->
0)0)00
ro contribution by these points to the axes is not really defined. It is convenient
ro O ro ro _ _ E
O ::J ccO-O-(/) ..¡:; to think of supplementary profiles in a correspondence analysis as points
e EE ~
O) ~.~
(/)
QJ
with zero mass. Thus they really have no inertia at all in the analysis and do
'+- lo- lo- 10- lo- ~ QJ
'+
ro oooOQJ "O
.~
not attract the axes in any way, yet they have positions which can be
Uí 'c 'c'c'c ü QJ
QJ::JQJ::JQJ e O) exarnined relative to the principal axes ofthe points with positive mass. Often
C/J,C/J,C/J .Q
+-'
eQJ the supplementary points do not have any natural mass in the context of a
;:-NM';;;-cn ro >
---- ---- ---- ---- ---- Zro particular example, for example the supplementary row of Table 3.5. On the
other hand, if a point does have a natural mass, then the value of its inertia
can be exarnined relative to the inertia of a principal axis. For example, the
two supplementary columns of Table 3.5 represent a partition of the same
sample of 193 people and their masses (0.119 and 0.881 respectively) are
74 Theory and Applications 01 f.;orrespondence Analysis 3. Simple Illustrations olCorrespoY}dence Analysis 75
~ I
0.214xO.Ol001718). The quality (OLT) of representation of this point in the
two-dimensional display is 0.892, the squared correlation (cosine) with the
plane, which is the sum of the individual squared correlations (cf Fig. 3.7). (e)
Similar printout for the column points.
(a)
axis 2
shows that cos 2 8, + cos 2 82 + cos 2 83 = 1 and that the angle 8 between the
0.01001718 118 99.5 ....
profile point vector and the plane of the first two axes. sayo is given by:
0.00041358 05 100.0
cos 2 f} = cos 2 f}, + cos 2 82 ,
0.08518985
comparable to those of the smoking category profiles. Their inertias in the
direction of the second axis are 0.01562 and 0.00211 respectively, giving a
combined inertia of 0.01773 which is well over the inertia (0.01001) of aH the (b)
smoking points in tbis direction. This would indicate to us that these points Name OLT MASS INR K=l eOR CTR K = 2 eOR eTR
would have a large attraction on the present second axis if they were allowed
to enter the analysis. If we remember that the positions of these points is more (1) SM 892 57 31 66 92 3 -193 800 214
in a third dimension than in the plane, then including them in the analysis (2) JM 991 93 139 -258 526 84 -242 465 551
would re-orientate the second axis to line up more with this "drinking"j (3) SE 1000 264 450 381 999 512 -10 1 3
(4) JE 1000 456 308 -232 942 331 58 58 152
"non-drinking" dimensiono
(5) se 998 130 71 201 865 70 79 133 81
Table 3.6 represents the complete numerical output of a correspondence -
analysis, in the output format of the correspondence analysis computer
program by Tabet (1973). Notice that in order to eliminate decimal points (c)
and thus facilitate printing and examination, aH quantities of a relative nature
are multiplied by 1000 and expressed as integers. The co-ordinates, too, are Name OLT MASS INR K = 1 eOR CTR K =2 COR CTR
multiplied by 1000 and rounded to integers (see the Table legend for details).
(1) NO 1000 316 577 393 994 654 -29 6 29
From our description above, the contributions are interpreted in two ways.
(2) LI 984 233 83 -98 327 31 141 657 463
First, for each principal axis (dimension) we look down the column headed (3) ME 983 321 148 -195 982 166 7 1 2
CTR in order to interpret the dimensiono Secondly, for each point we scan (4) HV 994 130 192 -293 684 150 -197 310 506
across the values in the COR columns to identify the axes which represent the
76 Theory and Applications of Correspondence Analysis 3. Simple Illustrations ofCorrespondence Analysis 77
point well. The values in the QLT column (sum of the eOR columns) The example concerns a large company which has three types of c1ient,
summarize the quality of representation of the points in the subspace of which we refer to as c1ients A, B and e respectively. A random sample of
chosen dimensionality. This Table of contributions supports the interpreta c1ients in each of these categories is drawn and amongst the information of
tion of the graphical output of the analysis (cf. the applications in ehapter 9). interest there are 5 demographic variables: sex, marital status, age group,
income group and region of residence. The data matrix of interest is given in
Table 3.7 and shows the breakdown of the 3 types of customer across the
3.4 FURTHER ILLUSTRATION OF THE GEOMETRY categories, 18 in all, of these variables. This matrix is actually composed of 5
In the simple illustration of correspondence analysis described in the previous contingency tables, each partitioning the sample of over 8000 people accord
sections, we saw that the row and column profiles, vectors of order 4 and 5 ing to a difIerent categorization.
respectively, actually lie exactIy in a 3-dimensional subspace. In fact the exact The row profiles are 3-vectors with elements summing to 1, for example
dimensionality of the set of row and column profiles of a data matrix in amongst the 4844 males in the sample the proportions of the 3 types of c1ient
correspondence analysis is always 1 less than the minimum of the number of are 0.50, 0.18 and 0.32 respectively. If we depict this point and the other
rows and the number of columns. We shall illustrate this fact using a matrix profile points in 3-dimensional space in the usual way, we see that they alllie
with 18 rows and 3 columns, which is thus of dimensionality 2. Because the in an equilateral triangle whose vertices are at the points [1 O O]T,
example is exactly 2-dimensional, this allows us to illustrate c1early the efIect [O 1 O]T and [O O l]T (Fig. 3.8(a)). This triangle represents a1l3-vectors
of the dimension weighting as well as other aspects of the geometry. whose elements are non-negative and add up to 1, with the vertices represent
ing the 3 most "polarized" profiles. Thus we can take the triangle out of the
TABLE 3.7
space and represent all the row profiles exactly in a 2-dimensional triangle
Frequencies of 3 types of client (columns) within 18 categories (rows) of 5
(Fig. 3.8(b)). This is in fact the triangular (or barycentric) co-ordinate system
demographic variables.
which is often used to represent data consisting of sets of three values with a
Type of client constant sum, for example the percentage composition of three constituents
of soil samples (see J6reskog et al., 1976, pp. 93 and 160 for examples in
A B e geology).
Sex Male 2444 853 1547 The relative proportions of the three types of c1ient in the sample are
Female 712 551 923 roughly 0.45, 0.20 and 0.35 respectively and hence this is the centroid of all
the row profiles. Notice that this is also the centroid of each subset of points
Marital status Unmarried 523 290 519 corresponding to the categories of one of the variables. (There are slight
Married 2630 1106 1946
deviations amongst each subset because sorne people refuse to respond to
Age Al (16-24 yr) 189 70 136 sorne questions, especially the question on income.)
A2 (25-34 yr) 796 133 444 This triangular co-ordinate system is an ordinary Euc1idean display of the
A3 (35~49 yr) 11 00 314 706 row profiles and the display of the points in this system is not quite the
A4 (50+ yr) 1070 882 1187 correspondence analysis display, where we know that the dimensions are
Income 11 (Iowest) 273 336 427 weighted inversely by the centroid values, thus defining chi-square distances
12 1005 422 739 between the points. The easiest way to think of the dimension weighting is to
13 1049 305 609 imagine the triangle to be elastic, then stretched difIerentially along its 3 sides
14 (highest) 767 250 612 to form a new triangle, no longer equilateral but with sides inversely
proportional to the square roots of the centroid values: 1/(0.45)1 /2 = 1.49,
Regions Rl 436 142 315
R2 843 226 494
1/(0.20)1 /2 = 2.24 and 1/(0.35)1 12 = 1.69 (Fig. 3.9). The side of the triangle
R3 243 84 453 most stretched is thus the one corresponding to the least frequent type of
R4 346 145 248 c1ient, client e, and the side least stretched corresponds to the most frequent
R5 775 584 708 type, client A. The points are situated in this new triangular co-ordinate
R6 519 226 263 system in the same way as before, using vectors parallel to the 3 sides.
78 Theory and Applications ofCorrespondence Analysis
e//~nf C
la) 3. Simple Illustrations ofCorrespondence Analysis
79
0.1
0.8
0.2
0.7
e/len/ A
0.3
0.6
0.4
0.5
lb)
, 0.3
0.4
----- ----- --- - ---.,
females ,,' "
,1
05
0.6
,,
,, 0.7
females': ,
0.2
- - - - - - - - - -~. eenlroid
• o • 0.8
..
....
-- - - - - •• - -., - - \- -~males
..
0.1
...
.. . 0.9 0
'---...!.- ' .,
..
..
...... • , e/I~nf A
-0~9
...
0:8 0:7 0:6 0:5 0'4 0:3
•
:
•
0:2
\
0.1
~ ~ ""3
di
1>
Ctien! B
FIG. 3.9. The "stretched" barycentric co-ordinate system which is the weighted
FIG.3.8. (a) Position of the profiJe point "males" in the 2-dimensional triangle. Euclidean space of the correspondence analysis. Distances between the points
(b) The triangular (or barycentric) co-ordinate system of the row profiles of Table are chi-square distances.
3.7. showing the positions of "males" and "females" as well as the row centroid.
A= 0.0083 (25.1"10) I -R6
ellen! B
Distances in this new display are the actual chi-square distances of the
correspondence analysis. Hence correspondence analysis of the row protiles -R5 etian! A
can be considered as tinding the principal axes of the points (with associated R4 me¡ies _I3
-A4 12J:marr led -R2
masses) in this new triangular co-ordinate system. -11 A=0~4.9"1o)
Al -R1 A3 _A2
Figure 3.10 is the joint graphical display of the row and column protiles in
the correspondence analysis of Table 3.7. The positions of the column points
in Fig. 3.10 are also in so-called "principal co-ordinates", that is with respect
females
ellan! C
-I4
[~oO OTJ T
of certain peculiarities of the analysis, for example the Guttmann effect (or
"horseshoe" effect), described in Section 8.3. Y= [00 N] DI' [mo M] (3.5.1)
In conclusion let us briefiy interpret the display of Fig. 3.10. Because this is where
an exact display we do not realIy need the tables of contributions unless the [0 0 N]TDw[oo N] = [mo M]TDq[mo M] = 1 (3.5.2)
principal axes are individualIy of interest. Our interpretation must be in terms
The columns of M define principal axes and the rows of F == ND l' define the
of the positions of the categories of each variable relative to the three points co-ordinates of the rows with respect to the principal axes. We can write (3.5.1) as:
representing the 3 types of client. It is interesting to note that the age groups
lie more or less on a straight line approximately orthogonal to the vector fJY = (1/1'1/2)[0 0 N] (fJ1'1/2~1/2{~0 ~] (IW/ 2) [mo M]T
representing clients C. Thus there is very littIe difference in proportions of
clients e across the age groups, but rather an interchange between clients B i.e.
(proportionalIy high in the oldest age group) and clients A (proportionalIy y* = [o~ N*] [~~
O
OT
DI'.
J[m~ M*]T (3.5.3)
high in younger age groups, especialIy 25-34 years group). Other interesting
facts displayed in the analysis is that the lowest income group has high where the matrices in the right-hand side have been rescaled by the respective scaling
proportions of clients B and e, while the highest income group has high factors preceding them above. The rescaling implies that
proportions of clients A and e (when we say "high" we mean high relative to [o~ N*]TDw.[o~ N*] = [m~ M~]TDq.[m~ M~] =1 (3.5.4)
the centroid profile which is [0.45 0.20 0.35]T). Notice that we must so that (3.5.3) and (3.5.4) define the complete principal axes geometry of the rows of
interpret different row profiles strictIy in terms of their profiles of client types. y* (the inverse relationship of dimension weights to centroid elements is still true).
Thus the proximity of the lowest income point and the point representing Thus the columns of
females means that they have similar usage of the company's services and not M* = (1/~1/2)M (3.5.5)
that females have generally a low income. In order to investigate this latter
define the principal axes, and the rows of
possibility we would need to know the contingency table crossing the sex and
income variables. F* == N*DI'. = (fJ~1/2)NDI' = (fJ~1/2)F (3.5.6)
are the co-ordinates of the rows with respect to these principal axes.
3.5 EXAM PLES
3.5.2 Correspondence analysis as two dual principal co-ordinates
and metric Principal co-ordinates analysis accepts a square symmetric matrix A of squared
Suppose we have the following triplet defining 1 points, with masses, in weighted distances between a set of objects and produces displays of the objects in subspaces of
Euclidean space:
82 Theory and Applications 01 Correspondence Analysis
chosen dimensionality (cf. Appendix A). Show that the principal co-ordinates analysis
of the matrix of chi-square distances between a set of profiles, where each profile is
weighted by its usual mass, yields the same solution as the correspondence analysis of
the profiles.
Prool
Let us suppose that the profiles are row profiles contained in the 1 x J matrix R. The
matrix ,i of squared chi-square distances is given by: ,i = sI T+ 15T- 2S, where S is
the matrix of chi-square scalar products with respect to any origin (for example:
S == RDc-1R T) and s is the vector of diagonal elements ofS (cf. Example 2.6.1 (a)).
As in Example 2.6.1~) the first stage of a principal co-ordinates analysis, namely
the operation -tCi),iCi) (where Ci) == 1 -Ir T) recovers the chi-square scalar products
with respect to the centroid profile r TR = eT: [!J
-tCi),iCi)T = (R-lc T)Dc- 1(R-lc T)T
The second stage of a principal co-ordinates analysis, where the points are weighted
by the elements of r, is to compute the eigendecomposition of D;/2( -tCi),iCi)T)D;/2,
Theory of Correspondence
say VD.lV T, and then display the objects in principal co-ordinates by the rows of
D,-1/2VDl/ 2. This is the same as the correspondence analysis solution (in principal Analysis and Equivalent
co-ordinates), given by (3.1.1) and (3.1.2), because (from (3.1.2)):
D;/2(R-lcT)Dc- 1(R _lc T)TD;/2 = D;/2ND~MTDc-lMD~NTD;/2
Approaches
= D;/2ND~NTD;/2
since MTDc-1M = 1. Hence:
D;/2ND~NTD;/2 = VD.lV T
i.e. In this chapter we first present a formal treatment of the algebra of
(ND ~)(ND~)T = (D,-IJ2VDl/2)(D,-1/2VDl/2)T correspondence analysis (Section 4.1). We then discuss various aIternative
and the result follows by the uniqueness of the eigendecomposition. approaches that have originated in different contexts: reciprocal averaging
(Section 4.2), dual scaling (Section 4.3), canonical correlation analysis of con
tingency tables (Section 4.4) and simuItaneous linear regressions (Section 4.5).
As discussed in Chapter 1, aH these techniques share the same numerical pro
cedure to arrive at their respective solutions, but they each have completely
different rationales and interpretations. The uniqueness of the rationale and
interpretation of correspondence analysis lies in the muItidimensional geo
metric framework in which the problem is casto A number of examples once
again concIudes the chapter, serving to complement and ilIustrate the
, chapter's material (Section 4.6).
4.1.1. Suppose N is a matrix of non-negative numbers, such that its row and total n... It is slight1y easier notationally to work with P than with N, since
column sums are non-zero. The correspondence matrix P is defined as the correspondence analysis is only concerned with the relative values of the data
matrix of elements of N divided by the grand total of N. The vectors of row and is thus invariant with respect to n...
and column sums of Pare denoted by r and e respectively and the diagonal
matrices ofthese sums by D r and De respectively. 4.1.3. The row and column profiles define two clouds of points in respective
J- and /-dimensional weighted Euclidean spaces.
Data matrix
Row cloud Column cloud
N(I x J) == [n¡J, nij ~ O Points: The / row profiles rl ' " r1 Points: The J column profiles cl, .. cJ
Correspondence matrix in J-dimensional space in / -dimensional space
P == (l/n. .)N, where n.. = 1TN l (4.1.1) Masses: The / elements ofr Masses: The J elements ofe
Row and column sums Metric: Weighted Euclidean with Metric: Weighted Euclidean with
dimension weights defined by dimension weights defined by
r == PI and e == P TI (4.1.2) the inverses of the elements the inverses of the elements
where r¡ > O(i = 1 ... 1), cj > O (j = 1 .,. J) of e (chi-square metric), that ofr (chi-square metric), that
isD;l isD r- l
D r == diag(r) and De == diag(e) (4.1.3)
Comment: The terms "distance" and "metric" are synonymous here. The chi
Comment: The sum of the elements of Pis 1. When N is a contingency table, square metric is an example of a diagonal metric defined by the distance
P can be considered to be a probability density on the cel1s of the / x J matrix, function (2.3.3). The diagonal matrix involved in the distance and scalar
and r and e the marginal densities. This is only an analogy when N is another product is itself often referred to as the metric, for example the metric D e- 1.
type of matrix. Notice that 1 == [1 ... 1]T denotes an /-vector or a J-vector
of ones, its order being deduced from the particular context. 4.1.4. The centroids of the row and column clouds in their respective spaces
are e and r respectively.
4.1.2. The row and column profiles of P (equivalently ofN) are defined as the
vectors of rows and columns of P divided by their respective sums.
Row centroid: e = R Tr Column centroid: r = eTe (4.1.5)
4.1.5. The overal1 spatial variation of each cloud of points is quantified by Proof" From (4.1.6) we have:
their total inertia, that is the weighted sum of squared distances from the in(I) = I:¡r¡I:j(p;)r¡-cY/cj
points to their respective centroids, the masses and the metric being defined
= I:¡I:/p¡j-r¡c)2/ r ¡Cj
in 4.1.3.
and in(J) = I:hI:¡(p¡/cj-rY/r¡
= I:¡I:j(p¡j-r¡cY/r¡c j
Total inertia of row points Total inertia ofcolumn points
in(I) = I:¡r¡(r¡ - e)TD c- (i\ -e)
1
in(J) = I:jcj(cj-r)TDr-l(cj-r) Hence
in(I) = in(J)
(4.1.6)
In the X2 formula nij = n.. Pij and thus the "expected" value in a cell is:
l.e. l.e.
eij == (I: jni) (I:¡ni)/n ..
in(I) = trace[D r(R-lcT)D;l(R-IeT)T] in(J) = trace[Dc(C-lrT)D;l(C-IrT)T]
= (n ..r¡) (n .. c)/n ..
(4.1.7)
= n..ricj
4.1.6. The total inertia is the same in both clouds and is equal to the mean
Principal axes
square contingency coefficient calculated on N, that is the chi-square statistic
for "independence" divided by the grand total n.. (calculated as if N were a Let the generalized SVD of P - re Tbe:
contingency table). P-re T = ADJlB T where A TD r- 1A = BTD c- 1 B = 1 (4.1.9)
/11 ~ ... ~ /1K
> O. Then the columns of A and B define the principal axes of
the column and row clouds respectively.
in(I) = in(J) = I:¡I: j (p¡j-r¡c j )2 = x2/n ..
r¡cj
l.e. Proof: Let us consider the cloud of row points defined by the row profiles in
= trace[D r- 1 (P - re T)Dc- 1 (P - re T)T] (4.1.8) R == D r- 1P, with associated masses in the diagonal of D r and in weighted
where Euclidean space defined by the diagonal metric Dc- 1. From Section 2.5 we
2 = ~.~. (nij-eij)2
know that the principal axes as well as the co-ordinates of the row profiles
X - ~1~J with respect to these axes are obtainable from the generalized SVD of R -le T
eij
(the centered row profiles), where the left and right singular vectors are ortho
and eij == n¡.n)n.., the "expected" value in the (i,j)th cel1 of the matrix based normalized with respect to D r and D c- 1 respectively. That is, if:
on the row and column marginals ni. and n. j •
D;IP-Ie T = LDcJ>MT where LTDrL = M TD;lM = 1 (4.1.10)
4. Theory and Equivalent Approaches 89
88 Theory and Applications ofCorrespondence Analysis
then the columns of M define the principal axes and the rows of LD q, define axes B (in the chi-square metric axes A (in the chi-square metric
the co-ordinates. D.- 1). Then:
0;1). Then:
Ifwe multiply (4.1.10) on the left by D. we obtain
P-reT=(D.L)Dq,MT where (D.L)TD;I(D.L)=M TD e- l M=I (4.1.11)
.
F = D- IAD
"
G = De-lBDI' (4.1.14)
which is in the form of (4.1.9) and shows that the columns of M (the principal Proof: Let us consider the co-ordinates of the row profiles, for example.
axes) are identical to those of B. It follows in a similar and symmetric fashion l
Notice that, because the principal axes B are orthonormal (BTDe- B = 1),
that the principal axes of the column cloud, which are defined in ¡-dimensional tbese co-ordinates are just the scalar products of the centred profiles R -leT
space by the right singular vectors of C -Ir Tin the following decomposition: with B (cf. Section 2.3), hence our definition (4.1.13). We can show (4.1.14) in
D e- l p T-lr T = YD",ZT where yTDeY = Z TD.- I Z = 1 (4.1.12) two equivalent ways. The direct proof is to rewrite (4.1.13) as follows, for
example:
are identical to the columns of A.
F = D.- l (P -re T)D; 1B (4.1.15)
Comment: Notice that the sets of singular values /11 ... /1K in (4.1.9), ePI .•. eP K
in (4.1.10) and IjJl ... 1jJ K in (4.1.12) are identical. We tacitly assume that each (using 1 = D.-Ir). Multiplying the generalized SVD (4.1.9) of P -reT on the
singular value is difTerent, in which case the singular vectors are uniquely right by D;IB we obtain:
defined up to reflections only (see Appendix A). Strictly speaking, then, we (P - re T)De- 1B = AD"
should say that the principal axes M of the row cloud are identical to the
columns of B up to reflections. If there are equal singular values amongst the hence the expression (4.1.15) becomes F = D.- IAD", the desired resulto An
first K* then the corresponding columns of M and B are only identical up to alternative proof is possible, since we know from (4.1.10) and (4.1.11) that
reflections and rotations; however, the subspace defined by M is the same as F = LD q,' where D.L is the matrix A and D q, = D w This immediately gives
that of B, which is what we are really interested in. The real problem is when the desired resulto The symmetric result G = D; 1BD" is similarly proved by
the K*th and (K* + 1)th singular values are the same, in which case the K* either of the aboye arguments.
dimensional subspace is not uniquely defined in its K*th dimensiono Even
Comment: The expressions (4.1.14) define the co-ordinates of the row and
though this will never occur exactly in practice, there are nevertheless
column profiles with respect to all the principal axes (the co-ordinates of
practical issues of stability which crop up when singular values are almost
individual poinis are contained in the rows of F and G). The co-ordinates of
equal, which we shall discuss later in Chapter 8 and in the course of various
tbe points with respect to an optimal K*-dimensional subspace are contained
applications.
in the rows of the first K* columns of F and G. F or example, if we write F(2)
and G(2) as the first two columns of F and G respectively, then the rows of
4.1.8. The respective co-ordinates of the row and column profiles with
F(2) and G(2) define the projections of the row and column profiles onto
respect to their own principal axes (Le. the principal co-ordinates) are related
to the principal axes of the other cloud of profiles by simple rescalings. respectively optimal planes.
Comment: Thus the ¡th row of Gis equal to a barycentre cJF of the rows of Decomposition ofinertia
F foBowed by an expansion in scale of 1/¡.¡,. on the kth dimension, for
axes
k = 1 ... K. The coefficients of the barycentre are the elements of the column total
profilec¡' the ¡th row of C. Symmetrically the ith row of F is equal to a bary 1 2 .. , K
centre r¡ G of the rows of G followed by similar scale expansions, where the rtft2 .. , rlftK 'lr.k!?k
coefficients of the barycentre are the elements ofthe row profile ri, the ith row
rtf?1
... r 2r.dlk
ofR. 2 r2fll r2h~ rdA
roWS
r¡f¡~ .. , rifA r¡r.d/k
4.1.1 O. With respect to the principal axes, the respective clouds of row and 1 rdA
in(l) = in(J)
column profiles have centroids at the origin. The weighted sum of squares of
total Al := ¡.¡,1 A2 == ¡.¡,~ ." AK := ¡.¡,k
the points' co-ordinates (i.e. weighted variance or (moment of) inertia) along 2 2 Clkkg~k
clg 12 ." c Ig lK
the kth principal axis in each cloud is equal to ¡.¡,;, which we denote by Ak and 1 clg11
call the kth principal inertia. The weighted sum of cross-products of the 2 c2g1K c2 r. kg1k
2 c~11 C~22 ."
co-ordinates (or weighted covariance) is zero.
columns
CJg JI
2 CJgn
2 .. , CJg~K CJr.kg}k
J
Centroid ofrows ofF Centroid ofrows ofG
rTF = OT cTG=OT (4.1.17)
These tables form the numerical support for the graphical display. We call the
Principal inertias of row cloud Principal inertias ofcolumn cloud columns of these tables contributions of the roWS and columns respectively to
the inertia of an axis. We can express each of these contributions as
FTDrF = D~ == D;. GTDcG = D~ == D;. (4.1.18) proportions of the respective inertia Ak (=- ¡.¡,~) in order to interpret the axis
itself. These contributions are often caBed "absolute contributions" because
they are affected by the mass of each point. Each row of these tables eontains
Proo!" The centerings (4.1.17) are obvious because the rows of F and G are the contributions of the axes to the inertia of the respective profile point.
merely the respective sets of centred profiles with respect to new reference Again we can express each of these as proportio ns of the point's inertia in
systems ofaxes. A prooffollows immediately from (4.1.13), for example: order to interpret how well the point is represented on the axes. These are
often cal1ed "relative contributions" because the masses are divided out (ef.
rT(Dr-1P-lc T)= lTp_cT =cT_c T =OT Section 3.3).
The results (4.1.18) pertaining to the weighted sum-of-squares and cross 4.1.12. In eorrespondence analysis the centering of the row and column
products of the principal co-ordinates follow directly from the standardiza profiles is a symmetric operation which removes exaetly one dimension from
tion ofthe principal axes in (4.1.9), and from (4.1.14). the original spaces of these profiles. This is embodied in the result that the
SVD of the uncentered matrix P "contains" the SVD of the centered matrix
P_rc T •
4.1.11. As a consequence of (4.1.6), (4.1.7) and (4.1.18) the total inertia of
each cloud of points is decomposed along the principal axes and amongsr
the points themselves in a similar and symmetric fashion. This gives a Let the generalized SVD of P be:
(4.1.19)
decomposition of inertia for each cloud of points which is analogous to a p = AD¡IBT where ATn;lA = BTn;lB = 1
decomposition of variance.
4. Theory and Equivalent Approaches 93
92 Theory and Applications ofCorrespondence Analysis
roots of the principal inertias: (Dc-1pT)F = GD,U' Applying the matrix of
while that ofP-re Tis given by (4.1.9). Then:
row profiles R == D; 1P to this result leads to:
Á. = [r A] (4.1.20) (D;IP)(Dc-1pT)F = (D;IP)GDI' = FD~
B= [e B] (4.1.21) because R maps G to a similady rescaled F in a symmetric fashion:
D~=[~ DI'
OTJ (Dr-1P)G = FD,U'
(4.1.22)
the display in standard co-ordinates.) The column points are actually the
projections of J "unit profiles" (the rows of the J x J identity matrix) onto the
Standard row co-ordinates Standard column co-ordinates principal subspace. This result is proved in Section 4.4 and illustrated in
(J) == FD; 1 r == GD; 1 (4.1.29) Section 3.4.
Hence the columns of (J) are Hence the columns of r are
standardized as: standardized as: 4.1.17. The following property, called the "principIe of distributional equiva
lence" (Benzécri et al., 1973), is peculiar to correspondence analysis and, in
(J)TDr(J) = I rTDcr = I (4.1.30) particular, to the display in principal co-ordinates. If two row points, say,
occupy identical positions in multidimensional space, then they may be
merged into one point, whose mass is the sum of the two masses, without
4.1.16. By an asymmetric display we mean that the standardizations imposed afTecting the masses and interpoint distances of the column points. Similarly,
on the two sets of points is different. Most commonly, one of the sets is a row of data may be subdivided into two (or more) rows of data, each of
represented in principal co-ordinates while the other set is represented in which is proportional to the original row, leaving the geometry of the column
standard co-ordinates. The transition formulae between these points are then points invariant.
asyrnmetric, as is the interpretation of the display.
(When they are low the display in principal co-ordinates is much smaller than and thus also equal to (nlj+n2j)/U11. +n2J = ñ2)ñ2.·
Other ways of identifying the solutions are to fix any two values of one set The identified values ofy(l) are obtained by dividing the unidentified values by
(say the Yjs), usually two "end point" values, or to impose the centering (4.2.6) the square root of this inertia, for example - 0.09962/(0.004912)1/2 = -1.421.
and then rescale in order to fix one value. Notice that in the present case of Under the second choice (b) of constraints, the initial values are already
1-dimensional ordination the choice of identification conditions imposed on identified and after a complete reciprocal averaging these have to be
the final solution is immaterial to their interpretation, and is relevant only in recentered and rescaled so that Yl1) = 200 and y~1) = 1000. The difference
the geometry of multidimensional ordinations. between the unidentified values of y~l) and Yl1) is 556.3 - 507.4 = 48.9, which
is equivalent to 1000-200 = 800 on the identified scale, hence the identified
value of yi1), for example, is calculated as:
Computation by reciproca/ averaging
{(539.7 - 507.4) x (8oo/48.9)} + 200 = 728.8
Iterative application of the reciprocal averaging formulae, incorporating
identification of each successive set of trial solutions, wil1 actual1y converge (Note that the left-hand side of this expression is subject to rounding error
at the optimal solution. We illustrate this procedure using the matrix of the value on the right is the result of the more accurate calculation.)
Table 4.1, the abundances (which can be frequencies or areal coverage, for During the 8th iteration (reciprocal averaging), the identified ordination
example) of 5 species of trees in 4 different sites on a mountain slope. (These has converged sufficiently to terminate computations. By convergence we
are the same artificial data as in Table 3.1, presumed to occur in the present mean that the identified set of values Mk ) •.• y~k) in our example) are close
ecological context.) As an initial set of values for the 4 sites, we can use their enough to the previous identified set Mk-l) ... y~-l») to be cal1ed identical.
altitude values: 200, 500, 700 and 1000. Centered and standardized according The difference between'" the unidentified and identified solutions at this
to (4.2.6) and (4.2.7) these are -1.241, -0.127,0.616 and 1.730 respectively. convergence point wil1 enable us to compute the optimal value of the
As an alternative we can fix the values of Yl and Y4 to be 200 and 1000 expansion factor~, equivalently the shrinkage factor 19. Under the choice (a)
respective1y, their respective altitude values. Table 4.2 shows the initial of identification conditions, and using (4.2.3) in this case, we need only inspect
computational steps for both choices of identification conditions, as wel1 as scale differences to deduce that 19 = 0.07476 (e.g. =0.1075/1.438). Under
the final solution. A complete reciprocal averaging is performed before the conditions (b), however, the scale change should be evaluated using un
values used need to be identified. Under the first choice (a) of identification identified and identified deviations from the centroid (which is 657.9 in this
conditions, the unidentified y(l) already satisfies eTy(l) = O and its weighted case); or, equivalently, using the unidentified and identified differences
sum of squares (inertia) is: between two individual values, for example, between those specifical1y used
in the identification (= (683.5 - 623.7)/(1000 - 200)). Notice that, as expected
y(l) TDcy(l) = (61/193) x (-0.09962f +... + (25/193) x (0.08215f under conditions (a), the optimal ordination of the columns (sites) is the set
= 0.004912 of standard column co-ordinates in the correspondence analysis of Table 4.1,
while their averages xl7 ) ... X~7) (ordination of the species) are the principal
TABLE 4.1 row co-ordinates.
Same data as Table 3.1. but presumed to Occur in an ecological context. This algorithm is a special case of the alternating least squares algorithm
which derives the largest singular value and associated pair of singular
Sites (average altitude) vectors of a rectangular matrix. In Appendix B we discuss the convergence
Site 1 Site 2 Site 3 Site 4
Trees properties of this algorithm.
(200 m) (500 m) (700 m) (1000 m)
In practice, reciprocal averaging is used chiefly to obtain a single pair of
Species 1 4 2 3 2 11
ordinations x and y, as described aboye. This process can be repeated to
Species 2 4 3 7 obtain another pair of ordinations "orthogonal to" the first pair, equivalent
4 18
Species 3 25 10 12
Species 4
4 51 to the second principal dimension of correspondence analysis. The computa
18 24 33 13 88 tion of this second pair is often useful in order to check the stability of the
Species 5 10 6 7 2 25 first. By stability we mean that smal1 changes in the data matrix do not lead
61 45 62 25 193
to the algorithm converging at a dramatical1y different solution. This topic is
treated in more detail in Section 8.1.
TABLE 4.2
Some initial and final steps of the reciprocal averaging computations on Table 4.1, performed under two sets of
identification conditions on the column scale values: (a) centered at mean zera. standardized to have unit variance;
(a) (b)
Identification condition
cTy=O, T
y O cy =l I
Y, =200, Y4=1000
Initial values: yIO) ... y~O)
-1.2411 -0.1270 06157 1.7298 I 200 500 700 1000
First stage of averaging to obtain xIO) . . x~O), as in (4.21)
0.008047 0.3269. -0.3527 0.1980 -0.2161 I 536.4 622.2 439.2 587.5 476.0
Second stage of averaging to obtain y;') ... y~), as in (4.2.2)
-0.09962 002052 0.04999 0.08215 I 507.4 539.7 547.7 556.3
Identified yl') ... y~')
-1.42140.29280.71331.1723 I 200728.8858.4 1000
First stage of averaging to obtain xi') ... x~')
-0.05597 02708 -0.3796 0.2298 -0.2048 I 6212 7220 5213 709.3 575.3
etc.
4.3 DUAL (OR OPTIMAL) SCALlNG FIG.41. Initial and optimal sea le values for the eolumns of Table 3.1 (or Table
4.1) and derived roW seores. The optimal seale values have been eentered and
i sealed to be directly comparable with the initial values. The dispersion (as
Fírst approaeh: maxímízíng seore varíanee measured by the inertia) of the optimal row scores is larger, although this may not
be obvious by inspeetion, In faet the inertia of the initial seores is 0.0660 while the
Let us use the same data (Table3.1) once again in order to introduce the inertia of the optimal seores is known to be 0.0748. the first princ'lpal inertia.
concepts of dual scaling, but return to the context original1y described in
Section 3.1, where the rows are staff groups and the columns are categories
optimal solution for the scale values, we conventionally fix their mean and
of smoking. In this example we can think of assigning a scale value to each of
the four categories of smoking, say O, 1,2 and 3, and then evaluate a position °
variance over the whole sample to be and 1 respectively:
for each of the staff groups on this scale as an average of the scale values of (61YI + 45Y2 + 62Y3 + 25Y4)/193 = O
the members in the group. For example, senior managers have 4 non (61yi+45y~+63y~+25y¡)/193 = 1
smokers, 2 light smokers, 3 medium smokers and 2 heavy smokers, giving an
In our general notation ofSection 4.1, this is recognized to be:
average value of {(4 x O) + (2 x 1)+ (3 x 2) + (2 x 3)}/11 = 1.27. Values for the
other four staff groups can be similarly evaluated and the resultant "scaling" cTy = ° (4.3.1)
is shown in Fig. 4.1. Thus for a particular set of scale values y 1 ... y 4 for the yTO cY = 1 (4.3.2)
columns, the way we obtain seale values for the rows, which we shall cal1 row
scores, is precisely as described in the previous section, using the averaging The values of the row scores are the elements of the l-vector O; IPy (where
formula (4.2.1). If we now think of the seale values for the smoking categories 1 = 5 in our example) and (4.3.1) is equivalent to the mean of the row scores
as variables YI .. 'Y4 then we can pose the problem of determining values of across the whole sample being zero: rT(O; IPy) = 0, as shown in Example
YI" 'Y4 which optimize sorne suitable criterion defined on the row scores. 4.6.2. (Notice that the mean and variance are defined over the whole sample,
Our object in deriving scores for the staff groups is elearly to investigate i.e. 193 people in our example, ofwhich 11 are assigned the first row score, 18
the differences between them and a natural criterion is thus the variance of the seeond, and so on.) The variance of the row seores is then the average sum
the row scores, which we would want to maximize. Clearly this variance is ofsquares:
unaffected by adding any eonstant to the scale values and can be increased at (Or-IPy)TO~(O;lpy) = yTp O;lPy
T
will by increasing the spread of the seale values. In order to identify an
104 Theory and Applications ofCorrespondence Analysis
M
(J)
To maximize this function subject to the constraint (4.3.2) we introduce a en
N ~
___
co
zero we obtain flieé"quation:
,; ><"' 11
1><
Dc-lpTDr-1Py = AY (4.3.5)
which is precisely the eigenequation of (4.1.23). Since the eigenvalue (Lagrange ~
multiplier) Ais the score variance itself: A = YTp TDr-l Py (proved by multi gs ~ ~ 'I~
I :: :: ~
plying (4.3.5) on the left by y TDc and using (4.3.2)), it is clear that this solution ~ ~
'j .~ Q) I ~ M M N
these are exactly the principal co-ordinates of the row profiles with respect to o~l¡" ~~
the first principal axis. In the notation of Section 4.1: y = [y 11 ••• y}1] T and
tIl
~ gM ~
+-' -...-
¡.,
>..>..
¡" CO
ro
x = [fu· . .fIl]T. In Fig. 4.1 we compare the results of this optimal scaling u
al •
with those obtained from the preset scale values described at the start of this e
.- +-' >..
N
section. The optimal scale values indicate relatively smal1 differences between ~ ,-
E
-§, ~
N
~
Ñ I ~
N
the 3 categories of smoking, and a larger difference between these categories --.J • • en
(f) ~ ;: ~~ I-q
and the "non-smoking" category. N 1
Ñ Ñ
~~
~ ~~
The approach to dual scaling described by Nishisato (1980) is practically
identical, except that he sets the problem in a context which is reminiscent of ;=-
.;::
.;;:
.;::
-;;:-:::-:::
.;::.;::.;::
13
1
c.o
the sum of squares decomposition in analysis of variance and discriminant
analysis. Given the scale values Yl" 'Yl of the columns, let us consider
replacing each integer in the 1 x J contingency table by just as many ~
(j)
al
repetitions ofthe corresponding scale value. For example, for the contingency U?
Q. ro
::J e U?
so on for all the rows (Table 4.3). The first row is thus characterized by 11 o,
'+-
E ro
+-'
values and we summarize these values by their mean, the row score Xl' The ro I\ .S?e
+-'
1-
~
ü
(f)
oro
objective which is imposed in order to calculate the scale values is cal1ed the (f) ji ji t-
criterion of internal consistency and has its origins in the writings of Guttman ~ en
(1941,1950), Maung (1941) and Bock (1960). The idea is tbat the set ofvalues
11II
Summed over all the rows we obtain the sum of squared deviations within Or¡2/oy = {(yTO cy )2PTO,-lPy _ (y TpTO; lPy)20 cY}/(y TOcy )2
rows, SSw: using the result of Example 4.6.3 once again and the usual quotient rule of
SSw = 61yi +45y~ +62y~+25y¡' -llxi -18x~ -51x~ -88x¡' -25x~ (4.3.6) derivatives. Setting this equal to zero we obtain:
The sum of squared deviations between rows, that is between the scores (yTOcy)pTO;lPy = (yTpTO;lPy)OcY
assigned to the 193 people, is SSb:
Using (4.3.10) we can rewrite the right-hand side ofthis equation in terms of
SSb = 1l(x l -x)2+ ... +25(x s -X)2 r¡2 and divide both sides by yTO cY to obtain:
= llxi + ... + 25x~ -193x 2 (4.3.7)
pTO,-lPy = r¡ 20 cY
In an analysis of variance fashion, SSb and SSw sum to SSr, the total sum of
squared deviations between all193 values in Table 4.3 and their mean x: l.e.
0-lpTO-1Py = r¡2 y (4.3.11)
SSt = 61 (Yl - X)2 +... + 25(y4 - X)2 c ,
= 61yi +... + 25y¡' -193x 2 which is an eigenequation identical to (4.3.5). In fact the only ditTerence
= SSb+SSw (4.3.8) between the previous objective (4.3.3) and the present one (4.3.10) is that the
former fixes in advance the total sum of squares SSr and incorporates this
According to the criterion of internal consistency we want to maximize SSb
constraint in the objective, whereas here the objective is expressed relative to
while minimizing SSw. Because these two quantities add up to SSr, the overall SSr' Here, of course, we ultimately have to impose a constraint on SSr anyway
variation amongst the scores, it is clear that both these objectives are satisfied
to identify the eigenvector solution of (4.3.11).
simultaneously for a given SSr' Putting this another way we must maximize This technique is called dual scaling because the symmetric problem of
SSb and simultaneously minimize SSw relative to SSr' If we define the ratio assigning standardized scale values to the rows of the table to maximize the
r¡2 = SSb/SSr then (4.3.8) can be written as:
resuIting scores of the columns is dual to the above problem. The optimal row
r¡2 + SSw/SSr = 1 (4.3.9) scale values are the elements of the first (non-trivial) eigenvector satisfying
the following eigenequation:
r¡2 is called the squared correlation ratio (Guttman, 1941) and it clearly
lies between O and 1. The objective is thus to find the scale values which O-lPO-lpT
, c x = r¡2 X
maximize r¡2.
The value of r¡2 is clearly unatTected by the value of x, because x is the mean If the scale values are standardized as xTO,X = 1, the column scores
y = 0c- lpTx have maximal variance of r¡2 = A. l , the first principal inertia,
of all the quantities on which both SSb and SSr are based. Therefore we
choose x = O, which is equivalent to the centering of the scale values and are the first principal co-ordinates of the column profiles. Notice how the
61Yl + ... +25Y4 = O (Le. cTy = O), as we have already noted above. From scale values and scores for each problem play dual roles-the scale values in
here on it is easier if we use matrix notation for the general problem. Since one problem, multiplied by the corresponding correlation ratio (square root
x = D,-lpy, (4.3.7) is in general: of principal inertia) r¡ = (A. l )1/2, are the scores of the symmetric problem.
109
4. Theory and Equivalent Approaches
108 Theory and Applications ofCorrespondence Analysis
J variables
The reason why '1 can be considered a corre1ation will become clear in the , ,
next section where we show yet another context in which correspondence JI variables J2 variables
analysis may be defined, canonical correlation analysis. , ~
111
4. Theory and Equivalent Approaches
110 Theory and Applications ofCorrespondence Analysis
We can describe the geometry in two different, though equivalent, ways.
J categories
(indicator.or dummy variablesl The question which is of particular interest to us is the following: if we define
, , A(K-l == [al' .. aK-] and B(K-) == [b 1 .. · b K-] as the respective submatrices of
JI categories Jz categoríes the first K* canonical weight vectors (i.e. the first K* columns of A and B of
(4.4.3) respectively), what interpretation can we give to a plot of the rows of
A(K-l and of B(K-) together in a K*-dimensional Euclidean space (e.g. a
2_dimensional plot, when K* = 2)?
1 cases Z¡ Zz
J2 - dimensional
subspace of
second se' of
o uncentred
columns
We call the spaces of Fig. 4.6 Mahalanobis spaces, because the metrics
115
4. Theory and Equivalent Approaches
114 Theory and Applications ofCorrespondence Analysis
metric ITcI
metric 0;1
vectors respectively in the Mahalanobis spaces of Fig. 4.6. By a cnit point e¡,
for example, we mean a vector of zeros except for a 1 in the lth position. The
co-ordinate of el with respect to a typical canonical axis Sila k of the first unit point eJ
~proportion cI
space is simply the lth element of a k:eI S 1/ (Sil a k) = a1k • Thus for K* = 2, say, :of the rows
the plots ofthe JI rows of A(2) and the J 2 rows of B(2) give displays where the I
!
points can be interpreted as the positions of the fixed unit points with respect
to the plane of the first two canonical axes in each space.
We can amalgamate these two displays into one joint display if we take
care in the interpretation of the between-set positions. From (4.4.2) and
(4.4.3) we see that the two sets of co-ordinates A(K.) and B(K.) are related as
follows (where K* = 2, say):
SI/S12B(K·) = A(K·)Dp(K·) (4.4.8) FIG.4.7. As Fig. 4.6. when Z, and Z2 consist of dummy variables. Here there is
no distinction between the display of the rowS and columns. For example. a
Sil S21 A (K·) = B(K·)Dp(K·) (4.4.9) proportion r¡ of the rowS of Z, (i.e. Ir, rows) coincide with the unit point e¡
representing column i.
where D p(K.) denotes the K* x K* diagonal matrix of the first K* canohical
correlations PI'" PK.' If we define R == SI/S12' for example, then for
K* = 2 a particular point [ail ai2 ] T of the first cloud is seen to be a linear these particular centerings r Tak = O and eTb k = O of the non-trivial vectors of
combination I: jrij[bji bj2] T of the J 2 points of the other cloud, followed by canonical weights imply that the variances of the canonical scores are inde
an expansion in scale of l/PI and 1/P2 on respective dimensions. Notice that pendent of centering the columns of ZI and Z2: aJDra k = aJ (D r -rrT)a k =
the columns of the JI x J 2 matrix R are the vectors of regression coefficients a~Sllak and similarly b~Debk = b~S22bk (cf. (4.4.5)). The non-trivial canonical
in the multiple regressions of the respective columns of Z2 on the set of correlations are likewise independent of the centering, since the between-set
columns of ZI' The interpretation of (4.4.8) (and similarly (4.4.9)) is not easy,
covariance matrix SI2 is:
though it is apparent that a specific point [ail ai 2]T will tend away from the T
origin of the display in the direction of those points [b ji bj2] T corresponding S12 = (l/n. ,)Zi Z 2-re T = P_re (4.4.10)
to variables which exhibit large positive regression coefficients on variable i.
(where P == (l/n.,)N = (l/n. JZiZ2 is the correspondence matrix, cf. (4.1.1)),
In the special case of the (uncentered) indicator matrices, the situation is
considerably simplified. The "uncentered" covariance matrices are simply Dr so that:
(4.4.11)
and De respectively. The 1 row points are coincident with the unit points in Pk = aJS12bk = a~(p-reT)bk = a~Pbk
each space and are distributed over these points in proportions given by the
respective elements of r and e (Fig. 4.7). The trivial solution takes the form of Equations (4.4.8) and (4.4.9) can similarly be expressed in terms of un
trivial canonical axes r = D r 1 and e = Del in the respective spaces. With centered covariance matrices:
respect to these axes the co-ordinates of the rows of ZI and Z2 are (4.4.12)
D;IPB(K.) = A(K·¡Dp(K·)
ZIDr-1r = Z11 = 1 and Z2De-1e = Z21 = 1, and the uncentered correlation (4.4.13)
of these two vectors of identical elements is 1. The non-trivial solutions thus D e- 1 p TA(K.) = B(K·¡Dp(K·)
appear from the second canonical axes onwards and these are all orthogonal
which are just the transition formulae of (4.1.16) in K*-dimensional space.
to their respective trivial axes:
Rere the matrices of regression coefficients of each set of dummy variables on
r TD;I(D ra k) = O i.e. rTa k = O k = 1 . .. J 1 -1 the other set are R = D r- 1P and e = D e- 1 P T, the matrices ofrow and column
eTDe-I(Debk) = O i.e. eTb k = O k = 1. .. J 1 -1 profiles respectively (cf. (4.1.4)), and the canonical correlations PI' P2'''' are
the square roots (}..1)1/2, (}..2)1/2, ... (denoted previously by J.Ll' J.L2'''·' cf.
(where we number the non-trivial solutions from 1 onwards, with the trivial (4.1.18)) of the principal inertias.
solution numbered as O, and assume, as always, that J 1 ~ J 2)' Notice that
4. Theory and Equivalent Approaches 117
116 Theory and Applications ofCorrespondence Analysis
¡nitial raw
Summarv scor8S a
cases (this would be a canonical correlation "biplot", cf. Appendix A). When
the data are in the special form of an indicator matrix, the cases themselves R2 • • •
occur only at the unit points and there is, in fact, no geometric difference R4
_~----------!----------t
between the cases and the variables. The concept of assigning a mass to the RI ... -------- • • •
categories of the discrete variables (Le. columns of the indicator matrix) is
••
R5 • •
justified here by the "piling up" of the cases at each unit point representing R3 • •
the variable. The chi-square metrics in the two corresponding spaces of the
cases are equivalent to Mahalanobis metrics, and the principal inertias of the
correspondence analysis are squared canonical correlations. Finally, from
(4.4.12) and (4.4.13) and the standardization (4.4.4), it is clear that the display
of the rows of A(K*) and B(K*), the matrices of canonical weights, is the same t:h I I I ¡nitial calumn
as the K*-dimensional display of the row and column profiles in standard I 2
3 scale valu8s b
co-ordinates obtained from the correspondence analysis of N. The principal CI C2 C3 C4
co-ordinates are thus the canonical weights scaled by the canonical correla
FIG.4.8. Plot of the initial scale values (horizontal axis) against the derived row
tions: F(K*) = A(K*)Dp(K*) and G(K*) = B(K*)Dp(K*)' scores (vertical axis). The size of each point is roughly proportional to the respec
In Example 4.6.6 we illustrate the results of this section by recovering the tive element of the data matrix (Table 3.1 or Table 4.1). The centroid of each
correspondence analysis of Table 3.1 from the canonical correlation analysis vertical set of points is Indicated by a x and it is clear that these are not exactly
of the associated indicator matrix (cf. Table 5.1). linear. Because the row scores are the centroids of each horizontal set of points in
this plot. the regression of scale values on scores is exactly linear with a slope of 1.
oplimol row
scores o transition between the scale values and scores is not a symmetric one, but
that this does not alTect the objective of finding the most col1inear simul
taneous linear regressions (cf. Fig. 4.9 caption).
4.6 EXAMPLES
4.6.1 Biplot imerpretation ofjoim displav of the row and column poims
R2
R4
RI
•• -1
-
•• , ••
_____ X____X
oplimol column
In a biplot of a matrix Y == [y¡jJ, each row and each column are represented by vectors
in a low-dimensional Euclidean space such that the "between-sets" (row-column)
scalar products approximate the elements y¡j, where the approximation is traditionally
in the sense of least-squares or weighted least-squares (see Appendix A). Show that in
an "asymmetric" correspondence analysis display of the rows and columns (for
~----- • • 1· scole volues b
example where the rows are displayed in principal co-ordinates and the columns in
R5 • • • • standard co-ordinates) the between-set scalar products approximate the quantities
R3
•
CI
•
C2
•
C3
•
C4
(p¡j - r¡cj)/r¡c j. In what specific way are these quantities approximated?
Solution
When the rows are represented by the rows of F and the columns by the rows of r,
-1 the reconstitution formula (4.1.27) may be written as:
Pij = r¡cj(1 + r.fhk'ljk)
That is:
(p¡j-r¡Cj)jr¡c j = r.fhk'ljk (4.6.1)
The right-hand side of (4.6.1) is the scalar product of the ith row of F and the jth row
FIG.4.9. As Fig. 4.8. for the optimal scale values and scores. The regression of of r in the full space. Therefore in the principal K*-dimensional subspace, the scalar
scores on scale values is now an exact linear one. as shown by the dotted line products are approximations:
through the centroids of the vertical sets of points. Because the transition from (p¡j-r¡c)/(r¡c j ) ~ r.fohk'ljk (4.6.2)
scores a back to scale values b is of the form b = (1 /Jl2) 0;:-' pTa. it follows ihat
the slope of this regress:on line is Jl2. the first principal inertia. which is 0.0748. The sense of the approximation is weighted least-squares. More specifically, if
so that the angle of SlOP6 is tan- 1 0.0748 = 4.3'. The regression of scale values we denote the quantities (p¡j - r¡cj)/(r¡c j ) by yij then the function which is being
on scores is still exactly linear with slope 1. since a = O; 1 Pb. The objective has minimized is'
still been to find the "most collinear" simultaneous regressions (i.e. minimize r.¡ r. flI2CJ!2(y¡j - r.tU¡kVjk)2 (4.6.3)
1 _Jl2 in this case).
where r.tU¡kVjk is the scalar product between row and column points in K*
l
dimensional space, the points' co-ordinates being the variables of the minimization.
formulae then D c- P Ta = Jlb and D r- Pb = Jla so that D c- 1 P T a is linear
l
The weights of the least-squares approximation are thus rl I2 cj!2, and it is easily
against b (with slope fl) and D r- 1 Pb is linear against a (with slope 1/fl). Of shown that this is identical to the approximation implied by the decomposition
course any pair of solutions has this property, but our objective is to find the (4.1.9). From the low rank ap,Proximation theorem (cf. (A.1.12), (2.5.8)), the rank
solution to the simultaneous linear regression problem for which the two K* approximation of P - re in the metrics D r- 1 and Dc- 1 minimizes trace
{Dr- 1(P - re T- S)D c- 1(P - re T- S) T} over all matrices S of rank not greater than K*.
lines are as collinear as possible. This means that we want to minimize the This can be rewritten as trace {D;/ (Y -S)D:1 2D:12(y -S) TD;12}, where S == [súJ =
angle l/Jl- fl = (1- fl)jfl, which is minimized by maximizing fl. Thus the D; ISD c- 1 is of the same rank as S. The set of scalar products r.tU¡kVjk is just a re
solution provided by the first principal axis of correspondence analysis is the paDlmetrization of the elements sij and the optimal s¡j are the scalar products
one we require. From our discussion in Section 4.4 it is clear why this r. k hk'ljk'
objective is satisfied when the canonical correlation between the two sets of
Comment
scores is maximized, because the canonical correlation is precisely the The optimal matrix of scalar products is F (KO¡r(KO), where F (KO) and r (KO) are
correlation of al1 the points that make up Fig. 4.9. Notice that in Fig. 4.9 the the first K* columns of F and r respectively. While this matrix may be unique, its
120 Theory and Applications ofCorrespondence Ana/ysis 4. Theory and Equiva/ent Approaches 121
decomposition as the product of two matrices is no1. For example, F(K,)r(K') = So/ution
cz,(K,)G(K') so that exactly the same biplot interpretation and sense of the approxima In general, the correlation between an I -vector z of Os and Is and a continuous
tion is valid in the dual asymmetric display. This is again the question of how we variable x, called the "point-biserial" correlation, simplifies to:
choose to identify the solutions, which we have discussed at great length throughout {zTX - (zTI )x} j{(z T 1) (I - Z T1 )s;} 1/2
this chapter in different guises (see also Gabriel, 1978). In the display of the rows of
F (K') and r (K') the rows he exactly at the barycentres of the columns, which may where X and s; are the mean and variance of x. Since correlations are independent of
be desirable in certain situations. The standardization is prescribed by the data centering and scaling we can use canonical scores ZI 8 and Z2b which are centered and
analyst, not by the analysis. By default the display is "symmetric" and in principal have variance 1, that is 8 = Cjl and b = 'Y (the first standard co-ordinate vectors), so
co-ordinates. that the correlation simplifies further as (z Tx)j {(z T1) (I - ZT1)} 1/ 2 • If z is the ith column
Z¡ of ZI' then z!1 = Ir" where r¡ is the usual mass of the ith row of N == ZiZ2' For
within-set correlations x is ZI fP and Z!ZlfP = Ir¡({J¡, so that thecorrelation is:
4.6.2 In variance of centroid with respect to reciprocal averaging
(Ir¡({J¡)j{(Ir¡}(I -Ir¡}}1/2 = {r;/(I-r¡}}1/2((J¡
Suppose y is a set of scale values for the columns of a correspondence matrix P and
that the centroid of y is e Ty = {3. Show that the scores x = D r- 1Py are still centered For between-set correlations x is Z2'Y and ZiZ2'Y = I{(Z!Z2)jI}y = Ipr¡({J¡, by the
transition formula from columns to rows, since (z!Z2)j(Ir¡) is the ith row profile of
at /3.
N :p({J¡ = {(ZiZ2)j(Ir¡}}y. Hence the between-set correlation is just p (= (1 1)1/ 2 ) times
So/ution the within-set correlation.
The centroid ofthe scores is rTX = r TD r- 1
Py = 1Tpy = eTy = {3.
Comment
To illustrate these results, consider the contingency table ofTable 8.5. Correspondence
Comment analysis yields a first principal axis with inertia 0.1043 and a principal co-ordinate
The variance of the scores as defined above is clearly less than that of the scale values for the first column, say, of 91 = 0.443, with mass 124j390 = 0.318. The aboye
and is, in general, less than or equal to the variance of the scale values multiplied by formulae hold in a symmetric fashion for the between- and within-set correla
the largest principal inertia Al of P. Only when the scale values are optimal (in the tions relative to the columns of Z2, so that these correlations are respectively:
sense of Section 4.3) does the row score variance reach this upper limil. 0.443(0.318)I/2j(0.682)1/2 = 0.303 and (dividing the aboye by the canonical correla
tion, the square root of the inertia) 0.937. These figures agree with the correlations
computed by Holland et al. (1981, Table 2) up to a change in signo In their case they
4.6.3 Vector deriva tÍ ve of a quadratic form compute the canonical solutions using the large indicator matrix and then actually
Consider the quadratic form yTAy, where A is a J x J symmetric matrix. Show that compute correlations in the usual way between the vectors of dummy variables and
the vector derivative ofy TAy with respect to y is 2Ay. scores. Our object in this example is to show how much simpler it can be to obtain
their results, working directly on the contingency table, and to show that the between
So/ution set ("interset") correlations are merely a scaled-down version of the within-set
By definition a(y TAy)jay is the vector of scalar derivatives a(y T Ay)jaYj,j = l .. . J. The ("intraset") correlations, a fact which the aboye authors appear to ignore.
terms involving Yj in yTAy are ajjyJ (with derivative 2a jj y) and then J -1 terms of the
form (ajj' +aj'j)YjYj',j' = 1 .. . J,j' =F j. Since A is symmetric these latter J -1 terms are 4. 6. 5 Generalized in verses in canonical correlation analysis of dummy
of the form 2ajj'YjYj" with derivatives 2a jj .yj" Hence a(y TAy)jaYj is 2L fajj'Yj', which is variables
precisely the jth element of 2Ay.
Let Z == [Z 1 ZJ be an I x J indicator matrix and consider the canonical correlation
analysis of the two sets of JI and J 2 columns (dummy variables). If the respective
4.6.4 Correlation between dummy variables and canonical varia tes covariance matrices Sil and S22 of the two sets of variables were non-singular then
the vectors a and b of optimal canonical weights are eigenvectors of the matrices
Let Z == [ZI Z2] be a bivariate indicator matrix and let 8 and b be vectors of SI'?SI2 S 2}S21 and S2}S2ISI'?S12 respectively, associated with the highest eigenvalue
canonical weights associated with the highest (non-trivial) canonical correlation p p2, with the usual identificationjstandardization a TSII a = bTS 22 b = 1. (This is equiva
between the columns of ZI and Z2' Show that the correlation coefficient between the lent to (4.4.2-4.4.4).) Of course in the present case the covariance matrices (4.4.5) of
ith column of ZI and the vector Z¡8 of canonical scores is '!Vl 12 j(l- rY 12, where ({J¡ the sets of dummy variables are singular and their inverses do not exis1. However,
is the standard co-ordinate of the ith row of the contingency table N == ZiZ2 on the certain generalized inverses of these matrices can be defined and substituted for the
first principal axis of the correspondence analysis. Show furthermore that the ordinary inverses to allow the usual theory to remain applicable.
correlation coefficient between the ith column of ZI and the vector Z2b of canonical
scores is f¡rl l2 j(l- r;)1/2 where f¡ is the ith first principal co-ordinate. (Thus, since (i) Show that the generahzed inverses SIl == D r- I -11 Tand S22 == De· 1 -11 Tlead to
f¡ = p({J¡, the within-set and between-set correlations differ only by a scaling factor, the canonical weight vectors al and bl which are identical to the first standard row
the canonical correlation.) and column co-ordinates in the correspondence analysis ofN == Zi Z 2'
122 Theory and Applications ofCorrespondence Analysis 4. Theory and Equivalent Approaches 123
(ii) Consider the generalized inverses: analysis on Z* == [Z! Z!] where a column of ZI and of Z2 has been omitted. Use
the results (4.6.4) to recover eventually the principal co-ordinates of the rows and
* _ [D;<I
8 11 = O
0J
O
* _ [U;.I
8 22 = O
0JO columns of ZiZ2' given in the first columns of (3.1.3) and (3.2.3) respectively.
where Dr. and D c• are the diagonal matrices of any (JI -1) and (J 2 -1) elements Solution
respectively of r and e. For ease of exposition, we have assumed that the last elements We use program BMDP6M (Dixon et al., 1981) to perCorm the canonical correlation
rJ, and cJ, are respectively excluded. Show that these generalized inverses lead to analysis oC Z* (193 x 7), where tbe last columns oC ZI and Z2 have been omitted. The
solutions of the form a* = [a! " . a1¡ -1 O] T and b* = [b! " . b', -1 O] T and that resultant canonical weights, with the corresponding "masses" r¡ and cj written below
these solutions yield those obtained in (i) as follows: them, are:
Solution rl r2 r3 r4 cI c2 C 3
(i) 811812822821 = (D r- I -11 T)(p - re T) (D c- I -11 T)(p _ re T)T 0.0570 0.0933 0.2642 0.4560 0.3161 0.2332 0.3212
= Dr-IPDc-IpT and the largest canonical correlation is 0.2734. From (4.6.4) the standard co-ordinates
where the total of Pis 1 (by definition) and the totals of p(I)(I 1 x JI) and P(2)(I2 x J 2) with a similar result for the matrix e of column profiles. Since
are denoted by tI and t 2 respectively, so that tI + t 2 = 1. Let F(1), G(1), D~I) and F(2),
G(2), D~fl denote the complete results of the respective correspondence analyses of p(l)
and p(2).
R[a l
a2
lJ = [aIR:::
1 a R2
lJ = [ala2 1lJ
1
and e[a
l
a2 1
lJ
= [al
a2 1
lJ
Show that the largest non-trivial principal inertia in the correspondence analysis of (noting the difTerent orders of the vector 1 on the left- and right-hand sides of these
P is 1 and that the other principal inertias are those of D\I) and D\2) arranged in expressions), the largest principal inertia is 1 and the values of al and a2 satisfy
descending order. Show that the principal co-ordinates F and G in the analysis of P the centeri~ and standardization conditions [a l l T a21 TJr = al t 1 + a2t 2 = O and
are of the following form (where we have ignored the re-ordering of the columns in [ale a21 JD r[a l l T a 21 TJT = ait l +a~t2 = 1, which imply that al = -(t 2/t¡}I/2
terms of the principal inertias): and a 2 = (tIft 2)112.
al In a similar fashion we can show that ~I = t l l /2 and ~2 = t;I/2 and that the
columns ofF and G satisfy theconditions FTDrF = DA = GTDcG and rTF = OT = cTG.
11 ~IF(1) o
al
F=
a2
12 o ~2F(2)
a2
al
JI ~I G(1) o
al
G=
a2
J2 1 : I o ~2G(2)
a2
associated with principal inertias:
'1 o o
DA = I O I D~I) I O
O I O I D<¡)
L .
where the scalars al' a2 , ~I and ~2 depend only on the values oft l and t 2 •
Solution
If r(1),c(l) and r(2),c(2) are the masses in the respective analyses of p(l) and p(2), then r
and c are:
r = [t l r(1)J c = [t IC(1)J
t 2r(2) t C(2)
2
(since tI + t 2 = 1). The transition matrix of row profiles R = D r- IPis just the similariy
blocked matrix of row profiles R(1) and R(2) of the submatrices:
R
= [R(1)
O
O
R(2)
J
5. Multiple Correspondence Analysis 127
E
~ INNNNNNNNN NN leo ).2=0.5500
ro
~ C0 (15.7%)
o
ce
C0
>
I en
11 4 ,2
11
..
LI
Q)
000000000
N ?- ,
:oro
2: I
I '
'
f
113,2 11
>~ w , o.
/"
c:
"O
g 2 oooooo~~~ 00 N
I
"
'
S·C
eo
~
Q)
(f)
c: u
M 11
4 t3~
11/
" JE
...
" ""
""
Q)
"O Ol ,
c: c: "" u 111
'04 ' .
:
o ~ -..J
oooo~~ooo 00 en
u o ~
~ ).1=0.6367
(f)
E N
I
..c. (/) (18.2%) 11 3,311? .SE
I
.S2 31 "
..c. o I ".11.
·0'
~ z ~~~~ooooo 00 I
NO
eo I
.~
....ro I
E I
E ¿
·0 u 11 4 ,4
11
u (/) en
000000000
~:..o N
en c: en
w·
..J
en ....
Q)
113~411
<l: ro w
f- ·C )
ro 000000000 00 ro
:.o> 2:
ro
SM
en el HV. •
X ::l
O W
C0 scale
en OJ (/) 000000000 00
'+- ~ en JM 0.5
C0
Q)
'+-
ro •
E U5
'O 2
(f) ) 000000000 00 ro
~ N
2 FIG.5.1. 2-dimensional correspondence analysis of the 193 x 9 Indicator matrix
N
.... ofTable 5.1, showing all the columns and so me selected rows.
(f)
ro
2
(/) 00 ~
~
"O
c:
ro
inertias are much higher and there are less dramatic differences in successive
en percentages of inertia, for example the first and second principal inertias
....
~
are 0.6367 and 0.5500 respectively (18.2 % and 15.7 % of the total inertia
-=
Q)
(f)
E respectively), compared to the values 0.07476 and 0.01002 (87.8 %and 11.7 %
..c. Q)
. . . . . . .
::l
f ..o ~ ~
(f)
respectively) in the analysis of the contingency tableo In fact we obtain 7 non
ro ~;"-;"-;"-NNMMM ~~ c:
~ ~~~~~~~~~
tri tri E trivial principal axes of the indicator matrix, whereas the contingency table
o ::l
ce o yields only 3.
U
In Fig. 5.1 we have also indicated the positions of rows labelled "3,1",
"3,2", "3,3" and "3,4" (Le. senior employees in the 4 smoking categories, in
the context of the example) and "4, 1", "4,2", "4,3" and "4,4" (i.e. junior
130 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 131
employees in the 4 smoking categories). Since the indicator matrix has 25 that is
identical rows corresponding to senior employees who do not smoke, for Z
D,-1 O J[Zi l ZiZ2J[rfJ [rfJ Z
example, these 25 points pile up at exact1y the same position in the display, (1/21) [ O D,-1 ZIZ l ZI Z 2 rf = rf DA (5.1.6)
hence our labelling of these row points by their codeo After the following
where we have partitioned r Z into rf and rf, with JI and J 2 rows
theoretical treatment of the analysis we shall discuss the transition between
respectively. Since the correspondence matrix in the analysis of N is
the row points and the column points in this display as well as in other
p = (l/1)N = (ljI)ziz2 and since ZiZl = ID, and ZIZ2 = ID" (5.1.6) sim
displays where either the row or the column points are in standard co
plifies as the following pair of equations:
ordinates (Figs 5.3 and 5.4).
rf+D;lprf = 2rfDr (5.1.7)
Column geometry D,-lp Trf+rf 2rfD r
= (5.1.8)
To anticipate the generality of the remaining sections of this chapter, we
Muitiplying (5.1.7) on the left by D,-l p T and using the expression for
denote the numbers of rows and columns of the contingency table N by JI D,-lp Trf in (5.1.8), we obtain:
and J 2 respectively. The associated indicator matrix is denoted by Z, with 1
rows and (JI +J 2 ) columns, and is partitioned as Z == [ZI Z2] so that: D,-lp TD,-lprf = rf(2Dr-I)(2Df-l) (5.1.9)
N = ZiZ2 (5.1.1) Similarly, after premultiplying (5.1.8) by D,-IP and using the expression for
We shall temporarily use the superfix Z to distinguish the correspondence D,-1 prf in (5.1.7), we obtain:
analysis of Z from that of N, otherwise we use the same basic notation as D,-IPD,-lp Trf = rf(2Df-I)(2Df-l) (5.1.10)
Chapter 4.
The row masses rf are all equal to 1/1, while the column masses cf are Eigenequations (5.1.9) and (5.1.10) involve the same matrices as (4.1.23),
equal to the row and column masses of N divided by 2 (cf. the column sums hence the solutions in the analysis of N are solutions of these equations.
ofTable 5.1, which are identical to the row and column sums ofTable 3.1): However, the principal co-ordinates will be subject to difIerent rescalings
along the principal axes. It is clear when comparing (5.1.9 and 5.1.10) with
CZ=~[:J (5.1.2) (4.1.23) that the relationship between the eigenvalues of the two analyses is:
A = (2), Z _ 1)2 (5.1.11)
Thus the correspondence matrix and diagonal matrices of row and column
masses which define the correspondence analysis of Z are respectively: or, inversely:
pZ = (1/21)Z (5.1.3) AZ =!(1 ±Al / 2 ) (5.1.12)
D; = (1/1)1 (5.1.4)
In Fig. 5.2 we fully account for all (JI + J 2) dimensions, both trivial and
De
z=~[D,
2 O
0J
De (5.1.5)
non-trivial, in the analysis of Z (remembering that we are at present studying
the column points only). For ease of exposition and without loss of generality,
we assume that J 2 ~ JI' so that there are J 2 dimensions in the analysis of N,
It is convenient to show the relationship between the two analyses in terms
the first of which is trivial. These yield twice as many dimensions in the
of the standard co-ordinate matrices <J) and r (in the analysis of N) and r Z
analysis of Z, through (5.1.12). The trivial dimension (A = 1) yields the
(in the analysis of Z), since we can avoid the question of rescaling during the
expected trivial dimension associated with AZ = 1, as well as a "null"
discussion and introduce it later as an option in the display. From (4.1.23) the
dimension AZ = O (hence the 7 dimensions of Table 5.1, rather than the
standard co-ordinates of the (J 1 + J 2) columns of Z are obtained from the
expected 8). The non-trivial dimensions yield a set of J 2 -1 dimensions,
non-trivial eigenvectors of:
associated with principal inertias greater than A2 = !(1 + A1/2) and J 2- 1
dimensions with inertias less than A2 = !(1- A1/2), with standard co-ordinates
{2[D~1 D~I}1/21)ZTl(1/21)z}rZ = rZDr as shown in Fig. 5.2. This leaves JI + J 2- 2J 2 = JI - J 2 dimensions un
132 Theory and Applications ofCorrespondence Analysis 5, Multiple Correspondence Analysis 133
(a) (b) more "spherically" than the row and column profiles of N, Because the
J2-1 J -J2 interesting principal inertias in the analysis of Z are aboye !, it seems that
. , . .I .
I I I 1
the percentages of inertia should be calculated on the quantities Af -!,
I I 1 1 k = l ... J 2 -1, which in our example of Fig. 5.1 would be 0.1367, 0.0500 and
I 1 1
0,0102 respectively, that is percentages of 69.4 %, 25.4 % and 5.2 %' These
JI 11 ~ :~: ~ :1
percentages reflect the relative values of the square roots of the principal
{ I I I 1
-¡---- ~-t ---:--' inertias in the analysis of N, In Section 5.2 we shall discuss the computation
J2
¡ 1 I
I
1
r I O I
I
J
I
1
r- 1-1
1
I
,:
I
r
ofpercentages ofinertia in a more general situation.
I '
11 1 I 11
12U+D~) 1h 12(1-D~) I O
I 1 I
1 I
I
I 1, l
i I
D~ O
The l row profiles are vectors originally in (J I + J 2)-dimensional space, but
they occur at only JI J 2 distinct positions. The frequencies in N indicate how
many "pile up" at each of these positions and it is equivalent to consider the
FIG.5.2. (a) The complete matrix of standard co-ordinates (including the trivial geometry of the JI J 2 distinct point with masses equal to Pij' (By the principIe
and null dimensions) in the correspondence analysis of the bivariate indicator of distributional equivalence this agglomeration of the rows does not afTect
matrix Z == [Z1 Z 2 J, with associated principal inertias (eigenvalues). Thus r Z is the geometry of the columns, so that the columns can be considered initially
the (J, +J 2 ) x (J, +J 2 -2) matrix excluding the first and last columns. The first
as points in (JI J 2)-dimensional space.) DifTerent subsets of the rows collectively
column corresponds to the usual trivial dimension which centres the cloud of
J, +J 2 points, while the last column corresponds to a null dimension created by define a column of the indicator matrix, so there is only a subtle difference
the additionallinear dependency amongst the column profiles. (b) The complete between the rows and columns of the indicator matrix. For example, with the
matrix of standard co-ordinates cJ) and r in the correspondence analysis of the labelling of the rows "i l , i 2 ", i = 1 , '. l, where i l and i 2 indicate the response
contingency table N == Z;Z2' with co-ordinates cJ)0 in J 1 -J 2 null dimensions of of row i (i,e. the categories of the two discrete variables to which i belongs),
the column profiles such that cJ) TO,.cJ)° = O.
all the rows with i l fixed as ir collectively represent the category ir (cf.
Fig. 5.1, where all the points labelled "3,1" ". "3, 4" represent the group SE,
the third category of the first discrete variable), In this case there is not only
accounted fOL These are the counterparts of the null dimensions of the the usual transition formula from the column points to the individual row
analysis of N, namely the (JI - J 2) dimensions in the JI-dimensional space points but also a close relationship between the centroid of such a subset of
of column profiles along which there is no inertia (A = O). Their existence rows and the column point representing the particular category, In Figs 5.3
is irrelevant to the analysis of N but here they emerge as (J I - J 2) dimensions and 5.4 we show the results compatible in scale with Fig. 5.1, when firstly the
associated with the inertia AZ = ! (cf. (5,1.12)). The associated "co-ordinates" column points and secondly the row points are displayed in standard
(J)0 satisfy the condition of uncorrelation with (J): (J) TDr(J)o = 0, but have un co-ordinates. Only in Fig. 5.4, where the row points are in standard co
determined orientation if (J I - J 2) ~ 2. Anyway it is clear that the last JI ordinates and the column points are in principal co-ordinates, do these
dimensions in this analysis are artefacts, with dimensions 2,3, ... , J 2 being centroids of the rows coincide with the corresponding column point. Other
the only ones of interest. These correspond exactly to those in the analysis wise sorne rescaling along the principal axes is necessary to make them
ofN. coincide, where the rescaling depends on the principal inertias, Table 5.2
To summarize, so far, there is no difTerence between the display in standard summarizes the transitions between the rows (both individually and in
co-ordinates of the rows and columns of N and that of the columns of Z, groups) and the columns in each of the three displays. These results are
where we disregard all dimensions of Z associated with inertias of! and less, proved in Example 5.5,1.
However, there is a substantial difference in the principal inertias, which will Notice that it is geometrically impossible to obtain a display where the
afTect the display in principal co-ordinates. First the percentages of inertia in individual row points tie midway between their response categories and,
the analysis of Z will be very much lower and secondly the differences simultaneously, the group centroids coincide with the corresponding column
between their values is less dramatic-the column profiles of Z are dispersed point. This is reminiscent of the transition formulae between the two clouds
134 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 135
.
LI
,
.
LI
~
11
11
3 ,2
0\10)
se
se
. , .
.
JE
11
3 ,2
11
o .
JE
(i.e."SE,L1")
.
ME
~SE
3 ,3 110 '
.
ME 11
(12)
/N~
11
l5
3 ,3
11
SE
.. 113,1
°11
11
3.1
11 (25)
.
0
NO
0
11
3,4
11
11
3,4
11
/
(4)
.
HV
I °"SM
scole
1 I
scole
.
JM I 0.5
I I
0.5
.
HV SM
. FIG. 5.4. Same analysis as Fig. 5.1, with the row points in standard co-ordinates .
The centroid of the 51 row points which have labels "3,1", "3,2", "3,3" and
"3,4", for example, and which occur only at the four positions indicated, co
incides with the column point SE, the third category of the first discrete variable.
.
JM
FIG.5.3. Same analysis as Fig. 5.1, with the column points in standard co Relationship with dual scaling
ordinates. Each row point lies at a position midway between the corresponding
column points, for example the 10 points labelled "3,2" (the 10 senior employees To conclude this section let us show how the dual scaling of the indicator
who smoke lightly) lie exactly midway between points SE and LI. matrix Z == [Zl Z2] is related to our analyses above, The objective of dual
scaling is to assign scale values to the columns (categories) so as to maximize
of points in the analysis of the eontingeney table N. Once again the display the variance of the row scores. Let us denote the (J 1 + J 2) scale values by the
in principal co-ordinates can be viewed as a compromise between these two vector v, partitioned into V1(J 1 x 1) and V2 (J2 x 1), and the 1 row scores by
competing objectives and has the advantage here that the rescaling in the the vector s, where the ith is of the form t(vj+vd, the average of the scale
transition to individual rows as well as to row centroids is, at least, the same. values of the responses of i. It is clear that this objective is equivalent to our
5. Multiple Correspondence Analysis 137
8co objective in the correspondence analysis of Z. In fact the optimal scale and
U O)
"O .~
"O e scores can be obtained exactly from the first principal axis of Fig. 5.3 if the
e ·0 E
tíO) ~ ::J
c
following identification conditions for vare chosen:
.2:l <t~~ o
co
co -Q)Cc
LO ~ ·u
o.
<J)
~
O)
u u
O)
..e..e
O) T r TV1 ~ c T Vz = O} (5.1.13)
v1 D,v I = vzDcv z = 1
O) <J)
> Ü lo- co"¡::
..e
+-,+-'
ii -5,ti)o. ........
~
co
1-5
co LL OCD g~ These imply the centering and standardization of v in terms of the column
'+ O).~
o <J)
:...=u)
<J)
~ co
0)0.
co O)
U"O
masses of Z (cf. (5.1.2)):
<J)
>. ~·ü ...;
e u
2:
co
e ~
0
e e
0 .0
(CZ)TV =O v TDZ
e v
= 1 (5.1.14)
e o. Zuo.
co
O) but these latter conditions do not in general imply those of VI and Vz
U
e
O) ~ e O) individually in (5.1.13). Rowever, under the constraints of (5.1.14) on v, the
"O o O) .~ optimal solution does in fact turn out to satisfy (5.1.13). In effect we can view
e 0.0)
tíO)
o
o.
<J) C")co"O 3:~
o~..oO) o.
the last Jz "uninteresting" dimensions of Fig. 5.2(a) as forcing the separate
.o.~ <J)
~ ~O)u"o
LO.- co
O) > ~ centerings and standardizations of VI and Vz. Rere VI and Vz are the first
o~ ..o ~ e e ..e co O) <J)
columns of el> and r respectively, associated with the inertia (Le. score
u'O ---::J"¡:: ro
O) o.
+-'
.... 3: ..e co
.¡....o'';:;
O) o
..e ~
+-,0.
._
LL
<J) 1"0
0).
.~ E
oQ)
O) .~
variance in dual scaling) of 1(1 + A~/Z). There is also a vector of standard
e o ro> <J) co-ordinates, VI and - vz' the first columns of el> and - r, associated with the
. - '+
<J)~
~ 13 ,_N
~
O)
>.
co
o. inertia t(l- A~/Z). The conditions that these be respectively uncorrelated
O) O)co"O
N-LO ~ x e e u
o O) co e with the trivial vector of co-ordinates 1 are:
L6'Stri
w
-.J
~ O)
0._
Z <J)
.~
¡::
o.
'"-o: e o.
E
f-E co lTD;[:J =O and lTD;[ - : J =O
::J x
Ü W
u....: Oro O-q; l.e.
"O~ ~coco
+-'0.
0._
+-'0.
0._
~ x . o. o. o Ue o Ue T +cv
rv T =O and T TZ
rv I -cv =O
LO·- .
~O)uu
~ ~ I Z
:d~
O).¡:: Q)'¡::
ro "- e e
2 E
~::J.¡::.¡:: roo. roo.O) Together these conditions are equivalent to the individual centering con
.S? o. o. ::J O) ::J
0">
0">
O) LL en '';:; U)'';::; straints of (5.1.13). SimilarIy, the conditions that these vectors be of unit
-5 O) u O) U
e ~ ~ en
<J)
~
O)
o. <J)
. inertia and uncorrelated with each other are:
O) O) <J) co O) <J) co
O)
~~t ~. O)
~'-E
3: . O) O) O)
and [vi VIJD;[ VI] O
Q)
..o
:..e e
+-' . : -5.~ [vi vIJD;[::J =1 -V z
=
<J) <J)
e O) O) l.e.
o
.;;;
+-' X co: CO)
-e e _co N=..c vT T 2 and VITD ,VI -vzDcv
T z= O
.¡¡;
I D,v I +vzDcv z =
<J) '+-
o· ..... ro +-'
e O)
<J)+-, .c: 8. E co ~ ~:2~":- g.6
....~
O) co
+-' e O co ~ o·ue ::l e: ~ <J) Together these are equivalent to the individual standardizing constraints of
Q.. o
::':'~'6
+-' +-'
O)
..e Cll.- ~
,,-couo.¡::
s <J) e ::J o. Oc~m~ (5.1.13).
--"00 0·- O) • (.)O)=oco
+-' el. ~ , ...c:: u O) +-'
g'
a .~
Cl
? S \....:
-S_N 3:
Q) ......N
"O o
......
~~ ~ ~
..o Q. co
::J.e
e
o
Sc ~ :-0) C roO) c;:::~~;.g
3: E O ,- ..o co co e -
.~
o~ t:::"t¡ >·~c . .§ .~ ·0 15 o. 5.2 MULTIVARIATE INDICATOR MATRICES
·5<J) ~ o °CllCO<J):'=O)
u :~ ~ 3: e ~ :S
;;:: ...... O'xCJ)
C;;·S 3: ;;: e
:2 ·0 >
ti:2.e2~ An example of a trivariate indicator matrix is the original data underIying
O) '" -Cl
ti:.!!?Eo.~..o
<J)
O
Table 3.5. Rere we have an additional discrete variable with two categories
5. Multiple Correspondence Analysis 139
138 Theory and Applications ofCorrespondence Analysis
adds up to Q, while the column sums 1 TZ show the marginal distribution of
of response ("do drink" and "do not drink"), so that the indicator matrix is
responses over all the categories (see Fig. 5.5). We use the index j to refer to
of the form Z == [ZI Z2 Z3] with 1 = 193 rows and J = JI +J 2 +J 3 = Z
a column of Z and jq to refer to a column of Zq. The vector C of column
5 + 4 + 2 = 11 columns. Table 3.5 is therefore the particular condensation Z
masses of Z is given by C = (1/QI)Z T1, while the subset of masses for the
ZI[Z2 Z3] = [Zi Z2 zI Z3]. There is now a c1ear difference between
the information contained in Z and the information contained in this fre
columns of Zq are denoted by the vector c;,
where c;
= (l/QI)Z;l. For sake
of c1arity we shall use the superfix Z to designate the correspondence
quency table, since the latter completely ignores the direct association
between variables 2 and 3 ("smoking" and "drinking"), as embodied in the analysis of Z in this section.
We shall now summarize sorne aspects of the row and column geometries
matrix ZiZ3' in the correspondence analysis of Z. These resuIts are all proved in Example
Underlying Table 3.7 is a 6-variate indicator matrix Z == [ZI ... Z6]'
5.5.2 and apply to the special case when Q = 2, which was discussed in
where the columns of Z6 refer to the three types of company c1ient. The table
Section 5.1. We assume throughout that there are responses in all the
is thus the condensation [ZI .,. Z5tZ6' which summarizes the association
categories of response, in other words there is no column of Z which consists
of each of the first 5 variables with variable 6, but ignores all associations
purely of zeros.
amongst the first 5.
Clearly when the study involves more than 2 discrete variables, then the (a) The sum of the masses of the columns of Zq is l/Q, for all q = 1 ... Q.
analysis of the indicator matrix and the analysis of any such condensation of Thus each discrete variable receives the same mass, which is distributed
it into a two-way frequency table might differ quite drastically. In this section over the categories according to the frequencies of response.
we shall c1arify the geometry of a multivariate indicator matrix (so-called (b) The centroid of the column profiles of Zq is at the origin of the display,
"muItiple correspondence analysis") and of certain two-way contingency that is at the centroid of all the column profiles. Thus each subc10ud of
tables derived from it. categories is balanced at the origino
(c) The total inertia of the column profiles (and of the row profiles) is:
in(J) = J/Q-1 (5.2.1)
Row and co/umn geometry
Let us consider a general Q-variate indicator matrix Z == [ZI ... ZQ], with (d) The inertia of the column profiles of Zq is:
1 rows and J = JI +... + J Q columns, where the qth matrix Zq (corresponding in(J q ) = (J q -1)/Q (5.2.2)
to the qth discrete variable) has J q columns (corresponding to J q categories).
There are QI ones scattered throughout Z, 1 in each submatrix Zq, otherwise Hence the inertia contributed by a discrete variable increases linearly
the elements of Z are zeros. Each row of Zq adds up to 1, and each row of Z with the number of response categories.
(e) The inertia of a particular category j is:
JI J2 JQ ¡nO) = l/Q - cf (5.2.3)
~.---"-
The "Burt matrix" Section 3.9). The Burt matrix is the analogue of the covariance matrix
It is instructive to compare the analysis of Z with that of the symmetric J x J
of Q continuous variables, where each J q x J q • submatrix is analogous
matrix Z TZ, which is cal1ed the Burt matrix in recognition of an artic1e by Burt to a covariance. Classical multivariate analysis of data on Q continuous
(1950) (cf. Benzécri, 1976; Lebart et al., 1977; Example 5.5.3). This matrix has
variables rarely proceeds beyond considering second-order moments, thanks
the fol1owing block structure: to the usual distributional assumptions of multinormality. Analogously, the
correspondence analysis of Z (or, equivalently, of ZTZ) does not take into
ZiZ l ZiZ 2........ z iz Q
account associations amongst more than two discrete variables but rather
ZIZ l ZJZ 2········ .. ~
looks at al1 the two-way associations jointly. In the jargon of multiway
ZT Z = . contingency table analysis we consider only second-order interactions. Thus
,
(5.2.4 )
" oo ••••••••••••••••••••
the correspondence analysis treatment of a multivariate indicator matrix Z
'.
T' T
. seems to be at an interface between the c1assical joint bivariate treatment of
ZQ Z l : ZQZQ continuous multivariate data and the complex interaction model1ing of
Each "off-diagonal" submatrix Z;Zq' (q =F q') is a two-way contingency table multiway contingency tables.
which condenses the association between variables q and q' across the Because the row and column co-ordinates of the Burt matrix (5.2.4) are
I individuals. Each "diagonal" submatrix Z; Zq is the diagonal matrix of identical we use one notation r B (respectively G B ) to denote the standard
column sums of Zq, which we have previously denoted in the vector QIc;. co-ordinates (respectively principal co-ordinates). Since the sum of the rows
Because the Burt matrix is positive semidefinite symmetric it is c1ear that its of each matrix Z;Zq" for al1 q and q', is the vector QIc;, it fol1ows that the
correspondence analysis produces two identical sets of co-ordinates for the rows of ZTZ sum to Q2 Ic z , so that the matrix R B ofrow profiles of ZTZ is of
rows and the columns. In Example 5.5.3 we prove that the standard the fol1owing form:
co-ordinates (of the rows or columns) in the analysis of Z TZ are identical to
l
the standard co-ordinates of the columns in the analysis of Z. Again the only I R 12 R 13 RIQJ
difference lies in the values of the principal inertias, which wil1 affect the sca1es R B = (ljQ) R~l I .R 23 R.2Q
of the principal co-ordinates. In this respect we also show that the principal . . '
inertias 2 B in the analysis of the Burt matrix are the squares of those of the R~l"""""""" : i
indicator matrix:
where R qq • is the matrix of row profiles of the two-way contingency table
2 B = (A Z)2 (5.2.5)
Z;Zq" There is only one transition, from the column co-ordinates to them
In the bivariate case (Q = 2) the Burt matrix is simply: selves, for example in standard co-ordinates:
r B = R Br B(Df)-1!2
ZTZ == [Z¡Zl Z¡Z2J == [ID; IDN J (5.2.6)
Z2 Z 1 Z2 Z 2 N e r B can be partitioned into Q sets ofrows r: (q = 1 ... Q), in which case:
Now the standard co-ordinates of the rows and columns of N = ZiZ2 pro
vide those of the columns of Z = [Zl Z2] (cf. Fig. 5.2), which are identical r: = (1jQ)(r:+~q'fqRqq.r:-)(Df)-1!2 q = 1...Q
to those of the columns (or rows) of Z TZ. The principal inertias 2 B of Z TZ are Col1ecting terms in r: and remembering that r Z = r B and Df = (Df)1!2
thus related to those ofN by (cf. (5.1.12)): we have the fol1owing expression for the co-ordinates of the categories of
variable q in terms of those of the other variables in the correspondence
2B = t(1 ±2 1!2)2 (5.2.7)
analysis of Z:
In Section 8.4 we describe an example by Healy and Goldstein (1976) where
the data are reported in the form of a Burt matrix. r;(QDf - 1) = (~q'fqRqq.r~) (5.2.8)
The fact that the analysis of the multivariate indicator matrix Z is As a special case of (5.2.8), when Q = 2, we have the pair of equations (5.1.7)
equiva1ent to that of the Burt matrix il1ustrates that these analyses should be and (5.1.8). This iIIustrates once again how special the bivariate case really is,
described as "joint bivariate" rather than multivariate (de Leeuw, 1973, because there is only one term in the sum on the right-hand side of (5.2.8).
142 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 143
In very special situations the analysis of a Q-variate indicator matrix Z (for In order to get (5.2.12) into a form comparable to (5.2.11) we have to factorize
Q > 2) will be equivalent to that of a two-way frequency tableo Suppose that 1/Q2' for example, as (Ql/2 /Q~/2)/(Ql/2Q~/2). The Ql/2Q~/2 factor is associated
the Q variables can be divided into two subsets of Ql and Q2 variables with Di/ 2 so that we have the following relationship between the principal
respectively, such that the variables within each subset are pairwise inde inertias ofthe two problems:
pendent of one another. We shall show how the analysis of Z is related to
Ql/2Q~/2Di/2 = QDf-I (5.2.13)
that ofthe table which crosses the categories ofthe Ql variables with those of
the Q2 variables. Without loss of generality, let us suppose that the first Ql The relationship between the standardized co-ordinates is a little more com
and last Q2 (= Q - Ql) variables of Z are the iwo subsets in question, such plicated to determine, owing to the different partitioning of the masses in the
that any pair of variables of the first set are independent and, similarly, any two problems. For example, the masses associated with the first Ql sets of
pair ofthe second set are independent, that is: columns of Z sum to QdQ, whereas the masses associated with the rows of
(5.2.10) sum to 1. From (5.2.12) and (5.2.13) we know, for example, that for
Z; Zq' = Icqc; for q, q' = 1. .. Ql q =1= q'
(5.2.9) q = 1 ... Ql : Q~/2rq = [3r;, for a scaling constant [3. From the standardiza
and for q, q' = Ql + 1 ... Q, q =1= q' tions ofrq and r;, [3 = Ql/2Q~/2/Ql/2 so that:
(where cq and cq ' are the row and column masses of the contingency table r q = (QdQ)1/2r; (5.2.14)
I
Z;Zq')' The two-way table in question is the (J 1 +.. .+J Q) x (J Q, +1 +... +J Q)
and similarly, for q = Ql + 1 ... Q (the column co-ordinates):
table:
r zIz Q
ZiZ Q,+1
.+. ZiZ Q,+2
ZiZ Q1 +2
...
... zIz
ziz Q
Q
(5.2.10)
r q = (Q2/Q)1/2r;
Z~'~Ql+1 Z~lZQ,+2 ... Z~lZQ The relationship between the canonical correlation analysis of the bivariate
indicator matrix [Zl Z2] and the correspondence analysis of the contin
The equations defining the column co-ordinates of Z are of the form (5.2.8), gency table ZiZ2 has been discussed in Section 4.4. Recall that we described
with the terms on the right-hand side subdividing into two sets. Ir q' is in the canonical correlation analysis as the search for maximally intercorrelated
same set as q then from (5.2.9) R qq , = lc;, hence the term RqqT;' is zero·· linear combinations u and v of the two sets of indicator variables, that is
because the centroid of the columns of Zq' is at the origin (the masses c;, of vectors u and v subtending a minimum angle, where u and vare identified by
these columns are proportional to the masses cq' of the columns of Z; Zq')' the usual standardization conditions of zero mean and unit variance. It is not
Thus the right-hand side of (5.2.8) involves variables of the other set only, possible to generalize this definition to the multivariate case, because each
resulting in two transition formulae between the co-ordinates of each set: submatrix Zq defines a subspace and the concept of an angle between more
for q = 1. .. Ql :r;(QDf-I) = ~~=Ql+IRqq.r~ than two subspaces cannot be generalized.
(5.2.11 ) However, there are alternative definitions of canonical correlation analysis
for q = Ql + 1. .. Q :r;(QDf - 1) = ~~~ lRq'qr~
which are readily generalized. For example, an equivalent objective is to find
In order to describe the correspondence analysis of (5.2.10), we shall use the u and v, with the same identification conditions, so that the sum of their
notation r for the matrix of standard co-ordinates, with its rows partitioned squared correlations with a third vector w (whose components also sum to
exactIy as those of r Z (i.e. the first Ql sets of rows of r are row co-ordinates zero) is a maximum (Carroll, 1968). This is in turn equivalent to finding u and
and the remaining Q2 sets of rows are column co-ordinates). Thus the usual v so that their sum u + v has maximum distance from the origin, subject to
transition formulae between row and column co-ordinates are: overall centering of u + v to have zero mean and a single standardization
condition that the mean squared distance of u and v to the origin is a constant
(columns to rows) for q = 1 ... Ql :rq Di/ 2 = (1/Q2)~~'=Q,+IRqq.r q' (Lebart et al., 1977). At the optimum u and v turn out to be individually
(rows to columns) for q = Ql + 1 ... Q2: r q Di/ 2 = (l/Ql) ~~= lRq'qrq. identified to have zero mean and the same variance.
(5.2.12) In order to generalize to Q sets of variables, let uq (I xl) denote the linear
144 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 145
combination of the qth set of variables (q = 1 ... Q), where the coefficients are subsets of scale values for each major dimension (J...: > t). When Q = 3, it
denoted by aq (J q xl): turns out that if there are K + principal axes with principal inertias above the
value i, then there are K- = 2K+ principal axes with principal inertias less
Dq = Zqa q than i which imply the individual centering and standardization of the 3
We can then search for solutions aq , q = 1 ... Q, which yield maximum sum subsets of scale values. In general, only the principal inertias above the value
of squared correlations between the DqS and a (Q + 1)th vector w (I xl), 1jQ are "interesting" and it is clear that a rather pessimistic impression of the
subject to the usual identification conditions on each a q via those on Dq • Or, quality of a display is obtained by the usual percentages of inertia. AIso if the
equivalentiy, we can search for solutions which resuit in D 1 +... + DQ having J q categories are derived from segmenting the range of a continuous variable
maximum distance from the origin, subject to overall centering and the single then we have the rather undesirable resuit that the usual percentages tend to
identification condition that the mean squared lengths of the DqS is a constant. zero, even on the major dimensions, as the subdivisions are made finer and
Either of these objectives is equivalent to the correspondence analysis of the finer. In other words, the process of subdividing introduces new dimensions
Q-variate indicator matrix Z (or of ZTZ ). Again at the optimum the DqS are as J increases (while Q is fixed), sorne of which are of no direct interest to the
individually centered and standardized to have equal variance, i.e. equal study of the interrelationships between the variables.
(squared) distance from the origin, or squared length. If we consider the analysis of the indicator matrix ZO with J 1 x J 2 •.. x J Q
rows, one row for each of the possible responses to the Q questions, then the
J - Q principal inertias are aH 1jQ. This is an additional justification for
Relationship to dual scaling taking 1jQ as a "baseline" value for the principal inertias of an indicator
matrix which is essentially a row-reweighted version of ZO.
The second equivalent definition of canonical correlation analysis which we Benzécri (1979) thus proposes that the percentages of inertia be computed
have discussed above, is exactly the objective of the dual scaling of Z: namely on the values J...: - (ljQ) and only for those inertias that are above 1jQ. In the
to assign scale values al'" aJ to all the categories of the variables (columns case of the Burt matrix the percentages should be based on the quantities:
of Z) so as to maximize the variance of the row scores. The elements of the
vector D 1 +... + DQ are Q times the row scores (the scores are conventionally (J(J...:) == {Qj(Q -1 W{J...:- (ljQ)}2 (5.2.17)
averages of scale values) and they sum to zero, so that maximizing the which vary from O to 1 as J...: varies from (ljQ) to 1. Notice that (5.2.16) is
distance of D 1 +... + DQ to the origin is equivalent to maximizing the row admitted as a special case when Q = 2. Again, only J...:s greater than (ljQ)
score variance. As discussed above, the optimum scale values are such that should be taken into account. The values (J(J...:) also occur as principal
the uqs are individually centered and standardized in the same way. In other inertias in the analysis of the Burt matrix for which the diagonal has been set
words the overall identification conditions of location and scale on the to zero. We shall discuss this property further in Section 8.6 which deals with
complete set of scale values implies individual conditions on the subsets the analysis of symmetric correspondence matrices.
al'" aJ at the optimum. We shall discuss this property further in Section
8.4, wh¿re different ways of constraining the solutions are described.
Special case: binary variables
When each variable has only two categories, that is J q = 2 for aH q and
Artificial dimensions and the calculation of percentages of inertia J = 2Q, the correspondence analysis of Z is closely related to the principal
In Section 5.1 we showed that in the special case Q = 2 the principal inertias components analysis of a matrix Y with Q columns, where each of these
J...: of the indicator matrix Z == [Zl Z2] are related to those J.. k of the columns is one of a pair of columns of Z and is standardized to have unit
contingency table as follows (cf. (5.1.11)): variance. Thus in the matrix Y each discrete variable is represented by just
one of its categories.
J.. k = 4(J...:-t)2 (5.2.16) It is slightiy easier to show the relationship between the correspondence
We also showed that the "interesting" inertias J...: are those above the value analysis of the 2Q x 2Q Burt matrix B = ZTZ and the principal components
of t, the rest being artifacts of the analysis. In the context of dual scaling analysis of the Q x Q correlation matrix (ljI)y T y. If V denotes the matrix of
these minor dimensions (J...: < t) serve to centre and standardize the two eigenvectors ofthe correlation matrix: {(1jI)y Ty}V = VD" then Vis related
146 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 147
to the matrix r + composed of the Q corresponding rows of the standard one. The fact that a particular answer is of special importance distinguishes
co-ordinate matrix r B (or r Z ) as follows. this situation from a typical survey where no special emphasis is placed on
particular responses.
r+ = D",V
Data emanatng fmm questionnaires involve all the peculiarities that
where D", is a diagonal matrix with typical diagonal element: might be expected when dealing with people. People often deliberately omit
answering a question (e.g. on their income), sometimes they do not know
t/lq = ((I _bqq)/bqq)1/2 what to respond or their responses do not coincide exactIy with any of the
where q refers to the selected category of the qth discrete variable and bqq is alternatives provided. A questionnaire which has been carefully prepared
the corresponding diagonal element of B. Notice that the t/I qS rescale the rows ineludes such alternatives as "Not prepared to answer", "Don't know" or
of V to obtain the rows of r+ which represent the selected categories. Thus "None of these" to allow for these eventualities, but having achieved this, how
the relative positions of the categories on every principal axis are changed by are these data to be analysed? We shall discuss this problem from a number
this rescaling and the difference between the configuration of rows of V and of viewpoints and illustrate a number of ways that correspondence analysis
that of r+ is by no means trivial. This difference affects the positions of the can cope with such data.
rows of Z and the rows of Y as well, since these depend on the scalings of the
Q variables. The value of t/lq depends on the variance of the qth variable and, Non-responses when the variables (questions) are binary
since the relevant column of Z contains only Os and ls, this variance is easily
seen to be zq(l-zq)/I, where Zq is the mean ofthis column (i.e. the variance of Let us start by considering a fairly simple questionnaire where each of the Q
a Bernoulli variable). Since Zq = bqq/I this variance may be written equiva questions has only two possible responses (e.g. "Yes" or "No", "True" or
lentIy as bqq(I - bqq )/I 3 and so the nearer bqq is to 1, say, the eloser t/I q is to zero. "False", ...). The response "Yes" to a question is coded 1 and O in the two
If bqq = tI, that is there is no "polarization" on variable q, then there is no relevant columns of the indicator matrix, while a "No" is coded O and 1. A
difference between the qth rows of r+ and V. non-response might then be coded as t and t to indicate indifference to a
These results come up again in Section 6.2 as a special case of the more "Yes" and to a "No". Alternatively, sorne other coding scheme cx q and 1-cxq
general discussion of "doubled" data in Chapter 6, where we amplify the might be used with a different justification for the choice of cx q • Another
concept of polarization of bipolar observations. alternative is to create a third response for each question, or to create just one
extra variable which records the total number of non-responses to the set of
questions by each individual.
5.3 ANALYSIS OF QUESTIONNAIRES AND NON-RESPONSES We shall briefiy describe sorne ways of handling non-responses in the
context of the data in Table 5.3. These are the (fictitious) results of a survey
The most common example of a multivariate indicator matrix arises as the of 100 randomly selected people, where each person has responded to a
result of a sample survey where 1 individuals respond to Q questions of a questionnaire consisting of 5 questions, also listed in Table 5.3. Only two
questionnaire. There are many ways to conduct a survey, for example, a alternative responses to each question are provided and 25 people choose not
question may be posed along with a fixed number of alternatives from which to respond to at least one question, with a total of 36 non-responses: 8, 16, 7
the respondent must make exactly one selection. In sorne cases it will be and 5 respectively to questions 2, 3, 4 and 5.
difficult to specify all possible responses beforehand, so that the question is Since a quarter of the sample has not responded in sorne way, we can
"open-ended" and a categorization of responses has to be made after all the consider creating a third alternative for questions 2 to 5, so that the multi
questionnaires have been completed and studied. This latter strategy is variate indicator matrix Z is a 100 x 14 matrix, with 2 columns (la and lb)
naturally more problematic and involves a large amount of work before the for question 1 and 3 columns (2a, 2b, 2*; 3a, 3b, 3*; etc.... ) for the other
statistical investigation even commences. Anyway, whatever the method of questions (Table 5.5(a)). The Burt table ZTZ is given in Table 5.4 (ineluding
data collection the result is a multivariate indicator matrix, with each a row and column of zeros marked 1*) and shows the marginal frequencies
question q having a fixed number J q of responses. A special case of such a down the diagonal and the between-question contingency tables off the
questionnaire is a multiple choice examination, where a number of questions diagonal. Figure 5.6 shows the two-dimensional correspondence analysis of
are posed and alternative answers are provided, one of which is the correct Z. The direction of spread along the first axis separates the younger, lower
148 Theory and Applications ofCorrespondence Analysis
I
I
~
o
TABLE5,3 ~
•LD '<t~o O'<t~ O N (Y) (Y)NO OOLD
LD
Ouestionnaire with Q = 5 questions and J q = 2 possible responses to each o
Ol e
question q, The responses of 100 people in a fictitious survey are shown, where • ID o
(Y)com m co (Y)
denotes a non-response, rou '''::;
Vl
ID
-QloooloLDLD
LD '<t(Y) '<tN LD ~'<t
000
r-
:::J
2 O
Ouestion 1 Sex: (a) male ~
ro
C1J
L!')
I '<t~0 I '<tmN
~~ ~
(Y)co'<t
~
ImN'<t
~
ILDOO
N
(b) female o.
ID
Vl
Ouestion 2 Age: (a) under 30 ro (Y)'<t0 N (Y) N (Y) ~ (Y) Oor- '<t(Y)0
'<t
(b) 30 or older Vl '<t
ro e
Ouestion 3 Annual income (before tax) u o
'''::; -Q ~~O N LD LD m'<tm ONO N CO N
ID
(a) below [8000 per annum u Vl '<t (Y) N (Y) ~ (Y) LD '<t
(b) above [8000 per annum o
U O
ID
:::J
N
e
·
N '1 N <.00 OOCO IN~LD I~LDN INLD~
aabaa baaab ro e
aabaa baaab aaabb ababb a"bb u o
'''::; -Q 10NO ~ LD <.O O'<tco r--.~o:::t l~o~
b"'b ID (Y)
aabaa baaab aaabb ababb -5
Vl
ID
~ '<t N~ ~N ~M
abbaa abaab aaabb bbabb bb"b .8 C1J I co O O (Y) (Y) N <.O'<tco '<t~(Y) '<t0'<t
Ol ~ LD (Y) N (Y) ~ N (Y) ~'<t
bbbaa abaab aaabb bbabb abba' e
aa'aa bbaab aaabb bbabb abba' u
e
ba'aa bbaab aaabb bbabb ab'a' o
o.
b"aa bbaab aaabb bbabb ab'b' Vl
ID C1J -Q • I '<t'<t'<t C1J -Q •
aaaba babab baabb bbabb b"b' o
I C1J -Q •
~~~ NNN
C1J-Q.
(Y)(Y)(Y)
C1J-Q.
LD LD LD
U
x
°5 (Y)
ro ~ N '<t LD
I
income respondents who are pessimistic about the economy and generally E e e e e e
+-'
o o o o o
.,::; '''::; '''::; '''::; '''::;
anti-government, from the older, higher income respondents who are opti- :::J Vl Vl Vl Vl Vl
en ID ID ID ID ID
mistic and pro-government. There does not seem to be much association :::J :::J :::J :::J :::J
between this feature and the sex of the respondent. The 4 non-response points
ID
.c
f-
O O O O O
I
are quite separate from the others, and determine the second principal axis,
5. Multiple Correspondence Analysis 151
o . _o
·
'"
~ -Q
'+-
C
E
~OL
O)
CJ)
C
O)
E
O)
·
l.()~
•
LD _IN LD 'V'
A2=0.3336
(18.5%)
-
ü ° O-
CJ)
O)
~
Ü
CJ)
Ol
;20 ;2 -1'<1 -Q '"
LD ""'
O) '+- C
CJ) 0 .
C "O
·2
° _IN °
O-
CJ) <lJ
'Ü
O)
~O ~ -1'<1 <lJ
LD ~
'"
O) ~
~ O)
U)m
·
·
° C C
O O)
~
,<:!"O N •,<:!"O ,<:!"O
C O- C
CJ) O)
~~())
(fJ
Ü
~oo
~CJ)~
'+-
-IN2
t)
ªO
;20 ;2 _IN
ªO
ªO
I scale I
3_
• 4
•
ro ro O) 0.5
~O ~ _IN
<lJ <lJ <lJ
~"O~ '<:!"~ '<:!"~ '<:!"~
-:'Q)¿
-gü E 5
ªO
ªO
·
'<lJ
• ~
•
(V)~ (V) _IN •(V) 'V'
M
•
-QCJ)o
": 35 ü
2-c ro
roc 0._
O- Ü ~O <lJ
'<:!"~
<lJ
'<:!"~ ~ -1'<1 -Q M
(V) ""'
.- (f) ~ U)
.~ ~ CJ) ~
(; C"D e
~O ~ _IN ~ _1'<1 <lJ M
~O~O
~O (V) ~
lb
rozi:
(fe~ale)
o
~O') +-'~
ª
LD -.C'O\
~«>-croc
~-g g
O)
I-.oECJ)O)
Ü ,
•N O ~O ~ _IN
·
NO •N O
I ¡ Al = 0.3676 (20.4%)
"o..::::'c=o 5b (against)
O)
+-' Ü
E °
c
=' m
-Q
N~
-Q
N~
-Q
N~
-Q
N~
-Q
N~
la
2b
•
5a
•
~<l>OL 3a••2a «30)
• (;¡o 30) (for)
CJ) C Ü (male)
t>reOOO) 4a
..::::'°N
:-:: C ¿;JO ¿;JO ¿;JO ¿;JO ¿;JO • 3b
(aptimistic)
evi~-S •
(.. taoOO)
LDO)C
O) "O O)
-00)
-QO ;:0 ;:0 -QO
~
-QO
~
~ü~
I-~Q)
<lJ
'+-CJ).o
O c_ ~~ <lJ <lJ
~
<lJ
~ 8.~
ro CJ) CJ)
FIG.5,6, Correspondence analysis of the 100 x 14 indicator matrix derived from
"O~O) the data of Table 5.3 (i.e. with coding of Table 5.5(a)). The inertias and per
O) , CJ)
LCc centages of inertia are those of the usual analysis whereas if they are based on the
'-+ -O
Co ° CJ)
O)
CJ)
O)
CJ)
O)
CJ)
O)
CJ)
O) quantities (52.17), the percentages of inertia are 56.2% and 35.7% respectively.
0 _ CJ) E E E E
ro ..
E
ro ..
ro .. ro ..
¡g,~ ~ ro
c~ c~ c~ c~ c~
c~c
c
"O c"O c"O c"O c"O with the point "female" being the only one which looks like it tends in the
E~ E~ E~ E~
O)
'-.0 O
"O~c E"O
Ü
O
c ro
. _~ Ü
O ~_
~ O
Ü ~
~O °
Ü
~_ ~ °
Ü ~_
~ O
Ü
direction of the non-responses. It seems then that the female respondents
~U~ refused to answer the questions more often than males-in fact they refused
~OO) O
-O,+ .0 0 0) "0 0 0) O) O)
~.~ O ~u-=- ~u-=- ~U-=- ~U-=-
Q; ~ ~ to answer 19 questions, whereas we would expect 15 non-responses based on
'+-~ro ro ro ro ro ro
~ 0"2 cr> cr> cr> cr> cr> the proportion of females in the sample, At this stage we are not interested in
"OL
O)Ü~
O)
c
O)
c
O)
c
O)
c
O)
c the statistical significance of this observation, but we can evaluate the
Lro"O
~ ~ ~ ~ ~
I-O)~
chi-square statistic between the frequencies 17 and 19 respectively of male
and female non-responses and the expected frequencies based on the propor
152 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 153
tions 58 and 42 respectively of males and females in the sample. The statistic the centroid of these points. The displayed positions of the other points
is 1.717, which is not significant. hardly change and the only difference is that we no longer have the spread of
The spread of the non-response points relative to the spread of the other the individual non-response points, as observed in Fig. 5.6. In other words,
points is indicative of another interesting feature in the saulple. Points the principal plane is very stable with respect to merging these 4 points.
representing non-response to questions 4 and 5 lie on the side of the older, If the non-responses are of negligible importance in the survey we can
higher income group. Looking at the data we see that 12 people did not choose the coding scheme illustrated in Table 5.5(c). Here there are only 2
respond to one of these questions, 7 of whom were "older", 2 were "younger" columns per question and a non-response is coded as a ! in each column.
and 3 were of unknown age. This indicates that in the sample older people Here the non-responses to a question is counted as a half a response a and
were indeed more reticent about answering these questions. Again we are not half a response b. (But, in general, we can code this as a and /3, where
interested in testing whether this feature is statistically significant. Our a + /3 = 1; see below and Table 5.5(e).) Geometrically, the 16 non-responses
objective here is exploratory, not confirmatory. to question 3, say, are concentrated into the point 3* of Fig. 5.6. The present
Instead of creating a special column for each question's non-response, we coding re-allocates the mass of 3* in equal amounts to the points 3a and 3b.
can create just one column which records the total number of non-responses, Because the non-response points almost completely determine the 2nd
irrespective of the questions (Table 5.5(b)). Geometrically we have merged principal axis in the analysis of Fig. 5.6, we expect a new axis to emerge as a
the 4 columns 2*,3*,4* and 5* into one point ** in Fig. 5.7, which represents result of the "disintegration" of these points. Figure 5.8 shows the new
analysis and we see that the first principal axis is very similar to those of
previous analyses, while the second is now determined almost exc1usively by
>'2= 0.325~
(24.0%)
>'2= 0.2076
•
(non - response)
(22.7%)
seate 1
I 0.5 lb
•
1 seale 1
0.5
2- 0
o4
Sa
3a 4a
• •
lb •
4b • 2a I 3.
• 4b •
·Sb 2b
Sb
>'1= 0.3S78
•
(26.4%)
2a 3b
• la 2b •
3a· • • "s.
Sa
• la
4a •
•
.3b
FIG.5.8. Correspondence analysis of the 100 x 1O indicator matrix where the
non-responses are recoded as ~ and ~ for responses a and b (i.e. with coding of
Table 5.5(c)). The individual non-response points (2', 3', 4' and 5') are
FIG.5.7. Correspondence analysis of the 100 x 11 indicator matrix where all the displayed as supplementary points but are pQorly correlated with this principal
non-responses are recoded in one column (i.e with coding ofTable 5.5(b)). plane.
154 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 155
I scole A2=0.2203 changes as ~ decreases in small steps from 1 to O, as the focus is gradual1y
0.5 I (16.6%) taken ofI the non-responses until their mass is absorbed completely into that
of the responses. Remember that the positions of the non-response points are
fixed throughout this process and that it is the orientation of the principal
lb
• plane that is changing as the "attractive force" (mass) of the non-response
2
.4 points is decreased to zero. In the limit, as ~ tends to O, these positions can
still be represented as supplementary points, as in Fig. 5.8. In this particular
example, because the first principal axis is very stable, the principal plane is
.50 actual1y pivoting around this axis as the masses of the non-response points
are varied.
3-
• 40 The most general way of recoding the non-responses prior to correspond
30
•·20 • ence analysis is illustrated in Table 5.4(e). Here a non-response to question q
AI= 0.3388
(25.5%) is given a relative mass of ~q, while the remainder of 1 - ~q is divided into a
5b
4b
• part rx q for response a and /3q for response b (so that rx q + /3q + ~q = 1). The
• 2b
• values of rx q and /3q can be the same, as previously, or difIerent if we wish to
3b
view the non-responses as missing values. For example, individual number 98
• is an "older male", and of the 20 "older males" in the data who have
10
responded to question 3 we see that 11 have a "lower income" and 9 a "higher
• income". We could thus al10cate the remainder of 1-~3 in proportions 11 to
5· 9 to responses a and b respectively: rx3 = 11(1-~3)/20, /33 = 9(1-~3)/20.
•
These values take into account the relatively higher proportion of "higher
FIG.5.9. Correspondence analysis of the 100 x 14 indicator matrix where the
incomes" amongst "older males", compared to the marginal frequencies of
non-responses are recoded as .¡. .¡ and 1- for responses a, b and • respectively the whole sample, which are 66 and 18 respectively for responses 3a and 3b.
(i.e. with coding ofTable 55(d)). Clearly our previous al1ocation of equal values of t(1-~3) to each response
ignores even these marginals.
In many cases these different recodings of the data will lead to negligible
the sex of the respondents. This is just a stronger indication that sex is not differences in the correspondence analyses, especial1y when the frequency of
associated with the feature which we have interpreted along the first axis. non-response is fairly low. There is no "correct" way to recode the data-each
One property which is common to al1 these schemes is that each response study has its own peculiarities and objectives which willlargely determine the
or non-response generates a 1 in the recoded matrix, so that the total of each approach of the data analyst. When faced with large frequencies of non
recoded matrix is a constant QI, in this case 5 x 100 = 500. In Table 5.5(a) the responses, we suggest that difIerent recoding schemes be attempted and the
1 for a non-response is placed in a special column, in Table 5.5(b) al1 the ls subsequent analyses compared. As in the above example, each one of these
for non-responses are col1ected into a single column, and in Table 5.5(c) each focuses on difIerent aspects of the non-responses and their association with
1 for a non-response is split equal1y between the possible responses. Various the features in the main body of data.
combinations of these coding schemes are possible, for example in Table Example 5.5.4 treats the problem of non-responses more theoretical1y.
5.5(d); here the column names are the same as those in Table 5.5(a), but only A1though we have illustrated the approach in the simple situation where each
t of the mass of a non-response is al10cated to the non-response column, the question has just two possible responses, the principIes remain identical in a
other half being divided between the responses a and b in equal parts. The more general situation. Each line of the recoded matrix has a total of Q, a 1
analysis of this matrix, shown in Fig. 5.9, is thus "halfway" between Figs 5.6 for each question q (q = 1 ... Q). In the case of a non-response this unit of
and 5.8. In general we have a coding scheme which al10cates ~ to the non mass can be assigned total1y or partial1y to a non-response category. If
response column and t(1-~) to each of the response columns, where ~ can partial1y assigned, the remainder must be divided between the actual response
be any value from O to 1. It would be interesting to observe how the display categories, equally or in proportion to sorne justifiable distribution. Geo
156 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 157
metrically, we create new (column) points for each question's non-response TABLE 5.6
and then divide the mass ofthe non-responses between these points and those Comparison of four correspondence analyses described by Hamrouni
and Benzécri (1976, p. 178, their analyses 1. 2, 5 and 7 respectively) of
ofthe other (column) points.
the voting at the United Nations in 1967. Correlations between the
positions of the 122 countries on principal axes which are higher than
Case study: United Na tions , resolutions (05)1/2 = 0.707 are reported. For example. the correlation between
these positions on axis 1 of analysis A and on axis 1 of analysis C is
Hamrouni and Benzécri (1976) discuss a case study of the pattern of voting 0986, so these axes "agree"
by member countries of the United Nations on 62 resolutions of the general
Avs B Avs C Avs O B vs C
assembly during 1967. Their discussion centres mostly around the way to
deal with abstentions and absences by various countries. Initially they 1,1:0.987 1.1 :0.986 1,10.995 1.1 :0.998
criticize previous work by Deutsch and Martin (1971), who coded the data in 2,2:0.994 2.3:0.972 3,2: 0.938 2.3:0.973
binary form as described aboye, corresponding to votes of "yes" and "no", 3.3:0.966 3.4:0.857 4.3 0.739 3.4: 0.876
with both abstentions and absences coded as t and 1. Because there are 4.4:0.969 4.5: 0.909 4,5: 0.935
5.5 0.850
large frequencies of abstention during most votes and because certain 6,6: 0.877
countries are characterized by high absenteeism, this is clearly an undesirable 7,7: 0.939
coding scheme. An abstention in this context is not a mere non-response, but 8.8: 0.838
a definite attitude and should be treated as a third distinct category. Whether
or not we should treat an absence as an additional category or whether we
should spread it across the three categories is not as clear, and Hamrouni and In order to compare the analyses we can compute the correlations between
Benzécri describe a number of different analyses to compare possible coding the positions of the 122 countries (rows of Z) on the principal axes of each
schemes as well as to check the stability of their results. In one of these an analysis. Thus there appear to be relatively minor differences between
absence is coded as three zeros in the three categories. In this case the coded analyses A and B, because their respective principal axes (as far as the 8th)
values do not sum to 1, as we have described previously, and the particular are highly correlated (see Table 5.6). In analysis C the first four axes of
country receives a smaller mass. The position of such a country would also analyses A and B are recovered, but the second axis of analysis C is unique.
be affected by this coding, in contrast to the coding schemes which try to In fact it turns out to be an "absenteeism axis" and strings out the countries
interpolate the absent vote by sorne set of expected values. which are frequently absent when a vote is taken. This feature is naturally
To illustrate the methodology of Hamrouni and Benzécri we have selected absent in the other analyses where absence is not treated quite as distinctIy.
4 out of the 10 correspondence analyses which they performed on these data In analysis D the second axis of analyses A and B appears to have not re
(their analyses 1, 2, 5 and 7). These analyses are characterized by the number appeared. This is an axis which separates Portugal and South Africa from the
of columns J q of the matrix Z which are used for each resolution q and how other countries because they alone voted "no" to 9 resolutions. These isolated
the abstentions and absences have been coded prior to analysis: votes have been lost in the coding scheme of analysis D where a "no" and an
"abstention" are equivalent.
Analysis A-J q = 3: q+(yes), q-(no), qo (abstention); absence coded as Hamrouni and Benzécri (1976) make a substantial interpretation of the
0,0,0. results of these analyses and manage to describe as many as 7 distinct
principal axes. Of these at least 4 can be described as stable in the sense that
Analysis B-J q = 3: q+(yes), q-(no), qo (abstention); absence coded as
they reappear in different analyses.
oc q _, ocqO , the respective proportions of countries who were present that
OC q +,
voted in each of the three possible ways.
Analysis C-J q = 4: q+(yes), q-(no), qo (abstention), q* (absence); 5.4 RECODING OF HETEROGENEOUS DATA
absence has its own distinct column.
A data analyst is often faced with a study which involves different types of
Analysis D-Jq = 2: q+(yes), q-(no or abstention); absence coded as O, O. variables. For example, a questionnaire might consist of several questions to
158 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 159
which there are only categorical answers (e.g. yes/no, or disagree/undecided/ data analysts would consider this to be a continuous measurement. (Most
agree) as well as questions which elicit a numerical response, like age and analysts would even gloss over the discreteness of the age measurement,
income. So far we have considered using correspondence analysis exc1usively treating it also as continuous.) Another way of describing these variables is
on discrete data such as contingency tables of counts and indicator matrices that they are basically continuous phenomena, but our observation of them
which record qualitative information on a set of individuals. In this section (or our measuring instrument) causes a discretization of their value.
we shall show how difIerent types of data can be recoded into a standard form We can continue this process of discretization and define a smaller set of
so that they may be explored collectively using correspondence analysis. categories of age, say, so that each individual is recorded as belonging to one
Let us consider a general situation where data is collected on 1 individuals, of a set of age groups. This can be viewed either as the segmenting of the
where each individual i is observed or measured according to Q variables (or range of a continuous variable, or as the agglomeration of the categories of a
"responds" to Q questions). Each individual i is characterized by a vector of discrete variable. Information is clearly lost in the process. For example, if
alphanumeric data which is clearly divisible into Q distinct parts. For subsets temperature is recoded into three discrete categories: less than 10.0°C, 10.0
of variables of the same, or similar, type we might have a good idea how to to 20.0 oC and aboye 20.0 oC, then observations of 5.2 oC and 9.8 oC, say, are
analyse the data (e.g. principal components analysis of the continuous data, not distinguishable after recoding. A priori this seems a disastrous loss of
correspondence analysis of the discrete data), but this would not give us a information, while on the other hand it might turn out that this hardly affects
single analysis which describes the data globally. In order to achieve a global our final results at all, while gaining the advantage of having reduced the
description, we first need to convert all the data into a common formo Because variable to a discrete formo
all measurements may be regarded as discrete, to varying degrees, it is clear There are a number of ways to recode an observation on a continuous
that all data can be considered in a discrete framework. variable to be discrete, involving only a few categories. The one just discussed
When a discrete variable has only a "small" number of categories, we have is the coarsest way and causes the greatest loss of information for a given
aIready seen that these categories can be identified with columns of an number of categories. This loss of information can be attenuated in a number
indicator matrix containing Os and ls to indicate each individual's category. of ad hoc ways. For example, Guittonneau and Roux (1977) propose that
A variable like "age (in years)" is also discrete but has many categories, one near the boundaries of the categories the strict 0/1 coding be relaxed, as
for each year in the data set. Here it would usually be impractical to create a depicted in Fig. 5.10. This permits the indication by the coding that a value
difIerent column in the indicator matrix for each of these years, unless the is near the boundary point and not "completely" in the category. This is
number 1 of individuals was so high that there were no low frequencies in any calledfuzzy coding (codagefiou in French) as opposed to our previous logical
one year. Finally, a variable like temperature, recorded to one decimal place, coding. Quite apart from the problem of deciding how many categories we
say, is discrete on categories of temperature in tenths of a degree, although should use in recoding the variable, we now have additional decisions to
make concerning the width of the fuzzy areas around the boundaries. This
cat~g("llf categ('~lf catego~lf
width should be related to the particular characteristics of the variable, for
1 2 3 example measurement error, but as yet there has been no in-depth study of
the efIect of difIerent choices of fuzzy coding. Whether the information saved
-1-+1---,
I _
can produce noticeably improved results is still an open question.
\ : I
A difIerent recoding scheme has been proposed by Escofier (1979), who sets
._--------------- ..\ :/ / " ,\:/
¡, I I
i1\ /,
.
RECOVEV
up just 2 columns in the (recoded) indicator matrix for each continuous
SCALE
variable. Suppose that x¡q, i = 1 ... 1, are the (mean) centred and (variance)
u_uu.uu_uu;:/ ,: 1:
I ' '. standardized values of the observations on the qth variable across the
.: :
=-~.=.:_.=.t__=__;=.-;=.::_:.=.--=I-:-·--- - --
1: \ o
individuals, that is the mean and variance of the x iq are Oand 1 respectively.
t SCALE OF CONTINUOUS VARIABLE
The two columns of the recoded matrix are labelled q + and q - and the ith
(~,LO) individual (row) is coded as:
FIG.5.10. A typical example of fuzzy coding of a quantitative measurement into Ziq+ = (1 +x¡q)/2 Ziq- = (1-x iq )/2 (5.4.1)
three categories. A value just below the upper cut-off point of the first category is
recoded as g.. i. O rather than the strict logical coding of 1. O. O. Because Ziq+ + Ziq- = 1, the centroid of the subset of columns (q + and q - )
160 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis
pertaining to variable q is stilI at the origin of the eorrespondenee analysis 13 qualitative variables, 3 categories each (e.g. leaves : wide, average or
display. The masses of columns q + and q - are identical, because the Xiq are narrow?)
centered, and thus the two column points must be equidistant from their
centroid (origin). This coding essentialIy creates a pair of positive and 4 qualitative variables, 4 categories each (e.g. under the foveoles: two
negative poles for each continuous variable and the eoding reflects to what grooves, one groove, a fold or nothing?)
extent the variable lies on the positive and negative side ofthe observed mean.
Because of the mean centering, this type of reeoding is particularly suitable 1 qualitative variable, 6 categories (6 shapes of leaf)
for "interval" variables (i.e. having an arbitrary origin, e.g. temperature in
OC), or for "ratio" variables which have an origin so far from the range in 5 quantitative variables (e.g.length ofthe petals)
which we observe them that we can regard our observations as being of an
interval nature. Within a particular speeies the 33 qualitative variables are a constant, while
Centering does not have to be performed with respeet to the mean, in fact the quantitative measurements are the averages obtained from a sampIe of
any point of central tendency may be used, e.g. the median. Similarly a plants of that species. In the case of the variables with 3 categories, the second
different measure of spread may be used to standardize the observations category is always intermediate to the first and the third, and the authors
before applying the transformation (5.4.1). The column points q+ and q decide to code each of these in 2 columns rather than 3, with the intermediate
wilI now have different masses and not lie exactly equidistant from the category coded as (t, h Thus the first 28 variables generate 30+26 = 56
centroid. Notice that it is of minor consequenee that some of the recoded columns of the matrix Z, while the remaining 5 qualitative variables generate
values Ziq + wilI be greater than one, with their associated Ziq _ negative. This 16 + 6 = 22 columns. After inspection of the histogram of the 5 quantitative
departs slightly from our concept of the correspondence matrix as a distribu variables, these are each discretized into 4 categories, thus generating 20
tion of a unit of positive mass amongst a matrix of cells. An alternative coding columns of Z. Observations near the boundaries of these intervals are
which ensures that Ziq+ and Ziq- vary between O and 1 is to equate the indicated by fuzzy eoding, as in Fig. 5.10, although the authors do not report
minimum and maximum values min(q) and max(q) of the variable to the their exact scheme. The recoded indicator matrix is thus of the folIowing
values O and 1 of Zíq+, and define Ziq- = 1-ziq +, resealing the observations form:
between these extremes. This is equivalent to centering with respect to the
midpoint of the extreme values, t(min(q) + max(q)), standardizing by half the Z = [Zl Z2 Z3 Z4 Zs]
75 x 98 75 x (30+26+ 16+6+20)
range, t(max(q) - min(q», and then using (5.4.1). The only advantage of
using the mean centering and variance standardization is that we know in with Zl' Z3 and Z4 in logical coding and Z2 and Zs in fuzzy coding.
advanee that the inertia of the columns q + and q - is exactly l/Q. This is the Various correspondence analyses are now performed on Z, including
same as the inertia of a binary diserete variable (J q = 2) which has been individual analyses of [Zl Z2], [Z3 Z4] and Zs, the submatrices of Z
coded 10gicalIy in an indicator matrix, whereas we know from (5.2.3) (proved which are homogeneous. We know that the inertia of a variable inereases
in Example 5.5.2(e») that any other coding which forces the recoded values to linearly with the number of categories J q' hence an analysis of Z could yield
lie between O and 1 necessarily leads to the inertia of the variable being less features arising from the heterogeneity of inertias amongst the variables.
than l/Q. Since the total inertias of the aboye three homogeneous submatrices in their
individual analyses are "standardized" (and thus comparable) measures of
Case study: taxonomy ofa plant genus the variation in the respective submatrices, we can reweight each submatrix
in a global analysis so that its part of the inertia is proportional to its
Guitonneau and Roux (1977) consider data on 75 species of the plant genus inertia in the individual analysis. The way to reweight the submatriees (i.e.
Erodium. There are a total of 38 charaeteristics (variables) which can be groups of variables) can be obtained from Example 5.5.5 where we treat the
observed on each plant: reweighting of a single qualitative variable in an indicator matrix.
Again, in order to compare the results of different analyses, eorrelations
15 qualitative variables, 2 categories each (e.g. ridge of mericarp: feathery between the positions of the speeies on principal axes are ealculated. The
or not?) first four axes are recovered in alI the analyses with high intercorreIations
162 Theory and Applieations ofCorrespondenee Analysis 5. Multiple Correspondenee Analysis 163
and it turns out, for example, that reweighting does not change the results
dramatically. This does not mean that we need not reweight. Reweighting in F
Z
= RZr z = RZ[~n
this analysis is preferable because we eliminate possible features which are The result that the individual row is at the midpoint of its two responses is clear from
artifacts of the coding. The fact that the results are stable means that these our argument aboye, that is (where ft,n corresponds to a row with response (j,j'):
features are not strong enough to obscure our view of the "true" features in f(~.n = Hyf+rf)
the data. The topic of reweighting is discussed in more detail in Section 8.2. The average 1(~) here is the same as in (a) aboye, hence
rf = (D~) - 19f = (D~) - 2ff
5.5 EXAMPLES which is the result we require, since Dr = (D~)2.
5.5.1 Transition between rows and co/umns of a bivariate indicator (c) Row points in standard ca-ordinates, calumn points in principal eo-ordinates. The
matrix transition formula from columns to rows is (cf. (4.1.32)):
Prove the results of Table 5.2 which summarize the transition formulae between the
rows and columns in three different displays in the correspondence analysis of the
Cl>z = RZGz(D~)-2 = RZ[~~}D~)-2
1 x (J 1+J 2) indicator matrix Z == [Zl Z2]. The transition to an individual row point is thus easily seen to be:
Solution fl't,j') = (D~)-21(gf+gf)
(a) Both row and calumn points in principal ca-ordinates. The transition formula from The average 4lt) is thus (D~)-llt) (cf. (5.5.1)) and hence (from (5.5.3))
columns to rows is (cf. (4.1.16)):
4lt) = gf
FZ= RZGz(D~)-l = RZ[~~}D~)-l which is the required result.
The ith row profile (ith row of R Z) is zero apart from two values of t indicating the 5.5.2 Correspondence ana/ysis of a mu/tivariate indicator matrix
categories of the two discrete variables to which i belongs. Hence f¡z is just the average
of the vectors gf and gf, where j indicates the jth row (category) of G ~ and j' the j' th Prove the results in Section 5.2 concerning the correspondence analysis of the
row (category) of G~, followed by the rescaling (D~)- 1, i.e. inverse square root of the multivariate indicator matrix Z == [Zl .. , Zil (see p. 139).
principal inertias, along principal axes:
Solution
f¡z = (D~)-lt(gf+gf) (5.5.1) (a) 1 TZ ql = 1, i.e. there are Iones in each matrix Zq, q = l ... Q. Thus the masses oC
Now because G~ and G~ are identically rescaled versions of F and G (from the the columns of Zq add up to I/Ql = l/Q.
analysis ofN == Z'[Z2), we have the transition from Gfto Gf:
G~=RG~D;l (5.5.2)
c;
(b) If == (l/Q1)Z:1 are the masses of the columns of Zq, then the centroid oC the
column is
where R and D l' ~ertain to the analysis of N. Z q c;/1 TC; = (l/Q1)Z qZ:l/(l/Q)
The centroid 1(j) of all the points f¡z which represent rows with category j of the first = (1/1)1
discrete variable, say, is the average of (5.5.1) as i ranges over the rows with re~onses
U/) for j' = 1 ... J 2' Because j is regarded as fixed the average of the terms gj is still since ZqZ: = 1, the identity matrix. This is identical to the centroid of all the columns
gj . The average ofthe terms gf is thejth row of RG~in (5.5.2) so that of Z, which is the vector (1/1)1 of row masses.
1(7) = (D~)-11(gf+D~f) (c-e) We first prove (5.2.3). The inertia oC thejth column profile is its mass (ef) times
= 1(D~)-1(I+DI')gf squared distance to the centroid. The chi-square distance between columns is simply
proportional to the Euclidean distance because the row masses are constant. Elements
Since DI' = D~/2 and Dr = (DT)l/2, we have from (5.1.12): of the profile are either zero or l/(Qlef) corresponding to elements Oor 1 respectively
1(7) = D~gf (5.5.3) of the ¡th column of Z. Corresponding terms in the squared distance computation
are {O- (1/1W /(1/1) = 1/1 and {(l/(Qlef)) - (1/1W /(1/1) = (l/l){l/(Qef) - W respec
which yields the result in Table 5.2 that gf = (D ~)-11(7). tively. There are Ql ef of the latter terms and 1 - Ql ef of the former so that the
squared distance is:
(b) Row points in principal eo-ordinates, eolumn points in standard ca-ordinates. The
transition formula from columns to rows is (cf. (4.1.31)): Qef{1/(Qef)-1}2+1-Qef = (l-Qef)/Qef (5.5.4)
';¡:<
~:'
in(j) = cf(1- Qcf)/Qcf = (l/Q) - cf In an opinion survey 1 individuals respond to a questionnaire consisting of Q
questions, each of which has only two possible responses. A number of non-responses
(5.2.2) follows by summing in(j) over j = jq = 1... J q: are recorded in the survey and all the results are coded in a multivariate indicator
in(Jq) = J q(1/Q) - (l/Q) = (J q- l)/Q matrix Z == [Zl ... ZQ], where each Zq (q = 1... Q) consists of 3 columns labelled qa, I
1
qb and q*, corresponding to responses a and b and the non-response respectively to I
since the masses of these columns add up to (l/Q). question q. Let hqa , hqb and hq• denote the relative frequencies of these 3 respective 1
(5.2.1)follows by summing in(Jq)over q = 1 ... Q: possibilities amongst the 1 individuals. Now suppose that a non-response to question
q is coded in general as !1. q, {3q and ~q in the 3 respective columns, where!1. q + {3q + ~q = 1.
in(J)=J/Q-1 In terms of these values and the relative frequencies of response, evaluate the total
inertia in(J) of the c10ud of column points and the inertia in(Jq) of the subcloud of
(f) Each c10ud of J q column profiles has the same centroid and occupies a points associated with question q. Also give the results for the special cases, firstly
The aboye coding of a non-response preserves the grand total of Z as QI. In our
F Z = R zG z (Df)-1/2 previous notation we might denote the column masses of Z by cj (j = 1... J), say,
so that column totals of Z are Qlc j . Our present notation involves the relative fre
where the ith row profile in R Z is a vector of zeros and Q values of l/Q indicating the
quencies of response (and non-response), for examp1e hqa is the number of times
responses jl ...jQ, sayo Thus f¡z is the average (l/Q)(gf.+ ... +gj~) of the column
response a is given to question q, divided by the "sample size" (number of rows) l.
principal axes by the inverse square roots of res~ective principal inertias (cf. (5.5.1),
between the column masses cqa , cqb and cq• of the columns of Zq and the relative
where Q = 2). The centroid of all row profiles f¡ which have response ji = j to the
first variable, say, is of the (row vector) form: frequencies hqa , hqb and hq• are easily deduced:
This yields the required result, using the principal co-ordinate form of (5.2.8).
the columns of Z, namely [l/l ... l/If. The inertia of column qa, say, is computed
as cqa times its squared distance to the centroid, involving Ihqa terms of the form
(1/I) {l/Qc qa ) _1}2, Ih qb terms of the form 1/1 and Ihq• terms of the form
5.5.3 Correspondence ana/vsis of a Burt matrix {(!1. q/(Qlc qa )) - (1/I)2}/(1/ I) = (1jI) {(!1. q/(Qc j )) - W (cL the simpler calculation in
r Z of the columns of Z. Furthermore, show that the principal inertias in the two
respective analyses are simply related as follows: AB = (A Z)2. in(qb) = (l/Ql{ 1 - (h qb + {3qh q.) - hq.{3q(l - {3q)/(hqb + {3qhq.}} (5.5.6)
The squared distance between column q* and the centroid involves I(hqa + hqb ) terms
Solution of theform (1/I) and Ih q• of theform {(~q/(Qlcq.))- (l/I¡}2 /(1/I) = (l/Il{ l/h qJ- W·
The squared distance is thus (1/h qJ-1 and the inertia of column q* is cq• times this
U sing the eigenequation (4.1.23) for the standard column co-ordinates of Z, we have:
(QID;)-IZT. Since the column masses of B are identical to those of Z, the aboye The inertia of the J q = 3 columns qa, qb and q* is the sum of the expressions (5.5.5-7):
eigenequation can be written as: in(Jq) = (l/Ql{ 1+ ~q - hq.Otq(l - Otq)/(h qa + Otqh q.) - hgo{3q(l - {3q)/(hqb + {3 qhq.l} (5.5.8)
(Q 2ID:)-IZ TZr Z = rZDf The special case of ~q = 1 (i.e. !1. q = {3q = O) simplifies as in(Jq) = 2/Q, which is not
which is precisely the transition formula in the analysis of the Hurt matrix. Hence surprising because this is a special case of (5.2.2) where the number of categories
r Z = r B and Df = (Df)I/2. J q = 3. The total inertia is then in(J) = 2.
i,
166 Theory and Applications ofCorrespondence Analysis 5. Multiple Correspondence Analysis 167
When .;. = O, IX. = Pq = t, (5.5.8) simplifies as: (3) Notice that because question 1 has no non-response, its inertia remains a constant
in(J.) = 1/Q{1-h••/4h~-h ../4(1-h~)} 0.2 in each of the three analyses above. As a percentage, however, its inertia varies
from 11.1 % to 21.8 %, depending on how the non-responses in the other questions are
where h~ "" h. a + th••, hence 1- h~ = hqb + th••. This simplifies further as: recoded. In the first analysis, where .;. = 1 (q = l ... Q), the contrast in inertias is the
¡n(Jq) = (l/Q) {1-h.j(4h~(l-h~))} greatest. We might want to eliminate this peculiarity of the coding scheme by
(5.5.9)
increasing the mass of the first question in this analysis. This is done simply by
The total inertia is then: mu1tiplying columns la and lb of the recoded indicator matrix Z by 2, i.e. code "male"
in(J) = 1- (1/4Q) L.h.j(h~(l- h~)) as 2 Oand "female" as 02, which will double the mass of question 1 so that its inertia
(5.5.10)
is identical to that of the other questions. To prove this, notice that the profiles of
Comments columns la and lb as well as their centroid are unaffected by this increase in mass.
The masses ofthe other co1umn profiles are decreased so that each subset of columns
(1) The inertia in(J.) of question q is at its highest when ~. = 1, IX. = p. = O, that is
corresponding to a question q has a total mass of 2/( Q + 1) for q = 1, and 1/(Q + 1) foc
when the non-response is coded fully as an extra response. As mass is transferred from
q = 2 ... Q. where Q = 5. Since the masses of columns la and lb have been doubled
column q* to columns qa and qb the inertia must decrease, as all three columns come
without affecting the squared distances to their centroids, it is clear that the inertia of
closer to the centroid [l/l .. . 1/1]T. We are, of course, more interested in the relative
question 1, as well as the inertias of the other questions, will now be 2/(Q + 1). In the
values of in(J.). When .;. = 1 the values of in(J.) are the same, 2/Q, for all q. At the
next example we prove a more general result concerning the reweighting of questions
other extreme when ~. = O and IX. = p. = t, say, we can deduce from (5.5.9) that the
(see also Example 5.5.S and Section 8.2 on the reweighting of questions).
value of in(J.) is highest when h•• is the lowest and h•. is t, i.e. hq• = h.b, in which case
in(J.) = (l/Q) (1- h•.). In other words, for a fixed frequency of non-response h•• across
the questions, the inertia of a question decreases as polarization of response increases.
The degree ofpolarization, measured by 1/{ h~(l - h~)}, is a quantity which is discussed 5.5.5 Reweighting of discrete variables
further in Section 6.1. Suppose that Z == [ZI ... ZQ] is a multivariate indicator matrix in its most general
form, i.e. the submatrix Z. corresponding to variable (or question) q consists of rows
(2) We can illustrate Ihese results in the analyses displayed in Figs 5.6. 5.8 and 5.9. of non-negative numbers which sum to 1 (Iogical or fuzzy coding). Let in(q) denote
From the marginals down the diagonal of Table 5.4 we can obtain the re1ative the inertia of the qth subset of J. columns of Z (note that the notation in(q) is
frequencies h. a, hqb and h•• (q = 1 ... 5): equivalent to in(J.) in the previous example). Suppose we wish to reweight the
q=l q=2 q=3 q=4 q=5 variables so that their inertias are proportional to pre-assigned values VI VQ. Show
that this is achieved by the following rescaling of the submatrices:
hq.:
h. b :
0.58
0.42
0.54
0.38
0.66 0.41 0.25 Z; = {v.fin(q)}Z. (5.5.11)
0.18 0.52 0.70
h•• : 0.00 0.08 0.16 0.07 0.05 Solution
The values of h~, 1- h~ and the polarization factor 1/{4h~(l - h~)} used in (5.5.9) are: The J. columns of each submatrix Z. have the same centroid [l/l .. . 1/1]T, and this is
the centroid of all the columns of Z. This property is unchanged when the columns of
q=l q=2 q=3 q=4 q=5 Zq are rescaled by a factor 'q (q = 1 ... Q). In general, letZ; '.Z.
h~: 0.58 0.58 0.74 0.445 0.275 now ,.g,
The sum of all the elements of Z;is thus '.1, == and let , == L.'•.
so that the mass of the qth variable is
compared to the constant mass of l/Q for all the variables of Z. The
1-h~: 0.42 0.42 0.26 0.555 0.725 distances from the column points of Z* to the centroid are identical to those of Z, as
1/{4h~(1 -h~)}: 1.03 1.03 1.30 1.01 1.25 the profiles and the metric are exactly the same. Therefore the inertia of the qth
For each of the three analyses the total inertia in(J), the inertias of the questions in(Jq) variable (i.e. qth subset of column points) of Z* is «(.fW(l/Q) times the inertia in(q),
and their values as percentages of in(}) are evaluated as: which can be written as:
q= 1 q=2 q=3 q=4 q=5 in*(q) = in(qK.Qg
total
It is clear that the values in*(q) are in the same proportion to each other as in(qK.,
~. = 1, IX. = p. = O: 0.200 0.400 0.400 0.400 0.400 1.800 hence we should use scaling factors '. = v./in(q) in order that the inertias in*(q) be
(Fig.5.6) (11.1%) (22.2%) (22.2%) (22.2 %) (22.2 %) proportional t.o given values v•.
.;. = t, IX. = P. = ±: 0.200 0.287 0.264 0.289 0.289 1.329
(Fig. 5.9; using (5.5.8)) (15.0%) (21.6 %) (19.9 %) (21.7%) (21.7 %) Comments
(1) When the elements of Z are Os and ls only (1ogical coding), as in Section S.2, we
.;. = O, IX.= p. = t: 0.200 0.184 0.158 0.186 0.190 0.918 have seen that the inertia of the qth variable is (J. -l)/Q, where J. is the number of
(Fig. 5.8; using (5.5.9)) (21.8 %) (20.0%) (17.2%) (20.3 %) (20.7 %) categories of this variable (or number of different responses to question q). In order
168 Theory and Applications ofCorrespondence Analysis
to equalize the inertias of the variables we can multiply the qth subset of variables by
any convenient value in proportion to 1/(Jq -1). For example, if Q = 3 and J 1 = 2,
J 2 = 3, J 3 = 4, then we know that the inertias will be in(l) = 1/3, in(2) = 2/3 and
in(3) = 1, with a total inertia of 2. To maintain integer values in the reweighted matrix
Z· it is convenient to multiply the 1st set of 2 columns of Z by 6, the second set of 3
columns by 3 and the third set of 4 columns by 2. This is equivalent to coding
observations on the three variables by values 6, 3 and 2 respectively instead of ones.
The new inertias are all equal to 6/11, with a total inertia of 18/11.
psychological difference between the two ways of asking the subject to In Section 6.1 we introduce the concept of "doubling" bipolar data so that
respondo the matrix to be analysed is a special case of a multivariate indicator matrix.
\.~ In order to allow the respondent more freedom of choice in describing his The row and column geometry of correspondence analysis in this case is
feelings, the scale might be made 5-point, for example, by introducing inter shown to depend on what we term the polarization of the observations and
mediate possibilities of "disagree slight1y" and "agree slight1y". A problem of their average on each variable. Similarities and differences between the
with the analysis of such data is the choice of the original scale values: why correspondence analysis of the doubled data matrix and other applicable
do we not choose - 2, -1, O, 1,2 for the scale values, why not 0,4, 5, 6, lO? techniques, such as the vector and unfolding models and principal com
Two alternatives to the above system of collecting bipolar data are ponents analysis, are discussed in Section 6.2. The chapter again conc1udes
possible, one very restrictive and the other very unrestrictive. The first is to with sorne theoretical examples (Section 6.3).
ask each respondent to rank the set of statements rather than assess each one
(c1early this is only possible when each scale of response is the same, for
example agree/disagree). In the above example the respondents would have 6.1 DOUBLlNG AND ITS ASSOCIATED GEOMETRY
to say which of the statements they agreed with the most, which second most,
etc., until the statement they disagreed with the most. In the case of a market
An examp/e
research survey people may not be asked to assess individual products but
rather to look at the set as a group and rank them in order of preference. This In order to introduce the correspondence analysis of bipolar data, let us
system is very restrictive because the subject is not allowed to express dislike consider the 7 x 4 data matrix of Table 6.1(a). These are artificial data which
for all the products, for example. On the other hand, it might very well be the we presume to be average ratings of 4 candidates in an election by 7 groups
intention of the market researcher to force an order of preference from a of subjects. For example, suppose that the rows represent different socio
respondent in a situation where the information in relative terms would economic groups and that the candidates A, B, C and O are c1assifiable
otherwise be very limited. Another possibility, which is desirable when a large politically as left, independent, centre and right respectively. The original
number of objects have to be ranked, is to ask the subjects to choose small rating scale was O(totally unfavourable), 1,2,3,4,5 (undecided), 6, 7, 8, 9,10
subsets of most preferred and most disliked objects. (totally favourable)-an ll-point scale. The first group of electors thus gave
The second alternative is to allow the respondents great freedom in the leftwing candidate A an average rating of 2.2.
expressing their feelings. Ratings are made on a quasi-continuous scale, say Figure 6.1 shows the 2-dimensional correspondence analysis of these data,
from O (totally against, say) to 10 (totally in favour), allowing first decimals: which is an almost exact representation of the row and column profiles. This
for example, 8.3. In effect this is a 101-point rating scale. Sometimes such data display is a good indication of the political tendencies of the 7 groups of
are collected by asking subjects to mark their response on a given straight electors, with groups 6 and 7 tending out with leftwing candidate A as
lineo We have found such ways of collecting data to work well when the opposed to group 1 tending out with rightwing candidate D. However, there
respondents are themselves numerically minded and trained as respondents, are two apparently minor changes in the way the data were collected which
whereas we doubt whether such a scheme would appeal to the general publico could dramatically affect this analysis.
In this case the presentation of a discrete scale with descriptive naming of a In the first instance much of the success of the analysis is due to the fact
small number of scale points has c1ear advantages. that the 7 groups represent a fairly complete spectrum of opinion. The
In Chapter 5 we discussed how "optimal" scale values for the categories of situation changes quite dramatically if groups 1,2 and 3 are omitted so that
response can be derived, by treating each category as a separate column of a the remaining electors represent a predominantly "leftwing" preference. The
multivariate indicator matrix. In that case the categories are treated like correspondence analysis ofthe last four rows ofTable 6.1(a) strings out these
those of nominal variables and the inherent ordering of the categories is four groups across the display and group 4 is seen to be c1early on the
ignored (unless an analysis under order constraints is performed, cf. Section "rightwing" side (Fig. 6.2). This illustrates how correspondence analysis
8.4). In the present chapter we shall go to almost the opposite extreme by displays the positions of the individual groups relative to the spectrum of
accepting the scale decided upon by the researcher. At least, we shall accept groups inc1uded in the analysis- there is still a left-to-right dimension
the two extreme poles of the scale and the way the researcher has decided to amongst these four groups and group 4 is more to the right than the others.
gradate the categories between the two poles. Secondly, the analysis is not invariant to the direction of the rating scale.
6. Ratings and Preferences 173
Ol
-
e
~
U?
ro
'O
r'-en'<tf';co(v)--:
lf)...-,.....-oocococn
,.....-,--('.J,.....-,.....-C'JN
r'
~
'<t
7
• .. A 3
•
.+....
D
..c
Y-
QJ f-
ONO~roOO 6.
5 4
..
..C 2.
'"ro I c:iNLf'ic6r--:c:ic:i B
Ó O '<t
FIG 6.1. 2-dimensional correspondence analysis of Table 6.1 (a)
E
2 (V)NOCO(V)--:O LD
Y
U? I (V)~LD'<tLDr'-ro '<t
"O
QJ
i!:l U (V)
Ü
~
ro
"O
'6
7
• ..A 4
..D
e 5
.E ro I CO ro '<t. LD LD r'- (V) ro
::::J
'<tNCOLDLDCOro en B .. c
'"
QJ
U CO (V)
•
6 ..
U?
FIG.6.2 2-dimensional correspondence analysis of the last four rows of
Ol ror'-OLDOLDro (V)
e I Table 6.1 (b)
r--:Lf'iLf'iNc:iNN CO
~
« N
'oC
~.º If each of the ratings in Table 6.1(a) is "refiected" by subtracting it from
u
QJ U
- QJ ,.....-NMo:::::tLDCOr""-
'" 10 (Table 6.1(b)) so that the value O corresponds to totally favourable and
Q) .=: ro
_~ "O
'O 10 to totally unfavourable, then the correspondence analysis would change
- e f
~
. ..c
(1)' character completely: Groups of electors would be associated with the
cof-QJ
"O -o I SJOlJ818 ~o sdnoJ8
~-.~
candidates they do not like and would be displayed in opposite directions to
([J -o QJ
the candidates they favour. This is counter-intuitive because we are used to
«~> U? (V)~co(V)'<tr'-en (V)
f-ui~ ro --icOcri~~Mo ro interpreting a display in a positive sense, as we interpreted Fig. 6.1. Of course
~
o QJ
~ 'O NN,.....-NN,.....-,..... (V)
Ü <J) f the reason is simply that Table 6.1(a) can be viewed as measuring positive
QJ :;;
a:; '" association between the groups of electors and the candidates, whereas it is
Y-~
o ro ~.-.J+ orooenNOo en
more difficult conceptually to fit the measures of negative association, in
'':::'0 c:ir--:Lf'i<YiNc:ic:i ro
'" U N
0.'" Table 6.1 (b), into our idea of a correspondence matrix as a two-way distribu
::::J
2 tion of mass. However, as we shall show in Section 6.2, we can think of the
Ol
r' U? + r'-roO'<tr'-eno LD
values in Table 6.1 (b) as dissimilarities, or distances, in which case the tech
COCOLDLD'<tNN LD
>
-o ~ U (V) nique of multidimensional unfolding can be used to display the data.
"O
'" In order to take into account the absolute nature of the ratings and the fact
i!:l
ro
"O
e
'<tNCOLDLD(V)r'- N that they are bipolar, correspondence analysis may be applied to the doubled
"O
"O
U CO+ Lf'ir--:<Yi<i<i<Yi""": c:i
(V) data matrix comprising both the original and refiected forms of the data. In
e
ro
U the aboye example, Tables 6.1(a) and (b) would together form the 7 x 8
'<t
doubled data matrix, and there would thus be two columns for each
o U? L.~
+
<!
N(V)OLDOLDN
N'<tLDr'-Or'-r'
r'
(V) candidate: a "+" column indicating the measure of positive association
'<t
Ol
e between the electors and the candidate, and a "-" column indicating the
1§ complementary measure of negative association (e.g. A + and A -). The
QJ
Ol ~ N (V) '<t LD CO r' .!!!. correspondence analysis of the doubled matrix and of the last 4 rows of the
~ ro
QJ 'O doubled matrix are shown in Figs 6.3 and 6.4 respectively. The positions of
> f
« the electors and the "+" candidate points in Fig. 6.3 are not much different
ro ro I SJOP818 ~O sdnoJ8 to the points of Fig. 6.1, although there is a separation of the elector cloud
and the "+" candidate cloud in the doubled analysis (cf. unfolding analysis
174 Theory and Applications ofCorrespondence Ana[ysis 6. Ratings and Prejerences
of Fig. 6.8). Figure 6.4 is quite a lot difIerent to Fig. 6.2, though, and elector I
co-ordinates of these two points with respect to any axis of the column space see three pairs for which the origin marks off approximately the same ratio of
(Fig. 6.5(b)), then: longer to shorter distances max {d q+, dq_ }/min{dq+, dq_}. This means that the
Cq+g q++cq_gq_ = O (6.1.1) quantities 1!q(I-1!q) are the same in each case so that the lengths between
opposite poles are proportional to the standard deviations Sq/t q of the
(Notice that most of these results are just special cases of the more general "fractional" ratings, i.e. the ratings divided by their respective t q . Thus the
results obtained in Chapter 5 for multivariate indicator matrices.) polarization is the same for each variable, but the variance of opinion is
Furthermore it is easy to show that the respective distances dq+ and dq_ highest for question 2 and lowest for question 1.
between the two points q+ and q - and the origin in the full space are equal If the lengths between opposite points are the same but the origin divides
to the coefficient of variation (standard deviation divided by mean) of the these lengths into different ratios then the relationship between standard
respective columns of the doubled matrix. For example, the squared distance deviation and polarization is such that the least polarized set of responses
between the q + column and the centroid in the metric D,-1 = 11 is: must have the highest standard deviation. Thus in Fig. 6.6(b) the variance of
IL¡Ú'¡q/Y. q-1/1)2 = IL¡{ Ú'¡q/y.qV - 2y¡q/IY. q+ 1/1 2} question 4 is higher than that of question 5. If both total lengths and ratios
of the subdivisions are different, as is often the case, then there is an inter
= I{(Is;+Iy;)/I 2y; }-2+1 action between standard deviation and polarization in (6.1.3) to give the total
= s;/Y; (6.1.2) length, which increases both by increasing standard deviation and by
increasing polarization,
where s; is the variance (L¡Ytq-1y;)/I of the ratings. Thus dq+ = Sq/Yq and,
The variation of each pair of doubled columns is measured by the sum of
similarly, dq_ = Sq/(tq-yq). The sum ofthese distances is thus:
their inertias:
dq++dq_ = (sq/t q)/{1!q(I-1!q)} (6.1.3)
in(q) = cq+d;+ +cq_d;_
where 1!q =. yq/tq and 1- 1!q = (t q- yq)/t q, so that their product 1!q(l- 1!q) is = {Yq(Sq/Yq)2 + (t q- Yq) (Sq/(t q- yq)V}/t.
inversely related to what we call the polarization of the average of the qth
question or variable, with low 1!q(I-1!q) indicating high polarization and = (tq/tJdq+dq_
high 1!q(l- 1!q) (when 1!q is near to i) indicating low polarization. We = (tq/tJ (Sq/t q)2/{1!q(I-1!q)) (6.1.4)
formally define the polarization of the average (or polarity) as the quantity
Thus this inertia depends multiplicatively on the mass attached to the
1/{1!q(I-1!q)}, which is thus greater than or equal to 4.
variable as indicated by the relative length of the scale, (tq/t.), the variance of
Let us suppose that we have an almost exact representation of the doubled
the "fractional" ratings, (Sq/tqV, and the polarization 1/{ 1!q(1 - 1!q)},
data matrix in a 2-dimensional display, so that we can ignore for the moment
The angle between two lines joining opposite points indicates the correla
the type of approximation implied when this display is less accurate, and let
tion between the two respective sets of ratings. The cosine of the angle
Fig. 6.6 depict a few typical pairs of doubled column points. In Fig. 6.6(a) we
between point vectors q + and ij +, for example, is their scalar product
divided by the product of their lengths (remembering that the scalar product
is in the metric D,-1 = 11):
5+ 4+
2+ ",
1 cos e = 1 L¡Ú'iq/Y.q -1jI) Ú'iq/y'q-ljI)/(d q+d<i+)
3
3+~ 1+
= {(LiYiqYiq)/(IYqYq) -1-1 + 1}/{ (SqSq)/(YqYq)}
2 4 = (1jI) {LiYiqYiq- IYqYq}/(SqSq) (6.1.5)
= correlation between columns q + and ij +
5
Let us now turn our attention to the dual geometry of the row points. The
FIG.6.6. Examples of pairs of column points: (a) different lengths between the
row profiles ofthe data matrix of Fig. 6.5(a) define 1 points in 2Q-dimensional
points of each pair, but the origin cuts off sublengths in the same proportions: (b)
same lengths between the points of each pair, but the origin cuts off sublengths in space. Because there are Q linear dependencies amongst the columns, the
different proportions dimensionality of the rows is actually equal to Q, just one dimension more
178 Theory and Applications ofCorrespondence Analysis 6, Ratings and Preferences 179
than the dimensionality of the row profiles of the undoubled matrix. Equal where
masses of 1/1 are assigned to each row profile and the chi-square metric · l" _ polarization of average rating
re latIve po anzatlOn - .. f'" l .
between the row profiles implies that the squared distance between subjects i polanzatlOn o mdIvIdua ratmgs
and i' is: = (1/1)~¡1t¡q(1- 1t¡q)/{ñq(1- ñ q)}
df¡, = (1/tJ~q {(y¡q - Y¡'q)2 /Yq+ (tq - Yiq - tq + Y¡'q)2 /(tq - Yq)} These results tie up with those of Example 5.5.4 (cf. (5.5.9) and comments 1
= (1/t)~q(y¡q - Y¡'q)2 tq /{Y q(t q- Yq)} and 2). For a fixed polarization of individual ratings, the inertia of a variable
actually decreases when the average is more polarized. For a fixed polariza
= ~q(tq/tJ (1t¡q - 1t i'q)2 /{ ñi1- ñ q)} (6,1.6)
tion of the average, the inertia increases when the individual ratings are more
where 1t¡q is defined as the fractional rating Y¡q/t q, Here the fractional rating is polarized, with a maximum inertia of tq/t, (the mass of the variable) when
analogous to a probability: subject i is at pole q + with a "probability" of 1t iq there are individual ratings only at the poles. Many other results of Chapter
and at the opposite pole q - with a "probability" of 1- 1t¡q' Thus the squared 5 carry over to the present situation, for example the reweighting of the
distance is a sum of Q terms where each term depends multiplicatively on the variables in Example 5.5.5.
relative mass (tq/t J of the question, the squared difTerence in the fractional
ratings and the polarization. If the ratings were all on a 2-point scale so that
1t¡q could only be either Oor 1, then clearly the variance of the 1t¡q (i = 1 ... 1) 6.2 COMPARISON WITH OTHER SCALlNG TECHNIQUES
would be equal to ñ q(1-ñ q), the familiar variance of the Bernoulli variate,
so that the distance between rows would be equivalent to the ordinary Vector and unfo/ding mode/s
Euclidean distance between rows of standardized data. When the rating scale
is generalized to more than 2 points, then ñ q (1- ñ q ) overestimates the In the literature there are two "classic" geometric frameworks for displaying
variance ofthe 1t¡q by an amount (1/1)~¡1t¡q(1-1t¡q): matrices of ratings and preferences, called respectively the "vector model" (or
"scalar product model") and the "unfolding model" (or "distance model"). A
(1/1){ ~i1tfq - 1ñ;} = ñ q(1- ñ q)- (1jI)~i1t¡i1-1t¡q) (6.1.7) good introduction to these models in the context of preference data is given
In the case of a 2-point scale where either 1t¡q or 1- 1t¡q is zero, each 1t¡i1- 1t¡q) by Carroll (1972). (A revised version of this paper is given by Carroll (1980),
is zero and hence their average is also zero. In general 1t¡q(1- 1t¡q) is a while an excellent taxonomy and literature survey of multidimensional
maximum of ! for an individual observation which is least polarized, i,e. scaling is provided by Carroll and Arabie (1980).)
1t¡q = t. Analogous to our definition of polarization of the average ratings we Both models aim to represent the rows and columns of the data matrix as
define the polarization ofthe individual ratings as the inverse of this average: points in a joint space of low dimensionality, but their interpretations of
1/~¡1t¡q(1-1t¡q), which is also greater than or equal to 4. We can then rewrite
interpoint positions are difTerent. For example, Fig. 6.7 depicts possible
(6.1.7) for the qth bipolar variable as: situations in 2-dimensional space where we have shown the positions of 5
variables (often called "stimuli" in this context) and 2 respondents (subjects).
(Sq/t q)2 = (1/polarization of average rating) The vector model attempts to represent the stimuli as points and each subject
- (1/polarization of individuals' ratings) by a vector through a fixed origin such that the perpendicular projections of
the stimuli onto this vector produce a set of values which approximate the
The polarization of the individual observations, which must be greater than (centered) data vector of that subject (Fig. 6.7(a)). The unfolding model
the polarization of their average, summarizes how near the poles the attempts to represent both stimuli and subjects as points so that the set of
individual observations lie. This can be very high if all the responses are distances from the variables to a particular subject approximates the data
extreme ones (e.g. strong disagreement or agreement) or very low if all the vector of that subject. The set of distances from the 5 stimuli to the point 1 in
responses are "intermediate" (e.g. undecided). From (6.1.4) the inertia of the Fig. 6.7(b) can be thought of as a folding of this fan of lines about the pivot
qth variable is related to the ratio of the polarization of the average to that point at 1 onto a common.line, hence the term unfolding to describe the
of the individuals (which we call the relative polarization) as follows: reverse process of taking the data vectors and opening out these fans to
in(q) = (tq/tJ (1- relative polarization) obtain the display. Thus for any line with origin at 1 the set of distances is
180 Theory and Applications ofCorrespondence Analysis 6. Ratings and Preferences 181
(al 3
•
7
• 4
•
,
,
5
I
\'''\ e
o
. !
A 2
!
o
•
subject 6
2 • .
c
(b) .
8
..,_E
,/
,":' .........
_'_-~';,. subject 1
I scale
0.2
I .B
A .-, - -;,'- - -C......... //'
. /
FIG.6.8. The "true" underlying positions of electors and candidates which
", /
/
generate the distances equal to the reflected ratings in Table 6.1 (b)
.:
..•...:.... "o
subject 2 which depend only on the ordering of the data. In the context of multi
~ ~ qf..i ., dimensional scaling these are known as "non-metric" techniques. Hence a
non-metric unfolding will try to represent the subjects and objects so that
_Lff_~_~ •2 distances from the variables to a subject are ordered as similarly as possible
to the subject's data vector. The non-metric approach is in many ways a more
FIG.6.7. The two "classlc" geometric models for preference and ratings data: (a) complicated context for analysing the data of interest here and we shall not
the vector model. where the perpendicular projections of the so-called "stimulus"
points (usually the columns of the data matrix) onto the subject vector approxi
pursue its discussion. For an introduction to non-metric scaling the reader is
mate the subject's (centered) data; (b) the unfolding model. where the distances referred to Greenacre and Underhill (1982, Sections 4 and 5).
from the subject to the set of stimuli approximates the subject's data (the "folded" Although we do not consider multidimensional unfolding in any detail in
distances are indicated below). this book, it serves as an instructive comparison with correspondence
analysis. In correspondence analysis there is a "barycentric" relationship
obtained by what might be described as "circular projection" of the stimulus between the row and column points in the display, as defined by the relevant
points onto this lineo In the vector model the set of distances is obtained by transition formulae. In unfolding there is a perhaps more direct interpretation
perpendicular projections, which can be thought of as circular projections if of the display in terms of interpoint distances. It is easy to fall into the trap of
the subject points are taken out to infinity. Thus, as Carroll (1972, 1980) interpreting a correspondence analysis as if it were an unfolding and this
points out, the unfolding model is more flexible in the sense that it admits the danger should be avoided. No distance between a row point and a column
vector model as a special case. point is included or implied in correspondence analysis, whereas this is the
It is interesting to reveal now that the artificial data of Table 6.1 (b) are the basic concept in unfolding. If the asymmetric correspondence analysis display
actual distances between the two sets of points in Fig. 6.8. Thus the unfolding is chosen (Section 4.1.16), then the biplot interpretation applies, that is
model would be the most suitable technique in the sense that it correctly between-set (row-to-column) scalar products may be interpreted. If the
assumes the data to be row-to-column distances. Figure 6.8 should be subjects are represented in principal co-ordinates, say, and the (doubled)
recovered exactly by a "metric" unfolding analysis of the data of Table 6.1 (b) stimuli in standard co-ordinates, as in (4.6.1) and (4.6.2), then the reconstitu
in 2-dimensional space. tion formula (4.6.1) can be written as:
In many applications, especially when the data involve rankings, there is a Yiq - Yq = 'E{- hkYqk (6.2.1 )
strong case for ignoring the actual values of the data and performing analyses Yq
182 Theory and Applications ofCorrespondence Analysis 6. Ratings and Preferences 183
where q refers to the original (q + ) variable of the doubled pair. Hence, if the the angle cosines (cf. (6.1.5)), as in the covariance biplot (Appendix A: Table
subjects are displayed as vectors, in the style of the vector model (cf. Fig. A.l(4)). Reweighting does affect the principal axes of these points, so that
6.7(a)), then the projections of the stimulus points onto a subject vector are their approximately displayed positions in lower dimensional spaces will be
proportional to the quantities (y¡q - Yq)/Yq, the usual deviation of a rating different.
relative to the mean rating. Notice that the means here are across subjects (i.e.
stimulus means, in this case, column means), whereas the vector model often
displays the deviations with respect to the means across stimuli (i.e. subject 6.3 EXAMPLES
means, in this case, row means).
6.3.1 Equivalence of correspondence analysis of a doubled matrix to a
Relationship to principal components analysis principal components analysis
From (6.1.6) it can be seen that the geometry ofthe rows in the correspondence Show that the correspondence analysis of the matrix of bipolar data Yiq (i = 1 ... 1,
analysis of a doubled data matrix is the same as the row geometry of a q = 1 ... Q), where the qth column is doubled with respect to tq (q = 1 ... Q), is
equivalent to the principal components analysis of the matrix Y':
principal components analysis of the original (undoubled) matrix with a
particular rescaling of the columns (see Example 6.3.1). In principal com Y;q = (~q)l/lY¡q
ponents analysis it is common to rescale the columns to have unit variance. where
In our present notation and in terms of the fractional ratings 1r¡q, with vari ~q = (tq/U/{Yq(t q- ,Yq)} (6.3.1 )
ances (Sq/t q)2, this would lead to squared distances between rows of the form:
Solution
d'f¡. = ~q(1riq - 1r¡'q)2/(Sq/tq)2 (6.2.2) The principal components analysis of Y' (cf. Appendix A, Table A.1(1)) situates the
In (6.1.6) the rescaling involves the quantity ñ q (l- ñ q ) in the denominator rows of Y' in an ordinary Euclidean space with masses 1/1 and squared interpoint
distances between rows i and i' of 1: q(y;q - Y;.q)l. These are exactly the same masses and
which from (6.1.7) exceeds the variance (Sq/t q)2 by the quantity which we have relative positions as the rows of the doubled matrix (cf. (6.1.6)).
called the polarization of individual ratings. Ignoring for the moment the
additional rescaling by the mass (tq/t.), we see that correspondence analysis
puts less emphasis on variables which have low polarization across the 6.3.2 Correspondence analysis of undoubled and doubled preferences
ratings, compared to the standardization (6.2.2), since the quantity ñ q (l- 1fq ) Suppose that (J¡ '= [O'¡lO'¡l'" O'¡il T is a vector of preferences by subject ion Q stimuli,
is then in large excess of the variance. Putting this another way, the i = i ... 1, where O'iq indicates the ranking of the qth stimulus, with a ranking of Q being
chi-square metric defining the positions of the row profiles accentuates the the most preferred. For example, if Q = 4 and subject i ranks stimulus 3 as the most
questions where polarization of individual ratings is the greatest, over and preferred, followed by stimuli 2,4 and 1, then (Ji = [1 3 4 2]. Derive the centered
aboye the accentuation already induced by the standardization of the positions of the ith subject in (ordinary) Q-dimensional Euclidean space in:
variables. (a) the correspondence analysis of the matrix of preferences O'iq, i = 1 ... 1, q = 1 ... Q;
This is an important concept in the analysis of bipolar data and illustrates (b) the correspondence analysis of the doubled matrix of preferences O'¡q + = O'¡q,
that the standardization of variables to have equal variances is not fully O'iq- = (Q+ l)-O'¡q
justifiable in this context where the data are definiteiy not of an "interval" When are the two analyses equivalent in their displays of the subjects?
nature. In correspondence analysis the more the group is polarized on a
certain question, both individually and on aveFage, the higher the importance Solution
given to that variable in the calculation of the between-subject distances. (a) The sum 1: qO'¡q = tQ(Q+ 1) for all i, so the ith row profile has elements r¡q =
O'iq/{tQ(Q + ll}. If we denote the mean ranking (l/1)1:¡O'¡q by c1q, the mass of the
Polarization seems to be a more useful concept here than variance and these qth column is c~ = lc1q/{t1Q(Q+ 1)} = c1q/{tQ(Q+ 1)}. The qth co-ordinate of
coincide only when ratings are made exclusively at the poles of the scale. subject i's profile !TI ordinary Euclidean space is (riq - cq)/(Cq)l/l, where the division
Since we can think of rescaling in the dual sense of assigning masses to the by (Cq)l/l takes care of the weighting of the dimensions implied by the chi-square
variables there is no change of position of the points (in the full space) metric:
representing the variables (stimuli) and the correlation coefficients are still (r¡q_cq)/(Cq)l/l = (O'¡q-c1q)/(tQ(Q+ l)c1 q)l/l (6.3.2)
184 Theory and Applications 01 Correspondence Analysis
(b) From (6.3.1) we know that the row profiles of the doubled matrix can be equiva
lently situated in Euclidean space by the vectors with elements:
a¡. = (~.)1/2a;.
where
~. = (1/Q)/{ ó'.(Q + 1- ó'.)}
The mean I:¡a;. = (~.)1/2ó'., so that the centered position of the ith subject is just the
deviation:
(~.)1/2(ai. - ó'.) = (a i • - ó'.)/(Qó'.(Q + 1- ó'.W/ 2
(6.3.2) and (6.3.3) are equal if and only if ó'. = t(Q+ 1) for aH q, that is aH stimuli
(6.3.3) [!]
receive the same average rankings, in which case the qth co-ordinate of subject i is
{a;. -t(Q + 1)}!t(Q + 1)Ql/2 Use of Correspondence
Analysis in Discriminant
Analysis, Classification,
Analysis
analysis, including analysis of variance and covariance, although it is the data matrix in the form of points in the continuum of multidimensional
usually preferable to treat them separately because of their individual space, whereas clustering techniques generate sets of discrete values which
peculiaritieso allocate the subjects to groups. In Fig. 7.1 we have attempted a schematic
(4) Cluster analysis (often called classification automatique in the French view of all the aboye mentioned multivariate methods.
literature): Similarities of observations between subjects are studied with In the present chapter we shall demonstrate how a scaling technique like
the aim of forming groups of similar subjects, that is creating a partition, correspondence analysis can be used to solve problems in these various
or sequence of partitions, of the subjects. contexts. Most of the material presented here is not peculiar to correspond
ence analysis but refers to a wider class of scaling techniques, although
Notice the difference between discriminant analysis, where there is a
correspondence analysis lends itself with particular ease to the treatment of a
particular partition of interest known in advance, and cluster analysis, where
wide variety of problems.
partitions are generated by the analysis. Just as discrimination can be
considered a discrete form of regression, so cluster analysis can be considered
a discrete form of multidimensional scaling, in the following sense : scaling
7.1 DISCRIMINANT ANALYSIS
techniques (like correspondence analysis) generate sets of scale values from
Up to now we have considered correspondence analysis as a technique for
(a) Regression ( b) Discriminant I I I
analysis
displaying the profiles of the 1 rows and J columns of a suitable data matrix
with respect to optimal subspaces. We now turn our attention to a partition
of the rows (andjor a partition of the columns), that is their grouping into an
x y l-y=f(X) x Z 1 - z=f(X)
exhaustive and disjoint set of c1asses.
For example, if the rows i = 1 ... 1 refer to the districts in a country and the
columns j = 1 ... J to different types of jobs (e.g. primary school teacher, shop
X·-j y·1
Prediction 1
I
1
I
. I
X·-l z·1
CIOSSl'f',callonj 1 1
assistant, dentist, etc.), then a partition of the rows would be the grouping of
the districts into regions while the jobs could be grouped into broader c1asses
I I I
of employment (e.g. public sector, small business, health care, etc.). Let us
suppose that the datum nij is the number of people in district i with job j, so
ro--- - - - -
that the rows of the matrix N corresponding to districts of the same region
(c) Scaling (d) Clustering
are simply added together to give the frequencies n~j of people in region h
with job j. Algebraically, this is handled more simply by defining an 1 x H
x - , yl.y2 .... x 1-+1 zl.z2 •... logical indicator matrix Zo whose ith row is a vector of zeros apart from a 1
in the appropriate column to indicate the region to which district i be1ongs.
The H x J matrix N' of regional frequencies is then simply N' = ZciN. In a
1....- _
similar fashion the columns of N can be condensed to give the frequencies n¡;
of people in district i with a job of c1ass 1by defining an L x J logical indicator
F IG. 7.1. Schematic view of four areas of multivariate analysis. X is an Ix J matrix 'lo whose columns c1assify the activities, so that the 1 x L matrix
observation matrix of I subjects on J variables (quantitative and/or qualitative).
Vectors y and z represent a set of quantitative and qualitative values respectively
N" = N'lci. Finally, we can obtain the H xL matrix of frequencies n~: of
for the subjects. Thus: (a) Regression establishes a relationship between X and y people in region h having c1ass 1 of employment by condensing row- and
(both of which are given) and prediction (forecasting) uses this relationship to columnwise: N'" = ZciN'lci (i.e. ZciN" or N''lci). All of these matrices are
derive y. from an additional X·. (b) Discriminant analysis/classification are depicted schematically in Fig. 7.2 and a number of different correspondence
similar to regression/prediction. but the dependent variable is qualitative. (c) analyses are possible, depending on the objective ofthe study.
Scaling (e.g. correspondence analysis) derives sets of quantitative values (e.g.
principal co-ordinates) for the subjects (and/or the variables). (d) Clustering It can be easily shown that each row profile of N' is the centroid of the
derives sets of qualitative values in the form of partitions of the subjects (and/or constituent group ofrow profiles ofN (Example 7.5.1). In addition, because
the variables). the column sums of N' are the same as those of N the metric in each space of
188 Theory and Applications ofCorrespondence Analysis 7, Use ofCorrespondence Analysis 189
H J L Similar descriptions hold for the separate analyses of N" and of N"', for
example in the analysis of N'" we are etTectively investigating the dual
~
principal axes of two sets of centroids and the rows of N" and the columns of
L{ N' can be displayed a posteriori as supplementary points.
The correspondence analysis of N', N" or N'" is etTectively a discriminant
analysis between the groups defining the rowS and/or columns. We shall take
the analysis of N' as an illustration and suppose that H ::::; J ::::; l. The
(centered) row profiles of N lie in a (J -1 )-dimensional space, while the set of
N''': centroids (row profiles of N') lie in an (H -1)-dimensional subspace. Thus
Zo N
Ni; (J - H) dimensions of the original row space are eliminated by focusing the
study on the group centroids. A particular group of rows of N (group of
districts in our example above) reflects the variability within the correspond
ing row ofN' (the region). This variability is not necessarily of a probabilistic
nature, as exemplified by our example in which the regions, districts and data
I ~:~I
are assumed complete and exhaustive. Notice that the analysis ofN, too, may
Hj I NÓ
N be considered a discriminant analysis where each group is a set of one row
alone with no variability within the group (apart from possible measurement
error in collecting the data).
F IG, 72, The baslc data matrix N and various condensations of its rows and It is up to the investigator to decide whether he is interested in the analysis
columns, The row and column groupings are given by the logical indicator of N (the inter-district analysis) or the analysis of N' (the inter-regional
matrices Zo and 2 0 respectively, analysis). Nevertheless it is often interesting to compare the principal axes
emanating from these two analyses, especial1y when the principal axes of N'
row profiles is the same, namely D,- l. SimilarIy, the column profiles of N" are are recovered amongst those of N. Clearly there is more inertia in the analysis
centroids of the constituent groups of column profiles of N and the column of N not only in the full space but also in any subspace. In other words, the
metrics in the correspondence analyses of N and N" are the same, namely process of condensing the points into groups reduces the moments of inertia
D; l. Hence there is just one metric framework for the data and possible of the cIoud in aH directions. In fact, with respect to any subspace we have the
partitions thereof, namely that of the original matrix N. cIassic result that the sum of the inenias of the individual profile points is
At the most detailed level N itself is analysed and principal axes of its equal to the sum of the inertias of the group centroids ("between-group" or
profiles computed. The rows of N' and the columns of N" may be displayed "interclass" inertia) plus the sum ofthe individual inertias within each group
as supplementary points with respect to any subspace either by computing ("within-group" or "intracIass" inenia). This result is stated more formal1y
the appropriate centroids or by using the relevant transition formulae (cf. and proved in Example 7.5.3. Thus the sum of the inertias of the centroids is
Example 7.5.1). always less than the sum of the individual inertias by an amount equal to the
To place more emphasis on ditTerences between groups of rows (regions) within-group inertia. Because the correspondence analysis ofN' identifies the
N' may be analysed. As far as the row profiles are concerned this is an principal axes which reflect maximum inertia of the centroids, the objective
analysis of the row group centroids in the same space, as if the masses of the of minimizing within-group inertia in the subspace of these axes is equiva
cloud of l points are removed and concentrated in the respective centroids. len tly satisfied. It is this property which characterizes the discrimination:
The computation of principal axes of the centroids is analogous to canonical loosely speaking, the group centroids are pushed apart while, simultaneously,
variate analysis, which can be thought of as a weighted principal com the within-g:oup variability is tightened.
ponents analysis of group centroids in Mahalanobis space (cL Appendix A). In terms of the principal inertias Ak of N and A~ of N' we have the further
In the analysis of N' the rows of N and the columns of N'" can be dis result that Al ~ ;"1' A2 ~ ;.~, ... (Example 7.5.4). The process of condensing
played as supplementary points using the relevant transition formulae (cL the data, that is of amalgamating groups of points, necessarily leads to
Example 7.5.1). smaller principal inertias.
190 Theory and Applications ofCorrespondence Analysis 7. Use of Correspondence Analysis 191
• • which the analyst considers too dissimilar. Between these two extremes there
• is a value (or a range of values) that can be used, A choice with more optimal
• properties can probably be made by employing sorne type of cross-validatory
• • scheme, which we shall discuss below.
• •
The actual mechanics of the c1assification decision are usual1y quite simple
• • once the neighbourhood of an unc1assified point is identified-the point is
• • c1assified into the group which has highest mass in this neighbourhood. If
•••
--------~-----~----------
• individual points have the same mass then this is equivalent to the highest
• •
frequency in the neighbourhood. This strategy takes into account the relative
• • proportions of individuals in each group in a natural way and there is no
• • need to adjust for prior probabilities of the groups, unless these are quite
different from the proportions in the design sel. For example, if a particular
• • group is doubly over-represented in the design set then the masses of al1 its
•
• members can be halved initial1y so that each member eventual1y counts as
• "half" a member for purposes of c1assification.
•
FIG.7.4. Example of c1assification subspace (dashed line) of the centroids Cross- va/idation
(open triangle and circle) of two groups of points (sol id triangles and circles)
when one grou¡:, is non-convex (the circles roughly define a banana-shaped The c1assification procedure which we have outlined aboye is not optimal in
cloud of points). A poor c1assification of the points will result in this subspace any theoretical sense because we are deliberately avoiding the mathematical
because the two groups overlap considerably when projected onto the subspace, assumptions by which a measure of its performance can be judged. It would
even though they are quite separate in the full space. make more sense here to judge the procedure in a cross-validatory fashion by
dividing the design set randomly into a design subset and test subsel.
Classification of the test subset can be performed and the results validated
orthogonal to the variation of the centroids has been eliminated as "un against the actual c1assification which is known. If this is repeated a number
important" for c1assification. In doing this we do make the assumption that of times the range of performance of a particular procedure can be obtained.
each group of points occupies an approximately convex region of the ful1 This whole process can be repeated on a variant of the procedure, for example
space. If one of the groups were highly non-convex then distances computed where extra dimensions are added to or subtracted from the c1assification
in the centroid subspace could be highly misleading for c1assification (Fig. space, or where the neighbourhoods are made wider or narrower. This can'
7.4). Notice that the usual multinormal assumptions are equivalent to direct the analyst towards an improved strategy, but it is a great deal of extra
assuming el1ipsoidal c10uds of points, which are convexo efforl.
It might prove worthwhile to perform separate analyses on each group of The aboye procedure is presentIy being used for forecasting the weather in
points to increase understanding of their spatial distributions. For example, a meteorological experiment and the study is outIined in Section 9.11.
if a group of points is found to occupy two separate regions then it would be
advisable to enlarge the c1assification space by adding the dimension which
coincides with this division. This may be easily achieved by replacing the 7.3 REGRESSION
original centroid of the group by the two centroids of its subgroups prior to
determining the c1assification space. A typical regression analysis attempts to "explain" the values of a quantitative
Conventional1y the neighbourhood of a point is a multidimensional variable y (which we cal1 the predictand) in terms of a number of predictor
sphere, or spheroid, and its radius is to be specified by the analysl. Empirical variables Xl ... Xj, possibly of different types, The c1assical approach is to set
guidelines for this choice are that the sphere should not be so smal1 that it up a regression model for the 1 sets of observations: Yi = f (x i1 ... X iJ; P) + e ¡,
inc1udes very few neighbours, and not so large that it encompasses points i = 1 .. . 1, where P is a vector of parameters and the e ¡s are residuals, or
194 Theory and Applications ofCorrespondence Analysis 7. Use ofCorrespondence Analysis 195
y regression computer program to take the burden otT his shoulders. This
would be fine if the analyses were viewed as exploratory, but more often than
not the resultant regression mode! is adopted and then regarded as the
"correct" mode!. This attitude is more prevalent in large studies where it is
expensive to perform further analyses to cross-validate the data and diagnose
the regression relationship more accurately (if it indeed exists!).
As an alternative we again try a less formal approach by using neighbour
hoods of points, the only problem being to decide in which space to calculate
the neighbourhoods. As in discrimination/classification we would want a
space which somehow shows up the relationship between the predictand and
the predictors, so that a new vector x* is not matched to vectors x on features
X2
that are unassociated with the variation in y. Correspondence analysis
provides a framework for investigating this association and then deriving
FIG.7.5. Illustration of the linear regression model (a plane in this case) where
such a regression space. First, the range of y is segmented into a number of
the number of predictor variables is J = 2.
classes, the analyst being guided by the histogram of the variable and his
experience of its values. Using the example of daily rainfall again, one class of
deviations, from the mode!. The analysis does not determine the form of the rainfall might be zero rainfall, the next O-~ mm, !-l mm, and so on, where
model but estimates the parameters Pfor the prescribed model with a view to the intervals are meteorologically relevant. The more observations there are
minimizing the residuals in sorne global sense. Most commonly the prescribed in the training set of data, the more subdivisions can be imposed to make
model is a linelJ.r one: /30 + /31 x i1 + ... + /3 JX u, and the parameters are esti subsequent analysis more sensitive to finer variations in the predictand. As
mated by least-squares, that is by minimizing r.¡er = r.i(y¡ - /30 - r. j/3jx¡y. far as the predictors are concerned, they too need to be recoded into
The situation can be described geometrically as the fitting of a hyperplanar categories or pairs of doubled variables, as described in Section 5.4, more
response surface and is depicted in Fig. 7.5. It seems highly unlikely that the especially if the variables are of diverse types. This whole recoding process
values Yi should alllie near such a hyperplane, as implied by the linear model. aims at producing a matrix which summarizes the association between y and
In the case of a predictand like rainfall, for example, there might be many zero x by crossing the categories of the predictand with those of the predictors
values, in which case the linear model is unrealistic. The careful data modeller (Fig.7.6).
would investigate the data more closely and introduce relevant functions of The correspondence analysis of this matrix will provide a space which
the predictors into the regression model so that the response surface follows
the values Yi more closely.
coteQories of predictors
If a model can be established which fits reasonably, the data is etTectively
j
discarded and the estimated mode! is used both as a description of the ~
relationship and to predict new values Y* from new vectors of observations I I
x* on the predictors. Confidence intervals on these predicted Y* are part of cot8Qories
01
------ ----H- -------- --1
such an analysis and are usually the same width irrespective of the values of
x*, as if there were a confidence band on either side of and parallel to the
predlctond - - - -- - - - - ~-I
-- - - - - - -
response surface. I
I
We have mixed feelings about the use of such models, especially in the
,I
analysis of large data sets. On the one hand, when the data analyst has sorne
prior justifications for describing his observations in terms of a model and has FIG.7.6. Data matrix for setting up a regression space by correspondence
fairly firm ideas about the form the model should take, then modelling seems analysis. The (i,j)th cell of this matrix contains the frequency of association (or
perfectly suitable and justifiable. On the other hand, it may be that the other recoded measure of association) of category i of the predictand with
analyst has no fixed ideas about the relationship and resorts to a multiple category j of the predictors.
196 Theory and Applications ofCorrespondence Analysis 7. Use ofCorrespondence Analysis 197
discriminates between the classes of the predictand. The individual (recoded) topic. Instead, we shaH briefly describe the differences between hierarchical
data vectors of predictors can be projected onto this regression space, as and non-hierarchical clustering and then demonstrate how a particular form
described in Section 7.1, as well as new vectors of predictors, whose of hierarchical clustering is related geometricaHy to correspondence analysis.
neighbourhoods can be determined. The actual prediction need not be a The aim of a cluster analysis is to derive a partition, or a sequence of
classification into a predictand group, but a summary statistic evaluated on partitions, of a set of objects based on their similarities (equivalentiy, their
aH the y¡s whose predictors fall in the neighbourhood, for example the mean distances) to one another, so that objects clustered into the same group (or
of the y¡s, or their median, or sorne more sophisticated statistic taking into class) are similar, or close, to one another, while those of different groups are
account the fact that y is a regionalized variable (in the regression space). dissimilar, or far apart. Before clustering even begins, several decisions need
Finally, the whole procedure, with its host of ad hoc choices, can be fine-tuned to be made by the data analyst, the most crucial being how the inter-object
using sorne cross-validation scheme. similarity or distance is to be measured. If there are 1 objects, an 1 x 1 sym
The efficacy of such a strategy for performing regression has not yet been metric matrix of similarities, or distances, is computed. To avoid repetition
fully investigated, although the method has a lot of appealing properties. The we shall describe clustering in terms of inter-object distances, which are
main problem with researching the method is the development of flexible monotonically inversely related to similarities (Le. the less similar two objects
computer programs to perform the large numbers of calculations involved, are, the further they are apart).
especially in cross-validation studies. One program, at least, is described by
Lebeaux (1974, 1977), who first worked in this area in collaboration with Hierarchica/ c/ustering
Benzécri. Cazes (1978), in the third of a series of articles on regression
methods, also discusses this strategy, which the French call régression par In hierarchical clustering the 1 objects are regarded initiaHy as 1 clusters of
boule (bubble regression). one object each and the analysis proceeds sequentiaHy to agglomerate
clusters into larger clusters until aH the objects form a single cluster. The
method is attractive because it is non-iterative and can be represented
7.4 CLUSTER ANALYSIS graphically in the form of a binary tree (Fig. 7.7). The only other decision
needed in order to carry out this type of clustering is how distances between
Cluster analysis has a vast theoretical and applied literature and it would be clusters of more than one object are to be measured. The three most
beyond the scope of this book to enter into a comprehensive review of the common choices for defining inter-cluster distance are the minimum of aH
inter-object distances between the two clusters ("single linkage" clustering),
I (= 9) objects the maximum inter-object distance ("diameter" clustering) and an average
e a (J h b i f d
inter-object distance ("average linkage" clustering). At each step of the
clustering the two closest clusters are agglomerated, corresponding to one of
CI>
4 the branches, or nodes, of the tree. Hence there are 1- 1 steps of the analysis,
o
e
6 that is 1 - 1 nodes of the tree, to complete the clustering of aH the objects, and
.E
.!!! each node may be indexed by the distance between the two clusters which the
-
't:>
o
d node brings together. The nodes of the tree are displayed on a vertical scale
CI> according to these distances (Fig. 7.7). Any horizontal cross-section of the
'6 7
~ tree reveals a partition of the objects, characterized by the distance d of the
"slice". For example, if the "diameter" method is used then the clusters are
such that all inter-object distances within each cluster are less than d, whereas
in "single linkage" clustering aH the clusters are separated from each other by
distances greater than d. An "average linkage" clustering compromises
FIG.7.7. Binary tree which summarizes a hierarchical clustering. Since there are between the former criterion of within-cluster similarity and the latter
9 objects. there are 8 nodes of the tree. which are formed in the order indicated
(i.e. In order of increasing inter-cluster distance). The horizontal "si ice" at criterion of between-cluster separability.
d istance d partiti ons the objects into 3 clusters: {c. a. e}. {h. b} and {l. f. g. d}. Hierarchical clustering is useful if the analyst has no prior ideas about how
198 Theory and Applications ofCorrespondence Analysis 7. Use ofCorrespondence Analysis 199
many clusters he expects or might like to have. Having obtained the results hierarchical clustering which can be related to a correspondence analysis. Let
in the form of a binary tree, sorne obvious choice of where to make the us suppose that we have analysed the correspondence matrix P(I x J) to
cross-section might become apparent. For example, in Fig. 7.7, the largejump obtain matrices of principal co-ordinates F and- G~ anQpriºfi'p~1 inertias in
between the distance values of nodes 6 and 7 suggests that the cross-section l!;.. Furthermore, let us suppose that we are interested in a clusterTng-ü(the
can be satisfactorily made between these nodes, so that 3 clusters are rows of P. In a hierarchical cluster analysis of these rows there wil1 be 1 -1
obtained. nodes, and at each node 1, two clusters l' and 1" are amalgamated. If in(ll')'
in(Iloo) and in(ll) denote the (within-cluster) inertias of the clusters 1', 1" and
the combined cluster I respective1y, each with respect to their own centroids,
Non-hierarchical clustering then we know from Example 7.5.3 that in(ll) is greater than the sum
In non-hierarchical clustering attention is directed at sorne prespecified in(lr)+in(lloo) by an amount equal to the inertia ofthe centroids of l' and 1"
number of clusters, to be obtained by attempting to optimize a criterion of with respect to their joint centroid, the centroid of cluster l. A natural choice
"clusteredness", that is an overal1 measure of within-cluster compactness and for clustering seems to be those two clusters whose agglomeration induces the
between-cluster separation. The clustering proceeds from a reasonable initial least increase in within-cluster inertia, equivalentiy the least decrease in
partition of the objects in an iterative fashion towards partitions which are between-cluster inertia. With each node I we can associate the minimum of
more and more "clustered". One such algorithm transfers one object between the increases VI:
any two clusters at each iteration so as to produce the maximum increase in VI == in(II)-{in(lr)+in(Ir')} (7.4.1)
the chosen measure of clusteredness. This can involve a large amount of
computation at each iteration, but there are simpler special cases, for example which can be rewritten as:
the present situation of 1 objects, with masses, in a weighted Euclidean space
where the clusteredness is defined as between-cluster inertia. Maximizing the r~l')r~IOO) ¡,(l') _ ¡,(l")lIb,-' (7.4.2)
VI=~1I
between-cluster inertia is equivalent to minimizing the within-cluster inertia,
as shown in Example 7.5.3. In this case the gain in between-cluster inertia where r~l') and ¡,(l'), for example, are the mass and centroid of the row profiles
resulting from the transfer of one object can be evaluated quite simply from in the l'th cluster, and 11 ... llb,-1 denotes the chi-square metric in the row
the distances of the object to the centroids, with an adjustment which geometry. This result is proved in a more general form in Example 7.5.6.
depends on the masses of the clusters and of the object (see Example 7.5.5). Thus the hierarchical clustering algorithm which minimizes VI at each node
This is a variant of the non-hierarchical clustering algorithm cal1ed "k-means is similar to a "single linkage" clustering on the (squared) chi-square
clustering" (MacQueen, 1967) where objects are assigned to the clusters with distances between the row profiles, with the difference that these distances are
the closest centroids. The centroids of the new clusters are recomputed after weighted by the masses of the pair of clusters. Clearly, the quantities VI can be
each assignment and the process continues until no more objects can change calculated at each node of the clustering tree, irrespective of the type of
clusters. To initiate this algorithm a set of random points are often computed clustering performed, but it is only in the particular case where VI is mini
as seeds for the first clustering of the objects. mized at each step that the value of VI coincides with the vertical position of
The results of such a clustering are not unique and depend on the initial
the node in the binary tree.
clusters as wel1 as the order of the objects in the data file. To eliminate the The set of quantities VI' I = 1 ... L, provides yet another decomposition of
latter dependency the algorithm can be modified to al10w al1 the objects to be the total inertia of the cloud of row profiles in(l), which in the present
assigned before recomputation of the new cluster centroids (Forgy, 1965). notation is equivalent to the inertia of the terminal cluster in(l d- Notice that
The former dependency can be investigated by repeating the clustering our present definition of in(lr), say, is the inertia of a cluster l' of Ir points
algorithm with different choices of initial clusters. The final solution which with respect to their own centroid (the "within-cluster" inertia) so that when
provides the optimal clusteredness can then be retained. l' is a single object in(ll') = O. If we sum VI over al1 L = 1- 1 nodes, al1 terms
cancel out, by definition (7.4.1), except the inertia ofthe final cluster of al1 the
Hierarchical c/ustering and correspondence analysis objects in(l). Hence:
For the remainder of this section we shal1 discuss a particular form of in(l) = ~(Vl (7.4.3 )
200 Theory and Applications ofCorrespondence Analysis 7. Use ofCorrespondence Analysis 201
There is an interesting analogy between clustering the profiles in this way points and their contributions to the principal axes as well as the axes'
and displaying them multidimensionally. In the latter situation we choose a contributions to the nodes may be computed. The nodal inertia V¡ is a
set of orthogonal axes to represent the profiles and th~ total inertia is weighted squared distance between the centroids of the pair of c1usters l' and
decomposed along these axes. If the axes are principal axes the'n lhe 1" agglomerated at the node (cf. (7.4.2)). This squared distance may be
decomposition is optimal in the usual sense that the axes reflect maximum expressed as the sum of squared differences in principal co-ordinates of the
inertia in an ordered fashion: Al ~ A2 ~ .... On the other hand, when a centroid profiles:
hierarchical clustering isperformed on the profiles, the sequence of 110des 111'(1')-1'(1")111>," = I- k(fll')-fll"l)2
defines a sequence of partitions o( the óbjects and the f6TarTr;"~'rtiacan be
decomposed amongst the nodes according to the quantities VI' When these (wherefll'l, for example, is the principal co-ordinate ofthe ['th centroid on the
quantities are minimized at each step then the decomposition is again kth principal axis). Consequently, the total inertia can be decomposed by
optimal, in the sense that the between-cluster inertia decreases the least at nodes and by axes in a two-way table of quantities:
each step. In this case there is also an ordering within the decomposition: (1') (1")
V 1 ~ V2 ~ ••. , that is in terms of the values of the nodes on the vertical scale
=~
V¡k -(1)
(fll'l
k
_fll"l)2
k (7.4.4)
r
of the binary tree. The analogy between axes and nodes is not complete,
however, since in no sense are the nodes "orthogonal" to one another. In fact, where VI = I-kV 1k , by definition, and Ak = I-¡V¡b since (7.4.1) applies in a
the various partitions of the objects, derived from horizontal cross-sections of similar fashion to the profiles projected onto the kth principal axis.
the binary tree, are highly dependent on one another because of the As in the usual decomposition of inertia by points and by axes (cf. Section
hierarchical style of clustering. 4.1.11), the quantities v¡k/Ak can be computed to investigate the contribution
By analogy with the principal inertias we shall call the quantities V 1 , V 2 , ••• of the nodes to the kth principal axis. If such a contribution is near to 1 then
nodal inertias, or the contributions ofthe nodes to the total inertia. Percentages this means that the dispersion of the cloud of points along axis k is associated
of inertia can be computed as before: vl/in(I), and a particular partition is almost exc1usively with points clustered at node l. The quantity V¡k/V¡ is
suggested where there is a large jump in these percentages, just as a particular similarly called the contribution of the kth principal axis to the node I and
subspace is suggested when there is a large drop in the principal inertias. may be interpreted as a squared angle cosine (Fig. 7.8). The centroids of the
There are sorne interesting inequality relationships between the principal pair of clusters [' and 1" constituting the node I define a direction in multi
inertias and the nodal inertias. The highest principal inertia Al' for example, dimensional space, subtending an angle of <Plk with principal axis k, such that
always exceeds the highest nodal inertia vL, the inertia of the node that forms cos 2 <Plk = V¡k/V¡ (proved in a more general situation in Example 7.5.6). Hence
the cluster of all the objects. If this were not so, it is easily shown that the axis if V1k/V¡ is close to 1 then the separation of the centroids of the c1usters [' and
which joins the centroids of the last two clusters would reflect a higher 1" is almost exactly along the kth principal axis. As before these squared
moment of inertia of the cloud of objects than Al' which is impossible since cosines may be added together to give the squared cosine of the inter-centroid
Al is the highest. More generally, the sum of the first K* principal inertias is vectors with respect to any subspace of the principal axes.
higher than the sum of the K* largest nodal inertias. The superiority of the
¡'
principal inertias over the nodal inertias is most dramatic amongst the higher
/
inertias. In fact, as shown by Benzécri and Cazes (1978), the largest nodal ,...""
,... ,...
inertia can be extremely small compared to Al in which case the cluster
,... ,...""
analysis is much less effective in analysing the data than correspondence I ",,,,'"
analysis. A typical example of this would be when the profiles occupy a
"continuous" region of multidimensional space and hardly cluster at all.
Displav of the nades with respect to principal axes is the matrix oC row profiles oC P, then the rows oC ZciD,D,-1P = zcip = P' contaín
the weighted sums oC each group oC row profiles. To obtain the centroids, each of these
The nodes may thus be represented in a correspondence analysis display by weighted sums must be divided by the total mass oC the respective group. These
the centroids of the profiles which they bring together into a cluster. Thus the masses are contained in the vector ZciD,l = Zcir, whích is just the vector of masses r'
oC the row profiles oC P', Le. the row sums oC P' :r' = P'l = ZciPl = Zcir. Hence the
terminal node L is displayed at the origin itself, while the two clusters
centroids are D; 1P', the row profiles oCp',
preceding L are displayed by two centroids on either side of the origin, Suppose that F, G and D A are the principal co-ordinates and principal inertias in
"balanced" at the origin by their respective masses. In general, the centroids the correspondence analysis oC P, The rows oC P' can be displayed as supplementary
of the two clusters (nodes) l' and [" preceding node [ are displayed on either points by evaluating the centroids oC the groups oC rows oC F: D; 1ZciD,F, where, as
side of the centroid of [ and are balanced at this centroid by their respective beCore, D, assigns the masses to individual rows oC F, zci sums these rows in their
various groups and D; 1divides the weighted sums by the group masses. Alternatively,
masses, which in previous notation can be written in the full space as: the usual transition Cormula Crom columns to rows can be applied to the profiles
r~l')(r(l') - 1'(1») + r(l.")(r(l") - 1'(1)) = O D;:IP' :D,-; Ip'GDI 1/2.
Suppose now that F', G' and D A' are the principal co-ordinates and principal
and with respect to any principal axis k as: inertias in the correspondence analysis oC p', The rows oC P can be displayed as
supplementary points by applying the column-to-row transition Cormula to the pro
r~l')(nl') -ni») + r~l")(nl") -ni») = o files D,-IP:D;IPG'Di l / 2. The row profiles D,-;Ip', displayed by F', are stil1 at the
centroids oC the groups of supplementary profiles: D,-; 1Z~D,D,-I PG'Di 1/2 =
Computationally, the display co-ordinates are obtained either by direct D,-;IP'G'Di 1/2 = F'.
calculation of the appropriate centroids of the clusters of points, or by
"condensing" the clusters of points into their centroids, as described in 7.5.2 Huvghens' theorem
Section 7.1 and Example 7.5.1, and then using the appropriate transition
Let y1... y J be a cloud oC points, with masses WI ... W¡, in a multidimensional weighted
formula to represent them as supplementary points. Euclidean space where the metric is defined by the positive-definite symmetric matrix
Notice that aH the results of this section apply to the more general situation Q. Let y denote the centroid oC the points 1:¡w¡y;/1:¡w¡. Show that the total inertia oC
of a set of points, with pre-assigned masses, in a weighted Euclidean space the cloud with respect to any point y is equal to its total inertia with respect to
structured by any positive-definite symmetric matrix. We state and prove the centroid plus the squared distance between y and y weighted by the total mass oC
results in Section 7.5 for this general case and these apply as a special case to thecloud:
correspondence analysis. 1:¡w¡ IIY¡-YII~ = 1:¡w;lIY¡-YII~+ (1:¡w¡) Ily-yll~ (7.5.1)
Benzécri et al. (1980) present FORTRAN programs to compute various where:
tables which enhance the interpretation of a cluster analysis in the framework Ila-bll~ == (a-b)TQ(a-b)
of correspondence analysis. A complete treatment of this subject is given by
Jambu (1978, 1983). So/ution
The total inertia oC the cloud with respect to y, i.e. the left-hand side oC (7.5.1) can be
written as:
+ 21:¡w¡(y¡ _y)TQ(y_y)
7.5.1 Correspondence analvsis of a "condensed" matrix By definition oC the centroid y the cross-product term is zero:
Let P(I x J) be a correspondence matrix and P'(H x J) the correspondence matrix {1:¡w¡(y¡_y)T}Q(y_y) = {1:¡wiY-1:¡w¡y}TQ(y_y) = O
the total inertia of all 1 = r.hl h points with respect to their overall centroid y is equal r.hin(h) resulting from the transfer is equal to:
to the sum of the inertias of the H group centroids (i.e. the between-group inertia) plus
the sum of the inertias of each group with respect to its respective centroid: wow. h') )
-(h') 2 woW (h") )
-(h") 2
(7.5.3)
Solution Solution
This result is a direct application of Huyghen's theorem to each group of points, where When Yo is transferred from group h' to group hU the centroids of these two groups
the point y of (7.5.1) is the overall centroid y. Thus the inertia of the hth group of are translated, respectively away from and towards the position of Yo. This transfer !
points may be expressed with respect to y as: only alTects the inertia of these two groups-the within-groups inertia of the other
r.:'wl h)Ilylh) - YII~ = r.:·wl h)Ilylh) _ y(h)lI~ + W(h) Ily(h) -YII~ groups is clearly unalTected and, because the transfer does not alter the position ofthe
overall centroid y, the inertias of the other centroids remain constant. Because the
Summing this equation over the groups, h = 1... H, gives the result (7.5.2). between- and within-groups inertias sum to a constant (Example 7.5.3) we can
evaluate the increase in the between-groups inertia equivalently by the decrease in the
Comment within-groups inertia.
This decomposition of the total inertia holds in the full space of the points as well as Before transfer, the inertia within-group h' comprising I h , points, which we denote
in any subspace onto which the points are projected. by in(Ih')' is:
in(Ih') = r.:·wl h)Ilylh') _ y(h)11 2
2 h 2
7.5.4 Principal inertias of a cloud of centroids
= W
oIIYo-y(h')11 +r.wl ') Ilylh')_y(h')11
where the summation in the second term extends over all points in group h' except the
Given the same situation as in Example 7.5.3, show that the principal inertias Al' A2 , ••. point Yo. By Huyghen's theorem this term is the inertia of the group of points without
of the original points are greater than (or equal to) the respective principal inertias Yo, which we denote by in(I h' -1), plus the new group's mass W~h') - Wo multiplied by
XI' }.~, ... of the centroids: }'I ~ A;, A2 ~ A~, ....
the squared distance between the old centroid y(h') and the new centroid denoted by
y~'), that is:
Solution (cf. Deniau and Oppenheim, 1979)
2
Let us suppose that the dimensionality of the 1 points is K and that of the H centroids in(Ih') = W
oIlyo - y(h')11 + in(Ih' -1) + (W(h') - wo)lly(h') _ y~')112
is K', where K' ~ K. The result is thus trivially true for k > K', since A~ = O in this The decrease in the inertia within group h' is thus:
case. For k ~ K', consider the following two subspaces of the K-dimensional space in
which all the points lie: first, the subspace of all vectors with respect to which the in(Ih') - in(Ih' - 1) = W 2
oIlyo - y(h')11 + (W(h') - w o ) IIY~') - y(h')11 2 1
(moment of) inertia of all the points is ~ Ak; secondly, the subspace of all vectors with This can be simplified as: !1
respect to which the inertia of the centroids is ~A~. The first subspace excludes the
first k - 1 principal axes of the 1 points and is thus of dimensionality 1 - (k - 1) = w W(h') )
in(Ih') - in(I h' - 1) = ( (h?)' 1 Yo y(h') 2 11
1 - k + 1, while the second subspace includes the first k principal axes of the H W, -w o il
centroids and is thus of dimensionality k. Since the sum of these dimensionalities is using the expression for the position of the new centroid: :1
1 - k + 1 + k = 1 + 1, the intersection ofthese two subspaces must have a dimensionality
of at least 1. W(h')y-(h') - w Y
-(h') _ ' o o
We can thus assume the existence of a vector u which is common to both subspaces. Yo - W(h')-W
. o
With respect to this vector, the result of Example 7.5.3 applies and the total inertia of which implies that the dilTerence between old and new centroids is:
all the points projected onto u must be greater than the inertia of their centroids:
inu(l) ~ inu(H). By definition ofthe two subspaces to which u belongs: Ak ~ inu(I) and y~')_y(h')
(h'~o (Yo-y(h'»)
=
inu(H) ~ },~, which implies }'k ~ A~. -w o w,
This argument can now be repeated for the group hU, where the point Yo is added
to the group. This leads to the following decrease in within-group inertia:
7. 5. 5 Change in between-groups inertia induced bV transfer of one point
in(Ih")- in(Ih" + 1) = - W oIlyo - y(h')1I 2+ (W(h") + wo ) IIY~') - y(h)11 2
Given the same situation as in Example 7.5.3, consider the transfer of one point Yo
_ wow,(h") ) -(h") 2
(with mass wo ) from group h' to group hU. Let in(h) denote the inertia of the hth group - - W(h')+W IIYo-y 11
(
centroid (with respect to the overall mean) before the transfer, which in our previous O
notation is: in(h) = W(hl Ily(h) - yf. Show that the increase in between-groups inertia which is negative, as expected, because the within-group inertia must increase when a
206 Theory and Applications ofCorrespondence Analysis
clustering
we finally interpret are "significant" in sorne statistical sense. This question have been obtained in the absence ofthese elements. Thus a robust regression
seems to be relevant only when the data arise from sorne random sampling procedure should be internal1y stable. Qur present situation, however, is
scheme, in other words when we assume that the data are a representative more problematic in that the planar display as a whole is analogous to the
"image" of an underiying population. In fact this ideal situation, on which estimate, that is it is an estimate of sorne theoretical plane, not a single scalar.
most conventional statistical inference is based, occurs relatively infrequentiy If the columns of a data matrix represent a set of preselected variables, it
and the data are more often than not collected in a deliberate non-random might be of additional interest to investigate the stability of the display with
fashion. For this reason we prefer to consider the wider issue of stability of respect to omitting each of the variables in turno This stability is characterized
the resuits, which includes the conventional notions of statistical significance as internal because the variables are not a sample of a potential set. Notice
as a special case when the data are representative samples in the usual sense. that we do not want to remove attention from rows or columns of a data
matrix which cause internal instability, but rather recognize the strong role
Internal and external stability they play in the display. If we could see the data vectors in their true high
dimensional positions, there would be no problem of internal stability.
To introduce our study of stability, suppose that a 2-dimensional display of Attempts have been made to structure muitidimensional scaling as a
the rows and columns of a data matrix has been obtained by correspondence statistical technique so that confidence regions on the displayed points can be
analysis. We shall call this display stable at two different levels: first at the derived (for example Ramsay, 1977, 1978). While being of interest within the
level of the data matrix itself, and secondly at the level of the wider particular data context (which is usual1y quite specialized) these rely on
population (should the data be sampled from a population). In other words, questionable assumptions and introduce a whole new spectrum of complica
if the rows, say, are indeed a random sample from a multivariate population, tion into the analysis, owing to large numbers of parameters that have
then the planar display is a partial view of "reality" at two levels. First, it is a to be estimated. As an alternative, we shall suggest a non-parametric
partial view of the multidimensional scatter of the data points, being approach using the ideas of jackknifing and bootstrapping. These have wide
optimally orientated to reflect as much of the points' inertia as possible. applicability, but admittedly lack the mathematical rigour of the traditional
Secondly, the data points are themselves a partial view of a theoretical statistical approach. For this reason we prefer the more physical term
geometric distribution of points in the population. stability as opposed to the statistical term confidence. An investigation of the
At the first level we say that the plane is internally stable, hence also the stability of a configuration of points often suggests very strongly that there is
scatter of points projected onto the plane, if the plane's orientation is not a "statistically significant" pattern in the data, which can then be confirmed
determined by isolated features of the given sample. An example of internal by formal analysis (if possible!). We are thus trying to push our "pattern
instability is thus a single "outlying" data point which has caused the recognizing" exploratory analysis as far as possible in the direction of con
principal plane to swing around excessively in its direction, so that removing firmation of the patterns without having to assume a specific mathematical
this point changes the plane's orientation quite dramatically. framework for the data.
At the second level we say that the plane is externally stable if its orienta
tion is minimal1y altered by considering further samples from the same
population. An example of external instability is thus a sample which is not Jackknifing and bootstrapping
large enough to characterize the population patterns with low variability, so Jackknifing (reviewed by Miller, 1974) and bootstrapping (Efron, 1979)
that other samples of the same size lead to different principal planes. Notice provide convenient frameworks for investigating the internal and external
that when the data are not collected by sampling, then we are only interested stabilities respectively of planar displays. Usually they are used to investigate
in the internal stability of a display. the variability of statistics calculated on sampled data, what we would term
In order to illustrate this distinction further, a paral1el might be drawn with the external stability of these statistics. In our context we are firstiy concerned
regression analysis. External stability of an estimate of a regression coefficient with the interml stability of the display with respect to a set of given entities,
would mean that the estimate has low standard deviation and would not vary be they rows or columns of the data matrix. We shall use the term jack
unduly if the study were repeated. Internal stability of the estimate would knifing to mean the deletion of each one of these entities in turn, followed by
mean that no isolated elements of the sample itself have contributed exces an assessment of how seriously this affects the display of points or orientation
sively to the value of the point estimate, so that a quite different value might of the planeo Bootstrapping, on the other hand, suggests generating a large
210 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 211
number of additional "data matrices" by resampling, with replacement, from of a 2-dimensional display of a set of 1 points with masses w¡, i = 1 ... 1, where
the sampled entities in the data, with the assumption that the original sample };k denotes the principal co-ordinate of the ith point on the kth principal axis
is the best available representation of the underlying population. The changes and Ak == /lf denotes the kth principal inertia, k = 1 ... K. Rather than repeat
induced by such resampling wiH give us a fair indication of the external the analysis 1 times in order to recompute the principal plane each time a
stability of the display. Clearly, if we could resample from the parent point is omitted, the stability can be judged by inspection of the table of
population itself we would obtain a more correct picture of the external contributions (i.e. decomposition of inertia, Section 4.1.11). If a row point
stability. However, we cannot do this and Efron's idea is to "puH oneself up i = s is removed, the displáy will have a new centroid and the best-fitting
by one's own bootstraps" and resample from the sample itself. plane will have a new orientation since it is no longer attracted to the position
Thus if the rows of an 1 x J data matrix represent a sample of people, we held by S. Depending on the sizes of the contributions w.f.~ and wJ.~, one of I
could obtain as many bootstrapped data matrices as we liked by resampling the following four situations can result:
I
l
from the set of rows. Because a bootstrapped sample is the same size as the (1) The plane and the axes remain stable.
original sample, it is inevitable that it contains sorne rows (people) more than (2) The plane remains stable and axes 1 and 2 change orientation, often
once, others not at aH. If we think of each row being assigned a mass W¡ interchanging.
(i = 1 ... 1), then the original rows have masses aH equal to 1/1, whereas in (3) The plane remains partially stable-for example, axis 1 remains stable
each bootstrapped sample the masses are multiples of 1/1, and zero for a row but axis 2 rotates out of the plane, often interchanging with axis 3.
which is not included in the sample. Thus bootstrapping merely redistributes (4) The plane is unstable and assumes a completely new orientation.
the unit of mass amongst the rows in discrete units of l/l. If we generalize this
idea to the redistribution of the mass in continuous amounts, then the jack These are clearly extreme situations and one of many intermediate outcomes
knife is subsumed as a special case of the bootstrap. Jackknifing (the rows) is possible.
involves 1 repetitions of the analysis, omitting each row in turn from the So far we have avoided a definition of stability, but clearly a quantifiable
analysis. These are just 1 particular bootstraps, where masses 1/(1-1) are measure of the change of the principal axis is the angle </> through which it
assigned to the (1- 1) included rows and O to the excluded row. This rotates. Escofier and Le Roux (1976) remark that if </> is less than 45° then the
emphasizes the close relationship of our investigations of internal and old and the new axes are closer to each other than to any other axes. Hence
external stability, the former being a special case of the latter. The jack we shalllabel a rotation of more than 45° instability, with stability increasing
knifing of the display, however, is so much simpler that it can be described monotonically from 45° to O°. The same rule can apply to the plane in terms
analytically to a certain extent and thus merits special consideration. of the planar rotation, which is defined as the maximum angle between a
Very often a study involves a substantial number of sampling units which vector in one of the planes and its orthogonal projection onto the other planeo
are important only in reflecting information about groups. Considerations of In both cases maximum stability is reached at 0° (cos 2 </> = 1) and maximum
internal and external stability remain essentially the same, but the emphasis instability at 90° (cos 2 </> = O), with a borderline at 45° (cos 2 </> = !).
is now on the stability of the display of the group summaries, usually the Apart from the trivial situation where s lies exactly at the centroid, the
group centroids or the geometric scatter of the groups. In the correspondence simplest case is where s lies exactly along a principal axis, say the first (Fig.
analysis of a contingency table the rows and columns are both preselected 8.1(a)): that iSf.k = Ofor k =1= 1. Removal of s causes the centroid ofthe points
sets of attributes and the sample is condensed into the cel1s of the tableo Here to translate away from s to a position at e = - {w.l(I- w.)} f.l on the first
we are usually most interested in the stability, usually external, of the points axis. The mass of each of the remaining points is scaled up by 1/(1- ws ) and
representing the rows and the columns. Of course in both of the aboye the moments of inertia along the old principal axes are respectively:
situations, if the number of sampling units is fairly low, or if the data do not
arise from random sampling, then a specific investigation of the internal ¿ {wj(l- w.)} (};l -
ifs
C)2 = [Al - {w./(I- w.)} f.D/(I- w.) (8.1.1)
stability (of the points and the axes) would be relevant.
¿ {wj(l-w.)}¡;f = Ak /(I-w.) k=2,3,oo. (8.1.2)
ifs
Effect of individual points on the principal axes (jackknifing) Clearly if the amount {w./(1 - w.)} f.~ is large enough, a new ordering of the
Let us illustrate the jackknifingapproach by considering the internal stability principal inertias may result, with sorne axes "shifting up" in rank. As long as
212 Theory and Applieations ofCorrespondenee Analysis 8. Speeial Topies 213
(a 1
~E in the direction of the previous third principal axis. In other words the new
//·7
second principal axis will be more aligned with the previous third axis than
with the second. If we express the negation of (8.1.4) in a slightly more relaxed
form by dropping the term {w.l(l - w.)} 1.~, then we have the following condi
tion which is sufficient for the second principal axis to be "stable" (Le.
(b/~t=.
¿
7
cjJ < 45°):
A2-{ws/(1-w.)}1s~ > A3
This condition provides a quick and easy method of checking on the stability
(8.1.5)
respectively, but notice that our argument assumes that the "higher" principal
/ axes are negligibly rotated by removing s (hence our assumption aboye that
1s1 = O). Notice too that we ignore elfects due to the possible change in metric
induced by removing S. In correspondence analysis the elfect on the metric of
FIG.8.1 Example af a paint Iying (a) an a principal axis; (b) in the principal removing a row or column might be fairly substantial (see (8.1.14) below).
plane; (e) aff the principal plane. If (8.1.5) is satisfied, Escofier and Le Roux (1976) give a set of upper bounds
for the rotation angle cjJ of the kth principal axis when s is removed. These
the value of (8.1.1) does not equal one of the values of (8.1.2), this is the only upper bounds are only approximate when the subspace of the first (k -1)
possible change since no new orientations ofaxes are possible. A similar principal axes is itself rotated when s is removed. We define a parameter:
argument applies to any principal axis on which point s lies exactly. Also if s
lies exactly in the principal plane (Fig. 8.1 (b)), say, then its removal causes a h ={wj(l-ws)} (Ü+1.~k+l +,..)/(Ak- Ak+l) (8.1.6)
reduction in the first two principal inertias of the form (8.1.1) while the namely that part of the inertia of point s which lies in the subspace of the
remaining principal inertias are simply rescaled as in (8.1.2). A number ofnew principal axes k, k + 1, ... , relative to the difference between Ak and Ak + 1 and
situations can result, for example if the moments of inertia along the (old) adjusted for the new centroid by dividing by (1- ws ). Another quantity of
axes defining the principal plane are still the largest and if neither 1s1 nor 1.2 importance is the "relative contribution" (cL Sections 3.3 and 4.1.11):
is zero, then a rotation of the first two principal axes takes place in the planeo
Of course these simple cases never occur in practiee, but serve to illustrate cos 2esk = ws1.Uin(s) = 1.~rr'k· 1.~' (8.1. 7)
the jackknifing idea. In general, we have a point s which lies in multi· that is the fraction of the inertia of the point s which lies along the kth
dimensional space (Fig. 8.1 (c», the removal of which causes a translation of principal axis, which is also the squared cosine of the angle esk subtended by
the centroid away from s by a vector: the point vector s and the kth principal axis. The simplest upper bound for cjJ
t = -{w s/(l-w s)}fs (8.1.3 ) is then:
as well as a re-orientation of all the principal axes, including the principal sin 2cjJ ~ h (8.1.8 )
plane. If 1.1 = 0, say, then as before the only change that can take place is in while more refined upper bounds depend on h and the angle esk :
the space orthogonal to the first principal axis, and the question would be
how much the second principal axis rotates into the subspace of the if h ~ 1: tan 2cjJ ~ h sin 2e sk /(l- h cos 2 esk ) (8.1.9)
remaining axes. The moments of inertia along all the previous axes are of the if h < 1: tan 24> ~ h sin 2e sk /(l- h cos 2esd (8.1.10)
form (8.1.1) and thus if:
If the difference Al - A2 is quite small relative to A2 - A3 , say, then we might
A2-{ws/(1-ws)}1s~ < A3-{ws/(1-ws)}fs~ (8.1.4) find instability ofaxis 1 but a high stability of the principal planeo (Again for
then rotation of the second axis out of the plane is already greater than 45° subscripts 1, 2 and 3 we can substitute k, k + 1, and k + 2.) The aboye
214 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 215
conditions can be generalized to investigating the stability of this plane by sampling. Instead we shall resort to the computer, following Efron and Gong
adding the contributions in the formulae. Thus if: (1981) who "show by example how a simple idea combined with massive
computation can solve a problem that is hopelessly beyond traditional
A2 - {ws/(l-ws)}(fs~ +fs~) > A3 (8.1.11)
theoretical solutions".
then the moment of inertia of the cloud of points without s along any line in Our proposal is that a random sample of the set of possible bootstrapped
the plane is higher than along any of the axes 3,4, ... , that is the plane will matrices be drawn, and then their geometry be related to that of the original
not rotate through cP greater than 45°. Upper bound formulae (8.1.8-10) are data matrix. Various scalar statistics can be computed, for example the angle
the same, but we define: between corresponding principal axes in the original analysis and each
h ={ws/(l- wsn (fs~ + fs~ + ... )/(A2 - A3) (8.1.12)
replicated one, eventually leading to confidence intervals in the usual style of
the bootstrap. An alternative strategy, which is less heavy computationally
and use the angle 0S.12 that s makes with the principal plane: and which yields highly satisfactory results, is to fix the original plane, say, as
the viewing plane for the replications. The replicated row and/or column
cos 2 0S.12 = cos 2 0sl +cos 2 0s2 (8.1.13)
points are then projected onto this plane in order to explore the stability of
Further detailed discussion of these formulae can be found in Escofier and the points themselves as well as, indirectly, the stability of the original planeo
Le Roux (1976), as well as a discussion of the efTect introduced by modifica If the plane representing the original points is unstable, the replicated points
tion of the metric in the case of correspondence analysis. This efTect is also will have large variability in the full space as well as on the viewing plane, and
expressed as an upper bound on the angle of rotation cP of the kth principal vice versa.
axis in the form: Most of our discussion aboye applies to a wide class of multidimensional
techniques. As an initial illustration of our proposed strategy in the context
sin 2cP ~ Ak[~ - P/{ e(l + P)}] (8.1.14)
of a correspondence analysis, let us return to our simple 5 x 4 data matrix of
where, in the usual notation of correspondence analysis and assuming that Table 3.1, a contingency table involving a sample of 193 cases. The analysis
the sth row of the correspondence matrix P is removed: ofthis matrix is given by Fig. 3.3 and Table 3.6. By drawing a random sample,
~ =max {Ps/(cj - Ps); j 1 = J}
with replacement, from these 193 cases we obtain a replicated contingency
tableo (This is equivalent to drawing a random sample of 193 cases from the
P=min {Ps/(cj - Psj); j = 1 J} multinomial distribution defined by the 20 cells of the original correspondence
e =min{Ak-l -Ak,Ak-Ak+d
matrix.) The replicated row and column profiles are then projected onto the
respective principal planes, say, of the original profiles, using the relevant
If Ak+1 is close to Ak so that it is a question of the efTect of the metric on the transition formulae of (4.1.16). For example, if D,: 1p* is a replicate set of row
plane of the kth and (k + 1)th principal axes, the same authors give the profiles (as row vectors), then F* = D,: 1P*GD; 1 is the matrix of projected
corresponding result for the angle of rotation of the planeo This is in the same co-ordinates. Figure 8.2 shows the replicate profiles and the convex hulls of
form as (8.1.14), with the only difTerence that: each set, as they appear on the original principal planeo All the convex hulls
in Fig. 8.2(a) intersect, while in Fig. 8.2(b) the convex hull of the set of first
e = min{Ak-l -Ak,t(Ak+l -Ak+2n column profiles does not intersect those of the third and fourth columns. This
Examples of the use of these results are given in Section 9.6. would suggest that evidence of association between the rows and columns is
not strong, in fact the X2 statistic for independence in Table 3.1 may be calcu
lated as X2 (12) = 16.4, which is not significant (P > 0.10). Figure 8.2(b) does
Bootstrapping of the samp/e to assess externa/ stabi/irv
suggest, however, that if there is a significant difTerence to be found, it is along
If the data are based on sorne sampling scheme, we can consider the external the first principal axis (the horizontal axis) which separates the first column
stability of the low-dimensional display. Although we have described boot from the remaining columns.
strapping of the given sample notionally as a generalization of the above Figure 8.3 is the analysis repeated on the same data matrix which has been
jackknifing procedure, it will no longer be possible to derive similar results to multiplied by 2, as if twice the number of cases have been sampled and the
bound the angle cP over the set of replicated data matrices obtained by re same relative frequencies observed. The correspondence matrix is unchanged
216 Theory and AppLieations ofCorrespondenee Analysis 8. Special Topies 217
(a) (a)
~c.a.e.e
---..
O. I
(b)
/:..
4
(b)
•
\
...\l
.1\
I
~ca.ie
----.
O. I
- '4
.
- .. ~
I
I
HO.t'"
O~ I
I
FIG.8.2. Replicated row and column profiles (displayed separately by (a) and
(b) respectively). projected onto the principal planes of the original row and FIG.8.3. As Fig. 8.2. except the data matrix of Table 3.1 has been multiplied by
column profiles of Table 3.1. The convex hull of each set of replicates is indicated. 2. as if there were twice the number of observations.
218 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 219
and there is no difference in the display of the original row and column when a projective geometry exists in the vector space. For example, in
profiles, but the replicated profiles are more compactly gathered about the multidimensional scaling of distance or similarity matrices the final result is
original profiles. The convex hulls of rows 3 and 4 have separated in Fig. a configuration of points, not aplane cutting through a higher-dimensional
8.3(a) while the convex hull of column 1 has separated from the remaining space of the points. Replicated distance matrices, say, can still be generated
ones in Fig. 8.3(b). The association is now significant and this is observable in terms of the underlying sampling scheme and these can be completely re
in the separation of these profiles. analysed to obtain new configurations of points, which can then be related to
Further examples of the use of bootstrapping in correspondence analysis the original one by Procrustes analysis (Schonemann and Carroll, 1970). This 1
are given in Sections 9.2, 9.3 and 9.7. is a translation, rotation and rescaling of the replicated configurations to "fit" 1
the original one. Here the Procrustes fit statistic can be used to quantify the l
involved, it would make sense to perform sorne prior clustering of the cases configurations are displayed as such. In order to investigate the internal
and then investigate the stability with respect to omitting each cluster. stability of a display obtained by multidimensional scaling, each replicated
Because the most outlying cluster of points is likely to cause the most configuration of (I - 1) points can be fitted, as described aboye, to the
instability, this would be another way of identifying outliers. In gradient corresponding (I - 1) points of the original configuration.
mation to the multinomial distribution of the IJ elements of the contingency standard co-ordinate vectors <P(k) and 1(k) (the kth columns of cj) and r
table as their total n increases. Thus, in the simplest case when independence respectively). Formulae for the 2nd-order moments are given by ü'Neill
does in fact hold, that is the theoretical canonical correlations are zero, the (1980, 1981) and are cited in Example 8.8.1, as well as an illustration of
asymptotic distribution of the components can be shown to be the same as their application to a small contingency tableo Notice that, unlike the
the distribution of the eigenvalues of a central Wishart matrix variate case of normaHy distributed data, the sample canonical correlations are not
W J _ 1(1 - 1), that is of order J - 1 and with 1 - 1 degrees of freedom, asymptotically uncorrelated.
assuming again that J ~ 1 (this result was first given by Lebart (1975, 1976) In the case of Table 3.1 we can test the significance of the first principal
and Corsten (1976). For the definition ofthe central Wishart distribution, see inertia Al = 0.0748 by comparing the X2 component npi = 193Al = 14.4
Anderson, 1958). The critical points for certain eigenvalues are tabled, for with the upper percentage points of the largest eigenvalue of a (J - 1) = 3
example, by Clemm et al. (1973), who give P = 0.10, 0.05 and 0.01 points of variate Wishart matrix with (1-1) = 4 degrees offreedom. The critical value
the largest and smallest eigenvalues of Wishart matrices up to an order of 20. at P = 0.05 is 15.24, so that the significance level is not less than 0.05. By
In other words these tables can be used for 1 x J contingency tables where contrast the (incorrect) conjecture of Kendall and Stuart (1961) that npi be
min {I, J} is not greater than 21. Table 51 of Pearson and Hartley (1972) gives asymptotically distributed as X2 (5) would yield the optimistic significance
P = 0.05 and 0.01 points only for the same statistics for matrices up to an level of P < 0.01.
order of 10, and also interpolation formulae for P = 0.10, 0.025 and 0.005. Another test, proposed by Bock (1960) and used extensively by Nishisato
Lebart (1975) gives approximate tables at significance level P = 0.05, based (1980) is derived from Bartlett's X2 approximation to the likelihood ratio test
on Monte Carlo studies, for the first five eigenvalues of a contingency table for testing canonical correlations when at least one set is multinormal
of order 1 x J not exceeding 100 x 50, as well as for the associated percentages (Bartlett, 1951), although the aboye authors gloss over this condition. Under
of inertia (we discuss the testing of the percentages of inertia at the end of this the independence model, the test statistic for the kth canonical correlation is
section). ü'Neill (1981) also gives the means and variances/covariances of all - {n -1 - t(l + J - 1)} 10ge(1 - p~) which is compared to the X2 (v )-distribu
the eigenvalues of Wishart matrices up to an order of 4. tion with v = 1 + J - (2k + 1). In our example aboye with k = 1, pi = 0.0748,
The asymptotic theory when dependence exists is given by ü'Neill (1978a, the statistic is evaluated as 14.6 and, using the critical points of the X2 (6)
1978b, 1980, 1981), and the following is a summary of his results. When K* distribution, would be judged significant at P < 0.025, another over-optimistic
of the true canonical correlations are non-zero and distinct, so that the result.
remaining K - K* are zeros, we know that the bivariate probabilities In practice we would personally interpret these significance leve1s, inc1uding
(elements p¡j of the true correspondence matrix) can be expressed as the those provided by the correct asymptotic theory, with due caution, especially
decomposition: when on the borderline of conventional "significance", for example P ~ 0.05.
We prefer to note the groupings and separations of the rows and columns of
Pij = r¡ci l + '1:.~"(Ak)1/2cP¡djk) (8.1.15) the data matrix, which can then lead to formal hypothesis testing, if
that is the reconstitution formula (4.1.27) expressed in terms of the theoretical necessary. For example, Fig. 8.2(b) suggests that columns 2, 3 and 4 be
standard co-ordinates. Here the notation - indicates theoretical values. The grouped-the X2 statistic on the resultant 5 x 2 matrix is then computed as
cases k > K* and k ~ K* are treated separately. The components nA k , X2 (4) = 13.9, which is significant at P < 0.01. In the context of the example,
k = K* + 1 ... K, are distributed asymptotically as the roots of a central this grouping is a sensible one in that it brings together all the "smokers" into
Wishart W K-K. (1- J + K - K *) only if the following condition is satisfied : a single category (see also the discussion in Section 9.3).
does not reject independence of the rows and columns of a contingency table, This attempt at "correcting" the geometry of the row profiles is dual1y
the major percentages of inertia might still be significantly high. Conversely, justifiable in the geometry ofthe column profiles, as discussed in Section 9.8.2.
when the hypothesis of independence is rejected, it may be that the per In the application of Section 9.9, involving gene frequencies in a set of
centages of inertia are not significant, implying that correspondence analysis human populations, there is again sorne arbitrariness in the choice of
is a poor "model" of the row-column dependence. populations. This can be partially eliminated by a prior down-weighting of
Lebart et al. (1977) give curves which serve as approximate critical points samples from the same or similar populations, as described in Section 9.9.4.
(P = 0.05) for the five largest percentages of inertia under the null hypothesis A number of reweighting schemes often suggest themselves, in which case
of independence. it is advisable to try them al1 out and observe which features in the resultant
graphical displays are stable across the difIerent analyses. In our experience 1
the first few principal axes are often quite stable with respect to reweighting.
8.2 REWEIGHTING AND FOCUSING Ir a certain principal axis is observed to be quite difIerent when a new set of
masses is assigned, then this would strongly indicate that the interpretation
The masses assigned to the row and column profiles are a distinctive feature should proceed with great caution (cf. Sections 9.8.4 and 9.9.4).
of correspondence analysis. When the data are in the form of a contingency
table the masses, proportional to the row and column sums of the table, are
Reweighting of measurement data
the "natural" ones to use in this context. In general data analysis, however,
the masses can be varied in order to explore difIerent features of the multi The application of correspondence analysis is readily extended to data which
dimensional point clouds. are measurements of positive physical quantities like rainfall, height, chlorine
We have already come across the concept of reweighting the points, that is content, etc. (These are often called "ratio" variables because they have a
allocating new masses to them, in various situations. In Chapter 5 we "zero-point" (origin) which is relevant to the study, so that relative values are
discussed the inertias of discrete variables and showed (in Example 5.5.4) how important, as opposed to "interval" variables, like time and temperature,
these could be modified in a very simple fashion by pre-transforming the where difIerences are important.) Rere we meet two situations, first where a
indicator matrix so as to assign difIerent masses to each subset of columns. In group of variables is at least measured in the same units (homogeneous
Chapter 7 we discussed the analysis of a cloud of points and a set of subclouds measurements), secondly where the variables are in different units (hetero
(Le. a partition of the points) where we investigated difIerences between the geneous measurements).
subclouds by removing the masses from the individual points and concen In the case of homogeneous measurements, correspondence analysis can
trating these into their respective subcloud's centroid. In this section we wish often be applied to the raw data, as in the application of Section 9.10 where
to develop these ideas a Httle further and also introduce a general strategy in the data are measurements on fossil skulls. Notice that the analysis displays
correspondence analysis which we callfocusing. difIerences in the shapes (Le. profiles) of the skulls, while the total of a skull's
measurements, interpreted as a type of "size" quantification of the skull, is
absorbed into the mass of the skull profile.
Reweighting to obtain prescribed masses
When the measurements are heterogeneous we are faced with the question
In many situations data are collected in a deliberate and somewhat arbitrary of assigning comparable units to the variables, which in correspondence
manner, not according to sorne elegant sampling scheme. We can attempt to analysis is a problem of reweighting the variables (cf. our previous discussion
correct the imbalances and poor representativeness of a data set by reweight which was more concerned with reweighting the observational units or
ing the set of points which is of primary interest. "subjects", e.g. the wildlife regions, human populations). Ideally this is a
The application of Section 9.8 is a good example of such data, which question to be dealt with by the specialist involved in the study, who has
consist of frequencies of antelope tribes in a set of African wildlife regions experience in the accuracy of his measurements and their particular levels of
(Table 9.13). Rere it is decided to reweight the regions so that their masses "significance" (Benzécri, 1977c). Thus a pollution expert might put on a
are proportional to their respective total antelope frequencies per unit area. comparable basis Sj parts per million of chemical j and si' kg of industrial
This is done quite simply by dividing all the frequencies in the matrix by the wastej' from a certain factory, in which case we would reweight the raw data
surface area of the respective region, prior to the correspondence analysis. for variablesj andj' by dividing their respective values by Sj and Sj" Because
224 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 225
of the inevitable degree of arbitrariness in such a procedure we would again In a large survey there are usually many ways of partitioning the data both
suggest that difTerent schemes be attempted to reveal stable features in the row- and column-wise. For example, the respondents may be categorized
resuiting displays which do not depend on the particular masses assigned to according to a number of demographic variables-age, sex, income group,
the variables. If a range for each Sj can be adopted then random weights etc.-which we call the secondary (or background) data. The primary (or
within these ranges can be assigned a number of times and a picture of the foreground) data, often expressions of opinion, are used to represent each
uncertainty induced by the weighting scheme can be built up, in much the individual as a point in multidimensional space. Each secondary variable
same style as the bootstrapping of Section 8.1. defines a partition of the respondents (usually the rows of the data matrix)
and thus a set of group centroids. Combinations of secondary variables
provide finer partitions of the rows and centroids of smaller subgroups,
Reweighting to obtain prescribed inertias
for example the average response of young males in the highest income
In the absence of a prescribed system of masses for the possibly unrepresenta group. The choice of how finely we wish to partition the data will depend
tive or heterogeneous rows and columns of a data matrix, as described above, on how much data are available and the depth of interpretation we want to
we might derive the masses so that the inertias of the displayed points have pursue.
certain relative values. The only analogy in conventional statistical analysis All of these centroids exist as supplementary points in the space of the row
is the standardization of heterogeneous variables to have unit variance. In the profiles and can be projected onto any principal subspace of the row profiles.
present situation we might want to equalize the inertias of each point or In this way a contrast of opinion along an axis may be found to associate
equalize the inertias of groups of points. Thus, if a study involves L groups of quite strongly with income, say, by observing that the supplementary points
variables, where each group is internally homogeneous, but where variables representing the income group centroids line up in their natural order along
of difTerent grou~s are heterogeneous, then we could reweight each group so this axis. Notice that if the rows and columns are in standard and principal
that the inertia of each group is equal, or proportional to sorne prescribed co-ordinates respectively then the centroids occupy the same positions as the
value. In this way heterogeneous groups of variables can be standardized columns of the indicator matrix which expresses the partition, where these
against each other while conserving the common measurement unit within columns are supplementary points in the column space of the primary data
each subset. (cf. Table 5.2 and Example 5.5.1(c)).
Unfortunately the derivation of such masses cannot be achieved in a closed We can focus on specific between-group difTerences at any chosen level of
form solution, except in special cases like the multivariate indicator matrix partitioning by analysing the respective set of group centroids. Computa
(Example 5.5.4), where the difTerent groups of points have a common tionally this is done by analysing the primary data matrix condensed, by
centroid. In general the centroid of the display changes with every new set of simple addition of the respective rows, into the groups of interest, as
masses and an iterative procedure is required to solve the problem. Benzécri described in Section 7.1. This leads to the mass of the group being the sum of
(1977c) gives a detailed description of the problem and its solution, while a the masses of its members. This is usually desirable unless certain groups are
companion paper by Hamrouni and Grelet (1977) describes a computer incorrectly represented in the sample of respondents, in which case the masses
program to perform the calculations. may be modified before analysis according to external information.
All of the above applies in a symmetric fashion to the partitioning of the
columns of the primary data matrix. In a survey of the consumption of
Focusing on poims
beverages, for example, respondents give a detailed account of the various
Focusing is a term which we use to describe the general process of reweighting alcoholic and non-alcoholic drinks they consume in a typical week. These
a cloud of points to satisfy certain objectives. In the above discussion, for may be partitioned (i.e. condensed) in various ways depending on the level of
example, our view of a set of heterogeneous measurements is incorrectly detail required in the exploration of the data.
focused if certain variables are more influential merely because they have This strategy provides the data analyst with a very flexible technique for
higher values, so we try to focus more equitably on all the variables by focusing attention on difTerent multidimensional features of the point cloud.
reweighting them. When points are grouped we have shown in Chapter 7 how In any single analysis the principal subspaces are determined solely by the
to shift the focus from between-point difTerences to between-group difTerences points with non-zero mass, while all other points can be projected onto the
by transferring the mass of the points to their respective group centroids. subspaces to enhance the interpretation of the display.
226 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 227
-
4
If the original row profiles of a matrix are points in a K-dimensional n.. ~ 8 m
: ~.
space and if we focus on the centroids of H groups of points, then we are
actualIy restricting our investigation to an (H -1 )-dimensional subspace.
't
.~f
'.. 0.5 I
Each original point can be expressed as the sum of two components, one in
this centroid subspace and the other in the (K - H + 1)-dimensional subspace
orthogonal to it. The latter subspace is often calIed the orthogonal complement, , I '. '. I f :,",.,..,
, _5 M3~
complementing in this case the centroid subspace.
J..
There may be occasions when we deliberately focus on a subspace of points _6
in order to investigate the dispersion of points in the orthogonal complement _9
.. a ood9
of this subspace. F or example, in an opinion survey we might be interested in 2
the dispersion of opinion which is in sorne sense unassociated with the sex of .. e
the respondent. If we gradualIy transfer mass from the points to the two
group centroids representing males and females, then the first principal axis _3
will eventually be forced to líe almost exactly through the centroids. The idea • ond k.._ 7
is to leave some mass with the original points so that the remaining principal
axes can still be determined. Dispersion of the respondents along these axes
..o
will be orthogonal to the male-female difference vector. The same result can
be achieved by orthogonalizing the centered profiles with respect to this .. 1
vector when calculating principal axes, but the advantage of focusing is that
the computations are a standard application of correspondence analysis to
the reweighted data. A further advantage is that we can easily focus on FIG.84. 2-dimensional correspondence analysis of Table 8.1. showing the
horseshoe pattern of the rows and the columns.
subspaces of points in a similar fashion. For example, if there are four income
groups then we can focus (almost exactly) on the subspace of the four income
group centroids, which might be 1-, 2- or 3-dimensional, and then study
the dispersion in the orthogonal complement of the subspace which is un correspondence analysis as well as other multidimensional scaling techniques.
associated with income differences. Figure 8.4 shows a typical example when the ecological presencejabsence
In Section 9.5 we discuss a set of examination marks and focus the first data ofTable 8.1 are analysed by correspondence analysis. In each display the
principal axis on the vector of total marks of the students so that information positions of the sites describe a parabolic-shaped curve, hence the term arch
which is uncorrelated with the total mark can be displayed and interpreted. or horseshoe. In Fig. 8.4 the set of species also describes a horseshoe pattern.
A number of practical issues are treated in this application, for example how The third principal axis reveals yet a further non-linear relationship and the
the focusing affects the calculation of inertias and percentages of inertia. points plotted with respect to axes 1 and 3 describe a cubic-shaped curve
(Fig.8.5).
Before attempting to explain this phenomenon let us first observe that if
8.3 HORSESHOE EFFECT the rows and columns of the data matrix are re-ordered as they appear on the
first principal axis of Fig. 8.4, then a diagonal band of positive mass is
Even though the principal axes are orthogonal to each other in a linear sense revealed (Table 8.2). This conforms to a model of a single underlying gradient
they can still be approximately related to each other in a non-linear sense. which simultaneously orders the sites and the species in this way. For
The prime example of this phenomenon is the so-called horseshoe effect (e.g. example, if we suppose that this gradient is rainfall, sites 3 and 2 at one end
Kendall, 1971), alternatively known as the arch (e.g. Gauch, 1982) or Guttman of the horseshoe might be in low rainfall areas, characterized by species
effect (Benzécri, 1973), which is often observed in the displays resulting from o, c, a ... which prefer drier conditions, while sites 9 and 7 at the other end
228 Theory and Applieations of Correspondenee Analysis 8. Special Topies 229
TABLE8.1 TABLE8.2
Sites o e a 9 i f I m b n d h k e
1 1 1 1 1 1 1 1
2 1 1 1 1 1 3
3 1 1 1 1 1 2
4 1 1 1 1 1 6
5 1 1 1 1 1 1 1 5
6 1 1 1 1 1 1 1
7 1 1 1 1 1 8
8 1 1 1 1 1 1 1 10
9 1 1 1 1 1 4
10 1 1 1 1 1 1 1 9
different species (1, e, k, ... ). In the present case where the data are either Os or
.. 1
I 0.5 ls, a matrix in the form of Table 8.2 is cal1ed a Petrie matrix, after the
archaeologist Flinders Petrie (cf. Kendal1, 1971). When the rows and the
columns of a data matrix can be permuted to obtain a Petrie matrix, it is easy
.
.7
i..
. .1
f
..
5 6
.2
to show that the first principal axis of correspondence analysis will provide
that correct ordering. This can be proved for a more general Petrie matrix of
non-negative numbers where from row to row the mass in the row profiles
e and k j.. a..and Q shifts monotonical1y, say from the left to the right. Equivalently, this is a
8
XI = 0.7887
-------e
.10 1m .e --'--*48.3%) matrix where the mass in the column profiles shifts monotonical1y from top
to bottom as we move across the columns.
9
• .
h .. d .. b
The existence of the horseshoe is more easily visualized in the case of
.• 4
n
correspondence analysis, where the row and column profiles are represented
in barycentric co-ordinates. F or example, if there are just 3 columns, the row
profiles aH lie within a triangle whose apices represent profiles which are
concentrated entirely in one of the columns (cf. Section 3.4). Thus, as we move
.3
down the rows of the matrix depicted in Fig. 8.6(a), the row profiles trace a
path which starts near the apex representing the first column, curves around
r ..0
as it passes that of the second column and final1y heads towards the third
column (Fig. 3.6(b)). The row profiles of a similarly patterned matrix with 4
ro
E
roo. .Q
- 0.'
~ ~ correctly ordered by a 1-dimensional correspondence analysis (reciprocal
~-o~~'§ averaging). However, there is a gradual bunching up of site positions at the
C\I ~~(/)~o.
e extremes where the horseshoe is steepest and this does not concord with their
Ol~.~a; ~
§ 4 <--o .~ C/) ~U)..c e assumed regular spacing. Valiant attempts have been made to straighten out
~d>+-,~Q)
'O o o .~ -:5 a; the horseshoe (Swan, 1970; Williamson, 1978; Hill and Gauch, 1980). The
u
..e..e0l- .....
U:~o.._o
approach of Williamson (1978) concentrates specifically on the fact that the
c
usual interpoint distance functions do not include knowledge of the chaining
v>U)+-'ü
..c e lt....·
e E ..e0:::J
_ '<t.
:::J • . between distant pairs of sites that is so obvious from the Petrie matrix of
E -Q),,(")
:::J O t: =
Q) e
Table 8.2. He thus proposes that the distances between unconnected (or
"8 u,
M-etI·
..c..out)
O
almost unconnected) pairs of points, that is pairs of points at approximately
~Q)
.~ == ~ (/) the same maximum distance apart, be recomputed as the sum of the distances
Q) ro e over a few intermediate linking points. This method, appropriately called
x u~.-
'5 -o en-o "step-across", also includes the use of the so-called "city-block" distance
------II~~:
ro e Q) Q)
"",ro=.!J
------ ~ .... .:. . ¿..c O '¡:: function to evaluate distances between points which are linked. This concords
--.. I lt.... U
.:~: '" en +-' Q. en
with the requlrement that the distances be additive along the single gradient.
------ .,' .. ~..e
CJ).~
Q)
-o
------ :.:'.=; .:. . '¡;::
<OE:::Jro
e (J) Effectively the distant pairs of points which previously had zero similarities
~
COOQ)<D are assigned negative similarities, these being increasingly more negative as
e d~:5E
-O+-'CO the similarities between the linking sites decrease. This filters through as
u...!Jro",
negative eigenvalues in the principal co-ordinates analysis which is used to
232 Theory and Applieations ofCorrespondenee Ana/ysis 8. Specia/ Topies 233
map the similarities, but these are usual1y too small in absolute value to be constraints on the solution, since the points are customarily displayed in
problematic and the horseshoe is largely eliminated in the principal subspace. principal co-ordinates as the orthogonal projections of the fixed row and
A similar strategy in the context of multidimensional scaling is to place more column profiles onto optimal1y fitting subspaces. (The principal axes them
emphasis on local structure (i.e. larger similarities, equivalently smal1er selves require identification conditions but we are usual1y more interested in
distances) and treat al1 zero similarities as missing values in the process oC the display itself.) In the other forms of the analysis, for example reciprocal
fitting the data to a Euc1idean representation (cf. Greenacre and Underhil1, averaging, the scale values (co-ordinates) are unidentified and identification
1982).
conditions are imposed to obtain a unique solution. As iIIustrated in Table
The method cal1ed "detrended correspondence analysis" (Hill and Gauch, 4.2, two constraints on either set of scale values are sufficient to identify the
1980; described also by Gauch, 1982) is attractive in that no specific solution. These are usual1y in the form of a mean centering and a variance
functional form of the non-linearities is assumed. The iterative algorithm standardization or the fixing of two scale values.
itself which performs the reciprocal averaging is modified to eliminate not In certain circumstances the data analyst might require that the scale
only linear relationships with higher order axes, but also non-linear relation values satisfy another set of conditions which actual1y change the domain in
ships of a fairly general nature. This process of "detrending" involves the set which the optimal solution is to be sought (in optimization theory this
of species only (the applications being chiefly in community ecology), while domain is cal1ed the feasible region). In this section we shal1 briefly discuss
the sites are the usual barycentres of the detrended species positions. In the two different types of constraints, both of which usually lead to different
process, however, control over the geometry is lost and it is possible that, just solutions to those obtained by a "standard" correspondence analysis.
as the non-linearities can mask less important gradients, so the detrending
might introduce further artifacts into the results. Our personal view is that
correspondence analysis and other scaling techniques which rely on data on Endpoint constraints in mu/tip/e correspondence ana/vsis
pairs of points are basical1y unsuitable for the quantitative identification oC Healy and Goldstein (1976) raise the question ofidentification constraints in
more than one such gradient in data of high diversity. Specific gradient the context of optimal scaling ofthe data ofTable 8.3. In our terminology this
models with relevance to the particular application (e.g. ecology) seem more is a Burt matrix of order J = 9, where 3 discrete variables (Q = 3) have 3
appropriate here, yet these should be considerably more flexible than the categories each (J q = 3, q = 1 ... Q). The object of these authors is to derive
usual artificial examples of Gaussian abundance curves.
Otherwise the horseshoe pattern often crops up in the results of a
correspondence analysis without causing too much concern (see, for example, T ABLE 8,3
the graphical displays in Sections 9.1, 9.3, 9.4 and 9.7). Since we are in an Matrix of frequencies of responses by 12232 mothers of 11 -year-old children to
exploratory framework we can see no reason not to interpret the positions of three questions: (A) Does the child destroy its own or others' belongings 7 (B)
points along an approximate curve rather than on a straight line. AIso, as in Does the child fight with other children 7 (C) Is the child disobedient at home?
There are 3 categories of response: (1) Never; (2) Sometimes; (3) Frequently (see
Fig. 9.7, the curved pattern of the points can enrich the interpretation when Healy and Goldstein, 1976. Table 1)
there are sorne points on the concave and/or convex sides of the curve.
It is of theoretical interest that for certain ideal examples of data in the Response categories
form of a continuous two-way table, the functional form of the co-ordinates Al A2 A3 Bl B2 B3 Cl C2 C3
along principal axes of a correspondence analysis can actual1y be derived.
Example 8.8.2 deals with an ideal Petrie matrix and derives the exact Al 11440 O O 5923 5134 383 5957 5254 229
quadratic relationship between the first and second principal axes. We have A2 O 667 O 143 440 84 135 468 64
not seen similar proofs in the context of any other scaling technique. A3 O O 125 22 62 41 18 70 37
Bl 5923 143 22 6088 O O 3896 2111 81
B2 5134 440 62 O 5636 O 2084 3387 165
B3 383 84 41 O O 508 130 294 84
8.4 IMPOSING ADDITIONAL CONSTRAINTS ON THE DISPLAY Cl 5957 135 18 3896 2084 130 6110 O O
C2 5254 468 70 2111 3387 294 O 5792 O
In its geometric form correspondence analysis does not usual1y require any C3 229 64 37 81 165 84 O O 330
234 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 235
scale values (which they call "attribute scores") for the 9 categories so that TABLE 8.4
disagreement within subjects across their respective attribute scores and their Relationship between the first principal co-ordinates of the correspondence
analysis of Table 8.3 and the "attribute scores", under quadratie eonstraints, of
(weighted) average score is minimized. This is the internal consistency
Healy and Goldstein (1976, Table 2).
definition of dual scaling defined in Section 4.3 and the option of weighting is
a standard re-assignment of masses to the variables of the Burt matrix, Correspondence Healyand
equivalently to the variables (i.e. groups of columns) of the underlying analysis Goldstein's
to 1, and their objective is to minimize SSw (cf. (4.3.6) and (4.3.7)). It follows 0.418 O O
that their solution should be equivalent to the optimal 1-dimensional -0.294 reeenter to have 0.712 identify max. 9.2
solution by correspondence analysis, that is the co-ordinates of the points on -1.726 O seo res for eaeh 2.144 seore of 100 27.6
the first principal axis in Fig. 8.7. Table 8.4 shows how to pass from these -0.325 0.749 eategories 9.7
co-ordinates to their attribute scores, each subset ofwhich has been translated -2.132 2.556 "frequently" 33.0
so that the lowest category "never" of each variable has score O, followed by
a uniform rescaling of all the scores so that the sum of the three highest
category scores gives a total of 100. a quadratic programming problem (e.g. see Walsh, 1975), which leads to a
Unfortunately, these adjusted scores in the last column of Table 8.4 are no system of linear equations, not an eigenvalue/eigenvector problem as before.
longer optimal, although they are perhaps more readily interpretable. This is The subsequent optimal solution, transformed in the same way as in Table
because they have been obtained from the optimal co-ordinates by more than 8.4, is quite difTerent from the one under the quadratic constraint. In this case
a mere overall centering and rescaling. It is clear from Fig. 8.7 that the we know that the optimal scaling of the 9 categories (under quadratic
re-centering of individual variables to have the same bottom endpoint must constraint) necessarily satisfies a number of additional properties, namely
make the solution sub-optimal-the categories of A are displaced towards that each of the 3 subsets of categories are individually centered and
the left, away from the almost perfect association between variables B and C. standardized (cf. relationship to dual scaling in Sections 5.1 and 5.2), even
Next, Healy and Goldstein consider optimal solutions under linear con though these have not been imposed a priori. The solution under the
straints which fix the "lowest" and "highest" points of the scale. This' is now endpoint constraints does not (necessarily) satisfy similar conditions, even
under a simple translation and rescaling ofthe scores together, so in this sense
A3 ' , ' O.16~O 1(23.1%)
it is not surprising that difTerent scores will result. This illustrates the fact that
• "a particular scoring system can be almost as much determined by the
\
\
constraints imposed as by the data on which it is based" (Healy and
\ C.3
Goldstein, 1976), which is a peculiarity of the scaling of multivariate indicator
\ matrices.
\ 8
\ ~\
\ •• Order constraints
\
),,-0.2412
The subject of optimization under order constraints is a vast topic in the
.-
A2
82
leale
(33.8%)
wel1 be minimal, while the gain in the interpretative value of an ordered present situation is complicated by the fact that the non-zero weights depend
solution might be large. on the row and column margins of the matrix, which are unknown unless the
Nishisato and Arri (1975) and Nishisato (1980, Section 8.1) discuss this missing values are known or are estimated.
subject, as wel1 as the special algorithms to obtain solutions of the dual
scaling problem under order constraints. Heiser (1981) and Gifi (1981) also
Imputing the missing values by iteratian af the carrespandence analysis
treat this problem at length. Historical1y, the subject was first deaIt with by
Bradley et al. (1962), and its development has been very much related to that We shal1 discuss an alternative way of tackling the missing data problem,
of non-metric scaling. originating in the work of Mutombo (1973) and Nora-Chouteau (1974).
Here the available data in the matrix "impute" the missing values, using the
same graphical "model" of the data as the analysis itself. The term "impute"
8.5 TREATMENT OF MISSING DATA means to ascribe and is often used in the context of missing data estimation.
This strategy is in the spirit of the "E-M algorithm", reviewed by Dempster
Missing data are relatively easy to handle, at least in principie, in correspond et al. (1977).
ence analysis as welI as in the general analysis described in Appendix A. 2. In order to introduce the algorithm, let us suppose that a single value nab
Computational1y, however, the execution time of an algorithm to perform an is missing. If we knew that the data table satisfied the folIowing "model"
analysis in the presence of missing data is greatIy increased. exactIy (i.e. the independence, or homogeneity, model, when N is a contin
gency table): nij = ni.njn.. (i.e. p¡j = r¡cJ, then the missing value nab is the
Weighting af individual elements af the carrespandence matrix only unknown of the implicit equation: nab = (n~. + nab)(n?b + nab)/(n? + nab )
(where the superfix ° indicates that summations are performed with a zero as
In correspondence analysis we have seen that a display of the rows and the (a, b )th element of the matrix). Thus nab can be directly evaluated as:
columns of a matrix in a low-dimensional Euclidean space is obtained by
weighted least-squares fitting of the row and column points. EquivalentIy, we
n ab = (n~.n?b)/{n? -(n~. +nOb)} (8.5.2)
can think of the analysis as a weighted least-squares approximation of the When the data deviate from this model, this value still solves the implicit
elements nij of the matrix. In a K*·dimensional correspondence analysis the equation and thus does not contribute to any measure of fit of the data to the
approximation ñij of n¡j is the reconstitution formula (cf. (4.1.28»: model. However, this value need not provide the closest fit of the available
data to the model. Conversely, the value which leads to the closest fit does
ñ¡j = (n¡.n)n. J (1 + r.{"·/;kgjk/Ai/2) (8.5.1) not necessarily satisfy the implicit equation, as demonstrated in Example
This is the correspondence "model", and its "parameters" (the /;k S, gjk S and
8.8.3. In any case, the data are more likely to deviate quite dramaticalIy from
AkS) are identified in the usual way:
this simple multiplicative model, in which case the dimensionality K* of
(8.5.1) as wel1 as the unknown "parameters" stilI need to be derived.
r.¡r¡/;k = r. hgjk = O, r.¡r¡Ü = r.jCjgJk = Ak, If K* were known a priori, then the E-M algorithm could provide an
r.¡r¡/;k/;k· = r. jCjgjkgjk' = 0, where k, k' = l ... K*, k f- k' estimate of nab as follows:
"Estimation" of the parameters is performed by minimizing the weighted (a) Start with an initial value for nabo
(b) Perform the K*-dimensional correspondence analysis of the complete
least-squares function r.¡r. jW¡j(nij - ñ¡Y, where wij is equal to 1/(ni. n .Jo
matrix (the M-step, or "maximization" step, of the E-M algorithm).
If sorne of the elements are missing, or perhaps fixed by the structure of the
(c) Use (8.5.1) to obtain a new value for nab (the E-step, or "expectation"
matrix (e.g. structural zeros), then the usual algorithm is no longer applicable.
step, of the E- M algorithm).
Instead, we could pose the objective of minimizing the same function as
(d) Iterate steps (b) and (c) until the new value entering the correspondence
aboye, but letting the double summation extend only over the cel1s (i,j) of the
analysis is practical1y the same as the estimate resulting from the analysis.
matrix with valid data, in other words w¡j is equal to s;)(n¡.n.J, where sij = 1
if the datum n¡j is present, otherwise zero. The general question of weighted Again this wiII mean that the (a, b )th celI of the matrix does not contribute to
least-squares fitting is reviewed by Gabriel and Zamir (1979). Notice that the the measure offit, which we know to be r.f=K.+IAt evaluated in the eventual
238 Theory and Applieations of Correspondenee Analysis 8. Special Topies 239
correspondence analysis during the last iteration. Since K* is usually un this yields another set of imputed values, using (8.5.1) with K* = 2. The
known, the aboye algorithm can be repeated for increasing values of K*, 1-dimensional problem must then be repeated using the updated matrix
being initialized using the value (8.5.2) which we know solves the algorithm before the next iteration in the second dimension is performed, and so on
for the "zero-dimensional" correspondence analysis (K* = O). Steps (b) to (d) until the whole procedure stabilizes. Notice that the repetition of a 1
are then executed for K* = 1, resulting in a value which can be used to dimensional analysis for every iteration in the second dimension is usually
initialize the algorithm for K* = 2, and so on until the fit is deemed satis quite rapid, especial1y near convergence, because we naturally use the
factory. When there are many missing values, then the algorithm is applied in previous solution (in the first dimension) to initialize the iterations which lead
a similar fashion, with a set of values being inserted, estimated, re-inserted, to the next solution.
etc., and the only difTerence is that the algorithm for K* = Ois now iterative
as well.
Convergence of the imputed va/ues
Although the algorithm converges for a given K* in "well-behaved"
situations (e.g. a whole row or column is not missing) we have no assurance It is an unfortunate fact that the convergence of the imputed values is usually
about the uniqueness of the resultant estimate, nor do we have any knowledge very slow and it is advisable to incorporate some acceleration technique to
of the optimality of the display. This is not too serious since we view the speed up convergence (cf. discussion of paper by Dempster et al., 1977). In
whole procedure as a strategem to allow correspondence analysis to be other situations where implicit equations need to be solved, the acceleration "
performed on all the rows and columns of the data matrix, rather than as an technique of Ramsay (1975) has been found to be extremely efTective (for
optimal way of imputing the missing values. This eliminates the highly example, in multidimensional unfolding, see Greenacre, 1978), and such a
unsatisfactory alternative of omitting the rows and columns which contain technique needs to be investigated in the present contexto
the missing data. Notice, furthermore, that the principal axes are not
"additive" as in the case of complete data-the correspondence analysis for
/mputing data which are not missing
K* = 2, for example, does not necessarily contain the principal axis of the
analysis for K* = 1, although the axes can be approximately additive if the The imputation of data values has many other applications. For example, in
imputed values are similar in these respective analyses. a correspondence analysis of a complete data matrix, each datum can be
deleted in turn and then imputed. The difTerence between this imputed value
and the value "predicted" by the original analysis is an external measure of
/mputing the missing va/ues during reciproca/ averaging the datum's infiuence on the analysis and can help in identifying extreme
An alternative computational procedure is to incorporate the imputation of data, or outliers. A global measure of the difTerences between those imputed
the data into the iterative procedure for computing the correspondence values and the data themselves provides an external measure of the fit of the
analysis. Ir reciprocal averaging is chosen as the method of computation, for graphical display to the data, in the style of the jackknife (Miller, 1974).
example, the aboye algorithm can still be used to solve the problem for Cross-validation of the graphical "model" may also be performed by
K* = O. Then, for K* = 1, reciprocal averaging is applied (cf. Section 4.2), deleting random portions of the data matrix, imputing these values and
but at the end of each iteration the missing values are updated using (8.5.1), comparing them with the original data (cf. Wold, 1978). In Section 8.6 we
or the equivalent reconstitution formula if the scale values (co-ordinates) are shall treat a specific missing value problem in the case of certain symmetric
standardized difTerently. Updating leads to new row and column margins, so data matrices, where we wish to ignore the diagonal elements.
the scale values need to be recentered before the next iteration. In this way
convergence to the 1-dimensional solution as well as convergence of the
imputed values occur simultaneously. For higher values of K*, (K*-l) 8.6 ANALYSIS OF SYMMETRIC MATRICES
dimensional reciprocal averagings become nested within the procedure. For
example, when K* = 2, a set of values (initially, the solution for K* = 1) is The correspondence analysis of a square matrix has special properties and
inserted into the matrix and a standard 1-dimensional reciprocal averaging is merits separate treatment. Symmetric data matrices, where the rows and
performed. A single reciprocal averaging iteration in the second dimension is columns refer to the same set of objects, frequently occur in practice, for
then performed, with the usual centering and orthogonality constraint, and example, the Burt matrices of Chapter 5, matrices of correlations, associations
240 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 241
or similarities between a set of objects, the frequencies of co-occurrence of a where a and pare any real numbers. By definition we are only interested in
set of indicators (e.g. species, artifacts) in ecological and archaeological values of a and f3 which generate matrices p(o:,P) of non-negative elements and
studies, and the total traffic or migration between areas. In many studies it is not difficult to show that the set of such values forms a 2-dimensional
involving such data, the diagonal of the matrix contains sorne structural convex polygon. The following is an outline of the proof. If we define the sets
values, the maximum similarity or association between objects, or is related ofvalues:
to the off-diagonal elements or sorne other aspect of the study. We shall
S+ == {(a,f3)lp¡\~'P) ~ O}
initially consider the given diagonal values to be included in the analysis and
then discuss how we might ignore them. Notice that matrices of distances and SO == {(a,f3)lpi\~'P) ~ O, i =f i/}
dissimilarities are not considered here, unless they can be transformed by
Sd == {(a,f3)lpi(~'P) ~ O}
sorne inverse monotonic function to values which can be regarded as a
distribution of non-negative mass over the cells of the matrix and thus
then S+, the set of values of interest, is the intersection of SO and Sd. SO can
suitable for correspondence analysis.
be shown to be a cone with vertex at the point (0,1), while Sd is a convex
polygon with at most 1 sides. Their intersection is thus a convex polygon with
Direct and in verse singular vectors at most 1 + 2 sides (Fig. 8.8). El Borgi (1978) gives an algorithm and computer
program in FORTRAN to compute the convex polygon, and shows that
The eigendecomposition of a symmetric matrix A is of the form A = UD,¡U T, certain vertices of the polygon are of special interest.
where all the elements of U and D,¡ are real. If there are no negative " Rewriting the reconstitution formula (4.1.27) for the symmetric P in terms
eigenvalues (i.e. A is positive semi-definite, or non-negative definite) then the ofthe standard co-ordinates, we have:
SVD of A is identical to the eigendecomposition, whereas if there are sorne
negative eigenvalues then the SVD takes a slightly different form, since the Pw = r¡r¡, (l + I: k<¡Jik<¡J¡'k(Al/ 2 Ek» (8.6.2)
singular values are non-negative by definition. If Ak is a negative eigenvalue
corresponding to the eigenvector U k then f1k = - Ak is a singular value of A where Ek , called the parity of the axis (or dimension), has a value of either
associated with left and right singular vectors Uk and - Uk respectively. Such + 1 or -1, depending on whether the principal axis is direct or inverse res
a pair of singular vectors is called inverse, while a pair of identical singular pectively. It is not difficult to show that the correspondence p(o:,P) has the
vectors associated with a positive eigenvalue is called direct (Benzécri .(?t al.,
1973, Volume 11 B no. 9; El Borgi, 1978). The singular values and associated
vectors will be ordered in descending order of the absolute eigenvalues. Since
lO.l)
we measure the quality of a least-squares approximation of A in terms of the
singular values it could well turn out that an inverse pair ofvectors associated
with a large negative eigenvalue becomes important.
same margins as P and the same principal axes, the only difference being in the use of the convex polygon in analysing a symmetric matrix of traffic flows
the principal inertias. In fact, the reconstitution formula for p(IX,Pl is: and compares his results with those obtained by the fitting of traffic models
to the data. Even though the symmetric correspondence analysis might
p¡\~,PJ = r¡r¡,(l + r. k({J¡k({J¡'k(aAi/ 2ek+ [3)) (8.6.3)
provide an impressive fit to the data, as in this application, its "parameters"
where the square roots of the principal inertias of p(IX,PJ, afTected by their are not really interpretable like those of a sensibly constructed parametric
respective parities, are of the form aAi/ 2ek+ [3. Notice, though, that the model. We prefer to see the introduction of the parameters a and [3 merely as
ordering ofthe terms in (8.6.2) and (8.6.3) is not necessarily the same. one of the possible ways of coping with the diagonal of a symmetric matrix.
The introduction of the additional "parameters" a and [3 into the
correspondence "model" can result in a dramatic improvement in the re Treating the diagonal as missing data
constitution of the data. In fact, in a space of specified dimensionality K* the
weighted sum of squared residuals (or residual inertia) is: When the diagonal values are completely irrelevant to the study, then there
is no need to fit them and we can treat them as missing data. An iterative
r.¡r.¡o{P¡¡o-Pi¡Y/r¡r¡, = r.f=KO+l(At/ 2ek+p)2 (8.6.4)
algorithm to impute values in the diagonal may be used (cf. Section 8.5),
where p == [3/a (the fit to the matrix p(IX,PJ is a2 times the aboye). This is a resulting in a solution where the imputed diagonal fits the "model" exact1y
quadratic in p and the value which minimizes the fit is: and does not contribute to the residual inertia. Since each missing value is
effectively an extra parameter, this solution should be better fitting than the
p* = - r.f=KO+1W2ek/(K -K*) (8.6.5)
one obtained from the convex polygon. The latter procedure, resulting in the
the minimum obtained being: minimum (8.6.6) can still be applied to this problem as a comparison, and has
a possible advantage that the margins of the original matrix are preserved.
r. f=KO+ lAk - (r. f=KO+1W 2 ed /(K - K*) (8.6.6) The weighted least-squares fit to the off-diagonal elements can also be
In the usual analysis of P the fit is just the first term of (8.6.6) and the performed by assigning zero weights to the diagonal. This will provide the
minimum is the sum of the smallest K - K* principal inertias. In the present best fit, but the geometry of the results is no longer clear.
situation however, the minimum might very well involve a different set of
principal axes, depending on which subset of K - K* principal inertias
Burt matrices
minimizes (8.6.6),
For example (Kazmierczak, 1978, p. 215), if P has 5 non-zero principal A Burt matrix B == ZTZ, where Z == [Zl,,,ZQ] is a Q-variate indicator
inertias (all with positive parity): matrix, is a special type of symmetric matrix. Ir Z is a logical indicator matrix,
then B has Q diagonal matrices ZiZ1 ... Z~ZQ down its diagonal (cf. (5.2.4)),
Al = 0.99 A2 = 0.98 A3 = 0.03 A4 = 0.02 AS = 0.01
so that the diagonal elements are the column sums of Z. The row (and
then for K* = 2, minimum residual inertia is provided by retaining Al and A2, column) margins of B are also proportional to these sums so that the J x J
while for K* = 3, the minimum is provided by retaining A3, A4 and As, the correspondence matrix P == B/b.. has the unique property of having its
three smallest eigenvalues! (the latter mínimum is 0.1269 x 10- 4 , whereas if diagonal being proportional to its margins, in fact Pjj = cj/Q, where cj is the
Al' A2 and A3 are retained, the residual inertia is 8.579 x 10- 4 ). The inertia on jth marginal sum of P.
the first two principal axes is readily absorbed into the "parameter" p, whose In Section 5.2 the calculation of percentages of inertia in the analyses of Z
optimal value (8.6.5) is high when Al and A2 are omitted. and B was discussed and the proposal was made that these be based on the
Notice that the fit is exact when K* = K -1, that is when any one of the quantities (cf. (5.2.17)):
principal axes is omitted, since (8.6.6) is zero. Thus in the presence of p, the
a(A!) == {Q/(Q-1WV!-(1/QW (8.6.7)
dimensionality of an exact fit is at most 1 -2, compared to 1 -1 in the usual
case. This is reminiscent of the situation in multidimensional scaling where for principal inertias A! of Z which are greater than l/Q. We now show that
the inclusion of an extra parameter called the "additive constant" also these are actually principal inertias in the analysis of B with its diagonal set
reduces the dimensionality of a cloud of points by one. to zero, called the modified Burt matrix. The associated correspondence
Results (8.6.3-6) are proved by Kazmierczak (1978), who also illustrates matrix is of the form p(IX,Pl in (8.6.1), where a = Q/(Q -1) and [3 = -l/(Q -1),
244 Theory and Applieations oJCorrespondenee Analysis 8. Special Topies 245
as shown in Example 8.8.4. Hence p = -l/Q and the square roots of the 8.7 ANALYSIS OF LARGE DATA SETS
inertias are, from (8.6.3):
Throughout this book we have assumed that the algorithm which performs
tx{(At)1/2 ek + p } = {Q/(Q-1)} {(At)1/2_(1/Q)} (8.6.8) correspondence analysis does not depend on the size or type of data matrix
at hand. The algorithms which evaluate either the relevant SVD (cf. (4.1.9))
Clearly the associated principal axes will be direct if (At)1/2 (i.e. At) is greater
or the relevant eigendecomposition and transition formula (cf. (4.1.23) and
than (l/Q) and inverse otherwise, in other words the parities e1ct ,Pl in the
(4.1.16) respectively) will clearly become unwieldy when the order ofthe data
analysis of p(ct,Pl are positive for At > l/Q, negative for Af < l/Q. The per
matrix becomes quite large. Since one of the major uses of correspondence
centages of inertia are thus based upon the principal inertias (the squares of
analysis is the exploration of large data sets, typically sample surveys in
(8.6.8)) associated with the direct principal axes in the analysis of the modified
sociology, psychology, marketing, education and epidemiology, for example,
Burt matrix, that is the quantities of (8.6.7). In this analysis the inverse
it is necessary to consider special algorithms to ease the computational
principal axes are the artifacts which imply the individual centerings and
burden.
standardizations of the subsets of columns of B (or of Z) in the optimal
solution. In Section 4.2 we have already illustrated the reciprocal averaging algorithm,
which can be used to compute the principal co-ordinates one axis at a time.
As an illustration consider the correspondence analysis of the 9 x 9 Burt
If the data are very high-dimensional then this algorithm al10ws us to
matrix B of Table 8.3, where Q = 3, as well as that of the modified Burt
evaluate the first few sets of co-ordinates only. A further advantage is that the
matrix, with zero diagonal. The resultant principal inertias and their parities
are as follows: data may be conveniently retained in secondary storage (disk or tape) and
re-read each time the reciprocal averaging is performed (see Appendix B).
Burt matrix Further savings, both in data storage and computational time, are possible
Modified Burt matrix
in special cases. For example, when the data are in the form of a (logical)
principal inertias parities principal inertias parities multivariate indicator matrix with a large number of columns (i.e. large J),
(1) 0.260021 +1 (1) 0.25ססOO -1 for example from a large questionnaire, then it is only necessary to store the
(2) 0.172187 +1 (2) 0.25ססOO -1 addresses ofthe non-zero elements (ones) ofthe data, in other words the code
(3) 0.097292 +1 (3) 0.070164 +1 number of each response. The reciprocal averaging then proceeds as before
(4) 0.078493 +1 (4) 0.025119 -1 but involves the addressed elements of the matrix only, with all the zeros
(5) 0.065212 +1 (5) 0.014990 +1 being correctly ignored. The same strategy is applicable to any so-cal1ed
(6) 0.051835 +1 (6) 0.013677 -1 "sparse" data matrix, such as the abundance matrices occurring frequentIy in
(7) O (7) 0.006360 -1 ecological studies. Here the number of sites and the number of species may
(8) O (8) 0.001032 -1 both be very large, but with a relatively smal1 number of species present at
any one site.
It can be easily verified that only the first two principal inertias At of the Burt
matrix correspond to principal inertias At = (At)1/2 greater than 1/3 and that
these correspond to the principal inertias 3 and 5 of the modified Burt matrix Stochastic approximation
which have positive parity (e.g. 0.070164 = (9/4){ (0.260021)1/2 -1/3}2). Notice It is instructive to consider more closely the particular form of the re 1:
how the Q - 1 zero inertias of the Burt matrix turn up as inertias with value ciprocal averaging equations in the case of a multivariate indicator matrix
1/(Q - 1)2 in the analysis of the modified matrix. The final outcome of al1 this Z == [Zl ... ZQ] (see Lebart et al., 1977, V.2). If z"[ denotes the ith row of Z
is that we would interpret only the first two principal axes of the Burt matrix and D the diagonal matrix of column sums of Z, then the reciprocal
and assign them percentages of inertia of 82.4 % and 17.6 %respectively. The averaging (double transition) of a J-vector Yo can be shown to be:
fact that these add up to 100 % does not mean that we have explained the I
data exactIy, but it does mean that al1 the "interesting" variation is explained, Y1 = (1/Q)~{D-1Zi(Z"[YO) (8.7.1)
in the sense described aboye (cf. also the relevant discussion of "interesting" Since the metric between rows is defined by the inverse diagonal matrix of
dimensions in Sections 5.1 and 5.2). column masses: (D!)-l = QID- 1 , the scalar quantity s(i, Yo) == (l/Q)z;ryo
L 1 .. 1
246 Theory and Applieations oJCorrespondenee Analysis 8. Special Topies 247
can be considered the co-ordinate of the point (row profile) (l/Q)z¡ on the 8.8 EXAMPLES
axis (l/QI)Dyo' Only Q terms are involved in the scalar product calculation
8.8. 1 Asymptotic distribution of the canonical correlations
of s(i, Yo)' The linear operation (8.7.1) can be written as:
(of a contingency table) in the case of dependence
Yl = (1jI) r.{ s(i, Yo)1t¡ (8.7.2) Let Pk, k = 1... K*, be the observed canonical correlations (square roots of principal
inertias) of a discrete bivariate sample density Pij (i = 1., . l, j = 1... J), based on a
where the vector 1t¡ == lD- 1 z¡ = (D;)-l(1/Q)Z¡ is associated with the projec sample of size n, and let Pk be the theoretical canonical correlations, assumed distinct,
of the underlying bivariate density Pij (i = 1... l, j = 1... J). Then the variables
tion onto the row profile vector (l/Q)z¡. Thus y 1 is a weighted average of the n l12 (pk - Pk) are asymptotically normal with zero means and variances and covariances
1ti' where the computations are achieved at the tth iteration by l updates of given by:
the vecto~ YI ofthe form:
uf = (1 +tpf) {1 + ~IK'PIE(iP(l)ep[k))E(l(l);dk))} -iN {E(iP~»)+E(l~»)}
-2 1 - - -"4PkPk'
Gkk' = "'iPkPk' 3- - {E( fP(k)fP(k')
- 2 - 2 ) + E(-1(k)1(k')
2 - 2 )}
YI +-- YI + (1/I)s(i, YI - 1 )1t; (8.7.3 )
+ ~IK' PI [E(iP(l)iP(k)iP(k·»)E(l(l)1 (k>1(k'»)
where i = 1 ... l. Each update is achieved after accessing the Q responses of +lpkPk' {E(iP(l)iP[k»)E(l(l)l[k») + E(ep(l)iP[k,»)E(l(l)l~'»)}]
the ith case (row) and the ordering of the rows is clearly immaterial to the We shall not provide a proof of these rather lengthy results. An example of the
final result. AIso the value of YI -1 remains constant throughout such an definition one of the moments in the aboye formulae is: E(cP(l)cPtk») == r.ir.jcPilcPfkP¡j =
iteration. r.¡r. j cPilcPM:¡cj (l +r.fpk,cPik'1jk') using (8.1.15). For further details see O'Neill (1980,
The form of (8.7.3) suggests a more general updating scheme: 1981).
Application
Yt.i+ 1 +-- YI.¡ + h(i, t)s(i, YI.;}1t; (8.7.4) An example in O'Neill (1981) illustrates the aboye results. A theoretical bivariate
density is defined by the contingency table:
Each update by a row of the data changes the present y, which is in turn used
9 14 25
in the following update. The updating is no longer invariant with respect to
the ordering of the rows. (This is reminiscent of the k-means clustering
algorithm where centroids are re-computed after each individual point re
assignment, cf. Section 7.4.) The "weights" h(i, t) are usually a non-increasing
function of i and a strictly decreasing function of t, for example:
1 18
9 38
which has marginal densities i' = [t t t]T and
20 10
Thus u~ = 9/16 = 0.5625. The other two moments are computed as u~ = 1.0712
the first K* principal axes, thus allowing large savings in computation time
at the expense of an acceptably smallloss in the solution's accuracy. Comment
This method is further illustrated by Lebart (1982a, 1982b). O'Neill (1981) also gives the first- and second-order moments of the central Wishart
248 Theory and Applieations ofCorrespondenee Analysis 8. Speeial Topies 249
matrix variate Wm(\I), for m = 2,3 and 4 and values oÍ\! up to 9. For m = 2, \1 = 2, the p(x,y)
(a 1
means of the chi-square components (under the assumption of independence) are
given as 3.571 and 0.429 respectively, with variances 6.674 and 0.391 respectively and
covariance 0.467. If the aboye contingency table were actually observed then both
components 144(t)l = 36 and 144(6/144) = 6 are together highly significant, rejecting
the null hypothesis of independence. The null hypothesis that PI 4' O and P2 = O is
also rejected quite strongly, because the second component np~ is asymptotically
WI (1), that is X2 (1), so that the observed value of 6 is significant at P < 0.001.
,I=r""
4(
(b)
Perform a correspondence analysis on this two-way distribution and show that the
principal co-ordinates of the "rows" (which are continuous functions of x) are poly
nomials in x. Derive the specific relationship between the first and second principal
co-ordinate functions of the "rows" (x) as well as the "columns" (y). ~
x x
Solution
Theintegralj Jp(x,y)dxdy = 1, so thatp(x,y)dxdy (O ~ x ~ 1,0 ~ Y ~ 2)is thecon
tinuous counterpart of the correspondence matrix Pij (i = 1 ... 1, j = 1 ... J). The c(y)
(d)
'~
marginal densities are thus:
fX+I
r(x) = Jx dy =1 O~x~l
O 1 2 Y
f: dx = y O~y~l
FIG.8.9. (a) The eontinuous mass distribution p(x. y), viewed in 3-dimensional
perspeetive. (b) p(x. y) seen "from the top", the eontinuous eounterpart of a
e(y)=
eorrespondenee matrix (e) The row mass funetion r(x). (d) the eolumn mass
funetion c(y).
J I
,-1
dx = 2-y 1~y~2
where we have used the result that 1 - zP+ 1 = (1 - z)(zP +Zp-I + ... + 1) for z = y-1.
The bivariate density p(x,y) and its marginals are sketched in Fig. 8.9. The con Applying the averaging on fip(y) from columns to rows we arrive at a function IXp(X):
tinuous counterparts of the row and column masses are r(x )dx and e(y)dy respectively.
We first show how the function x P (for any non-negative integer p) is afTected by the
process of reciprocal averaging. Applying the averaging process (i.e. transition
IXp(X) = f: fip(y)p(x,y)dxdy/{r(x)dx}
formula) in its continuous forro from rows to columns we arrive at a function fip(Y): that is:
fl
Jx JI +x
{(y-1)P+(y-l)r l + ... +1}dy
r
(p+1)lX p (x) = yPdy + 1
{
forO~y~l: f' xPp(x,y) dx dy/{e(y)dy} = f' x Pdx=yP/(p+1)
fip(y) = Jo
forl~y~2(similarly) ... ={l/(2-y)} f
(l/y)
1 Jo
x Pdx={(y-1)P+(y-l)P-I
= yPdy + f: (yp+ yp-I+ ... +1)dy
J,-I where we have made a change of variable in the second integral, replacing y -1 by y.
+ ... +l}/(p+l) Thus: (p+1)lX p (x) = 1/(p+1)-xP+I/(p+1)+xP+I/(p+1)+xP/p+ ... +x.
250 Theory and Applieations ofCorrespondenee Analysis 8. Special Topies 251
r fl(x)dx
{
forO~y~
y
which decreases as p increases. 1: = (3/2)1/2(y-1)
It is a fairly simple matter to evaluate the specific form of the first two principal
co-ordinate functionsfl(x) andf2(x). We know thatfl(x) is a linear function of x,
ax+b, and thatf2(x) is a quadratic, ex 2 +dx+e. Using the centering, standardizing
and orthogonality conditions:
(A I )1/2 91 (y) =
for 1 ~ Y ~ 2: f
JO
I
y- 1
f¡{x)dx = (3/2)1/2(y-1)
{
forO~y~ 1:51/2(2y-1)(y-1)
Jfp(x)fp.(x)r(x)dx = O p = l,p' = 2
92(y)= for1 ~y~2:51/2(2y-3)(y-1)
'2 I 92 and the relationship between 92 == 92(y) and 91 == 91 (y):
\
/
/ Comment
JI
\ x=O x = 1/ p(x,y) can also be considered as the continuous version of a logical indicator matrix,
\\ '1
-
91
where the number of categories for each "question" q is J q = 2. In fact, for y < 1, y and
y+ 1 form a doubled pair, so that points on the "column" curve of Fig. 8.10
corresponding to y and y + 1 connect through the origin and are balanced at the origin
in the usual way.
TABLE8.5
Frequencies of violent and non-violent convictions amongst 390 criminals 10
Violent convictions
9
.. ....
Non-violent impuled
8
......
convictions O 1 2 3 ~4 Total
volue
7
6
...............
O O 13 5 2 O 20 5
1 3 16 6 2 3 30
2 6 19 5 2 2 34
3 10 24 5 7 2 48
5 10 15 20 25 30
4 15 13 5 3 4 40
¡lerolions
5 16 15 11 3 2 47
6 13 8 6 5 2 34
FIG.8.12. Iterations to impute the missing value, using the reconstitution formula
7 9 10 3 4 1 27
of a correspondence analysis, in the style of the E-M algorithm.
8 8 5 4 2 1 20
9-10 13 10 4 2 O 29
11-12 15 7 6 O O 28
SoIution
~ 13 16 9 3 1 4 33
(a) Because only one element is "missing", the imputed value nl1 such that PI1 = r l CI
is, from (8.5.2):
Total 124 149 63 33 21 390
nl1 = (20 x 124)/(390-20-124) = 10.08
(b) Using the rounded value of nl1 = 10 the total inertia in the data matrix can be
evaluated to be 0.140181 to 6 decimals. There is a slight contribution by nll to
this inertia due to the rounding, but this is only evident in the 6th decimal. In
Fig. 8.11 we plot the least-squares fit to the data for a range of values of nl1,
where the fit is evaluated on all actual data. That is, the fit is the inertia of the
data matrix minus the contribution of the (1, l)th element. The integer 15 is
closest to the minimum of this fit and is optimal in this sense, but does not obey
0.15 the independence model since (35 x 139)/405 = 12.01.
(c) In order to impute a value which fits the 1-dimensional correspondence model
Pij = r¡cj(l +hlgjIf..1.f/2) we have to iterate as described in Section 8.5. The
progress of these iterations, initialized by the value 10 imputed in (a), is shown in
Fig. 8.12. The rate of convergence can be seen to be very slow and eventually a
value (rounded to an integer) of 5 is reached.
0.14 Comments
The value of O in the (1, l)th cell of Table 8.5 is clearly a structural zero since the
sample contains no non-criminals. In their analysis of these data Holland et al. (1981)
overlook the presence of this zero and use it as actual data, resulting in a maximal
canonical correlation of 0.323, that is a first principal inertia of the order of 0.1043. If
the imputed value of 5 is used, according to result (c) above, then the maximal
canonical correlation is 0.286 (first principal inertia of 0.0818). Although we always
0.13 I I 1 I 1 1 I I I
report positive values for the above correlations, an examination of the canonical
o 10 20 30 40
weights (or co-ordinates on the first principal axis) shows that the correlation between
violent and non-violent convictions appears to be negative, when the ordering of the
rows and columns is taken into account. (For example, the first row and first column
FIG.8.11. Least-squares fit to all data except the (1,1 )th element. for a series of appear on opposite sides of the first axis.) The above authors do acknowledge,
imputed values for this element. however, that the evidence of negative correlation is "exceedingly minimal". Our
254 Theory and Applications ofCorrespondence Analysis
analysis aboye shows that this negative correlation, however minimal, is to a certain
extent due to the use of the structural zero as an actual datum, which naturally
reinforces the negative association ofthe ordered rows and colurnns.
Solution
We need to show that for sorne a and /3 (cf. (8.6.1)):
00
P' = IXP+/3D c +(l-IX-/3)cc T
so that P' has zero diagonal. That is:
Applications of
where c/Q is the diagonal element of P and Cj is the jth marginal sum (mass) of P.
Hence:
IX/Q + {J + (1-1X - {J)cj = O
which can only be true for general cj ifa + /3 = 1, in which case IX/Q + (1- a) = O, thus
a=Q/(Q-I)and{J= -1/(Q-1).
From (8.6.3) the principal axes of the two analyses are the same up to possible As an introduetion to the potential of correspondence analysis as an
inversions. The principal inertias of B' are of the forrn: exploratory data·analytic technique, we now present 11 different applications
{1X(2:)1/2 + P}2 = {Q/(Q _1)2} {(J.e)I!2 - (l/QW in a wide variety of eontexts. In order to inelude so many applications, we
(i.e. the quantities defined in (8.6.7), where 2f = (2:)1/2), and their parities are the have kept the presentation of eaeh one quite brief.
sarne as the sign ofthe differences (2:)1/2 - (l/Q). The fields of application covered here are genetics (both human and
population), social psyehology, clinical researeh, education, eriminology, food
science, linguistics, ecology, palaeontology and meteorology. Further refer
ences to published articles in an even wider variety of fields are given in
Section 9.12.
9.7.7 Oatd
Table 9.1 shows the 4 x 5 contingency table of 5387 schoolchildren from
Caithness, Scotland, classified according to the two diserete variables, eye
colour and hair colour. The profiles of the hair colours (column profiles) are
256 Theory and Applications ofCorrespondence AnaJysis
TABLE 9.1
Eye and hair colour data (Fisher, 1940; Maung, 1941) The rows and eolumns
have been arranged in terms of the ordering (positive to negative) on the first
principal axis of Fig 9.1 (cf. eolumn ·'K = 1" 01 Table 92). The column profiles
¡~
ctj "O .~
a. 'OC'O
Hair eolour
.-
U
.!:
~
OCI~<or--lD ~'<tNC"lO
r-- r-- O) <O
C 1- Nr--Lt"l'<t
rf)
. _ Q) V) • U t""""" C1:)t""""" U N Lt"l
Eye colour Fair Red Medium Dark Q.O)c :
Blaek 'O E . •
Q)c:J;;:- ....
9.7.2 Method
wr.nQ)o~
~.C¡; ~ E rf) I ~
<l:><ll'O<llj:=>
I--ro0..'0 U ¡;¡
<O <O O O
(00)00
000)00
11
>c:
0<:tC"l0
<:t<:tlr-
I
11
>c:
<:tC'?'<tCOo)
lDN lDO
,.- I
Fisher's aim in studying these data was to quantify this association by e <¡) ' . ; : :
ro..c ..... ro
assigning scale values (or "scores") to each of the 9 categories such that the <Il
.... <lla.
c· oc ~OJCOC") oc M <000 t""""" N
u
e ._
'0'.- U
e
z ~LOoo<:t
~ N LO z oo~r--ON
(V) '<t .
correIation between the two discrete variables was maximized. We have aI (J) t
'+-
o'~
ready shown in Section 4.4 that this probIem differs froro the correspondence -g~ca.
0.- o o (f) (f)
o.. roe ~
(f) C"lC"l0)'<t (f) OC"lr--OON
analysis of TabIe 9.1 onIy in the standardization of the scale values. In <O~'<tO <{ C"lO>N'<t <{ r-- lD O) LO N
~o~V; cf. <6MOO ~
~NMN
~
N C"lN
'::: +-' 0..:= co ~
0-oE
u c o <Il O)lDO)O OC"lOO)OO
)..2= 0.03011 (13.1%) E'O~-5 1
.....J r--0>0)0 ~ 0000)0)
o~ ~-o O) 0)0)0)0 OOOOcno>
C O O
.
rf)
black halr
.. .
-<Il<110
...... :J..c
dark eyes blue eyes fair holr ='ro-ro ~
• ..dark holr
D.>c:'2'
..... e (l) :J
<Il
::l
r--OOlDO
'<t<ocnO
o <Il -, Cloc <Il
OCOCla:<{
-~U
.
red holr light eyes ::l <Il.!: o
0.91 ......
_ <Il .. o
<tJ~Eu
m
>
e
<Il
'<tOOlDO
NOOOO
0)000
0)C"l00
O)
O
C"l
Z
E
<tJ
::::;I--w<{
a:l:::J~Cl
wwww
E
<tJ
Z
;::¡:w~<!.....J
LLOC"",Cla:l
)..1 =0,1992 (86.6"10)
.~ en ~ Ol ~ooo N
<Il<tJOl W 0000 O
E·t o .-NC"l'<t ~N('l')-.:;;;tL.O
=,Q)t;
scole
medlum ha ir z c·.!:
.-
W medium eyes .j- ..
correspondence analysis the scale values are (usually) the principal co analysis can be applied to data with less stringent properties, for example
ordinates, while in canonical correlation analysis the standardization is not sparse matrices and matrices with sorne low cell frequencies (less than 5, say).
identified by the objective but is usually prescribed to be "unit", that is On the other hand, modelling of contingency tables is usually applied to fairly
equivalent to what we have called the standard co-ordinates. small matrices with substantial cell frequencies, sometimes at the expense of
agglomerating sorne rows or columns.
The parabolic configuration of the points in the plane is the horseshoe 9.2.1 Data
efTect which characterizes data with a strong "diagonal band" in the profiles 1554 Israeli adults are cross-classified in a two-way contingency table
(see Section 8.3). Notice that the rows and columns of Table 9.1 have been according to their 7 "principal worries" and an eighth category "more than
intentionally arranged in the same order as the first principal co-ordinates to one", and according to 5 groups which depend on their place ofresidence and
reveal this band of association. that of their fathers (Table 9.3).
9.1.4 Discussian
Goodman (1981) discusses these data in detail and models the frequencies TABLE 9.3
using his association models. He shows that his "RC-model" for the ex Principal worries of Israeli adults (Guttman. 1971).
pected frequencies Pij = E(pij) = rx.J3 je V'I';YI is closely linked with the canonical
correlation approach when the rows and columns derive from the discretiza m m
u .S!
tion of a bivariate normal density (or a transformed bivariate normal Q¡ ~ Q¡
Q) m E Q¡
density). Since the parameters are estimated by maximum likelihood, this m
u E
E.~
Q)
.;:: <t: m ~ -5<t:
m ___ -5
framework is very valuable for formal statistical inference. Hypotheses like m
({Ji = ({Jo + i({Jd (i.e. the theoretical "scale values" ({Ji are equally spaced) can be
:>
~
~ ---o..
Q)
:.~
---m
'+- Q)
.:..c o..
'+
...o
...o
---
m
.¡¡; 2 -Q)
m·- Vl
Q) O
m ~
:::J
Q)
m m
Q)
tested quite easily, which can be very useful in a variety of circumstances (cf. <t: <t: w
:::J ~
~<t:
~
~w Vl Vl
Section 9.3). However, because of the large sample size, Goodman finds that
ASAF EUAM IFAA IFEA IFI
the aboye RC-model does not "fit the data ... so well", and proposes a more
general mode!. No doubt even the more general model would then be rejected Enlisted relative ENR 61 104 8 22 5
if the sample size were increased, one of the dilemmas of hypothesis testing. Sabotage SAB 70 117 9 24 7
By contrast, correspondence analysis has no aspirations of modelling the Military situation MIL 97 218 12 28 14
data statistically. The data serve as the population and every "degree of Political situation POL 32 118 6 28 7
freedom", as it were, is worth looking al. A displayed point is not a parameter 128 14 52 12
Other OTH 81
estimate in the usual sense, although the stability of the display can be More than one worry MTO 20 42 2 6 O
explored by analogy with conventional estimation and inference (cf. Sections Personal economics PER 104 48 14 16 9
8.1, 9.2, 9.3 and 9.7)~ AIso, being a more modest technique, correspondence
260 Theory and Applications ofCorrespondence Analysis
9.2.2 Method
Correspondence analysis of the data is performed and then the stability of the
display is explored by bootstrapping the sample 100 times and projecting the
replicate profiles onto the principal subspace (a plane in this case). Q)
-f a: !CONO)LDOO)COO a: 1 (")COO)LDLD
Ol l- ~LDLO roCOr- 1 LDO~O)N
e U N LD U ~ r--.
o
9.2.3 Resu/ts and interpretation ro
~
(j) a: O'<!"O~O(")~(") a: '<!"CONO)O
The principal inertias and their percentages are 0.05967 (77.0 %), 0.01533 Q) O '<!"(")cor--. O)LD O NrDr--.,<!"M
O) ~
U Nr--.'<!" 0)(") U
(19.8 %), 0.00240 (3.1 %) and 0.00010 (0.1 %) respectively. Therefore the 2 ~
(j)
dimensional correspondence analysis, containing 96.8 %of the total inertia is '+
o N OrDO)NOONLD N Nr--.(")LDN
CO N LD (")
(") (") N LDLDCONO
an almost exact display of the profiles (Fig. 9.2). (j)
Q.
11 ,...... I N,...... 11 I (") ~
The first axis is determined almost exclusively by the opposition amongst ::J ::..:: 1 ::..:: I I
2
the worries of "personal economics" to "political situation" and "military Ol
II.1
a: OO,<!",<!"LD~N'<!" a: O(")NCOr--.
situation", with a corresponding opposition of Israelis living in Asia/Africa (j)
e 1 coco
,......('1") 1 '<!"COrD
LD (")
U r--. U
to those living in Europe/America. The contributions of these points to the E
::J
l'
first axis are very high (Table 9.4). The interpretation of this contrast is ou a: a:
(j)
LD'<!"CO'<!"LDr--.~rD NN'<!"O)r--. 1
obvious: people in the former group, living in the developing countries, have "O
e x
ro ro
Q) O
U
LD LD N (")
'<!"O)LD
LD O)
NO)
O
U
r--.(")O)(")'<!"
O) O) co I
financial problems, while those in the latter group are concerned about the
q- .-. co
wider issues mentioned above, as well as the "economic situation", rather
I
(j)Q.
O) Q)' '<!"NLDr--.NCOr--.r--. r--.~,<!"LDO 1'
.- u ~ INO)rD~NO) N,...... 0)(,0,......
than their personal finances. ~ ~
m O'¡::
e 11 ,...... N,...... ,...... LO 11 (")NNI~
1
I-~o
:;: ....
(j) :;: 1'1
..
More Ihan one -f
....
(j)
(/)
(/) O)rDr--.(")NLDLD(")
(/)
(/) NrDNLDLD
1
Military siluation Ol
« N'<!"(")N~CO'<!"N
« OO'<!"~(")
.
e
EURO PE IAMERICA
Sabotage
•l Enlisled relalive
.
ASIA/AFRICA Personal
eeonomies •
o
E
ro
~
,......,......N,......
~
M LD :\1
Economie silualion
....ro ~
LDCOCOLDLDONO)
0)(")(")0)(")000) ~
rDOrDCOr--.
0)0 rD cor--.
O)OO)O)N
Nr--.O)O)LDOrDO)
O O
.
Q)
e
Polilieal silualion
• ISRAEl. falher 'O
ISRAEL' falher ASIAlAFRICA e Q) Q) LL~
ISRAEL o E a:a:l-l-lOIOa: E ««««
';:0
ro z«-Oul-l- w ro (/)~«w_
.
OIher
(j)
o
Q.
E
z W(/)~o....wO~o.... z «W
LLLLLL
o
sea le u ~N(")'<!"LDrDr--.CO ,......NM"Í" LO
Q)
t-------I
0.1 O
ISRAEL~ fal,*,r
EUROPE/AMERICA
9.3.1 Data
The data are from a study by Calimlim et al. (1982) and are reproduced in
Table 9.5. 121 hospital patients have been randomly assigned to four groups
and each group receives a different drug (A, B, C or D). Each patient rates the
drug's effect on a 5-point scale worded poor, fair, good, very good and
excellent.
sea le
>------l
0.1 9.3.2 Methods and results
FIG.9.3. Bootstrapped display of the columns (groups of Israelis); the number The 2-dimensional correspondence analysis of the data is given in Fig. 9.4.
ing of the columns from 1 to 5 is the same as in Table 9.3. The co-ordinates on the first principal axis are optimal "scores" and "scale
values" respectively for the drugs and the categories of the rating scale (see
The second axis separates out the three groups in Israel from the two groups Section 4.3). These scores and the scale values are related by the "symmetric"
outside Israel, mainly because of the response "other". This suggests either transition formulae (4.1.16) (cf. also (4.2.4) and (4.2.5)), so that the scores are
that the Israeli inhabitants are reticent about answering the questionnaire, or not, strictly speaking, weighted averages of the scale values. The choice of
that they really have worries which cannot be easily c1assified into the given
categories.
Notice from column QLT (quality) of Table 9.4 that the row "enlisted TABLE 9.5
relative" (ENR) and the column "Israel: father Israel" (IFI) are not well Responses of 121 hospital patients in a survey of the effectiveness
represented in this display. They do in fact lie in the third dimension but this of four drugs; each drug has been rated on a verbal 5-point scale:
poor/fair/good/very good/excellent.
~Z=0.07731 (19.9%)
falr ..
.
exeellent
e drug B
edrull o
poor.. ~I =0.3047
very llood .. (78.3%l
drug e
edrug A llood"
seale
1----<
0.2
over the whole table in order to obtain replicates of the row profiles. In other
words we assume the rows to be four independent samples from four multi
nomial populations. The replicated row profiles are then projected onto the
principal plane as before (Figs 9.5 and 9.6). These strongly suggest that the
drugs separate into two groups: A with B and C with D, the first group being
more favourably rated than the second. Correspondingly, there does not
seem to be enough evidence to separate the responses "very good" and
"excel1ent", while there is sorne confusion at the lower end of the scale, with
"poor" and "good" being similarly scaled and "fair" occupying an anomalous
position.
9,3.3 Discussion
Cox and Chuang (1982) propose various analyses based on logit functions of
ca.le
&
>--------i
the multinomial probabilities (Le. the elements of the row profiles). Because
0.2 of the problems involved with low-frequency cel1s, they combine the cate
gories "very good" and "excel1ent" so that the contingency table is of the
FIG.9.5. Bootstrapped display of the rows (drugs) of Fig. 9.4; replicated points order 4 x 4. The x2-statistic computed on this condensed table is 43.9, with 9
for drugs A B, e and D are indicated by 1,2,3 and 4 respectively. degrees of freedom, compared to the value of 47.1 on the original table, with
266 Theory and Applications ofCorrespondence Analysis 9. A pplications ofCorrespondence Analysis 267
12 degrees of freedom. In either case, the statistic is highly significant by Cox and Chuang (1982), which involve fitting various models to functions
(P < 0.001). of the logits, lead to the same conclusions suggested by our analysis of Section
The x2-statistic is asymptotically equivalent to the likelihood ratio statistic 9.3.2.
G2 (see, for example, Fienberg, 1980), which also tests a contingency table for The anomalous position of the rating "fair" in Fig. 9.4 deserves comment.
homogeneity (Le. row-column independence). Chuang (1982) shows that G2 It is readily seen that this is almost entirely due to the relatively high
can be partitioned into J -1 components, in this case, which are themselves frequency of this rating for drug D. Although this involves only a few people,
G2-statistics for testing for homogeneity respectively on J -1 subtables of the bootstrapped profiles of "fair" nevertheless separate out from the other
order 1 x 2 of the following form: profiles (Fig. 9.6). This equivalently explains the anomaly that the second
column of Table 9.6 shows a highly significant result, while the first column
r. j" > jn 1 j"1
l
nlj
does not. If the people were perceiving the drugs on a "bad-to-good"
n2 ·
·J
r. "> ·n2·'J
J.J j = 1... J - 1 dimension and, correspondingly, if the drugs were perceivable on such a scale,
·· .. then we would expect the 5 ratings to be in the correct order along the first
principal axis. An explanation that we would venture for the actual outcome
nlj r.j">jnlj"
is that drug D, which we know to be drug C in a much lower dosage, had
Since the J categories are ordered from least to most favoured, the ratio minimal analgesic effect. Since there is no category of response labeHed "1 do
nu/r.j" > jn¡j" is the relative chance of rating j compared to aH more favourable not know", the respondent might opt for "fair" as an easy way-out, which
ones, called the continuation ratio (Fienberg, 1980). Each G2 , with 3 degrees' does not mean that he considers the effect to be between "poor" and "good".
of freedom, corresponding to such a subtable can be further partitioned into This is only a possible explanation of what might have happened, the lesson
3 components with 1 degree of freedom each. Because of the similarities of being that verbal rating scales of this sort might not be interpreted in the
drugs A and B and drugs C and D, this finer partitioning can be carried out expected way by the respondents.
in terms of these comparisons. In Table 9.6 we give a summary of this
partitioning, derived from Cox and Chuang (1982), in terms of the X2
statistics (note that the decomposition is no longer exact). Further analyses 9.4 MULTIDIMENSIONAL TIME SERIES
Poor vs fair Fair vs good Good vs very good 9.4.7 Science doctorates conferred in the USA, 7960-7975
to excellent to excellent to excellent
The data, given by Gabriel and Zamir (1979), are reproduced in Table 9.7 and
Testing equality of x; = 3.06 x; =19.51 x; = 21.88 described in the table caption. There are only two principal inertias of interest
continuation ratios P ~ 0.40 P < 0.001 P < 0.001
Not significant and the principal plane (Fig. 9.7) explains 95 % of the inertia. In 1960 there
appears to have been relatively more agriculture, earth science and chemistry
Partitioning the test statistic degrees, as opposed to engineeringjmathematics, while the trend from 1965
A vs B Not worth x~ = 1.00 x~ = 4.81 to 1975 appears to be away from the physical sciences towards the social
computing since Not significant P < 0.05
overall X2 is not sciences, sociology, psychology and anthropology.
significant Notice that the total inertia of the data matrix (0.01318) is quite low
Cvs D x~ = 2.98 x~ = 0.00 the profiles are changing fairly "slowly". Nevertheless, the display does show
P<0.10 Not significant that they are changing methodically and regularly from year to year, so the
A and B vs C and D X~= 15.06 X~ = 18.00 trend is definite. An informal forecast of the profile in 1976 may be obtained
P < 0.001 P < 0.001
by using the co-ordinates gu = -0.135 and g.2 = 0.075 of the extrapolated
l
TABLE 9.7
Data on science doctorates in the USA (source: statistical abstract of the United States, 1976, table 958) The last column
consists of estimated frequencies based on the extrapolated point in the display of Fig. 9.7 and a total of 18352 doctorates (the
total is 18361 owing to rounding errors, the estimated frequencies being accurate to only 3 significant figures).
1960 1965 1970 1971 1972 1973 1974 1975 1976 (es!)
Engineering 794 2073 3432 3495 3475 3338 3144 2959 2773
Mathematics 291 685 1222 1236 1281 1222 1196 1149 1099
Physics 530 1046 1655 1740 1635 1590 1334 1293 1254
Chemistry 1078 1444 2234 2204 2011 1849 1792 1762 1804
Earth Sciences 253 375 511 550 580 577 570 556 584
Biological Sciences 1245 1963 3360 3633 3580 3636 3473 3498 3541
Agricultural Sciences 414 576 803 900 855 853 830 904 908
Psychology 772 954 1888 2116 2262 2444 2587 2749 2822
Sociology 162 239 504 583 638 599 645 680 687
Economics 341 538 826 791 863 907 833 867 879
Anthropology 69 82 217 240 260 324 381 385 394
Other Social Sciences 314 502 1079 1392 1500 1609 1531 1550 1616
Totals 6263 10477 17731 18880 18940 18948 18316 18352 (18361)
"Tl "Tl
G) G)
w W
co -..J
o
:::J
30 O . o .0
~
rou -g. -CD' 'O
O
~3'
°1'
3 O
::o
.Do>
e -
lO
9 oo
o
og:
o> S
:I
g:
'"
'<
.
N
ro N , 'O
::J ,
00.
CD'
~.:1
o
CD
o:¡; o. lO
O
o lO '<
lO
o
:o
-< _. a> o o
e 3
o
o :I ;¡..
0.3 01 7<'
3 :::J O
..... ro -9: o' O
QJ ro
::J .g o.
,8' -i::J lO
o O
O
'".... ~ """"
~
~ o ¡¡j'
o>(/)
ú)
ro
o'
::J ::o
~
7<'
I
a>
N ~
Q)
g::o
~.
ro ::J
w~ 3
. ,.'"
:::J
o
lO
'<
'"
,,'
.... ~~
a>
a:..
,... ~
o'
ro
nO.
QJ CD
o,o
i a>
O
~ O
<::>
:..,.¡g,
.
9
:I ,,
:¡1 '""
O
;:,;
too
~ 8 <Q.,
O ~(/)
:::Y _.
&"(11 ;;: ¡:;¡ ..... u 3 r-..l O
QJ (/)
~
q,u
ro' O) r
<D(I) !2.
o
a> (0)
3-:::
9
o'
lO
,01
1
'-..1
O
01
N .
(J
<:>
--<
o.
0>0'
::o
CD
oCD<:::J
"e
oa
o
¡:¡
CD
'"
~
~ o,g
.~
¡Ji
a>
;!.
~
W-<
O'
.•
:::J
.",
I l:T
o' ... ... o
~
a>
Ñ
""
too
"";:,;<:>
.
Q)O I
e '< o' ~
6,-< CD
o. OS; .~. I
O oo
.-..1 e '-J
nO '"
N>~
:I:
CD
8 ro g: ,, '" '< :::J
o Q .~ ;!. I:l.
""...,
;:,;
o S; (}I.... ; ;
tn. ~
~(/) ;j' 1 3 ::r !2.
=ro(/) 2-, • • w U
'" '-..1 o' lO
""
'"O o lO
'"
lO
::J • J> CD -..JO .·0 o ;¡..
(/)u 0'0 01
'O' (i'
- o
~::J
Wo.
a
'O
'"
~
"'8"
"':I
!2.
(}1::J
~o.
. ro
::J
O
:I',
'<
lO
o' ,
I
.
:::J
o
lO
n (DO
:I:::J
CDo
~.g: ~
;:,;
¡:,
too
COro .~ ¡;;.
- - -- -.~
lO
ro
.O::J
o o> le»
---~---
-iro
o>
0"0>
-::J o
?'
"
::J
o>
-<
\<11
.- --
,...
5' (/)
!tCD:I: ooa>
ro o> -O
y>
.w-< -..1\ 0
.
N~.
--:-y> .~ 3 N
<D
s.
.-+
0
.
"'<D
O
J>
·cr
a:
CD
~
N
:.,.
:I:
o
3
:::Y
ro
o.
-rf!. UI
O
0l:: IV
e
o >!!
!.. o~ ~
o>
O-
\O
::;
"
270 Theory and Applications ofCorrespondence Analysis
p¡./c. = r¡(l + f¡lg.t!(A'1 )I /Z+ f¡zg.z/().z)I/Z) strategy offocusing a principal axis on an obvious feature of the data seL
to give an estimated profile p¡./c., i = 1 ... 12. Since the total number of
doctorates is fairIy steady from 1974 to 1975, the estimated frequencies for 9.5.1 Data
1976 are n.(p¡./c.), where n. = 18352, the same number as in 1975 (see last
column ofTable 9.7). Table 9.8 gives the marks for a class of 38 students on 8 questions in an
examination, as welI as the total marks (out of 100) and the average marks
The horseshoe effect is strikingly evident in the pattern of column points
for each question. In accordance with the discussion of Chapter 6 these marks
representing the years. The row points, representing the various disciplines,
are doubled with respect to their respective maximum marks, so that the data
do not follow the same pattern. The points representing agricultural sciences,
matrix analysed is of the order 38 x 16.
earth sciences and economics, for example, lie clearIy within the horseshoe,
which means that the profiles of these disciplines are higher than average in
the earIy and later years. This display should be compared to those of Figs 95.2 Method and resu/ts
9.1 and 9.4, where both sets of points exhibit approximately the same
Scaling ofthe students provided by thefirst principal axis
horseshoe effect owing to more ordered row-column associations. Notice
Since the first principal axis of correspondence analysis provides a scaling of
how the 2-dimensional curved representation of what is basicalIy a regular
the students which has maximum discriminability (as quantified by inertia),
trend allows for a richer description of the data.
it would be interesting to see how this scaling compares to the scaling
provided by the total marks, which is usualIy taken as the best summary of
the students' achievements.
9.4.2 Recorded number of offences of 18 types of crime. 1950-1963
Suppose that f¡ is the first principal co-ordinate of the ith student
The data matrix is given by Chatfield and Collins (1980) and consists of the (i = 1 ... 1), and that gq+, gq _ are the first principal co-ordinates of the pair of
frequencies of 18 types of crime (the rows) in each of the 14 years 1950-1963 points corresponding to question q (q = 1 ... Q). The transition formula from
(the columns) in the USA. The total inertia of the matrix is 0.008688 and the columns to rows is:
percentages of inertia are 72.4 %, 15.6 %, 4.3 %, 2.8 %, ... , so that it is again
clear that a planar display is quite adequate (Fig. 9.8). Even though the total f¡ = (l/Ji) L q{Y¡qgq + + (t q- Y¡q)gq-}lt (9.6.1 )
inertia is smalI, the changes from year to year are stilI methodical, but not as
regular as in Fig. 9.7. There is a trend from the crimes of violence, indecency where Ji is the square root of the first principal inertia, tq is the maximum
and homosexuality in the earIy 1950s to the crimes of theft as welI as motor mark for question q and t. is the maximum total mark, 100 in this case (we
vehicle related crimes in the 1960s. Remember that the display does not use the same notation as in Section 6.1). The minimum value of the f¡, when
represent the change in the total number of offences, but rather the changing all marks Y¡q (q = 1 ... Q) are zero, is:
emphasis in the relative frequencies of offences.
The isolated position of the point representing homósexuality-related fmin = (l/Ji) Lqtqgq_/t.
crimes shows that its development over the years has not been the same as
other crimes. This can be checked back to the data and it is indeed evident In order to scale the f¡ to lie between O and t., as a more convenient
that, in spite of a steadily increasing frequency of crimes in general, the comparison with the total mark which also lies between O and t., it is clear
frequencies of this particular offence rise to a peak in the mid-1950s and then that we first need to subtractfmin from all thef¡, which gives:
show a general, though not steady, decrease. The "bump" in the sequence of
year points in Fig. 9.8 demonstrates this pattern which is unique amongst the f¡-fmin = (l/Ji) Lq{Y¡igq+ -gq_}}/t (9.6.2)
various crimes, probably due to changing public and legal opinion about this
controversialoffence. and then multiply f¡-fmin by a constant 1 so that the maximum value of
TABLE9.8
Marks obtained by a class of students in an actual examination. showing the maximum mark possible for each question. the
average mark for each question across the class and the total mark of each student. The students are identified by their positions
1 3 7 9 10 7 5 8 43 92
2 2 6 10 9 8 5 8 43 91
3 3 7 8 12 8 5 8 39 90
4 3 8 9 8 7 5 5 45 90
5 1 7 10 9 5 5 5 45 87
6 3 6 10 6 8 5 8 41 87
7 3 6 10 12 7 5 O 43 86
8 1 8 10 9 8 5 8 36 85
9 3 8 9 10 6 5 6 36 83
10 3 7 10 6 6 5 5 40 82
11 3 8 9 6 8 5 8 32 79
12 3 8 10 9 8 5 6 29 78
13 1 7 9 9 6 5 8 31 76
14 3 6 10 12 5 5 7 28 76
15 2 7 8 10 8 5 7 26 73
16 1 6 10 8 6 5 8 29 73
17 3 8 10 8 8 5 8 22 72
18 3 8 7 6 7 5 8 28 72
19 3 6 8 9 7 5 7 25 70
20 1 7 9 9 6 4 O 33 69
21 3 8 4 9 8 5 8 24 69
22 3 6 O 8 8 5 3 35 68
23 3 8 10 13 6 5 8 15 68
24 O 6 8 3 5 3 8 34 67
25 3 6 2 6 8 5 6 30 66
26 3 5 10 3 7 5 7 24 64
27 1 6 10 6 6 5 8 21 63
28 3 7 8 O 7 5 7 23 60
29 2 4 10 9 7 5 O 22 59
30 3 6 4 8 7 5 6 19 58
31 3 7 9 O 5 5 6 20 55
32 1 8 1 6 2 2 8 27 55
33 3 4 2 5 8 5 O 26 53
34 2 4 O 5 2 5 6 28 52
35 1 5 4 3 O 5 7 21 46
36 1 5 2 6 O 5 O 26 45
37 O 6 1 O 5 5 O 28 45
38 O 4 7 6 O 5 O 10 32
Average 2.2 6.5 7.3 7.2 6.1 4.9 5.7 29.7 69.6
I1
274 Theory and Applications ofCorrespondence Analysis 9. Applications 01Correspondence Analysis 275 1I
1
3
37 22 Question q: 1 2 3 4 5 6 7 8
- 33 - --
54
Maximum t q : 3 8 10 13 8 5 8 45
'2- - 20
who receive less than fuI! marks and in this way fail to achieve what most of
1
the others succeed in achieving, are penalized more than their loss of marks
8- 27- -17 \7+: I
Icale implies. In a similar fashion, if the students had mostly obtained O for a
I I
question, then the student who does manage to gain marks receives more
-
0.5 23
,
than his actual marks gained, since he succeeds where most others have failed. 1
'1
This is one of the features which distinguishes the correspondence analysis
F:G. 9.9. Optimal2 -dimensional display. by correspondence analysis. of the data scaling from other scalings of such data.
matrix of Table 98. doubled column-wise with respect to the maximum mark for
The doubled profile of the total marks can be represented as two supple
each question; point q+ indicates the original qth column and q- its doubled
counterpart q = 1 .. 8. The vector of total marks and its doubled counterpart are mentary points in the same space as the doubled questions (Fig. 9.9). The
correlation between the total mark profile vector and the first principal axis '
represented as supplementary column points T + and T- respectively.
(i.e. the new linear combination of the marks defined aboye) is found to be
0.982, implying an angle of just under 11 degrees.
I(i; -fmin), when al1 the individual marks are maximum Yiq = t q (q = 1... Q), Focusing the total mark vector on the first principal axis
is t : In most cases we shal1 want to accept the total mark as the best overal1 1
l(l/Ji) Lqtq(gq+ -gq_)/t. = t. summary of the students' ability. Then the question of interest is how much
i.e.: variation and what sort of patterns are in the data which are uncorrelated
with the total mark. In Fig. 9.9 the dimension defined by the total mark is
l = Jit~/Lqtq(gq+ -gq_) (9.6.3) lined up quite wel1 with the first principal axis of the correspondence analysis,
In which respects could the scaling obtained by correspondence analysis be but we real1y want to force the first principal axis to coincide with this
different from the scaling by the total mark? In view of our discussion in dimensiono A convenient way of doing this is to introduce the doubled total
Section 6.1 it is clear that the correspondence analysis will be sensitive to the lPark into the analysis as a pair of included variables T + and T -, and then
polarization of the marks on each question, whereas this feature is not taken to increase their mass (if necessary) until the first axis is total1y aligned with
into account at al1 by the total mark. Figure 9.9 shows the correspondence this dimensiono Notice that this strategy of focusing affects neither the
analysis of the doubled data matrix. The linear combination of the individual centroid of the display nor the relative values of the inertias of the component
marks Yiq (q = 1... Q) to obtain the position on the redefined scale, with questions (cf. (6.1.4)).
minimum at O and maximum at t = 100, is, from (9.6.2) and (9.6.3), If T + and T - are assigned masses which sum to 1, then their inertia in
,
276 Theory and Applications ofCorrespondence Analysis 9. Applications ofCorrespondence Analysis 277
the full space is computed as 0.1009, roughly a third of the total inertia ~3= 0.02111 (10.8%)
(0.3020) of the 8 doubled questions. When the individual marks and the total
marks are analysed simultaneously, the set of 8 questions receives half the 7
mass it had before and the total mark also receives half the mass on which
the aboye inertia calculation is based. Because the centroid has not changed,
the inertias of the individual questions are halved, hence stay in the same
proportion, so that the sum of the inertias of the 8 questions is 0.1510, and
the inertia of T + and T - is similarly 0.0505. This gives a total inertia of
0.1510+0.0505 = 0.2015. Because the profiles T + and T - are averages of .29
the respective sets of profiles 1 + ... 8 + and 1 - ... 8 -, it is clear that the
introduction of the total marks into the analysis can only decrease the total
inertia of the cloud of points.
.7
It can be similarly argued that the first principal inertia (0.1055) in the
combined analysis must be less than the first principal inertia (0.1116) of the
analysis of the questions alone. Of course the percentage of inertia represented
by the first axis can increase, as it does in this example: the introduction of
T + and T - increases the percentage of inertia from 37.0 % to 52.4 % ~2=0.0301
because of the high correlation between the total mark and the first axis. 21 (15.0%)
Geometrically this is obvious because mass is being concentrated very close .22
~
\
\
\
The focusing, which has increased the correlation with the total mark
dimension from 0.982 to 0.998, generally decreases the correlations with the
individual questions, except for questions 4 and 8 which have highest
FIG. 9.10. Display with respect to the second and third principal axes of the
maximum marks. correspondence analysis of the doubled matrix of marks. where the first principal
In order to investigate the subsequent axes for possible patterns orthogonal axis has been focused on the total mark. (In other words. this is the optimal
to the total mark, we consider the display of the points with respect to axes 2 2-dimensional display orthogonal to the dimension defined by the total marks; the
and 3 of the focused analysis (Fig. 9.10). The first 5 questions play minor roles contribution of the total mark vector to this display is (almost) zera.) Only the
students which have prominent correlations with this plane are indicated. Notice
in this display and this plane thus shows the variation amongst the students )
'.\
that the question points are allocated a total mass of t in this particular focusing,
in tackling the last 3 questions. For example, students ranked 15, 17, 23, 26 so that these principal inertias should be multiplied by 2 for comparison with
and 27 performed relatively badly on question 8. The large distance between those of the unfocused analysis, assuming a zero contribution by the points T +
7 + and 7 - is chiefly due to students 7 and 20 who obtained O for this and T-.
T ABLE 9.9
Estimated protein consumption, in 9 per head per d, in 25 countries and from 9 protein sources (from Weber, 1973).
Agrarpolitik im Spannungsfeld des Internationalen Ernaehrungspolitik. Kiel. Institut fuer Agrarpolitik und Marktlehre (mimeo
(f) (f)
ctl ""O
Q)
E Q)
e ,
(f) (f)
Q)
ctl
Ol
S" (f)
'0 -o
e e :J ""O
ctl
o
.." .¡;¡ o o ;:l Q)
Q. :J Ol
ctl
.:;; ~
""O
.2 e Q)
Ol
e (f) >- >
~ ctl ctl
-c (f)'
ro .~
Q)
p (f)
(f)
Ol
.:>L -c ~ ~ .!!'.
-o Q) Ol
Ol
(f) Q) ;:'l :J 2
<{ ~ a::: w ~ LL U (fJ o.... LL
Albania ALBA 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
Austria AUST 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
Belgium/luxembourg BElX 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
Bulgaria BUlG 7.8 6.0 1.6 8.3 1.2 56.7 1 .1 3.7 4.2
Czechoslovakia CZEC 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1 .1 4.0
Denmark DEN M 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
East Germany EGER 8.4 11.6 3.7 11 .1 5.4 24.6 6.5 0.8 3.6
Finland FINl 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4
France FRAN 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5
Greece GREE 102 3.0 2.8 17.6 5.9 41.7 22 7.8 6.5
Hungary HUNG 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2
Ireland IREl 13.9 10.0 4.7 25.8 22 24.0 6.2 1.6 2.9
Italy ITAl 9.0 5.1 2.9 13.7 3.4 36.8 21 4.3 6.7
Netherlands NETH 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7
Norway NORW 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7
Poland POlA 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6
Portugal PORT 6.2 3.7 1 .1 4.9 14.2 27.0 5.9 4.7 7.9
Rumania RUMA 6.2 6.3 1.5 11 .1 1.0 49.6 3.1 5.3 2.8
Spain SPAI 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2
Sweden SWED 9.9 7.8 3.5 24.7 7.5 19.5 3.7 1.4 2.0
Switzerland SWIT 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9
United Kingdom UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3
Russia USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9
West Germany WGER 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8
Yugoslavia YUGO 4.4 5,0 1.2 9,5 0.6 55.9 3.0 5.7 3.2
280 Theory and Applications ofCorrespondence Analysis 9. Applications of Correspondence Analysis 281
question. Students 24 and 32, who stumbled on question 6 (where everyone analogous to frequencies in that a total mass of protein is distributed over the
else had done so well), account for the large contribution of this question on cells ofthe matrix in units ofO.1 g (per head per day).
the third axis. Notice the position of student 29 who did well on 6, badly on
7 and not too well on 8, compared with the average across the class. This
9.6.2 Method
plane thus shows the students who did not follow the average pattern for the
last 3 questions. This information could be useful, for example, if there seems Two analyses are performed on these data:
to have been insufficient time to complete the examination, in which case the
lecturer might consider the marks obtained by the students for questions 1 to Analysis 1. A principal components analysis on the data centred with respect
5 and questions 6 to 8 separately in his assessment of the class. to the column means (Appendix A, Table A.1(1)). When a data set involves
measurements on different scales, they are usually pre-standardized so that
9.5.3 . Discussion each vector of measurement has unit variance. However, the scale of
measurement is the same throughout the present table and rescaling seems
The chief reason, if not the only reason, of an examination is to arrive at an unnecessary, although it will become apparent that the largest protein
ordering of the students. Ir the examination has been carefully constructed sources do play an overwhelming role in the analysis.
with marks allocated in terms of the importance of different sections of the
syllabus, then the total mark certainly provides that ordering. However, given Analysis 2. A correspondence analysis. A row point thus represents the profile
a specific set of results, the total mark is almost certainly not the most of protein consumption in the particular country. The total consumption is
discriminating linear combination of the marks in any specific statistical not used in the point's position but rather as a mass to weight the point. It is
sense. Correspondence analysis of the doubled matrix of marks is a technique thus not the absolute amounts but the dietary preferences which are
of identifying a linear combination of the marks which maximizes a measure displayed, while the X2-distance (between countries, say) tries to correct for
of discrimination between students. Notice that we are not saying that this is the large differences between highly consumed proteins.
a more equitable way of combining the marks together in order to obtain a
total mark, but rather that this is a different way, based on the global set of
9.6.3 Resu/ts and interpretation
marks of the actual class, and that it might be of interest to study these
results to understand more fully the way the examination has tested the Tables 9.10 and 9.11list the complete numerical results from both analyses.
students. The strategy of focusing also helps in the understanding of more Notice that all co-ordinates are principal co-ordinates, including the "co
subtle features in the marks which are uncorrelated with the usual ordering ordinates" of the proteins in the principal components analysis (standard
of the students in terms of their total marks. computer packages usually give standard co-ordinates, sometimes called
"standardized scores"). In both analyses a few points contribute substantially
to the major principal axes. We first investigate the internal stabilities of the
9.6 PROTEIN CONSUMPTION IN EUROPE AND RUSSIA displays before proceeding with their interpretation.
In this application we compare the results of correspondence analysis with Principal components analysis
those of principal components analysis, using the same data set. The question The display of the countries with respect to the first principal plane is shown
of internal stability of the displays is discussed, and it is shown why a very in Fig. 9.11 (the arrows are explained later). Bulgaria and Yugoslavia
highly contributing row or column is best treated as a supplementary point. contribute the most to the first principal axis, 0.183 x 149.0 = 27.3 and
0.178 x 149.0 = 26.5 respectively. Because there is such a large difference be
9.6.7 Data tween the fir~t and second principal inertias (variances), namely 149.0-29.5 =
119.5, it is clear that the first axis would not rotate very much if either ofthese
These data are estimated protein consumptions from 9 different sources, by points were removed. To put an upper bound on the angle of rotation of the
inhabitants of 25 countries (Table 9.9). Here the data are neither contingency first axis if Bulgaria, say, were removed, the quantity h of (8.1.6) is first
nor frequency in nature, as in the previous applications, but they are computed as: h = (1/0.96)(0.134 x 209.7)/(149.0-29.5) = 0.245. (Notice that
TABLE 9.10
Decomposition of inertia (variance) in the principal components analysis of Table 9.9. in a similar format to that of a
correspondence analysis (cf. Table 9.11 ). for the first two principal axes. Ouantities which are multiplied by 1000 or expressed
as permills (thousandths) are indicated by x 1000 or 7•• respectively. Notice that the co-ordinates have not been multiplied by
1000 in this case. The total variance is 209.7 and the first two principal variances are 149.0 (71.1%) and 29.5 (14.1%)
respectively.
(b) Name QLT MASS INR K=1 COR CTR K=2 COR CTR
1 MEAT 363 111 51 1.8 315 23 0.7 48
2 PIPL
18
195 111 62 1.6 191 17 0.2 4 2
3 EGGS 573 111 6 0.8 562 5 0.1 11 O
4 MILK 976 111 231 5.2 556 181 4.5 420 690
5 FISH 442 111 53 1.6 216 16 -1.6 226 85
6 CERS 997 111 551 -10.5 955 741 2.2 42
7 165
STAR 326 111 12 0.8 260 4 -0.4 66 6
8 NUTS 549 111 18 -1.4 511 13 -0.4 38 5
9 FRVG 290 111 15 -0.2 20 O -0.9 27 29
TABLE9.11
Decomposition of inertia in the correspondence analysis of Table 9.9, for the first two principal axes. The information for the
third principal axis is also given for future reference, but this has not been included in the quality (OLT) of the points' planar
display. The total inertia is 0.1690 and the first three principal inertias are 0.0865 (51.2%),0.0390 (23.1 %) and 0.0200 (11.8%)
respectively.
(a) Name OLT MASS INR K=1 COR CTR K=2 COR CTR K=3 COR CTR
(b) Name OLT MASS INR K=1 COR CTR K=2 COR CTR K=3 COR CTR
POLA
•
Sr 'T .'REL
EGGS. 'M1LK
WGER
~EGER
ALBA
.CERS
• .FINL
.USSR
scole SpÍI ).,=0.0865
~ (51.2%) .ITAL .SWED
PORT
1 STAR
• DENM•
.GREE
NUTS
• FRVG .NORW
FIG.9.11. Optimal 2-dimensional display. by principal components analysis. of •
the data matrix of Table 9.9 which has been centered with respect to column
means. The lines emanating from each point indicate changes in position when
the analysis is repeated using thedata for columns MILK and CEREALS only. SPAI.
the inertia of the 4th point is 0.134 x total inertia, this being equal to
w4Ul1 +Ü2 +...) in (8.1.6).) Hence, by (8.1.8), 4J < 7.10. To evaluate the
tighter bound we first compute 841 to be 10.5°, since cos 2 841 = 0.967 in Table
scare
9.10(a). Hence, by (8.1.10), 4J < 3.3°. Clearly the first axis is internally stable. >--------1
0.1
On the second axis the point Portugal contributes 0.381 of the inertia,
namely 0.381 x 29.5 = 11.2. The inertia of Portugal which is along axes FISH
2,3, ... etc. is its total inertia 0.063 x 209.7 = 13.2 minus that part along the PORT •
first axis 0.002 x 149.0 = 0.3. Since A,3 = 15.0, h is evaluated as 0.927 and the •
rough upper bound for 4J (if Portugal were removed) is 34°, with the tighter
bound of 31°. Although the second axis would undergo a substantial
rotation, it would not be enough to label the axis unstable. Notice that the F1G.9.12. Optimal 2-dimensional display. by correspondence analysis. of Table
9.9. Notice that the co-ordinates on the second axis are opposite in sign to those
points discussed aboye lie very close to the principal axes mentioned, so that in Table 9.11. We have reversed the second axis in the display to facilitate
there is no need to consider the possibility of "diagonal" spatial rotations of comparison with Fig. 9.11.
the planeo
From Table 9.1O(b) it can be seen that variables milk and cereals play
an overwhelming role in determining the principal plane-their joint con the principal components analysis through their high variance. Sorne type of
tributions to the plane are proportions 0.922 and 0.855 of the respective standardization is needed, or alternatively, the columns of high magnitude
principal inertias. Reasoning as before, we would expect that the principal can be dowliweighted, possibly in steps, so that patterns of a more mu1ti
plane would hardly change if the analysis were repeated using just these two dimensional nature come into view.
sources of protein as variables. The arrows in Fig. 9.11 show the approximate
movements of the points when this is done, and the change in the configura Correspondence analysis
tion is minima!. This illustrates how the most consumed proteins dominate The correspondence analysis display of Fig. 9.12 only represents the differ
288 Theory and Applications ofCorrespondence Analysis
c
Az~0.024Bt(l6.1%)
ro
I~
.. PIPl al I~NNN~mMooo~m~o~Mo~OO~O~~oo
NO ~~~~ N~~~~~N ro~~~ ~
E
o
~ ~ ~
a..
.AUST
e
• HUNG CZEC o
• NETH Q. I~~O~~M~OON~~MO~NOOM~~O~~OO
.EGER
• .WGER ero
eQ)
18 ~~M
~
M~~~
~ ~~
~MOOOOM~M
~~ M~N
~~oom~~
N~ N
POlA. IREl
SWI¡ S"'[AR 'EGGS E
Q)
Im~~~~~oON~~OO~Ooo~~~o~mN~~
BUlG RUMA
.
FRVG BElX.
AI'0.0900 Q.
Q. N~MM~o~oo~~~m~OmON~~~m~oo
YUGO • • CERS
FRAN·.. MEAT
.. MllK
(5B.5%)
::J
rJ)
Q)
..c
12
~N
I
N~~~
I
I NN
I
I NN~
I
N~
I I
I I ~
.USSR +-'
·'TAl .UK .DENM
ALBA (;
• .SWED
.FINl
'+
+-'
::J
c:::
f-
~mNN~~O~~~N~OOOOOOMoomM~~OO~
~ ~M ,.......~oo::::t~N'o::::t('f')~N.q- m LO~~..--('f)LD
B U
NUTS
. GREE EPA I
::J ro
O''¡::;
~ ~ ~
• .NORW
coQ5
ü C
Q¡'+
E o
c::: m~OOOOOOO~~Mmm~~~OO~ONOO~OOM~O
~ ::J C
O ~o~mM~mN~NO~NN~
~N~OO
~oooo~~~m~
~~~MMU')~U')U')~ m ~MM~~m
0.1
e .g U
.lISH ~~.~
PORTO a:i;Q.
w C E ~ oo~~ooo~~oooo~~~~mN~U')~~U')~MO~
f-rJ)Q) I I I I I I I I
c'O ~ I
E ro
::J'+
- o
FIG.9.13. Optimal 2-dimensional display. by correspondence analysis. of Table 8.!!!
9.9. with Portugal excluded but displayed as a supplementary point. 'O
~~
e I c:::Z IU')U')N~~OOOM~MOOMONNMOO~~OO~M~
OON~OO~U')M~N~~MNMU')~U')~~NMNMm
rJ)..c
:S;+-'
~,
'O
Q)
I~
ences in "shape" of the data vectors, while the principal components analysis 'O
::J
~NM~OM~~~~~~~~mU')N~mMM~OOM
M~~~~~M~~~~~~~M~~MM~~~M~
display represents both "size" and "shape". Whereas the point Portugal Ü
C
contributed highly to the second axis of Fig. 9.11 because of Portugal's
ro
generally low protein consumption (a feature of "size"), in Fig. 9.12 it has a (;
very high contribution to the second axis (51.8 %, see Table 9.11(a)) because
I~
'+ l~mooNmMO~U')~O~~~OU')M~N~moo~o
ro ~~oo~~~~~m~mOU')m~~MOU')ON~U')
of its unusually high consumption of fish (a feature of "shape"). The stability +-'
Q)
oooo~m~OOMU')M~OO~~OOOONmMm~~U')mm
of the second axis with respect to the removal of Portugal can be investigated C
'+
as before. Condition (8.1.5) is not satisfied in this case, so it seems like1y that o
C Q)
the second axis could be unstable. o F. ~f-X~U~C:::zw~ I~~~
.¡:;
.0 0f- c:::C:::O
The correspondence analysis is thus repeated with Portugal as a supple 'Vi
o Z oo~~~wZw~~WZ~~f-C:::~~~w- ~w~
mentary point (Fig. 9.13 and Table 9.12). This 2-dimensional display is now Q. ~~W~NW~~C:::C:::~~~woo~a..~~~~~~
E ~~ooooUOw~~~I __ ZZa..c:::~~~~~ r
stable and provides an excellent "protein map" of the countries, with c1earIy o ~NM~U')~~oomO~NM~U')~oomO~NM~U')
ü
Q) ,....-..--,....-,....-,....-..--..--,....-..--NNNNNN
defined and well separated regions corresponding to southern Europe, o
eastern Europe and northern/central Europe.
9. Applications ofCorrespondence Analysis 291
e Q) e
ro
I I
o o
(J)~(J) 9.6.4 Discussion
~ ~ I I la:
f
u
q-O
q- NO
LO
..... O COLO
..... Mq-
M
CO .....
(J) a: Q)
ctlf-=
E uo
ro-o. Figure 9.13 shows more variation amongst the non-Mediterranean countries
.- (J)
..... Q) >
e x ....
than Fig. 9.12, thanks to the removal of Portugal as a contributing point. In
Q) ctl ctl
Fig. 9.12 the second axis is more than 50% determined by Portugal and, in
I cnLOf'COLOOCOf'M I 8-2 ~
81~1
particular, by a single element of Portugal's profile, the fish consumption. It
la:
O cnCONf' NO (J) (J) E
u CO M ..... .~ § ~
is preferable to remove the influence of this obvious and isolated feature from
(J).- Q.
.- ...... Q.
Q) ::J ::J the display so that more subtle multidimensional patterns can be investigated.
Q; ~ fJ')
Principal components analysis and correspondence analysis treat the data
~ ~oe
~I~I
IN I Nf'cnLOM
coco ..... cnMOLOq- .....
CO ..... N 1 "O o
11 1 M Iq- N ~uo differently and it is thus unfair to judge either as being better. However, these
~ I 1 ~ Q)'¡¡; data are definitely ratio measurements, as opposed to interval measurements,
._~ (J)
"O ..... ::J
c"O u and we do feel that correspondence analysis is better suited to ratio data.
.- e (J)
f'ctl'6 Gabriel (1981) illustrates the biplot using these data, and his analysis is a
~Iol la:
f- IcnCOOMcnCOCOLON
Mq-M ..... cnMNO I 8a:-.....:
. u variation of principal components analysis which we call the "covariance
u N q-..... oZ~
o::::..·~
biplot" (see Appendix A, Table A.1(4)). Here the columns rather than the
ctl (J)
(J).-
(J) .....
> rows are scaled by the singular values. As in Fig. 9.11, it is difficult to separate
ce Q:; ce
"ti E e ~
the differences in "size" (absolute protein consumption) and "shape" (relative
<b a: ..... a: o N N N LO N O N O .- ....
consumption) of the data vectors, whereas correspondence analysis concen
:J O O NNNMcnCONMM ~~ctl
.~ u
U MNCOf'McnMCO f-f-"S .
trates on shape patterns only. It is convenient in correspondence analysis to
c:::
.. u~ .:<-
o(.) e·~·t ,. !'~
ro represent the "size" (i.e. mass in this case) in the form of the size, for example,
Q) ~ ctl M .~.
I ~ ro ~f' ofthe displayed point.
N ..... co
..... ..... q-q-COCOLOMCOOCO - c · - (J)
..... 11 11
f' ..... f'Oq-N ..... NCO
..... NNMq-MNLOI roctl~Q)
cn 1
::JQ) ...... O
Q)
~ ~ 1 I Ü ~.S: ~
:octl ~.s: eg¡
f .;;; "O ,e'
.,.t j 9.7 SERIATION OF THE WORKS OF PLATO
(J) Q) •
ctl "O Q)
a: co a: NLOCOOf'f'f'f'f'
E ~ ~
t
co f'NNf'q-COq-cnq
z Z .......... N U ::J
~ e uo
This application shows how well the bootstrapping of a correspondence
._.-
c~.~ 'O analysis display can agree with more conventional statistical analysis of a
contingency tableo
1'"1
1& § ~
~I~I
.17
.13
•
.5
..
.31 • 15
~
29. . . LAWS
"REP.
PHIL.~ r
·23 ).1 = 0.0917 (69.0%)
4'~ ¡'_,
l _ '...... ~%
.1 .12 7 .14
3
~6 28 ;
•
.26
• %
. 24 ..CRIT.
•
9- 9.9-:.~->:J.~99;
O•
POLo .6
22 .27
"iJí~ 1
~/ ~....
r'I
.
.20 5 9 !I
SOPH.
"ih
· .\1/
..TIM.
.8 - scole
0.1 • "r¡
11/ scole
t------i
0.1
FIG.9.14. Optimal 2-dimensional display. by correspondence analysis. of the FIG. 9.15. Bootstrapped display of the columns (works of Plato) of Fig. 9.14;
Plato data (Cox and Brandwood. 1959). The rows are labelled by the decimal replicated points are labelled as follows: 1. Republic. 2. Laws. 3. Critias. 4.
equivalent of the 5-syllable sentence endings considered as binary numbers. For Philebus. 5. Politicus, 6. Sophist, 7. Timaeus.
example. the sentence ending of 1 long syllable. 3 short and 1 long. is coded as
10001 in binary. hence 17 in decimal.
other, where the optimal scores have been rescaled to have the same variance
errors for each of the intermediate works. The term "seriation", a synonym as Cox and Brandwood's scores, the score for each work being weighted by
for ordination (cf. Section 4.2), is often used in the context of archaeological the number of sentences in the work. For each score, 2 standard errors are
and historical data, especially when the ordering is presumed to be temporal indicated in the case of Cox and Brandwood's analysis, while 2 standard
(see Kendall, 1971). deviations are indicated in the case of the bootstrapped correspondence
analysis. The concordance, both of the scores and their variabilities, is
astounding and the only clear difference is that correspondence analysis gives
9.7.2 Method and resu/ts a lower score for Políticus.
Figure 9.14 shows the 2-dimensional correspondence analysis of these data.
This display was bootstrapped in the way described previously (Section 8.1)
9.7.3 Discussion
and Fig. 9.15 shows the replicates of the works (columns) only. Notice that ~
we are resampling from what is essentially population data (cf. the justifica ~)'Wi
','
o
»
-i
»
z
»
11
:JJ
9I~.. -j~ f
n
»
z -
G)
»
~ ~
m en
:JJ
m
rJ)
m
:JJ
<
m
rJ)
TABLE 9.13
Frequencies of antelope tri bes in African wildlife areas (Greenacre and Vrba, 1984); the rows and columns have been
rearranged according to the first principal co-ordinates (e.g. the ordering in Fig. 9.17). Forcensus sources see Vrba (1980).
Total 232973 515631 36462 18536 47023 13867 9911 241773 99290 1215466
296 Theory and Applications ofCorrespondence Analysis 9. Applications ofCorrespondence Analysis 297
9.8.7 Data BU~~J~ve~ L .. ,L•. ~ ,L, ~ ... .••••••••••.•M.!.! .H••. ~ ••. H•••••.•• ~ .M. ~. ~ ..•••••••••H•.••.
LEKS N
KBL eH MWQK M
The original data of this study are observed numerical frequencies of antelope ATAE A
AlU U L KAUR A
N
T
K oL R
L
e
l
J
P
·*'**-:-~-:--~-[-""'r----:-*--
F eA E
ERE
VAO
U
E
P
UN l U
E
P
o
N~! = 0.645
l4O.1'I'.)
F
•
TRA.
Axis 2 Az = 0.3852 Axis 3 )'3 = 0.3996
F
but slightly lower in the tribal analysis, as expected by the results of Example
F 7.5.4, the tribal frequencies being a condensation of the generic frequencies.
CoveA: lOW F Axis 2 of the generic analysis reflects a feature of generic variation within
o
MANYARA * tribes which is obviously absent in the tribal analysis.
ETOSHA *
* KALAHARI
LAKE* CoveA:HIGIl
TURKANA o BOV••
9.9 HLA GENE FREQUENCY DATA IN POPULATION GENETICS
WANKIE *
• REV. This section is a summary of the analyses described by Greenacre and Degos
(1977). This is a particularly nice iIIustration of a case where several principal
axes can be interpreted.
*KAFUE
9.9.1 Data
BICUAR * QUI,AMA
o • eNEO.
HIP. • longLtude: CENTRAL There are two sets of data, emanating respectively from the Fifth and Sixth
International Histocompatibility Workshops (Bodmer et al., 1973; Bodmer,
* CUELEI 1976) (histocompatibility == compatibility of tissue). These were two extensive
surveys of human populations aimed principalIy at studying the so-caIled
HLA chromosomic region (HLA == human leucocyte antigen) on human
• CEPo chromosome number 6. At the time of the Fifth Workshop, in 1972, the HLA
.cale
r - - - 0 .5
LUANOO* o complex was known to inelude 2 serologicaIly defined genes, now caIled
CoveA:MEVIUM
o HLA-A (at locus A) and HLA-B (at locus B). Twelve aIleles for HLA-A and
rongLtude :WEST
15 aIleles for HLA-B had been identified, thus aiready establishing the
extraordinary polymorphism of the complex. In this workshop an attempt to
FIG.9.18. Optimal 2-dimensional display, by correspondence analysis. of the test aIl hum&n populations was made. By the time of the Sixth Workshop,
area-corrected data of Table 9.13. Dummy variables representing categories of which aimed mainly at an intensive study of Caucasoid populations, know
bushcover and longitude are also displayed as supplementary column points. Five ledge had progressed to the extent of identifying a total of 15 aIleles for
supplementary row profiles. derived from frequencies of antelope fossil skulls in 5
fossil sites. are also displayed as supplementary points (indicated by F). Notice
HLA-A, 20 for HLA-B and 5 for an additional gene in the system, HLA-C.
that Fig. 9.1 7 is the projection of this display onto the first (horizontal) axis. The populations studied and alleles tested in the Fifth Workshop are given in
9. Applications of Correspondence Analysis 301
Table 9.14 (see Greenacre and Degos, 1977, for Sixth Workshop populations
and alleles).
Because each person has a pair (maternal and paternal) of sixth chromo
sornes and because the HLA system is codorninant, a typical HLA typing (or
phenotype) might be:
.c .o
.o O)
c~
C'?C'? HLA-A3,A28
C'?s:
ro ::JO)
~ C 5:(]J -B8
Q. ~"O ::J
E c
- C ~_.~
, c <t N -CW2
o o ~~ ~_.
'N
E ~s:
(/)L 0>
u ';::; ro O) QJ
~ (The "w" signifies a "workshop" allele which is still in the process of
ro ;fl C =::)..r::.
.8
(/) ::J ~:SE
0.-
ro
u 'x -"><--
._ c:
lo...
O> O O>
5:(]J
<t,...: definition.) The interpretation of the A locus is clear since the person is
I Q. u
~~~
o ro ro ro ::J C 'N
C
CL QJ QJ
ro O> ;
;:::'S: definitely heterozygous with respect to this gene, i.e. the chromosomes are
~ o
C
'S 'S5
C - o
~ ro
E
ro
c: e c: en ........
rororoo>.8"Oc C'?(]J
different at this locus, respectively A3 and A28. The fact that only one allele
E E E o c ~.~
C ro o ::J
u.. 5:.
~ >-<.9 <.9
~L ro L roo>O) LO
<t~
C
_ro ro u ro . Ol ~ ....... .o
<t
--.J
Q.
(/)
c
o
- -
ro
Q¡
(/)
O>
~
O>
(/)
.00
00...
N(]J
<t .
.'<1"
In any case, given a random sample of individuals from a certain popula
tion and their HLA phenotypes it is not difficult to compute an estimate of
I .;::; (/)
~"O "O
"O QJ
"O
c ...- ...
...-(Il
the frequency of the different alleles (including the "blank") at each locus. The
ro C ro
<t .
:o ::J
Q.
Q.
~
C
E
C
~~ ~ ~_.g . C'?
0...
sum of these gene frequencies at a particular locus is 1. In the case of the Fifth
"O
o
CL (/) O) ~.o C
(/)
O>
-
a:i.~
::J O>
E ...-CD Workshop populations we obtained the gene frequencies direcdy from the
O'~ <t .
C
ro
O C
>ro
> 'ti)
.-
L
---
-c_
-..- ='
ro
.~
~ -u<t<t
__ c:
"O .N
O"l ...
Joint Report (Bodmer et al., 1973), while for the Sixth Workshop populations
c: o.... Q)~Q)Q)Q) ~
"O .::::."-
..::.¿
c: ü
co .-
Q)
ctl -..- L
...::.¿
co
Q)
.~ .~ ~.~ ~ ~ Q).:!:. o <tlXl we computed the frequencies from the raw data using random, unrelated
~
CJ)
O> L ro L .- "O O> .- (/) CL C
(/)
c: .s:. C:,,;:¡
:::J en o.... o. ~ c: ....... Q)
c: c: .2' ~ .21 c c ;fl Q¡ E individuals only.
~
(/)
~g~t)c~~·~~g~~·~E~~E
~rocro~roB~roroo>ro~U::J>O>
o o es ~ ~.~ 'ti) ~
.o U.o ro roL ro(/)
c <t(]JUJUJu..(f) __ ~--.J--.J(f)_(f)f->>- <t <t-,-,UUJUJ
o
ro
::J
Q.
c
.g
ro
.;;
~
O o
"O
9.9.2 Method
The data suggest two correspondence analyses:
o ~ ~<t(f)<.9<tUJ<tUJOACL(]J<tZOa:~~ gu 0z-'-(f)~ ::5:5
CL ua:<tZCLa:(f) ~<tUJ(f):::)U~>UJ c(]J (]J-~::I:<t(f)
.o
.o ~<t(]JUJUJu..I~~~--.J--.J--.JCL(f)f->>- o<t <t<t<tuUJUJ II
ro ~ Analysis 1. Analysis of all the populations in both workshops with respect to
~ <t u
the set of alleles defined in the Fifth Workshop (hence the alleles common to
both workshops). Because of the overwhelming majority of Caucasoid
populations in the Sixth Workshop we determined the principal axes of
inertia using the data from the Fifth Workshop only, with the Sixth
Workshop populations as supplementary elements. The allele AW33 and the
"blank" alleles at both the loci under consideration were also made supple
302 Theory and Applications ofCorrespondence Analysis 9. Applications of Correspondence Analysis 303
® ~ AXIS4
I
AXIS4
lAiHI ~ 8.8'-.orinerti ol
j'1
~ f~~~~~rw~-~
seo. 0l'I
~
boIh o,., @®
1m! --::@ m N ~
~Iüil <@:I mi 1m! Ee!I ~ ~
E
@ IZ!BI I
I
@ ~ E ~'u
o E---r"
AXIS 3 liI ,.
E G
93"1. of Inerlto 'vJ
~
1
I s GG E U
lAm B~UCED
I
I
-01
S~
!lome SoCol. on
both oxes
~ p---"+---l¡,_~F-;iI-----'__+----;>"....-~.
E @j'.
m
~ <gIi U
F
F
U E
E
G
u
IZílBI : ¿
¡T:m3l~
u R
j ai
.J
y~U E Cy-,
2
I ~ ~ @@
m~8@~
1 01- 04 F
1 mE! H
FIG.9.20. Display with respect to principal axes 3 and 4 of the HLA gene lID1l ~
frequency data. The region encased by dotted lines is enlarged in Fig. 9.21. 1
mJ
Thus the main aspects of this data set are displayed as the separation of the
Oceanic populations (first axis), particularly from the European and Negroid
groups, and then the separation the American Indian populations from the FIG. 9.21. Enlarged display of the positive third principal axis, showing the
others (second axis). Sixth Workshop Caucasoid populations as supplementary row points. The abbre
viations are: N. Norway; E. England; U. USA; D, Denmark; G. Germany; W.
Third and fourth principal axes (Figs 9.20 and 9.21 ) Sweden; S. Switzerland; F. France; P, Netherlands; V. Austria; X. Finland; Y. Israel;
C, Czechoslovakia; R. Russia; 1. Italy; H. Hungary.
The percentages of inertia explained by these axes are respectively 9.3 % and
8.8 %' The proximity of these values suggests two phenomena of almost equal
importance and the orientation of the principal axes is likely to be unstable.
Therefore an interpretation is more valid in the plane rather than along the position of the Middle Eastern Caucasoids among the Southern Europeans.
separate axes. Notice the great spread of the United States samples in Fig. 9.21-this is
The plane clearly represents the spread of the populations which are indicative of the heterogeneity of the American people with respect to this
situated on the positive half of the first principal axis of Fig. 9.19. Stretching genetic system.
diagonally across the plane (from bottom left-hand corner to top right), there Stretching out perpendicularly from the line of spread of the Caucasoids
is a clear spread of the Caucasoid samples. The main contrast is between are aH the Negroid samples, opposed chiefly to the English and Scottish
Northern Europe and the two samples from Sardinia associated with alleles samples. The alleles AW19.2 (sum of AW30 and AW31) and BW17 alone
BW21 and B18. The positions of the Sixth Workshop European popula contribute a large amount to the inertia represented by this planeo
tions confirm the interpretation of this factor (see Fig. 9.21), and with very
few exceptions the Mediterranean populations can be separated from the Remaining principal axes (not illustrated)
Northern European and English populations. Other Caucasoid populations The fifth and sixth principal axes, explaining 6.9 % and 5.4 % of the total
are situated between these two extremes, the only notable tendency being the inertia respectively, still appear interesting, but they represent oppositions
1
306 Theory and Applications ofCorrespondence Analysis 9. Applications of Correspondence Analysis 307
among relatively few samples. Thus the fifth axis is principally an opposition
I HOT
between the samples of the coastal populations of New Guinea and Australia
(aboriginal) owing chiefly to the allele A11, of very high frequency in the New
[
~~~ -----'~
Guinean sample. The sixth axis contrasts the samples from Easter Island and [ WPA
PUN I q
1
Lapland and, correspondingly, the alleles Al! and BW10 are opposed to
BW35 and BW15. However, the orientation of these minor axes is found to
be quite sensitive to changes in the mass accorded to the population sample [ X~~
~fi . I
points and the relative importance of these phenomena must be judged with I
this reservation in mind (see discussion below). 1CE
In the second analysis (of the Sixth Workshop European Caucasoid [ ~~~ -----'~
populations) the principal inertias are much lower, indicating the relative ENG '
homogeneity of the European Caucasoids genetically. The first principal axis BUH
also represents the contrast between northern and southern populations seen 1 [
~~~----_ ....~
on the fourth axis of analysis 1 (Figs 9.20 and 9.21). The first principal inertia
(0.029) is of the same order as the fourth principal inertia (0.025) of the first I
,
[~Á~
cLAP
I :
I
I
[g~~
analysis, but slight1y higher because of the additional alleles included in
analysis 2. For further resuits and interpretation, see Greenacre and Degos
(1977).
Jenkins (1983) uses the same methodology in an analysis of gene frequency
l [
PI M
~~~ =::::::::::.
t
~ [§~~
data of African populations.
t
T~5
[ EPA
The ordering of the principal axes in terms of the values of the principal
I
inertias informally quantifies the heterogeneity of the groups of populations I
concerned. Thus the main ethnic differences appear along the major axes, '1 FIL
followed by differences within ethnic groups. The minor axes represent [ CHI
t
contrasts between single populations which can suffer from internal insta CNVI I I
easier to obtain European Caucasoids than coastal New Guineans), and even NGU . 1
less justified to assign masses proportional to the population size (China AYM
[ WAR I :
. I
clustering tree of Fig. 9.22. Slicing the tree at the X2-distance of 1.08 results in
21 clusters, sorne of which (e.g. Lapps, North Vietnamese) are single
populations. Each population's set of gene frequencies is then divided by the
number of populations in the respective cluster, so that each cluster of
populations receives an equal mass. The correspondence analysis of this
(row-) reweighted matrix did not difTer noticeably from Analysis 1 aboye with
respect to the first four principal axes. However, difTerent local features in the
data appeared from the fifth axis, indicating internal instability of these axes
(cf. Section 8.1). This justifies our halting of the interpretation at the fourth
axis.
I
The question of external stability and of statistical significance in this
context is an interesting one. In order to bootstrap the display of the
populations and the alleles, independent resampling with replacement must
I 4 ~ LB Le '
.-,II
r¡
be carried out within each sample of people from a given population and
a replicate matrix of estimated gene frequencies computed. The sampling
units in this study are the individuals themselves, not the populations. Ir a
number of replicate matrices were derived, an idea of the variability of the
population and allele points would be obtained by the projection of the
replicated profiles onto the principal axes, as described in Section 8.1. Ir
another type of scaling technique were used, perhaps with a difTerent defini
tion of inter-population genetic distance, then bootstrapping can again be
performed but each bootstrapped distance matrix might have to be re
analysed to obtain a new display which is then fitted to the original display
(see discussion in Section 8.1). There is scope for sorne interesting work in this
area, especially since there is a vast literature on genetic distances and their
statistical properties (for example, Balakrishnan and Sanghvi, 1968; Jacquard,
1973; Edwards, 1971; Smith, 1977).
This application illustrates the analysis and interpretation of data which, like
the data of Section 9.6, are measurements on a ratio scale. In the data set of
interest there are a substantial number of missing values, which need to be
imputed if all the rows and columns are to be displayed. The final results of FIG.9.23. Measurements taken on the skull fossil of a cave bear in the studies of
Marinelli (1931), Mottl (1933) and Cordy (1972); diagram according to Cordy
the correspondence analysis illustrate how the displays may be used for (1972)
informal classification of additional data.
This section is a summary of the unpublished report by Greenacre (1974). between the years 1829 and 1836 by the palaeontologist Schmerling in the
province of Liege, Belgium. As a reference data set, Cordy uses the data of
Marinelli (1931) on the cave bear skulls from the Drachenhohle caves at
9.10.1 Data
Mixnitz, Austria, as well as the data of Mottl (1933) on the cave bear skulls
Cordy (1972) describes a collection of fossil skulls of cave bears gathered from the Igrichohle caves at Pest, Hungary. We shall use the same data sets
~; ,
310 Theory and Applications ofCorrespondence Analysis 9. Applications ofCorrespondence Analysis 311
in our analysis, although we stress the obvious point that these do not AZ=0.000481(24.7%) *"
represent an exhaustive sampling of cave bear skulls.
The basic data sets thus comprise: . *" .. Seo'e
.. . . . .. .. -. . . . . *
WF I-----i
0.0\
(1) 47 skulls from the Drachenhohle,
.. .
(2) 77 skulls from the Igrich6hle,
(3) 12 skulls from the Schmerling collection.
we * ...
. - ... .
.
. .. ....
.
. .. - . .
..
..
WZ
. He
Fortunately, the same measurements were made on all these skulls and these
...A.."'*"'"
}o( .6.
i:J. HF l:J.
i:J..6.
'" .. A ...
o '"
.. - -
WM... *'}.L1
are indicated in Fig. 9.23. We ourselves introduced a new variable LI =
LF - (LD + LM) and then omitted LF, which is now decomposed into three '"
'"
'" ",'" o
LS
LP
WI
* LO
WT
AI=0.00052 (26.7%\
segments: LI, LD and LM. Similarly, LB was omitted because it is the sum '" '" '" '" '"'" '" o
O
O
O
~I
of the facial and cerebrallengths: LB = LF + Le. Because the depth G of the '" '" '"
O
o o
O O
'" o
o
showed that it caused instability in the displays. o
Amongst the skulls themselves, 4 of the Schmerling skulls were so damaged o o
..... o
that more than half their measurements were missing, so these had to be
omitted. This left a data matrix of order 132 x 17, given by Greenacre (1974). _ small (female\ skulls from DrachenhOhle
"'+
Draehenhohle .. Iarge (molelskulls from DrochenhiShle
o smoll (femole1skulls from 19r1elilhle
tripled. We chose to use a "second-order approximation" in the imputation, skull fossil data, where the missing elements have been interpolated using the
that is the reconstitution formula of a 2-dimensional correspondence analysis reconstitution formula (8.5.1), also in 2 dimensions. The dispersion of the skull
Schmerling skulls may be predicted similarly by their positions in the display, able to identify as accurately as possible in advance the type of weather
and they can also be seen to be more similar to the prachenhohle skulls than situation which is imminent in order to declare the day (or other fixed time
the Igrichohle skulls. More detailed interpretation of Fig. 9.24, as weH as of period) an experimental unit or nol. Here we shaH describe how a large
other principal axes, is given by Greenacre (1974). heterogeneous data set is recoded and analysed by correspondence analysis,
and how the resultant graphical displays are used to arrive at daily weather
forecasts on an operational basis.
9.70.4 Discussian
Remember that correspondence analysis displays difTerences in the shape of
9.77.7 Data
the skulls, not in their size. In this application the shape difTerences on the first
dimension are correlated with size difTerences (see Benzécri, 1977a, Section The basic data set consists of 485 days of data, gathered during three rainfall
3.7.2; 1978). By contrast, a principal components analysis of the data usuaHy seasons, on relevant meteorological variables. These variables are of various
produces a first dimension of "size" and then dimensions orthogonal to (un types, but mostly quantitative variables such as temperatures at various
correlated with) "size" (cf. Section 9.6). levels ofthe atmosphere and wind vectors (wind speeds and directions).
The justification for using correspondence analysis on these data is again Each of these days has aIready been assigned to one of 5 weather types,
that the measurements do have a positive mass interpretation-the addition briefiy described as follows: (1) fair weather days; (2) general rain days; and
of two measurements is a physical reunion of quantities, as in Section 9.6. then 3 categories of days on which convective activity is present: (3)
Exclusive use of the X2-distance is not easily defined, but then any choice of convective days that do not meet with the seeding criteria of the experiment;
metric has a certain ad hoc quality. The X2-distance does have the advantage (4) convective days that do meet the seeding criteria; (5) convective days
of the principIe of distributional equivalence which ensures a certain stability where hail is present in the clouds. This decision is made by the project
of the skull positions if variables are grouped or subdivided (Section 4.1.17). leaders at the end of the day, once the actual weather situation has been
As stated in Section 8.5, the imputation of the missing values is a strategy observed from the ground, from the air and on radar.
to complete the data matrix so that all rows and columns may be displayed,
rather than an estimation of the missing data. We would thus restrain the
9.77.2 Methad
interpretation of points which have a relatively high frequency of missing
data. For example, two of the Schmerling skulls have measurement HI In order to use data on aH the variables in a global analysis, an overall
missing, which is an important variable on the second axis. The imputed recoding of the data set into a multivariate indicator matrix was performed.
values are fairly low, which contributes to their lying towards the positive It was convenient to recode each variable into 9 discrete categories (J q = 9).
side of the second axis. The positions of these particular skulls on the second The categories were not chosen on a purely ad hoc basis, but in close
axis should thus be interpreted with caution. collaboration with the meteorologists involved in the project. Thus the
The informal classification described in this study is a nice fringe benefit of range of a continuous variable like surface temperature is divided into 9
the graphical display. In the next section data are analysed with the specific intervals, taking into account the rounding error in the measurements and
purpose of performing classifications. meteorologically relevant boundaries. This process of discretization is es
pecially suitable for recoding the wind vectors, which are otherwise quite
unmanageable quantities in conventional multivariate statistical analysis. An
9.11 GRAPHICAL WEATHER FORECASTING example of one of the categories of the wind vector at the 300 millibar level
is category 8: speed greater than 20 m/sec, direction between 2200 and 255°.
This section illustrates the material of Sections 5.4, 7.1 and 7.2, using In the context of the experiment it was found convenient to separate the
meteorological data from a weather-modification experiment in an important study into two parts, first the distinction between fair weather (type 1),
river catchment area in South Africa. The experiment is still in the preliminary general rain (type 2) and convective (types 3 to 5) days, and secondly the
stages where weather and rainfall patterns are being investigated before the distinction amongst the three types of convective days. Here we shall discuss
actual statistical and physical cloud-seeding experiment. the results of the first part of the study only, although the methodology in
Classification plays a major role in such an experiment. It is crucial to be both parts is essentially the same.
314 Theory and Applications of Correspondence Analysis 9. Applications of Correspondence Analysis 315
For the purposes of this study we presumed that only the first two seasons' TABLE9.15
data (258 days) were available to design a classification strategy, and that the Table of correct and incorrect forecasts. using the classifi
third season lay ahead, as it were, in order to evaluate the strategy's new days.
efTectiveness. In the notation of Section 7.1 (see Fig. 7.2), N is the totally
recoded matrix of design data and Zo is the logical indicator matrix with 3 Observed
columns indicating the category fair, general or convective. The matrix Predicted Fair General Convective
f.l'
'},
,~
Fair 23 3 16 42
).2=0.08871 (34.8%)
General 1 3 4 8
Convective 8 4 65 77
, 32 10 85 127
,
® Pr.watw
~
........
3
I N' = Z¿N thus has general element n~j = the number of days of type h for
N-NWwind
,
.moderate I which variable category j was observed. The test data may be denoted by N*
I and the problem is thus to predict the classification of each row of N* and
I
I I compare it to the actual classification.
3 , I As described in Section 7.1 the analysis of N' discriminates between the
I
~I
' 3"
3
3 3
' 1/3 centroids of the 3 subclouds of points. The advantage of having 3 groups of
Pr._.' 3 " 3 3 /
points is that the subspace of the centroids may be represented exactly in a
1_
~
1 11 3 3 3 3 3 3/ 3
" I
planeo The rows of the original data matrix N are projected onto this
--
1,
3
- ;.......1 1.•..:1
3
3
i " ,1
3
3 3
33 3
3 _ _3_ 1_3 _ subspace and are identified by their weather type (Fig. 9.25). The rows of the
). = 0.1661 1 11 113 ............ "'J- .. ~ 3 3 3 3 3 /8'3 3
(65.2%), <D ® "new" data N* are also projected onto this plane and are classified according
~.......
I 1 1 \ 1 ,3 3 .3 3 3 3/
'
1
1 '3
\ 1
3 3
3
1 3
__ ........... /
3/ 3
to the frequencies of neighbouring points of known weather type.
1
1
1
1..--r-n
L-
1
~_ J 3 3
J
3 ~'" J It was not too difficult to arrive at a reasonable radius of a neighbourhood,
33 ':"333333 indicated in Fig. 9.25, which represents a set of points considered to be similar
3 3 3 'J
in the' context of these particular data. Again this decision was made in
31 J J J
scale 3
consultation with project members familiar with aH the days in the study.
~
3
0.1
Here it proved both instructive and informative to examine the neighbour
1 = FA IR
I :~
i)
hoods for a set of increasing radii, noting how the relative frequencies of the
weather types change as "rings of similarity" are added to the neighbour
2 ~ GENERAL RAIN
ti
hoods. When two weather types have the same frequencies in a neighbour
3 = CON~[CTl YE ~,
-,~
hood we have assigned the weather type with the closest centroid, although
FIG.9.25. Exact 2-dimensional display of the three centroids (circled points 1. 2
there is still considerable room for improvement in the actual classification
and 3) of the three groups of days. The individual days. labelled by their weather decision (see Section 9.11.3).
types. are projected onto this plane. that is displayed as supplementary points. As The results of the test c1assification are summarized in Table 9.15.
examples of the many variables used in this analysis. the 9 categories of the
variable "precipitable water". which has high association with the weather types.
are indicated, as well as 2 categories of wind, "moderate to strong north to 9.11.3 Discussion
north-westerly winds". which associate strongly with general rain days. An
example of a new day. indicated by? and its neighbourhood are shown-the There are two major advantages of the aboye forecasting strategy over more
forecast would ciearly be a type 3 day. conventional statistical techniques. First, important data on the speed and
316 Theory and Applications ofCorrespondence Analysis 9. Applications of Correspondence Analysis 317
direction of the wind are taken into account by categorizing the wind vectors. equally spread around. It again seems that we should be balancing treatments
The recoding of all the observations to be discrete results in a homogeneous and controls in the subspace which exhibits the association between the
data set which can be analysed global!y. Secondly, the spatial framework and covariates and the response. This subspace can be identified by conducting
neighbourhood concept is particularly useful when applying the c1assification the exploratory phase of the experiment, which is standard practice now in
procedure operational!y. When the predictor information of a new day is weather modification experiments. For a one-off experiment like the typical
known, the day can be immediately situated as a point in the display of the clinical trial with a modest sample size, avoiding the pitfal!s of the randomiza
historical data points. Not only can the types of neighbouring days be tion can be tricky. We would suggest a sequential allocation of treatments
identified, but also their actual dates. Relevant information on the weather and controls if the response is known after a relatively short time, so that the
situations on these dates can be recal!ed in arder to remind the human subspace can be estimated during the experiment in an attempt to balance the
forecaster of these similar situations in the past. This should assist him in groups. Otherwise the only other satisfactory design seems to involve
making a more accurate forecast ofthe situation at hand. matching treatment and control units as closely as possible, for example in
More work still has to be done on the actual c1assification decision. For matched pairs. Notice that even when an exploratory phase is possible it is
example, the distances of the neighbouring points to the new point can be good practice to "update" the response-covariate subspace during the course
taken into account, so that closer points "count more" than points further of the experiment, as the association between covariates and response might
away. The use of cross-validation, as described in Section 7.2, can assist in be changing.
finding out if additional sophistication of the procedure leads to improved
classifications.
Finally, notice that similar geometric frameworks have other useful appli 9.12 REFERENCES TO PUBLlSHED APPLlCATIONS
cations in experimental designo For example, the most fundamental require
ment of treatment and control groups of experimental units is that they be This section gives a comprehensive list of references to published applications
balanced with respect to their covariates. In a weather modification experi of correspondence analysis, c1assified by field of application. Because most of
ment there are a number of important variables which can be associated these are in French (and mostly published in the journal Les Cahiers de
with the response variable of interest, say rainfall. Many experiments have I'Analyse des Données), we specifically indicate when the articles are in
been heavily criticized by statisticians who have discovered later that the English. We also give a very brief summary, in telegraphic style, of the data
randomized decision of the experimental units into treatment and control and special features of the application.
groups has a strong association with sorne important covariate, which could
explain the observed between-group differences. When there are many such Art and archaeology
covariates available, the chances of nul!ifying the resu1ts of an experiment by Prehistoric art in Southern Europe-Roux et al. (1976); various contin
such an argument are very good. The same problem crops up in the design of gency tables, published by Leroi-Gourhan (1965), for example nij = number
clinical trials, where the treatment and control groups are rarely similar
of times theme j (e.g. a horse) is found in region i; nodes of a clustering tree
across the many covariates like age, severity of illness, etc. Since the vector of are displayed (cf. Section 7.4).
covariates (or recoded covariates) defines a point in multidimensional space,
the desired strategy assigns treatment and controllabels to each point in such Typology of stone-age tools-F. Benzécri and Djindjian (1977); 359 x 18
a way that the two c10uds of points are as confounded ("unclustered") as data matrix (tools x variables), recoded as a 359 x 70 indicator matrix.
possible.
In the case of sorne clinical trials where the cases are al! available before Analysis of musical scores, illustrated by a choral work of Bach- Morando
the randomization occurs this could be performed by identifying the low
(1980); various matrices are suggested, for example nii , = number of times
dimensional subspace close to al! the cases, subdividing this subspace note i is fol!owed by note i'.
and then randomly allocating treatments and controls within each multi
dimensional "block". Working in this subspace does not necessarily avoid the Epidemiological, biomedical and pharmaceutical
danger that the response in the experiment is associated with the residual
space of the covariates in which the treatments and controls have not been Variations of Iymphoidalleukemias and their evolution-Bastin (1976);
318 Theory and Applications ofCorrespondence Analysis 9. Applications ofCorrespondence Analysis 319
detailed medical files on 102 hospital cases; all data recoded into discrete English; data sets which are simulated to represent ecological gradients are
form, resulting in a 102 x 91 indicator matrix (cf. Section 5.4). used.
Death or survivalfrom myocardial infarct-Nakache et al. (1977); 101 x 15 Taxonomy of the horse genus Equus, based on skull and jaw measure
matrix (patients x variables), recoded as a 101 x 83 indicator matrix; various ments-Eisenmann and Turlot (1978); 349 x 25 matrix of measurements
types of discriminant analysis, sorne based on correspondence analysis, are (individuals x variables).
Retrospective study of infant mortality in Mali-Abbaoui-Maiti (1979); Dahdouh et al. (1978); a large and very detailed data set on 105 species of the
811 mothers of deceased children respond to a detailed biographical and insect, as well as 429 species of vegetation at the 35 sites surveyed, 97 environ
medical questionnaire; various subtables are analysed. mental variables of a permanent nature, 6 variables of a temporary nature;
,
I '
Perception of the countryside- Brun-Chaize (1978); pairs of photographs
are shown to each of 324 people and the more preferred member is selected Discussion of pattern recognition and the recoding of images- Benzécri
in each case; the matrix nii' = number of people who prefer photographs j (1981b).
and (, is analysed as well as supplementary background information on the (I
¡
respondents.
Education
Mating parade of the albatros-Blosseville (1981); data are of the form
Marks in the admission examination of the Ecole Polytechnique, 1970 and
nij = number of times attitude j is observed during "individual-sequence" j,
1971-Nakhlé (1976); 911 x 20 table Yij = mark on subjectj by candidate i;
that is each row represents a particular bird during a demarcated time period
illustrates doubling (ef. Section 6.1).
of activity; see also Spence (1978).
Tertiary education in Greece and the profession of the fathers of students
Geology Meimaris (1978); 75 x 9 matrix nij = number of students from faculty j at one
of the 4 Greek universities, with fathers in professionj.
Geomorphological study of granulometric data-F. Benzécri et al. (1976);
102 x 20 table nij = mass of sand in sample i which falls in granulometric
Results of multiple-choice mathematics test given to 1300 pupils- Murtagh
classj (e.g. betw~en 0.4 and 0.5 mm in diameter).
(1981); 155 multiple-choice items partitioned into 55 groups; eomparisons
( are made between test and re-test data on the same pupils and between boys
Soil mechanics, marine sediments-Benzécri et al. (1981); two sets of
and girls; Guttman efTect illustrated and discussed.
samples of marine deposits collected off Irish Coast; categorizing of con
tinuous variables.
Primary school children and their families-Kubow-Ivarson (1982); 774
\ children surveyed, 89 questions with a total of 632 response categories,
Chemical analyses of geological samples from Canadian volcanic belt
leading to a 632 x 632 Burt matrix; illustrates the analysis of various "slices"
David et al. (1974), in English, with outline of the methodology; data
of the Burt matrix, i.e. subtables of order J x J l' where J = 632 and J 1 < 632.
analysed including all 22 elements, then using trace elements only; analyses
compared.
Social surveys
Chemical analyses of geological samples from Ethiopian volcanic belt
Teil and Cheminée (1975), in English, with outline of the methodology in Teil Living conditions in Lebanon, 1960-1970-in Benzécri et al. (1973: Vol.
(1975), in English but using Benzécri's notation; study very similar to that of 2C, no. 4, 5); 60 x 140 matrix of ratings Yiq = rating (O to 4) in region i on
David et al. (1974) described aboye; groups of samples are investigated. question q; correspondence analysis of doubled ratings; comparison made
between surveys in 1960 and in 1970.
Pattern recognition
Rural development in Colombia-in Benzécri et al. (1973: Vol. 2C, no. 6);
Electron microscopy- Van Heel and Franck (1980), in English; images are 45 x 306 matrix of ratings Yiq = rating (O to 4) in region i on question q;
divided up into a 32 x 32 matrix of cells and the data are in the form nij = correspondence analysis of doubled ratings; separate analysis performed to
intensity of cellj in image i; this application is discussed by Benzécri (1981a). reveal patterns in missing data.
Pattern recognition of alphabetie eharacters-Chaumereuil and Villard Attitude of students in Zalre to acts perpetrated against people and their
(1981); a rectangle in which a letter is written is subdivided into a 10 x 7 possessions-in Benzécri et al. (1973: Vol. 2C, no. 9); 535 x 54 matrix Yiq =
322 Theory and Applications ofCorrespondence Analysis 9. Applications ofCorrespondence Analysis 323
severity rating (O to 10) given by student ion act q (e.g. accidental poisoning); Operations research, mu/tip/e-criteria decision making
correspondence analysis of doubled ratings. Evaluation of environmental impact and cost in situating an autoroute
Gutsatz (1976); 1 = 7 alternatives, Q = 9 heterogeneous criteria, recoded as
Public irnage of the judiciary in France-in Benzécri el al. (1973: J = 23 dummy variables.
Vol. 2C, no. 10); for example, 1072 x (69 + 366) recoded indicator matrix
(respondents x categories), where there are 69 categories of personal informa
lnstallation of services in a building- Vasserot (1977); 105 x 105 matrix
tion and 366 categories of opinion; illustrates analysis of a large survey and
(not symmetric) Yw = rating ofhow much contact section i needs with section
the recoding of open-ended questions.
i' ; various analyses, including cluster analysis, are reported.
Comparison of activities of high-school boys and girls-Goudard el al.
U sage of a large computer in terms of different types of jobs submitted and
(1977); 676 x 33 table (pupils x questions), recoded as a 676 x 100 indicator
their various execution times-Carreiro (1978); various analyses are per
matrix; part ofthis study is an illustration of the theory of Yagolnitzer (1977),
formed, including principal components analyses.
who compares correspondence analyses of two matrices with the same row
and column entities.
Comparison of correspondence analysis and factor analysis in the context
Survey of daily activity and attitude to work of 423 Parisians on the eve of of multiple-criteria decision making-Stewart (1980), in English; a critical
their retirement-Blosseville and Cribier (1979); 19 questions, recoded as 60 study of these two techniques when applied to two small data matrices;
dummy variables. correspondence analysis was generally the most successful.
Review article on data analysis ("analyse des données") in economics Consumer survey of cigarette smoking-in Benzécri et al. (1973: Vol. 2C,
Benzécri (1980a). no. 3); for example, the 30 x 12 table of frequencies nij = number of people
who say cigarette i falls in category j (e.g. ratber pleasant); Guttman (borse
Agricultural development in Syria from 1960 to 1977-Arbache (1982); shoe) etTect illustrated, especially tbe situation wben some points líe inside tbe
63 x 7 x 18 matrix nijl = production of product i in region j in year t; aH borsesboe.
possible two-way tables are analysed and compared.
Miscellaneous
Foreign subsidiaries of USA-based multinational companies-Cholakian
(1980a); 3-way 42 x 28 x 7 table nijl = number of foreign subsidiaries in Ratings of films by newspaper film critics-Tagliante et al. (1976); 242 x 8
industrial sector i in the country (or region) j during the period t; various table Yij = rating of film i by critic j; each column is doubled and a third
analyses of 2-way contingency tables derived from this table are reported, column is also added for eacb critic to record omissions (i.e. non-responses).
some of which are marginal tables (e.g. the 42 x 28 table of nij., summed over The coding scbeme is oftype (a) in Table 5.5.
t), others obtained by combining two ofthe indices into one (e.g. the 42 x 196
table of niUI )' where the columns represent countries at particular time Interurban traffic flow in the southern suburbs of Paris- Kazmierczak
periods); the use of supplementary points and cluster analysis is extensively (1978); 15 x 15 matrix nii' = number of people who travel from bome in zone
illustrated; the same data is used in a subsequent article (Cholakian, 1980b) i to work in zone i' (not a symmetric matrix); various analyses are performed,
to illustrate the adjustment of a contingency table to have prescribed margins including tbe computation of the convex polygon of the symmetrized data
(Madre, 1980; Benzécri, Bourgarit and Madre, 1980). matrix (cf. Section 8.6).
Exports from India, 1963 to 1975-Gopalan (1980); multidimensional Quality control and durability of shoes, studied both in natural conditions
time series nijl = value of export of product i to country j during year t; and in laboratory tests-Hariri (1980); a questionnaire comprising 21
similar approach to Cholakian (1980a) aboye. questions, yielding 63 categories of response, is completed for each pair of
shoes in tbe study at various points of time; various condensations of tbe data
are considered (cf. discussion of focusing in Section 8.2).
Supply and demand of employment, and patterns of unemployment
Cabannes (1980); multidimensional time series, for example nil = number of
jobs otTered in industrial sector i during month t; displays seasonal patterns
. within each year and trends from year to year.
Market research
Choice of a product name in market research- Vasserot (1976); 17 x 19
table of frequencies nij = number of people who associate name i with
adjectivej; background data on the people interviewed are also analysed.
1
EtTectiveness and etTect of washing powders, their chemical composition
and ecotoxicity-Benzécri and Grelet-Puterflam (1981); recoding of hetero
geneous data. I 1
j '
l '\
References 327
Benzécri, J.-P. (1969b). Statistical analysis as a tool to make patterns emerge from
data. In "Meihodologies of Pattern Recognition, (Watanabe, S., ed.), pp. 35-74.
Academic Press, New York.
Benzécri, J.-P. et al. (1973). "L'Analyse des Données". Tome (Vol.) 1: La Taxinomie.
Tome 2: L'Analyse des Correspondances. Dunod, Paris.
Benzécri, J.-P. (1977a). Histoire et préhistoire de l'analyse des données. 5-L'analyse
des correspondances. Cahiers de rAnalyse des Données 2, 9-40.
Benzécri, J.-P. (1977b). Sur l'analyse des tableaux binaires associés a une corres
pondance multiple. Cahiers de l'Analyse des Données 2, 55-71.
References
Benzécri, J.-P. (1977c). Choix des unités et des poids dans un tableau en vue d'une
analyse de correspondance. Cahiers de l'Analyse des Données 2, 333-352.
Benzécri, J.-P. (1978). Note de lecture: l'allométrie. Cahiers de l'Analyse des Données 1,
371-376.
Benzécri, J.-P. (1979). Sur le calcul des taux d'inertie dans I'analyse d'un questionnaire.
Addendum et erratum a [BIN.MULT.]. Cahiers de rAnalyse des Données 4,
377-378.
Benzécri, J.-P. (1980a). Analyse des données en économie. Cahiers de l'Analyse des
Abbaoui-Maiti, S. (1979). Cause de mortalité infantile au Mali: enquete rétrospective
Données 5,9-16.
aupres des meres d'enfants décédés. Cahiers de rAnalyse des Données 4, 49-59. Benzécri, J.-P. (1980b). Les sondages d'opinion: point de vue d'un statisticien.
Abi-Boutros, B. and Bellier, L. (1977). Contribution a la taxinomie des micro Cahiers de l'Analyse des Données 5,475-480.
mammiferes. Application au genre Crocidura. Cahiers de rAnalyse des Données 2,
Benzécri, J.-P. (1981a). Mémoires re~us: c1assification d'objets dans des micrographies
435-450. électroniques brouillées, au moyen de I'analyse des correspondances. Cahiers de
Alvey, N., Galwey, N. and Lane, P. (1982). "An Introduction to GENSTAT".
r Analyse des Données 6,101-107.
Academic Press, London. Benzécri, J.-P. (1981b). La reconnaissance des formes: le~on d'introduction en"forme
Anderson, T. W. (1958). "An lntroduction to Multivariate Statistical Analysis". Wiley,
de dialogue. Cahiers de l'Analyse des Données 6, 157-174.
NewYork. Benzécri, J.-P. and Cazes, P. (1978). Probleme sur la c1assification. Cahiers de
Arbache, C. (1982). Evolution des productions de l'agriculture syrienne par régions,
l'Analyse des Données 3,95-101.
de 1960 a 1977. Cahiers de l'Analyse des Données 7, 67-91. Benzécri, J.-P. and Benzécri, F. (1980). "L'Analyse des Correspondances: Exposé
Assouly, P., Maccario, J. and Auget, J. L. (1980). Analyse des essais d'un antibiotique. Elémentaire." Dunod, Paris.
r
Cahiers de Analyse des Données 5,361-367. Benzécri, J.-P., Bourgarit, C. and Madre, J. L. (1980). Probleme: ajustement d'un
Austin, M. P. and Noy-Meir, I. (1971). The problem of non-Iinearity in ordination:
tableau a ses marges d'apres la formule de reconstitution. Cahiers de rAnalyse des
experiments with two-gradient models. J. Ecol. 59, 763-773. Données 5, 163-172.
Balakrishnan, V. and Sanghvi, L. D. (1968). Distance between populations on the Benzécri, J.-P., Lebeaux, M. O. and Jambu, M. (1980). Aides a l'interprétation en
basis of attribute data. Biometrics 24, 859-865.
Bartlett, M. S. (1951). The goodness of fit of a single hypothetical discriminant
r
c1assification automatique. Cahiers de Analyse des Données 5, 101-123.
Benzécri, J.-P., Biarez, J. and Favre, J.-L. (1981). L'analyse des données en mécanique
function in the case of several groups. Ann. Eugen. 16, 119-214. des soIs. Application a des sédiments marins. Cahiers de l'Analyse des Données 6,
Bastin, C. (1976). Les leucémies Iymphoides chroniques: la diversité des cas et leur
39-57.
évolution. Cahiers de l'Analyse des Données 1, 419-440. Benzécri, J.-P. and Grelet-Puterflam, Y. (1981). Sur les poudres de lessive utilisées
Bastin, C. and Flandrin, G. (1980). Typologie des Iymphocytes et pathologie Iympho
pour le lavage en machine: efficacité, usure de linge, composition chimique et
cytaire (étude préliminaire). Cahiers de rAnalyse des Données 5,347-359. écotoxicité. Cahiers de l'Analyse des Données 6, 415-437.
Benzécri, F. (1980). Introduction a l'analyse des correspondances d'apres un example
Blosseville, J.-M. (1981). Analyse des dialogues: la parade de l'albatros. Cahiers de
de données médicales. Cahiers de l'Analyse des Données 5, 283-310. r Analyse des Données 6,345-376.
Benzécri, F., Bressolier, C. and Thomas, Y. (1976). Deux analyses de données
Blosseville, J. M. and Cribier, F. (1979). Les types de métiers: une analyse multi
granulométriques en géomorphologie. Cahiers de rAnalyse des Données 1, 145-160.
dimensionel1e des caractéristiques socioprofessionnelles de Parisiens a la veille de la
Benzécri, F. and Djindjian, F. (1977). Typologie de l'outillage préhistorique en pierre
retraite. Cahiers de rAnalyse des Données 4,29-47.
taillée. Application a la définition du type burin de Noailles. Cahiers de l'Analyse
Bodmer, J. G. et al. (1973). "Joint report of the Fifth Histocompatibility Workshop".
des Données 2, 215-238. Histocompatibility Testing 1972,619-719. Munksgaard, Copenhagen.
Benzécri, J.-P. (1963). "Cours de Linguistique Mathématique". Université de Rennes, Bodmer, J. G. (1976). The ABC of HLA. In "Histocompatibility Testing 1975", pp.
Rennes, France. 21-99. Munksgaard, Copenhagen.
Benzécri, J.-P. (1969a). Approximation stochastique dans une algebre normée non Bradley, R. A., Katti, S. K. and Coons, 1. J. (1962). Optimal scaling for ordered
commutative. Bull. Soco Math. France 97,225-241. categories. Psychometrika 27,355-374.
,j
i
328 Theory and Applicarions ofCorrespondence Analysis
References
329
Bradu, D. and Gabriel, K. R. (1978). The biplot as a diagnostic tool for models oftwo
Chatfield, C. and Collins, A. 1. (1980). "An Introduction to Multivariate Analysis".
way tables. Technomerrics 20, 47-68. Chapman and Hall, London.
Bradu, D. and Grine, F. E. (1979). Multivariate analysis of Diademodontine crania Chaumereuil, P. and Villard, J. P. (1981). Un example de discrimination de figures par
from South Africa and Zambia. S. Afr. J. Sci. 75, 441-448. l'analyse des correspondances. Cahiers de l' Analyse des Données 6, 108-114.
Bretaudiere, J.-P., Dumont, G., Rej, R. and Bailly, M. (1981). Suitability of control Chola~ian, V. (1980a). Les filiales étrangeres des entreprises multinationales originaires
materials. General principIes and methods of investigation. C/in. Chem. 27, 798-805. des Etats-Unis: analyse de leur répartition par industrie, pays et date de création.
Bretaudiere, J.-P., Rej, R., Drake, P., Vassault, A. and Bailly, M. (1981). Suitability Cahiers de l' Analyse des Données 5, 17-43.
of control materials for determination of IX-amylase activity. C/in. Chem. 27, 806 Cholakian, V. (1980b). Un exemple d'application de diverses méthodes d'ajustement
815. a r
d'un tableau des marges imposées. Cahiers de Analyse des Données 5, 173-176.
Brun-Chaize, M. C. (1978). Le paysage forestier-analyse des criteres de préférence du
Chuang, C. (1982). On the decomposition of G2 for two-way contingency tables.
public apartir de photographies. Cahiers de I'Analyse des Données 3, 65-78.
Unpublished manuscript, Department of Statistics and Division of Biostatistics,
Bryant, E. H. and Atchley, W. R. (eds) (1975). "Multivariate Statistical Methods: University of Rochester, Rochester, New York.
Within-group Covariation". Halsted Press, Stroudsburg, Pennsylvania. Clemm, D. S., Krishnaiah, P. R. and Waikar, V. B. (1973). Tables of the extreme
Burt, C. (1950). The factorial analysis of qualitative data. Br. J. Psychol. (Srarisrical roots of a Wishart matrix. J. Srarisr. Compur. Simul. 2, 65-92.
Secrion) 3,166-185. Cordy, J.-M. (1972). Etude de la variabilité des cranes d'ours des cavernes de la
Burt, C. (1953). Scale analysis and factor analysis. Comments on DI. Guttman's collection Schmerling. Annls. Paléonrologie, pp. 151-207.
paper. Br. J. Srarisr. Psychol. 6, 5-23. Corsten, L. C. A. (1976). Matrix approximation, a key to application of multivariate
Cabbanes, J. P. (1980). Analyse de quelques séries relatives au chomage. Cahiers de methods. "Proceedings of the 9th International Conference of the Biometric
I'Analyse des Données 5, 443-474. Society", 1, pp. 61-77. Raleigh, North Carolina, U.S.A.
Calimlim, J. F., Wardell, W. M., Cox, c., Lasagna, L. and Sriwatanakul, K. (1982). Cox, C. and Chuang, C. (1982). Comparison of analytical approaches for ordinal data
Analgesic efficiency of orally Zomipirac sodium. Abstract presented at the 83rd from pharmaceutical studies. Unpublished manuscript, Division of Biostatistics,
annual meeting of the American Society of Clinical Pharmacology and Thera University ofRochester, Rochester, New York.
peutics, Lake Buena Vista, Florida, March 17-20, 1982. C/in. Pharmacol. Therap. Cox, D. R. and Brandwood, L. (1959). On a discriminatory problem connected with
31,208. the works of Plato. J. R. Srarisr. Soco B 21, 195-200.
Carreiro, S. (1978). L'utilisation des ressources d'un ordinateur: diversité des travaux Dahdouh, B., Durantan, J. F. and Lecoq, M. (1978). Analyse des données sur
et leur variation dans le temps. Cahiers de l' Analyse des Données 3, 343-354. l'écologie des acridiens d'Afrique de I'ouest. Cahiers de I'Analyse des Données 3,
Carroll, J. D. (1968). Generalization of canonical correlation analysis to three or more 459-482.
sets of variables. "Proceedings of the 76th annual convention of the American David, M., Campiglio, C. and Darling, R. (1974). Progress in R-and Q-mode analysis:
Psychological Association", 3, 227-228. correspondence analysis and its application to the study of geological processes.
Carroll, J. D. (1972). Individual differences and multidimensional scaling. In "Multi Can. J. Earrh Sci.l1, 131-146.
dimensional Scaling: Theory and Application in the Behavioral Sciences" (Shepard, De Leeuw, 1. (1973). Canonical analysis of categorical data. Thesis, Psychological
R. N., Romney, A. K. and Nerlove, S., eds), Vol. 1, pp. 105-155. Seminar Press, Institute, University of Leiden, The Netherlands.
New York. De Leeuw, J. (1982). Nonlinear principal component analysis. In "COMPSTAT
Carroll, J. D. (1980). Models and methods for multidimensional analysis ofpreferential 1982", (Caussinus, R, Ettinger, P. and Tomassone, R., eds.) pp. 77-89. Physica
choice (or other dominance) data. In "Similarity and choice" (Lantermann, E. D. Ver/ag, Vienna.
and Feger, H., eds), pp. 234-289. Hans Huber Publishers, Bern. De Leeuw, J. and Van Rijkevorsel, J. (1980). HOMALS and PRINCALS-some
Carroll, J. D. and Arabie, P. (1980). Multidimensional scaling. Ann. Rev. Psychol. 31, generalizations of principal components analysis. In "Data Analysis and
607-649. Informatics" (Diday, E. et al., eds) pp. 231-242. North Holland, Amsterdam.
Cazes, P. (1978). Méthodes de régression. III-L'analyse des données. Cahiers de Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from
I'Analyse des Données 3,385-391. incomplete data via the EM algorithm. J. R. Starisr. Soc. B 39,1-38.
Cazes, P. (1980). L'analyse de certains tableaux rectangulaires décomposés en blocs:
Deniau, C. and Oppenheim, G. (1979). Effet de I'affinement d'une partition sur les
généralisation de~ propriétés rencontrées dans l'étude des correspondances lO.
valeurs propres issues d'un tableau de correspondance. Cahiers de I'Analyse des
a
multiples. 1: Définitions et applications l'analyse canonique des variables quali Données 4, 289-297.
tatives. 11: Questionnaires: variantes de codages et nouveaux ca1culs de
Derijver, P. A. and Kittler, J. (1982). "Pattern Recognition: A Statistical Approach".
contributions. Cahiers de I'Analyse des Données 5,145-161,387-406. Prentice-Hall, London.
Cazes, P. (1981). L'analyse de certains tableaux rectangulaires décomposés en blocs:
Deutsch, S. B. and Martin, J. J. (1971). An ordering algorithm for analysis of data
généralisation des propriétés rencontrés dans I'étude des correspondances multiples. arrays. Operations Res. 19, 1350-1362.
III: Codage simultané de variables qualitatives et quantitatives. 111. Cas modeles. Dixon, W. J. et al. (1981). "BMDP Statistical Software 1981". University ofCalifornia
Cahiers de r Analyse des Données 6,9-18,135-143. Press, Berkeley, California.
Chambers, J. M. (1977). "Computational Methods for Data Analysis". Wiley, New Eckart, C. and Young, G. (1936). The approximation of one matrix by another of ,
York. lower rank. Psychometrika 1, 211-218.
1
1
330 Theory and Applications ofCorrespondence Analysis
References
331
Edwards, A. W. R. (1971). Distances between populations on the basis of gene Wiley, Chichester, UK.
frequencies. Biometrics 27, 873-881.
Gabriel, K. R. and Zamir, S. (1979). Lower rank approximation of matrices by least
Efron, B. (1979). Bootstrap methods-another look at the jackknife. Ann. Statist. 7,
squares with any choice ofweights. Technometrics 21, 489-498.
1-26.
Gaspard, D. and Mullon, e. (1980). Étude de la dilTérenciation spécifique sur trois
Efron, B. and Gong, G. (1981). Statistical theory and the computer. In "Computer
populations de térébratules biplissées du Cénomanien. Cahiers de rAnalyse des
Science and Statistics: Proceedings of 13th symposium on the Interface (Eddy, Données 5,193-211.
W. F., ed.) pp. 3-7. Springer-Verlag, New York.
Gauch, H. G. (1977). "ORDIFLEX-A Flexible Computer Program for Four
Eisenmann, V. and Turlot, J.-e. (1978). Sur la taxinomie du genre Equus. Cahiers de
Ordination Techniques: Weighted Averages, Polar Ordination, Principal Com
rAnalyse des Données 3, 179-201. ponents Analysis and Reciprocal Averaging, Release B." Comell University, Ithaca,
El Borgi, Y. (1978). Programme de tracé de polygone convexe associé a une loi N.Y.
symétrique. Cahiers de /'Analyse des Données 3,219-234.
Gauch, H. G. (1979). "COMPCLUS-A FORTRAN Program for Rapid Initial
Escofier-Cordier, B. (1965). L'analyse des correspondances. Doctoral thesis, Université Clustering of Large Data Sets." Comell University, Ithaca, N.Y.
de Rennes. Later published in Cahiers du Bureau Universitaire Recherche Opération
Gauch, H. G. (1980). Rapid initial clustering oflarge data sets. Vegetatio 42,103-111.
el/e, no. 13 (1969),25-39.
Gauch, H. G. (1982). "Multivariate Analysis in Community Ecology." Cambridge
Escofier, B. (1979). Traitement simultané de variables qualitatives et quantitative en University Press, Cambridge.
analyse factorielle. Cahiers de r Analyse des Données 4,137-146.
Gauch, H. G. and Stone, E. L. (1979). Vegetation and soil pattem in a mesophytic
Escofier, B. and Le Roux, B. (1976). Influence d'un élément sur les facteurs en analyse forest in Ithaca, New York. Am. Midl. Nat.102, 332-345.
des correspondances. Cahiers de r Analyse des Données 1, 297-318.
Gauch, H. G. and Wentworth, T. R. (1976). Canonical correlation analysis as an
Falkenhagen, E. R. and Nash, S. W. (1978). Multivariate classification in provenance ordination technique. Vegetatio 33, 17-22.
research. A comparison of two statistical techniques. Si/vae Genet. 27, 14-23:
Gauch, H. G., Whittaker, R. H. and Singer, S. B. (1981). A comparative study of
Fasham, M. 1. R. (1977). A comparison of nonmetric multidimensional scaling, nonmetric ordinations. J. Ecol. 69, 135-152.
principal components and reciprocal averaging for the ordination of simulated
Gauch, H. G., Whittaker, R. H. and Wentworth, T. R. (1977). A comparative study of
coenoclines, and coenoplanes. Ecology 58,551-561. reciprocal averaging and other ordination techniques. J. Ecol. 65,157-174.
Fenelon, J.-P. (1981). "Qu'est-ce que I'Analyse des Données?" Lefonen, Paris.
Gifi, A. (1981). "Nonlinear Multivariate Analysis." Department of Data Theory,
Fienberg, S. E. (1980). "The Analysis of Cross-Classified Categorical Data", 2nd University of Leiden, The Netherlands.
edition. MIT Press, Cambridge, Massachusetts.
Gittins, R. (1979). Ecological applications of canonical analysis. In "Multivariate
Finch, P. D. (1981). On the role of description in statistical enquiry. Br. J. Phi/osophy
Methods in Ecological Work" (Orloci, L., Rao, e. R. and Stiteler, W. M., eds),
Sci.32,127-144.
pp. 309-535. Intemational Co-operative Publishing House, Fairland, Maryland,
Fisher, R. A. (1940). The precision of discriminant functions. Ann. Eugen. 10,422-429. USA.
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpret
Gnanadesikan, R. (1977). "Methods for Statistical Data Analysis of Multivariate
ability of classifications. Biometrics 21, 768-769. Observations". Wiley, New York.
Fouillot, J.-P. and Teakaia, F. (1979). Elaboration d'un langage commun entre médicins
Golub, G. H. and Reinsch, e. (1971). The singular value decomposition. In "Handbook
et sportifs par l'analyse des données. Cahiers de /'Analyse des Données 4,231-252.
for Automatic Computation" (Wilkinson, J. H. and Reinsch, e., eds), Springer
Francis, l. (1981). "Statistical Software-a Comparative Review". North Holland, Verlag, Berlin.
NewYork.
Good, 1. 1. (1969). Sorne applications of the singular decomposition of a matrix.
Francis, 1. and Lauro, N. (1982). An analysis of developers' and users' ratings of Technometrics 11, 823-831.
1982" (Caussinus, H., Ettinger, P. and Tomassone, R., eds), pp. 212-217. Physica 11,297-316.
Verlag, Vienna.
Goodman, L. A. (1981). Association models and canonical correlation in the analysis
Friedman, J. H. et al. (1975). An algorithm for finding nearest neighbors. IEEE Trans. 1975. Cahiers de rAnalyse des Données 5, 407-442.
Computing 24,1000-1006.
Gordon, A. D. (1981). "Classification. Methods for the Exploratory Analysis of
Gabriel, K. R. (1971). The biplot-graphic display of matrices with application to Multivariate Data." Chapman and Hall, London.
principal component analysis. Biometrika 58,453-467.
Goudard, J., Grelet, Y. and Benzécri, J.-P. (1977). Les Iycéens du second cycle:
Gabriel, K. R. (1972). Analysis of meteorological data by means of canonical
comparaison entre filles et ganyons. Cahiers de rAnalyse des Données 2, 273-291.
decomposition and biplots. J. Appl. Meteorology 11,1071-1077.
Gouvea, V. (1977). Analyse des importations brésiliennes de machines et outils
Gabriel, K. R. (1978). Least-squares approximation ofmatrices by additive and multi mécaniques. Cahiers de rAnalyse des Données 2, 293-302.
plicative models. J. R. Statist. Soco B. 40,186-196.
Gower, J. e. (1966). Sorne distance properties of latent root and vector methods used
Gabriel, K. R. (1981). Biplot display of multivariate matrices for inspection of data in multivariate analysis. Biometrika 53,325-338.
and diagnosis. In "Interpreting Multivariate Data" (Bamett, V., ed), pp. 147-174.
Gower, J. e. and Digby, P. G. N. (1981). Expressing complex relationships in two
I
Jia
Greenacre, M. J. and Vrba, E. S. (1984). Graphical display and interpretation of pondence Analysis and Reciprocal Averaging". Cornell University, Ithaca,
antelope census data in African wildlife areas, using correspondence analysis. N.Y.
Ecology (in press). Hill, M. O. (1982). Correspondence analysis. In "Encyclopedia of Statistical Sciences"
Guitonneau, G. G. and Roux, M. (1977). Sur la taxinomie du genre Erodium. Cahiers (Kotz and Johnson, eds), 2, pp. 204-210. Wiley, New York.
de rAnalyse des Données 2, 97-113. Hill, M. O. and Gauch, H. G. (1980). Detrended correspondence analysis, an
Gutsatz, M. (1976). L'analyse des correspondances-systéme de décision multi improved ordination technique. Vegetatio 42,47-58.
dimensionelle. Cahiers de rAnalyse des Données 1,47-59. Hill, M. O. and Smith, A. J. E. (1976). Principal component analysis of taxonomic
Guttman, L. (1941). The quantification of a class of attributes: A theory and method data with multi-state discrete characters. Taxon 25, 249-255.
of scale construction. In "The Prediction of Personal Adjustment" (Horst, P., ed), Hills, M. (1969). On looking at large correlation matrices. Biometrika 56, 249-253.
pp. 319-348. Social Science Research Council, New York. Hirschfeld, H. O. (1935). A connection between correlation and contingency.
Guttman, L. (1950). The principal components of scale analysis. In "Measurement Cambridge Philosophical Soc. Proc. (Math. Proc.) 31,520-524.
and Prediction" (Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Holland, T. R., Levi, M. and Watson, C. G. (1980). Canonical correlation in the
Star, S. A. and Clausen, J. A., eds). Princeton University Press, Princeton. analysis of a contingency table. Psychol. Bull. 87, 334-336.
Guttman, L. (1953). A note on Sir Cyril Burt's "Factorial analysis of qualitative data". Holland, T. R., Levi, M. and Beckett, G. E. (1981). Associations between violent and
Bri. J. Statist. Psychol. 6,1-4. non-violent criminality: a canonical contingency-table analysis. Multivariate Behav.
Guttman, L. (1959). Metricizing rank-ordered and ordered data for a linear factor Res. 16,237-241.
analysis. Sankhyii 21,257-268. Horst, P. (1935). Measuring complex attitudes. J. Social Psychol. 6, 369-374.
Guttman, L. (1971). Measurement as structural theory. Psychometrika 36, 329-347. Horst, P. (1936). Obtaining a composite measure from a number of different measures
Haberman, S. J. (1981). Tests for independence in two-way contingency tables based .\,
of the same attribute. Psychometrika 1, 53-60.
~
:11
334 Theory and Applications 01 Correspondence Analysis
Relerences 335
Horst, P. (1963). "Matrix AIgebra for Social Scientists". Ho1t, Rinehart and Winston, Lancaster, H. O. (1963). Canonical correlations and partitions of "I!. Q. J. Math. 14,
New York. 220-224.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal Lancaster, H. O. (1966). Kolmogorov's remark on the Hotelling canonical cor
components. J. Educ. Psychol. 24,417-441,498-520. relations. Biometrika 53,585-588.
Hotelling, H. (1936). Relationships between two sets of variates. Biometrika 28, Lancaster, H. O. (1969). "The Chi-Squared Distribution." Wiley, New York.
321-377. Lawley, D. N. (1956). Tests of significance for the latent roots of covariance and
Householder, A. S. and Young, G. (1938). Matrix approximation and latent roots. correlation matrices. Biometrika 43,128-136.
Am. Math. Monthly 45,165-171. Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46,
Ibrahim, C. (1981). La balance des paiements de 21 pays de l'O.Cn.E.; son évolution 59-66.
de 1967 a 1978. Cahiers de l' Analyse des Données 6, 261-296. Lawley, D. N. and Maxwell, A. E. (1971). "Factor Analysis as a Statistical Method."
Jacquard, A. (1973). Distances généalogiques et distanees génétiques. Cahiers 2nd edn. Butterworth, London.
d'Anthropologie et d'Ecologie Humaine 1, 11-124.
Lebart, L. (1974). On Benzécri's method for finding eigenvectors by stochastic
Jambu, M. (1976). Programme de calcul des contributions mutuelles entre classes approximation (the case of binary data). In "Proceedings in Computational
d'une hiérarchie et facteurs d'une correspondance. Cahiers de rAnalyse des Données Statistics (COMPSTAT)". pp. 202-211. Physica-Verlag, Vienna.
1,77-92. Lebart, L. (1975). Validité des résultats en analyse des données. Report CREDOC
Jambu, M. (1978). "Classification Automatique pour l'Analyse des Données. 1 DGRST. 142 rue du Chevaleret, 75013 Paris.
Méthodes et AIgorithmes." Dunod, Paris. Lebart, L. (1976). The significance of eigenvalues issued from correspondence analysis.
Jambu, M. (1983). "Cluster Analysis in Data Analysis." North Holland, Amsterdam. In "Proceedings in Computational Statistics (COMPSTAT)", pp. 38-45. Physica
Jenkins, T. (1983). Human evolution in southern Africa. In "6th International Verlag, Vienna.
Congress of Human Genetics" (Motulsky, A. G. et al., eds), Vol. 1. Alan R. Liss Lebart, L. (1981). Une procédure d'analyse lexicale écrite en langage FORTRAN.
Inc., New York. Cahiers de l' Analyse des Données 6, 229-241
Johnson, P. O. (1950). The quantification of qualitative data in discriminant analysis. Lebart, L. (1982a). Exploratory analysis of large sparse matrices with application to
J. Am. Statist. Ass. 45, 65-76. textual data. In "COMPSTAT 1982" (Caussinus, H., Ettinger, P. and Tomassone,
Johnson, R. M. (1963). On a theorem stated by Eckart and Young. Psychometrika 28, R., eds), pp. 67-76. Physica-Verlag, Vienna.
259-263. Lebart, L. (1982b). L'analyse statistique des réponses libres dans les enquetes socio
Joreskog, K. G., Klovan, 1. E. and Reyment, R. A. (1976). "Methods in Geo économiques. Consommation-Revue de Socio-Economie 1, 39-62.
mathematics 1: Geological Factor Analysis". Elsevier, Amsterdam. Lebart, L., Morineau, A. and Fenelon, J.-P. (1979). "Traitement des Données
Kaiser, H. F. and Cerny, B. A. (1980). On thecanonical analysis ofcontingency tables. Statistiques." Dunod, Paris.
Educ. Psychol. Measurement 40, 95-99. Lebart, L., Morineau, A. and Tabard, N. (1977). "Techniques de la Description
Karchoud, A. (1981). Etude de la taxinomie des Équidés d'apres les mesures Statistique: méthodes et logiciels pour l'analyse des grands tableaux." Dunod,
squelettales. Cahiers de [' Analyse des Données 6, 453-463. Paris.
Kazmierczak, J. B. (1978). Migrations interurbaines dans la banlieue sud de Paris. Lebeaux, M.-O. (1974). Programmes de régression et de classification utilisant la
. Cahiers de rAnalyse des Données 3, 203-218. notiQn de voisinage. Doctoral thesis, Université Pierre et Marie Curie, Paris.
Kendall, D. G. (1971). Seriation from abundance matrices. In "Mathematics in the Lebeaux, M.-O. (1977). Notice sur l'utilisation du programme POUBEL. Cahiers de
Archaeological and Historical Sciences" (Hodson, C. R., Kendall, D. G. and Táutu, rAnalyse des Données 2, 467-481.
P., eds), pp. 215-252. Edinburgh University Press, Edinburgh. Lebeaux, M. O., Stepan, S. and Benzécri, J.-P. (1976). Analyse de liens au sien d'un
Kendall, M. G. (1961). "A Course in the Geometry of n Dimensions." Statistical groupe d'enfants. Cahiers de rAnalyse des Données 1, 197-216.
Monograph No. 8, Griffin, London. Lebras, H. (1974). Vingt analyses mu1tivariées d'une structure connue. Math. Sci.
Kendall, M. G. (1975). "Multivariate Analysis." Hafner Press, New York. HUm. 47, 37-55.
Kendall, M. G. and Stuart, A. (1961). "The Advanced Theory of Statistics", Vol. 2. Leroi-Gourhan, A. (1965). "Préhistoire de l'Art Occidentale." Mazenod, Paris:
Griffin, London. Lingoes, J. C. (1963). Multivariate analysis of contingencies: an IBM 7079 program
Kendall, M. G. and Stuart, A. (1973). "The Advanced Theory of Statistics", Vol. 2, for analyzing metricjnonmetric or linearjnonlinear data. Computational Report 2,
3rd edn. Griffin, London. 1-24. (Computing Center, University of Michigan).
Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika 58, Lingoes, 1. C. (1964). Simultaneous linear regression: an IBM 7090 program for
433-451. analyzing metricjnonmetric or linearjnonlinear data. Behav. Sei. 9, 87-88.
Kshirsagar, A. M. (1972). "Multivariate Analysis." Mareel Dekker, New York. Lingoes, 1. C. (1968). The multivariate analysis of qualitative data. Multivariate
Kubow-Ivarson, W. (1982). L'enfant au cycle primaire et sa famille: un exemple Behav. Res. 3, 61-94.
d'analyse par bandes du tableau de Burt issu d'une enquete. Cahiers de rAnalyse Lingoes, J. C. (1977). With contributions by Borg, 1., De Leeuw, J., Guttman, L.,
des Données 7, 45-65. Heiser, W., Lissitz, R. W., Roskam, E. E. and Schonemann, P. H. "Geometric
Lancaster, H. O. (1958). The structure of bivariate distributions. Ann. Math. Statist. Representations of Relational Data: Readings in Multidimensional Scaling."
29, 719-736. Mathesis Press, Ann Arbor.
..
secondaire dans les protéines. Cahiers de l'Analyse des Données 5, 75-85. Rosenzveig, C. (1978). Une chaine d'analyse des correspondances sur micro
Murtagh, F. (1981). Recherche d'un scalogramme sur les réponses de 1300 éleves a ordinateur. Cahiers de rAnalyse des Données 3, 418-434.
l.,,'··'.
338 Theory and Applications ofCorrespondence Analysis
References 339
Rosenzveig, C. and Thomas, J. P. H. (1979). Attitudes des sous-officiers des trois Tukey, J. W. (1977). "Exploratory Data Analysis". Addison-Wesley, Reading,
armées: dépouillement d'une enquete de sociologie militaire. Cahiers de I'Analyse Massachusetts.
des Données 4, 7-27.
Tukey, P. A. and Tukey, J. W. (1981). Graphical display of data sets in 3 or more
Roux, M. (1979). Estimation des paléoclimats d'apres l'écologie des foraminiferes.
dimensions. 1. Preparation; prechosen sequence of views. 2. Data-driven view
Cahiers de /'Analyse des Données 4, 61-79. selection; agglomeration and sharpening. 3. Summarization; smoothing; supple
Roux, M., Robert, J. and Benzécri, J.-P. (1976). Analyse de données sur l'art mented views. In "Interpreting Multivariate Data" (Barnett, V., ed.) pp. 189-278.
préhistorique. Cahiers de I'Analyse des Données 1, 61-70. Wiley, Chichester, UK.
Schmetterer, L. (1969). Multidimensional stochastic approximation. In "Multi Vasserot, G. (1976). L'analyse des correspondances appliquée au marketing. Le choix
variate Analysis" (Krishnaiah, P. R., ed), Vol. 2, pp. 443-460. Academic Press, du nom d'un produit. Cahiers de rAnalyse des Données 1, 319-333.
New York. Vasserot, G. (1977). L'implantation des services d'une société. Cahiers de I'Analyse des
Schonemann, P. H. (1970). On metric multidimensional unfolding. Psychometrika 35, Données 2, 303-311.
349-366. Van Heel, M. and Frank, 1. (1980). Classification of particles in noisy electron
Schonemann, P. H. and Carroll, R. M. (1970). Fitting one matrix to another under micrographs, using correspondence analysis. In "Pattern Recognition in Practice"
choice of a central dilation and a rigid motion. Psychometrika 35,245-255. (Gelsema, E. S. and Kanal, L. N., eds). North-Holland, Amsterdam.
Slater, P. (1960). The analysis of personal preferences. Br. J. Statist. Psychol. 3, Volle, M. (1970). La construction des nomenclatures d'activités économiques de
119-135. l'industrie. Annales de rI.N.S.E.E. 4, 101-131.
Smith, C. A. B. (1977). A note on genetic distance. Ann. Hum. Genet. (Lond.) 40, Vrba, E. S. (1980). The significance of bovid remains as indicators of environment
463-479. and predation patterns. In "Fossils in the Making: Vertebrate Taphonomy and
Spence, I. (1978). Multidimensional scaling. In "Quantitative Ethology" (Colgan, Paleoecology" (Behrensmeyer, A. K. and Hill, A. P., eds). University of Chicago
P. W., ed.) Wiley, New York. Press, Chicago.
Srikantan, K. S. (1970). Canonical association between nominal measurements. J. Am. Walsh, G. R. (1975). "Methods ofOptimization". Wiley, New York.
Statist. Ass. 65, 284-292. Wilkinson, J. H. and Reinsch, C. (1971). "Handbook for Automatic Computation."
Stewart, G. W. (1973). "Introduction to Matrix Computations." Academic Press, New Vol. 11: Linear Algebra (Bauer, F. L., ed.). Springer-Verlag.
York. Williams, E. J. (1952). Use of scores for the analysis of association in contingency
Stewart, T. J. (1981). A descriptive approach to multiple-criteria decision making. J. tables. Biometrika 39, 274-298.
Operational Res. Soco 32, 45-53. Williamson, M. H. (1978). The ordination of incidence data. J. Ecol. 66, 911-920.
Swan, J. M. A. (1970). An examination of sorne ordination problems by use of Wold, S. (1978). Cross-validatory estimation of the number of components in factor
simulated vegetational data. Ecology 51, 89-102. and principal components models. Technometrics 20, 397-405.
Tabet, N. (1973). Programme d'analyse des correspondances. Part of doctoral thesis, Yagolnitzer, E. (1977). Comparaison de deux correspondances entre les memes
3e cycle, Université de Paris VI. ensembles. Cahiers de rAnalyse des Données 2,251-264.
Tagliante, P., Chaumereuil, P. F. and Villard, J. P. (1976). Les critiques de cinéma Young, G. (1937). Matrix approximations and subspace fitting. Psychometrika 2,
d'apres la cote des films publiée par l'hebdomadaire Pariscope. Cahiers de l' Analyse 21-25.
Des Données 1, 381-400. Young, G. and Householder, A. S. (1938). Discussion of a set of points in terms of
Tatsuoka, M. M. (1971). "Multivariate Analysis." Wiley, New York. their mutual distances. Psychometrika 3, 19-22.
Teil, H. (1975). Correspondence factor analysis: an outline of its method. Math. Geol.
7,3-12.
Teil, H. and Cheminée, J. L. (1975). Application of correspondence factor analysis to
the study of major and trace elements in the Erta Ale chain (Afar, Ethiopia).
Math. Geol. 7, 13-30.
Teillard, P. (1976). L'évolution de la production industrielle franeyaise de 1963 a 1975.
Cahiers de /'Analyse des Données 1,401-417.
Tsébélis, G. (1979). Géographie électorale de la Gri:ce: analyse des attitudes de vote
aux scrutins nationaux de 1958 a 1977. Cahiers de rAnalyse des Données 4,
423-436.
Tsianco, M. c., Odoroff, C. L., Plumb, S. and Gabriel, K. R. (1981). BGRAPH-a
program for biplot multivariate graphics, Version 1. User's guide. Technical report
81/20, Department of Statistics and Division of Biostatistics, University of
Rochester, Rochester, New York 14642.
Tucker, L. R. (1960). Intra-individual and inter-individual multidimensionality. In
"Psychological Scaling: Theory and Applications" (Gulliksen, H. and Messick, S.,
eds), pp. 155-167. Wiley, New York.
).
Appendix A 341
approach for teaching. In this appendix we show how the SYD underlies
principal components analysis, biplot, canonical correlation analysis,
canonical variate analysis and correspondence analysis. These techniques
are aH variations of a theme, and that theme is the algebra and geometry
ofthe SYD.
Further relevant literature on the SYD in statistics is provided by Good
(1969), Chambers (1977), Gabriel (1978), Rao (1980), Mandel (1982) and
Appendix A Greenacre and Underhill (1982).
A = U Da y T (A.U)
I',(J IxK KxK KxJ
The singular value decomposition (SYD) is one of the most useful tools in where:
matrix algebra, yet it is still not treated in many textbooks for statisticians.
Its origins can be traced back to the work of French and Italian mathemati Dais a diagonal matrix of positive numbers ai ... aK
cians in the 1870s (see, for example, Marshall and Olkin, 1979, Chapter 19). K is the rank of A (K ~ min{I, J})
One of its largest fields of application, namely low rank matrix approxima UTU = yTy = 1, i.e. thecolumns ofU and Y are orthonormal (in an ordinary
tion, was first reported by Eckart and Young (1936) in the first volume Euclidean sense)
1
quite different analyses. For this reason we have found it to be an ideal from the matrix A results in a singular matrix.
342 Theory and Applications ofCorrespondence Analysis n
~.
1,
Appendix A 343
j, I~1
of a rectangular matrix A if Av = IXU and ATu = IXV hold simultaneously. V. The complete form is sometimes called the SVD itself (cf. Kshirsagar, 1972,
pp. 247-249).
Since the pair of vectors u and - v will also satisfy this pair of formulae for a ~:~,';I
singular value of -IX, it is assumed that the singular values are positive. There , ..;I
is also an indeterminacy in the scale of u and v, but the singular vectors do Low rank matrix approximation
fl'
have equal Euclidean norms (u Tu = vTv), so that it is sufficient to standardize
one of the vectors, usually to be of unit norm. The only indeterminacy that
.~¡ In view of form (A.1.2) of the SVD it seems that if singular values IXKO+l" .IXK
,
remains is a simultaneous reflection (multiplication by -1) of u and v in their .J are small compared to 1X 1 ... IXKO, then dropping the last K -K* terms of the
respective dual spaces, but this is usually of no consequence. It can be easily right-hand side of (A.l.2) gives a good approximation to A and has lower
proved that singular vectors associated with distinct singular values are rank than A. The approximation is in fact a least squares one, and it is this
necessarily orthogonal to each other in their respective dual spaces. result which makes the SVD so useful. The theorem of low rank approxima
tion (first stated and proved by Eckart and Young, 1936) is as follows:
Uniqueness of the SVD Let A[KO] == r.folXkUkvI be the 1 x J matrix of rank K* formed from the
largest K* singular values and corresponding singular vectors of A. Then
From now on it is assumed that the singular values are arranged in A[KO] is the rank K*least squares approximation of A in that it minimizes:
descending order: IX1 ~ 1X 2 ~ ... ~ IX K > 0, with the singular vectors ordered
accordingly. If strict inequalities order the singular values, that is there is no r.{r.f(a;j-x;j)2 = trace{(A-X)(A-X)T} (A. 1.4)
multiplicity of singular values, then the SVD is uniquely determined up to for all matrices X ofrank K* or less.
Appendix A 345
344 Theory and Applications 01 Correspondence Analysis
The columns of N and M may be caBed generalized left and right singular
It is not difficult to prove this result and a possible proof runs along the
vectors respectively. They are still orthonormal bases for the columns and
same lines as existing proofs for the special case when A is square (see, for
rows of A, but the metrics imposed in the 1- and J-dimensional spaces are no
example, Kshirsagar, 1972, pp. 429-430; Stewart, 1973). Rere we need to use
longer simple Euclidean, but generalized (or weighted) Euclidean metrics
the complete SVD of A, given in (A.1.3). The objective function (A.IA) can
defined by O and «D respectively (cf. Section 2.3). Similarly, the diagonal
then be written as:
elements of the diagonal matrix D IX may be caBed generalized singular values,
trace {ÜÜT(A-X)"VVT(A-X)T} = trace {(A-G)(A-G)T} ordered from largest to smaBest.
= r.f(a k-gkk)2 + r.rr.f+k9il The generalized SVD is easily proved, assuming the ordinary SVD of
01/2A«D l /2, where we use the symmetric matrix square roots (i.e. if O has
where G (I x J) == Ü T xv.Since G is of the same rank as X, it is clear that the
eigendecomposition O = WD /l W T, then 0 1/2 = WD~/2W T):
optimal G must have a "diagonal" of al'" aK. and otherwise be zero, which
implies X = A [K.] is optimal. 01/2A«D l /2 = VDIXV T, where VTV = VTV = I (A. l.1 O)
We require a notation for the submatrices ofV, D IX and V:
Letting:
DIX(K.) O ] N == 0-1/2V and M == «D- l /2V (A.Ul)
V == [V(K·) V(K-K·)] D IX == [ O D V == [V(K·) V(K-K·)]
IX(K-K·)
we have (A.1.8) and (A.1.9).
so that V(K.) (IxK*), DIX(K.) (K*xK*) and V(K") (JxK*) compose the The corresponding generalization which is induced in the theorem of low
SVD of A [K.]: rank approximation is as foBows. If the last K - K* terms of (A.1.8) are
A [K.] = V (K·¡D IX(K·)V (K·) (A.1.5) dropped, then A [K.] == r.f·akDkmJ = N (K.¡D IX(K.)M (K.)is the generalized rank
K* least-squares approximation of A in that it minimizes:
The matrix of residuals is clearly:
trace{O(A - X)«D(A - X)T} (A. 1. l2)
A-A[K.] = V(K-K.¡DIX(K-K.)V(K-K.) (A. 1.6)
amongst aH matrices X of rank K* (or less).
Because the sum of squared elements of a matrix Y is equal to trace (YY T),
When O, say, is a diagonal matrix D w of positive numbers Wl ... WI. the
we have from (A.l.1), (A. 1.5) and (A.1.6) that the sums of squared elements of fit (A.l.12) can be written as:
A, A[K.] and A-A[K.] are respective1y r.faf, r.f·af and r.f=K.+lai. A
traditional measure of the quality of approximation of A by A[K.] is the trace{Dw(A-X)«D(A-X)T} = r.{wi(ai-x¡}T«D(ai-x¡} (A. 1. l3)
percentage sum of squares: where ~i and Xi are the rows of A and X respective1y, written as column
7:K. == 100r.f·aflr.faf (A.1.7) vectors. This function essentiaBy defines a generalized principal components
analysis in the spirit of Pearson (1901) and Young (1937), as described in
Section 2.5. The rows of A are considered to be a cloud of 1 points in J
Generalized SVD dimensional generalized (or weighted) Euclidean space, where the metric is
We now need to introduce a slight generalization in the definition ofthe SVD defined by «D. (In Section 2.5 «D is also a diagonal matrix, often caBed a
in order to accommodate our later description. If O(I x 1) and «D(J x J) are "diagonal metric".) The values W l ... W I are masses (or weights) which are
given positive-definite symmetric matrices, then any real 1 x J matrix A of assigned to each of the row points themselves. The rows of X are unknown
rank K can be expressed as: points in a K*-dimensional subspace and the minimum of (A.U3) attained
by X == A[K.] identifies the subspace which is closest to the cloud of points in
A = N D IX MT = r.fakDkmJ (A.1.8) terms of weighted sum of squared distances. In this case the vectors
Ixl IxK KxK Kxl
mI'" mK. define orthonormal principal axes of the subspace, while the rows
where the columns of N and M are orthonormalized with respect to O and of the matrix N(K.¡DIX(K.) = [alDl ... aK.DK.] define the co-ordinates (with
«D respective1y: respect to these axes) of the projections of the cloud of points onto the
NTON = MT«DM = I (A.1.9) subspace. Remember that the orthonormality of the axes and of the projec
.J
tions is defined in terms ofthe metric «1>. Notice that the matrix approximation PHASE3
A[K*]. equivalentIy the optimal subspace defined by M(K*), is unique as long Obtain a graphical display of the rows and/or the columns of the
as IX K* is strictIy greater than IX K* + 1 (cf. aboye discussion of the uniqueness data by plotting the rows of
of the SVD). In practice this means that the dimensionality K* of the
approximation should be chosen where there is a clear difference between F == N(K*¡D~(K*) and/or G == M(K*¡D~(K*)
IXK* and IXK*+1 (cf. Section 8.1).
for given a and b, with respect to K* Cartesian axes.
Usually we would plot the rows of F == N (K*¡D ",(K*) in K*-dimensional
Euclidean space (K* is most often equal to 2 or 3) in order to explore the
This analysis can be programmed as a computer subroutine/procedure/macro
multidimensional configuration of the rows of the data matrix A. However,
and a variety of analyses are made possible by supplying the following
it is possible to plot the rows of G == M(K*) in the same space, as described by "parameters" to the program:
Gabriel (1971, 1981). This particular joint display of points representing both
the rows and the columns of a matrix is called a biplot and is essentially the (1) The type of centering/recoding of the data matrix in Phase 1.
same as the vector model for preferences suggested by Tucker (1960), but (2) The positive-definite symmetric matrices U and «1> in Phase 2.
applicable in a wider context of data. In a biplot display the scalar product (3) The scalars a and b which indicate how the singular values are appor
of the ith row point vector (i.e. ith row f¡T of F) and the jth column point tioned to rescale the left and right singular vectors respectively prior
vector (i.e. jth row gJ of G) approximates the datum a jj : to plotting.
aij ~ f¡T gj = (length off;) x (length of g) Table A.1 illustrates a number of well-known special cases of the aboye
x (angle cosine between f¡ and g) (A.U4) analysis. We also give a very brief description of the geometric interpretation
ofthe plots that result from these analyses, leaving the reader to refer to more
The data values aij are usually centered in sorne way, for example with respect detailed descriptions which exist in textbooks and journal articles in the
to the column means, and thus a positive deviation aij > O is indicated by column "References". The data-analyst is also free to experiment with the
vectors f j and gj subtending an acute angle, while a negative deviation ajj < O "parameters" in the three phases of the analysis in the context of a specific
is indicated by vectors subtending an obtuse angle. data set.
A few explanatory remarks are required to assist the reader's assimilation
ofTable A.1:
A.2 A GENERAL ANALYSIS AND ITS SPECIAL CASES
(1) It is not uncommon that only one set of points (row points or column
points) is of interest. Usually it is the set of points which is not rescaled
The aboye description suggests the following general analysis of a rectangular (e.g. b = O) that is not of interest.
data matrix Y:
(2) Ifboth sets ofpoints are plotted and a+b = 1, the biplot interpretation
PHASE 1 is valid, that is between-set scalar products f¡Tgj in the display approxi
mate the elements a¡j of the matrix A.
Pre-process Y in sorne way, usually sorne type of centering or
4 (3) If both sets of points are plotted, where b, say, is O, then the display of
recoding of the data. This results in a matrix A.
the column points will sometimes be very much "smaller" that the display
PHASE2 of the row points which are rescaled by a = 1, say: The column points
can then be uniformly rescaled by a convenient amount so that relative
Compute the generalized SVD of A for given U and «1> (cf. (A.1.l0)
positions of the column points are more easily observed, in which case the
and (A.U1)):
scale of each set of points is different. When a = b, the points are usually
A = ND",M T, where NTUN = MT«I>M = 1 plotted with respect to the same scale.
(4) Table A.1 is a framework for the fundamental computations for each
and select a rank K* approximation:
method. Specific methods require additional computations, for example
A[K*] = N (K*¡D"'(K*)M(K*) the computation of correlations between component scores (columns of
TABLE A.1
(1 ) Principal componenrs analysis Y-(111)11 Ty (11/)1 o Origin of display is the centroid of the rows. Pearson (1901 ) and
(or principal componenrs biplor if Displayed row points are the orthogonal Hotelling (1933) (both
variables are displayed). projections of the J-dimensional row points reproduced in Bryant and
The data matrix Y is typically cases (rows) onto the "ciosesC (i.e. best fitting) Atchley. 1975. along with
by variables (columns). K'-dimensional subspace. other articles on principal
Variables are plotted (if desired) as vectors and components analysis);
biplot interpretation is valid. Morrison (1976); Gabriel
Columns of G are eigenvectors of covariance (1971); Kendall (1975);
matrix. Chambers (1977).
(2) Generalized principal componenrs analysis (i) Y-11 TO",Y o As in (1) except the space of the rows is Many well-known
(... biplor) -O", is a diagonal matrix or generalized Euciidean according to the metric techniques can be con
of positive weights. assumed to have (ii) {Y-11To",Y}<II1/2 O <11 and the row points are weighted respec sidered as special cases. cf.
sum 1. tively by w, ... WI in the diagonal of O"'. The analyses (3). (7). (8) and
quality of approximation of each A is the (9). The potential of
same. Options (i) and (ii) give the same different O'" and <11 has
configuration F of row points but different yet to be explored
configurations G of column points (see (3)
below).
(3) Principal componenrs analysis (i) Y-(111)11 T y O Well-known special case of (2) where the Most textbooks on multi·
(... biplor) 01 srandardized dara Os is the or variance of the row points along each of the variate analysis treat this
diagonal matrix of standard deviations of (ii) {Y-(111)11Ty}Os' O original axes is made equal. problem specifically under
the variables The columns of G in (ii) are the usual the guise of principal
eigenvectors of the correlation matrix. Option components analysis of
(i) results in a ~ which is related to that G the correlation matrix (i.e.
by: ~ = OsG. so thatthe standard deviations option (ii».
are approximately represented by the lengths of
the column point vectors. but not the
covariance structure (see (4) below).
(4) Covariance biplor O Origin of display is the centro id of the rows. Gabriel (1971.1972.1981)
Displayed row points are the orthogonal
projections oftheJ-dimensional row points in
Mahalanobis space onto the "ciosesC (i.e.
best fitting) K' -dimensional subspace.
Variables are plotted as vectors and their lengths
approximate respective standard deviations of
the variables. Angle cosines between these
vectors approximate correlations between
respective variables.
(5) Correlation biplot {Y - (1 1/)11 Ty}O;' 1(1-1) '/2 O As for covariance biplot. except the column Hills (1969)
O. as in (3) aboye. points are at length 1 from the origin in
J-dimensional space The quality of display of
each variable may thus be gauged by observing
how near these points Iie to the unit
K··dimensional hypersphere (e.g unit circle in
2-d).
(7) Canonical correlarion analysis The columns of F and G are the canonical Anderson (1958);
LetY", [Y, Y 2].wherethevariables loadings. ie. the coefficients on the original Tatsuoka (1971); Morrison
naturally divide into subsets of size I and variables which give the canonical variables. (1976); Mardia er al.
J. Suppose Y is centered with respect to If the canonical scores are plotted. i.e. the rows (1979); Falkenhagen and
variable (column) means and that S" and of Y, F and Y2G. then these may be considered Nash (1978); Chambers
5 22 represent the within-sets covariance to be two clouds 01 points representing the (1977); Gittins (1979)
matrices. and 5'2 the between·sets cases in their respective Mahalanobis spaces
covariance matrix. (metrics 5;-,' and S;,i respectively). projected
onto the K' -dimensional subspaces which
exhibit the greatest positional correlation 01 the
two ciouds.
(8) Canonical variare analysis Ow 5- 1 o The origin of the display is the centroid of the Same relerences as lor
The rows 01 Y lall into H groups. Let Y HxH rows 01 Y. i.e. of the rows of Y too. canonical correlation
be the H xJ matrix 01 group means on the This is a special case of generalized principal analysis as well as most
J variables. D w the diagonal matrix 01 components analysis. as delined in (2). where textbooks on discriminant
proportions 01 rows in each group. and S we are identifying the principal axes 01 the analysis
the usual pooled within-groups co· group centroids. weighted by the number in
variance matrix. each group. in the Mahalanobis space defined
in terms of the within-groups pooled covariance
matrixS.
(9) Correspondence analysls (synonyms: O, Oc This analysis may be described as a special case Benzécri er al. (1973);
dual scaling. rec/procal averaging. of (7) aboye when the data is discrete. The Hill (1974); Greenacre and
canonical analysis oi conringency rabies) rows and columns of contingency table Y are Degos (1977); Greenacre
If Y is a contingency table. sayo let P be represented by points whose positions indicate (1978); Nishisato (1980);
Y divided by its sumo Let O, and Oc be the associations between rows and columns in Greenacre (1981); Gifi
diagonal matrices 01 row and column Y. The row and column displays are dual (1981); Gauch (1982);
sums respectively 01 P. generalized principal components analyses this book
(or principal co-ordinates analyses. cf. Example
3.5.2).
350 Theory and Applications ofCorrespondence Analysis Appendix A 351
F) and variables (columns of Y or A) in principal components analysis. wider framework in that it relies on an input matrix of interpoint distances
These are very simple to add to the basic method if one uses a pro rather than the original rectangular data matrix. Here we generalize the usual
gramming language like SAS or GENSTAT (see Appendix B). definition of the analysis to accommodate masses assigned to points. A
(5) In variations (4) and (5) the data are divided initially by (I _1)1 /2 so that matrix S ofscalar products is first calculated (cf. Example 2.6.1(b)) and then
the true scale of variance (respectively, correlation) is recovered in the its generalized eigendecomposition calculated: S = ND¡JN T, say, where
display. For example, in the covariance biplot: NTDwN = 1, D w being the diagonal matrix of point masses. The optimal
(least-squares) K*-dimensional display of the scalar products is provided
{Y - (1jI)11 TY}/(I _1)1/2 = NDo:MT
by the rows of F == N (Ko)D~lt1(o). The generalized eigendecomposition is
T
so that MD;M equals the covariance matrix: usually computed via the ordinary eigendecomposition, like the generalized
{Y - (1 jI) 11 Ty} T{Y - (1jI)11 TY}/(I -1) SVD (cf. Section 2.5 and Appendix B): that is, find the eigenvalues/eigen
vectors ofD~j2SD~J2 = UDIlU T, then N = D;;; 1/2u.
Hence the scalar products of the rows of G == M(KO¡DIX(K O) approximate Example 3.5.2 shows how correspondence analysis may be defined as two
the respective covariances, which implies the geometric interpretation in dual principal co-ordinates analyses.
terms of standard deviations and correlations (Gabriel, 1971).
(6) Notice the distinction between options (i) and (ii) in analyses (2) and (3).
Weighting of individual dala
There is no difIerence in the final display of the row points,. but the
displays of the column points are difIerent because difIerent matrices are A problem which is quite external to the aboye framework is that of
being biplotted. approximating the matrix A by weighted least-squares, where each term in aij
(7) Notice that the inverse of the matrix defining the metric acts as a is weighted by a prescribed wij' This is discussed by Gabriel and Zamir (1979)
"weighting matrix" in the dual problem. In canonical correlation analysis and is useful:
and correspondence analysis the matrix "parameters" n and C() are de
(1) For the treatment ofmissing data, for which wij can be set to O;
fined as the weighting matrices, whose inverses imply the usual metrics in
(2) when we have available a measure of confidence for each datum, which
the dual spaces. For example, the configuration of the row points F in
we equate proportionally to wij;
correspondence analysis which is obtained from Table A.1 (9), is the same
(3) in the treatment of outliers, which may be individually down-weighted.
as that ofthe analysis of A = D,-1P-11 TDe , with n = D" C() = D e- 1 and ._
a = 1. This is a generalized principal components analysis (Table A. 1(2)), Notice that our framework does allow a weighting scheme of the form
where the rows of A are the centered row profiles of P, weighted by wlJ = sltj , so that individual rows and/or columns of A can be weighted in the
masses in the diagonal of D, (cf. Section 2.5). Symmetrically, the con least-squares approximation. This is useful, for example, in reducing the róle
figuration of column points G is the same as the generalized principal played by an outlying vector of data. However, in order to accommodate a
components analysis of A = D c- 1 P T-11 TD, with n = De' C() = D; 1 and general weighting scheme we can no longer rely on the theory based on the
b = 1. As shown in Chapter 4, variations of correspondence analysis in SVD, with its neat algorithm and nice optimality properties. The usual row
the literature, for example reciprocal averaging and dual scaling, differ and column vector geometries are also lost when general weighting is
algebraically only in the definition of the parameters a and b, which are introduced.
often both set to zero. If a = 1 and b = O in correspondence analysis,
then each row point lies at a particular barycentre (weighted average) of
the column points, where the column points are weighted proportionally
to the respective elements in that row of the contingency tableo
the computations are only slightIy more cumbersome than the algorithm
defined by Table B.l. Either one of the eigenequations of (4.1.23) can be used,
whichever involves the matrix of smallest order, and then the other set of
co-ordinates can be obtained using the relevant transition formula of (4.1.16).
We are not particularly concerned about the loss of accuracy implied by
computing the solution of what is essentially an SVD problem by means of
the eigendecomposition (cf. Golub and Reinsch, 1971; Chambers, 1977). This
would be of concern only in rare instances.
Appendix B
Computations by reciprocal averaging
computer installations. (At present it runs on the DEC 10, with Tektronix ecologicalliterature.
4010 series graphics terminals, and relies on NCAR graphics software or the The book by Nishisato (1980) also gives listings of programs to perform
graphics package DISPLAA.) We used BGRAPH to plot Figs 8.2, 8.3, 9.3, various types of dual scaling, including the imposition of order constraints on
S. Nishisato,
Specia/ized programs wirh French documemarion Ontario Institute for Studies in Education,
Many users will prefer to receive a portable FORTRAN program to do the 252 Bloor Street West,
computations as well as handle the drudgery of input, possible recoding of Toronto, Canada
the data, output and line printer plottmg. Such software is available from a At present we ourselves are also occupied with the development and full
number of sources in France. F or example, a complete library of programs is documentation of a suite of portable FORTRAN programs to perform the
available by subscription, and includes a number of recoding programs analyses described in this book, including sorne graphics routines according
(doubling, creating indicator matrices, etc.) as well as programs to perform to an intemational standard. Interested parties can be kept in contact about
various cluster analyses and compute inertia contributions. Enquiries can be the availability of this software by writing to:
madeto: Michael Greenacre,
4 Place Jussieu,
All the correspondence analyses in this book were performed using the
75005 Paris, France program written in France by Tabet (1973).
Another library of programs, including correspondence analysis and
multiple correspondence analysis, geared to the analysis of survey data, is
published in the book of Lebart et al. (1977). A new version ofthese programs
exists, called SPAD-1983, and may be obtained through:
CESIA (Centre de Statistique et d'Informatique Appliquées),
82 rue de Sevres,
75007 Paris, France
A set of program subroutines in FORTRAN and in APL is also published
in the book by Lebart et al. (1979). Enquiries can also be made to the aboye
address.
l
I
1
Subject Index 359
e Contributions
absolute, see Contributions of points
Canonical correlation, 115, 120
to a principal axis
349
ofpoints to a principal axis, 67, 91
Contributions of points to a
geometry of, 111-6
principal axis
147-55
of correspondence matrix, 91-2
axes,73-4
Angle, 25, 26
ratings in an e1ection, 171-4
of indicator matrices in canonical
relative, see Contributions of principal
axis, 70, 91
science doctorates conferred in the
ofrow and column profiles, 91-2
table of, 75, 257, 261, 282-5, 289-90
plane,73-4
seriation of the works of Plato,
Centroid, 17, 18-20,24,34,85,90
Convex hull, 42, 215-8
data, 308-12
biplot interpretation of, 119-20
312
axis, 70, 91
populations,299-308
Transition formulae
between rows of a doubled matrix,
Correlation ratio, 106
, in economics, 323-4
Chi-square statistic, 31-3
and c1assification, 190-3
in education, 321
B proportional to the total inertia, and discriminant analysis, 187-90
pharmaceutics,317-8
Barycentre, 24, 42, 90
Classification, 185-6, 190-3, 308-12,
198-202
in geology, 320
Barycentric co-ordinate system,
312-7
and multidimensional unfolding, 181
in politics, 322-3
Biplot, see also Vector model, 119-20,
standard, see Standard co-ordinates
(barycentric) co-ordinate system,
319-20
Bipolar variables, 169-71
Principal co-ordinates
as two dual principal co-ordinates
259-63
Burt matrix, 140-1,243-4
Computer programs, 356-7
decomposition, 349
Russia, 280-91
modified, 243-4, 254
Continuation ratio, 266
computation of, 352-7
detrended,232
Fuzzy coding, 159, 174
136
Decomposition of inertia
L
equivalent to dual scaling, 104
97
Dimension, 16,24
interpretation of, 74
I
I
Length, 25, 26
I
regressions, 117-8
Dimension weighting, 77-8
Lever principie, 175
H Linear combination, 23
introduction, 3-7
Discriminant analysis, 185-6, 187-90
Linear independence (of vectors), 24
matrix,85 asymmetric, see Asymmetric display Huyghen's theorem, 203, 204, 206
Logical coding, 159
183-4
Doubling, 171-9,271 Incidence matrix, see Indicator matrix
Mass, 35-6, 85, 296, 306-8
181-3
correspondence analysis, 104
Inertia, see also Contributions, principal
role of masses in determining principal
137-46
with order constraints, 236
contributions to, see also
Mean-square contingency coefficient, 35
of preferences, 169-84
Duality,60-6
ofratings, 169-83
Dummy variables, see Indicator variables
decomposition along principal axes, Missing data, 236-9
computation of,46
generalized,351
diagonal, 240
of symmetric matrix, 240
Eigenvalue, 38
Internal consistency, 104-6, 234
artificial dimensions in, 144-5
product,240
Interval variable, 160,223
joint bivariate nature of, 140-1
symmetric, 240
Eigenvector, 38
Euclidean space
of indicator matrix of binary variables,
Cosine rule, 27
multidimensional, 28
two-dimensional, 25-8
of questionnaire data with non
Criterion of internal consistency, see
Jackknifing,209-14 responses, 146-57
Internal consistency
Experimental design, 316
Joint display, 65
percentages of inertia in, 144-5
Cross-validation
F biplot interpretation of, 119-20
relationship to dual scaling, 135-7, 144
Multivariate indicator matrix, 138, 174 Principal components analysis, 36, 39,
ofthe supplementary points, 73 solutions, 256-8
47-8,145-6,182-3,281-7,312,348
Response pattern table, see Indicator Stem-and-Ieaf histogram, 2-3
316
correlations, 115
inertias,167-8 Supplementary points, 70-4,188-9,
Nodalinertias,2oo-2
asymptotic distribution of, 219-21,
of submatrices to have prescribed 296-9,301-3,310
Node,197
247-8
inertias, 161-2,224 as points with zero mass, 73
202
Non-trivial solution
of bivariate indicator matrix, 131-3
s Supplementary profiles, see
242
chi-square, see Chi-square scalar SVD, see Singular value decomposition,
Normalization,27
65-6,95,312
dependency on origin of space, 28 T
Ordination, 96
dual, see Dual scaling coefficient, 86-7
Orthogonal, 27
Q
Score,102 is the same in the row and column
Quality
37-40,340-51 97
j
Ratio variables, 160,223
trivial and non-trivial, 92 Trivial axis, 51-3
"
masses and metric, 80-1 147-56
of multidimensional scaling, 219 Unfolding model, 179-81
,,~¿.
,"·'8~"(/,
r - 'r rv
<~. /
~ L
"" Bibl¡o¡tca::
u-
? ..Y .....'"
r:r t-'. ~\y
va d \.