You are on page 1of 14

Bayesian Multidimensional Scaling and

Choice of Dimension
Man-Suk Oh and Adrian E. Raftery

Multidimensional scaling is widely used to handle data that consist of similarity or dissimilarity measures between pairs of objects. We
deal with two major problems in metric multidimensiona l scaling—conŽ guration of objects and determination of the dimension of object
conŽ guration—within a Bayesian framework. A Markov chain Monte Carlo algorithm is proposed for object conŽ guration, along with
a simple Bayesian criterion, called MDS IC, for choosing their dimension. Simulation results are presented, as are real data. Our method
provides better results than does classical multidimensional scaling and ALSCAL for object conŽ guration, and MDS IC seems to work
well for dimension choice in the examples considered.
KEY WORDS: Clustering; Dimensionality; Dissimilarity; Markov chain Monte Carlo; Metric scaling; Model selection.

1. INTRODUCTION Another commonly used MDS method is ALSCAL (Takane


et al. 1977), which minimizes the sum of the squared dif-
Multidimensional scaling (MDS) is concerned with data that
ferences of squared dissimilarities and squared distances. The
are given as similarity or dissimilarity measures between pairs
attraction of ALSCAL is that it can analyze data in various
of objects. Its goal is to represent the objects by points in
forms. In many practical situations, however, there are mea-
a (usually) Euclidean space. MDS has its roots in psychol-
ogy, speciŽ cally psychophysics, as it is based on the analogy surement errors in the observed dissimilarities and no clear
between the psychological concept of similarity and the geo- notion of dimension.
metrical concept of distance. Subsequently, it has been widely Maximum likelihood MDS methods have been developed
used in other social and behavioral sciences. Recently, inter- for handling measurement errors; see, for example, Ramsay
est in MDS has increased further because of its usefulness in (1982), Takane (1982), Takane and Carroll (1981), MacKay
some rapidly developed subjects, such as genomics (Tibshirani (1989), MacKay and Zinnes (1986), Groenen (1993), and
et al. 1999) and information retrieval for the Web and other Groenen, Mathar, and Heisser (1995). However, justiŽ ca-
document databases (Schutze and Silverstein 1997). tion of maximum likelihood relies on asymptotic theory, and
One of the main applications of MDS is visualization, computation requires nonlinear optimization. The number of
where the user wants to represent a complex set of dissimilari- parameters to be optimized over typically grows as fast as the
ties in a form that is easier to see. One reason for this is to see number of objects, so that the asymptotic theory may not apply
if visually apparent clusters are present in the data. Another in high dimensions, as pointed out by Cox (1982). Moreover,
application is exploration, where the user wants to understand the likelihood surface will tend to have many more local min-
the main dimensions underlying the dissimilarities. For exam- ima when there are more dimensions, and Ž nding a good ini-
ple, the objects in MDS might be political candidates, and the tial estimate will be correspondingl y more difŽ cult.
data might consist of subjective similarity judgements. MDS Another important issue in MDS is dimensionality, i.e.,
might help to suggest which political positions or characteris- the number of signiŽ cant attributes. Despite its importance in
tics are important in forming similarity judgements (e.g., posi- many applications, there is no deŽ nitive method for determin-
tion on Social Security, age, tendency to tell jokes). A third ing dimension for dissimilarity data. The most commonly used
application is hypothesis testing. Monographs on MDS include method is to search for an elbow, that is, a point where a mea-
Davison (1983), Young (1987), Borg and Groenen (1997), and sure of Ž t or a measure of contribution to the dissimilarity lev-
Cox and Cox (2001). els off, in a plot of the measure versus dimension (Spence and
An important issue in MDS is conŽ guration of objects, i.e., Graef 1974; Davison 1983; Borg and Groenen 1997). How-
estimation of values for attributes of objects. A commonly ever, it is often difŽ cult to Ž nd an elbow—especially when
used MDS method for pairwise dissimilarity data was devel- there are signiŽ cant errors—and visual inspection of a plot
oped by Torgerson (1952, 1958). Object conŽ gurations are may be misleading, because its appearance often depends on
easy to compute with this method, now called classical MDS the relative scale of the axes.
(CMDS). It gives complete recovery (up to location shift) of In this article, we deal with these two important issues in
object conŽ gurations when the given dissimilarities are exactly MDS within a Bayesian framework. We use a Euclidean dis-
equal to the Euclidean distances and when the dimension is tance model and assume a Gaussian measurement error in the
correctly speciŽ ed. observed dissimilarity. Under the model, we propose a simple
Markov chain Monte Carlo (MCMC) algorithm with which
to obtain a Bayesian solution for the object conŽ guration.
Man-Suk Oh is Associate Professor of Statistics, Department of Statistics,
Ewha Women’s University, Seoul 120-750, Korea (E-mail: msoh@mm.
We found that the proposed method, which we call Bayesian
ewha.ac.kr). Adrian E. Raftery is Professor of Statistics and Sociology, MDS (BMDS), provided a much better Ž t to the data than did
Department of Statistics, University of Washington, Seattle 98195-432 2 CMDS and a moderately better Ž t than did ALSCAL in all
(E-mail: raftery@stat.washington.edu). Oh’s research was accomplished with
a research fund provided by Korean Research Foundation, Support for Faculty
Research Abroad. Raftery’s research was supported by ONR grant N00014-
96-1-1092 . The authors thank Chris Fraley for helpful comments and discus- © 2001 American Statistical Association
sions. They thank Kate Stovel for providing the Lloyds Bank data and for Journal of the American Statistical Association
helpful discussions about the data.
September 2001, Vol. 96, No. 455, Theory and Methods
1031
1032 Journal of the American Statistical Association, September 2001

the examples we tested. Moreover, the improvement in perfor- Construct a double-centered matrix A with elements
mance of the proposed BMDS scheme relative to CMDS or aij deŽ ned by aij D ƒ 12 4„2ij ƒ „2i¢ ƒ „2¢j C „2¢¢ 51 where „2i¢ D
1 Pn 1 Pn 1 Pn Pn
ALSCAL was more pronounce d when there were signiŽ cant n
2 2
jD1 dij 1 „ ¢j D n
2 2
iD1 dij 1 „¢¢ D n2 iD1
2
jD1 dij . It was
measurement errors in the data or when the Euclidean model shown by Torgerson (1952, 1958) that
assumption was violated or the dimension was misspeciŽ ed. p
On the basis of the BMDS estimate of object conŽ guration X
aij D xik ¢ xjk for all i1 j, i.e., A D XX 0 1 (2)
over a range of dimensions, we propose a simple Bayesian kD1
criterion with which to choose an appropriate dimension. This
criterion, called MDS IC, is based on the Bayes factor, or ratio where X is the n € p matrix of object coordinates. The coordi-
of integrated likelihoods, for the BMDS estimated conŽ gu- nates of X can be recovered from the spectral decomposition
ration under one dimension versus a different dimension. In of the matrix A in (2). If the observed dissimilarities, dij , sat-
simulated data, we found that the criterion works well for isfy the Euclidean distance assumption and there is no mea-
Euclidean models with measurement error that is not too large. surement error, then the Euclidean distances computed from
In real examples, we found that the criterion gave satisfactory the matrix X satisfying (2) will be exactly equal to the given
results. We also give an example of cluster analysis on real dissimilarities. However, when the model assumption is vio-
dissimilarity data in which the BMDS estimates of object con- lated or when there are signiŽ cant measurement errors in the
Ž guration are used in conjunction with model-based clustering data, CMDS estimates of object conŽ guration may not be very
(BanŽ eld and Raftery 1993; Fraley and Raftery 1998). useful.
In our approach, observed dissimilarities are modeled as Because Euclidean distance is invariant under translation,
equal to Euclidean distances plus measurement error. In this rotation, and re ection about the origin, the matrix X can be
sense, what we do here can be viewed as a Bayesian analy- centered at the origin and rotated to its principal axes orienta-
sis of metric MDS, and here being Bayesian seems to con- tion (Borg and Groenen 1997, p. 62).
fer the beneŽ ts of yielding good estimated conŽ gurations and There is no deŽ nitive method for choosing the effective
provides a formal way to choose the dimension and provide dimension of xi , the number of object attributes that contribute
measures of uncertainties in estimations. A great deal of MDS signiŽ cantly to the dissimilarities. A common way to assess
research, however, has focused on nonmetric MDS, in which dimension is to look at the eigenvalues of the scalar prod-
the relationship between dissimilarity and underlying distance uct matrix A. The kth eigenvalue is a measure of contribu-
is modeled as nonlinear. One could use the basic ideas here tion of the kth coordinate of X to squared distances. Hence,
to do Bayesian nonmetric MDS, and we suggest some ways only the Ž rst p coordinates of X corresponding to the Ž rst p
of doing this in Section 6. signiŽ cantly large eigenvalues sufŽ ce to represent the objects.
The article is organized as follows. Classical MDS and To determine signiŽ cantly large eigenvalues, one may draw
ALSCAL are brie y reviewed in Section 2. Bayesian MDS a plot of the ordered eigenvalues versus dimension and look
is described in Section 3: Section 3.1 deŽ nes the model and for a dimension at which the sequence of eigenvalues levels
the prior, Section 3.2 presents an MCMC algorithm, and off. If each „ij is equal to a p-dimensional Euclidean distance
Section 3.3 describes the estimation of object conŽ guration between objects i and j as given in (1), then the plot should
from the MCMC output. Based on the BMDS output, a simple level off precisely at dimension 4p C 15.
Bayesian dimension selection criterion, MDS IC, is described A measure of Ž t, called stress, is commonly used to deter-
in Section 4. Some simulated and real examples are given in mine the dimensionality. Several deŽ nitions of stress have
Section 5. We conclude with discussion in Section 6. been proposed; the one we use here, and perhaps the mostly
commonly used one, is
2. CLASSICAL MULTIDIMENSIONAL SCALING v
uP
AND ALSCAL u i>j 4dij ƒ „O ij 52
STRESS D t P 2
1
2.1 Classical Multidimensional Scaling i>j dij

Let „ij denote the dissimilarity measure between objects i


and j, which are functionally related to p unobserved where „O ij is the Euclidean distance obtained from the esti-
attributes of the objects. Let xi D 4xi1 1 : : : 1 xip 5 denote an mated object conŽ guration (Kruskal 1964). A plot of STRESS
unobserved vector representing the values of the attributes pos- versus dimension will level off at the true dimension p if
sessed by object i. dij D „O ij and „O ij is given by (1). Note that the squared
Torgerson (1952, 1958) developed a technique for multidi- STRESS is proportional to the sum of squared residuals,
P
mensional scaling, now called CMDS. Assume that the dis- SSR D i>j 4dij ƒ „O ij 52 .
similarity measure, „ij , is the distance between objects i and Both methods rely on detecting an elbow in a sequence of
j in a p-dimensional Euclidean space, i.e., values, that is, a point where the plot levels off. However,
s in real data that do not conform exactly to the model or in
Xp which there is a signiŽ cant amount of measurement or sam-
„ij D 4xik ƒ xjk 52 1 (1) pling error, an elbow may be difŽ cult to discern.
kD1
2.2 ALSCAL
where xik is the kth element of xi . The elements xik are
unknown, and the goal of MDS is to recover them from the ALSCAL (alternating least squares scaling) was developed
dissimilarity data. by Takane, et al. (1977). It is very general in that it can analyze
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1033

data given in various forms. In metric MDS, ALSCAL mini- we also assume a conjugate prior, ‹j IG41 ‚j 51 indepen-
mizes S-STRESS, where dently for j D 11 : : : 1 p. We assume prior independence among
X X1 å, and ‘ 2 , i.e.,  4X1‘ 2 1 å5 D  4X5 4‘ 2 5 4å5, where
2
S-STRESS D „2ij ƒ dij2 0  4X51  4‘ 2 5, and  4å5 are the priors given earlier.
i>j When there is little prior information, one may use either
the results from a preliminary run or the results from other
Note that S-STRESS differs from STRESS in that it uses
MDS techniques, such as CMDS or ALSCAL, for parame-
squared distances and dissimilarities, which is done for com-
ter speciŽ cation in the priors. For instance, one may choose
putational convenience. However, squaring dissimilarities and
a small a for a vague prior of ‘ 2 and choose b so that the
distances causes S-STRESS to emphasize larger dissimilarities
prior mean matches with SSR=m, where SSR is obtained from
over smaller ones, which may be viewed as a disadvantage of
CMDS or ALSCAL. Similarly, for the hyperprior of ‹j , one
ALSCAL (Borg and Groenen 1997, p. 204). The minimization
may choose a small  and choose ‚j so that the prior mean
process can be done by using the Newton–Raphson method,
of ‹j matches with the jth diagonal element of the sample
possibly modiŽ ed. ALSCAL is one of the most commonly P
covariance matrix Sx D 1n niD1 x0i xi of X obtained from CMDS
used MDS techniques, and it is available in the statistical com-
or ALSCAL. As noted above, the MDS solution can be trans-
puter packages SAS and SPSS. For details, see Cox and Cox P
formed to have zero sample mean, i.e., niD1 xi D 0, and a
(2001).
diagonal sample covariance matrix.
Such a prior is mildly data dependent, and it might be
3. BAYESIAN MULTIDIMENSIONAL SCALING
argued that this violates the deŽ nition of a prior distribu-
3.1 Model and Prior tion. However, we view this prior as an approximation to the
elicited data-independent prior of an analyst who knows a lit-
Dissimilarity data can be obtained in various forms. How-
tle, but not much, about the problem at hand. Because this
ever, because Euclidean distance is easy to handle and is rela-
prior is diffuse relative to the likelihood, the estimation results
tively insensitive to the choice of dimension compared to other
are unlikely to be sensitive to its precise speciŽ cation.
distance measures, it tends to be used in cases in which the
dimension is unknown unless there are strong theoretical rea- 3.2 MCMC
sons for preferring a non-Euclidean distance (Davison 1983).
Thus, for Bayesian MDS, we model the true dissimilarity mea- From the likelihood and the prior, the posterior density
sure „ij as the distance between objects i and j in a Euclidean function of the unknown parameters 4X1‘ 2 1 å5 is
space, as given in (1). p
Y
In practical situations, often there are measurement errors  4X1‘ 2 1å—D5 / 4‘ 2 5ƒ4m=2CaC15 ‹ƒn=2
j
in observations. In addition, dissimilarity measures are typi- jD1
cally given as positive values. We therefore assume that the 1 X „ij
observed dissimilarity measure, dij , is equal to the true mea- € exp ƒ SSR ƒ logê
2‘ 2 ‘
sure, „ij , plus a Gaussian error, with the restriction that the i>j

observed dissimilarity measure is always positive. In other 1X n


b X p
‚j
words, given „ij , the observed dissimilarity measure dij is ƒ x0i åƒ1 xi ƒ 2 ƒ 1 (5)
2 iD1 ‘ jD1 ‹ j
assumed to follow the truncated normal distribution
where D is the matrix of observed dissimilarities. Because
dij N 4„ij 1‘ 2 5I 4dij > 051 i 6D j1 i1 j D 11 : : : 1 n1 (3) of the complicated form of the posterior density function (5),
qP numerical integration is required to obtain a Bayes estimate
p
where „ij D kD1 4x ik ƒ x jk 5 , and xik are unobserved. From
2
of the parameters. In particular, the posterior is a complicated
this, the likelihood function of the unknown parameters function of X’s, which in most cases is of high dimension.
X D 8xi 9 and ‘ 2 is We therefore use an MCMC algorithm (Gilks, Richardson,
and Spiegelhalter 1996) to simulate from the posterior distri-
1 X „ij bution (5). Our algorithm proceeds by iteratively generating
l4X1‘ 2 5 / 4‘ 2 5ƒm=2 exp ƒ SSR ƒ log ê 1 (4)
2‘ 2
i>j
‘ new values for each object conŽ guration xi , the error variance
‘ 2 , and the hyperparameter å, given the current values of the
where m D n4n ƒ 15=2 is the number of dissimilarities, SSR D other unknowns.
P 2
i>j 4dij ƒ „ ij 5 is the sum of squared residuals, and ê4¢5 is We Ž rst suggest initialization strategies for the unknown
the standard normal cdf. parameters that are needed for the MCMC algorithm. For ini-
405
For Bayesian analysis of the model, we need to specify tialization of xi , one may use the output, xi , of xi from
priors for X and ‘ 2 . For the prior distribution of xi , we CMDS or ALSCAL, because it is easy to obtain. The result-
use a multivariate normal distribution with mean 0 and a ing X can be centered at the origin and then transformed by
diagonal covariance matrix å, i.e., xi N 401 å5, indepen- using the spectral decomposition so as to have a diagonal sam-
dently for i D 11 : : : 1 n. For the prior of the error variance ple covariance matrix, thus conforming to the prior. From the
405
‘ 2 , we use a conjugate prior ‘ 2 I G4a1 b51 the inverse adjusted xi ’s, one can compute the sum of squared residuals
Gamma distribution with mode b=4a C 15. For a hyperprior SSR and ‘ 2405 D SSR 405 =m, which can be used as an initial
405

for the elements of å D Diag4‹1 1 : : : 1 ‹p 5, given dimension p, value of ‘ 2 in the algorithm. In addition, diagonal elements
1034 Journal of the American Statistical Association, September 2001

of the adjusted sample covariance matrix of X can be used as To obtain the posterior mode of X, one may compute
initial values for the ‹j ’s. the posterior density for each MCMC sample. However, this
We now describe the details of sample generation in the can be time consuming, because the posterior is complicated.
MCMC algorithm. At each iteration, we simulate a new value However, we observed that the likelihood dominates the prior,
of ‹j from its conditional posterior distribution given the other and that in the likelihood (4), the term involving SSR is dom-
unknowns. From (5), the full conditional posterior distribu- inant, so that the posterior mode of X can be approximated
tion of ‹j is the inverse Gamma distribution IG4 C n=21 ‚j C by the value of X that minimizes the sum of squared residu-
sj =25, where sj =n is the sample variance of the jth coordinates als SSR.
of xi ’s. Because the center and direction of X can be arbitrary,
We use a random walk Metropolis–Hastings step (Hastings, we postprocess the MCMC sample of X at each iteration
1970) to generate xi and ‘ 2 in each iteration of the MCMC by using the transformation xi D Dx0 4xi ƒ xN 5, where xN is the
algorithm. SpeciŽ cally, a normal proposal density is used in sample mean of xi ’s and Dx is the matrix whose columns
the random walk Metropolis–Hastings algorithm for genera- are the eigenvectors of the sample covariance matrix Sx D
1 Pn
tion of xi . To choose the variance of the normal proposal n iD1 4xi ƒ x
N 50 4xi ƒ xN 5 of xi ’s. This transformation does not
density, we note that the full conditional posterior density solve the nonidentiŽ ability problem, but the new xi ’s have
P „
of xi is  4xi — ¢ ¢ ¢ 5 / exp6ƒ 12 4Q1 C Q2 5 ƒ nj6D i1 jD1 log ê4 ‘ij 571 sample mean 0 and a diagonal covariance matrix to correspond
P
where Q1 D ‘12 nj6Di1 jD1 4„ij ƒ dij 52 and Q2 D x0i åƒ1 xi . Because to the prior speciŽ cation.
4„ij ƒ dij 52 =‘ 2 is a quadratic function of xi with leading coef- Although samples of xi ’s are unstable because of lack of
Ž cient equal to 1=‘ 2 , and Q1 has n ƒ 1 of this kind whereas identiŽ ability, samples of „ij ’s are stable after convergence,
Q2 has only one quadratic term with coefŽ cient å1 Q1 would and hence they can be used to measure uncertainties in quanti-
dominate the full conditional posterior density function of xi ties that are functions of „ij ’s. For example, one may be inter-
unless there is strong prior information. Thus, we may con- ested in whether object i is more closely related to object j
sider Q1 only, approximate the full conditional variance of xi than to object k, which can be answered by looking at the pos-
by ‘ 2 =4nƒ15, and choose the variance of the normal proposal terior distribution of „ij ƒ „ik , as approximated by the MCMC
density to be a constant multiple of ‘ 2 =4n ƒ 15. sample.
From a preliminary numerical study, we found that the full
conditional density function of ‘ 2 is well approximated by 4. CHOICE OF DIMENSION
the density function of I G4m=2 C a1 SSR=2 C b5. When the As described in the previous section, BMDS gives object
number of dissimilarities, m D n4n ƒ 15=2, is large, which conŽ gurations in a given dimensional Euclidean space. In
is often the case because m is a quadratic function of n, most cases, the dimension of the objects (the number of sig-
the inverse Gamma density function is well approximated niŽ cant attributes) is unknown. In this section, we propose a
by a normal density. Thus, we propose a random walk simple Bayesian dimension selection criterion based on the
Metropolis–Hastings algorithm with a normal proposal den- BMDS object conŽ gurations.
sity with variance proportional to the variance of IG4m=2 C a, Consider the dimension p as an unknown variable, and
SSR=2 C b5 distribution. assume equal prior probability for all values of p up to some
maximum value, pmax . Then the posterior is given by
3.3 Posterior Inference
Iterative generation of 8xi 91‘ 2 , and 8‹j 9 for a sufŽ ciently  4X1‘ 2 1 å1 p—D5
long time provides a sample from the posterior distribu- / l4X1‘ 2 1 p—D5 4X—å1 p5 4‘ 2 5 4å—p5
tion of the unknown parameters, and Bayesian estimation of
1 X
the parameters can be obtained from the samples. However, D 42 5ƒm=2‘ ƒm exp ƒ SSR ƒ log ê4„ij =‘ 5
because the model assumes a Euclidean distance for the dis- 2‘ 2
i>j

similarity measure „ij , the posterior samples of 8xi 9 would p


Y X p
1
be invariant under translation, rotation, and re ection about € 42 5ƒnp=2 ‹ƒn=2
j exp ƒ sj
2‹
the origin, as in other MDS, unless there is strong prior jD1 j D1 j

information to the contrary. We can retrieve only the relative € â 4a5 b 4‘ 2 5ƒ4aC15 exp6ƒb=‘ 2 7
ƒ1 a

locations of the objects from the data, not their absolute loca- p
Y
tions. Hence, the convergence of „ij rather than X needs to € â 45ƒp
ƒ4C15
‚j ‹j exp6ƒ‚j =‹j 7
be checked to verify the convergence of MCMC. The near jD1
lack of identiŽ ability in X also suggests the use of sample
D A4p5 ¢ h4‘ 2 1 X5 ¢ g4å1 X1 p51
averages as Bayes estimates to be inadvisable, because the
MCMC samples of X may be unstable even when the dis- where
tances „ij are stable. Thus, we take an approximate posterior
mode of X as a Bayes estimate of X, i.e., the BMDS solution X
n
sj D xij2 1 (6)
of the object conŽ guration. The posterior mode provides rela-
iD1
tive positions of xi ’s corresponding to the maximum posterior p
density. A meaningful absolute position of X may be obtained Y
A4p5 D 42 5ƒ4mCnp5=2 â 4a5ƒ1 b a â 45ƒp ‚j 1 (7)
from an appropriate transformation, if desired. jD1
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1035

h4‘ 2 1X5 D 4‘ 2 5ƒ4m=2CaC15 Note that sj =n is the sample variance of the jth coordinate
. of X. However, without improvement in the Ž t, the scale
X
€ exp ƒ4SSR=2 Cb5 ‘ 2 ƒ logê4„ij =‘ 5 1 (8) of X may change with the dimension p. Given the same
i>j Euclidean distances, the coordinates of X would get closer
p to the origin as p increases, unless all the extra coordinates
Y ƒ4n=2CC15
g4å1X1p5D ‹j exp ƒ4sj =2C‚j 5=‹j 0 (9) are equal to 0. For instance, the Euclidean distance between
jD1 ƒ1 and 1 in one-dimensional p pspace is equal to
p the Euclidean
p
distance between 41= 211= 25 and 4ƒ1= 21ƒ1= 25 in
Note that, because of the postprocessing described in
two-dimensional space, and hence the variance in each coor-
Section 3.3, the xi ’s have sample mean 0 and a diagonal sam-
dinate is smaller in two-dimensional space. This would give
ple covariance matrix.
a smaller sj and hence a larger  4X1p—D5 in a higher dimen-
We adopt a Bayesian approach to choosing the dimension.
sion, although there is no change in the distance and the Ž t.
We view the overall task to be that of choosing the best conŽ g-
To circumvent this scale dependency, a dimension selection
uration, and hence we view the choice between dimension p
criterion should compare X’s in the same dimension. For this,
and dimension p 0 as being between the estimated conŽ guration
let X ü 4pC15 D 4X4p5 2 05 in 4p C15-dimensional space, which has
with dimension p and the estimated conŽ guration with dimen-
the Ž rst p coordinates equal to X4p5 and the last coordinates
sion p0 . Thus, we consider the marginal posterior,  4X1p—D5,
all equal to 0. Then Xü 4pC15 provides the same Euclidean dis-
of 4X1p5 with X equal to the BMDS solution and choose the
tances and the same Ž t as X 4p5 and may be considered an
value of p that gives the largest value of  4X1p—D5. Choos-
implantation of X 4p5 in 4p C15-dimensional space. Ideally, if p
ing between dimension p and dimension p0 is based on the
is the correct dimension, then the optimal solution X 4pC15 in
posterior odds for the estimated conŽ guration of dimension p
4p C15-dimensional space would be Xü 4pC15 . Thus, we com-
versus that of dimension p 0 . R 2 2 2 pare Xü 4pC15 and X4pC15 and choose p to be the dimension if
R Now, note that  4X1p—D5 D c l4X1‘ 1p—D5 4‘ 5d‘ ¢ X ü 4pC15 has a larger marginal posterior density than X4pC15 .
 4X1å1p5då c ¢l4X1p—D5 4X1p5, where c is a constant
From (11), the ratio of the marginal posteriors of X ü 4pC15
independent of X and p1l4X1p—D5 is the marginal likeli-
and X 4pC15 is
hood of 4X1p5, and  4X1p5 is the marginal prior of 4X1p5.
The marginalized likelihood term increases as p increases.  4X 4pC15 1pC1—D5
However, the marginal prior term decreases as p increases, Rp ²
 4X ü 4pC15 1pC1—D5
because we are using a diffuse (but proper) prior, and so this
ƒm=2C1 ƒ4n=2C5
term penalizes more complex models. The approach has the SSRpC1 Y sj =2 C‚j
pC1
D
simplicity of a maximum likelihood method as well as the SSRüpC1 ü
jD1 sj =2 C‚ j
advantage of a Bayesian method in penalizing more complex
ƒm=2C1 4pC15 ƒ4n=2C5
models. SSRpC1 p
Y sj =2 C‚j
Integrating the Rfunction g4å1X1p5 given in (9) with D 4p5
Qp SSRp jD1 sj =2C‚j
respect to å gives g4å1X1p5då D â p 4n=2 C5 jD1 4sj =2C
4pC15 ƒ4n=2C5
‚j 5ƒ4n=2C 5 . The integral of the function h4‘ 2 1X5 given in (8) spC1 =2C‚pC1
with respect to ‘ 2 is approximately equal to € 1
‚pC1
Z 4p5
h4‘ 2 1X5d‘ 2 42 51=2 4m=25ƒ1=2 where sj is sj given in (6), computed from X4p5 . Clearly, the
ratio R p depends on the choice of the hyperparameters  and
€ 4SSR=m5ƒm=2C1 exp6ƒm=270 (10) ‚j of å.
When there is no strong prior information, a reasonable
This formula is justiŽ ed in the Appendix. From these, choice for 41‚j 5 in 4p C15-dimensional space might be  D 12
4pC15
Z Z and ‚j D 12 sj =n, so that the prior information roughly cor-
 4X1p—D5 D c ¢A4p5¢ h4‘ 2 1X5d‘ 2 ¢ g4å1X1p5då responds to the information from one observation. This is
close to the unit information prior, which was observed by
D c ¢Aü 4p5¢4SSR=m5ƒm=2C1 Kass and Wasserman (1995) to correspond to the Bayesian
p
Y information criterion (B IC) approximation to the Bayes factor
€ 4sj =2C‚j 5ƒ4n=2C5 1 (11)
(Schwarz 1978) and by Raftery (1995) to correspond to a sim-
jD1
ilar approximation to the integrated likelihood. Raftery (1999)
where argued that this is a reasonable proper prior for approximating
the situation where the amount of prior information is small.
Aü 4p5 D A4p5¢42 51=2 â p 4n=2C54m=25ƒ1=2 exp6ƒm=270 This yields the ratio

To clarify the dependence of X on p, let X4p5 denote the 4pC15


ƒm=2C1 ƒ4nC15=2
BMDS solution of X when the dimension is p. There is a SSR pC1 p
Y rj 4nC15
Rp D
difŽ culty in directly comparing 4X4p5 1p5 and 4X4pC15 1p C15. SSRp jD1 4nCrj
4pC15
5
The marginal posterior  4X1p—D5 is dependent on the scale
Q € 4n C15ƒ4nC15=2 1
of X, because it includes the term pjD1 4sj =2 C‚j 5ƒ4n=2C 5 .
1036 Journal of the American Statistical Association, September 2001

where rj
4pC15 4pC15
D sj
4p5
=sj . Taking minus twice the logarithm of 5.1 A Simulation
the ratio gives As an illustrative example, we generated 50 random samples
of xi from a 10-dimensional multivariate normal distribution
LRp ² ƒ2 log Rp with mean 0 and variance I, the identity matrix. We used the
D 4mƒ25log4SSRpC1 =SSRp 5 (12) Euclidean distances between pairs 4xi 1xj 5 as dissimilarities „ij .
Given these „ij ’s, we generated the observed distances dij from
4pC15
p
X rj 4n C15 a normal distribution with mean „ij and standard deviation 03,
C 4nC15 log
jD1 4nCrj
4pC15
5 truncated at 0. Thus, the data consist of a 50€ 50 symmetric
matrix of dissimilarities computed from Euclidean distances
C4nC15log4nC15 0 (13) with Gaussian errors.
Using the results from ALSCAL for initialization, BMDS
as described in Section 3 was applied for various values of the
Note that the term (12) in LRp is roughly the log-likelihood dimension p. With minimum STRESS and xi obtained from
ratio and would be negative because higher dimension results BMDS, we applied MDS IC described in Section 4 to select
in a smaller SSR. The term (13) plays the role of penalty the dimension of xi . The results are summarized in Table 1.
on the increase of dimension by 1 and would be positive The table presents values of STRESS from CMDS,
Qp 4pC15
if jD1 4n=rj C15 < 4nC15pC1 . When there is no signiŽ - ALSCAL, and BMDS. It also presents the likelihood ratio
cant change in X between p- and 4p C15-dimensional spaces, term of (12), the penalty term of (13), and the MDS IC given
4pC15
then rj 1, and the penalty term is approximately 4nC in (14). It can be observed that BMDS shows signiŽ cant
15log4n C15. improvement over CMDS, providing a smaller STRESS, and
A positive LRp would prefer the dimension p to 4p C15 and a moderate improvement over ALSCAL when the dimension
a negative value would prefer the dimension 4p C15 to p, and is low. When the dimension is close to the true dimension
hence one can select the dimension where the value of LRp 10, BMDS shows a moderate improvement over CMDS and
turns positive. Alternatively, if we deŽ ne MDS IC as the same results as ALSCAL. It is interesting to observe
that the better performance of BMDS is more pronounced
MDS IC1 D 4mƒ25logSSR 1 1 when the dimension is low because, for visualization purposes,
pƒ1
X dimension p D 2 is often chosen.
MDS ICp D MDS IC1 C LRj 1 (14) The table shows that the log-likelihood ratios computed
jD1
from the BMDS solution for X decrease monotonically as p
increases up to 10, that there is no signiŽ cant change after
then the optimal dimension is the one that achieves the mini-
dimension 10, and that the penalties stay about the same for
mum of MDS IC p .
various p. Moreover, the MDS IC assumes a minimum at the
correct dimension, namely, 10.
5. EXAMPLES We also applied BMDS with initial values obtained from
BMDS requires that prior parameters be speciŽ ed. For all the CMDS results. It provided about the same results as before
the examples given in this section, we chose 5 as the degrees except that it gives slightly larger STRESS (with the difference
of freedom a for the prior of ‘ 2 , and we chose b to match the less than .01) when the dimension is larger than or equal to
prior mean of ‘ 2 with SSR=m obtained from the CMDS or 5. MDS IC with the BMDS results chose the same dimension,
ALSCAL. Note that a smaller a would not make much differ- 10, as in the previous case.
ence because m D n4n ƒ15=2 is large. For the hyperprior of ‹j ,
405 405
we chose  D 1=2 and ‚j D 12 sj =n, where sj =n is the sam-
ple variance of the jth coordinate of X obtained from CMDS Table 1. Analysis of Simulation Data in Example 1, x i N10 ( 0, I)
or ALSCAL, which roughly corresponds to information from
one observation as described in Section 4. CMDS ALSCAL BMDS
dim STRESS STRESS STRESS LRT Penalty MDSIC
For the multiplicity constant of the variance of the normal
proposal density in the Metropolis–Hastings algorithms for 1 06622 05079 04813 ƒ110508 17300 10647
generating xi and ‘ 2 , we chose 20382 for both xi and ‘ 2 as 2 04943 03198 03063 ƒ82807 17003 9714
3 03720 02250 02182 ƒ69505 16905 9056
suggested by Gelman, Roberts, and Gilks (1996). We found 4 02751 01648 01642 ƒ69903 16503 8530
reasonably fast mixing in MCMC with this choice of multi- 5 02037 01234 01234 ƒ55406 17008 7996
6 01580 00984 00984 ƒ59302 17206 7612
plicity constant. 7 01092 00772 00772 ƒ41402 17709 7191
In all the examples, we ran 13,000 iterations of MCMC and 8 00809 00652 00652 ƒ26702 17900 6955
observed very quick convergence in ‘ 2 and the „ij ’s. Here we 9 00672 00584 00584 ƒ25208 18100 6867
10 00614 00527 00527 ƒ7204 18702 6795ü
are interested in an approximate posterior mode of X rather 11 00658 00511 00511 ƒ5302 18605 6910
than its full posterior distribution, so that convergence require- 12 00715 00500 00500 ƒ1404 18607 7043
ments are less stringent than if one seeks the full posterior 13 00784 00497 00497 ƒ1103 18702 7216
14 00855 00495 00495 ƒ904 18509 7391
distribution, and all the iterations may be used for the purpose
of obtaining the xi ’s that give minimum STRESS. *Minimum MDSIC.
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1037

5.2 Airline Distances Between Cities Table 2. Analysis of City Data

Hartigan (1975, p. 192) provided airline distances between CMDS ALSCAL BMDS
30 principal cities of the world; these are shown in dim STRESS STRESS STRESS LRT Penalty MDSIC
Figure 1. Cities are located on the surface of the earth, a
1 .6782 .4007 .3617 ƒ704.2 95.7 5336
three-dimensional sphere, and airplanes travel on the surface 2 .4682 .1795 .1604 ƒ548.5 91.4 4727
of the earth. Thus, airline distances are not exactly Euclidean 3 .3811 .0903 .0851 4.7 108.0 4270
distances and we may expect the dimension of xi to be 4 .4006 .0902 .0856 ƒ2.4 88.9 4383
5 .4139 .0902 .0854 4.1 143.9 4469
between 2 and 3.
The BMDS was applied to the data. In this example, initial *Minimum MDSIC.
values from CMDS and ALSCAL yielded almost the same
BMDS results. The BMDS results from CMDS initial values
are shown in Table 2. Figure 2 is a plot of the observed airline distances ver-
BMDS yielded much smaller SSR than did CMDS and sus the estimated Euclidean distances. A perfect Ž t would
moderately smaller SSR than did ALSCAL in all cases. The yield a 45-degree line, as shown in Figure 2. The estimated
estimated SSR from BMDS dropped very quickly until dimen- Euclidean distances from BMDS are represented as red dots,
sion 3 and then increased slightly at dimension 4. MDSIC those from CMDS as blue dots, and those from ALSCAL as
selected dimension 3. We observed that the last coordinates green dots. One can see that BMDS provided points very close
of xi in dimension 4 are almost equal to 0, indicating strong to the 45-degree line, except for some points corresponding to
evidence for dimension 3. large distances. The Ž t gets worse as the distance gets larger,

Figure 1. Airline Distances Between Cities (100 miles, .62 mi D1 km) (Hartigan 1975).
1038 Journal of the American Statistical Association, September 2001

Figure 2. Observed and Estimated Distances for the Airline Distance Data (in Units of 100 mi). Red dots represent the estimated distances
from BMDS, blue dots in (a) from CMDS, and blue dots in (b) from ALSCAL.

because when cities are farther apart, there is a greater discrep- and Hrycak (1990) proposed measuring the dissimilarity
ancy between airline distance and three-dimensional Euclidean between the careers of two individuals by counting the min-
distance. imum number of insertions, deletions, and replacements that
Figure 3 shows plots for the locations of cities, obtained would be necessary to transform one career into another. Costs
from BMDS with dimension p D 3 and rotated manually to Ž t are associated with each kind of change, and the dissimilarity
the true locations of the cities approximately. One can observe between the two careers is then measured as the total cost of
that the cities are located approximately on the surface of a transforming one career into another. This approach, known
sphere with the radius of the earth and that the locations of as optimal alignment, is borrowed from molecular microbiol-
the cities are well recovered. ogy, where it is applied to the comparison of DNA and protein
sequences (Sankoff and Kruskal 1983; Boguski 1992).
5.3 Careers of Lloyds Bank Employees, 1905–1950 Here we reanalyze some data considered by Stovel, Savage,
Sociologists are interested in characterizing and describing and Bearman (1996), consisting of the careers of 80 randomly
careers, to answer questions such as, What are the typical selected employees of Lloyds Bank in England, whose careers
career patterns in a given period in a particular society? Have started between 1905 and 1909. This is part of a much larger
they been changing over time? Have people become more study aimed at discovering how career patterns in large orga-
mobile occupationally? nizations evolved over the course of the twentieth century.
One approach to doing this views a career as a sequence The more immediate goal here is to discover what the typi-
of occupations held, for example, in successive years and cal career sequences are, for data reduction and exploratory
then seeks to measure the similarity or dissimilarity between purposes, and also as a basis for further analysis. For each
careers and, hence, to Ž nd groups of similar careers. Abbott employee, information about his work position is available for
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1039

Figure 3. Estimated Locations of the Cities From BMDS. (a) View from the North Pole. (b) Hemisphere I. (c) Hemisphere II.

each year he was at Lloyds. The information consists of the CMDS or ALSCAL to variations in the alleged dimension and
nature of the position (four categories, from clerk to senior to violations of the Euclidean model assumption.
manager) and the kind of place they were in (six categories, Dimension 8 is chosen as optimal because MDSIC attains
from small rural place to large city). its minimum at 8. Thus, the estimated conŽ guration X when
From the sequence data, an 80 80 matrix of dissimilarity p D 8 can be used as a Ž nal estimate of X. Figure 4 shows
measures was obtained, by using the method of Abbott and the Ž tted and observed dissimilarities for the three MDS
techniques. The BMDS Ž tted dissimilarities Ž t the observed
Hrycak (1990); for details, see Stovel et al. Clearly, the dissim-
ones very well, considerably better than the CMDS ones and
ilarities are not Euclidean distances, and they may not satisfy
certain geometric properties that hold for Euclidean distances,
such as the triangle inequality. Our approach is to model the Table 3. Analysis of the Lloyd Bank Data
dissimilarities as before, with the idea that the non-Euclidean
nature of the dissimilarities can be modeled at least approxi- CMDS ALSCAL BMDS
dim STRESS STRESS STRESS LRT Penalty MDSIC
mately as part of the error. As we show, this supposition turns
out to be reasonable in practice. 1 .5357 .4648 .3545 ƒ4228.1 325.8 26924
2 .3390 .2228 .1815 ƒ3380.3 315.7 23022
We applied BMDS to the dissimilarity data with initial val- 3 .2190 .1297 .1063 ƒ2924.9 310.9 19957
ues from CMDS. (Initial values from ALSCAL gave almost 4 .1280 .0853 .0669 ƒ1540.1 317.5 17343
5 .0891 .0623 .0524 ƒ941.2 330.5 16120
the same results.) Table 3 presents the results of the analy- 6 .0725 .0530 .0452 ƒ600.9 326.7 15510
sis along with STRESS from CMDS and ALSCAL. Again, 7 .0619 .0472 .0411 ƒ392.7 330.3 15236
BMDS performed much better than did CMDS and moder- 8 .0558 .0414 .0386 ƒ221.8 330.5 15173
9 .0547 .0384 .0372 27.8 367.6 15282
ately better than did ALSCAL, especially when the dimension 10 .0556 .0374 .0374 7.7 396.1 15677
was too small. The improvement in performance of BMDS 11 .0600 .0370 .0375 ƒ17.3 310.3 16081
12 .0637 .0363 .0374 ƒ21.2 444.5 16374
is more pronounce d in this example than in the two previ-
ous examples. This suggests that BMDS is more robust than *Minimum MDSIC.
1040 Journal of the American Statistical Association, September 2001

Figure 4. Fitted and Observed Dissimilarities for the Lloyd Bank Data. Red dots represent the estimated distances from BMDS, blue dots in
(a) from CMDS, and blue dots in (b) from ALSCAL.

moderately better than the ALSCAL ones; the sum of squared Model-based clustering clearly identiŽ ed three groups.
residuals for BMDS is 48% of that for CMDS and 87% of These are shown in Figure 6, which displays the Ž rst two
that for ALSCAL. components of the BMDS solution. The three groups selected
Figure 5 gives pairwise scatterplots of the Ž rst four dimen- make clear substantive sense: Group 1 consists of 16 employ-
sions of the BMDS estimates of X. The fourth dimension ees who had shorter careers (22 years or fewer) and spent all
clearly separates two outliers. On closer inspection of the data, or almost all of their career at the lowest clerk rank. Group
it turned out that these were individuals who had very short 2 consists of 30 employees with long careers (40 years or
careers at Lloyds. They spent only a few years there, whereas more), almost all of whom ended their careers at the lowest
clerk level. Group 3 consists of 32 employees, most of whom
all the other employees were at Lloyds for at least 10 years.
were promoted and ended their careers as managers.
The sociologists’ interest in these data is primarily to char-
Another interest in these data would be in whether the
acterize the typical career patterns at Lloyds in this period. To
career pattern of individual i is more similar to that of individ-
try to answer this question, we applied model-based clustering ual j than to that of individual k, or whether the difference in
(BanŽ eld and Raftery 1993; Fraley and Raftery 1998) to the the career patterns between individuals i and j is signiŽ cantly
BMDS estimate of X, after removing the two clear outliers. different from that between individuals i and k. This can be
This models the data as a mixture of multivariate normals, answered by considering the posterior distribution of „ij ƒ„ik .
allowing for possible geometrically motivated constraints on Because BMDS generates posterior samples of „’s after con-
the covariance matrices of the different groups. The number of vergence, one can easily perform the test by using the MCMC
groups and the clustering model are chosen by using approx- samples of „’s. To illustrate this, we picked some individuals
imate Bayes factors, approximated via BIC. i1j1k and compared „ij with „ik . Empirical 2.5%, 5%, 50%,
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1041

Figure 5. Pairwise Scatterplots of the Estimated Object Con guration From BMDS for the Lloyd Bank Data.

95%, and 97.5% posterior percentiles of „ij ƒ„ik and its stan- 6. DISCUSSION
dard error, obtained from the MCMC samples, are presented
In this article, we proposed a Bayesian approach to
in Table 4 for some selected values of 4i1j1k5. One can see object conŽ guration in multidimensional scaling and a simple
that the career pattern of individual 21 (72) is signiŽ cantly Bayesian dimension choice criterion, MDSIC, based on the
closer to that of individual 38 (79) than to that of individual results from BMDS. Key advantages of the proposed Bayesian
75 (24). On the other hand, individual 29 is not signiŽ cantly multidimensional scaling are as follows. First, it provided a
closer to either 79 or 80 in terms of career patterns. Note that signiŽ cantly better Ž t than classical MDS and a moderately
individuals 21, 38, and 75 are in the Ž rst group, individuals better Ž t than ALSCAL overall. This implies that BMDS
72 and 79 in the second group, and individuals 24, 29, and 80 explores the posterior distribution quite well compared to the
in the third group. other MDS methods. The improvement in performance of
1042 Journal of the American Statistical Association, September 2001

3 and retaining them in higher dimensions may adversely


affect the performance of CMDS. Compared with ALSCAL,
3 BMDS gives an object conŽ guration that minimizes STRESS,
3
2

33 whereas ALSCAL gives one that minimizes S-STRESS—


33 3
3 3 the sum of the squared difference in the squared dissimilar-
33
3
3 ities and the squared distances. Squaring dissimilarities and
33333 distances causes S-STRESS to emphasize larger dissimilari-
1

1 •3
3 ties more than smaller ones (Borg and Groenen 1997).
32 A common reason for doing MDS is to cluster the objects.
33 In our third example, we showed how model-based clustering
22 can be used to do this, providing a formal basis for choosing
0

3
111111 • 1 3 the number of groups. The results were substantively reason-
1 11 1 2
1 1 1 3 222 able and useful. Combining BMDS and model-based cluster-
3 2 2
1 3 222•222 ing thus provides a fully model-based approach to clustering
3 3 22 222
22
3 2 22
-1

22 dissimilarity data, including ways to choose the dimension of


2 the data and the number of groups.
2
A more comprehensive approach to this problem would be
-3 -2 -1 0 1 to build a single model and carry out Bayesian inference for
it. This could be done by using a prior distribution of X that
Figure 6. Lloyds Bank Data: First Two BMDS Dimensions, With
the Three-Group Model-Based Clustering Classi cation Shown. The is based on a mixture of multivariate normal distributions,
ellipses show the one-standard-deviation contours of the densities of rather than on a single one as here. Then MCMC could be
each of the component multivariate normal distributions, and the dotted used to estimate both object conŽ guration and group mem-
lines show their principal axes.
bership simultaneously. This approach could provide a way
to choose the dimension and the number of groups simulta-
BMDS is more pronounced when the dissimilarities are dif- neously, rather than sequentially, as we did in our example.
ferent from Euclidean distances and the effective dimension This seems desirable because there may be a trade-off between
is ambiguous. This sort of robustness is useful in practice, dimension and number of groups. A maximum likelihood
because in applications dissimilarities often are not Euclidean approach to the problem of clustering with multidimensional
distances, and the concept of dimension may not even arise scaling of two-way dominance or proŽ le data was proposed
in their formulation. Another consideration is that one may by DeSarbo, Howard, and Jedidi (1991), but this is somewhat
often want to use two or three dimensions for visual display, different from the present context, where the data come in the
although the true dimension may be much higher. Second, form of dissimilarities.
the proposed dimension selection criterion, MDSIC, is easy We modeled dissimilarities as being equal to Euclidean dis-
to compute and gives a direct indication of optimal dimen- tances plus error. This corresponds to metric scaling, and so
sionality. In this article, we based MDSIC on the Bayesian our approach would perhaps best be called Bayesian metric
object conŽ guration. However, it may, in fact, be used inde- multidimensional scaling. There is a great emphasis in the
pendently of MDS techniques, and any good MDS solution MDS literature on nonmetric scaling, however. In nonmetric
can be used for MDSIC for choosing the dimension. In our scaling, dissimilarities are modeled as equal to a nonlinear
three examples, MDSIC based on ALSCAL solutions gave the
function of distance plus error. This could be incorporated in
same dimension choices as those based on BMDS solutions.
the present framework by replacing (3) by
Finally, BMDS used MCMC to generate posterior samples of
unknown parameters. Thus, unlike other MDS methods, it can
provide estimation errors of the distances, as illustrated in the dij N g4„ij 51‘ 2 I4dij > 051 i 6D j1 i1j D 11: : : 1n1 (15)
analysis of careers of Lloyds Bank employes.
Compared with CMDS, a key feature of BMDS is that when
where g4 5 is a nonlinear but monotonic function. One could
the dimension increases, the coordinates for lower dimensions
are changed, whereas in CMDS the coordinates for a lower postulate a parametric model, or a family of parametric mod-
dimension are always a subset of those for a higher one. els, for g; one such family of models was proposed by Ramsey
The coordinates obtained from the lower dimensions are not (1982). Then standard Bayesian inference via MCMC would
necessarily an optimal choice when the dimension increases, again be possible, leading to Bayesian nonmetric multidimen-
sional scaling.
We used the truncated normal distribution for the observed
Table 4. Posterior Percentiles and Standard Error of „ij „i k distance because it is approximately a conjugate form for the
normal prior distribution of X, if the restriction is ignored,
(i ,j ,k) 2.5% 5% 50% 95% 97.5% SE
and this makes it easy to Ž nd a reasonable proposal density
(21,38,75) ƒ1.609 ƒ1.568 ƒ1.331 ƒ1.076 ƒ1.033 .151 in the Metropolis–Hastings algorithm for generating X. The
(72,24,79) .037 .065 .222 .390 .419 .098 truncated normal distribution seems to work well in practice,
(29,79,80) ƒ.265 ƒ.228 ƒ.028 .168 .199 .121
as illustrated in the examples. Other distributions, such as the
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1043

log-normal and noncentral chi-squared, may be used instead REFERENCES


of the truncated normal distribution (Suppes and Zinnes 1963;
Abbott, A., and Hrycak, A. (1990), “Measuring Sequence Resemblance,”
Ramsay 1969; Ramsay 1977). The use of other distributions American Journal of Sociology, 96, 144–185.
would require only a slight modiŽ cation of the Metropolis– BanŽ eld, J. D., and Raftery, A. E. (1993), “Model-Based Gaussian and Non-
Hastings step for generating X in the proposed algorithm. Gaussian Clustering,” Biometrics, 49, 803–821.
Boguski, M. S. (1992), “Computational Sequence Analysis Revisited,” Jour-
We assumed equal variances for objects but different vari- nal of Lipid Research, 33, 957–974.
ances for coordinates. When there are replicated measures on Borg, I., and Groenen, P. (1997), Modern Multidimensional Scaling,
each pair of objects, one may further assume different vari- New York: Springer-Verlag.
Cox, D. R. (1982), “Comment,” Journal of the Royal Statistical Society,
ances for different objects to incorporate the object variation. Ser. A, 145, 308–309.
The proposed BMDS can be easily modiŽ ed to handle the Cox, T. F., and Cox, M. A. A. (2001), Multidimensional Scaling, London:
case of both object and coordinate variations. Chapman & Hall.
Davison, M. L. (1983), Multidimensional Scaling, New York: Wiley.
Apart from the present work, we do not know of any DeSarbo, W. S., Howard, D. J., and Jedidi, K. (1991), “MULTICLUS—A
Bayesian analyses of multidimensional scaling for dissimilar- New Method for Simultaneously Performing Multidimensional Scaling and
ity data. DeSarbo, Kim, and Fong (1999) proposed a Bayesian Cluster Analysis,” Psychometrika, 56, 121–136.
approach to multidimensional scaling when the data are in the DeSarbo, W. S., Kim, Y., and Fong, D. (1999), “A Bayesian Multidimensional
Scaling Procedure for the Spatial Analysis of Revealed Choice Data,” Jour-
form of binary choice data, but this is rather different from the nal of Econometrics, 89, 79–108.
present context, where the data take the form of dissimilarities. Fraley, C., and Raftery, A. E. (1998), “How Many Clusters? Which Clustering
A Fortran program implementing the proposed method is Method? Answers via Model-Based Cluster Analysis,” Computer Journal,
41, 578–588.
freely available through StatLib. Gelman A., Roberts, G. O., and Gilks, W. R. (1996), “EfŽ cient Metropolis
Jumping Rules,” Bayesian Statistics, 5, 599–608.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain
APPENDIX: JUSTIFICATION OF (10) Monte Carlo in Practice, London: Chapman & Hall.
Groenen, P. J. F. (1993), The Majorization Approach to Multidimen-
Integration of h4‘ 2 1X5, where
sional Scaling: Some Problems and Extensions, Lieden, The Netherlands:
SSR/2Cb X „ij DSWO.
h4‘ 2 1X5 D 4‘ 2 5ƒ4 m/2CaC15 exp ƒ ƒ logê Groenen, P. J. F., Mathar, R., and Heisser, W. J. (1995), “The Majorization
‘2 i>j ‘ Approach to Multidimensional Scaling for Minkowski Distances,” Journal
of ClassiŽ cation, 12, 3–19.
Kass, R. E. and Wasserman, L. A. (1995), “A Reference Bayesian Test for
is not straightforward. However, in most cases, m D n4nƒ15/2 is Nested Hypotheses and its Relationship to the Schwarz Criterion,” Journal
of the American Statistical Association, 90, 928–934.
very large and the likelihood of ‘ 2 dominates the prior and hence
Kruskal, J. B. (1964), “Multidimensional Scaling by Optimizing Goodness-
h4‘ 2 1X5 is approximately proportional to the likelihood of-Fit to a Nonmetric Hypothesis,” Psychometrika, 29, 1–28.
Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.
SSR X „ij Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov
l4‘ 2 1X5 4‘ 2 5ƒm/2 exp ƒ ƒ logê 0 (A.1)
2‘ 2 i>j ‘ Chains and Their Applications,” Biometrika, 57, 97–109.
MacKay, D. (1989), “Probabilistic Multidimensional Scaling: An Anisotropic
In addition, because of the large m, the likelihood l4‘ 2 1X5 is well Model for Distance Judgements,” Journal of Mathematical Psychology, 33,
187–205.
approximated by a normal density function. Thus, applying a Laplace MacKay, D., and Zinnes, J. L. (1986), “A Probabilistic Model for the Multidi-
approximation to the integral of l4‘ 2 1X5 gives mensional Scaling of Proximity and Preference Data,” Marketing Sciences,
Z Z 5, 325–334.
h4‘ 2 1X5d‘ 2 l4‘ 2 1X5d‘ 2 42 51/2 H ƒ1/2 l4X1 ‘O 2 51 (A.2) Raftery, A. E. (1995), “Bayesian Model Selection in Social Research (with
Discussion) ,” Sociologica l Methodology, 25, 111–193.
(1999), “Bayes Factors and BIC—Comment on ‘A Critique of the
where H is the minus Hessian of the log-likelihood and ‘O 2 is the Bayesian Information Criterion for Model Selection,’ ” Sociological Meth-
MLE of ‘ 2 . ods and Research, 27, 411–427.
We now argue that the probability ê4„ij /‘ 5 is unlikely to have (1969), “Some Statistical Considerations in Mutlidimensional Scal-
much effect on the model comparison and can safely be ignored. Sup- ing,” Psychometrika, 34, 167–182.
(1977), “Maximum Likelihood Estimation in Multidimensional Scal-
pose we are comparing dimension p with dimension 4pC15. We dis-
ing,” Psychometrika, 42, 241–266.
tinguish between two situations. Suppose Ž rst that the true dimension (1982), “Some Statistical Approache s to Multidimensional Scaling,”
is 4pC 15. Then, asymptotically, the term 4ƒSSR/2‘ 2 5 will dominate Journal of the Royal Statistical Society A, 145, 285–312.
the exponent on the right Sankoff, D., and Kruskal, J. B. (1983), Time Warps, String Edits, and Macro-
P side of (A.1), dimension 4pC15 will be molecules, Reading, MA: Addison-Wesley.
preferred, and the term i>j logê4„ij /‘5 will be immaterial. Second,
Schutze, H., and Silverstein, C. (1997), “Projections for EfŽ cient Document
suppose instead that the true dimension is p. Then, the Ž tted „ij Clustering,” ACM SIGIR 97, 74–81.
will be the same, asymptotically,
P for dimension p as for dimension Schwarz, G. (1978), “Estimating the Dimension of a Model,” Annals of Statis-
4pC15, and so the term i>j logê4„ij /‘5 will be the same for both tics, 6, 461–466.
dimensions. Thus, it will cancel in the comparison and can again be Spence, I., and Graef, J. C. (1974), “The Determination of the Underlying
Dimensionality of an Empirically Obtained Matrix of Proximities,” Multi-
ignored. P variate Behavioral Research, 9, 331–341.
Thus, we ignore the term i>j logê4„ij /‘5 and use the Stovel, K., Savage, M., and Bearman, P. (1996), “Ascription Into Achieve-
approximation ment: Models of Career Systems at Lloyds Bank, 1890–1970,” American
Journal of Sociology, 102, 358–399.
SSR Suppes, P., and Zinnes, J. L. (1963), “Basic Measurement Theory,” in Hand-
l4‘ 2 1X5 l 4‘ 2 1X5 4‘ 2 5ƒm/2 exp ƒ 0 book of Mathematical Psychology, Vol. I, eds. R. D. Luce, R. R. Bush, and
2‘ 2
E. Galanter, New York: Wiley, pp. 2–76.
Replacing l by l and H by the minus Hessian H of l in (A.2) and Takane, Y., Young, F. W., and de Leeuw, J. (1977), “Nonmetric
letting ‘O 2 D SSR/m, which maximizes l , gives the formula (10). Individual Difference Multidimensional Scaling: An Alternating Least
Squares Method With Optimal Scaling Features,” Psychometrika,
[Received July 2000. Revised February 2001.] 42, 7–67.
1044 Journal of the American Statistical Association, September 2001

Takane, Y., and Carroll, J. D. (1981), “Nonmetric Maximum Likelihood Torgerson, W. S. (1952), “Multidimensional Scaling: I. Theory and Method,”
Multidimensional Scaling from Directional Rankings of Similarities,” Psychometrika, 17, 401–419.
Psychometrika, 46, 389–405. Torgerson, W. S. (1958), Theory and Methods of Scaling, New York:
(1982), “The Method of Triadic Combinations: A New Treatment and
Wiley.
Its Applications,” Behaviormetrika, 11, 37–48.
Tibshirani, R., Lazzeroni, L., Hastie, T., Olshen, A., and Cox, D. (1999), Young, F. W. (1987), Multidimensional Scaling—History, Theory, and Appli-
“The Global Pairwise Approach to Radiation Hybrid Mapping,” Technical cations, Hauer, R. M. (editor), Hillsdale Erlbaum Association, Hillsdale,
Report, Department of Statistics, Stanford University. NJ: Lawrence Erlbaum.

You might also like