Professional Documents
Culture Documents
Bayesian Multidimensional Scaling and Choice of Dimension PDF
Bayesian Multidimensional Scaling and Choice of Dimension PDF
Choice of Dimension
Man-Suk Oh and Adrian E. Raftery
Multidimensional scaling is widely used to handle data that consist of similarity or dissimilarity measures between pairs of objects. We
deal with two major problems in metric multidimensiona l scaling—con guration of objects and determination of the dimension of object
con guration—within a Bayesian framework. A Markov chain Monte Carlo algorithm is proposed for object con guration, along with
a simple Bayesian criterion, called MDS IC, for choosing their dimension. Simulation results are presented, as are real data. Our method
provides better results than does classical multidimensional scaling and ALSCAL for object con guration, and MDS IC seems to work
well for dimension choice in the examples considered.
KEY WORDS: Clustering; Dimensionality; Dissimilarity; Markov chain Monte Carlo; Metric scaling; Model selection.
the examples we tested. Moreover, the improvement in perfor- Construct a double-centered matrix A with elements
mance of the proposed BMDS scheme relative to CMDS or aij de ned by aij D ƒ 12 4„2ij ƒ „2i¢ ƒ „2¢j C „2¢¢ 51 where „2i¢ D
1 Pn 1 Pn 1 Pn Pn
ALSCAL was more pronounce d when there were signi cant n
2 2
jD1 dij 1 „ ¢j D n
2 2
iD1 dij 1 „¢¢ D n2 iD1
2
jD1 dij . It was
measurement errors in the data or when the Euclidean model shown by Torgerson (1952, 1958) that
assumption was violated or the dimension was misspeci ed. p
On the basis of the BMDS estimate of object con guration X
aij D xik ¢ xjk for all i1 j, i.e., A D XX 0 1 (2)
over a range of dimensions, we propose a simple Bayesian kD1
criterion with which to choose an appropriate dimension. This
criterion, called MDS IC, is based on the Bayes factor, or ratio where X is the n p matrix of object coordinates. The coordi-
of integrated likelihoods, for the BMDS estimated con gu- nates of X can be recovered from the spectral decomposition
ration under one dimension versus a different dimension. In of the matrix A in (2). If the observed dissimilarities, dij , sat-
simulated data, we found that the criterion works well for isfy the Euclidean distance assumption and there is no mea-
Euclidean models with measurement error that is not too large. surement error, then the Euclidean distances computed from
In real examples, we found that the criterion gave satisfactory the matrix X satisfying (2) will be exactly equal to the given
results. We also give an example of cluster analysis on real dissimilarities. However, when the model assumption is vio-
dissimilarity data in which the BMDS estimates of object con- lated or when there are signi cant measurement errors in the
guration are used in conjunction with model-based clustering data, CMDS estimates of object con guration may not be very
(Ban eld and Raftery 1993; Fraley and Raftery 1998). useful.
In our approach, observed dissimilarities are modeled as Because Euclidean distance is invariant under translation,
equal to Euclidean distances plus measurement error. In this rotation, and re ection about the origin, the matrix X can be
sense, what we do here can be viewed as a Bayesian analy- centered at the origin and rotated to its principal axes orienta-
sis of metric MDS, and here being Bayesian seems to con- tion (Borg and Groenen 1997, p. 62).
fer the bene ts of yielding good estimated con gurations and There is no de nitive method for choosing the effective
provides a formal way to choose the dimension and provide dimension of xi , the number of object attributes that contribute
measures of uncertainties in estimations. A great deal of MDS signi cantly to the dissimilarities. A common way to assess
research, however, has focused on nonmetric MDS, in which dimension is to look at the eigenvalues of the scalar prod-
the relationship between dissimilarity and underlying distance uct matrix A. The kth eigenvalue is a measure of contribu-
is modeled as nonlinear. One could use the basic ideas here tion of the kth coordinate of X to squared distances. Hence,
to do Bayesian nonmetric MDS, and we suggest some ways only the rst p coordinates of X corresponding to the rst p
of doing this in Section 6. signi cantly large eigenvalues suf ce to represent the objects.
The article is organized as follows. Classical MDS and To determine signi cantly large eigenvalues, one may draw
ALSCAL are brie y reviewed in Section 2. Bayesian MDS a plot of the ordered eigenvalues versus dimension and look
is described in Section 3: Section 3.1 de nes the model and for a dimension at which the sequence of eigenvalues levels
the prior, Section 3.2 presents an MCMC algorithm, and off. If each „ij is equal to a p-dimensional Euclidean distance
Section 3.3 describes the estimation of object con guration between objects i and j as given in (1), then the plot should
from the MCMC output. Based on the BMDS output, a simple level off precisely at dimension 4p C 15.
Bayesian dimension selection criterion, MDS IC, is described A measure of t, called stress, is commonly used to deter-
in Section 4. Some simulated and real examples are given in mine the dimensionality. Several de nitions of stress have
Section 5. We conclude with discussion in Section 6. been proposed; the one we use here, and perhaps the mostly
commonly used one, is
2. CLASSICAL MULTIDIMENSIONAL SCALING v
uP
AND ALSCAL u i>j 4dij ƒ „O ij 52
STRESS D t P 2
1
2.1 Classical Multidimensional Scaling i>j dij
data given in various forms. In metric MDS, ALSCAL mini- we also assume a conjugate prior, ‹j IG41 ‚j 51 indepen-
mizes S-STRESS, where dently for j D 11 : : : 1 p. We assume prior independence among
X X1 å, and ‘ 2 , i.e., 4X1‘ 2 1 å5 D 4X5 4‘ 2 5 4å5, where
2
S-STRESS D „2ij ƒ dij2 0 4X51 4‘ 2 5, and 4å5 are the priors given earlier.
i>j When there is little prior information, one may use either
the results from a preliminary run or the results from other
Note that S-STRESS differs from STRESS in that it uses
MDS techniques, such as CMDS or ALSCAL, for parame-
squared distances and dissimilarities, which is done for com-
ter speci cation in the priors. For instance, one may choose
putational convenience. However, squaring dissimilarities and
a small a for a vague prior of ‘ 2 and choose b so that the
distances causes S-STRESS to emphasize larger dissimilarities
prior mean matches with SSR=m, where SSR is obtained from
over smaller ones, which may be viewed as a disadvantage of
CMDS or ALSCAL. Similarly, for the hyperprior of ‹j , one
ALSCAL (Borg and Groenen 1997, p. 204). The minimization
may choose a small and choose ‚j so that the prior mean
process can be done by using the Newton–Raphson method,
of ‹j matches with the jth diagonal element of the sample
possibly modi ed. ALSCAL is one of the most commonly P
covariance matrix Sx D 1n niD1 x0i xi of X obtained from CMDS
used MDS techniques, and it is available in the statistical com-
or ALSCAL. As noted above, the MDS solution can be trans-
puter packages SAS and SPSS. For details, see Cox and Cox P
formed to have zero sample mean, i.e., niD1 xi D 0, and a
(2001).
diagonal sample covariance matrix.
Such a prior is mildly data dependent, and it might be
3. BAYESIAN MULTIDIMENSIONAL SCALING
argued that this violates the de nition of a prior distribu-
3.1 Model and Prior tion. However, we view this prior as an approximation to the
elicited data-independent prior of an analyst who knows a lit-
Dissimilarity data can be obtained in various forms. How-
tle, but not much, about the problem at hand. Because this
ever, because Euclidean distance is easy to handle and is rela-
prior is diffuse relative to the likelihood, the estimation results
tively insensitive to the choice of dimension compared to other
are unlikely to be sensitive to its precise speci cation.
distance measures, it tends to be used in cases in which the
dimension is unknown unless there are strong theoretical rea- 3.2 MCMC
sons for preferring a non-Euclidean distance (Davison 1983).
Thus, for Bayesian MDS, we model the true dissimilarity mea- From the likelihood and the prior, the posterior density
sure „ij as the distance between objects i and j in a Euclidean function of the unknown parameters 4X1‘ 2 1 å5 is
space, as given in (1). p
Y
In practical situations, often there are measurement errors 4X1‘ 2 1å—D5 / 4‘ 2 5ƒ4m=2CaC15 ‹ƒn=2
j
in observations. In addition, dissimilarity measures are typi- jD1
cally given as positive values. We therefore assume that the 1 X „ij
observed dissimilarity measure, dij , is equal to the true mea- exp ƒ SSR ƒ logê
2‘ 2 ‘
sure, „ij , plus a Gaussian error, with the restriction that the i>j
for the elements of å D Diag4‹1 1 : : : 1 ‹p 5, given dimension p, value of ‘ 2 in the algorithm. In addition, diagonal elements
1034 Journal of the American Statistical Association, September 2001
of the adjusted sample covariance matrix of X can be used as To obtain the posterior mode of X, one may compute
initial values for the ‹j ’s. the posterior density for each MCMC sample. However, this
We now describe the details of sample generation in the can be time consuming, because the posterior is complicated.
MCMC algorithm. At each iteration, we simulate a new value However, we observed that the likelihood dominates the prior,
of ‹j from its conditional posterior distribution given the other and that in the likelihood (4), the term involving SSR is dom-
unknowns. From (5), the full conditional posterior distribu- inant, so that the posterior mode of X can be approximated
tion of ‹j is the inverse Gamma distribution IG4 C n=21 ‚j C by the value of X that minimizes the sum of squared residu-
sj =25, where sj =n is the sample variance of the jth coordinates als SSR.
of xi ’s. Because the center and direction of X can be arbitrary,
We use a random walk Metropolis–Hastings step (Hastings, we postprocess the MCMC sample of X at each iteration
1970) to generate xi and ‘ 2 in each iteration of the MCMC by using the transformation xi D Dx0 4xi ƒ xN 5, where xN is the
algorithm. Speci cally, a normal proposal density is used in sample mean of xi ’s and Dx is the matrix whose columns
the random walk Metropolis–Hastings algorithm for genera- are the eigenvectors of the sample covariance matrix Sx D
1 Pn
tion of xi . To choose the variance of the normal proposal n iD1 4xi ƒ x
N 50 4xi ƒ xN 5 of xi ’s. This transformation does not
density, we note that the full conditional posterior density solve the nonidenti ability problem, but the new xi ’s have
P „
of xi is 4xi — ¢ ¢ ¢ 5 / exp6ƒ 12 4Q1 C Q2 5 ƒ nj6D i1 jD1 log ê4 ‘ij 571 sample mean 0 and a diagonal covariance matrix to correspond
P
where Q1 D ‘12 nj6Di1 jD1 4„ij ƒ dij 52 and Q2 D x0i åƒ1 xi . Because to the prior speci cation.
4„ij ƒ dij 52 =‘ 2 is a quadratic function of xi with leading coef- Although samples of xi ’s are unstable because of lack of
cient equal to 1=‘ 2 , and Q1 has n ƒ 1 of this kind whereas identi ability, samples of „ij ’s are stable after convergence,
Q2 has only one quadratic term with coef cient å1 Q1 would and hence they can be used to measure uncertainties in quanti-
dominate the full conditional posterior density function of xi ties that are functions of „ij ’s. For example, one may be inter-
unless there is strong prior information. Thus, we may con- ested in whether object i is more closely related to object j
sider Q1 only, approximate the full conditional variance of xi than to object k, which can be answered by looking at the pos-
by ‘ 2 =4nƒ15, and choose the variance of the normal proposal terior distribution of „ij ƒ „ik , as approximated by the MCMC
density to be a constant multiple of ‘ 2 =4n ƒ 15. sample.
From a preliminary numerical study, we found that the full
conditional density function of ‘ 2 is well approximated by 4. CHOICE OF DIMENSION
the density function of I G4m=2 C a1 SSR=2 C b5. When the As described in the previous section, BMDS gives object
number of dissimilarities, m D n4n ƒ 15=2, is large, which con gurations in a given dimensional Euclidean space. In
is often the case because m is a quadratic function of n, most cases, the dimension of the objects (the number of sig-
the inverse Gamma density function is well approximated ni cant attributes) is unknown. In this section, we propose a
by a normal density. Thus, we propose a random walk simple Bayesian dimension selection criterion based on the
Metropolis–Hastings algorithm with a normal proposal den- BMDS object con gurations.
sity with variance proportional to the variance of IG4m=2 C a, Consider the dimension p as an unknown variable, and
SSR=2 C b5 distribution. assume equal prior probability for all values of p up to some
maximum value, pmax . Then the posterior is given by
3.3 Posterior Inference
Iterative generation of 8xi 91‘ 2 , and 8‹j 9 for a suf ciently 4X1‘ 2 1 å1 p—D5
long time provides a sample from the posterior distribu- / l4X1‘ 2 1 p—D5 4X—å1 p5 4‘ 2 5 4å—p5
tion of the unknown parameters, and Bayesian estimation of
1 X
the parameters can be obtained from the samples. However, D 42 5ƒm=2‘ ƒm exp ƒ SSR ƒ log ê4„ij =‘ 5
because the model assumes a Euclidean distance for the dis- 2‘ 2
i>j
information to the contrary. We can retrieve only the relative â 4a5 b 4‘ 2 5ƒ4aC15 exp6ƒb=‘ 2 7
ƒ1 a
locations of the objects from the data, not their absolute loca- p
Y
tions. Hence, the convergence of „ij rather than X needs to â 45ƒp
ƒ4C15
‚j ‹j exp6ƒ‚j =‹j 7
be checked to verify the convergence of MCMC. The near jD1
lack of identi ability in X also suggests the use of sample
D A4p5 ¢ h4‘ 2 1 X5 ¢ g4å1 X1 p51
averages as Bayes estimates to be inadvisable, because the
MCMC samples of X may be unstable even when the dis- where
tances „ij are stable. Thus, we take an approximate posterior
mode of X as a Bayes estimate of X, i.e., the BMDS solution X
n
sj D xij2 1 (6)
of the object con guration. The posterior mode provides rela-
iD1
tive positions of xi ’s corresponding to the maximum posterior p
density. A meaningful absolute position of X may be obtained Y
A4p5 D 42 5ƒ4mCnp5=2 â 4a5ƒ1 b a â 45ƒp ‚j 1 (7)
from an appropriate transformation, if desired. jD1
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1035
h4‘ 2 1X5 D 4‘ 2 5ƒ4m=2CaC15 Note that sj =n is the sample variance of the jth coordinate
. of X. However, without improvement in the t, the scale
X
exp ƒ4SSR=2 Cb5 ‘ 2 ƒ logê4„ij =‘ 5 1 (8) of X may change with the dimension p. Given the same
i>j Euclidean distances, the coordinates of X would get closer
p to the origin as p increases, unless all the extra coordinates
Y ƒ4n=2CC15
g4å1X1p5D ‹j exp ƒ4sj =2C‚j 5=‹j 0 (9) are equal to 0. For instance, the Euclidean distance between
jD1 ƒ1 and 1 in one-dimensional p pspace is equal to
p the Euclidean
p
distance between 41= 211= 25 and 4ƒ1= 21ƒ1= 25 in
Note that, because of the postprocessing described in
two-dimensional space, and hence the variance in each coor-
Section 3.3, the xi ’s have sample mean 0 and a diagonal sam-
dinate is smaller in two-dimensional space. This would give
ple covariance matrix.
a smaller sj and hence a larger 4X1p—D5 in a higher dimen-
We adopt a Bayesian approach to choosing the dimension.
sion, although there is no change in the distance and the t.
We view the overall task to be that of choosing the best con g-
To circumvent this scale dependency, a dimension selection
uration, and hence we view the choice between dimension p
criterion should compare X’s in the same dimension. For this,
and dimension p 0 as being between the estimated con guration
let X ü 4pC15 D 4X4p5 2 05 in 4p C15-dimensional space, which has
with dimension p and the estimated con guration with dimen-
the rst p coordinates equal to X4p5 and the last coordinates
sion p0 . Thus, we consider the marginal posterior, 4X1p—D5,
all equal to 0. Then Xü 4pC15 provides the same Euclidean dis-
of 4X1p5 with X equal to the BMDS solution and choose the
tances and the same t as X 4p5 and may be considered an
value of p that gives the largest value of 4X1p—D5. Choos-
implantation of X 4p5 in 4p C15-dimensional space. Ideally, if p
ing between dimension p and dimension p0 is based on the
is the correct dimension, then the optimal solution X 4pC15 in
posterior odds for the estimated con guration of dimension p
4p C15-dimensional space would be Xü 4pC15 . Thus, we com-
versus that of dimension p 0 . R 2 2 2 pare Xü 4pC15 and X4pC15 and choose p to be the dimension if
R Now, note that 4X1p—D5 D c l4X1‘ 1p—D5 4‘ 5d‘ ¢ X ü 4pC15 has a larger marginal posterior density than X4pC15 .
4X1å1p5då c ¢l4X1p—D5 4X1p5, where c is a constant
From (11), the ratio of the marginal posteriors of X ü 4pC15
independent of X and p1l4X1p—D5 is the marginal likeli-
and X 4pC15 is
hood of 4X1p5, and 4X1p5 is the marginal prior of 4X1p5.
The marginalized likelihood term increases as p increases. 4X 4pC15 1pC1—D5
However, the marginal prior term decreases as p increases, Rp ²
4X ü 4pC15 1pC1—D5
because we are using a diffuse (but proper) prior, and so this
ƒm=2C1 ƒ4n=2C5
term penalizes more complex models. The approach has the SSRpC1 Y sj =2 C‚j
pC1
D
simplicity of a maximum likelihood method as well as the SSRüpC1 ü
jD1 sj =2 C‚ j
advantage of a Bayesian method in penalizing more complex
ƒm=2C1 4pC15 ƒ4n=2C5
models. SSRpC1 p
Y sj =2 C‚j
Integrating the Rfunction g4å1X1p5 given in (9) with D 4p5
Qp SSRp jD1 sj =2C‚j
respect to å gives g4å1X1p5då D â p 4n=2 C5 jD1 4sj =2C
4pC15 ƒ4n=2C5
‚j 5ƒ4n=2C 5 . The integral of the function h4‘ 2 1X5 given in (8) spC1 =2C‚pC1
with respect to ‘ 2 is approximately equal to 1
‚pC1
Z 4p5
h4‘ 2 1X5d‘ 2 42 51=2 4m=25ƒ1=2 where sj is sj given in (6), computed from X4p5 . Clearly, the
ratio R p depends on the choice of the hyperparameters and
4SSR=m5ƒm=2C1 exp6ƒm=270 (10) ‚j of å.
When there is no strong prior information, a reasonable
This formula is justi ed in the Appendix. From these, choice for 41‚j 5 in 4p C15-dimensional space might be D 12
4pC15
Z Z and ‚j D 12 sj =n, so that the prior information roughly cor-
4X1p—D5 D c ¢A4p5¢ h4‘ 2 1X5d‘ 2 ¢ g4å1X1p5då responds to the information from one observation. This is
close to the unit information prior, which was observed by
D c ¢Aü 4p5¢4SSR=m5ƒm=2C1 Kass and Wasserman (1995) to correspond to the Bayesian
p
Y information criterion (B IC) approximation to the Bayes factor
4sj =2C‚j 5ƒ4n=2C5 1 (11)
(Schwarz 1978) and by Raftery (1995) to correspond to a sim-
jD1
ilar approximation to the integrated likelihood. Raftery (1999)
where argued that this is a reasonable proper prior for approximating
the situation where the amount of prior information is small.
Aü 4p5 D A4p5¢42 51=2 â p 4n=2C54m=25ƒ1=2 exp6ƒm=270 This yields the ratio
where rj
4pC15 4pC15
D sj
4p5
=sj . Taking minus twice the logarithm of 5.1 A Simulation
the ratio gives As an illustrative example, we generated 50 random samples
of xi from a 10-dimensional multivariate normal distribution
LRp ² ƒ2 log Rp with mean 0 and variance I, the identity matrix. We used the
D 4mƒ25log4SSRpC1 =SSRp 5 (12) Euclidean distances between pairs 4xi 1xj 5 as dissimilarities „ij .
Given these „ij ’s, we generated the observed distances dij from
4pC15
p
X rj 4n C15 a normal distribution with mean „ij and standard deviation 03,
C 4nC15 log
jD1 4nCrj
4pC15
5 truncated at 0. Thus, the data consist of a 50 50 symmetric
matrix of dissimilarities computed from Euclidean distances
C4nC15log4nC15 0 (13) with Gaussian errors.
Using the results from ALSCAL for initialization, BMDS
as described in Section 3 was applied for various values of the
Note that the term (12) in LRp is roughly the log-likelihood dimension p. With minimum STRESS and xi obtained from
ratio and would be negative because higher dimension results BMDS, we applied MDS IC described in Section 4 to select
in a smaller SSR. The term (13) plays the role of penalty the dimension of xi . The results are summarized in Table 1.
on the increase of dimension by 1 and would be positive The table presents values of STRESS from CMDS,
Qp 4pC15
if jD1 4n=rj C15 < 4nC15pC1 . When there is no signi - ALSCAL, and BMDS. It also presents the likelihood ratio
cant change in X between p- and 4p C15-dimensional spaces, term of (12), the penalty term of (13), and the MDS IC given
4pC15
then rj 1, and the penalty term is approximately 4nC in (14). It can be observed that BMDS shows signi cant
15log4n C15. improvement over CMDS, providing a smaller STRESS, and
A positive LRp would prefer the dimension p to 4p C15 and a moderate improvement over ALSCAL when the dimension
a negative value would prefer the dimension 4p C15 to p, and is low. When the dimension is close to the true dimension
hence one can select the dimension where the value of LRp 10, BMDS shows a moderate improvement over CMDS and
turns positive. Alternatively, if we de ne MDS IC as the same results as ALSCAL. It is interesting to observe
that the better performance of BMDS is more pronounced
MDS IC1 D 4mƒ25logSSR 1 1 when the dimension is low because, for visualization purposes,
pƒ1
X dimension p D 2 is often chosen.
MDS ICp D MDS IC1 C LRj 1 (14) The table shows that the log-likelihood ratios computed
jD1
from the BMDS solution for X decrease monotonically as p
increases up to 10, that there is no signi cant change after
then the optimal dimension is the one that achieves the mini-
dimension 10, and that the penalties stay about the same for
mum of MDS IC p .
various p. Moreover, the MDS IC assumes a minimum at the
correct dimension, namely, 10.
5. EXAMPLES We also applied BMDS with initial values obtained from
BMDS requires that prior parameters be speci ed. For all the CMDS results. It provided about the same results as before
the examples given in this section, we chose 5 as the degrees except that it gives slightly larger STRESS (with the difference
of freedom a for the prior of ‘ 2 , and we chose b to match the less than .01) when the dimension is larger than or equal to
prior mean of ‘ 2 with SSR=m obtained from the CMDS or 5. MDS IC with the BMDS results chose the same dimension,
ALSCAL. Note that a smaller a would not make much differ- 10, as in the previous case.
ence because m D n4n ƒ15=2 is large. For the hyperprior of ‹j ,
405 405
we chose D 1=2 and ‚j D 12 sj =n, where sj =n is the sam-
ple variance of the jth coordinate of X obtained from CMDS Table 1. Analysis of Simulation Data in Example 1, x i N10 ( 0, I)
or ALSCAL, which roughly corresponds to information from
one observation as described in Section 4. CMDS ALSCAL BMDS
dim STRESS STRESS STRESS LRT Penalty MDSIC
For the multiplicity constant of the variance of the normal
proposal density in the Metropolis–Hastings algorithms for 1 06622 05079 04813 ƒ110508 17300 10647
generating xi and ‘ 2 , we chose 20382 for both xi and ‘ 2 as 2 04943 03198 03063 ƒ82807 17003 9714
3 03720 02250 02182 ƒ69505 16905 9056
suggested by Gelman, Roberts, and Gilks (1996). We found 4 02751 01648 01642 ƒ69903 16503 8530
reasonably fast mixing in MCMC with this choice of multi- 5 02037 01234 01234 ƒ55406 17008 7996
6 01580 00984 00984 ƒ59302 17206 7612
plicity constant. 7 01092 00772 00772 ƒ41402 17709 7191
In all the examples, we ran 13,000 iterations of MCMC and 8 00809 00652 00652 ƒ26702 17900 6955
observed very quick convergence in ‘ 2 and the „ij ’s. Here we 9 00672 00584 00584 ƒ25208 18100 6867
10 00614 00527 00527 ƒ7204 18702 6795ü
are interested in an approximate posterior mode of X rather 11 00658 00511 00511 ƒ5302 18605 6910
than its full posterior distribution, so that convergence require- 12 00715 00500 00500 ƒ1404 18607 7043
ments are less stringent than if one seeks the full posterior 13 00784 00497 00497 ƒ1103 18702 7216
14 00855 00495 00495 ƒ904 18509 7391
distribution, and all the iterations may be used for the purpose
of obtaining the xi ’s that give minimum STRESS. *Minimum MDSIC.
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1037
Hartigan (1975, p. 192) provided airline distances between CMDS ALSCAL BMDS
30 principal cities of the world; these are shown in dim STRESS STRESS STRESS LRT Penalty MDSIC
Figure 1. Cities are located on the surface of the earth, a
1 .6782 .4007 .3617 ƒ704.2 95.7 5336
three-dimensional sphere, and airplanes travel on the surface 2 .4682 .1795 .1604 ƒ548.5 91.4 4727
of the earth. Thus, airline distances are not exactly Euclidean 3 .3811 .0903 .0851 4.7 108.0 4270
distances and we may expect the dimension of xi to be 4 .4006 .0902 .0856 ƒ2.4 88.9 4383
5 .4139 .0902 .0854 4.1 143.9 4469
between 2 and 3.
The BMDS was applied to the data. In this example, initial *Minimum MDSIC.
values from CMDS and ALSCAL yielded almost the same
BMDS results. The BMDS results from CMDS initial values
are shown in Table 2. Figure 2 is a plot of the observed airline distances ver-
BMDS yielded much smaller SSR than did CMDS and sus the estimated Euclidean distances. A perfect t would
moderately smaller SSR than did ALSCAL in all cases. The yield a 45-degree line, as shown in Figure 2. The estimated
estimated SSR from BMDS dropped very quickly until dimen- Euclidean distances from BMDS are represented as red dots,
sion 3 and then increased slightly at dimension 4. MDSIC those from CMDS as blue dots, and those from ALSCAL as
selected dimension 3. We observed that the last coordinates green dots. One can see that BMDS provided points very close
of xi in dimension 4 are almost equal to 0, indicating strong to the 45-degree line, except for some points corresponding to
evidence for dimension 3. large distances. The t gets worse as the distance gets larger,
Figure 1. Airline Distances Between Cities (100 miles, .62 mi D1 km) (Hartigan 1975).
1038 Journal of the American Statistical Association, September 2001
Figure 2. Observed and Estimated Distances for the Airline Distance Data (in Units of 100 mi). Red dots represent the estimated distances
from BMDS, blue dots in (a) from CMDS, and blue dots in (b) from ALSCAL.
because when cities are farther apart, there is a greater discrep- and Hrycak (1990) proposed measuring the dissimilarity
ancy between airline distance and three-dimensional Euclidean between the careers of two individuals by counting the min-
distance. imum number of insertions, deletions, and replacements that
Figure 3 shows plots for the locations of cities, obtained would be necessary to transform one career into another. Costs
from BMDS with dimension p D 3 and rotated manually to t are associated with each kind of change, and the dissimilarity
the true locations of the cities approximately. One can observe between the two careers is then measured as the total cost of
that the cities are located approximately on the surface of a transforming one career into another. This approach, known
sphere with the radius of the earth and that the locations of as optimal alignment, is borrowed from molecular microbiol-
the cities are well recovered. ogy, where it is applied to the comparison of DNA and protein
sequences (Sankoff and Kruskal 1983; Boguski 1992).
5.3 Careers of Lloyds Bank Employees, 1905–1950 Here we reanalyze some data considered by Stovel, Savage,
Sociologists are interested in characterizing and describing and Bearman (1996), consisting of the careers of 80 randomly
careers, to answer questions such as, What are the typical selected employees of Lloyds Bank in England, whose careers
career patterns in a given period in a particular society? Have started between 1905 and 1909. This is part of a much larger
they been changing over time? Have people become more study aimed at discovering how career patterns in large orga-
mobile occupationally? nizations evolved over the course of the twentieth century.
One approach to doing this views a career as a sequence The more immediate goal here is to discover what the typi-
of occupations held, for example, in successive years and cal career sequences are, for data reduction and exploratory
then seeks to measure the similarity or dissimilarity between purposes, and also as a basis for further analysis. For each
careers and, hence, to nd groups of similar careers. Abbott employee, information about his work position is available for
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1039
Figure 3. Estimated Locations of the Cities From BMDS. (a) View from the North Pole. (b) Hemisphere I. (c) Hemisphere II.
each year he was at Lloyds. The information consists of the CMDS or ALSCAL to variations in the alleged dimension and
nature of the position (four categories, from clerk to senior to violations of the Euclidean model assumption.
manager) and the kind of place they were in (six categories, Dimension 8 is chosen as optimal because MDSIC attains
from small rural place to large city). its minimum at 8. Thus, the estimated con guration X when
From the sequence data, an 80 80 matrix of dissimilarity p D 8 can be used as a nal estimate of X. Figure 4 shows
measures was obtained, by using the method of Abbott and the tted and observed dissimilarities for the three MDS
techniques. The BMDS tted dissimilarities t the observed
Hrycak (1990); for details, see Stovel et al. Clearly, the dissim-
ones very well, considerably better than the CMDS ones and
ilarities are not Euclidean distances, and they may not satisfy
certain geometric properties that hold for Euclidean distances,
such as the triangle inequality. Our approach is to model the Table 3. Analysis of the Lloyd Bank Data
dissimilarities as before, with the idea that the non-Euclidean
nature of the dissimilarities can be modeled at least approxi- CMDS ALSCAL BMDS
dim STRESS STRESS STRESS LRT Penalty MDSIC
mately as part of the error. As we show, this supposition turns
out to be reasonable in practice. 1 .5357 .4648 .3545 ƒ4228.1 325.8 26924
2 .3390 .2228 .1815 ƒ3380.3 315.7 23022
We applied BMDS to the dissimilarity data with initial val- 3 .2190 .1297 .1063 ƒ2924.9 310.9 19957
ues from CMDS. (Initial values from ALSCAL gave almost 4 .1280 .0853 .0669 ƒ1540.1 317.5 17343
5 .0891 .0623 .0524 ƒ941.2 330.5 16120
the same results.) Table 3 presents the results of the analy- 6 .0725 .0530 .0452 ƒ600.9 326.7 15510
sis along with STRESS from CMDS and ALSCAL. Again, 7 .0619 .0472 .0411 ƒ392.7 330.3 15236
BMDS performed much better than did CMDS and moder- 8 .0558 .0414 .0386 ƒ221.8 330.5 15173
9 .0547 .0384 .0372 27.8 367.6 15282
ately better than did ALSCAL, especially when the dimension 10 .0556 .0374 .0374 7.7 396.1 15677
was too small. The improvement in performance of BMDS 11 .0600 .0370 .0375 ƒ17.3 310.3 16081
12 .0637 .0363 .0374 ƒ21.2 444.5 16374
is more pronounce d in this example than in the two previ-
ous examples. This suggests that BMDS is more robust than *Minimum MDSIC.
1040 Journal of the American Statistical Association, September 2001
Figure 4. Fitted and Observed Dissimilarities for the Lloyd Bank Data. Red dots represent the estimated distances from BMDS, blue dots in
(a) from CMDS, and blue dots in (b) from ALSCAL.
moderately better than the ALSCAL ones; the sum of squared Model-based clustering clearly identi ed three groups.
residuals for BMDS is 48% of that for CMDS and 87% of These are shown in Figure 6, which displays the rst two
that for ALSCAL. components of the BMDS solution. The three groups selected
Figure 5 gives pairwise scatterplots of the rst four dimen- make clear substantive sense: Group 1 consists of 16 employ-
sions of the BMDS estimates of X. The fourth dimension ees who had shorter careers (22 years or fewer) and spent all
clearly separates two outliers. On closer inspection of the data, or almost all of their career at the lowest clerk rank. Group
it turned out that these were individuals who had very short 2 consists of 30 employees with long careers (40 years or
careers at Lloyds. They spent only a few years there, whereas more), almost all of whom ended their careers at the lowest
clerk level. Group 3 consists of 32 employees, most of whom
all the other employees were at Lloyds for at least 10 years.
were promoted and ended their careers as managers.
The sociologists’ interest in these data is primarily to char-
Another interest in these data would be in whether the
acterize the typical career patterns at Lloyds in this period. To
career pattern of individual i is more similar to that of individ-
try to answer this question, we applied model-based clustering ual j than to that of individual k, or whether the difference in
(Ban eld and Raftery 1993; Fraley and Raftery 1998) to the the career patterns between individuals i and j is signi cantly
BMDS estimate of X, after removing the two clear outliers. different from that between individuals i and k. This can be
This models the data as a mixture of multivariate normals, answered by considering the posterior distribution of „ij ƒ„ik .
allowing for possible geometrically motivated constraints on Because BMDS generates posterior samples of „’s after con-
the covariance matrices of the different groups. The number of vergence, one can easily perform the test by using the MCMC
groups and the clustering model are chosen by using approx- samples of „’s. To illustrate this, we picked some individuals
imate Bayes factors, approximated via BIC. i1j1k and compared „ij with „ik . Empirical 2.5%, 5%, 50%,
Oh and Raftery: Bayesian Multidimensional Scaling and Choice of Dimension 1041
Figure 5. Pairwise Scatterplots of the Estimated Object Con guration From BMDS for the Lloyd Bank Data.
95%, and 97.5% posterior percentiles of „ij ƒ„ik and its stan- 6. DISCUSSION
dard error, obtained from the MCMC samples, are presented
In this article, we proposed a Bayesian approach to
in Table 4 for some selected values of 4i1j1k5. One can see object con guration in multidimensional scaling and a simple
that the career pattern of individual 21 (72) is signi cantly Bayesian dimension choice criterion, MDSIC, based on the
closer to that of individual 38 (79) than to that of individual results from BMDS. Key advantages of the proposed Bayesian
75 (24). On the other hand, individual 29 is not signi cantly multidimensional scaling are as follows. First, it provided a
closer to either 79 or 80 in terms of career patterns. Note that signi cantly better t than classical MDS and a moderately
individuals 21, 38, and 75 are in the rst group, individuals better t than ALSCAL overall. This implies that BMDS
72 and 79 in the second group, and individuals 24, 29, and 80 explores the posterior distribution quite well compared to the
in the third group. other MDS methods. The improvement in performance of
1042 Journal of the American Statistical Association, September 2001
1 3
3 ties more than smaller ones (Borg and Groenen 1997).
32 A common reason for doing MDS is to cluster the objects.
33 In our third example, we showed how model-based clustering
22 can be used to do this, providing a formal basis for choosing
0
3
111111 1 3 the number of groups. The results were substantively reason-
1 11 1 2
1 1 1 3 222 able and useful. Combining BMDS and model-based cluster-
3 2 2
1 3 222222 ing thus provides a fully model-based approach to clustering
3 3 22 222
22
3 2 22
-1
Takane, Y., and Carroll, J. D. (1981), “Nonmetric Maximum Likelihood Torgerson, W. S. (1952), “Multidimensional Scaling: I. Theory and Method,”
Multidimensional Scaling from Directional Rankings of Similarities,” Psychometrika, 17, 401–419.
Psychometrika, 46, 389–405. Torgerson, W. S. (1958), Theory and Methods of Scaling, New York:
(1982), “The Method of Triadic Combinations: A New Treatment and
Wiley.
Its Applications,” Behaviormetrika, 11, 37–48.
Tibshirani, R., Lazzeroni, L., Hastie, T., Olshen, A., and Cox, D. (1999), Young, F. W. (1987), Multidimensional Scaling—History, Theory, and Appli-
“The Global Pairwise Approach to Radiation Hybrid Mapping,” Technical cations, Hauer, R. M. (editor), Hillsdale Erlbaum Association, Hillsdale,
Report, Department of Statistics, Stanford University. NJ: Lawrence Erlbaum.