This action might not be possible to undo. Are you sure you want to continue?
MixedData Using Information Complexity
HALIMA BENSMAIL

Department of Statistics, University of Tennessee, 334 Stokely
management building
Knoxville, TN 379960532 USA.
bensmail@utk.edu
Hamparsum Bozdogan
Department of Statistics, University of Tennessee, 336 Stokely management building
Knoxville, TN 379960532 USA.
bozdogan@utk.edu
SUMMARY
A nonparametric approach is developed herein to perform a cluster analysis for mixed data. For
robustness, a variety of covariance matrix estimators are proposed to overcome the singularity
and limitation of the maximum likelihood estimator. While, traditional criterias such BIC and AIC
overfit or underfit the number of clusters selection, we propose a complexitybased criteria
ICOMP from the Information theory as an alternative criteria to calibrate the penalty term.
Applications to simulated and real data are well investigated and resultat are discussed.
Key words: Gaussian Kernel distribution; Fisher information; cluster analysis; mixture models.
———————————————
*Department of Statistics, University of Tennessee, 334 Stokely Management Center,
Knoxville, TN 379960532, USA (email: bensmail@utk.edu). Supported by SRGP Award
(CBA) 2001/2002 to the both authors.
1
2
1. Introduction
Cluster analysis has been developed mainly through the invention of empirical, and
lately Bayesian study of ad hoc methods, in isolation from more formal statistical
procedures. In the last 25 years it has been found that basing cluster analysis on a
probability model can be useful both for understanding when existing methods are
likely to be successful and for suggesting new methods. One such probability model is
that the population of interest consists of K different subpopulations G
1
, ., G
K
and that
the density of a pdimensional observation x from the kth subpopulation is f
k
(x,
k
) for
some unknown vector of parameters
k
(k = 1, ., K). Given observations
x = (x
1
, ., x
n
), we let = (
1
, .,
n
)
t
denote the unknown identifying labels, where
i
= k if x
i
comes from the kth subpopulation. In the socalled classification maximum
likelihood procedure, = (
1
, .,
K
) and = (
1
, .,
n
)
t
are chosen to maximize the
classification likelihood
pr(
1
, .,
K
;
1
, .,
n
x) =
]
i=1
n
f
i
(x
i

i
). (1)
The problem of analyzing data from a mixture distribution has been investigated by
many authors in the past. Good sources of references are the paper by Hosmer
(1973), Hosmer (1976), Kazakos (1977) + recent publications. In most cases, the data
to be classified are viewed as coming from a mixture of probability distributions, each
representing a different cluster, so the likelihood is expressed as
pr(
1
, .,
K
;
1
, .,
K
x) =
]
i=1
n
_
k=1
K
k
f
k
(x
i

k
) (2)
where
k
is the probability that an observation belongs to the kth component
(
k
> 0; _
k=1
K
k
= 1).
In this paper, we consider a different method of clustering mixeddata, which is a
distributionfree approach. To do this, we propose first the technique of transforming
the mixed data from a noneuclidean space to an euclidean space where the usual
metric will be used. We perform then a clustering analysis on the transformed data.
Section 2 introduces the tranformation technique. Section 3 discuss the clustering
methodology based on a mixture of kernel distributions, and proposes different
approaches to estimate the covariance matrix involved in the calculation. Section 4
considers computational and statistical aspects of fitting the mixture of kernel
distributions. Section 5 proposes a model selection criteria for choosing the number of
components of the mixture and the number of dimensions of the transformed data. The
results of simulated data and real data examples are presented in Section 6. The
paper concludes with a discussion.
2. Mapping the mixed data
We assume that the reader is familiar with homogeneity analysis, also known as
multiple correspondence analysis or dual scaling (Nishisato (1980), Meulman (1986),
van der Burg et al. (1988), Gifi (1990) and Heiser and Meulman (1995). ). Let
(k
1
, ., k
j
, ., k
m
) be the mvector containing the number of categories of each variable,
and let p denote the dimensionality of the analysis that is chosen. Let each variable
3
v
j
(j = 1, ., m) be coded into an (n × k
j
) indicator matrix G
j
, n is the number of
observations. An indicator matrix indicates which categories are scored by which
objects. Rows of an indicator matrix usually refer to objects and columns to categories.
Its elements consist of zeros (not scored) and ones (scored).
Homogeneity analysis determines quantifications of the categories of each of the
variables such that homogeneity is maximized. If y
j
, a k
j
vector, is the quantification of
the categories of variable v
j
, then G
j
y
j
represents a single quantification or
transformation of the n objects for variable v
j
. Without additional conditions on the y
j
,
objects in the same categories get the same quantification. In homogeneity analysis,
simultaneous quantifications for each variable are collected in the k
j
× p matrices Y
j
,
called multiple nominal quantifications (quantification of the categorical data except
ordinal). Thus matrices G
j
Y
j
induce p multiple quantifications of the objects for variable
j. For example, a variable v
j
which is expressed using three categories (a, b, c) is
transformed as the following:
if v
j
=
a
b
a
c
c
a
then q(v
j
)=
y
1
y
2
y
1
y
3
y
3
y
1
=
1 0 0
0 1 0
1 0 0
0 0 1
0 0 1
1 0 0

y
j1
yj
2
y
j3
=G
j
y
j
q(v
j
) = G
j
y
j
represents a single transformation of the n objects induced by variable j.
Perfect homogeneity is defined if all multiple quantifications G
j
Y
j
of the objects are the
same for all variables which means that X = G
1
Y
1
=.= G
m
Y
m
. Homogeneity analysis
thus amounts to minimizing
(X; Y
1
, ., Y
m
) =
1
m
_
j=1
m
tr(X ÷ G
j
Y
j
)
t
(X ÷ G
j
Y
j
) (3)
over object scores X and multiple nominal quantification Y
j
under appropriate
normalization conditions. It should be emphasized that the choice of normalization of X
is crucial. In the distance approach X is orthogonal and not orthonormal (Meulman
1986); the shape of the configuration, the different amount of scatter in differents
directions, is determined by the eigenvalues (X
t
X/n =
2
), where is the diagonal
matrix of eigenvalues of J
0
PJ
t
, J = (I ÷ n
÷1
11
t
) a centering operator, and
P
0
=
1
m
_
j=1
m
G
j
(G
j
t
G
j
)
÷1
G
j
t
is the average of all the orthogonal projectors on the
subspace spanned by the columns of the indicator matrices G
j
. When category
quantifications are required to be points on a line (ordinal data for example), loss of
homogeneity still is the average sum of squares across variables, but with
variablewise components defined as
(X; Y
1
, ., Y
m
; a
1
, ., a
m
) =
1
m
_
j=1
m
tr(X ÷ G
j
y
j
a
j
t
)
t
(X ÷ G
j
y
j
a
j
t
) (4)
which is minimized over object scores X and multiple nominal quantifications
y
j
, (j = 1, ., m). The pvector a
j
is the vector of weights and y
j
gives the category
quantifications (Gifi 1990, Chap. 4).
4
Dimensionality
The dimensionality p is important here, because choosing a different dimensionality
will lead to different transformations as we can no longer assume the solutions to be
nested when we require category points to be on a line (ordinal data for example).
Multiple nominal variables have different quantifications in p dimensions; the
categories of nonmultiplenominal variables fit on a straight line. If there are no
dependencies on the data, the maximum number of dimensions, when the first
variables are multiple nominal (categorical and not ordinal), is
_
j=1
(k
j
÷ 1) + (m ÷ ). (5)
When transforming the mixeddata into an euclidean space, we expect the transformed
data to contain the same information and structure as the the original one. The
transformed data is homogeneous, continuous and freedistributed. Any probabilistic
algorithm (analysis), requires mostly the normality assumption which is not guaranted
here. In the next chapter, we will try to build a probabilistic clustering algorithm for the
transformed data using Kernel distribution.
3. Multivariate Kernel MixtureModel
In cluster analysis, we consider the problem of determining the structure of the data
with respect to clusters when no information other than the observed values is
available; from the extensive literature, we mention Hartigan (1975), Gordon (1999),
and Kaufman and Rousseeuw (1990). Here, the underlying data are clearly coming
from a nonparameteric distribution g. With this in mind, we consider the possibility of
distributionfree density fitting, in particular using multivariate kernel functions f. Kernel
methods for categorical data are discussed by Aitchison and Aitken (1976), with brief
mention of continuous and mixed (i.e. categorical and continuous) data and the
technique has been popular for some time in the literature, particularly for continuous
data, see Fryer (1977).
We assume that the data are generated by a mixture of underlying probability
distributions; each component of the mixture represents a different cluster so that the
observations x
i
(i = 1, ., n) to be classified arise from a random vector X with
likelihood density
pr(x
i
 ) = _
k=1
K
k
f
k
(x
i

k
) (6)
where = (
1
, .,
K
) is the mixing proportion (
k
> 0, _
k=1
K
k
= 1), f
k
is the
multivariate kernel density function defined as
f
k
(x) =
1
n
k
h
k
_
i=1
n
k
Ker(
x ÷ x
i
h
k
)
This is the well known ParzenRosenblatt kernel density estimator of g(x) where h
k
is a
sequence of real positive numbers tending to zero as n tends to infinity and Ker is a
kernel function in R. to get a smoother estimation, one can use a kernel Ker which is
bounded, symmetric and positive and satisfying x Ker(x) ÷ 0 as x ÷ · and  x
2
K(x)
dx < ·. Some special kernel functions are given by:
5
Kernel Ker(x)
Uniform
1
2
1
¦x<1}
Triangle (1 ÷ x)1
¦x<1}
Epanechnikov
3
4
(1 ÷ x
2
)1
¦x<1}
Quartic
15
16
(1 ÷ x
2
)
2
1
¦x<1}
Triweight
35
36
(1 ÷ x
2
)
3
1
¦x<1}
Gaussian
1
2
exp ÷
x
2
2
Cosines
4
cos(
4
x)1
¦x<1}
Table 1: Some special kernel functions
Well known theoretical results show that the choice of of reasonable Ker does not
seriously affect the quality of the estimator f
k
, on the contrary, the choice of h
k
turns to
be crucial for the accuracy of the estimator. Some indications about this choice are
given in Bosq and Lecoutre (1987). We will use in practice the multivariate Gaussian
kernel
f(x
i
h
k
,
k
) =
1
n
k
_
i=1
n
k
1
(2)
(p/2)
h
k
p

k 
1/2
exp ÷
1
2h
k
2
(x ÷ x
i
)
t
k
÷1
(x ÷ x
i
)
where
k
is the covariance matrix and h
k
is the window width for the k
th
cluster.
3.1 Window width estimation
A suitable value must be chosen for the window width h. Wegman (1972) suggets
formulae dependent on n. Method of leastsquare crossvalidation is popular but is not
useful (Habbema et al. (1974), Wahba and Wold (1975)). Using a full crossvalidation
method is not worthwhile because of the heavy computation involved. Fryer (1976)
uses meansquared error criteria with the assumption that the data come from a
normal density, or a mixture of normals and Deheuvels proposed h

= ¨ n
÷
1
5
, where ¨
denotes the empirical standard deviation. Silverman (1986) used h

= a. n
÷
1
p+4
where
a = 4/(2p + 1)]
1
(p+4)
. Recently, Specht (1990) proposed a probabilistic neural network
(PNN) fixed width. Later on, Bozdogan (2001) suggested formulae (in the univariate
case) involving the sample standard deviation and the interquantile range of the data.
Here we propose a generalization of Silverman approach, introducing the Information
Complexity of the data (Bozdogan 1994), as the following:
h
k

= a. s
k
. n
k
÷
1
(p+4)
and C
1
(L
k
) =
r
2
logtr(
k
)/r] ÷
1
2
logdet(
k
)]
r is the rank of
k
and a = 
4
2p+1
]
1
p+4
. If
k
is diagonal, then C
1
(L
k
) =
1
p
tr
k
t
k
] ÷
1
p
2
tr(
k
)
2
, where
k
= diag(
k
). Here
k
is the covariance matrix estimator of
k
.
Property 1. The sequence of real positive value numbers
h
k

= 
4
2p + 1
]
1
p+4
×
r
2
log
1
r
tr(
k
)] ÷
1
2
logdet(
k
)] × n
k
÷
1
p+4
6
tend to zero as n
k
tends to infinity.
Proof:
From property (3.3) of Bozdogan (19??) [from the book, we have C
1
(L
k
) ÷ 0 as
L
k
÷
2
I ( as scalar multiple of the identity matrix) and that C
1
(L
k
) > 0, necessarily.
Using this, and since n
k
÷
1
p+4
÷ 0 as n
k
÷ +·, then
n
k
÷+·
lim h
k

= 0
Sometimes, covariance matrix estimator magnifests singularity when the size of
the data is small by comparison to the number of variables/dimensions. In the following
paragraph, we describe some robust estimators of
k
which could improve the
calculation of s
k
and then h
k

.
3.2 Generalized Smooth Covariance Estimator
Different robust methods for estimating the covariance matrix
k
are presented
here and are described in Table 2. Those are the maximum entropy covariance matrix
estimator
ME
, maximum likelihood/empirical Bayes estimator
MLE/EB
, Maximum
entropy /empirical Bayes estimator
ME/EB
, Stipulated ridge covariance estimator
SRE
,
Stipulated diagonal covariance estimator
SDE
and convex sum covariance estimator
CSE
.
Model Expression Characteristics
Model 1 L
ME
= C + D C: covariance secondary midpoint
( Maximum entropy covariance estimator) D: a diagonal matrix, D = (d
ij
) > 0
Model 2 L
MLE/EB
= L
MLE
+
p÷1
n ×trL
MLE
]
I
p
I
p:
(pxp) Identity matrix
(Maximum likelihood/empirical bayes estimator) p : number of variables
Model 3
ME/EB
= L
ME
+
p÷1
n×tr
L
ME
]
I
p
I
p:
(pxp) Identity matrix
(Maximum entropy/empirical bayes estimator) p : number of variables
Model 4 L
SRE
= L
MLE
+p(p ÷ 1) 2n × tr L
MLE
]
÷1
×I
p
I
p:
(pxp) Identity matrix
(Stipulated ridge covariance estimator) p : number of variables
Model 5 L
SDE
= (1 ÷ t) ×L
MLE
+ t × Diag(L
MLE
) = p(p ÷ 1)(2n(trR
÷1
] ÷ p))
÷1
(Stipulated diagonal covariance estimator) R =Diag
(
L
MLE
)
÷1/2
L
MLE
Diag
(
L
MLE
)
÷1/2
Model 6 L
CSE
=
n
n+m
×L
MLE
+(1 ÷
n
n+m
) ×D
MLE
D
MLE
=
1
p
trL
MLE
] × I
p
, p > 2
(Convex covariance estimator) 0 < m <
2p(1+)÷2]
p÷
, where =
tr
L
MLE
]
2
tr
L
MLE
2
]
Table 2: Different robust covariance matrix estimator
7
4. Model and Estimation
4.1 Maximum likelihood estimation of the models using EM algorithm
We assume that the data are generated by a mixture of underlying probability
distributions; each component of the mixture represents a different cluster so that the
observations x
i
(i = 1, ., n) to be classified arise from a random vector X with
likelihood density
pr(x
i
 ) = _
k=1
K
p
k
f
k
(x
i

k
)
where f
k
(. 
k
= (h
k
, L
k
)) is the multivariate kernel density function, h
k
is the window
width and L
k
is the covariance matrix for the k
th
group:
f(x
i
h
k
, L
k
) =
1
n
k
_
i=1
n
k
1
(2)
(p/2)
h
k
p
L
k 
1/2
exp ÷
1
2h
k
2
(x ÷ x
i
)
t
L
k
÷1
(x ÷ x
i
)
= (
1
, .,
K
) is the mixing proportion (
k
> 0, _
k=1
K
k
= 1). The mean and
variance for each component are defined as:
E(x) = x
k
and var(x) = W
k
+ h
k
I
An explicit solution of the likelihood estimators is not possible and is numerically
tractable using the EM algorithm (Dempster et al. 1977). Here recursions are used
which exemplify the EM pattern of iterations of the following type:
1. Initializiation of the parameters p
k
(0)
, h
k
(0)
, and
k
(0)
. Here we use the Kmeans
clustering algorithm. Kmeans proceeds by repeated application of a twostep
process where: the mean vector for all observations in each cluster is computed;
and observations are reassigned to the cluster whose center is closest to the
observation.
2. Estep: calculation of the current conditional expectation of
k
given the data, such
that
pr(
i
= k) =
p
k
f
k
(x
i
)
_
h=1
K
p
h
f
h
(x
i
)
3. MStep: Evaluate new estimates of the parameters p
k
,
k
by standard maximum
likelihood and by estimating h by
h
k

= 
4
2p + 1
]
1
p+4
.
r
2
logtr(L
k
)/r] ÷
1
2
logdet(L
k
)] . n
k
÷1
(p+4)
where r is the rank of L
k
. If L
k
is diagonal, we can use, s
k
=
1
p
tr
k
t
k
] ÷
1
p
2
tr(O
k
)
2
where O
k
= diag(L
k
).
4. Iterate 2 and 3 until convergence is acheived.
5. Model Selection
Here we are using the model selection criteria proposed by Bozdogan (1994) ICOMP
which overcome the limitations of the traditional model selection criteria (AIC, SC and
others). ICOMP shares the lack of fit part with the traditional criteria but extends the
8
penalty part from a constant depending on mostly on the size of the data and the
number of parameters involved in the model to a more data adaptive score based on
the Fisher information theory which takes in consideration the covariance matrix of the
data involved to simplify the idea. Below we explain how the penalty term is derived
and what ICOMP looks like after preliminary calculation.
ICOMP
ifim
= ÷2log
L() + 2C
1
F
÷1
]
where
C
1
(F
÷1
) =
d
2
log
tr(F
÷1
)
d
] ÷
1
2
logF
÷1

F
÷1
= Cov(
) = ÷¦E
c
2
logL()
(c)(c
t
)
]} and d = rank(F
÷1
) = dim(F
÷1
). For K Clusters,
F
÷1
= Cov(
) =
F
÷1
(p) 0 0 0 0 0
0 F
1
÷1
0
0 F
2
÷1
0
0 . 0
0 0 .
0 0 F
K
÷1
Example:
For K = 2 mixtures for examples, we have:
F
÷1
(p) =
1
p¨
1
0
0
1
p¨
2
, F
1
÷1
=
L
1
0
0 (
2
n1
)D
p
+
(L
1
? L
1
)D
p
+t
, F
2
÷1
=
L
2
0
0 (
2
n2
)D
p
+
(L
2
? L
2
)D
p
+t
After calculation, ICOMP becomes:
ICOMP = ÷2logL(
) + kp + kp
p + 1
2
+ log
_
k=1
K tr(L
¨
k
)
p¨
k
+
1
2n
tr(L
k
2
) +
1
2n
tr(L
k
)
2
+
1
n
_
j
¨
k,jj
2
kp + kp(p + 1)/2
÷ (p + 2) _
k=1
K
logL
k
 ÷ p _
k=1
K
log(p
k
n)
6. Examples
9
6.1 Example 1: Power exponential
Here we simulated 300 bivariate nonnormal data. We simulated data from a well
known nonnormal distribution, the power exponential distribution which has a density
function defined by
f(x; , , ) =
pI(
p
2
)
p/2
I(1 +
p
2
)2
1+
p
2

÷1/2
exp ÷
1
2
(x ÷ )
'
÷1
(x ÷ )
,
where p is the number of variables, and relates to the kurtosis. We propose two
covariance matrices, each define the dstribution defining a cluster. Figure 14 are
densities display of a power exponential density function with different .
0.002
0.004
0.006
4
2
0
2
4
4
2
0
2
4
Figure 1: beta=0.5
0
0.002
0.004
0.006
4
2
2
4
4
2
2
4
Figure 2: beta=0.75
0
0.02
0.04
0.06
0.08
0.1
5
5
4
2
2
4
Figure 5: beta=1
0
0.01
0.02
0.03
4
2
2
4
4
2
2
4
Figure 4: beta=2
Here we have used
1
= (0, 0),
2
= (4, 2) and
3
= (2, 2) and L
1
=
2 ÷0. 65
÷0. 65 1
,
10
L
2
=
2 0. 84
0. 84 1
and L
3
=
1 ÷0. 5
÷0. 5 2
.
The following is the density function where the data is supposed to be drawn from, it is
a mixture of three components where the the mixing proportion is the same for three
component as Figure 5 shows:
Figure 5: Mixture of three power exponential
The density is summarized as the following:
p(x, ) = 0. 33 ×
2I(1)
×I(1+
1
0.75
)2
1+
1
0.75
1
(2×1÷(÷0.45)
2
×1. 4142×1)
× exp ÷
1
2(1÷(÷0.45)
2
)
(x1÷4)
2
2
÷ 2 × (÷0. 45) ×
0. 33 ×
2I(1)
×I(1+
1
1.25
)2
1+
1
1.25
1
(2×1÷(0.6)
2
×1. 4142×1)
× exp ÷
1
2(1÷(0.6)
2
)
(x1÷4)
2
2
÷ 2 × 0. 6 × (
x1÷4
1. 4142
)(
x2+2
1
) +
0. 33 ×
2I(1)
×I(1+
1
2
)2
1+
1
2
1
(1×1÷(÷0.5)
2
×1×1)
× exp ÷
1
2(1÷(÷0.5)
2
)
(x1÷2)
2
1
÷ 2 × (÷0. 5) × (
x1÷2
1
)(
x2+2
1
) +
(x2+2)
2
1
When plotting the data in a two dimensional space, the groups can be detected as
Figure 6 shows but in general when plotting the multivariate data derived from a
mixture of K component having a power exponential distribution, it will be challenging
to classify them using traditional methods especially methods which do not offer a
number of component choice criteria. Table 3 shows the performance of ICOMP with
different robust covariance matrices estimator. The number of cluster chosen is 3
using the convex covariance estimator model L
¨
CSE
. The estimation of the window with
for each component is ĥ = (0. 17208, 0. 00953, 0. 90023) and the optimal classification
resulted in only 5% of the points being misclassified, as indicated in Table 4.
11
Figure 6: Two dimensional Projection of the simulated data
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 642. 48 693. 62 682. 63 693. 63
K = 2 583. 62 582. 75 582. 72 591. 22
K = 3 382. 16 396. 32 392. 62 373. 12
K = 4 489. 42 481. 51 332. 22

482. 72
K = 5 526. 21 528. 12 592. 31 528. 11
K = 6 694. 87 606. 65 607. 95 696. 85
K = 7 643. 87 693. 28 693. 83 692. 08
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 683. 47 623. 44 682. 24 629. 99
K = 2 626. 21 582. 57 589. 27 602. 24
K = 3 387. 26 382. 37 382. 62 379. 25

K = 4 423. 62 482. 37 482. 31 402. 98
K = 5 672. 61 692. 13 682. 71 590. 95
K = 6 592. 73 628. 36 519. 71 501. 93
K = 7 682. 61 692. 71 692. 72 699. 10
Table 3: AIC values with different Robust covariance matrices Table 4: SC values with
12
different Robust covariance matrices
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 624. 42 682. 62 627. 06 632. 11
K = 2 599. 00 552. 82 527. 73 533. 29
K = 3 243. 94 289. 82 238. 37 235. 32
K = 4 542. 10 452. 18 473. 62 433. 32
K = 5 599. 03 529. 25 478. 22 423. 73
K = 6 616. 92 611. 22 634. 28 650. 73
K = 7 629. 02 613. 21 639. 38 652. 83
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 1 1 1 0
K = 2 5 3 4 2
K = 3 85 90 92 97
K = 4 7 5 3 1
K = 5 2 0 0 0
K = 6 0 1 0 0
K = 7 0 0 0 0
Table 5: ICOMP values with different Robust covariance matrices Table 6: Percentage
scoring of ICOMP
13
Predicted
Group1 Group2 Group3 Total
G
0.75
95 2 3 100
Simulated G
1.25
7 93 0 100
G
1.5
1 1 98 100
% correct 0.95 0.93 0.98 0. 95
Table 7: Crossclassification table giving the clustering results for the simulated
data.
6.2 Example 2 (categorical data): Cetacea Data
Here we analyze Slijper’s data concerning n = 36 differents types of cetacea (Slijper
(1973)). The whales, porpoises and dolphins have been described by 15
characteristics related to their morphology, osteology or behavior (A: Morphological
variables: Neck, form of the head, size of the head, beak, dorsal fin, flippers, set of
teeth, longitudinal furrows on throat, blow hole, color. B: Osteological variables:
Cervical vertebrae, lachrymal and jugal bones, head bones. C: Behavioral variables:
Habitat and Feeding). The characteristics are considered as discriminating variables
for allocating the Cetacea into K = 9 groups. The following are the characteristics of
the 15 variables:
Variable neck form of the head sizehead beak dorsalfin flippers set of teeth throat
# categories 2 6 2 4 4 4 5 3
variable blowhole color cervicalvertebrae lachrymal&jugal bones headbones habitat feedings
# categories 4 5 2 3 5 6 4
14
Figure 7: Classification of whales, porpoises and dolphins according to slijper.
Using the nonlinear transformation of this data, the new object score is of dimension
_
j=1
(k
j
÷ 1) + (m ÷ ) = _
j=1
(k
j
÷ 1) where is the number of categoical variables
(nominal not ordinal as the case here), the dimension of the observed data is p = 16.
First, we reduced the dimensionality of the transformed data. To achieve this we
carried out an all possible subset selection among the 2
16
= 65536 to determine the
optimal dimension of the cetacea data using ICOMP. ICOMP chooses three
dimension. Using our algorithm, we calculated ICOMP criteria for different robust
estimator and different number of component (K = 1, . . . , 12)(see Table 5). The main
funding is that the approach provide us with an answer which is in agreement with
Slijper (1973) and Van Der burg (1985) (see Figure 7). Nine groups was chosen by
ICOMP under the models ( L
¨
MLE/EB
, L
¨
SRE
, L
¨
SDE
) while the model L
¨
CSE
chooses 10 groups.
The data belonging to the 10
th
group are: number 10 (pilot whales), 11 (Risso’s whale),
24 (killer whale), 23 (Irwady dolphin), this suggest that perhaps those 4 observations
are influential or do not fit into the description/behavioral of the other observations. The
window width with L
¨
SDE
is given by
ĥ = (0. 04373, 0. 06234, 0. 07421, 0. 03219, 0. 12331, 0. 43212, 0. 23510, 0. 16241, 0. 24321)
15
Comp/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 18767. 12 18194. 01 17428. 28 17438. 09
K = 2 19883. 87 17594. 54 18462. 87 16485. 21
K = 3 17729. 56 15484. 45 15420. 11 15494. 01
K = 4 16463. 28 16438. 94 19583. 32 15483. 38
K = 5 16429. 96 18829. 20 15928. 11 17976. 86
K = 6 16827. 39 19999. 05 19043. 29 18088. 56
K = 7 16428. 98 16829. 27 18638. 86 17698. 87
K = 8 14761. 00 14465. 27 14643. 87 14754. 98
K = 9 14673. 42 14087. 12 13578. 76 14765. 65
K = 10 15876. 32 18691. 56 15765. 87 13984. 03
K = 11 17986. 24 16859. 45 18765. 72 18654. 98
K = 12 19463. 02 19654. 08 18875. 62 18653. 98
Table 8: ICOMP with different Robust covariance matrices
16
Predicted
G
1
G
2
G
3
G
4
G
5
G
6
G
7
G
8
G
9
Tot
G
1
2 0 0 0 0 1 0 0 0 3
G
2
0 2 0 0 0 0 0 0 0 2
G
3
1 0 2 0 0 0 0 0 0 3
G
4
0 0 0 2 0 0 0 0 0 2
Real G
5
0 0 1 0 3 0 1 0 0 5
G
6
0 2 0 0 0 11 0 0 0 13
G
7
0 0 0 0 0 0 2 0 0 2
G
8
0 0 0 0 0 0 0 2 0 2
G
9
0 0 1 0 0 0 0 0 2 3
%corr 0. 67 1 0. 67 1 0. 6 0. 85 1 1 0. 67 0. 80
Table 9: Crossclassification table giving the clustering results for the Cetacea
6.3 Example 3: Cervical Data (a mixture of continuous and categorical)
From each patient, one ectocervical and one endocervical sample was obtained with a
Cytobrush, bent at an angle of 90 degrees for the ectocervical sampling. The samples
were suspended in Leiden fixative (Boon and Drijver 1986). The method to prepare
plastic sections of the samples is described in Boon et al. 1990. The histological
diagnosis of each patient entering this study was known from subsequence biopsy.
There are 50 cases with mild dysplasia(histological group 1), 50 cases with moderate
dysplasia (histological group 2), 50 cases with severe dysplasia (histological group 3),
50 cases with carcinoma in situ (histological group 4), and 42 cases with invasive
squamous cell carcinoma (histological group 5), so the total number of cases n = 242.
The plastic sections were stained according to a modified Papanicolaou method (Boon
et al. 1990). There are 242 observations and 11 variables, 7 qualitative variables with 4
categories (1, 2, 3, 4) and 4 quantitative ones (0, 1, 2.) (see Table 7). The following are
some characteristics of the data, it gives the mean values for the 5 different groups
with respect to the qualitative and quantitative variables.
17
variables names scores
x
1
Nuclear Shape (1,2,3,4)
x
2
Nuclear Irregularity (1,2,3,4)
x
3
Chromatin Pattern (1,2,3,4)
x
4
Chromatin Distribution (1,2,3,4)
x
5
Number of Nucleoli > 0 (counts)
x
6
Nucleolar Irregularity (1,2,3,4)
x
7
Nucleus/Nucleolus Ratio (1,2,3,4)
x
8
Nucleus/Cytoplasm Ratio (1,2,3,4)
x
9
Number of Cells per Fragments > 2
x
10
Total Number of Cells > 0 (counts)
x
11
Number of Mitoses > 0 (counts)
Table 10: Caracteristics of the 11 variables
First, we reduce the dimensionality of the cervical data. To achieve this we carried out
an all possible subset selection to determine the optimal dimension of the cervical data
using ICOMP. We summarize the best subset results among the 2
12
= 4096 (see Table
8).
Best Subsets Size of Dim ICOMP
¦1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} 12 20, 132. 0
¦1, 2, 3, 4, 5, ÷, 7, 8, 9, 10, 11, 12} 11 22, 750. 81
¦1, 2, ÷, ÷, 5, 6, 7, 8, 9, 10, 11, 12} 10 36, 498. 11
¦1, 2, ÷, ÷, 5, ÷, 7, 8, 9, 10, 11, 12} 9 36, 714. 27
¦1, 2, ÷, ÷, 5, ÷, 7, 8, ÷, 10, 11, 12} 8 38, 892. 34
¦1, 2, ÷, ÷, ÷, ÷, 7, 8, ÷, 10, 11, 12} 7 36, 703. 16
¦1, ÷, 3, ÷, ÷, ÷, 7, ÷, 9, 10, 11, 12} 6 34, 734. 86
¦1, ÷, ÷, ÷, ÷, ÷, ÷, 8, ÷, 10, 11, 12} 5 42, 353. 52
¦1, ÷, ÷, ÷, ÷, ÷, ÷, ÷, 10, 11, 12} 4 42, 798. 42
¦1, ÷, ÷, 4, ÷, ÷, ÷, ÷, ÷, 10, ÷, ÷} 3

16, 627. 32

¦÷, ÷, ÷, 4, ÷, ÷, ÷, ÷, 9, ÷, ÷, ÷} 2

7, 354. 785

Table 11: The optimal dimension of cervical data.
Fitting Multivariate Gaussian Mixtures to cervical Data
Here, we carried out unsupervised clustering on the cervical data by pretending that
we do not know anything about the underlying group structure of the data using the
Gaussian mixturemodel cluster analysis. We used four criteria for selecting the
18
number of clusters (see Table 9) where AIC is the Akaike criteria and CAIC is the
consistent Akaike criteria and SC is the Schwarz criteria (Schwarz 1978).
k m C
1
(IFIM) ICOMP AIC SC
1 90 1.47 8240.03 8507.09 8731.09
2 103 16.35 8205.49 8481.79 8738.15
3 116 48.97 8166.34 8416.40 8705.11
4 129 92.26 8183.26 8385.72 8706.80
5

142 434.04 6855. 50

6413. 42

6766. 85

Kmod = 6 155 576.00 7128.54 6441.54 6827.33
Table 12: ICOMP, AIC, CAIC and SC values for the Mixture Cluster Analysis
Examining the results in Table 9, we see that ICOMP achieves its minimum at K = 5
mixture clusters, indicating that there are five clusters. AIC and SC give the same
results. Table 10 summarizes the well classified observation through diagonal element
and misclassified observations through the offdiagonal elements. The error rate of
misclassification acheived is 7. 43%
k 1 2 3 4 5 Total
1 48 2 0 0 0 50
2 0 46 4 0 0 50
3 0 0 45 5 0 50
4 0 0 0 43 7 50
5 0 0 0 0 42 42
Total 49 48 49 48 49 n = 242
Table 13: Confusion Matrix from Mixture Gaussian Solution.
Fitting Multivariate Mixture of Kernels to cervical data:
Now we relax the distributional assumption and carry out unsupervised clustering on
the cervic data using the mixture of kernels. Table 11 and 12 show scores of AIC and
ICOMP for different number of component and different robust covariance matrix
estimator. ICOMP suggests that the optimal number of component (K = 5) is reached
with the best robust matrix estimator (L
¨
CSE
). Table 13 shows the confusion matrix
which deliver a probability of misclassification of about 1. 65% and Table 14 provides
us with the estimation of the dataadaptive window width of the kernel distribution
under the model chosen.
19
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 81788. 19 80149. 40 97097. 02 77578. 00
K = 2 77700. 00 71511. 93 68594. 65 78484. 98
K = 3 67798. 92 65472. 43 65486. 11 63477. 01
K = 4 77448. 14 56477. 55 39508. 32

45497. 33
K = 5 39808. 35 39976. 00 39999. 33 44080. 04
K = 6 81838. 35 81906. 32 71066. 41 82000. 01
K/Cov L
¨
MLE/EB
L
¨
SRE
L
¨
SDE
L
¨
CSE
K = 1 56798. 19 78149. 40 77497. 02 72478. 00
K = 2 79880. 90 61510. 93 88494. 65 66484. 98
K = 3 47798. 92 55476. 43 55486. 11 53477. 01
K = 4 76448. 44 66487. 55 39508. 32 62497. 33
K = 5 36498. 87 40876. 00 35999. 33 32908. 04

K = 6 86888. 35 80976. 32 79066. 41 72000. 01
Table 14: AIC with different Robust covariance matrices Table 15: ICOMP with different
Robust covariance matrices
20
k 1 2 3 4 5 Total
1 50 0 0 0 0 50
2 1 47 1 1 0 50
3 0 0 50 0 0 50
4 0 0 1 49 0 50
5 0 0 0 0 42 42
Total 51 47 52 50 42 n = 242
K
h
1 0.12708909
2 0.10413573
3 0.07339081
4 0.04518093
Kmod = 5 0.01628560
Table 16: Confusion Matrix from RKR Solution. Table 17: Estimators of the Kernel
window width
References
AITCHISON, J. and AITKEN, C. G. G. (1976). Multivariate binary discrimination
by the kernel method. Biometrika, 63, 413420.
BENSMAIL, H. and CELEUX, G. (1996): Regularized Discriminant Analysis
Through Eigenvalue Decomposition, Journal of the American Statistical
Association, Vol. 91, No. 436, pp. 17431748.
DEVROYE, L. (1983): The equivalence of weak, strong and complete
convergence in L1 for kernel density estimates. The Annals of Statistics, 11,
896904.
GIFI, A. (1990): Non linear Multivariate Analysis. Wiley Series in
Probability and Mathematical Statistics, England.
GORDON, A. D. (1999), Classification: Methods for the Exploratory
Analysis of Multivariate Data,Chapman and Hall, 2nd Eds. New York.
FRYER, M. J. (1976). Some errors associated with the nonparametric
estimation of density functions. Journal of the Institute of Mathematics and its
Applications, 18, 371380.
FUKUNAGA, K. (1972). Introduction to Statistical Pattern Recognition,
New York: Academic.
HARTIGAN, J. A. (1975), Clustering Algorithms, Wiley, New York.
HAND, D.H. (1981): Discrimination and Classification. Wiley Series in
Probability and Mathematical Statistics, England.
HEISER, W.J., and MEULMAN, J.J. (1995): Nonlinear methods for the
analysis of homogeneity and heterogeneity. In: W.J Krzanowski(Ed.). Recent
Advances in Descriptive Multivariate Analysis. Clarendon Press, Oxford.
KAUFMAN, L., and ROUSSEEUW, P. J. (1990), Finding Groups in Data,
Wiley, New York.
MEULMAN, J.J. (1986): A Distance Approach to Nonlinear Multivariate
Analysis. DSWO Press, Leiden.
MEULMAN, J.J. et al. (1990): Prediction of Various Grades of Cervical
21
Preneoplasia and Neoplasia on Plastic Embedded Cytobrush Samples. Technical
Report RR9006, Department of Data Theory, University of Leiden.
NISHISATO, S. (1980): Analysis of categorical data: dual scaling and its
applcations. University of Toronto Press, Toronto Buffalo London.
PARZEN, E. On the estimation of a probability density function and the
mode. Annals of Mathematical Statistics, (33), 10651076, 1962.
ROSENBLATT, M. (1956). Remarks on some nonparametric estimates of a
density function. Annals of Mathematical Statistics, (27), 832837.
SHUNMUGAN, K. (1977): On a modified form of Parzen estimator for
nonparametric pattern recognition. Pattern Recognition, 9, 167170.
SCHWARZ, G. (1978), ”Estimating the Dimension of a Model,” The Annals of
Statsitics, 6, 461464.
TATSUAKA, M.M. (1988): Multivariate Analysis. Techniques for
Educational and Psychological research. Macmillan publishing Company, New
York. Collier Macmillan Publishers, London.
TUKEY. P. A. and TUKEY, J. W. (1981). Graphical display of data sets in 3
or more dimensions. In Interpreting Multivariate Data, ed. Barnett, V. New
York: Wiley.
VAN der BURG, E. (1985): Homals classification of whales, Porpoises and
Dolphins. Proceedings of the International Workshop on Data Analysis. Data
Analysis in Real Life Environment: Ins and Outs of Solving Problems. J.F
Marcotorchino, J. M. Proth, and J. Janssen(Eds.) Elsevier Science Publishers B.
V. (NorthHolland).
VAN der BURG, E., DE LEEUW, J., and VERDEGAAL, R. (1988):
Homogeneity analysis with k sets of variables: an alternating least square
method with optimal scaling features. Psychomerika, 53, 177197.
WERTZ, W. (1978): Statistical density estimation: A survey. Göttingen,
Vandenhoek, and Ruprecht, Monographs in Applied Statistics and
Econometrics, No. 13.
Vescia, G. 1985. Descriptive classification of Cetacea: Whales, porpoises,
and dolphins. In: Marcotorchino, J. F.; Proth, J. M.; Janssen J., eds. Data
analysis in real life environment: ins and outs of solving problems. Amsterdam,
The Netherlands: Elsevier Science Publishers B.V.
22
23