You are on page 1of 5

On Clustering Binary Data

Tao Li∗ Shenghuo Zhu†

Abstract the set of features, then the set of attributes is associated to


Clustering is the problem of identifying the distribution of patterns the set of data points and vice versa. The association relation
and intrinsic correlations in large data sets by partitioning the data suggests a new clustering model where the data and features
points into similarity classes. This paper studies the problem of are treated equally. Our new clustering model, BMD (Binary
clustering binary data. This is the case for market basket datasets Matrix Decomposition), explicitly describes the data assign-
where the transactions contain items and for document datasets ments (assigning data points into clusters) as well as feature
where the documents contain “bag of words”. The contribution of assignments (assigning features into clusters). The cluster-
the paper is two-fold. First a new clustering model is presented. The ing problem is then presented as binary matrix decomposi-
model treats the data and features equally, based on their symmetric tion, which is solved via an iterative alternating least-squares
association relations, and explicitly describes the data assignments optimization procedure. The procedure simultaneously per-
as well as feature assignments. An iterative alternating least- forms two tasks: data reduction (assigning data points into
squares procedure is used for optimization. Second, a unified view clusters) and feature identification (identifying features as-
of binary data clustering is presented by examining the connections sociated with each cluster). By explicitly feature assign-
among various clustering criteria. ments, BMD produces interpretable descriptions of the re-
sulting clusters. In addition, by iterative feature identifica-
1 Introduction tion, BMD performs an implicit adaptive feature selection at
each iteration and flexibly measures the distances between
The problem of clustering data arises in many disciplines and
data points. Therefore it works well for high-dimensional
has a wide range of applications. Intuitively, clustering is
data.
the problem of partitioning a finite set of points in a multi-
The second contribution of this paper is the presentation
dimensional space into classes (called clusters) so that (i) the
of a unified view for binary data clustering by examining the
points belonging to the same class are similar and (ii) the
connections among various clustering criteria. In particular,
points belonging to different classes are dissimilar.
we show the equivalence among the matrix decomposition,
In this paper, we focus our attention on binary datasets.
dissimilarity coefficients, minimum description length and
Binary data have been occupying a special place in the
entropy-based approach.
domain of data analysis. Typical applications for binary
data clustering include market basket data clustering and
2 BMD Clustering
document clustering. For market basket data, each data
transaction can be represented as a binary vector where each In this section, we describe the new clustering algorithm.
element indicates whether or not any of the corresponding Section 2.1 introduces the cluster model. Section 2.2 and
item/product was purchased. For document clustering, each Section 2.3 present the optimization procedure and the refin-
document can be represented as a binary vector where each ing methods, respectively. Section 2.4 gives an example to
element indicates whether a given word/term was present or illustrate the algorithm.
not.
The first contribution of the paper is the introduction of 2.1 The Clustering Model Suppose the dataset X has n
a new clustering model along with a clustering algorithm. instances, having r features each. Then X can be viewed as a
A distinctive characteristic of the binary data is that the fea- subset of Rr as well as a member of Rn×r . The cluster model
tures (attributes) they include have the same nature as the is determined by two matrices: the data matrix Dn×K = (dik )
data they intend to account for: both are binary. This charac- and the feature matrix Fr×K = ( f jk ), where K is the number
teristic implies the symmetric association relations between of clusters.
data and features: if the set of data points is associated to

1 Data point i belongs to cluster k
dik =
∗ School of Computer Science, Florida International University, 0 Otherwise
taoli@cs.fiu.edu . 
1 Attribute j belongs to cluster k
† NEC Labs America, Inc., zsh@sv.nec-labs.com. Major work was
f jk =
completed when the author was in University of Rochester. 0 Otherwise

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 526


The data (respectively, feature) matrix specifies the cluster Note that yk j can be thought of as the probability that
memberships for the corresponding data (respectively, fea- the j-th feature is present in the k-th cluster. Since each fk j
tures). is binary 1 , i.e., either 0 or 1, O′ (F) is minimized by:
For clustering, it is customary to assume that each 
data point is assigned to one and only one cluster, i.e., ˆ 1 if yk j > 1/2
(2.4) fk j =
∑Kk=1 dik = 1 holds for j = 1, · · · , n. Given representation 0 Otherwise
(D, F), basically, D denotes the cluster assignments of data
points and F indicates the feature representations of clusters. In practice, if a feature has similar association to all
The i j-th entry of DF T is the dot product of the i-th row clusters, then it is viewed as an outlier at the current stage.
of D and the j-th row of F, and indicates whether the j-th The optimization procedure for minimizing Equa-
feature will be present in the i-instance. Hence, DF T can be tion (2.2) alternates between updating D based on Equa-
interpreted as the approximation of the original data X. Our tion (2.3) and assigning features using Equation (2.4). After
goal is then to find a (D, F) that minimizes the squared error each iteration, we compute the value of the objective crite-
between X and its approximation DF T . rion O(D, F). If the value is decreased, we then repeat the
process; otherwise, the process has arrived at a local min-
1 imum. Since the BMD procedure monotonically decreases
(2.1) argmin O = ||X − DF T ||2F ,
D,F 2 the objective criterion, it converges to a local optimum. The
where
q kXkF is the Frobenius norm of the matrix X, i.e., clustering procedure is shown in Algorithm 1.
∑i, j x2i j . With the formulation, we transform the data
Algorithm 1 BMD: clustering procedure
clustering problem into the computation of D and F that
Input: (data points: Xn×r , # of classes: K)
minimizes the criterion O.
Output: D: cluster assignment;
F: feature assignment;
2.2 Optimization Procedure The objective criterion can
begin
be expressed as
1. Initialization:
1 T 2 1.1 Initialize D
OD,F = ||X − DF ||F
2 1.2 Compute F based on Equation (2.4)
!2
1 n m K 1.3 Compute O0 = O(D, F)
= ∑ ∑ xi j − ∑ dik fk j 2. Iteration:
2 i=1 j=1 k=1
begin
n K m 2.1 Update D given F (via Equation (2.3))
1
(2.2) = ∑ ∑ dik ∑ (xi j − fk j )2 2.2 Compute F given D (via Equation (2.4))
2 i=1 k=1 j=1
2.3 Compute the value of O1 = O(D, F);
1 n K m
= ∑ ∑ dik ∑ (xi j − yk j ) 2 2.4 if O1 < O0
2 i=1 k=1 j=1 2.4.1 O0 = O1
2.4.2 Repeat from 2.1
1 K m
+ ∑ nk ∑ (yk j − fk j ) , 2 2.5 else
2 k=1 j=1 2.5.1 break; (Converges)
end
where yk j = nk ∑i=1 dik xi j and nk = ∑i=1 dik (note that we use
1 n n
3. Return D, F;
fk j to denote the entry of F T .). The objective function can end
be minimized via an alternating least-squares procedure by
alternatively optimizing one of D or F while fixing the other.
Given an estimate of F, new least-squares estimates of
the entries of D can be determined by assigning each data
2.3 Refining Methods Clustering results are sensitive to
point to the closest cluster as follows:
initial seed points. The initialization step sets the initial
 1 if ∑mj=1 (xi j − fk j )2 < ∑mj=1 (xi j − fl j )2

values for D and F. Since D is a binary matrix and
(2.3) dˆik = for l = 1, · · · , K, l 6= k has at most one occurrence of 1 in each row, it is very
0 Otherwise

sensitive to initial assignments. To overcome the sensitivity
When D is fixed, OD,F can be minimized with respect to F of initialization, we refine the procedure. Its idea is to use
by minimizing the second part of Equation (2.2): mutual information to measure the similarity between a pair
1 K m
O′ (F) = ∑ nk ∑ (yk j − fk j )2 .
2 k=1 j=1
1 If the entries of F are arbitrary, then the optimization here can be

performed via singular value decomposition.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 527


feature
of clustering results. In addition, clustering a large data data point a b c d e f g
set may be time-consuming. To speed up the algorithm, a 1 1 1 0 0 1 0 0
small set of data points, for example, 1% of the entire data 2 1 1 1 1 0 0 1
set, may be selected as a bootstrap data set. The clustering 3 1 0 1 0 0 0 0
algorithm is first executed on the bootstrap data set. Then, 4 0 1 0 0 1 1 0
the algorithm is run on the entire data set using the data 5 0 0 0 1 1 1 1
assignments obtained from the bootstrap data set (instead of 6 0 0 0 1 0 1 0
using random seed points). Table 1: A bag-of-word representation of the sentences. a,
b, c, d, e, f, g, correspond to the presence of response,
2.4 An Example To illustrate how BMD works, we show user, interaction, computer, system, distributed and survey,
an artificial example, a dataset consisting of six sentences respectively.
from two clusters: user interaction and distributed systems,
as shown in Figure 1.
1(1) An system for user response clustering approaches. Section 3.1 sets down the notation,
2(1) A survey of user interaction Section 3.2, Section 3.3 and Section 3.4 discuss the binary
on computer response dissimilarity coefficients, minimum description length, and
3(1) Response for interaction the entropy-based approach respectively. The unified view
4(2) A multi-user distributed system on binary clustering is summarized in Figure 2. Note that the
5(2) A survey of distributed computer system relations of maximum likelihood principle with the entropy-
6(2) distributed systems
based criterion and with minimum description length (MDL)
Figure 1: The six example sentences. The numbers within are known in machine learning literature [8].
the parentheses are the clusters: 1=user interaction, 2=dis-
tributed system. Binary Matrix Decomposition

After preprocessing, we get the dataset as in Table 1. In Encoding D and F


this example, D is a 6 × 2 matrix and F is a 7 × 2 matrix. Distance Definition
Initially, the data points 2 and 5 are chosen as seed points,
where the data point 2 is in class 1 and the data point 5 Minimum Description Length(MDL)
is in class 2. Initialization is then performed on the seed
points to get the initial feature assignments. After Step Disimilarity Coefficients
1.2, features a, b and c are positive in class 1, e and f are Likelihood and Encoding Code Length

positive in class 2, and d and g are outliers. In other words,


F(a, 1) = F(b, 1) = F(c, 1) = 1, F(e, 2) = F( f , 2) = 1, and Maximum Likelihood
Generalized Entropy
all the other entries 2 of F are 0. Then Step 2.1 assigns data
points 1, 2 and 3 to class 1 and data points 4, 5 and 6 to
class 2. then Step 2.2 asserts a, b and c are positive in class Bernoulli Mixture
1, d, e and f are positive in class 2, and g is an outlier. In
the next iteration, the objective criterion does not change. At Entropy Criterion
this point the algorithm stops. The resulting clusters are: For
data points, class 1 contains 1, 2, and 3 and class 2 contains Figure 2: A Unified View on Binary Clustering. The thick
4, 5, and 6. For features, a, b and c are positive in class 1, d, lines are relations first shown in this paper, the dotted lines
e and f are positive in class 2 while g is an outlier. are well-known facts, and the thin line is first discussed
We have conducted experiments on real datasets to eval- in [7].
uate the performance of our BMD algorithm and compare
it with other standard clustering algorithms. Experimental 3.1 Notation We first set down some notation. Suppose
results on suggest that BMD is a viable and competitive bi- that a set of n r-dimensional binary data vectors, X, repre-
nary clustering algorithm. Due to space limit, we omitted thesented as an n × r matrix, (xi j ), is partitioned into K classes
experiment details. C = (C1 , . . . ,CK ) and we want the points within each class
are similar to each other. We view C as a partition of the
3 Binary Data Clustering indices {1, . . . , n}. So, for all i, 1 ≤ i ≤ n, and k, 1 ≤ k ≤ K,
In this section, a unified view on binary data clustering is we write i ∈ Ck to mean that the i-th vector belongs to the
presented by examining the relations among various binary k-th class. Let N = nr. For each k, 1 ≤ k ≤ K, let nk = kCk k,
Nk = nk r, and for each j, 1 ≤ j ≤ r, let N j,k,1 = ∑i∈Ck xi j
2 We use a,b,c,d,e, f , g to denote the rows of F. and N j,k,0 = nk − N j,k,1 . Also, for each j, 1 ≤ j ≤ r, let

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 528


N j,1 = ∑ni=1 xi j and N j,0 = n − N j,1 . We use xi as a point ture representations of clusters. Observe that
variable.
1
O(D, F) = ||X − DF T ||2F
3.2 Binary Dissimilarity Coefficients A popular 2
1
partition-based criterion (within-cluster) for clustering is to = ∑(xi j − (DF T )i j )2
minimize the summation of distances/dissimilarities inside 2 i, j
the cluster. The within-cluster criterion can be described as (3.7)
1 K
minimizing = ∑ ∑ ∑ |xi j − ek j |2
2 k=1 i∈C j
k
K
1 1 K
(3.5) S(C) = ∑ ∑ δ(xi , xi′ ), = ∑ ∑ d(xi , ek ),
k=1 nk i,i′ ∈C k
2 k=1 i∈Ck

or 3 where ek = ( fk1 , · · · , fkr ), i = 1, · · · , K is the cluster “repre-


sentative” of cluster Ci . Thus minimizing Equation (3.7) is
K the same as minimizing Equation (3.6) where the distance
(3.6) S(C) = ∑ ∑ δ(xi , xi′ ), is defined as d(xi , ek ) = ∑ j |xi j − (ek )i j |2 = ∑ j |xi j − (ek )i j |
k=1 i,i′ ∈Ck
(the last equation holds since xi j and (ek )i j are all binary.)
In fact, given two binary vectors X and Y , ∑i |Xi − Yi | calcu-
where δ(xi , xi′ ) is the distance measure between xi and xi′ .
lates their mismatches (the numerator of their dissimilarity
For binary clustering, the dissimilarity coefficients are popu-
coefficients).
lar measures of the distances.
3.3 Minimum Description Length Minimum Descrip-
3.2.1 Various Coefficients Given two binary data points,
tion length(MDL) aims at searching for a model that pro-
w and w′ , there are four fundamental quantities that can be
vides the most compact encoding for data transmission [10]
used to define similarity between the two [1]: a = k{ j |
and is conceptually similar to minimum message length
w j = w′j = 1}k, b = k{ j | w j = 1 ∧ w′j = 0}k, c = k{ j |
(MML) [9, 2] and stochastic complexity minimization [11].
w j = 0 ∧ w′j = 1}k, and d = k{ j | w j = w′j = 0}k, where
In fact, the MDL approach is a Bayesian method: the code
1 ≤ j ≤ r. It has been shown in [1] that the presence/absence
lengths and the code structure in the coding model are equiv-
based dissimilarity measure can be generally 4 written as
alent to the negative log probabilities and probability struc-
b+c
D(a, b, c, d) = αa+b+c+βd , where α > 0 and β ≥ 0. Dissimi-
ture assumptions in the Bayesian approach.
larity measures can be transformed into a similarity function As described in Section 2, in BMD clustering, the
by simple transformations such as adding 1 and inverting, original matrix X can be approximated by the matrix product
dividing by 2 and subtracting from 1, etc. [6]. If the joint of DF T . Instead of encoding the elements of X alone, we
absence of the attribute is ignored, i.e., β is set to 0, then then encode the model, D, F, and the data given the model,
the binary dissimilarity measure can be generally written as (X|DF T ). The overall code length is thus expressed as
b+c
D(a, b, c, d) = αa+b+c , where α > 0.
In cluster applications, the rankings based on a dissim- L(X, D, F) = L(D) + L(F) + L(X|DF T ).
ilarity coefficient is often of more interest than the actual
value of the dissimilarity coefficient. It has been shown In the Bayesian framework, L(D) and L(F) are negative log
that [1], if the paired absences are ignored in the calcula- priors for D and F and L(X|DF T ) is a negative log likelihood
tion of dissimilarity values, then there is only one single dis- of W given D and F. If we assume that the prior probabilities
similarity coefficient modulo the global order equivalence: of all the elements of D and F are uniform (i.e., 21 ), then L(D)
b+c and L(F) are fixed given the dataset X. In other words, we
a+b+c . Thus our following discussion is based on the single
dissimilarity coefficient. need to use one bit to represent each element of D and F
irrespective of the number of 1’s and 0’s. Hence, minimizing
3.2.2 BMD and Dissimilarity Coefficients Given repre- L(X, D, F) reduces to minimizing L(X|DF T ).
sentation (D, F), basically, D denotes the assignments of Use X̂ to denote the generated data matrix by D and F.
data points associated into clusters and F indicates the fea- For all i, 1 ≤ i ≤ n, j, 1 ≤ j ≤ p, b ∈ {0, 1}, and c ∈ {0, 1},
we consider p(xi j = b | x̂i j (D, F) = c), the probability of the
original data Wi j = b conditioned upon the generated data
3 Equation (3.5) computes the weighted sum using the cluster sizes. (x̂)i j , via DF T , is c. Note that
4 Basically, the presence/absence based dissimilarity measure satisfies

a set of axioms such as non-negative, range in [0,1], rationality whose Nbc


numerator and denominator are linear and symmetric, etc. [1]. p(xi j = b | X̂i j (D, F) = c) = .
N.c

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 529


Here Nbc is the number of elements of X which have value have:
b where the corresponding value for X̂ is c, and N.c is the K
1
number of elements of X̂ which have value c. Then the code S(C) = ∑ nk ∑ δ(xi , xi′ )
length for L(X, D, F) is k=1 ′ i,i ∈Ck
K r
1 1
L(X, D, F) = − ∑ Nbc log P(xi j = b | x̂i j (D, F) = c) = ∑ nk ∑ ∑ |xi, j − xi′, j |
b,c k=1 ′ i,i ∈Ck
r j=1
Nbc Nbc
= −np ∑
K r
1
log
∑ ∑ nk ρk (1 − ρk ).
( j) ( j)
=
b,c np N.c r k=1 j=1
= npH(X|X̂(D, F))
( j)
Here for each k, 1 ≤ k ≤ K, and for each j, 1 ≤ j ≤ r, ρk is
So minimizing the coding length is equivalent to mini- the probability that the j-th attribute is 1 in Ck .
mizing the conditional entropy. Denote pbc = p(xi j = b | Using the generalized entropy 5 defined in [5], H 2 (Q) =
x̂i j (D, F) = c). We wish to find the probability vectors −2 ∑ q − 1 , we have
n 2

i=1 i
p = (p00 , p01 , p10 , p11 ) that minimize
1 K
(3.8) H(X|X̂(D, F)) = − ∑ pi j log pi j ∑ nk Ĥ(Ck )
n k=1
i, j∈{0,1}
1 K r
∑∑ k k
 
( j) 2 ( j) 2
Since −pi j log pi j ≥ 0, with the equality holding at pi j = 0 = − n (ρ ) + (1 − ρ k ) − 1
2n k=1 j=1
or 1, the only possible probability vectors which minimize
H(X|X̂(D, F)) are those with pi j = 1 for some i, j and pi1 j1 = 1 K r r
∑ ∑ nk ρk (1 − ρk ) = n S(C).
( j) ( j)
0, (i1 , j1 ) 6= (i, j). Since X̂ is an approximation of X, it is =
n k=1 j=1
natural to require that p00 and p11 be close to 1 and p01
and p10 be close to 0. This is equivalent to minimizing the References
mismatches between X and X̂, i.e., minimizing O(D, F) =
1 T 2
2 ||X − DF ||F . [1] F. B. Baulieu. Two variant axiom systems for pres-
ence/absence based dissimilarity coefficients. Journal of
3.4 Entropy-Based Approach Classification, 14(1):159–170, 1997.
[2] R. A. Baxter and J. J. Oliver. MDL and MML: similarities
3.4.1 Classical Entropy Criterion The classical cluster- and differences. TR 207, Monash University, 1994.
ing criteria [3, 4] search for a partition C that maximizes the [3] H.-H. Bock. Probabilistic aspects in cluster analysis. In
following quantity O(C): Conceptual and Numerical Analysis of Data, pages 12–44,
1989.
K r 1 [4] G. Celeux and G. Govaert. Clustering criteria for discrete data
N j,k,t NN j,k,t
O(C) = ∑∑∑ log and latent class models. Journal of Classification, 8(2):157–
k=1 j=1 t=0 N Nk N j,t
176, 1991.
K r 1
N j,k,t

N j,k,t N j,t
 [5] J. Havrda and F. Charvat. Quantification method of classifica-
(3.9) =∑∑∑ log − log tion processes: Concept of structural a-entropy. Kybernetika,
k=1 j=1 t=0 N nk n
3:30–35, 1967.
[6] N. Jardine and R. Sibson. Mathematical Taxonomy. John
!
1 1 K
= Ĥ(X) − ∑ nk Ĥ(Ck ) . Wiley & Sons, 1971.
r n k=1 [7] T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in
categorical clustering. In ICML, 2004. 536-543.
Observe that n1 ∑Kk=1 nk Ĥ(Ck ) is the entropy measure of [8] T. M. Mitchell. Machine Learning. The McGraw-Hill
the partition, i.e., the weighted sum of each cluster’s entropy. Companies,Inc., 1997.
This leads to the following criterion: Given a dataset, fix [9] J. J. Oliver and R. A. Baxter. MML and Bayesianism:
Ĥ(X), then maximizing O(C) is equivalent to minimizing similarities and differences. TR 206, Monash University,
the expected entropy of the partition: 1994.
[10] J. Rissanen. Modeling by shortest data description. Automat-
1 K ica, 14:465–471, 1978.
(3.10) ∑ nk Ĥ(Ck )
n k=1
[11] J. Rissanen. Stochastic Complexity in Statistical Inquiry.
World Scientific Press, Singapore, 1989.

3.4.2 Entropy and Dissimilarity Coefficients Now ex-


amine the within-cluster criterion in Equation (3.5). We 5 Note that H s (Q) = (2(1−s) − 1)−1 (∑ni=1 qsi − 1).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 530

You might also like