Abstract the set of features, then the set of attributes is associated to
Clustering is the problem of identifying the distribution of patterns the set of data points and vice versa. The association relation and intrinsic correlations in large data sets by partitioning the data suggests a new clustering model where the data and features points into similarity classes. This paper studies the problem of are treated equally. Our new clustering model, BMD (Binary clustering binary data. This is the case for market basket datasets Matrix Decomposition), explicitly describes the data assign- where the transactions contain items and for document datasets ments (assigning data points into clusters) as well as feature where the documents contain “bag of words”. The contribution of assignments (assigning features into clusters). The cluster- the paper is two-fold. First a new clustering model is presented. The ing problem is then presented as binary matrix decomposi- model treats the data and features equally, based on their symmetric tion, which is solved via an iterative alternating least-squares association relations, and explicitly describes the data assignments optimization procedure. The procedure simultaneously per- as well as feature assignments. An iterative alternating least- forms two tasks: data reduction (assigning data points into squares procedure is used for optimization. Second, a unified view clusters) and feature identification (identifying features as- of binary data clustering is presented by examining the connections sociated with each cluster). By explicitly feature assign- among various clustering criteria. ments, BMD produces interpretable descriptions of the re- sulting clusters. In addition, by iterative feature identifica- 1 Introduction tion, BMD performs an implicit adaptive feature selection at each iteration and flexibly measures the distances between The problem of clustering data arises in many disciplines and data points. Therefore it works well for high-dimensional has a wide range of applications. Intuitively, clustering is data. the problem of partitioning a finite set of points in a multi- The second contribution of this paper is the presentation dimensional space into classes (called clusters) so that (i) the of a unified view for binary data clustering by examining the points belonging to the same class are similar and (ii) the connections among various clustering criteria. In particular, points belonging to different classes are dissimilar. we show the equivalence among the matrix decomposition, In this paper, we focus our attention on binary datasets. dissimilarity coefficients, minimum description length and Binary data have been occupying a special place in the entropy-based approach. domain of data analysis. Typical applications for binary data clustering include market basket data clustering and 2 BMD Clustering document clustering. For market basket data, each data transaction can be represented as a binary vector where each In this section, we describe the new clustering algorithm. element indicates whether or not any of the corresponding Section 2.1 introduces the cluster model. Section 2.2 and item/product was purchased. For document clustering, each Section 2.3 present the optimization procedure and the refin- document can be represented as a binary vector where each ing methods, respectively. Section 2.4 gives an example to element indicates whether a given word/term was present or illustrate the algorithm. not. The first contribution of the paper is the introduction of 2.1 The Clustering Model Suppose the dataset X has n a new clustering model along with a clustering algorithm. instances, having r features each. Then X can be viewed as a A distinctive characteristic of the binary data is that the fea- subset of Rr as well as a member of Rn×r . The cluster model tures (attributes) they include have the same nature as the is determined by two matrices: the data matrix Dn×K = (dik ) data they intend to account for: both are binary. This charac- and the feature matrix Fr×K = ( f jk ), where K is the number teristic implies the symmetric association relations between of clusters. data and features: if the set of data points is associated to
1 Data point i belongs to cluster k dik = ∗ School of Computer Science, Florida International University, 0 Otherwise taoli@cs.fiu.edu . 1 Attribute j belongs to cluster k † NEC Labs America, Inc., zsh@sv.nec-labs.com. Major work was f jk = completed when the author was in University of Rochester. 0 Otherwise
The data (respectively, feature) matrix specifies the cluster Note that yk j can be thought of as the probability that memberships for the corresponding data (respectively, fea- the j-th feature is present in the k-th cluster. Since each fk j tures). is binary 1 , i.e., either 0 or 1, O′ (F) is minimized by: For clustering, it is customary to assume that each data point is assigned to one and only one cluster, i.e., ˆ 1 if yk j > 1/2 (2.4) fk j = ∑Kk=1 dik = 1 holds for j = 1, · · · , n. Given representation 0 Otherwise (D, F), basically, D denotes the cluster assignments of data points and F indicates the feature representations of clusters. In practice, if a feature has similar association to all The i j-th entry of DF T is the dot product of the i-th row clusters, then it is viewed as an outlier at the current stage. of D and the j-th row of F, and indicates whether the j-th The optimization procedure for minimizing Equa- feature will be present in the i-instance. Hence, DF T can be tion (2.2) alternates between updating D based on Equa- interpreted as the approximation of the original data X. Our tion (2.3) and assigning features using Equation (2.4). After goal is then to find a (D, F) that minimizes the squared error each iteration, we compute the value of the objective crite- between X and its approximation DF T . rion O(D, F). If the value is decreased, we then repeat the process; otherwise, the process has arrived at a local min- 1 imum. Since the BMD procedure monotonically decreases (2.1) argmin O = ||X − DF T ||2F , D,F 2 the objective criterion, it converges to a local optimum. The where q kXkF is the Frobenius norm of the matrix X, i.e., clustering procedure is shown in Algorithm 1. ∑i, j x2i j . With the formulation, we transform the data Algorithm 1 BMD: clustering procedure clustering problem into the computation of D and F that Input: (data points: Xn×r , # of classes: K) minimizes the criterion O. Output: D: cluster assignment; F: feature assignment; 2.2 Optimization Procedure The objective criterion can begin be expressed as 1. Initialization: 1 T 2 1.1 Initialize D OD,F = ||X − DF ||F 2 1.2 Compute F based on Equation (2.4) !2 1 n m K 1.3 Compute O0 = O(D, F) = ∑ ∑ xi j − ∑ dik fk j 2. Iteration: 2 i=1 j=1 k=1 begin n K m 2.1 Update D given F (via Equation (2.3)) 1 (2.2) = ∑ ∑ dik ∑ (xi j − fk j )2 2.2 Compute F given D (via Equation (2.4)) 2 i=1 k=1 j=1 2.3 Compute the value of O1 = O(D, F); 1 n K m = ∑ ∑ dik ∑ (xi j − yk j ) 2 2.4 if O1 < O0 2 i=1 k=1 j=1 2.4.1 O0 = O1 2.4.2 Repeat from 2.1 1 K m + ∑ nk ∑ (yk j − fk j ) , 2 2.5 else 2 k=1 j=1 2.5.1 break; (Converges) end where yk j = nk ∑i=1 dik xi j and nk = ∑i=1 dik (note that we use 1 n n 3. Return D, F; fk j to denote the entry of F T .). The objective function can end be minimized via an alternating least-squares procedure by alternatively optimizing one of D or F while fixing the other. Given an estimate of F, new least-squares estimates of the entries of D can be determined by assigning each data 2.3 Refining Methods Clustering results are sensitive to point to the closest cluster as follows: initial seed points. The initialization step sets the initial 1 if ∑mj=1 (xi j − fk j )2 < ∑mj=1 (xi j − fl j )2 values for D and F. Since D is a binary matrix and (2.3) dˆik = for l = 1, · · · , K, l 6= k has at most one occurrence of 1 in each row, it is very 0 Otherwise sensitive to initial assignments. To overcome the sensitivity When D is fixed, OD,F can be minimized with respect to F of initialization, we refine the procedure. Its idea is to use by minimizing the second part of Equation (2.2): mutual information to measure the similarity between a pair 1 K m O′ (F) = ∑ nk ∑ (yk j − fk j )2 . 2 k=1 j=1 1 If the entries of F are arbitrary, then the optimization here can be
feature of clustering results. In addition, clustering a large data data point a b c d e f g set may be time-consuming. To speed up the algorithm, a 1 1 1 0 0 1 0 0 small set of data points, for example, 1% of the entire data 2 1 1 1 1 0 0 1 set, may be selected as a bootstrap data set. The clustering 3 1 0 1 0 0 0 0 algorithm is first executed on the bootstrap data set. Then, 4 0 1 0 0 1 1 0 the algorithm is run on the entire data set using the data 5 0 0 0 1 1 1 1 assignments obtained from the bootstrap data set (instead of 6 0 0 0 1 0 1 0 using random seed points). Table 1: A bag-of-word representation of the sentences. a, b, c, d, e, f, g, correspond to the presence of response, 2.4 An Example To illustrate how BMD works, we show user, interaction, computer, system, distributed and survey, an artificial example, a dataset consisting of six sentences respectively. from two clusters: user interaction and distributed systems, as shown in Figure 1. 1(1) An system for user response clustering approaches. Section 3.1 sets down the notation, 2(1) A survey of user interaction Section 3.2, Section 3.3 and Section 3.4 discuss the binary on computer response dissimilarity coefficients, minimum description length, and 3(1) Response for interaction the entropy-based approach respectively. The unified view 4(2) A multi-user distributed system on binary clustering is summarized in Figure 2. Note that the 5(2) A survey of distributed computer system relations of maximum likelihood principle with the entropy- 6(2) distributed systems based criterion and with minimum description length (MDL) Figure 1: The six example sentences. The numbers within are known in machine learning literature [8]. the parentheses are the clusters: 1=user interaction, 2=dis- tributed system. Binary Matrix Decomposition
After preprocessing, we get the dataset as in Table 1. In Encoding D and F
this example, D is a 6 × 2 matrix and F is a 7 × 2 matrix. Distance Definition Initially, the data points 2 and 5 are chosen as seed points, where the data point 2 is in class 1 and the data point 5 Minimum Description Length(MDL) is in class 2. Initialization is then performed on the seed points to get the initial feature assignments. After Step Disimilarity Coefficients 1.2, features a, b and c are positive in class 1, e and f are Likelihood and Encoding Code Length
positive in class 2, and d and g are outliers. In other words,
F(a, 1) = F(b, 1) = F(c, 1) = 1, F(e, 2) = F( f , 2) = 1, and Maximum Likelihood Generalized Entropy all the other entries 2 of F are 0. Then Step 2.1 assigns data points 1, 2 and 3 to class 1 and data points 4, 5 and 6 to class 2. then Step 2.2 asserts a, b and c are positive in class Bernoulli Mixture 1, d, e and f are positive in class 2, and g is an outlier. In the next iteration, the objective criterion does not change. At Entropy Criterion this point the algorithm stops. The resulting clusters are: For data points, class 1 contains 1, 2, and 3 and class 2 contains Figure 2: A Unified View on Binary Clustering. The thick 4, 5, and 6. For features, a, b and c are positive in class 1, d, lines are relations first shown in this paper, the dotted lines e and f are positive in class 2 while g is an outlier. are well-known facts, and the thin line is first discussed We have conducted experiments on real datasets to eval- in [7]. uate the performance of our BMD algorithm and compare it with other standard clustering algorithms. Experimental 3.1 Notation We first set down some notation. Suppose results on suggest that BMD is a viable and competitive bi- that a set of n r-dimensional binary data vectors, X, repre- nary clustering algorithm. Due to space limit, we omitted thesented as an n × r matrix, (xi j ), is partitioned into K classes experiment details. C = (C1 , . . . ,CK ) and we want the points within each class are similar to each other. We view C as a partition of the 3 Binary Data Clustering indices {1, . . . , n}. So, for all i, 1 ≤ i ≤ n, and k, 1 ≤ k ≤ K, In this section, a unified view on binary data clustering is we write i ∈ Ck to mean that the i-th vector belongs to the presented by examining the relations among various binary k-th class. Let N = nr. For each k, 1 ≤ k ≤ K, let nk = kCk k, Nk = nk r, and for each j, 1 ≤ j ≤ r, let N j,k,1 = ∑i∈Ck xi j 2 We use a,b,c,d,e, f , g to denote the rows of F. and N j,k,0 = nk − N j,k,1 . Also, for each j, 1 ≤ j ≤ r, let
N j,1 = ∑ni=1 xi j and N j,0 = n − N j,1 . We use xi as a point ture representations of clusters. Observe that variable. 1 O(D, F) = ||X − DF T ||2F 3.2 Binary Dissimilarity Coefficients A popular 2 1 partition-based criterion (within-cluster) for clustering is to = ∑(xi j − (DF T )i j )2 minimize the summation of distances/dissimilarities inside 2 i, j the cluster. The within-cluster criterion can be described as (3.7) 1 K minimizing = ∑ ∑ ∑ |xi j − ek j |2 2 k=1 i∈C j k K 1 1 K (3.5) S(C) = ∑ ∑ δ(xi , xi′ ), = ∑ ∑ d(xi , ek ), k=1 nk i,i′ ∈C k 2 k=1 i∈Ck
or 3 where ek = ( fk1 , · · · , fkr ), i = 1, · · · , K is the cluster “repre-
sentative” of cluster Ci . Thus minimizing Equation (3.7) is K the same as minimizing Equation (3.6) where the distance (3.6) S(C) = ∑ ∑ δ(xi , xi′ ), is defined as d(xi , ek ) = ∑ j |xi j − (ek )i j |2 = ∑ j |xi j − (ek )i j | k=1 i,i′ ∈Ck (the last equation holds since xi j and (ek )i j are all binary.) In fact, given two binary vectors X and Y , ∑i |Xi − Yi | calcu- where δ(xi , xi′ ) is the distance measure between xi and xi′ . lates their mismatches (the numerator of their dissimilarity For binary clustering, the dissimilarity coefficients are popu- coefficients). lar measures of the distances. 3.3 Minimum Description Length Minimum Descrip- 3.2.1 Various Coefficients Given two binary data points, tion length(MDL) aims at searching for a model that pro- w and w′ , there are four fundamental quantities that can be vides the most compact encoding for data transmission [10] used to define similarity between the two [1]: a = k{ j | and is conceptually similar to minimum message length w j = w′j = 1}k, b = k{ j | w j = 1 ∧ w′j = 0}k, c = k{ j | (MML) [9, 2] and stochastic complexity minimization [11]. w j = 0 ∧ w′j = 1}k, and d = k{ j | w j = w′j = 0}k, where In fact, the MDL approach is a Bayesian method: the code 1 ≤ j ≤ r. It has been shown in [1] that the presence/absence lengths and the code structure in the coding model are equiv- based dissimilarity measure can be generally 4 written as alent to the negative log probabilities and probability struc- b+c D(a, b, c, d) = αa+b+c+βd , where α > 0 and β ≥ 0. Dissimi- ture assumptions in the Bayesian approach. larity measures can be transformed into a similarity function As described in Section 2, in BMD clustering, the by simple transformations such as adding 1 and inverting, original matrix X can be approximated by the matrix product dividing by 2 and subtracting from 1, etc. [6]. If the joint of DF T . Instead of encoding the elements of X alone, we absence of the attribute is ignored, i.e., β is set to 0, then then encode the model, D, F, and the data given the model, the binary dissimilarity measure can be generally written as (X|DF T ). The overall code length is thus expressed as b+c D(a, b, c, d) = αa+b+c , where α > 0. In cluster applications, the rankings based on a dissim- L(X, D, F) = L(D) + L(F) + L(X|DF T ). ilarity coefficient is often of more interest than the actual value of the dissimilarity coefficient. It has been shown In the Bayesian framework, L(D) and L(F) are negative log that [1], if the paired absences are ignored in the calcula- priors for D and F and L(X|DF T ) is a negative log likelihood tion of dissimilarity values, then there is only one single dis- of W given D and F. If we assume that the prior probabilities similarity coefficient modulo the global order equivalence: of all the elements of D and F are uniform (i.e., 21 ), then L(D) b+c and L(F) are fixed given the dataset X. In other words, we a+b+c . Thus our following discussion is based on the single dissimilarity coefficient. need to use one bit to represent each element of D and F irrespective of the number of 1’s and 0’s. Hence, minimizing 3.2.2 BMD and Dissimilarity Coefficients Given repre- L(X, D, F) reduces to minimizing L(X|DF T ). sentation (D, F), basically, D denotes the assignments of Use X̂ to denote the generated data matrix by D and F. data points associated into clusters and F indicates the fea- For all i, 1 ≤ i ≤ n, j, 1 ≤ j ≤ p, b ∈ {0, 1}, and c ∈ {0, 1}, we consider p(xi j = b | x̂i j (D, F) = c), the probability of the original data Wi j = b conditioned upon the generated data 3 Equation (3.5) computes the weighted sum using the cluster sizes. (x̂)i j , via DF T , is c. Note that 4 Basically, the presence/absence based dissimilarity measure satisfies
a set of axioms such as non-negative, range in [0,1], rationality whose Nbc
numerator and denominator are linear and symmetric, etc. [1]. p(xi j = b | X̂i j (D, F) = c) = . N.c
Here Nbc is the number of elements of X which have value have: b where the corresponding value for X̂ is c, and N.c is the K 1 number of elements of X̂ which have value c. Then the code S(C) = ∑ nk ∑ δ(xi , xi′ ) length for L(X, D, F) is k=1 ′ i,i ∈Ck K r 1 1 L(X, D, F) = − ∑ Nbc log P(xi j = b | x̂i j (D, F) = c) = ∑ nk ∑ ∑ |xi, j − xi′, j | b,c k=1 ′ i,i ∈Ck r j=1 Nbc Nbc = −np ∑ K r 1 log ∑ ∑ nk ρk (1 − ρk ). ( j) ( j) = b,c np N.c r k=1 j=1 = npH(X|X̂(D, F)) ( j) Here for each k, 1 ≤ k ≤ K, and for each j, 1 ≤ j ≤ r, ρk is So minimizing the coding length is equivalent to mini- the probability that the j-th attribute is 1 in Ck . mizing the conditional entropy. Denote pbc = p(xi j = b | Using the generalized entropy 5 defined in [5], H 2 (Q) = x̂i j (D, F) = c). We wish to find the probability vectors −2 ∑ q − 1 , we have n 2
i=1 i p = (p00 , p01 , p10 , p11 ) that minimize 1 K (3.8) H(X|X̂(D, F)) = − ∑ pi j log pi j ∑ nk Ĥ(Ck ) n k=1 i, j∈{0,1} 1 K r ∑∑ k k
( j) 2 ( j) 2 Since −pi j log pi j ≥ 0, with the equality holding at pi j = 0 = − n (ρ ) + (1 − ρ k ) − 1 2n k=1 j=1 or 1, the only possible probability vectors which minimize H(X|X̂(D, F)) are those with pi j = 1 for some i, j and pi1 j1 = 1 K r r ∑ ∑ nk ρk (1 − ρk ) = n S(C). ( j) ( j) 0, (i1 , j1 ) 6= (i, j). Since X̂ is an approximation of X, it is = n k=1 j=1 natural to require that p00 and p11 be close to 1 and p01 and p10 be close to 0. This is equivalent to minimizing the References mismatches between X and X̂, i.e., minimizing O(D, F) = 1 T 2 2 ||X − DF ||F . [1] F. B. Baulieu. Two variant axiom systems for pres- ence/absence based dissimilarity coefficients. Journal of 3.4 Entropy-Based Approach Classification, 14(1):159–170, 1997. [2] R. A. Baxter and J. J. Oliver. MDL and MML: similarities 3.4.1 Classical Entropy Criterion The classical cluster- and differences. TR 207, Monash University, 1994. ing criteria [3, 4] search for a partition C that maximizes the [3] H.-H. Bock. Probabilistic aspects in cluster analysis. In following quantity O(C): Conceptual and Numerical Analysis of Data, pages 12–44, 1989. K r 1 [4] G. Celeux and G. Govaert. Clustering criteria for discrete data N j,k,t NN j,k,t O(C) = ∑∑∑ log and latent class models. Journal of Classification, 8(2):157– k=1 j=1 t=0 N Nk N j,t 176, 1991. K r 1 N j,k,t
N j,k,t N j,t [5] J. Havrda and F. Charvat. Quantification method of classifica- (3.9) =∑∑∑ log − log tion processes: Concept of structural a-entropy. Kybernetika, k=1 j=1 t=0 N nk n 3:30–35, 1967. [6] N. Jardine and R. Sibson. Mathematical Taxonomy. John ! 1 1 K = Ĥ(X) − ∑ nk Ĥ(Ck ) . Wiley & Sons, 1971. r n k=1 [7] T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in categorical clustering. In ICML, 2004. 536-543. Observe that n1 ∑Kk=1 nk Ĥ(Ck ) is the entropy measure of [8] T. M. Mitchell. Machine Learning. The McGraw-Hill the partition, i.e., the weighted sum of each cluster’s entropy. Companies,Inc., 1997. This leads to the following criterion: Given a dataset, fix [9] J. J. Oliver and R. A. Baxter. MML and Bayesianism: Ĥ(X), then maximizing O(C) is equivalent to minimizing similarities and differences. TR 206, Monash University, the expected entropy of the partition: 1994. [10] J. Rissanen. Modeling by shortest data description. Automat- 1 K ica, 14:465–471, 1978. (3.10) ∑ nk Ĥ(Ck ) n k=1 [11] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific Press, Singapore, 1989.
3.4.2 Entropy and Dissimilarity Coefficients Now ex-
amine the within-cluster criterion in Equation (3.5). We 5 Note that H s (Q) = (2(1−s) − 1)−1 (∑ni=1 qsi − 1).