Clustering With Constraints: Feasibility Issues and the k-Means Algorithm

Ian Davidson∗
Abstract Recent work has looked at extending the k-Means algorithm to incorporate background information in the form of instance level must-link and cannot-link constraints. We introduce two ways of specifying additional background information in the form of δ and constraints that operate on all instances but which can be interpreted as conjunctions or disjunctions of instance level constraints and hence are easy to implement. We present complexity results for the feasibility of clustering under each type of constraint individually and several types together. A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NP-complete. Thus, an iterative algorithm such as k-Means should not try to find a feasible partitioning at each iteration. This motivates our derivation of a new version of the k-Means algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints. Using standard UCI datasets, we find that using constraints improves accuracy as others have reported, but we also show that our algorithm reduces the number of iterations until convergence. Finally, we illustrate these benefits and our new constraint types on a complex real world object identification problem using the infra-red detector on an Aibo robot. Keywords: k-Means clustering, constraints. 1 Introduction and Motivation The k-Means clustering algorithm is a ubiquitous technique in data mining due to its simplicity and ease of use (see for example, [6, 10, 16]). It is well known that k-Means converges to a local minimum of the vector quantization error and hence must be restarted many times, a computationally very expensive task when dealing with the large data sets typically found in data mining problems. Recent work has focused on the use of background
∗ Department of Computer Science, University at Albany - State University of New York, Albany, NY 12222. Email: davidson@cs.albany.edu. † Department of Computer Science, University at Albany - State University of New York, Albany, NY 12222. Email: ravi@cs.albany.edu.

S. S. Ravi†
information in the form of instance level must-link and cannot-link constraints. A must-link constraint enforces that two instances must be placed in the same cluster while a cannot-link constraint enforces that two instances must not be placed in the same cluster. We can divide previous work on clustering under constraints into two types: 1) Where the constraints help the algorithm learn a distortion/distance/objective function [4, 15] and 2) Where the constraints are used as “hints” to guide the algorithm to a useful solution [18, 19]. Philosophically, the first type of work makes the assumption that points surrounding a pair of mustlink/cannot-link points should be close to/far from each other [15], while the second type just requires that the two points be in the same/different clusters. Our work falls into the second category. Recent examples of the second type of work include ensuring that constraints are satisfied at each iteration [19] and initializing algorithms so that constraints are satisfied [3]. The results of this type of work are quite encouraging; in particular, Wagstaff et al. [18, 19] illustrate that for simple classification tasks, k-Means with constraints obtains clusters with a significantly better purity (when measured on an extrinsic class label) than when not using constraints. Furthermore, Basu et al. [5] investigate determining the most informative set of constraints when the algorithm has access to an Oracle. In this paper we carry out a formal analysis of clustering under constraints. Our work makes several pragmatic contributions. • We introduce two new constraint types which act upon groups of instances. Roughly speaking, the constraint enforces that each instance x in a cluster must have another instance y in the same cluster such that the distance between x and y is at most . We shall see that this can be used to enforce prior information with respect to how the data was collected. The δ-constraint enforces that every instance in a cluster must be at a distance of at least δ from every instance in every other cluster. We can use this type of constraint to specify background information on the minimum distance between the clusters/objects we hope to discover. • We show that these two new constraints can be eas-

ily represented as a disjunction and conjunction of denoted by d(si , sj ). The distance function is assumed must-link constraints, thus making their implemen- to be symmetric so that d(si , sj ) = d(sj , si ). We tation easy. consider the problem of clustering the set S under the following types of constraints. • We present complexity results for the feasibility (a) Must-Link Constraints: Each must-link conproblem under each of the constraints individually straint involves a pair of points si and sj (i = j). In and in combination. The polynomial algorithms any feasible clustering, points si and sj must be in the developed in this context can be used for initializing same cluster. the k-Means algorithm. We show that in many situations, the feasibility problem (i.e., determining (b) Cannot-Link Constraints: Each cannot-link whether there there is a solution that satisfies all constraint also involves a pair of distinct points si and the constraints) is NP-complete. Therefore, it is sj . In any feasible clustering, points si and sj must not advisable to attempt to find a solution that not be in the same cluster. satisfies all the constraints at each iteration of a (c) δ-Constraint (or Minimum Separation Conclustering algorithm. straint): This constraint specifies a value δ > 0. In any solution satisfying this constraint, the distance between • To overcome this difficulty, we illustrate that the any pair of points which are in two different clusters k-Means clustering algorithm can be viewed as a must be at least δ. More formally, for any pair of clustwo step algorithm with one step being to take the ters S and S (i = j), and any pair of points s and i j p derivative of its error function and solving for the s such that s ∈ S and s ∈ S , d(s , s ) ≥ δ. Inforq p i q j p q new centroid positions. With that in mind, we pro- mally, this constraint requires that each pair of clusters pose a new differentiable objective function that in- must be well separated. As will be seen in Section 3.4, corporates constraint violations and rederive a new a δ-constraint can be represented by a conjunction of constrained k-Means algorithm. This algorithm, instance level must-link constraints. like the original k-Means algorithm, is designed to (d) -Constraint: This constraint specifies a value monotonically decrease its error function. > 0 and the feasibility requirement is the following: We empirically illustrate our algorithm’s perfor- for any cluster Si containing two or more points and mance on several standard UCI data sets and show that for any point sp ∈ Si , there must be another point adding must-link and cannot-link constraints not only sq ∈ Si such that d(sp , sq ) ≤ . Informally, this helps accuracy but can help improve convergence. Fi- constraint requires that in any cluster Sj containing nally, we illustrate the use of δ and constraints in clus- two more more points, each point in Sj must have tering distances obtained by an Aibo robot’s infra-red another point within a distance of at most . As detector for the purpose of object detection for path will be seen in Section 3.5, an -constraint can be planning. For example, we can specify the width of the represented by a disjunction of instance level mustAibo robot as δ; that is, we are only interested in clus- link constraints, with the proviso that when none of ters/objects that are more than δ apart, since the robot the must-link constraints is satisfied, the point is in a singleton cluster by itself. The -constraint is similar can only physically move between such objects. The outline of the paper is as follows. We begin in principle to the +minpts criterion used in the DBby considering the feasibility of clustering under all SCAN algorithm [11]. However, in our work, the aim four types constraints individually. The next section is to minimize the constrained vector quantization error builds upon the feasibility analysis for combinations of subject to the -constraint, while in DB-SCAN, their constraints. A summary of the results for feasibility can criterion is central to defining a cluster and an outlier. be found in Table 1. We then discuss the derivation of Although the constraints discussed above provide a our constrained k-Means algorithm. Finally, we present useful way to specify background information to a clusour empirical results. tering algorithm, it is natural to ask whether there is a feasible clustering that satisfies all the given constraints. 2 Definitions of Constraints and the Feasibility The complexity of satisfying the constraints will deterProblem mine how to incorporate them into existing clustering Throughout this paper, we use the terms “instances” algorithms. and “points” interchangeably. Let S = {s1 , s2 , . . . , sn } The feasibility problem has been studied for other denote the given set of points which must be partitioned types of constraints or measures of quality [14]. For exinto K clusters, denoted by S1 , . . ., SK . For any pair ample, the clustering problem where the quality is meaof points si and sj in S, the distance between them is sured by the maximum cluster diameter can be trans-

formed in an obvious way into a constrained clustering problem. Feasibility problems for such constraints have received a lot of attention in the literature (see for example [12, 13, 14]). Typical specifications of clustering problems include an integer parameter K that gives the number of required clusters. We will consider a slightly more general version, where the problem specification includes a lower bound K and an upper bound Ku on the number of clusters rather than the exact number K. Without upper and lower bounds, some feasibility problems may admit trivial solutions. For instance, if we consider the feasibility problem for a collection of mustlink constraints (or for a δ-constraint) and there is no lower bound on the number of clusters, a trivial feasible solution is obtained by having all the points in a single cluster. Likewise, when there is no upper bound on the number of clusters for the feasibility problem under cannot-link constraints (or an -constraint), a trivial feasible solution is to make each point into a separate cluster. Obviously, K and Ku must satisfy the condition 1 ≤ K ≤ Ku ≤ n, where n denotes the total number of points. For the remainder of this paper, a feasible clustering is one that satisfies all the given constraints and the upper and lower bounds on the number of clusters. For problems involving must-link or cannot-link constraints, it is assumed that the collection of constraints C = {C1 , C2 , . . . , Cm } containing the m constraints is given, where each constraint Cj = {sj1 , sj2 } specifies a pair of points. For problems involving and δ constraints, it is assumed that the values of and δ are given. For convenience, the feasibility problems under constraints (a) through (d) defined above will be referred to as ML-feasibility, CL-feasibility, δ-feasibility and feasibility respectively.

may produce solutions with singleton clusters when the constraints under consideration permit such clusters.

3.2 Feasibility Under Must-Link Constraints Klein et al. [15] showed that the ML-feasibility problem can be solved in polynomial time. They considered a more general version of the problem, where the goal is to obtain a new distance function that satisfies the triangle inequality when there is a feasible solution. In our definition of the ML-feasibility problem, no distances are involved. Therefore, a straightforward algorithm whose running time is linear in the number of points and constraints can be developed as discussed below. As is well known, must-link constraints are transitive; that is, must-link constraints {si , sj } and {sj , sk } imply the must-link constraint {si , sk }. Thus, the two constraints can be combined into a single must-link constraint, namely {si , sj , sk }. Thus, a given collection C of must-link constraints can be transformed into an equivalent collection M = {M1 , M2 , . . . , Mr } of constraints, by computing the transitive closure of C. The sets in M are pairwise disjoint and have the following interpretation: for each set Mi (1 ≤ i ≤ r), the points in Mi must all be in the same cluster in any feasible solution. For feasibility purposes, points which are not involved in any must-link constraint can be partitioned into clusters in an arbitrary manner. These facts allow us to obtain a straightforward algorithm for the ML-feasibility problem. The steps of the algorithm are shown in Figure 1. Whenever a feasible solution exists, the algorithm outputs a collection of K clusters. The only situation in which the algorithm reports infeasibility is when the lower bound on the number of clusters is too high. The transitive closure computation (Step 1 in Figure 1) in the algorithm can be carried out as follows. Construct an undirected graph G, with one node for each point appearing in the constraint sets C, and an 3 Complexity of Feasibility Problems edge between two nodes if the corresponding points ap3.1 Overview In this section, we investigate the pear together in a must-link constraint. Then, the concomplexity of the feasibility problems by considering nected components of G give the sets in the transitive each type of constraint separately. In Section 4, we closure. It can be seen that the graph G has n nodes examine the complexity of feasibility problems for comand O(m) edges. Therefore, its connected components binations of constraints. Our polynomial algorithms for can be found in O(n + m) time [8]. The remaining steps feasibility problems do not require distances to satisfy of the algorithm can be carried out in O(n) time. The the triangle inequality. Thus, they can also be used following theorem summarizes the above discussion. with nonmetric distances. On the other hand, the NPcompleteness results show that the corresponding feasi- Theorem 3.1. Given a set of n points and m must-link bility problems remain computationally intractable even constraints, the ML-feasibility problem can be solved in when the set to be clustered consists of points in 2 . O(n + m) time. The feasibility algorithms presented in Sections 3 and 4 focus on determining whether there is a partition 3.3 Feasibility Under Cannot-Link Constraints that satisfies all the given constraints; they do not at- Klein et al. [15] mention that the CL-feasibility probtempt to optimize any objective. Thus, these algorithms lem is NP-complete but omit the proof. Since we

Note: Whenever a feasible solution exists, the following algorithm outputs a collection of K clusters satisfying all the must-link constraints. 1. Compute the transitive closure of the constraints in C. Let this computation result in r sets of points, denoted by M1 , M2 , . . ., Mr . 2. Let S = S − i=1 Mi . (S denotes the subset of points that are not involved in any must-link constraint.) 3. if r ≥ K then
r i=K r

1. for each point si do (a) Determine the set Xi ⊆ S − {si } of points such that for each point xj ∈ Xi , d(si , xj ) < δ. (b) For each point xj ∈ Xi , create the must-link constraint {si , xj }. 2. Let C denote the set of all the must-link constraints created in Step 1. Use the algorithm for the MLfeasibility problem (Figure 1) with point set S, constraint set C and the values K and Ku . Figure 2: Algorithm for the δ-Feasibility Problem

(a) Let A = (

Mi ) ∪ S .
−1 ,

(b) Output M1 , . . ., MK else if |S | < K − r then

A.

Output “There is no solution.” else (a) Let t = K − r. Partition S into t clusters A1 , . . ., At arbitrarily. (b) Output M1 , . . ., Mr , A1 , . . ., At . Figure 1: Algorithm for the ML-Feasibility Problem

wish to draw some additional conclusions from the proof, we have included it in the appendix. The proof uses a straightforward reduction from the Graph KColorability problem (K-Color). It is known that the K-Color problem is NPcomplete even for graphs in which the number of edges is linear in the number of nodes [12]. This fact in conjunction with the proof in the appendix implies that the CLfeasibility problem is computationally intractable even when the number of constraints is linear in the number of points. Further, the K-Color problem is known to be NP-complete for every fixed value of K ≥ 3. From this fact, it follows that the CL-feasibility problem is also NP-complete when the lower bound on the number of clusters is 1 and the upper bound is fixed at any value ≥ 3. While NP-completeness result indicates that the CL-feasibility problem is at least as hard as the KColor problem, it can also be shown that the feasibility problem is no harder than the K-Color problem as follows. For each point si , create a graph node vi , and for each cannot-link constraint {si , sj }, create the undirected edge {vi , vj }. It is easy to verify that the

resulting graph is K-colorable iff there is a solution to the feasibility problem with at least 1 and at most K clusters. This reduction to the coloring problem points out that in practice, one can use known heuristics for graph coloring in choosing the number of clusters. Although the coloring problem is known to be hard to approximate in the worst-case, heuristics that work well in practice are known (see for example [1, 7]). The reduction also shows that when the upper bound on the number of clusters is two, the CL-feasibility problem can be solved in polynomial time. This is because the problem of determining whether an undirected graph can be colored using at most two colors can be solved efficiently [8]. 3.4 Feasibility Under δ-Constraint In this section, we show that the δ-feasibility problem can be solved in polynomial time. The basic idea is simple: in any feasible solution, every pair of points si and sj for which d(si , sj ) < δ, must be in the same cluster. Thus, given the value of δ, we can create a collection of appropriate must-link constraints and use the algorithm for the ML-feasibility problem. This shows that a δ-constraint can be replaced by a conjunction of mustlink constraints. The steps of the resulting algorithm are shown in Figure 2. The running time of the algorithm for δ-feasibility is dominated by the time needed to complete Step 1, that is, the time to compute the set of must-link constraints. Clearly, this step can be carried out in O(n2 ) time, and the number of must-link constraints generated is also O(n2 ). Thus, the overall running time of the algorithm is O(n2 ). The following theorem summarizes the above discussion. Theorem 3.2. For any δ > 0, the feasibility problem

under the δ-constraint can be solved in O(n2 ) time, singleton cluster. Otherwise, we need at least one where n is the number of points to be clustered. additional cluster for the points in S2 ; that is, the minimum number of clusters in that case is t + 1. Thus, 3.5 Feasibility Under -Constraint Let the set S the minimum number of clusters, denoted by N ∗ , is of points and the value > 0 be given. For any point given by N ∗ = t + min{1, r}. If Ku < N ∗ , there is sp ∈ S, the set Γp of -neighbors is given by no feasible solution. We may therefore assume that Ku ≥ N ∗ . Γp = {sq : sq ∈ S − {sp } and d(sp , sq ) ≤ }. As mentioned above, there is a solution with t + r clusters. If t + r ≥ Ku , then we can get Ku clusters by Note that a point is not an -neighbor of itself. Two distinct points sp and sq are are -neighbors of each simply merging (if necessary) an appropriate number of other if d(sp , sq ) ≤ . The -constraint requires that clusters from the collection X1 , X2 , . . ., Xr into a single in any cluster containing two or more points, each cluster. Since each CC of G has at least two points, this point in the cluster must have an -neighbor within merging step will not violate the -constraint. The only remaining possibility is that the value t+r the same cluster. This observation points out that an is smaller than the lower bound K . In this case, we can -constraint corresponds to a disjunction of must-link constraints. For example, if {si1 , . . . , sir } denote the - increase the number of clusters to K by splitting some neighbors of a point si , then satisfying the -constraint of the clusters X1 , X2 , . . ., Xr to form more clusters. for point si means that either one or more of the must- One simple way to increase the number of clusters by link constraints {si , si1 }, . . ., {si , sir } are satisfied or the one is to create a new singleton cluster by taking one point si is in a singleton cluster by itself. In particular, point away from some cluster Xi with two or more any point in S which does not have an -neighbor must points. To facilitate this, we construct a spanning tree for each CC of G. The advantage of having trees is that form a singleton cluster. So, to determine feasibility under an -constraint for we can remove a leaf node from a tree and make the the set of points S, we first find the subset S1 containing point corresponding to that node into a new singleton each point which does not have an -neighbor. Let cluster. Since each tree has at least two leaves and |S1 | = t, and let C1 , C2 , . . ., Ct denote the singleton removing a leaf will not disconnect a tree, this method clusters formed from S1 . To cluster the points in will increase the number of clusters by exactly one at S2 = S−S1 (i.e., the set of points each of which has an - each step. Thus, by repeating the step an appropriate neighbor), it is convenient to use an auxiliary undirected number of times, the number of clusters can be made equal to K . graph defined below. The above discussion leads to the feasibility algorithm shown in Figure 3. It can be seen that the running Definition 3.1. Let a set of points S and a value > 0 be given. Let Q ⊆ S be a set of points such that for each time of the algorithm is dominated by the time needed point in Q, there is an -neighbor in Q. The auxiliary for Steps 1 and 2. Step 1 can be implemented to run in 2 graph G(V, E) corresponding to Q is constructed as O(n ) time by finding the -neighbor set for each point. Since the number of -neighbors for each point in S2 is follows. at most n − 1, the construction of the auxiliary graph (a) The node set V has one node for each point in Q. and finding its CCs (Step 2) can also be done in O(n2 ) time. So, the overall running time of the algorithm is (b) For any two nodes vp and vq in V , the edge {vp , vq } O(n2 ). The following theorem summarizes the above is in E if the points in Q corresponding to vp and discussion. vq are -neighbors. Let G(V, E) denote the auxiliary graph corresponding to S2 . Note that each connected component (CC) of G has at least two nodes. Thus, the CCs of G provide a way of clustering the points in S2 . Let Xi , 1 ≤ i ≤ r, denote the cluster formed from the ith CC of G. The t singleton clusters together with these r clusters would form a feasible solution, provided the upper and lower bounds on the number of clusters are satisfied. So, we focus on satisfying these bounds. If S2 = ∅, then the minimum number of clusters is t = |S1 |, since each point in S1 must be in a separate Theorem 3.3. For any > 0, the feasibility problem under the -constraint can be solved in O(n2 ) time, where n is the number of points to be clustered. 4 Feasibility Under Combinations of Constraints 4.1 Overview In this section, we consider the feasibility problem under combinations of constraints. Since the CL-feasibility problem is NP-hard, the feasibility problem for any combination of constraints involving cannot-link constraints is, in general, computationally

1. 2.

3. 4. 5.

6.

intractable. So, we need to consider only the combinations of must-link, δ and constraints. We show that the feasibility problem remains efficiently solvable when both a must-link constraint and a δ constraint are considered together as well as when δ and constraints Find the set S1 ⊆ S such that no point in S1 has are considered together. When must-link constraints an -neighbor. Let t = |S1 | and S2 = S − S1 . are considered together with an constraint, we show that the feasibility problem is NP-complete. This result Construct the auxiliary graph G(V, E) for S2 (see Definition 3.1). Let G have r connected compo- points out that when must-link, δ and constraints are all considered together, the resulting feasibility problem nents (CCs) denoted by G1 , G2 , . . ., Gr . is also NP-complete in general. Let N ∗ = t + min {1, r}. (Note: To satisfy the -constraint, at least N ∗ clusters must be used.) 4.2 Combination of Must-Link and δ Constraints We begin by considering the combination of if N ∗ > Ku then Output “No feasible solution” must-link constraints and a δ constraint. As mentioned and stop. in Section 3.4, the effect of the δ-constraint is to conLet C1 , C2 , . . ., Ct denote the singleton clusters tribute a collection of must-link constraints. Thus, we corresponding to points in S1 . Let X1 , X2 , . . ., Xr can merge these must-link constraints with the given denote the clusters corresponding to the CCs of G. must-link constraints, and then ignore the δ-constraint. The resulting feasibility problem involves only mustif t + r ≥ Ku link constraints. Hence, we can use the algorithm from Section 3.2 to solve the feasibility problem in polynothen /* We may have too many clusters. */ mial time. For a set of n points, the δ constraint may (a) Merge clusters XKu −t , XKu −t+1 , . . ., Xr contribute at most O(n2 ) must-link constraints. Furinto a single new cluster XKu −t . ther, since each given must-link constraint involves two (b) Output the Ku clusters C1 , C2 , . . ., Ct , points, we may assume that the number of given mustlink constraints is also O(n2 ). Thus, the total number X1 , X2 , . . ., XKu −t . of must-link constraints due to the combination is also else /* We have too few clusters. */ O(n2 ). Thus, the following result is a direct consequence (a) Let N = t + r. Construct spanning trees of Theorem 3.1. T1 , T2 , . . ., Tr corresponding to the CCs of G. (b) while (N < K ) do (i) Find a tree Ti with at least two nodes. If no such tree exists, output “No feasible solution” and stop. (ii) Let v be a leaf in tree Ti . Delete v from Ti . (iii) Delete the point corresponding to v from cluster Xi and form a new singleton cluster XN +1 containing that point. (iv) N = N + 1. (c) Output the K clusters C1 , C2 , . . ., Ct , X1 , X2 , . . ., XK −t . Theorem 4.1. Given a set of n points, a value δ > 0 and a collection C of must-link constraints, the feasibility problem for the combination of must-link and δ constraints can be solved in O(n2 ) time. 4.3 Combination of Must-Link and Constraints Here, we show that the feasibility problem for the combination of must-link and constraints is NP-complete. To prove this result, we use a reduction from the following problem which is known to be NPcomplete [9]. Planar Exact Cover by 3-Sets (PX3C) Instance: A set X = {x1 , x2 , . . . , xn }, where n = 3q for some positive integer q and a collection T = {T1 , T2 , . . . , Tm } of subsets of X such that |Ti | = 3, 1 ≤ i ≤ m. Each element xi ∈ X appears in at most three sets in T . Further, the bipartite graph G(V1 , V2 , E), where V1 and V2 are in one-to-one correspondence with the elements of X and the 3-element sets in T respectively, and an edge {u, v} ∈ E iff the element corresponding to u appears in the set corresponding to v, is planar.

Figure 3: Algorithm for the -Feasibility Problem

Question: Does T contain a subcollection T = {Ti1 , Ti2 , . . . , Tiq } with q sets such that the union of the sets in T is equal to X? For reasons of space, we have included only a sketch of this NP-completeness proof. Theorem 4.2. The feasibility problem under the combination of must-link and constraints is NP-complete. Proof sketch: It is easy to verify the membership of the problem in NP. To prove NP-hardness, we use a reduction from PX3C. Consider the planar (bipartite) graph G associated with the PX3C problem instance. From the specification of the problem, it follows that each node of G has a degree of at most three. Every planar graph with N nodes and maximum node degree three can be embedded on an orthogonal N × 2N grid such that the nodes of the graph are grid points, each edge of the graph is a path along the grid, and no two edges share a grid point except for the grid points corresponding to the graph vertices. Moreover, such an embedding can be constructed in polynomial time [17]. The points and constraints for the feasibility problem are created from this embedding. Using suitable scaling, assume that the each grid edge is of length 1. Note that each edge e of G joins a set node to an element node. Consider the path along the grid for each edge of G. Introduce a new point in the middle of each grid edge in the path. Thus, for each edge e of G, this provides the set Se of points in the grid path corresponding to e, including the new middle point for each grid edge. The set of points in the feasibility instance is the union of the sets Se , e ∈ E. Let Se be obtained from Se by deleting the point corresponding to the element node of the edge e. For each edge e, we introduce a must-link constraint involving all the points in Se . We also introduce a mustlink constraint involving all the points corresponding to the elements nodes of G. We choose the value of to be 1/2. The lower bound on the number of clusters is set to m − n/3 + 1 and the upper bound is set to m. It can be shown that there is a solution to the PX3C problem instance if and only if there is a solution to the feasibility problem.

on the following simple observation: Any pair of points which are separated by a distance less than δ are also -neighbors of each other. This observation allows us to reduce the feasibility problem for the combination to one involving only the constraint. This is done as follows. Construct the auxiliary undirected graph G(V, E), where V is in one-to-one correspondence with the set of points and an edge {x, y} ∈ E iff the distance between the points corresponding to nodes x and y is less than δ. Suppose C1 , C2 , . . ., Cr denote the connected components (CCs) of G. Consider any CC, say Ci , with two or more nodes. By the above observation, the -constraint is satisfied for all the points corresponding to the nodes in Ci . Thus, we can “collapse” each such set of points into a single “super point”. Let S = {P1 , . . . , Pr } denote the new set of points, where Pi is a super point if the corresponding CC Ci has two or more nodes, and Pi is a single point otherwise. Given the distance function d for the original set of points S, we define a new distance function d for S as follows: d (Pi , Pj ) = min{d(sp , sq ) : sp ∈ Pi , sq ∈ Pj }. With this new distance function, we can ignore the δ constraint. We need only check whether there is a feasible clustering of the set S under the -constraint using the new distance function d . Thus, we obtain a polynomial time algorithm for the combination of and δ constraints when δ ≤ . It is not hard to see that the resulting algorithm runs in O(n2 ) time. For the case when δ > , any pair of -neighbors must be in the same cluster. Using this idea, it is possible to construct a collection of must-link constraints corresponding to the given δ and constraints, and solve the feasibility problem in O(n2 ) time. The following theorem summarizes the above discussion. Theorem 4.3. Given a set of n points and values δ > 0 and > 0, the feasibility problem for the combination of δ and constraints can be solved in O(n2 ) time.

The complexity results for various feasibility problems are summarized in Table 1. For each type of con4.4 Combination of δ and Constraints In this straint, the table indicates whether the feasibility probsection, we show that the feasibility problem for the lem is in P (i.e., efficiently solvable) or NP-complete. combination of δ and constraints can be solved in Results for which no references are cited are from this polynomial time. It is convenient to consider this paper. problem under two cases, namely δ ≤ and δ > . For reasons of space, we will discuss the algorithm for 5 A Derivation of a Constrained k-Means Algorithm the first case and mention the main idea for the second case. In this section we derive the k-Means algorithm and For the case when δ ≤ , our algorithm is based then derive a new constrained version of the algorithm

the algorithm’s popularity. We acknowledge that a more complicated analysis of the algorithm exists, namely an interpretation of the algorithm as analogous to the Newton-Raphson algorithm [2]. Our derivation of a new constrained version of k-Means is not inconsistent with these other explanations. The key step in deriving a constrained version of k-Means is to create a new differentiable error function which we call the constrained vector quantization error. Consider a a collection of r must-link constraints and s Table 1: Results for Feasibility Problems cannot-link constraints. We can represent the instances affected by these constraints by two functions g(i) and g (i). These functions return the cluster index from first principles. Let Cj be the centroid of the j th (i.e., a value in the range 1 through k) of the closest cluster and Qj be the set of instances that are closest cluster to the 1st and 2nd instances governed by the ith to the j th cluster. constraint. For clarity of notation, we assume that the It is well known that the error function of the k- instance associated with the function g (i) violates the Means problem is the vector quantization error (also constraint if at all. The new error function is shown in referred to as the distortion) given by the following Equation (5.5). equations. Constraint Must-Link Cannot-Link δ-constraint -constraint Must-Link and δ Must-Link and δ and Complexity P [15] NP-Complete [15] P P P NP-complete P
k

(5.5) VQE j

CVQE j

=

(5.1)

VQE

=
j=1

1 2 1 2

Tj,1 +
si ∈Qj s+r

(Tj,2 × Tj,3 )
l=1,g(l)=j

(5.2)

VQE j

=

1 2

(Cj − si )2
si ∈Qj

where Tj,1 Tj,2 Tj,3 = = = (Cj − si )2 (Cj − Cg (l) )2 ¬ ∆(g (l), g(l)) (Cj − Ch(g (l)) )2 ∆(g(l), g (l))
ml 1−ml

The k-Means algorithm is an iterative algorithm which in every step attempts to further minimize the distortion. Given a set of cluster centroids, the algorithm assigns instances to their nearest centroid which of course minimizes the distortion. This is step 1 of the algorithm. Step 2 is to recalculate the cluster centroids so as to minimize the distortion. This can be achieved by taking the first order derivative of the error (Equation (5.2)) with respect to the j th centroid and setting it to zero. A solution to the resulting equation gives us the k-Means centroid update rule as shown in Equation (5.3).

.

Here h(i) returns the index of the cluster (other than i) whose centroid is closest to cluster i’s centroid and ∆ is the Kronecker Delta Function defined by ∆(x, y) = 1 if x = y and 0 otherwise. We use ¬ ∆ to denote the negation of the Delta function. The first part of the new error function is the regular distortion. The remaining terms are the errors associated with the must-link (ml =1) and cannot-link (ml =0) constraints. In future work we intend to allow d(VQE j ) (5.3) = (Cj − si ) = 0 the parameter ml to take any value in [0, 1] so that d(Cj ) si ∈Qj we can allow for a continuum between must-link and cannot-link constraints. We see that if a must-link (5.4) Cj = si /|Qj | constraint is violated then the cost is equal to the si ∈Qj distance between the cluster centroids containing the Recall that Qj is the set of points closest to the centroid two instances that should have been in the same cluster. of the j th cluster. Similarly, if a cannot-link constraint is violated the Iterating through these two steps therefore mono- cost is the distance between the cluster centroid both tonically decreases the distortion. However, the algo- instances are in and the nearest cluster centroid to one rithm is prone to getting stuck in local minima. Never- of the instances. Note that both violation costs are in theless, good pragmatic results can be obtained, hence units of distance, as is the regular distortion.

Experiments with Minimization of the Constrained Vector Quantization Error In this section we compare the usefulness of our algorithm on three standard UCI data sets. We report the average over 100 random restarts (initial assignments of instances). As others have reported [19] for k=2 the addition of constraints improves the purity of the clusters with respect to an extrinsic binary class not given to the clustering algorithm. We observed similar behavior in our experiments. We are particularly interested in using our algorithm for more traditional unsupervised learning where k > 2. To this end we compare regular kMeans against constrained k-Means with respect to the number of iterations until convergence and the number d(CV QE j ) = (Cj − si ) + of constraints violated. For the PIMA data set (Figd(Cj ) si ∈Qj ure 4) which contains two extrinsic classes, we created s a random combination of 100 must-link and 100 cannot(Cj − Cg (l) ) + link constraints between the same and different classes l=1,g(l)=j,∆(g(l),g (l))=0 respectively. Our results show that our algorithm cons+r verged on average in 25% fewer iterations while satisfy(Cj − Ch(g (l)) ) ing the vast majority of constraints. l=s+1,g(l)=j,∆(g(l),g (l))=1 Similar results were obtained for the BreastCancer data sets (Figure 6) where 25 must-link and 25 cannot= 0 link constraints were used and Iris (Figure 8) where 13 Solving for Cj , we get must-link and 12 cannot-link constraints were used. 7 Experiments with the Sony Aibo Robot In this section we describe an example with the Sony Aibo robot to illustrate the use of our newly proposed s δ and constraints. Yj = si + Cg (l) + The Sony Aibo robot is effectively a walking comsi ∈Qj l=1,g(l)=j,∆(g(l),g (l))=0 puter with sensing devices such as a camera, microphone s+r and an infra-red distance detector. The on-board proCh(g (l)) cessor has limited processing capability. The infra-red l=s+1,g(l)=j,∆(g(l),g (l))=1 distance detector is connected to the end of the head and can be moved around to build a distance map of a and scene. Consider the scene (taken from a position further s back than the Aibo was for clarity) shown in Figure 5. Zj = |Qj | + (1 − ∆(g(l), g (l))) + We wish to cluster the distance information from the g(l)=j,l=1 infra-red sensor to form objects/clusters (spatially ads+r jacent points at a similar distance) which represent solid ∆(g(l), g (l))) objects that must be navigated around. g(l)=j,l=s+1 Unconstrained clustering of this data with k=9 The intuitive interpretation of the centroid update yields the set of significant (large) objects shown in rule is that if a must-link constraint is violated, the Figure 7. cluster centroid is moved towards the other cluster conHowever, this result does not make use of important taining the other point. Similarly, the interpretation of background information. Firstly, groups of clusters where (5.6) Cj = Yj /Zj

The first step of the constrained k-Means algorithm must minimize the new constrained vector quantization error. This is achieved by assigning instances so as to minimize the new error term. For instances that are not part of constraints, this involves as before, performing a nearest cluster centroid calculation. For pairs of instances in a constraint, for each possible combination of cluster assignments, the CVQE is calculated and the instances are assigned to the clusters that minimally increases the CVQE . The second step is to update the cluster centroids so as to minimize the constrained vector quantization error. To achieve this we take the first order derivative of the error, set to zero, and solve. By setting the appropriate values of ml we can derive the update rules (Equation (5.6)) for the must-link and cannot-link constraint violations. The resulting equations are shown below.

the update rule for a cannot-link constraint violation is that cluster centroid containing both constrained instances should be moved to the nearest cluster centroid so that one of the instances eventually gets assigned to it, thereby satisfying the constraint. 6

Figure 4: Performance of regular and constrained kMeans on the UCI PIMA dataset

Figure 6: Performance of regular and constrained kMeans on the UCI Breast Cancer dataset

Figure 5: An image of the scene to navigate

Figure 7: Unconstrained clustering of the distance map using k=9. The approximate locations of significant (large) clusters are shown by the ellipses.

Figure 8: Performance of regular and constrained kMeans on the UCI Iris dataset

Figure 9: Constrained clustering of the distance map using k=9. The approximate locations of significant (large) clusters are shown by the ellipses.

(such as the people) that are separated by a distance less than one foot can effectively be treated as one big cluster since the Aibo robot cannot easily walk through a gap smaller than one foot. Secondly, often one contiguous region is split into multiple objects due to errors in the infra-red sensor. These errors are due to poor reception of the incoming signal or the inability to reflect the outgoing signal which occurs in the wooden floor region of the scene. These errors are common in the inexpensive sensor on the Aibo robot. Since we know the accuracy of the Aibo head movement we can determine the distance between the furthest (about three feet) adjacent readings which determines a lower bound for (namely, 3 × tan 1◦ ) so as to ensure a feasible solution. However, if we believe that it is likely that one but unlikely that two incorrect adjacent mis-readings occur, then we can set to be twice this lower bound to overcome noisy observations. Using these values for our constraints we can cluster the data under our background information and obtain the clustering results shown in Figure 9. We see References that the clustering result is now more useful for our intended purpose. The Aibo can now move towards the area/cluster that represents an open area. [1] A. Hertz and D. de Werra, “Using Tabu Search Tech-

8 Conclusion and Future Work Clustering with background prior knowledge offers much promise with contributions using must-link and cannot-link instance level constraints having already been published. We introduced two additional constraints: and δ. We studied the computational complexity of finding a feasible solution for these four constraint types individually and together. Our results show that in many situations, finding a feasible solution under a combination of constraints is NP-complete. Thus, an iterative algorithm should not try to find a feasible solution in each iteration. We derived from first principles a constrained version of the k-Means algorithm that attempts to minimize the proposed constrained vector quantization error. We find that the use of constraints with our algorithm results in faster convergence and the satisfaction of a vast majority of constraints. When constraints are not satisfied it is because it is less costly to violate the constraint than to satisfy it by assigning two quite different (i.e. far apart in Euclidean space) instances to the same cluster in the case of must-link constraints. Future work will explore modifying hierarchical clustering algorithms to efficiently incorporate constraints. Finally, we showed the benefit of our two newly proposed constraints in a simple infra-red distance clustering problem. The δ constraint allows us to specify the minimum distance between clusters and hence can encode prior knowledge regarding the spatial domain. The constraint allows us to specify background information with regard to sensor error and data collection.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9] [10]

[11]

[12]

[13]

[14]

[15]

niques for Graph Coloring”, Computing, Vol. 39, 1987, pp. 345–351. L. Bottou and Y. Bengio, “Convergence Properties of the K-Means Algorithms”, Advances in Neural Information Processing Systems, Vol. 7, Edited by G. Tesauro and D. Touretzky and T. Leen, MIT Press, Cambridge, MA, 1995, pp. 585–592. S. Basu, A. Banerjee and R. J. Mooney, “Semisupervised Learning by Seeding”, Proc. 19th Intl. Conf. on Machine Learning (ICML-2002), Sydney, Australia, July 2002, pp. 19–26. S. Basu, M. Bilenko and R. J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering”, Proc. 10th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA, August 2004. S. Basu, M. Bilenko and R. J. Mooney, “Active Semi-Supervision for Pairwise Constrained Clustering”, Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004). P. S. Bradley and U. M. Fayyad, “Refining initial points for K-Means clustering”, Proc. 15th Intl. Conf. on Machine Learning (ICML-1998), 1998, pp. 91–99. G. Campers, O. Henkes and J. P. Leclerq, “Graph Coloring Heuristics: A Survey, Some New Propositions and Computational Experiences on Random and Leighton’s Graphs”, in Proc. Operational Research ’87, Buenos Aires, 1987, pp. 917–932. T. Cormen, C. Leiserson, R. Rivest and C. Stein, Introduction to Algorithms, Second Edition, MIT Press and McGraw-Hill, Cambridge, MA, 2001. M. E. Dyer and A. M. Frieze, “Planar 3DM is NPComplete”, J. Algorithms, Vol. 7, 1986, pp. 174–184. I. Davidson and A. Satyanarayana, “Speeding up KMeans Clustering Using Bootstrap Averaging”, Proc. IEEE ICDM 2003 Workshop on Clustering Large Data Sets, Melbourne, FL, Nov. 2003, pp. 16–25. M. Ester, H. Kriegel, J. Sander and X. Xu, “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, 1996, pp. 226–231. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NPcompleteness, W. H. Freeman and Co., San Francisco, CA, 1979. T. F. Gonzalez, “Clustering to Minimize the Maximum Intercluster Distance”, Theoretical Computer Science, Vol. 38, No. 2-3, June 1985, pp. 293–306. P. Hansen and B. Jaumard, “Cluster Analysis and Mathematical Programming”, Mathematical Programming, Vol. 79, Aug. 1997, pp. 191–215. D. Klein, S. D. Kamvar and C. D. Manning, “From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering”, Proc. 19th Intl. Conf. on Machine Learning (ICML 2002), Sydney, Australia, July 2002, pp. 307– 314.

[16] D. Pelleg and A. Moore, “Accelerating Exact k-means Algorithms with Geometric Reasoning”, Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA, Aug. 1999, pp. 277–281. [17] R. Tamassia and I. Tollis, “Planar Grid Embedding in Linear Time”, IEEE Trans. Circuits and Systems, Vol. CAS-36, No. 9, Sept. 1989, pp. 1230–1234. [18] K. Wagstaff and C. Cardie, “Clustering with InstanceLevel Constraints”, Proc. 17th Intl. Conf. on Machine Learning (ICML 2000), Stanford, CA, June–July 2000, pp. 1103–1110. [19] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained K-means Clustering with Background Knowledge”, Proc. 18th Intl. Conf. on Machine Learning (ICML 2001), Williamstown, MA, June–July 2001, pp. 577–584.

9 Appendix Here, we show that the feasibility problem for cannotlink constraints (CL-feasibility) is NP-complete using a reduction from the Graph K-Colorability problem (K-Color) [12]. Graph K-Colorability (K-Color) Instance: Undirected graph G(V, E), integer K ≤ |V |. Question: Can the nodes of G be colored using at most K colors so that for every pair of adjacent nodes u and v, the colors assigned to u and v are different? Theorem 9.1. The CL-feasibility problem is NPcomplete. Proof: It is easy to see that the CL-feasibility problem is in NP. To prove NP-hardness, we use a reduction from the K-Color problem. Let the given instance I of K-Color problem consist of undirected graph G(V, E) and integer K. Let n = |V | and m = |E|. We construct an instance I of the CL-feasibility problem as follows. For each node vi ∈ V , we create a point si , 1 ≤ i ≤ n. (The coordinates of the points are not specified as they play no role in the proof.) The set S of points is given by S = {s1 , s2 , . . . , sn }. For each edge {vi , vj } ∈ E, we create the cannot-link constraint {si , sj }. Thus, we create a total of m constraints. We set the lower and upper bound on the number clusters to 1 and K respectively. This completes the construction of the instance I . It is obvious that the construction can be carried out in polynomial time. It is straightforward to verify that the CL-feasibility instance I has a solution if and only if the K-Color instance I has a solution.