You are on page 1of 18

International Journal of Approximate Reasoning 115 (2019) 32–49

Contents lists available at ScienceDirect

International Journal of Approximate Reasoning


www.elsevier.com/locate/ijar

A three-way cluster ensemble approach for large-scale data


Hong Yu a,∗ , Yun Chen a , Pawan Lingras b , Guoyin Wang a
a
Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
b
Department of Mathematics and Computing Science, Saint Marys University, Halifax, NS B3H3C3, Canada

a r t i c l e i n f o a b s t r a c t

Article history: Cluster ensemble has emerged as a powerful technique for combining multiple clustering
Received 10 October 2018 results. To address the problem of clustering on large-scale data, this paper presents an
Received in revised form 27 July 2019 efficient three-way cluster ensemble approach based on Spark, which has the ability to deal
Accepted 3 September 2019
with both hard clustering and soft clustering. First, this paper proposes the framework of
Available online 10 September 2019
three-way cluster ensemble based on Spark inspired by the theory of three-way decisions,
Keywords: and develops a distributed three-way k-means clustering algorithm. Then, we introduce the
Cluster ensemble concept of cluster unit, which reflects the minimal granularity distribution structure agreed
Three-way decisions by all the ensemble members. We also introduce quantitative measures for calculating
Large-scale data the relationships between units and between clusters. Finally, we propose a consensus
Cluster units clustering algorithm based on cluster units, and we devise various three-way decision
Spark strategies to assign small cluster units and no-unit objects. The experimental results using
19 real-world data sets validate the effectiveness of the proposed approach from different
indices such as ARI, ACC, NMI and F1-Measure. The experimental results show that the
proposed approach can effectively deal with large-scale data, and the proposed consensus
clustering algorithm has a lower time cost and does not sacrifice the clustering quality.
© 2019 Elsevier Inc. All rights reserved.

1. Introduction

Cluster ensemble has emerged as an important elaboration of the clustering problem [9,14,29,32]. Generally speaking,
every clustering ensemble method is made up of two steps: 1) generating multiple different clusterings of the data set,
also called generation step, and 2) combining these clusterings to obtain a single new clustering, also called consensus step.
More specifically, a major challenge in cluster ensemble is to find a way of improving the combined clustering obtained from
the results of single clustering algorithms. In addition, we need to develop clustering technologies for large-scale data sets
in the era of big data. To overcome the issue, a parallel computing framework, Spark, is newly developed. Based on Spark,
some scholars have implemented large-scale data processing methods in the fields of data stream mining [42], graphics
computing [6,10,31], and machine learning [15,23,41].
One objective of this paper is to present a general framework of cluster ensemble which is appropriate for both hard
clustering and soft clustering, in order to address the problem of clustering on large-scale data. This paper will first present
a framework of three-way cluster ensemble based on Spark, where the theory of three-way decisions is introduced into the
context of cluster ensemble. The theory of three-way decisions is constructed based on the notions of acceptance, rejection

* Corresponding author.
E-mail addresses: yuhong@cqupt.edu.cn (H. Yu), chen_yun92@163.com (Y. Chen), pawan@cs.smu.ca (P. Lingras), wanggy@cqupt.edu.cn (G. Wang).

https://doi.org/10.1016/j.ijar.2019.09.001
0888-613X/© 2019 Elsevier Inc. All rights reserved.
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 33

and noncommitment, and it is an extension of the commonly used binary-decision model with an added third option [7,36].
A distributed three-way k-means clustering algorithm is also devised in our work to generate ensemble members.
Another objective of this paper is to present a novel efficient algorithm to obtain the final consensus clustering result.
In most of the existing algorithms, the computing of consensus clustering is based on all objects from different clusterings
generated by the first step of cluster ensemble, no matter which consensus functions are used. In this paper, we call them
objects-based approaches. In the generation step, it is advisable to use those clustering algorithms that can yield more
information about the data. In spite of this, there still has some common underlying structures agreed by all ensemble
members, especially for large-scale data sets. Therefore, we define these underlying structures as cluster units in this paper,
then a unit-based consensus clustering algorithm is proposed based on the three-way decisions. The unit reflects a minimal
granularity distribution structure agreed by all the ensemble members. Generally speaking, the number of units is much
less than the number of objects in applications. The comparison of experimental results shows that the proposed consen-
sus clustering method needs less computational time and has better performance on some indices such as the accuracy
(ACC) [26], normalized mutual information (NMI) [2], adjusted rand index (ARI) [12] and F1-Measure [22].
The remainder of this paper is organized as follows. Section 2 describes the related works. Section 3 introduces the
framework of three-way cluster ensemble based on Spark. Section 4 describes the cluster unit. Section 5 presents a consen-
sus clustering algorithm based on cluster units and three-way decisions. Section 6 reports experimental results based on a
number of public standard data sets. Conclusions are provided in Section 7.

2. Related work

In this section, we present a short review of existing approaches on cluster ensemble and point out several issues not
addressed satisfactorily to motivate the present study.
The aim of clustering is grouping a set of objects in such a way that objects belonging to the same cluster are more
similar than those belonging to different clusters. Obviously, there are three types of relationship between an object and
a cluster such as belong-to definitely, uncertain and not belong-to definitely. In this paper, we take into account the three
types of relationships. Sometimes, the terms as soft clustering or overlapping clustering are related to the topic based on
the considering that an object can belong to more than one clusters. But, this statement is only half right, because it
cannot reflect intuitively which objects are uncertain to the cluster. Thus, there are some approaches such as fuzzy cluster-
ing [19], interval clustering [5], rough clustering [18], three-way clustering [37,40], orthopartitions and soft clustering [3],
are researched to deal with this kind of uncertain relationship.
One of the most widely used fuzzy clustering algorithms is the Fuzzy C-means clustering (FCM) Algorithm. In the FCM,
similarities between objects and each cluster are described by membership degrees based on the fuzzy sets theory, and
all objects are assigned to k fuzzy clusters. However, it cannot get an exact representation of clusters by the fuzzy sets
theory. To solve this issue, Lingras and Peters [19,18] applied the rough sets theory to clustering, they proposed a new
cluster representation by using the concepts of lower and upper approximations in the rough sets theory. Considering
clusters presented as interval sets with lower and upper approximations in rough k-means clustering are not adequate to
describe clusters, Chen and Miao [5] proposed an interval set clustering based on decision theory. Yu [37,38,40] proposed
a framework of three-way cluster analysis recently, inspired by the theory of three-way decisions. The theory of three-way
decisions [35] extends binary-decisions in order to overcome some drawbacks of binary-decisions and has been applied
in areas of decision making [34], email spam filtering [43], three-way investment decisions [16] and many others [39].
However, the above approaches have rarely been applied to cluster ensemble.
Therefore, we employ the idea of three-way clustering for the problem of proposing a general framework of cluster en-
semble which is appropriate for both hard clustering and soft clustering. Similarity to rough clustering using a pair of sets to
represent a cluster, three-way clustering also describes a cluster by a pair of sets. General speaking, rough clustering usually
restricts to the rough k-means and its extension algorithms. In rough clustering, an object belongs to one and only one
lower approximation. Different from that, an object is allowed to belong to more than one core regions in three-way clus-
tering. For example, the reference [38] had given the examples of this scenario in social networks. Usually, uncertain objects
in fringe regions or the boundary regions need further treatment in three-way clustering or rough clustering when further
information can be obtained. Lately, Wang and Yao [30] proposed a three-way clustering method based on mathematical
morphology, and Afridi et al. [1] proposed a three-way clustering approach to handle missing data using game-theoretic
rough set. Campagner and Ciucci [3] recently proposed orthopartitions and soft clustering, in which the representation of
a cluster with two sets is also used, by using the notion of orthopartition as a generalized partition with uncertainty. In
this work, the normalized soft average mutual information (N-SAMI), normalized soft logical mutual information (N-SLMI)
and soft purity (SP) are defined for clustering evaluation, these entropy-based measures make up novel external evaluation
indices for soft clustering.
As we have discussed, a major challenge in cluster ensemble is to find a way of improving the combined clustering
obtained from the results of single clustering algorithms. Many consensus functions have been proposed to combine the
outcomes of multiple clustering systems into a single consolidated partition. For example, Strehl and Ghosh [28] proposed
a hypergraph-based approach, which models clusters as hyperedges and instances as vertices in a hypergraph and uses a
hypergraph partitioning algorithm to produce a final partition; and they also introduced three hypergraph-based methods
for cluster ensemble, which are the cluster-based similarity partitioning algorithm, the hyper graph partitioning algorithm
34 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

and the meta-clustering algorithm. The approach has the limitation that it can not model soft clustering. Fern and Brod-
ley [8] constructed a bipartite graph for cluster members to solve the problem of graph partitioning and proposed a new
hypergraph-based method called as hybrid bipartite graph formulation. Chen et al. [4] proposed an automatic fuzzy cluster-
ing algorithm which adopts the consistency method of subgraph partitioning. Ren et al. [24,25] put towards incorporating
weighted objects into the consensus process and then proposed an approach called weighted-object ensemble clustering
(WOEC).
Anyway, there still exist some common underlying structures agreed by all ensemble members, especially for processing
large-scale data sets, it is advisable to use those clustering algorithms that can yield more information about the data during
the step of generating cluster members. However, most of the existing consensus clustering algorithms are objects-based
approaches. Therefore, we define these underlying structures as cluster units in this paper. A unit reflects a minimal gran-
ularity distribution structure agreed by all the ensemble members. In many applications, the number of units is much less
than the number of objects. Since we perform the consensus clustering based on units rather than the individual objects
of the data set, the computational complexity of consensus clustering based on units is much lower than that of consensus
clustering based on objects.
On the other hand, let us review the processing of cluster ensemble again. Because of the intrinsic distributed ideas, the
framework of cluster ensemble can be perfectly integrated with the parallel ensemble environment as Spark. For example,
Yang et al. [33] proposed a new semi-supervised multi-ant colony consensus clustering algorithm using user-supplied con-
straint information, in which the hypergraph-based method is used to ensemble the cluster members and the parallelization
algorithm is perfected so that it can be used on the Hadoop platform for large-scale data processing. Liu et al. [20] pro-
posed a parallel back propagation neural network (BPNN) algorithm based on data separation in three distributed computing
environments including Hadoop, HaLoop and Spark, in order to effectively process big data. Concerning the unbalanced dis-
tribution of data in commercial data and the lack of user characteristics, Lin et al. [17] proposed an integrated random forest
algorithm based on Spark, which solves the problem that direct use of big data technology can lead to deviation from the
business when realizing business data.
Therefore, we construct a general framework of three-way cluster ensemble based on Spark (shorted by TWCE) in this
paper, and some issues are considered:

• The framework is appropriate for both hard clustering and soft clustering. The three-way clustering approach is used to
deal with the uncertain relationship between objects and clusters.
• An efficient consensus clustering algorithm is present, which is based on cluster units. The proposed concept of cluster
unit reflects the minimal granularity distribution structure agreed by all the ensemble members, and the quantitative
measures for calculating the relationships between units and between clusters are presented.
• The proposed approach is effectively deal with large-scale data. The model is based on Spark and a distributed three-
way k-means clustering algorithm is developed.

3. Three-way cluster ensemble based on Spark

The framework of TWCE is depicted in Fig. 1, the details refer to the subsection 3.2. Before that, we introduce some basic
concepts about the three-way representation of clustering.

3.1. Three-way clustering

To define our framework, let a universe be U = {x1 , · · · , xn , · · · , x N }, and the clustering scheme (result) C =
{C 1 , · · · , C k , · · · , C K } is a family of clusters of the universe.
In the most of existing clustering results, a cluster is usually represented by a single set, which divides the universe U
into two regions. From the perspective of decision-making to observe the representation of a single set means, it means
that the objects which are in the set belong to this cluster definitely and the objects which are not in the set do not belong
to this cluster definitely. This is a typical result of two-way decisions. We call it a representation of clustering based on the
two-way decisions.
However, the two-way representation of a cluster cannot show which objects might belong to this cluster, and it cannot
intuitively show the influence degree of the object during the processing of forming the cluster. Therefore, it is more
reasonable to use three regions to represent a cluster than to use two regions, and thus a cluster representation based on
three-way decisions is proposed [37].
In contrast to the general crisp representation of a cluster, we represent a three-way cluster C as a pair of sets:

C = (Co(C ), F r (C )). (1)

Here, Co(C ) ⊆ U and F r (C ) ⊆ U . Let T r (C ) = U − Co(C ) − F r (C ). Then, Co(C ), F r (C ) and T r (C ) naturally form the three
regions as Core Region, Fringe Region and Trivial Region respectively. If x ∈ Co(C ), the object x belongs to the cluster C
definitely; if x ∈ F r (C ), the object x might belong to C ; if x ∈ T r (C ), the object x does not belong to C definitely. These
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 35

Fig. 1. TWCE: the model of three-way cluster ensemble based on Spark.

subsets have the following properties.

U = Co(C ) ∪ F r (C ) ∪ T r (C ),
Co(C ) ∩ F r (C ) = ∅,
(2)
F r (C ) ∩ T r (C ) = ∅,
T r (C ) ∩ Co(C ) = ∅.

If F r (C ) = ∅, the representation of C in Equation (1) turns into C = Co(C ); it is a single set and T r (C ) = U − Co(C ).
This is a representation of two-way decisions. In other words, the representation of a single set is a special case of the
representation of three-way cluster.
Furthermore, according to Formula (2), we know that it is enough to represent expediently a cluster by the core region
and the fringe region.
In another way, for 1 ≤ k ≤ K , we can define a cluster scheme by the following properties:

(i ) for ∀k, Co(C k ) = ∅;


 (3)
(ii ) kK=1 (Co(C k ) ∪ F r (C k )) = U .

Property (i ) implies that a cluster cannot be empty. This makes sure that a cluster is physically meaningful. Property
(ii ) states that any object of U must definitely belong to or might belong to a cluster, which ensures that every object is
properly clustered.
With respect to the family of clusters, C, we have the following family of clusters formulated by three-way representation
as:

C = {(Co(C 1 ), F r (C 1 )), · · · , (Co(C k ), F r (C k )), · · · , (Co(C K ), F r (C K ))}. (4)

Obviously, we have the following family of two-way clusters formulated as:

C = {Co(C 1 ), · · · , Co(C k ), · · · , Co(C K )}. (5)

Under the representation, we can formulate the soft clustering and hard clustering as follows.
36 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

For a clustering, if there exists k = t, such that

(1) Co(C k ) ∩ Co(C t ) = ∅, or


(2) F r (C k ) ∩ F r (C t ) = ∅, or
(6)
(3) Co(C k ) ∩ F r (C t ) = ∅, or
(4) F r (C k ) ∩ Co(C t ) = ∅,
we call it is a soft clustering; otherwise, it is a hard clustering.
As long as one condition of Eq. (6) is satisfied, there must exist at least one object belonging to more than one cluster.
Obviously, the representation of three-way brings the following advantages: the representation of a single set is a special
case of the representation of three-way cluster; it intuitively shows that which objects are core of the cluster, and which
ones are fringe of the cluster; it diversifies the type of overlapping; and it reduces the searching space when focusing on
the overlapping/fringe objects.
Here, we introduce an evaluation-based three-way cluster model [37], which produces three regions by using an evalua-
tion function and a pair of thresholds. The model partially addresses the issue of trisecting a universal set into three regions.
Suppose there are a pair of thresholds (α , β) and α ≥ β . Although evaluations based on a total order are restrictive, they
have a computational advantage. One can obtain the three regions by simply comparing the evaluation value with a pair of
thresholds. Based on the evaluation function v (x, C k ), we get the following three-way decision rules:

Co(C k ) = {x ∈ U | v (x, C k ) > α },


F r (C k ) = {x ∈ U |β ≤ v (x, C k ) ≤ α }, (7)
T r (C ) = {x ∈ U | v (x, C ) < β}.
k k

The evaluation function v (x, C k ) can be specified accordingly when an algorithm is devised. In fact, in order to devise the
evaluation function, we can refer to the similarity measures or distance measures, probability, possibility functions, fuzzy
membership functions, Bayesian confirmation measures, subsethood measures and so on.

3.2. Framework of the TWCE

The model of three-way cluster ensemble is shown in Fig. 1, and the model is mainly for processing large-scale data sets.
First, let the N objects of U be stored in the HDFS (Hadoop Distributed File System) file system. Then, one or more
clustering functions are selected to cluster the data set. These clustering functions are also called clusterers. The set of
functions is denoted as F = { f 1 , f 2 , · · · , f m , · · · , f M }, where f m represents a clustering algorithm or the same clustering
algorithm with different parameters initialization. The clustering results obtained respectively by clustering functions are
called cluster members, and they are denoted as R = { R 1 , R 2 , · · · , R m , · · · , R M }. Each cluster member can be expressed as
m
Rm = Cm
1 2
, Cm , · · · , Cm
k
, · · · , C mK , where K m is the number of clusters of the cluster member R m .
The above step is similar with the generation step in the classical cluster ensemble methods. In fact, the clustering
function can come from any Spark-based clustering algorithms such as [11,13,21,27]. In our work, we choose to build the
clustering function based on the k-means algorithm since it is easy to understand and effective. We need to note that
the output of clustering functions are the presentation of three-way clustering, and we also call them three-way cluster
members. The clustering function is described in Subsection 3.3, and it is named as the distributed three-way k-means
clustering algorithm based on Spark, shorted by DisKMeans. We can obtain diverse clusterings by setting different number
of iterations in the DisKMeans algorithm.
Next, the model moves to the consensus step, namely, the labeled clusterings are combined into a single labeled cluster-
ing. As we have mentioned, the consensus clustering algorithm is based on the concept of cluster units, we introduce the
algorithm to obtain the cluster units in Section 4. Experiments suggest that some cluster units contain a large number of
objects and some contain very few objects. If we create cluster units with a small number of objects (in the worst case only
one object) into one cluster in the consensus step, it may not reveal the underlying structure of the real data set. Therefore,
we classify cluster units into two types: if the number of objects in a cluster unit is greater than λ · N, it is a big unit;
otherwise, it is a small unit. The parameter λ reflects the rate of number of objects in cluster units to the universe.
Then, we propose a consensus function based on big units, small units and no-unit objects. The function utilizes
the three-way decision strategies and is described in Section 5. The final clustering result obtained is denoted as
∗ ∗
C = {(Co(C 1 ), F r (C 1 )), · · · , (Co(C k ), F r (C k )), · · · , (Co(C K ), F r (C K ))}, where K ∗ is the number of clusters of the final clus-
tering result.

3.3. DisKMeans algorithm based on Spark

In this subsection, we construct a distributed three-way k-means clustering algorithm based on Spark, in order to deal
with large-scale data and the uncertain relationships between objects and clusters. The reason we choose to build the
clustering function based on the k-means algorithm is that we can easily obtain diverse clusterings by setting the different
iteration time when running the DisKMeans algorithm.
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 37

Fig. 2. The RDDs graph for the DisKMeans algorithm.

Fig. 2 shows the RDDs (Resilient Distributed Dataset) graph for the distributed three-way k-means clustering algorithm
based on Spark. The DisKMeans algorithm aims to generate multiple three-way cluster members in parallel by setting the
different iteration times, and it is described in Algorithm 1.
According to the hash partition method to partition the input data set, K partitions are get and stored in RDD1. We
randomly select an object as the initial cluster center for a partition, and the results are stored in RDD2. Then, the distance
between each data object and a cluster center, Dis(x, C center ), is calculated according to the Euclidean distance formula. The
distances are stored in RDD3 in the form of key-value pairs (x, e ). Then, the reduce operation is performed on RDD3, the
objects are clustered and stored in RDD4 in the form of key-value pairs (id, x), where id records the cluster label of the
object x. Then, the algorithm calculates the average distance of each cluster in RDD5 to update the cluster center. Next, the
algorithm is iterated until the maximum iteration time is reached. According to the three-way decision rules, the objects,
whose distances with a finial cluster center are greater than α (namely, Dis(x, C center ) > α ), are assigned to the core region
of the corresponding cluster; the objects, whose distances with the cluster center are between α and β , are assigned to the
fringe region of the corresponding cluster; otherwise, these objects are assigned to the trivial regions of the cluster.

Algorithm 1: DisKMeans: the distributed three-way k-means clustering algorithm based on Spark.
Input: U , K , IterationT ime, α , β ;
Output: R m .
1 //Running DisKMeans algorithm;
2 τ = 1;
3 According to the hash partition method to partition U , get K partitions in RDD1;
4 To select randomly one point in every k partitions as the initial cluster center and stored in RDD2;
5 while τ < IterationT ime do
6 To calculate the distance Dis(x, C ) and store it in RDD3 in the form of (x, e );
7 The reduce operation is performed on RDD3, the data objects are clustered and stored in RDD4 in the form of (id, x);
8 To calculate the average of each cluster in RDD5 and update the cluster center point;
9 //obtain the three-way clustering result based on the three-way decision rules;
10 for all x and C do
11 if Dis(x, C center ) > α then x ∈ Co(C );
12 if β ≤ Dis(x, C center ) ≤ α then x ∈ F r (C );

4. Cluster unit

4.1. Definition of cluster unit

The basic idea of cluster unit is based on the fact that one clustering must reveal some underlying structure from a
certain aspect. This leads us to an in-depth study of the information which have been obtained by the cluster members.
38 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

First, let us analyze the distribution of objects in cluster members. For any object xn , there are three possible distributions
among M cluster members.

Case 1. xn belongs to the core region of corresponding cluster in all M cluster members;

Case 2. xn belongs to the fringe region of corresponding cluster in all M cluster members;

Case 3. Among the M cluster members, xn belongs to the core region of a cluster in some cluster members and in the
fringe region of a cluster in the other cluster members.

The object in Case 1 means the relationship between it and the clusters in all clusterings is certain. However, the object
in Case 2 or 3 means the relationship is uncertain; i.e. we can not confirm definitively which cluster the object should
belong to.
The goal of consensus clustering is to find a single clustering that agrees as much as possible with the input clusterings.
There is a good possibility that there exists some common structures agreed by all clusterings. In other words, the different
clustering assigns some objects into the same cluster. Those objects reflect the fundamental structure and these structures
are highly reliable. We can regard this kind of structures as a unit during computing consensus clustering. This is different
from conventional consensus methods that use objects as the basic unit. We call the structure which appears in all the
clusterings as a cluster unit in this paper. Obviously, the approach based on units is more efficient than approaches based
on objects, especially for large-scale data sets.
A cluster unit is a subset of objects. A cluster unit appears in all the clusterings (cluster members) and reflects the
fundamental structure agreed by all clusterings. A cluster unit can not be divided into the smaller subset, thus the objects
in a cluster unit must belong to Case 1.

Definition 1. Cluster Unit. Let A ⊆ U , A be called a cluster unit if and only if it satisfies:

(C1) for every cluster member (clustering) R m , there exists a k ∈ [1, K m ] such that A = k
Co(C m );
m∈[1, M ]
(C2) if A is a cluster unit, B ⊂ A, B is not a cluster unit.

Condition (C1) denotes that a cluster unit is the intersection of clusters in all clusterings/ensemble members; Condition
(C2) denotes that the cluster unit is the maximal intersection of these clusters. When a set of objects satisfy these two
conditions, we call it a cluster unit.
Cluster units reflect the structures agreed by every ensemble member. Because the number of clusters of one clustering
might or not equal to the number of another clustering, the number of cluster units might be equal to or larger than the
maximal number of clusters of all clusterings. The number of cluster units is equal to the numbers of clusters of clusterings
only when all clusterings are same. This is a special situation, where we can regard all clusterings as one clustering result.
However, this situation has no practical significance. Therefore, the number of cluster units must be larger than the maximal
number of clusters of all clusterings.
When the number of clusters of every clustering is less than the real number of clusters, it is possible that the number
of cluster units is less than the true number of clusters. In this situation, cluster units reflect structures of data set at a
coarser granularity level. Obviously, this situation doesn’t help us find the real structure of data set. In order to prevent the
occurrence of such a situation, the number of clusters of all or some ensemble members should be equal or larger than the
true number of clusters.
We need to find a way to obtain cluster units from ensemble members. The simplest approach is a brute force method
that computes the intersections of core regions of clusters in all the clusterings. Suppose there are M clusterings, the
maximal number of cluster in all clusterings is K , the time complexity of this method will be O ( K M ), which is rather high.
In the next subsection, we will propose the notion of cluster mark sequence to improve the computational complexity of
obtaining cluster units.

4.2. Cluster mark sequence for an object

k
We can denote a cluster C m of clustering R m by a symbol such as r, s or t. The symbol is called the cluster mark for
the cluster. These symbols only mark clusters and have no practical significance. Set R m (xn ) denotes the cluster mark of xn
in R m . For any object, we can obtain its distribution in all ensemble members by cluster marks.
For an object xn ∈ U , we express it by the following cluster mark sequence, T n , namely,

T n = R 1 (xn ) · · · R m (xn ) · · · R M (xn ). (8)

The cluster mark sequence of xn just lists which cluster the object belongs to in every ensemble member in order.
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 39

When xn belongs to Case 1, its cluster mark sequence in every ensemble member is certain. That is to say it belongs to
only one cluster in every ensemble member. Thus, R m (xn ) is the sole cluster mark sequence of the object. When xn belongs
to Case 2 or Case 3, it belongs to the fringe region of different clusters in some or all ensemble members. Thus, R m (xn )
might have more than one cluster mark sequence.
For all objects in U , we use T to denote the set of cluster mark sequence, namely,

T = { T 1 , · · · , T n , · · · , T N }. (9)
Certainly, we need some space to store the cluster mark sequence of objects. Suppose K is the maximal number of
cluster in all clusterings. In the best case, when all objects are of Case 1, the space complexity is O ( N M ). In the worst case,
when all objects are of Case 3, the space complexity is O ( N M K ). The worst case is an extreme case, all objects belong to
fringe regions and no one certainly belongs to the core region. This situation can not arise if we hope to make clustering
meaningful according to Property (i ). In real-life, most of the objects belong to Case 1, with a small number of objects
belong to Case 2 or Case 3. Thus, in most cases, the space complexity will be closer to O ( N M ).

4.3. Algorithm to obtain cluster units

Let Unit denote the set of cluster units for M ensemble members, namely,

Unit = {U nit 1 , · · · , U nit z , · · · , U nit Z }, (10)


where U nit z denotes the z-th cluster unit, |U nit z | ≥ 1, and Z denotes the cardinality of Unit.
Obviously, these objects in cluster units belong to Case 1. There is no intersection between any two cluster units, namely,
U nit z ∩ U nitl = ∅, according to the definition of cluster unit. 
If every clustering is hard (crisp), the union of cluster units is the original data set, namely,
 U nit z = U . If some
clusterings are soft, the union of cluster units is a subset of the original data set, namely, U nit z ⊆ U . There are some
objects not belonging to any cluster unit and we call these objects as no-unit objects, denoted by Unit. Obviously, Unit ∪
Unit = U.
The basic idea of obtaining cluster units algorithm is based on the expression of cluster mark sequences. Algorithm 2
outlines the top level overview of the approach.
The algorithm starts with setting the cluster mark sequence of each object to be null. From Line 3 to Line 8, at each
iteration of the loop, a cluster mark sequence of an object is obtained. Then, a binary search tree mapMark is built to store
and search; where a node includes two data fields: the key field is a cluster mark sequence, and the value field represents
the set of objects with this cluster mark sequence. Then, the algorithm scans the set T, and updates the tree in every
iteration in the second for loop which begins from Line 10. The basic idea of the second loop is to merge objects with the
same cluster mark sequence. After the loop, the value field of each node in mapMark records a cluster unit.

Algorithm 2: OCU: to obtain cluster units.


Input: ensemble members R = { R 1 , · · · , R m , · · · , R M };
Output: the set of cluster units Unit and the set of no-unit objects Unit.
1 //Set the cluster mark sequence of each object to be null
2 T 1 = T 2 = · · · = T N = ∅; T = { T 1 , T 2 , · · · , T N } ;
3 for each R m do
4 for each Co(C m k
) ∈ R m do
5 //Find the cluster mark sequence of each object
6 if xn ∈ Co(C m k
) then
7 T n = T n ∪ k; //k is the cluster mark of C m
k
,
8 // i.e. k is added to the mth position of T n

9 Built a binary tree mapMark, and mapMark = ∅;


10 for n = 0 to N do
11 if xn is in the core region of clusters in every clustering then
12 if the mark sequence of xn is not in the tree then
13 insert a new node to the tree, which denotes the cluster mark sequence and the value of the node is n;
14 else
15 insert the label n to the value of the corresponding node whose key is the cluster mark sequence;

16 else
17 Unit = Unit ∪ xn ;

18 Output the value of each node in mapMark, which represents a cluster unit; output Unit.

The time complexity of the initialization is O ( N ); and the time complexity of the loop from Line 3 to 8 is O ( N M ). The
time complexity of the second loop which based on the binary search tree is O ( N log Z ), where Z is the number of cluster
units. Therefore, the total time complexity of obtaining cluster units is O ( N M ) + O ( N log Z ).
40 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

As previously mentioned, the number of cluster units is much smaller than the number of objects in practice, namely,
Z << N, so the time complexity of Algorithm 2 is far less than O ( N M ) + O ( N log N ). In the worst case, M = N, and the time
complexity will be O ( N 2 ). In reality, M << N. In such cases, O ( N M ) is almost equal to O ( N ). Thus, the time complexity is
closer to O ( N log Z ) in most practical applications.

4.4. An example to obtain cluster units

Example 1. Supposing there is a U = {x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 }, and there are three different clusterings/cluster mem-
bers, namely R 1 , R 2 and R 3 . The details are shown as follows.

R 1 = {((x1 , x2 ), ∅), ((x3 , x4 ), (x7 )), ((x5 , x6 ), (x7 )), ((x8 , x9 ), ∅)} ;
R 2 = {((x1 , x2 ), ∅), ((x3 , x4 ), (x5 , x6 , x7 )), ((x8 , x9 ), (x5 , x6 , x7 ))} ;
R 3 = {((x1 , x2 ), (x7 )), ((x3 , x4 , x5 , x6 , x8 , x9 ), (x7 ))} .

R 1 divides the data set U into four clusters and we suppose its cluster marks are 1, 2, 3 and 4 respectively; R 2 divides
U into three clusters and we suppose its cluster marks are r, s and t respectively; R 3 divides U into two clusters and we
suppose its cluster marks are u and v respectively.
Then, we have the cluster mark sequences of objects as:

T 1 = R 1 (x1 ) R 2 (x1 ) R 3 (x1 ) = 1ru ,


T 2 = R 1 (x2 ) R 2 (x2 ) R 3 (x2 ) = 1ru ,
T 3 = R 1 (x3 ) R 2 (x3 ) R 3 (x3 ) = 2sv ,
T 4 = R 1 (x4 ) R 2 (x4 ) R 3 (x4 ) = 2sv ,
T 5 = R 1 (x5 ) R 2 (x5 ) R 3 (x5 ) = 3(st ) v ,
T 6 = R 1 (x6 ) R 2 (x6 ) R 3 (x6 ) = 3(st ) v ,
T 7 = R 1 (x7 ) R 2 (x7 ) R 3 (x7 ) = (23)(st )(uv ),
T 8 = R 1 (x8 ) R 2 (x8 ) R 3 (x8 ) = 4t v ,
T 9 = R 1 (x9 ) R 2 (x9 ) R 3 (x9 ) = 4t v .

Observing the result we find that objects x1 , x2 , x3 , x4 , x8 and x9 belong to Case 1, namely, these objects belong to the
core region of clusters in every clustering. Objects x5 and x6 belong to Case 3, namely, these objects belong to the fringe
region of clusters in some clusterings. For example, T 5 exactly depicts that x5 belongs to the core region of cluster 3 in
R 1 , the fringe region of clusters marked s and t in R 2 , the core region of cluster marked v in R 3 . The object x7 belongs to
Case 2, namely, it belongs to the fringe region of clusters in every clustering.
As shown in Example 1, according to the clusterings R 1 , R 2 and R 3 , we can obtain the corresponding cluster units.
The data objects x1 and x2 , x3 and x4 , x8 and x9 all belong to the core region of the corresponding cluster in each
clustering, and each has the same cluster mark sequence, so we have 3 cluster units as Unit = {U nit 1 , U nit 2 , U nit 3 } =
{{x1 , x2 } , {x3 , x4 } , {x8 , x9 }}.
Actually, we use cluster mark sequences to denote cluster units, namely, Unit = {U nit 1 , U nit 2 , U nit 3 } = {1ru , 2sv , 4t v }.
And, we also use cluster mark sequences to represent no-unit objects, i.e., Unit = {x5 , x6 , x7 } = {3(st ) v , 3(st ) v , (23)(st )(uv )}.
That is to say we just need these three units and three no-unit objects for consensus clustering. In the example, the
consensus clustering will run on 6 units instead of on 9 original objects.
The Unit might not include all objects, that is, Unit = ∅. We will develop three-way decision strategies to process these
objects in the following section.

5. The consensus clustering algorithm

This section will introduce the proposed consensus clustering algorithm based on cluster units and three-way decisions,
abbreviated as CCAU algorithm.

5.1. Definition of relationship

As we have discussed in Section 3.1, we need to define an appropriate evaluation function in order to obtain the three
regions of a cluster. The relationships between units and between clusters are defined as evaluation functions used in the
proposed CCAU algorithm. The evaluation values for acceptance and rejection are defined by a pair of thresholds α , β .
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 41

For a cluster, not only its core region but also its fringe region contains objects of cluster units or no-unit objects.
Therefore, we need to clearly define the relationships between cluster units, between no-unit object and cluster unit, and
between no-unit objects.

Definition 2. Association Degree Between Cluster Units. The relationship between two units, U nit i and U nit j , is defined by
the following equation:
CountCommon( T i , T j )
BU U (i , j ) = , (11)
M

where T i and T j be the cluster mark sequence of U nit i and U nit j respectively, and CountCommon( T i , T j ) means the
number of common marks at the corresponding place in T i and T j . M is the number of cluster members.
When the value of BU U (i , j ) is larger, the association degree between them is stronger; and it is more likely that they
will be assigned into the same cluster. BUU Z × Z = [BU U (i , j )] denotes the matrix of association degree between cluster
units.

Definition 3. Association Degree Between No-unit Object and Cluster Unit. For xi and U nit j , xi is a no-unit object, the
association degree between xi and U nit j is defined by the following equation:

CountCommon( T i , T j )
BU U (i , j ) = (12)
M
where T i denotes the cluster mark sequence of xi , and T j denotes the cluster mark sequence of U nit j .

When the value of BU U (i , j ) is larger, it is more likely that they will be assigned into the same cluster. BUU|Unit|× Z =
[BU U (i , j )] denotes the matrix of association degree between no-unit objects and cluster units.

Definition 4. Association Degree Between No-unit Objects. For no-unit objects xi and x j , the association degree between
them is defined by the following equation:
| J|
BU U (i , j ) = M
, (13)

where J = {m | xi ∈ F r (C m
k
) and x j ∈ F r (C m
k
), 1 ≤ m ≤ M }, and k = 1, 2, · · · , K m . BU U (i , j ) means the number of times of xi
and x j occur in the both fringe regions of clusters.

When the value of BU U (i , j ) is larger, it is more likely that they will be assigned into the same fringe region.
BUU|Unit|×|Unit| = [BU U (i , j )] denotes the matrix of the association degree between no-unit objects.
First, in order to obtain the consensus clustering, we can obtain a clustering result based on cluster units. Then, we can
make decisions on no-unit objects. Thus, we need to define the association degree between clusters, and association degree
between no-unit objects and clusters.
Because a cluster contains three regions, the association degree between clusters are composed of the association degree
between core regions, the association degree between fringe regions, the association degree between core and fringe regions.
The proportion of core regions is different from other regions, a factor γ is used to reflect the rate.

Definition 5. Association Degree Between Clusters. For two clusters C z and C l , the association degree between them is
defined as:

1 
z l
S (C , C ) = BU U (u , v )
|C z | · |C l |
U nit u ∈Co(C z )
U nit v ∈Co(C l )
(14)
   
+γ BU U (u , v ) + BU U (u , v ) + BU U (u , v ) .
U nit u ∈Co(C z ) U nit u ∈ F r (C z ) U nit u ∈ F r (C z )
U nit v ∈ F r (C l ) U nit v ∈Co(C l ) U nit v ∈ F r (C l )

Then we define the association degree between no-unit objects and clusters. The association degree contains two parts,
one is the association degree between no-unit objects and core regions, another is the association degree between no-unit
objects and fringe regions.
Suppose the current no-unit object is xn and the cluster is C l . C l may contain cluster units and no-unit objects. Their
association degree is shown as follows.
42 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

Fig. 3. An example for obtaining an initial clustering based on big units.

Definition 6. Association Degree Between No-unit Object and Cluster. For a no-unit object xn and a cluster C l , the associa-
tion degree between them is defined as:

1  
l
S (xn , C ) = · BU U (n, u ) + BU U (n, m)
|C l |
U nit u ∈Co(C l ) xm ∈Co(C l )
(15)
  
+γ BU U (n, u ) + BU U (n, m) .
U nit u ∈ F r (C l ) xm ∈ F r (C l )

To obtain clustering in cluster units, we can use Equation 14 to calculate the association degree between clusters. Then
we need to confirm which clusters no-unit objects belong to by calculating the association degree between no-unit objects
and clusters according to Equation 15.

5.2. Description of the CCAU algorithm

As we have discussed, the consensus clustering algorithm is based on the cluster units and no-unit objects. The units
reflect the fundamental structures of the data set, and they are not divided in clustering process.
We can obtain cluster units Unit and no-unit objects Unit according to Algorithm 2; and build the corresponding ma-
trices according to Definition 2 to 4; then classify cluster units into big units and small units according to Subsection 4.3.
Furthermore, we store every unit as a core region of a cluster, and set the fringe of this cluster to empty.
Algorithm 3 outlines the top level overview of the CCAU algorithm. The algorithm includes three main steps: 1) to
obtain an initial clustering based on big units; 2) to deal with small units based on three-way decision; and 3) to deal with
no-unit objects based on three-way decision. At last, the algorithm transfers units to corresponding objects and output the
final consensus clustering.
The algorithm first obtains an initial clustering based on big units from Line 1 to Line 6. Let us consider the data set
in Example 1. Fig. 3 depicts the sketch map of connecting subgraphs. The left part of the figure shows the matrix of
big units, where the element of matrix records the times of the corresponding two cluster units divided into the same
cluster. Obviously, every unit is a premier subgraph. In the example, we suppose there are no small units. That is, U nit 1 =
{1ru }, U nit 2 = {2sv } and U nit 3 = {4t v } first construct three separate subgraphs. Here, we use the cluster mark sequence to
represent the cluster. Then, we find that we can construct a bigger connected subgraph by combining the components of
U nit 2 and U nit 3 , because they have the common cluster mark v, which means the objects represented by these marks have
greater possibility to be in the same cluster. Because the number of connected subgraph is equal to K = 2, we obtain an
initial clustering C = {C 1 = ({1ru }, ∅), C 2 = ({2sv , 4t v }, ∅)}.
Line 7 to Line 18 describe how to deal with small units based on three-way decision. First, the algorithm computes the
relationships between U nit z and other big units according to Equation (14). If all relationship values S are no more than β ,
U nit z is regarded as a set of noise points. Otherwise, the algorithm lets A = {m| S ≥ α } denote the set of cluster units whose
relationship values are greater than α : 1) if cluster units in A belong to the same cluster, the algorithm assigns U nit z to
the core region of the cluster; 2) and if cluster units in A belong to different clusters, the algorithm assigns U nit z to the
fringe region of those clusters. Then, the algorithm lets B = {m|β < S < α } denote the set of cluster units whose relationship
values are between β and α , and it assigns U nit z to the fringe region of these corresponding clusters.
Line 19 to Line 35 describe how to deal with no-unit objects based on three-way decision. First, the algorithm computes
the relationships between xn and a cluster according to Equation (15); then, it lets A = {C l | S (xn , C l ) ≥ α } denote the set of
clusters whose relationship with xn are no less than α .
Now there exists three cases:

Case I. If there only exists one cluster whose relationship with xn is no less than α , namely | A | = 1, the algorithm assigns
xn to the core region of the cluster in A;

Case II. If there exist multiple clusters whose relationship with xn are no less than α , namely | A | > 1, which means xn has
strong relationships with these clusters. Let B = {C l |max( S (xn , C l ), C l ∈ A } be the set of clusters whose relationship with xn
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 43

Algorithm 3: CCAU: the consensus clustering algorithm based on cluster units and three-way decisions.
Input: Unit, Unit and K ;
Output: The final clustering C.
1 Obtain a matrix of big units, where the element records the times of the corresponding two cluster units divided into the same cluster;
2 Find the connected subgraph of the matrix and set the number of subgraphs be S K ;
3 if S K ≥ K then
4 go to Line 7;//K is the real number of clusters
5 else
6 the none-zero elements of matrix subtract 1, update the matrix, go to Line 2;
7 for each small unit U nit z do
8 compute the relationships S between U nit z and other big units;
9 if all relationship values S are no more than β then
10 set U nit z be a noise set, and go to Line 19;

11 A = {m| S ≥ α } ;
12 if cluster units in A belong to the same cluster then
13 assign U nit z to the core region of the cluster;
14 else
15 assign U nit z to the fringe region of those clusters;
16 B = {m|β < S < α } ;
17 for U nit z ∈ B do
18 assign U nit z to the fringe region of the corresponding clusters;
19 for each no-unit object xn do
20 compute the relationships between xn and a cluster;
21 A = {C l | S (xn , C l ) ≥ α } ;
22 if | A | = 1 then
23 assign xn to the core region of the cluster in the A;
24 if | A | > 1 then
25 B = {C l |max( S (xn , C l ), C l ∈ A } ;
26 if | B | = 1 then
27 xn belongs to the core region of the cluster in the B;
28 else
29 xn belongs to the fringe region of these clusters in the B;

30 if | A | = 0 then
31 H = {C l |β ≤ S (xn , C l ) < α };
32 if H = ∅ then
33 xn belongs to the fringe region of these clusters in the H ;

34 Transfer units to corresponding objects and output the final clustering.

is maximal. If there only exists one cluster in B, namely | B | = 1, xn belongs to the core region of the cluster in B; otherwise,
xn belongs to the fringe region of these clusters in B;

Case III. If the relationship between xn and every cluster is less than α , then | A | = 0. Let H = {C l |β ≤ S (xn , C l ) < α }. Now
there exists two subcases:

1) if H = ∅, xn belongs to the fringe region of these clusters in H ;


2) if H = ∅, which means the relationship between xn and every cluster is less than β , and we regard xn as a noise point.

Let’s continue the example. Because there are no small units, then the algorithm goes to Line 19, which to deal with
the no-unit objects. The relationships between xn and clusters are computed according to Equation 15, and described in the
following matrix.

S(xn , C l ) C1 C2
x5 0 0.7
x6 0 0.7
x7 0.3 0.85

Let α = 0.8, β = 0.2 and γ = 0.8. For example, the relationship between x5 and C 2 is less than α but large than β , so
x5 ∈ F r (C 2 ). Thus, we update the clustering result as C = {C 1 = ({1ru }, {x7 }), C 2 = ({2sv , 4t v } ∪ {x7 }, (x5 , x6 )}.
44 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

Table 1
Information about the datasets.

No. Data sets #Objects #Attributes #Classes


1 Beef 60 470 5
2 Leukemia 72 3572 2
3 BreastTissue 106 10 6
4 HayesRoth 132 6 3
5 Iris 150 4 3
6 Glass 214 9 6
7 Satimage 420 37 6
8 Spambase 500 57 2
9 BalanceScale 625 4 3
10 AustralianCredit 690 16 2
11 Sony 980 65 2
12 Vowel 990 10 11
13 Yeast 1136 8 3
14 Semeion 1593 256 10
15 Wall 4302 25 4

16 Skin Segmentation 245,057 3 2


17 CovType 581,012 54 7
18 Poker Hand 1,000,000 11 10
19 KDDCup 1,048,576 42 9

At last, the algorithm transfers units to corresponding objects and output the final clustering. Let us look back Example 1,
we have 1ru = T 1 and 1ru = T 2 ; which means 1ru represents objects x1 and x2 . Thus, the final clustering of the example is
C = {({x1 , x2 }, {x7 }), ({x3 , x4 , x7 , x8 , x9 }, {x5 , x6 })}.
In the example, the results of cluster members are soft clustering, and the number of clusters in cluster members is
different. After consensus clustering, the final result is a soft clustering. Of course, the results might be a hard clustering for
a given data set accordingly.
The proposed algorithm utilizes cluster units which indicates the common structures accepted by all ensemble members,
and the algorithm reduces the computational time of consensus clustering since the number of cluster units is much less
than the number of objects in most cases, especially for large-scale data sets.

6. Experimental results

We evaluate the proposed approach through the following three experiments by analyzing influence of the parameters
and comparing with other methods. The experiments are implemented on Spark 1.6, and the data sets are all the CSV
(Comma-Separated Values) files.

6.1. Experimental settings and data sets

In order to give an appropriate value range of the parameters in the proposed approach, we design the empirical ana-
lyzing in Subsection 6.2. To show the superiority of the proposed consensus clustering algorithm based on cluster units and
three-way decisions (CCAU algorithm), we carry out the comparison experiments in Subsection 6.3. For a fair comparison,
we choose the same data sets used in the WOEC method [25], which are the former 16 data sets in Table 1. Ref. [25] has
experimentally verified a variety of hypergraph algorithms; and the compared method in Ref. [4] adopts the consistency
method of subgraph partitioning. These experiments do not need to implement in parallel on Spark, we just need to calcu-
late the proposed CCAU algorithm. The main purpose of the experiments is to show that the effectiveness of the proposed
consensus clustering algorithm based on cluster units and three-way decisions. However, the later 4 datasets in Table 1 are
absolutely large-scale data, and you can see the amount of each data set reaches hundreds of thousands or even millions.
Thus, we assign the experiments in Subsection 6.4 to show the ability of the proposed TWCE approach for handling the
large-scale data.
Table 1 gives the summary information about the datasets. “#Objects”, “#Attributes” and “#Classes” means the number
of objects, the number of attributes and the number of clusters (classes). Beef and Sony (SonyAIBORobotSurfaceII) are data
sets from the UCR time series repository.1 Leukemia is a gene expression data set.2 The other data sets are from the UCI
repository.3 Satimage contains 6435 images; we randomly select 420 images which are equally distributed among the six
classes in the experiments. Spambase has 4601 objects; we randomly sample 500 objects (250 for each class). The last four
data sets are large-scale data, namely, Covtype, KDDCup, Poker Hand, and Skin Segmentation.

1
http://www.cs.ucr.edu/~eamonn/time_series_data/.
2
http://stat.ethz.ch/~dettling/bagboost.html.
3
http://archive.ics.uci.edu/ml/index.html.
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 45

Before reporting the results, we need to indicate the performance indices that we will adopt.
The adjusted rand index, ARI, is an external evaluation index with a range of [−1, 1]. A larger value means that the clus-
ai
tering result is more consistent with the real situation. Let i (2 ) represent the number of data objects of the actual cluster
bj
i, and j (2 ) represent the number of data objects that in the clustering result. a + b can be considered as the number of
agreements between the actual clustering and the computed clustering. ARI is computed by the following formula [12]:

R I − E [R I]
ARI =
max( R I ) − E [R I]
 
ni j ai bj n (16)
ij 2
− i 2 j 2
/ 2
=     ,
1 ai bj ai bj n
2 i 2
+ j 2
− i 2 j 2
/ 2

where R I = a+b
n .
2
The accuracy, ACC, is the ratio of the number of objects accurately classified to the total number of objects. Set N be the
total number of objects and αk be the number of objects of correct classification in the cluster C k . The larger the ACC value,
the better the quality of the clustering result. It is calculated as follows [26]:

1 
K
ACC = αk . (17)
N
k =1

The normalized mutual information, NMI, is used to measure the similarity between two clustering results. P (i ) denotes
the ratio of the number of samples in the real cluster j to the total number of samples N; P ( j ) denotes the ratio of the
number of samples in the partitioned cluster to the total number of samples N; and P (i , j ) denotes the ratio of the number
of agreements to N. The range of NMI is [0, 1]. The larger the value, the more consistent the clustering result with the true
distribution. It is calculated as follows [2]:
k k P (i , j )
i =1 j =1 P (i , j )log ( P (i ) P ( j ) )
NMI =  . (18)
k k
i =1 P (i )log ( P (i )) j =1 P ( j )log ( P ( j ))

F1-Measure is a standard to evaluate the clustering algorithm or a good model. It combines the precision P and recall R
into one, which effectively avoids the contradiction between the precision and the recall. The larger the F1-Measure value,
the better the clustering effect. The computing equation is expressed as follows [22].

2P R
F1-Measure = . (19)
P+R

6.2. Empirical analyzing of parameters

The main purpose of the experiment is to observe the variations of ARI with different parameters, α , β and the number
of cluster members m, by applying the CCAU algorithm on 8 data sets in Table 1, namely, HayesRoth, Glass, BalanceScale,
Sony, Vowel, Yeast, Semeion and Wall.
Set the domain of (α , β) from (0.5, 0.5) to (0.9, 0.1) with the step length 0.05. That is, there are 9 pairs of (α , β). The
number of cluster members m is increased from 3 to 9. For the different parameters, α , β and m, the different ARI on each
data set are recorded in Fig. 4. Here, the x-axis is the thresholds α and β , the y-axis is the number of cluster members m,
and the z-axis is the ARI evaluation index of the final clustering result. Obviously, there are 63 results on each data set.
From the Figs. 4a to 4h, it can be seen that the experimental results on each data set show some facts. The results
are not good when α and β are (0.5, 0.5), but the results become better when α and β are changed. The ARIs generally
show an increasing trend, especially in the datasets HayesRoth, Glass, Sony and Semeion. Anyway, we can see that when
the thresholds α and β take (0.8, 0.2), (0.85, 0.15), (0.9, 0.1) and the number of cluster members is set to 3, 4 or 5, the
experimental results are significantly better than the results on other parameters. Thus, we adopt α = 0.8, β = 0.2, m = 3
in the following experiments.

6.3. Results of comparison experiments

In this subsection, the proposed CCAU method is compared with the algorithms in the references [4,25] on the former
15 data sets in Table 1.
Table 2 records the ARI index results of the algorithms, the best results are shown in bold. For a fair comparison, the
ARI results of the WOEC come from the reference [25] and we choose the best experimental results of the corresponding
three hypergraph integration methods (WOMC, WOSP, and WOHB) recorded in the original reference. In order to do the
46 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

Fig. 4. The ARIs on 8 data sets with different (α , β) and m.


H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 47

Table 2
Comparison results on the index ARI.

No. Data sets WOEC [25] Method in Ref. [4] Proposed method
1 Beef 0.2564 −0.0061 0.8537
2 Leukemia 0.2221 0.2008 0.8810
3 BreastTissue 0.3424 −0.0853 0.1385
4 HayesRoth 0.1636 −0.1378 0.1754
5 Iris 0.6496 0.6854 0.9665
6 Glass 0.1854 −0.0193 0.5258
7 Satimage 0.6807 −0.0212 0.9518
8 Spambase 0.5736 −0.8888 0.5010
9 BalanceScale 0.1510 0.0531 0.9069
10 AustralianCredit 0.4676 −0.7668 0.6200
11 Sony 0.3937 0.2260 0.4052
12 Vowel 0.1767 −0.0166 0.1472
13 Yeast 0.1137 0.0124 0.2735
14 Semeion 0.4183 −0.0361 0.7074
15 Wall 0.8449 −0.1752 0.6244

Average 0.3759 −0.0650 0.5785

Table 3
Comparison results on the indices of NMI, ACC, F1-Measure and running time.

No. Data sets NMI ACC F1-Measure Running time (s)


Method in CCAU Method in CCAU Method in CCAU Method in CCAU
Ref. [4] Ref. [4] Ref. [4] Ref. [4]
1 Beef 0.2780 0.3718 0.4167 0.4667 0.3826 0.4615 1 6
2 Leukemia 0.0643 0.1474 0.6389 0.6389 0.4981 0.4778 2 97
3 BreastTissue 0.3652 0.2951 0.3396 0.6415 0.2873 0.1435 4 3
4 HayesRoth 0.0257 0.0352 0.4318 0.3485 0.4407 0.2113 5 1
5 Iris 0.7582 0.7707 0.8933 0.9133 0.8918 0.5252 6 1
6 Glass 0.1132 0.3264 0.8364 0.7757 0.9109 0.3534 12 2
7 Satimage 0.4765 0.6489 0.6143 0.6643 0.5908 0.5818 43 3
8 Spambase 0.0014 0.1159 0.5100 0.9980 0.3793 0.3338 64 1
9 BalanceScale 0.3180 0.2647 0.2192 0.6432 0.3487 0.4783 78 1
10 AustralianCredit 0.0369 0.0344 0.4942 0.9928 0.3895 0.3837 116 1
11 Sony 0.2384 0.1977 0.7837 0.3041 0.7856 0.3368 187 1
12 Vowel 0.0208 0.2212 0.0990 0.2657 0.0407 0.1304 165 9
13 Yeast 0.1961 0.3296 0.3679 0.7042 0.2631 0.3096 405 9
14 Semeion 0.1435 0.3963 0.1488 0.4997 0.0897 0.4151 404 12
15 Wall 0.1154 0.0772 0.4008 0.5057 0.4026 0.3857 4690 4

Average 0.2101 0.2821 0.4796 0.6241 0.4467 0.3685 412.1 10.0

comparison, we also programmed the method in Ref. [4], and the ARIs of the compared method are also recorded. The last
row records the average ARI of the corresponding algorithm on all data sets.
From the above experimental results, we can see that the proposed CCAU algorithm has higher ARI values on most of
datasets such as datasets AustralianCredit, BalanceScale, Beef, Glass, HayesRoth, Iris, Leukemia, Satimage, Semeion, Sony and
Yeast. In contrast, the method in Ref. [4] is the worst one in the index ARI. It shows that the proposed method has higher
quality.
In addition, we also test the indices ACC, NMI, F1-Measure and CPU running times. The results are recorded in Table 3,
and the average values are shown in the last row. The proposed method is best at the indices ACC and NMI, and the
compared method is good at the index F1-Measure. However, the time cost of the proposed CCAU algorithm is much lower
than that of the compared algorithm in Ref. [4], especially when the size of data set is large. The proposed CCAU method
greatly decreases the computational time more than 40 times (412.1 s/10.0 s).

6.4. Results on large-scale data sets

In order to further show the ability of the proposed TWCE method for dealing with large-scale data sets, we test the
TWCE on the later four large-scale data sets in Table 1.
Table 4 shows the indices NMI, ACC, F1-Measure and ARI on these large-scale data sets. To observe the indices of NMI
and F1-Measur on the average values, the proposed method on these large-scale data sets is better than those of in Table 3.
The average value of ARI on the large-scale data sets is no more than the average on Table 2, but it is much higher than
ARIs of the compared algorithms. The proposed TWCE method is effective to deal with the large-scale data sets.
48 H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49

Table 4
Results of the proposed TWCE on the large-scale data sets.

Data sets NMI ACC F1-Measure ARI


Skin Segmentation 0.8370 0.1681 0.6790 0.8598
CovType 0.3920 0.2616 0.9634 0.4335
Poker Hand 0.5441 0.4721 0.6707 0.3348
KDDCup 0.2552 0.5020 0.1259 0.1461

Average 0.5070 0.3509 0.6097 0.4435

7. Conclusion

Clustering ensemble combines the multiple clustering results into one clustering result, making the clustering method
more stable and robust. This paper addressed the problem of clustering large-scale data, and proposed an efficient three-way
cluster ensemble model based on Spark, which has the ability to deal with both hard clustering and soft clustering. In
the three-way cluster ensemble model, a cluster is represented by a pair of sets, which divides the universe into three
regions, namely, core region, fringe region and trivial region. The three-way representation just reveals intuitively the three
relationships between an object and a cluster. An object certainly belongs to the cluster if it is in the core region of the
cluster; an object in the trivial region certainly do not belong to the cluster; and an object in the fringe region might
or might not belong to the cluster. We designed a distributed three-way k-means clustering algorithm based on Spark to
perform the initial three-way cluster members on the large-scale data.
In addition, to enhance the performance of the consensus clustering algorithm, this paper proposed the concept of
cluster unit by making best using of information of all cluster members. The cluster units reflect the minimal granularity
distribution structure agreed by all ensemble members. The association degree between units and between clusters were
also defined. The notion of cluster mark sequence was proposed in order to improve computational complexity of obtaining
cluster units. This paper further proposed a cluster ensemble algorithm based on cluster units and three-way decisions, in
which various three-way decision strategies were devised to assign small cluster units and no-unit objects.
This paper conducted experiments to illustrate the salient features of the proposed algorithm and evaluate its perfor-
mance. The empirical analyzing results suggest that the thresholds α and β should take (0.8, 0.2), (0.85, 0.15), (0.9, 0.1) and
the number of cluster members should be 3, 4 or 5. The comparison experimental results on a bunch of data sets show
that the validity and superiority of the proposed consensus function on the indices ARI, ACC, NMI, F1-Measure and CPU
running times. Especially, the running time has been reduced by more than 40 times due to the using of the proposed
cluster unit. The experimental results on some large-scale in millions datasets show that the proposed three-way cluster
ensemble model is a competitive model for large-scale data. However, the proposed method cannot handle non-numerical
data. In the future work, we hope to solve this problem effectively.

Declaration of competing interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no
significant financial support for this work that could have influenced its outcome.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61876027,
61751312 and 61533020.

References

[1] M.K. Afridi, N. Azam, J.T. Yao, E. Alanazi, A three-way clustering approach for handling missing data using GTRS, Int. J. Approx. Reason. 98 (2018)
11–24.
[2] G. Bouma, Normalized (pointwise) mutual information in collocation extraction, in: Proceedings of GSCL, 2009, pp. 31–40.
[3] A. Campagner, D. Ciucci, Orthopartitions and soft clustering: soft mutual information measures for clustering validation, Knowl.-Based Syst. 180 (2019)
51–61.
[4] H. Chen, X. Shen, Y. Lv, J. Long, A novel automatic fuzzy clustering algorithm based on soft partition and membership information, Neurocomputing
236 (2017) 104–112.
[5] M. Chen, D. Miao, Interval set clustering, Expert Syst. Appl. 38 (4) (2011) 2923–2932.
[6] W. Choi, S. Hong, W.K. Jeong, Vispark: GPU-accelerated distributed visual computing using spark, SIAM J. Sci. Comput. 38 (5) (2016) S700–S719.
[7] X. Deng, Y. Yao, Decision-theoretic three-way approximations of fuzzy sets, Inf. Sci. 279 (2014) 702–715.
[8] X.Z. Fern, C.E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in: Proceedings of the Twenty-First International Conference
on Machine Learning, ACM, 2004, p. 36.
[9] A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, ACM Trans. Knowl. Discov. Data 1 (1) (2007) 4.
[10] J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M.J. Franklin, I. Stoica, GraphX: graph processing in a distributed dataflow framework, in: 11th USENIX
Symposium on Operating Systems Design and Implementation, vol. 14, OSDI, 2014, pp. 599–613.
[11] S. Gopalani, R. Arora, Comparing apache spark and map reduce with performance analysis using k-means, Int. J. Comput. Appl. 113 (1) (2015).
H. Yu et al. / International Journal of Approximate Reasoning 115 (2019) 32–49 49

[12] L. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1) (1985) 193–218.


[13] C. Jin, R. Liu, Z. Chen, W. Hendrix, A. Agrawal, A. Choudhary, A scalable hierarchical clustering algorithm using spark, in: 2015 IEEE First International
Conference on Big Data Computing Service and Applications, BigDataService, IEEE, 2015, pp. 418–426.
[14] L. Jing, K. Tian, J.Z. Huang, Stratified feature sampling method for ensemble clustering of high dimensional data, Pattern Recognit. 48 (11) (2015)
3688–3702.
[15] P. Li, Y. Luo, N. Zhang, Y. Cao Heterospark, A heterogeneous CPU/GPU spark platform for machine learning algorithms, in: 2015 IEEE International
Conference on Networking, Architecture and Storage, NAS, IEEE, 2015, pp. 347–348.
[16] D. Liang, D. Liu, Systematic studies on three-way decisions with interval-valued decision-theoretic rough sets, Inf. Sci. 276 (2014) 186–203.
[17] W. Lin, Z. Wu, L. Lin, A. Wen, J. Li, An ensemble random forest algorithm for insurance big data analysis, IEEE Access 5 (2017) 16568–16575.
[18] P. Lingras, G. Peters, Applying rough set concepts to clustering, in: Proceedings of the 14th International Conference on Rough Sets, Fuzzy Sets, Data
Mining and Granular Computing, RSFDGrC 2013, Springer, 2012, pp. 23–37.
[19] P. Lingras, R. Yan, Interval clustering using fuzzy and rough set theory, in: IEEE Annual Meeting of the Fuzzy Information, 2004, vol. 2, Processing
NAFIPS’04, IEEE, 2004, pp. 780–784.
[20] Y. Liu, L. Xu, M. Li, The parallelization of back propagation neural network in mapreduce and spark, Int. J. Parallel Program. 45 (4) (2017) 760–779.
[21] W. Lu, P. Cao, Clustering large scale data set based on distributed local affinity propagation on spark, Int. J. Database Theory Appl. 9 (10) (2016)
241–250.
[22] J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel, et al., Performance measures for information extraction, in: Proceedings of DARPA Broadcast News
Workshop, Herndon, VA, 1999, pp. 249–252.
[23] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al., MLlib: machine learning in Apache Spark,
J. Mach. Learn. Res. 17 (1) (2016) 1235–1241.
[24] Y. Ren, C. Domeniconi, G. Zhang, G. Yu, Weighted-object ensemble clustering, in: 2013 IEEE 13th International Conference on Data Mining, ICDM, IEEE,
2013, pp. 627–636.
[25] Y. Ren, C. Domeniconi, G. Zhang, G. Yu, Weighted-object ensemble clustering: methods and analysis, Knowl. Inf. Syst. 51 (2) (2017) 661–689.
[26] E. Rendón, I. Abundez, A. Arizmendi, E.M. Quiroz, Internal versus external cluster validation indexes, Int. J. Comput. Commun. 5 (1) (2011) 27–34.
[27] T. Sarazin, H. Azzag, M. Lebbah, SOM clustering using spark-mapreduce, in: 2014 IEEE International Parallel & Distributed Processing Symposium
Workshops, IPDPSW, IEEE, 2014, pp. 1727–1734.
[28] A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (Dec) (2002) 583–617.
[29] S. Vega-Pons, J. Ruiz-Shulcloper, A survey of clustering ensemble algorithms, Int. J. Pattern Recognit. Artif. Intell. 25 (03) (2011) 337–372.
[30] P.X. Wang, Y.Y. Yao, CE3: a three-way clustering method based on mathematical morphology, Knowl.-Based Syst. 155 (2018) 54–65.
[31] R.S. Xin, J.E. Gonzalez, M.J. Franklin, I. Stoica, GraphX: a resilient distributed graph system on spark, in: First International Workshop on Graph Data
Management Experiences and Systems, ACM, 2013, p. 2.
[32] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678.
[33] Y. Yang, F. Teng, T. Li, H. Wang, H. Wang, Q. Zhang, Parallel semi-supervised multi-ant colonies clustering ensemble based on mapreduce methodology,
IEEE Trans. Cloud Comput. 6 (3) (2015) 857–867.
[34] J. Yao, N. Azam, Web-based medical decision support systems for three-way medical decision making with game-theoretic rough sets, IEEE Trans.
Fuzzy Syst. 23 (1) (2015) 3–15.
[35] Y. Yao, Three-way decision: an interpretation of rules in rough set theory, in: International Conference on Rough Sets and Knowledge Technology,
Springer, 2009, pp. 642–649.
[36] Y. Yao, Three-way decisions and cognitive computing, Cogn. Comput. 8 (4) (2016) 543–554.
[37] H. Yu, A framework of three-way cluster analysis, in: Proceedings of the International Joint Conference on Rough Sets, IJCRS 2017, Springer, 2017,
pp. 300–312.
[38] H. Yu, P. Jiao, Y. Yao, G. Wang, Detecting and refining overlapping regions in complex networks with three-way decisions, Inf. Sci. 373 (2016) 21–41.
[39] H. Yu, G. Wang, T. Li, J. Liang, D. Miao, Y. Yao, Three-Way Decisions: Methods and Practices for Complex Problem Solving, Science Publication, Beijing,
2015.
[40] H. Yu, C. Zhang, G. Wang, A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl.-Based Syst. 91 (2016)
189–203.
[41] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica Spark, Cluster computing with working sets, HotCloud 10 (10–10) (2010) 95.
[42] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, I. Stoica, Discretized streams: fault-tolerant streaming computation at scale, in: Proceedings of the
Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, 2013, pp. 423–438.
[43] B. Zhou, Y. Yao, J. Luo, Cost-sensitive three-way email spam filtering, J. Intell. Inf. Syst. 42 (1) (2014) 19–45.

You might also like