You are on page 1of 19

Information Sciences 627 (2023) 52–70

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

How to improve the accuracy of clustering algorithms


Qi Li , Shuliang Wang *, Xianjun Zeng , Boxiang Zhao , Yingxu Dang
School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China

A R T I C L E I N F O A B S T R A C T

Keywords: Clustering is an important data analysis technique. However, due to the diversity of datasets, each
Clustering optimization clustering algorithm is unable to produce satisfactory results on some particular datasets. In this
Improving accuracy paper, we propose a clustering optimization method called HIAC (Highly Improving the Accuracy
Gravitation
of Clustering algorithms). By introducing gravitation, HIAC forces objects in the dataset to move
towards similar objects, making the ameliorated dataset more friendly to clustering algorithms (i.
e., clustering algorithms can produce more accurate results on the ameliorated dataset). HIAC is
independent of clustering principle, so it can optimize different clustering algorithms. In contrast
to other similar methods, HIAC is the first to adopt the selective-addition mechanism, i.e., only
adding gravitation between valid-neighbors, to avoid dissimilar objects approaching each other.
In order to identify valid-neighbors from neighbors, HIAC introduces a decision graph, from
which the naked eye can observe a clear division threshold. Additionally, the decision graph can
assist HIAC in reducing the negative effects that improper parameter values have on optimization.
We conducted numerous experiments to test HIAC. Experiment results show that HIAC can
effectively ameliorate high-dimensional datasets, Gaussian datasets, shape datasets, the datasets
with outliers, and overlapping datasets. HIAC greatly improves the accuracy of clustering algo­
rithms, and its improvement rates are far higher than that of similar methods. The average
improvement rate is as high as 253.6% (except maximum and minimum). Moreover, its runtime is
significantly shorter than that of most similar methods. More importantly, with different
parameter values, the advantages of HIAC over similar methods are always maintained. The code
of HIAC is available at https://github.com/qiqi12/HIAC.

1. Introduction

Clustering, which categorizes objects based on the similarity between them, is a crucial data analysis technology. Clustering has
been widely used in many fields, such as pattern recognition, data compression, image segmentation, time series analysis, information
retrieval, spatial data analysis, biomedical research and so on [28].
Due to the diversity of the datasets to be analyzed, clustering is not a simple task [13]. The existing clustering algorithms fail to
produce accurate clustering results on some particular datasets. Most researchers [29,16,35,21,12,11,15,14,38,5,37,27,30] try to
improve the principles of the existing clustering algorithms to obtain more robust performance but introduce new flaws. For example,
Extreme clustering [29] addresses the shortcoming of DPC [1] on density-unbalanced datasets, but it introduces an additional input
parameter. In brief, purely improving the principles of clustering algorithms is not the optimal optimization strategy. A few researchers

* Corresponding author.
E-mail addresses: liqi_bitss@163.com (Q. Li), slwang2011@bit.edu.cn (S. Wang), XianJTeng@163.com (X. Zeng), zhaobx9676@gmail.com
(B. Zhao), dyxcome@163.com (Y. Dang).

https://doi.org/10.1016/j.ins.2023.01.094
Received 10 June 2022; Received in revised form 11 January 2023; Accepted 15 January 2023
Available online 20 January 2023
0020-0255/© 2023 Elsevier Inc. All rights reserved.
Q. Li et al. Information Sciences 627 (2023) 52–70

try to optimize the clustering performance by ameliorating the datasets to be analyzed [32,31,2,22,36,3,33,13,25]. This kind of
methods (Hereinafter referred to as ameliorate-dataset methods) adds gravitation between neighbors, and forces neighbors closer to
each other until the dataset becomes friendly to clustering algorithms. Importantly, some ameliorate-dataset methods are independent
of the clustering process, so different clustering algorithms can produce more accurate clustering results on ameliorated datasets.
Obviously, ameliorating dataset is a more effective clustering optimization strategy.
However, the existing ameliorate-dataset methods have some prominent defects. Specifically, 1.) Once datasets are slightly
complex, they will wrongly force adjacent clusters or dissimilar objects (i.e., outliers and normal objects) closer together, as shown in
Fig. 1. 2.) They dynamically find new neighbors during object motion, resulting in the loss of the original similarity relationship
between objects. Again, as shown in Fig. 1, the red objects that were once neighbors are no longer neighbors in the ameliorated
distribution. To address the above defects, we propose a novel ameliorate-dataset method called HIAC (Highly Improving the Accuracy
of Clustering algorithms). We summarize the main contributions of this work as follows:

1. We come up with the selective-addition mechanism to avoid outliers and normal objects or adjacent clusters approaching
each other. Specifically, we identify the neighbors with high-similarity (Hereinafter referred to as valid-neighbors) from all
neighbors and treat the remaining neighbors as invalid-neighbors. Gravitation is only selected to be added between valid-
neighbors. In contrast to the existing ameliorate-dataset methods, which adopt the full-addition mechanism (i.e., adding gravi­
tation between all neighbors), HIAC effectively cuts off the gravitational transfer between dissimilar neighbors.
2. We introduce the decision graph to provide a reasonable basis for identifying valid-neighbors. Specifically, we assign
similarity-related weights between neighbors. By counting all weights, we generate a weight probability distribution as the decision
graph. We can quickly determine which weights match the global pattern based on the regular probability differences in the de­
cision graph, and then identify the neighbor pairs whose weights match the global pattern as valid-neighbors. More importantly,
the decision graph can assist HIAC in reducing the negative effects that improper parameter values have on optimization.
3. We design the valid-neighbors locking strategy to preserve the original similarity relationship between objects. Specif­
ically, during the movement of the object, we no longer re-find timely valid-neighbors but instead force each object to continuously
move closer to its initial valid-neighbors. As a result, not only the running time of HIAC can be reduced, but also the original
similarity relationship between objects is maintained (i.e., neighbors in the original dataset are still neighbors in the ameliorated
dataset).
4. We conducted a series of experiments to verify the robustness and advantages of HIAC. The experimental results show that
HIAC is robust to Gaussian datasets, shape datasets, and high-dimensional datasets; HIAC successfully shrinks the clusters sur­
rounded by outliers, and makes the boundary-objects between adjacent clusters move toward their own cluster core; On 8 real-
world datasets, HIAC successfully improves the accuracies of clustering algorithms with an average improvement rate of
253.6% (except maximum and minimum), and its optimization performance is far superior to the existing methods; With different
parameter values, HIAC still outperforms the existing methods; HIAC runs significantly faster than most existing methods.
5. We published our code. We make the code and datasets publicly available at https://github.com/qiqi12/HIAC.

The remainder of the paper is organized as follows. The next section is about related works. Section 3 introduces the theory of HIAC.
Section 4 verifies the robustness and effectiveness of HIAC. The conclusion is reported in Section 5.

2. Related works

After Gravitational clustering [34] introduces gravitation into clustering, some researchers treat data as particles rather than con­
stant objects. They try to force particles by gravitation to move to improve clustering results. So far, gravitation-based clustering
algorithms have been proposed successively, and they consist of the gravitation process (In this paper, we refer to the gravitation
process as ameliorate-dataset method) and the clustering process. Most of ameliorate-dataset methods are nested within the clus­
tering process. Only a few ameliorate-dataset methods are independent of the clustering process. In this paper, we focus on the
ameliorate-dataset methods independent of the clustering process. Compared with the ameliorate-dataset methods nested in the

Fig. 1. Defects of existing ameliorate-dataset methods: The existing ameliorate-dataset methods may wrongly force outliers and normal objects
or adjacent clusters approaching each other, or lose the original similarity relationship between objects.

53
Q. Li et al. Information Sciences 627 (2023) 52–70

clustering process, the ameliorate-dataset methods independent of the clustering process are usually universal and can optimize
different clustering algorithms.
The ameliorate-dataset methods independent of the clustering process: Newtonian [3] assumes that all datasets follow the
multivariate Gaussian distribution. It forces objects to move closer to cluster centers until the ratio, which is the sum of the distances
between objects in the ameliorated dataset divided by the sum of the distances between objects in the original dataset, reaches a
threshold. Eventually, the ameliorated clusters become well-distributed for easy identification. These well-distributed clusters also
make it easy to determine the number of clusters. Herd [33] does not pay attention to the magnitude of gravitation but only the di­
rection of gravitation. It adds the unit vector of the resultant force between neighbors, forcing objects to move towards the center of
neighbors. In each iteration of Herd, the velocity grows in the direction of gravitation. To prevent objects from moving outside the
cluster, it sets an upper bound on the velocity. After multiple iterations, Herd make the datasets to be analyzed become friendly to
clustering algorithms. HIBOG [13] reconstructs the gravitational model so that gravitation is related to the distribution of objects.
HIBOG weakens the strength of gravitation between distant neighbors, and forces objects closer to the cluster core. Since the
reconstructed gravitational model can be calculated by the matrix, HIBOG runs in less time. Different from the methods mentioned
above, SBCA [25] no longer takes the object as the smallest unit of analysis, but the grid as the unit of analysis. It divides the dataset
into multiple grids, and forces grids to move along the gradient of density until the average distance between objects is below a
predetermined threshold.
The ameliorate-dataset methods nested in the clustering process: LGC [32] takes both the magnitude and direction of local
resultant force into consideration, and designs two local measures, centrality (CE) and coordination (CO), to determine whether an
object is at the boundary or the core of the cluster. Compared with traditional density-based clustering algorithms, LGC does not
require a threshold to find density core regions when identifying clusters. DCLRF [31] also uses CE to measure local resultant force. It
extracts all core-objects in the dataset, and then produces clustering results based on the natural neighborhood structure information of
core-objects. With the help of gravitation, DCLRF can accurately identify the number of clusters in the datasets with complex distri­
bution. OGC [2] does not apply gravitation to all objects, but only to clustering centers. Non-clustering centers are the source of
gravitation. Under gravitation of non-clustering centers, all clustering centers iteratively move. OGC changes the clustering center
iteration rule of K-means. SGISA [22] is applied to the image segmentation, which forces each pixel to move in the feature space. After
movement, two pixels are grouped into a cluster if the distance between them is below a predetermined threshold. Unlike the methods
above, LGBD [36] and NHOD [40] identify outliers surrounding clusters by gravitation. LGBD gradually expands the neighbors of each
object, so the gravitational variation ratios of outliers, boundary-objects, and core-objects show obvious differences. It identifies the
objects with the most variation as outliers. NHOD [40] adds Coulomb forces between objects. Coulomb forces can reveal the differ­
ences between different dimensions of an object through vector projections in each dimension, which allows NHOD to effectively
identify outliers in high-dimensional datasets. GHC [39] introduces gravitation into the hierarchical clustering process. It constructs a
gravitational relationship graph and divides the graph into several subgraphs according to the gravitational influence coefficient. The
final clustering results are then obtained by GHC iteratively merging these subgraphs. To deal with large-scale datasets, PGCGP [7] first
divides the dataset into a number of grids and then shrinks the object distribution in each grid individually.

3. The proposed method: HIAC

3.1. Overview of HIAC

There are several datasets (called good datasets) on which most clustering algorithms can perform well, but there are several
datasets (called bad datasets) on which few clustering algorithms are efficient [13]. The reason is that, in good datasets, the inter-
cluster distance is significantly larger than the intra-cluster distance, making it easier for clustering algorithms to identify clusters;
while in bad datasets, the inter-cluster distance is similar to the intra-cluster distance, and even the boundaries of clusters are
completely overlapping. Obviously, by increasing the difference between the inter-cluster distance and the intra-cluster distance, the
bad datasets can be ameliorated into good datasets, and clustering algorithms can produce more accurate results on ameliorated
datasets.
In this paper, we propose a novel ameliorate-dataset method called HIAC to ameliorate the dataset. HIAC still adds gravitation
between neighbors, forcing similar objects closer together. However, different from the existing methods, HIAC divides neighbors into
valid-neighbors and invalid-neighbors, and locks the original neighbor relationship during object motion. HIAC consists of the
following two steps:

• Step 1 (Identifying valid-neighbors). HIAC constructs a topology graph for the dataset, and it assigns weights to the edges of the
topology graph according to similarity. Based on the turning area of the weight probability distribution curve, HIAC clips invalid-
edges from the topology graph. Finally, the neighbors connected by the remaining edges are identified as valid-neighbors.
• Step 2 (Objects motion). HIAC only adds gravitation between valid-neighbors, and it forces objects to move together according to
Newton’s laws of motion. During objects motion, HIAC no longer looks for new valid-neighbors, ensuring that objects continue to
move toward their initial valid-neighbors. Finally, valid-neighbors are closer to each other (i.e., the intra-cluster distance de­
creases), and non-neighbors are far away from each other (i.e., the inter-cluster distance increases). The datasets ameliorated by
HIAC become more friendly to clustering algorithms.

54
Q. Li et al. Information Sciences 627 (2023) 52–70

3.2. Introduction of notations

In this paper, the descriptions of commonly used notations are recorded in Table 1.

3.3. Identifying valid-neighbors

Let us show an example, as shown in Fig. 2. A dataset consists of 2 normal clusters (marked in yellow and red) and 2 outliers
(marked in grey). For existing ameliorate-dataset methods, gravitations are added between all neighbors (Here, we take k( =
2)-nearest neighbors as an example), as shown in Fig. 2(A). As a result, there are unreasonable gravitations between clusters and
between outliers and normal objects (see the dotted arrow for details), which will inevitably lead to unreasonable proximity. Obvi­
ously, not all neighbors are supposed to add gravitation. In this paper, if the gravitation between neighbors is necessary, we call such
neighbors valid-neighbors; otherwise, we call them invalid-neighbors. From Fig. 2(A), we can observe that the distance between
invalid-neighbors is clearly much greater than the distance between valid-neighbors. Next, we will rely on this phenomenon to identify
all valid-neighbors for each object.

3.3.1. Constructing the topology graph


First, we construct a topology graph for the dataset. In the topology graph, each object acts as a vertex, and the k-nearest neighbors
of all vertices are connected by edges. k is one of the input parameters of HIAC. Each edge is assigned a weight, the edge weight
between oi and oj is defined as

wij = − ‖oi − oj ‖2 + max ‖oq − op ‖2


op ,oq ∈O (1)

The closer the objects (vertices) are, the larger the edge weight between them; the farther the objects (vertices) are, the smaller the
edge weight between them. For the dataset in Fig. 2, its topology graph (k = 2) is shown in Fig. 2(B). Next, instead of identifying valid-
neighbors directly, we transform identifying valid-neighbors into identifying invalid-edges (i.e., the edges between invalid-neighbors).

3.3.2. Clipping invalid-edges


Obviously, the invalid-edge has two characteristics. One is that its number is much less than valid-edges (The reason is that the
objects in the dataset are clustered, resulting in the majority of valid-neighbors); The other is that its weight is smaller than valid-edges
(The reason is that the distance between invalid-neighbors is greater than the distance between valid-neighbors). In short, it is a small
probability event that invalid-edges appear in the topology graph. Therefore, we merely need to generate the edge weight probability
distribution curve of the topology graph to identify this small probability event, and then all invalid-edges can be determined. In order
to reduce the probabilistic computational complexity, we replace the edge weight with the object weight, and the object weight of oi is
defined as

wij
j∈ϕ (2)
Wi =
k

, in which ϕ is the set of k-nearest neighbors of oi . Since the object weight is the mean value of the edge weight, the object weight and
the edge weight follow similar probability distribution, but the number of the object weights is only 1k of the number of the edge
weights. We divide the value range of the object weights into several equal-length intervals, and the interval length is dc =
(max(W)− min(W))*10
N .
These intervals are the smallest unit of analysis, and the probability of the g-th interval is calculated by
⎛ ⎛ ⎞ ⎞ ∑ ( )
N

⎜ ⎜ ⎟ ⎟ j=1 φg Wj
⎜ ⎜ ⎟ ⎟
P⎜min⎜W ⎟ + g ⋅ dc⎟ = , (3)
⎝ ⎝ ⎠ ⎠ N

Table 1
The Descriptions of commonly used notations in HIAC.
Notations Description

O The dataset to be analyzed.


N The number of objects in O.
oi The i-th object in O.
wij The edge weight between oi and oj .
Wi The object weight of oi .
oi
→ The vector form of oi .
o
̅→
ij The j-th valid-neighbor of →
oi .
→*
o The nearest valid-neighbor of →io.
⃦i
⃦oi − oj ‖
2
The Euclidean distance between oi and oj .

55
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 2. An example of HIAC: A) shows the gravitation that traditional ameliorate-dataset methods add between objects; B) shows the topology
graph constructed by HIAC; C) shows the decision graph generated by HIAC based on the topology graph; D) shows valid-neighbors determined by
HIAC according to the decision graph, and it only adds gravitation between valid-neighbors; E) shows the change in the distribution of objects under
gravitation, where the intra-cluster distance gradually decreases, the inter-cluster distance gradually increases, and the outliers never move.

( ) ( )
in which φg Wj = 1, if min(W) + (g − 1) ⋅ dc⩽Wj < min(W) + g ⋅ dc; otherwise, φg Wj = 0. The object weight probability distribution
curve of the topology graph in Fig. 2(B) is shown in Fig. 2(C). We name the object weight probability distribution curve as the decision
graph. From the decision graph, we can observe a clear turning area, as indicated by the red arrow. On the left side of the turning area,
the weight is small and the probability is low. On the right side of the turning area, the weight is large and the probability increases
sharply. Obviously, the left side of the turning area is a small probability event. Therefore, the turning area can be regarded as a
threshold, and the edge whose weight is less than the threshold can be regarded as an invalid-edge. The decision graph provides a
visual and reasonable basis, which makes it easily to identify invalid-edges. After clipping all invalid-edges, the neighbors connected
by the remaining edges are valid-neighbors. Algorithm 1 describes in details how to identify valid-neighbors. HIAC only adds grav­
itation between valid-neighbors, as shown in Fig. 2(D). Compared with Fig. 2(A), there is no gravitational attraction between clusters
and between outliers and normal objects in Fig. 2(D).

Algorithm 1:Identifying valid-neighbors

Inputs: O, k
Output: the valid-neighbor matrix
1. Connect all k-nearest neighbors with edges.
2. Assign weight to each edge according to the formula 1.
3. for oi in O do
4. Calculate the object weight of oi according to the formula 2.
5. end for
6. Divide the object weight value into N/10 intervals.
7. for g in range(N/10) do
8. Calculate the probability of the g-th interval according to the formula 3.
9. end for
10. Plot the probability distribution curve of object weights and treat it as the decision graph.
11. Identify the turning area of the decision graph as division threshold.
12. Divide k-nearest neighbors into valid-neighbors and invalid-neighbors according to the threshold.
13. Store valid-neighbors of all objects to the valid-neighbor matrix.
14. Return: the valid-neighbor matrix.

3.4. Objects motion

In this subsection, we treat each object as a vector (oi is rewritten as →


oi ), so that the defined gravitation is directional.

3.4.1. Gravitation calculation


oi has s valid-neighbors (s⩽k), and the gravitation between →
Let → oi and its j-th valid-neighbor (j⩽s) is defined as

56
Q. Li et al. Information Sciences 627 (2023) 52–70

⃦→* → ( → →)
̅→ ⃦ o − oi ‖ oij − oi
Fij = G i ⃦→ 2 → 2 (4)
⃦oij − oi ‖
2

1
∑N ⃦ ⃦→*
, in which G is the gravitational constant, G = N j=1 ⃦oj − oj ‖2 , and it is used to control the magnitude of gravitation. Since oi is

simultaneously attracted by the gravitations of s valid-neighbors, the total gravitation applied to it is


⃦→ → ( → →)
s ⃦ *
→ ∑ ̅→ ∑
s
oi − oi ‖2 oij − oi
Fi = Fij = G ⃦→ → 2 (5)
j=1 j=1
⃦oij − oi ‖
2

3.4.2. Motion model


According to physical theory, the object → oi moves in the direction of the total gravitation. During motion, due to the change of
position, the direction and magnitude of the total gravitation applied to → oi will continuously change. Therefore, the motion of → oi
follows the variable acceleration curve motion law. In order to reduce the difficulty of analysis, we draw on the idea of the micro-
element analysis to divide the motion of → oi into multiple segments of equal time. During each time-segment, we assume that the
motion of → oi follows the uniformly accelerated rectilinear motion law (i.e., the magnitude and direction of the total gravitation applied
to →
oi is constant). In addition, we lock the initial valid-neighbors. That is, even if the k-nearest neighbors of →
oi change during different
time-segments, we no longer re-find new valid-neighbors, thus preserving original similarity relationship between objects.
Let the initial velocity v0 = 0 and the mass mi = 1 during each time-segment, so the uniformly accelerated rectilinear motion model
→ →
can be simplified as S = 1 F t 2 . After d time-segments (t is the duration of each time-segment), →
2 o is turned into
i

→’ d →

oi = →
oi + Sli
l=1

d →

=→
oi + T Fli
l=1 (6)
⃦→ → ( )

s ⃦o* − ol ‖ → →l
∑d ∑ i i 2 oij − oi
=→
oi + T Gl ⃦
l=1 ⃦→ →l 2
j=1 ⃦oij − oi ‖2


, in which T = 12t 2 , oli is the →
oi in the l-th time-segment, and Gl is the gravitational constant in the l-th time-segment. T and d are the
other 2 parameters of HIAC. When all objects in the dataset O turn according to the formula (6), the dataset O will be ameliorated.
Algorithm 2 describes in details how gravitation ameliorates the dataset. As shown in Fig. 2(E), after two time-segments, the intra-
cluster distance decreases and the inter-cluster distance becomes larger, but outliers never move.

Algorithm 2:Objects motion

Inputs: O, T, d, valid-neighbor matrix


Output: the ameliorated dataset O

1. data normalization (optional)


2. Get the valid-neighbor information of each object from the valid-neighbor matrix.
3. for l in range(d) do
4. Calculate the gravitational constant in the l-th time-segment.
5. for oi in O do

6. Calculate the total gravitation Fi applied to oi according to the formula 5.
7. end for
8. for oi in O do

9. oi = F i T

10. end for


11. end for
12. O = ∅

13. for i in range(N) do


14. O .append(oi )
′ ′

15. end for


16. Return: O

3.5. Advantage analysis

Based on the above principles, HIAC has the following advantages (We will experimentally verify these advantages in Section 4.3):

1. HIAC is immune to outliers: Since outliers are far away from normal objects and the number of outliers is much less than that of
normal objects, the outlier edge (i.e., the edge between the outlier and other object) is a small probability event in the topology

57
Q. Li et al. Information Sciences 627 (2023) 52–70

graph. Therefore, outlier edges are treated as invalid-edges and clipped. HIAC does not apply gravitation to outliers, which can
prevent outliers from approaching normal objects.
2. HIAC can correctly shrink adjacent clusters: In overlapping datasets, the distance between boundary-objects in different clusters
is still larger than the distance between boundary-objects in the same cluster. Furthermore, the number of boundary-objects is
much less than the number of core-objects (i.e., the objects inside clusters). Therefore, the edge between adjacent clusters is a small
probability event in the topology graph, HIAC hardly adds gravitation between adjacent clusters. More importantly, since HIAC
locks the initial valid-neighbors, the gravitation applied to each boundary-object is always towards its own cluster core, which can
drive adjacent clusters away from each other.
3. HIAC is insensitive to the parameter k: Once k is set large, objects will have distant neighbors. If objects move towards distant
neighbors, then the ameliorated dataset will be less conducive to clustering. Fortunately, the decision graph of HIAC is insensitive
to the parameter k. With different k values, the object weight probability distribution curves are similar. The reason is that the
object weight is the mean of the edge weights, and the larger the k value, the more stable the mean. Therefore, even if k is set large,
with the help of the decision graph, HIAC can clip extra invalid-edges to ensure that objects do not move towards distant neighbors.
In conclusion, the decision graph can counteract the adverse effects of inappropriate k values on HIAC.

4. Experiments

4.1. Experimental settings

In this subsection, we describe the datasets, evaluation metrics, clustering algorithms to be optimized, baseline methods and
parameter settings.

4.1.1. Datasets
We select several common synthetic datasets (from clustering basic benchmark [9]) and real-world datasets (from UCI-dataset
repository [8]) to test the proposed method. Synthetic datasets are described in Table 2. Real-world datasets are described as follows:

• Breast cancer dataset records patient cases classified as benign and malignant tumors based on 30 attributes including mass
thickness, average cell size and average radius.
• Banknote authentication dataset records some attributes of banknotes to determine whether they are counterfeit.
• Each object in Digit dataset corresponds to a handwritten digit which is represented by 8 * 8 pixels, and these digits belongs to the
integer from 0 to 9.
• Iris dataset records 150 iris objects, which are divided into 3 categories (setosa, versicolour and virginica) according to the length
and width of the calyx and the length and width of the petals.
• Seeds dataset records the information of three categories of seeds (Kama, Rosa and Canadian), which are expressed by 7 attributes.
• Teaching assistant evaluation dataset records the scores of teaching assistants over 5 semesters, and these teaching assistants are
divided into 3 categories (low, medium and high).

Table 2
The details of synthetic datasets.
Number Dimension Clusters Description

Dim32 1024 32 16 high-dimensional dataset


Dim64 1024 64 16 high-dimensional dataset
Dim128 1024 128 16 high-dimensional dataset
Dim256 1024 256 16 high-dimensional dataset
Dim512 1024 512 16 high-dimensional dataset
Dim1024 1024 1024 16 high-dimensional dataset
A1 3000 2 20 Gaussian dataset
A2 5250 2 35 Gaussian dataset
A3 7500 2 50 Gaussian dataset
Flame 240 2 2 shape dataset
Aggregation 788 2 7 shape dataset
Heartshapes [6] 213 2 3 shape dataset
T7.10 k 8764 2 9 shape dataset
Ls3 1735 2 6 shape dataset
Compound-part 142 2 1 dataset with outliers
R15-outlier 627 2 15 dataset with outliers
Asymmetric-outlier [24] 1046 2 5 dataset with outliers
Adj-1 82 2 2 overlapping dataset
Adj-2 579 2 3 overlapping dataset
Adj-3 187 2 2 overlapping dataset
S1 5000 2 15 overlapping dataset
S2 5000 2 15 overlapping dataset
S3 5000 2 15 overlapping dataset
S4 5000 2 15 overlapping dataset

58
Q. Li et al. Information Sciences 627 (2023) 52–70

• Wireless indoor localization dataset records the signal strengths of different Wi-Fi collected in 4 rooms, and each object comes from
one room.
• Wine dataset records chemical analysis of different varieties of wines from Italy. The dataset can be used to distinguish wine
categories.

4.1.2. Evaluation metrics


We compare the accuracy of clustering algorithms between original datasets and ameliorated datasets. If the accuracy on
ameliorated datasets is greater than the accuracy on original datasets, then clustering algorithms will be effectively optimized by
ameliorate-dataset methods. Two approaches are used to measure the clustering accuracy: 1) We use NMI index to compute the
difference between the ground-truth label and the clustering result,
⎛ ⎞ ( )
∑|U| ∑ |V|
P(i,j)
⎜ ⎟ P i, j log P(i)P ′ (J)
⎜ ⎟ i=1 j=1
⎜ ⎟
NMI ⎜U, V ⎟ = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
() ( ) |V| ( ) ( ),
⎝ ⎠ ∑|U| ∑
P i logP i × P′ j logP′ j
i=1 i=1

⃒ ⃒ ⃒ ⃒/
in which U is the ground-truth label, V is the clustering result, P(i) = |Ui |/N, P (j) = ⃒Vj ⃒/N, P(i, j) = ⃒Ui ∩ Vj ⃒ N. The NMI’s range is

[0, 1]. The closer it is to 1, the more accurate the clustering result is. 2) We visualize the dataset and mark objects in different colors
based on the clustering result. The more objects in the same cluster are marked in the same color, and the fewer objects in different
clusters are marked in the same color, the more accurate the clustering result is.
Alternatively, we visualize both original datasets and ameliorated datasets. If the inter-cluster distance increases and the intra-
cluster distance decreases, then ameliorated datasets will be more friendly to clustering algorithms than original datasets.

4.1.3. Clustering algorithms to be optimized


We mainly select three clustering algorithms to evaluate the optimization effect of ameliorate-dataset methods, namely K-means,

Fig. 3. Process of HIAC ameliorating Dim32 (row 1), Dim64 (row 2), Dim128 (row 3), Dim256 (row 4), Dim512 (row 5), Dim1024 (row 6)
datasets: These datasets are high-dimensional datasets with 32, 64, 128, 256, 512, and 1024 dimensions, respectively. After being ameliorated by
HIAC, the compactness of clusters become significantly higher, and the inter-cluster distance increases significantly. Therefore, HIAC is robust to the
object dimension.

59
Q. Li et al. Information Sciences 627 (2023) 52–70

Agglomerative, DPC. The reason we select them is that they are common clustering algorithms and very representative. More impor­
tantly, their input parameters are independent of the object distribution, so we can objectively evaluate whether the improvement in
clustering accuracy is only contributed by the ameliorated distribution.

4.1.4. Baseline methods and parameterization


Among ameliorate-dataset methods, HIBOG, Newtonian, Herd, SBCA are the same as HIAC. They are all independent of the clus­
tering process, so they can optimize different clustering algorithms by ameliorating datasets. We will compare HIAC with HIBOG,
Newtonian, Herd and SBCA. We execute them within the successive parameter intervals, and then pick the best result. These parameter
intervals are listed below.

• HIAC: the interval of T is from 0.1 to 2, the interval of k is from 5 to 25, and the interval of d is from 1 to 40.
• SBCA: the interval of k is from 5 to 200, and hyperparameters are set to the values recommended by the paper [25].
• Herd: the interval of threshold is from 0.1 to 1, and the interval of max is from 1 to 100. These intervals cover the range of parameter
values in the paper [33].
• Newtonian: the interval of δt is from 0.01 to 0.2, because the paper [3] sets it to a very small value.
• HIBOG: the interval of T is from 0.1 to 0.5, the interval of k is from 5 to 25, and the interval of d is from 1 to 10, which recommended
by the paper [13].

4.2. Robustness experiments

Clustering is an unsupervised task. Before clustering, clusters in the dataset to be analyzed are unknown. Therefore, it is necessary
to verify the robustness of HIAC to diverse datasets. In this subsection, we will test HIAC on high-dimensional datasets, Gaussian
datasets, and shape datasets.

4.2.1. High-dimensional datasets


The higher the dimension, the more difficult it is for clustering algorithms to measure the similarity between objects. Many bad
datasets are usually high-dimensional. Here, HIAC is tested to ameliorate a set of common high-dimensional datasets, namely Dim32,
Dim64, Dim128, Dim256, Dim512, and Dim1024, as shown in Fig. 3. These datasets have 32, 64, 128, 256, 512, and 1024 feature
dimensions, respectively. The first column of Fig. 3 shows their original distribution. Columns 2 to 4 of Fig. 3 show their ameliorated
distribution in the 3 time-segments of HIAC, respectively (Note: PCA is used to reduce dimensionality before visualization). In the
original distribution (column 1), the size of the clusters in the first 2 datasets is large, indicating that they have large intra-cluster
distance. Moreover, there are some clusters that are close to each other (see the red box for details), indicating that their inter-
cluster distance is small. After 3 time-segments of HIAC, we can observe that the size of large-clusters shrinks significantly, and
clusters that were close to each other become farther away. HIAC successfully ameliorates these high-dimensional datasets, making
them more friendly to clustering algorithms. Therefore, HIAC is robust to the object dimension.

Fig. 4. The original distribution (A, B, C) and the distribution ameliorated by HIAC (D, E, F) of A1 (A, D), A2 (B, E), A3 (C, F) datasets: These
datasets are Gaussian datasets. After being ameliorated by HIAC, all clusters become more compact, and the cluster boundaries are clearer.
Therefore, HIAC is robust to the Gaussian distribution.

60
Q. Li et al. Information Sciences 627 (2023) 52–70

4.2.2. Gaussian datasets


The Gaussian distribution is the most common in the real-world. Here, HIAC is tested to ameliorate a set of Gaussian datasets,
namely A1, A2 and A3. From A1 to A3, the number of clusters gradually increases. Fig. 4 shows their original distribution (row 1) as
well as the distribution ameliorated by HIAC (row 2). Experimental results show that all clusters in ameliorated datasets become more
compact, so HIAC is robust to the Gaussian distribution.

4.2.3. Shape datasets


Shape datasets, in which objects form clusters of various shapes, do not obey a specific distribution function. Shape clusters are
difficult to identify, many researchers test their algorithms on shape datasets to demonstrate the effectiveness
[23,10,38,6,18,41,26,4,20,19,17]. In other words, shape datasets are naturally bad datasets, so it is necessary to verify whether HIAC
is robust to shape datasets. Here, we select 5 shape datasets, Flame, Aggregation, Heartshapes, Ls3, and T7.10 k. Fig. 5 shows their
original distribution (row 1) and the distribution ameliorated by HIAC (row 2), in which different colors represent the ground-truth
labels of objects. In ameliorated datasets, all clusters become compact. Specifically, the clusters in Flame shrink into tight strips, and the
clusters in Aggregation and Heartshapes shrink into small spheres. The clusters in Ls3 and T7.10 k become slim, thus making them
become far from each other. Except for very few objects moving to other clusters, the motion of objects is valid. Obviously, all
ameliorated datasets become more friendly for clustering.
Next, we compare the performance of K-means and Agglomerative between original Flame, Aggregation datasets and ameliorated
Flame, Aggregation datasets. Fig. 6 shows the clustering results of K-means and Agglomerative on original datasets, in which different
colors represent the categories identified by clustering algorithms. Obviously, K-means and Agglomerative are invalid. On Aggregation,
they identify two small-clusters as one category, and decompose a large-cluster into two categories (see the red box for details). On
Flame, they mislabel many objects in the lower cluster as the same category as the objects in the upper cluster (see the red box for
details). For intuitive comparison with Fig. 6, we map the clustering results on ameliorated datasets back to the original distribution.
Fig. 7 shows the clustering results of K-means and Agglomerative on ameliorated datasets, in which different colors still represent the
categories identified by clustering algorithms. The experimental results show that, except for individual objects, the objects in each
cluster are accurately identified as one category, so the clustering accuracy is greatly improved. HIAC successfully optimizes the
performance of K-means and Agglomerative on shape datasets.
Compared with K-means and Agglomerative, DBSCAN and BIRCH are good at handling shape datasets, but they are sensitive to input
parameter values. Different parameter values may lead to widely different clustering results. Since clustering is an unsupervised
analysis task, setting appropriate parameter values for DBSCAN and BIRCH is difficult. Here, we will verify whether HIAC can reduce
the parameter-sensitivity of DBSCAN and BIRCH. Within the same parameter interval (containing 100 parameter values), we execute
DBSCAN and BIRCH on original datasets and ameliorated datasets, respectively. Next, we count their probabilities with different levels
of accuracy before and after optimization, as shown in Fig. 8. The experimental results show that DBSCAN and BIRCH have a higher
probability of obtaining higher accuracy after HIAC optimization. That is, after HIAC optimization, more parameter values allow them
to produce more accurate results. For example, before HIAC optimization, most of the parameter values make the accuracy of DBSCAN
on Ls3 only 0.1, and make the accuracy of BIRCH on Heartshapes 0.7. After HIAC optimization, most of the parameter values make the
accuracy of DBSCAN on Ls3 1, and even all of the parameter values make the accuracy of BIRCH on Heartshapes 1. Apparently, HIAC
successfully reduces the sensitivity of DBSCAN and BIRCH to parameter values, making it easier for them to produce accurate clus­
tering results in unsupervised scenario.
Therefore, HIAC is robust to shape datasets.

Fig. 5. The original distributions (A, B, C, D, E) and the distributions ameliorated by HIAC (F, G, H, I, J) of Aggregation (A, F), Flame (B, G),
Heartshapes (C, H), Ls3 (D, I), and T7.10 k (E, J): These datasets are shape datasets. After being ameliorated by HIAC, all clusters become more
compact. Therefore, HIAC is robust to shape datasets.

61
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 6. Original clustering results of K-means (A, B) and Agglomerative (C, D) on Flame (A, C) and Aggregation (B, D) datasets: Colors represent
the categories identified by clustering algorithms. K-means and Agglomerative are completely ineffective for these datasets, and they mistakenly
identify multiple clusters as one category but one cluster as multiple categories (see the box for details).

Fig. 7. HIAC-optimized clustering results of K-means (A, B) and Agglomerative (C, D) on Flame (A, C) and Aggregation (B, D) datasets: After HIAC
optimization, the clustering results are mapped back to the original distribution. Only very few objects are misidentified, and the clustering ac­
curacies are greatly improved.

4.3. Advantage verification experiments

In this subsection, we will test HIAC on some extreme scenarios to demonstrate its advantages.

4.3.1. HIAC is immune to outliers


We verify the first advantage of HIAC experimentally (see Section 3.5 for details). HIAC is tested on Compound-part, R15-outlier and
Asymmetric-outlier datasets. These datasets contain not only a lot of outliers, but also a diverse distribution of normal objects. In
Compound-part, outliers surround a wrench-like cluster. In R15-outlier, outliers surround 15 Gaussian clusters. In Asymmetric-outlier,
outliers surround 5 density-unbalanced clusters. Fig. 9 shows their original distribution (row 1) and the distribution ameliorated by
HIAC (row 2), in which outliers are marked in orange. HIAC is not affected by outliers. Only normal clusters become more compact,
outliers remain original distribution. As a result, the boundary between normal clusters and outliers becomes obvious, which is
beneficial for clustering algorithms to identify normal clusters. Obviously, HIAC is immune to outliers.

4.3.2. HIAC can correctly shrink adjacent clusters


We verify the second advantage of HIAC experimentally (see Section 3.5 for details). Once clusters are close to each other, some
boundary-objects between adjacent clusters will become neighbors. Traditional ameliorate-dataset methods adopt the full-addition

62
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 8. Before and after optimization, the probabilities of DBSCAN (A, B, C, D, E) and BIRCH (F, G, H, I, J) with different levels of accuracy
on Aggregation (A, F), Flame (B, G), Heartshapes (C, H), Ls3 (D, I), and T7.10 k (E, J): DBSCAN and BIRCH have a higher probability of obtaining
higher accuracy after HIAC optimization. That is, after HIAC optimization, more parameter values allow them to produce more accurate results.
Therefore, HIAC successfully reduces the sensitivity of DBSCAN and BIRCH to parameter values.

Fig. 9. The original distribution (A, B, C) and the distribution ameliorated by HIAC (D, E, F) of Compound-part (A, D), R15-outlier (B, E), and
Asymmetric-outlier (C, F) datasets: These datasets contain a lot of outliers marked in orange. In ameliorated datasets, only normal clusters become
more compact, and outliers remain original distribution. Therefore, HIAC is immune to outliers.

mechanism (i.e., adding gravitation between all neighbors), resulting in adjacent clusters closer to each other. Obviously, it is detri­
mental to clustering. HIAC is the first to adopt the selective-addition mechanism (i.e., only adding gravitation between valid-
neighbors). In order to verify the advantages of the selective-addition mechanism, we compare HIAC and HIAC-full on Adj-1, Adj-2,
and Adj-3 datasets. HIAC-full is the version of HIAC that adopts the full-addition mechanism. Fig. 10 shows the original distribution of
these datasets (row 1), the distribution ameliorated by HIAC-full (row 2), and the distribution ameliorated by HIAC (row 3), in which
different colors represent the ground-truth labels of objects. In the datasets ameliorated by HIAC-full, adjacent clusters intersect each
other, and some boundary-objects even appear inside other clusters. Furthermore, HIAC-full erroneously makes the objects in Adj-1
(see Fig. 10(D) for details) more scattered. In the datasets ameliorated by HIAC, objects shrink to the cluster core, adjacent clusters
become far away from each other, so the cluster boundaries become very clear. Clearly, HIAC adopting the selective-addition
mechanism has significant advantages in ameliorating adjacent clusters.
Next, we test HIAC on a more extreme set of datasets. S-sets consists of 4 datasets, namely S1, S2, S3, S4, which all contain 15
Gaussian clusters. From S1 to S4, the inter-cluster distance is getting smaller and smaller, and even the cluster boundaries in S4 can no
longer be distinguished with the naked eye. Fig. 11 shows the original distribution of these datasets (row 1), the clustering results of
Agglomerative on them (row 2), and the clustering results of Agglomerative on the datasets ameliorated by HIAC (row 3, and the
clustering results have been mapped back to the original distribution), in which colors represent the categories identified by
Agglomerative. Before HIAC optimization, Agglomerative is ineffective on S4 because some overlapping clusters are marked in one color
(see the red box for details). After HIAC optimization, these overlapping clusters are accurately distinguished by Agglomerative.

63
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 10. The original distribution (A, B, C), the distribution ameliorated by HIAC-full (D, E, F), and the distribution ameliorated by HIAC
(G, H, I) of Adj-1 (A, D, G), Adj-2 (B, E, H), and Adj-3 (C, F, I): The clusters in these datasets are close to each other. Colors represent the ground-
truth labels of objects. In the datasets ameliorated by HIAC-full, adjacent clusters intersect each other, which is very unfavorable for clustering. In
the datasets ameliorated by HIAC, adjacent clusters become far away from each other.

4.3.3. HIAC is insensitive to parameter k


We verify the third advantage of HIAC experimentally (see Section 3.5 for details). HIAC with different k values is tested on Flame
and Aggregation datasets, as shown in Fig. 12. Colors represent the ground-truth labels of objects. Column 1 of Fig. 12 shows the original
distribution of datasets, and columns 2 to 5 show the distribution ameliorated by HIAC with k = 25, k = 35, k = 45, k = 55,
respectively. Although the span of k values is large, HIAC always shrinks the clusters in Flame into long strips and the clusters in
Aggregation into small spheres. All clusters become more compact, and different clusters are far away from each other. Therefore, HIAC
is not sensitive to the parameter k. It is easy for HIAC to obtain excellent optimization results in real scenarios.

4.4. Comparison experiments

In this subsection, we will compare HIAC with Newtonian, Herd, SBCA, HIBOG on Breast cancer (Hereinafter referred to as Cancer),
Banknote authentication (Hereinafter referred to as Banknote), Digit, Iris, Seeds, Teaching assistant evaluation (Hereinafter referred to as
TAE), Wireless indoor localization (Hereinafter referred to as Wireless), and Wine datasets. The size of these datasets is recorded in
Table 3.

4.4.1. Accuracy comparison


We compute the clustering accuracies (i.e., NMI) of K-means, Agglomerative and DPC on original datasets and ameliorated datasets.
Table 4 records the clustering accuracies of clustering algorithms on original datasets. Table 5 consists of 5 parts, and records the
clustering accuracies of clustering algorithms on the datasets ameliorated by Newtonian, Herd, SBCA, HIBOG, HIAC, respectively. The
accuracy improvement rate is recorded in parentheses ((optimized accuracy - original accuracy)/ original accuracy).
On some datasets, Newtonian, Herd and SBCA not only fail to optimize K-means, Agglomerative and DPC, but even reduce their
clustering accuracies. Especially Newtonian, it performs poorly on most datasets. Due to the high computational complexity, Newtonian
cannot ameliorate Cancer and Digit until it runs for more than 24 h. SBCA is superior to Newtonian and Herd, and it performs well on
most datasets. However, SBCA reduces the accuracy of K-means on Banknote by 41.4%, which is unacceptable. HIBOG improves the

64
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 11. The original distribution (A, B, C, D) of S1 (A, E, I), S2 (B, F, J), S3 (C, G, K), S4 (D, H, L), and the original (E, F, G, H) and the HIAC-
optimized (I, J, K, L) clustering results of Agglomerative on these datasets: From S1 to S4, adjacent clusters get closer and closer. Colors represent
the categories identified by Agglomerative. Agglomerative mistakenly identifies some overlapping adjacent clusters as one category (see the red box
for details). After HIAC optimization, Agglomerative successfully distinguishes all adjacent clusters.

Fig. 12. The original distribution (column 1) and the distribution ameliorated by HIAC with different k values (columns 2 to 5) of Flame
(row 1) and Aggregation (row 2) datasets: With different parameter k values, HIAC always shrinks all clusters into distant small clusters.

Table 3
The size of real-world datasets.
Cancer Banknote Digit Iris Seeds TAE Wireless Wine

Number 569 1372 1797 150 210 151 2000 178


Dimension 30 4 64 4 7 5 7 13
Size(num* dim) 17070 5488 115008 600 1470 755 14000 2314

accuracies of clustering algorithms on all datasets. However, the optimization of HIBOG is not outstanding, and many of its
improvement rates are below 10%. HIBOG has only one highest improvement rate (see the bold in Table 5 for details). As for the
proposed HIAC, it successfully ameliorates all datasets, so that the accuracies of clustering algorithms are greatly improved. More
importantly, most of HIAC’s improvement rates are above 10%, even as high as 26,000%. Compared with Newtonian, Herd, SBCA and
HIBOG, HIAC’s improvement rates are almost the highest (see the bold in Table 5 for details) and its average improvement rate is as

65
Q. Li et al. Information Sciences 627 (2023) 52–70

Table 4
The accuracies of clustering algorithms before optimization.
K-means Agglomerative DPC

Breast cancer 0.422 0.261 0.166


Banknote authentication 0.029 0.003 0.327
Digit 0.738 0.856 0.716
Iris 0.748 0.758 0.707
Seeds 0.691 0.724 0.706
Teaching assistant evaluation 0.013 0.010 − a
Wireless indoor location 0.885 0.906 0.864
Wine 0.423 0.410 0.384
a
Clustering algorithm is completely invalid for dataset.

Table 5
The accuracies of clustering algorithms after optimization.
K-means Agglomerative DPC

Breast cancer + Newtonian − a − a − a


Banknote authentication + Newtonian 0.104 ( + 258.6%) 0.106 ( + 3433.3%) 0.327
Digit + Newtonian − a − a − a
Iris + Newtonian 0.845 ( + 13.0%) 0.793 ( + 4.6%) 0.862 ( + 21.9%)
Seeds + Newtonian 0.697 ( + 0.9%) 0.691( − 4.6%) 0.684( − 3.1%)
Teaching assistant evaluation + Newtonian 0.014 ( + 7.7%) 0.010 − b
Wireless indoor location + Newtonian 0.885 0.862( − 4.9%) 0.864
Wine + Newtonian 0.423 0.410 0.384

Breast cancer + Herd 0.611 ( + 44.8%) 0.677 ( + 159.4%) 0.408 ( + 145.8%)


Banknote authentication + Herd 0.147 ( + 406.9%) 0.147 ( + 4800.0%) 0.309( − 0.06%)
Digit + Herd 0.740 ( + 0.3%) 0.858 ( + 0.2%) 0.781 ( + 9.1%)
Iris + Herd 0.752 ( + 0.5%) 0.750( − 1.1%) 0.778 ( + 10.0%)
Seeds + Herd 0.722 ( + 4.5%) 0.699( − 3.5%) 0.705( − 0.2%)
Teaching assistant evaluation + Herd 0.081 ( + 523.1%) 0.081 ( + 710%) 0.073
Wireless indoor location + Herd 0.829( − 6.3%) 0.883( − 2.5%) 0.800( − 7.4%)
Wine + Herd 0.847 ( + 100.2%) 0.907 ( + 121.2%) 0.697 ( + 81.5%)

Breast cancer + SBCA 0.611 ( + 44.8%) 0.497 ( + 90.4%) 0.454 ( + 173.5%)


Banknote authentication + SBCA 0.017( − 41.4%) 0.173 ( + 5666.7%) 0.877 ( + 168.2%)
Digit + SBCA 0.740 ( + 0.3%) 0.849( − 0.8%) 0.639( − 10.8%)
Iris + SBCA 0.748 0.786 ( + 3.7%) 0.883 ( + 24.9%)
Seeds + SBCA 0.730 ( + 5.6%) 0.750 ( + 3.6%) 0.739 ( + 4.7%)
Teaching assistant evaluation + SBCA 0.060 ( + 361.5%) 0.060 ( + 500%) 0.042
Wireless indoor location + SBCA 0.854( − 3.5%) 0.878( − 3.1%) 0.867 ( + 0.3%)
Wine + SBCA 0.874 ( + 106.6%) 0.907 ( + 121.2%) 0.646 ( + 68.2%)

Breast cancer + HIBOG 0.705 ( + 67.1%) 0.708 ( + 171.3%) 0.502 ( + 202.4%)


Banknote authentication + HIBOG 0.178 ( + 513.8%) 0.177 ( + 5800%) 0.329 ( + 0.6%)
Digit + HIBOG 0.882 ( + 19.5%) 0.877 ( + 2.6%) 0.915 ( + 27.8%)
Iris + HIBOG 0.813 ( + 8.7%) 0.803 ( + 5.9%) 0.793 ( + 12.2%)
Seeds + HIBOG 0.772 ( + 11.7%) 0.798 ( + 10.2%) 0.726 ( + 2.8%)
Teaching assistant evaluation + HIBOG 0.073 ( + 461.5%) 0.060 ( + 500%) 0.042
Wireless indoor location + HIBOG 0.924 ( + 4.4%) 0.941 ( + 3.9%) 0.923 ( + 6.8%)
Wine + HIBOG 0.889 ( + 110.7%) 0.874 ( + 113.2%) 0.863 ( + 124.7%)

Breast cancer + HIAC 0.732(+73.5%) 0.752(+188.1%) 0.717(+331.9%)


Banknote authentication + HIAC 0.773(+2555.5%) 0.783(+26000%) 0.918(+180.7%)
Digit + HIAC 0.891(+20.7%) 0.890(+4%) 0.923(+28.9%)
Iris + HIAC 0.929(+24.2%) 0.929(+22.6%) 0.929(+31.4%)
Seeds + HIAC 0.800(+15.8%) 0.826(+14.1%) 0.781(+10.6%)
Teaching assistant evaluation + HIAC 0.091(+600%) 0.091(+810%) 0.107
Wireless indoor location + HIAC 0.928(+4.9%) 0.940(+3.8%) 0.937(+8.4%)
Wine + HIAC 0.925(+118.7%) 0.953(+132.4%) 0.925(+148.2%)
a
Running time is too long to get results.
b
Clustering algorithm is completely invalid for dataset.

66
Q. Li et al. Information Sciences 627 (2023) 52–70

high as 253.6% (except maximum and minimum). Especially on Banknote, the original accuracies of K-means and Agglomerative are
only 0.029 and 0.003. After Newtonian, Herd, SBCA or HIBOG optimization, the accuracies of K-means and Agglomerative do not exceed
0.2. But after HIAC optimization, the accuracies of K-means and Agglomerative reach 0.773 and 0.783. Obviously, HIAC has obvious
advantages over existing ameliorate-dataset methods in improving the accuracy of clustering algorithms.

4.4.2. Parameter robustness comparison


For HIAC, HIBOG, SBCA and Herd, they have one parameter related to the number of neighbors, namely the number of neighbors k
in HIAC and HIBOG, the number of grids k in SBCA, and the neighborhood radius threshold in Herd. The larger these parameters, the
greater the number of neighbors that need to add gravitation. Figs. 13–15 show the performance of ameliorate-dataset methods with
different parameter values (i.e., different number of neighbors), and these parameter values are extracted equidistantly from the
recommended intervals (see Section 4.1.4 for details). The experimental results show that the accuracy curves of HIAC are almost at the
top, indicating that it is not accidental that HIAC outperforms HIBOG, SBCA and Herd. Therefore, HIAC is more reliable than existing
ameliorate-dataset methods.

4.4.3. Running time comparison


The following is a discussion of HIAC’s time complexity. When identifying valid-neighbors, HIAC needs to compute the weights
between each object and k neighbors with a time complexity of O(kN). When moving objects, HIAC needs to compute the gravitational
values between objects and valid-neighbors d times with a time complexity of O(dsN), where s is the average number of valid-neighbors
and s < k. In conclusion, the total time complexity of HIAC is O(kN + dsN). As for baseline methods, the time complexity of Newtonian
is O(dN2 D + 2N2 ), in which D is the dimension of the dataset; the time complexity of Herd is O(dN2 logN); the time complexity of SBCA
is O(dN2 ); the time complexity of HIBOG is O(dpN), in which p is the average number of neighbors in the neighborhood. Obviously, the
time complexity of HIAC is close to that of HIBOG, but significantly smaller than that of Newtonian, Herd, and SBCA.
Next, we count the running time of HIAC, Newtonian, Herd, SBCA, and HIBOG on the same computer, as shown in Fig. 16 where the
datasets are sorted by size (See Table 3 for details). Newtonian and Herd run over 500 s on many datasets, so they lose practical value.
Although SBCA outperforms Newtonian and Herd, it still runs 378 s on Digit. HIAC and HIBOG have a clear advantage in running time,
running less than 10 s on all datasets. Due to short running time, HIAC is more practical than most ameliorate-dataset methods.

5. Conclusion

In this paper, we propose a novel ameliorate-dataset method called HIAC to optimize clustering algorithms. Different from
traditional ameliorate-dataset methods, HIAC divides the neighbors of objects into valid-neighbors and invalid-neighbors, and only
adds gravitation between valid-neighbors to avoid dissimilar objects approaching each other. When determining valid-neighbors,
HIAC introduces the decision graph from which a clear division threshold can be identified by the naked eye. More importantly,
the decision graph can assist HIAC in reducing the negative effects that improper parameter k values have on optimization. We conduct
extensive experiments on common synthetic and real-world datasets to test HIAC. HIAC successfully ameliorates high-dimensional
datasets, Gaussian datasets and shape datasets by not only reducing their intra-cluster distances but also increasing their inter-
cluster distances. Ameliorated datasets become more friendly to clustering algorithms. We also verify the advantages of HIAC. Spe­
cifically, HIAC shrinks only normal objects in the datasets with outliers, and avoids adjacent clusters closer together in overlapping
datasets. Furthermore, HIAC is not sensitive to the parameter k, and it can obtain stable effects with a wide range of k values. Finally,
we compare HIAC with Newtonian, Herd, SBCA and HIBOG. HIAC has outstanding advantages over them. HIAC greatly improves the

Fig. 13. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of K-means
optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are almost at the top, so HIAC is
significantly superior to HIBOG, SBCA and Herd.

67
Q. Li et al. Information Sciences 627 (2023) 52–70

Fig. 14. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of
Agglomerative optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are almost at the top, so
HIAC is significantly superior to HIBOG, SBCA and Herd.

Fig. 15. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of DPC
optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are at the top, so HIAC is significantly
superior to HIBOG, SBCA and Herd.

Fig. 16. Running time of HIAC, Newtonian, Herd, SBCA and HIBOG: The datasets are sorted by size (see Table 3 for details) on the abscissa. The
running time of HIAC and HIBOG is significantly shorter than that of Newtonian, Herd and SBCA.

68
Q. Li et al. Information Sciences 627 (2023) 52–70

accuracies of clustering algorithms, not only its improvement rates are much higher than that of other methods, but also its running
times are much less than that of most methods. With different parameter values, the advantages of HIAC over other methods are always
maintained. In conclusion, HIAC is an excellent method to optimize clustering algorithms.

CRediT authorship contribution statement

Qi Li: Conceptualization, Methodology, Writing – review & editing, Formal analysis, Investigation, Software. Shuliang Wang:
Funding acquisition, Writing – review & editing. Xianjun Zeng: Software, Writing – review & editing. Boxiang Zhao: Writing – review
& editing. Yingxu Dang: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Acknowledgments

The work is funded by National Key R&D Program of China (2020YFC0832600), and National Natural Science Fund of China
(62076027).

References

[1] R. Alex, L. Alessandro, Clustering by fast search and find of density peaks, Science 344 (2014) 1492.
[2] M. Alswaitti, M.K. Ishak, N.A.M. Isa, Optimized gravitational-based data clustering algorithm, Eng. Appl. Artif. Intell. 73 (2018) 126–148.
[3] K. Blekas, I.E. Lagaris, Newtonian clustering: An approach based on molecular dynamics and global optimization, Pattern Recogn. 40 (2007) 1734–1744.
[4] L. Cai, H. Wang, F. Jiang, Y. Zhang, Y. Peng, A new clustering mining algorithm for multi-source imbalanced location data, Inf. Sci. 584 (2022) 50–64.
[5] Y. Cai, M. Zeng, Z. Cai, X. Liu, Z. Zhang, Graph regularized residual subspace clustering network for hyperspectral image clustering, Inf. Sci. 578 (2021) 85–101.
[6] J. Chen, S.Y. Philip, A domain adaptive density clustering algorithm for data with varying density distribution, IEEE Trans. Knowl. Data Eng. 33 (2021)
2310–2321.
[7] L. Chen, F. Chen, Z. Liu, M. Lv, T. He, S. Zhang, Parallel gravitational clustering based on grid partitioning for large-scale data, Appl. Intell. (2022) 1–21.
[8] D. Dua, C. Graff, UCI machine learning repository, 2019. URL: http://archive.ics.uci.edu/ml.
[9] P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark datasets, Appl. Intell. 48 (2018) 4743–4759.
[10] C. Gong, Z.g. Su, P.h. Wang, Q. Wang, An evidential clustering algorithm by finding belief-peaks and disjoint neighborhoods, Pattern Recogn. 113 (2021),
107751.
[11] L. Guo, Q. Dai, Graph clustering via variational graph embedding, Pattern Recogn. 122 (2022), 108334.
[12] J. Han, J. Xu, F. Nie, X. Li, Multi-view k-means clustering with adaptive sparse memberships and weight allocation, IEEE Trans. Knowl. Data Eng. 34 (2022)
816–827.
[13] Q. Li, S. Wang, C. Zhao, B. Zhao, X. Yue, J. Geng, Hibog: Improving the clustering accuracy by ameliorating dataset with gravitation, Inf. Sci. 550 (2021) 41–56.
[14] J. Liu, F. Cao, J. Liang, Centroids-guided deep multi-view k-means clustering, Inf. Sci. 609 (2022) 876–896.
[15] Z. Long, Y. Gao, H. Meng, Y. Yao, T. Li, Clustering based on local density peaks and graph cut, Inf. Sci. 600 (2022) 263–286.
[16] A. Lotfi, P. Moradi, H. Beigy, Density peaks clustering based on density backbone and fuzzy neighborhood, Pattern Recogn. 107 (2020), 107449.
[17] Y. Lu, Y.M. Cheung, Y.Y. Tang, Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,
IEEE Trans. Cybern. 51 (2021) 1598–1612.
[18] Y. Ma, H. Lin, Y. Wang, H. Huang, X. He, A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint, Inf. Sci. 557 (2021)
194–219.
[19] A.C.A. Neto, J. Sander, R.J.G.B. Campello, M.A. Nascimento, Efficient computation and visualization of multiple density-based clustering hierarchies, IEEE
Trans. Knowl. Data Eng. 33 (2021) 3075–3089.
[20] F. Nie, W. Chang, Z. Hu, X. Li, Robust subspace clustering with low-rank structure constraint, IEEE Trans. Knowl. Data Eng. 34 (2022) 1404–1415.
[21] X. Peng, Y. Li, I.W. Tsang, H. Zhu, J. Lv, J.T. Zhou, Xai beyond classification: Interpretable neural clustering, J. Mach. Learn. Res. 23 (2022) 1–6.
[22] E. Rashedi, H. Nezamabadi-Pour, A stochastic gravitational approach to feature based color image segmentation, Eng. Appl. Artif. Intell. 26 (2013) 1322–1332.
[23] A.U. Rehman, S.B. Belhaouari, Divide well to merge better: A novel clustering algorithm, Pattern Recogn. 122 (2022), 108305.
[24] M. Rezaei, P. Fränti, Can the number of clusters be determined by external indices? IEEE Access 8 (2020) 89239–89257.
[25] Y. Shi, Y. Song, A. Zhang, A shrinking-based clustering approach for multidimensional data, IEEE Trans. Knowl. Data Eng. 17 (2005) 1389–1403.
[26] D. Sun, K.C. Toh, Y. Yuan, Convex clustering: Model, theoretical guarantee and efficient algorithm, J. Mach. Learn. Res. 22 (2021) 1–32.
[27] G. Sun, Y. Cong, J. Dong, Y. Liu, Z. Ding, H. Yu, What and how: generalized lifelong spectral clustering via dual memory, IEEE Trans. Pattern Anal. Mach.
(2021). Intelligence.
[28] G. Wang, Q. Song, Automatic clustering via outward statistical testing on density metrics, IEEE Trans. Knowl. Data Eng. 28 (2016) 1971–1985.
[29] S. Wang, Q. Li, C. Zhao, X. Zhu, H. Yuan, T. Dai, Extreme clustering–a clustering method via density extreme points, Inf. Sci. 542 (2021) 24–39.
[30] S. Wang, D. Wang, C. Li, Y. Li, G. Ding, Clustering by fast search and find of density peaks with data field, Chin. J. Electron. 25 (2016) 397–402.
[31] X.X. Wang, Y.F. Zhang, J. Xie, Q.Z. Dai, Z.Y. Xiong, J.P. Dan, A density-core-based clustering algorithm with local resultant force, Soft. Comput. 24 (2020)
6571–6590.
[32] Z. Wang, Z. Yu, C.P. Chen, J. You, T. Gu, H.S. Wong, J. Zhang, Clustering by local gravitation, IEEE Trans. Cybern. 48 (2017) 1383–1396.
[33] K.C. Wong, C. Peng, Y. Li, T.M. Chan, Herd clustering: A synergistic data clustering approach using collective intelligence, Appl. Soft Comput. 23 (2014) 61–75.
[34] W.E. Wright, Gravitational clustering, Pattern Recogn. 9 (1977) 151–166.
[35] S. Xia, D. Peng, D. Meng, C. Zhang, G. Wang, E. Giem, W. Wei, Z. Chen, Ball kk-means: Fast adaptive clustering with no bounds, IEEE Trans. Pattern Anal. Mach.
Intell. 44 (2022) 87–99.
[36] J. Xie, Z. Xiong, Q. Dai, X. Wang, Y. Zhang, A local-gravitation-based method for the detection of outliers and boundary points, Knowl.-Based Syst. 192 (2020),
105331.
[37] M. Yang, Y. Li, P. Hu, J. Bai, J. Lv, X. Peng, Robust multi-view clustering with incomplete information, IEEE Trans. Pattern Anal. Mach. Intell. 45 (2022)
1055–1069.
[38] Y. Yang, S. Deng, J. Lu, Y. Li, Z. Gong, Z. Hao, et al., Graphlshc: towards large scale spectral hypergraph clustering, Inf. Sci. 544 (2021) 117–134.

69
Q. Li et al. Information Sciences 627 (2023) 52–70

[39] P. Zhang, K. She, A novel hierarchical clustering approach based on universal gravitation, Math. Problems Eng. (2020).
[40] P. Zhu, C. Zhang, X. Li, J. Zhang, X. Qin, A high-dimensional outlier detection approach based on local coulomb force, IEEE Trans. Knowl. Data Eng. (2022).
[41] S. Zhu, L. Xu, E.D. Goodman, Hierarchical topology-based cluster representation for scalable evolutionary multiobjective clustering, IEEE Trans. Cybern. (2021).

70

You might also like