Professional Documents
Culture Documents
Information Sciences
journal homepage: www.elsevier.com/locate/ins
A R T I C L E I N F O A B S T R A C T
Keywords: Clustering is an important data analysis technique. However, due to the diversity of datasets, each
Clustering optimization clustering algorithm is unable to produce satisfactory results on some particular datasets. In this
Improving accuracy paper, we propose a clustering optimization method called HIAC (Highly Improving the Accuracy
Gravitation
of Clustering algorithms). By introducing gravitation, HIAC forces objects in the dataset to move
towards similar objects, making the ameliorated dataset more friendly to clustering algorithms (i.
e., clustering algorithms can produce more accurate results on the ameliorated dataset). HIAC is
independent of clustering principle, so it can optimize different clustering algorithms. In contrast
to other similar methods, HIAC is the first to adopt the selective-addition mechanism, i.e., only
adding gravitation between valid-neighbors, to avoid dissimilar objects approaching each other.
In order to identify valid-neighbors from neighbors, HIAC introduces a decision graph, from
which the naked eye can observe a clear division threshold. Additionally, the decision graph can
assist HIAC in reducing the negative effects that improper parameter values have on optimization.
We conducted numerous experiments to test HIAC. Experiment results show that HIAC can
effectively ameliorate high-dimensional datasets, Gaussian datasets, shape datasets, the datasets
with outliers, and overlapping datasets. HIAC greatly improves the accuracy of clustering algo
rithms, and its improvement rates are far higher than that of similar methods. The average
improvement rate is as high as 253.6% (except maximum and minimum). Moreover, its runtime is
significantly shorter than that of most similar methods. More importantly, with different
parameter values, the advantages of HIAC over similar methods are always maintained. The code
of HIAC is available at https://github.com/qiqi12/HIAC.
1. Introduction
Clustering, which categorizes objects based on the similarity between them, is a crucial data analysis technology. Clustering has
been widely used in many fields, such as pattern recognition, data compression, image segmentation, time series analysis, information
retrieval, spatial data analysis, biomedical research and so on [28].
Due to the diversity of the datasets to be analyzed, clustering is not a simple task [13]. The existing clustering algorithms fail to
produce accurate clustering results on some particular datasets. Most researchers [29,16,35,21,12,11,15,14,38,5,37,27,30] try to
improve the principles of the existing clustering algorithms to obtain more robust performance but introduce new flaws. For example,
Extreme clustering [29] addresses the shortcoming of DPC [1] on density-unbalanced datasets, but it introduces an additional input
parameter. In brief, purely improving the principles of clustering algorithms is not the optimal optimization strategy. A few researchers
* Corresponding author.
E-mail addresses: liqi_bitss@163.com (Q. Li), slwang2011@bit.edu.cn (S. Wang), XianJTeng@163.com (X. Zeng), zhaobx9676@gmail.com
(B. Zhao), dyxcome@163.com (Y. Dang).
https://doi.org/10.1016/j.ins.2023.01.094
Received 10 June 2022; Received in revised form 11 January 2023; Accepted 15 January 2023
Available online 20 January 2023
0020-0255/© 2023 Elsevier Inc. All rights reserved.
Q. Li et al. Information Sciences 627 (2023) 52–70
try to optimize the clustering performance by ameliorating the datasets to be analyzed [32,31,2,22,36,3,33,13,25]. This kind of
methods (Hereinafter referred to as ameliorate-dataset methods) adds gravitation between neighbors, and forces neighbors closer to
each other until the dataset becomes friendly to clustering algorithms. Importantly, some ameliorate-dataset methods are independent
of the clustering process, so different clustering algorithms can produce more accurate clustering results on ameliorated datasets.
Obviously, ameliorating dataset is a more effective clustering optimization strategy.
However, the existing ameliorate-dataset methods have some prominent defects. Specifically, 1.) Once datasets are slightly
complex, they will wrongly force adjacent clusters or dissimilar objects (i.e., outliers and normal objects) closer together, as shown in
Fig. 1. 2.) They dynamically find new neighbors during object motion, resulting in the loss of the original similarity relationship
between objects. Again, as shown in Fig. 1, the red objects that were once neighbors are no longer neighbors in the ameliorated
distribution. To address the above defects, we propose a novel ameliorate-dataset method called HIAC (Highly Improving the Accuracy
of Clustering algorithms). We summarize the main contributions of this work as follows:
1. We come up with the selective-addition mechanism to avoid outliers and normal objects or adjacent clusters approaching
each other. Specifically, we identify the neighbors with high-similarity (Hereinafter referred to as valid-neighbors) from all
neighbors and treat the remaining neighbors as invalid-neighbors. Gravitation is only selected to be added between valid-
neighbors. In contrast to the existing ameliorate-dataset methods, which adopt the full-addition mechanism (i.e., adding gravi
tation between all neighbors), HIAC effectively cuts off the gravitational transfer between dissimilar neighbors.
2. We introduce the decision graph to provide a reasonable basis for identifying valid-neighbors. Specifically, we assign
similarity-related weights between neighbors. By counting all weights, we generate a weight probability distribution as the decision
graph. We can quickly determine which weights match the global pattern based on the regular probability differences in the de
cision graph, and then identify the neighbor pairs whose weights match the global pattern as valid-neighbors. More importantly,
the decision graph can assist HIAC in reducing the negative effects that improper parameter values have on optimization.
3. We design the valid-neighbors locking strategy to preserve the original similarity relationship between objects. Specif
ically, during the movement of the object, we no longer re-find timely valid-neighbors but instead force each object to continuously
move closer to its initial valid-neighbors. As a result, not only the running time of HIAC can be reduced, but also the original
similarity relationship between objects is maintained (i.e., neighbors in the original dataset are still neighbors in the ameliorated
dataset).
4. We conducted a series of experiments to verify the robustness and advantages of HIAC. The experimental results show that
HIAC is robust to Gaussian datasets, shape datasets, and high-dimensional datasets; HIAC successfully shrinks the clusters sur
rounded by outliers, and makes the boundary-objects between adjacent clusters move toward their own cluster core; On 8 real-
world datasets, HIAC successfully improves the accuracies of clustering algorithms with an average improvement rate of
253.6% (except maximum and minimum), and its optimization performance is far superior to the existing methods; With different
parameter values, HIAC still outperforms the existing methods; HIAC runs significantly faster than most existing methods.
5. We published our code. We make the code and datasets publicly available at https://github.com/qiqi12/HIAC.
The remainder of the paper is organized as follows. The next section is about related works. Section 3 introduces the theory of HIAC.
Section 4 verifies the robustness and effectiveness of HIAC. The conclusion is reported in Section 5.
2. Related works
After Gravitational clustering [34] introduces gravitation into clustering, some researchers treat data as particles rather than con
stant objects. They try to force particles by gravitation to move to improve clustering results. So far, gravitation-based clustering
algorithms have been proposed successively, and they consist of the gravitation process (In this paper, we refer to the gravitation
process as ameliorate-dataset method) and the clustering process. Most of ameliorate-dataset methods are nested within the clus
tering process. Only a few ameliorate-dataset methods are independent of the clustering process. In this paper, we focus on the
ameliorate-dataset methods independent of the clustering process. Compared with the ameliorate-dataset methods nested in the
Fig. 1. Defects of existing ameliorate-dataset methods: The existing ameliorate-dataset methods may wrongly force outliers and normal objects
or adjacent clusters approaching each other, or lose the original similarity relationship between objects.
53
Q. Li et al. Information Sciences 627 (2023) 52–70
clustering process, the ameliorate-dataset methods independent of the clustering process are usually universal and can optimize
different clustering algorithms.
The ameliorate-dataset methods independent of the clustering process: Newtonian [3] assumes that all datasets follow the
multivariate Gaussian distribution. It forces objects to move closer to cluster centers until the ratio, which is the sum of the distances
between objects in the ameliorated dataset divided by the sum of the distances between objects in the original dataset, reaches a
threshold. Eventually, the ameliorated clusters become well-distributed for easy identification. These well-distributed clusters also
make it easy to determine the number of clusters. Herd [33] does not pay attention to the magnitude of gravitation but only the di
rection of gravitation. It adds the unit vector of the resultant force between neighbors, forcing objects to move towards the center of
neighbors. In each iteration of Herd, the velocity grows in the direction of gravitation. To prevent objects from moving outside the
cluster, it sets an upper bound on the velocity. After multiple iterations, Herd make the datasets to be analyzed become friendly to
clustering algorithms. HIBOG [13] reconstructs the gravitational model so that gravitation is related to the distribution of objects.
HIBOG weakens the strength of gravitation between distant neighbors, and forces objects closer to the cluster core. Since the
reconstructed gravitational model can be calculated by the matrix, HIBOG runs in less time. Different from the methods mentioned
above, SBCA [25] no longer takes the object as the smallest unit of analysis, but the grid as the unit of analysis. It divides the dataset
into multiple grids, and forces grids to move along the gradient of density until the average distance between objects is below a
predetermined threshold.
The ameliorate-dataset methods nested in the clustering process: LGC [32] takes both the magnitude and direction of local
resultant force into consideration, and designs two local measures, centrality (CE) and coordination (CO), to determine whether an
object is at the boundary or the core of the cluster. Compared with traditional density-based clustering algorithms, LGC does not
require a threshold to find density core regions when identifying clusters. DCLRF [31] also uses CE to measure local resultant force. It
extracts all core-objects in the dataset, and then produces clustering results based on the natural neighborhood structure information of
core-objects. With the help of gravitation, DCLRF can accurately identify the number of clusters in the datasets with complex distri
bution. OGC [2] does not apply gravitation to all objects, but only to clustering centers. Non-clustering centers are the source of
gravitation. Under gravitation of non-clustering centers, all clustering centers iteratively move. OGC changes the clustering center
iteration rule of K-means. SGISA [22] is applied to the image segmentation, which forces each pixel to move in the feature space. After
movement, two pixels are grouped into a cluster if the distance between them is below a predetermined threshold. Unlike the methods
above, LGBD [36] and NHOD [40] identify outliers surrounding clusters by gravitation. LGBD gradually expands the neighbors of each
object, so the gravitational variation ratios of outliers, boundary-objects, and core-objects show obvious differences. It identifies the
objects with the most variation as outliers. NHOD [40] adds Coulomb forces between objects. Coulomb forces can reveal the differ
ences between different dimensions of an object through vector projections in each dimension, which allows NHOD to effectively
identify outliers in high-dimensional datasets. GHC [39] introduces gravitation into the hierarchical clustering process. It constructs a
gravitational relationship graph and divides the graph into several subgraphs according to the gravitational influence coefficient. The
final clustering results are then obtained by GHC iteratively merging these subgraphs. To deal with large-scale datasets, PGCGP [7] first
divides the dataset into a number of grids and then shrinks the object distribution in each grid individually.
There are several datasets (called good datasets) on which most clustering algorithms can perform well, but there are several
datasets (called bad datasets) on which few clustering algorithms are efficient [13]. The reason is that, in good datasets, the inter-
cluster distance is significantly larger than the intra-cluster distance, making it easier for clustering algorithms to identify clusters;
while in bad datasets, the inter-cluster distance is similar to the intra-cluster distance, and even the boundaries of clusters are
completely overlapping. Obviously, by increasing the difference between the inter-cluster distance and the intra-cluster distance, the
bad datasets can be ameliorated into good datasets, and clustering algorithms can produce more accurate results on ameliorated
datasets.
In this paper, we propose a novel ameliorate-dataset method called HIAC to ameliorate the dataset. HIAC still adds gravitation
between neighbors, forcing similar objects closer together. However, different from the existing methods, HIAC divides neighbors into
valid-neighbors and invalid-neighbors, and locks the original neighbor relationship during object motion. HIAC consists of the
following two steps:
• Step 1 (Identifying valid-neighbors). HIAC constructs a topology graph for the dataset, and it assigns weights to the edges of the
topology graph according to similarity. Based on the turning area of the weight probability distribution curve, HIAC clips invalid-
edges from the topology graph. Finally, the neighbors connected by the remaining edges are identified as valid-neighbors.
• Step 2 (Objects motion). HIAC only adds gravitation between valid-neighbors, and it forces objects to move together according to
Newton’s laws of motion. During objects motion, HIAC no longer looks for new valid-neighbors, ensuring that objects continue to
move toward their initial valid-neighbors. Finally, valid-neighbors are closer to each other (i.e., the intra-cluster distance de
creases), and non-neighbors are far away from each other (i.e., the inter-cluster distance increases). The datasets ameliorated by
HIAC become more friendly to clustering algorithms.
54
Q. Li et al. Information Sciences 627 (2023) 52–70
In this paper, the descriptions of commonly used notations are recorded in Table 1.
Let us show an example, as shown in Fig. 2. A dataset consists of 2 normal clusters (marked in yellow and red) and 2 outliers
(marked in grey). For existing ameliorate-dataset methods, gravitations are added between all neighbors (Here, we take k( =
2)-nearest neighbors as an example), as shown in Fig. 2(A). As a result, there are unreasonable gravitations between clusters and
between outliers and normal objects (see the dotted arrow for details), which will inevitably lead to unreasonable proximity. Obvi
ously, not all neighbors are supposed to add gravitation. In this paper, if the gravitation between neighbors is necessary, we call such
neighbors valid-neighbors; otherwise, we call them invalid-neighbors. From Fig. 2(A), we can observe that the distance between
invalid-neighbors is clearly much greater than the distance between valid-neighbors. Next, we will rely on this phenomenon to identify
all valid-neighbors for each object.
The closer the objects (vertices) are, the larger the edge weight between them; the farther the objects (vertices) are, the smaller the
edge weight between them. For the dataset in Fig. 2, its topology graph (k = 2) is shown in Fig. 2(B). Next, instead of identifying valid-
neighbors directly, we transform identifying valid-neighbors into identifying invalid-edges (i.e., the edges between invalid-neighbors).
, in which ϕ is the set of k-nearest neighbors of oi . Since the object weight is the mean value of the edge weight, the object weight and
the edge weight follow similar probability distribution, but the number of the object weights is only 1k of the number of the edge
weights. We divide the value range of the object weights into several equal-length intervals, and the interval length is dc =
(max(W)− min(W))*10
N .
These intervals are the smallest unit of analysis, and the probability of the g-th interval is calculated by
⎛ ⎛ ⎞ ⎞ ∑ ( )
N
⎜ ⎜ ⎟ ⎟ j=1 φg Wj
⎜ ⎜ ⎟ ⎟
P⎜min⎜W ⎟ + g ⋅ dc⎟ = , (3)
⎝ ⎝ ⎠ ⎠ N
Table 1
The Descriptions of commonly used notations in HIAC.
Notations Description
55
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 2. An example of HIAC: A) shows the gravitation that traditional ameliorate-dataset methods add between objects; B) shows the topology
graph constructed by HIAC; C) shows the decision graph generated by HIAC based on the topology graph; D) shows valid-neighbors determined by
HIAC according to the decision graph, and it only adds gravitation between valid-neighbors; E) shows the change in the distribution of objects under
gravitation, where the intra-cluster distance gradually decreases, the inter-cluster distance gradually increases, and the outliers never move.
( ) ( )
in which φg Wj = 1, if min(W) + (g − 1) ⋅ dc⩽Wj < min(W) + g ⋅ dc; otherwise, φg Wj = 0. The object weight probability distribution
curve of the topology graph in Fig. 2(B) is shown in Fig. 2(C). We name the object weight probability distribution curve as the decision
graph. From the decision graph, we can observe a clear turning area, as indicated by the red arrow. On the left side of the turning area,
the weight is small and the probability is low. On the right side of the turning area, the weight is large and the probability increases
sharply. Obviously, the left side of the turning area is a small probability event. Therefore, the turning area can be regarded as a
threshold, and the edge whose weight is less than the threshold can be regarded as an invalid-edge. The decision graph provides a
visual and reasonable basis, which makes it easily to identify invalid-edges. After clipping all invalid-edges, the neighbors connected
by the remaining edges are valid-neighbors. Algorithm 1 describes in details how to identify valid-neighbors. HIAC only adds grav
itation between valid-neighbors, as shown in Fig. 2(D). Compared with Fig. 2(A), there is no gravitational attraction between clusters
and between outliers and normal objects in Fig. 2(D).
Inputs: O, k
Output: the valid-neighbor matrix
1. Connect all k-nearest neighbors with edges.
2. Assign weight to each edge according to the formula 1.
3. for oi in O do
4. Calculate the object weight of oi according to the formula 2.
5. end for
6. Divide the object weight value into N/10 intervals.
7. for g in range(N/10) do
8. Calculate the probability of the g-th interval according to the formula 3.
9. end for
10. Plot the probability distribution curve of object weights and treat it as the decision graph.
11. Identify the turning area of the decision graph as division threshold.
12. Divide k-nearest neighbors into valid-neighbors and invalid-neighbors according to the threshold.
13. Store valid-neighbors of all objects to the valid-neighbor matrix.
14. Return: the valid-neighbor matrix.
56
Q. Li et al. Information Sciences 627 (2023) 52–70
⃦→* → ( → →)
̅→ ⃦ o − oi ‖ oij − oi
Fij = G i ⃦→ 2 → 2 (4)
⃦oij − oi ‖
2
1
∑N ⃦ ⃦→*
, in which G is the gravitational constant, G = N j=1 ⃦oj − oj ‖2 , and it is used to control the magnitude of gravitation. Since oi is
→
→’ d →
∑
oi = →
oi + Sli
l=1
d →
∑
=→
oi + T Fli
l=1 (6)
⃦→ → ( )
⃦
s ⃦o* − ol ‖ → →l
∑d ∑ i i 2 oij − oi
=→
oi + T Gl ⃦
l=1 ⃦→ →l 2
j=1 ⃦oij − oi ‖2
→
, in which T = 12t 2 , oli is the →
oi in the l-th time-segment, and Gl is the gravitational constant in the l-th time-segment. T and d are the
other 2 parameters of HIAC. When all objects in the dataset O turn according to the formula (6), the dataset O will be ameliorated.
Algorithm 2 describes in details how gravitation ameliorates the dataset. As shown in Fig. 2(E), after two time-segments, the intra-
cluster distance decreases and the inter-cluster distance becomes larger, but outliers never move.
Based on the above principles, HIAC has the following advantages (We will experimentally verify these advantages in Section 4.3):
1. HIAC is immune to outliers: Since outliers are far away from normal objects and the number of outliers is much less than that of
normal objects, the outlier edge (i.e., the edge between the outlier and other object) is a small probability event in the topology
57
Q. Li et al. Information Sciences 627 (2023) 52–70
graph. Therefore, outlier edges are treated as invalid-edges and clipped. HIAC does not apply gravitation to outliers, which can
prevent outliers from approaching normal objects.
2. HIAC can correctly shrink adjacent clusters: In overlapping datasets, the distance between boundary-objects in different clusters
is still larger than the distance between boundary-objects in the same cluster. Furthermore, the number of boundary-objects is
much less than the number of core-objects (i.e., the objects inside clusters). Therefore, the edge between adjacent clusters is a small
probability event in the topology graph, HIAC hardly adds gravitation between adjacent clusters. More importantly, since HIAC
locks the initial valid-neighbors, the gravitation applied to each boundary-object is always towards its own cluster core, which can
drive adjacent clusters away from each other.
3. HIAC is insensitive to the parameter k: Once k is set large, objects will have distant neighbors. If objects move towards distant
neighbors, then the ameliorated dataset will be less conducive to clustering. Fortunately, the decision graph of HIAC is insensitive
to the parameter k. With different k values, the object weight probability distribution curves are similar. The reason is that the
object weight is the mean of the edge weights, and the larger the k value, the more stable the mean. Therefore, even if k is set large,
with the help of the decision graph, HIAC can clip extra invalid-edges to ensure that objects do not move towards distant neighbors.
In conclusion, the decision graph can counteract the adverse effects of inappropriate k values on HIAC.
4. Experiments
In this subsection, we describe the datasets, evaluation metrics, clustering algorithms to be optimized, baseline methods and
parameter settings.
4.1.1. Datasets
We select several common synthetic datasets (from clustering basic benchmark [9]) and real-world datasets (from UCI-dataset
repository [8]) to test the proposed method. Synthetic datasets are described in Table 2. Real-world datasets are described as follows:
• Breast cancer dataset records patient cases classified as benign and malignant tumors based on 30 attributes including mass
thickness, average cell size and average radius.
• Banknote authentication dataset records some attributes of banknotes to determine whether they are counterfeit.
• Each object in Digit dataset corresponds to a handwritten digit which is represented by 8 * 8 pixels, and these digits belongs to the
integer from 0 to 9.
• Iris dataset records 150 iris objects, which are divided into 3 categories (setosa, versicolour and virginica) according to the length
and width of the calyx and the length and width of the petals.
• Seeds dataset records the information of three categories of seeds (Kama, Rosa and Canadian), which are expressed by 7 attributes.
• Teaching assistant evaluation dataset records the scores of teaching assistants over 5 semesters, and these teaching assistants are
divided into 3 categories (low, medium and high).
Table 2
The details of synthetic datasets.
Number Dimension Clusters Description
58
Q. Li et al. Information Sciences 627 (2023) 52–70
• Wireless indoor localization dataset records the signal strengths of different Wi-Fi collected in 4 rooms, and each object comes from
one room.
• Wine dataset records chemical analysis of different varieties of wines from Italy. The dataset can be used to distinguish wine
categories.
⃒ ⃒ ⃒ ⃒/
in which U is the ground-truth label, V is the clustering result, P(i) = |Ui |/N, P (j) = ⃒Vj ⃒/N, P(i, j) = ⃒Ui ∩ Vj ⃒ N. The NMI’s range is
′
[0, 1]. The closer it is to 1, the more accurate the clustering result is. 2) We visualize the dataset and mark objects in different colors
based on the clustering result. The more objects in the same cluster are marked in the same color, and the fewer objects in different
clusters are marked in the same color, the more accurate the clustering result is.
Alternatively, we visualize both original datasets and ameliorated datasets. If the inter-cluster distance increases and the intra-
cluster distance decreases, then ameliorated datasets will be more friendly to clustering algorithms than original datasets.
Fig. 3. Process of HIAC ameliorating Dim32 (row 1), Dim64 (row 2), Dim128 (row 3), Dim256 (row 4), Dim512 (row 5), Dim1024 (row 6)
datasets: These datasets are high-dimensional datasets with 32, 64, 128, 256, 512, and 1024 dimensions, respectively. After being ameliorated by
HIAC, the compactness of clusters become significantly higher, and the inter-cluster distance increases significantly. Therefore, HIAC is robust to the
object dimension.
59
Q. Li et al. Information Sciences 627 (2023) 52–70
Agglomerative, DPC. The reason we select them is that they are common clustering algorithms and very representative. More impor
tantly, their input parameters are independent of the object distribution, so we can objectively evaluate whether the improvement in
clustering accuracy is only contributed by the ameliorated distribution.
• HIAC: the interval of T is from 0.1 to 2, the interval of k is from 5 to 25, and the interval of d is from 1 to 40.
• SBCA: the interval of k is from 5 to 200, and hyperparameters are set to the values recommended by the paper [25].
• Herd: the interval of threshold is from 0.1 to 1, and the interval of max is from 1 to 100. These intervals cover the range of parameter
values in the paper [33].
• Newtonian: the interval of δt is from 0.01 to 0.2, because the paper [3] sets it to a very small value.
• HIBOG: the interval of T is from 0.1 to 0.5, the interval of k is from 5 to 25, and the interval of d is from 1 to 10, which recommended
by the paper [13].
Clustering is an unsupervised task. Before clustering, clusters in the dataset to be analyzed are unknown. Therefore, it is necessary
to verify the robustness of HIAC to diverse datasets. In this subsection, we will test HIAC on high-dimensional datasets, Gaussian
datasets, and shape datasets.
Fig. 4. The original distribution (A, B, C) and the distribution ameliorated by HIAC (D, E, F) of A1 (A, D), A2 (B, E), A3 (C, F) datasets: These
datasets are Gaussian datasets. After being ameliorated by HIAC, all clusters become more compact, and the cluster boundaries are clearer.
Therefore, HIAC is robust to the Gaussian distribution.
60
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 5. The original distributions (A, B, C, D, E) and the distributions ameliorated by HIAC (F, G, H, I, J) of Aggregation (A, F), Flame (B, G),
Heartshapes (C, H), Ls3 (D, I), and T7.10 k (E, J): These datasets are shape datasets. After being ameliorated by HIAC, all clusters become more
compact. Therefore, HIAC is robust to shape datasets.
61
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 6. Original clustering results of K-means (A, B) and Agglomerative (C, D) on Flame (A, C) and Aggregation (B, D) datasets: Colors represent
the categories identified by clustering algorithms. K-means and Agglomerative are completely ineffective for these datasets, and they mistakenly
identify multiple clusters as one category but one cluster as multiple categories (see the box for details).
Fig. 7. HIAC-optimized clustering results of K-means (A, B) and Agglomerative (C, D) on Flame (A, C) and Aggregation (B, D) datasets: After HIAC
optimization, the clustering results are mapped back to the original distribution. Only very few objects are misidentified, and the clustering ac
curacies are greatly improved.
In this subsection, we will test HIAC on some extreme scenarios to demonstrate its advantages.
62
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 8. Before and after optimization, the probabilities of DBSCAN (A, B, C, D, E) and BIRCH (F, G, H, I, J) with different levels of accuracy
on Aggregation (A, F), Flame (B, G), Heartshapes (C, H), Ls3 (D, I), and T7.10 k (E, J): DBSCAN and BIRCH have a higher probability of obtaining
higher accuracy after HIAC optimization. That is, after HIAC optimization, more parameter values allow them to produce more accurate results.
Therefore, HIAC successfully reduces the sensitivity of DBSCAN and BIRCH to parameter values.
Fig. 9. The original distribution (A, B, C) and the distribution ameliorated by HIAC (D, E, F) of Compound-part (A, D), R15-outlier (B, E), and
Asymmetric-outlier (C, F) datasets: These datasets contain a lot of outliers marked in orange. In ameliorated datasets, only normal clusters become
more compact, and outliers remain original distribution. Therefore, HIAC is immune to outliers.
mechanism (i.e., adding gravitation between all neighbors), resulting in adjacent clusters closer to each other. Obviously, it is detri
mental to clustering. HIAC is the first to adopt the selective-addition mechanism (i.e., only adding gravitation between valid-
neighbors). In order to verify the advantages of the selective-addition mechanism, we compare HIAC and HIAC-full on Adj-1, Adj-2,
and Adj-3 datasets. HIAC-full is the version of HIAC that adopts the full-addition mechanism. Fig. 10 shows the original distribution of
these datasets (row 1), the distribution ameliorated by HIAC-full (row 2), and the distribution ameliorated by HIAC (row 3), in which
different colors represent the ground-truth labels of objects. In the datasets ameliorated by HIAC-full, adjacent clusters intersect each
other, and some boundary-objects even appear inside other clusters. Furthermore, HIAC-full erroneously makes the objects in Adj-1
(see Fig. 10(D) for details) more scattered. In the datasets ameliorated by HIAC, objects shrink to the cluster core, adjacent clusters
become far away from each other, so the cluster boundaries become very clear. Clearly, HIAC adopting the selective-addition
mechanism has significant advantages in ameliorating adjacent clusters.
Next, we test HIAC on a more extreme set of datasets. S-sets consists of 4 datasets, namely S1, S2, S3, S4, which all contain 15
Gaussian clusters. From S1 to S4, the inter-cluster distance is getting smaller and smaller, and even the cluster boundaries in S4 can no
longer be distinguished with the naked eye. Fig. 11 shows the original distribution of these datasets (row 1), the clustering results of
Agglomerative on them (row 2), and the clustering results of Agglomerative on the datasets ameliorated by HIAC (row 3, and the
clustering results have been mapped back to the original distribution), in which colors represent the categories identified by
Agglomerative. Before HIAC optimization, Agglomerative is ineffective on S4 because some overlapping clusters are marked in one color
(see the red box for details). After HIAC optimization, these overlapping clusters are accurately distinguished by Agglomerative.
63
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 10. The original distribution (A, B, C), the distribution ameliorated by HIAC-full (D, E, F), and the distribution ameliorated by HIAC
(G, H, I) of Adj-1 (A, D, G), Adj-2 (B, E, H), and Adj-3 (C, F, I): The clusters in these datasets are close to each other. Colors represent the ground-
truth labels of objects. In the datasets ameliorated by HIAC-full, adjacent clusters intersect each other, which is very unfavorable for clustering. In
the datasets ameliorated by HIAC, adjacent clusters become far away from each other.
In this subsection, we will compare HIAC with Newtonian, Herd, SBCA, HIBOG on Breast cancer (Hereinafter referred to as Cancer),
Banknote authentication (Hereinafter referred to as Banknote), Digit, Iris, Seeds, Teaching assistant evaluation (Hereinafter referred to as
TAE), Wireless indoor localization (Hereinafter referred to as Wireless), and Wine datasets. The size of these datasets is recorded in
Table 3.
64
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 11. The original distribution (A, B, C, D) of S1 (A, E, I), S2 (B, F, J), S3 (C, G, K), S4 (D, H, L), and the original (E, F, G, H) and the HIAC-
optimized (I, J, K, L) clustering results of Agglomerative on these datasets: From S1 to S4, adjacent clusters get closer and closer. Colors represent
the categories identified by Agglomerative. Agglomerative mistakenly identifies some overlapping adjacent clusters as one category (see the red box
for details). After HIAC optimization, Agglomerative successfully distinguishes all adjacent clusters.
Fig. 12. The original distribution (column 1) and the distribution ameliorated by HIAC with different k values (columns 2 to 5) of Flame
(row 1) and Aggregation (row 2) datasets: With different parameter k values, HIAC always shrinks all clusters into distant small clusters.
Table 3
The size of real-world datasets.
Cancer Banknote Digit Iris Seeds TAE Wireless Wine
accuracies of clustering algorithms on all datasets. However, the optimization of HIBOG is not outstanding, and many of its
improvement rates are below 10%. HIBOG has only one highest improvement rate (see the bold in Table 5 for details). As for the
proposed HIAC, it successfully ameliorates all datasets, so that the accuracies of clustering algorithms are greatly improved. More
importantly, most of HIAC’s improvement rates are above 10%, even as high as 26,000%. Compared with Newtonian, Herd, SBCA and
HIBOG, HIAC’s improvement rates are almost the highest (see the bold in Table 5 for details) and its average improvement rate is as
65
Q. Li et al. Information Sciences 627 (2023) 52–70
Table 4
The accuracies of clustering algorithms before optimization.
K-means Agglomerative DPC
Table 5
The accuracies of clustering algorithms after optimization.
K-means Agglomerative DPC
66
Q. Li et al. Information Sciences 627 (2023) 52–70
high as 253.6% (except maximum and minimum). Especially on Banknote, the original accuracies of K-means and Agglomerative are
only 0.029 and 0.003. After Newtonian, Herd, SBCA or HIBOG optimization, the accuracies of K-means and Agglomerative do not exceed
0.2. But after HIAC optimization, the accuracies of K-means and Agglomerative reach 0.773 and 0.783. Obviously, HIAC has obvious
advantages over existing ameliorate-dataset methods in improving the accuracy of clustering algorithms.
5. Conclusion
In this paper, we propose a novel ameliorate-dataset method called HIAC to optimize clustering algorithms. Different from
traditional ameliorate-dataset methods, HIAC divides the neighbors of objects into valid-neighbors and invalid-neighbors, and only
adds gravitation between valid-neighbors to avoid dissimilar objects approaching each other. When determining valid-neighbors,
HIAC introduces the decision graph from which a clear division threshold can be identified by the naked eye. More importantly,
the decision graph can assist HIAC in reducing the negative effects that improper parameter k values have on optimization. We conduct
extensive experiments on common synthetic and real-world datasets to test HIAC. HIAC successfully ameliorates high-dimensional
datasets, Gaussian datasets and shape datasets by not only reducing their intra-cluster distances but also increasing their inter-
cluster distances. Ameliorated datasets become more friendly to clustering algorithms. We also verify the advantages of HIAC. Spe
cifically, HIAC shrinks only normal objects in the datasets with outliers, and avoids adjacent clusters closer together in overlapping
datasets. Furthermore, HIAC is not sensitive to the parameter k, and it can obtain stable effects with a wide range of k values. Finally,
we compare HIAC with Newtonian, Herd, SBCA and HIBOG. HIAC has outstanding advantages over them. HIAC greatly improves the
Fig. 13. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of K-means
optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are almost at the top, so HIAC is
significantly superior to HIBOG, SBCA and Herd.
67
Q. Li et al. Information Sciences 627 (2023) 52–70
Fig. 14. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of
Agglomerative optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are almost at the top, so
HIAC is significantly superior to HIBOG, SBCA and Herd.
Fig. 15. On Cancer(A), Banknote(B), Digit(C), Iris(D), Seeds(E), TAE(F), Wireless(G) and Wine(H) datasets, the clustering accuracies of DPC
optimized by ameliorate-dataset methods with different parameter values: The accuracy curves of HIAC are at the top, so HIAC is significantly
superior to HIBOG, SBCA and Herd.
Fig. 16. Running time of HIAC, Newtonian, Herd, SBCA and HIBOG: The datasets are sorted by size (see Table 3 for details) on the abscissa. The
running time of HIAC and HIBOG is significantly shorter than that of Newtonian, Herd and SBCA.
68
Q. Li et al. Information Sciences 627 (2023) 52–70
accuracies of clustering algorithms, not only its improvement rates are much higher than that of other methods, but also its running
times are much less than that of most methods. With different parameter values, the advantages of HIAC over other methods are always
maintained. In conclusion, HIAC is an excellent method to optimize clustering algorithms.
Qi Li: Conceptualization, Methodology, Writing – review & editing, Formal analysis, Investigation, Software. Shuliang Wang:
Funding acquisition, Writing – review & editing. Xianjun Zeng: Software, Writing – review & editing. Boxiang Zhao: Writing – review
& editing. Yingxu Dang: Writing – review & editing.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
The work is funded by National Key R&D Program of China (2020YFC0832600), and National Natural Science Fund of China
(62076027).
References
[1] R. Alex, L. Alessandro, Clustering by fast search and find of density peaks, Science 344 (2014) 1492.
[2] M. Alswaitti, M.K. Ishak, N.A.M. Isa, Optimized gravitational-based data clustering algorithm, Eng. Appl. Artif. Intell. 73 (2018) 126–148.
[3] K. Blekas, I.E. Lagaris, Newtonian clustering: An approach based on molecular dynamics and global optimization, Pattern Recogn. 40 (2007) 1734–1744.
[4] L. Cai, H. Wang, F. Jiang, Y. Zhang, Y. Peng, A new clustering mining algorithm for multi-source imbalanced location data, Inf. Sci. 584 (2022) 50–64.
[5] Y. Cai, M. Zeng, Z. Cai, X. Liu, Z. Zhang, Graph regularized residual subspace clustering network for hyperspectral image clustering, Inf. Sci. 578 (2021) 85–101.
[6] J. Chen, S.Y. Philip, A domain adaptive density clustering algorithm for data with varying density distribution, IEEE Trans. Knowl. Data Eng. 33 (2021)
2310–2321.
[7] L. Chen, F. Chen, Z. Liu, M. Lv, T. He, S. Zhang, Parallel gravitational clustering based on grid partitioning for large-scale data, Appl. Intell. (2022) 1–21.
[8] D. Dua, C. Graff, UCI machine learning repository, 2019. URL: http://archive.ics.uci.edu/ml.
[9] P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark datasets, Appl. Intell. 48 (2018) 4743–4759.
[10] C. Gong, Z.g. Su, P.h. Wang, Q. Wang, An evidential clustering algorithm by finding belief-peaks and disjoint neighborhoods, Pattern Recogn. 113 (2021),
107751.
[11] L. Guo, Q. Dai, Graph clustering via variational graph embedding, Pattern Recogn. 122 (2022), 108334.
[12] J. Han, J. Xu, F. Nie, X. Li, Multi-view k-means clustering with adaptive sparse memberships and weight allocation, IEEE Trans. Knowl. Data Eng. 34 (2022)
816–827.
[13] Q. Li, S. Wang, C. Zhao, B. Zhao, X. Yue, J. Geng, Hibog: Improving the clustering accuracy by ameliorating dataset with gravitation, Inf. Sci. 550 (2021) 41–56.
[14] J. Liu, F. Cao, J. Liang, Centroids-guided deep multi-view k-means clustering, Inf. Sci. 609 (2022) 876–896.
[15] Z. Long, Y. Gao, H. Meng, Y. Yao, T. Li, Clustering based on local density peaks and graph cut, Inf. Sci. 600 (2022) 263–286.
[16] A. Lotfi, P. Moradi, H. Beigy, Density peaks clustering based on density backbone and fuzzy neighborhood, Pattern Recogn. 107 (2020), 107449.
[17] Y. Lu, Y.M. Cheung, Y.Y. Tang, Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,
IEEE Trans. Cybern. 51 (2021) 1598–1612.
[18] Y. Ma, H. Lin, Y. Wang, H. Huang, X. He, A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint, Inf. Sci. 557 (2021)
194–219.
[19] A.C.A. Neto, J. Sander, R.J.G.B. Campello, M.A. Nascimento, Efficient computation and visualization of multiple density-based clustering hierarchies, IEEE
Trans. Knowl. Data Eng. 33 (2021) 3075–3089.
[20] F. Nie, W. Chang, Z. Hu, X. Li, Robust subspace clustering with low-rank structure constraint, IEEE Trans. Knowl. Data Eng. 34 (2022) 1404–1415.
[21] X. Peng, Y. Li, I.W. Tsang, H. Zhu, J. Lv, J.T. Zhou, Xai beyond classification: Interpretable neural clustering, J. Mach. Learn. Res. 23 (2022) 1–6.
[22] E. Rashedi, H. Nezamabadi-Pour, A stochastic gravitational approach to feature based color image segmentation, Eng. Appl. Artif. Intell. 26 (2013) 1322–1332.
[23] A.U. Rehman, S.B. Belhaouari, Divide well to merge better: A novel clustering algorithm, Pattern Recogn. 122 (2022), 108305.
[24] M. Rezaei, P. Fränti, Can the number of clusters be determined by external indices? IEEE Access 8 (2020) 89239–89257.
[25] Y. Shi, Y. Song, A. Zhang, A shrinking-based clustering approach for multidimensional data, IEEE Trans. Knowl. Data Eng. 17 (2005) 1389–1403.
[26] D. Sun, K.C. Toh, Y. Yuan, Convex clustering: Model, theoretical guarantee and efficient algorithm, J. Mach. Learn. Res. 22 (2021) 1–32.
[27] G. Sun, Y. Cong, J. Dong, Y. Liu, Z. Ding, H. Yu, What and how: generalized lifelong spectral clustering via dual memory, IEEE Trans. Pattern Anal. Mach.
(2021). Intelligence.
[28] G. Wang, Q. Song, Automatic clustering via outward statistical testing on density metrics, IEEE Trans. Knowl. Data Eng. 28 (2016) 1971–1985.
[29] S. Wang, Q. Li, C. Zhao, X. Zhu, H. Yuan, T. Dai, Extreme clustering–a clustering method via density extreme points, Inf. Sci. 542 (2021) 24–39.
[30] S. Wang, D. Wang, C. Li, Y. Li, G. Ding, Clustering by fast search and find of density peaks with data field, Chin. J. Electron. 25 (2016) 397–402.
[31] X.X. Wang, Y.F. Zhang, J. Xie, Q.Z. Dai, Z.Y. Xiong, J.P. Dan, A density-core-based clustering algorithm with local resultant force, Soft. Comput. 24 (2020)
6571–6590.
[32] Z. Wang, Z. Yu, C.P. Chen, J. You, T. Gu, H.S. Wong, J. Zhang, Clustering by local gravitation, IEEE Trans. Cybern. 48 (2017) 1383–1396.
[33] K.C. Wong, C. Peng, Y. Li, T.M. Chan, Herd clustering: A synergistic data clustering approach using collective intelligence, Appl. Soft Comput. 23 (2014) 61–75.
[34] W.E. Wright, Gravitational clustering, Pattern Recogn. 9 (1977) 151–166.
[35] S. Xia, D. Peng, D. Meng, C. Zhang, G. Wang, E. Giem, W. Wei, Z. Chen, Ball kk-means: Fast adaptive clustering with no bounds, IEEE Trans. Pattern Anal. Mach.
Intell. 44 (2022) 87–99.
[36] J. Xie, Z. Xiong, Q. Dai, X. Wang, Y. Zhang, A local-gravitation-based method for the detection of outliers and boundary points, Knowl.-Based Syst. 192 (2020),
105331.
[37] M. Yang, Y. Li, P. Hu, J. Bai, J. Lv, X. Peng, Robust multi-view clustering with incomplete information, IEEE Trans. Pattern Anal. Mach. Intell. 45 (2022)
1055–1069.
[38] Y. Yang, S. Deng, J. Lu, Y. Li, Z. Gong, Z. Hao, et al., Graphlshc: towards large scale spectral hypergraph clustering, Inf. Sci. 544 (2021) 117–134.
69
Q. Li et al. Information Sciences 627 (2023) 52–70
[39] P. Zhang, K. She, A novel hierarchical clustering approach based on universal gravitation, Math. Problems Eng. (2020).
[40] P. Zhu, C. Zhang, X. Li, J. Zhang, X. Qin, A high-dimensional outlier detection approach based on local coulomb force, IEEE Trans. Knowl. Data Eng. (2022).
[41] S. Zhu, L. Xu, E.D. Goodman, Hierarchical topology-based cluster representation for scalable evolutionary multiobjective clustering, IEEE Trans. Cybern. (2021).
70