You are on page 1of 9

Original Article ISSN(Print) 1598-2645

International Journal of Fuzzy Logic and Intelligent Systems ISSN(Online) 2093-744X


Vol. 22, No. 3, September 2022, pp. 267-275
http://doi.org/10.5391/IJFIS.2022.22.3.267

An Evolving Fuzzy Model to Determine an


Optimal Number of Data Stream Clusters
Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery
Department of Software, Babylon University, Babylon, Iraq

Abstract
Data streams are a modern type of data that differ from traditional data in various characteris-
tics: their indefinite size, high access, and concept drift due to their origin in non-stationary
environments. Data stream clustering aims to split these data samples into significant clusters,
depending on their similarity. The main drawback of data stream clustering algorithms is the
large number of clusters they produce. Therefore, determining an optimal number of clusters
is an important challenge for these algorithms. In practice, evolving models can change their
general structure by implementing different mechanisms. This paper presents a fuzzy model
that mainly consists of an evolving Cauchy clustering algorithm which is updated through a
specific membership function and determines the optimal number of clusters by implementing
two evolving mechanisms: adding and splitting clusters. The proposed model was tested on
six different streaming datasets, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxem-
bourg, and keystrokes. The results demonstrated that the efficiency of the proposed model in
producing an optimal number of clusters for each dataset outperforms that of previous models.
Keywords: Data stream clustering, Clusters number, Evolving mechanisms

1. Introduction

Artificial intelligence (AI) is a broad branch of computer science, and machine learning is
the backbone of various techniques, among which clustering is one of the most important [1].
It aims to split a dataset into significant clusters of data samples that have a high degree
of similarity between them, and a low degree of similarity with the data samples in other
clusters [2].
Received: May 15, 2022
Revised : Aug. 31, 2022
Data stream analysis has become one of the most active and effective research fields in
Accepted: Sep. 6, 2022 computer science due to the diverse challenges that it poses compared with traditional data
Correspondence to: Hussein A. A. Al-Khamees analysis, including the indefinite size, high access, single scan, limited memory and processing
(hussein.alkhamees7@gmail.com)
©The Korean Institute of Intelligent Systems
time, and concept drift caused by the non-stationary condition of the environments where data
originate [3]. Data stream mining links two essential computer fields: data mining tasks and
cc This is an Open Access article distrib- data streams [4].
uted under the terms of the Creative Com- Many techniques for analyzing traditional data can also be implemented on data streams,
mons Attribution Non-Commercial License
with clustering being one of them. Data stream clustering methods can be categorized into five
(http://creativecommons.org/licenses/by-nc /
3.0/) which permits unrestricted non-
types, one of which is the density-based class. This category includes the evolving Cauchy
commercial use, distribution, and reproduc- (e-Cauchy) clustering algorithm, on which this study is based [5].
tion in any medium, provided the original Evolving systems are naturally able to change the general structure of the model designed
work is properly cited. to describe the data stream by updating it after the arrival of each data sample. This is achieved

267 |
http://doi.org/10.5391/IJFIS.2022.22.3.267

through several mechanisms, such as adding, merging, and optimal number of clusters. Initialization is one of the most im-
splitting clusters to reduce the large number of clusters gener- portant steps in K-means algorithms; in this model it is achieved
ated [6]. by selecting the first data sample through entropy, hence, the
Fuzzy systems are extensively used in different fields that are model is known as an entropy-based initialization model. In
based on the concept of fuzzy logic. In addition, evolving fuzzy addition, the model aims to maximize Shannon entropy-based
algorithms are considered an important type of evolving systems objective function. The authors proved that their model is better
because of their ability to interact with the data provided [7]]. than the K-means algorithm in terms of the number of clusters.
Determining an optimal number of generated clusters re- A density-peak-based clustering algorithm for automatically
mains an open challenge, and traditional solutions cannot be determining the number (DPADN) of clusters was proposed
applied to all cases. Therefore, evolving mechanisms are pro- in [8]. This model consists of three steps. First, a density
posed as a solution to set an appropriate number of data stream measure is designed by applying a Gaussian kernel function
clusters [8]. to compute the density for all dataset samples. Second, a pre-
Most clustering algorithms (including e-Cauchy) generate a cluster approach is constructed to find the center of each cluster.
large number of clusters. Thus, the research question addressed Finally, a method is proposed which automatically sets the
in this study is how to overcome the large number of clusters center of the clusters. The evolving mechanisms of this model
generated while applying clustering algorithms to data streams. include searching the nearest two clusters and then merging
To overcome this problem, this study proposes an evolving them into one.
model based on the e-Cauchy algorithm, which adopts new Data stream mining can be defined as the uncovering inter-
evolving mechanisms such as adding and splitting clusters into esting and useful patterns from a large amount of data in a way
high-quality and low-quality clusters by re-assigning all data that makes these patterns understandable. This can achieved by
samples from low- to high-quality clusters. Moreover, the paper many techniques [11].
presents a new fuzzy method for distributing testing data. In recent years, clustering has become one of the most im-
To evaluate the proposed model, several streaming datasets portant and widely used data stream mining techniques. Data
were used, namely, power supply, sensor, HuGaDB, UCI-HAR, stream learning is divided into two main types: supervised and
Luxembourg, and keystrokes. The results demonstrated the unsupervised learning. Clustering is an unsupervised learning
model’s efficiency in producing an optimal number of clusters method [12].
for each dataset and showed that it has a higher silhouette Different methods are used to group (cluster) all the samples
coefficient than other previous models, thus, outperforming in a given dataset into clusters, each containing many samples
them. that have a high degree of similarity between each other but are
not similar to the samples in other clusters [11].
2. Related Work There are five types of data streams clustering methods: par-
titioning, hierarchical, grid-based, density-based, and model-
This section presents the most relevant studies related to the based methods [13]].
determination of an optimal number of generated clusters. These same five types are also applied to traditional data.
The authors in [9] discussed the limitations of the K-means The model herein proposed is based on the e-Cauchy algorithm,
algorithm, the first of which is the determination of the number which is a density-based method.
of clusters. They presented a cluster number assisted K-means Cluster validation comprises three major tasks: clustering
(CNAK) model that estimates this number based on bipartite evaluation, clustering stability, and clustering tendency. Each
matching and by adjusting the algorithm parameters. This task has several aspects. Clustering stability aims to form a good
model can generate different scores for the number of clusters background for the clustering result sensitivity to differentiate
(NoC), which ranges from 1 to NoCmax , by applying the Kuhn- algorithm parameters, such as the number of clusters [14].
Munkres algorithm to obtain bipartite matching. The authors Cluster validation is also called determining the number of
selected a random sample, used it as a reference, mapped other clusters in a dataset. The methods applied to set the number
centroid clusters and compared them. of clusters can be classified into three groups: methods based
The authors of [10] used one of the most effective validity on merging and splitting, traditional methods, and methods
indices, the entropy measure, as an indicator to determine the based on evolutionary computation. The first of these groups

www.ijfis.org An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters | 268
International Journal of Fuzzy Logic and Intelligent Systems, vol. 22, no. 3, September 2022

includes evolving methods that can be implemented on datasets learn the antecedent parameters.
to determine cluster numbers [14] 2) Tuning the parameters and structure to apply the evolu-
Traditional models usually fail when dealing with streaming tion mechanisms.
data because of the challenges posed by this modern data type. 3) Implement a learning technique to learn the consequent
Evolving systems represent an important and attractive solu- parameters.
tion for handling data streams [13] One of the most important
characteristics of evolving systems is their ability to change the
general structure of the model designed to describe the data 3. Proposed Method
stream. Accordingly, the major task and key feature of any
evolving system is associated with several mechanisms, such Several clustering algorithms for data streams have been devel-
as adding, splitting, merging, and removing entities. When the oped in recent years. In this study, we employed the e-Cauchy
evolving system is based on the clustering technique, the above algorithm. The main idea was to compute the density of each
mechanisms are implemented in terms of clusters, i.e., adding arriving data sample from the dataset to construct the initial
clusters, splitting clusters, etc. [5]. clusters, which are typically numerous. Subsequently, the clus-
This ensures greater flexibility when the system situations ter splitting mechanism was applied. Figure 1 shows a block
change into non-stationary environments where the data evolve diagram of the proposed model comprising five main parts that
over time or when concept drift appears. In traditional mod- are explained in detail in the following subsections.
els, the data distribution is assumed to be stationary and the
structure of the models and their parameters do not change 3.1 Data Pre-processing
over time. In evolving systems, the data environment is non-
stationary; therefore, both the structure and parameters of the Normalization is an essential data pre-processing step for most
models change over time in what is classified as a dynamic types of problems. Normalization can be accomplished through
state [8]. many methods, including min-max, decimal scaling, z-score,
In these cases, the model is more related to non-stationary median and MAD, double sigmoid, and tanh estimators [19]. In
environments that can generate a continuous stream of data the proposed model, the min-max technique was implemented.
whose distribution does not remain constant. Suppose there is a set of matching scores {M s}, where s = 1,
More specifically, some designers train their models offline
(offline training), which is sometimes known as batch training.
The model may initially perform well, that is, the performance
of the model will gradually deteriorate, especially as the upcom-
ing data evolves. Moreover, evolving systems have the ability
to forget data (especially after updating) in a proactive manner
to maintain memory efficiency so that the data stream faces no
problem [15].
Certainly, the analytics of these data must be performed
in real time; therefore, online analytics may be implemented
through evolving identification methods that allow the simulta-
neous adjustment of the structure and parameters [16]c.
Furthermore, when the system is time-variant, it is necessary
to describe the various behaviors through the evolution of the
model structure and to identify the parameters online [17].
Naturally, because there is a change in the general behavior
of the data stream, the learner should evolve to keep up with
the data change. The training step is the general idea behind an
evolving system. This step comprises the following [18]:
1) Splitting the input space via a clustering algorithm to Figure 1. Block diagram of the proposed method.

269 | Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery


http://doi.org/10.5391/IJFIS.2022.22.3.267

0
2, ..., n, then the normalized scores (M s ) are computed as:

0 (M s − min)
Ms = . (1)
(max − min)

The dataset is then partitioned into training (70% of the dataset)


and test data (30%).

3.2 Clustering Algorithm

The core step of the e-Cauchy clustering algorithm is the com-


putation of the density of each training data sample. It is worth
noting that this algorithm uses two predefined thresholds: θ
(that is set to 0.1) and the density threshold (Denthr ), which
is unique for any given dataset. Specifically, the latter was
considered the most effective threshold.
The algorithm proceeds as follows: When the first data sam-
ple arrives, the first cluster is constructed and its parameters are
set. When an additional data sample arrives, the density value
is computed and compared with Denthr ; if the density value is Figure 2. Pseudocode for the e-Cauchy algorithm.
less than Denthr , then a new cluster is constructed; otherwise,
the current data sample is appended to an existing cluster and
the parameters of the corresponding cluster are modified. These
steps are repeated for all samples of the selected dataset to si-
multaneously build the initial clusters, thus performing the first
cluster addition mechanism.
The pseudocode for the e-Cauchy algorithm is shown in
Figure 2, where data stream D consists of {x1, x2, ..., xi}, SCj
refers to the samples in cluster (j), CCj denotes the center of
cluster (j), and NoC is the number of clusters.
Figure 3. Pseudocode for the cluster splitting algorithm.

3.3 Splitting Mechanism 3.4 Assignment of Test Data

As mentioned above, the cluster splitting mechanism is applied In this model, the assignment of the test data samples require
to divide all the generated clusters (initial clusters from Algo- three steps: computing the distance between the current test
rithm 1 in Figure 2) into high- and low-quality clusters (HQC data and the first cluster, building a sub-cluster with a radius
and LQC). By applying this mechanism, the model can evolve equal to the computed distance, and determining all testing data
by re-assigning the samples in all LQCs to HQCs to produce samples that fall within the built sub-cluster. Subsequently, the
the final HQCs. distances of all data samples in the sub-cluster and their average
However, both the cluster adding and splitting mechanisms are calculated.
occur while training the model (with training data), whereas As a result of these steps, a set of distances is available for
the test data samples are assigned only to the HQCs that result each test data sample. In fact, the number of distances should
from the cluster splitting mechanism. Figure 3 illustrates the be equal to the number of HQCs formed by applying Algorithm
pseudocode for the splitting algorithm. 2 in Figure 3.
In some cases, the number of distances is large; thus, addi-
tional calculation time may be required to measure each time

www.ijfis.org An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters | 270
International Journal of Fuzzy Logic and Intelligent Systems, vol. 22, no. 3, September 2022

the false positive, and F N is the false negative.


Purity can be calculated by
n
1X
Purity = Precision, (4)
n i=1

where the precision is given by

(T P )
Precision = . (5)
(T P + F P )
Figure 4. Algorithm for assigning test samples.
The proposed model was implemented using the program-
ming language Python 3.6.9 on a Windows 10 Professional
(as in the case of the sensor data stream, which has 54 clusters)
operating system, with a 2.5 GHz core I7 CPU, and 16 GB of
(Table 1). Therefore, it is clear that using a fuzzy membership
RAM.
function is very useful. In the proposed methodology all com-
puted distances were included in a membership function, as
described in line 5 of the pseudocode for the test data assign-
4. Dataset Description
ment algorithm shown in Figure 4. To evaluate the proposed model, different streaming datasets
Finally, the minimum distance is selected (as described in were used, which were generated from diverse domains, such as
line 6 of Algorithm 3 in Figure 4) to assign the current sample questionnaires, user behavior recognition, and human activity
and all samples within the specified radius (as shown in line 7). recognition. Table 1 provides a brief overview of the datasets
This step is performed to ensure that the current data sample is with a numerical data type [22,23].
assigned to the nearest cluster.
5. Results and Discussion
3.5 Evaluation
As explained earlier, the e-Cauchy clustering algorithm contains
The silhouette coefficient (SC) is an internal index that evaluates two thresholds; the most important one is Denthr . However,
the quality of the clustering results. More specifically, SC the value of this threshold is varies from one dataset to another.
indicates whether the data samples were correctly clustered and Therefore, a certain threshold value was set for each streaming
separated (clusters coherence and goodness). The SC for data dataset.
(i) is computed as [20]: The first dataset used to test the proposed model was the
power supply stream dataset, for which Denthr was set to
(bx(i) − ax(i) )
SC x(i) = , (2) 0.0161. Figure 5 shows the results for this dataset. The model
max(ax(i) · bx(i) )
initially generated 1,400 clusters; after implementing the evolv-
where ax(i) is the average distance from x(i) to every data ing step, the number reduced to 678, finally achieving an op-
sample in the same cluster, and bx(i) is the lowest average timal number of clusters of 24. The SC for the training data
distance between x(i) and every data sample in other clusters.
It is worth pointing out that the SC for each stream dataset is
Table 1. Characteristics of numerical streaming datasets
measured before and after the evolving process for both the
training and test data. # Dataset Classes Features Samples
Based on the confusion matrix, which provides good details 1 Power supply 24 2 29,928
about the classifier, we use accuracy and purity [21]. Accuracy 2 Sensor 54 5 2,219,803
is computed according to
3 HuGaDB01 01 4 39 2,435
(T P + T N ) 4 UCI-HAR 6 561 10,299
Accuracy = , (3)
(T P + T N + F P + F N ) 5 Luxembourg 2 30 1,901
where T P is the true positive, T N is the true negative, F P is 6 Keystrokes 4 10 1,600

271 | Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery


http://doi.org/10.5391/IJFIS.2022.22.3.267

(a) (b) (a) (b)

Figure 5. (a) NoC and (b) SC for the power supply dataset. Figure 7. (a) NoC and (b) SC for the HuGaDB dataset.

(a) (b) (a) (b)

Figure 6. (a) NoC and (b) SC for the sensor data stream. Figure 8. (a) NoC and (b) SC for UCI-HAR dataset.

before evolving was −0.75, changing to 0.30 after evolution, The fourth evaluation dataset was the UCI-HAR stream
whereas the SC for the test data before and after evolving was dataset, and its Denthr was set to 0.00028. The number of
0.31 and 0.33, respectively. clusters was initially 3,605, which decreased to 933,412 and
Considering the SC value of the testing step for the power finally 6, which indicates the optimal NoC. The SC of the train-
supply dataset, which is 0.33, the proposed model outperforms ing data before evolution was −0.32, which increased to 0.41
the models presented in [24] and [25], which achieved an SC of after evolving, while the SC for the test data was 0.11 before
0.18 and 0.32, respectively. evolution, and 0.53 thereafter. Figure 8 shows the results for
The second dataset used for testing the model was the sensor the UCI-HAR dataset.
data stream. For this dataset, Denthr was set to 0.0011. The According to the SC value of the UCI-HAR dataset, the pro-
initial number of clusters was 1, 887, which reduces to 1, 052 posed model outperforms the model presented in [26], which
and then to 412; finally, the optimal number of clusters was achieved an SC of 0.441 by implementing the K-means algo-
54. In terms of SC, for the training step it was −0.52 before rithm.
evolving and 0.22 afterwards; whereas for the test data it was Next, the model was tested on the Luxembourg stream dataset,
0.08 and 0.49 before and after evolving, respectively. Figure 6 with a Denthr of 0.0026. The number of clusters at the begin-
shows the results for the sensor data stream. ning was 1,901, then 666, and finally 2, which was the optimal
Based on the final SC value of the testing step for this dataset NoC. The SC of the training data before evolution was −0.44,
(0.49), the proposed model achieved a higher SC than the model which increases to 0.57 after evolving, whereas the SC for the
proposed in [24], which achieved an SC value of 0.30. test data before the evolution was −0.08 and then 0.57. Figure
The HuGaDB01 01 sensor dataset was then used to test the 9 shows the results for this dataset.
proposed model, and Denthr was set to 0.0038. The number The last dataset was the keystrokes stream dataset, for which
of clusters was initially 852 and then 221, then decreased to 9 Denthr was set to 0.055. Initially, the number of clusters was
and finally to 4. In terms of SC, its value for the training step 592, then the NoC decreased to 39, to 8 and then to 4. The SC
before evolution was −0.68, and after evolving 0.39. The SC of the training data before and after evolution were -0.36 and
for the testing step before evolving was −0.002, increasing to 0.29, respectively. The SC for the test data before the evolution
0.52 after evolving. Figure 7 illustrates these results. was 0.01, and then increased to 0.57. These results are shown

www.ijfis.org An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters | 272
International Journal of Fuzzy Logic and Intelligent Systems, vol. 22, no. 3, September 2022

6. Conclusion

Data streams are a modern type of data that can be generated


in real-world applications. Their main characteristics are their
massive size, high speed, and concept drift. Many techniques
can be used on data streams, including clustering, which aims
to group similar data samples into different clusters. Fuzzy
(a) (b) systems are widely used in computer science, particularly in the
field of AI. Determining an optimal number of clusters remains
Figure 9. (a) NoC and (b) SC for the Luxembourg dataset. an open problem for researchers, as there is no static method for
this purpose. This paper presents a fuzzy model to overcome
such a problem. The proposed fuzzy model depends on the
e-Cauchy clustering algorithm, which is a density-based data
stream clustering method and implements a specific fuzzy mem-
bership function. The model applies two evolving mechanisms:
adding and splitting clusters. The evaluation step involves find-
ing an optimal number of clusters, as well as computing the
SC, accuracy, and purity. Six streaming datasets were used to
(a) (b)
evaluate this model, namely, power supply, sensor, HuGaDB,
Figure 10. (a) NoC and (b) SC for keystrokes data stream. UCI-HAR, Luxembourg, and keystrokes. The results obtained
from the proposed model were analyzed and compared with
Table 2. Quality indices for the tested stream datasets those of previous models, showing that this model is more effi-
cient and has a better performance than other existing models.
# Dataset Accuracy (%) Purity (%) Our future work will focus on applying the proposed model to
1 Power supply 72.30 71.88 online cloud computing and synchronizing it with login services
2 Sensor 81.48 81.25 to determine the authenticity and validity of users.
3 HuGaDB01 01 89.75 89.12
4 UCI-HAR 77.39 77.00 Conflict of Interest
5 Luxembourg 97.80 97.80 No potential conflict of interest relevant to this article was
6 Keystrokes 84.57 84.93 reported.

in Figure 10. References


The accuracy and purity for each stream dataset are listed in
[1] R. Cioffi, M. Travaglioni, G. Piscitelli, A. Petrillo, and F.
Table 2. Based on the results in Table 2, the proposed model
De Felice, “Artificial intelligence and machine learning
outperforms many previous models. For the sensor data stream,
applications in smart production: progress, trends, and
this model achieved a purity of 81.25%, whereas the model in
directions,” Sustainability, vol. 12, no. 2, article no. 492,
[27] had a purity of 76.5%. In the case of the UCI-HAR dataset,
2020. https://doi.org/10.3390/su12020492
the accuracy of the proposed model was 77.39%, whereas that
of the model in [28] was 78.79%. Similarly, the proposed model [2] A. Abdullatif, F. Masulli, and S. Rovetta, “Clustering of
achieved an accuracy of 84.57% for the keystrokes dataset, nonstationary data streams: a survey of fuzzy partitional
whereas the model in [29] attained an accuracy of 77.0%. methods,” Wiley Interdisciplinary Reviews: Data Mining
In general, the proposed model is highly accurate in assigning and Knowledge Discovery, vol. 8, no. 4, article no. e1258,
the data samples to an appropriate cluster, although it is not 2018. https://doi.org/10.1002/widm.1258
suitable for processing highly dimensional stream datasets.
[3] H. A. Al-Khamees and E. S. Al-Shamery, “Survey: Clus-
tering techniques of data stream,” in Proceedings of

273 | Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery


http://doi.org/10.5391/IJFIS.2022.22.3.267

2021 1st Babylon International Conference on Informa- [12] Wiharto, A. K. Wicaksana, and D. E. Cahyani, “Modi-
tion Technology and Science (BICITS), Babil, Iraq, 2021, fication of a density-based spatial clustering algorithm
pp. 113-119. https://doi.org/10.1109/BICITS51482.2021. for applications with noise for data reduction in intrusion
9509923 detection systems,” International Journal of Fuzzy Logic
and Intelligent Systems, vol. 21, no. 2, pp. 189-203, 2021.
[4] M. Al-Tarawneh, “Data Stream classification algorithms https://doi.org/10.5391/IJFIS.2021.21.2.189
for workload orchestration in vehicular edge computing:
a comparative evaluation,” International Journal of Fuzzy [13] M. Mousavi, A. A. Bakar, and M. Vakilian, “Data stream
Logic and Intelligent Systems, vol. 21, no. 2, pp. 101-122, clustering algorithms: a review,” Int J Adv Soft Comput
2021. https://doi.org/10.5391/IJFIS.2021.21.2.101 Appl, vol. 7, no. 3, pp. 1-15, 2015.

[14] E. Hancer and D. Karaboga, “A comprehensive survey


[5] I. Skrjanc, S. Ozawa, T. Ban, and D. Dovzan, “Large-
of traditional, merge-split and evolutionary approaches
scale cyber attacks monitoring using evolving Cauchy
proposed for determination of cluster number,” Swarm
possibilistic clustering,” Applied Soft Computing, vol. 62,
and Evolutionary Computation, vol. 32, pp. 49-67, 2017.
pp. 592-601, 2018. https://doi.org/10.1016/j.asoc.2017.11.
https://doi.org/10.1016/j.swevo.2016.06.004
008
[15] P. Angelov and A. Kordon, “Evolving inferential sensors
[6] H. A. A. Al-Khamees, N. Al-A’araji, and E. S. Al- in the chemical process industry,” in Evolving Intelligent
Shamery, “Data stream clustering using fuzzy-based Systems: Methodology and Applications. Hoboken, NJ:
evolving Cauchy algorithm,” International Journal of In- John Wiley & Sons, 2010, p. 313-336. https://doi.org/10.
telligent Engineering and Systems, vol. 14, no. 5, pp. 348- 1002/9780470569962.ch14
358, 2021. https://doi.org/10.22266/ijies2021.1031.31
[16] I. Skrjanc, G. Andonovski, A. Ledezma, O. Sipele, J. A.
[7] E. Lughofer, Evolving Fuzzy Systems: Methodologies, Iglesias, and A. Sanchis, “Evolving cloud-based system
Advanced Concepts and Applications. Berlin: Springer, for the recognition of drivers’ actions,” Expert Systems
2011. https://doi.org/10.1007/978-3-642-18087-3 with Applications, vol. 99, pp. 231-238, 2018. https://doi.
org/10.1016/j.eswa.2017.11.008
[8] W. Tong, S. Liu, and X. Z. Gao, “A density-peak-based
clustering algorithm of automatically determining the [17] V. Souza, D. M. dos Reis, A. G. Maletzke, and G. E.
number of clusters,” Neurocomputing, vol. 458, pp. 655- Batista, “Challenges in benchmarking stream learning al-
666, 2021. https://doi.org/10.1016/j.neucom.2020.03.125 gorithms with real-world data,” Data Mining and Knowl-
edge Discovery, vol. 34, no. 6, pp. 1805-1858, 2020.
[9] J. Saha and J. Mukherjee, “CNAK: cluster number as- https://doi.org/10.1007/s10618-020-00698-5
sisted K-means,” Pattern Recognition, vol. 110, article
[18] R. D. Baruah and P. Angelov, “Evolving fuzzy systems for
no. 107625, 2021. https://doi.org/10.1016/j.patcog.2020.
data streams: a survey,” Wiley Interdisciplinary Reviews:
107625
Data Mining and Knowledge Discovery, vol. 1, no. 6, pp.
[10] K. Chowdhury, D. Chaudhuri, and A. K. Pal, “An entropy- 461-476, 2011. https://doi.org/10.1002/widm.42
based initialization method of K-means clustering on the [19] A. Jain, K. Nandakumar, and A. Ross, “Score normaliza-
optimal number of clusters,” Neural Computing and Ap- tion in multimodal biometric systems,” Pattern Recogni-
plications, vol. 33, no. 12, pp. 6965-6982, 2021. https: tion, vol. 38, no. 12, pp. 2270-2285, 2005. https://doi.org/
//doi.org/10.1007/s00521-020-05471-9 10.1016/j.patcog.2005.01.012

[11] H. A. A. Al-Khamees, W. R. H. Al-Jwaid, and E. S. Al- [20] G. Shamim and M. Rihan, “Multi-domain feature extrac-
Shamery, “The impact of using convolutional neural net- tion for improved clustering of smart meter data,” Tech-
works in COVID-19 tasks: a survey,” International Jour- nology and Economics of Smart Grids and Sustainable
nal of Computing and Digital Systems, vol. 11, no. 1, pp. Energy, vol. 5, article no. 10, 2020. https://doi.org/10.
189-197, 2022. https://doi.org/10.12785/ijcds/110194 1007/s40866-020-00080-w

www.ijfis.org An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters | 274
International Journal of Fuzzy Logic and Intelligent Systems, vol. 22, no. 3, September 2022

[21] H. A. A. Al-Khamees, N. Al-A’araji, and E. S. Al- [29] C. Fahy and S. Yang, “Finding and tracking multi-density
Shamery, “Classifying the human activities of sensor data clusters in online dynamic data streams,” IEEE Trans-
using deep neural network,” in Intelligent Systems and Pat- actions on Big Data, vol. 8, no. 1, pp. 178-192, 2019.
tern Recognition. Cham, Switzerland: Springer, 2022, pp. https://doi.org/10.1109/TBDATA.2019.2922969
107-118. https://doi.org/10.1007/978-3-031-08277-1 9
Hussein A. A. Al-Khamees received his
[22] R. Chereshnev and A. Kertesz-Farkas, “HuGaDB: human
B.S. degree in Computer Science from
gait database for activity recognition from wearable iner-
University of Babylon, Iraq, in 1999. He
tial sensor networks,” in Analysis of Images, Social Net-
received his M.S. degree in Information
works and Texts. Cham, Switzerland: Springer, 2017, pp.
Technology from the Turkish Aeronauti-
131-141. https://doi.org/10.1007/978-3-319-73013-4 12
cal Association - Institute of Science and
[23] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. J. Reyes- Technology, Turkey, in 2017. His research interests include
Ortiz, “Human activity recognition on smartphones using data mining, data stream analysis, deep learning, and intelligent
a multiclass hardware-friendly support vector machine,” systems.
in Ambient Assisted Living and Home Care. Heidelberg, E-mail: hussein.alkhamees7@gmail.com
Germany: Springer, 2012, pp. 216-223. https://doi.org/10.
1007/978-3-642-35395-6 30 Nabeel Al-A’araji received his B.S. de-
gree in Mathematics from Al-Mustansiryah
[24] N. A. Supardi, S. J. Abdulkadir, and N. Aziz, “An evo- University, Iraq, in 1976. He received his
lutionary stream clustering technique for outlier detec- M.Sc. degree in Mathematics from the
tion,” in Proceedings of 2020 International Conference University of Baghdad, Iraq, in 1978 and
on Computational Intelligence (ICCI), Bandar Seri Iskan- received his Ph.D. degree in Mathematics,
dar, Malaysia, 2020, pp. 299-304. https://doi.org/10.1109/ from University of Wales, Aberystwyth, UK, in 1988. He is
ICCI51257.2020.9247832 currently a professor at the Department of Software, University
[25] X. F. Tang, R. Huang, Q. Chen, Z. Y. Peng, H. Wang, of Babylon. His research interests include artificial intelligence,
and B. H. Wang, “An outlier detection method of low- GIS machine learning, neural networks, deep learning, and data
voltage users based on weekly electricity consumption,” mining.
IOP Conference Series: Materials Science and Engi- E-mail: nhkaghed@itnet.uobabylon.edu.iq
neering, vol. 631, no. 4, article no. 042004, 2019. https:
//doi.org/10.1088/1757-899x/631/4/042004 Eman S. Al-Shamery received her B.Sc.,
M.Sc., and Ph.D. degrees in Computer
[26] P. P. Ariza-Colpas, E. Vicario, A. I. Oviedo-Carrascal, S. Science from the University of Babylon,
Butt Aziz, M. A. Pineres-Melo, A. Quintero-Linero, and F. Iraq, in 1998, 2001, and 2013, respec-
Patara, “Human activity recognition data analysis: history, tively. She is currently a professor at the
evolutions, and new trends,” Sensors, vol. 22, no. 9, article Department of Software, University of
no. 3401, 2022. https://doi.org/10.3390/s22093401 Babylon. Her research interests include artificial intelligence,
bioinformatics, machine learning, neural networks, deep learn-
[27] J. Shao, Y. Tan, L. Gao, Q. Yang, C. Plant, and I. As-
ing, and data mining.
sent, “Synchronization-based clustering on evolving data
E-mail: emanalshamery@itnet.uobabylon.edu.iq
stream,” Information Sciences, vol. 501, pp. 573-587,
2019. https://doi.org/10.1016/j.ins.2018.09.035

[28] A. Abedin, F. Motlagh, Q. Shi, H. Rezatofighi, and D.


Ranasinghe, “Towards deep clustering of human activities
from wearables,” in Proceedings of the 2020 International
Symposium on Wearable Computers, Virtual Event, 2020,
pp. 1-6. https://doi.org/10.1145/3410531.3414312

275 | Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery

You might also like