AF-DBSCAN Presentation

3th edition of International conference Europe Middle East
& North Africa on Information System & Technology and

Learning Research
El Kef Computer Science Institute (ISI KEF), Tunisia
A New Fuzzy Clustering Method based on

FN-DBSCAN to Determine the Optimal Input
Parameters
Authors: Mme. Sihem JEBARI Dr. Abir SMITI Dr. Aymen LOUATI
November 21-23, 2019 – Marrakech, Morocco

PLAN
• Context and Issues
• Data clustering
• DBSCAN
• FN-DBSCAN
• Actual challenges
• State of Art
• AF-DBSCAN
• Overview of the method
• Data normalization
• Estimation of input parameter values
• Clustering task
• Experimental Results
• Conclusion and Future Works
2
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges
- The modern world is a data-driven world.

must be analyzed and processed to can extract information (inform,
answer to questions, aid understanding and making decisions)
Data Analysis become important.

- The data volume increases each day.
The human manual treatment of this amount of data becomes extremely
difficult and expensive!! The treatment must be done automatically.
- Data Clustering has an important role in Data Analysis. 3

1. Context & Issues
2. State of Arts
Definition
- Clustering is a task which aims to « finding of natural groups from a
data set, when little or nothing is known about the category
structure. » M.R. Anderberg, 1973
- « Cluster Analysis divides data into groups that are meaningful,
useful or both » P.N. Tan et al, 2005
Does samples
Similarity? form a group
Hidden !!?
structure? …..? Decision?
Clusters? …..?
4
1. Context & Issues
2. State of Arts
3. Fuzzy Set Theory a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges
- Clustering methods: probabilistic, distance-based, and

density-based models.
- The density-based methods consider that clusters are high
dense sets of data points separated by less dense regions.
- DBSCAN (Density Based Spatial Clustering of Applications
with Noise): a well-known density-based clustering method.
Permits to discover the Hidden Patterns in the
data even with presence of Noise. 5
1. Context & Issues
2. State of Arts
- Some level of imprecision associated with data acquisition.

- Some kind of automated data acquisition devices can
introduce some forms of uncertainty into the saved data.
Hard clustering methods, which draw crisp boundaries to separate
clusters, show limits and can produce inaccurate clustering results!
Imprecision! Avoid
mistakes
Perform !!?
Accuracy! Uncertainty!
Clustering?! …..?
…..?
6
1. Context & Issues
2. State of Arts
Data Clustering:
A powerful tool to
identify the natural
grouping in data. Fuzzy Clustering
Aims to find natural

Data
Clustering clusters discarding
Fuzzy details which render
Set
them inaccurate.
Fuzzy Set Theory
An efficient theory to
handle imprecision
and uncertainty.
7
1. Context & Issues
2. State of Arts
- Soft Clustering.
- Grouping objects by defining a membership degree to each
of the clusters using a membership function .
- Each data element has a membership degree to each of the
clusters that is between 1 (full-membership) and 0 (non-
membership).
- Fuzzy Clustering methods: a group of algorithms for
clustering analysis in which the data elements are distributed
to the clusters in a not clear way.
i.e. elements can belong to one or more clusters. 8
1. Context & Issues
2. State of Arts
Fuzzy Neighborhood DBSCAN (FN-DBSCAN)

=
Hard Clustering (DBSCAN) + Fuzzy Set Theory (fuzzy neighborhood function )
ԑ
p1
p2
ԑ
Fig1. Crisp neighborhood function VS Fuzzy neighborhood function. 9

1. Context & Issues
2. State of Arts
Input parameters
Eps1 Eps2
ԑ
The minimal threshold p The minimal set
of neighborhood
cardinality.
membership degree.
Where Where and
The two parameters define together the desired density characteristics of the generated clusters10
1. Context & Issues
2. State of Arts
- FN-DBSCAN requires the specification of two user-given parameters:

Eps 1 and Eps2.
Minimal requirements of domain-knowledge are need from the user.
- FN-DBSCAN is very sensitive to the setting of its input parameters.
The parameters should be set properly according to the scale of the data set and the
density of the clusters.
Eps1 Eps2
!!? !!?
Clusters’ Data set’s
density?! …..? Scale!
…..?
11
1. Context & Issues
2. State of Arts
- Estimating the input parameters is a hard task for non-

experimental users especially for real-world and high
dimensional data sets.
Automatic techniques Estimating
should be developed to Parameters
estimate the values of these
input parameters.
Clustering
Task
12
1. Context & Issues
2. State of Arts
3. AF-DBSCAN General Solution to resolve the parameters’ complexity problem Estimating Eps Estimating minPts
- Trial/Error exploratory phase:

• The clustering algorithm is run several times with different
input parameter values. The clustering results are compared
using Validity Indices.
- (-) Costly method when dealing with big data volume.
- Optimization and meta-heuristic search algorithms (e.g.,
Colony algorithms, Bee optimization, Tabu search, Genetic
algorithm, etc.) have been hybrized with Clustering
algorithms to improve the clustering result.
13
1. Context & Issues
2. State of Arts
Method Description Limits

AEC - Three coefficients are calculated: distance - Efficient just in sample data sets
(Automatic Epsilon Calculation) between points, number of points located in with small noise ratio.
Paper[GM06] cluster, and cluster density. - Other parameters are required.
- Best result = Minimal distance between clusters’ - Time complexity much more than
result. DBSCAN.
A. Smiti and Z. Elouedi. - Combine Gaussian-Means (GM) and DBSCAN - GM provides circular cluster
Paper[SE12] algorithm. shapes.
- Not strong against noise.
E. Nejad et al. - Eps is estimated based on the noise ration of the - High dependence on the minPts’
Paper[EHY10] data and minPts. value.
M.N. Gaonkar and K. - k-neighbors plot is drawn for given k entered by - Semi-supervised method.
the user and then the Eps which corresponds to
Sawant. the knee is determined.
Paper[GS13]
14
1. Context & Issues
2. State of Arts
Method Description
M. Ester et al. - minPts = 4
Paper[EKSX96]
M.N. Gaonkar and K. Sawant.

Paper[GS13]
- Where is the number of points in Eps-neighborhood of the point xi.

M. Ester et al. - minPts= ln(n)
Paper [EKSX96] - Where n is the size of the data set.
Some methods are more relevant than other considering the clustering
quality, the run-time complexity, the input parameters complexity, etc.
It would rather determine the parameter values according to the data set
characteristics than taking fixed values. 15
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Results
Fig2. Overview of AF-DBSCAN.
16
2. State of Art
3. AF-DBSCAN
- Min-Max method:
- Let D a k-dimensions data set:
- Min-Max technique maps a value to in the

range as follows:
- Where
- In our case, the data range is [0,1], so the formula is simplified to:
17
2. State of Art
3. AF-DBSCAN
- calculation of Eps1:
Where
Step k-dist plot Step

Determine Eps’ Step Step
1 2 value 3 4
18
2. State of Art
3. AF-DBSCAN
- calculation of Eps:
• Determine k-neighbors values of all points in the data
1 set;
• Sort k-dist values;

K-dist 2
• Calculate slopes of each change (point) in the k-dist plot;
3
• Calculate mean and standard deviation of non-zero
K-neighbors
4 slopes;
• Find the first slope which is above the threshold;

Fig3. k-dist plot. 5
• Find corresponding k-dist value of the found slope in step
6 5 and define this value as Eps.
19
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering task
- Calculation of Eps2:
Where and
Step Step Step

1 2 3
20
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Task
Input: Data set D = xi, …, xn, Eps1, Eps2.

Output: A partition C = C1, …, Ck of k clusters.
Begin
Step 1. Mark all points in the data set as unclassified;
Step 2. Find an unclassified core point p within Eps1 and Eps2. Mark p to be classified. Start a new cluster to be
the current cluster and assign p to the current cluster;
Step 3. Create an empty set of seeds S. Put the set N(p,Eps1) into S;
Step 4. Get a point q in the seeds, mark q to be classified, assign q to the current cluster, and remove q from the
seeds;
Step 5. Check if q is a core point within Eps and minPts, if so, add all the unclassified points in the Eps-
neighborhood set of q to the set of seeds;
Step 6. Repeat steps 4 and 5 until the set of seeds is empty;
Step 7. Start a new cluster and repeat steps 2 to 6 until no more core points can be found;
Step 8. Mark all points which do not belong to any cluster as noise, and output all the clusters found;
End.
21
3. AF-DBSCAN
4. Experiment results
5. Conclusion a. Datasets b. Clustering accuracy c. Run time
• Datasets (from UCI repository):

o Thyroid Disease Dataset (TD)
o Breast Cancer Wisconsin (BCW)
• Evaluation criteria:
o PCC (Percent of Correct Classification)
o eT(Error Rate)
o Run Time 22
3. AF-DBSCAN
4. Experiment results a. Datasets b. Clustering accuracy c. Run time
5. Conclusion
Table1. Classification accuracy of the three clustering algorithms.
23
3. AF-DBSCAN
5. Conclusion a. Datasets b. Clustering accuracy c. Run time
Table2. Retrieval time’s results.
Fig4. Relationship between run speed and data

set's size. 24
3. AF-DBSCAN
5. Conclusion a. Conclusion b. Future works
- AF-DBSCAN: a new approach to enhance the FN-DBSCAN and

increase the clustering performance in noisy and imprecise
environment.
- The algorithm permits the automatic determination of the input
parameter values: Eps1 and Eps2 by exploiting the k-neighbors plot
method and performing mathematical calculations.
- Simulative experiments, carried out on a real medical data set,
highlighted the AF-DBSCAN’s effectiveness and showed that the
proposed method outperformed the classical method since it
provides a better clustering accuracy. However, it takes more time
due to the parameter-computing involved in the AF-DBSCAN. 25
3. AF-DBSCAN
5. Conclusion a. Conclusion b. Future works
- Realizing more research to find-out a technique with which

further speed gain can be achieved.
- Implementing techniques to make AF-DBSCAN incremental,

since new data can be continuously generated in medical
datasets.
26
« Mostof the fundamental ideas of science are
essentially simple, and may, as a rule, be expressed in a
language comprehensible to everyone. »
If you can't explain it simply, you don't understand it well enough.
I. Leopold , A. Einstein
The Evolution of Physics (1938)
Questions ?
27
2. DBSCAN
3. Fuzzy Set Theory a. Definition b. Example
4. Fuzzy Clustering
- An attempt to develop a set of concepts and techniques involved in

order to resolve complicated problems when, due to imprecision,
the boundaries of objects’ classes are not clearly defined.
- Fuzzy sets are « Classes of objects in which the transition from
membership to non-membership is gradual rather than abrupt.»
L.Zadeh, 1973
- « The fuzziness of a property is not viewed as a defect in the
linguistic expression of knowledge […], but rather as a way of
expressing graduality.» D.Dubois et al, 1993
Return Example
2. DBSCAN
4. Fuzzy Clustering
Let’ s U be a universe of discourse (a set of real numbers, set

of objects in one room, etc.):
A is a finite fuzzy subset of U.
A is characterized by a membership function:
which associates to each element of U a membership
degree .
A can be expressed as:
where is the membership degree of in A.

29
2. DBSCAN
4. Fuzzy Clustering
U= [0,100]
u: Age
A: fuzzy subset of U labeled old defined by a membership
function such as:
(the membership degree) is interpreted as the degree of

compatibility of with the concept old represented by A.
If u=60 Return
30

AF-DBSCAN Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AF-DBSCAN Presentation

Uploaded by

Copyright:

Available Formats

3th edition of International conference Europe Middle East

& North Africa on Information System & Technology and

El Kef Computer Science Institute (ISI KEF), Tunisia

A New Fuzzy Clustering Method based on

November 21-23, 2019 – Marrakech, Morocco

- The modern world is a data-driven world.

Data Analysis become important.

- Data Clustering has an important role in Data Analysis. 3

- Clustering methods: probabilistic, distance-based, and

- Some level of imprecision associated with data acquisition.

Aims to find natural

Fuzzy Neighborhood DBSCAN (FN-DBSCAN)

Fig1. Crisp neighborhood function VS Fuzzy neighborhood function. 9

Where Where and

- FN-DBSCAN requires the specification of two user-given parameters:

- Estimating the input parameters is a hard task for non-

- Trial/Error exploratory phase:

Method Description Limits

M.N. Gaonkar and K. Sawant.

- Where is the number of points in Eps-neighborhood of the point xi.

Fig2. Overview of AF-DBSCAN.

- Min-Max technique maps a value to in the

Step k-dist plot Step

• Sort k-dist values;

• Find the first slope which is above the threshold;

Step Step Step

Input: Data set D = xi, …, xn, Eps1, Eps2.

• Datasets (from UCI repository):

Table1. Classification accuracy of the three clustering algorithms.

Table2. Retrieval time’s results.

Fig4. Relationship between run speed and data

- AF-DBSCAN: a new approach to enhance the FN-DBSCAN and

- Realizing more research to find-out a technique with which

- Implementing techniques to make AF-DBSCAN incremental,

- An attempt to develop a set of concepts and techniques involved in

Let’ s U be a universe of discourse (a set of real numbers, set

where is the membership degree of in A.

(the membership degree) is interpreted as the degree of

You might also like