You are on page 1of 30

3th edition of International conference Europe Middle East

& North Africa on Information System & Technology and


Learning Research

 El Kef Computer Science Institute (ISI KEF), Tunisia

A New Fuzzy Clustering Method based on


FN-DBSCAN to Determine the Optimal Input
Parameters
Authors: Mme. Sihem JEBARI Dr. Abir SMITI Dr. Aymen LOUATI

November 21-23, 2019 – Marrakech, Morocco


PLAN
• Context and Issues
• Data clustering
• DBSCAN
• FN-DBSCAN
• Actual challenges
• State of Art
• AF-DBSCAN
• Overview of the method
• Data normalization
• Estimation of input parameter values
• Clustering task
• Experimental Results
• Conclusion and Future Works
2
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- The modern world is a data-driven world.


must be analyzed and processed to can extract information (inform,
answer to questions, aid understanding and making decisions)

Data Analysis become important.


- The data volume increases each day.
The human manual treatment of this amount of data becomes extremely
difficult and expensive!! The treatment must be done automatically.

- Data Clustering has an important role in Data Analysis. 3


1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

Definition
- Clustering is a task which aims to « finding of natural groups from a
data set, when little or nothing is known about the category
structure. » M.R. Anderberg, 1973
- « Cluster Analysis divides data into groups that are meaningful,
useful or both » P.N. Tan et al, 2005
Does samples
Similarity? form a group
Hidden !!?
structure? …..? Decision?
Clusters? …..?

4
1. Context & Issues
2. State of Arts
3. Fuzzy Set Theory a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- Clustering methods: probabilistic, distance-based, and


density-based models.
- The density-based methods consider that clusters are high
dense sets of data points separated by less dense regions.
- DBSCAN (Density Based Spatial Clustering of Applications
with Noise): a well-known density-based clustering method.
Permits to discover the Hidden Patterns in the
data even with presence of Noise. 5
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- Some level of imprecision associated with data acquisition.


- Some kind of automated data acquisition devices can
introduce some forms of uncertainty into the saved data.
Hard clustering methods, which draw crisp boundaries to separate
clusters, show limits and can produce inaccurate clustering results!

Imprecision! Avoid
mistakes
Perform !!?
Accuracy! Uncertainty!
Clustering?! …..?
…..?

6
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

Data Clustering:
A powerful tool to
identify the natural
grouping in data. Fuzzy Clustering

Aims to find natural


Data
Clustering clusters discarding
Fuzzy details which render
Set
them inaccurate.
Fuzzy Set Theory
An efficient theory to
handle imprecision
and uncertainty.
7
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- Soft Clustering.
- Grouping objects by defining a membership degree to each
of the clusters using a membership function .
- Each data element has a membership degree to each of the
clusters that is between 1 (full-membership) and 0 (non-
membership).
- Fuzzy Clustering methods: a group of algorithms for
clustering analysis in which the data elements are distributed
to the clusters in a not clear way.
i.e. elements can belong to one or more clusters. 8
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

Fuzzy Neighborhood DBSCAN (FN-DBSCAN)


=
Hard Clustering (DBSCAN) + Fuzzy Set Theory (fuzzy neighborhood function )

ԑ
p1

p2
ԑ

Fig1. Crisp neighborhood function VS Fuzzy neighborhood function. 9


1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

Input parameters

Eps1 Eps2
ԑ
The minimal threshold p The minimal set
of neighborhood
cardinality.
membership degree.

Where Where and

The two parameters define together the desired density characteristics of the generated clusters10
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- FN-DBSCAN requires the specification of two user-given parameters:


Eps 1 and Eps2.
Minimal requirements of domain-knowledge are need from the user.
- FN-DBSCAN is very sensitive to the setting of its input parameters.
The parameters should be set properly according to the scale of the data set and the
density of the clusters.

Eps1 Eps2
!!? !!?
Clusters’ Data set’s
density?! …..? Scale!
…..?

11
1. Context & Issues
2. State of Arts
3. AF-DBSCAN a. Data Clustering b. DBSCAN c. FN-DBSCAN d. Actual Challenges

- Estimating the input parameters is a hard task for non-


experimental users especially for real-world and high
dimensional data sets.
Automatic techniques Estimating
should be developed to Parameters
estimate the values of these
input parameters.
Clustering
Task
12
1. Context & Issues
2. State of Arts
3. AF-DBSCAN General Solution to resolve the parameters’ complexity problem Estimating Eps Estimating minPts

- Trial/Error exploratory phase:


• The clustering algorithm is run several times with different
input parameter values. The clustering results are compared
using Validity Indices.
- (-) Costly method when dealing with big data volume.
- Optimization and meta-heuristic search algorithms (e.g.,
Colony algorithms, Bee optimization, Tabu search, Genetic
algorithm, etc.) have been hybrized with Clustering
algorithms to improve the clustering result.
13
1. Context & Issues
2. State of Arts
3. AF-DBSCAN General Solution to resolve the parameters’ complexity problem Estimating Eps Estimating minPts

Method Description Limits


AEC - Three coefficients are calculated: distance - Efficient just in sample data sets
(Automatic Epsilon Calculation) between points, number of points located in with small noise ratio.
Paper[GM06] cluster, and cluster density. - Other parameters are required.
- Best result = Minimal distance between clusters’ - Time complexity much more than
result. DBSCAN.

A. Smiti and Z. Elouedi. - Combine Gaussian-Means (GM) and DBSCAN - GM provides circular cluster
Paper[SE12] algorithm. shapes.
- Not strong against noise.
E. Nejad et al. - Eps is estimated based on the noise ration of the - High dependence on the minPts’
Paper[EHY10] data and minPts. value.

M.N. Gaonkar and K. - k-neighbors plot is drawn for given k entered by - Semi-supervised method.
the user and then the Eps which corresponds to
Sawant. the knee is determined.
Paper[GS13]

14
1. Context & Issues
2. State of Arts
3. AF-DBSCAN General Solution to resolve the parameters’ complexity problem Estimating Eps Estimating minPts

Method Description
M. Ester et al. - minPts = 4
Paper[EKSX96]

M.N. Gaonkar and K. Sawant.


Paper[GS13]

- Where is the number of points in Eps-neighborhood of the point xi.


M. Ester et al. - minPts= ln(n)
Paper [EKSX96] - Where n is the size of the data set.

Some methods are more relevant than other considering the clustering
quality, the run-time complexity, the input parameters complexity, etc.
It would rather determine the parameter values according to the data set
characteristics than taking fixed values. 15
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Results

Fig2. Overview of AF-DBSCAN.

16
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Results

- Min-Max method:
- Let D a k-dimensions data set:

- Min-Max technique maps a value to in the


range as follows:

- Where
- In our case, the data range is [0,1], so the formula is simplified to:

17
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Results

- calculation of Eps1:

Where

Step k-dist plot Step


Determine Eps’ Step Step
1 2 value 3 4

18
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Results

- calculation of Eps:
• Determine k-neighbors values of all points in the data
1 set;

• Sort k-dist values;


K-dist 2
• Calculate slopes of each change (point) in the k-dist plot;
3
• Calculate mean and standard deviation of non-zero
K-neighbors
4 slopes;

• Find the first slope which is above the threshold;


Fig3. k-dist plot. 5
• Find corresponding k-dist value of the found slope in step
6 5 and define this value as Eps.
19
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering task

- Calculation of Eps2:

Where and

Step Step Step


1 2 3

20
2. State of Art
3. AF-DBSCAN
4. Experiment results a. Overview b. Data normalization c. Estimating input parameters d. Clustering Task

Input: Data set D = xi, …, xn, Eps1, Eps2.


Output: A partition C = C1, …, Ck of k clusters.
Begin
Step 1. Mark all points in the data set as unclassified;
Step 2. Find an unclassified core point p within Eps1 and Eps2. Mark p to be classified. Start a new cluster to be
the current cluster and assign p to the current cluster;
Step 3. Create an empty set of seeds S. Put the set N(p,Eps1) into S;
Step 4. Get a point q in the seeds, mark q to be classified, assign q to the current cluster, and remove q from the
seeds;
Step 5. Check if q is a core point within Eps and minPts, if so, add all the unclassified points in the Eps-
neighborhood set of q to the set of seeds;
Step 6. Repeat steps 4 and 5 until the set of seeds is empty;
Step 7. Start a new cluster and repeat steps 2 to 6 until no more core points can be found;
Step 8. Mark all points which do not belong to any cluster as noise, and output all the clusters found;
End.

21
3. AF-DBSCAN
4. Experiment results
5. Conclusion a. Datasets b. Clustering accuracy c. Run time

• Datasets (from UCI repository):


o Thyroid Disease Dataset (TD)
o Breast Cancer Wisconsin (BCW)
• Evaluation criteria:
o PCC (Percent of Correct Classification)

o eT(Error Rate)

o Run Time 22
3. AF-DBSCAN
4. Experiment results a. Datasets b. Clustering accuracy c. Run time
5. Conclusion

Table1. Classification accuracy of the three clustering algorithms.

23
3. AF-DBSCAN
4. Experiment results
5. Conclusion a. Datasets b. Clustering accuracy c. Run time

Table2. Retrieval time’s results.

Fig4. Relationship between run speed and data


set's size. 24
3. AF-DBSCAN
4. Experiment results
5. Conclusion a. Conclusion b. Future works

- AF-DBSCAN: a new approach to enhance the FN-DBSCAN and


increase the clustering performance in noisy and imprecise
environment.
- The algorithm permits the automatic determination of the input
parameter values: Eps1 and Eps2 by exploiting the k-neighbors plot
method and performing mathematical calculations.
- Simulative experiments, carried out on a real medical data set,
highlighted the AF-DBSCAN’s effectiveness and showed that the
proposed method outperformed the classical method since it
provides a better clustering accuracy. However, it takes more time
due to the parameter-computing involved in the AF-DBSCAN. 25
3. AF-DBSCAN
4. Experiment results
5. Conclusion a. Conclusion b. Future works

- Realizing more research to find-out a technique with which


further speed gain can be achieved.

- Implementing techniques to make AF-DBSCAN incremental,


since new data can be continuously generated in medical
datasets.

26
« Mostof the fundamental ideas of science are
essentially simple, and may, as a rule, be expressed in a
language comprehensible to everyone. »
If you can't explain it simply, you don't understand it well enough.
I. Leopold , A. Einstein 
The Evolution of Physics (1938)

Questions ?
27
2. DBSCAN
3. Fuzzy Set Theory a. Definition b. Example
4. Fuzzy Clustering

- An attempt to develop a set of concepts and techniques involved in


order to resolve complicated problems when, due to imprecision,
the boundaries of objects’ classes are not clearly defined.
- Fuzzy sets are « Classes of objects in which the transition from
membership to non-membership is gradual rather than abrupt.»
L.Zadeh, 1973
- « The fuzziness of a property is not viewed as a defect in the
linguistic expression of knowledge […], but rather as a way of
expressing graduality.» D.Dubois et al, 1993
Return Example
2. DBSCAN
3. Fuzzy Set Theory a. Definition b. Example
4. Fuzzy Clustering

Let’ s U be a universe of discourse (a set of real numbers, set


of objects in one room, etc.):
A is a finite fuzzy subset of U.
A is characterized by a membership function:
which associates to each element of U a membership
degree .
A can be expressed as:

where is the membership degree of in A.


29
2. DBSCAN
3. Fuzzy Set Theory a. Definition b. Example
4. Fuzzy Clustering

U= [0,100]
u: Age
A: fuzzy subset of U labeled old defined by a membership
function such as:

(the membership degree) is interpreted as the degree of


compatibility of with the concept old represented by A.

If u=60 Return
30

You might also like