You are on page 1of 36

Clustering Approaches

• The following are the two clustering


approaches that we are going to discuss: 1)
DBSCAN 2) Graph-Based Clustering
• DBSCAN stands for Density-
Based Spatial Clustering of Applications
with Noise.
• It was proposed by Martin Ester et al. in 1996.
DBSCAN is a density-based clustering
algorithm that works on the assumption that
clusters are dense regions in space separated
by regions of lower density.
DBSCAN algorithm requires two parameters
• Eps: Maximum radius of the neighborhood.
• MinPts: Minimum number of points in an Eps-
neighbourhood of that point.
• In this algorithm, we have 3 types of data points. •
core point - which has at least min pts points in its
neighborhood
• • border point - one which has a core point in its
neighborhood
• • noise point - one which is neither a core nor a
border point and is considered an outlier in the dataset
Why DBSCAN?
• Partitioning methods (K-means, PAM
clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex
clusters.
• In other words, they are suitable only for
compact and well-separated clusters.
Moreover, they are also severely affected by
the presence of noise and outliers in the data.
• Advantages 1) No need to specify the number
of clusters ahead of time. 2) Capable of
detecting noise data during clustering. 3)
Clusters of any size and form can be found
using the DBSCAN algorithm.
• Disadvantages 1) In the case of varying
density clusters, the DBSCAN algorithm fails.
2) Doesn't work well for data with a lot of
dimensions.
• Graph-Based Clustering Graph clustering
refers to the clustering of data in the form of
graphs.
The following are the types of Graph
Clustering:
• • Between-graph - Clustering a set of graphs •
Within-graph - Clustering the nodes/edges of
a single graph
• Outlier Detection and Analysis An outlier is a
data object that deviates significantly from the
normal objects as if it were generated by a
different mechanism.
• An outlier is a unique entity that stands out
from the rest of the group.
• Why outlier analysis? Most data mining
methods ignore outliers, noise, or exceptions;
however, in some applications, such as fraud
detection, unusual events may be more
interesting than those that occur more often,
and thus outlier analysis becomes essential.
Outlier Detection
• To find the outlier, we must first set the threshold
value such that every data point with a distance
greater than it from its nearest cluster is
considered an outlier for our purposes.
• The distance between the test data and each
cluster means must then be determined.
• If the distance between the test data and the
nearest cluster is greater than the threshold value,
the test data will be classified as an outlier.
Algorithm for outlier detection
• 1. Calculate each cluster's average.
• 2. Initialize the Threshold value
• 3. Calculate the test data's distance from each
cluster means.
• 4. Find the cluster that is closest to the test
results.
• 5. Outlier exists if (Distance > Threshold).
• Types of Outliers The following are the
different types of outliers: • Global Outliers •
Contextual Outliers • Collective Outliers.
• Global Outliers A data point is considered a
global outlier if its value is far outside the
entirety of the data set in which it is found.
• A global outlier is a measured sample point
that has a very high or a very low value
relative to all the values in a dataset.
• For example, if 9 out of 10 points have values
between 20 and 30, but the 10th point has a
value of 85, the 10th point may be a global
outlier.
• Contextual Outliers
If an individual data point is different in a specific
context or condition (but not otherwise), then it
is termed as a contextual outlier. Attributes of
data objects should be divided into two groups:
⦁ Contextual attributes: defines the context, e.g.,
time & location ⦁ Behavioral attributes:
characteristics of the object, used in outlier
evaluation, e.g., temperature
Collective Outliers
• A subset of data objects collectively deviates
significantly from the whole data set, even if
the individual data objects may not be outliers.
When several computers keep sending denial-
of-service packages to each other.
Applications: E.g., intrusion detection:
• Implementation of K-means in Weka The
following are the steps for implementing K-
means using Weka:
• 1) Go to the Preprocess tab in WEKA Explorer
and select Open File. Select the dataset
"vote.arff."
• Use the “Visualize” tab to visualize the
Clustering algorithm result. Go to the tab and
click on any box. Move the Jitter to the max. •
The X-axis and Y-axis represent the attribute. •
The blue color represents class label democrat
and the red color represents class label
republican. • Jitter is used to view Clusters
• Overfitting When we train a statistical model
with a large amount of data (much like fitting
ourselves into oversized pants! ), it is said to
be overfitted.
• When a model is trained with a large amount
of data, it begins to learn from the noise and
inaccuracies in the data collection.
• The model then fails to correctly categorize
the data due to too many details and noise.
• approaches are the causes of overfitting since these
types of machine learning algorithms have more
flexibility in constructing models based on the
dataset and can thus create unrealistic models.
• If we have linear data, we can use a linear algorithm
to avoid overfitting, or we can use decision tree
parameters like the maximal depth to avoid
overfitting.
• In a nutshell, over fitting is characterized by a high
variance and a low bias.
• Bias – A model's assumptions that make a function easier
to understand.
• Bias is one type of error that occurs due to wrong
assumptions about data such as assuming data is linear
when in reality, data follows a complex function.
• Bias is simply defined as the inability of the model
because of that there is some difference or error occurring
between the model’s predicted value and the actual value.
• These differences between actual or expected values and
the predicted values are known as error or bias error or
error due to bias. Bias is a systematic error that occurs due
to wrong assumptions in the machine learning process.
Variance:
Variance is when you train your data on training
data and get a very low error, but when you
change the data and then train the same
previous model again, you get a very high
error
K-Fold Cross-Validation
• This approach divides the data set into k
subsets (also known as folds), then trains all of
the subsets while leaving one (k-1) subset for
evaluation of the trained model. We iterate k
times with a different subset reserved for
testing purposes each time in this process.
Advantages of Cross-Validation
• • Out-of-sample precision can now be
estimated more accurately.
• • Every observation is used for both training
and testing, resulting in a more "effective" use
of data
• Random Forest Random
• Forest is a popular machine learning algorithm
that belongs to the supervised learning
technique. It is based on the concept of
ensemble learning, which is a process of
combining multiple classifiers to solve a
complex problem and to improve the
performance of the model.
Advantages of Random Forest
• Random Forest is capable of performing both
Classification and Regression tasks.
• It is capable of handling large datasets with
high dimensionality.
• It enhances the accuracy of the model and
prevents the overfitting issue.
Disadvantages of Random Forest
• Complexity is the main disadvantage of
Random forest algorithms.
• Construction of Random forests is much harder
and time-consuming than decision trees.
• More computational resources are required to
implement the Random Forest algorithm.

You might also like