You are on page 1of 38

Cluster- Unsupervised

• Cluster analysis or clustering is the task of grouping a


set of objects in such a way that objects in the same
group (called a cluster) are more similar (in some
sense) to each other than to those in other groups
(clusters).

• Use-
– Business Analytics
– Image Processing
– Web Search
Cluster
• Clustering is a process of partitioning a set of data
(or objects) into a set of meaningful sub-classes,
called clusters.

• Help users understand the natural grouping or


structure in a data set.
• Used either as a stand-alone tool to get insight into
data distribution or as a preprocessing step for other
algorithms.
Outlier
C2
C1 20,20
10,10
Euclidean distance formula

√( 𝑋 2 − 𝑋 1 ) + ( 𝑌 2 − 𝑦 1)

Let's say you have a data point(​


X2,Y2) and a centroid(X1,Y1​)
New Updated Centroid in Cluster 1 & Cluster 2

New Updated Centroid in Cluster 1 = 10+10+0+20+0/5


10+0+10+0+20/5 =8,8

New Updated Centroid in Cluster 2


Iteration 2:

C2
C1 20,20
8,8
Clustering Algorithms
• Partitioning Methods
– K-Means
– K-Medoids
• Density-Based Methods
• Hierarchical Methods
– Agglomerative Approach
– The Divisive Approach
Random Forest Classification
• Boosting
• Bagging
Evaluating Classification Model
Performance
Confusion Matrix
Precision
Precision is defined as the ratio of True Positives count
to total True Positive count made by the model.
Precision = TP/(TP+FP)
Recall
Recall is defined as the ratio of True Positives count to
the total Actual Positive count.
Recall = TP/(TP+FN)

Recall is also called “True Positive Rate” or “sensitivity”.


Specificity
Out of all the real negative cases, how many were
identified as negative.

Specificity = TN/ (TN + FP)

Eg: Use case: Out of all the non-Covid patients who visited the doctor, how many
were diagnosed as non-Covid.
Er. GOURAV
Fuzzy C-Means
This algorithm works by assigning membership to each data point corresponding to
each cluster center on the basis of distance between the cluster center and the data point.
More the data is near to the cluster center more is its
membership towards the particular cluster center. Clearly, summation of membership of
each data point should be equal to one. After each iteration membership and cluster
centers are updated according to the formula:
where,
• 'n' is the number of data points.
• 'vj' represents the jth cluster center. 'm' is the fuzziness index m € [1, ∞].
• 'c' represents the number of cluster center.
• 'µij' represents the membership of ith data to jth cluster center.
• 'dij' represents the Euclidean distance between ith data and jth cluster center.
• Main objective of fuzzy c-means algorithm is to minimize:
Where:
•c is the total number of clusters.
•m is the fuzziness parameter.
•dji​is the distance between data point xi​and cluster centroid cj​.
•μij​is the membership of data point xi​in cluster j.
•The parameter m controls the degree of fuzziness
Advantages
1) Gives best result for overlapped data set and comparatively better then k-means algorithm.
2) Unlike k-means where data point must exclusively belong to one cluster center here data
point is assigned
membership to each cluster center as a result of which data point may belong to more then one
cluster center.

Disadvantages
1) Apriori specification of the number of clusters.
2) With lower value of β we get the better result but at the expense of more number of iteration.
3) Euclidean distance measures can unequally weight underlying factors.
Classifications (Predicting Classes)
The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised
learning classifier

Example: Predicting Movie Genre


IMDb Rating Duration Genre

(8.0) A 160 Action

(6.2)B 170 Action

(7.2)C 168 Comedy

(8.2)D 155 Comedy

Now predict the genre of movie “E” with IMDb rating 7.4 and duration 144 minutes

You might also like