You are on page 1of 13

Cluster Analysis

Cluster analysis or simply clustering is the process of


partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a
cluster are similar to one another, yet dissimilar to
objects in other clusters. The set of clusters resulting
from a cluster analysis can be referred to as a
clustering.
Requirements for Cluster Analysis
Scalability: Many clustering algorithms work well on
small data sets containing fewer than several hundred
data objects; however, a large database may contain
millions or even billions of objects, particularly in Web
search scenarios. Clustering on only a sample of a given
large data set may lead to biased results. Therefore,
highly scalable clustering algorithms are needed.
Ability to deal with different types of
attributes: Many algorithms are designed to cluster
numeric (interval-based) data. However, applications
may require clustering other data types, such as binary,
nominal (categorical), and ordinal data, or mixtures of
these data types. Recently, more and more applications
need clustering techniques for complex data types such
as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many
clustering algorithms determine clusters based on Euclidean
or Manhattan distance measures . Algorithms based on such
distance measures tend to find spherical clusters with
similar size and density. However, a cluster could be of any
shape.
Requirements for domain knowledge to determine
input parameters: Many clustering algorithms require
users to provide domain knowledge in the form of input
parameters such as the desired number of clusters.
Ability to deal with noisy data: Most real-world
data sets contain outliers and/or missing, unknown, or
erroneous data.
Incremental clustering and insensitivity to
input order: In many applications, incremental
updates

Capability of clustering high-dimensionality data: A


data set can contain numerous dimensions or attributes
or may arrive at any time.
Constraint-based clustering: Real-world
applications may need to perform clustering under
various kinds of constraints.

Interpretability and usability: Users want


clustering results to be interpretable, comprehensible,
and usable.
Classification: Accuracy
Accuracy is one metric for evaluating classification models.
Informally, accuracy is the fraction of predictions our model
got right. Formally, accuracy has the following definition:
Accuracy=Number of correct predictions / Total number of
predictions

For binary classification, accuracy can also be calculated in


terms of positives and negatives as follows:
Accuracy= (TP+TN) / (TP+TN+FP+FN)

Where TP = True Positives, TN = True Negatives, FP = False


Positives, and FN = False Negatives.
Let's try calculating accuracy for the following
model that classified 100 tumors as malignant
 (the positive class) or benign (the negative
class):
•True Positive (TP):Reality: Malignant •False Positive (FP):Reality: Benign
•ML model predicted: Malignant •ML model predicted: Malignant
•Number of TP results: 1 •Number of FP results: 1

•False Negative (FN):Reality: •True Negative (TN):Reality: Benign


Malignant •ML model predicted: Benign
•ML model predicted: Benign •Number of TN results: 90
•Number of FN results: 8
Confusion Matrix

Evaluation of the performance of a classification model is based


on the counts of test records correctly and incorrectly predicted
by the model. The confusion matrix provides a more insightful
picture which is not only the performance of a predictive model,
but also which classes are being predicted correctly and
incorrectly, and what type of errors are being made.
Precision is the ratio of True Positives to all the positives
predicted by the model.
Low precision: the more False positives the model predicts,
the lower the precision.
Recall (Sensitivity)is the ratio of True Positives to all the
positives in your Dataset.
Low recall: the more False Negatives the model predicts,
the lower the recall.
F-Measure provides a single score that balances both the
concerns of precision and recall in one number. A good F1
score means that you have low false positives and low false
negatives, so you’re correctly identifying real threats, and
you are not disturbed by false alarms. An F1 score is
considered perfect when it’s 1, while the model is a total
failure when it’s 0.

You might also like