Cluster analysis or simply clustering is the process of
partitioning a set of data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can be referred to as a clustering. Requirements for Cluster Analysis Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions or even billions of objects, particularly in Web search scenarios. Clustering on only a sample of a given large data set may lead to biased results. Therefore, highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster numeric (interval-based) data. However, applications may require clustering other data types, such as binary, nominal (categorical), and ordinal data, or mixtures of these data types. Recently, more and more applications need clustering techniques for complex data types such as graphs, sequences, images, and documents. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures . Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. Requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to provide domain knowledge in the form of input parameters such as the desired number of clusters. Ability to deal with noisy data: Most real-world data sets contain outliers and/or missing, unknown, or erroneous data. Incremental clustering and insensitivity to input order: In many applications, incremental updates
Capability of clustering high-dimensionality data: A
data set can contain numerous dimensions or attributes or may arrive at any time. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints.
Interpretability and usability: Users want
clustering results to be interpretable, comprehensible, and usable. Classification: Accuracy Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy=Number of correct predictions / Total number of predictions
For binary classification, accuracy can also be calculated in
terms of positives and negatives as follows: Accuracy= (TP+TN) / (TP+TN+FP+FN)
Positives, and FN = False Negatives. Let's try calculating accuracy for the following model that classified 100 tumors as malignant (the positive class) or benign (the negative class): •True Positive (TP):Reality: Malignant •False Positive (FP):Reality: Benign •ML model predicted: Malignant •ML model predicted: Malignant •Number of TP results: 1 •Number of FP results: 1
Malignant •ML model predicted: Benign •ML model predicted: Benign •Number of TN results: 90 •Number of FN results: 8 Confusion Matrix
Evaluation of the performance of a classification model is based
on the counts of test records correctly and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made. Precision is the ratio of True Positives to all the positives predicted by the model. Low precision: the more False positives the model predicts, the lower the precision. Recall (Sensitivity)is the ratio of True Positives to all the positives in your Dataset. Low recall: the more False Negatives the model predicts, the lower the recall. F-Measure provides a single score that balances both the concerns of precision and recall in one number. A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats, and you are not disturbed by false alarms. An F1 score is considered perfect when it’s 1, while the model is a total failure when it’s 0.