approaches that we are going to discuss: 1) DBSCAN 2) Graph-Based Clustering • DBSCAN stands for Density- Based Spatial Clustering of Applications with Noise. • It was proposed by Martin Ester et al. in 1996. DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density. DBSCAN algorithm requires two parameters • Eps: Maximum radius of the neighborhood. • MinPts: Minimum number of points in an Eps- neighbourhood of that point. • In this algorithm, we have 3 types of data points. • core point - which has at least min pts points in its neighborhood • • border point - one which has a core point in its neighborhood • • noise point - one which is neither a core nor a border point and is considered an outlier in the dataset Why DBSCAN? • Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped clusters or convex clusters. • In other words, they are suitable only for compact and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data. • Advantages 1) No need to specify the number of clusters ahead of time. 2) Capable of detecting noise data during clustering. 3) Clusters of any size and form can be found using the DBSCAN algorithm. • Disadvantages 1) In the case of varying density clusters, the DBSCAN algorithm fails. 2) Doesn't work well for data with a lot of dimensions. • Graph-Based Clustering Graph clustering refers to the clustering of data in the form of graphs. The following are the types of Graph Clustering: • • Between-graph - Clustering a set of graphs • Within-graph - Clustering the nodes/edges of a single graph • Outlier Detection and Analysis An outlier is a data object that deviates significantly from the normal objects as if it were generated by a different mechanism. • An outlier is a unique entity that stands out from the rest of the group. • Why outlier analysis? Most data mining methods ignore outliers, noise, or exceptions; however, in some applications, such as fraud detection, unusual events may be more interesting than those that occur more often, and thus outlier analysis becomes essential. Outlier Detection • To find the outlier, we must first set the threshold value such that every data point with a distance greater than it from its nearest cluster is considered an outlier for our purposes. • The distance between the test data and each cluster means must then be determined. • If the distance between the test data and the nearest cluster is greater than the threshold value, the test data will be classified as an outlier. Algorithm for outlier detection • 1. Calculate each cluster's average. • 2. Initialize the Threshold value • 3. Calculate the test data's distance from each cluster means. • 4. Find the cluster that is closest to the test results. • 5. Outlier exists if (Distance > Threshold). • Types of Outliers The following are the different types of outliers: • Global Outliers • Contextual Outliers • Collective Outliers. • Global Outliers A data point is considered a global outlier if its value is far outside the entirety of the data set in which it is found. • A global outlier is a measured sample point that has a very high or a very low value relative to all the values in a dataset. • For example, if 9 out of 10 points have values between 20 and 30, but the 10th point has a value of 85, the 10th point may be a global outlier. • Contextual Outliers If an individual data point is different in a specific context or condition (but not otherwise), then it is termed as a contextual outlier. Attributes of data objects should be divided into two groups: ⦁ Contextual attributes: defines the context, e.g., time & location ⦁ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Collective Outliers • A subset of data objects collectively deviates significantly from the whole data set, even if the individual data objects may not be outliers. When several computers keep sending denial- of-service packages to each other. Applications: E.g., intrusion detection: • Implementation of K-means in Weka The following are the steps for implementing K- means using Weka: • 1) Go to the Preprocess tab in WEKA Explorer and select Open File. Select the dataset "vote.arff." • Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any box. Move the Jitter to the max. • The X-axis and Y-axis represent the attribute. • The blue color represents class label democrat and the red color represents class label republican. • Jitter is used to view Clusters • Overfitting When we train a statistical model with a large amount of data (much like fitting ourselves into oversized pants! ), it is said to be overfitted. • When a model is trained with a large amount of data, it begins to learn from the noise and inaccuracies in the data collection. • The model then fails to correctly categorize the data due to too many details and noise. • approaches are the causes of overfitting since these types of machine learning algorithms have more flexibility in constructing models based on the dataset and can thus create unrealistic models. • If we have linear data, we can use a linear algorithm to avoid overfitting, or we can use decision tree parameters like the maximal depth to avoid overfitting. • In a nutshell, over fitting is characterized by a high variance and a low bias. • Bias – A model's assumptions that make a function easier to understand. • Bias is one type of error that occurs due to wrong assumptions about data such as assuming data is linear when in reality, data follows a complex function. • Bias is simply defined as the inability of the model because of that there is some difference or error occurring between the model’s predicted value and the actual value. • These differences between actual or expected values and the predicted values are known as error or bias error or error due to bias. Bias is a systematic error that occurs due to wrong assumptions in the machine learning process. Variance: Variance is when you train your data on training data and get a very low error, but when you change the data and then train the same previous model again, you get a very high error K-Fold Cross-Validation • This approach divides the data set into k subsets (also known as folds), then trains all of the subsets while leaving one (k-1) subset for evaluation of the trained model. We iterate k times with a different subset reserved for testing purposes each time in this process. Advantages of Cross-Validation • • Out-of-sample precision can now be estimated more accurately. • • Every observation is used for both training and testing, resulting in a more "effective" use of data • Random Forest Random • Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. Advantages of Random Forest • Random Forest is capable of performing both Classification and Regression tasks. • It is capable of handling large datasets with high dimensionality. • It enhances the accuracy of the model and prevents the overfitting issue. Disadvantages of Random Forest • Complexity is the main disadvantage of Random forest algorithms. • Construction of Random forests is much harder and time-consuming than decision trees. • More computational resources are required to implement the Random Forest algorithm.