Professional Documents
Culture Documents
The data mining job requires the prediction of information that the data holds
during the process of data analysis. During this, some deviations in data
trends are observed which are called outliers. Let us first know about data
mining. It is basically an exercise to sort and identify patterns and make
connections from a huge data set to solve the problems. It helps in predicting
future trends. So, what are outliers in data mining? Outliers are also data
objects but behave distinctively from the rest of the data objects. The first
definition of outliers was given by Grubbs in 1969. We should also have
knowledge about outlier analysis in data mining tutorials point and the types of
outliers in data mining.
Classifying Outliers
From a broader sense, Outliers are classified as:
Further, discussing the types of outliers, they are of the three following types:
The most elementary form of outliers is this. These are the few points in a
dataset that are strongly deviating from the rest of the data points and are
therefore located far away from the data distribution or cluster.
They appear within a specific context or condition when the data deviates
greatly of course but in other conditions, the data may show normal behavior
which makes it very necessary for the context to be specified in the problem
statement. The two types of attributes of the objects of data are contextual,
which defines the context, and behavioral, which defines the objects'
characteristics.
3. Collective outliers:
These types of outliers deviate from the rest of the dataset by forming a
cluster away from the rest of the dataset. They arise when there are
anomalous behaviors of data points collectively.
Outlier Detection Techniques
The different techniques and approaches to detect all these above-mentioned
outliers are discussed below:
1. Sorting
What makes it one of the simplest ways of detecting outliers in data mining is
the fact that it entails data sorting according to each of their magnitudes
during data manipulation. The data belonging to either the higher or lower
range can be considered outliers.
2. Graphing
This method requires plotting all the data in a graph using either a histogram,
scatter plot, or drop box to detect the outliers which let the user visualize the
data diverging from the dataset.
The Gaussian distribution is assumed in this method to identify how much the
data points deviate from the mean of the sample by calculating the standard
deviations of the points.
• To calculate the Z-score for an observation, take the raw then subtract
the mean, and then divide by the standard deviation.
• Sometimes, transformations are applied like scaling the data when the
Gaussian distribution is not applied. Libraries of Python consisting of in-
built functions like Scikit-Learn and Scipy have an easy implementation
of transformations.
• A positive value of Z-score indicates the object lying above the mean
whereas a negative value of Z-score indicates the object deviating from
below the mean with the particular value of standard deviation.
• A standard threshold is used for the calculation of the Z-score. It is
unusual for the value to be far away from the value of zero. Such
unusual deviations from zero help us determine the outliers.
• In the case of a parametric distribution in a feature space of low
dimensions, Z-score happens to be a robust method for removing
outliers from a dataset.
4. Dbscan
5. Isolation Forests
This is one of the best methods which works on the application of binary trees.
Here, the outlier points are few in number and also deviate far enough to be
distinguished clearly. This method has an algorithm to get any feature and to
do any random splitting of the value that lies between the minimum and the
maximum range of values, comparing which the predictions are made. Later
after that, a forest is built up each and every observation in the set. According
to the algorithm, the illustration 'path length' is established as 'splittings'.
Using the Interquartile Rule to Find Outliers: The interquartile range can be
used to detect outliers.
Conclusion
In this article we have discussed what is outliers in data mining and what is
outlier analysis in data mining. Outliers are usually discarded for predicting
wrong information during data analysis. Yet there are certain scenarios where
outlier detection becomes important, for example, detection of fraud. Either
way, detecting outliers is quite significant in data mining. In this article we
discussed the several methods to determine the outliers of different types.
Data mining is an integral part of our digital lives and outliers are a major part
of it. For a deeper learning you can check out our Skillslash, Data Science
Course in Bangalore, Full Stack Developer Course in Bangalore and
other courses too. As we provide you with the best of coaching and a
wonderful learning experience with 100% placement guarantee.