You are on page 1of 7

Isolation Forest Python

Data Source

For this, we will be using a subset of a larger dataset that was used as part of a Machine
Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020).

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

All of the examples within this article can be used with any dataset.

Importing Libraries and Data

We will need to import, seaborn, pandas and IsolationForest from Scitkit-Learn.

import pandas as pd

import seaborn as sns


from sklearn.ensemble import IsolationForest

Once these have been imported, we next need to load our data.

df = pd.read_csv('Data/Xeek_Well_15-9-15.csv')

df.describe()

The summary only shows the numeric data present within the file. If we want to take a look at
all features within the dataframe, we can call upon df.info() , which will inform us we have 12
columns of data, and varying levels of completeness.

As with many machine learning algorithms, we need to deal with the missing values. As seen
above we have a few columns, such as NPHI (neutron porosity) with 13,346 values, and GR
(gamma ray) with 17,717 values.
The simplest way to deal with these missing values is to drop them. Even though this is a quick
method, it should not be done blindly and you should attempt to understand the reason for the
missing values. Removing these rows results in a reduced dataset when it comes to building
machine learning models.

To remove missing rows, we can call upon the following:

df = df.dropna()

And if we call upon df again, we will see that we are now down to 13,290 values for every
column.

Building the Isolation Forest Model with Scikit-Learn

From our dataframe, we need to select the variables we will train our Isolation Forest model
with.

In this example, I am going to use just two variables (NPHI and RHOB). In reality, we would use
more and we will see an example of that later on. Using two variables allows us to visualise
what the algorithm has done.

First, we will create a list of our column names:

anomaly_inputs = ['NPHI', 'RHOB']

Next, we will create an instance of our Isolation Forest model. This is done, first by creating a
variable called model_IF and then assigning it to IsolationForest().

We can then pass in a number of parameters for our model. The ones I have used in the code
below are:
contamination: This is how much of the overall data we expect to be considered as an outlier.
We can pass in a value between 0 and 0.5 or set it to auto.

random_state: This allows us to control the random selection process for splitting trees. In
other words, if we were to rerun the model with the same data and parameters with a fixed
value for this parameter, we should get repeatable outputs.

model_IF = IsolationForest(contamination=float(0.1),random_state=42)

Once our model has been initialised, we can train it on the data. To do this we call upon the .fit()
function and pass it to our dataframe (df). When we pass the dataframe parameter, we will also
select the columns we defined earlier.

model_IF.fit(df[anomaly_inputs])

After fitting the model, we can now create some predictions. We will do this by adding two new
columns to our dataframe:

anomaly_scores : Generated by calling upon model_IF.decision_function() and provides the


anomaly score for each sample within the dataset. The lower the score, the more abnormal that
sample is. Negative values indicate that the sample is an outlier.

anomaly : Generated by calling upon model_IF.predict() and is used to identify if a point is an


outlier (-1) or an inlier (1)

df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])

df['anomaly'] = model_IF.predict(df[anomaly_inputs])

Once the anomalies have been identified, we can view our dataframe and see the result. Values
of 1 indicate data points are good.
df.loc[:, ['NPHI', 'RHOB','anomaly_scores','anomaly'] ]

In the returned values above, we can see the original input features, the generated anomaly
scores and whether that point is an anomaly or not.

Visualising Anomaly Data using matplotlib

Looking at the numeric values and trying to determine if the point has been identified as an
outlier or not can be tedious.

Instead, we can use seaborn to generate a basic figure. We can use the data we used to train
our model and visually split it up into outliers or inliers.

This simple function is designed to generate that plot and provide some additional metrics as
text. The function takes:

data : Dataframe containing the values

outlier_method_name : The name of the method we are using. This is just for display
purposes

xvar , yvar : The variables that we want to plot on the x and y axis respectively

xaxis_limits, yaxis_limits : The x and y-axis ranges

def outlier_plot(data, outlier_method_name, x_var, y_var,

xaxis_limits=[0,1], yaxis_limits=[0,1]):

print(f'Outlier Method: {outlier_method_name}')

# Create a dynamic title based on the method


method = f'{outlier_method_name}_anomaly'

# Print out key statistics


print(f"Number of anomalous values {len(data[data['anomaly']==-1])}")
print(f"Number of non anomalous values {len(data[data['anomaly']== 1])}")
print(f'Total Number of Values: {len(data)}')

# Create the chart using seaborn


g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])
g.map(sns.scatterplot, x_var, y_var)
g.fig.suptitle(f'Outlier Method: {outlier_method_name}', y=1.10, fontweight='bold')
g.set(xlim=xaxis_limits, ylim=yaxis_limits)
axes = g.axes.flatten()
axes[0].set_title(f"Outliers\n{len(data[data['anomaly']== -1])} points")
axes[1].set_title(f"Inliers\n {len(data[data['anomaly']== 1])} points")
return g

Once our function has been defined, we can then pass in the required parameters.

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

Right away we can tell how many values have been identified as outliers and where they are
located. As we are only using two variables, we can see that we have essentially formed a
separation between the points at the edge of the data and those in the centre.

Increasing Isolation Forest Contamination Value

The previous example uses a value of 0.1 (10%) for the contamination parameter, what if we
increased that to 0.3 (30%)?

model_IF = IsolationForest(contamination=float(0.3), random_state=42)

df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])
outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
We can see that significantly more points have been selected and identified as outliers.

How do we know which contamination value to set?

Setting the contamination value allows us to identify what percentage of values should be
identified as outliers, but choosing that value can be tricky.

There are no hard and fast rules for picking this value, and it should be based on the domain
knowledge surrounding the data and its intended application(s).

For this particular dataset, we should consider other features such as borehole caliper and
delta-rho (DRHO) to help identify potentially poor data.

Using More than 2 Features for Isolation Forest

Now that we have seen the basics of using Isolation Forest with just two variables, let's see
what happens when we use a few more.

anomaly_inputs = ['NPHI', 'RHOB', 'GR', 'CALI', 'PEF', 'DTC']

model_IF = IsolationForest(contamination=0.1, random_state=42)


model_IF.fit(df[anomaly_inputs])
df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);


We now see that the points identified as outliers are much more spread out on the scatter plot,
and there is no hard edge around a core group of points.

Visualising Outliers with Seaborn’s Pairplot

Instead of just looking at two of the variables, we can look at all of the variables we have used.
This is done by using the seaborn pairplot.

First, we need to set the palette, which will allow us to control the colours being used in the
plot.

Then, we can call upon sns.pairplot and pass in the required parameters.

palette = ['#ff7f0e', '#1f77b4']

sns.pairplot(df, vars=anomaly_inputs, hue='anomaly', palette=palette)

Orange points indicate outliers (-1) and blue points indicate inliers (1). Image by the author.

This provides us with a much better overview of the data, and we can now see some of the
outliers clearly highlighted within the other features. Especially within the PEF and GR features.

Summary

Isolation Forest is an easy-to-use and easy-to-understand unsupervised machine learning


method that can isolate anomalous data points from good data. The algorithm can be scaled up
to handle large and highly dimensional datasets if required.

You might also like