You are on page 1of 26

11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Sklearn – An Introduction Guide to Machine


Learning
· 17 min read

11669 total views

Last Updated on April 3, 2023


Get 10-day Free Algo Trading Course

Table of Contents
1. What is Sklearn?
2. What is Sklearn used for?
3. How to download Sklearn for Python?
4. How to pick the best scikit-learn model?
5. Sklearn preprocessing – Prepare the data for analysis
1. Sklearn feature encoding
2. Sklearn data scaling
3. Sklearn missing values
4. Sklearn train test split
6. Sklearn Regression – Predict the future
1. Sklearn Linear Regression
2. Other Sklearn regression models
7. Sklearn Classification – Did I just see a cat?
1. Sklearn Decision Tree Classifier
2. Other Sklearn classification models
8. Sklearn Clustering – Create groups of similar data
1. Sklearn DBSCAN
2. Other Sklearn clustering models
9. Sklearn Dimensionality Reduction – Reducing random variables
1. Sklearn PCA
2. Other Sklearn Dimensionality Reduction models
10. What are the 3 Common Machine Learning Analysis/Testing Mistakes?
11. Full code

https://algotrading101.com/learn/sklearn-guide/ 1/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

What is Sklearn?
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
Sklearn (scikit-learn) is a Python library that provides a wide range of
"I have not read it, but I'm sure it is great." - My girlfriend
unsupervised and supervised machine learning algorithms.

Your first name


It is also one of the most used machine learning libraries and is built on top of
SciPy.
Your email address

Link: https://scikit-learn.org/stable/
Give me the 10-day crash course!
Get 10-day Free Algo Trading Course

What is Sklearn used for?


The Sklearn Library is mainly used for modeling data and it provides efficient tools
that are easy to use for any kind of predictive data analysis.

The main use cases of this library can be categorized into 6 categories which are
the following:

Preprocessing
Regression
Classification
Clustering
Model Selection
Dimensionality Reduction

As this article is mainly aimed at beginners, we will stick to the core concepts of
each category and explore some of its most popular features and algorithms.

Advanced readers can use this article as a recollection of some of the main use
cases and intuitions behind popular sklearn features that most ML practitioners
couldn’t live without.

Each category will be explained in a beginner-friendly and illustrative way


followed by the most used models, the intuition behind them, and hands-on
experience. But first, we need to set up our sklearn library.

https://algotrading101.com/learn/sklearn-guide/ 2/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

How
Join overto download
10,000 Sklearn
future traders and get for Python?
our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
Sklearn can be obtained in Python by using the pip install function as shown
"I have not read it, but I'm sure it is great." - My girlfriend
below:

Your first name


Python

1 $ pip install -U scikit-learn


Your email address

Sklearn developers strongly advise using a virtual environment (venv) or a conda


Get 10-day Free Algo Trading Course
environment when working with the library as it helps to avoid potential conflicts
with other packages.

How to pick the best Sklearn model?


When it comes to picking the best Sklearn model, there are many factors that come
into play that range from experience and data to the problem scope and math
behind each algorithm.

Sometimes all chosen algorithms can have similar results and, depending on the
problem setting, you will need to pick the one that is the fastest or the one that
generalizes the best on big data.

It may happen that all of your promised models won’t perform well enough and
that you will simply need to combine multiple models (e.g. ensemble), make your
own custom-made model, or go for a deep learning approach.

As picking the right model is one of the foundations of your problem solving, it is
wise to read-up on as many models and their uses as you can.

As model selection would be an article, or even a book, for itself, I’ll only provide
some rough guidelines in the form of questions that you’ll need to ask yourself
when deciding which model to deploy.

How much data do you have?

https://algotrading101.com/learn/sklearn-guide/ 3/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Some models are better on smaller datasets while others require more data and
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
tend tocourse.
email generalize
1 short better
email a on
daylarger datasets (e.g. SGD Regressor vs Lasso
for 10 days.
Regression).
"I have not read it, but I'm sure it is great." - My girlfriend
What are the main characteristics of your data?

Your first name


Is your data linear, quadratic, or all over the place? How do your distributions look
like? Is your
Your emaildata made out of numbers or strings? Is the data labeled?
address
What kind of a problem are you solving?

Are you trying to predict: which cat will push Get


most10-day
jars ofFree
the Algo
table,Trading
is that Course
a dog or
a cat, or of which dog breeds are a group of dogs made up?

All of these questions have different approaches and solutions. Thus we will
explore later in the article the three main problem classifications:

regression
classification
clustering

How do your models perform when compared against each other?

You will see that scikit-learn comes equipped with functions that allow us to
inspect each model on several characteristics and compare it to the other ones.

Take note that scikit-learn has created a good algorithm cheat-sheet that aids you
in your model selection and I’d advise having it near you at those troubling times.

Sklearn preprocessing – Prepare the data for


analysis
When you think of data you probably have in mind a ginormous excel spreadsheet
full of rows and columns with numbers in them. Well, the case is that data can
come in a plethora of formats like images, videos and audio.

https://algotrading101.com/learn/sklearn-guide/ 4/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

The main job of data preprocessing is to turn this data into a readable format for
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
our algorithm.
email A machine
course. 1 short can’t
email a day just
for 10 “listen in” to an audiotape to learn voice
days.
recognition, rather it needs it to be converted numbers.
"I have not read it, but I'm sure it is great." - My girlfriend

TheYour
main building
first name blocks of our dataset are called features which can be
categorical or numerical. Simply put, categorical data is used to group data with
similar
Your characteristics
email address while numerical data provides information with numbers.

As the features come from two different categories, they need to be treated
Get
(preprocessed) in different ways. The best way to10-day
learn isFree Algocoding
to start Tradingalong
Course
with me.

Sklearn feature encoding


Feature encoding is a method where we transform categorical variables into
continuous ones. The most popular ways of doing so are known as One Hot
Encoding and Label encoding.

For example, a person can have features such as [“male”, “female], [“from US”,
“from UK”], [“uses Binance”, “uses Coinbase”]. These features can be encoded as
numbers e.g. [“male”, “from US”, “uses Coinbase”] would be [0, 0, 1].

This can be done by using the scikit-learn OrdinalEncoder () function as follows:

Python

1 pip install scikit-learn


2 from sklearn import preprocessing
3
4 X = [['male', 'from US', 'uses Coinbase'], ['female', 'from UK', 'uses
5 Binance']]
6 encode = preprocessing.OrdinalEncoder()
7 encode.fit(X)
8
9 encode.transform([['male', 'from UK', 'uses Coinbase']])
10
Output: array([[1., 0., 1.]])

https://algotrading101.com/learn/sklearn-guide/ 5/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

As you can see, it transformed the features into integers. But they are not
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
continuous
email course.and can’t
1 short beaused
email with
day for scikit-learn estimators. In order to fix this, a
10 days.
popular and most used method is one hot encoding.
"I have not read it, but I'm sure it is great." - My girlfriend

OneYour
hot first
encoding,
name also known as dummy encoding, can be obtained through the
scikit-learn OneHotEncoder () function. It works by transforming each category
with N possible
Your values into N binary features where one category is represented as
email address
1 and the rest as 0.

The following example will hopefully make itGet 10-day Free Algo Trading Course
clear:

Python

1 one_hot = preprocessing.OneHotEncoder()
2 one_hot.fit(X)
3
4 one_hot.transform([['male', 'from UK', 'uses Coinbase'],
5 ['female', 'from US', 'uses Binance']]).toarray()
6
7 Output: array([[0., 1., 1., 0., 0., 1.],
8 [1., 0., 0., 1., 1., 0.]])

To see what your encoded features are exactly you can always use the
.categories_ attribute as shown below:

Python

1 one_hot.categories_
2
3 Output: [array(['female', 'male'], dtype=object),
4 array(['from UK', 'from US'], dtype=object),
5 array(['uses Binance', 'uses Coinbase'], dtype=object)]

Sklearn data scaling


Feature scaling is a preprocessing method used to normalize data as it helps by
improving some machine learning models. The two most common scaling
techniques are known as standardization and normalization.

https://algotrading101.com/learn/sklearn-guide/ 6/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Standardization makes the values of each feature in the data have zero-mean and
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
unit variance.
email This email
course. 1 short method
a dayisfor
commonly
10 days. used with algorithms such as SVMs and
Logistic regression.
"I have not read it, but I'm sure it is great." - My girlfriend

Standardization
Your first nameis done by subtracting the mean from each feature and dividing it
by the standard deviation. It’s some basic statistics and math, but don’t worry if
youYour
don’t getaddress
email it. There are many tutorials that cover it.

In scikit-learn we use the StandardScaler() function to standardize the data. Let us


Get
create a random NumPy array and standardize 10-day
the Free
data by AlgoitTrading
giving Course
a zero mean
and unit variance.

Python

1 import numpy as np
2
3 scaler = preprocessing.StandardScaler()
4 X = np.random.rand(3,4)
5 X

Python

1 X_scaled = scaler.fit_transform(X)
2 X_scaled

Python

1 print(f'The scaled mean is: {X_scaled.mean(axis=0)}\nThe scaled variance


is: {X_scaled.std(axis=0)}')

Wait for a second! Didn’t you say that all mean values need to be 0?

Well, in practice these values are so close to 0 that they can be viewed as zero.
Moreover, due to limitations with numerical representations the scaler can only
get the mean really close to a zero.

https://algotrading101.com/learn/sklearn-guide/ 7/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Let’s move onto the next scaling method called normalization. Normalization is a
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
term
emailwith many
course. definitions
1 short email a daythat
for 10change
days. from one field to another and we are going
to define it as follows:
"I have not read it, but I'm sure it is great." - My girlfriend

Normalization is a scaling technique in which values are shifted and rescaled so


Your first name
that they end up being between 0 and 1. It is also known as Min-Max scaling. In
scikit-learn
Your emailitaddress
can be applied with the Normalizer() function.

Python

1 norm = preprocessing.Normalizer() Get 10-day Free Algo Trading Course


2
3 X_norm = norm.transform(X)
4 X_norm

So, which one is better? Well, it depends on your data and the problem you’re
trying to solve. Standardization is often good when the data is depicting a Normal
distribution and vice versa. If in doubt, try both and see which one improves the
model.

Sklearn missing values


In scikit-learn we can use the .impute class to fill in the missing values. The most
used functions would be the SimpleImputer() , KNNImputer() and
IterativeImputer() .

When you encounter a real-life dataset it will 100% have missing values in it that
can be there for various reasons ranging from rage quits to bugs and mistakes.

There are several ways to treat them. One way is to delete the whole row
(candidate) from the dataset but it can be costly for small to average datasets as
you can delete plenty of data.

Some better ways would be to change the missing values with the mean or median
of the dataset. You could also try, if possible, to categorize your subject into their
subcategory and take the mean/median of it as the new value.

Let’s use the SimpleImputer() to replace the missing value with the mean:
https://algotrading101.com/learn/sklearn-guide/ 8/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 from sklearn.impute import SimpleImputer
email
2
course. 1 short email a day for 10 days.
3 imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
"I have not read it, but I'm sure it is great." - My girlfriend
4 imputer.fit_transform([[10,np.nan],[2,4],[10,9]])

Your first name

The strategy hyperparameter can be changed to median, most_frequent, and


Your email address
constant. But Igor, can we impute missing strings? Yes, you can!

Python
Get 10-day Free Algo Trading Course
1 import pandas as pd
2
3 df = pd.DataFrame([['i', 'g'],
4 ['o', 'r'],
5 ['i', np.nan],
6 [np.nan, 'r']], dtype='category')
7
8 imputer = SimpleImputer(strategy='most_frequent')
9 imputer.fit_transform(df)

If you want to keep track of the missing values and the positions they were in, you
can use the MissingIndicator() function:

Python

1 from sklearn.impute import MissingIndicator


2
3 # Image the 3's were imputed by the SimpleImputer()
4 Y = np.array([[3,1],
5 [5,3],
6 [9,4],
7 [3,7]])
8
9 missing = MissingIndicator(missing_values=3)
10 missing.fit_transform(Y)

The IterateImputer() is fancy, as it basically goes across the features and uses the
missing feature as the label and other features as the inputs of a regression model.
Then it predicts the value of the label for the number of iterations we specify.

https://algotrading101.com/learn/sklearn-guide/ 9/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

If you’re not sure how regression algorithms work, don’t worry as we will soon go
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
over them.
email As
course. the IterativeImputer()
1 short email a day for 10 days.is an experimental feature we will need to
enable it before use:
"I have not read it, but I'm sure it is great." - My girlfriend

Python
Your first name
1 from sklearn.experimental import enable_iterative_imputer
2 from sklearn.impute import IterativeImputer
3Your email address
4 imputer = IterativeImputer(max_iter=15, random_state=42)
5 imputer.fit_transform(([1,5],[4,6],[2, np.nan], [np.nan, 8]))

Get 10-day Free Algo Trading Course


Sklearn train test split
In Sklearn the data can be split into test and training groups by using the
train_test_split() function which is a part of the model_selection class.

But why do we need to split the data into two groups? Well, the training data is the
data on which we fit our model and it learns on it. In order to evaluate how the
model performs on unseen data, we use test data.

An important thing, in most cases, is to allocate more data to the training set.
When speaking of the ratio of this allocation there aren’t any hard rules. It all
depends on the size of your dataset.

The most used allocation ratio is 80% for training and 20% for testing. Have in
mind that most people use the training/development set split but name the dev set
as the test set. This is more of a conceptual mistake.

Now let us create a random dataset and split it into training and testing sets:

https://algotrading101.com/learn/sklearn-guide/ 10/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 from sklearn.datasets import make_blobs
email
2
course. 1 short email a day for 10 days.
from sklearn.model_selection import train_test_split
3
"I have not read it, but I'm sure it is great." - My girlfriend
4 # Create a random dataset
5 X, y = make_blobs(n_samples=1500)
6Your first name
7 # Split the data
8 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Your
9 email address
10 print(f'X training set {X_train.shape}\nX testing set {X_test.shape}\ny
training set {y_train.shape}\ny testing set {y_test.shape}')

Get 10-day Free Algo Trading Course

If your dataset is big enough you’ll often be fine with using this way to split the
data. But some datasets come with a severe imbalance in them.

For example, if you’re building a model to detect outliers that default their credit
cards you will most often have a very small percentage of them in your data.

This means that the train_test_split() function will most likely allocate too little
of the outliers to your training set and the ML algorithm won’t learn to detect
them efficiently. Let’s simulate a dataset like that:

Python

1 from sklearn.datasets import make_classification


2 from collections import Counter
3
4 # Create an imablanced dataset
5 X, y = make_classification(n_samples=1000, weights=[0.95], flip_y=0,
6 random_state=42)
7 print(f'Number of y before splitting is {Counter(y)}')
8
9 # Split the data the usual way
10 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
11 random_state=42)
print(f'Number of y in the training set after splitting is
{Counter(y_train)}')
print(f'Number of y in the testing set after splitting is
{Counter(y_test)}')

https://algotrading101.com/learn/sklearn-guide/ 11/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

As you can see, the training set has 43 examples of y while the testing set has only
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
7!email
In order to1combat
course. this,
short email wefor
a day can
10 split
days. the data into training and testing by
stratification which is done according to y.
"I have not read it, but I'm sure it is great." - My girlfriend

This means
Your that y examples will be adequately stratified in both training and
first name
testing sets (20% of y goes to the test set). In scikit-learn this is done by adding
theYour
stratify argument as shown below:
email address

Python

1 # Split the data by stratification Get 10-day Free Algo Trading Course
2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
3 stratify=y, random_state=42)
4 print(f'Number of y in the training set after splitting is
{Counter(y_train)}')
print(f'Number of y in the testing set after splitting is
{Counter(y_test)}')

For a more in-depth guide and understanding of the train test split and cross-
validation, please visit the following article that is found on our blog:
https://algotrading101.com/learn/train-test-split/

For more information about scikit-learn preprocessing functions go here.

Sklearn Regression – Predict the future


The regression method is used for prediction and forecasting and in Sklearn it can
be accessed by the linear_model() class.

In regression tasks, we want to predict the outcome y given X. For example,


imagine that we want to predict the price of a house (y) given features (X) like its
age and number of rooms. The most simple regression model is linear regression.

https://algotrading101.com/learn/sklearn-guide/ 12/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Sklearn Linear
Join over 10,000 Regression
future traders and get our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
Sklearn Linear Regression model can be used by accessing the LinearRegression()
"I have not read it, but I'm sure it is great." - My girlfriend
function. The linear regression model assumes that the dependent variable (y) is a
linear combination of the parameters (Xi).
Your first name

Allow me to illustrate how linear regression works. Imagine that you were tasked
Your email address
to fit a red line so it resembles the trend of the data while minimizing the distance
between each point as shown below:
Get 10-day Free Algo Trading Course
By eye-balling it should look something like this:

Let’s import the sklearn boston house-price dataset and so we can predict the
median house value (MEDV) by the house’s age (AGE) and the number of rooms
(RM).

Have in mind that this is known as a multiple linear regression as we are using two
features.

Python

1 from sklearn import linear_model, datasets


2 from sklearn.model_selection import train_test_split
3 import matplotlib.pyplot as plt
4 %matplotlib inline
5 import pandas as pd
6
7 # Load the Boston dataset
8 boston = datasets.load_boston()
9 df = pd.DataFrame(boston.data, columns=boston.feature_names)
10 # Add the target variable (label)
11 df['MEDV'] = boston.target
12 df.head()

Now we will set our features (X) and the label (y). Notice how we use the numpy
np.c_ function that concatenates the data for us.

https://algotrading101.com/learn/sklearn-guide/ 13/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 # Set the features and label
email
2
course. 1 short email a day for 10 days.
X = pd.DataFrame(np.c_[df['LSTAT'], df['RM']], columns = ['LSTAT','RM'])
3 y = df['MEDV']
"I have not read it, but I'm sure it is great." - My girlfriend

Your first name


Now we will split the data into training and test sets which we learned earlier how
to do:
Your email address

Python

1 # Set the features and label


Get 10-day Free Algo Trading Course
2 X = pd.DataFrame(np.c_[df['AGE'], df['RM']], columns = ['AGE','RM'])
3 y = df['MEDV']
4
5 # Split the data
6 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
7 random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Let’s plot each of our features and see how they look. Try to imagine where the
regression line would go.

Python

1 # Plot the features


2 plt.scatter(X['RM'], y)

Python

1 plt.scatter(X['AGE'], y)

You can already see that the data is a bit messy. The RM feature appears more
linear and is prone to higher correlation with the label while the age feature shows
the opposite. We also have outliers.

For this article, we won’t bother to clean up the data as we’re not interested to
create a perfect model.

https://algotrading101.com/learn/sklearn-guide/ 14/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

The next thing that we want to do is to fit our model and evaluate some of its core
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
metrics:
email course. 1 short email a day for 10 days.

"I have not read it, but I'm sure it is great." - My girlfriend
Python

1 regressor = linear_model.LinearRegression()
Your
2 first name
model = regressor.fit(X_train, y_train)
3
4 print('Coefficient of determination:', model.score(X, y))
Your email address
5 print('Intercept:', model.intercept_)
6 print('slope:', model.coef_)

Get 10-day Free Algo Trading Course

1 Coefficient of determination: 0.529269171356878


2 Intercept: -28.203538066489102
3 slope: [-0.06640957 8.7957305 ]

The coefficient of determination (R2) tells how much of the variance, in our case
the variance of the median house income, our model explains. As we see it explains
53% of the variance which is okay.

For the brevity of the article, we won’t go into math now but feel free to look up the
in-depth explanation behind the formula. And you don’t need to know it in order
to use the regression, not saying that you shouldn’t.

The .intercept_ shows the bias b0, while the .coef_ is an array that contains our
b1 and b2. In our case, the intercept is –28.20 and it represents the value of the
predicted response when X1 = X2 = 0.

When we look at the slope, we can see that the increase in X1 (AGE) by 1 lowers the
median house price by 0.06 while the increase in X2 (RM) results in the rise of the
dependent variable by 8.79.

Let’s see how good your regression line predictions were:

https://algotrading101.com/learn/sklearn-guide/ 15/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 # Age regression line
email
2
course. 1 short email a day for 10 days.
plt.plot(X['AGE'], y, 'o')
3 model.coef_[0], model.intercept_ = np.polyfit(X['AGE'], y, 1)
"I have not read it, but I'm sure it is great." - My girlfriend
4 plt.plot(X['AGE'], model.coef_[0]*X['AGE']+model.intercept_, color='red')

Your first name

Python

1Your# email
Room address
number regression line
2 plt.plot(X['RM'], y, 'o')
3 model.coef_[0], model.intercept_ = np.polyfit(X['RM'], y, 1)
4 plt.plot(X['RM'], model.coef_[0]*X['RM']+model.intercept_, color='red')
Get 10-day Free Algo Trading Course

Now, let us predict some data and use a sklearn metric that will tell us how the
model is performing:

Python

1 y_test_predict = regressor.predict(X_test)
2 print('predicted response:', y_test_predict, sep='\n')

Python

1 from sklearn.metrics import mean_squared_error


2
3 rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
4 print(rmse)

1 6.315423538049165

Root Mean Square Error (RMSE) is the standard deviation of


the residuals (prediction errors). Residuals are a measure of how far from the
regression line data points are. It tells us how concentrated the data is around the
regression line.

In our case, the RMSE is high for our liking. I’ll task you to try out other features
(LSTAT and RM) and lower the RMSE. What happens when you use those two or
more? Which features make the most sense to use?
https://algotrading101.com/learn/sklearn-guide/ 16/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Feel free to play around and check the Full code section to see some guidelines.
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
Other Sklearn
email course. regression
1 short email a day for 10 days. models

"I have not read it, but I'm sure it is great." - My girlfriend
There are various regression models that may be more useful and fit the data
better than the simple linear regression, and those are the Lasso, Elastic-Net,
Your first name
Ridge, Polynomial, and Bayesian regression.
For more information about them go here.
Your email address

Sklearn Classification
Get 10-day Free Algo Trading Course
Classification problem in ML involves teaching a machine how to group data
together to match the specified criteria. The most popular models in Sklearn come
from the tree() class.

Every day you perform classification. For example, when you go to a grocery store
you can easily group different foods by their food group (fruit, meat, grain, etc.).

When it comes to more complex decisions in the fields of medicine, trading, and
politics, we’d like some good ML algorithms to aid our decision-making process.

Sklearn Decision Tree Classifier


In Sklearn, the Decision Tree classifier can be accessed by using the
DecisionTreeClassifier() function which is a part of the tree() class.

The main goal of a Decision Tree algorithm is to predict the value of the target
variable (label) by learning simple decision rules deduced from the data features.
For example, look at my simple decision tree below:

Here are some main characteristics of a Decision Tree Classifier:

It is made out of Nodes and Branches


Branches connect Nodes
The top Node is called the Root Node (“Go outside”)

https://algotrading101.com/learn/sklearn-guide/ 17/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Node from which new nodes arise is called a Parent Node (i.e. “Is it raining?”
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
Node)
email course. 1 short email a day for 10 days.
A node without a Child Node is called a Leaf Node (i.e. “Classic programmer”
"I have not read it, but I'm sure it is great." - My girlfriend
Node)

Your first name


The good thing about a Decision Tree Classifier is that it is easy to visualize and
interpret. It also
Your email requires little to no data preparation. The bad thing about it is that
address
minor changes in the data can change it considerably.
For a more in-depth understanding of its pros and cons go here.
Get 10-day Free Algo Trading Course
Now, let’s create a decision tree on the popular iris dataset. The dataset is made
out of 3 plant species and we’ll want our tree to aid us in deciding to what
specimen our plant belongs to according to its petal/sepal width and length.

Python

1 from sklearn import tree


2 from sklearn.tree import DecisionTreeClassifier
3 from sklearn.datasets import load_iris
4 import graphviz
5
6 # Obtain the data and fit the model
7 X, y = load_iris(return_X_y=True)
8 dtc = DecisionTreeClassifier()
9 dtc = dtc.fit(X, y)
10
11 # Graph the Tree
12 iris = load_iris()
13 dot_data = tree.export_graphviz(dtc, out_file=None,
14 feature_names=iris.feature_names,
15 class_names=iris.target_names,
16 filled=True, rounded=True,
17 special_characters=True)
18 graph = graphviz.Source(dot_data)
19 graph

Take note that “Gini” measures impurity. A node is “pure” when it has 0 Gini
which happens when all training instances it applies to belong to the same class.

https://algotrading101.com/learn/sklearn-guide/ 18/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Have in mind that all algorithms have their hyperparameters which can be tuned
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
toemail
result in a1better
course. model.
short email For
a day forexample
10 days. you can set the Decision Tree to only go to
a certain depth, to have a certain allowed number of leaves and etc.
"I have not read it, but I'm sure it is great." - My girlfriend

To see
Yourwhat are the standard hyperparameter that your untouched Decision Tree
first name
Classifier has and what each of them does please visit the scikit-learn
documentation.
Your email address

Other Sklearn classification models


Get 10-day Free Algo Trading Course
Depending on the problem and your data, you might want to try out other
classification algorithms that Sklearn has to offer. For example, SVC, Random
Forest, AdaBoost, GaussianNB, or KNeighbors Classifier.

If you want to see how they compare to each other go here.

Sklearn Clustering – Create groups of similar


data
Clustering is an unsupervised machine learning problem where the algorithm
needs to find relevant patterns on unlabeled data. In Sklearn these methods can be
accessed via the sklearn.cluster module.

Below you can see an example of the clustering method:

Sklearn DBSCAN
In Sklearn, the DBSCAN clustering model can be utilized by using the the DBSCAN()
cluster which is a part of the cluster() class.

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As


the model isn’t deterministic (i.e. clusters must be convex), it is mostly used when
the clusters can be in any shape or size.

https://algotrading101.com/learn/sklearn-guide/ 19/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

The DBSCAN algorithm finds clusters by looking for areas with high density that
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
are separated
email course. 1 by areas
short emailofa low density.
day for 10 days.The algorithm has two main parameters
being min_samples and eps .
"I have not read it, but I'm sure it is great." - My girlfriend

High min_samples
Your first name and low eps indicate a higher density needed in order to create
a cluster. The min_samples parameter controls how sensitive the algorithm is
towards noiseaddress
Your email (higher values mean that it is less sensitive).

On the other hand, the eps parameter controls the local neighborhood of the
Getcluster,
points. If it is too high all data will be in one big 10-dayifFree
it isAlgo Trading
too low eachCourse
data
point will be its own cluster.

Enough theorizing, let’s jump to the coding part! We will generate some data and
fit the DBSCAN clustering algorithm on it. We will also play a bit with its
parameters.

Let’s import the libraries we need, create the data, scale it and fit the model:

Python

1 from sklearn.datasets import make_circles


2 from sklearn.cluster import DBSCAN
3 from sklearn import metrics
4 from sklearn.preprocessing import StandardScaler
5 import numpy as np
6 import matplotlib.pyplot as plt
7 %matplotlib inline
8
9 # Make the data and scale it
10 X, y = make_circles(n_samples=800, factor=0.3, noise=0.1, random_state=42)
11 X = StandardScaler().fit_transform(X)
12
13 # Fit the algorithm
14 y_predicted = DBSCAN(eps=0.35, min_samples=10).fit_predict(X)

Now, let’s see how our model performed:

https://algotrading101.com/learn/sklearn-guide/ 20/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 # Visualize the data
email
2
course. 1 short email a day for 10 days.
plt.scatter(X[:,0], X[:,1], c=y_predicted)

"I have not read it, but I'm sure it is great." - My girlfriend

Your first name

Here we can easily spot two clusters, they even resemble an eye (I’m tempted to
Your the
change email address
colors to make it look like the eye of Sauron). All models have their
performance metrics and let’s check out the main ones.

Get 10-day Free Algo Trading Python


Course
1 # Evaluation Metrics
2 print('Number of clusters:
3 {}'.format(len(set(y_predicted[np.where(y_predicted != -1)]))))
4 print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_predicted)))
print('Completeness: {}'.format(metrics.completeness_score(y,
y_predicted)))

1 Number of clusters: 2
2 Homogeneity: 1.0000000000000007
3 Completeness: 0.9691231370370732

What would happen if we changed the eps value to 0.4?

Python

1 y_predicted = DBSCAN(eps=0.4, min_samples=10).fit_predict(X)


2 plt.scatter(X[:,0], X[:,1], c='orangered')
3 plt.title('I see you!')

Couldn’t
resist

For a more hands-on experience in solving problems with clustering, check out
our article on finding trading pairs for the pairs trading strategy with machine
learning.

https://algotrading101.com/learn/sklearn-guide/ 21/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Other Sklearn
Join over 10,000 clustering
future traders models
and get our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
Depending on the clustering problem, you might want to use other clustering
"I have not read it, but I'm sure it is great." - My girlfriend
algorithms and the most popular ones are K-Means, Hierarchical, Affinity
Propagation, and Gaussian mixtures clustering.
Your first name

If you want
Your toaddress
email learn the in-depth theory behind clustering and get introduced to
various models and the math behind them, go here.

Sklearn Dimensionality Reduction – Reducing


Get 10-day Free Algo Trading Course

random variables
Dimensionality reduction is a method where we want to shrink the size of data
while preserving the most important information in it. In Sklearn these methods
can be accessed from the decomposition() class.

As humans, we usually think in 4 dimensions (if you count time as one) up to a


maximum of 6-7 if you are a quantum physicist. Data can easily go beyond that
and we need to reduce it to lower dimensions so it can be observed.

Sklearn PCA
PCA (Principal Component Analysis) is a linear technique for dimensionality
reduction. It basically does linear mapping of the data to a lower dimension while
maximizing the preserved variance of data.

PCA can be used for an easier visualization of data and as a preprocessing step to
speed up the performance of other machine learning algorithms. Let’s go back to
our iris dataset and make a 2d visualization from its 4d structure.

Firstly, we will load the required libraries, obtain the dataset, scale the data and
check how many dimensions we have:

https://algotrading101.com/learn/sklearn-guide/ 22/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 from sklearn.datasets import load_iris
email
2 course. 1 short email a day for 10import
from sklearn.decomposition days. PCA
3 from sklearn.preprocessing import StandardScaler
"I have not read it, but I'm sure it is great." - My girlfriend
4 import matplotlib.pyplot as plt
5 import pandas as pd
6Your%matplotlib
first name inline
7
8 # Load the data and scale it
Your
9 X,email
y = address
load_iris(return_X_y=True)
10 X = StandardScaler().fit_transform(X)
11 print(f'The number of dimensions in X is {X.shape[1]}')

Get 10-day Free Algo Trading Course

1 The number of dimensions in X is 4

Now we will set our PCA and fit it to the data:

Python

1 # Load PCA and specify the number of dimensions aka components


2 pca = PCA(n_components=2)
3 pc = pca.fit_transform(X)
4 print(f'The number of reduced dimensions is {pc.shape[1]}')

1 The number of reduced dimensions is 2

Let’s store the data into a pandas data frame and recode the numerical target
features to categorical:

Python

1 # Put the data into a pandas data frame


2 df = pd.DataFrame(data = pc, columns = ['pc_1', 'pc_2'])
3 df['target'] = y
4 df.head()

https://algotrading101.com/learn/sklearn-guide/ 23/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Python
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
1 # Recode the numerical data to categorical
email
2 course. 1 short email a day for 10 days.
def recoding(data):
3 if data == 0:
"I have not read it, but I'm sure it is great." - My girlfriend
4 return 'iris-setosa'
5 elif data == 1:
6Your first name
return 'iris-versicolor'
7 else:
8 return 'iris-virginica'
Your
9 email address
10 df['target'] = df['target'].apply(recoding)
11 df.head()

Get 10-day Free Algo Trading Course

And now for the finale with plot the data:

Python

1 # Plot the data


2 fig = plt.figure(figsize = (12,10))
3 ax = fig.add_subplot(1,1,1)
4 ax.set_xlabel('Principal Component 1', fontsize = 17)
5 ax.set_ylabel('Principal Component 2', fontsize = 17)
6 ax.set_title('2 component PCA', fontsize = 20)
7 targets = ['iris-setosa', 'iris-versicolor', 'iris-virginica']
8 colors = ['r', 'g', 'b']
9 for target, color in zip(targets,colors):
10 indicesToKeep = df['target'] == target
11 ax.scatter(df.loc[indicesToKeep, 'pc_1'],
12 df.loc[indicesToKeep, 'pc_2'],
13 c = color,
14 s = 50)
15 ax.legend(targets)
16 ax.grid()

As you can see, we basically compressed the 4d data into a 2d observable one. In
this case, we can say that the algorithm discovered the petals and sepals because
we had the width and length of both.

https://algotrading101.com/learn/sklearn-guide/ 24/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Other Sklearn
Join over 10,000 Dimensionality
future traders Reduction models
and get our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
There are other Dimensionality Reduction models in Sklearn that you would prefer
"I have not read it, but I'm sure it is great." - My girlfriend
more for certain problems and those are the ICA, IPCA, NMF, LDA, Factor Analysis,
and more.
Your first name
For a more in-depth look go here.

Your email address


What are the 3 Common Machine Learning
Analysis/Testing Mistakes?
Get 10-day Free Algo Trading Course
When you run your analysis, there are 3 common mistakes to take note:

Overfitting
Look-ahead Bias
P-hacking

Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1)
Overfitting 2) Look-Ahead Bias 3) P-Hacking

Full Code
GitHub Link

https://algotrading101.com/learn/sklearn-guide/ 25/26
11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
Get our 10-day "Know-What-To-
email course. 1 short email a day for 10 days.

Google" Algo Trading email course.


"I have not read it, but I'm sure it is great." - My girlfriend

Your first name


1 short email a day for 10 days.

Your email address


Over 10,000 future traders have taken this email course.

Getgirlfriend
"I have not read it, but I'm sure it is great." - My 10-day Free Algo Trading Course

First Name

Email address

Give me the 10-day crash course!

Unsubscribe at any time.

Igor Radovanovic

Programming

https://algotrading101.com/learn/sklearn-guide/ 26/26

You might also like