Sklearn - An Introduction Guide To Machine Learning - AlgoTrading101 Blog

11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog
Sklearn – An Introduction Guide to Machine

Learning
· 17 min read
11669 total views
Last Updated on April 3, 2023

Get 10-day Free Algo Trading Course
Table of Contents
1. What is Sklearn?
2. What is Sklearn used for?
3. How to download Sklearn for Python?
4. How to pick the best scikit-learn model?
5. Sklearn preprocessing – Prepare the data for analysis
1. Sklearn feature encoding
2. Sklearn data scaling
3. Sklearn missing values
4. Sklearn train test split
6. Sklearn Regression – Predict the future
1. Sklearn Linear Regression
2. Other Sklearn regression models
7. Sklearn Classification – Did I just see a cat?
1. Sklearn Decision Tree Classifier
2. Other Sklearn classification models
8. Sklearn Clustering – Create groups of similar data
1. Sklearn DBSCAN
2. Other Sklearn clustering models
9. Sklearn Dimensionality Reduction – Reducing random variables
1. Sklearn PCA
2. Other Sklearn Dimensionality Reduction models
10. What are the 3 Common Machine Learning Analysis/Testing Mistakes?
11. Full code
https://algotrading101.com/learn/sklearn-guide/ 1/26
What is Sklearn?
Join over 10,000 future traders and get our 10-day "Know-What-To-Google" Algo Trading
email course. 1 short email a day for 10 days.
Sklearn (scikit-learn) is a Python library that provides a wide range of
"I have not read it, but I'm sure it is great." - My girlfriend
unsupervised and supervised machine learning algorithms.
Your first name

It is also one of the most used machine learning libraries and is built on top of
SciPy.
Your email address
Link: https://scikit-learn.org/stable/
Give me the 10-day crash course!
What is Sklearn used for?

The Sklearn Library is mainly used for modeling data and it provides efficient tools
that are easy to use for any kind of predictive data analysis.
The main use cases of this library can be categorized into 6 categories which are
the following:
Preprocessing
Regression
Classification
Clustering
Model Selection
Dimensionality Reduction
As this article is mainly aimed at beginners, we will stick to the core concepts of
each category and explore some of its most popular features and algorithms.
Advanced readers can use this article as a recollection of some of the main use
cases and intuitions behind popular sklearn features that most ML practitioners
couldn’t live without.
Each category will be explained in a beginner-friendly and illustrative way

followed by the most used models, the intuition behind them, and hands-on
experience. But first, we need to set up our sklearn library.
How
Join overto download
10,000 Sklearn
future traders and get for Python?
our 10-day "Know-What-To-Google" Algo Trading
Sklearn can be obtained in Python by using the pip install function as shown
below:
Your first name

Python
1 $ pip install -U scikit-learn

Your email address
Sklearn developers strongly advise using a virtual environment (venv) or a conda

environment when working with the library as it helps to avoid potential conflicts
with other packages.
How to pick the best Sklearn model?

When it comes to picking the best Sklearn model, there are many factors that come
into play that range from experience and data to the problem scope and math
behind each algorithm.
Sometimes all chosen algorithms can have similar results and, depending on the
problem setting, you will need to pick the one that is the fastest or the one that
generalizes the best on big data.
It may happen that all of your promised models won’t perform well enough and
that you will simply need to combine multiple models (e.g. ensemble), make your
own custom-made model, or go for a deep learning approach.
As picking the right model is one of the foundations of your problem solving, it is
wise to read-up on as many models and their uses as you can.
As model selection would be an article, or even a book, for itself, I’ll only provide
some rough guidelines in the form of questions that you’ll need to ask yourself
when deciding which model to deploy.
How much data do you have?
Some models are better on smaller datasets while others require more data and
tend tocourse.
email generalize
1 short better
email a on
daylarger datasets (e.g. SGD Regressor vs Lasso
for 10 days.
Regression).
What are the main characteristics of your data?
Your first name

Is your data linear, quadratic, or all over the place? How do your distributions look
like? Is your
Your emaildata made out of numbers or strings? Is the data labeled?
address
What kind of a problem are you solving?
Are you trying to predict: which cat will push Get

most10-day
jars ofFree
the Algo
table,Trading
is that Course
a dog or
a cat, or of which dog breeds are a group of dogs made up?
All of these questions have different approaches and solutions. Thus we will
explore later in the article the three main problem classifications:
regression
classification
clustering
How do your models perform when compared against each other?
You will see that scikit-learn comes equipped with functions that allow us to
inspect each model on several characteristics and compare it to the other ones.
Take note that scikit-learn has created a good algorithm cheat-sheet that aids you
in your model selection and I’d advise having it near you at those troubling times.
Sklearn preprocessing – Prepare the data for

analysis
When you think of data you probably have in mind a ginormous excel spreadsheet
full of rows and columns with numbers in them. Well, the case is that data can
come in a plethora of formats like images, videos and audio.
The main job of data preprocessing is to turn this data into a readable format for
our algorithm.
email A machine
course. 1 short can’t
email a day just
for 10 “listen in” to an audiotape to learn voice
days.
recognition, rather it needs it to be converted numbers.
TheYour
main building
first name blocks of our dataset are called features which can be
categorical or numerical. Simply put, categorical data is used to group data with
similar
Your characteristics
email address while numerical data provides information with numbers.
As the features come from two different categories, they need to be treated
Get
(preprocessed) in different ways. The best way to10-day
learn isFree Algocoding
to start Tradingalong
Course
with me.
Sklearn feature encoding

Feature encoding is a method where we transform categorical variables into
continuous ones. The most popular ways of doing so are known as One Hot
Encoding and Label encoding.
For example, a person can have features such as [“male”, “female], [“from US”,
“from UK”], [“uses Binance”, “uses Coinbase”]. These features can be encoded as
numbers e.g. [“male”, “from US”, “uses Coinbase”] would be [0, 0, 1].
This can be done by using the scikit-learn OrdinalEncoder () function as follows:
Python
1 pip install scikit-learn

2 from sklearn import preprocessing
3
4 X = [['male', 'from US', 'uses Coinbase'], ['female', 'from UK', 'uses
5 Binance']]
6 encode = preprocessing.OrdinalEncoder()
7 encode.fit(X)
8
9 encode.transform([['male', 'from UK', 'uses Coinbase']])
10
Output: array([[1., 0., 1.]])
As you can see, it transformed the features into integers. But they are not
continuous
email course.and can’t
1 short beaused
email with
day for scikit-learn estimators. In order to fix this, a
10 days.
popular and most used method is one hot encoding.
OneYour
hot first
encoding,
name also known as dummy encoding, can be obtained through the
scikit-learn OneHotEncoder () function. It works by transforming each category
with N possible
Your values into N binary features where one category is represented as
email address
1 and the rest as 0.
The following example will hopefully make itGet 10-day Free Algo Trading Course
clear:
Python
1 one_hot = preprocessing.OneHotEncoder()
2 one_hot.fit(X)
3
4 one_hot.transform([['male', 'from UK', 'uses Coinbase'],
5 ['female', 'from US', 'uses Binance']]).toarray()
6
7 Output: array([[0., 1., 1., 0., 0., 1.],
8 [1., 0., 0., 1., 1., 0.]])
To see what your encoded features are exactly you can always use the
.categories_ attribute as shown below:
Python
1 one_hot.categories_
2
3 Output: [array(['female', 'male'], dtype=object),
4 array(['from UK', 'from US'], dtype=object),
5 array(['uses Binance', 'uses Coinbase'], dtype=object)]
Sklearn data scaling

Feature scaling is a preprocessing method used to normalize data as it helps by
improving some machine learning models. The two most common scaling
techniques are known as standardization and normalization.
Standardization makes the values of each feature in the data have zero-mean and
unit variance.
email This email
course. 1 short method
a dayisfor
commonly
10 days. used with algorithms such as SVMs and
Logistic regression.
Standardization
Your first nameis done by subtracting the mean from each feature and dividing it
by the standard deviation. It’s some basic statistics and math, but don’t worry if
youYour
don’t getaddress
email it. There are many tutorials that cover it.
In scikit-learn we use the StandardScaler() function to standardize the data. Let us

Get
create a random NumPy array and standardize 10-day
the Free
data by AlgoitTrading
giving Course
a zero mean
and unit variance.
Python
1 import numpy as np
2
3 scaler = preprocessing.StandardScaler()
4 X = np.random.rand(3,4)
5 X
Python
1 X_scaled = scaler.fit_transform(X)
2 X_scaled
Python
1 print(f'The scaled mean is: {X_scaled.mean(axis=0)}\nThe scaled variance

is: {X_scaled.std(axis=0)}')
Wait for a second! Didn’t you say that all mean values need to be 0?
Well, in practice these values are so close to 0 that they can be viewed as zero.
Moreover, due to limitations with numerical representations the scaler can only
get the mean really close to a zero.
Let’s move onto the next scaling method called normalization. Normalization is a
term
emailwith many
course. definitions
1 short email a daythat
for 10change
days. from one field to another and we are going
to define it as follows:
Normalization is a scaling technique in which values are shifted and rescaled so

Your first name
that they end up being between 0 and 1. It is also known as Min-Max scaling. In
scikit-learn
Your emailitaddress
can be applied with the Normalizer() function.
Python
1 norm = preprocessing.Normalizer() Get 10-day Free Algo Trading Course

2
3 X_norm = norm.transform(X)
4 X_norm
So, which one is better? Well, it depends on your data and the problem you’re
trying to solve. Standardization is often good when the data is depicting a Normal
distribution and vice versa. If in doubt, try both and see which one improves the
model.
Sklearn missing values

In scikit-learn we can use the .impute class to fill in the missing values. The most
used functions would be the SimpleImputer() , KNNImputer() and
IterativeImputer() .
When you encounter a real-life dataset it will 100% have missing values in it that
can be there for various reasons ranging from rage quits to bugs and mistakes.
There are several ways to treat them. One way is to delete the whole row
(candidate) from the dataset but it can be costly for small to average datasets as
you can delete plenty of data.
Some better ways would be to change the missing values with the mean or median
of the dataset. You could also try, if possible, to categorize your subject into their
subcategory and take the mean/median of it as the new value.
Let’s use the SimpleImputer() to replace the missing value with the mean:
Python
1 from sklearn.impute import SimpleImputer
email
2
course. 1 short email a day for 10 days.
3 imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
4 imputer.fit_transform([[10,np.nan],[2,4],[10,9]])
Your first name
The strategy hyperparameter can be changed to median, most_frequent, and

Your email address
constant. But Igor, can we impute missing strings? Yes, you can!
Python
1 import pandas as pd
2
3 df = pd.DataFrame([['i', 'g'],
4 ['o', 'r'],
5 ['i', np.nan],
6 [np.nan, 'r']], dtype='category')
7
8 imputer = SimpleImputer(strategy='most_frequent')
9 imputer.fit_transform(df)
If you want to keep track of the missing values and the positions they were in, you
can use the MissingIndicator() function:
Python
1 from sklearn.impute import MissingIndicator

2
3 # Image the 3's were imputed by the SimpleImputer()
4 Y = np.array([[3,1],
5 [5,3],
6 [9,4],
7 [3,7]])
8
9 missing = MissingIndicator(missing_values=3)
10 missing.fit_transform(Y)
The IterateImputer() is fancy, as it basically goes across the features and uses the
missing feature as the label and other features as the inputs of a regression model.
Then it predicts the value of the label for the number of iterations we specify.
If you’re not sure how regression algorithms work, don’t worry as we will soon go
over them.
email As
course. the IterativeImputer()
1 short email a day for 10 days.is an experimental feature we will need to
enable it before use:
Python
Your first name
1 from sklearn.experimental import enable_iterative_imputer
2 from sklearn.impute import IterativeImputer
3Your email address
4 imputer = IterativeImputer(max_iter=15, random_state=42)
5 imputer.fit_transform(([1,5],[4,6],[2, np.nan], [np.nan, 8]))

Sklearn train test split
In Sklearn the data can be split into test and training groups by using the
train_test_split() function which is a part of the model_selection class.
But why do we need to split the data into two groups? Well, the training data is the
data on which we fit our model and it learns on it. In order to evaluate how the
model performs on unseen data, we use test data.
An important thing, in most cases, is to allocate more data to the training set.
When speaking of the ratio of this allocation there aren’t any hard rules. It all
depends on the size of your dataset.
The most used allocation ratio is 80% for training and 20% for testing. Have in
mind that most people use the training/development set split but name the dev set
as the test set. This is more of a conceptual mistake.
Now let us create a random dataset and split it into training and testing sets:
Python
1 from sklearn.datasets import make_blobs
email
2
from sklearn.model_selection import train_test_split
3
4 # Create a random dataset
5 X, y = make_blobs(n_samples=1500)
6Your first name
7 # Split the data
8 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Your
9 email address
10 print(f'X training set {X_train.shape}\nX testing set {X_test.shape}\ny
training set {y_train.shape}\ny testing set {y_test.shape}')
If your dataset is big enough you’ll often be fine with using this way to split the
data. But some datasets come with a severe imbalance in them.
For example, if you’re building a model to detect outliers that default their credit
cards you will most often have a very small percentage of them in your data.
This means that the train_test_split() function will most likely allocate too little
of the outliers to your training set and the ML algorithm won’t learn to detect
them efficiently. Let’s simulate a dataset like that:
Python
1 from sklearn.datasets import make_classification

2 from collections import Counter
3
4 # Create an imablanced dataset
5 X, y = make_classification(n_samples=1000, weights=[0.95], flip_y=0,
6 random_state=42)
7 print(f'Number of y before splitting is {Counter(y)}')
8
9 # Split the data the usual way
10 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
11 random_state=42)
print(f'Number of y in the training set after splitting is
{Counter(y_train)}')
print(f'Number of y in the testing set after splitting is
{Counter(y_test)}')
As you can see, the training set has 43 examples of y while the testing set has only
7!email
In order to1combat
course. this,
short email wefor
a day can
10 split
days. the data into training and testing by
stratification which is done according to y.
This means
Your that y examples will be adequately stratified in both training and
first name
testing sets (20% of y goes to the test set). In scikit-learn this is done by adding
theYour
stratify argument as shown below:
email address
Python
1 # Split the data by stratification Get 10-day Free Algo Trading Course
3 stratify=y, random_state=42)
4 print(f'Number of y in the training set after splitting is
{Counter(y_train)}')
print(f'Number of y in the testing set after splitting is
{Counter(y_test)}')
For a more in-depth guide and understanding of the train test split and cross-
validation, please visit the following article that is found on our blog:
https://algotrading101.com/learn/train-test-split/
For more information about scikit-learn preprocessing functions go here.
Sklearn Regression – Predict the future

The regression method is used for prediction and forecasting and in Sklearn it can
be accessed by the linear_model() class.
In regression tasks, we want to predict the outcome y given X. For example,

imagine that we want to predict the price of a house (y) given features (X) like its
age and number of rooms. The most simple regression model is linear regression.
Sklearn Linear
Join over 10,000 Regression
future traders and get our 10-day "Know-What-To-Google" Algo Trading
Sklearn Linear Regression model can be used by accessing the LinearRegression()
function. The linear regression model assumes that the dependent variable (y) is a
linear combination of the parameters (Xi).
Your first name
Allow me to illustrate how linear regression works. Imagine that you were tasked
Your email address
to fit a red line so it resembles the trend of the data while minimizing the distance
between each point as shown below:
By eye-balling it should look something like this:
Let’s import the sklearn boston house-price dataset and so we can predict the
median house value (MEDV) by the house’s age (AGE) and the number of rooms
(RM).
Have in mind that this is known as a multiple linear regression as we are using two
features.
Python
1 from sklearn import linear_model, datasets

2 from sklearn.model_selection import train_test_split
3 import matplotlib.pyplot as plt
4 %matplotlib inline
6
7 # Load the Boston dataset
8 boston = datasets.load_boston()
9 df = pd.DataFrame(boston.data, columns=boston.feature_names)
10 # Add the target variable (label)
11 df['MEDV'] = boston.target
12 df.head()
Now we will set our features (X) and the label (y). Notice how we use the numpy
np.c_ function that concatenates the data for us.
Python
1 # Set the features and label
email
2
X = pd.DataFrame(np.c_[df['LSTAT'], df['RM']], columns = ['LSTAT','RM'])
3 y = df['MEDV']
Your first name

Now we will split the data into training and test sets which we learned earlier how
to do:
Your email address
Python
1 # Set the features and label

2 X = pd.DataFrame(np.c_[df['AGE'], df['RM']], columns = ['AGE','RM'])
3 y = df['MEDV']
4
5 # Split the data
7 random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Let’s plot each of our features and see how they look. Try to imagine where the
regression line would go.
Python
1 # Plot the features

2 plt.scatter(X['RM'], y)
Python
1 plt.scatter(X['AGE'], y)
You can already see that the data is a bit messy. The RM feature appears more
linear and is prone to higher correlation with the label while the age feature shows
the opposite. We also have outliers.
For this article, we won’t bother to clean up the data as we’re not interested to
create a perfect model.
The next thing that we want to do is to fit our model and evaluate some of its core
metrics:
Python
1 regressor = linear_model.LinearRegression()
Your
2 first name
model = regressor.fit(X_train, y_train)
3
4 print('Coefficient of determination:', model.score(X, y))
Your email address
5 print('Intercept:', model.intercept_)
6 print('slope:', model.coef_)
1 Coefficient of determination: 0.529269171356878

2 Intercept: -28.203538066489102
3 slope: [-0.06640957 8.7957305 ]
The coefficient of determination (R2) tells how much of the variance, in our case
the variance of the median house income, our model explains. As we see it explains
53% of the variance which is okay.
For the brevity of the article, we won’t go into math now but feel free to look up the
in-depth explanation behind the formula. And you don’t need to know it in order
to use the regression, not saying that you shouldn’t.
The .intercept_ shows the bias b0, while the .coef_ is an array that contains our
b1 and b2. In our case, the intercept is –28.20 and it represents the value of the
predicted response when X1 = X2 = 0.
When we look at the slope, we can see that the increase in X1 (AGE) by 1 lowers the
median house price by 0.06 while the increase in X2 (RM) results in the rise of the
dependent variable by 8.79.
Let’s see how good your regression line predictions were:
Python
1 # Age regression line
email
2
plt.plot(X['AGE'], y, 'o')
3 model.coef_[0], model.intercept_ = np.polyfit(X['AGE'], y, 1)
4 plt.plot(X['AGE'], model.coef_[0]*X['AGE']+model.intercept_, color='red')
Your first name
Python
1Your# email
Room address
number regression line
2 plt.plot(X['RM'], y, 'o')
3 model.coef_[0], model.intercept_ = np.polyfit(X['RM'], y, 1)
4 plt.plot(X['RM'], model.coef_[0]*X['RM']+model.intercept_, color='red')
Now, let us predict some data and use a sklearn metric that will tell us how the
model is performing:
Python
1 y_test_predict = regressor.predict(X_test)
2 print('predicted response:', y_test_predict, sep='\n')
Python
1 from sklearn.metrics import mean_squared_error

2
3 rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
4 print(rmse)
1 6.315423538049165
Root Mean Square Error (RMSE) is the standard deviation of

the residuals (prediction errors). Residuals are a measure of how far from the
regression line data points are. It tells us how concentrated the data is around the
regression line.
In our case, the RMSE is high for our liking. I’ll task you to try out other features
(LSTAT and RM) and lower the RMSE. What happens when you use those two or
more? Which features make the most sense to use?
Feel free to play around and check the Full code section to see some guidelines.
Other Sklearn
email course. regression
1 short email a day for 10 days. models
There are various regression models that may be more useful and fit the data
better than the simple linear regression, and those are the Lasso, Elastic-Net,
Your first name
Ridge, Polynomial, and Bayesian regression.
For more information about them go here.
Your email address
Sklearn Classification
Classification problem in ML involves teaching a machine how to group data
together to match the specified criteria. The most popular models in Sklearn come
from the tree() class.
Every day you perform classification. For example, when you go to a grocery store
you can easily group different foods by their food group (fruit, meat, grain, etc.).
When it comes to more complex decisions in the fields of medicine, trading, and
politics, we’d like some good ML algorithms to aid our decision-making process.
Sklearn Decision Tree Classifier

In Sklearn, the Decision Tree classifier can be accessed by using the
DecisionTreeClassifier() function which is a part of the tree() class.
The main goal of a Decision Tree algorithm is to predict the value of the target
variable (label) by learning simple decision rules deduced from the data features.
For example, look at my simple decision tree below:
Here are some main characteristics of a Decision Tree Classifier:
It is made out of Nodes and Branches

Branches connect Nodes
The top Node is called the Root Node (“Go outside”)
Node from which new nodes arise is called a Parent Node (i.e. “Is it raining?”
Node)
A node without a Child Node is called a Leaf Node (i.e. “Classic programmer”
Node)
Your first name

The good thing about a Decision Tree Classifier is that it is easy to visualize and
interpret. It also
Your email requires little to no data preparation. The bad thing about it is that
address
minor changes in the data can change it considerably.
For a more in-depth understanding of its pros and cons go here.
Now, let’s create a decision tree on the popular iris dataset. The dataset is made
out of 3 plant species and we’ll want our tree to aid us in deciding to what
specimen our plant belongs to according to its petal/sepal width and length.
Python
1 from sklearn import tree

2 from sklearn.tree import DecisionTreeClassifier
3 from sklearn.datasets import load_iris
4 import graphviz
5
6 # Obtain the data and fit the model
7 X, y = load_iris(return_X_y=True)
8 dtc = DecisionTreeClassifier()
9 dtc = dtc.fit(X, y)
10
11 # Graph the Tree
12 iris = load_iris()
13 dot_data = tree.export_graphviz(dtc, out_file=None,
14 feature_names=iris.feature_names,
15 class_names=iris.target_names,
16 filled=True, rounded=True,
17 special_characters=True)
18 graph = graphviz.Source(dot_data)
19 graph
Take note that “Gini” measures impurity. A node is “pure” when it has 0 Gini
which happens when all training instances it applies to belong to the same class.
Have in mind that all algorithms have their hyperparameters which can be tuned
toemail
result in a1better
course. model.
short email For
a day forexample
10 days. you can set the Decision Tree to only go to
a certain depth, to have a certain allowed number of leaves and etc.
To see
Yourwhat are the standard hyperparameter that your untouched Decision Tree
first name
Classifier has and what each of them does please visit the scikit-learn
documentation.
Your email address
Other Sklearn classification models

Depending on the problem and your data, you might want to try out other
classification algorithms that Sklearn has to offer. For example, SVC, Random
Forest, AdaBoost, GaussianNB, or KNeighbors Classifier.
If you want to see how they compare to each other go here.
Sklearn Clustering – Create groups of similar

data
Clustering is an unsupervised machine learning problem where the algorithm
needs to find relevant patterns on unlabeled data. In Sklearn these methods can be
accessed via the sklearn.cluster module.
Below you can see an example of the clustering method:
Sklearn DBSCAN
In Sklearn, the DBSCAN clustering model can be utilized by using the the DBSCAN()
cluster which is a part of the cluster() class.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As

the model isn’t deterministic (i.e. clusters must be convex), it is mostly used when
the clusters can be in any shape or size.
The DBSCAN algorithm finds clusters by looking for areas with high density that
are separated
email course. 1 by areas
short emailofa low density.
day for 10 days.The algorithm has two main parameters
being min_samples and eps .
High min_samples
Your first name and low eps indicate a higher density needed in order to create
a cluster. The min_samples parameter controls how sensitive the algorithm is
towards noiseaddress
Your email (higher values mean that it is less sensitive).
On the other hand, the eps parameter controls the local neighborhood of the
Getcluster,
points. If it is too high all data will be in one big 10-dayifFree
it isAlgo Trading
too low eachCourse
data
point will be its own cluster.
Enough theorizing, let’s jump to the coding part! We will generate some data and
fit the DBSCAN clustering algorithm on it. We will also play a bit with its
parameters.
Let’s import the libraries we need, create the data, scale it and fit the model:
Python
1 from sklearn.datasets import make_circles

2 from sklearn.cluster import DBSCAN
3 from sklearn import metrics
4 from sklearn.preprocessing import StandardScaler
5 import numpy as np
7 %matplotlib inline
8
9 # Make the data and scale it
10 X, y = make_circles(n_samples=800, factor=0.3, noise=0.1, random_state=42)
11 X = StandardScaler().fit_transform(X)
12
13 # Fit the algorithm
14 y_predicted = DBSCAN(eps=0.35, min_samples=10).fit_predict(X)
Now, let’s see how our model performed:
Python
1 # Visualize the data
email
2
plt.scatter(X[:,0], X[:,1], c=y_predicted)
Your first name
Here we can easily spot two clusters, they even resemble an eye (I’m tempted to
Your the
change email address
colors to make it look like the eye of Sauron). All models have their
performance metrics and let’s check out the main ones.
Get 10-day Free Algo Trading Python

Course
1 # Evaluation Metrics
2 print('Number of clusters:
3 {}'.format(len(set(y_predicted[np.where(y_predicted != -1)]))))
4 print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_predicted)))
print('Completeness: {}'.format(metrics.completeness_score(y,
y_predicted)))
1 Number of clusters: 2
2 Homogeneity: 1.0000000000000007
3 Completeness: 0.9691231370370732
What would happen if we changed the eps value to 0.4?
Python
1 y_predicted = DBSCAN(eps=0.4, min_samples=10).fit_predict(X)

2 plt.scatter(X[:,0], X[:,1], c='orangered')
3 plt.title('I see you!')
Couldn’t
resist
For a more hands-on experience in solving problems with clustering, check out
our article on finding trading pairs for the pairs trading strategy with machine
learning.
Other Sklearn
Join over 10,000 clustering
future traders models
and get our 10-day "Know-What-To-Google" Algo Trading
Depending on the clustering problem, you might want to use other clustering
algorithms and the most popular ones are K-Means, Hierarchical, Affinity
Propagation, and Gaussian mixtures clustering.
Your first name
If you want
Your toaddress
email learn the in-depth theory behind clustering and get introduced to
various models and the math behind them, go here.
Sklearn Dimensionality Reduction – Reducing

random variables
Dimensionality reduction is a method where we want to shrink the size of data
while preserving the most important information in it. In Sklearn these methods
can be accessed from the decomposition() class.
As humans, we usually think in 4 dimensions (if you count time as one) up to a

maximum of 6-7 if you are a quantum physicist. Data can easily go beyond that
and we need to reduce it to lower dimensions so it can be observed.
Sklearn PCA
PCA (Principal Component Analysis) is a linear technique for dimensionality
reduction. It basically does linear mapping of the data to a lower dimension while
maximizing the preserved variance of data.
PCA can be used for an easier visualization of data and as a preprocessing step to
speed up the performance of other machine learning algorithms. Let’s go back to
our iris dataset and make a 2d visualization from its 4d structure.
Firstly, we will load the required libraries, obtain the dataset, scale the data and
check how many dimensions we have:
Python
1 from sklearn.datasets import load_iris
email
2 course. 1 short email a day for 10import
from sklearn.decomposition days. PCA
3 from sklearn.preprocessing import StandardScaler
6Your%matplotlib
first name inline
7
8 # Load the data and scale it
Your
9 X,email
y = address
load_iris(return_X_y=True)
10 X = StandardScaler().fit_transform(X)
11 print(f'The number of dimensions in X is {X.shape[1]}')
1 The number of dimensions in X is 4
Now we will set our PCA and fit it to the data:
Python
1 # Load PCA and specify the number of dimensions aka components

2 pca = PCA(n_components=2)
3 pc = pca.fit_transform(X)
4 print(f'The number of reduced dimensions is {pc.shape[1]}')
1 The number of reduced dimensions is 2
Let’s store the data into a pandas data frame and recode the numerical target
features to categorical:
Python
1 # Put the data into a pandas data frame

2 df = pd.DataFrame(data = pc, columns = ['pc_1', 'pc_2'])
3 df['target'] = y
4 df.head()
Python
1 # Recode the numerical data to categorical
email
2 course. 1 short email a day for 10 days.
def recoding(data):
3 if data == 0:
4 return 'iris-setosa'
5 elif data == 1:
6Your first name
return 'iris-versicolor'
7 else:
8 return 'iris-virginica'
Your
9 email address
10 df['target'] = df['target'].apply(recoding)
11 df.head()
And now for the finale with plot the data:
Python
1 # Plot the data

2 fig = plt.figure(figsize = (12,10))
3 ax = fig.add_subplot(1,1,1)
4 ax.set_xlabel('Principal Component 1', fontsize = 17)
5 ax.set_ylabel('Principal Component 2', fontsize = 17)
6 ax.set_title('2 component PCA', fontsize = 20)
7 targets = ['iris-setosa', 'iris-versicolor', 'iris-virginica']
8 colors = ['r', 'g', 'b']
9 for target, color in zip(targets,colors):
10 indicesToKeep = df['target'] == target
11 ax.scatter(df.loc[indicesToKeep, 'pc_1'],
12 df.loc[indicesToKeep, 'pc_2'],
13 c = color,
14 s = 50)
15 ax.legend(targets)
16 ax.grid()
As you can see, we basically compressed the 4d data into a 2d observable one. In
this case, we can say that the algorithm discovered the petals and sepals because
we had the width and length of both.
Other Sklearn
Join over 10,000 Dimensionality
future traders Reduction models
and get our 10-day "Know-What-To-Google" Algo Trading
There are other Dimensionality Reduction models in Sklearn that you would prefer
more for certain problems and those are the ICA, IPCA, NMF, LDA, Factor Analysis,
and more.
Your first name
For a more in-depth look go here.
Your email address

What are the 3 Common Machine Learning
Analysis/Testing Mistakes?
When you run your analysis, there are 3 common mistakes to take note:
Overfitting
Look-ahead Bias
P-hacking
Do check out this lecture PDF to learn more: 3 Big Mistakes of Backtesting – 1)
Overfitting 2) Look-Ahead Bias 3) P-Hacking
Full Code
GitHub Link
Get our 10-day "Know-What-To-
Google" Algo Trading email course.

Your first name

1 short email a day for 10 days.
Your email address

Over 10,000 future traders have taken this email course.
Getgirlfriend
"I have not read it, but I'm sure it is great." - My 10-day Free Algo Trading Course
First Name
Email address
Give me the 10-day crash course!
Unsubscribe at any time.
Igor Radovanovic
Programming

Sklearn - An Introduction Guide To Machine Learning - AlgoTrading101 Blog

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sklearn - An Introduction Guide To Machine Learning - AlgoTrading101 Blog

Uploaded by

Copyright:

Available Formats

11/12/23, 10:42 PM Sklearn - An Introduction Guide to Machine Learning - AlgoTrading101 Blog

Sklearn – An Introduction Guide to Machine

11669 total views

Last Updated on April 3, 2023

Your first name

What is Sklearn used for?

Each category will be explained in a beginner-friendly and illustrative way

Your first name

1 $ pip install -U scikit-learn

Sklearn developers strongly advise using a virtual environment (venv) or a conda

How to pick the best Sklearn model?

How much data do you have?

Your first name

Are you trying to predict: which cat will push Get

How do your models perform when compared against each other?

Sklearn preprocessing – Prepare the data for

Sklearn feature encoding

This can be done by using the scikit-learn OrdinalEncoder () function as follows:

1 pip install scikit-learn

Sklearn data scaling

In scikit-learn we use the StandardScaler() function to standardize the data. Let us

1 print(f'The scaled mean is: {X_scaled.mean(axis=0)}\nThe scaled variance

Normalization is a scaling technique in which values are shifted and rescaled so

1 norm = preprocessing.Normalizer() Get 10-day Free Algo Trading Course

Sklearn missing values

Your first name

The strategy hyperparameter can be changed to median, most_frequent, and

1 from sklearn.impute import MissingIndicator

Get 10-day Free Algo Trading Course

Get 10-day Free Algo Trading Course

1 from sklearn.datasets import make_classification

For more information about scikit-learn preprocessing functions go here.

Sklearn Regression – Predict the future

In regression tasks, we want to predict the outcome y given X. For example,

1 from sklearn import linear_model, datasets

Your first name

1 # Set the features and label

1 # Plot the features

Get 10-day Free Algo Trading Course

1 Coefficient of determination: 0.529269171356878

Let’s see how good your regression line predictions were:

Your first name

1 from sklearn.metrics import mean_squared_error

Root Mean Square Error (RMSE) is the standard deviation of

Sklearn Decision Tree Classifier

Here are some main characteristics of a Decision Tree Classifier:

It is made out of Nodes and Branches

Your first name

1 from sklearn import tree

Other Sklearn classification models

If you want to see how they compare to each other go here.

Sklearn Clustering – Create groups of similar

Below you can see an example of the clustering method:

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As

1 from sklearn.datasets import make_circles

Now, let’s see how our model performed:

Your first name

Get 10-day Free Algo Trading Python

What would happen if we changed the eps value to 0.4?

1 y_predicted = DBSCAN(eps=0.4, min_samples=10).fit_predict(X)

Sklearn Dimensionality Reduction – Reducing

As humans, we usually think in 4 dimensions (if you count time as one) up to a