Group Assignment Submission 03

Group work Project – Submission 03
Master of Science in Financial Engineering

Course: Machine Learning in Finance
Date: 15.09.2020
Gustavo Edward de
Mûelenaere Corrêa
Nigel Moyo
The aim of the third submission is to implement an algorithm trading strategy that uses machine
learning to predict the next period asset trend, generating trading signals based on results and data
from previous submissions.
1. Brief Introduction to Machine Learning in Finance
Machine leaning is a buzz term that has attracted attention in both the industry and the academic
world. This is a subset of data science that offers algorithms that have an ability to learn and add
value using experience without human intervention. It is an application of artificial intelligence
that focuses on developing smart algorithms that perform well based on large volumes of data. The
financial industry is a sector that generates and deals with big data on a daily basis hence making
it a suitable playground for machine learning. Due to these developments, the world has witnessed
quite innovative financial products from fintech and financial services companies. The solutions
have resulted in improved operations and optimized financial portfolios. Machine learning has
been applied to algorithmic trading, using mathematical algorithms to make better decisions when
trading in financial markets, portfolio management optimization and fraud detection and
prevention.
There are two main approaches to Machine Learning: Supervised and Unsupervised Learning.
Supervised Learning
Supervised learning is the concept of algorithms learning from experience and examples. In this
concept the algorithm is fed two types of datasets, namely a training set and a test set. The
algorithm then learns from labelled examples in the training set, and then implements the
developments learnt from the training set onto the test set as accurately as possible. The training
consists of n ordered pairs (x1, y1), (x2, y2)…(xn, yn) where xi is some measurement of a data point
and yi is the label. The main objective is to make accurate estimations of the labels for the test set
by drawing inferences from the training set. The algorithm is given a sequence of desired outputs
and the objective is to learn from them and produce the accurate output given new inputs
(forecasting).
Supervised machine learning is usually grouped into regression and classification models.
• Classification: A classification model deals with output variables that belong to a category,
such as “green” or “blue” or “fat” and “slim”.
• Regression: A regression model deals with output variable that take a real value, such as
“dollars” or “height”.
There are various types of supervised learning, such as support vector machines (SVM), k-nearest
neighbor algorithms (KNN) and logistic regression for classification problems, linear regression
for regression problems, decision trees and neural networks.
• K Nearest Neighbors (KNN)

This algorithm is used to solve both classification and regression problems. This algorithm
assumes that similar objects exist in close proximity. The main objective of the KNN algorithm is
to use a database in which data points are grouped into several classes to predict the classification
of a new sample point. In our study we start with a set of data with which each data point is
allocated to a known class then we seek to predict the classification of a new data point based on
the known classifications of the training dataset. We consider each of the characteristics in the
training data set as a different dimension in some space. We first specify a positive integer k and
select k entries that are closest to the new sample; secondly, we find the most common
classifications and lastly, classify the new data set.
• Support Vector Machines (SVM)

Support vector machines are supervised learning algorithms that analyzes data used for
classification and regression analysis. The algorithm constructs the hyperplane since a high
dimensional space and this hyperplane provides the basis for accurate classification.
• Decision Trees
Decision trees are the foundation of classification machine learning. It is a map of possible
outcomes of a series of related choices. It provides the ability to weigh options against one another
based on costs, probabilities and benefits. Decision trees are classified into two types namely the
categorical variable decision trees and the continuous variable decision trees.
• Logistic Regression
This is a statistical method for predicting binary classes and it is a special type of linear regression
where the target variable is dichotomous in nature. It uses the log of odds as the dependent variable
to predict the occurrences of a binary event making use of a logit function. The dependent variable
follows a Bernoulli distribution and estimation is done through maximum likelihood.
Unsupervised Learning
Unsupervised learning is a machine learning concept where the algorithm builds a representation
that can be used for decision making, based on the input provided, without use of feedback or
rewards from the environment. The algorithm finds patterns and structures in the unstructured
noisy input data. This algorithm allows for modelling of probability densities over inputs. Common
unsupervised learning algorithms are cluster analysis, which is used for exploratory data analysis
to find hidden patterns, and principal component analysis (PCA) which is used for dimensional
reduction.
• Cluster Analysis
Cluster analysis is an algorithm that is used to classify input data into groups or clusters. There is
no prior information about the clusters hence the algorithm identifies the patterns within the given
data and group in accordance to how closely related the objects are. This algorithm involves
formulating a problem, selecting a clustering rule and the number of clusters. The measure of
distance should be carefully selected and, in most cases, the Euclidean distance is used as the
measure of distance. The distance in cluster analysis indicates how separated the clusters are from
each other.
• Principal Component Analysis (PCA)

This is a technique that is used for the reduction and classification of input data. It reduces the
dimensions of a data set by finding a new set of variables that is smaller than the original set of
variables but retains most of the sample’s variation present in the sample. The objective is to
represent the data using as few variables, to a high degree of accuracy.
2. Modeling
We implemented an algorithm trading strategy focused on classification. Starting with the standard
mean-reversion strategy based on Bollinger Bands, we implemented a supervised learning model
that would be trained based on two labels that would describe the market direction (up or down)
the day after the asset price hit either the upper or lower boundaries.
Below is a zoomed view of the asset price chart, displaying instances where the asset priced
crossed the BB boundaries:
The descriptors chosen (independent variables) were the following:
• Normalized upper and lower Bollinger Bands (based on the 90-day SMA)
• Normalized 30-day SAM of the asset price
• Normalized (Close-Open)/Close and (High-Low)/Close asset price data
• Normalized asset Volume data
• Normalized VIX 90 and 30-day SMAs
• Normalized Nasdaq 90 and 30-day SMAs
We decided on using Vix to capture some of the market volatility, and the Nasdaq index to gain
exposure to the tech sector, given that the assets in question are from that sector (Intel, AMD and
nVidia). All data was sourced from Yahoo Finance.
The target vector was built from looking forward one day and determining the asset price daily
direction (Close – Open), labeling 1 when the price closed higher, or -1 when the price closed
lower.
Using data from January 2006 to August 2020, we built a set containing all the days when the price
hit either BB boundary, together with the market direction at the end of the following day.
3. Classification
We then split that dataset using the Python Sklearn train_test_split library with the stratify option
enabled, which would guarantee a balanced label distribution between the training and test sets.
With the training and test sets ready, we deployed four classification models (K-Nearest Neighbors
(KNN), Support Vector Machine, Decision Tree and Logistic Regression), evaluating their
performance based on accuracy data, confusion matrices, and cross-validation scores (ROC-AUC).
Below are the various performance indices for each classification model:
KNN SVM DecisionT LogisticR

Train data accuracy: 0.65 0.52 1 0.55
Test data accuracy: 0.49 0.43 0.45 0.49
Mean ROC/AUC: 0.556 0.536 0.543 0.529
And the confusion matrices for each model are shown below:
KNN SVM Decision Tree Logistic Regression

26 49 56 19 30 45 10 65
33 54 74 13 44 43 18 69
From the results above we can conclude that no classification model was really successful in
capturing any meaningful behavior that would explain/predict with a high accuracy level the asset
price direction the day after it would hit any of the Bollinger Band boundaries.
For the actual algorithm implementation, we decided to use the KNN model, if anything because
of its marginally higher accuracy and ROC/AUC values. Below is the KNN ROC curve, which
clearly show a lack of predictability of the current model.
4. Algorithm implementation/backtesting
We implemented the strategy described above in the Quantopian platform, which gave us access
to the necessary asset price data as well as the portfolio analysis tool PyFolio. We checked every
day 30 minutes after market open if the price did hit/overshoot either boundary. If that was the
case we would build and fit the KNN model based on similar previous daily events and predict the
market direction towards the end of the day, opening a position accordingly. Any open position
would be closed 30 minutes before market close.
After running a backtest using a rolling 3-month BB spread of 2 x Standard Deviation from January
2011 to September 2020, we obtained the following performance metrics for such strategy:
As expected, the strategy did not capture any meaningful alpha, although it did perform much
better compared to the conventional mean reversion strategy of opening a trade based on which
BB boundary was crossed. The backtest results of this conventional strategy for the same period
is shown below:
A more detailed cumulative returns plot is displayed below:
Here are other return metrics: annual returns, their distribution and a visual reference of the
monthly returns’ performance.
The evolution of the strategy’s Sharpe ratio is displayed in the chart below:
Despite not showing a stellar performance, the fact that the ML implementation did improve the
strategy’s performance when compared to the conventional one gives us motivation to investigate
further if we can increase its predictability factor by adding more predictors and possibly
implementing dimensionality reduction to the resulting set.
5. References
1. Love, B.C., 2002. Comparing supervised and unsupervised category learning. Psychonomic bulletin
& review, 9(4), pp.829-835
2. Dixon, M.F., Halperin, I. and Bilokon, P., 2020. Machine Learning in Finance. Springer Verlag Berlin
Heidelberg.
3. Kaufman, L. and Rousseeuw, P.J., 2009. Finding groups in data: an introduction to cluster analysis
(Vol. 344). John Wiley & Sons.

Group Assignment Submission 03

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group Assignment Submission 03

Uploaded by

Copyright:

Available Formats

Group work Project – Submission 03

Master of Science in Financial Engineering

1. Brief Introduction to Machine Learning in Finance

• K Nearest Neighbors (KNN)

• Support Vector Machines (SVM)

• Principal Component Analysis (PCA)

The descriptors chosen (independent variables) were the following:

KNN SVM DecisionT LogisticR

KNN SVM Decision Tree Logistic Regression

You might also like