0% found this document useful (0 votes)

54 views9 pages

Group Assignment Submission 03

Project report for Machine Learning in Finance

Uploaded by

Francis Mtambo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views9 pages

Group Assignment Submission 03

Project report for Machine Learning in Finance

Uploaded by

Francis Mtambo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Group work Project – Submission 03

Master of Science in Financial Engineering

Course: Machine Learning in Finance
Date: 15.09.2020

Gustavo Edward de
Mûelenaere Corrêa

Nigel Moyo
The aim of the third submission is to implement an algorithm trading strategy that uses machine
learning to predict the next period asset trend, generating trading signals based on results and data
from previous submissions.

1. Brief Introduction to Machine Learning in Finance

Machine leaning is a buzz term that has attracted attention in both the industry and the academic
world. This is a subset of data science that offers algorithms that have an ability to learn and add
value using experience without human intervention. It is an application of artificial intelligence
that focuses on developing smart algorithms that perform well based on large volumes of data. The
financial industry is a sector that generates and deals with big data on a daily basis hence making
it a suitable playground for machine learning. Due to these developments, the world has witnessed
quite innovative financial products from fintech and financial services companies. The solutions
have resulted in improved operations and optimized financial portfolios. Machine learning has
been applied to algorithmic trading, using mathematical algorithms to make better decisions when
trading in financial markets, portfolio management optimization and fraud detection and
prevention.

There are two main approaches to Machine Learning: Supervised and Unsupervised Learning.

Supervised Learning

Supervised learning is the concept of algorithms learning from experience and examples. In this
concept the algorithm is fed two types of datasets, namely a training set and a test set. The
algorithm then learns from labelled examples in the training set, and then implements the
developments learnt from the training set onto the test set as accurately as possible. The training
consists of n ordered pairs (x1, y1), (x2, y2)…(xn, yn) where xi is some measurement of a data point
and yi is the label. The main objective is to make accurate estimations of the labels for the test set
by drawing inferences from the training set. The algorithm is given a sequence of desired outputs
and the objective is to learn from them and produce the accurate output given new inputs
(forecasting).
Supervised machine learning is usually grouped into regression and classification models.

• Classification: A classification model deals with output variables that belong to a category,
such as “green” or “blue” or “fat” and “slim”.

• Regression: A regression model deals with output variable that take a real value, such as
“dollars” or “height”.

There are various types of supervised learning, such as support vector machines (SVM), k-nearest
neighbor algorithms (KNN) and logistic regression for classification problems, linear regression
for regression problems, decision trees and neural networks.

• K Nearest Neighbors (KNN)

This algorithm is used to solve both classification and regression problems. This algorithm
assumes that similar objects exist in close proximity. The main objective of the KNN algorithm is
to use a database in which data points are grouped into several classes to predict the classification
of a new sample point. In our study we start with a set of data with which each data point is
allocated to a known class then we seek to predict the classification of a new data point based on
the known classifications of the training dataset. We consider each of the characteristics in the
training data set as a different dimension in some space. We first specify a positive integer k and
select k entries that are closest to the new sample; secondly, we find the most common
classifications and lastly, classify the new data set.

• Support Vector Machines (SVM)

Support vector machines are supervised learning algorithms that analyzes data used for
classification and regression analysis. The algorithm constructs the hyperplane since a high
dimensional space and this hyperplane provides the basis for accurate classification.

• Decision Trees
Decision trees are the foundation of classification machine learning. It is a map of possible
outcomes of a series of related choices. It provides the ability to weigh options against one another
based on costs, probabilities and benefits. Decision trees are classified into two types namely the
categorical variable decision trees and the continuous variable decision trees.

• Logistic Regression
This is a statistical method for predicting binary classes and it is a special type of linear regression
where the target variable is dichotomous in nature. It uses the log of odds as the dependent variable
to predict the occurrences of a binary event making use of a logit function. The dependent variable
follows a Bernoulli distribution and estimation is done through maximum likelihood.
Unsupervised Learning

Unsupervised learning is a machine learning concept where the algorithm builds a representation
that can be used for decision making, based on the input provided, without use of feedback or
rewards from the environment. The algorithm finds patterns and structures in the unstructured
noisy input data. This algorithm allows for modelling of probability densities over inputs. Common
unsupervised learning algorithms are cluster analysis, which is used for exploratory data analysis
to find hidden patterns, and principal component analysis (PCA) which is used for dimensional
reduction.

• Cluster Analysis
Cluster analysis is an algorithm that is used to classify input data into groups or clusters. There is
no prior information about the clusters hence the algorithm identifies the patterns within the given
data and group in accordance to how closely related the objects are. This algorithm involves
formulating a problem, selecting a clustering rule and the number of clusters. The measure of
distance should be carefully selected and, in most cases, the Euclidean distance is used as the
measure of distance. The distance in cluster analysis indicates how separated the clusters are from
each other.

• Principal Component Analysis (PCA)

This is a technique that is used for the reduction and classification of input data. It reduces the
dimensions of a data set by finding a new set of variables that is smaller than the original set of
variables but retains most of the sample’s variation present in the sample. The objective is to
represent the data using as few variables, to a high degree of accuracy.

2. Modeling

We implemented an algorithm trading strategy focused on classification. Starting with the standard
mean-reversion strategy based on Bollinger Bands, we implemented a supervised learning model
that would be trained based on two labels that would describe the market direction (up or down)
the day after the asset price hit either the upper or lower boundaries.
Below is a zoomed view of the asset price chart, displaying instances where the asset priced
crossed the BB boundaries:

The descriptors chosen (independent variables) were the following:

• Normalized upper and lower Bollinger Bands (based on the 90-day SMA)
• Normalized 30-day SAM of the asset price
• Normalized (Close-Open)/Close and (High-Low)/Close asset price data
• Normalized asset Volume data
• Normalized VIX 90 and 30-day SMAs
• Normalized Nasdaq 90 and 30-day SMAs

We decided on using Vix to capture some of the market volatility, and the Nasdaq index to gain
exposure to the tech sector, given that the assets in question are from that sector (Intel, AMD and
nVidia). All data was sourced from Yahoo Finance.

The target vector was built from looking forward one day and determining the asset price daily
direction (Close – Open), labeling 1 when the price closed higher, or -1 when the price closed
lower.

Using data from January 2006 to August 2020, we built a set containing all the days when the price
hit either BB boundary, together with the market direction at the end of the following day.
3. Classification

We then split that dataset using the Python Sklearn train_test_split library with the stratify option
enabled, which would guarantee a balanced label distribution between the training and test sets.

With the training and test sets ready, we deployed four classification models (K-Nearest Neighbors
(KNN), Support Vector Machine, Decision Tree and Logistic Regression), evaluating their
performance based on accuracy data, confusion matrices, and cross-validation scores (ROC-AUC).

Below are the various performance indices for each classification model:

KNN SVM DecisionT LogisticR

Train data accuracy: 0.65 0.52 1 0.55
Test data accuracy: 0.49 0.43 0.45 0.49
Mean ROC/AUC: 0.556 0.536 0.543 0.529

And the confusion matrices for each model are shown below:

KNN SVM Decision Tree Logistic Regression

26 49 56 19 30 45 10 65
33 54 74 13 44 43 18 69

From the results above we can conclude that no classification model was really successful in
capturing any meaningful behavior that would explain/predict with a high accuracy level the asset
price direction the day after it would hit any of the Bollinger Band boundaries.

For the actual algorithm implementation, we decided to use the KNN model, if anything because
of its marginally higher accuracy and ROC/AUC values. Below is the KNN ROC curve, which
clearly show a lack of predictability of the current model.

4. Algorithm implementation/backtesting

We implemented the strategy described above in the Quantopian platform, which gave us access
to the necessary asset price data as well as the portfolio analysis tool PyFolio. We checked every
day 30 minutes after market open if the price did hit/overshoot either boundary. If that was the
case we would build and fit the KNN model based on similar previous daily events and predict the
market direction towards the end of the day, opening a position accordingly. Any open position
would be closed 30 minutes before market close.

After running a backtest using a rolling 3-month BB spread of 2 x Standard Deviation from January
2011 to September 2020, we obtained the following performance metrics for such strategy:

As expected, the strategy did not capture any meaningful alpha, although it did perform much
better compared to the conventional mean reversion strategy of opening a trade based on which
BB boundary was crossed. The backtest results of this conventional strategy for the same period
is shown below:
A more detailed cumulative returns plot is displayed below:

Here are other return metrics: annual returns, their distribution and a visual reference of the
monthly returns’ performance.

The evolution of the strategy’s Sharpe ratio is displayed in the chart below:

Despite not showing a stellar performance, the fact that the ML implementation did improve the
strategy’s performance when compared to the conventional one gives us motivation to investigate
further if we can increase its predictability factor by adding more predictors and possibly
implementing dimensionality reduction to the resulting set.

5. References

1. Love, B.C., 2002. Comparing supervised and unsupervised category learning. Psychonomic bulletin
& review, 9(4), pp.829-835
2. Dixon, M.F., Halperin, I. and Bilokon, P., 2020. Machine Learning in Finance. Springer Verlag Berlin
Heidelberg.
3. Kaufman, L. and Rousseeuw, P.J., 2009. Finding groups in data: an introduction to cluster analysis
(Vol. 344). John Wiley & Sons.

Trinomial Model in Financial Markets
No ratings yet
Trinomial Model in Financial Markets
9 pages
Group Assignment Submission 03
No ratings yet
Group Assignment Submission 03
9 pages
MFE/3F Exam Sample Questions
No ratings yet
MFE/3F Exam Sample Questions
154 pages
Solution Exercises List 2 Brownian Motion and Stochastic Calculus
No ratings yet
Solution Exercises List 2 Brownian Motion and Stochastic Calculus
9 pages
Solution Exercises List 2 Brownian Motion and Stochastic Calculus
No ratings yet
Solution Exercises List 2 Brownian Motion and Stochastic Calculus
9 pages
Stochastic Analysis Exercises Guide
No ratings yet
Stochastic Analysis Exercises Guide
16 pages
Advanced Brownian Motion Tasks
No ratings yet
Advanced Brownian Motion Tasks
7 pages
Trinomial Model in Financial Markets
No ratings yet
Trinomial Model in Financial Markets
9 pages
Stochastic Analysis Exercises Guide
No ratings yet
Stochastic Analysis Exercises Guide
16 pages
Accounting and Management Exam - Mancosa
100% (1)
Accounting and Management Exam - Mancosa
11 pages
Accounting and Financial Management Module Guide
100% (1)
Accounting and Financial Management Module Guide
274 pages
Seed Co Business Plan
100% (1)
Seed Co Business Plan
3 pages
Financial Economics Study Material
No ratings yet
Financial Economics Study Material
13 pages
ARIMA Modelling Notes PDF
No ratings yet
ARIMA Modelling Notes PDF
18 pages
A Survey On The Application of Data Science and Analytics in The Field of Organized Sports
No ratings yet
A Survey On The Application of Data Science and Analytics in The Field of Organized Sports
7 pages
A Review On Role of Artificial Intelligence in Food Processing and Manufacturing Industry
No ratings yet
A Review On Role of Artificial Intelligence in Food Processing and Manufacturing Industry
5 pages
Decision Tree Algorithm in Healthcare AI
No ratings yet
Decision Tree Algorithm in Healthcare AI
10 pages
Col774 A5
No ratings yet
Col774 A5
6 pages
DMW Ebook TechKnowledge
No ratings yet
DMW Ebook TechKnowledge
216 pages
Lung Cancer Detection and Nodule Type Classification Using Image Processing and Machine Learning
No ratings yet
Lung Cancer Detection and Nodule Type Classification Using Image Processing and Machine Learning
6 pages
AI Classification Homework Solutions
No ratings yet
AI Classification Homework Solutions
31 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Module 3 - AI
No ratings yet
Module 3 - AI
7 pages
RapidMiner Setup & Data Handling Guide
No ratings yet
RapidMiner Setup & Data Handling Guide
38 pages
Decision Tree Algorithm Overview
No ratings yet
Decision Tree Algorithm Overview
12 pages
Bayesian CART Model Search
No ratings yet
Bayesian CART Model Search
17 pages
Heart Disease Prediction with Random Forest
No ratings yet
Heart Disease Prediction with Random Forest
16 pages
Ensemble Learning Techniques in ML
No ratings yet
Ensemble Learning Techniques in ML
99 pages
B2B Sales Success Prediction with ML
No ratings yet
B2B Sales Success Prediction with ML
5 pages
Decision Tree Basics in Machine Learning
No ratings yet
Decision Tree Basics in Machine Learning
69 pages
F# for Machine Learning Essentials
No ratings yet
F# for Machine Learning Essentials
29 pages
Multinomial Problem Statement
No ratings yet
Multinomial Problem Statement
28 pages
Data Mining: Transforming Data to Knowledge
No ratings yet
Data Mining: Transforming Data to Knowledge
722 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Machine Learning - Applications, Process and Techniques
100% (1)
Machine Learning - Applications, Process and Techniques
241 pages
ECE Classification Concepts
No ratings yet
ECE Classification Concepts
69 pages
Performance Analysis of Decision Tree Classifiers
100% (1)
Performance Analysis of Decision Tree Classifiers
9 pages
Random Forest Model Assumptions
No ratings yet
Random Forest Model Assumptions
33 pages
CART v6.0
No ratings yet
CART v6.0
434 pages
IEEE 2021 DQL Extracting Decision Tree
No ratings yet
IEEE 2021 DQL Extracting Decision Tree
7 pages
2.0 Decision Tree
No ratings yet
2.0 Decision Tree
50 pages
Personal Loan Campaign Final
No ratings yet
Personal Loan Campaign Final
12 pages
Crime Prediction in Indore Report
No ratings yet
Crime Prediction in Indore Report
41 pages
ML Notes Concise With Mu Modules
No ratings yet
ML Notes Concise With Mu Modules
23 pages

Group Assignment Submission 03

Uploaded by

Group Assignment Submission 03

Uploaded by

Group work Project – Submission 03

Master of Science in Financial Engineering

1. Brief Introduction to Machine Learning in Finance

• K Nearest Neighbors (KNN)

• Support Vector Machines (SVM)

• Principal Component Analysis (PCA)

The descriptors chosen (independent variables) were the following:

KNN SVM DecisionT LogisticR

KNN SVM Decision Tree Logistic Regression

You might also like