You are on page 1of 41

Stock prediction with machine learning

A Degree Thesis Submitted to the Faculty of the Escola


Tècnica d’Enginyeria de Telecomunicació de Barcelona
Universitat Politècnica de Catalunya
by

Vı́ctor Rubio Jornet


Abstract

During the last few months, there has been increased attention in the stock market due to the
Covid pandemic. The new-found leisure time has driven many people to buy and sell stocks with-
out any knowledge on the matter at hand. The number of affiliations on investing or trading apps
has increased drastically since the last year. It is natural to think that the field of predicting the
stock market has increased accordingly. However, only two main approaches have been made. One
focusing on day trading and using technical analysis of the markets to predict the immediate value,
and the other focusing on the stocks as long-time investments and using fundamental analysis to
predict the future value of the stock in the long run.

Following the outbreak of the Coronavirus, we have seen an increasing gap between the economy
and stock market that comes from the instability of the current times. This may worsen the predic-
tions created by the fundamental analysis until normality is achieved because the macroeconomic
and microeconomic factors that usually are key in the long-term predictions are not affecting the
stock market in the same way. In contrast, the technical analysis can predict short-term stock
values although it is difficult to stretch the length of time which this analysis can predict. ¿How
long into the future can technical analysis predict? ¿Will it be an accurate prediction? ¿Which
algorithm will help us obtain the best prediction with technical analysis? These were the main
questions that were asked at the beginning of the project.

The usual prediction time with technical analysis can go from hours to a month or two. How-
ever, the goal of this project is to compare different algorithms to obtain a predictor that is able to
know whether a stock will go up or down in value in 3 to 5 months using technical analysis.

The project started by doing an introduction to Deep Learning and Machine Learning. Af-
terwards, the process of obtaining an adequate amount of data to create a proper dataset began.
With enough data and the defined variables, the dataset was used to experiment with different
algorithms and different configurations to obtain the many predictors. Once the predictors were
designed a comparison was made among their results, and more data was added to the dataset
to try to improve the scores. Then, the second round of prediction started and the comparison
of the scores among the different algorithms was made again to obtain the results. After adding
more stock values to the dataset, a mistake was found on some of the rows. The coma value was
misplaced because of the difference in format in the API the data was being obtained. A third and
final round of prediction was done with the problem solved. Among the five algorithms that were
tested, the Random Forest one offered the best results, with an accuracy of 71% for the last dataset.

1
Resum

Durant els últims mesos hi ha hagut un increment de l’interès cap al mercat de valors a causa
de la pandèmia de la Covid. El nou temps lliure trobat ha portat a molta gent a comprar i vendre
accions sense tenir el coneixement suficient. El nombre d’usuaris a les aplicacions d’inversió o de
negociació borsària ha augmentat dràsticament des de l’any passat. És normal pensar que el camp
de la predicció del mercat de valors hagi augmentat de la mateixa manera. Tot i això, només s’han
fet dos tipus d’enfocaments. El primer es centra en la negociació borsària a diari, fent servir l’anàlisi
tècnica del mercat per predir el valor immediat de les accions, i l’altre centrant-se en les accions
com a inversió a llarg termini i fent servir l’anàlisi fonamental per predir el valor a futur de les
accions a la llarga.

Després del brot del Coronavirus s’ha vist que la diferència entre el mercat de valors i l’economia
ha anat augmentant, i prové de la inestabilitat dels temps actuals. Aquest fet pot empitjorar les
prediccions fetes amb l’anàlisi fonamental fins que s’aconsegueixi la normalitat degut a que els fac-
tors macroeconòmic i microeconòmic que normalment són claus en les prediccions a llarg termini
no afecten el mercat de valors de la mateixa manera. En canvi l’anàlisi tècnica pot predir els val-
ors de les accions a curt termini, tot i que és complicat allargar la quantitat de temps en la qual
podem predir. Fins a quin punt en el futur podem predir amb l’anàlisi tècnica? Serà una predicció
acurada? Quin algorisme ens ajudarà a obtenir la millor predicció amb anàlisi tècnica? Aquestes
han estat les qüestions que es van formular a l’inici del projecte.

El temps normal de predicció amb anàlisi tècnica pot anar d’hores a un mes o dos. La finali-
tat d’aquest projecte és comparar diferents algorismes per obtenir un predictor que permet saber si
el valor d’una acció pujarà o baixarà en un rang de temps d’uns 3 a 5 mesos fent servir anàlisi tècnica.

El projecte va començar amb una introducció a l’aprenentatge automàtic i l’aprenentatge pro-


fund. Posteriorment es va començar a obtenir suficients dades com per crear un conjunt de dades
funcional. Amb les dades suficients i les variables definides es va començar a usar el conjunt de
dades per experimentar amb diferents algorismes i configuracions per obtenir predictors. Un cop
dissenyats els predictors, es va fer una comparació entre ells, i es va afegir més informació al conjunt
de dades per intentar millorar les avaluacions. Llavors va començar la segona ronda de prediccions
i es va tornar a fer una comparació dels resultats per obtenir el valor del millor predictor. Després
d’afegir més valors d’accions al conjunt de dades, es va trobar un error en algunes files en el que
la coma dels decimals estava moguda degut a la diferència de formats a la API de on s’obtenen les
dades. Posteriorment es va procedir a fer una tercera i última ronda de prediccions amb el problema
solucionat. D’entre els cinc algorismes provats, el de Random Forest va oferir els millors resultats,
amb un percentatge d’encerts del 71% per l’últim conjunt de dades.

2
Resumen

Durante los últimos meses ha habido un incremento del interés hacia el mercado de valores
debido a la pandemia del Covid. El tiempo libre encontrado ha llevado a mucha gente a comprar
y vender acciones sin tener el conocimiento suficiente del tema. El número de usuarios de la apli-
caciones de inversiones o negociación bursátil ha aumentado drásticamente desde el año pasado.
Es normal pensar que el campo de la predicción del mercado de valores haya crecido acordemente.
Pese a esto, únicamente se han hecho dos tipos de enfoque. El primero se centra en la negociación
bursátil diaria, usando el análisis técnico del mercado para predecir el valor inmediato de las ac-
ciones, el otro se centra en las acciones como inversión a largo plazo, usando el análisis fundamental
para predecir los valores de las acciones a futuro.

Después del brote de Coronavirus se ha visto una brecha entre el mercado de valores i la
economı́a, que ha ido aumentando y proviene de la inestabilidad del momento que vivimos ac-
tualmente. Este hecho puede empeorar las predicciones realizadas mediante el análisis fundamental
hasta que volvamos a una normalidad, ya que los factores macroeconómico y microeconómico que
normalmente son clave en las predicciones a largo plazo no afectan al mercado de valores de la
misma manera. En contraste el análisis técnico puede predecir el valor de las acciones a corto
plazo, pese a que alargar la cantidad de tiempo que se puede predecir es complejo. ¿Hasta qué
punto en el futuro podemos predecir con el análisis técnico? ¿Será una predicción acertada? ¿Qué
algoritmo nos permitirá obtener la mejor predicción con análisis técnico? Estas han sido las pre-
guntas formuladas al inicio del proyecto.

El tiempo normal de predicción con análisis técnico puede variar desde horas a un mes o dos.
La finalidad del proyecto es comparar diferentes algoritmos para obtener un predictor que permita
saber si el valor de una acción subirá o bajará en un rango de tiempo de unos 3 a 5 meses usando
análisis técnico.

El proyecto empezó con una introducción al aprendizaje automático y aprendizaje profundo.


Posteriormente se obtuvieron suficientes datos para generar un conjunto de datos funcional. Con
los datos y las variables definidas se usó el conjunto de datos para experimentar con diferentes
algoritmos y configuraciones para obtener predictores. Una vez finalizados los predictores se realizó
la comparación y se decidió añadir más datos al conjunto de datos para intentar mejorar las evalua-
ciones. Ası́ empezó la segunda ronda de predicciones y se volvieron a comparar los resultados para
obtener el valor del mejor predictor. Se volvió a aumentar el conjunto de datos y se encontraron
errores en algunas filas, en las que la coma de los decimales estaba movida debido a la diferencia
de formato entre la API de dónde se obtienen los datos. Al solucionar el problema se procedió a
realizar una tercera y última evaluación de los diferentes algoritmos. Entre los cinco algoritmos
usados en el proyecto, el de Random forest ofreció los mejores resultados, con un porcentaje de
aciertos del 71% para el último conjunto de datos.

3
Acknowledgements

I would like to thank first my thesis supervisor at Ernst & Young, Ana Jimenez Castellanos,
who has guided me and has given me the recommendations to learn during the whole project while
giving me the space to grow as an Engineer and allowed me to work in this incredible project that
has engaged me to improve my skills in Machine Learning applied to economics.

I would also like to thank Prof. Climent Nadeu from the department of Communications and
Signal Theory at UPC, Barcelona. He gave me key indications to obtain knowledge prior to the
beginning of this project that has been helpful during the whole execution of the thesis.

I can’t forget my friends and college classmates, who have taught me many life lessons and have
made me who I am today. In particular I would like to thank Luis Ramón Rodrı́guez Javier, who
has helped me motivate during the project and has taught me the power of perseverance.

Finally, I have to express my eternal gratitude to my parents for teaching me so many valuable
lessons, for listening to the progress of this project even when they didn’t understand a word I said,
and for supporting me in every step I take.

4
Contents
Abstract 1

Resum 2

Resumen 3

Acknowledgements 4

List of figures 7

List of tables 8

1 Introduction 9

2 Objectives 10
2.1 Achieve a 3 to 5 months prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Figure out the best algorithm to make the prediction . . . . . . . . . . . . . . . . . . 10

3 State of the art of the technology used or applied in this thesis 11


3.1 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Exponential Moving Average (EMA) . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Moving Average Convergence Divergence (MACD) . . . . . . . . . . . . . . . 13
3.2 Classification algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Multi Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Shuffle Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Methodology 21
4.1 Machine Learning and Finance Research . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Creation of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Testing the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Testing with Cross Validation (CV) . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.2 Testing with Cross Validation and Shuffle Split . . . . . . . . . . . . . . . . . 24
4.3.3 Testing with Principal Component Analysis . . . . . . . . . . . . . . . . . . . 24

5 Results 24
5.1 First dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Third dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5
6 Economic and Environmental Impact 31
6.1 Economical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Environmental Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Conclusions and next steps 32


7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

References 34

Annex 1: Jupyter templates 35

Annex 2: Gantt Diagram 39

6
List of Figures
1 Candle graph of the Apple stock (AAPL) with some technical indicators. Image
from: Plus 500 trading platform[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12
3 Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12
4 Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line)
and the buying point (white dotted line) using the MACD indicator. Image from:
Plus 500 trading platform[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Perceptron schema. Image from: DeepAI webpage[5] . . . . . . . . . . . . . . . . . . 16
6 K-NN example with k=3 and k=7. Image from: Data Camp webpage[4] . . . . . . . 17
7 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel
using the Iris dataset. Image from: Scikit-learn webpage [14] . . . . . . . . . . . . . 19
8 Cross Validation example with k = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
9 Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15] . . . . 20
10 Architecture implemented in the project . . . . . . . . . . . . . . . . . . . . . . . . . 22
11 Dataset Creation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
12 SVM’s time comparison between models and datasets . . . . . . . . . . . . . . . . . 28
13 Random Forest’s time comparison between models and datasets . . . . . . . . . . . . 28
14 Random Forest’s time comparison between models and datasets . . . . . . . . . . . . 29
15 One configuration of each algorithm with the lowest time and best accuracy. The
MLP and KNN values are difficult to see in the graph as they ar too similar to the
ones obtained with Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
16 Accuracy results for the KNN algorithm when modifying the number of neighbors . 30
17 Accuracy results for the Random Forest algorithm when modifying the number of
trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
18 Cost calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
19 Gantt diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7
List of Tables
1 Number of rows and stocks for each dataset . . . . . . . . . . . . . . . . . . . . . . . 23
2 Results for the first dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Results for the second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Results for the third dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Best results for each algorithm and the iteration of the dataset . . . . . . . . . . . . 29
6 MLP with CV results when increasing the number of layers . . . . . . . . . . . . . . 31

8
1 Introduction
In this fast-paced world we live in, it is impossible to be an expert in everything; that is why we
use technology to help us with many day to day activities. Furthermore, we want technology to
improve our quality of life in many ways. Technology has given regular people access to the stock
market, bringing new opportunities to many. However, increasing people’s quality of life without
the proper knowledge is seemingly impossible. Artificial Intelligence has entered the game to level
the field. Through the use of technology and AI in the stock market, everyone will be able to buy
and sell stocks with a certain probability of success.
Two main approaches have been made in the field of stock prediction. The first approach is
mostly focused on stocks as long term investments through fundamental analysis1 . With this kind
of study, the stock value will ideally trend to the predicted intrinsic value. This approach does
not work appropriately on short or mid-time predictions because it does not indicate a stock’s
movement. On the other hand, an increasing trend in the trading community has been predicting
using technical analysis2 . Focusing our attention on the stock market’s direction may help us
predict in more volatile scenarios, although the window of time we will be able to predict will be
shorter.
This project will be using technical analysis and has two main goals. The first one is to find out
the better algorithm to predict whether a stock will go up or down in value. The second one will
be to try to stretch the prediction window to a range of 3 to 5 months.
The main milestones of this project are going to be obtaining the data, storing and processing
it to create a dataset, use different algorithms to predict whether a stock will go up or down in
value, and finally check the results to see which algorithm better predicted the trends of the many
stocks.
This thesis is structured by stating the context of the project, the objectives, and methodol-
ogy. Finally, it presents the results, the environmental and economic impact, and the project’s
conclusions.
During the thesis’s execution, some deviation appeared mainly because of the difference in the
formating of the data from the API where it was collected and Excel. A significant restraint found
during the first part of the project, data collection, was the limitation of the number of queries in
the API per day and second.
All of the process followed during the thesis is well represented in the Gantt diagram in Next
Steps, in the figure 19.
1 Fundamental analysis is a method of assessing the intrinsic value of a security by analyzing various macroeconomic

and microeconomic factors. The ultimate goal of fundamental analysis is to quantify the intrinsic value of a security.[8]
2 Technical analysis is a method used to predict the probable future price movement of a security – such as a stock

or currency pair – based on market data. [22]

9
Figure 1: Candle graph of the Apple stock (AAPL) with some technical indicators. Image from:
Plus 500 trading platform[11]

2 Objectives
In this undergraduate thesis, there are two main objectives. One of them is to create a predictor
that is able to know whether a stock will go up or down in value in a medium time range (3 to
5 months) using technical analysis. The second one is to figure out which algorithm will help us
better predict the stock’s movement.

2.1 Achieve a 3 to 5 months prediction


As we stated earlier, technical analysis is mainly used in trading to predict a stock’s movement in a
short time range, from hours to a month or two. This makes technical analysis an excellent method
to buy and sell repeatedly during a short period. To do so, two things may be required:

ˆ The use of a bot.


ˆ Constant monitoring of the stock market.

However, not many people have the knowledge or the time to do neither of those things.
This project aims to stretch this constraint to a more extended time period, reducing the number
of movements3 the investor will have to do and removing the necessity of the constant monitoring
of the stock market.

2.2 Figure out the best algorithm to make the prediction


Many implemented algorithms can be used to make the prediction for this thesis; hence choosing
one, using intuition alone, would be a mistake. Each algorithm has its strengths and weaknesses
and is in the scope of this project to find the most suitable one to make this prediction.
To complete the first objective, we have to think of the prediction as a classification problem4
because we will have to decide whether the stock value will go up (1) or down (0). There are many
classification techniques, but we will be focusing on the following:
3 By a movement we refer to the action of buying or selling a stock
4 Classification
is the process of predicting a label from given data points

10
ˆ Logistic Regression

ˆ Multi Layer Perceptron (MLP)

ˆ K-Nearest Neighbors (KNN)

ˆ Random Forest

ˆ Support Vector Machine (SVM)

To figure out which of the previous algorithms are best suited, we will be comparing their
accuracy, the deviation of the accuracy, and the time that it takes to complete the training with
each algorithm.

3 State of the art of the technology used or applied in this


thesis
In this section, there will be an in-depth description of the indicators used to build the prediction
model as well as the different algorithms used during the whole process.

3.1 Indicators
The indicators that were calculated and added to the dataset were the following:

ˆ Exponential Moving Average (EMA)

ˆ Moving Average Convergence Divergence (MACD)

3.1.1 Exponential Moving Average (EMA)

The exponential moving average tracks the stock value and is a type of weighted moving average
that gives more importance to the most recent data. The algorithm used to calculate the different
EMA in the project is the following:
1 def c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( self , list ) :
2 weights = np . exp ( np . linspace ( -1. , 0. , self . mean_values ) )
3 weights /= weights . sum ()
4 ema = np . convolve ( list , weights ) [: len ( list ) ]
5 ema [: self . mean_values ] = ema [ self . mean_values ]

Where self.mean_values and list are the amount of days that we are using in the moving
average and the array that we are calculating the EMA from.
In the dataset three EMAs were used with mean values of 6, 70 and 200. This method is
explained in the book Ganar en la bolsa es posible by Josef Ajram[1]. The idea behind this is to
have a clear trigger to buy or sell.

11
ˆ If the stock value is above the 6 days EMA, and the 6 days EMA is above the 70 days EMA
and this one is above the 200 says EMA, then the market has a higher probability to keep
going upwards.

Figure 2: Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]

ˆ If the stock value is below the 6 days EMA, and the 6 days EMA is below the 70 days EMA,
and this one is below the 200 days EMA, then we are in a bear market situation and it might
be a good time to sell, but never to buy.

Figure 3: Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]

The 200 days EMA represents the mean if we look back for a year (long time), the 70 days EMA
represents the mean if we look back a medium amount of time, and the 6 days EMA represents the
mean if we look back a short amount of time. With these three indicators, we can see the different

12
trends in long, medium, and short-time. Furthermore, with the logic explained previously, we can
use these indicators as a trigger to know whether we have to buy, sell, or maintain the position.

3.1.2 Moving Average Convergence Divergence (MACD)

The MACD indicator is composed by three functions:

ˆ MACD
ˆ Signal
ˆ Histogram

The MACD function is the difference between a 26 period EMA of the closing values and a
12 period EMA of the closing values. These two indicators are called slow EMA and fast EMA
respectively.

M ACD = 12P eriodEM A − 26P eriodEM A

The Signal function is a nine-day EMA of the MACD, and it is used as a trigger signal. When
the Signal function crosses the MACD in an upwards direction, it indicates a change in the bear
market trend. It is a selling call. When the Signal function crosses the MACD in a downwards
direction, it suggests a shift to a bullish market’s tendency. It is a buying call.
To help to visualize previously mentioned outcomes, the histogram function is usually repre-
sented. It is the difference between the MACD and Signal functions. If the histogram is zero, it
indicates a change in the market’s trend. We have to check the value of the histogram before it
reaches 0. If the histogram value is positive before reaching zero, it indicates a change to a bear
market trend. If the Histogram value before reaching zero is negative, it shows a bullish market.

Figure 4: Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line) and
the buying point (white dotted line) using the MACD indicator. Image from: Plus 500 trading
platform[11]

13
To calculate the previous functions the following algorithm was used:
1 def calcula te_macd ( self , data , name ) :
2 info = []
3 for d in data :
4 info . append ( d [ name ])
5 slow_m_pred = me an_pred ictor (26)
6 fast_m_pred = me an_pred ictor (12)
7 nine_m_pred = me an_pred ictor (9)
8 slow = slow_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( info )
9 fast = fast_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( info )
10
11 macd = fast - slow
12 signal = nine_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( macd )
13 hist = macd - signal
14 time = []
15 for da in data :
16 time . append ( da [ " datetime " ])
17 return time , macd , hist , signal

The data parameter is an array with candle like objects5 . The name variable represents the
value we want to calculate the MACD with (open, close, high, low). In this project the MACD
function was calculated with the closing values of each stock.

3.2 Classification algorithms


As mentioned in the objectives section 2.2 Figure out the best algorithm to make the pre-
diction the following algorithms have been used during the thesis:

ˆ Logistic Regression
ˆ Multi Layer Perceptron (MLP)
ˆ K-Nearest Neighbors (KNN)
ˆ Random Forest
ˆ Support Vector Machine (SVM)

3.2.1 Logistic Regression

When we think of a classification between two outcomes the first algorithm that comes to our mind
is Binary Logistic Regression. Naturally this is the first algorithm tried in this project because it
can be interpreted as knowing the probability of a stock going up.
Generally, to train a Binary Logistic Regression predictor we will have to follow the next
steps[10]:

1. Create a weight matrix (W ) and multiply it by the input variables (X), X being a matrix
with m rows and n features:
5A candle like object is a dictionary with the open, high, low, close and volume values

14
a = w0 + w1 ∗ x1 + w2 ∗ x2 + ... + wn ∗ xn

2. Use the sigmoid6 function to do the transformation from all real numbers to a space of 0 to
1:

ypred = 1/(1 + e−a )

3. Calculate the cost function for that iteration:


Pm
costw = (−1/m) i=1 yi log (ypredi ) + (1 − yi ) log (1 − ypredi )

4. Obtain the gradient of the cost function:


Pn
dwj = i=1 (ypred − y)xij

5. Update the weight matrix W :

wi = wj − (α ∗ dwj )

Where α is the learning rate.

All this process in implemented in the module from scikit-learn


sklearn.linear_model.LogisticRegression[17] and the following parameters:

ˆ Solver: SAGA

ˆ Penalty: L2

3.2.2 Multi Layer Perceptron (MLP)

To explain the process that follows a multi layer perceptron first we need to understand what a
single perceptron does. The perceptron will do a weighted sum of all of the imputs given to create
an output. At the end of the perceptron an activation function is added.
6 Sigmoid function: y = 1/(1 + e−x )

15
Figure 5: Perceptron schema. Image from: DeepAI webpage[5]

A MLP is different perceptrons combined together to create a fully-connected neural network.


At the last layer a step function is added to create a binary classifier. There are different activation
functions, the most common are:

ˆ Linear

ˆ Sigmoid

ˆ Tanh

ˆ ReLU

ˆ Leaky ReLU

ˆ Softmax

The MLP has been implemented using the module from scikit-learn
sklearn.neural_network.MLPClassifier[18] and the following parameters:

ˆ Hidden Layers: (7,6, 5, 4, 3, 2, 1)

ˆ Activation function: ReLu

ˆ Learning Rate: invscaling

16
3.2.3 K-Nearest Neighbors (KNN)

The K-NN algorithm is a type of classification algorithm that uses the distance between the data
point we want to predict and the data points we already have to predict the category of the new
point.
The prediction can be done with euclidean distance or any other type that we decide. Another
important variable we want to choose is the number of data points that we will use to predict the
class of the new data entry. The k cannot be too high in case we have a small amount of training
data points, but it cannot be too low because while it may have a lower bias, it may introduce a
higher variance.

Figure 6: K-NN example with k=3 and k=7. Image from: Data Camp webpage[4]

In this project the K-NN has been implemented with the scikit-learn module:
sklearn.neighbors.KNeighborsClassifier [16] and the following parameters:

ˆ Number of neighbors:200

ˆ Weights: distance

ˆ Number of jobs: 6

ˆ Leaf size: 30

17
3.2.4 Random Forest

The Radom Forest algorithm consists of a large ensemble of decision trees. The algorithm works
under the basis that a large number of relatively uncorrelated models (trees) operating as a com-
mittee will outperform any of the individual constituent models. Thus, the low correlation between
the decision trees is crutial.
A decision tree will be able to categorize a data entry between a set of given classes, in our case,
between 1 and 0. An ensemble of decision trees will obtain the result of each decision tree and
determine the category comparing the amount of predicted 1 with the amount of predicted 0.
To ensure the low correlation between models in a random forest the algorithm uses two methods:

ˆ Bagging: Each tree will use the same amount of data from the dataset (N), although it will
be randomly sampled from the dataset with replacement.
ˆ Feaute randomness: Each tree in a random forest will have a random subset of features that
will be selected amongst the original ones.

In this project the Random Forest algorithm has been implemented using the scikit-learn module
sklearn.ensemble.RandomForestClassifier[19] and the following parameters:

ˆ Number of estimators: 300

ˆ Criterion: entropy

3.2.5 Support Vector Machine (SVM)

The algorithm of Support Vector Machine tries to find a hyperplane in a N-dimensional space
that fits the data points in different categories. In this project , the hyperplane will divide the
N-dimensional space into the two classes, 0 and 1. Depending on the problem that we face a Linear
SVM, a RBF SVM (gaussian), or a Polynomial SVM may be used. All the previous algorithms
differ on the kernel they use:

18
Figure 7: 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel using
the Iris dataset. Image from: Scikit-learn webpage [14]

The Kernel we use in the algorithm will determine the shape of the function that classifies the
data points.
In this project the Support Vector Machine algorithm has been implemented using the scikit-
learn module
sklearn.svm.SVC[20], the rbf kernel (Gaussian) and the default parameters.

3.3 Cross Validation


The Cross Validation is a technique used to obtain a more accurate evaluation of the predictor that
is being used. When we use Cross Validation, the dataset is divided in different subsets and, for
each iteration, one subset is used as a validation set, and the remaining are used as a training. This
way we can use Cross Validation to try to find the best parameters to train our model. Once the
model is completed the accuracy is saved. At the end of all the iterations we will have K7 accuracy
scores.
In this project we will be using cross validation with the complete dataset, because once all
the models are completed we will use the different scores to obtain the mean and variance of the
model’s accuracy. This decision has been made because the amount of data on the first and second
dataset was thought to be too low to divide in three subsets (testing, validation and training).
7 The number of splits in the Cross Validation

19
Figure 8: Cross Validation example with k = 4

3.4 Shuffle Split


Shuffle Split is a random permutation cross-validator. In this case, instead of splitting the dataset
in k sets, all different from the others, it will be split in k randomly selected sets. Using Shuffle
Split that two sets are identical may be a possibility although it is highly improbable.

Figure 9: Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15]

3.5 Principal Component Analysis (PCA)


Principal Component Analysis is used to reduce the dimensionality of large data sets, by projecting
a large set of variables into one that still contains most of the information in the large set and has
a lower dimentional space. It is important to use PCA with some sort of scaling, because it is quite
sensitive to the variances of the variables. If there are large differences between the ranges of the
inital values, those variables with larger ranges will dominate over those with small ranges.

20
4 Methodology
The process followed in this project has been: researching the important topics (Machine Learning
algorithms and finance), implementing the datat retrieval process to create the dataset, and writing
the code to test each algorithm. For the last two parts of the project there have been three trials
to test the algortihms with different datasets. Finally, al results were stored in an Excel Spredsheet
to facilitate the comparison amongst all algorithms.

4.1 Machine Learning and Finance Research


To learn about Machine Learnign and classifiaction algorithms the course Machine Learning A-Z—:
Hands-On Python & R In Data Science[7], that focuses on how to use the different algorithms, and
is mainly focused on a more practical use of Machine Learning. Another course used to research
and obtain knowledge has been the Deep Learning Specialization Course[10], it focuses mainly in
a theoretical approach on Neural Networks. The master’s thesis Predicting Stock Prices Using
Technical Analysis and Machine Learning[9] has an introduction talking about using the crossing
of Moving Averages as a signal to know whether to buy or sell a stock that inspired me to research
about this topic. After finding this thesis the book Ganar en la bolsa es posible[1] was consulted
to obtain more information about signs that tell a buy or sell call with Moving Averages.

4.2 Creation of the dataset


To begin the process of retrieving data, different APIs were explored. Among them were the Yahoo
Finance API[23] and the Alpha Vantage API[2], the latter being the chosen one to use during the
project. The choice was based on the structure of the data obtained from each API and the queries
needed to receive said information.
The next important decision was to choose between two options. Creating the dataset by
obtaining the data directly from the API or storing the information first in a database to process
it later and build the dataset. The decision was guided by an important constraint; the Alpha
Vantage API has a limitation on the number of queries that can be done in a second and in a
day. Hence, a database was needed to store as much data as possible and then creating the dataset
directly from the database. MongoDB was used in AWS8 to keep the information since the lack of
a fixed structure allows more freedom in creating the objects of the database.
After all the previous choices were made a python module[13] was created to upload and down-
load the information from the database using pymongo[12]. A script was written to collect all of
the daily data from several stocks randomly selected and upload it to the database. At each of the
three trials, more stocks were added to the list until the last execution of the code. In the end, the
stocks were selected from a file containing all NASDAQ9 stocks until the database ran out of space.
This whole process is better described in the figure 10, where we can see all of the network
architecture involved in the process.
8 Amazon Web Service
9 The list of the stocks was obtained using FTP to ftp.nasdaqtrader.com

21
Figure 10: Architecture implemented in the project

Another script was written to download all the information from the database for each stock’s
value, from the IPO10 to the present date. The stock value was compared to the values from 3 to 5
months later. If the maximum value from the future was bigger than the current value, the row was
cataloged as a buying opportunity (1), and if the minimum value from the future was smaller than
the current value, the row was cataloged as a selling opportunity (0). Finally, one more module
was programmed to calculate the different indicators used in the predictions and were added to
each row. All of the rows were added to a .txt, .xlsx and .csv file, to be used as a dataset.
For all the versions of the dataset, it was composed11 of the following columns:

ˆ Stock name

ˆ Initial Date

ˆ Initial Value

ˆ 6 day EMA

ˆ 70 day EMA

ˆ 200 day EMA

ˆ Histogram function

ˆ MACD function
10 Initial Public Offering
11 Columns with the final value, EMAs and MACD were added to the dataset in case they were needed in some
stages of the project, although they were never used

22
ˆ Signal funtion
ˆ Result

Three versions of the dataset were created, each of them with increasing number of rows. Thanks
to the automatization of the data collection the diversity amongst the data was increased for each
version:

Version of the dataset Number of rows Number of different stocks added


1 92120 170
2 101509 190
3 720668 1867

Table 1: Number of rows and stocks for each dataset

All the code used to collect the data and create the dataset is in a Github repository that can
be acces using the link in the reference section [3]
As we see the number of different stocks in the last dataset increases drastically, thus rising the
diversity.
The schema of the creation of the dataset as described in this section is as follows:

Figure 11: Dataset Creation Process

23
4.3 Testing the algorithms
Once the first version of the dataset was obtained the creation of the algorithms began. Through
the use of Jupyter Notebook three different templates were created. Each template would have
different parameters and would be used with every algorithm. The process of calculating the scores
for each algorithm was done three times, one for each dataset.

4.3.1 Testing with Cross Validation (CV)

The template used in the testing with cross validation, as shown in Annex 1(7.2) , enables us to
observe the different results with Standard Scaling for each algorithm using Cross Validation. It is
important to note that a timer is set at the begining and ending of the cross validation to know
the time that it takes to run the whole algorithm. For all algorithms the dataset will be split in 10
subsets to do the cross validation.

4.3.2 Testing with Cross Validation and Shuffle Split

In this template the Shuffle Split function is added before the Cross Validation to generate a random
split in the dataset, in this case the dataset will not be evenly separated. The same timer is set
before doing the Cross Validation to ensure we know how much it takes to obtain the results. The
Shuffle Split will be done in groups of 10, with a test size of 0.2. We can see the code used in jupyter
notebook in Annex 1(7.2) .

4.3.3 Testing with Principal Component Analysis

The last template uses Principal Component Analysis combined with Standard Scaling before using
Cross Validation in combination with Shuffle Split to obtain the accuracy scores. A timer was added
to calculate the amount of time that we need in order to obtain said scores. We can see the template
in the Annex 1(7.2) .

5 Results
The goal of this section is to show the results for each iteration done with the different datasets,as
well as representing the result in a ”user friendly” manner.
It is important to define two main terms that will be repeated during the whole section. Mean
accuracy is the mean of all the accuracy obtained from the Cross Validation for each algorithm. The
accuracy is obtained by finding the porportion of correctly predicted cases and the total amount of
cases. The Deviation of the accuracy is the standard deviation of the accuracy obtained doing the
Cross Validation.

24
5.1 First dataset
The first dataset was composed of 92120 rows and included 170 well-known different stocks.

Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 03 65% 0%
with CV
Logistic Regression 0 : 00 : 03 65% 0%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 05 52% 2%
with CV Shuffle-
Split & PCA
Random forest with 0 : 01 : 39 66% 1%
CV
Random forest with 0 : 09 : 18 68% 1%
CV & ShuffleSplit
Random forest with 0 : 12 : 50 61% 0%
CV ShuffleSplit &
PCA
MLP with CV 0 : 00 : 34 65% 0%
MLP with CV & 0 : 00 : 35 65% 0%
ShuffleSplit
MLP with CV 0 : 01 : 04 64% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 00 : 26 65% 0%
K-NN with CV & 0 : 00 : 23 65% 0%
ShuffleSplit
K-NN with CV 0 : 00 : 22 65% 0%
ShuffleSplit & PCA
Gaussian SVM 1 : 59 : 43 65% 0%
with CV
Gaussian SVM 3 : 55 : 05 65% 0%
with CV & Shuffle-
Split
Gaussian SVM 0 : 40 : 06 65% 0%
with CV Shuffle-
Split & PCA

Table 2: Results for the first dataset

25
5.2 Second dataset
The second dataset was composed of 101509 rows and included 190 well-known different stocks.

Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 05 64% 1%
with CV
Logistic Regression 0 : 00 : 04 64% 1%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 07 53% 1%
with CV Shuffle-
Split & PCA
Random forest with 0 : 06 : 07 66% 1%
CV
Random forest with 0 : 10 : 09 68% 0%
CV & ShuffleSplit
Random forest with 0 : 14 : 01 61% 1%
CV ShuffleSplit &
PCA
MLP with CV 0 : 01 : 41 64% 0%
MLP with CV & 0 : 01 : 20 64% 0%
ShuffleSplit
MLP with CV 0 : 01 : 19 65% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 00 : 16 64% 0%
K-NN with CV & 0 : 00 : 25 64% 1%
ShuffleSplit
K-NN with CV 0 : 00 : 29 64% 0%
ShuffleSplit & PCA
Gaussian SVM 2 : 20 : 49 65% 0%
with CV
Gaussian SVM 4 : 10 : 38 65% 0%
with CV & Shuffle-
Split
Gaussian SVM 0 : 48 : 56 64% 1%
with CV Shuffle-
Split & PCA

Table 3: Results for the second dataset

26
5.3 Third dataset
The third dataset was composed of 720668 rows and included 1867 different stocks. For the results
of the SVM algorithm the data could not be obtained due to the large amount of samples and the
processing power needed

Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 57 63% 0%
with CV
Logistic Regression 0 : 00 : 50 63% 0%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 38 50% 0%
with CV Shuffle-
Split & PCA
Random forest with 1 : 11 : 40 70% 2%
CV
Random forest with 2 : 52 : 58 71% 0%
CV & ShuffleSplit
Random forest with 1 : 49 : 26 66% 0%
CV ShuffleSplit &
PCA
MLP with CV 0 : 03 : 30 63% 0%
MLP with CV & 0 : 09 : 13 64% 3%
ShuffleSplit
MLP with CV 0 : 08 : 17 63% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 05 : 55 63% 0%
K-NN with CV & 0 : 05 : 16 64% 1%
ShuffleSplit
K-NN with CV 0 : 07 : 31 63% 0%
ShuffleSplit & PCA
Gaussian SVM N/A N/A N/A
with CV
Gaussian SVM N/A N/A N/A
with CV & Shuffle-
Split
Gaussian SVM N/A N/A N/A
with CV Shuffle-
Split & PCA

Table 4: Results for the third dataset

27
5.4 Analysis of the results
As we can see in the tables above; for most cases, the usage of Principal Component analysis has
diminished the training time, specially when used before the Support Vector Machine algorithm,
as seen in figure 12.

Figure 12: SVM’s time comparison between models and datasets

PCA works best when used in a dataset with a high amount of features and samples; when we
increase the number of samples, the time spent doing Cross Validtion will decrease, as seen in figure
13

Figure 13: Random Forest’s time comparison between models and datasets

When we take into account the accuracy in these graphs we can see that, in the case of the
Random Forest algorithm, when we used Principal Component Analysis the slope is higher than
only using Shuffle & Split. This means that the more we increase our dataset the more we will need
PCA to avoid higher training times. Although, for the amount of data that is in the dataset in the

28
first three versions, the algorithm without using PCA will obtain a much more accurate prediction.

Figure 14: Random Forest’s time comparison between models and datasets

We can see in table 5 the best configurations for each algorithm. As we can see, most of the
results come from from the first version of the dataset.

Algorithm name Version of the Time spent training Mean accuracy Deviation of the ac-
dataset curacy
Random forest with 3 2 : 52 : 58 71% 0%
CV & ShuffleSplit
MLP with CV 1 0 : 00 : 34 65% 0%
KNN with CV 1 0 : 00 : 22 65% 0%
ShuffleSplit & PCA
SVM (Gaussian) 1 0 : 40 : 06 65% 0%
with CV Shuffle-
Split & PCA
Logistic regression 1 0 : 00 : 03 65% 0%
with CV

Table 5: Best results for each algorithm and the iteration of the dataset

Observing the figure 15, we can see the evolution of the previous best configurations throughout
the different datasets. For all algorithms except for Random Forest the mean accuracy decreases
when the amount of data is increased. This may be due to overfitting in the first use of the dataset.
The low amount of data and the low diversification of stocks may have caused overfitting in the
models, thus resulting in a better accuracy (65%) with the first dataset. As the diversification on
the dataset (amount of different stocks) grew the mean accuracy decreased.

29
Figure 15: One configuration of each algorithm with the lowest time and best accuracy. The MLP
and KNN values are difficult to see in the graph as they ar too similar to the ones obtained with
Linear Regression

Another possible explanation to the evolution in the figure 15 is that the drop on the accuracy
may be caused by keeping the algorithm’s parameters as constants. When the amount of data, and
diversity in the data is increased the parameters for all the algorithms have remained the same,
therefore the ability to predict for each algorithm decreases.
To try to determine the case that is happening in this project a small experiment is done. Some
of the algorithm’s parameters will be modified to try to better fit the last dataset.

Figure 16: Accuracy results for the KNN algorithm when modifying the number of neighbors

As we can see on the previous figure (16), by increasing the number of neighbors the accuracy
does not increase.
The same happens when we increase the number of layers in the MLP and and change the

30
amount of perceptrons in the layers, as we see in the table bellow:

Layers Number of layers Mean accuracy deviation of the ac-


curacy
7,6,5,4,3,2,1 7 63% 0%
7,7,6,5,4,3,2,1 8 63% 2%
7,8,9,5,10,3,5,1 8 63% 3%
7,7,6,5,4,5,2,1 8 63% 0%
7,6,7,8,5,4,5,2,1 9 63% 0%
7,6,6,7,8,5,4,5,2,1 10 63% 0%

Table 6: MLP with CV results when increasing the number of layers

When increasing the number of trees in the Random Forest algorithm the maximum accuracy
sitill remains the one predicted with the first parameters:

Figure 17: Accuracy results for the Random Forest algorithm when modifying the number of trees

6 Economic and Environmental Impact

6.1 Economical analysis


The economical analysis of this project isn’t complex, all of the code has been programmed during
the thesis or is Open Source (sk-learn, pymongo, LaTex).
All of the thesis has been done on a computer, and the data has been stored in a Database
using an AWS server with a 0, 65euros/h cost. The total cost of the server has been 468euros, to
facilitate the calculations we have had the database server up und running during the whole six
months.

31
Adding up the usual materials such as paper, pens, chair and desk, with the average salary of
a junior engineer from UPC and the utitlities the result is a total cost of 9.936, 00euros.

Figure 18: Cost calculation

6.2 Environmental Impact


The ability to bring the stock market to everyone has an incredible potential, and may impact the
environment in unexpected ways. All companies have a hard effect on the environment, some more
than others, and increasing the amount of population investing in those companies might empower
them to be even more ruthless in some cases.
Thankfully more and more pople are concerned with global warming and climate change, and
even though a company might be more profitable while polluting, the number of people that are
starting to invest only in environmentally friendly companies or funds has increased.
An additional aspect to mention in this section is the impact that having a computer/database
server has on the environment specially if it is open all of the time, even when we are not using it.
To facilitate the calculations on the Economical anlyisis (section 6.1) we said that the server was
up and running during the six months, but this wasn’t really true, only when the server was being
used it was opened.

7 Conclusions and next steps


The goal of this section is to make a review of the results and a comparison with similar projects
to create some context for the thesis as well as defining the next steps that can be made to further
develop a predictor that may help the average citizen.

7.1 Conclusions
After analizing the results in section 5 we can observe that Random Forest has outperformed the
other algorithms, specially when increasing the diversity of data in the dataset. The more amount
of data, and the more we diversified the data, the better the Random Forest algorithm perfomed12 ,
while the remaining algorithms’ accuracy diminished when the dataset grew.
12 The accuracy increased and the deviation decreased

32
During the Analysis of the results (section 5.4) we observed that the testing accuracy did not
improve when using more training data for most algorithms. Two explanations were discussed,
either the decrease of the accuracy was caused by the fact that the algorithms’ parameters were
not increased when more training data was used or it was caused by overfitting in the first dataset
due to the poor diversity of stocks.
Regarding the first explanation, we have seen that the accuracy did not improve when increasing
the number of parameters. This means that the models didn’t decrease their performances because
of the choice of the number of parameters.
In the figure 15 we can see that the only algorithm that hasn’t decreased the accuracy is the
Random Forest. This is due to the nature of the algorithm, as explained in the section 3.2.4 Random
Forest, the algorithm is an ensemble, which means that is composed of many decision trees. In
this project the number of trees in the algorithm is 300. An ensemble works under the assumption
that many uncorrelated errors average out to zero. Since each tree learns from different subsets of
our data, they are fairly uncorrelated from one another, thus making the Random Forest algorithm
more robust to overfitting than the other algorithms. All of this would explain why all of the
algorithms decreased all of their accuracy except for the Random Forest.
To create some context for the results, the article ”Predicting the daily return direction of
the stock market using hybrid machine learning algorithms”[6] talks about the results of Machine
Learning projects that aim to predict the movement of a stock for the next day. In the article the
authors mention that, for direction forecast13 , they have a lower accurcy (around 60%). Further-
more, the aim of this goal was to help regular people enter into the world of stock markets, and the
regular users of the trained algorithm will have a 50/50 chance of being right if their knowledge is
null.
In conclusion, Random Forest would be the chosen algorithm as the most suited to predict
whether to buy or sell a stock in a medium amount of time14 , because even though the training
time is far larger than the others, it increases the accuracy (71%) and will be more robust to
overfitting.

7.2 Next Steps


To further develop this thesis there are mainly two good follow up projects that can be implemented:

ˆ Create a bot to test the Machine Learning algorithms in real time.

ˆ Using other algorithms to predict the stock’s future value with the same range of time.

13 Direction forecast refers to the prediction of the trend, up or down


14 Medium means 3 to 5 months as explained in the Abstract

33
References
[1] Josef Ajram. Ganar en la bolsa es posible. Plataforma Editorial, 2011.
[2] Alpha Vantage API. https://www.alphavantage.co/.

[3] Code in GitHub. https://gitfront.io/r/user-6644703/


1b98564a45c8f096a86892ec04283ce4ac2b0660/FinancialDataCollection/.
[4] DataCamp. https://www.datacamp.com/community/tutorials/
k-nearest-neighbor-classification-scikit-learn.
[5] DeepAI. https://deepai.org/machine-learning-glossary-and-terms/perceptron.

[6] Xiao Zhong & David Enke. Predicting the daily return direction of the stock market using
hybrid machine learning algorithms. 2019.
[7] Kirill Eremenko. Machine Learning A-Z—: Hands-On Python and R In Data Science. https:
//www.udemy.com/course/machinelearning/learn/lecture/19678456#overview.

[8] Fundamental Analysis. https://corporatefinanceinstitute.com/resources/knowledge/


trading-investing/fundamental-analysis/.
[9] Jan Ivar Larsen. Predicting Stock Prices Using Technical Analysis and Machine Learning.
https://core.ac.uk/download/pdf/52104888.pdf, 2010.

[10] Andrew NG. Deep Learning Specialization Course. https://www.coursera.


org/specializations/deep-learning?utm_source=gg&utm_medium=sem&utm_
content=07-StanfordML-ROW&campaignid=2070742271&adgroupid=80109820241&
device=c&keyword=machine%20learning%20mooc&matchtype=b&network=g&
devicemodel=&adpostion=&creativeid=369041663186&hide_mobile_promo&gclid=
Cj0KCQjwk8b7BRCaARIsAARRTL5I3M5ATdzhXM2-7o5zXJB2SMWK3RgRB7f1v9ulpKjh8k8kDUf6W_
QaAmYFEALw_wcB#courses.
[11] Plus500. https://app.plus500.com/.
[12] PyMongo. https://pymongo.readthedocs.io/en/stable/.
[13] Python. https://www.python.org/.

[14] Scikit-Learn. https://scikit-learn.org/stable/modules/svm.html.


[15] Shuffle Split in Scikit-Learn. https://scikit-learn.org/stable/modules/cross_
validation.html.
[16] Sklearn K-Nearest Neighbors module. https://scikit-learn.org/stable/modules/
generated/sklearn.neighbors.KNeighborsClassifier.html.
[17] Sklearn Logistic Regression module. https://scikit-learn.org/stable/modules/
generated/sklearn.linear_model.LogisticRegression.html.

34
[18] Sklearn Multi Layer Perceptron module. https://scikit-learn.org/stable/modules/
generated/sklearn.neural_network.MLPClassifier.html.
[19] Sklearn Random Forest module. https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html.

[20] Sklearn Suport Vector Machine module. https://scikit-learn.org/stable/modules/


generated/sklearn.svm.SVC.html.
[21] Richard O. Duda & Peter E. Hart & David G. Stork. Pattern Classification. John Wiley &
Sons Inc, 1973.

[22] Technical Analysis. https://corporatefinanceinstitute.com/resources/knowledge/


trading-investing/technical-analysis/.
[23] Yahoo Finance API. https://finance.yahoo.com/.

35
Annex 1: Jupyter templates

[1]: import numpy as np


import matplotlib.pyplot as plt
import pandas as pd

from datetime import datetime


from dateutil.relativedelta import relativedelta

from sklearn.preprocessing import StandardScaler


#Import classifier from sklearn
from sklearn.model_selection import cross_val_score

[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values

[ ]: #Scalind Data
sc = StandardScaler()
X = sc.fit_transform(X)

[4]: #Create Model


classifier =

[ ]: #KFold split:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=10)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.
,→minutes, s=t_diff.seconds))

[ ]: #Print Accuracy with validation


print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.
,→std() * 2))

37
[ ]: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from datetime import datetime


from dateutil.relativedelta import relativedelta

from sklearn.preprocessing import StandardScaler


#IMporting classifier from sklearn
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values

[ ]: #Scaling Data
sc = StandardScaler()
X = sc.fit_transform(X)

[ ]: #Creating Model

[ ]: #Calculating the cv parameter


cv = ShuffleSplit(n_splits=10, test_size=0.2)

[ ]: #KFold split:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=cv)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.minutes, s=t_diff.
,→seconds))

[ ]: #Print Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

38
[ ]: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from datetime import datetime


from dateutil.relativedelta import relativedelta

from sklearn.preprocessing import StandardScaler


#Import classifier model from sklearn
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from sklearn.decomposition import PCA

[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values

[ ]: #Scaling data
sc = StandardScaler()
X = sc.fit_transform(X)

[ ]: #Principle Component Analysis


n_samples = X[:, 0].size
n_features = X[0].size
pca = PCA(n_components = min(n_samples, n_features))
X = pca.fit_transform(X)

[ ]: #Create Model

[ ]: cv = ShuffleSplit(n_splits=10, test_size=0.2)

[ ]: #Cross Validation:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=cv)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.minutes, s=t_diff.
,→seconds))

[ ]: #Print Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

39
Annex 2: Gantt Diagram

To help with the planning of the whole project a Gantt diagram was done at the first two weeks
to try to divide the thesis into smaller work packages that would cointain some tasks.

Figure 19: Gantt diagram

During the last work packages some difficulties arose as the last dataset was formed. A change
in the configuration of the computer modified the way the decimal numbers were interpreted, from
a dot to a coma. The database information remained the same, therefore some error would have
been introduced into the models when the third training had began. After analyzing the dataset
the error was spotted on some of the rows. The amount of rows affected by this problem were small
compared to the size of the dataset and the decision to remove these rows was taken.

39

You might also like