You are on page 1of 19

Elisabetta Spina

X17167809
Final Project
August 2020

1
Table of contents

1) Introduction………………………………………………………………………………………………………………………....3

2) Literature review…………………………………………………………………………………………………………………..4

3) Methodology…………………………………………………………………………………………………………………………6

3.1) Reasons behind the choice of the stocks to be analysed…………………………………………….…


6

3.2) General analysis on Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Mylan N.V.,
Perrigo Perrigo and Pfizer
Inc………………………………………………………………………………………………………….…7

3.3) Johnson & Johnson, Lilly (Eli) & Co., Merck & Co and Pfizer Inc………………………………….….8

4) ARIMA model…………………………………………………………………………………………………….…………….….10

4.1) Prediction models on Johnson & Johnson and Eli Lilly – train vs test
datasets……………...10

4.2) Johnson & Johnson future forecast model……………………………………………………………………


11

4.2.1) Test for stationarity…………………………………………………………………………………………..…


11

4.2.2) ACF and PACF………………………………………………………………………………………………..……12

4.2.3) Model summary and fit……………………………………………………………………………….……..13

4.2.4) Forecast…………………………………………………………………………………………………………….16

5) Conclusions…………………………………………………………………………………………………………………………17

6) Bibliography and webliography…………………………………………………………………………………………..18

2
1) Introduction
My analysis will be centred around financial time series data and, more specifically, around stock
prices from January 2007 to August 2020.

The software I have decided to use for this purpose will be R programming.

Given the vast amount of stocks listed on the various stock exchanges, I have decided to focus on a
handful of stocks that are part of the Standard & Poor 500 list of companies and that belong to the
pharmaceutical sector, which, given the current Coronavirus situation, I believe will perform well in
the next future.

In the first part of my analysis work, in order to be able to pick two stock on which I will concentrate
on the second part of the analysis, I will perform some exploratory data analysis, some basic
descriptive statistics , analyse graphs and calculate and compare various types of returns. The
companies I had selected to do so are Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Mylan N.V.,
Perrigo, and Pfizer Inc.

The second part of the analysis will revolve around forecasting stock prices using the ARIMA model,
which stands for Auto Regressive Moving Average, a powerful method widely used to forecast stock
prices.

The main objective of the analysis is to pick a stock to buy with the intent of selling it at a later stage
in order to make a profit: to do so, it is necessary to predict its prices in the next 200 days and if the
price is going to go up I will be able to make a profit.

There are multiple reasons behind the choice to undertake this type of project. To start, I always had
an interest in this field and I had made a few investments in stocks over the years, but my choices
were never really backed up and supported by a piece of data analytics work: therefore, I hoped that
this work would have given me a solid basis knowledge on how to perform financial data series
analysis in order to make correct investment decision in the future. I am fully aware that I would
only be able to cover the basis and that my analysis will be extremely basic compared to what
traders really do, but, if expended, this knowledge could really help me in making correct investment
decisions.

Additionally, I currently work for Fidelity Investment, one of the largest multinational financial
services corporations: the role I cover right now is a role in fund operations. Being aware that the
vast majority of the investment decisions taken but a company of such a big scale are either
automated or largely backed up by difference programming software, I believe that gaining an initial
understanding on how stock prices behave over time could allow me to progress in my career in the
same Company, which is exactly what I would like to do.

The structure of this report will be the following: after this introductory section, a brief literature
review on time series analysis and a theoretical explanation of the methods used will follow; then,
there will be a section that will explain thoroughly the methodology I have used, followed by the
results that I have found and finally a short conclusion.

3
2) Literature review
It is widely known that forecasting stock prices is always a challenging task due to the nature of the
data: stock prices, in fact, represent non-linear time series data, since the data points most times do
not cluster around a straight line, and non-stationary time series data, meaning that statistical
properties such as mean and variance do not remain constant over time (Kim and Na, 2016).

Over the years, there has been a large body of literature that has explored the area of stock market
behaviour and many have agreed on the fact that stocks values and prices can be affected by
numerous variables that can’t be taken all into consideration, such as market sentiment, news
releases, industry performance, etc, making it therefore very hard to make highly accurate
predictions (Min, 2000).

Additionally, there have been findings that stock prices are also influenced by psychological and non-
measurable factors, adding complexity to predictions.

Nevertheless, especially in the past last decade, AI and Machine Learning have played a fundamental
role in stock market analysis thanks to the intense development of algorithmic trading and High-
frequency trading: Artificial Neural Networks (ANNS) have been defined as the current dominant
machine learning techniques in the area (Bhowmik and Wang, 2020).

Stock data analysis is certainly included in the larger area of time series analysis and we can
therefore extrapolate its principles.

As seen in Janacek (2009) the components of a time series are:

 Trend: it is the overall long-term direction of the series (long-term increase or decrease)
 Seasonality: it occurs when there is repeated behaviour in the data which occurs at regular
intervals, which are normally due to the calendar (quarter, month, day)
 Cycle: it occurs when a series follows an up and down pattern that is not seasonal and
therefore not of fixed period, making it therefore more difficult to detect
 Variation: in all data there is random variation, sometimes it can be moderate or not
 Irregularities: sometimes there are unexpected dips or jumps in a series which might be due
to a one-off event

The method I have decided to employ is the ARIMA model: ARIMA stands for Auto Regressive
Moving Average and it is one of the finest methodologies to forecasts time series data; the basic
assumption of the model is that the current values are related or correlated with the n previous
value (Grillenzoni, 1993).

As explained by Malska and Wachta (2015), Its elements are:

4
 Autoregression (AR) refers to a model that predicts future values by looking at past values
 Integrated (I): the raw values are integrated or transformed or differenced in order for the
data series to become stationary, i.e that mean, variance, and autocorrelation do not change
over time.
 Moving average (MA) refers to the fact that an observation and a residual error from a
moving average model applied to lagged observations are dependent.

It is based on three components, known as p, d, q.

 P  autoregressive parameter (AR): number of autoregressive components, which refer to


the prior values of the current value
 D degree of differencing (I): how many non-seasonal differences are needed to achieve
stationarity
 Q  moving average (MA): number of lagged forecast errors in the prediction equation (the
error of the model as a combination of previous error terms et

ARIMA models work on the assumption that the data is stationary, meaning that the trend and
seasonality have been removed. In order to test stationarity by looking for correlation, we use two
functions:

 Auto-correlation function (ACF): Autocorrelation is the way the observation in a time series
are related to each other and it is calculated by simply correlating the current observation
and the observation p periods from the current
 Partial auto-correlation function (PACF), which tests the degree of associations between two
variables while adjusting the effect of one or more additional variables: It is used to measure
the degree of association between the current observation and the observation p periods
from the current

5
3) Methodology

3.1) Reasons behind the choice of the stocks to be analysed


The first step of my work was to pick the stock I was going to perform my data series analysis on.
Given the enormous amount of stocks listed in the various stock exchanges, I had to take a few
decisions to narrow down my choices:

 I decided to only investigate securities listed in the New York Stock Exchange and the
NASDAQ, which are the biggest exchanges in the World, whose historic data were going to
be much easier to retrieve
 I narrowed down my searches by only looking into securities that were part of the S&P 500
list, which includes the 500 companies that have the largest market capitalization. I decided
to take a decision in this direction because of what I intended to learn and gain from this
whole project: since I intended to enhance my data analytics skills in the financial field
because of a possible progression in my career and because of the fact that I wanted to start
investing myself, picking relatively safer stocks to start was definitely going to be easier for
me.
 Lastly, out of the 500 companies I chose to analyse those in the pharmaceutical fields. I
decided to do so because, even if it might look like a very simplistic approach, I believe that
because of the current Coronavirus pandemic, companies operating in this sector were going
to perform well in the next future. Therefore, the companies I had selected were AbbVie
Inc., Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Mylan N.V., Perrigo, Pfizer Inc. and
Zoetis. After loading the tickers in R, I have realized that AbbVie Inc and Zoetis only listed in
2013, while the others did in 2007, hence I have decided to exclude them so that my analysis
would have been more thorough and hopefully more precise (given the highest amount of
data available).
 Since my intent was to narrow down my analysis to one stock, I had used several methods to
pick the final one

Luckily, I was able to extract the data through R, which allowed me to extract directly data from
Yahoo Finance: the columns extracted were the date, the open price, highest price, lowest price,
close price, volume of trade and adjusted price

6
3.2) General analysis on Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Mylan
N.V., Perrigo and Pfizer Inc
The first piece of analysis I decided to do was to plot prices of the six stocks I had selected over the
period, just to gain an initial very visual understanding of how they behaved.

It is very important to mention that for the purpose of plotting data, it is best to use the adjusted
price, which, as the term suggests, it is adjusted for stock splits and dividends. To better understand
this concept let’s consider for example a stock that is pricing at 10 USD; if the company decides to
do a stock split transforming every 1 shares into 10 shares, the price will fall to 1 USD because of the
change in number of total shares. If we were to look at just the closing prices we would see a
massive fall from 10 to 1 USD and we could assume that the stock did terribly well on that day,
which will be incorrect and it will lead to an incorrect analysis.

We can see straight away that all the companies’ adjusted prices had an upward trend, except Mylan
and Perrigo, whose price had a peak but then went down after I and it looks like prices are on a
downward trend at the moment. Merck and Pfizer had a significant peak but seem to be on an
upward trend; Johnson & Johnson and Eli Lilly seem to be the only two stocks that saw a significant
and steady price increase over the course of 13 years.

Thanks to the below plots I have then decided to exclude Perrigo and Mylan and continue the
analysis on the remaining four.

Figure 1: adjusted prices from 2007 to 2020

7
3.3) Johnson & Johnson, Lilly (Eli) & Co., Merck & Co and Pfizer Inc

Having only four stocks left I have decided to look at the stock returns.

Additionally, I intended to create the ARIMA model in order to make a prediction for the next
200 days; therefore I thought it would make sense, in order to narrow down my choice, to plot
weekly and monthly returns, other than the daily ones.

Cumulative returns are expressed in percentage and it is therefore easier to visualize them on a
line chart: I have in fact plotted the arithmetic monthly cumulative returns.

As per below, we can see straight away that Eli Lilly saw a 500% increase in returns, which is
massive and it made me therefore pick this stock straight away; Johnson and Johnson’s returns
seems now to be higher than Merck’s, but I just wanted to perform some descriptive statistics to
confirm that.

Figure 2: Johnson & Johnson, Eli Lilly, Merck and Pfizer monthly cumulative returns

We can see below the summary on daily, weekly and monthly returns of the two stocks.

It looks like there is not much of a difference between the mean returns; it is worth noticing
though, that the minimum returns (which are negative returns) are less negative for Johnson
and Johnson, and the maximum returns are higher for Merck.

8
Figure 3: Johnson & Johnson summary on daily, weekly, monthly returns

Figure 4: Merck summary on daily, weekly, monthly returns

The standard deviations are quite similar for the daily returns, while they start to show a slightly
higher difference between the weekly and the monthly returns (which I was more interested in),
where Merck’s values are higher making it riskier and more volatile.

The two stocks present very similar values, but by looking at the graph and the fact that Merck’s
values are slightly riskier in the long run and I intended to want to keep the stock for at least 200
days, I picked Johnson and Johnson.

9
4) ARIMA model

4.1) Prediction models on Johnson and Johnson and Eli Lilly – train vs test datasets
The way I have decided to test whether I wanted to pick one stock or the other was to perform an
ARIMA model estimation on the adjusted prices of both of them, with the intent of trying to predict
10% of the data and see how well the model made those prediction by comparing them to the actual
data.

In order to do so I have split both datasets into training (90% of the data which were the first 3086
data points) and testing (10% of the data or from datapoint number 3087 to 3429) and I have gone
through every step of the ARIMA process with the final intent to calculate the mean percentage
error between the forecasted prices and the actual prices and to pick the stock that a lower mean
percentage error.

Therefore, what I will show in this paragraph is only the two results, while I will go through the
whole process only in the next paragraph, reapplying the model to forecast future prices: the steps
are visible in the R code I have attached to this project.

The type of ARIMA I have decided to use is the auto.arima, which is a R function that is supposed to
automatically pick the three parameters I will mention in the next paragraph, instead of us having to
pick them.

The two mean percentage errors, that I have simply calculated by subtracting the forecasted value
to the actual value and then dividing by the actual value, were the below:

 Johnson and Johnson: 0.3%


 Eli Lilly: 50.85%

In order to calculate them I have created a new data frame with the actual values of the test set and
the predicted values and then compared them.

It was very clear that the model predicted very poorly the Eli Lilly prices, with an error of 50.85%,
and almost perfectly the Johnson & Johnson prices, and this is what therefore made me pick it.

Please see R code for more details on the above.

10
4.2) Johnson & Johnson future forecast model

4.2.1) Test for stationarity


Before building an ARIMA model it is necessary to check that my time series is stationary, or in
simpler words that the mean, variance, and autocorrelation for both stocks do not change over time.

In order to do so I performed a couple of statistical tests to the Johnson & Johnson adjusted prices,
which are the Box-Ljung test and the Augmented Dickey Fuller test.

The Box-Ljung test’s null hypothesis states that the residuals are independently distributed, while
the alternative hypothesis states that the residuals are not independently distributed and therefore
show serial correlation.

As per below, we can see that the p value is much larger than 0.05, therefore we fail to reject the
null hypothesis and we can conclude that the data is independent.

The Dickey Fuller test’s null hypothesis states that the time series is non-stationary and the
alternative hypothesis that the time series is stationary, rejecting therefore the null hypothesis.

As shown in the below results, the p value is not smaller than 0.05, which is the significance level
that is normally used; therefore we fail to reject the null hypothesis and conclude that the data is
non-stationary.

As wanted stationarity in our data, since it is an essential prerequisite, the results of the test
represent an issue and the solution is to differentiate the data across its consecutive values (with the
diff function in R), which will stabilize the mean.

The difference function brings the data points closer together and I chose to use a difference value
of 1 (which will be my d value).

In fact, after differentiation, the results of the ADF test were the below, confirming therefore
stationarity.

11
4.2.2) ACF and PACF

ACF stands for "autocorrelation function" and PACF stands for "partial autocorrelation function."
And as explained in a previous section they are used to determine the ARIMA model’s order.

ACF and PACF are critical to identify and understand how different data points are correlated at
different time lags, which are nothing but the differences between one observation and a previous
observation.

First it is shown the ACF and PACF for the non-differentiated data, then the ACF and PACF for the
differentiated data.

Figure 5: ACF and PACF of raw data

Ideally, what we would like to see, is all the vertical lines within the two horizontal dotted lines in
both plot; in this case most of the values in the ACF are outside those lines, showing therefore that
the data is non stationary, while in the PACF plot we have an immediate drop in the correlation from
lag zero to lag 1.

Below we can see the two plots done after differentiation of 1 was applied, which definitely looks
better.
Figure 6: ACF and PACF of differentiated data

12
As explained in a previous paragraph, the two plots can help us finding the order of the ARIMA in
case we wanted to performed a customized ARIMA, which are the p,d and q values.

ACF will help us finding the Q value, while PACF will allow us to find out P value.

Looking at the ACF plot, we can see that there is a lag and that this lag is at 1 (the first vertical line
that goes outside the dotted line) and our q value will therefore be 1; by looking at the PACF plot we
can see that applying the same principal our p value will be 2.; we saw that applying a differentiation
value of 1 the model became stationary, therefore we can use a D value of 1.

The model’s order therefore will be 2,1,1 and we will use this to customize one ARIMA.

4.2.3) Model summary and fit


I have decided to test three different ARIMA model orders to see if they will all give me the same
results:

a) Auto.arima, which is a function that is supposed to pick the best model order by looking at
the dataset
b) A customized order chosen by looking at the ACF and PACF plots of the raw data, which will
be 2 1 1
c) One customized order chosen by looking at the ACF and PACF plots of the second fit model,
with the intent of try and make it more efficient

What we will look at is the residuals ACF plot and the PACF plot, since we want to see that the
model fits well by looking for the vertical lines to be within the dotted ones, we want to look at
the AIC and potentially BIC, of which the lowest the better, and the different types of errors.

Below in figures 7,8 and 9 we can see the three different models.

It is visible that the MAPE or mean absolute percentage errors are very low for all three models, the
residuals fit quiet well and that the lowest AIC value is in the second model.

13
Figure 7: AUTO.ARIMA model

Figure 8: 2,1,1 ARIMA model

14
Figure 9: 4,2,4 ARIMA model

15
4.2.4) Forecast
At this stage, I have created three fits and forecasted the value from them.

From the below plots we can see that two out of three models predicted that the price of the stock
is going to go up, while the second model shows that it is not going to change much over the next
200 days.

Figure 10: Johnson and Johnson models results

What I lastly did was to create csv files with all my forecasted values for the next 200 days, so that I
can refer to them in case I want to sell the stock before the 200 days.

16
5) Conclusion
Thanks to my analysis I have been able to pick one stock and perform an ARIMA model on it.

When trying to end up with only one stock I have performed some basis analysis on prices and
returns and descriptive statistics.

The stock I had chosen is Johnson and Johnson and I have concluded that the stock price is going to
go up I the next 200 days; one of the two models predicted that the price is going to remain stable
while the other two predicted that the price is going to go up.

I believe that the most reliable model is the auto.arima, since is the best model R proposes by
looking at the past data; as a matter of fact auto.arima predicted a price increase so I am fairly
confident that the prediction might end up being correct. This will mean that my decision will be to
buy Johnson & Johnson shares, to then sell it at a later stage, ideally after 200 days, in order to make
a profit.

When I tested the accuracy of the auto.arima model by splitting the Johnson & Johnson price data
into training and testing the model ended up to predict very well the prices.

I am fully aware that this analysis might look like a very simplistic way of making investment
decisions, but I am fully convinced that it might represent a good starting point for the objectives I
had set: in fact it gave me a good basic knowledge on what to pay attention to when analysing stock
data.

In addition to that it has been very interesting because I can apply what I have learned in this project
to my personal life and I am glad this is not going to remain just a theoretical piece of work.

17
6) Bibliography and webliography
Grillenzoni, C., 1993. ARIMA Processes with ARIMA Parameters. Journal of Business & Economic
Statistics, 11(2), p.235.

Janacek, G., 2009. Time Series Analysis Forecasting and Control. Journal of Time Series Analysis,.

Kim, D. and Na, H., 2016. The Forecast Dispersion Anomaly Revisted: Time-Series Forecast
Dispersion and the Cross-Section of Stock Returns. SSRN Electronic Journal,.

Malska, W. and Wachta, H., 2015. ARIMA MODEL USING THE TIME SERIES ANALYSIS. Scientific
Journals of Rzeszów University of Technology, Series: Electrotechnics, pp.23-30.

Min, D., 2000. Pattern and Forecast of Movement in Stock Price. SSRN Electronic Journal,.

Medium. 2020. Performing A Time Series Analysis On The AAPL Stock Index.. [online] Available at:
<https://towardsdatascience.com/performing-a-time-series-analysis-on-the-aapl-stock-index-
3655da9612ff> [Accessed 16 August 2020].

DataCamp. 2020. Sign In. [online] Available at: <https://learn.datacamp.com/courses/forecasting-


using-r?_escaped_fragment_=> [Accessed 16 August 2020].

Stack Overflow. 2020. Stack Overflow - Where Developers Learn, Share, & Build Careers. [online]
Available at: <https://stackoverflow.com/> [Accessed 16 August 2020].

2020. [online] Available at: <https://blogs.oracle.com/datascience/introduction-to-forecasting-with-


arima-in-r> [Accessed 16 August 2020].

A-little-book-of-r-for-time-series.readthedocs.io. 2020. Welcome To A Little Book Of R For Time


Series! — Time Series 0.2 Documentation. [online] Available at: <https://a-little-book-of-r-for-time-
series.readthedocs.io/en/latest/> [Accessed 16 August 2020].

DataCamp. 2020. Sign In. [online] Available at: <https://learn.datacamp.com/courses/manipulating-


time-series-data-in-r-with-xts-zoo> [Accessed 16 August 2020].

18
Udemy. 2020. Practical Data Science: Analyzing Stock Market Data With R. [online] Available at:
<https://www.udemy.com/course/practical-data-science-analyzing-stock-market-data-with-r/>
[Accessed 16 August 2020].

19

You might also like