KTLTC - Phạm Thịnh Phát -31211021082

BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC KINH TẾ TP. HỒ CHÍ MINH
CÔNG TRÌNH DỰ THI

GIẢI THƯỞNG
ĐỀ TÀI MÔN HỌC XUẤT SẮC UEH500 2023
TÊN CÔNG TRÌNH:

APPLICATION OF RSTUDIO IN
ANALYSIS OF THE US STOCK MARKET
TP Hồ Chí Minh, ngày 27 tháng 4 năm 2023
1
ĐẠI HỌC UEH
TRƯỜNG KINH DOANH
KHOA QUẢN TRỊ
Mã lớp học phần :

Tên lớp học phần : Kinh tế lượng tài chính
GV giảng dạy : TS. Nguyễn Thị Hồng Nhâm
Sinh viên thực hiện : Phạm Thịnh Phát
MSSV : 31211021082
SDT : 0902989343
Email : phatpham.31211021082@st.ueh.edu.vn
TP Hồ Chí Minh, ngày 27 tháng 4 năm 2023
2
Table of contents
CHAPTER 1 PANEL DATA ANALYSIS: DETERMINANTS OF SHARE PRICES

AMONG TOP COMPANIES IN THE USA...............................................................................4
CHAPTER 2 ARIMA MODEL FOR S&P 500 IN 10 YEARS...............................................10
REFERENCES............................................................................................................................14
3
CHAPTER 1 PANEL DATA ANALYSIS: DETERMINANTS OF SHARE
PRICES AMONG TOP COMPANIES IN THE USA
What drives the price of a stock has always been a discussed topic in the financial field. Besides
the intrinsic value of a company can indeed guide the share price of a company in the long term
(5-10 years), especially in large enterprises. In contrast, stock prices in smaller time frames are
usually reflected by how appealing a stock is to traders. This thesis will attempt to discover the
relationship between stock price liquidity, and monthly rate of return during a 5-year time span
from May 2018 to May 2023. Also, this research would also attempt to discover if is there a
truth in the saying “Sell in May and Go Away”.
Before applying panel data regression, panel data must be prepared and cleaned before import
into R. The data will be collected at Yahoo Finance and then downloaded separately into Excel.
Among these variables, the data on rate of return and liquidity needed to be calculated.
Close price−Open price

Rate of return=
Open price
As for liquidity of a stock, the calculation is adopted from Danyliv, O., Bland, B., & Nicholass,
D. (2014) as below:
Volume × Price
Liquidity=log 10 ( )
High−Low
The reason for using a log10 for the calculation is that the volume of trade is very high, a
logarithm permits the number to be at a manageable range. This calculation can be understood
that “the amount of capital needed to create $1 price fluctuations could be estimated as 10 Liquidity
of US dollars. The data will then be loaded into R and then the dataset needed to be transformed
into panel data. The pdim function will check whether the panel is balanced and it is ready to be
analyzed. Furthermore, to discover the relationship of stock price in May, a dummy variables
will be constructed. All these steps are concluded by these codes:
#Import data
library(readxl)
data <- read_excel("~/Desktop/data.xlsx",
col_types = c("date", "text", "numeric",
"numeric", "numeric","numeric",
"numeric","numeric","numeric"))
View(data)
library(plm)
panel_data = pdata.frame(data, index = c('Company', 'Date'))
data$Date = as.Date(data$Date)
data$May <- as.integer(data$Date == '2019-05-01','2020-05-01','2021-05-01','2022-05-01')
4
As panel data is the combination of cross-sectional data, in this case, is “Company”, and time-
series, or “Date”. Therefore, the variable needed to be stationed in order for the results would be
consistent. In R studio, this can be conducted through an Augmented Dickey-Fuller test (ADF)
through the package tseries. The following code will then be executed.
#Stationarity test
library(tseries)
adf.test(data$Close)
adf.test((data$`Rate of return`))
adf.test(data$Liquididty)
With the null hypothesis, H0 is the time series is non-stationary and the alternative hypothesis
H is the time series is stationary, the p-value calculated is summarized in the table below:
Close price Rate of return Liquidity
p-value 0,03471 <<0,01 0,06912
With these p-values, it is obvious that the Close price and Rate of return are stationary, whereas
the liquidity is non-stationary. For that reason, a differencing of this variable needs to be
calculated and tested again for stationarity.
adf.test(diff(data$Liquididty))
After determining that a 1st degree of differencing solves the problem of differencing, a
difference for this variable will then be built into the dataset.
data$Diff_Liquidity = c(NA, diff(data$Liquididty))
There are three methods to use when it comes to panel data regression: Pooled OLS, Fixed or
Random Effect. It is initially necessary to first estimate these 3 models.
Pooled_OLS = plm(Close ~ Ror + Diff_Liquidity + May, data = data, model = 'pooling')
Fixed_Effects = plm(Close ~ Ror + Diff_Liquidity + May, data = data, model = 'within')
Random_Effects = plm(Close ~ Ror + Diff_Liquidity + May, data = data, model = 'random')
After estimating these 3 models, the F-test need to be used to select between Pooled OLS or
Fixed Effects. In R, this can be done through a pooltest, where the null hypothesis is the Pooled
OLS is suitable, whereas the alternative hypothesis is the Fixed Effects model is suitable
pooltest(Pooled_OLS, Fixed_Effects)
5
As the p-value is much smaller than 0.05, the null hypothesis is to rejected, therefore a fixed
effects model is more suitable.
Another approach to check if there is a panel effect in R is a Breusch-Pagan Lagrange multiplier
test. The null hypothesis of this test is that variances across unit is zero,i.e. no panel effect.
plmtest(Pooled_OLS, type = c('bp'))
Similar to the F-test, the null hypothesis of this test is rejected, therefore a Pooled OLS model is
not suitable for this dataset.
After this pooltest, a Hausman test can then be used to select between Fixed and Random
Effects, with the null hypothesis being the Random Effects model is suitable, and the alternative
hypothesis of Fixed Effects is suitable.
phtest(Fixed_Effects,Random_Effects)
The p-value is 0.8124, higher than the alpha of 0.05, the null hypothesis is failed to be rejected.
Therefore, a random effects model is consistent in this case.
After the Random effect is selected, a diagnostic test needed to be run for this model.
Firstly, a multicollinearity test is conducted with 2 approaches: Pearson Correlation and VIF.
cor.test(data$Ror,data$Diff_Liquidity, alternative = 'two.sided')
vif(Random_Effects)
6
In the Pearson Correlation test, the null hypothesis of true correlation between these two
variables is equal to 0, is rejected due to a low p-value of approximately 0.0005. This could be
interpreted that there is a correlation, however with a low estimation of 0.14. Moving on with the
VIF, while it is debated among researchers about the exact value of VIF where multicollinearity
is proven to exist, in this case, the VIF is respectively low on a scale of 10. For that reason, it is
not evident to conclude there is multicollinearity in this model.
To test for heteroskedasticity, a Breusch-Pagan test can be executed.
bptest(formula(Random_Effects), data = data, studentize = FALSE)
As can be seen, the null hypothesis is one of constant variance, i.e., homoscedasticity. With a p-
value of 0.2374, the test suggests not rejecting the null hypothesis. It is concludeable that, the
model does not have heteroskedasticity.
Another problem needed to be addressed is autocorrelation. As can be seen in the figure of ACF
below, the model suffered from a huge problem of autocorrelation. While this graphical
conclusion may be subjective, a Breusch-Godfrey test executed provide a similar conclusion.
With a relatively small p-value, the null hypothesis of there is no serial correlation is rejected.
7
To address this problem, the Random Effect model can be regressed using a Generalized Least
Square method as this method permit the problem of autocorrelation. This can be executed
through the code below. Although the model = “pooling”, it provides the same result as a model
= “random”, as it creates an unrestricted FGLS model.
GLS = pggls(Close~Ror+Diff_Liquidity+May, data = data, model = 'pooling')
After the GLS model is appropriately employed, relationships and the equation is estimated. It is
summarized below.
As depicted, the Rate of return is a very strong drive indicator of the price of the share. R
automatically computed the p-value test of the null hypothesis of the Ror equal 0, which is
relatively small. Therefore, it is concluded that the Ror has a significant impact on the share
price.
This model will then be used to test other hypotheses
H01: Liquidity has a positive relationship with share price
H1: Liquidity does not have a positive relationship with share price
With a p-value of 0.6639, the null hypothesis is not rejected, indicating a positive relationship
between liquidity and share price
H02: Share price is negative in May
H2: Share price is not negative in May
8
With a p-value of 0.826, the null hypothesis is not rejected, indicating a negative impact on the
share price in May.
9
CHAPTER 2 ARIMA MODEL FOR S&P 500 IN 10 YEARS
The dataset will be collected at https://fred.stlouisfed.org/series/SP500 and then loaded into

Excel. Here, it will be transformed into an appropriate format and imported into R. It would then
be plotted for further analysis.
It can easily be seen that the series is non-stationary. This can also be explained through an ADF
test. The p-value also indicates that the null hypothesis of a non-stationary is failed to be
rejected.
For that reason, the first difference in the time series price will be taken into account. This
differencing will also be plotted and ADF tested.
10
The plot above, combined with the ADF test, the null hypothesis is rejected with a p-value of
0.01. In conclusion, a d = 1 will be appropriate in the ARIMA model.
In R, besides the manual procedure of Box-Jenkins to determine the order of the ARIMA model,
there is a function of auto.arima() which automatically selects the most suitable model for the
time series object. This is an algorithm developed by Hyndman and Khandakar in 2008, which
combines unit root tests, and minimization of the AIC and the MLE to obtain an ARIMA model.
This process can be described below.
The function indicates a model order of ARIMA (0,1,1) with drift.
11
However, as demonstrated above, this procedure only consist of stationary test and AIC to
determine the one with lowest AIC. In order for this model ARIMA (0,1,1) to be applicable for
forecast, a residual check will be implemented.
The ACF plot of residuals shows no significant autocorrelation between them. Most of the
residuals also fit in the bell shape of the normality assumption. The residual plot also
demonstrates a White noise figure. To further test for the White noise and the autocorrelation of
residuals, we can assume that the residuals are a time series object. From that point, an
auto.arima() function will search for the best-fit model for the residual. As the fit ARIMA model
for the residuals is (0,0,0), it is concluded that the residuals are white noise.
12
The model will then be used to forecast the upcoming price of the S&P500. As the date is
further, the bandwidth becomes bigger, indicating a decaying of the predictive power of this
model.
The summary() function also automatically calculates the needed measurements of the forecast.
With an MAE of 95 and RMSE of 135, the yielded forecast result is not very far, compared to
the mean of this index for the past 10 years of 2841. Also, the MAPE indicates 3.14, which
proves to be an acceptably accurate result.
13
REFERENCES
Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R
News 2(3), 7-10. URL https://CRAN.R-project.org/doc/Rnews/
Brooks, C. (2019). Introductory econometrics for finance (R Guide). Cambridge university press.
Croissant Y, Millo G (2008). “Panel Data Econometrics in R: The plm Package.” Journal of
Statistical Software, *27*(2), 1-43. doi:10.18637/jss.v027.i02
Danyliv, O., Bland, B., & Nicholass, D. (2014). A Practical Approach to Liquidity
Calculation. The Journal of Trading, 9(3), 57-65.
Forecasting: Principles and practice (2nd ed). 8.7 ARIMA modelling in R. (n.d.).
https://otexts.com/fpp2/arima-r.html
Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package
for R.” _Journal of Statistical Software_, *26*(3), 1-22. doi:10.18637/jss.v027.i03
John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third
Edition. Thousand Oaks CA: Sage.
URL:https://socialsciences.mcmaster.ca/jfox/Books/Companion/
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Trapletti A, Hornik K (2023). _tseries: Time Series Analysis and Computational Finance_. R
package version 0.10-54, <https://CRAN.R-project.org/package=tseries>.
Wickham H, Bryan J (2023). _readxl: Read Excel Files_. R package version 1.4.2, URL:
https://CRAN.R-project.org/package=readxl>.
14

KTLTC - Phạm Thịnh Phát -31211021082

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KTLTC - Phạm Thịnh Phát -31211021082

Uploaded by

Copyright:

Available Formats

BỘ GIÁO DỤC VÀ ĐÀO TẠO

TRƯỜNG ĐẠI HỌC KINH TẾ TP. HỒ CHÍ MINH

CÔNG TRÌNH DỰ THI

TÊN CÔNG TRÌNH:

TP Hồ Chí Minh, ngày 27 tháng 4 năm 2023

Mã lớp học phần :

TP Hồ Chí Minh, ngày 27 tháng 4 năm 2023

CHAPTER 1 PANEL DATA ANALYSIS: DETERMINANTS OF SHARE PRICES

Close price−Open price

The dataset will be collected at https://fred.stlouisfed.org/series/SP500 and then loaded into

The function indicates a model order of ARIMA (0,1,1) with drift.

You might also like