You are on page 1of 17

Report to be submitted in partial fulfillment of the

requirements for the degree

of

Dual Degree (B. Tech + M. Tech)

by

Abhishek Pramod Agnihotri


18CH30027

Under the supervision of

Prof. Rudra Prakash Pradhan

During the academic year 2022-23

VINOD GUPTA SCHOOL OF MANAGEMENT


INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

1
DECLARATION

I certify that,

a) The work contained in this report has been done by me under the guidance of my
supervisor.

b) The work has not been submitted to any other institute for a degree or diploma.

c) I have conformed to the norms and guidelines given in the Ethical Code of Conduct of the
Institute.

d) Whenever I have used materials (data, theoretical analysis, figures, and text) from
other sources, I have given due credit to them by citing them in the text of the thesis
and providing their details in the references.

Date: November 12, 2022 (Abhishek Pramod Agnihotri)

Place: Kharagpur (18CH30027)

2
Contents

Last Semester Work: 4


1. Introduction 6
2. Predictive Regression Framework 9
3. In-Sample Analysis 10
Table 1: Summary statistics, industry portfolio excess
returns, 1959:12-2022:11 11
Table 2: OLS post-LASSO predictive regression estimation
results, 1960:01-2022:12 13

3
Last Semester Work:
Twitter’s Influence on Bitcoin Price Fluctuation
People's perceptions of cryptocurrencies have shifted, and cryptocurrencies
continue toademonstrate their viability as aan aalternative currency. aMoney
investedaand suppositiona are two aof its amost appealing topics afor apeoplea
looking to increase their aincome.a Similarly, the cryptocurrency amarket
sharesasome characteristicsawith the share market, foreign exchange industry
(forex), or other asset markets such asacrudeaoil,agold,aand various valuablea
metals.aManyafactors, including theavolume ofabuyersaand sellers, as well as
other political and economic news and events, can influence the price of various
coins. It is critical for shareholders and hedge funds to have instruments that
predict the rise and fall of cryptocurrency prices and advise them on which
currency toainvestain.aIt is beneficialato use socialamediaaand cryptocurrency
trendsaand whether athere's a high correlationabetween a people'sapostsaand
changes in coinaprices.
Therefore,athe aresearch ainvestigated athese aaspects:
● Is athere aa relation in both aTwitter asentiment aand Bitcoin afluctuation?
● Can aa machine alearning technique predicated on apolarity asentimenta
accuracyabe used to predict Bitcoin price movement?

Results:

Thisastudyashowed that the correlationabetween bitcoinapriceaandasentiment isa


low. It’saalsoatoabe notedathat evenathoughatheacorrelationais lowait’sanot
completelyarandom.aIt improvesawhenaa lagaisaintroduced. Hence,aTwitter
doesaprovide a slight indication to theabitcoin prices.

4
Built sequential LSTM Model to understand predictive capabilities of Twitter
sentiment and deep learning model to get the prediction of crypto sentiment and
tweet sentiment using lag of 7 days.

The model was evaluated quantitatively based on the following metrics:


accuracy, precision, and recall. Based on the results, it can be observed that

● Model has accuracy of 73.4%.

● There were 3160 occurrences of the target as 0 and 1323 occurrences of


target as 1.

● Precision tells that out of all the target that the model predicted would be
true, 56% were actually true .

● Recall tells that out of all the target that were actually true, the model
only predicted this outcome correctly for 47% of target.

● F1 score of 51% tells that the model has done ok job of predicting
whether the tweet sentiment and crypto sentiment both will be positive or
both will be negative.

Theaunderlyingahypothesisaofathisaworkaisathataopinionsaexpressedainasocial
media can function as useful predictors of such fluctuations, especially in sofar
as they incorporate features such as sentiment and opinion. Thisastudyashows
thatatheacorrelationabetweenabitcoinapriceaandasentimentaisalow. It’s also to be
noted that even though the correlation is low.It’saalsoatoabe notedathat
evenathoughatheacorrelationais lowait’sanot completelyarandom.aIt
improvesawhenaa lagaisaintroduced. Hence,aTwitter doesaprovide a slight
indication to theabitcoin prices.

The model gave the good accuracy of 73.4% in predicting the bitcoin price
signal depending on the crypto sentiment.

5
Machine Learning Approach to Study
Return Dependencies Across Industries

1. Introduction

A substantial body of literature examines the predictability of total returns on


stock markets. In conversely, despite the fact that analyst reports and asset
allocations are frequently industry-based, a relatively few research studies
investigate stock return predictability along industry lines. Industry return
predictability studies usually rely on popular predictor variables from the
aggregate market return predictability literature, such as aggregate dividend
yield, nominal yields, and yield spreads. In this paper, we look at industry return
predictability using a different data set: lagged industry returns from across the
economy.

The lack of previous research on this topic could be attributed to the statistical
difficulties associated with estimating regression models with a large quantity of
predictors. The theoretical model in Hong et al. (2007), which introduces
information frictions into an economy with multiple linked industries, motivates
our use of lagged industry returns to forecast individual industry returns. Cash
flow impacts arising in one sector can actually impact anticipated revenues in
related industries because of sector links. Investors in a frictionless rational
expectations equilibrium identify all of the inter-industry consequences of a
cash flow shock in a specific industry. As a result, equity prices across all
relevant industries adjust immediately to fully incorporate the inter-industry
consequences of the cash flow shock, and lagged industry returns have no
predictive power. Investors with limited amount of information processing
capabilities, on the other hand, specialise in specific market segments. When a
cash flow shock occurs in a specific industry in this environment,
information-processing limitations prevent investors specialised in similar
industries from quickly calculating the full consequences of the shock. As a
result of the gradual spread of information across industries, the resulting slow

6
adjustment in asset prices gives rise to industry return predictability based on
lagged industry returns.

We investigate the predictive ability of lagged industry returns using a general


predictive regression model that enables each industry's return to react to the
lagged returns for all industries, allowing for a diverse set of industry links, both
in direct and indirect ways. Traditional ordinary least squares (OLS) estimation
has potential drawbacks given the abundance of predictor variables in predictive
regression models. First, using all lagged returns exposes OLS estimation to
overfitting because of the large number of predictors. Furthermore, if only a few
lag returns are selected, it is challenging to decide which of them are most
significant a certainty. As a result, we use machine learning to avoid overfitting
the data in our high-dimensional setting and to and choose the most relevant
predictor variables.

The least absolute shrinkage and selection operator (LASSO), a strong and
frequently used machine learning tool, is used. LASSO, like ridge regression,
causes shrinkage in estimated coefficients through the addition of a convex
regularization term in the objective function for fitting a model. In compared to
the penalty term in ridge regression, the LASSO employs a regularization term,
allowing shrinkage to zero for some coefficients. As a result, it performs feature
extraction, which typically results in a sparse model. Sparsity has two important
benefits. To begin, setting insignificant coefficients to zero helps prevent
overfitting the data. Besides that, it makes it easier to interpret the estimated
model by selecting the most significant predictor variables.

Even though the LASSO's penalty term reduces overfitting through sparsity, it
also tends to overshrink the coefficients for the chosen variables. This
possibility can result in significant (in magnitude) downward biases in the
estimated coefficients. To reduce biases, recent studies propose OLS
post-LASSO estimation. The idea is to first use LASSO to lessen model
dimension; then, to reduce biases in LASSO coefficient estimates, the
coefficients for the chosen predictors are re-estimated using OLS. We estimate
predictive regression models for each industry using OLS post-LASSO, with the
set of candidate predictors including the lagged returns for all 30 industries
considered. OLS post-LASSO estimation helps us determine the most important

7
set of lagged industry returns for predicting the return of a given industry while
also producing more precise predictions of the coefficients for the relevant
lagged industry returns.
We examine the ability of lagged industry returns to predict individual industry
returns using both in-sample and out-of-sample tests.

To perform the in-sample analysis, we use monthly return data from Kenneth
French's Data Library to estimate predictive regression models via OLS
post-LASSO for 30 industry portfolios spanning 1960 to 2022. For 29 of the
individual industries, the LASSO chooses at least one lagged industry return as
a predictor, while several lagged industry returns are selected for 22 of the
individual industries. Moreover, the OLS post-LASSO estimation results show
that the LASSO-selected lagged industry returns are frequently statistically
significant predictors of industry returns.

Lagged returns from the financial sector, as well as commodity and


material-producing industries, have been chosen as return predictors for a
variety of individual industries. From the perspective of Hong et al.'s theoretical
model, the major role for these sectors is financially understandable (2007).
When the financial sector undergoes a positive return shock, financial firms
have bigger financial buffers which makes them more able to provide credit on
favourable terms to industry sectors all over the economy; borrowers directly
benefit from the reasonable terms, while their customers benefit indirectly. We
assume lagged financial sector returns to positively affect future returns in many
industries in the presence of information frictions. Moreover, commodity price
shocks start raising product's price and returns for sectors in sooner phases of
production, while they hold profitability and poor returns for sectors in later
stages of production. With information frictions, we presume returns for lagged
commodity and material-producing industries to have a negative impact on
future returns for industries located later in the production. We cannot, however,
easily attribute all of the industry return predictability that we observe to
gradual information diffusion across economically related industries. Indeed,
machine learning is well known as a powerful tool for exploring new patterns
and connections in data.

8
2. Predictive Regression Framework

The followingageneralapredictivearegressionamodel specification is the basic


framework:

where
is the is the ith industryaportfolioareturnainaexcessaofathearisk-freearateaat
time t; is theacoefficientaassociatedato the jth laggedaindustryaportfolioa
return; N = 30.

To tackle the difficulties posed by the predictive regression models' high


dimension, we implement the LASSO from machine learning.

The objective function of LASSO model is:

with ≥ 0 as the regularisation parameter. When = 0 the model reduces to


an OLS regression as no penalisation is taking place. LASSO allows model
coefficients to be reduced to zero and produces sparse solutions in a data-driven
manner. We determine using a 10-fold cross-validation technique. This splits
the sample into 10 disjoint random sub-samples using the first 9 for training and
the tenth for evaluation. This process repeats 10 times and we choose with
the minimum mean squared prediction error.

Prominently, LASSO estimates exhibit a downward bias in magnitude, implying


that the penalty term tends to "overshrink" the coefficients of important
predictors chosen by LASSO. Belloni and Chernozhukov (2013), among others,
recommend using OLS post-LASSO to re-estimate model coefficients for

9
LASSO-selected predictors. Furthermore, Belloni et al. (2017) contend that
penalised regression methods introduce a dissipation bias, which can be
corrected by applying OLS to predictors chosen in the first stage that uses a
variable selection method.

3. In-Sample Analysis

We estimate predictive regressions using OLS post-LASSO on monthlyaexcess


returnadataafora30avalue-weightedaindustryaportfoliosafromaKennethaFrench's
DataaLibrary, whereatheaindustries areadefined using the StandardaIndustriala
Classication (SIC) system.

Tablea1 summarises theaindustry portfolioaexcessareturns froma1959:12ato


2022:11. The industries are referred to using their Data Library abbreviations. In
addition to the fact that the industry portfolios are value weighted, startingathe
sampleaina1959:12 reduces illiquidity and thin-trading concerns. In Table 1,
Smoke (Tobacco Products) has the highest annualised average excess return of
11.80% and Sharpe ratio of 0.55, while Steel (Steel Works, Etc.) has the lowest
annualised average excess return and Sharpe ratio of 4.93% and 0.18,
respectively.

10
Tablea1:aSummaryastatistics,aindustryaportfolioaexcessareturns,
a1959:12-2022:11a

Industry Ann. Ann. Min Max Ann. Sharpe


Portfolio Return Volatility Ratio
(%) (%)

Food 8.61 14.95 -18.13 19.89 0.57

Beer 9.14 17.44 -20.19 25.51 0.52

Smoke 11.80 21.1 -25.32 32.38 0.55

Games 8.88 25.18 -33.42 34.97 0.35

Books 6.51 20.72 -26.56 33.13 0.31

Hshld 7.18 16.37 -22.25 18.22 0.43

Clths 8.82 22.5 -31.45 31.79 0.39

Hlth 8.55 16.93 -21.05 29.01 0.50

Chems 6.73 19.4 -28.6 21.68 0.34

Txtls 7.41 25.34 -36.09 58.92 0.29

Cnstr 7.22 21.28 -29.3 25.02 0.33

Steel 4.93 26.49 -32.99 30.3 0.18

FabPr 8.02 21.5 -31.74 22.91 0.37

ElcEq 8.99 22.03 -32.8 22.87 0.40

Autos 7.84 25.66 -36.5 49.56 0.30

Carry 9.32 22.58 -35.3 31.4 0.41

11
Mines 7.35 25.84 -34.54 34.98 0.28

Coal 10.32 36.35 -40.85 45.55 0.28

Oil 8.87 20.8 -34.81 32.92 0.42

Util 6.32 13.96 -13.14 18.26 0.45

Telcm 6.06 16.19 -16.3 21.2 0.37

Servs 9.236 22.35 -28.66 23.38 0.41

BusEq 8.74 23.26 -31.96 24.65 0.37

Paper 5.96 17.7 -27.76 21.04 0.33

Trans 7.43 20.21 -28.52 18.5 0.36

Whlsl 8.20 19.48 -29.36 17.47 0.42

Rtail 8.87 18.68 -29.72 26.51 0.47

Meals 9.41 21.11 -32.17 28.23 0.44

Fin 8.01 18.95 -22.58 20.58 0.42

Other 4.96 20.03 -27.98 20.48 0.24

The table reports summary statistics for excess returns for 30 value-weighted industry
portfolios from Kenneth French's Data Library. Excess returns are computed relative to the
one-month Treasury bill return. The industry abbreviations are as follows:Food = Food
Products;Beer = Beer and Liquor; Smoke = Tobacco Products;Games = Recreation; Books =
Printing and Publishing; Hshld = Consumer Goods; Clths = Apparel; Hlth = Healthcare,
Medical Equipment, and Pharmaceutical Products; Chems = Chemicals; Txtls = Textiles;
Cnstr = Construction and Construction Materials; Steel = Steel Works, Etc.; FabPr =
Fabricated Products and Machinery; ElcEq = Electrical Equipment; Autos = Automobilesand
Trucks; Carry = Aircraft, Ships, and Railroad Equipment; Mines = Precious Metals,
Non-Metallic, and Industrial Metal Mining; Coal = Coal; Oil = Petroleum and Natural
Gas; Util = Utilities; Telcm = Communication; Servs = Personal and Business Services;
BusEq = Business Equipment; Paper = Business Supplies and Shipping Containers;
Trans = Transportation; Whlsl = Wholesale; Rtail = Retail; Meals = Restaurants, Hotels, and
Motels; Fin = Banking, Insurance, Real Estate, and Trading; Other = Everything Else.

Table 2 shows the estimated OLS post-LASSO coefficients for each industry.

12
After taking account for the lagged predictors, the estimation sample available
ranges from 1960:01 to 2022:12. The true regression coefficients for the
LASSO-selected sub-model are our goal. We use a bold (italicised bold) entry to
imply that a coefficient estimate is significant at the 10% (5%) level using the
conventional OLS post-LASSO t-statistic.

Furthermore, the OLS post-LASSO estimates in Table 2 recognise the


importance of lags in predicting individual industry returns. As individual
industry return predictors, the LASSO selects 167 lagged industry returns (out
of a possible 900). The LASSO selects at least one lagged industry return as a
return predictor for 29 of the 30 individual industries, and multiple lagged
industry returns are chosen for 22 of the 30 individual industries. The
conventional OLS post-LASSO t-statistics show that 82(53) of the 167
LASSO-selected lagged industry returns are important predictor at the 10%
(5%) level. In general, autocorrelation has little impact in Table 2, as the
LASSO only selects an industry's own lagged return for seven industries.

Tablea2:aOLSapost-LASSOapredictivearegressionaestimation
aresults,a1960:01-2022:12

regressor food beer smoke games books hshld clths hlth chems txtl
food 0.12
beer
smoke
games 0.03
books 0.18 0.04 0.06 0.1
hshld
clths 0.04 0.05 0.05 0.1 0.07 0.08 0.09
htlh
chems 0.15
txtl 0.06
cnstr
steel -0.08
fabpr
elceq -0.27
autos 0.11

13
carry 0.17 0.05
mines -0.02 -0.06
coal -0.06 -0.06 -0.03 -0.07 -0.04 -0.05 -0.05 -0.05 -0.07
oil -0.1 -0.17 -0.15 -0.15
utils 0.09 0.27 0.13 0.16 0.11
telcm -0.11 -0.14
servs -0.15 0.05 0.12
buseq 0.06 0.13
paper -0.19
trans 0.1
whlsl
rtail 0.02 0.03 0.05 0.06 0.07
meals
fin 0.11 0.1 0.08 0.08 0.18
other
r^2 2.24 2.52 6.54 5.05 6.3 2.97 7.93 2.68 0.78 7.91

Regressor cnstr steel fabpr elceq autos carry mines coal oil util
food 0.1
beer -0.27 -0.08 -0.1
smoke -0.09 0.02
games
books 0.13
hshld -0.13 -0.08
clths 0.04 0.04
htlh -0.13 -0.08
chems
txtl
cnstr -0.18
steel
fabpr 0.12
elceq
autos -0.002
carry 0.17 0.08
mines -0.04
coal -0.06 -0.05 0.08

14
oil -0.14 -0.13 -0.2 -0.08
utils 0.15 0.17 0.09
telcm 0.07
servs
buseq 0.12 0.03
paper 0.19
trans 0.06 0.06 0.16
whlsl -0.14
rtail 0.001 0.18 0.1
meals
fin 0.15 0.15 0.09 0.1 0.14 0.13
other 0.1
R^2 5.13 1.29 1.56 0.8 6.13 2.27 2.84 2.52 7.88

Regressor telcm servs buseq paper trans whlsl rtail meals fin other
food -0.1
beer -0.06 -0.05
smoke -0.03 -0.09 -0.14 -0.05 -0.06
games
books 0.09 0.1 0.12 0.14 0.06
hshld -0.07
clths 0.06 0.1 0.08
htlh -0.06
chems
txtl
cnstr -0.07
steel -0.08 -0.12
fabpr
elceq
autos -0.04
carry -0.04 0.06 0.04
mines -0.01
coal -0.02 -0.04 -0.04 -0.05
oil -0.09 -0.12 -0.11 -0.15 -0.15
utils 0.16 0.12 0.18 0.25 0.12
telcm -0.12

15
servs 0.02 0.05 0.07
buseq 0.05 0.03 0.06
paper
trans
whlsl
rtail 0.1 0.03 0.13
meals -0.1 0.05
fin 0.16 0.16 0.1 0.12 0.11 0.05 0.13 0.1
other
R^2 5.18 2.88 2.75 3.24 1.29 7.46 1.61 7.91 1.7 2.69

From the standpoint of progressive information sharing across economically


similar industries, many of the coefficientaestimatesain Tablea2aappear
economicallyaplausible (Hong et al. 2007). For example, the LASSO selects
lagged Fina(Banking, Insurance, Real Estate, and Trading) returns for 19 of the
30 individual industries, and eleven (seven) of the coefficient estimates are
important at the 10% (5%) level using conventional OLS post-LASSO
t-statistics. Furthermore, for lagged Fin returns, all of the coefficient estimates
are positive. This is financially sensible, because manyaindustriesarely heavily
onafinancialaintermediariesaforafinancing. A positive sign shock in the finance
industry increases financial institutions' cash reserves, making financial
institutions more ready to make available credit to firms all through the
economy; in contrast, negative return shocks in the financial sector reduce
intermediaries' capacity to contribute, heading up borrowing costs and lowering
returns in many sectors. Financial sector shocks have real impacts on firms that
borrow from financial intermediaries and indirect effects on the borrowing
firms' customers. Table 2 shows that the coefficient estimates for lagged Fin
returns are typically large, reaching as high as 0.18. (Txtls, Textiles).

Another notable pattern in Table 2 is the presence of industries at various phases


ofatheaproductionaprocess.aLaggedareturnsaforacommodity- anda
material-producingaindustries located sooner in the supply chain, such as Coal
(Coal) and Oil (Petroleum and Natural Gas), are frequently negatively related to
returns for industries located later in the supply chain, such as Smoke, Books
(Printing and Publishing), Txtls, Paper (Business Supplies and Shipping
Containers), Whlsl (Wholesale), and Meals (Restaurants, Hotels, and Motels).
The LASSO selects lagged coal and oil returns for 16 and 13, respectively, of

16
the individual industries in Table 2. Based on the conventional OLS
post-LASSO t -statistics, eleven and thirteen (seven and nine) of the coefficient
estimates for lagged coal and oil returns, respectively, are important at the 10%
(5%) level. Table 2 shows that the estimated coefficients for lagged coal and oil
returns are all negative, with the possible exception of the autoregression
coefficient for coal. These negative relationships are assumedly the result of
supply shocks that raise product prices and returns for sectors in earlier stages of
production but squeeze profit margins and lower returns for sectors in later
stages of production. The magnitude of the significant coefficient estimates is
once again substantial.

While additional predictive relationships in Table 2 readily accord with gradual


information diffusion across related industries, there are other relationships that
are more challenging to explain; for example, it is not obvious what economic
channel links lagged Beer (Beer and Liquor) to future Coal returns. It is well
known that machine learning is an effective means for discovering new
relationships in the data, and we uncover numerous unusual relationships in
Table 2.

17

You might also like