Sun2016 PDF

Trade the Tweet: Social Media Text Mining and Sparse Matrix Factorization
for Stock Market Prediction
Andrew Sun, Michael Lachanski, Frank J. Fabozzi
PII: S1057-5219(16)30160-0
DOI: doi: 10.1016/j.irfa.2016.10.009
Reference: FINANA 1050
To appear in: International Review of Financial Analysis
Received date: 6 August 2016

Accepted date: 17 October 2016
Please cite this article as: Sun, A., Lachanski, M. & Fabozzi, F.J., Trade the Tweet:
Social Media Text Mining and Sparse Matrix Factorization for Stock Market Prediction,
International Review of Financial Analysis (2016), doi: 10.1016/j.irfa.2016.10.009
This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
ACCEPTED MANUSCRIPT
Trade the Tweet: Social Media Text Mining and Sparse
Matrix Factorization for Stock Market Prediction
PT
RI
Andrew Sun
SC
AJSun Consultants
email: ajsun12@gmail.com
NU
Michael Lachanski
MA
SINSI
email: mlachans@princeton.edu
D
TE
Frank J. Fabozzi
EDHEC Business School
P
email: fabozzi321@aol.com
CE
AC
The authors thank StockTwits and Pierce Crosby for providing data. Felix Wong, and Mung Chiang
provided us the MatLab code on which our own R adaptation is based. Responsibility for errors resides with
the authors.
ACCEPTED MANUSCRIPT
Abstract
We investigate the potential use of textual information from user-generated microblogs
to predict the stock market. Utilizing the latent space model proposed by Wong et al.
PT
(2014), we correlate the movements of both stock prices and social media content. This
study differs from models in prior studies in two significant ways: (1) it leverages market
RI
information contained in high-volume social media data rather than news articles and (2)
SC
it does not evaluate sentiment. We test this model on data spanning from 2011 to 2015
on a majority of stocks listed in the S&P 500 index and find that our model outperforms a
NU
baseline regression. We conclude by providing a trading strategy that produces an attractive
annual return and Sharpe ratio. MA

D
P TE
CE
AC
1
ACCEPTED MANUSCRIPT
1 Introduction
The amount of public information available has dramatically increased since the efficient
market hypothesis was first proposed by Fama (1970). In addition to an increase in tradi-
PT
tional sources of information – news articles, analyst reports, and earnings statements, for
example – there has been a staggering increase in the amount of user-generated content on
RI
social media. For example, Twitter reports on its homepage that it has 320 million monthly
SC
active users producing about 500 million tweets per day. Kalampokis et al. (2013) review
studies exploring this expansion of social media and highlight the predictive power of social
NU
media for various applications. Social media data have become a popular source for stock
market prediction, and many have explored its relationship with financial markets.
MA
We present a text analysis based model for predicting stock price and then explore the
implications of such a model based on a trading strategy that leverages our proposed method.
We evaluate our model with in-sample and ex-sample evaluations.

D
TE
2 Literature Review
P
2.1 Text Mining from Traditional News Sources

CE
The use of text mining (i.e., the statistical analysis of natural language data) on infor-
mation sources is now a focal point for researchers. Loughran & McDonald (2011) analyze
AC
10-K filings to develop alternate word lists that better reflect the tone in financial text. They
try to identify specific words that contain information relevant to financial markets and link
these word lists to returns, trading volume, and other market metrics. Nassirtoussi et al.
(2014) summarize studies that focus on leveraging text for predicting asset price movements
and review the performance of various text mining methods on various text sources and
asset classes.
One of the earliest papers linking quantitative measures of language to stock price pre-
diction is Tetlock (2007). He examines the interactions between media content and stock
market activity and finds that investor pessimism can forecast patterns of market activity.
Furthermore, Tetlock et al. (2008) utilize the “bag-of-words” scheme to collect all the words
2
ACCEPTED MANUSCRIPT
from the Wall Street Journal (WSJ) and the Dow Jones News Service (DJNS) and then
classify them as positive or negative using the Harvard-IV-4 psychosocial dictionary. They
find that negative words used in the financial press typically forecast low firm earnings and
that market prices incorporate textual data from newswire sources with only a slight delay.
PT
Schumaker et al. (2012) also try to evaluate correlations between text and stock price
movements using the quantitative textual financial prediction system that the authors de-
RI
veloped, the Arizona Financial Text (AZFinText) system. The authors follow a two-step
SC
process: sentiment analysis and price prediction. Starting with text data from Yahoo! Fi-
nance, they determine whether the text is objective or subjective. Focusing only on the
NU
subjective text and making a significant effort in accurately classifying sentiment, Schu-
maker et al. (2012) test their classification against the MPQA Opinion Corpus, a database
MA
that contains news articles from a wide variety of news sources that are manually annotated
for opinions and other private states such as sentiments, beliefs, and emotions. They re-
D
port a classification accuracy of 74%. Employing their AZFinText system, they find that
TE
for subjective news articles it was easier to predict price direction, having a 59% accuracy.
Moreover, they find that the price decreases for articles classified as positive sentiment 53.5%
P
of the time and the price increases for articles classified as negative sentiment 52.4% of the
CE
time. These results suggest an interesting contrarian strategy for equity traders: sell with
good news and buy with bad news.

AC
Mamaysky & Glasserman (2015) show that text data can also be an indicator of market
volatility. They aggregate over 360,000 articles on 50 large financial companies between
1996-2014 and examine sequences of n words, known as n-grams, classifying each as having
positive sentiment or negative sentiment. They find that an increase in unusual language
of negative sentiment is subsequently followed by increased market volatility (measured by
the VIX index) that lasts for several months at a time.
Typically, methods linking statistical language processing to stock market prediction
focus on classifying sentiment in text sources as their first step. In contrast, Wong et al.
(2014) utilize a methodology that ignores the evaluation of sentiment. Instead, they look at
WSJ articles from 2008 to 2011 to first create a dictionary of the top 1,354 words and collect
stock prices that correspond to each article from the WSJ for each trading day. Utilizing a
3
ACCEPTED MANUSCRIPT
latent factor representation that links term frequencies to the log returns of each stock, they
develop a model that predicts the day’s closing price when given the articles for the day. The
approach developed by Wong et al. (2014) differs from most studies for two main reasons:
(1) as noted earlier, the methodology does not try to evaluate sentiment, avoiding any error
PT
in classifying positive or negative opinions and (2) the methodology can predict prices for
stocks not mentioned in any WSJ articles. Because of the simplicity and robustness of the
RI
methodology, we follow a similar one as described in Section 4.
SC
2.2 Text Mining from Social Media
NU
Researchers have long explored the predictive power of social media data; Kalampokis
et al. (2013) summarize studies suggesting how social media can be used for various types
MA
of predictions. For example, Google search queries have been used to track influenza-like
illnesses, Amazon reviews to predict product sales, and Twitter posts to predict rainfall.
D
One of the earliest empirical studies by Antweiler & Frank (2004) investigating the effect
TE
of social media on the stock market focused on the application of text on Yahoo! Finance
message boards to predict stock market volatility. These studies suggest that predictive
P
indicators can be derived from social media content.

CE
Recently, researchers have explored Twitter as their source for social media content.
Although each post or tweet is limited to 140 characters, in aggregate it is believed that the
AC
information may provide an accurate representation of public sentiment. Examining tweet
analysis and IPO performance, Liew & Wang (2016) find that there is a positive significant
correlation between IPOs’ average tweet sentiment and IPOs’ first-day returns not only on
the first trading day but also two or three days prior. Further, examining the relationship
between tweets and earnings announcements, Liew et al. (2016) report that not only are
the consensus earnings of crowdsourced information more accurate (by more than 60%) but
also tweet sentiment before the earnings announcement can predict post-announcement risk-
adjusted excess returns. Also, Azar & Lo (2016) show that tweets during the Federal Open
Market Committee (FOMC) meeting dates contain information that can be used to predict
stock market returns and can be used to build benchmark-outperforming portfolios.
Another important paper to highlight is that of Liew & Budavri (2016). They use
4
ACCEPTED MANUSCRIPT
StockTwits data to show that social media has significant power in explaining the time-
series variation in returns. They subsequently propose that a sixth “Social Media Factor”
for the Fama-French five-factor model that is both distinct from the previous five factors
and significant in predicting returns. In our study, we contribute to the literature on text
PT
mining for finance by providing the first application the algorithm from Wong et al. (2014)
to social media data from StockTwits at the daily and intraday frequency.
RI
SC
3 Data
NU
In this section, we describe and explain how we pre-processed the data.
3.1 Text Data from StockTwits

MA
The text data that we use are from StockTwits.com. Founded in 2008, StockTwits® is
a financial communications platform targeting participants in the investment community fo-

D
cusing on individual stocks and the stock market. $TICKER tags facilitate the organization
TE
and aggregation of information “streams” about equities and markets from across the web.
As of 2016, there were over 300,000 users on StockTwits producing streams that are viewed
P
by approximately 40 million worldwide. Their content can be integrated with many other
CE
financial sites, including Yahoo! Finance, CNNMoney, Reuters, TheStreet.com, Bing.com

AC
and The Globe and Mail. StockTwits invests a lot of effort into filtering out finance-unrelated
messages and spam. In our opinion, StockTwits provides both high quality and large scale
text data for our text mining purposes.
We obtained approximately 45 million messages from StockTwits streams from January
1, 2011, to August 31, 2015. Each data point provides about 40 different features including
content, follower count, following count, posted time and tag information. We, however,
are only interested in the tweet’s text and post time. To take advantage of StockTwits,
the streams need to be pre-processed via text mining. We utilize the R library tm to carry
out the pre-processing steps. First, we consolidate the streams on a given day into a single
large body of text. For intraday experiments, we further separate the data into AM, midday
and PM periods. Once the text data has been consolidated, we clean the text of nonword
5
ACCEPTED MANUSCRIPT
PT
RI
Panel a: Per day mention of the word Panel b: Per day price of Brent Crude
“oil” Oil
SC
NU
MA
Panel c: Per day mention of the word Panel d: Per day price of AAPL
“aapl”
D
Figure 1: Plot of Per Day Number of Mentions and Price of “oil and “aapl” against time
from 1/1/2011 to 8/31/2015
TE
terms such as website URLs and emoticons. Next, we remove stop-words, words such as
P
“the” and “it” that may cause extra noise, punctuation, and keep all text in lowercase for
CE
easy comparison. An exploratory analysis reveals trends in the number of times keywords
are mentioned. In Figure 1 we examine our StockTwits dataset at a high level by plotting
AC
the word count of the words “oil” and “aapl” and their corresponding prices against time.
By examining the plots, it is not clear whether a relationship between word count or price
exists. For ’oil’, there appears to be a negative correlation between word count and price in
2015. On the other hand, there may or may not be a correlation between the word count
and price of “aapl.” Loughran & McDonald (2011) suggest that raw word counts may not be
the best measure for a word’s information content and that special weighting should make
text-based analysis more informative.
Additionally, we note that the use of StockTwits has dramatically increased over the
period of our investigation. Panel a of Figure 2 shows that the word counts per day increase
at nearly an exponential rate, which can be attributed to the growth of StockTwits’ user
6
ACCEPTED MANUSCRIPT
base. We can especially observe the effect of StockTwits growth in panel b of Figure 2 where
both the average word count per year increased as well as the variance in the total word
count throughout each year. The skew in the word count is most nearly a by-product of
the high volume of social media text information. Despite StockTwits’ “growing pains,” we
PT
believe that the high volume nature of the dataset will provide valuable indicators of the
market and explore methods of normalization that deal with correcting the skew, which we
RI
outline further in Section 4.
SC
3.2 Stock Price Data
NU
In this paper, we seek to predict the prices of the component stocks of the S&P 500
Index. We examine only stocks traded during our examination period. Furthermore, we
MA
remove stocks with low volume such as Berkshire Hathaway B Shares (BRK-B). We obtain
a final stock list of 420 component stocks.

D
We obtain the historical close prices of the 420 component stocks for all trading days
TE
between January 1, 2011, to August 31, 2015, using the R library quantmod and supplement
any missing data through The Center for Research in Security Prices (CRSP). We calculate
P
the log-returns of these prices for all 1,173 trading days and retain them in a matrix where
CE
the rows are the component stocks and the columns are the trading days for a 420 × 1,173
matrix.
AC
For intraday tests, we obtain the price at open, noon and close on all trading days between
January 1, 2011, to August 31, 2015, from NYSE Trade and Quote (TAQ). Similarly, we
calculate the log-returns of these prices for all 3,519 trading periods. We keep these data in
a matrix of similar form, however, with dimensions 420 × 3,519.
We note two main differences between the data examined in this paper and previous
studies. First, many previous studies use news articles as the source of their text information.
More specifically, they look at news sources commonly read in the industry such as the WSJ
or DJNS. Correspondingly, news articles can be considered high quality but low quantity
sources of information (when compared to social media). In contrast, StockTwits streams are
low quality but high quantity sources of text information. We rely on StockTwits’ filtering
methods and eliminate nonwords (e.g. emoticons and website URLs) and stop-words to
7
ACCEPTED MANUSCRIPT
PT
RI
SC
NU
MA
D
Panel a: Plot of word counts per day

P TE
CE
AC
Panel b: Plot of word counts per year
Figure 2: Plot of Word Counts on StockTwits from 1/1/2011 to 8/31/2015
8
ACCEPTED MANUSCRIPT
improve the signal we obtain from our dataset. We limit our non-price data to that which
can be obtained from social media, in particular, StockTwits.
Second, most studies, when examining social media, utilize Twitter as their primary text
source. Although Twitter may appear to be a good data source for our methods, since we
PT
do not try to classify user sentiment, large amounts of Twitter content simply add noise
to stock market prediction algorithms. For example, phrases such as “I love NYC!” and
RI
“Analysts are bullish” both contain positive sentiment, but only one is likely to be relevant
SC
to the stock market. Therefore, we utilize StockTwits, a dataset very similar to Twitter, to
harness the power of social media and target financial content.
NU
4 Methodology
MA
In this section, we describe the methodology of this paper.1
D
4.1 Text Mining

TE
The first important step for our model is the creation of a dictionary of terms through text
mining. Our dictionary was created by examining the top words for each year and combining
P
them with the tickers of the 420 stocks. Some tickers were removed from our dictionary due
CE
to too few mentions. Sample words from our dictionary include typical indicators such as
AC
“buy,” “short,” “hold” as well as “aapl” and “spy.” After creating a dictionary, we create a
term-document matrix, a matrix where the rows correspond to the terms in the dictionary
and the columns correspond to the documents. For our data, each “document” is a successive
trading day and each entry is the word count for that term. More formally, we let Y denote
our term-document matrix and let yi,t indicate the word count for term i on day t.
Various term weighting schemes have been suggested: simple term frequency weighting
(tf), term frequency inverse document frequency (tf-idf) and modified versions such as the
one proposed by Loughran & McDonald (2011). We find that a term weighting scheme intro-
duced in Salton & Buckley (1988) works best with our methodology. Since text information
on each day may vary in total word count, we normalize the text on each day through cosine
1
All code is available for perusal on www.github.com/ajsun/trade-the-tweet. We try to follow the stan-
dards of Gentzkow & Shapiro (2014) wherever possible.
9
ACCEPTED MANUSCRIPT
PT
RI
SC
NU
MA
Figure 3: Plot of Normalized Term Frequencies
normalization (vector length normalization) method, in which each day’s text is treated as
D
a vector and divided by the Euclidian norm. If each day’s vector of term frequencies is
TE
yt
represented as yt , then kyt k2 is the cosine normalization. From the output of this normal-
ization shown in Figure 3, it can be seen that the normalization has significantly decreased
P
the trend. Note that the larger variance seen in the earlier years of the plot is reasonable, as
CE
it is expected that the amount of content produced will stabilize as StockTwits increases in
popularity. Finally, we standardize our frequencies by term. If we represent the ith term’s
AC
frequency on day t as yit , then we standardize yit using the standard z-score
yi,t − µ̄i
σ̄i
where µ̄i , σ̄i are, respectively, the mean and standard deviation of the term frequencies for
term i calculated over all prior days. We remove values of yit that are negative and prune
values above three standard deviations. Negative values are removed because intuitively,
terms mentioned fewer times than average should have no effect on the stock price. Values
above three standard deviations are pruned as to reduce the effect of outliers. This gives us
a standardized, normalized and skew-adjusted term-document matrix.
10
ACCEPTED MANUSCRIPT
4.2 A Sparse Matrix Factorization Model
Our sparse matrix factorization (SMF) model closely follows the methodology of Wong
et al. (2014). The SMF model has two favorable qualities for price prediction: (1) the
PT
matrices U and W are low rank when the selected d s, and (2) the sparse matrix selects
only the most relevant parameters for predicting the stock market, minimizing over fit.
RI
Below we outline the model in further detail.
SC
4.2.1 Basic Matrix Factorization Framework
The matrix factorization model that we use maps both text and stocks to a joint latent
NU
factor space of dimensionality d. Each stock i is associated with a latent factor vector
ui ∈ Rd and the text data on each trading period t is associated with a vector vt ∈ Rd . The
MA
resulting dot product uTi vt captures the interaction between stock i and trading day t which
can be used to approximate the log return r̂it . In other words:

D
r̂it = uTi vt
TE
4.2.2 Introduction of Text

P
We introduce a text vector yt , which contains the word frequency on trading day t from
CE
the adjusted term-document matrix created in Section 4.1. Then, the day’s latent text
AC
vector vt is inferred from the term frequencies, yt . Given m terms, we create a new matrix
W ∈ Rd×m such that it is a linear mapping of each term m from yt to vt . The log return
for a given stock i can then be expressed as:
r̂it = uTi W yt
The goal of the problem then is to learn the feature vector ui and the mapping matrix
W using the historical data from s days. In matrix form we denote the returns as R =
[rit ] ∈ Rn×s , U ∈ Rn×d , W ∈ Rd×m , Y = [y1 ...ys ] ∈ Rm×s . R is the return matrix with
stocks as the rows and days as the columns, U is the latent factor matrix relating n stocks
to d factors, W is the latent factor matrix relating d factors to m terms and Y is the term-
frequency matrix with terms as the rows and days as the columns. We then can formulate
11
ACCEPTED MANUSCRIPT
the following objective function.
1
minimize kR − U W Y k2F
U ≥0,W 2
PT
and solve for U and W.
RI
SC
4.2.3 Sparseness Constraints
Overfitting is the main problems when solving for U and W. For example, 700 terms (a
NU
small dictionary) and only 10 latent factors would yield 7,000 parameters to be estimated.
Thus, it is important to include regularization terms in our formulation of the problem to

MA
minimize the risk of overfitting. By introducing a sparseness constraint on the matrix W ,
we ensure that the number of nonzero parameters is small. Intuitively, limiting the number
of nonzero parameters makes sense, as not all words may be relevant when predicting the log
D
return of a certain stock. Thus, sparseness constraints solve the problem with noisy terms,
TE
a problem highlighted in Section 3.1.

P
As such, we introduce sparse group lasso regularization. The sparse group lasso objective
CE
function is
m
X
λ kWj k2 + µ kW k1
AC
j=1
As can be seen, there are two terms in the sparse group lasso. In order to select only a
few words for each latent factor, we minimize the first term, where j corresponds to the j th
word of the matrix W . The introduction of the first regularization term will ensure that a
small number of the columns of W will be nonzero. We minimize the second term to ensure
that each word corresponds with a few latent factors. The second term of the sparse group
lasso reduces the number of nonzero terms in each column. The optimization problem then
becomes:
12
ACCEPTED MANUSCRIPT
m
1 X
minimize kR − U W Y k2F + λ kWj k2 + µ kW k1
U,W 2
j=1
s.t U ≥0
PT
Wong et al. (2014) show that this optimization problem can be solved for a local minimum
with the alternating direction method of multipliers (ADMM) by rewriting the problem with
RI
auxiliary variables A and B:
SC
m
1 X
minimize kR − ABY k2F + λ kWj k2 + µ kW k1 + I+ (U )
NU
A,B,U,W 2
j=1
s.t A = U, B = W
MA
where I+ (U ) = 0 if U ≥ 0 and I+ (U ) = ∞ otherwise. Then we form the augmented
Lagrangian with the Lagrange multipliers C and D.

m
1
D
X
Lρ (A, B, U, W, C, D) = kR − ABY k2F + λ kWj k2 + µ kW k1 + I+ (U )
2
j=1
TE
+ tr(C (A − U )) + tr(DT (B − W ))
T
ρ ρ
+ kA − U k2F + kB − W k2F
P
2 2
CE
We solve this Lagrangian using the ADMM method described in the Appendix. Once we
have found matrices U and W , we can generate a prediction for tomorrow’s log return given
AC
today’s text data.
4.3 Training and Testing
In this paper, we predict stock price directions at a daily and intraday frequency. At
the daily level, given StockTwits streams as inputs, we predict the log return at the close of
each trading day. For our intraday tests, we predict the log return at midday and close for
each successive trading day.
4.3.1 Daily Prediction
The dataset is split into training, validation and test sets. The training set comprises 502
trading days spanning January 1, 2011 to December 31, 2012. The validation set contains
13
ACCEPTED MANUSCRIPT
252 trading days between January 1, 2013 to December 31, 2013 and the test set contains
252 trading days from January 1, 2014 to December 31, 2014 and 167 trading days from
January 1, 2015 to August 31, 2015 for a total of 419 trading days.
The main variables for our model are
PT
• n stocks, m terms and s days
RI
• rit : the log return of stock i on day t
SC
• yjt : the adjusted frequency of word j on day t
• pit : the close price of stock i on day t
NU
• ui : latent factor vector for stock i
MA
Our methodology is as follows: On day t, calculate U and W from historical data [rit0 ]
and [yjt0 ] where t0 < t ([·] denotes the matrix of those entries). Then use the adjusted term
D
frequencies [yjt ] on day t to predict [r̂it ] for all i and j. Once the returns [r̂it ] have been
TE
predicted, we compare the signs of the predictions with the actual returns to evaluate the
prediction accuracy for the price direction. Finally, we can recover the price [pit ] from the
P
return since [pit ] = pi,t−1 er̂it .

CE
4.3.2 Intraday Prediction

AC
Like the daily prediction, the dataset is split into training, validation and test sets. The
training set comprises of 1,506 trading periods (open to midday, midday to close, close to
open) between January 1, 2011 to December 31, 2012. The validation set contains 756
trading periods between January 1, 2013 to December 31, 2013 and the test set contains
1,257 trading periods between January 1, 2014 to August 31, 2015.
Our methodology for intraday predictions is identical to the methodology for daily pre-
dictions, however, for the intraday, t denotes the total number of trading periods rather
than trading days.
4.3.3 Hyper-parameters
As part of the model, there are several hyper-parameters that require tuning.
14
ACCEPTED MANUSCRIPT
Hyper-parameter Description
s Number of historical days used to predict price on day t
d Number of latent factors
λ Penalty parameter for first sparse group lasso term
PT
µ Penalty parameter for second sparse group lasso term
ρ Lagrangian penalty parameter
RI
First, we set d = 10, to ensure that W has low rank with only 10 latent factors. Then we
SC
perform a grid search: we compare prediction accuracies for each set of hyper-parameters
and the subsequent prediction on the validation set. Furthermore, λ and µ were selected
NU
only in ranges such that the matrix W remained sparse.
MA
5 Results
D
5.1 In-Sample Results

TE
Given a matrix of returns R for n stocks over the last s days, we learn U and W and
predict the return rit for stock i on day t by using the fact that rit = uTi W yt where yt is the
P
vector of term frequency on day t. In Table 1, we see that the model is able to predict price
CE
direction from training data set with an accuracy of 70.12%. We also evaluate our in-sample
testing using precision and recall which are defined as

AC
tp
P recision =
tp + f p
tp
Recall =
tp + f n
where tp, f p, tn, f n are the number of true positive, false positive, true negative and false
negative prediction respectively. In the context of our model, precision can be seen as a
measure of exactness of our predictions, and recall can be seen as a measure of completeness
of our predictions. We obtain a precision of 68.93% and a recall of 75.05%.
In panel a of Figure 4 we see that all stocks have prediction accuracies greater than 60%.
Similarly, the dotted line in panel b of Figure 4 represents 50% accuracy and we can see
that our model is able to predict direction accuracy greater than 50% on most days in our
in-sample examination.
15
ACCEPTED MANUSCRIPT
In-sample Ex-sample
Daily Daily Intraday
Year 2011-2012 2013 2014 2015 2013 2014 2015
Accuracy 70.12 51.18 51.42 51.58 50.23 48.61 49.03
Precision 68.93 54.14 53.72 51.25 53.12 50.80 48.28
PT
Recall 75.05 53.99 46.34 53.41 49.66 49.29 44.60
Table 1: Test Accuracy Comparison (%)
RI
SC
NU
MA
D
P TE
CE
Panel a: In-sample accuracy by stock Panel b: In-sample accuracy by day

AC
Figure 4: In-Sample Accuracy
5.2 Ex-Sample Results
Table 1 outlines the prediction accuracy of our model on the test set (2014-2015). We
also report the results of our validation set (2013) for comparison. We compare our SMF
model to other baseline models outlined below.
16
ACCEPTED MANUSCRIPT
Daily Intraday
Model 2013 2014 2015 2013 2014 2015
SMF 51.12 51.42 51.58 50.23 48.61 49.03
Previous Return 49.55 49.65 49.15 48.67 48.07 48.53
Previous Price 48.66 48.16 49.31 46.52 47.38 47.11
PT
AR on Return 50.61 50.80 50.31 49.66 51.04 48.87
Random 49.39 49.74 49.65 48.89 49.07 49.17
RI
* figures in bold indicate largest values for their respective column
Table 2: Model Accuracy (%)
SC
• Previous return/price: returns and prices are predicted to be the same as the
NU
previous day’s
• Autoregressive (AR) models: autoregressive models used to predict both price

MA
and return for each stock
• Random: return predictions based on a market timer making random guesses

D
Table 2 summarizes the results of our model compared to the baseline models. We see
TE
that the SMF model predicts directional accuracy better than the baseline models by a
P
percentage point in all years. Furthermore, the EMH suggests that our model should not
CE
do any better than the random test. Thus, a small percentage point increase may actually
suggest some significance in our predictive power. We will explore our algorithm further in
AC
a proposed trading strategy in Section 5.4.
5.3 Intraday Results
In addition to our daily testing, we also extend our test to encompass predictions at an
intraday level. Table 1 reports the results of our intraday testing for our validation (2013)
and test (2014-2015) sets. Like our daily tests, we also compare our intraday SMF model
the baseline models. Table 2 summarizes the results of our evaluation. We note that the
intraday predictions from the SMF model do not beat baselines every year, and in fact our
model is outperformed by a random guess in 2014 and 2015.
17
ACCEPTED MANUSCRIPT
5.4 Trading Strategy
One method of measuring the performance of a portfolio is the Sharpe ratio (SR). For a
given portfolio h, the ex-ante SR of that portfolio is defined as:
PT
E[rh ] − rf
SR(h) =
σh
where E[rh ] is the expected return of the portfolio, rf is the risk-free return and σh is the
RI
standard deviation of the returns of the portfolio. For our calculations (and simplicity) we
SC
assume the risk-free return is approximately zero, a reasonable assumption for short-term
rates during the study period. In his working paper, Lachanski (2015) shows that we can
NU
calculate the SR of a market timing strategy using a closed-form expression:
g
SR = p
MA
κ − g2
E[rt2 ]
where κ = 4E[|rt |]2
, g = p− 1
2 and p is the probability of the model making a correct
prediction. The value for κ can be estimated using the portfolio’s daily excess returns:
D
T [ Tt=1 rt2 ]
P
TE
κ̂ = PT
4[ t=1 |rt |]2
Using this definition of market timing SR, we calculate the SR of our strategy and compare
P
it to the S&P 500 Exchange Traded Funds (ETFs): SPDR S&P 500 ETF, iShares Core S&P
CE
500 ETF, and Vanguard 500 Index Fund. We see from Table 3 that our SR is significantly
lower than other market ETFs. However, it is important to note that the SR calculated from
AC
the closed-form expression is a lower bound, and trading strategies may take advantage of
correlations between stocks to generate a higher SR. Thus, we propose the following trading
strategy.
1. Given the text information for a day, predict the up-down direction of stocks
2. Invest all capital in the stocks with an “up” prediction equally
3. Sell all assets at the close price of the day
4. Repeat for all trading days
We compare our strategy against other trading strategies including an equally-weighted
portfolio (EW) and the global minimum variance portfolio (GMV). The EW portfolio, as the
18
ACCEPTED MANUSCRIPT
Portfolio Sharpe Ratio

SMF Market Timing 0.38
SPDR S&P 500 ETF 0.97
iShares Core S&P 500 ETF 0.98
Vanguard 500 Index Fund 0.98
PT
Table 3: Sharpe Ratio of the Market Timing Model and Comparable ETFs
RI
Portfolio Cumulative Annualized SR Worst Day Best Day
Return Return Return Return
SC
SMF 1.55 1.18 1.35 -0.038 0.035
SMF Weighted 1.38 1.13 1.01 -0.038 0.036
SPDR S&P 500 ETF 1.38 1.13 0.98 -0.043 0.038
NU
GMV2 1.22 1.08 1.61 -0.012 0.010
EW 1.19 1.07 0.71 -0.042 0.021
* figures in bold indicate largest values for their respective column
MA
Table 4: Portfolio Performance Comparison
name suggests, invests in all assets equally, and the GMV portfolio invests in the portfolio
D
that minimizes the variance of all possible portfolios. Note that our strategy invests in all
TE
assets equally. We also show a strategy with weights according to the magnitude of the
P
predicted return from our model. We simulate trading for all trading days between January
CE
1, 2011 to August 31, 2015.
Table 4 lists the cumulative and annualized returns, SR, and worst day and best day
AC
returns. We see that our basic strategy obtains the highest return when compared to the
SPDR S&P 500 ETF and the weighted strategy matches the return of the ETF. We also
see that our strategy obtains an SR of 1.35, which is much higher than the lower bound,
and higher than the SR of all comparable ETFs. We note that the SR of the GMV portfolio
obtains the highest SR of all portfolios, despite having lower return. This phenomenon is
cited as one of the reasons why the SR may not always be the best measurement of fund
performance. Nevertheless, our strategy tops the index ETF in all categories.
Figure 5 shows the plot of our trading strategy over the trading period (January 2013 -
August 2015) against other portfolios. The black lines divide between 2013, 2014 and 2015
2 Σ−1 1
The GMV portfolio weights are calculated using the following closed-form expression: w = 1T Σ−1 1
where Σ is the covariance matrix.
19
AC
CE
P TE
D
MA
20
NU
SC
RI
ACCEPTED MANUSCRIPT
PT
Figure 5: Graph of Portfolio Performance

ACCEPTED MANUSCRIPT
PT
RI
SC
NU
MA
Figure 6: Model Accuracy Plotted Against Magnitude of Return
D
respectively. We see that our strategy performs the worst in 2013, but improves in 2014
TE
and in 2015. This improvement is most likely due to the improvement in our prediction
accuracy from 2013 to 2015. We hypothesize that our trading strategy is able to outperform
P
benchmark portfolios despite small increases in overall accuracy because our strategy is
CE
better at predicting returns as they increase in magnitude. Figure 6 shows the relationship
between magnitude of return and the prediction accuracy, and more specifically, the positive
AC
correlation between the two. This positive correlation suggests that our trading strategy is
better able to take advantage of large positive returns and avoid large negative returns –
helping us perform better than the index. These results suggest that we are able to leverage
market indicators inherent in the StockTwits streams to predict the stock market better
than other models.
6 Conclusions
From the results described in Section 5 we found that we are able to use SMF meth-
ods to extract market indicators from StockTwits streams to predict stock price direction.
These findings support the claim of Wong et al. (2014) that SMF methods can be used in
21
ACCEPTED MANUSCRIPT
conjunction with text mining to predict the stock market.
This paper has two main conclusions. First, market-timing predictions using the SMF
model and StockTwits streams perform better than most basic baseline models. The ex-
sample results show that in every test year, our algorithm is able to beat benchmark methods
PT
that do not utilize information other than price history in their future predictions. Moreover,
a prediction accuracy of 51.37% for daily prediction frequency was found. At first glance,
RI
most investors would not consider a 51% prediction accuracy to be significant, and few would
SC
be willing to pay a premium for a coin that flips heads 51% of the time rather than 50%.
In the context of our model, however, a prediction accuracy of simply 60% for our data
NU
should yield a lower bound for the Sharpe ratio of 2.87, larger than all but two funds listed
on Morningstar. In fact, our prediction accuracies may suggest that StockTwits contains
MA
information useful to asset managers and investors. We show how to use our predictions in
a trading strategy with a SR of 1.35 and an annualized return of 1.18.

D
Second, we conclude that increasing the frequency of predictions does not seem to im-
TE
prove prediction accuracy. At first, intuition led us to believe that prediction accuracy
should increase with an increase in frequency due to the rumor-like nature of StockTwits.
P
Unlike high quality news sources such as the Wall Street Journal, StockTwits information
CE
may only take a few hours to be incorporated into the market rather than a whole day.
However, our results show the opposite – accuracy does not seem to increase with frequency
AC
and is sometimes beaten by a random guess. By examining the interaction between users on
StockTwits, we hypothesize that this may be due to the content-sharing on the site. Rather
than produce original content and opinions, many streams share and link to information
from other sources. This secondary nature of StockTwits causes a delay so that the infor-
mation comes out after it has been incorporated into market prices. This diminishes the
predictive effect of the text information at the daily level, and eliminates it at the intraday
level. On the other hand, news articles such as those from the Wall Street Journal are pri-
mary sources, and reflect new and original content. The difference between these two news
sources may explain why Wong et al. (2014) were able obtain higher prediction accuracies
at the daily level using Wall Street Journal articles.
The efficient market hypothesis implies that it is impossible to predict the market and
22
ACCEPTED MANUSCRIPT
consistently outperform a benchmark on a risk-adjusted return basis after taking account
transaction costs. In this paper, we make two simplifying assumptions. First, we do not take
into account transaction costs. Thus, to accurately compare the performance of our method
versus the market index or other mutual fund managers, we must factor in transaction costs
PT
into our model. However, we note that mutual funds and other asset managers do charge
fees that may be equivalent to adding transaction costs. Second, we assume the risk-free
RI
rate is zero. This is likely to be inconsequential as the 3-month US Treasury rate in 2015
SC
was 0.02%, a rate very close to 0. The market conditions during our period of investigation
– zero interest rates and rising equity market – are conducive to positive performance and it
NU
would be interesting to see if the trading strategy would continue to perform well in varying
conditions.
MA
D
P TE
CE
AC
23
ACCEPTED MANUSCRIPT
References
Antweiler, W. & Frank, M. Z. (2004). Is all that talk just noise? The information content
of internet stock message boards. Journal of Finance, 59(3), 1259–1294.
PT
Azar, P. & Lo, A. W. (2016). The wisdom of Twitter crowds: Predicting stock market
reactions to FOMC meetings via Twitter feeds. Journal of Portfolio Management.
RI
Boyd, S. (2010). Distributed optimization and statistical learning via the alternating direc-
SC
tion method of multipliers. FNT in Machine Learning, 3(1), 1–122.
NU
Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work.
Journal of Finance, 25(2), 383–417.

MA
Gentzkow, M. & Shapiro, J. M. (2014). Code and Data for the Social Sciences: A Practi-
tioner’s Guide. University of Chicago mimeo. Last updated January 2014.

D
Kalampokis, E., Tambouris, E., & Tarabanis, K. (2013). Understanding the predictive power
TE
of social media. Internet Research, 23(5), 544–559.

P
Lachanski, M. (2015). Not another market timing scheme! Working Paper.

CE
Liew, J. K.-S. & Budavri, T. (2016). The ’sixth’ factor – social media factor derived directly
from tweet sentiments. SSRN Electronic Journal.

AC
Liew, J. K.-S., Guo, S., & Zhang, T. (2016). Tweet sentiments and crowd-sourced earn-
ings estimates as valuable sources of information around earnings releases. Journal of
Alternative Investments.
Liew, J. K.-S. & Wang, G. Z. (2016). Twitter sentiment and IPO performance: A cross-
sectional examination. Journal of Portfolio Management, 42(4).
Loughran, T. & McDonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis,
Dictionaries, and 10-Ks. Journal of Finance, 66(1), 35–65.
Mamaysky, H. & Glasserman, P. (2015). Does unusual news forecast market stress? Working
Papers in Financial Research.
24
ACCEPTED MANUSCRIPT
Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining
for market prediction: A systematic review. Expert Systems with Applications, 41(16),
7653–7670.
PT
Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
Information Processing & Management, 24(5), 513–523.
RI
Schumaker, R. P., Zhang, Y., Huang, C.-N., & Chen, H. (2012). Evaluating sentiment in
SC
financial news articles. Decision Support Systems, 53(3), 458–464.
Tetlock, P. C. (2007). Giving Content to Investor Sentiment: The Role of Media in the
NU
Stock Market. Journal of Finance, 62(3), 1139–1168.
MA
Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than words: Quantifying
language to measure firms’ fundamentals. Journal of Finance, 63(3), 1437–1467.

D
Wong, F. M. F., Liu, Z., & Chiang, M. (2014). Stock market prediction from WSJ: Text
TE
mining via sparse matrix factorization. IEEE International Conference on Data Mining.
Zhang, Y. (2010). An alternating direction algorithm for nonnegative matrix factorization.

P
Rice Technical Report.

CE
AC
25
ACCEPTED MANUSCRIPT
Appendix
Zhang (2010) and Boyd (2010) propose an algorithm by extending the classical alternat-
ing direction method for convex optimization, the alternating direction method of multipliers
PT
(ADMM).3 The ADMM algorithm can be applied to the matrix factorization formulations
by introducing auxiliary variables U and V . The objective function is as follows
RI
1
minimize kXY − Ak2F
X,Y,U,V 2
SC
s.t X −U =0
Y −V =0
NU
U ≥ 0, V ≥ 0
where U ∈ Rm×d , V ∈ Rd×n . The augmented Lagrangian function is then

MA
1
L(X, Y, U, V, C, D) = kXY − Ak2F + C (X − U ) + D (Y − V )
2
ρ ρ
+ kX − U k2F + kY − V k2F
D
2 2
TE
where C ∈ Rm×d and D ∈ Rd×n are Lagrange multipliers and ρ is a penalty parameter for
the constraints X − U = 0 and Y − V = 0. Here stands for element-wise multiplication.

P
The steps of ADMM are derived by minimizing the augmented Lagrangian function with
CE
respect to X, Y, U and V one at a time while fixing the others at their most recent values.
The steps can be written as

AC
X+ = (AY T + ρU − C)(Y Y T + ρI)−1

T
Y+ = (X+ X+ + ρI)−1 (X+
T
A + ρV − D)
C +
U+ = (0, X+ + )
ρ
D
V+ = (0, Y+ + )+
ρ
C+ = C + ρ(X+ − U+ )
D+ = D + ρ(Y+ − V+ )
3
According to Boyd (2010), in practice there are several benefits to using ADMM. Assuming that the
functions are convex and the Lagrangian has a saddle point, the following hold: (1) the residuals converge
to 0, (2) the objective converges to an optimal value, and (3) the dual variable λ converges to an optimal
value. This suggests that ADMM is a viable optimization algorithm. In fact, Zhang (2010) find that ADMM
outperforms when tested on random matrices.
26
ACCEPTED MANUSCRIPT
where the subscript “+” is used to denote the updated value at each iteration. The notation
(·)+ denotes the element-wise maximum. Wong et al. (2014) provide a generalization of this
method that is used to solve the objective function in this paper.
PT
RI
SC
NU
MA
D
P TE
CE
AC
27

Sun2016 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sun2016 PDF

Uploaded by

Copyright:

Available Formats

 

Andrew Sun, Michael Lachanski, Frank J. Fabozzi

To appear in: International Review of Financial Analysis

Received date: 6 August 2016

Trade the Tweet: Social Media Text Mining and Sparse

Matrix Factorization for Stock Market Prediction

We investigate the potential use of textual information from user-generated microblogs

annual return and Sharpe ratio. MA

We evaluate our model with in-sample and ex-sample evaluations.

2.1 Text Mining from Traditional News Sources

good news and buy with bad news.

of negative sentiment is subsequently followed by increased market volatility (measured by

the VIX index) that lasts for several months at a time.

Typically, methods linking statistical language processing to stock market prediction

indicators can be derived from social media content.

information may provide an accurate representation of public sentiment. Examining tweet

stock market returns and can be used to build benchmark-outperforming portfolios.

3.1 Text Data from StockTwits

a financial communications platform targeting participants in the investment community fo-

financial sites, including Yahoo! Finance, CNNMoney, Reuters, TheStreet.com, Bing.com

text data for our text mining purposes.

We obtained approximately 45 million messages from StockTwits streams from January

text-based analysis more informative.

a final stock list of 420 component stocks.

a matrix of similar form, however, with dimensions 420 × 3,519.

Panel a: Plot of word counts per day

Panel b: Plot of word counts per year

Figure 2: Plot of Word Counts on StockTwits from 1/1/2011 to 8/31/2015

can be obtained from social media, in particular, StockTwits.

harness the power of social media and target financial content.

4.1 Text Mining

a standardized, normalized and skew-adjusted term-document matrix.

4.2 A Sparse Matrix Factorization Model

can be used to approximate the log return r̂it . In other words:

4.2.2 Introduction of Text

for a given stock i can then be expressed as:

the following objective function.

Thus, it is important to include regularization terms in our formulation of the problem to

a problem highlighted in Section 3.1.

Lagrangian with the Lagrange multipliers C and D.

today’s text data.

4.3 Training and Testing

each successive trading day.

4.3.1 Daily Prediction

The main variables for our model are

• pit : the close price of stock i on day t

return since [pit ] = pi,t−1 er̂it .

4.3.2 Intraday Prediction

1,257 trading periods between January 1, 2014 to August 31, 2015.

than trading days.

5.1 In-Sample Results

testing using precision and recall which are defined as

of our predictions. We obtain a precision of 68.93% and a recall of 75.05%.

Table 1: Test Accuracy Comparison (%)

Panel a: In-sample accuracy by stock Panel b: In-sample accuracy by day

Figure 4: In-Sample Accuracy

5.2 Ex-Sample Results

model to other baseline models outlined below.

Table 2: Model Accuracy (%)

• Autoregressive (AR) models: autoregressive models used to predict both price

• Random: return predictions based on a market timer making random guesses

a proposed trading strategy in Section 5.4.

5.3 Intraday Results