You are on page 1of 5

Available online at www.sciencedirect.

com

Borsa _Istanbul Review


_
Borsa Istanbul Review 19-4 (2019) 283e287
http://www.elsevier.com/journals/borsa-istanbul-review/2214-8450

Review

Big data in finance: Evidence and challenges


Avanidhar Subrahmanyam
Anderson School, UCLA, Los Angeles, CA, 90095-1481, USA
Received 25 June 2019; accepted 26 July 2019
Available online 8 August 2019

Abstract

I review the literature on the use of big data sets in finance. While big data and machine learning are exciting fields, there is a danger that we
get carried away by the novelty of the topics but pay less attention to the reliability of the academic evidence. Following my review, I argue that
big data-based evidence should be held to the same academic standards as the rest of the finance academic literature if we are to provide useful
advice to finance practitioners.
_
Copyright © 2019, Borsa Istanbul Anonim Şirketi. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-
ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

JEL classification: G12; G14; G40

1. Introduction of the datasets without ensuring that these datasets actually


add value. As an example, take the case of GPS movements of
Of late, the availability of big data has spurred interest delivery trucks. While these can in principle be used to better
amongst finance academics. These datasets promise to bring in predict company revenues, analysts already disseminate sales
new data to bear on the field of investment management, and forecasts (pretty much for free). It would be good to ascertain
consequently they have attracted much attention. While big that before paying for such expensive datasets, that they
datasets do hold a lot of promise, there are caveats associated actually add value over available forecasts. Similarly, consider
with their usage that need to be mentioned if we are to make measuring the number of cars in parking lots of Walmart or the
material progress in understanding equity markets and honing queues at McDonald's. Again, while these can in principle be
in on reliable signals which can improve the efficacy of money used to measure sales at these companies, one should first
managers. ascertain that simple time series forecasting models do not
For a start I would like to categorize big data into three parts. already predict sales in a reliable fashion. That is tried and
First, there is human generated data via social media, and on trusted techniques should not take a back seat to novelty.
various online forums, which can be analyzed using standard Big data is of interest to authors as well as editors; authors
tools for text analysis. Second, there is process generated data, want to further their career while editors want to get a high
via the act of purchases and sales, such as credit card receipts, citation count for their journal articles. The availability bias
supermarket scanners and so on. Finally, there is machine (Tversky & Kahneman, 1973) exacerbates the appeal of novel
generated data, such as tracking GPS movements of delivery data. These aspects have led to a flurry of work using novel
trucks, and satellite-based images of parking lots. data sets at the major finance journals. However, while these
New datasets like the ones delineated above naturally exciting new studies analyze different aspects of big data, I
excite academics, but it is important to note that there is the maintain that we have not made much headway in ensuring the
potential that academics can get carried away with the novelty value-added nature of big datasets. Many of these studies use
limited samples, and/or use inadequate controls, that leave us
wondering whether studies are picking up unique financial
E-mail address: asubrahm@anderson.ucla.edu.
_
Peer review under responsibility of Borsa Istanbul Anonim Şirketi. phenomena. In the next section, to make this point, I review

https://doi.org/10.1016/j.bir.2019.07.007
_
2214-8450/Copyright © 2019, Borsa Istanbul Anonim Şirketi. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
284 _
A. Subrahmanyam / Borsa Istanbul Review 19-4 (2019) 283e287

some of the major pieces of research on the usage of big data call audio files to measure positive and negative dimensions of
in finance. a manager's affective or emotional state using commercially
available software. They find that emotional states predict the
2. The research findings direction of future accounting fundamentals.
Another recent example of big data analysis is by Huang
One of the very first examples of big data applied to finance (2018), who considers product quality ratings by Amazon
is Tetlock (2007). This paper considers whether the textual customers over a 12 year period starting from 2004. This type
content of a popular newspaper column “Abreast of the of data is not easily available to investors and could potentially
Market” in the Wall Street Journal predicts market returns. affect returns with a lag. Huang finds that indeed, such
Tetlock finds that the impact of the column on the Dow Jones customer ratings predict returns to the tune of a 1% per month
Industrial Average is first negative, followed by a partial return differential between firms falling in the highest/lowest
reversal. This suggests that the market overreacts to extremely ratings deciles. The paper is unique in bringing subjective
negative news. The analysis spans a fifteen year period starting quality perceptions into objective return predictions, and, as
from 1984. Yet, there is not much attempt to control for such, should attract much attention. However, the paper does
massive negative news that could affect markets in a lumpy not control for analysts' forecasts of sales and earnings, and
way. Is it news events or the actual sentiment that predicts associated revisions, thus leaving open the possibility that
returns? Text analysis has also been applied to predict com- forecasts by other agents dominate quality perceptions.
panies' fundamentals such as earnings (Tetlock, Saar- Tirunillai and Tellis (2012) consider the impact of the volume
Tsechansky, and Macskassy, 2008). (number) of opinions and the actual ratings on stock returns at
Both the preceding papers use the nonproprietary Harvard the daily level. Using a vector autoregression framework, they
General Inquirer data to classify words into those conveying find that the volume (but not the number) of posts predict a
positive and negative connotations. Antweiler and Frank daily company's abnormal returns a few days ahead.
(2004) use naïve Bayes algorithms to perform computational Chen, De, Hu, and Hwang (2014) consider opinions posted
linguistic analyses of internet bulletin board posts. They find on a popular investment blog “Seeking Alpha.” They perform
that greater disagreement in messages and increased message text analysis on the posted opinions using the list constructed
postings are both associated with increased volatility, though by Loughran and McDonald (2011) and show that the pro-
the directional effect on returns of message board posts is portion of negative words in the posts on a given stock are
small. Their analysis uses a single year of data. Loughran and associated with lower returns, both contemporaneously, and in
McDonald (2011) argue that the HGI lists and naïve Bayes the future. The total sample period is about nine years. It re-
approaches do not work effectively for financial applications mains to be seen if these results can be extended to other
and come up with a new list of positive and negative words investor blogs.
specifically applicable to finance. They find that these words Cohen and Frazzini (2008) address the issue that events
bear significant relations to trading volume, volatility, earnings that affect a firm's customer firms impact the original firm
surprises, as well as incidences of corporate fraud. with a lag. Does keeping track of customer firms allow
In other work on textual data, Hoberg and Phillips (2010) abnormal returns to the original firms? Using 25 years of data,
use a text-based analysis of product descriptions in 10K they find that indeed, there is an extra 1.5% per month return
statements to argue that firms with similar product descriptions premium for going long (short) on stocks with customer re-
are more likely to conduct mergers. Jiang, Lee, Martin, and turn in top (bottom) 20%. While this paper does not directly
Zhou (2019) construct an index of managerial sentiment use big data, it does imply that keeping track of news stories
using textual analyses of corporate disclosures and show that of “downstream” firms affects profitability of trading “up-
this index strongly predicts aggregate market returns. Li stream firms.” This forms a promising line of thinking.
(2010) uses a machine learning approach instead of a dictio- Nonetheless, the paper does not use other proxies for agents'
nary. Specifically, he trains a machine on a manually classified expectations such as analysts' forecasts, leaving open the
set of managerial disclosures, and then uses the learnt machine possibility that customer prospects could be incorporated into
to show that managerial disclosures predict accounting fun- analysts' expectations.
damentals. He also shows that the negative impact of accruals Chordia, Green, and Kottimukkalur (2018) address the
on future stock returns (Sloan, 1996) only occurs when man- issue that there is a huge developing market for obtaining news
agers do not guide the investing public via their disclosures feeds split seconds before others. Does getting such macro
about impending bad news. news seconds ahead of its release predict returns? Using
Coval and Shumway (2001) point a directional microphone intradaily data, they find that the returns to trading on news
into a futures pit for trading Treasury Bond futures and show around macroeconomic announcements (such as inflation,
that ambient sound level increases point to an increased level GDP, unemployment, etc.) are fairly small amongst various
of volatility. This is a creative attempt to ascertain how non- combinations of entry/exit times, attain a maximum of 8bp for
financial variables impact the stock market. It indicates that entering within 0.1 s and exiting by 5 s. Such modest levels of
other forms of online chatter (number of posts, number of profit imply that the returns to attaining information seconds
visits to companies' websites) could predict returns. Along before others are fairly modest, indicating a good level of
these lines Mayew and Venkatachalam (2012) use conference market efficiency at the intraday horizon. However, as
_
A. Subrahmanyam / Borsa Istanbul Review 19-4 (2019) 283e287 285

technology improves and information becomes available to find that a one standard deviation cross-sectional increase in
trade upon microseconds sooner, it remains to be seen whether search volume increases returns in the subsequent week by 18
the profits continue to remain at low levels or increase. basis points. After one week, the impact of search volume tails
Hvidkjaer (2008) carefully calculates order imbalances on off rapidly. They also find that the highest search volume IPOs
the New York Stock Exchange by processing billions of data earn a 7% more first day return than the lowest ones. The
points, separately for small and large investors (categorized by paper is based on a relatively short sample of five years,
trade size). Using 23 years of data, he documents that trade leaving open the issue of whether the results are robust to
imbalances of small investors are negatively related to future longer time periods, including more recent years. Da,
stock returns in the cross-section over the subsequent one to Engelberg, and Gao (2014) show that investors' searches of
twelve months. While the exact rationale for this finding re- terms such as “recession” and “unemployment” predicts short-
mains to be explored, the result is consistent with the notion term reversals and increases in market volatility. This study is
that small investors overreact to information, and the reversal also based on a short sample of seven years. Once again, this
of their sentiment may cause stock return predictability. Using study does not seem to incorporate analysts' expectations, nor
intradaily transactions data, Gao, Han, Li, and Zhou (2018) does it control for retail trades as in Kelley and Tetlock (2013,
show that for the S&P 500 ETF, the first half hour of returns 2016). It also seems to contradict the wisdom in Da, Huang,
predicts the last half hour of returns. They attribute this finding and Jin (2018) that naïve investors overprice stocks.
to late informed trading at the open and close (the open due to
the need to act quickly on information and the close to act on it 3. Future research directions
before the information perishes overnight).
Chordia and Subrahmanyam (2004) show that individual There are a large number of exciting datasets being made
stock order flow predicts returns at the daily horizon as well. available for commercial usage. For example, firms like
Heston, Korajczyk, and Sadka (2010) use comprehensive RSMetrics provide detailed analyses of data from satellites,
intradaily data to show that returns during a particular half- drones and planes to estimate retail traffic, which can be used
hour period during a day predict returns at the same half- to synthesize buy/sell equity signals. Firms like Eagle Alpha
hour interval over about 40 trading days. This is an provide data on emailed customer receipts to again estimate
intriguing finding which lacks a convincing explanation at company revenues. Ravenpack analyzes media coverage
this point. which is translated into a sentiment score for each stock.
Using data from millions of retail investors' trades obtained Firms like iSentient provide Twitter sentiment data on com-
from brokerage firms, Kelley and Tetlock (2013) show that panies. Firms like Estimize gather crowd-sourced earnings
aggregate net buying of retail investors predicts returns posi- and revenue forecasts via their websites. Naturally these data
tively at the monthly level with no reversals, suggesting that sources are attractive. There is sound reason to believe that
aggregate retail buying contains valuable information. In a the data they disseminate are not easily available from other
followup paper Kelley and Tetlock (2016) show that aggregate sources and as such, could serve as a reliable source of equity
short sales of retail investors negatively predict returns. The return predictability. Such data have not yet found their way
data used by Kelley and Tetlock form an appealing way to to academic desks, but analyses as to whether they actually
measure the aggregation of diverse retail opinions and to see if add value over a long list of return predictors would be
the aggregate opinion is predictive of stock returns. Using data extremely valuable.
on a stock ranking game played on a website, Da, Huang, and Machine learning is also a promising area for academic
Jin (2018) show that aggregate rankings of stocks are nega- work. Non-OLS estimators (such as Lasso) form a promising
tively associated with future returns; i.e., the higher the forecasting technique. Lasso achieves a parsimonious predic-
ranking, the lower the subsequent return. That is, rankings tion by forcing the sum of OLS coefficients to be less than a
appear to overprice stocks. This result does not dovetail with pre-specified value, thus potentially setting some OLS co-
the work of Kelley and Tetlock, who seem to suggest that efficients to zero and thereby excluding variables that poten-
retail investors, on aggregate, appear to possess fundamental tially worsen the quality of forecasts. Similarly, random forest
information about securities. algorithms are trained on random subsamples of a dataset to
Neely, Rapach, Tu, and Zhou (2014) show that simple produce superior forecasts. They form a promising way to
forms of technical analysis, such as moving average rules, generate equity buy/sell signals. Artificial neural networks
strongly forecast stock returns. These rules entail a buy allow pattern recognition within stock market data in complex
signal when a short-term moving average crosses a long-term ways. As an example, there are multiple signals available
one from below, and vice versa. The profitability of simple about equities on any given day (e.g., social media sentiment,
rules is intriguing and deserves further analysis to discern the analysts' forecasts of earnings and revenues, credit card
psychological biases that permit the abnormal returns from transactions for the companies' products) and how these
these rules. interact is not easy to ascertain. The neural network can
Da, Engelberg, and Gao (2011) argue that interest in a analyze complex interactions between these variables and
company can be conveyed via searches for the company on produce a summary forecast for price movements. The
search engine. So, does internet search volume predict returns? advantage of such networks is that they learn on a daily basis
Using Google Trends to measure such search volume, they where their forecasting was in error, thus providing real time
286 _
A. Subrahmanyam / Borsa Istanbul Review 19-4 (2019) 283e287

learning and thus potentially superior forecasts. I Know First is retail investors' net buying relates positively to future returns.
an example of a company that uses artificial neural networks It is fair to say that much work needs to be done in replicating
to predict equity returns. the results and ensuring that they are robust to a variety of
Machine learning holds particular promise in forecasting time periods and controls.
stock returns for the reason that while there are a large number What are the practical implications of the research? They lie
of documented anomalies in the finance literature (Harvey, in exercising caution before using big data in financial appli-
Liu, & Zhu, 2016) the nature of the forecasting relation be- cations. For example, practitioners are frequently approached
tween the variables and stock returns can change over time. As by vendors marketing new data sets for the purposes of
an example, consider that past returns forecast future returns implementing equity strategies. In succumbing to vendors, a
(Jegadeesh & Titman, 1993) and so does Twitter sentiment, frequent problem is the same as that afflicting academia, the
but over time, the increased usage of past returns in funds' practitioner is likely to be swayed by the novelty of the data
trading strategies may weaken the role of past returns but as set alone, which is dangerous. As such, I recommend the
more people start using social media, the role of Twitter following checklist for both academics and practitioners. First,
sentiment in predicting returns may actually increase. Machine at least five years of historical analyses should be required
learning techniques can easily pick up such relations whereas showing the data actually “works” in terms of forecasting
standard OLS techniques may not be able to do so. How such returns. Second, preferably, the data should be divided into two
techniques improve forecasting is of tremendous interest to subperiods with validity checked in each period. There should
academics, and large sample of evidence on this matter is yet also be a sound economic hypothesis behind why the data set
to come by. The availability of packages such R and Python to works. Questions should be asked on what economic force or
program the preceding techniques has greatly simplified ap- behavioral bias precludes the information from being incor-
plications. Doubtless as more data and computing power be- porated into current prices in a timely way. Finally, papers such
comes available, academics will spend more time on this issue. as Grossman and Stiglitz (1980) suggest that extra rents can
Yet another application of machine learning is to improve only be earned on genuine information not available to all. So a
upon volatility forecasts. It is easy to see how techniques such pertinent question is that of how many people have access to
as random forest algorithms can be trained and used to fore- the data. If too many do have access, then the signals generated
cast future volatility. For example, simple ARCH-type fore- will be easily negated by crowding.
casts (Engle, 1982) can be improved by considering other
signals such as social media coverage and the extent of high Conflict of interest
frequency trading in individual firms. Studies such as Fan, Li,
and Yu (2012) have already made progress in using high fre- The author has nothing to disclose and no conflict of in-
quency data for estimating volatility. terest with regard to the contents of this article.
In analyzing the preceding “alternative approaches” to eq-
uity return and volatility forecasting it is important to reiterate References
that the obvious caveat applies. Specifically, we should ensure
that value being added relative to previously known predictors Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The infor-
mation content of internet stock message boards. The Journal of Finance,
and forecasting techniques such as OLS is truly adding value.
59(3), 1259e1294.
We should endeavor to ensure that simple forecasts do not Chen, H., De, P., Hu, Y. J., & Hwang, B. H. (2014). Wisdom of crowds: The
dominate those obtained from more complex techniques. value of stock opinions transmitted through social media. Review of
Financial Studies, 27(5), 1367e1403.
4. The challenges Chordia, T., Green, T. C., & Kottimukkalur, B. (2018). Rent seeking by low-
latency traders: Evidence from trading on macroeconomic announcements.
Review of Financial Studies, 31(12), 4650e4687.
The research using big data suffers from the problem of Chordia, T., & Subrahmanyam, A. (2004). Order imbalance and individual
novelty: when confronted with an analysis using novel data, stock returns: Theory and evidence. Journal of Financial Economics,
academics are likely to recommend publication using a 72(3), 485e518.
relatively low bar in terms of the reliability of the results. Cohen, L., & Frazzini, A. (2008). Economic links and predictable returns. The
Journal of Finance, 63(4), 1977e2011.
This is because novelty garners citations, which helps the
Coval, J. D., & Shumway, T. (2001). Is sound just noise? The Journal of
journal. Unfortunately, reliable results that control for pre- Finance, 56(5), 1887e1910.
vious characteristics that predict returns are relevant for Da, Z., Engelberg, J., & Gao, P. (2011). In search of attention. The Journal of
ascertaining if the research has truly unearthed a new pre- Finance, 66(5), 1461e1499.
dictor. The research I have reviewed, while creative and Da, Z., Engelberg, J., & Gao, P. (2014). The sum of all FEARS: Investor
sentiment and asset prices. Review of Financial Studies, 28(1), 1e32.
insightful, does not always control for existing proxies for
Da, Z., Huang, X., & Jing, L. (2018). Extrapolative beliefs in the cross-
investor expectations (such as analysts' opinions) casting section: What can we learn from the crowds?. working paper. University
doubt on the robustness of the results. Further, the research is of Notre Dame.
sometimes contradictory; thus Da, Huang, and Jin (2018) Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with esti-
suggest that one should trade in a contrarian manner to mates of the variance of United Kingdom inflation. Econometrica: Journal
of the Econometric Society, 987e1007.
retail sentiment, while Kelley and Tetlock (2013) suggest that
_
A. Subrahmanyam / Borsa Istanbul Review 19-4 (2019) 283e287 287

Fan, J., Li, Y., & Yu, K. (2012). Vast volatility matrix estimation using high- Kelley, E. K., & Tetlock, P. C. (2016). Retail short selling and stock prices.
frequency data for portfolio selection. Journal of the American Statistical Review of Financial Studies, 30(3), 801e834.
Association, 107(497), 412e428. Li, F. (2010). The information content of forward-looking statements in
Gao, L., Han, Y., Li, S. Z., & Zhou, G. (2018). Market intraday momentum. corporate filingsda naïve Bayesian machine learning approach. Journal of
Journal of Financial Economics, 129(2), 394e414. Accounting Research, 48(5), 1049e1102.
Grossman, S. J., & Stiglitz, J. E. (1980). On the impossibility of informa- Loughran, T., & McDonald, B. (2011). When is a liability not a liability?
tionally efficient markets. The American Economic Review, 70(3), Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1),
393e408. 35e65.
Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the cross-section of expected Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial
returns. Review of Financial Studies, 29(1), 5e68. affective states and future firm performance. The Journal of Finance,
Heston, S. L., Korajczyk, R. A., & Sadka, R. (2010). Intraday patterns in the 67(1), 1e43.
cross-section of stock returns. The Journal of Finance, 65(4), 1369e1407. Neely, C. J., Rapach, D. E., Tu, J., & Zhou, G. (2014). Forecasting the equity
Hoberg, G., & Phillips, G. (2010). Product market synergies and competition risk premium: The role of technical indicators. Management Science,
in mergers and acquisitions: A text-based analysis. Review of Financial 60(7), 1772e1791.
Studies, 23(10), 3773e3811. Sloan, R. G. (1996). Do stock prices fully reflect information in accruals and
Huang, J. (2018). The customer knows best: The investment value of consumer cash flows about future earnings? The Accounting Review, 289e315.
opinions. Journal of Financial Economics, 128(1), 164e182. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media
Hvidkjaer, S. (2008). Small trades and the cross-section of stock returns. in the stock market. The Journal of Finance, 62(3), 1139e1168.
Review of Financial Studies, 21(3), 1123e1151. Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than
Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling words: Quantifying language to measure firms' fundamentals. The Journal
losers: Implications for stock market efficiency. The Journal of Finance, of Finance, 63(3), 1437e1467.
48(1), 65e91. Tirunillai, S., & Tellis, G. J. (2012). Does chatter really matter? Dynamics of
Jiang, F., Lee, J., Martin, X., & Zhou, G. (2019). Manager sentiment and stock user-generated content and stock performance. Marketing Science, 31(2),
returns. Journal of Financial Economics, 132(1), 126e149. 198e215.
Kelley, E. K., & Tetlock, P. C. (2013). How wise are crowds? Insights from Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging
retail orders and stock returns. The Journal of Finance, 68(3), 1229e1265. frequency and probability. Cognitive Psychology, 5(2), 207e232.

You might also like