Professional Documents
Culture Documents
http://www.optirisk-systems.com
OptiRisk Systems
The Atrium, Suites 536 & 537
1 Harefield Road
Uxbridge, Middlesex, UB8 1EX
United Kingdom
www.optirisk-systems.com
+44 (0) 1895 256484
Comparative Analysis of NLP Approaches for Earnings Calls
Christopher Kantos† Dan Joldzic†
Gautam Mitra†† Kieu Thi Hoang†
†Alexandria
Technology, New York, United States
†OptiRisk
Systems, Uxbridge, United Kingdom
†† OptiRisk Systems and UCL Department of Computer Science, London, United Kingdom
Abstract
The field of natural language processing (NLP) has evolved significantly in recent years. In this chapter
we consider two leading and well-established methodologies, namely, those due to Loughran
McDonald, and FinBERT. We then contrast our approach to these two approaches and compare our
performance against these methods which are considered to be benchmarks. We use S&P 500 market
data for our investigations and describe the results obtained following our strategies. Our main
consideration is the Earnings Calls for the S&P 500 stocks. We vindicate our findings and present the
performance of our trading and fund management strategy which shows better results.
1
Table of Content
1. Introduction
1.1 Overview
1.2 Literature Review
1.3 Guided Tour
2. Data
2.1 Earnings Call Transcripts
2.2 Sentiment data
2.3 NLP models
3. Methodology
3.1 Natural Language Processing for Earnings Calls
4. Investigation
5. Results
References
2
1 Introduction
An Overview
In this chapter, we review a few natural language processing methods mainly in the context
of ‘Earnings Call’ of the companies/stocks under consideration. The methods selected
show an evolution of methods which have matured in the recent years.
There are many approaches of Natural Language Processing (NLP) to extract quantified
sentiment scores from texts. These sentiment scores are used by financial professionals
in creating portfolio construction or risk control models and thereby add value to the
investment process. We start by considering and critically analyzing the recent history of
three well known approaches, namely, Loughran McDonald (LM), FinBERT, and
Alexandria Technology.
1.2 FinBERT
FinBERT is a specialized version of the machine learning model BERT (Bidirectional Encoder
Representations from Transformers) developed by Google in 2018. The BERT model is
pretrained on both language modelling, and next sentence prediction, which provides a framework
for providing contextual relationships among words (Araci, 2019). FinBERT was developed in
2019 to extend the BERT model for better understanding within financial contexts. FinBERT uses
the Reuters TRC2-financial corpus as the pre-training dataset. It is a subset of the broader
Reuters TRC2 dataset, which consist of 1.8 million of articles that were published by Reuters
3
between 2008 and 2010. This dataset is filtered for financial keywords and phrases to improve
relevance and suitability with the computer power available. The resulting dataset is 46,143
documents consisting of 29 million words and 400,000 sentences. For sentiment training, FinBERT
uses Financial Phrasebank, developed in 2014 by Malo et al, consisting of 4,845 sentences in the
LexisNexis database labelled by 16 financial professionals. Lastly, the FiQA Sentiment dataset
which consists of 1,174 financial news headlines completes the FinBERT sentiment training
(Araci, 2019). FinBERT is an open-source project that is freely available for download on
Github.
4
1.3 Alexandria Technology
Alexandria Technology uses a unique approach to NLP that is an ensemble of pure machine learning
techniques without the need for dictionaries, rules, or price trend training. Developed in 2007, the
model’s initial application was to classify DNA sequences for genomic functionality. The
underlying NLP technology was then tailored for applications in the institutional investment
industry.
Alexandria’s language model uses dynamic sequencing to identify phrase length. A phrase can
be one, two, three or more words depending on the conditional probability of the string. The strings
become stand- alone features which are then re-analyzed to determine which features occur together
within a body of text. If two or more features have a high probability of occurring together, the
features are joined to form higher order concepts. For earnings calls, the Alexandria language
model was created from millions of sentences using the FactSet XML Transcripts corpus.
Once the language model is formed, training can begin for features such as sentiment and
themes. For sentiment, Alexandria’s training function is sentiment labels on sentences from
GICS sector analysts. A sentence is reviewed and given a label for topic and sentiment for that
particular topic. The earnings call sentiment training sample is over 200,000 unique sentence labels
from earnings calls.
The rest of this chapter is organized in the following sections. In Section 2 we consider and describe
the Data Sources. We discuss Earnings Call Transcripts, Sentiment Data and our approach to
Natural Language Processing (NLP) models. In Section 3 we describe the models and methods
used by us. We consider NLP for Earnings Calls and describe our method of Sentiment
Classification. The investigations and the results are set out in Section 4 and Section 5
respectively. In Section 6 we present discussions and conclusions in a summary form.
5
2 Data Sources and Models
2.1 Earnings Call Transcripts
In our analysis, we use the FactSet Earnings Call Transcript database, which covers earnings
calls of companies around the world. Earnings call is a conference call between the
management of a public company, analysts, investors, and the media to discuss the
company’s financial results during a given reporting period, such as a quarter or a fiscal
year (Chen, 2021). An earnings call is usually preceded by an earnings report. This contains
summary information on financial performance for the period. FactSet is a data vendor
who provides transcripts of these earning calls. Their coverage begins in 2003 for earnings
calls, of which we look at a period from 2010-2021. Our subset of this global dataset focuses on
the S&P 500 equities.
Earnings calls have always been a source of integral information for investors. During these
calls, companies discuss financial results, operations, advancements, and hardships that
will steer the course and future of the company. This information is a key resource for
analysts to forecast the ongoing and future financial performance of the company. The calls
are split up between Management Discussion (MD) where executives present financial
results and forecasts, followed by Questions and Answers (QA) where participants such as
investors and analysts can ask questions regarding the MD results.
Using NLP and ML, we can analyze these calls in near real time. By parsing and scoring
them for sentiment, we have an overall view of thousands of calls as well as a topic-by-
topic (or sentence-by-sentence) detailed view of what was said, who said it, and what the
sentiment was without ever having to dial in or read through the transcripts. Armed with
the quantitative information derived from these calls, we can then explore various ways to
incorporate and enhance our investment processes.
Loughran McDonald method use an alternative negative word list of the widely used
Harvard Dictionary words list, along with five other word lists, that better reflect tone in
financial text. This is proved when they research a large sample of 10-Ks during 1994 to
2008, almost three-fourths of the words (Loughran and McDonald, 2010) identified as
negative by the widely used Harvard Dictionary are words typically not considered negative
6
in financial contexts.
FinBERT is a pre-trained NLP model to analyze sentiment of financial text based on BERT.
BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine
Learning (ML) model for natural language processing. BERT consists of a set of
transformer encoders stacked on top of each other. BERT can be used on several language
tasks, such as sentiment analysis, text prediction, text generation, summarization, and so
on. FinBERT is built by further training the BERT language model in the finance domain,
using a large financial corpus and thereby fine-tuning it for financial sentiment
classification. FinBERT improved the state-of-the-art performance by 15 percentage points
for a financial sentiment classification task in FinancialPhrasebank dataset (Araci, 2019).
NLP for financial text has a wide array of applications for institutional investors. Previous research
shows that using NLP for financial news has applications in risk management, asset allocation, alpha
generation, among many others. It is only recently that we have begun to look at using the same
techniques that we use for financial news for other sources such as earnings calls.
As early as 2012, studies looked at large stock price movements, and if they correlated to
analyst earnings revisions (Govindaraj et al., 2012). More recent studies have shown that using
sentiment on earnings transcripts can lead to outperformance that is not explained by traditional
risk and return factors such as Momentum and EPS Revisions (Jha and Blain, 2015). Further study
7
by Societe Generale in 2019 show that sentiment on earnings calls actually outperformed traditional
factors such as Value, Profitability, Quality, Momentum and Growth as well, see (Short, 2016).
In a study by S&P in 2018, they show that market sentiment surrounding earnings calls not only has
low correlation with other signals but amplifies the effectiveness of other earnings transcript-based
signals (Tortoriello, 2018). Finally, Alexandria’s internal research shows that alpha signals from
earnings calls have a longer decay rate and drift compared to shorter-term signals such as news and
social media.
It is clear much research has been done to show the value add in analyzing the sentiment of
earnings calls, and in the next section, we will extend on the research to explore what effect different
NLP approaches have using a homogeneous set of earnings call transcript data as inputs.
For our sentiment classification, we use transcript data provided by FactSet. FactSet’s coverage
includes over 1.7 million global company corporate events, including earnings calls, conference
presentations, guidance calls, sales & revenue calls, and investor meeting notes (Factset, 2019).
We first use the three distinct models to apply sentiment labels to the S&P500 over a time
period of 2010-2021. Among individual section classifications (ie. Topics, Sentences), we found the
highest correlation to be between LM and FinBERT at 0.38. Alexandria had the lowest
correlations with both LM and FinBERT at 0.14 and 0.17 respectively. We then aggregated the
individual sections into net sentiment values for each security in the sample. Unsurprisingly,
when aggregated the correlations rose, with the highest correlation being between LM and
Alexandria at 0.45, the lowest being FinBERT and Alexandria at 0.30 and FinBERT and LM
having correlation of 0.41.
Correlations
Section Classification Correlation
Alexandria Loughran
Technology McDonald
Loughran 0.14
McDonald
FinBERT 0.17 0.38
8
Alexandria Loughran
Technology McDonald
Loughran 0.45
McDonald
FinBERT 0.30 0.41
We provide several examples of how each approach classifies sections of an earnings call for
sentiment. The differences are indicative of how each approach does or does not take context
and word order into question.
Turning to the Google Cloud segment, revenues were $4.6 billion for Loughran McDonald -1
the second quarter, up 54%. GCP’s revenue growth was again above Alexandria 1
Cloud overall, reflecting significant growth in both infrastructure and FinBERT -1
platform services. Once again, strong growth in Google Workspace
revenues was driven by robust growth in both seats and average
revenue per seat. Google Cloud had an operating loss of $591 million.
As to our Other Bets, in the first (sic) [second] quarter, revenues were
$192 million. The operating loss was $1.4 billion.
Turning to the balance sheet, total spot assets were $1.1 trillion and Loughran McDonald -1
standardized RWAs increased to $454 billion, reflecting high levels of Alexandria 1
client activity and the closing of E TRADE. Our standardized CET1 FinBERT -1
ratio was flat to the prior quarter at 17.4%.
9
4 Investigation
For comparison, we run a monthly long/short simulation where we are long the top quintile and short
the bottom quintile based on our sentiment classifications.
The earnings calls are separated into two sections, the Management Discussion (MD) and the
Question and Answer (QA). Each component of MD and QA is then totaled for each transcript
to get a net sentiment of positive, neutral or negative (1, 0, -1). We then use the average of
(MD + QA) as our sentiment score. Our final net sentiment score is the log of positive over
negative sentiment counts.
𝐶𝑜𝑢𝑛𝑡𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 1
𝑁𝑒𝑡𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = log10
𝐶𝑜𝑢𝑛𝑡𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 1
To avoid concerns of liquidity, we use the S&P 500 as our universe and split the data into
quintiles on a monthly basis using the net sentiment scores, with Q1 being the highest sentiment in
the sample, and Q5 being the lowest. We use an aggregation window of six months to ensure each
security in our universe has sample data over two unique earnings calls (Earnings calls are required
quarterly for United States public companies). Once we have our quintiles, we rebalance on the
last trading day of the month going long quintile 1 and short quintile 5. Quintiles 2-4 (generally
the most neutral sentiment) will be ignored.
We run our simulation monthly, from January 2010 to September of 2021, on Loughran McDonald
Monthly S&P500 Universe 2010-2021.
In the sample period, Alexandria outperforms Loughran McDonald and FinBERT in all years
except 2010 and 2011. Alexandria performs positively in every year apart from 2016, and in that
year showed the least loss compared to the other approaches. Over the course of the sample period,
Alexandria had the best performance, with cumulative P&L of 221.87%, followed by FinBERT
with 16.77%, and Loughran McDonald at 19.81%.
12
2010 -1.94% 9.05% -0.21 12.78%
2011 4.43% 4.80% 0.92 0.00%
2012 0.01% 5.99% 0 13.41%
2013 3.52% 3.80% 0.93 29.60%
2014 1.51% 5.32% 0.28 11.39%
2015 20.37% 7.42% 2.75 -0.73%
2016 -13.82% 7.84% -1.76 9.54%
2017 9.88% 4.14% 2.39 19.42%
2018 4.17% 6.60% 0.63 -6.24%
2019 -4.38% 9.86% -0.44 30.30%
2020 3.78% 16.41% 0.23 8.34%
2021 -10.44% 11.83% -0.88 14.68%
FinBERT Monthly S&P500 Universe 2010-2021(09)
13
5 Summary Results
Annual Returns
LM Alexandri FinBER
a T
2010 6.75% 3.60% -1.94%
2011 5.74% 5.47% 4.43%
2012 -5.29% 11.26% 0.01%
2013 1.93% 8.69% 3.52%
2014 1.02% 16.15% 1.51%
2015 9.30% 28.81% 20.37%
2016 -6.53% -4.04% -13.82%
2017 3.48% 22.88% 9.88%
2018 -0.32% 14.24% 4.17%
2019 2.35% 4.03% -4.38%
2020 4.57% 10.69% 3.78%
2021 -4.64% 6.66% -10.44%
Cumulative Returns
14
6 Discussion and Conclusion
As our analysis shows, not all NLP methods are the same when applied to earnings calls. We see
a stark difference between the sentiment labelling between the three methods as shown from the
low correlations among them, and subsequently the performance when applied to a trading strategy
over the past decade.
The result shows that using sentiment for earnings calls can generate alpha not explained by traditional
risk and return factors. Furthermore, when using the three different NLP methods to generate
sentiment from earning calls, there is a low correlation of sentiment labeling between Alexandria
and both Loughran McDonald and FinBERT at the individual classification level. Higher
correlations are found at the aggregate security sentiment level but remain low. As a result, low
correlations and large performance differences arise from the distinct language models, sentiment
training, and NLP technology of each approach. The research also suggests that Alexandria
Technology method significantly outperforms both other NLP approaches over the period of 2010-
2021. Earnings call Management Discussion (MD) and Question and Answer (QA) sentiment
have been the best performing factors against traditional style factors over the past 10 years. See
(Short, 2016)
15
We find Alexandria’s NLP method to perform the best during the sample by a significant margin.
Alexandria’s unique approach of using domain experts to train their model appears to be more robust
compared to the rigid technique of a dictionary-based approach, and even to the non-earnings specific
deep learning methods as well.
References
Araci, D. (2019) Financial Sentiment Analysis with Pre-trained Language Models. s.l:
University of Amsterdam.
Jha, V., Blain, J. and Montague, W. (2015) Finding Value in Earnings Transcripts Data with
AlphaSense.
Loughran, T. and McDonald, B. (2010) When is a Liability not a Liability? Textual Analysis,
Dictionaries, and 10-Ks.
16