You are on page 1of 18

Comparative Analysis of NLP

Approaches for Earnings Calls


White Paper Series

http://www.optirisk-systems.com

Specialists in Optimisation Modelling and Risk Analysis


Prepared by
Christopher Kantos, Dan Joldzic, Gautam Mitra, Kieu Thi Hoang

Copyright©(2020) OptiRisk Systems

DO NOT DUPLICATE WITHOUT PERMISSION


All brand names, product names are trademarks or registered
trademarks of their respective holders.
The material presented in this whitepaper is subject to change without
prior notice and is intended for general information only. The views of
the authors expressed in this document do not represent the views
and/or opinions of OptiRisk Systems.

OptiRisk Systems
The Atrium, Suites 536 & 537
1 Harefield Road
Uxbridge, Middlesex, UB8 1EX
United Kingdom
www.optirisk-systems.com
+44 (0) 1895 256484
Comparative Analysis of NLP Approaches for Earnings Calls
Christopher Kantos† Dan Joldzic†
Gautam Mitra†† Kieu Thi Hoang†

†Alexandria
Technology, New York, United States
†OptiRisk
Systems, Uxbridge, United Kingdom
†† OptiRisk Systems and UCL Department of Computer Science, London, United Kingdom

Abstract

The field of natural language processing (NLP) has evolved significantly in recent years. In this chapter
we consider two leading and well-established methodologies, namely, those due to Loughran
McDonald, and FinBERT. We then contrast our approach to these two approaches and compare our
performance against these methods which are considered to be benchmarks. We use S&P 500 market
data for our investigations and describe the results obtained following our strategies. Our main
consideration is the Earnings Calls for the S&P 500 stocks. We vindicate our findings and present the
performance of our trading and fund management strategy which shows better results.

Keywords— Natural Language Processing, Sentiment Analysis, Earnings Calls

1
Table of Content
1. Introduction
1.1 Overview
1.2 Literature Review
1.3 Guided Tour

2. Data
2.1 Earnings Call Transcripts
2.2 Sentiment data
2.3 NLP models

3. Methodology
3.1 Natural Language Processing for Earnings Calls

3.2 Sentiment Classification

4. Investigation

5. Results

6. Discussion and Conclusions

References

2
1 Introduction
An Overview
In this chapter, we review a few natural language processing methods mainly in the context
of ‘Earnings Call’ of the companies/stocks under consideration. The methods selected
show an evolution of methods which have matured in the recent years.

There are many approaches of Natural Language Processing (NLP) to extract quantified
sentiment scores from texts. These sentiment scores are used by financial professionals
in creating portfolio construction or risk control models and thereby add value to the
investment process. We start by considering and critically analyzing the recent history of
three well known approaches, namely, Loughran McDonald (LM), FinBERT, and
Alexandria Technology.

1.1 Loughran McDonald


Loughran McDonald (LM), developed by Tim Loughran and Bill McDonald of Notre
Dame in 2011, is arguably the most popular financial lexicon method available. Lexicon,
often called the “bag of words” approach to NLP use a dictionary of words or phrases that are
labelled with sentiment. The motivation for the LM dictionary was that in testing the popular and
widely used Harvard Dictionary, negative sentiment was regularly mislabeled for words when
applied in the financial context, see (Loughran and McDonald, 2010). The LM dictionary was
built from a large sample of 10 Q’s and 10 K’s from the years 1994 to 2008. The sentiment
was trained on approximately 5,000 words from these documents and over 80,000 words from
the Harvard Dictionary. The dictionary is updated on a periodic basis to capture new words,
phrases, and context that enter the business lexicon, specifically the financial domain.

1.2 FinBERT
FinBERT is a specialized version of the machine learning model BERT (Bidirectional Encoder
Representations from Transformers) developed by Google in 2018. The BERT model is
pretrained on both language modelling, and next sentence prediction, which provides a framework
for providing contextual relationships among words (Araci, 2019). FinBERT was developed in
2019 to extend the BERT model for better understanding within financial contexts. FinBERT uses
the Reuters TRC2-financial corpus as the pre-training dataset. It is a subset of the broader
Reuters TRC2 dataset, which consist of 1.8 million of articles that were published by Reuters

3
between 2008 and 2010. This dataset is filtered for financial keywords and phrases to improve
relevance and suitability with the computer power available. The resulting dataset is 46,143
documents consisting of 29 million words and 400,000 sentences. For sentiment training, FinBERT
uses Financial Phrasebank, developed in 2014 by Malo et al, consisting of 4,845 sentences in the
LexisNexis database labelled by 16 financial professionals. Lastly, the FiQA Sentiment dataset
which consists of 1,174 financial news headlines completes the FinBERT sentiment training
(Araci, 2019). FinBERT is an open-source project that is freely available for download on
Github.

4
1.3 Alexandria Technology
Alexandria Technology uses a unique approach to NLP that is an ensemble of pure machine learning
techniques without the need for dictionaries, rules, or price trend training. Developed in 2007, the
model’s initial application was to classify DNA sequences for genomic functionality. The
underlying NLP technology was then tailored for applications in the institutional investment
industry.

Alexandria’s language model uses dynamic sequencing to identify phrase length. A phrase can
be one, two, three or more words depending on the conditional probability of the string. The strings
become stand- alone features which are then re-analyzed to determine which features occur together
within a body of text. If two or more features have a high probability of occurring together, the
features are joined to form higher order concepts. For earnings calls, the Alexandria language
model was created from millions of sentences using the FactSet XML Transcripts corpus.

Once the language model is formed, training can begin for features such as sentiment and
themes. For sentiment, Alexandria’s training function is sentiment labels on sentences from
GICS sector analysts. A sentence is reviewed and given a label for topic and sentiment for that
particular topic. The earnings call sentiment training sample is over 200,000 unique sentence labels
from earnings calls.

1.4 White Paper Structure: A Guided Tour

The rest of this chapter is organized in the following sections. In Section 2 we consider and describe
the Data Sources. We discuss Earnings Call Transcripts, Sentiment Data and our approach to
Natural Language Processing (NLP) models. In Section 3 we describe the models and methods
used by us. We consider NLP for Earnings Calls and describe our method of Sentiment
Classification. The investigations and the results are set out in Section 4 and Section 5
respectively. In Section 6 we present discussions and conclusions in a summary form.

5
2 Data Sources and Models
2.1 Earnings Call Transcripts
In our analysis, we use the FactSet Earnings Call Transcript database, which covers earnings
calls of companies around the world. Earnings call is a conference call between the
management of a public company, analysts, investors, and the media to discuss the
company’s financial results during a given reporting period, such as a quarter or a fiscal
year (Chen, 2021). An earnings call is usually preceded by an earnings report. This contains
summary information on financial performance for the period. FactSet is a data vendor
who provides transcripts of these earning calls. Their coverage begins in 2003 for earnings
calls, of which we look at a period from 2010-2021. Our subset of this global dataset focuses on
the S&P 500 equities.

Earnings calls have always been a source of integral information for investors. During these
calls, companies discuss financial results, operations, advancements, and hardships that
will steer the course and future of the company. This information is a key resource for
analysts to forecast the ongoing and future financial performance of the company. The calls
are split up between Management Discussion (MD) where executives present financial
results and forecasts, followed by Questions and Answers (QA) where participants such as
investors and analysts can ask questions regarding the MD results.

Using NLP and ML, we can analyze these calls in near real time. By parsing and scoring
them for sentiment, we have an overall view of thousands of calls as well as a topic-by-
topic (or sentence-by-sentence) detailed view of what was said, who said it, and what the
sentiment was without ever having to dial in or read through the transcripts. Armed with
the quantitative information derived from these calls, we can then explore various ways to
incorporate and enhance our investment processes.

2.2 Sentiment Data


The sentiment data used in our testing is an output of the three NLP approaches. We use these
three NLP methods to build classification models which create sentiment score from the earnings
call transcripts dataset. All three approaches use a trinary system of -1 (negative sentiment), 0
(neutral sentiment) and 1 (positive sentiment) to score each section of the earnings call.

Loughran McDonald method use an alternative negative word list of the widely used
Harvard Dictionary words list, along with five other word lists, that better reflect tone in
financial text. This is proved when they research a large sample of 10-Ks during 1994 to
2008, almost three-fourths of the words (Loughran and McDonald, 2010) identified as
negative by the widely used Harvard Dictionary are words typically not considered negative
6
in financial contexts.

FinBERT is a pre-trained NLP model to analyze sentiment of financial text based on BERT.
BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine
Learning (ML) model for natural language processing. BERT consists of a set of
transformer encoders stacked on top of each other. BERT can be used on several language
tasks, such as sentiment analysis, text prediction, text generation, summarization, and so
on. FinBERT is built by further training the BERT language model in the finance domain,
using a large financial corpus and thereby fine-tuning it for financial sentiment
classification. FinBERT improved the state-of-the-art performance by 15 percentage points
for a financial sentiment classification task in FinancialPhrasebank dataset (Araci, 2019).

Different from Loughran McDonald method and FinBERT method, Alexandria


Technology’s method does not use any dictionaries, rules, or price training trends, but pure
machine learning techniques to produce sentiment score from texts. Alexandria’s language model
was created from millions of sentences using the FactSet XML Transcripts corpus. The training
function in this approach is sentiment labels on topics and sentences from GICS sector analysts.
A section of text is analyzed and labeled with a theme and sentiment. The earnings call sentiment
training sample is over 200,000 unique labels from earnings calls.

2.3 NLP Models


For our comparative analysis of NLP approaches for classifying and scoring sentiment for earnings
calls, we use open-source libraries for the Loughran McDonald and FinBERT approaches, and the
proprietary machine learning approach of Alexandria Technology. More information about the
three sources is found below and in the bibliography section of this paper.

3 Methodology of the three approaches


3.1 Natural Language Processing for Earnings Calls

NLP for financial text has a wide array of applications for institutional investors. Previous research
shows that using NLP for financial news has applications in risk management, asset allocation, alpha
generation, among many others. It is only recently that we have begun to look at using the same
techniques that we use for financial news for other sources such as earnings calls.

As early as 2012, studies looked at large stock price movements, and if they correlated to
analyst earnings revisions (Govindaraj et al., 2012). More recent studies have shown that using
sentiment on earnings transcripts can lead to outperformance that is not explained by traditional
risk and return factors such as Momentum and EPS Revisions (Jha and Blain, 2015). Further study
7
by Societe Generale in 2019 show that sentiment on earnings calls actually outperformed traditional
factors such as Value, Profitability, Quality, Momentum and Growth as well, see (Short, 2016).
In a study by S&P in 2018, they show that market sentiment surrounding earnings calls not only has
low correlation with other signals but amplifies the effectiveness of other earnings transcript-based
signals (Tortoriello, 2018). Finally, Alexandria’s internal research shows that alpha signals from
earnings calls have a longer decay rate and drift compared to shorter-term signals such as news and
social media.

It is clear much research has been done to show the value add in analyzing the sentiment of
earnings calls, and in the next section, we will extend on the research to explore what effect different
NLP approaches have using a homogeneous set of earnings call transcript data as inputs.

3.2 Sentiment Classification

For our sentiment classification, we use transcript data provided by FactSet. FactSet’s coverage
includes over 1.7 million global company corporate events, including earnings calls, conference
presentations, guidance calls, sales & revenue calls, and investor meeting notes (Factset, 2019).

We first use the three distinct models to apply sentiment labels to the S&P500 over a time
period of 2010-2021. Among individual section classifications (ie. Topics, Sentences), we found the
highest correlation to be between LM and FinBERT at 0.38. Alexandria had the lowest
correlations with both LM and FinBERT at 0.14 and 0.17 respectively. We then aggregated the
individual sections into net sentiment values for each security in the sample. Unsurprisingly,
when aggregated the correlations rose, with the highest correlation being between LM and
Alexandria at 0.45, the lowest being FinBERT and Alexandria at 0.30 and FinBERT and LM
having correlation of 0.41.

Correlations
Section Classification Correlation

Alexandria Loughran
Technology McDonald
Loughran 0.14
McDonald
FinBERT 0.17 0.38

Net Sentiment Correlation

8
Alexandria Loughran
Technology McDonald
Loughran 0.45
McDonald
FinBERT 0.30 0.41

Sentiment Classification: Illustrated with Examples

We provide several examples of how each approach classifies sections of an earnings call for
sentiment. The differences are indicative of how each approach does or does not take context
and word order into question.

Turning to the Google Cloud segment, revenues were $4.6 billion for Loughran McDonald -1

the second quarter, up 54%. GCP’s revenue growth was again above Alexandria 1
Cloud overall, reflecting significant growth in both infrastructure and FinBERT -1
platform services. Once again, strong growth in Google Workspace
revenues was driven by robust growth in both seats and average
revenue per seat. Google Cloud had an operating loss of $591 million.
As to our Other Bets, in the first (sic) [second] quarter, revenues were
$192 million. The operating loss was $1.4 billion.

Turning to the balance sheet, total spot assets were $1.1 trillion and Loughran McDonald -1
standardized RWAs increased to $454 billion, reflecting high levels of Alexandria 1
client activity and the closing of E TRADE. Our standardized CET1 FinBERT -1
ratio was flat to the prior quarter at 17.4%.

Non-GAAP operating expenses increased 9% year-over-year and 2% Loughran McDonald 1


sequentially, slightly above the high end of our guidance range, Alexandria -1
primarily due to higher variable compensation related to FinBERT 1

better-than-expected order momentum.

9
4 Investigation
For comparison, we run a monthly long/short simulation where we are long the top quintile and short
the bottom quintile based on our sentiment classifications.

The earnings calls are separated into two sections, the Management Discussion (MD) and the
Question and Answer (QA). Each component of MD and QA is then totaled for each transcript
to get a net sentiment of positive, neutral or negative (1, 0, -1). We then use the average of
(MD + QA) as our sentiment score. Our final net sentiment score is the log of positive over
negative sentiment counts.
𝐶𝑜𝑢𝑛𝑡𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 1
𝑁𝑒𝑡𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = log10
𝐶𝑜𝑢𝑛𝑡𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 1
To avoid concerns of liquidity, we use the S&P 500 as our universe and split the data into
quintiles on a monthly basis using the net sentiment scores, with Q1 being the highest sentiment in
the sample, and Q5 being the lowest. We use an aggregation window of six months to ensure each
security in our universe has sample data over two unique earnings calls (Earnings calls are required
quarterly for United States public companies). Once we have our quintiles, we rebalance on the
last trading day of the month going long quintile 1 and short quintile 5. Quintiles 2-4 (generally
the most neutral sentiment) will be ignored.

We run our simulation monthly, from January 2010 to September of 2021, on Loughran McDonald
Monthly S&P500 Universe 2010-2021.

In the sample period, Alexandria outperforms Loughran McDonald and FinBERT in all years
except 2010 and 2011. Alexandria performs positively in every year apart from 2016, and in that
year showed the least loss compared to the other approaches. Over the course of the sample period,
Alexandria had the best performance, with cumulative P&L of 221.87%, followed by FinBERT
with 16.77%, and Loughran McDonald at 19.81%.

Quintile Avg Annual Standard Sharpe vs.


Stocks Returns Deviation Ratio S&P500
1 108 14.36% 15.05% 0.95 2.18%
2 107 13.46% 16.05% 0.84 1.27%
3 108 12.95% 15.34% 0.84 0.76%
4 107 12.14% 16.85% 0.72 -0.05%
5 107 12.09% 16.92% 0.71 -0.10%
10
Long/Short 215 1.56% 5.82% 0.27

Loughran McDonald Q1 - Q5 Annual Returns

Annual Std Dev Sharpe SP 500


2010 6.75% 7.52% 0.9 12.78%
2011 5.74% 6.65% 0.86 0.00%
2012 -5.29% 4.89% -1.08 13.41%
2013 1.93% 3.12% 0.62 29.60%
2014 1.02% 4.49% 0.23 11.39%
2015 9.30% 6.62% 1.41 -0.73%
2016 -6.53% 4.99% -1.31 9.54%
2017 3.48% 4.56% 0.76 19.42%
2018 -0.32% 4.98% -0.06 -6.24%
2019 2.35% 5.24% 0.45 30.30%
2020 4.57% 8.97% 0.51 8.34%
2021 -4.64% 6.71% -0.69 14.68%

Loughran Monthly S&P500 Universe 2010-2021(09)

Quintile Avg Annual Standard Sharpe vs.


Stocks Returns Deviation Ratio S&P500
1 108 19.07% 15.06% 1.27 6.88%
2 107 13.41% 15.42% 0.87 1.22%
3 108 13.33% 16.07% 0.83 1.14%
4 107 12.32% 16.73% 0.74 0.13%
5 107 7.03% 17.45% 0.4 -5.16%
Long/Short 213 10.54% 7.83% 1.35

Alexandria Q1 - Q5 Annual Returns


11
Annual Std Dev Sharpe SP 500
2010 3.60% 5.68% 0.63 12.78%
2011 5.47% 7.95% 0.69 0.00%
2012 11.26% 6.69% 1.68 13.41%
2013 8.69% 3.08% 2.82 29.60%
2014 16.15% 8.38% 1.93 11.39%
2015 28.81% 10.61% 2.72 -0.73%
2016 -4.04% 9.22% -0.44 9.54%
2017 22.88% 8.22% 2.78 19.42%
2018 14.24% 8.60% 1.66 -6.24%
2019 4.03% 5.75% 0.7 30.30%
2020 10.69% 8.87% 1.2 8.34%
2021 6.66% 7.64% 0.87 14.68%

Alexandria Monthly S&P500 Universe 2010-2021(09)

Quintile Avg Annual Standard Sharpe vs.


Stocks Returns Deviation Ratio S&P500
1 108 14.59% 14.09% 1.04 1.66%
2 107 13.66% 15.66% 0.87 0.73%
3 108 12.62% 16.14% 0.78 -0.31%
4 107 12.05% 16.69% 0.72 -0.88%
5 107 11.97% 18.11% 0.66 -0.96%
Long/Short 215 1.34% 8.41% 0.16

FinBERT Q1 - Q5 Annual Returns

Annual Std Dev Sharpe SP 500

12
2010 -1.94% 9.05% -0.21 12.78%
2011 4.43% 4.80% 0.92 0.00%
2012 0.01% 5.99% 0 13.41%
2013 3.52% 3.80% 0.93 29.60%
2014 1.51% 5.32% 0.28 11.39%
2015 20.37% 7.42% 2.75 -0.73%
2016 -13.82% 7.84% -1.76 9.54%
2017 9.88% 4.14% 2.39 19.42%
2018 4.17% 6.60% 0.63 -6.24%
2019 -4.38% 9.86% -0.44 30.30%
2020 3.78% 16.41% 0.23 8.34%
2021 -10.44% 11.83% -0.88 14.68%
FinBERT Monthly S&P500 Universe 2010-2021(09)

13
5 Summary Results
Annual Returns

LM Alexandri FinBER
a T
2010 6.75% 3.60% -1.94%
2011 5.74% 5.47% 4.43%
2012 -5.29% 11.26% 0.01%
2013 1.93% 8.69% 3.52%
2014 1.02% 16.15% 1.51%
2015 9.30% 28.81% 20.37%
2016 -6.53% -4.04% -13.82%
2017 3.48% 22.88% 9.88%
2018 -0.32% 14.24% 4.17%
2019 2.35% 4.03% -4.38%
2020 4.57% 10.69% 3.78%
2021 -4.64% 6.66% -10.44%

Cumulative Returns

14
6 Discussion and Conclusion
As our analysis shows, not all NLP methods are the same when applied to earnings calls. We see
a stark difference between the sentiment labelling between the three methods as shown from the
low correlations among them, and subsequently the performance when applied to a trading strategy
over the past decade.

The result shows that using sentiment for earnings calls can generate alpha not explained by traditional
risk and return factors. Furthermore, when using the three different NLP methods to generate
sentiment from earning calls, there is a low correlation of sentiment labeling between Alexandria
and both Loughran McDonald and FinBERT at the individual classification level. Higher
correlations are found at the aggregate security sentiment level but remain low. As a result, low
correlations and large performance differences arise from the distinct language models, sentiment
training, and NLP technology of each approach. The research also suggests that Alexandria
Technology method significantly outperforms both other NLP approaches over the period of 2010-
2021. Earnings call Management Discussion (MD) and Question and Answer (QA) sentiment
have been the best performing factors against traditional style factors over the past 10 years. See
(Short, 2016)

15
We find Alexandria’s NLP method to perform the best during the sample by a significant margin.
Alexandria’s unique approach of using domain experts to train their model appears to be more robust
compared to the rigid technique of a dictionary-based approach, and even to the non-earnings specific
deep learning methods as well.

References

Araci, D. (2019) Financial Sentiment Analysis with Pre-trained Language Models. s.l:
University of Amsterdam.

Factset (2019) FactSet Research Systems 2019.

Govindaraj, S. et al. (2012) Large Price Changes and Subsequent Returns.

Jha, V., Blain, J. and Montague, W. (2015) Finding Value in Earnings Transcripts Data with
AlphaSense.

Loughran, T. and McDonald, B. (2010) When is a Liability not a Liability? Textual Analysis,
Dictionaries, and 10-Ks.

Short, J. (2016) Societe Generale Cross Asset Research/Equity Quant.

Tortoriello, R. (2018) Their Sentiments Exactly:Sentiment Signal Diversity Creates Alpha


Opportunity. s.l: S&P Global.

Chen, J. (2021) ‘Earning Calls’, Investopedia. Available at:


https://www.investopedia.com/terms/e/earnings-call.asp

16

You might also like