You are on page 1of 34

The R Package sentometrics to Compute, Aggregate

and Predict with Textual Sentiment

David Ardia Keven Bluteau


University of Neuchâtel University of Neuchâtel
Vrije Universiteit Brussel

Samuel Borms Kris Boudt


University of Neuchâtel Vrije Universiteit Brussel
Vrije Universiteit Brussel Vrije Universiteit Amsterdam

Abstract
We provide a hands–on introduction to optimized textual sentiment indexation using
the R package sentometrics. Driven by the need to unlock the potential of textual data,
sentiment analysis is increasingly used to capture its information value. The sentometrics
package implements an intuitive framework to efficiently compute sentiment scores of
numerous texts, to aggregate the scores into multiple time series, and to use these time
series to predict other variables. The workflow of the package is illustrated with a built–in
corpus of news articles from two major U.S. journals to forecast the VIX index.

Keywords: Penalized Regression, Prediction, R, sentometrics, Textual Sentiment, Time Series.

1. Introduction
Individuals, firms and governments continuously consume written information from various
sources to improve their decisions. The consensus view in the literature is that exploiting
this qualitative information on top of quantitative data provides valuable insights on many
future dynamics related to firms, the economy, political agendas, and marketing campaigns, to
name a few. Numbers reflect ex–post information about the past, while texts may reflect the
feelings of both author and readers about the past and the future. News media is considered
a good and stable reflection of the concerns of readers, as suggested by Tetlock (2007). The
sentiment transmitted through texts, called textual sentiment, is therefore often regarded as
a prime driver of many variables. Assigning sentiment scores to represent the polarity of
a text is, following data retrieval, the first step in the analysis of relating texts, qualitative
by nature, to quantitative target variables. There has been a steady growth of the amount
of research that takes the approach of transforming texts into quantitative variables using
sentiment analysis. The usage of sentiment in texts to improve inflation predictions (Beckers
et al. 2017), and the application of sentiment analysis to track political preferences (Ceron
et al. 2014) are recent examples in economics and politics. Gentzkow et al. (2018) survey
various applications with textual data, including sentiment analysis.
Textual sentiment does not live by the premise to be equally useful across all applications.

Electronic copy available at: https://ssrn.com/abstract=3067734


2 The R Package sentometrics

Deciphering when, to what degree, and which layers of the sentiment add value is needed to
study the full information potential present within qualitative communications consistently.
The R package sentometrics proposes a well–defined modelling workflow, specifically applied
to studying the evolution of textual sentiment and its impact on other quantities. It addresses
the present lack of analytical capability to extract time series intelligence about the sentiment
transmitted through a large panel of texts.
The major release of the R text mining infrastructure tm (Feinerer et al. 2008) over a decade
ago can be considered the starting point of the development and popularization of textual
analysis tools in R. A number of successful follow–up attempts at improving the speed and
interface of the comprehensive natural language processing capabilities provided by tm have
been delivered by the packages openNLP (Hornik 2016), cleanNLP (Arnold 2017), quanteda
(Benoit et al. 2018), tidytext (Silge and Robinson 2016) and qdap (Rinker 2017). These efforts
have given rise to the creation of a multitude of packages oriented to deal with specific text
mining problems, including topic modelling and sentiment analysis. Many of such specific
packages (internally) rely on one of the larger above–mentioned infrastructures.
The sentometrics package similarly positions itself as both integrative and supplementary to
the powerful text mining and data science toolboxes in the R universe. It is integrative, as
it combines the strengths of quanteda and stringi (Gagolewski 2018) for corpus construction
and manipulation. It uses data.table (Dowle and Srinivasan 2018) for fast aggregation of
textual sentiment into time series, and glmnet (Friedman et al. 2010) and caret (Kuhn 2018)
for (sparse) model estimation. It is supplementary, given that it easily extends any text
mining workflow by fulfilling (one of) the objectives (i) to compute textual sentiment, (ii)
to aggregate fine–grained textual sentiment into the construction of various sentiment time
series, and (iii) to predict other variables with these sentiment measures. The combination of
these three objectives leads to a flexible and computationally efficient framework to exploit
the information value of sentiment in texts.
The remainder of the paper is structured as follows. Section 2 introduces the methodology
behind the R package sentometrics. Section 3 describes the main control functions and illus-
trates the package’s typical workflow. Section 4 applies the entire framework to forecast the
VIX index. Section 5 concludes.

2. Use cases and workflow

The typical use cases of the R package sentometrics are the fast computation and aggregation
of textual sentiment, the subsequent time series visualization and manipulation, and the
estimation of a sentiment–based prediction model. The use case of building a prediction model
out of textual data encompasses the previous ones. We propose a workflow that consists of
five main steps, Steps 1–5, from corpus construction to model estimation. All use cases can
be addressed by following (a subset of) this workflow. The R package sentometrics takes care
of all steps in full, apart from the texts’ collection and cleaning. Table 1 overviews the key
functionalities of sentometrics and the associated functions and output objects if specifically
classed into a new S3 object.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 3

Functionality Functions Output


1. Corpus management
(a) Creation sento_corpus() sentocorpus
(b) Manipulation quanteda corpus functions (e.g.,
docvars(), corpus_sample(), or
corpus_subset())
(c) Conversion to_sentocorpus()
(d) Features generation add_features()
2. Sentiment computation
(a) Lexicon management sento_lexicons() sentolexicons
(b) Computation compute_sentiment() sentiment
(c) Manipulation sentiment_bind(), to_sentiment()
(d) Summarization peakdocs()
3. Sentiment aggregation
(a) Specification ctr_agg()
(b) Aggregation sento_measures(), sentomeasures
aggregate.sentiment()
(c) Manipulation measures_delete(), measures_fill(),
measures_global(), measures_merge(),
measures_select(), measures_subset(),
diff(), scale(), get_measures()
(d) Visualization plot.sentomeasures()
(e) Summarization summary(), peakdates(), nobs(),
nmeasures(), get_dimensions(),
get_dates()
4. Modelling
(a) Specification ctr_model()
(b) Estimation sento_model() sentomodel ,
sentomodeliter
(c) Prediction predict.sentomodel()
(d) Diagnostics summary(), get_loss_data(), attributions
attributions()
(e) Visualization plot.sentomodeliter(),
plot.attributions()

Table 1: Taxonomy of the R package sentometrics. This table displays the functionalities of
the sentometrics package along with the associated functions and S3 output objects.

All steps are explained below. We minimize the mathematical details to clarify the exposition,
and stay close to the actual implementation. Section 3 demonstrates how to use the functions.

2.1. Pre–process a selection of texts and generate relevant features (Step 1)

We assume the user has a corpus of texts at their disposal, of any size and from any source

Electronic copy available at: https://ssrn.com/abstract=3067734


4 The R Package sentometrics

available. Texts could be scraped from the web or retrieved from news databases. A minimal
requirement is that every text has a timestamp and an identifier. These texts should be
cleaned such that graphical and web–related elements (e.g., HTML tags) are removed. This
results in a set of documents dn,t for n = 1, . . . , Nt and time points t = 1, . . . , T , where Nt is
the total number of documents at time t.
Secondly, features have to be defined and mapped to the documents. Features can come
in many forms: news sources, entities (individuals, companies or countries discussed in the
texts), or text topics. The mapping implicitly permits subdividing the corpus into many
smaller groups with a common interest. Many data providers enrich their textual data with
information that can be used as features. If this is not the case, topic modelling or entity
recognition techniques are valid alternatives. Human classification or manual keyword(s) oc-
currence searches are simpler options. The extraction and inclusion of features is an important
part of the analysis and should be related to the variable that is meant to be predicted.
The texts and features have to be structured in a rectangular fashion. Every row represents
k
a document that is mapped to the features through numerical values wn,t ∈ [0, 1] where the
features are indexed by k = 1, . . . , K. The values are indicative of the relevance of a feature
to a document. If they are binary, they indicate which documents belong to which features.
The rectangular data structure is passed to the sento_corpus() function to create a sento-
corpus object. The reason for this separate corpus structure is twofold: It controls whether
all corpus requirements for further analysis are met (such as dealing with timestamps), and
it allows performing operations on the corpus in a more structured way. If no features are
k
of interest to the analysis, a dummy feature valued wn,t = 1 throughout is automatically
created. The add_features() function is used to add or generate new features, as will be
shown in the illustration. When the corpus is constructed, it is up to the user to decide which
texts have to be kept for the actual sentiment analysis.

2.2. Compute document–level textual sentiment (Step 2)


Every document requires at least one sentiment score for further analysis. The sentometrics
package can be used to assign sentiment using the lexicon–based approach, possibly aug-
mented with information from valence shifters. Alternatively, one can source own document–
level sentiment scores, and align these scores with the sentometrics package making use of
the to_sentiment() function.
The lexicon–based approach to sentiment calculation is flexible, transparent and computa-
tionally convenient. It looks for words (or unigrams) that are included in a pre–defined word
list of polarized (positive and negative) words. The R package sentometrics benefits from
built–in word lists in English, French, and Dutch, with the latter two mostly as a checked
web–based translation. The sentometrics package allows for three different ways of doing the
lexicon–based sentiment calculation. These procedures, though simple at their cores, have
proven efficient and powerful in many applications. In increasing complexity, the supported
approaches are:

(i) A unigrams approach. The most straightforward method, where computed sentiment
is simply a (weighted) sum of all detected word scores as they appear in the lexicon.
(ii) A valence–shifting bigrams approach. The impact of the word appearing before the
detected word is evaluated as well. A common example is ‘not good’, which under the

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 5

default approach would get a score of 1 (‘good’), but now ends up, for example, having
a score of –1 due to the presence of the negator ‘not’.

(iii) A valence–shifting clusters approach. Valence shifters can also appear in positions
other than right before a certain word. We implement this layer of complexity by
searching for valence shifters (and other sentiment–bearing words) in a cluster of at
maximum four words before and two words after a detected polarized word.

In the first two approaches, a document dn,t receives a sentiment score as:

Qd
{l} {l}
X
sn,t ≡ ωi vi si,n,t , (1)
i=1

for every lexicon l = 1, . . . , L. The total number of unigrams in the document is equal to Qd .
{l}
The score si,n,t is the sentiment value attached to unigram i from document dn,t , based on
lexicon l, and equals zero when the word is not in the lexicon. The impact of a valence shifter
is represented by vi , being the shifting value if unigram i − 1 is a value shifter. No valence
shifter or the simple unigrams approach boils down to vi = 1. When the valence shifter is a
negator, typically vi = −1. The weights ωi define the within–document aggregation.1
The third approach differs in the way it calculates the impact of valence shifters. Given a
detected polarized word, say unigram j, valence shifters can be identified in a surrounding
cluster of adjacent unigrams J ≡ {J L , J U } around this word (irrespective of whether they
appear in the same sentence or not), where J L ≡ {j − 4, j − 3, j − 2, j − 1} and J U ≡
{j + 1, j + 2}. The resulting sentiment value concerning unigram j and associated cluster J
{l} {l} P {l}
is sJ,n,t ≡ nN (1 + max{0.80 × (nA − 2nA n − nD ), −1} × ωj sj,n,t + J U ωm sm,n,t . The number
of amplifying valence shifters is nA , those that deamplify are counted by nD , n = 1 and
nN = −1 if there are is odd number of negators, else n = 0 and nN = 1.2 The unigrams in J
are first searched for in the lexicon, and only when there is no match, they are searched for in
the valence shifters word list. Clusters are non–overlapping from one polarized word to the
other; if another polarized word is detected at position j + 4, then the cluster consists of the
unigrams {j +3, j +5, j +6}. The total sentiment score for a document dn,t as expressed by (1)
{l} PCd {l}
becomes in this case sn,t ≡ J=1 sJ,n,t , with Cd the number of clusters. This last approach
borrows from how the R package sentimentr (Rinker 2018b) does its sentiment calculation.
Linguistic intricacies (e.g., sentence boundaries) are better handled in their package, but at
the expense of being significantly slower.
The scores obtained above are subsequently multiplied by the feature weights to spread out
{l,k} {l} k
the sentiment into lexicon– and feature–specific sentiment scores, as sn,t ≡ sn,t × wn,t , with
1
The values wi and vi are specific to a document dn,t , but we omit the indices n and t for brevity. For now,
the package only considers constant weights per document, but in a future version, weights could be made
unigram–specific.
2
Amplifying valence shifters, such as ‘very’, strengthen a polarized word. Deamplifying valence shifters,
such as ‘hardly’, downtone a polarized word. The strengthening value of 0.80 is fixed and acts, if applicable,
as a modifier of 80% on the polarized word. Negation inverses the polarity. An even number of negators
is assumed to cancel out the negation. Amplifiers are considered (but not double–counted) as deamplifiers
if there is an odd number of negators. Under this approach, for example, the occurrence of ‘not very good‘
receives a more realistic score of −0.20 (n = 1, nN = −1, nA = 1, nD = 0, s = +1), instead of −1.80, in many
cases too negative.

Electronic copy available at: https://ssrn.com/abstract=3067734


6 The R Package sentometrics

k the index denoting the feature. If the document does not correspond to the feature, the
{l,k}
value of sn,t is zero.
In sentometrics, the sento_lexicons() function is used to define the lexicons and the valence
shifters. The output is a sentolexicons object. Any provided lexicon is applied to the
corpus. The sentiment calculation is performed with compute_sentiment(). Depending on
{l,k}
the input type, this function outputs a data.table with all values for sn,t . When the output
can be used as the basis for aggregation into time series in the next step (that is, when it
has a "date" column), it is classed as a sentiment object. The to_sentiment() function
transforms any properly structured table with sentiment scores to a sentiment object.

2.3. Aggregate the sentiment into textual sentiment time series (Step 3)
In this step, the purpose is to aggregate the individual sentiment scores and obtain various
representative time series. Two main aggregations are performed. The first, across–document,
collapses all sentiment scores across documents within the same frequency (e.g., day or month,
as defined by t) into one score. The weighted sum that does so is:

Nt
{l,k} {l,k}
X
st ≡ θn sn,t . (2)
n=1

The weights θn define the importance of each document n at time t (for instance, based on
the length of the text). The second, across–time, smooths the newly aggregated sentiment
scores over time, as:
u
{l,k,b} {l,k}
X
su ≡ bt st , (3)
t=tτ

where tτ ≡ u − τ + 1. The time weighting schemes b = 1, . . . , B go with different values for bt


to smooth the time series in various ways (e.g., according to a downward sloping exponential
curve), with a time lag of τ . The first τ − 1 observations are dropped from the original time
indices, such that u = τ, . . . , T becomes the time index for the ultimate time series. This
leaves N ≡ T − τ + 1 time series observations.
The number of obtained time series, P , is equal to L (number of lexicons) × K (number of
features) × B (number of time weighting schemes). Every time series covers one aspect of
each dimension, best described as “the textual sentiment as computed by lexicon l for feature
k aggregated across time using scheme b”. The time series are designated by spu ≡ su{l,k,b}
across all values of u, for p = 1, . . . , P , and the triplet p ≡ {l, k, b}. The ensemble of time
series captures both different information (different features) and the same information in
different ways (same features, different lexicons, and aggregation schemes).
The entire aggregation setup is specified by means of the ctr_agg() function, including
the within–document aggregation needed for the sentiment analysis. The sento_measures()
function performs both the sentiment calculation (via compute_sentiment()) and time series
aggregation (via aggregate()), outputting a sentomeasures object.

2.4. Specify regression model and do (out–of–sample) predictions (Step 4)


The sentiment measures are now regular time series variables that can be applied in regres-

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 7

sions. In case of a linear regression, the reference equation is:


yu+h = δ + γ > xu + β1 s1u + . . . + βp spu + . . . + βP sPu + u+h . (4)

The target variable yu+h is often a variable to forecast, that is, h > 0. Let su ≡ (s1u , . . . , sPu )>
encapsulate all textual sentiment variables as constructed before, and β ≡ (β1 , . . . , βP )> .
Other variables are denoted by the vector xu at time u and γ is the associated parameter
vector. Logistic regression (binomial and multinomial) is available as a generalization of the
same underlying linear structure.
The typical large dimensionality of P relative to the number of observations, and the potential
multicollinearity, both pose a problem to ordinary least squares (OLS) regression. Instead,
estimation and variable selection through a penalized regression relying on the elastic net reg-
ularization of Zou and Hastie (2005) is more appropriate. Regularization, in short, shrinks the
coefficients of the least informative variables towards zero. It consists of optimizing the least
squares or likelihood function including a penalty component. The elastic net optimization
problem for the specified linear regression is expressed as:
T
(  )
1 X >
2 
> 2
min (y − δ̃ − γ
e x eu − β su ) + λ α βe + (1 − α) β . (5)
e e
e
δ̃,γ
e ,β
e N u=τ u+h 1 2

The tilde denotes standardized variables, and k.kp is the `p –norm. The standardization is
required for the regularization, but the coefficients are rescaled back once estimated. The
rescaled estimates of the model coefficients for the textual sentiment indices are in β,
b usually
a sparse vector, depending on the severity of the shrinkage. The parameter 0 ≤ α ≤ 1 defines
the trade–off between the Ridge (Hoerl and Kennard 1970), `2 , and the LASSO (Tibshirani
1996), `1 , regularizaton, respectively for α = 0 and α = 1. The λ ≥ 0 parameter defines
the level of regularization. When λ = 0, the problem reduces to OLS estimation. The two
parameters are calibrated in a data–driven way, such that they are optimal to the regression
equation at hand. The sentometrics package allows calibration through cross–validation, or
based on an information criteria with the degrees of freedom properly adjusted to the elastic
net context according to Tibshirani and Taylor (2012).
An analysis of typical interest is the sequential estimation of a regression model and out–of–
sample prediction. For a given sample size M < N , a regression model is estimated with the
first M observations and used to predict the one–step ahead observation M + 1 of the target
variable.3 This procedure is repeated, leading to a series of estimates. These are compared
with the realized values to assess the average out–of–sample prediction performance.
The type of model, the calibration approach, and other modelling decisions are defined via
the ctr_model() function. The (iterative) model estimation and calibration is done with
the sento_model() function that relies on the R packages glmnet and caret. The user can
define here additional (sentiment) values for prediction through the x argument. The output
is a sentomodel object (one model) or a sentomodeliter object (a collection of iteratively
estimated sentomodel objects and associated out–of–sample predictions).
A forecaster, however, is not limited to using the models provided through the sentometrics
package; (s)he is free to guide this step to whichever modelling toolbox available, continuing
with the sentiment variables computed in the previous steps.
3
We define one–step ahead as the first and single prediction a forecaster would like to obtain after an
estimation round.

Electronic copy available at: https://ssrn.com/abstract=3067734


8 The R Package sentometrics

2.5. Evaluate prediction performance and sentiment attributions (Step 5)


A sentomodeliter object carries an overview of performance measures relevant to the type
of model estimated. Additionally, it can easily be plotted to inspect the prediction precision
visually. A more rigid way to compare different models, sentiment–based or not, is to construct
a model confidence set (Hansen et al. 2011). This set isolates the models that are statistically
the best regarding predictive ability, within a confidence level. The function get_loss_data()
prepares several types of loss data from a collection of sentomodeliter objects, that can be
used by the R package MCS (Catania and Bernardi 2017) to create a model confidence set.
The aggregation into textual sentiment time series is entirely linear. Based on the estimated
coefficients β,
b every underlying dimension’s sentiment attribution to a given prediction can
thus be computed easily. For example, the attribution of a certain feature k in the forecast
of the target variable at a particular date, is the weighted sum of the model coefficients and
the values of the sentiment measures constructed from k. Attribution can be computed for
all features, lexicons, time–weighting schemes, time lags, and individual documents. Through
attribution, a prediction is broken down in its respective components. The attribution to
documents is useful to pick the texts with the most impact to a prediction at a certain date.
The function attributions() computes all types of possible attributions.

3. The R package sentometrics

In what follows, several examples show how to put the steps into practice using the sento-
metrics package. The subsequent sections illustrate the main workflow, using built–in data,
focusing on individual aspects of it. Section 3.1 studies corpus management and features
generation. Section 3.2 investigates the sentiment computation. Section 3.3 looks at the
aggregation into time series (including the control function ctr_agg()). Section 3.4 briefly
explains further manipulation of a sentomeasures object. Section 3.5 regards the modelling
setup (including the control function ctr_model()) and attribution.

3.1. Corpus management and features generation


The very first step is to load the R package sentometrics:

R> library("sentometrics")

We demonstrate the workflow using the usnews built–in dataset, a collection of news articles
from The Wall Street Journal and The Washington Post between 1995 and 2014.4 It has
a data.frame structure and thus satisfies the requirement that the input texts have to be
structured rectangularly, with every row representing a document. The data is loaded below:

R> data("usnews", package = "sentometrics")


R> class(usnews)
4
The data originates from https://www.figure-eight.com/data-for-everyone/ (under “Economic News Article
Tone and Relevance”).

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 9

[1] "data.frame"

For conversion to a sentocorpus object, the "id", "date" and "text" columns have to come
in that order. All other columns are reserved for features, of type numeric. For this particular
corpus, there are four original features. The first two indicate the news source, the latter two
k
the relevance of every document to the U.S. economy. The feature values wn,t are binary and
complementary (when "wsj" is 1, "wapo" is 0; similarly for "economy" and "noneconomy")
to subdivide the corpus to create separate time series.

R> head(usnews[, -3])

id date wsj wapo economy noneconomy


1 830981846 1995-01-02 0 1 1 0
2 842617067 1995-01-05 1 0 0 1
3 830982165 1995-01-05 0 1 0 1
4 830982389 1995-01-08 0 1 0 1
5 842615996 1995-01-09 1 0 0 1
6 830982368 1995-01-09 0 1 1 0

To access the texts, one can simply do usnews[["texts"]] (i.e., the third column omitted
above). An example of one text is:

R> usnews[["texts"]][2029]

[1] "Dow Jones Newswires NEW YORK -- Mortgage rates rose in the past week
after Fridays employment report reinforced the perception that the economy
is on solid ground, said Freddie Mac in its weekly survey. The average for
30-year fixed mortgage rates for the week ended yesterday, rose to 5.85
from 5.79 a week earlier and 5.41 a year ago. The average for 15-year
fixed-rate mortgages this week was 5.38, up from 5.33 a week ago and the
year-ago 4.69. The rate for five-year Treasury-indexed hybrid adjustable-rate
mortgages, was 5.22, up from the previous weeks average of 5.17. There is no
historical information for last year since Freddie Mac began tracking this
mortgage rate at the start of 2005."

The built–corpus is cleaned for non–alphanumeric characters. To put the texts and features
into a corpus structure, call the sento_corpus() function. If you have no features available,
the corpus can still be created without any feature columns in the input data.frame, but
a dummy feature called "dummyFeature" with a score of 1 for all texts is added to the
sentocorpus output object.

R> uscorpus <- sento_corpus(usnews)


R> class(uscorpus)

[1] "sentocorpus" "corpus" "list"

Electronic copy available at: https://ssrn.com/abstract=3067734


10 The R Package sentometrics

The sento_corpus() function creates a sentocorpus object on top of the quanteda’s package
corpus object. Hence, many functions from quanteda to manipulate corpora can be applied
to a sentocorpus object as well. For instance, quanteda::corpus_subset(uscorpus, date
< "2014-01-01") would limit the corpus to all articles before 2014. The presence of the
date document variable (the "date" column) and all other metadata as numeric features
valued between 0 and 1 are the two distinguishing aspects between a sentocorpus object
and any other corpus–like object in R. Having the date column is a requirement for the later
aggregation into time series. The function to_sentocorpus() transforms a quanteda corpus
object to a sentocorpus object. The quanteda corpus constructor supports conversion to
their format starting from a tm corpus.
To round off Step 1, we add two metadata features using the add_features() function. The
features uncertainty and election give a score of 1 to documents in which respectively
the word "uncertainty" or "distrust" and the specified regular expression regex appears.
Regular expressions provide flexibility to define more complex features, though can be slow
for a large corpus if too complex. Overall, this gives K = 6 features. The add_features()
function is most useful when the corpus starts off with no additional metadata, i.e., the sole
feature present is the automatically created "dummyFeature" feature.5

R> regex <- paste0("\\bRepublic[s]?\\b|\\bDemocrat[s]?\\b|\\belection\\b|")


R> uscorpus <- add_features(uscorpus,
+ keywords = list(uncertainty = c("uncertainty", "distrust"),
+ election = regex),
+ do.binary = TRUE, do.regex = c(FALSE, TRUE))
R> tail(quanteda::docvars(uscorpus))

date wsj wapo economy noneconomy uncertainty election


842616931 2014-12-22 1 0 1 0 0 0
842613758 2014-12-23 1 0 0 1 0 0
842615135 2014-12-23 1 0 0 1 0 0
842617266 2014-12-24 1 0 1 0 0 0
842614354 2014-12-26 1 0 0 1 0 0
842616130 2014-12-31 1 0 0 1 0 0

3.2. Lexicon preparation and sentiment computation


As Step 2, we calculate sentiment using lexicons. We supply the built–in Loughran & McDon-
ald (Loughran and McDonald 2011) and Henry (Henry 2008) word lists, and the more generic
Harvard General Inquirer word list. We also add six other lexicons from the R package lexicon
(Rinker 2018a): the NRC lexicon (Mohammad and Turney 2010), the Hu & Liu lexicon (Hu
and Liu 2004), the SentiWord lexicon (Baccianella et al. 2010), the Jockers lexicon (Jockers
2017), the SenticNet lexicon (Cambria et al. 2016), and the SO–CAL lexicon (Taboada et al.
2011).6 This gives L = 9.
5
To delete a feature, use quanteda::docvars(corpusVariable, field = "featureName") <- NULL. The
docvars method is extended for a sentocorpus object; for example, if all current features are deleted, the
dummy feature "dummyFeature" is automatically added.
6
You can add any two–column table format as a lexicon, as long as the first column is of type character,
and the second column is numeric.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 11

We pack these lexicons together in a named list, and provide it to the sento_lexicons()
function, together with an English valence word list. The valenceIn argument dictates
the complexity of the sentiment analysis. If valenceIn = NULL (the default), sentiment is
computed based on the simplest unigrams approach. If valenceIn is a table with an "x"
and a "y" column, the valence–shifting bigrams approach is considered for the sentiment
calculation. The values of the "y" column are those used as vi . If the second column is
named "t", it is assumed that this column indicates the type of valence shifter for every
word, and thus it forces employing the valence–shifting clusters approach for the sentiment
calculation. Three types of valence shifters are supported for the latter method: negators
(value of 1, define n and nN ), amplifiers (value of 2, counted in nA ) and deamplifiers (value
of 3, counted in nD ).

R> data("list_lexicons", package = "sentometrics")


R> data("list_valence_shifters", package = "sentometrics")
R> lexiconsIn <- c(list_lexicons[c("LM_en", "HENRY_en", "GI_en)],
+ list(NRC = lexicon::hash_sentiment_nrc,
+ HULIU = lexicon::hash_sentiment_huliu,
+ SENTIWORD = lexicon::hash_sentiment_sentiword,
+ JOCKERS = lexicon::hash_sentiment_jockers,
+ SENTICNET = lexicon::hash_sentiment_senticnet,
+ SOCAL = lexicon::hash_sentiment_socal_google))
R> lex <- sento_lexicons(lexiconsIn = lexiconsIn,
+ valenceIn = list_valence_shifters[["en"]])

The lex output is a sentolexicons object, a specialized list. The individual word lists
themselves are data.tables, as displayed below. Duplicates are removed, all words are put
to lowercase and only unigrams are kept.

R> lex[["HENRY_en"]]

x y
1: above 1
2: accomplish 1
3: accomplished 1
4: accomplishes 1
5: accomplishing 1
---
185: worse -1
186: worsen -1
187: worsening -1
188: worsens -1
189: worst -1

The simplest way forward is to compute sentiment scores for every text in the corpus. This is
handled by the compute_sentiment() function, which works with either a character vector,
a sentocorpus object, or a quanteda corpus object. For a character vector input, the

Electronic copy available at: https://ssrn.com/abstract=3067734


12 The R Package sentometrics

function returns a word count column and the computed sentiment scores for all lexicons, as
no features are involved. The compute_sentiment() has, besides the input corpus and the
lexicons, other arguments. The main one is the how argument, to specify the within–document
aggregation. More details on the contents of these arguments are provided in Section 3.3, when
the ctr_agg() function is discussed. See below for a brief usage and output example.

R> sentScores <- compute_sentiment(usnews[["texts"]],


+ lexicons = lex, how = "proportional")
R> head(sentScores[, c("word_count", "GI_en", "SENTIWORD", "SOCAL")])

word_count GI_en SENTIWORD SOCAL


1: 213 -0.013146 -0.003436 0.20679
2: 110 0.054545 0.004773 0.15805
3: 202 0.004950 -0.006559 0.11298
4: 153 0.006536 0.021561 0.15393
5: 245 0.008163 -0.017786 0.08997
6: 212 0.014151 -0.009316 0.03506

In Appendix A, we provide an illustrative comparison of the computation time for various


lexcon–based sentiment calculators in R, benchmarked againt the sentometrics package.

3.3. Creation of sentiment measures


To create sentiment time series and to be able to use the entire scope of the package one
needs the sento_measures() function and a well–specified aggregation setup defined via the
ctr_agg() function. We focus the explanation on the control function’s central arguments,
and integrate the other arguments in their discussion:

• howWithin: This argument defines how sentiment is aggregated within the same doc-
ument, setting the weights ωi in (1). The "counts" options sets ωi = 1. For binary
lexicons and the simple unigrams matching case, this amounts to simply taking the
difference between the number of positive and negative words. One can normalize the
sentiment score by the total number of words in the document ("proportional"), or
by the total number of polarized words in the document ("proportionalPol").
• howDocs: This argument defines how sentiment is aggregated across all documents at
the same date (or frequency), that is, it sets the weights θn in (2). The time frequency
at which the time series have to be aggregated is chosen via the by argument, and
can be set to daily ("day"), weekly ("week"), monthly ("month") or yearly ("year").
The option "equal_weight" gives the same weight to every document, while the option
"proportional" gives higher weights to documents with more words, relative to the
document population at a given date. The do.ignoreZeros argument forces ignoring
documents with zero sentiment in the computation of the across–document weights. By
default these documents are overlooked. This avoids the incorporation of documents
{l,k}
not relevant to a particular feature (as in those cases sn,t is exactly zero, because
k
wn,t = 0), which could lead to a bias of sentiment towards zero.7
7
It also ignores the documents which are relevant to a feature, but exhibit zero sentiment. This can occur
if none of the words have a polarity, or the weighted number of positive and negative words offset each other.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 13

• howTime: This argument defines how sentiment is aggregated across dates, to smoothen
the time series and to acknowledge that sentiment at a given point is at least partly
based on sentiment and information from the past. The lag argument has the role
of τ dictating how far to go back. In the implementation, lag = 1 means no time
aggregation and thus bt = 1. The "equal_weight" option is similar to a simple weighted
moving average, "linear" and "exponential" are two options which give weights to
the observations according to a linear or an exponential curve, "almon" does so based on
Almon polynomials, and "beta" based on the Beta weighting curve from Ghysels et al.
(2007). The last three curves have respective arguments to define their shape(s), being
alphasExp, ordersAlm and do.inverseAlm, and aBeta and bBeta.8 These weighting
schemes are always normalized to unity. If desired, user–constructed weights can be
supplied via weights as a named data.frame. All the weighting schemes define the
different bt values in (3). The fill argument is of sizeable importance here. It is used
to add in dates for which not a single document was available. These added, originally
missing, dates are given a value of 0 ("zero") or the most recent value ("latest").
The option "none" accords to not filling up the date sequence at all. Adding in dates
(or not) impacts the time aggregation by respectively combining the latest consecutive
dates, or the latest available dates.

• nCore: The nCore argument can help to speed up the sentiment calculation when dealing
with a large corpus. It expects a positive integer passed on to the setThreadOptions()
function from the RcppParallel (Allaire et al. 2018) package, and parallelizes the senti-
ment computation across texts. By default, nCore = 1, which indicates no paralleliza-
tion. Parallelization is expected to improve the speed of the sentiment computation
only for sufficiently large corpora, or when using many lexicons. A second speedup
strategy is to tokenize the texts separately and input these using the tokens argument.
This way, the tokenization can be tailor-made (e.g., stemmed9 ) and reused for different
sentiment computation function calls, for example to compare the impact of different
normalization or aggregation choices for the same tokenized corpus. Our tokenization is
done with the R package stringi, transforms all to lowercase, strips punctuation marks
and numeric characters.

In the example code below, we aggregate sentiment at a weekly frequency, choose a counts–
based within–document aggregation, and weight the documents for across–document aggre-
8
More specifically, for i = 1, . . . , ..., τ , we have as Almon weighting:
 
i b i B−b
c (1 − (1 − ) )(1 − ) ,
τ τ
where c acts as a normalization constant, b is a specific element from the ordersAlm vector, and B is the max-
imum value in the ordersAlm vector. If do.inverseAlm = TRUE, the inverse Almon polynomials are computed
modifying τi to 1 − τi . For Beta weighting, we have:
τ
i X i
f ( ; a, b)/ f ( , a, b),
τ i=1
τ
a−1 b−1
with the Beta density f (x; a, b) ≡ x (1−x)Γ(a)Γ(b)
Γ(a+b)
, Γ(.) the Gamma function, a as aBeta, and b as bBeta.
The functions weights_almon() and weights_beta() (as well as weights_exponential() for exponential
weighting) allow generating these time weights separately.
9
Make sure in this case to also stem the lexicons before you provide those to the sento_lexicons() function.

Electronic copy available at: https://ssrn.com/abstract=3067734


14 The R Package sentometrics

gation proportionally to the number of words in the document. The resulting time series
are smoothed according to an equally–weighted and an exponential time aggregation scheme
(B = 2), using a lag of 30 weeks. We ignore documents with zero sentiment for across–
document aggregation, and fill missing dates with zero before the across–time aggregation, as
per default.

R> ctrAgg <- ctr_agg(howWithin = "counts", howDocs = "proportional",


+ howTime = c("exponential", "equal_weight"), do.ignoreZeros = TRUE,
+ by = "week", fill = "zero", lag = 30, alphasExp = 0.2)

The sento_measures() function performs both the sentiment calculation in Step 2 and the
aggregation in Step 3, and results in a sentomeasures output object. The generic summary()
displays a brief overview of the composition of the sentiment time series. A sentomeasures
object is a classed list with as most important elements "measures" (the textual sentiment
time series), "sentiment" (the original sentiment scores per document) and "stats" (a se-
lection of summary statistics). Alternatively, the same output can be obtained by applying
the aggregate() function on the output of the compute_sentiment() function, if the latter
is computed from a sentocorpus object.

R> sentMeas <- sento_measures(uscorpus, lexicons = lex, ctr = ctrAgg)

There are 108 initial sentiment measures (9 lexicons × 6 features × 2 time weighting schemes).
An example of two of the created sentiment measures, and the naming of the sentiment
measures (each dimension’s component is separated by "--"), is shown below.

R> get_measures(sentMeas)[, c(1, 23)]

date NRC--noneconomy--equal_weight
1: 1995-07-24 3.931
2: 1995-07-31 4.124
3: 1995-08-07 3.961
4: 1995-08-14 4.059
5: 1995-08-21 3.846
---
1011: 2014-12-01 4.649
1012: 2014-12-08 4.649
1013: 2014-12-15 4.594
1014: 2014-12-22 4.653
1015: 2014-12-29 4.831

A sentomeasures object is easily plotted across each of its dimensions. For example, Figure 1
shows a time series of average sentiment for both time weighting schemes involved.10 A display
of averages across lexicons and features is achieved by altering the group argument from the
plot() method to "lexicons" and "features", respectively.

R> plot(sentMeas, group = "time")

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 15

equal_weight exponential_0.2

4
Sentiment

01−1995 01−2000 01−2005 01−2010 01−2015


Date

Figure 1: Textual sentiment time series averaged across time weighting schemes.

3.4. Manipulation of a sentomeasures object


There are a number of functions to facilitate the manipulation of a sentomeasures object.
These functions are all prefixed by measures_. Besides those functions, the methods imple-
mented are summary(), diff(), and scale(). The scale() function can take matrix inputs,
and thus is useful to add, deduct, multiply or divide the sentiment measures up to the level of
single observations. The functions get_dates() and get_dimensions() are convenient func-
tions that return the dates and the aggregation dimensions of a sentomeasures object. The
functions nobs() and nmeasures() give the number of time series observations and the num-
ber of sentiment measures. The get_measures() function extracts the sentiment measures,
either in long or wide format.
The measures_delete() and measures_select() functions respectively delete or extract
certain combinations of sentiment measures. Here, the measures_delete() function call
returns a new object without all sentiment measures created from the "LM_en" lexicon and
without the single time series consisting of the specified combination "SENTICNET" (lexicon),
"economy" (feature) and "equal_weight" (time weighting).11

R> measures_delete(sentMeas, list(c("LM_en"),


+ c("SENTICNET", "economy", "equal_weight")))
10
The averaged sentiment measures can be accessed from the plot object. Given p <- plot(...), this is via
p[["data"]]. The data is in long format.
11
The new number of sentiment measures does not necessarily equal L×K×B anymore once a sentomeasures
object is modified.

Electronic copy available at: https://ssrn.com/abstract=3067734


16 The R Package sentometrics

A sentomeasures object (95 textual sentiment time series, 1015 observations).

Subsetting across rows is possible with the measures_subset() function. A typical example
is to subset only a time series range by specifying the dates part of that range, as below,
where 100 dates are kept. One can also condition on specific sentiment measures being above,
below or between certain values.

R> measures_subset(sentMeas, date %in% get_dates(sentMeas)[50:149])

A sentomeasures object (108 textual sentiment time series, 100 observations).

To ex–post fill the time series date–wise, the measures_fill() is to be used. The function
at minimum fills in missing dates between the existing date range at the prevailing frequency.
Dates before the earliest and after the most recent date can be added too. The argument
fill = "zero" sets all added dates to zero, whereas fill = "latest" takes the most recent
known value. This function is applied internally depending on the fill parameter from the
ctr_agg() function. The example below pads the time series with trailing dates, taking the
first value that occurs.

R> sentMeasFill <- measures_fill(sentMeas, fill = "latest",


+ dateBefore = "1995-07-01")
R> get_measures(sentMeasFill)[, 1:3]

date LM_en--wsj--equal_weight LM_en--wapo--equal_weight


1: 1995-06-26 -2.045 -3.052
2: 1995-07-03 -2.045 -3.052
3: 1995-07-10 -2.045 -3.052
4: 1995-07-17 -2.045 -3.052
5: 1995-07-24 -2.045 -3.052
6: 1995-07-31 -2.085 -3.046

The sentiment visualized using the plot() function when there are many different lexicons,
features, and time weighting schemes may give a distorted image due to the averaging. To
obtain a more nuanced picture of the differences in one particular dimension, one can ignore
the other two dimensions. For example, corpusPlain has only the dummy feature, and there
is no time aggregation involved (lag = 1). This leaves the lexicons as the sole distinguishing
dimension between the sentiment time series.

R> corpusPlain <- sento_corpus(usnews[, 1:3])


R> ctrAggLex <- ctr_agg(howWithin = "proportionalPol", howTime = "own",
+ howDocs = "equal_weight", by = "month", fill = "none", lag = 1)
R> sentMeasLex <- sento_measures(corpusPlain,
+ lexicons = lex[-length(lex)], ctr = ctrAggLex)
R> mean(as.numeric(sentMeasLex$stats["meanCorr", ]))

[1] 0.3593

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 17

GI_en HENRY_en HULIU JOCKERS LM_en NRC SENTICNET SENTIWORD SOCAL

1
Sentiment

01−1995 01−2000 01−2005 01−2010 01−2015


Date

Figure 2: Textual sentiment time series averaged across lexicons.

Figure 2 plots the nine time series each belonging to a lexicon. There are differences in levels.
For example, the Loughran & McDonald lexicon’s time series lies almost exclusively in the
negative sentiment area. The overall average correlation is close to 36%.
After the initial aggregation, one can do additional merging of the sentiment time series with
the measures_merge() function. This functionality equips users with a way to either further
diminish or further expand the dimensionality of the sentiment measures. For instance, all
sentiment measures composed of three particular time aggregation schemes can be shrunk
together, by averaging across each fixed combination of the other two dimensions, resulting
in a set of new measures.
In the example, both built–in lexicons, both news sources, both added features, and similar
time weighting schemes are collapsed into their respective new counterparts. The do.keep =
FALSE option indicates that the original measures are not kept after merging, such that the
number of sentiment time series effectively goes down to 7 × 4 × 1 = 28.

R> sentMeasMerged <- measures_merge(sentMeas,


+ time = list(W = c("equal_weight", "exponential_0.2")),
+ lexicons = list(LEX = c("LM_en", "HENRY_en", "GI_en")),
+ features = list(JOUR = c("wsj", "wapo"),
+ NEW = c("uncertainty", "election")),
+ do.keep = FALSE)
R> get_dimensions(sentMeasMerged)

Electronic copy available at: https://ssrn.com/abstract=3067734


18 The R Package sentometrics

$features
[1] "economy" "noneconomy" "JOUR" "NEW"

$lexicons
[1] "NRC" "HULIU" "SENTIWORD" "JOCKERS" "SENTICNET" "SOCAL" "LEX"

$time
[1] "W"

With the measures_global() function, the sentiment measures can also be merged into
single dimension–specific time series we refer to as global sentiment. Each of the different
components of the dimensions has to receive a weight that stipulates the importance and
sign in the global sentiment index. The weights need not sum to one, that is, no condition
is imposed. For example, the global lexicon–oriented sentiment measure is computed as
G,L
su ≡ P p=1 spu × wl,p . The weight to lexicon l that appears in p is denoted by wl,p . An
1 PP

additional, “fully global”, measure is composed as sG G,L


u ≡ (su + sG,K
u + sG,B
u )/3.
In the code excerpt, we define ad–hoc weights that emphasize the initial features and the
Henry lexicon. Both time weighting schemes are weighted equally.12 The output of the
measures_global() function is a four–column data.table. The global sentiment measures
provide a low–dimensional summary of the sentiment in the corpus, depending on some preset
parameters related to the importance of the dimensions.

R> glob <- measures_global(sentMeas,


+ lexicons = c(0.10, 0.40, 0.05, 0.05, 0.08, 0.08, 0.08, 0.08, 0.08),
+ features = c(0.20, 0.20, 0.20, 0.20, 0.10, 0.10),
+ time = c(1/2, 1/2))

The peakdates() function pinpoints the dates with most extreme sentiment (negatively, pos-
itively, or both). The example below extracts the date with the lowest sentiment time series
value across all measures. The peakdocs() function can be applied equivalently to a senti-
ment object to get the document identifiers with most extreme document–level sentiment.

R> peakdates(sentMeas, n = 1, type = "neg")

[1] "2008-01-28"

3.5. Sparse regression using the sentiment measures


Step 4 consists of the regression modelling. We proceed by explaining the available modelling
setup in the sentometrics package. Other model frameworks with sentiment as an input are
to be explored based on the measures extracted through the get_measures() function.
The ctr_model() function defines the modelling setup, and is explained below. The main
arguments are itemized, all others are reviewed within the discussion.
12
This can also be achieved by setting time = 1, the default option.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 19

• model: The model argument can take "gaussian" (for linear regression), and "binomial"
or "multinomial" (both for logistic regression). The argument do.intercept = TRUE
fits an intercept.

• type: The type specifies the calibration procedure to find the most appropriate α and
λ in (5). The options are "cv" (cross–validation) or one of three information criteria
("BIC", "AIC" or "Cp"). The information criterium approach is available in case of a lin-
ear regression only. The argument alphas can be altered to change the possible values
for alpha, and similarly so for the lambdas argument. If lambdas = NULL, the possible
values for λ are generated internally by the glmnet() function from the R package glm-
net. If lambdas = 0, the regression procedure is OLS. The arguments trainWindow,
testWindow and oos are needed when model calibration is performed through cross–
validation, that is, when type = "cv". The cross–validation implemented is based
on the “rolling forecasting origin” principle, considering we are working with time se-
ries.13 The argument do.progress = TRUE prints calibration progress statements. The
do.shrinkage.x argument is a logical vector to indicate on which external explana-
tory variables to impose regularization. These variables, xu , are added through the x
argument of the sento_model() function.

• h: The integer argument h shifts the response variable up to yu+h and aligns the ex-
planatory variables in accordance with (4).14 If h = 0 (by default), no adjustments are
made. The logical do.difference argument, if TRUE, can be used to difference the tar-
get variable y supplied to the sento_model() function, if it is a continuous variable (i.e.,
model = "gaussian"). The lag taken is the absolute value of the h argument (given
|h| > 0). For example, if h = 2, and assuming the y variable is aligned time–wise with
all explanatory variables, denoted by X here for sake of the illustration, the regression
performed is of yt+2 − yt on Xt . If h = −2, the regression fitted is yt+2 − yt on Xt+2 .

• do.iter: To enact an iterative model estimation and a one–step ahead out–of–sample


analysis, set do.iter = TRUE. To perform a one–off in–sample estimation, set do.iter
= FALSE. The arguments nSample, start and nCore are used for iterative modelling,
thus, when do.iter = TRUE. The first argument is the size of the sample to re–estimate
the model each time, m. The second argument can be used to only run a later subset of
the iterations (start = 1 by default runs all iterations). The total number of iterations
is equal to length(y) - nSample - abs(h) - oos, with y the response variable as a vector.
The oos argument specifies partly, as explained above, the cross–validation, but also
provides flexibility in defining the out–of–sample exercise. For instance, given t, the
one–step ahead out–of–sample prediction is computed at t+oos+1. As per default, oos
13
As an example, take 120 observations in total, trainWindow = 80, testWindow = 10 and oos = 5. In
the first round of cross–validation, a model is estimated for a certain α and λ combination with the first 80
observations, then 5 observations are skipped, and predictions are generated for observations 86 to 95. The
next round does the same but with all observations moved one step forward. This is done until the end of the
total sample is reached, and repeated for all possible parameter combinations, relying on the train() function
from the R package caret. The optimal (α, λ) couple is the one that induces the lowest average prediction error
(measured by the root mean squared error for linear models, and overall accuracy for logistic models).
14
If the input response variable is not aligned time–wise with the sentiment measures and the other explana-
tory variables, h cannot be interpreted as the exact prediction horizon. In other words, h only shifts the input
variables as they are provided.

Electronic copy available at: https://ssrn.com/abstract=3067734


20 The R Package sentometrics

= 0. If nCore > 1, the %dopar% construct from the R package foreach (Weston 2017)
is utilized to speed up the out–of–sample analysis.

To enhance the intuition about attribution, we estimate a contemporaneous in–sample model


and compute the attribution decompositions.

R> set.seed(505)
R> y <- rnorm(nobs(sentMeas))
R> ctrInSample <- ctr_model(model = "gaussian",
+ h = 0, type = "BIC", alphas = 0, do.iter = FALSE)
R> fit <- sento_model(sentMeas, y, ctr = ctrInSample)

The attributions() function takes the sentomodel object and the related sentiment mea-
sures object as inputs, and generates by default attributions for all in–sample dates at the
level of individual documents, lags, lexicons, features, and time. The function can be applied
to a sentomodeliter object as well, for any specific dates using the refDates argument. If
do.normalize = TRUE, the values are normalized between –1 and 1 through division by the
`2 –norm of the attributions at a given date. The output is an attributions object.

R> attrFit <- attributions(fit, sentMeas)


R> head(attrFit[["features"]])

date wsj wapo economy noneconomy uncertainty election


1: 1995-07-24 0.0001681 0.002519 -0.014997 0.01235 -0.001750 -0.003043
2: 1995-07-31 0.0012196 0.003139 -0.013095 0.01367 -0.001718 -0.003070
3: 1995-08-07 0.0010004 0.002824 -0.011394 0.01216 -0.001693 -0.003091
4: 1995-08-14 0.0009597 0.002915 -0.010892 0.01155 -0.001673 -0.002837
5: 1995-08-21 0.0024681 0.002266 -0.007827 0.01110 -0.001718 -0.002850
6: 1995-08-28 0.0020701 0.002555 -0.009190 0.01238 -0.001706 -0.002862

Attribution decomposes a prediction into the different sentiment components along a given
dimension, for example, lexicons. The sum of the individual sentiment attributions per date,
the constant, and other non–sentiment measures should thus be equal to the prediction.
Indeed, the piece of code below shows that the difference between the prediction and the
summed attributions plus the constant is equal to zero throughout.

R> X <- as.matrix(get_measures(sentMeas)[, -1])


R> yFit <- predict(fit, newx = X)
R> attrSum <- rowSums(attrFit[["lexicons"]][, -1]) + fit[["reg"]][["a0"]]
R> all.equal(as.numeric(yFit), attrSum)

[1] TRUE

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 21

4. Application to predicting the VIX index


A noteworthy amount of finance research has pointed out the impact of sentiment expressed
through various corpora on stock returns and trading volume, including Heston and Sinha
(2017), Jegadeesh and Wu (2013), Tetlock et al. (2008), Tetlock (2007), and Antweiler and
Frank (2004). Caporin and Poli (2017) create lexicon–based news measures to improve daily
realized volatility forecasts. Manela and Moreira (2017) explicitly construct a news–based
measure closely related to the VIX index and a good proxy for uncertainty. A more widely used
proxy for uncertainty is the Economic Policy Uncertainty (EPU) index (Baker et al. 2016).
This indicator is a normalized text–based index of the number of news articles discussing
economic policy uncertainty, from ten large U.S. newspapers. A relationship between political
uncertainty and market volatility is found by Pástor and Veronesi (2013).
The VIX measures the annualized option–implied volatility on the S&P 500 stock market
index over the next 30 days. It is natural to expect that media sentiment and political
uncertainty partly explain the expected volatility measured by the VIX index. In this section,
we test this using the EPU index and sentiment variables constructed from the usnews dataset.
We analyze if our textual sentiment approach is more helpful than the EPU index in an out–
of–sample exercise of predicting the end–of–month VIX index in six months. The prediction
specifications we are interested in are summarized as follows:

Ms : V IXu+h = δ + γV IXu + β > su + u+h ,


Mepu : V IXu+h = δ + γV IXu + βEP Uu−1 + u+h ,
Mar : V IXu+h = δ + γV IXu + u+h .

The target variable V IXu is the most recent available end–of–month daily VIX index value.
We run the predictive analysis for h = 6 months. The sentiment time series are in su and
define the sentiment–based model (Ms ). As primary benchmark, we exchange the sentiment
variables for EP Uu−1 , the level of the U.S. economic policy uncertainty index in month
u − 1 we know is fully available by month u (Mepu ). We also consider a simple first–order
autoregressive specification (Mar ).
We use the built–in U.S. news corpus of around 4145 documents, in the uscorpus object.
Likewise, we proceed with the nine lexicons and the valence shifters list from the lex object
used in previous examples. To infer textual features from scratch, we use a structural topic
modelling approach as implemented by the R package stm (Roberts et al. 2018). This is
a prime example of integrating a distinct text mining workflow with our textual sentiment
analysis workflow. The stm package works on a quanteda document–term matrix as an input.
The cleaning of the document–term matrix we perform is fairly standard, as well as the default
parameters of the stm() function, and we group into eight features.

R> dfm <- quanteda::dfm(uscorpus, tolower = TRUE, remove_punct = TRUE,


+ remove_numbers = TRUE, remove = quanteda::stopwords("en")) %>%
+ quanteda::dfm_remove(min_nchar = 3) %>%
+ quanteda::dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile") %>%
+ quanteda::dfm_trim(max_docfreq = 0.15, docfreq_type = "prop")
R> dfm <- quanteda::dfm_subset(dfm, quanteda::ntoken(dfm) > 0)
R> topicModel <- stm::stm(dfm, K = 8, verbose = FALSE)

Electronic copy available at: https://ssrn.com/abstract=3067734


22 The R Package sentometrics

We then define the keywords as the five most statistically representative terms for each topic.
They are assembled in keywords as a list.

R> topTerms <- t(stm::labelTopics(topicModel, n = 5)[["prob"]])


R> keywords <- lapply(1:ncol(topTerms), function(i) topTerms[, i])
R> names(keywords) <- paste0("TOPIC_", 1:length(keywords))

We employ these words to generate the features based on the occurrences of these words in
a document after we have deleted all current ones. The features are scaled between 0 and 1
(do.binary = FALSE). Alternatively, one could use the predicted topics per text as a feature
and set do.binary = TRUE, to avoid documents sharing mutual features, instead of relying
on the generated keywords. We see a relatively even distribution of the corpus across the
generated features.

R> uscorpus <- add_features(uscorpus, keywords = keywords,


+ do.binary = FALSE, do.regex = FALSE)
R> quanteda::docvars(uscorpus, c("uncertainty", "election",
+ "economy", "noneconomy", "wsj", "wapo")) <- NULL
R> colSums(quanteda::docvars(uscorpus)[, -1] != 0)

TOPIC_1 TOPIC_2 TOPIC_3 TOPIC_4 TOPIC_5 TOPIC_6 TOPIC_7 TOPIC_8


1389 1079 1261 648 994 1005 856 1041

The frequency of our target variable is monthly, yet we aggregate the sentiment time series
on a daily level and then across time using a lag of about nine months (lag = 270). While
the lag is substantial, we also include six different Beta time weighting schemes to capture
time dynamics other than a slowly moving trend.

R> ctrAggPred <- ctr_agg(howWithin = "proportionalPol",


+ howDocs = "equal_weight", howTime = "beta",
+ by = "day", fill = "latest", lag = 270, aBeta = 1:3, bBeta = 1:2)
R> sentMeasPred <- sento_measures(uscorpus, lexicons = lex, ctr = ctrAggPred)

In Figure 3, we see sufficiently differing average time series patterns for all topics. The drop in
sentiment during the recent financial crisis is apparent, but the recovery varies along features.
The package has the EPU index as a dataset epu included. We consider the EPU index values
one month before all other variables and use the lubridate package (Grolemund and Wickham
2011) to do this operation. We assure that the length of the dependent variable is equal to the
number of observations in the sentiment measures by selecting based on the proper monthly
dates. The pre–processed VIX data is represented by the vix variable.15

R> data("epu", package = "sentometrics")


R> sentMeasIn <- measures_subset(sentMeasPred, date %in% vix$date)
R> datesIn <- get_dates(sentMeasIn)
15
The VIX data is retrieved from https://fred.stlouisfed.org/series/VIXCLS.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 23

TOPIC_1 TOPIC_2 TOPIC_3 TOPIC_4 TOPIC_5 TOPIC_6 TOPIC_7 TOPIC_8

0.04

0.03
Sentiment

0.02

0.01

0.00

01−1995 01−2000 01−2005 01−2010 01−2015


Date

Figure 3: Textual sentiment time series across latent topic features.

R> datesEPU <- lubridate::floor_date(datesIn, "month") %m-% months(1)


R> xEPU <- epu[epu$date %in% datesEPU, "index"]
R> y <- vix[vix$date %in% datesIn, "value"]
R> x <- data.frame(lag = y, epu = xEPU)

We apply the iterative rolling forward analysis (do.iter = TRUE) for our 6–month prediction
horizon. The target variable is aligned with the sentiment measures in sentMeasIn, such
that h = 6 in the modelling control means forecasting the monthly averaged VIX value in six
months. The calibration of the sparse linear regression is based on a Bayesian–like information
criterion (type = "BIC") proposed by Tibshirani and Taylor (2012). We configure a sample
size of M = 60 months for a total sample of N = 232 observations. Our out–of–sample setup
is nonoverlapping; oos = h - 1 means that for an in–sample estimation at time u, the last
available explanatory variable dates from time u − h, and the out–of–sample prediction is
performed at time u as well, not at time u − h + 1. We consider a range of alpha values that
allows any of the Ridge, LASSO and pure elastic net regularization objective functions.16

R> h <- 6
R> oos <- h - 1
R> M <- 60
16
When a sentiment measure is a duplicate of another, or when at least 50% of the series observations
are equal to zero, it is automatically discarded from the analysis. Discarded measures are put under the
"discarded" element of a sentomodel object.

Electronic copy available at: https://ssrn.com/abstract=3067734


24 The R Package sentometrics

R> ctrIter <- ctr_model(model = "gaussian",


+ type = "BIC", h = h, alphas = c(0, 0.1, 0.3, 0.5, 0.7, 0.9, 1),
+ do.iter = TRUE, oos = oos, nSample = M, nCore = 1)
R> out <- sento_model(sentMeasIn, x = x[, "lag", drop = FALSE], y = y,
+ ctr = ctrIter)
R> summary(out)

Model specification
- - - - - - - - - - - - - - - - - - - -

Model type: gaussian


Calibration: via BIC information criterion
Sample size: 60
Total number of iterations/predictions: 161
Optimal average elastic net alpha parameter: 0.9
Optimal average elastic net lambda parameter: 3.53

Out-of-sample performance
- - - - - - - - - - - - - - - - - - - -

Mean directional accuracy: 53.75 %


Root mean squared prediction error: 10.07
Mean absolute deviation: 7.73

The output of the sento_model() call is a sentomodeliter object. Below we replicate the
analysis but for the benchmark regressions, without the sentiment variables. The vector preds
assembles the out–of–sample predictions for model Mepu , the vector predsAR for model Mar .

R> preds <- predsAR <- rep(NA, nrow(out[["performance"]]$raw))


R> yTarget <- y[-(1:h)]
R> xx <- x[-tail(1:nrow(x), h), ]
R> for (i in 1:(length(preds))) {
+ j <- i + M
+ data <- data.frame(y = yTarget[i:(j - 1)], xx[i:(j - 1), ])
+ reg <- lm(y ~ ., data = data)
+ preds[i] <- predict(reg, xx[j + oos, ])
+ regAR <- lm(y ~ ., data = data[, c("y", "lag")])
+ predsAR[i] <- predict(regAR, xx[j + oos, "lag", drop = FALSE])
R> }

A more detailed view of the different performance measures, in this case directional accuracy,
root mean squared and absolute errors, is obtained via out[["performance"]]. A list of the
individual sentomodel objects is found under out[["models"]]. A simple plot to visualize
the out–of–sample fit of any sentomodeliter object can be produced using plot(). We
display in Figure 4 the realized values and the different predictions.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 25

R> plot(out) +
+ geom_line(data = melt(data.table(date = names(out$models), "M-epu" =
+ preds, "M-ar" = predsAR, check.names = FALSE), id.vars = "date"))

M−ar M−epu prediction realized

60

40
Response

20

01−2005 01−2010 01−2015


Date

Figure 4: Realized six–month ahead V IXu+6 index values (purple, realized) and out–of–
sample predictions from models Ms (blue, prediction), Mar (red), and Mepu (green).

Table 2 reports two common out–of–sample prediction performance measures, decomposed


in a pre–crisis period (spanning up to June 2007, from the point of view of the prediction
date), a crisis period (spanning up to December 2009) and a post–crisis period. It appears
that sentiment adds predictive power during the crisis period. The flexibility of the elastic
net avoids that predictive power is too seriously compromised when adding sentiment to the
regression, even when it has no added value.

Full Pre-crisis Crisis Post-crisis


Ms Mepu Mar Ms Mepu Mar Ms Mepu Mar Ms Mepu Mar
RMSE 10.07 10.42 10.72 7.00 6.64 6.52 16.55 19.23 19.45 8.94 7.43 8.42
MAD 7.73 7.46 7.93 5.97 5.78 5.60 12.90 14.23 14.55 7.35 6.09 7.54

Table 2: Performance measures. The root mean squared error (RMSE) is computed as
PNoos 2 PNoos
i=1 ei /Noos , and the mean absolute deviation (MAD) as i=1 |ei |/Noos , with Noos the
number of out–of–sample predictions, and ei the prediction error. The sentiment–based model
is Ms , the EPU–based model is Mepu , the autoregressive model is Mar .

Electronic copy available at: https://ssrn.com/abstract=3067734


26 The R Package sentometrics

The final step is to do the post–modelling attribution analysis. For a sentomodeliter object,
the attributions() function generates attributions for all out–of–sample dates. To study the
evolution of the prediction attribution, the attributions are shown using plot(). This can be
done according to any of the dimensions, except for individual documents. The attributions
are displayed stacked on top of each other, per date. The y–axis represents the attribution
to the prediction of the target variable. Figure 5 shows two types of attributions in separate
panels. The third topic was most impactful during the crisis, and the eighth topic received
the most positive post–crisis weight. Likewise, the lexicons attribution conveys an increasing
influence of the SO–CAL lexicon on the predictions. Finally, it can be concluded that the
predictive role of sentiment is least present before the crisis.

TOPIC_1 TOPIC_2 TOPIC_3 TOPIC_4 TOPIC_5 TOPIC_6 TOPIC_7 TOPIC_8

20
Attribution

−20

01−2005 01−2010 01−2015

GI_en HENRY_en HULIU JOCKERS LM_en NRC SENTICNET SENTIWORD SOCAL

20

10

−10

−20

−30

01−2005 01−2010 01−2015


Date

Figure 5: Attribution to features (top) and lexicons (below).

This illustration shows that the sentometrics package provides useful insights in predicting
variables like the VIX index starting from a corpus of texts. Results could be improved by
expanding the corpus, or by optimizing the features generation. For a larger application of the
entire workflow, we refer to Ardia et al. (2018). They find that the incorporation of textual
sentiment indices results in better prediction of the U.S. industrial production growth rate
compared to using a panel of typical macroeconomic indices only.

5. Conclusion and future development

The R package sentometrics provides a framework to calculate sentiment for texts, to aggre-

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 27

gate textual sentiment scores into many time series at a desired frequency, and to use these in
a flexible prediction modelling setup. It can be deployed to relate a corpus of texts, or qual-
itative data, to a quantitative target variable, and retrieve which type of sentiment is most
informative through visualization and attribution analysis. The main priorities for further de-
velopment are integrating better prediction tools, enhancing the complexity of the sentiment
engine, allowing user–defined weighting schemes, and adding intra–day aggregation.
If you use R or sentometrics, please cite the software in publications. In case of the latter,
use citation("sentometrics").

Computational details
The results in this paper were obtained using R 3.5.1 (R Core Team 2018), sentometrics
version 0.5.6, and underlying or used packages caret version 6.0.81 (Kuhn 2018), data.table
version 1.11.8 (Dowle and Srinivasan 2018), foreach version 1.4.4 (Weston 2017), ggplot2
version 3.1.0 (Wickham 2016), glmnet version 2.0.16 (Friedman et al. 2010), gridExtra ver-
sion 2.3.0 (Auguie 2017) ISOweek version 0.6.2 (Block 2011), lexicon version 1.1.3 (Rinker
2018a), lubridate version 1.7.4 (Grolemund and Wickham 2011), quanteda version 1.13.14
(Benoit et al. 2018), Rcpp version 1.0.0 (Eddelbuettel and Francois 2011), RcppArmadillo
version 0.9.200.5.0 (Eddelbuettel and Sanderson 2014), RcppRoll version 0.3.0 (Ushey 2018),
RcppParallel version 4.4.2 (Allaire et al. 2018), stm version 1.3.3 (Roberts et al. 2018), and
stringi version 1.2.4 (Gagolewski 2018). Computations were performed on a Windows 10
Pro machine, x86 64-w64-mingw32/x64 (64-bit) with Intel(R) Core(TM) i7-3632QM CPU 2x
2.20 GHz. The code used in the main paper is available in the R script run_vignette.R
located in the examples folder on the dedicated sentometrics GitHub repository at https:
//github.com/sborms/sentometrics.
The timings comparison of the various sentiment computation functions can be replicated
using the R script run_timings.R, available at the beforementioned GitHub repository. To
generate the results, we have also used the packages dplyr version 0.7.8 (Wickham et al. 2018),
meanr version 0.1.1 (Schmidt 2017), microbenchmark version 1.4.6 (Mersmann 2018), Senti-
mentAnalysis version 1.3.2 (Feuerriegel and Proellochs 2018), syuzhet version 1.0.4 (Jockers
2017), tidytext version 0.2.0 (Silge and Robinson 2016), and tidyr version 0.8.2 (Wickham
and Henry 2018).
R, sentometrics and all other packages are available from CRAN at http://CRAN.R-project.
org/. The version under development is available on our GitHub repository as well.

Acknowledgments
We thank Andres Algaba, Nabil Bouamara, Peter Carl, Leopoldo Catania, Thomas Chuf-
fart, Dries Cornilly, Serge Darolles, William Doehler, Arnaud Dufays, Matteo Ghilotti, Kurt
Hornik, Siem Jan Koopman, Brian Peterson, Laura Rossetti, Tobias Setz, Majeed Siman, Ste-
fan Theussl, Wouter Torsin, and participants at the CFE (London, 2017), eRum (Budapest,
2018), R/Finance (Chicago, 2018), SwissText (Winterthur, 2018), SoFiE (Brussels, 2018),
“Data Science in Finance with R” (Vienna, 2018), “New Challenges for Central Bank Com-
munication” (Brussels, 2018), and (EC)ˆ2 (Roma, 2018) conferences for helpful comments.
We acknowledge Google Summer of Code 2017 for their financial support.

Electronic copy available at: https://ssrn.com/abstract=3067734


28 The R Package sentometrics

References

Allaire J, Francois R, Ushey K, Vandenbrouck G, Geelnard M, Intel (2018). RcppParallel:


Parallel Programming Tools for Rcpp. R Package Version 4.4.2, URL https://CRAN.
R-project.org/package=RcppParallel.

Antweiler W, Frank M (2004). “Is All That Talk Just Noise? The Information Content of
Internet Stock Message Boards.” Journal of Finance, 59(3), 1259–1294. doi:10.1111/j.
1540-6261.2004.00662.x.

Ardia D, Bluteau K, Boudt K (2018). “Questioning the News about Economic Growth: Sparse
Forecasting Using Thousands of News-Based Sentiment Values.” International Journal of
Forecasting, forthcoming. doi:10.2139/ssrn.2976084.

Arnold T (2017). “A Tidy Data Model for Natural Language Processing Using cleanNLP.”
The R Journal, 9(2), 1–20. URL https://journal.r-project.org/archive/2017/
RJ-2017-035/RJ-2017-035.pdf.

Auguie B (2017). gridExtra: Miscellaneous Functions for ”Grid” Graphics. R package version
2.3, URL https://CRAN.R-project.org/package=gridExtra.

Baccianella S, Esuli A, Sebastiani F (2010). “SentiWordNet 3.0: An Enhanced Lexical Re-


source for Sentiment Analysis and Opinion Mining.” International Conference on Language
Resources and Evaluation. URL http://nmis.isti.cnr.it/sebastiani/Publications/
LREC10.pdf.

Baker S, Bloom N, Davis S (2016). “Measuring Economic Policy Uncertainty.” The Quarterly
Journal of Economics, 131(4), 1593–1636. doi:10.1093/qje/qjw024.

Beckers B, Kholodilin K, Ulbricht D (2017). “Reading Between the Lines: Using Media
to Improve German Inflation Forecasts.” DIW Berlin Discussion Paper No. 1665. doi:
10.2139/ssrn.2970466.

Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda:


An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source
Software, 3(30), 774. doi:10.21105/joss.00774.

Block U (2011). ISOweek: Week of the Year and Weekday According to ISO 8601. R Package
Version 0.6.2, URL https://CRAN.R-project.org/package=ISOweek.

Cambria E, Poria S, Bajpai R, Schuller B (2016). “SenticNet 4: A Semantic Resource for


Sentiment Analysis Based on Conceptual Primitives.” COLING, pp. 2666–2677.

Caporin M, Poli F (2017). “Building News Measures from Textual Data and an Application
to Volatility Forecasting.” Econometrics, 5(3), 35. doi:10.3390/econometrics5030035.

Catania L, Bernardi M (2017). MCS: Model Confidence Set Procedure. R Package Version
0.1.3, URL https://CRAN.R-project.org/package=MCS.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 29

Ceron E, Curini L, Iacus S, Porro G (2014). “Every Tweet Counts? How Sentiment Analysis
of Social Media Can Improve Our Knowledge of Citizens’ Political Preferences with an
Application to Italy and France.” New Media & Society, 16(2), 340–358. doi:10.1177/
1461444813480466.

Dowle M, Srinivasan A (2018). data.table: Extension of ‘data.frame’. R Package Version


1.11.8, URL https://CRAN.R-project.org/package=data.table.

Eddelbuettel D, Francois R (2011). “Rcpp: Seamless R and C++ Integration.” Journal of


Statistical Software, 40(8), 1–18. doi:10.18637/jss.v040.i08.

Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with high–performance


C++ linear algebra.” Computational Statistics and Data Analysis, 71(March), 1054–1063.
doi:10.1016/j.csda.2013.02.005.

Feinerer I, Hornik K, Meyer D (2008). “Text Mining Infrastructure in R.” Journal of Statistical
Software, 22(5), 1–54. doi:10.18637/jss.v025.i05.

Feuerriegel S, Proellochs N (2018). SentimentAnalysis: Dictionary–Based Sentiment


Analysis. R Package Version 1.3.2, URL https://CRAN.R-project.org/package=
SentimentAnalysis.

Friedman J, Hastie T, Tibshirani R (2010). “Regularization Paths for Generalized Linear


Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1–22. doi:10.
18637/jss.v033.i01.

Gagolewski M (2018). stringi: Character String Processing Facilities. R Package Version


1.2.4, URL https://CRAN.R-project.org/package=stringi.

Gentzkow M, Kelly B, Taddy M (2018). “Text as Data.” Journal of Economic Literature,


forthcoming. doi:10.2139/ssrn.2934001.

Ghysels E, Sinko A, Valkanov R (2007). “MIDAS Regressions: Further Results and New
Directions.” Econometrics Review, 26, 53–90. doi:10.1080/07474930600972467.

Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of
Statistical Software, 40(3), 1–25. doi:10.18637/jss.v040.i03.

Hansen P, Lunde A, Nason J (2011). “The Model Confidence Set.” Econometrica, 79, 453–497.
doi:10.3982/ECTA5771.

Henry E (2008). “Are Investors Influenced by How Earnings Press Releases are Written?”
Journal of Business Communication, 45(4), 363–407. doi:10.1177/0021943608319388.

Heston S, Sinha N (2017). “News vs. Sentiment: Predicting Stock Returns from News Stories.”
Financial Analysts Journal, 73(3), 67–83. doi:10.2469/faj.v73.n3.3.

Hoerl A, Kennard R (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Prob-
lems.” Technometrics, 12, 55–67. doi:10.1080/00401706.1970.10488634.

Hornik K (2016). openNLP: Apache OpenNLP Tools Interface. R Package Version 0.2.6,
URL https://CRAN.R-project.org/package=openNLP.

Electronic copy available at: https://ssrn.com/abstract=3067734


30 The R Package sentometrics

Hu M, Liu B (2004). “Mining and Summarizing Customer Reviews.” Proceedings of the


ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:
10.1145/1014052.1014073.

Jegadeesh N, Wu D (2013). “Word Power: A New Approach for Content Analysis.” Journal
of Financial Economics, 110(3), 712–729. doi:10.1016/j.jfineco.2013.08.018.

Jockers M (2017). syuzhet: Extract Sentiment and Plot Arcs from Text. R Package Version
1.0.4, URL https://CRAN.R-project.org/package=syuzhet.

Kuhn M (2018). caret: Classification and Regression Training. R Package Version 6.0-81,
URL https://CRAN.R-project.org/package=caret.

Loughran T, McDonald B (2011). “When is a Liability not a Liability? Textual Analysis,


Dictionaries, and 10-Ks.” Journal of Finance, 66(1), 35–65. doi:10.1111/j.1540-6261.
2010.01625.x.

Manela A, Moreira A (2017). “News Implied Volatility and Disaster Concerns.” Journal of
Financial Economics, 123(1), 137–162. doi:doi.org/10.1016/j.jfineco.2016.01.032.

Mersmann O (2018). microbenchmark: Accurate Timing Functions. R Package Version


1.4.6, URL https://CRAN.R-project.org/package=microbenchmark.

Mohammad S, Turney P (2010). “Emotions Evoked by Common Words and Phrases: Using
Mechanical Turk to Create an Emotion Lexicon.” Proceeding of Workshop on Computational
Approaches to Analysis and Generation of Emotion in Text, pp. 26–34. URL http://dl.
acm.org/citation.cfm?id=1860631.1860635.

Pástor L, Veronesi P (2013). “Political Uncertainty and Risk Premia.” Journal of Financial
Economics, 110(3), 520–545. doi:doi.org/10.1016/j.jfineco.2013.08.007.

R Core Team (2018). R: A Language and Environment for Statistical Computing. R


Foundation for Statistical Computing, Vienna, Austria. R version 3.5.1, URL https:
//www.R-project.org/.

Rinker T (2017). qdap: Quantitative Discourse Analysis Package. R Package Version 2.3.0,
URL https://CRAN.R-project.org/package=qdap.

Rinker T (2018a). lexicon: Lexicon Data. R Package Version 1.1.3, URL https://CRAN.
R-project.org/package=lexicon.

Rinker T (2018b). sentimentr: Calculate Text Polarity Sentiment. R Package Version 2.6.1,
URL https://CRAN.R-project.org/package=sentimentr.

Roberts M, Stewart B, Tingley D (2018). “stm: R Package for Structural Topic Models.”
Journal of Statistical Software, forthcoming.

Schmidt D (2017). meanr: Sentiment Analysis Scorer. R Package Version 0.1.1, URL https:
//CRAN.R-project.org/package=meanr.

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles
in R.” Journal of Open Source Software, 1(3). doi:10.21105/joss.00037.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 31

Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011). “Lexicon–Based Methods for


Sentiment Analysis.” Computational Linguistics, 37(2), 267–307. doi:10.1162/COLI_a_
00049.

Tetlock P (2007). “Giving Content to Investor Sentiment: The Role of Media in the Stock
Market.” Journal of Finance, 62(3), 1139–1168. doi:10.1111/j.1540-6261.2007.01232.
x.

Tetlock P, Saar-Tsechansky M, Macskassy S (2008). “More Than Words: Quantifying


Language to Measure Firms’ Fundamentals.” Journal of Finance, 63, 1437–1467. doi:
10.1111/j.1540-6261.2008.01362.x.

Tibshirani R (1996). “Regression Shrinkage and Selection via the LASSO.” Journal of
the Royal Statistical Society B, 58(1), 267–288. URL https://www.jstor.org/stable/
2346178.

Tibshirani R, Taylor J (2012). “Degrees of Freedom in LASSO Problems.” Annals of Statistics,


4, 1198–1232. doi:10.1214/12-AOS1003.

Ushey K (2018). RcppRoll: Efficient Rolling / Windowed Operations. R package version


0.3.0, URL https://CRAN.R-project.org/package=RcppRoll.

Weston S (2017). foreach: Provides Foreach Looping Construct for R. R Package Version
1.4.4, URL https://CRAN.R-project.org/package=foreach.

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
ISBN 978-3-319-24277-4. URL http://ggplot2.org.

Wickham H, François R, Henry L, Müller K (2018). dplyr: A Grammar of Data Manipulation.


R package version 0.7.8, URL https://CRAN.R-project.org/package=dplyr.

Wickham H, Henry L (2018). tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Func-
tions. R package version 0.8.2, URL https://CRAN.R-project.org/package=tidyr.

Zou H, Hastie T (2005). “Regularization and Variable Selection via the Elastic Net.” Journal
of the Royal Statistical Society B, 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x.

Electronic copy available at: https://ssrn.com/abstract=3067734


32 The R Package sentometrics

A. Speed comparison of lexicon–based sentiment analysis


This appendix provides an illustrative comparison of the computation time of various lexicon–
based sentiment analysis tools in R. The core of the sentiment computation in sentometrics
is implemented in C++ through Rcpp (Eddelbuettel and Francois 2011). We compare the
speed of our computation with the R packages meanr (Schmidt 2017), SentimentAnalysis
(Feuerriegel and Proellochs 2018), syuzhet (Jockers 2017), quanteda, and tidytext. The first
three of these five packages have proper sentiment functions. The quanteda and tidytext
packages have no explicit sentiment computation function; it needs to be constructed first,
based on their respective toolsets. This is an entry barrier for less–experienced programmers.
The SentimentAnalysis package has the tm package as backend and uses internally a similar
calculation as tm’s tm_term_score() function. The sentimentr package is not part of the
exercise because it proved to be vastly slower than all others, which was anticipated as it aims
to handle more difficult linguistic edge cases.
We perform two analyses. Sentiment is computed for 1000, 5000, 10000, 25000, 50000, 75000
and 100000 texts, and the average execution time in seconds across five repetitions, using the
microbenchmark (Mersmann 2018) package, is shown in Table 3.

Panel A: Average execution time of the sentiment computation for one lexicon
sentometrics
Texts unigrams bigrams clusters meanr SentimentAnalysis syuzhet quanteda tidytext
1000 0.21 0.19 0.20 0.11 2.03 0.67 0.76 0.31
5000 0.92 0.86 0.89 0.47 7.73 2.57 2.26 0.80
10000 1.54 1.53 1.56 0.84 15.49 5.01 3.03 1.46
25000 3.76 3.92 4.06 2.10 38.46 12.06 6.47 3.61
50000 8.30 8.11 8.39 4.50 79.39 26.60 12.86 7.58
75000 12.71 12.51 12.98 6.70 121.54 37.07 18.32 11.25
100000 17.81 17.95 17.69 8.51 — 48.52 24.76 13.92
Panel B: Average execution time of the sentiment computation for many lexicons
sentometrics tidytext
Texts unigrams unigrams, feats. bigrams clusters clusters, parallel unigrams bigrams
1000 0.26 0.37 1.14 0.29 0.94 0.26 1.30
5000 0.96 1.15 0.98 1.00 0.77 0.83 3.38
10000 1.86 1.92 1.90 1.91 1.51 1.56 3.38
25000 4.72 4.81 4.78 4.82 3.74 4.08 6.76
50000 9.65 9.91 9.73 9.79 7.63 7.75 33.24
75000 15.37 16.24 16.82 13.20 12.69 12.72 49.82
100000 23.81 24.43 24.92 21.43 17.17 18.02 68.03

Table 3: Average computation time (in seconds) of various lexicon–based sentiment facilities
in R. All implementations consider the Hu & Liu lexicon (Panel A), or the nine lexicons
specified in the lex object (Panel B). Some implementations do not integrate valence shifters
(unigrams), others do from a bigrams perspective (bigrams) or from a clusters perspective
(clusters). A bar ‘—’ indicates the calculation produced an error.

Electronic copy available at: https://ssrn.com/abstract=3067734


D. Ardia, K. Bluteau, S. Borms, K. Boudt 33

The first analysis (Panel A) benchmarks these implementations with three approaches using
the compute_sentiment() function from sentometrics: one without valence shifters, one
with valence shifters integrated from a bigrams perspective, and one with valence shifters
integrated from a clusters perspective. The number of threads for parallel computation is set
to one where appropriate. All other algorithms are run with a version of the the Hu & Liu
lexicon (about 6600 single words). The computations are counts–based and constructed so
as to give the same output across all packages for a binary lexicon, if the tokenization is the
same. For example, the sentometrics and tidytext implementations give identical results.
The meanr implementation comes out fastest because everything is written in the C program-
ming language. Yet, it offers no flexibility to define the input lexicon nor the scale on which
the scores are returned. On the other spectrum, amongst these approaches, the Sentimen-
tAnalysis and suyzhet packages are slowest. The latter package further does not offer the
flexibility of adding different sentiment lexicons than those available in their package. Senti-
mentAnalysis becomes problematic for the largest corpus due to a memory allocation error.
The quanteda package is fast, but slower than the sentometrics and tidytext implementations.
The tidytext package is faster, particularly for the two largest corpus sizes.
The second analysis (Panel B) compares the computation time with multiple lexicons as input.
The comparison is against the tidytext package, for a unigram and a bigrams implementation.
The lexicons are those defined in lex, nine in total. For the clusters approach, we also look at
its parallelized version, using eight cores (see the ‘clusters, parallel’ column). For the unigrams
approach in sentometrics, we also assess the additional time it takes to spread out sentiment
across features (see the ‘unigrams, feats.’ column).
The tidytext package is, in general, faster for many lexicons as well. Differences are not large
nonetheless, and running any sentometrics computation in parallel would make the speed
differentials disappear. However, the bigrams calculation using tidytext is markedly slower.
With sentometrics, the speed of the computation is comparable across all types of sentiment
calculation. The tidytext framework thus copes more slowly with complexity.
Overall, the sentometrics package brings an off–the–shelf yet flexible sentiment calculator that
is computationally efficient, being fast in itself, and independent as to the decision (how) to
integrate valence shifters as well as the number of input lexicons.

Affiliation:
David Ardia
Institute of Financial Analysis
University of Neuchâtel, Switzerland
E–mail: david.ardia@unine.ch

Keven Bluteau
Institute of Financial Analysis
University of Neuchâtel, Switzerland
& Solvay Business School
Vrije Universiteit Brussel, Belgium
E–mail: keven.bluteau@unine.ch

Electronic copy available at: https://ssrn.com/abstract=3067734


34 The R Package sentometrics

Samuel Borms (corresponding author)


Institute of Financial Analysis
University of Neuchâtel, Switzerland
& Solvay Business School
Vrije Universiteit Brussel, Belgium
E–mail: samuel.borms@unine.ch

Kris Boudt
Solvay Business School
Vrije Universiteit Brussel, Belgium
& School of Business and Economics
Vrije Universiteit Amsterdam, The Netherlands
E–mail: kris.boudt@vub.be

Electronic copy available at: https://ssrn.com/abstract=3067734

You might also like