Professional Documents
Culture Documents
Abstract
We provide a hands–on introduction to optimized textual sentiment indexation using
the R package sentometrics. Driven by the need to unlock the potential of textual data,
sentiment analysis is increasingly used to capture its information value. The sentometrics
package implements an intuitive framework to efficiently compute sentiment scores of
numerous texts, to aggregate the scores into multiple time series, and to use these time
series to predict other variables. The workflow of the package is illustrated with a built–in
corpus of news articles from two major U.S. journals to forecast the VIX index.
1. Introduction
Individuals, firms and governments continuously consume written information from various
sources to improve their decisions. The consensus view in the literature is that exploiting
this qualitative information on top of quantitative data provides valuable insights on many
future dynamics related to firms, the economy, political agendas, and marketing campaigns, to
name a few. Numbers reflect ex–post information about the past, while texts may reflect the
feelings of both author and readers about the past and the future. News media is considered
a good and stable reflection of the concerns of readers, as suggested by Tetlock (2007). The
sentiment transmitted through texts, called textual sentiment, is therefore often regarded as
a prime driver of many variables. Assigning sentiment scores to represent the polarity of
a text is, following data retrieval, the first step in the analysis of relating texts, qualitative
by nature, to quantitative target variables. There has been a steady growth of the amount
of research that takes the approach of transforming texts into quantitative variables using
sentiment analysis. The usage of sentiment in texts to improve inflation predictions (Beckers
et al. 2017), and the application of sentiment analysis to track political preferences (Ceron
et al. 2014) are recent examples in economics and politics. Gentzkow et al. (2018) survey
various applications with textual data, including sentiment analysis.
Textual sentiment does not live by the premise to be equally useful across all applications.
Deciphering when, to what degree, and which layers of the sentiment add value is needed to
study the full information potential present within qualitative communications consistently.
The R package sentometrics proposes a well–defined modelling workflow, specifically applied
to studying the evolution of textual sentiment and its impact on other quantities. It addresses
the present lack of analytical capability to extract time series intelligence about the sentiment
transmitted through a large panel of texts.
The major release of the R text mining infrastructure tm (Feinerer et al. 2008) over a decade
ago can be considered the starting point of the development and popularization of textual
analysis tools in R. A number of successful follow–up attempts at improving the speed and
interface of the comprehensive natural language processing capabilities provided by tm have
been delivered by the packages openNLP (Hornik 2016), cleanNLP (Arnold 2017), quanteda
(Benoit et al. 2018), tidytext (Silge and Robinson 2016) and qdap (Rinker 2017). These efforts
have given rise to the creation of a multitude of packages oriented to deal with specific text
mining problems, including topic modelling and sentiment analysis. Many of such specific
packages (internally) rely on one of the larger above–mentioned infrastructures.
The sentometrics package similarly positions itself as both integrative and supplementary to
the powerful text mining and data science toolboxes in the R universe. It is integrative, as
it combines the strengths of quanteda and stringi (Gagolewski 2018) for corpus construction
and manipulation. It uses data.table (Dowle and Srinivasan 2018) for fast aggregation of
textual sentiment into time series, and glmnet (Friedman et al. 2010) and caret (Kuhn 2018)
for (sparse) model estimation. It is supplementary, given that it easily extends any text
mining workflow by fulfilling (one of) the objectives (i) to compute textual sentiment, (ii)
to aggregate fine–grained textual sentiment into the construction of various sentiment time
series, and (iii) to predict other variables with these sentiment measures. The combination of
these three objectives leads to a flexible and computationally efficient framework to exploit
the information value of sentiment in texts.
The remainder of the paper is structured as follows. Section 2 introduces the methodology
behind the R package sentometrics. Section 3 describes the main control functions and illus-
trates the package’s typical workflow. Section 4 applies the entire framework to forecast the
VIX index. Section 5 concludes.
The typical use cases of the R package sentometrics are the fast computation and aggregation
of textual sentiment, the subsequent time series visualization and manipulation, and the
estimation of a sentiment–based prediction model. The use case of building a prediction model
out of textual data encompasses the previous ones. We propose a workflow that consists of
five main steps, Steps 1–5, from corpus construction to model estimation. All use cases can
be addressed by following (a subset of) this workflow. The R package sentometrics takes care
of all steps in full, apart from the texts’ collection and cleaning. Table 1 overviews the key
functionalities of sentometrics and the associated functions and output objects if specifically
classed into a new S3 object.
Table 1: Taxonomy of the R package sentometrics. This table displays the functionalities of
the sentometrics package along with the associated functions and S3 output objects.
All steps are explained below. We minimize the mathematical details to clarify the exposition,
and stay close to the actual implementation. Section 3 demonstrates how to use the functions.
We assume the user has a corpus of texts at their disposal, of any size and from any source
available. Texts could be scraped from the web or retrieved from news databases. A minimal
requirement is that every text has a timestamp and an identifier. These texts should be
cleaned such that graphical and web–related elements (e.g., HTML tags) are removed. This
results in a set of documents dn,t for n = 1, . . . , Nt and time points t = 1, . . . , T , where Nt is
the total number of documents at time t.
Secondly, features have to be defined and mapped to the documents. Features can come
in many forms: news sources, entities (individuals, companies or countries discussed in the
texts), or text topics. The mapping implicitly permits subdividing the corpus into many
smaller groups with a common interest. Many data providers enrich their textual data with
information that can be used as features. If this is not the case, topic modelling or entity
recognition techniques are valid alternatives. Human classification or manual keyword(s) oc-
currence searches are simpler options. The extraction and inclusion of features is an important
part of the analysis and should be related to the variable that is meant to be predicted.
The texts and features have to be structured in a rectangular fashion. Every row represents
k
a document that is mapped to the features through numerical values wn,t ∈ [0, 1] where the
features are indexed by k = 1, . . . , K. The values are indicative of the relevance of a feature
to a document. If they are binary, they indicate which documents belong to which features.
The rectangular data structure is passed to the sento_corpus() function to create a sento-
corpus object. The reason for this separate corpus structure is twofold: It controls whether
all corpus requirements for further analysis are met (such as dealing with timestamps), and
it allows performing operations on the corpus in a more structured way. If no features are
k
of interest to the analysis, a dummy feature valued wn,t = 1 throughout is automatically
created. The add_features() function is used to add or generate new features, as will be
shown in the illustration. When the corpus is constructed, it is up to the user to decide which
texts have to be kept for the actual sentiment analysis.
(i) A unigrams approach. The most straightforward method, where computed sentiment
is simply a (weighted) sum of all detected word scores as they appear in the lexicon.
(ii) A valence–shifting bigrams approach. The impact of the word appearing before the
detected word is evaluated as well. A common example is ‘not good’, which under the
default approach would get a score of 1 (‘good’), but now ends up, for example, having
a score of –1 due to the presence of the negator ‘not’.
(iii) A valence–shifting clusters approach. Valence shifters can also appear in positions
other than right before a certain word. We implement this layer of complexity by
searching for valence shifters (and other sentiment–bearing words) in a cluster of at
maximum four words before and two words after a detected polarized word.
In the first two approaches, a document dn,t receives a sentiment score as:
Qd
{l} {l}
X
sn,t ≡ ωi vi si,n,t , (1)
i=1
for every lexicon l = 1, . . . , L. The total number of unigrams in the document is equal to Qd .
{l}
The score si,n,t is the sentiment value attached to unigram i from document dn,t , based on
lexicon l, and equals zero when the word is not in the lexicon. The impact of a valence shifter
is represented by vi , being the shifting value if unigram i − 1 is a value shifter. No valence
shifter or the simple unigrams approach boils down to vi = 1. When the valence shifter is a
negator, typically vi = −1. The weights ωi define the within–document aggregation.1
The third approach differs in the way it calculates the impact of valence shifters. Given a
detected polarized word, say unigram j, valence shifters can be identified in a surrounding
cluster of adjacent unigrams J ≡ {J L , J U } around this word (irrespective of whether they
appear in the same sentence or not), where J L ≡ {j − 4, j − 3, j − 2, j − 1} and J U ≡
{j + 1, j + 2}. The resulting sentiment value concerning unigram j and associated cluster J
{l} {l} P {l}
is sJ,n,t ≡ nN (1 + max{0.80 × (nA − 2nA n − nD ), −1} × ωj sj,n,t + J U ωm sm,n,t . The number
of amplifying valence shifters is nA , those that deamplify are counted by nD , n = 1 and
nN = −1 if there are is odd number of negators, else n = 0 and nN = 1.2 The unigrams in J
are first searched for in the lexicon, and only when there is no match, they are searched for in
the valence shifters word list. Clusters are non–overlapping from one polarized word to the
other; if another polarized word is detected at position j + 4, then the cluster consists of the
unigrams {j +3, j +5, j +6}. The total sentiment score for a document dn,t as expressed by (1)
{l} PCd {l}
becomes in this case sn,t ≡ J=1 sJ,n,t , with Cd the number of clusters. This last approach
borrows from how the R package sentimentr (Rinker 2018b) does its sentiment calculation.
Linguistic intricacies (e.g., sentence boundaries) are better handled in their package, but at
the expense of being significantly slower.
The scores obtained above are subsequently multiplied by the feature weights to spread out
{l,k} {l} k
the sentiment into lexicon– and feature–specific sentiment scores, as sn,t ≡ sn,t × wn,t , with
1
The values wi and vi are specific to a document dn,t , but we omit the indices n and t for brevity. For now,
the package only considers constant weights per document, but in a future version, weights could be made
unigram–specific.
2
Amplifying valence shifters, such as ‘very’, strengthen a polarized word. Deamplifying valence shifters,
such as ‘hardly’, downtone a polarized word. The strengthening value of 0.80 is fixed and acts, if applicable,
as a modifier of 80% on the polarized word. Negation inverses the polarity. An even number of negators
is assumed to cancel out the negation. Amplifiers are considered (but not double–counted) as deamplifiers
if there is an odd number of negators. Under this approach, for example, the occurrence of ‘not very good‘
receives a more realistic score of −0.20 (n = 1, nN = −1, nA = 1, nD = 0, s = +1), instead of −1.80, in many
cases too negative.
k the index denoting the feature. If the document does not correspond to the feature, the
{l,k}
value of sn,t is zero.
In sentometrics, the sento_lexicons() function is used to define the lexicons and the valence
shifters. The output is a sentolexicons object. Any provided lexicon is applied to the
corpus. The sentiment calculation is performed with compute_sentiment(). Depending on
{l,k}
the input type, this function outputs a data.table with all values for sn,t . When the output
can be used as the basis for aggregation into time series in the next step (that is, when it
has a "date" column), it is classed as a sentiment object. The to_sentiment() function
transforms any properly structured table with sentiment scores to a sentiment object.
2.3. Aggregate the sentiment into textual sentiment time series (Step 3)
In this step, the purpose is to aggregate the individual sentiment scores and obtain various
representative time series. Two main aggregations are performed. The first, across–document,
collapses all sentiment scores across documents within the same frequency (e.g., day or month,
as defined by t) into one score. The weighted sum that does so is:
Nt
{l,k} {l,k}
X
st ≡ θn sn,t . (2)
n=1
The weights θn define the importance of each document n at time t (for instance, based on
the length of the text). The second, across–time, smooths the newly aggregated sentiment
scores over time, as:
u
{l,k,b} {l,k}
X
su ≡ bt st , (3)
t=tτ
The target variable yu+h is often a variable to forecast, that is, h > 0. Let su ≡ (s1u , . . . , sPu )>
encapsulate all textual sentiment variables as constructed before, and β ≡ (β1 , . . . , βP )> .
Other variables are denoted by the vector xu at time u and γ is the associated parameter
vector. Logistic regression (binomial and multinomial) is available as a generalization of the
same underlying linear structure.
The typical large dimensionality of P relative to the number of observations, and the potential
multicollinearity, both pose a problem to ordinary least squares (OLS) regression. Instead,
estimation and variable selection through a penalized regression relying on the elastic net reg-
ularization of Zou and Hastie (2005) is more appropriate. Regularization, in short, shrinks the
coefficients of the least informative variables towards zero. It consists of optimizing the least
squares or likelihood function including a penalty component. The elastic net optimization
problem for the specified linear regression is expressed as:
T
(
)
1 X >
2
> 2
min (y − δ̃ − γ
e x eu − β su ) + λ α
βe
+ (1 − α)
β . (5)
e e
e
δ̃,γ
e ,β
e N u=τ u+h 1 2
The tilde denotes standardized variables, and k.kp is the `p –norm. The standardization is
required for the regularization, but the coefficients are rescaled back once estimated. The
rescaled estimates of the model coefficients for the textual sentiment indices are in β,
b usually
a sparse vector, depending on the severity of the shrinkage. The parameter 0 ≤ α ≤ 1 defines
the trade–off between the Ridge (Hoerl and Kennard 1970), `2 , and the LASSO (Tibshirani
1996), `1 , regularizaton, respectively for α = 0 and α = 1. The λ ≥ 0 parameter defines
the level of regularization. When λ = 0, the problem reduces to OLS estimation. The two
parameters are calibrated in a data–driven way, such that they are optimal to the regression
equation at hand. The sentometrics package allows calibration through cross–validation, or
based on an information criteria with the degrees of freedom properly adjusted to the elastic
net context according to Tibshirani and Taylor (2012).
An analysis of typical interest is the sequential estimation of a regression model and out–of–
sample prediction. For a given sample size M < N , a regression model is estimated with the
first M observations and used to predict the one–step ahead observation M + 1 of the target
variable.3 This procedure is repeated, leading to a series of estimates. These are compared
with the realized values to assess the average out–of–sample prediction performance.
The type of model, the calibration approach, and other modelling decisions are defined via
the ctr_model() function. The (iterative) model estimation and calibration is done with
the sento_model() function that relies on the R packages glmnet and caret. The user can
define here additional (sentiment) values for prediction through the x argument. The output
is a sentomodel object (one model) or a sentomodeliter object (a collection of iteratively
estimated sentomodel objects and associated out–of–sample predictions).
A forecaster, however, is not limited to using the models provided through the sentometrics
package; (s)he is free to guide this step to whichever modelling toolbox available, continuing
with the sentiment variables computed in the previous steps.
3
We define one–step ahead as the first and single prediction a forecaster would like to obtain after an
estimation round.
In what follows, several examples show how to put the steps into practice using the sento-
metrics package. The subsequent sections illustrate the main workflow, using built–in data,
focusing on individual aspects of it. Section 3.1 studies corpus management and features
generation. Section 3.2 investigates the sentiment computation. Section 3.3 looks at the
aggregation into time series (including the control function ctr_agg()). Section 3.4 briefly
explains further manipulation of a sentomeasures object. Section 3.5 regards the modelling
setup (including the control function ctr_model()) and attribution.
R> library("sentometrics")
We demonstrate the workflow using the usnews built–in dataset, a collection of news articles
from The Wall Street Journal and The Washington Post between 1995 and 2014.4 It has
a data.frame structure and thus satisfies the requirement that the input texts have to be
structured rectangularly, with every row representing a document. The data is loaded below:
[1] "data.frame"
For conversion to a sentocorpus object, the "id", "date" and "text" columns have to come
in that order. All other columns are reserved for features, of type numeric. For this particular
corpus, there are four original features. The first two indicate the news source, the latter two
k
the relevance of every document to the U.S. economy. The feature values wn,t are binary and
complementary (when "wsj" is 1, "wapo" is 0; similarly for "economy" and "noneconomy")
to subdivide the corpus to create separate time series.
To access the texts, one can simply do usnews[["texts"]] (i.e., the third column omitted
above). An example of one text is:
R> usnews[["texts"]][2029]
[1] "Dow Jones Newswires NEW YORK -- Mortgage rates rose in the past week
after Fridays employment report reinforced the perception that the economy
is on solid ground, said Freddie Mac in its weekly survey. The average for
30-year fixed mortgage rates for the week ended yesterday, rose to 5.85
from 5.79 a week earlier and 5.41 a year ago. The average for 15-year
fixed-rate mortgages this week was 5.38, up from 5.33 a week ago and the
year-ago 4.69. The rate for five-year Treasury-indexed hybrid adjustable-rate
mortgages, was 5.22, up from the previous weeks average of 5.17. There is no
historical information for last year since Freddie Mac began tracking this
mortgage rate at the start of 2005."
The built–corpus is cleaned for non–alphanumeric characters. To put the texts and features
into a corpus structure, call the sento_corpus() function. If you have no features available,
the corpus can still be created without any feature columns in the input data.frame, but
a dummy feature called "dummyFeature" with a score of 1 for all texts is added to the
sentocorpus output object.
The sento_corpus() function creates a sentocorpus object on top of the quanteda’s package
corpus object. Hence, many functions from quanteda to manipulate corpora can be applied
to a sentocorpus object as well. For instance, quanteda::corpus_subset(uscorpus, date
< "2014-01-01") would limit the corpus to all articles before 2014. The presence of the
date document variable (the "date" column) and all other metadata as numeric features
valued between 0 and 1 are the two distinguishing aspects between a sentocorpus object
and any other corpus–like object in R. Having the date column is a requirement for the later
aggregation into time series. The function to_sentocorpus() transforms a quanteda corpus
object to a sentocorpus object. The quanteda corpus constructor supports conversion to
their format starting from a tm corpus.
To round off Step 1, we add two metadata features using the add_features() function. The
features uncertainty and election give a score of 1 to documents in which respectively
the word "uncertainty" or "distrust" and the specified regular expression regex appears.
Regular expressions provide flexibility to define more complex features, though can be slow
for a large corpus if too complex. Overall, this gives K = 6 features. The add_features()
function is most useful when the corpus starts off with no additional metadata, i.e., the sole
feature present is the automatically created "dummyFeature" feature.5
We pack these lexicons together in a named list, and provide it to the sento_lexicons()
function, together with an English valence word list. The valenceIn argument dictates
the complexity of the sentiment analysis. If valenceIn = NULL (the default), sentiment is
computed based on the simplest unigrams approach. If valenceIn is a table with an "x"
and a "y" column, the valence–shifting bigrams approach is considered for the sentiment
calculation. The values of the "y" column are those used as vi . If the second column is
named "t", it is assumed that this column indicates the type of valence shifter for every
word, and thus it forces employing the valence–shifting clusters approach for the sentiment
calculation. Three types of valence shifters are supported for the latter method: negators
(value of 1, define n and nN ), amplifiers (value of 2, counted in nA ) and deamplifiers (value
of 3, counted in nD ).
The lex output is a sentolexicons object, a specialized list. The individual word lists
themselves are data.tables, as displayed below. Duplicates are removed, all words are put
to lowercase and only unigrams are kept.
R> lex[["HENRY_en"]]
x y
1: above 1
2: accomplish 1
3: accomplished 1
4: accomplishes 1
5: accomplishing 1
---
185: worse -1
186: worsen -1
187: worsening -1
188: worsens -1
189: worst -1
The simplest way forward is to compute sentiment scores for every text in the corpus. This is
handled by the compute_sentiment() function, which works with either a character vector,
a sentocorpus object, or a quanteda corpus object. For a character vector input, the
function returns a word count column and the computed sentiment scores for all lexicons, as
no features are involved. The compute_sentiment() has, besides the input corpus and the
lexicons, other arguments. The main one is the how argument, to specify the within–document
aggregation. More details on the contents of these arguments are provided in Section 3.3, when
the ctr_agg() function is discussed. See below for a brief usage and output example.
• howWithin: This argument defines how sentiment is aggregated within the same doc-
ument, setting the weights ωi in (1). The "counts" options sets ωi = 1. For binary
lexicons and the simple unigrams matching case, this amounts to simply taking the
difference between the number of positive and negative words. One can normalize the
sentiment score by the total number of words in the document ("proportional"), or
by the total number of polarized words in the document ("proportionalPol").
• howDocs: This argument defines how sentiment is aggregated across all documents at
the same date (or frequency), that is, it sets the weights θn in (2). The time frequency
at which the time series have to be aggregated is chosen via the by argument, and
can be set to daily ("day"), weekly ("week"), monthly ("month") or yearly ("year").
The option "equal_weight" gives the same weight to every document, while the option
"proportional" gives higher weights to documents with more words, relative to the
document population at a given date. The do.ignoreZeros argument forces ignoring
documents with zero sentiment in the computation of the across–document weights. By
default these documents are overlooked. This avoids the incorporation of documents
{l,k}
not relevant to a particular feature (as in those cases sn,t is exactly zero, because
k
wn,t = 0), which could lead to a bias of sentiment towards zero.7
7
It also ignores the documents which are relevant to a feature, but exhibit zero sentiment. This can occur
if none of the words have a polarity, or the weighted number of positive and negative words offset each other.
• howTime: This argument defines how sentiment is aggregated across dates, to smoothen
the time series and to acknowledge that sentiment at a given point is at least partly
based on sentiment and information from the past. The lag argument has the role
of τ dictating how far to go back. In the implementation, lag = 1 means no time
aggregation and thus bt = 1. The "equal_weight" option is similar to a simple weighted
moving average, "linear" and "exponential" are two options which give weights to
the observations according to a linear or an exponential curve, "almon" does so based on
Almon polynomials, and "beta" based on the Beta weighting curve from Ghysels et al.
(2007). The last three curves have respective arguments to define their shape(s), being
alphasExp, ordersAlm and do.inverseAlm, and aBeta and bBeta.8 These weighting
schemes are always normalized to unity. If desired, user–constructed weights can be
supplied via weights as a named data.frame. All the weighting schemes define the
different bt values in (3). The fill argument is of sizeable importance here. It is used
to add in dates for which not a single document was available. These added, originally
missing, dates are given a value of 0 ("zero") or the most recent value ("latest").
The option "none" accords to not filling up the date sequence at all. Adding in dates
(or not) impacts the time aggregation by respectively combining the latest consecutive
dates, or the latest available dates.
• nCore: The nCore argument can help to speed up the sentiment calculation when dealing
with a large corpus. It expects a positive integer passed on to the setThreadOptions()
function from the RcppParallel (Allaire et al. 2018) package, and parallelizes the senti-
ment computation across texts. By default, nCore = 1, which indicates no paralleliza-
tion. Parallelization is expected to improve the speed of the sentiment computation
only for sufficiently large corpora, or when using many lexicons. A second speedup
strategy is to tokenize the texts separately and input these using the tokens argument.
This way, the tokenization can be tailor-made (e.g., stemmed9 ) and reused for different
sentiment computation function calls, for example to compare the impact of different
normalization or aggregation choices for the same tokenized corpus. Our tokenization is
done with the R package stringi, transforms all to lowercase, strips punctuation marks
and numeric characters.
In the example code below, we aggregate sentiment at a weekly frequency, choose a counts–
based within–document aggregation, and weight the documents for across–document aggre-
8
More specifically, for i = 1, . . . , ..., τ , we have as Almon weighting:
i b i B−b
c (1 − (1 − ) )(1 − ) ,
τ τ
where c acts as a normalization constant, b is a specific element from the ordersAlm vector, and B is the max-
imum value in the ordersAlm vector. If do.inverseAlm = TRUE, the inverse Almon polynomials are computed
modifying τi to 1 − τi . For Beta weighting, we have:
τ
i X i
f ( ; a, b)/ f ( , a, b),
τ i=1
τ
a−1 b−1
with the Beta density f (x; a, b) ≡ x (1−x)Γ(a)Γ(b)
Γ(a+b)
, Γ(.) the Gamma function, a as aBeta, and b as bBeta.
The functions weights_almon() and weights_beta() (as well as weights_exponential() for exponential
weighting) allow generating these time weights separately.
9
Make sure in this case to also stem the lexicons before you provide those to the sento_lexicons() function.
gation proportionally to the number of words in the document. The resulting time series
are smoothed according to an equally–weighted and an exponential time aggregation scheme
(B = 2), using a lag of 30 weeks. We ignore documents with zero sentiment for across–
document aggregation, and fill missing dates with zero before the across–time aggregation, as
per default.
The sento_measures() function performs both the sentiment calculation in Step 2 and the
aggregation in Step 3, and results in a sentomeasures output object. The generic summary()
displays a brief overview of the composition of the sentiment time series. A sentomeasures
object is a classed list with as most important elements "measures" (the textual sentiment
time series), "sentiment" (the original sentiment scores per document) and "stats" (a se-
lection of summary statistics). Alternatively, the same output can be obtained by applying
the aggregate() function on the output of the compute_sentiment() function, if the latter
is computed from a sentocorpus object.
There are 108 initial sentiment measures (9 lexicons × 6 features × 2 time weighting schemes).
An example of two of the created sentiment measures, and the naming of the sentiment
measures (each dimension’s component is separated by "--"), is shown below.
date NRC--noneconomy--equal_weight
1: 1995-07-24 3.931
2: 1995-07-31 4.124
3: 1995-08-07 3.961
4: 1995-08-14 4.059
5: 1995-08-21 3.846
---
1011: 2014-12-01 4.649
1012: 2014-12-08 4.649
1013: 2014-12-15 4.594
1014: 2014-12-22 4.653
1015: 2014-12-29 4.831
A sentomeasures object is easily plotted across each of its dimensions. For example, Figure 1
shows a time series of average sentiment for both time weighting schemes involved.10 A display
of averages across lexicons and features is achieved by altering the group argument from the
plot() method to "lexicons" and "features", respectively.
equal_weight exponential_0.2
4
Sentiment
Figure 1: Textual sentiment time series averaged across time weighting schemes.
Subsetting across rows is possible with the measures_subset() function. A typical example
is to subset only a time series range by specifying the dates part of that range, as below,
where 100 dates are kept. One can also condition on specific sentiment measures being above,
below or between certain values.
To ex–post fill the time series date–wise, the measures_fill() is to be used. The function
at minimum fills in missing dates between the existing date range at the prevailing frequency.
Dates before the earliest and after the most recent date can be added too. The argument
fill = "zero" sets all added dates to zero, whereas fill = "latest" takes the most recent
known value. This function is applied internally depending on the fill parameter from the
ctr_agg() function. The example below pads the time series with trailing dates, taking the
first value that occurs.
The sentiment visualized using the plot() function when there are many different lexicons,
features, and time weighting schemes may give a distorted image due to the averaging. To
obtain a more nuanced picture of the differences in one particular dimension, one can ignore
the other two dimensions. For example, corpusPlain has only the dummy feature, and there
is no time aggregation involved (lag = 1). This leaves the lexicons as the sole distinguishing
dimension between the sentiment time series.
[1] 0.3593
1
Sentiment
Figure 2 plots the nine time series each belonging to a lexicon. There are differences in levels.
For example, the Loughran & McDonald lexicon’s time series lies almost exclusively in the
negative sentiment area. The overall average correlation is close to 36%.
After the initial aggregation, one can do additional merging of the sentiment time series with
the measures_merge() function. This functionality equips users with a way to either further
diminish or further expand the dimensionality of the sentiment measures. For instance, all
sentiment measures composed of three particular time aggregation schemes can be shrunk
together, by averaging across each fixed combination of the other two dimensions, resulting
in a set of new measures.
In the example, both built–in lexicons, both news sources, both added features, and similar
time weighting schemes are collapsed into their respective new counterparts. The do.keep =
FALSE option indicates that the original measures are not kept after merging, such that the
number of sentiment time series effectively goes down to 7 × 4 × 1 = 28.
$features
[1] "economy" "noneconomy" "JOUR" "NEW"
$lexicons
[1] "NRC" "HULIU" "SENTIWORD" "JOCKERS" "SENTICNET" "SOCAL" "LEX"
$time
[1] "W"
With the measures_global() function, the sentiment measures can also be merged into
single dimension–specific time series we refer to as global sentiment. Each of the different
components of the dimensions has to receive a weight that stipulates the importance and
sign in the global sentiment index. The weights need not sum to one, that is, no condition
is imposed. For example, the global lexicon–oriented sentiment measure is computed as
G,L
su ≡ P p=1 spu × wl,p . The weight to lexicon l that appears in p is denoted by wl,p . An
1 PP
The peakdates() function pinpoints the dates with most extreme sentiment (negatively, pos-
itively, or both). The example below extracts the date with the lowest sentiment time series
value across all measures. The peakdocs() function can be applied equivalently to a senti-
ment object to get the document identifiers with most extreme document–level sentiment.
[1] "2008-01-28"
• model: The model argument can take "gaussian" (for linear regression), and "binomial"
or "multinomial" (both for logistic regression). The argument do.intercept = TRUE
fits an intercept.
• type: The type specifies the calibration procedure to find the most appropriate α and
λ in (5). The options are "cv" (cross–validation) or one of three information criteria
("BIC", "AIC" or "Cp"). The information criterium approach is available in case of a lin-
ear regression only. The argument alphas can be altered to change the possible values
for alpha, and similarly so for the lambdas argument. If lambdas = NULL, the possible
values for λ are generated internally by the glmnet() function from the R package glm-
net. If lambdas = 0, the regression procedure is OLS. The arguments trainWindow,
testWindow and oos are needed when model calibration is performed through cross–
validation, that is, when type = "cv". The cross–validation implemented is based
on the “rolling forecasting origin” principle, considering we are working with time se-
ries.13 The argument do.progress = TRUE prints calibration progress statements. The
do.shrinkage.x argument is a logical vector to indicate on which external explana-
tory variables to impose regularization. These variables, xu , are added through the x
argument of the sento_model() function.
• h: The integer argument h shifts the response variable up to yu+h and aligns the ex-
planatory variables in accordance with (4).14 If h = 0 (by default), no adjustments are
made. The logical do.difference argument, if TRUE, can be used to difference the tar-
get variable y supplied to the sento_model() function, if it is a continuous variable (i.e.,
model = "gaussian"). The lag taken is the absolute value of the h argument (given
|h| > 0). For example, if h = 2, and assuming the y variable is aligned time–wise with
all explanatory variables, denoted by X here for sake of the illustration, the regression
performed is of yt+2 − yt on Xt . If h = −2, the regression fitted is yt+2 − yt on Xt+2 .
= 0. If nCore > 1, the %dopar% construct from the R package foreach (Weston 2017)
is utilized to speed up the out–of–sample analysis.
R> set.seed(505)
R> y <- rnorm(nobs(sentMeas))
R> ctrInSample <- ctr_model(model = "gaussian",
+ h = 0, type = "BIC", alphas = 0, do.iter = FALSE)
R> fit <- sento_model(sentMeas, y, ctr = ctrInSample)
The attributions() function takes the sentomodel object and the related sentiment mea-
sures object as inputs, and generates by default attributions for all in–sample dates at the
level of individual documents, lags, lexicons, features, and time. The function can be applied
to a sentomodeliter object as well, for any specific dates using the refDates argument. If
do.normalize = TRUE, the values are normalized between –1 and 1 through division by the
`2 –norm of the attributions at a given date. The output is an attributions object.
Attribution decomposes a prediction into the different sentiment components along a given
dimension, for example, lexicons. The sum of the individual sentiment attributions per date,
the constant, and other non–sentiment measures should thus be equal to the prediction.
Indeed, the piece of code below shows that the difference between the prediction and the
summed attributions plus the constant is equal to zero throughout.
[1] TRUE
The target variable V IXu is the most recent available end–of–month daily VIX index value.
We run the predictive analysis for h = 6 months. The sentiment time series are in su and
define the sentiment–based model (Ms ). As primary benchmark, we exchange the sentiment
variables for EP Uu−1 , the level of the U.S. economic policy uncertainty index in month
u − 1 we know is fully available by month u (Mepu ). We also consider a simple first–order
autoregressive specification (Mar ).
We use the built–in U.S. news corpus of around 4145 documents, in the uscorpus object.
Likewise, we proceed with the nine lexicons and the valence shifters list from the lex object
used in previous examples. To infer textual features from scratch, we use a structural topic
modelling approach as implemented by the R package stm (Roberts et al. 2018). This is
a prime example of integrating a distinct text mining workflow with our textual sentiment
analysis workflow. The stm package works on a quanteda document–term matrix as an input.
The cleaning of the document–term matrix we perform is fairly standard, as well as the default
parameters of the stm() function, and we group into eight features.
We then define the keywords as the five most statistically representative terms for each topic.
They are assembled in keywords as a list.
We employ these words to generate the features based on the occurrences of these words in
a document after we have deleted all current ones. The features are scaled between 0 and 1
(do.binary = FALSE). Alternatively, one could use the predicted topics per text as a feature
and set do.binary = TRUE, to avoid documents sharing mutual features, instead of relying
on the generated keywords. We see a relatively even distribution of the corpus across the
generated features.
The frequency of our target variable is monthly, yet we aggregate the sentiment time series
on a daily level and then across time using a lag of about nine months (lag = 270). While
the lag is substantial, we also include six different Beta time weighting schemes to capture
time dynamics other than a slowly moving trend.
In Figure 3, we see sufficiently differing average time series patterns for all topics. The drop in
sentiment during the recent financial crisis is apparent, but the recovery varies along features.
The package has the EPU index as a dataset epu included. We consider the EPU index values
one month before all other variables and use the lubridate package (Grolemund and Wickham
2011) to do this operation. We assure that the length of the dependent variable is equal to the
number of observations in the sentiment measures by selecting based on the proper monthly
dates. The pre–processed VIX data is represented by the vix variable.15
0.04
0.03
Sentiment
0.02
0.01
0.00
We apply the iterative rolling forward analysis (do.iter = TRUE) for our 6–month prediction
horizon. The target variable is aligned with the sentiment measures in sentMeasIn, such
that h = 6 in the modelling control means forecasting the monthly averaged VIX value in six
months. The calibration of the sparse linear regression is based on a Bayesian–like information
criterion (type = "BIC") proposed by Tibshirani and Taylor (2012). We configure a sample
size of M = 60 months for a total sample of N = 232 observations. Our out–of–sample setup
is nonoverlapping; oos = h - 1 means that for an in–sample estimation at time u, the last
available explanatory variable dates from time u − h, and the out–of–sample prediction is
performed at time u as well, not at time u − h + 1. We consider a range of alpha values that
allows any of the Ridge, LASSO and pure elastic net regularization objective functions.16
R> h <- 6
R> oos <- h - 1
R> M <- 60
16
When a sentiment measure is a duplicate of another, or when at least 50% of the series observations
are equal to zero, it is automatically discarded from the analysis. Discarded measures are put under the
"discarded" element of a sentomodel object.
Model specification
- - - - - - - - - - - - - - - - - - - -
Out-of-sample performance
- - - - - - - - - - - - - - - - - - - -
The output of the sento_model() call is a sentomodeliter object. Below we replicate the
analysis but for the benchmark regressions, without the sentiment variables. The vector preds
assembles the out–of–sample predictions for model Mepu , the vector predsAR for model Mar .
A more detailed view of the different performance measures, in this case directional accuracy,
root mean squared and absolute errors, is obtained via out[["performance"]]. A list of the
individual sentomodel objects is found under out[["models"]]. A simple plot to visualize
the out–of–sample fit of any sentomodeliter object can be produced using plot(). We
display in Figure 4 the realized values and the different predictions.
R> plot(out) +
+ geom_line(data = melt(data.table(date = names(out$models), "M-epu" =
+ preds, "M-ar" = predsAR, check.names = FALSE), id.vars = "date"))
60
40
Response
20
Figure 4: Realized six–month ahead V IXu+6 index values (purple, realized) and out–of–
sample predictions from models Ms (blue, prediction), Mar (red), and Mepu (green).
Table 2: Performance measures. The root mean squared error (RMSE) is computed as
PNoos 2 PNoos
i=1 ei /Noos , and the mean absolute deviation (MAD) as i=1 |ei |/Noos , with Noos the
number of out–of–sample predictions, and ei the prediction error. The sentiment–based model
is Ms , the EPU–based model is Mepu , the autoregressive model is Mar .
The final step is to do the post–modelling attribution analysis. For a sentomodeliter object,
the attributions() function generates attributions for all out–of–sample dates. To study the
evolution of the prediction attribution, the attributions are shown using plot(). This can be
done according to any of the dimensions, except for individual documents. The attributions
are displayed stacked on top of each other, per date. The y–axis represents the attribution
to the prediction of the target variable. Figure 5 shows two types of attributions in separate
panels. The third topic was most impactful during the crisis, and the eighth topic received
the most positive post–crisis weight. Likewise, the lexicons attribution conveys an increasing
influence of the SO–CAL lexicon on the predictions. Finally, it can be concluded that the
predictive role of sentiment is least present before the crisis.
20
Attribution
−20
20
10
−10
−20
−30
This illustration shows that the sentometrics package provides useful insights in predicting
variables like the VIX index starting from a corpus of texts. Results could be improved by
expanding the corpus, or by optimizing the features generation. For a larger application of the
entire workflow, we refer to Ardia et al. (2018). They find that the incorporation of textual
sentiment indices results in better prediction of the U.S. industrial production growth rate
compared to using a panel of typical macroeconomic indices only.
The R package sentometrics provides a framework to calculate sentiment for texts, to aggre-
gate textual sentiment scores into many time series at a desired frequency, and to use these in
a flexible prediction modelling setup. It can be deployed to relate a corpus of texts, or qual-
itative data, to a quantitative target variable, and retrieve which type of sentiment is most
informative through visualization and attribution analysis. The main priorities for further de-
velopment are integrating better prediction tools, enhancing the complexity of the sentiment
engine, allowing user–defined weighting schemes, and adding intra–day aggregation.
If you use R or sentometrics, please cite the software in publications. In case of the latter,
use citation("sentometrics").
Computational details
The results in this paper were obtained using R 3.5.1 (R Core Team 2018), sentometrics
version 0.5.6, and underlying or used packages caret version 6.0.81 (Kuhn 2018), data.table
version 1.11.8 (Dowle and Srinivasan 2018), foreach version 1.4.4 (Weston 2017), ggplot2
version 3.1.0 (Wickham 2016), glmnet version 2.0.16 (Friedman et al. 2010), gridExtra ver-
sion 2.3.0 (Auguie 2017) ISOweek version 0.6.2 (Block 2011), lexicon version 1.1.3 (Rinker
2018a), lubridate version 1.7.4 (Grolemund and Wickham 2011), quanteda version 1.13.14
(Benoit et al. 2018), Rcpp version 1.0.0 (Eddelbuettel and Francois 2011), RcppArmadillo
version 0.9.200.5.0 (Eddelbuettel and Sanderson 2014), RcppRoll version 0.3.0 (Ushey 2018),
RcppParallel version 4.4.2 (Allaire et al. 2018), stm version 1.3.3 (Roberts et al. 2018), and
stringi version 1.2.4 (Gagolewski 2018). Computations were performed on a Windows 10
Pro machine, x86 64-w64-mingw32/x64 (64-bit) with Intel(R) Core(TM) i7-3632QM CPU 2x
2.20 GHz. The code used in the main paper is available in the R script run_vignette.R
located in the examples folder on the dedicated sentometrics GitHub repository at https:
//github.com/sborms/sentometrics.
The timings comparison of the various sentiment computation functions can be replicated
using the R script run_timings.R, available at the beforementioned GitHub repository. To
generate the results, we have also used the packages dplyr version 0.7.8 (Wickham et al. 2018),
meanr version 0.1.1 (Schmidt 2017), microbenchmark version 1.4.6 (Mersmann 2018), Senti-
mentAnalysis version 1.3.2 (Feuerriegel and Proellochs 2018), syuzhet version 1.0.4 (Jockers
2017), tidytext version 0.2.0 (Silge and Robinson 2016), and tidyr version 0.8.2 (Wickham
and Henry 2018).
R, sentometrics and all other packages are available from CRAN at http://CRAN.R-project.
org/. The version under development is available on our GitHub repository as well.
Acknowledgments
We thank Andres Algaba, Nabil Bouamara, Peter Carl, Leopoldo Catania, Thomas Chuf-
fart, Dries Cornilly, Serge Darolles, William Doehler, Arnaud Dufays, Matteo Ghilotti, Kurt
Hornik, Siem Jan Koopman, Brian Peterson, Laura Rossetti, Tobias Setz, Majeed Siman, Ste-
fan Theussl, Wouter Torsin, and participants at the CFE (London, 2017), eRum (Budapest,
2018), R/Finance (Chicago, 2018), SwissText (Winterthur, 2018), SoFiE (Brussels, 2018),
“Data Science in Finance with R” (Vienna, 2018), “New Challenges for Central Bank Com-
munication” (Brussels, 2018), and (EC)ˆ2 (Roma, 2018) conferences for helpful comments.
We acknowledge Google Summer of Code 2017 for their financial support.
References
Antweiler W, Frank M (2004). “Is All That Talk Just Noise? The Information Content of
Internet Stock Message Boards.” Journal of Finance, 59(3), 1259–1294. doi:10.1111/j.
1540-6261.2004.00662.x.
Ardia D, Bluteau K, Boudt K (2018). “Questioning the News about Economic Growth: Sparse
Forecasting Using Thousands of News-Based Sentiment Values.” International Journal of
Forecasting, forthcoming. doi:10.2139/ssrn.2976084.
Arnold T (2017). “A Tidy Data Model for Natural Language Processing Using cleanNLP.”
The R Journal, 9(2), 1–20. URL https://journal.r-project.org/archive/2017/
RJ-2017-035/RJ-2017-035.pdf.
Auguie B (2017). gridExtra: Miscellaneous Functions for ”Grid” Graphics. R package version
2.3, URL https://CRAN.R-project.org/package=gridExtra.
Baker S, Bloom N, Davis S (2016). “Measuring Economic Policy Uncertainty.” The Quarterly
Journal of Economics, 131(4), 1593–1636. doi:10.1093/qje/qjw024.
Beckers B, Kholodilin K, Ulbricht D (2017). “Reading Between the Lines: Using Media
to Improve German Inflation Forecasts.” DIW Berlin Discussion Paper No. 1665. doi:
10.2139/ssrn.2970466.
Block U (2011). ISOweek: Week of the Year and Weekday According to ISO 8601. R Package
Version 0.6.2, URL https://CRAN.R-project.org/package=ISOweek.
Caporin M, Poli F (2017). “Building News Measures from Textual Data and an Application
to Volatility Forecasting.” Econometrics, 5(3), 35. doi:10.3390/econometrics5030035.
Catania L, Bernardi M (2017). MCS: Model Confidence Set Procedure. R Package Version
0.1.3, URL https://CRAN.R-project.org/package=MCS.
Ceron E, Curini L, Iacus S, Porro G (2014). “Every Tweet Counts? How Sentiment Analysis
of Social Media Can Improve Our Knowledge of Citizens’ Political Preferences with an
Application to Italy and France.” New Media & Society, 16(2), 340–358. doi:10.1177/
1461444813480466.
Feinerer I, Hornik K, Meyer D (2008). “Text Mining Infrastructure in R.” Journal of Statistical
Software, 22(5), 1–54. doi:10.18637/jss.v025.i05.
Ghysels E, Sinko A, Valkanov R (2007). “MIDAS Regressions: Further Results and New
Directions.” Econometrics Review, 26, 53–90. doi:10.1080/07474930600972467.
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of
Statistical Software, 40(3), 1–25. doi:10.18637/jss.v040.i03.
Hansen P, Lunde A, Nason J (2011). “The Model Confidence Set.” Econometrica, 79, 453–497.
doi:10.3982/ECTA5771.
Henry E (2008). “Are Investors Influenced by How Earnings Press Releases are Written?”
Journal of Business Communication, 45(4), 363–407. doi:10.1177/0021943608319388.
Heston S, Sinha N (2017). “News vs. Sentiment: Predicting Stock Returns from News Stories.”
Financial Analysts Journal, 73(3), 67–83. doi:10.2469/faj.v73.n3.3.
Hoerl A, Kennard R (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Prob-
lems.” Technometrics, 12, 55–67. doi:10.1080/00401706.1970.10488634.
Hornik K (2016). openNLP: Apache OpenNLP Tools Interface. R Package Version 0.2.6,
URL https://CRAN.R-project.org/package=openNLP.
Jegadeesh N, Wu D (2013). “Word Power: A New Approach for Content Analysis.” Journal
of Financial Economics, 110(3), 712–729. doi:10.1016/j.jfineco.2013.08.018.
Jockers M (2017). syuzhet: Extract Sentiment and Plot Arcs from Text. R Package Version
1.0.4, URL https://CRAN.R-project.org/package=syuzhet.
Kuhn M (2018). caret: Classification and Regression Training. R Package Version 6.0-81,
URL https://CRAN.R-project.org/package=caret.
Manela A, Moreira A (2017). “News Implied Volatility and Disaster Concerns.” Journal of
Financial Economics, 123(1), 137–162. doi:doi.org/10.1016/j.jfineco.2016.01.032.
Mohammad S, Turney P (2010). “Emotions Evoked by Common Words and Phrases: Using
Mechanical Turk to Create an Emotion Lexicon.” Proceeding of Workshop on Computational
Approaches to Analysis and Generation of Emotion in Text, pp. 26–34. URL http://dl.
acm.org/citation.cfm?id=1860631.1860635.
Pástor L, Veronesi P (2013). “Political Uncertainty and Risk Premia.” Journal of Financial
Economics, 110(3), 520–545. doi:doi.org/10.1016/j.jfineco.2013.08.007.
Rinker T (2017). qdap: Quantitative Discourse Analysis Package. R Package Version 2.3.0,
URL https://CRAN.R-project.org/package=qdap.
Rinker T (2018a). lexicon: Lexicon Data. R Package Version 1.1.3, URL https://CRAN.
R-project.org/package=lexicon.
Rinker T (2018b). sentimentr: Calculate Text Polarity Sentiment. R Package Version 2.6.1,
URL https://CRAN.R-project.org/package=sentimentr.
Roberts M, Stewart B, Tingley D (2018). “stm: R Package for Structural Topic Models.”
Journal of Statistical Software, forthcoming.
Schmidt D (2017). meanr: Sentiment Analysis Scorer. R Package Version 0.1.1, URL https:
//CRAN.R-project.org/package=meanr.
Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles
in R.” Journal of Open Source Software, 1(3). doi:10.21105/joss.00037.
Tetlock P (2007). “Giving Content to Investor Sentiment: The Role of Media in the Stock
Market.” Journal of Finance, 62(3), 1139–1168. doi:10.1111/j.1540-6261.2007.01232.
x.
Tibshirani R (1996). “Regression Shrinkage and Selection via the LASSO.” Journal of
the Royal Statistical Society B, 58(1), 267–288. URL https://www.jstor.org/stable/
2346178.
Weston S (2017). foreach: Provides Foreach Looping Construct for R. R Package Version
1.4.4, URL https://CRAN.R-project.org/package=foreach.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
ISBN 978-3-319-24277-4. URL http://ggplot2.org.
Wickham H, Henry L (2018). tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Func-
tions. R package version 0.8.2, URL https://CRAN.R-project.org/package=tidyr.
Zou H, Hastie T (2005). “Regularization and Variable Selection via the Elastic Net.” Journal
of the Royal Statistical Society B, 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x.
Panel A: Average execution time of the sentiment computation for one lexicon
sentometrics
Texts unigrams bigrams clusters meanr SentimentAnalysis syuzhet quanteda tidytext
1000 0.21 0.19 0.20 0.11 2.03 0.67 0.76 0.31
5000 0.92 0.86 0.89 0.47 7.73 2.57 2.26 0.80
10000 1.54 1.53 1.56 0.84 15.49 5.01 3.03 1.46
25000 3.76 3.92 4.06 2.10 38.46 12.06 6.47 3.61
50000 8.30 8.11 8.39 4.50 79.39 26.60 12.86 7.58
75000 12.71 12.51 12.98 6.70 121.54 37.07 18.32 11.25
100000 17.81 17.95 17.69 8.51 — 48.52 24.76 13.92
Panel B: Average execution time of the sentiment computation for many lexicons
sentometrics tidytext
Texts unigrams unigrams, feats. bigrams clusters clusters, parallel unigrams bigrams
1000 0.26 0.37 1.14 0.29 0.94 0.26 1.30
5000 0.96 1.15 0.98 1.00 0.77 0.83 3.38
10000 1.86 1.92 1.90 1.91 1.51 1.56 3.38
25000 4.72 4.81 4.78 4.82 3.74 4.08 6.76
50000 9.65 9.91 9.73 9.79 7.63 7.75 33.24
75000 15.37 16.24 16.82 13.20 12.69 12.72 49.82
100000 23.81 24.43 24.92 21.43 17.17 18.02 68.03
Table 3: Average computation time (in seconds) of various lexicon–based sentiment facilities
in R. All implementations consider the Hu & Liu lexicon (Panel A), or the nine lexicons
specified in the lex object (Panel B). Some implementations do not integrate valence shifters
(unigrams), others do from a bigrams perspective (bigrams) or from a clusters perspective
(clusters). A bar ‘—’ indicates the calculation produced an error.
The first analysis (Panel A) benchmarks these implementations with three approaches using
the compute_sentiment() function from sentometrics: one without valence shifters, one
with valence shifters integrated from a bigrams perspective, and one with valence shifters
integrated from a clusters perspective. The number of threads for parallel computation is set
to one where appropriate. All other algorithms are run with a version of the the Hu & Liu
lexicon (about 6600 single words). The computations are counts–based and constructed so
as to give the same output across all packages for a binary lexicon, if the tokenization is the
same. For example, the sentometrics and tidytext implementations give identical results.
The meanr implementation comes out fastest because everything is written in the C program-
ming language. Yet, it offers no flexibility to define the input lexicon nor the scale on which
the scores are returned. On the other spectrum, amongst these approaches, the Sentimen-
tAnalysis and suyzhet packages are slowest. The latter package further does not offer the
flexibility of adding different sentiment lexicons than those available in their package. Senti-
mentAnalysis becomes problematic for the largest corpus due to a memory allocation error.
The quanteda package is fast, but slower than the sentometrics and tidytext implementations.
The tidytext package is faster, particularly for the two largest corpus sizes.
The second analysis (Panel B) compares the computation time with multiple lexicons as input.
The comparison is against the tidytext package, for a unigram and a bigrams implementation.
The lexicons are those defined in lex, nine in total. For the clusters approach, we also look at
its parallelized version, using eight cores (see the ‘clusters, parallel’ column). For the unigrams
approach in sentometrics, we also assess the additional time it takes to spread out sentiment
across features (see the ‘unigrams, feats.’ column).
The tidytext package is, in general, faster for many lexicons as well. Differences are not large
nonetheless, and running any sentometrics computation in parallel would make the speed
differentials disappear. However, the bigrams calculation using tidytext is markedly slower.
With sentometrics, the speed of the computation is comparable across all types of sentiment
calculation. The tidytext framework thus copes more slowly with complexity.
Overall, the sentometrics package brings an off–the–shelf yet flexible sentiment calculator that
is computationally efficient, being fast in itself, and independent as to the decision (how) to
integrate valence shifters as well as the number of input lexicons.
Affiliation:
David Ardia
Institute of Financial Analysis
University of Neuchâtel, Switzerland
E–mail: david.ardia@unine.ch
Keven Bluteau
Institute of Financial Analysis
University of Neuchâtel, Switzerland
& Solvay Business School
Vrije Universiteit Brussel, Belgium
E–mail: keven.bluteau@unine.ch
Kris Boudt
Solvay Business School
Vrije Universiteit Brussel, Belgium
& School of Business and Economics
Vrije Universiteit Amsterdam, The Netherlands
E–mail: kris.boudt@vub.be