Professional Documents
Culture Documents
February 2018
Big Data at BBVA Research
Index
02 Big Data & Big Models : Applying Big Data at BBVA Research
BBVA Data: Monitoring consumption using BBVA
transactional data: the Spanish Retail Sales Index
News Data (GDELT): Tracking Chinese
Vulnerability Sentiment
03 Annex
Big Data at BBVA Research
01
Opportunities in the digital era.
Big Data at BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
5
Big Data & BigBig
Models
Data at BBVA Research
New data may end up changing the way in which economists approach empirical questions
and the tools they use to answer them 6
Big Data & BigBig
Models
Data at BBVA Research
7
Big Data & BigBig
Models
Data at BBVA Research
External
institutions
BBVA
External
BBVA
Institutions
Research
8
Big Data & BigBig
Models
Data at BBVA Research
Clean,
GDELT BigQuery Fuse,
Aggregate
BBVA data and visualize
transform
Google search Amazon & analyze
and model
Web Redshift the data
the data
9
Big Data & BigBig
Models
Data at BBVA Research
Conflict Intensity Map 2017 Refugees Flows Map 2015-17 Barcelona Instability index 2017 Chinese slowdown: country network
7 Barcelona attacks
17 Aug. 2017
6
5
4
3
2
1
0
-1
25
05
10
13
10
12
15
17
20
22
27
31
03
08
15
18
20
10/08/2017
11/08/2017
12/08/2017
14/08/2017
15/08/2017
16/08/2017
17/08/2017
18/08/2017
20/08/2017
21/08/2017
22/08/2017
23/08/2017
25/08/2017
26/08/2017
27/08/2017
28/08/2017
31/08/2017
01/09/2017
03/09/2017
04/09/2017
05/09/2017
06/09/2017
08/09/2017
09/09/2017
10/09/2017
11/09/2017
13/09/2017
14/09/2017
15/09/2017
17/09/2017
18/09/2017
19/09/2017
20/09/2017
21/09/2017
10
A u g u s t 2 0 1 7 S e m t e m b e r 2 0 1 7
Big Data & BigBig
Models
Data at BBVA Research
11
Big Data & BigBig
Models
Data at BBVA Research
Our products
Political, Geopolitical Social Indexes Color Maps NAFTA Topics Politics & Financial Networks Mix Hard data & Sentiment & VAR models
(Political Indexs) (Nafta Project) (Political Netwoks) (CBSI and Turkey Sentiment Indexes)
Geographical Analysis Housing Prices Financial Stability & Macroprudential Measuring Sentiments Monetary & Stability tones by Central
(sentiment on Housing Prices) (ECB & FED FS index by FED Board) (sentiment Analysis on Economy and Society) Banks
12
Big Data at BBVA Research
02
Big Data & Big Models:
Applying Big Data at BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
1 2 3
BBVA International Policy
Data News (GDELT) Reports
12
Big Data & BigBig
Models
Data at BBVA Research
1
BBVA
Data
13
Big Data & BigBig
Models
Data at BBVA Research
6
Having accurate estimations on the
3 evolution of the retail trade sector activity
0
is of main importance given it is a key
indicator about the current economic
-3 situation
-6
-9
-12
Mar-04
Mar-07
Mar-10
Mar-13
Mar-16
Dec-04
Jun-09
Sep-05
Jun-06
Dec-07
Sep-08
Dec-10
Sep-11
Jun-12
Dec-13
Sep-14
Jun-15
Dec-16
Sep-17
14
Source: BBVA Research and BBVA Data & Analytics from INE data
Big Data & BigBig
Models
Data at BBVA Research
15
More information can be found in the annex
Big Data & BigBig
Models
Data at BBVA Research
Select
variables
Automate Query
process Data
Testing Data
data cleaning
Normalize Outlier
data detection
16
Big Data & BigBig
Models
Data at BBVA Research
A “High Definition” Retail Sales Index (RTI) for Spain* (and Mexico)
(BBVA consumption indicator for the optimal allocation of BBVA’s resources and products)
RTI–BBVA Index, in millions of euros and daily basis Comparison Retail Sales (RTI) - INE and BBVA on monthly
basis (standardized monthly growth)
18 3
2
17
1
16 0
-1
15
-2
14 -3
Apr-13
Oct-13
Apr-14
Oct-14
Apr-15
Oct-15
Apr-16
Oct-16
Apr-17
Oct-17
Jan-13
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Jul-13
Jul-14
Jul-15
Jul-16
Jul-17
Jul-16
Jul-13
Jul-14
Jul-15
Jul-17
Jan-13
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Log(total) Sundays Saturdays
BBVA & CX data merge BBVA-RTI INE-RTI
Going further from national figures, the retail sales at regional level …
20
-20
-40
May-13
May-14
May-15
May-16
May-17
Jan-13
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Sep-13
Sep-14
Sep-15
Sep-16
Sep-17
BBVA transactions Retail sales
20 20 20
0 0 0
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Jan-13
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Jul-13
Jul-14
Jul-15
Jul-16
Jul-17
Jul-13
Jul-14
Jul-15
Jul-16
Jul-17
Jan-13
Jan-14
Jan-15
Jan-16
Jan-17
Jan-18
Jul-13
Jul-14
Jul-15
Jul-16
Jul-17
BBVA transactions BBVA transactions BBVA transact ions
20
Source: BBVA Research and BBVA Data & Analytics
20
40
80
60
0
100
120
Health
Other services
Bars and restaurants
Books, press and magazines
Food
Median ticket
Leisure and entertainment
Care and beauty
Home (median ticket in December 2017)
Transport
25 percentile
75 percentile
Automotive
BBVA retail sales index by sector of activity
-3
-2
-1
0
1
2
3
-3
-2
-1
0
2
3
1
Jan-13
Jan-13
Jul-13
Jul-13
Jan-14
Jan-14
Jul-14
Jul-14
Jan-15
Jan-15
Jul-15
Jul-15
Jan-16
Jan-16
Jul-16
Jul-16
Jan-17
Jan-17
Department stores
Jul-17
Jul-17
Jan-18
Jan-18
(standardized monthly growth)
-3
-2
-1
-3
-2
-1
0
2
3
0
1
2
3
Jan-13 Jan-13
Jul-13 Jul-13
Jan-14 Jan-14
Jul-14 Jul-14
Jan-15 Jan-15
Big Data & BigBig
Jul-15 Jul-15
Models
Jan-16 Jan-16
Jul-16 Jul-16
BBVA retail sales by distribution classes
Jan-17 Jan-17
Jul-17 Jul-17
Large chain stores
Jan-18 Jan-18
21
Data at BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
International 2
News (GDELT)
(Narratives
& Sentiment)
Vulnerability Sentiment Indicator (CSVI)
(China)
20
Big Data & BigBig
Models
Data at BBVA Research
21
Big Data & BigBig
Models
Data at BBVA Research
22
Source: www.gdelt.org & BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
Tota Profits (HD) 19.63 New Construction (HD) 16.37 Wenzhou Index (M) 16.58 Currency Exchange Rate (S) 19.94
Institutional Reform & SOEs (S) 12.37 Mortgages Loans (HD) 14.57 WMP Yields (M) 13.63 Exchange Rate Policy (S) 17.95
Debt & SOE (S) 11.92 Land Reform (S) 12.62 Infrastructure_funds (S) 10.92 Macroprudential_Policy (S) 15.33
Liabilities (HD) 10.62 Housing Price (HD) 11.6 NPL ratio (HD) 9.46 Hinchon Index (M) 13.84
Local Goverment & SOE (S) 9.75 Housing Construction (S) 10.59 State & Finantial Inst (S) 8.95 CNY Currency (M) 11.92
Industry Policy (S) 9.5 Housing_Prices (S) 10.05 Banking_Regulation (S) 7.28 Capital Account (S) 10.05
Resource Mis. & P. Failure (S) 8.18 Housing Policy & Institutions (S) 8.94 Financial Vulnerability (S) 6.82 CNH Currency (M) 8.73
SOE (S) 7.15 Housing Finance (S) 7.83 Asset_Management (S) 5.62 Illicit Financial Flows (S) 1.58
Industry Laws & Regulation (S) 5.28 Housing Markets (S) 6.71 Financial Sector Instability (S) 5.35 Foreign Reserves (S) 0.6
Resource Misaloc. & SOE (S) 5.61 GICS Housing Index (M) 0.36 Bank Capital Adequacy (S) 4.6 Currency Reserves (S) 0.06
Real State Investment (HD) 0.31 Non Bank Financial Inst (S) 4.35
Monetary & F.Stability (S) 3.54
Acceptances (HD) 2.22
TSF Aggregate New (HD) 0.57
Entrusted Loans (HD) 0.13
%of Sentiment in Component (S) 69.76 %of Sentiment in Component (S) 56.74 %of Sentiment in Component (S) 57.43 %of Sentiment in Component (S) 48.27
% of Variance by 1st PC 63.2 % of Variance by 1st PC 65.8 % of Variance by 1st PC 59.5 % of Variance by 1st PC 78.99
in the CSVI 29.18 Weight in the CSVI 26.05 Weight in the CSVI 23.12 Weight in the CSVI 21.64
23
Source: BBVA Research. Ortiz et al. (2017). Tracking Chinese Vulnerability in Real Time Using Big Data
Big Data & BigBig
Models
Data at BBVA Research
24
Source: BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
3
(Lower Vulnerability)
Improving Sentiment
Policy Intervention
Stock
Market stock market crash, RMB enters IMF's
2 Crash SDR basket NPC Accepts
trade halted
Lower Growth
Target
0
S&P’s
Downgrade
(Higher Vulnerability)
Worsening Sentiment
Feb-16
May-15
Apr-15
Jul-15
Mar-16
May-16
Apr-16
Jul-16
Feb-17
Mar-17
May-17
Apr-17
Jul-17
Oct-15
Oct-16
Oct-17
Nov-15
Dec-15
Nov-16
Dec-16
Nov-17
Dec-17
Jun-15
Aug-15
Sep-15
Jan-16
Jun-16
Aug-16
Sep-16
Jan-17
Jun-17
Aug-17
Sep-17
Jan-18
25
Source: www.gdelt.org & BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
26
Big Data & BigBig
Models
Data at BBVA Research
China CSVI: English vs Chinese Turkey Econ. Sentiment: English vs Turkish News
(index) (% YoY and index)
Amplifier
8.0 0 8.0 0
2.5
7.0 7.0
Divergence -0.5
-0.2
6.0 6.0
1.5
5.0 5.0 -1
-0.4
4.0 4.0
0.5 -1.5
3.0 -0.63.0
-2
2.0 2.0
-0.5
-0.8
1.0 1.0 -2.5
0.0 0.0
-1.5 -1 -3
-1.0 -1.0
-2.0 -2.0
-1.2 -3.5
-2.5
mar.15
jun.15
mar.16
jun.16
mar.17
sep.15
dic.15
sep.16
dic.16
sep-15
dic-15
sep-16
dic-16
mar-15
mar-16
mar-17
jun-15
jun-16
Jul-15
Mar-15
Jan-16
Jul-16
Mar-16
Jan-17
Jul-17
Mar-17
Jan-18
Sep-15
Nov-15
Sep-16
Nov-16
Sep-17
Nov-17
May-15
May-16
May-17
Chinese Vulnerability Index (news in Chinese) BBVA-GB Monthly GDP BBVA-GB Monthly GDP
Chinese Vulnerability Index (news in English) Economic Sentiment (Turkish Media) Economic Sentiment (English Media)
27
Source: www.gdelt.org & BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
28
Source: www.gdelt.org & BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
Policy 3
Reports
(Topics , Networks
Monetary Policy Topics & Sentiment
& Sentiment)
(Turkey)
29
Big Data & BigBig
Models
Data at BBVA Research
32
Big Data & BigBig
Models
Data at BBVA Research
Pre-Processing
Information extraction Transformation Text mining and NPL Sentiment analysis
and text parsing
33
More information can be found in the annex
Big Data & BigBig
Models
Data at BBVA Research
First we identify the topics: Word clouds will help us to understand and
identify topics… here there is a big room for the Researcher
Each word cloud represents the probability distribution of words within a given topic. The size
of the word and the color indicates its probability of occurring within that topic 34
Big Data & BigBig
Models
Data at BBVA Research
100%
Monetary Policy maintains its share
but increases in “Stress” periods
90%
80%
Inflation remains stable
70%
60%
30%
Employment issues increased
20%
Relevance since the crisis
10%
0%
Economic Activity discussion
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
33
Source: BBVA Research. Ortiz et al (2017). How do the EM Central Bank talk? A Big Data approach to the Central Bank of Turkey
Big Data & BigBig
Models
Data at BBVA Research
Full Fledge Inflation Target The global financial crisis In search of price stability
2006-09 2010-15 2016-17
The network of the estimated and correlated topics using STM. The nodes in the graph represent the identified topics. Node size is proportional to the number of words in the corpus
devoted to each topic (weight). Node color indicates clusters using a community detection algorithm called modularity developed by Blondel et al (2008). Topics for which labeling is
Unknown are removed from the graph in the interest of visual clarity. Edges represent words that are common to the topics they connect (co-ocurrence of words between topics). 36
Edge width is proportional to the strength of this co-ocurrence between topics.
Big Data & BigBig
Models
Data at BBVA Research
1 0.6
0.9
0.8 0.5
0.7
0.4
0.6
0.5 0.3
0.4
0.2
0.3
0.2
0.1
0.1
0 0
2006
2006
2007
2007
2008
2008
2009
2009
2010
2010
2011
2011
2012
2012
2013
2013
2014
2014
2015
2015
2016
2016
2017
2017
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
37
Source: BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
Deviation from target (reference): Inflation & Credit Standard Monetary & Macroprudencial policies
(FX adj. Loans YoY ninus 15% and inflation minus 5%) (Sentiments)
3 15
10
1 0
5
0
-1
-5
-3 -10 -1
-15 Allows tight Standard & Ease Macro Prudential
-5
-20
-7 -25 -2
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
36
Source: BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
1.5 1.5
0.5 0.5
-0.5 -0.5
-1.5 -1.5
-2.5 -2.5
Easing
-3.5 -3.5
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Source: Iglesias, J, Ortiz, A & Rodrigo, T (2007)
Source: BBVA Research
We can check whether the sentiment affects analysts (and test the
Machines & Dictionary methods compare with Experts analysis)
3 3
1.00
50
2
2
0.50
1 0 1
0.00 0
-50 0
-1
-0.50
-1
-2
-1.00 -100
-3 -2
-1.50 -4 -150 -3
abr-06
abr-07
abr-08
abr-09
abr-10
ene-06
oct-06
ene-07
oct-07
ene-08
oct-08
ene-09
oct-09
ene-10
jul-06
jul-07
jul-08
jul-09
abr-06
abr-07
abr-08
abr-09
abr-10
oct-06
ene-06
jul-06
ene-07
jul-07
oct-07
ene-08
jul-08
oct-08
ene-09
jul-09
oct-09
ene-10
Monetary Policy Surprise (Demiralp, Kara & Ozlu, EJPE 2011) Surprise Size CBRT (Demiralp, Kara & Ozlu, EJPE 2011)
Monetary Policy Sentiment from Minutes (Normalized) Monetary Policy Sentiment from Minutes (Normalized)
Monetary Policy Sentiment from Statements (Normalized) Monetary Policy Sentiment from Statements (Normalized)
38
Source: BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
Tigthening Easing
39
Source: BBVA Research
Big Data & BigBig
Models
Data at BBVA Research
Monetary Policy
Monetary RateRate
Policy Shock
Shock Monetary
Monetary Policy
Policy “Sentiment
“Sentiment” “Shock
Shock
40
Source: BBVA Research
Big Data at BBVA Research
A NNEX
Big Data & BigBig
Models
Data at BBVA Research
The Retail Trade Index is a business cycle indicator which shows the monthly activity
of the retail sector (turnover)
Average Tone: GDELT uses more than 40 tonal dictionaries to build a score ranging from -100 (extremely
negative) to +100 (extremely positive) for each piece of news, with common values ranging between -10
(negative) and +10 (positive), with 0 indicating neutral tone. A neutral sentiment can be the result of a neutral
language or a balancing of some extreme positive sentiments compensated by negative ones. The sentiment
variable is based on the balance between the percentage of all words in the article having a positive and
negative emotional connotation within an article divided by the total number of words included the article
45
Big Data & BigBig
Models
Data at BBVA Research
𝑤𝑖𝑡ℎ 𝛾𝑖 , 𝛿𝑖 , 𝛽𝑖 , 𝜌𝑖 𝑏𝑒𝑖𝑛𝑔 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑒𝑣𝑒𝑟𝑦 𝑤𝑖𝑡ℎ 𝜇1 , 𝜇2 , 𝜇3 , 𝜇4 𝑏𝑒𝑖𝑛𝑔 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑒𝑣𝑒𝑟𝑦
𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑜𝑓 𝑡ℎ𝑒 𝑓𝑜𝑢𝑟 components
46
Big Data & BigBig
Models
Data at BBVA Research
Documents with less than 200 characters are excluded (titles, contents sections,…)
Then words are stemmed (reduce a word to their semantic root) to generate tokens
Feature selection is conducted on the tokens: common stopwords and words with length 3 or less are
removed and the remaining words are stemmed. Tokens are filtered out based on a
term-frequency-inverse-document-frequency (tf.idf) index (Manning and Schutze 1999), words
of the lowest quantile are removed. This indexing scheme is combined of a term-frequency index (tf) and a
document frequency index (df). tf is just the count of a given word in a document, mean tf
is used to construct the final index. df is the number of documents that contain a given word
Then, the tf.idf used to filter words out is:
𝑁
𝑡𝑓. 𝑖𝑑𝑓𝑖 = 𝑚𝑒𝑎𝑛 𝑡𝑓𝑖𝑗 ∗ 𝑙𝑜𝑔2
𝑑𝑓𝑖
where i indexes terms and j documents. This index gives high weight to frequent words through
the tf component, but if a word is very prevalent through the corpus; its weight is reduced through
the idf component. The aim of this filtering procedure is to remove very unfrequent as well as very frequent
words, to remove words with low semantic content
47
Big Data & BigBig
Models
Data at BBVA Research
Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) is a Bayesian model with a prior distribution
on the document-specific mixing probabilities where the count of terms within documents are independent
and identically distributed given a Dirichlet prior distribution
To introduce time-series dependencies into the data generating process, we use the dynamic topic model
(DTM), a particularization of the Structural Topic Models (STM) where each time period
has a separate topic model and time periods are linked via smoothly evolving parameters
STM (Roberts et. al. 2016) explicitly introduces covariates into a topic model allowing us to estimate the
impact of document-level covariates on topic content and prevalence as part of the topic model itself,
The process for generating individual words is the same as for plain LDA. However both objects
can depend on potentially different sets of document-level covariates: Topic Prevalence
(each document has P attributes that can affect the likelihood of discussing topic k)
and Topic Content (each document has an A-level categorical attribute that affects the likelihood
of discussing term v overall, and of discussing it within topic k. The generation of the k and d terms
is via multinomial logistic regression
48
Big Data & BigBig
Models
Data at BBVA Research
Words (Tokens): basic unit of discrete data. Represented as an unit vector with a single 1 entry,
and 0 in the remainder, this vector ha as many entries as total words under analysis.
Stop Words: “A”, “the” very frequent but don´t add value in term
Document-Term-Matrix: matrix where each row is the sum of all the words in a given document.
As such we have documents in the rows, words in the columns, and each entry in the matrix
is the number of occurences of a word in a given document
49
Big Data & BigBig
Models
Data at BBVA Research
Latent Dirichlet Allocation (LDA) (Blei et al. 2003) is a generative probabilistic (hierarchical Bayesian)
model of a corpus. Documents are represented as mixtures over latent topics, where each topic is
characterized by a distribution over words.
𝑁 ~ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜁)
𝜃~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼)
topic 𝑧𝑛 ~ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜁)
Bag of Words assumption: Order of the words is not importat , only the occurence is relevant.
This assumption is inherited, as LDA is an extension of the Latent Semantic Indexing algorithm
(an SVD on the Document-Term-Matrix)
Words are conditionally Independent and Identically distributed: Needed when working with latent mixture
of distributions, following de Finneti’s theorem (exchangeable observations are conditionally independent
given some latent variable to which an epistemic probability distribution would then be assigned)
50
Big Data & BigBig
Models
Data at BBVA Research
Structural Topic Model (Roberts et al. 2016) extends the LDA algorithm such that metadata (covariates)
can affect the topic distribution. This allows us to introduce time series dependencies, estimating
what is known as a Dynamic Topic Model
51
Big Data & BigBig
Models
Data at BBVA Research
Extending the LDA: Structural Topic Model & The Dynamic Topic Model
Structural Topic Model (Roberts et al. 2016) extends the LDA algorithm such that metadata (covariates)
can affect the topic distribution. This allows us to introduce time series dependencies, estimating what
is known as a Dynamic Topic Model (i.e Topics Change over time). Topics can depend on 2 classes
of covariates:
Topic Prevalence (each document has P attributes that can affect the likelihood of discussing topic k)
Topic Content (each document has an A-level categorical attribute that affects the likelihood
of discussing term v overall, and of discussing it within topic k)
Dynamic Topic Models were Γ is a Px(K – 1) matrix of prevalence coefficients, d indexes documents,
generative process: n indexes words within documents and k indexes the latent topics
𝛾𝑘 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎𝑘2 𝐼𝑝 )
𝜃𝑑 ~ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐𝑁𝑜𝑟𝑚𝑎𝑙 (Γ ′ 𝑥𝑑′ , Σ)
52
Source: Roberts et al. 2016
Big Data & BigBig
Models
Data at BBVA Research
We rely on Lexicon methods using the Loughran-McDonald dictionary (Loughran McDonald 2009),
a created dictionary specifically to analyze financial texts and the FED dictionary for financial stability
(Correa et al, 2017)
Using the negative and positive words of this dictionary, the average “tone” of a given document
is computed by:
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠 − 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠
Average tone = 100 ∗
𝑇𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑𝑠
The score ranges from -100 (extremely negative) to +100 (extremely positive) but common values range
between -10 and +10, with 0 indicating neutral
To build the final sentiment indices, we use the topic mixture that combines dictionary methods
with the output of LDA to weight word counts by topic, following the approach proposed
by Hansen and McMahon (2015). This allows generating different sentiment measures from a set of text,
and focusing that sentiment on the topics of interest
53
Big Data at BBVA Research
Big Data Workshop on economics and finance.
Bank of Spain
Alvaro Ortiz, Tomasa Rodrigo and Jorge Sicilia
February 2018