You are on page 1of 50

A High Dimensional State Space Model for

Unstructured data
Eleni Kalamara ∗ George Kapetanios†
King’s Business School King’s Business School

This version:
March 2022

Abstract
This paper introduces a high dimensional state space model for unstructured large
data sets. The framework is designed to model jointly unstructured data and large
balanced datasets of the type widely used in forecasting. The high dimensional sys-
tem is driven by the same factor structure which enables the use of Kalman filter
and maximum likelihood for parameter estimation. We provide coherent guidance
on whether and when the inclusion of granular information improves predictive perfor-
mance, through a thorough Monte Carlo experiment. In an empirical application using
a large dataset of unstructured text data drawn from UK newspaper articles, we find
that the proposed factor model delivers significant forecasting gains on key economic
variables such as GDP growth, inflation and unemployment up to 12 months ahead.
JEL Classification: C23, C13, C53, C55, C38
Keywords: state space model, kalman filter, maximum likelihood, unstructured data


King’s Business School, King’s College of London, UK. E-mail: eleni.kalamara@kcl.ac.uk

King’s Business School, King’s College of London,UK. E-mail: george.kapetanios@kcl.ac.uk

1
1 Introduction
The technological innovations in information processing and the increased storage capability
have made various data sources available. More granular and comprehensive data can help
to pose new sorts of questions and enable novel research designs that are informative about
the consequences of different economic policies and events (Varian, 2014). To this end,
statisticians and economists are now confronted with such alternative data sources which are
often generated as a by-product of procedures not directly related to conventional statistical
processes. Such data (often referred to as “Big Data”) can range from financial transactions
to customer reviews and newspaper articles and come with many distinct characteristics. A
fundamental one, is that are available in real time. The ability to capture and process timely
data is crucial especially for policy makers who look for ways to track economic activity and
set the future policy path.
While this is very powerful, it is also challenging. In most cases, this information flows
with less structure and higher dimensionality. Figuring out how to organize unstructured
data and assess whether the way we impose structure matters, it still remains an open
question in empirical research and many approaches have been suggested in the econometric
literature (see Hastie et al. (2015); Stock and Watson (2016); Primiceri et al. (2020), among
others).
Factor analysis can be considered as one of the pioneering approaches in the field of
applied macroeconomics since the early 2000s. It offers an effective way for analyzing and
predicting the economic activity by summarizing the information contained in panels of
macroeconomic time series (see e.g. the survey by Stock and Watson (2016), and references
therein) 1 . In most empirical experiments using this type of models, the dataset arrives
in “rectangular” form, with T observations and N variables with common movements ( N
typically a lot smaller than T).
1
In fact, factor models first gained their popularity in the early decades of the last century as a dimension
reduction technique used in the field of psychometrics. From there on, it has gradually become a classical
method used for the statistical analysis of all sorts of complex datasets in many fields natural, and social
sciences (see e.g. Lawley and Maxwell (1971)).

2
To fix ideas, the N variables yit , for i = 1 . . . N and t = 1 . . . T where t refers to the
time index, are each assumed to be the sum of two unobservable orthogonal components:
one component resulting from the factors that are common to the set of variables, χi,t , and
an idiosyncratic component ξit which captures the shocks specific to each of the variables.
The component χi,t is obtained by extracting a small number r ≥ 1 of common factors Fj,t ,
j = 1 . . . r from all of the variables present in the data set. The general decomposition is
described as :

yi,t = χi,t + ξi,t

or:
yi,t = ηi,1 F1,t + ⋅ ⋅ ⋅ + ηi,r Fr,t ξi,t

Of course, many variants of the above specification have been developed over the years. For
example, assuming that factors can exhibit a time dynamic (dynamic factor model) or impos-
ing different hypotheses for the variance-covariance matrix of the idiosyncratic components
(e.g approximate factor structure).
When data record a sequence of events with no further structure, there are a number
of ways to summarise this rich information and extract value and knowledge. Most of the
studies considered so far, make use only of standard high-dimensional time series observables
without including any “big” data type 2 . And as Diebold (2003) has accurately foreseen,
“although dynamic factor models don’t analyze really Big Data, they certainly represent a
movement of macroeconometrics in that direction”.
In this paper we investigate how auxiliary series derived from big data sources and ap-
peared in an unstructured frequency can be combined with standard macroeconomic time
series in a high dimensional multivariate structural time series model. We propose a method
which allows to exploit the high-frequency and/or real-time information of the auxiliary se-
ries adopting a common state space approach where the system is estimated via Maximum
2
There is no a unified definition of the term “big data”. It usually refers to datasets that are not only
big, but also high in variety and complexity, which makes them difficult to deal with traditional statistical
tools. See Shi (2014) for a relevant discussion.

3
likelihood method. In this sense, our paper situates among those in the literature which try
to use real-time data to continuously update measurements and forecasts of lower frequency
objects.
Additionally, we follow the approach of Cajner et al. (2019) and take the average of the
unstructured series at each time t to test whether this alternative data composition is a
viable solution to the problem of large dimension.
In the empirical application, we consider newspaper articles scores derived from UK
newspapers in order to forecast key economic variables such as output growth, inflation and
unemployment rate. Results show that the estimated factors achieve significant forecast im-
provements up to one year compared to a standard AR benchmark. Taking the average does
not necessarily arrives at significant improvements, which implies that the structure of the
data do matter when it comes to economic forecasting. A standard balanced dataset with
macroeconomic and financial variables is also used as another model comparator. The evi-
dence shows that including a number of unstructured text-based dataset provides additional
forecast gains especially at longer forecast horizons.
Due to the large number of the variables to be estimated, we further develop two different
estimation procedures which exhibit model selection and sparsity. In particular we estimate
the model with penalised maximum likelihood using two well-known penalties i.e. l1 -norm
(Tibshirani, 1996) and l2 penalty (Hoerl and Kennard, 1970). Results exhibit marginal
predictive improvements when incorporating a shrinking mechanism without though to show
a clear preference in the choice of penalty.
We further delve into the relationship of the derived factors via Monte Carlo Simulations
and conclude that the average can be an adequate alternative when we impose the same
variance on the unstructured data. Also from the monte carlo exercise, it emerges that our
method can yield large improvements in terms of Mean Squared Forecast Error (MSFE) of
the unobserved components up to 70%.
In the remainder of this papers we first present the related literature and the background,
in Section 2. Section 3 describes the different models considered while Section 4 presents a

4
relevant discussion. In section 5, we examine the relationship of the different data structures
through a monte carlo experiment. Section 6 displays the empirical results. Finally, Section
8 concludes.

2 Literature and Background


A number of econometric methods have been proposed in the literature for working in data-
rich environments. In the applied macroeconomics, factor models have been the workhorse
of this revolution since Stock and Watson’s (2002) seminal paper. The central idea of these
models is to summarize the information contained in a large number of N predictors into a
small number of factors common to the set of variables. Since then, a vast literature has
been developed exploring various model specifications based on the ground that a few latent
factors can account for much of the dynamic behavior of major economic aggregates (see
Barhoumi et al. (2014); Bai and Wang (2016) for detailed reviews of the methods).
The most common way to represent the factor dynamics is to consider a state space
formulation of the model (restricted factor analysis). Based on this formulation and the use
we want to make, two distinct schools have been emerged to estimate the factors and the
model parameters: first two-step estimators based on Principal Component Analysis (PCA)
and a VAR on the estimated factors (Bai and Ng, 2006; Forni and Lippi, 2001), and second,
on-line techniques based on Kalman Filter (KF) and Maximum Likelihood (Doz et al., 2012,
2011). The latter method is usually preferred when we deal with missing values, in particular
at the end of the sample often due to the delay in macroeconomic data releases or synthesize
information from variables observed at different frequencies (Bańbura and Modugno, 2014;
Mariano and Murasawa, 2003).
However, all of the above studies built models which are able to handle small/medium
datasets consisting mainly of three typical sources: hard data, soft data (mostly surveys)
or financial observables. With the recent explosion of the so called “alternative datasets”
(e.g text, google searches, satellite data) and the potential gains from their use (Varian,
2014), econometricians are asked to redefine the existing models to handle the size and the

5
complexity inherent in these datasets. One advantage of our proposed model over the above
studies, is that it integrates traditional time series with unstructured datasets to improve
inference of unobserved components on the one hand and to test the usefulness of this new
source of information, on the other.
The literature that explores alternative data sources to discover useful macroeconomic
insights is on its infancy. An example can be found in the work of Cajner et al. (2019) who
investigate the usefulness of unstructured payroll micro data as a supplement to standard
employment statistics for tracking the underlying state of the labor market. Their study
is also based on the state space approach using Kalman filter techniques. However, their
application does not consider the granular information of the dataset but a time series
extracted from it. The study of Aparicio and Bertolotto (2019) shows that inflation forecasts
can be improved by extracting an aggregate measure of inflation using online prices. Google
data has also been proved useful to improve macroeconomic flash estimates. For instance,
Ferrara and Simoni (2019) suggest that this specific type of data provide useful information
for nowcasting GDP in euro area.
Recently, a lot of emphasis has been put on the possible gain that forecasters can get from
text-based data. In Thorsrud (2018), high frequency time series indicators are extracted
from Norwegian newspapers and used as inputs to dynamic a factor model, while Ardia
et al. (2019) generates a pool of text-based sentiment values using shrinkage techniques. In
addition, Kalamara et al. (2020) extract timely text proxies of sentiment and uncertainty
from newspaper articles to forecast UK economic activity. Conducting a horse race of text
methods for turning text into time series, the derived metrics exhibit high and persistent
correlations with other commonly used survey and financial indicators such as condumer
confidence and volatility. They provide evidence that newspaper text in combination with
machine learning methods can provide timely signals of the economic path especially during
turbulent times.
In our empirical exercise, the unstructured dataset is based on the analysis of Kalamara
et al. (2020) in the following way: for every time period t, t ∈ {1, . . . T }, there is a different

6
number of articles kt and each of them spreads a (positive or negative) signal which is
captured by applying a set of different text techniques. That said, each technique provides
a different signal for the same article. This enables us to compare the text analysis methods
used in Kalamara et al. (2020) but in a different setting. Instead of creating a text indicator
from the methods considered, we create a high dimensional vector zt with dimension kt ,
where kt is the total number of articles at time t. Generally speaking, this vector can
represent different type of events that occur in an unstructured way in the economy such
as financial transactions and payroll data etc. All the above studies provide evidence that
text information holds a real promise to predict economic activity and point that text can
act as complement or alternative to soft information.
While there is evidence that alternative datasets carry invaluable information about the
economic environment, there is not a clear answer on which is the best way to handle this
type of data in practice. None of the previous work has implicitly address whether and
when incorporating the whole dataset adds greater accuracy or using an aggregate measure
of them is enough to provide useful information.
To the best of our knowledge, our paper bridges the gap of these literatures. On the on
hand, we suggest a standard factor model exploiting the whole data set, maintaining the
unstructured nature of the data and compare predicting performance by using an aggregate
measure of the “big” series. In that way we can comprehensively provide a better insight on
answering the question: “Is it always the more the data, the better”?

3 The Model
Let a n-dimensional balanced panel of time series, observed over T periods follow a factor
specification which is given by the equation:

yi,t = ηi′ Ft + ξi,t i = 1, . . . , n, t = 1, . . . , T (3.1)

7
where Ft and ηi are r-dimensional latent column vectors of factors and factor loadings,
with r << n. The term ηi′ Ft represents the common component while ξi,t the idiosyncratic
component. Ft and ξt are two independent processes. This is a standard model in the
literature and has been extensively studied by Stock and Watson (2002); Giannone et al.
(2008); Ludvigson and Ng (2009); Bai and Wang (2016), among others. In vector notation,
the factor model is equivalent to:
yt = ηFt + ξt (3.2)

where η is the weighting matrix of dimension n × r for t = 1 . . . T .


We now introduce the unstructured data denoted as zj,t , j = 1, ..., kt , t = 1, ..., T where
zt is a kt × 1 vector. Note that kt is time-varying: there is a different number of observed
series at each time period that can infuse a signal to the economy. The crucial difference
between yt and zt is that zt represents measurements on multiple events that occur during
time period t. These can be any sort of unstructured data from financial transactions to
newspaper article scores. We assume that zt follows the same factor representation of the
form:

zt = βt Ft + t (3.3)

where different subsets of the vector of factors Ft can enter xt and zt respectively and each
zj,t has its own loading βj . We define as K, the maximum dimension of kt , for t = 1 . . . T , i.e
K ≡ max(k1 , . . . , kT ). It holds that K >> T (fat dataset). Similarly, we assume t and Ft are
orthogonal.
The r-dimensional vector of factors Ft , is unobserved and has effect both on xt and
zt . In the case of time series, the factors are likely to be autocorrelated and therefore the
dynamic evolution of the factors should also considered. For example, we assume a simple
autoregressive dynamic structure:

Ft = cFt−1 + vt (3.4)

8
with c being an r×r matrix and vt an r-dimensional zero-mean white noise. The specification
of the model is completed by assuming that the idiosyncratic components ξt and t are
stationary processes with zero mean and covariance matrices E(ξt , ξt′ ) = Σξt = σξ ∗ In and
E(t , ′t ) = Σt = σ ∗ IK respectively.
Further, our proposed methodology can accommodate mixed frequencies for example, yt
can follow a lower frequency and zt a higher one and/or deal with ragged edges in yt . The
model enables the extraction of a high frequency factor and, thereby, enables nowcasting
and forecasting at either high or lower frequency.
Common methods for the estimation of the parameters and the factors include principal
components. This method is easy to compute, and is consistent under quite general assump-
tions as long as both the cross-section and time dimension grow large. It suffers, however,
from a large drawback: the data set must be balanced, where the start and end points of the
sample are the same across all observable time series. In practice, data are often availabe
at different time frequencies. A popular approach is therefore to estimate the factor model
using maximum likelihood and the standard Kalman filtering techniques. The kalman filter
allows unbalanced data sets and offers the possibility to smooth missing values (Bańbura
and Modugno, 2014; Doz et al., 2012).
To deal with the missing values in zt we follow the methodology presented in Harvey
(1990). Meanwhile, as shown by Doz et al. (2011), maximum likelihood can still lead to
consistent estimation of the central parameters of the factor model, given by the common
component even when neglecting the idiosyncratic time series dynamics.

3.1 The State Space Formulation

For the extraction of the common factors, we have to further specify the structure of the
model. Key in this framework is to use a model with a state space representation and
estimate the factors using Kalman Filter (Bańbura and Rünstler, 2011; Doz et al., 2011,
2012; Bańbura and Modugno, 2014). Such models can be written as a system with two
types of equations: measurement equations linking observed series to a latent state process,

9
and transition equations describing the state process dynamics.
Adjusting this to our case, the high dimensional model can be cast in a state space form
as follows. First, the final system of observables is:

⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜yt ⎟ ⎜ ηi ⎟ ⎜ξt ⎟
⎜ ⎟ = ⎜ ⎟ Ft + ⎜ ⎟ (3.5)
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝zt ⎠ ⎝βt ⎠ ⎝t ⎠

To ease notation, we re-write :


xt = Λt Ft + ζt (3.6)

where:
xt = (yt , zt )′ is a n + kt × 1 vector of observables.
Λt = (η, βt )′ is a n + kt × r loading matrix.
ζt = (ξt , t )′ is a n + kt × 1 vector of idiosyncratic components.
ζt ∼ i.i.d(0, Σζ ) where Σζ = (Σξt , Σt )′ . We assume that the idiosyncratic components are
cross-sectionally independent Gaussian White noise process (exact factor model).
As already noted in (Anderson and Deistler, 2008), the factors always admit a finite
AR(1) representation, and thus we write:

Ft = cFt−1 + vt ∼ w.n.N (0, Ir ) (3.7)

Equation 3.6 is the measurement equation which describes the relationship between the
unobservable state (Eq.3.7) and the observable variables. Ft represents the state vector
while ζt and vt are vectors of measurement and transition errors respectively. The observed
data vector xt has missing values, reflecting the panel unstructured data contained in zt .
We further assume that the system is generally invariant over time. However extensions
can be made to allow to η, c matrices to vary over time. It is also assumed that for all t
,t′ ≠ t, E(ζt , ut ) = 0.
Moreover, the number of factors r is known. In practice, r can be estimated from the
data and there has been a large literature adressing its consistent estimation (see, Bai and

10
Ng (2002); Kapetanios (2010), among others).
The specification of the state space system is completed by two further assumptions: the
initial Ft dimensional state vector is F0 and its initial variance V ar(F0 ) = p0 . We use an exact
initialization for the initial values of the state variables of the sampling error, and a diffuse
initialization for the other state variables. Prior to estimation, variables are transformed to
achieve stationarity and standardised.

3.2 An alternative factor model

So far in our model specification the vector zt is high dimensional and incorporates the
raw unstructured dataset. A plausible alternative is to create a composite measure of zt
calculating the average at each point in time and then use the derived time series as input
into the state space model. Cajner et al. (2019) follow a similar approach to construct a
monthly index for unemployment using payroll data while Aparicio and Bertolotto (2019)
aggregate online prices to compose a measure of inflation. This practice shares a number
of advantages. An immediate one is that the complexity of the system is reduced which
facilitates the estimation procedure. As such, we define zt∗ = ∑kk=1
t
zk,t /kt the average of the
multiple events that occur at each point in time. The model in (3.6), and (3.7) is modified
to:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜ yt ⎟ ⎜ ηi ⎟ ⎜ ξt ⎟
⎜ ⎟ = ⎜ ⎟ Ft + ⎜ ⎟ (3.8)
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝zt∗ ⎠ ⎝β ∗ ⎠ ⎝∗t ⎠

and:
xt = ΛFt + ζt (3.9)

where: x∗t = (yt , zt∗ )′ is a n + 1 × 1 vector of observables


Λ = (ηi , β ∗ )′ is a n + 1 × r loading matrix.
σ¯2∗
ζt = (ξt , ∗t )′ is a n + 1 × 1 is vector of idiosyncratic components where V ar(∗t )

= K
t

The dynamic structure of the factors and all other model assumptions remain unchanged
and as described in section 3. A question that naturally arises is whether this process

11
transforms the raw unstructured data to useful aggregate series. It is not obvious that
following the above strategy will result to improved predictive accuracy of the target variable.
In fact, there are a number of studies who examine the best way to reduce the dimensionality
of large datasets (Primiceri et al., 2020) but there is no clear path of when and how including
the raw data or an aggregate measure results to the best model performance.

4 Discussion
The time series model in section 3 as well as the model presented in section 3.2 are fitted with
the Kalman filter after putting the model in the state space form and setting the errors to
be normally distributed. Assuming normality of the errors is common in state space models
because the parameters of the model are estimated by maximizing a Gaussian log-likelihood
which is derived directly through the Kalman filter. Moreover, under normality, the Kalman
filter yields the minimum variance unbiased estimator of the state variables and as long as
the state model is linear, even if the true distribution of the error terms is non-Gaussian,
then the Kalman filter still provides the minimum variance linear unbiased estimator of the
state variables ( see, )3
Maximum likelihood has a number of advantages over its most usual alternative, principal
components analysis. First, it deals efficiently with missing values. Second, maximum
likelihood can enhance the interpretability of the model by imposing restrictions on the
parameters which are supported by the theory (Bańbura and Rünstler, 2011; Bańbura and
Modugno, 2014). Third, Doz et al. (2012) prove consistency of the factors and the maximum
likelihood estimates under two different sources of misspecification: ommited serial and cross-
sectional correlation. In particular, they show that as the time (T) and the cross-sectional
dimension increases, the misspecification effects diminish.
3
In the latter case, attention is restricted to estimators which are linear combinations of the observations
and then Ft is the one which minimises the mean squared error (MSE). Thus Ft is the minimum mean
square linear estimator (MMSLE) based on observations up to and including time t. This estimator is
unconditionally unbiased and the unconditional covariance matrix of the estimation error is again given by
the Kalman filter.

12
Even though we assume an exact factor model as the benchmark specification (i.e cross
sectionally independent and homoskedactic errors), the number of parameters to be esti-
mated remains high. Estimation of the parameters by maximum likelihood poses a practical
limitation on the number of parameters to be estimated Harvey (1990).
One possible alternative is to introduce some sort of sparsity on the relative parameters.
In recent years, penalisation techniques have been explored in various sparse estimation and
modelling tasks (Friedman et al., 2001). With sparsity, variable selection can improve the
estimation accuracy by effectively identifying the subset of important predictors/“events”
and can enhance model interpretability with parsimonious representation. It can also help
reduce the computational cost when sparsity is very high. To set the ideas, we assume that
the maximum likelihood takes the form:

(K + n)T 1 T 1
L(Λ, Σζ ) = − log 2π − ∑ log∣Gt ∣ − ΣTt=1 vt′ G−1
t vt
2 2 t=1 2
where vt = xt − x̂t∣t−1 , t = 1 . . . T and covariance matrix G = ΛFt Λ′ + Σζ .The above equation
can be easily delivered through the Kalman Filter and is known as the prediction error
decomposition form of the likelihood.
Generally, different penalties can be introduced to produce the regularisation effect on
the parameters. We write:
L(Λi , Σd ) + Q(p; Λi ) (4.1)

where typically the penalty function Q(p; Λi ) is of the form Q (p, Λi ) = p ∥ΨΛi ∥ , and Ψ is a
diagonal matrix consisting of penalty loadings and the notation ∥∥ denotes a generic norm.
p
For example, ∥∥p denotes the Lp-norm with ∥β∥p = ∑N
j=1 ∣βj ∣ . The penalty p “shrinks” some

of the parameters in Λi to zero and hence can cope with the potentially large dimension of
Λi .
There has been an evolving literature in the last decade devoted to the development
of various shrinkage estimators such as the Ridge, Lasso and generally Lp-norm estimators
where Ψ is taken to be the K + n × K + n identity matrix. All of them have been widely
applied for simultaneously selecting important variables and estimating their effects in high

13
dimensional statistical inference.
In the following empirical application we test the forecasting gains in terms of RMSE by
maximising the penalised maximum likelihood. More specifically, we apply two well-known
penalties: the quadratic or l2 (Hoerl and Kennard, 1970) and the l1 norm (Tibshirani, 1996).
Thus, the penalised likelihood takes the following form respectively:
Ridge penalty:

T K+n
(K + n)T 1 T 1
L(Λ, Σζ ) = − log 2π − ∑ log∣Gt ∣ − ΣTt=1 vt′ G−1
t v t + p ∑ ∑ Λ2it (4.2)
2 2 t=1 2 t=1 i=1

Lasso penalty:

T K+n
(K + n)T 1 T 1
L(Λ, Σζ ) = − log 2π − ∑ log∣Gt ∣ − ΣTt=1 vt′ G−1
t vt + p ∑ ∑ ∣Λit ∣
2 2 t=1 2 t=1 i=1

Fan and Li (2001) show that a good penalty term should at least satisfy three properties:
unbiasedness (the estimator should be nearly unbiased for parameters with large values),
sparsity (the estimator automatically shrinks small estimated parameters to zero), and con-
tinuity (the estimator continuous shrinks parameters to zero). Although the LASSO and
Ridge penalty do not satisfy all three properties, they are still widely used due to their
simplicity and computational efficiency.

5 When z̄ is a good alternative? Results from a simu-


lation study
We now investigate further the relationship between the derived factor Ft and the different
forms the unstructured dataset zt can take. Since the goal is to model the interrelationships
among different events, we focus primarily on the variance and covariance rather than the
mean. We are particularly interested in examining the behaviour of our estimators assuming
different specifications of the variance of the unstructured dataset. In what follows, we omit

14
the subscript t for simplicity. Recall the unstructured data set is defined as:

zi = BF + i

F = cF + u

where F ∼ N (0, 1) and i ∼ N (0, σi2 ). F and i and are independent of each other, i.e.:
K
Cov(F, i ) = 0 for all i ∈ K. Let the average of the unstructured dataset be: Z̄ = 1
K ∑i=1 zi .
We wish to compare the conditional variance of F given zi , i.e.V ar(F ∣z1 , ..., zK ) to the
conditional variance of F given the average z̄, i.e. V ar(f ∣z̄). To set ideas, we can write:

V ar(zi ) = 1 + σi2 ,
1 K
V ar(z̄) = 1 + 2 ∑ σi2 ,
K i=1
Cov(zi , F ) = 1,

Cov(z̄, F ) = 1.

Following Harvey (1990) Lemma 3.A2.1, if F and z̄ are jointly multivariate normal, then
the distribution of F conditional on z̄ is also multivariate normal with variance: Var(F ∣z̄) =
Var(F ) − Cov(F, z̄)Var(z̄)−1 Cov(z̄, F ). This leads to:

1
V ar(F ∣z̄) = 1 − K
,
1+ 1
K2 ∑i=1 σi2

Equivalently, the conditional variance of F given the whole dataset z is given by:

−1
⎛ 1 + σ12 1 ... 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟
⎜ 1 1 + σ22 1 ⎟ ⎜ 1 ⎟
⎜ ... ⎟ ⎜ ⎟
V ar(F ∣z1 , ..., zK ) = 1 − (1, ..., 1) ⎜







⎜ ... ... ... ... ⎟ ⎜ ... ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ 1 ... 1 1 + σK2 ⎠ ⎝ 1 ⎠

15
We first examine the case where σi ≠ σj for all i, j ∈ K. We set K = 2 and result to:

σ12 + σ22 (1 + σ12 ) (1 + σ22 ) − 1 − σ12 − σ22


V ar(F ∣z1 , z2 ) = 1 − = =
(1 + σ12 ) (1 + σ22 ) − 1 (1 + σ12 ) (1 + σ22 ) − 1
σ12 σ22 σ12 σ22 1
= = =
(1 + σ1 ) (1 + σ2 ) − 1 σ1 σ2 + σ1 + σ2 1 + σ2 + σ12
2 2 2 2 2 2 1
1 2

Equivalently, the conditional variance of with respect to the average:

1
V ar(F ∣z̄) = 1 −
1 + 14 (σ12 + σ22 )
1 + 41 (σ12 + σ22 ) − 1 1
(σ12 + σ22 ) 1
= = 4
=
1 + 14 (σ12 + σ22 ) 1 + 4 (σ12 + σ22 ) 1 + σ2 +σ
1 4
2
1 2

Let assume now that V ar(F ∣z̄) ≥ V ar(F ∣z1 , z2 ). Then, we have that:

1 1 4
+ 2 ≥
2
σ1 σ2 σ1 + σ22
2

1 1 4
+ − ≥ 0
σ12 σ22 σ12 + σ22
σ22 (σ12 + σ22 ) + σ12 (σ12 + σ22 ) − 4σ12 σ22
≥ 0
σ12 σ22 (σ12 + σ22 )

or

σ22 (σ12 + σ22 ) + σ12 (σ12 + σ22 ) − 4σ12 σ22 ≥ 0

σ24 + σ14 − 2σ12 σ22 ≥ 0


2
(σ22 − σ12 ) ≥ 0

16
which holds as a strict inequality for all σ22 ≠ σ12 . The result generalises for K > 2 using the
Sherman-Morrison formula 4 .
In particular, Var(F ∣z1 . . . zK ) becomes:

−1
⎛ 1 + σ12 1 ... 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟
⎜ 1 1 + σ22 1 ⎟ ⎜ 1 ⎟
⎜ ... ⎟ ⎜ ⎟
Var(F ∣z1 . . . zK ) = 1 − (1, ..., 1) ⎜






⎟ (5.1)
⎜ ... ... ... ... ⎟ ⎜ ... ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ 1 ... 1 1 + σK2 ⎠ ⎝ 1 ⎠

where the inverse:

−1 ⎛ 14 1
... 1 ⎞
σ12 σ22 σ12 σn
⎜ σ1 ⎟
2
⎛ 1 + σ12 1 ... 1 ⎞ ⎛ 1/σ12 0 ... 0 ⎞ ⎜ 1 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ σ2 σ2 1
... 1

⎜ 1 + σ22 ⎟ ⎜ 0 ⎟ ⎜ 1 2 σ24 σ2 σK ⎟
2 2
⎜ 1 1 ... ⎟ ⎜ 1/σ22 1 ... ⎟ 1 ⎜ ⎟
⎜ ⎟ =⎜ ⎟− ⎜ ... ... ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ... ... ⎟
⎜ ... ... ... ... ⎟ ⎜ ... ... ... ... ⎟ 1 + ∑K 1
⎜ ⎟
⎜ ⎟ ⎜ ⎟ i=1 σ 2 ⎜ ⎟
⎜ ⎟ ⎜ ⎟ i
⎜ ... ... ... ... ⎟
⎝ ⎠ ⎝ 0 ⎠ ⎜ ⎟
1 ... 1 1 + σK
2
... 2
0 1/σK ⎜ ⎟
⎝ σ21σ2 ... 1
2 σ2
σK
1
σ4 ⎠
1 K K−1 K

And thus, the second term in (5.1) takes the following form:

−1
⎛ 1 + σ12 1 ... 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟ K K
⎜ 1 ⎟ ⎜ 1 ⎟⎟ K 1 ∑i=1 ∑j=1 σi2 σj2
1
⎜ 1 + σ22 1 ... ⎟ ⎜

(1, ..., 1) ⎜ ⎟ ⎜ ⎟=∑ −
⎟ ⎜ ⎟
⎜ ...
⎜ ... ... ... ⎟


⎜ ... ⎟

2
i=1 σi 1 + ∑K 1
i=1 σ 2
⎜ ⎟ ⎜ ⎟ i

⎝ 1 ... 1 1 + σK2 ⎠ ⎝ 1 ⎠

Therefore, the conditional variance of f given z is:

K K
1 ∑i=1 ∑j=1 σi2 σj2
1
z
V ar(f ∣z1 , ..., zK ) = 1 − ∑ 2 + .
i=1 σi 1 + ∑K 1
i=1 σ 2
i

4
The Sherman-Morrison formula is given by:

A−1 uv ′ A−1
(A + uv ′ )−1 = A−1 −
1 + v ′ A−1 u
where A is invertible and u, v are vectors.

17
If V ar(F ∣z̄) ≥ V ar(F ∣z1 , z2 ), then it holds that:

K K
1 ∑i=1 ∑j=1 σi2 σj2
1
K
1
∑ 2− K 1

i=1 σi 1 + ∑i=1 σ2 1 + K2 ∑K
1 2
i=1 σi
i

or equally:
n K
1 ∑i=1 ∑j=1 σi2 σj2
1
K
1
∑ 2− K 1
− ≥0
i=1 σi 1 + ∑i=1 σ2 1 + K2 ∑K
1 2
i=1 σi
i

We now turn to the case where σi2 = σ 2 , for all i ∈ K, i.e. The above equation takes the form:

K 2
K 1
− σ K − 2 ≥ 0
4
(5.2)
σ 2
1 + σ2 1 + σK

K
We set α = σ2 . Then, equation 5.2 becomes,

α2 1
α− − ≥0
1 + α 1 + α1

or
1
α (1 + α) (1 + 1/α) − α2 (1 + )−1−α≥0
α
or
α + α2 + 1 + α − α2 − α − 1 − α ≥ 0

But α + α2 + 1 + α − α2 − α − 1 − α = 0. Therefore V ar(F ∣z̄) = V ar(f ∣z1 , z2 ) which results to the


proof of our result. As such, using z̄ can provide a less volatile factor under the assumption
of equal variances among the indiosyncratic components.

5.1 Simulation study

We support the above result through the following experiment. More specifically, we conduct
a Monte Carlo simulation study in order to elucidate to which extent the unstructured dataset
zt can provide gains in the predictive accuracy of the unobserved components of interest,

18
with respect to the structure we impose on the error variance.
First, we consider as yt a balanced dataset of dimension n = 50, and the total set of
observables xt , is driven by one common factor (r = 1). Throughout the experiment, we let
K ∈ {50, 100, 500, 1000}, T ∈ {50, 100, 500, 1000}, and we simulate data according to (3.5),
(3.6), (3.7) as follows:


• the loadings Λ = (η ′ , βt′ ) , η = 1, βt ∼ N (1, 1)
n×1 K×1

• parameter c = r × I where r = { 0.9}

• ut ∼ N (0, I)

• We consider two different specifications of the idiosyncratic components. First, we


assume homoskedasticity of the errors for the unstructured dataset zt and consider as
a baseline structure, i.e.:

⎛ξt ⎞ ⎛ ⎛1 0 ⎞⎞
1. ⎜ ⎟ ∼ N ⎜0, ⎜ ⎟⎟ where σ = {0.5}
⎝t ⎠ ⎝ ⎝0 σIK ⎠⎠

• Next, we report results where we relax the homoskedasticity assumption as not being
always realistic and get:

⎛ξt ⎞ ⎛ ⎛1 0 ⎞⎞
2. ⎜ ⎟ ∼ N ⎜0, ⎜ ⎟⎟, where Ω ∼ U (1, 3)
⎝t ⎠ ⎝ ⎝0 diag(Ω)⎠⎠

We define as “HSS ” the high dimensional state space model, i.e. including the unstructured
information:

xt = λft + ζt

ft = cft−1 + ut

19
And as comparators, “Model 1” which includes only the n-dimensional balanced dataset yt ,
and as “Model 2” the process that includes the average of the auxiliary series zt :

⎛ yt ⎞ ⎛ Λ ⎞ ⎛ ξt ⎞
x∗t = ⎜ ⎟ = ⎜ ⎟ Ft + ⎜ ⎟
⎝zt∗ ⎠ ⎝βt∗ ⎠ ⎝∗t ⎠

σ¯i,t
2
where: zt∗ = ∑kk=1
t
zt,k /kt and V ar(t ) = kt .

Figure 5.1 shows the ratio of the conditional variance of the factor using z over its
counterpart, z̄, i.e: var(f ∣z1 . . . zkt ))/var(f ∣z̄) in the presence of heteroskedastic idiosyncratic
components. Results are averaged across 200 Monte Carlo simulations. The ratio tends
to decrease with the size of the large dataset. The improvement is more pronounced after
K = 200 where the ratio exhibits a sharp decline up to 30% and then follows a steady trend.
We therefore conclude that our method allows to accommodate heterogeneity in the variances
of the auxiliary series errors which boosts its performance with respect to “Model 2”.

0.55

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15
0 200 400 600 800 1000

Figure 5.1: Ratio of var(F ∣z1 . . . zkt )) over var(F ∣z̄)(x-axis) across different specifications
of the unstructured dataset (y-axis). Results are averaged across 200 Monte Carlo
simulations.

Let F̂t now be the estimate of the factor based on the information set at time t. Table
5.1 shows the relative RMSES over different specifications averaged in time. Both, the two

20
comparators “Model 1” and “Model 2” are considered as benchmark models. As such, values
less than 1 indicate higher performance for the high dimensional state space methodology
compared to the relative benchmark. This ensures that the simulations results are directly
comparable and that any deterioration or improvement in the performance of our method can
only be attributed to the structure of the data we choose. Top panel shows results when we
consider homoskedastic indiosyncratic components of the unstructured dataset while bottom
panel reports findings when we assume heteroskedastic errors.
Looking at the former case, we broadly notice that the RMSE ratios deliver values lower
than one for both benchmark models. Clearly, we succeed higher performance with respect
to the model that does not include any auxiliary series and the RMSE decrease is more
prominent as we increase the dimensionality of the data to both directions. When we look
at the ratios with the average as a benchmark, we still arrive at significant gains close to
35%. Notably, in this case the changes on the sample size do not seem to play a significant
role. Instead, it is evident that the reduction on the relative RMSEs remains at the same
levels no matter the size specifications. This implies, that the average does hold some power
but the higher gains in the predictive performance arrive only when we opt for more granular
and possibly informative content.
When we assume heteroskedasticity of the errors, generally, the results are in line with the
previous specification when we consider as a benchmark the balanced factor model. However,
we now report greater reduction on the relative RMSEs, with “Model 2” as benchmark. This
implies that accounting for unequal variances affects the explanatory power of our model and
results to more accurate estimates, as supported by the theoretical evidence in Section 5.
For completeness, in Appendix B, we report and examine additional simulation results
such as correlations together with their bias and variance of the Kalman filter estimator F̂t
for the different simulation settings and models.

21
True Parameters : σ = 0.5

Model 1 Model 2
max(kt )

50 100 500 1000 50 100 500 1000

T 50 0.573 0.469 0.265 0.208 0.648 0.656 0.650 0.655


100 0.575 0.473 0.268 0.205 0.655 0.652 0.657 0.654
500 0.587 0.471 0.268 0.207 0.656 0.657 0.655 0.655
1000 0.573 0.470 0.269 0.207 0.659 0.657 0.656 0.659

True Parameters : Ω ∼ U (1, 3)

Model 1 Model 2
max(kt )

50 100 500 1000 50 100 500 1000

T 50 0.547 0.441 0.259 0.189 0.616 0.499 0.293 0.214


100 0.548 0.450 0.260 0.193 0.619 0.508 0.295 0.219
500 0.546 0.442 0.257 0.195 0.617 0.500 0.291 0.222
1000 0.547 0.447 0.256 0.198 0.617 0.505 0.290 0.224

Table 5.1: Average of relative RMSE of the high dimensional state space for different
benchmark models. Model 1: not include unstructured dataset (Zt ), Model 2: includes the
average of Zt .

6 Empirical application
To illustrate the usefullness of our methodology we now present an empirical exercise to
forecast UK real economy variables: the gross domestic product growth (GDP), inflation
and unemployment rate. We use the text scores created in the study of Kalamara et al.
(2020) which are drawn from three popular UK newspapers, namely the Guardian,the Daily
Mirror and the Daily Mail and create 15 unstructured datasets with respect to different text
analysis methods. We divide the discussion of the results into a number of sub-sections,
starting with a brief description of the data and the evaluation design followed by the results
for the different targets.

22
6.1 Data

6.1.1 Unstructured Dataset - Text

Text can be seen as another soft type of information which has been found to be extremely
useful especially for macroeconomic forecasting when the official releases of the targets come
with a delay (Thorsrud, 2016; Kalamara et al., 2020; Adriansson and Mattsson, 2015). More-
over, recent literature has shown that text information can provide robust measures of proxies
of economic and financial sentiment (Tetlock et al., 2008; Nyman et al., 2018; Correa et al.,
2017; Nielsen, 2011; Hu et al., 2017; Hu and Liu, 2004) and can act as another type of soft
information along with the more traditional survey data.
We build upon the study of Kalamara et al. (2020) and construct unstructured datasets
following a wide range of text mining techniques. All of them have been created to en-
capsulate economic sentiment and uncertainty using different dictionary approaches. The
complete list of the methods used appears in Table A1. Articles are retrieved through Dow
Jones’s Factiva API 5 .
However, we differentiate on the way we shape the signals derived from text. In particular,
we consider that for every time period t, t ∈ {1, . . . T }, there is a different number of articles kt
and each of them spreads a (positive or negative) signal which is captured by applying a text
analysis methodology. That said, each text analysis technique provides a different signal for
the same article. Instead of creating a time series text indicator from the methods considered
as in Kalamara et al. (2020), we create a high dimensional vector zt with dimension kt , where
kt is the total number of articles at time t. Generally speaking, this vector can represent
different type of events that occur in an unstructured way in the economy such as financial
transactions, payroll data etc.
5
For a detailed review of the text analysis techniques used, see Kalamara et al. (2020) and references
therein.

23
6.1.2 Balanced Dataset - Macro/Fin data

As a benchmark dataset, we use a wide set of macroeconomic and financial variables, all
commonly appeared in the forecasting literature. Specifically, we select “hard” indicators
that are readily available by statistical agencies and meant to capture information about UK
economic activity such as production, manufacturing, labour and stock market. It is already
empirically proven that such types of indicators exhibit high correlations with macroeconomic
aggregates and thus, often are included in researchers’ toolkit for economic forecasting (Stock
and Watson, 1998; Adriansson and Mattsson, 2015; Giannone et al., 2008).
The variables used in all the exercises are considered at a monthly frequency, starting in
January 2000 until August 2018, purely due to data availability constraints. Although we
do not explicitly take into account the real time release calendar of the input series, in the
design of the out-of-sample forecasting evaluation exercise we only consider the series that
were available at every point in time. All series are seasonally adjusted. Further details on
the extended dataset can be found in the Appendix Table A2.
We compute forecasts for three UK target variables, i.e. the newly released GDP estimate
by the Office of National Statistics, the unemployment rate and the inflation rate.

6.2 Results

6.2.1 Evaluation design

As indicated in Section 6.1.2 the balanced dataset consists of about 50 UK indicators for
the UK economy including real variables (such as gdp growth, industrial production and
employment), financial variables, prices, wages, money and credit aggregates from other
sources, and other conjunctural indicators. Our sample period runs from January 1990 to
June 2018. All variables are transformed to induce stationarity in advance6 .
We can observe “real time” sentiment and uncertainty at each point in time in the econ-
omy which is captured by newspaper articles and described in section 6.1.1. The maximum
6
Details on data transformations for individual series are reported in Appendix A2

24
number of articles that appeared monthly on the full sample is K = 327 7 .
To examine the performance of our model, we perform three sets of exercises according to
the model we estimate. We first focus on a multivariate linear model using only the balanced
UK indicators which is a standard model in the literature (Stock and Watson, 1998; Bai and
Wang, 2016).
We then add the unstructured dataset which is the product of a particular text analy-
sis methodology. We assess the additive informational content of the derived text vectors
in forecasts in relation to the flow of information which comes explicitly from a standard
balanced factor model. Finally, we compute the averaged time series for each text analysis
technique and incorporate this to the balanced dataset. We check how well the projection on
the common factor by including just an aggregate measure instead of the high dimensional
system tracks the variable of interest. Under this set up, we can on one hand, evaluate the
information that comes from alternative data sources like text, and on the other, to provide
some insight to which extend the structure of the different datasets plays a significant role
to the forecasting performance.
For the specifications that include some sort of text information (i.e. either by including
the raw unstructured dataset or just the average), we are also interested to compare the
predictive power of each individual text-based factor model. As such, for all exercises and
specifications considered, we compute recursive forecasts for different horizons and use as a
benchmark model an AR(p)8 . The AR(p) benchmark is considered as a hard competitor
and allows cross-comparisons across different approaches.
We calculate the relative Root Mean Squared Errors (RMSEs) and use the Diebold and
Mariano (1995) t-statistic and the adjustment proposed in Harvey et al. (1997) to test for
equal forecasting accuracy. As such, the tables we report results are categorised in three
panels: (i) models that include the average of the unstructured datasets (“Average”), (ii)
models that incorporate the raw unstructured dataset (“HSS”) and (iii) models that include
7
Note that the maximum number is relative to the desired frequency. For example, if the goal was to
estimate a higher frequency factor, the maximum number would be smaller.
8
The order p of the autoregressive process is determined by the Bayesian Infomation Criterion (BIC).

25
only the balanced macro/fin dataset (“Balanced”).
Finally we repeat the same exercise and report robustness checks and additional results
on the Appendix using the penalised maximum likelihood on the estimation procedure.

6.3 Forecasting the Inflation Rate

We now break down our analysis to examine the performance of our model with respect to
the different target considered.
Table A3 reports the relative RMSEs for inflation at different forecast horizons, i.e.
h = 1, 3, 6, 9, 12. A number below one indicates that the forecasts are more accurate than
those produced by the AR benchmark.
The results indicate that including some sort of text information succeeds in improving
the forecasting performance compared to the autoregressive benchmark. However, the bal-
anced dataset does not provide with significant gains the inflation forecasts. This suggests
that factors extracted from a large data repository at a monthly frequency are effective in
tracking inflation rate. Moreover the longer the forecast horizon, the better is the relative
forecast for some of the text metrics. In particular, the gains in the forecasting performance
are evident for datasets that capture both uncertainty and sentiment signals like the measure
of “baker bloom davis” and the “vader” respectively.
As it is evident, the “Average” model does improve forecasting performance but only
for short horizons. There are no gains after 9 months irrespective of the text dataset we
take the average from. This suggests that overall the average seems to captures some of the
timely exploitation but offers very little forecastibilty at longer horizons. Overall, based on
our empirical results, the AR is a hard benchmark to beat compared to the different factor
specifications, a result which has already been established by many researchers (see Aparicio
and Bertolotto (2019); Eickmeier and Ziegler (2008), among others).

26
6.4 Forecasting the Unemployment Rate

We now turn our interest in the ability of our model to forecast unemployment rate using the
unstructured text datasets. As indicated in Table A6 the balanced dataset provides some
predictability up to 10% compared to the AR benchmark. The gains appear to be higher
(about 30% RMSE reduction against the AR) when we test the high dimensional model with
the unstructured datasets as input variables.
Another interesting feature appears on the same table: although the majority of the
unstructured models that captures sentiment are able to maintain their gains against the
benchmark specification for most of the forecasting horizons, the models that include uncer-
tainty text signals do not seem to add any additional value. This finding is in line with the
study of Kalamara et al. (2020) who report no significant predictive power of the text-based
uncertainty metrics in relation to the text-based sentiment series for economic forecasting
In this exercise, the forecasting improvements using the average of the unstructured
datasets are less obvious compared to the AR benchmark. Notably, only models using the
“loughran”, “tf idf econom”, “punc econom” and “stability” seem to provide some RMSE
reduction but to a limited extent (up to 15% improvements) and only for longer horizons.
Overall, the predictive content of the high dimensional state space model in this exercise is
much stronger compared to the other model specifications.

6.5 Forecasting the GDP growth

We now have a closer look on the GDP growth forecasts. In particular, Table A9 sum-
marises the results for GDP growth according to different model specifications, as previously
discussed.
Again, an immediate point that arises is that only the high dimensional state space
model (HSS) achieves high gains in terms of forecasting performance with respect to the AR
benchmark. Using the unstructured datasets result to forecasts improvements up to 30%
whereas the balanced dataset offers a limited predictive power to GDP growth (up to 7%).
Similarly, forecasts including only the average, do not always outperform the AR(p)

27
benchmark. This becomes especially evident for shorter horizons. However, in some cases
like the “word econom count” model, some small consistent improvements appear over the
benchmark. In this set up, it is not clear whether including the average of the unstructured
dataset provides meaningful gains when compared to the balanced dataset. Generally, the
unstructured auxiliary text metrics seem to enhance the predictive ability of GDP growth
and help the standard macroeconomic time series to arrive at bigger forecast error reduc-
tions. The explanatory power over the benchmark is further certified by the DM-statistic
for forecast accuracy, especially at longer horizons.

6.6 Robustness results: Maximum Likelihood Estimation (MLE)


vs Penalised Maximum Likelihood

In section 4, we discussed penalised maximum likelihood as an alternative estimation method


in which the loading coefficients λi may be shrunk to zero. Thus, sparsity and estimation
are conducted simultaneously.
The tuning parameter p in (4.1) controls the sparsity of the loading matrix and therefore
is critical in penalisation. Typically, a larger tuning parameter produces a sparser load-
ing matrix and a larger bias, whereas a smaller tuning parameter yields a denser loading
matrix and a lower bias. Tuning parameters are usually selected via AIC, BIC or cross
validation. However, the optimal tuning parameters are “difficult to calibrate in practice”
(Lederer and Müller, 2014) and are “not practically feasible” (Fan and Li, 2001). In fact, in
a general penalized likelihood context, there is still not a solid framework to select tuning
hyperparameters (Fan and Tang, 2013).
In this application, we are primarily interested in examining how the performance of the
different targets and data specifications change when introducing some sort of sparsity on
the estimation. As such, we fix the penalty parameter p = 1 for all experiments and do not
proceed to further exploration on the optimal choice of the penalty. Further, imposing the
same amount of sparsity, it allows to conduct direct comparisons across the different models’
structures where the dimensionality of the data may vary significantly.

28
Overall, the effect of regularisation is more prominent to the high dimensional model
(HSS) and leads to greater RMSEs reductions when compared with its standard maximum
likelihood counterpart. Additionally, in some cases results using the Lasso penalty appear
to consistently outperform results applying the Ridge penalty. This implies that eliminating
some coefficients from a unstructured and possibly noisy dataset can be proved beneficial
for predictive accuracy and also enhances model’s interpretability.
The main results for the three target variables considered are reported in the Appendix.
Specifically, Tables A4 and A5 show the results for inflation rate applying the Lasso and
Ridge penalty, respectively. It is evident that penalisation delivers some RMSE gains for
the “Average” model set, especially for longer horizons. For horizons up to a quarter, even
though the benefits from penalisation is negligible, we notice that we achieve statistically
significant forecast accuracy for at least 5 text averages which appear to be the same for both
penalty schemes. Penalisation also helps the other two model specifications. Specifically,
for the “HSS”, we achieve a further reduction of the RMSE for most of the different text
datasets which goes up to 10% compared to the standard MLE while both Ridge and Lasso
offer limited gains for the ‘Balanced‘ dataset.
In Tables A7 and A8, we display the main findings for the unemployment rate. In this
case, penalisation does not always results to greater performance for both the “Average” and
the “Balanced” model. In fact standard MLE delivers marginally higher forecast improve-
ments for the ’Balanced’ model and for all forecast horizons. However, the effect of l1 norm
leads significant improved for the “HSS” specification. This becomes particularly evident
after two quarters where penalisation leads to outperformance of the model compared to the
AR benchmark. Ridge penalty, however, does not seem to have any strong effect on model’s
overall performance.
Finally, we report results for the GDP growth on Tables A10 and A11 for the two
penalised maximum likelihood methods. Again, in this case, imposing sparsity is encouraged
only for the “HSS” model specification. Interestingly, Ridge penalty delivers greater forecast
improvements compared to Lasso penalty, especially for shorter horizons, up to two quarters.

29
Even though we only find marginal gains using the penalised likelihood with the Lasso
penalty, they are statistically significant for several unstructured text datasets including
“loughran”, “stability” and “tf idf econom”. For “loughran” and “tf idf econom” we still
achieve significant gains compared to the benchmark when we use their average.
Overall, the empirical evidence shows that penalisation enhances the performance of the
“HSS” for most of the different text unstructured datasets. Forecasts of inflation and gdp
growth benefit the most while unemployment rate favours the standard MLE. Applying the
penalised likelihood on the “Average” helps especially for longer horizons.

7 Conclusions
This paper establishes a methodology to incorporate a panel of unstructured dataset of high
dimensions in a state space model in order to improve the forecasting performance of key
macroeconomic variables. The method is based on a kalman filter and maximum likelihood
estimation which have been widely used in macroeconomic forecasting (Doz et al., 2011;
Bańbura and Rünstler, 2011; Bańbura and Modugno, 2014). The unstructured shape of the
dataset is handled as in Harvey (1990).
We explore to which extent information contained in newspapers articles adds any sig-
nificant power compared to standard economic/financial datasets. The advantage of our
approach lies on the success to effectively combine traditional time series with auxialiary
series derived from alternative sources, such as newspaper articles. A major benefit of text
data is that they are available at higher frequencies than other forms of soft data and con-
trary to the latter, they do not suffer from publication delays. This characteristic can act
as an important player in macroeconomic forecasts as being the only real-time available
information.
A monte carlo study shows that our proposed method can improve the MSFE compared
to a standard state space model up to 75%. These results are robust to misspecifications
regarding the distribution of the indiosyncratic components of the unstrucured series.
In the empirical application of our method to UK unemployment,inflation and GDP

30
growth forecasting, we arrive at significant forecast gains when we complement the balanced
dataset with the unstructured dataset. Generally, the results stress the advantage of using
the high-dimensional auxiliary series of textual data despite involving a more complex model
to estimate. We also find that, including the average of the unstructured data over time does
improve forecasts but to a limited extent.
The explanatory power of the monthly textual information is further corroborated by
results from a Diebold and Mariano (1995) test, which indicates significant accuracy, espe-
cially at longer horizons. Results remain robust when we impose different penalties (i.e. l1
and l2 norms) to the maximum likelihood, resulting even to higher predictive performance
to some cases. Hence, our proposed approach provides a concrete framework to analyse the
usefulness of “Big Data” sources.
One limitation of the current paper is that it does not allow for time-variation in the
relation between the unobserved component of interest and the auxiliary series. Therefore,
a more structural method is required that extends the current approach by building the po-
tential for time variation into the estimation method directly, while retaining the possibility
to use the full sample size. Such extensions are currently under investigation by the authors.

References
Adriansson, N. and I. Mattsson (2015): “Forecasting GDP Growth, or How Can
Random Forests Improve Predictions in Economics?” .

Alexopoulos, M., J. Cohen, et al. (2009): “Uncertain times, uncertain measures,”


University of Toronto Department of Economics Working Paper, 352.

Anderson, B. D. and M. Deistler (2008): “Generalized linear dynamic factor models-a


structure theory,” in 2008 47th IEEE Conference on Decision and Control, IEEE, 1980–
1985.

Aparicio, D. and M. I. Bertolotto (2019): “Forecasting inflation with online prices,”


International Journal of Forecasting.

31
Ardia, D., K. Bluteau, and K. Boudt (2019): “Questioning the news about economic
growth: Sparse forecasting using thousands of news-based sentiment values,” International
Journal of Forecasting.

Bai, J. and S. Ng (2002): “Determining the number of factors in approximate factor


models,” Econometrica, 70, 191–221.

——— (2006): “Confidence intervals for diffusion index forecasts and inference for factor-
augmented regressions,” Econometrica, 74, 1133–1150.

Bai, J. and P. Wang (2016): “Econometric analysis of large factor models,” Annual
Review of Economics, 8, 53–80.

Baker, S. R., N. Bloom, and S. J. Davis (2016): “Measuring economic policy uncer-
tainty,” The Quarterly Journal of Economics, 131, 1593–1636.

Bańbura, M. and M. Modugno (2014): “Maximum likelihood estimation of factor mod-


els on datasets with arbitrary pattern of missing data,” Journal of Applied Econometrics,
29, 133–160.

Bańbura, M. and G. Rünstler (2011): “A look into the factor model black box: publi-
cation lags and the role of hard and soft data in forecasting GDP,” International Journal
of Forecasting, 27, 333–346.

Barhoumi, K., O. Darné, and L. Ferrara (2014): “Dynamic factor models: A review
of the literature,” Journal of Business Cycle Research, 2013, 73.

Cajner, T., L. D. Crane, R. A. Decker, A. Hamins-Puertolas, and C. Kurz


(2019): “Improving the Accuracy of Economic Measurement with Multiple Data Sources:
The Case of Payroll Employment Data,” in Big Data for 21st Century Economic Statistics,
University of Chicago Press.

32
Correa, R., K. Garud, J. M. Londono, N. Mislang, et al. (2017): “Constructing
a Dictionary for Financial Stability,” Board of Governors of the Federal Reserve System
(US), 6(7), p.9.

Diebold, F. X. (2003): “Big data dynamic factor models for macroeconomic measurement
and forecasting,” in Advances in Economics and Econometrics: Theory and Applications,
Eighth World Congress of the Econometric Society,(edited by M. Dewatripont, LP Hansen
and S. Turnovsky), 115–122.

Diebold, F. X. and R. S. Mariano (1995): “Comparing predictive accuracy,” Journal


of Business & economic statistics, 20, 134–144.

Doz, C., D. Giannone, and L. Reichlin (2011): “A two-step estimator for large ap-
proximate dynamic factor models based on Kalman filtering,” Journal of Econometrics,
164, 188–205.

——— (2012): “A quasi–maximum likelihood approach for large, approximate dynamic


factor models,” Review of economics and statistics, 94, 1014–1024.

Eickmeier, S. and C. Ziegler (2008): “How successful are dynamic factor models at
forecasting output and inflation? A meta-analytic approach,” Journal of Forecasting, 27,
237–265.

Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its
oracle properties,” Journal of the American statistical Association, 96, 1348–1360.

Fan, Y. and C. Y. Tang (2013): “Tuning parameter selection in high dimensional penal-
ized likelihood,” Journal of the Royal Statistical Society: SERIES B: Statistical Method-
ology, 531–552.

Ferrara, L. and A. Simoni (2019): “When are Google data useful to nowcast GDP? An
approach via pre-selection and shrinkage,” .

33
Forni, M. and M. Lippi (2001): “The generalised Dynamci Factor Model: Representation
Theory,” Econometric Theory, 17, 1113–1141.

Friedman, J., T. Hastie, and R. Tibshirani (2001): The elements of statistical learn-
ing, vol. 1, Springer series in statistics New York.

Giannone, D., L. Reichlin, and D. Small (2008): “Nowcasting: The real-time infor-
mational content of macroeconomic data,” Journal of Monetary Economics, 55, 665–676.

Gilbert, C. H. E. (2014): “Vader: A parsimonious rule-based model for sentiment analysis


of social media text,” in Eighth International Conference on Weblogs and Social Media
(ICWSM-14). Available at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14.
vader. hutto. pdf.

Harvey, A. (1990): Forecasting, structural time series models and the Kalman filter, Cam-
bridge university press.

Harvey, D., S. Leybourne, and P. Newbold (1997): “Testing the equality of predic-
tion mean squared errors,” International Journal of forecasting, 13, 281–291.

Hastie, T., R. Tibshirani, and M. Wainwright (2015): “Statistical Learning with


Sparsity: the Lasso and Generalizations,” CRC Press.

Hoerl, A. E. and R. W. Kennard (1970): “Ridge regression: Biased estimation for


nonorthogonal problems,” Technometrics, 12, 55–67.

Hu, G., P. Bhargava, S. Fuhrmann, S. Ellinger, and N. Spasojevic (2017): “An-


alyzing Users’ Sentiment Towards Popular Consumer Industries and Brands on Twitter,”
in Data Mining Workshops (ICDMW), 2017 IEEE International Conference on, IEEE,
381–388.

Hu, M. and B. Liu (2004): “Mining and summarizing customer reviews,” in Proceedings
of the tenth ACM SIGKDD international conference on Knowledge discovery and data
mining, ACM, 168–177.

34
Husted, L. F., J. Rogers, and B. Sun (2017): “Monetary Policy Uncertainty,” Interna-
tional Finance Discussion Papers 1215, Board of Governors of the Federal Reserve System
(U.S.).

Kalamara, E., A. Turrell, C. Redl, G. Kapetanios, S. Kapadia, et al. (2020):


“Making text count: economic forecasting using newspaper text,” Bank of England Staff
Working Papers.

Kapetanios, G. (2010): “A Testing Procedure for Determining the Number of Factors


in Approximate Factor Models With Large Datasets,” Journal of Business & Economic
Statistics, 28, 397–409.

Lawley, D. N. and A. E. Maxwell (1971): “Factor analysis as statistical method, No.


519.5 L3 1971,” Tech. rep.

Lederer, J. and C. Müller (2014): “Don’t fall for tuning parameters: tuning-free
variable selection in high dimensions with the TREX,” arXiv preprint arXiv:1404.0541.

Loughran, T. and B. McDonald (2013): “IPO first-day returns, offer price revisions,
volatility, and form S-1 language,” Journal of Financial Economics, 109, 307–326.

Ludvigson, S. C. and S. Ng (2009): “Macro factors in bond risk premia,” The Review
of Financial Studies, 22, 5027–5067.

Mariano, R. S. and Y. Murasawa (2003): “A new coincident index of business cycles


based on monthly and quarterly series,” Journal of applied Econometrics, 18, 427–443.

Nielsen, F. Å. (2011): “A new ANEW: Evaluation of a word list for sentiment analysis in
microblogs,” arXiv preprint arXiv:1103.2903.

Nyman, R., S. Kapadia, D. Tuckett, D. Gregory, P. Ormerod, and R. Smith


(2018): “News and narratives in financial systems: exploiting big data for systemic risk
assessment,” Bank of England Staff Working Papers, 704.

35
Primiceri, G. E., M. Lenza, D. Giannone, et al. (2020): “Economic Predictions with
Big Data: The Illusion of Sparsity,” Tech. rep., Forthcoming in Econometrica.

Shi, Y. (2014): “Big data: History, current status, and challenges going forward,” Bridge,
44, 6–11.

Stock, J. H. and M. W. Watson (1998): “A comparison of linear and nonlinear uni-


variate models for forecasting macroeconomic time series,” Tech. rep., National Bureau of
Economic Research.

——— (2002): “Forecasting using principal components from a large number of predictors,”
Journal of the American statistical association, 97, 1167–1179.

——— (2016): “Dynamic factor models, factor-augmented vector autoregressions, and struc-
tural vector autoregressions in macroeconomics,” in Handbook of macroeconomics, Else-
vier, vol. 2, 415–525.

Tetlock, P. C. (2007): “Giving content to investor sentiment: The role of media in the
stock market,” The Journal of finance, 62, 1139–1168.

Tetlock, P. C., M. Saar-Tsechansky, and S. Macskassy (2008): “More than words:


Quantifying language to measure firms’ fundamentals,” The Journal of Finance, 63, 1437–
1467.

Thorsrud, L. A. (2016): “Nowcasting using news topics. Big Data versus big bank,” .

——— (2018): “Words are the New Numbers: A Newsy Coincident Index of the Business
Cycle,” Journal of Business & Economic Statistics, 1–35.

Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” Journal of the
Royal Statistical Society. Series B (Methodological), 267–288.

Varian, H. R. (2014): “Big data: New tricks for econometrics,” Journal of Economic
Perspectives, 28, 3–28.

36
Appendix

Eleni Kalamara George Kapetanios

A Data

A.1 Unstructured data:Text analysis methods

Table A1 summarises the text analytics methods considered to extract information of news-
paper articles and consequently create the unstructured vectors kt . For more details on the
pre-processing of the raw text and the different techniques see the study of Kalamara et al.
(2020).
Positive and negative dictionary Boolean Computer science-based
Financial stability (Correa et al., 2017) Economic Uncertainty (Alexopoulos et al., VADER sentiment (Gilbert, 2014)
2009)
Finance oriented (Loughran and McDonald, Monetary policy uncertainty (Husted et al., ‘Opinion’ sentiment (Hu and Liu, 2004; Hu
2013) 2017) et al., 2017)
Afinn sentiment (Nielsen, 2011) Economic Policy Uncertainty(Baker et al., sentence sentiment
2016)
Harvard IV (used in Tetlock (2007)) Single word counts of “uncertain”, punctuation sentiment
“econom”, and “sustainab”
Anxiety-excitement (Nyman et al., 2018)
term frequency-inverse document frequency
(tf-idf) on “uncertain” and “econom”

Table A1: The three broad categories of text metrics used

A.2 Balanced data: Macro/Fin indicators

37
Table A2: Balanced dataset with Macro/Fin Data

Code Variable Name Source Transf


1 IoS: Services, Index ONS LD
2 PNDS: Private Non-Distribution Services: Index ONS LD
3 IoS: G: Wholesales, Retail and Motor Trade: Index ONS LD
4 IoS: 47: Retail trade except of motor vehicles and motorcycles: Index ONS LD
5 IoS: 46: Wholesale trade except of motor vehicles and motorcycles: Index ONS LD
6 IoS: 45: Wholesale And Retail Trade And Repair Of Motor Vehicles And Motorcycles: Index ONS LD
7 IoS: O-Q: PAD, Education and Health Index ONS LD
8 IoP:Production ONS LD
9 IoP:Manufacturing ONS LD
10 Energy output (utilities plus extraction) Pound Sterling (Index ONS LD
11 IoP: SIC07 Output Index D-E: Utilities: Electricity, Gas, Water Supply, Waste Management. ONS LD
12 IOP: B:MINING AND QUARRYING: ONS LD
13 RSI:VolumeAll Retailers inc fuel:All Business Index ONS LD
14 Construction Output: Seasonally Adjusted: Volume: All Work ONS LD
15 BOP Total Exports (Goods) ONS LD
16 BOP:EX:volume index:SA:Total Trade in Goods ONS LD
17 BOP Total Imports (Goods) ONS LD
18 BOP:IM:volume index:SA:Total Trade in Goods ONS LD
19 CPI all items ONS LDD
20 RPI all items ONS LDD
21 RPI ex Mortgages Interest Payments (RPIX) ONS LDD
22 PPI Output ONS LDD
23 PPI Input ONS LDD
24 Nationwide House Price MoM BoE database D
25 RICS House Price Balance BoE database D
26 M4 Money Supply BoE database LD
27 New Mortgage Approvals BoE database LD
28 Bank of England UK Mortgage Approvals BoE database LD
29 Average Weekly Earnings ONS LD
30 LFS Unemployment Rate ONS D
31 LFS Number of Employees (Total) ONS LD
32 Claimant Count Rate ONS D
33 New Cars Registrations BoE database LD
34 Oil Brent BoE database LD
35 UK mortgage base rate BoE database L
36 3m LIBOR BoE database L
37 FTSE all share BoE database LD
38 Sterling exchange rate index BoE database LD
39 FTSE volatility BoE database LD
40 GBP EUR spot BoE database LD
41 GBP USD spot BoE database LD
42 FTSE 250 INDEX BoE database LD
43 FTSE All Share BoE database LD
44 UK focused BoE database LD
45 S&P 500 BoE database LD
46 Euro Stoxx BoE database LD
47 Sterling ERI BoE database LD
48 VIX BoE database LD
49 UK VIX - FTSE 100 VOLATILITY INDEX - PRICE INDEX BoE database LD

Note: Sources are the Office for National Statistics (ONS), the Bank of England database (BOE), IHS
Markit/CIPS, the Confederation of British Industries (CBI), LLoyds Bank, the European Commission.
Transformation codes: LDD = log double difference, LD = log difference, L = levels, D = first difference.

38
Table A3: relative RMSEs for inflation rate with maximum likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.930 0.994 1.170 1.040 1.106
alexopoulos 0.918 0.986 1.093 1.137 0.999
baker bloom davis 0.912 0.9781 1.030 1.047 1.020
harvard 0.909 0.903 1.057 1.042 1.003
husted 0.930 0.984 1.020 1.045 1.027
loughran 0.944 0.974 1.019 1.054 1.001
nyman 0.929 0.935 1.018 1.057 1.043
opinion 0.992 0.909 1.092 1.043 1.111
punc econom 0.929 0.998 0.831 1.014 1.183
stability 0.948 0.967 0.955∗∗∗ 1.074 1.076
tf idf econom 0.988 0.927 0.982 1.077 1.071
tf idf uncertain 0.928 0.990 1.083 1.018 1.009
vader 0.952 0.947 1.080 1.042 1.005
word count econom 0.992 0.918∗ 0.914 1.007 1.070
word count uncertain 0.910 0.961 1.110 1.004 0.986
HSS afinn 1.052 0.966 1.028 0.966 0.889
alexopoulos 1.015 1.14 1.098 0.787 0.870
baker bloom davis 1.010 0.978 1.012 0.922 0.919
harvard 0.998 0.874 0.952 0.953 0.902∗∗
husted 1.028 0.991 1.158 0.850 0.912
loughran 1.048 0.985 1.185 1.007 0.897
nyman 1.059 0.915 1.054 1.044 0.965
opinion 1.039 0.885 0.964 0.976 0.906
punc econom 1.022 0.952∗∗∗ 1.049 1.061 1.070
stability 1.031 0.967 1.010 1.059 1.097
tf idf econom 0.999 0.894 1.010 1.058 1.083
tf idf uncertain 1.025 1.027 1.095 0.813 0.913∗∗
vader 1.056 0.967 1.003 0.965 0.884∗∗
word count econom 1.031 0.978 1.084 1.049 1.068
word count uncertain 1.021 0.969 1.074 0.788 0.830
Balanced - 1.022 1.057 1.038 1.152 1.050
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

39
Table A4: relative RMSEs for inflation rate with penalised maximum
(l1 -norm) likelihood

RMSE/RMSE ARP
specification metric 1 3 6 9 12
Average afinn 0.975 0.931 1.102 0.988 0.947
alexopoulos 0.943 0.977 1.127 0.898 0.928
baker bloom davis 0.932 0.963 1.077 0.994 0.982
harvard 0.943∗ 0.869 1.067 0.985 0.941
husted 0.962∗∗ 0.973 1.066 0.971 0.976
loughran 0.981 0.963∗∗ 1.068 1.024 0.942
nyman 0.970 0.913 1.069 1.031 1.019
opinion 0.957 0.870 1.094 0.988 0.960
punc econom 0.992∗∗ 0.942 0.924 0.988 1.008
stability 0.993 0.963 1.019 1.056 1.080
tf idf econom 0.907 0.882 1.047 1.068 1.084
tf idf uncertain 0.963 0.982 1.150 0.990 0.941
vader 1.002 0.951 1.097 0.988 0.947
word count econom 0.970 0.956 0.978 1.061 1.079
word count uncertain 0.933 0.943 1.014 0.904 0.903
HSS afinn 0.952 0.966 1.120 0.878 0.812
alexopoulos 0.985 1.004 1.098 0.787 0.871
baker bloom davis 1.094 0.898 1.124 0.879 0.961
harvard 0.898 0.874∗∗∗ 0.982 0.883 0.907
husted 0.888 0.991 1.158 0.880 0.918
loughran 0.978 0.985 1.085 1.007 0.973
nyman 0.859 0.915 1.054 1.042 0.965
opinion 0.979 0.965 0.911 0.922 0.986
punc econom 0.898 0.952 0.991 1.006 0.972
stability 0.932 0.967 0.759 0.859 0.975
tf idf econom 0.994∗ 0.894∗ 0.966 1.052 0.923
tf idf uncertain 0.925 0.971 0.995 0.813 0.913
vader 0.916 0.867 0.903 0.965 0.884
word count econom 1.011 0.918 1.081 1.019 1.018
word count uncertain 0.921 0.912 0.924 0.921 0.811
Balanced - 1.086 1.070 1.076 1.071 1.091
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

40
Table A5: relative RMSEs for inflation rate with penalised maximum
(l2 -norm) likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.970 0.935 1.096 0.995 0.954
alexopoulos 0.947 0.977 1.119 0.906 0.928
baker bloom davis 0.938 0.964 1.071 0.948 0.980
harvard 0.946∗∗ 0.885 1.070 0.991 0.946
husted 0.967∗∗ 0.974 1.060 0.974 0.974
loughran 0.975 0.963 1.058 1.026 0.951
nyman 0.971 0.925 1.060 1.030 1.010
opinion 0.958 0.887 1.098 0.992 0.963
punc econom 0.898 0.942 0.923 0.995 1.086
stability 0.990 0.963 1.015 1.052 1.066
tf idf econom 0.908 0.889 1.039 1.064 1.075
tf idf uncertain 0.963∗∗∗ 0.980 1.118 0.927 0.947
vader 0.995 0.951 1.0945 0.920 0.949
word count econom 0.970∗∗∗ 0.956 0.978 1.059 1.074
word count uncertain 0.939 0.946 1.132 0.904 0.901
HSS afinn 1.029 0.994 0.995 1.066 1.039
alexopoulos 1.045 1.044 1.047 0.953 0.917
baker bloom davis 1.090 0.989 0.945 0.929 0.859
harvard 0.989 0.949 0.955 0.959 0.956
husted 1.096 0.890 1.050 0.988 0.992
loughran 1.098 0.908 1.055 0.914 0.979
nyman 1.099 0.970 0.884 1.082 1.095
opinion 1.090 0.969 0.964 0.945 0.896
punc econom 0.922 0.852∗∗∗ 0.945 0.958 1.059
stability 0.903 0.973 0.994 0.976 0.954
tf idf econom 0.945 1.064 1.012 1.044 1.064
tf idf uncertain 0.952 0.887 0.955 0.842 0.891
vader 0.987 0.973 1.086 1.094 1.095
word count econom 0.931 0.976∗∗∗ 0.954 1.977 0.986
word count uncertain 0.955 0.964 0.944 0.949 0.929
Balanced - 1.211 1.065 1.046 1.055 1.052
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

41
Table A6: relative RMSEs for unemployment rate with maximum likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 1.145 1.035 1.043 1.119 1.095
alexopoulos 1.070 1.058 1.045 1.044 1.064
baker bloom davis 1.093 1.073 1.050 1.047 1.048
harvard 1.023 0.987 0.969 1.071 1.098
husted 1.062 1.069 0.850 0.904 0.901
loughran 0.976 0.957 0.931 0.988 1.012
nyman 1.011 1.024 0.959 0.982 1.023
opinion 1.025 1.038 1.011 1.109 1.101
punc econom 1.038 0.874 0.864 0.866 0.950
stability 0.975 0.934 0.879 0.896 0.923
tf idf econom 0.979 0.934 0.901∗∗ 0.935 0.962
tf idf uncertain 1.068 1.047 1.054 1.061 1.079
vader 1.025 1.022 1.029 1.06377 1.024
word count econom 0.939∗∗∗ 0.969 0.949 0.966 0.974
word count uncertain 1.082 1.067 1.063 1.0678 1.091
HSS afinn 1.233 0.991 0.989 1.080 0.897
alexopoulos 1.211 1.013 1.034 1.318 1.011
baker bloom davis 1.036 1.083 1.011 1.019 1.011
harvard 0.843 0.964 0.995 0.931 0.992
husted 1.116 1.033 1.021 1.099 0.960
loughran 0.861 0.808 0.864 0.985 1.037
nyman 0.924 0.872∗ 0.893∗ 0.967∗∗∗ 0.983
opinion 0.944 0.905 0.934 1.027 1.038
punc econom 0.970 0.781 0.811 0.848 0.928
stability 0.697 0.925 0.892 0.913 0.942
tf idf econom 0.732 0.926 0.907∗∗∗ 0.948∗∗ 0.972
tf idf uncertain 1.082 0.982 1.021 1.039 1.156
vader 0.953 0.960 1.011 1.028 0.981
word count econom 0.987∗ 0.958 0.941 0.726 0.865
word count uncertain 1.019 1.011 1.033 1.057 1.071
Balanced - 0.988 0.948 0.917 0.931* 0.953
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

42
Table A7: relative RMSEs for unemployment rate with penalised maximum
(l1 -norm) likelihood

RMSE/RMSE ARP
specification metric 1 3 6 9 12
Average afinn 0.978 0.993 1.009 1.103 1.098
alexopoulos 1.057 1.047 1.073 1.054 1.070
baker bloom davis 1.081 1.058 1.059 1.054 1.058
harvard 1.054 0.960 0.926 1.048 1.093
husted 1.080 1.039 0.942 0.965 0.961
loughran 0.911 0.890 0.863 0.947 0.996
nyman 0.976 0.974∗∗ 0.899 0.933 0.991
opinion 0.987 0.988 0.961 1.072 1.083
punc econom 1.034 0.863∗∗ 0.862 0.861 0.944
stability 0.970 0.929 0.867 0.886 0.915
tf idf econom 0.977∗∗∗ 0.932 0.894 0.925 0.958
tf idf uncertain 1.059 1.029 1.054 1.069 1.072
vader 1.012 1.019 1.019 1.067 1.044
word count econom 0.918 0.964 0.941 0.956 0.973
word count uncertain 1.072 1.053 1.064 1.068 1.090
HSS afinn 0.943 0.921 0.982 1.047 1.031
alexopoulos 1.073 1.138 1.031 1.039 0.959
baker bloom davis 1.086 1.068 1.028 1.029 0.880
harvard 0.986 0.914 0.929 0.932 0.952
husted 1.046 1.033 1.021 1.099 0.880
loughran 0.860 0.808∗∗∗ 0.864 0.986 0.837
nyman 0.927 0.872 0.895 0.963 0.987
opinion 0.942 0.907 0.934 0.927 0.888
punc econom 0.970 0.781 0.813 0.849 0.928
stability 0.691 0.928∗∗∗ 0.928 0.919 0.942∗
tf idf econom 0.736 0.926∗ 0.907 0.948 0.972
tf idf uncertain 1.082 0.989 1.026 1.039 0.954
vader 0.953∗ 0.967 1.111 1.028 0.988
word count econom 0.877∗∗ 0.958 0.949∗ 0.724 0.865
word count uncertain 1.019 1.011 1.038 1.052 1.074
Balanced - 0.980 0.972 0.949 0.953 0.966
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

43
Table A8: relative RMSEs for unemployment rate with penalised maximum
(l2 -norm) likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.969 0.980 1.089 1.103 1.099
alexopoulos 1.051 1.041 1.054 1.051 1.072
baker bloom davis 1.082 1.053 1.050 1.053 1.059
harvard 1.023 0.957 0.918 1.041 1.094
husted 1.853 1.439 0.947 0.960 0.908
loughran 0.907∗∗∗ 0.873 0.831 0.944 0.966
nyman 0.963 0.965 0.804 0.994 0.984
opinion 0.981 0.982 0.911 1.077 1.016
punc econom 1.034 0.858 0.887 0.868 0.949
stability 0.969 0.926 0.864 0.845 0.916
tf idf econom 0.975∗ 0.921 0.892 0.977 0.957
tf idf uncertain 1.057 1.027 1.054 1.062 1.071
vader 1.319 0.978 1.177 1.061 1.042
word count econom 0.916 0.967∗ 0.939∗ 0.956 0.972
word count uncertain 1.072 1.056 1.064 1.061 1.091
HSS afinn 1.064 0.889 0.989 1.077 1.078
alexopoulos 0.973 0.993 0.993 0.039 0.991
baker bloom davis 1.086 1.064 1.028 1.029 1.042
harvard 0.886∗ 0.984 0.928 1.051 1.052
husted 1.065 1.063 1.031 1.059 1.080
loughran 0.960∗∗∗ 0.898∗∗ 0.964∗∗ 0.988 1.085
nyman 0.828 0.882 0.983 0.988 0.988
opinion 0.982 0.901∗∗ 0.989 1.067 1.037
punc econom 0.889 0.776 0.13 0.898 0.920
stability 0.995 0.959 0.892 0.933 0.962
tf idf econom 0.976 0.966 0.967 0.946 0.962
tf idf uncertain 1.064 0.992 1.010 1.089 1.063
vader 0.995 0.960 1.018 1.055 1.091
word count econom 0.890 0.855 0.969 0.942 0.992
word count uncertain 1.054 1.064 1.087 1.065 1.046
Balanced - 0.972∗ 0.969 0.946 0.951 0.964
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

44
Table A9: relative RMSEs for gdp growth rate with maximum likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.989 1.081 1.010 0.990 0.995
alexopoulos 1.011 1.017 0.937 0.914 0.930
baker bloom davis 1.010 1.031 0.990 0.983 0.994
harvard 1.050 1.099 1.035 1.016 1.060
husted 1.014 1.063 1.013 0.996 0.994
loughran 0.953∗ 1.025 0.916 0.985 0.845
nyman 1.051 1.120 1.043 0.988 0.969
opinion 0.986∗ 1.073 1.035 0.974 0.970
punc econom 0.973 0.985 0.981 0.965 0.950
stability 0.986 0.988 0.950 0.928 0.919
tf idf econom 0.983 0.973 0.936 0.926 0.925
tf idf uncertain 1.018 1.055 0.997 0.981 0.985
vader 1.083 1.106 1.045 1.021 1.029
word count econom 0.900 0.988 0.961 0.952 0.950
word count uncertain 1.022 1.074 1.051 0.980 0.986
HSS afinn 0.943 0.991 0.966 0.988 1.034
alexopoulos 1.036 1.051 0.908 0.976 0.977
baker bloom davis 1.026 1.043 0.954 0.987 0.989
harvard 0.974 1.022 0.998 1.022 1.019
husted 1.020 1.082 0.993 0.910 0.989
loughran 0.849 0.842 0.771 0.796 0.802
nyman 0.973 1.079 1.022 0.988 0.971
opinion 0.941∗∗ 0.969 0.942∗ 0.962 0.969∗∗∗
punc econom 0.950 0.985 0.972 0.962 0.946
stability 0.969 0.950 0.903 0.887∗∗ 0.876∗∗∗
tf idf econom 0.965∗∗ 0.927 0.885 0.892 0.893
tf idf uncertain 1.032 1.071 0.981 0.988∗∗ 0.965∗∗
vader 0.990 1.047 1.030 1.030 1.043
word count econom 0.974∗∗ 0.957 0.927 0.933 0.933
word count uncertain 1.043 1.011 1.015 0.922 0.905
Balanced - 0.934 0.968 0.939 0.929 0.931
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

45
Table A10: relative RMSEs for gdp growth rate with penalised (l1 -norm)
maximum likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.971 1.031 0.965∗ 0.942 0.958
alexopoulos 1.066 1.030 0.961 0.936 0.943
baker bloom davis 1.039 1.059 0.998 0.995 0.991
harvard 0.991 1.05201 0.929 0.951 0.964
husted 1.080 1.039 1.013 1.068 1.030
loughran 0.931 0.965 0.984∗∗∗ 0.977∗∗ 0.976
nyman 0.985 1.069 0.966 0.924 0.904
opinion 0.969 1.020 0.951 0.916 0.920
punc econom 0.967 0.931 0.975 0.958 0.944
stability 0.983 0.977 0.933 0.905 0.896
tf idf econom 0.981∗∗ 0.964∗∗ 0.921∗ 0.907 0.9098
tf idf uncertain 1.012 1.034 1.013 0.995 0.994
vader 0.997 1.058 1.010 0.987 1.074
word count econom 0.987 0.981 0.950 0.937 0.936
word count uncertain 1.014 1.053 1.045 0.995 0.994
HSS afinn 1.043 0.971 0.996 1.080 1.094
alexopoulos 1.060 1.041 0.890 0.976 0.997
baker bloom davis 1.044 1.040 0.900 0.997 0.995
harvard 0.897 1.052 0.957 0.996 0.889
husted 1.066 1.0727 1.071 1.044 1.099
loughran 0.964∗∗ 0.994 0.997 0.979 0.990∗∗
nyman 0.993 1.049 1.012 0.938 0.917
opinion 0.941 0.995 0.992 0.969 0.929
punc econom 0.850 0.985 0.992 0.969 0.906
stability 0.889∗ 0.888 0.883∗ 0.877∗ 1.086
tf idf econom 0.965∗∗∗ 0.929∗∗ 0.995∗∗ 0.994 0.895
tf idf uncertain 1.034 1.074 1.091 1.080 1.088
vader 0.992 0.997 1.033 1.033 1.093
word count econom 0.984∗ 0.887 0.929 0.993 0.937
word count uncertain 1.088 1.081 1.085 0.096 0.993
Balanced - 1.044 0.986 0.964 0.958 0.988
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

46
Table A11: relative RMSEs for gdp growth with penalised (l2 -norm)
maximum likelihood

RMSE/RMSE ARP
Specification Metric 1 3 6 9 12
Average afinn 0.965 1.014 0.943 0.924 0.949
alexopoulos 1.056 0.968 0.946 0.940 0.946
baker bloom davis 1.021 1.025 0.989 0.974 1.039
harvard 0.991 1.045 0.984 0.939 0.955
husted 1.078 1.037 1.013 1.080 1.045
loughran 0.917 0.931 0.979 0.971 0.972
nyman 0.985 1.064 0.987 0.990 0.988
opinion 0.967∗ 1.011 0.938 0.989 0.907
punc econom 0.967 0.906 0.971 0.954 0.939
stability 0.980 0.967 0.915 0.883 0.874
tf idf econom 0.982∗ 0.967 0.926 0.910 0.912
tf idf uncertain 1.012 1.032 1.019 0.984 0.968
vader 0.928 1.052 1.033 0.997 1.028
word count econom 0.985 0.975 0.940 0.926 0.927
word count uncertain 1.014 1.050 1.045 0.998 0.995
HSS afinn 0.823 0.777 0.896 0.889 0.994
alexopoulos 0.996 0.891 0.988 0.996 0.999
baker bloom davis 0.976 0.778 0.977 0.887 1.089
harvard 0.988 0.992 0.997 0.828 0.777
husted 1.094 1.083 0.935 0.903 0.933
loughran 1.090 1.042 1.051 1.046 0.982∗
nyman 0.973 0.889 1.041 0.878 0.979
opinion 0.841 0.869 0.842∗∗∗ 0.969 1.069
punc econom 0.850 0.985 0.982 0.861 1.090
stability 1.077 0.985 0.888 0.887 0.886
tf idf econom 0.985∗∗ 0.987 0.888 0.888 1.085
tf idf uncertain 1.038 1.078 0.881 0.889 0.988
vader 0.980 1.087 1.038 0.836 0.990
word count econom 0.949∗∗∗ 0.974 0.924 0.993 0.939
word count uncertain 0.993 1.011 1.019 0.992 0.911
Balanced - 1.035 0.984 0.962 0.955 0.955
Note: Relative RMSEs of the different factor specifications over the AR(p) benchmark.“Average” includes
the average of the text metric, “HSS” includes the raw dataset and “OnlyMacro” includes only the macro
dataset *, **, *** are the results from Diebold and Mariano (1995). *, **, *** denotes rejection at 10%,
5% and 1% level respectively.

47
B Simulation study results: homoskedastic vs heteroskedas-
tic idiosyncratic components

True Parameters : σ = 0.5

HSS Model 1 Model 2


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 1.025 1.142 1.182 1.246 0.713 0.757 0.736 0.760 1.250 1.232 1.235 1.306
T
100 1.140 1.161 1.261 1.235 0.779 0.775 0.788 0.748 1.303 1.260 1.257 1.312
500 1.149 1.209 1.294 1.295 0.796 0.794 0.807 0.794 1.303 1.304 1.302 1.314
1000 1.151 1.209 1.295 1.309 0.801 0.799 0.805 0.798 1.317 1.306 1.308 1.384
True Parameters : Ω ∼ U (1, 3)

HSS Model 1 Model 2


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 0.887 0.872 0.800 0.827 0.920 0.868 0.857 0.867 1.063 1.232 1.179 1.229
T
100 0.920 0.868 0.857 0.867 0.882 0.863 0.856 0.869 1.085 1.214 1.239 1.293
500 0.920 0.868 0.857 0.8670 0.949 0.882 0.8730 0.869 1.085 1.212 1.257 1.292
1000 0.920 0.868 0.857 0.867 0.921 0.894 0.883 0.873 1.088 1.241 1.276 1.293

Table A1: Average of variance of the high dimensional state space for different
specification Model 1: not include unstructured dataset (Zt ), Model 2: includes the average
of Zt

48
True Parameters : σ = 0.5

HSS Model 1 Model 1


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 0.921 0.951 0.984 0.990 0.742 0.754 0.755 0.759 0.794 0.806 0.806 0.810
T
100 0.930 0.952 0.985 0.991 0.771 0.768 0.765 0.7580 0.818 0.816 0.815 0.810
500 0.929 0.955 0.985 0.991 0.775 0.775 0.776 0.773 0.821 0.823 0.824 0.822
1000 0.932 0.954 0.985 0.991 0.774 0.774 0.776 0.774 0.821 0.822 0.824 0.823
True Parameters : Ω ∼ U (1, 3)

HSS Model 2 Model 2


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 0.969 0.977 0.992 0.996 0.898 0.887 0.888 0.887 0.919 0.911 0.913 0.912
T
100 0.975 0.984 0.995 0.997 0.915 0.921 0.920 0.915 0.933 0.938 0.938 0.934
500 0.982 0.988 0.996 0.998 0.937 0.937 0.936 0.937 0.951 0.951 0.951 0.951
1000 0.982 0.988 0.996 0.998 0.940 0.939 0.940 0.939 0.953 0.953 0.954 0.953

Table A2: Average of correlations of the high dimensional state space for different
specification Model1: not include unstructured dataset (Zt ), Model2: includes the average
of Zt

49
True Parameters : σ = 0.5

HSS Model 1 Model 2


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 0.648 0.656 0.650 0.655 1.158 1.239 1.214 1.259 1.154 1.24 1.210 1.256
T
100 0.655 0.652 0.657 0.654 1.299 1.273 1.297 1.251 1.297 1.27 1.230 1.249
500 0.656 0.657 0.655 0.655 1.317 1.330 1.331 1.316 1.316 1.330 1.332 1.316
1000 0.659 0.657 0.6560 0.659 1.329 1.324 1.3343 1.333 1.330 1.325 1.334 1.333
True Parameters : Ω ∼ U (1, 3)

HSS Model 1 Model 2


max(kt )

50 100 500 1000 50 100 500 1000 50 100 500 1000

50 0.681 0.677 0.681 0.671 1.241 1.287 1.206 1.240 1.243 1.291 1.212 1.246
T
100 0.680 0.682 0.680 0.675 1.296 1.282 1.278 1.311 1.297 1.284 1.280 1.313
500 0.684 0.684 0.682 0.680 1.298 1.290 1.302 1.314 1.299 1.299 1.303 1.315
1000 0.685 0.684 0.682 0.681 1.312 1.324 1.322 1.320 1.312 1.324 1.323 1.320

Table A3: Average of bias of the high dimensional state space for different specification
Model1: not include unstructured dataset (Zt ), Model2: includes the average of Zt

50

You might also like