You are on page 1of 26

DOI: 10.1111/fima.

12303

ORIGINAL ARTICLE

Machine learning and asset allocation

Bryan R. Routledge

Tepper School of Business, Carnegie Mellon


University, Pittsburgh, Pennsylvania Abstract
Investors have access to a large array of structured and unstructured
Correspondence data. We consider how these data can be incorporated into finan-
Bryan R. Routledge, Tepper School of Business,
Carnegie Mellon University, 4765 Forbes Ave,
cial decisions through the lens of the canonical asset allocation deci-
Pittsburgh, PA 15213. sion. We characterize investor preference for simplicity in models of
Email: routledge@cmu.edu the data used in the asset allocation decision. The simplicity param-
eters then guide asset allocation along with the usual risk aversion
parameter. We use three distinct and diverse macroeconomic data
sets to implement the model to forecast equity returns (the equity
risk premium). The data sets we use are (a) price-dividend ratios, (b)
an array of macroeconomic series, and (c) text data from the Federal
Reserve’s Federal Open Market Committee (FOMC) meetings.

1 INTRODUCTION

There is no shortage of data. At your fingertips are 237,000 data series at the St. Louis Federal Reserve Bank’s Eco-
nomic Data (FRED). The Securities and Exchange Commission (SEC) received 304,000 corporate filings (e.g., 10K, 10Q,
8K, Form 4) during the first quarter of 2018. The SEC has over 18 million electronic filings from 1994. Mix in social
media sites like Twitter and data sets of size in the billions are common.1 Is any of these data helpful in decision making?
In the finance context we look at here, despite all data sets and sources available, we still have only about 840 monthly
observations of, say, postwar equity returns. Does that limit the value of the larger data sets? The goal of this paper
is to explain how machine learning—specifically regularized regressions—capture how individuals might use large and
varied data for decision making. Here, we look at the canonical asset allocation problem and characterize individuals
preferences over “models” for data. Viewed through the lens of a portfolio optimization, we look at how these data
become information.
The economic context for the model is a simple stock-bond asset allocation problem. At the core of this problem is
the level and dynamic properties of the equity premium (the rate of return on a broad portfolio of equities in excess
of the risk-free return). From data, and much finance research, we know the equity premium has substantial variation
in its conditional mean. The unconditional expectation of the equity premium is around 6%. However, the conditional
expectation commonly swings substantially from 0% to 12% (see Cochrane, 2011) or as evidenced in the term struc-
ture, see van Binsbergen, Hueskes, Koijen, and Vrugt (2013). Similar time variation in risk premiums show up in bonds


c 2019 Financial Management Association International
1 For example, O’Connor, Balasubramanyan, Routledge, and Smith (2010), and, Coppersmith, Dredze, Harman, and Hollingshead (2015).

Financial Management. 2019;48:1069–1094. wileyonlinelibrary.com/journal/fima 1069


1070 ROUTLEDGE

(Ludvigson & Ng, 2009), oil futures (Baker & Routledge, 2013), and foreign exchange (Koijen, Moskowitz, Pedersen, &
Vrugt, 2013). The implication of the time variation in the mean return is that ex post returns are, to some degree, pre-
dictable. Indeed, the main empirical support for the time variation in expected returns is the regression predicting the
excess return at horizon h, rt+h with information at date t, Xt .

rt+h = 𝜃 ′ Xt + 𝜖t+h . (1)

Across markets, the set of predictors varies. The aggregate price-dividend ratio is used as a forecaster in equity mar-
kets, the slope of the futures curve works in oil markets. In the bond market, Ludvigson and Ng (2009) use predictors
extracted as the principle components of 130 economic series. In addition, as you would expect, realized returns are
very volatile and so the precision (R2 ) of the predictive regression is low. More relevant here, not everyone is convinced
that the predictability is particularly useful. Welch and Goyal (2008) point out that out-of-sample forecasts are not
reliable.2
To investigate this question, we use the framework of Gilboa and Schmeidler (2010), Gilboa and Samuelson (2012),
and, Gilboa and Schmeidler (2003) to represent a decision maker’s preference over “models” of the data. In particu-
lar, we propose a representation and assume a functional form that captures the dual objectives: people like models
that explain the data (e.g., likelihood) and people like models that are simple (e.g., a small number of parameters). Of
course, these two desires are often at odds. A model with more parameters fits the data better (at least in-sample).
Machine learning tackles this trade-off by “regularizing” overparameterized models (see Tibshirani, 1996, 2011; Zou
& Hastie, 2005). These techniques look to exploit patterns in data that perform well out of sample. Here, we interpret
this approach through the axiomatic foundations of Gilboa and Schmeidler (2010). This lets us interpret and control
the transition from data to information used in decision making as preference parameters akin to a coefficient of risk
aversion. When we embed all this in the familiar portfolio problem we can see if we have preference parameters that
generate sensible behavior.3
First, we use the Gilboa and Schmeidler (2010) setting to characterize simplicity or parsimony in data models. Next,
we sketch an example with a two-state equity premium to clarify the model’s preference structure. Then, to see how
this might work in practice, we use multiple sources of economic data. For now, we simplify to the “static” (or single
period) portfolio problem. This is helpful since the simplicity of the decision step let us focus on the new feature of
“model” preference. We implement this using three different, and different in character, data sets. First, we use the
familiar price-dividend data as in Cochrane (2011). Second, we use an array of several hundred monthly macroeco-
nomic series from the St. Louis Fed’s data (FRED). This is along the lines of the data used in Ludvigson and Ng (2009).
Finally, we use text as data. Specifically, we use the Beige Book reports of the Federal Reserve staff economists that
characterize the economy using informal surveys.

2 DECISION MAKING AND SIMPLICITY

Gilboa and Schmeidler (2010), and related papers Gilboa and Samuelson (2012) and Gilboa and Schmeidler (2003),
present an axiomatic foundations for using data and incorporating a preference for simplify or parsimony. To summa-
rize and adapt this setting to asset allocation, we start by characterizing “data.” Data are of the form (xn , yn ). xn ∈ 𝕏 is
a (perhaps large) vector of “signals.” yn ∈ 𝕐 is a scalar “state.” The signal xn , is known by the decision maker and may be
useful in predicting the unobserved state yn . (𝕏, 𝕐 ) is the set of joint probability distributions over 𝕏 × 𝕐 .

2 Kim and Routledge (2019) look at the corporate finance implications of ignoring or not time variation in the risk premium.

3 Our model here is in different from the related paper of Gabaix (2014). That paper characterizes a nonmaximizing or “bounded’ rationality” (Simon, 1959)

approach to capture the idea that some data are “ignored.” Here, we will define preferences and an optimization so any data that are “ignored” are an opti-
mal choice.
ROUTLEDGE 1071

The idea here is that the signal xn is useful for predicting the unknown state yn . The state yn is relevant for the
decision problem at hand (below). The probabilistic relation between xn and yn is not known but the decision maker has
access to a data set. The data set of length n is denoted as hn = ((x0 , y0 ), … , (xn−1 , yn−1 )) (does not include n). The space
of possible data sets of length n is Hn and the set of all such data is H = ∪n≥0 Hn . In general, a data-generating process is
a function Δ : H → (𝕏, 𝕐 ). To make this workable, we restrict the set of possible data-generating processes, , to be
ones that are i.i.d. and do not depend on the history.4 So Δ ∈  has Δ(x, y) gives the probability of observing (y, x) and
denote the marginal distribution is 𝛿(y; x) as the probability observing y given signal x.
A “theory” is a possible data-generating process. Note that theories are “just” probability distributions. We are
directly interested in the prediction of yn given xn denoted theory 𝜃(y; x). The set of possible theories is the same
as the set of possible generating processes. This implies the true data processing process is in the set of theories
our decision maker considers. The decision maker has data in hn and ranks possible theories 𝜃1 ⪰hn 𝜃2 This means,
given the data in hn , theory 𝜃1 is at least as preferred as 𝜃2 (or, Gilboa and Schmeidler use the phrase “at least as
plausible”). Putting all these preference orderings together, {⪰hn }hn ∈Hn ,n=0,1,2,… will describe how one chooses a the-
ory. For example, the familiar maximum likelihood criterion is one way to construct a preference ordering. Gilboa
and Schmeidler (2010) add axioms for the preferences that allow for consideration of both “fit” (likelihood) and
“simplicity.”
Using a theory in the familiar Savage Expected Utility framework results in the following objective function. A deci-
sion maker who chooses, based on data in hn and signal xn using:
[ [ ]| ]
E𝜃∗ E u(c)|yn |xn
| (2)

s.t. 𝜃 ∗ = argmax𝜃 v(𝜃) + (x,y)∈hn 𝓁(y, x, 𝜃).

To describe this decision rule, think of the asset allocation problem we will explore below. The inner expectation
E[u(c)|yn ] governs how one chooses an asset allocation to the risky asset given a specific value for the equity risk pre-
mium (the yn ). The forecast of the equity risk premium, yn , given the current data xn —say the current price-dividend
ratio—depends on the theory 𝜃 ∗ . For example, something like Equation (1). Finally, that theory, 𝜃 ∗ , is chosen based on
the data seen in history hn .
The functions v(𝜃) and 𝓁(y, x, 𝜃) characterize the decision maker’s preferences to choose 𝜃 ∗ . The chosen 𝜃 ∗ deter-
mines how xn is used to forecast yn . The function 𝓁(y, x, 𝜃) uses the data in the history with the sum over hn . In our
implementation below, this will be a likelihood. Preferences over theories, sensibly, embody a preference for fitting the
data. However, the function v(𝜃) does not depend on data. This captures a preference theories themselves—say, sim-
plicity. In our asset allocation setting below, the v(𝜃) will be a “regularization” of a regression that imposes a preference
that fewer parameters in a regression is better. Like with any utility theory, the functional forms we use here are spe-
cific examples (like the Constant Relative Risk Aversion (CRRA) risk preference). We use the asset allocation setting to
evaluate the usefulness of this specification.

3 ASSET ALLOCATION: TWO-STATE EQUITY RISK PREMIUM

To see how this works, here is an example adapted from Gilboa and Samuelson (2012) to an asset allocation setting.
Imagine a signal xn and a state yn are coin tosses. Consider a one-consumption-date, one-risky asset setting with end-
of-period wealth, W1 where the investor chooses the proportion of wealth invested in the risky asset. Define:
( )
W1 = W0 w(R − Rf ) + Rf ,

4 This is stronger than needed. However, we can adapt this setting into with the usual time-series stationarity assumptions. For example, a VAR model with
innovations as i.i.d.
1072 ROUTLEDGE

TA B L E 1 Parameters for two-state example

Data-generating process
Signal x ∈ {tail,head}
State y ∈ {tail,head}
⎧ p(x = head) = 0.50

Data-generating process Δ ⎨ p(y = head|x = tail) = 0.48

⎩ p(y = head|x = head) = 0.80
Assets
Risk-free asset log Rf 0.0
⎧ 𝜇(y = tail) = 0.02

Risky asset log R ∼ N(𝜇(y), 𝜎 2 ) ⎨ 𝜇(y = head) = 0.10

⎩𝜎 = 0.20
Preferences
Risk aversion 1−𝛼 2.0
Theory selection Norm ‖ ⋅ ‖ L1
A (0.5, 0.5)′
B (−1, 1)
a, b see example
Numerical
Repetition of each sample to get means, standard 250
deviations of optimal portfolios
Gaussian Quadrature for log normal distribution 5 points

where W0 = 1 is initial wealth and w is the proportion of wealth allocated to the risk asset. Set the risk-free rate log Rf =
0. The risky asset return, R depends on the outcome of state y as log R ∼ N(𝜇(y), 𝜎 2 ). For concreteness, say 𝜇(y = tail) <
𝜇(y = head). Familiar values are 𝜇(y = tail) = .02 and 𝜇(y = head) = .10. The example has a constant volatility of 𝜎 =
.20. All the parameters are listed in Table 1.
Signals and states are the result of toss two coins xn , yn ∈ {head, tail}. Data observations in hn are independent.
However, the coin tosses may be correlated so knowing the outcome of xn may be informative about the outcome of
yn . A data-generating process, Δ ∈  describes the probability of each outcome (xn , yn ). This embeds any correlation
in the coins. Given the data in hn , the prediction-relevant portion of a theory is two-dimensional vector, 𝜃 ∈ (0, 1)2
that describes p(yn = head|xn = head) and p(yn = head|xn = tail). Risk preferences are the familiar CRRA setting with
risk aversion with curvature parameter 𝛼 < 1 (coefficient of relative risk aversion of 1 − 𝛼). All together, the portfolio
problem optimization, given data set hn is:

[ [ ]
( )𝛼 ]|
max E𝜃∗ (hn ) E 𝛼 −1 W ̃ |yn ||xn ,
1
w |

s.t.𝜃 ∗ (hn ) = argmax v(𝜃) + 𝓁(y, x, 𝜃).
𝜃
(x,y)∈hn

Here, let us specify the functional form of 𝓁() as a log-likelihood. That is:

( )
∑ ∑ I[x = head](I[y = head]𝜃1 + I[y = tail](1 − 𝜃1 ))
𝓁(y, x, 𝜃) = log , (3)
(x,y)∈h (x,y)∈h + I[x = tail](I[y = head]𝜃2 + I[y = tail](1 − 𝜃2 ))
ROUTLEDGE 1073

where I[⋅] is the indicator function and 𝜃1 = p(y = head |x = head) and 𝜃2 = p(y = head |x = tail) are the two parame-
ters of the theory. This seems reasonable in the sense that given enough data, the preferred theory will coincide with
the true data-generating process.
How can we specify v(𝜃) to capture a preference for simplicity? While “simplicity” and “complexity” are appealing as
ranking criteria, it is hard to find a universal measure. In the specific two-coin example here, the v(𝜃) defines “simple”
or a preferred reference point on the (0, 1)2 square (akin to a Bayesian prior). As an example, consider:

( )
v(𝜃) = −b a‖A − 𝜃‖L1 + (1 − a)‖B𝜃‖L1 , (4)

where a, b are positive scalars and ‖ ⋅ ‖ is a norm (we will use L1- and L2-norms below). If A = [ 12 , 12 ]′ , this expresses a
preference for theories where the second coin is both independent of the x-coin and also unbiased. If we define B =
[1, −1], this captures a preference for theories that imply no correlation across the two coins. That is, where p(y =
head|x = head) is close to p(y = head|x = tail). The preference parameters a and b scale things (and the format here
is similar to the Elastic Net regression we consider below). b determines the overall importance of simplicity in the
preferences. The a ∈ [0, 1] parameter weights the relative importance of the two notions of simplicity. Of course, it is
arbitrary and we embed it in an economic model to evaluate if this is a sensible representation of behavior.5
Figure 1 shows the optimal portfolio for each possible outcomes of the coin (on the vertical axis). The horizontal axis
is the size of the data (log n). Since the optimal portfolio allocation depends on the data, the figure reports the mean allo-
cation across many simulated data sets hn ∈ Hn . The shaded regions represent one standard deviation bounds around
the mean for the portfolio allocation. For reference, if the decision maker knew the true data process her optimal port-
folio would be .66 for x as a tail and .99 for x as a head. These reference lines are shown as the dashed lines in the
figure. In the case where data-generating process is known, the data are not needed. Hence, these reference lines do
not depend on the size of the data.
In Figure 1, b = 0 (and when b = 0, the value for a has no impact). With this calibration, the theory 𝜃 that determines
the optimal portfolio allocation is determined solely by the data set hn . That is the reason for the large variation in
portfolio allocations for smaller data sets. For data sets smaller than about log 3 (about 20 paired coin tosses), the one
standard deviation bounds for the optimal allocations overlap. This means that about half the time, more is allocated to
the risky asset on a tail signal than a head. For larger data sets, the optimal allocation both converges and converges to
the value that is optimal given the true data-generating process. Hence, we have characterized preferences consistent
with the Law of Large Numbers.
As a contrast, Figure 2 sets the preference parameter b = 10. In this case, there is a strong preference for “sim-
plicity.” Setting a = 1 characterizes simplicity as a preference for the y-coin as uncorrelated and unbiased. Hence, for
small data sets the optimal theory is that the y-coin is uncorrelated and unbiased. The result is that optimal portfolio
is constant across the two outcomes of the x-signal and not sensitive to the particular sample of data seen. Notice also
that optimal asset weight does not depend on the sample. However, once the data set is large, the optimal portfolios
are sensitive to the signal x-coin and also have some dependence on the data. Finally, note that as data sets get large
enough, the optimal portfolio does converge to the full information case.
As a contrast, in Figure 3 also the preference parameter b = 10. However, here a = 0 characterizing a “simplicity”
only by p(y = head|x = head) is close to p(y = head|x = tail). Hence, for small data sets, the optimal association to the
risky asset is constant across the outcome of the x-coin. But here, note that the allocation itself is dependent on the
data set as seen in the large one-standard deviation range of the optimal investment.
Finally, to see how the two “simplicity” parameters work together, Figure 5 varies a and b. The three rows have
b = {0, 3, 6, 9} and three columns are for a = {0, 0.5, 1.0}. Moving left to right, as b increases the optimal weights are
less reliant on small data sets. This is most strong, moving down, as a increases. For a = 1 and b = 9, the portfolio is

5 This functional form expands naturally to a multidimensional signal such as x ∈ {head, tail}K and is an analog of a “generalized lasso problem” (see Tibshirani,
n
Saunders, Rosset, Zhu, & Knight, 2005) and is a common tool in natural language processing and image-recognition settings and machine learning settings.
1074 ROUTLEDGE

x
w

x
x

a b

F I G U R E 1 Coin example—a = n.a and b = 0 [Color figure can be viewed at wileyonlinelibrary.com]


Note. Optimal portfolio allocation given signal x. The solid lines indicate mean of the simulation. The bands indicate
one standard deviation. The dashed line is a reference line indicating optimal allocation if the data-generating process
was known. Data size is the length of history hn observations used in choosing optimal theory 𝜃 ∗ (hn ). Note, for this
example, when b = 0, the value of a has no effect.

constant for small data sets. Finally, in all these examples, as the data set is large, the optimal portfolio converges to the
optimal given the true process.

4 ASSET ALLOCATION: MACRO DATA AND FEDERAL OPEN MARKET


COMMITTEE (FOMC) TEXT

Next, we look at an example using the one-period portfolio asset allocation problem but with data drawn from the U.S.
economy.
[ [ ]
( )𝛼 ]|
̃
max E𝜃∗ (hn ) E 𝛼 −1 W |y |x ,
1 n | n
w |


s.t. 𝜃 (hn ) = argmax𝜃 v(𝜃) + 𝓁(y, x, 𝜃)
(x,y)∈hn
( )
W1 = W0 w(R − Rf ) + Rf .
ROUTLEDGE 1075

x
w

x
x

a b

F I G U R E 2 Coin example—a = 1.0 and b = 10 [Color figure can be viewed at wileyonlinelibrary.com]


Note. Optimal portfolio allocation given signal x. The solid lines indicate mean of the simulation. The bands indicate
one standard deviation. The dashed line is a reference line indicating optimal allocation if the data-generating process
was known. Data size is the length of history hn observations used in choosing optimal theory 𝜃 ∗ (hn ).

The (artificial) one-period setting makes the decision model more transparent. But the use of real data means we
have to be more explicit with the timing of the data. Consider data as monthly and the portfolio horizon as 1 year.
Again, set the beginning of year wealth W0 = 1 with end-of-year wealth W1 . The equity risk premium depends on
the state y as:

r = log R − log Rf ∼ N(y − .5𝜎 2 , 𝜎 2 ).

Specifying things this way means we can set log Rf = 0 since in our static example the level of the risk-free rate is not
particularly interesting. We will also maintain a constant volatility assumption with 𝜎 = .20.
To express (possible) predictability of the equity returns, x are a k-vector of signals observed at the beginning of the
(1-year) period such that:

yn = 𝛿0 + 𝛿 ′ xn + 𝜖n .

𝛿 coefficients capturing the true data-generating process. The 𝜖n are orthogonal. Of course, in practice, our data set will
not contain y—we cannot “see” the equity premium. Instead we see the realized excess returns, rt,t+12 paired with the
beginning of period signals xt . This is the familiar predictive regression of Cochrane (2011) (and others)
1076 ROUTLEDGE

x
w

x
x

a b

F I G U R E 3 Coin example—a = 0 and b = 10 [Color figure can be viewed at wileyonlinelibrary.com]


Note. Optimal portfolio allocation given signal x. The solid lines indicate mean of the simulation. The bands indicate
one standard deviation. The dashed line is a reference line indicating optimal allocation if the data-generating process
was known. Data size is the length of history hn observations used in choosing optimal theory 𝜃 ∗ (hn ).

rt,t+12 = 𝜃0 + 𝜃 ′ xt + 𝜀t+12 . (5)

Signals, xt , are normalized to zero mean and unit variance. This is not necessary but makes it easier to specify prefer-
ences and compare across different input data sets. (It also means the intercept, 𝜃0 term is the estimate of the uncon-
ditional equity risk premium.)
To implement our decision model, we use the same CRRA preferences over terminal (end-of-year) wealth as the
previous example. In Equation (2), we need to specify how v and 𝓁 are measured in this context. First, we can encode the
theory 𝜃, that gives the conditional probability of y for a given vector x, as a candidate set of coefficients in Equation (5).
That is, a theory is:

yt = 𝜃0 + 𝜃 ′ xt .

In this familiar regression setting, the log-likelihood is the sum of squared errors.

𝓁(y, x, 𝜃) = −(y − 𝜃0 − 𝜃 ′ x)2 . (6)


ROUTLEDGE 1077

w x

x
x

a b

F I G U R E 4 Coin example—a = 0.5 and b = 10 [Color figure can be viewed at wileyonlinelibrary.com]


Note. Optimal portfolio allocation given signal x. The solid lines indicate mean of the simulation. The bands indicate
one standard deviation. The dashed line is a reference line indicating optimal allocation if the data-generating process
was known. Data size is the length of history hn observations used in choosing optimal theory 𝜃 ∗ (hn ). Note, for this
example, when b = 0, the value of a has no effect.

For the simplicity component of preferences over theories, we use the regularizer format of “Elastic Net” regularized
regressions in machine learning. This is similar to our previous example and is:

( )
v(𝜃) = −b a‖𝜃‖L1 + (1 − a)‖𝜃‖L2 . (7)

The Elastic Net regularizer (Zou & Hastie, 2005) combines two notions of “simplicity” for model parameters. The L1-
norm component encapsulates a preference for a “simple” model as one with a few nonzero parameters (called a LASSO
regression; see Tibshirani 1996, 2011).6 The L2-norm expresses the preference for “simple” as a model with parame-
ters that are close to zero (called a Ridge regression; see Hoerl & Kennard, 1988). The machine learning rational to

∑K ∑ 1
6 Recall 𝜃 ∈ ℝK+1 , ‖𝜃‖L1 = k=1
|𝜃k |, and ‖𝜃‖L2 = ( Kk=1
𝜃 2 ) 2 . Note we leave 𝜃0 outside of the definition of v. So the level of the equity premium will be
k
driven by the sample mean in the data.
1078 ROUTLEDGE

a b a b a b

a b a b a b

a b a b a b

F I G U R E 5 Coin example—b = 0, 3, 6, 9 and b = 0, 3, 6, 9 [Color figure can be viewed at wileyonlinelibrary.com]


Note. Optimal portfolio allocation given signal x. The solid lines indicate mean of the simulation. The bands indicate
one standard deviation. The dashed line is a reference line indicating optimal allocation if the data-generating process
was known. Data size is the length of history hn observations used in choosing optimal theory 𝜃 ∗ (hn ).

include both is that sparseness (from the absolute value L1-norm) makes a regression easy to understand and explain
since it has few nonzero parameters. However, in models with large dimensional x-vectors, with many near colineari-
ties, the L2-norm term often makes the estimates more stable (e.g., similar nonzero terms across similar subsamples).
The parameter a ∈ [0, 1] drives the degree of sparseness (e.g., a = 1 is the LASSO regularizer). The parameter b scales
how much weight is put on simplicity (vs. the data-driven component of Equation 2). Things are linear in the two norms
and this scaling makes for easier interpretation. Of course, in our setting, parameters a and b and the functional forms
are just a representation of preferences. Using the “practical wisdom” of machine learning is helpful, perhaps, in cali-
brating these terms.
We solve for the optimal model using Friedman, Hastie, and Tibshirani (2010). This solves simultaneously for many
values of the parameter b from small to large (called the regulation path). In practice, this is useful for model selection
ROUTLEDGE 1079

TA B L E 2 Phrase extraction tool—example

investment commercial
investments commercial_real_estate
business_investment commercial_construction
capital_investment commercial_banks
inventory_investment commercial_real_estate_markets
investment_plans commercial_loans
investment_tax_credit commercial_lending
investment_spending commercial_real_estate_loans
fixed_investment commercial_real_estate_activity
new_investment commercial_loan_demand
investment_companies commercial_aircraft
investment_activity commercial_construction_activity
capital_investments commercial_builders
security_investments commercial_building
new_investments commercial_space
equipment_investment commercial_mortgages
investment_projects commercial_real_estate_market
capital_investment_plans commercial_markets
business_investment_plans commercial_contractors
investment_management_firms commercial_paper
investment_portfolios commercial_realtors
investment_outlays commercial_side
alternative_investments commercial_projects

Note. A sample of phrases that contain the word “commercial” or “investment” listed in order of frequency.

(“tuning” of the choice of the “hyperparameter,” b). However, in our context it lets us look the optimal portfolios across
different preference parameters.7

4.1 Data sets


We look at three different data sets: (a) the price-dividend ratio; (b) a big collection of macroeconomic data series from
the Federal Reserve Bank of St. Louis data set; and (c) text from the Federal Reserve Board’s Federal Open Market Com-
mittee meetings. The aggregate return is the “market” (the value-weighted portfolio of all NYSE, AMEX, and NASDAQ
firms). Specifically, we use continuously compounded excess returns.8

1. Bob—Price-Dividend Macro Data

The usual workhorse for predicting equity returns is the price-dividend ratio. This is the variable that features
prominently in John Cochrane’s AFA Presidential Address (Cochrane, 2011), and many others such as Campbell and
Shiller (1988a, 1988b). The data come from Robert Shiller’s web page. (Hence, the handle, with apologies for the

7 We use the Friedman et al. (2010) package 𝚐𝚕𝚖𝚗𝚎𝚝 for 𝚁. https://cran.r-project.org/web/packages/glmnet/

8 Returns data are CRSP data via Ken French’s Data Library: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
1080 ROUTLEDGE

a b
a b
a b
a b

F I G U R E 6 Data: Bob—Weights [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are price-dividend ratio (“Bob”). Shown is the optimal allocation to the risky equity based on the
forecast of the equity risk premium. These are reported for the “train” data to 1952–2005. For these models plotted,
only a=0.0000_logb=-0.8307 has any time-series variation in the optimal portfolio weight.

informality, “Bob” for our investor using this datum.) The data are the dividend-price ratio of the S&P500 large-cap
stocks (and includes lags).

2. FRED—Broad Array of Macro data

Ludvigson and Ng (2009) use 130 or so macroeconomic series to forecast bond returns. From that series, they
extract six principle components. Along these lines, we have pulled similar data from FRED. Most of the series Lud-
vigson and Ng use are included along with a few others. There are 224 monthly data series included. Most have the
familiar transformations for stationarity (log differences) and all have been standardized to demean and set the vari-
ance to one. Note we do not do any dimensionality reduction, such as extraction principle component. The amount of
data and the amount of reliance on the data will be set by the simplicity preference parameters.
ROUTLEDGE 1081

TA B L E 3 Model coefficients—BOB

a = 0.0000_logb a = 0.0000_logb a = 0.5000_logb a = 1.0000_logb


= 3.7279 = −0.8307 = −2.4867 = −3.1798
nzero 2.00000 2.00000 0.00000 0.00000
Eer.train.sd 0.00000 0.01746 0.00000 0.00000
Eer.test.sd 0.00000 0.00556 0.00000 0.00000
weight-in-Rm.train.sd 0.00000 0.10367 0.00000 0.00000
weight-in-Rm.test.sd 0.00000 0.03336 0.00000 0.00000
(Intercept) 0.05811 0.05553 0.05811 0.05811
PRICEDIVIDENDRATIO_level_inv_z 0.00000 0.00879 0.00000 0.00000
PRICEDIVIDENDRATIO_level_lag_inv_z 0.00000 0.00884 0.00000 0.00000

Note. Forecast data are price-dividend ratio (“Bob”). The summary statistics at top report the number of nonzero coefficients in
the model (excluding the intercept), the standard deviation of the forecast of expected return, and the standard deviation of the
optimal allocation to the risky asset. These are reported for the “train” data to 1952–2005 and the out-of-sample “test” 2006–
2012.

3. Alan—Text of the FOMC Meetings—Beige Book

The regional Federal Reserve Banks have been producing a blog-like commentary on the economy monthly (roughly)
since 1970. This text is our third set of data. Specifically, we use text from the “Beige Book” of the Federal Reserve
System that is issued for each of the FOMC meeting. The Fed has a staff of economists who focus measurement and
forecasting. Prior to each of the eight meetings of the FOMC, the staff at the Federal Reserve Board publish several
documents with forecasts and commentary. The two most substantive documents are the “Greenbook” (formally “Cur-
rent Economic and Financial Conditions”) and the “Bluebook” (formally “Monetary Policy Alternatives”). Both of these
documents are available only to the FOMC and insiders at the Fed. They have been released as historical documents
with a rolling 5–6 year black-out window. These quantitative data are used in Romer and Romer (2004) as a measure
of Fed expectations and surprises. Here, we use the “Beige Book” (formally known as the “Summary of Commentary
on Current Economic Conditions by Federal Reserve District.”). Unlike the previous two, this information is released to
the public (approximately) 2 weeks prior to an FOMC meeting.9
The Beige Book data cover the period 1970–2015 and the (roughly) eight meetings per year. For each meeting, there
is a national summary and a report from each of the 12 regional banks. What is particularly interesting about this docu-
ment is that it is decidedly nonquantitative. It is “based on information gathered by Reserve Bank staff over the course
of several weeks. Each Federal Reserve Bank collects anecdotal information on current economic conditions in its Dis-
trict through reports from Bank and Branch directors and interviews with key business contacts, economists, market
experts, and other sources.”10 Alan Blinder, former Fed vice chairman, once characterized this process as the “ask your
uncle” method of gathering information. This data set is denoted “Alan” after long-standing Fed Chair Alan Greenspan.
Here is a sample (from November 1979 National Summary):

Manufacturing activity is particularly spotty. The automobile sector is exhibiting pervasive weakness, but there
exist definite areas of strength. St. Louis finds, with the exception of autos, a relatively high level of manufacturing
and particular strength in several industries including most capital goods, basic metal products, aircraft, and tex-
tiles. Minneapolis reports broadly based strength in industrial production. Dallas characterizes manufacturing as
flat with weakness in consumer durables being about offset by strength in nondurables and construction related

9 The release to the public began in 1983. The precursor to the Beige Book, the Red Book, has been released as historical data. In the period 1970–1983, these

reports were only internal to the Fed. However, we will set that aside and ask how would an investor use the data if they had been available over the sample.
10 See http://www.federalreserve.gov/monetarypolicy/fomc_historical.htm
1082 ROUTLEDGE

a b
a b
W

a b
a b

F I G U R E 7 Data: Bob—Wealth [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are price-dividend ratio (“Bob”). Shown is the accumulated wealth from holding optimal allocation
to the risky equity based on the forecast of the equity risk premium. These are reported for the “train” data to
1952–2005. For these models plotted, only a = 0.0000_logb = −0.8307 has any time-series variation in the optimal
portfolio weight. The other plots shown are effectively “buy and hold.”

areas such as primary metals and stone, clay and glass products. Meanwhile, the Third District is reportedly five
months into a general downturn in manufacturing activity. Chicago reports high operating rates, but declining
order backlogs at most capital goods producers. The capital goods sector in the Cleveland District is still expe-
riencing backlogs amidst some concern that they might soon face reductions in orders. San Francisco sees most
sectors except auto and construction related ones doing well.

This is the sort of information where rich, nuanced, and, often ambiguous, language may convey more information
than the systematic measurement of predefined quantities. Does it work as a forecaster of equity returns? (And if
you are wondering, the word “spotty” shows up frequently—about 1,500th most frequent of the 200,000 or so unique
words and phrases in the data).
ROUTLEDGE 1083

TA B L E 4 Model coefficients—FRED

a = 0.0000_logb a = 0.5000_logb a = 1.0000_logb a = 1.0000_logb


= 1.0786 = −2.0659 = −4.7127 = −6.5734
nzero 169.00000 0.00000 17.00000 92.00000
Eer.train.sd 0.02671 0.00000 0.06272 0.08828
Eer.test.sd 0.02116 0.00000 0.04836 0.13070
weight-in-Rm.train.sd 0.15703 0.00000 0.36754 0.49860
weight-in-Rm.test.sd 0.12232 0.00000 0.28374 0.69809
a = 0.0000_logb a = 0.5000_logb a = 1.0000_logb a = 1.0000_logb
= 1.0786 = −2.0659 = −4.7127 = −6.5734
nzero 169.00000 0.00000 17.00000 92.00000
Eer.train.sd 0.02671 0.00000 0.06272 0.08828
Eer.test.sd 0.02116 0.00000 0.04836 0.13070
weight-in-Rm.train.sd 0.15703 0.00000 0.36754 0.49860
weight-in-Rm.test.sd 0.12232 0.00000 0.28374 0.69809
(Intercept) 0.06076 0.05811 0.05839 0.06339
AAA_level_dif_z −0.00082 −0.00347 −0.01201
AWHMAN_level_z 0.00291
BAA_level_dif_z −0.00068 0.00272
CES3000000008_level_real_log_dif_dif_z −0.00026 −0.00514
CES4300000001_level_log_dif_z 0.00013 0.01093
CES4422000001_level_log_dif_z −0.00129 −0.00947 −0.01896
CIVPART_level_log_dif_z 0.00011 0.00156
CMWRBPPRIV_level_log_z −0.00008
CNERBPPRIV_level_log_z −0.00041 −0.01040
CP3M_level_dif_z −0.00057 0.01665
CPF3M_level_dif_z 0.00233 0.00758 0.00730
CPIENGSL_level_log_dif_dif_z −0.00223
CPN3M_level_dif_z −0.00236
CSOUBPPRIV_level_log_z −0.00029 0.00904
CUSR0000SA0L5_level_log_dif_dif_z −0.00010 0.00141
DIFFONE_level_dif_z −0.00404
EXUSUK_level_log_dif_z 0.00080 0.00637 0.00760
HOUSTMW_level_log_z −0.00216 −0.00453 −0.01420
HOUSTNF_level_log_z −0.00048
HOUSTS_level_log_z −0.00132
IPNMAN_level_log_dif_z −0.00056
IPUTIL_level_log_dif_z −0.00052
M0495AUSM346NNBR_level_log_dif_ −0.00080
dif_z
M0523BUSM244NNBR_level_pc_real_ 0.00073
log_dif_z
M0882BUSM350NNBR_level_dif_z −0.00183
M1125AUSM343NNBR_level_log_dif_z 0.00081

(Continued)
1084 ROUTLEDGE

TA B L E 4 Continued

a = 0.0000_logb a = 0.5000_logb a = 1.0000_logb a = 1.0000_logb


= 1.0786 = −2.0659 = −4.7127 = −6.5734
M1491BUSM144NNBR_level_pc_real_z −0.00271 −0.01151 −0.04292
NAPMBI_level_z 0.00134
NAPMCI_level_z −0.00294 −0.02315 −0.04765
NAPMEI_level_z −0.00182 −0.01253 −0.03145
NAPMEXI_level_z 0.00215 0.02681 0.05239
NAPMII_level_z −0.00209 −0.01134 −0.01323
NAPMIMP_level_z 0.00126 0.00700 0.03584
NAPMPRI_level_z −0.00172 −0.00525 −0.01716
NAPMSDI_level_z −0.00237 −0.03182 −0.02540
NAPM_level_z −0.00172
NMFEI_level_z 0.00196 0.00899 0.02195
PCEC96_level_pc_log_dif_z −0.00045
PCEDGC96_level_log_dif_dif_z 0.00014
PERMITMW_level_log_z −0.00164
PERMITS_level_log_z −0.00095
PERMIT_level_log_z −0.00147 −0.01147
UMCSENT_level_dif_z −0.00045
USEHS_level_log_dif_z −0.00095 −0.00282 −0.01028
USFIRE_level_log_dif_z −0.00012
USMINE_level_log_dif_z −0.00108 −0.00454 −0.01451

Note. Forecast data are from macroeconomic series of the St. Louis Federal Reserve Bank Economic Database (“FRED”). The
summary statistics at top report the number of nonzero coefficients in the model (excluding the intercept), the standard devi-
ation of the forecast of expected return, and the standard deviation of the optimal allocation to the risky asset. These are
reported for the “train” data to 1952–2005 and the out-of-sample “test” 2006–2012. A selection of parameter coefficients are
shown selected based on absolute size. The series name corresponds to the data key used on FRED. All data are normalized to
mean zero and variance of one.

There are many ways to use text data—machine learning and natural language processing is a fast evolving field.
Here, a “text regression” to forecast equity returns lines up with the decision framework we have in mind.11 To imple-
ment the text regression (6) with the simplicity preference in Equation (7), we need to define how the text maps to
the quantity “x.” We merge all the Beige Books at a given month and then use a model that simply counts words (“bag
of words”).12 Capturing the nuance in the text can be difficult. In the example above, just what was “spotty”? To cap-
ture words in context we augment word counts with adjective–noun and adverb–verb “phrases.”13 Table 2 shows an
example of phrases. The words “commercial” and “investment” are quite frequent in the data. They are also used in
many contexts. Table 2 shows the output of our phrase extraction tool for phrase that includes these words. Notice
that many of the phrases reflect important differences. A commercial bank is a bank (finance) and plays a very different

11 Text regressions were introduced in Kogan, Levin, Routledge, Sagi, and Smith (2009) on data using 10K text to predict return volatility. Other examples,
see Yogatama et al. (2011) and Chahuneau, Gimpel, Routledge, Scherlis, and Smith (2012). The Beige Book data were used in Yogatama, Routledge, and Smith
(2013) and Yogatama, Wang, Routledge, Smith, and Xing (2014); although these are not text regression modes. Other finance papers that use text as data, for
example Tetlock (2007), use a similar word count as a representation and then employ a “dictionary” to reduce the dimensionality (e.g., a predefined sentiment
score).
12 We downcase, remove punctuation, and tokenize numbers and dollars (5.0 and $6.0 are replaced by “xNUMBERx” and “xDOLLARx.” In “bag-of-words,” this

is mostly for convenience and has little impact on results.


13 Text is part-of-speech tagged using the Stanford Tagger. See Toutanova, Klein, Manning, and Singer (2003).
ROUTLEDGE 1085

a b
a b
a b
a b
a b
a b
a b
a b
a b

F I G U R E 8 Data: FRED—Weights [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are from macroeconomic series of the St. Louis Federal Reserve Bank Economic Database
(“FRED”). Shown is the optimal allocation to the risky equity based on the forecast of the equity risk premium. These
are reported for the “train” data to 1952–2005.

role in the economy than a commercial builder (manufacturing). Similarly, security investments reflect a financing activ-
ity where capital investment reflects a physical economic activity.
Phrases and words are counted in each document. Count data are skewed (most words are very infrequent) so we
use the transformation:

xi,t = log(1 + count(i, t)),

where count(i, t) counts the word or phrase i in the Beige Book at date t.14 Note that this conveniently preserves
the sparse structure—most of the words to not appear in a document. Similarly, this works well with months with no

14 This structure means that regression coefficients are informative about the percentage change in counts. That is if word j has a count of c and we contemplate

increasing that count to c(1 + Δ). The change in the forecast is


( )
1 + c(1 + Δ)
𝜃j (log(1 + c(1 + Δ)) − log(1 + c)) = 𝜃j log
1+c

≈ 𝜃j log(1 + Δ)

≈ 𝜃j Δ.

The approximation is sensible if c is big and Δ is small.


1086 ROUTLEDGE

a b
a b
a b
a b
a b
W

a b
a b
a b
a b

F I G U R E 9 Data: FRED—Wealth [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are from macroeconomic series of the St. Louis Federal Reserve Bank Economic Database
(“FRED”). Shown is the accumulated wealth from holding optimal allocation to the risky equity based on the forecast
of the equity risk premium. These are reported for the “train” data to 1952–2005.

meetings, simply have xi,t = 0 for all words. Finally, we prune the word counts and only include the 10,000 most fre-
quent words. (You can include them all; it does not change the results much.)

4.2 Results: Portfolios and wealth


So what do people with preference for simplicity and these data sets choose for an allocation to risky assets? For each
of the data sets, we use the period 1952–2005 as “data,” hn . We solve the optimal theory, given preference parameters,
a and b to via fitting the regularized regression in Equations (6) and (7). Then, using this model we can forecast the 1-
year-ahead equity return and, for each forecast, solve for the (myopic one-period is 1-year) optimal wealth allocation
to the risky asset. Our decision makers are all myopic. But we can look at their decisions given different inputs. This
generates a “time series” of optimal portfolios. For each of these, we can track performance.
ROUTLEDGE 1087

TA B L E 5 Model coefficients—Alan

a = 1.0000_logb a = 1.0000_logb a = 0.5000_logb a = 0.5000_logb


= −6.5732 = −7.0849 = −4.2985 = −2.2517
nzero 107.00000 139.00000 52.00000 0.00000
Eer.train.sd 0.10986 0.11498 0.07729 0.00000
Eer.test.sd 0.02383 0.04368 0.00646 0.00000
weight-in-Rm.train.sd 0.55082 0.56581 0.39623 0.00000
weight-in-Rm.test.sd 0.13057 0.23376 0.03736 0.00000
(Intercept) 0.06928 0.07142 0.06252 0.05811
airline-industry 0.05192 0.05194 0.02529
and-real-estateresidential-construction 0.01847 0.01753
are-weakening −0.01109 −0.01210
arms −0.00575 −0.00815
business-customers 0.00925 0.00046
came-too-late −0.01220 −0.00191
defense-procurement 0.00352 0.00778
did-indicate −0.01607 −0.01141
energy-problem 0.00930 0.00308 0.01074
energy-shortages 0.01369 0.01492 0.00583
energythe −0.00703 −0.01376
fed-funds 0.00398 0.01042
financial-panelthis 0.00851 0.00677 0.00003
fuel-shortage 0.00617 0.00383 0.01278
gulf-war 0.00569 0.00692
health-insurance 0.02415 0.03245
information-technology −0.03723 −0.03852
insurance-costs 0.09914 0.08545 0.01098
interstate-banking 0.01513 0.01281 0.01083
iraq 0.02151 0.01553 0.03912
is-required −0.00858
large-paper −0.01865
leary 0.01209 0.01232 0.01013
loosening 0.01258 0.00561
manufacturingconditions 0.01369 0.01621
manufacturingthe 0.00785 0.00773
md 0.01574 0.00959 0.00478
minneapolis-fed 0.01066 0.01280
morale −0.00643 −0.00433
new-groundbreakings 0.01783 0.00162 0.01413
nonfarm-employment −0.00702 −0.04094
numbertoken-even 0.02103 0.02405 0.01937
only-exception 0.02623 0.03729 0.00980
peak-levels −0.01133 −0.01541

(Continued)
1088 ROUTLEDGE

TA B L E 5 Continued

a = 1.0000_logb a = 1.0000_logb a = 0.5000_logb a = 0.5000_logb


= −6.5732 = −7.0849 = −4.2985 = −2.2517
permanent-financing −0.00563
pik-program 0.01116 0.00358 0.01530
pockets-of-strength 0.01118 0.01476
pricesmost 0.02242 0.03294
relying −0.02073 −0.08603
remained-subdued 0.01849 0.01979
rental-market 0.03197 0.01897
resource-related 0.03511 0.03430 0.02563
rose-modestly 0.00075 0.06568
rrb 0.00537 0.00415
same-store-sales 0.00268 0.00170
scrap −0.00722 −0.01245
second-mortgages −0.00258 −0.00181
single-family-permits −0.00700 −0.00981
south-st 0.01396 0.00875 0.01521
strong-dollar 0.01473 0.01098
target-range 0.00504 0.01255
telephone-survey-of-hotels −0.03318 −0.03989
thus-reported 0.00492
tourism-sector 0.01140 0.04983
tourismthe 0.01815 0.02270
weak-prices 0.00764 0.02771
widespread-increases 0.00080

Note. Forecast data are text from the Beige Books of the FOMC. The summary statistics at top report the number of nonzero
coefficients in the model (excluding the intercept), the standard deviation of the forecast of expected return, and the standard
deviation of the optimal allocation to the risky asset. These are reported for the “train” data to 1952–2005 and the out-of-
sample “test” 2006–2012. A selection of parameter coefficients is shown selected based on absolute size. The parameters are
words (phrase) count, n, transformed by log(1 + n).

That is, for hn is held constant across experiments—it is the datum from the 1952–2005 period. We solve for the
model:

𝜃 ∗ (hn , a, b) = argmax𝜃 − b(a‖A − 𝜃‖ + (1 − a)‖B𝜃‖) + 𝓁(y, x, 𝜃).
(x,y)∈hn

Then using, 𝜃 ∗ (hn ), we solve for the optimal portfolio at various points in time.
[ [ ]
( )𝛼 ]|
̃
wt∗ (hn , a, b) = max E𝜃∗ (hn ) E 𝛼 −1 W |yt ||xt , .
1
w |

This lets us observe behavior “in-sample,” for years up to 2005 and “out of sample” for years beyond. We can wiggle the
preference parameter over simplicity, a and b—recall a drives sparseness (few nonzero parameters) and b drives the
magnitude of the parameters. We can also look across data sets hn = {Bob, FRED, Alan}. For all these examples, we hold
risk aversion constant at −4.0. That level of risk aversion implies an average allocation to the risk-free rate of about
0.35 (which gives some “room” for allocations to vary and still remain in a plausible range).
ROUTLEDGE 1089

a b
a b
a b
a b
a b
a b

F I G U R E 1 0 Data: Alan—Weights [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are text from the Beige Books of the FOMC (“Alan”). Shown is the optimal allocation to the risky
equity based on the forecast of the equity risk premium. These are reported for the “train” data to 1952–2005.

To help summarize things, we use performance and track accumulated wealth associated with the portfolios
wt∗ (hn , a, b). Define:

log Wt+12 = log Wt + wt∗ (hn , a, b)rt+12 ,

where rt+12 is the equity (excess) return over the subsequent 12 months. (Plotting this annual item at each month
implies some smoothing that helps visualize the results.)
We start by looking “in-sample”—that is looking at the behavior when data, the agent, are acting on are coming from
the data set used to choose the model. Not surprisingly, performance will be misleadingly impressive. We look first at
“Bob”—using the price-dividend data. Shown in Figure 6 are portfolio weights for some of the people using these data.
In this plot, many of the preference parameters (larger a and b) put zero weight on the price-dividend observation and
hence produce no variation in the forecast of expected return. The coefficients for some of these models are shown in
1090 ROUTLEDGE

a b
W

a b
a b
a b
a b
a b

F I G U R E 1 1 Data: Alan—Wealth [Color figure can be viewed at wileyonlinelibrary.com]


Note. Forecast data are text from the Beige Books of the FOMC (“Alan”). Shown is the accumulated wealth from
holding optimal allocation to the risky equity based on the forecast of the equity risk premium. These are reported for
the “train” data to 1952–2005.

Table 3. In data set “Bob,” the preference for simplicity, in the parameters a and b, effectively result in two models. Either
the agent runs a (OLS-styled) regression using price-dividend or holds a constant portfolio. The results in-between—
where you place a modest eight on the price-dividend ratio, for the parameters we are looking at, happen to be knife
edge. The result is the constant (buy and hold) strategy. The other models fit put weight on the price-dividend portfolio
weights fluctuate (modestly, relative to the data sets below). The resulting wealth profile for these two behaviors is in
Figure 7. As mentioned, the in-sample performance of relying on the predictive variable makes for (overly) impressive
results. Out-of-sample results are presented in Figure 12.
The simplicity parameters have a bigger impact on portfolios using the FRED data. See Table 4 for examples. Here,
the models use none, some, or all of the various macro series available. From the parameters shown in table, a selection
of the ones the models view as important, some are familiar: interest rates (AAA, BAA, CPN3M), consumption (PCE),
housing starts (HOUSTNF, PERMIT), industrial production (IPUTIL). Notice the variability in the forecast of expected
ROUTLEDGE 1091

F I G U R E 1 2 Data: hn = {Bob, FRED, Alan}—Wealth - out of sample [Color figure can be viewed at
wileyonlinelibrary.com]
Note. All three data sets are represented here. (a) “Bob”: Price-Dividend data, (b) “FRED”: Economic Data from the St.
Louis Fed Database, and (c) “Alan”: Text from the Beige Books of the FOMC meetings. Shown is the accumulated
wealth from holding optimal allocation to the risky equity based on the forecast of the equity risk premium. The
models were trained on data from 1952 to 2005. Shown are the results for 2005–2012, out of sample. For
comparison, the buy-and-hold constant portfolio holds a 0.35 allocation to the risky asset (the optimal allocation for a
constant 6% equity risk premium given the model parameters).

returns and the resulting weight in the optimal portfolio depend on the a and b simplicity preference (regularization)
parameters. As with Bob, above, a large value of b drives the forecast and portfolio weights to be constant. Figure 8
shows the portfolio weights for “FRED” across a range of preference parameters. Here, the number of nonzero regres-
sors and the size of the regressors varies with the preference models. Wealth implications are shown (in-sample) in
Figure 9. Again, the in-sample results are clearly benefiting from being in-sample. Notice the bottom line is for the con-
stant portfolio (large value of b) and is the buy-and-hold portfolio.
Finally, we look at the text-based portfolio using the soft data of the text found in the FOMC Beige Books. The
model parameters, for a selection of simplicity parameters, are shown in Table 5. Again, the preference parame-
ters a and b regulate the number and size of the coefficients and, in turn, the variability of the equity risk premium
1092 ROUTLEDGE

forecast and asset allocation weight. The weights are shown in Figure 10. The in-sample impact on wealth from port-
folios derived from these data is shown in Figure 11. Interestingly, the “Alan” results are similar to using the full array
of macro data series. At least in sample, the soft text is generating the same level of information about future equity
returns. Table 5 highlights some of the phrases the model picks out whose word count is salient for the forecast. For
example, “fed-funds,” “manufacturing-conditions,” and “real-estate-residential-construction” are not surprising. The
model also reflects the economic setting of the sample period with “energy-shortages” and “iraq.” Some items like
“telephone-survey-of-hotels” are picking up exactly what the Beige Book information is intended to capture. Finally,
there are phrases that reflect more of a sentiment or modality, like “rose-modestly,” “loosening,” and “remained-
subdued” that are interesting.
Interestingly, the portfolio weights, see Figure 10, are quite different than for the FRED data. First notice the flat
portion pre-1970—that is simply the fact that the text data are not available at that period. The portfolios shown
are both more pronounced and more highly serially correlated (smoother). That is particularly noticeable in 2000 and
onward. Presumably, there is less text variation month-to-month than in the macro series. Given the ambiguity in the-
ory about what is driving the time variation in equity premium, it is not clear which agent is more “right.”
Turning to the performance and behavior of these agents post 2005—out of the sample they used to build their
model. Figure 12 shows the wealth implications of allocation to the risky asset for one of the agents using each of the
three data sets. Several models for each are shown—the combination of data and preference drives the portfolio. There
was substantial variation in all the predictive data through the financial crisis. Each of the agents has a portfolio that
forecasts variation in the equity risk premium and is, that is, sensitive to the signal they receive. The three data set types
are quite different. The agent using the price-dividend ratio, “Bob” has strong variation in the portfolio weight—indeed
the price-dividend ratio swung wildly in the financial crisis. The preference parameters play a particularly strong role
for the FRED macro data. The wealth performance across various models is similar for the “Bob” and “Alan” data. But
the models using the “FRED” data are vary sensitive to the simplicity coefficients a and b. Does this capture a heavier
worry of “overfitting” in the quantitative FRED macroeconomic data? Or does this point to some change in the rela-
tionship between macro data and the equity premium through the crisis? Both interesting questions are for future
research.

5 CONCLUSION

In this basic setting, there is much more to explore. But the model is promising. It provides preference parameters (and
the functional form) that capture how people incorporate data into their decision making. Looking at the model through
the lens of the canonical asset allocation problem lets us assess if the new preference parameters are delivering behav-
ior that is helpful in understanding asset markets and for developing tools that use data for portfolio construction.
For example, the model can shed light on the debate about how agents might behave in asset models with Bansal and
Yaron (2004) long-run risk (Dew-Becker & Bidder, 2015; Hansen & Sargent, 2010). Alternatively, it will be interesting
to explore equilibrium models of preference heterogeneity where the heterogeneity is in the “simplicity” parameters
(here a and b) (say along the lines of Dumas, Kurshev, & Uppal, 2009; Gallmeyer & Hollifield, 2008; or Osambela, 2015).
There are several extensions that are worth pursuing. One is to extend the specification to multiple assets to look
at richer portfolios. The “regularized” regression we used here is a natural analog to a Black–Litterman shrinkage esti-
mation model (see Wang, 2005). The second extension is to extend the framework to the dynamic setting of Epstein
and Zin (1989). In one sense, this is “easy.” For example, we use data from 1952 to 2005 to build the agent’s preferred
empirical model and then simply take that model as the “date-0” preferences and solve for optimal portfolios in the
Epstein and Zin (1989) framework. What is more interesting is to allow the empirical model to be solved dynamically.
This is harder, of course, since we need to impose time-consistency on the agent (or deal with the consequences of
time-inconsistent decision making).
ROUTLEDGE 1093

ACKNOWLEDGMENTS

Thanks for comments and suggestions from seminar participants at Carnegie Mellon University, the University of Utah,
and Pennsylvania State University.

REFERENCES

Baker, S. D., & Routledge, B. R. (2013). The price of oil risk. Carnegie Mellon University Working Paper.
Bansal, R., & Yaron, A. (2004). Risks for the long run: A potential resolution of asset pricing puzzles. Journal of Finance, 59, 1481–
1509.
Campbell, J. Y., & Shiller, R. J. (1988a). The dividend-price ratio and expectations of future dividends and discount factors.
Review of Financial Studies, 1, 195–228.
Campbell J. Y., & Shiller, R. J. (1988b). Stock prices, earnings, and expected dividends. Journal of Finance, 43, 661–676.
Chahuneau, V., Gimpel, K., Routledge, B. R., Scherlis, L., & Smith, N. A. (2012). Word salad: Relating food prices and descriptions.
In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning. Stroudsburg, PA: Association for Computational Linguistics, pp. 1357–1367.
Cochrane, J. H. (2011). Presidential address: Discount rates. Journal of Finance, 66, 1047–1108.
Coppersmith, G., Dredze, M., Harman, C., & Hollingshead, K. (2015). From ADHD to SAD: Analyzing the language of mental
health on twitter through self-reported diagnoses. In Proceedings of the Workshop on Computational Linguistics and Clinical
Psychology: From Linguistic Signal to Clinical Reality. Denver, CO, June: North American Chapter of the Association for Com-
putational Linguistics.
Dew-Becker, I., & Bidder, R. (2015). Long-run risk is the worst-case scenario. Northwestern University Working Paper.
Dumas, B., Kurshev, A., & Uppal, R. (2009). Equilibrium portfolio strategies in the presence of sentiment risk and excess volatil-
ity. Journal of Finance, 64(2), 579–629.
Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion, and the temporal behavior of consumption and asset returns: A
theoretical framework. Econometrica, 57(4), 937–969.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Jour-
nal of Statistical Software, 33(1), 1–22.
Gabaix, X. (2014). A sparsity-based model of bounded rationality. Quarterly Journal of Economics, 129(4), 1661–1710.
Gallmeyer, M., & Hollifield, B. (2008). An examination of heterogeneous beliefs with a short-sale constraint in a dynamic econ-
omy. Review of Finance, 12(2), 323–364.
Gilboa, I., & Samuelson, L. (2012). Subjectivity in inductive inference. Theoretical Economics, 7(2), 183–215.
Gilboa, I., & Schmeidler, D. (2003). Inductive inference: An axiomatic approach. Econometrica, 71, 1–26.
Gilboa, I., & Schmeidler, D. (2010). Simplicity and likelihood: An axiomatic approach. Journal of Economic Theory, 145(5), 1757–
1775.
Hansen, L. P., & Sargent, T. J. (2010). Fragile beliefs and the price of uncertainty. Quantitative Economics, 1(1), 129–162.
Hoerl, A., & Kennard, R. (1988). Ridge regression. Encyclopedia of Statistical Sciences, 8, 129–136.
Kim, Y., & Routledge, B. R. (2019). Does macro-asset pricing matter for corporate finance? Critical Finance Review. Retrieved
from https://scholars.cityu.edu.hk/en/publications/publication(8ba3622c-0d6c-4da6-9518-453bba30112b).html
Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for
Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, pp. 272–280.
Koijen, R. S., Moskowitz, T. J., Pedersen, L. H., & Vrugt, E. B. (2013). Carry. National Bureau of Economic Research Discussion
Paper.
Ludvigson, S. C., & Ng, S. (2009). Macro factors in bond risk premia. Review of Financial Studies, 31, 399–448.
O’Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public
opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media, pp. 122–129.
Osambela, E. (2015). Differences of opinion, endogenous liquidity, and asset prices. Review of Financial Studies, 28, 1914–1959.
Romer, C. D., & Romer, D. H. (2004). A new measure of monetary shocks: Derivation and implications. American Economic
Review, 94(4), 1055–1084.
Simon, H. A. (1959). A behavioral model of rational choice. American Economic Review, 49(3), 253–283.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3),
1139–1168.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Method-
ological), 58, 267–288.
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 73(3), 273–282.
1094 ROUTLEDGE

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency net-
work. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on
Human Language Technology-Volume 1. Stroudsburg, PA: Association for Computational Linguistics, pp. 173–180.
van Binsbergen, J., Hueskes, W., Koijen, R., & Vrugt, E. (2013). Equity yields. Journal of Financial Economics, 110(3), 503–519.
Wang, Z. (2005). A shrinkage approach to model uncertainty and asset allocation. Review of Financial Studies, 18(2), 673–705.
Welch, I., & Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Finan-
cial Studies, 21(4), 1455–1508.
Yogatama, D., Heilman, M., O’Connor, B., Dyer, C., Routledge, B. R., & Smith, N. A. (2011). Predicting a scientific community’s
response to an article. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:
Association for Computational Linguistics, pp. 594–604.
Yogatama, D., Routledge, B. R., & Smith, N. A. (2013). A sparse and adaptive prior for time-dependent model parameters. arXiv
preprint arXiv:1310.2627.
Yogatama, D., Wang, C., Routledge, B. R., Smith, N. A., & Xing, E. P. (2014). Dynamic language models for streaming text. Trans-
actions of the Association for Computational Linguistics, 2, 181–192.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 67(2), 301–320.

How to cite this article: Routledge BR. Machine learning and asset allocation. Financial Management. 2019;48:
1069–1094. https://doi.org/10.1111/fima.12303

You might also like