Professional Documents
Culture Documents
12303
ORIGINAL ARTICLE
Bryan R. Routledge
1 INTRODUCTION
There is no shortage of data. At your fingertips are 237,000 data series at the St. Louis Federal Reserve Bank’s Eco-
nomic Data (FRED). The Securities and Exchange Commission (SEC) received 304,000 corporate filings (e.g., 10K, 10Q,
8K, Form 4) during the first quarter of 2018. The SEC has over 18 million electronic filings from 1994. Mix in social
media sites like Twitter and data sets of size in the billions are common.1 Is any of these data helpful in decision making?
In the finance context we look at here, despite all data sets and sources available, we still have only about 840 monthly
observations of, say, postwar equity returns. Does that limit the value of the larger data sets? The goal of this paper
is to explain how machine learning—specifically regularized regressions—capture how individuals might use large and
varied data for decision making. Here, we look at the canonical asset allocation problem and characterize individuals
preferences over “models” for data. Viewed through the lens of a portfolio optimization, we look at how these data
become information.
The economic context for the model is a simple stock-bond asset allocation problem. At the core of this problem is
the level and dynamic properties of the equity premium (the rate of return on a broad portfolio of equities in excess
of the risk-free return). From data, and much finance research, we know the equity premium has substantial variation
in its conditional mean. The unconditional expectation of the equity premium is around 6%. However, the conditional
expectation commonly swings substantially from 0% to 12% (see Cochrane, 2011) or as evidenced in the term struc-
ture, see van Binsbergen, Hueskes, Koijen, and Vrugt (2013). Similar time variation in risk premiums show up in bonds
c 2019 Financial Management Association International
1 For example, O’Connor, Balasubramanyan, Routledge, and Smith (2010), and, Coppersmith, Dredze, Harman, and Hollingshead (2015).
(Ludvigson & Ng, 2009), oil futures (Baker & Routledge, 2013), and foreign exchange (Koijen, Moskowitz, Pedersen, &
Vrugt, 2013). The implication of the time variation in the mean return is that ex post returns are, to some degree, pre-
dictable. Indeed, the main empirical support for the time variation in expected returns is the regression predicting the
excess return at horizon h, rt+h with information at date t, Xt .
Across markets, the set of predictors varies. The aggregate price-dividend ratio is used as a forecaster in equity mar-
kets, the slope of the futures curve works in oil markets. In the bond market, Ludvigson and Ng (2009) use predictors
extracted as the principle components of 130 economic series. In addition, as you would expect, realized returns are
very volatile and so the precision (R2 ) of the predictive regression is low. More relevant here, not everyone is convinced
that the predictability is particularly useful. Welch and Goyal (2008) point out that out-of-sample forecasts are not
reliable.2
To investigate this question, we use the framework of Gilboa and Schmeidler (2010), Gilboa and Samuelson (2012),
and, Gilboa and Schmeidler (2003) to represent a decision maker’s preference over “models” of the data. In particu-
lar, we propose a representation and assume a functional form that captures the dual objectives: people like models
that explain the data (e.g., likelihood) and people like models that are simple (e.g., a small number of parameters). Of
course, these two desires are often at odds. A model with more parameters fits the data better (at least in-sample).
Machine learning tackles this trade-off by “regularizing” overparameterized models (see Tibshirani, 1996, 2011; Zou
& Hastie, 2005). These techniques look to exploit patterns in data that perform well out of sample. Here, we interpret
this approach through the axiomatic foundations of Gilboa and Schmeidler (2010). This lets us interpret and control
the transition from data to information used in decision making as preference parameters akin to a coefficient of risk
aversion. When we embed all this in the familiar portfolio problem we can see if we have preference parameters that
generate sensible behavior.3
First, we use the Gilboa and Schmeidler (2010) setting to characterize simplicity or parsimony in data models. Next,
we sketch an example with a two-state equity premium to clarify the model’s preference structure. Then, to see how
this might work in practice, we use multiple sources of economic data. For now, we simplify to the “static” (or single
period) portfolio problem. This is helpful since the simplicity of the decision step let us focus on the new feature of
“model” preference. We implement this using three different, and different in character, data sets. First, we use the
familiar price-dividend data as in Cochrane (2011). Second, we use an array of several hundred monthly macroeco-
nomic series from the St. Louis Fed’s data (FRED). This is along the lines of the data used in Ludvigson and Ng (2009).
Finally, we use text as data. Specifically, we use the Beige Book reports of the Federal Reserve staff economists that
characterize the economy using informal surveys.
Gilboa and Schmeidler (2010), and related papers Gilboa and Samuelson (2012) and Gilboa and Schmeidler (2003),
present an axiomatic foundations for using data and incorporating a preference for simplify or parsimony. To summa-
rize and adapt this setting to asset allocation, we start by characterizing “data.” Data are of the form (xn , yn ). xn ∈ 𝕏 is
a (perhaps large) vector of “signals.” yn ∈ 𝕐 is a scalar “state.” The signal xn , is known by the decision maker and may be
useful in predicting the unobserved state yn . (𝕏, 𝕐 ) is the set of joint probability distributions over 𝕏 × 𝕐 .
2 Kim and Routledge (2019) look at the corporate finance implications of ignoring or not time variation in the risk premium.
3 Our model here is in different from the related paper of Gabaix (2014). That paper characterizes a nonmaximizing or “bounded’ rationality” (Simon, 1959)
approach to capture the idea that some data are “ignored.” Here, we will define preferences and an optimization so any data that are “ignored” are an opti-
mal choice.
ROUTLEDGE 1071
The idea here is that the signal xn is useful for predicting the unknown state yn . The state yn is relevant for the
decision problem at hand (below). The probabilistic relation between xn and yn is not known but the decision maker has
access to a data set. The data set of length n is denoted as hn = ((x0 , y0 ), … , (xn−1 , yn−1 )) (does not include n). The space
of possible data sets of length n is Hn and the set of all such data is H = ∪n≥0 Hn . In general, a data-generating process is
a function Δ : H → (𝕏, 𝕐 ). To make this workable, we restrict the set of possible data-generating processes, , to be
ones that are i.i.d. and do not depend on the history.4 So Δ ∈ has Δ(x, y) gives the probability of observing (y, x) and
denote the marginal distribution is 𝛿(y; x) as the probability observing y given signal x.
A “theory” is a possible data-generating process. Note that theories are “just” probability distributions. We are
directly interested in the prediction of yn given xn denoted theory 𝜃(y; x). The set of possible theories is the same
as the set of possible generating processes. This implies the true data processing process is in the set of theories
our decision maker considers. The decision maker has data in hn and ranks possible theories 𝜃1 ⪰hn 𝜃2 This means,
given the data in hn , theory 𝜃1 is at least as preferred as 𝜃2 (or, Gilboa and Schmeidler use the phrase “at least as
plausible”). Putting all these preference orderings together, {⪰hn }hn ∈Hn ,n=0,1,2,… will describe how one chooses a the-
ory. For example, the familiar maximum likelihood criterion is one way to construct a preference ordering. Gilboa
and Schmeidler (2010) add axioms for the preferences that allow for consideration of both “fit” (likelihood) and
“simplicity.”
Using a theory in the familiar Savage Expected Utility framework results in the following objective function. A deci-
sion maker who chooses, based on data in hn and signal xn using:
[ [ ]| ]
E𝜃∗ E u(c)|yn |xn
| (2)
∑
s.t. 𝜃 ∗ = argmax𝜃 v(𝜃) + (x,y)∈hn 𝓁(y, x, 𝜃).
To describe this decision rule, think of the asset allocation problem we will explore below. The inner expectation
E[u(c)|yn ] governs how one chooses an asset allocation to the risky asset given a specific value for the equity risk pre-
mium (the yn ). The forecast of the equity risk premium, yn , given the current data xn —say the current price-dividend
ratio—depends on the theory 𝜃 ∗ . For example, something like Equation (1). Finally, that theory, 𝜃 ∗ , is chosen based on
the data seen in history hn .
The functions v(𝜃) and 𝓁(y, x, 𝜃) characterize the decision maker’s preferences to choose 𝜃 ∗ . The chosen 𝜃 ∗ deter-
mines how xn is used to forecast yn . The function 𝓁(y, x, 𝜃) uses the data in the history with the sum over hn . In our
implementation below, this will be a likelihood. Preferences over theories, sensibly, embody a preference for fitting the
data. However, the function v(𝜃) does not depend on data. This captures a preference theories themselves—say, sim-
plicity. In our asset allocation setting below, the v(𝜃) will be a “regularization” of a regression that imposes a preference
that fewer parameters in a regression is better. Like with any utility theory, the functional forms we use here are spe-
cific examples (like the Constant Relative Risk Aversion (CRRA) risk preference). We use the asset allocation setting to
evaluate the usefulness of this specification.
To see how this works, here is an example adapted from Gilboa and Samuelson (2012) to an asset allocation setting.
Imagine a signal xn and a state yn are coin tosses. Consider a one-consumption-date, one-risky asset setting with end-
of-period wealth, W1 where the investor chooses the proportion of wealth invested in the risky asset. Define:
( )
W1 = W0 w(R − Rf ) + Rf ,
4 This is stronger than needed. However, we can adapt this setting into with the usual time-series stationarity assumptions. For example, a VAR model with
innovations as i.i.d.
1072 ROUTLEDGE
Data-generating process
Signal x ∈ {tail,head}
State y ∈ {tail,head}
⎧ p(x = head) = 0.50
⎪
Data-generating process Δ ⎨ p(y = head|x = tail) = 0.48
⎪
⎩ p(y = head|x = head) = 0.80
Assets
Risk-free asset log Rf 0.0
⎧ 𝜇(y = tail) = 0.02
⎪
Risky asset log R ∼ N(𝜇(y), 𝜎 2 ) ⎨ 𝜇(y = head) = 0.10
⎪
⎩𝜎 = 0.20
Preferences
Risk aversion 1−𝛼 2.0
Theory selection Norm ‖ ⋅ ‖ L1
A (0.5, 0.5)′
B (−1, 1)
a, b see example
Numerical
Repetition of each sample to get means, standard 250
deviations of optimal portfolios
Gaussian Quadrature for log normal distribution 5 points
where W0 = 1 is initial wealth and w is the proportion of wealth allocated to the risk asset. Set the risk-free rate log Rf =
0. The risky asset return, R depends on the outcome of state y as log R ∼ N(𝜇(y), 𝜎 2 ). For concreteness, say 𝜇(y = tail) <
𝜇(y = head). Familiar values are 𝜇(y = tail) = .02 and 𝜇(y = head) = .10. The example has a constant volatility of 𝜎 =
.20. All the parameters are listed in Table 1.
Signals and states are the result of toss two coins xn , yn ∈ {head, tail}. Data observations in hn are independent.
However, the coin tosses may be correlated so knowing the outcome of xn may be informative about the outcome of
yn . A data-generating process, Δ ∈ describes the probability of each outcome (xn , yn ). This embeds any correlation
in the coins. Given the data in hn , the prediction-relevant portion of a theory is two-dimensional vector, 𝜃 ∈ (0, 1)2
that describes p(yn = head|xn = head) and p(yn = head|xn = tail). Risk preferences are the familiar CRRA setting with
risk aversion with curvature parameter 𝛼 < 1 (coefficient of relative risk aversion of 1 − 𝛼). All together, the portfolio
problem optimization, given data set hn is:
[ [ ]
( )𝛼 ]|
max E𝜃∗ (hn ) E 𝛼 −1 W ̃ |yn ||xn ,
1
w |
∑
s.t.𝜃 ∗ (hn ) = argmax v(𝜃) + 𝓁(y, x, 𝜃).
𝜃
(x,y)∈hn
Here, let us specify the functional form of 𝓁() as a log-likelihood. That is:
( )
∑ ∑ I[x = head](I[y = head]𝜃1 + I[y = tail](1 − 𝜃1 ))
𝓁(y, x, 𝜃) = log , (3)
(x,y)∈h (x,y)∈h + I[x = tail](I[y = head]𝜃2 + I[y = tail](1 − 𝜃2 ))
ROUTLEDGE 1073
where I[⋅] is the indicator function and 𝜃1 = p(y = head |x = head) and 𝜃2 = p(y = head |x = tail) are the two parame-
ters of the theory. This seems reasonable in the sense that given enough data, the preferred theory will coincide with
the true data-generating process.
How can we specify v(𝜃) to capture a preference for simplicity? While “simplicity” and “complexity” are appealing as
ranking criteria, it is hard to find a universal measure. In the specific two-coin example here, the v(𝜃) defines “simple”
or a preferred reference point on the (0, 1)2 square (akin to a Bayesian prior). As an example, consider:
( )
v(𝜃) = −b a‖A − 𝜃‖L1 + (1 − a)‖B𝜃‖L1 , (4)
where a, b are positive scalars and ‖ ⋅ ‖ is a norm (we will use L1- and L2-norms below). If A = [ 12 , 12 ]′ , this expresses a
preference for theories where the second coin is both independent of the x-coin and also unbiased. If we define B =
[1, −1], this captures a preference for theories that imply no correlation across the two coins. That is, where p(y =
head|x = head) is close to p(y = head|x = tail). The preference parameters a and b scale things (and the format here
is similar to the Elastic Net regression we consider below). b determines the overall importance of simplicity in the
preferences. The a ∈ [0, 1] parameter weights the relative importance of the two notions of simplicity. Of course, it is
arbitrary and we embed it in an economic model to evaluate if this is a sensible representation of behavior.5
Figure 1 shows the optimal portfolio for each possible outcomes of the coin (on the vertical axis). The horizontal axis
is the size of the data (log n). Since the optimal portfolio allocation depends on the data, the figure reports the mean allo-
cation across many simulated data sets hn ∈ Hn . The shaded regions represent one standard deviation bounds around
the mean for the portfolio allocation. For reference, if the decision maker knew the true data process her optimal port-
folio would be .66 for x as a tail and .99 for x as a head. These reference lines are shown as the dashed lines in the
figure. In the case where data-generating process is known, the data are not needed. Hence, these reference lines do
not depend on the size of the data.
In Figure 1, b = 0 (and when b = 0, the value for a has no impact). With this calibration, the theory 𝜃 that determines
the optimal portfolio allocation is determined solely by the data set hn . That is the reason for the large variation in
portfolio allocations for smaller data sets. For data sets smaller than about log 3 (about 20 paired coin tosses), the one
standard deviation bounds for the optimal allocations overlap. This means that about half the time, more is allocated to
the risky asset on a tail signal than a head. For larger data sets, the optimal allocation both converges and converges to
the value that is optimal given the true data-generating process. Hence, we have characterized preferences consistent
with the Law of Large Numbers.
As a contrast, Figure 2 sets the preference parameter b = 10. In this case, there is a strong preference for “sim-
plicity.” Setting a = 1 characterizes simplicity as a preference for the y-coin as uncorrelated and unbiased. Hence, for
small data sets the optimal theory is that the y-coin is uncorrelated and unbiased. The result is that optimal portfolio
is constant across the two outcomes of the x-signal and not sensitive to the particular sample of data seen. Notice also
that optimal asset weight does not depend on the sample. However, once the data set is large, the optimal portfolios
are sensitive to the signal x-coin and also have some dependence on the data. Finally, note that as data sets get large
enough, the optimal portfolio does converge to the full information case.
As a contrast, in Figure 3 also the preference parameter b = 10. However, here a = 0 characterizing a “simplicity”
only by p(y = head|x = head) is close to p(y = head|x = tail). Hence, for small data sets, the optimal association to the
risky asset is constant across the outcome of the x-coin. But here, note that the allocation itself is dependent on the
data set as seen in the large one-standard deviation range of the optimal investment.
Finally, to see how the two “simplicity” parameters work together, Figure 5 varies a and b. The three rows have
b = {0, 3, 6, 9} and three columns are for a = {0, 0.5, 1.0}. Moving left to right, as b increases the optimal weights are
less reliant on small data sets. This is most strong, moving down, as a increases. For a = 1 and b = 9, the portfolio is
5 This functional form expands naturally to a multidimensional signal such as x ∈ {head, tail}K and is an analog of a “generalized lasso problem” (see Tibshirani,
n
Saunders, Rosset, Zhu, & Knight, 2005) and is a common tool in natural language processing and image-recognition settings and machine learning settings.
1074 ROUTLEDGE
x
w
x
x
a b
constant for small data sets. Finally, in all these examples, as the data set is large, the optimal portfolio converges to the
optimal given the true process.
Next, we look at an example using the one-period portfolio asset allocation problem but with data drawn from the U.S.
economy.
[ [ ]
( )𝛼 ]|
̃
max E𝜃∗ (hn ) E 𝛼 −1 W |y |x ,
1 n | n
w |
∑
∗
s.t. 𝜃 (hn ) = argmax𝜃 v(𝜃) + 𝓁(y, x, 𝜃)
(x,y)∈hn
( )
W1 = W0 w(R − Rf ) + Rf .
ROUTLEDGE 1075
x
w
x
x
a b
The (artificial) one-period setting makes the decision model more transparent. But the use of real data means we
have to be more explicit with the timing of the data. Consider data as monthly and the portfolio horizon as 1 year.
Again, set the beginning of year wealth W0 = 1 with end-of-year wealth W1 . The equity risk premium depends on
the state y as:
Specifying things this way means we can set log Rf = 0 since in our static example the level of the risk-free rate is not
particularly interesting. We will also maintain a constant volatility assumption with 𝜎 = .20.
To express (possible) predictability of the equity returns, x are a k-vector of signals observed at the beginning of the
(1-year) period such that:
yn = 𝛿0 + 𝛿 ′ xn + 𝜖n .
𝛿 coefficients capturing the true data-generating process. The 𝜖n are orthogonal. Of course, in practice, our data set will
not contain y—we cannot “see” the equity premium. Instead we see the realized excess returns, rt,t+12 paired with the
beginning of period signals xt . This is the familiar predictive regression of Cochrane (2011) (and others)
1076 ROUTLEDGE
x
w
x
x
a b
Signals, xt , are normalized to zero mean and unit variance. This is not necessary but makes it easier to specify prefer-
ences and compare across different input data sets. (It also means the intercept, 𝜃0 term is the estimate of the uncon-
ditional equity risk premium.)
To implement our decision model, we use the same CRRA preferences over terminal (end-of-year) wealth as the
previous example. In Equation (2), we need to specify how v and 𝓁 are measured in this context. First, we can encode the
theory 𝜃, that gives the conditional probability of y for a given vector x, as a candidate set of coefficients in Equation (5).
That is, a theory is:
yt = 𝜃0 + 𝜃 ′ xt .
In this familiar regression setting, the log-likelihood is the sum of squared errors.
w x
x
x
a b
For the simplicity component of preferences over theories, we use the regularizer format of “Elastic Net” regularized
regressions in machine learning. This is similar to our previous example and is:
( )
v(𝜃) = −b a‖𝜃‖L1 + (1 − a)‖𝜃‖L2 . (7)
The Elastic Net regularizer (Zou & Hastie, 2005) combines two notions of “simplicity” for model parameters. The L1-
norm component encapsulates a preference for a “simple” model as one with a few nonzero parameters (called a LASSO
regression; see Tibshirani 1996, 2011).6 The L2-norm expresses the preference for “simple” as a model with parame-
ters that are close to zero (called a Ridge regression; see Hoerl & Kennard, 1988). The machine learning rational to
∑K ∑ 1
6 Recall 𝜃 ∈ ℝK+1 , ‖𝜃‖L1 = k=1
|𝜃k |, and ‖𝜃‖L2 = ( Kk=1
𝜃 2 ) 2 . Note we leave 𝜃0 outside of the definition of v. So the level of the equity premium will be
k
driven by the sample mean in the data.
1078 ROUTLEDGE
a b a b a b
a b a b a b
a b a b a b
include both is that sparseness (from the absolute value L1-norm) makes a regression easy to understand and explain
since it has few nonzero parameters. However, in models with large dimensional x-vectors, with many near colineari-
ties, the L2-norm term often makes the estimates more stable (e.g., similar nonzero terms across similar subsamples).
The parameter a ∈ [0, 1] drives the degree of sparseness (e.g., a = 1 is the LASSO regularizer). The parameter b scales
how much weight is put on simplicity (vs. the data-driven component of Equation 2). Things are linear in the two norms
and this scaling makes for easier interpretation. Of course, in our setting, parameters a and b and the functional forms
are just a representation of preferences. Using the “practical wisdom” of machine learning is helpful, perhaps, in cali-
brating these terms.
We solve for the optimal model using Friedman, Hastie, and Tibshirani (2010). This solves simultaneously for many
values of the parameter b from small to large (called the regulation path). In practice, this is useful for model selection
ROUTLEDGE 1079
investment commercial
investments commercial_real_estate
business_investment commercial_construction
capital_investment commercial_banks
inventory_investment commercial_real_estate_markets
investment_plans commercial_loans
investment_tax_credit commercial_lending
investment_spending commercial_real_estate_loans
fixed_investment commercial_real_estate_activity
new_investment commercial_loan_demand
investment_companies commercial_aircraft
investment_activity commercial_construction_activity
capital_investments commercial_builders
security_investments commercial_building
new_investments commercial_space
equipment_investment commercial_mortgages
investment_projects commercial_real_estate_market
capital_investment_plans commercial_markets
business_investment_plans commercial_contractors
investment_management_firms commercial_paper
investment_portfolios commercial_realtors
investment_outlays commercial_side
alternative_investments commercial_projects
Note. A sample of phrases that contain the word “commercial” or “investment” listed in order of frequency.
(“tuning” of the choice of the “hyperparameter,” b). However, in our context it lets us look the optimal portfolios across
different preference parameters.7
The usual workhorse for predicting equity returns is the price-dividend ratio. This is the variable that features
prominently in John Cochrane’s AFA Presidential Address (Cochrane, 2011), and many others such as Campbell and
Shiller (1988a, 1988b). The data come from Robert Shiller’s web page. (Hence, the handle, with apologies for the
8 Returns data are CRSP data via Ken French’s Data Library: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
1080 ROUTLEDGE
a b
a b
a b
a b
informality, “Bob” for our investor using this datum.) The data are the dividend-price ratio of the S&P500 large-cap
stocks (and includes lags).
Ludvigson and Ng (2009) use 130 or so macroeconomic series to forecast bond returns. From that series, they
extract six principle components. Along these lines, we have pulled similar data from FRED. Most of the series Lud-
vigson and Ng use are included along with a few others. There are 224 monthly data series included. Most have the
familiar transformations for stationarity (log differences) and all have been standardized to demean and set the vari-
ance to one. Note we do not do any dimensionality reduction, such as extraction principle component. The amount of
data and the amount of reliance on the data will be set by the simplicity preference parameters.
ROUTLEDGE 1081
TA B L E 3 Model coefficients—BOB
Note. Forecast data are price-dividend ratio (“Bob”). The summary statistics at top report the number of nonzero coefficients in
the model (excluding the intercept), the standard deviation of the forecast of expected return, and the standard deviation of the
optimal allocation to the risky asset. These are reported for the “train” data to 1952–2005 and the out-of-sample “test” 2006–
2012.
The regional Federal Reserve Banks have been producing a blog-like commentary on the economy monthly (roughly)
since 1970. This text is our third set of data. Specifically, we use text from the “Beige Book” of the Federal Reserve
System that is issued for each of the FOMC meeting. The Fed has a staff of economists who focus measurement and
forecasting. Prior to each of the eight meetings of the FOMC, the staff at the Federal Reserve Board publish several
documents with forecasts and commentary. The two most substantive documents are the “Greenbook” (formally “Cur-
rent Economic and Financial Conditions”) and the “Bluebook” (formally “Monetary Policy Alternatives”). Both of these
documents are available only to the FOMC and insiders at the Fed. They have been released as historical documents
with a rolling 5–6 year black-out window. These quantitative data are used in Romer and Romer (2004) as a measure
of Fed expectations and surprises. Here, we use the “Beige Book” (formally known as the “Summary of Commentary
on Current Economic Conditions by Federal Reserve District.”). Unlike the previous two, this information is released to
the public (approximately) 2 weeks prior to an FOMC meeting.9
The Beige Book data cover the period 1970–2015 and the (roughly) eight meetings per year. For each meeting, there
is a national summary and a report from each of the 12 regional banks. What is particularly interesting about this docu-
ment is that it is decidedly nonquantitative. It is “based on information gathered by Reserve Bank staff over the course
of several weeks. Each Federal Reserve Bank collects anecdotal information on current economic conditions in its Dis-
trict through reports from Bank and Branch directors and interviews with key business contacts, economists, market
experts, and other sources.”10 Alan Blinder, former Fed vice chairman, once characterized this process as the “ask your
uncle” method of gathering information. This data set is denoted “Alan” after long-standing Fed Chair Alan Greenspan.
Here is a sample (from November 1979 National Summary):
Manufacturing activity is particularly spotty. The automobile sector is exhibiting pervasive weakness, but there
exist definite areas of strength. St. Louis finds, with the exception of autos, a relatively high level of manufacturing
and particular strength in several industries including most capital goods, basic metal products, aircraft, and tex-
tiles. Minneapolis reports broadly based strength in industrial production. Dallas characterizes manufacturing as
flat with weakness in consumer durables being about offset by strength in nondurables and construction related
9 The release to the public began in 1983. The precursor to the Beige Book, the Red Book, has been released as historical data. In the period 1970–1983, these
reports were only internal to the Fed. However, we will set that aside and ask how would an investor use the data if they had been available over the sample.
10 See http://www.federalreserve.gov/monetarypolicy/fomc_historical.htm
1082 ROUTLEDGE
a b
a b
W
a b
a b
areas such as primary metals and stone, clay and glass products. Meanwhile, the Third District is reportedly five
months into a general downturn in manufacturing activity. Chicago reports high operating rates, but declining
order backlogs at most capital goods producers. The capital goods sector in the Cleveland District is still expe-
riencing backlogs amidst some concern that they might soon face reductions in orders. San Francisco sees most
sectors except auto and construction related ones doing well.
This is the sort of information where rich, nuanced, and, often ambiguous, language may convey more information
than the systematic measurement of predefined quantities. Does it work as a forecaster of equity returns? (And if
you are wondering, the word “spotty” shows up frequently—about 1,500th most frequent of the 200,000 or so unique
words and phrases in the data).
ROUTLEDGE 1083
TA B L E 4 Model coefficients—FRED
(Continued)
1084 ROUTLEDGE
TA B L E 4 Continued
Note. Forecast data are from macroeconomic series of the St. Louis Federal Reserve Bank Economic Database (“FRED”). The
summary statistics at top report the number of nonzero coefficients in the model (excluding the intercept), the standard devi-
ation of the forecast of expected return, and the standard deviation of the optimal allocation to the risky asset. These are
reported for the “train” data to 1952–2005 and the out-of-sample “test” 2006–2012. A selection of parameter coefficients are
shown selected based on absolute size. The series name corresponds to the data key used on FRED. All data are normalized to
mean zero and variance of one.
There are many ways to use text data—machine learning and natural language processing is a fast evolving field.
Here, a “text regression” to forecast equity returns lines up with the decision framework we have in mind.11 To imple-
ment the text regression (6) with the simplicity preference in Equation (7), we need to define how the text maps to
the quantity “x.” We merge all the Beige Books at a given month and then use a model that simply counts words (“bag
of words”).12 Capturing the nuance in the text can be difficult. In the example above, just what was “spotty”? To cap-
ture words in context we augment word counts with adjective–noun and adverb–verb “phrases.”13 Table 2 shows an
example of phrases. The words “commercial” and “investment” are quite frequent in the data. They are also used in
many contexts. Table 2 shows the output of our phrase extraction tool for phrase that includes these words. Notice
that many of the phrases reflect important differences. A commercial bank is a bank (finance) and plays a very different
11 Text regressions were introduced in Kogan, Levin, Routledge, Sagi, and Smith (2009) on data using 10K text to predict return volatility. Other examples,
see Yogatama et al. (2011) and Chahuneau, Gimpel, Routledge, Scherlis, and Smith (2012). The Beige Book data were used in Yogatama, Routledge, and Smith
(2013) and Yogatama, Wang, Routledge, Smith, and Xing (2014); although these are not text regression modes. Other finance papers that use text as data, for
example Tetlock (2007), use a similar word count as a representation and then employ a “dictionary” to reduce the dimensionality (e.g., a predefined sentiment
score).
12 We downcase, remove punctuation, and tokenize numbers and dollars (5.0 and $6.0 are replaced by “xNUMBERx” and “xDOLLARx.” In “bag-of-words,” this
a b
a b
a b
a b
a b
a b
a b
a b
a b
role in the economy than a commercial builder (manufacturing). Similarly, security investments reflect a financing activ-
ity where capital investment reflects a physical economic activity.
Phrases and words are counted in each document. Count data are skewed (most words are very infrequent) so we
use the transformation:
where count(i, t) counts the word or phrase i in the Beige Book at date t.14 Note that this conveniently preserves
the sparse structure—most of the words to not appear in a document. Similarly, this works well with months with no
14 This structure means that regression coefficients are informative about the percentage change in counts. That is if word j has a count of c and we contemplate
≈ 𝜃j log(1 + Δ)
≈ 𝜃j Δ.
a b
a b
a b
a b
a b
W
a b
a b
a b
a b
meetings, simply have xi,t = 0 for all words. Finally, we prune the word counts and only include the 10,000 most fre-
quent words. (You can include them all; it does not change the results much.)
TA B L E 5 Model coefficients—Alan
(Continued)
1088 ROUTLEDGE
TA B L E 5 Continued
Note. Forecast data are text from the Beige Books of the FOMC. The summary statistics at top report the number of nonzero
coefficients in the model (excluding the intercept), the standard deviation of the forecast of expected return, and the standard
deviation of the optimal allocation to the risky asset. These are reported for the “train” data to 1952–2005 and the out-of-
sample “test” 2006–2012. A selection of parameter coefficients is shown selected based on absolute size. The parameters are
words (phrase) count, n, transformed by log(1 + n).
That is, for hn is held constant across experiments—it is the datum from the 1952–2005 period. We solve for the
model:
∑
𝜃 ∗ (hn , a, b) = argmax𝜃 − b(a‖A − 𝜃‖ + (1 − a)‖B𝜃‖) + 𝓁(y, x, 𝜃).
(x,y)∈hn
Then using, 𝜃 ∗ (hn ), we solve for the optimal portfolio at various points in time.
[ [ ]
( )𝛼 ]|
̃
wt∗ (hn , a, b) = max E𝜃∗ (hn ) E 𝛼 −1 W |yt ||xt , .
1
w |
This lets us observe behavior “in-sample,” for years up to 2005 and “out of sample” for years beyond. We can wiggle the
preference parameter over simplicity, a and b—recall a drives sparseness (few nonzero parameters) and b drives the
magnitude of the parameters. We can also look across data sets hn = {Bob, FRED, Alan}. For all these examples, we hold
risk aversion constant at −4.0. That level of risk aversion implies an average allocation to the risk-free rate of about
0.35 (which gives some “room” for allocations to vary and still remain in a plausible range).
ROUTLEDGE 1089
a b
a b
a b
a b
a b
a b
To help summarize things, we use performance and track accumulated wealth associated with the portfolios
wt∗ (hn , a, b). Define:
where rt+12 is the equity (excess) return over the subsequent 12 months. (Plotting this annual item at each month
implies some smoothing that helps visualize the results.)
We start by looking “in-sample”—that is looking at the behavior when data, the agent, are acting on are coming from
the data set used to choose the model. Not surprisingly, performance will be misleadingly impressive. We look first at
“Bob”—using the price-dividend data. Shown in Figure 6 are portfolio weights for some of the people using these data.
In this plot, many of the preference parameters (larger a and b) put zero weight on the price-dividend observation and
hence produce no variation in the forecast of expected return. The coefficients for some of these models are shown in
1090 ROUTLEDGE
a b
W
a b
a b
a b
a b
a b
Table 3. In data set “Bob,” the preference for simplicity, in the parameters a and b, effectively result in two models. Either
the agent runs a (OLS-styled) regression using price-dividend or holds a constant portfolio. The results in-between—
where you place a modest eight on the price-dividend ratio, for the parameters we are looking at, happen to be knife
edge. The result is the constant (buy and hold) strategy. The other models fit put weight on the price-dividend portfolio
weights fluctuate (modestly, relative to the data sets below). The resulting wealth profile for these two behaviors is in
Figure 7. As mentioned, the in-sample performance of relying on the predictive variable makes for (overly) impressive
results. Out-of-sample results are presented in Figure 12.
The simplicity parameters have a bigger impact on portfolios using the FRED data. See Table 4 for examples. Here,
the models use none, some, or all of the various macro series available. From the parameters shown in table, a selection
of the ones the models view as important, some are familiar: interest rates (AAA, BAA, CPN3M), consumption (PCE),
housing starts (HOUSTNF, PERMIT), industrial production (IPUTIL). Notice the variability in the forecast of expected
ROUTLEDGE 1091
F I G U R E 1 2 Data: hn = {Bob, FRED, Alan}—Wealth - out of sample [Color figure can be viewed at
wileyonlinelibrary.com]
Note. All three data sets are represented here. (a) “Bob”: Price-Dividend data, (b) “FRED”: Economic Data from the St.
Louis Fed Database, and (c) “Alan”: Text from the Beige Books of the FOMC meetings. Shown is the accumulated
wealth from holding optimal allocation to the risky equity based on the forecast of the equity risk premium. The
models were trained on data from 1952 to 2005. Shown are the results for 2005–2012, out of sample. For
comparison, the buy-and-hold constant portfolio holds a 0.35 allocation to the risky asset (the optimal allocation for a
constant 6% equity risk premium given the model parameters).
returns and the resulting weight in the optimal portfolio depend on the a and b simplicity preference (regularization)
parameters. As with Bob, above, a large value of b drives the forecast and portfolio weights to be constant. Figure 8
shows the portfolio weights for “FRED” across a range of preference parameters. Here, the number of nonzero regres-
sors and the size of the regressors varies with the preference models. Wealth implications are shown (in-sample) in
Figure 9. Again, the in-sample results are clearly benefiting from being in-sample. Notice the bottom line is for the con-
stant portfolio (large value of b) and is the buy-and-hold portfolio.
Finally, we look at the text-based portfolio using the soft data of the text found in the FOMC Beige Books. The
model parameters, for a selection of simplicity parameters, are shown in Table 5. Again, the preference parame-
ters a and b regulate the number and size of the coefficients and, in turn, the variability of the equity risk premium
1092 ROUTLEDGE
forecast and asset allocation weight. The weights are shown in Figure 10. The in-sample impact on wealth from port-
folios derived from these data is shown in Figure 11. Interestingly, the “Alan” results are similar to using the full array
of macro data series. At least in sample, the soft text is generating the same level of information about future equity
returns. Table 5 highlights some of the phrases the model picks out whose word count is salient for the forecast. For
example, “fed-funds,” “manufacturing-conditions,” and “real-estate-residential-construction” are not surprising. The
model also reflects the economic setting of the sample period with “energy-shortages” and “iraq.” Some items like
“telephone-survey-of-hotels” are picking up exactly what the Beige Book information is intended to capture. Finally,
there are phrases that reflect more of a sentiment or modality, like “rose-modestly,” “loosening,” and “remained-
subdued” that are interesting.
Interestingly, the portfolio weights, see Figure 10, are quite different than for the FRED data. First notice the flat
portion pre-1970—that is simply the fact that the text data are not available at that period. The portfolios shown
are both more pronounced and more highly serially correlated (smoother). That is particularly noticeable in 2000 and
onward. Presumably, there is less text variation month-to-month than in the macro series. Given the ambiguity in the-
ory about what is driving the time variation in equity premium, it is not clear which agent is more “right.”
Turning to the performance and behavior of these agents post 2005—out of the sample they used to build their
model. Figure 12 shows the wealth implications of allocation to the risky asset for one of the agents using each of the
three data sets. Several models for each are shown—the combination of data and preference drives the portfolio. There
was substantial variation in all the predictive data through the financial crisis. Each of the agents has a portfolio that
forecasts variation in the equity risk premium and is, that is, sensitive to the signal they receive. The three data set types
are quite different. The agent using the price-dividend ratio, “Bob” has strong variation in the portfolio weight—indeed
the price-dividend ratio swung wildly in the financial crisis. The preference parameters play a particularly strong role
for the FRED macro data. The wealth performance across various models is similar for the “Bob” and “Alan” data. But
the models using the “FRED” data are vary sensitive to the simplicity coefficients a and b. Does this capture a heavier
worry of “overfitting” in the quantitative FRED macroeconomic data? Or does this point to some change in the rela-
tionship between macro data and the equity premium through the crisis? Both interesting questions are for future
research.
5 CONCLUSION
In this basic setting, there is much more to explore. But the model is promising. It provides preference parameters (and
the functional form) that capture how people incorporate data into their decision making. Looking at the model through
the lens of the canonical asset allocation problem lets us assess if the new preference parameters are delivering behav-
ior that is helpful in understanding asset markets and for developing tools that use data for portfolio construction.
For example, the model can shed light on the debate about how agents might behave in asset models with Bansal and
Yaron (2004) long-run risk (Dew-Becker & Bidder, 2015; Hansen & Sargent, 2010). Alternatively, it will be interesting
to explore equilibrium models of preference heterogeneity where the heterogeneity is in the “simplicity” parameters
(here a and b) (say along the lines of Dumas, Kurshev, & Uppal, 2009; Gallmeyer & Hollifield, 2008; or Osambela, 2015).
There are several extensions that are worth pursuing. One is to extend the specification to multiple assets to look
at richer portfolios. The “regularized” regression we used here is a natural analog to a Black–Litterman shrinkage esti-
mation model (see Wang, 2005). The second extension is to extend the framework to the dynamic setting of Epstein
and Zin (1989). In one sense, this is “easy.” For example, we use data from 1952 to 2005 to build the agent’s preferred
empirical model and then simply take that model as the “date-0” preferences and solve for optimal portfolios in the
Epstein and Zin (1989) framework. What is more interesting is to allow the empirical model to be solved dynamically.
This is harder, of course, since we need to impose time-consistency on the agent (or deal with the consequences of
time-inconsistent decision making).
ROUTLEDGE 1093
ACKNOWLEDGMENTS
Thanks for comments and suggestions from seminar participants at Carnegie Mellon University, the University of Utah,
and Pennsylvania State University.
REFERENCES
Baker, S. D., & Routledge, B. R. (2013). The price of oil risk. Carnegie Mellon University Working Paper.
Bansal, R., & Yaron, A. (2004). Risks for the long run: A potential resolution of asset pricing puzzles. Journal of Finance, 59, 1481–
1509.
Campbell, J. Y., & Shiller, R. J. (1988a). The dividend-price ratio and expectations of future dividends and discount factors.
Review of Financial Studies, 1, 195–228.
Campbell J. Y., & Shiller, R. J. (1988b). Stock prices, earnings, and expected dividends. Journal of Finance, 43, 661–676.
Chahuneau, V., Gimpel, K., Routledge, B. R., Scherlis, L., & Smith, N. A. (2012). Word salad: Relating food prices and descriptions.
In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning. Stroudsburg, PA: Association for Computational Linguistics, pp. 1357–1367.
Cochrane, J. H. (2011). Presidential address: Discount rates. Journal of Finance, 66, 1047–1108.
Coppersmith, G., Dredze, M., Harman, C., & Hollingshead, K. (2015). From ADHD to SAD: Analyzing the language of mental
health on twitter through self-reported diagnoses. In Proceedings of the Workshop on Computational Linguistics and Clinical
Psychology: From Linguistic Signal to Clinical Reality. Denver, CO, June: North American Chapter of the Association for Com-
putational Linguistics.
Dew-Becker, I., & Bidder, R. (2015). Long-run risk is the worst-case scenario. Northwestern University Working Paper.
Dumas, B., Kurshev, A., & Uppal, R. (2009). Equilibrium portfolio strategies in the presence of sentiment risk and excess volatil-
ity. Journal of Finance, 64(2), 579–629.
Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion, and the temporal behavior of consumption and asset returns: A
theoretical framework. Econometrica, 57(4), 937–969.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Jour-
nal of Statistical Software, 33(1), 1–22.
Gabaix, X. (2014). A sparsity-based model of bounded rationality. Quarterly Journal of Economics, 129(4), 1661–1710.
Gallmeyer, M., & Hollifield, B. (2008). An examination of heterogeneous beliefs with a short-sale constraint in a dynamic econ-
omy. Review of Finance, 12(2), 323–364.
Gilboa, I., & Samuelson, L. (2012). Subjectivity in inductive inference. Theoretical Economics, 7(2), 183–215.
Gilboa, I., & Schmeidler, D. (2003). Inductive inference: An axiomatic approach. Econometrica, 71, 1–26.
Gilboa, I., & Schmeidler, D. (2010). Simplicity and likelihood: An axiomatic approach. Journal of Economic Theory, 145(5), 1757–
1775.
Hansen, L. P., & Sargent, T. J. (2010). Fragile beliefs and the price of uncertainty. Quantitative Economics, 1(1), 129–162.
Hoerl, A., & Kennard, R. (1988). Ridge regression. Encyclopedia of Statistical Sciences, 8, 129–136.
Kim, Y., & Routledge, B. R. (2019). Does macro-asset pricing matter for corporate finance? Critical Finance Review. Retrieved
from https://scholars.cityu.edu.hk/en/publications/publication(8ba3622c-0d6c-4da6-9518-453bba30112b).html
Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for
Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, pp. 272–280.
Koijen, R. S., Moskowitz, T. J., Pedersen, L. H., & Vrugt, E. B. (2013). Carry. National Bureau of Economic Research Discussion
Paper.
Ludvigson, S. C., & Ng, S. (2009). Macro factors in bond risk premia. Review of Financial Studies, 31, 399–448.
O’Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public
opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media, pp. 122–129.
Osambela, E. (2015). Differences of opinion, endogenous liquidity, and asset prices. Review of Financial Studies, 28, 1914–1959.
Romer, C. D., & Romer, D. H. (2004). A new measure of monetary shocks: Derivation and implications. American Economic
Review, 94(4), 1055–1084.
Simon, H. A. (1959). A behavioral model of rational choice. American Economic Review, 49(3), 253–283.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3),
1139–1168.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Method-
ological), 58, 267–288.
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 73(3), 273–282.
1094 ROUTLEDGE
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency net-
work. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on
Human Language Technology-Volume 1. Stroudsburg, PA: Association for Computational Linguistics, pp. 173–180.
van Binsbergen, J., Hueskes, W., Koijen, R., & Vrugt, E. (2013). Equity yields. Journal of Financial Economics, 110(3), 503–519.
Wang, Z. (2005). A shrinkage approach to model uncertainty and asset allocation. Review of Financial Studies, 18(2), 673–705.
Welch, I., & Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Finan-
cial Studies, 21(4), 1455–1508.
Yogatama, D., Heilman, M., O’Connor, B., Dyer, C., Routledge, B. R., & Smith, N. A. (2011). Predicting a scientific community’s
response to an article. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:
Association for Computational Linguistics, pp. 594–604.
Yogatama, D., Routledge, B. R., & Smith, N. A. (2013). A sparse and adaptive prior for time-dependent model parameters. arXiv
preprint arXiv:1310.2627.
Yogatama, D., Wang, C., Routledge, B. R., Smith, N. A., & Xing, E. P. (2014). Dynamic language models for streaming text. Trans-
actions of the Association for Computational Linguistics, 2, 181–192.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 67(2), 301–320.
How to cite this article: Routledge BR. Machine learning and asset allocation. Financial Management. 2019;48:
1069–1094. https://doi.org/10.1111/fima.12303