Ultima Entrega Resumen

Solving Heterogeneous General Equilibrium Economic Models
with Deep Reinforcement Learning

Preprint
Edward Hill 1 Marco Bardoscia 1 2 Arthur Turrell 1 3

arXiv:2103.16977v1 [econ.GN] 31 Mar 2021
Abstract 1. Introduction
One of the core problems in macroeconomics is to create
General equilibrium macroeconomic models are models that capture how the self-interested actions of indi-
a core tool used by policymakers to understand viduals and firms combine to drive the aggregate behaviour
a nation’s economy. They represent the econ- of the economy. These models can provide a guide for poli-
omy as a collection of forward-looking actors cymakers as to what actions they should take in any particu-
whose behaviours combine, possibly with stochas- lar circumstance. Historically, macroeconomic models have
tic effects, to determine global variables (such as tended to be simple because of the need for interpretability,
prices) in a dynamic equilibrium. However, stan- but also because of a heavy reliance on solution methods
dard semi-analytical techniques for solving these that are semi-analytical. Such methods allow for the solu-
models make it difficult to include the important tion of a wide range of important macroeconomic problems.
effects of heterogeneous economic actors. The However, events such as the Great Financial Crisis and the
COVID-19 pandemic has further highlighted the COVID-19 crisis have shown that the ability to solve more
importance of heterogeneity, for example in age general problems that include multiple, discrete agents and
and sector of employment, in macroeconomic out- complex state spaces is desirable. We propose a way to
comes and the need for models that can more use reinforcement learning to extend the frontier of what
easily incorporate it. We use techniques from is possible in macroeconomic modelling, both in terms of
reinforcement learning to solve such models in- the model assumptions that can be used and the ease with
corporating heterogeneous agents in a way that which models can be changed.
is simple, extensible, and computationally effi- Specifically, we show that reinforcement learning can solve
cient. We demonstrate the method’s accuracy and the ‘rational expectations equilibrium’ (REE) models that
stability on a toy problem for which there is a are ubiquitous in macroeconomics where choice variables
known analytical solution, its versatility by solv- are continuous and may have time-dependency, and where
ing a general equilibrium problem that includes there are global constraints that bind agents’ collective ac-
global stochasticity, and its flexibility by solving tions. Importantly, we show how to solve rational expecta-
a combined macroeconomic and epidemiological tions equilibrium models with discrete heterogeneous agents
model to explore the economic and health impli- (rather than a continuum of agents or a single representative
cations of a pandemic. The latter successfully agent).
captures plausible economic behaviours induced
by differential health risks by age. We apply reinforcement learning to solve three REE mod-
els: precautionary saving; the interaction between a pan-
demic and the macroeconomy (an ‘epi-macro’ model), with
stochasticity in health statuses; and a macroeconomic model
1
Bank of England 2 University College London, Depart- which has global stochasticity, i.e. where the background
ment of Computer Science 3 Data Science Campus, Office
is changing in a way that the agents are unable to predict.
for National Statistics. Correspondence to: Edward Hill
<ed.hill@bankofengland.co.uk>. With these three models, we show that we can capture a
macroeconomy that has rational, forward-looking agents,
The views expressed are solely those of the authors, and that is dynamic in time, that is stochastic, and that attains
do not represent the views of the Bank of England (BoE) or Office ‘general equilibrium’ between the supply and demand of
for National Statistics (ONS) or state the policies of the BoE or
goods or services in different markets.
ONS.
Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning
2. Background The competitive equilibrium is defined by a vector of state

variables, and by consumption and production plans for the
Macroeconomic models seek to explain the behaviour of agents that maximise utility. Often, the optimal policies
economic variables such as wages, hours worked, prices, of all agents are solved analytically by Lagrangian meth-
investment, interest rates, the consumption of goods and ods: the equilibrium conditions are substituted in and the
services, and more, depending on the level of complex- system of equations simplified, usually by log-linearising
ity. They do this through ‘microfoundations’, that is de- the model around an assumed steady state. We now re-
scribing the behaviour of individual agents and deriving view some macroeconomic models before briefly discussing
the system-wide behaviour based on how those atomic be- multi-agent models more generally.
haviours aggregate. An important class of these models
is used to describe how variables co-move in time when The Representative agent with rational expectations is an
supply and demand are balanced (in general equilibrium), important class of macroeconomic model, most well-known
and when some variables are subject to stochastic noise (aka is the representative agent dynamic stochastic general equi-
‘shocks’). A typical macroeconomic rational expectations librium (DSGE) model. The canonical model is the repre-
model with general equilibrium is a representation of an sentative agent New Keynesian (RANK) model (Smets &
economy populated by households, firms, and public in- Wouters, 2007). Continuum rational expectations models
stitutions (such as the government). The choices made by overcome some of the heterogeneity-related shortcomings of
these distinct agents are framed as a dynamic programming those models by replacing the representative household with
problem in which households maximise their discounted a continuum of households that are ex ante differentiated by
P∞
future utility U = E t=1 β t u (st , at ) with u per-period their assets and labour productivity. The canonical example
utility, β a discount factor, st ∈ S a vector of state variables, is the heterogeneous agent New Keynesian (HANK) model
at ∈ a(st ) a vector of choice variables, and st evolving as (Kaplan et al., 2018). Macroeconomic agent-based models
st+1 = h(st , at ). E(·) represents an expectation operator, differ in that they simulate agents as discrete entities but also
usually assumed to be ‘rational’ in the sense of being the typically make very different assumptions to, say, RANK or
households’ best possible forecast given the available in- HANK models; the most important being that they tend not
formation (and implying that any deviations from perfect to assume rational expectations/perfect foresight and they
foresight are random). For household agents, u is mono- may not necessarily have competitive markets. Importantly,
tonically increasing in consumption, ct , and decreasing in they allow for heterogeneity in multiple dimensions simul-
hours worked, nt (both choice variables). Extra conditions taneously (Haldane & Turrell, 2019). Agent-based models
are imposed via other equations, for example, a budget (ABMs) are also extensively used in epidemiology (Tracy
constraint of the form (1 + rt )bt−1 + wt nt ≥ pt ct + bt et al., 2018), sometimes under the name ‘individual-based
with pt price, wt wages, and r the interest rate. bt cap- models’. At the start of the coronavirus crisis, UK govern-
tures savings, typically in the form of a risk-free bond or ment policy was heavily informed by such models, most
other investment. If bt < 0 is permitted (i.e. debt) then notably that of Ferguson et al. (2020), and there are several
bt usually satisfies a ‘no Ponzi’ condition that rules out ABMs modelling the coronavirus pandemic (Hoertel et al.,
unlimited borrowing and effectively imposes the rule that 2020; Kerr et al., 2020). These epi-ABMs do not capture
bT = 0 (t ∈ 0, . . . , T ). Consumers take prices, wages, economic effects.
and interest rate as given; these are state variables. Firms Epi-macro models attempt to combine macroeconomic and
maximise profits Πt = pt Yt − wt Nt (possibly including a epidemiological effects, and their interaction. The canonical
−rt Kt term if savings are invested) subject to a ‘production examples combining epidemiology and a REE representa-
constraint’, Yt , that turns labour, Nt , and capital, Kt into tive agent model are Eichenbaum et al. (2020a) and Eichen-
consumption goods. Typically, Yt = At f (Kt , Nt ) where baum et al. (2020b) who link the two by assuming that, in
f is a monotonically increasing function of its inputs and addition to the usual Susceptible, Infected, Recovered (SIR)
At is either predetermined or follows a log-autoregressive model transmission mechanism as posed by Kermack &
process ln At = ρA ln At−1 + t ; t ∼ N (0, σA ) is known McKendrick (1927), a household agent may be infected at
as a technology ‘shock’. Governments perform functions work or while engaging in consumption. Market clearing
such as the collection and redistribution of taxes. Firms are is also assumed. Building on many of the same assump-
assumed to be perfectly competitive, meaning that each firm tions as HANK, the canonical continuum agent epi-macro
takes prices and wages as given. model with REE is by Kaplan et al. (2020). Agents are
Prices, wages, and interest rates are determined by market differentiated by their assets, productivity, occupation, and
clearing for goods, labour, and savings respectively in which health status. There are three types of good: regular, social,
supply and demand are balanced in each market. These and home-produced; and three types of work: workplace,
‘general equilibrium’ conditions bind agents and the envi- remote, and home. The epi-macro link is achieved through
ronment together, and are atypical in reinforcement learning. a transmissibility of infection that is modified to include
terms proportional to hours worked and amount consumed, allow θ (the utility of work) to vary, and introduce four
with avoidance of infection captured through a disutility of discrete states, indexed by d, that the household can be in
death. Market clearing is assumed. at any given time. While making no economic difference,
the expansion of the action and state spaces means that
Finally, recent work has seen reinforcement learning be
the hyperparameters we find are transferable to the more
applied to multi-agent systems of relevance to economics in
realistic problems we will come to. It also allows us to test
the case of bidding in auctions under constraints (Feng et al.,
that training the parameters of a single network to provide
2018; Dütting et al., 2019), and deciding on behaviours
the value function for any agent is a good approximation
for both agents and a social planner in a gather-and-build
to training a network for each agent individually, which is
economy (Zheng et al., 2020).
significantly more computationally expensive.
In the rest of this paper, we show how to use reinforcement
The analytical solution for the consumption and hours paths
learning to solve typical rational expectations macroeco-
for the household are given by ct = λ−1 β t /I and nt =
nomic models while also incorporating discrete agent het-
wt θ−1 λβ −t where λ is a Lagrange multiplier determined
erogeneity and, potentially, stochasticity; demonstrating that
from the no Ponzi condition. Computationally, we start
all three can be combined is by far our major contribution
from the Bellman equation:
and has applications for a wide class of economic problems.
U (t, s, S) =
3. Model and Experiments
max ut (s, S, a) + β E U (t + 1, st+1 , St+1 )
a st+1 ,St+1 |a,s,S
3.1. Precautionary Savings
(1)
A typical rational expectations equilibrium problem is that
where s is the agent’s state, a is the action vector. E is
of precautionary saving, in which agents anticipate a change
the expectation operator, and S is the global state (which
in circumstances that will adversely affect their utilities, in
includes t, but we make t explicit for clarity). For the cur-
this case a reduction in wages, and respond in advance
in order to smooth their consumption. Such behaviour is
rent problem,
U (t, s) =
we drop the global state to obtain
maxa ut (s, a) + β Est+1 |a,s U (t + 1, st+1 ) .
typical of the agents in a REE model. The simplest version
of this problem has a known analytical solution. We solve The optimal action vector under local and global constraints,
this model using reinforcement learning so that we may a∗t (s, S), is computed using the method of Lagrange mul-
compare it to the analytical solution, and we also use it as tipliers; this requires accurate values of the gradient of U
a way to demonstrate many of the challenges of using RL with respect to the state variables.
for this class of problems; notably the speed and accuracy
We use a deep neural network to approximate U (t, s) =
of convergence given the sensitivity to the estimate of the
U (t, s = (e, θ, d; bt )); however, direct approximation is
value function; the continuous action and state spaces; and
problematic because ∂s U (t, s) is slow to converge and is
the enforcement of the ‘no Ponzi’ condition.
highly sensitive to initialisation and hyperparameter choice.
To mitigate this, we find D(t, s) = ∂s U (t, s) explicitly by
3.1.1. M ODEL
solving
We assume that there is a single household agent with ratio-
nal expectations. There are I = 2 firms, with the firms and D(t, s) = ∂s ut (s, a∗ )
X
the good each firm produces indexed by i. The household +β ∂s P(a∗ (s) → st+1 |s, a∗ )U (t + 1, st+1 )
agent is employed by one of these firms, P which we will st+1
denote e, and has per period utility ut = i∈I ln cit − θ2 n2t
+β E ∂s (st+1 )D(t + 1, st+1 ) (2)
with action (choice variables) cit consumption and nt hours st+1 |s,a∗
worked. 0 ≤ t < T is the discrete timestep. The price
That a directly learnt estimate of the ∂s U (t, s) aids stability
vector is fixed to pit = 1, and the interest rate is fixed
and convergence has been noted in both deterministic (Bal-
to 0. The wage is imposed as w = 1 for t < T /2
duzzi & Ghifary, 2015) and stochastic (Heess et al., 2015)
and w = 0.5 afterwards, a fall that is anticipated. The
continuous control problems.
household agent is subject
P to a budget constraint such that
bt+1 = bt + wt nt − i∈I pit cit . The no Ponzi condition In the case of precautionary saving, ∂s P(s → st+1 |s, a∗ ) =
is imposed via bT = 0, which prevents unlimited borrow- 0 and P(s → st+1 |s, a∗ ) = δsst+1 , (with δ the Kro-
ing by P
the household. The agent maximises its discounted necker delta) so that D(t, s) = ∂s ut (s, a∗ ) + β ∂s∂s
t+1
D(t +
utility 0≤t<T β t ut with β = 0.97 and T = 20. ∗
1, a (s)). The values of ∂a U (t + 1, st+1 ) that are required
We use I = 2 firms rather than a single representative firm, for finding a∗ are then written as ∂s∂at+1
D(t + 1, st+1 ). U
and D need not be consistent.
3.1.2. S PECIFICATION 3.2. A General Equilibrium Rational Expectations

Epi-Macro Model with Stochasticity in Agents’
System All timings use a laptop with a 4-core CPU (an i7-
States
6700HQ) without GPU acceleration. The code is written
in P YTHON 3 using P Y T ORCH (Paszke et al., 2019) for the We now build on the previous example to demonstrate the
neural network. solution of a rational expectations general equilibrium epi-
macro model, with SIRD health states. We find the ‘decen-
Neural Network and Training The networks for U and D
tralised equilibrium’ in which each household agent behaves
are identical with 5 layers of 50 Softsign neurons, followed
optimally according to its choice variables. Agents are mo-
by a single linear layer. The inputs are normalised to the
tivated to change their behaviour due to fear of dying from
typical scales in the problem, for example t → (2t − 1)/2T ,
the disease and, because consumption carries with it a risk
and the outputs of each network are normalised to the scale
of contracting the disease, their patterns of consumption
of the problem by an additive and multiplicative factor. The
change, in turn altering their risk of infection – this risk
networks are wrapped in a caching and linear interpolation
therefore connects the macroeconomic and epidemiologi-
function to reduce network evaluations.
cal aspects of the model. These consumption changes are
The networks are trained using the Adam optimiser (Kingma differentiated by age as there is an exponentially increasing
& Ba, 2014) with a decaying learning rate over epochs, risk of death according to age once infected.
E of lr = max(5 × 10−3 × 0.8−E , 10−5 ). Each epoch
contains 160 experiences of U (t, s) and D(t, s), which are 3.2.1. T HE MODEL
trained with a replay buffer (Mnih et al., 2015). Initially the
The model combines features of both agent-based macro
buffer is emptied after each epoch, but once lr ≤ 10−4 it
models and rational expectations models. Agents are ratio-
retains a fraction of its contents. This provides swift initial
nal, forward-looking, and discrete. Let household agents be
learning followed by good coverage of experiences later in
indexed by j, while sectors (the analogue of firms), and the
the process (Fedus et al., 2020).
goods that each sector produces, are indexed by i. House-
The experiences are created using n = 4-step learning (e.g. hold are ex ante differentiated by their age and the sector that
Sutton & Barto, 2018), recorded by running an agent for- employs them. Let Ei be the set of households employed
ward taking its current optimal actions. These are run inde- by sector i.
pendently, in parallel. n/(n + 1) of forward runs apply a
The model is a real business cycle (RBC) model (Kyd-
perturbation to the state every n steps, allowing for exploita-
land & Prescott, 1982), with no technology shock. In
tion and exploration. In early epochs, where the predicted
each period household agents engage in consumption, c,
solution is less accurate, Double DQN (Van Hasselt et al.,
and work, captured by hours worked, n. Time-t utility
2016) and target clipping (Schulman et al., 2017) provide
stability.
of householdP j when susceptible,
1 2
infected, or recovered
is uj = i ln cji − 2 θnj where θ = 1 is a disutility
of working. Households face a time-t budget constraint
3.1.3. R ESULTS
balancing their income from work and interest on the cap-
We assess the goodness of the model by defining its error as ital they have loaned to industry, against their consump-
the mean absolute fractional difference between the analytic P and investment in new capital, vj : wj nj + rkj,t =
tion
consumption from the equations above and the simulated i pi cji + vj,t , with kj,t+1 = kj,t + vj,t . Sector i produces
consumption for an agent beginning at t = 0 with b0 = 0 for a quantity of goods Yi using household labour such that
a number of values of θ. This is a stringent and appropriate Yi = ANiα Ki1−α where Ni are the total hours worked in
test of the model since the value of consumption at each sector i and Ki = Ke + K̂i is its total capital. Ke is an
timestep depends on current savings, which are themselves initial endowment of capital and K̂i is that derived from
determined by the time-histories of n and ci . Note that this consumers’ investments. We set A = 1, α = 2/3. Sectors
means errors in the observable values compound over time, are profit-maximising so ∂Ni Πi = 0 and ∂Ki Πi = 0 where
as they will in similar models in later sections. The model the profit Πi = pi Yi − wi Ni − (r + δ)Ki , with δ the de-
successfully converges to the analytical solution with error preciation rate of capital. For our form of Yi this implies
of ≈ 0.01 in ≈ 25 epochs, with each epoch containing 160 Πi = 0. All consumers and firms take pi , wi and r as given,
experiences. This takes ≈ 2 minutes, about 1 GFLOPs-hour. and
P these are adjusted to clear the markets P for hours worked:
P
g
j∈Ei j j n = N i ∀i and capital: j gj kj = i K̂i ,
−1
where agent weights gj = J ∀j. The utility of death is
−200, and the discount rate is β = 0.97. See Appendix I
for a more detailed description.
This type of model, which is common, is used only to
demonstrate the approach; the solution method is appli- has significantly less noise than Ĥτ . We use the average
0
cable to a wide range of models. Also, note that there is of {Hτ 0 }τ 0 ≤τ weighted by γ τ −τ with γ ≤ 1. While in
no need for linearisation around a steady-state, which is a general Hτ is therefore produced solely as a function of
common solution technique for models of this type. Uτ −1 , we modify this general schemeP for this specific
P case
by providing the infection fraction Infected j cji / j cji in
Household agents have four possible health states: sus-
ceptible, infected, recovered, and deceased (SIRD), with multi-agent simulation τ from H̄τ −1 . This counteracts the
the probability of transition between states given by, e.g., propagation of the high levels of noise created by the initia-
P(S → I) for going from susceptible to infected. In what tion of the pandemic through to later times of the simulation,
follows, epidemiological parameters are noted with tildes. and could be avoided by using a larger number of agents.
At each time-step (a day), At each timestep, we iteratively solve to find the val-
X ues of the prices (pi , wi , r) for which the markets
cji
X ρ̃i cji Infected j for labour and capital clear by gradient descent using
Pj (S → I) = β̃ P scipy.optimize.least squares with the default
cji,t=0 j cji
i parameters, initialised from the previous timestep, for t > 0.
so that there is a ‘shopping risk’ of acquiring the infection if We then advance both the capital and epidemiological state
to proceed to the next timestep.
many infected are consuming the same goods.P ρ̃i is a vector
of relative consumption risks such that i ρ̃i = 1. P(I →
R) = γ̃, and Pj (I → D) = ξ(j) ˜ for each household j 3.2.3. R E - TRAINING AGENT BEHAVIOURS
˜
where ξ(j) is an exponentially increasing function of age Uτ +1 is obtained from Uτ by continuing training using H̄τ .
rising from 0.006 at age 40 through 0.024 at 55 to 0.165 for The network parameters are the same as in §3.1, however
a 70 year old, then flattening at age 80. β̃ = 0.56, and γ̃ = we use a gentler decay of the learning rate. There are no
0.2 from Lin et al. (2020). The relative risk of consuming problems observed with convergence, and the adherence to
each sector’s product is ρ̃i = {0.8, 1.2}/2, and we refer the no Ponzi condition is a good test of this, since achieving
to these as ‘remote’ and ‘social’ consumption respectively. it is sensitive to the entire time history of the simulation.
At time t = 0, kj = 15 = Ke /2 ∀j, and at time t = 2
we infect the youngest 10% of the population. On death, As in the multi-agent model, consumers take prices (pi ,wi
any investments are redistributed across living agents and and r) as given, find their optimal consumptions, hours
deceased agents are not replaced. We use J = 100 agents, worked and investments before advancing their state using
with a distribution of ages given by Age(j) ∼ U(20, 95). the budget constraint and the probabilities of their SIRD
Households are evenly distributed across employers, and state changing.
cannot change employer.
3.2.4. R ESULTS
Aside from using a reasonable distribution of the death rates,
this model is entirely uncalibrated. We examine two cases: a ‘heterogeneous’ case as described
above, and a ‘homogeneous’ case without age heterogeneity
3.2.2. S OLVING THE MODEL WITH MULTIPLE AGENTS but with the same mean death rate.
We run a number of simulations indexed by τ . Solving Figure 1 shows the percentages of susceptible, infected,
the model means finding a history Hτ =∞ = {St }t∈0,...,T recovered and deceased agents as the pandemic progresses.
and agent behaviours Uτ =∞ (t, kt , St ) that are consistent. Each line is an average over the results of 3 simulations,
We begin with Hτ =0 and Uτ =0 . We use H̄τ , an average and each simulation’s result is an average over 20 histories.
created from {Hτ 0 }τ 0 ≤τ to re-train Uτ , obtaining Uτ +1 . A As can be seen from the 95% confidence intervals in the
multi-agent simulation is run with agent behaviours gov- figure, the behaviour is similar across simulations. In the
erned by Uτ +1 to obtain Hτ +1 in general equilibrium. As is homogeneous-age case, more people are infected over the
standard in iterative methods, we terminate at a sufficiently course of the pandemic, but there are fewer deaths in total.
large τmax in order to provide a good approximation to the Figure 2 shows the agents’ consumptions. We bin the uni-
values at τ = ∞; the results we present use τmax = 50. In form age distribution into young (< 40), old (> 70) and
the limit of large J, each household’s behaviour makes no middle-age groups; we find considerable differences in be-
difference to the system, so despite the stochastic transitions haviour between them. After consuming the most before
of health statuses, we are able to obtain a unique history the pandemic since they anticipate the coming opportunity
of the variables characterising the global system (including to save, the old strongly reduce consumption in response to
the I = 2 sectors). The observed Ĥτ can be seen as noisy infection risk, and reduce consumption of the riskier ‘social’
¯ good more. The young, conversely, are unlikely to die and
observations of that Hτ , and Ĥτ as an estimate of H̄τ where
τ →∞ ¯ so their consumption is relatively unchanged, governed by
the averaging is chosen such that H̄τ −−−−→ Hτ and Ĥτ
Age Group
Susceptible 15 Under 40
100 Age distribution 40 to 70
10
Investment deviation from global average

Homogeneous 70+
Heterogeneous
80 5
0
Peak of infections
Percent of agents
60 Recovered
5
10
40 15
Deceased 20
20
0 5 10 15 20 25 30 35 40
Infected Simulation Time
0
0 10 20 30 40
Simulation Time Figure 3. Average investments of living agents binned into three
age groups. The vertical line shows the time of peak infections.
Figure 1. Percentage of agents who are susceptible, infected, re-

covered, or deceased as the simulation progresses.
Age Group = Under 40 Age Group = 40 to 70 Age Group = 70+

8 Remote
Social
and has shown consistent stability and convergence. This
Consumption per living agent
6
exercise has also shown the sensitivity of the model’s con-
clusions to those parameters, emphasising the importance
4
of calibration in all aspects of the model if it were used as
2 more than a test case of the methodology.
0 10 20 30 0 10 20 30 0 10 20 30
Simulation Time Simulation Time Simulation Time
3.2.5. S PECIFICATION
Figure 2. Average consumption of living agents from the ‘remote’ The hardware, software, and parameters are identical to
and ‘social’ sectors binned into three age groups. §3.1 with the exception of the learning rate decay which
is slower here to allow for the longer time history. Scaling
is linear with J, since the number of iterations of the least
the decrease in the size of the economy. squares optimisation seems to scale very weakly with J for
Figure 3 shows the mean investments per agent in each age this problem. Each history calculation followed by an RL
group, normalised to the salary (product of wage and hours update takes ∼ 30 minutes on the reference machine, so a
worked) of an agent in the same system with no pandemic single simulation takes ∼ 24 hours, ∼ 720 GFLOPs-hour.
and no saving. The young anticipate the pandemic and save
before it in order to spend when infection rates are higher, 3.3. Stochasticity of global variables: A toy wage-shock
with the converse behaviour for the old. It also shows the problem
no Ponzi condition holds with good accuracy for all age
In the previous section, only the agents’ state was stochastic
groups.
and, because of being in the limiting case of large numbers
Finally, Figure 4 shows the total consumption for both of agents, each individual agent’s state did not affect the
heterogeneous-age and homogeneous-age cases. The in- global state. We now return to the model in §3.1, but instead
clusion of distributional effects causes significant changes of having a deterministic drop in wages, wages now follow
to bulk macroeconomic quantities. a log-autoregressive stochastic process.
Together these results show that in this uncalibrated model
3.3.1. T HE MODEL
the inclusion of age heterogeneity makes a substantial differ-
ence to both the epidemiological and economic progress of a We return to the Bellman equation, (1). We assume there
pandemic. This model and solution method has been tested is no stochasticity in health states so the Es0 |s,a (·) are no
with a range of epidemiological and economic parameters longer present, however the ESt+1 |S (·) remain.
work, θ ∈ [0.6, 1.4], and employer, e; and dynamically

heterogeneous in savings. Again, the multiple employers
7 Age distribution and the treatment of heterogeneity in θ are introduced so that
Homogeneous demonstrations of convergence and hyperparameter choice
Heterogeneous
6 are relevant to the problem in the next section.
Consumption per living agent
In this toy problem, we compare two types of agent: one

a current-time wage-observing agent, whose future util-
5
ity is a function of the wages at the current timestep, wt :
U (t, S = (t, wt ), s = (θ, e, bt )); the other a non-wage-
4 Social observing agent with U (t, S = (t), s = (θ, e, bt )). Both
are able to optimise by testing how their strategies play out
Social
within the histories they have seen, however they only have
3 Remote
Remote
partial visibility of the global state. In the previous case
that had deterministic global state, a knowledge of the time
2 determined all other global variables, but here the mapping
0 10 20 30 from observable values to global state is one-to-many.
Simulation Time
The wage follows a log-autoregressive process, ln wt =
ρw ln wt−1 + t ; t ∼ N (0, σw ); where we use ρw = 0.97,
Figure 4. A comparison of total consumption per living agent by
sector in two separate simulations with agents who are heteroge- σw = 0.1, and wt=0 = 1. Since the wage is autore-
neous or homogeneous in age. gressive, knowledge of the current wage adds informa-
tion about the wage in the future. The agent is trained
on #T ∈ {100, 1000} training histories.
We parametrise wage histories by their mean absolute
U (t, s, S) = fractional deviation, dh , from the mean path of the auto-
σ2 P 0
regressive process wmean,t = exp 2w t0 <t ρ2t w . Of the
max ut (s, S, a) + β E U (t + 1, st+1 , St+1 ) (3) 50 test histories, which have dh ∈ [0.07, 0.55], we consider
a St+1 |S
the 27 with dh > 0.2 to represent those with significant
where st+1 = a(s) advances deterministically. We change deviation from the mean path.
the method to solve for Ũ and D̃, defining
We judge the success of this model by calculating the av-
Ũ (t, S, st+1 ) = β E U (t + 1, St+1 , st+1 ) (4) erage total utility an agent with θ = 1 attains over these
St+1 |S previously unseen wage histories, {wh,t }0≤t<T , where the
and average over histories with dh > x is denoted ūdh >x . Ta-
n o ble 1 compares the average utilities of agents with differ-
U (t, S, s) = max ut (S, s, a) + β Ũ (t, S, st+1 ) (5) ent training setups and wage visibility to an analytic ap-
a
proximation found by defining the action at time t in his-
where the value of a which attains the maximum defines a∗ . tory h to be the values obtained from the formulae in §3.1
Again, we will find the maximum using Lagrange’s method for a wage history beginning at time t: ch,t = 1/λ̄ and
and the auxiliary quantity D̃, nh,t = wh,t θ−1 λ̄, obtaining λ̄ from thenno Ponzi condi- o
0 0
tion as I −1 t0 ≥t β t −t = λ̄2 θ−1 t0 ≥t β t−t Ep wp,t
2
P P
D̃(t, S, st+1 ) = E [D(t + 1, St+1 , st+1 )] (6) 0 .
St+1 |S 2
Ep wp,t 0 is evaluated analytically over all possible wage
and so paths that have wp,t = wh,t using the second moment of
the log-normal distribution.
∂a L(t, S, s, a) = ∂a ut (S, s, a) We implement prioritised experience replay (Schaul et al.,
+ β(∂a st+1 )D̃(t + 1, S, st+1 ) (7) 2015) by retaining experiences with a larger error for a
larger number of training epochs. This increases the agents’
As is standard in reinforcement learning, the expectations performance relative to a base agent trained as in previous
are approximated by using a large number of global state sections, particularly on the dh > 0.2 histories that deviate
histories to update Ũ and D̃. significantly from the mean path. This is expected since
the training examples are sparser and more varied at higher
Excepting the wage history, the set-up is as in §3.1. The dh . The wage-observing agent outperforms the non-wage-
agents are statically heterogeneous in their propensity to
Table 1. Total utilities achieved by different agents relative to the

analytic approximation for different visibility of the wage, numbers Technology shock Hours worked
of training histories #T and including prioritised experience replay 2 1.5
(PER).
1 1.0
Agent #T ūdh ≥0 ūdh >0.2 0 10 0 10
Wage-observing + PER 1000 +0.97 +1.21 2 1.5

Wage-observing 1000 +0.54 +0.29
Wage-observing + PER 100 +0.27 +0.17 1 1.0
Analytical approximation - 0.0 0.0 0 10 0 10
Non-wage-observing + PER 1000 -0.39 -1.43
2 1.5
observing agent. In addition to the solution becoming stable 1

1.0
and the no Ponzi condition being satisfied, that the utilities 0 10 0 10
for the wage-observing agent are consistently higher than
those for the analytical approximation gives us confidence 2 1.5
that the answers converged to are accurate. We record the 1
utilities after 100 epochs which equates to 40, 000 experi- 1.0
0 10 0 10
ences or 10 minutes on the reference machine. A degra- Simulation Time Simulation Time
dation in performance is seen if an insufficient number of
training histories are used.
Figure 5. Each row represents one of 4 randomly chosen simu-
lations. Left column The technology shock for sectors 0 (black,
3.3.2. S PECIFICATION solid) and 1 (red, dashed). Right column Paths of hours worked for
The specification is identical to §3.1.2 except that the rate agents who can (orange, dashed) and cannot (blue, solid) observe
realisation of prices when choosing their action; all agents work
of decay of the learning rate is decreased to allow averaging,
for sector 0 and have θ = {0.76, 0.88, 1.0, 1.12, 1.24} from top
lr = e−0.01E /(1 + E) for epoch E; and, as discussed, to bottom.
prioritised experience replay is used.
3.4. Stochasticity of global variables: A general

equilibrium with stochastic technology shocks
the paths of agents who can observe realisations of prices
Finally we demonstrate a general equilibrium model that has have smaller fluctuations, which is also true of the paths
stochastic global variables, specifically a log-autoregressive of other quantities. Averaging over 256 runs, the mean un-
process in the technology. This is given by ln Ait = signed curvature of the paths drops from κ̄non−obs = 0.44
ρA ln Ai,t−1 + t ; t ∼ N (0, σA ) and affects the produc- to κ̄obs = 0.27; this fall is also true of the curves in the
1−α α
tion, given by Yit = Ait Kit Nit , of a sector. α = 2/3, figure where κ̄non−obs = 0.42 and κ̄obs = 0.19. This
ρA = 0.97, σA = 0.1, and Ai0 = 1. The agents are as in difference is because agents who can observe realisations
§3.3, being statically heterogeneous in propensity to work, are better able to adjust their behaviour to the current and
θ and employer e, and dynamically heterogeneous in in- (since it is an autoregressive process) future values of the
vestment, kt ; however now have visibility of all prices, not technology shock.
just wages. The global model that couples the agents is
the same RBC model described in §3.2 and Appendix I, 3.4.1. S PECIFICATION
but without infections. As the agents’ internal states are no
The specification of the neural network and learning remains
longer stochastic, a smaller number (J = 10) of agents can
unchanged from the previous section. Calculation of the
be used.
multiple histories is parallelised; we use 8 threads. For each
As in §3.3, we compare two types of agents, one that ‘sees’ simulation epoch E, 8(4 + E) histories are found, followed
realisations of the prices and another than does not. Since by 20 RL training epochs. In total, there are 12 simulation
differing values of the technology shock move the general epochs and a total of ≈1000 histories and 240 RL training
equilibrium, changing the prices, observing them gives the epochs, each of which samples from the most recent 50%
agent information about the state of the underlying stochas- of the histories. The number of histories and epochs is
tic process. Figure 5 shows the paths of hours worked, informed by the convergence properties from the previous
nj , for 5 agents with a range of values of θ, all of whom section. A total of ≈100, 000 experiences are recorded
work in the same sector given by i = 0. As expected, during the whole simulation which takes ≈6 hours.
4. Discussion and Conclusion Feng, Z., Narasimhan, H., and Parkes, D. C. Deep learning
for revenue-optimal auctions with budgets. In Proceed-
This work shows that reinforcement learning can be used ings of the 17th International Conference on Autonomous
to solve a wide range of important macroeconomic ratio- Agents and Multiagent Systems, pp. 354–362, 2018.
nal expectations models in a way that is simple, flexible,
and computationally tractable. Furthermore, these meth- Ferguson, N., Laydon, D., Nedjati-Gilani, G., Imai, N.,
ods can be immediately applied to previously intractable Ainslie, K., Baguelin, M., Bhatia, S., Boonyasiri, A.,
problems with multiple degrees of discrete heterogeneity Cucunubá, Z., Cuomo-Dannenburg, G., et al. Report
and stochasticity. Being highly relevant to real world phe- 9: Impact of non-pharmaceutical interventions (NPIs)
nomena, such as climate change and disease transmission, to reduce COVID-19 mortality and healthcare demand.
these capabilities are of great value to policymakers and can Imperial College COVID-19 Response Team, 2020.
be developed into serious tools to aid decision making in
complex scenarios. Haldane, A. G. and Turrell, A. E. Drawing on different
disciplines: macroeconomic agent-based models. Journal
Finally, by linking to reinforcement learning, this work of Evolutionary Economics, 29(1):39–66, 2019.
provides the potential to apply its extensive toolkit of tech-
niques, many of which have direct relevance to economic Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and
questions: examples include accessing larger state and ac- Erez, T. Learning continuous control policies by stochas-
tion spaces (e.g. Lillicrap et al., 2015), including bounded tic value gradients. arXiv preprint arXiv:1510.09142,
rationality, or applying inverse reinforcement learning to 2015.
deduce agents’ objectives and rewards from observed micro-
Hoertel, N., Blachier, M., Blanco, C., Olfson, M., Massetti,
and macro-economic behaviours. Additionally, we can har-
M., Rico, M. S., Limosin, F., and Leleu, H. A stochas-
ness improvements in implementation such as GPU/TPU
tic agent-based model of the SARS-CoV-2 epidemic in
acceleration (Paszke et al., 2019) and distributed computing
france. Nature Medicine, 26(9):1417–1421, 2020.
(Mnih et al., 2016).
Kaplan, G., Moll, B., and Violante, G. L. Monetary policy
Acknowledgements according to hank. American Economic Review, 108(3):
697–743, 2018.
We would like to thank Federico Di Pace for useful discus-
sions. Kaplan, G., Moll, B., and Violante, G. L. The great lock-
down and the big stimulus: Tracing the pandemic possi-
bility frontier for the u.s. Working Paper 27794, National
References Bureau of Economic Research, September 2020.
Balduzzi, D. and Ghifary, M. Compatible value gradients
for reinforcement learning of continuous deep policies. Kermack, W. O. and McKendrick, A. G. A contribution to
arXiv preprint arXiv:1509.03005, 2015. the mathematical theory of epidemics. Proceedings of
the Royal Society A: Mathematical, Physical and Engi-
Dütting, P., Feng, Z., Narasimhan, H., Parkes, D., and Ravin- neering Sciences, 115(772):700–721, 1927.
dranath, S. S. Optimal auctions through deep learning.
Kerr, C. C., Stuart, R. M., Mistry, D., Abeysuriya, R. G.,
In International Conference on Machine Learning, pp.
Hart, G., Rosenfeld, K., Selvaraj, P., Nunez, R. C.,
1706–1715. PMLR, 2019.
Hagedorn, B., George, L., et al. Covasim: an agent-
Eichenbaum, M. S., Rebelo, S., and Trabandt, M. The based model of COVID-19 dynamics and interven-
macroeconomics of epidemics. Working Paper 26882, tions. medRxiv, 2020. URL https://doi.org/10.
National Bureau of Economic Research, March 2020a. 1101/2020.05.10.20097469.
Kingma, D. P. and Ba, J. Adam: A method for stochastic

Eichenbaum, M. S., Rebelo, S., and Trabandt, M. Epidemics
optimization. arXiv preprint arXiv:1412.6980, 2014.
in the neoclassical and new keynesian models. Working
Paper 27430, National Bureau of Economic Research, Kydland, F. E. and Prescott, E. C. Time to build and aggre-
June 2020b. gate fluctuations. Econometrica: Journal of the Econo-
metric Society, pp. 1345–1370, 1982.
Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y.,
Larochelle, H., Rowland, M., and Dabney, W. Revisiting Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
fundamentals of experience replay. In International Con- T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
ference on Machine Learning, pp. 3061–3071. PMLR, control with deep reinforcement learning. arXiv preprint
2020. arXiv:1509.02971, 2015.
Lin, Q., Zhao, S., Gao, D., Lou, Y., Yang, S., Musa, S. S., Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck,
Wang, M. H., Cai, Y., Wang, W., Yang, L., et al. A con- M., Parkes, D. C., and Socher, R. The AI economist:
ceptual model for the outbreak of coronavirus disease Improving equality and productivity with AI-driven tax
2019 (COVID-19) in Wuhan, China with individual re- policies. arXiv preprint arXiv:2004.13332, 2020.
action and governmental action. International journal of
infectious diseases, 2020.
Appendix I : The Real Business Cycle model
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- We use a standard real business cycle model, however we
land, A. K., Ostrovski, G., et al. Human-level control adopt notation and variables common in reinforcement learn-
through deep reinforcement learning. Nature, 518(7540): ing, in particular, an emphasis on state variables (capital)
529–533, 2015. rather than action variables (consumption, hours worked),
and the inclusion of an action-dependent expectation. A
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, baseline RBC model would use Equations 9, 12, 13, 14, 15,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- 16, and 19. We use Equations 9, 10, 11, 14, 15, 16, and 19,
chronous methods for deep reinforcement learning. In but note that 10 and 12, and 11 and 13, are the same up to
International Conference on Machine Learning, pp. 1928– algebraic manipulation, expressing the future behaviour in
1937. PMLR, 2016. terms of U (k) and U 0 (k), functions of the state, rather than
in terms of consumption or other actions as is usually seen.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
We work in real quantities, using pi=0 = 1 as the numéraire
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
and other quantities, including pi6=0 6= 1 defined relative to
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
this.
son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,
L., Bai, J., and Chintala, S. Pytorch: An imperative
style, high-performance deep learning library. In Wal- Notation
lach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, j is an index that runs over the J consumer-workers. i runs
F., Fox, E., and Garnett, R. (eds.), Advances in Neural In- over the I consumption goods, each of which is produced
formation Processing Systems 32, pp. 8024–8035. Curran by a different sector/firm.
Associates, Inc., 2019.
For consumers, nj is hours worked, cji is consumption, kj
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- is capital held by the consumer, vj is their investment in
tized experience replay. arXiv preprint arXiv:1511.05952, capital in that timestep, θj > 0 is the weight given to hours
2015. in the utility function. Ei denotes the set of agents employed
at firm i, e(j) is the index, i, of the employer of j.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and For firms, Ni is the number of hours worked at the firm, Ki
Klimov, O. Proximal policy optimization algorithms. is its capital, Ai is the firm’s technology, Yi is the production
arXiv preprint arXiv:1707.06347, 2017. function, Ci is the consumption of the firm’s goods.
Smets, F. and Wouters, R. Shocks and frictions in us busi- r is the real interest rate, wi are wages, and pi are the prices
ness cycles: A bayesian dsge approach. American Eco- of goods.
nomic Review, 97(3):586–606, 2007.
Consumers
Sutton, R. S. and Barto, A. G. Reinforcement learning:
An introduction. MIT Press, Cambridge, Massachusetts, For consumers, the time-t utility is
2018. X 1
uj ({cji }, nj ; θj ) = ln cji − θj n2j
2
Tracy, M., Cerdá, M., and Keyes, K. M. Agent-based mod- i
eling in public health: Current applications and future
and their total utility from time t onward is
directions. Annual Review of Public Health, 39(1):77–94,
2018.
Uj,t,S (kt,j ) =
( )
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- X
ment learning with double q-learning. In Proceedings of max uj (cji , nj ) + β PS→S 0 (cji , nj )Uj,t+1,S 0 (kt+1,j )
cji ,nj
the AAAI Conference on Artificial Intelligence, pp. 2094— S0
-2100, 2016. (8)

Their budget constraint is, The technology shock

The technology shock is log-autoregressive
X
we(j) nj + rkj,t = pi cji + vj ∀j (9)
i
ln Ait = ρA ln Ai,t−1 + t t ∼ N (0, σA ) (19)
kj,t+1 = kj,t + vj
Market clearing conditions
Let Uj0 = ∂kt+1,j U (kt+1,j ); expectations are over a dis-
tribution of probabilities PS→S 0 , and so E ∂cji ln Pj = Wages are set by market clearing for hours worked
P
S→S 0 ∂cji Pj . Consumers take prices (pi , wi and r) as X
given and use their first order conditions Ni = gj nj ∀i (20)
j∈Ei
0 = c−1 0

ji + β E Uj ∂cji ln Pj − βpi Uj (10)
and the real interest rate is set by market clearing for capital
0 = −θj nj + we(j) β E Uj0 (11)
X X
to find cji and nj ; this is solved iteratively since U is a K̃i = gj kj (21)
function of kt+1 and thus cji and nj . In the reinforcement i j
learning training we add an additional reward for individu- P
ally achieving a no-Ponzi condition at the final time-step. where gj is the weight of each agent, with j gj = 1.
If the probabilities were independent of the action, ∂cji Pj =

0, as would be the case in a standard RBC model, then that
term can be removed and the E Uj0 eliminated obtaining
we(j)
nj = ∀i (12)
θcji pi
A small amount of work, with care taken as to the max-
imisation over a in the definition of the utility, shows that
E Uj0 = (1 + rt+1 )(pi,t+1 cji,t+1 )−1 , and so Equation 10
reduces to the Euler equation

pi
c−1
ji = β E (1 + rt+1 )c−1
ji,t+1 (13)
pi,t+1
where the pi remains due to the multiple goods with pi6=0 6=
1.
Firms
Firms are profit maximising with production function
Yi = Ai Ki1−α Niα (14)
Since profit Πi = pi Yi − wi Ni − (r + δ)Ki , then taking

prices (pi , wi , r) as given
∂Ni Πi = 0 pi αYi − wi Ni = 0 ∀i (15)
∂Ki Πi = 0 pi (1 − α)Yi − (r + δ)Ki = 0 ∀i (16)

and therefore Πi = 0.
Ki,t+1 = Ki,t (1 − δ) + (Yi − Ci ) (17)
We split Ki = Ke + K̂i where Ke is an endowment and K̂i

is provided by investment from the consumers.
K̃i,t+1 = K̃i,t (1 − δ) + (Yi − Ci ) − (r + δ)Ke (18)

Ultima Entrega Resumen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ultima Entrega Resumen

Uploaded by

Copyright:

Available Formats

Solving Heterogeneous General Equilibrium Economic Models

with Deep Reinforcement Learning

Edward Hill 1 Marco Bardoscia 1 2 Arthur Turrell 1 3

2. Background The competitive equilibrium is defined by a vector of state

3.1.2. S PECIFICATION 3.2. A General Equilibrium Rational Expectations

Investment deviation from global average

Figure 1. Percentage of agents who are susceptible, infected, re-

Age Group = Under 40 Age Group = 40 to 70 Age Group = 70+

work, θ ∈ [0.6, 1.4], and employer, e; and dynamically

In this toy problem, we compare two types of agent: one

Table 1. Total utilities achieved by different agents relative to the

Wage-observing + PER 1000 +0.97 +1.21 2 1.5

observing agent. In addition to the solution becoming stable 1

3.4. Stochasticity of global variables: A general

Kingma, D. P. and Ba, J. Adam: A method for stochastic

-2100, 2016. (8)

Their budget constraint is, The technology shock

If the probabilities were independent of the action, ∂cji Pj =

Yi = Ai Ki1−α Niα (14)

Since profit Πi = pi Yi − wi Ni − (r + δ)Ki , then taking

∂Ni Πi = 0 pi αYi − wi Ni = 0 ∀i (15)

∂Ki Πi = 0 pi (1 − α)Yi − (r + δ)Ki = 0 ∀i (16)

Ki,t+1 = Ki,t (1 − δ) + (Yi − Ci ) (17)

We split Ki = Ke + K̂i where Ke is an endowment and K̂i

K̃i,t+1 = K̃i,t (1 − δ) + (Yi − Ci ) − (r + δ)Ke (18)

You might also like