Lecture5 Machine Learning

Basic GPR Framework Basic GPR Algorithm Extensions/Augmentations Neural Networks
Numerical Methods
Lecture 5: Machine Learning Methods
Zachary R. Stangebye
University of Notre Dame
Spring 2022
Machine Learning
• Ill-defined but commonly-used term

Machine Learning
• Connotes complex “thinking”-like operations done by

computer, e.g.,
• Identifying/categorizing images
• Predicting consumer behavior based on patterns and trends in
online sphere
• Detecting spam in incoming e-mail messages
• Web search engines
• ...
Machine Learning
• Connotes complex “thinking”-like operations done by

computer, e.g.,
• Identifying/categorizing images
• Predicting consumer behavior based on patterns and trends in
online sphere
• Detecting spam in incoming e-mail messages
• Web search engines
• ...
• Rapid adoption in “big tech” and other private industries
• Gradual, recent adoption by academics/economists

Machine Learning: Two Kinds
1. Supervised: Maps input data to known output points
2. Unsupervised: No known outputs
• Goal is to infer natural structures present in input data
• Pattern recognition, clustering, etc.

Machine Learning: Two Kinds
1. Supervised: Maps input data to known output points
2. Unsupervised: No known outputs
• Goal is to infer natural structures present in input data
• Pattern recognition, clustering, etc.
• We’ll focus on former since it’s more useful for economists of

all stripes
Supervised Machine Learning: What is it really?
• Approximate solution to a functional equation

• Not so different from what we’ve been doing all semester
• Difference: Bigger scope
• Large domain/input space (hundreds/thousands of dimensions)

• Capable of extreme non-linearities/high-order interactions
• Not so different from what we’ve been doing all semester
• Difference: Bigger scope
• Large domain/input space (hundreds/thousands of dimensions)

• Capable of extreme non-linearities/high-order interactions
• Example: Image recognition
• Each pixel is dimension/input

• No restrictions on which important/how they interact
Machine Learning: Image Recognition Example
• “Is this a picture of Sylvester Stallone? Yes or No?”

Machine Learning: Image Recognition Example
• “Is this a picture of Sylvester Stallone? Yes or No?”
• ML finds a function f : Ω → {0, 1}

• Ω is high-dimensional pixel space
Key Innovation
• Machine learning thus an umbrella term for a collection of
functional approximation algorithms that break the curse of
dimensionality
• Estimating/parameterizing f often referred to as “training”
• Several algorithms fall under this categorically

Key Innovation
dimensionality
1. Neural networks
2. Random forest models
3. Gaussian process regressions
4. ...
Key Innovation
dimensionality
1. Neural networks
2. Random forest models
3. Gaussian process regressions
4. ...
• In each case, solution time scales nearly/roughly linearly with

number of dimensions
• Why?...Nobody really understands completely (yet)
• Likely why academics got on train more slowly
In Economics...
• Economists finally starting to adopt these methods
• Actually not difficult to implement
• Just esoteric to understand
• Potential absolutely tremendous

In Economics...
• Economists finally starting to adopt these methods
• Actually not difficult to implement
• Just esoteric to understand
• Potential absolutely tremendous
1. Bagherpour (2017): Predict mortgage loan defaults better

than existing methods
2. Albanesi and Vamossy (2019): Predict consumer default better
than FICO scores with same data
3. Fernandez-Villaverde et al. (2019): Solve heterogeneous
agents model with no “shortcuts”
4. . . .
For us...
• Approximate value/policy functions
For us...
• Approximate value/policy functions
• Follow Sheidigger and Bilionis (2017). Suggest combination of
• Gaussian Process Regression (GPR)

• Active Subspace (AS)
• Value function iteration (VFI)
• Can easily globally solve DSGE model of 500 dimensions with

significant non-linearities
• Closest competitor is Smolyak projection algorithm (maxes out
around 20 dimensions when pushed really hard)
• Has other benefits over Smolyak, e.g., for some non-linear
models it often works where Smolyak would fail
Sheidigger and Bilionis (2017)
• Basic idea
1. Use Gaussian Process Regression (GPR) and (possibly) Active

Subspace (AS) to approximate value function
2. Update approximations with a VFI until convergence
Sheidigger and Bilionis (2017)
• Basic idea
1. Use Gaussian Process Regression (GPR) and (possibly) Active

Subspace (AS) to approximate value function
2. Update approximations with a VFI until convergence
• Hybrid dynamic-programming/projection method
• The VFI part is easy (know it already)

• We’ll need to carefully break down the GPR and AS part
Big-Picture Idea
• In search of a function V : RD → R
• Rather than treat V as a deterministic function, form beliefs

over set of possible f ’s and update with Bayes’ rules
Big-Picture Idea
• In search of a function V : RD → R
• Rather than treat V as a deterministic function, form beliefs

over set of possible f ’s and update with Bayes’ rules
1. Start with prior beliefs over V
2. Update prior on V using Bellman equation + Bayes’ rule
3. Posterior is updated guess in a VFI setting
4. Restart next iteration without updating prior

• Bellman equation itself does all the updating we’ll need
Gaussian Process: Interpolation Strategy
• Shaded area = 2× Standard Deviation

• Black lines: Draws from GP
• Zero posterior volatility at observed points
Gaussian Process Regression I
• Assume we wish to evaluate the function at N input points

(“grid points” or ”training points”): X = {x1 , . . . , xN } ⊂ RD
Gaussian Process Regression I
• Assume we wish to evaluate the function at N input points

(“grid points” or ”training points”): X = {x1 , . . . , xN } ⊂ RD
• Before evaluating equilibrium conditions, assume model

knowledge about f (·) by assigning a Gaussian Process (GP)
prior
• Let θ be the hyper-parameters of the model...
f |X, θ ∼ N (f |m, K )
where m ∈ RN and K ∈ RN×N is positive definite

Gaussian Process Regression II
• Mean, m given by a function
m(x1 ; θ )
 
m = m(X; θ ) = 
 .. 
. 
m(xN ; θ )
• Similarly, covariance given by a function: K = K (X, X; θ )

Gaussian Process Regression II
• Mean, m given by a function
m(x1 ; θ )
 
m = m(X; θ ) = 
 .. 
. 
m(xN ; θ )
• Similarly, covariance given by a function: K = K (X, X; θ )

• Comes from more general cross-covariance matrix
 
k(x1 , x̂1 ; θ ) . . . k(x1 , x̂N̂ ; θ )

K (X, X̂; θ ) =  .. .. ..  ∈ RN×N̂

 . . . 
k(xN , x̂1 ; θ ) . . . k(xN , x̂N̂ ; θ )
for an arbitrary set of N̂ inputs X̂ = {x̂1 , . . . , x̂N̂ } ⊂ RD

Gaussian Process Regression III
• Parameterize m(·; θ ) with our initial pointwise guess
• Parameterize k(·, ·; θ ) with a kernel function
• Most common is square exponential (SE) covariance function
D
!
1 X (xi − xi0 )2
kSE (x, x 0 ; θ ) = s 2 exp −
2 li2
i=1
• Implies θ = {s, l1 , . . . , lD }
Gaussian Process Regression IV
D
!
1 X (xi − xi0 )2
kSE (x, x 0 ; θ ) = s 2 exp −
2 li2
i=1
• For SE covariance function...

• s > 0 is called the signal strength
• li > 0 is called the lengthscale of the i th input/dimension
Gaussian Process Regression IV
D
!
1 X (xi − xi0 )2
kSE (x, x 0 ; θ ) = s 2 exp −
2 li2
i=1
• For SE covariance function...

• s > 0 is called the signal strength
• li > 0 is called the lengthscale of the i th input/dimension
• Notice
• K has diagonal row of s 2
• li governs contribution of input/dimension i to correlation
across points
• As x → x 0 , correlation approaches 1. Implies continuously
differentiable functions
GPR Measurement
• Evaluate the equilibrium conditions at our guess/prior
t i = f (xi ), for i = 1, . . . , N
GPR Measurement
• Evaluate the equilibrium conditions at our guess/prior
t i = f (xi ), for i = 1, . . . , N
• Helps approximation to assume that we observe the output

with some i.i.d. measurement error with noise sn2
• Gives rise to t
t i |f (x i ), sn ∼ N (t i |f (xi ), sn2 )
where sn2 is another hyperparameter (we’ll keep it small)
• Independence in measurement noise/across observations =⇒
Likelihood : t|X, θ , sn ∼ N (t|m, K + sn2 IN )

Choosing the Hyperparameters
• Done by simple maximum (log) likelihood (MLE)
θ ∗ , sn∗ = arg max log p(t|X, θ , sn )

θ ,sn
Choosing the Hyperparameters
• Done by simple maximum (log) likelihood (MLE)
θ ∗ , sn∗ = arg max log p(t|X, θ , sn )

θ ,sn
• Solve with simple gradient-based method (BFGS) from

multiple starting points for robustness
GPR Posterior
• Combine prior GP with likelihood to get posterior GP

f (·)|X, t, θ ∗ , sn∗ ∼ N f (·)|m̃(·), k̃(·, ·)
where
−1
m̃(x) = m(x; θ ∗ ) + K (x, X; θ ∗ ) K + (sn∗ )2 IN (t − m)
| {z } | {z } | {z }
1×N N×N N×1
and
−1
k̃(x, x0 ) = k(x, x0 ; θ ∗ )−K (x, X; θ ∗ ) K + (sn∗ )2 IN K (X, x; θ ∗ )
| {z } | {z } | {z }
1×N N×N N×1
Relation to Projection Methods
• Posterior mean ⇐⇒ Linear combination of SE basis functions

N
X
∗
m̃(x) = m(x; θ ) + ai k(x, xi ; θ ∗ )
i=1
−1
where a = K + (sn∗ )2 IN (t − m)
Relation to Projection Methods
• Posterior mean ⇐⇒ Linear combination of SE basis functions

N
X
∗
m̃(x) = m(x; θ ) + ai k(x, xi ; θ ∗ )
i=1
−1
where a = K + (sn∗ )2 IN (t − m)
• Typically m(x; θ ∗ ) = 0, but works for any prior

General Description: DSGE Model
• Dynamic, stochastic, discrete-time, infinite-horizon economy

with D dimensions
• Exogenous/endogenous states x ∈ RD
• Control variables, c, chosen from C(x)
• Law of motion: x0 ∼ F (·|x, c)
• Bellman equation
TV (x) = max u(c, x) + βEF |x,c V (x̃0 )

c∈C(x)
General Description: DSGE Model
• Dynamic, stochastic, discrete-time, infinite-horizon economy

with D dimensions
• Exogenous/endogenous states x ∈ RD
• Control variables, c, chosen from C(x)
• Law of motion: x0 ∼ F (·|x, c)
• Bellman equation
TV (x) = max u(c, x) + βEF |x,c V (x̃0 )

c∈C(x)
• Equilibrium given by V = TV
• Optimal policy function: p(x) ∈ C(x)
Full Algorithm: DSGE Model
• Initial guess if V ∞ . Then at each iteration step s...

Full Algorithm: DSGE Model
• Initial guess if V ∞ . Then at each iteration step s...
1. Generate (randomly) relatively small training set (grid points)

of size ns , xs1:n . Evaluate Bellman to get
tis = TV s−1 (xsi ) for i ∈ 1, . . . , ns
2. Apply GPR to “learn” update, Vsurrogate , which is the

posterior mean. Set V s = Vsurrogate
3. Repeat until V s converges to V s−1 at 10, 000 random,
non-training-input points
Full Algorithm: Notes

• Note we “reset” prior variance and mean for each iteration
• Early priors don’t filter down through later iterations
• Not true Bayesian approach, but much more efficient


• Once done (or in each iteration), “learn” an approximation for

policy function too

• Once done (or in each iteration), “learn” an approximation for

policy function too
• While GPR easy to code yourself, I recommend
GaussianProcesses.jl package
• Treats Gaussian Processes as own class of variables
• Much faster built-in training
• Less code development time/room for errors

Training Inputs
• In each iteration, select (5 to 10) × D training inputs (grid
points)
• This is where curse of dimensionality blatantly disappears!
Training Inputs
• In each iteration, select (5 to 10) × D training inputs (grid
points)
• This is where curse of dimensionality blatantly disappears!
• Typically drawn uniformly uniformly from [x, x̄]D for

well-behaved problems
• Not always efficient for complex/irregular state spaces
• Here, ergodic distributions more concentrated

Examples
1. Simple consumption-saving model with uncertainty
• Endowment economy (y ) follows an AR(1)
b0

+ βEỹ 0 |y V (ỹ , b 0 )

V (y , b) = max u y − b +
b0 1+r
Examples
1. Simple consumption-saving model with uncertainty
• Endowment economy (y ) follows an AR(1)
b0

+ βEỹ 0 |y V (ỹ , b 0 )

V (y , b) = max u y − b +
b0 1+r
2. Production economy with adjustment costs (see previous

slides)
Ergodic Training Inputs
• If state space is irregular, then begin with uniform draws in

early iterations, but in later iterations...
1. Simulate (non-equilibrium) law of motion for n periods to

derive estimate of ergodic set
Xergodic = {xi : 1 ≤ i ≤ n}
2. Fit a histogram/Gaussian model mixture to the ergodic

distribution. Call this density, ρestimated (x)
3. Draw N training inputs from ρestimated (x) rather than [x, x̄]D
Ergodic Training Inputs
• If state space is irregular, then begin with uniform draws in

early iterations, but in later iterations...
1. Simulate (non-equilibrium) law of motion for n periods to

derive estimate of ergodic set
Xergodic = {xi : 1 ≤ i ≤ n}
2. Fit a histogram/Gaussian model mixture to the ergodic

distribution. Call this density, ρestimated (x)
3. Draw N training inputs from ρestimated (x) rather than [x, x̄]D
• Gets it right “where it counts”

Ergodic Training Inputs: My Thoughts/Experience
• Really crucial to get reliable convergence
• Two useful tips

• Two useful tips
1. After “enough” iterations drawing from the ergodic

distribution, fix the training inputs for later iterations
• Continue to check convergence with updated ergodic
distributions
• Two useful tips
1. After “enough” iterations drawing from the ergodic

distribution, fix the training inputs for later iterations
• Continue to check convergence with updated ergodic
distributions
2. Draw training inputs from “typical set” rather than straight

ergodic distribution
• Ensures we’re getting a ‘disperse’ or ‘representative’ draw
• Fills relevant space more efficiently

Typical Sets
• Information-theoretic concept
• Given N i.i.d. draws from a distribution, what does a ‘typical’

sequence look like
• Distribution of realized draws somehow matches underlying
distribution
• Not the most likely sequence, e.g., repeated mode
Typical Sets
• Information-theoretic concept
• Given N i.i.d. draws from a distribution, what does a ‘typical’

sequence look like
• Distribution of realized draws somehow matches underlying
distribution
• Not the most likely sequence, e.g., repeated mode
• Defined for a level of ‘typicality’ 1 −
• Typical set AN () ⊂ X N for domain X

Atypical Set: Example
0.4
0.3
0.2
0.1
0.0
-4 -2 0 2 4
Typical Set: Example
0.4
0.3
0.2
0.1
0.0
-4 -2 0 2 4
Typical Set: Formal Definition
• {x1 , x2 , . . . , xN } ∈ AN () iff

N
1 X
− log(p(xi )) − H(x̃) <
N
i=1
where H(x̃) is the entropy of the random variable x̃, i.e., if

domain of random variable is X , then
X
H(x̃) = − p(x) log(p(x))
x∈X
Irregular State Space Example: Arellano (2008)
• Sovereign default model with short-term bonds (full

description in earlier lectures)
u(y −b+q(y , b 0 )b 0 )+βE max{VR (ỹ 0 , b 0 ), VD (ỹ 0 )}

VR (y , b) = max
0 b
VD (y ) = u(ydef (y )) + βE πVR (ỹ 0 , 0) + (1 − π)VD (ỹ 0 )

1
q(y , b 0 ) = E 1{VR (ỹ 0 , b 0 ) ≥ VD (ỹ 0 )}

1+r
Irregular State Space Example: Arellano (2008)
0.030
1.2
0.025
1.1
0.020
1.0 0.015
0.010
0.9
0.005
0.8
0
0.0 0.1 0.2 0.3
Really Big State Spaces

• GPR loses accuracy for LARGE dimensionality (D >> 20 or
so)
• Happens because they rely on Euclidian distance to define
input-space correlations (less informative as D grows)

so)
• In these cases, rely on active subspaces
• Translate action from large state space to a lower-dimension
one that captures its essential features

so)
• In these cases, rely on active subspaces
• Translate action from large state space to a lower-dimension
one that captures its essential features
• Assume function can be well-approximated by
f (x) ≈ h(WT x)
for some W ∈ RD×d that projects the high-dimensional input
space the a lower-dimensionsal active subspace, Rd
• h : Rd → R is called the link function
• Repeat whole previous algorithm on h
Choosing W
1. Define a D × D matrix
Z
C = (∇f (x))(∇f (x))T ρ(x)dx
where ρ is uniform over the state space

Choosing W
Z
C = (∇f (x))(∇f (x))T ρ(x)dx

2. Since C is positive definite, decompose
Λ VT
C = VΛ
where Λ is a diagonal matrix with eigenvalues of C in

decreasing order and V is orthonormal matrix of
corresponding eigenvectors
Choosing W
Z
C = (∇f (x))(∇f (x))T ρ(x)dx

2. Since C is positive definite, decompose
Λ VT
C = VΛ
where Λ is a diagonal matrix with eigenvalues of C in

decreasing order and V is orthonormal matrix of
corresponding eigenvectors
3. W = V1 , which correspond to eigenvectors of d largest
eigenvalues
In practice
• Set d to something a bit less than 20, e.g., 15

• Cannot construct C analytically. Instead
1. Draw large N points uniformly from state space
2. Estimate numerical gradient at each point
3. Define
N
1 X i i T
CN = g (g )
N
i=1
4. Perform singular value decomposition (SVD) on CN to get Λ

and V
5. Back out W from V as before
In practice
• Set d to something a bit less than 20, e.g., 15

• Cannot construct C analytically. Instead
1. Draw large N points uniformly from state space
2. Estimate numerical gradient at each point
3. Define
N
1 X i i T
CN = g (g )
N
i=1
4. Perform singular value decomposition (SVD) on CN to get Λ

and V
5. Back out W from V as before
• Using this, Scheidigger and Bilionis (2017) accurately and

globally solve a DSGE model with 500 continuous states! By
far the most powerful tool in your toolbox
In practice
• Most interesting structural models have dimensionality
significantly less than 20
• Might not be true of more empirically oriented frameworks
In practice
• Most interesting structural models have dimensionality
significantly less than 20
• Might not be true of more empirically oriented frameworks
• But, could use this in following way. If you want to choose
parameters to match moments/calibrate
1. Add each parameter of interest as a dimension with
deterministic, constant law of motion
2. Solve model once
3. Select parameters as points in state space. Model dynamics
respond immediately
4. Can calibrate dozens of parameters while solving the model
only once
Alternate Algorithms
• Many other machine learning algorithms
• Tend to share feature that they kill curse of dimensionality

Alternate Algorithms
• Many other machine learning algorithms
• Tend to share feature that they kill curse of dimensionality
• Most popular alternative in economics: Neural Networks
• Approximate function with layers of simple, non-linear

functions (“neurons/nodes”)
Neural Networks Visually

Neural Networks in Broad Terms (I)
• A node/neuron often called “perceptron” and is a simple

(basis) function referred to as an “activation function,” e.g.,
(
0, x ≤ 0
(ReLU) φ(x) =
x, x > 0

(
0, x ≤ 0
(ReLU) φ(x) =
x, x > 0
1
(Sigmoid) φ(x) =
1 + e −x

(
0, x ≤ 0
(ReLU) φ(x) =
x, x > 0
1
(Sigmoid) φ(x) =
1 + e −x
(ArcTan) φ(x) = tan−1 (x)

...
Neural Networks in Broad Terms (II)
• Impulse of a perceptron in a layer, l, described by a set of

coefficients, θ l for a layer l
• Denote number of perceptrons in a layer by Nl
• Every perceptron has weights, θ W

l,i ∈ R
Nl−1 , and constant
bias, θ bl , and functional form for output

 
Nl−1
X
b W
ol,i (x) = φ θl,i + θl,i,j ol−1,j (x)
j=1
Neural Networks in Broad Terms (III)
• “Training” the neural network is the process of estimating

{θθ l }M
l=1 for a network of M layers
• Typically done with gradient-based algorithm

Neural Networks in Broad Terms (III)
• “Training” the neural network is the process of estimating

{θθ l }M
l=1 for a network of M layers
• Typically done with gradient-based algorithm
• Similar in spirit to projection methods
• Use a non-linear, iterated combination of simple activation

functions rather than a simple linear combination of orthogonal
basis functions
• Solved in a very similar way (black box solver over coefficients)
• Much more difficult to demonstrate results theoretically/derive

confidence metrics, but tends to work really well in practice!
Neural Networks in a VFI

• Based on Ferndandez-Villaverde et al. (2020), which I’ll call
FHN and own experience
• Use neural networks to approximate equilibrium objects
(value/policy/pricing functions, etc.)
• Simplest working neural network is an MLP (Multi-layer
perceptron): At least one hidden layer.
1. Initial layer always same size as dimension (D)
2. Middle layer size (N) a choice (FHN set N = 16)
• Can increase to deal with more non-linearities
3. Final layer size = 1 (since it’s a function)
• FHN find that a single-layer MLP appears sufficient even for
complex problems in structural economics
• Same issues with choice of training inputs as GPR
• Less sensitive to changing parameter choice than GPR as
neurons do not correspond to training inputs
Neural Networks in a VFI II

• Choice of activation function in final layer depends on output
• Bounded problem: Sigmoid or tanh
• Unbounded problem: Softmax or relu
• I have found that the approximation works best when domain
is scaled to [0, 1]D and range is scaled into [0, 1]
Training Neural Networks

• Most routines written for big data
1. Start with parameter guess: θi
2. Randomly sample domain of training inputs: “batch”
3. Update θi+1 with a gradient descent (not Newton step) with
fixed constant
4. Repeat with a new batch from current θi+1 : “epoch”
Training Neural Networks

• Most routines written for big data
1. Start with parameter guess: θi
2. Randomly sample domain of training inputs: “batch”
3. Update θi+1 with a gradient descent (not Newton step) with
fixed constant
4. Repeat with a new batch from current θi+1 : “epoch”
• Our data requirements are much smaller and our need for
accuracy in iterations is higher. Better to
1. Use entire training set as one “batch”
2. Use a better optimizer than gradient descent, i.e., BFGS,
Nelder-Mead, etc.
• I recommend using the FluxOptTools package, as such
algorithms are not built into Flux

Lecture5 Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture5 Machine Learning

Uploaded by

Copyright:

Available Formats

Basic GPR Framework Basic GPR Algorithm Extensions/Augmentations Neural Networks

University of Notre Dame

• Ill-defined but commonly-used term

• Ill-defined but commonly-used term

• Connotes complex “thinking”-like operations done by

• Ill-defined but commonly-used term

• Connotes complex “thinking”-like operations done by

• Rapid adoption in “big tech” and other private industries

• Gradual, recent adoption by academics/economists

Machine Learning: Two Kinds

1. Supervised: Maps input data to known output points

2. Unsupervised: No known outputs

• Goal is to infer natural structures present in input data

• Pattern recognition, clustering, etc.

Machine Learning: Two Kinds

1. Supervised: Maps input data to known output points

2. Unsupervised: No known outputs

• Goal is to infer natural structures present in input data

• Pattern recognition, clustering, etc.

• We’ll focus on former since it’s more useful for economists of

Supervised Machine Learning: What is it really?

• Approximate solution to a functional equation

Supervised Machine Learning: What is it really?

• Approximate solution to a functional equation

• Not so different from what we’ve been doing all semester

• Difference: Bigger scope

• Large domain/input space (hundreds/thousands of dimensions)

Supervised Machine Learning: What is it really?

• Approximate solution to a functional equation

• Not so different from what we’ve been doing all semester

• Difference: Bigger scope

• Large domain/input space (hundreds/thousands of dimensions)

• Example: Image recognition

• Each pixel is dimension/input

Machine Learning: Image Recognition Example

• “Is this a picture of Sylvester Stallone? Yes or No?”

Machine Learning: Image Recognition Example

• “Is this a picture of Sylvester Stallone? Yes or No?”

• ML finds a function f : Ω → {0, 1}

• Several algorithms fall under this categorically

• Several algorithms fall under this categorically

• Several algorithms fall under this categorically

• In each case, solution time scales nearly/roughly linearly with

• Actually not difficult to implement

• Just esoteric to understand

• Potential absolutely tremendous

• Actually not difficult to implement

• Just esoteric to understand

• Potential absolutely tremendous

1. Bagherpour (2017): Predict mortgage loan defaults better

• Follow Sheidigger and Bilionis (2017). Suggest combination of

• Gaussian Process Regression (GPR)

• Can easily globally solve DSGE model of 500 dimensions with

Sheidigger and Bilionis (2017)

1. Use Gaussian Process Regression (GPR) and (possibly) Active

Sheidigger and Bilionis (2017)

1. Use Gaussian Process Regression (GPR) and (possibly) Active

• Hybrid dynamic-programming/projection method

• The VFI part is easy (know it already)

• Rather than treat V as a deterministic function, form beliefs

• Rather than treat V as a deterministic function, form beliefs

1. Start with prior beliefs over V

2. Update prior on V using Bellman equation + Bayes’ rule