You are on page 1of 19

Mathematical Foundations of Regression Methods

for Approximating the Forward Initial Margin


arXiv:2002.04563v1 [q-fin.RM] 11 Feb 2020

Lucia Cipolina-Kun1
1
University of Bristol. School of Computer Science, Electrical and
Electronic and Engineering Mathematics

February 12, 2020

Abstract

In recent years, forward initial margin has attracted the attention of


practitioners and academics. From a computational point of view, it poses
a challenging problem as it requires the implementation of a nested Monte
Carlo simulation. Therefore, abundant literature has been published on
approximation methods aiming to reduce the dimensionality of the prob-
lem. The most popular ones being the family of regression methods.
This paper describes the mathematical foundations on which these re-
gression approximation methods lie. We introduce mathematical rigor to
show that in essence, all the methods propose variations of approxima-
tions for the conditional expectation function, which is interpreted as an
orthogonal projection on Hilbert spaces. We show that each method is
simply choosing a different functional form to numerically estimate the
conditional expectation.
We cover in particular the most popular methods in the literature so far,
Polynomial approximation, Kernel regressions and Neural Networks.

1 Introduction
Initial margin (IM) has become a topic of high relevance for the financial in-
dustry in recent years. The relevance stems from its wide implications to the
way business will be conducted in the financial industry and in addition, for the
need of approximation methods to calculate it.
The exchange of initial margin is not new in the financial industry, however, after
regulations passed in 2016 by Financial Regulatory Authorities [5], a wider set
The authors would like to thank Eusebio Gardella for his contribution to the final
manuscript

1
of financial institutions are required to post significant amounts of initial margin
to each other.
The initial margin amount is calculated at the netting set level following a set
of rules depending on whether the trade has been done bilaterally [13] or facing
a Clearing House [7]. The calculation is performed on a daily basis and carried
on throughout the life of the trade.
Since each netting set consumes initial margin throughout its entire life, financial
institutions need resources to fund this amount in the future, given that the
margin received from its counterparty cannot be rehypothecated. To estimate
the total need for funds, each counterparty needs to perform a forecast of the
initial margin amount up to the time to maturity of the netting set.
Forecasting the total forward initial margin amount is a computationally chal-
lenge problem. An exact initial margin calculation requires a full implemen-
tation of a nested Monte Carlo simulation, which is computationally onerous
given the live-demand of IM numbers to conduct business on a daily basis.

1.1 Literature Review

As a response to this computational challenge, practitioners and academics have


proposed several approximation methods to reduce the dimensionality of the
problem, at the cost of losing a tolerable degree of accuracy. The most popular
of these methods have been the regression methods, since they offer simplicity
and reusability of the Bank’s legacy Monte Carlo engines, paired with good
results. Inspired by early works of Longstaff-Schwartz [16], several banks have
implemented some version of the polinomial regression proposed by [1], [8] and
[10], or the Kernel regressions proposed by [2], [10] and [11]. Most recently
estimation by Neural Networks have been proposed by [12] and [18].
The above mentioned methods proposed by the industry to tackle the initial
margin problem have not been developed exclusively to this particular issue.
Considered in a broad sense, these are well known regression methodologies
used for long time in finance and engineering. As such, the current state of the
literature focus on the implementation and numerical validation of the regression
methods from a practical point of view. Their aim is to show, without going
deep into mathematical formalities, that the proposed approximations perform
well under a certain degree of confidence.
Our paper comes to fill the gap between theory and practice. We present the
underlying mathematical context required to analyze the initial margin compu-
tation problem from a theoretical probability standpoint.

1.2 Organization
The rest of this paper is organized as follows. The first section below translates
the practitioner’s definition of initial margin into a formal probability setting.

2
The following section develops the assumptions needed to fit the initial margin
problem into a classical problem of orthogonal projections into Hilbert spaces.
The last section concludes by linking back the formal theory presented with the
practitioner’s suggested approaches.

2 Description of the Initial Margin Problem


2.1 Definition of Forward Initial Margin
Following the definition in [6], the aim for a given Bank is to calculate the total
cost associated with the posting of initial margin to a counterparty over the life
of a netting set. As this is a quantity calculated forward in time, we need to
consider the relevant discounting and funding rates into the equation. The total
cost is referred as the Margin Valuation Adjustment as defined below.

"Z #
T RT
(r(s)+λB (s)+λC (s)ds)
M V At = E ((1 − R)λ(u) − SI (u))e t IM (u)du | Fu
t
(1)
where R is the recovery rate of the counterparty, λ(u) is the Bank’s effective
financing rate (i.e. borrowing spread), SI is the spread received on initial mar-
gin, r(s) is the risk free rate λB (s), λC (s) are the survival probabilities of the
Bank and its counterparty and IMt is the initial margin held by the bank at
time t.
To simplify the above equation one can consider spreads rates and survival
spreads deterministic, then it simplifies to:

Z T RT
(r(s)+λB (s)+λC (s)ds)
M V At = ((1 − R)λ(u) − SI (u))e t E [IM (u) | Fu ] du
t
(2)
Here our quantity of interest is E [IMt | Ft ], which is the expectation of the ini-
tial margin at any point in time t, taken across all random paths ω. Since we are
on a stochastic diffusion setting, this expectation depends (i.e. is conditional)
on the particular realization of the risk factors at that time t (i.e. Ft ).
We are given a probability space (Ω, F, P) where Ω is the set of all possible
sample paths ω, Ft is the sigma-algebra of events at time t (i.e. the natural
filtration to which risk factors are adapted to) and P the probability measure
defined on the elements of Ft . The task is to calculate an amount IM (ω, t) for
every simulation path ω at any given point in time t. This amount is defined as
1

IM (ω, t) = Q99 (∆(ω, t, δIM ) | Ft ) (3)


1 This is the simplest definition of initial margin, for alternative definitions see [3]

3
The 99% confidence level is a BCBS-IOSCO requirement and the ∆(ω, t, δIM )
is defined (on its simplest form)2 as the change in the netting set’s value (i.e.
PnL) over a time interval (t, t + δIM ] given the counterparty has defaulted.

∆(ω, t, δIM ) = V (ω, t + δIM ) − V (ω, t) (4)

In the following sections we describe the computational methods developed in


the current literature to estimate this quantity.

2.2 The Brute Force Approach to Calculate Initial Margin


The most exact method to calculate the initial margin requires a Brute Force im-
plementation of a nested Monte Carlo simulation. The picture below exemplifies
the situation. Take for example a given point in time t1 , we have simulated an
outer Monte Carlo simulation for the values of V (ω, t) and an inner simulation
of the forward Probability Density Function of V (ω, t + δIM ) | Ft ). The differ-
ence in the value of the netting set between these two time steps is ∆(ω, t, δIM ),
given that we have a PDF distribution of (V (ω, t + δIM ) | Ft ), we will have a
PDF distribution of ∆(ω, t, δIM ), from which one should take the Q99 to obtain
the initial margin.

Figure 1: A simulation tree with two branches and two time steps

The initial margin at every time step t is the 99th percentile of the inner PDF
scenarios. Both processes (inner and outer) are usually simulated using the same
2 This is the simplest definition of netting set PnL, for further definitions see [3]

4
stochastic differential equation with a sufficiently large number of scenarios to
ensure convergence. The number of scenarios become particularly important
for the inner scenarios as the Q99 is a tail statistic, therefore the number of
scenarios on the tail should be sufficiently large.
As every nested Monte Carlo problem, the Brute Force calculation of initial
margin is computationally onerous given that the number of operations increase
exponentially. This type of complexity is not possible to afford in real time,
when the IM numbers are needed for live trading, thus the reason behind the
development of approximation methods.

3 Regression Methods for the Approximation of


Initial Margin
The technique proposed by Longstaff-Schwartz in [16] has been the framework
traditionally used in finance to reduce dimensions in nested Monte Carlo prob-
lems. In short, the authors propose to reduce the number of machine operations
by simulating only the outer scenarios and fitting a regression to approximate
the inner ones.
Their method proposes a Least Squares regression as the functional choice to
approximate the inner scenarios in the Monte Carlo. The use of regressions is
justified by the fact that the author’s underlying mathematical problem is the
estimation of a conditional expectation, which is usually approximated with a
linear regression. While our problem at hand is the approximation of a quantile,
we will show how the general Longstaff-Schwartz framework can be used.
The images below provide a graphical description of the Longstaff-Schwartz
framework used for initial margin. The image on the right shows that when
only the outer scenarios are available, we need to assume a distribution for the
inner ones, as the number of simulated values are not enough to infer a quantile.
From equation 3, the inner scenarios are used to calculate the PDF of the
change in the portfolio value, conditional on the information provided on each
filtration set Ft . From this one, we obtain the Q99 that makes the initial margin.
However, when only the outer scenarios are simulated, we are left with only a
single point of the empirical distribution, which is not enough to build a PDF. To
overcome this limitation, the relevant literature propose to assume a parametric
distribution for the rest of the unknown inner scenarios. To model, [V (ω, t +
δIM )−V (ω, t)] the majority of authors propose a local Normal distribution since
it can be conveniently parameterized by only two parameters. This choice is
justified under the assumption that the simulated V (ω, t) come from a Normal
SDE and the time lapse between observations δIM is small. Other authors like
[4] propose a fat-tailed T distribution.
If we assume that the inner scenarios follow a local Normal distribution, we are
only left with the task of calculating the distribution’s expectation and variance
to then obtain the initial margin as:

5
Figure 2: Simplification of a nested Monte Carlo by applying distributional
assumptions

(IM (ω, t) | Ft ) = σ(ω, t) · Φ−1 (99%) | Ft (5)

Where for simplicity and without loss of generalization, we have assumed that
the data is centered thus the expectation is zero. Under this assumption, the
term σ(ω, t) can be written as:
p
(σ(ω, t) | Ft ) = E [∆2 (ω, t, δIM ) | Ft ] (6)

Equation 6 shows that, when restated in this way, we have transformed a quan-
tile estimation problem into the estimation of a conditional expectation, which
puts us on the same setting as the Longstaff-Schwartz
  problem. In what follows,
we study the approximation of E ∆2 (ω, t, δIM ) | Ft , that it’s the conditional
expectation of the netting set’s PnL under the information given at time t.

4 Conditional Expectations as Regression Func-


tions on Hilbert Spaces
4.1 Conditional Expectation as a Regressions
The task is to find the best (i.e. unbiased) estimator, at every time step t, for
the conditional expectation in equation 6 above. The proposal is to implement
a numerical procedure where the conditional expectation is approximated by
performing a regression against a set of basis functions. The use of regressions
to approximate conditional expectations needs to be formally justified. In other
words, we need to formalize the validity of the below expression 3 .
3 Without loss of generality, we assume a regression without independent term to ease

notation

6
E ∆(ω, t, δIM )2 | Ft = m(xt )
 
(7)
where xt is an Ft -adapted process and m(xt ) is a stochastic function with certain
properties described in the following sections.

4.1.1 Properties of Conditional Expectations in Hilbert Spaces


To characterize conditional expectations as regressions, we need to first study
their properties as functions of random variables. In the above equation for
m(xt ), the set of regressors have been fixed by conditioning on the filtration,
therefore, we are projecting onto a fixed subspace and the the conditional expec-
tation has become a function of Xt and not just a random variable, as typically
defined in option pricing. By analyzing conditional expectations as functions,
we can use the properties inherited from the functional space they belong to. In
particular, the properties regarding orthogonal projections and decomposition
into orthogonal bases.
To setup an appropriate functional space, we consider the space of random
measurable functions on the probability space (Ω, H, P) where H is the Hilbert
space of random vectors. We narrow the space to bounded random vectors, in
the sense that E[y 2 ] < ∞ for every y ∈ H, this is, we only consider square-
integrable functions with finite second moment. As such, the inner product
p of
two elements y1 and y2 is E[y1 .y2 ] and the norm of an element is k y k= E[y 2 ].
The set of conditional expectations functions meet the conditions above to be-
long to Hilbert spaces and as such, to inherit its properties. In particular, we
focus on the properties of orthogonal projections onto Hilbert subspaces, since
these allow us to rewrite conditional expectations as regressions.
The starting point is to characterize conditional expectations as orthogonal
projection maps, this is so since the projected vectors meet the orthogonality
requirement under the appropriate Hilbert norm (see the Appendix and [15]
for a complete proof). Second, the projected vectors belong to the orthogonal
complement space, which is a closed Hilbert subspace, therefore, they can be
decomposed into combinations of basis functions of that subspace.
If the projected subspace is closed and separable, then we can find a finite
number of basis functions. The challenge under a Monte Carlo simulation setting
is to find a finite set of suitable basis functions to represent the projected vectors.
The following sections expand on the most common empirical methodologies
used for this mean.

7
5 Approximation Methodologies for the Estima-
tion of Conditional Expectations
5.1 The Choice of Regressors
Under a Monte Carlo simulation, we empirically generate the random vectors
belonging to the H space and the closed subspace F ⊂ H, which will be pro-
jecting onto. When simulated, this space becomes a discrete time filtration
F = (Ft )t=1...T spanned by the simulated variables F = span(x1 , ..., xT ). Un-
der this setting, we need to find an estimate for E[Y | span(x1 , ..., xT )].
The key art of empirical simulations is the choice of the regressors Xt . The
output of the Monte Carlo simulation contains several variables that can be
used as regressors. As explained in the previous sections, in theory we would
want to find an orthogonal basis to decompose the projected vectors, in practice,
the choise of regressors is an empirical call. However, some minimal conditions
must be met.

5.1.1 Markovian Regressors


The choice of functional form used to represent the conditional expectation de-
pends on the σ-algebra obtained from the filtration. There should be a sound re-
lationship between the projected vector Y and the projection vectors (Xt )0≤t≤T
from the filtration. In choosing the relevant modeling variables, not all the his-
tory coming from the filtration will be relevant. For Markov chains, only the
last recent values of the state variables are necessary, while for non Markov,
past realizations of the filtrations should be included as regressors.
If the variables form the filtration are Markovian, then the conditional expec-
tation can be estimated as:

E[Yt+1 | Ft ] = E[Yt+1 | Xt ] (8)

The regressors (Xt )0≤t≤T will be given by the information produced by the
Monte Carlo simulation. If the (Xt )0≤t≤T process has stochastic independent
increments, then it is a Markov process (see [15]).

5.2 The Choice of the Basis Functions


The choice of basis functions depend on the chosen norm (to define orthogonal-
ity between Y,X), the projection subspace span(x1 , ..., xt ) and mostly, on the
empirical relationship between the independent and dependent variables X and
Y.
The underlying idea behind the different approximation methods is to find an
approximation for E[Yt+1 |X = Xt ] = m(Xt ) that solves the below minimization
problem:

8
Let Y be a random vector Y ∈ H and (Xt )0≤t≤T a deterministic vector in a
Hilbert space such that X ∈ F ⊂ H, where Y, X are independent. Additionally,
assume m(Xt ) a given a function of X, solve the minimization problem below
to find β̂ ⊂ Rn such that:

inf E[(Yt+1 − m(Xt , β̂))2 ] (9)


β̂

where the function m(·) can be decomposed in basis functions as explained


before.

5.2.1 Empirical Relationship Between the Dependent and Indepen-


dent Variables
The choice of functional form to represent the relationship between (Yt )n and
(Xt )n is an empirical call mostly based on data analysis. Which is the best choice
of basis functions such that m(X) : X → Y that agrees with our simulation
data?
In practice, under a nested Monte Carlo simulation, basis functions are selected
arbitrarily so that the resulting functional form of the conditional expectation
function represents appropriately the relationship between the vector (Yt )n and
the filtrated vector (Xt )n , where n is the dimension of the simulated vectors at
time t.
From the Monte Carlo simulation engine, one can easily obtain the simulated
variables V (ω, t) and V (ω, t + δIM ) from which ∆2 (ω, t, δIM ) can be obtained
at every t. The natural choice used in the literature is to choose V (ω, t) as the
regressor for ∆2 (ω, t, δIM ). This assumes that V (ω, t) is a good predictor for
∆2 (ω, t, δIM ) as in equation 8. In what follows, we will use:

Xt = V (ω, t)n
Yt+1 = ∆2 (ω, t, δIM )n = V (ω, t + δIM )n − V (ω, t)n
where n is the number of simulated scenarios, ω.
To choose the functional relationship between V (ω, t) and ∆2 (ω, t, δIM ) one
must remember from the previous section that we only have available the sim-
ulations from outer scenarios and from this information we need to infer the
relationship between the V (ω, t) and ∆2 (ω, t, δIM ) in the inner scenarios. The
assumption is that the functional relationship between the V (·) and ∆2 (·) in
the inner and outer scenarios is the same. This is possible since the same SDE
is used to simulate the inner and outer scenarios (provided we are on a given t).
From the graphs in [10] we can study the relationship between V (·) and ∆2 (·)
for different type of regression models. From the cloud of points, the functional
form is rather un-conclusive.
Lastly, a usual simplification is to assume that the set of basis functions does
not change over time, but weights do. Hence, after a functional form of the

9
Figure 3: A comparison of scatter plots of portfolio value vs simulated initial
margin and PnL for initial margin models. Taken from [10].

conditional expectation has been chosen, the problem is to estimate the weights
at every simulation time step t, conditional on the information given by the
filtration up to that time.

5.3 Analytical Approximations for the Conditional Ex-


pectation Function
Several methods can be used to estimate the conditional expectation function
E[Yt+1 |X = Xt ] = m(Xt ), depending on the properties of the underlying data.
A good summary of the available methodologies for nested Monte Carlo prob-
lems can be found in [9]. Below we discuss the most popular ones used in the
context of initial margin.

5.3.1 Algorithms Based on Linear Maps


In the context of initial margin estimation, linear regression methods are pro-
posed by [1], [8] and [10]. The problem setting is as described in equation 9
above. In the case where the Hilbert subspace Fn is a linear vector space, the
estimate can be computed solving a linear equation system. Linear maps include
the ordinary least square regression and polynomial functions.
The projection is done onto a linear independent set of random variables (Xt ).
where m(X, β̂)) is a linear map of the form:
P
X
m(Xt , β̂) = β̂i .φi (Xt ) = BT .φi (Xt ) (10)
i=0

where φi (Xt ) are the P set of linear basis functions arbitrarily chosen.
A linear problem like this can be solved by inverting matrices. In the appendix
we cover the case where both Y and X are jointly Gaussian distributed, which
is setting of the well known Normal Equations.

10
In the non-linear methods below, the vectors spaces (and thus the function
m(X, β̂)) are non-linear and thus, the Least Ordinary Square approach does
not work.

5.3.2 Algorithms Based on Non Parametric Regressions


Among the vast amount of non-parametric algorithms available, kernel regres-
sions have been the most popular ones. In particular the Naradaya-Watson
kernel [17] has been used by [2], [10] and [11] in the context of initial margin
estimation.
In principle, Naradaya-Watson proposed an algorithm to approximate the em-
pirical joint density distribution of two random variables. This is, instead of
estimating a conditional expectation directly, the method estimates the entire
joint pdf distribution,from which the conditional expectation is derived. The
method calculates a locally weighted average of the regressors given a choice of
a kernel function. The weights provide a smoothing feature to the fit as they
decrease smoothly to zero with distance from the given point. The derivation of
the Naradaya-Watson pdf estimator can be found in [14]. Under this method,
the conditional expectation function can be approximated as 4 :
PN
Kh (x0 , xi )yi
E[Y |X = x] = Pi=1
N
(11)
i=1 Kh (x0 , xi )

where h is the bandwidth parameter that controls the amount of smoothing of


the data (see [14] for a discussion) and K(.) is a Kernel function satisfying 5 :

1. symmetry: k(−u) = k(u)


2. positive semi-definiteness: k(u) ≥ 0 ∀u
R +∞
3. −∞ k(u)du = 1

While the above expression is non-linear on the regressors, we show that the
NW estimator is a particular case of the local estimation problem. This is, the
function above can be re-expressed as being locally linear (i.e. piece-wise lin-
ear) on a certain set of β parameters. Moreover, we show that the conditional
expectation estimator proposed by kernel regressions can be framed as a par-
ticular case of the orthogonal projection problem in equation 9 with a modified
objective function and a different norm.
The local minimization problem works by setting a small neighborhood of a
given centroid x0 and assuming a parametric function within the selected region.
For every time step t of the Monte Carlo simulation, consider a neighborhood of
4 see[14] for a derivation
5 ”Kernel” functions can be seen as data ”filters” or ”envelopes”, in the sense that they
define proximity between points.

11
x0 and assume a local functional relationship m(.) between the Y, X variables.
If m(.) is continuous and smooth, it should be somewhat constant over a small
enough neighborhood (by Taylor expansion). While linear regressions assume
a linear relationship between variables and minimize a global sum of square
errors, kernel regressions assume a linear relationship only on a small neigh-
borhood, focusing on the modeling of pice-wise areas of the observed variables,
and repeating the procedure for all centroids. The trick is to determinze the
size of the neighborhood to trade off error bias with model variance. The local
minimization problem is now as follows:
Given a random vector Y ∈ H and a deterministic vector X ∈ F ⊂ H, with
Y, X independent, and given the functions Kh (·) and m(·), find β̂ such that:

inf E[Kh (Xt , Yt + 1).(Yt+1 − m(Xt , β̂))2 ] (12)


β̂

In the local minimization problem, the model for m(x) is a linear expansion of
basis functions as in 10
P
X
m(Xt , β̂) = β̂i .φi (Xt ) (13)
i=0

Some examples of m(.) are:

• m(Xt , β̂) = β0 , the constant function (i.e. intercept only). This results in
the NW estimate in equation 11 above.
• m(Xt , β̂) = β0 + β1 (xi − x0 ) gives a local polynomial regression.

The β̂ are solved by weighted least squares. As with the linear case, the vector
W (Yt+1 − m(Xt , β̂)) should be orthogonal to all other functions of X, but now
the relevant projection is orthogonal with respect to this new inner product:

hW (Yt+1 − m(Xt , β̂)), m(Xt , β̂)i (14)

where W is a sequence of weights W (X0 ) = diag (Khλ (X0 , Xi ))N ×N .


The solution is given by:
−1
β̂ = X T W X XT W Y

(15)

We can obtain the same expression for m(x) as in 11:


N
h −1 i X
m(Xt ) = β0 = XT W X XT W Y = wn yi (16)
n=1

12
The task is now to choose the Kh (.) function. One common choice in the
literature are the radial basis functions, in which each basis function depends
only on the magnitude of the radial distance (often Euclidean) from a centroid
x0 ∈ X. The most cited in the literature is the Gaussian kernel:
 
− k x − x0 k
k(x, x0 ) = exp (17)
2σ 2

5.3.3 Deep Neural Networks


Deep Neural Networks (DNN) have been proposed to estimate initial margin by
[12] and [18]. Deep NN is an umbrella term for different related methodologies,
our focus is on methods proposed on the initial margin literature, in particular
the feed-forward neural networks as implemented in [18].
A DNN can be thought of a nonlinear generalization of a linear model, where
the parameters (called weights) have to be ”learned” from the data. The idea is
to create a nonlinear function from a stack of linear functions. In a feed-forward
DNN, each layer projects onto the next layer of neurons (computational nodes).
Each linear regression is called a neuron and takes the input from the previous
layer, combines the input according to a rule and applies a function to the
result. To find the weights, the method minimzes the total sum of errors as in
equation 9 by performing an orthogonal projection of the dependent variables
onto a non-linear subspace.
The theoretical basis for the approximation of a function by NNs is given by
the The Universal Approximation Theorem (UAT) [19]. This theorem states
that any continuous function in a compact subset of Rn can be approximated
by a feed-forward network with a single hidden layer containing a finite number
of neurons. However in practice we might need more than one layer to improve
the approximation.
Following [18], the functional form for the approximation of the conditional
expectation function is given by:
 
n
X Xm
m(Xt , w, a, b) = wi φ  ai,j xj + bi  (18)
i=1 j=1

where where aij is the weight of the function that goes form the input xi to
the jth hidden neuron (function), bj is the bias of the jth hidden neuron, φ is
the activation function, and wi are the weights of the function that goes to the
(linear) output neuron form the jth neuron. The authors in [18] use the classic
ReLU function as activation function.
Except for the input nodes, each node
P is a neuron (function) that performs a
simple computation: y = φ(z), z = wi xi + b where y is the output and xi is
the i − th input.

13
A Theorems and Results
We present the main theorems used in this paper along with their particular
application for our case. The reader is referred to [15] for a complete reference.
The below theorem guarantees that an orthogonal projection exists and is
unique.
Theorem A.1 (Subspace Projection Theorem). Let H be a Hilbert space with
y ∈ H and M ⊂ H be a closed nonmepty subspace of H. Then:

1. there exist a unique vector ŷ ∈M (called the projection of y onto M) that


minimizes k ŷ − y k.
This is:
k ŷ − y k≤k x − y k ∀x ∈ M. (A.1)

where k . k is the norm generated by the inner product associated with H.

2. ŷ can be obtained as the orthogonal projection of y onto M.


This is: ŷ can be uniquely characterized by:

(y − ŷ) ∈ M⊥ (A.2)

The theorem below states that given a Hilbert space and a subspace, any vector
belonging to a Hilbert space can be conveniently decomposed as the sum of two
vectors, one in the subspace and another one on its orthogonal complement.

Theorem A.2 (Direct Sum). Let M be a closed subspace of H. Then, every


y ∈ H can be written uniquely as y = m ⊕ m⊥ where m ∈ M and m⊥ ∈ M⊥ .
The spaces M ⊥ and M are called complements.

The theorem above tells us that if we find m and m⊥ we can get y. The task
now is to find the orthogonal complement m⊥ and to show that this is indeed
the conditional expectation. The theorem below is based on A.2 and develops
the properties of projections functions.
Theorem A.3 (Properties of Projection Mappings). Let H be a Hilbert space
and let PM be a projection mapping onto a closed subspace M. Then
1. each y ∈ H has a unique representation as a sum of an element in M and
an element in M⊥ i.e.

y = PM (y) + (I − PM )y (A.3)

2. y ∈ M if and only if PM (y) = y

3. y ∈ M⊥ if and only if PM (y) = 0

14
Where PM is the projection operator, also called orthogonal projector.

Combining theorem A.1 with A.3 we know that the minimizing vector has the
following form:
Proposition A.1. If ŷ ∈ M minimizes k ŷ − y k⇐⇒ ŷ = PM .y

The theorem above tells us that a projection onto a subspace is a linear transfor-
mation and the projection is the closes vector in the subspace. Using property
(3) above we can use the orthogonality condition to find PM and relate it to
the conditional expectation.
We need to show that, on a given closed subspace M ⊂ H and a random variable
Y ∈H

E[Y | M] = PM (Y ) (A.4)

The theorem below states that vectors that belong to the orthogonal subspace
are the ones minimizing the distance between spaces. We use the definition of
distance to define orthogonality.
Definition A.1 (Orthogonal Complement). Let M denote any subset of a
Hilbert space H. Then the set of all vectors orthogonal to M, called the orthog-
onal space or orthogonal complement, denoted by M⊥ :

M⊥ = {x ∈ H : hy, xi = 0 ∀y ∈ M} (A.5)

More specifically, the inner product of this space is defined as:

hy, xi = E(XY ) (A.6)

for any random variables X, Y belonging to the defined probability space.


Using the definition of orthogonality above, we can define conditional expecta-
tions as the projection operator in Hilbert spaces. The conditional expectation
E[Y | F] can be interpreted as an operator on random variables that transform
A -measurable variables into F-measurable ones.

A.1 Projection onto Hilbert Spaces


Definition A.2 (Linear Projection onto Random Spaces). Let L2 (P ) with
the inner product defined as before be a Hilbert space, Y ∈ L2 (P ), A ran-
dom vector X = x1 , ..., xk ∈ L2 (P ) with E[X 2 ] < ∞ and M ⊆ L2 (P ), M =
span(x1 , ..., xk ).

Then:
m(x) = argminE[(y − ŷ)2 ] (A.7)
ŷ∈M

Where ŷ is a function of the given vector X.

15
The function above is called the best linear projection of y given X, or the linear
projection of y onto X.
Below we show that conditional expectations are functions of X and meet the
above definition.

Theorem A.4 (Conditional Expectation as Orthogonal Projection in a Sub-


space). Let Y ∈ L2 (Ω, Y, P) with finite variance E[Y 2 ] < ∞ and F ⊆ Y
a sub-sigma algebra. Then E[Y | F] is the orthogonal projection of Y onto
L2 (Ω, F, P).
This means that for very F-measurable Ŷ with finite variance E[Ŷ 2 ] < ∞,
E[(Y − Ŷ )2 ] ≥ E[(Y − E[Y | F])2 ]

where Ŷ is any other function of F with equality if and only if Ŷ = E[Y | F].

Proof Sketch: Once the subspace of independent variables X ∈ F has been


fixed, we no longer have a random variable E[Y |F], but now we have a function
E[Y |X = x] = m(x). The theorem above says that this function minimizes the
distance between the projected vector Y and the subspace generated by the X.
Therefore E[Y | |X = x] is the best predictor (in a mean-squared sense) of Y
given some partial information modeled as a sub σ field F.
From the orthogonal projection theorem A.1, we need to show that the condi-
tional expectation is orthogonal to the projected vector.
Let e = |Y − Ŷ | be the distance between the projection and the projected
vectors and F = span(x1 , ..., xk ). If the distance e is minimal, the projected
vector should be orthogonal to the projection space F. Equations A.5 and A.6
gives us the appropriate distance to use.
We need to show he, Xi = 0 or equivalently E[Xe] = 0.
E[Xe] = E[X(Y − Ŷ )] = E[X(Y − E[Y | F ])] = E[XY ] − XE[Y | X] = 0
Since XE[Y | F ] = E[XY | F ] = E[XY ].

Now the task is to find E[Y | F] that minimizes the inner product. This is, we
would like to find a functional form for the vector. The proposition below states
that orthogonal complement spaces are closed, therefore its vectors can be ap-
proximated by basis functions. Moreover, since the Hilbert space is separable,
its vectors can be rewritten and approximated by a countable set of basis func-
tions (for example the first N basis functions). The specific functional form is
unknown, in some simple cases the conditional expectation can be constructed
explicitly, for example, the regression model is a particular form of E[Y | F],
the basis functions to choose depend on the functional relationship between the
independent and the dependent variables.
Proposition A.2. Let M⊥ be the orthogonal complement of a Hilbert subspace
M, then:

16
1. M⊥ is a subspace of H
2. M⊥ is closed

To choose a functional form for m(x) = E[Y | F] the below optimization problem
is implemented. In the case that Y and X are jointly Gaussian, the solution are
the least-squares estimators.

E[e2 ] = E[k Y − m(x) k2 ] <  (A.8)

An orthogonal projection for a given subspace M ⊥ can be constructed from any


orthonormal basis for that subspace. In a Hilbert space , let M = L(e1 , .., en )
an orthonormal set, i.e. hei , ej i = 0 for i 6= j and hei , ei i = 1 , then:
n
X
PM (Y ) = hYi , ei iei (A.9)
i=1

The challenge is to approximate the Hilbert space by a finite dimensional vector


space. In the particular case of linear projections:
• If two random variables X and Y are Gaussian, then the projection of Y
onto X coincides with the conditional expectation E[Y | X]

• If X and Y are not Gaussian, the linear projection of Y onto X is the


minimum variance linear prediction of Y given X.

17
References
[1] Anfuso, Fabrizio and Aziz, Daniel and Giltinan, Paul and Loukopoulos,
Klearchos. A Sound Modelling and Backtesting Framework for Forecast-
ing Initial Margin Requirements. (January 11, 2017). Available at SSRN:
https://ssrn.com/abstract=2716279
[2] Andersen, L and Pykhtin, M and Sokol, A. Credit Exposure in
the Presence of Initial Margin. (July 22, 2016). Available at SSRN:
https://ssrn.com/abstract=2806156
[3] Andersen, Leif and Pykhtin M editors. Margin in Derivaties Trading.(2018).
Risk Books.
[4] Andersen, Leif B.G. and Dickinson, Andrew Samuel. Funding and Credit
Risk with Locally Elliptical Portfolio Processes: An Application to CCPs.
(April 11, 2018). Available at SSRN: https://ssrn.com/abstract=316115
[5] Basel Committee on Banking Supervision and International Organization
of Securities Commissions (2015). Margin Requirements for Non-Centrally
Cleared Derivatives. D317 March 2015.

[6] Green, Andrew David and Kenyon, Chris. MVA: Initial Margin Valuation
Adjustment by Replication and Regression.(January 12, 2015) Available at
SSRN: https://ssrn.com/abstract=2432281.
[7] Garcia Trillos, Camilo and Henrard, Marc and Macrina, Andrea. Estima-
tion of Future Initial Margins in a Multi-Curve Interest Rate Framework.
(February 3, 2016). Available at SSRN: https://ssrn.com/abstract=2682727
[8] Caspers, Peter and Giltinan, Paul and Lichters, Roland and Nowaczyk, Niko-
lai. Forecasting Initial Margin Requirements - A Model Evaluation. (Febru-
ary 3, 2017). Available at SSRN: https://ssrn.com/abstract=2911167

[9] Carrier, J. Valuation of early-exercise price of options using simulations


and nonparametric regression. Insurance: Mathematics and Economics19,
19-30 (1996).
[10] Chan, Justin and Zhu, Shengyao and Tourtzevitch, Boris. Prac-
tical Approximation Approaches to Forecasting and Backtesting Ini-
tial Margin Requirements. (November 29, 2017). Available at SSRN:
https://ssrn.com/abstract=3079782
[11] Dahlgren, Martin. Forward Valuation of Initial Margin in Exposure and
Funding Calculations.(2018) Margin in Derivatives Trading, ch 9. Risk
Books.

[12] Hernandez, Andres. Estimating Future Initial Margin with Machine Learn-
ing.(March, 2017). PWC.

18
[13] ISDA SIMM Methodology, version 2.1, Effective Date: December 2018.
[14] Kuhn, Steffen. Kernel Regression by Mode Calculation of the Conditional
Probability Distribution Available at: https://arxiv.org/pdf/0811.3499.pdf
[15] Luenberg, David Optimization by Vector Space Methods.(1969). Wiley.

[16] Longstaff, Francis A. and Schwartz, Eduardo S. Valuing American Options


by Simulation: A Simple Least-Squares Approach.(October 1998). Available
at SSRN: https://ssrn.com/abstract=137399
[17] Naradaya EA On Estimating Regression. Theory Probab. Appl., 9(1),
141142.

[18] Ma, Xun and Spinner, Sogee and Venditti, Alex and Li, Zhao and Tang,
Strong. Initial Margin Simulation with Deep Learning.(March 21, 2019).
Available at SSRN: https://ssrn.com/abstract=3357626
[19] Sodhi, Anurag. American Put Option pricing using Least squares Monte
Carlo method under Bakshi, Cao and Chen Model Framework (1997) and
comparison to alternative regression techniques in Monte Carlo.(August,
2018). Available at https://arxiv.org/abs/1808.02791

19

You might also like