You are on page 1of 146

Market Risk Measurement

Lecture 3
Statistical Tools for Market Risk Management

Riccardo Rebonato
1 Plan of the Lecture

Risky stuff is by definition uncertain.

To deal with an uncertain setting we have to make use of

• statistics for describing

• probability for predicting

Therefore in this section we are going to review the fundamental statistical and
probabilistic tools that we will use throughout the course — and that you will
use if you want to measure and manage risk.
2 Random Variables and Stochastic Processes

Suppose that we have a random event, such as the outcome of the next presi-
dential election. In its itself, this event is not a random variable. It becomes a
random variable when we associate a numerical outcome to the outcome.

So, the outcome of the next US Presidential election is not a random variable.

However, if we define a variable that takes a value +1 for a Democratic victory,


and -1 for a Republican President, this variable is a random variable.
Why would we want to associate numerical values to random events? Because
we can quantify some aspects of the random event.

Take the case of the US presidential election.

If we just sum all the random variables from 1900 to today, the sum immediately
tells us whether there have been more Republican or Democratic presidents.
Or take a market portfolio. The P&L at the end of each trading day is a random
variable.

By taking its average, we can try to tell whether the trader is any good (or is
he just lucky?).

By looking at the variance of the returns we have one measure of how much
risk the trader takes.

By looking at the shape (skewness, kurtosis) of the distribution, we can try to


guess her ‘style’ — more about this later.

You get the drift.


The outcome of a random variable is not known in advance — otherwise it would
not be random.

What do we know about random variables?

First of all, we know which values they can assume — perhaps it’s the whole
real line; perhaps it is just 1 and −1, as in case of the US presidential election.
The range of values is (somewhat confusingly) called the domain.

Next we can try to discover the probability of each possible event occurring.

If we have found this (as we shall see, this information is encoded in the prob-
ability distribution associated with a random variable), then we know all that
there is to know (or, rather, all that it is possible to discover) about the random
variable.
What about a stochastic process?

A stochastic process is a random variable indexed by time.

Let’s go back to the example of the US presidential election, and the random
variable associated with it. Suppose that we stick a time label to each outcome
(+1 or -1). Now we have a stochastic process.

What good is that to us?

Once we have a process we can look at dynamic properties: for instance, we


can see whether the process is mean-reverting: given that the last US President
was a Republican, is it more likely for the next one to be a Republican or a
Democrat?
Or we can look at our trader again.

Does her strategy display ‘momentum’ ?

Is the risk she takes constant over time?

Has she changed trading ‘style’ over time?

Are her returns serially correlated?

Again, you get the drift.


3 The Probability Distribution Function

We said that the probability distribution is the function that ‘contains’ all the
information we can have about a random variable.

What exactly is a probability distribution, then?

To be precise, let’s define the cumulative probability distribution.

For simplicity, I am going to assume in the following that the domain of the
random variable is the real axis (or a portion of it).

Then I will be dealing with a continuous cumulative distribution function, but


I won’t repeat that it is continuous every time unless it is necessary.
The cumulative probability distribution (or Cumulative Distribution Function)
of the random variable X, CDFX , is a function.

As a function, it takes in a number and it returns another number.

The number the cumulative probability distribution takes in is any value, say,
x, in the domain of the random variable.

It returns the probability that a random draw of X will return a value smaller
or equal to x. So
CDFX (x) = Prob [X ≤ x] (1)
As x tends towards lower and lower values (in the domain), it becomes more
and more unlikely that a draw of X will yield a value even lower.

So, as x becomes smaller and smaller, the CDF tends to zero:

lim CDFX (x) = 0 (2)


x→inf(dom(x))

Similarly, as x becomes larger and larger, the CDF tends to 1 (because we can
be virtually certain that the next random draw will be smaller than x):

lim CDFX (x) = 1 (3)


x→sup(dom(x))

This is what a ‘typical’ looks like. See Fig (1).


4 Probability Density Functions

CDFs contain all that there is to know about a random variable, but they are
not very easy to read.

Visually much more appealing are the probability density functions.

When the CDF is continuous, they are defined are the derivative of the CDF
with respect to the input variable, x: if the denote the density function with
a lower-case Greek symbol and he CDF with an upper case Greek symbol we
have
d
ϕ (x) = Φ (x) (4)
dx

The density function tells us how the CDF changes. Fig (2) show what it looks
like for the same normal (0, 1) variable that gave rise to Fig (1):
Figure 1: The cumulative distribution function for a normal random variable
(mean = 0, variance = 1).
PDFs are handy because at a glance they tell us the mode and give us a
perception of the skewness and the spread of the distribution.

The correct interpretation of ϕ (x) is the following:

ϕ (x) dx gives the probability to draw a value of the random variable X between
x and x + dx. Remember: you have to multiply the density by dx in order to
make it become a probability — after all, that is why it is called a density.
Figure 2: The probability density function for a normal random variable (mean
= 0, variance = 1).
5 The Inverse of a CDF

The inverse of a CDF is a crucial concept for risk management.

We know that, if a function, y = f (x), is continuous and monotonic, it admits


an inverse, usually called f −1 (·) .

The inverse function takes in each possible value in the range of f (x) (one of
the ‘y’ values), and returns the associated value in the domain (one of the ‘x’
values).

A continuous CDF is continuous and monotonic.

Therefore it admits an inverse: Φ−1 (u), with u ∈ [0, 1].


Now here is a beautiful and extremely useful result

Draw from the distribution Φ (x) many random draws. Call them x1, x2, ...,
xn.

For each random draw, calculate the CDF: u1 = Φ (x1) , u2 = Φ (x2) , ...,
un = Φ (xn).

Then the random variables u1, u2, ..., un (they are random variables, right?)
are uniformly distributed over the interval [0, 1].

In symbols
Φ (xi) = U (0, 1) (5)
Now take a second distribution, say, Ψ (·).

Feed the uniform random variables, u1, u2, ..., un, into the inverse of Ψ (·),
Ψ (u)−1.

Then what I get out is a set of random draws (call them y1, y2, ..., yn) from
the distribution Ψ (·).
So, we have found a way to associate to the random draws from a given
distribution, Φ (x), the ‘associated’ random draws from a different distribution,
Ψ (y).

What does ‘associated’ mean?

It means that if, for instance, x1 was an exceptionally large move for distribution
Φ (x), then the associated value y1 will also be equally exceptionally large for
the distribution Ψ (y).

So, in formulae
y1 = Ψ [Φ (x1)]−1 (6)

This is important.
We will prove the result in a later lecture, but, for the moment, please make
sure you understand what it says well, because we will make use of it repeatedly
when we look at Monte Carlo Simulations.
6 Percentiles

Some inverse values are more special than others.

For instance, given a distribution, Φ (x), we may want to know what is the
value, x, such that the probability of a draw smaller than x is, say, 10%.

But, by definition, this is just the inverse of Φ (x), calculated for x = 0.1:
Φ−1 (0.1).
This number is called the 10th percentile.

So, for instance, the 10th, 5th an 1th percentiles of the standard normal dis-
tribution are given by

P erc10 = −1.28155...
P erc05 = −1.64485...
P erc01 = −2.32635...

As we shall see, the famous VaR is just a percentile of the loss distribution.
7 The Empirical Cumulative Distribution (ECDF)

So far we have assumed a continuous CDF.

Suppose that we have lots and lots of data - perhaps the returns from equity
indices (S&P, FTSE100, DAX) and swaps (10y, 5y, 2y US$ swaps).

Can we build an empirical CDF?

Of course we can. Here is how to do it.


Suppose you have N data points.

First you sort all your data from lowest to highest.

Then you begin counting. When you meet your first data point (the lowest),
1.
you associate a cumulative probability for that value of N

Then you move to the second data point (the second lowest). You associate
2.
to it a cumulative probability of N

Next you go to the ... kth data point. You associate to it a cumulative
k.
probability of N

By the time you have reached the last data point, you associate to it the
cumulative probability of N
N = 1.

Fig (3) shows what you get with this procedure for four of the time series that
we saw in Lecture 2.
Note that, despite looking smooth, if you look very carefully, you see little
‘jumps’ (discontinuities).

1 ). What changes is only


The size of each jump is always exactly the same ( N
the density of the jumps.

Close to the centre of the distribution we have lots of small same-size jumps
next to each other.

In the tails, we can detect more easily ‘by inspection’ the ‘quantum jump’.
Cumulated Frequencie

Figure 3: The ECDF functions for S&P, DAX, 10-year and 2-year US$ swaps
returns.
So, given some data, we know how to build an empirical CDF.

When we have lots of data, despite the tiny jumps, the curve looks very smooth.

Why bothering with ‘named’ CDFs when empirical ones seem to do so well?

Apart from analytical tractability, empirical CDFs have a drawback: according


to the information they ‘contain’ no realization smaller than the smallest in the
record, or larger than the largest, can ever occur.
Especially for risk management purposes, this may not be such a good idea.

If the record from which we extract is very long, and contains seriously excited
periods, the implicit assumption may not be that bad, but the cautious risk
manger should be aware...
Exercise 1 Look carefully at these four ECDFs. Can you guess which ones
have the fattest tails?
8 ‘Famous’ Distributions

For some distributions we can obtain a closed-form expression for the density
(or for other quantities, such as the characteristic function, from which the
distribution of the density can be obtained).

These ‘named’ distributions are useful, because we can often obtain analytical
results or confidence intervals with them, but they are rare.

As we saw, they also assign non-zero probabilities to values beyond what is in


the ‘training set’.
Some of the best known distributions are:

• Normal (Gaussian)

• Student-t

• Poisson

• Gamma

• Chi-Sqaured
• Stable class (Gaussian, Cauchy, Levy...)

• ....
One distribution is more famous than all the other distributions put together:
it is the Normal or Gaussian distribution.

It is the easiest distribution for obtaining closed-form results, but its popularity
is not just a case of looking for the keys under the street lamp.

There are two properties that make it really special.

The first has to do with the central limit theorem (CLT). The second with
Maximum Entropy (ME). Here they are.
9 The Central Limit Theorem

In its simplest form, and after skipping all the fine print, the CLT theorem states
the following:

The distribution of the sum (or the average) of n independent and


identically distributed random draws, drawn from any distribution,
tends to the Gaussian distribution as n tend to infinity.

Why does it matter for market risk management?


Suppose that the 1-day returns from a portfolio are distributed according to
some really strange distribution (perhaps with really fat tails, horrible skewness,
you name it), but always from the same one.

Also, let’s accept that the draws are independent.

Then, if I look at n-day returns, which are just the sum of 1-day term returns,
the Central Limit Theroem tells me that their distribution will approach the
normal as n becomes large enough.
Empirically, if we look at 1-day returns, we can reject normality for returns from
almost any asset class. (We will look at this in the following.)

But if we look at monthly returns, for many financial time series we no longer
can reject normality.

And for yearly returns, most time series generate distributions that are indis-
tinguishable from the normal one.
Exercise 2 What do you think stands in the way for the condition for the
theorem to apply?
10 Maximum Entropy*
Entropie = Il caractérise le niveau de désorganisation, ou d'imprédictibilité du contenu en information d'un système.

The entropy of a distribution, H (Φ), is defined as

H (Φ) = − ϕ (x) log [ϕ (x)] dx (7)


pdf
It looks like a strange beast: the expectation of the logarithm of the density
function itself.

We won’t go into why the (differential) entropy is defined this way. The in-
teresting thing is that maximizing the entropy of a distribution given some
knowledge about the distribution, means finding the ‘least’ committal distrib-
ution compatible with what we know.
What does ‘least committal’ mean? Suppose that I only know that the my
random variable is distributed between a and b. I have no clues about its
mean, its variance, or anything else.

Then the ‘least committal’ distribution must be the uniform, U [a, b].

One can prove that this is indeed the case using variational calculus.

The very nice thing is that the Gaussian distribution is the least committal (the
maximum entropy) distribution if we only know the mean and variance of
a distribution!

If you want we can talk to me more about this because I love the topic, and
ME is extremely powerful, but this would take us a bit off course.
11 A Few Important Distributions

11.1 Gaussian (Normal)

It is fully described by two parameters: one location parameter (the mean, µ),
and one shape parameter (the variance, σ 2).

Its pdf is given by


1 µ−x 2
1 −
ϕNorm (x) = √ e 2 σ (8)
2πσ
For many purposes, it is useful to define the variables, z,
µ−x
z≡ (9)
σ

Subtracting the mean and dividing by the standard deviation creates standard-
ized normal draws, which, clearly, are drawn from a standard Normal distribu-
tion, ie, from a Normal distribution with mean zero and variance 1.
11.2 Chi Squared

If the variables zi are standard Normal variates, then the quantity Gk ,


Gk ≡ zk2 (10)
i=1,k
is distributed according to the chi-squared distribution with k degrees of
freedom.

Figure (4) shows its shape.

The chi squared distribution comes in useful to ‘estimate’ an unknown variance,


but we won’t go into that.

It is even more relevant for us, because of its links with the Student-t distribu-
tion.
11.3 Student-t Distribution

This is a very useful distribution to model distributions with fat tails, as we


shall see presently.

If

• variable Z has a standard normal distribution,

• variable G has a chi-square distribution with k degrees of freedom, and

• Z and G are independent


Figure 4: The pdf of the chi-squared distribution for different values of the
degrees of freedom.
then the variable X, defined by
Z
X≡ (11)
G/k
has a Student-t distribution.
Student-t distribution allow for fatter tails than the Normal distribution.

The lower the number of degrees of freedom (look again at the effect of the
degrees of freedom on the Chi-squared distribution), the more peaked the pdf,
and the fatter the tails.

See Fig (5).


For a number of degrees of freedom grater than, say, 30, the differences between
the Student-t and the Normal distributions become very small.

See Fig (6).


Figure 5: The pdfs of Student-t distributions for different values of the dof.
12 Stable Distributions

The Student-t distribution can fit reasonably fat-tailed empirical distributions,


but cannot handle skewness.

If skewness is important, we must resort to a member of the family of Stable


distributions.

Also, if we need really fat tails, then the Student-t distribution is not going to
take us very far, and Stable disributions can do a better job.
Figure 6: The convergence of the Student-t distribution to the normal, as the
dof increase.
Stable distributions have nice properties, because the sum of independent,
idenitically-distributed stable-distributed random variables has the same dis-
tribution as the component variables.

So, the CLT can be extended to stable distributions.

Indeed, the Normal distribution is one member of the stable family.


Stable Distribution are described by four parameters: two shape parameters (α
and β), one scale parameter (γ), and one location parameter (δ).

More precisely,

• parameter α controls the tails of the distribution; 0 < α ≤ 2. The smaller


the parameter α, the fatter the tails.

• parameter β controls the skewness of the distribution: for β = 0 the


distribution is symmetric; for β < 0 it is left-skewed; for β > 0 it is
right-skewed; −1 ≤ β ≤ 1;

• parameter γ plays a role similar to the standard deviation;


• parameter δ specifies where the distribution is located (pretty much as
the parameter µ does for the Normal distribution).

The two parameters α and β ‘interact’: when α is small, then β can induce a
lot of skewness; when α is high, the effect of β decreases.

See Figs (7) and (8).


The Normal distribution, as mentioned, is a stable distribution, and is associated
with α = 2.

More precisely, a Normal distribution is characterized by


σ
Norm (µ, σ) = Stable 2, 0, √ , µ (12)
2

It is symmetric, with mean equal median equal mode.


Figure 7: The effect of the tail parameter, α.
Figure 8: The effect of the skew parameter, β.
The Cauchy distribution is also symmetric, but has heavier tails than the Nor-
mal.

It is given by
Cauchy (δ, γ) = Stable (1, 0, γ, δ) (13)

The Cauchy is one of the few stable distributions for which the pdf has an
analytic form:
1 γ
ϕCauchy (x; δ, γ) = 2 (14)
2
π γ + (x − δ)
A stable distribution which has both fat tails and skewness is the Levy distrib-
ution, given by

Levy (δ, γ) = Stable (0.5, 1, γ, γ + δ) (15)


Its pdf is given by
γ 1 γ
− 2(x−δ)
ϕLevy (x; δ, γ) = e ,δ<x<∞ (16)
2π (x − δ)3/2

Note the distribution is only defined for values of x greater than δ.

These are examples of the kind of shapes that can be obtained with the Normal,
Cauchy and Levy distributions.

See Fig (9).


13 How Does It Work in Practice?

We are going to use real data (S&P500, FTSE100, DAX, 10-y, 5y- and 2y
US$ swap rates) from 2005 to 2016, and fit the daily returns to several of the
distributions we have seen above.

This is what we get.


Figure 9: Comparison of Normal, Cauchy and Levy distributions.
13.1 S&P500

Let’s start with the S&P. See Tab (10).


Figure 10:
Looking at parameters of the Student-t distribution, we note the low value for
the degrees of freedom, indicating fat tails.

Looking at the Stable distribution, we note that the leptokurtosis is not ex-
cessive (α = 1.53 , not too far from 2, which corresponds to the normal
distribution), and that there is a left skew.

These are the fits obtained with the Normal, the Student-t and the Stable
distributions:
13.2
Figure 11:
13.3 FTSE 100
Figure 12:
Figure 13:
Figure 14:
In the case of the FTSE index, the distribution appear less fat-tailed than for the
S&P500 (the number of dof for the Student-t is higher, and the α parameter
for the Stable distribution is closer to 2).

There is again left-skewness.


13.4 2-Year Swap Rate
Figure 15:
Figure 16:
Figure 17:
Figure 18:
14 Observations

These are returns from linear positions.

Little skewness.

This doesn’t mean that this is always going to be the case.

Strategies and options may give rise to very skewed return distributions.
Figure 19:
Figure 20:
Figure 21:
15 Multivariate Distributions

So far we have looked at one random variable at a time. What if I had two?
Or three? Or seven?

Most of the reasoning so far carries through with little change.

So, for instance, for two random variables, X and Y , we can define the bivariate
cumulative distribution function, Φ2 (x, y), as the probability that the draw of
X will be less than x and the draw of Y will be less than y:

Φ2 (x, y) = Prob (X ≤ x, Y ≤ y) (17)


If X and Y are defined over the real line then the domain of Φ2 (·) is R2 and
the range is, again, [0, 1].

Fig (22) shows a CDF for a bivariate normal distribution function.


If Φ2 (x, y) is continuos everywhere and partially differentiable, we define the
bivariate density, ϕ2 (x, y), as the cross partial derivative of the bivariate CDF:
∂2
ϕ2 (x, y) = Φ2 (x, y) (18)
∂x∂y

The density has the property that, when integrated (‘summed over’) all the
possible values of the X and Y variables, it gives 1:

ϕ2 (x, y) dxdy (19)

The quantity ϕ2 (x, y) dxdy gives the probability of drawing for variable X a
value between x and x + dx and for variable Y a value between y and y + dy.
Figure 22: The CDF for the bivariate Normal distribution
16 Dependence Among Variables

Let’s go back to the definition of a bivariate CDF:

Φ2 (x, y) = Prob (X ≤ x, Y ≤ y) (20)

Suppose that the variable X is a draw from the S&P returns, and Y is a draw
from the DOW-Jones returns.

If the draw for the S&P was, say, large and positive, the draw for the DOW-
Jones is very unlikely to have been large and negative.

If, instead, the variables X and Y had been the S&P returns and the tem-
perature in Paris, a large/small/middling realization of the S&P return has no
bearing on the draw of the temperature in Paris.
One simple way to describe the strength of this ‘linkage’ between variables is
via the coefficient of correlation, ρ.

For some distributions (the elliptical ones, of which the normal distribution is
a prime example) the coefficient of correlation is all that is needed to describe
the dependence between two (or, for that matter, umpteen) variables. This can
make life very easy.

For the case of bivariate Normal draws with means µx and µy , variances σ 2x
and σ2y and correlation, ρ, the density function is given by

− 12 z
1 1−ρ2
ϕ2−Norm (x, y) = e (21)
2πσxσy 1 − ρ2
with
x − µx 2 x − µy 2 (x − µx) y − µy
z= + − 2ρ (22)
σx σy σxσy

This (Fig (23)) is what the 3-d plot looks like for zero correlation:
3-d plots are stunning, but contour plots are more useful to visualize the degree
of dependence.

Figs (24) and (25) show the contour plots for a bivariate Normal with correlation
of 0.6 and -0.6, respectively.
Figure 23: A bivariate Gaussian density for ρ = 0.
17 Marginal Distributions

Take any bivariate density. It is a function of, say, x and y: ϕ2 (x, y).

What do we get if we integrate over all the possible values of, say, y?

We obtain the marginal univariate distribution of the variable x:

ϕx (x) = ϕ2 (x, y) dy (23)


Conversely
ϕy (y) = ϕ2 (x, y) dx (24)
Figure 24: Contour plot for a bivariate Normal density for ρ = 0.6.
Figure 25: Contour plot for a bivariate Normal density for ρ = 0.6.
When we ‘marginalize’ a joint distribution, we lose all information about the
variable(s) we integrate over, and about their dependence with the other vari-
able(s).

For normal variates, we can also go the other way: given two univariate distri-
butions, we can combine them to form an infinity of bivariate normal densities,
one for each value of the correlation, ρ.
We can see this at work in Figs (24) and (25).

Both the bivariate distributions that produced the contour plots for the bivariate
Normal densities in Figs (24) and (25) were obtained from the same ‘marginals’,
ie, from two univariate Normal distributions, with µ1 = µ2 = 0 and σ1 = 0.1,
σ2 = 0.2, but with different correlation coefficients (0.6 and −0.6).
18 Conditional Distributions

Take again any bivariate density.

It is a function of, say, x and y: ϕ2 (x, y).

Now fix one particular value of, say, y: y = y.

Once y is so fixed, the function ϕ2 (x, y) becomes a function of a single variable,


ϕcond (x|y) = ϕ2 (x, y).

This function gives the probability of x, given that we know that y has the
value of y.

Similarly, one can define ϕcond (x|y).


Now, one important relationship:

if we multiply the probability of x given y, time the probability of y, we clearly


get back the probability of x and y, ie, the joint probability, ϕ2 (x, y).

So, we have
ϕ2 (x, y) = ϕcond (x|y) ϕy (y) (25)

The joint probability is given by the product of the conditional times the mar-
ginal.
19 Expectations

Consider two random variables, x and y.

If we know nothing about the values of x, what is the expectation of y?

It is just given by
E [y] = yϕ2 (x, y) dxdy (26)
But let’s assume that we know that X had the value x. What is the expectation
of y now?

It has become
E [y|x] = yϕcond (y|x) dy (27)

Now, if the variables x are y are jointly normally distributed, a beautiful result
follows, namely:
E [y|x] = α + βx (28)
where α and β are the intercept and slope of a linear regression of y on
x.
19.1 Proving the Result*

The proof is very easy.

Suppose that this is true, ie, that the conditional expectation of y given x is a
linear (affine, really) function of x:

E [y|x] = α + βx (29)

Multiply both sides by ϕx (x) and integrate with respect to x.

We get

E [y|x] ϕx (x) dx = αϕx (x) dx + βϕx (x) xdx (30)


yϕcond(y|x)dy
Substituting from Equation (27) we get

yϕcond (y|x) dy ϕx (x) dx =


E[y|x]

α ϕx (x) dx +β ϕx (x) xdx =


1 E[x]

yϕ2 (x, y) dxdy = α + β E [x] ⇒ E [y] = α + β E [x] (31)


E[y]
From this we get
α = E [y] − β E [x] (32)

Substituting back gives:

E [y|x] = α + βx = E [y] + β (x − E [x]) (33)


which say that the conditional expectation of y is equal to

• the unconditional expectation of y, E [y], plus

• a term proportional to the difference between x and its expected value


(with the proportionality constant given by β).
We still have to determine β.

Go back to E [y|x] = α + βx, and this time multiply both sides by xϕx (x)
and then integrate with respect to x.

We have

E [y|x] xϕx (x) dx = α xϕx (x) dx + βϕx (x) x2dx (34)


yϕcond(y|x)dy
which gives

xy ϕcond (y|x) ϕx (x) dxdy = αE [x] + β E x2 =


ϕ2(x,y)

E [xy] = αE [x] + β E x2 (35)


Go back to E [y] = α + β E [x] and multiply both sides by E [x], to obtain

E [x] E [y] = αE [x] + β (E [x])2 (36)

Subtract (36) from (35), to obtain

E [xy] − E [x] E [y] = β E x2 − (E [x])2 (37)


and therefore
E [xy] − E [x] E [y] Cov (x, y)
β= = (38)
E x2 − (E [x])2 var (x)
because, for any a and b,

E [ab] = E [a] E [b] + cov (a, b) (39)


QED.
Note: if I impose that the conditional expectation of y given x should be a
linear function of x, E [y|x] = α + βx, then we results obtained so far are
general, and do not require any assumption of normality.

However, why would we make this linearity assumption? Is it reasonable to do


so?

It turns out that, if x are y are jointly normally distributed, then one can show
that, indeed, the conditional expectation of y given x is exactly a linear (affine)
function of x.
20 Conditional Probability

When we ask for a conditional probability, we ask for the probability of some-
thing happening, given that we are told that something else has happened.

The best way to understand conditional probabilities is via Venn diagrams.

Here is one.
Figure 26:

The definition of the conditional probability of A given B, (P (A|B)), is


P (A ∩ B)
P (A|B) = (40)
P (B)
where P (A ∩ B) is the joint probability of A and B.
The probability of A and B both occurring is always smaller than the probability
of just B happening.

P (A∩B)
So we can rest assured that the ratio P (B)
is always smaller than 1.
We can also write:
P (A ∩ B) = P (A|B) P (B) (41)
which is a chain rules for probabilities: the joint probability of A and B is equal
to the conditional probability of A given B, times the probability of B.
If works for as many variables as you want.

To be quicker, let’s write

P (A ∩ B) ≡ P (A, B) (42)
and, for many variables,

P (A ∩ B ∩ ... ∩ Z) ≡ P (A, B, ..., Z) (43)


Then we have
P (A, B, C) = P (A|BC) P (B, C) (44)
But
P (B, C) = P (B|C) P (C) (45)
and therefore

P (A, B, C) = P (A|BC) P (B|C) P (C) (46)

So a joint probability is the product of a bunch of conditionals, each of lower


and lower order, times a marginal (P (C)), which is a degenerate conditional
probability anyhow.
21 Copulas

So far we have looked at the correlation coefficient as the measure of depen-


dence between two variables, and at the correlation matrix as the measure of
(pairwise) dependence between a collection of variables.

For a class of distributions — called the ‘elliptical’ distributions, correlations tell


us everything there is to know about codependence.

However, this is not in general true.


Remember:

Given a joint distribution, I can always associate a unique set of marginals and
a correlation matrix.

However, the converse is not true: there are joint bivariate distributions that
cannot be obtained by the two marginals and the correlation matrix.

Ho do we move beyond correlations?

By introducing copulas.
So what are copulas?

We know that, given a multivariate (joint) probability distribution, we can


always obtain its marginal (univariate) distributions.

We have seen how to do it above (by integration / summation).

Copulas are special multivariate probability distributions.

They are those multivariate distributions that have for marginals uniform U [0, 1]
distributions.
Why are they important?

Because of Sklar’s theorem, that says that any multivariate distribution can be
express as a function of

• its own univariate marginal distributions, and

• a copula (ie, of a special multivariate distribution that has for its own
marginals uniform U [0, 1] distributions.)
More precisely, let F (x1, x2, ..., xn) be the cumulative distribution for variables
{x1, x2, ..., xn}.

This means that F (x1, x2, ..., xn) is the function such that

F (x1, x2, ..., xn) = Prob [X1 ≤ x1, X2 ≤ x2, ..., Xn ≤ xn] (47)

Let Fi (xi) be the n marginals associated with the cumulative distribution


F (x1, x2, ..., xn).
Then Sklar’s Theorem assures me that there exists some copula, C, such that
the original cumulative distribution, F , is given by the application of the copula,
C, to the uniform variates generated by the n marginals, Fi (xi):

F (x1, x2, ..., xn) = C (F1 (x1) , F2 (x2) , ..., Fn (xn))

Here is how this theorem can be used.


Take two variables for simplicity, x and y.

Suppose that we have a random vector of realizations for x ({x1, x2, ..., xn})
and a vector of realizations for y ({y1, y2, ..., yn}).

Let Fx be the cumulative distribution of x, and Fy for y.

We have learnt that, if we put the realizations {x} into Fx and the realizations
{y} into Fy we get back new random variables, ux and uy , drawn from U [0, 1].
So, let’s do it:
{ux1 , ux2 , ..., uxn} = {Fx (x1) , Fx (x2) , ...Fx (xn)}

uy1 , uy2 , ..., uyn = {Fy (y1) , Fy (y2) , ...Fy (yn)}

We have created new random variables, ux and uy , whose joint distribution


is the copula.

Why so? Because the marginals of this joint distribution are uniform distribu-
tions — and this is just the definition of a copula.

This decomposition is very useful, because the copula contains all the infor-
mation about the dependence among the variables, and no information about
their marginals.

The marginals have no information about the dependence.


There is more.

Suppose that I have determined the copula, C, ie, the joint distribution function
of the uniform variates, ux and uy :
C (ux, uy ) = Prob [U x ≤ ux, U y ≤ uy ] (48)

Then, if we find a way to generate samples {ux, uy } from the copula joint
distribution, one can generate a sample from the arbitrary original distributions,
Fx and Fy , simply by doing
x = Fx−1 (ux) (49)

y = Fy−1 (uy ) (50)


where Fx−1 and Fy−1 are the inverses of Fx and Fy , respectively.

So, everything boils down to choosing a good copula.


22 An Example

Consider the following bivariate joint density, shown in Fig (27):


Figure 27:
These are the two innocuous-looking associated marginals, shown in Figs (28)
and (29):
If we calculate the covariance and correlation matrices for these two variables,
we get (after rounding)

0.1 −0.05
cov = Σ = (51)
−0.05 0.25

1 −0.1
corr = ρ = (52)
−0.1 1
Figure 28:
Figure 29:
Suppose that we assume that the copula was a Gaussian copula.

This means that

C = Φnorm ϕ−1 −1
x,norm (ux) , ϕy,norm (uy ) (53)

It also means that the joint density created by the Gaussian copula will be a
Bivariate Normal density.

This is what we get. See Fig (30).


Clearly there are big differences between the original density and the synthetic
density produced by the Gaussian copula, as Fig (31) clearly shows:
Figure 30:
These big differences are present despite the fact that the marginals are exactly
recovered, and the correlation coefficients (and the covariance elements)
between the synthetic Gaussian variables were exactly the same as in
the original data.

Why did this happen?


Figure 31:
Because we did not choose a good copula.

Look back at Fig (27): look at the contours, which bulge in an out, and
which cannot be turned into circles by rescaling the axes (an indication that
the distribution is not elliptical).

Clearly, for a dependence as complex as this, we cannot hope for the simple
correlation coefficient to do an adequate job.
The really important message is that, even with the marginals and the co-
variance matrix and the correlation matrix all perfectly recovered, the resulting
bivariate density can be very different from the true one.

Fitting to the correct copula is really important.

By the why, this is how the data were created in MatLab. Not elegant, but
does the job.

The complex density is obtained in Fs. You can play with it yourselves.
x1 = -3:.15:3; x2 = -3:.15:3;
[X1,X2] = meshgrid(x1,x2);

mu = [0, 0];
Sigma = [.25, 0.3; 0.3, 1];
F1 = mvnpdf([X1(:) X2(:)],mu,Sigma);
F1 = reshape(F1,length(x2),length(x1));
surf(x1,x2,F1);
caxis([min(F1(:))-.5*range(F1(:)),max(F1(:))]);
axis([-3 3 -3 3 0 .4])
xlabel(’x1’); ylabel(’x2’); zlabel(’Probability Density’);

figure
mu = [0, 0.1];
Sigma = [.25, -0.4; -0.4, 1];
F2 = mvnpdf([X1(:) X2(:)],mu,Sigma);
F2 = reshape(F2,length(x2),length(x1));
surf(x1,x2,F2);
caxis([min(F2(:))-.5*range(F2(:)),max(F2(:))]);
axis([-3 3 -3 3 0 .4])
xlabel(’x1’); ylabel(’x2’); zlabel(’Probability Density’);

figure
mu = [0, 0.05];
Sigma = [.25, -0.05; -0.05, 0.98];
Fs = mvnpdf([X1(:) X2(:)],mu,Sigma);
Fs = reshape(Fs,length(x2),length(x1));
surf(x1,x2,Fs);
caxis([min(Fs(:))-.5*range(Fs(:)),max(Fs(:))]);
axis([-3 3 -3 3 0 .4])
xlabel(’x1’); ylabel(’x2’); zlabel(’Probability Density’);

omega1 = 0.5; omega2 = 0.5;


Ftot = omega1 * F1 + omega2 * F2;\pagebreak

23 Building Untuition with Copulas

Take two variables, x and y.

Let F (x, y) be their CDF, and Fx (x) and Fx (x) their marginals.
We know that
 

F (x, y) = C Fx (x), Fy (y) (54)


u v

What can a plausible copula look like?


Figure 32:

Consider, for instance


uv
C (u, v) = (55)
1 − θ (1 − u) (1 − v)
This is what it looks like:
Does it make sense? Could this be a possible copula? A plausible copula?

First of all, the copula function gives a joint cumulative distribution. So, it
must return values between 0 and 1.

Let’s look at its behaviour.

Suppose that I have chosen a very large value of x. Then the probability that
a draw of X will give a value smaller than x is very high. Let’s say

P [X ≤ x] = Fx (x) = 0.99 (56)


Suppose that I have chosen a very small value of y. Then the probability that
a draw of Y will give a value smaller than y is very low. Call it ǫ.
What is the probability of a joint draw of X and Y such that X ≤ x and
Y ≤ y?

Irrespective of whether X and Y are concordant, the joint probability will also
be very low (because a finite number times a vanishingly small number must
be equal to almost zero).
Take the standard Gaussian distribution, and Gaussian marginals.

Then
P [X ≤ 5, Y ≤ 5] = F (5, 5) ≃ 1 (57)
and
Fx (5) ≃ 1 (58)

Fy (5) ≃ 1 (59)

But we know
 
 
F (x, y) = C Fx (5), Fy (5) ≃ 1 (60)
1 1 1
So, irrespective of whether X and Y are positively or negatively correlated, as
the cumulative marginals tend to 1 together, the joint distribution must tend
to 1, and so must the copula as its two inputs tend to 1.
24 Building a Copula by Hand (Emprical Copu-
lae)

We have obtained this N pairs of random variables.

We have built their emprical cumulative distributions, obtaining variable u and


v (which are indivudlaly uniformly distributed).

This is a list of the reaizations of u and v (just a snippet):


Note I have sorted by the first random variable.

Now I am going to construct the emprical bivariate cumulative distribution,


remembering the defintion,

Φ (u, v) = P (U ≤ u, V ≤ v) (61)
and approximating as:
# (U ≤ u, V ≤ v)
Φ (u, v) = P (U ≤ u, V ≤ v) ≃ (62)
N

This is what we get (the number count in the numerator first, the whole cal-
culation then (N = 45):
Figure 33:
Figure 34:

We have built by hand the emprical copula.

Exercise: write a piece of code that builds the emprical copula.


Figure 35:
0.5

0.4

0.3

0.2

0.1

01
0.8
0.6
0.4

0.2
0.8 0.9 1
0 0.5 0.6 0.7
0.2 0.3 0.4
0 0.1

Figure 36:

You might also like